data platform for metrics system
Materials — open to everyone, no sign-in
Topic: data platform for metrics system
Interviewer: Meng
Interviewee: MiracleGuardian
Level: L5 (Senior)
Additional Resources:
Meeting notes:
https://docs.google.com/document/d/1-l4aji0aG-or-ZoTXpf-zuUOHLXWo6Ckj2vCNNhogQI/edit#
系统设计活动报名
职位申请报名
https://commitway.com/job-refer
YouTube channel
https://www.youtube.com/channel/UCKzpuki3fHTfCCngJDCZ_Mg
QRCode
Mock System Design Interview Summary
Interview Overview
Date: 2/5/2022
Target level: L5 - senior
Duration: 45 minutes
Topic covered: Data Platform for Metrics System
Drawing tool used:
Requirements
10k people with 1k products
Provide metric system for products
Granularity: not real time
Alarm, dashboard, data retention
API:
Scale estimate:
Write QPS 10million QPS
1k products
Scaling
availability
Design
Who owns the business logic for alarm and actions?
Service team vs platform team
First is onboard a new service, Add UI for other teams to use
[32 minutes left]
No SQL database > SQL
Scalability > consistency
Product service ID: partition key
Sort key:
Push vs pull:
We can use pull model to collect data. This way we will reduce the risk of traffic spikes.
Optimization
Glazier to store old data
[26 minutes left]
Interviewer: Can we accommodate needs for other teams
Use of Data access
Resource management
Interviewee: Add a permission service
Interviewer: 10M QPS is too high. We may need to aggregate the data
Add small data aggregator service
[19 minutes left]
Pre Aggregate data prior to collecting into the central service
Interviewer: Will the data collection be on-premise or in the cloud?
Interviewee: prefer to put in the cloud.
Interviewer: What is the tradeoff?
Interviewee: easier to onboard and scale
Interviewer: how do we separate data from different teams in the pipeline?
Interviewee: different teams will use different data collection lambda?
Interviewer: how do we handle one service having too much data?
Interviewee:
Failure of data pulls: add zookeeper for lambda to restart the failed lamda
Data collector service failure: data push will fail. Will retry push after data collector is back up
DB failures: Zookeeper or heartbeat will keep DB alive
Interviewer: What if some component is down?
Interviewee: we will restart that component.
[11 minutes left]
Interviewer: how do we keep record from previous runs?
Interviewee: we will pull data at time intervals.
9pm pull succeeds
10pm pull fails
Zookeeper can restart the 10pm pull
10K TPS
Pull data every 5-10 minutes
We can pull at different intervals depending on the load.
Interviewer: which team creates the data collection rules?
Interviewee: the data platform team will create some template, then service teams will create rules for collecting data.
createDataAggregatorLabmda(product service id, metrics, time_interval, tps)
Interviewer: Do we need to use a message queue?
Interviewee: we can add message queue
We may only need MQ for some service, but probably not for all services.
Interviewer: some product teams don’t need real time. They may have dependency with other teams. Build team may submit the code submission data; the test team may submit test data later.
How do you handle data aggregation across multiple teams?
Interviewee:
We may need compute service to aggregate
Interviewer:
Interviewer:
Flow and time control is good
Requirement:
Should ask when gathering requirement, such as interaction between teams
Ideally explain deeper, e.g.
cloud vs on-premise.
Pull and push
We probably need MQ, data aggregation, ETL pipeline
Monitoring system
Backup data. Based on QPS it’s too large.
Interviewee:
Classic interview question
Interviewer asks a classical question, but some points are different from the classic solutions
On-premise vs cloud
Metrics system
Interviewer:
Cloud vs on-premise
Kafka:
topic can be used as resource management
Consumer
5 aspects:
Data collection
Data operation
Data store
Data retrieval
Data alert/abnormal
Data collection:
Interviewee design: Lambda service
API, or log file
Cloud vs on-premise
Cloud is faster
To add a server on cloud, it’s much faster
For on-premise, workflow is much slower (e.g. 1 month)
On cloud, handles international teams
CDN
AWS can create replica in different data centers. The Texas data center was down. Distributed is more reliable.
On premise
security is higher
Serve local customers
Data service:
Data collection connects with Kafka
Data can be consumed by spark
Later dashboard
Service management is good
We should have a data validation/aggregation service
Different teams may have different styles (API vs log)
Data store:
Elastic search, databrick
Need to process sensitive data
Audience:
Do we need to build the whole system end-to-end on premise?
Interviewer:
Different teams can not access other team’s data
Security DL - distribution list
HR data access list
Kafka, S3 restriction, AWS - rule/IAM - distributed list, permission
Maybe not the first version
Priority - how to manage data, data resource management
Audience:
Do we need to change database schema to add field?
Kafka topic -> map to different data in database.
Different record
Schema
DL to IAM rule
Alert:
cloud-watch or other ways for alert
We can add ML to detect anomaly
Audience
Do we need to store rawdata?
Interviewer:
Databrick has pros and cons
Data processing
By product team or by platform
Discussed with Google:
Platform - only for storage, and CRUD.
Product team is responsible for processing
Audience
Why do we need the small data aggregator?
Interviewer:
We can simplify to merge data aggregator and data collector service
Interviewee:
Data collector does not store the data, only for processing
Interviewer:
We should add validation and data processing, e.g. obfuscate/remove sensitive data
Audience:
Do we need to monitor aggregator?
Interviewer:
It’s one of the ways to monitor
Audience:
For pulling, we need to use zookeeper to manage the upstream service
Audience:
Metrics -> UI, there is 1 hour delay
Should oncall to use the dashboard? Delay may be too long
Interviewer
The platform is for many teams
There is no absolute function. Usually oncall is part of the system. We may monitor the overall health of the system.
It could be 10/15 minutes
Some product team may want real-time aggregation
Interviewee:
Usually 1-5 minutes is manageable
Audience:
Do we need a real time path and a slow path?
Interviewer:
Streaming vs map-reduce
Lots of discussion
Audience:
Requirement: can tolerate some delay. Don’t make 1 hour as a hard requirement
Audience:
To reduce granularity, we need to change the system
Audience:
We can have multi-level aggregation. Starting with .5 minutes. Then aggregate to 1 minute, 5 minutes etc.
Interviewer
Can handle at ETL
ETL is a consumer of kafka
Hybrid system
Interviewer:
We use elastic search
Inflex,
Prometheus
Time series database
Audience
Should clarify tradeoff for NoSQL database
Audience
Do all product team share the same schema?
Interviewer:
If we don’t need data processing, then it can be put one table
If we need further data processing, then we will put into databrick/spark. Each team will define their own schema.
Audience:
ETL using existing product, or we will self-implement?
Interviewer:
Can use databrick/spark sigma, but we still need to write some code/configuration
Audience:
Zookeeper is not for health check
Who is responsible for product A or product B
N vs M
For pull, we need to use Zookeeper to pull data from which service
Audience:
Every time lambda is up, they register with the zookeeper where to get data from.
Interviewer:
We use push
We use agent pull
Data collection perspective: it’s all push
Data team can build an agent to push
Audience:
Prometheus uses pull
Interviewer:
Agent: product team can build their own agent
Agent: can collaborate with platform team to build agent
Audience:
Do we need further aggregation?
Interviewer:
There are both use pattern
Audience
For direct storage, do we still aggregate?
Interviewer
We can do aggregation in spark