data platform for metrics system

System DesignDatabases & Storage

Materials — open to everyone, no sign-in

Topic: data platform for metrics system

Interviewer: Meng

Interviewee: MiracleGuardian

Level: L5 (Senior)

Additional Resources:


Meeting notes:

https://docs.google.com/document/d/1-l4aji0aG-or-ZoTXpf-zuUOHLXWo6Ckj2vCNNhogQI/edit#

系统设计活动报名

https://commitway.com/design

职位申请报名

https://commitway.com/job-refer

YouTube channel

https://www.youtube.com/channel/UCKzpuki3fHTfCCngJDCZ_Mg

QRCode

Mock System Design Interview Summary

Interview Overview

Date: 2/5/2022

Target level: L5 - senior

Duration: 45 minutes

Topic covered: Data Platform for Metrics System

Drawing tool used:

Requirements

10k people with 1k products

Provide metric system for products

Granularity: not real time

Alarm, dashboard, data retention

API:

Scale estimate:

Write QPS 10million QPS

1k products

Scaling

availability

Design

Who owns the business logic for alarm and actions?

Service team vs platform team

First is onboard a new service, Add UI for other teams to use

[32 minutes left]

No SQL database > SQL

Scalability > consistency

Product service ID: partition key

Sort key:

Push vs pull:

We can use pull model to collect data. This way we will reduce the risk of traffic spikes.

Optimization

Glazier to store old data

[26 minutes left]

Interviewer: Can we accommodate needs for other teams

Use of Data access

Resource management

Interviewee: Add a permission service

Interviewer: 10M QPS is too high. We may need to aggregate the data

Add small data aggregator service

[19 minutes left]

Pre Aggregate data prior to collecting into the central service

Interviewer: Will the data collection be on-premise or in the cloud?

Interviewee: prefer to put in the cloud.

Interviewer: What is the tradeoff?

Interviewee: easier to onboard and scale

Interviewer: how do we separate data from different teams in the pipeline?

Interviewee: different teams will use different data collection lambda?

Interviewer: how do we handle one service having too much data?

Interviewee:

Failure of data pulls: add zookeeper for lambda to restart the failed lamda

Data collector service failure: data push will fail. Will retry push after data collector is back up

DB failures: Zookeeper or heartbeat will keep DB alive

Interviewer: What if some component is down?

Interviewee: we will restart that component.

[11 minutes left]

Interviewer: how do we keep record from previous runs?

Interviewee: we will pull data at time intervals.

9pm pull succeeds

10pm pull fails

Zookeeper can restart the 10pm pull

10K TPS

Pull data every 5-10 minutes

We can pull at different intervals depending on the load.

Interviewer: which team creates the data collection rules?

Interviewee: the data platform team will create some template, then service teams will create rules for collecting data.

createDataAggregatorLabmda(product service id, metrics, time_interval, tps)

Interviewer: Do we need to use a message queue?

Interviewee: we can add message queue

We may only need MQ for some service, but probably not for all services.

Interviewer: some product teams don’t need real time. They may have dependency with other teams. Build team may submit the code submission data; the test team may submit test data later.

How do you handle data aggregation across multiple teams?

Interviewee:

We may need compute service to aggregate

Interviewer:

Interviewer:

Flow and time control is good

Requirement:

Should ask when gathering requirement, such as interaction between teams

Ideally explain deeper, e.g.

cloud vs on-premise.

Pull and push

We probably need MQ, data aggregation, ETL pipeline

Monitoring system

Backup data. Based on QPS it’s too large.

Interviewee:

Classic interview question

Interviewer asks a classical question, but some points are different from the classic solutions

On-premise vs cloud

Metrics system

Interviewer:

Cloud vs on-premise

Kafka:

topic can be used as resource management

Consumer

5 aspects:

Data collection

Data operation

Data store

Data retrieval

Data alert/abnormal

Data collection:

Interviewee design: Lambda service

API, or log file

Cloud vs on-premise

Cloud is faster

To add a server on cloud, it’s much faster

For on-premise, workflow is much slower (e.g. 1 month)

On cloud, handles international teams

CDN

AWS can create replica in different data centers. The Texas data center was down. Distributed is more reliable.

On premise

security is higher

Serve local customers

Data service:

Data collection connects with Kafka

Data can be consumed by spark

Later dashboard

Service management is good

We should have a data validation/aggregation service

Different teams may have different styles (API vs log)

Data store:

Elastic search, databrick

Need to process sensitive data

Audience:

Do we need to build the whole system end-to-end on premise?

Interviewer:

Different teams can not access other team’s data

Security DL - distribution list

HR data access list

Kafka, S3 restriction, AWS - rule/IAM - distributed list, permission

Maybe not the first version

Priority - how to manage data, data resource management

Audience:

Do we need to change database schema to add field?

Kafka topic -> map to different data in database.

Different record

Schema

DL to IAM rule

Alert:

cloud-watch or other ways for alert

We can add ML to detect anomaly

Audience

Do we need to store rawdata?

Interviewer:

Databrick has pros and cons

Data processing

By product team or by platform

Discussed with Google:

Platform - only for storage, and CRUD.

Product team is responsible for processing

Audience

Why do we need the small data aggregator?

Interviewer:

We can simplify to merge data aggregator and data collector service

Interviewee:

Data collector does not store the data, only for processing

Interviewer:

We should add validation and data processing, e.g. obfuscate/remove sensitive data

Audience:

Do we need to monitor aggregator?

Interviewer:

It’s one of the ways to monitor

Audience:

For pulling, we need to use zookeeper to manage the upstream service

Audience:

Metrics -> UI, there is 1 hour delay

Should oncall to use the dashboard? Delay may be too long

Interviewer

The platform is for many teams

There is no absolute function. Usually oncall is part of the system. We may monitor the overall health of the system.

It could be 10/15 minutes

Some product team may want real-time aggregation

Interviewee:

Usually 1-5 minutes is manageable

Audience:

Do we need a real time path and a slow path?

Interviewer:

Streaming vs map-reduce

Lots of discussion

Audience:

Requirement: can tolerate some delay. Don’t make 1 hour as a hard requirement

Audience:

To reduce granularity, we need to change the system

Audience:

We can have multi-level aggregation. Starting with .5 minutes. Then aggregate to 1 minute, 5 minutes etc.

Interviewer

Can handle at ETL

ETL is a consumer of kafka

Hybrid system

Interviewer:

We use elastic search

Inflex,

Prometheus

Time series database

Audience

Should clarify tradeoff for NoSQL database

Audience

Do all product team share the same schema?

Interviewer:

If we don’t need data processing, then it can be put one table

If we need further data processing, then we will put into databrick/spark. Each team will define their own schema.

Audience:

ETL using existing product, or we will self-implement?

Interviewer:

Can use databrick/spark sigma, but we still need to write some code/configuration

Audience:

Zookeeper is not for health check

Who is responsible for product A or product B

N vs M

For pull, we need to use Zookeeper to pull data from which service

Audience:

Every time lambda is up, they register with the zookeeper where to get data from.

Interviewer:

We use push

We use agent pull

Data collection perspective: it’s all push

Data team can build an agent to push

Audience:

Prometheus uses pull

Interviewer:

Agent: product team can build their own agent

Agent: can collaborate with platform team to build agent

Audience:

Do we need further aggregation?

Interviewer:

There are both use pattern

Audience

For direct storage, do we still aggregate?

Interviewer

We can do aggregation in spark