Design YouTube (used)

Topic: Design YouTube (used)

Interviewer: peijin

Interviewee: sinco

Level: L5 (Senior)

Topic

Mock System Design Interview Summary

Interview Overview

Date: 12/5/2021

Target level: L5 (senior)

Duration: 1 hour

Topic covered: Design YouTube

Drawing tool used: Whimsical

Requirements

Functional requirements

View the video, thumbnail

Upload the video

Search the video

Popular / comments on the video

Non functional requirements

We should not lose the video

Watch the video fluently

Make sure high availability

Try to reduce the network cost

Constraints:

DAU: 100 million

5 videos / day person

View:

QPS: 5*100 million / 86400 (~0.1 million) = 5000 queries per second

Upload:

Download: 1:200 ratio

QPS: 25 / second

Download width: 7G/s * 200 = 1400G/s = 1.4T /s

Assumption: 300M / video

Upload width: 7G / s

Storage: 683T / day

System Design

External APIs

getVideo(video_id, user_id, offset)

Offset is needed because large video is chopped into small pieces

uploadVideo(video_id, user_id, description, length, tags[], video_content)

System design diagram

Upload flow:

Changing API for upload

uploadVideo(video_id, user_id, description, length, tags[])

-> returns { presigned-url }

Q: Should we choose Google cloud storage /AWS S3, instead of maintaining our own distributed file system

A: what are the tradeoffs of pre-built solution vs self-maintained solution?

Q: Reduce maintenance if we use a pre-build solution.

Adding message queue to trigger encode service

Q from interviewer: Why message queue?

A from interviewee: Encoding service and upload service are time consuming

Q from interviewer: Why using S3 and not other types of storage?

A from interviewee: because we nee d a large amount of storage; S3 can scale unlimitly.

Q from interviewer: why not other database

A from interviewee: can use google cloud and azure.

Record status of the intermediate processing:

Download flow

Retrieve data from S3

Add CDN to cache popular videos

Q from interviewer: how do we know what are the most frequently watched videos?

A from interviewee: add TopK system and use the output to cache the popular videos

Q from interviewer: any other optimization for watched videos?

A from interviewee: encode video at different resolution

Q from interviewer: how do we implement comment features?

Add API:

comment(video_id, user_id, content, comment_id)

like (video_id, comment_id)

Interviewee: DB schema first or comment first?Interviewer: First comment at high level, then go to DB design

Interviewer: What will you store in Redis

Interviewee: schema for redis

How to handle Redis failure?

Data is not very important

If Redis fails, we can sync up Redis with NoSQL every 1 or 5 minutes

Interviewer: How to make sure there is no loss due to redis crash?

Interviewee: another copy of redis as stand-by

Database schema

Additional design

Discussions during the Interview

Interviewer and Audience Feedback after the Interview

Requirement gathering - feature

L5 interviewee should drive the interview process

Tradeoff - sometimes may not get keypoint, S3 - video large file size, (not amazon vs google)

Estimate QPS - no need to be very accurate

Some portion is not covered (e.g. database schema)

Project awareness, not enough time to cover

Audience

Should the interviewee propose and then follow up with additional

Or entirely ask the interviewer?

Interviewer

Prefer interviewee

Audience

We have not seen the landing page design

Audience

Yes. Probably based on recommendation system

Audience

Upload now is 2 APIs -> create metadata and URL

2 steps for upload, but one page may break

Audience

Upload service can be put in between client and S3

前兆万兆网卡

Upload

Dev tool

如何看到video from S3?

S3: 4000 small file for 10 minute video

First get metadata from server for the chunks

Then retrieve each chunk

We should write the API for download video to deal with chunks

There are only 2 APIs, but there are so many services.

For example, how to save, there is no API

How to upload to original service => requires API

L5: should cover some additional questions

Non-functional: lots

10G per second -> cluster, or many instances. If one network supports x-K bytes, then how do we handle?

How do we handle upload error?

How do we handle popular and non-popular videos?

How do we pace it?

45 minute - may not have enough time to cover all features

Confirm the number of the features. Directly go to the important features

The interviewee seems to know most of the knowledge, but did not express it systematically.

Scalability, cache.

How to form the system?

Interviewer: “I expect you to drive the interview”

First write down an outline before drawing the video, such as life cycle of the video, upload, download, review

Comment service. Interviewer: address the key points

Should discuss the tradeoffs

NoSQL: should talk about extensibility, high concurrency

SQL: query is good

Why Redis: high availability

Redis cluster, scale up

Upload/download worker? Kubernetes? AWS?

Lamda: small load, servless. upload/download clusters

Everybody use S3 or not?

Netflix uses AWS. Netflix builds its own CDN.

S3 streaming

Cut into small pieces

Can be on the client. Download pieces, then play the downloaded portion

Can S3 sustain such high QPS?

If 20% users download from S3. Most users should rely on CDN.

CDN and S3 should be connected.

Can put a lot of CDNs

IP - map to CDN

CDN and S3 should have dedicated line to ensure up-to-date

Is there another buffer between S3 and CDN

Netflix: uses AWS

Self-built service - load balancer becomes the bottleneck

Bilibili - they will build their own infrastructure

Netflix - found lower cost in AWS

Instagram - photos - their own storage system

Depends on the scale of the company

Mock - interviewer - interviewee - synchronize

Real interview, we don’t know the question ahead of time

It’s hard for the interviewer to drive the next step

Upload, download, or other things

Interviewer - hard to drive

Interviewer - should have some idea. NVP - critical journey - upload, download

Needs to be proactive, not reactive

Want to have an expectation of the key point for testing.

Limit the requirements.

Redis, NoSQL - but there is not enough time.

I don’t understand if the interviewer or interviewee

The best is the interviewee should drive. Need to confirm with the interviewer

Interviewee should drive. If the interviewer disagrees, then interviewee can change direction.

NoSQL, MySQL. Can summarize the interviewer and make it as a keypoint.

Try to drive toward the part that you understand.

There are some usual tradeoffs - nosql/sql

Buffering

Synchronize vs asynchronous (message queue)2-3 minutes. Then confirm you are on the right track.

Is interviewer in partnership with interviewee?

At checkpoint: but you can double confirm at each checkpoint.

If the interviewer wants to ask. Then interviewee should take the hint

How does the interviewer score? What metrics?

Different interviewers are different.

Cannot expect all interviewers are professional.

Soft skill / hard skill.

Communication skill. 2-4 minutes, confirm with interviewer.

Hard skill. Every question may have some points, scalability, reliability.

Why should we use message queue? Did I give the right answers?

It is right. There are async processing.

Alternative is synchronous. Then latency is much longer.

解耦，异步，削峰，填谷

decouple

Asynchronous

Handle peak

Handle trough

Upload should be broken up.

Client side code will handle.

Some users may upload the same video

They will compare duplicate.

Can reference existing ID.

If client cut the video into 10 pieces, will it be hard on client?

4 core - 4 k video, cut into 20

Encoding is not in client side

Does the client compress before the upload?

May depend on upload format

Original source file can be saved in “original storage”

Transcode - output with compression.

Before upload

HTTP uses GZIP. Built in compression

Video may not be compressed again

Chunk then upload

Upload is just through browser

Does the browser chunk it for you?

Javascript can chunk it for you before the upload.

ADP - on top of TCP

TCP will cut into packet

When client requires different density, then it will query metadata

DASH - streaming service.

Chunk - each chunk’s offset is saved in DB

DB may not rely on mysql.

Offset - put in blob

User facing meta data, can be put in my mysql

Technical meta data are handled separately

Can use Redis to handle progress (user progress to which point)