Design Google Drive
Materials — open to everyone, no sign-in
Topic: Design Google Drive
Interviewer: ken
Interviewee: 艾宇杰
Level: L4 (Experienced Individual Contributor)
Additional Resources:
QR Code to join us:
Interview Notes:
Functional requirements
Support directories
Upload files
Download files
File sync - multiple clients
Out of scope
Permission
File sharing
Notification
Scale
File < 1 GB
50M signed up users. 10M DAU
10 GB free space
Upload 2 files per day. Average size 500 KB
1:1 read to write ratio
Average 1000 files per user
Estimates:
QPS
50M * 10GB = 500 PB
2 files uploads
QPS for upload: 10 million * 2 uploads / 10^5 = 10 * 10 * 2 = 200 QPS
Peak QPS = 200 QPS * 5 = 1000 QPS
Metadata DB storage
10GB / 500KB = 20,000 files
1000 files * 10M users = 10 billion files entries
File path, s3 path, user, date,
10 billion files * 200 bytes of metadata per file = 2 TB
===
Bandwidth
200 uploads per second * .5MB files per upload = 100 MB per second
Non functional
Durable
Sync quickly
Minimize bandwidth
Scalable
available
====
API
UploadFile
DownloadFile
GetFileDirectory
[31:25]
Pull / push new changes
[Reversed arrow?]
[25:58]
User
Device
File
File_id
Block_id
[19:37]
10M * 2 / 3600 /24 = 231 write QPS
Support of transaction
Relation / tables
[10:24]
How does the blob storage connect back to client?
Both way can work
API gateway can return the URL for upload. Client connects to blob store
Blob store can connect back to the client
6:28
=====
=====
Notification back to client
Client starts and connects
Long pull to notification service
Websocket, bidirectional
File has changed
Initiate a download to the API gateway
[1:20]
Sharding
=====
Highly
Points to cover:
API:
list, upload, download, uploadChunk, downloadChunk
Client notification when version updated on server. Tradeoffs of poll vs push
Database Schema
Architecture choices:
If using S3/Google cloud storage: does the traffic go through the application server or not?
Push vs poll for propagating changes
storage choices:
cache
database: SQL (mySql) vs NoSQL (cassandra, eventual consistency)
File storage: Amazon S3, Google cloud storage, HDFS
Bar raiser:
Familiar with Amazon S3, Google cloud storage or HDFS workflow
Tiered storage to save storage cost
Experienced IC
Soft Skills
Requirement gathering
Discuss tradeoffs
Clear presentation
Driving interview
Hard skills
Design quality
Knowledge of existing solutions/tradeoffs
Fit into larger context of project and product lifecycle
====
GFS
Soft skill
How to communicate and show my knowledge
Hard skill
API flow
Schema
===