Cloud Storage Service

Topic: Cloud Storage Service

Interviewer: Ken

Interviewee: 乔磊

Level: L5 (Senior)

Job referral

https://commitway.com/job-refer

QRCode

Mock System Design Interview Summary

[10-15] requirement gathering

[longer portion] design

[10-15] drill down

Basic requirements

[43:00]

Functional [gathered]

Support directories

Upload files

Download files

File sync - multiple clients

Out of scope

Permission

File sharing

Notification

[41:47]

Scale

File < 1 GB

50M signed up users. 10M DAU [ gathered]

10 GB free space

Upload 2 files per day. Average size 500 KB [not gathered]

1:1 read to write ratio

Average 1000 files per user

My Estimates:

QPS [not computed]

50M * 10GB = 500 PB

QPS for upload: 10 million * 2 uploads / 10^5 = 10 * 10 * 2 = 200 QPS

Peak QPS = 200 QPS * 5 = 1000 QPS

Metadata DB storage

10GB / 500KB = 20,000 files

1000 files * 10M users = 10 billion files entries

File path, s3 path, user, date,

10 billion files * 200 bytes of metadata per file = 2 TB

Bandwidth

200 uploads per second * .5MB files per upload = 100 MB per second

Non functional

Durable [99.99999%] .01% 3 times data replica 99.99999%

Availability [ 99.999%] 5 minutes per year

Sync quickly

Minimize bandwidth

Scalable

Highly available

[37:30]

Points to cover:

API:

list, upload, download, uploadChunk, downloadChunk

Client notification when version updated on server. Tradeoffs of poll vs push

Database Schema

Architecture choices:

If using S3/Google cloud storage: does the traffic go through the application server or not?

Push vs poll for propagating changes

storage choices:

cache

database: SQL (mySql) vs NoSQL (cassandra, eventual consistency)

File storage: Amazon S3, Google cloud storage, HDFS

Bar raiser:

Familiar with Amazon S3, Google cloud storage or HDFS workflow

Tiered storage to save storage cost

Soft Skills

Requirement gathering

Discuss tradeoffs

Clear presentation

Driving interview

Hard skills

Design quality

Compare existing solutions

Fit into larger context of project and product lifecycle

[37:00]

Back of envelope estimation

10M user * 1000M data = 10,000 TB

1GB per year, 100MB per month

10PB of data per year

[self implement]

.01 failure rate for disk (SSD)

Compute nodes, disk farm, leverage cloud service

S3 service - reduce complexity

S3 as my main storage system

To investigate different providers

[33:50]

[32:36 - good timing, starting design]

Starting high level design

[IAM - auth]

Add API gateway, encryption, LB

Normal files

Small, large, medium files

Try to handle large and small files

[104]

[29:03]

May want to define APIs

Adding queue service

[I don’t understand the queue service]

Monitors the queue system

[what protocol is for the message queue service]

APIs

[ I feel fine with defining API later ]

Create, delete,

//filepath=adsfadsf&userToken=adf

/folder/object.txt (Get) - Download

/folder/object.txt (PUT) - upload

Multiparts upload api (PUT) - not using object names, parts ID.

[Question version control]

No version control needed

[ may think of chunking ]

[ may want a way ]

[ long running channel, can always have a connection with the backend service]

Write or change is limited

Most of the time there is no change

First time that I install the app, a full sync

[need a list directory API?]

/fullsync

[may think of system hook for accessing file, and download at that time]

[20:24]

[109]

Message content: (user id, file name, delete / new file)

[should optimize with checksum]

API service talks to the message queue

[metadata database missing?]

Client app monitors folder

Detects change

Client app keep the file in memory list

Client app sends files to API service

API service sends file to storage service

Optimize -> send multiple

API service generate message to message queue

Message content: user ID, file name, delete/new file

[What if API service crashes?]

Message content: user id, file name, delete /new file, sender

Q: message queue: multiple topics?

A: add API to list files?

Add API to get delta

Can use different attribute to do the filter

Can filter the messages to

[prefer message queue]

Messages to filter by userID

Use API to get delta:

Get all files. Compare with local files

Messages:

Is more clean to compute delta

[8:53]

RabbitMQ

Can handle large amount of data

Can handle the scenario

S3 as actual storage

May prefer cloud queue system

Filter by ID

[7:04]

Q: walk through 2nd client?

A: listen to changes. delete/new file

Issue request to API service to get the storage

[chunking missing]

[4:15]

Reliable

[106]

API service: stateless service - data is saved in memory - cluster of server with high compute and memory capacity

We can scale up API service

[2:20]

High CPU/memory

Need a database - store all meta data, which file belongs to who

Virtual folder structure

[should draw the database]

Read/write split for database

One write database master and several read database

[did not complete]