Distributed Key-Value Store

System DesignDatabases & StorageDistributed Systems

Materials — open to everyone, no sign-in

Topic: Distributed Key-Value Store

Interviewer: ken

Interviewee: ryandai

Level: L4 (Experienced Individual Contributor)

Additional Resources:


系统设计活动报名

https://commitway.com/design

职位申请报名

https://commitway.com/job-refer

YouTube channel

https://www.youtube.com/channel/UCKzpuki3fHTfCCngJDCZ_Mg

QRCode

Mock System Design Interview Summary

Interview Overview

Date: 2/26/2022

Target level: L4 - experienced IC

Duration: 45 minutes

Topic covered: Key-value store

Drawing tool used:

Requirements

[0:00]

Functional requirements

Design the next generation of distributed KV store that will support services:

authentication, authorization, messaging, etc

Won’t be used to support payment

Range query and batch query/update

Technical implications: B+ tree more ideal than hashtables

Eventual consistency

Technical implication: no need for transactions

in-memory?

Scaling requirements

K/V Scale linearly

Multi data center after 5 years

Latency: Users should have real time experience

Availability: 99.9%

Robustness: user data should not get lost

Technical implications: how to ensure multiple replicas

read/write ratio = 99/1

Technical implication: B+ tree is better than LSM tree

CAP: Availability > Partition > Consistency

Capacity planning

1 petabytes of data, Multi data center after 5 years

1M queries per second: 【-】

10k write QPS, 990k read QPS

Technical implications:

need to distribute to multiple nodes

Some mechanism to route requests to the right nodes.

How to reduce master pressure?

Key <= 1kb, Value <= 8kb 【-】

Average k+v size 2kb => 2kb * 10k / s

[39:40 - 102]

API

getDataRequest(user_token, key, version) - res, error return a string as result

Return a string as result, version_id

Error: error_code, error message

[range query: pagination?]

===

Architecture

Data model

Key-value storage

Key global unique ID, value: string/integer/json

Key methodology

Hash function to convert key to storage key

Handle hash collision

[seems to be offrail since we need to support range query]

Open address

Separate chain

[31:27]

Sharding key: storage key

To support range query:

0-100 in node 1

101-200 in node 2

[hash to node number] [then hash to storage key]

There may be hot-spot 【+】

Hot-spot

Alternative sharding strategy:

we can use consistent hashing

Will resolve hot-key issue

[26:52]

[hoped to walk through the “life of a query”]

【scale up】- too early

Storage is 1 PB

10 TB per node is reasonable

100 nodes

May expand to more data centers in the future

Data sharding in region, or globally

Depends on data capacity

Servers

If we only shard in one region, we can synchronize data between different regions. We can query close to the customer

[22:53]

Range query: node 1 - node 2

50-100

101-150

Config service, key range in each node

Q: range query in a node?

A: we will not hash the key into a storage key

Hash table

Disk only - list of data, without memory

In-memory (binary tree) + disk (append only) - LSM tree

Q: in-memory

A: list will be in memory. Key will be sort-key

Physical storage - sort key

Mongo or dynamoDB, key - string, value - json file

[15]

Access each row to access the value

Append to the result

A:

Range query (lowkey, high key, pagination, term)

===

Noticed the high QPS

write to master

multiple read replicas (data synced from master node, sync or async): sync write

===

Consider cache between storage and disk

===

Add Cache [ ok, but a bit too early ]

Add LRU cache

Independent service

[9:30]

How do we insert?

Data storage write to key-value

Find node 1

Node 1: Resort: Database will find the

B tree strategy, logN

No sq in file

Append to the end of file

==

[1:30]

Leader election to handle master node failure

Read replica failure

===

====

Soft skill: meet expectation

Hard skill:

====

===

Interviewee:

Needed to reiterate. Hashcode -> range query

Key value: not very clear how to get

是否一步到位

Interviewer:

===

Audience discussion

Possible drill down points:

Consistent hashing

Kafka queue

Read cache - local + global cache

Configuration service, zookeeper

Consensus algorithm

Key-value, noSQL database

Sharding: key as sharding

Discussion

R » W, so no need for queue

Discussion

Highly available, strong consistency

Dynamo or cassandra pattern

Availability:

Split brain, network partition.

Cassandra - no master, sloppy quorum

Leader election: supports strong consistency not as availability

Discussion

Strong or eventual consistency: can be solved by leader election

Discussion

Should add more estimation

Discussion

cache design

Should we use queue?

Depends on latency requirement

Should clarify consistency requirement

Discussion:

Similar to SQL database + cache. Write through

Range query can be handled by SQL

Range query vs get/set. We can add cache and SQL

Range query 多坑

Discussion:

Can we propose Redis + rockDB on database

We are designing non-sql database

Focus should be on distributed

Discussion

dynamoDB uses berkeleyDB (B+ tree based)

Cassandra uses LSM tree

Discussion

Bloom filter index tree

===

===

Reference:

https://docs.google.com/document/d/1pqsaPts9fLFt2x3YFmG6D0KK-u3-wPIP6Nq4JAinGhY/edit#

Reference Design

===

Grading Criteria

Soft skill

Clarify requirements

Discuss trade offs

Present clearly; right technical terms

Pace the interview

Hard skills

Design quality

Basic facts and tradeoffs

Project and product lifecycle awareness

Problem Specific Criteria

Basics

Knowledge of Hashtable, B+, LSM tree; Tradeoffs

Data partitioning / traffic routing

Achieving availability

Achieving durability

Bonus:

Knowledge of specific technology or algorithm

===

Areas covered in 1-1 system design training

Part 1: High throughput infrastructure

Notification

Rate limiter

TopK

Key value store

Typeahead suggestion

Distributed message queue

Part 2: High volume infrastructure

Cloud file system (e.g. design google drive, dropbox)

Distributed log collection

Ads logging

Part 3: Collaboration Applications

Multi-user chat

News Feed

News Feed Real Time Comments

Like-unlike

Calendar

Part 4: Distributed Transaction Applications

Ebay auction

inventory management

Ticketmaster

Uber payment

Part 5: Content sharing Applications

YouTube

Google Photos

TinyURL

Part 6: Geography Applications

Design Uber

Design Yelp