Distributed Key-value Store
Topic: Distributed Key-value Store
Interviewer: shihao (沉)
Interviewee: Tom (慢)
Level: L5 (Senior)
Group QR code
Topic
Mock System Design Interview Summary
Interview Overview
Date: 3/6/2022
Target level: L5
Duration: 45 minutes
Topic covered: Key-value store
Requirements
Functional requirements
Design the next generation of distributed KV store that will support core service like payment, authentication, authorization, etc
KV store need to scale linearly
We do not need to consider multi-data center deployment for now, but may need it after 5 years
Range query, batch inserts
Non functional requirements
Functional requirements
Support payment/authentication
Scale linearly
Multi data center after 5 years
Range query and batch query/update
Non functional requirements
Latency: Users should have real time experience
Availability: 99.9%
Robustness: user data should not get lost
Capacity planning
read/write ratio = 99/1
10k write QPS
=> 990k read QPS
Key <= 1kb
Value <= 8kb
Average k+v size 2kb
=> 2kb * 10k / s
System Design
Options:
Centralized LB => db shard (Google spanner) (recommended)
Pro: quick, easy to manage
Cons: the metadata table might be the bottleneck
Distributed hash table => Dynamodb
Pros: easy to scale
Latency - need some extra jumps - add more data in a single machine
Global LB try to find the best
External APIs
getDataRequest(user_token, key, version) - res, error return a string as result
Return a string as result, version_id
Error: error_code, error message
System design
Index: type of index for one cluster
How to implement batch read / batch write
Index:
B+ tree, log structured merge tree
B+ Look up tree is O(log(N))
Better for
LSM tree very good write performance
LSM is close to append
Write performance is good
Which one will you pick?
Read is heavy.
B+ tree can perform better
LSM read is not worse than B+ tree
LSM has good write performance
How do you perform range query?
LSM
2 batch writes, batch1, batch2
Batch 1 is successful
Transactions
Most strict level of isolation
Lock solution, optimistic, pessimistic lock.
Depends: Strong consistent read, or not
How to handle: High throughput?
Every write goes to master
Every read goes to master or slave, without consistency guarantee
Every read goes to master for consistency guarantee
Master - slave
Write comes
Write to master
If want robustness - master sync data to slaves
Master return succeed
Master + slave = N
Assume we read b copies
a + b > n => strong consistency read
Master can go down
Master election algorithm, e.g. paxo, raft
Does raft require read+write > N?
Read from master => strong consistency read
Paxo is only for election of master. This is the node to accept write.
If a cluster is down, can promote from another cluster
Zookeeper key1/cluster_1 => cluster_2
If read query needs to read across 2 different clusters?
Shard based on what?
Based on key range, or based on hash value
Read from 1-1000, 1-100 in shard 1, 101-1000 in shard 2
Need to query different shards
How to enforce ACID?
There can be an abstract layer on top of the shards
We can use multi-version control
Read timestamp <= my_timestamp
Time is not accurate
Spanner - true time
Computer network time protocol
Try to get pretty close clock
Same data center - there can be an atom clock
There can be a bottleneck
Interviewer and Audience Feedback
====
===
Interviewer
Hire
Describe tradeoff
Want to see more accurate technical terms
Range query
Forces B+ tree
Consistent model
Payment, authentication ACID
K/V size
Spanner uses B+ tree
Prefers B+ tree but interviewee is more familiar with LSM tree
Self implement B+ tree
Value in memory. Cluster index vs non cluster
Read-write conflict
Write-write conflict - ran out of time
Want to have snapshot
Distributed:
Consistent hashing
Range based - need separate meta data
Master may become a bottleneck
partition
WDR vs election based?
Distributed transaction
Interviewer’s design
Within shard we can put more meta data
Range
Timestamp 2PL
Get all locks
SAGA pattern?
How to reduce pressure for single master?
Only talk to single master: Need a new tablet / partition
Meta data is only for
CRUD for master?
Only for split and merge
Distributed B+ tree
QPS 10k
At the beginning when new client is created, will need to visit the root tablet
Then client can cache the root tablet content
Out of date metadata: then it will requery
For scale up: we can move the data gradually
How to handle hot tablet?
With range query.
Config table may create hot tablet
Change master slave into quorum based
Can use quorum read write. Read more
Interviewee:
Distributed system - how to implement consistency
Range query: If it’s a list then we may be issue queries in parallel
==
Soft skill:
Let the interviewer finish asking
===
Clarifying question:
Distributed key value store
How is it a distributed system?
Designing an infrastructure, will be a bit more detail
I gave the business requirement,
2 machines or more are distributed system
Key value store - is index the same as key?
Hashtable, want to use range query
Read heavy and range query heavy
Range query
If query contains range across multiple tablets
Reads: retrieve timestamp and then read from multiple tablets
Gossip:
Leaderless: no reliable timestamp
Similar to multi-range system
Design distributed key-value store
What does “next generation” mean?
Don’t want to use dynamo design.
2PL, 2 Phase Lock, SAGA
2 phase lock
Write-write conflict
Write 1-2-3
Write 2-3-4
We can lock 1-2-3
Then write
Then unlock, then commit
2 phase commit
Coordinator
Try pre-execute
Make all component ready for commit
Then coordinator can commit everything
Saga: try-catch confirm
Can I make a spanner system?
Interviewer: we don’t need people to memorize any particular paper.
Do keys need to be sorted?
I will probably use an order
WR - can guarantee strong consistency
Why is range query a transaction?
Single record is not a transaction
Split-merge
==
Distributed
Cache
Consistent
How to scale
Requirement gathering
API not important
Single node - indexing, business logic
Distributed - sharding, replication
==
Distributed transaction
==
2 phase commit is a basic protocol
Strong consistency
==
Can we use an existing system
Related to workflow
Everyone has different question