Distributed Key-Value Store
Materials — open to everyone, no sign-in
Topic: Distributed Key-Value Store
Interviewer: video from 2-13-2022
Interviewee: system
Level: L4 (Experienced Individual Contributor)
Additional Resources:
System Design Interview
Join Us on Wechat
Interview Notes
Requirements
Functional
Key value store
Data size: 10 TB/year
Value of the key: 1KB on average. Small key, value
Average size of value = 1KB. No need to handle very big key-values.
QPS
Eventual consistency
Key and value are strings
Scalaling
QPS 100K
Latency 50ms
Availability: 99.99%
Fault tolerant, no single point of failure
Scalability
High concurrency
[15:30]
Design for write intensive rather than read intensive
[17:02]
Diagram [ drawn with Miro]
[17:44]
[20:26]
[21:29]
[22:14]
Config server is for metadata
[23:33]
Client -> load balancer (http json) -> request manager
Request manager -> config server to find responsible storage
Request manager -> data server
Q: Why use JSON?
A: it is easier for frontend to use JSON
Q: Why HTTP?
A: HTTP: more widely used. TCP more for internal use
[JSON/HTTP]
Q: Does the request manager call the config server every time?
A: we can add a cache before the config server
Q: why do we add server side cache, not the client side
A: I need to compare the mapping version every time between config server and client
Q: do we invalidate the cache every time config mapping is updated?
A: yes
Config server can maintain heartbeat with the data servers for leader election
When client is initialized, it obtains the mapping request from the config server
Data server is responsible for data storage; responsible for data migration
Q: what happens if config server goes down
A: we can have a replica config server
Add a sentinel to monitor config server
Sentinel server will be switch config server
Q: how do we handle write-intensive requests? How to handle highly concurrent requests?
A: can use level DB as database engine
[May want to use log structured merge tree (LSM tree) ]
Q: why it can be fast?
A: because level DB uses LSM tree to support write-heavy nodes
[ https://en.wikipedia.org/wiki/Log-structured_merge-tree#/media/File:LSM_Tree.png ]
Sorted string tables
Divide table into segments
Q: how do you add a new data server? How do you update the client and meta data server
A: the config server will do this
We can use consistent hashing to distribute data
[may need a migration sequence to split data]
If one node is down, we will send requests to other nodes
Config server contains mapping table
Data server info of many rows similar to the following: 192.16, A
(ip address, node name)
This table is stored in config nodes and cache
If one node is down, we modify the mapping table and send it to the client
Q: How do you invalidate old cache?
A: we will store the version number
When new node goes down, it will regenerate the config table. The config server will sync the server to the client
Consistency
Q: eventual consistency. How do we guarantee strong consistency?
A: if we change data 2 nodes out of 3 nodes, we can send success to the client
Q: one key is added, there may be delay to the secondary. We may have inconsistency. What is the most straightforward way?
A: we can use RAFT
Q: we can just redirect the reads to the master server
Concurrency
Q: concurrent users changing the same key
A: we can store a version number.
V10 5->6
V11 5->25
V10 -> key=5->v=25
V10 -> key=5->v=30
Q: optimistic control? How do you compare this to locks?
A: We can just use the version number. The version number is stored with a key-value pair.
Durability
Q: how do we handle durability?
[53:00]
After we write data to memory, the server goes down?
[may want to write ahead the changes to disk]
A: durability is not a problem. We have backup for the config server and data store. Therefore durability is not a problem.
Write-ahead-log
Audience discussion
Elastic search - Consistency hash - no need to add secondary - 1 file will be split into 5.
Replica of the data can be put on different nodes.
Controller node makes sense
Does elastic search support strong consistency?
Master / slave - supports strong consistency.
Quorum based - may not support strong consistency
We can check if user reads immediately after write, will they get the latest version?
In elastic search: Primary shard vs replica shard
Is it equivalent to the diagram
Physical node, virtual node
We will add 3 together
Dynamo style: not easy to support strong consistency
Primary + secondary nodes: easier to support strong consistency
If there are 15 computers, we divide into 5 groups. Each group is a “super node” for consistent hashing.
===
HTTP is built on top of TCP
Key is what protocol is used on top of TCP
===
Optimistic locking vs pessimistic locking
Key-value where all records are independent: may be better to use optimistic lock
===
Order: distributed -> then data structure in single node
It may be better to reverse the order to discuss single node and then distributed