Distributed Key-Value Store

System DesignDatabases & StorageDistributed Systems

Materials — open to everyone, no sign-in

Topic: Distributed Key-Value Store

Interviewer: video from 2-13-2022

Interviewee: system

Level: L4 (Experienced Individual Contributor)

Additional Resources:


System Design Interview

Join Us on Wechat

Interview Notes

Requirements

Functional

Key value store

Data size: 10 TB/year

Value of the key: 1KB on average. Small key, value

Average size of value = 1KB. No need to handle very big key-values.

QPS

Eventual consistency

Key and value are strings

Scalaling

QPS 100K

Latency 50ms

Availability: 99.99%

Fault tolerant, no single point of failure

Scalability

High concurrency

[15:30]

Design for write intensive rather than read intensive

[17:02]

Diagram [ drawn with Miro]

[17:44]

[20:26]

[21:29]

[22:14]

Config server is for metadata

[23:33]

Client -> load balancer (http json) -> request manager

Request manager -> config server to find responsible storage

Request manager -> data server

Q: Why use JSON?

A: it is easier for frontend to use JSON

Q: Why HTTP?

A: HTTP: more widely used. TCP more for internal use

[JSON/HTTP]

Q: Does the request manager call the config server every time?

A: we can add a cache before the config server

Q: why do we add server side cache, not the client side

A: I need to compare the mapping version every time between config server and client

Q: do we invalidate the cache every time config mapping is updated?

A: yes

Config server can maintain heartbeat with the data servers for leader election

When client is initialized, it obtains the mapping request from the config server

Data server is responsible for data storage; responsible for data migration

Q: what happens if config server goes down

A: we can have a replica config server

Add a sentinel to monitor config server

Sentinel server will be switch config server

Q: how do we handle write-intensive requests? How to handle highly concurrent requests?

A: can use level DB as database engine

[May want to use log structured merge tree (LSM tree) ]

Q: why it can be fast?

A: because level DB uses LSM tree to support write-heavy nodes

[ https://en.wikipedia.org/wiki/Log-structured_merge-tree#/media/File:LSM_Tree.png ]

Sorted string tables

Divide table into segments

Q: how do you add a new data server? How do you update the client and meta data server

A: the config server will do this

We can use consistent hashing to distribute data

[may need a migration sequence to split data]

If one node is down, we will send requests to other nodes

Config server contains mapping table

Data server info of many rows similar to the following: 192.16, A

(ip address, node name)

This table is stored in config nodes and cache

If one node is down, we modify the mapping table and send it to the client

Q: How do you invalidate old cache?

A: we will store the version number

When new node goes down, it will regenerate the config table. The config server will sync the server to the client

Consistency

Q: eventual consistency. How do we guarantee strong consistency?

A: if we change data 2 nodes out of 3 nodes, we can send success to the client

Q: one key is added, there may be delay to the secondary. We may have inconsistency. What is the most straightforward way?

A: we can use RAFT

Q: we can just redirect the reads to the master server

Concurrency

Q: concurrent users changing the same key

A: we can store a version number.

V10 5->6

V11 5->25

V10 -> key=5->v=25

V10 -> key=5->v=30

Q: optimistic control? How do you compare this to locks?

A: We can just use the version number. The version number is stored with a key-value pair.

Durability

Q: how do we handle durability?

[53:00]

After we write data to memory, the server goes down?

[may want to write ahead the changes to disk]

A: durability is not a problem. We have backup for the config server and data store. Therefore durability is not a problem.

Write-ahead-log

Audience discussion

Elastic search - Consistency hash - no need to add secondary - 1 file will be split into 5.

Replica of the data can be put on different nodes.

Controller node makes sense

Does elastic search support strong consistency?

Master / slave - supports strong consistency.

Quorum based - may not support strong consistency

We can check if user reads immediately after write, will they get the latest version?

In elastic search: Primary shard vs replica shard

Is it equivalent to the diagram

Physical node, virtual node

We will add 3 together

Dynamo style: not easy to support strong consistency

Primary + secondary nodes: easier to support strong consistency

If there are 15 computers, we divide into 5 groups. Each group is a “super node” for consistent hashing.

===

HTTP is built on top of TCP

Key is what protocol is used on top of TCP

===

Optimistic locking vs pessimistic locking

Key-value where all records are independent: may be better to use optimistic lock

===

Order: distributed -> then data structure in single node

It may be better to reverse the order to discuss single node and then distributed