Consensus Algorithm
Materials — open to everyone, no sign-in
Topic: Consensus Algorithm
Presenter: Longwei
Additional Resources:
Sign up for future events:
System Design Presentation Summary
Date: 9/11/2022
Topic: Consensus Algorithms
Presenter: Leo
Link: https://docs.google.com/presentation/d/1hAqPI1Nc8GzJcoIVNIjwsDhC9Ak49Fs9lg6VqMqaITY/edit?usp=sharing
Fault and partial failure
Unreliable clocks, time of the day clock vs monotonic clock
Distributed system:
there is not a way to define the total order of events in a distributed system
there is only partial order
Truth is defined by the majority
Leader & lock
Perfect fault detector doesn’t exist. Good enough tool: time-out
Tolerating fault is hard. No shared memory. Only rely on messages
Consensus
Linearizability: total order operations
Causality: 2 events one before another
Sequence number
Timestamps are difficult
Global unique, causality, incremented number
Non-causal ordering: TinyURL: how to provide global unique key?
Single machine incremental counter
Odd-even, each sequence
Snowflake (timestamp + instance + sequence)
Key generation service.
Lamport timestamp
Increment the node’s own counter
When I receive a counter, I will use the max (local time, received counter + 1)
Guaranteed causality but not total order
Partial sync
7:43 distributed consensus
Fault-tolerant consensus:
Paxos (no leader. too many round trip to agree)
Raft (simplified Paxos)
7:45 Raft
Leader election
Log replication
Partial sync
Crash-recovery
Use
FLP - if an asynchronized system relies on messages, and there can be crashes it’s impossible to reach consensus, UNLESS we use timeout
Use majority to elect leader
Use majority to build consensus
All nodes agree on log ordering
Leader decides total order
No two leaders at the same time (we can have 0 leader but cannot have 2 leaders)
7:51
Raft term
Sending: <msg, term>
3 states: follower, candidate, leader
7:55
Leader election
of nodes
Periodic suppression through Heartbeat (from monkey kingdom)
Election timeout
Leader election 2
If followers don’t hear from a leader, they can become a candidate
When candidate gets majority vote, it becomes a leader
All nodes need to know all other nodes
7:59 Log replication
New writes: always go to the leader, write it into a log but not committed
Get majority to agree the change
Leader commits
Leader asks followers to commit [what happens if the followers crashes]
8:02: edge cases
What if no one is elected
Why we need odd number of nodes? A and C can get equal number of votes. Use random back-off to resolve.
Network partition
If the cluster has been partitioned into 2, majority wins
The client writes to the minority partition, but there is no way for the minority partition to get majority vote, so changes are not committed.
The client writes to majority partition, that partition can save and commit work
8:06
Network repair
Rely on term to agree on who is the real leader
Rollback on uncommitted changes
8:09 network roundtrip
Meeting is productivity
Consensus - cost is delay in network
8:16 does the election need to be fair
No. If there is a stable board, the board can appoint a leader
8:17: recap
CP (consistency + partition)
Raft: lock on leader
Paxos: every write requires a leader election
AP (availability + partition)
First write
Resolve conflict during read
8:20: building block: timeout (lease)
Time bound lease is the best, because we cannot get a perfect fault detector
Heartbeat, is expensive
Logical clock: monotonically increment number per term
Raft
Cassandra: generation number
Kafka: epoch number
8:25: Building clock: lamport clock
Attached logical timestamp in the message
[do all nodes using lamport clock have a view of all transactions]
8:25: 2 phase commit (2PC)
Prepare phase, commit phase
Each node needs to know if other nodes stored the data
Drawback: coordinator is a single point of failure. [what if coordinator fails]
8:28: building block: quorum
Read quorum, write quorum
Higher term wins (not majority wins]
R + W > total
You can pick R and W. DynamoDB uses low W, but high R
Raft always use majority as quorum
8:35: what if we don’t use consensus?
DynamoDB: want to have high write availability, so writes are faster
Reads are slow. Using a vector clock to merge: Default policy: last write wins.
Failure handling: sloppy quorum ( 朋友圈 )
Don’t use heartbeat. Uses gossip
Vector clock
8:40 Just like Git
8:41 discussions
Dynamo paper - cassandra is the most popular implementation.
DynamoDB - single master database not based on Dynamo
What if leader crashes after the prepare phase is successful. Can you ask other nodes to commit?
Discussion:
What projects uses Raft? Etcd and other smaller projects
Kafka: kRaft but still in beta