Consensus Algorithm

System DesignDistributed Systems

Materials — open to everyone, no sign-in

Topic: Consensus Algorithm

Presenter: Longwei

Additional Resources:


Sign up for future events:

System Design Presentation Summary

Date: 9/11/2022

Topic: Consensus Algorithms

Presenter: Leo

Link: https://docs.google.com/presentation/d/1hAqPI1Nc8GzJcoIVNIjwsDhC9Ak49Fs9lg6VqMqaITY/edit?usp=sharing

Fault and partial failure

Unreliable clocks, time of the day clock vs monotonic clock

Distributed system:

there is not a way to define the total order of events in a distributed system

there is only partial order

Truth is defined by the majority

Leader & lock

Perfect fault detector doesn’t exist. Good enough tool: time-out

Tolerating fault is hard. No shared memory. Only rely on messages

Consensus

Linearizability: total order operations

Causality: 2 events one before another

Sequence number

Timestamps are difficult

Global unique, causality, incremented number

Non-causal ordering: TinyURL: how to provide global unique key?

Single machine incremental counter

Odd-even, each sequence

Snowflake (timestamp + instance + sequence)

Key generation service.

Lamport timestamp

Increment the node’s own counter

When I receive a counter, I will use the max (local time, received counter + 1)

Guaranteed causality but not total order

Partial sync

7:43 distributed consensus

Fault-tolerant consensus:

Paxos (no leader. too many round trip to agree)

Raft (simplified Paxos)

7:45 Raft

Leader election

Log replication

Partial sync

Crash-recovery

Use

FLP - if an asynchronized system relies on messages, and there can be crashes it’s impossible to reach consensus, UNLESS we use timeout

Use majority to elect leader

Use majority to build consensus

All nodes agree on log ordering

Leader decides total order

No two leaders at the same time (we can have 0 leader but cannot have 2 leaders)

7:51

Raft term

Sending: <msg, term>

3 states: follower, candidate, leader

7:55

Leader election

of nodes

Periodic suppression through Heartbeat (from monkey kingdom)

Election timeout

Leader election 2

If followers don’t hear from a leader, they can become a candidate

When candidate gets majority vote, it becomes a leader

All nodes need to know all other nodes

7:59 Log replication

New writes: always go to the leader, write it into a log but not committed

Get majority to agree the change

Leader commits

Leader asks followers to commit [what happens if the followers crashes]

8:02: edge cases

What if no one is elected

Why we need odd number of nodes? A and C can get equal number of votes. Use random back-off to resolve.

Network partition

If the cluster has been partitioned into 2, majority wins

The client writes to the minority partition, but there is no way for the minority partition to get majority vote, so changes are not committed.

The client writes to majority partition, that partition can save and commit work

8:06

Network repair

Rely on term to agree on who is the real leader

Rollback on uncommitted changes

8:09 network roundtrip

Meeting is productivity

Consensus - cost is delay in network

8:16 does the election need to be fair

No. If there is a stable board, the board can appoint a leader

8:17: recap

CP (consistency + partition)

Raft: lock on leader

Paxos: every write requires a leader election

AP (availability + partition)

First write

Resolve conflict during read

8:20: building block: timeout (lease)

Time bound lease is the best, because we cannot get a perfect fault detector

Heartbeat, is expensive

Logical clock: monotonically increment number per term

Raft

Cassandra: generation number

Kafka: epoch number

8:25: Building clock: lamport clock

Attached logical timestamp in the message

[do all nodes using lamport clock have a view of all transactions]

8:25: 2 phase commit (2PC)

Prepare phase, commit phase

Each node needs to know if other nodes stored the data

Drawback: coordinator is a single point of failure. [what if coordinator fails]

8:28: building block: quorum

Read quorum, write quorum

Higher term wins (not majority wins]

R + W > total

You can pick R and W. DynamoDB uses low W, but high R

Raft always use majority as quorum

8:35: what if we don’t use consensus?

DynamoDB: want to have high write availability, so writes are faster

Reads are slow. Using a vector clock to merge: Default policy: last write wins.

Failure handling: sloppy quorum ( 朋友圈 )

Don’t use heartbeat. Uses gossip

Vector clock

8:40 Just like Git

8:41 discussions

Dynamo paper - cassandra is the most popular implementation.

DynamoDB - single master database not based on Dynamo

What if leader crashes after the prepare phase is successful. Can you ask other nodes to commit?

Discussion:

What projects uses Raft? Etcd and other smaller projects

Kafka: kRaft but still in beta