Ds foundations · Distributed Systems · Advanced

Consensus and Coordination

How nodes agree on one value despite failures — two-phase commit, Raft, Paxos, and the ownership-election primitives that make leaders safe.

Ds foundations Advanced ⏱ 5 min read Complete

🧭 Analogy

A jury must return a single verdict even if a juror falls ill mid-deliberation. They can’t poll everyone forever, so they agree on a rule: a majority decides, and any new foreperson must already know the case. Distributed consensus is that majority rule, formalized so the system keeps agreeing even as members fail and rejoin.

Consensus is the act of getting independent, possibly-failing nodes to agree on a single value or order. It underpins both flavours of strong consistency from communication and consistency: for transactions all participants must agree to commit or abort; for replicas all must agree on update order.

Two-phase commit (2PC)

The classic atomic-commit protocol uses a coordinator that assigns a globally unique transaction id:

sequenceDiagram
participant C as Coordinator
participant P1 as Participant 1
participant P2 as Participant 2
C->>P1: Prepare
C->>P2: Prepare
P1-->>C: Vote commit (can commit durably)
P2-->>C: Vote commit
C->>P1: Commit
C->>P2: Commit
Note over C,P2: If any vote is abort, coordinator tells all to abort

In the prepare phase, a participant that prepares guarantees it can commit durably and can no longer abort. In the resolve phase, if all can commit, all commit; otherwise all abort. Participant failure is recoverable (consult the coordinator’s log on restart). The serious weakness is coordinator failure: participants that voted commit must block, holding locks, until the coordinator recovers — hurting availability and risking cascading failures. 2PC is consensus, but it is not fault-tolerant.

Replicated state machines and the consensus properties

Replica consistency requires applying updates in the same order everywhere — built on atomic / total-order broadcast over replicated state machines. A correct consensus algorithm provides:

Safety — all replicas agree on the same value.
Liveness — a value is eventually chosen, even with failures (fault tolerance).
Validity — the chosen value was actually proposed.

Fault tolerance requires operating on a quorum (majority) and handling leader/follower failure via elections.

Raft

Raft is the deliberately “understandable” leader-based algorithm. Clusters are odd-sized (3 or 5); nodes are leader, follower, or candidate.

The leader sends heartbeats (~300–500 ms) and owns a monotonically increasing term (a logical clock — one leader per term, persisted by all).
Updates go to the leader, which orders them, appends them as uncommitted, sends AppendEntries, and marks them committed once a majority acknowledges.
Leader election: each follower runs a randomized election timer; on expiry it becomes a candidate, increments the term, votes for itself, and sends RequestVote. A follower votes only if the incoming term is greater and the candidate’s log is at least as up to date — guaranteeing any elected leader already holds all committed entries.

stateDiagram-v2
[*] --> Follower
Follower --> Candidate: election timer expires
Candidate --> Leader: wins majority vote
Candidate --> Follower: sees higher term
Candidate --> Candidate: split vote, new election
Leader --> Follower: sees higher term

Raft powers Neo4j, YugabyteDB, etcd, and Hazelcast. Paxos (Lamport) is the leaderless ancestor, notoriously hard; Multi-Paxos (leader-based, like Raft) underpins Spanner.

💡 Consensus makes 2PC safe

Google Spanner runs its multi-split 2PC itself as a Paxos group: the coordinator’s state is replicated, so if it fails a participant takes over — eliminating 2PC’s blocking weakness. The lesson generalizes: layer 2PC on top of a fault-tolerant consensus group rather than trusting a single coordinator.

Ownership election: the practical face of consensus

Brendan Burns frames a common need: exactly one replica owns a task (a scheduler, a shard’s master), with automatic re-election on failure. His advice is emphatic:

⚠️ Don't roll your own consensus

Implementing Paxos or Raft by hand is like implementing locks on raw compare-and-swap — interesting academically, rarely worthwhile, and easy to get subtly wrong. Use a key-value store that already implements consensus: etcd, ZooKeeper, or consul.

These stores expose two essential primitives:

Compare-and-swap — atomically write a new value only if the current value matches the expected one.
TTL — a key auto-clears when its time-to-live expires.

A distributed lock is compare-and-swap against nil, written with a TTL so a holder that crashes mid-lock doesn’t deadlock everyone. Ownership (a lease) is a renewable lock: the owner renews it every ttl/2 seconds, and on loss simply terminates so the orchestrator restarts it as a standby.

Two classic races and their fixes:

A slow holder can lose the lock on TTL expiry, then its late unlock releases the new owner’s lock. Fix: unlock/renew perform compare-and-swap against both the value and the resource version recorded at lock time.
A delayed request R1 from a deposed-then-reinstated master could be wrongly accepted after a newer R2. Fix: send the resource version with every request and have the called system double-check that the requester is still the current owner, rejecting on any version mismatch.

Do you even need it?

Burns is equally clear that consensus is often unnecessary. A single replica with orchestrator auto-restart yields roughly two-to-four nines depending on crash and reschedule times. “Deciding when the distributed part is unnecessarily complex is a key skill.” Reach for elected leaders only when you genuinely need four-plus nines.

When to use it — and when not

✅ Reach for it when

When replicas must apply updates in the same order or a transaction must commit atomically across partitions.
When exactly one replica must own a task (a leader/master) and survive failure.
When you need a distributed lock or lease that auto-releases if its holder dies.

⛔ Think twice when

When a single replica with orchestrator restart gives an acceptable SLA — don't add consensus you don't need.
Implementing Paxos/Raft yourself instead of using etcd, ZooKeeper, or consul.

ds-foundationsCommunication and Consistency: CAP and the Models

What CAP really forces you to choose, and the spectrum from eventual to strict serializability — so you pick a consistency model on purpose, not by accident.

ds-foundationsTime and Ordering in Distributed Systems

Why wall-clock timestamps can't order events across machines, and how logical clocks, version vectors, and anti-entropy reason about order instead.

ds-scalabilityDistributed Databases: Replication and Sharding

Scaling the data tier — read replicas, partitioning and sharding, leader-follower vs leaderless replication, NoSQL data models, and the consistency knobs real engines expose.

Check your understanding

Score: 0 / 4

1. What is the serious weakness of two-phase commit (2PC)?

2PC is consensus but not fault-tolerant: a coordinator crash leaves prepared participants blocked, hurting availability. The fix is to replicate coordinator state via a fault-tolerant consensus algorithm.

2. In Raft, when does a candidate win an election?

A follower votes only if the candidate's term is greater and its log is at least as up to date, and a majority is required — guaranteeing the elected leader holds all committed entries.

3. Which two primitives from a consensus key-value store make distributed locks safe?

Compare-and-swap atomically sets a value only if it matches the expected one; TTL auto-clears a key so a crashed lock holder doesn't deadlock everyone. Resource versions close the slow-holder race.

4. Why does Spanner run its two-phase commit *as a Paxos group*?

By replicating the coordinator's state across a Paxos group, a new coordinator can be elected if the original fails, so prepared participants are never permanently blocked.

Sync across devices

Consensus and Coordination

Two-phase commit (2PC)

Replicated state machines and the consensus properties

Raft

Ownership election: the practical face of consensus

Do you even need it?

See also

When to use it — and when not

✅ Reach for it when

⛔ Think twice when

Check your understanding

Comments

Two-phase commit (2PC)

Replicated state machines and the consensus properties

Raft

Ownership election: the practical face of consensus

Do you even need it?

See also

When to use it — and when not

✅ Reach for it when

⛔ Think twice when

Related topics

Check your understanding

Comments