🧭 Analogy
A jury must return a single verdict even if a juror falls ill mid-deliberation. They can’t poll everyone forever, so they agree on a rule: a majority decides, and any new foreperson must already know the case. Distributed consensus is that majority rule, formalized so the system keeps agreeing even as members fail and rejoin.
Consensus is the act of getting independent, possibly-failing nodes to agree on a single value or order. It underpins both flavours of strong consistency from communication and consistency: for transactions all participants must agree to commit or abort; for replicas all must agree on update order.
Two-phase commit (2PC)
The classic atomic-commit protocol uses a coordinator that assigns a globally unique transaction id:
sequenceDiagram participant C as Coordinator participant P1 as Participant 1 participant P2 as Participant 2 C->>P1: Prepare C->>P2: Prepare P1-->>C: Vote commit (can commit durably) P2-->>C: Vote commit C->>P1: Commit C->>P2: Commit Note over C,P2: If any vote is abort, coordinator tells all to abort
In the prepare phase, a participant that prepares guarantees it can commit durably and can no longer abort. In the resolve phase, if all can commit, all commit; otherwise all abort. Participant failure is recoverable (consult the coordinator’s log on restart). The serious weakness is coordinator failure: participants that voted commit must block, holding locks, until the coordinator recovers — hurting availability and risking cascading failures. 2PC is consensus, but it is not fault-tolerant.
Replicated state machines and the consensus properties
Replica consistency requires applying updates in the same order everywhere — built on atomic / total-order broadcast over replicated state machines. A correct consensus algorithm provides:
- Safety — all replicas agree on the same value.
- Liveness — a value is eventually chosen, even with failures (fault tolerance).
- Validity — the chosen value was actually proposed.
Fault tolerance requires operating on a quorum (majority) and handling leader/follower failure via elections.
Raft
Raft is the deliberately “understandable” leader-based algorithm. Clusters are odd-sized (3 or 5); nodes are leader, follower, or candidate.
- The leader sends heartbeats (~300–500 ms) and owns a monotonically increasing term (a logical clock — one leader per term, persisted by all).
- Updates go to the leader, which orders them, appends them as uncommitted, sends
AppendEntries, and marks them committed once a majority acknowledges. - Leader election: each follower runs a randomized election timer; on expiry it becomes a candidate, increments the term, votes for itself, and sends
RequestVote. A follower votes only if the incoming term is greater and the candidate’s log is at least as up to date — guaranteeing any elected leader already holds all committed entries.
stateDiagram-v2 [*] --> Follower Follower --> Candidate: election timer expires Candidate --> Leader: wins majority vote Candidate --> Follower: sees higher term Candidate --> Candidate: split vote, new election Leader --> Follower: sees higher term
Raft powers Neo4j, YugabyteDB, etcd, and Hazelcast. Paxos (Lamport) is the leaderless ancestor, notoriously hard; Multi-Paxos (leader-based, like Raft) underpins Spanner.
💡 Consensus makes 2PC safe
Google Spanner runs its multi-split 2PC itself as a Paxos group: the coordinator’s state is replicated, so if it fails a participant takes over — eliminating 2PC’s blocking weakness. The lesson generalizes: layer 2PC on top of a fault-tolerant consensus group rather than trusting a single coordinator.
Ownership election: the practical face of consensus
Brendan Burns frames a common need: exactly one replica owns a task (a scheduler, a shard’s master), with automatic re-election on failure. His advice is emphatic:
⚠️ Don't roll your own consensus
Implementing Paxos or Raft by hand is like implementing locks on raw compare-and-swap — interesting academically, rarely worthwhile, and easy to get subtly wrong. Use a key-value store that already implements consensus: etcd, ZooKeeper, or consul.
These stores expose two essential primitives:
- Compare-and-swap — atomically write a new value only if the current value matches the expected one.
- TTL — a key auto-clears when its time-to-live expires.
A distributed lock is compare-and-swap against nil, written with a TTL so a holder that crashes mid-lock doesn’t deadlock everyone. Ownership (a lease) is a renewable lock: the owner renews it every ttl/2 seconds, and on loss simply terminates so the orchestrator restarts it as a standby.
Two classic races and their fixes:
- A slow holder can lose the lock on TTL expiry, then its late
unlockreleases the new owner’s lock. Fix:unlock/renewperform compare-and-swap against both the value and the resource version recorded at lock time. - A delayed request
R1from a deposed-then-reinstated master could be wrongly accepted after a newerR2. Fix: send the resource version with every request and have the called system double-check that the requester is still the current owner, rejecting on any version mismatch.
Do you even need it?
Burns is equally clear that consensus is often unnecessary. A single replica with orchestrator auto-restart yields roughly two-to-four nines depending on crash and reschedule times. “Deciding when the distributed part is unnecessarily complex is a key skill.” Reach for elected leaders only when you genuinely need four-plus nines.
See also
- Communication and consistency — the consistency models consensus enables.
- Time and ordering — terms and logical clocks underpinning Raft.
- Distributed databases — Raft/Paxos inside real engines.
- Serving patterns — where ownership election fits the bigger picture.
When to use it — and when not
✅ Reach for it when
- When replicas must apply updates in the same order or a transaction must commit atomically across partitions.
- When exactly one replica must own a task (a leader/master) and survive failure.
- When you need a distributed lock or lease that auto-releases if its holder dies.
⛔ Think twice when
- When a single replica with orchestrator restart gives an acceptable SLA — don't add consensus you don't need.
- Implementing Paxos/Raft yourself instead of using etcd, ZooKeeper, or consul.
Related topics
What CAP really forces you to choose, and the spectrum from eventual to strict serializability — so you pick a consistency model on purpose, not by accident.
ds-foundationsTime and Ordering in Distributed SystemsWhy wall-clock timestamps can't order events across machines, and how logical clocks, version vectors, and anti-entropy reason about order instead.
ds-scalabilityDistributed Databases: Replication and ShardingScaling the data tier — read replicas, partitioning and sharding, leader-follower vs leaderless replication, NoSQL data models, and the consistency knobs real engines expose.
Check your understanding
Score: 0 / 41. What is the serious weakness of two-phase commit (2PC)?
2PC is consensus but not fault-tolerant: a coordinator crash leaves prepared participants blocked, hurting availability. The fix is to replicate coordinator state via a fault-tolerant consensus algorithm.
2. In Raft, when does a candidate win an election?
A follower votes only if the candidate's term is greater and its log is at least as up to date, and a majority is required — guaranteeing the elected leader holds all committed entries.
3. Which two primitives from a consensus key-value store make distributed locks safe?
Compare-and-swap atomically sets a value only if it matches the expected one; TTL auto-clears a key so a crashed lock holder doesn't deadlock everyone. Resource versions close the slow-holder race.
4. Why does Spanner run its two-phase commit *as a Paxos group*?
By replicating the coordinator's state across a Paxos group, a new coordinator can be elected if the original fails, so prepared participants are never permanently blocked.
Comments
Sign in with GitHub to join the discussion.