Skip to content

Consensus: Raft & Paxos

Many distributed systems need exactly one thing to be true: a set of independent, failure-prone machines must agree on a single value — who the leader is, what the next entry in the log is, what the current config says. This is the consensus problem, and it is the beating heart of etcd, ZooKeeper, Spanner, Kafka’s controller, and the control plane of nearly every platform you’ll touch. It sounds trivial (“just take a vote”) and is, in fact, one of the deepest results in computer science. This page builds it from why it’s hard up to how Raft and Paxos actually work.

On one machine, agreement is free: there’s one decider. Distribute it, and two demons appear.

First, nodes fail — they crash mid-decision, leaving others unsure whether a value was chosen. Second, the network is unreliable: messages drop, delay, and reorder, so a silent node is indistinguishable from a dead one (the same ambiguity from the overview). A correct consensus algorithm must guarantee, despite all this:

  • Agreement — no two nodes decide different values.
  • Validity — the decided value was actually proposed by someone (no inventing values).
  • Termination — every non-failing node eventually decides.

The mechanism that makes consensus possible is the quorum — requiring agreement from a majority of nodes (more than half). Majorities have one magical property:

Any two majorities of the same set must overlap in at least one node.

5-node cluster. A majority is 3.
Majority X: {A, B, C} Majority Y: {C, D, E}
└──── overlap at C ────┘
C "remembers" what X decided and tells Y. No two majorities
can decide conflicting things, because some node is in BOTH.

That overlap is how the system avoids two different decisions: any new majority is guaranteed to include at least one node that witnessed the previous one. It’s also why clusters are sized odd — 3, 5, 7. A 5-node cluster tolerates 2 failures (3 remain to form a majority); a 6-node cluster also only tolerates 2 (you still need 4 for majority of 6), so the sixth node buys you nothing but cost. The general rule: to tolerate f failures you need 2f + 1 nodes.

Raft: consensus you can actually understand

Section titled “Raft: consensus you can actually understand”

Paxos came first and is correct, but notoriously hard to follow. Raft (Ongaro & Ousterhout, 2014) was designed explicitly for understandability, and it now underlies etcd, Consul, and TiKV. It decomposes consensus into three digestible pieces.

Raft elects a single leader; all writes flow through it, which sidesteps the chaos of everyone proposing at once. Time is divided into numbered terms (a logical clock — see Time, Clocks & Ordering). Each node is a follower, candidate, or leader.

Follower hears nothing from a leader for a random timeout
│ (randomness prevents everyone timing out at once)
Becomes CANDIDATE → increments term → requests votes from all nodes
├─ wins a MAJORITY of votes ──────────────► becomes LEADER
├─ hears from a leader with ≥ its term ────► steps back to FOLLOWER
└─ split vote (no majority) ───────────────► new term, try again

A node votes for at most one candidate per term, and only for a candidate whose log is at least as up-to-date as its own. Combined with the majority requirement, this guarantees at most one leader per term — the overlap property again.

The leader appends each client command to its log and replicates it to followers via AppendEntries messages. An entry is committed once a majority have stored it — at which point it’s safe to apply, because any future leader is guaranteed (by the up-to-date-log voting rule) to already contain every committed entry. The followers’ logs converge to match the leader’s, in order.

The two rules above interlock to give Raft its core guarantee: if any node has applied an entry at a given log index, no other node will ever apply a different entry at that index. One log, one truth, replicated.

Paxos (Lamport, 1998) solves the same problem with a different shape. A proposer runs a two-phase protocol against a quorum of acceptors:

  1. Prepare — the proposer picks a monotonically increasing proposal number n and asks acceptors to promise not to accept anything numbered below n. If a majority promise (and report any value they’ve already accepted), the proposer proceeds.
  2. Accept — the proposer asks the majority to accept value v at number n. If a value was already reported in phase 1, the proposer is obligated to reuse it — this is the rule that preserves agreement across rounds.

Basic Paxos decides a single value; Multi-Paxos chains it to agree on a log of values (and, in practice, elects a stable leader to skip phase 1 on the common path — converging on the same leader-based shape as Raft). Raft and Multi-Paxos are, under the hood, close cousins; Raft just packages the moving parts so humans can hold them in their heads.

Consensus is the foundation that lets a cluster behave like a single linearizable machine for the data that must never disagree: etcd (Kubernetes’ brain), ZooKeeper (Kafka, HBase coordination), Spanner, CockroachDB, distributed locks, and leader election itself (next page: Leader Election & Coordination). What does it buy us? A trustworthy single source of truth that survives a minority of failures. What does it cost? Every committed decision requires a majority round-trip — so writes are slower than a single node, the cluster halts if it can’t reach a majority (it chooses CP), and you pay for 2f + 1 machines to tolerate f failures. That price is exactly why you run consensus only for the small, critical core of state, and keep the bulk of your data on cheaper, weaker stores.

  1. State the three properties any consensus algorithm must satisfy, and explain why FLP says you can’t guarantee all three in a fully asynchronous network.
  2. Why must any two majority quorums overlap, and how does that single fact prevent two conflicting decisions?
  3. Why does a 6-node cluster tolerate no more failures than a 5-node one? Give the 2f + 1 reasoning.
  4. In Raft, what two rules together guarantee at most one leader per term, and why is randomized election timeout important?
  5. In Paxos, why is a proposer sometimes forced to reuse a value reported during the prepare phase rather than proposing its own?