Leader Election & Coordination
Many distributed problems get dramatically simpler if exactly one node is “in charge” — one writer, one scheduler, one owner of a shard. But “pick one node” is itself a hard distributed problem: the moment you have a leader, you have to handle the leader dying, the leader being wrongly declared dead, and two nodes both believing they’re the leader. This page is about getting that right.
Why elect a leader at all?
Section titled “Why elect a leader at all?”Coordination is expensive. If every node can write, you need conflict resolution, consensus on every operation, or careful commutative data types. Funnel decisions through a single leader and a whole class of races disappears:
Leaderless: every node decides → conflicts, coordination on every writeLeader: one node decides → simple ordering, but a single point to replaceThe trade-off is stark — what does this buy us, and what does it cost? A leader buys simplicity and a natural ordering of operations; it costs you a failover problem and a potential bottleneck. Most systems decide that trade is worth it, then spend their complexity budget on making failover safe.
Electing the leader
Section titled “Electing the leader”You can’t just let nodes shout “I’m the leader” — partitions and crashes make that chaotic. Real systems delegate election to a consensus protocol (see Consensus: Raft & Paxos). The pattern almost everyone uses in practice:
- Run a small, strongly-consistent coordination service — ZooKeeper, etcd, or Consul — itself backed by Raft/ZAB across an odd number of nodes (3 or 5) so it can tolerate failures via majority quorum.
- A would-be leader takes a lease: a lock with a time-to-live that it must keep renewing (“heartbeating”). If it stops renewing — because it crashed or got partitioned away — the lease expires and another node can claim leadership.
etcd / ZooKeeper (Raft quorum: 3 or 5 nodes) │ holds the lease key + TTL ▼ Node A renews lease every 2s ──► A is leader (A crashes) ──► lease expires after TTL ──► Node B claims it ──► B is leaderThe danger: split-brain
Section titled “The danger: split-brain”The hard case isn’t a clean crash — it’s a node that’s slow or partitioned but still alive. Its lease expires, the cluster elects a new leader, and then the old leader wakes up still believing it’s in charge. Now two leaders are issuing writes. That’s split-brain, and it corrupts data.
The fix: fencing tokens
Section titled “The fix: fencing tokens”The robust solution is to make every lock grant a monotonically increasing number — a fencing token — and have the protected resource itself reject anything stale.
A acquires lock → token 33 → (A stalls) ...B acquires lock → token 34 → B writes with token 34 ✓ storage records "highest seen: 34"A wakes, writes with token 33 → storage sees 33 < 34 → REJECTEDThe storage layer only accepts writes whose token is ≥ the highest it has seen. Now a delayed
ex-leader cannot clobber the new one, even if its software still thinks it’s the boss. The
coordination service hands out ever-increasing tokens (e.g. ZooKeeper’s zxid/version), and the data
store enforces them. This is the difference between a lock that advises and one that guarantees.
What coordination services give you
Section titled “What coordination services give you”Beyond leader election, a coordination service is the Swiss-army tool for “we need the cluster to agree on a small, critical fact”:
| Use | What it provides |
|---|---|
| Leader election | Whoever holds the lease key is leader; auto-expiry on failure |
| Service discovery | A consistent registry of “who is alive and where” |
| Distributed locks | Mutual exclusion across nodes (with fencing tokens) |
| Configuration | A single source of truth nodes can watch for changes |
| Membership | Detecting nodes joining/leaving via ephemeral keys + heartbeats |
The thread
Section titled “The thread”What does this buy us, and what does it cost? Leader election buys a single point of decision that makes ordering and ownership trivial; it costs a failover protocol, and — done naïvely — risks split-brain. The deeper lesson is that time and liveness can’t be trusted in a distributed system: you can never be sure a node is dead, only that you haven’t heard from it. Fencing tokens accept that uncertainty and make correctness depend on monotonic order rather than on guessing who’s alive.
Check your understanding
Section titled “Check your understanding”- What class of problems does electing a single leader simplify, and what new problem does it create?
- Why do real systems lean on a service like etcd/ZooKeeper instead of nodes electing among themselves directly?
- Describe split-brain and the exact sequence (including a GC pause) that produces it.
- How does a fencing token prevent a stalled ex-leader from corrupting data, and which component must enforce it?
- Why should a coordination service be kept off the request hot path?