Leader Election & Coordination

Many distributed problems get dramatically simpler if exactly one node is “in charge” — one writer, one scheduler, one owner of a shard. But “pick one node” is itself a hard distributed problem: the moment you have a leader, you have to handle the leader dying, the leader being wrongly declared dead, and two nodes both believing they’re the leader. This page is about getting that right.

Why elect a leader at all?

Coordination is expensive. If every node can write, you need conflict resolution, consensus on every operation, or careful commutative data types. Funnel decisions through a single leader and a whole class of races disappears:

Leaderless: every node decides → conflicts, coordination on every write
Leader:     one node decides   → simple ordering, but a single point to replace

The trade-off is stark — what does this buy us, and what does it cost? A leader buys simplicity and a natural ordering of operations; it costs you a failover problem and a potential bottleneck. Most systems decide that trade is worth it, then spend their complexity budget on making failover safe.

Electing the leader

You can’t just let nodes shout “I’m the leader” — partitions and crashes make that chaotic. Real systems delegate election to a consensus protocol (see Consensus: Raft & Paxos). The pattern almost everyone uses in practice:

Run a small, strongly-consistent coordination service — ZooKeeper, etcd, or Consul — itself backed by Raft/ZAB across an odd number of nodes (3 or 5) so it can tolerate failures via majority quorum.
A would-be leader takes a lease: a lock with a time-to-live that it must keep renewing (“heartbeating”). If it stops renewing — because it crashed or got partitioned away — the lease expires and another node can claim leadership.

   etcd / ZooKeeper (Raft quorum: 3 or 5 nodes)
        │   holds the lease key + TTL
        ▼
   Node A renews lease every 2s ──► A is leader
   (A crashes) ──► lease expires after TTL ──► Node B claims it ──► B is leader

The danger: split-brain

The hard case isn’t a clean crash — it’s a node that’s slow or partitioned but still alive. Its lease expires, the cluster elects a new leader, and then the old leader wakes up still believing it’s in charge. Now two leaders are issuing writes. That’s split-brain, and it corrupts data.

The fix: fencing tokens

The robust solution is to make every lock grant a monotonically increasing number — a fencing token — and have the protected resource itself reject anything stale.

A acquires lock → token 33 → (A stalls) ...
B acquires lock → token 34 → B writes with token 34  ✓ storage records "highest seen: 34"
A wakes, writes with token 33 → storage sees 33 < 34 → REJECTED

The storage layer only accepts writes whose token is ≥ the highest it has seen. Now a delayed ex-leader cannot clobber the new one, even if its software still thinks it’s the boss. The coordination service hands out ever-increasing tokens (e.g. ZooKeeper’s zxid/version), and the data store enforces them. This is the difference between a lock that advises and one that guarantees.

What coordination services give you

Beyond leader election, a coordination service is the Swiss-army tool for “we need the cluster to agree on a small, critical fact”:

Use	What it provides
Leader election	Whoever holds the lease key is leader; auto-expiry on failure
Service discovery	A consistent registry of “who is alive and where”
Distributed locks	Mutual exclusion across nodes (with fencing tokens)
Configuration	A single source of truth nodes can watch for changes
Membership	Detecting nodes joining/leaving via ephemeral keys + heartbeats

The thread

What does this buy us, and what does it cost? Leader election buys a single point of decision that makes ordering and ownership trivial; it costs a failover protocol, and — done naïvely — risks split-brain. The deeper lesson is that time and liveness can’t be trusted in a distributed system: you can never be sure a node is dead, only that you haven’t heard from it. Fencing tokens accept that uncertainty and make correctness depend on monotonic order rather than on guessing who’s alive.

Check your understanding

What class of problems does electing a single leader simplify, and what new problem does it create?
Why do real systems lean on a service like etcd/ZooKeeper instead of nodes electing among themselves directly?
Describe split-brain and the exact sequence (including a GC pause) that produces it.
How does a fencing token prevent a stalled ex-leader from corrupting data, and which component must enforce it?
Why should a coordination service be kept off the request hot path?