Part 4 · Distributed Systems Theory

A single computer is a comfortable place to reason. Its clock is one clock. Its memory is either written or not. When you call a function, it either runs or your whole program crashes with it — there is no third state where the function “might have run, you’ll find out later.” A distributed system erases all of these comforts at once, and that erasure is not a temporary inconvenience to be engineered away. It is the permanent terrain. This page names the hard truths so the rest of Part 4 can build on them honestly.

What makes distribution different (not just bigger)

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. — Leslie Lamport

The temptation is to think of many machines as one big machine. They are not. Three things break the illusion, and they break it forever.

The network is not reliable, and you cannot tell why

Between any two machines sits a network that can drop, delay, duplicate, or reorder messages. When you send a request and hear nothing back, you face an irreducible ambiguity:

You sent a request and got no reply. Which happened?

  (a) the request never arrived          → it did NOT run
  (b) it arrived, ran, the reply was lost → it DID run
  (c) it's still in flight, slow          → it might run LATER

From where you stand, (a), (b), and (c) look IDENTICAL.

A timeout is a guess, not a fact. This single ambiguity is the seed of most of distributed systems: it forces retries, which create duplicates, which force idempotency. You design for duplicates because the network leaves you no choice.

There is no global “now”

Each machine has its own clock, and those clocks drift relative to each other — a few parts per million, which sounds tiny until it means tens of milliseconds of disagreement across a fleet. So “which event happened first?” has no universal answer you can simply read off a wall clock. Ordering must be constructed, not observed — the subject of Time, Clocks & Ordering.

Parts fail independently and partially

On one machine, a crash takes everything down together — clean, total, observable. Across machines, one node can die while its neighbours keep serving, a disk can rot while the CPU runs, a link can sever while both endpoints stay healthy. This partial failure is the genuinely new category. The system is never simply “up” or “down”; it is always some live nodes disagreeing with some dead or unreachable ones.

The trade-off that runs through everything

Here is the thread we will pull on every page: what does this buy us, and what does it cost?

Distribution buys you things a single machine can never offer — fault tolerance (survive a node death), scale (more work than one box can do), and low latency (serve users near them). But it demands payment in the only currencies the universe accepts: you cannot have perfect consistency, constant availability, and immunity to network partitions all at once. Strengthening one guarantee almost always weakens another. The art of distributed systems is not eliminating this tension — it is choosing where to spend, deliberately and per workload.

Roadmap: the six pillars ahead

This part builds in dependency order. Each page assumes the ones before it.

  overview (you are here)
      │
      ├─►  CAP & PACELC ............ the unavoidable choice under partition (and otherwise)
      │
      ├─►  Consistency Models ...... the spectrum from linearizable to eventual
      │
      ├─►  Consensus: Raft & Paxos . how independent nodes agree despite failures
      │
      ├─►  Time, Clocks & Ordering . constructing order without a global clock
      │
      ├─►  Idempotency ............. designing for the duplicates the network forces on you
      │
      └─►  Leader Election & Coordination . locks, leases, fencing, split-brain

CAP & PACELC — the precise statement of what you must give up during a partition, and the often-ignored trade-off (latency vs consistency) that applies even when the network is healthy.
Consistency Models — a menu, not a binary. Linearizable, sequential, causal, eventual — and the practical “session” guarantees like read-your-writes that sit between them.
Consensus: Raft & Paxos — how a cluster picks one truth using quorums and leaders, powering etcd, ZooKeeper, and the control plane of nearly every modern platform.
Time, Clocks & Ordering — why wall clocks lie, and how logical clocks and happens-before recover a usable notion of order.
Idempotency — the discipline that turns “the network delivers duplicates” from a bug source into a non-event.
Leader Election & Coordination — distributed locks, leases, fencing tokens, and how to avoid two nodes both believing they’re in charge.

By the end you should be able to look at any distributed design and ask the only questions that matter: What does it assume about the network? Where does it spend its consistency budget? And what happens the moment a partition cuts the cluster in half?

Check your understanding

Why is partial failure a genuinely new category compared to a single-machine crash, rather than just “more of the same”?
When a request times out, why are “it never ran,” “it ran and the reply was lost,” and “it’s still running” indistinguishable from the caller’s perspective — and what does that force you to build?
What is the single warning that all eight fallacies of distributed computing really restate?
Express the core distributed-systems trade-off in the form “this buys us X but costs us Y.”
Why can’t you simply read a wall clock to decide which of two events on different machines happened first?