Skip to content

Part 0 · Foundations — Why Systems Are Hard

A program running on one machine is, by the standards of this textbook, easy. The CPU executes your instructions in order. Memory you wrote a microsecond ago is still there. If a function returns, it succeeded; if the machine is up, your code is up. There is exactly one clock, one copy of the truth, and one thing that can fail — the whole box, all at once, in a way you’ll notice immediately. You can hold the entire system in your head.

System design is the study of what happens when you can no longer do that. The moment your problem outgrows one machine, you are forced to add a network between your components, scale by running many copies, and confront failure not as a rare crash but as a constant background condition. Each of these is individually manageable. Together they produce the difficulty that this entire site exists to explain.

Two functions in the same process talk by passing a pointer — instant, reliable, free. Two services on two machines talk by sending bytes across a wire that can be slow, can reorder, can duplicate, and can simply drop your message and never tell you. The call that looked like balance = account.read() is now a journey across switches and cables that might take a millisecond or might take forever. A whole page of this part — The Fallacies of Distributed Computing — is devoted to the comforting lies engineers tell themselves about that wire.

One machine has a ceiling: so many requests per second, so many gigabytes of RAM, so much disk. When demand exceeds the ceiling you must spread the load across many machines — and now you have copies. Copies of your data that can disagree. Copies of your code that can be at different versions. Work that has to be split so each machine does a fair share, and combined so the user sees one coherent answer. Scale is what turns “store the value” into “store the value, in several places, and keep them in sync.”

On one machine, failure is binary and loud. Across many machines, something is always broken. A disk dies, a process is paused for garbage collection, a network link flaps, a deploy goes sideways on 3 of your 300 servers. At scale, rare events become routine: if a component fails once every three years and you run a thousand of them, you’ll see a failure roughly every day. The system must keep working while parts of it are dead — and it must do so without a human in the loop.

ONE MACHINE MANY MACHINES + NETWORK
----------- -----------------------
one clock no shared clock
one copy of truth copies that can disagree
failure = the box is down failure = always, somewhere
calls are instant & reliable calls are slow & may vanish
fits in your head fits in nobody's head

The one thread that runs through everything

Section titled “The one thread that runs through everything”

If you remember a single habit from this textbook, make it this question, asked of every design decision you ever encounter:

What does this buy us, and what does it cost?

There are no free wins in system design. Adding a cache buys you speed and costs you the risk of serving stale data. Adding a replica buys you durability and availability and costs you the problem of keeping replicas consistent. Choosing strong consistency buys you simple reasoning and costs you latency and, sometimes, availability during a partition. The engineers who look like wizards are not the ones who memorized the “right” answer — they are the ones who reflexively name both sides of the ledger and then choose on purpose for their actual constraints.

This part builds the vocabulary and the instincts everything else relies on. Read it in order; each page assumes the last.

  1. The Fallacies of Distributed Computing — the eight false assumptions about networks that cause real outages. This is the why systems are hard page made concrete.
  2. Latency, Throughput & the Numbers to Know — the difference between “how fast is one thing” and “how many things per second,” plus the order-of-magnitude latency numbers every engineer carries in their head.
  3. Back-of-the-Envelope Estimation — how to size a system in your head before building it: QPS, storage, bandwidth, and the round numbers that make the arithmetic fast.
  4. Availability, SLAs & the Nines — what “99.99%” actually promises, and the math of how availability multiplies in series and improves in parallel.
  5. The CAP Theorem (Intuition) — the single most-quoted and most-misunderstood result in distributed systems, explained as a forced choice during a network partition.

By the end you’ll have the mental ruler the rest of the textbook measures everything against. Then we go deeper — into distributed systems, reliability, and the building blocks of real architectures.

  1. Name the three forces that turn an “easy” single-machine program into a hard distributed system, and give one concrete difficulty each introduces.
  2. On one machine, failure is binary and loud. Why is failure continuous at scale, even if each individual component is highly reliable?
  3. State the recurring thread of this textbook in one sentence. Why is “it depends” a sign of seniority rather than indecision?
  4. Why does adding copies of your data (for scale or durability) create a brand-new problem you didn’t have on one machine?
  5. Pick any technique you already know (a cache, a load balancer, a backup). What does it buy, and what does it cost?