Part 3 · Scaling & Performance

Most systems do not “scale” in one heroic move. They scale the way you climb a wall in the dark: you reach forward, hit the next obstruction, and remove that one. Then you reach again. Scaling is not an architecture you choose up front; it is a sequence of bottleneck removals, each one buying headroom until the next limit appears. This part of the book is about seeing those limits clearly and choosing the cheapest move that clears them.

The one idea behind all scaling

Find the single resource that is saturating first, relieve only that resource, and repeat.

A system has many resources — CPU, memory, disk I/O, network, database connections, locks. At any given load, exactly one of them is the binding constraint; everything else has slack. Throwing effort at the resources that aren’t saturated is wasted motion. The art is measurement: knowing which wall you are actually hitting before you start tearing it down.

This is the question we will weave through every page: what does this buy us, and what does it cost? Replicas buy read throughput and cost you consistency. Caches buy latency and cost you invalidation headaches. Sharding buys near-unlimited capacity and costs you cross-shard joins and operational pain. There is no free scaling — only trades you make with open eyes.

Why bottlenecks move

Relieve one constraint and the load simply flows to the next-weakest link. Add web servers and the database becomes the wall. Add read replicas and the single writer becomes the wall. Cache the hot keys and a cache stampede becomes the wall. This whack-a-mole property is not a failure of design; it is the expected shape of scaling. Each page in this part is one mole.

   load ──► [ web tier ]  (fix: add boxes)
                 │
                 ▼
            [ database ]   (fix: replicas, cache, partition, shard)
                 │
                 ▼
            [ hot key / lock / N+1 ]  (fix: measure, then target)

The roadmap

This part walks the ladder roughly in the order you will climb it in real life:

Vertical vs Horizontal Scaling — the two fundamental directions: a bigger box, or more boxes. Why “up” is simple but hits a ceiling, and why “out” is near-unlimited but expensive in complexity.
Statelessness & Sessions — horizontal scaling only works if your services hold no local state. Where session state goes once you evict it from the app server.
Caching Strategies — the highest-leverage latency move there is, plus the famously hard problem hiding inside it: invalidation.
Read Replicas & CQRS — scaling reads by copying data, the consistency cost of replication lag, and when splitting read and write models pays off.
Database Scaling Patterns — the usual ladder of moves, in order, and why the database is almost always the first wall you hit.
Finding Performance Bottlenecks — the meta-skill: measuring before optimizing, the USE method, and the usual suspects.

If you read only one page, read the last one. Every other technique here is a cure; finding the bottleneck is the diagnosis, and a cure applied to the wrong organ does nothing.

Where this connects

Scaling sits on top of the building blocks (load balancers, caches, queues) and is constrained by the data and distributed systems realities — you cannot scale faster than consistency, coordination, and the speed of light allow. Keep those constraints in view; they are why scaling is a set of trades and not a set of upgrades.

Check your understanding

Why is it accurate to say that at any given load, exactly one resource is the binding constraint?
What does “bottlenecks move” mean, and why is it the expected shape of scaling rather than a design failure?
Restate the single idea behind all scaling in your own words.
Why is premature scaling dangerous even when the technique itself is sound?
Of the six pages in this part, which is the diagnosis and which are cures — and why does the distinction matter?