Latency, Throughput & the Numbers to Know

Two of the most important words in system design are routinely confused, and the confusion produces bad designs. Latency is how long one operation takes. Throughput is how many operations complete per unit of time. They sound like the same thing measured two ways, but they are independent — and the gap between them is where a lot of engineering lives.

Latency vs throughput: a highway

The cleanest intuition is a road.

   LATENCY      = how long it takes one car to drive end to end
   THROUGHPUT   = how many cars pass the exit per minute

A single sports car has low latency (it arrives quickly) but a one-lane road has low throughput (few cars per minute). A hundred-lane highway full of slow trucks has high throughput and high latency at once. You can improve throughput by adding lanes (parallelism) without making any single trip faster — and you can lower latency for one trip without moving more total traffic.

This is why “make it faster” is an ambiguous request:

A user staring at a spinner cares about latency — their request.
A platform team paying for servers cares about throughput — total work per dollar.
A trading system cares about tail latency — the worst requests (the 99th percentile), because the slow ones are the ones that lose money or time out.

The numbers every engineer should know

The most useful mental model in all of system design is a rough sense of how long things take, in orders of magnitude. You don’t need precision — you need to know that reading from RAM and reading across an ocean differ by a factor of a million. These figures are approximate and vary by hardware and year; the ratios are the point, not the digits.

Operation	Approx. time	Relative
L1 CPU cache reference	~1 ns	1×
L2 cache reference	~4 ns	a few ×
Main memory (RAM) reference	~100 ns	~100×
Read 1 MB sequentially from RAM	~10 µs
SSD random read	~100 µs	~100,000×
Round trip within a datacenter	~0.5 ms	~500,000×
Read 1 MB sequentially from SSD	~1 ms
Disk (HDD) seek	~10 ms	~10,000,000×
Round trip across regions/continent	~50–150 ms	~100,000,000×

Read that table as a ladder of gaps, each roughly 100× the last:

   cache (ns) → RAM (~100 ns) → SSD (~100 µs) → DC network (~0.5 ms)
        → disk seek (~10 ms) → cross-region (~tens-hundreds of ms)
   |---------------------- ~8 orders of magnitude ----------------------|

Why memory-vs-network is the dominant design tension

Look again at the ladder and you’ll see the central drama of system design. The fast operations (cache, RAM) live inside one machine. The slow ones (network, especially cross-region) are exactly what you incur the moment you distribute. So nearly every architecture is a negotiation between two opposing pulls:

Keep data close (in memory, on the same machine) and it’s blindingly fast — but one machine has limited RAM, and if it dies, the data dies with it.
Spread data across machines (over the network) and you get capacity and durability — but every access now pays the network tax, which is 100× to 1,000,000× slower.

   FAST but LIMITED & FRAGILE          SLOW but SCALABLE & DURABLE
   ------------------------            ---------------------------
   CPU cache, RAM, local SSD   <-----> other machines over the network
   (one box; dies with the box)        (many boxes; survives a box dying)

Caching, replication, sharding, CDNs, data locality, and “keep the working set in memory” are all moves in this one negotiation. A cache is literally a bet: pay the slow network/disk cost once, then serve from fast memory many times. That bet pays off only if you read far more often than the data changes — which is exactly the trade-off thread again.

The thread

What do these numbers buy us, and what do they cost? Carrying the latency ladder in your head buys you the ability to reject bad designs in seconds — to know, before writing a line, that a per-item network call in a loop or a cross-region read on the hot path will be too slow. The cost is intellectual humility: the numbers are approximate, hardware moves, and you must still measure the real system rather than trust the table. The table tells you where to look; the profiler tells you what’s true. Next we turn these intuitions into deliberate sizing in Back-of-the-Envelope Estimation.

Check your understanding

Define latency and throughput, and give an example where one is good while the other is bad.
Why is an average latency misleading? What does p99 tell you that p50 hides, and why does the tail matter more as a system fans out to many dependencies?
Roughly how many times slower is a same-datacenter network round trip than a main-memory read? Than a cross-region round trip?
Explain the “memory vs network” tension. Why is essentially every caching/replication/sharding technique a move within this single tension?
Using the table, argue in one sentence why making one network call per item inside a loop is a design smell.