Timeouts, Retries & Backoff

When one service calls another over a network, three things can happen: a fast success, a fast failure, or — the dangerous one — nothing at all. The call just hangs. The remote process might be dead, overloaded, or perfectly fine but stuck behind a slow disk. From the caller’s side these are indistinguishable. The entire discipline of this page exists to answer two questions that sound trivial and aren’t: how long do I wait before giving up? and should I try again?

Get these wrong and you don’t just fail — you build a machine that amplifies failure. This is one of the most common ways a small incident becomes a total outage.

Why you must have a timeout

A call with no timeout waits forever. That seems harmless until you trace the resource chain. Each in-flight request holds a thread (or a connection, or a goroutine, or a coroutine slot). If the downstream service hangs and you have no timeout, every request piles up waiting, and your thread pool drains. Once the pool is empty, your service stops accepting any requests — including ones to healthy dependencies. A single slow dependency, with no timeout, takes your whole service down. This is the cascade from the overview, and the timeout is the first and cheapest defense against it.

What does a timeout buy us, and what does it cost? It buys bounded resource use — a guarantee that a stuck call releases its thread within a known time, so the pool can’t be held hostage. It costs you the risk of giving up on work that would have succeeded a moment later, and forces you to decide a value with no perfect answer.

Retries: only when it’s safe, only when it helps

A timeout tells you a call didn’t succeed in time. The tempting next move is to try again — many failures are transient (a dropped packet, a node restarting, a brief GC pause), and a retry often just works. But retries carry a landmine.

The idempotency precondition

When a call times out, you do not know whether it actually executed. The request may have reached the server, completed, and only the response was lost. If you retry a non-idempotent operation — “charge this card,” “send this email,” “increment this counter” — you may do it twice. The bank charges the customer twice; the user gets two emails.

So the iron rule is: retry only idempotent operations. An operation is idempotent if doing it twice has the same effect as doing it once. Reads are naturally idempotent. Writes are not, unless you make them idempotent — typically with an idempotency key the server uses to dedupe. This is so central it has its own treatment in Idempotency; the short version is that retries and idempotency are two halves of one mechanism — you cannot safely have the first without the second.

Caller sends "charge $50"  ──►  Server charges $50  ──►  ✓ done
                                       │
                            response lost in transit ╳
                                       │
Caller times out, RETRIES "charge $50" ──► Server charges $50 AGAIN
                                       = customer charged $100
            (unless the request carried an idempotency key)

Backoff: don’t retry immediately, and don’t retry in lockstep

Even for safe-to-retry operations, how you retry matters enormously. The naive approach — retry immediately, in a tight loop, a fixed number of times — is precisely how you kick a struggling service while it’s down.

Exponential backoff

Instead of retrying after a fixed delay, double the wait each time: 100 ms, then 200, 400, 800, 1600… This gives a temporarily overloaded dependency room to recover instead of being hammered the instant it shows a pulse. The exponential growth means a few quick retries handle transient blips, while persistent failure quickly backs off to a gentle trickle.

Jitter: the part everyone forgets

Backoff alone has a vicious failure mode. Imagine a dependency hiccups and 10,000 callers all time out at the same instant. With pure exponential backoff, all 10,000 retry at exactly 100 ms, then exactly 200 ms — synchronized waves that slam the recovering service in perfect unison and knock it over again. The fix is jitter: add randomness to each delay so the retries spread out over the window instead of arriving as a spike.

NO JITTER (synchronized retries)      WITH JITTER (spread out)
   ┃          ┃          ┃               ▎ ▎▎  ▎ ▎  ▎▎ ▎  ▎ ▎
   ┃          ┃          ┃              ▎  ▎ ▎▎  ▎  ▎ ▎ ▎  ▎ ▎
  spikes hammer the service           load is smeared across time

This same “synchronized herd hits a recovering resource” pathology shows up whenever many clients act in unison — it’s the sibling of the Cache Stampede & Thundering Herd problem, and jitter is the same cure in both places.

Retry storms and the budget that stops them

The deepest danger is the retry storm. Picture a dependency under load, returning errors. Every caller retries. Retries multiply the request rate — if everyone retries up to 3 times, a struggling service that was at 100% capacity now faces up to 4× the traffic at the exact moment it can least handle it. The retries cause the outage they were meant to survive. The service can never recover because the recovery attempt is what’s killing it.

Defenses, in layers:

Cap retries to a small number (often 1–2). Unlimited retries are a footgun.
Retry budgets: allow retries to consume only, say, 10% of total request volume — if the system is already drowning in retries, stop adding more.
Don’t retry across every layer. If A→B→C and each retries 3×, a single user request can become 27 calls to C. Retry at one layer, usually the one closest to the failure, not at every hop.
Combine with circuit breakers (next page) so that once a dependency is clearly down, you stop calling it entirely instead of retrying into the void.

The synthesis

Timeouts bound how long you suffer; retries recover transient faults; backoff and jitter keep recovery from becoming an attack; budgets and caps keep the whole thing from amplifying. What does this machinery buy us, and what does it cost? It buys resilience against the constant low-grade flakiness of real networks — and it costs you added latency on the unlucky path, the discipline of making operations idempotent, and the ever-present risk that a careless retry policy turns a hiccup into a stampede.

Check your understanding

Why does a call with no timeout endanger the entire calling service, not just that one request?
How should you choose a timeout value, and what goes wrong if it’s far too short or far too long?
Why is it unsafe to retry a non-idempotent operation, and what makes the retry/idempotency pair inseparable?
Exponential backoff without jitter still causes outages. Describe the failure mode and how jitter fixes it.
Explain a retry storm. Give two distinct mechanisms (besides backoff) that prevent retries from amplifying an outage.