Part 7 · Observability & Operations

A system you cannot see into is a system you cannot operate. The previous parts taught you to build things — caches, queues, replicas, retries. This part is about the moment after you ship, when the thing is live and something is slow, or wrong, or about to fall over, and your only window into the running machine is the signals it chose to emit. Observability is the discipline of making a system explain itself.

The word is borrowed from control theory: a system is observable if you can infer its internal state from its external outputs. For software that means: when a user reports “checkout is slow,” can you find out why — which service, which dependency, which line of code, for which users — without redeploying to add print statements? If the answer requires a code change, you have a monitoring system, not an observable one. The difference is whether you can ask questions you didn’t anticipate when you wrote the code.

Monitoring vs observability

These words get used interchangeably, but the distinction is worth keeping:

Monitoring watches for the failures you predicted. You define a dashboard and an alert in advance: “CPU over 90%,” “error rate over 1%.” It answers known questions.
Observability lets you investigate failures you didn’t predict. It answers unknown questions — the “why is this one customer in Sydney seeing 4-second loads on Tuesdays” kind that no pre-built dashboard anticipated.

You need both. Monitoring catches the routine; observability is what you reach for at 3 a.m. when the routine dashboards are all green and the system is still broken.

The three pillars

Observability is conventionally built on three kinds of telemetry. They overlap, but each answers a different shape of question:

   LOGS                METRICS              TRACES
   "what happened"     "how much / how      "where did the time go,
   discrete events     often / how fast"    across services"
   high detail         aggregated numbers   one request's full path
   high volume/cost    cheap, summarized    sampled, structured

   debugging a         dashboards &         finding latency &
   specific event      alerting             failure across hops

Logs are timestamped records of discrete events — “user 42 logged in,” “payment failed: card declined.” They are the highest-fidelity signal and the most expensive to store. You read them when you already know roughly where to look.
Metrics are numbers aggregated over time — request rate, error count, p99 latency. They are cheap because they throw away detail, which makes them perfect for dashboards and alerts but useless for explaining a single request.
Distributed traces follow one request as it fans out across many services, stitching the hops together so you can see where the time went. They are the answer to “the whole thing is slow but no single service looks slow.”

The recurring question of this book applies to telemetry itself: what does it buy us, and what does it cost? Every signal you emit buys you future insight and charges you in storage, network, and engineer attention. The art of observability is emitting the right signals — enough to answer tomorrow’s question, not so much that the answer drowns in noise (and the bill drowns the budget).

Operations: closing the loop

Seeing is only half of it. The second half of this part is what you do with what you see:

Alerting & On-Call turns telemetry into a human waking up — but only for the right reasons. The central skill is alerting on symptoms users feel rather than causes engineers fear, so the pager fires when it matters and stays silent when it doesn’t.
Deployment Strategies is how you change a running system without breaking it: rolling, blue-green, and canary releases, feature flags, and — above all — fast rollback. Observability and deployment are a feedback loop: you deploy a small change, watch the signals, and roll back the instant they turn red.

Roadmap

Read in order — each page builds on the last:

Logging — structured events, levels, centralization, and the correlation IDs that tie one request’s logs together.
Metrics — counters, gauges, histograms, the RED and USE methods, and the cardinality trap that blows up your bill.
Distributed Tracing — spans, context propagation, sampling, and finding latency across service hops.
Alerting & On-Call — symptom-based alerts, error budgets, alert fatigue, and runbooks.
Deployment Strategies — rolling, blue-green, canary, feature flags, and the rollback that saves you.

Check your understanding

In control-theory terms, what does it mean for a system to be observable, and how does that map to debugging production software?
Distinguish monitoring from observability. Which one do you reach for when every dashboard is green but the system is still broken?
Name the three pillars and the distinct shape of question each one answers best.
Apply what does it buy us, what does it cost? to telemetry in general. What is the cost of emitting more signals?
Describe the operational feedback loop this part is organized around.