Metrics

A log tells you about one event. A metric tells you about all of them at once: “we served 12,000 requests this minute, 0.3% of them failed, and the slowest 1% took over 800 ms.” Metrics are the cheap pillar of observability — they get cheap precisely by throwing detail away and keeping only numbers aggregated over time. That trade is what makes them perfect for dashboards and alerts and useless for explaining any single request. This page builds the metric types from scratch, then gives you two ready-made frameworks for deciding which metrics to collect.

Why not just count logs?

You could answer “how many requests failed?” by querying logs. But aggregating billions of log lines on every dashboard refresh is slow and expensive. A metric pre-aggregates: instead of storing a line per event, you store a running number per time window. The system increments a counter in memory and periodically writes one compact data point — “errors at 14:03 = 37.” A dashboard reading that is reading a handful of numbers, not scanning a haystack.

What does this buy us? Cheap storage and instant queries over long time ranges. What does it cost? You lose the detail — a metric can tell you the error rate spiked, never which user or why. That’s why metrics point you at a problem and logs and traces explain it. Metrics are the smoke alarm; the other pillars are the investigation.

The three metric types

Almost every metric is one of three shapes:

   COUNTER          GAUGE              HISTOGRAM
   only goes up     up and down        distribution of values
   (or resets)      a snapshot         bucketed
   ───────────      ──────────         ──────────────
   total requests   current memory     request latency
   total errors     queue depth        response sizes
   bytes sent       active connections  → p50, p90, p99

Counter — a monotonically increasing total: requests served, errors, bytes sent. You never read the raw value; you read its rate of change (“requests per second”). Counters survive resets cleanly because you only care about deltas.
Gauge — a value that goes up and down, sampled now: memory in use, queue depth, active connections, temperature. A gauge is a snapshot, not a total.
Histogram — the most important and most subtle. It buckets observations to capture a distribution, so you can ask for percentiles. This is the only way to see tail latency. An average latency of 100 ms can hide a p99 of 3 seconds — and the average is a lie precisely for the users who are suffering. Never alert on averages; alert on percentiles.

The RED method: instrument every service

Knowing the metric types doesn’t tell you which metrics to collect. The RED method gives a dead-simple default for any request-driven service. For every service, track three things:

   R — Rate       requests per second
   E — Errors     failed requests per second (or error %)
   D — Duration   latency distribution (p50/p90/p99 via histogram)

These three answer “is this service healthy from the caller’s point of view?” — which is exactly the symptom you want to alert on (more in Alerting & On-Call). RED is request-centric: it describes the experience of whoever is calling the service.

The USE method: instrument every resource

RED watches the work; the USE method watches the machine doing the work. For every resource — CPU, memory, disk, network, a connection pool — track:

   U — Utilization   % of time the resource was busy
   S — Saturation    how much work is queued/waiting (the backlog)
   E — Errors        error events from the resource

USE is resource-centric and catches a different failure: a service whose RED metrics still look okay but whose database connection pool is 100% utilized with a growing wait queue — saturation climbing before duration visibly degrades. Saturation is often the leading indicator; it goes red before users feel it. The two methods are complementary: RED tells you users are hurting, USE often tells you why, and often sooner.

Method	Watches	Best at
RED	request-driven services	”are callers being served well?” (symptoms)
USE	resources/hardware	”is the underlying machine choking?” (causes, early)

The cost: cardinality

Metrics are cheap — until you add labels. A label is a dimension you attach to a metric so you can slice it: http_requests_total{service="checkout", status="500"}. Labels are powerful: they let you break “total requests” down by service, region, status code. But each unique combination of label values creates a separate time series that must be stored and indexed. The number of combinations is the cardinality, and it multiplies:

   service (10) × status (15) × region (5)  = 750 series   ← fine
   ...add  user_id (1,000,000)              = 750,000,000 series   ← catastrophe

Putting a high-cardinality field — user_id, request_id, email, raw URL with IDs in it — into a metric label is the classic, budget-destroying mistake. It’s called a cardinality explosion, and it can make a metrics system slower and more expensive than the logs it was supposed to replace.

The rule: labels are for low-cardinality dimensions you’ll group by (status, region, endpoint template). Anything unbounded or per-request — the user, the exact ID, the full URL — belongs in a log or a trace, not a metric label. This is the precise mirror of logging’s cost: logs are expensive in volume, metrics are expensive in cardinality. Knowing which signal to reach for is what keeps both the system and the bill healthy.

Check your understanding

Why is a metric cheaper than counting log lines, and what exactly do you give up for that cheapness?
Match each to its metric type: total errors, current queue depth, request latency distribution. Why must latency be a histogram and not a gauge?
Why is alerting on average latency dangerous? What should you alert on instead, and why?
Contrast RED and USE: what does each watch, and why is saturation often a leading indicator that RED misses?
What is a cardinality explosion, and which kinds of fields must never become metric labels? Where do those fields belong instead?