Skip to content

Redundancy & Failover

A single point of failure (SPOF) is any one component whose death takes the whole system with it: the lone database, the one load balancer, the single power feed, the one engineer who knows how the deploy works. The entire game of availability is finding SPOFs and removing them — and the only known way to remove one is redundancy: have more than one of the thing, so the survivors carry on when one dies.

That sounds obvious, and the principle is. The subtlety — and the place real outages come from — is in how the spare takes over. This page works from “why redundancy at all” up to the surprisingly treacherous mechanics of failover.

Why redundancy works: the arithmetic of independence

Section titled “Why redundancy works: the arithmetic of independence”

Suppose one server is up 99% of the time — it’s down about 3.65 days a year. Put two independent servers behind a load balancer and require only one to be up. The system is down only when both are down at once. If the failures are truly independent, that’s 0.01 × 0.01 = 0.0001 — 99.99% availability, down from 3.65 days to about 53 minutes a year. Two cheap unreliable things composed into one reliable thing. This is the core magic, and it’s why redundancy is the foundation of every nines target in Availability, SLAs & the Nines.

The redundancy spectrum: passive to active

Section titled “The redundancy spectrum: passive to active”

Redundancy is not one design — it’s a spectrum trading cost against recovery speed.

COLD STANDBY WARM STANDBY HOT / ACTIVE-PASSIVE ACTIVE-ACTIVE
spare is off spare running, spare fully synced, all nodes serve
must boot & data lagging, idle, ready instantly traffic at once
restore promote in seconds
└── cheapest ────────────────────────────────────────── most expensive ──┘
└── slowest recovery ──────────────────────────────── fastest recovery ──┘

One node serves all traffic; a standby waits. On failure, traffic fails over to the standby, which gets promoted to primary. This is the default for stateful systems — most relational databases run a primary with one or more replicas, and promote a replica when the primary dies.

What does this buy us, and what does it cost? It buys simplicity: there’s exactly one writer, so you never have to reconcile conflicting writes — consistency is easy. It costs you idle capacity (you’re paying for a standby that does no useful work) and a recovery gap: there’s a window between “primary died” and “standby is serving” where you’re down.

Every node serves live traffic simultaneously; load is shared. Lose one and the others absorb its share — there’s no promotion step because there’s no special primary.

What does this buy us, and what does it cost? It buys no idle capacity (every machine earns its keep) and near-instant failure absorption (the survivors were already serving). It costs you the hard problem of coordination: with multiple active writers you must handle conflicting concurrent writes, which drags you into consistency trade-offs, distributed locking, or conflict resolution. Active-active is straightforward for stateless services and genuinely difficult for stateful ones.

Here’s the counterintuitive truth: the failover mechanism is itself a common cause of outages. The redundancy is there to make you safer, but the switch-over is a high-risk maneuver, for several reasons.

Detecting failure is ambiguous. How do you know the primary is dead and not just slow, or briefly unreachable? You use health checks and timeouts — but set them too aggressive and a momentary blip triggers an unnecessary failover; set them too lax and you stay down longer than needed. There’s no perfect threshold.

Split-brain. The nightmare scenario: the primary isn’t actually dead — the network between it and the standby failed. The standby, seeing no primary, promotes itself. Now you have two primaries, both accepting writes, diverging. When the network heals, you have two conflicting versions of the truth and no clean way to merge them. Systems prevent this with quorum (a majority must agree before promotion) or fencing (forcibly disabling the old primary — “STONITH”: shoot the other node in the head).

┌──────────┐ network partition ┌──────────┐
│ PRIMARY │ ╳ ────────────────╳ │ STANDBY │
│ (alive!) │ │ promotes │
│ accepts │ │ itself, │
│ writes │ │ accepts │
└──────────┘ │ writes │
└──── both think they're it ──┘
= SPLIT BRAIN, diverging data

The standby was never tested. A standby that has sat idle for a year may have drifted config, an expired certificate, or insufficient capacity to actually take the full load. Failover that’s never rehearsed often fails when it’s finally needed. This is why mature teams practice failover drills and even deliberately kill primaries in production (chaos engineering) — the only trustworthy failover is one you’ve watched work.

Redundancy buys availability by composing independent copies; failover is how you cash that in — but it’s a live maneuver with sharp edges. Spread copies across real failure domains, prefer stateless active-active where you can, accept the consistency cost of active-passive where you must, and rehearse the switch-over so the spare you’re paying for actually works when the weather turns.

  1. Two servers are each up 99% of the time. Why is the pair not automatically 99.99% available, and what condition must hold for that math to work?
  2. Contrast active-passive and active-active on two axes: idle capacity and consistency difficulty. Which suits a stateless web tier, and why?
  3. What is split-brain, what causes it, and name two mechanisms that prevent it.
  4. Why is the failover itself a frequent source of outages, even when the redundancy is correctly provisioned?
  5. Your “redundant” database pair lives in the same rack. Explain precisely why that undermines the availability arithmetic.