Skip to content

Deployment Strategies

Every outage has a leading cause that dwarfs hardware failure: someone deployed something. A deployment strategy is how you ship change while keeping the blast radius small and the undo button fast. The goal isn’t to never break things — it’s to break them for few users, briefly, and to recover instantly.

“Stop the old version, start the new one” (a recreate deploy) means downtime during the swap and, worse, an all-or-nothing bet: if the new build is bad, 100% of users hit it at once and your only recovery is another full deploy. Everything below exists to shrink that bet.

Replace instances gradually — a few at a time — so the service stays up throughout.

[v1][v1][v1][v1] → [v2][v1][v1][v1] → [v2][v2][v1][v1] → [v2][v2][v2][v2]
(old and new run side by side during the roll)
  • Buys: no downtime, no extra fleet cost.
  • Costs: v1 and v2 run simultaneously (your code and DB schema must tolerate that — see the migration note below), and rollback means rolling back, which is slow.

Run two full environments. “Blue” serves production; “green” gets the new version. Test green, then flip the load balancer to it instantly.

LB ──► Blue (v1) [live] Deploy v2 to Green, test it
LB ──► Green (v2) [live] ◄── flip the LB; Blue stays warm as instant rollback
  • Buys: instant cutover, instant rollback (flip back to blue), test in a prod-identical env.
  • Costs: you pay for double the infrastructure during the deploy, and shared state (the database) still has to be compatible across both.

Route a small slice of real traffic (1% → 5% → 25% → 100%) to the new version while watching metrics. If error rate or latency climbs, abort before most users are affected.

99% ──► v1
1% ──► v2 ◄── watch p99, error rate, business metrics
healthy? widen to 5%, 25%, 100%. bad? route back to 0%.
  • Buys: the smallest blast radius — real-traffic validation with most users protected.
  • Costs: needs solid metrics and alerting to decide automatically, plus traffic-splitting infrastructure.

Feature flags: decouple deploy from release

Section titled “Feature flags: decouple deploy from release”

The most important idea here: deploying code and releasing a feature don’t have to be the same event. Ship the code dark behind a flag, then turn it on for 1% of users — or off instantly if it misbehaves — without a redeploy.

deploy (code present, flag OFF) ──► flip flag ON for 1% ──► 100%
(kill switch = flip OFF, no deploy)

This turns “rollback” from a deploy into a config change measured in seconds, and lets you separate the engineering risk (the deploy) from the product decision (the release).

What does this buy us, and what does it cost? Each strategy trades infrastructure cost and complexity for a smaller blast radius and faster recovery. Recreate is cheap and dangerous; canary with feature flags is the gold standard but demands real observability and disciplined, compatible schema changes. Pick the cheapest strategy that makes your rollback fast enough for the damage a bad deploy could do.

  1. Why is a “recreate” deploy risky beyond just causing downtime?
  2. Contrast the cost and rollback speed of rolling vs blue-green.
  3. What makes canary the smallest-blast-radius option, and what capability is it entirely dependent on?
  4. How do feature flags separate “deploy” from “release,” and why does that speed up rollback?
  5. Why must schema changes be backward-compatible for these strategies, and what is the expand→migrate→contract pattern?