Load Balancers
A single server has a ceiling: it can only do so much, and when it dies, everything it served dies with it. The escape from both problems is the same — run many identical servers and put something in front of them that spreads requests across the fleet. That something is a load balancer (LB). It turns one virtual address that the world talks to into traffic distributed over N machines that the world never sees. This is the block that lets you scale out (add more machines) instead of scale up (buy a bigger machine), and the block that lets a server fail without taking the service down.
What does a load balancer buy us, and what does it cost? It buys horizontal scalability and fault tolerance. It costs you a new component to operate, a potential bottleneck/single-point-of-failure to make redundant, and — most importantly — it forces a discipline on your app servers: they should be interchangeable, which means stateless.
L4 vs L7: where the balancer looks
Section titled “L4 vs L7: where the balancer looks”Load balancers come in two layers, named for the OSI model, and the difference is how deeply they read each connection.
L4 (transport) L7 (application) ────────────── ──────────────── sees: IP + TCP/UDP port sees: full HTTP — URL, headers, cookies decision: per connection decision: per request speed: very fast, cheap speed: slower, must parse & often terminate TLS can't do: content-based routing can: route /api → A, /img → B, by host, etc.- L4 load balancing operates at the transport layer. It sees a TCP/UDP connection — source IP, destination port — and forwards the whole connection to a backend without looking inside. It’s fast and protocol-agnostic, but it can’t make decisions based on what the request is.
- L7 load balancing operates at the application layer. It parses the HTTP request, so it can route by URL path, hostname, headers, or cookies, terminate TLS, rewrite requests, and serve as a smart reverse proxy. The price is more CPU per request and a deeper coupling to the protocol.
Rule of thumb: use L4 when you need raw speed and don’t care about content; use L7 when routing decisions depend on the content of the request (which, for web APIs, is most of the time).
Balancing algorithms
Section titled “Balancing algorithms”Given a healthy pool of servers, how does the LB choose one? A few common policies:
| Algorithm | How it picks | Best when |
|---|---|---|
| Round-robin | next server in rotation | requests are uniform and servers are equal |
| Weighted round-robin | rotation, but bigger servers get more | a heterogeneous fleet |
| Least-connections | the server with the fewest live connections | requests vary widely in duration |
| Least-response-time | fewest connections + lowest latency | latency-sensitive, mixed workloads |
| IP / consistent hashing | hash of client IP (or key) → server | you need the same client to hit the same server |
Round-robin is the default: simple, fair when work is uniform. Least-connections shines when some requests are slow and some are fast — round-robin would happily pile new requests on a server already stuck on a slow one. Hashing is the odd one out: it deliberately sends the same client to the same server, which is how you get sticky sessions (more on that below) or route a cache key to its owning node.
Health checks: routing only to the living
Section titled “Health checks: routing only to the living”A load balancer is only as good as its knowledge of which backends are alive. It continuously health-checks each server and removes failing ones from rotation:
- Passive checks watch real traffic — if a backend returns errors or stops responding, eject it.
- Active checks probe on a schedule — hit
GET /healthzevery few seconds; mark a server unhealthy after K consecutive failures, healthy again after K successes.
A good health endpoint checks that the server can actually do its job (e.g. reach its database), not merely that the process is up. This is what makes a server crash invisible to users: the LB simply stops sending it traffic. It is also why your servers must be interchangeable — the LB assumes any healthy backend can serve any request.
Sticky sessions, and why statelessness wins
Section titled “Sticky sessions, and why statelessness wins”Sometimes an LB is configured for sticky sessions (session affinity): once a user hits server A, the LB pins them to A for the rest of the session, usually via a cookie or by hashing their IP. Why? Because server A is holding that user’s session state in its local memory, and sending the next request to server B would lose it.
This works, but look at what it costs:
Sticky (state in the server): - load can't rebalance: a "hot" user is stuck on one box - that server dies → the user's session is GONE - autoscaling is awkward: can't drain a server without dropping sessions
Stateless (state in a shared store / token): - ANY server can handle ANY request - a server can die or be replaced freely - the LB is free to use least-connections, round-robin, anythingThe better answer is almost always to make servers stateless: push session state out to a shared store (Redis, a database) or into a signed token the client carries (a JWT). Then stickiness is unnecessary, any server can serve any request, and the LB regains full freedom to balance and to fail servers over invisibly. This is important enough to have its own page — Statelessness & Sessions — and it’s the single biggest reason the toolkit in this book keeps pushing state out of the request-handling tier.
The LB itself must not be a single point of failure
Section titled “The LB itself must not be a single point of failure”If everything flows through one load balancer, the LB is your availability. So real deployments run the LB redundantly — multiple LB nodes behind a shared (often anycast or floating) IP, with health checks between them — so that the thing protecting your fleet is itself protected. DNS (the previous page) often does the first, coarse split across LB front doors; the LB does the fine split across servers.
Check your understanding
Section titled “Check your understanding”- What two distinct problems does a load balancer solve, and why are they really the same idea (“many identical servers”)?
- Contrast L4 and L7 load balancing by what each one can see. Give one routing decision only L7 can make.
- When does least-connections beat round-robin? Construct a scenario where round-robin overloads a server.
- Why do sticky sessions limit a load balancer’s freedom, and what does making servers stateless buy back?
- How can an overly aggressive health check turn a small blip into a full cascading outage?