Load Balancers

A single server has a ceiling: it can only do so much, and when it dies, everything it served dies with it. The escape from both problems is the same — run many identical servers and put something in front of them that spreads requests across the fleet. That something is a load balancer (LB). It turns one virtual address that the world talks to into traffic distributed over N machines that the world never sees. This is the block that lets you scale out (add more machines) instead of scale up (buy a bigger machine), and the block that lets a server fail without taking the service down.

What does a load balancer buy us, and what does it cost? It buys horizontal scalability and fault tolerance. It costs you a new component to operate, a potential bottleneck/single-point-of-failure to make redundant, and — most importantly — it forces a discipline on your app servers: they should be interchangeable, which means stateless.

L4 vs L7: where the balancer looks

Load balancers come in two layers, named for the OSI model, and the difference is how deeply they read each connection.

   L4 (transport)                         L7 (application)
   ──────────────                         ────────────────
   sees: IP + TCP/UDP port                sees: full HTTP — URL, headers, cookies
   decision: per connection               decision: per request
   speed: very fast, cheap                speed: slower, must parse & often terminate TLS
   can't do: content-based routing        can: route /api → A, /img → B, by host, etc.

L4 load balancing operates at the transport layer. It sees a TCP/UDP connection — source IP, destination port — and forwards the whole connection to a backend without looking inside. It’s fast and protocol-agnostic, but it can’t make decisions based on what the request is.
L7 load balancing operates at the application layer. It parses the HTTP request, so it can route by URL path, hostname, headers, or cookies, terminate TLS, rewrite requests, and serve as a smart reverse proxy. The price is more CPU per request and a deeper coupling to the protocol.

Rule of thumb: use L4 when you need raw speed and don’t care about content; use L7 when routing decisions depend on the content of the request (which, for web APIs, is most of the time).

Balancing algorithms

Given a healthy pool of servers, how does the LB choose one? A few common policies:

Algorithm	How it picks	Best when
Round-robin	next server in rotation	requests are uniform and servers are equal
Weighted round-robin	rotation, but bigger servers get more	a heterogeneous fleet
Least-connections	the server with the fewest live connections	requests vary widely in duration
Least-response-time	fewest connections + lowest latency	latency-sensitive, mixed workloads
IP / consistent hashing	hash of client IP (or key) → server	you need the same client to hit the same server

Round-robin is the default: simple, fair when work is uniform. Least-connections shines when some requests are slow and some are fast — round-robin would happily pile new requests on a server already stuck on a slow one. Hashing is the odd one out: it deliberately sends the same client to the same server, which is how you get sticky sessions (more on that below) or route a cache key to its owning node.

Health checks: routing only to the living

A load balancer is only as good as its knowledge of which backends are alive. It continuously health-checks each server and removes failing ones from rotation:

Passive checks watch real traffic — if a backend returns errors or stops responding, eject it.
Active checks probe on a schedule — hit GET /healthz every few seconds; mark a server unhealthy after K consecutive failures, healthy again after K successes.

A good health endpoint checks that the server can actually do its job (e.g. reach its database), not merely that the process is up. This is what makes a server crash invisible to users: the LB simply stops sending it traffic. It is also why your servers must be interchangeable — the LB assumes any healthy backend can serve any request.

Sticky sessions, and why statelessness wins

Sometimes an LB is configured for sticky sessions (session affinity): once a user hits server A, the LB pins them to A for the rest of the session, usually via a cookie or by hashing their IP. Why? Because server A is holding that user’s session state in its local memory, and sending the next request to server B would lose it.

This works, but look at what it costs:

   Sticky (state in the server):
     - load can't rebalance: a "hot" user is stuck on one box
     - that server dies → the user's session is GONE
     - autoscaling is awkward: can't drain a server without dropping sessions

   Stateless (state in a shared store / token):
     - ANY server can handle ANY request
     - a server can die or be replaced freely
     - the LB is free to use least-connections, round-robin, anything

The better answer is almost always to make servers stateless: push session state out to a shared store (Redis, a database) or into a signed token the client carries (a JWT). Then stickiness is unnecessary, any server can serve any request, and the LB regains full freedom to balance and to fail servers over invisibly. This is important enough to have its own page — Statelessness & Sessions — and it’s the single biggest reason the toolkit in this book keeps pushing state out of the request-handling tier.

The LB itself must not be a single point of failure

If everything flows through one load balancer, the LB is your availability. So real deployments run the LB redundantly — multiple LB nodes behind a shared (often anycast or floating) IP, with health checks between them — so that the thing protecting your fleet is itself protected. DNS (the previous page) often does the first, coarse split across LB front doors; the LB does the fine split across servers.

Check your understanding

What two distinct problems does a load balancer solve, and why are they really the same idea (“many identical servers”)?
Contrast L4 and L7 load balancing by what each one can see. Give one routing decision only L7 can make.
When does least-connections beat round-robin? Construct a scenario where round-robin overloads a server.
Why do sticky sessions limit a load balancer’s freedom, and what does making servers stateless buy back?
How can an overly aggressive health check turn a small blip into a full cascading outage?