🧭 Analogy
A supermarket with one queue and twelve tills hires a greeter who waves each shopper to the shortest open till, closes a till whose card reader breaks, and opens more at rush hour. The greeter is a load balancer: clients only ever talk to the greeter, never to a till directly — which is also why the tills can sit safely behind a perimeter.
A load balancer (LB) is a reverse proxy that distributes R incoming requests across N service replicas. It is the keystone of scaling out: clients only ever contact the LB, which keeps stateless replicas equally busy and shields them behind a security perimeter. It must be very low latency, because every request flows through it.
Layer 4 vs Layer 7
Load balancing happens at two levels:
- Network-level (Layer 4) — routes on TCP/UDP using NAT. Fast and simple.
- Application-level (Layer 7) — reassembles HTTP and routes on headers, paths, or body content. Richer, slightly slower.
An AWS experiment found the network LB ~20% faster at low and medium load — but equal at 256 clients once the 4 replicas saturated, because the bottleneck had moved to the replicas themselves.
The four feature areas
graph TD C["Clients"] --> LB["Load balancer"] LB -->|"distribution policy"| R1["Replica 1 (healthy)"] LB -->|"distribution policy"| R2["Replica 2 (healthy)"] LB -. "health check fails" .-> R3["Replica 3 (removed)"] LB -->|"autoscaling adds"| R4["Replica 4 (new)"]
- Distribution policies — round robin, least connections, HTTP header field, or HTTP operation. Replicas can be weighted by capacity.
- Health monitoring — unhealthy replicas are removed and reincorporated when healthy again.
- Elasticity — dynamically provision capacity. AWS Auto Scaling groups set min/max and scale schedule-based or dynamically on metrics, with warm-up periods and scale-in below thresholds. Essentially mandatory for fluctuating workloads.
- Session affinity (sticky sessions) — pins a client to one replica for stateful services.
⚠️ Sticky sessions guarantee imbalance
Session affinity is the workaround for stateful services, but sessions last varying durations, so some replicas overload while others idle — inevitable at millions of sessions. Statelessness avoids the whole problem: failures yield retries elsewhere, slow replicas are health-checked out, and state lives in an external store. Prefer stateless.
Behind the LB: the app server has its own limits
A widened highway that ends in one lane doesn’t help. Inside each replica, capacity is bounded too: listener threads accept connections (queuing in the OS sockets backlog), a thread pool of container threads serves requests (Tomcat defaults min 25 / max 200), and a smaller database connection pool can make threads block. Systems degrade before 100% utilization, so tune pool sizes to a target — and remember that scaling one tier just pushes the bottleneck downstream.
When the chain bends: cascading failures
Load balancing keeps a single tier alive, but services call services, and slow responses — not crashes — cause cascading failures. In a chain A→B→C, growing load slows C, creating back pressure: B’s fixed-size thread pool fills waiting on C, then A’s fills waiting on B, then timeouts and refused connections cascade upward. A crash returns an immediate error; a slow service keeps returning late results, denying recovery time. Immediate retries make it worse; use exponential backoff.
graph LR A["Service A<br/>pool fills waiting on B"] --> B["Service B<br/>pool fills waiting on C"] B --> C["Service C<br/>SLOW (not crashed)"] C -.->|"back pressure"| B B -.->|"back pressure"| A
🛠️ The defensive toolkit
Three patterns keep a load-balanced system standing under stress:
• Fail fast — client-side timeouts (set near the P99 latency so a stalled call releases its thread) plus server-side throttling (HTTP 503 past a threshold), often returning a canned default response.
• Circuit breaker — on a failure threshold (e.g. 25% errors) trip OPEN (fail immediately), then HALF_OPEN (trial calls), back to CLOSED on success — relieving the overwhelmed server and recovering automatically.
• Bulkhead — reserve a subset of threads per request type so lightweight “order status” always has capacity even when heavyweight “create order” is saturated.
Long-tail latency is the real metric
Real workloads have long-tail response times, so percentiles (P50/P95/P99) describe them far better than averages. At 200M requests/day, a P99 of 3 s means 2M requests/day exceed 3 s. A slow service holds resources: a 50 ms API stalled at 3 s loses ~59 potential requests on that thread. Size timeouts and capacity against the tail, not the mean.
See also
- Scalability fundamentals — why statelessness makes load balancing work.
- Serving patterns — replicated, sharded, and scatter/gather topologies.
- Asynchronous messaging at scale — smoothing peaks the LB can’t.
- Caching — reducing the load each replica must serve.
When to use it — and when not
✅ Reach for it when
- When scaling a service out across multiple replicas behind a single entry point.
- When load fluctuates and you want capacity to grow and shrink automatically.
- When you need to remove a single point of failure and survive replica crashes.
⛔ Think twice when
- For a single-replica service whose SLA is met by orchestrator restart alone.
- Relying on sticky sessions to scale a stateful service — externalize state instead.
Related topics
What scaling actually means — scale up vs out vs down, the twin principles of replication and optimization, statelessness, and why Amdahl's law caps your gains.
ds-patternsServing Patterns: Replicated, Sharded, Scatter/GatherThe three multi-node serving topologies — replicate to scale requests, shard to scale data, scatter/gather to scale time — plus the readiness, hot-sharding, and straggler realities that govern them.
ds-scalabilityAsynchronous Messaging at ScaleDecoupling producers from consumers with queues and logs — persistence and delivery guarantees, pub/sub, competing consumers, dead-letter queues, and the event-log shift Kafka makes.
Check your understanding
Score: 0 / 41. What is the key difference between Layer 4 and Layer 7 load balancing?
Network-level (L4) NAT routing is faster at low/medium load; application-level (L7) can route on HTTP headers, paths, and content at slightly higher cost. Once replicas saturate, both perform equally.
2. Why do sticky sessions cause load imbalance?
With sticky (session-affinity) routing, a replica is pinned to its sessions; because sessions end at different times, distribution becomes uneven — inevitable at millions of sessions.
3. In a chain A→B→C, what triggers a cascading failure?
A crash returns an immediate error; a slow service keeps returning late results, so B's thread pool fills waiting on C, then A's fills waiting on B — exhaustion cascades upward.
4. What does a circuit breaker do when a downstream service is failing?
On a threshold (e.g. 25% errors) the breaker trips OPEN, failing fast; after a timeout it goes HALF_OPEN to trial calls, returning to CLOSED on success — giving the downstream time to recover.
Comments
Sign in with GitHub to join the discussion.