The Architecture Reference

Ms operations · Microservices · Advanced

Resilience: Timeouts, Retries, Circuit Breakers, Bulkheads

Stability patterns for distributed systems — timeouts, retries, bulkheads, circuit breakers, and idempotency — plus the four aspects of resilience and why it's ultimately a people property.

Ms operations Advanced ⏱ 4 min read Complete

🧭 Analogy

A ship survives a hull breach because watertight bulkheads keep one flooded compartment from sinking the whole vessel. Electrical circuit breakers trip before a fault burns the house down. Distributed systems borrow both: isolate failures into compartments, and cut off a failing dependency fast — because the alternative is the whole system going down together.

Resilience is bigger than software

Drawing on David Woods, Newman defines four aspects of resilience:

graph TD
R["Resilience"]
R --> RO["Robustness<br/>absorb EXPECTED perturbation"]
R --> RB["Rebound<br/>recover after trauma (tested backups, playbook)"]
R --> GE["Graceful extensibility<br/>handle the UNEXPECTED"]
R --> SA["Sustained adaptability<br/>keep adapting — learning culture"]

Microservices mainly help with robustness (and that’s what adds complexity); the other three depend on flatter organizations, skilled people, tested backups, and a blame-free learning culture. Failure is everywhere — a statistical certainty at scale — and how much resiliency you need is defined by users via cross-functional requirements (response-time percentiles under load, availability, durability), enshrined as SLOs. Degrading functionality is a business decision made per interface and per dependency.

The stability patterns

Framed by the AdvertCorp case — a slow “turnip” service exhausted a shared connection pool and took down the whole site, because “in a distributed system, latency kills”:

  • Timeouts — put one on every out-of-process call, with an overall operation budget. (Wells: a few multiples of p99, not the library default of 10s or none.)
  • Retries — for transient failures only, within the budget. Use exponential backoff with jitter, and only retry idempotent requests on 5xx.
  • Bulkheads — separate connection pools per downstream. The most important pattern; it enables load shedding so one sick dependency can’t drown the rest.
  • Circuit breakers — fail fast after a threshold instead of letting calls pile up and back-pressure. Newman mandates them for all synchronous downstream calls.
stateDiagram-v2
[*] --> Closed
Closed --> Open: failures exceed threshold
Open --> HalfOpen: after cool-down timer
HalfOpen --> Closed: trial call succeeds
HalfOpen --> Open: trial call fails
  • Isolation & redundancy — run more than one instance across availability zones.
  • Idempotency — make the underlying business operation safely replayable; “HTTP gives you nothing for free” (see data consistency).

⚠️ A slow service is worse than a broken one

Wells: a broken dependency fails fast; a slow one ties up threads and triggers thundering-herd retries that take down healthy services too. The “small blast radius” only holds if the rest of the system keeps working. Default timeouts (often 10 seconds or infinite) are how one slow service silently consumes your whole thread pool.

Building resilient systems

Beyond per-call patterns: redundancy across AZs, fast startup and graceful shutdown (12-factor disposability), rate-limiting / load shedding (return 503s), fallback behaviour (serve most-popular instead of personalized), caching (even short caching helps for breaking news — but test cache clearing), and going asynchronous to remove temporal coupling.

The CAP theorem applies: in a partition you choose AP (eventual consistency) or CP (sacrifice availability); mix per capability. Beware shared physical fate behind “virtual” hosts, and remember provider SLAs cap their liability, not your losses.

💡 Validate resilience — don't assume it

Chaos engineering experiments on the whole system (including people) to build confidence it withstands turbulent conditions in production: Game Days (Google’s DiRT), Netflix’s Simian Army (Chaos Monkey/Gorilla). Also test backups and failovers regularly — “a failover process that isn’t tested will be found to have stopped working when you need it” — and reason about one failure at a time, watching for systems less independent than you assumed.

🔑 Key insight

Resilience is ultimately a property of the people building and running the system. Patterns help, but blameless post-mortems and a just culture matter more — the Telstra case shows blaming individuals creates fear and destroys learning. “If robustness relies on humans never making mistakes, you’re in for a rocky ride.”

See also

When to use it — and when not

✅ Reach for it when

  • You make synchronous out-of-process calls that could fail or slow down.
  • You want to limit the blast radius of a failing dependency.
  • You are building robustness deliberately, not assuming it.

⛔ Think twice when

  • The interaction is fully asynchronous and already temporally decoupled.
  • You need the consistency model rather than failure handling — see data consistency.

Check your understanding

Score: 0 / 4

1. Which stability pattern does Newman call the most important?

Bulkheads isolate failures so one slow downstream can't exhaust a shared pool and take down everything (the AdvertCorp 'turnip' service).

2. What does a circuit breaker do?

Newman mandates circuit breakers for all synchronous downstream calls; in a distributed system, latency kills.

3. What are David Woods's four aspects of resilience?

Robustness absorbs expected perturbation, rebound recovers after trauma, graceful extensibility handles the unexpected, and sustained adaptability keeps adapting over time.

4. What is chaos engineering really about?

It targets the whole sociotechnical system (Game Days, Netflix's Simian Army), not just the software.

Comments

Sign in with GitHub to join the discussion.