The Architecture Reference

Ds patterns · Distributed Systems · Advanced

Serving Patterns: Replicated, Sharded, Scatter/Gather

The three multi-node serving topologies — replicate to scale requests, shard to scale data, scatter/gather to scale time — plus the readiness, hot-sharding, and straggler realities that govern them.

Ds patterns Advanced ⏱ 5 min read Complete

🧭 Analogy

A restaurant scales three ways. To serve more diners it hires identical cooks (replication). When the menu outgrows one kitchen it splits stations — grill, pastry, salad — each owning part of the work (sharding). To plate one giant banquet order faster, every station works on it at once and a head chef assembles the plates (scatter/gather). Same kitchen, three different dimensions of “more.”

Serving patterns are multi-node, loosely coupled topologies for long-running serving systems. Built from single-node patterns, they decouple services behind formal APIs so each scales independently. Three patterns cover most needs, each scaling a different dimension.

Replicated load-balanced services — scale requests

The simplest and most familiar pattern: every server is identical and can serve every request, with a load balancer in front of a scalable number of replicas.

  • Stateless services need no saved state — requests can even go to separate instances. They replicate for redundancy and scale. Burns’ availability argument: a “three-nines” SLA leaves only ~1.4 minutes of downtime per day, so a single instance would have to upgrade in under 1.4 minutes — instead, run two replicas behind a load balancer so users are served during rollouts and crashes.
  • Readiness probes are distinct from liveness probes: readiness tells the load balancer an instance is ready (databases connected, files loaded), not merely alive.
  • Session-tracked services route a user’s requests to the same replica (for in-memory caching), typically by hashing source+destination IP — but IP tracking fails across NAT, so use cookies or headers.
  • Caching layer — add a few large cache replicas (e.g. two with 5 GB) as a separate tier above many small web replicas, rather than a per-server sidecar (which would scale cache and web together and lower the hit rate). Reverse proxies like Varnish add rate limiting / DoS defense (returning HTTP 429) and, fronted by nginx, SSL termination — yielding a three-tier stack: nginx (SSL) → Varnish (cache) → web app.

Sharded services — scale data

Where replicated services are homogeneous, in sharded services each replica (a shard) serves only a subset of requests, and a load-balancing root routes each request to the right shard. Replicated suits stateless workloads; sharded suits stateful ones, primarily because state is too large for one machine (see the caching discussion of effective cache size).

graph TD
ROOT["Root / shard router"] -->|"Shard = hash(key) % N"| S0["Shard 0 (+ replicas)"]
ROOT --> S1["Shard 1 (+ replicas)"]
ROOT --> S2["Shard 2 (+ replicas)"]
HOT["Hot shard A"] -. "autoscale independently" .-> ROOT
  • Sharding functionsShard = ShardingFunction(Req), with two essential properties: determinism (same input → same shard) and uniformity (even load). Typically hash(Req) % N.
  • Selecting the key — hashing the whole request is usually wrong. For a cache, hash request.path (so requests differing only in time/IP share a shard); if locale matters, shard(country(request.ip), request.path) is the right granularity — neither too general nor too specific.
  • Replicated, sharded — implement each shard as a replicated service so it survives failures and rollouts, and supports hot sharding (a viral item makes its shard hot; replicate that shard and consolidate cool ones).

⚠️ Use consistent hashing or lose everything on resize

Naive hash % N re-sharding (10→11 shards changes %10 to %11) remaps most keys — effectively a total cache failure. Consistent hashing guarantees only ~#keys / #shards remap (under 10% going 10→11). Always shard with a consistent hash function (e.g. nginx hash $request_uri consistent).

Scatter/gather — scale time

Where replication scales requests/second and sharding scales data size, scatter/gather scales time by parallelizing a single request. A tree (root + leaves) farms one request out to all leaves simultaneously; each returns a partial result, and the root combines them.

graph TD
REQ["Single request"] --> ROOT["Root"]
ROOT -->|"scatter"| L0["Leaf 0<br/>partial result"]
ROOT -->|"scatter"| L1["Leaf 1<br/>partial result"]
ROOT -->|"scatter"| L2["Leaf 2<br/>partial result"]
L0 -->|"gather"| ROOT
L1 -->|"gather"| ROOT
L2 -->|"gather"| ROOT
ROOT --> OUT["Combined result"]
  • Root distribution (replicated leaves) — for “embarrassingly parallel” work. The document-search example farms “cat” to one leaf and “dog” to another, then the root intersects the returned document sets.
  • Leaf sharding — when data exceeds one machine, shard documents across leaves; a request goes to all shards and the root takes the union of results.
  • Scale for reliability — replicate each shard so the root can load-balance each leaf across healthy replicas, masking failures and enabling under-load upgrades.

⚠️ More leaves is not always faster — the straggler problem

The root waits for the slowest leaf. A 99th-percentile leaf latency of 2 s scattered to 5 leaves hits ~5% of requests; scattered to 100 leaves, essentially every request takes 2 s. Per-leaf failure compounds the same way (100 leaves at 1% failure ≈ every request fails). Conclusions: parallelism has diminishing returns from per-leaf overhead, and P99 performance matters far more because each request fans out into many.

💡 Real systems combine all three

These patterns aren’t mutually exclusive — a production system layers them: replicated SSL/cache/web tiers in front, sharded (and replicated) storage behind, scatter/gather for parallel queries. Picking patterns from a shared catalog makes such systems easier to design and debug because the lessons transfer.

See also

When to use it — and when not

✅ Reach for it when

  • Replicated: stateless services that scale by adding identical replicas behind a load balancer.
  • Sharded: stateful services or caches whose data is too large for one machine.
  • Scatter/gather: a single request needing large, mostly independent parallel processing.

⛔ Think twice when

  • Scatter/gather with high fan-out when tail latency or per-leaf failure would dominate every request.
  • Sharding stateless workloads that replicate perfectly well, adding routing complexity for no gain.

Check your understanding

Score: 0 / 4

1. What does each serving pattern primarily scale?

Replication adds identical replicas to serve more requests; sharding splits data across machines; scatter/gather farms one request out to many leaves to finish it faster.

2. How does a readiness probe differ from a liveness/health probe?

An instance can be alive but not yet ready (still loading plugins or connecting to databases). The readiness probe gates traffic until it can actually serve.

3. Why does naive `hash % N` re-sharding behave like a total cache failure when N changes?

Going from %10 to %11 moves nearly every key, invalidating the cache. Consistent hashing limits movement to roughly #keys/#shards (e.g. under 10% going 10→11).

4. What is the straggler problem in scatter/gather?

A 99th-percentile leaf latency scattered to 5 leaves affects ~5% of requests; scattered to 100 leaves, essentially every request hits it. The same logic makes per-leaf failure compound — so P99 matters most.

Comments

Sign in with GitHub to join the discussion.