The Architecture Reference

Api management · APIs & Communication · Intermediate

Rate Limiting and Quotas

Protect APIs from overuse and abuse: rate limiting rejects on per-request properties, load shedding rejects on system state — using fixed/sliding window or token/leaky bucket algorithms.

Api management Intermediate ⏱ 4 min read Complete

🎢 Analogy

Rate limiting is the theme-park ride operator counting riders per hour per ticket — no single guest can hog the queue. Load shedding is closing the gate entirely when the ride itself is over capacity, regardless of who is waiting. One looks at the person in front; the other looks at the state of the machine.

Two distinct controls

Protecting APIs from overuse and abuse uses two related but different mechanisms, central to mitigating the Denial of Service category of STRIDE:

  • Rate limiting rejects requests based on individual request properties — too many from a given user, app, or location.
  • Load shedding rejects requests based on overall system state — for example, the database is at capacity, so new work is turned away regardless of who sent it.

Both are best applied at the API gateway, the centralized point through which all ingress flows. Genuine volumetric DDoS is better handled by specialist providers or CDNs.

graph TD
R["Incoming request"] --> RL{"Rate limit?<br/>(per user/app/IP)"}
RL -->|over budget| Rej1["429 Too Many Requests"]
RL -->|within budget| LS{"System at capacity?<br/>(load shedding)"}
LS -->|overloaded| Rej2["503 / shed load"]
LS -->|healthy| Backend["Backend service"]

The four algorithms

The book names four rate-limiting strategies:

  • Fixed window — count requests in a fixed time bucket (e.g. per minute); simple but allows bursts at window edges.
  • Sliding window — a rolling window smooths the edge-burst problem.
  • Token bucket — tokens refill at a steady rate; each request spends one; allows controlled bursts up to the bucket size.
  • Leaky bucket — requests drain at a fixed rate, smoothing output regardless of input spikes.

For tiered plans, the gateway can use the service identity or request-header metadata as the fingerprint — a free tier and a paying tier get different budgets (this is traffic shaping when the service mesh applies it east–west).

Fail open vs fail closed

When a protective control itself fails, its default behaviour matters:

  • Fail open permits access under failure — desirable when availability is paramount (medical emergency services).
  • Fail closed blocks access — the right default for most financial or government APIs.
graph TD
F["Rate-limit control fails"] --> Q{"What matters most?"}
Q -->|"availability paramount"| Open["Fail open<br/>permit access<br/>e.g. emergency services"]
Q -->|"security paramount"| Closed["Fail closed<br/>block access<br/>e.g. financial, government"]

Rate-limit internal calls too

Evolving systems can create circular dependencies — infinite internal call loops — causing a ‘friendly-fire’ denial of service. Rate-limit and monitor internal calls, not just external ones.

Context turns metrics into signals

Don’t blindly apply RED metrics. A burst of 403s could indicate a malicious actor; a spike of 401s could mean a stolen token or a compromised vendor. Establish a baseline, alert when outside the expected range, and read these status-code patterns as security signals, not just noise.

A worked DREAD example

Mastering API Architecture scores a no-rate-limiting DDoS against the gateway at Damage 8, Reproducibility 8, Exploitability 5, Affected Users 10, Discoverability 10 → 8.2 — the highest priority — mitigated precisely by rate limiting and load shedding.

See also

  • API gateways — where rate limiting and load shedding are enforced.
  • API security — rate limiting as a DoS mitigation in STRIDE.
  • Service mesh — traffic shaping and policing for east–west traffic.

When to use it — and when not

✅ Reach for it when

  • You must protect backends from denial-of-service, runaway clients, or cost blowouts.
  • You offer tiered plans (free vs paying) needing different request budgets.
  • You need to shed load gracefully when a downstream resource is at capacity.

⛔ Think twice when

  • Genuine volumetric DDoS — offload that to a specialist provider or CDN, not just your gateway.
  • Internal trusted traffic where the overhead outweighs the risk (but still monitor it).

Check your understanding

Score: 0 / 4

1. What is the difference between rate limiting and load shedding?

Rate limiting acts on properties like too many requests from a user/app/location; load shedding acts when the system itself (e.g. the DB) is at capacity.

2. Which is NOT a named rate-limiting algorithm in the book?

The strategies covered are fixed window, sliding window, token bucket, and leaky bucket.

3. Where is rate limiting best applied?

The gateway is the single ingress point, making it the natural place to rate-limit; volumetric DDoS is best handled by specialist providers/CDNs.

4. What is the difference between 'fail open' and 'fail closed'?

The default behaviour under failure must match your requirements — availability-critical systems may fail open, security-critical ones fail closed.

Comments

Sign in with GitHub to join the discussion.