The Architecture Reference

Api messaging · APIs & Communication · Advanced

Choreography and Async Communication

Coordinate independent services without a central conductor: stateless choreography, shared correlation IDs, progress resources, and fallbacks for the failures distributed systems guarantee.

Api messaging Advanced ⏱ 4 min read Complete

💃 Analogy

Choreography is a flock of starlings. No bird is in charge, yet the murmuration moves as one because each bird reacts to its neighbours by simple shared rules. Take any bird away and the flock adapts. An orchestrated system is a marching band following a drum major — impressive, but if the drum major trips, everyone stops.

Workflow without a conductor

In stateless choreography, there is no central engine. The workflow is an emergent by-product of independent services reacting to one another — the style that inspired the language Ballerina, and the model behind cloud pub/sub services (Pub/Sub, Eventarc, SNS/SQS). Compared with orchestration, it is loosely coupled, resilient, and easy to modify in parts. The trade-off is observability: it is hard to monitor overall progress and ecosystem health.

When to choose it: few steps and branches favour orchestration; involved, branching workflows favour choreography.

sequenceDiagram
participant Cart as Cart service
participant Tax as Tax service
participant Pay as Payment service
participant Ship as Shipping service
Cart->>Tax: checkout event (correlation-id: J1)
Tax->>Pay: taxed event (correlation-id: J1)
Pay->>Ship: paid event (correlation-id: J1)
Ship-->>Cart: shipped event (correlation-id: J1)
Note over Cart,Ship: No central engine — each service reacts to the previous

Shared state and correlation

Services in a choreography share state, not data models. Use a standalone HTTP shared-state resource per job (advertised via rel="sharedState"), read by each task and written back with idempotent PUT. Property names must be agreed via a shared vocabulary, and values must be expressed idempotently (an absolute updatedPrice, never a percentIncrease).

Correlate the distributed work with IDs that travel on every message:

  • jobID → the correlation-id header (the whole job).
  • taskID → the request-id header (one step), in both request and response.

Restore visibility with a progress resource

To compensate for choreography’s blind spot, expose a progress resource per job — a cacheable HTTP resource holding job and task metadata (jobStatus, taskStatus, start/stop times, taskMaxTTL) plus a refresh link to poll. Keep it basic feedback, not a trace log, and never include private or debugging data.

Keep the progress resource minimal

A progress resource is for monitoring, not diagnostics. Don’t embed management actions (restart, rollback) in it — with code- or DSL-driven workflows those cause runtime conflicts — and never leak internal/private values, which is a security exposure.

Plan for failure — it is guaranteed

“Any large system is going to be operating most of the time in failure mode.” With several dependencies, failure likelihood grows fast, so plan a fallback per dependency:

  • Automatic retries — only on idempotent methods (GET, HEAD, PUT, DELETE), only for retryable cases (5xx and connection errors, never 4xx). Prefer exponential backoff, capped at ~3 attempts; mitigate locally so an external mitigation service does not become a fatal dependency.
  • Static fallback — a configured alternate location.
  • Dynamic fallback — a service-registry lookup.
  • Queue for later replay — respond 202 Accepted and replay when the dependency recovers (must be in the documented design).
  • Give up — stop and return 500, ideally with a wait time.
graph TD
F["Dependency failure"] --> Q{"Transient and idempotent?"}
Q -->|"yes, 5xx or connection"| Retry["Retry with backoff, cap ~3"]
Q -->|"no"| Alt{"Alternate available?"}
Alt -->|"static or dynamic"| Fallback["Use fallback location"]
Alt -->|"can defer"| Queue["202 Accepted, replay later"]
Alt -->|"none"| GiveUp["Return 500 with wait time"]

Idempotency makes choreography safe

Because messages cross unreliable networks and may be redelivered, every step must be safely repeatable. Make writes idempotent (replacement values, conditional ETags) so a retried or duplicated event never double-applies — the bedrock that lets a leaderless flow recover on its own.

Time and human escalation

Bound each job with a maxTTL; exceeding it forces a cancel and, where state changed, a “Revert Them All” across tasks. When retries cannot resolve a halting error, call for help: alert a predetermined human (email/SMS) with links to the job, progress, shared state, and error report — because you cannot predict every runtime failure.

Watch compounding delays

Retry and undo delays compound across sequential steps — three steps each retrying three times at 10s is roughly 90s. Only MaxTTL-bounded, workflow-compliant services keep this in check; reserve heavy delayed-undo for batch, never real-time.

See also

When to use it — and when not

✅ Reach for it when

  • An involved, branching workflow spans services owned by different teams.
  • You want loose coupling and resilience over a central orchestration engine.
  • Independent services must coordinate while remaining individually deployable.

⛔ Think twice when

  • A short workflow with few steps and branches — central orchestration is simpler to reason about.
  • When end-to-end progress monitoring is the dominant requirement and you cannot add a progress resource.

Check your understanding

Score: 0 / 4

1. What distinguishes choreography from orchestration?

In choreography the workflow is an emergent by-product of loosely coupled services; orchestration uses a central engine (the conductor).

2. What is the main drawback of choreography, and how is it mitigated?

Loose coupling makes the whole picture hard to see; a progress resource per job restores observability without a central engine.

3. How are individual tasks correlated across a distributed job?

Each job/task gets a UUID: jobID travels as correlation-id, taskID as request-id, in both request and response.

4. Which methods are safe to automatically retry on transient failure?

Retry only idempotent and retryable cases (5xx/connection errors, not 4xx); never auto-retry non-idempotent POST/PATCH, and prefer exponential backoff.

Comments

Sign in with GitHub to join the discussion.