The Architecture Reference

Ms communication · Microservices · Intermediate

Communication Styles: Sync vs Async

Choose the communication style before the technology — request-response vs event-driven, synchronous vs asynchronous — and understand temporal coupling, cascading failures, and what goes in an event.

Ms communication Intermediate ⏱ 5 min read Complete

🧭 Analogy

A phone call is synchronous: both people must be available at once, and if the other end is slow you’re stuck holding the line. A letter or a posted notice is asynchronous: you drop it in the box and carry on; the recipient reads it whenever they’re free. Neither is “better” — you choose by whether you need an answer right now or can decouple in time.

Style before technology

Newman’s thesis: teams pick a technology (REST, gRPC, Kafka) before deciding what style of communication they need. Set technology aside and build a vocabulary. Moving from in-process to inter-process calls, three things change: performance (network calls can’t be inlined; payload and serialization now matter), changing interfaces (provider and consumers deploy separately, so backward-incompatible changes force lockstep or phased rollouts), and error handling (new failure modes — crash, omission, timing, response, and arbitrary/Byzantine failures; HTTP status codes model rich semantics, e.g. a retryable 503 vs a pointless-to-retry 404).

The model of styles

graph TD
Start{"What do you need?"}
Start -->|"a reply / something done"| RR["Request-response"]
Start -->|"emit a fact, don't care who acts"| EV["Event-driven"]
RR --> SB["Synchronous blocking<br/>(keep connection open)"]
RR --> AB["Asynchronous nonblocking<br/>(via queue/broker)"]
EV --> ASYNC["Nonblocking asynchronous only"]

First choose request-response vs event-driven; if request-response, choose synchronous vs asynchronous; event-driven is limited to nonblocking asynchronous. Mix and match is the norm.

A poison message shows why async needs a safety valve — without a retry cap a single bad message can take down the whole pool:

graph LR
Q["Queue"] --> W["Worker"]
W -->|"crash, requeue"| Q
W -->|"retries exceed limit"| DLQ["Dead letter queue<br/>(message hospital)"]
DLQ --> H["Human / automated triage"]
  • Synchronous blocking — simple and familiar, but creates temporal coupling, susceptibility to slow downstreams, and cascading failures in long call chains (MusicCorp’s fraud example). Remedies: restructure interactions (move fraud detection off the critical path) or go nonblocking.
  • Asynchronous nonblocking — temporal decoupling, good for long-running work (a warehouse dispatch taking hours or days), at the cost of complexity and choice. (Note: await is still blocking from the code’s perspective.)
  • Communication through common data — one service writes to a known location (file, store) that others later consume (data lakes for loose coupling, warehouses for tighter). Simple, ubiquitous, great for large volumes and interoperability — but usually high-latency (polling).

Request-response: a “request” beats a “command”

For fetching data or ensuring something is done, prefer a request (examinable, rejectable) over a command. Synchronous keeps a connection open; asynchronous routes via a queue/broker (buffering benefits, but you must correlate and route the response). All forms need time-out handling.

Event-driven: emit facts, don’t issue orders

A service emits events — factual statements that something happened — without knowing who consumes them, greatly reducing coupling. An event is a fact (the payload); a message is the medium. Implement via message brokers (“keep the middleware dumb, the smarts in the endpoints”).

What’s in an event?

  • Just an ID — causes a barrage of callbacks and re-adds coupling.
  • Fully detailed events — Newman’s preference: self-sufficient consumers and a historical record useful for event sourcing. But watch event size, PII leakage, and the fact that event data becomes part of your contract.

💡 Choreography vs orchestration is next

Once services emit and react to events, you’re choosing between letting them react independently or having one service direct the flow. That trade-off is covered in choreography vs orchestration and applied to workflows in sagas.

⚠️ The poison message and catastrophic failover

A 2006 bank pricing system suffered a catastrophic failover: a crashing worker requeued a poison message that crashed the next worker that picked it up, repeatedly, taking the whole pool down. The fix was a maximum retry limit and a dead letter queue / message hospital. With async messaging you also need correlation IDs and good monitoring — the decoupling buys scalability at a real cost in operational complexity.

🔑 Key insight

There is rarely one right option; expect a mix. Synchronous is simplest but couples you in time; asynchronous events give the loosest coupling but demand correlation, monitoring, and careful failure handling. Choose per interaction, deliberately.

See also

  • Choreography vs orchestration — who drives a multi-service process.
  • Sagas — coordinating state changes without distributed transactions.
  • Resilience — timeouts, retries, and circuit breakers for synchronous calls.

When to use it — and when not

✅ Reach for it when

  • You are deciding how two services should talk to each other.
  • You want to reduce temporal coupling or avoid cascading failures.
  • You need to decide what to put inside an event.

⛔ Think twice when

  • You have already picked a style and need a specific technology comparison.
  • You are designing a whole distributed workflow — see sagas.

Check your understanding

Score: 0 / 4

1. What is Newman's main thesis about communication?

Teams pick a technology before deciding what style they need; the chapter builds a vocabulary of styles first, deliberately setting technology aside.

2. What problem does synchronous blocking communication create?

A synchronous call requires the callee to be up at the same time; slow downstreams can cascade failures back up the chain.

3. What does Newman recommend putting in an event?

Fully detailed events make consumers self-sufficient and provide a historical record; 'just an ID' causes a barrage of callbacks and adds coupling — but beware size, PII, and contract implications.

4. What fixed the 2006 bank pricing system's catastrophic failover?

A poison message kept crashing each worker that picked it up; a max retry limit and dead letter queue stopped the loop.

Comments

Sign in with GitHub to join the discussion.