🧭 Analogy
Booking a holiday is not a single instant. You reserve a flight, wait for a confirmation email, book a hotel that needs manager approval, then wait days for a visa. You hold the whole thing in your head, chase the slow steps, and start over if a booking falls through. A long-running process is software doing exactly that — and the “head” that remembers where every booking is must be durable.
What “long-running” really means
A process is a series of tasks performed to reach a desired result. The instant any task has to wait — for a remote component to respond, for a human to act, or for a timer to elapse — the process becomes long-running: it spans seconds, hours, weeks, or months. Ruecker’s central claim is that handling long-running behaviour is the hardest, most under-appreciated problem in distributed systems, and that waiting is forced on you the moment logic crosses a boundary:
- a remote call crosses local / OS / machine boundaries (latency, unavailability);
- invoking another component crosses a transaction boundary (no shared ACID);
- another team crosses an organizational boundary;
- an external service crosses a company boundary;
- involving people crosses the boundary between automatable and non-automatable work.
Modern systems have more boundaries, not fewer, as they move from monoliths to fine-grained services and functions. So “long-running” is not an exotic case — it is the default once you make your first remote call.
The three problems you inherit: state, time, failure
flowchart LR
A["Start: order placed"] --> B["Charge card<br/>(remote, may fail)"]
B -->|"wait for response"| C{"Paid?"}
C -->|"yes"| D["Reserve goods"]
C -->|"timeout"| E["Wait 1 day, retry"]
E --> B
D --> F["Wait for human approval"]
F --> G["Ship goods"]
G --> H["End: order delivered"]Every arrow that says wait or may fail drags in one of three concerns:
- State. Something must durably remember exactly where each instance is — which card charge succeeded, which is still pending — and survive crashes, restarts, and deploys. There is never a moment with zero instances running, because “running” includes “waiting.”
- Time. Steps need timers: retry after one day, escalate if a human has not acted in five days, cancel if a response never arrives. Scheduling must be reliable even when nothing else is happening.
- Failure. Remote calls are unreliable. You need retries, timeouts, and a way to restore consistency when you genuinely cannot tell whether the card was charged.
Waiting is logical, not a blocked thread — the engine persists state, frees the thread, and reloads only when a trigger fires:
stateDiagram-v2 [*] --> Active: instance started Active --> Waiting: reach a wait point Waiting --> Active: trigger fires (response, timer, or human action) note right of Waiting State sits in the database no thread blocked end note Active --> [*]: process completes
The “Wild West” trap
The Ash death-spiral
A developer (“Ash”) builds a credit-card payment backend. The external service is flaky, so Ash adds retries; the retries must survive hours, so Ash adds a payment table with a status column and a polling scheduler; the API becomes async (HTTP 202 + polling); a poisoned message crashes it; ad-hoc monitoring scripts appear; then SLA reporting is demanded. The result is an unmaintainable mess only Ash understands — with the business logic buried inside hand-written state-handling code.
This is Wild West integration: ungoverned, point-to-point integration where every team re-implements persistence, retries, scheduling, and reporting by hand — usually worse than a real tool would. The punchline is that all of those capabilities are built into a workflow engine. The two that matter most are persisting state (which enables waiting) and scheduling (which enables retries).
Key insight
Waiting is logical, not a blocked thread. The engine writes the instance’s current state to its database, returns the thread, and does nothing until a triggering event — a user pressing a button or a scheduler firing — reloads the state and resumes. That is how one engine can hold millions of “waiting” instances cheaply.
Don’t store the wrong things
A workflow engine tracks control flow — where each instance is and what happens next. It is not a database for your business entities. Keep the customer record, the order, and the invoice in your own service; let the engine hold only references (IDs) plus the small amount of data it needs to route. Mixing business data into the engine couples your domain to a tool you should be able to swap.
Rule of thumb
Before hand-rolling a status column and a polling loop, give a workflow engine a second thought — even when the project “seems simple.” Simple processes are exactly where Wild West integration begins.
See also
- Where workflow engines fit — the engine as a persistent, scheduling state machine.
- Orchestration vs choreography — who owns the end-to-end flow.
- Handling failures and retries — idempotency, timeouts, and eventual consistency.
When to use it — and when not
✅ Reach for it when
- Logic must wait for a remote service, a human, or a timer before it can continue
- A flow spans minutes, hours, or weeks and must survive restarts and crashes
- You find yourself hand-rolling status columns, polling schedulers, and retry loops
- An automated process needs only occasional manual intervention for exceptions (straight-through processing)
- A flow crosses transaction, organizational, company, or human boundaries
⛔ Think twice when
- The whole operation completes in one synchronous, in-memory transaction with no waiting
- A pure decision that resolves in a single atomic step (use a decision engine instead)
- You only need fire-and-forget messaging with no end-to-end accountability
- The control flow is so trivial that a single function call captures it fully
- You are tempted to store business entities in the engine rather than referencing them by ID
Related topics
A workflow engine is a persistent, scheduling state machine — run it as a service, keep it decentralized, and judge its adoption by return on investment.
auto-foundationsOrchestration vs ChoreographyOrchestration is command-driven, choreography is event-driven — and the real architecture choice is made link by link, based on who owns the outcome.
auto-operatingHandling Failures and RetriesRemote calls are unreliable — keep failures local with stateful retries, design every operation to be idempotent, and clean up after ambiguous failures.
Check your understanding
Score: 0 / 61. What are the two core capabilities that make a workflow engine good at long-running work?
Ruecker frames the engine as a state machine that is good at waiting and scheduling. Durable state lets instances wait; scheduling lets the engine become active to retry or escalate.
2. In a workflow engine, what does it mean that an instance is 'waiting' for a week?
Waiting is logical, not a blocked thread. The engine stores current state, returns the thread, and reloads state only when a user action or a scheduler timer fires.
3. Why does crossing a boundary force long-running behaviour?
Any remote call crosses local/transaction/org/company/human boundaries, each adding availability and latency problems — so the calling side must be able to wait and recover.
4. What is the 'Wild West integration' anti-pattern?
The 'Ash' story shows ungoverned point-to-point integration drifting into homemade status tables, polling schedulers, and monitoring scripts — exactly what an engine provides for free.
5. What is 'straight-through processing' (STP)?
STP combines automating the control flow with automating the tasks themselves, so people are involved only when something unusual happens.
6. Where should a workflow engine store your customer record and invoice?
The engine tracks control flow, not business entities. Keeping domain data in your service avoids coupling your domain to a tool you should be able to swap.
Comments
Sign in with GitHub to join the discussion.