The Architecture Reference

Auto foundations · Guide · Start here

Why Long-Running Processes Are Hard

The moment your logic crosses a boundary it becomes long-running — and you inherit state, time, and failure problems that a workflow engine exists to solve.

Auto foundations Start here ⏱ 5 min read Complete

🧭 Analogy

Booking a holiday is not a single instant. You reserve a flight, wait for a confirmation email, book a hotel that needs manager approval, then wait days for a visa. You hold the whole thing in your head, chase the slow steps, and start over if a booking falls through. A long-running process is software doing exactly that — and the “head” that remembers where every booking is must be durable.

What “long-running” really means

A process is a series of tasks performed to reach a desired result. The instant any task has to wait — for a remote component to respond, for a human to act, or for a timer to elapse — the process becomes long-running: it spans seconds, hours, weeks, or months. Ruecker’s central claim is that handling long-running behaviour is the hardest, most under-appreciated problem in distributed systems, and that waiting is forced on you the moment logic crosses a boundary:

  • a remote call crosses local / OS / machine boundaries (latency, unavailability);
  • invoking another component crosses a transaction boundary (no shared ACID);
  • another team crosses an organizational boundary;
  • an external service crosses a company boundary;
  • involving people crosses the boundary between automatable and non-automatable work.

Modern systems have more boundaries, not fewer, as they move from monoliths to fine-grained services and functions. So “long-running” is not an exotic case — it is the default once you make your first remote call.

The three problems you inherit: state, time, failure

flowchart LR
A["Start: order placed"] --> B["Charge card<br/>(remote, may fail)"]
B -->|"wait for response"| C{"Paid?"}
C -->|"yes"| D["Reserve goods"]
C -->|"timeout"| E["Wait 1 day, retry"]
E --> B
D --> F["Wait for human approval"]
F --> G["Ship goods"]
G --> H["End: order delivered"]

Every arrow that says wait or may fail drags in one of three concerns:

  • State. Something must durably remember exactly where each instance is — which card charge succeeded, which is still pending — and survive crashes, restarts, and deploys. There is never a moment with zero instances running, because “running” includes “waiting.”
  • Time. Steps need timers: retry after one day, escalate if a human has not acted in five days, cancel if a response never arrives. Scheduling must be reliable even when nothing else is happening.
  • Failure. Remote calls are unreliable. You need retries, timeouts, and a way to restore consistency when you genuinely cannot tell whether the card was charged.

Waiting is logical, not a blocked thread — the engine persists state, frees the thread, and reloads only when a trigger fires:

stateDiagram-v2
[*] --> Active: instance started
Active --> Waiting: reach a wait point
Waiting --> Active: trigger fires (response, timer, or human action)
note right of Waiting
  State sits in the database
  no thread blocked
end note
Active --> [*]: process completes

The “Wild West” trap

The Ash death-spiral

A developer (“Ash”) builds a credit-card payment backend. The external service is flaky, so Ash adds retries; the retries must survive hours, so Ash adds a payment table with a status column and a polling scheduler; the API becomes async (HTTP 202 + polling); a poisoned message crashes it; ad-hoc monitoring scripts appear; then SLA reporting is demanded. The result is an unmaintainable mess only Ash understands — with the business logic buried inside hand-written state-handling code.

This is Wild West integration: ungoverned, point-to-point integration where every team re-implements persistence, retries, scheduling, and reporting by hand — usually worse than a real tool would. The punchline is that all of those capabilities are built into a workflow engine. The two that matter most are persisting state (which enables waiting) and scheduling (which enables retries).

Key insight

Waiting is logical, not a blocked thread. The engine writes the instance’s current state to its database, returns the thread, and does nothing until a triggering event — a user pressing a button or a scheduler firing — reloads the state and resumes. That is how one engine can hold millions of “waiting” instances cheaply.

Don’t store the wrong things

A workflow engine tracks control flow — where each instance is and what happens next. It is not a database for your business entities. Keep the customer record, the order, and the invoice in your own service; let the engine hold only references (IDs) plus the small amount of data it needs to route. Mixing business data into the engine couples your domain to a tool you should be able to swap.

Rule of thumb

Before hand-rolling a status column and a polling loop, give a workflow engine a second thought — even when the project “seems simple.” Simple processes are exactly where Wild West integration begins.

See also

When to use it — and when not

✅ Reach for it when

  • Logic must wait for a remote service, a human, or a timer before it can continue
  • A flow spans minutes, hours, or weeks and must survive restarts and crashes
  • You find yourself hand-rolling status columns, polling schedulers, and retry loops
  • An automated process needs only occasional manual intervention for exceptions (straight-through processing)
  • A flow crosses transaction, organizational, company, or human boundaries

⛔ Think twice when

  • The whole operation completes in one synchronous, in-memory transaction with no waiting
  • A pure decision that resolves in a single atomic step (use a decision engine instead)
  • You only need fire-and-forget messaging with no end-to-end accountability
  • The control flow is so trivial that a single function call captures it fully
  • You are tempted to store business entities in the engine rather than referencing them by ID

Check your understanding

Score: 0 / 6

1. What are the two core capabilities that make a workflow engine good at long-running work?

Ruecker frames the engine as a state machine that is good at waiting and scheduling. Durable state lets instances wait; scheduling lets the engine become active to retry or escalate.

2. In a workflow engine, what does it mean that an instance is 'waiting' for a week?

Waiting is logical, not a blocked thread. The engine stores current state, returns the thread, and reloads state only when a user action or a scheduler timer fires.

3. Why does crossing a boundary force long-running behaviour?

Any remote call crosses local/transaction/org/company/human boundaries, each adding availability and latency problems — so the calling side must be able to wait and recover.

4. What is the 'Wild West integration' anti-pattern?

The 'Ash' story shows ungoverned point-to-point integration drifting into homemade status tables, polling schedulers, and monitoring scripts — exactly what an engine provides for free.

5. What is 'straight-through processing' (STP)?

STP combines automating the control flow with automating the tasks themselves, so people are involved only when something unusual happens.

6. Where should a workflow engine store your customer record and invoice?

The engine tracks control flow, not business entities. Keeping domain data in your service avoids coupling your domain to a tool you should be able to swap.

Comments

Sign in with GitHub to join the discussion.