Auto foundations · Guide · Start here

Why Long-Running Processes Are Hard

The moment your logic crosses a boundary it becomes long-running — and you inherit state, time, and failure problems that a workflow engine exists to solve.

Auto foundations Start here ⏱ 5 min read Complete

🧭 Analogy

Booking a holiday is not a single instant. You reserve a flight, wait for a confirmation email, book a hotel that needs manager approval, then wait days for a visa. You hold the whole thing in your head, chase the slow steps, and start over if a booking falls through. A long-running process is software doing exactly that — and the “head” that remembers where every booking is must be durable.

What “long-running” really means

A process is a series of tasks performed to reach a desired result. The instant any task has to wait — for a remote component to respond, for a human to act, or for a timer to elapse — the process becomes long-running: it spans seconds, hours, weeks, or months. Ruecker’s central claim is that handling long-running behaviour is the hardest, most under-appreciated problem in distributed systems, and that waiting is forced on you the moment logic crosses a boundary:

a remote call crosses local / OS / machine boundaries (latency, unavailability);
invoking another component crosses a transaction boundary (no shared ACID);
another team crosses an organizational boundary;
an external service crosses a company boundary;
involving people crosses the boundary between automatable and non-automatable work.

Modern systems have more boundaries, not fewer, as they move from monoliths to fine-grained services and functions. So “long-running” is not an exotic case — it is the default once you make your first remote call.

The three problems you inherit: state, time, failure

flowchart LR
A["Start: order placed"] --> B["Charge card<br/>(remote, may fail)"]
B -->|"wait for response"| C{"Paid?"}
C -->|"yes"| D["Reserve goods"]
C -->|"timeout"| E["Wait 1 day, retry"]
E --> B
D --> F["Wait for human approval"]
F --> G["Ship goods"]
G --> H["End: order delivered"]

Every arrow that says wait or may fail drags in one of three concerns:

State. Something must durably remember exactly where each instance is — which card charge succeeded, which is still pending — and survive crashes, restarts, and deploys. There is never a moment with zero instances running, because “running” includes “waiting.”
Time. Steps need timers: retry after one day, escalate if a human has not acted in five days, cancel if a response never arrives. Scheduling must be reliable even when nothing else is happening.
Failure. Remote calls are unreliable. You need retries, timeouts, and a way to restore consistency when you genuinely cannot tell whether the card was charged.

Waiting is logical, not a blocked thread — the engine persists state, frees the thread, and reloads only when a trigger fires:

stateDiagram-v2
[*] --> Active: instance started
Active --> Waiting: reach a wait point
Waiting --> Active: trigger fires (response, timer, or human action)
note right of Waiting
  State sits in the database
  no thread blocked
end note
Active --> [*]: process completes

The “Wild West” trap

The Ash death-spiral

A developer (“Ash”) builds a credit-card payment backend. The external service is flaky, so Ash adds retries; the retries must survive hours, so Ash adds a payment table with a status column and a polling scheduler; the API becomes async (HTTP 202 + polling); a poisoned message crashes it; ad-hoc monitoring scripts appear; then SLA reporting is demanded. The result is an unmaintainable mess only Ash understands — with the business logic buried inside hand-written state-handling code.

This is Wild West integration: ungoverned, point-to-point integration where every team re-implements persistence, retries, scheduling, and reporting by hand — usually worse than a real tool would. The punchline is that all of those capabilities are built into a workflow engine. The two that matter most are persisting state (which enables waiting) and scheduling (which enables retries).

Key insight

Waiting is logical, not a blocked thread. The engine writes the instance’s current state to its database, returns the thread, and does nothing until a triggering event — a user pressing a button or a scheduler firing — reloads the state and resumes. That is how one engine can hold millions of “waiting” instances cheaply.

Don’t store the wrong things

A workflow engine tracks control flow — where each instance is and what happens next. It is not a database for your business entities. Keep the customer record, the order, and the invoice in your own service; let the engine hold only references (IDs) plus the small amount of data it needs to route. Mixing business data into the engine couples your domain to a tool you should be able to swap.

Rule of thumb

Before hand-rolling a status column and a polling loop, give a workflow engine a second thought — even when the project “seems simple.” Simple processes are exactly where Wild West integration begins.

When to use it — and when not

✅ Reach for it when

Logic must wait for a remote service, a human, or a timer before it can continue
A flow spans minutes, hours, or weeks and must survive restarts and crashes
You find yourself hand-rolling status columns, polling schedulers, and retry loops
An automated process needs only occasional manual intervention for exceptions (straight-through processing)
A flow crosses transaction, organizational, company, or human boundaries

⛔ Think twice when

The whole operation completes in one synchronous, in-memory transaction with no waiting
A pure decision that resolves in a single atomic step (use a decision engine instead)
You only need fire-and-forget messaging with no end-to-end accountability
The control flow is so trivial that a single function call captures it fully
You are tempted to store business entities in the engine rather than referencing them by ID

auto-foundationsWhere Workflow Engines Fit

A workflow engine is a persistent, scheduling state machine — run it as a service, keep it decentralized, and judge its adoption by return on investment.

auto-foundationsOrchestration vs Choreography

Orchestration is command-driven, choreography is event-driven — and the real architecture choice is made link by link, based on who owns the outcome.

auto-operatingHandling Failures and Retries

Remote calls are unreliable — keep failures local with stateful retries, design every operation to be idempotent, and clean up after ambiguous failures.

Check your understanding

Score: 0 / 6

1. What are the two core capabilities that make a workflow engine good at long-running work?

Ruecker frames the engine as a state machine that is good at waiting and scheduling. Durable state lets instances wait; scheduling lets the engine become active to retry or escalate.

2. In a workflow engine, what does it mean that an instance is 'waiting' for a week?

Waiting is logical, not a blocked thread. The engine stores current state, returns the thread, and reloads state only when a user action or a scheduler timer fires.

3. Why does crossing a boundary force long-running behaviour?

Any remote call crosses local/transaction/org/company/human boundaries, each adding availability and latency problems — so the calling side must be able to wait and recover.

4. What is the 'Wild West integration' anti-pattern?

The 'Ash' story shows ungoverned point-to-point integration drifting into homemade status tables, polling schedulers, and monitoring scripts — exactly what an engine provides for free.

5. What is 'straight-through processing' (STP)?

STP combines automating the control flow with automating the tasks themselves, so people are involved only when something unusual happens.

6. Where should a workflow engine store your customer record and invoice?

The engine tracks control flow, not business entities. Keeping domain data in your service avoids coupling your domain to a tool you should be able to swap.

Sync across devices

Why Long-Running Processes Are Hard

What “long-running” really means

The three problems you inherit: state, time, failure

The “Wild West” trap

Don’t store the wrong things

See also

When to use it — and when not

✅ Reach for it when

⛔ Think twice when

Check your understanding

Comments

What “long-running” really means

The three problems you inherit: state, time, failure

The “Wild West” trap

Don’t store the wrong things

See also

When to use it — and when not

✅ Reach for it when

⛔ Think twice when

Related topics

Check your understanding

Comments