The Architecture Reference

Ed patterns · Event-Driven · Intermediate

Streams and the Log

The durable append-only log as the substrate of event-driven systems — partitions, offsets, retention, compaction, tombstones, and the Kappa architecture.

Ed patterns Intermediate ⏱ 4 min read Complete

🧭 Analogy

Think of the log as a ship’s logbook: entries are appended in order and never erased, each reader keeps their own bookmark (offset), and anyone can flip back to page one and re-read the whole voyage. The broker is the ship — it just keeps the book safe, replicated, and fast to read.

The log is the foundation

Underneath every event-driven system is a durable append-only log served by an event broker (Kafka is canonical; Pulsar is the other unlimited-retention option). Bellemare lists the minimal storage and serving requirements a broker must meet:

  • Partitioning — substreams that enable parallel consumption.
  • Strict ordering — guaranteed within a partition.
  • Immutability — events are never modified after they are written.
  • Indexing (offsets) — each consumer tracks its own position; consumer lag = tail index − consumer index.
  • Infinite retention and replayability — read the same history any number of times.

This is what separates an event broker from a message broker: a message broker hands each message to one consumer and deletes it, so it cannot communicate state to many consumers. The log can.

graph LR
K["Key"] -->|"hash → partition"| PA["Partition 0: e0 e1 e2 ..."]
K --> PB["Partition 1: e0 e1 e2 ..."]
PA -->|"offset 1"| CA["Consumer instance A"]
PB -->|"offset 0"| CB["Consumer instance B"]
PA -->|"offset 2 (replay)"| CR["Replay / new consumer"]

Partitions, keys, and data locality

A partitioner deterministically maps an event key to a partition, usually by hashing. Because same-key events always land on the same partition, a consumer that owns that partition holds a complete account of every key it sees — this is data locality. It lets a service scale to many instances, each owning a subset of partitions, while keeping per-key state coherent. The active instances in a consumer group are capped at the partition count, so partition count sets the ceiling on parallelism.

The key insight

Retention is the deal-breaker when choosing a broker. Only Kafka and Pulsar offer unlimited retention; Kinesis (365 days), Event Hubs Premium (90 days), and Pub/Sub (31 days) cap it — which makes them unsuitable as the durable source of truth.

Keeping the log affordable: compaction and tombstones

An append-only log grows forever. Two mechanisms tame it:

  • Compaction — the broker retains only the most recent event per key, deleting predecessors and tombstones. You trade history for size; current state stays cheap to materialize.
  • Tombstones — a keyed event with a null value, signaling deletion. The tombstone is itself eventually compacted away.
graph LR
Before["Before: A=1, B=1, A=2, C=1, A=3, B=null"] -->|"compaction"| After["After: A=3, C=1<br/>(B tombstoned away)"]

Tiered storage (Kafka KIP-405) transparently offloads old segments to cheap cold storage, giving effectively infinite retention without infinite hot-disk cost.

Kappa over Lambda

Because one stream can serve both historical and real-time needs, Jay Kreps’s Kappa architecture uses a single stream: a consumer sets its offset to the beginning, consumes forward, and builds its own state store. The Lambda architecture instead runs parallel batch and stream paths — and Bellemare rejects it: it forces dual code paths, batch and stream models must evolve in lockstep, results often fail to converge, and merging multiple Lambda products scales combinatorially (2 products → 4 relationships, 3 → 8). Kappa, supported by indefinite retention, tiered storage, and compaction, avoids all of it.

Don't infer lag from the last event time on a compacted stream

For a compacted state product, hours or days between events for a key is normal. Inferring “this consumer is behind” from wall-clock gaps will mislead you — rely on offsets, not timestamps, to judge how caught-up a consumer is.

See also

When to use it — and when not

✅ Reach for it when

  • You need durable, replayable history that many consumers read independently
  • Parallelism via partitions and data locality on a key matters
  • You want a single mechanism for both historical and real-time data (Kappa)

⛔ Think twice when

  • The broker caps retention below your replay needs (Kinesis, Event Hubs, Pub/Sub)
  • You only need a fire-and-forget, loss-tolerant notification (ephemeral messaging may suffice)
  • Strict global ordering across all partitions is required (ordering is per-partition only)

Check your understanding

Score: 0 / 4

1. Ordering guarantees in a log-based broker hold…

Strict ordering is guaranteed within a partition. Same-key events land on the same partition via the partitioner, which is the basis of data locality.

2. Consumer lag is computed as…

Lag = tail index − consumer index. It is the key signal for autoscaling and for waking FaaS functions.

3. Log compaction…

Compaction keeps current state cheap while discarding history; a tombstone (key with null value) signals deletion and is itself eventually compacted.

4. The Kappa architecture differs from Lambda in that it…

Kappa lets consumers set their offset to the beginning and consume forward, building their own state — avoiding Lambda's dual code paths and combinatorial merge problem.

Comments

Sign in with GitHub to join the discussion.