🧭 Analogy
A schema is the legal contract attached to every shipment. Without it, the receiver guesses what’s in the box and how to handle it — and quietly mis-handles it the day the sender changes the packing. With it, both sides agree in advance, and the rules for amending the contract are written down so neither party can surprise the other.
The event is the message; the schema is the contract
Bellemare frames communication with Claude Shannon: the fundamental problem is reproducing at one point a message selected at another, with content and meaning intact. The data contract is the format of the data plus the logic under which it is created. It has two parts:
- Data definition — what is produced: fields, types, structures.
- Triggering logic — why it is produced: the business logic that fired the event.
Changing the data definition is common and manageable; changing triggering logic is more dangerous because it alters the meaning of the original event.
The key insight
Any event communication lacking an explicit schema falls back to an implicit one — brittle, dependent on tribal knowledge, and prone to inconsistent interpretations of the single source of truth. Make the schema explicit so producers cannot serialize non-compliant data.
Compatibility types: how schemas evolve safely
A schema format must support evolution so producer and consumer can update independently. The rules are compatibility types:
- Forward compatibility — data from a newer schema reads as if from an older one. Especially useful in EDA, where the producer typically updates first.
- Backward compatibility — data from an older schema reads with a newer one (consumer ahead of producer, or reprocessing old data). Missing fields are filled by default values applied at read time.
- Full compatibility — the union of both; the strongest guarantee. Use it whenever possible: you can always loosen later, but tightening is far harder. The data-mesh book adds full-transitive (any version to any version) as the ideal for data products.
graph TD S["Schema change"] --> F["Forward<br/>new producer, old consumer"] S --> B["Backward<br/>old producer, new consumer"] S --> FU["Full = forward AND backward"] FU --> FT["Full-transitive<br/>any version ↔ any version"] F -.->|"most common in EDA"| NOTE["Producer updates first"]
The schema registry
Naively attaching the schema to every event is wasteful at scale. A schema registry stores each schema once and ships only a short placeholder ID (Confluent’s prefix is 5 bytes). Beyond saving bandwidth, it provides data discovery, a write-path validation hook that throws on unauthorized evolution, auto-updated documentation, and downloadable schemas for code generation — turning a generic key/value map into a typed class with compiler checks and IDE support, which materially improves data quality.
Bellemare recommends Avro or Protobuf and explicitly does not recommend JSON as the contract, because schemaless JSON lacks full-compatibility evolution. (JSON Schema, the validated variant, is a reasonable third choice and uniquely supports data-quality keywords like value ranges.)
Breaking changes
Sometimes evolution isn’t enough — for example, a string address must become a structured Address. The technical deploy is easy; the hard part is social. The procedure:
- Design the new model (hardest part — redefining domain boundaries).
- Iterate with existing consumers and governance.
- Plan release, migration, and deprecation (history migration works when fields are remodeled but falls short when new data is created, e.g., splitting one address into home/work — support both for roughly 8–12 weeks).
- Execute: run old and new in parallel, mark the old deprecated (grandfather existing consumers, block new ones), and let retention purge it. Never mix incompatible event types in one stream — create a new stream instead.
graph LR P["Producer"] --> V1["Stream v1 (deprecated)<br/>existing consumers"] P --> V2["Stream v2 (new model)<br/>new + migrated consumers"] V1 -.->|"retention purges after 8-12 weeks"| Gone["removed"]
Never push schema reconciliation onto consumers
For entities, the anti-pattern is producing both old and new schemas and letting consumers sort it out — they are never in a better position than the producer to resolve divergent schemas. Re-create entities in the new format via migration/replay, keeping the originals in their stream for forensics.
See also
- Event notification vs. state transfer — the events the contract describes.
- Streams and the log — where schematized events live.
- Data products — the schema as a product’s API.
When to use it — and when not
✅ Reach for it when
- Producers and consumers evolve independently and must not break each other
- You want code generation, data discovery, and producer-side validation
- Multiple languages consume the same stream
⛔ Think twice when
- Using JSON (schemaless) as the contract — it lacks full-compatibility evolution
- Mixing incompatible event types in one stream
- Pushing schema reconciliation onto consumers instead of resolving at the producer
Related topics
Notification, event-carried state transfer (ECST), state events, and delta events — what each carries, the coupling it implies, and why state events are the right default for shared data.
ed-patternsStreams and the LogThe durable append-only log as the substrate of event-driven systems — partitions, offsets, retention, compaction, tombstones, and the Kappa architecture.
ed-datameshData ProductsTreating data as a first-class product — its makeup (code, infrastructure, ports), the three alignment types, multimodal access, and the medallion quality model.
Check your understanding
Score: 0 / 41. A well-defined data contract has two components:
The contract is the format plus the business logic under which the event is created; altering triggering logic is more dangerous because it changes the event's meaning.
2. Forward compatibility means…
Forward compatibility suits EDA's common pattern — the producer updates first. Backward compatibility is the reverse; full is the union of both.
3. The primary benefit of a schema registry is…
The registry also provides discovery, write-path evolution validation, and downloadable schemas for code generation — but the headline win is not shipping the schema with every event.
4. When a breaking schema change is unavoidable, the most important step is…
Technically the deploy is easy; the hard part is renegotiation and migration. Create a new stream, support both during a deprecation window, and let retention purge the old.
Comments
Sign in with GitHub to join the discussion.