Why CDC: The Dual-Write Problem and the Log as Truth

Module: Foundations | Duration: 20 min read | Lesson: 1 of 11


TheWorldShop's checkout service writes an order to Postgres, then publishes an OrderCreated event to Kafka so the warehouse, the email service, and analytics all hear about it. Clean, obvious, and what almost everyone builds first.

Then one afternoon the database write succeeds but the Kafka publish fails (a network blip, a broker restart). The order exists in Postgres. No event was sent. The warehouse never ships it. The customer's card is charged for a package that never moves. A week later, a different bug does the opposite: the event fires but the transaction rolls back, so analytics counts revenue for an order that doesn't exist.

This is the dual-write problem, and it has no clean application-level fix. Change Data Capture exists to make it disappear by changing where events come from. This lesson is why CDC is the answer, and why the answer was hiding in the database's log the whole time.


2. Concept Explanation

The Dual-Write Problem

A "dual write" is any operation that must update two systems to be correct: write to the database and publish to Kafka, write to the DB and update a cache, write to two databases. The trap is that you can't make two independent writes atomic without a distributed transaction, and distributed transactions across a database and Kafka are impractical at scale.

So you're left with an ordering, and every ordering breaks:

  • DB first, then publish: if the publish fails, the DB has the change but no event was emitted. Downstream is permanently behind. (Lost event.)
  • Publish first, then DB: if the DB write fails or rolls back, you emitted an event for a change that never happened. (Phantom event.)
  • Wrap in a try/catch and retry: retries can double-publish, and a crash between the two writes leaves you inconsistent with no record of where you were.

There is no application-code arrangement of two independent writes that's both atomic and durable. The bug isn't in your code; it's in the shape of the problem.

The Insight: The Database Already Has a Log

Here's the move. Your database is already, internally, a log-based system. Every committed change is first written to a durable, ordered transaction log (Postgres calls it the WAL, write-ahead log; MySQL the binlog; MongoDB the oplog) before it touches the tables. That log is how the database itself guarantees durability and replicates to standbys.

So the database has already solved, for itself, the exact problem you're fighting: an ordered, durable, atomic record of every change that committed. Change Data Capture reads that log and turns each committed change into an event.

Now there's only one write. The application writes to the database. That's it. The event isn't a second write you have to coordinate, it's a derived consequence of the database commit, produced by tailing the log after the fact. If the transaction commits, the change is in the log, so the event will be emitted. If it rolls back, it never enters the log, so no event. The atomicity problem dissolves because there's nothing to keep in sync.

Why This Is "The Log as Truth"

CDC makes the database's commit log the source of truth for what happened, and Kafka the transport for that truth. Every downstream system, the warehouse, email, analytics, reads the same ordered stream of committed changes. They can't disagree about what happened, because there's exactly one record of it, and it's the same record the database trusts for its own durability.

This is the same "log is the unifying abstraction" idea from the Kafka internals course, now applied one layer down: the database's log becomes the event stream.

What CDC Is Not

CDC captures changes that already committed. It's not a way to prevent a write or to validate it. It's downstream of the commit. It also doesn't capture intent (the business meaning), it captures rows changing. A row update from status='pending' to status='shipped' is what CDC sees; "the order shipped" is meaning you layer on top. (This distinction drives the outbox pattern in the next course, hold onto it.)

Aha: The dual-write problem isn't a bug you can fix in application code, it's a property of writing to two systems at once. CDC doesn't solve the dual write; it eliminates it by deleting the second write. The event stops being something your code emits and becomes something the database's own commit log produces. One write, one truth, no coordination.


3. Worked Example

Let's see the dual-write bug, then watch CDC make it impossible.

The broken dual-write (don't ship this):

def place_order(order):
    db.execute("INSERT INTO orders (...) VALUES (...)")   # write 1
    db.commit()
    kafka.publish("orders", order)                        # write 2  <-- can fail independently

If the process crashes between commit() and publish(), the order is in Postgres forever with no event. No amount of try/catch fixes the crash-in-the-gap case.

The CDC version, the app just writes the DB:

def place_order(order):
    db.execute("INSERT INTO orders (...) VALUES (...)")
    db.commit()        # that's the entire write path. No publish.

A CDC connector tails the WAL and emits the event for you, because the commit landed in the log:

WAL:  ... | COMMIT txid=8841: INSERT orders(id=5001, total=42.00, status='pending') | ...
                                  │
                                  ▼ Debezium reads the WAL
Kafka topic "theworldshop.public.orders":
  { "op": "c", "after": { "id": 5001, "total": 42.00, "status": "pending" }, ... }

No event can be lost, because the event derives from the same log entry that made the commit durable. No phantom event, because a rolled-back transaction never writes a COMMIT to the log.

Bring up the lab. The CDC course needs a real database with its log enabled plus Kafka Connect running Debezium. Clone the lab repo (shared across courses) and start the stack:

git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/ingestion-and-transport/change-data-capture/cdc-fundamentals/
./scripts/setup.sh

Verify Postgres (with logical replication on), Kafka, and Kafka Connect with Debezium are reachable:

./scripts/verify.sh
# expected: "Postgres 16 (wal_level=logical) ready, Kafka ready, Kafka Connect + Debezium 2.x on :8083, sample theworldshop DB seeded"
You are helping me run the lab for the "CDC Fundamentals with Debezium"
course. The lab is in
petascalelabs-lab-setup/ingestion-and-transport/change-data-capture/cdc-fundamentals/
and includes:
  - docker-compose.yml: Postgres 16 (wal_level=logical), Kafka, Kafka Connect with the Debezium connectors
  - a seeded "theworldshop" database (orders, customers, inventory tables)
  - scripts/setup.sh, scripts/verify.sh, scripts/teardown.sh

My environment:
  OS: <fill in>
  RAM: <fill in GB>
  Docker version: <fill in>

Walk me through:
1. Confirming Docker has enough memory for Postgres + Kafka + Connect.
2. Any OS-specific notes for my OS.
3. How to confirm Postgres is actually in logical replication mode (wal_level=logical).
4. The teardown command to reclaim resources.

Do not assume my OS; ask if unclear.

Later lessons assume this stack is running and reuse its scripts.


4. Your Turn

Exercise: TheWorldShop's loyalty service writes a points award to its database and publishes a PointsAwarded event to Kafka in the same function, a classic dual write.

  1. Describe two distinct failure scenarios (one for each ordering of the two writes) and the user-visible bug each causes.
  2. Explain why wrapping both writes in a retry loop doesn't fully fix it.
  3. Restate the loyalty flow as a CDC-based design. What does the application write, and where does the event come from?
  4. In the CDC design, what guarantees the event is emitted if and only if the DB change committed?
  5. CDC captures "a row in points changed." The business event is "points were awarded." Name one situation where that gap matters.

5. Real-World Application

Debezium was created at Red Hat specifically to give applications a way out of the dual-write trap, and it's now the dominant open-source CDC engine. Its entire design premise is "the database log already has the answer."

Netflix, Airbnb, Shopify, and basically every event-driven shop run CDC for exactly this reason: keeping caches, search indexes, derived stores, and analytics in sync with operational databases without dual writes. "Materialize a read model from a CDC stream" is one of the most common modern data patterns.

The microservices "saga" and "outbox" patterns (next course) are direct descendants of recognizing the dual-write problem. Once teams internalized that two independent writes can't be made atomic, an entire family of log-based patterns followed.


6. Recap + Bridge

The dual-write problem, keeping a database and an event stream in sync with two independent writes, has no clean application-level fix, because two writes can't be made atomic. CDC eliminates it by deleting the second write: the application writes only the database, and events derive from the database's own commit log (WAL/binlog/oplog), making the log the single source of truth.

Next we get specific about how to read change: the three capture methods, query-based, trigger-based, and log-based, and why log-based won.