Module: Foundations | Duration: 20 min read | Lesson: 1 of 10

TheWorldShop launches a new analytics table fed by a live stream. It works for today's orders. Then the analytics team asks the obvious question: "Can we see the last three years too?" The streaming pipeline only has data from launch day forward. Someone needs to load three years of history into the same table the live stream is writing to, right now, without corrupting it, double-counting, or taking the live pipeline down.

An engineer kicks off a bulk load of three years of order history. Halfway through, it collides with the live stream writing the same partitions, the table shows duplicated orders, the live consumers see inconsistent data, and the "quick backfill" becomes a multi-day incident.

Backfilling, loading historical data into a system that's also serving live data, is one of the most error-prone operations in data engineering, and one of the least taught. This course is backfills done safely. It opens with why they're hard: the present and the past are fighting over the same tables.

2. Concept Explanation

What a Backfill Is

A backfill loads historical data into a destination, usually because:

A new pipeline needs history (the launch scenario, the stream only has data from launch forward).
A bug corrupted past data and you must reprocess a date range correctly.
A schema change needs old data re-derived (a new column computed for history).
A new source must be loaded from its beginning.

The defining tension: you're loading the past into a system that's serving the present. If those were separate systems, it'd be easy. They're not, the backfill and the live pipeline write the same tables, and that overlap is where everything goes wrong.

Why Backfills Are Hard (The Four Hazards)

Collision with the live stream. The backfill writes historical partitions while the live pipeline writes current ones, and if they overlap (the boundary between "history" and "now"), they can double-write, race, or produce inconsistent state. Coordinating the handoff is the core difficulty (Lesson 4).
Non-idempotency. A backfill is huge and will fail partway (network, OOM, a bad batch). If re-running isn't idempotent, the retry duplicates everything loaded so far. Backfills must be idempotent (Lesson 2), the same lesson as batch ingestion, now at history scale.
Scale and cost. Three years of data is orders of magnitude more than a daily increment. A naive backfill can take days, cost a fortune, and overwhelm the source and destination (Lesson 3 partition-aware; Lesson 8 cost).
Correctness vs the live data. When the backfill meets the live stream at the boundary, you need exactly the right data, no gap (missing the handoff window), no overlap (double-counting it). Getting the boundary right is subtle (Lessons 4, 5).

The Core Principle: Idempotent, Bounded, Reconcilable

The safe-backfill philosophy, built over this course:

Idempotent (Lesson 2): re-running any part produces the same result. A backfill will be re-run; design for it.
Bounded (Lesson 3): processed in partition-aligned chunks, so you can run, resume, parallelize, and verify piece by piece, not one monolithic job.
Reconcilable (Lessons 5, 9): you can prove the backfill is correct, counts and checksums match the source, no gap or overlap with the live data.

A backfill that's idempotent, bounded, and reconcilable is safe (re-runnable, resumable, verifiable). One that's monolithic, non-idempotent, and unverified is the multi-day incident.

Backfills Are a Lakehouse Strength

Why "for lakehouses"? Because open table formats (Iceberg/Delta, Strata 3) make safe backfills possible in ways older systems didn't:

Atomic commits let a backfill batch land all-or-nothing (no half-written partitions).
Snapshot isolation lets the live stream read a consistent table while the backfill writes (no seeing half a backfill).
Partition-level operations let you overwrite/merge specific historical partitions without touching live ones.
Time travel lets you verify and roll back.

The lakehouse is what turns "backfilling is terrifying" into "backfilling is a managed operation." This course leans on those features throughout.

Aha: Backfills are hard for one reason: the past and the present are fighting over the same tables. If history and live data lived in separate systems it'd be trivial, but they share partitions, so a bulk historical load can collide, double-write, and corrupt the live pipeline at the boundary. The safe-backfill answer is three properties, idempotent (re-runnable), bounded (partition-aligned chunks), reconcilable (provably correct), and the lakehouse's atomic commits and snapshot isolation are what make achieving them practical instead of terrifying.

3. Worked Example

See the collision, then preview the safe approach.

Bring up the lab. This course needs a lakehouse (Iceberg/Delta), a live stream writing to it, and a backfill harness. Clone the lab repo (shared) and start it:

git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/ingestion-and-transport/batch-ingestion-and-backfills/backfill-strategies/
./scripts/setup.sh

Verify the lakehouse, the live stream, and the backfill tooling are reachable:

./scripts/verify.sh
# expected: "Iceberg REST catalog + warehouse ready, live order stream running, 3 years of history in source, backfill runner ready, Spark/Flink available"

You are helping me run the lab for the "Backfill Strategies for Lakehouses"
course. The lab is in
petascalelabs-lab-setup/ingestion-and-transport/batch-ingestion-and-backfills/backfill-strategies/
and includes:
  - docker-compose.yml: an Iceberg REST catalog + object store, a live order stream writing to an Iceberg table, a source with 3 years of history, and a backfill runner (Spark/Flink)
  - scripts/setup.sh, scripts/verify.sh, scripts/teardown.sh, scripts/backfill.sh, scripts/reconcile.sh

My environment:
  OS: <fill in>
  RAM: <fill in GB>  (a lakehouse + stream + Spark wants several GB)

Walk me through:
1. Confirming Docker has enough memory.
2. Bringing up the catalog, the live stream, and the source history.
3. Running a small sample backfill and reconciling it.
4. The teardown command.

Do not assume my OS; ask if unclear.

Step 1, the collision (the wrong way):

./scripts/live-stream.sh start                      # live orders writing to the Iceberg table
./scripts/backfill.sh --naive --range 2023-01..2025-12   # bulk-load 3 years, no coordination
./scripts/reconcile.sh orders

COLLISION at the boundary: backfill and live stream both wrote 2026-05 partitions
duplicates: 14,002 orders | live consumers saw inconsistent table mid-backfill
=> the "quick backfill" corrupted the live pipeline

Step 2, the four hazards made concrete:

./scripts/hazards.sh
# 1 collision: backfill overlapped live partitions
# 2 non-idempotent: a retry of the failed naive run would double everything again
# 3 scale: naive single job, 3 years, OOM-prone, days-long
# 4 boundary: gap OR overlap between history and live, both wrong

Step 3, preview the safe properties:

./scripts/backfill.sh --safe --range 2023-01..2025-12 \
  --chunk monthly --idempotent merge --boundary explicit
./scripts/reconcile.sh orders
# bounded (monthly chunks), idempotent (merge), explicit boundary with live
# reconcile: source == table, NO duplicates, NO gap. Safe.

The same backfill, made idempotent + bounded + reconcilable, lands correctly without touching the live stream's integrity.

4. Your Turn

Exercise: TheWorldShop launched a streaming analytics table last month and now needs three years of history loaded into the same table while the stream keeps running.

Explain why this is fundamentally harder than loading history into a fresh, separate table.
Identify the four hazards this backfill faces, tied to the live stream and the data scale.
A teammate proposes one big bulk-load job over the full three years. Give two reasons that's dangerous.
State the three properties a safe backfill must have, and what each protects against.
Name two lakehouse (Iceberg/Delta) features that make this backfill safer than it would be on a plain warehouse, and what each provides.

5. Real-World Application

Backfills are a recurring, high-stakes operation, every new analytical table, every reprocessing-after-a-bug, every schema migration needs one, and the "naive bulk load collided with live and corrupted the table" incident is a well-known data-engineering war story.

The lakehouse made safe backfills practical, Iceberg/Delta's atomic commits, snapshot isolation, and partition operations are precisely what let a backfill and a live stream share a table safely. This is a major reason the lakehouse (Strata 3) became the default analytical store, and why this course is framed around it.

Idempotent + bounded + reconcilable is the universal safe-backfill recipe, the same idempotency from batch ingestion (D1), now combined with partition-aligned bounding and explicit reconciliation. It reappears in every lesson of this course and in the CDC and platform tracks.

6. Recap + Bridge

A backfill loads history into a system serving the present, and it's hard because the past and present share the same tables, risking collision, duplication, and inconsistency, especially at the boundary, compounded by non-idempotency, scale, and correctness hazards. The safe recipe is idempotent (re-runnable), bounded (partition-aligned chunks), and reconcilable (provably correct), and the lakehouse's atomic commits and snapshot isolation make achieving it practical.

The first property, idempotency, is also the first technique. Next: idempotent writes into Iceberg/Delta, how the lakehouse's MERGE and atomic commits make a backfill safe to re-run.