Module: Building Pipelines That Don't Page You | Duration: ~12 min | Lesson: 1 of 8

TheWorldShop's nightly revenue job failed halfway through on Tuesday. Standard stuff: Dev clears the failed tasks and reruns. Wednesday morning, revenue is reported 18% too high. The rerun didn't replace Tuesday's partial work, it appended to it. The half-written rows from the failed run were still there, and the rerun added a full day on top.

Dev didn't do anything wrong operationally. Clearing and rerunning is the standard recovery. The pipeline was the problem: it was built assuming it would only ever run once per day. The moment reality made it run twice, it double-counted, and the alert that fired said "task succeeded."

The retry button is going to get pressed. Backfills happen. Schedulers fire twice. The question isn't whether your task runs more than once, it's whether you designed for it.

2. Concept Explanation

Idempotency is a property you build in, not luck you have

A task is idempotent if running it twice produces the same result as running it once. Running it five times, same result. This is not a nice-to-have. In any orchestrator, retries, backfills, manual reruns, and scheduler double-fires are normal operations, not exceptions. A pipeline that's only correct when each task runs exactly once is a pipeline that's correct by luck, and luck runs out the first time someone presses the retry button.

The reframe is the whole lesson: design every task assuming it will run twice. Not "handle the error case." Assume the happy path includes re-execution. If that assumption makes a task uncomfortable, the task is wrong, not the runtime.

The append trap

Dev's bug is the most common non-idempotent pattern: INSERT INTO ... SELECT. Appending is non-idempotent by construction, because the second run adds to what the first run left. Every retry inflates the data.

The idempotent alternatives all share one idea: make the write replace a well-defined scope, not add to whatever's there.

Overwrite a partition. INSERT OVERWRITE TABLE t PARTITION (dt='2026-06-09') replaces exactly that day. Run it ten times, the day's data is identical. This is the workhorse pattern.
MERGE / upsert on a key. MERGE INTO target USING source ON target.id = source.id updates existing rows and inserts new ones. Re-running re-applies the same merge, no duplication, as long as the key is stable.
Delete-then-insert in one transaction. DELETE WHERE dt = X; INSERT ... WHERE dt = X inside a transaction. The scope is deleted and rewritten atomically.

The common thread: the task owns a scope (a partition, a key range, a date) and makes that scope match the computed result, regardless of what was there before.

Idempotency needs a deterministic scope key

Overwrite-a-partition only works if the task knows which partition deterministically, from its inputs, not from the wall clock. This is where the logical-date discipline from the vocabulary track pays off. The task must derive its scope from the run's logical date (ds), not NOW(). If the scope is "today as of when this happens to run," a retry the next morning writes a different partition and idempotency is lost. Scope must be a pure function of the run's parameters.

idempotent:     write partition dt = {{ ds }}          # same every rerun of that run
NOT idempotent: write partition dt = CURRENT_DATE      # changes with wall clock

Surrogate keys: the subtle non-idempotency

A trap from the vocabulary track worth repeating: generating surrogate keys with a sequence or random UUID at write time is non-idempotent even if you "overwrite," because the same logical row gets a different key each run. Downstream joins on that key then see "new" rows. Idempotent surrogate keys are deterministic: a hash of the natural key (md5(order_id || customer_id)), so the same input always yields the same surrogate. Then a rerun produces byte-identical output, and downstream is stable.

Side effects are the hard part

Database writes are easy to make idempotent. Side effects to the outside world are where it gets real:

Sending an email / Slack / webhook is inherently non-idempotent, the recipient gets two messages. Guard with a "have I already sent this for this run?" check (a dedupe table keyed on run id), or move the send behind an idempotency key the downstream service honors.
Calling a payment or external API. Use the provider's idempotency key feature (most payment APIs have one) so a retry doesn't double-charge.
Publishing to a queue. At-least-once delivery means consumers must dedupe, the exactly-once discussion from the vocabulary track. Idempotency moves to the consumer.

The rule: for every side effect, ask "what happens on the second run?" and design the guard before the retry forces the question.

Idempotency is testable

Because it's a precise property, you can test it. The canonical test: run the task twice against the same input and assert the output is identical (same row count, same checksum). Put it in CI. A task that passes "run once" but fails "run twice" is a latent 3am incident, and the test catches it before it ships.

3. Worked Example

Turning Dev's append-based revenue job idempotent.

Before (non-idempotent, the Hook's bug):

-- Appends. Every rerun adds another day on top of whatever's there.
INSERT INTO revenue_daily
SELECT order_date, region, SUM(amount_usd) AS revenue
FROM orders_usd
WHERE order_date = CURRENT_DATE        -- wall clock: also non-deterministic
GROUP BY order_date, region;

Two bugs: it appends, and its scope is CURRENT_DATE.

After (idempotent: overwrite a deterministic partition):

-- Scope keyed on the run's logical date, not the wall clock.
-- OVERWRITE replaces exactly this partition: same result every rerun.
INSERT OVERWRITE TABLE revenue_daily PARTITION (order_date = '{{ ds }}')
SELECT region, SUM(amount_usd) AS revenue
FROM orders_usd
WHERE order_date = '{{ ds }}'
GROUP BY region;

Now Tuesday's failed-then-rerun sequence is safe: the rerun overwrites Tuesday's partition with the correct full-day total, discarding the half-written rows. No double count. The retry button does what Dev expected.

Add a deterministic surrogate key, if downstream needs one:

-- md5 of the natural key, NOT a sequence/UUID -> stable across reruns.
SELECT
  md5(order_date || '|' || region) AS revenue_sk,
  region,
  SUM(amount_usd) AS revenue
FROM orders_usd WHERE order_date = '{{ ds }}'
GROUP BY region;

Guard a side effect (the daily revenue Slack post):

def post_revenue_summary(run_id, summary):
    if already_sent(run_id):        # dedupe table keyed on run_id
        return                      # second run: no duplicate message
    slack.post("#finance", summary)
    mark_sent(run_id)

Test it, in CI:

def test_revenue_is_idempotent():
    run_revenue_job(ds="2026-06-09")
    first = checksum("revenue_daily", partition="2026-06-09")
    run_revenue_job(ds="2026-06-09")          # run it AGAIN
    second = checksum("revenue_daily", partition="2026-06-09")
    assert first == second        # identical, or the design is broken

Aha: Design every task assuming it'll run twice. If that thought hurts, the design is wrong, not the runtime. Retries, backfills, and double-fires are normal operations, not edge cases, so idempotency is a property you build in (overwrite a deterministic scope, deterministic keys, guarded side effects), never luck you hope holds when someone presses retry.

4. Real-World Application

Idempotency is the single most important property of a production pipeline, and every orchestrator's design assumes you've provided it. Airflow retries tasks by default; its whole recovery model (clear and rerun) is safe only if tasks are idempotent. Dagster and Prefect make the same assumption. The frameworks give you retries for free and quietly hand you the responsibility for idempotency, which is why the "rerun double-counted" incident is a rite of passage.

The patterns are baked into modern data tooling. dbt's incremental models use merge / insert_overwrite strategies precisely to be idempotent on rerun. Spark's INSERT OVERWRITE and the table formats' MERGE (Iceberg, Delta) exist so a reprocessed partition replaces rather than duplicates. Payment and messaging APIs ship idempotency-key features because they learned the double-charge lesson at scale. When you reach for one of these, you're reaching for idempotency, often without naming it.

The design discipline that separates senior engineers is treating idempotency as a requirement at design time, not a bug fixed after the first double-count. They derive scope from the logical date, use deterministic keys, guard every side effect, and add the run-it-twice test to CI. The connection to the rest of this stratum is direct: the failure-modes track's "retry that made it worse" is this lesson's design choice not made. The whole point of the design-patterns track is to turn those failures into decisions, and idempotency is the first and most important one.

5. Your Turn

Exercise: TheWorldShop's daily_inventory_adjustments task does: (1) INSERT INTO adjustments SELECT ... for the day, (2) generates a txn_id via a database sequence, (3) emails the warehouse manager a summary, (4) decrements a stock_levels table with UPDATE stock_levels SET qty = qty - :delta.

Identify every non-idempotent operation and explain what goes wrong on a rerun.
Rewrite each to be idempotent. Be specific about scope keys and guards.
Write the assertion you'd put in CI to prove the whole task is idempotent.

6. Recap + Bridge

Idempotency is a design choice: build every task to produce the same result whether it runs once or ten times, because retries and backfills are normal, not exceptional. Overwrite a deterministic scope instead of appending, derive that scope from the logical date not the wall clock, use deterministic surrogate keys, guard every external side effect, and add a run-it-twice test to CI. Next lesson takes the operation idempotency makes safe, the rerun, and elevates it: backfill as a first-class API, so reprocessing history is a designed feature, not a favor from "the Airflow person."