Module: Idempotency and State | Duration: ~13 min | Lesson: 1 of 6

Priya's teammate Sam writes a task that appends each day's orders to a daily_orders table. He uses INSERT. He's tested it; it writes 1,200 rows per day; everything looks fine. Sam says "it's idempotent because each row has a unique surrogate_key from gen_random_uuid()."

Three months later, the on-call rotation has a recurring 11pm dedupe job. The dedupe job sometimes fails because two rows for the same order_id have different surrogate_keys and the dedupe rule can't pick one cleanly.

Sam wasn't wrong about uniqueness. Every row had a unique surrogate key. He was wrong about idempotency. Each retry inserted different rows representing the same logical fact. The dedupe job is now load-bearing for correctness, and it was never supposed to exist.

2. Concept Explanation

The actual definition

A task is idempotent if running it twice has the same effect on the world as running it once.

That's it. The definition says nothing about:

Whether the output is bit-identical.
Whether the same rows are written.
Whether the function is deterministic.

It's a statement about state convergence. After N runs, the world is in the same state as after 1 run. The path doesn't matter; the destination does.

Three shapes that look idempotent and aren't

INSERT without a unique constraint, but with surrogate keys. Each run inserts a row. Each row is unique by surrogate. The table has N copies of the same logical fact after N runs. Not idempotent.
UPDATE counters SET value = value + 1. Looks like it converges. It doesn't; each run increments. Not idempotent.
API calls that mutate. loyalty_service.credit(customer, points=10) adds 10 points each call. Three retries means 30 points credited. Not idempotent (unless the API supports idempotency keys; see below).

Three shapes that are idempotent

INSERT ... ON CONFLICT (natural_key) DO NOTHING. A unique constraint on the natural key. First run inserts. Subsequent runs see the conflict, do nothing. State converges.
INSERT OVERWRITE PARTITION (dt=...). Partition-level idempotency. Each run replaces the partition with its computed contents. Run it N times, the partition's final contents are the same.
UPDATE counters SET value = (SELECT COUNT(*) FROM source WHERE ...). The new value is computed from source state, not derived from the previous value. Run it N times, the counter is the same.

The shape: idempotent tasks compute their target state and assert it. Non-idempotent tasks describe a delta and apply it.

Why this matters more than it seems

Retries are not optional. Every modern orchestrator retries by default (Lesson 5 of Course 1.1). Every cloud retries. Every network blip causes a retry. The expected number of runs per task over its lifetime is greater than 1.

If your task is non-idempotent, you have three choices:

Make it idempotent (the right choice).
Tolerate duplicates downstream (the dedupe-job antipattern; load-bearing for correctness).
Set retries=0 and accept that any failure means manual intervention (the never-scale choice).

Most teams pick (2) by accident. They write INSERT tasks, ship dedupe jobs to "clean up," and discover years later that the dedupe job has subtle correctness bugs and they've been double-counting some category of revenue.

The natural-key discipline

Every table that receives writes from a pipeline should have a natural key: the set of columns that uniquely identify a logical record. For daily_orders, the natural key is probably (order_id, line_item_id). For customer_events, (customer_id, event_id, event_timestamp). For daily_revenue_summary, just (revenue_date).

The natural key is what makes two writes the same write. Declaring it explicitly (as a unique constraint or as a primary key) lets the database enforce idempotency for you:

CREATE TABLE daily_orders (
    order_id      BIGINT      NOT NULL,
    line_item_id  INT         NOT NULL,
    customer_id   BIGINT      NOT NULL,
    amount_cents  BIGINT      NOT NULL,
    ...,
    PRIMARY KEY (order_id, line_item_id)   -- natural key
);

INSERT INTO daily_orders (order_id, line_item_id, customer_id, amount_cents, ...)
VALUES (...)
ON CONFLICT (order_id, line_item_id) DO NOTHING;

Retry the task 100 times. The table has the same rows. The dedupe job doesn't exist.

External side effects: the idempotency-key protocol

Internal writes (Postgres, S3, warehouse) are a SQL exercise. External side effects (refund a card, send an email, post a webhook) are a protocol exercise. You need:

An idempotency key the external API accepts. A string you generate that uniquely identifies this logical operation. Examples: Idempotency-Key: refund:order-12345:v1.
A dedupe table you write to before calling the API. If the row exists with status succeeded, skip the call. If it exists with status pending, decide what to do (often: re-call with same key; the API dedupes server-side).

Stripe pioneered this; Twilio, Square, SendGrid, and most modern transactional APIs accept idempotency keys. The pattern lets your task retry safely without double-charging or double-emailing.

For APIs that don't accept idempotency keys, the only safe pattern is the dedupe table on your side, plus the assumption that you'd rather skip a possibly-duplicate call than risk doubling it. This is conservative; some failure modes will be undercounted. Pick conservative.

3. Worked Example

Sam's broken task:

import uuid

def load_daily_orders(**context):
    ds = context["ds"]
    orders = fetch_orders_for_date(ds)
    for o in orders:
        db.execute("""
            INSERT INTO daily_orders (surrogate_key, order_id, line_item_id, customer_id, amount_cents)
            VALUES (%s, %s, %s, %s, %s)
        """, (str(uuid.uuid4()), o.order_id, o.line_item_id, o.customer_id, o.amount_cents))

What goes wrong: a retry inserts the rows again with new surrogate_key values. After 3 retries, the table has 3 logical copies of each order, each with a different surrogate. Downstream SUM(amount_cents) is 3x correct.

The idempotent rewrite:

-- DDL: declare what "same row" means
ALTER TABLE daily_orders
ADD CONSTRAINT daily_orders_natural_key
PRIMARY KEY (order_id, line_item_id);

def load_daily_orders(**context):
    ds = context["ds"]
    orders = fetch_orders_for_date(ds)
    for o in orders:
        db.execute("""
            INSERT INTO daily_orders (order_id, line_item_id, customer_id, amount_cents)
            VALUES (%s, %s, %s, %s)
            ON CONFLICT (order_id, line_item_id) DO NOTHING
        """, (o.order_id, o.line_item_id, o.customer_id, o.amount_cents))

Two changes:

The natural key is declared (primary key).
The insert respects it (ON CONFLICT DO NOTHING).

Now: retry 100 times, the table has the same rows. The dedupe job is unnecessary. The downstream sum is correct.

The partition-level alternative

For tables that are partitioned (very common for analytics), the cleaner pattern is INSERT OVERWRITE:

INSERT OVERWRITE TABLE warehouse.daily_orders PARTITION (dt = '{{ ds }}')
SELECT order_id, line_item_id, customer_id, amount_cents
FROM   raw_orders
WHERE  order_date = DATE '{{ ds }}';

Each run replaces the whole dt='{{ ds }}' partition. Retries are safe by definition. The natural key inside the partition matters less because the partition replacement is atomic.

This is the dominant pattern for daily/hourly analytic loads. It's idempotent at the partition level, which is the unit of work for these pipelines.

The external side effect

def credit_loyalty(**context):
    ds = context["ds"]
    customers_to_credit = fetch_eligible(ds)
    for c in customers_to_credit:
        idem_key = f"loyalty:credit:{ds}:{c.customer_id}"
        # Pre-check our dedupe table
        already = db.execute("""
            SELECT 1 FROM loyalty_attempts
            WHERE  idempotency_key = %s AND status = 'succeeded'
        """, (idem_key,)).fetchone()
        if already:
            continue
        # API-side idempotency
        loyalty_service.credit(
            customer_id=c.customer_id,
            points=c.points_to_credit,
            idempotency_key=idem_key,
        )
        db.execute("""
            INSERT INTO loyalty_attempts (idempotency_key, status)
            VALUES (%s, 'succeeded')
            ON CONFLICT (idempotency_key) DO UPDATE SET status='succeeded'
        """, (idem_key,))

Two layers of defense: local dedupe table and API-side idempotency key. Either alone is fragile. Both together are bombproof. The task is now safe to retry as often as the orchestrator wants.

Aha: Idempotent means safe to retry, not writes the same thing. The two are not the same. A task that inserts unique-by-surrogate-key rows isn't idempotent. A task that converges to the same target state regardless of how many times it ran is idempotent. The natural key is the difference.

4. Real-World Application

Stripe's API is the canonical example of "every endpoint accepts Idempotency-Key." That single design decision is why financial integrations against Stripe rarely double-charge. Internally, every well-run platform applies the same shape to its own services.

In the data-pipeline world, dbt encodes idempotency assumptions in its incremental model semantics: by default, each incremental run filters source data and inserts the delta. To be truly idempotent, the model needs a unique_key declared, which dbt translates to MERGE semantics under the hood. Skip the unique_key and your incremental model is non-idempotent; backfill it and you get duplicates.

The cultural rule in mature data teams: every write task declares a natural key, and the database enforces it. The orchestrator retries. The task converges. Nobody writes a dedupe job. The "every write has a natural key" rule is one line in every PR template that pays for itself within a quarter.

5. Your Turn

Exercise: Look at this task that updates a customer_lifetime_value table:

def update_clv(**context):
    ds = context["ds"]
    purchases = fetch_purchases(ds)
    for p in purchases:
        db.execute("""
            UPDATE customer_lifetime_value
            SET    total_revenue = total_revenue + %s,
                   updated_at    = NOW()
            WHERE  customer_id   = %s
        """, (p.amount_cents, p.customer_id))

Is this task idempotent? Walk through what happens if it runs twice for the same ds.
Rewrite it as an idempotent task. State which natural key you're using and what guarantees it.
Argue against the alternative "wrap it in a transaction with a status table" approach in 2-3 sentences.

6. Recap + Bridge

Idempotent means safe to retry. The world converges to the same state regardless of how many times you ran the task. The shape: declare a natural key, write tasks that compute target state from source (not deltas), and let the database enforce the contract. For external side effects, use idempotency keys at the protocol layer plus a dedupe table on your side. Next lesson: a small Airflow feature that newcomers consistently misuse as a database. XCom.