The Green DAG That Wasn't

Module: Anatomy of a Broken Pipeline | Duration: ~15 min | Lesson: 1 of 8


Priya is on-call for TheWorldShop. At 9:02am the analytics lead pings her: "the loyalty dashboard says zero new signups yesterday. Is something broken?" Priya pulls up Airflow. The load_signups DAG is green. Every task is green. The scheduler is happy. The metadata table says success for every run in the last 30 days.

The data is wrong anyway.

The dashboard is right. Zero rows landed. The DAG ran. The DAG did nothing. Priya is about to learn that a green tick in an orchestrator is a statement about an exit code, not a statement about data.


2. Concept Explanation

What "success" means to an orchestrator

Every orchestrator (Airflow, Dagster, Prefect, Argo) tracks one thing per task: did the worker process exit with status 0? That's it. A Python task that catches every exception and returns None is a success. A Spark job that wrote 0 rows because the upstream S3 prefix was empty is a success. A dbt run against an empty staging table is a success.

The orchestrator is doing exactly what it was built to do. It's tracking process completion, not data completion. The conflation is on us.

The three flavors of a silent green run

  1. No-op success. The task ran. It received zero input. It wrote zero output. Every step returned cleanly. This is the most common silent failure in production data pipelines.
  2. Caught-and-swallowed success. A try/except Exception: pass somewhere on the hot path. The exception fired, got eaten, and the task returned. The logs even have the stack trace, if you scroll far enough.
  3. Partial-write success. The task wrote some of what it should have. The orchestrator has no opinion on "some." See Lesson 4 for the partition-level version of this.

Why this is the #1 on-call lesson

On-call dashboards almost universally show task status from the orchestrator's metadata DB. That's a pipeline health signal. It tells you nothing about data health. You need a second signal, computed against the data itself, to close the loop. Without that second signal, your runbook is "wait for someone in business to notice." Which, for TheWorldShop, was a marketing analyst at 9am.

The fix shape

Every task that writes data should publish a metric the on-call rotation can alert on:

  • rows_written (with a non-zero threshold, contextual to the table)
  • bytes_written
  • max(updated_at) of the table after the write

Then a second job, running after the pipeline, checks the metric. The orchestrator can't be the only thing watching the orchestrator.


3. Worked Example

Here's a real shape of a silent-success task. It looks fine. Read it twice before scrolling.

def load_signups(execution_date, **kwargs):
    s3_prefix = f"s3://theworldshop-raw/signups/dt={execution_date.strftime('%Y-%m-%d')}/"
    try:
        df = spark.read.parquet(s3_prefix)
        df.write.mode("overwrite").saveAsTable("warehouse.signups")
        return "ok"
    except Exception as e:
        logging.warning(f"load_signups soft-failed: {e}")
        return "ok"

Three independent silent-failure paths in fourteen lines:

  1. If s3_prefix is empty (no files dropped yesterday), spark.read.parquet returns an empty DataFrame. saveAsTable("warehouse.signups") happily overwrites the table with zero rows. Task is green.
  2. If the prefix itself doesn't exist, parquet raises. The except catches it. Task returns "ok". Task is green.
  3. If S3 throws a transient permission error mid-read, except catches it. Task returns "ok". Table left in whatever state Spark got to before the throw. Task is green.

Now the contract version, which fails loudly:

def load_signups(execution_date, **kwargs):
    s3_prefix = f"s3://theworldshop-raw/signups/dt={execution_date.strftime('%Y-%m-%d')}/"
    df = spark.read.parquet(s3_prefix)
    row_count = df.count()
    if row_count == 0:
        raise ValueError(f"load_signups got 0 rows from {s3_prefix}")
    df.write.mode("overwrite").saveAsTable("warehouse.signups")
    return {"rows_written": row_count, "source": s3_prefix}

Three differences:

  • No bare except. Real exceptions kill the task. The orchestrator now knows.
  • An explicit zero-row check. Zero is a suspicious number for a signup table at TheWorldShop's scale.
  • A structured return value. The metric rows_written is now visible to a downstream "is the dashboard going to lie?" check.

Aha: A task exiting 0 is a statement about a process, not a statement about data. If you want "the data is right" to be a tracked signal, you have to write the check. The orchestrator will never write it for you.


4. Real-World Application

Every mature data platform pairs an orchestrator with a data observability signal. At Airbnb, Wall (their lineage tool) tracks freshness and volume per table and pages independently of Airflow. At Netflix, Atlas does the same. At smaller shops, this is a dbt test job that runs after the main DAG, or a Soda/Great Expectations check, or just a SELECT COUNT(*) FROM target_table WHERE dt = current_date that pages when the count looks wrong.

The teams that don't do this learn the same lesson Priya did: the green tick is a floor, not a ceiling. Above it lives the question "is the data actually here, and is it right?" That's the question the rest of this course is about.


5. Your Turn

Exercise: You're reviewing a teammate's PR for a new daily DAG that loads vendor returns into warehouse.returns. The PR description says "all tasks pass in staging, ready to merge."

  1. Name three failure scenarios the green-task signal will hide if you merge as-is.
  2. For each scenario, write the one-line metric or check you'd add to catch it.
  3. Decide where the check runs: inside the task, as a separate downstream task, or as an external observability job. Justify each placement.

6. Recap + Bridge

A green task means a process exited cleanly. It does not mean the data is right. The fix is to publish data-shaped metrics from every write and check them independently. We've named the silent-success failure mode. Next lesson is its evil twin: a task that ran, wrote the right data, and is still responsible for yesterday's report being wrong. Welcome to late-arriving data.