Module: Pipeline Quality vs Data Quality | Duration: ~13 min | Lesson: 1 of 6

Priya's manager messages on Slack: "the pipeline health dashboard is all green. why is finance saying yesterday's revenue is wrong?"

Priya opens the pipeline-health dashboard. All 47 DAGs are green. Last run times are within SLA. Task durations are normal. Cluster utilization is healthy. There is no signal anywhere on this dashboard that suggests anything is wrong.

She opens the warehouse and runs SELECT SUM(amount) FROM daily_revenue WHERE revenue_date = yesterday. The number is $312k. Finance's source-of-truth payment processor says it should be $1.2M. A $900k discrepancy in a table where every DAG that touches it is green.

The pipeline is healthy. The data is wrong. Both statements are true at the same time, and a dashboard that shows only the first one is half a dashboard.

2. Concept Explanation

Two independent axes

Pipeline quality and data quality are orthogonal. Picture a 2x2:

              data wrong         data right
pipeline   ┌──────────────────┬──────────────────┐
   green   │  silent corruption │   the dream      │
           │  (this lesson)     │                  │
           ├──────────────────┼──────────────────┤
pipeline   │  the easy case   │   weird, but ok  │
   red     │  red AND wrong   │   stale-on-purpose│
           └──────────────────┴──────────────────┘

Most teams build monitoring for the bottom-left quadrant (pipeline red, data wrong). That's the easy case. Something failed loudly; on-call sees the red square; the data is fixed when the rerun succeeds.

The dangerous quadrant is the top-left. Pipeline green, data wrong. No alarm fires from the pipeline-health side because nothing about the pipeline is anomalous. The bug is in the data, and the only thing that catches it is a check against the data itself.

Pipeline-health signals

Pipeline-health signals come from the orchestrator's metadata:

task status (succeeded/failed)
task duration
DAG run latency
worker queue depth
scheduler heartbeat

These tell you "is the work getting done on time?" They tell you nothing about whether the work produced the right output.

Data-quality signals

Data-quality signals come from the data itself:

row counts vs expected baseline
SUM(amount) reconciliation against an external source of truth
null rates per column
referential integrity (orphaned foreign keys)
distributional checks (mean/median/p99 within a band)
column-value range checks (no negative ages, no future birthdates)

These tell you "is the output of the work right?" They tell you nothing about how it got there.

Why teams conflate the two

The orchestrator owns both running the work and recording whether work ran. The metadata DB is the obvious place to put a dashboard. Data-quality checks live somewhere else (Soda, Great Expectations, dbt tests, ad-hoc SQL). Building one dashboard from the orchestrator's metadata is one config file. Building a second dashboard from data-quality results is integration work.

Most teams do the easy part and don't do the hard part. They get away with it for a while. Then someone notices a $900k discrepancy.

The "two dashboards, one wall" rule

A working data org has two dashboards (or two halves of one dashboard):

Pipeline health	Data health
Tasks green/red	Tables fresh/stale
DAG durations	Row counts vs baseline
Worker queue depth	Critical metric reconciliations
Last successful run	Null/distinct/range checks per critical column

On-call is paged when either side goes red. The two halves are kept separate so people can see what kind of problem is happening at a glance.

The mistake is showing only the left column. The next mistake is hiding the right column behind a "data quality" link nobody clicks.

3. Worked Example

Priya's $900k discrepancy traced back to a SQL bug in daily_revenue:

-- The bug
INSERT INTO daily_revenue (revenue_date, gross_revenue)
SELECT date(created_at) AS revenue_date,
       SUM(amount_cents) / 100.0
FROM   payments
WHERE  status = 'succeeded'
  AND  created_at >= '{{ ds }}'        -- bug: no upper bound
  AND  date(created_at) = '{{ ds }}'
GROUP BY revenue_date;

The >= '{{ ds }}' filter is open-ended; the date(created_at) = '{{ ds }}' filter narrows back to the day. The bug is that date(created_at) = '{{ ds }}' is computed in UTC, while created_at >= '{{ ds }}' interpreted {{ ds }} as midnight UTC. Timezone slippage drops 8 hours of payments.

Every task in the DAG was green. Nothing about the pipeline metadata hinted at the bug. The only signal that could have caught this was a data-quality check.

Here are three data-quality checks that would have caught it:

-- Check 1: row count baseline
WITH baseline AS (
    SELECT AVG(daily_row_count) AS avg_rows
    FROM   daily_revenue_stats
    WHERE  revenue_date BETWEEN '{{ ds }}'::date - 30 AND '{{ ds }}'::date - 1
)
SELECT
    (SELECT COUNT(*) FROM payments WHERE date(created_at) = '{{ ds }}') AS actual_rows,
    (SELECT avg_rows FROM baseline)                                      AS expected_rows
;
-- alert if actual < expected * 0.5

-- Check 2: revenue reconciliation against source-of-truth processor
SELECT
    (SELECT SUM(amount_cents)/100.0 FROM payments
     WHERE  status = 'succeeded' AND date(created_at) = '{{ ds }}') AS warehouse_revenue,
    (SELECT total_cents/100.0 FROM stripe_daily_settlement
     WHERE  settlement_date = '{{ ds }}')                            AS source_revenue
;
-- alert if |warehouse - source| / source > 0.01

-- Check 3: distinct-hour coverage
SELECT COUNT(DISTINCT EXTRACT(HOUR FROM created_at)) AS hours_covered
FROM   payments
WHERE  status = 'succeeded' AND date(created_at) = '{{ ds }}'
;
-- alert if hours_covered < 20  (a normal day has 24)

Any one of these, run after the main pipeline, would have paged Priya before the dashboard lied to finance. The first two are slow (full-day scans). The third is cheap and would have caught this exact bug (only 16 hours showed activity, not 24).

Aha: The orchestrator's job is to get the work done. It is not to judge whether the work was right. That judgment requires running a check against the data itself, after the work. A green pipeline is necessary for the data to be right. It is not sufficient.

4. Real-World Application

Every mature data team eventually ships a "Data Reliability" function (the name varies: Data Observability, Data Trust, Data SRE). Its purpose is exactly the right-hand column above: continuous checks against the data, not against the metadata.

Tooling has converged: Monte Carlo, Bigeye, Soda, Acceldata, and dbt's data-tests block all produce data-quality signals separate from pipeline signals. The most useful ones land in the same paging system as pipeline alerts, so the on-call sees a unified queue but knows from the alert which axis failed.

The teams that skip this layer don't realize they're skipping it. They have a pipeline dashboard that's been green for months and an annual ritual called "the consultant audit found a 7% revenue undercount in 2024." The bug was visible from day one; nothing was looking.

5. Your Turn

Exercise: TheWorldShop just hired you to "improve their data quality." The current state: one pipeline dashboard (all green), no data-quality checks, daily complaints from finance and ops about numbers being "off."

Sketch a minimum-viable Data Health dashboard that pairs with the existing Pipeline Health dashboard. Pick exactly 3 critical tables and propose 1 check per table.
For each check, name (a) the SQL shape, (b) the alert threshold, (c) the paging severity (P0/P1/P2 from Lesson 7).
The CTO asks "won't this just be more alerts?" Defend the trade in 2-3 sentences.

6. Recap + Bridge

Pipeline health and data health are orthogonal axes. Most teams ship the first dashboard and don't ship the second. The cure is a separate set of checks running against the data, with their own alerts and their own paging severity. Next lesson we look at one of the easiest data-quality checks to ship (row counts) and the false sense of security it gives you when it's the only one.