Module: Lineage for Orchestration | Duration: ~11 min | Lesson: 1 of 6

A vendor emails TheWorldShop at 16:00: the address-validation feed they've been sending for three years had a bug, and last week's files mislabeled country codes. Priya needs to answer one question before she can act: what did we build on top of that feed?

She opens the DAG list. There are 600 DAGs. She greps the codebase for the source table name. She finds 40 references, but some are views on views, and she can't tell which final tables actually reach a customer-facing surface. Three hours later she has a half-trusted list scrawled in a doc, and she's still not sure it's complete.

The question "what depends on this?" is the first question in every data incident. Why did it take Priya three hours, and what would have made it three seconds?

2. Concept Explanation

Lineage is the dependency graph of your data

Lineage is the directed graph of "this dataset was produced from those datasets." Nodes are tables, views, files, dashboards, ML features. Edges are "produced from." It's the same shape as the DAG you already write in your orchestrator, but at the data level, not the task level, and it usually spans many DAGs, tools, and teams.

The distinction matters. Your Airflow DAG knows that task B runs after task A. It does not inherently know that the table B writes is read by a dbt model in a different repo, which feeds a dashboard owned by another team. The orchestrator's DAG is one slice. Lineage is the whole graph, stitched across every slice.

The two questions lineage answers

Almost every operational use of lineage reduces to walking the graph in one of two directions:

Downstream (impact): "what depends on this?" Walk forward from a node. This is Priya's question. When a source is wrong, broken, or about to change, you need the full set of things built on it. This is also the question behind blast radius (Lesson 4) and lineage-aware alerting (previous course).
Upstream (provenance): "where did this come from?" Walk backward from a node. When a number is wrong on a dashboard, you trace its inputs to find where the corruption entered. This is the debugging direction, and the compliance direction ("where did this PII column originate?").

A team without lineage answers both questions by grep, tribal knowledge, and guessing. A team with lineage answers them with a graph traversal. The difference is hours versus seconds, and "I think that's all of them" versus "that's all of them."

Why orchestration specifically needs it

You might think lineage is a governance or analytics concern. It's an orchestration concern for three concrete reasons:

Incident response starts with impact. Every page in the previous course assumed you could answer "what's downstream." Without lineage, the on-call can't scope the blast radius, can't route the root cause, and can't tell stakeholders what's affected.
Backfills need blast radius. Rerunning a source table re-fires everything downstream. Without lineage you don't know whether a backfill touches 3 tables or 300 (Lesson 4).
Change safety. Before you rename a column or drop a table, lineage tells you who breaks. Without it, you find out from the people who break.

These are all things the orchestrator (or the engineer operating it) has to do, and all of them are guesswork without the graph.

Where lineage comes from

Lineage isn't free, and the source determines its quality:

Parsed from SQL. Tools read your CREATE TABLE AS SELECT / INSERT ... SELECT and infer "this table reads those tables." dbt does this for its whole project (the manifest.json is a lineage graph). This is cheap and broad but static, it reflects the code, not necessarily what ran.
Emitted at runtime. The pipeline reports "this run read A and B, wrote C" as it executes. OpenLineage (next lesson) is the standard for this. More accurate (it reflects what actually happened) and captures dynamic cases SQL parsing misses, but requires instrumentation.
Hand-maintained. A wiki page someone updates. Always wrong within a quarter. Don't rely on it for incidents.

Most mature setups combine parsed lineage (broad coverage) with runtime lineage (accuracy where it matters).

Granularity: table vs column

Lineage has a resolution. Table-level lineage says "orders_usd reads fx_rates." Column-level says "orders_usd.amount_usd is computed from orders.amount and fx_rates.rate." Column-level is far more useful for some questions (a column rename, a PII trace) and far more expensive to maintain. Lesson 3 is entirely about when each is worth it. For now: know that lineage has a granularity dial, and table-level is the floor.

3. Worked Example

Answering Priya's question with a table-level lineage graph.

Step 1: represent lineage as edges (here, from a dbt manifest or a parser).

# (upstream, downstream) = "downstream is produced from upstream"
edges = [
    ("address_feed",      "customers_clean"),
    ("customers_clean",   "orders_enriched"),
    ("orders_enriched",   "revenue_by_region"),
    ("orders_enriched",   "shipping_dashboard"),
    ("customers_clean",   "marketing_segments"),
    ("loyalty_raw",       "loyalty_points"),       # unrelated branch
]

Step 2: walk downstream from the broken source.

from collections import defaultdict, deque

children = defaultdict(list)
for up, down in edges:
    children[up].append(down)

def impacted_by(source):
    seen, q = set(), deque([source])
    while q:
        n = q.popleft()
        for c in children[n]:
            if c not in seen:
                seen.add(c)
                q.append(c)
    return seen

print(impacted_by("address_feed"))
# {'customers_clean', 'orders_enriched', 'revenue_by_region',
#  'shipping_dashboard', 'marketing_segments'}

In milliseconds, Priya has the exact, complete set. loyalty_points is correctly not in it, it's a different branch. Her three-hour grep is now a graph walk.

Step 3: filter to what she actually needs to act on. Of the impacted set, which are customer-facing or business-critical?

critical = {"revenue_by_region", "shipping_dashboard"}
to_notify = impacted_by("address_feed") & critical
print(to_notify)   # {'revenue_by_region', 'shipping_dashboard'}

Now she can tell stakeholders precisely: "the address-feed bug affects revenue_by_region and shipping_dashboard; here's the full downstream list for the record." Complete, fast, and trustworthy.

Aha: "What depends on this?" is the first question in every data incident, and without lineage you answer it by grep and prayer in three hours. With a lineage graph it's a downstream traversal that returns in milliseconds and is actually complete. The orchestrator knows task order; lineage knows data dependency across every DAG, tool, and team.

4. Real-World Application

The "vendor sends a correction, what did we build on it" scenario is a real, recurring data-engineering crisis, and the teams that handle it in minutes are the ones with lineage. dbt gives the most teams their first real lineage graph essentially for free: dbt docs renders the DAG from manifest.json, and you can query that manifest programmatically to get the impact set, exactly the graph walk above.

Beyond dbt, the ecosystem splits into lineage producers and lineage stores. Producers emit lineage (dbt, Spark with the OpenLineage integration, Airflow with the OpenLineage provider). Stores collect and visualize it (DataHub, OpenMetadata, Marquez, plus the commercial observability platforms). The next lesson covers OpenLineage, the standard that lets a producer and a store from different vendors interoperate.

The maturity ladder is recognizable. Teams start with grep and tribal knowledge (Priya's three hours). They graduate to dbt's static graph for their dbt project. They eventually add runtime lineage to capture the non-dbt parts (raw ingests, Spark jobs, reverse-ETL) and stitch a cross-tool graph. Each rung turns a category of multi-hour incident into a multi-second query, and the first incident where lineage saves an afternoon usually pays for the whole investment.

5. Your Turn

Exercise: TheWorldShop's lineage includes: pii_raw to users_clean; users_clean to users_masked and to fraud_features; users_masked to analytics_dashboard; fraud_features to fraud_model. Legal asks: "we got a deletion request, where does this user's PII flow?"

Which direction do you walk (upstream or downstream) for each of these questions: (i) legal's PII-flow question, (ii) "an analyst says a number on analytics_dashboard is wrong, where did it come from?"
Walk legal's question over the graph and list every node the PII reaches. Why is missing one node a legal problem, not just an engineering one?
The graph above came from parsing dbt SQL. Name one kind of PII flow this static graph might miss that runtime lineage would catch.

6. Recap + Bridge

Lineage is the cross-tool dependency graph of your data, and it answers the two questions every incident starts with: downstream "what depends on this?" (impact) and upstream "where did this come from?" (provenance). Orchestration needs it for incident scoping, backfills, and change safety, all guesswork without the graph. It comes from parsed SQL (broad), runtime emission (accurate), or wikis (wrong). Next lesson: OpenLineage, the standard that makes runtime lineage portable across tools, explained in 60 seconds.