Module: Pipeline Specs & Quality Checks | Duration: ~16 min | Lesson: 1 of 4
Sam's pipeline is technically perfect. No nulls where nulls are forbidden, zero duplicates, lands on time every morning, has for months. And this quarter's planning deck, the one deciding TheWorldShop's warehouse expansion, was built from a hand-maintained spreadsheet instead, because the VP's analyst didn't know Sam's table existed, and the one analyst who did know didn't trust a number in it after last year's incident.
A pipeline nobody finds, or nobody believes, has the same business value as a pipeline that doesn't run. That stings, because Sam did the hard engineering right. This course is about the other half of the job, and it starts by pinning down what "data quality" even means, because most definitions stop at nulls and duplicates, and nulls and duplicates were never the whole story.
2. Concept Explanation
The definition that fits on a sticky note
Data quality = data trust + data impact. People believe the data is correct, and the data actually changes what the business does. Miss the first and your correct numbers get ignored; miss the second and your trusted numbers decorate dashboards nobody acts on. Consistently ship datasets with both properties and, bluntly, you get promoted; this course is the operating manual for "consistently."
Unpacking that into the six properties a genuinely high-quality dataset has:
1. It's discoverable
Someone with a question can find out the answer exists. Discoverability is partly a platform concern (catalogs, search, lineage tools), but the data engineer's share is real: names that follow conventions, descriptions that say what the table is for, and ownership metadata that isn't blank (the freshness-and-SLA course in this stratum has a whole chapter on the column nobody fills in). Sam's warehouse-planning table failed here first, before trust ever got a vote.
2. Its definitions are understood, and complete
The famous cautionary tale: Zillow's home-buying arm trusted its price model enough to buy houses at scale, and lost hundreds of millions when the model's inputs didn't capture the world it was betting on. The data wasn't "dirty", the definition of enough was wrong. Completeness questions ("what is this metric blind to?") are quality questions, and some blindness is irreducible: rare, unprecedented events (a pandemic, a black-swan market turn) will beat any historical dataset. Quality work includes stating what the data cannot see.
3. It keeps the table-stakes guarantees
The classics: not-null columns are never null, dimension tables have no duplicate entities, enumerations hold only valid values, facts are deduped within their stated window. Necessary, cheap, and, per this lesson's whole argument, nowhere near sufficient. (Lesson 4 turns these into a systematic checklist.)
4. It derives business value
Every pipeline should trace to money, usually along a chain. Direct: revenue reporting, billing. One hop: signups don't earn a cent themselves, but they're the leading indicator of the ad impressions that do. Two hops: paid-click tracking feeds signups feeds revenue. Cost-side counts too: the pipeline that itemizes cloud spend so the company stops donating margin to its cloud provider is a money pipeline. And a rare few are strategic, feeding one large executive decision rather than a recurring process; they're legitimate, and they're maybe one or two in a career, so if every pipeline in a portfolio claims strategic value, most are claiming nothing. If you can't articulate the chain from your table to a dollar (earned or saved), that's not automatically a dead pipeline, but it is an unanswered quality question.
5. It's easy to use
Column names are obvious; conventions distinguish what you filter from what you aggregate (dim_ and m_ prefixes, or your house style); the schema doesn't make analysts guess. There's honest tension here with the modeling stratum's compact structures, arrays and structs are harder to use, deliberately, for master-data audiences, so "easy" is always relative to the declared consumer. What's not defensible is friction with no audience rationale: cryptic names, unparsed blobs, mystery timezones.
6. It arrives on an agreed schedule
The operative word is agreed. A pipeline landing at 9am is high quality if the contract says 9am and a crisis if consumers assumed 6am, same pipeline, different promises. The unit-economics pipeline in this course's source lineage refreshed every three days; once that interval was agreed, "the data is two days old" stopped being an incident and became Tuesday, and the pages stopped. Latency SLAs, their measurement, and their failure handling are the freshness-and-SLA course's turf; the spec (next lesson) is where the agreement gets written down.
Trust is asymmetric
One more property of the trust half, and it's the one that explains Sam's quarter: trust builds linearly and collapses instantly. Months of correct data buy you a hearing; one confidently wrong number in an executive deck spends it all, and consumers don't return just because you fixed the bug, they return when they've watched the process that makes the bug unlikely. That's why this course is process-shaped: specs, reviews, staged backfills, and visible checks aren't bureaucracy, they're the only known technology for rebuilding trust at better-than-linear speed, and for not losing it in the first place.
3. Worked Example
Scoring a real table against all six properties, because the exercise of auditing quality is the skill, and it's never one query.
The subject: TheWorldShop's fact_seller_payouts, Sam's pride, the one the planning deck ignored.
| Property | Finding | Verdict |
|---|---|---|
| Discoverable | Not in the catalog; name says payouts, description blank; owner field empty | Fail |
| Definitions | "Payout" excludes clawbacks (documented nowhere); finance's spreadsheet includes them, so the two disagree by ~2% forever | Fail |
| Table stakes | Not-null, dedup, enum checks all present and passing | Pass |
| Business value | Direct money chain: feeds seller cost forecasting | Pass |
| Easy to use | dim_/m_ prefixes, typed columns, no blobs | Pass |
| Timely | Lands 6am daily... per Sam's intention; no consumer ever agreed to anything | Half-fail |
Three-and-a-half of six, and notice which half failed: every engineering property passed, every social property failed. The 2% clawback discrepancy is the trust-collapse mechanism in miniature: the analyst who found it couldn't know whose definition was right, so the rational move was to trust neither and keep the spreadsheet.
The repair plan, in priority order, is the rest of this course in preview:
- Write the definition down where consumers will collide with it: a spec section stating "payout excludes clawbacks; see
m_clawbacksfor the complement; finance's number =m_payout - m_clawbacks." Disagreement between two stated definitions is reconcilable in a meeting; disagreement between a stated one and a silent one is a trust incident. (Lesson 3: specs.) - Make the latency a contract: propose 7am to the two consuming teams, get a thumbs-up in writing, alert only past the contract. (Freshness course for the machinery.)
- Register, describe, assign an owner. An hour of metadata against a quarter of invisibility.
- Only then, incrementally: the checks that would catch definitional drift, e.g., a reconciliation check against finance's ledger totals, because the table-stakes checks were never going to catch a meaning bug. (Lesson 4.)
Aha: Every property that failed in this audit fails silently. Missing checks page you; missing discoverability, unagreed latency, and undocumented definitions just quietly route the business around your pipeline, and the dashboards stay green the whole time. That's why quality can't be defined as "absence of alerts": half of quality lives in artifacts, specs, contracts, catalog entries, that no monitor ever evaluates.
4. Your Turn
Exercise: Audit and chain.
- For each scenario, name the quality property being violated (there are six to choose from):
a. Analysts keep a shared doc titled "which revenue table is the real one."
b. The churn model performed beautifully for two years, then face-planted during an unprecedented market event.
c. A table's
event_tsis local time in some rows, UTC in others, and the column comment says nothing. d. The exec dashboard broke because the upstream table landed at 11am; the pipeline owner says "it lands when it lands." - Trace the business-value chain for TheWorldShop's
fact_search_impressions(search result views). How many hops to money? - Sam's teammate argues: "our checks all pass, so quality is fine." Give the two-sentence rebuttal using this lesson's definition.
5. Real-World Application
The trust-collapse asymmetry is why mature data organizations invest in visible process even when the engineering is already sound. Certification programs ("golden dataset" badges), spec-review rituals, and published SLA pages all exist to make trust inspectable: a consumer who can see the review trail and the passing checks extends trust before months of personal verification. Airbnb's Midas process, next lesson's subject, is the fully-industrialized version, and its name (the king whose touch turned things to gold) is a statement about exactly this: the certification is the product.
The business-value chain, meanwhile, is the most practical career tool in this lesson. Performance reviews, headcount debates, and deprecation decisions all reduce to "what does this pipeline do for the business," and engineers who answer with a chain, "this feeds the model that sets shipping prices; a day of downtime mis-prices roughly X orders", win those conversations. Engineers who answer "it populates a table" watch their pipeline land on the deprecation list, sometimes correctly. The chain also works in reverse as a personal filter: offered two projects, the one with the shorter, clearer chain to money is almost always the better bet.
6. Recap + Bridge
Quality is trust plus impact, unpacked into six properties: discoverable, definitionally honest, table-stakes clean, value-chained to money, usable by its declared audience, and timely by agreement, and the failures that hurt most are the silent, social ones no check ever pages about. Trust collapses faster than it builds, which is why the remedy is visible process rather than heroic correctness. The most complete version of that process in the industry is Airbnb's nine-step Midas ritual for minting gold datasets, and the next lesson walks all nine steps, then asks the equally important question: when is all that ceremony worth it, and when is it a waste of everyone's quarter?