7 Data Modeling Mistakes That Silently Corrupt Your Metrics
The scariest class of bug in a data warehouse is the one that doesn't throw an
error. A NullPointerException wakes you up. A failed DAG pages you. But a query
that parses cleanly, runs in two seconds, draws a chart, and returns a number
that was never true - that one ships to the board deck. Nobody finds it until
someone in finance runs the same metric a second way and the two numbers don't
match.
Almost every one of these "wrong-without-an-error" bugs traces back to a data modeling decision, not a coding mistake. The SQL is usually fine. The problem is the shape of the data the SQL ran against: the wrong grain, a join that duplicates rows, a measure summed across an axis it can't be summed across, a ratio divided in the wrong order, a dimension that forgot its own history.
This is a field guide to the big ones. For each: the symptom you'll actually see in production, why it happens, and the fix. If you want the full treatment behind any of these, the Semantic & Metrics Layer curriculum walks each one as its own lesson, on a running example.
1. Mixed grain: the bug that lives in the schema
The symptom. A finance analyst pings you: "Average order value shows $312 in
the dashboard, but I just ran it in a notebook and got $89. Which one is right?"
Both queries are correct. Both ran against the same fact_sale table. They
disagree anyway.
Why it happens. The grain of a fact table is the answer to one question: "what does one row mean?" A good grain statement is one short sentence - "one row per order line," "one row per daily inventory snapshot per warehouse per SKU." If you can't say it in a sentence, the table doesn't have a grain; it has contents, which is a worse thing.
The classic failure is mixing grains in one table - order-level rows and line-level rows sitting together:
| order_id | line_id | amount | shipping |
|---|---|---|---|
| 1001 | 1 | 50 | NULL |
| 1001 | 2 | 80 | NULL |
| 1001 | NULL | NULL | 12 |
Ask for SUM(amount + shipping) and you get a meaningless number: amount is at
line grain, shipping is at order grain. Every analyst who touches this table
gets a different answer depending on how they COALESCE, GROUP BY, or filter.
The bug isn't in anyone's SQL. It's in the schema.
The fix. Declare the grain before you write a single column, and never let
two grains share a table. If you need both order-level and line-level facts, that
is two fact tables (fact_order and fact_order_line), each with one row meaning
one thing. Grain is not a style choice you can fix with a GROUP BY afterward -
it's the one decision that defines the table. More:
Grain: the one decision.
2. The fan-out join trap (and the symmetric-aggregate fix)
The symptom. You join orders to order_items to get product details, then
SUM(order_total). The number comes back almost exactly double what finance
reports. You check the join, the data, the filter - everything looks right. The
number is still wrong.
Why it happens. Each order has, on average, two line items. Joining orders
to order_items turns one order row into two. Now order_total, which lives on
the order, appears twice, and SUM counts it twice. The join didn't add wrong
data - it duplicated a coarse-grain measure across the finer-grain rows it
created.
Here it is with tiny numbers. Two orders worth $150 in total:
orders
| order_id | order_total |
|---|---|
| 1 | $100 |
| 2 | $50 |
order_items
| order_id | item |
|---|---|
| 1 | A |
| 1 | B |
| 2 | C |
Join the two tables and order 1's $100 now sits on two rows (one for item A,
one for item B). So SUM(order_total) adds 100 + 100 + 50 = $250 instead of
$150. The $100 got counted twice because the join copied it onto every line
item - that is the entire bug.
This is the single most common way a correctly-defined metric returns a wrong number, and it's sneaky for three reasons:
- The join is correct. You genuinely needed
order_items. The bug is the aggregation after it, not the join itself. - The wrong number is plausible. 2x, 1.7x, 3x - these look like real numbers. Without finance's figure to compare against, you'd never know.
- You can't just divide by two. The inflation is weighted by each order's line-item count, so a few big orders skew it unpredictably.
It gets worse when you sum measures from both grains in one query - say
SUM(order_total) and SUM(item_quantity) together. After the join,
item_quantity is correct (it's at the join's grain) and order_total is
inflated, in the same SELECT. You can't reason about fan-out per query; you have
to reason about it per measure.
The fix. Two defenses:
- Aggregate before joining. Roll each fact up to a common grain first, then join the aggregates. No row multiplication, because both sides are now at the same grain. Use this whenever you can.
- Symmetric aggregates. When you can't pre-aggregate - because you need to filter on a finer-grain attribute (say, "orders containing a clearance item") while measuring a coarser one - you keep the fanned-out join and dedupe the coarse measure during aggregation, typically by hashing the row's primary key so each order's revenue counts exactly once.
This is also a warning about your BI tool: some tools (famously Looker) do
symmetric aggregates automatically; some silently SUM the inflated rows and
return the wrong number with no warning. And a tool might do SUM symmetrically
but botch AVG or COUNT(DISTINCT). Knowing exactly what your tool does per
aggregation function is the difference between trusting a fanned-out dashboard
and shipping a number that's double. A metrics layer makes "is this fan-out-safe?"
a non-question by picking the right technique every time. Deep dives:
the fan-out join trap
and
symmetric aggregates.
3. Semi-additive measures summed across time
The symptom. It's quarter-end. Finance reconciles the warehouse against the
general ledger. Cash balances are off by $84M. You trace it: a dashboard is
summing account_balance from a daily-snapshot fact across 90 days and across
10,000 accounts. The cross-account sum is fine. The cross-day sum just added the
same dollar to itself 90 times.
Why it happens. Picture one bank account holding $1,000, snapshotted three days in a row:
| day | account_balance |
|---|---|
| Mon | $1,000 |
| Tue | $1,000 |
| Wed | $1,000 |
SUM(account_balance) returns $3,000 - but the account never held $3,000. It
held $1,000 the whole time. Summing a balance across days counts the same money
once per day. (Summing across accounts on a single day is fine: two accounts
with $1,000 each really is $2,000.)
That is the heart of it: not every measure can be summed across every axis. There are three classes:
- Additive - sum across anything.
revenue,quantity_sold,clicks. If your fact table is pure-additive, BI tools can't get the aggregation wrong. - Semi-additive - sum across some axes, not others. These are balances and
inventories: a level at a moment, not an amount over a period.
account_balance,inventory_on_hand,headcount,open_tickets. You can sum them across customers or warehouses, but never across time. - Non-additive - never sum. Ratios, percentages, prices. A sum of percentages isn't a percentage.
The trap is that BI tools default to SUM, and a semi-additive measure summed
across time produces a confident, catastrophically wrong number - like the $84M
above.
The fix. For semi-additive measures, the correct time aggregation is an
end-of-period snapshot (closing balance), a period-start snapshot (opening
balance), or an average over the period - never a sum. Encode this in the
semantic layer or the measure definition so a consumer who drags "balance" onto a
time axis can't accidentally SUM it. Full lesson:
additive, semi-additive, non-additive measures.
4. Ratio metrics and the order-of-aggregation trap
The symptom. One engineer defines "average order value" as it reads in
English - average of (amount / items) per order, AVG(amount / item_count) - and
ships it. AOV reads $23. Finance computes it their way - total revenue over total
items, SUM(amount) / SUM(item_count) - and gets $31. Eight dollars apart, on a
metric that's one division. Neither is a typo.
Why it happens. There are two orders in which you can compute a ratio, and they are not the same number:
- Aggregate, then divide:
SUM(numerator) / SUM(denominator). Weights by size - a big order counts more. - Divide, then aggregate:
AVG(numerator / denominator). Weights every row equally - a $5 order and a $5,000 order each contribute one ratio.
A tiny example shows how far apart they get:
| order | amount | items | amount / items |
|---|---|---|---|
| A | $100 | 1 | $100 |
| B | $10 | 10 | $1 |
- Aggregate, then divide:
(100 + 10) / (1 + 10)=110 / 11= $10 per item. - Divide, then aggregate: average of (
$100,$1) = $50.50.
Same two rows, a 5x gap, no bug in sight. The second number is dominated by order A's lopsided $100-per-item; the first correctly spreads it across all the items actually sold, which is almost always what the business means.
This is the most-hit metric bug because English is ambiguous ("average order
value," "conversion rate," "cost per click" all just say "divide"), both formulas
run without erroring, and - worst - the gap changes when you slice. The
AVG-of-per-row version disagrees differently in every group, so it breaks the
moment someone adds a new GROUP BY you didn't test.
The fix. Stop hand-rolling the division. Declare the numerator and denominator as separate measures and use a dedicated ratio type, so the engine always aggregates each first and divides last - at whatever grain and slice the consumer requests:
The win isn't correctness once - it's correctness under slicing. More: ratio metrics and their order-of-aggregation trap.
5. SCD Type 1's silent lie: as-is vs as-was
The symptom. A customer placed $10,000 of orders last quarter while living in
the West region. This quarter they move to East. You overwrite their
region in dim_customer, then ask "what was revenue by region last quarter."
That $10,000 now shows up under East - a region the customer wasn't even in when
those orders happened. The dashboard reports a number that was never true, and
nothing flags it.
Why it happens. This is the question every Slowly Changing Dimension type is really answering: when a fact joins to a dimension, which version of the dimension does it attach to? A fact should join to the dimension as it was at the moment the fact happened (as-was), not as it is now (as-is). SCD Type 1 overwrites the old value, so it throws away exactly the information you'd need to honor that - silently.
Closely related, and the single most common SCD anti-pattern in production: joining facts to the current dimension version even when you've gone to the trouble of keeping SCD2 history.
The schema looks as-was; the query is silently as-is. It happens because the natural-key join "just works" and looks correct on the surface.
The fix. Resolve the version-aware surrogate key on the fact at ETL time and
join USING (customer_key) - never join an SCD2 dimension by natural key in the
BI tool. And reserve Type 1 for attributes you will never report on
historically; default to Type 2 for anything that feeds financial or regulatory
reporting. You can replay this entire decision in the browser with the
SCD Playground - push a change timeline through Types 0-6
and watch which version each fact lands on - and the deeper write-up is in
Slowly Changing Dimensions, actually explained.
6. The SCD2 operational anti-patterns
Type 2 done right at design time can still bite you in operations. The ones that take a senior engineer days to diagnose:
- The
is_currentview race. A view defined asWHERE is_current = TRUEworks fine - except during the ETL window. If the expire-old and insert-new steps aren't atomic, there's a moment where a customer has two current rows (or zero), and an analyst querying mid-load gets duplicates. Fix: do the update + insert in a singleMERGE/transaction. dbt snapshots get this right by default. - Version explosion. SCD2-ing a column that changes constantly (a
last_login_at) generates 365 dim rows per customer per year. A 1M-customer dim becomes 1.8 billion rows in five years. Fix: SCD2 only the columns that need history; SCD1 the rest, or split the high-churn ones into a mini-dimension. With dbt snapshots, setcheck_cols=['segment','region','tier']explicitly - thecheckstrategy defaults to all columns. - Late-arriving rows. A fact references a customer whose dim row doesn't exist
yet (the CDC stream is behind). Dropping it is silent data loss; bucketing it to
-1is visible but uncorrelated. Fix: insert an inferred placeholder dim row immediately, point the fact at its surrogate, and update the placeholder in place when the real data lands. For late-arriving facts, resolve the surrogate against the historical dim version, not the current one. - Inferred dims that never get cleaned up. The placeholders from the previous
fix are supposed to be filled in later. The CDC stream stalls, they never are,
and a year on, 12% of your dim is
customer_name IS NULLrendering a giant "NULL" bucket on every dashboard. Fix: monitor the inferred-dim count and alert if it doesn't decay. - Time-zone-naive validity windows.
valid_fromstored as a naive date while facts are stamped in UTC means a window of hours where facts resolve to the wrong dim version. Fix: storevalid_from/valid_toasTIMESTAMP WITH TIME ZONE(or always UTC) and use half-open intervals so a single instant never matches two versions.
Full catalog: SCD anti-patterns in the wild and late-arriving dimensions and facts.
7. Metric drift: the bug that's really an org chart
The symptom. Monday, the growth dashboard says revenue grew 12% last quarter. Tuesday, the finance deck says 9%. Wednesday, a data scientist's notebook says 10.5%. Nobody made a mistake. The growth query counts refunds in the refund month; finance counts them in the sale month; the notebook excludes test accounts. Three "correct" queries, three answers, one reconciliation meeting that eats a Thursday - and recurs next quarter.
Why it happens. A metric isn't a number; it's a number plus a pile of buried decisions - what rows count, what window, what grain, what filters. When the definition of "revenue" is duplicated across a LookML model, a dbt model, a notebook, and a spreadsheet some analyst left behind in 2023, you have four chances to diverge, and they will, because each is patched independently. You can't out-engineer this with more careful SQL; the SQL in each place is fine. The disease is duplication; disagreeing numbers are just the symptom.
This shows up structurally as conformance forks - a second dim_customer
called dim_customer_v2_marketing that someone built because getting an hourly
refresh added to the central dim "would take two quarters." The fork looks like a
technical problem (a duplicate dimension) but the root cause is almost always
organizational: coordination cost was higher than fork cost, so forking was the
rational move.
The fix. Define each metric once, in one place, and make every consumer read that definition instead of re-implementing it - that single place is the metrics layer. And fix the coordination cost that drives forks: same-day PR review for additive dim changes, an extension/satellite mechanism for team-specific columns, tiered dims (a nightly conformed dim plus an hourly latest-snapshot view sharing the same surrogates). Patch the technical artifact without the cause and the same fork reappears in six months under a new name. More: the metric drift problem and when conformance fails.
Bonus: the cost mistake - One Big Table update amplification
The seven above all corrupt a number. This last one is worth knowing because it corrupts a bill instead - same silent-failure shape, different blast radius.
The symptom. Support updates one customer's region in the CRM. Six minutes later Snowflake has billed you for rewriting 12,000 rows. A marketing batch updates 8,000 customers' segments and rewrites 400,000 rows; the Slack channel lights up with "why did our bill spike on Tuesday?"
Why it happens. In a star schema, a customer attribute lives in one row of
dim_customer - update it, one row written. In a denormalized One Big Table, that
attribute is copied onto every order and event the customer ever touched -
updating it rewrites N rows, where N is the customer's entire footprint. Columnar
storage makes it worse: files are immutable, so "update one row" means "rewrite
the whole file that contains it." A change touching 50,000 rows scattered across
hundreds of files can rewrite gigabytes to alter a fraction of a percent.
The fix. Match the model to the workload. OBT earns its keep for read-heavy, append-mostly, self-serve analytics; it's a poor fit for high-churn attributes that mutate often. Keep mutable attributes in a dimension and join, or keep the OBT but accept that attribute updates are expensive and batch them. Delta/Iceberg delete files soften the blow but accumulate and need compaction. More: the cost of OBT update amplification.
A diagnostic checklist
When a number is disputed and nothing errored, walk this list before you blame the SQL:
- What is the grain of every table in the query? If two tables in one
FROM/JOINhave different grains and you're summing across them, suspect fan-out or mixed grain first. - Is every summed measure additive across the axes you're summing? Balances and inventories across time are the classic landmine.
- Is any ratio computed as
AVG(a/b)instead ofSUM(a)/SUM(b)? Check whether it survives a newGROUP BY. - Does the fact join to the dimension version that was true when the fact happened? Or is it silently joining to "current"?
- Is the disputed metric defined in more than one place? If so, the bug is duplication, not arithmetic.
- Did the cost spike, not the number? Look for update amplification on a denormalized table.
The one thing to keep
Every bug here shares a signature: it runs clean and lies quietly. No exception, no failed job - just a confident number that was never true. That's why they survive code review and surface only at reconciliation, and it's why data modeling is worth taking as seriously as the application code that everyone already lints, tests, and reviews.
The defenses are knowable and mostly mechanical: state the grain in one sentence, aggregate before you join (or trust a layer that knows when to), classify every measure's additivity, declare ratios instead of dividing by hand, join facts to the dimension version that was true at the time, and define each metric exactly once. None of it is exotic. It just has to be deliberate, because the failure mode is silence.
If you want to build the instinct rather than memorize the list, the Semantic & Metrics Layer track teaches each of these as a running war story with worked SQL, and the SCD Playground lets you reproduce the as-of-join bug with your own change timeline and watch the facts land on the right - or wrong - version. The fastest way to stop shipping these is to make one of them fail in front of you, on purpose, once.