← Back to all posts
#data-modeling#dimensional-modeling#metrics-layer#warehouse#data-quality

7 Data Modeling Mistakes That Silently Corrupt Your Metrics

By Petascale Labs ·

The scariest class of bug in a data warehouse is the one that doesn't throw an error. A NullPointerException wakes you up. A failed DAG pages you. But a query that parses cleanly, runs in two seconds, draws a chart, and returns a number that was never true - that one ships to the board deck. Nobody finds it until someone in finance runs the same metric a second way and the two numbers don't match.

Almost every one of these "wrong-without-an-error" bugs traces back to a data modeling decision, not a coding mistake. The SQL is usually fine. The problem is the shape of the data the SQL ran against: the wrong grain, a join that duplicates rows, a measure summed across an axis it can't be summed across, a ratio divided in the wrong order, a dimension that forgot its own history.

This is a field guide to the big ones. For each: the symptom you'll actually see in production, why it happens, and the fix. If you want the full treatment behind any of these, the Semantic & Metrics Layer curriculum walks each one as its own lesson, on a running example.

1. Mixed grain: the bug that lives in the schema

The symptom. A finance analyst pings you: "Average order value shows $312 in the dashboard, but I just ran it in a notebook and got $89. Which one is right?" Both queries are correct. Both ran against the same fact_sale table. They disagree anyway.

Why it happens. The grain of a fact table is the answer to one question: "what does one row mean?" A good grain statement is one short sentence - "one row per order line," "one row per daily inventory snapshot per warehouse per SKU." If you can't say it in a sentence, the table doesn't have a grain; it has contents, which is a worse thing.

The classic failure is mixing grains in one table - order-level rows and line-level rows sitting together:

order_idline_idamountshipping
1001150NULL
1001280NULL
1001NULLNULL12

Ask for SUM(amount + shipping) and you get a meaningless number: amount is at line grain, shipping is at order grain. Every analyst who touches this table gets a different answer depending on how they COALESCE, GROUP BY, or filter. The bug isn't in anyone's SQL. It's in the schema.

The fix. Declare the grain before you write a single column, and never let two grains share a table. If you need both order-level and line-level facts, that is two fact tables (fact_order and fact_order_line), each with one row meaning one thing. Grain is not a style choice you can fix with a GROUP BY afterward - it's the one decision that defines the table. More: Grain: the one decision.

2. The fan-out join trap (and the symmetric-aggregate fix)

The symptom. You join orders to order_items to get product details, then SUM(order_total). The number comes back almost exactly double what finance reports. You check the join, the data, the filter - everything looks right. The number is still wrong.

Why it happens. Each order has, on average, two line items. Joining orders to order_items turns one order row into two. Now order_total, which lives on the order, appears twice, and SUM counts it twice. The join didn't add wrong data - it duplicated a coarse-grain measure across the finer-grain rows it created.

Here it is with tiny numbers. Two orders worth $150 in total:

orders

order_idorder_total
1$100
2$50

order_items

order_iditem
1A
1B
2C

Join the two tables and order 1's $100 now sits on two rows (one for item A, one for item B). So SUM(order_total) adds 100 + 100 + 50 = $250 instead of $150. The $100 got counted twice because the join copied it onto every line item - that is the entire bug.

This is the single most common way a correctly-defined metric returns a wrong number, and it's sneaky for three reasons:

  • The join is correct. You genuinely needed order_items. The bug is the aggregation after it, not the join itself.
  • The wrong number is plausible. 2x, 1.7x, 3x - these look like real numbers. Without finance's figure to compare against, you'd never know.
  • You can't just divide by two. The inflation is weighted by each order's line-item count, so a few big orders skew it unpredictably.

It gets worse when you sum measures from both grains in one query - say SUM(order_total) and SUM(item_quantity) together. After the join, item_quantity is correct (it's at the join's grain) and order_total is inflated, in the same SELECT. You can't reason about fan-out per query; you have to reason about it per measure.

The fix. Two defenses:

  1. Aggregate before joining. Roll each fact up to a common grain first, then join the aggregates. No row multiplication, because both sides are now at the same grain. Use this whenever you can.
  2. Symmetric aggregates. When you can't pre-aggregate - because you need to filter on a finer-grain attribute (say, "orders containing a clearance item") while measuring a coarser one - you keep the fanned-out join and dedupe the coarse measure during aggregation, typically by hashing the row's primary key so each order's revenue counts exactly once.

This is also a warning about your BI tool: some tools (famously Looker) do symmetric aggregates automatically; some silently SUM the inflated rows and return the wrong number with no warning. And a tool might do SUM symmetrically but botch AVG or COUNT(DISTINCT). Knowing exactly what your tool does per aggregation function is the difference between trusting a fanned-out dashboard and shipping a number that's double. A metrics layer makes "is this fan-out-safe?" a non-question by picking the right technique every time. Deep dives: the fan-out join trap and symmetric aggregates.

3. Semi-additive measures summed across time

The symptom. It's quarter-end. Finance reconciles the warehouse against the general ledger. Cash balances are off by $84M. You trace it: a dashboard is summing account_balance from a daily-snapshot fact across 90 days and across 10,000 accounts. The cross-account sum is fine. The cross-day sum just added the same dollar to itself 90 times.

Why it happens. Picture one bank account holding $1,000, snapshotted three days in a row:

dayaccount_balance
Mon$1,000
Tue$1,000
Wed$1,000

SUM(account_balance) returns $3,000 - but the account never held $3,000. It held $1,000 the whole time. Summing a balance across days counts the same money once per day. (Summing across accounts on a single day is fine: two accounts with $1,000 each really is $2,000.)

That is the heart of it: not every measure can be summed across every axis. There are three classes:

  • Additive - sum across anything. revenue, quantity_sold, clicks. If your fact table is pure-additive, BI tools can't get the aggregation wrong.
  • Semi-additive - sum across some axes, not others. These are balances and inventories: a level at a moment, not an amount over a period. account_balance, inventory_on_hand, headcount, open_tickets. You can sum them across customers or warehouses, but never across time.
  • Non-additive - never sum. Ratios, percentages, prices. A sum of percentages isn't a percentage.

The trap is that BI tools default to SUM, and a semi-additive measure summed across time produces a confident, catastrophically wrong number - like the $84M above.

The fix. For semi-additive measures, the correct time aggregation is an end-of-period snapshot (closing balance), a period-start snapshot (opening balance), or an average over the period - never a sum. Encode this in the semantic layer or the measure definition so a consumer who drags "balance" onto a time axis can't accidentally SUM it. Full lesson: additive, semi-additive, non-additive measures.

4. Ratio metrics and the order-of-aggregation trap

The symptom. One engineer defines "average order value" as it reads in English - average of (amount / items) per order, AVG(amount / item_count) - and ships it. AOV reads $23. Finance computes it their way - total revenue over total items, SUM(amount) / SUM(item_count) - and gets $31. Eight dollars apart, on a metric that's one division. Neither is a typo.

Why it happens. There are two orders in which you can compute a ratio, and they are not the same number:

  1. Aggregate, then divide: SUM(numerator) / SUM(denominator). Weights by size - a big order counts more.
  2. Divide, then aggregate: AVG(numerator / denominator). Weights every row equally - a $5 order and a $5,000 order each contribute one ratio.

A tiny example shows how far apart they get:

orderamountitemsamount / items
A$1001$100
B$1010$1
  • Aggregate, then divide: (100 + 10) / (1 + 10) = 110 / 11 = $10 per item.
  • Divide, then aggregate: average of ($100, $1) = $50.50.

Same two rows, a 5x gap, no bug in sight. The second number is dominated by order A's lopsided $100-per-item; the first correctly spreads it across all the items actually sold, which is almost always what the business means.

This is the most-hit metric bug because English is ambiguous ("average order value," "conversion rate," "cost per click" all just say "divide"), both formulas run without erroring, and - worst - the gap changes when you slice. The AVG-of-per-row version disagrees differently in every group, so it breaks the moment someone adds a new GROUP BY you didn't test.

The fix. Stop hand-rolling the division. Declare the numerator and denominator as separate measures and use a dedicated ratio type, so the engine always aggregates each first and divides last - at whatever grain and slice the consumer requests:

metric:
  name: aov
  type: ratio
  numerator: revenue        # a measure: SUM(amount)
  denominator: item_count   # a measure: SUM(item_count)

The win isn't correctness once - it's correctness under slicing. More: ratio metrics and their order-of-aggregation trap.

5. SCD Type 1's silent lie: as-is vs as-was

The symptom. A customer placed $10,000 of orders last quarter while living in the West region. This quarter they move to East. You overwrite their region in dim_customer, then ask "what was revenue by region last quarter." That $10,000 now shows up under East - a region the customer wasn't even in when those orders happened. The dashboard reports a number that was never true, and nothing flags it.

Why it happens. This is the question every Slowly Changing Dimension type is really answering: when a fact joins to a dimension, which version of the dimension does it attach to? A fact should join to the dimension as it was at the moment the fact happened (as-was), not as it is now (as-is). SCD Type 1 overwrites the old value, so it throws away exactly the information you'd need to honor that - silently.

Closely related, and the single most common SCD anti-pattern in production: joining facts to the current dimension version even when you've gone to the trouble of keeping SCD2 history.

SELECT c.region, SUM(f.amount)
FROM   fact_sale f JOIN dim_customer c
  ON   c.customer_id_src = f.customer_id_src   -- natural-key join
 WHERE c.is_current = TRUE
GROUP BY 1;

The schema looks as-was; the query is silently as-is. It happens because the natural-key join "just works" and looks correct on the surface.

The fix. Resolve the version-aware surrogate key on the fact at ETL time and join USING (customer_key) - never join an SCD2 dimension by natural key in the BI tool. And reserve Type 1 for attributes you will never report on historically; default to Type 2 for anything that feeds financial or regulatory reporting. You can replay this entire decision in the browser with the SCD Playground - push a change timeline through Types 0-6 and watch which version each fact lands on - and the deeper write-up is in Slowly Changing Dimensions, actually explained.

6. The SCD2 operational anti-patterns

Type 2 done right at design time can still bite you in operations. The ones that take a senior engineer days to diagnose:

  • The is_current view race. A view defined as WHERE is_current = TRUE works fine - except during the ETL window. If the expire-old and insert-new steps aren't atomic, there's a moment where a customer has two current rows (or zero), and an analyst querying mid-load gets duplicates. Fix: do the update + insert in a single MERGE/transaction. dbt snapshots get this right by default.
  • Version explosion. SCD2-ing a column that changes constantly (a last_login_at) generates 365 dim rows per customer per year. A 1M-customer dim becomes 1.8 billion rows in five years. Fix: SCD2 only the columns that need history; SCD1 the rest, or split the high-churn ones into a mini-dimension. With dbt snapshots, set check_cols=['segment','region','tier'] explicitly - the check strategy defaults to all columns.
  • Late-arriving rows. A fact references a customer whose dim row doesn't exist yet (the CDC stream is behind). Dropping it is silent data loss; bucketing it to -1 is visible but uncorrelated. Fix: insert an inferred placeholder dim row immediately, point the fact at its surrogate, and update the placeholder in place when the real data lands. For late-arriving facts, resolve the surrogate against the historical dim version, not the current one.
  • Inferred dims that never get cleaned up. The placeholders from the previous fix are supposed to be filled in later. The CDC stream stalls, they never are, and a year on, 12% of your dim is customer_name IS NULL rendering a giant "NULL" bucket on every dashboard. Fix: monitor the inferred-dim count and alert if it doesn't decay.
  • Time-zone-naive validity windows. valid_from stored as a naive date while facts are stamped in UTC means a window of hours where facts resolve to the wrong dim version. Fix: store valid_from/valid_to as TIMESTAMP WITH TIME ZONE (or always UTC) and use half-open intervals so a single instant never matches two versions.

Full catalog: SCD anti-patterns in the wild and late-arriving dimensions and facts.

7. Metric drift: the bug that's really an org chart

The symptom. Monday, the growth dashboard says revenue grew 12% last quarter. Tuesday, the finance deck says 9%. Wednesday, a data scientist's notebook says 10.5%. Nobody made a mistake. The growth query counts refunds in the refund month; finance counts them in the sale month; the notebook excludes test accounts. Three "correct" queries, three answers, one reconciliation meeting that eats a Thursday - and recurs next quarter.

Why it happens. A metric isn't a number; it's a number plus a pile of buried decisions - what rows count, what window, what grain, what filters. When the definition of "revenue" is duplicated across a LookML model, a dbt model, a notebook, and a spreadsheet some analyst left behind in 2023, you have four chances to diverge, and they will, because each is patched independently. You can't out-engineer this with more careful SQL; the SQL in each place is fine. The disease is duplication; disagreeing numbers are just the symptom.

This shows up structurally as conformance forks - a second dim_customer called dim_customer_v2_marketing that someone built because getting an hourly refresh added to the central dim "would take two quarters." The fork looks like a technical problem (a duplicate dimension) but the root cause is almost always organizational: coordination cost was higher than fork cost, so forking was the rational move.

The fix. Define each metric once, in one place, and make every consumer read that definition instead of re-implementing it - that single place is the metrics layer. And fix the coordination cost that drives forks: same-day PR review for additive dim changes, an extension/satellite mechanism for team-specific columns, tiered dims (a nightly conformed dim plus an hourly latest-snapshot view sharing the same surrogates). Patch the technical artifact without the cause and the same fork reappears in six months under a new name. More: the metric drift problem and when conformance fails.

Bonus: the cost mistake - One Big Table update amplification

The seven above all corrupt a number. This last one is worth knowing because it corrupts a bill instead - same silent-failure shape, different blast radius.

The symptom. Support updates one customer's region in the CRM. Six minutes later Snowflake has billed you for rewriting 12,000 rows. A marketing batch updates 8,000 customers' segments and rewrites 400,000 rows; the Slack channel lights up with "why did our bill spike on Tuesday?"

Why it happens. In a star schema, a customer attribute lives in one row of dim_customer - update it, one row written. In a denormalized One Big Table, that attribute is copied onto every order and event the customer ever touched - updating it rewrites N rows, where N is the customer's entire footprint. Columnar storage makes it worse: files are immutable, so "update one row" means "rewrite the whole file that contains it." A change touching 50,000 rows scattered across hundreds of files can rewrite gigabytes to alter a fraction of a percent.

The fix. Match the model to the workload. OBT earns its keep for read-heavy, append-mostly, self-serve analytics; it's a poor fit for high-churn attributes that mutate often. Keep mutable attributes in a dimension and join, or keep the OBT but accept that attribute updates are expensive and batch them. Delta/Iceberg delete files soften the blow but accumulate and need compaction. More: the cost of OBT update amplification.

A diagnostic checklist

When a number is disputed and nothing errored, walk this list before you blame the SQL:

  1. What is the grain of every table in the query? If two tables in one FROM/JOIN have different grains and you're summing across them, suspect fan-out or mixed grain first.
  2. Is every summed measure additive across the axes you're summing? Balances and inventories across time are the classic landmine.
  3. Is any ratio computed as AVG(a/b) instead of SUM(a)/SUM(b)? Check whether it survives a new GROUP BY.
  4. Does the fact join to the dimension version that was true when the fact happened? Or is it silently joining to "current"?
  5. Is the disputed metric defined in more than one place? If so, the bug is duplication, not arithmetic.
  6. Did the cost spike, not the number? Look for update amplification on a denormalized table.

The one thing to keep

Every bug here shares a signature: it runs clean and lies quietly. No exception, no failed job - just a confident number that was never true. That's why they survive code review and surface only at reconciliation, and it's why data modeling is worth taking as seriously as the application code that everyone already lints, tests, and reviews.

The defenses are knowable and mostly mechanical: state the grain in one sentence, aggregate before you join (or trust a layer that knows when to), classify every measure's additivity, declare ratios instead of dividing by hand, join facts to the dimension version that was true at the time, and define each metric exactly once. None of it is exotic. It just has to be deliberate, because the failure mode is silence.

If you want to build the instinct rather than memorize the list, the Semantic & Metrics Layer track teaches each of these as a running war story with worked SQL, and the SCD Playground lets you reproduce the as-of-join bug with your own change timeline and watch the facts land on the right - or wrong - version. The fastest way to stop shipping these is to make one of them fail in front of you, on purpose, once.

Found this useful? Give it a like.