Module: Fact Data Modeling | Duration: ~18 min | Lesson: 1 of 6
Dev's first week owning TheWorldShop's notifications pipeline starts with a storage alert. The sellers dimension he worked with last quarter fits comfortably in one Postgres table. The notification events table does not fit comfortably anywhere: every customer gets a couple dozen pushes a day, and multiplied out that's tens of billions of rows, daily. The dimension table grew by a few thousand rows overnight. The fact table grew by more rows than the dimension table has accumulated in its entire life.
Same platform, same customers, wildly different physics. Dimensions describe who things are. Facts record what happened, and things happen a lot. Before you can model that firehose, you need a precise answer to a deceptively simple question: what exactly counts as a fact?
2. Concept Explanation
A fact is atomic
A fact is something that happened: a login, a payment, a click, a step counted by your watch. The test for a well-modeled fact is atomicity: you can't break it down into anything smaller.
"Ran a mile" sounds like a fact, but it isn't atomic. A mile decomposes into thousands of individual steps, and each step is the real fact. The mile is an aggregation of facts. Same with "threw a three-day sale event": the atomic facts are the individual page views, carts, and purchases inside it. The event is a rollup, and at that point it starts behaving like a dimension (all those facts are tied to one entity).
Facts appear at every layer of granularity, and you can always aggregate up. What you can't do is disaggregate down. So model and store the atomic level when the business needs it, and treat everything coarser as derived.
A fact is immutable
Here's the gift facts give you: they don't change. You logged in at 9:14am. Nothing that happens later can make that untrue. Dimensions carry all the slowly-changing, effective-dated, "which version was true when" machinery you saw earlier in this track. Facts carry none of it. The past is fixed.
That's one whole class of problems you never have to solve for facts. In exchange, facts hand you two different problems: volume and context.
The volume problem
Facts run 10x to 100x the size of your dimensions, and often worse. One person, one dimension row. That same person takes 10,000 steps a day, gets 25 notifications, fires hundreds of client events. The multiplier is actions per entity per day, and it's what turns two billion users into 50 billion notification rows daily. At Netflix, one network-request fact stream measured about two petabytes of brand new data per day.
This is why fact modeling decisions carry real money. A sloppy dimension table wastes a few gigabytes. A sloppy fact table at this scale moves your cloud bill by whole percentage points. Every column you keep, every duplicate you fail to remove, gets multiplied by billions.
The context problem
A fact in isolation is nearly worthless. "We sent a notification" tells you nothing. But chain three facts together: sent at 2:00pm, clicked at 2:20pm, purchased at 2:25pm, and you have a conversion funnel, which is the economic engine of every ads-and-commerce business on the internet. Google once tested about forty shades of blue for its links because a barely-better shade converted measurably more; that experiment only works if the see-it, click-it, buy-it facts can be stitched together.
So when you model a fact, always ask: what other facts or dimensions does this need to sit next to before it's useful? Sometimes it's other facts in a funnel. Sometimes it's a dimension join, "customers in Canada click pushes more than customers in the US", which turns a flat rate into a question worth investigating.
The anatomy: who, where, when, what, how
Well-modeled fact rows answer five questions:
- Who did it. Almost always pushed down to IDs:
customer_id,device_id,session_id. The names and attributes live in dimensions; the fact carries the keys. - Where it happened. Sometimes geography (country, city), just as often where in the product: the home page, the checkout page, which button. Often modeled as plain low-cardinality strings rather than IDs.
- What happened. The event itself, usually an enumeration:
generated,sent,delivered,clicked,convertedfor a notification funnel. This is the heart of the fact. - When it happened. A timestamp, non-negotiable.
- How they did it. The mechanism: "on an iPhone", "with the Android app". How and where blur together in virtual products, and how and who blur in device-heavy domains (is a car the actor or the mechanism?). Don't agonize; pick a convention and document it.
The what and when fields should carry a hard not-null guarantee. A fact with no event or no timestamp isn't analyzable; it's noise wearing a schema.
The when field has a trap in it
As companies go mobile-first, logging moves to the client, and client-side logging is better: it captures every interaction, not just the ones that trigger a server request. But clients live in local time zones. If each device logs its own wall-clock time, your global event stream is smeared across 24+ offsets and every daily rollup is quietly wrong.
The rule: clients log UTC, always. Convert to local time at query or presentation time, when you actually know which locale you want. This is a one-line decision when the logging library is written and a multi-quarter cleanup when it isn't.
3. Worked Example
TheWorldShop's push-notification funnel, modeled as atomic facts. Each notification moves through a lifecycle, and each transition is one immutable row:
Notice what's not here: no customer name, no seller tier, no country. Those are dimension attributes reachable by joining on customer_id. The fact carries keys, the event, and the time, and the NOT NULLs enforce the anatomy.
Now the context payoff. Each row alone is useless. Together, the funnel falls out of one aggregation:
And the atomicity test in action. A product manager asks Dev to "log a row when a customer finishes onboarding." Onboarding is five screens, each with its own completion. "Finished onboarding" is a mile, not a step. Dev logs the five atomic screen-completion facts instead, and derives "finished onboarding" as has all five. Six months later, when the PM asks "where in onboarding do people bail?", the answer is already in the table. Had he logged only the rollup, that question would need a schema change and a backfill.
Aha: You can always aggregate atomic facts up, but you can never split an aggregate back down. Every time you log the rollup instead of the atom, you're deciding, permanently, which questions the business will never get to ask about the past. Immutability makes facts easy to store and unforgiving to under-model.
4. Your Turn
Exercise: Classify and anatomize.
TheWorldShop's checkout team proposes logging these four things as facts. For each, say whether it's an atomic fact, an aggregate wearing a fact costume, or a dimension, and one sentence why:
customer completed checkout(checkout has 3 steps: cart review, payment, confirmation)payment_authorized at 2026-07-03 14:02:11 UTC for order 998customer's preferred payment method is now Apple Paydaily_gmv for seller 4471 on 2026-07-02 was $1,840
Then, for #2 only: label its who / where / when / what / how fields, and name the two fields that must be NOT NULL.
5. Real-World Application
The volume math drives real org design. At Facebook scale, two billion users times dozens of daily actions is on the order of 100 billion fact rows per day; a month of that is trillions of rows. Teams that internalize "facts are 10-100x dimensions" make different choices early: they push who-fields down to IDs instead of denormalizing names everywhere, they enforce not-null what/when at the logging library, and they design retention policies before the table exists rather than after the first storage incident.
The context principle is just as commercial. Notification systems, ads platforms, and growth teams are all funnel businesses: sent, clicked, converted. The individual fact streams are cheap; the joined funnel is the product. When TheWorldShop's growth team asks why click-through differs by country, the answer comes from exactly the pattern above, fact stream plus a dimension join on the who-field.
And the UTC rule earns its keep the first time a company expands internationally. A mobile app that logged device-local time for two years has a data archaeology problem no pipeline can fully fix: you can't reliably reconstruct which UTC instant "2024-03-09 02:30" meant in a time zone that skipped that half hour for daylight saving. Teams that got this wrong don't fix it; they draw a line in the sand, fix the client, and annotate every historical chart with an asterisk.
6. Recap + Bridge
A fact is an atomic, immutable record of something that happened, carrying who / where / when / what / how, with what and when never null and always in UTC. Facts trade the dimension world's mutability problems for two of their own: volume that multiplies every modeling mistake by billions, and the need for context, because a fact alone is worthless and a funnel of facts is a business. Next: where facts actually come from. Raw logs and fact data are inextricably linked, but they are not the same thing, and turning one into the other is most of the job.