Module: Datelists & Reduced Facts | Duration: ~26 min hands-on | Lesson: 1 of 4
The growth team asks Dev for monthly active customers, refreshed daily. The naive query is easy to write: scan the last 30 days of events, group by customer, done. Then Dev does the volume math from the last course. TheWorldShop's event stream at Facebook scale would be about 100 billion rows a day, so every daily MAU refresh re-scans three trillion rows, 29 of the 30 days of which haven't changed since yesterday's run.
Three trillion rows a day to learn something you mostly knew yesterday. The cumulative pattern from earlier in this track was built for exactly this: carry history forward on one row per entity, touch only today's slice. This lesson applies it to activity facts, and builds the table that the next lesson will compress into a single integer.
2. Concept Explanation
From "scan the window" to "carry the window"
The MAU question is a 30-day window question, and window questions have two implementations:
- Re-scan: every day, aggregate the last 30 days of facts. Cost: 30 days of fact volume, daily, forever. Correct, simple, and ruinous at scale.
- Cumulate: keep one row per customer holding their activity history, and each day fold in only today's data. Cost: yesterday's compact table plus one day of facts. This is cumulative table design (the earlier course covers the pattern's mechanics and its backfill/PII costs) applied to fact-derived activity rather than dimensions.
The cumulated shape for activity is:
One row per customer per as-of date. The dates_active array is the whole 30-day (or 300-day) window, pre-assembled. "Were they active in the last week?" stops being a scan and becomes array inspection on one row.
The daily fold
Each day's build is the familiar two-CTE cumulation:
- yesterday = the cumulated table as of yesterday.
- today = only today's events, aggregated to one row per active customer. This is also where you define "active": any event? a page view? an engagement action? The definition is a business decision (Lesson 5 of the fact course), and it's baked in right here.
FULL OUTER JOINthe two on customer id, then the three-case logic: brand-new customer seeds a one-element array; active-today customer gets today prepended; absent customer carries yesterday's array forward unchanged.
Two conventions matter more than they look:
- Prepend, don't append. Keep the newest date at index 1. Every downstream consumer ("last 30 days") reads a prefix of the array instead of computing offsets from the tail, and the datelist conversion in the next lesson depends on this orientation.
- Advance the as-of date for everyone. Absent customers still get a row with today's
ds(yesterday.ds + 1). The table is a daily snapshot of all known customers, not just today's visitors; a customer with no row is a customer you've never seen, which is a different thing from a customer who stayed home.
The gotchas the raw data will throw at you
Two field notes from building this on real logs, both fact-course lessons come home to roost:
- Null who-fields. Logged-out traffic and instrumentation bugs produce events with no customer id. In a plain aggregation they'd quietly vanish; in a
FULL OUTER JOINcumulation a null key is poison, it never matches itself, so null rows multiply. Filtercustomer_id IS NOT NULLin the today CTE, explicitly, every time. - Trust the source's types, verify the range. A "numeric id" column can carry values that overflow
BIGINT(real systems hash ids into huge numerics). If the source is genuinely numeric-but-huge, store the id asTEXTin the cumulated table. An id's only job here is equality; you never do math on it, so the string costs you nothing and the overflow error costs you a 2am page.
What this table earns you
Once users_cumulated exists, every activity question becomes a one-row lookup against the latest partition:
- MAU:
cardinality(dates_active) filtered to the last 30 days > 0, no scan of the fact table at all. - Weekly active, daily active: same array, shorter prefix.
- Activity streaks, days-since-last-active, "came back after a gap": all sitting on the row, the same wins the cumulative dimensions table delivered, now for behavior.
The residual cost is the array itself: 30-plus dates per customer, stored as full 4-byte dates, on billions of rows. That's the fat the next lesson burns off, the entire array collapses into one 32-bit integer, and the queries get faster at the same time.
3. Worked Example
Building users_cumulated for TheWorldShop, January 2026, from the raw event stream. Lab environment is the same Postgres-in-Docker from the cumulative-table-design course.
The table. Note TEXT for the id (the overflow lesson) and the comment discipline:
The daily fold. Seed run: yesterday (Dec 31) is empty, today is Jan 1.
The three-case CASE is character-for-character the cumulative pattern from the earlier course; the only news is the prepend (ARRAY[today] || yesterday, not the reverse) and that "today" is an aggregation of facts, with the active-definition and the null filter living inside it.
Run it forward. Change the two dates to Jan 1/Jan 2 and rerun; then Jan 2/Jan 3; and so on (in production this is one parameterized Airflow task, and yes, the backfill is sequential, cumulation's standing tax). After a week:
And the growth team's actual question, answered without touching the events table:
Because the array is newest-first, "active in the last week" needs only element 1. No 7-day scan, no GROUP BY, no shuffle: one partition, one array index per row.
Aha: The cumulated activity table redefines what a day of MAU computation is: not "aggregate 30 days of behavior" but "fold one day of behavior into a running summary." The 30-day window still exists, but it's stored, not recomputed, and a stored window is paid for once instead of re-derived three-trillion-rows at a time.
4. Your Turn
Exercise: Extend the fold.
The growth team wants two more columns on users_cumulated: first_active_date (the earliest date ever seen for this customer) and is_new (true only on the row where the customer first appears).
- Write the expression for
first_active_datein the fold query. Which side of the coalesce wins, and why is that the opposite of thedscolumn's logic? - Write the expression for
is_new. - A teammate proposes computing
first_active_datelater, asdates_active[cardinality(dates_active)](the oldest element). Name the scenario from this course's production notes where that silently gives the wrong answer.
5. Real-World Application
This table shape, one row per user per day, activity history on the row, is the direct ancestor of Facebook's growth-accounting infrastructure. The daily/weekly/monthly active user numbers that appeared in earnings reports were computed from cumulated activity tables, not from re-scans of raw event streams, because at 100-billion-events-a-day the re-scan simply doesn't fit in a nightly window. The same shape powers the state-transition analytics (new, retained, churned, resurrected) that the analytical-patterns course builds, which is why first_active_date and is_new from the exercise aren't hypothetical: they're the first two columns that course will reach for.
The two production gotchas are also worth their scar tissue. The null-key filter is the difference between a table of customers and a table of customers plus one monstrous null-row lineage that grows without bound. And the id-type lesson generalizes: identifiers are for equality, not arithmetic, so when in doubt, strings. Teams migrating between id schemes (int to snowflake ids, region-prefixed ids) bless every table that made that call early.
6. Recap + Bridge
Window questions have a re-scan shape and a carry shape, and at fact volume only the carry shape survives: users_cumulated folds one day of events into a per-customer date array, newest first, nulls filtered, ids stored as text, with the three-case cumulative logic doing the work. Activity questions become one-row array lookups. The remaining waste is the array itself, thirty 4-byte dates per customer that are really just offsets from today. Next lesson replaces the whole array with one 32-bit integer, and turns "monthly active" into a single CPU instruction.