Module: Datelists & Reduced Facts | Duration: ~26 min hands-on | Lesson: 1 of 4

The growth team asks Dev for monthly active customers, refreshed daily. The naive query is easy to write: scan the last 30 days of events, group by customer, done. Then Dev does the volume math from the last course. TheWorldShop's event stream at Facebook scale would be about 100 billion rows a day, so every daily MAU refresh re-scans three trillion rows, 29 of the 30 days of which haven't changed since yesterday's run.

Three trillion rows a day to learn something you mostly knew yesterday. The cumulative pattern from earlier in this track was built for exactly this: carry history forward on one row per entity, touch only today's slice. This lesson applies it to activity facts, and builds the table that the next lesson will compress into a single integer.

2. Concept Explanation

From "scan the window" to "carry the window"

The MAU question is a 30-day window question, and window questions have two implementations:

Re-scan: every day, aggregate the last 30 days of facts. Cost: 30 days of fact volume, daily, forever. Correct, simple, and ruinous at scale.
Cumulate: keep one row per customer holding their activity history, and each day fold in only today's data. Cost: yesterday's compact table plus one day of facts. This is cumulative table design (the earlier course covers the pattern's mechanics and its backfill/PII costs) applied to fact-derived activity rather than dimensions.

The cumulated shape for activity is:

users_cumulated(
  customer_id   TEXT,
  dates_active  DATE[],   -- every date this customer was active, newest first
  ds            DATE      -- the as-of date of this row
)

One row per customer per as-of date. The dates_active array is the whole 30-day (or 300-day) window, pre-assembled. "Were they active in the last week?" stops being a scan and becomes array inspection on one row.

The daily fold

Each day's build is the familiar two-CTE cumulation:

yesterday = the cumulated table as of yesterday.
today = only today's events, aggregated to one row per active customer. This is also where you define "active": any event? a page view? an engagement action? The definition is a business decision (Lesson 5 of the fact course), and it's baked in right here.
FULL OUTER JOIN the two on customer id, then the three-case logic: brand-new customer seeds a one-element array; active-today customer gets today prepended; absent customer carries yesterday's array forward unchanged.

Two conventions matter more than they look:

Prepend, don't append. Keep the newest date at index 1. Every downstream consumer ("last 30 days") reads a prefix of the array instead of computing offsets from the tail, and the datelist conversion in the next lesson depends on this orientation.
Advance the as-of date for everyone. Absent customers still get a row with today's ds (yesterday.ds + 1). The table is a daily snapshot of all known customers, not just today's visitors; a customer with no row is a customer you've never seen, which is a different thing from a customer who stayed home.

The gotchas the raw data will throw at you

Two field notes from building this on real logs, both fact-course lessons come home to roost:

Null who-fields. Logged-out traffic and instrumentation bugs produce events with no customer id. In a plain aggregation they'd quietly vanish; in a FULL OUTER JOIN cumulation a null key is poison, it never matches itself, so null rows multiply. Filter customer_id IS NOT NULL in the today CTE, explicitly, every time.
Trust the source's types, verify the range. A "numeric id" column can carry values that overflow BIGINT (real systems hash ids into huge numerics). If the source is genuinely numeric-but-huge, store the id as TEXT in the cumulated table. An id's only job here is equality; you never do math on it, so the string costs you nothing and the overflow error costs you a 2am page.

What this table earns you

Once users_cumulated exists, every activity question becomes a one-row lookup against the latest partition:

MAU: cardinality(dates_active) filtered to the last 30 days > 0, no scan of the fact table at all.
Weekly active, daily active: same array, shorter prefix.
Activity streaks, days-since-last-active, "came back after a gap": all sitting on the row, the same wins the cumulative dimensions table delivered, now for behavior.

The residual cost is the array itself: 30-plus dates per customer, stored as full 4-byte dates, on billions of rows. That's the fat the next lesson burns off, the entire array collapses into one 32-bit integer, and the queries get faster at the same time.

3. Worked Example

Building users_cumulated for TheWorldShop, January 2026, from the raw event stream. Lab environment is the same Postgres-in-Docker from the cumulative-table-design course.

The table. Note TEXT for the id (the overflow lesson) and the comment discipline:

CREATE TABLE users_cumulated (
  customer_id   TEXT,
  -- list of past dates on which this customer was active, newest first
  dates_active  DATE[],
  -- the as-of date for this row
  ds            DATE,
  PRIMARY KEY (customer_id, ds)
);

The daily fold. Seed run: yesterday (Dec 31) is empty, today is Jan 1.

INSERT INTO users_cumulated
WITH yesterday AS (
  SELECT * FROM users_cumulated
  WHERE  ds = DATE '2025-12-31'
),
today AS (
  SELECT CAST(customer_id AS TEXT)          AS customer_id,
         DATE(CAST(event_time AS TIMESTAMP)) AS date_active
  FROM   events
  WHERE  DATE(CAST(event_time AS TIMESTAMP)) = DATE '2026-01-01'
    AND  customer_id IS NOT NULL              -- the null-key filter, always
  GROUP  BY 1, 2                              -- one row per active customer
)
SELECT
  COALESCE(t.customer_id, y.customer_id)      AS customer_id,
  CASE
    WHEN y.dates_active IS NULL
      THEN ARRAY[t.date_active]                       -- first sighting: seed
    WHEN t.date_active IS NOT NULL
      THEN ARRAY[t.date_active] || y.dates_active     -- active: PREPEND today
    ELSE y.dates_active                                -- absent: carry forward
  END                                          AS dates_active,
  COALESCE(t.date_active, y.ds + 1)            AS ds
FROM today t
FULL OUTER JOIN yesterday y USING (customer_id);

The three-case CASE is character-for-character the cumulative pattern from the earlier course; the only news is the prepend (ARRAY[today] || yesterday, not the reverse) and that "today" is an aggregation of facts, with the active-definition and the null filter living inside it.

Run it forward. Change the two dates to Jan 1/Jan 2 and rerun; then Jan 2/Jan 3; and so on (in production this is one parameterized Airflow task, and yes, the backfill is sequential, cumulation's standing tax). After a week:

SELECT customer_id, cardinality(dates_active) AS days_seen, dates_active
FROM   users_cumulated
WHERE  ds = DATE '2026-01-07'
ORDER  BY days_seen DESC
LIMIT  5;
-- top rows: daily regulars with 7-element arrays, newest date first
-- long tail: one-and-done visitors whose array froze at length 1

And the growth team's actual question, answered without touching the events table:

SELECT COUNT(*) FILTER (
         WHERE dates_active[1] > DATE '2026-01-07' - 7
       )                       AS weekly_active,
       COUNT(*)                AS all_known_customers
FROM   users_cumulated
WHERE  ds = DATE '2026-01-07';

Because the array is newest-first, "active in the last week" needs only element 1. No 7-day scan, no GROUP BY, no shuffle: one partition, one array index per row.

Aha: The cumulated activity table redefines what a day of MAU computation is: not "aggregate 30 days of behavior" but "fold one day of behavior into a running summary." The 30-day window still exists, but it's stored, not recomputed, and a stored window is paid for once instead of re-derived three-trillion-rows at a time.

4. Your Turn

Exercise: Extend the fold.

The growth team wants two more columns on users_cumulated: first_active_date (the earliest date ever seen for this customer) and is_new (true only on the row where the customer first appears).

Write the expression for first_active_date in the fold query. Which side of the coalesce wins, and why is that the opposite of the ds column's logic?
Write the expression for is_new.
A teammate proposes computing first_active_date later, as dates_active[cardinality(dates_active)] (the oldest element). Name the scenario from this course's production notes where that silently gives the wrong answer.

5. Real-World Application

This table shape, one row per user per day, activity history on the row, is the direct ancestor of Facebook's growth-accounting infrastructure. The daily/weekly/monthly active user numbers that appeared in earnings reports were computed from cumulated activity tables, not from re-scans of raw event streams, because at 100-billion-events-a-day the re-scan simply doesn't fit in a nightly window. The same shape powers the state-transition analytics (new, retained, churned, resurrected) that the analytical-patterns course builds, which is why first_active_date and is_new from the exercise aren't hypothetical: they're the first two columns that course will reach for.

The two production gotchas are also worth their scar tissue. The null-key filter is the difference between a table of customers and a table of customers plus one monstrous null-row lineage that grows without bound. And the id-type lesson generalizes: identifiers are for equality, not arithmetic, so when in doubt, strings. Teams migrating between id schemes (int to snowflake ids, region-prefixed ids) bless every table that made that call early.

6. Recap + Bridge

Window questions have a re-scan shape and a carry shape, and at fact volume only the carry shape survives: users_cumulated folds one day of events into a per-customer date array, newest first, nulls filtered, ids stored as text, with the three-case cumulative logic doing the work. Activity questions become one-row array lookups. The remaining waste is the array itself, thirty 4-byte dates per customer that are really just offsets from today. Next lesson replaces the whole array with one 32-bit integer, and turns "monthly active" into a single CPU instruction.