Module: Cumulative Table Design | Duration: ~18 min | Lesson: 1 of 4
Maya, a data engineer at TheWorldShop, gets two messages in the same hour. An analyst wants "sellers by tier by region, quick, it's for the board deck." A backend engineer wants a seller lookup that returns in under 10ms because it sits behind the storefront. Maya points both of them at the same beautiful sellers table she built last quarter.
The analyst's dashboard times out. The storefront call pulls back a 4KB blob of nested arrays for every request and the on-call pages at 2am.
Maya didn't build a bad table. She built one table for two consumers who needed opposite things. The biggest data engineering mistakes don't come from bad SQL. They come from modeling data for the wrong customer.
2. Concept Explanation
Before you model anything, answer one question: who reads this? The answer changes the shape of the table more than any other decision you'll make.
Four consumers, four shapes
- Analysts and data scientists want flat tables. Identifier plus decimals plus a few strings. Every column ready to
SUM,AVG,WHERE, andGROUP BYwith no unpacking. Nested types are a tax on them. - Other data engineers are the one audience that's fine with complex types. If your table feeds ten downstream pipelines built by people who unpack structs for a living, nesting is a feature, not a burden.
- ML models want the identifier and a wide row of flat features, mostly decimals, some categoricals. They don't care what you named the columns.
- Customers and executives should almost never see a table. They see a chart with good annotations. If you're handing a VP an Excel export, something upstream was modeled wrong.
Same underlying facts, four different correct shapes. Model for the wrong one and you either make analysts unpack arrays or make an online service scan a table it should point-lookup.
OLTP and OLAP are the two ends
There are two classic ways to model, and they optimize for opposite things.
OLTP (online transaction processing) is how software engineers model for apps. Third normal form, minimal duplication, foreign keys, constraints, linker tables. It's tuned to touch one entity fast: one seller, one order. Postgres and MySQL live here.
OLAP (online analytical processing) is how most data engineers model for analysis. Duplication is fine. What matters is scanning a population (the whole table or a big slice) without joining a pile of tables together. A well-shaped OLAP table lets you GROUP BY instead of JOIN.
The failure is using one where you need the other:
- Model OLTP data like OLAP and your app slows down: it drags in wide, duplicated columns to answer a single-row lookup.
- Model OLAP over raw OLTP and every analytical query becomes a join storm. Shuffle everywhere, slow, expensive, painful.
The continuum: production to metrics
OLTP and OLAP aren't a binary. They're the two ends of a continuum, and the interesting engineering lives in the middle.
- Production database snapshots. The raw OLTP tables, dumped to the lake. At TheWorldShop that's roughly 40 normalized tables describing sellers: profile, payout account, catalog, ratings, fulfillment SLAs, and so on.
- Master data. The middle ground. Still one row per entity (deduped, complete), but assembled from all those production tables into one coherent definition. This is where
sellersshould live. It's normalized enough to be a source of truth and complete enough that nobody has to join 40 tables to answer "what do we know about this seller." - OLAP cubes. Flatten master data out. Now you might have many rows per entity so you can slice and dice: seller by region by tier by month. Analysts love this layer.
- Metrics. Aggregate the cube down to a handful of numbers. "Average GMV per active seller." One row, one number.
Read it as a distillation. You take 40 messy production tables, put Humpty Dumpty back together into master data, split that back out cleanly into a cube, then smash the cube down into a single metric.
Why the middle matters
Without a master data layer, the analyst's first task is "join these 40 tables and good luck." At TheWorldShop, Maya's seller-pricing pipeline pulls from about 40 production tables. Collapsing them into one master sellers table is the difference between a five-minute query and a five-hour one for everyone downstream.
Master data is also where complex types earn their keep. Your consumers here are mostly other data engineers, so a struct or an array is fair game. That's exactly the freedom you don't have one layer over in the OLAP cube, where analysts want everything flat.
3. Worked Example
Maya inventories what "a seller" actually is at TheWorldShop. The production side is normalized across many tables:
An analyst asking "GMV by seller tier by region" against this has to join five to forty tables and aggregate. Slow, and every analyst re-derives the same joins.
Maya builds a master table instead: one row per seller, complete, deduped.
Notice year_stats is an array of struct. That's a deliberate middle-ground choice: it keeps the table at one-row-per-seller (great for engineers joining to it) while still carrying every year of history. If the main consumer were analysts, Maya would explode it flat into an OLAP cube. If the main consumer were the storefront, she'd hand the app a tiny compressed blob and nothing else.
Same facts, three shapes, three consumers:
Aha: Master data is not "a bigger OLAP cube" and it's not "a copy of production." It's the deduped, complete definition of an entity that sits between them, so nobody downstream has to choose between joining 40 tables and inheriting your join mistakes. Build the middle once and every layer above it gets cheap.
4. Your Turn
Exercise: Place the table, then reshape it.
TheWorldShop has a raw orders stream (one row per order line, OLTP-shaped) and needs three things built on top of it:
- A
customersmaster table that any pipeline can join to for "everything we know about a customer." Which layer of the continuum is this, and what's the grain (rows per customer)? - A dataset an analyst will use to chart "revenue by customer segment by month." Which layer, and what's the grain now?
- A single tile on the exec dashboard: "monthly active customers this month." Which layer?
5. Real-World Application
This continuum is not a textbook abstraction, it's the org chart of a mature data platform.
At Airbnb, pricing-and-availability master data was assembled from roughly 40 production tables into a single table describing every listing's price and availability. Without it, every downstream analyst and every pricing model would have re-implemented the same 40-way join, at different times, with subtly different logic, and the numbers would have disagreed. The master layer made "what is this listing's price" have exactly one answer.
At Facebook, dim_all_users was master data for every user in the company. It had on the order of 10,000 downstream pipelines. It wasn't just growth analytics, it was the user table everyone joined to. That's the leverage of the middle layer: build it once, correctly, and thousands of consumers inherit a coherent definition instead of re-deriving one badly.
The anti-pattern shows up as "just dump production to the lake and let analysts query it." It feels cheap because you skip the modeling. You pay for it forever in join storms, disagreeing metrics, and the slow rot of a warehouse where no two dashboards define "active seller" the same way. The master layer is where you spend modeling effort so that everyone above you doesn't have to.
6. Recap + Bridge
Who reads the table decides its shape. OLTP and OLAP are two ends of a continuum, and master data is the deduped, complete middle that keeps every layer above it cheap. Next we'll build that master table for real: the cumulative pattern that full-outer-joins yesterday to today, coalesces the unchanging bits, and carries all of history forward in one row, with no GROUP BY in sight.