The Three Pattern Families: Thinking Above the SQL

Module: Applying Analytical Patterns | Duration: ~18 min | Lesson: 1 of 5


Priya has written a lot of pipelines at TheWorldShop. One afternoon, procrastinating on a design doc, she opens the last two years of her work and tries to sort it into piles. Growth reporting, seller-quality tracking, funnel dashboards, experiment readouts, fraud-label monitoring: dozens of pipelines, each one bespoke at the time, each one a week of thought.

The piles refuse to number more than three. Every pipeline she's ever shipped is either counting things in groups, comparing today's state to yesterday's, or sliding a window along time. Three shapes, wearing dozens of costumes. The week of thought per pipeline was mostly re-deriving which shape she was in. This course is about never paying that week again.


2. Concept Explanation

Why patterns beat SQL

SQL is already an abstraction: your query compiles into a plan (a tree of scans, filters, joins, aggregations) that some engine executes in a language you never see. Nobody mourns not writing the C++. The move this course proposes is one rung higher: recognize which analysis you're doing before writing any SQL at all, because once the pattern is named, the SQL mostly writes itself.

Two reasons this rung matters more every year:

  • Cognitive budget. The consumers of your work think in pictures and verdicts, not clauses. Priya once built a dashboard for TheWorldShop's CEO; the value wasn't the SQL, it was that the CEO never had to know SQL existed. The higher you go in scope (one pipeline, a family, a warehouse), the more your job is choosing shapes, not typing syntax.
  • The syntax layer is commoditizing. LLMs write competent SQL from a clear description. What they can't do is know that your seller-quality question is a state-transition problem, which makes the engineer who can name the pattern more valuable, not less: the pattern is the prompt.

The three families

Aggregation patterns. The keyword is GROUP BY; the verbs are count, sum, average, occasionally percentile (and median is just the 50th percentile; the mean is its own thing and skew-sensitive). This family is at least half of all analytics: trends over time, composition ("how many customers per country"), and root-cause analysis, decomposing a metric's movement along dimensions until the mover falls out. When TheWorldShop's weekly orders jump by a million, the aggregation answer isn't "up a million," it's "up 1.5 million in NA, down 500k in EU," and now there's a thread to pull. The pattern's real-world pedigree includes the day a national internet shutdown showed up as a single country's active users dropping to zero: the total dipped mysteriously; the dimensional cut named the cause in one query.

Cumulation patterns. The keyword is FULL OUTER JOIN; the subject is today versus yesterday. These patterns care about state and its transitions, and they have one property aggregation never has: no data is data. A customer who did nothing today doesn't appear in any GROUP BY, but their absence is exactly what churn is, so cumulation carries every known entity forward (the cumulative-table-design machinery) and treats the missing row as a signal. Growth accounting (Lesson 2) and survivorship curves (Lesson 3) live here.

Window patterns. The keyword is OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...). Two moves, and they're inverses: derivatives (day-over-day, week-over-week, year-over-year deltas, which amplify change and add noise) and integrals (rolling sums and moving averages, which smooth noise and reveal trend). Plus ranking, when order itself is the answer. The mechanics of window functions live in the Spark stratum's window-functions chapter; this course cares about when to reach for them (Lesson 4's funnels flirt with them; the smoothing tradecraft shows up throughout).

Is there a fourth family, enrichment, joining columns on? Deliberately no: these patterns operate at the master data layer, where the modeling courses upstream already put every column you need on the row. If you're doing heavy enrichment inside an analysis, the gap is in your data model, not your pattern vocabulary.

The gotchas that transfer across all three

Because patterns repeat, their failure modes repeat too. Three worth branding onto the inside of your eyelids:

  • Enough dimensions equals no aggregation. Group by country, and device, and tier, and signup week, and dog's name, and each group converges toward one person. A GROUP BY whose groups have one member is the raw data with extra steps, and it's statistically void: distributional behavior needs on the order of 30 members per group. Fewer, better-chosen dimensions; bucketize the rest (fact course, Lesson 5).
  • Percentages lie without their counts. "100% drop in Kyrgyzstan" means the one customer there left. Ship every rate with its numerator and denominator, or someone will page you over a rounding error wearing a percent sign.
  • Time is a dimension too, and it's high-cardinality. Ninety days times 200 countries is 18,000 cells before you've added anything else. Long-horizon analyses should coarsen time (weekly, monthly grain) as deliberately as they bucketize any other dimension, this is half of why the reduced-fact tier exists.

And one structural rule: run these patterns against the daily-aggregate layer, not raw facts. One row per entity per metric per day joins one-to-one with entity dimensions (no fan-out, minimal shuffle), and the same shape feeds experimentation for free: assign entities to test/control, join on the entity key, difference the metrics. Aggregating raw facts with "spicy" dimension joins mid-query is how analyses become incidents.


3. Worked Example

One business question, walked through the family-selection decision, because choosing the shape is the skill.

TheWorldShop leadership asks: "Seller listings grew 40% this year, but GMV only grew 12%. What's going on?"

Name the pattern before touching a keyboard. "What's going on" with a metric's movement is root-cause analysis: the aggregation family, decomposition along dimensions. Not cumulation (nobody asked about state transitions yet), not window (no smoothing or deltas requested).

Pick the layer. Daily aggregates: seller_daily_metrics(seller_id, ds, m_listings, m_gmv) joined one-to-one to dim_sellers. Never the raw order-lines table; the question is annual and dimensional, exactly what the middle rung is for.

Decompose. Cut the two growth rates by the dimensions most likely to explain a divergence:

SELECT d.dim_tier,
       SUM(m.m_listings) FILTER (WHERE m.ds >= DATE '2026-01-01') AS listings_now,
       SUM(m.m_listings) FILTER (WHERE m.ds <  DATE '2026-01-01') AS listings_prior,
       SUM(m.m_gmv)      FILTER (WHERE m.ds >= DATE '2026-01-01') AS gmv_now,
       SUM(m.m_gmv)      FILTER (WHERE m.ds <  DATE '2026-01-01') AS gmv_prior,
       COUNT(DISTINCT m.seller_id)                                AS sellers  -- the count beside the rate
FROM   seller_daily_metrics m
JOIN   dim_sellers d USING (seller_id)
GROUP  BY 1;

The cut shows minor-tier listings exploding while their GMV barely moves: the 40% was mostly zero-sales listings from a signup promotion. One more cut (by signup cohort month) confirms it. The divergence has a name, a segment, and an owner.

Notice what happened next, though. The natural follow-ups are: "are those promo sellers staying?" That's not aggregation anymore, it's survivorship (Lesson 3). "How many go dormant each month and do any come back?" State transitions: growth accounting (Lesson 2). "Where in onboarding do they stall?" Funnel (Lesson 4). One business thread, four patterns, and the fluency being built in this course is hearing which one each question is asking for.

Aha: The pattern is decided by the shape of the question, not the shape of the data: "why did it move" is aggregation, "who changed state" is cumulation, "how is it trending" is window. Engineers who route questions to patterns before writing SQL don't just work faster, they stop building bespoke pipelines for questions that differ only in costume, which is where whole years of team capacity quietly go.


4. Your Turn

Exercise: Route six questions.

For each TheWorldShop request, name the family (aggregation / cumulation / window), the one keyword or clause that will anchor the SQL, and any transferable gotcha you'd guard against:

  1. "Refund rate by product category and payment method, last quarter."
  2. "Of sellers active in June, how many went dormant in July, and how many dormant ones came back?"
  3. "Smooth the daily GMV chart, the weekday swings are drowning the trend."
  4. "Rank warehouses by fulfillment speed."
  5. "Refund rate by category, payment method, device, app version, and customer signup week."
  6. "Is our year-over-year growth accelerating or decelerating?"

5. Real-World Application

Root-cause-analysis frameworks are this lesson productized. The internal tools at large companies that "explain" a metric's week-over-week move are, mechanically, automated dimensional decomposition: take the delta, cut it along every registered dimension, rank the cuts by contribution, and present "the move is 80% attributable to segment X." Every one of them is the aggregation family running in a loop, and knowing that demystifies both the tool and its failure modes (it can only explain along dimensions someone registered, and it inherits the groups-of-one problem when segments get thin).

The pattern vocabulary also changes how teams scope work. "Build a seller-health pipeline" is a quarter of ambiguity; "seller health is growth accounting on the seller entity plus a 28-day rolling quality score" is a two-week ticket, because both halves name machinery that already exists. Teams that talk in patterns write less SQL, reuse more infrastructure, and, when the LLM writes the first draft of the query, actually know how to review it.


6. Recap + Bridge

Three families cover nearly everything: aggregation (GROUP BY; trends, composition, root-cause), cumulation (FULL OUTER JOIN; state, transitions, absence-as-data), and window (OVER; derivatives, integrals, ranks). The transferable gotchas, groups of one, rates without counts, time's cardinality, run against the daily-aggregate layer, apply to all three. The rest of the course is the greatest hits, one pattern at a time, starting with the most consequential: growth accounting, the five-state machine that tells a company whether it's actually growing, and doubles as a monitoring rig for any classifier you'll ever ship.