Module: Slowly Changing Dimensions | Duration: ~12 min | Lesson: 1 of 8

Q: "What was the customer's region when they placed the order?"

If your answer is "look up dim_customer.region for that customer_key", your warehouse is lying. The region you'd return is today's region, not the region at the time of the order. If the customer moved from EU to APAC last quarter, every order they placed in Q1 is now reported as APAC. Quarterly revenue-by-region is wrong, and nobody can tell.

This is the SCD problem. It's almost every reporting question in disguise. "How much did we sell in segment X last year?", segment is a dim attribute, attribute changed mid-year, the answer depends on which-segment-when. "How many active accounts did we have at the end of Q3?", active is a dim attribute that changes. Any time a dim attribute can change AND a downstream question is historical, you have an SCD problem.

2. Concept Explanation

Why "slowly"?

Kimball coined the term "slowly changing" because the contrast was facts (high-volume, fast-changing) vs dimensions (low-volume, mostly stable). Dimensions don't churn, customers don't update their email hourly, but they do change occasionally, and "occasional" is the dangerous regime. If every dim row changed every day you'd notice. Because changes are rare, they slip through unmodeled.

The two reporting modes

Every reporting question that touches a dim attribute is implicitly one of two modes:

"As-was" reporting: the attribute value at the time of the event. "What was the customer's region when the order was placed?" This is the default expectation of finance, marketing, and most business stakeholders.
"As-is" reporting: the attribute value as of today. "How much have customers currently in EU spent over the last year (including when they were elsewhere)?" Useful for segmentation analysis.

The trap: most warehouses default to as-is (because the dim only stores current values) while users assume as-was. The result is a quietly wrong number that confidence-displays in a dashboard.

The four SCD types (preview)

Kimball enumerated four (later six) handling strategies. The three you'll use 95% of the time:

Type 1, Overwrite. The dim row is updated in place. History is destroyed. "As-is" reporting only. Cheap, lossy.
Type 2, New row per change. Each change inserts a new dim row with a new surrogate and a validity window. Facts join to the version current at the event time. "As-was" reporting, full history. Expensive, accurate.
Type 3, Limited history. Add a previous_X column for one prior value. "Was the customer recently in EU?" cheaply answerable. Doesn't scale beyond one prior value.

We'll cover each in detail in lessons 2–4, then Type 6 (hybrid), bitemporal, late-arriving, and the anti-patterns.

The cost-vs-correctness tradeoff

At the heart of SCD design is a cost-correctness curve:

Type	Storage cost	Query complexity	Historical accuracy
1	Lowest	Lowest	None (only current)
3	Low	Low	One prior value
2	Highest	Medium	Full
6	Highest	Highest	Full + recent

The naive instinct "let's just SCD2 everything" works for small dims but is wasteful for large or rarely-queried-historically ones. The right answer is per-attribute: which attributes need history, which don't?

The per-attribute decision

For a single dim, different columns can have different SCD treatment:

customer_email, SCD1 (overwrite). Nobody asks "what was the email at the time of the order?"
customer_region, SCD2. Reporting needs as-was.
marketing_segment, SCD2. Same reason.
customer_name, SCD1 typically (corrections), but could be SCD3 if you want to keep one prior name visible for audit.
birth_date, SCD1 (corrections only).
customer_lifetime_value, SCD1, snapshotted in a periodic-snapshot fact instead (it's a derived measure, not a dim attribute).

This per-attribute analysis is the most-skipped exercise in modeling. Most teams pick one SCD type for the whole dim and live with the consequences.

3. Worked Example

Bring up the lab. You cloned the lab repo in an earlier course on this track. From the repo root, bring up this course's sandbox:

cd petascalelabs-lab-setup/semantic-and-metrics-layer/dimensional-modeling-foundations/slowly-changing-dimensions/
./scripts/setup.sh

Verify the seed tables are reachable:

./scripts/verify.sh
# expected: "SCD lab ready: base star tables + 335 customer change events in DuckDB"

Whenever a lesson says "open the lab", drop into an interactive DuckDB shell with the seed CSVs mounted read-only at /seed:

docker compose exec duckdb duckdb

You are helping me bring up a local DuckDB lab on my machine.

Course: Slowly Changing Dimensions (TheWorldShop scenarios).
Lab artifacts (already in petascalelabs-lab-setup/semantic-and-metrics-layer/dimensional-modeling-foundations/slowly-changing-dimensions/):
  - docker-compose.yml: a single DuckDB container with the seed CSVs mounted read-only at /seed
  - data/: raw_customers.csv, raw_products.csv, raw_dates.csv, raw_orders.csv, customer_changes.csv (synthetic)
  - scripts/setup.sh, scripts/verify.sh, scripts/teardown.sh

My machine:
  OS: <fill in>
  RAM: <fill in GB>
  Docker version: <fill in>

Please walk me through:
1. Any OS-specific prerequisites (Rosetta, WSL2, virtualization flags) for my OS.
2. Confirming Docker is running with enough memory (>=2 GB is plenty here).
3. How to confirm ./scripts/verify.sh worked, and what to check if the container never reports healthy.
4. The teardown command when I want to reclaim disk.

Do not assume my OS, ask if it's not clear.

Let's demonstrate the SCD bug in the lab, concretely.

-- Load the base data
CREATE TABLE raw_customers     AS SELECT * FROM read_csv('/seed/raw_customers.csv', header=true);
CREATE TABLE raw_orders        AS SELECT * FROM read_csv('/seed/raw_orders.csv', header=true);
CREATE TABLE customer_changes  AS SELECT * FROM read_csv('/seed/customer_changes.csv', header=true);

-- Build the (current-state-only) dim, the natural mistake
CREATE TABLE dim_customer_v1 AS
SELECT customer_id AS customer_key, customer_name, region, segment
FROM   raw_customers;

-- A 'revenue by region' query. Looks innocent.
SELECT c.region, SUM(o.quantity * o.unit_price) AS revenue
FROM   raw_orders o
JOIN   dim_customer_v1 c ON c.customer_key = o.customer_id
GROUP  BY 1
ORDER  BY 2 DESC;

The result: revenue attributed to whatever region the customer is in today. Customers who moved are misattributed. The same customer's orders from Jan (when they were in EU) and Aug (when they were in APAC) are all attributed to APAC.

Now load the change log and find the customers who moved:

-- Find customers whose region changed during the year
SELECT customer_id, COUNT(*) AS region_changes
FROM   customer_changes
WHERE  column_name = 'region'
GROUP  BY 1
ORDER  BY 2 DESC
LIMIT 10;

For each of those customers, their full year of orders is currently being attributed to a single (current) region in the dashboard. That's the bug.

A quick sanity check on the scale:

-- How many orders are affected (placed before the customer's region change)?
SELECT COUNT(*) AS affected_orders
FROM   raw_orders o
JOIN   customer_changes c
  ON   c.customer_id = o.customer_id
 WHERE c.column_name = 'region'
   AND o.order_date  < c.changed_at;

If that's, say, 600 orders out of 5000, then 12% of your revenue-by-region report is on the wrong row. Most teams discover this not via a test but via a stakeholder asking why a number changed quarter-over-quarter "unexpectedly".

Aha: The phrase "the customer's region" hides a temporal ambiguity that SQL doesn't let you express. There is no syntactic way to say "the customer's region as of the order date" against a current-only dim, the join is silently as-is. SCD modeling exists specifically to make this temporal aspect expressible in the schema, so the join becomes obvious instead of misleading.

4. Your Turn

Exercise: Audit a dim for SCD risk.

List every column in raw_customers. For each, decide: SCD1, SCD2, or SCD3? Justify in one phrase each.
For the columns you marked SCD2, name one reporting question that would be silently wrong if you used a current-only dim.
Bonus: are there any columns where the right answer is "this shouldn't be on the dim at all"? (Hint: derived measures.)

5. Real-World Application

The single most common production SCD failure mode is one we'll meet in Lesson 8 explicitly, but it's worth previewing: a team uses SCD1 (overwrite) and discovers six months later that the CFO is comparing quarter-over-quarter numbers and the prior-quarter numbers are silently changing. The fix is to retrofit SCD2, which requires reconstructing history from source-system change logs, often imperfectly, and re-keying every fact table to the new versioned dim. This is a multi-month project that retroactive design discipline would have prevented.

The defensive practice: when in doubt, default to SCD2 on dims that participate in financial or regulatory reporting. Storage is cheap. Backfilling history isn't. The teams that absorb dim changes gracefully are the ones who SCD2'd region, segment, tier on day one and didn't have to retrofit later.

We'll spend the next three lessons on the mechanics of each type, then build up to the production anti-patterns. By the end of this course you'll be able to look at any dim and say "SCD2 these three columns, SCD1 these four, and customer_lifetime_value doesn't belong here at all", and that judgment will be the single most valuable modeling skill you carry forward.

6. Recap + Bridge

When a dimension attribute changes, say a customer moves region, it quietly corrupts every historical report unless you decide, on purpose, how to handle it. That decision is the SCD type. Next: Type 1, the overwrite, and exactly what history you throw away when you use it.