Module: Setup | Duration: ~7 min | Lesson: 0 of 6
The Course 1.1 / 1.2 lab — extended with two additional fact tables that each reference customers and dates but were built by different teams in slightly different ways. You'll be the engineer tasked with conforming them.
The new tables:
fact_support_ticket — built by the support team, customer joined by email, date stored as created_at TIMESTAMP.
fact_web_session — built by the analytics team, customer joined by web_user_id (different surrogate, namespace-conflict-y), date stored as a millisecond epoch.
These are the kind of fact tables every real company accumulates as different teams ship their own pipelines. Course 1.3 is about reconciling them.
- Course 1.1 lab (DuckDB + base seed) is running.
- ~30 MB additional disk for the new fact CSVs.
No new tooling.
cd ~/s7-lab
curl -L -o fact_support_ticket.csv https://github.com/data-learning-course/s7-seed/releases/download/v1/fact_support_ticket.csv
curl -L -o fact_web_session.csv https://github.com/data-learning-course/s7-seed/releases/download/v1/fact_web_session.csv
mv fact_support_ticket.csv fact_web_session.csv seed/
ls seed/fact_*
Fallback: seed/gen_aux_facts.py synthesizes both from the existing data.
docker run --rm -it -v $(pwd)/seed:/seed datacatering/duckdb:latest -c \
"CREATE TABLE raw_support AS SELECT * FROM read_csv('/seed/fact_support_ticket.csv', header=true);
CREATE TABLE raw_sessions AS SELECT * FROM read_csv('/seed/fact_web_session.csv', header=true);
SELECT 'support' AS table, COUNT(*) FROM raw_support
UNION ALL
SELECT 'sessions', COUNT(*) FROM raw_sessions;"
Expected output: ~1200 support tickets, ~8000 sessions.
I'm setting up the lab environment for the "Conformed Dimensions & the Bus Matrix" course on data-learning. I already have the base lab from Course 1.1 (DuckDB-in-Docker + seed CSVs) plus the change log from Course 1.2.
This course adds two fact CSVs to seed/: fact_support_ticket.csv (~1200 rows, customer keyed by email, timestamp in ISO format) and fact_web_session.csv (~8000 rows, customer keyed by web_user_id, timestamp in millisecond epoch).
Verification: both tables load and have plausible row counts.
My machine:
- OS: <I will fill in>
- Existing tools: <I will fill in>
Walk me through downloading and verifying. If the download fails, help me synthesize equivalent data from raw_customers and dim_date using SQL — describe how to generate ~1200 ticket events with email keys and ~8000 session events with synthetic web_user_id keys (~5% overlap with our customer set).