Module: Setup | Duration: ~8 min | Lesson: 0 of 10
A local Data Vault sandbox in DuckDB-on-Docker. You'll start from the same raw_customers, raw_orders, and raw_change_log CSVs used in Track 7.1, plus two new raw extracts from a second source system (raw_crm_contacts.csv, raw_crm_accounts.csv) so you have something realistic to integrate. Vault's whole point is multi-source absorption — a single-source lab can't show why Vault exists.
By the end you'll have:
- DuckDB-in-Docker running interactively.
- 5 raw source CSVs loaded as
raw_* tables.
- A
vault/ SQL directory pre-wired with empty hub/link/sat DDLs you'll fill in during the course.
- Docker Desktop ≥ 4.x (or any Docker engine).
- A shell (bash/zsh/PowerShell).
- ~80 MB free disk.
- macOS / Linux / Windows (WSL2).
No Python, no JVM, no warehouse account.
mkdir -p ~/s7-vault-lab/seed ~/s7-vault-lab/vault
cd ~/s7-vault-lab
curl -L -o seed/raw_customers.csv https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_customers.csv
curl -L -o seed/raw_orders.csv https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_orders.csv
curl -L -o seed/raw_change_log.csv https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_change_log.csv
curl -L -o seed/raw_crm_contacts.csv https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_crm_contacts.csv
curl -L -o seed/raw_crm_accounts.csv https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_crm_accounts.csv
- Open Ubuntu (WSL2) shell.
- Run the macOS/Linux commands above verbatim.
If the download fails, the fallback synthesizer at seed/gen_vault_seed.py (in the course repo) reproduces the CSVs from a fixed seed — instructions are in the Copy Prompt below.
docker run --rm -it -v $(pwd)/seed:/seed datacatering/duckdb:latest -c \
"CREATE TABLE raw_customers AS SELECT * FROM read_csv('/seed/raw_customers.csv', header=true);
CREATE TABLE raw_orders AS SELECT * FROM read_csv('/seed/raw_orders.csv', header=true);
CREATE TABLE raw_crm_contacts AS SELECT * FROM read_csv('/seed/raw_crm_contacts.csv', header=true);
CREATE TABLE raw_crm_accounts AS SELECT * FROM read_csv('/seed/raw_crm_accounts.csv', header=true);
SELECT 'customers' AS tbl, COUNT(*) FROM raw_customers UNION ALL
SELECT 'orders', COUNT(*) FROM raw_orders UNION ALL
SELECT 'crm_contacts', COUNT(*) FROM raw_crm_contacts UNION ALL
SELECT 'crm_accounts', COUNT(*) FROM raw_crm_accounts;"
Expected output (counts approximate):
customers 500
orders 4200
crm_contacts 620
crm_accounts 410
Note the crm_contacts row count is higher than customers — the CRM has people the order system doesn't, and vice versa. That mismatch is the whole reason Vault exists.
I'm setting up the lab for the "Data Vault 2.0" course on data-learning. The lab needs DuckDB-in-Docker with five seed CSVs in seed/: raw_customers.csv (~500 rows), raw_orders.csv (~4200 rows), raw_change_log.csv (a customer-attribute change history), raw_crm_contacts.csv (~620 rows of contacts from a second source CRM), raw_crm_accounts.csv (~410 rows of B2B accounts).
The verification is: run a DuckDB container that mounts seed/, load all five CSVs, and confirm row counts roughly match the expected numbers (the CRM contact count should slightly exceed the customer count — that's intentional, it represents identity mismatch across source systems).
My machine:
- OS: <I will fill in>
- RAM: <I will fill in>
- Existing tools (Docker? Python? jq?): <I will fill in>
Walk me through downloading, mounting, and verifying. If the curl downloads fail (network blocked, GitHub down), help me synthesize equivalent CSVs locally — describe a 60-line Python script that generates 500 customers with email/name/region, 4200 orders linking to those customers, a CRM contacts table with ~80% overlap on email (some new, some missing), and a CRM accounts table for the B2B subset. Do not include secrets or API keys.