Lab Setup

Module: Setup | Duration: ~8 min | Lesson: 0 of 10


1. What You'll Build

A local Data Vault sandbox in DuckDB-on-Docker. You'll start from the same raw_customers, raw_orders, and raw_change_log CSVs used in Track 7.1, plus two new raw extracts from a second source system (raw_crm_contacts.csv, raw_crm_accounts.csv) so you have something realistic to integrate. Vault's whole point is multi-source absorption — a single-source lab can't show why Vault exists.

By the end you'll have:

  • DuckDB-in-Docker running interactively.
  • 5 raw source CSVs loaded as raw_* tables.
  • A vault/ SQL directory pre-wired with empty hub/link/sat DDLs you'll fill in during the course.

2. Prerequisites

  • Docker Desktop ≥ 4.x (or any Docker engine).
  • A shell (bash/zsh/PowerShell).
  • ~80 MB free disk.
  • macOS / Linux / Windows (WSL2).

No Python, no JVM, no warehouse account.


3. Installation

macOS / Linux

mkdir -p ~/s7-vault-lab/seed ~/s7-vault-lab/vault
cd ~/s7-vault-lab

# Reuse Track 7.1 CSVs if you have them; otherwise pull a fresh copy.
curl -L -o seed/raw_customers.csv     https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_customers.csv
curl -L -o seed/raw_orders.csv        https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_orders.csv
curl -L -o seed/raw_change_log.csv    https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_change_log.csv
curl -L -o seed/raw_crm_contacts.csv  https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_crm_contacts.csv
curl -L -o seed/raw_crm_accounts.csv  https://github.com/data-learning-course/s7-seed/releases/download/v1/raw_crm_accounts.csv

Windows (WSL2)

  1. Open Ubuntu (WSL2) shell.
  2. Run the macOS/Linux commands above verbatim.

If the download fails, the fallback synthesizer at seed/gen_vault_seed.py (in the course repo) reproduces the CSVs from a fixed seed — instructions are in the Copy Prompt below.


4. Verify Your Setup

docker run --rm -it -v $(pwd)/seed:/seed datacatering/duckdb:latest -c \
  "CREATE TABLE raw_customers   AS SELECT * FROM read_csv('/seed/raw_customers.csv',    header=true);
   CREATE TABLE raw_orders      AS SELECT * FROM read_csv('/seed/raw_orders.csv',       header=true);
   CREATE TABLE raw_crm_contacts AS SELECT * FROM read_csv('/seed/raw_crm_contacts.csv', header=true);
   CREATE TABLE raw_crm_accounts AS SELECT * FROM read_csv('/seed/raw_crm_accounts.csv', header=true);
   SELECT 'customers'    AS tbl, COUNT(*) FROM raw_customers   UNION ALL
   SELECT 'orders',              COUNT(*) FROM raw_orders      UNION ALL
   SELECT 'crm_contacts',        COUNT(*) FROM raw_crm_contacts UNION ALL
   SELECT 'crm_accounts',        COUNT(*) FROM raw_crm_accounts;"

Expected output (counts approximate):

customers     500
orders        4200
crm_contacts  620
crm_accounts  410

Note the crm_contacts row count is higher than customers — the CRM has people the order system doesn't, and vice versa. That mismatch is the whole reason Vault exists.


5. Copy Prompt

I'm setting up the lab for the "Data Vault 2.0" course on data-learning. The lab needs DuckDB-in-Docker with five seed CSVs in seed/: raw_customers.csv (~500 rows), raw_orders.csv (~4200 rows), raw_change_log.csv (a customer-attribute change history), raw_crm_contacts.csv (~620 rows of contacts from a second source CRM), raw_crm_accounts.csv (~410 rows of B2B accounts).

The verification is: run a DuckDB container that mounts seed/, load all five CSVs, and confirm row counts roughly match the expected numbers (the CRM contact count should slightly exceed the customer count — that's intentional, it represents identity mismatch across source systems).

My machine:
- OS: <I will fill in>
- RAM: <I will fill in>
- Existing tools (Docker? Python? jq?): <I will fill in>

Walk me through downloading, mounting, and verifying. If the curl downloads fail (network blocked, GitHub down), help me synthesize equivalent CSVs locally — describe a 60-line Python script that generates 500 customers with email/name/region, 4200 orders linking to those customers, a CRM contacts table with ~80% overlap on email (some new, some missing), and a CRM accounts table for the B2B subset. Do not include secrets or API keys.