Lab Setup

Module: Setup | Duration: ~10 min | Lesson: 0 of 10


1. What You'll Build

A local lab for working through dimensional modeling without standing up a warehouse. You'll run DuckDB inside Docker against a small, deliberately messy retail dataset (orders, customers, products, dates). DuckDB is the right engine for this course because it's engine-agnostic enough that every concept you learn ports cleanly to Snowflake, BigQuery, ClickHouse, or Postgres — and small enough to run on a laptop.

At the end of this lesson you should be able to:

  • Open a SQL shell into DuckDB and SELECT * FROM raw_orders LIMIT 5.
  • See the four raw tables (raw_orders, raw_customers, raw_products, raw_dates) loaded from CSV.
  • Be ready to model them — that's what Lessons 1–10 are about.

2. Prerequisites

  • ~2 GB free RAM and 1 GB disk
  • Docker Desktop ≥ 4.x (macOS/Windows) or Docker Engine + Compose (Linux)
  • A shell — zsh, bash, or PowerShell
  • Optional: a SQL client you like (DBeaver, TablePlus) — DuckDB has its own CLI which is what we'll use

No cloud account, no warehouse, no credit card.


3. Installation

macOS

  1. Install Docker Desktop from https://www.docker.com/products/docker-desktop, launch it, wait for the whale icon to settle.
  2. Make a lab directory and pull the image:
    mkdir -p ~/s7-lab && cd ~/s7-lab
    docker pull datacatering/duckdb:latest
    
  3. Grab the seed CSVs (a tiny synthetic retail dataset shipped with this course):
    curl -L -o seed.zip https://github.com/data-learning-course/s7-seed/releases/download/v1/seed.zip
    unzip seed.zip
    ls seed/
    # raw_customers.csv raw_dates.csv raw_orders.csv raw_products.csv
    
    If that URL is unreachable, generate equivalent data with the fallback script in seed/gen.py (included).

Linux

  1. sudo apt-get install -y docker.io (or your distro's equivalent), then sudo systemctl enable --now docker.
  2. Same docker pull and curl steps as macOS.
  3. Add your user to the docker group so you don't need sudo: sudo usermod -aG docker $USER && newgrp docker.

Windows (WSL2)

  1. Install WSL2 with Ubuntu, then Docker Desktop with the WSL2 backend.
  2. From a WSL2 shell, follow the macOS steps verbatim — paths, curl, docker all work the same.

4. Verify Your Setup

From inside ~/s7-lab, start a DuckDB shell with the seed CSVs mounted:

docker run --rm -it -v $(pwd)/seed:/seed datacatering/duckdb:latest \
  -c "CREATE TABLE raw_orders AS SELECT * FROM read_csv('/seed/raw_orders.csv', header=true); SELECT COUNT(*) AS n FROM raw_orders;"

Expected output:

┌──────┐
│  n   │
│ int  │
├──────┤
│ 5000 │
└──────┘

If you see 5000, the lab is ready. If you see an error about read_csv or /seed not found, your volume mount path is wrong — the most common Windows/WSL pitfall.

Throughout the course, when a lesson asks you to "open the lab", run:

docker run --rm -it -v $(pwd)/seed:/seed datacatering/duckdb:latest

Then at the D prompt, recreate the four tables (or save them to a persistent .duckdb file — see the README in seed/).


5. Copy Prompt

I'm setting up the lab environment for the "Dimensional Modeling Fundamentals" course on data-learning. Here's what the course expects me to have running locally:

- DuckDB latest, running inside a Docker container
- A `seed/` directory containing four CSVs: raw_orders.csv, raw_customers.csv, raw_products.csv, raw_dates.csv
- The ability to run an interactive DuckDB shell with the seed directory mounted at /seed
- The verification query `SELECT COUNT(*) FROM raw_orders` returning 5000

My machine:
- OS: <I will fill in>
- RAM: <I will fill in>
- Existing tools: <I will fill in>

Walk me through the install step-by-step for my OS. If a step fails, diagnose from the error message before suggesting reinstalls. At the end, give me the exact verification command and the expected output so I know the lab is ready.