Module: Data Vault 2.0 | Duration: ~14 min | Lesson: 1 of 10

A telco onboards its 14th billing-system acquisition this decade. Each acquired company's billing system has its own concept of "customer", different IDs, different attributes, different update cadences, different retention rules.

The Kimball warehouse was built three years ago against five source systems. Adding a sixth required reshaping dim_customer to accommodate a new SCD axis. Adding the seventh required renaming columns. By the tenth, the team stopped adding fact tables and started running parallel warehouses per acquisition, the thing Kimball was supposed to prevent.

A consultant suggests rebuilding on Data Vault. The team groans ("another rewrite?") but listens because they have no idea how to onboard the 14th source without breaking the 13 that came before.

Vault's pitch: the model absorbs new sources without restructuring the existing ones. Hubs, links, and satellites are not a query-performance answer. They're an architecture of change.

2. Concept Explanation

What Vault is actually optimizing for

Kimball optimizes for query performance and analyst ergonomics. The star schema is shaped for the BI tool to write fast joins; the dim is shaped for an analyst to filter on a human-readable column.

Vault optimizes for two completely different things:

Source-system volatility absorption: when a source renames a column, splits a table, or adds a new attribute, Vault adds a new satellite without touching anything that already exists. No existing query breaks.
Auditability: every row carries the source system it came from, the load timestamp, and the hash of the business key. The warehouse can reconstruct "what did our customer record say on 2026-01-15 according to Salesforce vs SAP?", a question Kimball forces you to answer with SCD2 plus discipline plus luck.

The price you pay: the query layer is unusable raw. Analysts cannot query Vault tables directly without losing their minds. You must materialize a query layer on top (PIT tables, bridge tables, or a Kimball-style serving mart), Lesson 7 will cover this.

This is the most-misunderstood thing about Vault: Vault is not a competitor to Kimball at the consumption layer. Vault sits behind Kimball, feeding it. The competition is at the integration layer, where do you stage and integrate 14 source systems before they become a star?

The three Vault primitives, briefly

Vault decomposes every entity into three table types. We'll build each in Lessons 2-4; here's the one-paragraph version for orientation:

Hub = the business key only. hub_customer has columns (customer_hk, customer_business_key, load_ts, record_source). That's it. No attributes. Nothing changes about a hub except new rows getting added.
Link = a many-to-many relationship between hubs. link_customer_order carries (link_hk, customer_hk, order_hk, load_ts, record_source). Even if today the relationship is one-to-one (one order has one customer), the link models it as many-to-many because tomorrow's source might change that.
Satellite = the attributes for a hub or a link, with full history. sat_customer_details carries (customer_hk, load_ts, name, email, region, hash_diff, record_source). New row inserted whenever any attribute changes, never an update.

The decomposition feels wasteful ("one entity = three tables?") until you watch a new source land. Adding a 15th source's customer data is: insert any new business keys into the existing hub_customer, then add sat_customer_details_source15 (a new satellite, parallel to the existing 14, sharing the hub). Zero existing tables touched. Zero existing queries broken.

Why "insert-only" is the load story

Vault tables are insert-only. There is no UPDATE in canonical Vault. Every change is a new row with a new load_ts. This has three implications:

Load parallelism is trivial. Different sources load to different satellites; they never contend for the same row. dbt incrementals can run all sats in parallel without locking.
Audit is structural. "What did the warehouse know on 2026-03-15?" is a WHERE load_ts <= '2026-03-15' query, not a special process.
Storage grows. A chatty source that updates the same row 100x/day inflates the satellite. Vault practitioners use hash_diff columns (Lesson 5) to dedupe "no real change" updates, but the model still grows faster than Kimball SCD2 would.

When Vault is the right tool

Vault is the right tool when at least two of the following are true:

High source volatility: multiple source systems, frequent schema changes, mergers/acquisitions in your future.
Regulatory audit pressure: finance, healthcare, telecom, insurance; auditor asks "prove this data was correct on Q3 close day".
Multi-source identity resolution: same customer represented in 5+ systems with different IDs.
Long-term data retention requirements: 7+ year retention with full point-in-time queryability.

Vault is not the right tool when none of these apply. A startup with three sources and no compliance pressure adopting Vault is cosplay, you're paying the 3x table count and 5x load complexity for properties you don't need. We'll quantify this in Lesson 10.

Vault's relationship to Kimball

The most useful mental model: Vault is to integration what Kimball is to consumption. A mature warehouse runs both:

Source systems  →  Raw Vault  →  Business Vault  →  Star schema marts  →  BI tools
                   (per-source) (cross-source)    (Kimball)

Vault absorbs change at the front; Kimball serves queries at the back. Lesson 6 covers Raw vs Business Vault; Track 2.3 Lesson 5 covers the full layered architecture.

If you read a blog post claiming "we replaced our Kimball warehouse with Data Vault and everything's faster". They didn't. They either kept Kimball serving on top of Vault, or they have unhappy analysts. Both are common; neither is "Vault replaces Kimball".

3. Worked Example

Bring up the lab. If you haven't cloned the lab repo yet, do it now (one repo, shared across every Strata 7 course), then bring up this course's sandbox:

git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/semantic-and-metrics-layer/modeling-alternatives-vault-obt/data-vault-2/
./scripts/setup.sh

Verify the seed tables are reachable:

./scripts/verify.sh
# expected: "Data Vault lab ready: 5 raw source tables (orders + CRM) in DuckDB"

Whenever a lesson says "open the lab", drop into an interactive DuckDB shell with the seed CSVs mounted read-only at /seed:

docker compose exec duckdb duckdb

You are helping me bring up a local DuckDB lab on my machine.

Course: Data Vault 2.0 (TheWorldShop scenarios).
Lab artifacts (already in petascalelabs-lab-setup/semantic-and-metrics-layer/modeling-alternatives-vault-obt/data-vault-2/):
  - docker-compose.yml: a single DuckDB container with the seed CSVs mounted read-only at /seed
  - data/: raw_customers.csv, raw_orders.csv, raw_change_log.csv, raw_crm_contacts.csv, raw_crm_accounts.csv (synthetic)
  - scripts/setup.sh, scripts/verify.sh, scripts/teardown.sh

My machine:
  OS: <fill in>
  RAM: <fill in GB>
  Docker version: <fill in>

Please walk me through:
1. Any OS-specific prerequisites (Rosetta, WSL2, virtualization flags) for my OS.
2. Confirming Docker is running with enough memory (>=2 GB is plenty here).
3. How to confirm ./scripts/verify.sh worked, and what to check if the container never reports healthy.
4. The teardown command when I want to reclaim disk.

Do not assume my OS, ask if it's not clear.

Let's see why our lab needs Vault.

Look at the two source systems we just loaded:

SELECT 'orders_system' AS src, customer_id::TEXT AS key, name, email, region
FROM raw_customers LIMIT 3
UNION ALL
SELECT 'crm_system',          contact_id,           full_name AS name, primary_email AS email, country AS region
FROM raw_crm_contacts LIMIT 3;

Result (illustrative):

src	key	name	email	region
orders_system	142	Mae Park	mae@example.com	us-west
orders_system	143	Lin Chen	lin@example.com	apac
orders_system	144	Anil Roy	anil@example.com	eu
crm_system	C-7281	Mae M. Park	mae@example.com	United States
crm_system	C-7282	Linda Chen	linda.c@corp.example	apac
crm_system	C-7283	(null)	anil@example.com	EU

Seven observations from six rows:

The two systems use different business keys: integer customer_id vs string contact_id.
Names disagree: "Mae Park" vs "Mae M. Park"; "Lin Chen" vs "Linda Chen".
Emails sometimes match, sometimes don't: Lin's email is different across systems.
Region encoding differs: "us-west" vs "United States", "eu" vs "EU".
Nullability differs: CRM has a null name where orders doesn't.
Counts differ: CRM has ~620 rows, orders has ~500. Some CRM contacts aren't yet customers; some customers were never in the CRM.
No single source is authoritative. Orders is authoritative for last_order_date. CRM is authoritative for account_owner and b2b_account_id. The truth is the union.

A Kimball dim_customer has to pick, which name wins, which region encoding, which IDs survive? Every pick is a fight between the teams that own the two sources, and the fight is permanent because reversing the decision means rewriting history.

A Vault approach:

One hub_customer keyed on a business key (we'll use email as a starting point, fix it in Lesson 8 with same-as links).
One sat_customer_from_orders carrying name/region/etc. as seen by the orders system.
One sat_customer_from_crm carrying name/region/etc. as seen by the CRM.

Downstream, the serving star schema's dim_customer chooses (e.g., "name from CRM if not null, else orders") and the Vault preserves both raw histories forever. When the CRM team renames full_name to display_name, that's a new satellite (sat_customer_from_crm_v2), the existing satellite stays, existing queries keep working, the choice of which to read is a serving-layer decision.

We'll build the first hub in Lesson 2. For now, sketch the conceptual model:

hub_customer (one row per distinct customer-business-key, ever seen)
  ├─ sat_customer_from_orders (history of attributes as told by orders)
  ├─ sat_customer_from_crm    (history of attributes as told by CRM)
  └─ link_customer_order      (relationship: which customer placed which order)
     └─ sat_order_details_from_orders

Four tables to model a customer-and-order universe that Kimball would model in two (dim_customer, fact_sale). The price of absorbing source change.

Aha: Data Vault is not a query model. It is a change-absorption model. The reason you split one customer into a hub + N satellites is not that hubs are queryable; it's that the next source system you onboard adds a satellite without breaking the previous N satellites or any query that reads them. Vault is the architecture you pick when you know your sources will keep changing and your existing dashboards must never blink.

4. Your Turn

Exercise: Decide whether Vault is the right tool for three scenarios.

For each, write one sentence: would you recommend Data Vault, Kimball, or a hybrid, and why?

A 12-person fintech startup with one Postgres production database and one Stripe integration. Compliance: SOC 2 in progress. Analytics: 6 dashboards in Looker.
A regional bank with 9 legacy core-banking systems (3 mainframe COBOL, 4 different SQL Servers, 2 SaaS), a regulator who requires 10-year auditability of customer attribute history, and a recent acquisition of a fintech that will bring system #10.
A 200-person B2B SaaS with one Snowflake-based data lake, three operational sources (Postgres, Salesforce, HubSpot), and a dbt-based Kimball warehouse that works fine but takes 2 weeks to onboard a new source.

5. Real-World Application

Data Vault was created by Dan Linstedt in the 1990s for U.S. Department of Defense intelligence integration, explicitly designed for the case where sources are unstable, audit is non-negotiable, and the cost of breaking a downstream query is measured in real-world consequences. The DoD origin story explains a lot of the design: insert-only loads, hash-key independence, structural audit. The model isn't about speed, it's about not lying when a source changes underneath you.

The industries that adopt Vault most aggressively today are the ones whose pain matches DoD's: banking, insurance, telecom, healthcare. A typical fact pattern: a multi-decade-old company has accumulated 10–50 source systems through acquisitions and platform migrations. They have a regulator who can subpoena "the state of the customer database on March 15, 2018". They have multiple ETL teams who cannot coordinate schema changes. Kimball can be made to work here, but every new source costs months of dim_customer reshaping and an angst-filled regression test. Vault costs more tables and more loaders, but the per-source onboarding cost is bounded, and the auditor query is one timestamp filter.

A cautionary counterpoint: Vault is over-marketed. There is an active consulting industry that sells Vault as the answer to every warehouse problem, and the literature can be self-reinforcing. Most modern data teams (especially those born after 2015, on cloud warehouses, with ~5 sources) genuinely do not need it. The Snowflake/BigQuery era's cheap storage and elastic compute means the cost problem Vault was designed for (storage was expensive in the 90s, so Kimball's denormalization felt extravagant) is now mostly solved by columnar engines. What remains is the integration problem, which Vault solves elegantly, but only when integration is your dominant pain.

A fair summary: if you read about Vault and feel "this is overengineered for me", you're probably right. If you read about Vault and feel "this is exactly the architecture for the chaos I'm drowning in", you're also probably right. Vault is one of the rare architectural choices where the first instinct of an experienced engineer is usually correct, and where the literature's enthusiasm should be calibrated against the size of the problem you actually have.

6. Recap + Bridge

Data Vault exists for environments with many volatile sources and hard audit requirements, where Kimball's upfront conformance is too rigid. It trades query simplicity for load flexibility and full lineage. Next: the foundation of the model, hubs that store business keys and nothing else.