Module: Data Vault 2.0 | Duration: ~14 min | Lesson: 1 of 10
A telco onboards its 14th billing-system acquisition this decade. Each acquired company's billing system has its own concept of "customer" — different IDs, different attributes, different update cadences, different retention rules.
The Kimball warehouse was built three years ago against five source systems. Adding a sixth required reshaping dim_customer to accommodate a new SCD axis. Adding the seventh required renaming columns. By the tenth, the team stopped adding fact tables and started running parallel warehouses per acquisition — the thing Kimball was supposed to prevent.
A consultant suggests rebuilding on Data Vault. The team groans ("another rewrite?") but listens because they have no idea how to onboard the 14th source without breaking the 13 that came before.
Vault's pitch: the model absorbs new sources without restructuring the existing ones. Hubs, links, and satellites are not a query-performance answer. They're an architecture of change.
2. Concept Explanation
What Vault is actually optimizing for
Kimball optimizes for query performance and analyst ergonomics. The star schema is shaped for the BI tool to write fast joins; the dim is shaped for an analyst to filter on a human-readable column.
Vault optimizes for two completely different things:
- Source-system volatility absorption — when a source renames a column, splits a table, or adds a new attribute, Vault adds a new satellite without touching anything that already exists. No existing query breaks.
- Auditability — every row carries the source system it came from, the load timestamp, and the hash of the business key. The warehouse can reconstruct "what did our customer record say on 2026-01-15 according to Salesforce vs SAP?" — a question Kimball forces you to answer with SCD2 plus discipline plus luck.
The price you pay: the query layer is unusable raw. Analysts cannot query Vault tables directly without losing their minds. You must materialize a query layer on top (PIT tables, bridge tables, or a Kimball-style serving mart) — Lesson 7 will cover this.
This is the most-misunderstood thing about Vault: Vault is not a competitor to Kimball at the consumption layer. Vault sits behind Kimball, feeding it. The competition is at the integration layer — where do you stage and integrate 14 source systems before they become a star?
The three Vault primitives, briefly
Vault decomposes every entity into three table types. We'll build each in Lessons 2-4; here's the one-paragraph version for orientation:
- Hub = the business key only.
hub_customerhas columns(customer_hk, customer_business_key, load_ts, record_source). That's it. No attributes. Nothing changes about a hub except new rows getting added. - Link = a many-to-many relationship between hubs.
link_customer_ordercarries(link_hk, customer_hk, order_hk, load_ts, record_source). Even if today the relationship is one-to-one (one order has one customer), the link models it as many-to-many because tomorrow's source might change that. - Satellite = the attributes for a hub or a link, with full history.
sat_customer_detailscarries(customer_hk, load_ts, name, email, region, hash_diff, record_source). New row inserted whenever any attribute changes — never an update.
The decomposition feels wasteful ("one entity = three tables?") until you watch a new source land. Adding a 15th source's customer data is: insert any new business keys into the existing hub_customer, then add sat_customer_details_source15 (a new satellite, parallel to the existing 14, sharing the hub). Zero existing tables touched. Zero existing queries broken.
Why "insert-only" is the load story
Vault tables are insert-only. There is no UPDATE in canonical Vault. Every change is a new row with a new load_ts. This has three implications:
- Load parallelism is trivial. Different sources load to different satellites; they never contend for the same row. dbt incrementals can run all sats in parallel without locking.
- Audit is structural. "What did the warehouse know on 2026-03-15?" is a
WHERE load_ts <= '2026-03-15'query, not a special process. - Storage grows. A chatty source that updates the same row 100x/day inflates the satellite. Vault practitioners use
hash_diffcolumns (Lesson 5) to dedupe "no real change" updates — but the model still grows faster than Kimball SCD2 would.
When Vault is the right tool
Vault is the right tool when at least two of the following are true:
- High source volatility — multiple source systems, frequent schema changes, mergers/acquisitions in your future.
- Regulatory audit pressure — finance, healthcare, telecom, insurance; auditor asks "prove this data was correct on Q3 close day".
- Multi-source identity resolution — same customer represented in 5+ systems with different IDs.
- Long-term data retention requirements — 7+ year retention with full point-in-time queryability.
Vault is not the right tool when none of these apply. A startup with three sources and no compliance pressure adopting Vault is cosplay — you're paying the 3x table count and 5x load complexity for properties you don't need. We'll quantify this in Lesson 10.
Vault's relationship to Kimball
The most useful mental model: Vault is to integration what Kimball is to consumption. A mature warehouse runs both:
Vault absorbs change at the front; Kimball serves queries at the back. Lesson 6 covers Raw vs Business Vault; Track 2.3 Lesson 5 covers the full layered architecture.
If you read a blog post claiming "we replaced our Kimball warehouse with Data Vault and everything's faster" — they didn't. They either kept Kimball serving on top of Vault, or they have unhappy analysts. Both are common; neither is "Vault replaces Kimball".
3. Worked Example
Let's see why our lab needs Vault.
Look at the two source systems we just loaded:
Result (illustrative):
| src | key | name | region | |
|---|---|---|---|---|
| orders_system | 142 | Mae Park | mae@example.com | us-west |
| orders_system | 143 | Lin Chen | lin@example.com | apac |
| orders_system | 144 | Anil Roy | anil@example.com | eu |
| crm_system | C-7281 | Mae M. Park | mae@example.com | United States |
| crm_system | C-7282 | Linda Chen | linda.c@corp.example | apac |
| crm_system | C-7283 | (null) | anil@example.com | EU |
Seven observations from six rows:
- The two systems use different business keys — integer
customer_idvs stringcontact_id. - Names disagree — "Mae Park" vs "Mae M. Park"; "Lin Chen" vs "Linda Chen".
- Emails sometimes match, sometimes don't — Lin's email is different across systems.
- Region encoding differs — "us-west" vs "United States", "eu" vs "EU".
- Nullability differs — CRM has a null name where orders doesn't.
- Counts differ — CRM has ~620 rows, orders has ~500. Some CRM contacts aren't yet customers; some customers were never in the CRM.
- No single source is authoritative. Orders is authoritative for
last_order_date. CRM is authoritative foraccount_ownerandb2b_account_id. The truth is the union.
A Kimball dim_customer has to pick — which name wins, which region encoding, which IDs survive? Every pick is a fight between the teams that own the two sources, and the fight is permanent because reversing the decision means rewriting history.
A Vault approach:
- One
hub_customerkeyed on a business key (we'll use email as a starting point, fix it in Lesson 8 with same-as links). - One
sat_customer_from_orderscarrying name/region/etc. as seen by the orders system. - One
sat_customer_from_crmcarrying name/region/etc. as seen by the CRM.
Downstream, the serving star schema's dim_customer chooses (e.g., "name from CRM if not null, else orders") and the Vault preserves both raw histories forever. When the CRM team renames full_name to display_name, that's a new satellite (sat_customer_from_crm_v2) — the existing satellite stays, existing queries keep working, the choice of which to read is a serving-layer decision.
We'll build the first hub in Lesson 2. For now, sketch the conceptual model:
Four tables to model a customer-and-order universe that Kimball would model in two (dim_customer, fact_sale). The price of absorbing source change.
Aha: Data Vault is not a query model — it is a change-absorption model. The reason you split one customer into a hub + N satellites is not that hubs are queryable; it's that the next source system you onboard adds a satellite without breaking the previous N satellites or any query that reads them. Vault is the architecture you pick when you know your sources will keep changing and your existing dashboards must never blink.
4. Your Turn
Exercise: Decide whether Vault is the right tool for three scenarios.
For each, write one sentence: would you recommend Data Vault, Kimball, or a hybrid — and why?
- A 12-person fintech startup with one Postgres production database and one Stripe integration. Compliance: SOC 2 in progress. Analytics: 6 dashboards in Looker.
- A regional bank with 9 legacy core-banking systems (3 mainframe COBOL, 4 different SQL Servers, 2 SaaS), a regulator who requires 10-year auditability of customer attribute history, and a recent acquisition of a fintech that will bring system #10.
- A 200-person B2B SaaS with one Snowflake-based data lake, three operational sources (Postgres, Salesforce, HubSpot), and a dbt-based Kimball warehouse that works fine but takes 2 weeks to onboard a new source.
5. Real-World Application
Data Vault was created by Dan Linstedt in the 1990s for U.S. Department of Defense intelligence integration — explicitly designed for the case where sources are unstable, audit is non-negotiable, and the cost of breaking a downstream query is measured in real-world consequences. The DoD origin story explains a lot of the design: insert-only loads, hash-key independence, structural audit. The model isn't about speed — it's about not lying when a source changes underneath you.
The industries that adopt Vault most aggressively today are the ones whose pain matches DoD's: banking, insurance, telecom, healthcare. A typical fact pattern: a multi-decade-old company has accumulated 10–50 source systems through acquisitions and platform migrations. They have a regulator who can subpoena "the state of the customer database on March 15, 2018". They have multiple ETL teams who cannot coordinate schema changes. Kimball can be made to work here, but every new source costs months of dim_customer reshaping and an angst-filled regression test. Vault costs more tables and more loaders, but the per-source onboarding cost is bounded — and the auditor query is one timestamp filter.
A cautionary counterpoint: Vault is over-marketed. There is an active consulting industry that sells Vault as the answer to every warehouse problem, and the literature can be self-reinforcing. Most modern data teams (especially those born after 2015, on cloud warehouses, with ~5 sources) genuinely do not need it. The Snowflake/BigQuery era's cheap storage and elastic compute means the cost problem Vault was designed for (storage was expensive in the 90s, so Kimball's denormalization felt extravagant) is now mostly solved by columnar engines. What remains is the integration problem, which Vault solves elegantly — but only when integration is your dominant pain.
A fair summary: if you read about Vault and feel "this is overengineered for me", you're probably right. If you read about Vault and feel "this is exactly the architecture for the chaos I'm drowning in", you're also probably right. Vault is one of the rare architectural choices where the first instinct of an experienced engineer is usually correct — and where the literature's enthusiasm should be calibrated against the size of the problem you actually have.