Module: Choosing a Modeling Style | Duration: ~12 min | Lesson: 1 of 6
A senior data engineer joins a new company. On day three, the CTO asks: "Should we move to a Data Vault architecture? I've been reading about it."
The engineer's answer determines the next three months of work. They could nod and start designing hubs. They could push back hard and recommend Kimball-with-discipline. They could propose a hybrid layered architecture. They could even recommend OBT-only on Snowflake.
The right answer depends on four inputs the engineer hasn't been told yet:
- Team — who's going to build it, who's going to use it, what skills exist.
- Sources — how many, how stable, how chaotic.
- Engine — Snowflake, BigQuery, ClickHouse, Postgres, something else.
- Change rate — how often do business definitions move, how often do sources break, how often does the warehouse need re-shaping.
Decide without these inputs and you're picking by fashion. Decide with them and the choice is usually obvious — or, when it isn't, the uncertainty itself is informative.
This lesson formalizes the four inputs and shows how to extract them in a 30-minute conversation. The rest of the course applies them.
2. Concept Explanation
Input 1: Team
Three sub-dimensions:
- Headcount. A 2-person data team has fundamentally different options than a 50-person team. Vault requires multiple full-time engineers; a 2-person team adopting Vault is choosing to spend their entire bandwidth on infrastructure rather than analysis.
- Skill mix. Are your engineers SQL-native? Python-native? Spark-native? dbt-native? The closer your team is to dbt, the smoother the dbt-SL / dbt-Vault / dbt-OBT paths. Less dbt familiarity raises the cost of every modern modeling style.
- Consumers. Who queries the warehouse? Analysts? PMs? ML engineers? Self-serve business users? Each consumer profile has a preferred shape (analysts can handle star; PMs need OBT; ML needs wide feature tables).
A 5-engineer team with strong SQL skills serving 30 analysts has very different options than a 20-engineer team serving 500 business users in Tableau. Both might be "medium" data orgs by headcount — but the modeling answer differs.
Input 2: Sources
Three sub-dimensions:
- Count. Single Postgres + Stripe is 2 sources. A bank with 14 acquired billing systems is 14. The count alone drives Vault's relevance — Vault's break-even is around 10-15 sources.
- Volatility. Do sources change their schemas? A Postgres OLTP under your own engineering team's control rarely surprises you. An external SaaS API surprises you whenever the vendor ships a release. A mainframe COBOL feed file surprises you when someone retires.
- Heterogeneity of identity. Do the same real-world entities (customers, products) live in multiple sources with different IDs? Multi-source identity integration is one of the strongest signals for Vault.
Three sources that are stable, identity-consistent, and under your team's control: Kimball is fine. Twenty sources, volatile, with cross-source identity confusion: Vault's specific properties earn their cost.
Input 3: Engine
The warehouse engine constrains your options more than people realize:
- Row-store OLTP databases (Postgres-without-columnar-extensions, MySQL, SQL Server-without-columnstore) — OBT is punished by I/O model; Kimball is the dominant choice; Vault is feasible but harder.
- Columnar warehouses (Snowflake, BigQuery, Redshift) — OBT is rewarded; joins are reasonable; Vault is feasible; all three modeling styles work.
- Columnar lakehouses (Databricks-on-Spark, Iceberg-on-anything) — same as columnar warehouses, plus Iceberg's schema-evolution properties favor OBT.
- Specialized columnar engines (ClickHouse, Druid, Pinot) — OBT is the native pattern; star and Vault are grudgingly supported; engine performance is highly dependent on data layout matching OBT.
- Hybrid HTAP engines (TiDB, CockroachDB, AlloyDB) — depends on use case; analytical workloads should still be OBT or star.
The engine choice often pre-decides the modeling choice. ClickHouse + Kimball-star is a strange combination; you almost certainly want OBT. Snowflake + Vault-only is also strange; you almost certainly want Vault-with-Kimball-marts.
Input 4: Change rate
Three separate change rates that matter:
- Source schema change rate. How often do sources rename columns, add tables, change types? High rate → Vault's absorption helps.
- Business definition change rate. How often does "active customer" or "revenue" get re-defined? High rate → semantic layer (Tracks 7.3-7.5) helps more than any modeling choice.
- Consumer requirement change rate. How often does the BI team need a new dashboard shape? High rate → OBT marts give optionality (build new OBT, leave existing ones alone); Kimball stars are more rigid.
The three rates are independent. A bank might have low business-definition change ("checking account" has been the same for 50 years) but high source change (acquisitions every quarter). A consumer-product startup might have the opposite — sources stable, business definitions in flux every quarter.
How to extract these in a 30-minute conversation
Going in cold to a new company, you can extract the four inputs with about a dozen targeted questions:
Team:
- How many people on the data team? Engineering vs analytics split?
- What's the consumer profile? Analysts, PMs, business users, ML?
- What tools is the team comfortable with — dbt, SQL, Spark, Python?
Sources: 4. How many distinct source systems feed the warehouse? 5. In the last 12 months, how often have sources changed schema unexpectedly? 6. Do the same entities (customer, product, account) exist in multiple sources with different IDs?
Engine: 7. What's the primary warehouse? Snowflake, BigQuery, Redshift, ClickHouse, Postgres, something else? 8. Is the warehouse cost a known concern, or is it comfortably within budget?
Change rate: 9. How often do you onboard a new source? Monthly? Quarterly? Annually? 10. How often does the analytics team redefine a metric (active user, revenue, etc.)? 11. Are there imminent (next 6-12 months) acquisitions, platform changes, or regulatory shifts?
Audit / compliance: 12. Are there regulators, auditors, or compliance officers who ask point-in-time questions?
Twelve questions, ~2.5 minutes each, 30 minutes total. With these answers, the modeling choice is usually obvious within an hour of follow-up.
The decision is rarely binary
A naïve framing: "Vault or Kimball or OBT?" — pick one.
A mature framing: "What layer does each style sit in?" — usually multiple, often hybrid.
The most common 2026 mature answer is some flavor of:
The four inputs determine which layers you actually need and how heavy each layer is, not just which single style to adopt. We'll formalize this in Lesson 5.
3. Worked Example
Let's work through extracting the four inputs and recommending a modeling style for three scenarios.
Scenario 1: A 40-person fintech, 5 sources, Snowflake
Inputs extracted:
- Team: 4 data engineers, 8 analysts, 30 business users on Looker. dbt-native.
- Sources: 5 (Postgres prod, Stripe, Plaid, HubSpot, Mixpanel). Stable APIs except Postgres (3 schema changes/quarter).
- Engine: Snowflake. Cost is monitored but not a critical concern.
- Change rate: Sources stable; metric definitions evolve quarterly with product launches. SOC 2 audit annually.
Recommendation: Kimball star schema core + OBT marts for Looker. Skip Vault — 5 sources is below threshold, audit is shallow. dbt-native team is comfortable with star + mart architecture. Snowflake favors OBT-on-Kimball for the Looker users (30 self-serve users justify the wide-table marts). The metric-definition churn suggests adopting dbt Semantic Layer (Track 7.4) as an additional layer over time — but that's a different decision.
Scenario 2: A 12-source regional insurance carrier with regulator pressure
Inputs extracted:
- Team: 15 data engineers, 25 analysts, ~100 business users. Mixed skills — some dbt, some legacy ETL tools, some Spark.
- Sources: 12 systems. 4 are legacy mainframe feeds (schemas change on no schedule). 2 acquisitions in last 18 months added more.
- Engine: Snowflake. Bill is significant; cost is monitored carefully.
- Change rate: Sources change often (mainframe surprises 6-8 times/year). Business definitions stable. Regulator runs audits requiring point-in-time customer-state reproducibility.
Recommendation: Vault for integration + Kimball serving marts + OBT for high-traffic dashboards. This is the layered architecture's sweet spot. Vault absorbs the mainframe schema churn and the audit requirement. Kimball star serves the analyst tier. OBT marts for the highest-traffic business-user dashboards. The 15-person engineering team has the bandwidth; the regulator pressure justifies the Vault overhead. This is a multi-year build but the right one.
Scenario 3: A 6-person YC-backed startup with one Postgres source
Inputs extracted:
- Team: 1 part-time data engineer, 2 analysts (one technical, one not). dbt is in use but lightly.
- Sources: 1 (Postgres OLTP) + occasional Stripe extract.
- Engine: Snowflake (free tier for now; small data).
- Change rate: Product team ships new features weekly — schema changes constant. Metric definitions in flux. No audit pressure.
Recommendation: dbt staging models + thin marts. No Kimball, no Vault, no OBT formalism. This team is too small to justify any architectural sophistication. The right answer is "write straightforward dbt models that join staging tables for each dashboard, refactor as the team grows". Adopting Vault here would be Lesson 10 of Course 2.1 — cosplay. Adopting OBT marts is overkill at this scale. The right call is to ship dashboards and revisit modeling discipline at 20+ engineers / 5+ sources.
Three scenarios, three different answers, all driven by the same four inputs.
Aha: The choice between Kimball, Vault, and OBT is not a matter of preference or philosophy — it's a function of four measurable inputs (team, sources, engine, change rate). A senior data engineer's value isn't in advocating for one style; it's in extracting the inputs in a 30-minute conversation and deriving the recommendation. The styles themselves are well-understood; the senior-engineering skill is the diagnosis.
4. Your Turn
Exercise: Extract inputs from a scenario.
A 200-person B2B SaaS company asks you to recommend a modeling style. Here's what they've told you so far (a typical underspecified initial brief):
"Our analytics has gotten complicated. We have Snowflake, dbt, and Looker. The dashboards are slow. We're considering rebuilding on a 'real warehouse architecture'. We've heard about Data Vault. What do you think?"
Write down the follow-up questions you'd ask to extract the four inputs. Aim for 8-12 questions, each tied to a specific input dimension. Then sketch the kind of recommendation that would emerge from different plausible answers.
5. Real-World Application
In consulting practice, the four-input framework is what experienced data architects use whether they articulate it or not. A 30-minute conversation that extracts these inputs produces a recommendation that is rarely overturned by deeper investigation. The reason is that modeling style is a function of constraints, and once the constraints are named, the function is mostly mechanical.
Three common failure modes when these inputs are not extracted:
-
The architect recommends what they're comfortable with. A Vault practitioner sees Vault opportunities everywhere. A Kimball traditionalist sees star schemas. A modern-stack devotee sees dbt + OBT for everything. The choice is biased by experience rather than by inputs.
-
The team recommends what's currently fashionable. "We should adopt OBT because that's what's modern" is the same shape of argument as "we should adopt Kimball because that's what's traditional". Both ignore the inputs.
-
The team rebuilds because the dashboards are slow. As the worked example showed, slow dashboards are a symptom that might have nothing to do with the modeling style. Migration is expensive; diagnosing the actual bottleneck first is cheaper.
A few real-world signals of senior data-architecture judgment:
-
Asking the four-input questions before the diagnostic conversation gets to design. A senior architect spends the first 30 minutes extracting; a junior architect spends the first 30 minutes drawing a star schema.
-
Recommending NOT to migrate. Often the right answer is "your current architecture is fine; the problem is X" where X is a specific tactical issue (missing index, query rewrite, warehouse sizing). Recommending against a rebuild is harder than recommending one.
-
Quantifying the trade. "Vault would cost 12 engineering months and would save you 8 weeks per acquisition. You're projecting 1 acquisition every 18 months. The payback is 2.5 years." That kind of math is what makes the four-input framework actionable.
The rest of this course applies the framework to specific style choices. Lesson 2 covers when Kimball is undefeated; Lesson 3 covers Vault's specific signal; Lesson 4 covers OBT's; Lesson 5 covers hybrids; Lesson 6 covers the cost of changing your mind. By the end you'll have a concrete decision framework that survives contact with real-world conversations.