Data Warehousing & Modeling

01 · Why analytics needs its own store 4 min

The database that runs the app
is the wrong place to analyze it.

Your app's database is tuned to read and write a few rows at a time, fast — one customer, one order. Analytics asks the opposite kind of question: scan millions of rows and summarize them. Run that on the live app database and you slow down real users while getting slow answers. The fix is a second, purpose-built store.

OLTP vs OLAP — two opposite database workloads. OLTP (Online Transaction Processing) handles the app's live reads and writes — small, frequent, point lookups. OLAP (Online Analytical Processing) answers analytical questions — large scans and aggregations over history. A data warehouse is an OLAP store built for the second job.

We covered the row-store vs column-store internals in Databases & SQL; here we build on it. The short version: analytics warehouses are columnar — they store each column together, so a query that touches 3 of 200 columns reads only those 3, and compresses them hard.

Two stores, two jobs. A pipeline copies data from the app DB into the warehouse so analytics never touches production.

Why not just one database?

Different access pattern. OLTP reads a handful of rows; OLAP scans millions. One index layout can't be great at both.
Isolation. A heavy report shouldn't be able to slow down checkout.
History. The app keeps current state; the warehouse keeps all of history to compare over time.
One place, many sources. The warehouse joins the app DB, the CRM, payments and ads — the app DB only knows itself.

Analytics on the live app DB

-- run straight against production, row-store SELECT region, SUM(amount) FROM orders -- 5 years, 400M rows GROUP BY region; -- full table scan -- reads every column of every row · locks · slows checkout

Same query on a columnar warehouse

-- run against the warehouse, column-store SELECT region, SUM(amount) FROM fact_sales -- same 400M rows GROUP BY region; -- reads only 2 columns -- scans region + amount · compressed · users untouched

Like a busy restaurant kitchen (OLTP) versus the accountant's back office (OLAP): you don't do the end-of-year books on the line during the dinner rush.

02 · The storage choices 5 min

Three answers to "where
does the data live?"

Before modeling anything, you choose where analytical data is stored. The three options aren't rival religions — they're points on a line from structured & managed to open & flexible, and the lakehouse is the modern attempt to get both.

Warehouse · lake · lakehouse — a data warehouse stores clean, structured tables optimized for SQL analytics. A data lake stores raw files of any shape (JSON, CSV, Parquet, images) cheaply in object storage. A lakehouse adds a table layer (open formats like Apache Iceberg or Delta Lake) on top of lake files so you get warehouse-like tables and transactions on open storage.

Left to right: clean tables you must load, raw files you can dump anything into, and the lakehouse — a table layer that brings structure and transactions to open lake files.

Data warehouse — structured, managed, SQL-first

Data is loaded into defined tables and columns before you query it (schema-on-write). The engine owns the storage format, so SQL is fast and governance is easy.

Strengths

Fast SQL, strong governance, easy for analysts. Snowflake, BigQuery and Redshift live here.

Limits

Structured data mostly; you load before you can query, and the vendor owns the storage format.

Use when your analytics is mostly tabular and you want the least operational fuss — the common starting point.

Data lake — raw files, any shape, dirt cheap

Dump anything into object storage (S3, GCS, Azure Blob) and decide the structure later, when you read it (schema-on-read). Great for volume and variety; weaker for ad-hoc SQL and consistency.

Strengths

Cheapest storage, holds any format (logs, images, JSON), no upfront modeling. Feeds ML well.

Limits

No tables/transactions by default — easy to create a "data swamp" that nobody can query reliably.

Use when you have huge or semi-structured data (events, files) and ML/exploration workloads.

Lakehouse — tables on top of lake files

An open table format — Apache Iceberg or Delta Lake — sits over Parquet files in the lake and tracks which files belong to a table, plus a transaction log. You get ACID-like transactions, schema evolution and time travel on open storage that many engines can read.

Strengths

Warehouse-style tables + transactions on cheap open files; no vendor lock-in on the data; one copy for BI and ML.

Limits

More moving parts to operate, and the tooling is younger than classic warehouses.

Use when you want one open copy of data serving both SQL analytics and ML, without paying to duplicate it.

The trend — the warehouse and lake are converging. Warehouses now read open lake formats, and lakehouses now run fast SQL. For most teams the practical answer in 2026 is: start with a managed warehouse, and adopt an open table format (Iceberg / Delta) once you need one copy of data shared across many engines.

03 · Facts, dimensions & the star schema 7 min

Model analytics around
what happened and by what.

App databases are modeled to avoid duplication. Analytics is modeled for understanding and speed: split the world into events you measure and the things you slice them by. That split is dimensional modeling, and its shape is the star schema — the Kimball method that still dominates BI.

Fact vs dimension — a fact table records measurable events, one row per event, with numeric measures (amount, quantity) plus foreign keys. A dimension table holds the descriptive context you filter and group by (the product, customer, store, date). Facts are the verbs; dimensions are the nouns.

The star: one fact table in the middle, dimension tables as the points. Every join is fact → dimension, one hop.

Why this shape wins for BI

Simple joins. Every query is the fact joined directly to a few dimensions — no deep chains to reason about.
Readable. "Sales by region by month" maps straight onto dimensions; analysts find it intuitive.
Fast. Big fact scan + small dimension lookups is exactly what columnar engines optimize for.
Stable. Add a new way to slice data by adding a dimension — the fact table barely changes.

Grain — decide it first

The grain is what one row of the fact table means — "one line item on one order" or "one product's sales per store per day". Pick it explicitly and early: everything else (which dimensions attach, which measures make sense) follows from the grain, and mixing grains in one table is the classic modeling bug.

Kimball's four-step design

Pick the business process

What activity are we measuring?

Start from a real process the business cares about — orders, shipments, sign-ups, support tickets. Each process becomes a fact table. Don't model "the whole company" at once; model one process well.

Declare the grain

What does one fact row represent?

State the grain in one sentence: one row per order line. The finest practical grain is usually best — you can always sum up to coarser views, but you can never split a pre-aggregated row back apart.

Identify the dimensions

How will people slice it?

List the ways people describe and filter the event: by date, by product, by customer, by store, by channel. Each becomes a dimension table keyed into the fact. If it answers "by what?", it's a dimension.

Identify the facts (measures)

What numbers do we add up?

List the numeric measures at that grain — quantity, amount, discount, tax. Good measures are additive: they sum correctly across every dimension. Ratios and percentages are computed at query time, not stored.

04 · Wide tables, on purpose 5 min

Analytics denormalizes
where the app normalizes.

In Databases & SQL we said: normalize to kill redundancy, denormalize only when a measured read problem forces it. Analytics flips the default. Reads are the whole point and writes are batch-controlled, so we widen tables deliberately to make queries simple and fast.

Star vs snowflake — a star schema keeps each dimension as one wide, denormalized table (everything about a product in dim_product). A snowflake schema normalizes dimensions into sub-tables (product → category → department). Star trades some redundancy for fewer joins; in analytics, that trade is usually worth it.

Snowflake — normalized dimensions

No duplicated category text — but every query re-joins three tables to read one product's category.

Star — denormalized dimension

Category and department live right in the dimension. One join, and columnar compression makes the repetition nearly free.

The wide extreme — One Big Table

Some teams go further and pre-join the fact with its dimensions into a single One Big Table (OBT / "wide table") — one row per event with every descriptive column already attached. Zero joins at query time; BI tools love it. The cost is redundancy and a rebuild when a dimension changes — which is fine because the transform layer (next section) owns rebuilding it.

Stale attributes. A renamed category must be re-stamped onto millions of rows, not one dimension row.
Storage & rebuilds. Very wide tables cost more to store and recompute on every run.
Lost history nuance. Tracking how an attribute changed over time is easier in a real dimension (an SCD) than in a flattened blob.

-- one big table: event + all its context, pre-joined SELECT * FROM wide_sales WHERE order_date >= '2026-01-01' AND category = 'Footwear' AND region = 'EMEA'; -- no joins · every filter column is already here

An OBT pushes the joins into the build step so the query has none.

Rule of thumb — model a star by default: denormalized dimensions, simple joins. Snowflake only the dimensions that are huge or genuinely shared. Flatten to a wide table for a specific BI use case where every join hurts. The redundancy you fear in OLTP is cheap in a columnar warehouse — and rebuilt automatically, not edited by hand.

05 · Ingest → warehouse → transform → BI 5 min

How data gets from the source
to a chart someone trusts.

The warehouse is the center, not the whole picture. Around it sits a now-standard set of layers — the "modern data stack". We covered the moving-data mechanics in ETL & ELT Pipelines; here's how those pieces line up around the warehouse.

Modern data stack — a layered set of specialized tools around a cloud warehouse: ingest (connectors copy sources in), store (the warehouse / lakehouse), transform (dbt models raw into clean marts with SQL), and serve (BI tools and ML read the result). It's ELT: load raw first, model in place.

Connectors load raw in; dbt turns raw → staging → marts inside the warehouse; BI and ML read the marts. Everything after "raw" is SQL in version control.

The transform layer (dbt) does the modeling

dbt is where the star schema from Part 3 actually gets built. You write SELECT statements; dbt manages them as a dependency graph of tables/views, with tests and docs. A common layering: staging (lightly cleaned source mirrors) → marts (the facts and dimensions analysts query). Some teams name the tiers bronze / silver / gold — same idea.

Models are SQL files; ref() wires dependencies so dbt knows the build order.
Tests (unique, not-null, relationships) gate the build — see the quality checks in the pipelines deck.
Docs & lineage are generated from the graph, so every table is traceable to its sources.

-- a dbt model: staging refs build the star with p as ( SELECT * FROM {{ ref('stg_products') }} ) SELECT product_id, name, category, -- denormalized into the dim department FROM p; -- dbt builds this as a table, after stg_products

ref() makes dependencies explicit; dbt computes the build order and can test the output.

06 · Warehouses, engines & the stack 6 min

The tooling landscape —
what each is good and bad at.

The center of the stack is the warehouse/engine. Five names cover most of the market in 2026, each with a clear sweet spot. Then dbt for modeling and a BI tool to put it in front of people.

Cloud warehouse

Snowflake

Pro — separates storage from compute, easy to run, strong sharing and governance; a safe default for SQL analytics.
Con — usage-based pricing climbs with careless queries; it's a managed service you don't self-host.

Cloud warehouse

BigQuery

Pro — fully serverless: no clusters to size, scales instantly, deep Google Cloud and ML integration.
Con — on-demand pricing is per-byte-scanned, so SELECT * habits get expensive; it's GCP-centric.

Cloud warehouse

Amazon Redshift

Pro — mature, tight AWS integration; good when your data and team already live in AWS.
Con — more cluster tuning than the serverless options; historically more knobs to manage.

Lakehouse

Databricks

Pro — the lakehouse leader (Delta Lake + Spark): one platform for SQL, big data and ML on open files.
Con — broader and more complex than a pure warehouse; more concepts for a SQL-only team to learn.

Columnar engine

ClickHouse

Pro — extremely fast open-source columnar engine; superb for high-volume, low-latency analytics and user-facing dashboards.
Con — leaner on transforms/joins and ecosystem; more of a query engine than a full managed warehouse.

Transform

dbt

Pro — the standard for in-warehouse modeling: version-controlled SQL with tests, docs and lineage; analysts own it.
Con — transform only — it doesn't ingest, store or schedule; it sits on top of the warehouse.

Putting it in front of people — BI

A BI tool turns warehouse tables into dashboards and self-serve exploration. The common choices: Looker (a governed semantic model — central metric definitions, strong for big orgs), Metabase (lightweight, open-source, fast for teams to self-serve), and Tableau / Power BI (rich visual analysis, widely adopted in enterprises).

Already on a cloud? Default to its native warehouse — BigQuery on GCP, Redshift on AWS — unless you want cloud-neutral, then Snowflake.
Heavy ML + big-data + open files? Lean Databricks (lakehouse) so SQL and ML share one copy.
Need sub-second, high-concurrency dashboards? Add or use ClickHouse for that serving layer.
Whatever you pick, model with dbt and serve with one BI tool — and don't run more tools than you can operate.

How to choose — for most teams the decision is the warehouse, and the rest follows. Start with a managed warehouse that matches your cloud (or Snowflake to stay neutral); reach for a lakehouse (Databricks + Iceberg/Delta) when ML and one open copy of data matter; reach for ClickHouse when raw query speed and concurrency are the product. Then it's dbt for modeling and a single BI tool on top.

07 · A worked star schema & takeaways 4 min

From scattered tables
to one answerable star.

Let's build the retail sales star end to end: define the fact, attach the dimensions, then ask a real business question with one clean query.

-- grain: one row per order line CREATE TABLE fact_sales ( date_key INT, -- → dim_date product_key INT, -- → dim_product customer_key INT, -- → dim_customer store_key INT, -- → dim_store quantity INT, -- additive measure amount DECIMAL -- additive measure );

-- "Footwear revenue by region, last quarter" SELECT s.region, SUM(f.amount) AS revenue FROM fact_sales f JOIN dim_store s ON s.store_key = f.store_key JOIN dim_product p ON p.product_key = f.product_key JOIN dim_date d ON d.date_key = f.date_key WHERE p.category = 'Footwear' AND d.quarter = '2026-Q1' GROUP BY s.region;

One fact, three single-hop joins, two filters, one group-by. That readability is the whole point of dimensional modeling.

1Analytics gets its own store. OLAP / columnar warehouse, separate from the app's OLTP database.

2Warehouse, lake, lakehouse sit on one line from structured-and-managed to open-and-flexible — the lakehouse aims for both.

3Model dimensionally. Facts (events you measure) + dimensions (how you slice), in a star — pick the grain first.

4Denormalize on purpose. Wide dimensions and even one-big- tables trade cheap redundancy for simple, fast reads.

5The stack is ELT. Ingest raw, transform in-warehouse with dbt, serve with BI — pick the warehouse first; the rest follows.

Keep going

The Data Warehouse Toolkit — Kimball & Ross (the dimensional modeling reference)
Fundamentals of Data Engineering — Reis & Housley
dbt docs — modeling, tests and lineage in practice
Siblings: Databases & SQL · ETL & ELT

One sentence to remember

"Model the data the way people ask questions, not the way the app stores it."

— the whole talk, compressed

Knowledge check

Did it stick?

Five quick questions on OLAP, storage choices, dimensional modeling and the modern stack — instant feedback, no sign-in.

Rate this deck

be the first

Navigate with ← → or scroll · back to library

Data Warehousing& Modeling — designingthe store the whole company queries.

The database that runs the appis the wrong place to analyze it.

Why not just one database?

Three answers to "wheredoes the data live?"

Data warehouse — structured, managed, SQL-first

Data lake — raw files, any shape, dirt cheap

Lakehouse — tables on top of lake files

Model analytics aroundwhat happened and by what.

Why this shape wins for BI

Grain — decide it first

Kimball's four-step design

Analytics denormalizeswhere the app normalizes.

The wide extreme — One Big Table

How data gets from the sourceto a chart someone trusts.

The transform layer (dbt) does the modeling

The tooling landscape —what each is good and bad at.

Snowflake

BigQuery

Amazon Redshift

Databricks

ClickHouse

dbt

Putting it in front of people — BI

From scattered tablesto one answerable star.

Keep going

One sentence to remember

Did it stick?

Data Warehousing
& Modeling — designing
the store the whole company queries.

The database that runs the app
is the wrong place to analyze it.

Three answers to "where
does the data live?"

Model analytics around
what happened and by what.

Analytics denormalizes
where the app normalizes.

How data gets from the source
to a chart someone trusts.

The tooling landscape —
what each is good and bad at.

From scattered tables
to one answerable star.