A 36-minute working session on the analytics side of data: why analytics needs its own store, warehouse vs lake vs lakehouse, dimensional modeling and the star schema, why analysts denormalize on purpose, and the modern stack that ingests, transforms and serves it.
Your app's database is tuned to read and write a few rows at a time, fast — one customer, one order. Analytics asks the opposite kind of question: scan millions of rows and summarize them. Run that on the live app database and you slow down real users while getting slow answers. The fix is a second, purpose-built store.
We covered the row-store vs column-store internals in Databases & SQL; here we build on it. The short version: analytics warehouses are columnar — they store each column together, so a query that touches 3 of 200 columns reads only those 3, and compresses them hard.
Two stores, two jobs. A pipeline copies data from the app DB into the warehouse so analytics never touches production.
Like a busy restaurant kitchen (OLTP) versus the accountant's back office (OLAP): you don't do the end-of-year books on the line during the dinner rush.
Before modeling anything, you choose where analytical data is stored. The three options aren't rival religions — they're points on a line from structured & managed to open & flexible, and the lakehouse is the modern attempt to get both.
Left to right: clean tables you must load, raw files you can dump anything into, and the lakehouse — a table layer that brings structure and transactions to open lake files.
Data is loaded into defined tables and columns before you query it (schema-on-write). The engine owns the storage format, so SQL is fast and governance is easy.
Use when your analytics is mostly tabular and you want the least operational fuss — the common starting point.
Dump anything into object storage (S3, GCS, Azure Blob) and decide the structure later, when you read it (schema-on-read). Great for volume and variety; weaker for ad-hoc SQL and consistency.
Use when you have huge or semi-structured data (events, files) and ML/exploration workloads.
An open table format — Apache Iceberg or Delta Lake — sits over Parquet files in the lake and tracks which files belong to a table, plus a transaction log. You get ACID-like transactions, schema evolution and time travel on open storage that many engines can read.
Use when you want one open copy of data serving both SQL analytics and ML, without paying to duplicate it.
App databases are modeled to avoid duplication. Analytics is modeled for understanding and speed: split the world into events you measure and the things you slice them by. That split is dimensional modeling, and its shape is the star schema — the Kimball method that still dominates BI.
The star: one fact table in the middle, dimension tables as the points. Every join is fact → dimension, one hop.
The grain is what one row of the fact table means — "one line item on one order" or "one product's sales per store per day". Pick it explicitly and early: everything else (which dimensions attach, which measures make sense) follows from the grain, and mixing grains in one table is the classic modeling bug.
Start from a real process the business cares about — orders, shipments, sign-ups, support tickets. Each process becomes a fact table. Don't model "the whole company" at once; model one process well.
State the grain in one sentence: one row per order line. The finest practical grain is usually best — you can always sum up to coarser views, but you can never split a pre-aggregated row back apart.
List the ways people describe and filter the event: by date, by product, by customer, by store, by channel. Each becomes a dimension table keyed into the fact. If it answers "by what?", it's a dimension.
List the numeric measures at that grain — quantity, amount, discount, tax. Good measures are additive: they sum correctly across every dimension. Ratios and percentages are computed at query time, not stored.
In Databases & SQL we said: normalize to kill redundancy, denormalize only when a measured read problem forces it. Analytics flips the default. Reads are the whole point and writes are batch-controlled, so we widen tables deliberately to make queries simple and fast.
dim_product). A snowflake schema normalizes dimensions into sub-tables (product → category → department). Star trades some redundancy for fewer joins; in analytics, that trade is usually worth it.No duplicated category text — but every query re-joins three tables to read one product's category.
Category and department live right in the dimension. One join, and columnar compression makes the repetition nearly free.
Some teams go further and pre-join the fact with its dimensions into a single One Big Table (OBT / "wide table") — one row per event with every descriptive column already attached. Zero joins at query time; BI tools love it. The cost is redundancy and a rebuild when a dimension changes — which is fine because the transform layer (next section) owns rebuilding it.
An OBT pushes the joins into the build step so the query has none.
The warehouse is the center, not the whole picture. Around it sits a now-standard set of layers — the "modern data stack". We covered the moving-data mechanics in ETL & ELT Pipelines; here's how those pieces line up around the warehouse.
Connectors load raw in; dbt turns raw → staging → marts inside the warehouse; BI and ML read the marts. Everything after "raw" is SQL in version control.
dbt is where the star schema from Part 3 actually gets built. You write SELECT statements; dbt manages them as a dependency graph of tables/views, with tests and docs. A common layering: staging (lightly cleaned source mirrors) → marts (the facts and dimensions analysts query). Some teams name the tiers bronze / silver / gold — same idea.
ref() wires dependencies so dbt knows the build order.ref() makes dependencies explicit; dbt computes the build order and can test the output.
The center of the stack is the warehouse/engine. Five names cover most of the market in 2026, each with a clear sweet spot. Then dbt for modeling and a BI tool to put it in front of people.
Pro — separates storage from compute, easy to run, strong sharing and governance; a safe default for SQL analytics.
Con — usage-based pricing climbs with careless queries; it's a managed service you don't self-host.
Pro — fully serverless: no clusters to size, scales instantly, deep Google Cloud and ML integration.
Con — on-demand pricing is per-byte-scanned, so SELECT * habits get expensive; it's GCP-centric.
Pro — mature, tight AWS integration; good when your data and team already live in AWS.
Con — more cluster tuning than the serverless options; historically more knobs to manage.
Pro — the lakehouse leader (Delta Lake + Spark): one platform for SQL, big data and ML on open files.
Con — broader and more complex than a pure warehouse; more concepts for a SQL-only team to learn.
Pro — extremely fast open-source columnar engine; superb for high-volume, low-latency analytics and user-facing dashboards.
Con — leaner on transforms/joins and ecosystem; more of a query engine than a full managed warehouse.
Pro — the standard for in-warehouse modeling: version-controlled SQL with tests, docs and lineage; analysts own it.
Con — transform only — it doesn't ingest, store or schedule; it sits on top of the warehouse.
A BI tool turns warehouse tables into dashboards and self-serve exploration. The common choices: Looker (a governed semantic model — central metric definitions, strong for big orgs), Metabase (lightweight, open-source, fast for teams to self-serve), and Tableau / Power BI (rich visual analysis, widely adopted in enterprises).
Let's build the retail sales star end to end: define the fact, attach the dimensions, then ask a real business question with one clean query.
One fact, three single-hop joins, two filters, one group-by. That readability is the whole point of dimensional modeling.
"Model the data the way people ask questions, not the way the app stores it."
— the whole talk, compressed
Five quick questions on OLAP, storage choices, dimensional modeling and the modern stack — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library