A 38-minute deep session that picks up where the intro deck left off. We assume you know joins, keys, ACID and what an index is — this is the layer below: analytics in SQL, reading EXPLAIN ANALYZE, index strategy, MVCC and isolation, and scaling a table past one machine. If the fundamentals are hazy, do Databases & SQL first.
This is Part 2 of Databases & SQL — so we skip GROUP BY basics and go straight to the tool that turns SQL into an analytics engine. A window function computes a value across a set of rows related to the current row while still returning every row — running totals, rankings, and row-to-row deltas without a self-join.
window: a frame of rows defined by OVER (PARTITION BY … ORDER BY …). Unlike GROUP BY, which folds many rows into one, a window leaves the rows intact and adds the computed column alongside them. The partition slices the data into independent groups; the frame picks which rows inside the partition the function sees.Each user is its own partition; the frame grows from the first row to the current one, so the sum accumulates and resets per user.
ROW_NUMBER is always unique; RANK leaves gaps after ties (1,1,3); DENSE_RANKdoesn't (1,1,2). NTILE(4) buckets into quartiles.
LAG(x) reads the previous row, LEAD(x) the next — perfect for deltas (this month vs last) without a correlated subquery.
SUM, AVG, COUNT, MAX all take an OVER clause — running totals, moving averages, share-of-group.
FIRST_VALUE, LAST_VALUE, NTH_VALUE pull a row from a position in the frame — watch the frame, or LAST_VALUE surprises you.
WHERE, GROUP BY and HAVING, so you can't filter on one in the same query — wrap it in a CTE or subquery and filter outside. Second: with ORDER BY but no explicit frame, the default is RANGE, not ROWS — which lumps all peer rows with the same sort key into the current frame. For a true row-by-row running total, spell out ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.A CTE (Common Table Expression — the WITH clause) gives a subquery a name so a complex query reads top-to-bottom instead of inside-out. The advanced payoff is recursion: a CTE that references itself, which is how you walk trees and graphs — org charts, category hierarchies, dependency DAGs — in one statement.
WITH RECURSIVE t AS (anchor UNION ALL recursive-member). The anchor seeds the result; the recursive member re-runs against the rows the previous step produced, until it adds no new rows. UNION (vs UNION ALL) de-duplicates and is a cheap guard against infinite loops on cyclic data.Each iteration joins the table to the rows found one level up — the query fans out a layer at a time until nobody else reports in.
Pre-Postgres 12 a CTE was an optimization fence — always materialised. Modern Postgres inlines a simple CTE into the main plan; force either way with AS MATERIALIZED / AS NOT MATERIALIZED.
Graphs can have cycles. Use UNION to drop repeats, or the SQL-standard CYCLE … SET … USINGclause (Postgres 14+) to detect and stop on a revisited node.
WITH RECURSIVEis standard SQL — Postgres, SQL Server, Oracle, modern MySQL 8+ and SQLite all support it. The CYCLE/SEARCH sugar varies by engine.
The intro deck showed Seq Scan vs Index Scan. Now go deeper: which join algorithmthe planner chose, why, and how to spot a plan that's about to fall over. The planner is a cost model — it guesses; your job is to catch it guessing wrong.
EXPLAIN prints the plan with estimated rows and cost; EXPLAIN ANALYZE actually runs it and prints actual rows and time. Add BUFFERSto see page hits vs reads (Postgres 18 includes it by default with ANALYZE). Read the tree bottom-up, inside-out.Loops the outer rows and looks each up in the inner — ideally via an index. Wins when the outer side is small. Dies when both sides are large (rows × rows).
Builds a hash table on the smaller input, streams the larger past it. Wins on big unsorted equi-joins. Costs memory — past work_mem it spills to disk in batches.
Walks both inputs in sorted order, merging like a zipper. Wins when inputs are already sorted (an index or a range). Paysfor a sort if they aren't.
ANALYZE to refresh them.480 000 times.placed_at.loops, a big actual-rows number, and a Sort or hash that spilled to disk.Postgres-flavoured; most engines have an equivalent. One PRO, one CON, and when to reach for each.
Pro: ground truth — actual rows, timing, buffers for a single statement. Con: ANALYZE really runs the query (wrap writes in a rolled-back transaction).
Reach for it when you already know which query is slow.
Pro: aggregates total time, calls and mean per normalised query across the whole server. Con: an extension you must enable; showswhat is slow, not why.
Reach for it to find the top offenders before you tune anything.
Pro: logs the actual plan of any query slower than a threshold, in production. Con: log volume and a little overhead; tune the threshold carefully.
Reach for it to catch intermittent slow plans you can't reproduce.
Pro: turn a wall of text into a tree, flag the expensive node, suggest fixes. Con: only as good as the plan you paste; no server-wide view.
Reach for it when a plan is too big to read in your head.
You know what a B-tree index is. The craft is choosing the shape: which columns, in which order, with what extras — and knowing that every index you add is a tax on every write. Three patterns cover most of the wins.
In (a, b, c) only a left-prefix is usable. Put the columns you test with = first, then a single range column, then the ORDER BY column. A filter on balone can't use it.
Add non-key columns with INCLUDE so the index holds everything the query selects. The planner does an index-only scan — it never touches the heap (when the visibility map is fresh).
A WHERE on the index itself keeps it small and cheap. Classic for a hot subset — open orders, soft-deleted rows, a single tenant — where most rows are irrelevant.
Every index is a second structure the engine must keep in sync — so each one slows inserts, updates and deletes.
INSERT becomes one heap write plus a write to every index on the table.pg_stat_user_indexes).GIN suits jsonb/arrays/full-text and BRIN suits huge naturally-ordered tables like time-series.EXPLAIN— an index that isn't used is worse than none.The intro deck gave you ACID. Now the hard part of the I: what concurrent transactions are actually allowed to see, how Postgres delivers it without locking readers, and the failure modes — lost updates, write skew, deadlocks — that bite at scale.
xmin/xmax); a transaction sees only versions valid for its snapshot. The upshot: readers never block writers and writers never block readers. The cost: dead versions pile up, and VACUUM must reclaim them.Higher levels prevent more anomalies — and cost more aborts and contention. Pick the lowest level that's still correct.
SELECT … FOR UPDATE to take a row lock, or detect via an optimistic version column.FOR UPDATE SKIP LOCKED lets many workers grab different rows without blocking each other.VACUUM, and bloat the table.Two different cuts of the same idea: split the data. Partitioning splits one table into pieces on one machine; sharding splits it across many machines. They solve different problems and have very different costs.
RANGE (by date), LIST (by region) and HASH. The planner does partition pruning — skipping partitions the WHEREcan't match.DELETE); smaller per-partition indexes; prune irrelevant partitions at plan time.Partitioning = child tables under one server. Sharding = rows spread across independent servers behind a shard key.
A dashboard query times out. Walk the loop you'll use for the rest of your career: measure → read the plan → change one thing → re-measure.
(user_id, placed_at DESC) serves both the join and the sort, killing the disk sort.INCLUDE (amount) turns it into an index-only scan, so it never visits the heap.active users; the index is a fraction of the size.ANALYZE so the planner stops guessing and picks the right join.EXPLAIN ANALYZE again; trust the numbers, not the hunch.Five questions on windows, plans, indexes, isolation, and scale — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library