Library
00/07 · ~38 min
GUIDEDECK · PART 2 · for queries that have to be fast

Advanced SQL
window functions, plans,
and the engine underneath.

A 38-minute deep session that picks up where the intro deck left off. We assume you know joins, keys, ACID and what an index is — this is the layer below: analytics in SQL, reading EXPLAIN ANALYZE, index strategy, MVCC and isolation, and scaling a table past one machine. If the fundamentals are hazy, do Databases & SQL first.

~38 MININTERMEDIATE → ADVANCEDPOSTGRES-FLAVOURED
SCROLL
01 · Window functions 6 min

Aggregate without
collapsing the rows.

This is Part 2 of Databases & SQL — so we skip GROUP BY basics and go straight to the tool that turns SQL into an analytics engine. A window function computes a value across a set of rows related to the current row while still returning every row — running totals, rankings, and row-to-row deltas without a self-join.

Window function an aggregate or ranking computed over a window: a frame of rows defined by OVER (PARTITION BY … ORDER BY …). Unlike GROUP BY, which folds many rows into one, a window leaves the rows intact and adds the computed column alongside them. The partition slices the data into independent groups; the frame picks which rows inside the partition the function sees.
SELECT user_id, placed_at, amount, SUM(amount) OVER ( PARTITION BY user_id -- one running total per user ORDER BY placed_at -- ordered within the user ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- the frame ) AS running_spend, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY placed_at) AS nth_order FROM orders;
PARTITION user=42 10 → 10 25 → 35 8 → 43 ◂ current 15 (not yet) frame = start … current PARTITION user=99 resets → 0

Each user is its own partition; the frame grows from the first row to the current one, so the sum accumulates and resets per user.

The functions worth memorising

ranking

Position in a group

ROW_NUMBER is always unique; RANK leaves gaps after ties (1,1,3); DENSE_RANKdoesn't (1,1,2). NTILE(4) buckets into quartiles.

offset

Look at a neighbour

LAG(x) reads the previous row, LEAD(x) the next — perfect for deltas (this month vs last) without a correlated subquery.

aggregate

Any aggregate, windowed

SUM, AVG, COUNT, MAX all take an OVER clause — running totals, moving averages, share-of-group.

value

Edges of the frame

FIRST_VALUE, LAST_VALUE, NTH_VALUE pull a row from a position in the frame — watch the frame, or LAST_VALUE surprises you.

Two traps the intro deck didn't cover. First: window functions run after WHERE, GROUP BY and HAVING, so you can't filter on one in the same query — wrap it in a CTE or subquery and filter outside. Second: with ORDER BY but no explicit frame, the default is RANGE, not ROWS — which lumps all peer rows with the same sort key into the current frame. For a true row-by-row running total, spell out ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
02 · CTEs & recursion 5 min

Name a subquery — then
let it call itself.

A CTE (Common Table Expression — the WITH clause) gives a subquery a name so a complex query reads top-to-bottom instead of inside-out. The advanced payoff is recursion: a CTE that references itself, which is how you walk trees and graphs — org charts, category hierarchies, dependency DAGs — in one statement.

Recursive CTE WITH RECURSIVE t AS (anchor UNION ALL recursive-member). The anchor seeds the result; the recursive member re-runs against the rows the previous step produced, until it adds no new rows. UNION (vs UNION ALL) de-duplicates and is a cheap guard against infinite loops on cyclic data.
WITH RECURSIVE reports AS ( -- anchor: the boss, depth 0 SELECT id, name, manager_id, 0 AS depth FROM employees WHERE manager_id IS NULL UNION ALL -- recursive: everyone reporting to a known row SELECT e.id, e.name, e.manager_id, r.depth + 1 FROM employees e JOIN reports r ON e.manager_id = r.id ) SELECT * FROM reports ORDER BY depth;
CEO · d0 VP · d1 VP · d1 Eng · d2 Eng · d2 anchor step 2

Each iteration joins the table to the rows found one level up — the query fans out a layer at a time until nobody else reports in.

readability

The fence is gone

Pre-Postgres 12 a CTE was an optimization fence — always materialised. Modern Postgres inlines a simple CTE into the main plan; force either way with AS MATERIALIZED / AS NOT MATERIALIZED.

cycles

Guard against loops

Graphs can have cycles. Use UNION to drop repeats, or the SQL-standard CYCLE … SET … USINGclause (Postgres 14+) to detect and stop on a revisited node.

engines

Mostly portable

WITH RECURSIVEis standard SQL — Postgres, SQL Server, Oracle, modern MySQL 8+ and SQLite all support it. The CYCLE/SEARCH sugar varies by engine.

03 · Reading query plans 6 min

EXPLAIN ANALYZE is the
only opinion that counts.

The intro deck showed Seq Scan vs Index Scan. Now go deeper: which join algorithmthe planner chose, why, and how to spot a plan that's about to fall over. The planner is a cost model — it guesses; your job is to catch it guessing wrong.

Query plan the tree of physical operators the planner picks to execute your SQL. EXPLAIN prints the plan with estimated rows and cost; EXPLAIN ANALYZE actually runs it and prints actual rows and time. Add BUFFERSto see page hits vs reads (Postgres 18 includes it by default with ANALYZE). Read the tree bottom-up, inside-out.

Three ways to join two tables

nested loop

For each outer, probe inner

Loops the outer rows and looks each up in the inner — ideally via an index. Wins when the outer side is small. Dies when both sides are large (rows × rows).

hash join

Build a hash, then probe

Builds a hash table on the smaller input, streams the larger past it. Wins on big unsorted equi-joins. Costs memory — past work_mem it spills to disk in batches.

merge join

Zip two sorted streams

Walks both inputs in sorted order, merging like a zipper. Wins when inputs are already sorted (an index or a range). Paysfor a sort if they aren't.

EXPLAIN (ANALYZE, BUFFERS) SELECT ... FROM orders o JOIN users u ON u.id = o.user_id WHERE o.placed_at > now() - '30 days'; Nested Loop (cost=.. rows=12) -- estimate: 12 -> Seq Scan on orders -- no index used rows=480000 actual rows=480000 -- est wildly off -> Index Scan on users (loops=480000) Planning 0.3 ms · Execution 3 200 ms

How to read it as broken

  • Estimate vs actual — planner expected 12 rows, got 480,000. Stale statistics; run ANALYZE to refresh them.
  • Wrong join— because it believed "12 rows", it chose a nested loop and probed the inner index 480 000 times.
  • Seq Scan on a filtered date — a missing index on placed_at.
  • The tell: high loops, a big actual-rows number, and a Sort or hash that spilled to disk.

The tooling landscape

Postgres-flavoured; most engines have an equivalent. One PRO, one CON, and when to reach for each.

EXPLAIN / EXPLAIN ANALYZE

The one query in front of you

Pro: ground truth — actual rows, timing, buffers for a single statement. Con: ANALYZE really runs the query (wrap writes in a rolled-back transaction).

Reach for it when you already know which query is slow.

pg_stat_statements

Which query is slow

Pro: aggregates total time, calls and mean per normalised query across the whole server. Con: an extension you must enable; showswhat is slow, not why.

Reach for it to find the top offenders before you tune anything.

auto_explain

Plans you never ran by hand

Pro: logs the actual plan of any query slower than a threshold, in production. Con: log volume and a little overhead; tune the threshold carefully.

Reach for it to catch intermittent slow plans you can't reproduce.

pgMustard · explain.dalibo

Visualise & grade a plan

Pro: turn a wall of text into a tree, flag the expensive node, suggest fixes. Con: only as good as the plan you paste; no server-wide view.

Reach for it when a plan is too big to read in your head.

04 · Index strategy 6 min

The right index is shaped
like your query.

You know what a B-tree index is. The craft is choosing the shape: which columns, in which order, with what extras — and knowing that every index you add is a tax on every write. Three patterns cover most of the wins.

composite

Order: equality → range → sort

In (a, b, c) only a left-prefix is usable. Put the columns you test with = first, then a single range column, then the ORDER BY column. A filter on balone can't use it.

CREATE INDEX ON orders (user_id, status, placed_at); -- serves WHERE user_id=? AND status=? -- ORDER BY placed_at
covering

Answer from the index alone

Add non-key columns with INCLUDE so the index holds everything the query selects. The planner does an index-only scan — it never touches the heap (when the visibility map is fresh).

CREATE INDEX ON orders (user_id) INCLUDE (amount, status); -- SELECT amount,status WHERE user_id=? -- → Index Only Scan, no heap fetch
partial

Index only the rows you query

A WHERE on the index itself keeps it small and cheap. Classic for a hot subset — open orders, soft-deleted rows, a single tenant — where most rows are irrelevant.

CREATE INDEX ON orders (placed_at) WHERE status = 'open'; -- tiny: indexes ~2% of the table
INSERT row heap idx (a,b,c)+write idx (status)+write one row = N+1 writes

Every index is a second structure the engine must keep in sync — so each one slows inserts, updates and deletes.

When an index hurts

  • Write amplification — one INSERT becomes one heap write plus a write to every index on the table.
  • It defeats HOT updates — Postgres can do a cheap heap-only-tuple update only if no indexed column changed. Index a hot-updated column and you force index churn + bloat.
  • Dead weight — an index nothing queries still costs disk, cache and write time. Drop the unused ones (check pg_stat_user_indexes).
  • Right type for the data — B-tree is the default, but GIN suits jsonb/arrays/full-text and BRIN suits huge naturally-ordered tables like time-series.
  • Equality columns first (the planner can seek straight to them), then one range/inequality column, then the sort column.
  • A range column "uses up" the index — columns after it can't be seeked, only filtered.
  • Don't mirror columns across many overlapping indexes; one well-ordered composite usually beats three single-column ones.
  • Always confirm with EXPLAIN— an index that isn't used is worse than none.
05 · Transactions & concurrency 6 min

Isolation is a dial between
correctness and contention.

The intro deck gave you ACID. Now the hard part of the I: what concurrent transactions are actually allowed to see, how Postgres delivers it without locking readers, and the failure modes — lost updates, write skew, deadlocks — that bite at scale.

MVCC Multi-Version Concurrency Control: every write creates a new row version rather than overwriting in place. Each version carries the transaction IDs that created and deleted it (xmin/xmax); a transaction sees only versions valid for its snapshot. The upshot: readers never block writers and writers never block readers. The cost: dead versions pile up, and VACUUM must reclaim them.

The four isolation levels

  • Read Committed (Postgres default) — each statement sees the latest committed data. Stops dirty reads; still allows non-repeatable and phantom reads.
  • Repeatable Read — a snapshot frozen at the first statement. In Postgres this is true snapshot isolation: no phantoms, but write skew slips through.
  • Serializable — behaves as if transactions ran one at a time. Postgres uses SSI and may abort a transaction with a serialization error you must retry.
  • Read Uncommitted — allows dirty reads in the standard; Postgres has no true dirty read, so it behaves as Read Committed.
PREVENTS → Read Committeddirty read ✓ Repeatable Read+ non-repeat, phantom ✓ Serializable+ write skew ✓ write skew slips past Repeatable Read — only Serializable catches it

Higher levels prevent more anomalies — and cost more aborts and contention. Pick the lowest level that's still correct.

-- Txn A -- Txn B UPDATE acct SET ... UPDATE acct SET ... WHERE id=1; WHERE id=2; -- A now wants id=2 -- B now wants id=1 UPDATE ... WHERE id=2; UPDATE ... WHERE id=1; -- each holds what the other needs → deadlock -- Postgres detects it and aborts one txn

Locks, deadlocks & queues

  • Deadlock — two transactions grab locks in opposite order. The fix is discipline: acquire rows in a consistent order (e.g. ascending id). Postgres breaks the cycle by aborting one.
  • Lost update — read-modify-write without a lock. Use SELECT … FOR UPDATE to take a row lock, or detect via an optimistic version column.
  • Job queuesFOR UPDATE SKIP LOCKED lets many workers grab different rows without blocking each other.
  • Keep transactions short — long ones hold locks, block VACUUM, and bloat the table.
06 · Partitioning & sharding 5 min

One big table is fine —
until it isn't.

Two different cuts of the same idea: split the data. Partitioning splits one table into pieces on one machine; sharding splits it across many machines. They solve different problems and have very different costs.

Partitioning one logical table physically stored as several child tables, split by a key. Declarative partitioning (Postgres 10+) supports RANGE (by date), LIST (by region) and HASH. The planner does partition pruning — skipping partitions the WHEREcan't match.
  • Why: drop old data by dropping a partition (instant, no DELETE); smaller per-partition indexes; prune irrelevant partitions at plan time.
  • Watch: the partition key must be in most queries, or you scan every partition. It is still one server.
Sharding horizontal partitioning across independent database servers, each holding a subset of rows. A shard keydecides which node owns a row. It's how you scale writes and storage past one box.
  • Why: no single machine can hold the data or take the write volume; you need horizontal scale.
  • Cost: cross-shard joins and transactions are hard, rebalancing is painful, and a bad shard key creates hotspots. Don't shard until you must.
PARTITIONING · one server orders (logical) 2024 2025 2026 SHARDING · many servers node Aid % 3 = 0 node Bid % 3 = 1 node Cid % 3 = 2

Partitioning = child tables under one server. Sharding = rows spread across independent servers behind a shard key.

Engine landscape (2026)

  • Postgres — native declarative partitioning; sharding via the Citus extension or application-level routing.
  • MySQL — built-in partitioning; Vitess is the standard sharding layer (it powers very large MySQL fleets).
  • Distributed-by-default — CockroachDB, YugabyteDB and Spanner shard and replicate for you behind a SQL interface, trading some latency for automatic horizontal scale.
07 · A worked optimization + recap 4 min

Put it together:
one slow query, fixed.

A dashboard query times out. Walk the loop you'll use for the rest of your career: measure → read the plan → change one thing → re-measure.

Before — 3.2 s
-- "latest order per active user, last 30d" SELECT u.id, o.amount, o.placed_at FROM users u JOIN orders o ON o.user_id = u.id WHERE u.status = 'active' AND o.placed_at > now() - '30 days' ORDER BY o.placed_at DESC; Seq Scan on orders · Sort spills to disk
After — 18 ms
-- 1. partial composite index, sorted CREATE INDEX ON orders (user_id, placed_at DESC) INCLUDE (amount); -- 2. partial index for the active subset CREATE INDEX ON users (id) WHERE status = 'active'; -- 3. ANALYZE so estimates are fresh ANALYZE orders, users; -- → Index Only Scan, no sort, ~180x faster

The five moves

  • Index for the access pattern — a composite (user_id, placed_at DESC) serves both the join and the sort, killing the disk sort.
  • Cover the queryINCLUDE (amount) turns it into an index-only scan, so it never visits the heap.
  • Narrow with a partial index — index only active users; the index is a fraction of the size.
  • Refresh statisticsANALYZE so the planner stops guessing and picks the right join.
  • Re-measureEXPLAIN ANALYZE again; trust the numbers, not the hunch.
Knowledge check

Did it stick?

Five questions on windows, plans, indexes, isolation, and scale — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library