Library
00/07 · ~34 min
GUIDEDECK · for moving data you can trust

ETL & ELT
Pipelines — getting data
from where it lives to where it's used.

A 34-minute working session on data pipelines: extract, transform and load; why ELT changed the game; batch vs streaming; and the idempotency, orchestration and quality checks that make a pipeline something you can sleep through.

~34 MINDATA TEAMTOOL-AGNOSTIC
SCROLL
01 · Why pipelines exist 4 min

Your data is scattered.
The questions need it together.

Orders live in the app database, customers in the CRM, charges in the payment processor, ad spend in a dozen SaaS dashboards. Every useful question — "what's last week's margin by region?" — needs those joined, cleaned and trustworthy. A pipeline is how that happens on purpose instead of by hand.

Data pipeline a repeatable, automated flow that moves data from source systems to a destination(usually a data warehouse), reshaping it along the way so it's ready to query. The point isn't the one-time copy — it's that it runs again tomorrow, on a schedule, observable and re-runnable.
N

source systems, each with its own schema, format and quirks.

1

place analysts and dashboards actually query — the warehouse.

24/7

new rows keep arriving; yesterday's export is already stale.

schemas drift, APIs change. Design for breakage, not for a perfect snapshot.

The hand-rolled trap

  • A one-off script run from someone's laptop has no history — re-run it and last week is gone.
  • It hits production directly, competing with real users for the database.
  • When a column is renamed it fails silently, and the numbers in the deck are quietly wrong.
App DB · orders CRM · customers Payments API Ad spend CSV pipeline E · T · L Warehouse one source of truth → dashboards · ML · reports

Many messy sources fan into one pipeline and land in a single warehouse the whole company can trust.

By hand — fragile & invisible
// Monday morning: finance needs last week's revenue const rows = await prod.query("SELECT * FROM orders") // on the live DB sendEmail("finance@co", toCSV(rows)) // no history · no schema check · breaks on a rename · run by a human
A pipeline — defined once, runs itself
extract("orders", { from: "replica" }) // not prod .transform(cleanRevenue) // typed + tested .load("warehouse.revenue") // versioned, full history // scheduled · monitored · re-runnable · alerts on failure
02 · The three stages 6 min

Three jobs, in order:
extract, transform, load.

Every pipeline — whatever the tool — is doing these three things. Get the words straight and the rest of the talk is just where and when you run them.

ETL Extract, Transform, Load — names the three stages of moving data: pull it out of the source (extract), reshape and clean it (transform), then write it to the destination (load). The order of those words is exactly the argument we'll have in Part 3.
E
Extract
read from the source
T
Transform
clean, shape, enrich
L
Load
write to the destination
Stage 1 · Extract

Read it out, gently

Pull records from databases, APIs, files or event streams. The two hard parts: don't hammer the source, and only grab what's new — track a watermark (last id or timestamp) so you read the delta, not the whole table.

// incremental: only rows since last run extract({ table: "orders", since: lastRun.watermark // e.g. updated_at })
Stage 2 · Transform

Make it correct & usable

Cast types, fix encodings, drop duplicates, join reference data, apply business rules. This is where data goes from raw to trustworthy — and where most bugs live, so it should be typed and tested.

clean(o) { amount: Money.parse(o.amount), placedAt: Date.parse(o.created), region: regionOf(o.country) // enrich }
Stage 3 · Load

Write it where it's queried

Land the result in the warehouse. Prefer upserts over blind inserts, write in batches, and make the write idempotentso a retry can't double-count (Part 5).

load({ into: "warehouse.orders", key: "order_id", // upsert key mode: "merge" })

Transform is where it goes wrong

Implicit & lossy
// transform tangled into the read, no types rows.forEach(r => { r.amount = r.amount + "" // stringify money → lose precision r.date = r.created.slice(0,10) // assumes a format that'll change }) // bad rows pass through silently and poison the warehouse
Explicit, typed, fail-loud
function clean(o: RawOrder): Order { return { amount: Money.parse(o.amount), // typed money placedAt: Date.parse(o.created_at), // throws if invalid region: regionOf(o.country), // validated lookup } }

Like a warehouse receiving dock: extract is the truck arriving, transform is inspecting and labelling the goods, load is putting them on the right shelf.

03 · The order that changed 6 min

Move the T after the L,
and the economics flip.

ETL and ELT use the same three stages — they just disagree on whento transform. That one swap is the biggest shift in modern data engineering, and it's driven entirely by what the warehouse can now do.

ELT Extract, Load, Transform — loads raw data into the warehouse first, then transforms it inside the warehouse using SQL. ETL transforms before loading, on a separate machine. Same letters, reordered — but the consequences are large.
ETL — transform before load Source Transformown server Warehouse clean only ELT — load raw, transform in place Source Warehouse raw (landing) modelled (SQL) transform = SQL the warehouse runs at scale

ETL transforms on a separate box and loads only the finished shape. ELT lands the raw data, then lets the warehouse do the transform in SQL.

Why ELT won

The warehouse got cheap and huge

  • These warehouses are columnar / OLAP stores — cloud (Snowflake, BigQuery, Redshift) or self-hosted (ClickHouse) — that separate storage from compute, so keeping raw data is nearly free and scans over billions of rows stay fast.
  • Their query engines transform faster than a hand-built ETL box ever could, and scale on demand.
  • Keeping the raw copy means you can re-transform when requirements change — no re-extract.
  • Transforms become plain SQL in version control (the dbt model), so analysts own them, not just engineers.
When ETL still wins

Transform-before-load earns its keep

  • Compliance — you must mask or drop PII before it ever lands in the warehouse.
  • Heavy reduction— aggregating 1 TB of logs down to a summary so you never store the firehose.
  • Legacy targets — an on-prem warehouse with no spare compute to run transforms.
  • Streaming — transforming events in flight, before they hit any store (Part 4).

The modern default is ELT: extract, load raw, then transform in the warehouse with version-controlled SQL. Reach for ETL when privacy, volume or the target forces the transform earlier.

04 · When data moves 5 min

Process it in chunks,
or one event at a time.

ETL vs ELT was about where you transform. This is a separate question: how often data moves. Do you wait and move a big pile on a schedule (say, every night), or handle each new record the moment it shows up? Most companies end up needing both — different questions can tolerate different delays.

Batch processes a bounded set of records on a schedule (every hour, every night). Streaming processes an unbounded flow of events continuously, seconds after each one happens. Micro-batch splits the difference — tiny batches every few seconds.
BATCH — scheduled windows run 02:00 whole window collect all day… STREAMING — per event act now handled within seconds

Batch waits, then processes a whole window at once. Streaming acts on each event as it lands.

Pick by the question, not the hype

  • Batch — daily revenue, monthly churn, ML training sets. Simpler, cheaper, easy to reprocess. Latency: minutes to hours.
  • Streaming— fraud blocks, live dashboards, "notify me now". Harder to operate. Latency: seconds.
  • Micro-batch— the pragmatic middle most "real-time" systems actually run.
  • Default to batch; add streaming only where the business pays for the extra seconds.

Don't reprocess the whole world every run

Full reload every night
// re-read and rewrite the entire history, daily TRUNCATE warehouse.orders; INSERT INTO warehouse.orders SELECT * FROM source.orders; // 5 years of rows, every time // slow, expensive, hammers the source — and grows forever
Incremental by watermark
// only the rows that changed since last run MERGE INTO warehouse.orders AS t USING ( SELECT * FROM source.orders WHERE updated_at > :last_watermark // the delta ) s ON t.order_id = s.order_id WHEN MATCHED ... WHEN NOT MATCHED ... // upsert
05 · Pipelines that survive reruns 6 min

A pipeline will fail mid-flight.
Running it again must be safe.

Networks drop, nodes die, a deploy kills a job at 90%. The question isn't whether to retry — it's whether a retry corrupts your data. That property has a name: idempotency.

Idempotency running the same step twice produces the same result as running it once. An idempotent load can be retried, re-scheduled or backfilled without double-counting. It's the single most important property of a production pipeline.
INSERT + retry → duplicates load #1 retry order 42 · $30 order 42 · $30 counted twice ✕ MERGE on id + retry → same load #1 order 42 · $30 one row ✓

A retried INSERT double-counts order 42. A MERGE keyed on order_id lands the same single row every time.

How to make a step idempotent

  • Key your writes. Upsert/MERGE on a business key instead of blind INSERT.
  • Partition by time. Re-running "2026-06-27" should replacethat day's partition, not append to it.
  • Carry a run id. Tag rows so a failed run can be deleted cleanly before retry.
  • Deterministic transforms. No now() or random ids baked into output.
Append-only — retry doubles it
async function load(batch) { for (const row of batch) await db.insert("orders", row) // blind append } // crash after row 800 of 1000 → retry re-inserts 1–800
Keyed upsert — retry is a no-op
async function load(batch) { await db.upsert("orders", batch, { key: "order_id" // same id ⇒ same row }) } // retry from row 1 → rows 1–800 just overwrite themselves

Two operations idempotency unlocks

Retries

On a transient failure the orchestrator just runs the task again. Because the load is idempotent, automatic retries are safe — no human paged at 3 a.m. to "clean up duplicates" first.

Backfills

A backfill re-runs the pipeline over past dates — to fix a bug, add a new column, or load history. Idempotent, time-partitioned steps let you replay last March without disturbing anything else.

06 · Running it for real 5 min

Order the steps, then
prove the data is right.

A pipeline is many tasks with dependencies, run on a schedule, that must be retried, monitored and trusted. Two disciplines make that sane: orchestration and data quality.

Orchestration scheduling and coordinating tasks as a DAG (directed acyclic graph) of dependencies, so each step runs only after its inputs are ready, retries on failure, and the whole run is observable. Airflow, Dagster and Prefect are the common tools.
extract_orders extract_crm transform check publish fail → stop, alert pass → publish

Each node is a task; edges are dependencies. The quality check gates publish — bad data never reaches users.

What the orchestrator gives you

  • Dependencies — transform runs only after both extracts succeed.
  • Scheduling & retries — cron-like triggers with automatic, bounded retries.
  • Observability — run history, logs, durations, and alerts when a task fails or runs late.
  • Backfills — re-run a date range from one command (Part 5).

Data quality — test the data, not just the code

Code tests prove your transform logic. Quality checks prove the data is sane on every run — and gate the pipeline so bad data fails loudly instead of leaking into dashboards.

S
Schema
The shape is what we expect.
+

Columns exist, types match, no surprise renames. A source adding or dropping a field should fail the run, not silently null out a report.

EXPECT columns(orders) = [order_id, amount, placed_at, region] EXPECT type(amount) = DECIMAL
F
Freshness
The data is recent enough.
+

The newest row is younger than the SLA. Stale data is a silent failure — the dashboard still renders, it's just lying.

EXPECT max(placed_at) > now() - INTERVAL '2 hours'
V
Volume
The row count is in a sane range.
+

Today's batch is within a tolerance of normal. A run that loads 0 rows (or 50×) usually means a broken upstream, not a record sales day.

EXPECT row_count BETWEEN 0.5 × avg_7d AND 2 × avg_7d
U
Uniqueness & nulls
Keys are unique; required fields are present.
+

The primary key is actually unique (catches the duplicate bug from Part 5) and not-null columns are never null. The cheapest checks, the highest catch rate.

EXPECT unique(order_id) EXPECT not_null(order_id, amount)
R
Referential
Joins won't silently drop rows.
+

Every customer_id in orders exists in the customers table. Orphaned keys quietly disappear in an inner join — and the revenue disappears with them.

EXPECT orders.customer_id IN customers.id // no orphans

Lineage — know where a number came from

Lineage is the map of how each table and column was derived from its sources. When a dashboard looks wrong, lineage lets you trace it back through every transform to the offending source — and see what else breaks if you change a column.

  • Debugging — trace a bad metric to its root in minutes, not days.
  • Impact analysis — see every downstream report before you rename a column.
  • Trust & compliance — prove where a regulated number came from.

The tooling landscape — who does what

You almost never build all of this by hand. A modern stack is usually three jobs handed to specialized tools: a connector tool to extract and load, dbt to transform, and one orchestratorto run the whole thing on a schedule. Here's where the leading tools sit and what each is good (and bad) at.

Sourcesapps · APIs Ingest · E+LAirbyte / Fivetran Warehouseraw Transformdbt · SQL Modelleddashboards + quality checks (dbt tests · Great Expectations) Orchestration Airflow · Dagster · Prefect — schedule, retry, backfill, monitor one conductor drives every stage above

The orchestrator (bottom) conducts; connectors handle E+L, dbt handles the T inside the warehouse, and quality checks gate what reaches dashboards.

Orchestrator

Apache Airflow

Pro — the industry default: huge community, runs almost anywhere, an integration for everything.
Con — heavy Python boilerplate, and it schedules tasks without really understanding your data.

Orchestrator

Dagster

Pro — thinks in data assets (the tables you produce), with typed inputs/outputs and great local testing.
Con — younger and smaller than Airflow, and a newer way of thinking to learn.

Orchestrator

Prefect

Pro — lightweight and pure-Python; turning an existing script into a managed flow is quick.
Con — fewer batteries included; you assemble more of the surrounding stack yourself.

Ingestion · E+L

Airbyte

Pro — hundreds of ready-made connectors (pre-built source readers), open-source and self-hostable.
Con — can be heavy to run yourself, and connector quality varies from source to source.

Ingestion · E+L

Fivetran

Pro— fully managed connectors: set it up once and there's essentially no maintenance.
Con — priced by how much data it moves, so the bill can climb fast at high volume.

Transform

dbt

Pro — transforms as version-controlled SQL with built-in tests and docs, so analysts can own them.
Con — transform only: it doesn't extract, load, or schedule itself — it needs the tools above.

How to choose— Most teams don't pick one tool; they combine three layers. Use a connectortool for E+L (Fivetran if you'd rather pay than maintain, Airbyte if you want open-source and control), dbt for the transform, and one orchestrator to tie it together: Airflow for the safe, well-staffed default, Dagster when data-asset awareness and testing matter most, Prefectwhen you mainly want to schedule Python with little ceremony. The real rule: don't run more tools than your team can actually operate.
07 · Putting it together 2 min

From a fragile script
to a pipeline you can trust.

Same job — move orders into the warehouse — written the way it usually starts, then the way it should end up.

The 2 a.m. pager pipeline
// runs from a laptop, no schedule, no checks rows = prod.query("SELECT * FROM orders") // full table, on prod rows.forEach(r => warehouse.insert(r)) // blind append // crash midway ⇒ dups · rename ⇒ silent nulls // no history · no alert · no idea if it's right
The sleep-through-it pipeline
extract("orders", { from: "replica", since: watermark }) .transform(clean) // typed · tested .check([schema, freshness, unique]) // gate .load({ key: "order_id", mode: "merge" }) // idempotent // orchestrated · incremental · re-runnable · alerts on fail
1A pipeline is a product. Scheduled, observable, re-runnable — not a script someone runs by hand.
2E, T, L — then argue about order. ELT (load raw, transform in the warehouse) is the modern default; ETL when privacy or volume forces the transform earlier.
3Default to batch & incremental. Read deltas by watermark; add streaming only where seconds are worth the cost.
4Make every step idempotent. Keyed upserts and time-partitions turn retries and backfills into non-events.
5Gate on data quality. Test the data, not just the code, and keep lineage so a wrong number is traceable.

Keep going

  • dbt — the standard for ELT transforms as version-controlled SQL
  • Apache Airflow / Dagster— orchestration & scheduling
  • The Data Engineering Cookbook & Fundamentals of Data Engineering(Reis & Housley)
  • Great Expectations — data quality checks as code

One sentence to remember

"Move data once, on purpose, in a way you can safely run again."

— the whole talk, compressed

Knowledge check

Did it stick?

Five quick questions on ETL vs ELT, batch vs streaming, idempotency and data quality — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library