A 34-minute working session on data pipelines: extract, transform and load; why ELT changed the game; batch vs streaming; and the idempotency, orchestration and quality checks that make a pipeline something you can sleep through.
Orders live in the app database, customers in the CRM, charges in the payment processor, ad spend in a dozen SaaS dashboards. Every useful question — "what's last week's margin by region?" — needs those joined, cleaned and trustworthy. A pipeline is how that happens on purpose instead of by hand.
source systems, each with its own schema, format and quirks.
place analysts and dashboards actually query — the warehouse.
new rows keep arriving; yesterday's export is already stale.
schemas drift, APIs change. Design for breakage, not for a perfect snapshot.
Many messy sources fan into one pipeline and land in a single warehouse the whole company can trust.
Every pipeline — whatever the tool — is doing these three things. Get the words straight and the rest of the talk is just where and when you run them.
Pull records from databases, APIs, files or event streams. The two hard parts: don't hammer the source, and only grab what's new — track a watermark (last id or timestamp) so you read the delta, not the whole table.
Cast types, fix encodings, drop duplicates, join reference data, apply business rules. This is where data goes from raw to trustworthy — and where most bugs live, so it should be typed and tested.
Land the result in the warehouse. Prefer upserts over blind inserts, write in batches, and make the write idempotentso a retry can't double-count (Part 5).
Like a warehouse receiving dock: extract is the truck arriving, transform is inspecting and labelling the goods, load is putting them on the right shelf.
ETL and ELT use the same three stages — they just disagree on whento transform. That one swap is the biggest shift in modern data engineering, and it's driven entirely by what the warehouse can now do.
ETL transforms on a separate box and loads only the finished shape. ELT lands the raw data, then lets the warehouse do the transform in SQL.
The modern default is ELT: extract, load raw, then transform in the warehouse with version-controlled SQL. Reach for ETL when privacy, volume or the target forces the transform earlier.
ETL vs ELT was about where you transform. This is a separate question: how often data moves. Do you wait and move a big pile on a schedule (say, every night), or handle each new record the moment it shows up? Most companies end up needing both — different questions can tolerate different delays.
Batch waits, then processes a whole window at once. Streaming acts on each event as it lands.
Networks drop, nodes die, a deploy kills a job at 90%. The question isn't whether to retry — it's whether a retry corrupts your data. That property has a name: idempotency.
A retried INSERT double-counts order 42. A MERGE keyed on order_id lands the same single row every time.
INSERT.now() or random ids baked into output.On a transient failure the orchestrator just runs the task again. Because the load is idempotent, automatic retries are safe — no human paged at 3 a.m. to "clean up duplicates" first.
A backfill re-runs the pipeline over past dates — to fix a bug, add a new column, or load history. Idempotent, time-partitioned steps let you replay last March without disturbing anything else.
A pipeline is many tasks with dependencies, run on a schedule, that must be retried, monitored and trusted. Two disciplines make that sane: orchestration and data quality.
Each node is a task; edges are dependencies. The quality check gates publish — bad data never reaches users.
Code tests prove your transform logic. Quality checks prove the data is sane on every run — and gate the pipeline so bad data fails loudly instead of leaking into dashboards.
Columns exist, types match, no surprise renames. A source adding or dropping a field should fail the run, not silently null out a report.
The newest row is younger than the SLA. Stale data is a silent failure — the dashboard still renders, it's just lying.
Today's batch is within a tolerance of normal. A run that loads 0 rows (or 50×) usually means a broken upstream, not a record sales day.
The primary key is actually unique (catches the duplicate bug from Part 5) and not-null columns are never null. The cheapest checks, the highest catch rate.
Every customer_id in orders exists in the customers table. Orphaned keys quietly disappear in an inner join — and the revenue disappears with them.
Lineage is the map of how each table and column was derived from its sources. When a dashboard looks wrong, lineage lets you trace it back through every transform to the offending source — and see what else breaks if you change a column.
You almost never build all of this by hand. A modern stack is usually three jobs handed to specialized tools: a connector tool to extract and load, dbt to transform, and one orchestratorto run the whole thing on a schedule. Here's where the leading tools sit and what each is good (and bad) at.
The orchestrator (bottom) conducts; connectors handle E+L, dbt handles the T inside the warehouse, and quality checks gate what reaches dashboards.
Pro — the industry default: huge community, runs almost anywhere, an integration for everything.
Con — heavy Python boilerplate, and it schedules tasks without really understanding your data.
Pro — thinks in data assets (the tables you produce), with typed inputs/outputs and great local testing.
Con — younger and smaller than Airflow, and a newer way of thinking to learn.
Pro — lightweight and pure-Python; turning an existing script into a managed flow is quick.
Con — fewer batteries included; you assemble more of the surrounding stack yourself.
Pro — hundreds of ready-made connectors (pre-built source readers), open-source and self-hostable.
Con — can be heavy to run yourself, and connector quality varies from source to source.
Pro— fully managed connectors: set it up once and there's essentially no maintenance.
Con — priced by how much data it moves, so the bill can climb fast at high volume.
Pro — transforms as version-controlled SQL with built-in tests and docs, so analysts can own them.
Con — transform only: it doesn't extract, load, or schedule itself — it needs the tools above.
Same job — move orders into the warehouse — written the way it usually starts, then the way it should end up.
"Move data once, on purpose, in a way you can safely run again."
— the whole talk, compressed
Five quick questions on ETL vs ELT, batch vs streaming, idempotency and data quality — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library