Library
00/07 · ~34 min
GUIDEDECK · shipping models, not just training them

MLOps & the path
from notebook to production.

A 34-minute working session on what happens after the model works in a notebook — pipelines, experiment tracking, serving, feature stores, monitoring for drift, and the tools that tie it all together. MLOps is really CI/CD plus data-and-model monitoring.

~34 MINBEGINNER → INTERMEDIATETOOL-AGNOSTIC
SCROLL
01 · Why MLOps 4 min

A model that works in a notebook
is maybe 10% of the job.

Training a model that scores well on your laptop is the easy part. The hard part is everything around it: getting fresh data in, reproducing the result months later, serving predictions reliably, and noticing when the model quietly goes stale. This deck is about that other 90% — the engineering that keeps a model useful in production. For the models themselves, see Machine Learning Fundamentals and Deep Learning.

MLOps Machine Learning Operations — is the set of practices for taking ML models to production and keeping them healthy there. Think of it as DevOps for ML: the same CI/CD, versioning, and automation discipline you know from Developer Tooling, plus two things software alone never had to track — data and model quality over time.

Why ML breaks differently than code

  • Code has logic; ML has logic + data + weights. Three moving parts, each of which can change the output.
  • It fails silently. A broken API throws a 500. A stale model just returns slightly worse predictions for weeks — no error, only lost money.
  • The world drifts.Code behaves the same tomorrow; a model trained on last year's data slowly stops matching reality.
  • Reproducibility is hard."It worked when I ran it" needs the exact data, code, params, and environment.
notebook.ipynb model.fit(X, y) auc 0.92 ✓ works on my machine production serves 5k req/s monitored reproducible THE GAP pipelines · versioning · serving feature stores · monitoring · retrain

MLOps is the bridge across the gap — the unglamorous plumbing that turns a one-off result into a dependable service.

~90%

of the effort in a real ML product is everything except the model code.

the things that can change — code, data, and learned weights — so three things to version.

0

error messages when a model silently degrades. You only find out if you watch for it.

02 · The ML lifecycle & pipelines 5 min

ML is a loop,
not a one-way trip.

Software ships and is mostly done. A model ships, decays, and has to be retrained on fresh data — over and over. The job of MLOps is to turn that loop into automated, repeatable pipelines instead of a human running cells by hand.

ML pipeline an automated, ordered sequence of steps that turns raw data into a deployed model: ingest & validate data → engineer features → train → evaluate → deploy → monitor. Each step is code, version-controlled and re-runnable, so the whole thing can run on a schedule or a trigger without anyone babysitting a notebook.
Data ingest · validate Train features · fit Evaluate metrics · gate Deploy serve Monitor drift · quality drift detected → retrain on fresh data

The lifecycle closes on itself: monitoring in production is what triggers the next round of training.

From notebook cells to a pipeline

  • Each step is a function with typed inputs/outputs — not a cell that depends on hidden notebook state.
  • Steps are cached and resumable. If training fails, re-running skips the data step that already succeeded.
  • An orchestrator runs the DAG — Airflow, Kubeflow Pipelines, Dagster, Prefect, or Metaflow schedule and retry the steps for you.
# a pipeline = small, ordered, re-runnable steps @step def ingest() -> DataFrame: ... @step def train(df: DataFrame) -> Model: ... @step def evaluate(m: Model) -> Metrics: ... # the orchestrator runs the DAG on a schedule pipeline = ingest >> train >> evaluate >> deploy

The same logic as the notebook — but as wired, testable steps an orchestrator can run unattended.

03 · Experiment tracking & the model registry 5 min

If you can't reproduce it,
you don't really have it.

Data scientists run hundreds of experiments. Without a system, "the good one" is a model file on someone's laptop and nobody remembers which params produced it. Two tools fix this: experiment tracking for the search, and a model registry for the winners.

Experiment tracking logging every training run's inputs and outputs (params, code version, dataset version, metrics, and the resulting model artifact) so any result can be compared and reproduced. The model registry is the next step: a versioned catalog of the models you actually promote, each with a stage like Staging or Production and a clear lineage back to the run that made it.
import mlflow with mlflow.start_run(): mlflow.log_param("max_depth", 8) model = train(X_train, y_train) mlflow.log_metric("auc", 0.91) mlflow.sklearn.log_model(model, "model") # run is now reproducible: params + data + artifact
RUNS run · auc 0.88 run · auc 0.91 ★ run · auc 0.86 REGISTRY creditModelv7 · Staging creditModelv6 · Production promote best run → versioned stage

Track every run; promote only the winner into the registry with a version and a stage.

version this

Three things, not one

Reproducibility needs the code (git), the data (a snapshot/hash, e.g. via DVC or a table version), and the model + params(the tracked run). Miss one and "it worked yesterday" comes back.

lineage

Answer "where did this come from?"

A registered model points back to its run, which points to the code commit and dataset version. When a prediction is questioned, you can trace the whole chain.

tools

What people use

MLflow (open-source, the common default), Weights & Biases, Neptune, and Comet for tracking; MLflow, SageMaker, and Vertex AI all ship a registry.

04 · Serving models 5 min

A trained model is useless
until it can answer requests.

Inference is using a trained model to make predictions on new data. The big architectural choice is when you run it: ahead of time in batch, or on demand in real time. That one decision drives your latency budget, cost, and infrastructure.

Inference serving making a model available to produce predictions. Batch scores a large set on a schedule and stores the results for later lookup. Online (real-time) wraps the model in an API that answers one request at a time, in milliseconds. Pick batch when predictions can be slightly stale; pick online when they must reflect the request happening right now.

Batch — score ahead of time, look up later

# runs nightly; latency doesn't matter df = read_table("users") df["score"] = model.predict(df[FEATURES]) write_table("user_scores", df) # the app just looks up a precomputed row
Use when
Predictions can be hours old — churn scores, daily recommendations, lead ranking, demand forecasts.
Wins
Simple, cheap, high throughput. No always-on service; reuse your data warehouse and a scheduler.
Cost
Predictions are stale between runs and you can't score brand-new entities until the next job.

Online — answer one live request in milliseconds

@app.post("/predict") def predict(req: Features): x = featurize(req) # MUST match training score = model.predict(x) # warm, in-memory return {"score": float(score)} # target: p99 < 50ms · autoscale on RPS
Use when
The prediction must reflect the live request — fraud checks, search ranking, pricing, recommendations on the page.
Wins
Fresh, per-request answers. Scales horizontally behind a load balancer.
Cost
Always-on infra, tail latency (p99) to chase, and the model must be loaded and warm.

Streaming — predictions on a flow of events

A middle ground: the model consumes an event stream (Kafka, Kinesis, Pub/Sub) and emits predictions as events arrive — used for things like real-time anomaly detection or enriching a clickstream. It's online inference driven by events instead of synchronous HTTP calls.

  • Near-real-time, but decoupled — the producer doesn't block on the prediction.
  • Great when many events need scoring continuously and a few hundred ms of lag is fine.
BATCH job (nightly) scores table app lookup ONLINE client model APIwarm model prediction p99 in ms

Batch hides latency by precomputing; online pays latency per request to stay fresh.

Serving infrastructure & latency levers

  • Where it runs: dedicated servers like NVIDIA Triton, TensorFlow Serving, TorchServe, BentoML, or KServe/Seldon on Kubernetes for autoscaling and rollouts.
  • Cut latency with request batching, caching, model quantization or distillation, and GPUs for big models.
  • Roll out safely with canary or shadow deploys — send a slice of traffic to the new model before the full switch.
  • Large language models are the latency-heavy extreme of this — see LLM Evals & LLMOps.
05 · Feature stores & training/serving skew 5 min

The bug that scores great offline
and fails in production.

The single most common production ML bug isn't the model — it's that the features fed to it at servingtime don't match the ones it saw at training time. This is training/serving skew, and a feature store exists largely to prevent it.

Feature store a central system for defining, computing, and serving features with two synced halves: an offline store (large history, used to build training sets) and an online store (low-latency lookups for serving). Both are fed from one feature definition, so the number the model trains on is computed the same way as the number it sees in production.
Skew — two code paths
# training (Python, batch over history) avg = df.groupby("user").spend.mean() # serving (different code, different window!) avg = sum(last_30_txns) / 30 # subtly different → model sees inputs it never trained on

Two implementations of "average spend" drift apart — the model degrades and no test catches it.

Feature store — one definition
# define the feature ONCE @feature_view(source=transactions) def avg_spend_7d(df): return df.rolling("7d").mean() # training reads OFFLINE store · serving reads ONLINE # same logic, both sides → no skew

One definition feeds both stores, so training and serving compute the feature identically.

feature definition avg_spend_7d offline store history → training sets online store low-latency lookups training pipeline model API one source of truth → no skew

The feature store is the shared spine between training and serving.

Do you actually need one?

  • Be honest: most early projects don't. A feature store is real operational weight. A single batch model with one code path has no skew to prevent.
  • It pays off when many models or teams reuse the same features, or when you serve online and must guarantee the offline/online numbers match.
  • Tools: Feast (open-source), Tecton(commercial), and the managed feature stores in SageMaker, Vertex AI, and Databricks.
  • Watch for point-in-time correctness too: training sets must use only data that existed at prediction time, or you leak the future and over-estimate accuracy.
06 · Monitoring in production 5 min

Models don't crash —
they quietly rot.

The model that was 91% accurate at launch is not 91% accurate forever. The world it learned from keeps changing. Production monitoring is how you catch that decay before your users (or your revenue) do. It builds directly on classic Observability & Monitoring — same dashboards and alerts, plus two ML-specific signals.

Drift the live world moving away from the training data. Data drift (covariate shift) is the inputs changing — a new user segment, a renamed category. Concept drift is the relationshipbetween inputs and the target changing — what made a transaction "fraud" last year no longer holds. Both quietly erode accuracy.
1
Operational metrics — is the service healthy?
Latency, throughput, errors — the same as any API.
+

The first layer is plain service monitoring: request rate, p99 latency, error rates, CPU/GPU and memory. A model that times out is broken regardless of its accuracy. This is exactly the observabilityyou'd put on any service.

2
Data quality & data drift — are the inputs sane?
Schemas, ranges, and shifting input distributions.
+

Watch incoming features for broken schemas, nulls, and out-of-range values first — a renamed column silently feeding zeros is more common than true drift. Then track the input distribution versus a training reference. Standard measures: the Population Stability Index (PSI), the Kolmogorov–Smirnov test, and KL / Jensen–Shannon divergence.

# Population Stability Index per feature psi = population_stability_index(ref=train_dist, cur=live_dist) if psi > 0.2: # 0.1–0.2 minor · >0.2 major shift alert("data drift") trigger_retrain()
3
Model quality & concept drift — are predictions still good?
The real goal — but the labels arrive late.
+

What you actually care about is accuracy on live data. The catch: ground-truth labels arrive late(did the loan default? did the user churn?), so you often can't score quality immediately. Until labels land, watch proxy signals — the distribution of the predictions themselves, confidence scores, and business KPIs — then compute true metrics once labels arrive.

4
Retraining triggers — when to refresh the model
Scheduled, performance-based, or drift-based.
+

Three ways to decide it's time to retrain: scheduled (e.g. weekly — simple, predictable), performance-based (a metric drops below a threshold — ideal, but needs timely labels), and drift-based (PSI/KS crosses a limit — a useful early warning). Most teams start with a schedule and add drift triggers as they mature. Whichever fires, it should kick off the same automated pipeline from Part 2.

training live (drifted) distribution has shifted → PSI > 0.2

When the live input distribution drifts away from the training reference, the model is predicting on data it never really saw.

What teams actually run

  • Ops layer:Prometheus + Grafana, or your cloud's APM — reused straight from observability.
  • ML layer: Evidently (open-source), NannyML, WhyLabs, Arize, or Fiddler for drift and quality reports; SageMaker Model Monitor and Vertex AI Model Monitoring are the managed options.
  • Golden rule: log every prediction with its inputs and a request id, so you can join in the real outcome later and compute true accuracy.
07 · Tooling, CI/CD-for-ML & recap 5 min

The landscape, and the one idea
under all of it.

The tool list is long, but they slot into the lifecycle you already know. Below: how the four big platforms compare, then the unifying frame — MLOps = CI/CD + data/model monitoring.

CI/CD for ML extends the developer-tooling pipeline you know. CI tests code and data, then trains and evaluates a candidate. CD promotes it only if it beats the current production model on a held-out set — sometimes called continuous training (CT). The model is a build artifact that has to pass a quality gate, just like any release.
on: [push] jobs: train-and-gate: steps: - run: pytest tests/ # code + data checks - run: dvc repro # reproduce pipeline - run: python evaluate.py # gate vs production - run: mlflow register # promote if it wins
test code+data train evaluate gatebeats prod? promote reject

The model only ships if it beats production on the gate — a build that fails its tests doesn't release.

The tooling landscape — how to choose

MLflow

The lightweight default

Open-source tracking, model registry, and packaging — framework- and cloud-agnostic.

Pro — easy to adopt, runs anywhere, no lock-in; great at the experiment/registry core.

Con — not an orchestrator or a serving platform; you bring your own pipelines and infra.

Choose when you want tracking + a registry without committing to a cloud.

Kubeflow

The Kubernetes-native platform

Open-source pipelines, training operators, and serving (KServe) that run on your own Kubernetes cluster.

Pro — portable, end-to-end orchestration; no vendor lock-in if you already run K8s.

Con — heavy to operate; you own the cluster, upgrades, and the complexity.

Choose when you have a platform team and want cloud-neutral pipelines.

Amazon SageMaker

Managed, end-to-end on AWS

Training, hosted endpoints, pipelines, feature store, registry, and model monitor — all managed services.

Pro — covers the whole lifecycle; minimal infra to run yourself; deep AWS integration.

Con — AWS lock-in and cost; lots of surface area to learn.

Choose when you're committed to AWS and want managed over assembled.

Google Vertex AI

Managed, end-to-end on GCP

The GCP counterpart — pipelines (built on Kubeflow Pipelines), prediction, feature store, registry, and model monitoring; tight BigQuery integration.

Pro — full lifecycle managed; strong data/analytics ties; KFP pipelines are fairly portable.

Con — GCP lock-in and cost.

Choose when your data lives in GCP/BigQuery and you want managed.

A simpler decision than the matrix suggests

  • Start small.A scheduled batch job + MLflow + a cron trigger solves a surprising number of real problems. Don't buy a platform for one model.
  • Follow your data's gravity. If everything is in AWS, SageMaker; in GCP, Vertex AI. Fighting your cloud rarely pays.
  • Reach for Kubeflow only with a platform team and a multi-cloud or on-prem requirement that justifies running it.
  • The mature LLM cousin — evals, prompt versioning, guardrails — lives in LLM Evals & LLMOps.

Five rules to walk out with

1The model is 10% of the work. MLOps is the other 90% — pipelines, serving, and monitoring.
2Version code, data, and model so every result is reproducible and traceable.
3Pick batch vs online deliberately — it sets your latency, cost, and infra.
4Kill training/serving skew with one feature definition for both sides.
5Monitor for drift and retrain — models rot silently, so the loop never closes.
Knowledge check

Did it stick?

Five quick questions on the ML lifecycle, serving, skew, drift, and tooling — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library