A 34-minute working session on what happens after the model works in a notebook — pipelines, experiment tracking, serving, feature stores, monitoring for drift, and the tools that tie it all together. MLOps is really CI/CD plus data-and-model monitoring.
Training a model that scores well on your laptop is the easy part. The hard part is everything around it: getting fresh data in, reproducing the result months later, serving predictions reliably, and noticing when the model quietly goes stale. This deck is about that other 90% — the engineering that keeps a model useful in production. For the models themselves, see Machine Learning Fundamentals and Deep Learning.
MLOps is the bridge across the gap — the unglamorous plumbing that turns a one-off result into a dependable service.
of the effort in a real ML product is everything except the model code.
the things that can change — code, data, and learned weights — so three things to version.
error messages when a model silently degrades. You only find out if you watch for it.
Software ships and is mostly done. A model ships, decays, and has to be retrained on fresh data — over and over. The job of MLOps is to turn that loop into automated, repeatable pipelines instead of a human running cells by hand.
The lifecycle closes on itself: monitoring in production is what triggers the next round of training.
The same logic as the notebook — but as wired, testable steps an orchestrator can run unattended.
Data scientists run hundreds of experiments. Without a system, "the good one" is a model file on someone's laptop and nobody remembers which params produced it. Two tools fix this: experiment tracking for the search, and a model registry for the winners.
Staging or Production and a clear lineage back to the run that made it.Track every run; promote only the winner into the registry with a version and a stage.
Reproducibility needs the code (git), the data (a snapshot/hash, e.g. via DVC or a table version), and the model + params(the tracked run). Miss one and "it worked yesterday" comes back.
A registered model points back to its run, which points to the code commit and dataset version. When a prediction is questioned, you can trace the whole chain.
MLflow (open-source, the common default), Weights & Biases, Neptune, and Comet for tracking; MLflow, SageMaker, and Vertex AI all ship a registry.
Inference is using a trained model to make predictions on new data. The big architectural choice is when you run it: ahead of time in batch, or on demand in real time. That one decision drives your latency budget, cost, and infrastructure.
A middle ground: the model consumes an event stream (Kafka, Kinesis, Pub/Sub) and emits predictions as events arrive — used for things like real-time anomaly detection or enriching a clickstream. It's online inference driven by events instead of synchronous HTTP calls.
Batch hides latency by precomputing; online pays latency per request to stay fresh.
The single most common production ML bug isn't the model — it's that the features fed to it at servingtime don't match the ones it saw at training time. This is training/serving skew, and a feature store exists largely to prevent it.
Two implementations of "average spend" drift apart — the model degrades and no test catches it.
One definition feeds both stores, so training and serving compute the feature identically.
The feature store is the shared spine between training and serving.
The model that was 91% accurate at launch is not 91% accurate forever. The world it learned from keeps changing. Production monitoring is how you catch that decay before your users (or your revenue) do. It builds directly on classic Observability & Monitoring — same dashboards and alerts, plus two ML-specific signals.
The first layer is plain service monitoring: request rate, p99 latency, error rates, CPU/GPU and memory. A model that times out is broken regardless of its accuracy. This is exactly the observabilityyou'd put on any service.
Watch incoming features for broken schemas, nulls, and out-of-range values first — a renamed column silently feeding zeros is more common than true drift. Then track the input distribution versus a training reference. Standard measures: the Population Stability Index (PSI), the Kolmogorov–Smirnov test, and KL / Jensen–Shannon divergence.
What you actually care about is accuracy on live data. The catch: ground-truth labels arrive late(did the loan default? did the user churn?), so you often can't score quality immediately. Until labels land, watch proxy signals — the distribution of the predictions themselves, confidence scores, and business KPIs — then compute true metrics once labels arrive.
Three ways to decide it's time to retrain: scheduled (e.g. weekly — simple, predictable), performance-based (a metric drops below a threshold — ideal, but needs timely labels), and drift-based (PSI/KS crosses a limit — a useful early warning). Most teams start with a schedule and add drift triggers as they mature. Whichever fires, it should kick off the same automated pipeline from Part 2.
When the live input distribution drifts away from the training reference, the model is predicting on data it never really saw.
The tool list is long, but they slot into the lifecycle you already know. Below: how the four big platforms compare, then the unifying frame — MLOps = CI/CD + data/model monitoring.
The model only ships if it beats production on the gate — a build that fails its tests doesn't release.
Open-source tracking, model registry, and packaging — framework- and cloud-agnostic.
Pro — easy to adopt, runs anywhere, no lock-in; great at the experiment/registry core.
Con — not an orchestrator or a serving platform; you bring your own pipelines and infra.
Choose when you want tracking + a registry without committing to a cloud.
Open-source pipelines, training operators, and serving (KServe) that run on your own Kubernetes cluster.
Pro — portable, end-to-end orchestration; no vendor lock-in if you already run K8s.
Con — heavy to operate; you own the cluster, upgrades, and the complexity.
Choose when you have a platform team and want cloud-neutral pipelines.
Training, hosted endpoints, pipelines, feature store, registry, and model monitor — all managed services.
Pro — covers the whole lifecycle; minimal infra to run yourself; deep AWS integration.
Con — AWS lock-in and cost; lots of surface area to learn.
Choose when you're committed to AWS and want managed over assembled.
The GCP counterpart — pipelines (built on Kubeflow Pipelines), prediction, feature store, registry, and model monitoring; tight BigQuery integration.
Pro — full lifecycle managed; strong data/analytics ties; KFP pipelines are fairly portable.
Con — GCP lock-in and cost.
Choose when your data lives in GCP/BigQuery and you want managed.
Five quick questions on the ML lifecycle, serving, skew, drift, and tooling — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library