Ch 1 — Why MLOps Matters

Technical debt in ML, the MLOps lifecycle, and maturity levels
High Level
science
Notebook
arrow_forward
warning
Debt
arrow_forward
cycle
Lifecycle
arrow_forward
trending_up
Maturity
arrow_forward
groups
Culture
arrow_forward
deployed_code
MLOps
-
Click play or press Space to begin...
Step- / 8
science
The Notebook-to-Production Gap
Why 87% of ML models never make it to production
The Problem
Data scientists build models in Jupyter notebooks. The model works. Accuracy is great. Then comes the question: “How do we put this in production?” And everything falls apart. According to Gartner, only about 53% of AI projects make it from prototype to production (as of 2023). The gap between “model works in a notebook” and “model reliably serves predictions at scale” is enormous. MLOps exists to bridge this gap — it’s the set of practices, tools, and culture that make ML systems reliable, reproducible, and maintainable in production.
The Gap
// The notebook-to-production gap Notebook: ✓ Single dataset, static ✓ Manual feature engineering ✓ Train once, evaluate once ✓ “It works on my machine” Production: ✗ Data changes daily ✗ Features must be computed in real-time ✗ Model must retrain automatically ✗ Must handle 10K+ requests/sec ✗ Must monitor for drift ✗ Must roll back if broken ✗ Must be auditable & reproducible
Key insight: ML code is typically only 5–10% of a production ML system. The rest is data pipelines, serving infrastructure, monitoring, configuration, and testing. This is the central finding of Google’s “Hidden Technical Debt” paper.
warning
Hidden Technical Debt in ML Systems
Sculley et al. (NeurIPS 2015) — the paper that launched MLOps
The Google Paper
In 2015, D. Sculley and colleagues at Google published “Hidden Technical Debt in Machine Learning Systems” at NeurIPS. The paper argued that ML systems have a special capacity for incurring technical debt because they have all the maintenance problems of traditional software plus a set of ML-specific issues. Key debt categories: boundary erosion (no strict API contracts between components), entanglement (changing one feature affects all others — CACE: “Changing Anything Changes Everything”), hidden feedback loops (model predictions influence future training data), undeclared consumers (other systems silently depend on your model), and data dependency debt (harder to track than code dependencies).
ML-Specific Debt
// Hidden Technical Debt (Sculley, 2015) CACE Principle: Changing Anything Changes Everything → Add 1 feature → all predictions shift Feedback Loops: Model predicts → user acts → new data → Model trains on influenced data → Predictions drift silently Data Dependencies: - Unstable data sources - Underutilized features - Legacy features no one removes - Correlated features masking bugs Configuration Debt: Hyperparams, thresholds, feature flags → Often more lines than model code
Key insight: The paper’s most famous diagram shows that ML code is a tiny box surrounded by massive boxes for data collection, verification, feature extraction, serving infrastructure, monitoring, and configuration. MLOps addresses every one of those surrounding boxes.
cycle
The MLOps Lifecycle
A continuous loop, not a one-time pipeline
The Continuous Loop
MLOps is not a linear pipeline — it’s a continuous loop. The lifecycle has roughly 7 stages: Data management (collection, labeling, versioning), experimentation (feature engineering, model selection, hyperparameter tuning), model development (training, evaluation, validation), deployment (packaging, serving, A/B testing), monitoring (performance tracking, drift detection), retraining (triggered by drift or schedule), and governance (audit trails, compliance, model cards). The key difference from DevOps: in MLOps, data is a first-class citizen. Code changes are only half the story — data changes can break a model just as easily.
MLOps Lifecycle
// The MLOps continuous loop ┌─────────────────────────┐ │ 1. Data Management │ │ collect, label, version│ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 2. Experimentation │ │ features, tuning │ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 3. Training │ │ train, evaluate │ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 4. Deployment │ │ serve, A/B test │ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 5. Monitoring │◄─┐ │ drift, performance │ │ └─────────┬───────────────┘ │ ▼ │ ┌─────────────────────────┐ │ │ 6. Retrain │──┘ └─────────────────────────┘
Key insight: In traditional software, you deploy and you’re done (until the next feature). In ML, deployment is just the beginning — the model will degrade over time as the world changes, and the loop must keep spinning.
compare
MLOps vs. DevOps vs. DataOps
Related but distinct disciplines
Three Ops Compared
DevOps automates the software delivery lifecycle: build, test, deploy, monitor. DataOps automates data pipelines: ingestion, transformation, quality, delivery. MLOps combines both and adds ML-specific concerns: experiment tracking, model versioning, feature stores, drift detection, and retraining triggers. MLOps inherits CI/CD from DevOps and data quality from DataOps, but adds CT (Continuous Training) — the automated retraining of models when data or performance changes. This is the unique “third pillar” that doesn’t exist in traditional software.
Comparison
// DevOps vs DataOps vs MLOps DevOps: CI (build/test) + CD (deploy) Artifact: code → binary/container Trigger: code commit DataOps: Data pipelines + quality checks Artifact: datasets, transforms Trigger: new data arrives MLOps: CI + CD + CT (Continuous Training) Artifacts: code + data + model + config Triggers: code commit, data change, performance drift, schedule
Key insight: MLOps has three triggers for change (code, data, model performance), while DevOps has one (code). This tripling of complexity is why ML systems need specialized operational practices.
trending_up
MLOps Maturity Levels
From manual chaos to fully automated ML
The Maturity Model
Google’s MLOps maturity model (from their “MLOps: Continuous delivery and automation pipelines in ML” guide) defines three levels: Level 0 (Manual): Data scientists train models manually, hand off artifacts to engineers, no automation, no monitoring. Level 1 (ML Pipeline Automation): Training is automated and reproducible, feature stores exist, continuous training is triggered by data changes, but deployment is still semi-manual. Level 2 (CI/CD Pipeline Automation): Full automation — code changes trigger CI, model changes trigger CD, data changes trigger CT. Automated testing, monitoring, and rollback. Most organizations are at Level 0 or early Level 1.
Maturity Levels
// Google MLOps Maturity Model Level 0 — Manual Process • Jupyter notebooks, manual steps • No pipeline, no versioning • “Works on my machine” • Deploy once, pray it holds Level 1 — ML Pipeline Automation • Automated training pipeline • Feature store for consistency • Continuous Training (CT) • Experiment tracking (MLflow/W&B) Level 2 — CI/CD + CT Automation • Full CI/CD for ML code • Automated model validation • Canary/shadow deployments • Automated drift → retrain → deploy • Model registry + governance
Key insight: Most organizations are at Level 0. Getting to Level 1 is the highest-ROI investment — it eliminates the most painful manual steps. Level 2 is where mature ML organizations operate, but it requires significant infrastructure investment.
groups
The MLOps Team
Roles, responsibilities, and the collaboration challenge
Key Roles
MLOps requires collaboration across multiple roles: Data Scientists build and evaluate models. ML Engineers productionize models — they bridge the gap between data science and software engineering. Data Engineers build and maintain data pipelines and feature stores. Platform/MLOps Engineers build and maintain the ML platform (training infrastructure, serving, monitoring). Product Managers define success metrics and business requirements. The biggest organizational challenge is the handoff problem: data scientists throw models “over the wall” to engineers. MLOps aims to eliminate this wall through shared tools, shared code, and shared responsibility.
Team Structure
// MLOps team roles Data Scientist: Experiments, model selection, evaluation Tools: notebooks, pandas, sklearn, PyTorch ML Engineer: Productionize, optimize, deploy Tools: Docker, K8s, TorchServe, vLLM Data Engineer: Pipelines, feature stores, data quality Tools: Airflow, Spark, dbt, Feast Platform Engineer: ML platform, infra, CI/CD, monitoring Tools: Kubeflow, MLflow, Terraform Anti-pattern: “throw over the wall” Goal: shared ownership, shared tools
Key insight: The most successful ML teams have ML Engineers who can speak both languages — they understand model architecture and loss functions, but also know Docker, Kubernetes, and CI/CD. This hybrid role is the linchpin of MLOps.
difference
MLOps vs. LLMOps
How large language models change the game
The LLM Difference
LLMs introduced new operational challenges that traditional MLOps didn’t anticipate: Prompt management replaces feature engineering — prompts are the new “code” that needs versioning and testing. Model routing replaces model selection — you route requests to different LLMs based on cost, latency, and capability. Evaluation is harder — there’s no single metric like accuracy; you need human evaluation, LLM-as-judge, and behavioral tests. Cost management is critical — LLM inference costs scale with token usage, not just request count. Guardrails are essential — you need to prevent harmful outputs, hallucinations, and prompt injection.
MLOps vs LLMOps
// Traditional MLOps vs LLMOps Training: MLOps: Train your own model LLMOps: Use pre-trained, fine-tune, or prompt Features: MLOps: Feature engineering + feature store LLMOps: Prompt engineering + RAG Versioning: MLOps: Model weights + code + data LLMOps: Prompts + RAG docs + model version Evaluation: MLOps: Accuracy, F1, AUC LLMOps: Human eval, LLM-as-judge, behavioral Cost: MLOps: Compute (training + inference) LLMOps: Per-token pricing, caching critical
Key insight: LLMOps doesn’t replace MLOps — it extends it. Organizations running both traditional ML models and LLM-based applications need both sets of practices. Chapters 7–8 of this course dive deep into LLMOps-specific tooling.
checklist
Getting Started with MLOps
Practical first steps for any team
Start Here
You don’t need to buy a platform on day one. Start with the highest-ROI practices: (1) Version everything — code (Git), data (DVC), models (MLflow), and configs. (2) Track experiments — log every training run with hyperparameters, metrics, and artifacts. (3) Automate training — make training reproducible with a single command. (4) Monitor in production — track prediction distributions, latency, and error rates. (5) Test your data — add data validation checks (Great Expectations, Pandera) to catch data issues before they reach the model. These five practices alone will move most teams from Level 0 to Level 1.
MLOps Starter Checklist
// Minimum viable MLOps 1. Version Control: Code → Git Data → DVC or LakeFS Models → MLflow Model Registry 2. Experiment Tracking: MLflow or Weights & Biases Log: params, metrics, artifacts 3. Reproducible Training: Dockerfile + requirements.txt Single command: make train 4. Basic Monitoring: Prediction distribution shifts Latency p50/p95/p99 Error rate dashboards 5. Data Validation: Schema checks on input data Distribution checks before training
Key insight: The rest of this course covers each of these areas in depth. Start simple, automate incrementally, and resist the urge to buy an end-to-end platform before you understand your actual needs.