Ch 1: Why MLOps Matters — MLOps & LLMOps

Ch 1 — Why MLOps Matters

Technical debt in ML, the MLOps lifecycle, and maturity levels

Index

High Level

science

Notebook

arrow_forward

warning

Debt

arrow_forward

cycle

Lifecycle

arrow_forward

trending_up

Maturity

arrow_forward

groups

Culture

arrow_forward

deployed_code

MLOps

Click play or press Space to begin...

Step- / 8

science

The Notebook-to-Production Gap

Why 87% of ML models never make it to production

The Problem

Data scientists build models in Jupyter notebooks. The model works. Accuracy is great. Then comes the question: “How do we put this in production?” And everything falls apart. According to Gartner, only about 53% of AI projects make it from prototype to production (as of 2023). The gap between “model works in a notebook” and “model reliably serves predictions at scale” is enormous. MLOps exists to bridge this gap — it’s the set of practices, tools, and culture that make ML systems reliable, reproducible, and maintainable in production.

The Gap

// The notebook-to-production gap Notebook: ✓ Single dataset, static ✓ Manual feature engineering ✓ Train once, evaluate once ✓ “It works on my machine” Production: ✗ Data changes daily ✗ Features must be computed in real-time ✗ Model must retrain automatically ✗ Must handle 10K+ requests/sec ✗ Must monitor for drift ✗ Must roll back if broken ✗ Must be auditable & reproducible

Key insight: ML code is typically only 5–10% of a production ML system. The rest is data pipelines, serving infrastructure, monitoring, configuration, and testing. This is the central finding of Google’s “Hidden Technical Debt” paper.

warning

Hidden Technical Debt in ML Systems

Sculley et al. (NeurIPS 2015) — the paper that launched MLOps

The Google Paper

In 2015, D. Sculley and colleagues at Google published “Hidden Technical Debt in Machine Learning Systems” at NeurIPS. The paper argued that ML systems have a special capacity for incurring technical debt because they have all the maintenance problems of traditional software plus a set of ML-specific issues. Key debt categories: boundary erosion (no strict API contracts between components), entanglement (changing one feature affects all others — CACE: “Changing Anything Changes Everything”), hidden feedback loops (model predictions influence future training data), undeclared consumers (other systems silently depend on your model), and data dependency debt (harder to track than code dependencies).

ML-Specific Debt

// Hidden Technical Debt (Sculley, 2015) CACE Principle: Changing Anything Changes Everything → Add 1 feature → all predictions shift Feedback Loops: Model predicts → user acts → new data → Model trains on influenced data → Predictions drift silently Data Dependencies: - Unstable data sources - Underutilized features - Legacy features no one removes - Correlated features masking bugs Configuration Debt: Hyperparams, thresholds, feature flags → Often more lines than model code

Key insight: The paper’s most famous diagram shows that ML code is a tiny box surrounded by massive boxes for data collection, verification, feature extraction, serving infrastructure, monitoring, and configuration. MLOps addresses every one of those surrounding boxes.

cycle

The MLOps Lifecycle

A continuous loop, not a one-time pipeline

The Continuous Loop

MLOps is not a linear pipeline — it’s a continuous loop. The lifecycle has roughly 7 stages: Data management (collection, labeling, versioning), experimentation (feature engineering, model selection, hyperparameter tuning), model development (training, evaluation, validation), deployment (packaging, serving, A/B testing), monitoring (performance tracking, drift detection), retraining (triggered by drift or schedule), and governance (audit trails, compliance, model cards). The key difference from DevOps: in MLOps, data is a first-class citizen. Code changes are only half the story — data changes can break a model just as easily.

MLOps Lifecycle

// The MLOps continuous loop ┌─────────────────────────┐ │ 1. Data Management │ │ collect, label, version│ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 2. Experimentation │ │ features, tuning │ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 3. Training │ │ train, evaluate │ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 4. Deployment │ │ serve, A/B test │ └─────────┬───────────────┘ ▼ ┌─────────────────────────┐ │ 5. Monitoring │◄─┐ │ drift, performance │ │ └─────────┬───────────────┘ │ ▼ │ ┌─────────────────────────┐ │ │ 6. Retrain │──┘ └─────────────────────────┘

Key insight: In traditional software, you deploy and you’re done (until the next feature). In ML, deployment is just the beginning — the model will degrade over time as the world changes, and the loop must keep spinning.

compare

MLOps vs. DevOps vs. DataOps

Related but distinct disciplines

Three Ops Compared

DevOps automates the software delivery lifecycle: build, test, deploy, monitor. DataOps automates data pipelines: ingestion, transformation, quality, delivery. MLOps combines both and adds ML-specific concerns: experiment tracking, model versioning, feature stores, drift detection, and retraining triggers. MLOps inherits CI/CD from DevOps and data quality from DataOps, but adds CT (Continuous Training) — the automated retraining of models when data or performance changes. This is the unique “third pillar” that doesn’t exist in traditional software.

Comparison

// DevOps vs DataOps vs MLOps DevOps: CI (build/test) + CD (deploy) Artifact: code → binary/container Trigger: code commit DataOps: Data pipelines + quality checks Artifact: datasets, transforms Trigger: new data arrives MLOps: CI + CD + CT (Continuous Training) Artifacts: code + data + model + config Triggers: code commit, data change, performance drift, schedule

Key insight: MLOps has three triggers for change (code, data, model performance), while DevOps has one (code). This tripling of complexity is why ML systems need specialized operational practices.

trending_up

MLOps Maturity Levels

From manual chaos to fully automated ML

The Maturity Model

Google’s MLOps maturity model (from their “MLOps: Continuous delivery and automation pipelines in ML” guide) defines three levels: Level 0 (Manual): Data scientists train models manually, hand off artifacts to engineers, no automation, no monitoring. Level 1 (ML Pipeline Automation): Training is automated and reproducible, feature stores exist, continuous training is triggered by data changes, but deployment is still semi-manual. Level 2 (CI/CD Pipeline Automation): Full automation — code changes trigger CI, model changes trigger CD, data changes trigger CT. Automated testing, monitoring, and rollback. Most organizations are at Level 0 or early Level 1.

Maturity Levels

// Google MLOps Maturity Model Level 0 — Manual Process • Jupyter notebooks, manual steps • No pipeline, no versioning • “Works on my machine” • Deploy once, pray it holds Level 1 — ML Pipeline Automation • Automated training pipeline • Feature store for consistency • Continuous Training (CT) • Experiment tracking (MLflow/W&B) Level 2 — CI/CD + CT Automation • Full CI/CD for ML code • Automated model validation • Canary/shadow deployments • Automated drift → retrain → deploy • Model registry + governance

Key insight: Most organizations are at Level 0. Getting to Level 1 is the highest-ROI investment — it eliminates the most painful manual steps. Level 2 is where mature ML organizations operate, but it requires significant infrastructure investment.

groups

The MLOps Team

Roles, responsibilities, and the collaboration challenge

Key Roles

MLOps requires collaboration across multiple roles: Data Scientists build and evaluate models. ML Engineers productionize models — they bridge the gap between data science and software engineering. Data Engineers build and maintain data pipelines and feature stores. Platform/MLOps Engineers build and maintain the ML platform (training infrastructure, serving, monitoring). Product Managers define success metrics and business requirements. The biggest organizational challenge is the handoff problem: data scientists throw models “over the wall” to engineers. MLOps aims to eliminate this wall through shared tools, shared code, and shared responsibility.

Team Structure

// MLOps team roles Data Scientist: Experiments, model selection, evaluation Tools: notebooks, pandas, sklearn, PyTorch ML Engineer: Productionize, optimize, deploy Tools: Docker, K8s, TorchServe, vLLM Data Engineer: Pipelines, feature stores, data quality Tools: Airflow, Spark, dbt, Feast Platform Engineer: ML platform, infra, CI/CD, monitoring Tools: Kubeflow, MLflow, Terraform Anti-pattern: “throw over the wall” Goal: shared ownership, shared tools

Key insight: The most successful ML teams have ML Engineers who can speak both languages — they understand model architecture and loss functions, but also know Docker, Kubernetes, and CI/CD. This hybrid role is the linchpin of MLOps.

difference

MLOps vs. LLMOps

How large language models change the game

The LLM Difference

LLMs introduced new operational challenges that traditional MLOps didn’t anticipate: Prompt management replaces feature engineering — prompts are the new “code” that needs versioning and testing. Model routing replaces model selection — you route requests to different LLMs based on cost, latency, and capability. Evaluation is harder — there’s no single metric like accuracy; you need human evaluation, LLM-as-judge, and behavioral tests. Cost management is critical — LLM inference costs scale with token usage, not just request count. Guardrails are essential — you need to prevent harmful outputs, hallucinations, and prompt injection.

MLOps vs LLMOps

// Traditional MLOps vs LLMOps Training: MLOps: Train your own model LLMOps: Use pre-trained, fine-tune, or prompt Features: MLOps: Feature engineering + feature store LLMOps: Prompt engineering + RAG Versioning: MLOps: Model weights + code + data LLMOps: Prompts + RAG docs + model version Evaluation: MLOps: Accuracy, F1, AUC LLMOps: Human eval, LLM-as-judge, behavioral Cost: MLOps: Compute (training + inference) LLMOps: Per-token pricing, caching critical

Key insight: LLMOps doesn’t replace MLOps — it extends it. Organizations running both traditional ML models and LLM-based applications need both sets of practices. Chapters 7–8 of this course dive deep into LLMOps-specific tooling.

checklist

Getting Started with MLOps

Practical first steps for any team

Start Here

You don’t need to buy a platform on day one. Start with the highest-ROI practices: (1) Version everything — code (Git), data (DVC), models (MLflow), and configs. (2) Track experiments — log every training run with hyperparameters, metrics, and artifacts. (3) Automate training — make training reproducible with a single command. (4) Monitor in production — track prediction distributions, latency, and error rates. (5) Test your data — add data validation checks (Great Expectations, Pandera) to catch data issues before they reach the model. These five practices alone will move most teams from Level 0 to Level 1.

MLOps Starter Checklist

// Minimum viable MLOps 1. Version Control: Code → Git Data → DVC or LakeFS Models → MLflow Model Registry 2. Experiment Tracking: MLflow or Weights & Biases Log: params, metrics, artifacts 3. Reproducible Training: Dockerfile + requirements.txt Single command: make train 4. Basic Monitoring: Prediction distribution shifts Latency p50/p95/p99 Error rate dashboards 5. Data Validation: Schema checks on input data Distribution checks before training

Key insight: The rest of this course covers each of these areas in depth. Start simple, automate incrementally, and resist the urge to buy an end-to-end platform before you understand your actual needs.

Ch 2: Experiment Tracking arrow_forward