Key Insights — MLOps & LLMOps

Foundations

Why MLOps & Core Infrastructure

Chapters 1 – 4

expand_more

1

“Only a small fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.” — Google (2015)

Technical debt in ML systems is real and grows silently — Google’s 2015 paper identified it as the dominant cost of production ML.
MLOps maturity progresses from manual (Level 0) through ML pipeline automation (Level 1) to CI/CD for ML (Level 2).
The gap between a notebook prototype and a production system is where most ML projects fail.

2

Experiment Tracking

“If you can’t reproduce it, you can’t improve it.”

MLflow is the universal default for experiment tracking — open-source, vendor-neutral, and integrates with everything.
Track parameters, metrics, artifacts, and code versions for every run. Log the Git commit hash automatically.
Weights & Biases excels at real-time dashboards and team collaboration; MLflow excels at self-hosted, open-source flexibility.

3

Model Registry & Versioning

“A model without a registry is like code without version control.”

A model registry is the single source of truth for which model version is in production, staging, or archived.
Model cards document what the model does, how it was trained, its limitations, and ethical considerations.
DVC and LakeFS bring Git-like versioning to datasets — essential for reproducibility.

4

Data Pipelines & Feature Stores

“Training-serving skew is the silent killer of ML systems.”

Feature stores (Feast, Tecton) solve training-serving skew by ensuring the same feature computation is used in both training and inference.
Great Expectations validates data quality with declarative expectations — catch schema changes and distribution shifts before they reach the model.
Separate offline (batch, historical) and online (low-latency, real-time) feature serving.

Bottom line: MLOps exists because production ML is 90% infrastructure and 10% model code. Track experiments, version models and data, validate data quality, and use feature stores to prevent training-serving skew.

Deployment

CI/CD, Serving & LLMOps

Chapters 5 – 8

expand_more

5

CI/CD for Machine Learning

“If deploying a model requires a hero, you don’t have MLOps.”

ML CI/CD has three loops: CI (code + data validation), CD (model deployment), and CT (continuous training on new data).
Model validation gates block deployment if the new model doesn’t beat the current production model on key metrics.
Deployment strategies: canary (gradual rollout), shadow (parallel comparison), blue-green (instant switch), A/B testing (user-level split).

6

Model Serving & Inference

“The best model is useless if it can’t serve predictions fast enough.”

Triton Inference Server (NVIDIA) is the industry standard for GPU-accelerated serving with dynamic batching and multi-model support.
vLLM with PagedAttention and continuous batching is the go-to for LLM serving — 2–4x throughput improvement over naive serving.
Optimize the serving stack: ONNX Runtime for cross-framework portability, quantization for smaller models, batching for throughput.

7

LLMOps: Gateways & Routing

“An LLM gateway is to LLMs what an API gateway is to microservices.”

LiteLLM (open-source, MIT) provides a unified OpenAI-compatible API for 100+ LLM providers with fallbacks, load balancing, and virtual keys.
Complexity-based routing is the highest-ROI optimization: route 70% of simple requests to a cheap model for ~90% cost reduction on those requests.
Semantic caching (Portkey) recognizes paraphrased queries and returns cached responses, reducing costs by up to 40%.

8

LLMOps: Prompt Management & Evaluation

“Prompts are the new code — they deserve the same rigor.”

Prompt registries (MLflow, Langfuse) decouple prompts from code, enabling instant updates without full deploys.
LLM-as-judge uses a powerful model to score outputs on correctness, helpfulness, and safety — 75–85% agreement with human evaluators.
Guardrails validate inputs (prompt injection, PII) and outputs (hallucination, toxicity, format) before they reach users.

Bottom line: Automate the path from code to production with CI/CD. Use the right serving infrastructure (Triton, vLLM). For LLMs, add a gateway for routing and cost control, manage prompts like code, and guard every input and output.

Operations

Monitoring & The Full Stack

Chapters 9 – 10

expand_more

9

Monitoring & Drift Detection

“A model with perfect infrastructure metrics can still produce terrible predictions.”

Data drift (input distributions shift) is the most common and easiest to detect; concept drift (relationships change) is the most dangerous.
Evidently AI is the leading open-source tool for drift detection, data quality checks, and model performance monitoring.
Monitor at three levels: infrastructure (Prometheus/Grafana), model (Evidently/Arize), and business (custom dashboards).
Most “drift” alerts are actually upstream data issues — always check data quality before retraining.

10

The MLOps Stack

“The most successful MLOps teams aren’t the ones with the most tools — they’re the ones that close the feedback loop fastest.”

SageMaker for AWS-native teams, Vertex AI for GCP/BigQuery users, Databricks for lakehouse architectures, open-source for multi-cloud.
Build incrementally: start with MLflow + CI/CD (covers 80% of needs), then add monitoring, feature stores, and LLMOps as pain points emerge.
Choose based on where your data already lives and your team’s engineering capacity — not feature comparisons.

Bottom line: Close the feedback loop: production → monitoring → drift detection → retraining. Start simple, add layers as you mature. Speed of iteration beats perfection of infrastructure.