5
“If deploying a model requires a hero, you don’t have MLOps.”
- ML CI/CD has three loops: CI (code + data validation), CD (model deployment), and CT (continuous training on new data).
- Model validation gates block deployment if the new model doesn’t beat the current production model on key metrics.
- Deployment strategies: canary (gradual rollout), shadow (parallel comparison), blue-green (instant switch), A/B testing (user-level split).
6
“The best model is useless if it can’t serve predictions fast enough.”
- Triton Inference Server (NVIDIA) is the industry standard for GPU-accelerated serving with dynamic batching and multi-model support.
- vLLM with PagedAttention and continuous batching is the go-to for LLM serving — 2–4x throughput improvement over naive serving.
- Optimize the serving stack: ONNX Runtime for cross-framework portability, quantization for smaller models, batching for throughput.
7
“An LLM gateway is to LLMs what an API gateway is to microservices.”
- LiteLLM (open-source, MIT) provides a unified OpenAI-compatible API for 100+ LLM providers with fallbacks, load balancing, and virtual keys.
- Complexity-based routing is the highest-ROI optimization: route 70% of simple requests to a cheap model for ~90% cost reduction on those requests.
- Semantic caching (Portkey) recognizes paraphrased queries and returns cached responses, reducing costs by up to 40%.
8
“Prompts are the new code — they deserve the same rigor.”
- Prompt registries (MLflow, Langfuse) decouple prompts from code, enabling instant updates without full deploys.
- LLM-as-judge uses a powerful model to score outputs on correctness, helpfulness, and safety — 75–85% agreement with human evaluators.
- Guardrails validate inputs (prompt injection, PII) and outputs (hallucination, toxicity, format) before they reach users.
Bottom line: Automate the path from code to production with CI/CD. Use the right serving infrastructure (Triton, vLLM). For LLMs, add a gateway for routing and cost control, manage prompts like code, and guard every input and output.