Ch 5 — CI/CD for Machine Learning

Testing ML code, model validation gates, GitHub Actions for ML, CML, and automated retraining
High Level
code
Commit
arrow_forward
bug_report
Test
arrow_forward
model_training
Train
arrow_forward
verified
Validate
arrow_forward
rocket_launch
Deploy
arrow_forward
monitoring
Monitor
-
Click play or press Space to begin...
Step- / 8
sync
CI/CD for ML Is Different
Why traditional CI/CD pipelines don’t work for ML
The ML CI/CD Challenge
Traditional CI/CD tests code: does it compile? Do unit tests pass? Does the API return 200? ML CI/CD must test three additional things: (1) Data quality — is the training data valid and consistent? (2) Model quality — does the model meet performance thresholds? (3) Model behavior — does it handle edge cases correctly? ML tests are also statistical, not binary — “accuracy ≥ 0.92” is a threshold, not a pass/fail. And ML pipelines are expensive — a training run can take hours on GPUs, so you can’t run the full pipeline on every commit. You need a tiered testing strategy.
Traditional vs ML CI/CD
// Traditional CI/CD Trigger: code commit Test: unit tests, integration tests Build: compile → binary/container Deploy: push to production Time: minutes // ML CI/CD Trigger: code commit, data change, drift Test: code tests + data tests + model tests Build: train model (hours on GPUs) Validate: performance gates, bias checks Deploy: canary → shadow → production Time: hours to days // Key difference: tests are statistical // "accuracy ≥ 0.92" not "test passed"
Key insight: ML CI/CD has three pipelines, not one: CI (test code quality), CT (continuous training — retrain on new data), and CD (deploy validated models). CT is the unique ML addition.
bug_report
Testing ML Code
Unit tests, data tests, and model tests
Three Layers of ML Testing
Layer 1: Code tests (run on every commit, seconds). Unit tests for data preprocessing functions, feature engineering logic, and model inference code. Use pytest. Layer 2: Data tests (run on data changes, minutes). Schema validation, distribution checks, null/duplicate checks using Great Expectations or Pandera. Layer 3: Model tests (run on training completion, minutes to hours). Performance thresholds (accuracy, F1 above minimum), regression tests (no worse than current production model), behavioral tests (specific inputs → expected outputs), and bias/fairness checks on protected attributes.
ML Test Examples
# Layer 1: Code tests (every commit) def test_preprocess_handles_nulls(): df = pd.DataFrame({"age": [25, None, 30]}) result = preprocess(df) assert result["age"].isna().sum() == 0 def test_model_output_shape(): model = load_model() x = torch.randn(1, 10) y = model(x) assert y.shape == (1, 2) # Layer 2: Data tests (data changes) def test_no_future_leakage(): assert df["event_date"].max() < df["label_date"].min() # Layer 3: Model tests (after training) def test_model_accuracy(): metrics = evaluate(model, test_data) assert metrics["accuracy"] >= 0.92 def test_no_regression(): new = evaluate(new_model, test_data) old = evaluate(prod_model, test_data) assert new["f1"] >= old["f1"] - 0.01
Key insight: The regression test (new_model ≥ prod_model - tolerance) is the most important ML test. It prevents deploying a model that’s worse than what’s currently in production.
sync
GitHub Actions for ML
Automating ML workflows with GitHub’s CI/CD
GitHub Actions for ML
GitHub Actions is the most popular CI/CD platform for ML teams. Key patterns: On push — run code tests, linting, and type checking (fast, every commit). On PR — run data validation, train a small model on a subset, generate a CML report with metrics and plots in the PR. On schedule — run full training pipeline nightly or weekly. On data change — trigger retraining when DVC detects new data. For GPU workloads, use self-hosted runners (your own GPU machines) or cloud-provisioned runners. GitHub’s standard runners are CPU-only and not suitable for training.
GitHub Actions Workflow
# .github/workflows/ml-ci.yml name: ML CI/CD on: push: branches: [main] pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.11" - run: pip install -r requirements.txt - run: pytest tests/ -v - run: black --check src/ train: needs: test runs-on: self-hosted # GPU runner if: github.event_name == 'pull_request' steps: - uses: actions/checkout@v4 - run: dvc pull - run: python train.py --subset 0.1 - uses: iterative/setup-cml@v2 - run: cml comment create report.md
Key insight: Use a tiered approach: fast tests on every commit (CPU), subset training on PRs (GPU), full training on merge to main or on schedule. This balances speed with thoroughness.
description
CML: Continuous Machine Learning
ML reports in pull requests
What Is CML?
CML (Continuous Machine Learning) by Iterative (the DVC team) is an open-source tool that brings ML context into your Git workflow. It auto-generates reports in pull requests with: training metrics comparison (this branch vs. main), plots (loss curves, confusion matrices, ROC curves), dataset statistics, and model size/latency benchmarks. CML integrates with GitHub Actions and GitLab CI. It also supports cloud provisioning — CML can spin up GPU instances on AWS/GCP/Azure for training and tear them down when done, so you only pay for what you use.
CML Report Script
#!/bin/bash # generate-report.sh (runs in CI) # Train model and save metrics python train.py python evaluate.py > metrics.json # Generate CML report cat << EOF > report.md # Model Training Report ## Metrics | Metric | Value | |--------|-------| | Accuracy | $(jq .accuracy metrics.json) | | F1 Score | $(jq .f1 metrics.json) | | Latency | $(jq .latency_ms metrics.json)ms | ## Training Curves ![](./plots/loss_curve.png) ## Confusion Matrix ![](./plots/confusion_matrix.png) EOF # Post as PR comment cml comment create report.md
Key insight: CML makes model performance visible in code review. Reviewers can see metrics, plots, and comparisons directly in the PR — no need to switch to a separate dashboard. This dramatically improves review quality.
verified
Model Validation Gates
Automated quality checks before deployment
Gate Types
Model validation gates are automated checks that must pass before a model can be promoted to production. Performance gates: accuracy, F1, AUC above minimum thresholds. Regression gates: new model must not be worse than current production model on any critical metric. Latency gates: inference time must be within SLA (e.g., p99 < 50ms). Size gates: model file size within deployment limits. Fairness gates: performance must be equitable across protected groups (gender, age, ethnicity). Behavioral gates: specific test cases that must produce expected outputs (e.g., “this known fraud case must be flagged”).
Validation Gate Config
# model_gates.yaml performance: accuracy: min: 0.92 f1_score: min: 0.88 auc_roc: min: 0.95 regression: max_degradation: 0.01 # vs prod model latency: p50_ms: 10 p95_ms: 30 p99_ms: 50 fairness: max_disparity: 0.05 # across groups protected_attributes: - gender - age_group behavioral: test_cases: tests/critical_cases.csv min_pass_rate: 1.0 # 100% must pass
Key insight: Behavioral tests are the most underused but most valuable gate. They encode domain knowledge: “this specific input should always produce this output.” They catch subtle regressions that aggregate metrics miss.
update
Continuous Training (CT)
Automated retraining when the world changes
Retraining Triggers
Continuous Training is the ML-specific addition to CI/CD. Models degrade over time as the world changes (data drift, concept drift). CT automates retraining based on triggers: Schedule-based (retrain daily/weekly regardless — simplest, good starting point). Performance-based (retrain when monitoring detects accuracy dropping below a threshold). Data-based (retrain when new data arrives or data distribution shifts significantly). Hybrid (scheduled baseline + triggered on drift). The retraining pipeline should be identical to the initial training pipeline — same code, same validation gates, same promotion workflow.
CT Pipeline
// Continuous Training triggers 1. Schedule: Cron: "0 2 * * 1" # Monday 2am Simple, predictable Good default starting point 2. Performance: Monitor: accuracy < 0.90 for 3 days → Trigger retrain Requires production monitoring 3. Data: New data volume > 10K rows OR distribution shift detected → Trigger retrain CT Pipeline: Trigger → fetch latest data → validate data (gates) → train model → validate model (gates) → register in model registry → promote to staging → [human approval or auto] → promote to production
Key insight: Start with scheduled retraining (weekly). Only add drift-triggered retraining when you have monitoring in place and evidence that your model degrades between scheduled retrains.
rocket_launch
Deployment Strategies
Canary, shadow, blue-green, and A/B testing
Deployment Patterns
Canary deployment: Route 5% of traffic to the new model, monitor metrics, gradually increase to 100%. Safest option — limits blast radius. Shadow deployment: Run the new model alongside production, compare predictions, but only serve the old model’s predictions. Zero risk to users. Blue-green deployment: Run two identical environments; switch all traffic at once. Fast rollback (just switch back). A/B testing: Split traffic between old and new model, measure business metrics (not just ML metrics). Requires statistical significance testing. Most teams start with canary, add shadow for high-risk models, and A/B test for business-critical decisions.
Deployment Strategies
// ML deployment strategies Canary: 5% → 25% → 50% → 100% Monitor at each stage Rollback if metrics drop Best for: most models Shadow: 100% to old model (serves) 100% to new model (logs only) Compare predictions offline Best for: high-risk, first deploy Blue-Green: Old env (blue) serves traffic New env (green) ready Switch DNS → instant cutover Best for: simple, fast rollback A/B Test: 50/50 split (or other ratio) Measure business KPIs Statistical significance test Best for: business-critical models
Key insight: Shadow deployment is the safest way to validate a new model in production. It sees real traffic and real data, but users are never affected. Use it for the first deployment of any high-stakes model.
checklist
CI/CD Best Practices for ML
Practical guidelines for ML teams
Best Practices
1. Tier your tests — fast tests on every commit, medium tests on PRs, full training on merge. 2. Make training reproduciblemake train should produce the same model from the same data. 3. Version your pipeline — the training pipeline is code; treat it like code. 4. Automate validation gates — never rely solely on human judgment for model promotion. 5. Keep rollback fast — you should be able to roll back to the previous model in under 5 minutes. 6. Test on real data distributions — synthetic test data misses distribution-specific bugs. 7. Monitor after deploy — CI/CD doesn’t end at deployment; monitoring is the final stage.
ML CI/CD Maturity
// ML CI/CD maturity levels Level 0: No CI/CD Manual training, manual deploy "It works on my machine" Level 1: Basic CI Code tests on every commit Manual training + deploy Level 2: CI + CT Code tests + data tests Automated training pipeline Manual model validation + deploy Level 3: Full CI/CD/CT Automated tests (code + data + model) Automated training (scheduled + triggered) Automated validation gates Automated canary deployment Production monitoring → retrain loop
Key insight: Most teams should aim for Level 2 first: automated tests and automated training, with human-in-the-loop for validation and deployment. Level 3 (full automation) requires mature monitoring and high confidence in your validation gates.