Ch 9 — Monitoring & Drift Detection

Data drift, concept drift, model performance monitoring, Evidently AI, and alerting strategies
High Level
smart_toy
Model
arrow_forward
query_stats
Predict
arrow_forward
monitoring
Monitor
arrow_forward
warning
Drift
arrow_forward
notifications
Alert
arrow_forward
autorenew
Retrain
-
Click play or press Space to begin...
Step- / 8
trending_down
Why Models Degrade
The world changes, but your model doesn’t
Model Decay
A model that performs well at deployment will inevitably degrade over time. The world changes, but the model is frozen at training time. Data drift: the statistical distribution of input features shifts. A retail demand model trained on in-store sales data performs poorly when online sales surge. Concept drift: the relationship between inputs and outputs changes. A spam filter becomes less accurate as spammers adapt their techniques. Upstream data changes: a feature pipeline breaks, a data source changes format, or a third-party API returns different values. Without monitoring, you won’t know your model is degrading until users complain or business metrics drop — which could be weeks or months later.
Types of Drift
// Types of model degradation Data Drift (input distribution shifts): Training: avg_income = $55K, std = $20K Production: avg_income = $72K, std = $35K // Features look different from training Concept Drift (relationship changes): Training: high_income → low_default_risk Reality: high_income → moderate_risk // Same features, different outcomes Upstream Changes (data pipeline breaks): Before: age = 35 (integer) After: age = "35-44" (categorical) // Schema change breaks the model Label Drift (target distribution shifts): Training: 5% fraud rate Production: 12% fraud rate // Class balance changed
Key insight: Data drift is the most common and easiest to detect (you can measure it without labels). Concept drift is the most dangerous (model accuracy drops silently) and hardest to detect (requires ground truth labels, which are often delayed).
bar_chart
Data Drift Detection
Statistical tests for distribution shifts
Detection Methods
Data drift detection compares the distribution of production data against a reference dataset (typically training data or a recent baseline). Statistical tests used depend on data type and size: Kolmogorov-Smirnov (KS) test — compares cumulative distributions of numerical features (best for small datasets, ≤1,000 samples). Jensen-Shannon divergence — measures distance between two probability distributions (works well for larger datasets). Wasserstein distance — measures the “earth mover’s distance” between distributions (sensitive to shape changes). Chi-squared test — compares categorical feature distributions. Population Stability Index (PSI) — popular in finance, bins the distribution and measures divergence. A feature is “drifted” if the test statistic exceeds a threshold (e.g., p-value < 0.05).
Drift Test Selection
// Statistical tests for drift detection Numerical features: Small data (≤1K): KS test (p < 0.05) Large data (>1K): Jensen-Shannon div Shape-sensitive: Wasserstein distance Categorical features: Small data (≤1K): Chi-squared test Large data (>1K): Jensen-Shannon div PSI (Population Stability Index): PSI < 0.1 → No significant drift PSI 0.1-0.2 → Moderate drift (investigate) PSI > 0.2 → Significant drift (action!) Evidently AI auto-selects: Numerical + ≤1K → KS test Numerical + >1K → Wasserstein Categorical → Chi-squared / JS div
Key insight: Don’t test every feature individually — you’ll get false positives. Focus on the features with the highest feature importance in your model. If a low-importance feature drifts, it may not affect predictions at all.
psychology_alt
Concept Drift Detection
When the rules of the game change
Detection Strategies
Concept drift is harder to detect because it requires ground truth labels, which are often delayed (fraud labels arrive weeks later, medical outcomes take months). Strategies: Direct monitoring — when labels are available, track model accuracy, precision, recall, and F1 over time. A sustained drop signals concept drift. Prediction drift — when labels are unavailable, monitor the distribution of model predictions. If the model suddenly predicts “fraud” 3x more often, something changed. Proxy metrics — use business metrics as proxies (conversion rate, customer complaints, escalation rate). Window comparison — compare model performance on recent data vs. older data using a sliding window.
Concept Drift Types
// Types of concept drift Sudden drift: ┌──────┐ │ ████ │████████ │ ████ │ ████████ └──────┘ // Abrupt change (e.g., new regulation) Gradual drift: ████████ ████████ ████████ ████████ // Slow transition (e.g., user behavior) Recurring drift: ████ ████ ████ ████ ████ // Seasonal patterns (e.g., holiday sales) Detection without labels: Monitor prediction distribution Track confidence scores over time Use business metrics as proxies Compare recent vs. baseline windows
Key insight: Recurring drift (seasonal patterns) is often mistaken for concept drift. Before retraining, check if the pattern is seasonal. If your model sees this pattern every December, it’s not drift — it’s a known cycle that should be modeled explicitly.
biotech
Evidently AI
Open-source ML monitoring and testing
Evidently Overview
Evidently AI is the most popular open-source tool for ML monitoring. It provides pre-built reports (visual dashboards for data drift, model quality, target drift) and test suites (automated checks that pass/fail). Key features: Data Drift Report — compares reference and current data distributions for all features, auto-selects appropriate statistical tests. Data Quality Report — checks for missing values, duplicates, out-of-range values, new categories. Model Performance Report — tracks accuracy, precision, recall, F1, AUC over time. Test Suites — define pass/fail conditions (e.g., “no more than 30% of features should drift”) for CI/CD integration. Evidently works as a Python library, generates HTML reports, and integrates with Grafana for real-time dashboards.
Evidently Usage
# Evidently AI data drift detection from evidently.report import Report from evidently.metric_preset import ( DataDriftPreset, DataQualityPreset, ) # Compare reference vs current data report = Report(metrics=[ DataDriftPreset(), DataQualityPreset(), ]) report.run( reference_data=train_df, current_data=production_df, ) # HTML dashboard report.save_html("drift_report.html") # Programmatic access result = report.as_dict() drift_share = result["metrics"][0] ["result"]["share_of_drifted_columns"] if drift_share > 0.3: trigger_alert("30%+ features drifted!")
Key insight: Evidently’s test suites are the key to CI/CD integration. Run them as part of your data pipeline: if the test suite fails (too much drift, data quality issues), block the pipeline and alert the team before the model makes bad predictions.
dashboard
Model Performance Monitoring
Tracking accuracy, latency, and business metrics
What to Monitor
Monitor at three levels: Infrastructure metrics — CPU/GPU utilization, memory, request latency (p50, p95, p99), throughput (requests/second), error rates. Use Prometheus + Grafana. Model metrics — prediction distribution, confidence scores, feature distributions, model accuracy (when labels available). Use Evidently, Arize, or Fiddler. Business metrics — conversion rate, revenue impact, user satisfaction, escalation rate. These are the ultimate measure of model value. Connect model predictions to business outcomes. Set up dashboards for each level and alerts for anomalies. The most common mistake is monitoring only infrastructure and ignoring model and business metrics.
Monitoring Stack
// Three-level monitoring stack Level 1: Infrastructure Tool: Prometheus + Grafana Metrics: - Request latency (p50, p95, p99) - Throughput (req/sec) - GPU utilization (%) - Error rate (4xx, 5xx) - Memory usage Level 2: Model Tool: Evidently AI / Arize Metrics: - Prediction distribution - Feature drift (per feature) - Accuracy / F1 (when labels arrive) - Confidence score distribution Level 3: Business Tool: Custom dashboards / Looker Metrics: - Conversion rate - Revenue per prediction - User satisfaction (NPS) - Escalation rate
Key insight: A model can have perfect infrastructure metrics (low latency, zero errors) while producing terrible predictions. Always monitor at the model and business level, not just infrastructure.
notifications_active
Alerting Strategies
When to page, when to ticket, when to ignore
Alert Design
Bad alerting is worse than no alerting — alert fatigue causes teams to ignore real problems. Design alerts with severity tiers: P0 (page immediately) — model serving is down, error rate > 5%, latency > 10x baseline. P1 (ticket, fix today) — significant data drift (> 30% features), model accuracy dropped > 10%, data quality check failed. P2 (review this week) — moderate drift (10–30% features), slight accuracy decline, cost anomaly. P3 (informational) — minor drift, new feature values observed, usage patterns changed. Use anomaly detection on metrics rather than static thresholds — a 2% accuracy drop might be normal variance or a catastrophic signal depending on context.
Alert Configuration
// Alerting tiers P0 — Page (PagerDuty/Opsgenie): error_rate > 5% latency_p99 > 10s serving_down == true // Action: wake someone up P1 — Ticket (Jira/Linear): drift_share > 0.30 accuracy_drop > 10% data_quality_fail == true // Action: fix today P2 — Review (Slack): drift_share > 0.10 accuracy_drop > 3% cost_anomaly == true // Action: review this week P3 — Log (dashboard only): new_category_observed minor_distribution_shift // Action: informational
Key insight: Start with fewer, high-confidence alerts and add more as you learn your system’s behavior. It’s better to miss a P2 alert than to have the team ignore all alerts because of noise.
autorenew
Responding to Drift
Retrain, recalibrate, or replace
Response Playbook
When drift is detected, follow a structured response: 1. Diagnose — is it data drift, concept drift, or an upstream data issue? Check data quality first (a broken pipeline is more common than real drift). 2. Assess impact — is the model still performing acceptably? Check business metrics. Minor drift with no accuracy impact may not need action. 3. Choose response: Retrain (most common) — retrain on recent data. Use continuous training if drift is frequent. Recalibrate — adjust decision thresholds without retraining (e.g., change fraud threshold from 0.5 to 0.6). Fallback — switch to a simpler, more robust model or rule-based system. Pause — stop serving predictions if risk is too high. 4. Validate — confirm the fix works on recent data before deploying.
Response Decision Tree
// Drift response decision tree Drift detected │ ├─ Data quality issue? │ └─ Yes → Fix pipeline, no retrain │ ├─ Upstream schema change? │ └─ Yes → Fix feature engineering │ ├─ Accuracy still acceptable? │ ├─ Yes → Monitor closely, no action │ └─ No → Continue ↓ │ ├─ Labels available for retrain? │ ├─ Yes → Retrain on recent data │ │ Validate → Deploy │ └─ No → Continue ↓ │ ├─ Can recalibrate thresholds? │ ├─ Yes → Adjust thresholds │ └─ No → Continue ↓ │ └─ High risk? ├─ Yes → Pause model, fallback └─ No → Monitor, schedule retrain
Key insight: Most “drift” alerts are actually upstream data issues (broken pipelines, schema changes, missing values). Always check data quality before assuming the model needs retraining.
smart_toy
LLM-Specific Monitoring
Monitoring challenges unique to LLM applications
LLM Monitoring Challenges
LLMs present unique monitoring challenges: No ground truth — there’s no “correct answer” for most LLM outputs, so traditional accuracy metrics don’t apply. Non-deterministic — the same input can produce different outputs. Provider changes — OpenAI and Anthropic update models without notice, changing behavior. Prompt sensitivity — small prompt changes cause large output changes. What to monitor: output quality (LLM-as-judge scores, user feedback), safety (toxicity scores, PII leakage), hallucination rate (claims not grounded in source documents), cost per interaction, latency (time to first token, total generation time), and user engagement (completion rate, follow-up questions, thumbs up/down).
LLM Monitoring Metrics
// LLM-specific monitoring Quality metrics: LLM-as-judge score (1-5) User thumbs up/down ratio Task completion rate Hallucination rate (vs source docs) Safety metrics: Toxicity score (per response) PII leakage incidents Prompt injection attempts Guardrail trigger rate Operational metrics: Time to first token (TTFT) Total generation time Tokens per response Cost per interaction Cache hit rate Engagement metrics: Conversation length Follow-up question rate Escalation to human rate Session abandonment rate
Key insight: For LLMs, user feedback is the most reliable quality signal. Implement thumbs up/down on every response, and use low-rated responses to build your evaluation dataset. This creates a virtuous cycle: monitoring feeds evaluation feeds improvement.