Ch 11 — Drift, Debugging & Alerts

Detecting silent degradation, diagnosing root causes, and building alerting systems
High Level
trending_down
Drift
arrow_forward
search
Detect
arrow_forward
bug_report
Debug
arrow_forward
account_tree
Root Cause
arrow_forward
notifications
Alerts
arrow_forward
autorenew
Recover
-
Click play or press Space to begin...
Step- / 8
trending_down
Understanding Drift in LLM Systems
Your system degrades silently — here’s why and how
What Is Drift?
Drift is the gradual, silent degradation of your LLM system’s performance over time. Unlike a crash or error, drift doesn’t trigger alarms. Your system keeps running, returning 200 OK, while quality slowly erodes. Users notice before your metrics do — unless you’re specifically watching for it.
Types of Drift
Data drift: The distribution of user queries changes over time. Users start asking questions your system wasn’t designed for
Model drift: The LLM provider silently updates the model. Output style, accuracy, and behavior shift without notice
Concept drift: The world changes and your system’s knowledge becomes stale. Facts that were correct become wrong
Retrieval drift: Your document corpus grows, ages, or develops contradictions. RAG quality degrades as the knowledge base evolves
Why Drift Is Dangerous
Drift is the most common cause of production LLM failures and the hardest to detect because:

No error signals: HTTP 200, no exceptions, no crashes
Gradual: 0.5% quality drop per week is invisible day-to-day but devastating over months
Multi-causal: Often caused by multiple factors interacting, making root cause analysis hard
Delayed feedback: Users may not report degradation for weeks
Critical: OpenAI, Anthropic, and Google update their models without notice. A “minor improvement” to GPT-4o can change your system’s behavior. Run your eval suite weekly even when you haven’t changed anything on your end.
search
Drift Detection Methods
Catching degradation before users notice
Statistical Detection
Rolling averages: Compare this week’s quality scores to the 4-week rolling average. Flag if deviation exceeds 2σ
Distribution comparison: Use KL divergence or Jensen-Shannon divergence to compare input/output distributions across time windows
Trend analysis: Fit a regression line to weekly quality scores. A negative slope indicates drift even if individual weeks look normal
Embedding drift: Track the centroid of query embeddings over time. Movement indicates changing user behavior
Scheduled Eval Runs
The simplest and most effective drift detection: run your eval suite on a fixed schedule.

Daily: Run a quick subset (20 examples) against production
Weekly: Run the full eval suite (200+ examples)
Monthly: Run extended eval with human review of a sample

Compare each run to the baseline. Any metric drop >2% triggers investigation.
Canary Queries
Canary queries are fixed test inputs with known-good outputs that you run against production continuously. Like a canary in a coal mine — if the canary dies, something changed. Run 10–20 canary queries every hour. If any canary response changes significantly, alert immediately.
Key insight: Canary queries are the cheapest and fastest drift detection method. 20 queries/hour costs ~$1/day with GPT-4o-mini but catches model-level drift within an hour of it happening.
bug_report
Debugging LLM Systems
A systematic approach to finding what went wrong
The Debugging Workflow
LLM debugging is fundamentally different from traditional software debugging. There’s no stack trace, no line number, no deterministic reproduction. Instead, use this systematic approach:

1. Reproduce: Can you trigger the same failure with the same input?
2. Isolate: Which component is failing? Retrieval? Generation? Guardrails?
3. Compare: What’s different between working and failing cases?
4. Hypothesize: Form a theory about the root cause
5. Test: Modify one variable and re-run to confirm the hypothesis
Trace-Based Debugging
Traces are your primary debugging tool. For a failing request, examine:

Input: Was the user query unusual or ambiguous?
Retrieval: Were the right documents retrieved? Were they relevant?
Prompt: What did the full prompt look like with context injected?
Model response: What exactly did the model output?
Post-processing: Did guardrails modify or block the response?

Most failures become obvious once you can see the full trace.
Pro tip: Build a “debug mode” that lets you replay any production trace locally with full visibility. This means storing the complete input, retrieved documents, and model output for every request (or a sample). The storage cost is minimal compared to the debugging time saved.
account_tree
Root Cause Analysis
The five most common failure categories and how to diagnose them
Failure Category Map
// Symptom → Likely root cause Wrong facts → Retrieval failure (wrong docs) → Stale knowledge base → Hallucination (no grounding) Off-topic response → Query misclassification → System prompt ignored → Prompt injection success Quality drop (gradual) → Model provider update → Data drift (new query types) → Retrieval corpus degradation Quality drop (sudden) → Prompt change regression → Config change (temperature, etc.) → API version change Inconsistent behavior → High temperature setting → Non-deterministic retrieval → Load balancer routing to different models
The Isolation Method
When you can’t identify the root cause, isolate each component:

1. Test retrieval alone: Are the right documents being retrieved for the failing query? If not, the problem is retrieval
2. Test generation with known-good context: Give the model perfect context manually. If it still fails, the problem is the model or prompt
3. Test with a different model: Same prompt, different model. If the other model works, the problem is model-specific
4. Test with the original prompt: Revert to the last known-good prompt. If it works, the problem is the prompt change
Key insight: 70% of LLM production issues trace back to one of three causes: (1) retrieval returning wrong documents, (2) prompt changes that regressed other cases, or (3) model provider silent updates. Check these three first.
notifications
Building an Alerting System
The right alerts at the right time to the right people
Alert Tiers
Tier 1 — Critical (page immediately):
• Safety violation rate > 0.1%
• System completely down (error rate > 50%)
• Cost exceeding 2x daily budget

Tier 2 — Warning (Slack notification):
• Quality score drops > 5% from baseline
• Latency p95 exceeds SLA
• Hallucination rate > 10%

Tier 3 — Informational (daily digest):
• Quality score drops 1–5%
• Cost trending 20% above forecast
• New query patterns detected
Avoiding Alert Fatigue
Alert fatigue is when teams get so many alerts they start ignoring them. It’s the #1 reason alerting systems fail. Prevention strategies:

Fewer, smarter alerts: 5 well-tuned alerts beat 50 noisy ones
Deduplication: Group related alerts into a single notification
Cooldown periods: Don’t re-alert for the same issue within 4 hours
Actionable only: Every alert should have a clear action. If you can’t act on it, it’s not an alert — it’s a metric
Regular review: Monthly review of alert frequency. Tune or remove alerts that fire too often without action
Pro tip: Start with just 3 alerts: (1) safety violation rate above threshold, (2) quality score below threshold, (3) cost above budget. Add more only when you have a specific incident that would have been caught by a new alert.
autorenew
Recovery Strategies
What to do when things go wrong in production
Immediate Response
Rollback: Revert to the last known-good configuration (prompt, model version, retrieval config). This is your fastest recovery option — have a rollback procedure documented and tested
Feature flag: Disable the affected feature while keeping the rest of the system running
Fallback model: Switch to a backup model (e.g., GPT-4o-mini as fallback for GPT-4o) that may be less capable but more stable
Human-in-the-loop: Route affected queries to human agents while investigating
Post-Incident Process
After every significant incident:

1. Timeline: When did it start? When was it detected? When was it resolved?
2. Impact: How many users were affected? What was the severity?
3. Root cause: What specifically caused the issue?
4. Detection gap: Why didn’t we catch it sooner? What monitoring was missing?
5. Prevention: What changes prevent recurrence?
6. Eval update: Add the failure case to your eval dataset
Key insight: The most important recovery action is adding the failure to your eval dataset. This ensures the same failure is caught automatically in the future. Over time, your eval dataset becomes a comprehensive record of every failure mode your system has encountered.
science
A/B Testing LLM Changes
Safely deploying changes with controlled experiments
Why A/B Test LLM Changes
Offline eval tells you if a change is likely better. A/B testing tells you if it’s actually better in production. The gap between offline and online performance can be significant because:

• Eval datasets don’t perfectly represent production traffic
• User behavior changes in response to different outputs
• Edge cases in production aren’t captured in eval sets
• Latency and cost differences affect real user experience
A/B Testing Flow
1. Offline eval passes: New config meets quality thresholds
2. Deploy to 5% of traffic: Small exposure limits blast radius
3. Monitor for 48 hours: Compare quality, safety, latency, and cost
4. Statistical significance: Wait until you have enough data to be confident
5. Ramp to 50%, then 100%: Gradually increase exposure
6. Keep the old config ready: Instant rollback if issues emerge at scale
What to Measure
Quality metrics: LLM judge scores on a sample of both variants
User signals: Thumbs up/down, regeneration rate, session length
Safety: Violation rate in both variants
Operational: Latency, cost, error rate
Business metrics: Conversion, retention, task completion rate
Key insight: A/B testing is especially important for prompt changes, which are the most common and most impactful changes to LLM systems. A prompt that improves one use case often regresses another. A/B testing catches this in production before full rollout.
checklist
The Production Readiness Checklist
Everything you need before going live — and staying live
Pre-Launch Checklist
// Before going to production [✓] Eval dataset: 50+ examples, all passing [✓] CI/CD eval gate: blocks bad deployments [✓] Input guardrails: injection, PII, content [✓] Output guardrails: safety, grounding, PII [✓] Tracing: every request logged with full context [✓] Cost tracking: per-query, per-feature, daily budget [✓] Latency monitoring: p50, p95, p99, TTFT [✓] Quality monitoring: LLM judge on 5% sample [✓] Alerting: safety, quality, cost (3 alerts min) [✓] Rollback plan: documented and tested [✓] Canary queries: 10+ running hourly
Ongoing Operations
Weekly: Full eval suite run, review quality trends, check canary results
Monthly: Human evaluation of 100 production samples, alert tuning review, eval dataset expansion from production failures
Quarterly: Full system audit, model comparison (should we switch?), cost optimization review
On every change: CI/CD eval, A/B test for significant changes, rollback plan ready
Next up: Chapter 12 brings it all together with the Eval-First Mindset — building evaluation into your team’s culture, not just your CI/CD pipeline.