Ch 11: Drift, Debugging & Alerts

Ch 11 — Drift, Debugging & Alerts

Detecting silent degradation, diagnosing root causes, and building alerting systems

Index

High Level

trending_down

Drift

arrow_forward

Detect

arrow_forward

bug_report

Debug

arrow_forward

account_tree

Root Cause

arrow_forward

notifications

Alerts

arrow_forward

autorenew

Recover

Click play or press Space to begin...

Step- / 8

trending_down

Understanding Drift in LLM Systems

Your system degrades silently — here’s why and how

What Is Drift?

Drift is the gradual, silent degradation of your LLM system’s performance over time. Unlike a crash or error, drift doesn’t trigger alarms. Your system keeps running, returning 200 OK, while quality slowly erodes. Users notice before your metrics do — unless you’re specifically watching for it.

Types of Drift

• Data drift: The distribution of user queries changes over time. Users start asking questions your system wasn’t designed for
• Model drift: The LLM provider silently updates the model. Output style, accuracy, and behavior shift without notice
• Concept drift: The world changes and your system’s knowledge becomes stale. Facts that were correct become wrong
• Retrieval drift: Your document corpus grows, ages, or develops contradictions. RAG quality degrades as the knowledge base evolves

Why Drift Is Dangerous

Drift is the most common cause of production LLM failures and the hardest to detect because:

• No error signals: HTTP 200, no exceptions, no crashes
• Gradual: 0.5% quality drop per week is invisible day-to-day but devastating over months
• Multi-causal: Often caused by multiple factors interacting, making root cause analysis hard
• Delayed feedback: Users may not report degradation for weeks

Critical: OpenAI, Anthropic, and Google update their models without notice. A “minor improvement” to GPT-4o can change your system’s behavior. Run your eval suite weekly even when you haven’t changed anything on your end.

Drift Detection Methods

Catching degradation before users notice

Statistical Detection

• Rolling averages: Compare this week’s quality scores to the 4-week rolling average. Flag if deviation exceeds 2σ
• Distribution comparison: Use KL divergence or Jensen-Shannon divergence to compare input/output distributions across time windows
• Trend analysis: Fit a regression line to weekly quality scores. A negative slope indicates drift even if individual weeks look normal
• Embedding drift: Track the centroid of query embeddings over time. Movement indicates changing user behavior

Scheduled Eval Runs

The simplest and most effective drift detection: run your eval suite on a fixed schedule.

• Daily: Run a quick subset (20 examples) against production
• Weekly: Run the full eval suite (200+ examples)
• Monthly: Run extended eval with human review of a sample

Compare each run to the baseline. Any metric drop >2% triggers investigation.

Canary Queries

Canary queries are fixed test inputs with known-good outputs that you run against production continuously. Like a canary in a coal mine — if the canary dies, something changed. Run 10–20 canary queries every hour. If any canary response changes significantly, alert immediately.

Key insight: Canary queries are the cheapest and fastest drift detection method. 20 queries/hour costs ~$1/day with GPT-4o-mini but catches model-level drift within an hour of it happening.

bug_report

Debugging LLM Systems

A systematic approach to finding what went wrong

The Debugging Workflow

LLM debugging is fundamentally different from traditional software debugging. There’s no stack trace, no line number, no deterministic reproduction. Instead, use this systematic approach:

1. Reproduce: Can you trigger the same failure with the same input?
2. Isolate: Which component is failing? Retrieval? Generation? Guardrails?
3. Compare: What’s different between working and failing cases?
4. Hypothesize: Form a theory about the root cause
5. Test: Modify one variable and re-run to confirm the hypothesis

Trace-Based Debugging

Traces are your primary debugging tool. For a failing request, examine:

• Input: Was the user query unusual or ambiguous?
• Retrieval: Were the right documents retrieved? Were they relevant?
• Prompt: What did the full prompt look like with context injected?
• Model response: What exactly did the model output?
• Post-processing: Did guardrails modify or block the response?

Most failures become obvious once you can see the full trace.

Pro tip: Build a “debug mode” that lets you replay any production trace locally with full visibility. This means storing the complete input, retrieved documents, and model output for every request (or a sample). The storage cost is minimal compared to the debugging time saved.

account_tree

Root Cause Analysis

The five most common failure categories and how to diagnose them

Failure Category Map

// Symptom → Likely root cause Wrong facts → Retrieval failure (wrong docs) → Stale knowledge base → Hallucination (no grounding) Off-topic response → Query misclassification → System prompt ignored → Prompt injection success Quality drop (gradual) → Model provider update → Data drift (new query types) → Retrieval corpus degradation Quality drop (sudden) → Prompt change regression → Config change (temperature, etc.) → API version change Inconsistent behavior → High temperature setting → Non-deterministic retrieval → Load balancer routing to different models

The Isolation Method

When you can’t identify the root cause, isolate each component:

1. Test retrieval alone: Are the right documents being retrieved for the failing query? If not, the problem is retrieval
2. Test generation with known-good context: Give the model perfect context manually. If it still fails, the problem is the model or prompt
3. Test with a different model: Same prompt, different model. If the other model works, the problem is model-specific
4. Test with the original prompt: Revert to the last known-good prompt. If it works, the problem is the prompt change

Key insight: 70% of LLM production issues trace back to one of three causes: (1) retrieval returning wrong documents, (2) prompt changes that regressed other cases, or (3) model provider silent updates. Check these three first.

notifications

Building an Alerting System

The right alerts at the right time to the right people

Alert Tiers

Tier 1 — Critical (page immediately):
• Safety violation rate > 0.1%
• System completely down (error rate > 50%)
• Cost exceeding 2x daily budget

Tier 2 — Warning (Slack notification):
• Quality score drops > 5% from baseline
• Latency p95 exceeds SLA
• Hallucination rate > 10%

Tier 3 — Informational (daily digest):
• Quality score drops 1–5%
• Cost trending 20% above forecast
• New query patterns detected

Avoiding Alert Fatigue

Alert fatigue is when teams get so many alerts they start ignoring them. It’s the #1 reason alerting systems fail. Prevention strategies:

• Fewer, smarter alerts: 5 well-tuned alerts beat 50 noisy ones
• Deduplication: Group related alerts into a single notification
• Cooldown periods: Don’t re-alert for the same issue within 4 hours
• Actionable only: Every alert should have a clear action. If you can’t act on it, it’s not an alert — it’s a metric
• Regular review: Monthly review of alert frequency. Tune or remove alerts that fire too often without action

Pro tip: Start with just 3 alerts: (1) safety violation rate above threshold, (2) quality score below threshold, (3) cost above budget. Add more only when you have a specific incident that would have been caught by a new alert.

autorenew

Recovery Strategies

What to do when things go wrong in production

Immediate Response

• Rollback: Revert to the last known-good configuration (prompt, model version, retrieval config). This is your fastest recovery option — have a rollback procedure documented and tested
• Feature flag: Disable the affected feature while keeping the rest of the system running
• Fallback model: Switch to a backup model (e.g., GPT-4o-mini as fallback for GPT-4o) that may be less capable but more stable
• Human-in-the-loop: Route affected queries to human agents while investigating

Post-Incident Process

After every significant incident:

1. Timeline: When did it start? When was it detected? When was it resolved?
2. Impact: How many users were affected? What was the severity?
3. Root cause: What specifically caused the issue?
4. Detection gap: Why didn’t we catch it sooner? What monitoring was missing?
5. Prevention: What changes prevent recurrence?
6. Eval update: Add the failure case to your eval dataset

Key insight: The most important recovery action is adding the failure to your eval dataset. This ensures the same failure is caught automatically in the future. Over time, your eval dataset becomes a comprehensive record of every failure mode your system has encountered.

science

A/B Testing LLM Changes

Safely deploying changes with controlled experiments

Why A/B Test LLM Changes

Offline eval tells you if a change is likely better. A/B testing tells you if it’s actually better in production. The gap between offline and online performance can be significant because:

• Eval datasets don’t perfectly represent production traffic
• User behavior changes in response to different outputs
• Edge cases in production aren’t captured in eval sets
• Latency and cost differences affect real user experience

A/B Testing Flow

1. Offline eval passes: New config meets quality thresholds
2. Deploy to 5% of traffic: Small exposure limits blast radius
3. Monitor for 48 hours: Compare quality, safety, latency, and cost
4. Statistical significance: Wait until you have enough data to be confident
5. Ramp to 50%, then 100%: Gradually increase exposure
6. Keep the old config ready: Instant rollback if issues emerge at scale

What to Measure

• Quality metrics: LLM judge scores on a sample of both variants
• User signals: Thumbs up/down, regeneration rate, session length
• Safety: Violation rate in both variants
• Operational: Latency, cost, error rate
• Business metrics: Conversion, retention, task completion rate

Key insight: A/B testing is especially important for prompt changes, which are the most common and most impactful changes to LLM systems. A prompt that improves one use case often regresses another. A/B testing catches this in production before full rollout.

checklist

The Production Readiness Checklist

Everything you need before going live — and staying live

Pre-Launch Checklist

// Before going to production [✓] Eval dataset: 50+ examples, all passing [✓] CI/CD eval gate: blocks bad deployments [✓] Input guardrails: injection, PII, content [✓] Output guardrails: safety, grounding, PII [✓] Tracing: every request logged with full context [✓] Cost tracking: per-query, per-feature, daily budget [✓] Latency monitoring: p50, p95, p99, TTFT [✓] Quality monitoring: LLM judge on 5% sample [✓] Alerting: safety, quality, cost (3 alerts min) [✓] Rollback plan: documented and tested [✓] Canary queries: 10+ running hourly

Ongoing Operations

• Weekly: Full eval suite run, review quality trends, check canary results
• Monthly: Human evaluation of 100 production samples, alert tuning review, eval dataset expansion from production failures
• Quarterly: Full system audit, model comparison (should we switch?), cost optimization review
• On every change: CI/CD eval, A/B test for significant changes, rollback plan ready

Next up: Chapter 12 brings it all together with the Eval-First Mindset — building evaluation into your team’s culture, not just your CI/CD pipeline.

arrow_back Ch 10: Guardrails Ch 12: Eval-First Mindset arrow_forward