Ch 9 — Production Observability

The 5 pillars of monitoring AI systems — cost, latency, quality, safety, and hallucination detection
High Level
payments
Cost
arrow_forward
speed
Latency
arrow_forward
star
Quality
arrow_forward
shield
Safety
arrow_forward
psychology
Hallucin.
arrow_forward
dashboard
Dashboard
-
Click play or press Space to begin...
Step- / 8
visibility
Why Production Observability Is Different
Traditional monitoring doesn’t work for AI systems
The Observability Gap
Traditional software monitoring tracks uptime, error rates, and response times. For LLM systems, the server can return 200 OK while delivering a completely wrong, hallucinated, or harmful response. Your HTTP metrics will show green while your users are getting garbage. You need a fundamentally different monitoring approach.
The Five Pillars
Production LLM observability requires monitoring five dimensions simultaneously:

1. Cost: How much are you spending per query, per feature, per user?
2. Latency: How fast are responses? Where are the bottlenecks?
3. Quality: Are responses actually good? Is quality stable over time?
4. Safety: Are harmful outputs reaching users?
5. Hallucination: Is the system fabricating information?
Traces: The Foundation
The building block of LLM observability is the trace — a complete record of what happened for a single request. A trace captures:

Input: The user query and system prompt
Retrieval: Which documents were fetched (for RAG)
LLM call: Model, tokens in/out, latency, cost
Output: The generated response
Post-processing: Guardrail checks, formatting

Tools like Langfuse, Phoenix, and LangSmith capture traces automatically with OpenTelemetry-compatible instrumentation.
Key insight: Without traces, debugging a production issue is like debugging a program without logs. Traces are the single most important investment in LLM observability. Instrument from day one.
payments
Pillar 1: Cost Tracking
Know exactly where every dollar goes
Why Cost Tracking Matters
LLM costs are unpredictable and can spike without warning. A runaway agent loop can burn $1,000 in minutes. A prompt change that adds 500 tokens can double your monthly bill. Without cost tracking, you discover budget overruns from your finance team, not your dashboard.
What to Track
Cost per query: Average and p95 cost per user request
Cost per model: Breakdown by GPT-4o vs Claude vs Haiku
Cost per feature: Which product features are most expensive?
Cost per user: Are power users driving disproportionate costs?
Daily/weekly budget: Set limits and alert when approaching
Cost Attribution
Break down costs by the dimensions that matter for your business. A typical attribution model:

By model: 60% GPT-4o, 25% embeddings, 15% Haiku
By pipeline stage: 40% generation, 35% retrieval, 25% guardrails
By feature: 50% chat, 30% search, 20% summarization

This tells you where to optimize. If embeddings are 25% of cost, switching to a cheaper embedding model has high ROI.
Budget rule: Set daily cost alerts at 80% of your budget. Set hard limits that automatically disable features or switch to cheaper models when exceeded. A $500/day budget with no alerting is a $15,000/month surprise waiting to happen.
speed
Pillar 2: Latency Profiling
Users abandon after 3 seconds — know where your time goes
What to Measure
End-to-end latency: Total time from user query to response delivered
Time to first token (TTFT): How long before the user sees anything? Critical for streaming UX
Percentiles: Track p50 (median), p95 (most users), and p99 (worst case). Averages hide tail latency problems
Per-component breakdown: How much time in retrieval vs generation vs guardrails?
Latency Decomposition
// Typical RAG pipeline latency Embedding query ~50ms Vector search ~30ms Reranking ~100ms LLM generation ~1500ms ← bottleneck Guardrail checks ~200ms Total ~1880ms // With streaming: TTFT ~300ms // Users perceive this as fast
Optimization Levers
Streaming: Stream tokens to the user as they’re generated. TTFT drops from 2s to 300ms even though total time is the same
Parallel execution: Run retrieval and guardrail checks in parallel where possible
Model selection: Use faster models (GPT-4o-mini, Haiku) for simple queries, stronger models for complex ones
Caching: Cache embeddings, retrieval results, and even full responses for repeated queries
Prompt optimization: Shorter prompts = fewer tokens = faster generation
Key insight: Latency perception matters more than actual latency. Streaming makes a 3-second response feel instant because the user sees content appearing immediately. Always implement streaming for user-facing LLM applications.
star
Pillar 3: Quality Monitoring
Tracking response quality in real-time
Continuous Quality Scoring
Run an LLM judge on 5–10% of production responses to continuously score quality. Track multiple dimensions:

Relevancy: Does the response address the user’s actual question?
Helpfulness: Is it actually useful, or just technically correct but unhelpful?
Coherence: Does it make logical sense, or contradict itself?
Completeness: Does it cover all aspects of the question?

Plot these scores over time. A declining trend signals a problem before users complain.
User Signals
Supplement automated scoring with implicit user feedback:

Thumbs up/down: Direct quality signal (if your UI supports it)
Regeneration rate: Users clicking “try again” indicates dissatisfaction
Follow-up questions: Indicate the first answer was incomplete
Session abandonment: User leaves without resolution
Quality Degradation Patterns
Learn to recognize these patterns in your quality dashboards:

Sudden cliff: Quality drops sharply on a specific date → model update or prompt change
Gradual decline: Quality slowly erodes over weeks → data drift or concept drift
Time-of-day patterns: Quality worse during peak hours → load-related timeouts or truncation
Topic-specific drops: Quality fine overall but terrible for one topic → missing or outdated documents in RAG
Pro tip: When automated quality scores and user feedback signals diverge (high automated score but high regeneration rate), your automated metric needs recalibration. User behavior is the ultimate ground truth.
shield
Pillar 4: Safety Monitoring
Detecting harmful outputs before they cause damage
What to Detect
PII leakage: Names, emails, SSNs, credit card numbers appearing in responses
Toxic content: Hate speech, harassment, explicit material, violent content
Policy violations: Responses that violate your organization’s usage policies
Prompt injection success: Users who successfully manipulated the model
Jailbreak attempts: Users trying to bypass safety guardrails (even if unsuccessful — track the attempts)
Real-Time vs Batch Monitoring
Real-time (pre-response): Block harmful outputs before they reach users. Adds 100–500ms latency but prevents harm. Essential for safety-critical applications
Batch (post-response): Analyze responses after delivery. No latency impact but harm may have already occurred. Good for trend analysis and improving guardrails

Use both: Real-time for critical safety checks, batch for comprehensive analysis and guardrail improvement.
Safety Metrics to Track
Safety violation rate: % of responses flagged by safety classifiers
PII leak rate: % of responses containing detected PII
Prompt injection attempt rate: % of inputs classified as injection attempts
Jailbreak success rate: % of injection attempts that bypassed guardrails
False positive rate: % of safe responses incorrectly blocked (impacts UX)
Critical: Safety monitoring is non-negotiable for user-facing systems. A single harmful response can trigger regulatory action, lawsuits, or brand damage costing millions. The cost of safety monitoring ($100–$500/month) is trivial compared to the cost of a safety incident.
psychology
Pillar 5: Hallucination Detection
Catching fabricated facts before users trust them
Detection Methods
Source verification (RAG): Compare every claim in the response against the retrieved documents. Flag claims not supported by any source
Self-consistency: Generate multiple responses to the same query. If they disagree on key facts, flag for review
Confidence scoring: Low token-level probabilities correlate with hallucination. Models are less “confident” when fabricating
LLM-as-judge: Ask a judge model: “Are all claims in this response supported by the provided context?”
Hallucination Metrics
Hallucination rate: % of responses containing at least one unsupported claim
Severity classification: Minor (wrong date) vs major (wrong medical advice) vs critical (fabricated citation)
Category tracking: Fabricated facts, wrong attributions, invented citations, contradictions with source

Track these over time. A rising hallucination rate signals model degradation, data quality issues, or retrieval problems.
Key insight: No hallucination detection method catches 100% of hallucinations. Layer multiple methods and accept that some will slip through. The goal is to minimize the rate and severity, not achieve perfection. Target: <5% hallucination rate for most applications, <1% for safety-critical ones.
notifications
Alerting & Incident Response
Know when things go wrong, fast
Alert Design
Threshold alerts: Metric crosses a predefined boundary (e.g., cost > $500/day)
Anomaly alerts: Metric deviates from its normal pattern (e.g., 2σ from 7-day rolling average)
Rate-of-change alerts: Metric changing faster than expected (e.g., quality dropping 1%/hour)
Composite alerts: Multiple metrics degrading simultaneously (strongest signal of a real problem)

Prefer anomaly detection over fixed thresholds when possible — it adapts to natural variation and seasonal patterns.
Incident Response Playbook
1. Detect: Alert fires automatically
2. Triage: Is this a real issue? Check the dashboard for context
3. Diagnose: Which pillar is affected? Examine traces for the affected time window
4. Mitigate: Rollback to previous config, disable the affected feature, or add an emergency guardrail
5. Fix: Root cause analysis and permanent fix
6. Learn: Add the failure case to your eval dataset so it’s caught automatically next time
Pro tip: Every production incident should result in a new eval example. Your eval dataset should grow from production failures, making it increasingly representative over time. This is the flywheel from Chapter 7 in action.
dashboard
The Observability Dashboard
One view, five pillars, traffic-light status
Dashboard Layout
// Production Health Dashboard Cost [GREEN] $342 / $500 daily budget Latency [GREEN] p95: 1.8s (target: <3s) Quality [YELLOW] 0.82 relevancy (target: 0.85) Safety [GREEN] 0 violations today Hallucin. [GREEN] 2.1% rate (target: <5%) // Trend: Quality declining 0.5%/week // Action: Investigate query distribution shift
Building Your Dashboard
Start with the 5 pillars as top-level traffic-light indicators. Green/yellow/red for each metric. Then add drill-down capability:

Time-series plots: Spot trends and anomalies over days/weeks
Per-feature breakdown: Which features need attention?
Recent incidents: What went wrong recently and was it resolved?
Eval suite status: Last CI/CD eval run results and trends
Cost projection: At current rate, what’s the monthly bill?
Next up: Chapter 10 covers guardrails and safety in depth — input/output filtering, PII detection, prompt injection defense, content moderation, and the frameworks that make implementation easier.