Ch 9: Production Observability

Ch 9 — Production Observability

The 5 pillars of monitoring AI systems — cost, latency, quality, safety, and hallucination detection

Index

High Level

payments

Cost

arrow_forward

speed

Latency

arrow_forward

star

Quality

arrow_forward

shield

Safety

arrow_forward

psychology

Hallucin.

arrow_forward

dashboard

Dashboard

Click play or press Space to begin...

Step- / 8

visibility

Why Production Observability Is Different

Traditional monitoring doesn’t work for AI systems

The Observability Gap

Traditional software monitoring tracks uptime, error rates, and response times. For LLM systems, the server can return 200 OK while delivering a completely wrong, hallucinated, or harmful response. Your HTTP metrics will show green while your users are getting garbage. You need a fundamentally different monitoring approach.

The Five Pillars

Production LLM observability requires monitoring five dimensions simultaneously:

1. Cost: How much are you spending per query, per feature, per user?
2. Latency: How fast are responses? Where are the bottlenecks?
3. Quality: Are responses actually good? Is quality stable over time?
4. Safety: Are harmful outputs reaching users?
5. Hallucination: Is the system fabricating information?

Traces: The Foundation

The building block of LLM observability is the trace — a complete record of what happened for a single request. A trace captures:

• Input: The user query and system prompt
• Retrieval: Which documents were fetched (for RAG)
• LLM call: Model, tokens in/out, latency, cost
• Output: The generated response
• Post-processing: Guardrail checks, formatting

Tools like Langfuse, Phoenix, and LangSmith capture traces automatically with OpenTelemetry-compatible instrumentation.

Key insight: Without traces, debugging a production issue is like debugging a program without logs. Traces are the single most important investment in LLM observability. Instrument from day one.

payments

Pillar 1: Cost Tracking

Know exactly where every dollar goes

Why Cost Tracking Matters

LLM costs are unpredictable and can spike without warning. A runaway agent loop can burn $1,000 in minutes. A prompt change that adds 500 tokens can double your monthly bill. Without cost tracking, you discover budget overruns from your finance team, not your dashboard.

What to Track

• Cost per query: Average and p95 cost per user request
• Cost per model: Breakdown by GPT-4o vs Claude vs Haiku
• Cost per feature: Which product features are most expensive?
• Cost per user: Are power users driving disproportionate costs?
• Daily/weekly budget: Set limits and alert when approaching

Cost Attribution

Break down costs by the dimensions that matter for your business. A typical attribution model:

• By model: 60% GPT-4o, 25% embeddings, 15% Haiku
• By pipeline stage: 40% generation, 35% retrieval, 25% guardrails
• By feature: 50% chat, 30% search, 20% summarization

This tells you where to optimize. If embeddings are 25% of cost, switching to a cheaper embedding model has high ROI.

Budget rule: Set daily cost alerts at 80% of your budget. Set hard limits that automatically disable features or switch to cheaper models when exceeded. A $500/day budget with no alerting is a $15,000/month surprise waiting to happen.

speed

Pillar 2: Latency Profiling

Users abandon after 3 seconds — know where your time goes

What to Measure

• End-to-end latency: Total time from user query to response delivered
• Time to first token (TTFT): How long before the user sees anything? Critical for streaming UX
• Percentiles: Track p50 (median), p95 (most users), and p99 (worst case). Averages hide tail latency problems
• Per-component breakdown: How much time in retrieval vs generation vs guardrails?

Latency Decomposition

// Typical RAG pipeline latency Embedding query ~50ms Vector search ~30ms Reranking ~100ms LLM generation ~1500ms ← bottleneck Guardrail checks ~200ms Total ~1880ms // With streaming: TTFT ~300ms // Users perceive this as fast

Optimization Levers

• Streaming: Stream tokens to the user as they’re generated. TTFT drops from 2s to 300ms even though total time is the same
• Parallel execution: Run retrieval and guardrail checks in parallel where possible
• Model selection: Use faster models (GPT-4o-mini, Haiku) for simple queries, stronger models for complex ones
• Caching: Cache embeddings, retrieval results, and even full responses for repeated queries
• Prompt optimization: Shorter prompts = fewer tokens = faster generation

Key insight: Latency perception matters more than actual latency. Streaming makes a 3-second response feel instant because the user sees content appearing immediately. Always implement streaming for user-facing LLM applications.

star

Pillar 3: Quality Monitoring

Tracking response quality in real-time

Continuous Quality Scoring

Run an LLM judge on 5–10% of production responses to continuously score quality. Track multiple dimensions:

• Relevancy: Does the response address the user’s actual question?
• Helpfulness: Is it actually useful, or just technically correct but unhelpful?
• Coherence: Does it make logical sense, or contradict itself?
• Completeness: Does it cover all aspects of the question?

Plot these scores over time. A declining trend signals a problem before users complain.

User Signals

Supplement automated scoring with implicit user feedback:

• Thumbs up/down: Direct quality signal (if your UI supports it)
• Regeneration rate: Users clicking “try again” indicates dissatisfaction
• Follow-up questions: Indicate the first answer was incomplete
• Session abandonment: User leaves without resolution

Quality Degradation Patterns

Learn to recognize these patterns in your quality dashboards:

• Sudden cliff: Quality drops sharply on a specific date → model update or prompt change
• Gradual decline: Quality slowly erodes over weeks → data drift or concept drift
• Time-of-day patterns: Quality worse during peak hours → load-related timeouts or truncation
• Topic-specific drops: Quality fine overall but terrible for one topic → missing or outdated documents in RAG

Pro tip: When automated quality scores and user feedback signals diverge (high automated score but high regeneration rate), your automated metric needs recalibration. User behavior is the ultimate ground truth.

shield

Pillar 4: Safety Monitoring

Detecting harmful outputs before they cause damage

What to Detect

• PII leakage: Names, emails, SSNs, credit card numbers appearing in responses
• Toxic content: Hate speech, harassment, explicit material, violent content
• Policy violations: Responses that violate your organization’s usage policies
• Prompt injection success: Users who successfully manipulated the model
• Jailbreak attempts: Users trying to bypass safety guardrails (even if unsuccessful — track the attempts)

Real-Time vs Batch Monitoring

• Real-time (pre-response): Block harmful outputs before they reach users. Adds 100–500ms latency but prevents harm. Essential for safety-critical applications
• Batch (post-response): Analyze responses after delivery. No latency impact but harm may have already occurred. Good for trend analysis and improving guardrails

Use both: Real-time for critical safety checks, batch for comprehensive analysis and guardrail improvement.

Safety Metrics to Track

• Safety violation rate: % of responses flagged by safety classifiers
• PII leak rate: % of responses containing detected PII
• Prompt injection attempt rate: % of inputs classified as injection attempts
• Jailbreak success rate: % of injection attempts that bypassed guardrails
• False positive rate: % of safe responses incorrectly blocked (impacts UX)

Critical: Safety monitoring is non-negotiable for user-facing systems. A single harmful response can trigger regulatory action, lawsuits, or brand damage costing millions. The cost of safety monitoring ($100–$500/month) is trivial compared to the cost of a safety incident.

psychology

Pillar 5: Hallucination Detection

Catching fabricated facts before users trust them

Detection Methods

• Source verification (RAG): Compare every claim in the response against the retrieved documents. Flag claims not supported by any source
• Self-consistency: Generate multiple responses to the same query. If they disagree on key facts, flag for review
• Confidence scoring: Low token-level probabilities correlate with hallucination. Models are less “confident” when fabricating
• LLM-as-judge: Ask a judge model: “Are all claims in this response supported by the provided context?”

Hallucination Metrics

• Hallucination rate: % of responses containing at least one unsupported claim
• Severity classification: Minor (wrong date) vs major (wrong medical advice) vs critical (fabricated citation)
• Category tracking: Fabricated facts, wrong attributions, invented citations, contradictions with source

Track these over time. A rising hallucination rate signals model degradation, data quality issues, or retrieval problems.

Key insight: No hallucination detection method catches 100% of hallucinations. Layer multiple methods and accept that some will slip through. The goal is to minimize the rate and severity, not achieve perfection. Target: <5% hallucination rate for most applications, <1% for safety-critical ones.

notifications

Alerting & Incident Response

Know when things go wrong, fast

Alert Design

• Threshold alerts: Metric crosses a predefined boundary (e.g., cost > $500/day)
• Anomaly alerts: Metric deviates from its normal pattern (e.g., 2σ from 7-day rolling average)
• Rate-of-change alerts: Metric changing faster than expected (e.g., quality dropping 1%/hour)
• Composite alerts: Multiple metrics degrading simultaneously (strongest signal of a real problem)

Prefer anomaly detection over fixed thresholds when possible — it adapts to natural variation and seasonal patterns.

Incident Response Playbook

1. Detect: Alert fires automatically
2. Triage: Is this a real issue? Check the dashboard for context
3. Diagnose: Which pillar is affected? Examine traces for the affected time window
4. Mitigate: Rollback to previous config, disable the affected feature, or add an emergency guardrail
5. Fix: Root cause analysis and permanent fix
6. Learn: Add the failure case to your eval dataset so it’s caught automatically next time

Pro tip: Every production incident should result in a new eval example. Your eval dataset should grow from production failures, making it increasingly representative over time. This is the flywheel from Chapter 7 in action.

dashboard

The Observability Dashboard

One view, five pillars, traffic-light status

Dashboard Layout

// Production Health Dashboard Cost [GREEN] $342 / $500 daily budget Latency [GREEN] p95: 1.8s (target: <3s) Quality [YELLOW] 0.82 relevancy (target: 0.85) Safety [GREEN] 0 violations today Hallucin. [GREEN] 2.1% rate (target: <5%) // Trend: Quality declining 0.5%/week // Action: Investigate query distribution shift

Building Your Dashboard

Start with the 5 pillars as top-level traffic-light indicators. Green/yellow/red for each metric. Then add drill-down capability:

• Time-series plots: Spot trends and anomalies over days/weeks
• Per-feature breakdown: Which features need attention?
• Recent incidents: What went wrong recently and was it resolved?
• Eval suite status: Last CI/CD eval run results and trends
• Cost projection: At current rate, what’s the monthly bill?

Next up: Chapter 10 covers guardrails and safety in depth — input/output filtering, PII detection, prompt injection defense, content moderation, and the frameworks that make implementation easier.

arrow_back Ch 8: Eval Tools Ch 10: Guardrails & Safety arrow_forward