Ch 16 — Monitoring & Observability

You can’t improve what you can’t see. Building the nervous system for your AI product.
High Level
visibility
Why
arrow_forward
speed
Performance
arrow_forward
star
Quality
arrow_forward
payments
Cost
arrow_forward
notifications
Alerts
arrow_forward
dashboard
Dashboards
-
Click play or press Space to begin...
Step- / 8
visibility
Why AI Needs Different Monitoring
Traditional APM tells you the system is up. AI observability tells you the system is good.
The Observability Gap
Traditional application monitoring answers: “Is the system running?” Uptime, error rates, response times. If the API returns 200 OK, everything looks fine.

But for AI products, the API can return 200 OK while delivering a hallucinated answer, a biased recommendation, or a response that violates your safety guidelines. The system is “up” but the output is wrong.

AI observability must answer a fundamentally different question: “Is the system producing good outputs?” This requires monitoring dimensions that traditional APM doesn’t cover: output quality, factual accuracy, safety compliance, cost efficiency, and user satisfaction.
Silent Degradation
The most dangerous characteristic of AI systems is silent degradation. Quality can erode gradually without any system alerts:

• A model provider updates the underlying model — your prompts now produce subtly different outputs
• User query patterns shift — the model encounters more out-of-distribution inputs
• Knowledge base content becomes stale — answers are technically grounded but factually outdated
• A data pipeline breaks — new documents stop being indexed, but old ones still work

Without quality monitoring, you discover these issues when users complain — or worse, when they silently leave.
The monitoring principle: For AI products, “the system is up” is necessary but not sufficient. You need three layers of monitoring: infrastructure (is it running?), performance (is it fast and cheap enough?), and quality (is it producing good outputs?). Most teams have the first two. The third is what separates good AI products from bad ones.
speed
Performance Monitoring
Latency, throughput, errors — the foundation layer
Key Performance Metrics
Latency (response time):
p50: Median response time. What most users experience.
p95: 95th percentile. What slow users experience.
p99: 99th percentile. Worst-case experience.
Time to first token: For streaming responses, how quickly the user sees the first word.

Track each separately. A p50 of 1.5s with a p99 of 12s means 1% of users wait 12 seconds — that’s thousands of bad experiences per day at scale.

Throughput:
• Requests per second (current vs. capacity)
• Concurrent users
• Queue depth (if requests are queued)

Error rates:
• API errors (timeouts, rate limits, 5xx errors)
• Model errors (empty responses, malformed outputs, refusals)
• Pipeline errors (retrieval failures, parsing errors)
End-to-End Tracing
A single user query passes through multiple components: query processing → embedding → retrieval → re-ranking → prompt assembly → LLM call → output parsing → response delivery.

End-to-end tracing records the time and status of each step for every request. When latency spikes, you can pinpoint exactly which component slowed down:

• Was retrieval slow? (Vector database issue)
• Was the LLM call slow? (Provider issue or prompt too long)
• Was output parsing slow? (Structured output validation)

Modern observability platforms (MLflow, Langfuse, Spanora) provide this tracing with minimal code changes — often a single-line integration.
The latency budget: Define a latency budget for each component. Example: retrieval ≤200ms, re-ranking ≤100ms, LLM call ≤2s, total ≤3s. When any component exceeds its budget, investigate. This prevents the “death by a thousand cuts” where each component adds 50ms until the total is unacceptable.
star
Quality Monitoring
The layer that most teams miss — and the one that matters most
Automated Quality Signals
Hallucination detection:
For RAG products, compare the response against the retrieved context. Does the response contain claims not supported by the source documents? Automated faithfulness scoring catches the most dangerous quality failures.

Safety filter triggers:
Track how often content filters activate. A spike in safety triggers may indicate adversarial attacks, a prompt regression, or a model behavior change.

Response characteristics:
• Average response length (sudden changes indicate prompt issues)
• Refusal rate (model declining to answer — too high means over-cautious, too low means under-guarded)
• Language distribution (are responses appearing in unexpected languages?)
• Format compliance (are structured outputs valid JSON/markdown?)
User-Driven Quality Signals
Explicit feedback:
• Thumbs up/down ratio (track daily, alert on drops)
• Correction frequency (users editing AI outputs)
• “Flag as incorrect” rate

Implicit feedback:
• Regeneration rate (users requesting a new response)
• Escalation rate (users requesting human help)
• Abandonment rate (users leaving mid-conversation)
• Copy rate (users copying AI output — positive signal)

Retrieval quality (for RAG):
• Average retrieval relevance score
• “No results found” rate
• Source diversity (are answers always coming from the same few documents?)
Drift Detection
Input drift: Are user queries changing over time? New topics emerging? Seasonal shifts?
Output drift: Are model responses changing even without prompt changes? (Provider model updates.)
Quality drift: Are quality scores trending downward over days/weeks?
The quality monitoring rule: Sample and evaluate at least 1% of production responses daily using automated quality metrics. For high-stakes products, evaluate 5–10%. This gives you early warning of quality degradation before it reaches a level that users notice and complain about.
payments
Cost Monitoring
AI costs are variable and can spike without warning — track them in real time
Why Cost Monitoring Is Critical
Unlike traditional software where infrastructure costs are relatively fixed, AI costs are directly proportional to usage and highly variable:

• A viral feature can 10x your API spend overnight
• A prompt change that adds 500 tokens increases cost per query by 20–40%
• A retrieval bug that returns too many documents inflates context length and cost
• Users who discover they can have long conversations drive up per-session costs

Without real-time cost monitoring, you discover budget overruns at the end of the month — when it’s too late to fix them.
Key Cost Metrics
Per-query costs:
• Input tokens per query (system prompt + context + user input)
• Output tokens per query (model response)
• Total cost per query (input + output at model pricing)
• Embedding cost per query (for RAG retrieval)

Aggregate costs:
• Daily/weekly/monthly total spend
• Cost by feature (which AI features cost the most?)
• Cost by user segment (are power users 10x more expensive?)
• Cost by model (if using multiple models)

Efficiency metrics:
• Cost per successful resolution (not just per query)
• Cost per unit of value created (cost per sale assisted, cost per ticket resolved)
• Token waste rate (tokens spent on failed or abandoned interactions)
Cost guardrails: Set three levels of cost alerts. Warning (80% of daily budget): investigate and optimize. Critical (100% of daily budget): implement rate limiting or model downgrade. Emergency (150% of daily budget): activate cost circuit breaker — automatically switch to a cheaper model or temporarily disable the feature. Define these before launch, not during a cost crisis.
notifications
Alerting Strategy
The right alerts at the right time — without alert fatigue
Alert Design Principles
1. Alert on anomalies, not thresholds alone.
A fixed threshold (“alert if latency >3s”) misses gradual degradation and triggers false alarms during known spikes. Anomaly detection (“alert if latency is 2 standard deviations above the 7-day average”) adapts to normal patterns.

2. Alert on leading indicators.
Don’t wait for user complaints. Alert on metrics that predict user-facing problems: retrieval relevance dropping, hallucination rate rising, cost per query increasing.

3. Every alert must have a runbook.
An alert without a response procedure is just noise. For each alert, document: what it means, who responds, what to check first, and what actions to take.

4. Minimize alert fatigue.
Too many alerts and the team ignores them all. Ruthlessly prune alerts that don’t lead to action. If an alert fires and nobody does anything, delete it.
The Alert Hierarchy
P0 — Page immediately (any time):
• System down (API errors >50%)
• Safety violation detected (harmful content in production)
• Cost emergency (150%+ of daily budget)

P1 — Respond within 1 hour (business hours):
• Quality degradation (>20% drop in satisfaction score)
• Latency spike (p95 >2x normal for >15 minutes)
• Hallucination rate spike (>2x baseline)

P2 — Investigate within 24 hours:
• Gradual quality drift (5–10% decline over a week)
• Cost trending above budget
• Escalation rate increasing
• New query patterns emerging (potential coverage gap)

P3 — Review in weekly meeting:
• Minor metric fluctuations
• Feature usage changes
• Feedback pattern shifts
The alert test: For every alert, ask: “If this fires at 3am, is it worth waking someone up?” If yes, it’s P0. If no, it’s not a page — it’s a notification. Getting this wrong in either direction is costly: missed P0s cause outages; false P0s cause burnout.
dashboard
The PM Dashboard
What the PM should see every morning — and what to do about it
The Daily View
Health summary (green/yellow/red):
One-glance status for quality, performance, cost, and safety. Green = within targets. Yellow = approaching thresholds. Red = action needed.

Key metrics (last 24h vs. 7-day average):
• Total queries served
• User satisfaction score (thumbs up %)
• Escalation rate
• Hallucination rate (if measurable)
• Average latency (p50, p95)
• Total cost / cost per query
• Top 5 failure queries (most negative feedback)

Trend lines (last 30 days):
Quality, satisfaction, cost, and usage trends. Are things getting better or worse? Flat lines are fine. Downward quality trends need investigation.
The Weekly Deep Dive
Failure analysis:
Review the top 20 worst-performing queries of the week. Categorize by failure type (retrieval miss, hallucination, safety, out-of-scope). Prioritize fixes by frequency and severity.

User feedback review:
Read a sample of free-text feedback. Look for patterns that quantitative metrics miss. Are users frustrated about something specific?

Cost analysis:
Cost by feature, by user segment, by model. Identify optimization opportunities. Are there queries that could use a cheaper model?

Drift check:
Have input patterns changed? Are new topics emerging that the AI doesn’t handle well? Is the knowledge base keeping up with real-world changes?
The PM’s monitoring habit: Check the daily dashboard every morning (5 minutes). Do the weekly deep dive every Friday (30 minutes). Present the monthly report to leadership (quality trends, cost trends, improvement plan). This cadence catches problems early, drives continuous improvement, and keeps leadership informed.
account_tree
Tracing Multi-Step AI Systems
When your AI is an agent with tools, chains, and reasoning steps
The Complexity Challenge
Modern AI products are not single model calls. They’re multi-step workflows:

• A customer support agent that searches the knowledge base, looks up order status via API, reasons about the answer, and generates a response
• A document analyzer that extracts entities, classifies sections, summarizes content, and generates recommendations
• An agentic system that plans a sequence of actions, executes tools, evaluates results, and iterates

Each step can fail independently. A single user query might involve 5–15 internal operations, any of which can introduce latency, errors, or quality issues.
Distributed Tracing for AI
Trace structure:
Each user request creates a trace containing multiple spans. Each span represents one operation (retrieval, LLM call, tool execution). Spans record: start time, duration, input, output, status, and metadata.

What to capture per span:
LLM calls: Model, prompt (or hash), response, tokens used, latency, cost
Retrieval: Query, documents returned, relevance scores, latency
Tool calls: Tool name, input, output, success/failure, latency
Reasoning steps: Agent’s chain-of-thought, decision points, selected actions

Why it matters:
When a user reports a bad response, you can pull the full trace and see exactly what happened at each step. Was the retrieval wrong? Did the tool return an error? Did the LLM ignore the context? Without tracing, debugging is guesswork.
The tracing investment: Implement tracing from day one, not after the first production incident. Retrofitting tracing into a complex AI system is 5–10x harder than building it in from the start. Modern frameworks (OpenTelemetry, Langfuse, MLflow) make this straightforward with minimal code overhead.
check_circle
The Observability Checklist
What to have in place before launch — and what to build in the first 90 days
Before Launch (Must-Have)
□ Performance monitoring live
Latency (p50/p95/p99), error rate, throughput tracked in real time.

□ Cost tracking active
Per-query cost, daily spend, budget alerts configured.

□ Basic quality signals
Thumbs up/down collection, escalation rate tracking, response length monitoring.

□ Safety monitoring
Content filter trigger rate, refusal rate, flagged response logging.

□ P0/P1 alerts configured
System down, safety violations, cost emergencies, and quality degradation alerts with runbooks.

□ End-to-end tracing
Full request traces with per-component latency and status.
First 90 Days (Build Incrementally)
□ Automated quality evaluation
Daily sampling and scoring of production responses (hallucination, relevance, faithfulness).

□ Drift detection
Input distribution monitoring, output characteristic tracking, quality trend analysis.

□ PM dashboard
Daily health summary, weekly deep dive views, monthly leadership report.

□ Cost optimization pipeline
Token usage analysis, model routing optimization, caching for common queries.

□ Feedback analytics
Aggregated feedback patterns, failure categorization, improvement prioritization.

□ Full alert hierarchy
P0 through P3 alerts tuned based on actual production patterns (reduce false positives).
The bottom line: Observability is not a nice-to-have — it’s the nervous system of your AI product. Without it, you’re operating blind: you don’t know if quality is degrading, costs are spiking, or users are frustrated until the damage is done. With it, you detect issues in minutes, diagnose root causes in hours, and continuously improve based on data. Build the must-haves before launch. Build the rest in the first 90 days. Never stop refining.