Ch 12 — Production Hardening & Scaling

Circuit breakers, observability, cost controls, and scaling from pilot to enterprise-wide without the wheels falling off
High Level
visibility
Observe
arrow_forward
shield
Guard
arrow_forward
replay
Recover
arrow_forward
attach_money
Cost
arrow_forward
open_with
Scale
arrow_forward
rocket_launch
Launch
-
Click play or press Space to begin...
Step- / 8
bug_report
Why Agents Fail Differently
LLM APIs fail in ways traditional APIs never do — and your infrastructure isn't ready
Novel Failure Modes
LLM-powered agents fail differently than traditional software. A 2026 study found agents achieve only 60% success on single runs, dropping to 25% across eight consecutive runs without resilience engineering. The failure modes are unique: partial/malformed responses (the model returns half an answer), model version drift (behavior changes silently when the provider updates), timeout/latency spikes (P95 latency can be 10x P50), content policy rejections (valid business queries blocked by safety filters), context window overflow (conversation exceeds token limits), and rate limiting (HTTP 429 under load). Top models now achieve hallucination rates below 1%, but agents still face infinite loops, context drift, and cascading tool failures that traditional monitoring won't catch.
Agent Failure Modes
Success rates without resilience: Single run: 60% 8 consecutive runs: 25% LLM-specific failures: Partial/malformed responses Model version drift (silent) Timeout spikes (P95 = 10x P50) Content policy rejections Context window overflow Rate limiting (HTTP 429) Infinite loops Context drift Cascading tool failures Hallucination rates (2026): Best models: < 1% Gemini-2.0-Flash: 0.7%
Why it matters: A 60% single-run success rate means 4 out of 10 agent actions fail. Without resilience patterns, your agent is a coin flip with slightly better odds. Production requires engineering these failures away.
visibility
The Five Pillars of Observability
89% of organizations have implemented agent observability — here's what to monitor
Beyond Traditional Logging
Agent observability requires five pillars beyond traditional application monitoring. Traces: capture every step, prompt, tool call, and model invocation as a connected trace. Metrics: monitor latency (P50/P95/P99), token usage, cost per interaction, and throughput tied to SLAs. Logs & payloads: persist raw prompts, completions, and tool responses for debugging and audit. Online evaluations: run real-time automated evaluators for faithfulness, safety, and PII leakage. Human review loops: incorporate subject matter experts for risky outputs. Organizations with mature monitoring report 80% faster incident resolution, 50% reduction in production issues, and 30% cost savings. By 2026, 89% of organizations have implemented observability, with quality issues as the top production barrier at 32%.
Five Pillars
1. Traces Every step as connected trace Prompt → tool call → response 2. Metrics Latency: P50 / P95 / P99 Token usage & cost per interaction Throughput vs SLA 3. Logs & Payloads Raw prompts & completions Tool call inputs & outputs 4. Online Evaluations Faithfulness scoring Safety & PII detection 5. Human Review Loops SME review for risky outputs Impact: 80% faster resolution, 50% fewer issues, 30% cost savings
Key insight: The most critical metric is P95 end-to-end latency, not average latency. An agent with 500ms average but 8-second P95 will frustrate 1 in 20 users — and those users will be the loudest critics.
electric_bolt
Circuit Breakers
Prevent cascading failures with state-machine protection
The Pattern
Circuit breakers prevent cascading failures by monitoring error rates and temporarily stopping requests to failing services. The pattern uses three states: CLOSED (normal operation, requests pass through), OPEN (too many failures detected, requests are immediately rejected with a fallback response), and HALF-OPEN (after a recovery timeout, a limited number of test requests are allowed through to check if the service has recovered). Configure with a failure threshold (e.g., 5 failures in 60 seconds triggers OPEN), a recovery timeout (e.g., 30 seconds before trying HALF-OPEN), and a success threshold (e.g., 3 consecutive successes in HALF-OPEN returns to CLOSED). Without circuit breakers, a failing LLM API will consume your entire request budget while returning errors.
Circuit Breaker States
CLOSED (normal): Requests pass through Monitor error rate If failures > threshold → OPEN OPEN (protecting): Reject requests immediately Return fallback response Wait recovery timeout HALF-OPEN (testing): Allow limited test requests If success → CLOSED If failure → OPEN Configuration: Failure threshold: 5 in 60s Recovery timeout: 30s Success threshold: 3 consecutive
Key insight: Circuit breakers are especially critical for AI agents because LLM API failures are expensive. Each failed request still consumes tokens (partial responses), network bandwidth, and user patience. Failing fast saves all three.
replay
Retry & Fallback Strategies
Exponential backoff, error classification, and multi-tier model degradation
Retry Engineering
Not all errors deserve retries. Transient failures (rate limits, timeouts, API connection errors) warrant retries with exponential backoff. Permanent failures (authentication errors, content policy rejections, invalid inputs) require fast failure — retrying will never succeed. The retry formula: delay = base_delay × 2^attempt + jitter, with a maximum of 3–5 retries. Beyond retries, implement fallback chains: multi-tier model degradation from full functionality (primary model) to core functionality (smaller/cheaper model) to basic responses (cached/template responses). This ensures the user always gets something useful, even when the primary model is unavailable.
Retry & Fallback
Error classification: Retryable: rate limit, timeout, connection error, 5xx Non-retryable: auth error, 4xx, content policy, invalid input Exponential backoff: delay = base × 2^attempt + jitter Max retries: 3-5 Base delay: 1s Attempts: 1s, 2s, 4s, 8s, 16s Fallback chain: Tier 1: Primary model (full) Tier 2: Smaller model (core) Tier 3: Cached/template (basic) // User always gets something useful
Key insight: The fallback chain is a product decision, not just an engineering decision. Define with product managers what "degraded but acceptable" looks like for each agent capability. A slow, partial answer is often better than no answer.
shield
Guardrails & Safety
Input validation, output filtering, and the guardrails that prevent catastrophic failures
Defense in Depth
Production guardrails operate at three layers. Input guardrails: validate and sanitize all inputs before they reach the model — check for prompt injection, PII in queries that shouldn't contain it, and inputs that exceed context windows. Output guardrails: validate all model outputs before they reach the user — check for hallucinated data, PII leakage, off-topic responses, and outputs that violate business rules. Execution guardrails: limit what the agent can do — restrict tool access by role, enforce spending limits per action, set maximum loop iterations, and require approval for irreversible actions. These three layers create defense in depth: if one layer fails, the others catch the problem. No single guardrail is sufficient; the combination is what makes the system safe.
Three-Layer Guardrails
Input guardrails: □ Prompt injection detection □ PII scanning □ Context window check □ Input sanitization Output guardrails: □ Hallucination detection □ PII leakage check □ Business rule validation □ Topic boundary enforcement Execution guardrails: □ Tool access by role □ Spending limits per action □ Max loop iterations □ Approval for irreversible actions // Defense in depth: 3 layers // No single guardrail is sufficient
Key insight: The most dangerous guardrail gap is execution limits. An agent without a maximum loop iteration count can enter an infinite loop that burns through your entire API budget in minutes. Set hard limits on every dimension.
attach_money
Cost Controls
46% of AI budgets go to inference — controlling costs at scale
Cost Management
With 46% of AI budgets spent on inference, cost control is a production requirement, not an optimization. Implement controls at four levels. Per-request budgets: set maximum token limits per agent action to prevent runaway costs from verbose prompts or infinite loops. Per-user budgets: cap daily/monthly usage per user to prevent abuse and ensure fair distribution. Per-agent budgets: set monthly spending limits per agent with alerts at 80% and hard stops at 100%. Model routing: use cheaper models for simple tasks and expensive models only when needed — a well-designed router can cut inference costs by 40–60% with minimal quality impact. Provisioned throughput from cloud providers offers 30–50% savings for predictable workloads.
Cost Control Layers
Per-request: Max tokens per action Prevent runaway prompts Per-user: Daily / monthly caps Fair distribution Per-agent: Monthly spending limit Alert at 80%, stop at 100% Model routing: Simple tasks → cheap model Complex tasks → capable model Savings: 40-60% cost reduction Provisioned throughput: Predictable workloads Savings: 30-50%
Key insight: Model routing is the highest-leverage cost optimization. Most enterprise agent actions (classification, extraction, simple Q&A) don't need the most expensive model. Route intelligently and save 40–60% without users noticing.
open_with
Scaling Patterns
From pilot to enterprise-wide: the scaling playbook
Scaling Strategy
Scaling from pilot to enterprise-wide deployment requires a deliberate strategy, not just "turn it on for everyone." The scaling playbook has four phases. Single workflow: prove the agent works for one team on one process with full monitoring. Team-wide: expand to the full team, adding load testing and performance baselines. Department-wide: expand across the department, adding role-based access, cost allocation, and cross-team monitoring. Enterprise-wide: full rollout with centralized governance, federated management, and executive dashboards. At each phase, validate that latency stays within SLA, costs scale linearly (not exponentially), error rates don't increase, and human oversight capacity matches demand. The most common scaling failure is overwhelming the human review pipeline.
Scaling Phases
Phase 1: Single workflow 1 team, 1 process Full monitoring, daily review Phase 2: Team-wide Full team adoption Load testing, performance baselines Phase 3: Department-wide Cross-team rollout Role-based access, cost allocation Phase 4: Enterprise-wide Centralized governance Federated management Executive dashboards Validate at each phase: □ Latency within SLA □ Costs scale linearly □ Error rates stable □ Human review capacity OK
Key insight: The most common scaling failure is overwhelming the human review pipeline. If your agent escalates 15% of actions and you scale 10x, your review team needs 10x capacity too. Plan human scaling alongside agent scaling.
rocket_launch
The Production Readiness Checklist
The final gate before your agent goes live
Go/No-Go Criteria
Before any agent goes to production, it must pass a production readiness review covering all the patterns from this chapter. This isn't a formality — it's the gate that separates pilots that impress demos from agents that survive real-world conditions. The checklist covers observability (all five pillars implemented), resilience (circuit breakers, retries, fallbacks tested), guardrails (input, output, and execution guardrails active), cost controls (per-request, per-user, per-agent budgets set), scaling plan (load tested at 3x expected peak), compliance (audit trails, documentation, human oversight), and runbook (incident response procedures documented and rehearsed). An agent that passes all seven categories is production-ready. Anything less is a risk.
Readiness Checklist
Production readiness review: Observability: □ Traces, metrics, logs, evals, review Resilience: □ Circuit breakers configured □ Retries with backoff □ Fallback chain defined Guardrails: □ Input, output, execution guards Cost controls: □ Per-request/user/agent budgets Scaling: □ Load tested at 3x peak Compliance: □ Audit trails, docs, oversight Runbook: □ Incident response documented All 7 categories = production ready
Key insight: The production readiness review should be a recurring event, not a one-time gate. Run it quarterly, because the agent's environment changes (model updates, new integrations, scaling) even when the agent's code doesn't.