Ch 12: Production Hardening — AI Agents for the Enterprise

Ch 12 — Production Hardening & Scaling

Circuit breakers, observability, cost controls, and scaling from pilot to enterprise-wide without the wheels falling off

Index

High Level

visibility

Observe

arrow_forward

shield

Guard

arrow_forward

replay

Recover

arrow_forward

attach_money

Cost

arrow_forward

open_with

Scale

arrow_forward

rocket_launch

Launch

Click play or press Space to begin...

Step- / 8

bug_report

Why Agents Fail Differently

LLM APIs fail in ways traditional APIs never do — and your infrastructure isn't ready

Novel Failure Modes

LLM-powered agents fail differently than traditional software. A 2026 study found agents achieve only 60% success on single runs, dropping to 25% across eight consecutive runs without resilience engineering. The failure modes are unique: partial/malformed responses (the model returns half an answer), model version drift (behavior changes silently when the provider updates), timeout/latency spikes (P95 latency can be 10x P50), content policy rejections (valid business queries blocked by safety filters), context window overflow (conversation exceeds token limits), and rate limiting (HTTP 429 under load). Top models now achieve hallucination rates below 1%, but agents still face infinite loops, context drift, and cascading tool failures that traditional monitoring won't catch.

Agent Failure Modes

Success rates without resilience: Single run: 60% 8 consecutive runs: 25% LLM-specific failures: Partial/malformed responses Model version drift (silent) Timeout spikes (P95 = 10x P50) Content policy rejections Context window overflow Rate limiting (HTTP 429) Infinite loops Context drift Cascading tool failures Hallucination rates (2026): Best models: < 1% Gemini-2.0-Flash: 0.7%

Why it matters: A 60% single-run success rate means 4 out of 10 agent actions fail. Without resilience patterns, your agent is a coin flip with slightly better odds. Production requires engineering these failures away.

visibility

The Five Pillars of Observability

89% of organizations have implemented agent observability — here's what to monitor

Beyond Traditional Logging

Agent observability requires five pillars beyond traditional application monitoring. Traces: capture every step, prompt, tool call, and model invocation as a connected trace. Metrics: monitor latency (P50/P95/P99), token usage, cost per interaction, and throughput tied to SLAs. Logs & payloads: persist raw prompts, completions, and tool responses for debugging and audit. Online evaluations: run real-time automated evaluators for faithfulness, safety, and PII leakage. Human review loops: incorporate subject matter experts for risky outputs. Organizations with mature monitoring report 80% faster incident resolution, 50% reduction in production issues, and 30% cost savings. By 2026, 89% of organizations have implemented observability, with quality issues as the top production barrier at 32%.

Five Pillars

1. Traces Every step as connected trace Prompt → tool call → response 2. Metrics Latency: P50 / P95 / P99 Token usage & cost per interaction Throughput vs SLA 3. Logs & Payloads Raw prompts & completions Tool call inputs & outputs 4. Online Evaluations Faithfulness scoring Safety & PII detection 5. Human Review Loops SME review for risky outputs Impact: 80% faster resolution, 50% fewer issues, 30% cost savings

Key insight: The most critical metric is P95 end-to-end latency, not average latency. An agent with 500ms average but 8-second P95 will frustrate 1 in 20 users — and those users will be the loudest critics.

electric_bolt

Circuit Breakers

Prevent cascading failures with state-machine protection

The Pattern

Circuit breakers prevent cascading failures by monitoring error rates and temporarily stopping requests to failing services. The pattern uses three states: CLOSED (normal operation, requests pass through), OPEN (too many failures detected, requests are immediately rejected with a fallback response), and HALF-OPEN (after a recovery timeout, a limited number of test requests are allowed through to check if the service has recovered). Configure with a failure threshold (e.g., 5 failures in 60 seconds triggers OPEN), a recovery timeout (e.g., 30 seconds before trying HALF-OPEN), and a success threshold (e.g., 3 consecutive successes in HALF-OPEN returns to CLOSED). Without circuit breakers, a failing LLM API will consume your entire request budget while returning errors.

Circuit Breaker States

CLOSED (normal): Requests pass through Monitor error rate If failures > threshold → OPEN OPEN (protecting): Reject requests immediately Return fallback response Wait recovery timeout HALF-OPEN (testing): Allow limited test requests If success → CLOSED If failure → OPEN Configuration: Failure threshold: 5 in 60s Recovery timeout: 30s Success threshold: 3 consecutive

Key insight: Circuit breakers are especially critical for AI agents because LLM API failures are expensive. Each failed request still consumes tokens (partial responses), network bandwidth, and user patience. Failing fast saves all three.

replay

Retry & Fallback Strategies

Exponential backoff, error classification, and multi-tier model degradation

Retry Engineering

Not all errors deserve retries. Transient failures (rate limits, timeouts, API connection errors) warrant retries with exponential backoff. Permanent failures (authentication errors, content policy rejections, invalid inputs) require fast failure — retrying will never succeed. The retry formula: delay = base_delay × 2^attempt + jitter, with a maximum of 3–5 retries. Beyond retries, implement fallback chains: multi-tier model degradation from full functionality (primary model) to core functionality (smaller/cheaper model) to basic responses (cached/template responses). This ensures the user always gets something useful, even when the primary model is unavailable.

Retry & Fallback

Error classification: Retryable: rate limit, timeout, connection error, 5xx Non-retryable: auth error, 4xx, content policy, invalid input Exponential backoff: delay = base × 2^attempt + jitter Max retries: 3-5 Base delay: 1s Attempts: 1s, 2s, 4s, 8s, 16s Fallback chain: Tier 1: Primary model (full) Tier 2: Smaller model (core) Tier 3: Cached/template (basic) // User always gets something useful

Key insight: The fallback chain is a product decision, not just an engineering decision. Define with product managers what "degraded but acceptable" looks like for each agent capability. A slow, partial answer is often better than no answer.

shield

Guardrails & Safety

Input validation, output filtering, and the guardrails that prevent catastrophic failures

Defense in Depth

Production guardrails operate at three layers. Input guardrails: validate and sanitize all inputs before they reach the model — check for prompt injection, PII in queries that shouldn't contain it, and inputs that exceed context windows. Output guardrails: validate all model outputs before they reach the user — check for hallucinated data, PII leakage, off-topic responses, and outputs that violate business rules. Execution guardrails: limit what the agent can do — restrict tool access by role, enforce spending limits per action, set maximum loop iterations, and require approval for irreversible actions. These three layers create defense in depth: if one layer fails, the others catch the problem. No single guardrail is sufficient; the combination is what makes the system safe.

Three-Layer Guardrails

Input guardrails: □ Prompt injection detection □ PII scanning □ Context window check □ Input sanitization Output guardrails: □ Hallucination detection □ PII leakage check □ Business rule validation □ Topic boundary enforcement Execution guardrails: □ Tool access by role □ Spending limits per action □ Max loop iterations □ Approval for irreversible actions // Defense in depth: 3 layers // No single guardrail is sufficient

Key insight: The most dangerous guardrail gap is execution limits. An agent without a maximum loop iteration count can enter an infinite loop that burns through your entire API budget in minutes. Set hard limits on every dimension.

attach_money

Cost Controls

46% of AI budgets go to inference — controlling costs at scale

Cost Management

With 46% of AI budgets spent on inference, cost control is a production requirement, not an optimization. Implement controls at four levels. Per-request budgets: set maximum token limits per agent action to prevent runaway costs from verbose prompts or infinite loops. Per-user budgets: cap daily/monthly usage per user to prevent abuse and ensure fair distribution. Per-agent budgets: set monthly spending limits per agent with alerts at 80% and hard stops at 100%. Model routing: use cheaper models for simple tasks and expensive models only when needed — a well-designed router can cut inference costs by 40–60% with minimal quality impact. Provisioned throughput from cloud providers offers 30–50% savings for predictable workloads.

Cost Control Layers

Per-request: Max tokens per action Prevent runaway prompts Per-user: Daily / monthly caps Fair distribution Per-agent: Monthly spending limit Alert at 80%, stop at 100% Model routing: Simple tasks → cheap model Complex tasks → capable model Savings: 40-60% cost reduction Provisioned throughput: Predictable workloads Savings: 30-50%

Key insight: Model routing is the highest-leverage cost optimization. Most enterprise agent actions (classification, extraction, simple Q&A) don't need the most expensive model. Route intelligently and save 40–60% without users noticing.

open_with

Scaling Patterns

From pilot to enterprise-wide: the scaling playbook

Scaling Strategy

Scaling from pilot to enterprise-wide deployment requires a deliberate strategy, not just "turn it on for everyone." The scaling playbook has four phases. Single workflow: prove the agent works for one team on one process with full monitoring. Team-wide: expand to the full team, adding load testing and performance baselines. Department-wide: expand across the department, adding role-based access, cost allocation, and cross-team monitoring. Enterprise-wide: full rollout with centralized governance, federated management, and executive dashboards. At each phase, validate that latency stays within SLA, costs scale linearly (not exponentially), error rates don't increase, and human oversight capacity matches demand. The most common scaling failure is overwhelming the human review pipeline.

Scaling Phases

Phase 1: Single workflow 1 team, 1 process Full monitoring, daily review Phase 2: Team-wide Full team adoption Load testing, performance baselines Phase 3: Department-wide Cross-team rollout Role-based access, cost allocation Phase 4: Enterprise-wide Centralized governance Federated management Executive dashboards Validate at each phase: □ Latency within SLA □ Costs scale linearly □ Error rates stable □ Human review capacity OK

Key insight: The most common scaling failure is overwhelming the human review pipeline. If your agent escalates 15% of actions and you scale 10x, your review team needs 10x capacity too. Plan human scaling alongside agent scaling.

rocket_launch

The Production Readiness Checklist

The final gate before your agent goes live

Go/No-Go Criteria

Before any agent goes to production, it must pass a production readiness review covering all the patterns from this chapter. This isn't a formality — it's the gate that separates pilots that impress demos from agents that survive real-world conditions. The checklist covers observability (all five pillars implemented), resilience (circuit breakers, retries, fallbacks tested), guardrails (input, output, and execution guardrails active), cost controls (per-request, per-user, per-agent budgets set), scaling plan (load tested at 3x expected peak), compliance (audit trails, documentation, human oversight), and runbook (incident response procedures documented and rehearsed). An agent that passes all seven categories is production-ready. Anything less is a risk.

Readiness Checklist

Production readiness review: Observability: □ Traces, metrics, logs, evals, review Resilience: □ Circuit breakers configured □ Retries with backoff □ Fallback chain defined Guardrails: □ Input, output, execution guards Cost controls: □ Per-request/user/agent budgets Scaling: □ Load tested at 3x peak Compliance: □ Audit trails, docs, oversight Runbook: □ Incident response documented All 7 categories = production ready

Key insight: The production readiness review should be a recurring event, not a one-time gate. Run it quarterly, because the agent's environment changes (model updates, new integrations, scaling) even when the agent's code doesn't.

arrow_back Ch 11: Compliance & Audit Back to Course Index arrow_forward