Ch 7 — Agent Cost Engineering

The employee analogy — agents bill by the minute and sometimes spin in circles
High Level
person
Employee
arrow_forward
price_change
Drivers
arrow_forward
loop
Doom Loop
arrow_forward
monitoring
Monitor
arrow_forward
shield
Guardrails
arrow_forward
build
Tools
-
Click play or press Space to begin...
Step- / 8
person
The Employee Analogy
An agent is like hiring a worker who bills by the minute
Why Agents Are Expensive
Imagine hiring a contractor who bills by the minute, works autonomously, and sometimes gets stuck in circles doing the same thing over and over. That’s an AI agent. Unlike a chatbot (one question, one answer), an agent makes 3–10 LLM calls per task: planning what to do, selecting tools, executing actions, checking results, and sometimes retrying when things go wrong.
The Cost Multiplier
A single unconstrained agent task costs $5–8 in API fees for software engineering tasks. That’s because each task triggers multiple LLM calls, each with growing context (previous steps are added to the context window). By the 8th call in a task, the agent might be sending 30,000+ input tokens per call — and paying the quadratic scaling tax from Chapter 3.
Key insight: Agents are the most expensive AI workload because they combine all the cost multipliers: multi-turn context growth, output-heavy generation (planning and code), and unpredictable execution paths. Cost engineering for agents is not optional — it’s survival.
price_change
The Four Cost Drivers
Planning, tool selection, execution, and verification
Cost Driver Breakdown
// Where agent tokens go (typical task) 1. Planning 15–25% of total tokens Analyzing task, creating execution plan 2. Tool Selection 10–15% of total tokens Choosing which tools/APIs to call 3. Execution 40–55% of total tokens Running tools, processing results 4. Verification 15–25% of total tokens Checking results, error handling
The Context Growth Problem
Each step in an agent task adds to the context window. The planning output becomes input for tool selection. Tool results become input for the next step. By step 8, the agent is carrying the entire conversation history — 20,000–50,000 tokens of accumulated context. This triggers both quadratic scaling costs and potential context surcharges.
Key insight: Execution is the biggest cost driver (40–55%), but it’s also the hardest to optimize because it depends on the task. The most impactful optimization is constraining the number of iterations (max steps) and compressing context between steps.
loop
Doom Loops: The Cost Disaster
When agents get stuck and burn through your budget
What Is a Doom Loop?
A doom loop occurs when an agent gets stuck in a retry cycle — attempting the same action, failing, and trying again without making progress. The agent keeps consuming tokens with each attempt, and the context window grows with each failed attempt’s error messages. Without guardrails, a doom loop can run for hours.
Real-World Horror Stories
A documented incident: two LangChain agents entered an infinite conversation cycle that generated a $47,000 bill over 11 days. Another case: an agent generated 2.3 million unintended API calls over a weekend. A third: parallel agents spawning sub-calls burned through $10,000 in 4.5 days.
Prevention
Max iterations: Hard cap on the number of LLM calls per task (typically 10–25). Budget caps: Maximum dollar amount per task ($10–50) and per agent per day ($100–500). Timeout: Maximum wall-clock time per task (5–30 minutes). Error detection: If the agent repeats the same action 3 times, abort and escalate to a human.
Key insight: Every production agent system needs hard limits. The question is not “will a doom loop happen?” but “when it happens, how much will it cost before the guardrails kick in?” Set limits before deploying, not after the first incident.
monitoring
The Monitoring Gap
96% of enterprises exceed AI cost projections, but only 44% have guardrails
The Problem
96% of enterprises report AI costs exceeding initial projections. Yet only 44% have financial guardrails in place. Fewer than 1 in 5 agent deployments have token-level cost monitoring at launch. Teams without monitoring lose $8,000–23,000/month in undetected waste — agents running unnecessary tasks, doom loops, and inefficient model selection.
Why Traditional Monitoring Fails
Traditional cloud monitoring (CPU, memory, network) doesn’t capture AI costs because agents are autonomous and concurrent. By the time a dashboard updates showing high spend, the agent has already consumed the tokens. Effective agent cost monitoring requires pre-execution policy enforcement — checking the budget before each LLM call, not after.
Key insight: Cost monitoring for agents must be real-time and pre-emptive. Post-hoc dashboards that show yesterday’s spend are useful for analysis but useless for prevention. You need per-call budget checks that can abort a task before it exceeds limits.
shield
The 4-Layer Cost Governance Framework
Token tracking, anomaly alerts, budget limits, and attribution
The Four Layers
// 4-layer agent cost governance Layer 1: Token Tracking Log every LLM call: model, tokens, cost Per-task and per-agent attribution Layer 2: Anomaly Alerts Alert when task exceeds 2x expected cost Alert when agent exceeds daily budget Layer 3: Budget Hard Limits Per-task: abort at $10–50 Per-agent/day: throttle at $100–500 Fleet/month: escalate at threshold Layer 4: Weekly Attribution Cost per task type, per agent, per team Identify optimization opportunities
Three-Tier Enforcement
Effective cost governance requires three enforcement tiers: Per-action limits (max tokens per single LLM call), per-agent budgets (max spend per agent per day), and fleet-level throttling (max total spend across all agents per billing period). Each tier catches different failure modes.
Key insight: The 4-layer framework is not overhead — it’s the minimum viable cost infrastructure for production agents. Teams that implement it from day one avoid the $47,000 surprise bills that teams without it inevitably face.
build
Cost Monitoring Tools
AgentCost, LangSmith, Helicone, and custom solutions
Purpose-Built Tools
Helicone — Open-source LLM observability platform. Logs every API call with cost tracking, latency metrics, and user attribution. Integrates with OpenAI, Anthropic, and most providers via a proxy. LangSmith — LangChain’s observability platform. Deep integration with LangChain/LangGraph agent frameworks. Traces multi-step agent executions with per-step cost breakdown.
What to Track
At minimum, log for every LLM call: model used, input tokens, output tokens, thinking tokens (if reasoning model), cost, latency, task ID, and agent ID. This data enables cost attribution (which tasks are expensive?), anomaly detection (is this task unusually costly?), and optimization targeting (which tasks should we route to cheaper models?).
Key insight: You can’t optimize what you can’t measure. The first step in agent cost engineering is always visibility — knowing exactly where every dollar goes. Most teams are shocked by what they find when they first enable per-task cost tracking.
precision_manufacturing
Connection to Harness Engineering
Cost constraints as part of the agent harness
Cost as a Constraint
In harness engineering (the discipline of building systems to control AI agents), cost constraints are a core component of the harness. Just as you set architectural constraints (allowed file paths, forbidden operations) and review pipelines (human approval for risky actions), you set cost constraints: max iterations, budget caps, model routing rules, and escalation triggers.
The Harness Cost Components
// Cost constraints in an agent harness max_iterations: 25 max_cost_per_task: $10.00 max_cost_per_day: $200.00 default_model: "gpt-4o-mini" upgrade_model: "gpt-4.1" upgrade_threshold: "complexity > 7" doom_loop_detect: "3 identical actions" escalate_to_human: "on budget exceed"
Key insight: Cost engineering and harness engineering are two sides of the same coin. A well-designed harness naturally controls costs by constraining agent behavior. See the Harness Engineering course for the full framework.
lightbulb
The Agent Cost Playbook
Putting it all together
The Checklist
Before deploying: Set max iterations, budget caps, and doom loop detection. Enable per-task cost logging. Choose default model (budget) and upgrade model (mid-tier). Define escalation rules. After deploying: Monitor daily spend vs projections. Review weekly cost attribution. Identify tasks that consistently exceed budget. Route high-volume simple tasks to cheaper models. Compress context between agent steps.
Key insight: Agent cost engineering is not a one-time setup. It’s an ongoing practice of monitoring, optimizing, and adjusting as workloads evolve. The teams that treat it as a continuous discipline save 40–60% compared to those who set and forget.
What’s Next
Chapter 8 zooms out to the business level — AI FinOps, ROI measurement, and the future of AI economics. The portfolio analogy: treat AI investments like a financial portfolio, not individual bets. Why 95% of AI pilots fail to show measurable returns, and how to be in the 5% that succeed.
Chapter Summary
Agents are the most expensive AI workload: $5–8 per task, 3–10x more LLM calls than chatbots. Four cost drivers: planning, tool selection, execution, verification. Doom loops can cost $10K–47K. 96% of enterprises exceed projections but only 44% have guardrails. The 4-layer governance framework (tracking, alerts, limits, attribution) is the minimum viable cost infrastructure. Cost constraints are a core component of the agent harness.