Ch 7: Agent Cost Engineering

person

The Employee Analogy

An agent is like hiring a worker who bills by the minute

Why Agents Are Expensive

Imagine hiring a contractor who bills by the minute, works autonomously, and sometimes gets stuck in circles doing the same thing over and over. That’s an AI agent. Unlike a chatbot (one question, one answer), an agent makes 3–10 LLM calls per task: planning what to do, selecting tools, executing actions, checking results, and sometimes retrying when things go wrong.

The Cost Multiplier

A single unconstrained agent task costs $5–8 in API fees for software engineering tasks. That’s because each task triggers multiple LLM calls, each with growing context (previous steps are added to the context window). By the 8th call in a task, the agent might be sending 30,000+ input tokens per call — and paying the quadratic scaling tax from Chapter 3.

Key insight: Agents are the most expensive AI workload because they combine all the cost multipliers: multi-turn context growth, output-heavy generation (planning and code), and unpredictable execution paths. Cost engineering for agents is not optional — it’s survival.

price_change

The Four Cost Drivers

Planning, tool selection, execution, and verification

Cost Driver Breakdown

// Where agent tokens go (typical task) 1. Planning 15–25% of total tokens Analyzing task, creating execution plan 2. Tool Selection 10–15% of total tokens Choosing which tools/APIs to call 3. Execution 40–55% of total tokens Running tools, processing results 4. Verification 15–25% of total tokens Checking results, error handling

The Context Growth Problem

Each step in an agent task adds to the context window. The planning output becomes input for tool selection. Tool results become input for the next step. By step 8, the agent is carrying the entire conversation history — 20,000–50,000 tokens of accumulated context. This triggers both quadratic scaling costs and potential context surcharges.

Key insight: Execution is the biggest cost driver (40–55%), but it’s also the hardest to optimize because it depends on the task. The most impactful optimization is constraining the number of iterations (max steps) and compressing context between steps.

loop

Doom Loops: The Cost Disaster

When agents get stuck and burn through your budget

What Is a Doom Loop?

A doom loop occurs when an agent gets stuck in a retry cycle — attempting the same action, failing, and trying again without making progress. The agent keeps consuming tokens with each attempt, and the context window grows with each failed attempt’s error messages. Without guardrails, a doom loop can run for hours.

Real-World Horror Stories

A documented incident: two LangChain agents entered an infinite conversation cycle that generated a $47,000 bill over 11 days. Another case: an agent generated 2.3 million unintended API calls over a weekend. A third: parallel agents spawning sub-calls burned through $10,000 in 4.5 days.

Prevention

Max iterations: Hard cap on the number of LLM calls per task (typically 10–25). Budget caps: Maximum dollar amount per task ($10–50) and per agent per day ($100–500). Timeout: Maximum wall-clock time per task (5–30 minutes). Error detection: If the agent repeats the same action 3 times, abort and escalate to a human.

Key insight: Every production agent system needs hard limits. The question is not “will a doom loop happen?” but “when it happens, how much will it cost before the guardrails kick in?” Set limits before deploying, not after the first incident.

monitoring

The Monitoring Gap

96% of enterprises exceed AI cost projections, but only 44% have guardrails

The Problem

96% of enterprises report AI costs exceeding initial projections. Yet only 44% have financial guardrails in place. Fewer than 1 in 5 agent deployments have token-level cost monitoring at launch. Teams without monitoring lose $8,000–23,000/month in undetected waste — agents running unnecessary tasks, doom loops, and inefficient model selection.

Why Traditional Monitoring Fails

Traditional cloud monitoring (CPU, memory, network) doesn’t capture AI costs because agents are autonomous and concurrent. By the time a dashboard updates showing high spend, the agent has already consumed the tokens. Effective agent cost monitoring requires pre-execution policy enforcement — checking the budget before each LLM call, not after.

Key insight: Cost monitoring for agents must be real-time and pre-emptive. Post-hoc dashboards that show yesterday’s spend are useful for analysis but useless for prevention. You need per-call budget checks that can abort a task before it exceeds limits.

shield

The 4-Layer Cost Governance Framework

Token tracking, anomaly alerts, budget limits, and attribution

The Four Layers

// 4-layer agent cost governance Layer 1: Token Tracking Log every LLM call: model, tokens, cost Per-task and per-agent attribution Layer 2: Anomaly Alerts Alert when task exceeds 2x expected cost Alert when agent exceeds daily budget Layer 3: Budget Hard Limits Per-task: abort at $10–50 Per-agent/day: throttle at $100–500 Fleet/month: escalate at threshold Layer 4: Weekly Attribution Cost per task type, per agent, per team Identify optimization opportunities

Three-Tier Enforcement

Effective cost governance requires three enforcement tiers: Per-action limits (max tokens per single LLM call), per-agent budgets (max spend per agent per day), and fleet-level throttling (max total spend across all agents per billing period). Each tier catches different failure modes.

Key insight: The 4-layer framework is not overhead — it’s the minimum viable cost infrastructure for production agents. Teams that implement it from day one avoid the $47,000 surprise bills that teams without it inevitably face.

build

Cost Monitoring Tools

AgentCost, LangSmith, Helicone, and custom solutions

Purpose-Built Tools

Helicone — Open-source LLM observability platform. Logs every API call with cost tracking, latency metrics, and user attribution. Integrates with OpenAI, Anthropic, and most providers via a proxy. LangSmith — LangChain’s observability platform. Deep integration with LangChain/LangGraph agent frameworks. Traces multi-step agent executions with per-step cost breakdown.

What to Track

At minimum, log for every LLM call: model used, input tokens, output tokens, thinking tokens (if reasoning model), cost, latency, task ID, and agent ID. This data enables cost attribution (which tasks are expensive?), anomaly detection (is this task unusually costly?), and optimization targeting (which tasks should we route to cheaper models?).

Key insight: You can’t optimize what you can’t measure. The first step in agent cost engineering is always visibility — knowing exactly where every dollar goes. Most teams are shocked by what they find when they first enable per-task cost tracking.

precision_manufacturing

Connection to Harness Engineering

Cost constraints as part of the agent harness

Cost as a Constraint

In harness engineering (the discipline of building systems to control AI agents), cost constraints are a core component of the harness. Just as you set architectural constraints (allowed file paths, forbidden operations) and review pipelines (human approval for risky actions), you set cost constraints: max iterations, budget caps, model routing rules, and escalation triggers.

The Harness Cost Components

// Cost constraints in an agent harness max_iterations: 25 max_cost_per_task: $10.00 max_cost_per_day: $200.00 default_model: "gpt-4o-mini" upgrade_model: "gpt-4.1" upgrade_threshold: "complexity > 7" doom_loop_detect: "3 identical actions" escalate_to_human: "on budget exceed"

Key insight: Cost engineering and harness engineering are two sides of the same coin. A well-designed harness naturally controls costs by constraining agent behavior. See the Harness Engineering course for the full framework.

lightbulb

The Agent Cost Playbook

Putting it all together

The Checklist

Before deploying: Set max iterations, budget caps, and doom loop detection. Enable per-task cost logging. Choose default model (budget) and upgrade model (mid-tier). Define escalation rules. After deploying: Monitor daily spend vs projections. Review weekly cost attribution. Identify tasks that consistently exceed budget. Route high-volume simple tasks to cheaper models. Compress context between agent steps.

Key insight: Agent cost engineering is not a one-time setup. It’s an ongoing practice of monitoring, optimizing, and adjusting as workloads evolve. The teams that treat it as a continuous discipline save 40–60% compared to those who set and forget.

What’s Next

Chapter 8 zooms out to the business level — AI FinOps, ROI measurement, and the future of AI economics. The portfolio analogy: treat AI investments like a financial portfolio, not individual bets. Why 95% of AI pilots fail to show measurable returns, and how to be in the 5% that succeed.

Chapter Summary

Agents are the most expensive AI workload: $5–8 per task, 3–10x more LLM calls than chatbots. Four cost drivers: planning, tool selection, execution, verification. Doom loops can cost $10K–47K. 96% of enterprises exceed projections but only 44% have guardrails. The 4-layer governance framework (tracking, alerts, limits, attribution) is the minimum viable cost infrastructure. Cost constraints are a core component of the agent harness.

Ch 7 — Agent Cost Engineering