Ch 8: Token Budgeting & Production Patterns

savings

Token Budget Allocation

Spending each token where it matters most

The Principle

Token budgeting is the practice of allocating your context window like a financial budget. Every component gets a token allowance based on its importance to the current task. The skill is spending each token where it matters most, not filling the window. A well-budgeted 30K-token context outperforms a carelessly filled 128K-token context.

Sample Budget

// Token budget for 128K window System prompt: 2,000 (1.5%) Tool schemas: 8,000 (6%) Few-shot examples: 3,000 (2%) Retrieved docs: 15,000 (12%) Conversation: 10,000 (8%) Memory/metadata: 2,000 (1.5%) Output reserve: 8,000 (6%) Safety margin: 30,000 (23%) // Total used: ~78K of 128K (61%) // Staying under 60% for quality

Why Budget?

Without explicit budgets, context grows organically until quality degrades. Teams discover the problem only when users report worse answers. Token budgets make context management proactive instead of reactive — you decide in advance how much space each component gets, and enforce those limits programmatically.

Key insight: The safety margin is not wasted space. Keeping 30–40% of the window unused ensures the model has room for its own reasoning and output, and provides buffer for unexpectedly large tool responses or retrieved documents.

memory

KV-Cache Optimization

The single most important metric for production agents

Cache Hit Rate

Industry leaders describe KV-cache hit rate as “the single most important metric” for production AI agents. High cache hit rates can reduce inference latency (TTFT — time to first token) and costs by 10×. The optimization: maximize the stable prefix that remains identical across requests.

Optimization Techniques

Multi-Query Attention (MQA): Shares key-value heads across query heads, reducing cache memory.

Grouped-Query Attention (GQA): Groups query heads to share KV pairs, balancing quality and efficiency.

PagedAttention: Manages KV-cache memory like virtual memory pages, eliminating fragmentation.

FlashAttention: Optimizes the attention computation itself for GPU memory hierarchy.

Practical Optimization

The most impactful optimization for application developers: deterministic serialization of the stable prefix. Ensure your system prompt, tool definitions, and few-shot examples are serialized identically on every request. Even whitespace differences break cache hits. Use frozen templates, not dynamically generated prompts, for the stable portion.

Why it matters: A 90% cache hit rate means you’re paying for only 10% of the stable prefix on each request. For a system prompt + tools consuming 10K tokens, that’s 9K tokens free on every call. At scale, this is the difference between viable and unviable economics.

cached

Prompt Caching

90% cost savings on repeated content

How It Works

Prompt caching is a provider-level feature (available in Anthropic, Google, and others) that caches the KV states for shared prefixes across requests. When your system prompt is the same across requests, it’s computed once and reused. System prompts and few-shot examples become nearly free on repeat calls.

Savings Example

// Prompt caching economics System prompt: 2,000 tokens Tool schemas: 8,000 tokens Few-shot: 3,000 tokens Total prefix: 13,000 tokens // Without caching (10K requests/day): 13K × 10K × $2.50/1M = $325/day // With 90% cache hit rate: 1.3K × 10K × $2.50/1M = $32.50/day Annual savings: $106,762

Maximizing Cache Hits

Freeze the prefix: System prompt, tool schemas, and few-shot examples should be identical on every request. No dynamic content in the cached portion.

Order matters: Put the most stable content first. Any change to an early token invalidates the cache for everything after it.

Batch similar requests: Requests with the same prefix share cache entries. Group similar tasks to maximize reuse.

Key insight: Prompt caching makes the architectural decision of what goes in the stable prefix vs. dynamic content a direct cost optimization lever. Every token you can move into the stable prefix is a token that becomes nearly free.

layers

The Layered Architecture

How all patterns work together

The Full Stack

In a production context engineering system, the patterns layer together. Each addresses a different failure mode and operates at a different stage of the request lifecycle:

Layer 1 — Progressive disclosure (Ch 3) and tool management (Ch 7) define what can enter the context window.

Layer 2 — Routing (Ch 5) and retrieval (Ch 6) manage what enters during execution.

Layer 3 — Compression (Ch 4) manages what stays as context accumulates.

Layer 4 — Token budgeting (this chapter) ties it all together economically.

The Flow

// Production context engineering stack Request arrives → Progressive disclosure: load skills → Tool management: load relevant tools → Routing: select knowledge domain → Retrieval: fetch relevant documents → Budget check: within token limits? → Generate response Context accumulates → Compression: summarize older turns → Memory: archive to long-term store → Budget check: still within limits? → Continue or compress

cases

Real-World Case Studies

Documented cost reductions from context engineering

Fintech Document Analysis

A fintech startup reduced document analysis costs from $30,600 to $4,100 monthly (87% reduction) through three techniques: extracting only relevant document sections via RAG instead of including full documents, compressing conversation history with sliding window summarization, and caching system prompts using provider-level prompt caching.

Enterprise Support Agent

An enterprise deploying a multi-domain support agent reduced context costs by 73% by adding routing (queries go to the right domain instead of loading all domains) and progressive disclosure (agent skills load on demand instead of all at startup). Quality improved simultaneously because the model’s attention was focused on relevant context.

Aggregate Industry Data

Companies implementing effective context management report:

35–60% accuracy improvements in enterprise AI systems
60–80% cost reduction on long-running agent tasks
50–90% cost reduction through strategic caching and compression
87% reduction in document analysis costs (best documented case)

Key insight: Context engineering consistently improves both quality and cost simultaneously. This is rare in engineering — most optimizations trade one for the other. Better context means better answers AND lower bills.

analytics

Monitoring & Observability

Measuring whether your context engineering is working

Key Metrics

KV-cache hit rate: The percentage of prefix tokens served from cache. Target: >85%.

Context utilization: What percentage of the window is used on average. Target: 40–60%.

Token cost per request: Average input tokens per API call. Track trends over time.

Compression ratio: Tokens before vs. after compression. Measures compression effectiveness.

Routing accuracy: Percentage of queries routed to the correct domain. Measure via user feedback or manual review.

Evaluation

Probe-based evaluation is the current best practice for measuring context quality. After compression or routing, ask the model specific questions about the context to verify that critical information was preserved. If the model can’t answer questions about information that should be in context, your compression or routing is too aggressive.

Rule of thumb: If you can’t measure it, you can’t optimize it. Instrument your context pipeline with token counters at each stage (pre-routing, post-routing, pre-compression, post-compression) to identify where tokens are being spent and where savings are possible.

checklist

Getting Started Checklist

Practical steps ordered by impact

Week 1: Quick Wins

1. Audit your tool token cost. Count how many tokens your tool schemas consume before any user interaction. This number is usually higher than expected.

2. Enable prompt caching. If your provider supports it, ensure your stable prefix (system prompt, tools, few-shot) is cached. This alone can cut costs by 50–90% on the prefix portion.

3. Set a token budget. Define explicit limits for each context component. Enforce them programmatically.

Month 1: Core Patterns

4. Add compression. If your agents run long tasks, implement sliding window + summarization. Keep the latest 5 turns raw, summarize older ones.

5. Add routing. If your agents serve multiple domains, add keyword-based routing. Even simple rules cut context bloat significantly.

6. Implement progressive disclosure. Move from loading all instructions upfront to tiered loading with Agent Skills.

Quarter 1: Advanced

7. Upgrade to agentic RAG. Replace fixed retrieval pipelines with agent-controlled loops.

8. Add monitoring. Instrument your pipeline with token counters and quality probes at each stage.

rocket_launch

The Future of Context Engineering

Where the discipline is heading

Near-Term (2026–2027)

Larger windows, same problems: Context windows will continue growing (Gemini already offers 2M+), but the attention degradation and cost scaling problems remain. Bigger windows make context engineering more important, not less.

Standardized tooling: Expect frameworks and libraries that implement the patterns from this course as composable middleware — routing, compression, progressive disclosure as plug-and-play components.

The Broader Picture

Context engineering is one pillar of the broader harness engineering movement — the design of complete systems that make AI agents reliable. Context engineering controls what the model sees; harness engineering controls the entire environment (constraints, feedback loops, documentation, linting, review pipelines) that the agent operates in. Together, they represent the shift from “using AI tools” to “engineering AI systems.”

Key insight: Context engineering has gone from a niche concern to the core discipline of AI engineering in under a year. The patterns in this course — progressive disclosure, compression, routing, retrieval, tool management, and token budgeting — are now table stakes for any production AI system. The teams that master them will build better products for less money.

Ch 8 — Token Budgeting & Production Patterns