Ch 6: The Optimization Playbook

water_drop

The Water Conservation Analogy

Three ways to reduce your water bill — and your AI bill

Three Levers

You can reduce your water bill three ways: (1) Use less — shorter showers, fix leaks (prompt compression, output length control). (2) Reuse — collect rainwater, recycle greywater (prompt caching, semantic caching). (3) Switch to a cheaper source — use well water instead of city water (model routing, distillation). AI cost optimization works the same way.

The Three Levers Framework

// AI cost optimization = water conservation Use Less Compression, output control Savings: 15–30% Reuse Prompt caching, semantic caching Savings: 30–90% Cheaper Source Model routing, distillation Savings: 40–60% Combined All three together Savings: 60–70% typical

Key insight: Most LLM applications have 60–80% of spending that is reducible without touching product quality. The techniques in this chapter are not theoretical — they’re production-proven and can be implemented in hours, not weeks.

alt_route

Model Routing

The highest-leverage optimization: 40–60% savings

The Concept

Model routing sends each request to the cheapest model that can handle it well. A classifier (itself a cheap model) evaluates incoming requests and routes simple tasks to budget models and complex tasks to premium models. Research shows 62% of agent tasks can route to budget models with zero quality loss.

// Routing rules example Classification/extraction → GPT-4o mini ($0.15/M) Simple Q&A, formatting → GPT-5 Nano ($0.05/M) Code generation, analysis → GPT-4.1 ($2.00/M) Complex reasoning → Claude Opus ($5.00/M) Hard math/science → o3 ($2.00/M)

Implementation

The simplest router is a keyword/heuristic classifier that checks request length, presence of code blocks, or explicit complexity markers. More sophisticated routers use a small LLM (GPT-4o mini at $0.15/M) to classify the request before routing. The router cost is negligible compared to the savings from routing 60%+ of traffic to budget models.

Key insight: Model routing is the single highest-leverage optimization because it addresses the 3,000x price range across models. Even a crude router that correctly classifies 70% of requests saves 40–50% of total spend.

cached

Prompt Caching

45–90% savings on repeated prefixes — 10 minutes to implement

How It Works

When your requests share the same prefix (system prompt, tool definitions, few-shot examples), the provider can cache the processed KV-cache and reuse it across requests. Instead of reprocessing 2,000 tokens of system prompt for every request, the cached version is loaded instantly. Anthropic offers ~90% discount on cached reads. OpenAI offers ~50% discount.

Latency Bonus

Prompt caching also improves time-to-first-token by 13–31% because the model skips the prefill phase for cached tokens. You get both cost savings and faster responses — a rare win-win in optimization.

Semantic Caching

Semantic caching goes further: it caches entire responses for similar queries. When a new query is semantically similar to a previously answered one (measured by embedding cosine similarity), the cached response is returned without making an LLM call at all. This captures the 35–45% of queries that are repeated or near-identical in most production systems.

Key insight: Prompt caching is the highest-ROI optimization in terms of effort-to-savings ratio. It takes 10 minutes to enable (often just a configuration flag) and saves 45–90% on the cached portion of every request. If you do nothing else from this chapter, enable prompt caching.

compress

Prompt Compression & Output Control

Using less water — 15–30% savings

Prompt Compression

Remove redundancy from prompts. Most system prompts contain verbose instructions that can be condensed 30–50% without quality loss. Remove filler phrases, consolidate overlapping instructions, and use structured formats instead of prose. A 2,000-token system prompt compressed to 1,200 tokens saves 40% on that prefix across every request.

Output Length Control

Since output tokens cost 3–8x more than input, controlling output length has outsized impact. Use max_tokens to cap responses. Add explicit length instructions: “Respond in 2–3 sentences” instead of “Explain thoroughly.” For structured output, use JSON mode to prevent verbose prose.

Context Window Management

For multi-turn conversations, implement sliding window + summarization. Keep the last 3–5 turns in full, summarize older turns, and drop irrelevant history. This prevents the context from growing unboundedly and triggering the quadratic scaling costs from Chapter 3. Companies report 60–80% token cost reduction from context compression alone.

Key insight: Compression is the “fix the leaks” strategy. It won’t transform your bill on its own, but combined with routing and caching, it contributes a reliable 15–30% additional savings with minimal risk.

batch_prediction

Batch Processing

50% off for workloads that can wait

How Batch APIs Work

Batch APIs let you submit a collection of requests that are processed asynchronously within a 24-hour window. In exchange for accepting latency, you get 50% off the standard price. OpenAI, Anthropic, and Google all offer batch APIs. Implementation takes about 30 minutes — you submit a JSONL file of requests and poll for results.

Ideal Workloads

Data processing (classify 10,000 documents overnight). Content generation (generate product descriptions in bulk). Evaluation (run test suites against model outputs). Embedding generation (batch-embed a document corpus). Any workload where results aren’t needed in real-time.

Batch + Cache Stacking

Batch discounts stack with prompt caching. If your batch requests share the same system prompt, you get both the 50% batch discount and the caching discount. On Anthropic, this means paying roughly 5% of the standard price for cached tokens in batch mode (50% batch × 10% cache rate).

Key insight: Look at your workload and ask: “Does this need a response in under 1 second, or can it wait a few hours?” Every workload that can wait should go through the batch API. It’s free money.

science

Distillation

5–30x cost reduction, 95–97% quality retention

What Distillation Is

Distillation trains a smaller, cheaper model to mimic the behavior of a larger, expensive model on your specific task. You generate training data by running your task through the expensive model, then fine-tune a small model on those outputs. The result: a model that performs 95–97% as well on your specific task at 5–30x lower cost.

When to Distill

Distillation makes sense when you have a well-defined, high-volume task that a premium model handles well. Customer support classification, code review, content moderation, and entity extraction are ideal candidates. It does not work well for open-ended creative tasks or tasks that require broad general knowledge.

The DeepSeek Effect

DeepSeek demonstrated that aggressive distillation and efficiency optimization can produce models that compete with frontier models at 20–100x lower cost. DeepSeek V3.2 at $0.27/$0.42 per M tokens performs comparably to models costing 10–20x more on many benchmarks. This competitive pressure has forced all providers to improve efficiency.

Key insight: Distillation is the most powerful long-term optimization, but it requires upfront investment (data generation, fine-tuning, evaluation). Start with routing and caching for immediate wins, then distill your highest-volume tasks once you have stable workload patterns.

stacked_bar_chart

Combined Savings: The Full Stack

Layering all optimizations for 60–70% total reduction

Optimization Priority Order

// Implementation order (effort vs impact) 1. Prompt caching 10 min 45–90% on cached tokens 2. Batch API 30 min 50% on async workloads 3. Model routing 2–4 hrs 40–60% overall 4. Prompt compression1–2 hrs 15–30% additional 5. Semantic caching 4–8 hrs 15–30% on repeated queries 6. Distillation 1–2 wks 5–30x on high-volume tasks

Real-World Combined Impact

A production AI gateway implementing routing + caching + batching achieves 47–80% total cost reduction. The first three optimizations (caching, batching, routing) can be implemented in a single day and typically deliver 40–60% savings. Adding compression and semantic caching pushes to 60–70%. Distillation, when applicable, can push individual task costs down by an additional 5–30x.

Key insight: The optimizations compound. Routing sends 60% of traffic to a model that’s 10x cheaper. Caching saves 50% on the remaining traffic’s prefixes. Batching halves the cost of async work. Together, a $10,000/month bill becomes $3,000–4,000/month.

lightbulb

The KV-Cache Hit Rate

The key production metric for cost optimization

What to Measure

The KV-cache hit rate measures what percentage of your input tokens are served from cache vs recomputed from scratch. A high hit rate (80%+) means your caching strategy is working — most requests reuse previously computed token representations. A low hit rate (<30%) means you’re paying full price for every request.

How to Improve It

Stabilize your prompt prefix. Don’t dynamically change system prompts, tool definitions, or few-shot examples between requests. Every change invalidates the cache. Order matters — put the most stable content (system prompt) first, variable content (user query) last. This maximizes the cacheable prefix length.

What’s Next

Chapter 7 applies everything we’ve learned to AI agents — the most expensive and hardest-to-optimize AI workload. The employee analogy: agents bill by the minute and sometimes spin in circles. Doom loops, monitoring, and the 4-layer cost governance framework.

Chapter Summary

The water conservation analogy: use less (compression), reuse (caching), switch sources (routing). Model routing saves 40–60%. Prompt caching saves 45–90% on prefixes in 10 minutes. Batch APIs give 50% off for async work. Distillation delivers 5–30x reduction on high-volume tasks. Combined: 60–70% typical savings. KV-cache hit rate is the key production metric.

Ch 6 — The Optimization Playbook