Ch 2: The Context Window — Context Engineering

view_column

What Is the Context Window?

The fixed-size buffer that defines what the model knows

Definition

The context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request. It includes everything: system instructions, user input, conversation history, retrieved documents, tool definitions, and the model’s own output. Think of it as the model’s working memory — anything outside the window simply doesn’t exist to the model.

Current Sizes (2026)

// Context window sizes by model GPT-4o: 128K tokens GPT-4.1: 1M tokens Claude 3.5/4: 200K tokens Gemini 3.0 Pro: 2M+ tokens Llama 3.1 405B: 128K tokens // 1 token ≈ 0.75 English words // 128K tokens ≈ 96,000 words ≈ a 300-page book

Key insight: Bigger windows don’t automatically mean better results. The skill is spending each token where it matters most, not filling the window.

layers

The Eight Components

Everything that competes for space in the context window

Components 1–4

1. System prompt — Instructions that define the model’s role, behavior, and constraints. Typically 200–2,000 tokens.

2. User prompt — The actual question or task from the user. Usually the smallest component in production systems.

3. Conversation history — Prior turns in the dialogue. Grows linearly with conversation length and is the primary source of context bloat.

4. Retrieved documents (RAG) — External knowledge fetched via search or retrieval. Can be the largest single component, often 10,000–50,000+ tokens.

Components 5–8

5. Tool definitions — JSON schemas describing available tools/APIs. A single complex schema can consume 500+ tokens. 90 tools means 50K+ tokens before any user interaction.

6. Few-shot examples — Input/output pairs that demonstrate desired behavior. High-quality examples improve consistency but consume significant space.

7. Memory stores — Persisted facts from previous sessions (user preferences, past decisions). Enables continuity across conversations.

8. Metadata — Timestamps, user IDs, session state, environment variables. Small individually but adds up across complex agent systems.

Key insight: Traditional prompt engineering addresses only component #2 (user prompt). Context engineering manages all eight. In production, the user prompt is typically less than 5% of the total context.

warning

The “Lost in the Middle” Problem

Why models miss information buried in long contexts

The U-Shaped Curve

Research has consistently shown that LLMs exhibit a U-shaped performance curve across the context window. Models attend strongly to content at the beginning (primacy effect) and end (recency effect) of the context, but struggle with information placed in the middle. This is the “lost in the middle” phenomenon, first documented in 2023 and still present in 2026 models despite larger windows.

Critical in AI: If a critical piece of information (like an updated refund policy) is buried in the middle of 50 retrieved documents, the model may ignore it entirely — even though it’s technically “in context.” Position matters as much as presence.

The “Needle in a Haystack” Test

The standard evaluation: insert a specific fact at various positions in a long context and test whether the model can retrieve it. Results show systematic blind spots in the middle third of the context window. Even models advertising 200K+ token windows show degraded retrieval accuracy for middle-positioned information.

Practical Implication

Context engineering addresses this by controlling information placement: put critical instructions at the beginning (system prompt), put the most relevant retrieved content near the end (close to the query), and use compression to eliminate middle-section noise. Systematic context management can prevent 30% of information loss from this effect.

speed

Effective vs. Advertised Limits

Why real-world performance breaks before the technical ceiling

The 30–40% Gap

Most models experience meaningful performance degradation 30–40% before their advertised context limit. A model with a 128K token window may start producing noticeably worse output around 80K–90K tokens. This isn’t a bug — it’s a fundamental property of how attention mechanisms scale with sequence length.

Attention Scales Quadratically

The computational cost of attention scales quadratically with context length. Doubling the context doesn’t just double the cost — it quadruples it. This means cost escalation is geometric, and strategic compression and caching become essential at scale.

The Paradox of More

Including irrelevant information actively degrades model attention on essential tokens. It’s not neutral — it’s harmful. Every irrelevant document, every stale conversation turn, every unused tool definition is competing for the model’s finite attention budget. The core principle of context engineering is “Select, Don’t Dump” — include only what’s necessary for the next step.

Rule of thumb: If you’re using more than 60% of a model’s advertised context window, you should be actively managing what’s in there. Beyond 80%, expect measurable quality degradation regardless of how relevant the content is.

bolt

The KV-Cache

The hidden mechanism that makes context engineering practical

What It Is

The KV-cache (Key-Value cache) stores pre-computed attention states from the transformer’s self-attention mechanism. When generating tokens, the model doesn’t need to recompute attention for the entire context from scratch — it reuses cached key-value pairs from previous tokens. This is what makes autoregressive generation fast enough to be practical.

Two Levels of Caching

Intra-request caching: Within a single generation, previously computed KV pairs are reused for each new token. This is automatic and universal.

Prefix/prompt caching: Across requests, server-side features (available in Anthropic, Groq, and vLLM) cache the KV states for shared prefixes. If your system prompt is the same across requests, it’s computed once and reused.

Why It Matters for Context Engineering

Industry leaders describe KV-cache hit rate as “the single most important metric” for production AI agents. High cache hit rates can reduce inference latency and costs by 10×. The key optimization: separate stable prefix content (system instructions, tool definitions) from dynamic content (user queries, retrieved docs) to maximize cache reuse.

Key insight: Manus discovered that dynamically adding or removing tools mid-iteration invalidates the KV-cache for all subsequent tokens, because tool definitions sit near the front of the context. This single finding changed how production agent systems manage their tool registries.

savings

Token Economics

The cost math that makes context engineering non-optional

The Numbers

// GPT-4o pricing example Input cost: $2.50 / 1M tokens 128K window: $0.32 per request // At 10,000 requests/day: Monthly cost: $96,000 // With context engineering (50% reduction): Monthly cost: $48,000 Annual savings: $576,000

Real-World Case Study

A fintech startup reduced document analysis costs from $30,600 to $4,100 monthly (87% reduction) through three context engineering techniques: extracting only relevant document sections via RAG instead of including full documents, compressing conversation history with sliding window summarization, and caching system prompts using provider-level prompt caching.

Prompt Caching Savings

Newer model providers offer prompt caching that provides up to 90% cost savings on repeated content. System prompts and few-shot examples become nearly free on repeat calls. This makes the architectural decision of what goes in the stable prefix vs. dynamic content a direct cost optimization lever.

architecture

Context Window Architecture

How to structure the window for maximum effectiveness

The Stable Prefix

The first portion of the context window should contain stable, rarely-changing content: system instructions, tool definitions, and few-shot examples. This maximizes KV-cache hit rates across requests and ensures the model always has its core instructions in the high-attention “primacy” zone.

The Dynamic Middle

Conversation history, retrieved documents, and memory stores occupy the middle. This is where compression and routing do their work — keeping only the most relevant turns, selecting only the most pertinent documents, and summarizing older context to preserve space.

The Active Tail

The most recent user query and the most relevant retrieved content should be placed near the end of the context, in the high-attention “recency” zone. This exploits the U-shaped attention curve to ensure the model focuses on what matters most for the current task.

Key insight: This three-zone architecture (stable prefix, dynamic middle, active tail) is not just a best practice — it’s an optimization for both attention quality and KV-cache economics. Getting the structure right often matters more than getting the content right.

hub

The Agent Context Challenge

Why agents make context engineering exponentially harder

The ReAct Loop Problem

AI agents using the ReAct pattern (Reason + Act) accumulate context with every tool call. Each cycle adds the tool result (hundreds or thousands of tokens), the model’s reasoning, and the action taken. Without intervention, a 10-step agent task can consume the entire context window — pushing out the system instructions and early task context the model needs to reason well.

Critical in AI: OpenAI recommends fewer than 20 tools per agent, with accuracy degrading past 10. Connect a few MCP servers and you might reach 90+ tool definitions — over 50,000 tokens of schemas before the model starts reasoning. This is the tool management crisis that Chapter 7 addresses.

Multi-Agent Amplification

In multi-agent systems, each agent has its own context window. Information must be selectively routed between agents — a billing agent doesn’t need the onboarding knowledge base, and a code review agent doesn’t need the deployment logs. Without routing, every agent carries every piece of context, multiplying waste.

What’s Next

The remaining chapters of this course address each dimension of this challenge: progressive disclosure (Ch 3) controls what loads and when, compression (Ch 4) shrinks accumulated history, routing (Ch 5) directs queries to the right source, retrieval (Ch 6) fetches external knowledge on demand, tool management (Ch 7) controls the capability surface, and token budgeting (Ch 8) ties it all together economically.

Ch 2 — The Context Window