Ch 4: Context Compression — Context Engineering

expand

The Context Bloat Problem

How agent loops fill the context window

The ReAct Accumulation Loop

Every tool call in a ReAct agent adds to the context: the model’s reasoning, the action taken, and the tool’s result. Each tool result can be hundreds or thousands of tokens — API responses, file contents, search results, error traces. A 10-step agent task can easily consume 30,000–50,000 tokens of accumulated action history alone.

What Gets Pushed Out

Without intervention, the accumulated history fills the context window and pushes out the system instructions, tool definitions, and early task context that the model actually needs to reason well. The model loses access to its own instructions while drowning in the history of its own actions.

Critical in AI: This is not a theoretical problem. Long-running agent tasks routinely fail not because the model can’t solve the problem, but because critical context has been displaced by accumulated history from earlier steps.

content_cut

Naive Approaches

Simple truncation and its limitations

Keep Top N Turns

The simplest approach: keep only the most recent N turns of interaction and discard the rest. Fast and predictable, but brutal — critical early decisions, error patterns, and task context are permanently lost. The model has no memory of what it already tried.

Head/Tail Truncation

Keep the first few turns (task setup) and the last few turns (recent context), discard the middle. Better than pure recency, but still loses the narrative thread of how the agent arrived at its current state.

Why Simple Truncation Fails

Truncation

Binary: information is either fully present or completely gone. No middle ground. Loses error traces, successful patterns, and decision rationale. Model repeats mistakes it already made.

Compression

Graduated: preserves the gist of older context while keeping recent turns in full detail. Retains error patterns and key decisions. Model builds on prior work instead of repeating it.

compress

Sliding Window + Summarization

The dominant approach in 2026

How It Works

The field has converged on sliding window plus summarization hybrids as the dominant compression approach. The pattern: keep the most recent N turns in full detail (the “window”), and compress older context through LLM-based summarization. The summary preserves key decisions, outcomes, and error patterns while dramatically reducing token count.

The Compression Flow

// Sliding window compression Context Window Layout: [System Prompt] // Always preserved [Tool Definitions] // Always preserved [Compressed Summary] // Turns 1..N-5 [Raw Turn N-4] // Recent window [Raw Turn N-3] // Recent window [Raw Turn N-2] // Recent window [Raw Turn N-1] // Recent window [Raw Turn N] // Current turn // Token reduction: 8K → 2K while // maintaining coherence

When to Compress

Compression can be triggered periodically (every N turns), threshold-based (when context exceeds X% of the window), or event-based (after a task phase completes). Periodic compression is simplest; threshold-based is most efficient. Each compression step requires an LLM call, so amortizing the cost by compressing periodically rather than every turn is standard practice.

Key insight: Conversation summarization can reduce token usage from 8,000 to 2,000 tokens while maintaining coherence. Companies implementing this pattern report 60–80% cost reduction on long-running agent tasks.

music_note

Manus’s Rhythm Preservation

Why raw recent turns matter more than you think

The Rhythm Effect

Manus discovered a subtle but important finding: keep the most recent tool calls in raw format so the model maintains its “rhythm” and formatting style. When recent tool interactions are summarized instead of kept raw, the model loses its formatting consistency — it starts producing outputs in slightly different structures, breaking downstream parsers and tool integrations.

The Error Trace Rule

Manus’s second critical finding: do not compress away error traces. When a tool call fails, leaving the error and stack trace in context helps the model avoid repeating the same mistake. This technique is well-established — libraries like Instructor use it for structured output retries — and it applies broadly to any agent that calls tools.

Key insight: Compression is lossy by definition. The art is knowing what to lose. Manus’s rules provide clear guidance: never compress recent tool calls (rhythm), never compress error traces (learning), and always compress older successful interactions first (they’re the safest to summarize).

storage

Long-Term Memory Approach

Moving history to durable storage and retrieving on demand

The Architecture

Instead of compressing older turns into a summary, the long-term memory approach moves them to a durable storage system (typically a vector database) and retrieves only relevant actions on demand. The context window carries only the recent window plus any retrieved historical context that’s relevant to the current task.

Three-Tier Memory

Working memory: Active context window — immediate reasoning.

Recall storage: Searchable database of recent interactions — retrieved when relevant.

Archival storage: Vector-based long-term memory — accumulated knowledge across all sessions.

When to Use Each Approach

// Decision matrix Short tasks (5-10 turns): → No compression needed // Context fits comfortably Medium tasks (10-50 turns): → Sliding window + summary // Best cost/quality tradeoff Long tasks (50+ turns): → Long-term memory + retrieval // Summary alone loses too much Cross-session: → Archival memory // Persistent knowledge store

science

Advanced Compression Techniques

Beyond basic summarization

Prompt Caching

Prompt caching is a server-side feature (available in Anthropic, Groq, and vLLM) that caches the KV states for shared prefixes across requests. If your system prompt and tool definitions are the same across requests, they’re computed once and reused. This provides up to 90% cost savings on repeated content and is the most impactful “compression” technique for the stable prefix.

Test-Time Training

An emerging technique: test-time training temporarily fine-tunes the model on the current context, achieving a 35× speedup for 2M-token contexts compared to processing the full context directly. Still experimental, but it points toward a future where compression happens at the model level rather than the prompt level.

Selective Inclusion

The most fundamental compression technique is also the simplest: don’t include what you don’t need. Before adding any content to the context, ask: does the model need this for the next step? If not, leave it out. This “Select, Don’t Dump” principle prevents bloat before compression is even needed.

Key insight: The best compression is prevention. Every token that never enters the context window is a token that never needs to be compressed, cached, or paid for. Selective inclusion is the cheapest and most effective optimization.

savings

Cost Impact

Real-world savings from compression

Before & After

// Long-running agent task (50 turns) Without compression: Avg context size: 95,000 tokens Cost per request: $0.24 Total task cost: $12.00 With sliding window + summary: Avg context size: 25,000 tokens Cost per request: $0.06 Total task cost: $3.00 // + ~$0.50 for compression calls Net savings: 71%

Enterprise Scale

Companies implementing effective context compression report 60–80% cost reduction on long-running agent tasks. At enterprise scale (thousands of agent tasks per day), this translates to hundreds of thousands of dollars in annual savings. The compression LLM calls add cost, but the net savings are overwhelmingly positive.

Key insight: Compression improves both cost and quality. Smaller, more focused contexts produce better model outputs (less noise, more signal) while costing less. This is one of the rare optimizations where you get better results for less money.

balance

Tradeoffs and Open Questions

What compression loses and what remains unsolved

The Lossy Nature

All compression is lossy. Summarization preserves the gist but loses details. The question is always: which details can you afford to lose? Compression works well for long-horizon tasks where early steps are contextual background, but poorly when critical early details get summarized away — like a specific error message or a user’s exact phrasing of a requirement.

Compression Quality

The quality of the compression depends on the quality of the summarization model. Using a cheaper, faster model for compression saves money but may lose important nuance. Using the same model for compression and reasoning is more accurate but doubles the cost of each compression step.

Open Questions

When to compress? Too early loses detail; too late wastes tokens on bloated context.

What detail level? Aggressive compression saves more tokens but loses more information.

How to evaluate? Measuring whether a summary preserved “enough” information is itself an unsolved problem. Probe-based evaluation (asking the model questions about compressed content) is the current best practice.

Rule of thumb: Start with hybrid sliding window (keep the latest 5 turns raw, summarize older ones). Use probe-based evaluation to test whether your summaries preserve what matters. Adjust the window size and summary detail level based on task-specific results.

Ch 4 — Context Compression