Ch 4 — Context Compression

Shrinking accumulated history while preserving the information the model needs
High Level
expand
Bloat
arrow_forward
content_cut
Truncate
arrow_forward
compress
Summarize
arrow_forward
view_timeline
Sliding
arrow_forward
storage
Memory
arrow_forward
check_circle
Result
-
Click play or press Space to begin...
Step- / 8
expand
The Context Bloat Problem
How agent loops fill the context window
The ReAct Accumulation Loop
Every tool call in a ReAct agent adds to the context: the model’s reasoning, the action taken, and the tool’s result. Each tool result can be hundreds or thousands of tokens — API responses, file contents, search results, error traces. A 10-step agent task can easily consume 30,000–50,000 tokens of accumulated action history alone.
What Gets Pushed Out
Without intervention, the accumulated history fills the context window and pushes out the system instructions, tool definitions, and early task context that the model actually needs to reason well. The model loses access to its own instructions while drowning in the history of its own actions.
Critical in AI: This is not a theoretical problem. Long-running agent tasks routinely fail not because the model can’t solve the problem, but because critical context has been displaced by accumulated history from earlier steps.
content_cut
Naive Approaches
Simple truncation and its limitations
Keep Top N Turns
The simplest approach: keep only the most recent N turns of interaction and discard the rest. Fast and predictable, but brutal — critical early decisions, error patterns, and task context are permanently lost. The model has no memory of what it already tried.
Head/Tail Truncation
Keep the first few turns (task setup) and the last few turns (recent context), discard the middle. Better than pure recency, but still loses the narrative thread of how the agent arrived at its current state.
Why Simple Truncation Fails
Truncation
Binary: information is either fully present or completely gone. No middle ground. Loses error traces, successful patterns, and decision rationale. Model repeats mistakes it already made.
Compression
Graduated: preserves the gist of older context while keeping recent turns in full detail. Retains error patterns and key decisions. Model builds on prior work instead of repeating it.
compress
Sliding Window + Summarization
The dominant approach in 2026
How It Works
The field has converged on sliding window plus summarization hybrids as the dominant compression approach. The pattern: keep the most recent N turns in full detail (the “window”), and compress older context through LLM-based summarization. The summary preserves key decisions, outcomes, and error patterns while dramatically reducing token count.
The Compression Flow
// Sliding window compression Context Window Layout: [System Prompt] // Always preserved [Tool Definitions] // Always preserved [Compressed Summary] // Turns 1..N-5 [Raw Turn N-4] // Recent window [Raw Turn N-3] // Recent window [Raw Turn N-2] // Recent window [Raw Turn N-1] // Recent window [Raw Turn N] // Current turn // Token reduction: 8K → 2K while // maintaining coherence
When to Compress
Compression can be triggered periodically (every N turns), threshold-based (when context exceeds X% of the window), or event-based (after a task phase completes). Periodic compression is simplest; threshold-based is most efficient. Each compression step requires an LLM call, so amortizing the cost by compressing periodically rather than every turn is standard practice.
Key insight: Conversation summarization can reduce token usage from 8,000 to 2,000 tokens while maintaining coherence. Companies implementing this pattern report 60–80% cost reduction on long-running agent tasks.
music_note
Manus’s Rhythm Preservation
Why raw recent turns matter more than you think
The Rhythm Effect
Manus discovered a subtle but important finding: keep the most recent tool calls in raw format so the model maintains its “rhythm” and formatting style. When recent tool interactions are summarized instead of kept raw, the model loses its formatting consistency — it starts producing outputs in slightly different structures, breaking downstream parsers and tool integrations.
The Error Trace Rule
Manus’s second critical finding: do not compress away error traces. When a tool call fails, leaving the error and stack trace in context helps the model avoid repeating the same mistake. This technique is well-established — libraries like Instructor use it for structured output retries — and it applies broadly to any agent that calls tools.
Key insight: Compression is lossy by definition. The art is knowing what to lose. Manus’s rules provide clear guidance: never compress recent tool calls (rhythm), never compress error traces (learning), and always compress older successful interactions first (they’re the safest to summarize).
storage
Long-Term Memory Approach
Moving history to durable storage and retrieving on demand
The Architecture
Instead of compressing older turns into a summary, the long-term memory approach moves them to a durable storage system (typically a vector database) and retrieves only relevant actions on demand. The context window carries only the recent window plus any retrieved historical context that’s relevant to the current task.
Three-Tier Memory
Working memory: Active context window — immediate reasoning.

Recall storage: Searchable database of recent interactions — retrieved when relevant.

Archival storage: Vector-based long-term memory — accumulated knowledge across all sessions.
When to Use Each Approach
// Decision matrix Short tasks (5-10 turns): → No compression needed // Context fits comfortably Medium tasks (10-50 turns): → Sliding window + summary // Best cost/quality tradeoff Long tasks (50+ turns): → Long-term memory + retrieval // Summary alone loses too much Cross-session: → Archival memory // Persistent knowledge store
science
Advanced Compression Techniques
Beyond basic summarization
Prompt Caching
Prompt caching is a server-side feature (available in Anthropic, Groq, and vLLM) that caches the KV states for shared prefixes across requests. If your system prompt and tool definitions are the same across requests, they’re computed once and reused. This provides up to 90% cost savings on repeated content and is the most impactful “compression” technique for the stable prefix.
Test-Time Training
An emerging technique: test-time training temporarily fine-tunes the model on the current context, achieving a 35× speedup for 2M-token contexts compared to processing the full context directly. Still experimental, but it points toward a future where compression happens at the model level rather than the prompt level.
Selective Inclusion
The most fundamental compression technique is also the simplest: don’t include what you don’t need. Before adding any content to the context, ask: does the model need this for the next step? If not, leave it out. This “Select, Don’t Dump” principle prevents bloat before compression is even needed.
Key insight: The best compression is prevention. Every token that never enters the context window is a token that never needs to be compressed, cached, or paid for. Selective inclusion is the cheapest and most effective optimization.
savings
Cost Impact
Real-world savings from compression
Before & After
// Long-running agent task (50 turns) Without compression: Avg context size: 95,000 tokens Cost per request: $0.24 Total task cost: $12.00 With sliding window + summary: Avg context size: 25,000 tokens Cost per request: $0.06 Total task cost: $3.00 // + ~$0.50 for compression calls Net savings: 71%
Enterprise Scale
Companies implementing effective context compression report 60–80% cost reduction on long-running agent tasks. At enterprise scale (thousands of agent tasks per day), this translates to hundreds of thousands of dollars in annual savings. The compression LLM calls add cost, but the net savings are overwhelmingly positive.
Key insight: Compression improves both cost and quality. Smaller, more focused contexts produce better model outputs (less noise, more signal) while costing less. This is one of the rare optimizations where you get better results for less money.
balance
Tradeoffs and Open Questions
What compression loses and what remains unsolved
The Lossy Nature
All compression is lossy. Summarization preserves the gist but loses details. The question is always: which details can you afford to lose? Compression works well for long-horizon tasks where early steps are contextual background, but poorly when critical early details get summarized away — like a specific error message or a user’s exact phrasing of a requirement.
Compression Quality
The quality of the compression depends on the quality of the summarization model. Using a cheaper, faster model for compression saves money but may lose important nuance. Using the same model for compression and reasoning is more accurate but doubles the cost of each compression step.
Open Questions
When to compress? Too early loses detail; too late wastes tokens on bloated context.

What detail level? Aggressive compression saves more tokens but loses more information.

How to evaluate? Measuring whether a summary preserved “enough” information is itself an unsolved problem. Probe-based evaluation (asking the model questions about compressed content) is the current best practice.
Rule of thumb: Start with hybrid sliding window (keep the latest 5 turns raw, summarize older ones). Use probe-based evaluation to test whether your summaries preserve what matters. Adjust the window size and summary detail level based on task-specific results.