Key Insights — Context Engineering

Foundations

The Paradigm Shift

Chapters 1 – 3

expand_more

1

What Is Context Engineering?

“The hottest new programming language is English — but the real skill is what you put in the context window.”

Context engineering replaced prompt engineering as the core AI skill in mid-2025, championed by Andrej Karpathy and Shopify CEO Tobi Lütke.
Formal definition: the discipline of managing the complete information environment an LLM sees — what enters, when, and how it’s structured.
Five core pillars: progressive disclosure, compression, routing, retrieval, and token budgeting — all working together as a layered system.

2

The Context Window

“A 200K context window doesn’t mean you should use 200K tokens.”

The context window contains 8 components: system prompt, user prompt, conversation history, RAG documents, tool schemas, few-shot examples, memory stores, and metadata.
The “lost in the middle” problem: LLMs attend strongly to the beginning and end of context but have blind spots in the middle. Place critical information at the edges.
KV-cache stores pre-computed attention states and is the single most impactful optimization for production AI agents — reducing latency and cost by up to 10x.

3

Progressive Disclosure & Agent Skills

“Don’t load everything at once — load what’s needed, when it’s needed.”

Three-tier loading: discovery (agent sees skill exists), activation (agent reads the skill), execution (agent follows the instructions). Each tier adds tokens only when relevant.
Agent Skills are markdown files with YAML frontmatter that agents load on demand. Anthropic released the pattern in Dec 2025; adopted by OpenAI, Google, and Cursor.
Progressive disclosure can reduce baseline context usage by 60–80% compared to loading all instructions upfront.

Bottom line: Context engineering is the shift from “write a better prompt” to “design the entire information environment.” The context window is a finite resource with attention blind spots. Progressive disclosure is the first line of defense — load only what’s needed, when it’s needed.

Core Techniques

Compression, Routing & Retrieval

Chapters 4 – 6

expand_more

4

Context Compression

“Never compress error traces — the agent needs them verbatim to avoid repeating mistakes.”

The dominant pattern is sliding window + summarization: keep recent turns raw, summarize older turns, preserve critical elements like tool call sequences and error traces.
Manus’s key lesson: preserve the “rhythm” of tool calls in summaries. Agents that lose their action history repeat failed approaches.
Effective compression achieves 60–80% token cost reduction while maintaining task completion quality.

5

Context Routing

“The best token is the one you never load.”

Context routing classifies queries and directs them to the right context source before anything enters the window — preventing irrelevant tokens from consuming budget.
Four routing strategies: rule-based (keyword matching), LLM-based (model classifies), hierarchical (lead agent triages), and hybrid (combining methods).
Good routing reduces downstream token usage by 40–70% because only relevant context enters the window.

6

Retrieval Evolution (Agentic RAG)

“Traditional RAG is a fixed pipeline. Agentic RAG is a reasoning loop.”

Agentic RAG: the agent plans its search strategy, evaluates results, and iterates — achieving 42% higher faithfulness than traditional fixed-pipeline RAG.
Graph RAG adds relational reasoning (entity relationships); Self-RAG lets models decide when to retrieve and critique their own outputs.
Guardrails are essential: max retrieval rounds, confidence thresholds, and fallback strategies prevent runaway retrieval loops.

Bottom line: Compression shrinks what stays, routing directs what enters, and retrieval fetches what’s needed. Together they form the runtime context management layer. The key principle: every token in the window should earn its place.

Production

Tools & Token Economics

Chapters 7 – 8

expand_more

7

Tool & Capability Management

“Each tool schema costs 500+ tokens. With 90 tools, that’s 45K tokens before the user says anything.”

MCP (Model Context Protocol) is the emerging standard for tool integration. Tool schemas consume significant context — 20–40% of the window in production systems.
KV-cache invalidation: Manus discovered that dynamically changing available tools invalidates the cache, causing massive latency spikes. Keep tool sets stable.
Progressive tool disclosure — loading tool schemas only when relevant — is the primary mitigation for tool token bloat.

8

Token Budgeting & Production Patterns

“Treat your context window like a financial budget — every token has a cost and a return.”

Target 40–60% context utilization with a 30–40% safety margin. KV-cache hit rate is the single most important production metric.
Prompt caching (stable prefix reuse) can achieve 90% cost savings on the cached portion. Deterministic serialization maximizes cache hits.
The layered architecture combines all patterns: disclosure → tools → routing → retrieval → compression → budgeting. Real-world case studies show 73–87% cost reductions.

Bottom line: Context engineering is not one technique — it’s a layered system. Start with progressive disclosure (cheapest), add routing and compression, optimize retrieval, manage tools carefully, and budget every token. The compound effect of all layers is transformative.