Ch 6 — Retrieval Evolution (Agentic RAG)

From fixed pipelines to agent-controlled retrieval loops
High Level
search
Query
arrow_forward
database
Retrieve
arrow_forward
psychology
Evaluate
arrow_forward
refresh
Iterate
arrow_forward
hub
Graph
arrow_forward
output
Generate
-
Click play or press Space to begin...
Step- / 8
search
Traditional RAG: The Fixed Pipeline
Query → vector search → inject → generate
How Traditional RAG Works
Traditional RAG follows a fixed pipeline: take the user’s query, convert it to an embedding, search a vector database for similar chunks, inject the top-K results into the context, and generate a response. The pipeline is deterministic — the same query always produces the same retrieval, regardless of whether the results are sufficient.
Where It Breaks
A question like “What themes emerge across this quarter’s customer feedback?” requires connecting information across multiple documents — something vector similarity search cannot do. A question where the first retrieval returns insufficient results needs a second attempt with a reformulated query — something a fixed pipeline cannot do.
Key insight: Traditional RAG treats retrieval as a single-shot operation. But real-world questions often require multiple retrieval attempts, query reformulation, and cross-document reasoning. The pipeline needs to become a loop.
smart_toy
Agentic RAG
Putting retrieval under agent control
The Shift
Agentic RAG puts retrieval under agent control. Instead of a fixed pipeline, the agent decides its own search strategy, can reformulate queries when results are insufficient, and iterates until confident. The retrieval loop replaces the retrieval pipeline. The agent becomes the orchestrator of its own knowledge gathering.
The Agentic Loop
// Agentic RAG loop 1. Analyze query → plan retrieval strategy 2. Execute search with initial query 3. Evaluate results: → Sufficient? Generate answer. → Insufficient? Reformulate query. → Contradictory? Search for resolution. 4. Repeat until confident (max N rounds) 5. Generate response with curated context
Impact
Agentic RAG improves faithfulness metrics by 42% versus traditional RAG, according to research on enterprise AI systems. The improvement comes from the agent’s ability to recognize when initial results are insufficient and take corrective action — something a fixed pipeline cannot do.
Why it matters: Agentic RAG transforms retrieval from a passive lookup into an active reasoning process. The agent doesn’t just find similar text — it evaluates relevance, identifies gaps, and strategically fills them.
hub
Graph RAG
Adding relational reasoning to retrieval
The Problem with Vector Search
Standard vector search finds similar text but cannot connect entities across documents. If Document A mentions “Project Alpha was led by Sarah” and Document B mentions “Sarah’s team delivered 40% cost reduction,” vector search treats these as unrelated. The connection between Sarah, Project Alpha, and the cost reduction is invisible.
How Graph RAG Works
Graph RAG builds entity-relationship graphs over the corpus. Entities (people, projects, concepts) become nodes; relationships (led, delivered, caused) become edges. This enables thematic and relational questions that require connecting information across multiple sources — like “What were the outcomes of all projects Sarah led?”
Key insight: Vector search answers “what text is similar to my query?” Graph RAG answers “what entities and relationships are relevant to my query?” The two are complementary — vector search for content similarity, graph search for relational reasoning.
psychology
Self-RAG
Models that decide when to retrieve and critique their own outputs
The Concept
Self-RAG trains models to make three decisions autonomously: (1) Assess whether they have enough information before answering — triggering retrieval only when needed. (2) Evaluate the quality of retrieved results before using them. (3) Critique their own outputs for faithfulness to the retrieved context.
Why It Matters
Traditional RAG retrieves for every query, even when the model already knows the answer. Self-RAG eliminates unnecessary retrievals, reducing latency and cost. When it does retrieve, it evaluates quality before injecting into context — preventing low-quality chunks from diluting the model’s attention.
Key insight: Self-RAG adds metacognition to retrieval. The model doesn’t just retrieve and generate — it reasons about whether retrieval is needed, whether the results are good enough, and whether its answer is faithful to the evidence.
merge
Combining All Three
The most advanced retrieval architecture
The Combined Architecture
The most advanced work combines all three approaches: Agentic RAG for iterative, strategy-driven retrieval. Graph RAG for relational reasoning across documents. Self-RAG for self-assessment and quality control. The agent plans its retrieval strategy, uses both vector and graph search, evaluates results, and iterates until confident.
The Flow
// Combined retrieval architecture Agent receives query → Self-assess: Do I need retrieval? → Plan: Vector search, graph search, or both? → Execute: Run planned searches → Evaluate: Are results sufficient? → Iterate: Reformulate if needed → Generate: Answer with curated context → Critique: Is answer faithful?
Key insight: Each layer addresses a different failure mode. Agentic RAG handles insufficient results. Graph RAG handles relational questions. Self-RAG handles unnecessary retrieval and unfaithful generation. Together, they cover the full spectrum of retrieval failures.
tune
Chunking Strategy
How you split documents determines retrieval quality
Why Chunking Matters
Before any retrieval happens, documents must be split into chunks for indexing. Chunking strategy is one of the most impactful decisions in a RAG system. Chunks too small lose context; chunks too large dilute relevance. The optimal size depends on the content type, the embedding model, and the query patterns.
Common Strategies
Fixed-size: Split every N tokens. Simple but ignores semantic boundaries.

Semantic: Split at paragraph or section boundaries. Preserves meaning but produces uneven sizes.

Hierarchical: Create chunks at multiple granularities (paragraph, section, document) and retrieve at the appropriate level.

Overlap: Include N tokens of overlap between adjacent chunks to preserve context at boundaries.
Embedding Choice
The embedding model converts chunks into vectors for similarity search. Different models have different strengths: general-purpose embeddings (OpenAI text-embedding-3, Cohere embed-v3) work well for broad content. Domain-specific embeddings (fine-tuned on legal, medical, or technical text) outperform general models in specialized domains.
Reranking
Reranking adds a second pass after initial retrieval: a cross-encoder model scores each retrieved chunk against the original query for precise relevance. This catches chunks that are semantically similar but not actually relevant, and promotes chunks that are relevant but weren’t top-ranked by vector similarity alone.
warning
Tradeoffs
Cost, latency, and reliability concerns
Latency
Agentic RAG is significantly slower than traditional RAG. A single question might trigger 3–5 retrieval cycles, each adding search time plus the agent’s reasoning about strategy. For real-time applications (chatbots, live support), this latency may be unacceptable. For background tasks (research, analysis), it’s a worthwhile tradeoff.
Token Cost
Each retrieval cycle adds retrieved chunks to context plus the agent’s reasoning about strategy. Cost scales with question complexity. A simple factual lookup might need one cycle; a thematic analysis across a corpus might need five. Without guardrails, agentic RAG can over-retrieve on simple questions.
Required Guardrails
Maximum retrieval rounds: Cap iterations to prevent runaway loops.

Confidence thresholds: Stop iterating when the agent is confident enough, not when it’s perfect.

Fallback to direct generation: If retrieval consistently fails, let the model answer from its training data rather than looping indefinitely.

Cost budgets: Set per-query token budgets that include retrieval overhead.
Critical in AI: Agentic RAG without guardrails can enter infinite retrieval loops on unanswerable questions. Maximum rounds and confidence thresholds are not optional — they’re safety requirements.
trending_up
Retrieval as Context Engineering
How retrieval fits the broader discipline
The Context Engineering Lens
From a context engineering perspective, retrieval is the mechanism for bringing external knowledge into the context window on demand. It’s the bridge between the model’s training data (static, potentially outdated) and the organization’s current knowledge (dynamic, up-to-date). The quality of retrieval directly determines the quality of the context.
Integration with Other Patterns
Routing (Ch 5) determines which knowledge base to search. Retrieval (this chapter) fetches specific documents from that knowledge base. Compression (Ch 4) shrinks the retrieved content to fit the budget. Progressive disclosure (Ch 3) determines whether retrieval instructions are even loaded. Each pattern handles a different dimension of the same problem: getting the right information to the model.
Key insight: RAG has matured from a simple “retrieve and stuff” pattern into a sophisticated, agent-controlled knowledge gathering system. The evolution mirrors the broader shift from static prompt engineering to dynamic context engineering.