Ch 6: Retrieval Evolution (Agentic RAG)

search

Traditional RAG: The Fixed Pipeline

Query → vector search → inject → generate

How Traditional RAG Works

Traditional RAG follows a fixed pipeline: take the user’s query, convert it to an embedding, search a vector database for similar chunks, inject the top-K results into the context, and generate a response. The pipeline is deterministic — the same query always produces the same retrieval, regardless of whether the results are sufficient.

Where It Breaks

A question like “What themes emerge across this quarter’s customer feedback?” requires connecting information across multiple documents — something vector similarity search cannot do. A question where the first retrieval returns insufficient results needs a second attempt with a reformulated query — something a fixed pipeline cannot do.

Key insight: Traditional RAG treats retrieval as a single-shot operation. But real-world questions often require multiple retrieval attempts, query reformulation, and cross-document reasoning. The pipeline needs to become a loop.

smart_toy

Agentic RAG

Putting retrieval under agent control

The Shift

Agentic RAG puts retrieval under agent control. Instead of a fixed pipeline, the agent decides its own search strategy, can reformulate queries when results are insufficient, and iterates until confident. The retrieval loop replaces the retrieval pipeline. The agent becomes the orchestrator of its own knowledge gathering.

The Agentic Loop

// Agentic RAG loop 1. Analyze query → plan retrieval strategy 2. Execute search with initial query 3. Evaluate results: → Sufficient? Generate answer. → Insufficient? Reformulate query. → Contradictory? Search for resolution. 4. Repeat until confident (max N rounds) 5. Generate response with curated context

Impact

Agentic RAG improves faithfulness metrics by 42% versus traditional RAG, according to research on enterprise AI systems. The improvement comes from the agent’s ability to recognize when initial results are insufficient and take corrective action — something a fixed pipeline cannot do.

Why it matters: Agentic RAG transforms retrieval from a passive lookup into an active reasoning process. The agent doesn’t just find similar text — it evaluates relevance, identifies gaps, and strategically fills them.

hub

Graph RAG

Adding relational reasoning to retrieval

The Problem with Vector Search

Standard vector search finds similar text but cannot connect entities across documents. If Document A mentions “Project Alpha was led by Sarah” and Document B mentions “Sarah’s team delivered 40% cost reduction,” vector search treats these as unrelated. The connection between Sarah, Project Alpha, and the cost reduction is invisible.

How Graph RAG Works

Graph RAG builds entity-relationship graphs over the corpus. Entities (people, projects, concepts) become nodes; relationships (led, delivered, caused) become edges. This enables thematic and relational questions that require connecting information across multiple sources — like “What were the outcomes of all projects Sarah led?”

Key insight: Vector search answers “what text is similar to my query?” Graph RAG answers “what entities and relationships are relevant to my query?” The two are complementary — vector search for content similarity, graph search for relational reasoning.

psychology

Self-RAG

Models that decide when to retrieve and critique their own outputs

The Concept

Self-RAG trains models to make three decisions autonomously: (1) Assess whether they have enough information before answering — triggering retrieval only when needed. (2) Evaluate the quality of retrieved results before using them. (3) Critique their own outputs for faithfulness to the retrieved context.

Why It Matters

Traditional RAG retrieves for every query, even when the model already knows the answer. Self-RAG eliminates unnecessary retrievals, reducing latency and cost. When it does retrieve, it evaluates quality before injecting into context — preventing low-quality chunks from diluting the model’s attention.

Key insight: Self-RAG adds metacognition to retrieval. The model doesn’t just retrieve and generate — it reasons about whether retrieval is needed, whether the results are good enough, and whether its answer is faithful to the evidence.

merge

Combining All Three

The most advanced retrieval architecture

The Combined Architecture

The most advanced work combines all three approaches: Agentic RAG for iterative, strategy-driven retrieval. Graph RAG for relational reasoning across documents. Self-RAG for self-assessment and quality control. The agent plans its retrieval strategy, uses both vector and graph search, evaluates results, and iterates until confident.

The Flow

// Combined retrieval architecture Agent receives query → Self-assess: Do I need retrieval? → Plan: Vector search, graph search, or both? → Execute: Run planned searches → Evaluate: Are results sufficient? → Iterate: Reformulate if needed → Generate: Answer with curated context → Critique: Is answer faithful?

Key insight: Each layer addresses a different failure mode. Agentic RAG handles insufficient results. Graph RAG handles relational questions. Self-RAG handles unnecessary retrieval and unfaithful generation. Together, they cover the full spectrum of retrieval failures.

tune

Chunking Strategy

How you split documents determines retrieval quality

Why Chunking Matters

Before any retrieval happens, documents must be split into chunks for indexing. Chunking strategy is one of the most impactful decisions in a RAG system. Chunks too small lose context; chunks too large dilute relevance. The optimal size depends on the content type, the embedding model, and the query patterns.

Common Strategies

Fixed-size: Split every N tokens. Simple but ignores semantic boundaries.

Semantic: Split at paragraph or section boundaries. Preserves meaning but produces uneven sizes.

Hierarchical: Create chunks at multiple granularities (paragraph, section, document) and retrieve at the appropriate level.

Overlap: Include N tokens of overlap between adjacent chunks to preserve context at boundaries.

Embedding Choice

The embedding model converts chunks into vectors for similarity search. Different models have different strengths: general-purpose embeddings (OpenAI text-embedding-3, Cohere embed-v3) work well for broad content. Domain-specific embeddings (fine-tuned on legal, medical, or technical text) outperform general models in specialized domains.

Reranking

Reranking adds a second pass after initial retrieval: a cross-encoder model scores each retrieved chunk against the original query for precise relevance. This catches chunks that are semantically similar but not actually relevant, and promotes chunks that are relevant but weren’t top-ranked by vector similarity alone.

warning

Tradeoffs

Cost, latency, and reliability concerns

Latency

Agentic RAG is significantly slower than traditional RAG. A single question might trigger 3–5 retrieval cycles, each adding search time plus the agent’s reasoning about strategy. For real-time applications (chatbots, live support), this latency may be unacceptable. For background tasks (research, analysis), it’s a worthwhile tradeoff.

Token Cost

Each retrieval cycle adds retrieved chunks to context plus the agent’s reasoning about strategy. Cost scales with question complexity. A simple factual lookup might need one cycle; a thematic analysis across a corpus might need five. Without guardrails, agentic RAG can over-retrieve on simple questions.

Required Guardrails

Maximum retrieval rounds: Cap iterations to prevent runaway loops.

Confidence thresholds: Stop iterating when the agent is confident enough, not when it’s perfect.

Fallback to direct generation: If retrieval consistently fails, let the model answer from its training data rather than looping indefinitely.

Cost budgets: Set per-query token budgets that include retrieval overhead.

Critical in AI: Agentic RAG without guardrails can enter infinite retrieval loops on unanswerable questions. Maximum rounds and confidence thresholds are not optional — they’re safety requirements.

trending_up

Retrieval as Context Engineering

How retrieval fits the broader discipline

The Context Engineering Lens

From a context engineering perspective, retrieval is the mechanism for bringing external knowledge into the context window on demand. It’s the bridge between the model’s training data (static, potentially outdated) and the organization’s current knowledge (dynamic, up-to-date). The quality of retrieval directly determines the quality of the context.

Integration with Other Patterns

Routing (Ch 5) determines which knowledge base to search. Retrieval (this chapter) fetches specific documents from that knowledge base. Compression (Ch 4) shrinks the retrieved content to fit the budget. Progressive disclosure (Ch 3) determines whether retrieval instructions are even loaded. Each pattern handles a different dimension of the same problem: getting the right information to the model.

Key insight: RAG has matured from a simple “retrieve and stuff” pattern into a sophisticated, agent-controlled knowledge gathering system. The evolution mirrors the broader shift from static prompt engineering to dynamic context engineering.

Ch 6 — Retrieval Evolution (Agentic RAG)