Ch 9: Advanced RAG Patterns

Ch 9 — Advanced RAG Patterns

Beyond naive RAG — graphs, agents, multi-modal, and more

Index Under the Hood →

High Level

hub

Graph RAG

arrow_forward

smart_toy

Agentic

arrow_forward

image

Multi-Modal

arrow_forward

table_chart

Structured

arrow_forward

chat

Conversational

arrow_forward

cached

Caching

arrow_forward

check_circle

Choose

Click play or press Space to begin the journey...

Step- / 7

hub

Graph RAG

Combining knowledge graphs with vector retrieval

The Problem with Flat Chunks

Standard RAG treats documents as isolated chunks. But real knowledge has relationships: Person X works at Company Y, Product A depends on Component B, Policy C references Regulation D. Flat vector search misses these connections.

How Graph RAG Works

Step 1: Extract entities and relationships from documents using an LLM. Build a knowledge graph (nodes = entities, edges = relationships).

Step 2: At query time, retrieve relevant entities from the graph and traverse their connections to find related context.

Step 3: Combine graph-retrieved context with vector-retrieved chunks for a richer answer.

When to Use

Multi-hop questions: "Who manages the team that built the product mentioned in the Q3 report?" — requires traversing multiple relationships.

Entity-centric queries: "Tell me everything about Project Alpha" — gather all related entities and their connections.

Summarization over large corpora: Microsoft's GraphRAG (Edge et al., 2024) builds community summaries from the graph for high-level questions.

Microsoft GraphRAG (open-source) is the leading implementation. It uses an LLM to extract entities and relationships, builds a graph, detects communities using the Leiden algorithm, and generates community summaries. For "global" questions about the entire corpus, it queries community summaries instead of individual chunks.

smart_toy

Agentic RAG

LLM agents that decide how and when to retrieve

Beyond Fixed Pipelines

Standard RAG follows a fixed pipeline: retrieve then generate. Agentic RAG gives the LLM agency to decide its own retrieval strategy. The agent can choose which tools to use, when to retrieve, what to search for, and whether to retrieve again if the first results are insufficient.

How It Works

The LLM is given access to tools: vector search, SQL query, web search, calculator, API calls. For each question, the agent reasons about which tools to use and in what order. It can chain multiple retrievals, combine results from different sources, and self-correct.

# LangGraph Agentic RAG from langgraph.prebuilt import create_react_agent tools = [ vector_search_tool, # search docs sql_query_tool, # query database web_search_tool, # search the web calculator_tool, # math operations ] agent = create_react_agent( model=llm, tools=tools, prompt="You are a research assistant..." ) # Agent decides: search docs first, # then query SQL for numbers, # then calculate the final answer result = agent.invoke({ "messages": [("user", "What was our revenue growth rate compared to the industry average?")] })

Agentic RAG is the most powerful but least predictable pattern. The agent may take unexpected paths, make unnecessary tool calls, or get stuck in loops. Use it for complex research queries where flexibility matters. For simple Q&A, stick with fixed pipelines — they are faster, cheaper, and more predictable.

image

Multi-Modal RAG

Retrieving and reasoning over images, tables, and diagrams

The Challenge

Many documents contain critical information in images, charts, tables, and diagrams that text-only RAG completely misses. A financial report's key insights might be in a bar chart. A technical manual's wiring diagram cannot be chunked as text.

Approaches

1. Text extraction: Use OCR or vision models to convert images/tables to text, then index the text normally. Simplest but loses visual structure.

2. Multi-modal embeddings: Use models like CLIP or Nomic Embed Vision to embed images directly into the same vector space as text. Retrieve images by text query.

3. Vision LLM generation: Pass retrieved images directly to a vision-capable LLM (GPT-4o, Claude 3.5 Sonnet, Gemini) for analysis. Most powerful approach.

# Multi-modal RAG with vision LLM # 1. Extract images from documents images = extract_images(pdf_path) # 2. Generate text descriptions for img in images: description = vision_llm.describe(img) # Index the description as a chunk # Store image path in metadata chunks.append(Document( page_content=description, metadata={"image_path": img.path} )) # 3. At query time, retrieve description # 4. Pass original image to vision LLM answer = vision_llm.invoke([ {"type": "text", "text": question}, {"type": "image_url", "url": img_path} ])

The practical approach: Extract images, generate text summaries with a vision LLM, index the summaries. At retrieval time, if a summary matches, pass the original image to the generation LLM. This gives you text-based retrieval with image-aware generation. LlamaIndex's MultiModalVectorStoreIndex implements this pattern.

table_chart

Structured Data RAG (Text-to-SQL)

Querying databases with natural language

When Documents Aren't Enough

Some questions need precise, structured data: "What was our revenue last quarter?", "How many orders were placed in March?", "Show me the top 10 customers by spend." This data lives in databases, not documents. Text-to-SQL lets the LLM write SQL queries to answer these questions.

How It Works

Step 1: Provide the LLM with the database schema (table names, columns, types).
Step 2: The LLM generates a SQL query from the natural language question.
Step 3: Execute the SQL query against the database.
Step 4: The LLM interprets the results and generates a natural language answer.

# LangChain Text-to-SQL from langchain_community.utilities import SQLDatabase from langchain.chains import create_sql_query_chain db = SQLDatabase.from_uri("sqlite:///sales.db") chain = create_sql_query_chain(llm, db) # User: "What was total revenue in Q3?" # LLM generates: # SELECT SUM(amount) FROM orders # WHERE date BETWEEN '2024-07-01' # AND '2024-09-30' sql = chain.invoke({ "question": "Total revenue in Q3 2024?" }) result = db.run(sql)

Combine Text-to-SQL with document RAG. Use a router to decide: numerical/analytical questions go to SQL, conceptual/policy questions go to vector search. LlamaIndex's RouterQueryEngine and LangChain's routing chains handle this. This gives users a single interface for both structured and unstructured data.

chat

Conversational RAG

Multi-turn chat with memory and context tracking

The Challenge

Users don't ask one question and leave. They have conversations: follow-up questions, clarifications, topic shifts. Each turn depends on the previous context. "What's the refund policy?" followed by "How long do I have?" — the second question only makes sense with the first.

Key Components

Chat history: Store the full conversation. Pass it to the LLM for context.

History-aware retrieval: Rewrite follow-up questions using chat history before retrieval (covered in Ch 7).

Memory management: As conversations grow, summarize older turns to stay within the token budget. Keep recent turns verbatim, compress older ones.

# LangChain Conversational RAG from langchain.chains import ( create_history_aware_retriever, create_retrieval_chain ) from langchain_community.chat_message_histories import ( ChatMessageHistory ) from langchain_core.runnables.history import ( RunnableWithMessageHistory ) # 1. History-aware retriever (rewrites query) hist_retriever = create_history_aware_retriever( llm, retriever, contextualize_prompt ) # 2. Full RAG chain rag_chain = create_retrieval_chain( hist_retriever, stuff_chain ) # 3. Add message history conversational_rag = RunnableWithMessageHistory( rag_chain, get_session_history, # returns ChatMessageHistory input_messages_key="input", history_messages_key="chat_history", )

Session management is critical. Each user/conversation needs its own chat history. Use Redis, PostgreSQL, or DynamoDB to persist histories across requests. LangChain's RunnableWithMessageHistory handles this with pluggable backends. Set a maximum history length (e.g., last 20 turns) to control costs.

cached

Caching & Performance Patterns

Reducing latency and cost with intelligent caching

Semantic Caching

If a user asks a question very similar to one already answered, return the cached answer instead of running the full pipeline. Semantic caching uses embedding similarity to match queries — "What's the refund policy?" and "How do returns work?" might hit the same cache entry.

Other Caching Layers

Embedding cache: Cache embedding API calls. Same text = same vector. Saves API costs on repeated documents.

LLM response cache: Cache exact prompt → response pairs. Useful for identical queries.

Retrieval cache: Cache query → retrieved documents. Avoids repeated vector searches for the same query.

# LangChain semantic cache from langchain.cache import InMemoryCache from langchain.globals import set_llm_cache # Exact match cache set_llm_cache(InMemoryCache()) # Semantic cache (GPTCache) from gptcache import Cache from gptcache.adapter.langchain_models import ( LangChainLLMs ) cache = Cache() cache.init( similarity_threshold=0.95, # Uses embedding similarity to match ) # Cache hit: <50ms response # Cache miss: full pipeline (1-5s)

Set the similarity threshold carefully. Too low (0.8) and you'll return wrong cached answers for different questions. Too high (0.99) and the cache rarely hits. Start at 0.95 and tune based on your query distribution. Monitor cache hit rate and answer quality.

verified

Choosing the Right Pattern

Match the pattern to your use case

Decision Guide

Simple Q&A over documents:
Standard RAG (Ch 1-8). No advanced patterns needed.

Questions about relationships between entities:
Graph RAG. Build a knowledge graph from your documents.

Complex research requiring multiple data sources:
Agentic RAG. Give the LLM tools for each data source.

Documents with images, charts, diagrams:
Multi-modal RAG. Extract and index visual content.

Numerical/analytical questions:
Text-to-SQL + document RAG with routing.

Chat-based interface:
Conversational RAG with history-aware retrieval.

High traffic, repeated questions:
Add semantic caching to any pattern above.

Complexity vs Value

Each advanced pattern adds significant complexity. Before adopting one, ask:

1. Does standard RAG fail? If basic retrieval + generation works, stop there.

2. Can I identify the failure mode? Graph RAG fixes relationship queries. Multi-modal fixes image-heavy docs. Don't add complexity without a specific problem to solve.

3. Do I have the infrastructure? Graph RAG needs a graph database. Agentic RAG needs careful tool design. Multi-modal needs vision models.

Most production RAG systems use standard RAG with hybrid search and reranking. Advanced patterns are for specific, identified failure modes. Build the simple version first, measure where it fails, then add the specific pattern that addresses that failure. Premature complexity is the enemy of shipping.