Ch 1 — What Is RAG & Why It Matters

The retrieve-then-generate pattern that grounds LLMs in real data
High Level
description
Documents
arrow_forward
content_cut
Chunk
arrow_forward
tag
Embed
arrow_forward
database
Store
arrow_forward
search
Retrieve
arrow_forward
smart_toy
Generate
arrow_forward
verified
Answer
-
Click play or press Space to begin the journey...
Step- / 8
warning
The Problem: LLMs Don't Know Your Data
Knowledge cutoffs, hallucination, and the limits of training data
Knowledge Cutoff
Every LLM has a training cutoff date. GPT-4o's training data ends in late 2023. Claude's in early 2025. Anything after that date — your company's latest docs, today's stock prices, last week's incident report — the model simply doesn't know.
Hallucination
When an LLM doesn't know something, it doesn't say "I don't know." It generates plausible-sounding text that may be completely wrong. It invents citations, fabricates statistics, and confidently states falsehoods. This is hallucination — the core problem RAG solves.
Without RAG
"What's our refund policy?"

LLM: "Your refund policy allows returns within 30 days..." (made up — your actual policy is 14 days)
With RAG
"What's our refund policy?"

LLM retrieves your policy doc, then: "According to your policy document, refunds are available within 14 days of purchase." (grounded in real data)
lightbulb
RAG: Retrieve, Then Generate
The core idea in one sentence
The Pattern
Retrieval-Augmented Generation means: before the LLM generates an answer, first retrieve relevant documents from your own data, then include those documents in the prompt so the LLM can base its answer on real information.
The Name
The term comes from the 2020 paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis et al. at Meta AI. The paper showed that combining a retriever (DPR) with a generator (BART) outperformed pure generation on knowledge-intensive tasks like open-domain QA.
# RAG in pseudocode — it's this simple def rag(question): # Step 1: Retrieve docs = vector_store.search(question, top_k=5) # Step 2: Augment the prompt prompt = f"""Answer based on these docs: {docs} Question: {question}""" # Step 3: Generate answer = llm.generate(prompt) return answer
That's it. RAG is fundamentally just "search your docs, paste them into the prompt, ask the LLM." Everything else — chunking, embeddings, reranking — is about making that search better.
account_tree
The RAG Pipeline: Two Phases
Indexing (offline) and Querying (online)
Phase 1: Indexing (Offline)
Done once (or periodically) before any user asks a question:

1. Load — Read documents from files, databases, APIs
2. Chunk — Split documents into smaller passages
3. Embed — Convert each chunk into a vector (array of numbers)
4. Store — Save vectors in a vector database with metadata
Phase 2: Querying (Online)
Happens in real-time when a user asks a question:

5. Embed the query — Convert the question into a vector
6. Retrieve — Find the most similar chunks in the vector store
7. Augment — Insert retrieved chunks into the LLM prompt
8. Generate — LLM produces an answer grounded in the retrieved context
The pipeline in the header shows this flow. Chapters 2–8 of this learning path go deep on each stage.
compare_arrows
RAG vs Fine-Tuning vs Long Context
Three approaches to giving LLMs new knowledge
Fine-Tuning
Bakes knowledge into model weights. Expensive to train, hard to update, can cause catastrophic forgetting. Best for teaching the model a new style or behavior, not for injecting facts. A fine-tuned model still hallucinates — it just hallucinates in your company's voice.
Long Context Windows
Paste everything into the prompt. Gemini 1.5 Pro supports 1M tokens. But: cost scales linearly with context size, latency increases, and models struggle with information in the middle of very long contexts (the "lost-in-the-middle" effect, Liu et al. 2023).
RAG
Retrieves only what's relevant. No retraining needed. Data can be updated instantly (just re-index). Cost is predictable — you only pay for the retrieved chunks, not the entire corpus. The model sees exactly the context it needs, nothing more.
They're not mutually exclusive. In practice, teams combine approaches: RAG for factual grounding, fine-tuning for tone/format, and long context for complex multi-document reasoning. But RAG is the default starting point for most knowledge-intensive applications.
psychology
The Key Insight: Semantic Search
Finding relevant documents by meaning, not keywords
Beyond Keyword Search
Traditional search (like Elasticsearch with BM25) matches keywords. If you search "how to cancel my subscription" but the doc says "steps to terminate your account," keyword search might miss it. Semantic search understands that "cancel subscription" and "terminate account" mean the same thing.
How It Works
An embedding model (like OpenAI's text-embedding-3-small or the open-source BGE-large) converts text into a vector — an array of numbers that captures meaning. Similar meanings produce similar vectors. You find relevant documents by finding the nearest vectors to your query.
# Semantic similarity in action embed("cancel my subscription") → [0.23, -0.41, 0.87, ...] embed("terminate your account") → [0.21, -0.39, 0.85, ...] cosine_similarity = 0.94 # very similar! embed("today's weather forecast") → [-0.65, 0.12, -0.33, ...] cosine_similarity = 0.11 # not similar
This is the magic of RAG. Embedding models compress the meaning of text into vectors, and vector databases find the nearest neighbors in milliseconds — even across millions of documents.
trending_up
Naive RAG vs Advanced RAG
The spectrum of sophistication
Naive RAG
The simplest implementation: chunk documents, embed them, retrieve top-K, stuff them into the prompt. Works surprisingly well for many use cases. This is where you should start — don't over-engineer before you have a baseline.
Where Naive RAG Fails
Poor chunking — splits mid-sentence, loses context
Irrelevant retrieval — top-K results aren't always the best
No query understanding — ambiguous queries get bad results
Lost in the middle — LLM ignores context in the middle of the prompt
No verification — answer might not actually use the retrieved docs
Advanced RAG
Adds techniques at every stage to improve quality:

Pre-retrieval: Query rewriting, HyDE, multi-query expansion
Retrieval: Hybrid search, reranking, MMR for diversity
Post-retrieval: Context compression, reordering, citation
Generation: Self-RAG (self-reflection), CRAG (corrective retrieval)
This learning path covers the full spectrum. Chapters 2–8 build up the pipeline stage by stage. Chapter 9 covers the advanced patterns (GraphRAG, Self-RAG, CRAG). Start naive, measure, then add complexity where it helps.
cases
Where RAG Shines
Real-world applications across industries
Common Use Cases
Customer support bots — Answer questions from help docs, knowledge bases, and past tickets

Internal knowledge search — "What's our policy on X?" across thousands of company documents

Code assistants — Retrieve relevant code, docs, and examples from your codebase (this is how Cursor works)

Legal/medical research — Search case law, clinical guidelines, or regulatory documents with natural language
What They All Share
Every RAG application has the same core need: a large corpus of domain-specific knowledge that the LLM wasn't trained on, and users who ask natural language questions about that knowledge. The RAG pipeline is the bridge.
RAG is the most deployed LLM pattern in production today. It's simpler than fine-tuning, cheaper than retraining, and gives the LLM access to up-to-date, domain-specific knowledge. If you're building an AI application that needs to answer questions about specific data, RAG is almost certainly your starting point.
rocket_launch
The Journey Ahead
What you'll learn in this deep dive
The Pipeline, Stage by Stage
Ch 2: Document Loading — Getting data in (PDFs, web, DBs)
Ch 3: Chunking — Breaking docs into retrieval units
Ch 4: Embeddings — Text to vectors (models, math, benchmarks)
Ch 5: Vector Stores — Where vectors live (HNSW, FAISS, Pinecone)
Ch 6: Retrieval — Finding the right chunks (dense, sparse, hybrid)
Ch 7: Query Transformation — Making queries smarter
Ch 8: Generation — Synthesizing grounded answers
Beyond the Pipeline
Ch 9: Advanced Patterns — GraphRAG, Self-RAG, CRAG, agentic RAG
Ch 10: Solutions Landscape — LlamaIndex, LangChain, Haystack, Vectara, and more
Ch 11: Production & Eval — RAGAS metrics, caching, monitoring, A/B testing
Each chapter has two views. The High Level (like this page) gives you the visual journey — watch the pipeline build up. The Under the Hood goes deep on the technical details, algorithms, and code. Start with high level, then dive under the hood when you're ready.