Ch 1: What Is RAG & Why It Matters

Ch 1 — What Is RAG & Why It Matters

The retrieve-then-generate pattern that grounds LLMs in real data

Index Under the Hood →

High Level

description

Documents

arrow_forward

content_cut

Chunk

arrow_forward

tag

Embed

arrow_forward

database

Store

arrow_forward

Retrieve

arrow_forward

smart_toy

Generate

arrow_forward

verified

Answer

Click play or press Space to begin the journey...

Step- / 8

warning

The Problem: LLMs Don't Know Your Data

Knowledge cutoffs, hallucination, and the limits of training data

Knowledge Cutoff

Every LLM has a training cutoff date. GPT-4o's training data ends in late 2023. Claude's in early 2025. Anything after that date — your company's latest docs, today's stock prices, last week's incident report — the model simply doesn't know.

Hallucination

When an LLM doesn't know something, it doesn't say "I don't know." It generates plausible-sounding text that may be completely wrong. It invents citations, fabricates statistics, and confidently states falsehoods. This is hallucination — the core problem RAG solves.

Without RAG

"What's our refund policy?"

LLM: "Your refund policy allows returns within 30 days..." (made up — your actual policy is 14 days)

With RAG

"What's our refund policy?"

LLM retrieves your policy doc, then: "According to your policy document, refunds are available within 14 days of purchase." (grounded in real data)

lightbulb

RAG: Retrieve, Then Generate

The core idea in one sentence

The Pattern

Retrieval-Augmented Generation means: before the LLM generates an answer, first retrieve relevant documents from your own data, then include those documents in the prompt so the LLM can base its answer on real information.

The Name

The term comes from the 2020 paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis et al. at Meta AI. The paper showed that combining a retriever (DPR) with a generator (BART) outperformed pure generation on knowledge-intensive tasks like open-domain QA.

# RAG in pseudocode — it's this simple def rag(question): # Step 1: Retrieve docs = vector_store.search(question, top_k=5) # Step 2: Augment the prompt prompt = f"""Answer based on these docs: {docs} Question: {question}""" # Step 3: Generate answer = llm.generate(prompt) return answer

That's it. RAG is fundamentally just "search your docs, paste them into the prompt, ask the LLM." Everything else — chunking, embeddings, reranking — is about making that search better.

account_tree

The RAG Pipeline: Two Phases

Indexing (offline) and Querying (online)

Phase 1: Indexing (Offline)

Done once (or periodically) before any user asks a question:

1. Load — Read documents from files, databases, APIs
2. Chunk — Split documents into smaller passages
3. Embed — Convert each chunk into a vector (array of numbers)
4. Store — Save vectors in a vector database with metadata

Phase 2: Querying (Online)

Happens in real-time when a user asks a question:

5. Embed the query — Convert the question into a vector
6. Retrieve — Find the most similar chunks in the vector store
7. Augment — Insert retrieved chunks into the LLM prompt
8. Generate — LLM produces an answer grounded in the retrieved context

The pipeline in the header shows this flow. Chapters 2–8 of this learning path go deep on each stage.

compare_arrows

RAG vs Fine-Tuning vs Long Context

Three approaches to giving LLMs new knowledge

Fine-Tuning

Bakes knowledge into model weights. Expensive to train, hard to update, can cause catastrophic forgetting. Best for teaching the model a new style or behavior, not for injecting facts. A fine-tuned model still hallucinates — it just hallucinates in your company's voice.

Long Context Windows

Paste everything into the prompt. Gemini 1.5 Pro supports 1M tokens. But: cost scales linearly with context size, latency increases, and models struggle with information in the middle of very long contexts (the "lost-in-the-middle" effect, Liu et al. 2023).

RAG

Retrieves only what's relevant. No retraining needed. Data can be updated instantly (just re-index). Cost is predictable — you only pay for the retrieved chunks, not the entire corpus. The model sees exactly the context it needs, nothing more.

They're not mutually exclusive. In practice, teams combine approaches: RAG for factual grounding, fine-tuning for tone/format, and long context for complex multi-document reasoning. But RAG is the default starting point for most knowledge-intensive applications.

psychology

The Key Insight: Semantic Search

Finding relevant documents by meaning, not keywords

Beyond Keyword Search

Traditional search (like Elasticsearch with BM25) matches keywords. If you search "how to cancel my subscription" but the doc says "steps to terminate your account," keyword search might miss it. Semantic search understands that "cancel subscription" and "terminate account" mean the same thing.

How It Works

An embedding model (like OpenAI's text-embedding-3-small or the open-source BGE-large) converts text into a vector — an array of numbers that captures meaning. Similar meanings produce similar vectors. You find relevant documents by finding the nearest vectors to your query.

# Semantic similarity in action embed("cancel my subscription") → [0.23, -0.41, 0.87, ...] embed("terminate your account") → [0.21, -0.39, 0.85, ...] cosine_similarity = 0.94 # very similar! embed("today's weather forecast") → [-0.65, 0.12, -0.33, ...] cosine_similarity = 0.11 # not similar

This is the magic of RAG. Embedding models compress the meaning of text into vectors, and vector databases find the nearest neighbors in milliseconds — even across millions of documents.

trending_up

Naive RAG vs Advanced RAG

The spectrum of sophistication

Naive RAG

The simplest implementation: chunk documents, embed them, retrieve top-K, stuff them into the prompt. Works surprisingly well for many use cases. This is where you should start — don't over-engineer before you have a baseline.

Where Naive RAG Fails

Poor chunking — splits mid-sentence, loses context
Irrelevant retrieval — top-K results aren't always the best
No query understanding — ambiguous queries get bad results
Lost in the middle — LLM ignores context in the middle of the prompt
No verification — answer might not actually use the retrieved docs

Advanced RAG

Adds techniques at every stage to improve quality:

Pre-retrieval: Query rewriting, HyDE, multi-query expansion
Retrieval: Hybrid search, reranking, MMR for diversity
Post-retrieval: Context compression, reordering, citation
Generation: Self-RAG (self-reflection), CRAG (corrective retrieval)

This learning path covers the full spectrum. Chapters 2–8 build up the pipeline stage by stage. Chapter 9 covers the advanced patterns (GraphRAG, Self-RAG, CRAG). Start naive, measure, then add complexity where it helps.

cases

Where RAG Shines

Real-world applications across industries

Common Use Cases

Customer support bots — Answer questions from help docs, knowledge bases, and past tickets

Internal knowledge search — "What's our policy on X?" across thousands of company documents

Code assistants — Retrieve relevant code, docs, and examples from your codebase (this is how Cursor works)

Legal/medical research — Search case law, clinical guidelines, or regulatory documents with natural language

What They All Share

Every RAG application has the same core need: a large corpus of domain-specific knowledge that the LLM wasn't trained on, and users who ask natural language questions about that knowledge. The RAG pipeline is the bridge.

RAG is the most deployed LLM pattern in production today. It's simpler than fine-tuning, cheaper than retraining, and gives the LLM access to up-to-date, domain-specific knowledge. If you're building an AI application that needs to answer questions about specific data, RAG is almost certainly your starting point.

rocket_launch

The Journey Ahead

What you'll learn in this deep dive

The Pipeline, Stage by Stage

Ch 2: Document Loading — Getting data in (PDFs, web, DBs)
Ch 3: Chunking — Breaking docs into retrieval units
Ch 4: Embeddings — Text to vectors (models, math, benchmarks)
Ch 5: Vector Stores — Where vectors live (HNSW, FAISS, Pinecone)
Ch 6: Retrieval — Finding the right chunks (dense, sparse, hybrid)
Ch 7: Query Transformation — Making queries smarter
Ch 8: Generation — Synthesizing grounded answers

Beyond the Pipeline

Ch 9: Advanced Patterns — GraphRAG, Self-RAG, CRAG, agentic RAG
Ch 10: Solutions Landscape — LlamaIndex, LangChain, Haystack, Vectara, and more
Ch 11: Production & Eval — RAGAS metrics, caching, monitoring, A/B testing

Each chapter has two views. The High Level (like this page) gives you the visual journey — watch the pipeline build up. The Under the Hood goes deep on the technical details, algorithms, and code. Start with high level, then dive under the hood when you're ready.