Key Insights

Ingestion

Data Prep & Indexing

Chapters 1-5

expand_more

1

What Is RAG & Why It Matters

RAG solves the two biggest problems with LLMs: hallucinations and outdated knowledge.

Retrieve-Then-Generate: Instead of relying on the model's internal memory, RAG searches an external database for facts, pastes them into the prompt, and asks the model to summarize them.

2

Document Loading & Preprocessing

Garbage in, garbage out. The quality of your RAG system depends entirely on how well you extract text from PDFs, websites, and databases.

Metadata is Crucial: Extracting the text isn't enough. You must attach metadata (date, author, source URL) so you can filter results later.

3

Chunking Strategies

You can't feed a 500-page book into an embedding model. You must break it into chunks.

Semantic Chunking: Instead of cutting text arbitrarily every 500 words, split it at natural boundaries (paragraphs, sections) to preserve meaning.
Overlap: Always overlap chunks (e.g., by 50 words) so you don't accidentally cut a crucial sentence in half.

4

Embeddings: Text to Vectors

Embeddings convert text into high-dimensional mathematical coordinates.

Semantic Similarity: If two chunks of text have similar meanings, their vectors will be close together in space, even if they share zero exact keywords.

5

Vector Stores & Indexing

Standard databases search by exact keyword match. Vector databases search by mathematical proximity.

HNSW (Hierarchical Navigable Small World): The underlying algorithm used by most vector databases to find the "nearest neighbors" across millions of vectors in milliseconds.

The Bottom Line: The ingestion pipeline is the foundation of RAG. If your chunking is sloppy or your embeddings are weak, no amount of prompt engineering will save the final answer.

Retrieval

Search & Generation

Chapters 6-8

expand_more

6

Retrieval Strategies

Vector search (dense) is great for concepts, but terrible for exact names or IDs.

Hybrid Search: The industry standard. Combine vector search (for meaning) with keyword search (BM25, for exact terms) to get the best of both worlds.
Reranking: Retrieve 50 documents quickly, then use a slower, highly accurate Cross-Encoder model to re-sort them and pick the top 5.

7

Query Transformation

Users write terrible search queries. Don't search using what they typed.

HyDE (Hypothetical Document Embeddings): Ask the LLM to write a fake answer to the user's question, then search the vector database using that fake answer. It works shockingly well.
Query Expansion: Have the LLM rewrite the user's query into 3-4 different variations, search for all of them, and combine the results.

8

Generation: Synthesizing Answers

Once you have the documents, you must force the LLM to stick to the script.

Lost in the Middle: LLMs pay attention to the beginning and end of a prompt, but ignore the middle. Put your most important retrieved documents at the very top or very bottom.
Strict Citations: Force the model to append `[Doc 1]` citations to every claim it makes, ensuring traceability.

The Bottom Line: Naive RAG (just embedding the user's query) fails in production. Advanced retrieval requires query rewriting, hybrid search, and reranking to find the right context.

Advanced

Patterns & Production

Chapters 9-11

expand_more

9

Advanced RAG Patterns

Modern RAG systems are moving from linear pipelines to stateful agents.

Self-RAG: The model evaluates its own retrieved documents. If they aren't helpful, it rewrites the query and searches again before answering.
GraphRAG: Extracting entities and relationships from documents into a Knowledge Graph, allowing the system to answer complex "connect the dots" questions that vector search fails at.

10

RAG Solutions Landscape

The ecosystem is split between DIY frameworks and managed services.

LlamaIndex vs LangChain: LlamaIndex is deeply specialized for data ingestion and RAG. LangChain is broader, focusing on general agent orchestration.

11

RAG in Production & Evaluation

You cannot improve a RAG system without automated evaluation metrics.

RAGAS Framework: Evaluates RAG across multiple dimensions: Faithfulness (did it hallucinate?), Answer Relevance (did it answer the question?), and Context Precision (did it retrieve garbage?).

The Bottom Line: Building a RAG prototype takes a weekend. Getting it to 95% accuracy in production takes months of tuning chunk sizes, reranking models, and automated evaluation.

Key Insights — RAG