Ch 12 — RAG & Knowledge Integration

How to ground LLM products in your organization’s data — the architecture that makes AI actually useful.
High Level
lightbulb
Why RAG
arrow_forward
content_cut
Chunk
arrow_forward
search
Retrieve
arrow_forward
smart_toy
Generate
arrow_forward
bug_report
Failures
arrow_forward
speed
Evaluate
-
Click play or press Space to begin...
Step- / 8
lightbulb
Why RAG Exists
The fundamental problem RAG solves — and why fine-tuning alone isn’t enough
The Knowledge Problem
LLMs are trained on public internet data with a knowledge cutoff date. They don’t know about your company’s products, your internal policies, your customer data, or anything that happened after training ended.

Ask GPT-4 about your company’s Q3 earnings, your return policy, or your latest product release — it will either hallucinate an answer or admit it doesn’t know.

For most enterprise AI products, the value comes from answering questions about your specific data. A customer support bot needs your knowledge base. A legal assistant needs your contracts. An internal tool needs your documentation.
RAG vs. Fine-Tuning
Fine-tuning bakes knowledge into the model’s weights. It’s expensive, slow to update, and the model can still hallucinate. You can’t easily verify where an answer came from.

RAG (Retrieval-Augmented Generation) keeps knowledge external. At query time, the system retrieves relevant documents and includes them in the prompt. The LLM generates an answer grounded in those documents.

RAG advantages:
Always current: Update the knowledge base, and answers change immediately
Citable: Every answer can point to its source documents
Cheaper: No retraining required when data changes
Controllable: You decide exactly what knowledge is available

RAG is the default architecture for enterprise LLM products in 2025–2026. Fine-tuning is used for behavior and style; RAG is used for knowledge.
The mental model: Think of the LLM as a brilliant new employee with excellent reasoning skills but zero knowledge of your company. RAG is the process of handing them the right documents before they answer a question. The quality of the answer depends on the quality of the documents you hand them.
schema
The RAG Pipeline
Two phases, twelve components — the architecture behind every enterprise AI product
Phase 1: Offline Indexing
This happens before any user query. You prepare your knowledge base for fast retrieval:

1. Load documents.
Ingest raw data from all sources: PDFs, web pages, databases, Confluence, Notion, Slack, email archives, support tickets. Both batch backfills (historical data) and incremental updates (new data as it arrives).

2. Chunk documents.
Split large documents into smaller pieces (chunks). A 50-page PDF becomes hundreds of 512-token chunks. Each chunk should be a self-contained unit of information.

3. Embed chunks.
Convert each text chunk into a numerical vector (embedding) that captures its meaning. Similar texts produce similar vectors.

4. Store in vector database.
Save the embeddings in a specialized database (Pinecone, Weaviate, pgvector, Qdrant) that supports fast similarity search across millions of vectors.
Phase 2: Online Retrieval
This happens for every user query in real time:

5. Embed the query.
Convert the user’s question into a vector using the same embedding model.

6. Search the vector database.
Find the top-K chunks whose vectors are most similar to the query vector. Typically K = 5–20.

7. Re-rank results.
Use a cross-encoder model to re-score the retrieved chunks for relevance. This boosts precision by 18–42% over vector search alone.

8. Build the prompt.
Combine the system prompt + retrieved chunks + user query into a single prompt for the LLM.

9. Generate the answer.
The LLM reads the retrieved context and generates an answer grounded in those documents.

10. Cite sources.
Include references to the source documents so users can verify the answer.
PM takeaway: The pipeline has many failure points. Retrieval quality determines answer quality. If the system retrieves the wrong documents, the LLM will confidently generate a wrong answer based on irrelevant context. The PM must understand each stage to diagnose quality issues.
content_cut
Chunking & Embedding Decisions
The unglamorous decisions that determine 80% of RAG quality
Chunking Strategy
How you split documents into chunks has an outsized impact on retrieval quality:

Fixed-size chunking (recommended start):
Split at 512 tokens with 10–15% overlap between chunks. Simple, predictable, and surprisingly effective. Production data from 2026 shows this outperforms more complex approaches in most cases.

Semantic chunking:
Split at natural boundaries (paragraphs, sections, topic shifts). Sounds better in theory but creates 3–5x more fragments, increasing embedding costs and retrieval noise. Use only when fixed-size demonstrably fails.

Document-aware chunking:
Respect document structure (headers, tables, code blocks). Important for structured content like technical documentation or legal contracts where splitting mid-table destroys meaning.

The overlap matters: Without overlap, information that spans a chunk boundary is lost. A 10–15% overlap ensures continuity.
Embedding Models
What embeddings do:
Convert text into vectors (lists of numbers) that capture semantic meaning. “How do I return a product?” and “What is your refund policy?” produce similar vectors even though they share few words.

Model selection:
text-embedding-3-small: Best cost/quality balance for most use cases ($0.02 per million tokens)
text-embedding-3-large: Higher quality, 9x more expensive
• Open-source alternatives (e.g., BGE, E5): Free to run but require hosting infrastructure

Cost at scale:
Embedding 1 million documents at 512 tokens each costs $10–$90 depending on the model. Re-embedding when you change models costs the same again. Choose carefully upfront.
Metadata Preservation
Store metadata alongside each chunk: source URL, document title, author, last updated date, section heading, document type. This enables filtering (“only search HR policies”), temporal weighting (“prefer recent documents”), and source citation in answers.
Start simple, measure, then optimize: Begin with 512-token fixed chunks, text-embedding-3-small, and basic metadata. Measure retrieval quality on 200+ test queries. Only add complexity (semantic chunking, larger embeddings, re-ranking) when you have evidence that simpler approaches aren’t meeting your quality threshold.
search
Retrieval Strategies
Vector search alone isn’t enough — how production systems find the right documents
Vector Search
How it works: Convert the query to a vector, find the nearest vectors in the database using approximate nearest-neighbor (ANN) algorithms like HNSW.

Strengths: Captures semantic meaning. “How do I cancel my subscription?” matches a document about “account termination procedures” even without shared keywords.

Weaknesses: Misses exact keyword matches. Searching for “error code E-4021” might return documents about errors in general rather than the specific code. Also struggles with ambiguous queries (“cellular” could mean biology or phone plans).
Keyword Search (BM25)
How it works: Traditional text search based on word frequency and document length. The same technology behind search engines.

Strengths: Exact matches. “Error code E-4021” finds exactly the document containing that code. Fast and well-understood.

Weaknesses: No semantic understanding. “How do I cancel?” won’t match “account termination” because the words are different.
Hybrid Search (The Standard)
Combine both. Run vector search and keyword search in parallel, then merge results using Reciprocal Rank Fusion (RRF). Documents that rank highly in both searches get boosted to the top.

Hybrid search is now the production standard for RAG systems. It captures both semantic meaning and exact matches, covering each approach’s blind spots.
Re-Ranking
After initial retrieval, pass the top 20–50 results through a cross-encoder re-ranker. Unlike embedding models (which encode query and document separately), cross-encoders process the query and document together, producing more accurate relevance scores.

Re-ranking boosts precision by 18–42% in production systems. It’s slower (can’t search the full database this way) but dramatically improves the final set of documents sent to the LLM.
Temporal Filtering
Apply time-aware weighting so recent documents outrank stale ones. Critical for knowledge bases where policies change, products update, and old information becomes misleading.
The retrieval stack: Hybrid search (vector + BM25) → re-ranking → temporal filtering → top-K selection. Each layer improves precision. The PM should track retrieval quality separately from generation quality — if retrieval is broken, no amount of prompt engineering fixes the output.
bug_report
RAG Failure Modes
The “vibe check” problem — RAG systems that pass demos but fail in production
Retrieval Failures
1. Wrong documents retrieved.
The query “cellular data plan” retrieves biology documents about cellular structures. Ambiguous queries are the most common retrieval failure. Mitigation: hybrid search, query expansion, metadata filtering.

2. Relevant documents not in the knowledge base.
The user asks about a product that was launched last week, but the knowledge base hasn’t been updated. Mitigation: automated ingestion pipelines with freshness monitoring.

3. Stale documents outranking current ones.
An outdated return policy from 2023 ranks higher than the current 2026 policy because it has more keyword matches. Mitigation: temporal filtering, document versioning, explicit deprecation.

4. Noisy retrieval.
The top-K results include 3 relevant and 7 irrelevant chunks. The irrelevant chunks confuse the LLM and trigger hallucinations. Mitigation: re-ranking, stricter relevance thresholds, fewer but higher-quality chunks.
Generation Failures
5. Hallucination despite context.
The LLM ignores the retrieved documents and generates information from its training data. The answer sounds authoritative but isn’t grounded in your knowledge base. Mitigation: strong grounding instructions in the system prompt, faithfulness evaluation.

6. Synthesizing across contradictory sources.
Two retrieved documents give conflicting information (old policy vs. new policy). The LLM blends them into a confidently wrong answer. Mitigation: document versioning, conflict detection, preferring the most recent source.

7. Missing the answer in the context.
The relevant information is in the retrieved chunks, but the LLM fails to find it — especially in long contexts. LLMs have a “lost in the middle” problem where they attend more to the beginning and end of the context. Mitigation: put the most relevant chunks first, limit context length.
The cascading failure: RAG failures cascade across stages. Bad chunking → bad embeddings → bad retrieval → bad generation. When the output is wrong, the PM must diagnose which stage failed. Was the right document in the knowledge base? Was it retrieved? Was it ranked highly? Did the LLM use it? Each stage needs its own monitoring.
speed
Evaluating RAG Systems
Separate metrics for retrieval and generation — because they fail independently
Retrieval Metrics
Recall@K: Of all relevant documents, how many appear in the top K results? Target: >85% at K=10 for production systems.

Precision@K: Of the top K retrieved documents, how many are actually relevant? Higher precision means less noise for the LLM.

Mean Reciprocal Rank (MRR): How early does the first relevant result appear? Important because LLMs pay more attention to earlier context (positional bias).

Context relevance: Semantic alignment between the query and retrieved passages. Measured using embedding similarity or cross-encoder scores.
Generation Metrics
Faithfulness / Groundedness: Does the answer stay grounded in the retrieved context? Or does it introduce information not present in the documents? This is the hallucination metric for RAG.

Answer relevancy: Does the answer actually address the user’s question? A grounded but off-topic answer is still a failure.

Completeness: Does the answer cover all relevant aspects from the retrieved context? Partial answers frustrate users.
The Three-Layer Evaluation
Executive layer: Manual accuracy checks on 200–1,000 representative questions. Simple, defensible KPIs. Inter-rater agreement (Cohen’s kappa).

Team layer: Standardized frameworks like RAGAS for reproducible, automated evaluation across retrieval and generation dimensions.

Developer layer: Fine-grained telemetry for debugging. Per-query retrieval scores, chunk-level relevance, latency breakdowns, hallucination source tracing.
The evaluation rule: Always evaluate retrieval and generation separately. If the answer is wrong, first check: were the right documents retrieved? If yes, the generation failed. If no, the retrieval failed. Different problems require different fixes. Conflating them wastes engineering time.
settings
Knowledge Base Operations
RAG is only as good as the data behind it — keeping the knowledge base healthy
Ingestion Pipeline
The knowledge base needs two ingestion lanes:

Batch backfill: Initial load of all historical documents. Run once at setup, then periodically for full re-indexing (e.g., when you change the embedding model or chunking strategy).

Incremental updates: New and updated documents ingested continuously. A new support article published at 2pm should be retrievable by 2:15pm. This requires automated pipelines that detect changes, re-chunk, re-embed, and update the vector database.

Deletion and deprecation: When documents are outdated or removed, the corresponding chunks must be deleted from the vector database. Stale chunks are a common source of wrong answers.
Data Quality for RAG
Source quality matters most. RAG can’t fix bad source documents. If your knowledge base contains contradictory, outdated, or poorly written content, the AI will faithfully reproduce those problems.

Common data quality issues:
• Duplicate documents (same info, different versions) → conflicting answers
• Outdated content not marked as deprecated → stale answers
• Poorly structured documents (no headings, no sections) → bad chunking
• Missing metadata (no dates, no authors) → can’t filter or prioritize
• Inconsistent terminology across documents → retrieval gaps

The PM’s role: Champion a knowledge base hygiene program. Regular audits, clear ownership of content areas, deprecation workflows, and quality standards for new content. This is often the highest-leverage investment in RAG quality.
The 80/20 of RAG quality: Teams spend 80% of their time optimizing embeddings, re-rankers, and prompts. But 80% of quality issues come from the source data: missing documents, stale content, poor structure, and contradictions. Fix the data first. Then optimize the pipeline.
rocket_launch
The PM’s RAG Playbook
Practical decisions and trade-offs for shipping a RAG product
Key PM Decisions
1. Scope the knowledge base.
What data sources are included? What’s excluded? Start narrow (one document collection) and expand. Every new source adds complexity and potential failure modes.

2. Define freshness requirements.
How quickly must new information be available? Real-time (minutes), near-real-time (hours), or batch (daily)? Freshness requirements drive infrastructure cost.

3. Set the citation standard.
Must every answer cite its sources? Can the AI say “I don’t know”? How are citations displayed? Source citation is the primary trust mechanism in RAG products.

4. Define the “I don’t know” behavior.
When no relevant documents are found, the AI should say so rather than hallucinate. This requires a retrieval confidence threshold: below it, the system admits uncertainty rather than guessing.
The RAG Maturity Ladder
Level 1: Basic RAG
Vector search, fixed chunking, single data source. Good for internal tools and proofs of concept. 2–4 weeks to build.

Level 2: Production RAG
Hybrid search, re-ranking, multiple data sources, automated ingestion, source citations. Good for customer-facing products. 2–3 months to build.

Level 3: Advanced RAG
Agentic RAG (the AI decides what to search and when), multi-step retrieval, query decomposition, graph-based knowledge structures. For complex domains with interconnected information. 4–6+ months.

Level 4: Enterprise RAG
Multi-tenant access control, compliance audit trails, real-time ingestion, cross-language retrieval, feedback-driven re-indexing. For regulated industries at scale. Ongoing investment.
The bottom line: RAG is the most common architecture for enterprise AI products because it solves the knowledge problem without retraining models. But “just add RAG” is deceptively simple. The quality depends on chunking strategy, retrieval approach, source data quality, and continuous evaluation. The PM who understands the full pipeline can diagnose issues, set realistic expectations, and make informed trade-offs between cost, quality, and freshness.