Ch 10 — RAG & Context Injection

How to feed external documents into prompts effectively — structure, label, cite, and prevent hallucination
Real-World
psychology
What Is RAG
arrow_forward
warning
Dump & Pray
arrow_forward
label
Label & Cite
arrow_forward
gavel
Legal Domain
arrow_forward
block
Anti-Hallucinate
arrow_forward
content_cut
Chunking
arrow_forward
layers
Multi-Source
arrow_forward
checklist
RAG Prompt
-
Click play or press Space to begin...
Step- / 8
psychology
What Is RAG? (You’re Already Doing It)
Every time you paste a document into a prompt, you’re doing manual Retrieval-Augmented Generation
RAG in 30 Seconds
Retrieval-Augmented Generation means: instead of relying on the model’s training data (which is frozen and potentially outdated), you inject relevant documents into the prompt so the model can answer based on your data.

The automated version uses vector databases and embedding search to find relevant chunks. But the prompting part — how you present the context to the model — is the same whether you retrieve chunks automatically or paste them manually.
The RAG Pipeline
# Simplified RAG flow 1. User asks a question 2. Retrieve relevant document chunks (vector search, keyword search, etc.) 3. Inject those chunks into the prompt alongside the question 4. Generate an answer grounded in the provided context # This chapter focuses on step 3: # HOW you inject context matters as much # as WHAT you inject.
Why Context Presentation Matters
You can retrieve the perfect document chunks, but if you present them poorly, the model will:

Ignore relevant context buried in a wall of text
Hallucinate details that aren’t in the context
Blend context with training data (mixing your facts with its “knowledge”)
Fail to cite sources making answers unverifiable

The prompt engineering around context injection is what separates a “sometimes useful” RAG system from a “reliably accurate” one.
Key insight: RAG has two halves: retrieval (finding the right chunks) and generation (prompting the model to use them correctly). Most tutorials focus on retrieval. This chapter focuses on the generation side — the prompting techniques that make or break answer quality.
warning
The “Dump & Pray” Anti-Pattern
Pasting a 10-page document and hoping the model finds the answer
Superficial Approach
Prompt: Here's our pricing documentation: [10 pages of text pasted verbatim, including table of contents, headers, footers, page numbers, legal disclaimers, and 47 pricing scenarios] What's the price for the Enterprise plan with annual billing?
What Goes Wrong
1. Needle in a haystack: The answer is in paragraph 23 of 47. The model has to scan everything to find it.

2. Conflicting information: The document mentions “Enterprise: $99/mo” in one section and “Enterprise annual: $79/mo (billed annually)” in another. The model might pick either.

3. Hallucination risk: If the model can’t find the exact answer, it will confidently make one up based on patterns in the document.

4. Token waste: You’re paying for 10 pages of input tokens when only 2 paragraphs are relevant.
Model Output (Dump & Pray)
The Enterprise plan with annual billing is $89 per user per month, billed annually at $1,068 per user per year.
Is This Right?
Maybe. Maybe not. You can’t tell because:

• No citation — which section did this come from?
• The $89 figure might be from an outdated section
• “Per user” might be wrong — maybe it’s per seat
• The annual calculation might be the model doing math, not quoting the document

Without citations, every RAG answer is unverifiable.
The pattern: Dumping raw documents produces answers that look confident but are often wrong or unverifiable. The model treats the entire document as “context” and freely blends, summarizes, and interpolates. You need structure, labels, and citation requirements.
label
The Fix: Label, Cite, Constrain
Structure your context with labels and require citations in every answer
Deliberate RAG Prompt
Answer the user's question using ONLY the provided context excerpts. Cite which excerpt supports each claim using [A], [B], or [C]. If the excerpts don't contain the answer, say "Not found in provided context." [A] Enterprise Pricing (2025 rate card): "Enterprise plan: $99/user/month on monthly billing. $79/user/month on annual billing (billed as $948/user/year). Minimum 50 seats. Includes SSO, audit logs, and dedicated support." [B] Discount Policy: "Enterprise annual contracts over 200 seats qualify for volume discount: 200-499 seats: 10% off, 500+ seats: 15% off. Applied to the annual rate." [C] Plan Comparison: "Enterprise includes everything in Pro, plus: SSO/SAML, audit logging, custom roles, 99.99% SLA, dedicated CSM." Question: What's the price for the Enterprise plan with annual billing?
Model Output (Structured)
The Enterprise plan with annual billing is $79/user/month, billed annually at $948/user/year [A]. There is a minimum of 50 seats [A]. If you have 200+ seats, volume discounts apply: 10% off for 200-499 seats, 15% off for 500+ seats [B].
Why This Is Better
1. Verifiable: Every claim has a citation. You can check [A] to confirm $79/user/month.

2. No hallucination: The “only use provided context” constraint prevents the model from inventing details.

3. Transparent gaps: If the answer isn’t in the excerpts, the model says so instead of guessing.

4. Efficient: Only relevant excerpts are included, saving tokens and reducing noise.
Key insight: Three rules for RAG prompts: (1) Label every context chunk with a reference ID, (2) Require citations in the answer, (3) Constrain the model to only use provided context. These three rules eliminate most hallucination in RAG systems.
gavel
Domain Example: Legal Contract Review
Where hallucination isn’t just annoying — it’s dangerous
The Scenario
Your legal team needs to review a vendor contract. They want to know: What are the termination conditions? What’s the liability cap? Are there auto-renewal clauses?
Contract Review Prompt
You are a legal document analyst. Answer questions about this contract using ONLY the provided clauses. For each answer: 1. Quote the exact clause text 2. Cite the clause number 3. Flag any ambiguities or missing info 4. If the contract doesn't address the question, say "NOT ADDRESSED" [Clause 8.1 — Termination for Cause]: "Either party may terminate this Agreement for cause upon 30 days written notice if the other party materially breaches any obligation and fails to cure within the notice period." [Clause 8.3 — Termination for Convenience]: "Client may terminate for convenience upon 90 days written notice. Early termination fee: remaining contract value × 25%." [Clause 12.1 — Limitation of Liability]: "Neither party's aggregate liability shall exceed the total fees paid in the 12 months preceding the claim." Questions: 1. Can we terminate without cause? 2. What's the liability cap? 3. Is there an auto-renewal clause?
Model Output
1. Termination without cause: Yes. Per Clause 8.3, Client may terminate for convenience with 90 days written notice. However, an early termination fee applies: "remaining contract value × 25%" [Clause 8.3]. ⚠ Note: Only the Client has this right. The vendor cannot terminate for convenience — only for cause [Clause 8.1]. 2. Liability cap: "Neither party's aggregate liability shall exceed the total fees paid in the 12 months preceding the claim" [Clause 12.1]. ⚠ Ambiguity: This doesn't specify whether the cap applies to indirect/ consequential damages or only direct damages. Recommend clarifying. 3. Auto-renewal: NOT ADDRESSED in the provided clauses. The contract excerpts do not contain an auto-renewal or renewal clause. Recommend reviewing the full agreement for renewal terms.
Key insight: The “NOT ADDRESSED” response for question 3 is the most valuable output. Without the constraint, the model would have invented a plausible auto-renewal clause based on its training data. In legal contexts, a confident hallucination is worse than no answer.
block
Anti-Hallucination Techniques
Specific prompt patterns that prevent the model from making things up
Technique 1: The Grounding Constraint
"Answer using ONLY the provided context. Do not use any outside knowledge. If the context doesn't contain the answer, say 'Information not found in provided documents.'"
Technique 2: Mandatory Citations
"Every factual claim must cite its source using [Source ID]. Claims without citations are not allowed."
Technique 3: Confidence Flagging
"For each answer, rate your confidence: HIGH = directly stated in context MEDIUM = inferred from context LOW = partially supported NONE = not in context (don't answer)"
Technique 4: The Extraction-Only Pattern
"Extract and quote the relevant passages that answer this question. Do not summarize, paraphrase, or interpret. Return the exact text from the document with the source reference."
Technique 5: Negative Instructions
"NEVER: - Make up information not in the context - Combine information from context with your training knowledge - Provide an answer if you're not sure it's supported by the context - Guess dates, numbers, or names"
Which to Use When
Low stakes (FAQ bot): Grounding constraint + citations
Medium stakes (internal docs): Add confidence flagging
High stakes (legal, medical, financial): All five techniques combined
Key insight: No single technique eliminates hallucination completely. Layer them: grounding constraint as the baseline, citations for verifiability, confidence flagging for transparency, and negative instructions for the most critical applications. Defense in depth.
content_cut
Context Chunking: Less Is More
How to select and trim context for maximum relevance and minimum noise
The Chunking Principle
More context ≠ better answers. Research consistently shows that 3–5 highly relevant chunks outperform 20 loosely relevant ones. Why?

1. Signal-to-noise ratio: Irrelevant context dilutes the relevant parts
2. Lost in the middle: LLMs pay more attention to the beginning and end of the context (Liu et al., 2023). Important info in the middle gets overlooked.
3. Contradictions: More chunks = higher chance of conflicting information
4. Token cost: Every irrelevant chunk costs money and latency
Optimal Chunk Strategy
# For a question-answering RAG system: Retrieve: Top 5-10 chunks by relevance Re-rank: Score by relevance to the specific question (not just topic) Select: Top 3-5 after re-ranking Order: Most relevant first and last (exploit primacy/recency bias) Label: [A], [B], [C] with source info
The “Lost in the Middle” Problem
Liu et al. (2023) showed that LLMs perform best when the answer is at the beginning or end of the context, and worst when it’s in the middle. For a 20-chunk context:

• Answer in chunk 1: ~90% accuracy
• Answer in chunk 10: ~60% accuracy
• Answer in chunk 20: ~85% accuracy

Practical fix: Put the most relevant chunk first. If you have many chunks, put the second-most-relevant last.
Key insight: Context injection is a quality game, not a quantity game. Retrieve broadly, then aggressively filter down to the 3–5 most relevant chunks. Order them strategically (best first, second-best last). Your RAG system’s accuracy depends more on chunk selection than on the model you use.
layers
Multi-Source Context: Handling Conflicting Information
When different documents say different things, tell the model how to resolve conflicts
The Problem
Real-world RAG systems pull from multiple sources: product docs, support tickets, internal wikis, Slack messages. These sources often contradict each other because they were written at different times by different people.
Conflict Resolution Prompt
Answer using the provided sources. Each source has a date and type. When sources conflict, follow these rules: Priority order: 1. Official documentation (highest) 2. Release notes 3. Support tickets 4. Internal wiki 5. Slack messages (lowest) When same-priority sources conflict: Use the most recent source. Always flag conflicts: "Note: [Source A] says X, but [Source B] says Y. Using [Source A] because [reason]." [A] Official Docs (2025-02-15): "Rate limit: 1000 requests/minute per API key." [B] Internal Wiki (2024-11-03): "Rate limit: 500 requests/minute." [C] Release Notes (2025-01-20): "Increased rate limits to 1000 req/min for all tiers."
Model Output
The current rate limit is 1000 requests per minute per API key [A]. Note: The internal wiki [B] states 500 req/min, but this is outdated. Release notes from 2025-01-20 [C] confirm the limit was increased to 1000 req/min, and the official docs [A] reflect this updated value. Using official docs [A] as the authoritative source.
Why This Matters
Without conflict resolution rules, the model would either:

• Pick one source randomly
• Average them (“the rate limit is 500–1000 req/min”)
• Hallucinate a compromise

Explicit priority rules and conflict flagging make the answer trustworthy and auditable.
Key insight: Multi-source RAG without conflict resolution is a hallucination factory. Always define: (1) source priority hierarchy, (2) recency rules for same-priority conflicts, and (3) mandatory conflict flagging. The model should never silently choose between contradicting sources.
checklist
The RAG Prompt Template
A production-ready template for context-grounded question answering
Universal RAG Prompt Template
# INSTRUCTIONS Answer the user's question using ONLY the provided context documents. Follow these rules strictly: 1. Base every claim on the provided context 2. Cite sources using [Source ID] after each claim 3. If the context doesn't contain the answer, say "Not found in provided context" 4. Never combine context with outside knowledge 5. Flag any ambiguities or contradictions between sources # CONTEXT DOCUMENTS [A] {source_name} ({date}): "{chunk_text}" [B] {source_name} ({date}): "{chunk_text}" [C] {source_name} ({date}): "{chunk_text}" # USER QUESTION {question}
RAG Prompt Checklist
□ Grounding constraint "ONLY the provided context" □ Citation requirement "Cite using [Source ID]" □ Fallback instruction "If not found, say so" □ Anti-hallucination "Never use outside knowledge" □ Labeled chunks [A], [B], [C] with source metadata □ Chunk ordering Most relevant first and last □ Conflict resolution Priority rules if multi-source □ 3-5 chunks maximum Quality over quantity
Key insight: RAG prompt engineering is about trust engineering. Every technique in this chapter — labeling, citations, grounding constraints, conflict resolution — exists to make the answer verifiable and trustworthy. An uncited RAG answer is no better than a guess. A cited, grounded, conflict-flagged answer is a reliable tool.