Ch 8 — Generation: Synthesizing Answers

Turning retrieved context into grounded, cited answers
High Level
description
Context
arrow_forward
prompt_suggestion
Prompt
arrow_forward
smart_toy
LLM
arrow_forward
format_quote
Citations
arrow_forward
stream
Stream
arrow_forward
fact_check
Guardrails
arrow_forward
check_circle
Answer
-
Click play or press Space to begin the journey...
Step- / 7
description
The Generation Step
Where retrieved chunks become a coherent answer
What Happens Here
Generation is the final step of the RAG pipeline. The LLM receives the user's question plus the retrieved context chunks and produces a natural language answer. This is where all the work from chunking, embedding, retrieval, and reranking pays off.

The quality of the answer depends on:
1. Context quality — Are the right chunks in the prompt?
2. Prompt design — How you instruct the LLM to use the context
3. Model choice — GPT-4o, Claude, Gemini, open-source models
4. Output controls — Citations, guardrails, formatting
The Basic Pattern
Every RAG generation follows the same structure: a system prompt that sets the rules, the retrieved context inserted into the prompt, and the user question. The LLM is instructed to answer based only on the provided context.
# The fundamental RAG prompt pattern System: You are a helpful assistant. Answer the question based ONLY on the provided context. If the context does not contain the answer, say "I don't have enough information." Context: {retrieved_chunks} User: {question}
The "answer only from context" instruction is critical. Without it, the LLM will happily fill gaps with its training data, which may be outdated or wrong. This instruction grounds the answer in your actual documents.
prompt_suggestion
Prompt Engineering for RAG
Designing prompts that produce grounded, useful answers
Key Prompt Principles
Be explicit about grounding: "Answer ONLY based on the provided context." This prevents the LLM from using its training data.

Handle missing information: "If the context does not contain the answer, say so clearly." This prevents hallucination on gaps.

Request citations: "Cite the source document for each claim." This enables verification.

Set the tone: "Be concise and professional." Match the output style to your use case.

Specify format: "Respond in bullet points" or "Respond in JSON." Structure the output for downstream processing.
# Production RAG prompt template System: You are a knowledge assistant for Acme Corp. Answer questions using ONLY the provided context. Rules: - If the context doesn't contain the answer, say "I don't have information about that." - Cite sources using [Source: filename, page X] - Be concise. Use bullet points for lists. - Never make up information. Context: --- Source: policies/refunds.pdf, Page 7 Customers may request a full refund within 30 days of purchase... --- Source: policies/refunds.pdf, Page 8 After 30 days, only store credit is available... --- User: What is the refund policy?
Include source metadata in the context. When you format the retrieved chunks, include the source filename, page number, and any other metadata. This enables the LLM to cite specific sources in its answer, which users can verify.
smart_toy
Choosing the Right LLM
Balancing quality, speed, cost, and context window
Key Factors
Context window: How many tokens can the model process? GPT-4o supports 128K tokens. Claude 3.5 Sonnet supports 200K. Gemini 1.5 Pro supports 1M+. Larger windows let you include more chunks.

Instruction following: How well does the model stick to "answer only from context"? Larger models are better at this.

Latency: Time to first token and tokens per second. Smaller models (GPT-4o-mini, Claude 3.5 Haiku) are faster.

Cost: Per-token pricing varies 100x between models. GPT-4o-mini is ~$0.15/1M input tokens; GPT-4o is ~$2.50/1M.
Practical Recommendations
Prototyping: GPT-4o-mini or Claude 3.5 Haiku. Fast, cheap, good enough to validate your pipeline.

Production (quality-focused): GPT-4o or Claude 3.5 Sonnet. Best instruction following and reasoning.

Production (cost-focused): GPT-4o-mini, Gemini 1.5 Flash, or Claude 3.5 Haiku. 10-20x cheaper than flagship models with 80-90% of the quality.

Self-hosted: Llama 3.1 70B or Mixtral 8x7B via vLLM or TGI. Full data control, no API dependency.
For most RAG applications, GPT-4o-mini or Claude 3.5 Haiku is sufficient. The context provides the knowledge; the model just needs to synthesize it. You rarely need the full reasoning power of flagship models. Save the budget for more retrieval calls or reranking instead.
format_quote
Citations & Source Attribution
Enabling users to verify and trust the answer
Why Citations Matter
Citations transform RAG from "trust me" to "here's the proof." Users can click through to the source document and verify the answer. This is essential for enterprise, legal, medical, and financial use cases where accuracy is critical.
Citation Approaches
Inline citations: The LLM adds [1], [2] references in the text, with a source list at the bottom. Simple and familiar.

Per-sentence attribution: Each sentence is tagged with its source chunk. More granular but harder to implement.

Structured output: The LLM returns JSON with answer text and source references. Best for programmatic processing.

Highlighted excerpts: Show the exact passage from the source that supports each claim. Highest trust but most complex.
# Structured citation output from pydantic import BaseModel from typing import List class Citation(BaseModel): text: str # claim from the answer source: str # document name page: int # page number quote: str # exact supporting quote class RAGAnswer(BaseModel): answer: str citations: List[Citation] confidence: str # "high", "medium", "low" # Use with structured output (OpenAI, LangChain) llm_with_structure = llm.with_structured_output( RAGAnswer )
Structured output with Pydantic models is the most reliable way to get citations. OpenAI's function calling / structured output and LangChain's with_structured_output() guarantee the response matches your schema. No regex parsing needed.
stream
Streaming Responses
Showing the answer as it is generated, token by token
Why Stream?
RAG generation can take 2-10 seconds. Without streaming, the user stares at a blank screen. With streaming, they see the answer building word by word — the perceived latency drops dramatically. Time to first token (TTFT) becomes more important than total generation time.
How It Works
The LLM API returns a stream of token chunks via Server-Sent Events (SSE). Your backend forwards these chunks to the frontend in real time. The UI appends each chunk as it arrives. Most LLM APIs (OpenAI, Anthropic, Google) support streaming natively.
# LangChain streaming from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="gpt-4o-mini", streaming=True ) # Stream tokens for chunk in rag_chain.stream({ "question": "What is the refund policy?" }): print(chunk.content, end="", flush=True) # FastAPI streaming endpoint from fastapi.responses import StreamingResponse @app.post("/chat") async def chat(request: ChatRequest): async def generate(): async for chunk in rag_chain.astream( {"question": request.question} ): yield chunk.content return StreamingResponse(generate())
Streaming and structured output are often incompatible. If you need JSON citations, you typically cannot stream. A common pattern: stream the answer text first, then append citations at the end. Or use two LLM calls — one streaming for the answer, one non-streaming for structured citations.
fact_check
Guardrails & Safety
Preventing hallucination, off-topic answers, and harmful content
Common Guardrails
Grounding check: Verify the answer is supported by the retrieved context. If the LLM's answer contains claims not in any chunk, flag or remove them.

Topic guardrails: Restrict the system to only answer questions about your domain. Reject off-topic queries ("Write me a poem").

PII filtering: Detect and redact personal information (emails, phone numbers, SSNs) from the output.

Content safety: Block harmful, biased, or inappropriate content using moderation APIs (OpenAI Moderation, Azure Content Safety).
Implementation Approaches
Prompt-based: Include guardrail instructions in the system prompt. Simplest but least reliable.

Post-processing: Run a second LLM call or classifier on the output to check for violations.

Guardrails frameworks: NeMo Guardrails (NVIDIA), Guardrails AI, LangChain output parsers. These provide structured validation pipelines.
The "I don't know" response is a feature, not a bug. A RAG system that says "I don't have information about that" when the context is insufficient is far more trustworthy than one that confidently generates a wrong answer. Train your users to expect and trust this response.
verified
Synthesis Strategies
Different ways to combine chunks into answers
Stuff (Default)
Put all retrieved chunks into a single prompt. Simple and works when the total context fits in the model's window. This is the default in LangChain and LlamaIndex.
Map-Reduce
Generate a partial answer from each chunk independently ("map"), then combine all partial answers into a final answer ("reduce"). Useful when you have too many chunks for a single prompt.
Refine
Start with the first chunk, generate an initial answer. Then iteratively refine the answer by adding one chunk at a time. Each step can update, expand, or correct the previous answer. Good for building comprehensive answers from many sources.
# LangChain synthesis strategies from langchain.chains import ( create_stuff_documents_chain, create_retrieval_chain ) # Stuff (default) - all docs in one prompt stuff_chain = create_stuff_documents_chain( llm, prompt ) # Full RAG chain rag_chain = create_retrieval_chain( retriever, stuff_chain ) result = rag_chain.invoke({ "input": "What is the refund policy?" }) print(result["answer"]) print(result["context"]) # retrieved docs
Use "stuff" unless you have a reason not to. With modern 128K+ context windows, you can fit 50-100 chunks in a single prompt. Map-reduce and refine add complexity and multiple LLM calls. Only use them when your context genuinely exceeds the model's window.