Ch 8: Generation — High Level

Ch 8 — Generation: Synthesizing Answers

Turning retrieved context into grounded, cited answers

Index Under the Hood →

High Level

description

Context

arrow_forward

prompt_suggestion

Prompt

arrow_forward

smart_toy

LLM

arrow_forward

format_quote

Citations

arrow_forward

stream

Stream

arrow_forward

fact_check

Guardrails

arrow_forward

check_circle

Answer

Click play or press Space to begin the journey...

Step- / 7

description

The Generation Step

Where retrieved chunks become a coherent answer

What Happens Here

Generation is the final step of the RAG pipeline. The LLM receives the user's question plus the retrieved context chunks and produces a natural language answer. This is where all the work from chunking, embedding, retrieval, and reranking pays off.

The quality of the answer depends on:
1. Context quality — Are the right chunks in the prompt?
2. Prompt design — How you instruct the LLM to use the context
3. Model choice — GPT-4o, Claude, Gemini, open-source models
4. Output controls — Citations, guardrails, formatting

The Basic Pattern

Every RAG generation follows the same structure: a system prompt that sets the rules, the retrieved context inserted into the prompt, and the user question. The LLM is instructed to answer based only on the provided context.

# The fundamental RAG prompt pattern System: You are a helpful assistant. Answer the question based ONLY on the provided context. If the context does not contain the answer, say "I don't have enough information." Context: {retrieved_chunks} User: {question}

The "answer only from context" instruction is critical. Without it, the LLM will happily fill gaps with its training data, which may be outdated or wrong. This instruction grounds the answer in your actual documents.

prompt_suggestion

Prompt Engineering for RAG

Designing prompts that produce grounded, useful answers

Key Prompt Principles

Be explicit about grounding: "Answer ONLY based on the provided context." This prevents the LLM from using its training data.

Handle missing information: "If the context does not contain the answer, say so clearly." This prevents hallucination on gaps.

Request citations: "Cite the source document for each claim." This enables verification.

Set the tone: "Be concise and professional." Match the output style to your use case.

Specify format: "Respond in bullet points" or "Respond in JSON." Structure the output for downstream processing.

# Production RAG prompt template System: You are a knowledge assistant for Acme Corp. Answer questions using ONLY the provided context. Rules: - If the context doesn't contain the answer, say "I don't have information about that." - Cite sources using [Source: filename, page X] - Be concise. Use bullet points for lists. - Never make up information. Context: --- Source: policies/refunds.pdf, Page 7 Customers may request a full refund within 30 days of purchase... --- Source: policies/refunds.pdf, Page 8 After 30 days, only store credit is available... --- User: What is the refund policy?

Include source metadata in the context. When you format the retrieved chunks, include the source filename, page number, and any other metadata. This enables the LLM to cite specific sources in its answer, which users can verify.

smart_toy

Choosing the Right LLM

Balancing quality, speed, cost, and context window

Key Factors

Context window: How many tokens can the model process? GPT-4o supports 128K tokens. Claude 3.5 Sonnet supports 200K. Gemini 1.5 Pro supports 1M+. Larger windows let you include more chunks.

Instruction following: How well does the model stick to "answer only from context"? Larger models are better at this.

Latency: Time to first token and tokens per second. Smaller models (GPT-4o-mini, Claude 3.5 Haiku) are faster.

Cost: Per-token pricing varies 100x between models. GPT-4o-mini is ~$0.15/1M input tokens; GPT-4o is ~$2.50/1M.

Practical Recommendations

Prototyping: GPT-4o-mini or Claude 3.5 Haiku. Fast, cheap, good enough to validate your pipeline.

Production (quality-focused): GPT-4o or Claude 3.5 Sonnet. Best instruction following and reasoning.

Production (cost-focused): GPT-4o-mini, Gemini 1.5 Flash, or Claude 3.5 Haiku. 10-20x cheaper than flagship models with 80-90% of the quality.

Self-hosted: Llama 3.1 70B or Mixtral 8x7B via vLLM or TGI. Full data control, no API dependency.

For most RAG applications, GPT-4o-mini or Claude 3.5 Haiku is sufficient. The context provides the knowledge; the model just needs to synthesize it. You rarely need the full reasoning power of flagship models. Save the budget for more retrieval calls or reranking instead.

format_quote

Citations & Source Attribution

Enabling users to verify and trust the answer

Why Citations Matter

Citations transform RAG from "trust me" to "here's the proof." Users can click through to the source document and verify the answer. This is essential for enterprise, legal, medical, and financial use cases where accuracy is critical.

Citation Approaches

Inline citations: The LLM adds [1], [2] references in the text, with a source list at the bottom. Simple and familiar.

Per-sentence attribution: Each sentence is tagged with its source chunk. More granular but harder to implement.

Structured output: The LLM returns JSON with answer text and source references. Best for programmatic processing.

Highlighted excerpts: Show the exact passage from the source that supports each claim. Highest trust but most complex.

# Structured citation output from pydantic import BaseModel from typing import List class Citation(BaseModel): text: str # claim from the answer source: str # document name page: int # page number quote: str # exact supporting quote class RAGAnswer(BaseModel): answer: str citations: List[Citation] confidence: str # "high", "medium", "low" # Use with structured output (OpenAI, LangChain) llm_with_structure = llm.with_structured_output( RAGAnswer )

Structured output with Pydantic models is the most reliable way to get citations. OpenAI's function calling / structured output and LangChain's with_structured_output() guarantee the response matches your schema. No regex parsing needed.

stream

Streaming Responses

Showing the answer as it is generated, token by token

Why Stream?

RAG generation can take 2-10 seconds. Without streaming, the user stares at a blank screen. With streaming, they see the answer building word by word — the perceived latency drops dramatically. Time to first token (TTFT) becomes more important than total generation time.

How It Works

The LLM API returns a stream of token chunks via Server-Sent Events (SSE). Your backend forwards these chunks to the frontend in real time. The UI appends each chunk as it arrives. Most LLM APIs (OpenAI, Anthropic, Google) support streaming natively.

# LangChain streaming from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="gpt-4o-mini", streaming=True ) # Stream tokens for chunk in rag_chain.stream({ "question": "What is the refund policy?" }): print(chunk.content, end="", flush=True) # FastAPI streaming endpoint from fastapi.responses import StreamingResponse @app.post("/chat") async def chat(request: ChatRequest): async def generate(): async for chunk in rag_chain.astream( {"question": request.question} ): yield chunk.content return StreamingResponse(generate())

Streaming and structured output are often incompatible. If you need JSON citations, you typically cannot stream. A common pattern: stream the answer text first, then append citations at the end. Or use two LLM calls — one streaming for the answer, one non-streaming for structured citations.

fact_check

Guardrails & Safety

Preventing hallucination, off-topic answers, and harmful content

Common Guardrails

Grounding check: Verify the answer is supported by the retrieved context. If the LLM's answer contains claims not in any chunk, flag or remove them.

Topic guardrails: Restrict the system to only answer questions about your domain. Reject off-topic queries ("Write me a poem").

PII filtering: Detect and redact personal information (emails, phone numbers, SSNs) from the output.

Content safety: Block harmful, biased, or inappropriate content using moderation APIs (OpenAI Moderation, Azure Content Safety).

Implementation Approaches

Prompt-based: Include guardrail instructions in the system prompt. Simplest but least reliable.

Post-processing: Run a second LLM call or classifier on the output to check for violations.

Guardrails frameworks: NeMo Guardrails (NVIDIA), Guardrails AI, LangChain output parsers. These provide structured validation pipelines.

The "I don't know" response is a feature, not a bug. A RAG system that says "I don't have information about that" when the context is insufficient is far more trustworthy than one that confidently generates a wrong answer. Train your users to expect and trust this response.

verified

Synthesis Strategies

Different ways to combine chunks into answers

Stuff (Default)

Put all retrieved chunks into a single prompt. Simple and works when the total context fits in the model's window. This is the default in LangChain and LlamaIndex.

Map-Reduce

Generate a partial answer from each chunk independently ("map"), then combine all partial answers into a final answer ("reduce"). Useful when you have too many chunks for a single prompt.

Refine

Start with the first chunk, generate an initial answer. Then iteratively refine the answer by adding one chunk at a time. Each step can update, expand, or correct the previous answer. Good for building comprehensive answers from many sources.

# LangChain synthesis strategies from langchain.chains import ( create_stuff_documents_chain, create_retrieval_chain ) # Stuff (default) - all docs in one prompt stuff_chain = create_stuff_documents_chain( llm, prompt ) # Full RAG chain rag_chain = create_retrieval_chain( retriever, stuff_chain ) result = rag_chain.invoke({ "input": "What is the refund policy?" }) print(result["answer"]) print(result["context"]) # retrieved docs

Use "stuff" unless you have a reason not to. With modern 128K+ context windows, you can fit 50-100 chunks in a single prompt. Map-reduce and refine add complexity and multiple LLM calls. Only use them when your context genuinely exceeds the model's window.