Ch 4 — Evaluating RAG Systems

RAGAS metrics: faithfulness, answer relevancy, context precision & recall
High Level
help
Query
arrow_forward
search
Retrieve
arrow_forward
filter_alt
Context
arrow_forward
smart_toy
Generate
arrow_forward
fact_check
Evaluate
arrow_forward
insights
Improve
-
Click play or press Space to begin...
Step- / 8
warning
Why RAG Evaluation Is Different
Two systems to evaluate, not one
The RAG Pipeline
A RAG system has two components that can fail independently: the retriever (finding relevant documents) and the generator (producing an answer from those documents). A perfect retriever with a bad generator gives wrong answers. A perfect generator with a bad retriever gives confident answers based on irrelevant context.
Failure Modes
Retrieval miss: Relevant documents not found
Retrieval noise: Irrelevant documents dilute context
Hallucination: Generator invents facts not in retrieved context
Incomplete answer: Generator ignores relevant retrieved information
Wrong attribution: Answer cites wrong source document
The Evaluation Challenge
You need metrics that isolate each component. If the final answer is wrong, is it because the retriever found the wrong documents, or because the generator misinterpreted the right documents? Without component-level metrics, you can’t diagnose or fix the problem.
Key insight: End-to-end accuracy alone is insufficient for RAG. You need separate metrics for retrieval quality and generation quality to know where to invest improvement effort.
verified
Faithfulness
Is the answer grounded in the retrieved context?
What It Measures
Faithfulness checks whether every claim in the generated answer can be traced back to the retrieved context. A faithfulness score of 0.8 means 80% of the claims in the answer are supported by the retrieved documents. The remaining 20% are hallucinated.
How It Works
// Faithfulness evaluation Step 1: Extract claims from the answer "Revenue grew 15% in Q3" "The CEO announced layoffs" "Stock price hit $200" Step 2: Check each claim against context Claim 1: Supported (doc #2, para 3) Claim 2: Supported (doc #1, para 1) Claim 3: NOT FOUND (hallucinated) Score: 2/3 = 0.67
Why It’s the Most Important Metric
Faithfulness is the anti-hallucination metric. A RAG system that hallucinates defeats the entire purpose of retrieval-augmented generation. If users can’t trust that answers come from your documents, they won’t trust the system at all.
Target: Aim for faithfulness > 0.90 in production. Below 0.80, users will encounter hallucinations frequently enough to lose trust. Below 0.70, the system is actively harmful.
target
Answer Relevancy
Does the answer actually address the question?
What It Measures
Answer relevancy scores how well the generated answer addresses the original question. A faithful answer that doesn’t address the question is useless. This metric catches cases where the model generates accurate but off-topic responses.
How It Works
The evaluator generates synthetic questions from the answer, then measures the semantic similarity between the synthetic questions and the original question. If the answer is relevant, the synthetic questions should be similar to the original. Score: 0.0 (completely irrelevant) to 1.0 (perfectly relevant).
Common Failure Patterns
Topic drift: Answer starts relevant but wanders off-topic
Over-generalization: Answer is too broad to be useful
Wrong aspect: Answer addresses a different facet of the topic
Padding: Relevant core buried in irrelevant filler text
Key insight: Faithfulness and relevancy are independent dimensions. An answer can be perfectly faithful (all claims from context) but completely irrelevant (doesn’t answer the question). You need both.
filter_alt
Context Precision & Recall
Evaluating the retriever, not the generator
Context Precision
What fraction of retrieved documents are actually relevant? If you retrieve 10 documents and only 3 are relevant, precision is 0.30. Low precision means the generator is drowning in noise, which increases hallucination risk and wastes context window tokens.
Context Recall
What fraction of relevant documents were actually retrieved? If 5 documents in your corpus are relevant but you only retrieved 2, recall is 0.40. Low recall means the generator is missing critical information, leading to incomplete or wrong answers.
The Precision-Recall Tradeoff
// Retrieval tuning High precision, low recall Few docs, all relevant Risk: missing information Low precision, high recall Many docs, some irrelevant Risk: noise & hallucination Sweet spot k=5-10 docs, reranked Precision > 0.70, Recall > 0.80
Practical tip: Start with top-k=10, add a reranker to boost precision, then measure both metrics. Most teams over-retrieve (low precision) rather than under-retrieve (low recall). A reranker typically improves precision by 20–40%.
foundation
Groundedness
The bridge between retrieval and generation
What It Measures
Groundedness is closely related to faithfulness but focuses on whether the answer is derivable from the context, not just whether individual claims are supported. A grounded answer could be logically inferred from the context, even if it synthesizes information across multiple documents.
Groundedness vs Faithfulness
Faithfulness: “Is every claim in the answer stated in the context?” (strict)
Groundedness: “Could the answer be reasonably derived from the context?” (allows inference)

Groundedness is more permissive — it allows the model to synthesize and reason, as long as the reasoning is supported by the evidence.
When to Use Which
Legal/medical/financial: Use strict faithfulness (no inference allowed)
Research/analysis: Use groundedness (synthesis is valuable)
Customer support: Use faithfulness for facts, groundedness for recommendations
Education: Use groundedness (explaining concepts requires synthesis)
Key insight: The right metric depends on your risk tolerance. High-stakes domains need strict faithfulness. Domains where synthesis adds value can use the more permissive groundedness metric.
dashboard
The RAG Evaluation Dashboard
Putting all metrics together
The Complete Metric Set
// RAG evaluation scorecard Retrieval Quality Context Precision: 0.82 (target: >0.70) Context Recall: 0.91 (target: >0.80) Generation Quality Faithfulness: 0.94 (target: >0.90) Answer Relevancy: 0.87 (target: >0.80) Groundedness: 0.92 (target: >0.85) End-to-End Answer Correctness: 0.85 (target: >0.80) Latency (p95): 2.3s (target: <3s)
Diagnostic Decision Tree
When the end-to-end score drops, use component metrics to diagnose:

Low context recall? → Improve chunking, embeddings, or retrieval strategy
Low context precision? → Add a reranker or reduce top-k
Low faithfulness? → Improve prompt, add “only use provided context” instruction
Low relevancy? → Improve query understanding or prompt structure
Pro tip: Track these metrics over time, not just at launch. RAG quality degrades as your document corpus grows, as embedding models change, and as user queries evolve. Weekly evaluation catches drift early.
dataset
Building a RAG Eval Dataset
The foundation of reliable evaluation
What You Need
A RAG eval dataset contains:

1. Questions (50–200 representative queries)
2. Ground truth answers (human-written correct answers)
3. Ground truth contexts (which documents should be retrieved)

Creating this is the hardest part of RAG evaluation, but it’s a one-time investment that pays dividends every time you change your pipeline.
Synthetic Dataset Generation
Use an LLM to generate question-answer pairs from your documents:

1. Feed each document chunk to an LLM
2. Ask it to generate 3–5 questions that the chunk answers
3. Use the chunk as the ground truth context
4. Have a human review 20% for quality

This bootstraps a dataset in hours instead of weeks.
Key insight: A small, high-quality eval dataset (50 examples) is more valuable than a large, noisy one (1000 examples). Focus on covering your key use cases and edge cases, not on volume.
tips_and_updates
RAG Eval Best Practices
Lessons from production RAG systems
The RAG Eval Checklist
1. Evaluate retrieval and generation separately — always
2. Use faithfulness as your primary metric — hallucination is the #1 RAG failure
3. Test with adversarial queries — questions your corpus can’t answer
4. Measure latency alongside quality — a perfect answer in 30 seconds is useless
5. Re-evaluate after every pipeline change — new embeddings, new chunking, new model
Common Mistakes
Only testing happy path: Real users ask ambiguous, multi-hop, and unanswerable questions
Ignoring “I don’t know”: A good RAG system should refuse to answer when context is insufficient
Evaluating once at launch: RAG quality degrades over time as documents change
Using only automated metrics: Sample 5–10% for human review to calibrate your metrics
Next up: In Chapter 5, we’ll tackle the even harder problem of evaluating AI agents — systems that take actions, use tools, and make multi-step decisions.