Ch 4: Evaluating RAG Systems

Ch 4 — Evaluating RAG Systems

RAGAS metrics: faithfulness, answer relevancy, context precision & recall

Index

High Level

help

Query

arrow_forward

Retrieve

arrow_forward

filter_alt

Context

arrow_forward

smart_toy

Generate

arrow_forward

fact_check

Evaluate

arrow_forward

insights

Improve

Click play or press Space to begin...

Step- / 8

warning

Why RAG Evaluation Is Different

Two systems to evaluate, not one

The RAG Pipeline

A RAG system has two components that can fail independently: the retriever (finding relevant documents) and the generator (producing an answer from those documents). A perfect retriever with a bad generator gives wrong answers. A perfect generator with a bad retriever gives confident answers based on irrelevant context.

Failure Modes

• Retrieval miss: Relevant documents not found
• Retrieval noise: Irrelevant documents dilute context
• Hallucination: Generator invents facts not in retrieved context
• Incomplete answer: Generator ignores relevant retrieved information
• Wrong attribution: Answer cites wrong source document

The Evaluation Challenge

You need metrics that isolate each component. If the final answer is wrong, is it because the retriever found the wrong documents, or because the generator misinterpreted the right documents? Without component-level metrics, you can’t diagnose or fix the problem.

Key insight: End-to-end accuracy alone is insufficient for RAG. You need separate metrics for retrieval quality and generation quality to know where to invest improvement effort.

verified

Faithfulness

Is the answer grounded in the retrieved context?

What It Measures

Faithfulness checks whether every claim in the generated answer can be traced back to the retrieved context. A faithfulness score of 0.8 means 80% of the claims in the answer are supported by the retrieved documents. The remaining 20% are hallucinated.

How It Works

// Faithfulness evaluation Step 1: Extract claims from the answer "Revenue grew 15% in Q3" "The CEO announced layoffs" "Stock price hit $200" Step 2: Check each claim against context Claim 1: Supported (doc #2, para 3) Claim 2: Supported (doc #1, para 1) Claim 3: NOT FOUND (hallucinated) Score: 2/3 = 0.67

Why It’s the Most Important Metric

Faithfulness is the anti-hallucination metric. A RAG system that hallucinates defeats the entire purpose of retrieval-augmented generation. If users can’t trust that answers come from your documents, they won’t trust the system at all.

Target: Aim for faithfulness > 0.90 in production. Below 0.80, users will encounter hallucinations frequently enough to lose trust. Below 0.70, the system is actively harmful.

target

Answer Relevancy

Does the answer actually address the question?

What It Measures

Answer relevancy scores how well the generated answer addresses the original question. A faithful answer that doesn’t address the question is useless. This metric catches cases where the model generates accurate but off-topic responses.

How It Works

The evaluator generates synthetic questions from the answer, then measures the semantic similarity between the synthetic questions and the original question. If the answer is relevant, the synthetic questions should be similar to the original. Score: 0.0 (completely irrelevant) to 1.0 (perfectly relevant).

Common Failure Patterns

• Topic drift: Answer starts relevant but wanders off-topic
• Over-generalization: Answer is too broad to be useful
• Wrong aspect: Answer addresses a different facet of the topic
• Padding: Relevant core buried in irrelevant filler text

Key insight: Faithfulness and relevancy are independent dimensions. An answer can be perfectly faithful (all claims from context) but completely irrelevant (doesn’t answer the question). You need both.

filter_alt

Context Precision & Recall

Evaluating the retriever, not the generator

Context Precision

What fraction of retrieved documents are actually relevant? If you retrieve 10 documents and only 3 are relevant, precision is 0.30. Low precision means the generator is drowning in noise, which increases hallucination risk and wastes context window tokens.

Context Recall

What fraction of relevant documents were actually retrieved? If 5 documents in your corpus are relevant but you only retrieved 2, recall is 0.40. Low recall means the generator is missing critical information, leading to incomplete or wrong answers.

The Precision-Recall Tradeoff

// Retrieval tuning High precision, low recall Few docs, all relevant Risk: missing information Low precision, high recall Many docs, some irrelevant Risk: noise & hallucination Sweet spot k=5-10 docs, reranked Precision > 0.70, Recall > 0.80

Practical tip: Start with top-k=10, add a reranker to boost precision, then measure both metrics. Most teams over-retrieve (low precision) rather than under-retrieve (low recall). A reranker typically improves precision by 20–40%.

foundation

Groundedness

The bridge between retrieval and generation

What It Measures

Groundedness is closely related to faithfulness but focuses on whether the answer is derivable from the context, not just whether individual claims are supported. A grounded answer could be logically inferred from the context, even if it synthesizes information across multiple documents.

Groundedness vs Faithfulness

• Faithfulness: “Is every claim in the answer stated in the context?” (strict)
• Groundedness: “Could the answer be reasonably derived from the context?” (allows inference)

Groundedness is more permissive — it allows the model to synthesize and reason, as long as the reasoning is supported by the evidence.

When to Use Which

• Legal/medical/financial: Use strict faithfulness (no inference allowed)
• Research/analysis: Use groundedness (synthesis is valuable)
• Customer support: Use faithfulness for facts, groundedness for recommendations
• Education: Use groundedness (explaining concepts requires synthesis)

Key insight: The right metric depends on your risk tolerance. High-stakes domains need strict faithfulness. Domains where synthesis adds value can use the more permissive groundedness metric.

dashboard

The RAG Evaluation Dashboard

Putting all metrics together

The Complete Metric Set

// RAG evaluation scorecard Retrieval Quality Context Precision: 0.82 (target: >0.70) Context Recall: 0.91 (target: >0.80) Generation Quality Faithfulness: 0.94 (target: >0.90) Answer Relevancy: 0.87 (target: >0.80) Groundedness: 0.92 (target: >0.85) End-to-End Answer Correctness: 0.85 (target: >0.80) Latency (p95): 2.3s (target: <3s)

Diagnostic Decision Tree

When the end-to-end score drops, use component metrics to diagnose:

• Low context recall? → Improve chunking, embeddings, or retrieval strategy
• Low context precision? → Add a reranker or reduce top-k
• Low faithfulness? → Improve prompt, add “only use provided context” instruction
• Low relevancy? → Improve query understanding or prompt structure

Pro tip: Track these metrics over time, not just at launch. RAG quality degrades as your document corpus grows, as embedding models change, and as user queries evolve. Weekly evaluation catches drift early.

dataset

Building a RAG Eval Dataset

The foundation of reliable evaluation

What You Need

A RAG eval dataset contains:

1. Questions (50–200 representative queries)
2. Ground truth answers (human-written correct answers)
3. Ground truth contexts (which documents should be retrieved)

Creating this is the hardest part of RAG evaluation, but it’s a one-time investment that pays dividends every time you change your pipeline.

Synthetic Dataset Generation

Use an LLM to generate question-answer pairs from your documents:

1. Feed each document chunk to an LLM
2. Ask it to generate 3–5 questions that the chunk answers
3. Use the chunk as the ground truth context
4. Have a human review 20% for quality

This bootstraps a dataset in hours instead of weeks.

Key insight: A small, high-quality eval dataset (50 examples) is more valuable than a large, noisy one (1000 examples). Focus on covering your key use cases and edge cases, not on volume.

tips_and_updates

RAG Eval Best Practices

Lessons from production RAG systems

The RAG Eval Checklist

1. Evaluate retrieval and generation separately — always
2. Use faithfulness as your primary metric — hallucination is the #1 RAG failure
3. Test with adversarial queries — questions your corpus can’t answer
4. Measure latency alongside quality — a perfect answer in 30 seconds is useless
5. Re-evaluate after every pipeline change — new embeddings, new chunking, new model

Common Mistakes

• Only testing happy path: Real users ask ambiguous, multi-hop, and unanswerable questions
• Ignoring “I don’t know”: A good RAG system should refuse to answer when context is insufficient
• Evaluating once at launch: RAG quality degrades over time as documents change
• Using only automated metrics: Sample 5–10% for human review to calibrate your metrics

Next up: In Chapter 5, we’ll tackle the even harder problem of evaluating AI agents — systems that take actions, use tools, and make multi-step decisions.

arrow_back Ch 3: LLM-as-Judge Ch 5: Evaluating Agents arrow_forward