The Complete Metric Set
// RAG evaluation scorecard
Retrieval Quality
Context Precision: 0.82 (target: >0.70)
Context Recall: 0.91 (target: >0.80)
Generation Quality
Faithfulness: 0.94 (target: >0.90)
Answer Relevancy: 0.87 (target: >0.80)
Groundedness: 0.92 (target: >0.85)
End-to-End
Answer Correctness: 0.85 (target: >0.80)
Latency (p95): 2.3s (target: <3s)
Diagnostic Decision Tree
When the end-to-end score drops, use component metrics to diagnose:
• Low context recall? → Improve chunking, embeddings, or retrieval strategy
• Low context precision? → Add a reranker or reduce top-k
• Low faithfulness? → Improve prompt, add “only use provided context” instruction
• Low relevancy? → Improve query understanding or prompt structure
Pro tip: Track these metrics over time, not just at launch. RAG quality degrades as your document corpus grows, as embedding models change, and as user queries evolve. Weekly evaluation catches drift early.