Ch 11: RAG in Production & Evaluation

Ch 11 — RAG in Production & Evaluation

Deploying, monitoring, evaluating, and continuously improving RAG systems

Index Under the Hood →

High Level

rocket_launch

Deploy

arrow_forward

science

Evaluate

arrow_forward

monitoring

Monitor

arrow_forward

feedback

Feedback

arrow_forward

tune

Optimize

arrow_forward

sync

Iterate

arrow_forward

verified

Mature

Click play or press Space to begin the journey...

Step- / 7

rocket_launch

Deployment Architecture

How to serve a RAG system in production

Typical Production Stack

A production RAG system is a multi-service architecture:

API layer: FastAPI or Flask serving the RAG endpoint. Handles authentication, rate limiting, request validation.

Ingestion pipeline: Separate service (or batch job) that processes new documents, chunks, embeds, and upserts to the vector store. Runs on a schedule or triggered by events.

Vector store: Managed service (Pinecone, Qdrant Cloud) or self-hosted (Qdrant, Weaviate on Kubernetes).

LLM: API call to OpenAI/Anthropic/Azure OpenAI, or self-hosted via vLLM/TGI.

# Minimal production RAG API from fastapi import FastAPI, HTTPException from pydantic import BaseModel app = FastAPI() class Query(BaseModel): question: str session_id: str = "default" @app.post("/ask") async def ask(q: Query): docs = await retriever.ainvoke(q.question) answer = await rag_chain.ainvoke({ "input": q.question, "context": docs }) return { "answer": answer, "sources": [d.metadata for d in docs] }

Separate ingestion from serving. The ingestion pipeline (chunking, embedding, indexing) is CPU/GPU-intensive and runs in batch. The serving API is I/O-bound (waiting for vector search and LLM responses). Different scaling characteristics mean different infrastructure. Use a message queue (SQS, RabbitMQ) to decouple them.

science

Evaluation Metrics

Measuring retrieval quality and answer correctness

Retrieval Metrics

Context Precision: Of the retrieved documents, how many are actually relevant? High precision = less noise in context.

Context Recall: Of all relevant documents in the corpus, how many were retrieved? High recall = don't miss important information.

Hit Rate (Recall@k): Does the correct document appear in the top-k results? The simplest and most important retrieval metric.

Mean Reciprocal Rank (MRR): How high does the first relevant document rank? MRR = 1 means it's always first.

Generation Metrics

Faithfulness: Is the answer supported by the retrieved context? Measures hallucination. The most critical metric for RAG.

Answer Relevancy: Does the answer actually address the question? An answer can be faithful but off-topic.

Answer Correctness: Is the answer factually correct? Requires ground truth labels. The gold standard but expensive to create.

Answer Completeness: Does the answer cover all aspects of the question? Important for complex, multi-part queries.

Start with three metrics: Hit Rate (retrieval), Faithfulness (generation), and Answer Relevancy (end-to-end). These catch the three most common failures: wrong documents retrieved, hallucinated answers, and off-topic responses. Add more metrics as your system matures.

monitoring

Monitoring & Observability

Tracking performance, quality, and cost in real time

What to Monitor

Latency: End-to-end response time. Break down by: embedding (10-50ms), retrieval (20-100ms), LLM generation (500-3000ms). Set SLOs (e.g., p95 < 3s).

Token usage & cost: Track tokens per request (input + output). Set budgets and alerts. A single runaway prompt can cost $100+.

Error rates: LLM API failures, vector store timeouts, rate limit hits. Set up retries with exponential backoff.

Quality drift: Run Ragas evaluation on a sample of production queries weekly. Track faithfulness and relevancy over time. Alert if they drop below thresholds.

Observability Tools

LangSmith: Traces every step of your LangChain pipeline. Inputs, outputs, latency, tokens, errors. The most popular choice for LangChain apps.

LangFuse: Open-source alternative to LangSmith. Self-hostable. Traces, scores, prompt management. Works with any framework.

Phoenix (Arize): Open-source observability for LLM apps. Embedding visualization, trace analysis, evaluation. Good for debugging retrieval quality.

OpenTelemetry: Standard observability framework. Integrate RAG metrics into your existing monitoring stack (Datadog, Grafana, etc.).

Log every production query. Store the question, retrieved documents, generated answer, latency, and token count. This dataset becomes your evaluation set, debugging tool, and fine-tuning data. LangSmith and LangFuse do this automatically. If you build custom, log to a structured store (PostgreSQL, BigQuery).

feedback

User Feedback Loops

Collecting and using human signals to improve quality

Types of Feedback

Explicit feedback: Thumbs up/down buttons on answers. Simple to implement, low response rate (5-15% of users). But the signal is strong and unambiguous.

Implicit feedback: Did the user ask a follow-up? Did they rephrase the same question? Did they copy the answer? These signals are noisy but abundant.

Correction feedback: Let users edit or correct the answer. The most valuable signal but requires UI investment. Great for internal tools where users are motivated.

Using Feedback

1. Identify failure patterns: Cluster thumbs-down queries. Are they about a specific topic? A specific document type? This tells you where to focus improvement efforts.

2. Build evaluation sets: Thumbs-up queries with their answers become ground truth for automated evaluation. Over time, you build a comprehensive test suite from real user queries.

3. Fine-tune retrieval: Use feedback to identify queries where the wrong documents were retrieved. Adjust chunking, add metadata, or create synthetic training pairs for a custom reranker.

4. Improve prompts: Analyze thumbs-down answers to find prompt weaknesses. Iterate on system prompts to address common failure modes.

The feedback flywheel: Users provide feedback → you identify failures → you fix the pipeline → quality improves → users trust the system more → more usage → more feedback. This virtuous cycle is what separates good RAG systems from great ones. Start collecting feedback from day one.

tune

Optimization Strategies

Improving latency, cost, and quality

Latency Optimization

Streaming: Stream LLM responses to the user. Time-to-first-token (TTFT) drops from 2-3s to 200-500ms. Users perceive the system as much faster.

Parallel retrieval: If you search multiple indexes or use hybrid search, run queries in parallel (asyncio). Don't wait for one to finish before starting the next.

Semantic caching: Cache answers for semantically similar queries. Cache hit = <50ms response. Start with exact match caching, then add semantic similarity.

Smaller models: Use GPT-4o-mini or Claude 3.5 Haiku for simple questions. Route complex questions to larger models. Save 80-90% on LLM costs for simple queries.

Quality Optimization

Better chunking: The #1 lever for retrieval quality. Experiment with chunk sizes (256-1024 tokens), overlap (10-20%), and strategies (recursive, semantic).

Add reranking: A cross-encoder reranker (Cohere Rerank, BGE Reranker) dramatically improves precision. Retrieve 20 documents, rerank to top 5. Adds 100-200ms latency but significantly better context.

Hybrid search: Combine vector search with BM25 keyword search. Catches exact matches that embedding models miss. Use Reciprocal Rank Fusion to merge results.

Contextual Retrieval: Prepend document context to each chunk before embedding (Anthropic technique). Up to 49% improvement in retrieval accuracy.

Optimization order: (1) Fix chunking. (2) Add hybrid search. (3) Add reranking. (4) Add streaming. (5) Add caching. (6) Add model routing. Each step has diminishing returns. Measure with Ragas after each change to confirm improvement. Don't optimize what you haven't measured.

sync

Data Freshness & Continuous Improvement

Keeping your knowledge base current and pipeline evolving

Data Freshness

Incremental ingestion: Don't re-index everything when a document changes. Track document hashes, only re-process changed/new documents. LlamaIndex's IngestionPipeline with caching handles this.

Scheduled sync: Set up a cron job or event trigger to sync your data sources. Frequency depends on how often data changes: real-time for chat support, daily for documentation, weekly for research papers.

Stale document handling: Delete vectors for removed documents. Update metadata (e.g., "last_updated") so you can filter by freshness at query time. Some queries should prefer recent documents.

Continuous Improvement Cycle

Week 1-2: Deploy basic RAG. Collect queries and feedback. Identify top failure modes.

Week 3-4: Fix the biggest failure mode (usually chunking or retrieval). Re-evaluate with Ragas. Deploy improvement.

Month 2: Add hybrid search and reranking. Build a golden evaluation set from production queries. Set up automated evaluation in CI/CD.

Month 3+: Add advanced patterns as needed (agentic, multi-modal, graph). Optimize latency and cost. The system gets better every week because you're measuring and iterating.

RAG is never "done." Your data changes, user needs evolve, better models are released, and new techniques emerge. The teams that win are the ones with the tightest feedback loop: deploy, measure, learn, improve, repeat. Invest in evaluation infrastructure early because it pays dividends forever.

verified

RAG Maturity Model

From prototype to production-grade system

Level 1: Prototype

Basic RAG with LangChain + Chroma + OpenAI. Works for demos. No evaluation, no monitoring, no feedback. Breaks on edge cases. Good for validating the use case.

Level 2: MVP

Deployed API with a managed vector store. Basic evaluation (Ragas on 50 test queries). Logging and error tracking. Thumbs up/down feedback. Handles 80% of queries well.

Level 3: Production

Hybrid search + reranking. Automated evaluation in CI/CD. LangSmith/LangFuse tracing. Incremental ingestion. Streaming responses. Handles 95% of queries well. Cost-optimized.

Level 4: Advanced

Agentic or multi-modal RAG. A/B testing pipeline changes. Custom reranker fine-tuned on your data. Semantic caching. Model routing. Sub-second TTFT. Handles 99% of queries well.

Common Pitfalls

1. Skipping evaluation: "It seems to work" is not a metric. You need numbers. Build a test set and run Ragas before every change.

2. Over-engineering early: Don't start with Graph RAG and agentic patterns. Start with basic RAG, measure where it fails, then add complexity.

3. Ignoring chunking: Teams spend weeks on prompt engineering when the real problem is bad chunking. Fix retrieval first, then fix generation.

4. No feedback loop: If you're not collecting user feedback, you're flying blind. Even simple thumbs up/down is invaluable.

5. Treating RAG as a one-time project: RAG is a living system. Budget for ongoing maintenance, data updates, and continuous improvement.

Most teams should aim for Level 3 within 2-3 months. Level 2 is achievable in 2-4 weeks. Level 4 is only needed for high-stakes, high-traffic applications. The jump from Level 1 to Level 2 (adding evaluation and feedback) delivers the most value per effort. Don't skip it.