Ch 8 — The Eval Tools Landscape

RAGAS, DeepEval, Braintrust, LangSmith, Phoenix, Langfuse — choosing the right tools for your stack
High Level
help
Need
arrow_forward
science
Offline
arrow_forward
cloud
Platforms
arrow_forward
monitoring
Observ.
arrow_forward
compare
Choose
arrow_forward
architecture
Stack
-
Click play or press Space to begin...
Step- / 8
help
Start With the Problem, Not the Tool
What are you actually trying to evaluate?
The Tool Trap
Teams often start by picking a tool and then figuring out what to evaluate. This is backwards. The eval dataset is the hard part — any tool can run your evals once you have one. Start by defining what you need to measure, then choose the tool that fits.
Three Categories of Need
The eval tools landscape breaks into three categories, each solving a different problem:

1. Offline evaluation frameworks: Run eval suites during development and CI/CD. Measure quality before deployment
2. Managed experiment platforms: Track experiments, compare variants, manage datasets. Collaboration and workflow
3. Production observability: Monitor live systems, trace requests, detect drift. Real-time visibility
The Decision Framework
// What's your primary need? RAG evaluation → RAGAS (specialized metrics) General LLM testing → DeepEval (broad metric coverage) Experiment tracking → Braintrust or LangSmith Production monitoring → Arize Phoenix or Langfuse Self-hosted requirement → Langfuse or Phoenix LangChain ecosystem → LangSmith
Key insight: Most mature teams use 2–3 tools: one for offline eval (RAGAS or DeepEval), one for production monitoring (Phoenix or Langfuse), and optionally one for experiment tracking (Braintrust or LangSmith).
science
RAGAS & DeepEval
Open-source offline evaluation frameworks
RAGAS
Retrieval-Augmented Generation Assessment. The go-to framework for evaluating RAG systems. Provides specialized metrics that decompose RAG quality into its components:

Faithfulness: Are claims grounded in retrieved context?
Answer relevancy: Does the answer address the question?
Context precision: Are retrieved documents relevant?
Context recall: Were all relevant documents retrieved?

Open-source, Python-based. You pay only for the LLM API calls used to compute metrics.
DeepEval
Comprehensive LLM evaluation framework with 14+ metrics covering hallucination, bias, toxicity, relevancy, faithfulness, coherence, and more. Key differentiators:

Pytest integration: Write evals as test cases that run in CI/CD
Conversational eval: Evaluate multi-turn conversations, not just single responses
Custom metrics: Define your own evaluation criteria with natural language rubrics
Benchmarking: Run standard benchmarks (MMLU, HumanEval) locally against your models
Choose RAGAS if RAG evaluation is your primary need. Choose DeepEval if you need broad LLM evaluation beyond RAG, especially if your team uses pytest. Both are open-source and free (you pay only for LLM API calls).
cloud
Braintrust & LangSmith
Managed platforms for experiments, tracing, and collaboration
Braintrust
Hosted evaluation platform focused on experiment tracking and prompt optimization. Key strengths:

Experiment comparison: Run A/B tests on prompts and models with statistical significance testing
Dataset management: Version-controlled eval datasets with collaborative editing
Scoring: Built-in LLM judges plus custom scoring functions
Logging: Production request logging with replay and evaluation
Generous free tier: 50K logs/month free
LangSmith
Built by LangChain. End-to-end tracing, evaluation, monitoring, and prompt management. Key strengths:

Deep LangChain/LangGraph integration: Automatic tracing of chains and agents
End-to-end traces: See every step of complex pipelines with latency and cost breakdown
Prompt management: Version, test, and deploy prompts from a central hub
Annotation queues: Built-in human evaluation workflows
Free tier: 5K traces/month
Choose Braintrust for experiment tracking and prompt optimization. Choose LangSmith if you’re in the LangChain/LangGraph ecosystem and want tight integration with your orchestration layer.
monitoring
Arize Phoenix & Langfuse
Open-source observability and LLM engineering
Arize Phoenix
Open-source LLM observability from Arize AI. Designed for production monitoring with strong visualization capabilities:

Trace viewer: See every step of your LLM pipeline with timing and cost
Embedding visualization: Visualize query and document embeddings to spot clustering issues
LLM judge evaluation: Built-in judges for relevance, hallucination, and toxicity
Dataset curation: Create eval datasets from production traces
Deployment: Run locally, self-hosted, or use Arize cloud
Langfuse
Open-source LLM engineering platform. Strong community and self-hosting story:

Tracing: OpenTelemetry-compatible traces for any LLM framework
Evaluation: Score traces with custom metrics and LLM judges
Prompt management: Version and deploy prompts with A/B testing
Cost tracking: Detailed cost attribution per model, feature, and user
Self-hostable: Full control over your data — critical for regulated industries
Choose Phoenix for embedding visualization and production monitoring. Choose Langfuse for self-hosted deployments and integrated prompt management. Both are open-source and free to self-host.
compare
Head-to-Head Comparison
Strengths, weaknesses, and sweet spots for each tool
Offline Eval Frameworks
RAGAS Strength: RAG-specific metrics Weakness: Limited beyond RAG Best for: RAG evaluation specialists Cost: Free + LLM API calls DeepEval Strength: Broad metrics, pytest Weakness: Less RAG depth than RAGAS Best for: General LLM testing teams Cost: Free + LLM API calls
Platforms & Observability
Braintrust Strength: Experiments, A/B testing Best for: Prompt optimization teams Cost: Free tier, then $25/seat/mo LangSmith Strength: LangChain integration Best for: LangChain/LangGraph users Cost: Free tier, then $39/seat/mo Phoenix Strength: Embedding viz, monitoring Best for: Production observability Cost: Free (open-source) Langfuse Strength: Self-hosted, prompt mgmt Best for: Regulated industries Cost: Free (self-hosted)
integration_instructions
Integration Patterns
Wiring eval tools into your development workflow
Development Time
Jupyter notebooks: Use RAGAS or DeepEval for exploratory evaluation during development. Quick iteration on prompts and retrieval strategies
Experiments: Use Braintrust or LangSmith to compare prompt variants with statistical rigor. Track which changes actually improve metrics
Local testing: DeepEval’s pytest integration lets you run evals as unit tests during development
CI/CD Time
GitHub Actions / GitLab CI: Run RAGAS or DeepEval on every PR that touches prompts or model config
Gate deployments: Block merge if quality metrics drop below threshold
Post results: Comment eval report on the PR with metric diffs against baseline
Production Time
Tracing: Phoenix or Langfuse trace every production request with latency, cost, and token counts
Sampling: Run LLM judge on 5–10% of production responses for continuous quality scoring
Alerting: Trigger alerts when quality metrics drop, costs spike, or error rates increase
Dashboards: Track cost, latency, quality, and safety trends over time
Pro tip: Start with one tool per category. Don’t try to adopt all six at once. Pick RAGAS or DeepEval for offline eval, add Phoenix or Langfuse for production monitoring, and add a managed platform only when you need experiment collaboration.
savings
Cost Reality Check
What eval tools actually cost in practice
Tool Licensing Costs
// Monthly costs (2026) RAGAS Free (open-source) DeepEval Free (open-source) Phoenix Free (open-source) Langfuse Free (self-hosted) Braintrust Free tier: 50K logs/mo Pro: $25/seat/mo LangSmith Free tier: 5K traces/mo Plus: $39/seat/mo
The Hidden Cost: LLM API Calls
The biggest cost isn’t the tool — it’s the LLM API calls for judging. Every LLM-judged metric requires an API call to a strong model:

1,000 evaluations with GPT-4o: ~$5–$20
1,000 evaluations with GPT-4o-mini: ~$0.50–$2
Daily production sampling (500 evals): ~$75–$300/month

Factor this into your budget alongside tool licensing. For most teams, LLM judge costs are 5–10x the tool costs.
Budget tip: Start with free/open-source tools (RAGAS + Phoenix or Langfuse). Add managed platforms only when you need experiment tracking or team collaboration features. Total cost for a small team: $100–$500/month including LLM judge API calls.
architecture
The Recommended Stack
A practical starting point for most teams
Starter Stack (Free)
For teams just starting with eval:

Offline eval: RAGAS (if RAG) or DeepEval (if general LLM)
Production monitoring: Langfuse (self-hosted) or Phoenix (local)
Cost: $0 for tools + $50–$200/month for LLM judge API calls

This covers 80% of evaluation needs. You can run eval in CI/CD, monitor production, and track quality over time.
Growth Stack (Paid)
For teams that need collaboration and experiment tracking:

• Everything in the starter stack, plus:
Experiment platform: Braintrust or LangSmith for A/B testing and prompt management
Cost: $25–$39/seat/month + LLM judge costs

Add this when you have multiple people iterating on prompts and need shared visibility into what’s working.
What Matters More Than Tools
The tools are the easy part. What actually matters:

1. A good eval dataset — 50+ examples from real production data
2. The right metrics — 3–5 that cover quality, safety, and operations
3. The habit of running evals — before every deploy, weekly for drift
4. Acting on results — eval data that nobody looks at is worthless

A team with a spreadsheet and 50 good eval examples outperforms a team with every tool but no dataset.
Next up: Chapter 9 dives into production observability — the 5 pillars of monitoring AI systems in the real world: cost, latency, quality, safety, and hallucination detection.