Ch 8: The Eval Tools Landscape

Ch 8 — The Eval Tools Landscape

RAGAS, DeepEval, Braintrust, LangSmith, Phoenix, Langfuse — choosing the right tools for your stack

Index

High Level

help

Need

arrow_forward

science

Offline

arrow_forward

cloud

Platforms

arrow_forward

monitoring

Observ.

arrow_forward

compare

Choose

arrow_forward

architecture

Stack

Click play or press Space to begin...

Step- / 8

help

Start With the Problem, Not the Tool

What are you actually trying to evaluate?

The Tool Trap

Teams often start by picking a tool and then figuring out what to evaluate. This is backwards. The eval dataset is the hard part — any tool can run your evals once you have one. Start by defining what you need to measure, then choose the tool that fits.

Three Categories of Need

The eval tools landscape breaks into three categories, each solving a different problem:

1. Offline evaluation frameworks: Run eval suites during development and CI/CD. Measure quality before deployment
2. Managed experiment platforms: Track experiments, compare variants, manage datasets. Collaboration and workflow
3. Production observability: Monitor live systems, trace requests, detect drift. Real-time visibility

The Decision Framework

// What's your primary need? RAG evaluation → RAGAS (specialized metrics) General LLM testing → DeepEval (broad metric coverage) Experiment tracking → Braintrust or LangSmith Production monitoring → Arize Phoenix or Langfuse Self-hosted requirement → Langfuse or Phoenix LangChain ecosystem → LangSmith

Key insight: Most mature teams use 2–3 tools: one for offline eval (RAGAS or DeepEval), one for production monitoring (Phoenix or Langfuse), and optionally one for experiment tracking (Braintrust or LangSmith).

science

RAGAS & DeepEval

Open-source offline evaluation frameworks

RAGAS

Retrieval-Augmented Generation Assessment. The go-to framework for evaluating RAG systems. Provides specialized metrics that decompose RAG quality into its components:

• Faithfulness: Are claims grounded in retrieved context?
• Answer relevancy: Does the answer address the question?
• Context precision: Are retrieved documents relevant?
• Context recall: Were all relevant documents retrieved?

Open-source, Python-based. You pay only for the LLM API calls used to compute metrics.

DeepEval

Comprehensive LLM evaluation framework with 14+ metrics covering hallucination, bias, toxicity, relevancy, faithfulness, coherence, and more. Key differentiators:

• Pytest integration: Write evals as test cases that run in CI/CD
• Conversational eval: Evaluate multi-turn conversations, not just single responses
• Custom metrics: Define your own evaluation criteria with natural language rubrics
• Benchmarking: Run standard benchmarks (MMLU, HumanEval) locally against your models

Choose RAGAS if RAG evaluation is your primary need. Choose DeepEval if you need broad LLM evaluation beyond RAG, especially if your team uses pytest. Both are open-source and free (you pay only for LLM API calls).

cloud

Braintrust & LangSmith

Managed platforms for experiments, tracing, and collaboration

Braintrust

Hosted evaluation platform focused on experiment tracking and prompt optimization. Key strengths:

• Experiment comparison: Run A/B tests on prompts and models with statistical significance testing
• Dataset management: Version-controlled eval datasets with collaborative editing
• Scoring: Built-in LLM judges plus custom scoring functions
• Logging: Production request logging with replay and evaluation
• Generous free tier: 50K logs/month free

LangSmith

Built by LangChain. End-to-end tracing, evaluation, monitoring, and prompt management. Key strengths:

• Deep LangChain/LangGraph integration: Automatic tracing of chains and agents
• End-to-end traces: See every step of complex pipelines with latency and cost breakdown
• Prompt management: Version, test, and deploy prompts from a central hub
• Annotation queues: Built-in human evaluation workflows
• Free tier: 5K traces/month

Choose Braintrust for experiment tracking and prompt optimization. Choose LangSmith if you’re in the LangChain/LangGraph ecosystem and want tight integration with your orchestration layer.

monitoring

Arize Phoenix & Langfuse

Open-source observability and LLM engineering

Arize Phoenix

Open-source LLM observability from Arize AI. Designed for production monitoring with strong visualization capabilities:

• Trace viewer: See every step of your LLM pipeline with timing and cost
• Embedding visualization: Visualize query and document embeddings to spot clustering issues
• LLM judge evaluation: Built-in judges for relevance, hallucination, and toxicity
• Dataset curation: Create eval datasets from production traces
• Deployment: Run locally, self-hosted, or use Arize cloud

Langfuse

Open-source LLM engineering platform. Strong community and self-hosting story:

• Tracing: OpenTelemetry-compatible traces for any LLM framework
• Evaluation: Score traces with custom metrics and LLM judges
• Prompt management: Version and deploy prompts with A/B testing
• Cost tracking: Detailed cost attribution per model, feature, and user
• Self-hostable: Full control over your data — critical for regulated industries

Choose Phoenix for embedding visualization and production monitoring. Choose Langfuse for self-hosted deployments and integrated prompt management. Both are open-source and free to self-host.

compare

Head-to-Head Comparison

Strengths, weaknesses, and sweet spots for each tool

Offline Eval Frameworks

RAGAS Strength: RAG-specific metrics Weakness: Limited beyond RAG Best for: RAG evaluation specialists Cost: Free + LLM API calls DeepEval Strength: Broad metrics, pytest Weakness: Less RAG depth than RAGAS Best for: General LLM testing teams Cost: Free + LLM API calls

Platforms & Observability

Braintrust Strength: Experiments, A/B testing Best for: Prompt optimization teams Cost: Free tier, then $25/seat/mo LangSmith Strength: LangChain integration Best for: LangChain/LangGraph users Cost: Free tier, then $39/seat/mo Phoenix Strength: Embedding viz, monitoring Best for: Production observability Cost: Free (open-source) Langfuse Strength: Self-hosted, prompt mgmt Best for: Regulated industries Cost: Free (self-hosted)

integration_instructions

Integration Patterns

Wiring eval tools into your development workflow

Development Time

• Jupyter notebooks: Use RAGAS or DeepEval for exploratory evaluation during development. Quick iteration on prompts and retrieval strategies
• Experiments: Use Braintrust or LangSmith to compare prompt variants with statistical rigor. Track which changes actually improve metrics
• Local testing: DeepEval’s pytest integration lets you run evals as unit tests during development

CI/CD Time

• GitHub Actions / GitLab CI: Run RAGAS or DeepEval on every PR that touches prompts or model config
• Gate deployments: Block merge if quality metrics drop below threshold
• Post results: Comment eval report on the PR with metric diffs against baseline

Production Time

• Tracing: Phoenix or Langfuse trace every production request with latency, cost, and token counts
• Sampling: Run LLM judge on 5–10% of production responses for continuous quality scoring
• Alerting: Trigger alerts when quality metrics drop, costs spike, or error rates increase
• Dashboards: Track cost, latency, quality, and safety trends over time

Pro tip: Start with one tool per category. Don’t try to adopt all six at once. Pick RAGAS or DeepEval for offline eval, add Phoenix or Langfuse for production monitoring, and add a managed platform only when you need experiment collaboration.

savings

Cost Reality Check

What eval tools actually cost in practice

Tool Licensing Costs

// Monthly costs (2026) RAGAS Free (open-source) DeepEval Free (open-source) Phoenix Free (open-source) Langfuse Free (self-hosted) Braintrust Free tier: 50K logs/mo Pro: $25/seat/mo LangSmith Free tier: 5K traces/mo Plus: $39/seat/mo

The Hidden Cost: LLM API Calls

The biggest cost isn’t the tool — it’s the LLM API calls for judging. Every LLM-judged metric requires an API call to a strong model:

• 1,000 evaluations with GPT-4o: ~$5–$20
• 1,000 evaluations with GPT-4o-mini: ~$0.50–$2
• Daily production sampling (500 evals): ~$75–$300/month

Factor this into your budget alongside tool licensing. For most teams, LLM judge costs are 5–10x the tool costs.

Budget tip: Start with free/open-source tools (RAGAS + Phoenix or Langfuse). Add managed platforms only when you need experiment tracking or team collaboration features. Total cost for a small team: $100–$500/month including LLM judge API calls.

architecture

The Recommended Stack

A practical starting point for most teams

Starter Stack (Free)

For teams just starting with eval:

• Offline eval: RAGAS (if RAG) or DeepEval (if general LLM)
• Production monitoring: Langfuse (self-hosted) or Phoenix (local)
• Cost: $0 for tools + $50–$200/month for LLM judge API calls

This covers 80% of evaluation needs. You can run eval in CI/CD, monitor production, and track quality over time.

Growth Stack (Paid)

For teams that need collaboration and experiment tracking:

• Everything in the starter stack, plus:
• Experiment platform: Braintrust or LangSmith for A/B testing and prompt management
• Cost: $25–$39/seat/month + LLM judge costs

Add this when you have multiple people iterating on prompts and need shared visibility into what’s working.

What Matters More Than Tools

The tools are the easy part. What actually matters:

1. A good eval dataset — 50+ examples from real production data
2. The right metrics — 3–5 that cover quality, safety, and operations
3. The habit of running evals — before every deploy, weekly for drift
4. Acting on results — eval data that nobody looks at is worthless

A team with a spreadsheet and 50 good eval examples outperforms a team with every tool but no dataset.

Next up: Chapter 9 dives into production observability — the 5 pillars of monitoring AI systems in the real world: cost, latency, quality, safety, and hallucination detection.

arrow_back Ch 7: Eval Pipeline Ch 9: Production Observability arrow_forward