Starter Stack (Free)
For teams just starting with eval:
• Offline eval: RAGAS (if RAG) or DeepEval (if general LLM)
• Production monitoring: Langfuse (self-hosted) or Phoenix (local)
• Cost: $0 for tools + $50–$200/month for LLM judge API calls
This covers 80% of evaluation needs. You can run eval in CI/CD, monitor production, and track quality over time.
Growth Stack (Paid)
For teams that need collaboration and experiment tracking:
• Everything in the starter stack, plus:
• Experiment platform: Braintrust or LangSmith for A/B testing and prompt management
• Cost: $25–$39/seat/month + LLM judge costs
Add this when you have multiple people iterating on prompts and need shared visibility into what’s working.
What Matters More Than Tools
The tools are the easy part. What actually matters:
1. A good eval dataset — 50+ examples from real production data
2. The right metrics — 3–5 that cover quality, safety, and operations
3. The habit of running evals — before every deploy, weekly for drift
4. Acting on results — eval data that nobody looks at is worthless
A team with a spreadsheet and 50 good eval examples outperforms a team with every tool but no dataset.
Next up: Chapter 9 dives into production observability — the 5 pillars of monitoring AI systems in the real world: cost, latency, quality, safety, and hallucination detection.