Ch 7 — Building an Eval Pipeline

From ad-hoc testing to systematic, CI/CD-integrated evaluation that blocks bad deployments
High Level
dataset
Dataset
arrow_forward
analytics
Metrics
arrow_forward
compare
Baseline
arrow_forward
rule
CI/CD
arrow_forward
history
Regression
arrow_forward
trending_up
Maturity
-
Click play or press Space to begin...
Step- / 8
dataset
Building Your Eval Dataset
The foundation everything else depends on
The Golden Dataset
Your eval dataset is the single most important asset in your evaluation system. It’s a curated collection of inputs paired with expected behaviors — the definition of what “good” looks like for your specific use case. Start with 50–200 examples covering the full range of your system’s expected inputs.
What to Include
Happy path (40%): Common queries your system handles daily
Edge cases (20%): Unusual inputs, long queries, ambiguous requests
Adversarial (15%): Prompt injections, jailbreak attempts, deliberately confusing inputs
Out-of-scope (15%): Questions the system should refuse or redirect
Regression cases (10%): Previously-fixed bugs that must never recur
Defining Expected Behavior
For each example, specify what “correct” means. This varies by task:

Exact match: The answer must be precisely X (factual QA)
Contains: Must include specific key points (summarization)
Quality criteria: Must score ≥4 on helpfulness (chatbot)
Behavioral: Must refuse, cite sources, or ask for clarification (safety)
Synthetic Augmentation
Don’t have enough real data? Use an LLM to generate synthetic test cases from your existing examples. Provide 10 real examples and ask the model to generate 50 variations. Then human-review the synthetics to filter out low-quality ones. This gets you to 200 examples in a day instead of a month.
Pro tip: Start with 50 examples. A small, high-quality dataset you actually use beats a large dataset you never built. You can always grow it later from production failures.
analytics
Choosing Your Metrics
Match the metric to the task — no single metric rules them all
The Metric Stack
Every system needs at least three types of metrics:

1. Quality metrics: Is the output good? (accuracy, faithfulness, relevancy, helpfulness)
2. Safety metrics: Is the output safe? (toxicity rate, PII leakage, policy violations)
3. Operational metrics: Is it practical? (latency p95, cost per query, error rate)

Optimizing for a single metric is a trap. A system that’s accurate but slow, or helpful but unsafe, fails in production.
Metric Selection by Use Case
// Match metric to task Factual QA → Accuracy, F1, exact match RAG system → Faithfulness, relevancy, recall Chatbot → Helpfulness, safety, tone Code generation → pass@k, test pass rate Agents → Task completion, cost/task Summarization → Coverage, faithfulness, conciseness
Deterministic vs LLM-Judged
Use deterministic metrics wherever possible — they’re fast, free, and reproducible:

• Format compliance (JSON schema validation)
• Length constraints (token count)
• Exact match / regex patterns
• Code test pass rates

Reserve LLM-judged metrics for subjective dimensions: helpfulness, coherence, tone, and completeness. These cost $0.005–$0.02 per evaluation but capture nuance that deterministic checks cannot.
Key insight: Don’t over-engineer metrics at the start. Pick 3–5 metrics that cover quality, safety, and operations. You can always add more as you learn which dimensions matter most for your system.
compare
Establishing Baselines
Scores without context are meaningless
Types of Baselines
Current production: The model and prompt currently serving users — the most important baseline
Previous version: Last known-good configuration for regression detection
Human performance: How well do humans do on the same task? Sets the ceiling
Random/naive: What does chance or a simple heuristic achieve? Sets the floor
Best known: Best score ever achieved on your eval set — the target to beat
Why Baselines Matter
“82% accuracy” means nothing alone. But “82%, up from 76%, human baseline 91%” tells a complete story: improving, room to grow, and a clear target. Baselines turn numbers into decisions. Without them, you’re flying blind even with metrics.
Running Baselines Correctly
Always run baseline alongside candidate: Don’t compare against cached scores from last month. Model providers update APIs without notice, and your data distribution shifts over time
Same eval set, same conditions: Temperature, system prompt, and retrieval configuration must be identical
Statistical significance: On small eval sets (50–100 examples), a 2% difference might be noise. Use bootstrap confidence intervals to know if a change is real
Practical tip: Store every eval run with its full configuration (model version, prompt, temperature, eval set hash). When something breaks in production, you need to trace back to exactly what changed.
rule
CI/CD Integration
Block bad deployments automatically
Eval in Your Deployment Pipeline
The most impactful thing you can do is run your eval suite on every pull request that touches prompts, model configuration, or retrieval logic. The eval results are posted as a PR comment, and the merge is blocked if any metric drops below threshold. This catches regressions before they reach production.
The CI/CD Eval Flow
// Triggered on: pull request 1. Load eval dataset 50-200 examples from version-controlled file 2. Run candidate Execute system with PR changes on all examples 3. Run baseline Execute current production config on same examples 4. Compare BLOCK if any metric drops >2% WARN if any metric drops >1% PASS if all metrics stable or improved 5. Report Post results as PR comment with diffs
What to Gate On
Hard gates (block merge): Safety score below threshold, hallucination rate above 5%, format compliance below 95%
Soft gates (require approval): Quality drop >2%, cost increase >20%
Informational (flag only): Latency increase >10%, minor metric fluctuations

Start with soft gates until you trust your eval suite. Overly aggressive hard gates frustrate developers and get bypassed.
Key insight: A 5-minute eval run in CI/CD prevents days of production firefighting. The ROI is enormous. Teams with eval gates ship faster, not slower, because they catch problems early.
history
Regression Testing
Catching silent degradation before users notice
What Causes Regressions
Prompt changes: Improving one case inadvertently breaks another. The most common cause
Model updates: Provider silently updates the API — output format, tone, and accuracy can all shift
Data changes: New RAG documents introduce noise or contradictions
Dependency updates: New embedding model changes retrieval quality
Scale effects: Performance degrades under production load due to timeouts or truncation
The Bug Bank
Every production failure should become an eval example. When a user reports a bad response, add that input-output pair to your eval dataset with the correct expected behavior. Over time, your eval set becomes a comprehensive record of everything that’s ever gone wrong — the ultimate regression test.
Detection Strategy
1. Pin your eval dataset: Same inputs, same expectations, version-controlled
2. Run after every change: Prompt edits, model swaps, config changes
3. Run weekly even without changes: Model providers update APIs silently
4. Compare to baseline: Flag any metric drop >1%
5. Track trends: Plot weekly scores to catch gradual drift that per-run comparisons miss
Critical: Run your eval suite weekly even when nothing has changed on your end. Model providers update their APIs without notice. A “minor improvement” to GPT-4o can change your system’s behavior in ways that only your eval suite will catch.
edit_note
Eval-Driven Development
Write the eval first, then iterate — like TDD for AI
The EDD Workflow
Eval-Driven Development (EDD) is the AI equivalent of Test-Driven Development:

1. Define the eval: Write 10–20 examples that define what success looks like
2. Run the eval: See where the current system fails
3. Iterate: Change prompts, models, or retrieval until the eval passes
4. Ship: Deploy with confidence because the eval proves it works
5. Monitor: Continue running the eval in production to catch drift
Why EDD Works
EDD forces you to define success before building. Without it, teams fall into endless prompt tweaking based on vibes — trying a few examples, eyeballing the output, and hoping for the best. With EDD, when the eval passes, you’re done. When it doesn’t, you know exactly what to fix.
EDD in Practice
// Example: Adding a new feature Day 1: Write 15 eval examples 5 happy path, 5 edge cases, 5 adversarial Time: 2 hours Day 2: Run eval, see 40% pass rate Iterate on prompt: 40% → 65% → 78% Time: 4 hours Day 3: Switch model, add guardrail 78% → 91% — meets threshold Time: 3 hours Ship: Confident deploy with eval proof Total: ~9 hours vs days of vibes-tweaking
Pro tip: Start every new LLM feature by writing 10 eval examples. It takes 30 minutes and saves days of undirected prompt tweaking. The examples also serve as documentation of expected behavior.
trending_up
Pipeline Maturity Model
Start small, grow systematically — Level 0 to Level 5
The Five Levels
Level 0: Vibes only No eval. "Looks good to me." Level 1: Basic (1 day to set up) 50 examples, 3 metrics, manual run Level 2: Automated (1 week) 100 examples, 5 metrics, CI/CD Level 3: Gated (2-4 weeks) 200+ examples, eval gates block deploys Level 4: Monitored (1-2 months) Continuous production monitoring + alerts Level 5: Eval-Driven (ongoing) Eval-first development culture
Where Most Teams Are
Research shows most AI teams are at Level 0–1. They know evaluation matters but haven’t invested in systematic approaches. The teams shipping the best AI products are at Level 3+. The gap between Level 1 and Level 3 is often the difference between a demo and a product.
Getting Started Today
Go from Level 0 to Level 1 in one day:

1. Write 50 eval examples (2 hours)
2. Pick 3 metrics (30 minutes)
3. Write a script that runs them (2 hours)
4. Run it before your next deployment

That’s it. You now have more evaluation rigor than 80% of AI teams.
Key insight: The biggest jump in value is from Level 0 to Level 1. This takes one day and provides 80% of the benefit. Don’t wait for the perfect eval system — start with 50 examples today.
architecture
The Complete Pipeline Architecture
Putting all the pieces together
End-to-End Flow
// Development time 1. EDD: Write eval examples first 2. Iterate: Prompt/model changes 3. Local eval: Quick check before PR // CI/CD time 4. PR eval: Full suite, compare baseline 5. Gate: Block/warn/pass based on results 6. Deploy: Ship with confidence // Production time 7. Monitor: Continuous quality scoring 8. Alert: Notify on degradation 9. Learn: Failures → new eval examples
The Flywheel Effect
A well-built eval pipeline creates a virtuous cycle:

• Production failures become eval examples
• More eval examples catch more regressions
• Fewer regressions mean higher quality
• Higher quality means fewer production failures

Each cycle makes your system more robust. After 6 months, your eval dataset is a comprehensive record of every failure mode your system has ever encountered.
Next up: Chapter 8 surveys the eval tools landscape — RAGAS, DeepEval, Braintrust, LangSmith, Arize Phoenix, and Langfuse — and helps you choose the right tools for your stack.