Ch 7: Building an Eval Pipeline

dataset

Building Your Eval Dataset

The foundation everything else depends on

The Golden Dataset

Your eval dataset is the single most important asset in your evaluation system. It’s a curated collection of inputs paired with expected behaviors — the definition of what “good” looks like for your specific use case. Start with 50–200 examples covering the full range of your system’s expected inputs.

What to Include

• Happy path (40%): Common queries your system handles daily
• Edge cases (20%): Unusual inputs, long queries, ambiguous requests
• Adversarial (15%): Prompt injections, jailbreak attempts, deliberately confusing inputs
• Out-of-scope (15%): Questions the system should refuse or redirect
• Regression cases (10%): Previously-fixed bugs that must never recur

Defining Expected Behavior

For each example, specify what “correct” means. This varies by task:

• Exact match: The answer must be precisely X (factual QA)
• Contains: Must include specific key points (summarization)
• Quality criteria: Must score ≥4 on helpfulness (chatbot)
• Behavioral: Must refuse, cite sources, or ask for clarification (safety)

Synthetic Augmentation

Don’t have enough real data? Use an LLM to generate synthetic test cases from your existing examples. Provide 10 real examples and ask the model to generate 50 variations. Then human-review the synthetics to filter out low-quality ones. This gets you to 200 examples in a day instead of a month.

Pro tip: Start with 50 examples. A small, high-quality dataset you actually use beats a large dataset you never built. You can always grow it later from production failures.

analytics

Choosing Your Metrics

Match the metric to the task — no single metric rules them all

The Metric Stack

Every system needs at least three types of metrics:

1. Quality metrics: Is the output good? (accuracy, faithfulness, relevancy, helpfulness)
2. Safety metrics: Is the output safe? (toxicity rate, PII leakage, policy violations)
3. Operational metrics: Is it practical? (latency p95, cost per query, error rate)

Optimizing for a single metric is a trap. A system that’s accurate but slow, or helpful but unsafe, fails in production.

Metric Selection by Use Case

// Match metric to task Factual QA → Accuracy, F1, exact match RAG system → Faithfulness, relevancy, recall Chatbot → Helpfulness, safety, tone Code generation → pass@k, test pass rate Agents → Task completion, cost/task Summarization → Coverage, faithfulness, conciseness

Deterministic vs LLM-Judged

Use deterministic metrics wherever possible — they’re fast, free, and reproducible:

• Format compliance (JSON schema validation)
• Length constraints (token count)
• Exact match / regex patterns
• Code test pass rates

Reserve LLM-judged metrics for subjective dimensions: helpfulness, coherence, tone, and completeness. These cost $0.005–$0.02 per evaluation but capture nuance that deterministic checks cannot.

Key insight: Don’t over-engineer metrics at the start. Pick 3–5 metrics that cover quality, safety, and operations. You can always add more as you learn which dimensions matter most for your system.

compare

Establishing Baselines

Scores without context are meaningless

Types of Baselines

• Current production: The model and prompt currently serving users — the most important baseline
• Previous version: Last known-good configuration for regression detection
• Human performance: How well do humans do on the same task? Sets the ceiling
• Random/naive: What does chance or a simple heuristic achieve? Sets the floor
• Best known: Best score ever achieved on your eval set — the target to beat

Why Baselines Matter

“82% accuracy” means nothing alone. But “82%, up from 76%, human baseline 91%” tells a complete story: improving, room to grow, and a clear target. Baselines turn numbers into decisions. Without them, you’re flying blind even with metrics.

Running Baselines Correctly

• Always run baseline alongside candidate: Don’t compare against cached scores from last month. Model providers update APIs without notice, and your data distribution shifts over time
• Same eval set, same conditions: Temperature, system prompt, and retrieval configuration must be identical
• Statistical significance: On small eval sets (50–100 examples), a 2% difference might be noise. Use bootstrap confidence intervals to know if a change is real

Practical tip: Store every eval run with its full configuration (model version, prompt, temperature, eval set hash). When something breaks in production, you need to trace back to exactly what changed.

rule

CI/CD Integration

Block bad deployments automatically

Eval in Your Deployment Pipeline

The most impactful thing you can do is run your eval suite on every pull request that touches prompts, model configuration, or retrieval logic. The eval results are posted as a PR comment, and the merge is blocked if any metric drops below threshold. This catches regressions before they reach production.

The CI/CD Eval Flow

// Triggered on: pull request 1. Load eval dataset 50-200 examples from version-controlled file 2. Run candidate Execute system with PR changes on all examples 3. Run baseline Execute current production config on same examples 4. Compare BLOCK if any metric drops >2% WARN if any metric drops >1% PASS if all metrics stable or improved 5. Report Post results as PR comment with diffs

What to Gate On

• Hard gates (block merge): Safety score below threshold, hallucination rate above 5%, format compliance below 95%
• Soft gates (require approval): Quality drop >2%, cost increase >20%
• Informational (flag only): Latency increase >10%, minor metric fluctuations

Start with soft gates until you trust your eval suite. Overly aggressive hard gates frustrate developers and get bypassed.

Key insight: A 5-minute eval run in CI/CD prevents days of production firefighting. The ROI is enormous. Teams with eval gates ship faster, not slower, because they catch problems early.

history

Regression Testing

Catching silent degradation before users notice

What Causes Regressions

• Prompt changes: Improving one case inadvertently breaks another. The most common cause
• Model updates: Provider silently updates the API — output format, tone, and accuracy can all shift
• Data changes: New RAG documents introduce noise or contradictions
• Dependency updates: New embedding model changes retrieval quality
• Scale effects: Performance degrades under production load due to timeouts or truncation

The Bug Bank

Every production failure should become an eval example. When a user reports a bad response, add that input-output pair to your eval dataset with the correct expected behavior. Over time, your eval set becomes a comprehensive record of everything that’s ever gone wrong — the ultimate regression test.

Detection Strategy

1. Pin your eval dataset: Same inputs, same expectations, version-controlled
2. Run after every change: Prompt edits, model swaps, config changes
3. Run weekly even without changes: Model providers update APIs silently
4. Compare to baseline: Flag any metric drop >1%
5. Track trends: Plot weekly scores to catch gradual drift that per-run comparisons miss

Critical: Run your eval suite weekly even when nothing has changed on your end. Model providers update their APIs without notice. A “minor improvement” to GPT-4o can change your system’s behavior in ways that only your eval suite will catch.

edit_note

Eval-Driven Development

Write the eval first, then iterate — like TDD for AI

The EDD Workflow

Eval-Driven Development (EDD) is the AI equivalent of Test-Driven Development:

1. Define the eval: Write 10–20 examples that define what success looks like
2. Run the eval: See where the current system fails
3. Iterate: Change prompts, models, or retrieval until the eval passes
4. Ship: Deploy with confidence because the eval proves it works
5. Monitor: Continue running the eval in production to catch drift

Why EDD Works

EDD forces you to define success before building. Without it, teams fall into endless prompt tweaking based on vibes — trying a few examples, eyeballing the output, and hoping for the best. With EDD, when the eval passes, you’re done. When it doesn’t, you know exactly what to fix.

EDD in Practice

// Example: Adding a new feature Day 1: Write 15 eval examples 5 happy path, 5 edge cases, 5 adversarial Time: 2 hours Day 2: Run eval, see 40% pass rate Iterate on prompt: 40% → 65% → 78% Time: 4 hours Day 3: Switch model, add guardrail 78% → 91% — meets threshold Time: 3 hours Ship: Confident deploy with eval proof Total: ~9 hours vs days of vibes-tweaking

Pro tip: Start every new LLM feature by writing 10 eval examples. It takes 30 minutes and saves days of undirected prompt tweaking. The examples also serve as documentation of expected behavior.

trending_up

Pipeline Maturity Model

Start small, grow systematically — Level 0 to Level 5

The Five Levels

Level 0: Vibes only No eval. "Looks good to me." Level 1: Basic (1 day to set up) 50 examples, 3 metrics, manual run Level 2: Automated (1 week) 100 examples, 5 metrics, CI/CD Level 3: Gated (2-4 weeks) 200+ examples, eval gates block deploys Level 4: Monitored (1-2 months) Continuous production monitoring + alerts Level 5: Eval-Driven (ongoing) Eval-first development culture

Where Most Teams Are

Research shows most AI teams are at Level 0–1. They know evaluation matters but haven’t invested in systematic approaches. The teams shipping the best AI products are at Level 3+. The gap between Level 1 and Level 3 is often the difference between a demo and a product.

Getting Started Today

Go from Level 0 to Level 1 in one day:

1. Write 50 eval examples (2 hours)
2. Pick 3 metrics (30 minutes)
3. Write a script that runs them (2 hours)
4. Run it before your next deployment

That’s it. You now have more evaluation rigor than 80% of AI teams.

Key insight: The biggest jump in value is from Level 0 to Level 1. This takes one day and provides 80% of the benefit. Don’t wait for the perfect eval system — start with 50 examples today.

architecture

The Complete Pipeline Architecture

Putting all the pieces together

End-to-End Flow

// Development time 1. EDD: Write eval examples first 2. Iterate: Prompt/model changes 3. Local eval: Quick check before PR // CI/CD time 4. PR eval: Full suite, compare baseline 5. Gate: Block/warn/pass based on results 6. Deploy: Ship with confidence // Production time 7. Monitor: Continuous quality scoring 8. Alert: Notify on degradation 9. Learn: Failures → new eval examples

The Flywheel Effect

A well-built eval pipeline creates a virtuous cycle:

• Production failures become eval examples
• More eval examples catch more regressions
• Fewer regressions mean higher quality
• Higher quality means fewer production failures

Each cycle makes your system more robust. After 6 months, your eval dataset is a comprehensive record of every failure mode your system has ever encountered.

Next up: Chapter 8 surveys the eval tools landscape — RAGAS, DeepEval, Braintrust, LangSmith, Arize Phoenix, and Langfuse — and helps you choose the right tools for your stack.

Ch 7 — Building an Eval Pipeline