Ch 1: Why Evaluation Matters

warning

The “Works on My Laptop” Problem

Why most LLM deployments fly blind

The Scenario

You build an LLM-powered feature. You test it with 10 examples. It looks great. You ship it. A week later, support tickets flood in: the model is hallucinating product names, giving wrong prices, and occasionally insulting customers. What happened?

Why LLMs Are Different

Traditional software is deterministic — the same input always produces the same output. LLMs are stochastic: the same prompt can yield different responses each run. A model that scores 96.9% success in clean conditions can drop to 88.1% under production stress (perturbations, edge cases, load).

The Cost of Not Evaluating

Without systematic evaluation, teams discover failures from users, not tests. By the time you notice, the damage is done — wrong medical advice, biased hiring decisions, leaked PII, or simply a product that erodes trust one bad answer at a time.

Reality check: Most teams evaluate LLMs by “trying a few prompts and seeing if the output looks right.” This is the AI equivalent of testing software by clicking around for 5 minutes before shipping.

visibility_off

Silent Failures

The failures you don’t see until it’s too late

What Makes LLM Failures Silent

LLM failures don’t throw exceptions. They return confident-sounding wrong answers. A chatbot that hallucinates a company policy doesn’t crash — it delivers the hallucination with the same tone as a correct answer. The user has no way to tell the difference.

Types of Silent Failures

• Hallucination: Fabricating facts, citations, or data
• Drift: Quality degrades over time as model providers update
• Bias amplification: Systematically favoring certain groups
• Context window overflow: Silently dropping important information
• Prompt injection: Users manipulating the model to bypass guardrails

The Reproducibility Crisis

Running the same benchmark on identical model checkpoints produces inconsistent results. This means benchmark comparisons across papers, vendors, and time periods often aren’t comparable — even when they appear identical. You can’t improve what you can’t reliably measure.

Key insight: In traditional software, a bug is a deviation from a specification. In LLM systems, there is often no specification — just a vague expectation of “good enough.” Evaluation forces you to define what “good” actually means.

thumb_up

Vibes-Based Evaluation

Why “it looks good to me” doesn’t scale

What Vibes Eval Looks Like

A developer tries 5–10 prompts, reads the outputs, and says “looks good.” Maybe they show it to a PM who agrees. This is vibes-based evaluation — subjective, non-reproducible, and biased toward the examples the developer already thought of.

Why It Fails

• Selection bias: You test the cases you expect, not the edge cases that break things
• Recency bias: The last 3 outputs color your judgment of the whole system
• Anchoring: Once you see a good output, you assume the model is “good enough”
• No regression detection: When the model provider updates, you have no baseline to compare against

The Scale Problem

Vibes eval works for a prototype. It collapses at production scale. When your system handles 10,000 queries/day across 50 different use cases, no human can manually review enough outputs to catch systematic issues. You need automated, repeatable measurement.

The vibes trap: Teams that rely on vibes eval ship faster initially, but spend 3–5x more time firefighting production issues. Systematic evaluation reduces production failures by up to 60%.

analytics

What Good Evaluation Looks Like

The shift from “how smart is this model?” to “does it work for my use case?”

The Evaluation Stack

Good evaluation operates at four levels:

• Unit evals: Test individual capabilities (accuracy, format compliance, safety)
• Integration evals: Test the full pipeline (retrieval + generation + post-processing)
• System evals: Test end-to-end user experience (latency, cost, satisfaction)
• Continuous evals: Monitor production quality over time (drift, regression)

The Three Pillars

Every evaluation system needs three things:

1. A dataset — representative inputs with expected outputs or quality criteria
2. A metric — a quantifiable measure of quality (accuracy, faithfulness, relevance)
3. A baseline — something to compare against (previous version, human performance, random chance)

Key insight: The question has shifted from “how smart is this model?” to “what specific capabilities does it have for my use case?” A model that scores 90% on MMLU might score 40% on your domain-specific task. General benchmarks don’t predict specific performance.

Ch 1 — Why Evaluation Matters