Ch 10: Evaluation & Metrics That Matter

Ch 10 — Evaluation & Metrics That Matter

Accuracy is not enough. Precision, recall, F1 — what they mean in product terms.

Index

High Level

dangerous

Accuracy Trap

arrow_forward

grid_on

Confusion

arrow_forward

target

P / R / F1

arrow_forward

chat

LLM Eval

arrow_forward

person

Human

arrow_forward

trending_up

Business

Click play or press Space to begin...

Step- / 8

dangerous

The Accuracy Trap

Why the most intuitive metric is often the most misleading

The Problem with Accuracy

Accuracy = (correct predictions) / (total predictions). It’s the first metric everyone asks about, and it’s often the worst one to rely on.

The classic example: Fraud occurs in 0.1% of transactions. A model that predicts “not fraud” for every single transaction achieves 99.9% accuracy. It catches zero fraud. The metric looks perfect; the model is useless.

This happens whenever the classes are imbalanced — when one outcome is far more common than the other. Spam detection (95% of email is not spam), disease diagnosis (99% of patients don’t have the disease), defect detection (99.5% of products pass inspection).

Accuracy hides the failures in the minority class — which is usually the class you care about most.

What to Use Instead

Accuracy is fine as a sanity check but should never be your primary metric. Instead, use metrics that separately measure performance on the things you care about:

• Precision: When the model says “yes,” how often is it right?
• Recall: Of all actual “yes” cases, how many did the model catch?
• F1 Score: The balanced combination of precision and recall

These metrics force you to think about which types of errors matter rather than lumping all predictions together. They’re the language of AI product quality.

PM rule: When someone reports “95% accuracy,” your first question should be: “What’s the class distribution?” If one class is 95% of the data, 95% accuracy means the model may be doing nothing useful. Always ask for precision and recall alongside accuracy.

grid_on

The Confusion Matrix

The foundation of all classification metrics — four numbers that tell the whole story

Four Outcomes

Every prediction falls into one of four categories:

True Positive (TP): Model said “yes” and the answer was yes.
• Fraud detector correctly flags a fraudulent transaction.
• Spam filter correctly catches a spam email.

False Positive (FP): Model said “yes” but the answer was no. (Type I error)
• Fraud detector flags a legitimate transaction as fraud.
• Spam filter sends a real email to the spam folder.

True Negative (TN): Model said “no” and the answer was no.
• Fraud detector correctly allows a legitimate transaction.

False Negative (FN): Model said “no” but the answer was yes. (Type II error)
• Fraud detector misses an actual fraudulent transaction.
• Spam filter lets spam through to the inbox.

The Product Translation

Every AI product has its own version of these four outcomes. The PM’s job is to translate them into business impact:

Fraud detection:
• FP = Customer friction (blocked legitimate purchase) → lost revenue, angry customer
• FN = Missed fraud → financial loss, regulatory risk

Medical screening:
• FP = Healthy patient told they might be sick → unnecessary tests, anxiety
• FN = Sick patient told they’re healthy → delayed treatment, potential death

Content moderation:
• FP = Legitimate content removed → user frustration, censorship complaints
• FN = Harmful content allowed through → user safety risk, brand damage

For your product, which error is more expensive? This determines which metric you optimize.

The asymmetry decision: In almost every AI product, false positives and false negatives have different costs. The PM must quantify this asymmetry: “A false positive costs us $X. A false negative costs us $Y.” This single decision drives the model’s threshold, the evaluation criteria, and the product design.

target

Precision, Recall & F1

The three metrics every AI PM must understand cold

Precision

Precision = TP / (TP + FP)

“When the model says ‘yes,’ how often is it right?”

High precision means few false alarms. When the model flags something, you can trust it.

Optimize precision when false positives are expensive:
• Email spam filter — Sending a real email to spam is worse than letting some spam through
• Content recommendations — Bad recommendations annoy users and erode trust
• Automated actions — If the AI auto-deletes files it thinks are duplicates, false positives are catastrophic

Recall

Recall = TP / (TP + FN)

“Of all actual positive cases, how many did the model catch?”

High recall means few missed cases. The model catches most of what it should.

Optimize recall when false negatives are expensive:
• Cancer screening — Missing a tumor is far worse than a false alarm
• Fraud detection — Missing actual fraud costs more than investigating false flags
• Security threats — Missing an intrusion is catastrophic; false alerts are annoying but survivable

F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean of precision and recall. F1 is high only when both precision and recall are high. It penalizes models that sacrifice one for the other.

Use F1 when:
• False positives and false negatives are roughly equally costly
• You need a single number to compare models
• The dataset is imbalanced (accuracy is misleading)

The Precision-Recall Trade-Off

Precision and recall are in tension. Increasing one typically decreases the other:

• Make the model more cautious (higher threshold) → precision goes up, recall goes down
• Make the model more aggressive (lower threshold) → recall goes up, precision goes down

The PM sets the threshold based on error cost asymmetry. There is no universally “right” balance — it depends entirely on your product and users.

The PM cheat sheet: Precision = “trust the alarm.” Recall = “catch everything.” F1 = “balance both.” When someone says “we improved F1 from 0.82 to 0.87,” you should immediately ask: “Did precision or recall improve? Which matters more for our users?”

chat

Evaluating LLM Products

When there’s no single “correct answer” — how to measure quality for generative AI

Why LLM Evaluation Is Harder

Classification metrics work when there’s a clear right/wrong answer. LLM products often don’t have one:

• “Summarize this document” — Many valid summaries exist
• “Write a marketing email” — Quality is subjective
• “Answer this question” — The answer might be correct but poorly worded, or well-worded but incomplete

You can’t compute precision/recall on a generated paragraph. You need different evaluation approaches.

Automated LLM Metrics

Reference-based metrics:
• BLEU / ROUGE: Compare generated text against reference texts. Useful for translation and summarization. Limited because they measure word overlap, not meaning.
• Exact match: For factual Q&A, does the answer match the expected answer exactly?

LLM-as-judge:
Use a stronger LLM to evaluate a weaker one. “Rate this response on accuracy (1–5), helpfulness (1–5), and safety (1–5).” Increasingly common but introduces its own biases (LLMs tend to prefer longer, more verbose responses).

Factual grounding:
For RAG products, measure whether the response is supported by the retrieved documents. Can the AI cite its sources? Are the citations accurate?

The Evaluation Rubric

For LLM products, create a scoring rubric that evaluators (human or LLM) use consistently:

Example rubric for a customer support bot:

Accuracy (1–5):
5 = Factually correct, complete answer
3 = Mostly correct, minor omissions
1 = Incorrect or fabricated information

Helpfulness (1–5):
5 = Fully resolves the customer’s issue
3 = Partially helpful, needs follow-up
1 = Unhelpful or irrelevant

Safety (Pass/Fail):
Pass = No harmful, biased, or inappropriate content
Fail = Contains any safety violation

Tone (1–5):
5 = Professional, empathetic, on-brand
3 = Acceptable but generic
1 = Rude, confusing, or off-brand

PM ownership: The evaluation rubric is a product artifact, not an engineering artifact. The PM defines what “good” looks like for the user. The ML team builds systems to measure it. If you don’t own the rubric, you don’t own the quality bar.

person

Human Evaluation

When automated metrics aren’t enough — and they often aren’t

When You Need Human Evaluation

Automated metrics are fast and scalable but miss nuance. Human evaluation is slow and expensive but catches what metrics miss:

• Tone and style: Is the response on-brand? Empathetic? Professional?
• Factual accuracy: Is the information correct? (Automated metrics can’t reliably check facts)
• Harmful content: Is there subtle bias, toxicity, or inappropriate content?
• User experience: Would a real user find this helpful? Confusing? Frustrating?
• Edge cases: How does the model handle unusual or adversarial inputs?

For most AI products, you need both: automated metrics for continuous monitoring and human evaluation for periodic deep assessment.

Running Human Evaluation

Who evaluates:
• Domain experts for accuracy (doctors for medical AI, lawyers for legal AI)
• Target users for helpfulness and usability
• Red team for safety and adversarial robustness
• Internal team for quick iteration during development

How many evaluations:
• Minimum 200–500 examples per evaluation cycle
• Each example rated by 2–3 evaluators (to measure agreement)
• Include a mix of easy cases, hard cases, and edge cases

Measuring agreement:
If evaluators disagree frequently, either the rubric is ambiguous or the task is genuinely subjective. Inter-annotator agreement (Cohen’s kappa or percentage agreement) tells you how reliable your evaluation is. Below 70% agreement, refine the rubric before trusting the results.

The evaluation cadence: Run human evaluation at three points: before launch (go/no-go decision), monthly after launch (quality tracking), and after every major model change (regression check). Between human evaluations, automated metrics provide continuous monitoring. This combination gives you both depth and coverage.

trending_up

Business Metrics

Connecting model metrics to the numbers that matter to the business

The Metrics Gap

Model metrics (precision, recall, F1) measure how well the model performs. Business metrics measure how much value the model creates. They’re related but not the same.

A model with 92% F1 that users ignore creates zero business value. A model with 78% F1 that saves users 2 hours per day is transformative.

The PM must bridge this gap by defining how model performance translates to business outcomes.

Common Business Metrics for AI

Efficiency metrics:
• Time saved per task (AI drafts reduce writing time by 40%)
• Automation rate (% of cases handled without human intervention)
• Cost per resolution (AI support costs $0.50 vs. human at $12)

Quality metrics:
• Task completion rate (% of users who achieve their goal with AI assistance)
• Error escalation rate (% of AI outputs that require human correction)
• User satisfaction (CSAT, NPS on AI-assisted interactions)

Revenue metrics:
• Conversion lift (AI recommendations increase purchase rate by X%)
• Retention impact (users with AI features retain X% better)
• Revenue per AI interaction (direct monetization of AI features)

The Metrics Stack

Build a three-layer metrics stack that connects model performance to business value:

Layer 1: Model metrics (technical team tracks daily)
Precision, recall, F1, latency, cost per query, hallucination rate

Layer 2: Product metrics (PM tracks weekly)
Task completion rate, user acceptance rate, escalation rate, regeneration rate, user satisfaction

Layer 3: Business metrics (leadership tracks monthly/quarterly)
Revenue impact, cost savings, efficiency gains, retention impact, competitive differentiation

Each layer should have clear leading indicators that predict the layer above. If model precision drops (Layer 1), expect escalation rate to rise (Layer 2), which will increase support costs (Layer 3).

The translation exercise: For every model metric improvement, articulate the business impact. “We improved recall from 85% to 92%. This means we catch 7% more fraud, preventing an estimated $340K in annual losses.” This translation is how you justify continued investment in AI to leadership.

science

A/B Testing AI Products

The gold standard for measuring real-world impact — with AI-specific complications

Why A/B Testing Matters for AI

Offline evaluation (test set metrics) tells you how the model performs on historical data. A/B testing tells you how it performs on real users. The gap can be significant:

• A model with higher F1 might produce outputs that users find less helpful
• A faster model with lower accuracy might have higher user satisfaction (speed matters)
• A model that’s technically better might confuse users who were accustomed to the old behavior

A/B testing measures what actually matters: does the new model make the product better for users?

AI-Specific A/B Challenges

Non-determinism: The same user might get different outputs from the same model, making it harder to attribute differences to the model vs. randomness.

Learning effects: Users adapt to AI behavior over time. A model that seems worse initially might be better once users learn its patterns.

Feedback contamination: If both variants feed into the same training pipeline, they can influence each other’s future performance.

A/B Testing Best Practices for AI

1. Test on business metrics, not model metrics.
The A/B test should measure task completion rate, user satisfaction, or revenue — not F1 score. Model metrics are for offline evaluation.

2. Run longer than traditional A/B tests.
AI behavior varies more than deterministic features. Run for at least 2 weeks to capture variance and user adaptation.

3. Segment results.
The new model might be better for power users but worse for new users. Check performance across user segments, not just overall.

4. Monitor safety metrics separately.
Even if the new model wins on business metrics, check that safety metrics (hallucination rate, harmful content) haven’t degraded.

5. Have a rollback plan.
If the new model causes unexpected issues, you need to switch back instantly. This requires infrastructure support.

The A/B testing trap: Don’t A/B test too early. If the model hasn’t passed offline evaluation, it’s not ready for real users. A/B testing is for comparing models that are both “good enough” — not for discovering that a model is broken. Use offline evaluation as the gate; use A/B testing as the tiebreaker.

analytics

The Evaluation Playbook

Putting it all together — a systematic approach to AI product evaluation

The Evaluation Stack

Level 1: Automated metrics (continuous)
Run on every model change. Precision, recall, F1 for classification. ROUGE, factual accuracy for generation. Latency, cost, safety scores. This is your first gate — if automated metrics regress, don’t proceed.

Level 2: Human evaluation (periodic)
Run before launch, monthly after launch, and after major changes. Domain experts rate outputs using your rubric. 200–500 examples per cycle. This catches what automated metrics miss.

Level 3: A/B testing (for major changes)
Run when comparing meaningfully different models or approaches. Measure business metrics on real users. 2+ weeks duration. This is the final arbiter of real-world value.

Level 4: User feedback (continuous)
Thumbs up/down, corrections, satisfaction surveys. Low-cost, high-volume signal. Use to identify emerging issues and prioritize improvements.

The PM’s Evaluation Checklist

□ Primary metric defined — The one number that defines model quality for your product

□ Guardrail metrics defined — Metrics that must not degrade (safety, latency, cost)

□ Business metric linked — How model performance translates to business value

□ Evaluation dataset curated — 200–500 examples covering normal, hard, and edge cases

□ Rubric written — Clear scoring criteria for human evaluation

□ Automated pipeline built — Metrics computed on every model change

□ Human evaluation scheduled — Regular cadence with qualified evaluators

□ A/B testing infrastructure ready — Can split traffic and measure business metrics

□ Feedback mechanism live — Users can signal quality (thumbs up/down, corrections)

The bottom line: Evaluation is not a one-time activity — it’s a continuous system. The PM who builds a robust evaluation stack has a superpower: they can measure quality, detect degradation, compare alternatives, and justify investment with data. Without evaluation, you’re flying blind. With it, you’re making informed product decisions at every stage.

arrow_back Ch 9: Model Development for PMs Ch 11: Prompt Engineering as Product Design arrow_forward