The Evaluation Stack
Level 1: Automated metrics (continuous)
Run on every model change. Precision, recall, F1 for classification. ROUGE, factual accuracy for generation. Latency, cost, safety scores. This is your first gate — if automated metrics regress, don’t proceed.
Level 2: Human evaluation (periodic)
Run before launch, monthly after launch, and after major changes. Domain experts rate outputs using your rubric. 200–500 examples per cycle. This catches what automated metrics miss.
Level 3: A/B testing (for major changes)
Run when comparing meaningfully different models or approaches. Measure business metrics on real users. 2+ weeks duration. This is the final arbiter of real-world value.
Level 4: User feedback (continuous)
Thumbs up/down, corrections, satisfaction surveys. Low-cost, high-volume signal. Use to identify emerging issues and prioritize improvements.
The PM’s Evaluation Checklist
□ Primary metric defined — The one number that defines model quality for your product
□ Guardrail metrics defined — Metrics that must not degrade (safety, latency, cost)
□ Business metric linked — How model performance translates to business value
□ Evaluation dataset curated — 200–500 examples covering normal, hard, and edge cases
□ Rubric written — Clear scoring criteria for human evaluation
□ Automated pipeline built — Metrics computed on every model change
□ Human evaluation scheduled — Regular cadence with qualified evaluators
□ A/B testing infrastructure ready — Can split traffic and measure business metrics
□ Feedback mechanism live — Users can signal quality (thumbs up/down, corrections)
The bottom line: Evaluation is not a one-time activity — it’s a continuous system. The PM who builds a robust evaluation stack has a superpower: they can measure quality, detect degradation, compare alternatives, and justify investment with data. Without evaluation, you’re flying blind. With it, you’re making informed product decisions at every stage.