Eval Anti-Patterns
• Eval theater: Having an eval suite that nobody looks at. Running evals but never acting on the results. This is worse than no eval because it creates false confidence
• Overfitting to eval: Optimizing prompts specifically for your eval examples until they pass, without ensuring generalization. Your eval set should be representative, not a target to game
• Stale eval data: An eval dataset that hasn’t been updated in 6 months. Production traffic evolves; your eval must evolve with it
• Metric worship: Optimizing a single metric (e.g., accuracy) while ignoring safety, latency, and cost. Multi-dimensional evaluation is essential
More Anti-Patterns
• Tool-first thinking: Spending weeks evaluating eval tools instead of writing eval examples. The tool doesn’t matter if you don’t have data
• Perfectionism: Waiting for the perfect eval dataset before starting. 50 imperfect examples today beat 500 perfect examples next quarter
• Siloed evaluation: Only the ML team runs evals. Product, design, and QA should contribute examples from their unique perspectives
• Ignoring human eval: Relying entirely on automated metrics without periodic human review. Automated metrics drift; human judgment calibrates them
Warning: The most dangerous anti-pattern is eval theater — having a green CI/CD badge that nobody trusts. If your team routinely overrides eval failures to ship, your eval system has lost credibility. Fix the eval or fix the process, but never normalize ignoring eval results.