Ch 9 — NLP Evaluation

Accuracy, F1, BLEU, ROUGE, BERTScore, human evaluation — and why measuring NLP is harder than it looks
High Level
target
Classify
arrow_forward
translate
Generate
arrow_forward
neurology
Learned
arrow_forward
person
Human
arrow_forward
leaderboard
Benchmark
arrow_forward
check_circle
Best Practice
-
Click play or press Space to begin...
Step- / 8
analytics
Why NLP Evaluation Is Hard
Language has no single correct answer — and that breaks simple metrics
The Challenge
Evaluating NLP systems is fundamentally harder than evaluating other ML tasks. In image classification, a cat is a cat. In NLP, there are many correct ways to express the same meaning. "The cat sat on the mat" and "A feline rested upon the rug" mean the same thing but share almost no words. This means simple word-overlap metrics miss correct answers that use different vocabulary. Fluency, coherence, factuality, and relevance are separate dimensions of quality that no single metric captures. A summary can be fluent but factually wrong. A translation can be accurate but awkward. Human evaluation remains the gold standard, but it's expensive, slow, and subjective — annotators disagree 10–30% of the time on quality judgments. The field has developed dozens of metrics, each capturing a different aspect of quality, and the art of NLP evaluation is knowing which metrics matter for your task.
The Evaluation Problem
Reference: "The cat sat on the mat" Prediction A: "A feline rested on the rug" Word overlap: 2/6 = 33% (low) Meaning: correct Prediction B: "The cat the on mat sat" Word overlap: 6/6 = 100% (high) Meaning: nonsense Quality dimensions: Fluency: grammatically correct? Coherence: logically consistent? Factuality: factually accurate? Relevance: answers the question? Adequacy: captures the meaning? No single metric captures all dimensions
Key insight: The fundamental challenge of NLP evaluation is that language is many-to-many: many different texts can express the same meaning, and the same text can have many valid interpretations. Any metric that relies on exact matching will underestimate quality.
target
Classification Metrics
Precision, recall, F1 — the foundation for all NLP evaluation
The Basics
For classification tasks, the core metrics are precision (of all items predicted positive, how many are actually positive?), recall (of all actually positive items, how many did we find?), and F1 score (harmonic mean of precision and recall). Accuracy is misleading with imbalanced classes — a spam detector that never flags spam gets 95% accuracy if only 5% of emails are spam. F1 balances precision and recall: high F1 requires both to be high. For multi-class problems, there are three averaging strategies: macro F1 (average F1 across classes, treating all classes equally), micro F1 (aggregate TP/FP/FN across classes, equivalent to accuracy), and weighted F1 (weight by class frequency). Macro F1 is preferred when all classes matter equally; weighted F1 when you care more about frequent classes.
Precision, Recall, F1
Precision = TP / (TP + FP) "Of predicted positives, how many correct?" Recall = TP / (TP + FN) "Of actual positives, how many found?" F1 = 2 × (P × R) / (P + R) Harmonic mean: requires both to be high Example (spam detection): 100 emails: 5 spam, 95 not spam Model predicts 10 as spam 4 correct spam, 6 false positives Precision: 4/10 = 40% Recall: 4/5 = 80% F1: 53% Accuracy: 89% (misleading!) Multi-class averaging: Macro: average F1 per class (equal weight) Weighted: weight by class frequency
Key insight: Always report F1 instead of accuracy for NLP tasks. Accuracy hides poor performance on minority classes, which are often the classes you care about most (spam, toxic content, rare entities).
translate
BLEU Score
The workhorse metric for machine translation — and its many limitations
How BLEU Works
BLEU (Bilingual Evaluation Understudy, Papineni et al., 2002) measures n-gram overlap between a generated translation and one or more reference translations. It computes modified precision for 1-grams through 4-grams, clips counts to avoid gaming by repetition, and applies a brevity penalty for translations that are too short. BLEU scores range from 0 to 1 (often reported as 0–100). A BLEU score of 30–40 is considered good for machine translation. BLEU became the standard MT metric because it correlates reasonably with human judgment at the corpus level. But it has significant limitations: it penalizes valid synonyms ("big" vs "large"), ignores word order beyond n-grams, and correlates poorly with human judgment at the sentence level. Despite these flaws, BLEU remains widely used because it's fast, reproducible, and well-understood.
BLEU Calculation
BLEU = BP × exp(∑ w_n × log p_n) p_n = modified n-gram precision Clip counts to reference maximum BP = brevity penalty (penalize short output) w_n = uniform weights (typically 1/4 each) Example: Ref: "The cat is on the mat" Hyp: "The the the the" Unclipped 1-gram precision: 4/4 = 100% Clipped: "the" appears 2x in ref Clipped precision: 2/4 = 50% Strengths: fast, reproducible, standard Weaknesses: Penalizes valid synonyms Ignores meaning (word overlap only) Poor sentence-level correlation
Key insight: BLEU measures surface similarity, not meaning. Two translations can have the same meaning but very different BLEU scores. Use BLEU for comparing systems on the same test set, not for judging absolute translation quality.
summarize
ROUGE and METEOR
Recall-oriented metrics for summarization and improved translation evaluation
Beyond BLEU
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard metric for summarization. While BLEU focuses on precision (how much of the output matches the reference), ROUGE focuses on recall (how much of the reference is captured in the output). ROUGE-1 measures unigram recall, ROUGE-2 measures bigram recall, and ROUGE-L measures the longest common subsequence. For summarization, recall matters more than precision: you want to capture all the important information, even if you include some extra words. METEOR improves on both BLEU and ROUGE by incorporating stemming (matching "running" with "ran"), synonym matching (matching "big" with "large"), and word order penalties. METEOR correlates better with human judgment than BLEU, especially at the sentence level.
ROUGE and METEOR
ROUGE-1 (unigram recall): Ref: "The cat sat on the mat" Hyp: "The cat sat" Recall: 3/6 = 50% (3 of 6 ref words found) ROUGE-2 (bigram recall): Ref bigrams: "the cat", "cat sat", "sat on"... Measures how many ref bigrams appear ROUGE-L (longest common subsequence): Captures word order without n-gram gaps METEOR improvements over BLEU: Stemming: "running" matches "ran" Synonyms: "big" matches "large" Word order: penalizes scrambled output Better human correlation than BLEU Typical use: Translation: BLEU + METEOR Summarization: ROUGE-1, ROUGE-2, ROUGE-L
Key insight: The choice between precision-oriented (BLEU) and recall-oriented (ROUGE) metrics depends on your task. For translation, you want precision (don't add wrong words). For summarization, you want recall (don't miss important information).
neurology
BERTScore and Learned Metrics
Using neural models to evaluate neural models
Semantic Evaluation
BERTScore (Zhang et al., 2020) addresses the fundamental limitation of n-gram metrics: they can't recognize paraphrases. Instead of matching exact words, BERTScore computes cosine similarity between contextual embeddings of tokens in the prediction and reference. "Big" and "large" get high similarity because BERT knows they're synonymous. BERTScore computes precision (each predicted token's best match in the reference), recall (each reference token's best match in the prediction), and F1. It correlates significantly better with human judgment than BLEU or ROUGE, especially for creative and paraphrastic text. Other learned metrics include COMET (trained on human translation quality judgments), BLEURT (fine-tuned BERT for quality estimation), and UniEval (multi-dimensional evaluation). The trend is clear: learned metrics are replacing hand-crafted ones.
BERTScore
BERTScore: 1. Encode ref and hyp with BERT 2. Compute pairwise cosine similarity 3. Greedy match: best alignment 4. Compute precision, recall, F1 Example: Ref: "The cat sat on the mat" Hyp: "A feline rested on the rug" BLEU: low (few exact matches) BERTScore: high (semantic match) "cat" ~ "feline": cosine = 0.85 "sat" ~ "rested": cosine = 0.78 "mat" ~ "rug": cosine = 0.82 Other learned metrics: COMET: trained on human MT judgments BLEURT: fine-tuned BERT for quality UniEval: multi-dimensional evaluation
Key insight: BERTScore represents a paradigm shift in NLP evaluation: from counting word overlaps to measuring semantic similarity. This aligns evaluation with what we actually care about — meaning, not surface form.
person
Human Evaluation
The gold standard — expensive, subjective, and irreplaceable
Why Humans Are Still Needed
Human evaluation remains the gold standard for NLP because automated metrics can't fully capture fluency, coherence, factuality, and usefulness. Common human evaluation protocols include: Likert scale rating (rate quality 1–5), pairwise comparison (which output is better, A or B?), error annotation (mark specific errors in the output), and adequacy/fluency (separate scores for meaning preservation and grammatical quality). The challenges are significant: human evaluation is expensive ($0.10–$1.00 per judgment), slow (days to weeks for large evaluations), and subjective (inter-annotator agreement is typically 60–80%). Pairwise comparison is the most reliable protocol because it's easier for humans to compare two outputs than to assign absolute scores. This is why LLM leaderboards like Chatbot Arena use pairwise human preferences.
Human Evaluation Protocols
Likert scale (1-5): Rate fluency, coherence, factuality Problem: calibration varies by annotator Pairwise comparison: "Which is better, A or B?" Most reliable protocol Used by Chatbot Arena, LMSYS Error annotation: Mark specific errors (factual, grammar) Most informative but most expensive Challenges: Cost: $0.10-$1.00 per judgment Speed: days to weeks Agreement: 60-80% inter-annotator Bias: position bias, length bias Best practice: 3+ annotators per item Clear guidelines with examples Measure inter-annotator agreement
Key insight: Human evaluation is not optional for production NLP systems. Automated metrics are useful for development iteration, but only human evaluation tells you whether the system actually works for users. Budget for it from the start.
leaderboard
Benchmarks and Leaderboards
GLUE, SuperGLUE, and the standardized evaluation landscape
Standard Benchmarks
NLP benchmarks provide standardized test sets for comparing models. GLUE (General Language Understanding Evaluation) and its harder successor SuperGLUE test models on a suite of understanding tasks: sentiment, textual entailment, paraphrase detection, and question answering. Models are scored on each task and given an aggregate score. BERT achieved 80.5 on GLUE; human performance is 87.1. Modern models have saturated both benchmarks, scoring above human performance. This led to harder benchmarks: MMLU (massive multitask language understanding, 57 subjects), BIG-Bench (204 diverse tasks), and Chatbot Arena (live human pairwise comparisons). The benchmark treadmill — models saturate benchmarks faster than new ones are created — is a persistent challenge. Benchmarks also suffer from data contamination: test data leaking into training corpora.
Key Benchmarks
GLUE / SuperGLUE: Language understanding suite Sentiment, entailment, QA, similarity Saturated: models exceed human scores MMLU: 57 subjects (math, history, law, medicine) Tests broad knowledge + reasoning GPT-4: ~87%, Human expert: ~90% Chatbot Arena (LMSYS): Live human pairwise comparisons Elo rating system for LLMs Most trusted LLM ranking Benchmark problems: Saturation: models exceed human scores Contamination: test data in training Narrow: benchmarks test specific skills Gaming: optimizing for benchmark, not quality
Key insight: Benchmarks are useful for relative comparison but dangerous as absolute measures. A model that scores 90% on MMLU may still fail on your specific use case. Always evaluate on data that represents your actual deployment scenario.
check_circle
Evaluation Best Practices
Building a robust evaluation strategy for your NLP system
A Practical Framework
A robust NLP evaluation strategy combines multiple approaches. During development: use automated metrics (F1, BLEU, ROUGE, BERTScore) for fast iteration. Run evaluations after every experiment. Track metrics over time to catch regressions. Before deployment: conduct human evaluation on a representative sample. Use pairwise comparison against the current system. Test on edge cases, adversarial inputs, and out-of-distribution data. After deployment: monitor real-world performance through user feedback, implicit signals (click-through, task completion), and periodic human evaluation. The most important practice is error analysis: manually examine failures to understand systematic patterns. A confusion matrix shows what the model gets wrong; error analysis shows why. The second most important practice: evaluate on data that looks like production, not on clean academic benchmarks.
Evaluation Checklist
During development: Automated metrics per experiment Track metrics over time Error analysis on failures Test on edge cases Before deployment: Human evaluation (50-200 samples) Pairwise comparison vs current system Adversarial testing Out-of-distribution testing After deployment: User feedback collection Implicit signals (clicks, completion) Periodic human evaluation Drift monitoring Golden rules: 1. No single metric is sufficient 2. Evaluate on production-like data 3. Error analysis > hyperparameter tuning 4. Human eval is not optional
Key insight: The best evaluation strategy is layered: automated metrics for speed, human evaluation for quality, and production monitoring for real-world performance. Each layer catches failures the others miss.