The Basics
For classification tasks, the core metrics are precision (of all items predicted positive, how many are actually positive?), recall (of all actually positive items, how many did we find?), and F1 score (harmonic mean of precision and recall). Accuracy is misleading with imbalanced classes — a spam detector that never flags spam gets 95% accuracy if only 5% of emails are spam. F1 balances precision and recall: high F1 requires both to be high. For multi-class problems, there are three averaging strategies: macro F1 (average F1 across classes, treating all classes equally), micro F1 (aggregate TP/FP/FN across classes, equivalent to accuracy), and weighted F1 (weight by class frequency). Macro F1 is preferred when all classes matter equally; weighted F1 when you care more about frequent classes.
Precision, Recall, F1
Precision = TP / (TP + FP)
"Of predicted positives, how many correct?"
Recall = TP / (TP + FN)
"Of actual positives, how many found?"
F1 = 2 × (P × R) / (P + R)
Harmonic mean: requires both to be high
Example (spam detection):
100 emails: 5 spam, 95 not spam
Model predicts 10 as spam
4 correct spam, 6 false positives
Precision: 4/10 = 40%
Recall: 4/5 = 80%
F1: 53%
Accuracy: 89% (misleading!)
Multi-class averaging:
Macro: average F1 per class (equal weight)
Weighted: weight by class frequency
Key insight: Always report F1 instead of accuracy for NLP tasks. Accuracy hides poor performance on minority classes, which are often the classes you care about most (spam, toxic content, rare entities).