Position Bias
In pairwise comparisons, judges tend to prefer the first response (or sometimes the second, depending on the model). Mitigation: run each comparison twice with swapped positions and average the results.
Verbosity Bias
LLM judges prefer longer, more detailed responses even when shorter answers are more accurate and appropriate. A concise correct answer often scores lower than a verbose partially-correct one.
Self-Enhancement Bias
Models tend to rate their own outputs higher than outputs from other models. GPT-4 rates GPT-4 outputs more favorably than Claude outputs, and vice versa. Mitigation: use a different model family as judge than the one being evaluated.
Factual Blindness
JudgeBench research found that even advanced judges perform only slightly better than random on tasks requiring factual verification, logical reasoning, and mathematical correctness. LLM judges are better at style than substance.
Critical: LLM judges are excellent for subjective quality (helpfulness, tone, coherence) but unreliable for objective correctness (factual accuracy, math, code correctness). Use deterministic checks for objective criteria.