Annotator Biases
• Fatigue: Quality drops measurably after 50–100 evaluations per session. Accuracy falls 10–15% in the last hour of a long shift
• Anchoring: The first few examples set the baseline for all subsequent judgments, skewing scores
• Verbosity preference: Longer responses are rated higher regardless of accuracy — the same bias LLM judges have
• Recency bias: The last-read response in a pairwise comparison is rated more favorably
• Demographic bias: Annotator background, culture, and language fluency affect judgment on subjective tasks
Mitigation Strategies
• Session limits: Cap at 2 hours or 100 evaluations per session, whichever comes first
• Randomize order: Shuffle response positions in pairwise comparisons to counter position bias
• Blind evaluation: Never reveal model names, versions, or any metadata to annotators
• Diverse annotators: Mix demographics, expertise levels, and cultural backgrounds
• Gold questions: Embed known-answer items to detect annotator drift and disengagement
• Regular audits: Check for annotator drift monthly by comparing agreement on a fixed calibration set
Reality check: Human-to-human agreement is typically 0.60–0.80 Cohen’s Kappa on subjective tasks. If your annotators agree above 0.85, they may be rubber-stamping. If below 0.50, your guidelines need work. The sweet spot is 0.65–0.80.