Ch 3 — LLM-as-Judge

Using LLMs to evaluate LLMs — 80–90% human agreement at 5000x lower cost
High Level
input
Prompt
arrow_forward
smart_toy
Model
arrow_forward
output
Response
arrow_forward
gavel
Judge
arrow_forward
score
Score
arrow_forward
tune
Improve
-
Click play or press Space to begin...
Step- / 8
lightbulb
The Core Idea
Why use an LLM to judge another LLM?
The Problem
Human evaluation is the gold standard, but it’s expensive ($5–25 per evaluation), slow (hours to days), and doesn’t scale. If you need to evaluate 10,000 responses across 20 dimensions, human review would cost $50K–$500K and take weeks.
The Solution
Use a strong LLM as an automated judge. Give it a rubric, the prompt, and the response, and ask it to score the output. Research shows LLM judges achieve 80–90% agreement with human evaluators at 500–5000x lower cost. The same 10,000 evaluations cost $10–$100 and finish in minutes.
How It Works
// Basic LLM-as-Judge pattern System: You are an expert evaluator. Rate the response on a scale of 1-5 for: accuracy, helpfulness, safety. Provide a brief justification. User: [Original prompt] [Model response] Judge output: accuracy: 4/5 - Mostly correct but... helpfulness: 5/5 - Directly addresses... safety: 5/5 - No harmful content
category
Judge Patterns
Four ways to structure LLM evaluation
Pointwise Scoring
Rate a single response on a rubric (1–5 scale). Simplest pattern. Works well for absolute quality assessment. Risk: scores can drift without a reference point.
Pairwise Comparison
Show the judge two responses and ask which is better. More reliable than pointwise because relative judgments are easier than absolute ones. This is how Chatbot Arena works.
Reference-Based
Provide a gold-standard answer and ask the judge how close the response is. Best for factual tasks where a correct answer exists. Requires maintaining a reference dataset.
Multi-Aspect
Score on multiple dimensions separately (accuracy, tone, completeness, safety). More expensive per evaluation but gives granular insight into where a model fails, not just that it fails.
Best practice: Start with pairwise comparison for model selection, then use multi-aspect pointwise scoring for ongoing monitoring. Pairwise is more reliable; pointwise scales better.
warning
Known Biases
Where LLM judges systematically get it wrong
Position Bias
In pairwise comparisons, judges tend to prefer the first response (or sometimes the second, depending on the model). Mitigation: run each comparison twice with swapped positions and average the results.
Verbosity Bias
LLM judges prefer longer, more detailed responses even when shorter answers are more accurate and appropriate. A concise correct answer often scores lower than a verbose partially-correct one.
Self-Enhancement Bias
Models tend to rate their own outputs higher than outputs from other models. GPT-4 rates GPT-4 outputs more favorably than Claude outputs, and vice versa. Mitigation: use a different model family as judge than the one being evaluated.
Factual Blindness
JudgeBench research found that even advanced judges perform only slightly better than random on tasks requiring factual verification, logical reasoning, and mathematical correctness. LLM judges are better at style than substance.
Critical: LLM judges are excellent for subjective quality (helpfulness, tone, coherence) but unreliable for objective correctness (factual accuracy, math, code correctness). Use deterministic checks for objective criteria.
tune
Building Effective Judge Prompts
The rubric is everything
Anatomy of a Good Rubric
A judge prompt needs four components:

1. Role: “You are an expert evaluator for customer support responses”
2. Criteria: Specific, measurable dimensions to score
3. Scale: Clear definitions for each score level (what does a 3 vs 4 look like?)
4. Output format: Structured JSON for automated parsing
Example Rubric
Criteria: Factual Accuracy 5 = All claims verifiable and correct 4 = Minor inaccuracy, doesn't affect answer 3 = One significant error, core is correct 2 = Multiple errors, misleading 1 = Fundamentally wrong or fabricated Output: {"score": N, "reason": "..."}
Pro tip: Ask the judge to provide reasoning before the score (chain-of-thought). This improves accuracy by 10–15% because the model commits to an analysis before assigning a number.
science
Calibrating Your Judge
How to know if your judge is trustworthy
The Calibration Process
1. Create a calibration set of 50–100 examples with human labels
2. Run the LLM judge on the same examples
3. Measure Cohen’s Kappa (agreement beyond chance) — aim for >0.6
4. Analyze disagreements to refine the rubric
5. Repeat until judge-human agreement stabilizes
Measuring Agreement
Don’t use simple correlation (Pearson’s r) — a judge can show perfect correlation while exhibiting systematic bias. Cohen’s Kappa measures true agreement, accounting for chance. Research on 54 LLMs found 23 models achieved “human-like” judgment patterns.
Agreement Benchmarks
// Cohen's Kappa interpretation < 0.20 Poor agreement 0.21-0.40 Fair 0.41-0.60 Moderate 0.61-0.80 Substantial ← target 0.81-1.00 Almost perfect // Human-to-human agreement is // typically 0.60-0.80 on subjective tasks
Key insight: Your LLM judge doesn’t need to be perfect — it needs to be as reliable as a human evaluator. Human-to-human agreement on subjective tasks is typically 0.60–0.80. Match that and you have a useful judge.
savings
Cost Optimization
Getting more signal per dollar
The Cost Landscape
Using GPT-4o as a judge costs roughly $0.005–$0.02 per evaluation (depending on prompt length). Claude Sonnet is similar. Smaller models (GPT-4o-mini, Claude Haiku) cost 10–20x less but with lower reliability on complex judgments.
Tiered Judging
Use a cheap model for easy cases and an expensive model for hard ones:

1. Run a fast/cheap judge on all responses
2. Flag low-confidence scores (near decision boundaries)
3. Re-evaluate flagged items with a stronger judge
4. This cuts costs by 60–80% with minimal accuracy loss
Variance-Adaptive Allocation
Recent research (2026) proposes dynamically allocating judge queries based on estimated score variance. Items with high variance (ambiguous quality) get more judge evaluations; clear-cut items get fewer. This achieves significantly better accuracy under fixed budgets.
Practical tip: For most teams, GPT-4o-mini or Claude Haiku as a first-pass judge, with GPT-4o or Claude Sonnet for flagged items, provides the best cost/accuracy tradeoff. Budget ~$50–$200/month for continuous evaluation of a production system.
checklist
When to Use (and Not Use) LLM Judges
Matching the technique to the task
Use LLM-as-Judge For
Helpfulness & relevance: Is the response useful?
Tone & style: Does it match brand guidelines?
Completeness: Did it address all parts of the question?
Safety screening: Is the content appropriate?
Coherence: Does the response make logical sense?
Comparative ranking: Which of two responses is better?
Don’t Use LLM-as-Judge For
Factual accuracy: Use retrieval + verification instead
Mathematical correctness: Use code execution
Code correctness: Run the code against tests
Format compliance: Use regex/schema validation
Latency measurement: Use instrumentation
High-stakes decisions: Use human review
Rule of thumb: If a deterministic check can answer the question, use it. LLM judges are for the subjective, nuanced dimensions that can’t be captured by a regex or a unit test.
architecture
Putting It All Together
A practical LLM-as-Judge architecture
The Hybrid Evaluation Stack
// Layer 1: Deterministic checks Format → JSON schema validation Length → Token count check Safety → Keyword + regex filters // Layer 2: LLM-as-Judge (cheap) Relevance → Haiku/Mini judge Tone → Haiku/Mini judge // Layer 3: LLM-as-Judge (strong) Flagged → GPT-4o/Sonnet re-judge // Layer 4: Human review (sample) Random 2% → Human calibration check
Implementation Checklist
1. Define your evaluation criteria (3–5 dimensions)
2. Write detailed rubrics with score-level definitions
3. Create a calibration set with human labels (50–100 examples)
4. Test judge agreement (target Cohen’s Kappa > 0.6)
5. Implement tiered judging for cost efficiency
6. Run calibration checks weekly to detect judge drift
Next up: In Chapter 4, we’ll apply these evaluation techniques specifically to RAG systems — measuring faithfulness, context precision, and answer relevancy with the RAGAS framework.