Ch 3: LLM-as-Judge — High Level

Ch 3 — LLM-as-Judge

Using LLMs to evaluate LLMs — 80–90% human agreement at 5000x lower cost

High Level

input

Prompt

arrow_forward

smart_toy

Model

arrow_forward

output

Response

arrow_forward

gavel

Judge

arrow_forward

score

Score

arrow_forward

tune

Improve

-

Click play or press Space to begin...

Step- / 8

lightbulb

The Core Idea

Why use an LLM to judge another LLM?

The Problem

Human evaluation is the gold standard, but it’s expensive ($5–25 per evaluation), slow (hours to days), and doesn’t scale. If you need to evaluate 10,000 responses across 20 dimensions, human review would cost $50K–$500K and take weeks.

The Solution

Use a strong LLM as an automated judge. Give it a rubric, the prompt, and the response, and ask it to score the output. Research shows LLM judges achieve 80–90% agreement with human evaluators at 500–5000x lower cost. The same 10,000 evaluations cost $10–$100 and finish in minutes.

How It Works

// Basic LLM-as-Judge pattern System: You are an expert evaluator. Rate the response on a scale of 1-5 for: accuracy, helpfulness, safety. Provide a brief justification. User: [Original prompt] [Model response] Judge output: accuracy: 4/5 - Mostly correct but... helpfulness: 5/5 - Directly addresses... safety: 5/5 - No harmful content

category

Judge Patterns

Four ways to structure LLM evaluation

Pointwise Scoring

Rate a single response on a rubric (1–5 scale). Simplest pattern. Works well for absolute quality assessment. Risk: scores can drift without a reference point.

Pairwise Comparison

Show the judge two responses and ask which is better. More reliable than pointwise because relative judgments are easier than absolute ones. This is how Chatbot Arena works.

Reference-Based

Provide a gold-standard answer and ask the judge how close the response is. Best for factual tasks where a correct answer exists. Requires maintaining a reference dataset.

Multi-Aspect

Score on multiple dimensions separately (accuracy, tone, completeness, safety). More expensive per evaluation but gives granular insight into where a model fails, not just that it fails.

Best practice: Start with pairwise comparison for model selection, then use multi-aspect pointwise scoring for ongoing monitoring. Pairwise is more reliable; pointwise scales better.

warning

Known Biases

Where LLM judges systematically get it wrong

Position Bias

In pairwise comparisons, judges tend to prefer the first response (or sometimes the second, depending on the model). Mitigation: run each comparison twice with swapped positions and average the results.

Verbosity Bias

LLM judges prefer longer, more detailed responses even when shorter answers are more accurate and appropriate. A concise correct answer often scores lower than a verbose partially-correct one.

Self-Enhancement Bias

Models tend to rate their own outputs higher than outputs from other models. GPT-4 rates GPT-4 outputs more favorably than Claude outputs, and vice versa. Mitigation: use a different model family as judge than the one being evaluated.

Factual Blindness

JudgeBench research found that even advanced judges perform only slightly better than random on tasks requiring factual verification, logical reasoning, and mathematical correctness. LLM judges are better at style than substance.

Critical: LLM judges are excellent for subjective quality (helpfulness, tone, coherence) but unreliable for objective correctness (factual accuracy, math, code correctness). Use deterministic checks for objective criteria.

tune

Building Effective Judge Prompts

The rubric is everything

Anatomy of a Good Rubric

A judge prompt needs four components:

1. Role: “You are an expert evaluator for customer support responses”
2. Criteria: Specific, measurable dimensions to score
3. Scale: Clear definitions for each score level (what does a 3 vs 4 look like?)
4. Output format: Structured JSON for automated parsing

Example Rubric

Criteria: Factual Accuracy 5 = All claims verifiable and correct 4 = Minor inaccuracy, doesn't affect answer 3 = One significant error, core is correct 2 = Multiple errors, misleading 1 = Fundamentally wrong or fabricated Output: {"score": N, "reason": "..."}

Pro tip: Ask the judge to provide reasoning before the score (chain-of-thought). This improves accuracy by 10–15% because the model commits to an analysis before assigning a number.

science

Calibrating Your Judge

How to know if your judge is trustworthy

The Calibration Process

1. Create a calibration set of 50–100 examples with human labels
2. Run the LLM judge on the same examples
3. Measure Cohen’s Kappa (agreement beyond chance) — aim for >0.6
4. Analyze disagreements to refine the rubric
5. Repeat until judge-human agreement stabilizes

Measuring Agreement

Don’t use simple correlation (Pearson’s r) — a judge can show perfect correlation while exhibiting systematic bias. Cohen’s Kappa measures true agreement, accounting for chance. Research on 54 LLMs found 23 models achieved “human-like” judgment patterns.

Agreement Benchmarks

// Cohen's Kappa interpretation < 0.20 Poor agreement 0.21-0.40 Fair 0.41-0.60 Moderate 0.61-0.80 Substantial ← target 0.81-1.00 Almost perfect // Human-to-human agreement is // typically 0.60-0.80 on subjective tasks

Key insight: Your LLM judge doesn’t need to be perfect — it needs to be as reliable as a human evaluator. Human-to-human agreement on subjective tasks is typically 0.60–0.80. Match that and you have a useful judge.

savings

Cost Optimization

Getting more signal per dollar

The Cost Landscape

Using GPT-4o as a judge costs roughly $0.005–$0.02 per evaluation (depending on prompt length). Claude Sonnet is similar. Smaller models (GPT-4o-mini, Claude Haiku) cost 10–20x less but with lower reliability on complex judgments.

Tiered Judging

Use a cheap model for easy cases and an expensive model for hard ones:

1. Run a fast/cheap judge on all responses
2. Flag low-confidence scores (near decision boundaries)
3. Re-evaluate flagged items with a stronger judge
4. This cuts costs by 60–80% with minimal accuracy loss

Variance-Adaptive Allocation

Recent research (2026) proposes dynamically allocating judge queries based on estimated score variance. Items with high variance (ambiguous quality) get more judge evaluations; clear-cut items get fewer. This achieves significantly better accuracy under fixed budgets.

Practical tip: For most teams, GPT-4o-mini or Claude Haiku as a first-pass judge, with GPT-4o or Claude Sonnet for flagged items, provides the best cost/accuracy tradeoff. Budget ~$50–$200/month for continuous evaluation of a production system.

checklist

When to Use (and Not Use) LLM Judges

Matching the technique to the task

Use LLM-as-Judge For

• Helpfulness & relevance: Is the response useful?
• Tone & style: Does it match brand guidelines?
• Completeness: Did it address all parts of the question?
• Safety screening: Is the content appropriate?
• Coherence: Does the response make logical sense?
• Comparative ranking: Which of two responses is better?

Don’t Use LLM-as-Judge For

• Factual accuracy: Use retrieval + verification instead
• Mathematical correctness: Use code execution
• Code correctness: Run the code against tests
• Format compliance: Use regex/schema validation
• Latency measurement: Use instrumentation
• High-stakes decisions: Use human review

Rule of thumb: If a deterministic check can answer the question, use it. LLM judges are for the subjective, nuanced dimensions that can’t be captured by a regex or a unit test.

architecture

Putting It All Together

A practical LLM-as-Judge architecture

The Hybrid Evaluation Stack

// Layer 1: Deterministic checks Format → JSON schema validation Length → Token count check Safety → Keyword + regex filters // Layer 2: LLM-as-Judge (cheap) Relevance → Haiku/Mini judge Tone → Haiku/Mini judge // Layer 3: LLM-as-Judge (strong) Flagged → GPT-4o/Sonnet re-judge // Layer 4: Human review (sample) Random 2% → Human calibration check

Implementation Checklist

1. Define your evaluation criteria (3–5 dimensions)
2. Write detailed rubrics with score-level definitions
3. Create a calibration set with human labels (50–100 examples)
4. Test judge agreement (target Cohen’s Kappa > 0.6)
5. Implement tiered judging for cost efficiency
6. Run calibration checks weekly to detect judge drift

Next up: In Chapter 4, we’ll apply these evaluation techniques specifically to RAG systems — measuring faithfulness, context precision, and answer relevancy with the RAGAS framework.

arrow_back Ch 2: Benchmarks Ch 4: Evaluating RAG Systems arrow_forward