Ch 9 — Evaluation & Benchmarks — Under the Hood

lm-evaluation-harness, MMLU internals, HumanEval pass@k, MT-Bench judge, custom evaluators, and evaluation pipelines
Under the Hood
-
Click play or press Space to begin...
Step- / 10
Alm-evaluation-harnessThe standard framework for running benchmarks
1
terminal
Install & Run
CLI + YAML config
tasks
2
quiz
MMLU Internals
5-shot, log-probs
3
codeHumanEval: generate code, execute tests, compute pass@k
BExecution-Based EvaluationHumanEval pass@k and functional correctness
3
code
HumanEval
pass@k pipeline
judge
4
gavel
MT-Bench
GPT-4 judge
CLLM-as-Judge ImplementationBuilding your own judge evaluator
5
description
AlpacaEval 2
LC win rate
custom
6
build
Custom Judge
Your own evaluator
7
calculateStatistical significance: bootstrap CI, paired tests, effect sizes
DStatistics & Training-Time EvaluationConfidence intervals, TRL callbacks, and monitoring
7
calculate
Statistics
CI & significance
train
8
model_training
Eval in Training
TRL callbacks
EEnd-to-End Evaluation PipelinePutting it all together
9
compare
Compare Models
Base vs fine-tuned
full
10
checklist
Full Pipeline
Complete script
1
Title