Ch 9: Evaluation & Benchmarks

Ch 9 — Evaluation & Benchmarks — Under the Hood

lm-evaluation-harness, MMLU internals, HumanEval pass@k, MT-Bench judge, custom evaluators, and evaluation pipelines

Under the Hood

Click play or press Space to begin...

Step- / 10

Alm-evaluation-harnessThe standard framework for running benchmarks

terminal

Install & Run

CLI + YAML config

tasks

quiz

MMLU Internals

5-shot, log-probs

codeHumanEval: generate code, execute tests, compute pass@k

BExecution-Based EvaluationHumanEval pass@k and functional correctness

code

HumanEval

pass@k pipeline

judge

gavel

MT-Bench

GPT-4 judge

CLLM-as-Judge ImplementationBuilding your own judge evaluator

description

AlpacaEval 2

LC win rate

custom

build

Custom Judge

Your own evaluator

calculateStatistical significance: bootstrap CI, paired tests, effect sizes

DStatistics & Training-Time EvaluationConfidence intervals, TRL callbacks, and monitoring

calculate

Statistics

CI & significance

train

model_training

Eval in Training

TRL callbacks

EEnd-to-End Evaluation PipelinePutting it all together

compare

Compare Models

Base vs fine-tuned

full

checklist

Full Pipeline

Complete script