Ch 16 — Evaluation for Multimodal

Benchmarks, metrics, human evaluation, and building eval pipelines for multimodal systems
High Level
quiz
What
arrow_forward
analytics
Metrics
arrow_forward
leaderboard
Bench
arrow_forward
person
Human
arrow_forward
settings
Pipeline
arrow_forward
monitoring
Monitor
-
Click play or press Space to begin...
Step- / 8
quiz
Why Multimodal Eval Is Hard
Unique challenges in evaluating systems that see, hear, and generate
The Challenge
Multimodal evaluation is harder than text evaluation because:

Subjectivity: “Is this generated image good?” has no single right answer
Multiple dimensions: Accuracy, relevance, visual quality, safety, bias — all matter simultaneously
Cross-modal alignment: Does the text description match the image? How do you measure “match”?
Hallucination detection: Verifying visual claims requires looking at the image, not just reading text
Scale: Human evaluation of images/video is 10x more expensive than text
What to Evaluate
// Evaluation dimensions for multimodal AI Understanding Can the model correctly describe images? Can it answer questions about visual content? Does it understand spatial relationships? Generation Is the generated image/video high quality? Does it match the text prompt? Is it free of artifacts and distortions? Safety Does it refuse harmful requests? Does it hallucinate visual content? Is it biased across demographics? Practical Latency, cost, throughput, reliability
Key insight: The biggest mistake in multimodal eval is only measuring accuracy. A model that’s 95% accurate but hallucinates confidently on the other 5% is more dangerous than one that’s 90% accurate but says “I’m not sure” when uncertain.
analytics
Metrics for Understanding
Measuring how well models understand visual content
VLM Understanding Metrics
Accuracy: Correct answers on visual Q&A (multiple choice or exact match)
BLEU/ROUGE/CIDEr: Text similarity between generated captions and reference captions. Useful but limited.
CLIPScore: CLIP similarity between generated text and image. Measures text-image alignment without reference.
Hallucination rate: % of responses containing objects/facts not present in the image
Spatial accuracy: Correct identification of positions, sizes, and relationships
Metrics for Generation
FID (Fréchet Inception Distance): Measures quality and diversity of generated images vs. real images. Lower is better.
CLIPScore: How well does the generated image match the text prompt?
Aesthetic score: Predicted human preference for visual quality
IS (Inception Score): Quality and diversity of generated images
Human preference: Side-by-side comparison rated by humans (gold standard)
Key insight: Automated metrics (FID, CLIPScore) are useful for rapid iteration but don’t fully capture human preferences. Always validate with human evaluation for high-stakes decisions. Use automated metrics for daily monitoring, human eval for milestone decisions.
leaderboard
Key Benchmarks
Standard benchmarks for multimodal model comparison
VLM Benchmarks
// Major multimodal benchmarks MMMU College-level visual reasoning 30 subjects, 11.5K questions MMBench Comprehensive VLM evaluation 3,000 questions, 20 abilities DocVQA Document understanding 50K questions on real documents ChartQA Chart and graph understanding MathVista Math reasoning with visual inputs RealWorldQA Real-world spatial understanding POPE Object hallucination detection MM-Vet Integrated visual understanding
Generation Benchmarks
GenEval: Compositional text-to-image generation (objects, attributes, relations)
T2I-CompBench: Compositional generation with attribute binding, spatial relations
DrawBench: Google’s text-to-image benchmark with complex prompts
PartiPrompts: 1,600 prompts testing various generation capabilities
Chatbot Arena (Vision): Human preference ranking via blind comparisons
Key insight: Chatbot Arena (Vision) is the most reliable benchmark because it uses real users making blind comparisons. Public benchmarks are useful for initial screening but can be gamed. Always build a custom eval set for your specific use case.
person
Human Evaluation
The gold standard for multimodal quality
Human Eval Approaches
Side-by-side comparison: Show outputs from two models, ask which is better. Most reliable for ranking models.
Likert scale rating: Rate quality on 1–5 scale across dimensions (accuracy, relevance, completeness). Good for tracking improvement.
Error annotation: Mark specific errors (hallucination, wrong object, spatial error). Best for understanding failure modes.
Task completion: Can a human complete a downstream task using the model’s output? Most realistic.
Practical Setup
// Human eval for multimodal systems Sample size: 100-500 examples Annotators: 3 per example (majority vote) Cost: $0.50-2.00 per judgment Total: $150-3,000 per eval round // Dimensions to rate Accuracy: Is the visual description correct? Completeness: Are all important details covered? Hallucination: Any fabricated visual claims? Relevance: Does the response address the query? Safety: Any harmful or biased content?
Key insight: Human evaluation for multimodal is 3–5x more expensive than for text because annotators must carefully examine images/videos. Invest in clear annotation guidelines and inter-annotator agreement metrics to ensure quality.
search
Hallucination Detection
Finding and measuring visual hallucinations
Types of Visual Hallucination
Object hallucination: “There is a cat on the table” when there’s no cat
Attribute hallucination: “The red car” when the car is blue
Relation hallucination: “The cup is on the left” when it’s on the right
Count hallucination: “Three people” when there are five
Text hallucination: Misreading text in images (OCR errors presented as facts)
Fabrication: Inventing entire scenes or details not present
Detection Methods
POPE benchmark: Yes/no questions about object presence. Simple but effective for object hallucination.
CHAIR metric: Caption Hallucination Assessment with Image Relevance. Measures % of caption objects not in the image.
VLM-as-judge: Use a stronger VLM to verify claims made by the model under test
Grounding verification: Ask model for bounding boxes, verify they contain the claimed objects
Consistency checks: Ask the same question multiple ways — inconsistent answers suggest hallucination
Key insight: Visual hallucination is the #1 reliability risk in VLM applications. Build hallucination detection into your eval pipeline from day one. The POPE benchmark + VLM-as-judge combination catches most hallucinations at low cost.
settings
Building an Eval Pipeline
Automated evaluation for continuous improvement
Pipeline Architecture
// Multimodal eval pipeline 1. Eval Dataset 50-500 curated examples from your domain Images + questions + expected answers Tagged by difficulty, category, edge case 2. Automated Metrics Accuracy, CLIPScore, hallucination rate Run on every model change (CI/CD) 3. VLM-as-Judge GPT-4o or Gemini rates quality 1-5 Cheaper than human eval, 80% correlation 4. Human Eval (periodic) Monthly deep eval on 100-200 examples Calibrate automated metrics 5. Regression Tests Known failure cases that must pass Grows over time as you find bugs
Best Practices
Start with 50 examples: A small, high-quality eval set beats a large noisy one
Include edge cases: Low-light photos, handwritten text, complex charts, ambiguous images
Version your eval set: Track changes to eval data alongside model changes
Measure what matters: Align metrics with actual user satisfaction
Automate everything: Eval should run automatically on every deployment
Track trends: Dashboard showing metrics over time, not just point-in-time
Key insight: The eval pipeline is your most important infrastructure investment. Teams that build robust eval pipelines iterate 5x faster because they can confidently measure the impact of every change.
monitoring
Production Monitoring
Continuous evaluation in production
What to Monitor
Quality metrics: Sample 1–5% of production outputs for automated quality scoring
Hallucination rate: Track hallucination detection scores over time
User feedback: Thumbs up/down, explicit corrections, task completion rate
Latency: P50, P95, P99 by input type (image size, resolution mode)
Cost: Tokens per request, daily spend, cost per successful task
Error patterns: Cluster failures to identify systematic issues
Alerting Rules
Quality drop: Alert if average quality score drops >10% over 24 hours
Hallucination spike: Alert if hallucination rate exceeds baseline by 2x
Latency degradation: Alert if P95 latency exceeds SLA
Cost anomaly: Alert if daily cost exceeds budget by >20%
New failure mode: Alert when a new cluster of similar failures appears
Pro tip: Build a “data flywheel”: log production inputs/outputs → identify failures → add to eval set → fix model → verify fix → deploy. This continuous improvement loop is how the best multimodal systems get better over time.
school
Key Takeaways
Evaluating multimodal AI systems
Essential Concepts
1. Multi-dimensional: Evaluate accuracy, hallucination, safety, bias, latency, and cost simultaneously

2. Automated + human: Use automated metrics for daily monitoring, human eval for milestone decisions

3. Hallucination is #1 risk: Build hallucination detection (POPE, VLM-as-judge) into your pipeline from day one

4. Custom eval > benchmarks: 50 domain-specific examples beat any public benchmark

5. Data flywheel: Production failures become eval examples become model improvements
Quick Start
Day 1: Create 50 eval examples from your domain
Week 1: Add automated accuracy + hallucination metrics
Week 2: Set up VLM-as-judge for quality scoring
Month 1: First human eval round, calibrate automated metrics
Ongoing: Grow eval set from production failures, monitor trends
Next up: Chapter 17 looks ahead to the future of multimodal AI — where the technology is heading, what breakthroughs to expect, and how to prepare for what’s coming.