Ch 16: Evaluation for Multimodal

Ch 16 — Evaluation for Multimodal

Benchmarks, metrics, human evaluation, and building eval pipelines for multimodal systems

Index

High Level

quiz

What

arrow_forward

analytics

Metrics

arrow_forward

leaderboard

Bench

arrow_forward

person

Human

arrow_forward

settings

Pipeline

arrow_forward

monitoring

Monitor

Click play or press Space to begin...

Step- / 8

quiz

Why Multimodal Eval Is Hard

Unique challenges in evaluating systems that see, hear, and generate

The Challenge

Multimodal evaluation is harder than text evaluation because:

• Subjectivity: “Is this generated image good?” has no single right answer
• Multiple dimensions: Accuracy, relevance, visual quality, safety, bias — all matter simultaneously
• Cross-modal alignment: Does the text description match the image? How do you measure “match”?
• Hallucination detection: Verifying visual claims requires looking at the image, not just reading text
• Scale: Human evaluation of images/video is 10x more expensive than text

What to Evaluate

// Evaluation dimensions for multimodal AI Understanding Can the model correctly describe images? Can it answer questions about visual content? Does it understand spatial relationships? Generation Is the generated image/video high quality? Does it match the text prompt? Is it free of artifacts and distortions? Safety Does it refuse harmful requests? Does it hallucinate visual content? Is it biased across demographics? Practical Latency, cost, throughput, reliability

Key insight: The biggest mistake in multimodal eval is only measuring accuracy. A model that’s 95% accurate but hallucinates confidently on the other 5% is more dangerous than one that’s 90% accurate but says “I’m not sure” when uncertain.

analytics

Metrics for Understanding

Measuring how well models understand visual content

VLM Understanding Metrics

• Accuracy: Correct answers on visual Q&A (multiple choice or exact match)
• BLEU/ROUGE/CIDEr: Text similarity between generated captions and reference captions. Useful but limited.
• CLIPScore: CLIP similarity between generated text and image. Measures text-image alignment without reference.
• Hallucination rate: % of responses containing objects/facts not present in the image
• Spatial accuracy: Correct identification of positions, sizes, and relationships

Metrics for Generation

• FID (Fréchet Inception Distance): Measures quality and diversity of generated images vs. real images. Lower is better.
• CLIPScore: How well does the generated image match the text prompt?
• Aesthetic score: Predicted human preference for visual quality
• IS (Inception Score): Quality and diversity of generated images
• Human preference: Side-by-side comparison rated by humans (gold standard)

Key insight: Automated metrics (FID, CLIPScore) are useful for rapid iteration but don’t fully capture human preferences. Always validate with human evaluation for high-stakes decisions. Use automated metrics for daily monitoring, human eval for milestone decisions.

leaderboard

Key Benchmarks

Standard benchmarks for multimodal model comparison

VLM Benchmarks

// Major multimodal benchmarks MMMU College-level visual reasoning 30 subjects, 11.5K questions MMBench Comprehensive VLM evaluation 3,000 questions, 20 abilities DocVQA Document understanding 50K questions on real documents ChartQA Chart and graph understanding MathVista Math reasoning with visual inputs RealWorldQA Real-world spatial understanding POPE Object hallucination detection MM-Vet Integrated visual understanding

Generation Benchmarks

• GenEval: Compositional text-to-image generation (objects, attributes, relations)
• T2I-CompBench: Compositional generation with attribute binding, spatial relations
• DrawBench: Google’s text-to-image benchmark with complex prompts
• PartiPrompts: 1,600 prompts testing various generation capabilities
• Chatbot Arena (Vision): Human preference ranking via blind comparisons

Key insight: Chatbot Arena (Vision) is the most reliable benchmark because it uses real users making blind comparisons. Public benchmarks are useful for initial screening but can be gamed. Always build a custom eval set for your specific use case.

person

Human Evaluation

The gold standard for multimodal quality

Human Eval Approaches

• Side-by-side comparison: Show outputs from two models, ask which is better. Most reliable for ranking models.
• Likert scale rating: Rate quality on 1–5 scale across dimensions (accuracy, relevance, completeness). Good for tracking improvement.
• Error annotation: Mark specific errors (hallucination, wrong object, spatial error). Best for understanding failure modes.
• Task completion: Can a human complete a downstream task using the model’s output? Most realistic.

Practical Setup

// Human eval for multimodal systems Sample size: 100-500 examples Annotators: 3 per example (majority vote) Cost: $0.50-2.00 per judgment Total: $150-3,000 per eval round // Dimensions to rate Accuracy: Is the visual description correct? Completeness: Are all important details covered? Hallucination: Any fabricated visual claims? Relevance: Does the response address the query? Safety: Any harmful or biased content?

Key insight: Human evaluation for multimodal is 3–5x more expensive than for text because annotators must carefully examine images/videos. Invest in clear annotation guidelines and inter-annotator agreement metrics to ensure quality.

Hallucination Detection

Finding and measuring visual hallucinations

Types of Visual Hallucination

• Object hallucination: “There is a cat on the table” when there’s no cat
• Attribute hallucination: “The red car” when the car is blue
• Relation hallucination: “The cup is on the left” when it’s on the right
• Count hallucination: “Three people” when there are five
• Text hallucination: Misreading text in images (OCR errors presented as facts)
• Fabrication: Inventing entire scenes or details not present

Detection Methods

• POPE benchmark: Yes/no questions about object presence. Simple but effective for object hallucination.
• CHAIR metric: Caption Hallucination Assessment with Image Relevance. Measures % of caption objects not in the image.
• VLM-as-judge: Use a stronger VLM to verify claims made by the model under test
• Grounding verification: Ask model for bounding boxes, verify they contain the claimed objects
• Consistency checks: Ask the same question multiple ways — inconsistent answers suggest hallucination

Key insight: Visual hallucination is the #1 reliability risk in VLM applications. Build hallucination detection into your eval pipeline from day one. The POPE benchmark + VLM-as-judge combination catches most hallucinations at low cost.

settings

Building an Eval Pipeline

Automated evaluation for continuous improvement

Pipeline Architecture

// Multimodal eval pipeline 1. Eval Dataset 50-500 curated examples from your domain Images + questions + expected answers Tagged by difficulty, category, edge case 2. Automated Metrics Accuracy, CLIPScore, hallucination rate Run on every model change (CI/CD) 3. VLM-as-Judge GPT-4o or Gemini rates quality 1-5 Cheaper than human eval, 80% correlation 4. Human Eval (periodic) Monthly deep eval on 100-200 examples Calibrate automated metrics 5. Regression Tests Known failure cases that must pass Grows over time as you find bugs

Best Practices

• Start with 50 examples: A small, high-quality eval set beats a large noisy one
• Include edge cases: Low-light photos, handwritten text, complex charts, ambiguous images
• Version your eval set: Track changes to eval data alongside model changes
• Measure what matters: Align metrics with actual user satisfaction
• Automate everything: Eval should run automatically on every deployment
• Track trends: Dashboard showing metrics over time, not just point-in-time

Key insight: The eval pipeline is your most important infrastructure investment. Teams that build robust eval pipelines iterate 5x faster because they can confidently measure the impact of every change.

monitoring

Production Monitoring

Continuous evaluation in production

What to Monitor

• Quality metrics: Sample 1–5% of production outputs for automated quality scoring
• Hallucination rate: Track hallucination detection scores over time
• User feedback: Thumbs up/down, explicit corrections, task completion rate
• Latency: P50, P95, P99 by input type (image size, resolution mode)
• Cost: Tokens per request, daily spend, cost per successful task
• Error patterns: Cluster failures to identify systematic issues

Alerting Rules

• Quality drop: Alert if average quality score drops >10% over 24 hours
• Hallucination spike: Alert if hallucination rate exceeds baseline by 2x
• Latency degradation: Alert if P95 latency exceeds SLA
• Cost anomaly: Alert if daily cost exceeds budget by >20%
• New failure mode: Alert when a new cluster of similar failures appears

Pro tip: Build a “data flywheel”: log production inputs/outputs → identify failures → add to eval set → fix model → verify fix → deploy. This continuous improvement loop is how the best multimodal systems get better over time.

school

Key Takeaways

Evaluating multimodal AI systems

Essential Concepts

1. Multi-dimensional: Evaluate accuracy, hallucination, safety, bias, latency, and cost simultaneously

2. Automated + human: Use automated metrics for daily monitoring, human eval for milestone decisions

3. Hallucination is #1 risk: Build hallucination detection (POPE, VLM-as-judge) into your pipeline from day one

4. Custom eval > benchmarks: 50 domain-specific examples beat any public benchmark

5. Data flywheel: Production failures become eval examples become model improvements

Quick Start

• Day 1: Create 50 eval examples from your domain
• Week 1: Add automated accuracy + hallucination metrics
• Week 2: Set up VLM-as-judge for quality scoring
• Month 1: First human eval round, calibrate automated metrics
• Ongoing: Grow eval set from production failures, monitor trends

Next up: Chapter 17 looks ahead to the future of multimodal AI — where the technology is heading, what breakthroughs to expect, and how to prepare for what’s coming.

arrow_back Ch 15: Ethics & Safety Ch 17: The Future of Multimodal AI arrow_forward