Ch 10: The Multimodal Model Landscape

corporate_fare

The Closed-Source Leaders

GPT-4o, Gemini 2.5, Claude 3.5 — capabilities and tradeoffs

GPT-4o (OpenAI)

• Modalities: Text, images, audio (native), video (limited)
• Strengths: Best overall reasoning, strong OCR, native audio with emotion, tool use
• Weaknesses: Expensive at scale, image generation via DALL-E (separate), video understanding limited
• Pricing: $2.50/M input, $10/M output tokens

Gemini 2.5 Pro (Google)

• Modalities: Text, images, audio, video (native, up to 1hr)
• Strengths: Best video understanding, 1M+ token context, native multimodal from training, strong reasoning
• Weaknesses: Availability varies by region, occasional safety over-filtering
• Pricing: $1.25/M input, $5/M output tokens

Claude 3.5 Sonnet (Anthropic)

• Modalities: Text, images (strong vision), PDF
• Strengths: Best at following complex instructions, strong document understanding, safety-focused, excellent coding
• Weaknesses: No audio or video, no image generation
• Pricing: $3/M input, $15/M output tokens

Key insight: No single model wins across all modalities. GPT-4o leads in audio, Gemini in video, Claude in instruction-following and documents. The best choice depends entirely on your specific use case and which modalities matter most.

code

The Open-Source Ecosystem

LLaMA, Qwen, InternVL — open models closing the gap

Open VLMs

// Top open-source multimodal models (2025) InternVL 2.5 (Shanghai AI Lab) Best overall open VLM, rivals GPT-4V 8B to 78B params, strong OCR Qwen2-VL (Alibaba) Strong multilingual, video understanding 2B to 72B params, any-resolution LLaVA-NeXT (Community) Dynamic resolution, efficient 7B to 34B params, easy to fine-tune Phi-3-Vision (Microsoft) Small but capable (4.2B params) Runs on mobile devices Pixtral Large (Mistral) 124B params, strong reasoning Efficient architecture

Why Open-Source Matters

• Cost: No per-token API fees — just compute costs. At scale, 5–20x cheaper than APIs.
• Privacy: Data never leaves your infrastructure
• Customization: Fine-tune on your domain data with LoRA
• No rate limits: Scale to millions of requests without API throttling
• Latency: Self-hosted models can achieve sub-100ms latency
• Control: No content filtering restrictions

Key insight: Open-source VLMs in 2025 match GPT-4V (2023) on most benchmarks. The gap with GPT-4o and Gemini 2.5 is 12–18 months. For many production use cases, open models are already good enough — and dramatically cheaper.

compare

Capability Matrix

Comparing models across modalities and tasks

Modality Support

// Input modalities supported Text Image Audio Video GPT-4o ✓ ✓ ✓ ~ Gemini 2.5 ✓ ✓ ✓ ✓ Claude 3.5 ✓ ✓ × × Qwen2-VL ✓ ✓ × ✓ InternVL 2.5 ✓ ✓ × ~ // Output modalities Text Image Audio GPT-4o ✓ ~* ✓ Gemini 2.5 ✓ ✓ × Claude 3.5 ✓ × × // * via DALL-E integration // ✓ = native, ~ = limited, × = no

Task Performance Leaders

• OCR & documents: Gemini 2.5 > GPT-4o > Claude
• Visual reasoning: GPT-4o ≈ Gemini 2.5 > Claude
• Video understanding: Gemini 2.5 >> others
• Audio/speech: GPT-4o >> others
• Instruction following: Claude 3.5 > GPT-4o > Gemini
• Coding from screenshots: Claude 3.5 > GPT-4o
• Multilingual vision: Qwen2-VL > Gemini > GPT-4o

Key insight: There is no “best” multimodal model — only the best model for your specific task and constraints. Build your evaluation around your actual use case, not generic benchmarks.

attach_money

Cost Analysis

API pricing, self-hosting economics, and optimization

API Pricing Comparison

// Cost per 1,000 image analyses // (1 image + short prompt + response) GPT-4o $3.50 - $12.00 Gemini 2.5 $1.75 - $6.00 Claude 3.5 $4.50 - $18.00 GPT-4o-mini $0.35 - $1.20 Gemini Flash $0.10 - $0.40 // Self-hosted (InternVL on A100): InternVL 8B ~$0.05 per 1,000 images Qwen2-VL 7B ~$0.04 per 1,000 images // Self-hosted is 10-100x cheaper at scale

When to Self-Host

• >100K images/month: Self-hosting becomes cost-effective
• Privacy requirements: Data can’t leave your infrastructure
• Latency-critical: Need sub-100ms response times
• Custom fine-tuning: Domain-specific accuracy requirements
• No content filtering: Need to process sensitive content

Cost Optimization Tips

• Use mini/flash models for simple tasks (classification, basic OCR)
• Use low-res mode when fine detail isn’t needed
• Batch requests to amortize overhead
• Cache results for repeated images
• Route easy tasks to cheap models, hard tasks to expensive ones

checklist

Model Selection Framework

A decision tree for choosing the right multimodal model

Decision Tree

// Model selection decision tree Need video understanding? → Gemini 2.5 Pro (best video, 1hr+) Need native audio? → GPT-4o (native voice mode) Need best instruction following? → Claude 3.5 Sonnet Need lowest cost at scale? → Self-host Qwen2-VL or InternVL Need mobile/edge deployment? → Phi-3-Vision (4.2B params) Need image generation + understanding? → Gemini 2.5 (native image output) Need privacy / no data sharing? → Self-host any open model

The Multi-Model Strategy

Most production systems use multiple models:

• Triage model (cheap): GPT-4o-mini or Gemini Flash classifies and routes
• Specialist model (expensive): GPT-4o or Gemini Pro handles complex cases
• Batch model (self-hosted): Open-source model processes high-volume, low-urgency tasks

This “model router” pattern reduces costs by 60–80% while maintaining quality on hard cases.

Key insight: The model router pattern is the most important cost optimization in multimodal AI. Route 80% of requests to cheap models and 20% to expensive ones. Your average cost drops dramatically while quality stays high.

science

Benchmarks & Evaluation

How multimodal models are measured and compared

Key Benchmarks

• MMMU: Massive Multi-discipline Multimodal Understanding — college-level visual reasoning across 30 subjects
• MMBench: Comprehensive VLM evaluation with fine-grained ability assessment
• DocVQA: Document visual question answering — reading and understanding documents
• ChartQA: Understanding and extracting data from charts and graphs
• MathVista: Mathematical reasoning with visual inputs
• RealWorldQA: Real-world spatial understanding from photos

Benchmark Limitations

• Contamination: Models may have seen benchmark data during training
• Narrow scope: Benchmarks test specific skills, not real-world performance
• Static: Don’t capture model behavior on your specific data distribution
• Gaming: Models can be optimized for benchmarks without improving real utility

Key insight: Public benchmarks are useful for initial model selection but insufficient for production decisions. Always build a custom eval set from your actual data and use cases. A model that scores 5% lower on MMMU might score 20% higher on your specific task.

trending_up

Landscape Trends

Where the multimodal model landscape is heading

Current Trends

• Native multimodal: All new frontier models are trained multimodal from scratch, not bolt-on
• Longer context: 1M+ tokens enables processing entire videos and document collections
• Smaller, faster: Mini/flash models achieve 80% of flagship quality at 10% of the cost
• Open-source acceleration: Gap between open and closed shrinking from 18 months to 6–12 months
• Specialization: Domain-specific models (medical, legal, scientific) outperforming generalists

Predictions for 2026

• Universal models: Single model handles text, image, audio, video, and 3D natively
• Real-time video: Process live video streams with sub-second latency
• On-device multimodal: Capable VLMs running on phones and laptops
• Multimodal agents: AI agents that can see, hear, and interact with the physical world
• Commoditization: Basic multimodal capabilities become commodity; differentiation shifts to domain expertise

Key insight: The multimodal landscape is consolidating around a few architectural patterns (native multimodal Transformers) while diversifying in deployment (cloud APIs, self-hosted, on-device). The winning strategy is flexibility — design systems that can swap models as the landscape evolves.

school

Key Takeaways

Navigating the multimodal model landscape

Essential Concepts

1. No single best model: GPT-4o (audio), Gemini (video), Claude (instructions), open-source (cost/privacy)

2. Open-source is production-ready: InternVL, Qwen2-VL match GPT-4V on most tasks at 10–100x lower cost

3. Model router pattern: Route 80% to cheap models, 20% to expensive ones for 60–80% cost reduction

4. Custom evals beat benchmarks: Always test on your actual data and use cases

5. Design for flexibility: Abstract model calls so you can swap providers as the landscape evolves

Action Items

• Identify your modalities: Which inputs/outputs do you actually need?
• Build a custom eval: 50–100 examples from your real data
• Test 3–4 models: Don’t assume the most expensive is best for your task
• Implement model routing: Use cheap models for easy cases
• Plan for evolution: The best model today won’t be the best model in 6 months

Next up: Chapter 11 dives into multimodal embeddings and search — how to build systems that search across text, images, and audio using shared embedding spaces.

Ch 10 — The Multimodal Model Landscape