Ch 10 — The Multimodal Model Landscape

GPT-4o, Gemini, Claude, open-source — comparing capabilities across all modalities
High Level
corporate_fare
Closed
arrow_forward
code
Open
arrow_forward
compare
Compare
arrow_forward
attach_money
Cost
arrow_forward
checklist
Choose
arrow_forward
trending_up
Trends
-
Click play or press Space to begin...
Step- / 8
corporate_fare
The Closed-Source Leaders
GPT-4o, Gemini 2.5, Claude 3.5 — capabilities and tradeoffs
GPT-4o (OpenAI)
Modalities: Text, images, audio (native), video (limited)
Strengths: Best overall reasoning, strong OCR, native audio with emotion, tool use
Weaknesses: Expensive at scale, image generation via DALL-E (separate), video understanding limited
Pricing: $2.50/M input, $10/M output tokens
Gemini 2.5 Pro (Google)
Modalities: Text, images, audio, video (native, up to 1hr)
Strengths: Best video understanding, 1M+ token context, native multimodal from training, strong reasoning
Weaknesses: Availability varies by region, occasional safety over-filtering
Pricing: $1.25/M input, $5/M output tokens
Claude 3.5 Sonnet (Anthropic)
Modalities: Text, images (strong vision), PDF
Strengths: Best at following complex instructions, strong document understanding, safety-focused, excellent coding
Weaknesses: No audio or video, no image generation
Pricing: $3/M input, $15/M output tokens
Key insight: No single model wins across all modalities. GPT-4o leads in audio, Gemini in video, Claude in instruction-following and documents. The best choice depends entirely on your specific use case and which modalities matter most.
code
The Open-Source Ecosystem
LLaMA, Qwen, InternVL — open models closing the gap
Open VLMs
// Top open-source multimodal models (2025) InternVL 2.5 (Shanghai AI Lab) Best overall open VLM, rivals GPT-4V 8B to 78B params, strong OCR Qwen2-VL (Alibaba) Strong multilingual, video understanding 2B to 72B params, any-resolution LLaVA-NeXT (Community) Dynamic resolution, efficient 7B to 34B params, easy to fine-tune Phi-3-Vision (Microsoft) Small but capable (4.2B params) Runs on mobile devices Pixtral Large (Mistral) 124B params, strong reasoning Efficient architecture
Why Open-Source Matters
Cost: No per-token API fees — just compute costs. At scale, 5–20x cheaper than APIs.
Privacy: Data never leaves your infrastructure
Customization: Fine-tune on your domain data with LoRA
No rate limits: Scale to millions of requests without API throttling
Latency: Self-hosted models can achieve sub-100ms latency
Control: No content filtering restrictions
Key insight: Open-source VLMs in 2025 match GPT-4V (2023) on most benchmarks. The gap with GPT-4o and Gemini 2.5 is 12–18 months. For many production use cases, open models are already good enough — and dramatically cheaper.
compare
Capability Matrix
Comparing models across modalities and tasks
Modality Support
// Input modalities supported Text Image Audio Video GPT-4o ~ Gemini 2.5 Claude 3.5 × × Qwen2-VL × InternVL 2.5 × ~ // Output modalities Text Image Audio GPT-4o ~* Gemini 2.5 × Claude 3.5 × × // * via DALL-E integration // ✓ = native, ~ = limited, × = no
Task Performance Leaders
OCR & documents: Gemini 2.5 > GPT-4o > Claude
Visual reasoning: GPT-4o ≈ Gemini 2.5 > Claude
Video understanding: Gemini 2.5 >> others
Audio/speech: GPT-4o >> others
Instruction following: Claude 3.5 > GPT-4o > Gemini
Coding from screenshots: Claude 3.5 > GPT-4o
Multilingual vision: Qwen2-VL > Gemini > GPT-4o
Key insight: There is no “best” multimodal model — only the best model for your specific task and constraints. Build your evaluation around your actual use case, not generic benchmarks.
attach_money
Cost Analysis
API pricing, self-hosting economics, and optimization
API Pricing Comparison
// Cost per 1,000 image analyses // (1 image + short prompt + response) GPT-4o $3.50 - $12.00 Gemini 2.5 $1.75 - $6.00 Claude 3.5 $4.50 - $18.00 GPT-4o-mini $0.35 - $1.20 Gemini Flash $0.10 - $0.40 // Self-hosted (InternVL on A100): InternVL 8B ~$0.05 per 1,000 images Qwen2-VL 7B ~$0.04 per 1,000 images // Self-hosted is 10-100x cheaper at scale
When to Self-Host
>100K images/month: Self-hosting becomes cost-effective
Privacy requirements: Data can’t leave your infrastructure
Latency-critical: Need sub-100ms response times
Custom fine-tuning: Domain-specific accuracy requirements
No content filtering: Need to process sensitive content
Cost Optimization Tips
• Use mini/flash models for simple tasks (classification, basic OCR)
• Use low-res mode when fine detail isn’t needed
Batch requests to amortize overhead
Cache results for repeated images
Route easy tasks to cheap models, hard tasks to expensive ones
checklist
Model Selection Framework
A decision tree for choosing the right multimodal model
Decision Tree
// Model selection decision tree Need video understanding? → Gemini 2.5 Pro (best video, 1hr+) Need native audio? → GPT-4o (native voice mode) Need best instruction following? → Claude 3.5 Sonnet Need lowest cost at scale? → Self-host Qwen2-VL or InternVL Need mobile/edge deployment? → Phi-3-Vision (4.2B params) Need image generation + understanding? → Gemini 2.5 (native image output) Need privacy / no data sharing? → Self-host any open model
The Multi-Model Strategy
Most production systems use multiple models:

Triage model (cheap): GPT-4o-mini or Gemini Flash classifies and routes
Specialist model (expensive): GPT-4o or Gemini Pro handles complex cases
Batch model (self-hosted): Open-source model processes high-volume, low-urgency tasks

This “model router” pattern reduces costs by 60–80% while maintaining quality on hard cases.
Key insight: The model router pattern is the most important cost optimization in multimodal AI. Route 80% of requests to cheap models and 20% to expensive ones. Your average cost drops dramatically while quality stays high.
science
Benchmarks & Evaluation
How multimodal models are measured and compared
Key Benchmarks
MMMU: Massive Multi-discipline Multimodal Understanding — college-level visual reasoning across 30 subjects
MMBench: Comprehensive VLM evaluation with fine-grained ability assessment
DocVQA: Document visual question answering — reading and understanding documents
ChartQA: Understanding and extracting data from charts and graphs
MathVista: Mathematical reasoning with visual inputs
RealWorldQA: Real-world spatial understanding from photos
Benchmark Limitations
Contamination: Models may have seen benchmark data during training
Narrow scope: Benchmarks test specific skills, not real-world performance
Static: Don’t capture model behavior on your specific data distribution
Gaming: Models can be optimized for benchmarks without improving real utility
Key insight: Public benchmarks are useful for initial model selection but insufficient for production decisions. Always build a custom eval set from your actual data and use cases. A model that scores 5% lower on MMMU might score 20% higher on your specific task.
trending_up
Landscape Trends
Where the multimodal model landscape is heading
Current Trends
Native multimodal: All new frontier models are trained multimodal from scratch, not bolt-on
Longer context: 1M+ tokens enables processing entire videos and document collections
Smaller, faster: Mini/flash models achieve 80% of flagship quality at 10% of the cost
Open-source acceleration: Gap between open and closed shrinking from 18 months to 6–12 months
Specialization: Domain-specific models (medical, legal, scientific) outperforming generalists
Predictions for 2026
Universal models: Single model handles text, image, audio, video, and 3D natively
Real-time video: Process live video streams with sub-second latency
On-device multimodal: Capable VLMs running on phones and laptops
Multimodal agents: AI agents that can see, hear, and interact with the physical world
Commoditization: Basic multimodal capabilities become commodity; differentiation shifts to domain expertise
Key insight: The multimodal landscape is consolidating around a few architectural patterns (native multimodal Transformers) while diversifying in deployment (cloud APIs, self-hosted, on-device). The winning strategy is flexibility — design systems that can swap models as the landscape evolves.
school
Key Takeaways
Navigating the multimodal model landscape
Essential Concepts
1. No single best model: GPT-4o (audio), Gemini (video), Claude (instructions), open-source (cost/privacy)

2. Open-source is production-ready: InternVL, Qwen2-VL match GPT-4V on most tasks at 10–100x lower cost

3. Model router pattern: Route 80% to cheap models, 20% to expensive ones for 60–80% cost reduction

4. Custom evals beat benchmarks: Always test on your actual data and use cases

5. Design for flexibility: Abstract model calls so you can swap providers as the landscape evolves
Action Items
Identify your modalities: Which inputs/outputs do you actually need?
Build a custom eval: 50–100 examples from your real data
Test 3–4 models: Don’t assume the most expensive is best for your task
Implement model routing: Use cheap models for easy cases
Plan for evolution: The best model today won’t be the best model in 6 months
Next up: Chapter 11 dives into multimodal embeddings and search — how to build systems that search across text, images, and audio using shared embedding spaces.