Ch 9 — How Vision-Language Models Work

GPT-4V, Gemini, Claude Vision, LLaVA — architecture, visual reasoning, and capabilities
High Level
image
Image
arrow_forward
visibility
Encode
arrow_forward
link
Project
arrow_forward
psychology
Reason
arrow_forward
chat
Respond
arrow_forward
build
Apply
-
Click play or press Space to begin...
Step- / 8
architecture
The VLM Architecture
Three components: vision encoder, projector, and LLM
How VLMs Work
A Vision-Language Model combines three components:

1. Vision Encoder (ViT): Converts the image into patch embeddings — a sequence of visual feature vectors
2. Projector: Maps visual embeddings into the LLM’s token space (linear layer, MLP, or Q-Former)
3. LLM Backbone: Processes the combined sequence of [visual tokens] + [text tokens] and generates a text response

The image becomes “virtual text tokens” that the LLM processes alongside the actual text prompt.
The Pipeline
// VLM processing pipeline Input: [image] + "What's in this photo?" 1. Vision Encoder Image → ViT → 576 patch embeddings 2. Projector 576 visual embeddings → 576 virtual tokens (mapped to LLM's embedding dimension) 3. LLM [576 visual tokens] + [text tokens] → Self-attention across ALL tokens → Generate text response autoregressively Output: "A golden retriever playing fetch in a park on a sunny afternoon."
Key insight: The LLM doesn’t “know” which tokens are visual and which are text — they’re all just vectors in the same embedding space. The magic is in the projector that makes visual information “look like” text to the LLM.
compare
Bolt-On vs Native Multimodal
Two fundamentally different approaches
Bolt-On (LLaVA, Early GPT-4V)
Take a pre-trained LLM and add vision capabilities by training a projector while keeping the LLM mostly frozen:

Faster to build: Leverage existing LLM capabilities
Modular: Swap vision encoder or LLM independently
Limitation: Visual understanding is “grafted on” — the LLM wasn’t designed for visual reasoning
Examples: LLaVA, InternVL, early Claude Vision
Native Multimodal (Gemini, GPT-4o)
Train the model from scratch on interleaved text, images, audio, and video:

Deeper integration: Visual reasoning is fundamental, not added on
Better performance: Stronger spatial reasoning, counting, OCR
Cross-modal transfer: Understanding in one modality improves others
Limitation: Requires massive compute and data to train from scratch
Examples: Gemini 2.5, GPT-4o
Key insight: The industry is moving from bolt-on to native multimodal. Gemini was trained multimodal from the start, which is why it excels at tasks requiring deep visual reasoning. The next generation of models will all be natively multimodal.
psychology
Visual Reasoning Capabilities
What VLMs can actually do with images
Core Capabilities
Image description: Detailed captions with context and relationships
OCR & text extraction: Read text in images, documents, screenshots
Visual Q&A: Answer specific questions about image content
Chart/graph analysis: Extract data and trends from visualizations
Spatial reasoning: Understand positions, sizes, and relationships
Multi-image comparison: Spot differences, track changes
Code from screenshots: Generate code from UI mockups
Emergent Abilities
Humor understanding: Explain why a meme is funny
Safety analysis: Identify hazards in workplace photos
Medical reasoning: Preliminary analysis of X-rays, skin conditions
Scientific diagrams: Explain circuit diagrams, molecular structures
Cultural context: Recognize cultural references, historical context
Key insight: VLMs don’t just “see” images — they reason about them using the LLM’s world knowledge. A VLM can explain why a circuit diagram is wrong because it combines visual pattern recognition with electrical engineering knowledge from its text training.
image_search
Resolution & Token Strategies
How models handle different image sizes and detail levels
Resolution Handling
Different models handle resolution differently:

Fixed resolution: Resize all images to 224×224 or 336×336 (simple but loses detail)
Tiled/multi-crop: Split large images into tiles, process each separately, combine (GPT-4V high-res mode)
Dynamic resolution: Adjust patch size based on image content (Gemini, InternVL)
Any-resolution: Handle arbitrary aspect ratios without cropping (emerging)
Token Budget Tradeoffs
// GPT-4V resolution modes Low detail: Resize to 512×512 → 85 tokens Cost: ~$0.001 per image Good for: classification, general Q&A High detail: Tile into 512×512 crops → 765+ tokens Cost: ~$0.01 per image Good for: OCR, fine details, documents Auto: Model decides based on image content Recommended for most use cases
Key insight: Resolution mode is the most impactful cost lever in VLM applications. A document OCR app needs high-res (10x cost) while a content moderation app works fine with low-res. Always match resolution to your actual needs.
warning
VLM Limitations & Failure Modes
Where vision-language models still struggle
Known Weaknesses
Hallucination: Confidently describing objects that aren’t in the image
Counting: Unreliable beyond ~5 objects (“How many people?”)
Spatial precision: “Is the red ball to the left or right of the blue one?” often wrong
Small text: Misreading fine print, especially at low resolution
Overlapping objects: Confusion when objects occlude each other
Adversarial images: Optical illusions and adversarial patches fool models
Mitigation Strategies
Use high-res mode for detail-critical tasks (OCR, small objects)
Crop and zoom into regions of interest before sending
Ask specific questions rather than open-ended “describe this image”
Chain-of-thought: Ask the model to reason step by step about what it sees
Multi-image verification: Send multiple views/angles for critical tasks
Grounding: Ask for bounding box coordinates to verify the model is looking at the right thing
Key insight: VLM hallucination is the visual equivalent of text hallucination — the model generates plausible-sounding descriptions of things it doesn’t actually see. Always verify critical visual claims, especially for medical, legal, or safety applications.
open_in_new
LLaVA: The Open-Source VLM
How the open-source community built competitive VLMs
LLaVA Architecture
LLaVA (Large Language and Vision Assistant) showed that competitive VLMs can be built with surprisingly simple ingredients:

1. Vision encoder: CLIP ViT-L/14 (frozen)
2. Projector: Simple 2-layer MLP
3. LLM: Vicuna/LLaMA (fine-tuned)

Training: first align vision-language with image-caption pairs, then instruction-tune with visual Q&A data. Total training cost: ~$100 on 8 A100 GPUs.
Open-Source VLM Ecosystem
// Open-source VLMs (2025) LLaVA-NeXT Dynamic resolution, strong OCR InternVL 2.5 Best open VLM, rivals GPT-4V Qwen2-VL Alibaba, strong multilingual Phi-3-Vision Microsoft, small but capable Idefics 3 HuggingFace, multi-image Pixtral Mistral, efficient architecture // Open VLMs now match GPT-4V (2023) // on most benchmarks. Gap with GPT-4o // and Gemini 2.5 is narrowing fast.
Key insight: LLaVA proved that VLMs don’t need billions in compute. A simple projector between a frozen vision encoder and a fine-tuned LLM produces surprisingly capable models. This democratized VLM research and enabled rapid iteration.
build
Practical VLM Patterns
How to use VLMs effectively in applications
Common Application Patterns
// Pattern 1: Document Understanding Input: [invoice image] + "Extract all line items" Output: Structured JSON with items, prices, totals // Pattern 2: Visual QA Pipeline Input: [product photo] + "Any defects visible?" Output: "Scratch on upper-left corner, ~2cm long" // Pattern 3: Multi-Image Comparison Input: [before] + [after] + "What changed?" Output: Detailed diff of visual changes // Pattern 4: Screenshot to Code Input: [UI mockup] + "Generate React component" Output: Working JSX matching the design
Best Practices
Be specific in prompts: “Count the red cars in the parking lot” not “What do you see?”
Use system prompts: Define the model’s role and expected output format
Provide examples: Few-shot with example image-response pairs
Chain multiple calls: First identify regions of interest, then analyze each
Validate outputs: Cross-check extracted data against known constraints
Pro tip: For structured extraction (invoices, forms, tables), ask the VLM to output JSON and validate the schema. For subjective tasks (quality inspection, content moderation), use confidence scores and human review for edge cases.
school
Key Takeaways
What to remember about vision-language models
Essential Concepts
1. Three components: Vision encoder (ViT) + Projector (MLP) + LLM backbone

2. Bolt-on vs native: Adding vision to an LLM vs training multimodal from scratch. Native is better but more expensive.

3. Visual reasoning: VLMs combine visual perception with LLM world knowledge for emergent capabilities

4. Resolution tradeoff: More tokens = better detail but higher cost. Match resolution to your task.

5. Open-source parity: InternVL, Qwen2-VL rival GPT-4V on most benchmarks
Practical Implications
• VLMs unlock document understanding, visual QA, and screenshot-to-code at scale
Hallucination is the biggest risk — always verify critical visual claims
Resolution mode is the most impactful cost lever
Specific prompts dramatically outperform vague ones for visual tasks
• Open-source VLMs are production-ready for many use cases
Next up: Chapter 10 maps the full multimodal model landscape — comparing GPT-4o, Gemini, Claude, and open-source alternatives across all modalities, with guidance on choosing the right model for your use case.