Ch 9: How Vision-Language Models Work

Ch 9 — How Vision-Language Models Work

GPT-4V, Gemini, Claude Vision, LLaVA — architecture, visual reasoning, and capabilities

Index

High Level

image

Image

arrow_forward

visibility

Encode

arrow_forward

link

Project

arrow_forward

psychology

Reason

arrow_forward

chat

Respond

arrow_forward

build

Apply

Click play or press Space to begin...

Step- / 8

architecture

The VLM Architecture

Three components: vision encoder, projector, and LLM

How VLMs Work

A Vision-Language Model combines three components:

1. Vision Encoder (ViT): Converts the image into patch embeddings — a sequence of visual feature vectors
2. Projector: Maps visual embeddings into the LLM’s token space (linear layer, MLP, or Q-Former)
3. LLM Backbone: Processes the combined sequence of [visual tokens] + [text tokens] and generates a text response

The image becomes “virtual text tokens” that the LLM processes alongside the actual text prompt.

The Pipeline

// VLM processing pipeline Input: [image] + "What's in this photo?" 1. Vision Encoder Image → ViT → 576 patch embeddings 2. Projector 576 visual embeddings → 576 virtual tokens (mapped to LLM's embedding dimension) 3. LLM [576 visual tokens] + [text tokens] → Self-attention across ALL tokens → Generate text response autoregressively Output: "A golden retriever playing fetch in a park on a sunny afternoon."

Key insight: The LLM doesn’t “know” which tokens are visual and which are text — they’re all just vectors in the same embedding space. The magic is in the projector that makes visual information “look like” text to the LLM.

compare

Bolt-On vs Native Multimodal

Two fundamentally different approaches

Bolt-On (LLaVA, Early GPT-4V)

Take a pre-trained LLM and add vision capabilities by training a projector while keeping the LLM mostly frozen:

• Faster to build: Leverage existing LLM capabilities
• Modular: Swap vision encoder or LLM independently
• Limitation: Visual understanding is “grafted on” — the LLM wasn’t designed for visual reasoning
• Examples: LLaVA, InternVL, early Claude Vision

Native Multimodal (Gemini, GPT-4o)

Train the model from scratch on interleaved text, images, audio, and video:

• Deeper integration: Visual reasoning is fundamental, not added on
• Better performance: Stronger spatial reasoning, counting, OCR
• Cross-modal transfer: Understanding in one modality improves others
• Limitation: Requires massive compute and data to train from scratch
• Examples: Gemini 2.5, GPT-4o

Key insight: The industry is moving from bolt-on to native multimodal. Gemini was trained multimodal from the start, which is why it excels at tasks requiring deep visual reasoning. The next generation of models will all be natively multimodal.

psychology

Visual Reasoning Capabilities

What VLMs can actually do with images

Core Capabilities

• Image description: Detailed captions with context and relationships
• OCR & text extraction: Read text in images, documents, screenshots
• Visual Q&A: Answer specific questions about image content
• Chart/graph analysis: Extract data and trends from visualizations
• Spatial reasoning: Understand positions, sizes, and relationships
• Multi-image comparison: Spot differences, track changes
• Code from screenshots: Generate code from UI mockups

Emergent Abilities

• Humor understanding: Explain why a meme is funny
• Safety analysis: Identify hazards in workplace photos
• Medical reasoning: Preliminary analysis of X-rays, skin conditions
• Scientific diagrams: Explain circuit diagrams, molecular structures
• Cultural context: Recognize cultural references, historical context

Key insight: VLMs don’t just “see” images — they reason about them using the LLM’s world knowledge. A VLM can explain why a circuit diagram is wrong because it combines visual pattern recognition with electrical engineering knowledge from its text training.

image_search

Resolution & Token Strategies

How models handle different image sizes and detail levels

Resolution Handling

Different models handle resolution differently:

• Fixed resolution: Resize all images to 224×224 or 336×336 (simple but loses detail)
• Tiled/multi-crop: Split large images into tiles, process each separately, combine (GPT-4V high-res mode)
• Dynamic resolution: Adjust patch size based on image content (Gemini, InternVL)
• Any-resolution: Handle arbitrary aspect ratios without cropping (emerging)

Token Budget Tradeoffs

// GPT-4V resolution modes Low detail: Resize to 512×512 → 85 tokens Cost: ~$0.001 per image Good for: classification, general Q&A High detail: Tile into 512×512 crops → 765+ tokens Cost: ~$0.01 per image Good for: OCR, fine details, documents Auto: Model decides based on image content Recommended for most use cases

Key insight: Resolution mode is the most impactful cost lever in VLM applications. A document OCR app needs high-res (10x cost) while a content moderation app works fine with low-res. Always match resolution to your actual needs.

warning

VLM Limitations & Failure Modes

Where vision-language models still struggle

Known Weaknesses

• Hallucination: Confidently describing objects that aren’t in the image
• Counting: Unreliable beyond ~5 objects (“How many people?”)
• Spatial precision: “Is the red ball to the left or right of the blue one?” often wrong
• Small text: Misreading fine print, especially at low resolution
• Overlapping objects: Confusion when objects occlude each other
• Adversarial images: Optical illusions and adversarial patches fool models

Mitigation Strategies

• Use high-res mode for detail-critical tasks (OCR, small objects)
• Crop and zoom into regions of interest before sending
• Ask specific questions rather than open-ended “describe this image”
• Chain-of-thought: Ask the model to reason step by step about what it sees
• Multi-image verification: Send multiple views/angles for critical tasks
• Grounding: Ask for bounding box coordinates to verify the model is looking at the right thing

Key insight: VLM hallucination is the visual equivalent of text hallucination — the model generates plausible-sounding descriptions of things it doesn’t actually see. Always verify critical visual claims, especially for medical, legal, or safety applications.

open_in_new

LLaVA: The Open-Source VLM

How the open-source community built competitive VLMs

LLaVA Architecture

LLaVA (Large Language and Vision Assistant) showed that competitive VLMs can be built with surprisingly simple ingredients:

1. Vision encoder: CLIP ViT-L/14 (frozen)
2. Projector: Simple 2-layer MLP
3. LLM: Vicuna/LLaMA (fine-tuned)

Training: first align vision-language with image-caption pairs, then instruction-tune with visual Q&A data. Total training cost: ~$100 on 8 A100 GPUs.

Open-Source VLM Ecosystem

// Open-source VLMs (2025) LLaVA-NeXT Dynamic resolution, strong OCR InternVL 2.5 Best open VLM, rivals GPT-4V Qwen2-VL Alibaba, strong multilingual Phi-3-Vision Microsoft, small but capable Idefics 3 HuggingFace, multi-image Pixtral Mistral, efficient architecture // Open VLMs now match GPT-4V (2023) // on most benchmarks. Gap with GPT-4o // and Gemini 2.5 is narrowing fast.

Key insight: LLaVA proved that VLMs don’t need billions in compute. A simple projector between a frozen vision encoder and a fine-tuned LLM produces surprisingly capable models. This democratized VLM research and enabled rapid iteration.

build

Practical VLM Patterns

How to use VLMs effectively in applications

Common Application Patterns

// Pattern 1: Document Understanding Input: [invoice image] + "Extract all line items" Output: Structured JSON with items, prices, totals // Pattern 2: Visual QA Pipeline Input: [product photo] + "Any defects visible?" Output: "Scratch on upper-left corner, ~2cm long" // Pattern 3: Multi-Image Comparison Input: [before] + [after] + "What changed?" Output: Detailed diff of visual changes // Pattern 4: Screenshot to Code Input: [UI mockup] + "Generate React component" Output: Working JSX matching the design

Best Practices

• Be specific in prompts: “Count the red cars in the parking lot” not “What do you see?”
• Use system prompts: Define the model’s role and expected output format
• Provide examples: Few-shot with example image-response pairs
• Chain multiple calls: First identify regions of interest, then analyze each
• Validate outputs: Cross-check extracted data against known constraints

Pro tip: For structured extraction (invoices, forms, tables), ask the VLM to output JSON and validate the schema. For subjective tasks (quality inspection, content moderation), use confidence scores and human review for edge cases.

school

Key Takeaways

What to remember about vision-language models

Essential Concepts

1. Three components: Vision encoder (ViT) + Projector (MLP) + LLM backbone

2. Bolt-on vs native: Adding vision to an LLM vs training multimodal from scratch. Native is better but more expensive.

3. Visual reasoning: VLMs combine visual perception with LLM world knowledge for emergent capabilities

4. Resolution tradeoff: More tokens = better detail but higher cost. Match resolution to your task.

5. Open-source parity: InternVL, Qwen2-VL rival GPT-4V on most benchmarks

Practical Implications

• VLMs unlock document understanding, visual QA, and screenshot-to-code at scale
• Hallucination is the biggest risk — always verify critical visual claims
• Resolution mode is the most impactful cost lever
• Specific prompts dramatically outperform vague ones for visual tasks
• Open-source VLMs are production-ready for many use cases

Next up: Chapter 10 maps the full multimodal model landscape — comparing GPT-4o, Gemini, Claude, and open-source alternatives across all modalities, with guidance on choosing the right model for your use case.

arrow_back Ch 8: Speech & Audio AI Ch 10: The Multimodal Model Landscape arrow_forward