Ch 12 — Multimodal LLMs

When LLMs learn to see, hear, and generate images — vision encoders, CLIP, and native multimodality
Frontier
image
Vision
arrow_forward
link
CLIP
arrow_forward
settings_input_component
Adapter
arrow_forward
diamond
Native
arrow_forward
mic
Audio
arrow_forward
brush
Generate
arrow_forward
landscape
Future
-
Click play or press Space to begin...
Step- / 7
image
Teaching LLMs to See: Vision Encoders
Converting images into token-like representations
The Analogy
An LLM only understands tokens (numbers). To “see” an image, you need a translator that converts pixels into the same kind of vectors the LLM already understands. A Vision Transformer (ViT) does this: it splits an image into patches (like tokens), processes them through transformer layers, and outputs a sequence of visual embeddings. These visual tokens are then fed into the LLM alongside text tokens.
Key insight: A typical image becomes 256-576 visual tokens. Each token is a d_model-dimensional vector, just like a text token embedding. The LLM’s attention mechanism (Ch 3) can then attend to visual tokens exactly like text tokens. This is why the transformer architecture is so powerful — it’s modality-agnostic. Anything that can be represented as a sequence of vectors can be processed.
Vision Transformer
# ViT: Vision Transformer # 1. Split image into patches # 224×224 image → 14×14 grid of 16×16 patches # = 196 patches (visual tokens) # 2. Linear projection: patch → vector # Each 16×16×3 patch → 1024-dim vector # 3. Transformer layers (same as LLM!) # Self-attention between all patches # Output: 196 visual embedding vectors # Popular vision encoders: # CLIP ViT-L/14: 304M params, 256 tokens # SigLIP: improved CLIP training # InternViT: 6B params (largest) # DINOv2: self-supervised, no text # Higher resolution → more tokens: # 224×224: 196 tokens # 336×336: 576 tokens # 672×672: 2304 tokens
link
CLIP: Connecting Vision and Language
The model that taught AI to understand images through text
The Analogy
CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) is like a bilingual translator who learned both “image language” and “text language” by seeing 400 million image-caption pairs from the internet. It maps images and text into the same embedding space: a photo of a dog and the text “a photo of a dog” end up as nearby vectors. This shared space is what makes multimodal LLMs possible.
Key insight: CLIP’s training is elegant: given a batch of (image, caption) pairs, maximize the similarity between matching pairs and minimize it for non-matching pairs (contrastive learning). The result: a vision encoder whose output vectors are already semantically meaningful in a way that aligns with language. Most multimodal LLMs (LLaVA, GPT-4V) use CLIP or its successors as the vision backbone.
CLIP Architecture
# CLIP training (simplified): # Image encoder: ViT → image embedding # Text encoder: Transformer → text embedding # Loss: maximize cosine(img, matching_text) # minimize cosine(img, wrong_text) # Trained on 400M image-text pairs # from the internet (WebImageText) # Result: shared embedding space # "dog" text vector ≈ dog photo vector # "sunset" text ≈ sunset photo # Zero-shot classification: # Embed image + embed all class names # Pick class with highest cosine similarity # No task-specific training needed! # Successors: # SigLIP (Google): sigmoid loss, better # EVA-CLIP: scaled up, stronger # MetaCLIP: curated training data
settings_input_component
The Adapter Approach: Bolting Vision onto LLMs
LLaVA, GPT-4V, and the “vision encoder + projection + LLM” pattern
The Analogy
The most common approach is like adding a camera to a brain: take a pre-trained vision encoder (CLIP), a pre-trained LLM (Llama), and connect them with a small projection layer (adapter). The adapter translates visual tokens into the LLM’s embedding space. Train only the adapter (and maybe fine-tune the LLM) on image-text data. This is how LLaVA, Llama 3.2 Vision, and likely GPT-4V work.
Key insight: LLaVA (Liu et al., 2023) showed this works remarkably well with just a 2-layer MLP as the adapter. The training is two-stage: (1) pre-train the adapter on image-caption pairs (align visual and text spaces), (2) fine-tune on visual instruction data (teach the model to answer questions about images). The entire process takes hours, not months.
Architecture
# Adapter approach (LLaVA-style): # Image → [CLIP ViT] → visual tokens # ↓ # [Projection MLP] # ↓ # Text tokens → [ LLM (Llama) ] → output # (visual + text tokens) # Projection: maps CLIP dim → LLM dim # e.g., 1024 (CLIP) → 4096 (Llama) # Just a 2-layer MLP! # Models using this approach: # LLaVA 1.5/1.6: CLIP + Vicuna/Llama # Llama 3.2 Vision: CLIP + Llama 3.2 # GPT-4V (likely): ViT + GPT-4 # Qwen-VL: ViT + Qwen # InternVL: InternViT + InternLM
diamond
Native Multimodal: Gemini’s Approach
Training on all modalities from the start
The Analogy
The adapter approach is like a person who learned English first, then learned to interpret sign language through a translator. Native multimodal (Gemini) is like a person who grew up bilingual — both languages are deeply integrated from birth. Gemini was trained from scratch on interleaved text, images, audio, and video as a single token stream. No separate vision encoder bolted on afterward.
Key insight: Gemini’s native approach means visual and textual understanding are deeply intertwined. It can reason about spatial relationships, read text in images, understand charts, and process video natively. Gemini 1.5 Pro processes up to 1 million tokens of mixed modality input — equivalent to a 1-hour video or 700,000 words. The MoE architecture (Ch 5) helps manage the compute cost of processing multiple modalities.
Comparison
# Adapter approach (GPT-4V, LLaVA): # ✓ Reuse existing LLM + vision encoder # ✓ Cheaper to train # ✗ Vision and language somewhat separate # ✗ Limited cross-modal reasoning # Native multimodal (Gemini): # ✓ Deep integration of all modalities # ✓ Better cross-modal reasoning # ✓ Handles video/audio natively # ✗ Must train from scratch (expensive) # ✗ Can't easily swap components # Modalities supported (Gemini 1.5): # Text: standard token stream # Images: patch tokens (like ViT) # Audio: spectrogram tokens # Video: frame tokens (sampled) # All interleaved in one sequence
mic
Audio & Speech: Voice-Native LLMs
From speech-to-text pipelines to end-to-end voice models
The Evolution
Early voice assistants used a pipeline: speech-to-text (Whisper) → LLM → text-to-speech. Modern models process audio directly. GPT-4o handles voice natively with ~200ms latency — fast enough for natural conversation. It can detect emotion, handle interruptions, and even sing. The audio is tokenized (like text) using a codec model that compresses waveforms into discrete tokens.
Key insight: Audio tokenization uses neural codecs like EnCodec (Meta) or SoundStream (Google) that compress audio into discrete tokens at ~50 tokens/second. A 10-second audio clip becomes ~500 tokens — the same format as text. This means the same transformer architecture processes speech, music, and sound effects. The unification of all modalities into tokens is the key architectural insight.
Audio Processing
# Pipeline approach (old): # Audio → [Whisper] → text → [LLM] → text # → [TTS] → audio # Latency: ~2-5 seconds (too slow) # Native approach (GPT-4o): # Audio → [Codec] → audio tokens # ↓ # Text tokens → [Multimodal LLM] → output # Latency: ~200ms (conversational) # Audio codecs: # EnCodec (Meta): 50 tokens/sec # SoundStream (Google): 50 tokens/sec # Whisper (for ASR): 25 tokens/sec # Voice-native models: # GPT-4o: text + vision + audio # Gemini 2.0: all modalities # Moshi (Kyutai): open-source voice LLM
brush
Image Generation: From Understanding to Creating
Diffusion models, DALL-E, and unified generation
The Landscape
Image generation uses a different architecture: diffusion models (DALL-E 3, Midjourney, Stable Diffusion) that start with noise and gradually denoise into an image, guided by text embeddings. The latest frontier is unified models that both understand and generate images in one model. Gemini 2.0 and GPT-4o can generate images natively. This unification means one model can see an image, reason about it, and create a modified version.
Key insight: The trend is toward any-to-any models: text→text, text→image, image→text, audio→text, text→audio, and every combination. A single model that handles all modalities as input and output. This is the vision behind GPT-4o (“o” for omni) and Gemini. We’re moving from specialized models to unified multimodal systems.
Generation Approaches
# Image generation methods: # 1. Diffusion (DALL-E 3, SD, Midjourney) # Start with noise → iteratively denoise # Guided by CLIP text embeddings # ~50 denoising steps per image # Separate from the LLM # 2. Autoregressive (Parti, Chameleon) # Tokenize images into discrete tokens # Generate image tokens like text tokens # Same architecture as LLM # 3. Unified (Gemini 2.0, GPT-4o) # One model: understand + generate # Input: any modality # Output: any modality # The holy grail of multimodal AI # Modality matrix (GPT-4o): # Text → Text ✓ (chat) # Image → Text ✓ (describe) # Text → Image ✓ (generate) # Audio → Text ✓ (transcribe) # Text → Audio ✓ (speak)
landscape
The Multimodal Future
Where multimodal AI is heading
What’s Next
The trajectory is clear: everything becomes tokens. Text, images, audio, video, 3D objects, actions, sensor data — all tokenized and processed by the same transformer architecture. Future models will seamlessly switch between modalities: see a diagram, explain it verbally, generate an improved version, and write code to implement it. The transformer’s modality-agnostic attention mechanism makes this possible.
Key insight: The unifying principle across all 12 chapters so far: everything is a sequence of vectors processed by attention. Text tokens, image patches, audio frames, video frames — they all become vectors in the same high-dimensional space. The transformer doesn’t care what the vectors represent. This architectural universality is why the same basic design (Ch 4) powers text, vision, audio, code, and multimodal AI.
The Unifying Principle
# Everything is tokens: # Text: "Hello" → [9906] → embed → vector # Image: 16×16 patch → linear → vector # Audio: 20ms frame → codec → vector # Video: frame → ViT → vectors # Code: "def f" → [755, 282] → vectors # All fed into the same transformer: # [text, image, audio, text, image, ...] # Attention connects everything # The model learns cross-modal relationships # Future capabilities: # - Real-time video understanding # - Robotic action generation # - 3D scene understanding # - Scientific data (proteins, molecules) # - Any modality that can be tokenized