Ch 12: Training Multimodal Models

Ch 12 — Training Multimodal Models

Data collection, pre-training, alignment, fine-tuning, and compute requirements

Index

High Level

dataset

Data

arrow_forward

model_training

Pre-train

arrow_forward

link

Align

arrow_forward

tune

Fine-tune

arrow_forward

memory

Compute

arrow_forward

rocket_launch

Deploy

Click play or press Space to begin...

Step- / 8

dataset

Multimodal Training Data

The fuel that powers multimodal models

Data Types & Sources

• Image-text pairs: LAION-5B (5.85B pairs), DataComp (12.8B), CC12M — scraped from internet alt-text
• Interleaved documents: Web pages with images inline — teaches models to reason about images in context
• Video-text: WebVid (10M), InternVid (234M) — video clips with descriptions
• Audio-text: LibriSpeech, Common Voice, WavCaps — speech and audio with transcriptions
• Instruction data: Visual Q&A, image-based conversations, chart reasoning — for instruction tuning

Data Quality Pipeline

// Data curation pipeline 1. Collect Scrape web, license datasets 2. Filter Remove NSFW, low-quality, duplicates 3. Score CLIP similarity (text-image alignment) Aesthetic score (image quality) 4. Deduplicate Perceptual hashing, embedding clustering 5. Balance Ensure diversity across concepts 6. Caption Re-caption with VLM for better text // Data quality > data quantity // 1B high-quality pairs > 5B noisy pairs

Key insight: Data curation is the most impactful and least glamorous part of training multimodal models. Teams that invest in data quality consistently outperform those with more compute but worse data. “Garbage in, garbage out” applies 10x to multimodal training.

model_training

Pre-Training Strategies

How models learn to connect modalities

Contrastive Pre-Training (CLIP-style)

Train separate encoders for each modality, align them with contrastive loss. Pull matching pairs together, push non-matching apart. Produces embedding models for search and classification. Fast to train, scales well, but doesn’t generate text.

Generative Pre-Training (LLaVA-style)

Freeze a pre-trained vision encoder and LLM, train a projector to connect them. Then fine-tune the LLM on visual instruction data. Produces VLMs that can reason about images and generate text responses. Cheaper than training from scratch.

Native Multimodal (Gemini-style)

Train the entire model from scratch on interleaved multimodal data. Text, images, audio, and video are all tokenized and processed by a single Transformer. Produces the most capable models but requires enormous compute ($50M–$500M per training run).

Key insight: The three strategies represent a cost-capability tradeoff. Contrastive (cheapest, embeddings only) → Generative (moderate, bolt-on VLM) → Native (most expensive, best capability). Most practitioners use the generative approach — it’s the sweet spot of cost and capability.

link

Vision-Language Alignment

Teaching the LLM to understand visual tokens

Two-Stage Training

Stage 1 — Alignment pre-training:
Train only the projector on image-caption pairs. The vision encoder and LLM are frozen. This teaches the projector to map visual features into the LLM’s embedding space. ~600K image-caption pairs, ~5 hours on 8 GPUs.

Stage 2 — Visual instruction tuning:
Unfreeze the LLM (or use LoRA) and train on visual instruction data: image-based Q&A, conversations, reasoning tasks. ~150K instruction examples, ~20 hours on 8 GPUs.

Instruction Data Generation

// How to create visual instruction data Method 1: GPT-4V bootstrapping Send images to GPT-4V with diverse prompts Collect high-quality Q&A pairs Cost: ~$0.01 per example Method 2: Template-based Use existing annotations (bounding boxes, captions) to generate Q&A templates "What object is at [x,y]?" → "[label]" Cost: nearly free Method 3: Human annotation Highest quality, most expensive ~$0.50-2.00 per example Best for domain-specific data

Key insight: The LLaVA recipe (frozen ViT + MLP projector + fine-tuned LLM) can be replicated for ~$100 in compute. This democratized VLM training — you don’t need Google-scale resources to build a capable vision-language model.

tune

Fine-Tuning for Your Domain

Adapting multimodal models to specific use cases

Fine-Tuning Approaches

• Full fine-tuning: Update all parameters. Best quality but expensive and risks catastrophic forgetting.
• LoRA: Add small trainable matrices to attention layers. 10–100x cheaper, minimal quality loss. The default choice.
• QLoRA: LoRA on a 4-bit quantized model. Enables fine-tuning on consumer GPUs (24GB VRAM).
• Adapter tuning: Add small adapter modules between layers. Similar to LoRA but different architecture.

Domain Fine-Tuning Recipe

// Fine-tuning a VLM for your domain 1. Collect data 500-5,000 domain-specific image-text pairs Format: image + question + answer 2. Choose base model LLaVA-NeXT, InternVL, or Qwen2-VL 3. Fine-tune with QLoRA Rank: 64, Alpha: 128 Learning rate: 2e-5 Epochs: 3-5 Hardware: 1x A100 or 1x RTX 4090 4. Evaluate Test on held-out domain examples Compare against base model Time: 2-8 hours Cost: $5-50 (cloud GPU)

Key insight: Domain fine-tuning is the highest-leverage activity for most teams. A fine-tuned 7B model often outperforms GPT-4V on domain-specific tasks. Medical imaging, satellite analysis, product inspection — fine-tuning makes the difference between “interesting demo” and “production-ready.”

memory

Compute Requirements

What it actually costs to train multimodal models

Training Costs by Scale

// Approximate training costs (2025) LoRA fine-tune (7B VLM) Hardware: 1x A100 (80GB) Time: 2-8 hours Cost: $5-50 Full VLM training (LLaVA-style, 7B) Hardware: 8x A100 Time: 1-3 days Cost: $500-3,000 CLIP training (ViT-L) Hardware: 256x A100 Time: 2-4 weeks Cost: $100K-500K Frontier model (Gemini-scale) Hardware: 10,000+ TPUs/GPUs Time: 2-6 months Cost: $50M-500M

Practical Guidance

• Most teams: Fine-tune existing open-source VLMs with LoRA ($5–50)
• Startups: Train a LLaVA-style VLM on domain data ($500–3K)
• Large companies: Train custom CLIP or VLM from scratch ($100K–500K)
• AI labs: Train frontier multimodal models ($50M+)

The key insight: you almost never need to train from scratch. Fine-tuning open-source models gets you 90% of the way at 0.01% of the cost.

Key insight: The cost curve for multimodal training is exponential. Going from “good enough” to “state of the art” costs 1000x more. For most applications, a fine-tuned open-source model at $50 outperforms a frontier model API at $50K/month in total cost.

shield

RLHF & Safety Alignment

Making multimodal models safe and helpful

Multimodal RLHF

RLHF (Reinforcement Learning from Human Feedback) for multimodal models follows the same pattern as text-only:

1. Supervised fine-tuning: Train on high-quality visual instruction data
2. Reward model: Train a model to predict human preferences for visual responses
3. PPO/DPO: Optimize the VLM to maximize the reward model’s score

The challenge: multimodal RLHF requires human annotators who can evaluate both visual accuracy and text quality simultaneously.

Safety Considerations

• Visual hallucination: Model describes objects not in the image — requires visual grounding training
• Harmful content: Model generates or describes NSFW/violent content from images
• Bias: Model makes stereotypical assumptions based on visual appearance
• Privacy: Model identifies real people or extracts personal information from images
• Jailbreaks: Adversarial images that bypass safety filters

Key insight: Safety alignment for multimodal models is harder than for text-only models because the attack surface is larger. An adversarial image can bypass text-based safety filters. This is an active research area with no complete solutions yet.

rocket_launch

Deployment & Optimization

Getting multimodal models into production

Optimization Techniques

• Quantization: INT8 or INT4 reduces memory 2–4x with minimal quality loss. Essential for deployment.
• KV-cache optimization: Visual tokens consume KV-cache space — compress or evict old visual tokens
• Flash Attention: 2–4x faster attention computation, essential for long visual sequences
• Speculative decoding: Use a small draft model to speed up generation 2–3x
• Batching: Process multiple images simultaneously for throughput

Serving Infrastructure

// Production serving stack vLLM Best for VLM serving (PagedAttention) TGI HuggingFace, easy deployment TensorRT NVIDIA, maximum GPU efficiency Ollama Simple local deployment SGLang Optimized for multimodal // Typical latency targets Time to first token: <500ms Tokens per second: 30-60 Image processing: <200ms

Pro tip: vLLM with INT4 quantization is the default production stack for self-hosted VLMs. It handles batching, KV-cache management, and continuous batching automatically. Start here unless you have specific requirements.

school

Key Takeaways

What to remember about training multimodal models

Essential Concepts

1. Data quality > quantity: Curated data with good text-image alignment beats raw scale

2. Three training strategies: Contrastive (embeddings), Generative (bolt-on VLM), Native (from scratch)

3. Two-stage alignment: Projector pre-training + visual instruction tuning

4. Fine-tuning is king: LoRA/QLoRA on open-source VLMs costs $5–50 and often beats API models on domain tasks

5. You almost never need to train from scratch: Fine-tuning gets 90% of the way at 0.01% of the cost

For Practitioners

• Start with API models (GPT-4V, Gemini) to validate your use case
• Switch to open-source when you need cost reduction, privacy, or customization
• Fine-tune with LoRA on 500–5,000 domain-specific examples
• Deploy with vLLM + INT4 quantization for production serving
• Monitor and iterate: Collect production data for continuous improvement

Next up: Chapter 13 puts it all together — building multimodal applications end-to-end, from architecture patterns to production deployment.

arrow_back Ch 11: Multimodal Embeddings Ch 13: Building Multimodal Applications arrow_forward