Ch 12 — Training Multimodal Models

Data collection, pre-training, alignment, fine-tuning, and compute requirements
High Level
dataset
Data
arrow_forward
model_training
Pre-train
arrow_forward
link
Align
arrow_forward
tune
Fine-tune
arrow_forward
memory
Compute
arrow_forward
rocket_launch
Deploy
-
Click play or press Space to begin...
Step- / 8
dataset
Multimodal Training Data
The fuel that powers multimodal models
Data Types & Sources
Image-text pairs: LAION-5B (5.85B pairs), DataComp (12.8B), CC12M — scraped from internet alt-text
Interleaved documents: Web pages with images inline — teaches models to reason about images in context
Video-text: WebVid (10M), InternVid (234M) — video clips with descriptions
Audio-text: LibriSpeech, Common Voice, WavCaps — speech and audio with transcriptions
Instruction data: Visual Q&A, image-based conversations, chart reasoning — for instruction tuning
Data Quality Pipeline
// Data curation pipeline 1. Collect Scrape web, license datasets 2. Filter Remove NSFW, low-quality, duplicates 3. Score CLIP similarity (text-image alignment) Aesthetic score (image quality) 4. Deduplicate Perceptual hashing, embedding clustering 5. Balance Ensure diversity across concepts 6. Caption Re-caption with VLM for better text // Data quality > data quantity // 1B high-quality pairs > 5B noisy pairs
Key insight: Data curation is the most impactful and least glamorous part of training multimodal models. Teams that invest in data quality consistently outperform those with more compute but worse data. “Garbage in, garbage out” applies 10x to multimodal training.
model_training
Pre-Training Strategies
How models learn to connect modalities
Contrastive Pre-Training (CLIP-style)
Train separate encoders for each modality, align them with contrastive loss. Pull matching pairs together, push non-matching apart. Produces embedding models for search and classification. Fast to train, scales well, but doesn’t generate text.
Generative Pre-Training (LLaVA-style)
Freeze a pre-trained vision encoder and LLM, train a projector to connect them. Then fine-tune the LLM on visual instruction data. Produces VLMs that can reason about images and generate text responses. Cheaper than training from scratch.
Native Multimodal (Gemini-style)
Train the entire model from scratch on interleaved multimodal data. Text, images, audio, and video are all tokenized and processed by a single Transformer. Produces the most capable models but requires enormous compute ($50M–$500M per training run).
Key insight: The three strategies represent a cost-capability tradeoff. Contrastive (cheapest, embeddings only) → Generative (moderate, bolt-on VLM) → Native (most expensive, best capability). Most practitioners use the generative approach — it’s the sweet spot of cost and capability.
link
Vision-Language Alignment
Teaching the LLM to understand visual tokens
Two-Stage Training
Stage 1 — Alignment pre-training:
Train only the projector on image-caption pairs. The vision encoder and LLM are frozen. This teaches the projector to map visual features into the LLM’s embedding space. ~600K image-caption pairs, ~5 hours on 8 GPUs.

Stage 2 — Visual instruction tuning:
Unfreeze the LLM (or use LoRA) and train on visual instruction data: image-based Q&A, conversations, reasoning tasks. ~150K instruction examples, ~20 hours on 8 GPUs.
Instruction Data Generation
// How to create visual instruction data Method 1: GPT-4V bootstrapping Send images to GPT-4V with diverse prompts Collect high-quality Q&A pairs Cost: ~$0.01 per example Method 2: Template-based Use existing annotations (bounding boxes, captions) to generate Q&A templates "What object is at [x,y]?" → "[label]" Cost: nearly free Method 3: Human annotation Highest quality, most expensive ~$0.50-2.00 per example Best for domain-specific data
Key insight: The LLaVA recipe (frozen ViT + MLP projector + fine-tuned LLM) can be replicated for ~$100 in compute. This democratized VLM training — you don’t need Google-scale resources to build a capable vision-language model.
tune
Fine-Tuning for Your Domain
Adapting multimodal models to specific use cases
Fine-Tuning Approaches
Full fine-tuning: Update all parameters. Best quality but expensive and risks catastrophic forgetting.
LoRA: Add small trainable matrices to attention layers. 10–100x cheaper, minimal quality loss. The default choice.
QLoRA: LoRA on a 4-bit quantized model. Enables fine-tuning on consumer GPUs (24GB VRAM).
Adapter tuning: Add small adapter modules between layers. Similar to LoRA but different architecture.
Domain Fine-Tuning Recipe
// Fine-tuning a VLM for your domain 1. Collect data 500-5,000 domain-specific image-text pairs Format: image + question + answer 2. Choose base model LLaVA-NeXT, InternVL, or Qwen2-VL 3. Fine-tune with QLoRA Rank: 64, Alpha: 128 Learning rate: 2e-5 Epochs: 3-5 Hardware: 1x A100 or 1x RTX 4090 4. Evaluate Test on held-out domain examples Compare against base model Time: 2-8 hours Cost: $5-50 (cloud GPU)
Key insight: Domain fine-tuning is the highest-leverage activity for most teams. A fine-tuned 7B model often outperforms GPT-4V on domain-specific tasks. Medical imaging, satellite analysis, product inspection — fine-tuning makes the difference between “interesting demo” and “production-ready.”
memory
Compute Requirements
What it actually costs to train multimodal models
Training Costs by Scale
// Approximate training costs (2025) LoRA fine-tune (7B VLM) Hardware: 1x A100 (80GB) Time: 2-8 hours Cost: $5-50 Full VLM training (LLaVA-style, 7B) Hardware: 8x A100 Time: 1-3 days Cost: $500-3,000 CLIP training (ViT-L) Hardware: 256x A100 Time: 2-4 weeks Cost: $100K-500K Frontier model (Gemini-scale) Hardware: 10,000+ TPUs/GPUs Time: 2-6 months Cost: $50M-500M
Practical Guidance
Most teams: Fine-tune existing open-source VLMs with LoRA ($5–50)
Startups: Train a LLaVA-style VLM on domain data ($500–3K)
Large companies: Train custom CLIP or VLM from scratch ($100K–500K)
AI labs: Train frontier multimodal models ($50M+)

The key insight: you almost never need to train from scratch. Fine-tuning open-source models gets you 90% of the way at 0.01% of the cost.
Key insight: The cost curve for multimodal training is exponential. Going from “good enough” to “state of the art” costs 1000x more. For most applications, a fine-tuned open-source model at $50 outperforms a frontier model API at $50K/month in total cost.
shield
RLHF & Safety Alignment
Making multimodal models safe and helpful
Multimodal RLHF
RLHF (Reinforcement Learning from Human Feedback) for multimodal models follows the same pattern as text-only:

1. Supervised fine-tuning: Train on high-quality visual instruction data
2. Reward model: Train a model to predict human preferences for visual responses
3. PPO/DPO: Optimize the VLM to maximize the reward model’s score

The challenge: multimodal RLHF requires human annotators who can evaluate both visual accuracy and text quality simultaneously.
Safety Considerations
Visual hallucination: Model describes objects not in the image — requires visual grounding training
Harmful content: Model generates or describes NSFW/violent content from images
Bias: Model makes stereotypical assumptions based on visual appearance
Privacy: Model identifies real people or extracts personal information from images
Jailbreaks: Adversarial images that bypass safety filters
Key insight: Safety alignment for multimodal models is harder than for text-only models because the attack surface is larger. An adversarial image can bypass text-based safety filters. This is an active research area with no complete solutions yet.
rocket_launch
Deployment & Optimization
Getting multimodal models into production
Optimization Techniques
Quantization: INT8 or INT4 reduces memory 2–4x with minimal quality loss. Essential for deployment.
KV-cache optimization: Visual tokens consume KV-cache space — compress or evict old visual tokens
Flash Attention: 2–4x faster attention computation, essential for long visual sequences
Speculative decoding: Use a small draft model to speed up generation 2–3x
Batching: Process multiple images simultaneously for throughput
Serving Infrastructure
// Production serving stack vLLM Best for VLM serving (PagedAttention) TGI HuggingFace, easy deployment TensorRT NVIDIA, maximum GPU efficiency Ollama Simple local deployment SGLang Optimized for multimodal // Typical latency targets Time to first token: <500ms Tokens per second: 30-60 Image processing: <200ms
Pro tip: vLLM with INT4 quantization is the default production stack for self-hosted VLMs. It handles batching, KV-cache management, and continuous batching automatically. Start here unless you have specific requirements.
school
Key Takeaways
What to remember about training multimodal models
Essential Concepts
1. Data quality > quantity: Curated data with good text-image alignment beats raw scale

2. Three training strategies: Contrastive (embeddings), Generative (bolt-on VLM), Native (from scratch)

3. Two-stage alignment: Projector pre-training + visual instruction tuning

4. Fine-tuning is king: LoRA/QLoRA on open-source VLMs costs $5–50 and often beats API models on domain tasks

5. You almost never need to train from scratch: Fine-tuning gets 90% of the way at 0.01% of the cost
For Practitioners
Start with API models (GPT-4V, Gemini) to validate your use case
Switch to open-source when you need cost reduction, privacy, or customization
Fine-tune with LoRA on 500–5,000 domain-specific examples
Deploy with vLLM + INT4 quantization for production serving
Monitor and iterate: Collect production data for continuous improvement
Next up: Chapter 13 puts it all together — building multimodal applications end-to-end, from architecture patterns to production deployment.