Contrastive Pre-Training (CLIP-style)
Train separate encoders for each modality, align them with contrastive loss. Pull matching pairs together, push non-matching apart. Produces embedding models for search and classification. Fast to train, scales well, but doesn’t generate text.
Generative Pre-Training (LLaVA-style)
Freeze a pre-trained vision encoder and LLM, train a projector to connect them. Then fine-tune the LLM on visual instruction data. Produces VLMs that can reason about images and generate text responses. Cheaper than training from scratch.
Native Multimodal (Gemini-style)
Train the entire model from scratch on interleaved multimodal data. Text, images, audio, and video are all tokenized and processed by a single Transformer. Produces the most capable models but requires enormous compute ($50M–$500M per training run).
Key insight: The three strategies represent a cost-capability tradeoff. Contrastive (cheapest, embeddings only) → Generative (moderate, bolt-on VLM) → Native (most expensive, best capability). Most practitioners use the generative approach — it’s the sweet spot of cost and capability.