Ch 4 — Contrastive Learning & CLIP

How CLIP connects images and text, contrastive loss, LAION-5B, and zero-shot classification
High Level
image
Image
arrow_forward
text_fields
Text
arrow_forward
link
Pair
arrow_forward
school
Train
arrow_forward
search
Zero-shot
arrow_forward
hub
Backbone
-
Click play or press Space to begin...
Step- / 8
link
The Core Idea
Learning to connect images and text in a shared space
What CLIP Does
CLIP (Contrastive Language-Image Pre-training) learns a shared embedding space where images and text live together. A photo of a cat and the text “a photo of a cat” are mapped to nearby points in this space. A photo of a dog and “a photo of a cat” are mapped to distant points. This simple idea — pulling matching pairs together and pushing non-matching pairs apart — is the foundation of modern multimodal AI.
How It Trains
Given a batch of N image-text pairs:
1. Encode all N images with a vision encoder (ViT)
2. Encode all N texts with a text encoder (Transformer)
3. Compute cosine similarity between all N×N pairs
4. Maximize similarity for matching pairs (the diagonal)
5. Minimize similarity for non-matching pairs (off-diagonal)
Contrastive Loss Visualized
// Batch of N=4 image-text pairs // Similarity matrix (higher = more similar) Text1 Text2 Text3 Text4 Image1 [ 0.95 0.12 0.08 0.15 ] Image2 [ 0.10 0.91 0.05 0.11 ] Image3 [ 0.07 0.09 0.93 0.06 ] Image4 [ 0.13 0.08 0.04 0.89 ] // Goal: maximize diagonal (matching pairs) // minimize everything else (non-matching) // In practice: batch size = 32,768
Key insight: CLIP doesn’t need labeled categories. It learns from raw image-text pairs scraped from the internet. This is why it generalizes so well — it’s seen millions of concepts described in natural language, not a fixed taxonomy of 1,000 ImageNet classes.
search
Zero-Shot Classification
Classify images without any task-specific training
How It Works
1. Define categories as text prompts: [“a photo of a cat”, “a photo of a dog”, “a photo of a car”]
2. Encode each category with CLIP’s text encoder
3. Encode the input image with CLIP’s image encoder
4. Compute cosine similarity between the image and each text embedding
5. The text with highest similarity = predicted category

No fine-tuning, no labeled training data for the specific task. You can add new categories at any time just by describing them in text.
Performance
CLIP achieves competitive accuracy on ImageNet (76.2% top-1) without ever seeing ImageNet training data. On some specialized tasks, it matches models trained specifically for that task. This “zero-shot” capability was unprecedented in 2021.
Prompt Engineering for CLIP
// Better prompts = better classification Bad: "cat" Good: "a photo of a cat" Best: "a photo of a cat, a type of pet" // Ensemble multiple prompts per class: "a photo of a {class}" "a blurry photo of a {class}" "a drawing of a {class}" // Average embeddings for robust classification
Key insight: Zero-shot classification means you can deploy a classifier for any set of categories without collecting training data. Need to classify 50 new product types? Just describe them in text. This is why CLIP is so versatile in production.
dataset
LAION-5B & Training Data
5.85 billion image-text pairs that democratized multimodal AI
The Dataset
OpenAI’s CLIP was trained on 400 million image-text pairs from a proprietary dataset called WIT (WebImageText). This data was never released.

LAION-5B (Large-scale Artificial Intelligence Open Network) is the open-source response: 5.85 billion image-text pairs scraped from the internet, filtered for quality. It enabled the entire open-source multimodal ecosystem.
Scale Comparison
// Training data scale ImageNet 1.2M images, 1K classes CLIP (WIT) 400M image-text pairs LAION-400M 400M pairs (open) LAION-5B 5.85B pairs (open) DataComp 12.8B pairs (filtered) // More data = better generalization // But also more noise, bias, and risk
Data Quality Challenges
Noisy: Many pairs have weak or incorrect text-image associations
Biased: Reflects internet demographics, stereotypes, and content distribution
NSFW content: Requires filtering (LAION includes safety scores)
Copyright: Contains copyrighted images — legal battles ongoing
Privacy: Contains personal photos scraped without consent
Key insight: LAION-5B democratized multimodal AI. Before it, only companies with proprietary web-scale datasets (OpenAI, Google) could train CLIP-like models. After LAION, anyone could train competitive models. Stable Diffusion was trained on LAION data.
palette
CLIP in Stable Diffusion
The text conditioning backbone that makes text-to-image work
How CLIP Powers Image Generation
In Stable Diffusion, CLIP’s text encoder converts your prompt into embeddings that guide the diffusion process. The diffusion model (U-Net) doesn’t understand text directly — it understands CLIP embeddings. CLIP is the translator between your words and the generated image.
The Pipeline
// Text-to-Image via CLIP + Diffusion 1. User writes: "a sunset over mountains, oil painting" 2. CLIP text encoder → text embeddings (77 tokens × 768d) 3. Text embeddings injected via cross-attention into the U-Net at every denoising step 4. U-Net denoises latent space conditioned on text 5. VAE decoder → final 1024×1024 image // The text embeddings "steer" the denoising // toward images matching the description
Why CLIP Quality Matters
The quality of CLIP’s text understanding directly affects image generation quality:

• If CLIP doesn’t understand “to the left of”, Stable Diffusion can’t reliably place objects
• If CLIP conflates “red car” and “car red”, color assignment becomes unreliable
• If CLIP can’t count, “three apples” might generate two or five

This is why newer models use better text encoders: SDXL uses dual CLIP + OpenCLIP, SD3 and Flux use T5-XXL.
Key insight: When your text-to-image prompt doesn’t work as expected, the bottleneck is often the text encoder (CLIP), not the image generator (diffusion model). Understanding this helps you debug prompt failures and choose better models.
warning
CLIP’s Limitations
Where contrastive learning struggles
Known Weaknesses
Spatial relationships: “cat on top of dog” vs “dog on top of cat” — CLIP treats these as nearly identical because it’s a bag-of-concepts model
Counting: “three apples” vs “five apples” — CLIP doesn’t encode quantity well
Negation: “no people in the image” — CLIP often ignores “no”
Fine-grained attributes: Specific textures, materials, subtle color differences
Compositionality: Complex multi-object scenes with specific arrangements
Improvements & Successors
SigLIP (Google): Replaces softmax with sigmoid loss — better scaling and performance
EVA-CLIP: Larger, better-trained vision encoder with masked image modeling
OpenCLIP: Open-source reproductions trained on LAION with improvements
T5/FLAN text encoders: Using language models instead of CLIP for richer text understanding (used in SD3, Flux)
LLM-based encoders: Using the LLM itself as the text encoder for better compositionality
Key insight: CLIP’s limitations explain many of the “failures” of text-to-image models. When Stable Diffusion generates the wrong number of fingers or puts objects in the wrong position, it’s partly because CLIP doesn’t encode counting or spatial relationships well.
hub
CLIP as the Universal Backbone
Why CLIP is everywhere in multimodal AI
Where CLIP Is Used
Text-to-image: Stable Diffusion, DALL-E (text conditioning via cross-attention)
Image search: Encode query text + database images, find nearest neighbors
Content moderation: Classify images by text descriptions at scale
Multimodal RAG: Retrieve images relevant to text queries in vector databases
Zero-shot classification: Classify without task-specific training data
Image captioning: As the vision encoder in vision-language models
CLIP Score: Metric for evaluating text-image alignment in generated images
The CLIP Ecosystem
CLIP spawned an entire ecosystem of tools, models, and datasets:

OpenCLIP: Open-source implementation with multiple model sizes
LAION: Open datasets for training (400M, 2B, 5B)
Hugging Face: Pre-trained models, fine-tuning recipes, model hub
CLIP-based metrics: CLIP Score, CLIP-IQA for quality assessment
CLIP retrieval: Billion-scale image search engines
Key insight: CLIP is to multimodal AI what word2vec was to NLP — a foundational embedding technique that everything else builds on. Understanding CLIP is understanding the backbone of modern multimodal systems. It’s the glue between text and images.
build
Using CLIP in Practice
Practical applications and integration patterns
Common Patterns
// Pattern 1: Image Search index: encode all images → store in vector DB query: encode text query → nearest neighbor search result: top-K most similar images // Pattern 2: Zero-Shot Classification classes: encode class descriptions as text input: encode image predict: argmax(cosine_similarity) // Pattern 3: Content Filtering filter: encode "NSFW content" as text score: similarity(image, filter_text) flag: score > threshold
Fine-Tuning CLIP
Pre-trained CLIP works well for general tasks, but fine-tuning on domain data can dramatically improve performance:

Medical imaging: Fine-tune on radiology reports + X-rays
E-commerce: Fine-tune on product descriptions + photos
Fashion: Fine-tune on style descriptions + outfit images
Satellite imagery: Fine-tune on geographic descriptions + aerial photos

Even a few thousand domain-specific pairs can significantly boost accuracy.
Pro tip: When building multimodal search or classification, start with pre-trained CLIP. If accuracy isn’t sufficient, fine-tune on your domain data. If you need spatial understanding or counting, consider augmenting CLIP with a VLM like GPT-4V or Gemini.
school
Key Takeaways
What to remember about contrastive learning and CLIP
Essential Concepts
1. Contrastive learning: Pull matching image-text pairs together, push non-matching apart

2. Shared embedding space: Images and text live in the same vector space — enabling cross-modal operations

3. Zero-shot transfer: Classify by describing categories in text — no task-specific training needed

4. Text conditioning: CLIP embeddings guide image generation in Stable Diffusion via cross-attention

5. Scale matters: 400M+ image-text pairs needed for good generalization
Practical Implications
• CLIP embeddings are your go-to for multimodal search and retrieval
• Text-to-image quality depends heavily on text encoder quality
• Zero-shot classification is good enough for many production use cases
• Fine-tuning CLIP on domain data can dramatically improve performance
• CLIP’s limitations (spatial, counting) explain many text-to-image failures
Next up: Chapter 5 dives deep into how diffusion models actually work — the forward noise process, reverse denoising, the U-Net architecture, classifier-free guidance, and the math made intuitive.