Ch 4: Contrastive Learning & CLIP

Ch 4 — Contrastive Learning & CLIP

How CLIP connects images and text, contrastive loss, LAION-5B, and zero-shot classification

Index

High Level

image

Image

arrow_forward

text_fields

Text

arrow_forward

link

Pair

arrow_forward

school

Train

arrow_forward

Zero-shot

arrow_forward

hub

Backbone

Click play or press Space to begin...

Step- / 8

link

The Core Idea

Learning to connect images and text in a shared space

What CLIP Does

CLIP (Contrastive Language-Image Pre-training) learns a shared embedding space where images and text live together. A photo of a cat and the text “a photo of a cat” are mapped to nearby points in this space. A photo of a dog and “a photo of a cat” are mapped to distant points. This simple idea — pulling matching pairs together and pushing non-matching pairs apart — is the foundation of modern multimodal AI.

How It Trains

Given a batch of N image-text pairs:
1. Encode all N images with a vision encoder (ViT)
2. Encode all N texts with a text encoder (Transformer)
3. Compute cosine similarity between all N×N pairs
4. Maximize similarity for matching pairs (the diagonal)
5. Minimize similarity for non-matching pairs (off-diagonal)

Contrastive Loss Visualized

// Batch of N=4 image-text pairs // Similarity matrix (higher = more similar) Text1 Text2 Text3 Text4 Image1 [ 0.95 0.12 0.08 0.15 ] Image2 [ 0.10 0.91 0.05 0.11 ] Image3 [ 0.07 0.09 0.93 0.06 ] Image4 [ 0.13 0.08 0.04 0.89 ] // Goal: maximize diagonal (matching pairs) // minimize everything else (non-matching) // In practice: batch size = 32,768

Key insight: CLIP doesn’t need labeled categories. It learns from raw image-text pairs scraped from the internet. This is why it generalizes so well — it’s seen millions of concepts described in natural language, not a fixed taxonomy of 1,000 ImageNet classes.

Zero-Shot Classification

Classify images without any task-specific training

How It Works

1. Define categories as text prompts: [“a photo of a cat”, “a photo of a dog”, “a photo of a car”]
2. Encode each category with CLIP’s text encoder
3. Encode the input image with CLIP’s image encoder
4. Compute cosine similarity between the image and each text embedding
5. The text with highest similarity = predicted category

No fine-tuning, no labeled training data for the specific task. You can add new categories at any time just by describing them in text.

Performance

CLIP achieves competitive accuracy on ImageNet (76.2% top-1) without ever seeing ImageNet training data. On some specialized tasks, it matches models trained specifically for that task. This “zero-shot” capability was unprecedented in 2021.

Prompt Engineering for CLIP

// Better prompts = better classification Bad: "cat" Good: "a photo of a cat" Best: "a photo of a cat, a type of pet" // Ensemble multiple prompts per class: "a photo of a {class}" "a blurry photo of a {class}" "a drawing of a {class}" // Average embeddings for robust classification

Key insight: Zero-shot classification means you can deploy a classifier for any set of categories without collecting training data. Need to classify 50 new product types? Just describe them in text. This is why CLIP is so versatile in production.

dataset

LAION-5B & Training Data

5.85 billion image-text pairs that democratized multimodal AI

The Dataset

OpenAI’s CLIP was trained on 400 million image-text pairs from a proprietary dataset called WIT (WebImageText). This data was never released.

LAION-5B (Large-scale Artificial Intelligence Open Network) is the open-source response: 5.85 billion image-text pairs scraped from the internet, filtered for quality. It enabled the entire open-source multimodal ecosystem.

Scale Comparison

// Training data scale ImageNet 1.2M images, 1K classes CLIP (WIT) 400M image-text pairs LAION-400M 400M pairs (open) LAION-5B 5.85B pairs (open) DataComp 12.8B pairs (filtered) // More data = better generalization // But also more noise, bias, and risk

Data Quality Challenges

• Noisy: Many pairs have weak or incorrect text-image associations
• Biased: Reflects internet demographics, stereotypes, and content distribution
• NSFW content: Requires filtering (LAION includes safety scores)
• Copyright: Contains copyrighted images — legal battles ongoing
• Privacy: Contains personal photos scraped without consent

Key insight: LAION-5B democratized multimodal AI. Before it, only companies with proprietary web-scale datasets (OpenAI, Google) could train CLIP-like models. After LAION, anyone could train competitive models. Stable Diffusion was trained on LAION data.

palette

CLIP in Stable Diffusion

The text conditioning backbone that makes text-to-image work

How CLIP Powers Image Generation

In Stable Diffusion, CLIP’s text encoder converts your prompt into embeddings that guide the diffusion process. The diffusion model (U-Net) doesn’t understand text directly — it understands CLIP embeddings. CLIP is the translator between your words and the generated image.

The Pipeline

// Text-to-Image via CLIP + Diffusion 1. User writes: "a sunset over mountains, oil painting" 2. CLIP text encoder → text embeddings (77 tokens × 768d) 3. Text embeddings injected via cross-attention into the U-Net at every denoising step 4. U-Net denoises latent space conditioned on text 5. VAE decoder → final 1024×1024 image // The text embeddings "steer" the denoising // toward images matching the description

Why CLIP Quality Matters

The quality of CLIP’s text understanding directly affects image generation quality:

• If CLIP doesn’t understand “to the left of”, Stable Diffusion can’t reliably place objects
• If CLIP conflates “red car” and “car red”, color assignment becomes unreliable
• If CLIP can’t count, “three apples” might generate two or five

This is why newer models use better text encoders: SDXL uses dual CLIP + OpenCLIP, SD3 and Flux use T5-XXL.

Key insight: When your text-to-image prompt doesn’t work as expected, the bottleneck is often the text encoder (CLIP), not the image generator (diffusion model). Understanding this helps you debug prompt failures and choose better models.

warning

CLIP’s Limitations

Where contrastive learning struggles

Known Weaknesses

• Spatial relationships: “cat on top of dog” vs “dog on top of cat” — CLIP treats these as nearly identical because it’s a bag-of-concepts model
• Counting: “three apples” vs “five apples” — CLIP doesn’t encode quantity well
• Negation: “no people in the image” — CLIP often ignores “no”
• Fine-grained attributes: Specific textures, materials, subtle color differences
• Compositionality: Complex multi-object scenes with specific arrangements

Improvements & Successors

• SigLIP (Google): Replaces softmax with sigmoid loss — better scaling and performance
• EVA-CLIP: Larger, better-trained vision encoder with masked image modeling
• OpenCLIP: Open-source reproductions trained on LAION with improvements
• T5/FLAN text encoders: Using language models instead of CLIP for richer text understanding (used in SD3, Flux)
• LLM-based encoders: Using the LLM itself as the text encoder for better compositionality

Key insight: CLIP’s limitations explain many of the “failures” of text-to-image models. When Stable Diffusion generates the wrong number of fingers or puts objects in the wrong position, it’s partly because CLIP doesn’t encode counting or spatial relationships well.

hub

CLIP as the Universal Backbone

Why CLIP is everywhere in multimodal AI

Where CLIP Is Used

• Text-to-image: Stable Diffusion, DALL-E (text conditioning via cross-attention)
• Image search: Encode query text + database images, find nearest neighbors
• Content moderation: Classify images by text descriptions at scale
• Multimodal RAG: Retrieve images relevant to text queries in vector databases
• Zero-shot classification: Classify without task-specific training data
• Image captioning: As the vision encoder in vision-language models
• CLIP Score: Metric for evaluating text-image alignment in generated images

The CLIP Ecosystem

CLIP spawned an entire ecosystem of tools, models, and datasets:

• OpenCLIP: Open-source implementation with multiple model sizes
• LAION: Open datasets for training (400M, 2B, 5B)
• Hugging Face: Pre-trained models, fine-tuning recipes, model hub
• CLIP-based metrics: CLIP Score, CLIP-IQA for quality assessment
• CLIP retrieval: Billion-scale image search engines

Key insight: CLIP is to multimodal AI what word2vec was to NLP — a foundational embedding technique that everything else builds on. Understanding CLIP is understanding the backbone of modern multimodal systems. It’s the glue between text and images.

build

Using CLIP in Practice

Practical applications and integration patterns

Common Patterns

// Pattern 1: Image Search index: encode all images → store in vector DB query: encode text query → nearest neighbor search result: top-K most similar images // Pattern 2: Zero-Shot Classification classes: encode class descriptions as text input: encode image predict: argmax(cosine_similarity) // Pattern 3: Content Filtering filter: encode "NSFW content" as text score: similarity(image, filter_text) flag: score > threshold

Fine-Tuning CLIP

Pre-trained CLIP works well for general tasks, but fine-tuning on domain data can dramatically improve performance:

• Medical imaging: Fine-tune on radiology reports + X-rays
• E-commerce: Fine-tune on product descriptions + photos
• Fashion: Fine-tune on style descriptions + outfit images
• Satellite imagery: Fine-tune on geographic descriptions + aerial photos

Even a few thousand domain-specific pairs can significantly boost accuracy.

Pro tip: When building multimodal search or classification, start with pre-trained CLIP. If accuracy isn’t sufficient, fine-tune on your domain data. If you need spatial understanding or counting, consider augmenting CLIP with a VLM like GPT-4V or Gemini.

school

Key Takeaways

What to remember about contrastive learning and CLIP

Essential Concepts

1. Contrastive learning: Pull matching image-text pairs together, push non-matching apart

2. Shared embedding space: Images and text live in the same vector space — enabling cross-modal operations

3. Zero-shot transfer: Classify by describing categories in text — no task-specific training needed

4. Text conditioning: CLIP embeddings guide image generation in Stable Diffusion via cross-attention

5. Scale matters: 400M+ image-text pairs needed for good generalization

Practical Implications

• CLIP embeddings are your go-to for multimodal search and retrieval
• Text-to-image quality depends heavily on text encoder quality
• Zero-shot classification is good enough for many production use cases
• Fine-tuning CLIP on domain data can dramatically improve performance
• CLIP’s limitations (spatial, counting) explain many text-to-image failures

Next up: Chapter 5 dives deep into how diffusion models actually work — the forward noise process, reverse denoising, the U-Net architecture, classifier-free guidance, and the math made intuitive.

arrow_back Ch 3: Generative Model Family Tree Ch 5: How Diffusion Models Work arrow_forward