Ch 11: Multimodal Embeddings & Search

Ch 11 — Multimodal Embeddings & Search

Shared embedding spaces, vector databases, cross-modal retrieval, and multimodal RAG

Index

High Level

hub

Embed

arrow_forward

storage

Store

arrow_forward

swap_horiz

Cross

arrow_forward

auto_awesome

RAG

arrow_forward

build

Build

Click play or press Space to begin...

Step- / 8

hub

Shared Embedding Spaces

One vector space for text, images, audio, and video

The Core Idea

A shared embedding space maps different modalities into the same vector space. A photo of a sunset, the text “beautiful sunset over the ocean,” and an audio clip of waves crashing all map to nearby vectors. This enables cross-modal operations: search images with text, find similar audio for a video, or cluster content across modalities.

Key Embedding Models

// Multimodal embedding models CLIP Text + Image (512-768d) SigLIP Text + Image (better scaling) CLAP Text + Audio (512d) ImageBind 6 modalities (Meta, 1024d) ONE-PEACE Text+Image+Audio (1536d) Jina CLIP Text + Image (optimized for search)

How Cross-Modal Search Works

1. Index time: Encode all images/audio/video into vectors using the embedding model. Store in a vector database.
2. Query time: Encode the query (text, image, or audio) into the same vector space.
3. Search: Find the K nearest neighbors using cosine similarity or dot product.
4. Return: Results from any modality, ranked by similarity.

Key insight: Shared embedding spaces are the foundation of multimodal search. Without them, you’d need separate search systems for each modality. With them, a single text query can find relevant images, audio clips, and video segments simultaneously.

storage

Vector Databases for Multimodal

Storing and searching billions of multimodal embeddings

Vector DB Options

// Vector databases for multimodal search Pinecone Managed, serverless, easy scaling Weaviate Open-source, built-in multimodal Qdrant Open-source, fast, Rust-based Milvus Open-source, billion-scale ChromaDB Simple, developer-friendly pgvector PostgreSQL extension (simple) // For multimodal: store embedding + metadata // Metadata: modality type, source URL, // timestamp, dimensions, content hash

Indexing Strategy

• HNSW (Hierarchical Navigable Small World): Best recall, higher memory. Default for most use cases.
• IVF (Inverted File Index): Lower memory, good for billion-scale. Requires training.
• Product Quantization: Compress vectors 4–8x with minimal recall loss. Essential at scale.
• Hybrid search: Combine vector similarity with keyword/metadata filters for precision.

Key insight: For multimodal search, the vector database choice matters less than the embedding model quality. A great embedding model with a simple database outperforms a mediocre embedding model with the fanciest database.

swap_horiz

Cross-Modal Retrieval Patterns

Text-to-image, image-to-image, and beyond

Search Patterns

• Text → Image: “red sports car on a mountain road” → matching photos
• Image → Image: Upload a photo, find visually similar images
• Text → Audio: “jazz piano with rain sounds” → matching audio clips
• Image → Text: Find documents that describe a given image
• Audio → Video: Find video clips matching an audio snippet
• Any → Any: With unified models like ImageBind, search across all 6 modalities

Real-World Applications

• E-commerce: Upload a photo of a dress, find similar products for sale
• Stock media: Search millions of images/videos/audio with natural language
• Content moderation: Find images similar to known harmful content
• Digital asset management: Organize and search corporate media libraries
• Duplicate detection: Find near-duplicate images across large collections
• Recommendation: “More like this” across content types

Key insight: Cross-modal search enables entirely new user experiences. Instead of tagging every image with keywords, users can search with natural language. Instead of browsing categories, they can upload a reference image and find similar content instantly.

auto_awesome

Multimodal RAG

Retrieval-Augmented Generation with images, documents, and more

How Multimodal RAG Works

Standard RAG retrieves text chunks. Multimodal RAG retrieves text, images, charts, tables, and audio to provide richer context to the LLM:

1. Index: Embed documents (text + images + tables) into vector DB
2. Query: User asks a question (text or image)
3. Retrieve: Find relevant text chunks AND images/charts
4. Generate: VLM processes retrieved text + images together
5. Answer: Response grounded in both textual and visual evidence

Document Processing Pipeline

// Multimodal RAG for documents 1. Parse PDF → text + images + tables 2. Chunk Text into semantic chunks 3. Embed Text chunks with text embedder Images with CLIP/SigLIP Tables as screenshots with CLIP 4. Store All embeddings in vector DB with metadata (page, type, source) 5. Query Retrieve text + visual chunks 6. Generate Feed to VLM for grounded answer

Key insight: Many documents contain critical information in charts, diagrams, and tables that text-only RAG completely misses. Multimodal RAG captures this visual information, dramatically improving answer quality for technical documents, financial reports, and scientific papers.

tune

Embedding Quality & Fine-Tuning

Improving retrieval accuracy for your domain

Measuring Embedding Quality

• Recall@K: What fraction of relevant items appear in top-K results?
• Precision@K: What fraction of top-K results are relevant?
• MRR (Mean Reciprocal Rank): How high does the first relevant result rank?
• NDCG: Normalized Discounted Cumulative Gain — accounts for ranking quality

Always measure on your data, not generic benchmarks.

Fine-Tuning Embeddings

• Contrastive fine-tuning: Provide positive/negative pairs from your domain. “This image matches this text” / “This image doesn’t match this text.”
• Hard negative mining: Use the most confusing negative examples for better discrimination
• Domain adaptation: Fine-tune CLIP on medical images, satellite imagery, or product photos
• Matryoshka embeddings: Train embeddings that work at multiple dimensions (256, 512, 768) for flexible cost/quality tradeoff

Key insight: Fine-tuning embeddings on even 1,000 domain-specific pairs can improve retrieval accuracy by 15–30%. This is the highest-ROI optimization for multimodal search — more impactful than changing vector databases or indexing strategies.

scale

Scaling Multimodal Search

From thousands to billions of items

Scale Considerations

// Scaling multimodal search 1K items Brute force works fine Any vector DB, any hardware 100K items HNSW index, single machine ~1 GB memory for 768d vectors 10M items HNSW + quantization ~10 GB memory, fast queries 1B items Distributed index (Milvus/Qdrant) IVF + PQ, multiple nodes ~100 GB compressed // Embedding generation is the bottleneck // CLIP: ~1000 images/sec on A100 // 1B images = ~12 days on single GPU

Production Architecture

• Batch embedding pipeline: Process new content asynchronously, update index periodically
• Caching: Cache frequent queries and their results
• Tiered search: Fast approximate search first, then re-rank top-100 with a more expensive model
• Hybrid search: Combine vector similarity with metadata filters (date, category, source)
• Multi-index: Separate indexes for different content types, merge results at query time

Key insight: The re-ranking pattern is crucial at scale: use a fast, cheap embedding model for initial retrieval (top-1000), then a more expensive model (or VLM) to re-rank the top-100. This gives you the speed of simple embeddings with the accuracy of sophisticated models.

build

Building a Multimodal Search System

End-to-end architecture for production

Architecture

// Production multimodal search stack Ingestion Upload → Extract modalities → Embed → Index Search Query → Embed → ANN search → Re-rank → Return Components Embedding: CLIP/SigLIP (self-hosted or API) Vector DB: Qdrant / Pinecone / Weaviate Re-ranker: Cross-encoder or VLM API: FastAPI / Express Cache: Redis for frequent queries Queue: Kafka/SQS for async embedding

Implementation Tips

• Start simple: CLIP + pgvector is enough for <100K items
• Normalize vectors: Always L2-normalize before storing for consistent cosine similarity
• Store originals: Keep original content alongside embeddings for display
• Version embeddings: When you upgrade the model, you need to re-embed everything
• Monitor quality: Track search relevance metrics in production
• A/B test: Compare embedding models on real user queries

Pro tip: The biggest mistake in multimodal search is over-engineering the infrastructure and under-investing in embedding quality. Start with the simplest stack that works, then invest in better embeddings and fine-tuning before scaling the infrastructure.

school

Key Takeaways

What to remember about multimodal embeddings and search

Essential Concepts

1. Shared embedding spaces (CLIP, CLAP, ImageBind) map different modalities into one vector space

2. Cross-modal search: Text → image, image → image, any → any using nearest neighbor search

3. Multimodal RAG: Retrieve text + images + charts for richer, grounded LLM responses

4. Fine-tuning embeddings on 1K+ domain pairs improves accuracy 15–30%

5. Re-ranking: Fast retrieval + expensive re-ranking = best cost/quality tradeoff

Action Items

• Start with CLIP + simple vector DB for prototyping
• Build a domain-specific eval set (query + expected results)
• Fine-tune embeddings if generic CLIP isn’t accurate enough
• Implement hybrid search (vector + metadata filters) for production
• Use multimodal RAG when documents contain charts, diagrams, or tables

Next up: Chapter 12 covers training multimodal models — data collection, pre-training strategies, alignment techniques, and the compute requirements for building your own multimodal AI.

arrow_back Ch 10: Model Landscape Ch 12: Training Multimodal Models arrow_forward