Ch 11 — Multimodal Embeddings & Search

Shared embedding spaces, vector databases, cross-modal retrieval, and multimodal RAG
High Level
hub
Embed
arrow_forward
storage
Store
arrow_forward
search
Search
arrow_forward
swap_horiz
Cross
arrow_forward
auto_awesome
RAG
arrow_forward
build
Build
-
Click play or press Space to begin...
Step- / 8
hub
Shared Embedding Spaces
One vector space for text, images, audio, and video
The Core Idea
A shared embedding space maps different modalities into the same vector space. A photo of a sunset, the text “beautiful sunset over the ocean,” and an audio clip of waves crashing all map to nearby vectors. This enables cross-modal operations: search images with text, find similar audio for a video, or cluster content across modalities.
Key Embedding Models
// Multimodal embedding models CLIP Text + Image (512-768d) SigLIP Text + Image (better scaling) CLAP Text + Audio (512d) ImageBind 6 modalities (Meta, 1024d) ONE-PEACE Text+Image+Audio (1536d) Jina CLIP Text + Image (optimized for search)
How Cross-Modal Search Works
1. Index time: Encode all images/audio/video into vectors using the embedding model. Store in a vector database.
2. Query time: Encode the query (text, image, or audio) into the same vector space.
3. Search: Find the K nearest neighbors using cosine similarity or dot product.
4. Return: Results from any modality, ranked by similarity.
Key insight: Shared embedding spaces are the foundation of multimodal search. Without them, you’d need separate search systems for each modality. With them, a single text query can find relevant images, audio clips, and video segments simultaneously.
storage
Vector Databases for Multimodal
Storing and searching billions of multimodal embeddings
Vector DB Options
// Vector databases for multimodal search Pinecone Managed, serverless, easy scaling Weaviate Open-source, built-in multimodal Qdrant Open-source, fast, Rust-based Milvus Open-source, billion-scale ChromaDB Simple, developer-friendly pgvector PostgreSQL extension (simple) // For multimodal: store embedding + metadata // Metadata: modality type, source URL, // timestamp, dimensions, content hash
Indexing Strategy
HNSW (Hierarchical Navigable Small World): Best recall, higher memory. Default for most use cases.
IVF (Inverted File Index): Lower memory, good for billion-scale. Requires training.
Product Quantization: Compress vectors 4–8x with minimal recall loss. Essential at scale.
Hybrid search: Combine vector similarity with keyword/metadata filters for precision.
Key insight: For multimodal search, the vector database choice matters less than the embedding model quality. A great embedding model with a simple database outperforms a mediocre embedding model with the fanciest database.
swap_horiz
Cross-Modal Retrieval Patterns
Text-to-image, image-to-image, and beyond
Search Patterns
Text → Image: “red sports car on a mountain road” → matching photos
Image → Image: Upload a photo, find visually similar images
Text → Audio: “jazz piano with rain sounds” → matching audio clips
Image → Text: Find documents that describe a given image
Audio → Video: Find video clips matching an audio snippet
Any → Any: With unified models like ImageBind, search across all 6 modalities
Real-World Applications
E-commerce: Upload a photo of a dress, find similar products for sale
Stock media: Search millions of images/videos/audio with natural language
Content moderation: Find images similar to known harmful content
Digital asset management: Organize and search corporate media libraries
Duplicate detection: Find near-duplicate images across large collections
Recommendation: “More like this” across content types
Key insight: Cross-modal search enables entirely new user experiences. Instead of tagging every image with keywords, users can search with natural language. Instead of browsing categories, they can upload a reference image and find similar content instantly.
auto_awesome
Multimodal RAG
Retrieval-Augmented Generation with images, documents, and more
How Multimodal RAG Works
Standard RAG retrieves text chunks. Multimodal RAG retrieves text, images, charts, tables, and audio to provide richer context to the LLM:

1. Index: Embed documents (text + images + tables) into vector DB
2. Query: User asks a question (text or image)
3. Retrieve: Find relevant text chunks AND images/charts
4. Generate: VLM processes retrieved text + images together
5. Answer: Response grounded in both textual and visual evidence
Document Processing Pipeline
// Multimodal RAG for documents 1. Parse PDF → text + images + tables 2. Chunk Text into semantic chunks 3. Embed Text chunks with text embedder Images with CLIP/SigLIP Tables as screenshots with CLIP 4. Store All embeddings in vector DB with metadata (page, type, source) 5. Query Retrieve text + visual chunks 6. Generate Feed to VLM for grounded answer
Key insight: Many documents contain critical information in charts, diagrams, and tables that text-only RAG completely misses. Multimodal RAG captures this visual information, dramatically improving answer quality for technical documents, financial reports, and scientific papers.
tune
Embedding Quality & Fine-Tuning
Improving retrieval accuracy for your domain
Measuring Embedding Quality
Recall@K: What fraction of relevant items appear in top-K results?
Precision@K: What fraction of top-K results are relevant?
MRR (Mean Reciprocal Rank): How high does the first relevant result rank?
NDCG: Normalized Discounted Cumulative Gain — accounts for ranking quality

Always measure on your data, not generic benchmarks.
Fine-Tuning Embeddings
Contrastive fine-tuning: Provide positive/negative pairs from your domain. “This image matches this text” / “This image doesn’t match this text.”
Hard negative mining: Use the most confusing negative examples for better discrimination
Domain adaptation: Fine-tune CLIP on medical images, satellite imagery, or product photos
Matryoshka embeddings: Train embeddings that work at multiple dimensions (256, 512, 768) for flexible cost/quality tradeoff
Key insight: Fine-tuning embeddings on even 1,000 domain-specific pairs can improve retrieval accuracy by 15–30%. This is the highest-ROI optimization for multimodal search — more impactful than changing vector databases or indexing strategies.
scale
Scaling Multimodal Search
From thousands to billions of items
Scale Considerations
// Scaling multimodal search 1K items Brute force works fine Any vector DB, any hardware 100K items HNSW index, single machine ~1 GB memory for 768d vectors 10M items HNSW + quantization ~10 GB memory, fast queries 1B items Distributed index (Milvus/Qdrant) IVF + PQ, multiple nodes ~100 GB compressed // Embedding generation is the bottleneck // CLIP: ~1000 images/sec on A100 // 1B images = ~12 days on single GPU
Production Architecture
Batch embedding pipeline: Process new content asynchronously, update index periodically
Caching: Cache frequent queries and their results
Tiered search: Fast approximate search first, then re-rank top-100 with a more expensive model
Hybrid search: Combine vector similarity with metadata filters (date, category, source)
Multi-index: Separate indexes for different content types, merge results at query time
Key insight: The re-ranking pattern is crucial at scale: use a fast, cheap embedding model for initial retrieval (top-1000), then a more expensive model (or VLM) to re-rank the top-100. This gives you the speed of simple embeddings with the accuracy of sophisticated models.
build
Building a Multimodal Search System
End-to-end architecture for production
Architecture
// Production multimodal search stack Ingestion Upload → Extract modalities → Embed → Index Search Query → Embed → ANN search → Re-rank → Return Components Embedding: CLIP/SigLIP (self-hosted or API) Vector DB: Qdrant / Pinecone / Weaviate Re-ranker: Cross-encoder or VLM API: FastAPI / Express Cache: Redis for frequent queries Queue: Kafka/SQS for async embedding
Implementation Tips
Start simple: CLIP + pgvector is enough for <100K items
Normalize vectors: Always L2-normalize before storing for consistent cosine similarity
Store originals: Keep original content alongside embeddings for display
Version embeddings: When you upgrade the model, you need to re-embed everything
Monitor quality: Track search relevance metrics in production
A/B test: Compare embedding models on real user queries
Pro tip: The biggest mistake in multimodal search is over-engineering the infrastructure and under-investing in embedding quality. Start with the simplest stack that works, then invest in better embeddings and fine-tuning before scaling the infrastructure.
school
Key Takeaways
What to remember about multimodal embeddings and search
Essential Concepts
1. Shared embedding spaces (CLIP, CLAP, ImageBind) map different modalities into one vector space

2. Cross-modal search: Text → image, image → image, any → any using nearest neighbor search

3. Multimodal RAG: Retrieve text + images + charts for richer, grounded LLM responses

4. Fine-tuning embeddings on 1K+ domain pairs improves accuracy 15–30%

5. Re-ranking: Fast retrieval + expensive re-ranking = best cost/quality tradeoff
Action Items
• Start with CLIP + simple vector DB for prototyping
• Build a domain-specific eval set (query + expected results)
Fine-tune embeddings if generic CLIP isn’t accurate enough
• Implement hybrid search (vector + metadata filters) for production
• Use multimodal RAG when documents contain charts, diagrams, or tables
Next up: Chapter 12 covers training multimodal models — data collection, pre-training strategies, alignment techniques, and the compute requirements for building your own multimodal AI.