Ch 4 — Embeddings: Text to Vectors

How text becomes searchable numbers
High Level
text_fields
Text
arrow_forward
token
Tokenize
arrow_forward
neurology
Model
arrow_forward
scatter_plot
Vector
arrow_forward
compare_arrows
Similarity
arrow_forward
leaderboard
Benchmark
arrow_forward
check_circle
Choose
-
Click play or press Space to begin the journey...
Step- / 7
scatter_plot
What Are Embeddings?
Converting text into numerical vectors that capture meaning
The Core Idea
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings get similar vectors. "How do I return a product?" and "What is the refund process?" would have vectors that are close together, even though they share few words.
Why RAG Needs Embeddings
Traditional keyword search fails when users use different words than the documents. Embeddings enable semantic search — finding text by meaning, not exact word matches. This is the foundation of the entire retrieval step in RAG.
# What an embedding looks like text = "How do I return a product?" vector = embed(text) # → [0.023, -0.041, 0.087, ..., -0.012] # ^--- 1536 numbers (for text-embedding-3-small) # ^--- 3072 numbers (for text-embedding-3-large) # Similar texts → similar vectors "refund process" → cosine_sim = 0.92 "weather today" → cosine_sim = 0.11
Embeddings are the bridge between text and math. They let you use mathematical operations (distance, similarity) on natural language. Without embeddings, there is no semantic search, and without semantic search, RAG falls back to keyword matching.
neurology
How Embedding Models Work
Transformer encoders trained on massive text pairs
The Architecture
Embedding models are transformer encoders (like BERT, not like GPT). They read the entire input at once and produce a single vector that summarizes the meaning. The model is trained on millions of text pairs — "this sentence is similar to that sentence" — using contrastive learning.
The Training Process
Contrastive learning: Show the model pairs of similar texts (positive pairs) and dissimilar texts (negative pairs). Train it to push similar pairs closer together in vector space and dissimilar pairs further apart. The result: a model that maps meaning to geometry.
Key Properties
Fixed-size output: Regardless of input length (5 words or 500 words), the output is always the same number of dimensions (e.g., 1536).

Dense vectors: Every dimension has a non-zero value, unlike sparse representations (like TF-IDF) where most values are zero.

Semantic similarity = geometric proximity: Texts about similar topics cluster together in vector space. You can literally measure meaning with cosine similarity.
Embedding models are NOT generative LLMs. GPT-4o generates text token by token. Embedding models read the full input and output a single vector. They are smaller, faster, and cheaper. A typical embedding API call costs ~$0.02 per million tokens.
hub
Popular Embedding Models
The landscape of commercial and open-source options
Commercial APIs
OpenAI text-embedding-3-small — 1536 dims, best price/performance. ~$0.02/1M tokens.

OpenAI text-embedding-3-large — 3072 dims, higher accuracy. ~$0.13/1M tokens. Supports Matryoshka (truncate to fewer dims).

Cohere embed-v3 — 1024 dims. Supports separate input types for documents vs queries. Multilingual (100+ languages).

Google text-embedding-004 — 768 dims. Free tier available via Vertex AI. Good multilingual support.
Open-Source (Run Locally)
BGE (BAAI) — bge-large-en-v1.5, bge-m3 (multilingual). Top MTEB scores. Run via sentence-transformers or HuggingFace.

E5 (Microsoft) — e5-large-v2, multilingual-e5-large. Requires "query: " and "passage: " prefixes.

GTE (Alibaba) — gte-large-en-v1.5. Strong performance, no prefix needed.

Nomic Embed — nomic-embed-text-v1.5. 8192 token context. Open weights + open training data.
Open-source models match or beat commercial APIs on benchmarks. BGE-m3 and GTE-large score competitively with OpenAI on MTEB. The trade-off: you host the model yourself (GPU required for speed) vs paying per API call.
compare_arrows
Similarity Measures
How to compare two vectors
Cosine Similarity
The most common measure for embeddings. Measures the angle between two vectors, ignoring their magnitude. Returns a value from -1 (opposite) to 1 (identical). In practice, most embedding similarities fall between 0.3 and 0.95.

Formula: cos(θ) = (A · B) / (||A|| × ||B||)
Other Measures
Dot product: Like cosine but affected by vector magnitude. Faster to compute. Used when vectors are already normalized (magnitude = 1).

Euclidean distance: Straight-line distance between vector endpoints. Less common for embeddings because it is sensitive to magnitude differences.
# Computing cosine similarity in Python import numpy as np def cosine_sim(a, b): return np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b) ) # Example v1 = embed("refund policy") v2 = embed("return process") v3 = embed("pizza recipe") cosine_sim(v1, v2) # → 0.91 (similar) cosine_sim(v1, v3) # → 0.14 (unrelated)
Use cosine similarity unless your vector DB says otherwise. Most embedding models are trained with cosine similarity in mind. Some vector databases (Pinecone, Weaviate) let you choose the metric at index creation time. OpenAI embeddings are normalized, so cosine = dot product.
layers
Dimensionality & Matryoshka Embeddings
More dimensions = more accuracy, but at a cost
Dimensions Matter
Each dimension captures one aspect of meaning. More dimensions = more nuance = better retrieval. But more dimensions also mean more storage (4 bytes per float32 × dimensions × number of chunks) and slower search.

Common sizes: 384 (small), 768 (medium), 1024 (Cohere), 1536 (OpenAI small), 3072 (OpenAI large).
Matryoshka Representation Learning
Matryoshka embeddings (Kusupati et al., 2022) are trained so that the first N dimensions are a valid, useful embedding on their own. You can truncate a 3072-dim vector to 256 dims and still get good results — just slightly less accurate. This lets you trade accuracy for speed and storage at query time.
# OpenAI text-embedding-3-large supports Matryoshka from openai import OpenAI client = OpenAI() response = client.embeddings.create( model="text-embedding-3-large", input="refund policy", dimensions=256 # truncate from 3072 to 256 ) # Full: 3072 dims → 12 KB per vector # Truncated: 256 dims → 1 KB per vector # 12x less storage, ~5% accuracy drop
Matryoshka is a game-changer for large corpora. If you have 10M chunks at 3072 dims, that is 120 GB of vectors. Truncating to 256 dims drops it to 10 GB with minimal accuracy loss. Start with full dimensions, then truncate if storage or latency becomes a problem.
leaderboard
The MTEB Benchmark
How to compare embedding models objectively
What Is MTEB
The Massive Text Embedding Benchmark (Muennighoff et al., 2023) evaluates embedding models across 8 task types and 58+ datasets. It is the standard benchmark for comparing embedding models. The leaderboard is hosted on HuggingFace.
Task Types
Retrieval — The most relevant for RAG. How well does the model find relevant passages?
STS — Semantic Textual Similarity. How well does cosine similarity correlate with human judgments?
Classification — Using embeddings as features for text classification.
Clustering — Grouping similar texts together.
Reranking, Pair Classification, Summarization, BitextMining
How to Read the Leaderboard
Focus on the Retrieval column for RAG use cases. The overall average includes tasks (like classification) that may not matter for your application.

Check the model size. A 7B parameter model may score 2% higher than a 330M model, but it is 20x slower and more expensive to run.

Check the context length. Some models support 512 tokens max, others support 8192. Make sure it covers your chunk sizes.
MTEB is a starting point, not the final answer. Benchmark performance on academic datasets may not reflect performance on your specific domain. Always test the top 2-3 models on your actual data before committing.
verified
Choosing the Right Embedding Model
A practical decision framework
Decision Framework
Starting out / prototyping?
Use text-embedding-3-small (OpenAI). Cheap, fast, good enough. No infrastructure needed.

Need best accuracy?
Use text-embedding-3-large (OpenAI) or embed-v3 (Cohere). Check MTEB Retrieval scores.

Need to run locally / data privacy?
Use BGE-m3, GTE-large, or Nomic Embed via sentence-transformers. Requires a GPU for production throughput.

Multilingual corpus?
Use Cohere embed-v3, BGE-m3, or multilingual-e5-large. Test on your specific languages.
# OpenAI (simplest) from openai import OpenAI client = OpenAI() resp = client.embeddings.create( model="text-embedding-3-small", input=["chunk 1", "chunk 2"] ) vectors = [e.embedding for e in resp.data] # Open-source (local) from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-large-en-v1.5") vectors = model.encode(["chunk 1", "chunk 2"])
You can always switch models later. Changing your embedding model means re-embedding all your chunks (a batch job). It is not free, but it is straightforward. Start with the simplest option, measure retrieval quality, and upgrade only if needed.