Ch 4: Embeddings — High Level

Ch 4 — Embeddings: Text to Vectors

How text becomes searchable numbers

Index Under the Hood →

High Level

text_fields

Text

arrow_forward

token

Tokenize

arrow_forward

neurology

Model

arrow_forward

scatter_plot

Vector

arrow_forward

compare_arrows

Similarity

arrow_forward

leaderboard

Benchmark

arrow_forward

check_circle

Choose

Click play or press Space to begin the journey...

Step- / 7

scatter_plot

What Are Embeddings?

Converting text into numerical vectors that capture meaning

The Core Idea

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings get similar vectors. "How do I return a product?" and "What is the refund process?" would have vectors that are close together, even though they share few words.

Why RAG Needs Embeddings

Traditional keyword search fails when users use different words than the documents. Embeddings enable semantic search — finding text by meaning, not exact word matches. This is the foundation of the entire retrieval step in RAG.

# What an embedding looks like text = "How do I return a product?" vector = embed(text) # → [0.023, -0.041, 0.087, ..., -0.012] # ^--- 1536 numbers (for text-embedding-3-small) # ^--- 3072 numbers (for text-embedding-3-large) # Similar texts → similar vectors "refund process" → cosine_sim = 0.92 "weather today" → cosine_sim = 0.11

Embeddings are the bridge between text and math. They let you use mathematical operations (distance, similarity) on natural language. Without embeddings, there is no semantic search, and without semantic search, RAG falls back to keyword matching.

neurology

How Embedding Models Work

Transformer encoders trained on massive text pairs

The Architecture

Embedding models are transformer encoders (like BERT, not like GPT). They read the entire input at once and produce a single vector that summarizes the meaning. The model is trained on millions of text pairs — "this sentence is similar to that sentence" — using contrastive learning.

The Training Process

Contrastive learning: Show the model pairs of similar texts (positive pairs) and dissimilar texts (negative pairs). Train it to push similar pairs closer together in vector space and dissimilar pairs further apart. The result: a model that maps meaning to geometry.

Key Properties

Fixed-size output: Regardless of input length (5 words or 500 words), the output is always the same number of dimensions (e.g., 1536).

Dense vectors: Every dimension has a non-zero value, unlike sparse representations (like TF-IDF) where most values are zero.

Semantic similarity = geometric proximity: Texts about similar topics cluster together in vector space. You can literally measure meaning with cosine similarity.

Embedding models are NOT generative LLMs. GPT-4o generates text token by token. Embedding models read the full input and output a single vector. They are smaller, faster, and cheaper. A typical embedding API call costs ~$0.02 per million tokens.

hub

Popular Embedding Models

The landscape of commercial and open-source options

Commercial APIs

OpenAI text-embedding-3-small — 1536 dims, best price/performance. ~$0.02/1M tokens.

OpenAI text-embedding-3-large — 3072 dims, higher accuracy. ~$0.13/1M tokens. Supports Matryoshka (truncate to fewer dims).

Cohere embed-v3 — 1024 dims. Supports separate input types for documents vs queries. Multilingual (100+ languages).

Google text-embedding-004 — 768 dims. Free tier available via Vertex AI. Good multilingual support.

Open-Source (Run Locally)

BGE (BAAI) — bge-large-en-v1.5, bge-m3 (multilingual). Top MTEB scores. Run via sentence-transformers or HuggingFace.

E5 (Microsoft) — e5-large-v2, multilingual-e5-large. Requires "query: " and "passage: " prefixes.

GTE (Alibaba) — gte-large-en-v1.5. Strong performance, no prefix needed.

Nomic Embed — nomic-embed-text-v1.5. 8192 token context. Open weights + open training data.

Open-source models match or beat commercial APIs on benchmarks. BGE-m3 and GTE-large score competitively with OpenAI on MTEB. The trade-off: you host the model yourself (GPU required for speed) vs paying per API call.

compare_arrows

Similarity Measures

How to compare two vectors

Cosine Similarity

The most common measure for embeddings. Measures the angle between two vectors, ignoring their magnitude. Returns a value from -1 (opposite) to 1 (identical). In practice, most embedding similarities fall between 0.3 and 0.95.

Formula: cos(θ) = (A · B) / (||A|| × ||B||)

Other Measures

Dot product: Like cosine but affected by vector magnitude. Faster to compute. Used when vectors are already normalized (magnitude = 1).

Euclidean distance: Straight-line distance between vector endpoints. Less common for embeddings because it is sensitive to magnitude differences.

# Computing cosine similarity in Python import numpy as np def cosine_sim(a, b): return np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b) ) # Example v1 = embed("refund policy") v2 = embed("return process") v3 = embed("pizza recipe") cosine_sim(v1, v2) # → 0.91 (similar) cosine_sim(v1, v3) # → 0.14 (unrelated)

Use cosine similarity unless your vector DB says otherwise. Most embedding models are trained with cosine similarity in mind. Some vector databases (Pinecone, Weaviate) let you choose the metric at index creation time. OpenAI embeddings are normalized, so cosine = dot product.

layers

Dimensionality & Matryoshka Embeddings

More dimensions = more accuracy, but at a cost

Dimensions Matter

Each dimension captures one aspect of meaning. More dimensions = more nuance = better retrieval. But more dimensions also mean more storage (4 bytes per float32 × dimensions × number of chunks) and slower search.

Common sizes: 384 (small), 768 (medium), 1024 (Cohere), 1536 (OpenAI small), 3072 (OpenAI large).

Matryoshka Representation Learning

Matryoshka embeddings (Kusupati et al., 2022) are trained so that the first N dimensions are a valid, useful embedding on their own. You can truncate a 3072-dim vector to 256 dims and still get good results — just slightly less accurate. This lets you trade accuracy for speed and storage at query time.

# OpenAI text-embedding-3-large supports Matryoshka from openai import OpenAI client = OpenAI() response = client.embeddings.create( model="text-embedding-3-large", input="refund policy", dimensions=256 # truncate from 3072 to 256 ) # Full: 3072 dims → 12 KB per vector # Truncated: 256 dims → 1 KB per vector # 12x less storage, ~5% accuracy drop

Matryoshka is a game-changer for large corpora. If you have 10M chunks at 3072 dims, that is 120 GB of vectors. Truncating to 256 dims drops it to 10 GB with minimal accuracy loss. Start with full dimensions, then truncate if storage or latency becomes a problem.

leaderboard

The MTEB Benchmark

How to compare embedding models objectively

What Is MTEB

The Massive Text Embedding Benchmark (Muennighoff et al., 2023) evaluates embedding models across 8 task types and 58+ datasets. It is the standard benchmark for comparing embedding models. The leaderboard is hosted on HuggingFace.

Task Types

Retrieval — The most relevant for RAG. How well does the model find relevant passages?
STS — Semantic Textual Similarity. How well does cosine similarity correlate with human judgments?
Classification — Using embeddings as features for text classification.
Clustering — Grouping similar texts together.
Reranking, Pair Classification, Summarization, BitextMining

How to Read the Leaderboard

Focus on the Retrieval column for RAG use cases. The overall average includes tasks (like classification) that may not matter for your application.

Check the model size. A 7B parameter model may score 2% higher than a 330M model, but it is 20x slower and more expensive to run.

Check the context length. Some models support 512 tokens max, others support 8192. Make sure it covers your chunk sizes.

MTEB is a starting point, not the final answer. Benchmark performance on academic datasets may not reflect performance on your specific domain. Always test the top 2-3 models on your actual data before committing.

verified

Choosing the Right Embedding Model

A practical decision framework

Decision Framework

Starting out / prototyping?
Use text-embedding-3-small (OpenAI). Cheap, fast, good enough. No infrastructure needed.

Need best accuracy?
Use text-embedding-3-large (OpenAI) or embed-v3 (Cohere). Check MTEB Retrieval scores.

Need to run locally / data privacy?
Use BGE-m3, GTE-large, or Nomic Embed via sentence-transformers. Requires a GPU for production throughput.

Multilingual corpus?
Use Cohere embed-v3, BGE-m3, or multilingual-e5-large. Test on your specific languages.

# OpenAI (simplest) from openai import OpenAI client = OpenAI() resp = client.embeddings.create( model="text-embedding-3-small", input=["chunk 1", "chunk 2"] ) vectors = [e.embedding for e in resp.data] # Open-source (local) from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-large-en-v1.5") vectors = model.encode(["chunk 1", "chunk 2"])

You can always switch models later. Changing your embedding model means re-embedding all your chunks (a batch job). It is not free, but it is straightforward. Start with the simplest option, measure retrieval quality, and upgrade only if needed.