Ch 2 — Embeddings: Meaning as Math

How tokens become vectors with meaning — the representation layer of every LLM
Foundation
pin
Token ID
arrow_forward
grid_on
Lookup
arrow_forward
explore
Meaning
arrow_forward
timeline
Position
arrow_forward
calculate
Arithmetic
arrow_forward
swap_vert
Static vs Ctx
arrow_forward
search
Similarity
arrow_forward
apps
Uses
-
Click play or press Space to begin...
Step- / 8
pin
From Token IDs to Vectors
Why a raw number isn’t enough
The Analogy
After tokenization, each token is an integer — like a student ID number. But student #4521 tells you nothing about the student. You need a profile: their interests, skills, personality. An embedding is that profile — a list of numbers (a vector) that captures the meaning of a token. Token 9906 (“Hello”) becomes a vector of 4096 numbers that encode everything the model knows about “Hello.”
Key insight: Token IDs are arbitrary — “cat” being token 2368 and “dog” being token 5765 tells the model nothing about their relationship. Embeddings fix this: “cat” and “dog” get similar vectors because they appear in similar contexts. The model learns meaning from patterns in text.
What Happens
# Token ID: just a number token_id = 9906 # "Hello" # Embedding: a rich vector of meaning embedding = [0.023, -0.117, 0.891, 0.045, ...] # 4096 numbers for GPT-3 (d_model=4096) # 12288 numbers for GPT-4 (d_model=12288) # The embedding table is a giant matrix: # Shape: (vocab_size, d_model) # GPT-4: (100258, 12288) ≈ 1.2B parameters # Just for the embedding layer!
Real World
Student ID #4521 → Profile: [math: 9, art: 3, sports: 7, music: 5, ...]
In LLMs
Token 9906 → Embedding: [0.023, -0.117, 0.891, 0.045, ...] (4096 dims)
grid_on
The Embedding Table: A Giant Lookup
One row per token in the vocabulary
The Analogy
The embedding table is like a massive spreadsheet. Each row is one token from the vocabulary. Each column is one dimension of meaning. To get a token’s embedding, you simply look up its row. Token 9906 (“Hello”) → go to row 9906, read all 4096 columns. That’s it — no computation, just a table lookup.
Key insight: The embedding table is a learned parameter. It starts random and gets refined during training. After seeing billions of sentences, the model adjusts each row so that tokens with similar meanings end up with similar vectors. The table IS the model’s vocabulary knowledge.
In PyTorch
import torch import torch.nn as nn # Create embedding table vocab_size = 100258 # GPT-4 vocab d_model = 4096 # embedding dim embed = nn.Embedding(vocab_size, d_model) # embed.weight.shape = (100258, 4096) # = 410M parameters (just embeddings!) # Look up token IDs token_ids = torch.tensor([9906, 11, 1268]) # "Hello", ",", " how" vectors = embed(token_ids) # Shape: (3, 4096) # 3 tokens, each a 4096-dim vector # This is literally just: # vectors[0] = embed.weight[9906] # vectors[1] = embed.weight[11] # vectors[2] = embed.weight[1268]
explore
Meaning Lives in Geometry
Similar words end up near each other in vector space
The Analogy
Imagine a map where cities are placed by culture instead of geography. Paris and Rome would be close (both European capitals, romantic, historic). Tokyo and Seoul would be close. New York and London would be close. Embedding space is this kind of map for words — tokens are positioned by meaning, not alphabetical order. “King” and “queen” are neighbors. “Python” (language) and “Java” are neighbors.
Key insight: This isn’t hand-coded — it emerges from training. The model learns that “cat” and “dog” appear in similar contexts (“The ___ sat on the mat”, “I fed my ___”) so their embeddings converge. This is the distributional hypothesis: “You shall know a word by the company it keeps” (Firth, 1957).
Embedding Dimensions Over Time
# How embedding sizes have grown: # Word2Vec (2013): 300 dimensions # GloVe (2014): 300 dimensions # BERT-base (2018): 768 dimensions # BERT-large (2018): 1024 dimensions # GPT-2 (2019): 768 dimensions # GPT-3 (2020): 12288 dimensions # Llama 2 7B (2023): 4096 dimensions # Llama 2 70B (2023): 8192 dimensions # Llama 3 8B (2024): 4096 dimensions # GPT-4 (est.): 12288 dimensions # More dimensions = richer representation # but more parameters and compute
timeline
Positional Encoding: Where Am I?
Embeddings alone don’t know word order
The Analogy
Imagine receiving a bag of Scrabble tiles with no board. You know what letters you have, but not their order. “Dog bites man” and “Man bites dog” would look identical! Positional encoding adds a “seat number” to each token — position 0, position 1, position 2, etc. The final input to the transformer is: token embedding + position embedding.
Key insight: The original transformer (Vaswani et al., 2017) used fixed sinusoidal patterns. Modern LLMs use learned position embeddings (GPT-2/3) or RoPE (Rotary Position Embedding, used by Llama, Mistral, and most open models). RoPE encodes position as a rotation in embedding space, which generalizes better to longer sequences.
How It Works
# Token embeddings (from lookup table): # "The" → [0.1, 0.3, -0.2, ...] # "cat" → [0.8, -0.1, 0.5, ...] # "sat" → [0.2, 0.7, -0.3, ...] # Position embeddings: # pos 0 → [0.01, 0.02, -0.01, ...] # pos 1 → [0.03, -0.01, 0.02, ...] # pos 2 → [-0.02, 0.04, 0.01, ...] # Final input = token + position: # "The" at pos 0 → [0.11, 0.32, -0.21, ...] # "cat" at pos 1 → [0.83, -0.11, 0.52, ...] # "sat" at pos 2 → [0.18, 0.74, -0.29, ...] # In PyTorch: pos_embed = nn.Embedding(max_seq_len, d_model) x = token_embed(ids) + pos_embed(positions)
calculate
Vector Arithmetic: King − Man + Woman = Queen
The most famous result in embedding research
The Analogy
If embeddings truly capture meaning, then meaning should be computable. Take the vector for “king,” subtract “man,” add “woman” — you get a vector closest to “queen.” The “man→woman” direction is a consistent shift in embedding space. This works for countries→capitals, verbs→past tense, and more. Mikolov et al. (2013) demonstrated this with Word2Vec.
Key insight: This works because embeddings encode relationships as directions. The direction from “man” to “woman” is roughly the same as from “king” to “queen,” from “uncle” to “aunt,” from “he” to “she.” Gender, tense, plurality — these abstract concepts become geometric directions in vector space.
Worked Example
import gensim.downloader as api # Load pre-trained Word2Vec model = api.load("word2vec-google-news-300") # king - man + woman = ? result = model.most_similar( positive=["king", "woman"], negative=["man"], topn=3 ) # [('queen', 0.71), ('monarch', 0.62), ...] # More examples that work: # Paris - France + Italy = Rome # walking - walk + swim = swimming # bigger - big + small = smaller # The math: vec("king") - vec("man") # + vec("woman") # ≈ vec("queen")
swap_vert
Static vs. Contextual Embeddings
The revolution that made transformers possible
The Analogy
In a dictionary, “bank” has one entry with multiple definitions. That’s a static embedding (Word2Vec, GloVe) — one vector per word, regardless of context. But in conversation, you instantly know whether “bank” means a financial institution or a river bank from context. Contextual embeddings (BERT, GPT) give “bank” a different vector in each sentence.
Key insight: This is the fundamental difference between old NLP and modern LLMs. In Word2Vec, “bank” always has the same vector. In GPT, “bank” in “I went to the bank to deposit money” gets a completely different vector than “bank” in “I sat on the river bank.” The transformer layers transform the initial embedding based on surrounding context.
The Evolution
# Static embeddings (2013-2017): # Word2Vec, GloVe, FastText # "bank" → always [0.3, -0.1, 0.7, ...] # Same vector in every context! # Contextual embeddings (2018+): # ELMo, BERT, GPT, Llama # "I deposited money at the bank" # "bank" → [0.8, 0.2, -0.1, ...] (financial) # "I sat on the river bank" # "bank" → [-0.3, 0.6, 0.4, ...] (nature) # How it works in a transformer: # 1. Look up static embedding (same for "bank") # 2. Pass through 96 transformer layers # 3. Each layer mixes info from other tokens # 4. Output: context-aware embedding
search
Cosine Similarity: Measuring Closeness
How to tell if two embeddings are “similar”
The Analogy
Two arrows pointing in the same direction are similar, even if one is longer. Cosine similarity measures the angle between two vectors, ignoring their length. cos(θ) = 1 means identical direction (same meaning). cos(θ) = 0 means perpendicular (unrelated). cos(θ) = −1 means opposite. This is the standard metric for comparing embeddings.
Key insight: When you use semantic search, RAG, or recommendation systems, cosine similarity is doing the heavy lifting. Your query gets embedded, every document gets embedded, and the system returns documents with the highest cosine similarity to your query. Vector databases (Pinecone, Weaviate, FAISS) are optimized for exactly this operation.
In Practice
from openai import OpenAI import numpy as np client = OpenAI() def get_embedding(text): resp = client.embeddings.create( input=text, model="text-embedding-3-small" ) return np.array(resp.data[0].embedding) a = get_embedding("The cat sat on the mat") b = get_embedding("A kitten rested on the rug") c = get_embedding("Stock prices rose sharply") # Cosine similarity: def cosine(x, y): return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y)) cosine(a, b) # ≈ 0.85 (very similar!) cosine(a, c) # ≈ 0.12 (unrelated)
apps
Embeddings Power Everything
From search to recommendations to RAG — embeddings are everywhere
Where Embeddings Are Used
Semantic search: Find documents by meaning, not keywords. RAG: Retrieve relevant context for LLM answers. Recommendations: “Users who liked X also liked Y” via embedding similarity. Clustering: Group similar documents automatically. Classification: Use embeddings as features for downstream models. Anomaly detection: Flag items far from any cluster.
The complete picture: Embeddings are the bridge between human language and machine computation. Token IDs are arbitrary labels. Embeddings give those labels meaning — geometric, computable meaning. Every LLM starts by converting tokens into embeddings, and every downstream application (search, RAG, classification) relies on the quality of those embeddings.
OpenAI Embedding Models
# Current OpenAI embedding models: # text-embedding-3-small: 1536 dims # $0.02 per 1M tokens # text-embedding-3-large: 3072 dims # $0.13 per 1M tokens # Both support dimension reduction: resp = client.embeddings.create( input="Hello world", model="text-embedding-3-large", dimensions=256 # truncate to 256 ) # Smaller = faster search, less storage # Larger = more accurate similarity # Open-source alternatives: # sentence-transformers (HuggingFace) # E5, BGE, GTE models # Nomic Embed, Jina Embeddings
Real World
A GPS coordinate tells you where a place is and how close it is to other places
In LLMs
An embedding tells you what a word means and how related it is to other words