Ch 2: Embeddings — How LLMs Work

Ch 2 — Embeddings: Meaning as Math

How tokens become vectors with meaning — the representation layer of every LLM

arrow_backIndex

Foundation

Token ID

arrow_forward

grid_on

Lookup

arrow_forward

explore

Meaning

arrow_forward

timeline

Position

arrow_forward

calculate

Arithmetic

arrow_forward

swap_vert

Static vs Ctx

arrow_forward

Similarity

arrow_forward

apps

Uses

Click play or press Space to begin...

Step- / 8

From Token IDs to Vectors

Why a raw number isn’t enough

The Analogy

After tokenization, each token is an integer — like a student ID number. But student #4521 tells you nothing about the student. You need a profile: their interests, skills, personality. An embedding is that profile — a list of numbers (a vector) that captures the meaning of a token. Token 9906 (“Hello”) becomes a vector of 4096 numbers that encode everything the model knows about “Hello.”

Key insight: Token IDs are arbitrary — “cat” being token 2368 and “dog” being token 5765 tells the model nothing about their relationship. Embeddings fix this: “cat” and “dog” get similar vectors because they appear in similar contexts. The model learns meaning from patterns in text.

What Happens

# Token ID: just a number token_id = 9906 # "Hello" # Embedding: a rich vector of meaning embedding = [0.023, -0.117, 0.891, 0.045, ...] # 4096 numbers for GPT-3 (d_model=4096) # 12288 numbers for GPT-4 (d_model=12288) # The embedding table is a giant matrix: # Shape: (vocab_size, d_model) # GPT-4: (100258, 12288) ≈ 1.2B parameters # Just for the embedding layer!

Real World

Student ID #4521 → Profile: [math: 9, art: 3, sports: 7, music: 5, ...]

In LLMs

Token 9906 → Embedding: [0.023, -0.117, 0.891, 0.045, ...] (4096 dims)

grid_on

The Embedding Table: A Giant Lookup

One row per token in the vocabulary

The Analogy

The embedding table is like a massive spreadsheet. Each row is one token from the vocabulary. Each column is one dimension of meaning. To get a token’s embedding, you simply look up its row. Token 9906 (“Hello”) → go to row 9906, read all 4096 columns. That’s it — no computation, just a table lookup.

Key insight: The embedding table is a learned parameter. It starts random and gets refined during training. After seeing billions of sentences, the model adjusts each row so that tokens with similar meanings end up with similar vectors. The table IS the model’s vocabulary knowledge.

In PyTorch

import torch import torch.nn as nn # Create embedding table vocab_size = 100258 # GPT-4 vocab d_model = 4096 # embedding dim embed = nn.Embedding(vocab_size, d_model) # embed.weight.shape = (100258, 4096) # = 410M parameters (just embeddings!) # Look up token IDs token_ids = torch.tensor([9906, 11, 1268]) # "Hello", ",", " how" vectors = embed(token_ids) # Shape: (3, 4096) # 3 tokens, each a 4096-dim vector # This is literally just: # vectors[0] = embed.weight[9906] # vectors[1] = embed.weight[11] # vectors[2] = embed.weight[1268]

explore

Meaning Lives in Geometry

Similar words end up near each other in vector space

The Analogy

Imagine a map where cities are placed by culture instead of geography. Paris and Rome would be close (both European capitals, romantic, historic). Tokyo and Seoul would be close. New York and London would be close. Embedding space is this kind of map for words — tokens are positioned by meaning, not alphabetical order. “King” and “queen” are neighbors. “Python” (language) and “Java” are neighbors.

Key insight: This isn’t hand-coded — it emerges from training. The model learns that “cat” and “dog” appear in similar contexts (“The ___ sat on the mat”, “I fed my ___”) so their embeddings converge. This is the distributional hypothesis: “You shall know a word by the company it keeps” (Firth, 1957).

Embedding Dimensions Over Time

# How embedding sizes have grown: # Word2Vec (2013): 300 dimensions # GloVe (2014): 300 dimensions # BERT-base (2018): 768 dimensions # BERT-large (2018): 1024 dimensions # GPT-2 (2019): 768 dimensions # GPT-3 (2020): 12288 dimensions # Llama 2 7B (2023): 4096 dimensions # Llama 2 70B (2023): 8192 dimensions # Llama 3 8B (2024): 4096 dimensions # GPT-4 (est.): 12288 dimensions # More dimensions = richer representation # but more parameters and compute

timeline

Positional Encoding: Where Am I?

Embeddings alone don’t know word order

The Analogy

Imagine receiving a bag of Scrabble tiles with no board. You know what letters you have, but not their order. “Dog bites man” and “Man bites dog” would look identical! Positional encoding adds a “seat number” to each token — position 0, position 1, position 2, etc. The final input to the transformer is: token embedding + position embedding.

Key insight: The original transformer (Vaswani et al., 2017) used fixed sinusoidal patterns. Modern LLMs use learned position embeddings (GPT-2/3) or RoPE (Rotary Position Embedding, used by Llama, Mistral, and most open models). RoPE encodes position as a rotation in embedding space, which generalizes better to longer sequences.

How It Works

# Token embeddings (from lookup table): # "The" → [0.1, 0.3, -0.2, ...] # "cat" → [0.8, -0.1, 0.5, ...] # "sat" → [0.2, 0.7, -0.3, ...] # Position embeddings: # pos 0 → [0.01, 0.02, -0.01, ...] # pos 1 → [0.03, -0.01, 0.02, ...] # pos 2 → [-0.02, 0.04, 0.01, ...] # Final input = token + position: # "The" at pos 0 → [0.11, 0.32, -0.21, ...] # "cat" at pos 1 → [0.83, -0.11, 0.52, ...] # "sat" at pos 2 → [0.18, 0.74, -0.29, ...] # In PyTorch: pos_embed = nn.Embedding(max_seq_len, d_model) x = token_embed(ids) + pos_embed(positions)

calculate

Vector Arithmetic: King − Man + Woman = Queen

The most famous result in embedding research

The Analogy

If embeddings truly capture meaning, then meaning should be computable. Take the vector for “king,” subtract “man,” add “woman” — you get a vector closest to “queen.” The “man→woman” direction is a consistent shift in embedding space. This works for countries→capitals, verbs→past tense, and more. Mikolov et al. (2013) demonstrated this with Word2Vec.

Key insight: This works because embeddings encode relationships as directions. The direction from “man” to “woman” is roughly the same as from “king” to “queen,” from “uncle” to “aunt,” from “he” to “she.” Gender, tense, plurality — these abstract concepts become geometric directions in vector space.

Worked Example

import gensim.downloader as api # Load pre-trained Word2Vec model = api.load("word2vec-google-news-300") # king - man + woman = ? result = model.most_similar( positive=["king", "woman"], negative=["man"], topn=3 ) # [('queen', 0.71), ('monarch', 0.62), ...] # More examples that work: # Paris - France + Italy = Rome # walking - walk + swim = swimming # bigger - big + small = smaller # The math: vec("king") - vec("man") # + vec("woman") # ≈ vec("queen")

swap_vert

Static vs. Contextual Embeddings

The revolution that made transformers possible

The Analogy

In a dictionary, “bank” has one entry with multiple definitions. That’s a static embedding (Word2Vec, GloVe) — one vector per word, regardless of context. But in conversation, you instantly know whether “bank” means a financial institution or a river bank from context. Contextual embeddings (BERT, GPT) give “bank” a different vector in each sentence.

Key insight: This is the fundamental difference between old NLP and modern LLMs. In Word2Vec, “bank” always has the same vector. In GPT, “bank” in “I went to the bank to deposit money” gets a completely different vector than “bank” in “I sat on the river bank.” The transformer layers transform the initial embedding based on surrounding context.

The Evolution

# Static embeddings (2013-2017): # Word2Vec, GloVe, FastText # "bank" → always [0.3, -0.1, 0.7, ...] # Same vector in every context! # Contextual embeddings (2018+): # ELMo, BERT, GPT, Llama # "I deposited money at the bank" # "bank" → [0.8, 0.2, -0.1, ...] (financial) # "I sat on the river bank" # "bank" → [-0.3, 0.6, 0.4, ...] (nature) # How it works in a transformer: # 1. Look up static embedding (same for "bank") # 2. Pass through 96 transformer layers # 3. Each layer mixes info from other tokens # 4. Output: context-aware embedding

Cosine Similarity: Measuring Closeness

How to tell if two embeddings are “similar”

The Analogy

Two arrows pointing in the same direction are similar, even if one is longer. Cosine similarity measures the angle between two vectors, ignoring their length. cos(θ) = 1 means identical direction (same meaning). cos(θ) = 0 means perpendicular (unrelated). cos(θ) = −1 means opposite. This is the standard metric for comparing embeddings.

Key insight: When you use semantic search, RAG, or recommendation systems, cosine similarity is doing the heavy lifting. Your query gets embedded, every document gets embedded, and the system returns documents with the highest cosine similarity to your query. Vector databases (Pinecone, Weaviate, FAISS) are optimized for exactly this operation.

In Practice

from openai import OpenAI import numpy as np client = OpenAI() def get_embedding(text): resp = client.embeddings.create( input=text, model="text-embedding-3-small" ) return np.array(resp.data[0].embedding) a = get_embedding("The cat sat on the mat") b = get_embedding("A kitten rested on the rug") c = get_embedding("Stock prices rose sharply") # Cosine similarity: def cosine(x, y): return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y)) cosine(a, b) # ≈ 0.85 (very similar!) cosine(a, c) # ≈ 0.12 (unrelated)

apps

Embeddings Power Everything

From search to recommendations to RAG — embeddings are everywhere

Where Embeddings Are Used

Semantic search: Find documents by meaning, not keywords. RAG: Retrieve relevant context for LLM answers. Recommendations: “Users who liked X also liked Y” via embedding similarity. Clustering: Group similar documents automatically. Classification: Use embeddings as features for downstream models. Anomaly detection: Flag items far from any cluster.

The complete picture: Embeddings are the bridge between human language and machine computation. Token IDs are arbitrary labels. Embeddings give those labels meaning — geometric, computable meaning. Every LLM starts by converting tokens into embeddings, and every downstream application (search, RAG, classification) relies on the quality of those embeddings.

OpenAI Embedding Models

# Current OpenAI embedding models: # text-embedding-3-small: 1536 dims # $0.02 per 1M tokens # text-embedding-3-large: 3072 dims # $0.13 per 1M tokens # Both support dimension reduction: resp = client.embeddings.create( input="Hello world", model="text-embedding-3-large", dimensions=256 # truncate to 256 ) # Smaller = faster search, less storage # Larger = more accurate similarity # Open-source alternatives: # sentence-transformers (HuggingFace) # E5, BGE, GTE models # Nomic Embed, Jina Embeddings

Real World

A GPS coordinate tells you where a place is and how close it is to other places

In LLMs

An embedding tells you what a word means and how related it is to other words

arrow_back Ch 1: Text to Tokens Ch 3: Attention arrow_forward