Ch 3: The Embedding Layer — Words to Vectors

grid_on

The Embedding Matrix: A Giant Lookup Table

model.embed_tokens.weight — shape [vocab_size, hidden_size]

What It Is

The embedding layer is a 2D matrix stored as the tensor model.embed_tokens.weight. Its shape is [vocab_size, hidden_size] — for Llama 3.1 8B, that's [128,256 × 4,096]. Each of the 128,256 rows is a learned vector of 4,096 floating-point numbers representing one token. The entire matrix is the model's "dictionary" — but instead of definitions, each word gets a coordinate in 4,096-dimensional space.

Tensor Entry

"model.embed_tokens.weight": { "dtype": "BF16", "shape": [128256, 4096], "data_offsets": [0, 1050673152] } // 128,256 × 4,096 × 2 bytes = ~1 GB // One of the largest individual tensors

Key insight: The embedding matrix is one of only two tensors whose first dimension is vocab_size rather than hidden_size. The other is lm_head.weight. Together, they're the model's "on-ramp" and "off-ramp" between human language and the internal representation.

search

How Embedding Lookup Works

Token ID 42 = Row 42 of the matrix — no multiplication needed

The Lookup Operation

Unlike most layers in the transformer, the embedding layer does no matrix multiplication. It's a pure table lookup: given a token ID, it returns the corresponding row of the matrix. If the tokenizer converts "Hello" to ID 9906, the embedding layer simply reads row 9906 from the matrix — a 4,096-dimensional vector. This vector is the model's internal representation of "Hello" and gets passed into the first transformer layer.

Lookup in Action

// Pseudocode for embedding lookup: input_ids = [9906, 11, 1917] // "Hello, world" // Each ID selects a row: embed[0] = embed_tokens[9906] // → [0.12, -0.34, ...] embed[1] = embed_tokens[11] // → [0.05, 0.89, ...] embed[2] = embed_tokens[1917] // → [-0.71, 0.22, ...] // Output shape: [3, 4096] // 3 tokens, each represented as 4096 floats

Key insight: This is just fancy indexing. In PyTorch, it's F.embedding(input_ids, weight) which is equivalent to weight[input_ids]. No gradients flow through the index operation itself — they flow into the selected rows during backpropagation.

scatter_plot

What the Vectors Mean: Semantic Space

Similar words cluster together in 4,096-dimensional space

Learned Geometry

During training, the embedding matrix learns to place tokens with similar meaning or usage near each other in the 4,096-dimensional space. "dog" and "cat" end up close together. "Python" and "JavaScript" cluster near each other. "happy" and "joyful" are neighbors. This isn't programmed — it emerges from the training data as the model learns to predict next tokens. The geometry captures semantic relationships: king - man + woman ≈ queen.

Conceptual Neighborhoods

// Conceptual embedding neighborhoods: Animals: dog, cat, fish, bird // nearby Code: python, java, rust // nearby Numbers: one, two, three // nearby // Each token has 4,096 coordinates // Different dimensions capture different // features: formality, topic, syntax role... // No single dimension has a clean meaning

Key insight: The 4,096 dimensions don't individually correspond to understandable features like "animal-ness" or "verb-ness." The meaning is distributed across all dimensions. This is why you need all 4,096 values — losing even a few degrades the representation.

straighten

Vocabulary Size and Its Impact

32K vs 128K tokens — what changes and what doesn't

Vocab Size Tradeoffs

Vocabulary size directly controls the embedding matrix's first dimension. Llama 2 had 32,000 tokens; Llama 3 expanded to 128,256. This 4× increase in vocab size meant a 4× increase in embedding parameters — from ~131M to ~525M parameters just for the embedding layer. Larger vocabularies encode text more efficiently (fewer tokens per sentence) but increase memory for the embedding and output matrices.

Vocab Size Comparison

// Embedding size by vocabulary: Llama 2 (32K vocab, 4096 dim): 32,000 × 4,096 × 2B = ~256 MB Llama 3 (128K vocab, 4096 dim): 128,256 × 4,096 × 2B = ~1 GB // 4× more vocab = 4× larger embedding // But: fewer tokens per sentence // "Hello world" = 2 tokens vs 3 // Faster inference per character

Why it matters: Larger vocabulary = bigger embedding + lm_head tensors, but shorter sequences. This is a tradeoff: you spend more memory on the dictionary but less compute per character because each token covers more text. For multilingual models, larger vocabs are essential.

output

The Output Head: lm_head.weight

The inverse embedding — vectors back to token probabilities

What lm_head Does

At the output end, the model needs to convert its internal 4,096-dimensional representation back into a probability distribution over the vocabulary. The lm_head.weight tensor does this: it's a matrix of shape [vocab_size, hidden_size] — the same shape as the embedding. It multiplies the final hidden state by this matrix to produce a "logit" score for each of the 128,256 tokens. Softmax converts these logits into probabilities.

Output Projection

// Final hidden state → token probabilities: hidden = transformer_output // [1, 4096] logits = hidden @ lm_head.T // [1, 128256] probs = softmax(logits) // [1, 128256] // Highest probability → predicted next token // lm_head.weight shape: [128256, 4096] // Same shape as embed_tokens.weight!

Key insight: The embedding converts IDs → vectors. The lm_head converts vectors → probabilities over IDs. They're mirror images of each other. This symmetry is why weight tying works — the same matrix can serve both roles.

link

Weight Tying: Shared vs. Separate

When embed_tokens and lm_head share the same matrix

Tied Embeddings

Weight tying means the embedding matrix and the output head share the same physical parameters — they point to the same memory. This was introduced by Press & Wolf (2017) and used in GPT-2 and BERT. The benefit: you save vocab_size × hidden_size parameters. For Llama 3.1 8B, that would save ~525M parameters (~1 GB in BF16). The config field tie_word_embeddings controls this.

Current Practice

Modern open-source LLMs like Llama and Mistral do NOT tie embeddings — they use tie_word_embeddings: false. The embedding and lm_head are separate tensors. This gives the model more capacity: the input embedding can specialize in encoding meaning, while the output head can specialize in predicting the next token.

Config Examples

// Llama 3.1 8B config.json: "tie_word_embeddings": false // → Two separate tensors in the file: // model.embed_tokens.weight [128256, 4096] // lm_head.weight [128256, 4096] // Total: ~2 GB for both // GPT-2 / smaller models: "tie_word_embeddings": true // → One tensor, shared: // model.embed_tokens.weight [50257, 768] // lm_head.weight → same memory

Key insight: When you see tie_word_embeddings: false, expect two large vocab-sized tensors in the file. When it's true, the safetensors header will only contain one, and the framework creates a reference for the other.

data_usage

File Size Impact

How much of the model file is just the embedding?

Embedding as % of Total

For Llama 3.1 8B with untied embeddings, the embedding + lm_head together contain about 1.05 billion parameters out of 8.03 billion total — roughly 13% of the model. In BF16, that's ~2.1 GB out of ~16 GB. For models with smaller vocabularies (32K), this drops to about 3%. For models with larger hidden sizes (like Llama 3.1 70B with hidden_size=8192), the absolute size doubles but the percentage drops because the per-layer tensors grow even more.

Embedding Sizes Across Models

// embed_tokens + lm_head (BF16, untied): Llama 3.1 8B (128K×4096): 2 × 1.0 GB = ~2.1 GB // 13% of model Llama 3.1 70B (128K×8192): 2 × 2.0 GB = ~4.0 GB // 3% of model Mistral 7B (32K×4096): 2 × 0.25 GB = ~0.5 GB // 3.5% of model // Larger vocab → embedding dominates more // Larger model → embedding % shrinks

lightbulb

Practical Takeaways

What knowing the embedding layer helps you do

Debugging & Analysis

Inspect token representations: Extract a row from the embedding matrix to see how the model "thinks about" a word. Compare cosine similarity between rows to find semantic neighbors.

Detect tokenizer mismatches: If the embedding matrix has 128,256 rows but your tokenizer has 32,000 tokens, you have a mismatch — loading will fail or produce garbage.

Estimate overhead: Switching from 32K to 128K vocabulary adds ~1.5 GB to the model (with untied embeddings in BF16). Know this before choosing a model.

Quick Reference

// Embedding layer cheat sheet: Tensor name: model.embed_tokens.weight Shape: [vocab_size, hidden_size] Operation: Table lookup (no matmul) Mirror: lm_head.weight (output) Tied?: Check config.tie_word_embeddings Size formula: vocab × hidden × bytes_per_param Llama 3.1 8B: 128256 × 4096 × 2 = 1.0 GB

Key insight: The embedding layer is the bridge between human language and the model's internal math. Everything that happens inside the transformer operates on these 4,096-dimensional vectors. Next up: the attention weights that let the model relate these vectors to each other.

Ch 3 — The Embedding Layer: Words to Vectors