Ch 2: Transformer Architecture for Fine-Tuning

Ch 2 — Transformer Architecture for Fine-Tuning

Model anatomy: where the billions of parameters live and what changes during fine-tuning

Index Under the Hood →

High Level

token

Tokenizer

arrow_forward

grid_view

Embedding

arrow_forward

hub

Attention

arrow_forward

neurology

FFN

arrow_forward

stacks

Layers

arrow_forward

output

LM Head

arrow_forward

save

Formats

Click play or press Space to begin the journey...

Step- / 7

token

Tokenizer: Text to Numbers

The first step before any model computation

What a Tokenizer Does

LLMs don't see text. They see sequences of integers (token IDs). The tokenizer converts text into these IDs and back. Each model has its own tokenizer with a fixed vocabulary. Llama 3 uses a vocabulary of 128,256 tokens. GPT-4o uses approximately 200,000 tokens.

Tokenizers use Byte-Pair Encoding (BPE) or SentencePiece. Common words become single tokens; rare words are split into subword pieces. "unhappiness" might become ["un", "happiness"] or ["un", "happ", "iness"].

Special Tokens

Why It Matters for Fine-Tuning

1. Chat templates: Your training data must use the exact same special token format the model expects. Mismatched templates are the #1 cause of bad fine-tuning results.

2. Vocabulary size: The tokenizer's vocabulary is fixed. You can't add new tokens easily (though it's possible with embedding resizing). If your domain has specialized terms, the tokenizer will split them into subwords.

3. Token count: Tokenization determines training cost. A 1,000-word document might be 1,200-1,500 tokens. Training cost is proportional to total tokens processed.

# Tokenization example from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") tokens = tok.encode("Fine-tuning is powerful") # [34, 500, 12, 83, 7926, 374, 8147] # 4 words → 7 tokens # Vocabulary size len(tok) # 128256

grid_view

Embedding Layer: IDs to Vectors

Turning discrete tokens into continuous representations

Token Embeddings

The embedding layer is a lookup table: a matrix of shape (vocab_size, hidden_dim). For Llama 3 8B: (128256, 4096) = 525 million parameters. Each token ID maps to a 4096-dimensional vector.

These vectors encode semantic meaning. Similar words have similar vectors. "king" and "queen" are close in embedding space; "king" and "bicycle" are far apart.

Positional Information

Transformers have no built-in sense of word order. Positional encoding adds position information to the embeddings.

RoPE (Rotary Position Embedding) is used by Llama 3, Mistral, and most modern models. It encodes position by rotating the embedding vectors. This allows the model to generalize to longer sequences than it was trained on (with some quality loss).

Fine-Tuning Impact

The embedding layer is typically not fine-tuned with LoRA (it's already well-trained during pre-training). In full fine-tuning, embeddings change slightly but they represent a small fraction of total parameters (~6% for Llama 3 8B).

If you add new special tokens (e.g., for a custom chat template), you need to resize the embedding matrix and initialize the new token embeddings. The new embeddings start random and need training data to learn meaningful representations.

# Embedding dimensions for popular models # Llama 3 8B: 128256 x 4096 = 525M params # Llama 3 70B: 128256 x 8192 = 1.05B params # Mistral 7B: 32000 x 4096 = 131M params # GPT-2: 50257 x 768 = 38.6M params # The embedding matrix is shared with the LM head # in most models (weight tying), so it appears twice # but is stored once in memory.

Weight tying: Most modern LLMs share the embedding matrix with the output (LM head) layer. The same matrix that converts token IDs to vectors also converts the final hidden states back to vocabulary probabilities. This saves memory and improves quality.

hub

Self-Attention: The Core Mechanism

How the model relates tokens to each other

How Attention Works

For each token, attention asks: "Which other tokens should I pay attention to?" It computes three vectors from each token's representation:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

Attention score = softmax(Q · K^T / √d). High scores mean strong relevance. The output is a weighted sum of V vectors, where weights are the attention scores.

Multi-Head Attention

Instead of one attention computation, the model runs multiple attention heads in parallel. Each head learns different relationship patterns: one head might track subject-verb agreement, another tracks coreference, another tracks semantic similarity.

Llama 3 8B: 32 attention heads, each with dimension 128 (32 × 128 = 4096).
Llama 3 70B: 64 attention heads, each with dimension 128 (64 × 128 = 8192).

Grouped-Query Attention (GQA)

Standard multi-head attention has separate Q, K, V projections per head. GQA (Ainslie et al., 2023) shares K and V across groups of query heads. This reduces memory during inference (smaller KV cache) with minimal quality loss.

Llama 3 8B: 32 query heads, 8 KV heads (4:1 ratio).
Llama 3 70B: 64 query heads, 8 KV heads (8:1 ratio).
Mistral 7B: 32 query heads, 8 KV heads.

Causal Masking

For autoregressive generation (predicting the next token), each token can only attend to previous tokens, not future ones. This is enforced by a causal mask: a triangular matrix that sets future attention scores to negative infinity before softmax.

Attention is where LoRA targets. The Q, K, V, and O (output) projection matrices are the primary targets for LoRA fine-tuning. These four matrices per layer contain the majority of the model's learned relationships. By adding small low-rank adapters to these projections, LoRA can modify the model's behavior with minimal parameter changes.

neurology

Feed-Forward Network (FFN)

The "thinking" layer that processes each token independently

What the FFN Does

After attention mixes information between tokens, the FFN processes each token independently. It's a two-layer neural network with a non-linear activation in between:

FFN(x) = W2 · activation(W1 · x)

The FFN expands the hidden dimension (4096 → 14336 for Llama 3 8B), applies a non-linearity, then projects back down (14336 → 4096). This expansion gives the model more capacity to learn complex transformations.

SwiGLU Activation

Modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU. SwiGLU uses a gating mechanism: it multiplies two linear projections element-wise, with one passed through a Swish activation. This requires three weight matrices instead of two (gate_proj, up_proj, down_proj) but produces better results.

Llama 3 8B FFN: gate_proj (4096 × 14336), up_proj (4096 × 14336), down_proj (14336 × 4096) = 176M parameters per layer.

Where Parameters Live

The FFN layers contain the majority of parameters in a transformer. For Llama 3 8B:

Attention per layer: Q (4096×4096) + K (4096×1024) + V (4096×1024) + O (4096×4096) = 41.9M params

FFN per layer: gate (4096×14336) + up (4096×14336) + down (14336×4096) = 176.2M params

FFN is 4.2x larger than attention per layer. Across 32 layers: attention = 1.34B, FFN = 5.64B. The FFN is where most of the model's "knowledge" is stored.

For fine-tuning: LoRA typically targets the attention projections (q_proj, k_proj, v_proj, o_proj). Some practitioners also target the FFN projections (gate_proj, up_proj, down_proj) for more capacity. Targeting all seven matrices per layer gives the best results but uses more memory. A common middle ground: target Q, K, V, O + gate_proj and up_proj.

stacks

Stacking Transformer Layers

How depth creates capability

Layer Structure

Each transformer layer contains:

1. RMSNorm (pre-normalization)
2. Multi-Head Attention (with GQA)
3. Residual connection (add input back)
4. RMSNorm (pre-normalization)
5. Feed-Forward Network (SwiGLU)
6. Residual connection (add input back)

This block repeats N times. The residual connections are critical: they allow gradients to flow through the network without vanishing, enabling very deep models.

Model Sizes

Llama 3 8B: 32 layers, hidden_dim=4096
Llama 3 70B: 80 layers, hidden_dim=8192
Llama 3 405B: 126 layers, hidden_dim=16384
Mistral 7B: 32 layers, hidden_dim=4096
Phi-3 Mini (3.8B): 32 layers, hidden_dim=3072
Qwen 2.5 72B: 80 layers, hidden_dim=8192

More layers = more capacity but more memory and slower inference.

What Each Layer Learns

Research shows different layers specialize:

Early layers (1-10): Low-level features. Syntax, grammar, local word relationships. These change least during fine-tuning.

Middle layers (11-22): Semantic understanding. Meaning, context, reasoning patterns. These change moderately.

Late layers (23-32): Task-specific features. Output formatting, style, domain-specific patterns. These change most during fine-tuning.

This is why LoRA often works well: you only need to modify the upper layers' behavior, and small adapters are sufficient for that.

RMSNorm vs LayerNorm: Modern LLMs use RMSNorm (Root Mean Square Normalization) instead of the original LayerNorm. RMSNorm is simpler (no mean subtraction, just variance normalization) and 10-15% faster. It has very few parameters (just a scaling vector per layer, ~4096 params). Pre-normalization (normalizing before attention/FFN, not after) is now standard and improves training stability.

output

LM Head: Vectors to Words

The final projection that produces token probabilities

How the LM Head Works

The LM head takes the final hidden state (a 4096-dimensional vector for Llama 3 8B) and projects it to vocabulary size (128,256 dimensions). Each dimension represents the logit (unnormalized score) for one token in the vocabulary.

Apply softmax to convert logits to probabilities. The token with the highest probability is the model's prediction for the next token. During generation, this process repeats autoregressively.

Temperature & Sampling

Temperature: Divides logits before softmax. T=0.0 is greedy (always pick the highest). T=1.0 is the trained distribution. T>1.0 makes output more random.

Top-p (nucleus sampling): Only sample from the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9).

Top-k: Only consider the k most likely tokens.

These are inference-time parameters. They don't affect fine-tuning, but understanding them helps you evaluate your fine-tuned model.

Parameter Count

The LM head is a single linear layer: (hidden_dim, vocab_size). For Llama 3 8B: (4096, 128256) = 525M parameters.

With weight tying, this matrix is the same as the embedding matrix (transposed). So the model's first and last layers share parameters. This is standard in Llama 3, Mistral, and most modern models.

The LM head is typically not targeted by LoRA. It's already well-trained and shares weights with the embedding layer.

# Full parameter breakdown: Llama 3 8B # Embedding: 128256 x 4096 = 525M # Per layer (x32): # Attention: Q+K+V+O = 42M # FFN: gate+up+down = 176M # Norms: 2 x 4096 = ~8K # Layer total: = 218M # All 32 layers: = 6.98B # Final norm: 4096 = ~4K # LM Head: (tied with embedding) # ───────────────────────────────────── # Total: ~8.03B parameters

save

Model Formats & Precision

How models are stored, loaded, and quantized

File Formats

safetensors: The modern standard (HuggingFace). Safe, fast, memory-mapped. Used by Llama 3, Mistral, and most open models. Files are named model-00001-of-00004.safetensors for sharded models.

PyTorch (.bin): Legacy format using Python pickle. Can execute arbitrary code on load (security risk). Being phased out in favor of safetensors.

GGUF: Format used by llama.cpp and Ollama. Optimized for CPU inference and quantized models. Single file containing model + metadata + tokenizer.

Numerical Precision

fp32 (32-bit float): Full precision. 4 bytes per parameter. 7B model = 28 GB. Used for optimizer states.

fp16 (16-bit float): Half precision. 2 bytes per parameter. 7B = 14 GB. Standard for training and inference.

bf16 (bfloat16): Same size as fp16 but with fp32's exponent range. Better for training (less overflow/underflow). Preferred on A100/H100 GPUs.

int8 (8-bit integer): 1 byte per parameter. 7B = 7 GB. Quantized inference with minimal quality loss.

int4 (4-bit integer): 0.5 bytes per parameter. 7B = 3.5 GB. Used by QLoRA (NF4 format). Some quality loss but enables fine-tuning large models on consumer GPUs.

Quantization Methods

GPTQ: Post-training quantization. Calibrates on a small dataset. Good quality at 4-bit. Used for inference deployment.

AWQ (Activation-Aware Weight Quantization): Preserves salient weights based on activation patterns. Often slightly better than GPTQ.

NF4 (Normal Float 4-bit): Used by QLoRA. Quantization-aware format optimized for normally-distributed weights. Enables 4-bit base model loading during fine-tuning.

GGUF quantization: Multiple levels (Q2_K through Q8_0). Q4_K_M is the most popular balance of quality and size for local inference.

fp16 / bf16
2 bytes/param
7B = 14 GB
Training + inference

int8
1 byte/param
7B = 7 GB
Inference only

int4 (NF4)
0.5 bytes/param
7B = 3.5 GB
QLoRA training

GGUF Q4_K_M
~0.56 bytes/param
7B = ~4.1 GB
Local inference

For fine-tuning: Train in bf16 (if GPU supports it) or fp16. Use NF4 quantization for QLoRA. For deployment, quantize to GPTQ/AWQ (4-bit) or GGUF for local inference. The precision you train in and the precision you deploy in are often different.