Ch 5: The Feed-Forward Network — Where Knowledge Lives

widgets

Three Matrices Per Layer: The SwiGLU MLP

gate_proj, up_proj, down_proj — the knowledge storage system

The MLP Block

After attention mixes information between tokens, the feed-forward network (FFN/MLP) processes each token independently through two parallel "up-projections" and one "down-projection." Modern LLMs use the SwiGLU activation (Shazeer, 2020) instead of simple ReLU, which requires three weight matrices instead of two: gate_proj (controls information flow), up_proj (carries data), and down_proj (compresses back).

Tensor Names (Layer 0)

model.layers.0.mlp.gate_proj.weight // Shape: [14336, 4096] — "w1" model.layers.0.mlp.up_proj.weight // Shape: [14336, 4096] — "w3" model.layers.0.mlp.down_proj.weight // Shape: [4096, 14336] — "w2" // 3 tensors × 32 layers = 96 MLP tensors

Key insight: The MLP's three matrices are the largest individual tensors in the file. Each gate_proj and up_proj is [14336, 4096] = 117M params = 224 MB in BF16. They dwarf the attention tensors.

open_in_full

The Intermediate Dimension: 3.5× Expansion

Why intermediate_size is 14,336 when hidden_size is 4,096

Expansion Ratio

The FFN projects up to a larger dimension and back down. For Llama 3.1 8B, hidden_size is 4,096 but intermediate_size is 14,336 — a ratio of about 3.5×. In standard FFNs (without SwiGLU), the ratio is typically 4× (giving 16,384). SwiGLU uses 2/3 of that (≈10,922) but rounds to a GPU-friendly number (14,336) to match 8B total params. The expansion gives the model a wider "workspace" to transform representations.

Dimension Flow

// Data flow through the MLP: Input: [seq_len, 4096] // ↓ gate_proj: × [14336, 4096].T Gate path: [seq_len, 14336] // expand // ↓ up_proj: × [14336, 4096].T Up path: [seq_len, 14336] // expand // ↓ element-wise multiply Combined: [seq_len, 14336] // ↓ down_proj: × [4096, 14336].T Output: [seq_len, 4096] // compress

Why it matters: The 14,336-dimensional intermediate space is where the model does its "thinking" for each token — pattern matching, factual recall, reasoning. A wider intermediate dimension means more capacity for learned knowledge.

filter_alt

SwiGLU: How Gating Works

Swish activation × gated linear unit = selective information flow

The SwiGLU Formula

SwiGLU processes input x through two parallel paths: gate_proj applies a Swish activation (x · σ(x)), creating a learned gate that decides which dimensions to "let through." up_proj projects the data without activation. These two are multiplied element-wise — the gate selectively amplifies or suppresses each dimension. Finally, down_proj compresses back to hidden_size. This gating mechanism outperforms simple ReLU on most benchmarks.

Pseudocode

// SwiGLU forward pass: gate = swish(x @ gate_proj.T) // [14336] up = x @ up_proj.T // [14336] hidden = gate * up // element-wise output = hidden @ down_proj.T // [4096] // Where swish(x) = x × sigmoid(x) // The gate learns WHICH dimensions matter // for each input token's representation

Key insight: The gate acts like a learned filter — it produces values between 0 and ~1 for each of the 14,336 dimensions, selectively allowing information through. This is why SwiGLU uses three matrices instead of two: the third matrix (gate_proj) is the "filter control."

pie_chart

The MLP Dominates: ~65% of All Parameters

Why the feed-forward network is the biggest part of the file

Parameter Count

Each layer's MLP has three large matrices: gate [14336, 4096], up [14336, 4096], and down [4096, 14336]. That's 3 × 14336 × 4096 = 176.2 million parameters per layer. Across 32 layers: ~5.6 billion MLP parameters out of 8.03B total. The MLP alone is about 65% of the entire model. Compare to attention at ~28%: the MLP is where most of the model's "knowledge" — facts, patterns, language rules — is stored.

MLP Per-Layer Budget

// MLP tensors per layer (BF16): gate_proj: 14336 × 4096 = 58.7M (112 MB) up_proj: 14336 × 4096 = 58.7M (112 MB) down_proj: 4096 × 14336 = 58.7M (112 MB) ffn_norm: [4096] = 4K (8 KB) ──────────────────────────────────── Total: ~176M params/layer (~336 MB) × 32 = ~5.6B (~10.5 GB)

Key insight: The MLP consumes 10.5 GB of a 16 GB model. When you quantize from BF16 to 4-bit, you're primarily compressing these MLP tensors. Understanding this helps predict where quality loss will be most noticeable.

psychology

What the FFN Actually "Knows"

Factual recall, pattern recognition, and learned heuristics

Knowledge Storage

Research has shown that the FFN layers store factual knowledge (e.g., "Paris is the capital of France") as specific patterns in their weight matrices. Individual neurons in the intermediate layer can activate for specific concepts. While attention determines which tokens to relate, the FFN determines what to do with that information — it's where the model's "reasoning shortcuts," language patterns, and world knowledge live.

Attention vs FFN Roles

// The transformer layer pipeline: 1. Attention: "What should I look at?" // Mixes info BETWEEN tokens // Token 5 reads from tokens 1-4 2. FFN/MLP: "What do I know about this?" // Processes each token INDEPENDENTLY // Applies learned knowledge to transform // the representation // Analogy: attention = reading comprehension // FFN = applying expertise

Key insight: The FFN processes each token position independently — no cross-token communication. This is why it can be parallelized perfectly across sequence positions. The attention layer is what creates dependencies between tokens.

compare_arrows

SwiGLU vs. ReLU vs. GELU

Why modern LLMs switched to gated activations

Activation Evolution

ReLU (original): max(0, x). Simple but creates "dead neurons" — once a neuron outputs 0, it may never recover. Uses 2 matrices.

GELU (GPT-2/BERT era): Smooth approximation of ReLU with probabilistic gating. Better gradient flow. Still 2 matrices.

SwiGLU (Llama/Mistral): Explicit gating mechanism with 3 matrices. ~1-3% better on benchmarks than GELU at equivalent parameter count. The extra matrix cost is offset by reducing the intermediate dimension from 4× to ~3.5×.

Matrix Count Comparison

// Standard FFN (ReLU/GELU): up_proj: [16384, 4096] // 4× expansion down_proj: [4096, 16384] // 2 matrices, 134M params/layer // SwiGLU FFN (Llama): gate_proj: [14336, 4096] // ~3.5× up_proj: [14336, 4096] down_proj: [4096, 14336] // 3 matrices, 176M params/layer // More params but better quality per param

balance

The post_attention_layernorm

RMSNorm between attention and FFN

FFN Normalization

Just as input_layernorm normalizes before attention, post_attention_layernorm normalizes before the FFN. The tensor model.layers.{N}.post_attention_layernorm.weight has shape [4096] — another tiny but critical tensor. Llama uses Pre-Norm architecture where normalization happens before each sublayer (attention and FFN), not after. This produces more stable gradients during training.

Layer Data Flow

// Complete transformer layer flow: x = input x = x + attention(input_layernorm(x)) // ↑ residual ↑ norm before attn x = x + ffn(post_attention_layernorm(x)) // ↑ residual ↑ norm before FFN // The "+" is the residual connection // It lets gradients flow directly through // all 32 layers without vanishing

Key insight: The residual connection (the + operator) is why deep transformers work. Without it, signals would degrade across 32 layers. Each layer only needs to learn a small "correction" to the residual stream — not the full representation.

lightbulb

Practical Takeaways

What knowing the FFN architecture tells you

Practical Applications

Quantization strategy: Since FFN is 65% of the model, per-tensor quantization precision here matters most. Some quantization schemes apply higher precision to attention and lower to FFN.

LoRA targeting: Fine-tuning with LoRA on gate_proj/up_proj/down_proj can modify factual knowledge. Targeting attention (q_proj/v_proj) modifies attention patterns instead.

Memory estimation: MLP = ~10.5 GB for 8B model in BF16. Quick formula: 3 × intermediate_size × hidden_size × num_layers × bytes_per_param.

FFN Cheat Sheet

// Feed-Forward Network quick reference: Tensors: gate_proj, up_proj, down_proj Shapes: [intermediate, hidden] × 2 [hidden, intermediate] × 1 Activation: SwiGLU (swish gating) Expansion: ~3.5× (14336/4096) % of model: ~65% Per layer: 176M params (336 MB BF16) All layers: 5.6B params (10.5 GB BF16)

Key insight: The FFN is the "knowledge store" of the model. Attention decides what to focus on; the FFN applies what the model has learned. Next: the special tensors that tie everything together — normalization, positional encoding, and the output head.

Ch 5 — The Feed-Forward Network: Where Knowledge Lives