Ch 5 — The Feed-Forward Network: Where Knowledge Lives

SwiGLU gating, gate/up/down projections, and why the MLP owns ~65% of all parameters
High Level
input
Input
arrow_forward
filter_alt
Gate
arrow_forward
arrow_upward
Up
arrow_forward
close
Multiply
arrow_forward
arrow_downward
Down
arrow_forward
output
Output
-
Click play or press Space to begin...
Step- / 8
widgets
Three Matrices Per Layer: The SwiGLU MLP
gate_proj, up_proj, down_proj — the knowledge storage system
The MLP Block
After attention mixes information between tokens, the feed-forward network (FFN/MLP) processes each token independently through two parallel "up-projections" and one "down-projection." Modern LLMs use the SwiGLU activation (Shazeer, 2020) instead of simple ReLU, which requires three weight matrices instead of two: gate_proj (controls information flow), up_proj (carries data), and down_proj (compresses back).
Tensor Names (Layer 0)
model.layers.0.mlp.gate_proj.weight // Shape: [14336, 4096] — "w1" model.layers.0.mlp.up_proj.weight // Shape: [14336, 4096] — "w3" model.layers.0.mlp.down_proj.weight // Shape: [4096, 14336] — "w2" // 3 tensors × 32 layers = 96 MLP tensors
Key insight: The MLP's three matrices are the largest individual tensors in the file. Each gate_proj and up_proj is [14336, 4096] = 117M params = 224 MB in BF16. They dwarf the attention tensors.
open_in_full
The Intermediate Dimension: 3.5× Expansion
Why intermediate_size is 14,336 when hidden_size is 4,096
Expansion Ratio
The FFN projects up to a larger dimension and back down. For Llama 3.1 8B, hidden_size is 4,096 but intermediate_size is 14,336 — a ratio of about 3.5×. In standard FFNs (without SwiGLU), the ratio is typically 4× (giving 16,384). SwiGLU uses 2/3 of that (≈10,922) but rounds to a GPU-friendly number (14,336) to match 8B total params. The expansion gives the model a wider "workspace" to transform representations.
Dimension Flow
// Data flow through the MLP: Input: [seq_len, 4096] // ↓ gate_proj: × [14336, 4096].T Gate path: [seq_len, 14336] // expand // ↓ up_proj: × [14336, 4096].T Up path: [seq_len, 14336] // expand // ↓ element-wise multiply Combined: [seq_len, 14336] // ↓ down_proj: × [4096, 14336].T Output: [seq_len, 4096] // compress
Why it matters: The 14,336-dimensional intermediate space is where the model does its "thinking" for each token — pattern matching, factual recall, reasoning. A wider intermediate dimension means more capacity for learned knowledge.
filter_alt
SwiGLU: How Gating Works
Swish activation × gated linear unit = selective information flow
The SwiGLU Formula
SwiGLU processes input x through two parallel paths: gate_proj applies a Swish activation (x · σ(x)), creating a learned gate that decides which dimensions to "let through." up_proj projects the data without activation. These two are multiplied element-wise — the gate selectively amplifies or suppresses each dimension. Finally, down_proj compresses back to hidden_size. This gating mechanism outperforms simple ReLU on most benchmarks.
Pseudocode
// SwiGLU forward pass: gate = swish(x @ gate_proj.T) // [14336] up = x @ up_proj.T // [14336] hidden = gate * up // element-wise output = hidden @ down_proj.T // [4096] // Where swish(x) = x × sigmoid(x) // The gate learns WHICH dimensions matter // for each input token's representation
Key insight: The gate acts like a learned filter — it produces values between 0 and ~1 for each of the 14,336 dimensions, selectively allowing information through. This is why SwiGLU uses three matrices instead of two: the third matrix (gate_proj) is the "filter control."
pie_chart
The MLP Dominates: ~65% of All Parameters
Why the feed-forward network is the biggest part of the file
Parameter Count
Each layer's MLP has three large matrices: gate [14336, 4096], up [14336, 4096], and down [4096, 14336]. That's 3 × 14336 × 4096 = 176.2 million parameters per layer. Across 32 layers: ~5.6 billion MLP parameters out of 8.03B total. The MLP alone is about 65% of the entire model. Compare to attention at ~28%: the MLP is where most of the model's "knowledge" — facts, patterns, language rules — is stored.
MLP Per-Layer Budget
// MLP tensors per layer (BF16): gate_proj: 14336 × 4096 = 58.7M (112 MB) up_proj: 14336 × 4096 = 58.7M (112 MB) down_proj: 4096 × 14336 = 58.7M (112 MB) ffn_norm: [4096] = 4K (8 KB) ──────────────────────────────────── Total: ~176M params/layer (~336 MB) × 32 = ~5.6B (~10.5 GB)
Key insight: The MLP consumes 10.5 GB of a 16 GB model. When you quantize from BF16 to 4-bit, you're primarily compressing these MLP tensors. Understanding this helps predict where quality loss will be most noticeable.
psychology
What the FFN Actually "Knows"
Factual recall, pattern recognition, and learned heuristics
Knowledge Storage
Research has shown that the FFN layers store factual knowledge (e.g., "Paris is the capital of France") as specific patterns in their weight matrices. Individual neurons in the intermediate layer can activate for specific concepts. While attention determines which tokens to relate, the FFN determines what to do with that information — it's where the model's "reasoning shortcuts," language patterns, and world knowledge live.
Attention vs FFN Roles
// The transformer layer pipeline: 1. Attention: "What should I look at?" // Mixes info BETWEEN tokens // Token 5 reads from tokens 1-4 2. FFN/MLP: "What do I know about this?" // Processes each token INDEPENDENTLY // Applies learned knowledge to transform // the representation // Analogy: attention = reading comprehension // FFN = applying expertise
Key insight: The FFN processes each token position independently — no cross-token communication. This is why it can be parallelized perfectly across sequence positions. The attention layer is what creates dependencies between tokens.
compare_arrows
SwiGLU vs. ReLU vs. GELU
Why modern LLMs switched to gated activations
Activation Evolution
ReLU (original): max(0, x). Simple but creates "dead neurons" — once a neuron outputs 0, it may never recover. Uses 2 matrices.

GELU (GPT-2/BERT era): Smooth approximation of ReLU with probabilistic gating. Better gradient flow. Still 2 matrices.

SwiGLU (Llama/Mistral): Explicit gating mechanism with 3 matrices. ~1-3% better on benchmarks than GELU at equivalent parameter count. The extra matrix cost is offset by reducing the intermediate dimension from 4× to ~3.5×.
Matrix Count Comparison
// Standard FFN (ReLU/GELU): up_proj: [16384, 4096] // 4× expansion down_proj: [4096, 16384] // 2 matrices, 134M params/layer // SwiGLU FFN (Llama): gate_proj: [14336, 4096] // ~3.5× up_proj: [14336, 4096] down_proj: [4096, 14336] // 3 matrices, 176M params/layer // More params but better quality per param
balance
The post_attention_layernorm
RMSNorm between attention and FFN
FFN Normalization
Just as input_layernorm normalizes before attention, post_attention_layernorm normalizes before the FFN. The tensor model.layers.{N}.post_attention_layernorm.weight has shape [4096] — another tiny but critical tensor. Llama uses Pre-Norm architecture where normalization happens before each sublayer (attention and FFN), not after. This produces more stable gradients during training.
Layer Data Flow
// Complete transformer layer flow: x = input x = x + attention(input_layernorm(x)) // ↑ residual ↑ norm before attn x = x + ffn(post_attention_layernorm(x)) // ↑ residual ↑ norm before FFN // The "+" is the residual connection // It lets gradients flow directly through // all 32 layers without vanishing
Key insight: The residual connection (the + operator) is why deep transformers work. Without it, signals would degrade across 32 layers. Each layer only needs to learn a small "correction" to the residual stream — not the full representation.
lightbulb
Practical Takeaways
What knowing the FFN architecture tells you
Practical Applications
Quantization strategy: Since FFN is 65% of the model, per-tensor quantization precision here matters most. Some quantization schemes apply higher precision to attention and lower to FFN.

LoRA targeting: Fine-tuning with LoRA on gate_proj/up_proj/down_proj can modify factual knowledge. Targeting attention (q_proj/v_proj) modifies attention patterns instead.

Memory estimation: MLP = ~10.5 GB for 8B model in BF16. Quick formula: 3 × intermediate_size × hidden_size × num_layers × bytes_per_param.
FFN Cheat Sheet
// Feed-Forward Network quick reference: Tensors: gate_proj, up_proj, down_proj Shapes: [intermediate, hidden] × 2 [hidden, intermediate] × 1 Activation: SwiGLU (swish gating) Expansion: ~3.5× (14336/4096) % of model: ~65% Per layer: 176M params (336 MB BF16) All layers: 5.6B params (10.5 GB BF16)
Key insight: The FFN is the "knowledge store" of the model. Attention decides what to focus on; the FFN applies what the model has learned. Next: the special tensors that tie everything together — normalization, positional encoding, and the output head.