What the FFN Does
After attention mixes information between tokens, the FFN processes each token independently. It's a two-layer neural network with a non-linear activation in between:
FFN(x) = W2 · activation(W1 · x)
The FFN expands the hidden dimension (4096 → 14336 for Llama 3 8B), applies a non-linearity, then projects back down (14336 → 4096). This expansion gives the model more capacity to learn complex transformations.
SwiGLU Activation
Modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU. SwiGLU uses a gating mechanism: it multiplies two linear projections element-wise, with one passed through a Swish activation. This requires three weight matrices instead of two (gate_proj, up_proj, down_proj) but produces better results.
Llama 3 8B FFN: gate_proj (4096 × 14336), up_proj (4096 × 14336), down_proj (14336 × 4096) = 176M parameters per layer.
Where Parameters Live
The FFN layers contain the majority of parameters in a transformer. For Llama 3 8B:
Attention per layer: Q (4096×4096) + K (4096×1024) + V (4096×1024) + O (4096×4096) = 41.9M params
FFN per layer: gate (4096×14336) + up (4096×14336) + down (14336×4096) = 176.2M params
FFN is 4.2x larger than attention per layer. Across 32 layers: attention = 1.34B, FFN = 5.64B. The FFN is where most of the model's "knowledge" is stored.
For fine-tuning: LoRA typically targets the attention projections (q_proj, k_proj, v_proj, o_proj). Some practitioners also target the FFN projections (gate_proj, up_proj, down_proj) for more capacity. Targeting all seven matrices per layer gives the best results but uses more memory. A common middle ground: target Q, K, V, O + gate_proj and up_proj.