Ch 9 — Attention & Transformers
Scaled dot-product math, multi-head mechanics, positional encoding, and full architecture
Under the Hood
-
Click play or press Space to begin the deep dive...
Zone AScaled Dot-Product AttentionSteps 1–2
1
key
QKV Projections
X·W_Q, X·W_K, X·W_V
2
calculate
Attention Scores
softmax(QK\u1d40/√d\u2096)·V
arrow_downward Single head → multiple heads
3
Zone BMulti-Head Attention & Transformer BlockSteps 3–5
3
view_comfy
Multi-Head
Concat + project
4
stacks
FFN + Residual
LayerNorm + skip
5
calculate
Parameter Count
Full model FLOPs
arrow_downward Inject position information
6
Zone CPositional Encoding & MaskingSteps 6–7
6
pin_drop
Sinusoidal & RoPE
Position encoding math
7
visibility_off
Causal Mask
Autoregressive decoding
arrow_downward Efficiency optimizations
8
Zone DEfficiency — FlashAttention, KV Cache, GQASteps 8–9
8
bolt
FlashAttention
Tiling & kernel fusion
9
cached
KV Cache & GQA
Inference optimization
arrow_downward Architecture variants
10
Zone EArchitecture Variants & Modern InnovationsStep 10
10
hub
BERT / GPT / T5
Encoder vs Decoder