Ch 9: Attention & Transformers

Ch 9 — Attention & Transformers

Scaled dot-product math, multi-head mechanics, positional encoding, and full architecture

Under the Hood

Click play or press Space to begin the deep dive...

Step- / 10

Zone AScaled Dot-Product AttentionSteps 1–2

key

QKV Projections

X·W_Q, X·W_K, X·W_V

calculate

Attention Scores

softmax(QK\u1d40/√d\u2096)·V

arrow_downward Single head → multiple heads

Zone BMulti-Head Attention & Transformer BlockSteps 3–5

view_comfy

Multi-Head

Concat + project

stacks

FFN + Residual

LayerNorm + skip

calculate

Parameter Count

Full model FLOPs

arrow_downward Inject position information

Zone CPositional Encoding & MaskingSteps 6–7

pin_drop

Sinusoidal & RoPE

Position encoding math

visibility_off

Causal Mask

Autoregressive decoding

arrow_downward Efficiency optimizations

Zone DEfficiency — FlashAttention, KV Cache, GQASteps 8–9

bolt

FlashAttention

Tiling & kernel fusion

cached

KV Cache & GQA

Inference optimization

arrow_downward Architecture variants

Zone EArchitecture Variants & Modern InnovationsStep 10

hub

BERT / GPT / T5

Encoder vs Decoder