Ch 2 — Transformer Architecture — Under the Hood
Q/K/V projections, RoPE, GQA, KV cache, SwiGLU, and model inspection with code
Under the Hood
-
Click play or press Space to begin...
ATokenizer & Embedding InternalsBPE, chat templates, and the embedding lookup
1
text_fields
Raw Text
Unicode string
BPE
1
token
Token IDs
Integer sequence
lookup
2
grid_view
Embeddings
4096-dim vectors
+ RoPE
2
rotate_right
Positioned
With rotation
3
chatChat templates: Jinja2 strings that format messages into model-specific special token sequences
BMulti-Head Attention with GQAQ/K/V projections, scaled dot-product, causal mask
4
search
Q Projection
32 heads x 128d
4
key
K Projection
8 heads x 128d
4
data_object
V Projection
8 heads x 128d
score
5
hub
Attention
softmax(QK/√d)V
concat
5
output
O Projection
4096 → 4096
6
cachedKV Cache: store computed K and V tensors to avoid recomputation during autoregressive generation
CFFN with SwiGLU & RMSNormThe knowledge storage layer and normalization
7
normalize
RMSNorm
Pre-normalization
7
call_split
Gate + Up
4096 → 14336
SwiGLU
7
compress
Down Proj
14336 → 4096
8
addResidual connections: output = input + attention(input) and output = input + ffn(input)
DModel Inspection & Parameter CountingHow to examine a model's architecture in code
9
code
Load Model
from_pretrained
inspect
9
account_tree
Print Layers
Named modules
count
9
calculate
Param Count
Per layer + total
EPrecision, Quantization & MemoryLoading models in different precisions for fine-tuning
10
memory
bf16 Load
2 bytes/param
or
10
compress
4-bit NF4
0.5 bytes/param
config
10
settings
BitsAndBytes
Quantization config