Ch 2: Transformer Architecture

Ch 2 — Transformer Architecture — Under the Hood

Q/K/V projections, RoPE, GQA, KV cache, SwiGLU, and model inspection with code

Index ← High Level

Under the Hood

Click play or press Space to begin...

Step- / 10

ATokenizer & Embedding InternalsBPE, chat templates, and the embedding lookup

text_fields

Raw Text

Unicode string

BPE

token

Token IDs

Integer sequence

lookup

grid_view

Embeddings

4096-dim vectors

+ RoPE

rotate_right

Positioned

With rotation

chatChat templates: Jinja2 strings that format messages into model-specific special token sequences

BMulti-Head Attention with GQAQ/K/V projections, scaled dot-product, causal mask

Q Projection

32 heads x 128d

key

K Projection

8 heads x 128d

data_object

V Projection

8 heads x 128d

score

hub

Attention

softmax(QK/√d)V

concat

output

O Projection

4096 → 4096

cachedKV Cache: store computed K and V tensors to avoid recomputation during autoregressive generation

CFFN with SwiGLU & RMSNormThe knowledge storage layer and normalization

normalize

RMSNorm

Pre-normalization

call_split

Gate + Up

4096 → 14336

SwiGLU

compress

Down Proj

14336 → 4096

addResidual connections: output = input + attention(input) and output = input + ffn(input)

DModel Inspection & Parameter CountingHow to examine a model's architecture in code

code

Load Model

from_pretrained

inspect

account_tree

Print Layers

Named modules

count

calculate

Param Count

Per layer + total

EPrecision, Quantization & MemoryLoading models in different precisions for fine-tuning

memory

bf16 Load

2 bytes/param

compress

4-bit NF4

0.5 bytes/param

config

settings

BitsAndBytes

Quantization config