What Each Layer Does
Research (Tenney et al., 2019; Elhage et al., 2022) reveals a progression: Early layers (1-10): Learn syntax, part-of-speech, simple patterns. Middle layers (10-60): Learn semantics, entity relationships, factual knowledge. Late layers (60+): Learn task-specific reasoning, output formatting, next-token prediction. The representation gets progressively more abstract and task-relevant.
The complete picture: A transformer is: embeddings → N × (norm → attention → residual → norm → FFN → residual) → final norm → output projection. That’s the entire architecture. The original 2017 design was elegant; modern refinements (RMSNorm, SwiGLU, RoPE, GQA) made it practical at scale. Every LLM you use — GPT-4, Claude, Llama, Gemini — is this block, repeated.
The Full Model
class LLM(nn.Module):
def __init__(self, vocab, d, n_heads, d_ff, n_layers):
super().__init__()
self.embed = nn.Embedding(vocab, d)
self.blocks = nn.ModuleList([
TransformerBlock(d, n_heads, d_ff)
for _ in range(n_layers)
])
self.norm = RMSNorm(d)
self.head = nn.Linear(d, vocab)
def forward(self, token_ids):
x = self.embed(token_ids)
for block in self.blocks:
x = block(x)
x = self.norm(x)
logits = self.head(x)
return logits
# Llama 3 8B:
model = LLM(
vocab=128256, d=4096,
n_heads=32, d_ff=14336,
n_layers=32
)