token

How LLMs Work

From text to tokens to transformers — the complete story of large language models

Co-Created by Kiran Shirol and Claude

TopicsTokenizationTransformersTrainingAlignmentInference

home Learning Portal play_arrow Start Learning summarize Key Insights dictionary Glossary14 chapters · 5 parts

Part 1

From Text to Numbers

Tokenization, embeddings, and the attention mechanism.

text_fields

Text to Tokens

BPE, WordPiece, SentencePiece — how raw text becomes numbers.

arrow_forward Learn

scatter_plot

Embeddings: Meaning as Math

Word2Vec, contextual embeddings, and why king − man + woman = queen.

arrow_forward Learn

visibility

Attention: The Core Innovation

Q/K/V, multi-head attention, masks — how every token decides who to listen to.

arrow_forward Learn

Part 2

The Transformer Architecture

Blocks, scaling, and the training recipe for frontier models.

layers

The Transformer Block

LayerNorm, FFN, residual connections — the repeating unit of every LLM.

arrow_forward Learn

expand

Scaling Up: From Transformer to LLM

Parameter counts, MoE, context windows, and scaling laws.

arrow_forward Learn

model_training

The Training Recipe

Next-token prediction, data mixtures, infrastructure, and training costs.

arrow_forward Learn

Part 3

Making LLMs Useful

Fine-tuning, alignment, and text generation mechanics.

tune

Fine-Tuning & Instruction Following

SFT, instruction tuning, LoRA, QLoRA — from text predictor to assistant.

arrow_forward Learn

thumb_up

RLHF & Alignment

Reward models, PPO, DPO — the step that turns a predictor into ChatGPT.

arrow_forward Learn

edit_note

How LLMs Generate Text

Temperature, top-k, top-p, beam search, and one-token-at-a-time decoding.

arrow_forward Learn

Part 4

Under the Hood in Practice

Context windows, inference optimization, and multimodal capabilities.

memory

Context Windows & Memory

RoPE, ALiBi, KV cache, RAG as external memory, and infinite context.

arrow_forward Learn

speed

Making LLMs Fast

Quantization, speculative decoding, Flash Attention, and continuous batching.

arrow_forward Learn

image

Multimodal LLMs

Vision encoders, CLIP, image tokens, audio, video, and tool use.

arrow_forward Learn

Part 5

The Bigger Picture

Emergent abilities, limitations, and the LLM landscape.

psychology

Emergent Abilities & Limitations

In-context learning, CoT reasoning, hallucinations, and what LLMs can’t do.

arrow_forward Learn

explore

The LLM Landscape (Capstone)

GPT, Claude, Llama, Gemini, Mistral — open vs closed and where it’s heading.

arrow_forward Learn

How LLMs Work

From Text to Numbers

The Transformer Architecture

Making LLMs Useful

Under the Hood in Practice

The Bigger Picture

Explore Related Courses