Ch 1 — Why CPUs Aren’t Enough

Sequential vs parallel — why AI needs a fundamentally different processor
Foundation
memory
CPU Design
arrow_forward
grid_view
Matrix Math
arrow_forward
developer_board
GPU Design
arrow_forward
speed
Throughput
arrow_forward
data_array
Memory Wall
arrow_forward
calculate
The Numbers
arrow_forward
compare
Real World
arrow_forward
bolt
Why It Matters
-
Click play or press Space to begin...
Step- / 8
memory
The CPU: Brilliant at One Thing at a Time
4–128 powerful cores, each a Swiss Army knife of computing
The Highway Analogy
Think of a CPU as a 4-lane highway. Each lane is wide, fast, and can handle any vehicle — sports cars, trucks, buses. The lanes have traffic lights that adapt in real time, ramps that predict where cars are going, and emergency shoulders for unexpected situations.

A modern CPU like the Intel Core i9-14900K has 24 cores running at up to 6.0 GHz. Each core has branch prediction, out-of-order execution, speculative execution, and deep cache hierarchies. These features make each core incredibly smart — it can handle complex, unpredictable tasks with minimal wasted cycles.

This design is perfect for tasks like running your operating system, compiling code, or serving web requests — workloads where each step depends on the result of the previous one.
CPU Architecture at a Glance
Intel Core i9-14900K (2024) Cores: 24 (8P + 16E) Clock: up to 6.0 GHz L3 Cache: 36 MB Memory BW: ~90 GB/s (DDR5-5600) FP32 TFLOPS: ~1.5 AMD EPYC 9754 (Server, 2024) Cores: 128 Clock: up to 3.1 GHz L3 Cache: 256 MB Memory BW: ~460 GB/s (12-ch DDR5) FP32 TFLOPS: ~5 Even the biggest server CPU: ~5 TFLOPS. An H100 GPU: ~67 TFLOPS (FP32). That's 13x more raw compute.
Key insight: A CPU is like a team of 24 expert surgeons — each one can handle any operation brilliantly. But when you need to stitch 10 million wounds at once, you don’t need surgeons. You need 10,000 medics who each know one simple stitch.
grid_view
AI Is Just Matrix Multiplication
Every neural network forward pass boils down to multiply-and-add
What Happens Inside a Neural Network
When you send a prompt to an LLM, here’s what actually happens at the hardware level:

1. Token embedding: Your words become vectors (arrays of numbers). Each token is a vector of 4,096–12,288 floating-point numbers.

2. Attention layers: Each layer multiplies your token vectors by three huge weight matrices (Q, K, V) — each one is [hidden_dim × hidden_dim]. For a 70B model, that’s [8192 × 8192] matrices.

3. Feed-forward layers: Another pair of matrix multiplications, typically [hidden_dim × 4×hidden_dim].

4. Repeat: A 70B model has ~80 layers. Each layer does 4–6 matrix multiplications. That’s 320–480 matrix multiplications per token.
The Math Behind One Token
Llama 3 70B — per token: Hidden dim: 8,192 Layers: 80 Attention heads: 64 Per layer operations: Q projection: 8192 × 8192 = 67M multiply-adds K projection: 67M multiply-adds V projection: 67M multiply-adds Output proj: 67M multiply-adds FFN up: 8192 × 28,672 = 235M FFN down: 235M Per layer total: ~738M multiply-adds All 80 layers: ~59 billion operations Per. Single. Token.
Key insight: AI doesn’t need the CPU’s ability to handle complex branching logic. It needs to do the same simple operation — multiply two numbers and add — billions of times. This is the fundamental mismatch between CPU architecture and AI workloads.
developer_board
The GPU: Thousands of Simple Workers
Trade individual smarts for massive parallelism
The Stadium Analogy
If a CPU is a 4-lane highway, a GPU is a 10,000-lane road. Each lane is narrow — it can only handle bicycles, not trucks. But when your job is to move 10,000 identical packages from point A to point B, 10,000 bicycles beat 4 trucks every time.

An NVIDIA H100 GPU has 16,896 CUDA cores and 528 Tensor Cores. Each CUDA core is simple — no branch prediction, no speculative execution, no out-of-order logic. It does one thing: take two numbers, multiply them, add to an accumulator.

The magic is that all 16,896 cores do this simultaneously. In a single clock cycle, the H100 performs thousands of multiply-add operations in parallel. This is the SIMT (Single Instruction, Multiple Threads) execution model.
GPU vs CPU Architecture
NVIDIA H100 SXM (2023) CUDA Cores: 16,896 Tensor Cores: 528 Clock: 1.98 GHz (boost) Memory: 80 GB HBM3 Memory BW: 3,350 GB/s FP32 TFLOPS: 67 FP16 TFLOPS: 990 (with Tensor Cores) Compare to best server CPU: EPYC 9754: ~5 TFLOPS FP32 H100: 67 TFLOPS FP32 That's 13x on FP32. On FP16 with Tensor Cores: 200x.
Key insight: A GPU trades the CPU’s ability to handle complex, branching logic for raw throughput on simple, repetitive operations. Since AI is almost entirely simple, repetitive operations (multiply-add), this trade-off is enormously favorable.
speed
The Throughput Gap: 593x Faster
Real benchmarks on matrix multiplication — the core operation of AI
Matrix Multiplication Benchmark
Researchers benchmarked a 4096×4096 matrix multiplication — the kind of operation that happens hundreds of times per token in an LLM:

Sequential CPU (single core): The baseline. One core multiplying row by column, element by element. Painfully slow.

Parallel CPU (all cores): Using OpenMP to spread the work across all CPU cores gives a 12–14x speedup. Significant, but still limited by core count.

GPU (CUDA): The same matrix multiply on a GPU achieves a 593x speedup over sequential CPU and 45x over optimized parallel CPU.

For larger matrices (8192×8192, 16384×16384), the GPU advantage grows even further because GPUs scale better with problem size.
Benchmark Numbers
4096 × 4096 matrix multiply: Sequential CPU (1 core): Time: ~120 seconds Speed: 1x baseline Parallel CPU (all cores): Time: ~9 seconds Speed: ~13x GPU (CUDA optimized): Time: ~0.2 seconds Speed: ~593x Why the gap grows with size: 2048×2048: GPU ~100x faster 4096×4096: GPU ~593x faster 8192×8192: GPU ~1000x+ faster GPUs get relatively faster as matrices get bigger — exactly the trend in modern AI models.
Key insight: The GPU advantage isn’t just “a bit faster.” It’s 100–1000x faster for the exact operation AI needs most. This is why training GPT-4 on CPUs would take decades instead of months. The performance gap is the entire reason the AI revolution is happening now.
data_array
The Memory Wall
Compute is fast — but feeding data to the compute is the real bottleneck
The Restaurant Kitchen Analogy
Imagine a kitchen with 100 chefs (GPU cores), but only one narrow door to the pantry (memory bus). The chefs can cook incredibly fast, but they spend most of their time waiting for ingredients.

This is the memory wall — the gap between how fast processors can compute and how fast memory can deliver data. It’s the single biggest bottleneck in AI hardware.

CPU memory bandwidth: A server CPU gets ~90–460 GB/s from DDR5 RAM. That’s the equivalent of a single-lane road to the pantry.

GPU memory bandwidth: An H100 gets 3,350 GB/s from HBM3 memory. That’s a 37-lane highway to the pantry. The newer B200 pushes this to 8,000 GB/s.
Memory Bandwidth Comparison
Memory bandwidth (GB/s): Consumer CPU (DDR5-5600): ~90 GB/s Server CPU (12-ch DDR5): ~460 GB/s NVIDIA A100 (HBM2e): 2,039 GB/s NVIDIA H100 (HBM3): 3,350 GB/s NVIDIA B200 (HBM3e): 8,000 GB/s AMD MI350X (HBM3e): 8,000 GB/s B200 has 89x the bandwidth of a consumer CPU. This is why AI models run on GPUs.
Key insight: Raw compute (FLOPS) gets the headlines, but memory bandwidth often determines real-world AI performance. A model that fits in GPU memory with enough bandwidth to feed the cores will run fast. A model that doesn’t will stall, no matter how many TFLOPS you have.
calculate
The Numbers: Training GPT-4 on CPUs
A thought experiment that shows why GPUs aren’t optional
The Scale of Modern Training
Let’s do some back-of-envelope math to understand why CPUs simply cannot train modern AI models:

Training Llama 3 70B required approximately 6.4 million GPU-hours on H100 GPUs. Meta used 16,384 H100 GPUs running for about 54 days.

Each H100 delivers ~990 TFLOPS at FP16 with Tensor Cores. A top server CPU delivers ~5 TFLOPS at FP32 (and less at FP16 since CPUs lack native FP16 acceleration).

Even being generous and assuming the CPU could match its FP32 rate at FP16, you’d need ~200x more time on CPUs. That 54-day training run becomes ~29 years on the same number of CPUs. And you’d need 16,384 of the most expensive server CPUs on the planet.
The Math
Llama 3 70B training: GPU setup: 16,384 × H100 GPU time: 54 days GPU FLOPS: 990 TFLOPS (FP16) If we used CPUs instead: CPU FLOPS: ~5 TFLOPS (FP32) Ratio: 990 / 5 = 198x slower CPU time: 54 × 198 = 10,692 days = ~29 years Cost comparison: H100 cloud: ~$3-4/hr/GPU 16K GPUs × 54 days = ~$63M CPUs for 29 years? Effectively impossible. And GPT-4 used ~5x more compute than Llama 3 70B.
Key insight: This isn’t about GPUs being “nice to have.” Without GPUs, modern AI simply would not exist. The transformer architecture was published in 2017, but it took GPU compute scaling to make models like GPT-3 (2020) and GPT-4 (2023) possible. Hardware enables the science.
compare
Real-World Comparison: Training vs Inference
GPUs dominate training, but inference has more nuance
Training vs Inference
Training is like building a factory — massive upfront investment, enormous compute, done once (or a few times). GPUs are absolutely essential here. No debate.

Inference is like running the factory — serving predictions to users. Here, the picture is more nuanced:

For large models (70B+), GPUs are still essential for inference. The model weights alone need 140+ GB of memory, and you need high bandwidth to generate tokens fast.

For small models (1B–7B), CPUs can actually work. A quantized 7B model runs at 10–30 tokens/sec on a modern CPU. Not fast, but usable for batch processing. For models under 1.5B parameters, CPUs can even outperform budget GPUs by 1.3x.
When CPUs Still Make Sense
CPU Struggles
Training any model — too slow by 100–1000x

Large model inference — not enough memory or bandwidth

Real-time serving — latency too high for interactive use

Batch inference at scale — cost per token is much higher
CPU Works Fine
Tiny models (<1.5B) — CPU can match budget GPUs

Low-volume inference — a few requests per minute

Preprocessing — tokenization, data loading, ETL

Orchestration — managing GPU jobs, serving APIs, logging
Key insight: Modern AI infrastructure uses CPUs and GPUs together. CPUs handle the “brain” work (orchestration, data prep, networking), while GPUs handle the “muscle” work (matrix multiplication). Neither replaces the other — they’re complementary.
bolt
Why This Matters for Everything Ahead
The GPU compute gap shapes every decision in AI infrastructure
The Cascade Effect
The CPU-GPU performance gap doesn’t just affect training speed. It cascades through every infrastructure decision:

GPU scarcity → GPU pricing: H100s cost $25,000–$40,000 each. Demand far exceeds supply. This drives the entire cloud GPU market.

GPU power → cooling crisis: An H100 draws 700W. A B200 draws 1,000W. A rack of 8 GPUs needs 8,000W just for compute. This is why data centers are moving to liquid cooling.

GPU memory limits → distributed training: A 70B model doesn’t fit on one GPU. You need 4–8 GPUs connected by fast interconnects. This drives NVLink, InfiniBand, and cluster networking.

GPU efficiency → cost optimization: At $3–4/hr per GPU, keeping GPUs idle is burning money. This drives scheduling, orchestration, and utilization optimization.
Course Roadmap
What we'll cover next: Ch 2: GPU architecture deep dive (CUDA cores, Tensor Cores, SMs) Ch 3: The accelerator landscape (NVIDIA, AMD, Google TPU, AWS) Ch 4: Memory — the real bottleneck (HBM, bandwidth, KV cache) Ch 5: Interconnects — how GPUs talk (NVLink, InfiniBand, PCIe) Ch 6: Network topologies (fat-tree, rail-optimized) Ch 7: Distributed training (data/tensor/pipeline parallelism) Ch 8-14: Training clusters, inference, storage, power, cloud, and more
Key insight: Every chapter in this course exists because of the fundamental truth we covered here: AI needs parallel compute that CPUs can’t provide. GPUs fill that gap, but they bring their own constraints — power, cooling, memory, networking, cost — and the entire field of AI infrastructure exists to manage those constraints.