Ch 2 — Inside a GPU: Architecture That Powers AI

Streaming Multiprocessors, CUDA cores, Tensor Cores, and the memory hierarchy
Foundation
view_module
SMs
arrow_forward
grid_4x4
CUDA Cores
arrow_forward
apps
Tensor Cores
arrow_forward
group_work
Warps
arrow_forward
layers
Memory
arrow_forward
tune
Precision
arrow_forward
route
Data Flow
arrow_forward
timeline
Evolution
-
Click play or press Space to begin...
Step- / 8
view_module
Streaming Multiprocessors: The Building Blocks
A GPU is a city of identical factories, each one an SM
The Factory Floor Analogy
Think of a GPU as a massive industrial park with dozens of identical factories. Each factory is a Streaming Multiprocessor (SM) — a self-contained processing unit with its own workers, tools, and local storage.

The NVIDIA A100 has 108 SMs. The H100 has 132 SMs. The B200 has 160 SMs. Each SM operates independently, running its own set of threads on its own data.

Inside each SM, you’ll find:
CUDA cores (the general workers)
Tensor Cores (the specialized matrix machines)
Warp schedulers (the foremen who assign work)
Register file + shared memory (the local stockroom)

The GPU’s job is to keep all these factories running at full capacity simultaneously.
SM Count Across Generations
Architecture GPU SMs Ampere (2020) A100 108 Hopper (2022) H100 132 Hopper (2023) H200 132 Blackwell(2024) B200 160 Inside each Hopper SM: CUDA Cores: 128 Tensor Cores: 4 Warp Schedulers: 4 Register File: 256 KB Shared Memory: 228 KB (configurable) H100 total: 132 SMs × 128 = 16,896 CUDA cores 132 SMs × 4 = 528 Tensor Cores
Key insight: The SM is the fundamental unit of GPU compute. When NVIDIA says “more SMs,” they mean more parallel factories. When they say “bigger SMs,” they mean more workers per factory. Both increase total throughput, but in different ways.
grid_4x4
CUDA Cores: The General Workers
Simple, fast, and there are thousands of them
The Screwdriver Analogy
A CUDA core is like a worker with a single screwdriver. They can do exactly one thing per clock cycle: take two numbers, multiply them (or add them), and produce a result.

No branch prediction. No speculative execution. No out-of-order logic. Just: multiply, add, done. Next.

This simplicity is the point. Because each core is so simple, you can fit thousands of them on a single chip. The H100 has 16,896 CUDA cores, each running at ~1.98 GHz.

Each CUDA core handles either a floating-point (FP32, FP64) or integer (INT32) operation. Modern architectures can execute FP32 and INT32 operations simultaneously on separate datapaths within the same SM, effectively doubling throughput for mixed workloads.
CUDA Core Math
H100 CUDA Core throughput: Cores: 16,896 Clock: 1.98 GHz (boost) Ops per core: 2 (FMA = multiply + add) FP32 TFLOPS: 16,896 × 1.98 GHz × 2 = 66.9 TFLOPS For comparison: A100: 19.5 TFLOPS FP32 H100: 66.9 TFLOPS FP32 B200: ~90 TFLOPS FP32 FMA = Fused Multiply-Add One instruction does: a × b + c This is exactly what matrix multiplication needs.
Key insight: CUDA cores handle general-purpose parallel compute. They’re the workhorses for operations like element-wise addition, activation functions (ReLU, GELU), and normalization layers. But for the heavy matrix multiplications that dominate AI, Tensor Cores are the real stars.
apps
Tensor Cores: The Matrix Machines
One instruction does what takes CUDA cores hundreds of cycles
The Power Drill Analogy
If a CUDA core is a worker with a screwdriver (one screw at a time), a Tensor Core is a worker with a power drill that drives 64 screws simultaneously.

A Tensor Core performs an entire 4×4 matrix multiply-and-accumulate in a single instruction. That’s 64 multiply-add operations at once — roughly 8x the throughput of CUDA cores for matrix work.

Introduced in NVIDIA’s Volta architecture (2017), Tensor Cores have evolved dramatically:

Volta (2017): FP16 input, FP32 accumulate
Ampere (2020): Added BF16, TF32, INT8, INT4
Hopper (2022): Added FP8, larger matrix sizes
Blackwell (2024): Added FP4, 2nd-gen Transformer Engine
Tensor Core Performance
H100 Tensor Core TFLOPS: FP64 Tensor: 67 TFLOPS FP32 (TF32): 495 TFLOPS FP16 / BF16: 990 TFLOPS FP8: 1,979 TFLOPS B200 Tensor Core TFLOPS: FP64 Tensor: 90 TFLOPS FP32 (TF32): ~1,100 TFLOPS FP16 / BF16: ~2,250 TFLOPS FP8: ~4,500 TFLOPS FP4: ~9,000 TFLOPS Notice the pattern: halving precision roughly doubles TFLOPS. FP4 on B200 is 100x the FP32 CUDA core performance.
Key insight: Tensor Cores are why NVIDIA dominates AI. They’re purpose-built silicon for the exact operation neural networks need most: matrix multiply-accumulate. Every generation makes them faster and supports more precision formats. The jump from FP16 to FP8 to FP4 doubles throughput each time with minimal quality loss for inference.
group_work
Warps and Thread Scheduling
How the GPU keeps thousands of cores busy without wasting cycles
The Assembly Line Analogy
Imagine a factory where workers operate in teams of 32. Every team member does the exact same task, but on different pieces of material. If one team member needs to wait for a delivery (memory fetch), the foreman instantly switches to a different team that’s ready to work. No idle time.

This is how GPU warps work. A warp is a group of 32 threads that execute the same instruction simultaneously (SIMT model). Each SM has 4 warp schedulers that can issue instructions to different warps every cycle.

The magic trick: when a warp stalls waiting for memory (which takes 200–400 cycles), the scheduler instantly switches to another ready warp. With enough warps in flight, the GPU hides memory latency entirely — there’s always a ready warp to execute.
Warp Scheduling in Numbers
H100 SM warp capacity: Warp schedulers per SM: 4 Max warps per SM: 64 Threads per warp: 32 Max threads per SM: 2,048 Latency hiding example: Memory fetch latency: ~300 cycles Instructions per warp before stall: ~20 Warps needed to hide latency: 300/20 = 15 With 64 warps available, the SM has 4x the warps needed to completely hide memory latency. This is zero-overhead context switching — no saving/restoring state. All warps' registers live on-chip simultaneously.
Key insight: CPUs hide latency with complex hardware (branch prediction, speculative execution, deep caches). GPUs hide latency with massive parallelism — just switch to another group of threads. This is simpler, cheaper in silicon, and scales better. It’s why GPUs can dedicate more transistors to compute instead of control logic.
layers
The Memory Hierarchy
Registers → Shared Memory → L2 Cache → HBM — each level trades speed for size
The Storage Analogy
Think of GPU memory like a series of storage locations, each farther from the worker but larger:

Registers (your pockets): Fastest access, ~0 cycles. Each thread has its own registers. The H100 has 256 KB of register file per SM — that’s 33 MB total across all SMs.

Shared Memory (your desk): ~20 cycles. Shared among all threads in a block. 228 KB per SM on H100. Used for data that multiple threads need to access.

L2 Cache (the filing cabinet): ~200 cycles. 50 MB on H100. Shared across all SMs. Caches frequently accessed data from HBM.

HBM (the warehouse): ~300–400 cycles. 80 GB on H100. This is where model weights and activations live. High capacity but relatively slow to access.
Memory Hierarchy Numbers
H100 Memory Hierarchy: Registers (per SM): Size: 256 KB Latency: ~0 cycles BW: ~20 TB/s (aggregate) Shared Memory (per SM): Size: 228 KB Latency: ~20 cycles BW: ~15 TB/s (aggregate) L2 Cache (shared): Size: 50 MB Latency: ~200 cycles BW: ~12 TB/s HBM3 (global): Size: 80 GB Latency: ~300-400 cycles BW: 3,350 GB/s 80 GB sounds like a lot, but a 70B FP16 model needs 140 GB. That's why multi-GPU is needed.
Key insight: The art of GPU programming is keeping data as close to the cores as possible. A well-optimized kernel loads data from HBM once into shared memory, then reuses it many times. This is why libraries like cuBLAS and FlashAttention are so important — they maximize data reuse at every level of the hierarchy.
tune
Precision Formats: Trading Accuracy for Speed
FP32, FP16, BF16, TF32, FP8, INT8, FP4 — each halving doubles throughput
The Resolution Analogy
Precision formats are like photo resolution. A 4K photo (FP32) captures every detail but takes lots of storage and bandwidth. A 1080p photo (FP16) looks almost identical to the human eye but uses half the space. A thumbnail (FP8) is fine for previewing but you wouldn’t print it.

FP32 (32 bits): Full precision. 1 sign + 8 exponent + 23 mantissa bits. The gold standard, but slow and memory-hungry.

BF16 (16 bits): Google’s brain float. Same exponent range as FP32 (so same number range) but fewer mantissa bits. Ideal for training — maintains dynamic range while halving memory.

FP16 (16 bits): IEEE half-precision. Smaller range than BF16 but more mantissa precision. Good for inference.

FP8 (8 bits): Introduced with Hopper. Two variants: E4M3 (more precision) and E5M2 (more range). Doubles throughput again with minimal quality loss for inference.
Precision Format Comparison
Format Bits H100 TFLOPS Use Case FP64 64 34 Scientific FP32 32 67 Legacy training TF32 19 495 Training (default) BF16 16 990 Training FP16 16 990 Inference FP8 8 1,979 Inference INT8 8 1,979 Quantized inference Memory per parameter: FP32: 4 bytes → 70B = 280 GB FP16: 2 bytes → 70B = 140 GB FP8: 1 byte → 70B = 70 GB INT4: 0.5 byte → 70B = 35 GB Lower precision = less memory + more TFLOPS + faster inference. The trick is doing it without losing too much quality.
Key insight: The shift from FP32 to lower precision formats is one of the most impactful trends in AI infrastructure. Training in BF16 (instead of FP32) halves memory usage and doubles throughput with negligible quality loss. Inference in FP8 or INT4 makes models 4–8x more efficient. This is why quantization (covered in the Small Models course) is so powerful.
route
How a Matrix Multiply Flows Through the GPU
From HBM to Tensor Core and back — the journey of one operation
The Full Data Flow
Let’s trace a single matrix multiplication (the core of every attention layer) through the GPU:

1. Load from HBM: Weight matrix tiles and input activations are loaded from HBM3 into L2 cache, then into shared memory. This is the slowest step (~300 cycles).

2. Tile into shared memory: The matrices are split into small tiles (e.g., 128×128) that fit in shared memory. Each SM works on one tile.

3. Feed Tensor Cores: Data moves from shared memory into registers, then into Tensor Cores. Each Tensor Core computes a 4×4 (or larger) matrix multiply-accumulate.

4. Accumulate results: Partial results are accumulated in registers (FP32 precision for accuracy, even if inputs are FP16).

5. Write back: Final results go from registers → shared memory → L2 → HBM for the next layer to consume.
The Tiling Strategy
Matrix multiply: A[M×K] × B[K×N] Step 1: Tile the matrices Split A into M/128 row tiles Split B into N/128 col tiles Each SM gets one output tile Step 2: Load tiles to shared mem Load A_tile[128×K] from HBM Load B_tile[K×128] from HBM ~300 cycles per load Step 3: Compute in Tensor Cores Sub-tile into 4×4 blocks Tensor Core: C += A_sub × B_sub ~1 cycle per 4×4 multiply Step 4: Reuse data Each loaded tile is reused 128 times (once per column/row) This is the key to efficiency The ratio of compute to memory access (arithmetic intensity) determines whether you're compute-bound or memory-bound.
Key insight: The reason libraries like cuBLAS and FlashAttention are so critical is that they optimize this tiling strategy. FlashAttention, for example, fuses the entire attention computation into a single kernel that keeps data in shared memory, avoiding expensive HBM round-trips. This can speed up attention by 2–4x.
timeline
GPU Architecture Evolution: Volta to Blackwell
Each generation brings more SMs, faster Tensor Cores, and more memory bandwidth
The Generational Leap
NVIDIA releases a new GPU architecture roughly every two years. Each generation brings major improvements:

Volta (2017, V100): Introduced Tensor Cores. The GPU that made modern deep learning training practical. 125 TFLOPS FP16.

Ampere (2020, A100): Added BF16 and TF32 support. Structural sparsity for 2x inference boost. 312 TFLOPS FP16. The workhorse of the GPT-3 era.

Hopper (2022, H100): Added FP8 and the Transformer Engine that automatically manages precision. 990 TFLOPS FP16. The GPU that trained GPT-4 and Llama 3.

Blackwell (2024, B200): Dual-die chiplet design with 208 billion transistors. Added FP4. 2nd-gen Transformer Engine. ~2,250 TFLOPS FP16. The current flagship.
Generation Comparison
V100 A100 H100 B200 Year: 2017 2020 2022 2024 SMs: 80 108 132 160 CUDA: 5120 6912 16896 ~20K Tensor: 640 432 528 ~640 HBM: 16GB 80GB 80GB 192GB BW: 900 2039 3350 8000 FP16 TF: 125 312 990 ~2250 TDP: 300W 400W 700W 1000W 7-year improvement (V100→B200): FP16 TFLOPS: 18x Memory: 12x Bandwidth: 9x Power: 3.3x Performance grows much faster than power — efficiency improves ~5x per generation.
Key insight: Each GPU generation roughly doubles AI performance while power grows more slowly. This means each generation is significantly more energy-efficient. But the absolute power numbers (1,000W for B200) are driving the entire data center industry toward liquid cooling — a topic we’ll cover in Chapter 11.