The Assembly Line Analogy
Imagine a factory where workers operate in teams of 32. Every team member does the exact same task, but on different pieces of material. If one team member needs to wait for a delivery (memory fetch), the foreman instantly switches to a different team that’s ready to work. No idle time.
This is how GPU warps work. A warp is a group of 32 threads that execute the same instruction simultaneously (SIMT model). Each SM has 4 warp schedulers that can issue instructions to different warps every cycle.
The magic trick: when a warp stalls waiting for memory (which takes 200–400 cycles), the scheduler instantly switches to another ready warp. With enough warps in flight, the GPU hides memory latency entirely — there’s always a ready warp to execute.
Warp Scheduling in Numbers
H100 SM warp capacity:
Warp schedulers per SM: 4
Max warps per SM: 64
Threads per warp: 32
Max threads per SM: 2,048
Latency hiding example:
Memory fetch latency: ~300 cycles
Instructions per warp
before stall: ~20
Warps needed to hide
latency: 300/20 = 15
With 64 warps available,
the SM has 4x the warps needed
to completely hide memory latency.
This is zero-overhead context
switching — no saving/restoring
state. All warps' registers
live on-chip simultaneously.
Key insight: CPUs hide latency with complex hardware (branch prediction, speculative execution, deep caches). GPUs hide latency with massive parallelism — just switch to another group of threads. This is simpler, cheaper in silicon, and scales better. It’s why GPUs can dedicate more transistors to compute instead of control logic.