Why Batching Matters
GPUs are massively parallel processors — they’re designed to process many inputs simultaneously. Processing one request at a time wastes 90%+ of GPU capacity. Batching groups multiple requests together for parallel processing. Three strategies: Static batching (fixed batch size, wait until full — simple but adds latency). Dynamic batching (collect requests for a short window, batch whatever arrived — Triton’s approach). Continuous batching (for LLMs — don’t wait for all sequences to finish; add new requests as old ones complete). The trade-off is always latency vs. throughput: larger batches = higher throughput but higher latency per request.
Batching Comparison
// Batching strategies
No Batching (batch_size=1):
GPU utilization: ~5-10%
Latency: lowest per request
Throughput: very low
Waste: enormous
Static Batching:
Wait for N requests → process together
GPU utilization: ~60-80%
Latency: variable (waiting time)
Throughput: good
Dynamic Batching (Triton):
Collect for max_delay μs → batch
GPU utilization: ~70-90%
Latency: bounded by max_delay
Throughput: very good
Continuous Batching (vLLM):
New requests join mid-batch
GPU utilization: ~85-95%
Latency: lowest for LLMs
Throughput: best for LLMs
Key insight: Continuous batching (used by vLLM and TGI) was a breakthrough for LLM serving. In static batching, a batch waits for the longest sequence to finish. In continuous batching, short sequences leave and new ones join immediately, keeping the GPU saturated.