Inference Sizing
Rule of thumb for inference:
GPUs needed = Model_Size ÷ GPU_Memory
(leave 20-30% headroom for KV cache)
7B model (FP16 = 14 GB):
H100 (80GB): 1 GPU ✓
RTX 4090 (24GB): 1 GPU ✓
13B model (FP16 = 26 GB):
H100: 1 GPU ✓
RTX 4090: 2 GPUs (or INT4 = 1)
70B model (FP16 = 140 GB):
H100: 2 GPUs
B200 (192GB): 1 GPU ✓
MI300X (192GB): 1 GPU ✓
405B model (FP16 = 810 GB):
H100: ~12 GPUs
B200: ~5 GPUs
Training Sizing
Rule of thumb for training:
Memory = ~16 bytes × parameters
+ activations (varies with batch)
7B model training:
Memory: ~130-170 GB
H100s needed: 2-4
(with FSDP sharding)
13B model training:
Memory: ~250-350 GB
H100s needed: 4-8
70B model training:
Memory: ~1,300-1,700 GB
H100s needed: 16-32
405B model training:
Memory: ~7,500-10,000 GB
H100s needed: 128+
These are minimums. Production
training often uses 2-4x more
GPUs for faster throughput.
Key insight: Memory is the first constraint to check when planning any AI workload. Before worrying about TFLOPS, interconnects, or networking, ask: “Does my model fit?” If it doesn’t fit in GPU memory, nothing else matters until you solve that problem — either with more GPUs, quantization, or a smaller model.