Ch 13 — Numerical Methods & Stability

When computers lie about math — floating point, overflow, and tricks that save training
Numerical
memory
Float
arrow_forward
warning
Overflow
arrow_forward
functions
Log-Trick
arrow_forward
thermostat
Softmax
arrow_forward
speed
Mixed Prec
arrow_forward
tune
Condition
arrow_forward
build
Tricks
-
Click play or press Space to begin...
Step- / 8
memory
Floating-Point Numbers
Why 0.1 + 0.2 ≠ 0.3 on a computer
The Analogy
Imagine measuring with a ruler that only has markings every millimeter. You can’t represent 1.5mm exactly — you round to 1mm or 2mm. Floating-point numbers are like a ruler with uneven spacing: very precise near zero, increasingly imprecise for large numbers. Computers store numbers with ~7 decimal digits of precision (float32) or ~16 digits (float64). Everything else gets rounded.
Key insight: In Python, 0.1 + 0.2 = 0.30000000000000004, not 0.3. This isn’t a bug — it’s fundamental to how computers represent numbers. In deep learning, these tiny errors accumulate over billions of operations. A model with 175 billion parameters doing millions of multiply-adds can amplify rounding errors into NaN (Not a Number) crashes.
Float Formats in AI
# IEEE 754 floating-point formats: # float64 (FP64): 1 sign + 11 exp + 52 mantissa # → ~16 decimal digits precision # → Range: ±10^308 # float32 (FP32): 1 sign + 8 exp + 23 mantissa # → ~7 decimal digits precision # → Range: ±3.4 × 10^38 # float16 (FP16): 1 sign + 5 exp + 10 mantissa # → ~3.3 decimal digits precision # → Range: ±65504 (very limited!) # bfloat16 (BF16): 1 sign + 8 exp + 7 mantissa # → Same range as FP32, less precision # → Google's format, great for training # Tradeoff: less precision = less memory # FP32: 4 bytes/param → 700GB for 175B params # FP16: 2 bytes/param → 350GB for 175B params
warning
Overflow & Underflow
When numbers get too big or too small
The Analogy
Overflow is like a car odometer that rolls past 999,999 and resets to 000,000. The number is too big to represent, so it becomes “infinity” (Inf). Underflow is like trying to measure an atom with a kitchen scale — the number is so small it rounds to zero. Both are catastrophic in training: Inf or 0 in a gradient means your model stops learning.
Critical in AI: Softmax computes e^x. If x = 1000, e^1000 = Inf (overflow!). If x = -1000, e^-1000 = 0 (underflow!). Multiplying many probabilities: 0.01 × 0.01 × ... (100 times) = 10^-200 = underflow to 0. These aren’t edge cases — they happen in every training run without proper numerical tricks.
Worked Example
import numpy as np # Overflow: number too large np.float32(1e38) * np.float32(10) # → inf (overflow!) # Underflow: number too small np.float32(1e-38) * np.float32(1e-8) # → 0.0 (underflow!) # Real problem: multiplying probabilities probs = [0.01] * 200 product = np.prod(probs) # → 0.0 (underflow after ~100 terms) # Solution: work in log space! log_product = np.sum(np.log(probs)) # → -921.03 (perfectly fine!) # NaN: 0/0, inf-inf, sqrt(-1) # Once NaN enters, it infects everything # NaN + anything = NaN
functions
The Log-Sum-Exp Trick
The single most important numerical trick in ML
The Analogy
Imagine comparing skyscrapers by height. Instead of measuring from sea level (huge numbers, hard to compare), you measure from the shortest building (small, manageable numbers). The log-sum-exp trick does the same: subtract the maximum value before exponentiating. This keeps numbers in a safe range without changing the mathematical result.
Key insight: Every time you call torch.logsumexp(), nn.CrossEntropyLoss(), or F.log_softmax(), PyTorch uses this trick internally. Without it, training any model with softmax (which is every classifier and every transformer) would crash with overflow/underflow. This one trick makes modern deep learning possible.
The Math & Code
# Problem: log(Σ e^xᵢ) overflows # e.g., x = [1000, 1001, 999] # e^1000 = Inf → crash! # Solution: subtract max first # log(Σ e^xᵢ) = M + log(Σ e^(xᵢ - M)) # where M = max(x) x = np.array([1000, 1001, 999]) # Naive (CRASHES): # np.log(np.sum(np.exp(x))) → inf # Safe (log-sum-exp trick): M = np.max(x) # 1001 result = M + np.log(np.sum(np.exp(x - M))) # x - M = [-1, 0, -2] # e^[-1, 0, -2] = [0.37, 1.0, 0.14] # sum = 1.50, log = 0.41 # result = 1001.41 ✓ # PyTorch does this automatically: import torch torch.logsumexp(torch.tensor(x, dtype=torch.float), dim=0)
thermostat
Numerically Stable Softmax
Why PyTorch combines softmax + cross-entropy into one function
The Analogy
Computing softmax naively is like converting currencies through 5 intermediate currencies — you lose precision at each step. Fused operations combine multiple steps into one, like converting directly. PyTorch’s nn.CrossEntropyLoss fuses softmax + log + negative into one numerically stable operation. That’s why you pass raw logits, not softmax probabilities.
Key insight: If you see code that does loss = -log(softmax(logits)) in two steps, it’s a bug waiting to happen. The softmax might produce 0 for some classes (underflow), then log(0) = -Inf. PyTorch’s fused CrossEntropyLoss(logits, targets) avoids this entirely using the log-sum-exp trick internally.
Safe vs. Unsafe
import torch import torch.nn.functional as F logits = torch.tensor([100.0, 200.0, 300.0]) # UNSAFE: two-step softmax + log probs = torch.softmax(logits, dim=0) # → [0.0, 0.0, 1.0] (underflow!) log_probs = torch.log(probs) # → [-inf, -inf, 0.0] (NaN risk!) # SAFE: fused log_softmax log_probs = F.log_softmax(logits, dim=0) # → [-200.0, -100.0, 0.0] (correct!) # SAFEST: fused cross-entropy loss loss = F.cross_entropy( logits.unsqueeze(0), torch.tensor([2]) ) # Handles everything internally
Unsafe
log(softmax(x)) — underflow then -Inf
Safe
log_softmax(x) — fused, numerically stable
speed
Mixed Precision Training
Use less precision to train faster — without losing quality
The Analogy
Imagine writing a rough draft in pencil (fast, imprecise) and the final version in pen (slow, precise). Mixed precision does the same: forward and backward passes use FP16/BF16 (fast, 2× less memory) while the master weights stay in FP32 (precise). The gradients are computed cheaply in low precision, then applied precisely to the high-precision weights.
Key insight: Mixed precision training is why modern LLMs are trainable at all. Training GPT-3 in pure FP32 would need ~700GB just for parameters. BF16 halves that to ~350GB. NVIDIA’s Tensor Cores are 8× faster in FP16 than FP32. Mixed precision gives you 2× memory savings and 2-3× speed boost with virtually no quality loss.
In Practice
import torch from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for batch in dataloader: optimizer.zero_grad() # Forward pass in FP16 (fast!) with autocast(): output = model(batch) loss = criterion(output, targets) # Scale loss to prevent underflow scaler.scale(loss).backward() # Unscale and update in FP32 scaler.step(optimizer) scaler.update() # GradScaler: multiplies loss by large number # before backward (prevents FP16 underflow) # then divides gradients back before update
tune
Conditioning & Numerical Sensitivity
Some problems amplify errors, others absorb them
The Analogy
A well-conditioned problem is like a sturdy bridge — small vibrations don’t cause collapse. An ill-conditioned problem is like a house of cards — a tiny breeze topples everything. The condition number measures this sensitivity. Matrix inversion with a high condition number amplifies tiny rounding errors into huge mistakes.
Key insight: This is why batch normalization and layer normalization work so well. They keep the condition number of the weight matrices low by normalizing activations. Without normalization, deep networks become ill-conditioned — gradients either explode or vanish because tiny errors get amplified through 100+ layers.
Condition Number
# Condition number: κ(A) = σ_max / σ_min # (ratio of largest to smallest singular value) # κ ≈ 1: well-conditioned (stable) # κ >> 1: ill-conditioned (unstable) A_good = torch.tensor([[2.0, 0], [0, 1.0]]) # κ = 2/1 = 2 (well-conditioned) A_bad = torch.tensor([[1.0, 1.0], [1.0, 1.001]]) # κ ≈ 4002 (ill-conditioned!) # Tiny change in input → huge change in output # Why normalization helps: # BatchNorm keeps activations ~ N(0, 1) # → Weight matrices stay well-conditioned # → Gradients flow smoothly through layers # → Training is stable and fast
build
Essential Numerical Tricks
The toolkit every ML engineer needs
Gradient Clipping
Gradient clipping caps the gradient magnitude to prevent exploding gradients. If ||g|| > threshold, scale it down: g → g × (threshold / ||g||). Used in virtually every RNN and transformer training. Without it, a single bad batch can send gradients to Inf and destroy the model.
Epsilon Values
Adding a tiny ε = 1e-8 prevents division by zero. Adam optimizer: update = m / (√v + ε). Layer norm: x / (σ + ε). Without ε, a zero variance or zero denominator crashes training. It’s the smallest trick with the biggest impact.
All The Tricks
# 1. Gradient clipping torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=1.0 ) # 2. Epsilon in Adam optimizer = torch.optim.Adam( params, lr=3e-4, eps=1e-8 ) # 3. Log-space for probabilities # Never multiply probs; add log-probs log_prob = log_p1 + log_p2 + log_p3 # 4. Numerically stable sigmoid # σ(x) = 1/(1+e^-x) can overflow # Use: torch.sigmoid() (handles it) # 5. Weight initialization # Xavier/He init keeps activations # in a numerically stable range # 6. Gradient checkpointing # Recompute activations instead of storing # Trades compute for memory
school
Numerical Stability Checklist
The rules that prevent NaN disasters
The Big Picture
Numerical stability is the unsung hero of deep learning. Without the log-sum-exp trick, no softmax. Without mixed precision, no LLMs. Without gradient clipping, no RNNs. Without epsilon, no Adam optimizer. These aren’t optional optimizations — they’re requirements for training to work at all.
The complete checklist: (1) Use fused ops (CrossEntropyLoss, not softmax+log). (2) Work in log-space for probabilities. (3) Clip gradients. (4) Add epsilon to denominators. (5) Use mixed precision with GradScaler. (6) Initialize weights properly. (7) Use normalization layers. (8) Monitor for NaN/Inf during training.
Debugging NaN
# When you see NaN in training: # 1. Check loss for NaN if torch.isnan(loss): print("NaN detected!") # 2. Enable anomaly detection torch.autograd.set_detect_anomaly(True) # 3. Common causes: # - Learning rate too high → exploding grads # - log(0) or 0/0 somewhere # - Missing epsilon in normalization # - FP16 overflow without GradScaler # - Bad data (NaN in inputs) # 4. Prevention: # - Use torch.nan_to_num() as safety net # - Log training metrics every N steps # - Save checkpoints frequently # - Start with small learning rate
Real World
Building a bridge: use safety factors, test materials, inspect regularly
In AI
Training a model: use epsilon, clip gradients, monitor for NaN, checkpoint