Ch 13: Numerical Methods & Stability — Mathematics Behind AI & ML

Ch 13 — Numerical Methods & Stability

When computers lie about math — floating point, overflow, and tricks that save training

arrow_backIndex

Numerical

memory

Float

arrow_forward

warning

Overflow

arrow_forward

functions

Log-Trick

arrow_forward

thermostat

Softmax

arrow_forward

speed

Mixed Prec

arrow_forward

tune

Condition

arrow_forward

build

Tricks

Click play or press Space to begin...

Step- / 8

memory

Floating-Point Numbers

Why 0.1 + 0.2 ≠ 0.3 on a computer

The Analogy

Imagine measuring with a ruler that only has markings every millimeter. You can’t represent 1.5mm exactly — you round to 1mm or 2mm. Floating-point numbers are like a ruler with uneven spacing: very precise near zero, increasingly imprecise for large numbers. Computers store numbers with ~7 decimal digits of precision (float32) or ~16 digits (float64). Everything else gets rounded.

Key insight: In Python, 0.1 + 0.2 = 0.30000000000000004, not 0.3. This isn’t a bug — it’s fundamental to how computers represent numbers. In deep learning, these tiny errors accumulate over billions of operations. A model with 175 billion parameters doing millions of multiply-adds can amplify rounding errors into NaN (Not a Number) crashes.

Float Formats in AI

# IEEE 754 floating-point formats: # float64 (FP64): 1 sign + 11 exp + 52 mantissa # → ~16 decimal digits precision # → Range: ±10^308 # float32 (FP32): 1 sign + 8 exp + 23 mantissa # → ~7 decimal digits precision # → Range: ±3.4 × 10^38 # float16 (FP16): 1 sign + 5 exp + 10 mantissa # → ~3.3 decimal digits precision # → Range: ±65504 (very limited!) # bfloat16 (BF16): 1 sign + 8 exp + 7 mantissa # → Same range as FP32, less precision # → Google's format, great for training # Tradeoff: less precision = less memory # FP32: 4 bytes/param → 700GB for 175B params # FP16: 2 bytes/param → 350GB for 175B params

warning

Overflow & Underflow

When numbers get too big or too small

The Analogy

Overflow is like a car odometer that rolls past 999,999 and resets to 000,000. The number is too big to represent, so it becomes “infinity” (Inf). Underflow is like trying to measure an atom with a kitchen scale — the number is so small it rounds to zero. Both are catastrophic in training: Inf or 0 in a gradient means your model stops learning.

Critical in AI: Softmax computes e^x. If x = 1000, e^1000 = Inf (overflow!). If x = -1000, e^-1000 = 0 (underflow!). Multiplying many probabilities: 0.01 × 0.01 × ... (100 times) = 10^-200 = underflow to 0. These aren’t edge cases — they happen in every training run without proper numerical tricks.

Worked Example

import numpy as np # Overflow: number too large np.float32(1e38) * np.float32(10) # → inf (overflow!) # Underflow: number too small np.float32(1e-38) * np.float32(1e-8) # → 0.0 (underflow!) # Real problem: multiplying probabilities probs = [0.01] * 200 product = np.prod(probs) # → 0.0 (underflow after ~100 terms) # Solution: work in log space! log_product = np.sum(np.log(probs)) # → -921.03 (perfectly fine!) # NaN: 0/0, inf-inf, sqrt(-1) # Once NaN enters, it infects everything # NaN + anything = NaN

functions

The Log-Sum-Exp Trick

The single most important numerical trick in ML

The Analogy

Imagine comparing skyscrapers by height. Instead of measuring from sea level (huge numbers, hard to compare), you measure from the shortest building (small, manageable numbers). The log-sum-exp trick does the same: subtract the maximum value before exponentiating. This keeps numbers in a safe range without changing the mathematical result.

Key insight: Every time you call torch.logsumexp(), nn.CrossEntropyLoss(), or F.log_softmax(), PyTorch uses this trick internally. Without it, training any model with softmax (which is every classifier and every transformer) would crash with overflow/underflow. This one trick makes modern deep learning possible.

The Math & Code

# Problem: log(Σ e^xᵢ) overflows # e.g., x = [1000, 1001, 999] # e^1000 = Inf → crash! # Solution: subtract max first # log(Σ e^xᵢ) = M + log(Σ e^(xᵢ - M)) # where M = max(x) x = np.array([1000, 1001, 999]) # Naive (CRASHES): # np.log(np.sum(np.exp(x))) → inf # Safe (log-sum-exp trick): M = np.max(x) # 1001 result = M + np.log(np.sum(np.exp(x - M))) # x - M = [-1, 0, -2] # e^[-1, 0, -2] = [0.37, 1.0, 0.14] # sum = 1.50, log = 0.41 # result = 1001.41 ✓ # PyTorch does this automatically: import torch torch.logsumexp(torch.tensor(x, dtype=torch.float), dim=0)

thermostat

Numerically Stable Softmax

Why PyTorch combines softmax + cross-entropy into one function

The Analogy

Computing softmax naively is like converting currencies through 5 intermediate currencies — you lose precision at each step. Fused operations combine multiple steps into one, like converting directly. PyTorch’s nn.CrossEntropyLoss fuses softmax + log + negative into one numerically stable operation. That’s why you pass raw logits, not softmax probabilities.

Key insight: If you see code that does loss = -log(softmax(logits)) in two steps, it’s a bug waiting to happen. The softmax might produce 0 for some classes (underflow), then log(0) = -Inf. PyTorch’s fused CrossEntropyLoss(logits, targets) avoids this entirely using the log-sum-exp trick internally.

Safe vs. Unsafe

import torch import torch.nn.functional as F logits = torch.tensor([100.0, 200.0, 300.0]) # UNSAFE: two-step softmax + log probs = torch.softmax(logits, dim=0) # → [0.0, 0.0, 1.0] (underflow!) log_probs = torch.log(probs) # → [-inf, -inf, 0.0] (NaN risk!) # SAFE: fused log_softmax log_probs = F.log_softmax(logits, dim=0) # → [-200.0, -100.0, 0.0] (correct!) # SAFEST: fused cross-entropy loss loss = F.cross_entropy( logits.unsqueeze(0), torch.tensor([2]) ) # Handles everything internally

Unsafe

log(softmax(x)) — underflow then -Inf

Safe

log_softmax(x) — fused, numerically stable

speed

Mixed Precision Training

Use less precision to train faster — without losing quality

The Analogy

Imagine writing a rough draft in pencil (fast, imprecise) and the final version in pen (slow, precise). Mixed precision does the same: forward and backward passes use FP16/BF16 (fast, 2× less memory) while the master weights stay in FP32 (precise). The gradients are computed cheaply in low precision, then applied precisely to the high-precision weights.

Key insight: Mixed precision training is why modern LLMs are trainable at all. Training GPT-3 in pure FP32 would need ~700GB just for parameters. BF16 halves that to ~350GB. NVIDIA’s Tensor Cores are 8× faster in FP16 than FP32. Mixed precision gives you 2× memory savings and 2-3× speed boost with virtually no quality loss.

In Practice

import torch from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for batch in dataloader: optimizer.zero_grad() # Forward pass in FP16 (fast!) with autocast(): output = model(batch) loss = criterion(output, targets) # Scale loss to prevent underflow scaler.scale(loss).backward() # Unscale and update in FP32 scaler.step(optimizer) scaler.update() # GradScaler: multiplies loss by large number # before backward (prevents FP16 underflow) # then divides gradients back before update

tune

Conditioning & Numerical Sensitivity

Some problems amplify errors, others absorb them

The Analogy

A well-conditioned problem is like a sturdy bridge — small vibrations don’t cause collapse. An ill-conditioned problem is like a house of cards — a tiny breeze topples everything. The condition number measures this sensitivity. Matrix inversion with a high condition number amplifies tiny rounding errors into huge mistakes.

Key insight: This is why batch normalization and layer normalization work so well. They keep the condition number of the weight matrices low by normalizing activations. Without normalization, deep networks become ill-conditioned — gradients either explode or vanish because tiny errors get amplified through 100+ layers.

Condition Number

# Condition number: κ(A) = σ_max / σ_min # (ratio of largest to smallest singular value) # κ ≈ 1: well-conditioned (stable) # κ >> 1: ill-conditioned (unstable) A_good = torch.tensor([[2.0, 0], [0, 1.0]]) # κ = 2/1 = 2 (well-conditioned) A_bad = torch.tensor([[1.0, 1.0], [1.0, 1.001]]) # κ ≈ 4002 (ill-conditioned!) # Tiny change in input → huge change in output # Why normalization helps: # BatchNorm keeps activations ~ N(0, 1) # → Weight matrices stay well-conditioned # → Gradients flow smoothly through layers # → Training is stable and fast

build

Essential Numerical Tricks

The toolkit every ML engineer needs

Gradient Clipping

Gradient clipping caps the gradient magnitude to prevent exploding gradients. If ||g|| > threshold, scale it down: g → g × (threshold / ||g||). Used in virtually every RNN and transformer training. Without it, a single bad batch can send gradients to Inf and destroy the model.

Epsilon Values

Adding a tiny ε = 1e-8 prevents division by zero. Adam optimizer: update = m / (√v + ε). Layer norm: x / (σ + ε). Without ε, a zero variance or zero denominator crashes training. It’s the smallest trick with the biggest impact.

All The Tricks

# 1. Gradient clipping torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=1.0 ) # 2. Epsilon in Adam optimizer = torch.optim.Adam( params, lr=3e-4, eps=1e-8 ) # 3. Log-space for probabilities # Never multiply probs; add log-probs log_prob = log_p1 + log_p2 + log_p3 # 4. Numerically stable sigmoid # σ(x) = 1/(1+e^-x) can overflow # Use: torch.sigmoid() (handles it) # 5. Weight initialization # Xavier/He init keeps activations # in a numerically stable range # 6. Gradient checkpointing # Recompute activations instead of storing # Trades compute for memory

school

Numerical Stability Checklist

The rules that prevent NaN disasters

The Big Picture

Numerical stability is the unsung hero of deep learning. Without the log-sum-exp trick, no softmax. Without mixed precision, no LLMs. Without gradient clipping, no RNNs. Without epsilon, no Adam optimizer. These aren’t optional optimizations — they’re requirements for training to work at all.

The complete checklist: (1) Use fused ops (CrossEntropyLoss, not softmax+log). (2) Work in log-space for probabilities. (3) Clip gradients. (4) Add epsilon to denominators. (5) Use mixed precision with GradScaler. (6) Initialize weights properly. (7) Use normalization layers. (8) Monitor for NaN/Inf during training.

Debugging NaN

# When you see NaN in training: # 1. Check loss for NaN if torch.isnan(loss): print("NaN detected!") # 2. Enable anomaly detection torch.autograd.set_detect_anomaly(True) # 3. Common causes: # - Learning rate too high → exploding grads # - log(0) or 0/0 somewhere # - Missing epsilon in normalization # - FP16 overflow without GradScaler # - Bad data (NaN in inputs) # 4. Prevention: # - Use torch.nan_to_num() as safety net # - Log training metrics every N steps # - Save checkpoints frequently # - Start with small learning rate

Real World

Building a bridge: use safety factors, test materials, inspect regularly

In AI

Training a model: use epsilon, clip gradients, monitor for NaN, checkpoint

arrow_back Ch 12: Tensors & Geometry Ch 14: Math of Modern AI arrow_forward