Ch 6 — Training Neural Networks — Under the Hood

Backprop calculus, Adam internals, batch norm math, residual gradient flow, and mixed-precision training
Under the Hood
-
Click play or press Space to begin. Click any node for deep-dive details...
Step- / 10
ABackpropagation CalculusChain rule · computational graphs · gradient flow
1
arrow_forward
Forward PassCompute activations
layer by layer
2
arrow_back
Backward PassChain rule gradients
output to input
3
arrow_downward From gradients to weight updates: optimizer internals
BOptimizer InternalsSGD + momentum · Adam math · learning rate schedules
trending_down
SGD + MomentumVelocity accumulation
exponential moving avg
4
speed
AdamAdaptive moments
bias correction
schedule
LR ScheduleWarmup, cosine
decay strategies
5
arrow_downward Stabilizing training: batch norm and layer norm
CNormalization TechniquesBatch norm math · layer norm · when to use each
equalizer
Batch NormNormalize across
batch dimension
6
layers
Layer NormNormalize across
feature dimension
7
arrow_downward Residual connections and gradient highways
DSkip Connections & RegularizationResidual gradient flow · dropout math · weight decay
alt_route
Residual Blocky = F(x) + x
gradient highway
8
blur_on
DropoutRandom masking
inverted scaling
9
arrow_downward Scaling training: distributed and mixed precision
ETraining at ScaleMixed precision · data parallelism · gradient accumulation
memory
Mixed PrecisionFP16 forward/back
FP32 master weights
10
lan
DistributedData parallel
model parallel