Ch 6: Training Neural Networks

Ch 6 — Training Neural Networks — Under the Hood

Backprop calculus, Adam internals, batch norm math, residual gradient flow, and mixed-precision training

Under the Hood

Click play or press Space to begin. Click any node for deep-dive details...

Step- / 10

ABackpropagation CalculusChain rule · computational graphs · gradient flow

arrow_forward

Forward PassCompute activations
layer by layer

arrow_back

Backward PassChain rule gradients
output to input

arrow_downward From gradients to weight updates: optimizer internals

BOptimizer InternalsSGD + momentum · Adam math · learning rate schedules

trending_down

SGD + MomentumVelocity accumulation
exponential moving avg

speed

AdamAdaptive moments
bias correction

schedule

LR ScheduleWarmup, cosine
decay strategies

arrow_downward Stabilizing training: batch norm and layer norm

CNormalization TechniquesBatch norm math · layer norm · when to use each

equalizer

Batch NormNormalize across
batch dimension

layers

Layer NormNormalize across
feature dimension

arrow_downward Residual connections and gradient highways

DSkip Connections & RegularizationResidual gradient flow · dropout math · weight decay

alt_route

Residual Blocky = F(x) + x
gradient highway

blur_on

DropoutRandom masking
inverted scaling

arrow_downward Scaling training: distributed and mixed precision

ETraining at ScaleMixed precision · data parallelism · gradient accumulation

memory

Mixed PrecisionFP16 forward/back
FP32 master weights

lan

DistributedData parallel
model parallel