Ch 12 — Reinforcement Learning

Bellman equations, DQN internals, PPO clipping, MCTS, and RLHF math
Under the Hood
-
Click play or press Space to begin the deep dive...
Step- / 10
A · Bellman & Q-Learning
B · DQN Internals
C · Policy Gradient & PPO
D · AlphaZero & MCTS
E · RLHF Math
A Bellman Equations & Q-Learning
functions
Bellman Optimality Equation
V*(s), Q*(s,a), recursive value definition
S1
arrow_downward
grid_on
Q-Learning Algorithm
TD update, convergence, worked example
S2
arrow_downward
B DQN Internals
replay
Experience Replay & Target Network
Buffer sampling, target stabilization, loss function
S3
arrow_downward
upgrade
DQN Improvements
Double DQN, Dueling DQN, Prioritized Replay, Rainbow
S4
arrow_downward
C Policy Gradient & PPO
trending_up
Policy Gradient Theorem
REINFORCE, log-probability trick, variance reduction
S5
arrow_downward
tune
PPO Clipped Objective
Probability ratio, clipping, GAE, full algorithm
S6
arrow_downward
D AlphaZero & MCTS
account_tree
Monte Carlo Tree Search
UCB1, selection, expansion, simulation, backpropagation
S7
arrow_downward
emoji_events
AlphaZero Training Loop
Self-play, network targets, loss function
S8
arrow_downward
E RLHF Math
psychology
Reward Model & RLHF Objective
Bradley-Terry model, KL-constrained PPO
S9
arrow_downward
bolt
DPO & Modern Alternatives
DPO derivation, GRPO, comparison
S10