Ch 12: Reinforcement Learning

Ch 12 — Reinforcement Learning

Bellman equations, DQN internals, PPO clipping, MCTS, and RLHF math

Index ← High Level

Under the Hood

Click play or press Space to begin the deep dive...

Step- / 10

A · Bellman & Q-Learning

B · DQN Internals

C · Policy Gradient & PPO

D · AlphaZero & MCTS

E · RLHF Math

A Bellman Equations & Q-Learning

functions

Bellman Optimality Equation

V*(s), Q*(s,a), recursive value definition

arrow_downward

grid_on

Q-Learning Algorithm

TD update, convergence, worked example

arrow_downward

B DQN Internals

replay

Experience Replay & Target Network

Buffer sampling, target stabilization, loss function

arrow_downward

upgrade

DQN Improvements

Double DQN, Dueling DQN, Prioritized Replay, Rainbow

arrow_downward

C Policy Gradient & PPO

trending_up

Policy Gradient Theorem

REINFORCE, log-probability trick, variance reduction

arrow_downward

tune

PPO Clipped Objective

Probability ratio, clipping, GAE, full algorithm

arrow_downward

D AlphaZero & MCTS

account_tree

Monte Carlo Tree Search

UCB1, selection, expansion, simulation, backpropagation

arrow_downward

emoji_events

AlphaZero Training Loop

Self-play, network targets, loss function

arrow_downward

E RLHF Math

psychology

Reward Model & RLHF Objective

Bradley-Terry model, KL-constrained PPO

arrow_downward

bolt

DPO & Modern Alternatives

DPO derivation, GRPO, comparison

S10