Ch 12 — Reinforcement Learning

Learning from rewards — from Q-tables to AlphaGo to RLHF
High Level
sports_esports
Agent
arrow_forward
grid_on
Q-Learn
arrow_forward
neurology
Deep RL
arrow_forward
tune
Policy
arrow_forward
emoji_events
AlphaGo
arrow_forward
psychology
RLHF
-
Click play or press Space to begin the journey...
Step- / 8
sports_esports
The RL Framework
Agent, environment, state, action, reward
How RL Differs
Unlike supervised learning (labeled data) or unsupervised learning (find patterns), RL learns from interaction. An agent takes actions in an environment, receives rewards, and learns a policy — a strategy that maximizes cumulative reward over time. No teacher provides correct answers; the agent discovers them through trial and error.
# The RL loop Agent observes state s Agent chooses action a based on policy π Environment returns reward r and next state s′ Agent updates policy to maximize total reward Repeat forever Goal: maximize E[∑ γt rt] # γ = discount factor (0.99 typical) # Future rewards worth less than immediate
Key Challenges
Exploration vs Exploitation: Try new actions (explore) or stick with what works (exploit)? Credit Assignment: Which past action caused the reward? Goal scored 100 moves later — which move was the good one? Sparse Rewards: Chess: reward only at game end Thousands of moves, one reward signal Sample Efficiency: RL needs millions of interactions Much less efficient than supervised learning
RL is the third paradigm of machine learning (alongside supervised and unsupervised). It’s how AlphaGo learned to beat world champions, how robots learn to walk, and how ChatGPT was aligned with human preferences via RLHF.
grid_on
Q-Learning & Value Functions
Learning the value of state-action pairs
Value-Based RL
The Q-function Q(s,a) estimates the total future reward of taking action a in state s, then following the optimal policy. Q-learning updates this estimate after each experience using the Bellman equation. Once Q is learned, the optimal policy is simply: pick the action with the highest Q-value.
# Q-learning update rule Q(s,a) ← Q(s,a) + α · [r + γ · max Q(s′,a′) - Q(s,a)] TD target current # α = learning rate # γ = discount factor # r = immediate reward # max Q(s′,a′) = best future value
ε-Greedy Exploration
With probability ε, take a random action (explore). With probability 1−ε, take the best known action (exploit). Start with high ε (lots of exploration), gradually decrease it as the agent learns. This balances discovering new strategies with using proven ones.
Q-table limitation: Q-learning stores a value for every (state, action) pair. For Atari games with 210×160 pixel screens, the state space is astronomical. Solution: Deep Q-Networks — replace the Q-table with a neural network that generalizes across similar states.
neurology
Deep Q-Networks (DQN)
Mnih et al. (2013/2015) — human-level Atari from pixels
The Breakthrough
DeepMind’s DQN used a CNN to approximate Q(s,a) directly from raw game pixels. Two key innovations made it stable: experience replay (store transitions in a buffer, sample randomly to break correlations) and a target network (frozen copy of Q, updated periodically, to stabilize the moving target).
# DQN architecture Input: 4 stacked frames (84×84×4) Conv1: 32 filters, 8×8, stride 4 Conv2: 64 filters, 4×4, stride 2 Conv3: 64 filters, 3×3, stride 1 FC: 512 units Output: Q-value for each action (e.g., 18) # Achieved superhuman on 29/49 Atari games
DQN Limitations
Discrete actions only: DQN outputs Q-values for each possible action. Works for Atari (18 actions) but not for robotics (continuous joint angles).

Overestimation: max operator biases Q-values upward. Double DQN fixes this.

Sample inefficiency: Needs millions of frames to learn. Humans learn Atari in minutes.
DQN was a watershed moment (2013): the first time a single algorithm learned multiple tasks from raw sensory input at human level. It proved deep learning + RL could solve complex sequential decision problems, launching the deep RL revolution.
tune
Policy Gradient & Actor-Critic
Directly optimizing the policy — PPO and continuous control
Policy Gradient Methods
Instead of learning Q-values, directly optimize the policy π(a|s). The policy is a neural network that outputs action probabilities. Gradient: increase probability of actions that led to high rewards, decrease probability of actions that led to low rewards.
REINFORCE (Williams, 1992): ∇J = E[∑ ∇log π(a|s) · Gt] # Gt = total return from time t # High variance, slow convergence Actor-Critic: Actor: policy π(a|s) — chooses actions Critic: V(s) — estimates state value Advantage: A = r + γV(s′) - V(s) # Lower variance, faster learning PPO (Schulman, 2017): Clips policy updates to stay close Stable, simple, works everywhere # Default algorithm for most RL tasks
Value-Based (DQN)
Learns Q(s,a), derives policy. Discrete actions only. Stable but limited. Good for games.
Policy Gradient (PPO)
Learns π(a|s) directly. Continuous actions. Robotics, locomotion, LLM alignment (RLHF).
PPO is everywhere: OpenAI Dota 2 (PPO), ChatGPT alignment (PPO), robotic manipulation (PPO), game AI (PPO). Its simplicity and stability made it the default RL algorithm. The key idea: limit how much the policy can change per update to prevent catastrophic performance drops.
emoji_events
AlphaGo & AlphaZero
Self-play + MCTS + deep RL = superhuman game AI
AlphaGo (2016): Policy network: predict human expert moves Value network: predict win probability MCTS: search guided by both networks Beat Lee Sedol 4-1 in Go AlphaGo Zero (2017): No human data at all — pure self-play Single network: policy + value heads Surpassed AlphaGo in 3 days Discovered novel strategies AlphaZero (2018): Same algorithm for Go, Chess, Shogi Beat Stockfish (chess) after 4 hours No game-specific knowledge Tabula rasa learning
The Self-Play Loop
1. Play games against yourself using MCTS + current network

2. Store (state, MCTS policy, outcome) as training data

3. Train network to predict MCTS policy and game outcome

4. Repeat with improved network

Each iteration, the agent plays better opponents (itself), creating an ever-improving curriculum.
Beyond games: The AlphaZero approach has been applied to protein structure (AlphaFold), chip design (AlphaChip), mathematics (AlphaTensor for matrix multiplication), and weather forecasting. Self-play + search + neural networks is a general framework for optimization problems.
psychology
RLHF: RL for LLM Alignment
The bridge between RL and language models
How RLHF Works
RLHF treats the LLM as an RL agent, the prompt as the state, the generated response as the action, and the reward model score as the reward. PPO optimizes the LLM to generate responses that score high on the reward model while staying close to the original SFT model (KL penalty).
# RLHF as an RL problem Agent: LLM (policy π) State: prompt + tokens generated so far Action: next token to generate Reward: RM(full response) at end of sequence Constraint: KL(π || π_ref) < δ Objective: max E[RM(response)] - β · KL(π || π_ref)
Why RLHF Matters
RLHF is what makes LLMs useful. A base model just predicts tokens; RLHF teaches it to be helpful, harmless, and honest. InstructGPT showed a 1.3B RLHF model was preferred over a 175B base model. ChatGPT’s success was largely due to RLHF alignment.
DPO and beyond: DPO (Ch 10) eliminates the RL step entirely by directly optimizing preferences. Newer methods like GRPO (DeepSeek) use group-relative rewards without a separate reward model. The trend: simpler alignment methods that achieve RLHF-level quality with less complexity.
precision_manufacturing
Modern RL Applications
Robotics, games, science, and real-world deployment
Robotics: Dexterous manipulation (OpenAI hand) Locomotion (Boston Dynamics, Agility) Sim-to-real transfer (train in simulation) Games: OpenAI Five (Dota 2, beat world champs) AlphaStar (StarCraft II, Grandmaster) GT Sophy (Gran Turismo, beat best drivers) Science: AlphaFold (protein structure) Plasma control (nuclear fusion, DeepMind) Drug discovery (molecular optimization) Industry: Recommendation systems (YouTube, TikTok) Data center cooling (Google, 40% savings) Autonomous driving (Waymo, Tesla)
Sim-to-Real Transfer
Training RL in the real world is slow and dangerous. The solution: train in simulation (millions of episodes in hours), then transfer the policy to a real robot. Domain randomization (vary physics, textures, lighting in sim) makes policies robust to the reality gap.
The sample efficiency problem: RL still needs orders of magnitude more experience than humans. A human learns Atari in minutes; DQN needs millions of frames. Model-based RL (learn a world model, plan in imagination) and foundation models for RL are active research areas addressing this gap.
route
Key Takeaways
The RL paradigm and where it’s heading
Summary
1. RL learns from rewards through trial and error, not labeled data

2. Q-learning estimates state-action values; DQN scales it with neural networks

3. Policy gradient methods (PPO) directly optimize the policy for continuous control

4. AlphaZero: self-play + MCTS + neural network = superhuman game AI

5. RLHF uses RL to align LLMs with human preferences

6. DPO simplifies alignment by removing the RL step

7. Sim-to-real transfer enables robotic RL training
The Future of RL
Foundation models for RL: Pretrain on diverse environments, fine-tune for specific tasks (like LLMs for language).

World models: Learn environment dynamics, plan in imagination (Dreamer, IRIS).

RL + LLMs: Use LLMs as planners, reward models, or world simulators for RL agents.
Coming up: Ch 13 covers Ethics & Bias in AI — fairness, accountability, transparency, and the societal implications of the systems we’ve studied throughout this course.