Ch 12: Reinforcement Learning

Ch 12 — Reinforcement Learning

Learning from rewards — from Q-tables to AlphaGo to RLHF

Index Under the Hood →

High Level

sports_esports

Agent

arrow_forward

grid_on

Q-Learn

arrow_forward

neurology

Deep RL

arrow_forward

tune

Policy

arrow_forward

emoji_events

AlphaGo

arrow_forward

psychology

RLHF

Click play or press Space to begin the journey...

Step- / 8

sports_esports

The RL Framework

Agent, environment, state, action, reward

How RL Differs

Unlike supervised learning (labeled data) or unsupervised learning (find patterns), RL learns from interaction. An agent takes actions in an environment, receives rewards, and learns a policy — a strategy that maximizes cumulative reward over time. No teacher provides correct answers; the agent discovers them through trial and error.

# The RL loop Agent observes state s Agent chooses action a based on policy π Environment returns reward r and next state s′ Agent updates policy to maximize total reward Repeat forever Goal: maximize E[∑ γ^t r_t] # γ = discount factor (0.99 typical) # Future rewards worth less than immediate

Key Challenges

Exploration vs Exploitation: Try new actions (explore) or stick with what works (exploit)? Credit Assignment: Which past action caused the reward? Goal scored 100 moves later — which move was the good one? Sparse Rewards: Chess: reward only at game end Thousands of moves, one reward signal Sample Efficiency: RL needs millions of interactions Much less efficient than supervised learning

RL is the third paradigm of machine learning (alongside supervised and unsupervised). It’s how AlphaGo learned to beat world champions, how robots learn to walk, and how ChatGPT was aligned with human preferences via RLHF.

grid_on

Q-Learning & Value Functions

Learning the value of state-action pairs

Value-Based RL

The Q-function Q(s,a) estimates the total future reward of taking action a in state s, then following the optimal policy. Q-learning updates this estimate after each experience using the Bellman equation. Once Q is learned, the optimal policy is simply: pick the action with the highest Q-value.

# Q-learning update rule Q(s,a) ← Q(s,a) + α · [r + γ · max Q(s′,a′) - Q(s,a)] TD target current # α = learning rate # γ = discount factor # r = immediate reward # max Q(s′,a′) = best future value

ε-Greedy Exploration

With probability ε, take a random action (explore). With probability 1−ε, take the best known action (exploit). Start with high ε (lots of exploration), gradually decrease it as the agent learns. This balances discovering new strategies with using proven ones.

Q-table limitation: Q-learning stores a value for every (state, action) pair. For Atari games with 210×160 pixel screens, the state space is astronomical. Solution: Deep Q-Networks — replace the Q-table with a neural network that generalizes across similar states.

neurology

Deep Q-Networks (DQN)

Mnih et al. (2013/2015) — human-level Atari from pixels

The Breakthrough

DeepMind’s DQN used a CNN to approximate Q(s,a) directly from raw game pixels. Two key innovations made it stable: experience replay (store transitions in a buffer, sample randomly to break correlations) and a target network (frozen copy of Q, updated periodically, to stabilize the moving target).

# DQN architecture Input: 4 stacked frames (84×84×4) Conv1: 32 filters, 8×8, stride 4 Conv2: 64 filters, 4×4, stride 2 Conv3: 64 filters, 3×3, stride 1 FC: 512 units Output: Q-value for each action (e.g., 18) # Achieved superhuman on 29/49 Atari games

DQN Limitations

Discrete actions only: DQN outputs Q-values for each possible action. Works for Atari (18 actions) but not for robotics (continuous joint angles).

Overestimation: max operator biases Q-values upward. Double DQN fixes this.

Sample inefficiency: Needs millions of frames to learn. Humans learn Atari in minutes.

DQN was a watershed moment (2013): the first time a single algorithm learned multiple tasks from raw sensory input at human level. It proved deep learning + RL could solve complex sequential decision problems, launching the deep RL revolution.

tune

Policy Gradient & Actor-Critic

Directly optimizing the policy — PPO and continuous control

Policy Gradient Methods

Instead of learning Q-values, directly optimize the policy π(a|s). The policy is a neural network that outputs action probabilities. Gradient: increase probability of actions that led to high rewards, decrease probability of actions that led to low rewards.

REINFORCE (Williams, 1992): ∇J = E[∑ ∇log π(a|s) · G_t] # G_t = total return from time t # High variance, slow convergence Actor-Critic: Actor: policy π(a|s) — chooses actions Critic: V(s) — estimates state value Advantage: A = r + γV(s′) - V(s) # Lower variance, faster learning PPO (Schulman, 2017): Clips policy updates to stay close Stable, simple, works everywhere # Default algorithm for most RL tasks

Value-Based (DQN)

Learns Q(s,a), derives policy. Discrete actions only. Stable but limited. Good for games.

Policy Gradient (PPO)

Learns π(a|s) directly. Continuous actions. Robotics, locomotion, LLM alignment (RLHF).

PPO is everywhere: OpenAI Dota 2 (PPO), ChatGPT alignment (PPO), robotic manipulation (PPO), game AI (PPO). Its simplicity and stability made it the default RL algorithm. The key idea: limit how much the policy can change per update to prevent catastrophic performance drops.

emoji_events

AlphaGo & AlphaZero

Self-play + MCTS + deep RL = superhuman game AI

AlphaGo (2016): Policy network: predict human expert moves Value network: predict win probability MCTS: search guided by both networks Beat Lee Sedol 4-1 in Go AlphaGo Zero (2017): No human data at all — pure self-play Single network: policy + value heads Surpassed AlphaGo in 3 days Discovered novel strategies AlphaZero (2018): Same algorithm for Go, Chess, Shogi Beat Stockfish (chess) after 4 hours No game-specific knowledge Tabula rasa learning

The Self-Play Loop

1. Play games against yourself using MCTS + current network

2. Store (state, MCTS policy, outcome) as training data

3. Train network to predict MCTS policy and game outcome

4. Repeat with improved network

Each iteration, the agent plays better opponents (itself), creating an ever-improving curriculum.

Beyond games: The AlphaZero approach has been applied to protein structure (AlphaFold), chip design (AlphaChip), mathematics (AlphaTensor for matrix multiplication), and weather forecasting. Self-play + search + neural networks is a general framework for optimization problems.

psychology

RLHF: RL for LLM Alignment

The bridge between RL and language models

How RLHF Works

RLHF treats the LLM as an RL agent, the prompt as the state, the generated response as the action, and the reward model score as the reward. PPO optimizes the LLM to generate responses that score high on the reward model while staying close to the original SFT model (KL penalty).

# RLHF as an RL problem Agent: LLM (policy π) State: prompt + tokens generated so far Action: next token to generate Reward: RM(full response) at end of sequence Constraint: KL(π || π_ref) < δ Objective: max E[RM(response)] - β · KL(π || π_ref)

Why RLHF Matters

RLHF is what makes LLMs useful. A base model just predicts tokens; RLHF teaches it to be helpful, harmless, and honest. InstructGPT showed a 1.3B RLHF model was preferred over a 175B base model. ChatGPT’s success was largely due to RLHF alignment.

DPO and beyond: DPO (Ch 10) eliminates the RL step entirely by directly optimizing preferences. Newer methods like GRPO (DeepSeek) use group-relative rewards without a separate reward model. The trend: simpler alignment methods that achieve RLHF-level quality with less complexity.

precision_manufacturing

Modern RL Applications

Robotics, games, science, and real-world deployment

Robotics: Dexterous manipulation (OpenAI hand) Locomotion (Boston Dynamics, Agility) Sim-to-real transfer (train in simulation) Games: OpenAI Five (Dota 2, beat world champs) AlphaStar (StarCraft II, Grandmaster) GT Sophy (Gran Turismo, beat best drivers) Science: AlphaFold (protein structure) Plasma control (nuclear fusion, DeepMind) Drug discovery (molecular optimization) Industry: Recommendation systems (YouTube, TikTok) Data center cooling (Google, 40% savings) Autonomous driving (Waymo, Tesla)

Sim-to-Real Transfer

Training RL in the real world is slow and dangerous. The solution: train in simulation (millions of episodes in hours), then transfer the policy to a real robot. Domain randomization (vary physics, textures, lighting in sim) makes policies robust to the reality gap.

The sample efficiency problem: RL still needs orders of magnitude more experience than humans. A human learns Atari in minutes; DQN needs millions of frames. Model-based RL (learn a world model, plan in imagination) and foundation models for RL are active research areas addressing this gap.

route

Key Takeaways

The RL paradigm and where it’s heading

Summary

1. RL learns from rewards through trial and error, not labeled data

2. Q-learning estimates state-action values; DQN scales it with neural networks

3. Policy gradient methods (PPO) directly optimize the policy for continuous control

4. AlphaZero: self-play + MCTS + neural network = superhuman game AI

5. RLHF uses RL to align LLMs with human preferences

6. DPO simplifies alignment by removing the RL step

7. Sim-to-real transfer enables robotic RL training

The Future of RL

Foundation models for RL: Pretrain on diverse environments, fine-tune for specific tasks (like LLMs for language).

World models: Learn environment dynamics, plan in imagination (Dreamer, IRIS).

RL + LLMs: Use LLMs as planners, reward models, or world simulators for RL agents.

Coming up: Ch 13 covers Ethics & Bias in AI — fairness, accountability, transparency, and the societal implications of the systems we’ve studied throughout this course.