Week 06 — Policy Gradient Methods
Directly optimize the policy, not the value function. The REINFORCE algorithm and its descendants.
Week 06 — Policy Gradient Methods
Directly optimize the policy, not the value function. The REINFORCE algorithm and its descendants.
Lecture
The policy gradient theorem · REINFORCE · baseline subtraction for variance reduction · trust region methods (TRPO) · the proximal policy optimization (PPO) trick.
Read before the lecture
- Sutton et al., *Policy Gradient Methods for RL with Function Approximation* (NeurIPS 1999)
- Schulman et al., *Proximal Policy Optimization Algorithms* (2017)
Problem set
PS4 — Policy gradient theorem
- Prove the policy gradient theorem from the start (Sutton et al. 1999 style).
- Show that subtracting a state-dependent baseline does not change the gradient in expectation.
Reference text for this week: chapter 06 of the bilingual notes — EN PDF · FR PDF.