RL · schedule · Week 06 of 12 · ← 05 · 07 →

Week 06 — Policy Gradient Methods

Directly optimize the policy, not the value function. The REINFORCE algorithm and its descendants.

Lecture

The policy gradient theorem · REINFORCE · baseline subtraction for variance reduction · trust region methods (TRPO) · the proximal policy optimization (PPO) trick.

Read before the lecture

Sutton et al., *Policy Gradient Methods for RL with Function Approximation* (NeurIPS 1999)
Schulman et al., *Proximal Policy Optimization Algorithms* (2017)

Problem set

PS4 — Policy gradient theorem

Prove the policy gradient theorem from the start (Sutton et al. 1999 style).
Show that subtracting a state-dependent baseline does not change the gradient in expectation.

Reference text for this week: chapter 06 of the bilingual notes — EN PDF · FR PDF.