Week 06 — Policy Gradient Methods

Directly optimize the policy, not the value function. The REINFORCE algorithm and its descendants.

RL  ·  schedule  ·  Week 06 of 12 ·  ← 05 ·  07 →

Week 06 — Policy Gradient Methods

Directly optimize the policy, not the value function. The REINFORCE algorithm and its descendants.

Lecture

The policy gradient theorem · REINFORCE · baseline subtraction for variance reduction · trust region methods (TRPO) · the proximal policy optimization (PPO) trick.

Read before the lecture

Problem set

PS4 — Policy gradient theorem

  1. Prove the policy gradient theorem from the start (Sutton et al. 1999 style).
  2. Show that subtracting a state-dependent baseline does not change the gradient in expectation.

Reference text for this week: chapter 06 of the bilingual notes — EN PDF · FR PDF.