Week 07 — Actor-Critic Methods

Combine policy gradients with a learned value function — the best of both worlds, and the foundation of most modern RL.

RL  ·  schedule  ·  Week 07 of 12 ·  ← 06 ·  08 →

Week 07 — Actor-Critic Methods

Combine policy gradients with a learned value function — the best of both worlds, and the foundation of most modern RL.

Lecture

The actor-critic family · A2C and A3C · GAE (generalized advantage estimation, Schulman et al. 2016) · soft actor-critic (SAC, Haarnoja et al. 2018) · entropy-regularized RL.

Read before the lecture

Code lab

PPO on CartPole and BipedalWalker

Train PPO with Stable-Baselines3 on CartPole-v1 (easy) and BipedalWalker-v3 (hard). Tune learning rate, clip range, and entropy coefficient. Report results.

Notebook: lab03-ppo.ipynb  ·  Dataset: Gymnasium environments.


Reference text for this week: chapter 07 of the bilingual notes — EN PDF · FR PDF.