Week 07 — Actor-Critic Methods
Combine policy gradients with a learned value function — the best of both worlds, and the foundation of most modern RL.
Week 07 — Actor-Critic Methods
Combine policy gradients with a learned value function — the best of both worlds, and the foundation of most modern RL.
Lecture
The actor-critic family · A2C and A3C · GAE (generalized advantage estimation, Schulman et al. 2016) · soft actor-critic (SAC, Haarnoja et al. 2018) · entropy-regularized RL.
Read before the lecture
Code lab
PPO on CartPole and BipedalWalker
Train PPO with Stable-Baselines3 on CartPole-v1 (easy) and BipedalWalker-v3 (hard). Tune learning rate, clip range, and entropy coefficient. Report results.
Notebook: lab03-ppo.ipynb · Dataset: Gymnasium environments.
Reference text for this week: chapter 07 of the bilingual notes — EN PDF · FR PDF.