RL · schedule · Week 07 of 12 · ← 06 · 08 →

Week 07 — Actor-Critic Methods

Combine policy gradients with a learned value function — the best of both worlds, and the foundation of most modern RL.

Lecture

The actor-critic family · A2C and A3C · GAE (generalized advantage estimation, Schulman et al. 2016) · soft actor-critic (SAC, Haarnoja et al. 2018) · entropy-regularized RL.

Read before the lecture

Haarnoja et al., *Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor* (ICML 2018)

Code lab

PPO on CartPole and BipedalWalker

Train PPO with Stable-Baselines3 on CartPole-v1 (easy) and BipedalWalker-v3 (hard). Tune learning rate, clip range, and entropy coefficient. Report results.

Notebook: lab03-ppo.ipynb · Dataset: Gymnasium environments.

Reference text for this week: chapter 07 of the bilingual notes — EN PDF · FR PDF.