Week 04 — Q-Learning and Temporal Difference Methods
The off-policy algorithm that unlocks RL: learn the optimal action-value function from any behavior policy.
Week 04 — Q-Learning and Temporal Difference Methods
The off-policy algorithm that unlocks RL: learn the optimal action-value function from any behavior policy.
Lecture
SARSA (on-policy) vs Q-learning (off-policy) · convergence theorems (Watkins-Dayan 1992) · the importance of the contraction property · $n$-step bootstrapping · double Q-learning.
Read before the lecture
- Sutton and Barto, chapters 6 and 7
- Watkins and Dayan, *Q-learning* (Machine Learning, 1992)
Code lab
Q-learning on FrozenLake and Taxi
Train a tabular Q-learning agent on FrozenLake-v1 and Taxi-v3 (Gymnasium). Report learning curves and final reward.
Notebook: lab02-q-learning.ipynb · Dataset: Gymnasium built-in environments.
Reference text for this week: chapter 04 of the bilingual notes — EN PDF · FR PDF.