RL · schedule · Week 04 of 12 · ← 03 · 05 →

Week 04 — Q-Learning and Temporal Difference Methods

The off-policy algorithm that unlocks RL: learn the optimal action-value function from any behavior policy.

Lecture

SARSA (on-policy) vs Q-learning (off-policy) · convergence theorems (Watkins-Dayan 1992) · the importance of the contraction property · $n$-step bootstrapping · double Q-learning.

Read before the lecture

Sutton and Barto, chapters 6 and 7
Watkins and Dayan, *Q-learning* (Machine Learning, 1992)

Code lab

Q-learning on FrozenLake and Taxi

Train a tabular Q-learning agent on FrozenLake-v1 and Taxi-v3 (Gymnasium). Report learning curves and final reward.

Notebook: lab02-q-learning.ipynb · Dataset: Gymnasium built-in environments.

Reference text for this week: chapter 04 of the bilingual notes — EN PDF · FR PDF.