Week 04 — Q-Learning and Temporal Difference Methods

The off-policy algorithm that unlocks RL: learn the optimal action-value function from any behavior policy.

RL  ·  schedule  ·  Week 04 of 12 ·  ← 03 ·  05 →

Week 04 — Q-Learning and Temporal Difference Methods

The off-policy algorithm that unlocks RL: learn the optimal action-value function from any behavior policy.

Lecture

SARSA (on-policy) vs Q-learning (off-policy) · convergence theorems (Watkins-Dayan 1992) · the importance of the contraction property · $n$-step bootstrapping · double Q-learning.

Read before the lecture

  • Sutton and Barto, chapters 6 and 7
  • Watkins and Dayan, *Q-learning* (Machine Learning, 1992)

Code lab

Q-learning on FrozenLake and Taxi

Train a tabular Q-learning agent on FrozenLake-v1 and Taxi-v3 (Gymnasium). Report learning curves and final reward.

Notebook: lab02-q-learning.ipynb  ·  Dataset: Gymnasium built-in environments.


Reference text for this week: chapter 04 of the bilingual notes — EN PDF · FR PDF.