RL · schedule · Week 03 of 12 · ← 02 · 04 →

Week 03 — Monte Carlo and Temporal Difference Methods

When the model isn't known: learn from sampled trajectories. The MC / TD spectrum is the conceptual backbone of model-free RL.

Lecture

Monte Carlo policy evaluation · the exploration-exploitation problem · the $\varepsilon$-greedy and softmax strategies · TD(0) and the TD update · why TD has lower variance than MC.

Read before the lecture

Sutton and Barto, chapters 5–6

Problem set

PS2 — MC vs TD

Implement first-visit MC and every-visit MC on a simple MDP. Show empirically that both converge.
Implement TD(0). Compare convergence speed and variance against MC on the random-walk problem.

Reference text for this week: chapter 03 of the bilingual notes — EN PDF · FR PDF.