Week 03 — Monte Carlo and Temporal Difference Methods
When the model isn't known: learn from sampled trajectories. The MC / TD spectrum is the conceptual backbone of model-free RL.
Week 03 — Monte Carlo and Temporal Difference Methods
When the model isn't known: learn from sampled trajectories. The MC / TD spectrum is the conceptual backbone of model-free RL.
Lecture
Monte Carlo policy evaluation · the exploration-exploitation problem · the $\varepsilon$-greedy and softmax strategies · TD(0) and the TD update · why TD has lower variance than MC.
Read before the lecture
- Sutton and Barto, chapters 5–6
Problem set
PS2 — MC vs TD
- Implement first-visit MC and every-visit MC on a simple MDP. Show empirically that both converge.
- Implement TD(0). Compare convergence speed and variance against MC on the random-walk problem.
Reference text for this week: chapter 03 of the bilingual notes — EN PDF · FR PDF.