Week 11 — Applications
Three case studies that drove RL into the mainstream: games (AlphaGo line), robotics (Sim2Real), and RLHF for LLMs.
Week 11 — Applications
Three case studies that drove RL into the mainstream: games (AlphaGo line), robotics (Sim2Real), and RLHF for LLMs.
Lecture
AlphaGo, AlphaZero, MuZero — the architectural arc · Sim2Real for robotics and the domain-randomization trick · RLHF (Christiano 2017 → Ouyang 2022) and DPO (Rafailov 2023) as the LLM-alignment workhorses · the new direct-preference family.
Read before the lecture
Recitation — paper discussion
Rafailov et al., *Direct Preference Optimization: Your Language Model is Secretly a Reward Model* (NeurIPS 2023) (paper)
Come ready to argue one side of each:
- DPO removes the explicit reward model — does it remove RL from RLHF, or just hide it?
- What evidence would settle the DPO-vs-PPO debate?
Reference text for this week: chapter 11 of the bilingual notes — EN PDF · FR PDF.