RL · schedule · Week 11 of 12 · ← 10 · 12 →

Week 11 — Applications

Three case studies that drove RL into the mainstream: games (AlphaGo line), robotics (Sim2Real), and RLHF for LLMs.

Lecture

AlphaGo, AlphaZero, MuZero — the architectural arc · Sim2Real for robotics and the domain-randomization trick · RLHF (Christiano 2017 → Ouyang 2022) and DPO (Rafailov 2023) as the LLM-alignment workhorses · the new direct-preference family.

Read before the lecture

Ouyang et al., *Training language models to follow instructions with human feedback* (NeurIPS 2022, the InstructGPT paper)

Recitation — paper discussion

Rafailov et al., *Direct Preference Optimization: Your Language Model is Secretly a Reward Model* (NeurIPS 2023) (paper)

Come ready to argue one side of each:

DPO removes the explicit reward model — does it remove RL from RLHF, or just hide it?
What evidence would settle the DPO-vs-PPO debate?

Reference text for this week: chapter 11 of the bilingual notes — EN PDF · FR PDF.