Week 11 — Applications

Three case studies that drove RL into the mainstream: games (AlphaGo line), robotics (Sim2Real), and RLHF for LLMs.

RL  ·  schedule  ·  Week 11 of 12 ·  ← 10 ·  12 →

Week 11 — Applications

Three case studies that drove RL into the mainstream: games (AlphaGo line), robotics (Sim2Real), and RLHF for LLMs.

Lecture

AlphaGo, AlphaZero, MuZero — the architectural arc · Sim2Real for robotics and the domain-randomization trick · RLHF (Christiano 2017 → Ouyang 2022) and DPO (Rafailov 2023) as the LLM-alignment workhorses · the new direct-preference family.

Read before the lecture

Recitation — paper discussion

Rafailov et al., *Direct Preference Optimization: Your Language Model is Secretly a Reward Model* (NeurIPS 2023) (paper)

Come ready to argue one side of each:

  • DPO removes the explicit reward model — does it remove RL from RLHF, or just hide it?
  • What evidence would settle the DPO-vs-PPO debate?

Reference text for this week: chapter 11 of the bilingual notes — EN PDF · FR PDF.