Week 08 — Evaluation, Safety, and Alignment
How do you know your generative model is actually good? Spoiler: we don't entirely.
Week 08 — Evaluation, Safety, and Alignment
How do you know your generative model is actually good? Spoiler: we don't entirely.
Lecture
Capability benchmarks (MMLU, HumanEval, GSM8K, HELM) · safety benchmarks (TruthfulQA, BBQ, RealToxicityPrompts) · LLM-as-judge · human eval methodology · red-teaming · the alignment problem at survey level (RLHF, constitutional AI, debate, scalable oversight).
Read before the lecture
Recitation — paper discussion
Bai et al., *Constitutional AI: Harmlessness from AI Feedback* (Anthropic 2022) (paper)
Come ready to argue one side of each:
- Does constitutional AI replace RLHF or supplement it?
- What's the right benchmark for harmlessness in 2026?
Reference text for this week: chapter 08 of the bilingual notes — EN PDF · FR PDF.