GenAI · schedule · Week 08 of 10 · ← 07 · 09 →

Week 08 — Evaluation, Safety, and Alignment

How do you know your generative model is actually good? Spoiler: we don't entirely.

Lecture

Capability benchmarks (MMLU, HumanEval, GSM8K, HELM) · safety benchmarks (TruthfulQA, BBQ, RealToxicityPrompts) · LLM-as-judge · human eval methodology · red-teaming · the alignment problem at survey level (RLHF, constitutional AI, debate, scalable oversight).

Read before the lecture

Lin et al., *TruthfulQA: Measuring How Models Mimic Human Falsehoods* (ACL 2022)

Recitation — paper discussion

Bai et al., *Constitutional AI: Harmlessness from AI Feedback* (Anthropic 2022) (paper)

Come ready to argue one side of each:

Does constitutional AI replace RLHF or supplement it?
What's the right benchmark for harmlessness in 2026?

Reference text for this week: chapter 08 of the bilingual notes — EN PDF · FR PDF.