Week 08 — Evaluation, Safety, and Alignment

How do you know your generative model is actually good? Spoiler: we don't entirely.

GenAI  ·  schedule  ·  Week 08 of 10 ·  ← 07 ·  09 →

Week 08 — Evaluation, Safety, and Alignment

How do you know your generative model is actually good? Spoiler: we don't entirely.

Lecture

Capability benchmarks (MMLU, HumanEval, GSM8K, HELM) · safety benchmarks (TruthfulQA, BBQ, RealToxicityPrompts) · LLM-as-judge · human eval methodology · red-teaming · the alignment problem at survey level (RLHF, constitutional AI, debate, scalable oversight).

Read before the lecture

Recitation — paper discussion

Bai et al., *Constitutional AI: Harmlessness from AI Feedback* (Anthropic 2022) (paper)

Come ready to argue one side of each:

  • Does constitutional AI replace RLHF or supplement it?
  • What's the right benchmark for harmlessness in 2026?

Reference text for this week: chapter 08 of the bilingual notes — EN PDF · FR PDF.