GenAI · schedule · Week 02 of 10 · ← 01 · 03 →

Week 02 — The Transformer Architecture

Vaswani et al. 2017 in eight pages — and what every successor since has changed about it.

Lecture

Attention as a content-addressable lookup · multi-head attention · positional encodings (absolute, RoPE, ALiBi) · encoder-decoder vs decoder-only vs encoder-only · the quadratic-in-sequence-length problem · FlashAttention and its descendants.

Read before the lecture

Vaswani et al., *Attention Is All You Need* (NeurIPS 2017)

Code lab

Lab 1 — Implement a small transformer from scratch

Implement a 4-layer decoder-only transformer in pure PyTorch (no nn.Transformer). Train on TinyStories or a Shakespeare corpus. Generate samples.

Notebook: lab01-mini-transformer.ipynb · Dataset: TinyStories or Shakespeare corpus.

Reference text for this week: chapter 02 of the bilingual notes — EN PDF · FR PDF.