Week 02 — The Transformer Architecture
Vaswani et al. 2017 in eight pages — and what every successor since has changed about it.
Week 02 — The Transformer Architecture
Vaswani et al. 2017 in eight pages — and what every successor since has changed about it.
Lecture
Attention as a content-addressable lookup · multi-head attention · positional encodings (absolute, RoPE, ALiBi) · encoder-decoder vs decoder-only vs encoder-only · the quadratic-in-sequence-length problem · FlashAttention and its descendants.
Read before the lecture
Code lab
Lab 1 — Implement a small transformer from scratch
Implement a 4-layer decoder-only transformer in pure PyTorch (no nn.Transformer). Train on TinyStories or a Shakespeare corpus. Generate samples.
Notebook: lab01-mini-transformer.ipynb · Dataset: TinyStories or Shakespeare corpus.
Reference text for this week: chapter 02 of the bilingual notes — EN PDF · FR PDF.