Week 02 — The Transformer Architecture

Vaswani et al. 2017 in eight pages — and what every successor since has changed about it.

GenAI  ·  schedule  ·  Week 02 of 10 ·  ← 01 ·  03 →

Week 02 — The Transformer Architecture

Vaswani et al. 2017 in eight pages — and what every successor since has changed about it.

Lecture

Attention as a content-addressable lookup · multi-head attention · positional encodings (absolute, RoPE, ALiBi) · encoder-decoder vs decoder-only vs encoder-only · the quadratic-in-sequence-length problem · FlashAttention and its descendants.

Read before the lecture

Code lab

Lab 1 — Implement a small transformer from scratch

Implement a 4-layer decoder-only transformer in pure PyTorch (no nn.Transformer). Train on TinyStories or a Shakespeare corpus. Generate samples.

Notebook: lab01-mini-transformer.ipynb  ·  Dataset: TinyStories or Shakespeare corpus.


Reference text for this week: chapter 02 of the bilingual notes — EN PDF · FR PDF.