Week 06 — Large-Scale Training — Distributed Training
When one GPU stops being enough: data parallelism, model parallelism, and the engineering of training at scale.
Week 06 — Large-Scale Training — Distributed Training
When one GPU stops being enough: data parallelism, model parallelism, and the engineering of training at scale.
Lecture
Data parallel (DDP, ZeRO) · tensor parallel · pipeline parallel · model parallel · gradient accumulation and gradient checkpointing · mixed precision (FP16, BF16, FP8) · FSDP and DeepSpeed · NCCL and the GPU-interconnect layer.
Read before the lecture
- Rajbhandari et al., *ZeRO: Memory Optimizations Toward Training Trillion Parameter Models* (SC 2020)
Problem set
Memo 1 — Pick a training topology
- For a 7B-parameter model trained on 8 A100s, design the parallelism topology and justify the choice. Memo, 800 words.
Reference text for this week: chapter 06 of the bilingual notes — EN PDF · FR PDF.