MLOps · schedule · Week 06 of 12 · ← 05 · 07 →

Week 06 — Large-Scale Training — Distributed Training

When one GPU stops being enough: data parallelism, model parallelism, and the engineering of training at scale.

Lecture

Data parallel (DDP, ZeRO) · tensor parallel · pipeline parallel · model parallel · gradient accumulation and gradient checkpointing · mixed precision (FP16, BF16, FP8) · FSDP and DeepSpeed · NCCL and the GPU-interconnect layer.

Read before the lecture

Rajbhandari et al., *ZeRO: Memory Optimizations Toward Training Trillion Parameter Models* (SC 2020)

Problem set

Memo 1 — Pick a training topology

For a 7B-parameter model trained on 8 A100s, design the parallelism topology and justify the choice. Memo, 800 words.

Reference text for this week: chapter 06 of the bilingual notes — EN PDF · FR PDF.