Week 06 — Large-Scale Training — Distributed Training

When one GPU stops being enough: data parallelism, model parallelism, and the engineering of training at scale.

MLOps  ·  schedule  ·  Week 06 of 12 ·  ← 05 ·  07 →

Week 06 — Large-Scale Training — Distributed Training

When one GPU stops being enough: data parallelism, model parallelism, and the engineering of training at scale.

Lecture

Data parallel (DDP, ZeRO) · tensor parallel · pipeline parallel · model parallel · gradient accumulation and gradient checkpointing · mixed precision (FP16, BF16, FP8) · FSDP and DeepSpeed · NCCL and the GPU-interconnect layer.

Read before the lecture

Problem set

Memo 1 — Pick a training topology

  1. For a 7B-parameter model trained on 8 A100s, design the parallelism topology and justify the choice. Memo, 800 words.

Reference text for this week: chapter 06 of the bilingual notes — EN PDF · FR PDF.