Reinforcement Learning

← cohort home

The operational schedule for Reinforcement Learning. Per-cohort dates fill in at intake; the structure below is stable across cohorts.

The single source of truth is _data/apprentissage-renforcement.yml. Edits there flow through this page automatically.

Week	Title	Pitch	Detail
01	Markov Decision Processes	The mathematical scaffold under everything in RL: states, actions, transitions, rewards, and the principle of optimality.	week 01 →
02	Dynamic Programming	When you have the model, you don't need to learn — you compute. The reference algorithms every learner approximates.	week 02 →
03	Monte Carlo and Temporal Difference Methods	When the model isn't known: learn from sampled trajectories. The MC / TD spectrum is the conceptual backbone of model-free RL.	week 03 →
04	Q-Learning and Temporal Difference Methods	The off-policy algorithm that unlocks RL: learn the optimal action-value function from any behavior policy.	week 04 →
05	Deep Q-Networks and Variants	When the state space is too large for a table: DQN, Atari, and the deep-learning era of RL.	week 05 →
06	Policy Gradient Methods	Directly optimize the policy, not the value function. The REINFORCE algorithm and its descendants.	week 06 →
07	Actor-Critic Methods	Combine policy gradients with a learned value function — the best of both worlds, and the foundation of most modern RL.	week 07 →
08	State-of-the-Art Algorithms	What labs and industry are using in 2026: PPO, SAC, MuZero, IMPALA, and a tour of recent improvements.	week 08 →
09	Multi-Agent Reinforcement Learning	When more than one agent learns at once: cooperation, competition, and the non-stationarity of every other agent.	week 09 →
10	Constrained and Safe RL	When the agent must respect constraints (safety, fairness, resource limits) even during exploration.	week 10 →
11	Applications	Three case studies that drove RL into the mainstream: games (AlphaGo line), robotics (Sim2Real), and RLHF for LLMs.	week 11 →
12	Final project presentations	Each participant trains an RL agent on an environment of their choice and presents results.	week 12 →

Operational notes

Default timezone: Africa/Lagos (UTC+1). Per-cohort timing negotiated at intake.
Lab notebooks and problem-set repos live in the cohort GitHub organization.
The bilingual lecture notes remain the reference text.