Reinforcement Learning: From Theory to Practice

From MDPs and fixed-point theory to DQN, PPO, and topological perspectives on convergence — implemented from scratch.

Program Overview

This workshop provides a rigorous yet practical introduction to reinforcement learning (RL), connecting mathematical foundations — including topological and metric-space perspectives — to modern deep RL algorithms. Participants implement classic and deep RL methods from scratch, apply them to real-world problems, and gain insight into the mathematical structures underlying convergence and optimality.

Software Requirements

Python 3.10+
Libraries: numpy, gymnasium (OpenAI Gym), matplotlib, torch (PyTorch), stable-baselines3
Optional: tensorboard, wandb

Day 1: Foundations — MDPs & Dynamic Programming

Objectives: Formalize RL problems and solve small MDPs exactly.

What is RL? — The agent-environment loop, reward hypothesis, comparison with supervised/unsupervised learning, applications overview
Markov Decision Processes — States, actions, transitions, rewards, discount factor \(\gamma\), policies, value functions \(V^\pi(s)\) and \(Q^\pi(s,a)\)
Bellman Equations — Bellman expectation equation, Bellman optimality equation, the Bellman operator \(\mathcal{T}\) as a contraction mapping
Dynamic Programming — Policy evaluation (iterative), policy improvement, policy iteration, value iteration. Convergence proofs via Banach fixed point theorem
Implementation — Code a GridWorld environment and solve it with policy iteration and value iteration. Visualize value functions and optimal policies

Lab 1: Implement a complete MDP solver: define a GridWorld with obstacles and rewards, implement policy evaluation, policy improvement, and value iteration. Visualize the optimal policy as arrows on the grid.

Homework: Solve a different MDP (e.g., FrozenLake from Gymnasium) using your implementations.

Day 2: Tabular Methods — MC, TD, Q-Learning

Objectives: Learn model-free RL methods that work without knowing the environment dynamics.

Monte Carlo Methods — First-visit vs. every-visit MC, MC prediction, MC control with ε-greedy exploration, importance sampling
Temporal Difference Learning — TD(0) prediction, the TD error \(\delta_t\), SARSA (on-policy TD control), Q-Learning (off-policy TD control), convergence comparison
Exploration vs. Exploitation — ε-greedy, softmax, UCB, optimistic initialization. The exploration-exploitation dilemma. Multi-armed bandits as a special case
Fixed Point Perspective — The Bellman operator as a contraction in metric spaces, convergence rates, connections to quasi-metric structures, why Q-learning converges

Lab 2: Implement Q-Learning and SARSA from scratch. Train agents on Gymnasium environments (Taxi-v3, CliffWalking). Compare learning curves, explore the effect of ε, α, and γ on convergence.

Homework: Implement n-step TD and compare with 1-step TD on the same environment.

Day 3: Deep RL — DQN & Extensions

Objectives: Scale RL to high-dimensional problems with neural network function approximation.

Function Approximation — Why tabular methods don’t scale, linear function approximation, the deadly triad (function approximation + bootstrapping + off-policy), neural networks as approximators
Deep Q-Networks (DQN) — Experience replay, target networks, the DQN loss, ε-decay schedules. Implementation with PyTorch
DQN Extensions — Double DQN, Dueling DQN, Prioritized Experience Replay, Noisy Nets, Rainbow DQN (overview)
Practical DQN — Hyperparameter tuning, debugging tips, common failure modes, when DQN works well and when it doesn’t

Lab 3: Implement DQN from scratch in PyTorch. Train on CartPole-v1 and LunarLander-v2. Implement Double DQN and compare performance. Log training curves with TensorBoard.

Homework: Train DQN on a new environment and analyze the learned Q-values.

Day 4: Policy Gradient & Actor-Critic Methods

Objectives: Learn policy-based methods and modern actor-critic algorithms.

Policy Gradient Methods — Why optimize policies directly, the policy gradient theorem, REINFORCE algorithm, variance reduction with baselines
Actor-Critic Methods — Advantage function \(A(s,a)\), A2C (Advantage Actor-Critic), GAE (Generalized Advantage Estimation), entropy regularization
PPO — Proximal Policy Optimization — Clipped surrogate objective, trust regions (intuition), PPO implementation, why PPO is the workhorse of modern RL
Stable-Baselines3 — Using SB3 for rapid prototyping: PPO, A2C, SAC. Custom environments, callbacks, evaluation, hyperparameter tuning with Optuna

Lab 4: Implement REINFORCE from scratch. Then use Stable-Baselines3 to train PPO on continuous control tasks (MountainCarContinuous, Pendulum). Compare sample efficiency and stability across algorithms.

Homework: Train a PPO agent on a custom environment relevant to your research.

Day 5: Applications & Advanced Topics

Objectives: Apply RL to real-world problems and explore cutting-edge directions.

RL for Resource Allocation — Wireless network optimization (DQN for channel allocation), energy grid management, scheduling problems. Connection to the instructor’s research
Multi-Agent RL — Independent learners, centralized training with decentralized execution (CTDE), communication, cooperative vs. competitive settings
Topological Perspectives on RL — Topology of state/action/policy spaces, how topological structure affects convergence, connections to the instructor’s research on RL foundations
Advanced Topics Survey — Model-based RL, offline RL, reward shaping, inverse RL, RL from Human Feedback (RLHF), safe RL
Capstone Project Work — Complete and polish final projects
Presentations & Wrap-Up — Project demos, discussion, resources for continued learning, certificates

Lab 5 (Capstone): Choose one project:

Resource allocator: DQN agent for wireless network channel allocation
Game agent: Train an agent to play a classic Atari game using DQN or PPO
Control system: PPO agent for a continuous control task with custom reward shaping
Custom application: Apply RL to a problem from your own research domain

Assessment

Daily labs (40%) — Working implementations and analysis
Capstone project (40%) — Complete RL application with evaluation
Participation (20%) — Engagement, homework, and discussions

Resources

Learning Outcomes

By the end of this workshop, participants will be able to:

Formalize sequential decision problems as Markov Decision Processes (MDPs)
Implement tabular RL algorithms (dynamic programming, Q-learning, SARSA)
Understand convergence guarantees through the lens of fixed point theory
Build deep RL agents (DQN, policy gradient, actor-critic)
Apply RL to practical problems (resource allocation, game playing, optimization)
Evaluate and debug RL systems

Who Should Attend

ML practitioners and researchers who want a rigorous grounding in RL theory alongside hands-on implementation experience. Graduate students working on sequential decision problems, control, or optimization. Engineers building agents for games, robotics, scheduling, or resource allocation. Researchers interested in the mathematical foundations (fixed-point theory, topology) of modern RL.

Prerequisites:

Python programming (comfortable with NumPy, classes, basic OOP)
Linear algebra basics (vectors, matrices, eigenvalues)
Probability and statistics (distributions, expectation, conditional probability)
Familiarity with neural networks (forward pass, backpropagation concepts)

Brochure

Lecture notes and lab notebooks are linked in the sidebar.

For a printable one-page brochure suitable for forwarding to a program committee, conference organizer, or corporate L&D team, write to gabayae2@gmail.com with the audience size and intended delivery dates.