Lab 3 — PPO on CartPole and BipedalWalker¶

Goal. Train PPO with Stable-Baselines3 on CartPole-v1 (easy) and BipedalWalker-v3 (hard). Tune learning rate, clip range, entropy coefficient.

What you ship. Notebook with two trained agents, training curves, evaluation videos (or screenshots), and a 200-word memo on which hyperparameter mattered most for the harder task.

Setup¶

Install the dependencies (one-time).

In [ ]:
# !pip install gymnasium stable-baselines3 numpy matplotlib
In [ ]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

np.random.seed(42)

Environments¶

In [ ]:
env_cart = gym.make('CartPole-v1')
env_walk = gym.make('BipedalWalker-v3')
print('CartPole obs/act dims:', env_cart.observation_space.shape, env_cart.action_space.n)
print('BipedalWalker obs/act dims:', env_walk.observation_space.shape, env_walk.action_space.shape)

Exercise 1 — PPO on CartPole¶

In [ ]:
model = PPO('MlpPolicy', env_cart, verbose=0)
model.learn(total_timesteps=50_000)
mean, std = evaluate_policy(model, env_cart, n_eval_episodes=20)
print(f'CartPole: {mean:.1f} +/- {std:.1f}')

Exercise 2 — PPO on BipedalWalker¶

In [ ]:
# YOUR TURN
# Train PPO on BipedalWalker for 500_000 steps. Tune lr, n_steps, ent_coef.
# Report mean reward over 20 eval episodes.

Exercise 3 — Hyperparameter sensitivity¶

In [ ]:
# YOUR TURN
# On BipedalWalker, sweep ent_coef in {0.0, 0.01, 0.1}. Plot learning curves.
# Write a 200-word memo on which mattered most and why.

Done?¶

Submit per the cohort schedule. Peer review pairing announced the following Monday.