Lab 1 — Value iteration on a 50-state inventory MDP¶
Goal. Solve a 50-state inventory-management MDP by value iteration. Compare with a hand-designed heuristic policy.
What you ship. Notebook with the optimal value function, the optimal policy, and a comparison against a 'order-up-to-S' heuristic across 1000 simulated rollouts.
Setup¶
Install the dependencies (one-time).
In [ ]:
# !pip install numpy matplotlib
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
Inventory MDP setup¶
In [ ]:
# State: current stock in {0, ..., 49}. Action: order quantity in {0, ..., 49-s}.
# Demand: Poisson(lambda=3). Holding cost 0.5/unit/period. Stockout cost 5/unit.
# Selling price 4/unit. Ordering cost 2/unit + fixed 10 if order > 0.
S = 50
GAMMA = 0.95
DEMAND_LAMBDA = 3
HOLD = 0.5
STOCKOUT = 5.0
PRICE = 4.0
ORDER_VAR = 2.0
ORDER_FIX = 10.0
from scipy.stats import poisson
demand_pmf = poisson.pmf(np.arange(S+1), DEMAND_LAMBDA)
demand_pmf[-1] = 1.0 - demand_pmf[:-1].sum()
print('demand pmf sum:', demand_pmf.sum())
Exercise 1 — Build the reward and transition tables¶
In [ ]:
# YOUR TURN
# For each (state s, action a), compute:
# - expected immediate reward r(s, a)
# - transition probabilities P(s' | s, a)
Exercise 2 — Value iteration¶
In [ ]:
# YOUR TURN
# Initialize V = 0. Iterate V_{k+1}(s) = max_a [r(s, a) + gamma * sum_{s'} P(s'|s,a) V_k(s')]
# until ||V_{k+1} - V_k||_inf < 1e-6. Plot V and the policy.
Exercise 3 — Compare against (s, S) heuristic¶
In [ ]:
# YOUR TURN
# Simulate both the optimal policy and an order-up-to-S heuristic for 1000 episodes.
# Report mean total reward, standard error, and the gap.
Done?¶
Submit per the cohort schedule. Peer review pairing announced the following Monday.