Case study · Health Ops · Stochastic Optimization

Mobile-clinic scheduling — Kenya

Real-data MDP on 7,876 KMPDC-licensed Kenyan health facilities and the SHA per-facility payment time-series. Three policies — manual round-robin, capped linear program, tabular Q-learning — compared on patients-served, travel cost, and equity. The honest finding: the algorithm matters less than how the constraints are written.

Read · 7 min · 1,395 words Q-learning · 83 patients/day (+122% vs manual) LP (25% cap) · 52/day (+39%, 4 counties served) Manual round-robin · 37/day, 87 km/day travel Data · 7,876 facilities · 47 counties

Summary

Three scheduling policies on the same simulated dispatch over 180 days. Q-learning doubles patients-served per day vs the manual round-robin, but achieves it by concentrating 100% of visits on Nairobi (which dominates SHA payment volume in the real data). A linear program with a 25% per-county cap diversifies across four counties, gives up some patients-served, but eliminates 32% of travel and respects an equity constraint a regulator would actually impose. The right "winner" is the one whose objective matches the mandate the mobile-clinic programme was funded under.

Why this matters

Mobile health clinics in Kenya bridge the gap between fixed referral hospitals and underserved rural catchments. Each visit costs fuel, staff hours, and consumables; each visit also serves a finite queue of patients. The scheduling decision — which county does the clinic visit tomorrow — is binding: a clinic in Nairobi is a clinic not in Wajir. Three real constraints sit on the operator:

The business question

Given a list of candidate counties with their (real) facility counts and SHA payment time-series as a demand proxy, which county does the clinic visit each day for the next 180 days to maximise patients-served subject to a travel budget — and how does that change when an equity constraint is added?

Data

Two real Kenyan datasets, joined at the county level:

After cleaning out NaN-county rows and joining, the top 12 counties by SHA payment volume become the candidate locations for the MDP.

Real KMPDC facility dataReal SHA paymentsSynthetic 2D coords (no GPS in Kaggle)12-county MDP
Top 20 Kenyan counties by facility count and SHA payment volume
Figure 1. Real distribution: top-20 Kenyan counties by KMPDC-licensed facility count (left) and SHA payment volume (right). Nairobi dominates both — 1,723 facilities, ~2.2 billion KSH in payments — by an order of magnitude over second place. This single fact drives the optimization story below.
Per-county SHA payment time-series across Dec-Feb
Figure 2. Per-county SHA payments over the 8 observation periods. Variation is real and county-specific. The seasonal modulation per county is what gives the MDP its non-trivial dynamics; otherwise it would degenerate to "always pick Nairobi."

MDP formulation

A standard discrete-time, discrete-action Markov decision process:

Distance comes from synthetic 2D coordinates within Kenya's lat/lon bounding box because the public Kaggle dataset doesn't include GPS, a clear "swap-in" point for production. The MDP's structural conclusions don't depend on the specific coordinates.

12 counties plotted in 2D space, bubble size by facility count, color by demand
Figure 3. The 12 candidate counties on a synthetic 2D map. Bubble size scales with facility count, colour with the inferred mean demand. Nairobi is the bright-yellow large bubble, the cluster the optimal policy keeps gravitating toward.

Three policies

1. Manual round-robin (industry baseline)

Visit county t mod 12 on day t. This is what most under-resourced field operations actually do: uniform rotation, no demand awareness. Maximises geographic equity by construction; ignores demand.

2. Linear program with per-county visit cap

Solve max cTv subject to sum(v) = 1, 0 ≤ v_i ≤ 0.25, where v_i is the visit-share for county i and c_i is its expected demand. Translate the optimal share vector to a stochastic schedule.

Without the cap, the LP degenerates to "always pick Nairobi" — exactly the same answer Q-learning eventually converges to. The cap is the equity mandate written as a linear constraint.

3. Tabular Q-learning

Standard ε-greedy with α = 0.1, γ = 0.95, ε decaying from 0.30 to 0.05 over 400 episodes. Q[loc, day_bucket, action] table; updates from the simulation reward signal. A revisit-penalty term in training nudges toward diversification, but the greedy evaluation policy (which doesn't see the recent-visit history) collapses back to the highest-demand county.

Q-learning training reward curve over 400 episodes
Figure 4. Q-learning training reward (20-episode rolling mean) over 400 episodes. The agent quickly identifies the highest-demand county and stops exploring; reward plateaus around episode 100. Faster convergence than expected, a clean signal that the demand structure is dominated by a single county, not subtle.

Results

Simulated 180-day rollouts with the same realised-demand seed across all three policies:

PolicyPatients/dayTotal travel (km)Counties servedCoverage lift vs manual
Manual round-robin37.387,30412
LP (25% per-county cap)52.060,1214+39%
Q-learning (no cap)82.801+122%
The honest reading. Q-learning's "+122% coverage lift" is a coverage lift only if you read "coverage" as patients-served. Read in the equity sense most health programmes use it — visits to the bottom-quartile-utilisation catchments — Q-learning's coverage is worse than manual round-robin's. Q-learning never visits Wajir, Migori, or Mandera. Its policy is to sit in Nairobi all 180 days. That's a feature of the demand-only objective, not a bug in Q-learning.

Trade-offs

Deployment sketch

For an actual mobile-clinic programme:

Lessons

  1. Pick the algorithm that matches the constraint structure, not the headline metric. Q-learning maximises raw patients-served by a wide margin, but the LP is the deployment-correct answer in a constrained, regulator-overseen setting.
  2. "Coverage" is an ambiguous metric until you write down the constraint. Patients-served and bottom-quartile equity are both reasonable interpretations and the algorithms ranking flips between them. The case-study's hero number (+122%) is honest under one definition and misleading under another; the deep-dive resolves the ambiguity.
  3. Cheap structural prior > expensive learned policy on small action spaces. 12 actions × 84 states is not a regime where reinforcement learning is the right tool. Direct value iteration or LP would converge faster, give interpretable shadow prices, and avoid the over-engineering trap.