Case study · Hydropower · Time-series

River flow forecasting — Lake Kariba

Daily lake-level and outflow forecasting for the world's largest man-made reservoir, on real Zambezi River Authority data — 30-day horizon, gradient-boosted regressor with turbine discharge as exogenous covariate hits a remarkable 7 cm RMSE on a level that drifts within a 7 m operational band.

Read · 6 min · 1,354 words Best model · GBM with exogenous covariates RMSE · 0.07 m (vs SARIMA 0.18 m, naive 0.50 m) SARIMA 95% PI coverage · 100% Data · 1,155 daily obs · 2020–2023

Summary

A gradient-boosted regressor that uses real-time turbine discharge as an exogenous covariate forecasts Lake Kariba's daily lake-level with 7 cm RMSE across a 30-day horizon. SARIMA gets 18 cm; an unobserved-components state-space model, 24 cm; the naive last-observation baseline, 50 cm. Both SARIMA and the state-space model deliver well-calibrated prediction intervals (95–100% empirical coverage of the nominal 95% level), making them the better choice for risk-aware dispatch decisions even though their point forecasts trail.

The same trade-off as PJM, in a higher-stakes setting: ML wins on point accuracy, structural models win on calibrated uncertainty. Ship the ensemble.

Why this matters

Lake Kariba is the largest man-made reservoir by water volume on the planet, sitting on the Zambezi River between Zambia and Zimbabwe. Its level drives roughly 1,800 MW of hydroelectric generation across the Kariba South (Zimbabwe) and Kariba North (Zambia) power stations. The operational band is narrow:

The 2015–2016 and 2019–2020 droughts pushed the lake within a metre of its minimum operational level, forcing rolling blackouts in both countries. A forecast that's accurate at the centimetre level on a 30-day horizon directly informs: turbine dispatch, downstream coordination with Cahora Bassa (Mozambique), and inter-country water-sharing negotiations between the Zambezi River Authority's two member states.

The business question

Two operational decisions consume the forecast:

The first wants the most accurate point forecast; the second wants honest uncertainty bands. Same forecast input, different downstream decisions: the same pattern as the PJM load-forecasting study, with much higher stakes per percentage-point of error.

Data

Lake Kariba reservoir data from the public Kaggle dataset marbin/lake-kariba-reservoir-data: 1,155 daily observations from 1 Jan 2020 to 28 Feb 2023, covering:

CSV · KaggleDaily resolution~3 yearsMultivariate

EDA

The lake-level series is dominated by a slow annual cycle (rainy season Nov–Apr fills the lake; dry season May–Oct draws it down) and a long-term recovery trend through 2022 after the 2019–2020 drought trough. Turbine discharge is anti-correlated with lake-level on the seasonal scale (operators discharge harder when the lake is high) and shows weekly variation tied to grid demand patterns.

Daily lake level, turbine discharge, and total outflow from 2020 to 2023
Figure 1. Top: daily lake level showing the 2020–2023 recovery from the drought trough, modulated by an annual cycle. Middle: turbine discharge with operational variability. Bottom: total outflow tracks turbine discharge closely (spillage was near-zero across this window).
Annual cycle: monthly mean lake level vs turbine discharge
Figure 2. Annual cycle. Mean lake level (blue) peaks May–July (post-rains), troughs Nov–Dec (end of dry season). Turbine discharge (orange) lags the level; operators capitalise on high water through the dry months.
Correlation matrix of reservoir variables
Figure 3. Correlation matrix. Lake level, usable storage, and live storage move as one (r > 0.99): the storage volumes are mechanical functions of level. Turbine discharge correlates weakly with level on the daily scale; the relationship is mostly seasonal and lagged.

Modelling approach

Three candidates, all forecasting the daily lake_level series 30 days ahead. The held-out window is the last 30 days of the dataset.

1. SARIMA baseline

SARIMAX(1,1,1)(1,1,1)7. Captures short-range autocorrelation and weekly cycles. The annual cycle has to be absorbed implicitly by the integration term, a known weakness on a series this strongly seasonal.

2. State-space — UnobservedComponents + Fourier exog

UnobservedComponents with a local-linear-trend level component, stochastic level and slope, and three pairs of annual Fourier harmonics passed as exogenous regressors. The Kalman filter delivers the prediction intervals.

3. ML challenger — gradient-boosted regressor with exogenous covariates

GradientBoostingRegressor(n_estimators=400, max_depth=3, learning_rate=0.05) on engineered features:

This is the lever that drops RMSE from ~18 cm (SARIMA, lake-level alone) to 7 cm (GBM, with discharge as exog). The structural relationship "tomorrow's lake level = today's level + (inflow − outflow)" is something the GBM can learn directly when given outflow data; SARIMA, working only on the lake-level history, has to infer it.

Results

30-day held-out test, RMSE in metres, MAPE on lake-level (which is bounded near 478 m, so MAPE values are tiny):

ModelMAPERMSE (m)95% PI coverage
GBM (with exog: discharge, outflow)0.013%0.07
SARIMA(1,1,1)(1,1,1)70.027%0.18100%
UC + Fourier annual exog0.046%0.2490%
Naive-last0.085%0.50
Naive-seasonal (365-day lag)0.39%1.90
30-day forecast comparison: SARIMA, UC, and GBM against actual lake level
Figure 4. 30-day held-out forecast comparison. The black line is realised level. GBM (with discharge as exog) tracks the realised level so closely it's hard to separate visually. SARIMA's 95% PI (shaded) is wide enough that the actual line stays comfortably inside. UC drifts below the realised level by a few centimetres.
The headline number in context. 7 cm RMSE on a level that varies within a ~7 m operational band is a relative error of ~1%. On a reservoir whose 1 m drop costs hundreds of GWh of generation per year, that's the difference between scheduling six weeks of generation confidently versus going hand-to-mouth on inflow telemetry.

Trade-offs

Deployment sketch

For the Zambezi River Authority and Kariba power-station operators:

Lessons

  1. Exogenous covariates dominate when they exist. The 60% RMSE reduction (SARIMA → GBM) is almost entirely attributable to having turbine discharge in the feature set. Picking the right inputs beats picking the right algorithm.
  2. Slow-moving, high-stakes targets need narrow PIs, not just low MAPE. A 7 cm point error is tight; but on this kind of asset, "what's the probability we breach the operational threshold in the next 30 days" is the question that actually drives decisions. SARIMA's calibrated interval is often more valuable than GBM's tighter mean.
  3. Real African open data is good enough. Lake Kariba is a transboundary reservoir between two African countries; the daily data exists and is publicly accessible on Kaggle. The model would extend cleanly to Cahora Bassa and other African dams once equivalent data is published.