Week 03 — Classical ML: regression, classification, clustering

Module 3: regularized regression, ensembles (RF, gradient boosting), clustering, dimensionality reduction, SHAP and its failure modes.

Module M3  |  ← schedule |  ← week 02 |  week 04 →

The pre-deep-learning toolkit. Still the right answer for most tabular problems.

What you ship this week

Credit-scoring pipeline on an African bank dataset, with EDA notebook, model comparison (logistic + XGBoost), calibration, fairness audit across at least two demographic slices, and a deployable scoring function.

Due Friday 18:00 (Africa/Lagos (UTC+1))
Submission Drop the repo URL into the week's cohort channel. Peer-review pairing announced Monday of next week.
Rubric Pass / revise. Pass requires green CI, tests covering the public API, and a README a stranger can follow to install and run the code.

Live sessions and labs

Default weekly cadence below. Cohort-specific dates and Zoom links fill in at intake.

Day Time Block Recording
Mon 09:00-12:00 Live instruction + code-along (post-session)
Mon 14:00-16:00 Independent lab work + TA office hours (post-session)
Tue 09:00-12:00 Live instruction + code-along (post-session)
Tue 14:00-16:00 Independent lab work + TA office hours (post-session)
Wed 09:00-12:00 Live instruction + code-along (post-session)
Wed 14:00-16:00 Independent lab work + TA office hours (post-session)
Thu 09:00-12:00 Live instruction + code-along (post-session)
Thu 14:00-16:00 Independent lab work + TA office hours (post-session)
Fri 10:00-11:00 Industry speaker (post-session)
Fri 11:30-12:30 Lab review (post-session)
Fri 14:00-15:00 Cohort retrospective (post-session)

Learning outcomes

By the end of the week, every participant will:

  1. Fit and tune linear and regularized regression (ridge, lasso, elastic net).
  2. Build and interpret tree-based ensembles (random forests, gradient boosting).
  3. Apply unsupervised methods (k-means, hierarchical, DBSCAN, GMMs, PCA, UMAP).
  4. Diagnose feature importance and partial dependence honestly, without overclaiming causality.

Topics covered

Linear and logistic regression · regularization (ridge, lasso, elastic net) · SVMs and the kernel trick · decision trees, random forests, gradient boosting (XGBoost, LightGBM) · clustering (k-means, hierarchical, DBSCAN, GMM) · dimensionality reduction (PCA, UMAP, t-SNE) · model interpretation (permutation importance, SHAP, partial dependence) · what these methods can and cannot tell you about causation.

Labs

Lab 1 — Credit scoring with fairness audit

Full pipeline from EDA to deployable scoring function on a Kaggle African-bank dataset. Compare logistic baseline against XGBoost. Audit calibration and group fairness across at least two demographic slices.

Dataset: Kaggle: *Bank loan default prediction* (Cameroon subset).

Lab 2 — Customer segmentation

Cluster customers by mobile-money transaction patterns. Justify $k$, characterize each cluster, write a 300-word memo for a non-technical product manager.

Dataset: Public anonymized mobile-money transaction sample (Orange/MTN open-data initiative).

Lab 3 — SHAP interpretation --- and its failure modes

SHAP-explain the XGBoost model from Lab 1. Then deliberately construct three cases where SHAP gives misleading explanations and document them.

Dataset: Same as Lab 1.

Readings

Mandatory

Optional deepening

Builds on (course catalogue)