Week 03 — Classical ML: regression, classification, clustering
Module 3: regularized regression, ensembles (RF, gradient boosting), clustering, dimensionality reduction, SHAP and its failure modes.
The pre-deep-learning toolkit. Still the right answer for most tabular problems.
What you ship this week
Credit-scoring pipeline on an African bank dataset, with EDA notebook, model comparison (logistic + XGBoost), calibration, fairness audit across at least two demographic slices, and a deployable scoring function.
| Due | Friday 18:00 (Africa/Lagos (UTC+1)) |
|---|---|
| Submission | Drop the repo URL into the week's cohort channel. Peer-review pairing announced Monday of next week. |
| Rubric | Pass / revise. Pass requires green CI, tests covering the public API, and a README a stranger can follow to install and run the code. |
Live sessions and labs
Default weekly cadence below. Cohort-specific dates and Zoom links fill in at intake.
| Day | Time | Block | Recording |
|---|---|---|---|
| Mon | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Mon | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Tue | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Tue | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Wed | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Wed | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Thu | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Thu | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Fri | 10:00-11:00 | Industry speaker | (post-session) |
| Fri | 11:30-12:30 | Lab review | (post-session) |
| Fri | 14:00-15:00 | Cohort retrospective | (post-session) |
Learning outcomes
By the end of the week, every participant will:
- Fit and tune linear and regularized regression (ridge, lasso, elastic net).
- Build and interpret tree-based ensembles (random forests, gradient boosting).
- Apply unsupervised methods (k-means, hierarchical, DBSCAN, GMMs, PCA, UMAP).
- Diagnose feature importance and partial dependence honestly, without overclaiming causality.
Topics covered
Linear and logistic regression · regularization (ridge, lasso, elastic net) · SVMs and the kernel trick · decision trees, random forests, gradient boosting (XGBoost, LightGBM) · clustering (k-means, hierarchical, DBSCAN, GMM) · dimensionality reduction (PCA, UMAP, t-SNE) · model interpretation (permutation importance, SHAP, partial dependence) · what these methods can and cannot tell you about causation.
Labs
Lab 1 — Credit scoring with fairness audit
Full pipeline from EDA to deployable scoring function on a Kaggle African-bank dataset. Compare logistic baseline against XGBoost. Audit calibration and group fairness across at least two demographic slices.
Dataset: Kaggle: *Bank loan default prediction* (Cameroon subset).
Lab 2 — Customer segmentation
Cluster customers by mobile-money transaction patterns. Justify $k$, characterize each cluster, write a 300-word memo for a non-technical product manager.
Dataset: Public anonymized mobile-money transaction sample (Orange/MTN open-data initiative).
Lab 3 — SHAP interpretation --- and its failure modes
SHAP-explain the XGBoost model from Lab 1. Then deliberately construct three cases where SHAP gives misleading explanations and document them.
Dataset: Same as Lab 1.
Readings
Mandatory
- Before Tuesday. Hastie, Tibshirani, Friedman, *ESL*, chapters 3 (linear), 9 (trees), 10 (boosting)
- Before Wednesday. Christoph Molnar, *Interpretable Machine Learning*, chapters 5 (SHAP) and 8 (limitations)
Optional deepening
- Tianqi Chen and Carlos Guestrin, *XGBoost: A Scalable Tree Boosting System* (KDD 2016)
- Cynthia Rudin, *Stop Explaining Black Box Machine Learning Models for High Stakes Decisions* (Nature Machine Intelligence 2019)