Week 03 — Classical ML: regression, classification, clustering |

Module M3 | ← schedule | ← week 02 | week 04 →

The pre-deep-learning toolkit. Still the right answer for most tabular problems.

What you ship this week

Credit-scoring pipeline on an African bank dataset, with EDA notebook, model comparison (logistic + XGBoost), calibration, fairness audit across at least two demographic slices, and a deployable scoring function.

Due	Friday 18:00 (Africa/Lagos (UTC+1))
Submission	Drop the repo URL into the week's cohort channel. Peer-review pairing announced Monday of next week.
Rubric	Pass / revise. Pass requires green CI, tests covering the public API, and a README a stranger can follow to install and run the code.

Live sessions and labs

Default weekly cadence below. Cohort-specific dates and Zoom links fill in at intake.

Day	Time	Block	Recording
Mon	`09:00-12:00`	Live instruction + code-along	(post-session)
Mon	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Tue	`09:00-12:00`	Live instruction + code-along	(post-session)
Tue	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Wed	`09:00-12:00`	Live instruction + code-along	(post-session)
Wed	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Thu	`09:00-12:00`	Live instruction + code-along	(post-session)
Thu	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Fri	`10:00-11:00`	Industry speaker	(post-session)
Fri	`11:30-12:30`	Lab review	(post-session)
Fri	`14:00-15:00`	Cohort retrospective	(post-session)

Learning outcomes

By the end of the week, every participant will:

Fit and tune linear and regularized regression (ridge, lasso, elastic net).
Build and interpret tree-based ensembles (random forests, gradient boosting).
Apply unsupervised methods (k-means, hierarchical, DBSCAN, GMMs, PCA, UMAP).
Diagnose feature importance and partial dependence honestly, without overclaiming causality.

Topics covered

Linear and logistic regression · regularization (ridge, lasso, elastic net) · SVMs and the kernel trick · decision trees, random forests, gradient boosting (XGBoost, LightGBM) · clustering (k-means, hierarchical, DBSCAN, GMM) · dimensionality reduction (PCA, UMAP, t-SNE) · model interpretation (permutation importance, SHAP, partial dependence) · what these methods can and cannot tell you about causation.

Labs

Lab 1 — Credit scoring with fairness audit

Full pipeline from EDA to deployable scoring function on a Kaggle African-bank dataset. Compare logistic baseline against XGBoost. Audit calibration and group fairness across at least two demographic slices.

Dataset: Kaggle: *Bank loan default prediction* (Cameroon subset).

Lab 2 — Customer segmentation

Cluster customers by mobile-money transaction patterns. Justify $k$, characterize each cluster, write a 300-word memo for a non-technical product manager.

Dataset: Public anonymized mobile-money transaction sample (Orange/MTN open-data initiative).

Lab 3 — SHAP interpretation --- and its failure modes

SHAP-explain the XGBoost model from Lab 1. Then deliberately construct three cases where SHAP gives misleading explanations and document them.

Dataset: Same as Lab 1.

Readings

Mandatory

Before Tuesday. Hastie, Tibshirani, Friedman, *ESL*, chapters 3 (linear), 9 (trees), 10 (boosting)
Before Wednesday. Christoph Molnar, *Interpretable Machine Learning*, chapters 5 (SHAP) and 8 (limitations)

Optional deepening

Builds on (course catalogue)

Fondements de l'Apprentissage Automatique