Lab 1 — Classification on a clinical dataset¶

Goal. Predict 30-day hospital readmission. Compare logistic regression, k-NN, Naive Bayes, plus a calibration-aware variant.

What you ship. Notebook with four models, accuracy + AUC + Brier calibration for each, and a 200-word memo on why calibration matters in clinical ML.

Setup¶

Install the dependencies (one-time).

In [ ]:
# !pip install scikit-learn pandas matplotlib numpy
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, brier_score_loss
from sklearn.datasets import fetch_openml

np.random.seed(42)

A diabetes-readmission dataset¶

Public alternative to MIMIC-IV demo (which requires PhysioNet credentialing). The actual cohort uses MIMIC-IV.

In [ ]:
X, y = fetch_openml('diabetes', version=1, as_frame=True, return_X_y=True)
y = (y == 'tested_positive').astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print('train/test:', X_train.shape, X_test.shape)

Exercise 1 — Three baselines¶

In [ ]:
# YOUR TURN — fit logistic, k-NN, and Naive Bayes. Report accuracy and AUC.

Exercise 2 — Calibration¶

In [ ]:
# YOUR TURN — compute Brier scores. Plot reliability diagrams for each.

Exercise 3 — Calibrated variant¶

In [ ]:
# YOUR TURN — Wrap the best uncalibrated model in CalibratedClassifierCV.
# Recompute Brier score and AUC.

Done?¶

Submit per the cohort schedule. Peer review pairing announced the following Monday.