Lab 1 — Classification on a clinical dataset¶
Goal. Predict 30-day hospital readmission. Compare logistic regression, k-NN, Naive Bayes, plus a calibration-aware variant.
What you ship. Notebook with four models, accuracy + AUC + Brier calibration for each, and a 200-word memo on why calibration matters in clinical ML.
Setup¶
Install the dependencies (one-time).
In [ ]:
# !pip install scikit-learn pandas matplotlib numpy
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, brier_score_loss
from sklearn.datasets import fetch_openml
np.random.seed(42)
A diabetes-readmission dataset¶
Public alternative to MIMIC-IV demo (which requires PhysioNet credentialing). The actual cohort uses MIMIC-IV.
In [ ]:
X, y = fetch_openml('diabetes', version=1, as_frame=True, return_X_y=True)
y = (y == 'tested_positive').astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print('train/test:', X_train.shape, X_test.shape)
Exercise 1 — Three baselines¶
In [ ]:
# YOUR TURN — fit logistic, k-NN, and Naive Bayes. Report accuracy and AUC.
Exercise 2 — Calibration¶
In [ ]:
# YOUR TURN — compute Brier scores. Plot reliability diagrams for each.
Exercise 3 — Calibrated variant¶
In [ ]:
# YOUR TURN — Wrap the best uncalibrated model in CalibratedClassifierCV.
# Recompute Brier score and AUC.
Done?¶
Submit per the cohort schedule. Peer review pairing announced the following Monday.