Lab 2 — Versioning the full ML project¶

Goal. Take an existing ML notebook. Version code in Git, dataset in DVC, trained model artifact in MLflow Model Registry. Tag a v1.0 release that reproduces from scratch.

What you ship. Public repo with code + DVC remote pointer + MLflow registry pointer. README documenting the reproduction steps.

Setup¶

Install the dependencies (one-time).

In [ ]:
# !pip install dvc mlflow scikit-learn
In [ ]:
import subprocess, pathlib
import mlflow
import mlflow.sklearn

Pick a model from a prior lab and version it end-to-end¶

In [ ]:
# Set the MLflow tracking URI (local for the lab, S3 / Databricks for production)
import os
os.environ.setdefault('MLFLOW_TRACKING_URI', 'sqlite:///mlflow.db')
print('MLflow URI:', os.environ['MLFLOW_TRACKING_URI'])

Exercise 1 — Version the data with DVC¶

In [ ]:
# !pip install dvc dvc-s3
# !dvc init
# !dvc add data/training.csv
# git add data/training.csv.dvc .gitignore
#
# YOUR TURN — point a DVC remote at the storage of your choice (S3, GCS, Azure, local).

Exercise 2 — Log model training with MLflow¶

In [ ]:
# YOUR TURN
# Wrap your training script in mlflow.start_run(). Log params, metrics, and
# the trained model with mlflow.sklearn.log_model.

Exercise 3 — Promote to Model Registry¶

In [ ]:
# YOUR TURN
# Register the model. Transition stage from None -> Staging -> Production.
# Tag a v1.0 release in Git.

Done?¶

Submit per the cohort schedule. Peer review pairing announced the following Monday.