Week 05 — Data Pipelines and Feature Stores
Orchestration: the chain of transformations from raw data to model-ready features, running on a schedule, observable, idempotent.
Week 05 — Data Pipelines and Feature Stores
Orchestration: the chain of transformations from raw data to model-ready features, running on a schedule, observable, idempotent.
Lecture
DAGs for data pipelines · Airflow (Beauchemin 2014) · Prefect, Dagster, Metaflow, Kubeflow Pipelines · feature stores (Feast, Hopsworks) · idempotence, backfill, late-arriving data · the training-serving feature parity problem.
Read before the lecture
Recitation — paper discussion
Hermann and Del Balso, *Meet Michelangelo: Uber's Machine Learning Platform* (Uber engineering 2017) (paper)
Come ready to argue one side of each:
- What does Michelangelo solve that a notebook + Airflow doesn't?
- What's the smallest team for which Michelangelo's complexity makes sense?
Reference text for this week: chapter 05 of the bilingual notes — EN PDF · FR PDF.