Week 01 — Python for data work
Module 1 of the AIRINA Labs ML & AI Bootcamp. Idiomatic Python, NumPy, Pandas, reproducible projects, Jupyter discipline.
| Module 1 of 10. ← schedule | next week → |
Take the Python you already half-know and make it precise enough to ship.
This first week is partly a calibration and partly a hard reset on tooling. Almost every participant arrives with some Python — a few Coursera notebooks, a couple of Kaggle attempts, a research project’s script. Almost every participant also arrives with at least one bad habit that will block them in week 5 if we don’t fix it now. The week’s labs are aggressive about exposing those habits: messy notebooks, hidden state, untyped functions, missing tests, broken environments.
By Friday, every participant ships a clean, reproducible Python project published on their personal GitHub. That artifact becomes the foundation that every later module builds on.
What you ship this week
| Deliverable | A public GitHub repository containing the cleaned-up version of Lab 1 (the 300-line refactor) with a README.md, a pyproject.toml, a working pytest suite, and CI passing on a single push. |
| Due | Friday end of week 1, 18:00 Africa/Lagos. |
| Submission | Drop the repo URL into the #cohort-week-01 channel. Peer review pairing is announced Monday of week 2. |
| Rubric | The deliverable is pass / revise. Pass requires green CI, tests covering the public API, and a README a stranger can follow to install and run the code. |
Live sessions
| Day | Time | Topic | Link | Recording |
|---|---|---|---|---|
| Monday | 17:00 | Kickoff and cohort introductions | (filled in at intake) | (pending) |
| Tuesday | 17:00 | Idiomatic Python: comprehensions, generators, decorators | (pending) | (pending) |
| Wednesday | 17:00 | NumPy and vectorization, with worked examples on real data | (pending) | (pending) |
| Thursday | 17:00 | Pandas in depth: joins, group-by, reshape, the index | (pending) | (pending) |
| Friday | 17:00 | Reproducible projects: virtualenv, pyproject, lockfile, pre-commit | (pending) | (pending) |
Total live time this week: about 7.5 hours (five 90-minute sessions). Office hours add another 90 minutes on Wednesday.
Learning outcomes
By the end of the week, every participant will be able to:
- Write idiomatic Python. Use comprehensions, generators, context managers, and decorators where they pay off and not where they don’t. Recognize and refactor non-idiomatic patterns.
- Use NumPy and Pandas fluently for vectorized data work. Without falling back to Python-level loops on arrays of more than a few hundred elements.
- Build a reproducible Python project. Virtual environment, pinned dependencies,
pyproject.toml, pre-commit hooks, a workingpytestsuite, a CI pipeline that runs on each push. - Read and write Jupyter notebooks without losing reproducibility. Hidden-state hygiene, output management, conversion to scripts, version control of notebooks.
Readings and prep
Before Tuesday. Hitchhiker’s Guide to Python chapters on "Writing great Python code" (idioms, structuring projects, vendorization). Skim, not deep-read — we’ll work through the same material in the live session.
Before Wednesday. Wes McKinney, Python for Data Analysis (3rd ed., O’Reilly 2022), chapters 4 (NumPy basics) and 5 (Getting started with Pandas). Available open-access at wesmckinney.com/book.
Before Thursday. Same book, chapters 8 and 10 (data wrangling, group-by mechanics).
Optional deepening. Luciano Ramalho, Fluent Python (2nd ed., O’Reilly 2022) for any participant who already knows Pandas well and wants the language-level deep dive. Especially chapters 17 (concurrency basics), 18 (with/async), 24 (class metaprogramming).
Lab notebooks
Lab 1 — The 300-line refactor
A deliberately ugly Python script (messy_data_pipeline.py) is provided. It loads three CSV files, joins them, computes a handful of metrics, writes outputs to disk. The script works. It is also unreadable, untested, full of hidden state, and indeed has two bugs that don’t fire on the example data.
Your task. Refactor it into a clean Python package with:
-
src/<your_project>/module layout - One pure function per responsibility, typed with
mypy --strict - A
pytestsuite that catches both latent bugs - A
pyproject.tomlthat pins all dependencies - A
README.mdwith install + run instructions a stranger can follow - A passing GitHub Actions workflow (Python 3.11, lint + type-check + test)
Dataset. The lab dataset is the public Kenya Health Facilities Registry export from December 2024 (~12,000 facilities). The notebook in the lab repo has the download link.
Notebook. week-01-lab-01-refactor.ipynb (link finalized at intake).
Lab 2 — Pandas wrangling on a real dataset
Starting from raw Kenya health facility data, produce a clean analytical DataFrame answering three real questions: (1) what fraction of facilities by county are public vs private? (2) where are the underserved areas (population per facility), and (3) does the picture change if you weight by facility capacity rather than headcount? Hand in a notebook and a short writeup.
Notebook. week-01-lab-02-pandas.ipynb.
Lab 3 — Publish your project
Take the refactored Lab 1 package and publish it to a public GitHub repository under your own account, with a release tag, a versioned pyproject.toml, and a CI badge in the README. Add it to your bootcamp profile.
This is the artifact you reference every subsequent week — and the one that recruiters end up looking at first.
Office hours
| When | TA | Topic |
|---|---|---|
| Wednesday 19:00 (Africa/Lagos) | (announced at intake) | Open: bring your refactor, your environment problems, anything blocked. |
Async questions go in the #cohort-week-01-questions channel; expected response time is under 24 hours during the week.
Builds on / connects to
This week pulls from two existing course-catalogue volumes:
- Programmation Scientifique — the Python-as-scientific-language perspective.
- Introduction to Data Science — chapters 1-2 of the bilingual notes.
The bootcamp condenses the relevant chapters into the week’s reading list rather than re-deriving the material.