Week 05 — Natural language processing

Module 5: text preprocessing, embeddings, transformers, BERT-family fine-tuning, multilingual / low-resource NLP with attention to African languages.

Module M5  |  ← schedule |  ← week 04 |  week 06 →

From the linguistics-aware classical methods to the transformer-era pipeline.

What you ship this week

Three-deliverable pack: a sentiment classifier (logistic baseline + DistilBERT), an NER tagger on a multilingual dataset including an African language, and a fine-tuned summarization model.

Due Friday 18:00 (Africa/Lagos (UTC+1))
Submission Drop the repo URL into the week's cohort channel. Peer-review pairing announced Monday of next week.
Rubric Pass / revise. Pass requires green CI, tests covering the public API, and a README a stranger can follow to install and run the code.

Live sessions and labs

Default weekly cadence below. Cohort-specific dates and Zoom links fill in at intake.

Day Time Block Recording
Mon 09:00-12:00 Live instruction + code-along (post-session)
Mon 14:00-16:00 Independent lab work + TA office hours (post-session)
Tue 09:00-12:00 Live instruction + code-along (post-session)
Tue 14:00-16:00 Independent lab work + TA office hours (post-session)
Wed 09:00-12:00 Live instruction + code-along (post-session)
Wed 14:00-16:00 Independent lab work + TA office hours (post-session)
Thu 09:00-12:00 Live instruction + code-along (post-session)
Thu 14:00-16:00 Independent lab work + TA office hours (post-session)
Fri 10:00-11:00 Industry speaker (post-session)
Fri 11:30-12:30 Lab review (post-session)
Fri 14:00-15:00 Cohort retrospective (post-session)

Learning outcomes

By the end of the week, every participant will:

  1. Build a working text classification pipeline (cleaning, tokenization, vectorization, training, evaluation).
  2. Fine-tune a pretrained transformer on a domain-specific task.
  3. Apply NLP to a multilingual or low-resource setting (with attention to African languages).
  4. Understand the limitations: hallucination, bias, evaluation difficulty.

Topics covered

Text preprocessing and tokenization (BPE, WordPiece) · word embeddings (Word2Vec, GloVe, FastText) · sequence models (RNN, LSTM, GRU) · the Transformer architecture · BERT-family models and fine-tuning · NER, sentiment, classification, summarization · multilingual and low-resource NLP · evaluation: BLEU, ROUGE, exact match, human evaluation.

Labs

Lab 1 — Sentiment from baseline to BERT

Two pipelines on the same customer-review corpus: a TF-IDF + logistic baseline and a fine-tuned DistilBERT. Report accuracy, macro F1, and the per-class confusion matrix.

Dataset: Amazon multilingual reviews (HuggingFace `amazon_reviews_multi`, with French and English splits).

Lab 2 — Multilingual NER

Train a token-classification head on a multilingual dataset that includes at least one African language. Report per-language F1.

Dataset: MasakhaNER 2.0 (Wolof, Yoruba, Swahili, Hausa, plus French and English).

Lab 3 — Fine-tuned summarization

Fine-tune a T5-base or BART-base on a news-summarization corpus. Evaluate by ROUGE-L and by a 50-example human-eval rubric you design.

Dataset: XSum or CNN/DailyMail (English baseline).

Readings

Mandatory

Optional deepening

Builds on (course catalogue)