Week 05 — Natural language processing
Module 5: text preprocessing, embeddings, transformers, BERT-family fine-tuning, multilingual / low-resource NLP with attention to African languages.
From the linguistics-aware classical methods to the transformer-era pipeline.
What you ship this week
Three-deliverable pack: a sentiment classifier (logistic baseline + DistilBERT), an NER tagger on a multilingual dataset including an African language, and a fine-tuned summarization model.
| Due | Friday 18:00 (Africa/Lagos (UTC+1)) |
|---|---|
| Submission | Drop the repo URL into the week's cohort channel. Peer-review pairing announced Monday of next week. |
| Rubric | Pass / revise. Pass requires green CI, tests covering the public API, and a README a stranger can follow to install and run the code. |
Live sessions and labs
Default weekly cadence below. Cohort-specific dates and Zoom links fill in at intake.
| Day | Time | Block | Recording |
|---|---|---|---|
| Mon | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Mon | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Tue | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Tue | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Wed | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Wed | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Thu | 09:00-12:00 | Live instruction + code-along | (post-session) |
| Thu | 14:00-16:00 | Independent lab work + TA office hours | (post-session) |
| Fri | 10:00-11:00 | Industry speaker | (post-session) |
| Fri | 11:30-12:30 | Lab review | (post-session) |
| Fri | 14:00-15:00 | Cohort retrospective | (post-session) |
Learning outcomes
By the end of the week, every participant will:
- Build a working text classification pipeline (cleaning, tokenization, vectorization, training, evaluation).
- Fine-tune a pretrained transformer on a domain-specific task.
- Apply NLP to a multilingual or low-resource setting (with attention to African languages).
- Understand the limitations: hallucination, bias, evaluation difficulty.
Topics covered
Text preprocessing and tokenization (BPE, WordPiece) · word embeddings (Word2Vec, GloVe, FastText) · sequence models (RNN, LSTM, GRU) · the Transformer architecture · BERT-family models and fine-tuning · NER, sentiment, classification, summarization · multilingual and low-resource NLP · evaluation: BLEU, ROUGE, exact match, human evaluation.
Labs
Lab 1 — Sentiment from baseline to BERT
Two pipelines on the same customer-review corpus: a TF-IDF + logistic baseline and a fine-tuned DistilBERT. Report accuracy, macro F1, and the per-class confusion matrix.
Dataset: Amazon multilingual reviews (HuggingFace `amazon_reviews_multi`, with French and English splits).
Lab 2 — Multilingual NER
Train a token-classification head on a multilingual dataset that includes at least one African language. Report per-language F1.
Dataset: MasakhaNER 2.0 (Wolof, Yoruba, Swahili, Hausa, plus French and English).
Lab 3 — Fine-tuned summarization
Fine-tune a T5-base or BART-base on a news-summarization corpus. Evaluate by ROUGE-L and by a 50-example human-eval rubric you design.
Dataset: XSum or CNN/DailyMail (English baseline).
Readings
Mandatory
- Before Tuesday. Jurafsky and Martin, *Speech and Language Processing* (3rd draft), chapters 6-9 (vector semantics, neural LMs, RNN, transformer)
- Before Wednesday. Vaswani et al., *Attention Is All You Need* (NeurIPS 2017)
- Before Thursday. HuggingFace NLP Course, chapters 1-3 (fine-tuning a pretrained model)