Week 05 — Natural language processing |

Module M5 | ← schedule | ← week 04 | week 06 →

From the linguistics-aware classical methods to the transformer-era pipeline.

What you ship this week

Three-deliverable pack: a sentiment classifier (logistic baseline + DistilBERT), an NER tagger on a multilingual dataset including an African language, and a fine-tuned summarization model.

Due	Friday 18:00 (Africa/Lagos (UTC+1))
Submission	Drop the repo URL into the week's cohort channel. Peer-review pairing announced Monday of next week.
Rubric	Pass / revise. Pass requires green CI, tests covering the public API, and a README a stranger can follow to install and run the code.

Live sessions and labs

Default weekly cadence below. Cohort-specific dates and Zoom links fill in at intake.

Day	Time	Block	Recording
Mon	`09:00-12:00`	Live instruction + code-along	(post-session)
Mon	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Tue	`09:00-12:00`	Live instruction + code-along	(post-session)
Tue	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Wed	`09:00-12:00`	Live instruction + code-along	(post-session)
Wed	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Thu	`09:00-12:00`	Live instruction + code-along	(post-session)
Thu	`14:00-16:00`	Independent lab work + TA office hours	(post-session)
Fri	`10:00-11:00`	Industry speaker	(post-session)
Fri	`11:30-12:30`	Lab review	(post-session)
Fri	`14:00-15:00`	Cohort retrospective	(post-session)

Learning outcomes

By the end of the week, every participant will:

Build a working text classification pipeline (cleaning, tokenization, vectorization, training, evaluation).
Fine-tune a pretrained transformer on a domain-specific task.
Apply NLP to a multilingual or low-resource setting (with attention to African languages).
Understand the limitations: hallucination, bias, evaluation difficulty.

Topics covered

Text preprocessing and tokenization (BPE, WordPiece) · word embeddings (Word2Vec, GloVe, FastText) · sequence models (RNN, LSTM, GRU) · the Transformer architecture · BERT-family models and fine-tuning · NER, sentiment, classification, summarization · multilingual and low-resource NLP · evaluation: BLEU, ROUGE, exact match, human evaluation.

Labs

Lab 1 — Sentiment from baseline to BERT

Two pipelines on the same customer-review corpus: a TF-IDF + logistic baseline and a fine-tuned DistilBERT. Report accuracy, macro F1, and the per-class confusion matrix.

Dataset: Amazon multilingual reviews (HuggingFace `amazon_reviews_multi`, with French and English splits).

Lab 2 — Multilingual NER

Train a token-classification head on a multilingual dataset that includes at least one African language. Report per-language F1.

Dataset: MasakhaNER 2.0 (Wolof, Yoruba, Swahili, Hausa, plus French and English).

Lab 3 — Fine-tuned summarization

Fine-tune a T5-base or BART-base on a news-summarization corpus. Evaluate by ROUGE-L and by a 50-example human-eval rubric you design.

Dataset: XSum or CNN/DailyMail (English baseline).

Readings

Mandatory

Before Tuesday. Jurafsky and Martin, *Speech and Language Processing* (3rd draft), chapters 6-9 (vector semantics, neural LMs, RNN, transformer)
Before Wednesday. Vaswani et al., *Attention Is All You Need* (NeurIPS 2017)
Before Thursday. HuggingFace NLP Course, chapters 1-3 (fine-tuning a pretrained model)

Optional deepening

MasakhaNLP project: papers and datasets for African languages

Builds on (course catalogue)

Traitement Automatique du Langage