Lab 4 — RAG over a domain corpus¶

Goal. Index a public-domain corpus in Chroma. Build retrieval + reranking + grounded generation. Evaluate on 30 questions.

What you ship. Notebook with the indexed corpus, retrieval pipeline, and an eval table for faithfulness, context relevance, answer correctness on 30 questions.

Setup¶

Install the dependencies (one-time).

In [ ]:
# !pip install langchain chromadb sentence-transformers transformers torch pypdf
In [ ]:
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
import pathlib
import urllib.request

Build a small corpus from open-access PDFs¶

Replace these with your domain corpus. The defaults below are 3 WHO/AFRO PDFs.

In [ ]:
URLS = [
  # Replace with your real WHO/AFRO URLs — placeholders below.
  'https://example.org/who-afro-1.pdf',
  'https://example.org/who-afro-2.pdf',
  'https://example.org/who-afro-3.pdf',
]
pathlib.Path('corpus').mkdir(exist_ok=True)
for u in URLS:
    fn = pathlib.Path('corpus') / u.rsplit('/', 1)[-1]
    if not fn.exists():
        print('skip (placeholder URL):', u)
print('corpus dir:', list(pathlib.Path('corpus').glob('*.pdf')))

Exercise 1 — Load, chunk, and embed¶

In [ ]:
# YOUR TURN
# Load each PDF with PyPDFLoader. Split with RecursiveCharacterTextSplitter
# (chunk_size=600, chunk_overlap=80). Embed with sentence-transformers (all-MiniLM-L6-v2).
# Persist to Chroma.

Exercise 2 — Retrieval + grounded generation¶

In [ ]:
# YOUR TURN
# Build a chain: query -> top-k=5 retrieval -> prompt with context -> LLM.
# Use any open LLM (Mistral, Llama, Gemma) or an API.

Exercise 3 — Evaluate on 30 held-out questions¶

In [ ]:
# YOUR TURN
# Write 30 questions where you know the answer is in the corpus.
# Evaluate by LLM-as-judge on three axes: faithfulness, context relevance, answer correctness.

Done?¶

Submit per the cohort schedule. Peer review pairing announced the following Monday.