Lab 4 — Visualizing single-cell genomic data¶

Goal. Apply PCA, t-SNE, and UMAP to a public single-cell RNA-seq dataset. Compare what each method preserves. Discuss the cost of nonlinear methods for downstream interpretation.

What you ship. Notebook with three 2-D embeddings of the same data, side-by-side, with a 200-word memo on when each is the right tool.

Setup¶

Install the dependencies (one-time).

In [ ]:
# !pip install scanpy umap-learn scikit-learn matplotlib
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import scanpy as sc
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

np.random.seed(42)

10x Genomics PBMC 3K (a canonical single-cell dataset)¶

In [ ]:
adata = sc.datasets.pbmc3k()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]
X = adata.X.toarray()
print('X shape:', X.shape)

Exercise 1 — PCA¶

In [ ]:
# YOUR TURN
# Compute first 50 PCs. Plot PC1 vs PC2.

Exercise 2 — t-SNE on PCA features¶

In [ ]:
# YOUR TURN
# Run t-SNE (perplexity=30) on the first 50 PCs.

Exercise 3 — UMAP on PCA features¶

In [ ]:
# YOUR TURN
# Run UMAP (n_neighbors=15, min_dist=0.1) on the first 50 PCs.

Exercise 4 — Compare visually¶

In [ ]:
# YOUR TURN
# Plot all three side by side. Color by Louvain cluster (sc.tl.louvain).
# Write 200 words on which is best for which downstream task.

Done?¶

Submit per the cohort schedule. Peer review pairing announced the following Monday.