Lab 4 — Visualizing single-cell genomic data¶
Goal. Apply PCA, t-SNE, and UMAP to a public single-cell RNA-seq dataset. Compare what each method preserves. Discuss the cost of nonlinear methods for downstream interpretation.
What you ship. Notebook with three 2-D embeddings of the same data, side-by-side, with a 200-word memo on when each is the right tool.
Setup¶
Install the dependencies (one-time).
In [ ]:
# !pip install scanpy umap-learn scikit-learn matplotlib
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import scanpy as sc
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
np.random.seed(42)
10x Genomics PBMC 3K (a canonical single-cell dataset)¶
In [ ]:
adata = sc.datasets.pbmc3k()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]
X = adata.X.toarray()
print('X shape:', X.shape)
Exercise 1 — PCA¶
In [ ]:
# YOUR TURN
# Compute first 50 PCs. Plot PC1 vs PC2.
Exercise 2 — t-SNE on PCA features¶
In [ ]:
# YOUR TURN
# Run t-SNE (perplexity=30) on the first 50 PCs.
Exercise 3 — UMAP on PCA features¶
In [ ]:
# YOUR TURN
# Run UMAP (n_neighbors=15, min_dist=0.1) on the first 50 PCs.
Exercise 4 — Compare visually¶
In [ ]:
# YOUR TURN
# Plot all three side by side. Color by Louvain cluster (sc.tl.louvain).
# Write 200 words on which is best for which downstream task.
Done?¶
Submit per the cohort schedule. Peer review pairing announced the following Monday.