Persistence Landscapes as ML Features: A Complete Pipeline
Beyond Persistence Diagrams
In the TDA Python tutorial, we computed persistence diagrams and used basic summary statistics as ML features. But persistence diagrams live in an awkward mathematical space — they are multisets of points, not vectors — which makes them difficult to use directly with standard ML algorithms.
Persistence landscapes, introduced by Peter Bubenik, solve this problem elegantly. They transform persistence diagrams into functions in a Banach space, giving us a vector representation that is:
- Stable: small changes in data produce small changes in the landscape
- Statistical: we can compute means, variances, and perform hypothesis tests
- Compatible: the output plugs directly into any ML pipeline expecting feature vectors
The Mathematics
Given a persistence diagram \(D = \{(b_i, d_i)\}\), we first define tent functions for each point:
\[\Lambda_i(t) = \max\left(0, \min(t - b_i, d_i - t)\right)\]Each tent function is a triangle centered at \(\frac{b_i + d_i}{2}\) with height \(\frac{d_i - b_i}{2}\).
The \(k\)-th persistence landscape is the \(k\)-th largest value of these tent functions at each point:
\[\lambda_k(t) = k\text{-th largest value of } \{\Lambda_i(t)\}_i\]The first landscape \(\lambda_1\) captures the most persistent features, \(\lambda_2\) the second most persistent, and so on.
Implementation
Let’s build a complete pipeline. First, install the dependencies:
pip install ripser persim scikit-learn matplotlib
Step 1: Compute Persistence Diagrams
import numpy as np
from ripser import ripser
def compute_persistence(X, maxdim=1):
"""Compute persistence diagrams up to dimension maxdim."""
result = ripser(X, maxdim=maxdim)
return result['dgms']
Step 2: Compute Persistence Landscapes
def persistence_landscape(dgm, num_landscapes=5, resolution=100):
"""
Compute persistence landscapes from a persistence diagram.
Parameters
----------
dgm : np.ndarray
Persistence diagram (n x 2 array of birth-death pairs).
num_landscapes : int
Number of landscape functions to compute.
resolution : int
Number of sample points for discretization.
Returns
-------
landscapes : np.ndarray
Array of shape (num_landscapes, resolution).
"""
# Remove infinite death times
finite = dgm[np.isfinite(dgm[:, 1])]
if len(finite) == 0:
return np.zeros((num_landscapes, resolution))
# Define the grid
birth_min = finite[:, 0].min()
death_max = finite[:, 1].max()
grid = np.linspace(birth_min, death_max, resolution)
# Compute tent functions for each point
tents = np.zeros((len(finite), resolution))
for i, (b, d) in enumerate(finite):
tents[i] = np.maximum(0, np.minimum(grid - b, d - grid))
# Sort at each grid point to get landscapes
sorted_tents = np.sort(tents, axis=0)[::-1]
# Take top-k landscapes
k = min(num_landscapes, len(finite))
landscapes = np.zeros((num_landscapes, resolution))
landscapes[:k] = sorted_tents[:k]
return landscapes
Step 3: Build the Feature Vector
def landscape_features(X, maxdim=1, num_landscapes=5, resolution=100):
"""
Full pipeline: point cloud -> persistence -> landscapes -> feature vector.
"""
dgms = compute_persistence(X, maxdim=maxdim)
features = []
for dim in range(maxdim + 1):
ls = persistence_landscape(dgms[dim], num_landscapes, resolution)
# Flatten into a single feature vector
features.append(ls.flatten())
return np.concatenate(features)
Step 4: Classification Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def tda_classification(datasets, labels):
"""
Classify point clouds using persistence landscape features.
Parameters
----------
datasets : list of np.ndarray
List of point clouds.
labels : np.ndarray
Class labels.
"""
# Compute features for all datasets
X = np.array([landscape_features(data) for data in datasets])
# Build pipeline
clf = Pipeline([
('scaler', StandardScaler()),
('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Cross-validate
scores = cross_val_score(clf, X, labels, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
return clf.fit(X, labels)
Example: Classifying Shapes
Let’s test this on a synthetic dataset — distinguishing circles from figure-eights:
from sklearn.datasets import make_circles
def make_figure_eight(n_points=100, noise=0.05):
"""Generate points on a figure-eight curve."""
t = np.linspace(0, 2 * np.pi, n_points)
x = np.sin(t) + np.random.normal(0, noise, n_points)
y = np.sin(2 * t) / 2 + np.random.normal(0, noise, n_points)
return np.column_stack([x, y])
# Generate datasets
n_samples = 50
datasets = []
labels = []
for _ in range(n_samples):
X_circle, _ = make_circles(n_samples=100, noise=0.05, factor=0.5)
datasets.append(X_circle)
labels.append(0)
X_eight = make_figure_eight(100, noise=0.05)
datasets.append(X_eight)
labels.append(1)
labels = np.array(labels)
# Classify
model = tda_classification(datasets, labels)
A circle has one \(H_1\) feature (one loop), while a figure-eight has two. Persistence landscapes capture this distinction naturally, typically achieving 95%+ accuracy on this task.
Tips for Real-World Use
-
Normalize your data before computing persistence. TDA is scale-sensitive — the persistence values depend on the metric.
-
Choose resolution and num_landscapes carefully. Start with
resolution=100andnum_landscapes=5. Increase if your diagrams have many points. -
Combine with standard features. Landscape features are complementary — concatenating them with traditional features (PCA components, statistical summaries) often improves performance.
-
Consider persistence images as an alternative vectorization. They use a Gaussian kernel density estimate instead of tent functions and can work better for certain datasets.
-
For large datasets, use approximate persistence (e.g., subsampling or landmark-based methods) to keep computation tractable.
Further Reading
- Bubenik, P. (2015). Statistical Topological Data Analysis using Persistence Landscapes. JMLR.
- For a comprehensive treatment of these methods, see The Shape of Data.
- The scikit-tda ecosystem provides production-ready implementations.
Enjoy reading this article?
Here are some more articles you might like to read next: