Persistence landscapes as ML features: a complete pipeline |

Beyond Persistence Diagrams

In the TDA Python tutorial, we computed persistence diagrams and used basic summary statistics as ML features. But persistence diagrams live in an awkward mathematical space — they are multisets of points, not vectors — which makes them difficult to use directly with standard ML algorithms.

Persistence landscapes, introduced by Peter Bubenik in 2015, solve this problem by transforming persistence diagrams into functions in a Banach space. The resulting vector representation is stable (small changes in data produce small changes in the landscape), supports the usual statistics (means, variances, hypothesis tests), and plugs directly into any ML pipeline that takes feature vectors.

The Mathematics

Given a persistence diagram \(D = \{(b_i, d_i)\}\), we first define tent functions for each point:

\[\Lambda_i(t) = \max\left(0, \min(t - b_i, d_i - t)\right)\]

Each tent function is a triangle centered at \(\frac{b_i + d_i}{2}\) with height \(\frac{d_i - b_i}{2}\).

The \(k\)-th persistence landscape is the \(k\)-th largest value of these tent functions at each point:

\[\lambda_k(t) = k\text{-th largest value of } \{\Lambda_i(t)\}_i\]

The first landscape \(\lambda_1\) captures the most persistent features, \(\lambda_2\) the second most persistent, and so on.

Implementation

The pipeline in full. First, install the dependencies:

pip install ripser persim scikit-learn matplotlib

Step 1: Compute Persistence Diagrams

import numpy as np
from ripser import ripser

def compute_persistence(X, maxdim=1):
    """Compute persistence diagrams up to dimension maxdim."""
    result = ripser(X, maxdim=maxdim)
    return result['dgms']

Step 2: Compute Persistence Landscapes

def persistence_landscape(dgm, num_landscapes=5, resolution=100):
    """
    Compute persistence landscapes from a persistence diagram.

    Parameters
    ----------
    dgm : np.ndarray
        Persistence diagram (n x 2 array of birth-death pairs).
    num_landscapes : int
        Number of landscape functions to compute.
    resolution : int
        Number of sample points for discretization.

    Returns
    -------
    landscapes : np.ndarray
        Array of shape (num_landscapes, resolution).
    """
    # Remove infinite death times
    finite = dgm[np.isfinite(dgm[:, 1])]
    if len(finite) == 0:
        return np.zeros((num_landscapes, resolution))

    # Define the grid
    birth_min = finite[:, 0].min()
    death_max = finite[:, 1].max()
    grid = np.linspace(birth_min, death_max, resolution)

    # Compute tent functions for each point
    tents = np.zeros((len(finite), resolution))
    for i, (b, d) in enumerate(finite):
        tents[i] = np.maximum(0, np.minimum(grid - b, d - grid))

    # Sort at each grid point to get landscapes
    sorted_tents = np.sort(tents, axis=0)[::-1]

    # Take top-k landscapes
    k = min(num_landscapes, len(finite))
    landscapes = np.zeros((num_landscapes, resolution))
    landscapes[:k] = sorted_tents[:k]

    return landscapes

Step 3: Build the Feature Vector

def landscape_features(X, maxdim=1, num_landscapes=5, resolution=100):
    """
    Full pipeline: point cloud -> persistence -> landscapes -> feature vector.
    """
    dgms = compute_persistence(X, maxdim=maxdim)

    features = []
    for dim in range(maxdim + 1):
        ls = persistence_landscape(dgms[dim], num_landscapes, resolution)
        # Flatten into a single feature vector
        features.append(ls.flatten())

    return np.concatenate(features)

Step 4: Classification Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

def tda_classification(datasets, labels):
    """
    Classify point clouds using persistence landscape features.

    Parameters
    ----------
    datasets : list of np.ndarray
        List of point clouds.
    labels : np.ndarray
        Class labels.
    """
    # Compute features for all datasets
    X = np.array([landscape_features(data) for data in datasets])

    # Build pipeline
    clf = Pipeline([
        ('scaler', StandardScaler()),
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
    ])

    # Cross-validate
    scores = cross_val_score(clf, X, labels, cv=5, scoring='accuracy')
    print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

    return clf.fit(X, labels)

Example: Classifying Shapes

A synthetic stress test for the pipeline: distinguishing circles from figure-eights.

from sklearn.datasets import make_circles

def make_figure_eight(n_points=100, noise=0.05):
    """Generate points on a figure-eight curve."""
    t = np.linspace(0, 2 * np.pi, n_points)
    x = np.sin(t) + np.random.normal(0, noise, n_points)
    y = np.sin(2 * t) / 2 + np.random.normal(0, noise, n_points)
    return np.column_stack([x, y])

# Generate datasets
n_samples = 50
datasets = []
labels = []

for _ in range(n_samples):
    X_circle, _ = make_circles(n_samples=100, noise=0.05, factor=0.5)
    datasets.append(X_circle)
    labels.append(0)

    X_eight = make_figure_eight(100, noise=0.05)
    datasets.append(X_eight)
    labels.append(1)

labels = np.array(labels)

# Classify
model = tda_classification(datasets, labels)

A circle has one \(H_1\) feature (one loop), while a figure-eight has two. Persistence landscapes capture this distinction naturally, typically achieving 95%+ accuracy on this task.

Tips for Real-World Use

Normalize your data before computing persistence. TDA is scale-sensitive — the persistence values depend on the metric.
Choose resolution and num_landscapes carefully. Start with resolution=100 and num_landscapes=5. Increase if your diagrams have many points.
Combine with standard features. Landscape features are complementary — concatenating them with traditional features (PCA components, statistical summaries) often improves performance.
Consider persistence images as an alternative vectorization. They use a Gaussian kernel density estimate instead of tent functions and can work better for certain datasets.
For large datasets, use approximate persistence (e.g., subsampling or landmark-based methods) to keep computation tractable.

Persistence landscapes as ML features: a complete pipeline