TOPOLOGICAL DATA ANALYSIS DRIVEN FEATURE GENERATION IN
MACHINE LEARNING MODELS

By

Danielle Barnes

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computational Mathematics, Science, and Engineering—Doctor of Philosophy

2024

ABSTRACT

Topological data analysis (TDA) is an emerging field in data science, with origins in

algebraic topology. I focus on two main disciplines of topological data analysis, mapper

and persistent homology. Mapper is an algorithm to construct a graph that is similar to a

Reeb graph [1], allowing for the abstraction of shape from data. Persistent homology is

way to measure features in a dataset, and returns a set of points (a persistence diagram)

that represents the structure of the dataset. While both mapper and persistent homology

are effective tools in their own right, a significant area of research includes using features

created from these topological tools in machine learning algorithms. In this dissertation,

I focus on advancing both theoretical and computational methods that allow the use of

topological data analysis in machine learning algorithms.

I have developed an extension to the mapper algorithm, named predictive mapper, that

uses the eigenvectors of the graph Laplacian of the geometric realization of a mapper graph,

allowing features to be created from a linear combination of the eigenvector values for use

in machine learning. I have also contributed to teaspoon [2], an open source package for

topological signal processing by including new datasets and expanded functionality for fea-

turization methods. Lastly, I have started the development of ceREEBerus, a python package

for working with Reeb graphs while implementing a standardized software development

framework for the Munch Lab.

Copyright by
DANIELLE BARNES
2024

ACKNOWLEDGEMENTS

Many people provided the support, guidance, and humor required to complete a dissertation.

Many thanks to my committee members, Dr. Brian O’Shea and Dr. H. Metin Atkulga,

and special thanks to my chairs Dr. Liz Munch and Dr. Jose Perea. Special thanks to

my running group, which has taken many iterations over the years for keeping me sane by

very insanely hopping in a van (sometimes with strangers) and running 200 miles across

the country. And the most thanks to my husband, Adi, for always laughing and taking care

of our dogs over the years when I was working, either at work or on my degree.

Chapter 4 was originally co-authored by Dr. Luis Polanco and Dr. Jose Perea. Chapter

4 and 5 work was supported in part by Michigan State University through computational

resources provided by the Institute for Cyber-Enabled Research, with assistance from Dr.

Elizabeth Munch for providing background information for computing persistence diagrams

using a directional transform. Chapter 5 work was also supported and collaborative with

many others in the Munch Lab and at Michigan State including Dr. Luis Polanco, Dr.

Jose Perea, Elena Wang, Pat Bills, Doug Krum, Morgan Patterson, Max Chumley, Sunia

Tanweer, and Dr. Firas Khasawneh.

Thank you to those outside of my committee who also read my dissertation and provided

critical feedback, Dr. Tara Kilbride and Dr. Dakota Crisp. Additionally, the completion

of this dissertation would not have been possible without the support of my supervisor,

Jonathan Prantner. And finally, special thanks to the unsung heros of this dissertation -

Keurig, Nespresso, Starbucks, and my favorite coffee maker, Spinn.

iv

TABLE OF CONTENTS

LIST OF SYMBOLS .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . .

1

CHAPTER 2

BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Homology .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Persistent Homology . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Mapper
2.4 Reeb Graphs .

3
3
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
.

. .
.

. .

. .

CHAPTER 3

PREDICTIVE MAPPER . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Spectral theory and motivation . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Optimized mapper graph . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Mapper embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Mapping from 𝑋 to | ¯𝑀 | . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Predictive Mapper Pipeline . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Methods and Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Computation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.9 Conclusion .

. .

.

.

.

CHAPTER 4

. .
.
.

FEATURE GENERATION FOR MACHINE LEARNING . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
.
Introduction .
4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Background .
. . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Featurization Methods
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Datasets . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 User Guide
4.6 Results .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
. .
4.7 Computation Time of Features . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Discussion .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.9 Conclusion .

. .
.
.
. .

. .

. .

.

CHAPTER 5

OPEN SOURCE SOFTWARE CONTRIBUTIONS . . . . . . . . 74
5.1 Open Source Software Development Framework . . . . . . . . . . . . . 74
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Teaspoon .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
.
5.3 CeREEBerus

. .
.

CHAPTER 6

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

BIBLIOGRAPHY .

. .

APPENDIX .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

v

𝑋

𝑋 (𝑣𝑖)

𝐻 (𝑀, 𝑌 )

|𝑋 (𝑣𝑖)|

𝑀

𝑉 (𝑀)

¯𝑋

ˆ𝑋

¯𝑋 (𝑣𝑖)

¯𝑀

𝑉 ( ¯𝑀)

𝐴( ¯𝑀)

¯𝑣𝑖 ∈ 𝑉 ( ¯𝑀)

¯𝑋 ( ¯𝑣𝑖)

LIST OF SYMBOLS

Dataset, 𝑋 ∈ R𝑝𝑥𝑚

Subset of data 𝑋 in each node 𝑣𝑖 ∈ 𝑉

Measure of optimal

Number of points in 𝑋 (𝑣𝑖)

A mapper graph

Set of nodes in mapper graph 𝑀

Training set, ¯𝑋 ⊂ 𝑋

Testing set, ˆ𝑋 ⊂ 𝑋, ¯𝑋 ∩ ˆ𝑋 = ∅

Subset of training data ¯𝑋 in each node 𝑣𝑖 ∈ 𝑉

Optimized mapper graph

Set of nodes in optimized mapper graph ¯𝑀

Adjacency matrix of ¯𝑀

Node 𝑖 of optimized mapper graph 𝑀

Test dataset points in node ¯𝑣𝑖

| ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗 )|

Number of data points in ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗

y

Eigenvectors of 𝐿y = 𝜆𝐷y

U = {𝑈𝛼}𝛼∈𝐴

Finite open covering of 𝑋

𝑁 (U)

𝑓 : 𝑋 → R

𝑓 −1(𝑈𝛼)

¯U

𝑁 ( ¯U)

Nerve of U

Continuous map from 𝑋 to R

Pullback cover of 𝑈𝛼

Refined pullback cover of 𝑈𝛼

Nerve of

¯U

vi

|𝑁 ( ¯U)|

𝜌(𝑥)

Φ

𝛽

¯Φ

Geometric realization of the nerve of

¯U

Map from 𝑋 → |𝑁 (U)|. Also refers to map from
𝑋 → | ¯𝑀 |

Family of gaussian distributions

Normalization constant

Family of gaussian distributions, normalized and a
partition of unity

{𝑣0, 𝑣1, ..., 𝑣 𝑘 } = 𝑉

Set of vertices of 𝑁 ( ¯U)

( ¯𝜑0(𝑥), ¯𝜑1(𝑥), ..., ¯𝜑𝑘 (𝑥))

Barycentric coordinates defined by ¯Φ

| ¯𝑀 |

Geometric realization of the nerve of ¯𝑀

vii

CHAPTER 1

INTRODUCTION

This dissertation contributes to outstanding areas of Topological Data Analysis by extending

theory for mapper [1], providing additional examples and datasets for use in persistent

homology, and contributes to open source software to enable collaborative research.

For mapper, the algorithmic result is an abstract simplicial complex with no direct

translation into a real-valued vector space, which is a requirement for use in machine

learning algorithms. I propose an extension to the current mapper algorithm, that allows

features to be computed from it, extending from an abstract graph into a real-valued vector

space.

For persistent homology, translation needs to similarly be done from the space of

persistence diagrams into a real-valued vector space, and I focus on computational aspects

of existing methods. Some methods can be costly from a computational perspective, limiting

the ability to easily analyze large or complex datasets. To address this, I have extended

the teaspoon [2] python software package to include vectorized and parallel options for

specific machine learning features. This work was published as part of a review paper [3]

that provided additional code for researchers to use for feature creation from persistence

diagrams and example datasets for analysis.

Researchers commonly access these methods through open source software, so devel-

oping and maintaining easy-to-use and purpose built software is an important challenge to

support more widespread adoption of useful topological methods. I address these issues

through contributions to open source software, and the development of additional packages

and machine learning pipelines focused on using topological data analysis. This work

includes moving to a standardized software development framework, with documentation

and tutorials for collaborators to contribute.

Through this dissertation, I have made a meaningful contribution to topological meth-

1

ods, computational efficiency, and ability for researchers to implement topological methods

across different disciplines. All code is available as open source, and many principles

of standardized software development are applied, allowing for ease of maintenance and

appropriate documentation, vital to researchers continuing to make contributions and im-

provements in topological methods.

2

CHAPTER 2

BACKGROUND

Intuitively, topology is the study of shape, and of specific interest is identifying various

properties such as holes and connected components. These features allow differentiation

between item types. For example, in the widely used MNIST [4] dataset, the number "8"

has different topological features than the number "0". Specificially the number "8" has one

connected component, and two holes; while the number "0" has one connected component,

and one hole. These correspond to homology groups, and specifically persistence of these

homology groups through filtrations gives a way to identify relevant topological features.

2.1 Homology

Homology concerns itself with useful properties of a simplicial complex and can provide

information about the underlying space. We follow [5] in this section. Firstly we define
simplices. Let 𝑢0, 𝑢1, ..., 𝑢𝑘 be points in R𝑑. A point 𝑥 = (cid:205)𝑘
𝑖=0
is an affine combination of the 𝑢𝑖 if the (cid:205)𝑘
𝑖=0
combinations. It is a 𝑘-plane if the 𝑘 + 1 points are affinely independent, by which we

𝜆𝑖 = 1. The affine hull is the set of affine

𝜆𝑖𝑢𝑖, with each 𝜆𝑖 ∈ R,

mean that any two affine combinations, 𝑥 = (cid:205) 𝜆𝑖𝑢𝑖 and 𝑦 = (cid:205) 𝛿𝑖𝑢𝑖, are the same iff 𝜆𝑖 = 𝛿𝑖

for all 𝑖. The 𝑘 + 1 points are affinely independent iff the 𝑘 vectors 𝑢𝑖 − 𝑢0, for 1 ≤ 𝑖 ≤ 𝑘,
are linearly independent. In R𝑑 we can have at most 𝑑 linearly independent vectors and

therefore at most 𝑑 + 1 affinely dependent points.

An affine combination, 𝑥 = (cid:205) 𝜆𝑖𝑢𝑖 is a convex combination if all 𝜆𝑖 are non-negative.

The convex hull is the set of convex combinations. A 𝑘-simplex is the convex hull of

𝑘 + 1 affinely independent points, 𝜎 = conv{𝑢0, 𝑢1, ..., 𝑢𝑘 }. Example simplices are shown

in Figure 2.1.

A face of 𝜎 is the convex hull of a non-empty subset of the 𝑢𝑖, and is denoted as 𝜏

with 𝜏 ≤ 𝜎 meaning 𝜏 is a face of 𝜎. Using this we next define a simplicial complex as a

finite collection of simplices 𝐾 such that 𝜎 ∈ 𝐾 and 𝜏 ≤ 𝜎 implies 𝜏 ∈ 𝐾 and 𝜎, 𝜎0 ∈ 𝐾

3

(a) 0-simplex.

(b) 1-simplex.

(c) 2-simplex.

Figure 2.1 Illustration of simplices.

implies 𝜎 ∩ 𝜎0 is either empty or a face of both. An example simplicial complex is in Figure

2.2. The former definitions refer to a geometric simplicial complex, which is a simplicial

complex in Euclidean space. This definition can be generalized to an abstract simplicial

complex, which is a finite collection of sets 𝐴 such that 𝛼 ∈ 𝐴 and 𝛽 ⊆ 𝛼 implies 𝛽 ∈ 𝐴.

(a) Simplicial complex.

(b) Not a simplicial complex.

Figure 2.2 Example of a simplicial complex (left) and not a simplicial complex (right).

To understand similarities and differences of shapes, we define some additional termi-

nology to understand homology groups, still following [5]. Let 𝐾 be a simplicial complex

and 𝑝 a dimension. A p-chain is a formal sum of 𝑝-simplices in 𝐾, denoted 𝑐 with coeffi-

cients 𝑎𝑖 and 𝑝-simplices 𝜎𝑖, represented as 𝑐 = (cid:205) 𝑎𝑖𝜎𝑖. The 𝑝-chains, along with addition,

form chain groups denoted 𝐶𝑝. The boundary of a 𝑝-simplex is the sum of its 𝑝 − 1

4

dimensional faces, so that 𝛿 𝑝𝜎 = (cid:205)𝑝
[𝑢𝑜, ..., ˆ𝑢 𝑗 , ..., 𝑢 𝑝], with 𝜎 = [𝑢0, 𝑢1, ..., 𝑢 𝑝] and the
hat denoting that 𝑢 𝑗 is omitted. This definition extends to a 𝑝-chain so that 𝛿 𝑝𝑐 = (cid:205) 𝑎𝑖𝛿 𝑝𝜎𝑖.

𝑗=0

This is called the boundary map, 𝛿 𝑝 : 𝐶𝑝 → 𝐶𝑝−1. A chain complex is a sequence of
𝛿 𝑝+2−−−→ 𝐶𝑝+1
𝛿 𝑝−1−−−→. A cycle is a
𝑝-chain with an empty boundary, 𝛿 𝑝𝑐 = 0. The set of cycles is the kernel of 𝛿 𝑝, ker𝛿 𝑝 = 𝑍 𝑝.

chain groups with boundary maps: ...

𝛿 𝑝
−−→ 𝐶𝑝−1

𝛿 𝑝+1−−−→ 𝐶𝑝

A boundary is a 𝑝-chain that is a boundary of a 𝑝 + 1-chain, and the boundary group is

denoted 𝐵𝑝. Finally, the p𝑡ℎ homology group, denoted 𝐻𝑝, provides useful information in

describing topological features. Each 𝑝 corresponds to a dimension, with 𝐻0 representing

the number of connected components, and 𝐻1 representing 1-dimensional holes. Formally
the p𝑡ℎ homology group is the quotient group 𝐻𝑝 = Ker 𝛿 𝑝/ Im 𝛿 𝑝+1. The rank of 𝐻𝑝 is
called the 𝑝𝑡ℎ Betti number, 𝛽𝑝, and provides the number of features in the 𝑝𝑡ℎ homology

group.

This is important because homeomorphic spaces have the same Betti numbers. For

example, the number "8" can be identified as the same type with different handwriting since

𝛽0 = 1 and 𝛽1 = 2. An illustration of equivalent homology groups is in Figure 2.3.

Figure 2.3 Homology groups 𝐻0 and 𝐻1 across different spaces.

2.2 Persistent Homology

Extending the idea from a singular simplicial complex to an ordered sequence of sim-

plicial complexes provides a way to quantify how prominent, or persistent, specific features

are for a given data set. This provides motivation for the concept of persistent homology,

which takes a function defined on a simplicial complex, and quantifies the changes in ho-

5

mology classes as the sublevel sets grow with increasing value of the function [6]. Features

that are more persistent are more likely to be attributed to features of the underlying space

from which the data was sampled.

For this section, we follow [6]. To measure the persistence of features, we start with a

filtration. Using Definition 3.1, a filtration F = F(𝐾) of a simplicial complex 𝐾 is a nested

sequence of its subcomplexes F : ∅ = 𝐾0 ⊆ 𝐾1 ⊆ ... ⊆ 𝐾𝑛 = 𝐾. I focus on creating this

filtration on real data sets in two ways. From Definition 2.10, let (𝑃, 𝑑) be a finite metric

space. Given a real 𝑟 > 0, the Vietoris-Rips complex is the abstract simplicial complex

VR𝑟 (𝑃) where a simplex 𝜎 ∈ VR𝑟 (𝑃) if and only if 𝑑 ( 𝑝, 𝑞) ≤ 2𝑟 for every pair of vertices

of 𝜎. This is illustrated in Figure 2.4 where 𝑟 increases through the image.

Another way to construct a filtration is through sublevel set filtration, where if 𝑋 is a

topological space and 𝑓 : 𝑋 → R is a function, then the sublevel sets, 𝑋𝑎 = 𝑓 −1(−∞, 𝑎],

𝑎 ∈ R. Then for 𝑎𝑖 ≤ 𝑎 𝑗 ≤ 𝑎𝑘 ≤ 𝑎𝑙, the sublevel set filtration would be 𝑋0 ⊆ .... ⊆ 𝑋𝑎𝑖 ⊆

𝑋𝑎 𝑗 ⊆ 𝑋𝑎𝑘 ⊆ 𝑋𝑎𝑙 ⊆ ... ⊆ 𝑋. This type of filtration is commonly used on image data, and

an example is in Figure 2.5.

Following [5], we start with a filtration F(𝐾) with ∅ = 𝐾0 ⊆ 𝐾1 ⊆ ... ⊆ 𝐾𝑛 = 𝐾.

For every 𝑖 ≤ 𝑗 there is an inclusion map from the underlying space of 𝐾𝑖 to that of 𝐾 𝑗 ,
hence an induced homomorphism, 𝑓 𝑖, 𝑗
𝑝

: 𝐻𝑝 (𝐾𝑖) → 𝐻𝑝 (𝐾 𝑗 ), for each dimension 𝑝. This

filtration corresponds to a sequence of homology groups connected by homomorphisms,

0 = 𝐻𝑝 (𝐾0) → 𝐻𝑝 (𝐾1) → ... → 𝐻𝑝 (𝐾𝑛) = 𝐻𝑝 (𝐾).

The p-th persistent homology groups are the images of the homomorphisms, 𝐻𝑖, 𝑗

𝑝 =

for 0 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛. The ranks of these groups are the p-th persistent Betti

im 𝑓 𝑖, 𝑗
𝑝
numbers, 𝛽𝑖, 𝑗

𝑝 = rank𝐻𝑖, 𝑗

𝑝 . The collection of persistent Betti numbers is then represented

as a persistence diagram, 𝐷 ( 𝑓 ) ⊂ R2, where 𝐷 ( 𝑓 ) ⊂ R2 of 𝑓 is the set of points (𝑎𝑖, 𝑎 𝑗 ),
𝑎𝑖 < 𝑎 𝑗 , counted with multiplicity 𝑢𝑖, 𝑗

𝑝 for 0 ≤ 𝑖 < 𝑗 ≤ 𝑛. The time at which the feature

appears, 𝑎𝑖, is known as the birth, and the time at which the feature disappears, 𝑎 𝑗 , is the

6

(a) VR𝑟𝑖 (𝑃).

(b) VR𝑟 𝑗 (𝑃).

(c) VR𝑟𝑘 (𝑃).

(d) VR𝑟𝑙 (𝑃).

Figure 2.4 Given 𝑖 ≤ 𝑗 ≤ 𝑘 ≤ 𝑙, the Vietoris-Rips Complex is using the filtration
VR𝑟𝑖 (𝑃) ⊆ VR𝑟 𝑗 (𝑃) ⊆ VR𝑟𝑘 (𝑃) ⊆ VR𝑟𝑙 (𝑃) is demonstrated in this image. Note that in
the last filtration VR𝑟𝑙 (𝑃) the hole in 𝐻1 no longer persists.

Figure 2.5 Example MNIST observation after applying directional transforms provides an
application for sublevel set persistence.

death. The assumed coordinate representation for a persistence diagram is in birth-death

coordinates, which is the ordered pair (𝑎𝑖, 𝑎 𝑗 ). Each feature has a corresponding coordinate

7

pair, with Figure 2.6 showing an example persistence diagram. Finally, by the Fundamental

Lemma of Persistent Homology, the persistence diagrams encode the necessary information

about persistent homology groups.

Figure 2.6 The Rips persistence diagrams in dimensions 0 (blue) and 1 (orange), for a
point cloud sampled around the unit circle.

Lemma 1 (Fundamental Lemma of Persistent Homology) Let ∅ = 𝐾0 ⊆ 𝐾1 ⊆ ... ⊆

𝐾𝑛 = 𝐾 be a filtration. For every pair of indices 0 ≤ 𝑘 ≤ 𝑙 ≤ 𝑛 and every dimension 𝑝, the
𝑝-th persistent Betti number is 𝛽𝑘,𝑙

𝑝 = (cid:205)𝑖≤𝑘 (cid:205) 𝑗 >𝑙 𝑢𝑖, 𝑗

𝑝

The sequence of homomorphisms can also be generalized to vector spaces, and as

adapted from [7]:

Definition 2.2.1 (Persistence modules) Let V denote a sequence of vector spaces and

linear maps, of length 𝑛: 𝑉1

𝑝1→ 𝑉2

𝑝2→ ...

𝑝𝑛−1→ 𝑉𝑛. Each

𝑝𝑖
→ represents a map. The object V

is called a persistence diagram of vector spaces, or simply a persistence module.

We can also define a persistence diagram for a persistence module, analogous to the

above definition. Then a given persistence module decomposes uniquely into interval

8

modules when the index set is finite. Specifically Proposition 2 (Gabriel’s Theorem) from

[6] guarantees a unique representation of persistence modules as birth-death pairs.

Proposition 2 (Gabriel’s Theorem) Any persistence module over a finite index set decom-

poses uniquely up to isomorphism into closed-closed interval modules.

In the context of features for use in machine learning models, the types of features required

need to identify relevant differences and similarities between data points, and persistence

diagrams provide a way to do so. These diagrams are a set of points in the plane, allowing

many different computations and transformations into the required data structures for use

in machine learning applications. Many featurization methods exist for working with

persistence diagrams as noted in [3].

Lastly, we follow [7] for zigzag persistence. Zigzag persistence differs from sublevel set

persistence in that instead of using a monotonically increasing sequence to create a sublevel

set filtration, 𝑋0 ⊆ .... ⊆ 𝑋𝑎𝑖 ⊆ 𝑋𝑎 𝑗 ⊆ 𝑋𝑎𝑘 ⊆ 𝑋𝑎𝑙 ⊆ ... ⊆ 𝑋, we consider filtrations that

‘zigzag’ to create zigzag modules.

Assuming 𝐾 𝑗 ∈ 𝐾 is a sequence of simplicial complexes, the input to zigzag persistence

would be 𝐾0 ↔ 𝐾1 ↔ ... ↔ 𝐾𝑛, with ↔ being an inclusion to either the left or right.

An example of this is in Figure 2.7a along with the corresponding persistence diagram in

Figure 2.7b. The persistence diagram is computed using the Vietoris-Rips complex for a

fixed radius, r, and differing in zigzag filtrations.

Definition 2.2.2 (Zigzag modules) Let V denote a sequence of vector spaces and linear

maps, of length 𝑛: 𝑉1

𝑝1↔ 𝑉2

𝑝2↔ ...

𝑝𝑛−1↔ 𝑉𝑛. Each

𝑝𝑖
↔ represents either a forward map

𝑓𝑖
→ or

a backward map

𝑔𝑖
←. The object V is called a zigzag diagram of vector spaces, or simply a

zigzag module.

Similar to Proposition 2 for persistence modules, the following theorem guarantees a

unique decomposition into birth-death coordinates of a zigzag modules. A zigzag persis-

9

(a) Zigzag filtration.

(b) Persistence Diagram.

Figure 2.7 Example zigzag filtration with the corresponding persistence diagram.

tence module is a quiver representation, denoted V(𝑄) with 𝑄 = (𝑁, 𝐸) a quiver, or

directed graph. When the graph is finite and linear shaped (like a zigzag module) it is

known as 𝐴𝑛-type.

Proposition 3 Every quiver representation (V(𝑄) for an 𝐴𝑛-type quiver 𝑄 has an inter-

val decomposition. Furthermore, this decomposition is unique up to isomorphism and

permutation of the intervals.

This then guarantees the unique existence of birth-death pairs for zigzag persistence

modules.

2.3 Mapper

Mapper, first introduced in 2007 by Singh, Memoli, and Carlsson [1], is another tool

for Topological Data Analysis, commonly used for visualization and data exploration. This

original paper identified diabetes subtypes using data from the Miller-Reaven diabetes study

as a proof of concept. The Miller-Reaven diabetes study used a dimensionality reduction

method (projection pursuit [8]) to identify juvenile and adult onset diabetes as essentially

different diseases, and using Mapper a similar result was observed, see Figure 2.8. The

mapper algorithm has since been used widely as a data visualization tool across various

domains, including for clustering and feature selection. The identification of loops and

flares can be used to identify interesting clusters and help select variables that stratify data

10

[9]. Some applications include breast cancer [10], voting [11], and sports [11]. While

Mapper provides a way to explore, visualize and reveal a lower dimensional structure of

the data, there is no obvious and direct way to use information from Mapper in machine

learning models.

(a)

(b)

(c)

Figure 2.8 Diabetes data visualized using different mapper filters and different resolutions.
a) Diabetes data from the Miller-Reaven study visualized using the first principle
components as a filter. (b) Diabetes data from the Miller-Reaven study visualized using
the first principle components as a filter at a more granular resolution. (c) Diabetes data
from the Miller-Reaven study visualized using the tSne as a filter.

Next we review the topological motivation and theoretical framework for Mapper from

[1]. Intuitively, the goal is for mapper to produce a graph that resembles the original space.

For example, applying the mapper algorithm to a point cloud sampled from an annulus

should result in a graph that is circular.

Assume, given a continuous map 𝑓

: 𝑋 → R, a finite covering U = {𝑈𝛼}𝛼∈𝐴, we

note 𝑓 −1(𝑈𝛼) forms an open cover of X. We then decompose 𝑓 −1(𝑈𝛼) into path connected

components as a covering of 𝑋, referred to as ¯U. Mapper is then defined as the nerve of

this cover, N ( ¯U) [12]. This is illustrated in Figure 2.9a, and discussed further in Section

3.4.

For mapper to be applied to real data, we use a statistical construction that is analogous to

the topologically-motivated algorithm. First, we assume we have a sample of 𝑛 data points

𝑋 from a metric space, and a function 𝑓 : 𝑋 → R, known for all 𝑛 points, and a distance

11

(a) The topological version of mapper.

(b) The statistical version of mapper.

Figure 2.9 The Topological vs. Statistical Version of Mapper.

matrix 𝐷. The function 𝑓

: 𝑋 → R is called a filter, and is analogous to the function

from the topological version. After applying the filter function, a covering is determined

on 𝑓 (𝑥) = U. Intervals are then computed on the covering, specified by parameters ℓ and

𝑝, where ℓ is the length, and 𝑝 is the percentage overlap of the intervals of R. We then

can construct a covering, U = {𝑈𝛼}𝛼∈𝐴. For each 𝑈𝛼 ∈ U, we take the set { 𝑓 −1(𝑈𝛼)}𝛼∈𝐴,

which defines a cover of 𝑋. We then have subsets of 𝑋 such that 𝑋𝛼 ∈ 𝑓 −1(𝑈𝛼) and subsets

of 𝐷, denoted 𝐷𝛼, corresponding to each 𝑋𝛼.

Once the data and distance matrix is partitioned into subsets, a clustering algorithm,

using each 𝐷𝛼, is used locally to determine the number of clusters. The desired clustering

algorithm also does not require specifying the number of clusters [1]. Some examples of

appropriate clustering algorithms are single-linkage clustering [13] and HDBScan [14].

These clusters construct a new covering, a refined pull-back cover, denoted ¯U. A node is

created for each cluster, and edges are added between nodes which share a data point in their

respective clusters. The statistical version of mapper is then the nerve of the refined pullback

cover. The algorithm to construct the statistical version of mapper is then as follows:

1: Given a dataset 𝑋, filter function 𝑓 , distance matrix 𝐷, interval length 𝑙, partition

12

overlap 𝑝, and local clustering algorithm 𝑎

2: Take 𝑓 (𝑋) → R and using 𝑙 and 𝑝 create a covering of 𝑓 (𝑋), U = {𝑈𝛼}𝛼∈𝐴

3: Take the pull-back cover, { 𝑓 −1(𝑈𝛼)}𝛼∈𝐴 to get 𝑋𝛼 ∈ 𝑓 −1(𝑈𝛼) and subsets of 𝐷, denoted

𝐷𝛼, corresponding to each 𝑋𝛼

4: Using 𝐷𝛼, perform local clustering using 𝑎 to create a refined pullback cover,

¯U

5: For each cluster, create a vertex (0-dimensional simplex), which is designated as a node

6: For each node, if the intersection of the data points in neighboring nodes is non-empty,

add an edge (1-simplex) between that node and the neighboring node

The statistical construction takes point cloud data as input, and returns a graph that

shows the shape and features of the underlying data with respect to the filter function.

Mapper also extends to a d-dimensional version, with details available in the original paper

[1]. We refer to the statistical version of Mapper as 𝑀 (𝑋). This is illustrated in Figure 2.9b.

2.4 Reeb Graphs

Another important tool for topological data analysis is the Reeb graph, which provides

a simplified view of potentially complex shapes while retaining important information on

connected components. The use of Reeb graphs has been popularized in computer science

[15] and provides utility for 3D shape estimation. Additionally, mapper is a discretized

version of the Reeb graph. With the existence of metrics that calculate distances between

Reeb graphs, such as bottleneck distance and interleaving distance [16], similarity of Reeb

graphs and hence underlying spaces can be quantified.

To better understand the representation Reeb graphs provide, consider the torus, shown

in Figure 2.10a and a Reeb graph of the height function on the torus shown in Figure 2.10b.

Following [6], the Reeb graph is defined formally as follows with an example in Figure

2.11.

Definition 2.4.1 (Reeb Graph) Let 𝑋 be a topological space with a function 𝑓 : 𝑋 → R.

13

(a) Torus plotted in 3 dimensions.

(b) Reeb graph of the torus.

Figure 2.10 Reeb graph representation of a Torus.

Define an equivalence relation ∼ on 𝑋 by asserting 𝑥 ∼ 𝑦 iff (i) 𝑓 (𝑥) = 𝑓 (𝑦) = 𝛼 and (ii) 𝑥

and 𝑦 belong to the same connected component of the level set 𝑓 −1(𝛼). Let [𝑥] denote the

equivalence class of 𝑥 ∈ 𝑋. The Reeb graph R 𝑓 of 𝑓 : 𝑋 → R is the quotient space 𝑋/∼, i.e.

the set of equivalent classes equipped with the quotient topology. Let Φ : 𝑋 → 𝑅 𝑓 , 𝑥 ↦−→ [𝑥]

denote the quotient map.

Through this construction, the Reeb graph preserves relationships between the connected

components of level sets [17]. The Mapper algorithm is a generalization of the Reeb

graph construction for real-valued datasets, however, algorithms for their construction have

only been available recently [18], allowing additional applications for Reeb graphs and

computations on Reeb graphs. In this dissertation, I focus on basics of working with Reeb

graphs, and assume the Reeb graph has been provided as input.

14

Figure 2.11 Reeb graph R 𝑓 of the function 𝑓 : 𝑋 → R. Figure inspired by [6].

15

CHAPTER 3

PREDICTIVE MAPPER

A primary area of research for my dissertation is work on the mapper algorithm, both in

extending available software and the definitions and development of predictive mapper.

Predictive mapper extends the mapper algorithm to create geometrically-motivated features

for use in machine learning models. For a review of the original mapper algorithm, refer to

Section 2.3.

When creating the predictive mapper algorithm, there are multiple items that must be

developed to create useful features for machine learning. The mapper graph needs to be

structured to represent different classes to be predicted, and must be optimal in a sense to be

defined later. There needs to be a way to create real-valued vectors from the mapper graph,

and finally a mapping must exist from the original dataset to the mapper graph. Using

predictive mapper as an extension to the original mapper algorithm, each of these areas is

addressed allowing for prediction using information from mapper output.

Before I discuss the algorithm, I review some spectral theory and motivation. Next

I discuss each of the three components of the algorithm, and end with two examples on

datasets and a discussion.

3.1 Spectral theory and motivation

To provide a theoretical justification for predictive mapper, I will first review a proposi-

tion and fact from spectral graph theory, and then discuss prior use of a related algorithm,

Lapalacian eigenmaps. This background will be important when I discuss the mapper graph

optimization and mapper embedding in Sections 3.2 and 3.3 respectively.

For this, I follow [19] until noted otherwise and start with notation required for this

section. Let 𝐺 = (𝑉, 𝐸), a similarity graph with vertex set 𝑉 = {𝑣1, ..., 𝑣𝑛} and edge set

𝐸 with weights 𝑤𝑖 𝑗 ≥ 0 representing the similarity between the vertices connected by the

edge. The weighted adjacency matrix of the graph is 𝑊 = (𝑤𝑖 𝑗 )𝑖, 𝑗=1,...,𝑛. Assume 𝐺 is

16

undirected so that 𝑤𝑖 𝑗 = 𝑤 𝑗𝑖. The degree of a vertex 𝑣𝑖 ∈ 𝑉 is

𝑑𝑖 =

𝑛
∑︁

𝑗=1

𝑤𝑖 𝑗 .

(3.1)

The degree matrix 𝐷 is the diagonal matrix with degrees 𝑑𝑖 on the diagonal. Given a subset

of vertices 𝐴 ⊂ 𝑉, the complement 𝑉 \ 𝐴 = ¯𝐴. Also one can measure the size of a A in two

ways:

| 𝐴| ≔ the number of vertices in 𝐴

vol( 𝐴) ≔

𝑑𝑖.

∑︁

𝑖∈𝐴

(3.2)

(3.3)

3.1.1 Spectral Theory

Moving into the spectral theory, disconnected and weakly connected components of a

graph 𝐺 can be separated by the eigenvectors of the graph Laplacian. The Laplacian matrix

is defined as 𝐿 = 𝐷 − 𝑊, where 𝐷 is the degree matrix, and 𝑊 is the weighted adjacency

matrix.

Proposition 4 (Number of connected components and the spectrum of 𝐿) Let 𝐺 be an

undirected graph with non-negative weights. Then the multiplicity 𝑘 of the eigenvalue 0

of the Laplacian 𝐿 equals the number of connected components 𝐴1, ..., 𝐴𝑘 in the graph.
The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1𝐴1

, ..., 1𝐴𝑘 of those

components.

From the proof of Proposition 4, the structure of the Laplacian matrix, 𝐿, has block

diagonal form. Assume there are 𝑘 connected components in 𝐺, then the Laplacian matrix

𝐿 is:

17

𝐿 =

.

(3.4)

𝐿1

𝐿2

. . .

𝐿 𝑘

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

Each of the blocks of 𝐿𝑖 is a graph Laplacian corresponding to the subgraph of the 𝑖-th

connected component. The Laplacian 𝐿 as a whole provides the indicator functions for

connected components in a graph, using the first eigenvector in each 𝐿𝑖 ∈ 𝐿.

Beyond the Laplacian matrix 𝐿 = 𝐷 − 𝑊, there are other variants of this matrix

called normalized graph Laplacians with the former referred to as the unnormalized graph

Laplacian. For our purposes, the random walk Laplacian is

𝐿𝑟𝑤 = 𝐷−1𝐿.

(3.5)

For completeness, there is another normalized Laplacian, called 𝐿𝑠𝑦𝑚 with details available

in [19]. A property of 𝐿𝑟𝑤 I use is 𝜆 is an eigenvalue of 𝐿𝑟𝑤 with eigenvector 𝑢 if and

only if 𝜆 and 𝑢 solve the generalized eigenvalue problem 𝐿𝑢 = 𝜆𝐷𝑢.

It is notable the

authors of [19] provide a few arguments for using the random walk Laplacian, and it is the

Laplacian used throughout this paper. From the graph partitioning point of view (mincut

problem) using 𝐿𝑟𝑤 maximizes within cluster similarity in contrast to the other Laplacians.

Additionally, the normalized Laplacians do not have issues with convergence that are seen

in the unnormalized Laplacians. There is also a proposition analogous to Proposition 4.

Proposition 5 (Number of connected components and the spectrum of 𝐿𝑟𝑤) Let 𝐺 be

an undirected graph with non-negative weights. Then the multiplicity 𝑘 of the eigenvalue 0

of 𝐿𝑟𝑤 equals the number of connected components 𝐴1, ..., 𝐴𝑘 in the graph. The eigenspace
of eigenvalue 0 is spanned by the indicator vectors 1𝐴1

, ..., 1𝐴𝑘 of those components.

18

To demonstrate this, in Figure 3.1 note a graph with 2 connected components, and the

corresponding eigenvectors of the graph Laplacian. This shows the correspondence to the

block structure of the eigenvectors of the Graph Laplacian and how these vectors can be

used an indicator functions on the graph.

(a) Graph with 2 components.

(b) Eigenvectors of 𝐿 corresponding to
Figure 3.1a.

Figure 3.1 The eigenvalues of the graph Laplacian provide indicator functions for the 3
different components of the graph.

3.1.2 Laplacian eigenmaps

We see related work in [20]. The authors use a Laplacian eigenmap embedding as a

lower dimensional representation of the original dataset. A graph 𝐺 is constructed from

the original dataset using either 𝜖-neighborhoods or 𝑛 nearest neighbors. Given points
𝑥1, ..., 𝑥𝑘 ∈ R𝑙, an 𝜖-neighborhood connects nodes 𝑖, 𝑗 when ||𝑥𝑖 − 𝑥 𝑗 ||2 < 𝜖, 𝜖 ∈ R.
Alternatively, using 𝑛 nearest neighbors, nodes 𝑖, 𝑗 are connected by an edge if 𝑖 is among

the 𝑛 ∈ N nearest neighbors of 𝑗 (or 𝑗 is among n nearest neighbors of 𝑖).

An edge weight is selected, for example from either the heat kernel, 𝑊𝑖 𝑗 = 𝑒

| | 𝑥𝑖 −𝑥 𝑗 | |2
𝑡

or

setting 𝑊𝑖 𝑗 = 1 if vertices 𝑖 and 𝑗 are connected. Then for each connected component of

𝐺 the eigenvalues and eigenvectors for 𝐿y = 𝜆𝐷y are computed where 𝐷 is the diagonal

weight matrix, 𝐷𝑖𝑖 = (cid:205) 𝑗 𝑊 𝑗𝑖 and 𝐿 = 𝐷 − 𝑊 is the Laplacian matrix. The data is then

embedded into a lower dimensional space given by (y1(𝑖), ....., y𝑚 (𝑖)) where each y 𝑗 is an

19

eigenvector, ordered by eigenvalue, with y1 having the smallest eigenvalue.

3.1.3 Mincut Problem

Additionally, the eigenvectors of each 𝐿𝑖 provide a solution to the mincut problem. The

mincut problem is a way to partition a graph into separate components by removing edge(s)
with the minimum weight. Given a graph with adjacency matrix 𝑊 ( 𝐴, 𝐵) ≔ (cid:205)𝑖∈𝐴, 𝑗 ∈𝐵 𝑤𝑖 𝑗
and ¯𝐴 the complement of 𝐴. For subsets 𝑘, the mincut problem chooses a partition 𝐴1, .., 𝐴𝑘

to minimize

cut( 𝐴1, ..𝐴𝑘 ) ≔

1
2

𝑘
∑︁

𝑖=1

𝑊 ( 𝐴𝑖, ¯𝐴𝑖).

(3.6)

This can also be modified for other problems, with the Ncut problem having a direct

relationship with the random walk Laplacian. The Ncut problem minimizes

Ncut( 𝐴1, ..𝐴𝑘 ) ≔

1
2

𝑘
∑︁

𝑖=1

𝑊 ( 𝐴𝑖, ¯𝐴𝑖)
vol( 𝐴𝑖)

=

𝑘
∑︁

𝑖=1

cut( 𝐴𝑖, ¯𝐴𝑖)
vol( 𝐴𝑖)

.

(3.7)

When 𝑘 = 2, the second eigenvector of 𝐿𝑟𝑤 provides a solution to a relaxed version of the

Ncut problem. When 𝑘 > 2, the first 𝑘 eigenvectors provide a solution. In Figure 3.2 note

the second eigenvector separates the graph into two distinct regions, where each region

is strongly connected to other notes in the region and only weakly connected to the other

region. This also translates to Figure 3.3, where the second eigenvector for the block matrix

of each connected component differentiates areas of that component of the graph with the

minimal edge weights.

3.2 Optimized mapper graph

The overall goal of optimizing the mapper object as defined earlier is to have weak con-

nections between areas of the graph that represent different classes, and strong connections

between areas of heavy overlap. Very recent related work to optimize a mapper graph is

20

(a) Connected graph colored by the second
eigenvector of the graph Laplacian.

(b) Eigenvalues of 𝐿 corresponding to
Figure 3.2a.

Figure 3.2 The eigenvalues of the graph Laplacian demonstrate the solution to the mincut
problem as the second eigenvector separates the graph by minimal edge weight.

(a) Eigenvalues of 𝐿 for
Figure 3.3b.

(b) Graph colored by
eigenvector 1 of 𝐿 in Figure
3.3a.

(c) Graph colored by
eigenvector 5 of 𝐿 in Figure
3.3a.

Figure 3.3 The eigenvalues of the graph Laplacian demonstrate the solution to the mincut
problem as the eigenvectors separate based on minimum edge weights.

available in [21], where the authors use an algorithm named 𝐺-mapper to determine an

optimal cover of a mapper graph. This variant is a bit different and is as follows:

Definition 3.2.1 Let 𝑋 (𝑣𝑖) be the points associated with node 𝑣𝑖 ∈ 𝑉 (𝑀), with 𝑉 (𝑀) the

set of nodes in mapper graph 𝑀. Let |𝑋 (𝑣𝑖)𝑦 | be the number of data points in 𝑣𝑖 with label

𝑦. Then 𝑓 (𝑣𝑖, 𝑦) = |𝑋 (𝑣𝑖)𝑦 |/|𝑋 (𝑣𝑖)|. Let 𝑁𝑦 be the number of nodes in 𝑀 which contain

points with label 𝑦. Then

𝐻 (𝑀, 𝑌 ) =

∑︁

∑︁

𝑣∈𝑉 (𝑀)

𝑦∈𝑌

𝑓 (𝑣, 𝑦)
𝑁𝑦

(3.8)

21

provides a measure on a mapper graph and the mapper graph is considered optimal when

𝐻 (𝑀, 𝑌 ) = |𝑌 |.

In the prior definition, if each node 𝑣𝑖 ∈ 𝑉 (𝑀) only contains one label, then 𝐻 (𝑀, 𝑌 ) is

optimal. 𝐻 (𝑀, 𝑌 ) can be thought of as an overall quality measure on 𝑀, representing the

percentage of outcomes in each node.

Consider Figure 3.4 as an example. Each color and shape corresponds to a different

outcome variable, 𝑦 ∈ 𝑌 . In both graphs, |𝑌 | = 3, so to check if each graph is optimal,

For Figure 3.4b compute 1

compute 𝐻 (𝑀, 𝑌 ). For Figure 3.4a compute 1

1 = 3, so this is an optimal graph.
1 = 2.5, so this graph is not optimal. Lastly from
2 = 1.93. Visually it is possible to distinguish
2 + 1
that Figure 3.4c is less optimal than Figure 3.4b and that information is reflected in the

Figure 3.4c 𝐻 (𝑀, 𝑌 ) = 1

3 + .75
3 + .6

3 + .25
3 + .2

1 + 1
1 + .2

3 + 1
3 + 1

2 + 1

1 + 1

2 + 1

metric 𝐻 (𝑀, 𝑌 ).

(a) Optimal mapper graph.

(b) Suboptimal mapper
graph.
Figure 3.4 Example of (a) an optimal mapper graph where 𝐻 (𝑀, 𝑌 ) = |𝑌 | and (b-c) two

(c) Suboptimal mapper
graph.

suboptimal mapper graphs. For (a): 𝐻 (𝑀, 𝑌 ) = 1

2 + 1
1 = 2.5. For (c):𝐻 (𝑀, 𝑌 ) = 1

2 + 1
3 + 1

1 + 1
3 + .2

1 = 3. For (b):
1 + .2
3 + .6
2 + 1

2 = 1.93.

𝐻 (𝑀, 𝑌 ) = 1

3 + 1

3 + .25

3 + .75

1 + 1

To create an optimal mapper graph in practice, construction relies on various parameter

choices during the initial mapper construction, and 𝐻 (𝑀, 𝑌 ) was computed on the mapper

graph before optimization and after optimization as a comparison. This could be used

as a metric during initial mapper graph creation, however I did not use it at the part of

22

the pipeline. Figure 3.5 shows the mapper algorithm applied to the Iris dataset [22] and

highlighted by the proportion of each species of iris in the dataset. In Figure 3.5a there

is complete separation of the Setosa class from the others, but the separation between

Versicolor and Virginia is not as as obvious.

(a) Setosa percentage.

(b) Versicolor percentage.

(c) Virginia percentage.

Figure 3.5 Example mapper graph using the iris dataset with percentage of classifications
in each node highlighted for Setosa, Versicolor, and Virginica. 𝐻 (𝑀, 𝑌 ) = 2.53.

To create an optimal mapper graph, and for utility in a predictive pipeline, I first build

a mapper graph, 𝑀, using ¯𝑋 ⊂ 𝑋 ∈ R𝑝𝑥𝑚, with 𝑋 a dataset with 𝑝 observations and 𝑚

features. I also split 𝑋 into a training and test set, denoted ¯𝑋 and ˆ𝑋 respectively.

Then using mapper graph 𝑀, the set of nodes, 𝑉 (𝑀), and the set of datasets associated

with each node, ¯𝑋 (𝑣𝑖) for 𝑣𝑖 ∈ 𝑉 (𝑀), a clustering algorithm is applied to the subset of the

data associated with a node. A new mapper object,

¯𝑀 is then constructed using the new

clusters to construct a larger node set, 𝑉 ( ¯𝑀). The adjacency matrix, 𝐴( ¯𝑀), is constructed

by forming an edge between two nodes when points are in the intersection, that is for nodes

¯𝑣𝑖, ¯𝑣 𝑗 ∈ 𝑉 ( ¯𝑀) with 𝑖 ≠ 𝑗, 𝐴𝑖, 𝑗 ( ¯𝑀) = | ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗 )|. This provides a weighted adjacency

matrix with weights equal to the size of the intersection of points.

What this means is after building a mapper graph through the normal pipeline, better

separation can be achieved via localized clustering to target larger nodes with more variation

in data distributions, and specifically variation in a desired outcome variable. Using the iris

dataset as an example, this is illustrated in Figure 3.6.

23

(a) Percent of versicolor nodes before
additional clustering, 𝐻 (𝑀, 𝑌 ) = 2.53.

(b) Percentage of versicolor nodes after
additional clustering, 𝐻 (𝑀, 𝑌 ) = 3.

Figure 3.6 Before and after of applying localized clustering to a mapper graph to achieve
better separation between classes.

While this type of result may be achievable through the use of various mapper param-

eters, this additional way to cluster allows for more targeted post-processing of a mapper

graph when the usual pipeline is not yielding a desireable result. An example of how the

how the usual mapper process may fail is in Figure 3.7.

(a)

(b)

(c)

(d)

Figure 3.7 This example shows (a) Point cloud sampled from 2 annuli (b) mapper graph
using original pipeline (c) Mapper graph after optimizing (d) mapper graph with
additional clusters in the original pipeline to create more separation. Note the two annuli
were not recovered in the original mapper pipeline.

It should be noted that while using different filter functions may recover a desirable

mapper graph, many trials of mapper graph creation with the Iris did not result in separation

between the Versicolor and Virginica classes. The outcome variable was also used in the

original mapper graph creation in an attempt to create additional separation without much

24

success.

3.3 Mapper embedding

Using both spectral theory and Laplacian eigenmaps as motivation for predictive mapper,

I construct the mapper embedding portion of the algorithm. Through optimization of the

mapper graph, the goal is to have nodes in the mapper graph all have the same classes, and

be disconnected or weakly connected to nodes of different classes. Then the eigenvectors

of the graph Laplacian of the optimized mapper graph represent the different components,

and hence different classes.

Using the optimized mapper graph as ¯𝑀, the general eigenvalue problem 𝐿y = 𝜆𝐷y is

solved for y, resulting in an embedding of

¯𝑀 = [y𝑜, ..., y𝑛] This embedding represents the

classes through connected components and the separation of weakly connected components.

3.4 Mapping from 𝑋 to | ¯𝑀 |

The last item needed is a mapping from the dataset to the mapper graph so that these

eigenvectors can be used as an embedding for both ¯𝑋 and ˆ𝑋. The goal of this construction

is shown in Figure 3.8.

To understand the following construction, I quickly review [1] for the topological mo-

tivation, and then construct the statistical version. The nerve of an open covering U of

a space 𝑋, denoted 𝑁 (U), as an abstract simplicial complex with vertices ∈ U and its

simplices are the finite subcollections {𝑈1, ..., 𝑈𝑛} of U such that 𝑈1 ∩ 𝑈2 ∩ ... ∩ 𝑈𝑛 ≠ ∅.

Topological Motivation Given a finite covering U = {𝑈𝛼}𝛼∈𝐴 of a space 𝑋, a contin-

uous map 𝑓

: 𝑋 → R, and the refinement of the pullback cover 𝑓 −1(𝑈𝛼), which is path

connected components as a covering of 𝑋, referred to as ¯U. Mapper is then defined as the

nerve of this cover, N ( ¯U). Also recall a simplicial complex from Section 2.1. The nerve,

𝑁 ( ¯U) can be embedded in Euclidean space and this is the geometric realization of this

nerve, denoted |𝑁 ( ¯U)|. I will construct a map 𝜌 : 𝑋 → |𝑁 ( ¯U)|.

25

(a) The topological version of mapper with
an additional mapping 𝜌(𝑥).

(b) The statistical version of mapper with an
additional mapping 𝜌(𝑥).

Figure 3.8 Topological vs. Statistical Version of Mapper with an additional mapping
required for predictive mapper.

Definition 3.4.1 (Partition of unity) A partition of unity subordinate to the finite open

covering U is a family of real valued functions {𝜑𝛼 : 𝑋 → R}𝛼∈𝐴 with the following

properties:

• 0 ≤ 𝜑𝛼 (𝑥) ≤ 1 for all 𝛼 ∈ 𝐴 and 𝑥 ∈ 𝑋.

• (cid:205)𝛼∈𝐴 𝜑𝛼 (𝑥) = 1 for all 𝑥 ∈ 𝑋.

• The closure of the set {𝑥 ∈ 𝑋 | 𝜑𝛼 (𝑥) > 0} is contained in the open set 𝑈𝛼.

An example partition of unity is shown in Figure 3.9b. A family of probability den-

sity functions could be used to construct to be a partition of unity. Consider the Gaus-

sian probability distribution function 𝜙𝜇,𝜎 (𝑥) = 1

𝑥−𝜇
𝜎 )2), with mean 𝜇, stan-
dard deviation 𝜎, and 𝛽 a normalization constant. Next define a family of functions

𝛽 exp(− 1
2 (

Φ = {𝜑0(𝑥), 𝜑1(𝑥), 𝜑2(𝑥), 𝜑3(𝑥)} with

26

𝜑0(𝑥) = exp

(cid:32)

1
2
(cid:32)

𝜑1(𝑥) = exp

𝜑2(𝑥) = exp

𝜑3(𝑥) = exp

(cid:32)

1
2

1
2

(cid:18) −4 − 𝑥
1
(cid:18) 2 − 𝑥
2
(cid:18) 8 − 𝑥
2

1
2
(cid:18) −10 − 𝑥
2

(cid:32)

(cid:19) 2(cid:33)

(cid:19) 2(cid:33)

(cid:19) 2(cid:33)

(cid:19) 2(cid:33)

(3.9)

(3.10)

(3.11)

(3.12)

(3.13)

Let 𝛽 = 𝜑0(𝑥) + 𝜑1(𝑥) + 𝜑2(𝑥) + 𝜑3(𝑥) and set U = (−18, 16) with 𝑈0 = (−8, 0), 𝑈1 =

(−6, 10), 𝑈2 = (0, 16) and 𝑈3 = (−18, 2). Then the family of functions ¯Φ =

¯𝜑0 =

¯𝜑1 =

¯𝜑2 =

¯𝜑3 =

𝜑0(𝑥)
𝛽
𝜑1(𝑥)
𝛽
𝜑2(𝑥)
𝛽
𝜑3(𝑥)
𝛽

(3.14)

(3.15)

(3.16)

(3.17)

is a partition of unity on U = (−18, 16). It can be checked that this construction satisfies

the definition of a partition of unity.

Using {𝑣0, 𝑣1, ..., 𝑣 𝑘 } = 𝑉 to denote the vertices of 𝑁 ( ¯U), there exists the barycentric

coordinization, which is a bĳection of the points 𝑣 in the geometric simplex to a set of
ordered 𝑘-tuples (𝑟0, 𝑟1, ..., 𝑟𝑘 ) ∈ R𝑘+1 such that 0 ≤ 𝑟𝑖 ≤ 1 and (cid:205)𝑘
𝑖=0
is a barycentric coordinate. For a given partition of unity ¯Φ, define 𝜌(𝑥) ∈ |N (U)|

𝑟𝑖 = 1. Each 𝑟𝑖

as the point in the simplex spanned by the vertices 𝑉, with barycentric coordinates

( ¯𝜑0(𝑥), ¯𝜑1(𝑥), ..., ¯𝜑𝑘 (𝑥)). This 𝜌(𝑥) is the mapping required from 𝑋 → |N ( ¯U)|.

27

(a) A family of functions Φ.

(b) A family of functions ¯Φ.

Figure 3.9 A family of functions Φ before and after normalization to create a partition of
unity on the range [-17.99,15.9].

Statistical Construction Using this, I construct a partition of unity to build the map

𝑃 : 𝑋 → | ¯𝑀 |, where | ¯𝑀 | is the geometric realization of ¯𝑀. Given ¯𝑋 (𝑣𝑖) the subset of data

¯𝑋 in each node 𝑣𝑖 ∈ 𝑉 ( ¯𝑀) with 0 ≤ 𝑖 ≤ 𝑘, summary statistics are then computed on each

set ¯𝑋 ( ¯𝑣𝑖). Define Let Φ = {𝜑0, 𝜑1, ...., 𝜑𝑘 } be a family of probability distributions with

𝜑𝑖 (𝑥) = exp(−

1
2

(𝑥 − 𝜇 ¯𝑣𝑖 )𝑇 Σ−1

¯𝑣𝑖 (𝑥 − 𝜇 ¯𝑣𝑖 ))
∑︁

𝜑𝑖 (𝑥)

𝛽 =

𝑖

¯𝜑𝑖 (𝑥) =

𝜑𝑖 (𝑥)
𝛽

(3.18)

(3.19)

(3.20)

with 𝜇 ¯𝑣𝑖 as the mean of ¯𝑋 ( ¯𝑣𝑖) and Σ ¯𝑣𝑖 the covariance of ¯𝑋 ( ¯𝑣𝑖). Then ¯Φ = { ¯𝜑0, ¯𝜑1, ...., ¯𝜙𝑘 } is a

partition of unity that can provide a barycentric coordinization of points to the mapper graph,

i.e. a point ˆ𝑥 ∈ ˆ𝑋 has a barycentric coordinatization in | ¯𝑀 | = ( ¯𝜑0( ˆ𝑥), ¯𝜑1( ˆ𝑥), ...., ¯𝜑𝑘 ( ˆ𝑥)).

Finally, using this barycentric coordinatization together with the eigenvectors y, features

can be created that are a linear combination of the values for each eigenvector. Taking a

single eigenvector, y 𝑗 ∈ R𝑘 , the eigenvector feature for ˆ𝑥 is

𝑘
∑︁

𝑖=0

y 𝑗𝑖 ¯𝜑𝑖 ( ˆ𝑥).

28

(3.21)

This is the eigenvector value associated with each node, multiplied by the probability of

inclusion for ˆ𝑥 in each node. This has the effect that for nodes with small or 0 probability

of inclusion for ˆ𝑥, the value of the eigenvector at that node provides minimal contribution

to the feature, whereas higher probability nodes provide higher contributions to the feature.

Note that

¯𝑋 is a random sample of 𝑋, and X is a sample from some underlying

distribution. One can approximate the probability distribution function with any of our

random samples: 𝑋, ¯𝑋, ˆ𝑋. Error bounds will differ, but this construction then holds for any

random samples of 𝑋.

3.5 Predictive Mapper Pipeline

The optimized mapper graph, mapper embedding, and mapping from 𝑋 to ¯𝑀 provide

the predictive mapper pipeline which I summarize here. The following inputs are required

to create a mapper graph and consequently the predictive mapper pipeline: training data:

¯𝑋, testing data:

ˆ𝑋, filter function:

𝑓 , and distance matrix or metric: 𝑑. Additionally

the following are parameters required: clustering algorithm, partition length: ℓ, percent

overlap in partition: 𝑝, and a clustering algorithm for mapper optimization. The

predictive mapper construction is then as follows:

• Construct 𝑀, a mapper graph using the ˆ𝑋, 𝑓 , 𝑑, clustering algorithm, 𝑝, 𝑙

• Optimize 𝑀 with localized clustering using the clustering algorithm for mapper

optimization on nodes to so that if ¯𝑀 is an optimized version of 𝑀, then 𝐻 (𝑀, 𝑌 ) <

𝐻 ( ¯𝑀, 𝑌 )

• Construct the weighted adjacency matrix of ¯𝑀 such that ¯𝑣𝑖, ¯𝑣 𝑗 with 𝑖 ≠ 𝑗, 𝐴𝑖, 𝑗 ( ¯𝑀) =

| ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗 )|

• Compute the eigenvectors y of the generalized eigenvalue problem, 𝐿y = 𝜆𝐷y

29

• Taking ¯𝑋 (𝑣𝑖) the subset of data ¯𝑋 in each node 𝑣𝑖 ∈ 𝑉, compute summary statistics

and the partition of unity ¯Φ functions

• For each 𝑥 ∈ ¯𝑋 and 𝑥 ∈ ˆ𝑋, compute the probability of inclusion in each node using ¯Φ

• Create features as a linear combination of eigenvectors and the partition of unity, i.e.

for each 𝑗, (cid:205)𝑘

𝑖=0 y 𝑗𝑖 ¯𝜑𝑖 ( ˆ𝑥)

This pipeline provides an approach to creating features from a mapper graph that can be

used in machine learning models. An example of the full pipeline is shown in Figure 3.10.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 3.10 A simulated example of the predictive mapper pipeline. (a) Point cloud
sampled from two annuli. (b) Construction of pullback cover. (c) Refined pullback cover.
(d) mapper graph. (e) Cover from mapper optimization step. (f) Optimized mapper graph.
(g) Two test points in 𝑋. (h) Probability of inclusion in each node for point on the right. (i)
Probability of inclusion in each node for the point on the left. (j) Eigenvectors of ¯𝑀. (k)
Eigenvector features of the point on the right represented on the mapper graph. (l)
Eigenvector features of the point on the left represented on the mapper graph.

30

Figure 3.11 Mapper Experiment Design.

3.6 Methods and Results

Using two datasets, the famous iris [23] dataset and a student performance [24] dataset,

features created from the predictive mapper algorithm were assessed for accuracy in different

machine learning models. The iris dataset was used as a familiar dataset during algorithm

development and to identify potential areas to test, while the student performance dataset

was used to explore potential use cases for predictive mapper features.

To create a comparison for performance of the predictive mapper features, the same

machine learning models were evaluated a feature set created from predictive mapper and

the feature set from the original data. To create a true comparison for performance, 50% of

the data was used to build a mapper model, and 50% was used as a testing set to validate

model performance. The dataset structure for building the mapper model and building and

testing the model is shown in Figure 3.11.

For both datasets, the clustering algorithm used during the initial mapper fit was HDB-

Scan [14] and the chosen clustering algorithm to optimize mapper was Bayesian gaussian

mixture modeling [25]. HDBScan was chosen as it meets the requirements of [1] and

Bayesian Gaussian mixture modeling was chosen as computation of means and standard

deviations is part of the algorithm, the distributions are assumed normal, and the number

31

of clusters can be inferred from the data.

Both datasets also report performance in terms of original data, eigenvector features,

and augmented features. The augmented data is using both the original data and eigenvector

features to assess model performance.

3.6.1

Iris

The iris [23] dataset is one of the earliest known datasets using in machine learning

models, and can be traced back to scientist R.A. Fisher in 1936. The dataset has 4 features

that measure properties of the iris plant, with a classification of plant type. There are three

classes with 50 instances each, making the dataset perfectly balanced for testing machine

learning tasks. One of the classes is also easily separable from the other two, making iris

an excellent candidate for evaluation and constructing the predictive mapper algorithm.

The predictive mapper algorithm was run on the iris dataset using three different mapper

graphs, all with the same parameters in mapper graph creation using a filter of principle

components analysis. The original mapper graph is in Figure 3.12 and colored by percentage

of outcome variable, and complete separation between Setosa and the other two classes is

observed in the original graph.

(a) Iris mapper graph color
by percentage of node in
Setosa class.

(b) Iris mapper graph color
by percentage of node in
Versicolor class.

(c) Iris mapper graph color
by percentage of node in
Virginia class.

Figure 3.12 Original iris mapper graphs for mapper graph 0.

To optimize the mapper graph, Bayesian Gaussian mixture modeling was applied to the

original mapper graph, and the mapper graphs after optimization are shown in Figure 3.13,

32

and each class is associated with a different connected component.

(a) Iris mapper graph color
by percentage of node in
Setosa class.

(b) Iris mapper graph color
by percentage of node in
Versicolor class.

(c) Iris mapper graph color
by percentage of node in
Virginia class.

Figure 3.13 Optimized iris mapper graphs for mapper graph 0.

The probability of inclusion in each node for all data points was computed, and then the

features were created using the partition of unity and eigenvector basis. Example graphs

of the probability of inclusion and eigenvector features are shown for a selected point in

Figure 3.14. For both images, the graph on the left highlights three nodes with a non-zero

probability of inclusion for a sample data point. This is then reflected in the eigenvector

features, with variation in the same nodes and the nodes in the same connected component.

Components with a zero probability of includes for the data point do not show variation in

the eigenvector features.

Three different optimized mapper graphs were assessed with the results are included in

Table 3.2. 𝐻 (𝑀, 𝑌 ) was measure before and after for each and reported in Table 3.1. In

each of the three trials a more optimal mapper graph was created through the predictive

mapper pipeline.

Trial
Mapper Graph 0
Mapper Graph 1
Mapper Graph 2

|𝑉 (𝑀)|
14
16
15

𝐻 (𝑀, 𝑌 )
2.53
2.55
2.43

|𝑉 ( ¯𝑀)|
30
33
33

𝐻 ( ¯𝑀, 𝑌 )
3.00
3.00
3.00

Table 3.1 Iris mapper graph comparison before and after optimization.

33

(a) Example point in iris dataset colored by
probability of inclusion in each node.

(b) Example point in iris dataset colored by
eigenvector features.

Figure 3.14 An example point from the iris dataset shown with the probability of node
inclusion on the left and the associated eigenvector feature on the right.

With all three graphs, ridge classification using the eigenvector features outperformed

classification using the original dataset. The random forest model outperformed the original

data in 2 of the 3 trials, and trial 1 had 4 of the 8 models where the eigenvector features

outperformed the original data.

While these results are not broad enough to be generally conclusive, there are a few

observations to note. The eigenvector features seem to perform well for ridge regression,

and may be useful in applications where that machine learning model is preferred. The

mapper graph and subsequent optimized mapper graph impact results, with variation among

the three different graphs. Finally, as the Iris dataset is a small dataset, sampling may have

a larger impact on the outcome.

3.6.2 Student Performance

The dataset used to evaluate predictive mapper features is called Student Performance

[24] and hosted on the UC Irvine Machine Learning Repository. The dataset has 30

features and 649 observations comprised of grades, demographic, social, and school factors

collected by a combination of school information and questionnaires. The dataset was

originally evaluated in [26] where the original authors predicted three different outcomes

34

Model
Random Forest (n=5)
SVM (linear, c=1)
SVM (linear, c=10)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

Random Forest (n=5)
SVM (linear, c=1)
SVM (linear, c=10)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

Model
Random Forest (n=5)
SVM (linear, c=1)
SVM (linear, c=10)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

Iris Mapper Object 0
Original Data
92.5%
95.6
95.8
95.4
94.7
82.9%
80.8%
95.9%
Iris Mapper Object 1
94.3%
96.6
95.1
95.7
95.2
82.4%
83.6%
93.5%
Iris Mapper Object 2
Original Data
96.2%
97.3
97.3
97.4%
97.2%
79.7%
74.9%
97.1%

Eigenvector
93.5%
93.5
92.9
93.3
93.5
93.1%
93.2%
93.1%

94.1%
93.6
93.1
92.4
84.0%
94.0%
93.9%
94.1%

Eigenvector
94.8%
94.7
94.5%
96.5%
89.7%
94.6%
94.8%
94.4%

Augmented
93.6%
93.8
93.5
93.3
93.5
93.3%
93.4%
93.2%

94.7%
93.6
92.2
92.8
92.2%
94.1%
93.9%
93.6%

Augmented
95.1%
94.6
94.7%
96.5%
95.5%
94.6%
94.8%
94.5%

Table 3.2 Accuracy from Iris mapper models.

using a variation of machine learning methods. The variables are in Table A.1 taken from

[26] and important to understanding both the original classification methods used and those

explored in this paper.

The original analysis predicted G3 using all variables, using all variables except G2,

and using all variables except G1 and G2. The third problem, predicting the final grade

(G3) without first (G1) and second (G2) period grades was much harder, and the focus

of the predictive mapper analysis. Data was prepared for analysis by creating indicator

35

variables for each of the binary outcomes, creating a binary indicator for pass or fail for G3,

and oversampling the failed class. In the original dataset of 649 observations, 100 of the

observations were classified as fail for G3, or 15.4 % of the sample having failed. Data was

split into a training and testing of 50% each, and then the failed class was oversampled in

the training set to create a balanced dependent variable. The testing set was not modified.

The 50% of the data used as the training set was used to build a mapper object and was

used to train various models. To create some variability, data from the held out testing set

was included to train the model (50%) and not used to report on model accuracy, and the

average of 100 runs is reported. To also create variation in the mapper graphs, there were

12 different mapper graphs generated and evaluated, all with different datasets. For direct

comparisons, mapper graphs were created with the same parameters from different data.

The parameters of each experiment is outlined in Table 3.3. Full results are included here

for trials 0-5.

Trial Filter
PCA
0
PCA
1
PCA
2
Eccentricity (5)
3
Eccentricity (5)
4
Eccentricity (5)
5
tsneX (5)
6
tsneX (5)
7
tsneX (5)
8
tsneX (5)
9
tsneX (5)
10
tsneX (5)
11

Partitions Overlap
8
8
8
8
8
8
8
8
8
12
12
12

50%
50%
50%
25%
25%
25%
50%
50%
50%
30%
30%
30%

Original Nodes
20 (H = 1.29)
16 (H = 1.14)
20 (H = 1.28)
20 (H = 1.27)
21 (H = 1.32)
22 (H = 1.18)
24 (H = 1.34)
22(H = 1.24)
18 (H = 1.26)
20 (H = 1.07)
14 (H = 1.03)
16 (H = 1.02)

Optimized Nodes
28 (H = 1.25)
26 (H = 1.09)
26 (H = 1.22)
28 (H = 1.26)
29 (H = 1.26)
28 (H = 1.14)
38 (H = 1.30)
33 (H = 1.21)
25 (H = 1.21)
49 (H = 1.20)
37 (H = 1.10)
44 (H = 1.08)

Table 3.3 Parameters for student performance data mapper graph comparison.

The standard predictive mapper pipeline was followed for the student performance

dataset, and how optimal 𝐻 (𝑀, 𝑌 ) the graphs were was measured. The same optimization

process followed for the Iris dataset was used, and did not result in a more optimal mapper

36

graph for many of the graphs. To create a direct comparison to the Iris dataset, one can

normalize by the number of classes. The Iris mapper graphs all had an optimal score of

1. The optimal score for the student performance dataset in any mapper graph, optimized

or not, was not close to that range. The minimum normalized score was .5 and the

maximum normalized score was .67. It is also doubtful there is a relevant difference in

features created from the original or optimized mapper graphs for this dataset based on the

normalized optimal scores.

Using the optimized mapper graphs, a total of 8 machine learning models were run,

and the partition of unity matrix before the eigenvector change of basis was also used as

a feature set. "Eigenvector +" denotes the original dataset augmented with eigenvector

features, and "Partition +" denotes the original dataset augmented with the features from the

partition of unity. While evaluating this dataset initially, the eigenvector features performed

intermittently in terms of accuracy, but very consistently predicted all students passed. When

this was observed, the partition of unity features were also evaluated to better understand if

they could be useful in prediction.

Based on the lack of performance from the eigenvector features for this dataset, different

parameters in how the partition of unity was computed were modified to see if there were

ways to modify features for better performance. Given this, the student dataset results

should absolutely be treated as exploratory in nature.

Trials 0-1 For the first three trials (trials 0, 1, and 2) the first principle component was

used as a filter for creation of the mapper object. Results for these trials are shown in

Table 3.4. In two of the three mapper graphs, some combination involving the eigenvector

features or partition of unity features outperformed the original dataset features alone for

all models.

In the results table, ∗ denotes no recall for the failed class in the dataset (𝐺3 = 1). I

see this specifically happening with the eigenvector features. This suggests there may be

37

Student performance Mapper Graph 0

Eigenvector Partition Eigenvector+ Partition+

Model

Random Forest (n=5)
Random Forest (n=50)
Random Forest (n=100)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

Original
Data
82.7%
84.1%
84.3%
83.2%
82.7%
81.4%
81.5%
81.7%

78.3%
79.0%
80.0%
84.6%∗
78.3%
84.6%∗
84.6%∗
84.6%∗

83.1%
81.0%
84.4%
81.5%
81.4%
84.7%
85.6% 82.3%
81.0%
83.1%
85.6% 81.5%
85.5% 81.6%
85.3% 81.7%

Student performance Mapper Graph 1

Random Forest (n=5)
Random Forest (n=50)
Random Forest (n=100)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

83.5%
83.4%
85.7% 83.6%
85.8% 83.7%
88.0% 84.6%∗
83.6%
83.4%
85.0
81.2%
85.0%
81.4%
84.9%
81.4%

83.0%
83.1%
83.2%
84.6%∗
83.0%
84.9%
84.9%
84.9%

Student performance Mapper Graph 2

Random Forest (n=5)
Random Forest (n=50)
Random Forest (n=100)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

81.7%
84.1%
84.3%
83.5%
81.7%
77.4%
77.4%
77.9%

80.6%
80.9%
80.8%
82.8%
80.6%
84.3
84.3%
84.3%

80.9%
81.2%
81.2%
83.1%
80.9%
84.3%
84.3%
84.3%

84.5%
85.2%
85.0%
85.5%
84.5%
84.8%
85.2%
84.2%

83.3%
84.3%
84.3%
85.0%
83.3%
85.4%
85.4%
83.6%

83.5%
84.5%
84.5%
85.1%
83.5%
82.5%
82.6%
81.1%

84.3%
85.4%
85.3%
85.4%
84.3%
84.8%
85.2%
84.2%

83.3%
84.5%
84.5%
85.0%
83.3%
85.4%
85.4%
83.6%

Table 3.4 Results from trials 0, 1, and 2 for the student performance dataset models using
mapper.

information loss in the predictive mapper pipeline that is useful for this machine learning

task.

Each of the mapper graphs had slightly different features, and the original mapper graph

for a representative graph along with the optimized mapper graphs is shown in Figure 3.15,

with the other two for this trial in the Appendix as Figure A.1 and Figure A.2.

Trial 3 - 5 For the next trials, eccentricity was used as a filter for creation of the mapper

38

(a) Original mapper graph.

(b) Optimized mapper graph.

Figure 3.15 Mapper graphs for Experiment 0 before and after optimization. For this
example, the typical optimization work flow was not successful.

object. A representative optimized mapper graph is shown in Figure 3.16, with the other

two in the Appendix as Figure A.3 and Figure A.4.

(a) Original mapper graph.

(b) Optimized mapper graph.

Figure 3.16 Mapper graphs for Experiment 3 before and after optimization. For this
example, the typical optimization work flow was not successful.

Results are shown in Table 3.5. Note the ∗ designated denotes the feature set and

model were not able to identify any failed observations in the data. For the third mapper

graph random forest models, the partition+ features seem to perform better than the original

dataset. This may be due to the original dataset overfitting the models, and the partition

features underfitting the models.

I also observed the eigenvector features for the third

mapper graph is not able to recall any of the failed class, further supporting this graph may

39

Student performance Mapper Graph 3

Eigenvector Partition Eigenvector+ Partition+

be underfitting the data.

Model

Random Forest (n=5)
Random Forest (n=50)
Random Forest (n=100)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

Original
Data
82.7% 84.5%∗
84.5%∗
83.8%
84.5%∗
83.9%
84.6%∗
82.7%
84.6%∗
82.1%
84.6%∗
78.2%
84.6%∗
78.1%
84.6%∗
78.9%

71.6%
71.3%
71.3%
81.2%
71.6%
81.1%
81.1%
80.6%

82.7%
83.8%
83.8%
81.8%
82.7%
80.5%
80.5%
81.2%

Student performance Mapper Graph 4

Random Forest (n=5)
Random Forest (n=50)
Random Forest (n=100)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

82.7%
83.4%
85.9%
83.2%
86.0% 83.4%
82.7%
88.1%
83.6%
83.4%
84.6%∗
81.3%
84.6%∗
81.4%
84.6%∗
81.4%

83.4%
81.3%
85.2%
85.8%
86.0%
81.4%
88.6%
75.7%
81.3%
83.4%
83.4% 83.2%
83.4%
83.2%
83.5% 82.0%

Student performance Mapper Graph 5

Random Forest (n=5)
Random Forest (n=50)
Random Forest (n=100)
SVM (rbf)
KNN (n=5)
Ridge (alpha=1)
Ridge (alpha=10)
Logistic Regression

84.6%∗
82.7%
85.6% 84.6%∗
84.6%∗
85.9%
84.6%∗
85.4%
84.6%∗
82.7%
84.6%∗
80.0%
84.6%∗
80.0%
84.6%∗
78.6%

81.5%
81.5%
81.4%
84.6%∗
81.5%
70.2%
84.6%∗
84.4%

83.2%
85.6%
85.8%
85.5%
83.2%
81.6%
81.2%
81.4%

82.7%
84.2%
84.3%
84.1%
82.7%
84.1%
84.0%
82.2%

84.1%
86.0%
86.0%
88.5%
84.1%
83.4%
84.3%
83.4%

83.3%
85.2%
85.3%
85.4%
83.3%
84.5%
84.2%
84.2%

Table 3.5 Results from trials 3, 4, and 5 for the student performance dataset models using
mapper.

Trial 6 - 12 Additional models were run with this set of mapper graphs to allow for

more extensive evaluation of performance. The mapper graphs were created using tSne

as a filter, and the focus was on creating nodes in the mapper graph that are more heavily

weighted to either passing or failing. Before and after optimization of an example mapper

graph is shown in Figure 3.17.

40

(a) Original mapper graph.

(b) Optimized mapper graph.

Figure 3.17 Mapper graphs for Experiment 6 before and after optimization. For this
example, the typical optimization work flow was not successful.

Additional figures (Figure A.5 and Figure A.6 and results from these trials can be

found in the Appendix. Figures from mapper graphs 9, 10, 11 are available as Figure A.7,

Figure A.8 and Figure A.9 respectively. These results do not differ from the mapper graphs

created using other filters, and the eigenvector features are unable to identify the failing

class. Results for these are not formally reported as they do not change the interpretation or

outcome.

3.7 Computation

Work for predictive mapper was completed in python. This was done through a combi-

nation of custom functions for prediction and an extension of Kepler mapper [27]. Mapper

graphs and notebooks are available for models run for the Iris dataset. Future work could

include the incorporation of these functions into the Kepler mapper software.

3.8 Discussion and Limitations

Results of predictive mapper for the Iris dataset show promise, and further research is

needed into similar use cases where predictive mapper may have usefulness. Based on the

student performance dataset, predictive mapper may not be useful with unbalanced datasets

and machine learning models where recall is a difficult task. The student performance

dataset may also require specialized filters that extract the relevant information from the

41

dataset. Even with significant modification of features in an attempt to extract better

performance, predictive overall using eigenvector or partition of unity features was not

useful.

It is important to note that creating an apples to apples comparison for model perfor-

mance of predictive mapper features to the original dataset features has the potential for

bias based on the way the mapper models are created and the need to have a similar amount

of information available for each model. To minimize this issue multiple mapper models

were built using the same parameters for each dataset from different random samples of

data. Model parameters were not optimized during model fitting beyond running with a few

different parameters, and accuracy for each model is reported using the same parameters

for both the original dataset and the predictive mapper features. This could significantly

impact the performance of both the original and predictive mapper models.

A main area where there is significant impact to the performance of the predictive

mapper models is in the creation of the mapper graph used to create features and parameters

for feature creation. For the development of each mapper graph used to assess model

performance subjective decisions were made to optimize model performance, with the end

of goal of developing useful features for machine learning. To create predictive features

decisions on mapper graph parameters and optimization and a suitable way to map test

observations onto the mapper object must be made and are subjective and heavily depend

on the dataset. An average or stable mapper graph could be a potential avenue to explore to

mitigate current issues with individual mapper graphs in predictive pipelines.

Related work related to the Laplacian eigenvectors was done in [28]. This work crafts

an algorithm for eigenvector cascading, and shows that important dataset features persist

over different scales. The idea would need to be adapted across samples of data and not

necessarily scale, although that could be a relevant feature.

Outstanding work in the predictive mapper algorithm includes additional work on cre-

42

ating an optimal mapper graph through filters and additional localized clustering, and

exploration of performance with different probability distribution functions and additional

data sets. The ideal goal is to find specific areas where mapper features can provide use-

ful predictive information that available in the original feature set, and high dimensional

datasets may benefit from an additional feature reduction technique. Based on prior work

with Laplacian eigenmaps, this may be in machine learning problems requiring a reduced

feature set where predictive mapper would be a good solution. Further research into how

optimal a mapper graph can be before overfitting becomes an issue for modeling is also

needed. A formal assessment on how well a particular mapping from 𝑋 → | ¯𝑀 | performs

for datasets would be a final area of investigation to increase utility of predictive mapper.

Further notes for another potential application are below, and this is an addition that was

discovered near the completion of this dissertation.

3.8.1 Potential for Clustering Applications

An additional, barely explored avenue for predictive mapper could be as an algorithm for

clustering and assigning new data points to existing clusters. Although in a very preliminary

stage, predictive mapper seems to outperform spectral clustering in identifying classes for

noisy half-moon shapes. An example is shown in Figure 3.18 of both the original data and

the mapper graph with complete separation. Using spectral clustering (nearest neighbors

= 10) 81% of the points in Figure 3.18a were identified correctly, while using predictive

mapper with 50% data for mapper graph creation and an additional 25% data as training

data provided accuracy of 84% (depending on the method used). It should be noted this

work is very preliminary, and it is not assumed that different parameters for clustering would

not yield better results than predictive mapper. Additional work is needed here and could

be of interest.

43

(a) Point cloud of half moon.

(b) Mapper graph of half moon point
cloud.

Figure 3.18 Example of half moon.

3.9 Conclusion

Some areas and applications of predictive mapper may be useful for use in machine

learning models, and in the not-yet-explored avenue of an extension to clustering algorithms.

The development of the predictive mapper pipeline for use with the statistical construction

of mapper provides another avenue for researchers to explore in topological data analysis,

and may be improved by other work in mapper graph optimization.

44

CHAPTER 4

FEATURE GENERATION FOR MACHINE LEARNING

Many and varied methods currently exist for featurization, which is the process of mapping

persistence diagrams to Euclidean space, with the goal of maximally preserving structure.

However, and to our knowledge, there are presently no methodical comparisons of existing

approaches, nor a standardized collection of test data sets. This chapter provides a compara-

tive study of several such methods, and is taken largely from my paper, A Comparative Study

of Machine Learning Methods for Persistence Diagrams [3]. This paper was co-authored

by Dr. Luis Polanco and Dr. Jose Perea.

In this chapter we review, evaluate and compare the stable multi-scale kernel, persis-

tence landscapes, persistence images, Carlsson Coordinates - the ring of algebraic functions,

template functions, and adaptive template systems. Using these approaches for feature ex-

traction, we apply and compare popular machine learning methods on five data sets: MNIST,

Shape retrieval of non-rigid 3D Human Models (SHREC14), extracts from the Protein Clas-

sification Benchmark Collection (Protein), MPEG7 shape matching, and HAM10000 skin

lesion data set. These data sets are commonly used in the above methods for featurization,

and we use them to evaluate predictive utility in real-world applications.

4.1

Introduction

Persistence diagrams are a popular tool in Topological Data Analysis, with a recent

interest in feature creation for machine learning tasks, with success in several applications to

science and engineering. Many methods exist for the vectorization of persistence diagrams,

but a comprehensive evaluation of these methods is largely unexplored. Our goal here is

to contribute to the comparative analysis of machine learning methods with persistence

diagrams.

Starting with topological descriptors of datasets, in the form of persistence diagrams,

we provide examples and methodology to create features from these diagrams to be used

45

in machine learning algorithms. We provide the necessary background and mathematical

justification for six different methods (in chronological order):

the Multi-Scale Kernel,

Persistence Landscapes, Persistence Images, Adcock-Carlsson Coordinates, Template Sys-

tems, and Adaptive Template Systems. To thoroughly evaluate these methods, we have

reviewed five different data sets along with the relevant methods to compute persistence di-

agrams from them. The datasets, persistence diagrams and code to compute the persistence

diagrams is readily available for academic use.

As part of this review, we also provide a user guide for these methods, including

comparisons and evaluations across the different types of datasets. After computing the

six types of features, we compared the predictive accuracy of a ridge regression, random

forest, and support vector machine model to assess the type of featurization that is most

useful in predictive models. The code developed for this analysis is available, with some

functions developed specifically for use in machine learning applications, and easy-to-use

jupyter notebooks showing examples of each function with multiple dataset types.

Of these methods, Persistence Landscapes, Adcock-Carlsson Coordinates, and Template

Systems are quite accurate and create features for large datasets quickly. Adaptive Template

Systems and Persistence Images took somewhat longer to run, however, the Adaptive

Template Systems featurization method did improve accuracy over other methods. The

Multi-Scale Kernel was the most computationally intensive, and during our evaluation we

did not observe instances of it outperforming other methods.

4.2 Background

In order to develop the mathematical foundations needed for doing machine learning

with persistence diagrams, it has been informative to first study the structure of the space

they form. Additional background pertinent to this chapter and important for understanding

properties of persistence methods is included here. For this background we follow [29].

46

Definition 4.2.1 (Wasserstein distance) The 𝑝th Wasserstein distance between two per-

sistence diagrams, 𝑑1 and 𝑑2 is defined as

𝑊𝑝 (𝑑1, 𝑑2) = (inf
𝛾

||𝑥 − 𝛾(𝑥)||

1
𝑝

𝑝
∞)

∑︁

𝑥∈𝑑1

(4.1)

where 𝛾 ranges over all bĳections from 𝑑1 to 𝑑2. The set of bĳections is nonempty because

of the diagonal.

A related distance is the bottleneck distance, which is the Wasserstein distance with

𝑝 = ∞. Then the space of persistence diagrams also needs to be complete to be appropriate

for statistical inference, so the following definitions are key.

Definition 4.2.2 (Persistence Diagram) A generalized persistence diagram is a countable

multiset of points in R2 along with the diagonal △ = {(𝑥, 𝑦) ∈ R2|𝑥 = 𝑦}

Definition 4.2.3 (Total persistence) The degree-𝑝 total persistence diagram 𝑑 is defined

as

Pers𝑝 (𝑑) =

(pers(𝑥)) 𝑝

∑︁

𝑥∈𝑑

(4.2)

Definition 4.2.4 (Space of persistence diagrams) We define the space of persistence dia-

grams as

𝐷 𝑝 = {𝑑|𝑊𝑝 (𝑑, 𝑑∅) < ∞} = {𝑑|Pers𝑝 (𝑑) < ∞}

(4.3)

The space of persistence diagrams is complete and separable. Two key theorems also

provide stability for both sublevel set persistence and persistence computed using Ripser.

The stability theorem of [30] for sublevel set persistence contends that if X is a finitely

triangulated space and 𝑓 , 𝑔 : X −→ R are tame and continuous, then

𝑑𝐵 (dgm𝑛 ( 𝑓 ), dgm𝑛 (𝑔)) ≤ ∥ 𝑓 − 𝑔∥∞

47

for every integer 𝑛 ≥ 0. We note that the theorem is still true if continuous is replaced by

piecewise linear. Similarly, if (𝑋, d𝑋) and (𝑌 , d𝑌 ) are finite metric spaces, then the stability

of Rips persistent homology [31, Theorem 5.2] says that

𝑑𝐵 (dgmR

𝑛 (𝑋), dgmR

𝑛 (𝑌 )) ≤ 2𝑑𝐺𝐻 (𝑋, 𝑌 )

where 𝑑𝐺𝐻 (·, ·) denotes the Gromov-Hausdorff distance [32].

When we refer to featurization methods as stable, this is usually in regards to the Wasserstein

distance or bottleneck distance.

Doing statistics and machine learning directly on the space of persistence diagrams

turns out to be quite difficult. Indeed, (D, 𝑑𝐵) does not have unique geodesics, and thus

the Fréchet mean of general collections of persistence diagrams is not unique [33]. Since

computing averages, and in general, doing linear algebra on persistence diagrams is not

available, then several authors have proposed mapping (D, 𝑑𝐵) to topological vector spaces

where further analysis can be done. These methods are the main focus of this review. The

theory of vectorization of persistence diagrams is an active area of research, with recent

results showing the impossibility of full embeddability. Indeed, even though the space of

persistence diagrams with exactly 𝑛 points can be coarsely embedded in a Hilbert space

[34], this ceases to be true if the number of points is allowed to vary (see [35] and [36]).

That said, partial featurization is still useful as we will demonstrate here.

4.3 Featurization Methods

For each of the methods below, we start with a collection of persistence diagrams.

A persistence diagram can be represented in either the birth-death plane or birth-lifetime

plane—some methods will require birth-death coordinates and others will require birth-

lifetime coordinates. The birth-death plane is the representation pair (𝑥, 𝑦) where 𝑥 is

the time of birth, and 𝑦 is the time of death of the feature in the persistence diagram. The

birth-lifetime plane can be defined as the collection of points (𝑥, 𝑦 − 𝑥), where (𝑥, 𝑦) is

48

in birth-death coordinates. In this manner, we define lifetime as the persistence 𝑦 − 𝑥 of a

feature (𝑥, 𝑦). The persistence diagrams of a particular geometric object can be calculated

in a variety of ways, which will be made explicit for each dataset at time of evaluation.

4.3.1 Multi-Scale Kernel

The Multi-Scale Kernel of [37] defines a Kernel over the space of persistence diagrams,

which can then be used in various types of kernel learning methods. In general, a kernel 𝑘 is

by definition a symmetric and positive definite function of two variables. Mathematically,

from [37], given a set 𝑋, a function 𝑘 : 𝑋 × 𝑋 → R is a kernel if there exists a Hilbert

space 𝐻, called the feature space, and a map Φ : 𝑋 → 𝐻, called the feature map, such

that 𝑘 (𝑥, 𝑦) = ⟨Φ(𝑥), Φ(𝑦)⟩𝐻 for all 𝑥, 𝑦 ∈ 𝑋. The kernel induces a distance on 𝑋 defined

as

𝑑𝑘 (𝑥, 𝑦) = (𝑘 (𝑥, 𝑥) + 𝑘 (𝑦, 𝑦) − 2𝑘 (𝑥, 𝑦))

1

2 = ∥Φ(𝑥) − Φ(𝑦) ∥𝐻.

The authors of [37] propose a multi-scale kernel on D as follows. Given 𝐹, 𝐺 ∈ D, the

persistence scale space kernel 𝑘 𝜎 is

𝑘 𝜎 (𝐹, 𝐺) = ⟨Φ𝜎 (𝐹), Φ𝜎 (𝐺)⟩𝐿2 (Ω)

(4.4)

where Φ𝜎 : D → 𝐿2(Ω) is the associated feature map, and Ω ⊂ R2 is the closed half-

plane above the diagonal. Deriving the solution of a distribution-analogue of the Heat

equation with boundary conditions in Definition 1 of [37], the closed form expression of

the multi-scale kernel is:

𝑘 𝜎 (𝐹, 𝐺) =

1
8𝜋𝜎

∑︁

𝑝∈𝐹,𝑞∈𝐺

where if 𝑞 = (𝑎, 𝑏), then ¯𝑞 = (𝑏, 𝑎).

𝑒− ∥ 𝑝−𝑞 ∥2

8𝜎 − 𝑒− ∥ 𝑝− ¯𝑞 ∥2

8𝜎

The multi-scale kernel is shown to be stable w.r.t the 1-Wasserstein distance by Theorem

2 of [37], which is a desirable property for classification algorithms. However, by Theorem

3 of [37], the multi-scale kernel is not stable in the Wasserstein sense for 1 < 𝑝 ≤ ∞.

49

4.3.2 Persistence Landscapes

Persistence landscapes are a mapping of persistence diagrams into a function space that

is either a Banach space or Hilbert space ([38]). Advantages of persistence landscapes are

that they are invertible, stable, parameter-free, and nonlinear. Persistence landscapes can

be computed from a persistence diagram as follows.

From [38], for a persistence diagram 𝐷 = (𝑎𝑖, 𝑏𝑖)𝑖∈𝐼, and for 𝑎 < 𝑏, let

and

𝑓(𝑎,𝑏) (𝑡) = max(0, min(𝑎 + 𝑡, 𝑏 − 𝑡))

𝜆𝑘 (𝑡) = kmax (cid:8) 𝑓(𝑎𝑖,𝑏𝑖) (𝑡)(cid:9)

𝑖∈𝐼

(4.5)

(4.6)

with kmax as the kth largest element.

The persistence landscape is the sequence of piecewise linear functions, 𝜆1, 𝜆2, ... :

R → R. Bubenik shows desirable properties for working with persistence landscapes

in statistical modeling, in particular that even if unique means do not exist in the set of

persistence diagrams, persistence landscapes do have unique means and the mean landscape

converges to the expected persistence landscape. Figure 4.1 shows an example of persistence

landscapes from the MPEG7 dataset, described in the data section.

Figure 4.1 Persistence Landscapes from the MPEG7 dataset to show differences in
features. Each color corresponds to a different landscape, i.e. 𝜆𝑘 for 𝑘 = 1, 2, 3.

50

4.3.3 Persistence Images

From [39], persistence images are a mapping sending a persistence diagram to an

integrable function, called a persistence surface. Fixing a grid on R2, the integral over this

grid yields pixel values forming the persistence image. Advantages of persistence images

include a representation in R𝑛, stability, and ease of computation. When calculating the

persistence image, a resolution, a distribution, and a weighting function are required as

parameters. It is worth noting that the resolution (i.e., number of pixels) determines the

number of features computed by the persistence image.

More explicitly, let 𝐷 be a persistence diagram in birth-lifetime coordinates. We take

𝜙𝑢 : R2 → R to be a differentiable probability distribution. Using, for instance, the Gaussian

Distribution with mean 𝑢 and variance 𝜎2 we have

𝜙𝑢 (𝑥, 𝑦) =

1
2𝜋𝜎2

𝑒−[(𝑥−𝑢 𝑥)2+(𝑦−𝑢 𝑦)2]/2𝜎2

The persistence surface 𝜌𝐷 : R2 → R is the function

𝜌𝐷 (𝑧) =

∑︁

𝑓 (𝑢)𝜙𝑢 (𝑧)

𝑢=(𝑥,𝑦−𝑥)∈𝐷

with 𝑓

: R2 → R, a nonnegative weighting function that is zero along the horizontal

axis, continuous, and piecewise differentiable. For example,

𝑓 (𝑥) = 𝑦 would satisfy

this requirement and is used frequently as 𝑓 (𝑥). The persistence image is then 𝐼 (𝜌𝐷) 𝑝 =
∫ ∫

𝜌𝐷 𝑑𝑦𝑑𝑥, where integration is over the fixed grid on R2. This creates an image depicting

𝑝

high and low density areas in the defined grid, that are represented as a high-dimensional

vector for use in machine learning algorithms. An example is shown in Figure 4.2 taken

from the MNIST dataset.

4.3.4 Adcock-Carlsson Coordinates: The ring of algebraic functions on persistence

diagrams

This method is explored by [40], where the authors highlight the fact that any persistence

diagram with exactly 𝑛 points can be described by a vector of the form (𝑥1, 𝑦1, 𝑥2, 𝑦2, . . . , 𝑥𝑛, 𝑦𝑛)

51

Figure 4.2 Persistence Images of a 5 from the MNIST set in dimension 0.

where 𝑥𝑖 denotes the birth of the 𝑖-th class and 𝑦𝑖 the corresponding death time. This fea-

turization method also appears in other applications, such as brain imaging [41], chatter

identification [42], image classification [40], and also led to related work with additional

stability properties [43].

From [40] the general form below is used for feature creation for Adcock-Carlsson

coordinates.

𝑝𝑎,𝑏 =

∑︁

𝑖

(𝑥𝑖 + 𝑦𝑖)𝑎 (𝑦𝑖 − 𝑥𝑖)𝑏

for integers 𝑎 ≥ 0 and 𝑏 ≥ 1.

Adcock-Carlsson coordinates that both emphasized more persistent features, and took

into account all types of features were desired. Based on this, the following features for both

the 0-dimensional and 1-dimensional persistence diagrams meet this criteria, as suggested

52

in [40]:

𝑥𝑖 (𝑦𝑖 − 𝑥𝑖)

∑︁

𝑖

(𝑦𝑚𝑎𝑥 − 𝑦𝑖) (𝑦𝑖 − 𝑥𝑖)

∑︁

𝑖

∑︁

𝑖

𝑖 (𝑦𝑖 − 𝑥𝑖)4
𝑥2

(𝑦𝑚𝑎𝑥 − 𝑦𝑖)2(𝑦𝑖 − 𝑥𝑖)4.

∑︁

𝑖

(4.7)

(4.8)

(4.9)

(4.10)

The theoretical motivation for Adcock-Carlsson Coordinates is as follows. Define

𝑆 𝑝𝑛 (R2) as the 𝑛-fold symmetric product of R2 (unordered 𝑛-tuples). We note the set of

persistence diagrams with exactly n points of the form {(𝑥1, 𝑦1), ...., (𝑥𝑛, 𝑦𝑛)} is a subset of
𝑆 𝑝𝑛 (R2). The inclusions 𝑆 𝑝𝑛 (R2) ↩→ 𝑆 𝑝𝑛+1(R2) thus produce an inverse system of affine

coordinate rings

· · · → 𝐴[𝑆 𝑝𝑛+1(R2)] → 𝐴[𝑆 𝑝𝑛 (R2)] → · · ·

which provide the basis for studying algebraic functions on the space of persistence dia-

grams.

With this setting in mind, the main goal of [40] is to determine free generating sets for

the subalgebra of 𝐴[𝑆 𝑝∞(R2)] comprised of elements which are invariant under adjoining

a point of zero persistence to a persistence diagram. The following theorem is an answer to

this question (see Theorem 1 [40]).

Theorem 6 The subalgebra of 2-multisymmetric functions invariant under adding points

with zero persistence, is freely generated over R by the set of elements of the form

𝑝𝑎,𝑏 =

∑︁

𝑖

(𝑥𝑖 + 𝑦𝑖)𝑎 (𝑦𝑖 − 𝑥𝑖)𝑏

for integers 𝑎 ≥ 0 and 𝑏 ≥ 1.

These are the features we call Adcock-Carlsson coordinates.

53

4.3.5 Template Systems

As with other methods, the goal of template systems as a featurization method is to embed

the space of persistence diagrams into a real-valued vector space since persistence diagrams

cannot be used directly in machine learning models. Template systems additionally separate

the space of persistence diagrams, which is desired for classification accuracy.

Template systems first start with a template function, which is any function on R2 that is

continuous, and has compact support contained within the upper half plane, W := R × R>0.

Given a template function 𝑓 : W → R and a persistence diagram, 𝐷, the template function

becomes a function on persistence diagrams [44]:

𝜈 𝑓 (𝐷) :=

𝑓 (x).

∑︁

x∈𝐷

A template system is a collection of template functions that separates points on the per-

sistence diagrams. The specific template functions I used for the review are tent functions,

where given a point a = (𝑎, 𝑏), and a radius 𝛿, define

𝑔a,𝛿 (𝑥, 𝑦) = |1 − 1

𝛿 max{|𝑥 − 𝑎|, |𝑦 − 𝑏|}|+

so that with a persistence diagram 𝐷 = x = (𝑥𝑖, 𝑦𝑖), the value of the tent function is

𝐺a,𝛿 (𝐷) =

𝑔a,𝛿 (x).

∑︁

x∈𝐷

To compute the values, a grid size and associated parameters are chosen so that 𝑔 is

compactly supported in W. This is illustrated in Figure 4.3.

The advantage of working with these template systems is that they can be used to

approximate real-valued functions on the space of persistence diagrams as proven by the

following theorem (see Theorem 29 [45]). Denote 𝐶𝑐 (W) to be the set of compactly

supported continuous functions, D the space of persistence diagrams and 𝐶 (D, R) the set

of continuous functions from D to R.

54

(a) Birth-Death Coordinates.

(b) Birth-Persistence.

(c) Adaptive Template Systems.

Figure 4.3 Figures 4.3a and 4.3b show the grid required for the computation of tent
functions in different coordinates, and Figure 4.3c shows adaptive template systems.

Theorem 7 Let T ⊂ 𝐶𝑐 (W) be a template system for D, let C ⊂ D be compact, and

let 𝐹 : C → R be continuous. Then for every 𝜖 > 0 there exist 𝑁 ∈ N, a polynomial

𝑝 ∈ R[𝑥1, ..., 𝑥𝑁 ] and template functions 𝑓1, . . . , 𝑓𝑁 ∈ T so that

(cid:12)𝑝 (cid:0)𝜈(𝐷, 𝑓1), . . . , 𝜈(𝐷, 𝑓𝑁 )(cid:1) − 𝐹 (𝐷)(cid:12)
(cid:12)

(cid:12) < 𝜖

for every 𝐷 ∈ C.

That is, the collection of functions of the form 𝐷 → 𝑝(𝜈(𝐷, 𝑓1), . . . , 𝜈(𝐷, 𝑓𝑁 )), is dense

in 𝐶 (D, R) with respect to the compact-open topology.

4.3.6 Adaptive Template Systems

The Adaptive Template Systems methodology of [46] concerns itself with improving

and furthering some of the work presented in [45] and [47]. The goal is to produce template

systems that are adaptive to the input data set and the supervised classification problem at

hand. One shortcoming of template systems, like tent functions, when applied to Theorem

7 is that without prior knowledge about the compact set C ⊂ D, the number of template

functions that carry no information relevant to the problem can be high. By reducing this

overhead, adaptive templates improve the computation times and accuracy in some specific

problems.

The relationship between template systems and adaptive template systems is demon-

strated in Figure 4.3, showing the adaptive template systems depend on density of data. To do

55

so, given a compact set C ⊂ D we consider the set 𝑆 = (cid:208)
𝐷∈C

𝐷 ⊂ W along with different algo-

rithms such as Gaussian mixture models (GMM) ([48]), Hierarchical density-based spatial

clustering of applications with noise (HDBSCAN) ([14]) and Cover-Tree Entropy Reduction
(CDER) ([49]) to define a family of ellipsoidal domains (cid:8)z ∈ R2 : (z − x)∗ 𝐴(z − x) ≤ 1(cid:9)

in W, fitting the density distribution of 𝑆. Here 𝐴 is a 2 × 2 symmetric matrix and x ∈ R2.

Once this family of ellipsoidal domains is computed, we use them to define the following

adaptive template functions

𝑓𝐴 (z) =





1 − (z − x)∗ 𝐴(z − x)

if (z − x)∗ 𝐴(z − x) < 1

0

if (z − x)∗ 𝐴(z − x) ≥ 1.

4.4 Datasets

The five different datasets considered in this work were chosen from a collection of

experiments presented in the literature of topological methods for machine learning. We

acknowledge that this selection is inherently biased towards datasets with favorable perfor-

mance with regards to specific topological methods. Nevertheless, we counterbalance this

by applying all the evaluated featurization methods to all the data sets here considered and

compare the classification results across all the presented methodologies. This comparative

work showcases how the variation between methods results in the need for the user to find

a suitable combination of featurization methods and parameter tuning to obtain optimal

results in a given dataset. As such, readers should view this as a resource for their own

analysis, and not as a recommendation for specific techniques.

For all datasets and methods, parameter tuning was done using a grid search method on

a subset of data that was not used to report final results, and parameters were chosen based

on performance of a ridge regression model, a random forest and a support-vector machine

(SVM) model. It is worth noting a weakness of the analysis in that the same parameters

were used in the feature set calculation for all reported models, and run with a single split.

56

This was due to time required for feature calculation.

The ridge regression and random forest classifier were run with default parameters, and

the support-vector machine was run using the radial basis function (RBF) with some tuning

on the cost parameter (C). The exception is for the Multi-Scale Kernel feature set - we

only fit a support-vector machine model, and was specific to feature computation for the

Multi-Scale Kernel. Due to computation times for the Multi-Scale Kernel and the direct

use of the features in a kernel support-vector machine model, this was the most appropriate

choice for model fitting. Preliminary results showed that ridge regression did not provide

additional accuracy for kernel methods as well. Each dataset was sampled to create training

and testing sets for a 10 or 100 trials depending on size, with the exception of the Protein

Classification Dataset, which included indices for machine learning tasks.

Random forest classifiers as presented in [50] are used to solve the same classification

problems presented for each data set. Parameters such as number of trees in each forest and

the size of each tree are chosen based on performance and tuned on the testing set.

4.4.1 MNIST

The MNIST dataset from [4] is a database of 70,000 handwritten single digits of numbers

0 to 9. An example image from the MNIST database is shown in Figure 4.4.

The calculation of persistence diagrams for the MNIST dataset is as in [40]. The

persistence diagrams are calculated using a "sweeping" motion in one of four directions: left

to right, right to left, top to bottom, or bottom to top, corresponding to the 0-dimensional and

1-dimensional persistence diagrams. This method creates a base of 8 different persistence

diagrams to use in the application of methods. To compute this filtration, pixels are

converted to a binary response with intensity calculated based on position. This has the

effect that depending on the direction of sweep, features will have different birth and death

times, providing distinct information for each direction. Figure 4.4 shows the various

calculations of persistence diagrams for an example number eight. Both 0-dimensional and

57

1-dimensional persistence diagrams were used for the MNIST dataset, noting that some

observations did not have 1-dimensional persistence diagrams, so these observations were

filled with a single diagram of birth-death coordinate of [0,.01].

Figure 4.4 A) Example number 8 from the MNIST dataset. B) The same number 8 after
computing each of the 4 types of coordinate transforms to compute the persistence
diagrams.

The number of topological features available for model fitting is dependent on the

method. For the Persistent Images, Persistence Landscapes, and Template Systems there

are 8 features each. The Multi-Scale Kernel produces 8 different kernel matrices, and for

Adcock-Carlsson Coordinates, 32 different features were computed from these persistence

diagrams.

For the MNIST dataset, a random sample of 1000 images was used to tune parameters,

with 80% used for the training portion, and 20% used for the testing portion. We used the

set of 60,000 images corresponding to the training set of MNIST to create our own training

and testing sets for model fitting and evaluation. For this set of 60,000, 10 trials were run

with an 80% training and 20% testing split to determine model performance.

4.4.2 SHREC14

We evaluated the SHREC 2014 dataset ([51]) in the same manner as the authors of

[46]. To compute the topological features, the authors of [37] describe using a heat kernel

signature to compute persistence diagrams for both the 0-dimensional and 1-dimensional

58

persistence diagrams. The kernel is computed, and then the persistence diagrams were

computed using Dipha [52], which can directly handle meshes. We were provided the dataset

as a set of persistence diagrams. The dataset consists of 15 different labels, corresponding

to 5 different poses for the three classes of male, female, and child figures. As noted in

[46], parameters in the dataset define different machine learning tasks due to a different

calculation of the heat kernel signature, and for this evaluation we focused problem 6, which

has the highest accuracy ranging from 89% - 92% as reported in[46].

For the SHREC14 dataset, a random sample of 90 images (30% of the data) was

used to tune the model and determine appropriate model parameters. The remaining 210

observations were split into 80% training and 20% testing for 100 trials to report final

model fit, which is listed below. Persistence diagrams for 0-dimensional homology and

1-dimensional homology were computed for this dataset. Table 4.3 shows complete results

for the SHREC 2014 dataset.

4.4.3 Protein Classification

We use the Protein Classification Benchmark dataset PCB00019 [53] as another type

of data to evaluate the topological methods above. This specific set contains information

for 1357 proteins corresponding to 55 classification problems, and we reported on 54 of the

problems using one to tune parameters. The training and testing index were provided, and

the mean accuracy was reported for both training and testing sets using these indices. Table

4.4 shows results from our experiments using the training and testing indices provided in

the original dataset.

Persistence diagrams for this dataset were computed for each protein by considering the

3-D structure (provided in [54]) as a point cloud in R3. This point cloud was built using

the 𝑥, 𝑦 and 𝑧 position of each atom in the molecule at hand. With this information the

persistent 0-dimensional and 1-dimensional homology is computed using Ripser [55].

59

4.4.4 MPEG7

The mpeg-7 dataset from [56] is a database of object shapes in black and white, with

1400 shapes in 70 classes. An example from the original dataset is shown in Figure 4.5

along with the contour as described below.

Figure 4.5 (A) An example apple from MPEG7 data. (B) An example image contour used
for MPEG7. (C) The distance to mean point calculation used for sublevel set persistence.
(D) Persistence diagrams from lower star persistence. (E) Persistence diagrams from the
contour.

To compute persistence diagrams, first the image contour is computed by placing ob-

servations from the point cloud into a sequence. The distance curve is computed as the

distance from the center of the sequence. Sublevel set persistence is taken using the com-

puted distance curves as point cloud data. Persistence diagrams for both 0-dimensional and

1-dimensional homology were computed for this dataset.

An example notebook of MPEG7 is provided using only 4 shapes - apple, children, dog,

and bone. This approach is due to the initial difficulty in getting accurate models for the full

60

dataset. Due to the small number of samples (80 total) and lack of repeated sampling, the

estimates provided for this dataset are not stable and are not reported. We used this dataset

for a timing comparison of featurization methods from persistence diagrams. We do not

report on the results of this dataset.

4.4.5 HAM10000

The HAM10000 dataset provided by [57] is a collection of 10000 images of skin lesions

with one 7 potential classifications: (i) Actinic Keratoses and Intraepithelial Carcionma,

(ii) Basal cell carcinoma, (iii) Benign keratosis, (iv) Dermatofibroma, (v) Melanocytic nevi,

(vi) Melanoma, (vii) Vascular skin lesions. A total of 18 persistence diagrams for this set

were calculated using the methods outlined in [58], 9 corresponding to the 0-dimensional

homology and 9 corresponding to the 1-dimensional homology.

To obtain such diagrams, first a mask is computed by implementing the methodology

proposed in [58]. The mask encodes the image with a binary response of white if that

region (pixel) of the image if part of the lesion, and black if it is part of the healthy tissue.

Once the mask is computed it is applied to the original image to identify the lesion. The

area recognized as the lesion is converted into 3 different color models: RGB, HSV and

XYZ. Each color model is split into their corresponding channels, and for each channel we

use sublevel set filtration to obtain 0-dimensional and 1-dimensional persistence diagrams.

In total, for each image on the data set we obtain 18 persistence diagrams, 9 in homological

dimension 0 and 9 for homological dimension 1. An example image and this process is

shown in Figure 4.6.

To tune the models, a random sample of 250 images were taken and a ridge regression,

random forest and support vector machine model were fit to determine parameters. The

remaining 9750 images were split into an 80% training and 20% testing set to report final

results.

To evaluate the HAM10000 dataset, due to the large number of birth and death pairs in

61

Figure 4.6 (A) Example of skin lesion in HAM10000 (B) Skin lesion with mask (C) Mask
only dataset.

each persistence diagram, subsampling of persistent features was required. Each observation

in a data set, for example an image, will yield 18 persistence diagrams corresponding to

homological features in that observation. In the HAM10000 dataset, there was an average

of 5,783 birth-death pairs in each persistence diagram. This was an issue to complete

computation for the vectorization methods, even for adaptive templates, so each persistence

diagram was subsampled as follows.

The method of subsampling is two steps: Highly persistent features were always in-

cluded, and a uniform random sampling method (without replacement) was used to sample

the remaining points. The threshold for feature lifetime and number of points to sample was

determined by using parameters that preserve the distribution of points in each persistence

diagram. As a result, features in each persistence diagram with a lifetime of 5 or more were

automatically included, and 5% of the rest of the points were also included. This resulted

in sampled persistence diagrams with an average of 290 points each (see Table 4.1).

4.5 User Guide

4.5.1 Available functions

As part of the available code, a function for each method is included. Each function

requires two sets of persistence diagrams, a training set and a testing set, and parameters

specific to the function. The function returns two feature sets for that method, corresponding

to the training and test set respectively. Each function also prints the time in seconds taken

62

at the end of each run. In this section of the user guide each function is described, along

with the required parameters for the function.

The Multi-Scale Kernel feature matrix can be computed using the function ker-

nel_features or fast_kernel_features.

It is recommended to use fast_kernel_features due

to computation time. Both functions require a parameter sigma, denoted as s in the function

with a default value of 4. In [37] this parameter is referred to as the scale parameter. From

the definition, we note that as sigma, 𝜎, increases the function decreases. Increasing sigma

results in a less diffuse kernel matrix, while decreasing sigma results in a more diffuse

kernel matrix.

Due to time required for the Multi-Scale Kernel, there are two additional sets of functions

that use Numba ([59]) for significantly faster computation. In the current implementation,

these are not able to be combined with multi-core processing (MPI for example), and have

a different format than the other functions included. These functions are provided in the

github repository for this project, and were used to compute results for the Multi-Scale

Kernel for the MNIST dataset.

The Persistence Landscapes features can be computed using landscape_features. The

Multi-Scale Kernel function, landscape_features requires two parameters: the number of

landscapes, n and resolution, r. The number of landscapes parameter, n, controls the number

of landscapes used, and the resolution, r, controls the number of samples used to compute

the landscapes. The default parameters for n is 5 and r is 100.

The Persistence Images can be computed using the function persistence_image_features.

The persistence_image_features function requires two parameters, pixels and spread. The

pixels, p is a pair that defines the number of pixels along each axis. The spread, s, is the

standard deviation of the Gaussian kernel used to generate the persistence image. It is worth

noting that the implementation here uses the Gaussian kernel, however, other distributions

could be chosen so that s would correspond to parameters specific to the chosen distribution.

63

Additionally, the weighting function is constant for this implementation. Increasing spread

increases the variance for each distribution, resulting in larger "hot spots". Increasing pixels

provides a smoother distribution, whereas decreasing pixels yields a less smooth distri-

bution. Note that increasing pixels increases computation time. This is demonstrated in

Figure 4.2 in the methods section.

The Adcock-Carlsson Coordinates features can be computed using the function carls-

son_coordinates, does not require any parameters. This function returns four different

features for every type of persistence diagram provided. So for datasets that have persis-

tence diagrams corresponding to 0-dimensional and 1-dimensional homology, 8 features

are returned for machine learning. The features returned correspond to the four coordinates

calculated in [40].

The Template Systems features can be computed using the function tent_features, and

has a choice of two parameters: d, which defines the number of bins in each axis and

padding, which controls the padding around the collection of persistence diagrams. This

function returns a training and testing set. This function computes the tent features from

[45].

The Adaptive Template Systems features can be called with the function adap-

tive_features, and requires the labels for the training set. Users can choose three different

types of Adaptive Templates: Gaussian Mixture Modeling (GMM), Cover-Tree Entropy

Reduction (CDER), and Hierarchical density-based clustering of applications with noise

(HDBSCAN). The parameter d refers to the number of components when using the GMM

model type. This would be minimally the number of classes in your data, and ideally repre-

sents closer to the number of distributions in the data that correspond to each observation.

Details on these methods can be found in [46], as well as the original references linked in

the methods section. During this evaluation, we evaluated adaptive templates using both

GMM and CDER methods, but did not formally evaluate HDBSCAN. HDBSCAN was

64

difficult to formally assess as we had difficulty with completion of the algorithm for some

datasets. For those datasets we were able to complete, we did not notice an improvement

over other adaptive methods.

4.6 Results

One consideration we must make before analysing the results comes from the compu-

tation of Multi-Scale Kernel features. As explained for each dataset in Section 4.4, more

often than not we will compute multiple persistent diagrams per data point in a given data

set. Since this multi-scale kernel provides a notion of similarity between persistent dia-

grams (see [37]) we require it to be computed between diagrams corresponding to the same

dimension homology and method type. For example, the kernel matrix that corresponds

to the 0-dimensional homology of a data set is computed using the persistence scale space

kernel between two sets of persistence diagrams that represent the 0-dimensional homology.

This means that for a dataset that has sets of 0-dimensional homology persistence diagrams

and 1-dimensional homology persistence diagrams, two kernel matrices were returned (one

per each dimension).

The kernel matrix used in our models is the sum of available kernels, and differs based

on the persistence diagrams available for each dataset. While this does improve accuracy

significantly over individual kernel matrices, other methods of combining kernel features

were not explored in this dissertation, but is available in [60] for the interested reader. The

available parameter, sigma, is consistent across all types of diagrams for our evaluation.

For each of the other methods, Persistence Landscapes, Persistence Images, Adcock-

Carlsson Coordinates, Template Systems, and Adaptive Template Systems, each feature

matrix was constructed for the relevant set of diagrams, and all topological types were used

in fitting the same model.

The datasets used in this analysis were of varying size, both in terms of observations and

the size of sets of persistence diagrams. As noted in the descriptions of data, the types of

65

Dataset Characteristics

Dataset
MNIST
SHREC14
Protein
MPEG7
HAM10000

Observations Diagrams
70,000
300
1357
1400
10,000

8
2
2
2
18

Average Pairs Min/Max Pairs
1.15
14
346
205
5783

0/7
1/29
3/500
1/500
13/32610

Table 4.1 Characteristics of each dataset. The column headings can be explained as such:
Observations - number of observations in the dataset, Diagrams - the number of
homological types used to compute persistence diagrams, Average Pairs - the average
number of birth/death pairs across the set of persistence diagrams for a single observation
in the original dataset, and Min/Max Pairs - the minimum and maximum number of
birth/death pairs across the set of persistence diagrams for a single observation in the
original dataset.

persistence diagrams calculated also differs. A summary of characteristics for each dataset

is included in Table 4.1.

4.6.1 MNIST

The Multi-Scale Kernel features calculated yielded 8 different kernel matrices, and the

final kernel matrix was calculated using the unweighted summation of these kernels as in

[60]. Due to the time needed for computation of the Multi-Scale Kernel, a smaller set of

12,000 observations was used to report final results.

Table 4.2 shows complete results for the MNIST analysis. Four different methods

(highlighted on the table) provided similar results for the MNIST dataset, and we note the

SVM model had higher accuracy in each case. This table, and all subsequent results tables,

include the method used to construct topological features, training and test accuracy, and

model and parameters used for evaluation.

4.6.2 SHREC14

Results are reported in Table 4.3. Adaptive Template Systems and Persistence Land-

scapes were the two methods with highest classification accuracy on the test dataset, with

Template Systems and the Multi-Scale Kernel performing nearly as well.

66

Topological
Method
Multi-Scale Ker-
nel (sum of Ker-
nels
for 12,000
observations)
Persistence Land-
scapes

Persistent Images

Adcock-Carlsson
Coordinates

Template
tems

Sys-

Adaptive
plate Systems

Tem-

Full Results for MNIST Dataset

Train

Test

Model

.6895 ± .0035

.6932 ± .0117

SVM

.8844 ± .0004

.8786 ± .0019

Ridge Regression

.9231 ± .0004
.5814 ± .0098
.8997 ± .0005
.9368 ± .0004
.6889 ± .0036
.8590 ± .0010

.9525 ± .0004
.7214 ± .0092
.896 ± .0005

.9638 ± .0003
.6967 ± .0035
.8819 ± .0016

.9180 ± .0018
.5828 ± .0098
.8934 ± .0021
.9199 ± .0023
.6953 ± .0123
.8547 ± .0030

.9356 ± .0018
.7170 ± .0097
.8959 ± .0017

.9477 ± .0015
.6973 ± .0031
.8817 ± .0027

.9515 ± .0021

.9363 ± .0021

.6914 ± .0188

.6932 ± .0209

SVM(RBF)
Random Forest
Ridge Regression
SVM (RBF)
Random Forest
Ridge Regression

SVM (RBF)
Random Forest
Ridge Regression

SVM(RBF)
Random Forest
Ridge Regression
(GMM)
SVM(RBF)
(GMM)
Random Forest

Table 4.2 Results from the MNIST Dataset using the average model classification accuracy
± standard deviation over 10 trials.

4.6.3 Protein Classification

Nearly all of the topological methods in this thesis provided similar classification accu-

racy for this dataset. We observe the testing accuracy as higher than the training accuracy

for this dataset, and the results are similar to those in [61]. The Multi-Scale Kernel though

did not perform as well and as shown in Figure 4.7 is the most computationally intensive.

Results are reported in Table 4.4.

67

Method
Multi-Scale Ker-
nel (sigma = .5,
sum of kernels)
Persistence Land-
scapes (n = 5, r =
200)

Persistent Images
(p = 40, s = .5)

Adcock-Carlsson
Coordinates

Template
Sys-
tems (d = 12, p =
1.1)

Adaptive
plate
(CDER)

Tem-
Systems

Full Results for SHREC14 Dataset

Train
.8942 ± .0142

Test
.8938 ± .0464

Model
Kernel SVM

.9968 ± .0037

.9312 ± .0336

Ridge Regression

.9302 ± .0098

.9186 ± .0417

.9739 ± .0190
.7243 ± .0387

.9114 ± .0441
.7048 ± .0588

.9067 ± .0147
.9855 ± .0092
.85 ± .0199

.8876 ± .0479
.865 ± .0764
.7124 ± .0814

.8671 ± .0183

.6928 ± .0599

.9147 ± .0299
.9442 ± .0087

.6976 ± .0899
.9100 ± .0405

SVM(RBF, c =
10)
Random Forest
Ridge Regression

SVM (RBF, c = 1)
Random Forest
Ridge Regression

SVM (RBF, c =
50)
Random Forest
Ridge Regression

.9350 ± .0079
..9483 ± .0214
.9937 ± .0078

.9159 ± .0383
.8874 ± .0481
.9169 ± .0395

SVM(RBF, c = 1)
Random Forest
Ridge Regression

.9929 ± .0083

.9064 ± .0397

.9729 ± .0200

.9164 ± .0422

SVM(RBF, c =
10)
Random Forest

Table 4.3 Results from the Shrec14 Dataset using the average model classification
accuracy ± standard deviation over 100 trials.

68

Figure 4.7 Comparison of timing by method. The legend is the same for all plots. The
x-axis represents the size of the dataset, and the y-axis represents the time in seconds
required for calculation of all of the persistence diagrams associated with the dataset of the
given size. A) Timings for the MPEG7 Dataset including the Multi-Scale Kernel. B)
Timings for the MPEG7 Dataset excluding the Multi-Scale Kernel. C) Timings for the
Protein Dataset including all features. D) Timings for the Protein Dataset excluding
Multi-Scale Kernel and Adaptive Templates (GMM).

4.6.4 HAM10000

Due to run time for the large number of points in each persistence diagram, even after

subsampling, results were not reported for the Multi-Scale Kernel or Adaptive Template

Systems.

69

Method
Multi-Scale Ker-
nel (sum of ker-
nels)
Persistence Land-
scapes

Persistent Images

Adcock-Carlsson
Coordinates

Template
tems

Sys-

Adaptive
plate Systems

Tem-

Full Results for Protein Dataset

Train
.8294 ± .1063

Test
.8803 ± .0702

Model
Kernel SVM

.9108 ± .0615

.9620 ± .0204

Ridge Regression

.9012 ± .0682
.9011 ± .0686
.9011 ± .0682
.9007 ± .0684
.9008 ± .0685
.9008 ± .0685

.9009 ± .0685
.9015 ± .0677
.9008 ± .0684

.9020 ± .0678
.9016 ± .0678
.9008 ± .0685

.9782 ± .0151
.9782 ± .0152
.9758 ± .0165
.9782 ± .0151
.9782 ± .0151
.9780 ± .0151

.9782 ± .0151
.9779 ± .0151
.9780 ± .0151

.9782 ± .0151
.9775 ± .0152
.9782 ± .0151

.9007 ± .0684

.9782 ± .0151

.9100 ± .0685

.9800 ± .0151

SVM(RBF)
Random Forest
Ridge Regression
SVM (RBF)
Random Forest
Ridge Regression

SVM (RBF)
Random Forest
Ridge Regression

SVM(RBF)
Random Forest
Ridge Regression
(CDER)
SVM(CDER)
(HDB)
Random Forest

Table 4.4 Results from the Protein Dataset using the average model classification accuracy
± standard deviation over 54 trials corresponding to the predefined indices of the dataset.

Results are listed in Table 4.5. The HAM10000 dataset presented the largest computa-

tional challenge during this review, and is a continued area of research.

4.7 Computation Time of Features

Formal timings were captured for all features for the 0-dimensional persistence diagrams

for the MPEG7 and Protein Datasets. A comparison of timings is in Figure 4.7. The timing

reported is for the generation of features from one type of persistence diagram for a dataset

of that size. This means when computing a training feature set and testing feature set

for multiple types of persistence diagrams, the expected time to generate features can be

significantly longer. For example, in the MNIST dataset we compute 4 different types

70

Topological Method
Multi-Scale Kernel
Persistence
Landscapes

Persistence Images
(pixels = 20,
spread = 1)
Adcock-Carlsson
Coordinates

Template Systems
(d = 10, p = 1.5)

Adaptive Template
Systems

Results for HAM10000 Dataset

Train

Test

Model

.8347 ± .0022
.6695 ± 0
.6695 ± 0
.7417 ± .0017
.7122 ± .0012
.6695 ± 0
.6719 ± .0007

.6801 ± .0009
.6695 ± 0
.7193 ± .0015
.7830 ± .0024
.6695 ± 0

Did Not Run
.6881 ± .0074
Ridge Regression
.6692 ± 1.2𝑒 − 16 SVM(RBF, c = 1)
.6692 ± 1.2𝑒 − 16 Random Forest
.6371 ± .0671
.6895 ± .0031
.6692 ± 1.2𝑒 − 16 Random Forest
.6696 ± .0025

Ridge Regression
SVM (RBF, c= 1)

Ridge Regression

SVM (RBF)

.6710 ± .0019
.6692±1.12𝑒 −16 Random Forest
.6987 ± .0041
.7303 ± .0054
.6692 ± 1.2𝑒 − 16 Random Forest

Ridge Regression
SVM(RBF, c = 5)

Did Not Run

Table 4.5 Results from the HAM10000 Dataset using the average model classification
accuracy ± standard deviation over 10 trials.

of persistence diagrams with both 0-dimensional and 1-dimensional homology, giving 8

sets of features that can be generated for the sets of persistence diagrams for that dataset.

Specific to the multi-scale kernel method, the timing reported is for a symmetric feature

matrix that is 𝑛𝑥𝑛, where 𝑛 is the number of observations in the dataset. This means the

training feature set requires less computation time than a testing feature set of comparable

size.

Additionally, during the review of these methods, we did not encounter significant issues

with model fitting, hence formal timings were not recorded for this portion of the analysis.

Conclusions from these timings are addressed in the discussion section.

Data Availability

The datasets, persistence diagrams (or code to compute the diagrams), and all other as-

sociated code for this study can be found in the machine learning methods for persistence di-

agrams github repository https://github.com/barnesd8/machine_learning_for_persistence.

71

For each of the five datasets, the following code is available:

• A jupyter notebook that loads and formats the persistence diagrams including images

and does a preliminary model fitting on a subset of the data

• A python script that calculates the persistence diagrams from the original dataset -

some of these are written using MPI depending on the size of the dataset

• A python script that fits models for random samples of the data to get mean estimates

of accuracy for both the training and test dataset

These scripts reference modules included in the github repository, including a persis-

tence methods script that calculates features from persistence diagrams for a training and

test dataset. This uses a combination of other available functions and functions written

specifically for this thesis.

The Template Systems and Adaptive Template Systems methods use functions from

https://github.com/lucho8908/adaptive_template_systems, which is the corresponding code

repository for [46]. The available methods in our extension include Tent Functions and

Adaptive Template Functions (GMM, CDER, and HDBScan methods).

The Adcock-Carlsson Coordinates method is a function developed specifically for

this thesis, and includes the calculation of the 4 different features used in our analysis.

The Persistence Landscape method uses the persistence landscape calculation from the

Gudhi Package [62]. The Multi-Scale Kernel Method has two included implementations,

one is from [63] and is slower to compute, while the other is a faster implementation that

can be used on larger datasets. All of the results in this thesis were reported using the

implementation written specifically for this thesis. The Persistence Images features are

also from [63]. Additionally, many functions from [64] are used throughout.

72

4.8 Discussion

Adcock-Carlsson Coordinates, Tent Functions, and Persistence Landscapes scale well,

and perform well even for large datasets.

It is of note though that parameter choice

will affect computation time. This was especially notable in the Template Features (Tent

Functions) computation time. As the number of tent functions is increased, the time to

compute features also increases. We observed a superlinear increase, however, even with

this increase computation time was not a barrier for analysis.

Persistence Images and Adaptive Template Functions do not scale or perform as well,

however, they do provide good featurizations for accurate models and should be considered

depending on the dataset. Specifically, the Adaptive Template Functions was not completed

for the full HAM10000 dataset due to computation time.

When using these methods, it should be of note that the Multi-Scale Kernel method is

computationally intensive, and does not scale well. Additionally, the accuracy achieved is

not better than other methods for the datasets in this thesis.

4.9 Conclusion

This chapter reviews 6 methods for topological feature extraction for use in machine-

learning algorithms. Persistence Landscapes, Adcock-Carlsson Coordinates, Template Sys-

tems, and Adaptive Template Systems perform consistently with minimal differences be-

tween datasets and types of persistence diagrams. These methods are also less expensive in

terms of execution time. A main contribution of this chapter is the availability of datasets,

persistence diagrams, and code for others to use and contribute to the research community.

73

CHAPTER 5

OPEN SOURCE SOFTWARE CONTRIBUTIONS

5.1 Open Source Software Development Framework

Topological data analysis is a multi-disciplinary analysis approach, with applications in

an extensive number of domains. To enable the use of TDA tools, well-documented and

maintained open source software is needed. Multiple software packages are available as

part of this dissertation, either developed collaboratively as part of Liz Munch’s lab, or as

part of my individual contributions to topological data analysis.

Three different open source tools are included in this section, teaspoon [2], ceREEBerus,

and an extension to the kepler mapper [65] software. teaspoon is a python package de-

veloped for topological signal processing, with modules for specific applications including

parameter selection, persistent homology, and machine learning. I have contributed to the

machine learning module by optimizing already available feature creation functions and

contributing an additional method from prior work. Additionally, I have implemented a

software development process for automated releases to pypi for teaspoon.

This standardized software development framework was used when creating ceREEBerus,

a python package for working with Reeb graphs. Users can load Reeb graphs, customize

plots, compute merge trees, and perform basic distance calculations using ceREEBerus.

ceREEBerus includes code from collaborators in the Munch lab, and I wrote the base

functions and created the overall structure and initial package for ceREEBerus.

This section focuses on the available open source software, and begins with the software

development process implemented to promote collaboration.

5.1.1 Broader Impact

While much of the work in this section focuses on implementation of already existing

methods, a significant motivation for this work is to enable collaboration and multidis-

74

ciplinary research with topological tools. Each of the sections in this chapter outlines a

contribution that heavily contributes to the broader impact of toplogical data analysis.

Through the standardization and automation of the development pipeline of teaspoon

and ceREEBerus student and collaborators from disciplines outside of computer science

and software engineering can readily contribute and learn standards along the way.

I

have also provided training to other graduate students as well as led a discussion-based

responsible conduct of research seminar focusing on version control software. Additionally

having automation for deployment and testing provides a process where individuals just

learning software development can feel more comfortable with the process of contributing

to software packages.

The focus of teaspoon has been very similar, with the additional endeavor of simplify-

ing the user experience and providing additional examples for individuals wishing to learn

and use topological methods for data analysis through teaspoon. To this end, I simplified

the install process and integrated notebooks, datasets and code from [3] in teaspoon. This

further simplifies the process for users to get started as persistence diagrams are readily

available from familiar datasets for use in featurization methods and subsequently machine

learning models.

Finally through my work on ceREEBerus I initialized a software package and process

for collaborators in the Munch Lab (and more broadly) to be used as they continue to expand

on both theory and computation of Reeb graphs. The structure and examples I provided

through this initial work set a solid foundation for collaboration of Reeb graph computation.

5.1.2 Automation of development pipeline

A major component of the software development framework was standardizing and

automating software releases for any existing packages in the MunchLab, and providing

guidelines for future packages for the MunchLab. This provides a framework for packages

developed primarily in python, and released via PyPI. This framework helps support a

75

consistent development practice and allows for maintenance by graduate students in limited

terms. The process consists of two main components: 1) Process for merging, testing

releases, and approval to the master/main branch and 2) Automation of deployment to PyPI.

To support an automated deployment with multiple contributors, it is imperative to first

establish a collaborative process that supports updates from multiple developers that are

tested and reviewed. This process was designed and documented, and provided via GitHub

and as workshops to graduate students contributing to software development projects. As

such, the CMSE value to collaborate across disciplines was supported and enabled. The

process is visualized in Figure 5.1, and consists of creating a feature branch from the test

release branch, implementing required features, merging into the test release branch with

appropriate review, merging into the master/main branch after review and testing.

Figure 5.1 Illustration of the collaborative release process implemented.

To further support the development pipeline, workflows via GitHub were configured

to allow for automated releases and include criteria that must be met for the automated

releases. Upon approval of the release, the following workflow is executed: 1) Create a

container (ubuntu), 2) Install python, 3) Install the package, 4) Run unit tests, 5) Build the

package, and 6) Publish to PyPI. Workflow monitoring is enabled, and failure at any point

76

in the workflow will halt the process and will not publish to PyPI. The set of standards for

a successful build are provided in further detail.

Must have passed a successful deployment to TestPyPI The next steps are implement

into a release process on the test release branch, which upon successful completion deploys to

the TestPyPI site. This is the first check required for code to be merged into the master/main

branch for release.

Must have passed a code review Before a feature branch can be integrated into a release a

code review is required by a project collaborator. The collaborator can approve the changes

or suggest changes to further refine the code.

Must pass unit tests Each module has a series of unit tests that are required to be passed

before deployment. Once the code is reviewed, and the merge to the main/master branch is

accepted, the automated deployment process runs the unit tests.

Must have package dependencies in pyproject.toml For a successful install, build,

and deployment, package dependencies must be listed in the pyproject.toml file in the

dependencies section.

Must have successful package build Before the version can be automatically deployed

it must build successfully.

Must have the version incremented for release When the release is pushed to PyPI the

version must be incremented.

5.2 Teaspoon

The information in this section on teaspoon follows [2]. teaspoon is a comprehensive

python package for topological signal processing combining techniques from topological

77

data analysis to create tools for signal processing that use shape for inference. teaspoon

is open source with extensive documentation and follows the collaborative development

process outlined in Section 5.1.2 making the code both easy to use and easy to contribute.

teaspoon is comprised of five main modules: dynamical systems, machine learning,

complex networks, information, and parameter selection. I focus on the machine learning

module, which historically include the featurization and classification modules, and have

been extended to include both a datasets module and notebook examples for machine

learning pipelines. Additionally, some functions in the featurization module have been

added or updated for computational efficiency. Source code and notebook examples are

adapted from [3].

As part of the focus on usability for teaspoon, a considerable number of users continued

to have difficulties with package installation due to system dependencies CMake [66] and

Boost [67]. Installation issues were limited to those related to zigzag persistence. Currently

computation of zigzag persistence has limited implementations, all reliant on underlying

C/C++ libraries to ensure reasonable computation times. As such, these teaspoon modules

have been moved to their own repository, spork.

The main goals of my work with teaspoon has been to improve ease of use, enable

additional collaboration and expand the machine learning modules by including work done

in [3] to allow for easy distribution. Since featurization of persistence diagrams is an active

area of research, providing example datasets and efficient code for feature computation

provides a contribution to the topological data analysis community.

5.2.1

spork

spork is the companion to teaspoon and was developed to create a simpler install

for the majority of users utilizing the teaspoon package. As additional software for

TDA has become available, teaspoon has expanded to include additional features for

method applications, many of which are used for specific purposes. Many users would

78

not necessarily need all functions when first using teaspoon, so to improve the install

experience a subset of functions were moved to a companion package, spork.

The additional install currently required for spork, is Dionysus 2 [68] and includes

features required to compute zigzag persistence. The install is available on the Python

Package Index (PyPI), and builds locally, requiring a compiler for the C++ to python

bindings, as well as other system dependencies.

5.2.2 Featurization

The featurization module in teaspoon originally contained six different featurization

methods: persistence landscapes, persistence images, Carlsson Coordinates, template func-

tions, kernel method, and signature of paths. As part of my work I have extended the

featurization module to include adaptive template systems as an additional feature, and

included a faster version of the kernel method and the persistence images method. Initial

code for adaptive template systems was provided by Luis Polanco and I have included a

subset of these functions.

Each feature type was optimized in a different way specific to the method and required

libraries. The two types of optimization were explored during this analysis, parallelization

and vectorization. The discussion on parallelization follows [69]. Parallelization can

refer to either functional parallelism or data parallelism. Functional parallelism is

running multiple sets of instructions concurrently, and data parallelism is running the same

instructions on different sets of data concurrently. Data parallelism was used to optimize

the persistence images code and an illustration of this is in Figure 5.2.

Vectorization can refer to a few things: the process of turning persistence diagrams into

features, a type of data parallelism where computation is done simultaneously on elements

of a vector, or the process of restructuring code to remove loops and offloading the a lower

level language (C) for better performance [70]. When I refer to vectorization, I mean the

last definition.

79

Figure 5.2 Illustration of data parallelism where the addition operation is performed on
blocks of a dataset at a time concurrently.

To determine the most appropriate implementation in the teaspoon software, a number

of methods were developed and timed using a simulated dataset. Development and timing

were performed on an iMac using the Apple M1 chip, 16 GB of Ram with 8 total cores.

Initial timing was run locally as many users of teaspoon will be working on personal

computers. Timing was also assessed on Michigan State’s High Performance Computing

Center.

5.2.3 Kernel Method

The first method discussed is the Multi-scale Heat Kernel Method from [37]. To

optimize the kernel method, code was developed and assessed using Numba [59], which

translates Python to optimized machine code and compiles upon the first run. Numba

provides many different ways to improve performance, including automatic just in time

compiling, automatic parallelization, and the creation of NumPy universal functions [70].

Numba’s just-in-time compiling (jit) is simply a decorator applied to a Python function

that directs Numba to automatically optimize and generate machine code with performance

similar to C, C++ and Fortran. This same decorate can also be designated to compile code

80

to run in parallel. While this is simple to use, initial benchmarking tests showed much better

performance using guvectorize.

Guvectorize (and vectorize) is a way to create NumPy universal functions (ufuncs) with

Numba, avoiding the traditional process required to create a NumPy ufunc. NumPy ufuncs

operate element-by-element and is a wrapper for C functions. Functions can be written

using guvectorize in Numba with cpu, parallel, and gpu targets and also broadcast returning

different size outputs based on input size. Any loops inside the function do not parallelize,

but any additional dimensions will parallelize.

To better understand this mechanism, consider a dataset comprised of persistence di-

agrams. For each observation in the dataset, there is a set of persistence diagrams. Each

diagram is comprised of birth and death pairs. Consider the example below, with 4 sets

of length 3 birth and death pairs. From the python array perspective, the shape of this

set would be (4, 3, 2). To compute features using the kernel method, each data point is

compared to each other. Writing a function then to take a set of persistence diagrams and

compare them to a single persistence diagram, so an array of (4, 3, 2) and an array of (3, 2)

will then broadcast across the first dimension for the calculation.

df = [[[0,1], [0,2], [0,4]],

[[1,3], [0,4], [0,6]],

[[2,4], [3,5], [0,7]],

[[0,1], [2,4], [3,7]]]

To allow for core-based parallelization or gpu processing, writing a function using

guvectorize to compare an entire array to a single observation will parallelize the additional

dimension. In this case, providing two arrays of size (4, 3, 2) will parallelize along the first

dimension for one of the arrays using a scatter operation, sending observations of the second

array to different processes to run in parallel. The implicit assumption in this example is

81

that each persistence diagram is the same length, i.e. the second dimension will always be

3. That is not the case, so first restructuring is required. Any timing reported includes the

time required to restructure the data.

The code example below shows both the before and after of the restructuring function

that fills in the array so create the same second dimension to enable both the creation of a

NumPy ufunc using guvectorize in Numba.

df = [[[0,1], [0,2]],

[[1,3], [0,4], [0,6]],

[[2,4], [3,5], [0,7]],

[[0,1]]]

df = reshape_to_array(df)

print(df)

[[[0,1], [0,2], [-1,-1]],

[[1,3], [0,4], [0,6]],

[[2,4], [3,5], [0,7]],

[[0,1], [-1,-1],[-1,-1]]]

Results of the local run are shown in Figure 5.3 and results of the HPCC run are shown

in Figure 5.4. For both note a speed-up that scales well as additional processors are used.

For the kernel computation the level of complexity is: O (𝑁 2|𝐹 | · |𝐺 |) where |𝐹 | and |𝐺 |

are the cardinality of the multisets 𝐹 and 𝐺 and 𝑁 is the number of multisets. In other

words, for a dataset with 𝑁 observations, a computation needs to happen for each set of

birth-death pairs for each 𝑛𝑖, 𝑛 𝑗 ∈ 𝑁. Further optimization could be possible through a fast

kernel implementation [71], but the implementation would need to be adapted to account

for the specific structure of persistence diagrams.

82

Figure 5.3 Comparison of timing for Multiscale Heat Kernel using a simulated dataset
running locally.

Timing for the multiscale kernel method was run on MSU’s HPCC with amd20-v100

20 Intel(R) Xeon(R) Platinum 8260 CPU’s with clock speed 2.40 GHz. The gpu used was

an NVIDIA v100 with 32768 MB memory.

Figure 5.4 Comparison of timing for Multiscale Heat Kernel using a simulated dataset
running MSU’s HPCC with 16 cores.

5.2.4 Persistence Images

Similar methods were explored for the optimization of persistence images. Ultimately

the implementation from the persim [72] package was used for the availability of the parallel

option. Due to the need for integration over a grid of a gaussian density function, the error

function from the SciPy [73] library is required and not supported the Numba features.

At attempt to recreate the integration in NumPy for use with Numba was attempted, but

ultimately not as efficient nor accurate. Timing from running locally is shown in Figure 5.5

and timing from using HPCC is shown in Figure 5.6.

Timing run on MSU’s HPCC for persistence images was on a intel18-v100 node with

Intel(R) Xeon(R) Gold 6148 CPU’s with clockspeed of 2.40 GHz using 16 cores.

83

Figure 5.5 Comparison of timing for persistence images using a simulated dataset running
locally.

Figure 5.6 Comparison of timing for persistence images using a simulated dataset running
on MSU’s HPCC.

5.2.5 Datasets

Datasets used in [3] were integrated into teaspoon as part of a new module within the

machine learning section, load_datasets. This module includes the persistence diagrams

computed from the MNIST dataset and the MPEG7 Dataset. The following describes how

to load each dataset for use.

The mnist persistence diagrams can be loaded using the function load_datasets.mnist().

No parameters are required to load this function. This dataset includes the persistence

diagrams for the 60,000 observations considered the training set for the MNIST dataset.

The dataset is loaded as a pandas dataframe, and has 9 columns, 8 which correspond to

persistence diagrams and one column of dataset labels. The 8 columns of persistence

diagrams can be defined as:

1. zero_dim_rtl: persistence diagram for 0 dimensional features computed using the

right to left directional transform

84

2. zero_dim_ltr: persistence diagram for 0 dimensional features computed using the

left to right directional transform

3. zero_dim_btt: persistence diagram for 0 dimensional features computed using the

bottom to top directional transform

4. zero_dim_ttb: persistence diagram for 0 dimensional features computed using the

top to bottom directional transform

5. one_dim_rtl: persistence diagram for 1 dimensional features computed using the

right to left directional transform

6. one_dim_ltr: persistence diagram for 1 dimensional features computed using the left

to right directional transform

7. one_dim_btt: persistence diagram for 1 dimensional features computed using the

bottom to top directional transform

8. one_dim_ttb: persistence diagram for 1 dimensional features computed using the

top to bottom directional transform

Each of the persistence diagrams was computed first using a directional transform on

pixel values, and then sublevel set persistence. This is demonstrated in Figure 5.7 with

additional details available in [3] for mnist, and all other datasets listed here.

The mpeg persistence diagrams can be loaded using the function load_datasets.mpeg7().

No parameters are required to load this function. This dataset includes the persistence

diagrams for the 1400 observations in the mpeg7 dataset. The dataset is loaded as a pandas

dataframe, and has 5 columns, 4 which correspond to persistence diagrams or related

computation and one column of dataset labels. The columns of the mpeg7 persistence

diagrams dataset can be defined as:

85

Figure 5.7 The number 8 after computing each of the 4 types of coordinate transforms to
compute the persistence diagrams.

1. Name: label of dataset

2. Outline: outline of the image represented as a 2-dimensional point cloud dataset

3. DistCurve: distance from center of outline

4. persistence_outline: persistence diagrams for 0 dimensional features and 1 dimen-

sional features computed from the outline (contour)

5. lower_star_persistence: persistence diagram for 0 dimensional features computed

using sublevel set persistence of the distance curve

Persistence diagrams were computed using both the shape outline and distance curve.

Ripser was used to compute the 0-dimensional and 1-dimensional features from the outline

as well as the 0-dimension sublevel set persistence from the distance curve. The available

data for the outline and distance curve is plotted in Figure 5.8.

The shape retrieval of non-rigid 3D Human Models, SHREC14, persistence diagrams

can be loaded using the function load_datasets.shrec14(). No parameters are required to

load this function. This dataset includes the persistence diagrams for the 3000 observations

in the SHREC14 dataset. The dataset is loaded as a pandas dataframe, and has 6 columns, 3

which correspond to persistence diagrams or related computation and one column of dataset

86

Figure 5.8 Example output from mpeg7 dataset for columns Outline and DistCurve.

labels. Persistence diagrams were computed using the Heat Kernel Signature with a range

of parameter values (corresponding to trial) [74] on 300 images, each corresponding to a

specific trial value in the dataset. The columns of this persistence diagrams dataset can be

defined as:

1. freq: index of machine learning problem (300 observations per problem)

2. trial: corresponds to a machine learning problem (10 total, corresponds to a parameter

used in computing the persistence diagram)

3. CollectedDgm: set of all persistence diagrams

4. Dgm1: persistence diagrams for 1 dimensional features

5. Dgm0: persistence diagram for 0 dimensional features

6. TrainingLabel: One of 15 labels corresponding to five different poses for the three

classes of male, female, and child figures

5.3 CeREEBerus

Another area of focus for open source software development for topological data analysis

is a packaged called ceREEBerus, which focuses on computation related to Reeb graphs.

Working with Reeb graphs is an active area of research, with provably correct and efficient

87

algorithms for computing Reeb graphs available [18] in recent years. This provides a

renewed interest in the use of Reeb graphs, along with the availability of metrics defined

on Reeb graphs [16]. The goal then of ceREEBerus is to start with Reeb graphs and basic

functions on Reeb graphs and build a comprehensive package to compute a wide variety of

metrics on Reeb graphs.

The initial and current utility of ceREEBerus allows users to load example Reeb graphs,

generate random merge trees, plot graphs and merge trees, generate merge trees from Reeb

graphs, and compute distances between merge trees. This package is available via PyPI and

follows the open source software framework implemented for the Munch Lab.

5.3.1 Available functions

As part of the package, extensive user documentation is available at the ceREEBerus

site, and a summary is included here. The ceREEBerus package is comprised of three

modules: example data and graphs (data), Reeb graphs (reeb), and compute (compute).

The sample data and graphs module provides both networkx and Reeb graphs for

multiple example graphs including the dancing man, simple loops, and torus graphs. Each

of these is plotted in Figure 5.9 using the plotting function available in ceREEBerus. An

example Reeb graph is in Figure 5.10a and corresponding ceREEBerus plot is in Figure

5.10b.

(a) Dancing Man.

(b) Simple Loops.

(c) Torus.

Figure 5.9 Output from ceREEBerus plotting capabilities of dancing man, simple loops,
and torus.

There are two main ways to load the example graphs, either as a networkx graph or a

88

(a) A Reeb graph of dancing man.

(b) Output from ceREEBerus - a graph of dancing
man.

Figure 5.10 Cereeberus output from on the right of a Reeb graph on the left.

Reeb graph, which is a custom class created for ceREEBerus. These can be loaded and

accessed by using from cereeberus.data import graphs for the networkx version, or from

cereeberus.data import reeb for the Reeb graph version. An example code block for loading

the relevant libraries, loading the Reeb graph version of the torus, and plotting is shown

below. Relevant output for this code is shown in Figure 5.9c.

from cereeberus.data import reeb

reeb_torus = reeb.torus()

reeb_torus.plot_reeb()

When plotting Reeb graph objects, the use of additional parameters allows you to control

position and display of the plots. The available parameters with listed default values and

definitions are:

• position = {}: an optional argument to update positions of nodes for plotting

• resetSpring = False: a boolean flag that when true assigns new node positions for the

underlying Reeb graph using the spring layout from networkx

89

• horizontalDrawing = False: a boolean flag that when true changes the x and y axis

for plotting and does not update node position

• verbose = False: a boolean flag that when true provides logging output

• cpx = .1: float that is only used for graphs where the degree of connectivity between

two nodes is two and controls the radius of the curve used for plotting relevant to the

x position

• cpy = .1: float that is only used for graphs where the degree of connectivity between

two nodes is two and controls the radius of the curve used for plotting relevant to the

y position

An example of horizontal drawing, updating position, and reset spring are shown in the

following code block, with example output in Figure 5.11.

### Import relevant module

from cereeberus.data import reeb

reeb_torus = reeb.torus()

### Plotting the Reeb graph using horizontalDrawing = True

reeb_torus.plot_reeb(horizontalDrawing=True)

### Plotting the Reeb graph with an updated position

reeb_torus.plot_reeb(position = {0:(1,1), 1:(2,2), 2:(2,3),

3:(4,4), 4:(2,5), 5:(6,6)}, horizontalDrawing = False,

verbose = False)

### Plotting the Reeb graph with the resetSpring parameter = True

reeb_torus.plot_reeb(resetSpring = True)

Lastly as part of the plotting function, the ability to refine how multiple lines between

two nodes is plotted is shown below with the corresponding output in Figure 5.12.

90

(a) Example torus plotting
horizontally.

(b) Example torus with
updated positions.

(c) Example torus using
spring layout.

Figure 5.11 Output from ceREEBerus plotting capabilities of torus using different
parameters.

### Import relevant module

from cereeberus.data import reeb

sl = reeb.simple_loops()

### Plotting the simple loops graph with cpx = .1

sl.plot_reeb(cpx=.1)

### Plotting the simple loops graph with cpx = .5

sl.plot_reeb(cpx=.5)

### Plotting the simple loops graph with cpx = 1

sl.plot_reeb(cpx=1)

(a) Example simple loops
with cpx = .1.

(b) Example simple loops
with cpx = .5.

(c) Example simple loops
with cpx = 1.

Figure 5.12 Output from ceREEBerus plotting capabilities of simple loops using different
cpx parameters.

Another capability in the compute.degree module is adding nodes, removing isolates,

and removing any unnecessary nodes to create a minimal Reeb graph. Recalling the

91

definition from Section 2.4, the nodes with both an up degree and down degree equal to

1 can be removed to compute the minimal Reeb graph. I use the following algorithm to

construct the minimal Reeb graph:

Require: R, a Reeb graph

1: n = length(nodes)

2: if 𝑖 ≤ 𝑛 then

3:

4:

5:

6:

7:

8:

if up degree(R) == down degree(R) == 1 then

add edge(up node, down node)

remove edge(up node, i)

remove edge(down node, i)

remove node(i)

end if

9: end if

The following code example shows first adding nodes to the torus Reeb graph, and then

constructing the minimal Reeb graph of the torus. Plots of each are available in Figure 5.13.

import cereeberus.compute.degree as degree

import networkx as nx

reeb_torus = reeb.torus()

### Add node and plot

reeb_torus_add = degree.add_nodes(reeb_torus, fx=4.5, x = 1)

reeb_torus_add.plot_reeb()

### Add another node and plot

reeb_torus_add = degree.add_nodes(reeb_torus_add, fx=3.5, x=1)

reeb_torus_add.plot_reeb()

### Compute minimal Reeb graph

92

reeb_torus_min = degree.minimal_reeb(reeb_torus_add)

reeb_torus_min.plot_reeb(cpx=1)

(a) Torus Reeb graph after
adding one node.

(b) Torus Reeb graph after
adding two nodes.

(c) Minimal Reeb graph of
torus.

Figure 5.13 Output from ceREEBerus of degree functions.

As part of the compute.degree module one can compute distances between graphs and

remove isolates, or nodes with up degree and down degree equal to zero. The remove

isolates function is a necessary step in the union find algorithm implemented for computing

merge trees. Code is shown below and images are available in Figure 5.14.

Interested

readers are directed to the ceREEBerus documentation for details on distance as graph edit

distance and directed merge tree distance are both available.

from cereeberus.data import reeb

import cereeberus.compute.degree as degree

jm = reeb.juggling_man()

jm.plot_reeb()

### Remove isolates and plot

jm_im = degree.remove_isolates(jm)

jm_im.plot_reeb()

The final and core functionality of ceREEBerus is the ability to compute merge trees

from Reeb graphs. The main code was provided by Elena Wang, and I structured and

implemented it into the overall python package. The computeMergeTree function uses an

93

(a) Reeb graph with isolates.

(b) Removed isolates.

Figure 5.14 Plots from Reeb graphs before and after removing isolates.

implementation of the union find algorithm. Example code and the associated output are

below and shown in Figure 5.15.

from cereeberus.data import reeb

import cereeberus.compute.degree as degree

jm = reeb.juggling_man()

### Remove isolates and plot

jm_im = degree.remove_isolates(jm)

jm_im.plot_reeb()

### Import merge tree, compute, and plot

from cereeberus.compute.merge import computeMergeTree

cmt = computeMergeTree(jm_im, verbose=False, precision=1, filter=False)

cmt.plot_reeb(position = {7:(0,1), 5:(1,4), 2:(0,5), 'inf':(0,7)})

5.3.2 Predictive Mapper

Through a seed grant from Center for Business and Social Analytics at Michigan State

University, I was able to work with a software developer, Doug Krum, to take a prototype for

an interactive Mapper application built by Pat Bills and transition the work to python, taking

advantage of the full suite of functions in scikitlearn [64], including parallelism. We started

94

(a) Dancing man before merge tree.

(b) Merge tree of dancing man.

Figure 5.15 Plot of Reeb graph and associated merge tree.

with the mapper implementation from KeplerMapper [27] and [65], and expanded the

package to include additional features that allowed the development of predictive mapper.

This variant included a web-interface to allow users to interactively create a mapper graph

and choose specific parameters for the mapper graph creation. During the development of

predictive mapper this tool was used, and additional functions were created to enable the

predictive pipeline.

Based on the final predictive mapper pipeline, the associated release is still a work in

progress, requiring additional code and data restructuring needed to support integration

into the already existing Kepler mapper. As mentioned in Section 3.4, example notebooks

for the work done on the iris dataset are available along with the mapper graphs generated.

Similarly an example notebook used to create images for the predictive mapper portion of

this dissertation is available and is an independent mapper pipeline.

5.3.3 Conclusion

As part of my work in the Computational Mathematics, Science, and Engineering de-

partment I have contributed to many computational projects supporting work in topological

data analysis. These contributions include a standardized process and framework for collab-

oration in the Munch Lab, the inclusion of additional and optimized methods in teaspoon,

and the reorganization of teaspoon to have a companion package spork. I have addition-

95

ally used this framework for the creation of ceREEBerus. As part of a review paper I

have additionally contributed a set of notebooks and datasets for machine learning use of

persistence diagrams.

96

CHAPTER 6

CONCLUSION

Topological data analysis is a powerful tool for extracting the shape of data. Two primary

tools, persistent homology and mapper, require additional transformations to be used directly

in machine learning models. In this dissertation, I have contributed to the discipline of TDA

by developing another method for feature creation and providing additional computational

tools for working with persistence diagrams. I have also contributed to the available open

source software in TDA through package development and the implementation of standards

for the MunchLab.

To extend the use of mapper, I have developed predictive mapper as a way to create

an optimal version of a mapper graph, compute probabilities of inclusion of data points in

mapper nodes, and then using the eigenvectors of the graph Laplacian of the mapper graph,

represent each data point as a linear combination of the probability of inclusion in each

node and the eigenvector values. This featurization provides real-valued vectors for use in

machine learning models that represent the connectivity of the mapper graph weighted by

a meaningful probability for each data point.

In the field of persistent homology, I have contributed additional datasets consisting

of persistence diagrams for some popular datasets, along with the inclusion of a far more

computationally efficient function to compute the Multi-Scale Heat Kernel [37]. Example

notebooks of 6 different featurization methods are available for academic use, furthering

the availability of TDA resources.

I have also contributed meaningfully to available tools for researchers in TDA, with the

creation of ceREEBerus to work with Reeb graphs, additional contributions to teaspoon,

and the additional code required for the predictive mapper pipeline. This code is available

as open source with a standardized development process to enable collaboration.

Through this dissertation I have contributed both to the field of Topological Data

97

Analysis, and under taken a combination of theoretical and applied work in the spirit of the

CMSE program.

98

BIBLIOGRAPHY

[1] Gurjeet Singh, Facundo Memoli, and Gunnar Carlsson. Topological methods for
the analysis of high dimensional data sets and 3d object recognition. Eurographics
Symposium on Point-Based Graphics, 2007.

[2] Audun D Myers, Melih Yesilli, Sarah Tymochko, Firas Khasawneh, and Elizabeth
Munch. Teaspoon: A comprehensive python package for topological signal processing.
In NeurIPS 2020 Workshop on Topological Data Analysis and Beyond, 2020.

[3] Danielle Barnes, Luis Polanco, and Jose A. Perea. A comparative study of machine
learning methods for persistence diagrams. Frontiers in Artificial Intelligence, 4:91,
2021.

[4] Y. LeCun and C. Cortes. The mnist database, 1999.

[5] Herbert Edelsbrunner and John Harer. Computational topology: an introduction,

volume 69. American Mathematical Soc., 2010.

[6] Tamal Krishna Day and Yusu Wang. Computational Topology for Data Analysis.

Cambridge Univ. Press, Cambridge, 2006-2021.

[7] Gunnar E. Carlsson and Vin de Silva. Zigzag persistence. CoRR, abs/0812.0197,

2008.

[8]

J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions on Computers, C(23):881=890, September 1974.

[9] Frédéric Chazal and Bertrand Michel. An introduction to topological data analysis:
Fundamental and practical aspects for data scientists. Frontiers in Artificial Intelli-
gence, 4:108, 2021.

[10] Monica Nicolau, Arnold J. Levin, and Gunnar Carlsson. Topology based data analysis
identifies a subgroup of breast cancers with a unique mutational profile and excellent
survival. Proceedings of the National Academy of Sciences of the Unites States of
America, pages 7265–7270, 2011.

[11] P.Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan,
J. Carlsson, and G. Carlsson. Extracting insights from the shape of complex data using
topology. Scientific Reports, 3:1236, 2013.

[12] Elizabeth Munch and Bei Wang. Convergence between categorical representations of
reeb space and mapper. 32nd International Symposium on Computational Geometry,

99

53:53:1–53:15, 2016.

[13] Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32:241–254,

1967.

[14] Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-based cluster-
ing based on hierarchical density estimates. In Jian Pei, Vincent S. Tseng, Longbing
Cao, Hiroshi Motoda, and Guandong Xu, editors, Advances in Knowledge Discovery
and Data Mining, pages 160–172, Berlin, Heidelberg, 2013. Springer Berlin Heidel-
berg.

[15] Silvia Biasotti, Daniela Giorgi, Michela Spagnuolo, and Bianca Falcidieno. Reeb
graphs for shape analysis and applications. Theoretical Computer Science, 392:5–22,
02 2008.

[16] Brian Bollen, Erin W. Chambers, Joshua A. Levine, and Elizabeth Munch. Reeb graph

metrics from the ground up. CoRR, abs/2110.05631, 2021.

[17] Ulrich Bauer, Elizabeth Munch, and Yusu Wang. Strong Equivalence of the Interleav-
ing and Functional Distortion Metrics for Reeb Graphs. In Lars Arge and János Pach,
editors, 31st International Symposium on Computational Geometry (SoCG 2015), vol-
ume 34 of Leibniz International Proceedings in Informatics (LIPIcs), pages 461–475,
Dagstuhl, Germany, 2015. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

[18] Harish Doraiswamy and Vĳay Natarajan. Efficient algorithms for computing reeb

graphs. Computational Geometry, 42(6):606–616, 2009.

[19] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and Computing,

17(4), 2007.

[20] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for

embedding and clustering, 2001.

[21] Enrique Alvarado, Robin Belton, Emily Fischer, Kang-Ju Lee, Sourabh Palande, Sarah
Percival, and Emilie Purvine. 𝑔-mapper: Learning a cover in the mapper construction,
2024.

[22] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[23] R. A. Fisher.

Iris.

UCI Machine Learning Repository, 1988.

DOI:

https://doi.org/10.24432/C56C76.

[24] Paulo Cortez. Student Performance. UCI Machine Learning Repository, 2014. DOI:

100

https://doi.org/10.24432/C5TG7T.

[25] S.J. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches to gaussian
mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(11):1133–1142, 1998.

[26] P. Cortez and A. M. Gonçalves Silva. Using data mining to predict secondary school

student performance. 2008.

[27] Hendrik Jacob van Veen, Nathaniel Saul, David Eargle, and Sam W. Mangham. Kepler
mapper: A flexible python implementation of the mapper algorithm. Journal of Open
Source Software, 4(42):1315, 2019.

[28] Joshua L. Mike and Jose A. Perea. Geometric data analysis across scales via laplacian

eigenvector cascading, 2019.

[29] Yuriy Mileyko, Sayan Mukherjee, and John Harer. Probability measures on the space

of persistence diagrams. Inverse Problems, 27(12):124007, 2011.

[30] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence

diagrams. Discrete & computational geometry, 37(1):103–120, 2007.

[31] Frédéric Chazal, Vin De Silva, and Steve Oudot. Persistence stability for geometric

complexes. Geometriae Dedicata, 173(1):193–214, 2014.

[32] Mikhail Gromov. Metric structures for Riemannian and non-Riemannian spaces.

Creative Commons, 2007.

[33] Katharine Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer. Fréchet means for
distributions of persistence diagrams. Discrete & Computational Geometry, 52(1):44–
70, 2014.

[34] Atish Mitra and Žiga Virk. The space of persistence diagrams on 𝑛 points coarsely
embeds into hilbert space. Proceedings of the American Mathematical Society,
149(6):2693–2703, 2021.

[35] Alexander Wagner. Nonembeddability of persistence diagrams with 𝑝 > 2 wasserstein

metric. arXiv preprint arXiv:1910.13935, 2019.

[36] Peter Bubenik and Alexander Wagner. Embeddings of persistence diagrams into
hilbert spaces. Journal of Applied and Computational Topology, 4(3):339–351, 2020.

[37] Jan Reininghaus, Stefan Huber, Ulrich Bauer, and Roland Kwitt. A stable multi-scale

101

kernel for topological machine learning. 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 4741–4748, 2015.

[38] Peter Bubenik. The persistence landscape and some of its properties. Topological

Data Analysis, 15:77–102, 2020.

[39] Henry Adams, Sofya Chepushtanova, Tegan Emerson, Eric Hanson, Michael Kirby,
Francis Motta, Rachel Neville, Chris Peterson, Patrick Shipman, and Lori Ziegelmeier.
Persistence images: A stable vector representation of persistent homology, 2015.

[40] Aaron Adcock, Erik Carlsson, and Gunnar Carlsson. The ring of algebraic functions
on persistence bar codes. Homology, Homotopy and Applications, 18(1):381–402,
2016.

[41] Luigi Caputi, Anna Pidnebesna, and Jaroslav Hlinka. Promises and pitfalls of topo-
logical data analysis for brain connectivity analysis. NeuroImage, 238:118245, 2021.

[42] Firas A. Khasawneh, Elizabeth Munch, and Jose A. Perea. Chatter classification in
turning using machine learning and topological data analysis∗∗this material is based
upon work supported by the national science foundation under grant nos. cmmi-
1759823 and dms-1759824 with pi fak, and cmmi-1800466 and dms-1800446 with
pi em. jap acknowledges the support of the nsf under grant dms-1622301 and darpa
IFAC-PapersOnLine, 51(14):195–200, 2018. 14th
under grant hr0011-16-2-003.
IFAC Workshop on Time Delay Systems TDS 2018.

[43] Sara Kališnik. Tropical coordinates on the space of persistence barcodes. Foundations

of Computational Mathematics, 19:101–129, 2019.

[44] Sarah Tymochko, Elizabeth Munch, and Firas A. Khasawneh. Adaptive partitioning
In 2019 18th IEEE International

for template functions on persistence diagrams.
Conference On Machine Learning And Applications (ICMLA). IEEE, dec 2019.

[45] J. A. Perea, A. Munch, and F. A. Khasawneh. Approximating continuous functions
on persistence diagrams using template functions. CoRR, abs/1902.07190, 2019.

[46] Luis Polanco and J. A. Perea. Adaptive template systems: Data-driven feature selection

for learning with persistence diagrams. IEEE ICMLA, 2019.

[47] Sarah Tymochko, Elizabeth Munch, and Firas A. Khasawneh. Adaptive partitioning
In 2019 18th IEEE International
for template functions on persistence diagrams.
Conference On Machine Learning And Applications (ICMLA), pages 1227–1234,
2019.

102

[48] Douglas Reynolds. Gaussian Mixture Models, pages 659–663. Springer US, Boston,

MA, 2009.

[49] Abraham Smith, Paul Bendich, John Harer, and Jay Hineman. Supervised learn-
ing of labeled pointcloud differences via cover-tree entropy reduction. CoRR,
abs/1702.07959, 2017.

[50] Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001.

[51] D. Pickup, X. Sun, P. L. Rosin, R. R. Martin, Z. Cheng, Z. Lian, M. Aono,
A. Ben Hamza, A. Bronstein, M. Bronstein, S. Bu, U. Castellani, S. Cheng, V. Garro,
A. Giachetti, A. Godil, J. Han, H. Johan, L. Lai, B. Li, C. Li, H. Li, R. Litman, X. Liu,
Z. Liu, Y. Lu, A. Tatsuma, and J. Ye. SHREC’14 track: Shape retrieval of non-rigid
3d human models. In Proceedings of the 7th Eurographics workshop on 3D Object
Retrieval, EG 3DOR’14, pages 1–10. Eurographics Association, 2014.

[52] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Distributed computation of

persistent homology, 2013.

[53] Paolo Sonego, Mircea Pacurar, Somdutta Dhir, Attila Kertész-Farkas, András Kocsor,
Zoltán Gáspári, Jack A. M. Leunissen, and Sándor Pongor. A protein classification
benchmark collection for machine learning. Nucleic Acids Research, 35:D232 – D236,
2007.

[54] wwPDB consortium. Protein Data Bank:

the single global archive for 3D macro-

molecular structure data. Nucleic Acids Research, 47(D1):D520–D528, 10 2018.

[55] Christopher Tralie, Nathaniel Saul, and Rann Bar-On. Ripser.py: A lean persistent
homology library for python. The Journal of Open Source Software, 3(29):925, Sep
2018.

[56] Xiang Bai, Xingwei Yang, Longin Jan Latecki, Wenyu Liu, and Zhuowen Tu. Learning
context sensitive shape similarity by graph transduction. IEEE Trans. Pattern Analysis
and Machine Intelligence (PAMI), 2009.

[57] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large
collection of multi-source dermatoscopic images of common pigmented skin lesions.
Sci Data, 5(180161), 2018.

[58] Y. Chung, C. Hu, A. Lawson, and C. Smyth. Topological approaches to skin disease
image analysis. In 2018 IEEE International Conference on Big Data (Big Data), pages
100–105, 2018.

103

[59] Antoine Pitrou Siu Kwan Lam and Stanley Seibert. Numba: a llvm-based python jit
compiler. In LLVM ’15: Proceedings of the Second Workshop on the LLVM Compiler
Infrastructure in HPC, pages 1–6, Nov 2015.

[60] Mehmet Gönen and Ethem Alpaydin. Multiple kernel learning algorithms. Journal

of Machine Learning Research, (12):2211–2268, 2011.

[61] L. Polanco and J. A Perea. Coordinatizing data with lens spaces and persistent coho-
mology. Proceedings of the 31 st Canadian Conference on Computational Geometry
(CCCG), pages 49–57, 2019.

[62] Pawel Dlotko. Persistence representations. In GUDHI User and Reference Manual.

GUDHI Editorial Board, 3.4.1 edition, 2021.

[63] Nathaniel Saul and Chris Tralie. Scikit-tda: Topological data analysis for python,

2019.

[64] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.

[65] Hendrik Jacob van Veen, Nathaniel Saul, David Eargle, and Sam W. Mangham. Kepler
Mapper: A flexible Python implementation of the Mapper algorithm, October 2020.

[66] Bill Hoffman. Cmake.

[67] Rene Rivera Beman Dawes, David Abrahams. Boost.

[68] Dmitry Morozov. Dionysus 2, 2023. https://www.mrzv.org/software/dionysus2/.

[69] Edmond Chow Victor Eĳkhout and Robert van de Geĳn. Introduction to High Per-

formance Scientific Computing. Springer Science & Business Media, 2020.

[70] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli
Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J.
Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwĳk, Matthew
Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre
Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi,
Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature,
585(7825):357–362, September 2020.

[71] John Paul Ryan, Sebastian Ament, Carla P. Gomes, and Anil Damle. The fast kernel

104

transform. CoRR, abs/2106.04487, 2021.

[72] Nathaniel Saul. Persim 0.3.6: a scikit-tda project, 2019.

[73] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy,
David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan
Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman,
Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J
Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef
Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M.
Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0
Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python.
Nature Methods, 17:261–272, 2020.

[74] Leonidas Guibas Jian Sun, Maks Ovsjanikov. A concise and provably informative
multi-scale signature based on heat diffusion. Eurographics Symposium on Geometry
Processing, 28(5), 2009.

105

APPENDIX

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.1 Mapper graphs for the grades dataset for Experiment 1 before and after
optimization. For this example, the typical optimization work flow was not successful.

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.2 Mapper graphs for the grades dataset for Experiment 2 before and after
optimization. For this example, the typical optimization work flow was not successful.

106

Attribute
sex
age
school
address
Pstatus
Medu
Mjob
Fedu
Fjob
guardian
famsize
famrel

reason

traveltime

studytime

failures
schoolsup
famsup
activities
paidclass
internet
nursery
higher
romantic
freetime
goout
Walc

Dalc

health
absences
G1
G2
G3

from 1 – very bad to 5 –

Description (Domain)
student’s sex (binary: female or male)
student’s age (numeric: from 15 to 22)
student’s school (binary: Gabriel Pereira or Mousinho da Silveira)
student’s home address type (binary: urban or rural)
parent’s cohabitation status (binary: living together or apart)
mother’s education (numeric: from 0 to 4)
mother’s job (nominal)
father’s education (numeric: from 0 to 4)
father’s job (nominal)
student’s guardian (nominal: mother, father or other)
family size (binary: ≤ 3 or > 3)
quality of family relationships (numeric:
excellent)
reason to choose this school (nominal: close to home, school reputation,
course preference or other)
home to school travel time (numeric: 1 – < 15min, 2 – 15 to 30 min, 3
– 30min to 1 hour or 4 – > 1 hour)
weekly study time (numeric: 1 – < 2 hours, 2 – 2 to 5 hours, 3 – 5 to 10
hours or 4 – > 10 hours)
number of past class failures (numeric: n if 1 ≤ n < 3, else 4)
extra educational school support (binary: yes or no)
family educational support (binary: yes or no)
extra-curricular activities (binary: yes or no)
extra paid classes (binary: yes or no)
Internet access at home (binary: yes or no)
attended nursery school (binary: yes or no)
wants to take higher education (binary: yes or no)
with a romantic relationship (binary: yes or no)
free time after school (numeric: from 1 – very low to 5 – very high)
going out with friends (numeric: from 1 – very low to 5 – very high)
weekend alcohol consumption (numeric: from 1 – very low to 5 – very
high)
workday alcohol consumption (numeric: from 1 – very low to 5 – very
high)
current health status (numeric: from 1 – very bad to 5 – very good)
number of school absences (numeric: from 0 to 93)
first period grade (numeric: from 0 to 20)
second period grade (numeric: from 0 to 20)
final grade (numeric: from 0 to 20)

Table A.1 Student related variables from the grades dataset [24].

107

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.3 Mapper graphs for the grades dataset for Experiment 4 before and after
optimization. For this example, the typical optimization work flow was not successful.

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.4 Mapper graphs for the grades dataset for Experiment 5 before and after
optimization. For this example, the typical optimization work flow was not successful.

108

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.5 Mapper graphs for Experiment 7 before and after optimization. For this
example, the typical optimization work flow was not successful.

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.6 Mapper graphs for Experiment 8 before and after optimization. For this
example, the typical optimization work flow was not successful.

109

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.7 Mapper graphs for Experiment 9 before and after optimization. For this
example, the typical optimization work flow was not successful.

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.8 Mapper graphs for Experiment 10 before and after optimization. For this
example, the typical optimization work flow was not successful.

110

(a) Original mapper graph

(b) Optimized mapper graph

Figure A.9 Mapper graphs for Experiment 11 before and after optimization. For this
example, the typical optimization work flow was not successful.

111