TOPOLOGICAL DATA ANALYSIS DRIVEN FEATURE GENERATION IN MACHINE LEARNING MODELS By Danielle Barnes A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science, and Engineeringβ€”Doctor of Philosophy 2024 ABSTRACT Topological data analysis (TDA) is an emerging field in data science, with origins in algebraic topology. I focus on two main disciplines of topological data analysis, mapper and persistent homology. Mapper is an algorithm to construct a graph that is similar to a Reeb graph [1], allowing for the abstraction of shape from data. Persistent homology is way to measure features in a dataset, and returns a set of points (a persistence diagram) that represents the structure of the dataset. While both mapper and persistent homology are effective tools in their own right, a significant area of research includes using features created from these topological tools in machine learning algorithms. In this dissertation, I focus on advancing both theoretical and computational methods that allow the use of topological data analysis in machine learning algorithms. I have developed an extension to the mapper algorithm, named predictive mapper, that uses the eigenvectors of the graph Laplacian of the geometric realization of a mapper graph, allowing features to be created from a linear combination of the eigenvector values for use in machine learning. I have also contributed to teaspoon [2], an open source package for topological signal processing by including new datasets and expanded functionality for fea- turization methods. Lastly, I have started the development of ceREEBerus, a python package for working with Reeb graphs while implementing a standardized software development framework for the Munch Lab. Copyright by DANIELLE BARNES 2024 ACKNOWLEDGEMENTS Many people provided the support, guidance, and humor required to complete a dissertation. Many thanks to my committee members, Dr. Brian O’Shea and Dr. H. Metin Atkulga, and special thanks to my chairs Dr. Liz Munch and Dr. Jose Perea. Special thanks to my running group, which has taken many iterations over the years for keeping me sane by very insanely hopping in a van (sometimes with strangers) and running 200 miles across the country. And the most thanks to my husband, Adi, for always laughing and taking care of our dogs over the years when I was working, either at work or on my degree. Chapter 4 was originally co-authored by Dr. Luis Polanco and Dr. Jose Perea. Chapter 4 and 5 work was supported in part by Michigan State University through computational resources provided by the Institute for Cyber-Enabled Research, with assistance from Dr. Elizabeth Munch for providing background information for computing persistence diagrams using a directional transform. Chapter 5 work was also supported and collaborative with many others in the Munch Lab and at Michigan State including Dr. Luis Polanco, Dr. Jose Perea, Elena Wang, Pat Bills, Doug Krum, Morgan Patterson, Max Chumley, Sunia Tanweer, and Dr. Firas Khasawneh. Thank you to those outside of my committee who also read my dissertation and provided critical feedback, Dr. Tara Kilbride and Dr. Dakota Crisp. Additionally, the completion of this dissertation would not have been possible without the support of my supervisor, Jonathan Prantner. And finally, special thanks to the unsung heros of this dissertation - Keurig, Nespresso, Starbucks, and my favorite coffee maker, Spinn. iv TABLE OF CONTENTS LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Persistent Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Mapper 2.4 Reeb Graphs . 3 3 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . CHAPTER 3 PREDICTIVE MAPPER . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Spectral theory and motivation . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Optimized mapper graph . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Mapper embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Mapping from 𝑋 to | ¯𝑀 | . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Predictive Mapper Pipeline . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6 Methods and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.7 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.9 Conclusion . . . . . . CHAPTER 4 . . . . FEATURE GENERATION FOR MACHINE LEARNING . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 . Introduction . 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Featurization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 User Guide 4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 . . 4.7 Computation Time of Features . . . . . . . . . . . . . . . . . . . . . . 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.9 Conclusion . . . . . . . . . . . . CHAPTER 5 OPEN SOURCE SOFTWARE CONTRIBUTIONS . . . . . . . . 74 5.1 Open Source Software Development Framework . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Teaspoon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 . 5.3 CeREEBerus . . . CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 BIBLIOGRAPHY . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 v 𝑋 𝑋 (𝑣𝑖) 𝐻 (𝑀, π‘Œ ) |𝑋 (𝑣𝑖)| 𝑀 𝑉 (𝑀) ¯𝑋 ˆ𝑋 ¯𝑋 (𝑣𝑖) ¯𝑀 𝑉 ( ¯𝑀) 𝐴( ¯𝑀) ¯𝑣𝑖 ∈ 𝑉 ( ¯𝑀) ¯𝑋 ( ¯𝑣𝑖) LIST OF SYMBOLS Dataset, 𝑋 ∈ R𝑝π‘₯π‘š Subset of data 𝑋 in each node 𝑣𝑖 ∈ 𝑉 Measure of optimal Number of points in 𝑋 (𝑣𝑖) A mapper graph Set of nodes in mapper graph 𝑀 Training set, ¯𝑋 βŠ‚ 𝑋 Testing set, ˆ𝑋 βŠ‚ 𝑋, ¯𝑋 ∩ ˆ𝑋 = βˆ… Subset of training data ¯𝑋 in each node 𝑣𝑖 ∈ 𝑉 Optimized mapper graph Set of nodes in optimized mapper graph ¯𝑀 Adjacency matrix of ¯𝑀 Node 𝑖 of optimized mapper graph 𝑀 Test dataset points in node ¯𝑣𝑖 | ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗 )| Number of data points in ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗 y Eigenvectors of 𝐿y = πœ†π·y U = {π‘ˆπ›Ό}π›Όβˆˆπ΄ Finite open covering of 𝑋 𝑁 (U) 𝑓 : 𝑋 β†’ R 𝑓 βˆ’1(π‘ˆπ›Ό) Β―U 𝑁 ( Β―U) Nerve of U Continuous map from 𝑋 to R Pullback cover of π‘ˆπ›Ό Refined pullback cover of π‘ˆπ›Ό Nerve of Β―U vi |𝑁 ( Β―U)| 𝜌(π‘₯) Ξ¦ 𝛽 Β―Ξ¦ Geometric realization of the nerve of Β―U Map from 𝑋 β†’ |𝑁 (U)|. Also refers to map from 𝑋 β†’ | ¯𝑀 | Family of gaussian distributions Normalization constant Family of gaussian distributions, normalized and a partition of unity {𝑣0, 𝑣1, ..., 𝑣 π‘˜ } = 𝑉 Set of vertices of 𝑁 ( Β―U) ( Β―πœ‘0(π‘₯), Β―πœ‘1(π‘₯), ..., Β―πœ‘π‘˜ (π‘₯)) Barycentric coordinates defined by Β―Ξ¦ | ¯𝑀 | Geometric realization of the nerve of ¯𝑀 vii CHAPTER 1 INTRODUCTION This dissertation contributes to outstanding areas of Topological Data Analysis by extending theory for mapper [1], providing additional examples and datasets for use in persistent homology, and contributes to open source software to enable collaborative research. For mapper, the algorithmic result is an abstract simplicial complex with no direct translation into a real-valued vector space, which is a requirement for use in machine learning algorithms. I propose an extension to the current mapper algorithm, that allows features to be computed from it, extending from an abstract graph into a real-valued vector space. For persistent homology, translation needs to similarly be done from the space of persistence diagrams into a real-valued vector space, and I focus on computational aspects of existing methods. Some methods can be costly from a computational perspective, limiting the ability to easily analyze large or complex datasets. To address this, I have extended the teaspoon [2] python software package to include vectorized and parallel options for specific machine learning features. This work was published as part of a review paper [3] that provided additional code for researchers to use for feature creation from persistence diagrams and example datasets for analysis. Researchers commonly access these methods through open source software, so devel- oping and maintaining easy-to-use and purpose built software is an important challenge to support more widespread adoption of useful topological methods. I address these issues through contributions to open source software, and the development of additional packages and machine learning pipelines focused on using topological data analysis. This work includes moving to a standardized software development framework, with documentation and tutorials for collaborators to contribute. Through this dissertation, I have made a meaningful contribution to topological meth- 1 ods, computational efficiency, and ability for researchers to implement topological methods across different disciplines. All code is available as open source, and many principles of standardized software development are applied, allowing for ease of maintenance and appropriate documentation, vital to researchers continuing to make contributions and im- provements in topological methods. 2 CHAPTER 2 BACKGROUND Intuitively, topology is the study of shape, and of specific interest is identifying various properties such as holes and connected components. These features allow differentiation between item types. For example, in the widely used MNIST [4] dataset, the number "8" has different topological features than the number "0". Specificially the number "8" has one connected component, and two holes; while the number "0" has one connected component, and one hole. These correspond to homology groups, and specifically persistence of these homology groups through filtrations gives a way to identify relevant topological features. 2.1 Homology Homology concerns itself with useful properties of a simplicial complex and can provide information about the underlying space. We follow [5] in this section. Firstly we define simplices. Let 𝑒0, 𝑒1, ..., π‘’π‘˜ be points in R𝑑. A point π‘₯ = (cid:205)π‘˜ 𝑖=0 is an affine combination of the 𝑒𝑖 if the (cid:205)π‘˜ 𝑖=0 combinations. It is a π‘˜-plane if the π‘˜ + 1 points are affinely independent, by which we πœ†π‘– = 1. The affine hull is the set of affine πœ†π‘–π‘’π‘–, with each πœ†π‘– ∈ R, mean that any two affine combinations, π‘₯ = (cid:205) πœ†π‘–π‘’π‘– and 𝑦 = (cid:205) 𝛿𝑖𝑒𝑖, are the same iff πœ†π‘– = 𝛿𝑖 for all 𝑖. The π‘˜ + 1 points are affinely independent iff the π‘˜ vectors 𝑒𝑖 βˆ’ 𝑒0, for 1 ≀ 𝑖 ≀ π‘˜, are linearly independent. In R𝑑 we can have at most 𝑑 linearly independent vectors and therefore at most 𝑑 + 1 affinely dependent points. An affine combination, π‘₯ = (cid:205) πœ†π‘–π‘’π‘– is a convex combination if all πœ†π‘– are non-negative. The convex hull is the set of convex combinations. A π‘˜-simplex is the convex hull of π‘˜ + 1 affinely independent points, 𝜎 = conv{𝑒0, 𝑒1, ..., π‘’π‘˜ }. Example simplices are shown in Figure 2.1. A face of 𝜎 is the convex hull of a non-empty subset of the 𝑒𝑖, and is denoted as 𝜏 with 𝜏 ≀ 𝜎 meaning 𝜏 is a face of 𝜎. Using this we next define a simplicial complex as a finite collection of simplices 𝐾 such that 𝜎 ∈ 𝐾 and 𝜏 ≀ 𝜎 implies 𝜏 ∈ 𝐾 and 𝜎, 𝜎0 ∈ 𝐾 3 (a) 0-simplex. (b) 1-simplex. (c) 2-simplex. Figure 2.1 Illustration of simplices. implies 𝜎 ∩ 𝜎0 is either empty or a face of both. An example simplicial complex is in Figure 2.2. The former definitions refer to a geometric simplicial complex, which is a simplicial complex in Euclidean space. This definition can be generalized to an abstract simplicial complex, which is a finite collection of sets 𝐴 such that 𝛼 ∈ 𝐴 and 𝛽 βŠ† 𝛼 implies 𝛽 ∈ 𝐴. (a) Simplicial complex. (b) Not a simplicial complex. Figure 2.2 Example of a simplicial complex (left) and not a simplicial complex (right). To understand similarities and differences of shapes, we define some additional termi- nology to understand homology groups, still following [5]. Let 𝐾 be a simplicial complex and 𝑝 a dimension. A p-chain is a formal sum of 𝑝-simplices in 𝐾, denoted 𝑐 with coeffi- cients π‘Žπ‘– and 𝑝-simplices πœŽπ‘–, represented as 𝑐 = (cid:205) π‘Žπ‘–πœŽπ‘–. The 𝑝-chains, along with addition, form chain groups denoted 𝐢𝑝. The boundary of a 𝑝-simplex is the sum of its 𝑝 βˆ’ 1 4 dimensional faces, so that 𝛿 π‘πœŽ = (cid:205)𝑝 [π‘’π‘œ, ..., ˆ𝑒 𝑗 , ..., 𝑒 𝑝], with 𝜎 = [𝑒0, 𝑒1, ..., 𝑒 𝑝] and the hat denoting that 𝑒 𝑗 is omitted. This definition extends to a 𝑝-chain so that 𝛿 𝑝𝑐 = (cid:205) π‘Žπ‘–π›Ώ π‘πœŽπ‘–. 𝑗=0 This is called the boundary map, 𝛿 𝑝 : 𝐢𝑝 β†’ πΆπ‘βˆ’1. A chain complex is a sequence of 𝛿 𝑝+2βˆ’βˆ’βˆ’β†’ 𝐢𝑝+1 𝛿 π‘βˆ’1βˆ’βˆ’βˆ’β†’. A cycle is a 𝑝-chain with an empty boundary, 𝛿 𝑝𝑐 = 0. The set of cycles is the kernel of 𝛿 𝑝, ker𝛿 𝑝 = 𝑍 𝑝. chain groups with boundary maps: ... 𝛿 𝑝 βˆ’βˆ’β†’ πΆπ‘βˆ’1 𝛿 𝑝+1βˆ’βˆ’βˆ’β†’ 𝐢𝑝 A boundary is a 𝑝-chain that is a boundary of a 𝑝 + 1-chain, and the boundary group is denoted 𝐡𝑝. Finally, the pπ‘‘β„Ž homology group, denoted 𝐻𝑝, provides useful information in describing topological features. Each 𝑝 corresponds to a dimension, with 𝐻0 representing the number of connected components, and 𝐻1 representing 1-dimensional holes. Formally the pπ‘‘β„Ž homology group is the quotient group 𝐻𝑝 = Ker 𝛿 𝑝/ Im 𝛿 𝑝+1. The rank of 𝐻𝑝 is called the π‘π‘‘β„Ž Betti number, 𝛽𝑝, and provides the number of features in the π‘π‘‘β„Ž homology group. This is important because homeomorphic spaces have the same Betti numbers. For example, the number "8" can be identified as the same type with different handwriting since 𝛽0 = 1 and 𝛽1 = 2. An illustration of equivalent homology groups is in Figure 2.3. Figure 2.3 Homology groups 𝐻0 and 𝐻1 across different spaces. 2.2 Persistent Homology Extending the idea from a singular simplicial complex to an ordered sequence of sim- plicial complexes provides a way to quantify how prominent, or persistent, specific features are for a given data set. This provides motivation for the concept of persistent homology, which takes a function defined on a simplicial complex, and quantifies the changes in ho- 5 mology classes as the sublevel sets grow with increasing value of the function [6]. Features that are more persistent are more likely to be attributed to features of the underlying space from which the data was sampled. For this section, we follow [6]. To measure the persistence of features, we start with a filtration. Using Definition 3.1, a filtration F = F(𝐾) of a simplicial complex 𝐾 is a nested sequence of its subcomplexes F : βˆ… = 𝐾0 βŠ† 𝐾1 βŠ† ... βŠ† 𝐾𝑛 = 𝐾. I focus on creating this filtration on real data sets in two ways. From Definition 2.10, let (𝑃, 𝑑) be a finite metric space. Given a real π‘Ÿ > 0, the Vietoris-Rips complex is the abstract simplicial complex VRπ‘Ÿ (𝑃) where a simplex 𝜎 ∈ VRπ‘Ÿ (𝑃) if and only if 𝑑 ( 𝑝, π‘ž) ≀ 2π‘Ÿ for every pair of vertices of 𝜎. This is illustrated in Figure 2.4 where π‘Ÿ increases through the image. Another way to construct a filtration is through sublevel set filtration, where if 𝑋 is a topological space and 𝑓 : 𝑋 β†’ R is a function, then the sublevel sets, π‘‹π‘Ž = 𝑓 βˆ’1(βˆ’βˆž, π‘Ž], π‘Ž ∈ R. Then for π‘Žπ‘– ≀ π‘Ž 𝑗 ≀ π‘Žπ‘˜ ≀ π‘Žπ‘™, the sublevel set filtration would be 𝑋0 βŠ† .... βŠ† π‘‹π‘Žπ‘– βŠ† π‘‹π‘Ž 𝑗 βŠ† π‘‹π‘Žπ‘˜ βŠ† π‘‹π‘Žπ‘™ βŠ† ... βŠ† 𝑋. This type of filtration is commonly used on image data, and an example is in Figure 2.5. Following [5], we start with a filtration F(𝐾) with βˆ… = 𝐾0 βŠ† 𝐾1 βŠ† ... βŠ† 𝐾𝑛 = 𝐾. For every 𝑖 ≀ 𝑗 there is an inclusion map from the underlying space of 𝐾𝑖 to that of 𝐾 𝑗 , hence an induced homomorphism, 𝑓 𝑖, 𝑗 𝑝 : 𝐻𝑝 (𝐾𝑖) β†’ 𝐻𝑝 (𝐾 𝑗 ), for each dimension 𝑝. This filtration corresponds to a sequence of homology groups connected by homomorphisms, 0 = 𝐻𝑝 (𝐾0) β†’ 𝐻𝑝 (𝐾1) β†’ ... β†’ 𝐻𝑝 (𝐾𝑛) = 𝐻𝑝 (𝐾). The p-th persistent homology groups are the images of the homomorphisms, 𝐻𝑖, 𝑗 𝑝 = for 0 ≀ 𝑖 ≀ 𝑗 ≀ 𝑛. The ranks of these groups are the p-th persistent Betti im 𝑓 𝑖, 𝑗 𝑝 numbers, 𝛽𝑖, 𝑗 𝑝 = rank𝐻𝑖, 𝑗 𝑝 . The collection of persistent Betti numbers is then represented as a persistence diagram, 𝐷 ( 𝑓 ) βŠ‚ R2, where 𝐷 ( 𝑓 ) βŠ‚ R2 of 𝑓 is the set of points (π‘Žπ‘–, π‘Ž 𝑗 ), π‘Žπ‘– < π‘Ž 𝑗 , counted with multiplicity 𝑒𝑖, 𝑗 𝑝 for 0 ≀ 𝑖 < 𝑗 ≀ 𝑛. The time at which the feature appears, π‘Žπ‘–, is known as the birth, and the time at which the feature disappears, π‘Ž 𝑗 , is the 6 (a) VRπ‘Ÿπ‘– (𝑃). (b) VRπ‘Ÿ 𝑗 (𝑃). (c) VRπ‘Ÿπ‘˜ (𝑃). (d) VRπ‘Ÿπ‘™ (𝑃). Figure 2.4 Given 𝑖 ≀ 𝑗 ≀ π‘˜ ≀ 𝑙, the Vietoris-Rips Complex is using the filtration VRπ‘Ÿπ‘– (𝑃) βŠ† VRπ‘Ÿ 𝑗 (𝑃) βŠ† VRπ‘Ÿπ‘˜ (𝑃) βŠ† VRπ‘Ÿπ‘™ (𝑃) is demonstrated in this image. Note that in the last filtration VRπ‘Ÿπ‘™ (𝑃) the hole in 𝐻1 no longer persists. Figure 2.5 Example MNIST observation after applying directional transforms provides an application for sublevel set persistence. death. The assumed coordinate representation for a persistence diagram is in birth-death coordinates, which is the ordered pair (π‘Žπ‘–, π‘Ž 𝑗 ). Each feature has a corresponding coordinate 7 pair, with Figure 2.6 showing an example persistence diagram. Finally, by the Fundamental Lemma of Persistent Homology, the persistence diagrams encode the necessary information about persistent homology groups. Figure 2.6 The Rips persistence diagrams in dimensions 0 (blue) and 1 (orange), for a point cloud sampled around the unit circle. Lemma 1 (Fundamental Lemma of Persistent Homology) Let βˆ… = 𝐾0 βŠ† 𝐾1 βŠ† ... βŠ† 𝐾𝑛 = 𝐾 be a filtration. For every pair of indices 0 ≀ π‘˜ ≀ 𝑙 ≀ 𝑛 and every dimension 𝑝, the 𝑝-th persistent Betti number is π›½π‘˜,𝑙 𝑝 = (cid:205)π‘–β‰€π‘˜ (cid:205) 𝑗 >𝑙 𝑒𝑖, 𝑗 𝑝 The sequence of homomorphisms can also be generalized to vector spaces, and as adapted from [7]: Definition 2.2.1 (Persistence modules) Let V denote a sequence of vector spaces and linear maps, of length 𝑛: 𝑉1 𝑝1β†’ 𝑉2 𝑝2β†’ ... π‘π‘›βˆ’1β†’ 𝑉𝑛. Each 𝑝𝑖 β†’ represents a map. The object V is called a persistence diagram of vector spaces, or simply a persistence module. We can also define a persistence diagram for a persistence module, analogous to the above definition. Then a given persistence module decomposes uniquely into interval 8 modules when the index set is finite. Specifically Proposition 2 (Gabriel’s Theorem) from [6] guarantees a unique representation of persistence modules as birth-death pairs. Proposition 2 (Gabriel’s Theorem) Any persistence module over a finite index set decom- poses uniquely up to isomorphism into closed-closed interval modules. In the context of features for use in machine learning models, the types of features required need to identify relevant differences and similarities between data points, and persistence diagrams provide a way to do so. These diagrams are a set of points in the plane, allowing many different computations and transformations into the required data structures for use in machine learning applications. Many featurization methods exist for working with persistence diagrams as noted in [3]. Lastly, we follow [7] for zigzag persistence. Zigzag persistence differs from sublevel set persistence in that instead of using a monotonically increasing sequence to create a sublevel set filtration, 𝑋0 βŠ† .... βŠ† π‘‹π‘Žπ‘– βŠ† π‘‹π‘Ž 𝑗 βŠ† π‘‹π‘Žπ‘˜ βŠ† π‘‹π‘Žπ‘™ βŠ† ... βŠ† 𝑋, we consider filtrations that β€˜zigzag’ to create zigzag modules. Assuming 𝐾 𝑗 ∈ 𝐾 is a sequence of simplicial complexes, the input to zigzag persistence would be 𝐾0 ↔ 𝐾1 ↔ ... ↔ 𝐾𝑛, with ↔ being an inclusion to either the left or right. An example of this is in Figure 2.7a along with the corresponding persistence diagram in Figure 2.7b. The persistence diagram is computed using the Vietoris-Rips complex for a fixed radius, r, and differing in zigzag filtrations. Definition 2.2.2 (Zigzag modules) Let V denote a sequence of vector spaces and linear maps, of length 𝑛: 𝑉1 𝑝1↔ 𝑉2 𝑝2↔ ... π‘π‘›βˆ’1↔ 𝑉𝑛. Each 𝑝𝑖 ↔ represents either a forward map 𝑓𝑖 β†’ or a backward map 𝑔𝑖 ←. The object V is called a zigzag diagram of vector spaces, or simply a zigzag module. Similar to Proposition 2 for persistence modules, the following theorem guarantees a unique decomposition into birth-death coordinates of a zigzag modules. A zigzag persis- 9 (a) Zigzag filtration. (b) Persistence Diagram. Figure 2.7 Example zigzag filtration with the corresponding persistence diagram. tence module is a quiver representation, denoted V(𝑄) with 𝑄 = (𝑁, 𝐸) a quiver, or directed graph. When the graph is finite and linear shaped (like a zigzag module) it is known as 𝐴𝑛-type. Proposition 3 Every quiver representation (V(𝑄) for an 𝐴𝑛-type quiver 𝑄 has an inter- val decomposition. Furthermore, this decomposition is unique up to isomorphism and permutation of the intervals. This then guarantees the unique existence of birth-death pairs for zigzag persistence modules. 2.3 Mapper Mapper, first introduced in 2007 by Singh, Memoli, and Carlsson [1], is another tool for Topological Data Analysis, commonly used for visualization and data exploration. This original paper identified diabetes subtypes using data from the Miller-Reaven diabetes study as a proof of concept. The Miller-Reaven diabetes study used a dimensionality reduction method (projection pursuit [8]) to identify juvenile and adult onset diabetes as essentially different diseases, and using Mapper a similar result was observed, see Figure 2.8. The mapper algorithm has since been used widely as a data visualization tool across various domains, including for clustering and feature selection. The identification of loops and flares can be used to identify interesting clusters and help select variables that stratify data 10 [9]. Some applications include breast cancer [10], voting [11], and sports [11]. While Mapper provides a way to explore, visualize and reveal a lower dimensional structure of the data, there is no obvious and direct way to use information from Mapper in machine learning models. (a) (b) (c) Figure 2.8 Diabetes data visualized using different mapper filters and different resolutions. a) Diabetes data from the Miller-Reaven study visualized using the first principle components as a filter. (b) Diabetes data from the Miller-Reaven study visualized using the first principle components as a filter at a more granular resolution. (c) Diabetes data from the Miller-Reaven study visualized using the tSne as a filter. Next we review the topological motivation and theoretical framework for Mapper from [1]. Intuitively, the goal is for mapper to produce a graph that resembles the original space. For example, applying the mapper algorithm to a point cloud sampled from an annulus should result in a graph that is circular. Assume, given a continuous map 𝑓 : 𝑋 β†’ R, a finite covering U = {π‘ˆπ›Ό}π›Όβˆˆπ΄, we note 𝑓 βˆ’1(π‘ˆπ›Ό) forms an open cover of X. We then decompose 𝑓 βˆ’1(π‘ˆπ›Ό) into path connected components as a covering of 𝑋, referred to as Β―U. Mapper is then defined as the nerve of this cover, N ( Β―U) [12]. This is illustrated in Figure 2.9a, and discussed further in Section 3.4. For mapper to be applied to real data, we use a statistical construction that is analogous to the topologically-motivated algorithm. First, we assume we have a sample of 𝑛 data points 𝑋 from a metric space, and a function 𝑓 : 𝑋 β†’ R, known for all 𝑛 points, and a distance 11 (a) The topological version of mapper. (b) The statistical version of mapper. Figure 2.9 The Topological vs. Statistical Version of Mapper. matrix 𝐷. The function 𝑓 : 𝑋 β†’ R is called a filter, and is analogous to the function from the topological version. After applying the filter function, a covering is determined on 𝑓 (π‘₯) = U. Intervals are then computed on the covering, specified by parameters β„“ and 𝑝, where β„“ is the length, and 𝑝 is the percentage overlap of the intervals of R. We then can construct a covering, U = {π‘ˆπ›Ό}π›Όβˆˆπ΄. For each π‘ˆπ›Ό ∈ U, we take the set { 𝑓 βˆ’1(π‘ˆπ›Ό)}π›Όβˆˆπ΄, which defines a cover of 𝑋. We then have subsets of 𝑋 such that 𝑋𝛼 ∈ 𝑓 βˆ’1(π‘ˆπ›Ό) and subsets of 𝐷, denoted 𝐷𝛼, corresponding to each 𝑋𝛼. Once the data and distance matrix is partitioned into subsets, a clustering algorithm, using each 𝐷𝛼, is used locally to determine the number of clusters. The desired clustering algorithm also does not require specifying the number of clusters [1]. Some examples of appropriate clustering algorithms are single-linkage clustering [13] and HDBScan [14]. These clusters construct a new covering, a refined pull-back cover, denoted Β―U. A node is created for each cluster, and edges are added between nodes which share a data point in their respective clusters. The statistical version of mapper is then the nerve of the refined pullback cover. The algorithm to construct the statistical version of mapper is then as follows: 1: Given a dataset 𝑋, filter function 𝑓 , distance matrix 𝐷, interval length 𝑙, partition 12 overlap 𝑝, and local clustering algorithm π‘Ž 2: Take 𝑓 (𝑋) β†’ R and using 𝑙 and 𝑝 create a covering of 𝑓 (𝑋), U = {π‘ˆπ›Ό}π›Όβˆˆπ΄ 3: Take the pull-back cover, { 𝑓 βˆ’1(π‘ˆπ›Ό)}π›Όβˆˆπ΄ to get 𝑋𝛼 ∈ 𝑓 βˆ’1(π‘ˆπ›Ό) and subsets of 𝐷, denoted 𝐷𝛼, corresponding to each 𝑋𝛼 4: Using 𝐷𝛼, perform local clustering using π‘Ž to create a refined pullback cover, Β―U 5: For each cluster, create a vertex (0-dimensional simplex), which is designated as a node 6: For each node, if the intersection of the data points in neighboring nodes is non-empty, add an edge (1-simplex) between that node and the neighboring node The statistical construction takes point cloud data as input, and returns a graph that shows the shape and features of the underlying data with respect to the filter function. Mapper also extends to a d-dimensional version, with details available in the original paper [1]. We refer to the statistical version of Mapper as 𝑀 (𝑋). This is illustrated in Figure 2.9b. 2.4 Reeb Graphs Another important tool for topological data analysis is the Reeb graph, which provides a simplified view of potentially complex shapes while retaining important information on connected components. The use of Reeb graphs has been popularized in computer science [15] and provides utility for 3D shape estimation. Additionally, mapper is a discretized version of the Reeb graph. With the existence of metrics that calculate distances between Reeb graphs, such as bottleneck distance and interleaving distance [16], similarity of Reeb graphs and hence underlying spaces can be quantified. To better understand the representation Reeb graphs provide, consider the torus, shown in Figure 2.10a and a Reeb graph of the height function on the torus shown in Figure 2.10b. Following [6], the Reeb graph is defined formally as follows with an example in Figure 2.11. Definition 2.4.1 (Reeb Graph) Let 𝑋 be a topological space with a function 𝑓 : 𝑋 β†’ R. 13 (a) Torus plotted in 3 dimensions. (b) Reeb graph of the torus. Figure 2.10 Reeb graph representation of a Torus. Define an equivalence relation ∼ on 𝑋 by asserting π‘₯ ∼ 𝑦 iff (i) 𝑓 (π‘₯) = 𝑓 (𝑦) = 𝛼 and (ii) π‘₯ and 𝑦 belong to the same connected component of the level set 𝑓 βˆ’1(𝛼). Let [π‘₯] denote the equivalence class of π‘₯ ∈ 𝑋. The Reeb graph R 𝑓 of 𝑓 : 𝑋 β†’ R is the quotient space 𝑋/∼, i.e. the set of equivalent classes equipped with the quotient topology. Let Ξ¦ : 𝑋 β†’ 𝑅 𝑓 , π‘₯ β†¦βˆ’β†’ [π‘₯] denote the quotient map. Through this construction, the Reeb graph preserves relationships between the connected components of level sets [17]. The Mapper algorithm is a generalization of the Reeb graph construction for real-valued datasets, however, algorithms for their construction have only been available recently [18], allowing additional applications for Reeb graphs and computations on Reeb graphs. In this dissertation, I focus on basics of working with Reeb graphs, and assume the Reeb graph has been provided as input. 14 Figure 2.11 Reeb graph R 𝑓 of the function 𝑓 : 𝑋 β†’ R. Figure inspired by [6]. 15 CHAPTER 3 PREDICTIVE MAPPER A primary area of research for my dissertation is work on the mapper algorithm, both in extending available software and the definitions and development of predictive mapper. Predictive mapper extends the mapper algorithm to create geometrically-motivated features for use in machine learning models. For a review of the original mapper algorithm, refer to Section 2.3. When creating the predictive mapper algorithm, there are multiple items that must be developed to create useful features for machine learning. The mapper graph needs to be structured to represent different classes to be predicted, and must be optimal in a sense to be defined later. There needs to be a way to create real-valued vectors from the mapper graph, and finally a mapping must exist from the original dataset to the mapper graph. Using predictive mapper as an extension to the original mapper algorithm, each of these areas is addressed allowing for prediction using information from mapper output. Before I discuss the algorithm, I review some spectral theory and motivation. Next I discuss each of the three components of the algorithm, and end with two examples on datasets and a discussion. 3.1 Spectral theory and motivation To provide a theoretical justification for predictive mapper, I will first review a proposi- tion and fact from spectral graph theory, and then discuss prior use of a related algorithm, Lapalacian eigenmaps. This background will be important when I discuss the mapper graph optimization and mapper embedding in Sections 3.2 and 3.3 respectively. For this, I follow [19] until noted otherwise and start with notation required for this section. Let 𝐺 = (𝑉, 𝐸), a similarity graph with vertex set 𝑉 = {𝑣1, ..., 𝑣𝑛} and edge set 𝐸 with weights 𝑀𝑖 𝑗 β‰₯ 0 representing the similarity between the vertices connected by the edge. The weighted adjacency matrix of the graph is π‘Š = (𝑀𝑖 𝑗 )𝑖, 𝑗=1,...,𝑛. Assume 𝐺 is 16 undirected so that 𝑀𝑖 𝑗 = 𝑀 𝑗𝑖. The degree of a vertex 𝑣𝑖 ∈ 𝑉 is 𝑑𝑖 = 𝑛 βˆ‘οΈ 𝑗=1 𝑀𝑖 𝑗 . (3.1) The degree matrix 𝐷 is the diagonal matrix with degrees 𝑑𝑖 on the diagonal. Given a subset of vertices 𝐴 βŠ‚ 𝑉, the complement 𝑉 \ 𝐴 = ¯𝐴. Also one can measure the size of a A in two ways: | 𝐴| ≔ the number of vertices in 𝐴 vol( 𝐴) ≔ 𝑑𝑖. βˆ‘οΈ π‘–βˆˆπ΄ (3.2) (3.3) 3.1.1 Spectral Theory Moving into the spectral theory, disconnected and weakly connected components of a graph 𝐺 can be separated by the eigenvectors of the graph Laplacian. The Laplacian matrix is defined as 𝐿 = 𝐷 βˆ’ π‘Š, where 𝐷 is the degree matrix, and π‘Š is the weighted adjacency matrix. Proposition 4 (Number of connected components and the spectrum of 𝐿) Let 𝐺 be an undirected graph with non-negative weights. Then the multiplicity π‘˜ of the eigenvalue 0 of the Laplacian 𝐿 equals the number of connected components 𝐴1, ..., π΄π‘˜ in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1𝐴1 , ..., 1π΄π‘˜ of those components. From the proof of Proposition 4, the structure of the Laplacian matrix, 𝐿, has block diagonal form. Assume there are π‘˜ connected components in 𝐺, then the Laplacian matrix 𝐿 is: 17 𝐿 = . (3.4) 𝐿1 𝐿2 . . . 𝐿 π‘˜ (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) Each of the blocks of 𝐿𝑖 is a graph Laplacian corresponding to the subgraph of the 𝑖-th connected component. The Laplacian 𝐿 as a whole provides the indicator functions for connected components in a graph, using the first eigenvector in each 𝐿𝑖 ∈ 𝐿. Beyond the Laplacian matrix 𝐿 = 𝐷 βˆ’ π‘Š, there are other variants of this matrix called normalized graph Laplacians with the former referred to as the unnormalized graph Laplacian. For our purposes, the random walk Laplacian is πΏπ‘Ÿπ‘€ = π·βˆ’1𝐿. (3.5) For completeness, there is another normalized Laplacian, called πΏπ‘ π‘¦π‘š with details available in [19]. A property of πΏπ‘Ÿπ‘€ I use is πœ† is an eigenvalue of πΏπ‘Ÿπ‘€ with eigenvector 𝑒 if and only if πœ† and 𝑒 solve the generalized eigenvalue problem 𝐿𝑒 = πœ†π·π‘’. It is notable the authors of [19] provide a few arguments for using the random walk Laplacian, and it is the Laplacian used throughout this paper. From the graph partitioning point of view (mincut problem) using πΏπ‘Ÿπ‘€ maximizes within cluster similarity in contrast to the other Laplacians. Additionally, the normalized Laplacians do not have issues with convergence that are seen in the unnormalized Laplacians. There is also a proposition analogous to Proposition 4. Proposition 5 (Number of connected components and the spectrum of πΏπ‘Ÿπ‘€) Let 𝐺 be an undirected graph with non-negative weights. Then the multiplicity π‘˜ of the eigenvalue 0 of πΏπ‘Ÿπ‘€ equals the number of connected components 𝐴1, ..., π΄π‘˜ in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1𝐴1 , ..., 1π΄π‘˜ of those components. 18 To demonstrate this, in Figure 3.1 note a graph with 2 connected components, and the corresponding eigenvectors of the graph Laplacian. This shows the correspondence to the block structure of the eigenvectors of the Graph Laplacian and how these vectors can be used an indicator functions on the graph. (a) Graph with 2 components. (b) Eigenvectors of 𝐿 corresponding to Figure 3.1a. Figure 3.1 The eigenvalues of the graph Laplacian provide indicator functions for the 3 different components of the graph. 3.1.2 Laplacian eigenmaps We see related work in [20]. The authors use a Laplacian eigenmap embedding as a lower dimensional representation of the original dataset. A graph 𝐺 is constructed from the original dataset using either πœ–-neighborhoods or 𝑛 nearest neighbors. Given points π‘₯1, ..., π‘₯π‘˜ ∈ R𝑙, an πœ–-neighborhood connects nodes 𝑖, 𝑗 when ||π‘₯𝑖 βˆ’ π‘₯ 𝑗 ||2 < πœ–, πœ– ∈ R. Alternatively, using 𝑛 nearest neighbors, nodes 𝑖, 𝑗 are connected by an edge if 𝑖 is among the 𝑛 ∈ N nearest neighbors of 𝑗 (or 𝑗 is among n nearest neighbors of 𝑖). An edge weight is selected, for example from either the heat kernel, π‘Šπ‘– 𝑗 = 𝑒 | | π‘₯𝑖 βˆ’π‘₯ 𝑗 | |2 𝑑 or setting π‘Šπ‘– 𝑗 = 1 if vertices 𝑖 and 𝑗 are connected. Then for each connected component of 𝐺 the eigenvalues and eigenvectors for 𝐿y = πœ†π·y are computed where 𝐷 is the diagonal weight matrix, 𝐷𝑖𝑖 = (cid:205) 𝑗 π‘Š 𝑗𝑖 and 𝐿 = 𝐷 βˆ’ π‘Š is the Laplacian matrix. The data is then embedded into a lower dimensional space given by (y1(𝑖), ....., yπ‘š (𝑖)) where each y 𝑗 is an 19 eigenvector, ordered by eigenvalue, with y1 having the smallest eigenvalue. 3.1.3 Mincut Problem Additionally, the eigenvectors of each 𝐿𝑖 provide a solution to the mincut problem. The mincut problem is a way to partition a graph into separate components by removing edge(s) with the minimum weight. Given a graph with adjacency matrix π‘Š ( 𝐴, 𝐡) ≔ (cid:205)π‘–βˆˆπ΄, 𝑗 ∈𝐡 𝑀𝑖 𝑗 and ¯𝐴 the complement of 𝐴. For subsets π‘˜, the mincut problem chooses a partition 𝐴1, .., π΄π‘˜ to minimize cut( 𝐴1, ..π΄π‘˜ ) ≔ 1 2 π‘˜ βˆ‘οΈ 𝑖=1 π‘Š ( 𝐴𝑖, ¯𝐴𝑖). (3.6) This can also be modified for other problems, with the Ncut problem having a direct relationship with the random walk Laplacian. The Ncut problem minimizes Ncut( 𝐴1, ..π΄π‘˜ ) ≔ 1 2 π‘˜ βˆ‘οΈ 𝑖=1 π‘Š ( 𝐴𝑖, ¯𝐴𝑖) vol( 𝐴𝑖) = π‘˜ βˆ‘οΈ 𝑖=1 cut( 𝐴𝑖, ¯𝐴𝑖) vol( 𝐴𝑖) . (3.7) When π‘˜ = 2, the second eigenvector of πΏπ‘Ÿπ‘€ provides a solution to a relaxed version of the Ncut problem. When π‘˜ > 2, the first π‘˜ eigenvectors provide a solution. In Figure 3.2 note the second eigenvector separates the graph into two distinct regions, where each region is strongly connected to other notes in the region and only weakly connected to the other region. This also translates to Figure 3.3, where the second eigenvector for the block matrix of each connected component differentiates areas of that component of the graph with the minimal edge weights. 3.2 Optimized mapper graph The overall goal of optimizing the mapper object as defined earlier is to have weak con- nections between areas of the graph that represent different classes, and strong connections between areas of heavy overlap. Very recent related work to optimize a mapper graph is 20 (a) Connected graph colored by the second eigenvector of the graph Laplacian. (b) Eigenvalues of 𝐿 corresponding to Figure 3.2a. Figure 3.2 The eigenvalues of the graph Laplacian demonstrate the solution to the mincut problem as the second eigenvector separates the graph by minimal edge weight. (a) Eigenvalues of 𝐿 for Figure 3.3b. (b) Graph colored by eigenvector 1 of 𝐿 in Figure 3.3a. (c) Graph colored by eigenvector 5 of 𝐿 in Figure 3.3a. Figure 3.3 The eigenvalues of the graph Laplacian demonstrate the solution to the mincut problem as the eigenvectors separate based on minimum edge weights. available in [21], where the authors use an algorithm named 𝐺-mapper to determine an optimal cover of a mapper graph. This variant is a bit different and is as follows: Definition 3.2.1 Let 𝑋 (𝑣𝑖) be the points associated with node 𝑣𝑖 ∈ 𝑉 (𝑀), with 𝑉 (𝑀) the set of nodes in mapper graph 𝑀. Let |𝑋 (𝑣𝑖)𝑦 | be the number of data points in 𝑣𝑖 with label 𝑦. Then 𝑓 (𝑣𝑖, 𝑦) = |𝑋 (𝑣𝑖)𝑦 |/|𝑋 (𝑣𝑖)|. Let 𝑁𝑦 be the number of nodes in 𝑀 which contain points with label 𝑦. Then 𝐻 (𝑀, π‘Œ ) = βˆ‘οΈ βˆ‘οΈ π‘£βˆˆπ‘‰ (𝑀) π‘¦βˆˆπ‘Œ 𝑓 (𝑣, 𝑦) 𝑁𝑦 (3.8) 21 provides a measure on a mapper graph and the mapper graph is considered optimal when 𝐻 (𝑀, π‘Œ ) = |π‘Œ |. In the prior definition, if each node 𝑣𝑖 ∈ 𝑉 (𝑀) only contains one label, then 𝐻 (𝑀, π‘Œ ) is optimal. 𝐻 (𝑀, π‘Œ ) can be thought of as an overall quality measure on 𝑀, representing the percentage of outcomes in each node. Consider Figure 3.4 as an example. Each color and shape corresponds to a different outcome variable, 𝑦 ∈ π‘Œ . In both graphs, |π‘Œ | = 3, so to check if each graph is optimal, For Figure 3.4b compute 1 compute 𝐻 (𝑀, π‘Œ ). For Figure 3.4a compute 1 1 = 3, so this is an optimal graph. 1 = 2.5, so this graph is not optimal. Lastly from 2 = 1.93. Visually it is possible to distinguish 2 + 1 that Figure 3.4c is less optimal than Figure 3.4b and that information is reflected in the Figure 3.4c 𝐻 (𝑀, π‘Œ ) = 1 3 + .75 3 + .6 3 + .25 3 + .2 1 + 1 1 + .2 3 + 1 3 + 1 2 + 1 1 + 1 2 + 1 metric 𝐻 (𝑀, π‘Œ ). (a) Optimal mapper graph. (b) Suboptimal mapper graph. Figure 3.4 Example of (a) an optimal mapper graph where 𝐻 (𝑀, π‘Œ ) = |π‘Œ | and (b-c) two (c) Suboptimal mapper graph. suboptimal mapper graphs. For (a): 𝐻 (𝑀, π‘Œ ) = 1 2 + 1 1 = 2.5. For (c):𝐻 (𝑀, π‘Œ ) = 1 2 + 1 3 + 1 1 + 1 3 + .2 1 = 3. For (b): 1 + .2 3 + .6 2 + 1 2 = 1.93. 𝐻 (𝑀, π‘Œ ) = 1 3 + 1 3 + .25 3 + .75 1 + 1 To create an optimal mapper graph in practice, construction relies on various parameter choices during the initial mapper construction, and 𝐻 (𝑀, π‘Œ ) was computed on the mapper graph before optimization and after optimization as a comparison. This could be used as a metric during initial mapper graph creation, however I did not use it at the part of 22 the pipeline. Figure 3.5 shows the mapper algorithm applied to the Iris dataset [22] and highlighted by the proportion of each species of iris in the dataset. In Figure 3.5a there is complete separation of the Setosa class from the others, but the separation between Versicolor and Virginia is not as as obvious. (a) Setosa percentage. (b) Versicolor percentage. (c) Virginia percentage. Figure 3.5 Example mapper graph using the iris dataset with percentage of classifications in each node highlighted for Setosa, Versicolor, and Virginica. 𝐻 (𝑀, π‘Œ ) = 2.53. To create an optimal mapper graph, and for utility in a predictive pipeline, I first build a mapper graph, 𝑀, using ¯𝑋 βŠ‚ 𝑋 ∈ R𝑝π‘₯π‘š, with 𝑋 a dataset with 𝑝 observations and π‘š features. I also split 𝑋 into a training and test set, denoted ¯𝑋 and ˆ𝑋 respectively. Then using mapper graph 𝑀, the set of nodes, 𝑉 (𝑀), and the set of datasets associated with each node, ¯𝑋 (𝑣𝑖) for 𝑣𝑖 ∈ 𝑉 (𝑀), a clustering algorithm is applied to the subset of the data associated with a node. A new mapper object, ¯𝑀 is then constructed using the new clusters to construct a larger node set, 𝑉 ( ¯𝑀). The adjacency matrix, 𝐴( ¯𝑀), is constructed by forming an edge between two nodes when points are in the intersection, that is for nodes ¯𝑣𝑖, ¯𝑣 𝑗 ∈ 𝑉 ( ¯𝑀) with 𝑖 β‰  𝑗, 𝐴𝑖, 𝑗 ( ¯𝑀) = | ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗 )|. This provides a weighted adjacency matrix with weights equal to the size of the intersection of points. What this means is after building a mapper graph through the normal pipeline, better separation can be achieved via localized clustering to target larger nodes with more variation in data distributions, and specifically variation in a desired outcome variable. Using the iris dataset as an example, this is illustrated in Figure 3.6. 23 (a) Percent of versicolor nodes before additional clustering, 𝐻 (𝑀, π‘Œ ) = 2.53. (b) Percentage of versicolor nodes after additional clustering, 𝐻 (𝑀, π‘Œ ) = 3. Figure 3.6 Before and after of applying localized clustering to a mapper graph to achieve better separation between classes. While this type of result may be achievable through the use of various mapper param- eters, this additional way to cluster allows for more targeted post-processing of a mapper graph when the usual pipeline is not yielding a desireable result. An example of how the how the usual mapper process may fail is in Figure 3.7. (a) (b) (c) (d) Figure 3.7 This example shows (a) Point cloud sampled from 2 annuli (b) mapper graph using original pipeline (c) Mapper graph after optimizing (d) mapper graph with additional clusters in the original pipeline to create more separation. Note the two annuli were not recovered in the original mapper pipeline. It should be noted that while using different filter functions may recover a desirable mapper graph, many trials of mapper graph creation with the Iris did not result in separation between the Versicolor and Virginica classes. The outcome variable was also used in the original mapper graph creation in an attempt to create additional separation without much 24 success. 3.3 Mapper embedding Using both spectral theory and Laplacian eigenmaps as motivation for predictive mapper, I construct the mapper embedding portion of the algorithm. Through optimization of the mapper graph, the goal is to have nodes in the mapper graph all have the same classes, and be disconnected or weakly connected to nodes of different classes. Then the eigenvectors of the graph Laplacian of the optimized mapper graph represent the different components, and hence different classes. Using the optimized mapper graph as ¯𝑀, the general eigenvalue problem 𝐿y = πœ†π·y is solved for y, resulting in an embedding of ¯𝑀 = [yπ‘œ, ..., y𝑛] This embedding represents the classes through connected components and the separation of weakly connected components. 3.4 Mapping from 𝑋 to | ¯𝑀 | The last item needed is a mapping from the dataset to the mapper graph so that these eigenvectors can be used as an embedding for both ¯𝑋 and ˆ𝑋. The goal of this construction is shown in Figure 3.8. To understand the following construction, I quickly review [1] for the topological mo- tivation, and then construct the statistical version. The nerve of an open covering U of a space 𝑋, denoted 𝑁 (U), as an abstract simplicial complex with vertices ∈ U and its simplices are the finite subcollections {π‘ˆ1, ..., π‘ˆπ‘›} of U such that π‘ˆ1 ∩ π‘ˆ2 ∩ ... ∩ π‘ˆπ‘› β‰  βˆ…. Topological Motivation Given a finite covering U = {π‘ˆπ›Ό}π›Όβˆˆπ΄ of a space 𝑋, a contin- uous map 𝑓 : 𝑋 β†’ R, and the refinement of the pullback cover 𝑓 βˆ’1(π‘ˆπ›Ό), which is path connected components as a covering of 𝑋, referred to as Β―U. Mapper is then defined as the nerve of this cover, N ( Β―U). Also recall a simplicial complex from Section 2.1. The nerve, 𝑁 ( Β―U) can be embedded in Euclidean space and this is the geometric realization of this nerve, denoted |𝑁 ( Β―U)|. I will construct a map 𝜌 : 𝑋 β†’ |𝑁 ( Β―U)|. 25 (a) The topological version of mapper with an additional mapping 𝜌(π‘₯). (b) The statistical version of mapper with an additional mapping 𝜌(π‘₯). Figure 3.8 Topological vs. Statistical Version of Mapper with an additional mapping required for predictive mapper. Definition 3.4.1 (Partition of unity) A partition of unity subordinate to the finite open covering U is a family of real valued functions {πœ‘π›Ό : 𝑋 β†’ R}π›Όβˆˆπ΄ with the following properties: β€’ 0 ≀ πœ‘π›Ό (π‘₯) ≀ 1 for all 𝛼 ∈ 𝐴 and π‘₯ ∈ 𝑋. β€’ (cid:205)π›Όβˆˆπ΄ πœ‘π›Ό (π‘₯) = 1 for all π‘₯ ∈ 𝑋. β€’ The closure of the set {π‘₯ ∈ 𝑋 | πœ‘π›Ό (π‘₯) > 0} is contained in the open set π‘ˆπ›Ό. An example partition of unity is shown in Figure 3.9b. A family of probability den- sity functions could be used to construct to be a partition of unity. Consider the Gaus- sian probability distribution function πœ™πœ‡,𝜎 (π‘₯) = 1 π‘₯βˆ’πœ‡ 𝜎 )2), with mean πœ‡, stan- dard deviation 𝜎, and 𝛽 a normalization constant. Next define a family of functions 𝛽 exp(βˆ’ 1 2 ( Ξ¦ = {πœ‘0(π‘₯), πœ‘1(π‘₯), πœ‘2(π‘₯), πœ‘3(π‘₯)} with 26 πœ‘0(π‘₯) = exp (cid:32) 1 2 (cid:32) πœ‘1(π‘₯) = exp πœ‘2(π‘₯) = exp πœ‘3(π‘₯) = exp (cid:32) 1 2 1 2 (cid:18) βˆ’4 βˆ’ π‘₯ 1 (cid:18) 2 βˆ’ π‘₯ 2 (cid:18) 8 βˆ’ π‘₯ 2 1 2 (cid:18) βˆ’10 βˆ’ π‘₯ 2 (cid:32) (cid:19) 2(cid:33) (cid:19) 2(cid:33) (cid:19) 2(cid:33) (cid:19) 2(cid:33) (3.9) (3.10) (3.11) (3.12) (3.13) Let 𝛽 = πœ‘0(π‘₯) + πœ‘1(π‘₯) + πœ‘2(π‘₯) + πœ‘3(π‘₯) and set U = (βˆ’18, 16) with π‘ˆ0 = (βˆ’8, 0), π‘ˆ1 = (βˆ’6, 10), π‘ˆ2 = (0, 16) and π‘ˆ3 = (βˆ’18, 2). Then the family of functions Β―Ξ¦ = Β―πœ‘0 = Β―πœ‘1 = Β―πœ‘2 = Β―πœ‘3 = πœ‘0(π‘₯) 𝛽 πœ‘1(π‘₯) 𝛽 πœ‘2(π‘₯) 𝛽 πœ‘3(π‘₯) 𝛽 (3.14) (3.15) (3.16) (3.17) is a partition of unity on U = (βˆ’18, 16). It can be checked that this construction satisfies the definition of a partition of unity. Using {𝑣0, 𝑣1, ..., 𝑣 π‘˜ } = 𝑉 to denote the vertices of 𝑁 ( Β―U), there exists the barycentric coordinization, which is a bΔ³ection of the points 𝑣 in the geometric simplex to a set of ordered π‘˜-tuples (π‘Ÿ0, π‘Ÿ1, ..., π‘Ÿπ‘˜ ) ∈ Rπ‘˜+1 such that 0 ≀ π‘Ÿπ‘– ≀ 1 and (cid:205)π‘˜ 𝑖=0 is a barycentric coordinate. For a given partition of unity Β―Ξ¦, define 𝜌(π‘₯) ∈ |N (U)| π‘Ÿπ‘– = 1. Each π‘Ÿπ‘– as the point in the simplex spanned by the vertices 𝑉, with barycentric coordinates ( Β―πœ‘0(π‘₯), Β―πœ‘1(π‘₯), ..., Β―πœ‘π‘˜ (π‘₯)). This 𝜌(π‘₯) is the mapping required from 𝑋 β†’ |N ( Β―U)|. 27 (a) A family of functions Ξ¦. (b) A family of functions Β―Ξ¦. Figure 3.9 A family of functions Ξ¦ before and after normalization to create a partition of unity on the range [-17.99,15.9]. Statistical Construction Using this, I construct a partition of unity to build the map 𝑃 : 𝑋 β†’ | ¯𝑀 |, where | ¯𝑀 | is the geometric realization of ¯𝑀. Given ¯𝑋 (𝑣𝑖) the subset of data ¯𝑋 in each node 𝑣𝑖 ∈ 𝑉 ( ¯𝑀) with 0 ≀ 𝑖 ≀ π‘˜, summary statistics are then computed on each set ¯𝑋 ( ¯𝑣𝑖). Define Let Ξ¦ = {πœ‘0, πœ‘1, ...., πœ‘π‘˜ } be a family of probability distributions with πœ‘π‘– (π‘₯) = exp(βˆ’ 1 2 (π‘₯ βˆ’ πœ‡ ¯𝑣𝑖 )𝑇 Ξ£βˆ’1 ¯𝑣𝑖 (π‘₯ βˆ’ πœ‡ ¯𝑣𝑖 )) βˆ‘οΈ πœ‘π‘– (π‘₯) 𝛽 = 𝑖 Β―πœ‘π‘– (π‘₯) = πœ‘π‘– (π‘₯) 𝛽 (3.18) (3.19) (3.20) with πœ‡ ¯𝑣𝑖 as the mean of ¯𝑋 ( ¯𝑣𝑖) and Ξ£ ¯𝑣𝑖 the covariance of ¯𝑋 ( ¯𝑣𝑖). Then Β―Ξ¦ = { Β―πœ‘0, Β―πœ‘1, ...., Β―πœ™π‘˜ } is a partition of unity that can provide a barycentric coordinization of points to the mapper graph, i.e. a point Λ†π‘₯ ∈ ˆ𝑋 has a barycentric coordinatization in | ¯𝑀 | = ( Β―πœ‘0( Λ†π‘₯), Β―πœ‘1( Λ†π‘₯), ...., Β―πœ‘π‘˜ ( Λ†π‘₯)). Finally, using this barycentric coordinatization together with the eigenvectors y, features can be created that are a linear combination of the values for each eigenvector. Taking a single eigenvector, y 𝑗 ∈ Rπ‘˜ , the eigenvector feature for Λ†π‘₯ is π‘˜ βˆ‘οΈ 𝑖=0 y 𝑗𝑖 Β―πœ‘π‘– ( Λ†π‘₯). 28 (3.21) This is the eigenvector value associated with each node, multiplied by the probability of inclusion for Λ†π‘₯ in each node. This has the effect that for nodes with small or 0 probability of inclusion for Λ†π‘₯, the value of the eigenvector at that node provides minimal contribution to the feature, whereas higher probability nodes provide higher contributions to the feature. Note that ¯𝑋 is a random sample of 𝑋, and X is a sample from some underlying distribution. One can approximate the probability distribution function with any of our random samples: 𝑋, ¯𝑋, ˆ𝑋. Error bounds will differ, but this construction then holds for any random samples of 𝑋. 3.5 Predictive Mapper Pipeline The optimized mapper graph, mapper embedding, and mapping from 𝑋 to ¯𝑀 provide the predictive mapper pipeline which I summarize here. The following inputs are required to create a mapper graph and consequently the predictive mapper pipeline: training data: ¯𝑋, testing data: ˆ𝑋, filter function: 𝑓 , and distance matrix or metric: 𝑑. Additionally the following are parameters required: clustering algorithm, partition length: β„“, percent overlap in partition: 𝑝, and a clustering algorithm for mapper optimization. The predictive mapper construction is then as follows: β€’ Construct 𝑀, a mapper graph using the ˆ𝑋, 𝑓 , 𝑑, clustering algorithm, 𝑝, 𝑙 β€’ Optimize 𝑀 with localized clustering using the clustering algorithm for mapper optimization on nodes to so that if ¯𝑀 is an optimized version of 𝑀, then 𝐻 (𝑀, π‘Œ ) < 𝐻 ( ¯𝑀, π‘Œ ) β€’ Construct the weighted adjacency matrix of ¯𝑀 such that ¯𝑣𝑖, ¯𝑣 𝑗 with 𝑖 β‰  𝑗, 𝐴𝑖, 𝑗 ( ¯𝑀) = | ¯𝑋 ( ¯𝑣𝑖) ∩ ¯𝑋 ( ¯𝑣 𝑗 )| β€’ Compute the eigenvectors y of the generalized eigenvalue problem, 𝐿y = πœ†π·y 29 β€’ Taking ¯𝑋 (𝑣𝑖) the subset of data ¯𝑋 in each node 𝑣𝑖 ∈ 𝑉, compute summary statistics and the partition of unity Β―Ξ¦ functions β€’ For each π‘₯ ∈ ¯𝑋 and π‘₯ ∈ ˆ𝑋, compute the probability of inclusion in each node using Β―Ξ¦ β€’ Create features as a linear combination of eigenvectors and the partition of unity, i.e. for each 𝑗, (cid:205)π‘˜ 𝑖=0 y 𝑗𝑖 Β―πœ‘π‘– ( Λ†π‘₯) This pipeline provides an approach to creating features from a mapper graph that can be used in machine learning models. An example of the full pipeline is shown in Figure 3.10. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 3.10 A simulated example of the predictive mapper pipeline. (a) Point cloud sampled from two annuli. (b) Construction of pullback cover. (c) Refined pullback cover. (d) mapper graph. (e) Cover from mapper optimization step. (f) Optimized mapper graph. (g) Two test points in 𝑋. (h) Probability of inclusion in each node for point on the right. (i) Probability of inclusion in each node for the point on the left. (j) Eigenvectors of ¯𝑀. (k) Eigenvector features of the point on the right represented on the mapper graph. (l) Eigenvector features of the point on the left represented on the mapper graph. 30 Figure 3.11 Mapper Experiment Design. 3.6 Methods and Results Using two datasets, the famous iris [23] dataset and a student performance [24] dataset, features created from the predictive mapper algorithm were assessed for accuracy in different machine learning models. The iris dataset was used as a familiar dataset during algorithm development and to identify potential areas to test, while the student performance dataset was used to explore potential use cases for predictive mapper features. To create a comparison for performance of the predictive mapper features, the same machine learning models were evaluated a feature set created from predictive mapper and the feature set from the original data. To create a true comparison for performance, 50% of the data was used to build a mapper model, and 50% was used as a testing set to validate model performance. The dataset structure for building the mapper model and building and testing the model is shown in Figure 3.11. For both datasets, the clustering algorithm used during the initial mapper fit was HDB- Scan [14] and the chosen clustering algorithm to optimize mapper was Bayesian gaussian mixture modeling [25]. HDBScan was chosen as it meets the requirements of [1] and Bayesian Gaussian mixture modeling was chosen as computation of means and standard deviations is part of the algorithm, the distributions are assumed normal, and the number 31 of clusters can be inferred from the data. Both datasets also report performance in terms of original data, eigenvector features, and augmented features. The augmented data is using both the original data and eigenvector features to assess model performance. 3.6.1 Iris The iris [23] dataset is one of the earliest known datasets using in machine learning models, and can be traced back to scientist R.A. Fisher in 1936. The dataset has 4 features that measure properties of the iris plant, with a classification of plant type. There are three classes with 50 instances each, making the dataset perfectly balanced for testing machine learning tasks. One of the classes is also easily separable from the other two, making iris an excellent candidate for evaluation and constructing the predictive mapper algorithm. The predictive mapper algorithm was run on the iris dataset using three different mapper graphs, all with the same parameters in mapper graph creation using a filter of principle components analysis. The original mapper graph is in Figure 3.12 and colored by percentage of outcome variable, and complete separation between Setosa and the other two classes is observed in the original graph. (a) Iris mapper graph color by percentage of node in Setosa class. (b) Iris mapper graph color by percentage of node in Versicolor class. (c) Iris mapper graph color by percentage of node in Virginia class. Figure 3.12 Original iris mapper graphs for mapper graph 0. To optimize the mapper graph, Bayesian Gaussian mixture modeling was applied to the original mapper graph, and the mapper graphs after optimization are shown in Figure 3.13, 32 and each class is associated with a different connected component. (a) Iris mapper graph color by percentage of node in Setosa class. (b) Iris mapper graph color by percentage of node in Versicolor class. (c) Iris mapper graph color by percentage of node in Virginia class. Figure 3.13 Optimized iris mapper graphs for mapper graph 0. The probability of inclusion in each node for all data points was computed, and then the features were created using the partition of unity and eigenvector basis. Example graphs of the probability of inclusion and eigenvector features are shown for a selected point in Figure 3.14. For both images, the graph on the left highlights three nodes with a non-zero probability of inclusion for a sample data point. This is then reflected in the eigenvector features, with variation in the same nodes and the nodes in the same connected component. Components with a zero probability of includes for the data point do not show variation in the eigenvector features. Three different optimized mapper graphs were assessed with the results are included in Table 3.2. 𝐻 (𝑀, π‘Œ ) was measure before and after for each and reported in Table 3.1. In each of the three trials a more optimal mapper graph was created through the predictive mapper pipeline. Trial Mapper Graph 0 Mapper Graph 1 Mapper Graph 2 |𝑉 (𝑀)| 14 16 15 𝐻 (𝑀, π‘Œ ) 2.53 2.55 2.43 |𝑉 ( ¯𝑀)| 30 33 33 𝐻 ( ¯𝑀, π‘Œ ) 3.00 3.00 3.00 Table 3.1 Iris mapper graph comparison before and after optimization. 33 (a) Example point in iris dataset colored by probability of inclusion in each node. (b) Example point in iris dataset colored by eigenvector features. Figure 3.14 An example point from the iris dataset shown with the probability of node inclusion on the left and the associated eigenvector feature on the right. With all three graphs, ridge classification using the eigenvector features outperformed classification using the original dataset. The random forest model outperformed the original data in 2 of the 3 trials, and trial 1 had 4 of the 8 models where the eigenvector features outperformed the original data. While these results are not broad enough to be generally conclusive, there are a few observations to note. The eigenvector features seem to perform well for ridge regression, and may be useful in applications where that machine learning model is preferred. The mapper graph and subsequent optimized mapper graph impact results, with variation among the three different graphs. Finally, as the Iris dataset is a small dataset, sampling may have a larger impact on the outcome. 3.6.2 Student Performance The dataset used to evaluate predictive mapper features is called Student Performance [24] and hosted on the UC Irvine Machine Learning Repository. The dataset has 30 features and 649 observations comprised of grades, demographic, social, and school factors collected by a combination of school information and questionnaires. The dataset was originally evaluated in [26] where the original authors predicted three different outcomes 34 Model Random Forest (n=5) SVM (linear, c=1) SVM (linear, c=10) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression Random Forest (n=5) SVM (linear, c=1) SVM (linear, c=10) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression Model Random Forest (n=5) SVM (linear, c=1) SVM (linear, c=10) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression Iris Mapper Object 0 Original Data 92.5% 95.6 95.8 95.4 94.7 82.9% 80.8% 95.9% Iris Mapper Object 1 94.3% 96.6 95.1 95.7 95.2 82.4% 83.6% 93.5% Iris Mapper Object 2 Original Data 96.2% 97.3 97.3 97.4% 97.2% 79.7% 74.9% 97.1% Eigenvector 93.5% 93.5 92.9 93.3 93.5 93.1% 93.2% 93.1% 94.1% 93.6 93.1 92.4 84.0% 94.0% 93.9% 94.1% Eigenvector 94.8% 94.7 94.5% 96.5% 89.7% 94.6% 94.8% 94.4% Augmented 93.6% 93.8 93.5 93.3 93.5 93.3% 93.4% 93.2% 94.7% 93.6 92.2 92.8 92.2% 94.1% 93.9% 93.6% Augmented 95.1% 94.6 94.7% 96.5% 95.5% 94.6% 94.8% 94.5% Table 3.2 Accuracy from Iris mapper models. using a variation of machine learning methods. The variables are in Table A.1 taken from [26] and important to understanding both the original classification methods used and those explored in this paper. The original analysis predicted G3 using all variables, using all variables except G2, and using all variables except G1 and G2. The third problem, predicting the final grade (G3) without first (G1) and second (G2) period grades was much harder, and the focus of the predictive mapper analysis. Data was prepared for analysis by creating indicator 35 variables for each of the binary outcomes, creating a binary indicator for pass or fail for G3, and oversampling the failed class. In the original dataset of 649 observations, 100 of the observations were classified as fail for G3, or 15.4 % of the sample having failed. Data was split into a training and testing of 50% each, and then the failed class was oversampled in the training set to create a balanced dependent variable. The testing set was not modified. The 50% of the data used as the training set was used to build a mapper object and was used to train various models. To create some variability, data from the held out testing set was included to train the model (50%) and not used to report on model accuracy, and the average of 100 runs is reported. To also create variation in the mapper graphs, there were 12 different mapper graphs generated and evaluated, all with different datasets. For direct comparisons, mapper graphs were created with the same parameters from different data. The parameters of each experiment is outlined in Table 3.3. Full results are included here for trials 0-5. Trial Filter PCA 0 PCA 1 PCA 2 Eccentricity (5) 3 Eccentricity (5) 4 Eccentricity (5) 5 tsneX (5) 6 tsneX (5) 7 tsneX (5) 8 tsneX (5) 9 tsneX (5) 10 tsneX (5) 11 Partitions Overlap 8 8 8 8 8 8 8 8 8 12 12 12 50% 50% 50% 25% 25% 25% 50% 50% 50% 30% 30% 30% Original Nodes 20 (H = 1.29) 16 (H = 1.14) 20 (H = 1.28) 20 (H = 1.27) 21 (H = 1.32) 22 (H = 1.18) 24 (H = 1.34) 22(H = 1.24) 18 (H = 1.26) 20 (H = 1.07) 14 (H = 1.03) 16 (H = 1.02) Optimized Nodes 28 (H = 1.25) 26 (H = 1.09) 26 (H = 1.22) 28 (H = 1.26) 29 (H = 1.26) 28 (H = 1.14) 38 (H = 1.30) 33 (H = 1.21) 25 (H = 1.21) 49 (H = 1.20) 37 (H = 1.10) 44 (H = 1.08) Table 3.3 Parameters for student performance data mapper graph comparison. The standard predictive mapper pipeline was followed for the student performance dataset, and how optimal 𝐻 (𝑀, π‘Œ ) the graphs were was measured. The same optimization process followed for the Iris dataset was used, and did not result in a more optimal mapper 36 graph for many of the graphs. To create a direct comparison to the Iris dataset, one can normalize by the number of classes. The Iris mapper graphs all had an optimal score of 1. The optimal score for the student performance dataset in any mapper graph, optimized or not, was not close to that range. The minimum normalized score was .5 and the maximum normalized score was .67. It is also doubtful there is a relevant difference in features created from the original or optimized mapper graphs for this dataset based on the normalized optimal scores. Using the optimized mapper graphs, a total of 8 machine learning models were run, and the partition of unity matrix before the eigenvector change of basis was also used as a feature set. "Eigenvector +" denotes the original dataset augmented with eigenvector features, and "Partition +" denotes the original dataset augmented with the features from the partition of unity. While evaluating this dataset initially, the eigenvector features performed intermittently in terms of accuracy, but very consistently predicted all students passed. When this was observed, the partition of unity features were also evaluated to better understand if they could be useful in prediction. Based on the lack of performance from the eigenvector features for this dataset, different parameters in how the partition of unity was computed were modified to see if there were ways to modify features for better performance. Given this, the student dataset results should absolutely be treated as exploratory in nature. Trials 0-1 For the first three trials (trials 0, 1, and 2) the first principle component was used as a filter for creation of the mapper object. Results for these trials are shown in Table 3.4. In two of the three mapper graphs, some combination involving the eigenvector features or partition of unity features outperformed the original dataset features alone for all models. In the results table, βˆ— denotes no recall for the failed class in the dataset (𝐺3 = 1). I see this specifically happening with the eigenvector features. This suggests there may be 37 Student performance Mapper Graph 0 Eigenvector Partition Eigenvector+ Partition+ Model Random Forest (n=5) Random Forest (n=50) Random Forest (n=100) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression Original Data 82.7% 84.1% 84.3% 83.2% 82.7% 81.4% 81.5% 81.7% 78.3% 79.0% 80.0% 84.6%βˆ— 78.3% 84.6%βˆ— 84.6%βˆ— 84.6%βˆ— 83.1% 81.0% 84.4% 81.5% 81.4% 84.7% 85.6% 82.3% 81.0% 83.1% 85.6% 81.5% 85.5% 81.6% 85.3% 81.7% Student performance Mapper Graph 1 Random Forest (n=5) Random Forest (n=50) Random Forest (n=100) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression 83.5% 83.4% 85.7% 83.6% 85.8% 83.7% 88.0% 84.6%βˆ— 83.6% 83.4% 85.0 81.2% 85.0% 81.4% 84.9% 81.4% 83.0% 83.1% 83.2% 84.6%βˆ— 83.0% 84.9% 84.9% 84.9% Student performance Mapper Graph 2 Random Forest (n=5) Random Forest (n=50) Random Forest (n=100) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression 81.7% 84.1% 84.3% 83.5% 81.7% 77.4% 77.4% 77.9% 80.6% 80.9% 80.8% 82.8% 80.6% 84.3 84.3% 84.3% 80.9% 81.2% 81.2% 83.1% 80.9% 84.3% 84.3% 84.3% 84.5% 85.2% 85.0% 85.5% 84.5% 84.8% 85.2% 84.2% 83.3% 84.3% 84.3% 85.0% 83.3% 85.4% 85.4% 83.6% 83.5% 84.5% 84.5% 85.1% 83.5% 82.5% 82.6% 81.1% 84.3% 85.4% 85.3% 85.4% 84.3% 84.8% 85.2% 84.2% 83.3% 84.5% 84.5% 85.0% 83.3% 85.4% 85.4% 83.6% Table 3.4 Results from trials 0, 1, and 2 for the student performance dataset models using mapper. information loss in the predictive mapper pipeline that is useful for this machine learning task. Each of the mapper graphs had slightly different features, and the original mapper graph for a representative graph along with the optimized mapper graphs is shown in Figure 3.15, with the other two for this trial in the Appendix as Figure A.1 and Figure A.2. Trial 3 - 5 For the next trials, eccentricity was used as a filter for creation of the mapper 38 (a) Original mapper graph. (b) Optimized mapper graph. Figure 3.15 Mapper graphs for Experiment 0 before and after optimization. For this example, the typical optimization work flow was not successful. object. A representative optimized mapper graph is shown in Figure 3.16, with the other two in the Appendix as Figure A.3 and Figure A.4. (a) Original mapper graph. (b) Optimized mapper graph. Figure 3.16 Mapper graphs for Experiment 3 before and after optimization. For this example, the typical optimization work flow was not successful. Results are shown in Table 3.5. Note the βˆ— designated denotes the feature set and model were not able to identify any failed observations in the data. For the third mapper graph random forest models, the partition+ features seem to perform better than the original dataset. This may be due to the original dataset overfitting the models, and the partition features underfitting the models. I also observed the eigenvector features for the third mapper graph is not able to recall any of the failed class, further supporting this graph may 39 Student performance Mapper Graph 3 Eigenvector Partition Eigenvector+ Partition+ be underfitting the data. Model Random Forest (n=5) Random Forest (n=50) Random Forest (n=100) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression Original Data 82.7% 84.5%βˆ— 84.5%βˆ— 83.8% 84.5%βˆ— 83.9% 84.6%βˆ— 82.7% 84.6%βˆ— 82.1% 84.6%βˆ— 78.2% 84.6%βˆ— 78.1% 84.6%βˆ— 78.9% 71.6% 71.3% 71.3% 81.2% 71.6% 81.1% 81.1% 80.6% 82.7% 83.8% 83.8% 81.8% 82.7% 80.5% 80.5% 81.2% Student performance Mapper Graph 4 Random Forest (n=5) Random Forest (n=50) Random Forest (n=100) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression 82.7% 83.4% 85.9% 83.2% 86.0% 83.4% 82.7% 88.1% 83.6% 83.4% 84.6%βˆ— 81.3% 84.6%βˆ— 81.4% 84.6%βˆ— 81.4% 83.4% 81.3% 85.2% 85.8% 86.0% 81.4% 88.6% 75.7% 81.3% 83.4% 83.4% 83.2% 83.4% 83.2% 83.5% 82.0% Student performance Mapper Graph 5 Random Forest (n=5) Random Forest (n=50) Random Forest (n=100) SVM (rbf) KNN (n=5) Ridge (alpha=1) Ridge (alpha=10) Logistic Regression 84.6%βˆ— 82.7% 85.6% 84.6%βˆ— 84.6%βˆ— 85.9% 84.6%βˆ— 85.4% 84.6%βˆ— 82.7% 84.6%βˆ— 80.0% 84.6%βˆ— 80.0% 84.6%βˆ— 78.6% 81.5% 81.5% 81.4% 84.6%βˆ— 81.5% 70.2% 84.6%βˆ— 84.4% 83.2% 85.6% 85.8% 85.5% 83.2% 81.6% 81.2% 81.4% 82.7% 84.2% 84.3% 84.1% 82.7% 84.1% 84.0% 82.2% 84.1% 86.0% 86.0% 88.5% 84.1% 83.4% 84.3% 83.4% 83.3% 85.2% 85.3% 85.4% 83.3% 84.5% 84.2% 84.2% Table 3.5 Results from trials 3, 4, and 5 for the student performance dataset models using mapper. Trial 6 - 12 Additional models were run with this set of mapper graphs to allow for more extensive evaluation of performance. The mapper graphs were created using tSne as a filter, and the focus was on creating nodes in the mapper graph that are more heavily weighted to either passing or failing. Before and after optimization of an example mapper graph is shown in Figure 3.17. 40 (a) Original mapper graph. (b) Optimized mapper graph. Figure 3.17 Mapper graphs for Experiment 6 before and after optimization. For this example, the typical optimization work flow was not successful. Additional figures (Figure A.5 and Figure A.6 and results from these trials can be found in the Appendix. Figures from mapper graphs 9, 10, 11 are available as Figure A.7, Figure A.8 and Figure A.9 respectively. These results do not differ from the mapper graphs created using other filters, and the eigenvector features are unable to identify the failing class. Results for these are not formally reported as they do not change the interpretation or outcome. 3.7 Computation Work for predictive mapper was completed in python. This was done through a combi- nation of custom functions for prediction and an extension of Kepler mapper [27]. Mapper graphs and notebooks are available for models run for the Iris dataset. Future work could include the incorporation of these functions into the Kepler mapper software. 3.8 Discussion and Limitations Results of predictive mapper for the Iris dataset show promise, and further research is needed into similar use cases where predictive mapper may have usefulness. Based on the student performance dataset, predictive mapper may not be useful with unbalanced datasets and machine learning models where recall is a difficult task. The student performance dataset may also require specialized filters that extract the relevant information from the 41 dataset. Even with significant modification of features in an attempt to extract better performance, predictive overall using eigenvector or partition of unity features was not useful. It is important to note that creating an apples to apples comparison for model perfor- mance of predictive mapper features to the original dataset features has the potential for bias based on the way the mapper models are created and the need to have a similar amount of information available for each model. To minimize this issue multiple mapper models were built using the same parameters for each dataset from different random samples of data. Model parameters were not optimized during model fitting beyond running with a few different parameters, and accuracy for each model is reported using the same parameters for both the original dataset and the predictive mapper features. This could significantly impact the performance of both the original and predictive mapper models. A main area where there is significant impact to the performance of the predictive mapper models is in the creation of the mapper graph used to create features and parameters for feature creation. For the development of each mapper graph used to assess model performance subjective decisions were made to optimize model performance, with the end of goal of developing useful features for machine learning. To create predictive features decisions on mapper graph parameters and optimization and a suitable way to map test observations onto the mapper object must be made and are subjective and heavily depend on the dataset. An average or stable mapper graph could be a potential avenue to explore to mitigate current issues with individual mapper graphs in predictive pipelines. Related work related to the Laplacian eigenvectors was done in [28]. This work crafts an algorithm for eigenvector cascading, and shows that important dataset features persist over different scales. The idea would need to be adapted across samples of data and not necessarily scale, although that could be a relevant feature. Outstanding work in the predictive mapper algorithm includes additional work on cre- 42 ating an optimal mapper graph through filters and additional localized clustering, and exploration of performance with different probability distribution functions and additional data sets. The ideal goal is to find specific areas where mapper features can provide use- ful predictive information that available in the original feature set, and high dimensional datasets may benefit from an additional feature reduction technique. Based on prior work with Laplacian eigenmaps, this may be in machine learning problems requiring a reduced feature set where predictive mapper would be a good solution. Further research into how optimal a mapper graph can be before overfitting becomes an issue for modeling is also needed. A formal assessment on how well a particular mapping from 𝑋 β†’ | ¯𝑀 | performs for datasets would be a final area of investigation to increase utility of predictive mapper. Further notes for another potential application are below, and this is an addition that was discovered near the completion of this dissertation. 3.8.1 Potential for Clustering Applications An additional, barely explored avenue for predictive mapper could be as an algorithm for clustering and assigning new data points to existing clusters. Although in a very preliminary stage, predictive mapper seems to outperform spectral clustering in identifying classes for noisy half-moon shapes. An example is shown in Figure 3.18 of both the original data and the mapper graph with complete separation. Using spectral clustering (nearest neighbors = 10) 81% of the points in Figure 3.18a were identified correctly, while using predictive mapper with 50% data for mapper graph creation and an additional 25% data as training data provided accuracy of 84% (depending on the method used). It should be noted this work is very preliminary, and it is not assumed that different parameters for clustering would not yield better results than predictive mapper. Additional work is needed here and could be of interest. 43 (a) Point cloud of half moon. (b) Mapper graph of half moon point cloud. Figure 3.18 Example of half moon. 3.9 Conclusion Some areas and applications of predictive mapper may be useful for use in machine learning models, and in the not-yet-explored avenue of an extension to clustering algorithms. The development of the predictive mapper pipeline for use with the statistical construction of mapper provides another avenue for researchers to explore in topological data analysis, and may be improved by other work in mapper graph optimization. 44 CHAPTER 4 FEATURE GENERATION FOR MACHINE LEARNING Many and varied methods currently exist for featurization, which is the process of mapping persistence diagrams to Euclidean space, with the goal of maximally preserving structure. However, and to our knowledge, there are presently no methodical comparisons of existing approaches, nor a standardized collection of test data sets. This chapter provides a compara- tive study of several such methods, and is taken largely from my paper, A Comparative Study of Machine Learning Methods for Persistence Diagrams [3]. This paper was co-authored by Dr. Luis Polanco and Dr. Jose Perea. In this chapter we review, evaluate and compare the stable multi-scale kernel, persis- tence landscapes, persistence images, Carlsson Coordinates - the ring of algebraic functions, template functions, and adaptive template systems. Using these approaches for feature ex- traction, we apply and compare popular machine learning methods on five data sets: MNIST, Shape retrieval of non-rigid 3D Human Models (SHREC14), extracts from the Protein Clas- sification Benchmark Collection (Protein), MPEG7 shape matching, and HAM10000 skin lesion data set. These data sets are commonly used in the above methods for featurization, and we use them to evaluate predictive utility in real-world applications. 4.1 Introduction Persistence diagrams are a popular tool in Topological Data Analysis, with a recent interest in feature creation for machine learning tasks, with success in several applications to science and engineering. Many methods exist for the vectorization of persistence diagrams, but a comprehensive evaluation of these methods is largely unexplored. Our goal here is to contribute to the comparative analysis of machine learning methods with persistence diagrams. Starting with topological descriptors of datasets, in the form of persistence diagrams, we provide examples and methodology to create features from these diagrams to be used 45 in machine learning algorithms. We provide the necessary background and mathematical justification for six different methods (in chronological order): the Multi-Scale Kernel, Persistence Landscapes, Persistence Images, Adcock-Carlsson Coordinates, Template Sys- tems, and Adaptive Template Systems. To thoroughly evaluate these methods, we have reviewed five different data sets along with the relevant methods to compute persistence di- agrams from them. The datasets, persistence diagrams and code to compute the persistence diagrams is readily available for academic use. As part of this review, we also provide a user guide for these methods, including comparisons and evaluations across the different types of datasets. After computing the six types of features, we compared the predictive accuracy of a ridge regression, random forest, and support vector machine model to assess the type of featurization that is most useful in predictive models. The code developed for this analysis is available, with some functions developed specifically for use in machine learning applications, and easy-to-use jupyter notebooks showing examples of each function with multiple dataset types. Of these methods, Persistence Landscapes, Adcock-Carlsson Coordinates, and Template Systems are quite accurate and create features for large datasets quickly. Adaptive Template Systems and Persistence Images took somewhat longer to run, however, the Adaptive Template Systems featurization method did improve accuracy over other methods. The Multi-Scale Kernel was the most computationally intensive, and during our evaluation we did not observe instances of it outperforming other methods. 4.2 Background In order to develop the mathematical foundations needed for doing machine learning with persistence diagrams, it has been informative to first study the structure of the space they form. Additional background pertinent to this chapter and important for understanding properties of persistence methods is included here. For this background we follow [29]. 46 Definition 4.2.1 (Wasserstein distance) The 𝑝th Wasserstein distance between two per- sistence diagrams, 𝑑1 and 𝑑2 is defined as π‘Šπ‘ (𝑑1, 𝑑2) = (inf 𝛾 ||π‘₯ βˆ’ 𝛾(π‘₯)|| 1 𝑝 𝑝 ∞) βˆ‘οΈ π‘₯βˆˆπ‘‘1 (4.1) where 𝛾 ranges over all bΔ³ections from 𝑑1 to 𝑑2. The set of bΔ³ections is nonempty because of the diagonal. A related distance is the bottleneck distance, which is the Wasserstein distance with 𝑝 = ∞. Then the space of persistence diagrams also needs to be complete to be appropriate for statistical inference, so the following definitions are key. Definition 4.2.2 (Persistence Diagram) A generalized persistence diagram is a countable multiset of points in R2 along with the diagonal β–³ = {(π‘₯, 𝑦) ∈ R2|π‘₯ = 𝑦} Definition 4.2.3 (Total persistence) The degree-𝑝 total persistence diagram 𝑑 is defined as Pers𝑝 (𝑑) = (pers(π‘₯)) 𝑝 βˆ‘οΈ π‘₯βˆˆπ‘‘ (4.2) Definition 4.2.4 (Space of persistence diagrams) We define the space of persistence dia- grams as 𝐷 𝑝 = {𝑑|π‘Šπ‘ (𝑑, π‘‘βˆ…) < ∞} = {𝑑|Pers𝑝 (𝑑) < ∞} (4.3) The space of persistence diagrams is complete and separable. Two key theorems also provide stability for both sublevel set persistence and persistence computed using Ripser. The stability theorem of [30] for sublevel set persistence contends that if X is a finitely triangulated space and 𝑓 , 𝑔 : X βˆ’β†’ R are tame and continuous, then 𝑑𝐡 (dgm𝑛 ( 𝑓 ), dgm𝑛 (𝑔)) ≀ βˆ₯ 𝑓 βˆ’ 𝑔βˆ₯∞ 47 for every integer 𝑛 β‰₯ 0. We note that the theorem is still true if continuous is replaced by piecewise linear. Similarly, if (𝑋, d𝑋) and (π‘Œ , dπ‘Œ ) are finite metric spaces, then the stability of Rips persistent homology [31, Theorem 5.2] says that 𝑑𝐡 (dgmR 𝑛 (𝑋), dgmR 𝑛 (π‘Œ )) ≀ 2𝑑𝐺𝐻 (𝑋, π‘Œ ) where 𝑑𝐺𝐻 (Β·, Β·) denotes the Gromov-Hausdorff distance [32]. When we refer to featurization methods as stable, this is usually in regards to the Wasserstein distance or bottleneck distance. Doing statistics and machine learning directly on the space of persistence diagrams turns out to be quite difficult. Indeed, (D, 𝑑𝐡) does not have unique geodesics, and thus the FrΓ©chet mean of general collections of persistence diagrams is not unique [33]. Since computing averages, and in general, doing linear algebra on persistence diagrams is not available, then several authors have proposed mapping (D, 𝑑𝐡) to topological vector spaces where further analysis can be done. These methods are the main focus of this review. The theory of vectorization of persistence diagrams is an active area of research, with recent results showing the impossibility of full embeddability. Indeed, even though the space of persistence diagrams with exactly 𝑛 points can be coarsely embedded in a Hilbert space [34], this ceases to be true if the number of points is allowed to vary (see [35] and [36]). That said, partial featurization is still useful as we will demonstrate here. 4.3 Featurization Methods For each of the methods below, we start with a collection of persistence diagrams. A persistence diagram can be represented in either the birth-death plane or birth-lifetime planeβ€”some methods will require birth-death coordinates and others will require birth- lifetime coordinates. The birth-death plane is the representation pair (π‘₯, 𝑦) where π‘₯ is the time of birth, and 𝑦 is the time of death of the feature in the persistence diagram. The birth-lifetime plane can be defined as the collection of points (π‘₯, 𝑦 βˆ’ π‘₯), where (π‘₯, 𝑦) is 48 in birth-death coordinates. In this manner, we define lifetime as the persistence 𝑦 βˆ’ π‘₯ of a feature (π‘₯, 𝑦). The persistence diagrams of a particular geometric object can be calculated in a variety of ways, which will be made explicit for each dataset at time of evaluation. 4.3.1 Multi-Scale Kernel The Multi-Scale Kernel of [37] defines a Kernel over the space of persistence diagrams, which can then be used in various types of kernel learning methods. In general, a kernel π‘˜ is by definition a symmetric and positive definite function of two variables. Mathematically, from [37], given a set 𝑋, a function π‘˜ : 𝑋 Γ— 𝑋 β†’ R is a kernel if there exists a Hilbert space 𝐻, called the feature space, and a map Ξ¦ : 𝑋 β†’ 𝐻, called the feature map, such that π‘˜ (π‘₯, 𝑦) = ⟨Φ(π‘₯), Ξ¦(𝑦)⟩𝐻 for all π‘₯, 𝑦 ∈ 𝑋. The kernel induces a distance on 𝑋 defined as π‘‘π‘˜ (π‘₯, 𝑦) = (π‘˜ (π‘₯, π‘₯) + π‘˜ (𝑦, 𝑦) βˆ’ 2π‘˜ (π‘₯, 𝑦)) 1 2 = βˆ₯Ξ¦(π‘₯) βˆ’ Ξ¦(𝑦) βˆ₯𝐻. The authors of [37] propose a multi-scale kernel on D as follows. Given 𝐹, 𝐺 ∈ D, the persistence scale space kernel π‘˜ 𝜎 is π‘˜ 𝜎 (𝐹, 𝐺) = ⟨Φ𝜎 (𝐹), Φ𝜎 (𝐺)⟩𝐿2 (Ξ©) (4.4) where Φ𝜎 : D β†’ 𝐿2(Ξ©) is the associated feature map, and Ξ© βŠ‚ R2 is the closed half- plane above the diagonal. Deriving the solution of a distribution-analogue of the Heat equation with boundary conditions in Definition 1 of [37], the closed form expression of the multi-scale kernel is: π‘˜ 𝜎 (𝐹, 𝐺) = 1 8πœ‹πœŽ βˆ‘οΈ π‘βˆˆπΉ,π‘žβˆˆπΊ where if π‘ž = (π‘Ž, 𝑏), then Β―π‘ž = (𝑏, π‘Ž). π‘’βˆ’ βˆ₯ π‘βˆ’π‘ž βˆ₯2 8𝜎 βˆ’ π‘’βˆ’ βˆ₯ π‘βˆ’ Β―π‘ž βˆ₯2 8𝜎 The multi-scale kernel is shown to be stable w.r.t the 1-Wasserstein distance by Theorem 2 of [37], which is a desirable property for classification algorithms. However, by Theorem 3 of [37], the multi-scale kernel is not stable in the Wasserstein sense for 1 < 𝑝 ≀ ∞. 49 4.3.2 Persistence Landscapes Persistence landscapes are a mapping of persistence diagrams into a function space that is either a Banach space or Hilbert space ([38]). Advantages of persistence landscapes are that they are invertible, stable, parameter-free, and nonlinear. Persistence landscapes can be computed from a persistence diagram as follows. From [38], for a persistence diagram 𝐷 = (π‘Žπ‘–, 𝑏𝑖)π‘–βˆˆπΌ, and for π‘Ž < 𝑏, let and 𝑓(π‘Ž,𝑏) (𝑑) = max(0, min(π‘Ž + 𝑑, 𝑏 βˆ’ 𝑑)) πœ†π‘˜ (𝑑) = kmax (cid:8) 𝑓(π‘Žπ‘–,𝑏𝑖) (𝑑)(cid:9) π‘–βˆˆπΌ (4.5) (4.6) with kmax as the kth largest element. The persistence landscape is the sequence of piecewise linear functions, πœ†1, πœ†2, ... : R β†’ R. Bubenik shows desirable properties for working with persistence landscapes in statistical modeling, in particular that even if unique means do not exist in the set of persistence diagrams, persistence landscapes do have unique means and the mean landscape converges to the expected persistence landscape. Figure 4.1 shows an example of persistence landscapes from the MPEG7 dataset, described in the data section. Figure 4.1 Persistence Landscapes from the MPEG7 dataset to show differences in features. Each color corresponds to a different landscape, i.e. πœ†π‘˜ for π‘˜ = 1, 2, 3. 50 4.3.3 Persistence Images From [39], persistence images are a mapping sending a persistence diagram to an integrable function, called a persistence surface. Fixing a grid on R2, the integral over this grid yields pixel values forming the persistence image. Advantages of persistence images include a representation in R𝑛, stability, and ease of computation. When calculating the persistence image, a resolution, a distribution, and a weighting function are required as parameters. It is worth noting that the resolution (i.e., number of pixels) determines the number of features computed by the persistence image. More explicitly, let 𝐷 be a persistence diagram in birth-lifetime coordinates. We take πœ™π‘’ : R2 β†’ R to be a differentiable probability distribution. Using, for instance, the Gaussian Distribution with mean 𝑒 and variance 𝜎2 we have πœ™π‘’ (π‘₯, 𝑦) = 1 2πœ‹πœŽ2 π‘’βˆ’[(π‘₯βˆ’π‘’ π‘₯)2+(π‘¦βˆ’π‘’ 𝑦)2]/2𝜎2 The persistence surface 𝜌𝐷 : R2 β†’ R is the function 𝜌𝐷 (𝑧) = βˆ‘οΈ 𝑓 (𝑒)πœ™π‘’ (𝑧) 𝑒=(π‘₯,π‘¦βˆ’π‘₯)∈𝐷 with 𝑓 : R2 β†’ R, a nonnegative weighting function that is zero along the horizontal axis, continuous, and piecewise differentiable. For example, 𝑓 (π‘₯) = 𝑦 would satisfy this requirement and is used frequently as 𝑓 (π‘₯). The persistence image is then 𝐼 (𝜌𝐷) 𝑝 = ∫ ∫ 𝜌𝐷 𝑑𝑦𝑑π‘₯, where integration is over the fixed grid on R2. This creates an image depicting 𝑝 high and low density areas in the defined grid, that are represented as a high-dimensional vector for use in machine learning algorithms. An example is shown in Figure 4.2 taken from the MNIST dataset. 4.3.4 Adcock-Carlsson Coordinates: The ring of algebraic functions on persistence diagrams This method is explored by [40], where the authors highlight the fact that any persistence diagram with exactly 𝑛 points can be described by a vector of the form (π‘₯1, 𝑦1, π‘₯2, 𝑦2, . . . , π‘₯𝑛, 𝑦𝑛) 51 Figure 4.2 Persistence Images of a 5 from the MNIST set in dimension 0. where π‘₯𝑖 denotes the birth of the 𝑖-th class and 𝑦𝑖 the corresponding death time. This fea- turization method also appears in other applications, such as brain imaging [41], chatter identification [42], image classification [40], and also led to related work with additional stability properties [43]. From [40] the general form below is used for feature creation for Adcock-Carlsson coordinates. π‘π‘Ž,𝑏 = βˆ‘οΈ 𝑖 (π‘₯𝑖 + 𝑦𝑖)π‘Ž (𝑦𝑖 βˆ’ π‘₯𝑖)𝑏 for integers π‘Ž β‰₯ 0 and 𝑏 β‰₯ 1. Adcock-Carlsson coordinates that both emphasized more persistent features, and took into account all types of features were desired. Based on this, the following features for both the 0-dimensional and 1-dimensional persistence diagrams meet this criteria, as suggested 52 in [40]: π‘₯𝑖 (𝑦𝑖 βˆ’ π‘₯𝑖) βˆ‘οΈ 𝑖 (π‘¦π‘šπ‘Žπ‘₯ βˆ’ 𝑦𝑖) (𝑦𝑖 βˆ’ π‘₯𝑖) βˆ‘οΈ 𝑖 βˆ‘οΈ 𝑖 𝑖 (𝑦𝑖 βˆ’ π‘₯𝑖)4 π‘₯2 (π‘¦π‘šπ‘Žπ‘₯ βˆ’ 𝑦𝑖)2(𝑦𝑖 βˆ’ π‘₯𝑖)4. βˆ‘οΈ 𝑖 (4.7) (4.8) (4.9) (4.10) The theoretical motivation for Adcock-Carlsson Coordinates is as follows. Define 𝑆 𝑝𝑛 (R2) as the 𝑛-fold symmetric product of R2 (unordered 𝑛-tuples). We note the set of persistence diagrams with exactly n points of the form {(π‘₯1, 𝑦1), ...., (π‘₯𝑛, 𝑦𝑛)} is a subset of 𝑆 𝑝𝑛 (R2). The inclusions 𝑆 𝑝𝑛 (R2) ↩→ 𝑆 𝑝𝑛+1(R2) thus produce an inverse system of affine coordinate rings Β· Β· Β· β†’ 𝐴[𝑆 𝑝𝑛+1(R2)] β†’ 𝐴[𝑆 𝑝𝑛 (R2)] β†’ Β· Β· Β· which provide the basis for studying algebraic functions on the space of persistence dia- grams. With this setting in mind, the main goal of [40] is to determine free generating sets for the subalgebra of 𝐴[𝑆 π‘βˆž(R2)] comprised of elements which are invariant under adjoining a point of zero persistence to a persistence diagram. The following theorem is an answer to this question (see Theorem 1 [40]). Theorem 6 The subalgebra of 2-multisymmetric functions invariant under adding points with zero persistence, is freely generated over R by the set of elements of the form π‘π‘Ž,𝑏 = βˆ‘οΈ 𝑖 (π‘₯𝑖 + 𝑦𝑖)π‘Ž (𝑦𝑖 βˆ’ π‘₯𝑖)𝑏 for integers π‘Ž β‰₯ 0 and 𝑏 β‰₯ 1. These are the features we call Adcock-Carlsson coordinates. 53 4.3.5 Template Systems As with other methods, the goal of template systems as a featurization method is to embed the space of persistence diagrams into a real-valued vector space since persistence diagrams cannot be used directly in machine learning models. Template systems additionally separate the space of persistence diagrams, which is desired for classification accuracy. Template systems first start with a template function, which is any function on R2 that is continuous, and has compact support contained within the upper half plane, W := R Γ— R>0. Given a template function 𝑓 : W β†’ R and a persistence diagram, 𝐷, the template function becomes a function on persistence diagrams [44]: 𝜈 𝑓 (𝐷) := 𝑓 (x). βˆ‘οΈ x∈𝐷 A template system is a collection of template functions that separates points on the per- sistence diagrams. The specific template functions I used for the review are tent functions, where given a point a = (π‘Ž, 𝑏), and a radius 𝛿, define 𝑔a,𝛿 (π‘₯, 𝑦) = |1 βˆ’ 1 𝛿 max{|π‘₯ βˆ’ π‘Ž|, |𝑦 βˆ’ 𝑏|}|+ so that with a persistence diagram 𝐷 = x = (π‘₯𝑖, 𝑦𝑖), the value of the tent function is 𝐺a,𝛿 (𝐷) = 𝑔a,𝛿 (x). βˆ‘οΈ x∈𝐷 To compute the values, a grid size and associated parameters are chosen so that 𝑔 is compactly supported in W. This is illustrated in Figure 4.3. The advantage of working with these template systems is that they can be used to approximate real-valued functions on the space of persistence diagrams as proven by the following theorem (see Theorem 29 [45]). Denote 𝐢𝑐 (W) to be the set of compactly supported continuous functions, D the space of persistence diagrams and 𝐢 (D, R) the set of continuous functions from D to R. 54 (a) Birth-Death Coordinates. (b) Birth-Persistence. (c) Adaptive Template Systems. Figure 4.3 Figures 4.3a and 4.3b show the grid required for the computation of tent functions in different coordinates, and Figure 4.3c shows adaptive template systems. Theorem 7 Let T βŠ‚ 𝐢𝑐 (W) be a template system for D, let C βŠ‚ D be compact, and let 𝐹 : C β†’ R be continuous. Then for every πœ– > 0 there exist 𝑁 ∈ N, a polynomial 𝑝 ∈ R[π‘₯1, ..., π‘₯𝑁 ] and template functions 𝑓1, . . . , 𝑓𝑁 ∈ T so that (cid:12)𝑝 (cid:0)𝜈(𝐷, 𝑓1), . . . , 𝜈(𝐷, 𝑓𝑁 )(cid:1) βˆ’ 𝐹 (𝐷)(cid:12) (cid:12) (cid:12) < πœ– for every 𝐷 ∈ C. That is, the collection of functions of the form 𝐷 β†’ 𝑝(𝜈(𝐷, 𝑓1), . . . , 𝜈(𝐷, 𝑓𝑁 )), is dense in 𝐢 (D, R) with respect to the compact-open topology. 4.3.6 Adaptive Template Systems The Adaptive Template Systems methodology of [46] concerns itself with improving and furthering some of the work presented in [45] and [47]. The goal is to produce template systems that are adaptive to the input data set and the supervised classification problem at hand. One shortcoming of template systems, like tent functions, when applied to Theorem 7 is that without prior knowledge about the compact set C βŠ‚ D, the number of template functions that carry no information relevant to the problem can be high. By reducing this overhead, adaptive templates improve the computation times and accuracy in some specific problems. The relationship between template systems and adaptive template systems is demon- strated in Figure 4.3, showing the adaptive template systems depend on density of data. To do 55 so, given a compact set C βŠ‚ D we consider the set 𝑆 = (cid:208) 𝐷∈C 𝐷 βŠ‚ W along with different algo- rithms such as Gaussian mixture models (GMM) ([48]), Hierarchical density-based spatial clustering of applications with noise (HDBSCAN) ([14]) and Cover-Tree Entropy Reduction (CDER) ([49]) to define a family of ellipsoidal domains (cid:8)z ∈ R2 : (z βˆ’ x)βˆ— 𝐴(z βˆ’ x) ≀ 1(cid:9) in W, fitting the density distribution of 𝑆. Here 𝐴 is a 2 Γ— 2 symmetric matrix and x ∈ R2. Once this family of ellipsoidal domains is computed, we use them to define the following adaptive template functions 𝑓𝐴 (z) =   ο£³ 1 βˆ’ (z βˆ’ x)βˆ— 𝐴(z βˆ’ x) if (z βˆ’ x)βˆ— 𝐴(z βˆ’ x) < 1 0 if (z βˆ’ x)βˆ— 𝐴(z βˆ’ x) β‰₯ 1. 4.4 Datasets The five different datasets considered in this work were chosen from a collection of experiments presented in the literature of topological methods for machine learning. We acknowledge that this selection is inherently biased towards datasets with favorable perfor- mance with regards to specific topological methods. Nevertheless, we counterbalance this by applying all the evaluated featurization methods to all the data sets here considered and compare the classification results across all the presented methodologies. This comparative work showcases how the variation between methods results in the need for the user to find a suitable combination of featurization methods and parameter tuning to obtain optimal results in a given dataset. As such, readers should view this as a resource for their own analysis, and not as a recommendation for specific techniques. For all datasets and methods, parameter tuning was done using a grid search method on a subset of data that was not used to report final results, and parameters were chosen based on performance of a ridge regression model, a random forest and a support-vector machine (SVM) model. It is worth noting a weakness of the analysis in that the same parameters were used in the feature set calculation for all reported models, and run with a single split. 56 This was due to time required for feature calculation. The ridge regression and random forest classifier were run with default parameters, and the support-vector machine was run using the radial basis function (RBF) with some tuning on the cost parameter (C). The exception is for the Multi-Scale Kernel feature set - we only fit a support-vector machine model, and was specific to feature computation for the Multi-Scale Kernel. Due to computation times for the Multi-Scale Kernel and the direct use of the features in a kernel support-vector machine model, this was the most appropriate choice for model fitting. Preliminary results showed that ridge regression did not provide additional accuracy for kernel methods as well. Each dataset was sampled to create training and testing sets for a 10 or 100 trials depending on size, with the exception of the Protein Classification Dataset, which included indices for machine learning tasks. Random forest classifiers as presented in [50] are used to solve the same classification problems presented for each data set. Parameters such as number of trees in each forest and the size of each tree are chosen based on performance and tuned on the testing set. 4.4.1 MNIST The MNIST dataset from [4] is a database of 70,000 handwritten single digits of numbers 0 to 9. An example image from the MNIST database is shown in Figure 4.4. The calculation of persistence diagrams for the MNIST dataset is as in [40]. The persistence diagrams are calculated using a "sweeping" motion in one of four directions: left to right, right to left, top to bottom, or bottom to top, corresponding to the 0-dimensional and 1-dimensional persistence diagrams. This method creates a base of 8 different persistence diagrams to use in the application of methods. To compute this filtration, pixels are converted to a binary response with intensity calculated based on position. This has the effect that depending on the direction of sweep, features will have different birth and death times, providing distinct information for each direction. Figure 4.4 shows the various calculations of persistence diagrams for an example number eight. Both 0-dimensional and 57 1-dimensional persistence diagrams were used for the MNIST dataset, noting that some observations did not have 1-dimensional persistence diagrams, so these observations were filled with a single diagram of birth-death coordinate of [0,.01]. Figure 4.4 A) Example number 8 from the MNIST dataset. B) The same number 8 after computing each of the 4 types of coordinate transforms to compute the persistence diagrams. The number of topological features available for model fitting is dependent on the method. For the Persistent Images, Persistence Landscapes, and Template Systems there are 8 features each. The Multi-Scale Kernel produces 8 different kernel matrices, and for Adcock-Carlsson Coordinates, 32 different features were computed from these persistence diagrams. For the MNIST dataset, a random sample of 1000 images was used to tune parameters, with 80% used for the training portion, and 20% used for the testing portion. We used the set of 60,000 images corresponding to the training set of MNIST to create our own training and testing sets for model fitting and evaluation. For this set of 60,000, 10 trials were run with an 80% training and 20% testing split to determine model performance. 4.4.2 SHREC14 We evaluated the SHREC 2014 dataset ([51]) in the same manner as the authors of [46]. To compute the topological features, the authors of [37] describe using a heat kernel signature to compute persistence diagrams for both the 0-dimensional and 1-dimensional 58 persistence diagrams. The kernel is computed, and then the persistence diagrams were computed using Dipha [52], which can directly handle meshes. We were provided the dataset as a set of persistence diagrams. The dataset consists of 15 different labels, corresponding to 5 different poses for the three classes of male, female, and child figures. As noted in [46], parameters in the dataset define different machine learning tasks due to a different calculation of the heat kernel signature, and for this evaluation we focused problem 6, which has the highest accuracy ranging from 89% - 92% as reported in[46]. For the SHREC14 dataset, a random sample of 90 images (30% of the data) was used to tune the model and determine appropriate model parameters. The remaining 210 observations were split into 80% training and 20% testing for 100 trials to report final model fit, which is listed below. Persistence diagrams for 0-dimensional homology and 1-dimensional homology were computed for this dataset. Table 4.3 shows complete results for the SHREC 2014 dataset. 4.4.3 Protein Classification We use the Protein Classification Benchmark dataset PCB00019 [53] as another type of data to evaluate the topological methods above. This specific set contains information for 1357 proteins corresponding to 55 classification problems, and we reported on 54 of the problems using one to tune parameters. The training and testing index were provided, and the mean accuracy was reported for both training and testing sets using these indices. Table 4.4 shows results from our experiments using the training and testing indices provided in the original dataset. Persistence diagrams for this dataset were computed for each protein by considering the 3-D structure (provided in [54]) as a point cloud in R3. This point cloud was built using the π‘₯, 𝑦 and 𝑧 position of each atom in the molecule at hand. With this information the persistent 0-dimensional and 1-dimensional homology is computed using Ripser [55]. 59 4.4.4 MPEG7 The mpeg-7 dataset from [56] is a database of object shapes in black and white, with 1400 shapes in 70 classes. An example from the original dataset is shown in Figure 4.5 along with the contour as described below. Figure 4.5 (A) An example apple from MPEG7 data. (B) An example image contour used for MPEG7. (C) The distance to mean point calculation used for sublevel set persistence. (D) Persistence diagrams from lower star persistence. (E) Persistence diagrams from the contour. To compute persistence diagrams, first the image contour is computed by placing ob- servations from the point cloud into a sequence. The distance curve is computed as the distance from the center of the sequence. Sublevel set persistence is taken using the com- puted distance curves as point cloud data. Persistence diagrams for both 0-dimensional and 1-dimensional homology were computed for this dataset. An example notebook of MPEG7 is provided using only 4 shapes - apple, children, dog, and bone. This approach is due to the initial difficulty in getting accurate models for the full 60 dataset. Due to the small number of samples (80 total) and lack of repeated sampling, the estimates provided for this dataset are not stable and are not reported. We used this dataset for a timing comparison of featurization methods from persistence diagrams. We do not report on the results of this dataset. 4.4.5 HAM10000 The HAM10000 dataset provided by [57] is a collection of 10000 images of skin lesions with one 7 potential classifications: (i) Actinic Keratoses and Intraepithelial Carcionma, (ii) Basal cell carcinoma, (iii) Benign keratosis, (iv) Dermatofibroma, (v) Melanocytic nevi, (vi) Melanoma, (vii) Vascular skin lesions. A total of 18 persistence diagrams for this set were calculated using the methods outlined in [58], 9 corresponding to the 0-dimensional homology and 9 corresponding to the 1-dimensional homology. To obtain such diagrams, first a mask is computed by implementing the methodology proposed in [58]. The mask encodes the image with a binary response of white if that region (pixel) of the image if part of the lesion, and black if it is part of the healthy tissue. Once the mask is computed it is applied to the original image to identify the lesion. The area recognized as the lesion is converted into 3 different color models: RGB, HSV and XYZ. Each color model is split into their corresponding channels, and for each channel we use sublevel set filtration to obtain 0-dimensional and 1-dimensional persistence diagrams. In total, for each image on the data set we obtain 18 persistence diagrams, 9 in homological dimension 0 and 9 for homological dimension 1. An example image and this process is shown in Figure 4.6. To tune the models, a random sample of 250 images were taken and a ridge regression, random forest and support vector machine model were fit to determine parameters. The remaining 9750 images were split into an 80% training and 20% testing set to report final results. To evaluate the HAM10000 dataset, due to the large number of birth and death pairs in 61 Figure 4.6 (A) Example of skin lesion in HAM10000 (B) Skin lesion with mask (C) Mask only dataset. each persistence diagram, subsampling of persistent features was required. Each observation in a data set, for example an image, will yield 18 persistence diagrams corresponding to homological features in that observation. In the HAM10000 dataset, there was an average of 5,783 birth-death pairs in each persistence diagram. This was an issue to complete computation for the vectorization methods, even for adaptive templates, so each persistence diagram was subsampled as follows. The method of subsampling is two steps: Highly persistent features were always in- cluded, and a uniform random sampling method (without replacement) was used to sample the remaining points. The threshold for feature lifetime and number of points to sample was determined by using parameters that preserve the distribution of points in each persistence diagram. As a result, features in each persistence diagram with a lifetime of 5 or more were automatically included, and 5% of the rest of the points were also included. This resulted in sampled persistence diagrams with an average of 290 points each (see Table 4.1). 4.5 User Guide 4.5.1 Available functions As part of the available code, a function for each method is included. Each function requires two sets of persistence diagrams, a training set and a testing set, and parameters specific to the function. The function returns two feature sets for that method, corresponding to the training and test set respectively. Each function also prints the time in seconds taken 62 at the end of each run. In this section of the user guide each function is described, along with the required parameters for the function. The Multi-Scale Kernel feature matrix can be computed using the function ker- nel_features or fast_kernel_features. It is recommended to use fast_kernel_features due to computation time. Both functions require a parameter sigma, denoted as s in the function with a default value of 4. In [37] this parameter is referred to as the scale parameter. From the definition, we note that as sigma, 𝜎, increases the function decreases. Increasing sigma results in a less diffuse kernel matrix, while decreasing sigma results in a more diffuse kernel matrix. Due to time required for the Multi-Scale Kernel, there are two additional sets of functions that use Numba ([59]) for significantly faster computation. In the current implementation, these are not able to be combined with multi-core processing (MPI for example), and have a different format than the other functions included. These functions are provided in the github repository for this project, and were used to compute results for the Multi-Scale Kernel for the MNIST dataset. The Persistence Landscapes features can be computed using landscape_features. The Multi-Scale Kernel function, landscape_features requires two parameters: the number of landscapes, n and resolution, r. The number of landscapes parameter, n, controls the number of landscapes used, and the resolution, r, controls the number of samples used to compute the landscapes. The default parameters for n is 5 and r is 100. The Persistence Images can be computed using the function persistence_image_features. The persistence_image_features function requires two parameters, pixels and spread. The pixels, p is a pair that defines the number of pixels along each axis. The spread, s, is the standard deviation of the Gaussian kernel used to generate the persistence image. It is worth noting that the implementation here uses the Gaussian kernel, however, other distributions could be chosen so that s would correspond to parameters specific to the chosen distribution. 63 Additionally, the weighting function is constant for this implementation. Increasing spread increases the variance for each distribution, resulting in larger "hot spots". Increasing pixels provides a smoother distribution, whereas decreasing pixels yields a less smooth distri- bution. Note that increasing pixels increases computation time. This is demonstrated in Figure 4.2 in the methods section. The Adcock-Carlsson Coordinates features can be computed using the function carls- son_coordinates, does not require any parameters. This function returns four different features for every type of persistence diagram provided. So for datasets that have persis- tence diagrams corresponding to 0-dimensional and 1-dimensional homology, 8 features are returned for machine learning. The features returned correspond to the four coordinates calculated in [40]. The Template Systems features can be computed using the function tent_features, and has a choice of two parameters: d, which defines the number of bins in each axis and padding, which controls the padding around the collection of persistence diagrams. This function returns a training and testing set. This function computes the tent features from [45]. The Adaptive Template Systems features can be called with the function adap- tive_features, and requires the labels for the training set. Users can choose three different types of Adaptive Templates: Gaussian Mixture Modeling (GMM), Cover-Tree Entropy Reduction (CDER), and Hierarchical density-based clustering of applications with noise (HDBSCAN). The parameter d refers to the number of components when using the GMM model type. This would be minimally the number of classes in your data, and ideally repre- sents closer to the number of distributions in the data that correspond to each observation. Details on these methods can be found in [46], as well as the original references linked in the methods section. During this evaluation, we evaluated adaptive templates using both GMM and CDER methods, but did not formally evaluate HDBSCAN. HDBSCAN was 64 difficult to formally assess as we had difficulty with completion of the algorithm for some datasets. For those datasets we were able to complete, we did not notice an improvement over other adaptive methods. 4.6 Results One consideration we must make before analysing the results comes from the compu- tation of Multi-Scale Kernel features. As explained for each dataset in Section 4.4, more often than not we will compute multiple persistent diagrams per data point in a given data set. Since this multi-scale kernel provides a notion of similarity between persistent dia- grams (see [37]) we require it to be computed between diagrams corresponding to the same dimension homology and method type. For example, the kernel matrix that corresponds to the 0-dimensional homology of a data set is computed using the persistence scale space kernel between two sets of persistence diagrams that represent the 0-dimensional homology. This means that for a dataset that has sets of 0-dimensional homology persistence diagrams and 1-dimensional homology persistence diagrams, two kernel matrices were returned (one per each dimension). The kernel matrix used in our models is the sum of available kernels, and differs based on the persistence diagrams available for each dataset. While this does improve accuracy significantly over individual kernel matrices, other methods of combining kernel features were not explored in this dissertation, but is available in [60] for the interested reader. The available parameter, sigma, is consistent across all types of diagrams for our evaluation. For each of the other methods, Persistence Landscapes, Persistence Images, Adcock- Carlsson Coordinates, Template Systems, and Adaptive Template Systems, each feature matrix was constructed for the relevant set of diagrams, and all topological types were used in fitting the same model. The datasets used in this analysis were of varying size, both in terms of observations and the size of sets of persistence diagrams. As noted in the descriptions of data, the types of 65 Dataset Characteristics Dataset MNIST SHREC14 Protein MPEG7 HAM10000 Observations Diagrams 70,000 300 1357 1400 10,000 8 2 2 2 18 Average Pairs Min/Max Pairs 1.15 14 346 205 5783 0/7 1/29 3/500 1/500 13/32610 Table 4.1 Characteristics of each dataset. The column headings can be explained as such: Observations - number of observations in the dataset, Diagrams - the number of homological types used to compute persistence diagrams, Average Pairs - the average number of birth/death pairs across the set of persistence diagrams for a single observation in the original dataset, and Min/Max Pairs - the minimum and maximum number of birth/death pairs across the set of persistence diagrams for a single observation in the original dataset. persistence diagrams calculated also differs. A summary of characteristics for each dataset is included in Table 4.1. 4.6.1 MNIST The Multi-Scale Kernel features calculated yielded 8 different kernel matrices, and the final kernel matrix was calculated using the unweighted summation of these kernels as in [60]. Due to the time needed for computation of the Multi-Scale Kernel, a smaller set of 12,000 observations was used to report final results. Table 4.2 shows complete results for the MNIST analysis. Four different methods (highlighted on the table) provided similar results for the MNIST dataset, and we note the SVM model had higher accuracy in each case. This table, and all subsequent results tables, include the method used to construct topological features, training and test accuracy, and model and parameters used for evaluation. 4.6.2 SHREC14 Results are reported in Table 4.3. Adaptive Template Systems and Persistence Land- scapes were the two methods with highest classification accuracy on the test dataset, with Template Systems and the Multi-Scale Kernel performing nearly as well. 66 Topological Method Multi-Scale Ker- nel (sum of Ker- nels for 12,000 observations) Persistence Land- scapes Persistent Images Adcock-Carlsson Coordinates Template tems Sys- Adaptive plate Systems Tem- Full Results for MNIST Dataset Train Test Model .6895 Β± .0035 .6932 Β± .0117 SVM .8844 Β± .0004 .8786 Β± .0019 Ridge Regression .9231 Β± .0004 .5814 Β± .0098 .8997 Β± .0005 .9368 Β± .0004 .6889 Β± .0036 .8590 Β± .0010 .9525 Β± .0004 .7214 Β± .0092 .896 Β± .0005 .9638 Β± .0003 .6967 Β± .0035 .8819 Β± .0016 .9180 Β± .0018 .5828 Β± .0098 .8934 Β± .0021 .9199 Β± .0023 .6953 Β± .0123 .8547 Β± .0030 .9356 Β± .0018 .7170 Β± .0097 .8959 Β± .0017 .9477 Β± .0015 .6973 Β± .0031 .8817 Β± .0027 .9515 Β± .0021 .9363 Β± .0021 .6914 Β± .0188 .6932 Β± .0209 SVM(RBF) Random Forest Ridge Regression SVM (RBF) Random Forest Ridge Regression SVM (RBF) Random Forest Ridge Regression SVM(RBF) Random Forest Ridge Regression (GMM) SVM(RBF) (GMM) Random Forest Table 4.2 Results from the MNIST Dataset using the average model classification accuracy Β± standard deviation over 10 trials. 4.6.3 Protein Classification Nearly all of the topological methods in this thesis provided similar classification accu- racy for this dataset. We observe the testing accuracy as higher than the training accuracy for this dataset, and the results are similar to those in [61]. The Multi-Scale Kernel though did not perform as well and as shown in Figure 4.7 is the most computationally intensive. Results are reported in Table 4.4. 67 Method Multi-Scale Ker- nel (sigma = .5, sum of kernels) Persistence Land- scapes (n = 5, r = 200) Persistent Images (p = 40, s = .5) Adcock-Carlsson Coordinates Template Sys- tems (d = 12, p = 1.1) Adaptive plate (CDER) Tem- Systems Full Results for SHREC14 Dataset Train .8942 Β± .0142 Test .8938 Β± .0464 Model Kernel SVM .9968 Β± .0037 .9312 Β± .0336 Ridge Regression .9302 Β± .0098 .9186 Β± .0417 .9739 Β± .0190 .7243 Β± .0387 .9114 Β± .0441 .7048 Β± .0588 .9067 Β± .0147 .9855 Β± .0092 .85 Β± .0199 .8876 Β± .0479 .865 Β± .0764 .7124 Β± .0814 .8671 Β± .0183 .6928 Β± .0599 .9147 Β± .0299 .9442 Β± .0087 .6976 Β± .0899 .9100 Β± .0405 SVM(RBF, c = 10) Random Forest Ridge Regression SVM (RBF, c = 1) Random Forest Ridge Regression SVM (RBF, c = 50) Random Forest Ridge Regression .9350 Β± .0079 ..9483 Β± .0214 .9937 Β± .0078 .9159 Β± .0383 .8874 Β± .0481 .9169 Β± .0395 SVM(RBF, c = 1) Random Forest Ridge Regression .9929 Β± .0083 .9064 Β± .0397 .9729 Β± .0200 .9164 Β± .0422 SVM(RBF, c = 10) Random Forest Table 4.3 Results from the Shrec14 Dataset using the average model classification accuracy Β± standard deviation over 100 trials. 68 Figure 4.7 Comparison of timing by method. The legend is the same for all plots. The x-axis represents the size of the dataset, and the y-axis represents the time in seconds required for calculation of all of the persistence diagrams associated with the dataset of the given size. A) Timings for the MPEG7 Dataset including the Multi-Scale Kernel. B) Timings for the MPEG7 Dataset excluding the Multi-Scale Kernel. C) Timings for the Protein Dataset including all features. D) Timings for the Protein Dataset excluding Multi-Scale Kernel and Adaptive Templates (GMM). 4.6.4 HAM10000 Due to run time for the large number of points in each persistence diagram, even after subsampling, results were not reported for the Multi-Scale Kernel or Adaptive Template Systems. 69 Method Multi-Scale Ker- nel (sum of ker- nels) Persistence Land- scapes Persistent Images Adcock-Carlsson Coordinates Template tems Sys- Adaptive plate Systems Tem- Full Results for Protein Dataset Train .8294 Β± .1063 Test .8803 Β± .0702 Model Kernel SVM .9108 Β± .0615 .9620 Β± .0204 Ridge Regression .9012 Β± .0682 .9011 Β± .0686 .9011 Β± .0682 .9007 Β± .0684 .9008 Β± .0685 .9008 Β± .0685 .9009 Β± .0685 .9015 Β± .0677 .9008 Β± .0684 .9020 Β± .0678 .9016 Β± .0678 .9008 Β± .0685 .9782 Β± .0151 .9782 Β± .0152 .9758 Β± .0165 .9782 Β± .0151 .9782 Β± .0151 .9780 Β± .0151 .9782 Β± .0151 .9779 Β± .0151 .9780 Β± .0151 .9782 Β± .0151 .9775 Β± .0152 .9782 Β± .0151 .9007 Β± .0684 .9782 Β± .0151 .9100 Β± .0685 .9800 Β± .0151 SVM(RBF) Random Forest Ridge Regression SVM (RBF) Random Forest Ridge Regression SVM (RBF) Random Forest Ridge Regression SVM(RBF) Random Forest Ridge Regression (CDER) SVM(CDER) (HDB) Random Forest Table 4.4 Results from the Protein Dataset using the average model classification accuracy Β± standard deviation over 54 trials corresponding to the predefined indices of the dataset. Results are listed in Table 4.5. The HAM10000 dataset presented the largest computa- tional challenge during this review, and is a continued area of research. 4.7 Computation Time of Features Formal timings were captured for all features for the 0-dimensional persistence diagrams for the MPEG7 and Protein Datasets. A comparison of timings is in Figure 4.7. The timing reported is for the generation of features from one type of persistence diagram for a dataset of that size. This means when computing a training feature set and testing feature set for multiple types of persistence diagrams, the expected time to generate features can be significantly longer. For example, in the MNIST dataset we compute 4 different types 70 Topological Method Multi-Scale Kernel Persistence Landscapes Persistence Images (pixels = 20, spread = 1) Adcock-Carlsson Coordinates Template Systems (d = 10, p = 1.5) Adaptive Template Systems Results for HAM10000 Dataset Train Test Model .8347 Β± .0022 .6695 Β± 0 .6695 Β± 0 .7417 Β± .0017 .7122 Β± .0012 .6695 Β± 0 .6719 Β± .0007 .6801 Β± .0009 .6695 Β± 0 .7193 Β± .0015 .7830 Β± .0024 .6695 Β± 0 Did Not Run .6881 Β± .0074 Ridge Regression .6692 Β± 1.2𝑒 βˆ’ 16 SVM(RBF, c = 1) .6692 Β± 1.2𝑒 βˆ’ 16 Random Forest .6371 Β± .0671 .6895 Β± .0031 .6692 Β± 1.2𝑒 βˆ’ 16 Random Forest .6696 Β± .0025 Ridge Regression SVM (RBF, c= 1) Ridge Regression SVM (RBF) .6710 Β± .0019 .6692Β±1.12𝑒 βˆ’16 Random Forest .6987 Β± .0041 .7303 Β± .0054 .6692 Β± 1.2𝑒 βˆ’ 16 Random Forest Ridge Regression SVM(RBF, c = 5) Did Not Run Table 4.5 Results from the HAM10000 Dataset using the average model classification accuracy Β± standard deviation over 10 trials. of persistence diagrams with both 0-dimensional and 1-dimensional homology, giving 8 sets of features that can be generated for the sets of persistence diagrams for that dataset. Specific to the multi-scale kernel method, the timing reported is for a symmetric feature matrix that is 𝑛π‘₯𝑛, where 𝑛 is the number of observations in the dataset. This means the training feature set requires less computation time than a testing feature set of comparable size. Additionally, during the review of these methods, we did not encounter significant issues with model fitting, hence formal timings were not recorded for this portion of the analysis. Conclusions from these timings are addressed in the discussion section. Data Availability The datasets, persistence diagrams (or code to compute the diagrams), and all other as- sociated code for this study can be found in the machine learning methods for persistence di- agrams github repository https://github.com/barnesd8/machine_learning_for_persistence. 71 For each of the five datasets, the following code is available: β€’ A jupyter notebook that loads and formats the persistence diagrams including images and does a preliminary model fitting on a subset of the data β€’ A python script that calculates the persistence diagrams from the original dataset - some of these are written using MPI depending on the size of the dataset β€’ A python script that fits models for random samples of the data to get mean estimates of accuracy for both the training and test dataset These scripts reference modules included in the github repository, including a persis- tence methods script that calculates features from persistence diagrams for a training and test dataset. This uses a combination of other available functions and functions written specifically for this thesis. The Template Systems and Adaptive Template Systems methods use functions from https://github.com/lucho8908/adaptive_template_systems, which is the corresponding code repository for [46]. The available methods in our extension include Tent Functions and Adaptive Template Functions (GMM, CDER, and HDBScan methods). The Adcock-Carlsson Coordinates method is a function developed specifically for this thesis, and includes the calculation of the 4 different features used in our analysis. The Persistence Landscape method uses the persistence landscape calculation from the Gudhi Package [62]. The Multi-Scale Kernel Method has two included implementations, one is from [63] and is slower to compute, while the other is a faster implementation that can be used on larger datasets. All of the results in this thesis were reported using the implementation written specifically for this thesis. The Persistence Images features are also from [63]. Additionally, many functions from [64] are used throughout. 72 4.8 Discussion Adcock-Carlsson Coordinates, Tent Functions, and Persistence Landscapes scale well, and perform well even for large datasets. It is of note though that parameter choice will affect computation time. This was especially notable in the Template Features (Tent Functions) computation time. As the number of tent functions is increased, the time to compute features also increases. We observed a superlinear increase, however, even with this increase computation time was not a barrier for analysis. Persistence Images and Adaptive Template Functions do not scale or perform as well, however, they do provide good featurizations for accurate models and should be considered depending on the dataset. Specifically, the Adaptive Template Functions was not completed for the full HAM10000 dataset due to computation time. When using these methods, it should be of note that the Multi-Scale Kernel method is computationally intensive, and does not scale well. Additionally, the accuracy achieved is not better than other methods for the datasets in this thesis. 4.9 Conclusion This chapter reviews 6 methods for topological feature extraction for use in machine- learning algorithms. Persistence Landscapes, Adcock-Carlsson Coordinates, Template Sys- tems, and Adaptive Template Systems perform consistently with minimal differences be- tween datasets and types of persistence diagrams. These methods are also less expensive in terms of execution time. A main contribution of this chapter is the availability of datasets, persistence diagrams, and code for others to use and contribute to the research community. 73 CHAPTER 5 OPEN SOURCE SOFTWARE CONTRIBUTIONS 5.1 Open Source Software Development Framework Topological data analysis is a multi-disciplinary analysis approach, with applications in an extensive number of domains. To enable the use of TDA tools, well-documented and maintained open source software is needed. Multiple software packages are available as part of this dissertation, either developed collaboratively as part of Liz Munch’s lab, or as part of my individual contributions to topological data analysis. Three different open source tools are included in this section, teaspoon [2], ceREEBerus, and an extension to the kepler mapper [65] software. teaspoon is a python package de- veloped for topological signal processing, with modules for specific applications including parameter selection, persistent homology, and machine learning. I have contributed to the machine learning module by optimizing already available feature creation functions and contributing an additional method from prior work. Additionally, I have implemented a software development process for automated releases to pypi for teaspoon. This standardized software development framework was used when creating ceREEBerus, a python package for working with Reeb graphs. Users can load Reeb graphs, customize plots, compute merge trees, and perform basic distance calculations using ceREEBerus. ceREEBerus includes code from collaborators in the Munch lab, and I wrote the base functions and created the overall structure and initial package for ceREEBerus. This section focuses on the available open source software, and begins with the software development process implemented to promote collaboration. 5.1.1 Broader Impact While much of the work in this section focuses on implementation of already existing methods, a significant motivation for this work is to enable collaboration and multidis- 74 ciplinary research with topological tools. Each of the sections in this chapter outlines a contribution that heavily contributes to the broader impact of toplogical data analysis. Through the standardization and automation of the development pipeline of teaspoon and ceREEBerus student and collaborators from disciplines outside of computer science and software engineering can readily contribute and learn standards along the way. I have also provided training to other graduate students as well as led a discussion-based responsible conduct of research seminar focusing on version control software. Additionally having automation for deployment and testing provides a process where individuals just learning software development can feel more comfortable with the process of contributing to software packages. The focus of teaspoon has been very similar, with the additional endeavor of simplify- ing the user experience and providing additional examples for individuals wishing to learn and use topological methods for data analysis through teaspoon. To this end, I simplified the install process and integrated notebooks, datasets and code from [3] in teaspoon. This further simplifies the process for users to get started as persistence diagrams are readily available from familiar datasets for use in featurization methods and subsequently machine learning models. Finally through my work on ceREEBerus I initialized a software package and process for collaborators in the Munch Lab (and more broadly) to be used as they continue to expand on both theory and computation of Reeb graphs. The structure and examples I provided through this initial work set a solid foundation for collaboration of Reeb graph computation. 5.1.2 Automation of development pipeline A major component of the software development framework was standardizing and automating software releases for any existing packages in the MunchLab, and providing guidelines for future packages for the MunchLab. This provides a framework for packages developed primarily in python, and released via PyPI. This framework helps support a 75 consistent development practice and allows for maintenance by graduate students in limited terms. The process consists of two main components: 1) Process for merging, testing releases, and approval to the master/main branch and 2) Automation of deployment to PyPI. To support an automated deployment with multiple contributors, it is imperative to first establish a collaborative process that supports updates from multiple developers that are tested and reviewed. This process was designed and documented, and provided via GitHub and as workshops to graduate students contributing to software development projects. As such, the CMSE value to collaborate across disciplines was supported and enabled. The process is visualized in Figure 5.1, and consists of creating a feature branch from the test release branch, implementing required features, merging into the test release branch with appropriate review, merging into the master/main branch after review and testing. Figure 5.1 Illustration of the collaborative release process implemented. To further support the development pipeline, workflows via GitHub were configured to allow for automated releases and include criteria that must be met for the automated releases. Upon approval of the release, the following workflow is executed: 1) Create a container (ubuntu), 2) Install python, 3) Install the package, 4) Run unit tests, 5) Build the package, and 6) Publish to PyPI. Workflow monitoring is enabled, and failure at any point 76 in the workflow will halt the process and will not publish to PyPI. The set of standards for a successful build are provided in further detail. Must have passed a successful deployment to TestPyPI The next steps are implement into a release process on the test release branch, which upon successful completion deploys to the TestPyPI site. This is the first check required for code to be merged into the master/main branch for release. Must have passed a code review Before a feature branch can be integrated into a release a code review is required by a project collaborator. The collaborator can approve the changes or suggest changes to further refine the code. Must pass unit tests Each module has a series of unit tests that are required to be passed before deployment. Once the code is reviewed, and the merge to the main/master branch is accepted, the automated deployment process runs the unit tests. Must have package dependencies in pyproject.toml For a successful install, build, and deployment, package dependencies must be listed in the pyproject.toml file in the dependencies section. Must have successful package build Before the version can be automatically deployed it must build successfully. Must have the version incremented for release When the release is pushed to PyPI the version must be incremented. 5.2 Teaspoon The information in this section on teaspoon follows [2]. teaspoon is a comprehensive python package for topological signal processing combining techniques from topological 77 data analysis to create tools for signal processing that use shape for inference. teaspoon is open source with extensive documentation and follows the collaborative development process outlined in Section 5.1.2 making the code both easy to use and easy to contribute. teaspoon is comprised of five main modules: dynamical systems, machine learning, complex networks, information, and parameter selection. I focus on the machine learning module, which historically include the featurization and classification modules, and have been extended to include both a datasets module and notebook examples for machine learning pipelines. Additionally, some functions in the featurization module have been added or updated for computational efficiency. Source code and notebook examples are adapted from [3]. As part of the focus on usability for teaspoon, a considerable number of users continued to have difficulties with package installation due to system dependencies CMake [66] and Boost [67]. Installation issues were limited to those related to zigzag persistence. Currently computation of zigzag persistence has limited implementations, all reliant on underlying C/C++ libraries to ensure reasonable computation times. As such, these teaspoon modules have been moved to their own repository, spork. The main goals of my work with teaspoon has been to improve ease of use, enable additional collaboration and expand the machine learning modules by including work done in [3] to allow for easy distribution. Since featurization of persistence diagrams is an active area of research, providing example datasets and efficient code for feature computation provides a contribution to the topological data analysis community. 5.2.1 spork spork is the companion to teaspoon and was developed to create a simpler install for the majority of users utilizing the teaspoon package. As additional software for TDA has become available, teaspoon has expanded to include additional features for method applications, many of which are used for specific purposes. Many users would 78 not necessarily need all functions when first using teaspoon, so to improve the install experience a subset of functions were moved to a companion package, spork. The additional install currently required for spork, is Dionysus 2 [68] and includes features required to compute zigzag persistence. The install is available on the Python Package Index (PyPI), and builds locally, requiring a compiler for the C++ to python bindings, as well as other system dependencies. 5.2.2 Featurization The featurization module in teaspoon originally contained six different featurization methods: persistence landscapes, persistence images, Carlsson Coordinates, template func- tions, kernel method, and signature of paths. As part of my work I have extended the featurization module to include adaptive template systems as an additional feature, and included a faster version of the kernel method and the persistence images method. Initial code for adaptive template systems was provided by Luis Polanco and I have included a subset of these functions. Each feature type was optimized in a different way specific to the method and required libraries. The two types of optimization were explored during this analysis, parallelization and vectorization. The discussion on parallelization follows [69]. Parallelization can refer to either functional parallelism or data parallelism. Functional parallelism is running multiple sets of instructions concurrently, and data parallelism is running the same instructions on different sets of data concurrently. Data parallelism was used to optimize the persistence images code and an illustration of this is in Figure 5.2. Vectorization can refer to a few things: the process of turning persistence diagrams into features, a type of data parallelism where computation is done simultaneously on elements of a vector, or the process of restructuring code to remove loops and offloading the a lower level language (C) for better performance [70]. When I refer to vectorization, I mean the last definition. 79 Figure 5.2 Illustration of data parallelism where the addition operation is performed on blocks of a dataset at a time concurrently. To determine the most appropriate implementation in the teaspoon software, a number of methods were developed and timed using a simulated dataset. Development and timing were performed on an iMac using the Apple M1 chip, 16 GB of Ram with 8 total cores. Initial timing was run locally as many users of teaspoon will be working on personal computers. Timing was also assessed on Michigan State’s High Performance Computing Center. 5.2.3 Kernel Method The first method discussed is the Multi-scale Heat Kernel Method from [37]. To optimize the kernel method, code was developed and assessed using Numba [59], which translates Python to optimized machine code and compiles upon the first run. Numba provides many different ways to improve performance, including automatic just in time compiling, automatic parallelization, and the creation of NumPy universal functions [70]. Numba’s just-in-time compiling (jit) is simply a decorator applied to a Python function that directs Numba to automatically optimize and generate machine code with performance similar to C, C++ and Fortran. This same decorate can also be designated to compile code 80 to run in parallel. While this is simple to use, initial benchmarking tests showed much better performance using guvectorize. Guvectorize (and vectorize) is a way to create NumPy universal functions (ufuncs) with Numba, avoiding the traditional process required to create a NumPy ufunc. NumPy ufuncs operate element-by-element and is a wrapper for C functions. Functions can be written using guvectorize in Numba with cpu, parallel, and gpu targets and also broadcast returning different size outputs based on input size. Any loops inside the function do not parallelize, but any additional dimensions will parallelize. To better understand this mechanism, consider a dataset comprised of persistence di- agrams. For each observation in the dataset, there is a set of persistence diagrams. Each diagram is comprised of birth and death pairs. Consider the example below, with 4 sets of length 3 birth and death pairs. From the python array perspective, the shape of this set would be (4, 3, 2). To compute features using the kernel method, each data point is compared to each other. Writing a function then to take a set of persistence diagrams and compare them to a single persistence diagram, so an array of (4, 3, 2) and an array of (3, 2) will then broadcast across the first dimension for the calculation. df = [[[0,1], [0,2], [0,4]], [[1,3], [0,4], [0,6]], [[2,4], [3,5], [0,7]], [[0,1], [2,4], [3,7]]] To allow for core-based parallelization or gpu processing, writing a function using guvectorize to compare an entire array to a single observation will parallelize the additional dimension. In this case, providing two arrays of size (4, 3, 2) will parallelize along the first dimension for one of the arrays using a scatter operation, sending observations of the second array to different processes to run in parallel. The implicit assumption in this example is 81 that each persistence diagram is the same length, i.e. the second dimension will always be 3. That is not the case, so first restructuring is required. Any timing reported includes the time required to restructure the data. The code example below shows both the before and after of the restructuring function that fills in the array so create the same second dimension to enable both the creation of a NumPy ufunc using guvectorize in Numba. df = [[[0,1], [0,2]], [[1,3], [0,4], [0,6]], [[2,4], [3,5], [0,7]], [[0,1]]] df = reshape_to_array(df) print(df) [[[0,1], [0,2], [-1,-1]], [[1,3], [0,4], [0,6]], [[2,4], [3,5], [0,7]], [[0,1], [-1,-1],[-1,-1]]] Results of the local run are shown in Figure 5.3 and results of the HPCC run are shown in Figure 5.4. For both note a speed-up that scales well as additional processors are used. For the kernel computation the level of complexity is: O (𝑁 2|𝐹 | Β· |𝐺 |) where |𝐹 | and |𝐺 | are the cardinality of the multisets 𝐹 and 𝐺 and 𝑁 is the number of multisets. In other words, for a dataset with 𝑁 observations, a computation needs to happen for each set of birth-death pairs for each 𝑛𝑖, 𝑛 𝑗 ∈ 𝑁. Further optimization could be possible through a fast kernel implementation [71], but the implementation would need to be adapted to account for the specific structure of persistence diagrams. 82 Figure 5.3 Comparison of timing for Multiscale Heat Kernel using a simulated dataset running locally. Timing for the multiscale kernel method was run on MSU’s HPCC with amd20-v100 20 Intel(R) Xeon(R) Platinum 8260 CPU’s with clock speed 2.40 GHz. The gpu used was an NVIDIA v100 with 32768 MB memory. Figure 5.4 Comparison of timing for Multiscale Heat Kernel using a simulated dataset running MSU’s HPCC with 16 cores. 5.2.4 Persistence Images Similar methods were explored for the optimization of persistence images. Ultimately the implementation from the persim [72] package was used for the availability of the parallel option. Due to the need for integration over a grid of a gaussian density function, the error function from the SciPy [73] library is required and not supported the Numba features. At attempt to recreate the integration in NumPy for use with Numba was attempted, but ultimately not as efficient nor accurate. Timing from running locally is shown in Figure 5.5 and timing from using HPCC is shown in Figure 5.6. Timing run on MSU’s HPCC for persistence images was on a intel18-v100 node with Intel(R) Xeon(R) Gold 6148 CPU’s with clockspeed of 2.40 GHz using 16 cores. 83 Figure 5.5 Comparison of timing for persistence images using a simulated dataset running locally. Figure 5.6 Comparison of timing for persistence images using a simulated dataset running on MSU’s HPCC. 5.2.5 Datasets Datasets used in [3] were integrated into teaspoon as part of a new module within the machine learning section, load_datasets. This module includes the persistence diagrams computed from the MNIST dataset and the MPEG7 Dataset. The following describes how to load each dataset for use. The mnist persistence diagrams can be loaded using the function load_datasets.mnist(). No parameters are required to load this function. This dataset includes the persistence diagrams for the 60,000 observations considered the training set for the MNIST dataset. The dataset is loaded as a pandas dataframe, and has 9 columns, 8 which correspond to persistence diagrams and one column of dataset labels. The 8 columns of persistence diagrams can be defined as: 1. zero_dim_rtl: persistence diagram for 0 dimensional features computed using the right to left directional transform 84 2. zero_dim_ltr: persistence diagram for 0 dimensional features computed using the left to right directional transform 3. zero_dim_btt: persistence diagram for 0 dimensional features computed using the bottom to top directional transform 4. zero_dim_ttb: persistence diagram for 0 dimensional features computed using the top to bottom directional transform 5. one_dim_rtl: persistence diagram for 1 dimensional features computed using the right to left directional transform 6. one_dim_ltr: persistence diagram for 1 dimensional features computed using the left to right directional transform 7. one_dim_btt: persistence diagram for 1 dimensional features computed using the bottom to top directional transform 8. one_dim_ttb: persistence diagram for 1 dimensional features computed using the top to bottom directional transform Each of the persistence diagrams was computed first using a directional transform on pixel values, and then sublevel set persistence. This is demonstrated in Figure 5.7 with additional details available in [3] for mnist, and all other datasets listed here. The mpeg persistence diagrams can be loaded using the function load_datasets.mpeg7(). No parameters are required to load this function. This dataset includes the persistence diagrams for the 1400 observations in the mpeg7 dataset. The dataset is loaded as a pandas dataframe, and has 5 columns, 4 which correspond to persistence diagrams or related computation and one column of dataset labels. The columns of the mpeg7 persistence diagrams dataset can be defined as: 85 Figure 5.7 The number 8 after computing each of the 4 types of coordinate transforms to compute the persistence diagrams. 1. Name: label of dataset 2. Outline: outline of the image represented as a 2-dimensional point cloud dataset 3. DistCurve: distance from center of outline 4. persistence_outline: persistence diagrams for 0 dimensional features and 1 dimen- sional features computed from the outline (contour) 5. lower_star_persistence: persistence diagram for 0 dimensional features computed using sublevel set persistence of the distance curve Persistence diagrams were computed using both the shape outline and distance curve. Ripser was used to compute the 0-dimensional and 1-dimensional features from the outline as well as the 0-dimension sublevel set persistence from the distance curve. The available data for the outline and distance curve is plotted in Figure 5.8. The shape retrieval of non-rigid 3D Human Models, SHREC14, persistence diagrams can be loaded using the function load_datasets.shrec14(). No parameters are required to load this function. This dataset includes the persistence diagrams for the 3000 observations in the SHREC14 dataset. The dataset is loaded as a pandas dataframe, and has 6 columns, 3 which correspond to persistence diagrams or related computation and one column of dataset 86 Figure 5.8 Example output from mpeg7 dataset for columns Outline and DistCurve. labels. Persistence diagrams were computed using the Heat Kernel Signature with a range of parameter values (corresponding to trial) [74] on 300 images, each corresponding to a specific trial value in the dataset. The columns of this persistence diagrams dataset can be defined as: 1. freq: index of machine learning problem (300 observations per problem) 2. trial: corresponds to a machine learning problem (10 total, corresponds to a parameter used in computing the persistence diagram) 3. CollectedDgm: set of all persistence diagrams 4. Dgm1: persistence diagrams for 1 dimensional features 5. Dgm0: persistence diagram for 0 dimensional features 6. TrainingLabel: One of 15 labels corresponding to five different poses for the three classes of male, female, and child figures 5.3 CeREEBerus Another area of focus for open source software development for topological data analysis is a packaged called ceREEBerus, which focuses on computation related to Reeb graphs. Working with Reeb graphs is an active area of research, with provably correct and efficient 87 algorithms for computing Reeb graphs available [18] in recent years. This provides a renewed interest in the use of Reeb graphs, along with the availability of metrics defined on Reeb graphs [16]. The goal then of ceREEBerus is to start with Reeb graphs and basic functions on Reeb graphs and build a comprehensive package to compute a wide variety of metrics on Reeb graphs. The initial and current utility of ceREEBerus allows users to load example Reeb graphs, generate random merge trees, plot graphs and merge trees, generate merge trees from Reeb graphs, and compute distances between merge trees. This package is available via PyPI and follows the open source software framework implemented for the Munch Lab. 5.3.1 Available functions As part of the package, extensive user documentation is available at the ceREEBerus site, and a summary is included here. The ceREEBerus package is comprised of three modules: example data and graphs (data), Reeb graphs (reeb), and compute (compute). The sample data and graphs module provides both networkx and Reeb graphs for multiple example graphs including the dancing man, simple loops, and torus graphs. Each of these is plotted in Figure 5.9 using the plotting function available in ceREEBerus. An example Reeb graph is in Figure 5.10a and corresponding ceREEBerus plot is in Figure 5.10b. (a) Dancing Man. (b) Simple Loops. (c) Torus. Figure 5.9 Output from ceREEBerus plotting capabilities of dancing man, simple loops, and torus. There are two main ways to load the example graphs, either as a networkx graph or a 88 (a) A Reeb graph of dancing man. (b) Output from ceREEBerus - a graph of dancing man. Figure 5.10 Cereeberus output from on the right of a Reeb graph on the left. Reeb graph, which is a custom class created for ceREEBerus. These can be loaded and accessed by using from cereeberus.data import graphs for the networkx version, or from cereeberus.data import reeb for the Reeb graph version. An example code block for loading the relevant libraries, loading the Reeb graph version of the torus, and plotting is shown below. Relevant output for this code is shown in Figure 5.9c. from cereeberus.data import reeb reeb_torus = reeb.torus() reeb_torus.plot_reeb() When plotting Reeb graph objects, the use of additional parameters allows you to control position and display of the plots. The available parameters with listed default values and definitions are: β€’ position = {}: an optional argument to update positions of nodes for plotting β€’ resetSpring = False: a boolean flag that when true assigns new node positions for the underlying Reeb graph using the spring layout from networkx 89 β€’ horizontalDrawing = False: a boolean flag that when true changes the x and y axis for plotting and does not update node position β€’ verbose = False: a boolean flag that when true provides logging output β€’ cpx = .1: float that is only used for graphs where the degree of connectivity between two nodes is two and controls the radius of the curve used for plotting relevant to the x position β€’ cpy = .1: float that is only used for graphs where the degree of connectivity between two nodes is two and controls the radius of the curve used for plotting relevant to the y position An example of horizontal drawing, updating position, and reset spring are shown in the following code block, with example output in Figure 5.11. ### Import relevant module from cereeberus.data import reeb reeb_torus = reeb.torus() ### Plotting the Reeb graph using horizontalDrawing = True reeb_torus.plot_reeb(horizontalDrawing=True) ### Plotting the Reeb graph with an updated position reeb_torus.plot_reeb(position = {0:(1,1), 1:(2,2), 2:(2,3), 3:(4,4), 4:(2,5), 5:(6,6)}, horizontalDrawing = False, verbose = False) ### Plotting the Reeb graph with the resetSpring parameter = True reeb_torus.plot_reeb(resetSpring = True) Lastly as part of the plotting function, the ability to refine how multiple lines between two nodes is plotted is shown below with the corresponding output in Figure 5.12. 90 (a) Example torus plotting horizontally. (b) Example torus with updated positions. (c) Example torus using spring layout. Figure 5.11 Output from ceREEBerus plotting capabilities of torus using different parameters. ### Import relevant module from cereeberus.data import reeb sl = reeb.simple_loops() ### Plotting the simple loops graph with cpx = .1 sl.plot_reeb(cpx=.1) ### Plotting the simple loops graph with cpx = .5 sl.plot_reeb(cpx=.5) ### Plotting the simple loops graph with cpx = 1 sl.plot_reeb(cpx=1) (a) Example simple loops with cpx = .1. (b) Example simple loops with cpx = .5. (c) Example simple loops with cpx = 1. Figure 5.12 Output from ceREEBerus plotting capabilities of simple loops using different cpx parameters. Another capability in the compute.degree module is adding nodes, removing isolates, and removing any unnecessary nodes to create a minimal Reeb graph. Recalling the 91 definition from Section 2.4, the nodes with both an up degree and down degree equal to 1 can be removed to compute the minimal Reeb graph. I use the following algorithm to construct the minimal Reeb graph: Require: R, a Reeb graph 1: n = length(nodes) 2: if 𝑖 ≀ 𝑛 then 3: 4: 5: 6: 7: 8: if up degree(R) == down degree(R) == 1 then add edge(up node, down node) remove edge(up node, i) remove edge(down node, i) remove node(i) end if 9: end if The following code example shows first adding nodes to the torus Reeb graph, and then constructing the minimal Reeb graph of the torus. Plots of each are available in Figure 5.13. import cereeberus.compute.degree as degree import networkx as nx reeb_torus = reeb.torus() ### Add node and plot reeb_torus_add = degree.add_nodes(reeb_torus, fx=4.5, x = 1) reeb_torus_add.plot_reeb() ### Add another node and plot reeb_torus_add = degree.add_nodes(reeb_torus_add, fx=3.5, x=1) reeb_torus_add.plot_reeb() ### Compute minimal Reeb graph 92 reeb_torus_min = degree.minimal_reeb(reeb_torus_add) reeb_torus_min.plot_reeb(cpx=1) (a) Torus Reeb graph after adding one node. (b) Torus Reeb graph after adding two nodes. (c) Minimal Reeb graph of torus. Figure 5.13 Output from ceREEBerus of degree functions. As part of the compute.degree module one can compute distances between graphs and remove isolates, or nodes with up degree and down degree equal to zero. The remove isolates function is a necessary step in the union find algorithm implemented for computing merge trees. Code is shown below and images are available in Figure 5.14. Interested readers are directed to the ceREEBerus documentation for details on distance as graph edit distance and directed merge tree distance are both available. from cereeberus.data import reeb import cereeberus.compute.degree as degree jm = reeb.juggling_man() jm.plot_reeb() ### Remove isolates and plot jm_im = degree.remove_isolates(jm) jm_im.plot_reeb() The final and core functionality of ceREEBerus is the ability to compute merge trees from Reeb graphs. The main code was provided by Elena Wang, and I structured and implemented it into the overall python package. The computeMergeTree function uses an 93 (a) Reeb graph with isolates. (b) Removed isolates. Figure 5.14 Plots from Reeb graphs before and after removing isolates. implementation of the union find algorithm. Example code and the associated output are below and shown in Figure 5.15. from cereeberus.data import reeb import cereeberus.compute.degree as degree jm = reeb.juggling_man() ### Remove isolates and plot jm_im = degree.remove_isolates(jm) jm_im.plot_reeb() ### Import merge tree, compute, and plot from cereeberus.compute.merge import computeMergeTree cmt = computeMergeTree(jm_im, verbose=False, precision=1, filter=False) cmt.plot_reeb(position = {7:(0,1), 5:(1,4), 2:(0,5), 'inf':(0,7)}) 5.3.2 Predictive Mapper Through a seed grant from Center for Business and Social Analytics at Michigan State University, I was able to work with a software developer, Doug Krum, to take a prototype for an interactive Mapper application built by Pat Bills and transition the work to python, taking advantage of the full suite of functions in scikitlearn [64], including parallelism. We started 94 (a) Dancing man before merge tree. (b) Merge tree of dancing man. Figure 5.15 Plot of Reeb graph and associated merge tree. with the mapper implementation from KeplerMapper [27] and [65], and expanded the package to include additional features that allowed the development of predictive mapper. This variant included a web-interface to allow users to interactively create a mapper graph and choose specific parameters for the mapper graph creation. During the development of predictive mapper this tool was used, and additional functions were created to enable the predictive pipeline. Based on the final predictive mapper pipeline, the associated release is still a work in progress, requiring additional code and data restructuring needed to support integration into the already existing Kepler mapper. As mentioned in Section 3.4, example notebooks for the work done on the iris dataset are available along with the mapper graphs generated. Similarly an example notebook used to create images for the predictive mapper portion of this dissertation is available and is an independent mapper pipeline. 5.3.3 Conclusion As part of my work in the Computational Mathematics, Science, and Engineering de- partment I have contributed to many computational projects supporting work in topological data analysis. These contributions include a standardized process and framework for collab- oration in the Munch Lab, the inclusion of additional and optimized methods in teaspoon, and the reorganization of teaspoon to have a companion package spork. I have addition- 95 ally used this framework for the creation of ceREEBerus. As part of a review paper I have additionally contributed a set of notebooks and datasets for machine learning use of persistence diagrams. 96 CHAPTER 6 CONCLUSION Topological data analysis is a powerful tool for extracting the shape of data. Two primary tools, persistent homology and mapper, require additional transformations to be used directly in machine learning models. In this dissertation, I have contributed to the discipline of TDA by developing another method for feature creation and providing additional computational tools for working with persistence diagrams. I have also contributed to the available open source software in TDA through package development and the implementation of standards for the MunchLab. To extend the use of mapper, I have developed predictive mapper as a way to create an optimal version of a mapper graph, compute probabilities of inclusion of data points in mapper nodes, and then using the eigenvectors of the graph Laplacian of the mapper graph, represent each data point as a linear combination of the probability of inclusion in each node and the eigenvector values. This featurization provides real-valued vectors for use in machine learning models that represent the connectivity of the mapper graph weighted by a meaningful probability for each data point. In the field of persistent homology, I have contributed additional datasets consisting of persistence diagrams for some popular datasets, along with the inclusion of a far more computationally efficient function to compute the Multi-Scale Heat Kernel [37]. Example notebooks of 6 different featurization methods are available for academic use, furthering the availability of TDA resources. I have also contributed meaningfully to available tools for researchers in TDA, with the creation of ceREEBerus to work with Reeb graphs, additional contributions to teaspoon, and the additional code required for the predictive mapper pipeline. This code is available as open source with a standardized development process to enable collaboration. Through this dissertation I have contributed both to the field of Topological Data 97 Analysis, and under taken a combination of theoretical and applied work in the spirit of the CMSE program. 98 BIBLIOGRAPHY [1] Gurjeet Singh, Facundo Memoli, and Gunnar Carlsson. Topological methods for the analysis of high dimensional data sets and 3d object recognition. Eurographics Symposium on Point-Based Graphics, 2007. [2] Audun D Myers, Melih Yesilli, Sarah Tymochko, Firas Khasawneh, and Elizabeth Munch. Teaspoon: A comprehensive python package for topological signal processing. In NeurIPS 2020 Workshop on Topological Data Analysis and Beyond, 2020. [3] Danielle Barnes, Luis Polanco, and Jose A. Perea. A comparative study of machine learning methods for persistence diagrams. Frontiers in Artificial Intelligence, 4:91, 2021. [4] Y. LeCun and C. Cortes. The mnist database, 1999. [5] Herbert Edelsbrunner and John Harer. Computational topology: an introduction, volume 69. American Mathematical Soc., 2010. [6] Tamal Krishna Day and Yusu Wang. Computational Topology for Data Analysis. Cambridge Univ. Press, Cambridge, 2006-2021. [7] Gunnar E. Carlsson and Vin de Silva. Zigzag persistence. CoRR, abs/0812.0197, 2008. [8] J. H. Friedman and J. W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C(23):881=890, September 1974. [9] FrΓ©dΓ©ric Chazal and Bertrand Michel. An introduction to topological data analysis: Fundamental and practical aspects for data scientists. Frontiers in Artificial Intelli- gence, 4:108, 2021. [10] Monica Nicolau, Arnold J. Levin, and Gunnar Carlsson. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences of the Unites States of America, pages 7265–7270, 2011. [11] P.Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, and G. Carlsson. Extracting insights from the shape of complex data using topology. Scientific Reports, 3:1236, 2013. [12] Elizabeth Munch and Bei Wang. Convergence between categorical representations of reeb space and mapper. 32nd International Symposium on Computational Geometry, 99 53:53:1–53:15, 2016. [13] Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32:241–254, 1967. [14] Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-based cluster- ing based on hierarchical density estimates. In Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu, editors, Advances in Knowledge Discovery and Data Mining, pages 160–172, Berlin, Heidelberg, 2013. Springer Berlin Heidel- berg. [15] Silvia Biasotti, Daniela Giorgi, Michela Spagnuolo, and Bianca Falcidieno. Reeb graphs for shape analysis and applications. Theoretical Computer Science, 392:5–22, 02 2008. [16] Brian Bollen, Erin W. Chambers, Joshua A. Levine, and Elizabeth Munch. Reeb graph metrics from the ground up. CoRR, abs/2110.05631, 2021. [17] Ulrich Bauer, Elizabeth Munch, and Yusu Wang. Strong Equivalence of the Interleav- ing and Functional Distortion Metrics for Reeb Graphs. In Lars Arge and JΓ‘nos Pach, editors, 31st International Symposium on Computational Geometry (SoCG 2015), vol- ume 34 of Leibniz International Proceedings in Informatics (LIPIcs), pages 461–475, Dagstuhl, Germany, 2015. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. [18] Harish Doraiswamy and VΔ³ay Natarajan. Efficient algorithms for computing reeb graphs. Computational Geometry, 42(6):606–616, 2009. [19] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007. [20] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering, 2001. [21] Enrique Alvarado, Robin Belton, Emily Fischer, Kang-Ju Lee, Sourabh Palande, Sarah Percival, and Emilie Purvine. 𝑔-mapper: Learning a cover in the mapper construction, 2024. [22] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. [23] R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76. [24] Paulo Cortez. Student Performance. UCI Machine Learning Repository, 2014. DOI: 100 https://doi.org/10.24432/C5TG7T. [25] S.J. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches to gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1133–1142, 1998. [26] P. Cortez and A. M. GonΓ§alves Silva. Using data mining to predict secondary school student performance. 2008. [27] Hendrik Jacob van Veen, Nathaniel Saul, David Eargle, and Sam W. Mangham. Kepler mapper: A flexible python implementation of the mapper algorithm. Journal of Open Source Software, 4(42):1315, 2019. [28] Joshua L. Mike and Jose A. Perea. Geometric data analysis across scales via laplacian eigenvector cascading, 2019. [29] Yuriy Mileyko, Sayan Mukherjee, and John Harer. Probability measures on the space of persistence diagrams. Inverse Problems, 27(12):124007, 2011. [30] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete & computational geometry, 37(1):103–120, 2007. [31] FrΓ©dΓ©ric Chazal, Vin De Silva, and Steve Oudot. Persistence stability for geometric complexes. Geometriae Dedicata, 173(1):193–214, 2014. [32] Mikhail Gromov. Metric structures for Riemannian and non-Riemannian spaces. Creative Commons, 2007. [33] Katharine Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer. FrΓ©chet means for distributions of persistence diagrams. Discrete & Computational Geometry, 52(1):44– 70, 2014. [34] Atish Mitra and Ε½iga Virk. The space of persistence diagrams on 𝑛 points coarsely embeds into hilbert space. Proceedings of the American Mathematical Society, 149(6):2693–2703, 2021. [35] Alexander Wagner. Nonembeddability of persistence diagrams with 𝑝 > 2 wasserstein metric. arXiv preprint arXiv:1910.13935, 2019. [36] Peter Bubenik and Alexander Wagner. Embeddings of persistence diagrams into hilbert spaces. Journal of Applied and Computational Topology, 4(3):339–351, 2020. [37] Jan Reininghaus, Stefan Huber, Ulrich Bauer, and Roland Kwitt. A stable multi-scale 101 kernel for topological machine learning. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4741–4748, 2015. [38] Peter Bubenik. The persistence landscape and some of its properties. Topological Data Analysis, 15:77–102, 2020. [39] Henry Adams, Sofya Chepushtanova, Tegan Emerson, Eric Hanson, Michael Kirby, Francis Motta, Rachel Neville, Chris Peterson, Patrick Shipman, and Lori Ziegelmeier. Persistence images: A stable vector representation of persistent homology, 2015. [40] Aaron Adcock, Erik Carlsson, and Gunnar Carlsson. The ring of algebraic functions on persistence bar codes. Homology, Homotopy and Applications, 18(1):381–402, 2016. [41] Luigi Caputi, Anna Pidnebesna, and Jaroslav Hlinka. Promises and pitfalls of topo- logical data analysis for brain connectivity analysis. NeuroImage, 238:118245, 2021. [42] Firas A. Khasawneh, Elizabeth Munch, and Jose A. Perea. Chatter classification in turning using machine learning and topological data analysisβˆ—βˆ—this material is based upon work supported by the national science foundation under grant nos. cmmi- 1759823 and dms-1759824 with pi fak, and cmmi-1800466 and dms-1800446 with pi em. jap acknowledges the support of the nsf under grant dms-1622301 and darpa IFAC-PapersOnLine, 51(14):195–200, 2018. 14th under grant hr0011-16-2-003. IFAC Workshop on Time Delay Systems TDS 2018. [43] Sara KaliΕ‘nik. Tropical coordinates on the space of persistence barcodes. Foundations of Computational Mathematics, 19:101–129, 2019. [44] Sarah Tymochko, Elizabeth Munch, and Firas A. Khasawneh. Adaptive partitioning In 2019 18th IEEE International for template functions on persistence diagrams. Conference On Machine Learning And Applications (ICMLA). IEEE, dec 2019. [45] J. A. Perea, A. Munch, and F. A. Khasawneh. Approximating continuous functions on persistence diagrams using template functions. CoRR, abs/1902.07190, 2019. [46] Luis Polanco and J. A. Perea. Adaptive template systems: Data-driven feature selection for learning with persistence diagrams. IEEE ICMLA, 2019. [47] Sarah Tymochko, Elizabeth Munch, and Firas A. Khasawneh. Adaptive partitioning In 2019 18th IEEE International for template functions on persistence diagrams. Conference On Machine Learning And Applications (ICMLA), pages 1227–1234, 2019. 102 [48] Douglas Reynolds. Gaussian Mixture Models, pages 659–663. Springer US, Boston, MA, 2009. [49] Abraham Smith, Paul Bendich, John Harer, and Jay Hineman. Supervised learn- ing of labeled pointcloud differences via cover-tree entropy reduction. CoRR, abs/1702.07959, 2017. [50] Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001. [51] D. Pickup, X. Sun, P. L. Rosin, R. R. Martin, Z. Cheng, Z. Lian, M. Aono, A. Ben Hamza, A. Bronstein, M. Bronstein, S. Bu, U. Castellani, S. Cheng, V. Garro, A. Giachetti, A. Godil, J. Han, H. Johan, L. Lai, B. Li, C. Li, H. Li, R. Litman, X. Liu, Z. Liu, Y. Lu, A. Tatsuma, and J. Ye. SHREC’14 track: Shape retrieval of non-rigid 3d human models. In Proceedings of the 7th Eurographics workshop on 3D Object Retrieval, EG 3DOR’14, pages 1–10. Eurographics Association, 2014. [52] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Distributed computation of persistent homology, 2013. [53] Paolo Sonego, Mircea Pacurar, Somdutta Dhir, Attila KertΓ©sz-Farkas, AndrΓ‘s Kocsor, ZoltΓ‘n GΓ‘spΓ‘ri, Jack A. M. Leunissen, and SΓ‘ndor Pongor. A protein classification benchmark collection for machine learning. Nucleic Acids Research, 35:D232 – D236, 2007. [54] wwPDB consortium. Protein Data Bank: the single global archive for 3D macro- molecular structure data. Nucleic Acids Research, 47(D1):D520–D528, 10 2018. [55] Christopher Tralie, Nathaniel Saul, and Rann Bar-On. Ripser.py: A lean persistent homology library for python. The Journal of Open Source Software, 3(29):925, Sep 2018. [56] Xiang Bai, Xingwei Yang, Longin Jan Latecki, Wenyu Liu, and Zhuowen Tu. Learning context sensitive shape similarity by graph transduction. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 2009. [57] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data, 5(180161), 2018. [58] Y. Chung, C. Hu, A. Lawson, and C. Smyth. Topological approaches to skin disease image analysis. In 2018 IEEE International Conference on Big Data (Big Data), pages 100–105, 2018. 103 [59] Antoine Pitrou Siu Kwan Lam and Stanley Seibert. Numba: a llvm-based python jit compiler. In LLVM ’15: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pages 1–6, Nov 2015. [60] Mehmet GΓΆnen and Ethem Alpaydin. Multiple kernel learning algorithms. Journal of Machine Learning Research, (12):2211–2268, 2011. [61] L. Polanco and J. A Perea. Coordinatizing data with lens spaces and persistent coho- mology. Proceedings of the 31 st Canadian Conference on Computational Geometry (CCCG), pages 49–57, 2019. [62] Pawel Dlotko. Persistence representations. In GUDHI User and Reference Manual. GUDHI Editorial Board, 3.4.1 edition, 2021. [63] Nathaniel Saul and Chris Tralie. Scikit-tda: Topological data analysis for python, 2019. [64] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [65] Hendrik Jacob van Veen, Nathaniel Saul, David Eargle, and Sam W. Mangham. Kepler Mapper: A flexible Python implementation of the Mapper algorithm, October 2020. [66] Bill Hoffman. Cmake. [67] Rene Rivera Beman Dawes, David Abrahams. Boost. [68] Dmitry Morozov. Dionysus 2, 2023. https://www.mrzv.org/software/dionysus2/. [69] Edmond Chow Victor EΔ³khout and Robert van de GeΔ³n. Introduction to High Per- formance Scientific Computing. Springer Science & Business Media, 2020. [70] Charles R. Harris, K. Jarrod Millman, StΓ©fan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van KerkwΔ³k, Matthew Brett, Allan Haldane, Jaime FernΓ‘ndez del RΓ­o, Mark Wiebe, Pearu Peterson, Pierre GΓ©rard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. [71] John Paul Ryan, Sebastian Ament, Carla P. Gomes, and Anil Damle. The fast kernel 104 transform. CoRR, abs/2106.04487, 2021. [72] Nathaniel Saul. Persim 0.3.6: a scikit-tda project, 2019. [73] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, StΓ©fan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Δ°lhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, AntΓ΄nio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. [74] Leonidas Guibas Jian Sun, Maks Ovsjanikov. A concise and provably informative multi-scale signature based on heat diffusion. Eurographics Symposium on Geometry Processing, 28(5), 2009. 105 APPENDIX (a) Original mapper graph (b) Optimized mapper graph Figure A.1 Mapper graphs for the grades dataset for Experiment 1 before and after optimization. For this example, the typical optimization work flow was not successful. (a) Original mapper graph (b) Optimized mapper graph Figure A.2 Mapper graphs for the grades dataset for Experiment 2 before and after optimization. For this example, the typical optimization work flow was not successful. 106 Attribute sex age school address Pstatus Medu Mjob Fedu Fjob guardian famsize famrel reason traveltime studytime failures schoolsup famsup activities paidclass internet nursery higher romantic freetime goout Walc Dalc health absences G1 G2 G3 from 1 – very bad to 5 – Description (Domain) student’s sex (binary: female or male) student’s age (numeric: from 15 to 22) student’s school (binary: Gabriel Pereira or Mousinho da Silveira) student’s home address type (binary: urban or rural) parent’s cohabitation status (binary: living together or apart) mother’s education (numeric: from 0 to 4) mother’s job (nominal) father’s education (numeric: from 0 to 4) father’s job (nominal) student’s guardian (nominal: mother, father or other) family size (binary: ≀ 3 or > 3) quality of family relationships (numeric: excellent) reason to choose this school (nominal: close to home, school reputation, course preference or other) home to school travel time (numeric: 1 – < 15min, 2 – 15 to 30 min, 3 – 30min to 1 hour or 4 – > 1 hour) weekly study time (numeric: 1 – < 2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours or 4 – > 10 hours) number of past class failures (numeric: n if 1 ≀ n < 3, else 4) extra educational school support (binary: yes or no) family educational support (binary: yes or no) extra-curricular activities (binary: yes or no) extra paid classes (binary: yes or no) Internet access at home (binary: yes or no) attended nursery school (binary: yes or no) wants to take higher education (binary: yes or no) with a romantic relationship (binary: yes or no) free time after school (numeric: from 1 – very low to 5 – very high) going out with friends (numeric: from 1 – very low to 5 – very high) weekend alcohol consumption (numeric: from 1 – very low to 5 – very high) workday alcohol consumption (numeric: from 1 – very low to 5 – very high) current health status (numeric: from 1 – very bad to 5 – very good) number of school absences (numeric: from 0 to 93) first period grade (numeric: from 0 to 20) second period grade (numeric: from 0 to 20) final grade (numeric: from 0 to 20) Table A.1 Student related variables from the grades dataset [24]. 107 (a) Original mapper graph (b) Optimized mapper graph Figure A.3 Mapper graphs for the grades dataset for Experiment 4 before and after optimization. For this example, the typical optimization work flow was not successful. (a) Original mapper graph (b) Optimized mapper graph Figure A.4 Mapper graphs for the grades dataset for Experiment 5 before and after optimization. For this example, the typical optimization work flow was not successful. 108 (a) Original mapper graph (b) Optimized mapper graph Figure A.5 Mapper graphs for Experiment 7 before and after optimization. For this example, the typical optimization work flow was not successful. (a) Original mapper graph (b) Optimized mapper graph Figure A.6 Mapper graphs for Experiment 8 before and after optimization. For this example, the typical optimization work flow was not successful. 109 (a) Original mapper graph (b) Optimized mapper graph Figure A.7 Mapper graphs for Experiment 9 before and after optimization. For this example, the typical optimization work flow was not successful. (a) Original mapper graph (b) Optimized mapper graph Figure A.8 Mapper graphs for Experiment 10 before and after optimization. For this example, the typical optimization work flow was not successful. 110 (a) Original mapper graph (b) Optimized mapper graph Figure A.9 Mapper graphs for Experiment 11 before and after optimization. For this example, the typical optimization work flow was not successful. 111