TOPOLOGICAL REPRESENTATION AND DIMENSIONALITY REDUCTION OF SINGLE CELL RNA SEQUENCING AND VIRAL PHYLOGENETIC DATA By Yuta Hozumi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Applied Mathematics—Doctor of Philosophy 2024 ABSTRACT With the development and improvement in technology, we are now able to observe and collect large data, especially in the field of biology. Sequencing technology has rapidly advanced, allowing for large-scale sequencing with increased precision. This has resulted in large datasets with high dimensionality. One trade-off with high dimensionality is the increase in sparsity. There are multiple causes of sparsity, including sequencing errors, nonuniform noise, low read depth, etc. It is paramount that we find an effective way to reduce the dimensionality of the data and select important features for downstream analysis. DNA sequencing is a vital aspect of computational biology. Analyzing the sequences gives insight into relationships between species as well as within species, such as evolution, genetic drift, mutation, common ancestry, and more. We gathered over 3.6 million SARS-CoV-2 sequences from all over the world and performed multiple sequence alignments to extract mutations. Utilizing the mutation data, we performed UMAP dimensionality reduction followed by clustering to analyze mutational trends in the US and the world. However, such alignment methods do come with limitations, namely in computational cost and assumptions. To this end, we developed k-mer topology, an alignment-free DNA sequencing method that utilizes techniques from topological data analysis. This results in a uniform set of features for all sequences, regardless of their sequence length. Another new technology in DNA sequencing technology is single-cell RNA-sequencing (scRNA- seq), where gene expression profiles for each cell can be extracted. With current technology, roughly 10,000 cells and over 20,000 genes can be sequenced, which has caused problems due to the high dimensionality of the data. Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Additionally, we proposed topological nonnegative matrix factorization (TNMF), a matrix decomposition algo- rithm which incorporates multiscale geometric and topological information in the form of persistent Laplacian. Due to the nonnegativity constraint of the method, the reduced features are interpretable. Both CCP and TNMF are verified on scRNA-seq data to show their superior performance in both clustering, classification, and visualization. Traditional visualization methods require a dimensionality reduction to 2 or 3 dimensions. However, such aggressive reduction can lead to misleading conclusions. We, therefore, introduce Residue-Similarity (RS) plot, which is a visualization tool for data with dimensions greater than 3. The RS plot is constructed from two measures, residue (R) and similarity (S) scores. The R-score measures the interclass difference, and the S-score measures the intra-class similarity score. The Residue Similarity Index (RSI) was introduced to evaluate the R-score and S-score and is verified to correlate with clustering and classification accuracies across a variety of benchmark datasets. Copyright by YUTA HOZUMI 2024 This dissertation is dedicated to my parents, Eiji and Momoko. Thank you for always believing in me. v ACKNOWLEDGEMENTS I would like to begin by thanking Prof. Guo-Wei Wei for being the most supportive and encouraging advisor throughout my Ph.D. journey. During my first year, when I was struggling through the exams, Prof. Wei was the first one to reach out to me to offer advice. After I passed the exams, right when I was getting accustomed to research, we were hit by the COVID-19 pandemic. During this time, Prof. Wei called me regularly, making sure I was doing well, and always made himself available whenever I needed help. Even though I was still inexperienced, he invited me to join the COVID research group, which kick-started my research career. I was able to observe the motivation and brilliance of researchers at the forefront of research. Furthermore, Prof. Wei taught me the importance of the ‘big picture,’ which many students fail to see, and it has opened my eyes to mathematical biology. I genuinely appreciate him for providing me with numerous opportunities. Without Prof. Wei, I do not think my current goal of pursuing research would exist. I would also like to thank my dissertation committee, Prof. Mark Iwen, Prof. Ekaterina Rap- inchuk (Merkurjev), and Prof. Longxiu Huang for their encouragement, feedback, and comments. Every time I see them, they always offered encouragement and advices. Additionally, I would like to thank Prof. Changchuan Yin for introducing me to genomic analysis during the COVID projects. He introduced me to the field of bioinformatics, which became the basis of my dissertation. I would also like to thank the senior members of our group, Dr. Rui Wang, Prof. Jiahui Chen, and Prof. Duc Nguyen, for mentoring me throughout my Ph.D. I am especially thankful to Dr. Rui Wang for her guidance throughout my second and third years at MSU. I learned all kinds of skills, including coding, writing papers, topological data analysis, and more. I am also thankful to my cohorts. Without their support, especially during my first year and COVID lockdown, I may not have been able to finish my Ph.D. I am especially thankful for Cullen Haselby, Luis Suarez, Quinn Minich and Chris Potvin for their encouragement. We are all in different fields of mathematics, but I am thankful for all the advises they have given me. Lastly, I give my biggest thanks to my father Eiji Hozumi and mother Momoko Hozumi. They have been my biggest supporter from day 1 at MSU. I do sincerely thank them for their endless and vi unrequited support and love. They were the first ones that I would reach out if anything happened, good or bad, and they would always listen till the very end. Lastly, I would like to acknowledge that all of the work included in this thesis was supported in part the following grants: NIH grants R01GM126189, R01AI164266, R35GM148196; NSF grants DMS-1721024, DMS-1761320, IIS1900473, DMS-2052983, IIS-1900473; NASA grant 80NSSC21M0023; Michigan State University Research Foundation; Bristol-Myers Squibb 65109; Pfizer. Additionally, I would like to thanl MSU’s higher performance cluster computer, The IBM TJ Watson Research Center, The COVID-19 High Performance Computing Consortium, and NVIDIA for computational assistance. vii TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Topological Data Analysis 1.3 Phylogenetic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Single Cell RNA Sequencing (scRNA-seq) . . . . . . . . . . . . . . . . . . . . . 1.5 Outline BIBLIOGRAPHY . . 1 1 5 7 8 . 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1 Overview of Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Evaluation metrics for clustering and classification . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Topological Data Analysis 2.5 Phylogenetic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 BIBLIOGRAPHY . . . . . CHAPTER 3 PHYLOGENETIC ANALYSIS . . . . . . . . . . . . . . . . . . . . . . 50 3.1 UMAP-assisted k-means clustering of large-scale SARS-CoV-2 mutation datasets 50 3.2 K-mer Topology for Whole Genome Analysis . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 BIBLIOGRAPHY . APPENDIX 3A . . ADDITIONAL MATERIALS FOR UMAP-ASSISTED K- MEANS CLUSTERING OF LARGE-SCALE SARS-COV-2 MUTATION DATASETS . . . . . . . . . . . . . . . . . . . . . 105 ADDITIONAL MATERIALS FOR K-MERS TOPOLOGY FOR ALIGNEMNT-FREE SEQUENCE ANALYSIS . . . . . . 113 APPENDIX 3B . . . CHAPTER 4 . . . 4.1 Methods . . . . 4.2 Results . . . 4.3 Discussion . . . . . BIBLIOGRAPHY . . RESIDUE-SIMILARITY SCORES AND INDEXES . . . . . . . . . . . 120 . 120 . . 124 . . . 130 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 . . . Introduction . . CORRELATED CLUSTERING AND PROJECTION . . . . . . . . . . 135 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.1 5.2 Methods and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 . 142 . 5.3 Results . . . . 5.4 Discussion . . 160 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 6 Introduction . . . . 6.1 6.2 Prior Work . . 6.3 Topological NMF . TOPOLOGICAL NONNEGATIVE MATRIX FACTORIZATION . . . . 172 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 . . . . . . viii BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 CHAPTER 7 APPLICATION IN SINGLE CELL RNA SEQUENCING . . . . . . . . 182 7.1 Preprocessing of Single Cell RNA Sequencing data using Correlated Clus- tering and Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 . . . . . . . . 196 7.2 Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE 7.3 Topological Non-Negative Matrix Factorization for Single Cell RNA Se- BIBLIOGRAPHY . APPENDIX 7A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 quencing Data . . ADDITIONAL RESULTS FOR PREPROCESSING SINGLE CELL RNA SEQUENCING DATA USING CORRELATED CLUSTERING AND PROJECTION . . . . . . . . . . . . . . . 232 ADDITIONAL MATERIALS FOR ANALYZING SCRNA- SEQ DATA BY CCP-ASSISTED UMAP AND T-SNE . . . . . 241 ALGEBRAIC CONNECTIVITY OF FRI IN CCP . . . . . . . . 251 APPENDIX 7B APPENDIX 7C CHAPTER 8 DISSERTATION CONTRIBUTIONS AND FUTURE DIRECTIONS . . 253 ix CHAPTER 1 INTRODUCTION 1.1 Dimensionality Reduction Technological advances have fueled exponential growth in high-dimensional data. In bio- logical science, high-dimensional data are ubiquitous in genomics, epigenomics, transcriptomics, proteomics, metabolomics, and phenomics. In image science, an image of moderate size, i.e., 1024 × 1024, gives rise to a 1,048,576-dimensional vector. The rapid increase in the size and complexity of scientific data has made the problem of the ‘curse of dimensionality’ more chal- lenging than ever before in data sciences [1]. In machine learning, this problem is associated with the phenomenon that the average predictive power of a well-trained model first increases as the feature size increases but starts to deteriorate beyond a certain dimensionality [2, 3]. Moreover, data with an enormous volume of the feature space will become sparse, posing challenges for statistical analysis in determining statistical significance and principal variables. Furthermore, it is challenging to visualize data in high dimensions unless one can reduce the dimension to two or three. Therefore, it is desirable to reduce the dimensionality of high-dimensional data for the sake of prediction, analysis, and visualization. These challenges have been driving the development of many dimensionality reduction methods that can capture the intrinsic correlations in the original data in a low-dimensional representation [4]. Dimensionality reduction can be achieved through various deep neural networks (DNN), such as graph neural networks, autoencoders, transformers, etc. However, most DNN methods may not work well with excessively high-dimensional data. Commonly used dimensionality reduction algorithms fall into two categories: linear and nonlinear with respect to a certain distance metric. Principle component analysis (PCA) [5] is a basic linear DR algorithm that focuses on finding the principal components by creating new uncorrelated variables that successively maximize variances [6]. Specifically, the first principal component is a vector that maximizes the variance of the projected data, while the ith principal component is a vector that is orthogonal to the first (i – 1) principal components, leading to the maximization of the variance of the projected data. Linear 1 discriminant analysis (LDA) is another linear dimensionality reduction method proposed by Sir Ronald Fisher in 1936 [7]. As a generalization of Fisher’s linear discriminant, LDA aims to find a linear combination of features that maximizes the separability of classes and minimizes the inter-class variance for the multi-class classification problem [8]. Nonnegative matrix factorization (NMF) is another linear dimensionality reduction method that is often employed for data with nonnegative entries, such as count matrix in single cell RNA-sequencing (scRNA-seq) data. NMF decomposes a nonnegative matrix into 2 nonnegative factor matrices, and the basis is a nonnegative combination of the original features, leading to a more interpretable result [9]. Another category of dimensionality reduction methods contains many nonlinear algorithms, which can be classified into two groups: those that favor the preservation of the global pairwise distance and those that seek to retain local distance instead of global distance. Algorithms such as kernel principal component analysis (kernel PCA), Sammon mapping, and spectral embedding fall within the former category, while Isomap, LargeVis, Laplacian eigenmaps, locally linear embedding (LLE), diffusion maps [10, 11], t-SNE, and UMAP fall into the latter category. Kernel PCA [12] is an extension of PCA. Standard PCA typically has poor performance if the data has complicated algebraic structures that cannot be well-represented in a linear space. Therefore, kernel PCA is designed by applying kernel functions in a reproducing kernel Hilbert space in 1998. In 1969, John W. Sammon first proposed the Sammon mapping [13], which aims to conserve the structure of inter-point distances by minimizing Sammon’s error and attempts to ensure the mapping does not affect the underlying topology [14]. Spectral embedding computes the full Laplacian graph and uses graph eigenvectors, which allows for the preservation of the original global graph structure in the lower-dimensional space. Although kernel PCA, Sammon mapping, and spectral embedding preserve the pairwise distance structure among all the data, they fail to capture the local relationship between data points. Therefore, nonlinear algorithms are essential to incorporate the local structure in low-dimensional space and better describe the local information of the original data. A quantitative survey of dimensionality reduction techniques is provided in Ref. [15]. Several widely used nonlinear DR algorithms are briefly discussed in the following. Isomap [16] is 2 a nonlinear method aiming to preserve the geodesic distance between samples while reducing dimensionality. It’s an extension of multidimensional scaling (MDS) [17], replacing the Euclidean distance in MDS with geodesic distance (estimated by Dijkstra’s distance in graph theory). Isomap is a local method, estimating the intrinsic geometry of a data manifold by roughly estimating each sample’s neighbors, ensuring its efficiency [18]. Laplacian Eigenmap (LE), introduced in 2003 [19], is another unsupervised nonlinear algorithm preserving the local properties of a weighted graph Laplacian. LE constructs a neighborhood graph where each data point is linked to its nearest neighbors. Then, edge weights are estimated using the Gaussian kernel function. After solving the eigenvectors of the generalized weighted neighborhood graph matrix, eigenvectors associated with zero eigenvalues are discarded, and subsequent eigenvectors (smallest) are used for embedding in a k-dimensional space. Moreover, t-Distributed Stochastic Neighbor Embedding (t-SNE) [20, 21] is a nonlinear, manifold-based method well-suited for reducing high-dimensional data into two- or three-dimensional space for visualization. T-SNE represents similarities for every pair of data by constructing a conditional probability distribution over pairs of data. It then applies the ‘student t-distribution’ to obtain the probability distribution in the embedded space. By minimizing the Kullback-Leibler (KL) divergence between these two probabilities in the original and embedded space, t-SNE preserves the significant structure of the data for analysis and visualization [22]. Furthermore, a state-of-art nonlinear dimensionality reduction algorithm is uniform manifold approximation and projection (UMAP) [23], a graph-based algorithm that builds on the Laplacian eigenmaps and performs great visualization and feature extraction. Three assumptions make UMAP stand out among the other dimensionality reduction algorithms: (1) Data is uniformly distributed on a Riemann manifold, (2) Riemannian metric is locally constant, and (3) The manifold is locally connected. UMAP creates k-dimensional weighted graph representations based on the k-nearest neighborhood searching and intents to minimize the edge-wise cross-entropy between the embedded low-dimensional weighted graph representation in teams of a fuzzy set cross- entropy loss function via the stochastic gradient descent. Specifically, UMAP constructs a weighted directed adjacency matrix A, where A(i, j) represents the connection between the ith node and the 3 jth node when the jth node is one of the k nearest neighbors. Next, a normalized sparse Laplacian matrix can be derived from A with the implementation of the cross-entropy loss involved, and the k-dimensional eigenvectors of this normalized Laplacian will be used to represent each of the original data points in a low-dimension space. All of the dimensionality reduction algorithms mentioned above have broad applications in science and technology. However, they often rely on frequency domain representations obtained from matrix diagonalization. Typically, the computational complexity of eigenvalue decomposition for a full matrix is O(M3), where M is the number of samples forming an M × M matrix. While fast solvers exist, they may sacrifice accuracy, especially for datasets with relatively high intrinsic dimensions [15]. Moreover, for datasets with a large number of features I, where M << I, the reliance on matrix diagonalization limits the performance of these algorithms. Additionally, many methods depend on computing distances between data entries (samples), which can be problematic in high dimensions. Particularly, methods that utilize nearest neighbors, such as UMAP and t-SNE, may suffer from instability in datasets with moderately high intrinsic dimensions, as outlined in the ‘curse of dimensionality’ [1]. A somewhat related but distinct problem is tensor-based dimensionality reduction [24, 25], which addresses data with internal structures of geometric, topological, algebraic, and/or physical origins. Methods dealing with tensorial structures, such as Tucker decomposition [26, 27], are often employed in addition to the aforementioned dimensionality reduction approaches. These methods find applications in various domains such as videos, X-ray Computed Tomography (X-ray CT), and Magnetic Resonance Imaging (MRI) data. Other related issues pertain to feature evaluation, ranking, clustering, extraction, and selection, particularly for unlabeled data. Feature evaluation and ranking can be accomplished through filtering or embedding techniques, while feature clustering and selection can be carried out using methods such as k-means, k-means++, k-medoids, among others. These methods often serve as preprocessing steps for dimensionality reduction. For labeled data, various supervised learning methods can be employed for feature selection or extraction. 4 In this dissertation, we introduce two novel dimensionality reduction methods: Correlated Clustering and Projection (CCP) and Topological Nonnegative Matrix Factorization (TNMF). CCP is a data domain dimensionality reduction method consisting of two steps. First, features are clustered based on their similarities. Then, the clustered features are nonlinearly projected into a single descriptor. TNMF, on the other hand, is a persistent Laplacian regularized NMF method. Unlike standard manifold regularization, TNMF can capture the geometrical and topological shape of the data at multiscale. We evaluate CCP and TNMF on standard benchmark datasets and apply these methods to single-cell RNA sequencing (scRNA-seq) data. Additionally, we propose a novel visualization method called the Residue-Similarity (RS) plot. The RS plot comprises two components. Given a set of labels for each sample, the Residue (R) score measures the intraclass similarity score, while the Similarity (S) score measures the interclass similarity score. We find that the RS plot can effectively visualize data with dimensions greater than 3, and the RS score correlates with the accuracy of classification tasks. 1.2 Topological Data Analysis In recent years, there has been an explosion of data across multiple disciplines, with biology experiencing a significant influx of new data from technological advances such as genomics, single- cell RNA sequencing (scRNA-seq) [28], spatial transcriptomics (ST) [29], epigenomics [30], and DNA sequencing [31]. The data in many fields is often high-dimensional, noisy, and lacks uniformity in size or features. However, traditional data analysis tools, including dimensionality reduction techniques, often struggle to capture the intricate structures and relationships within such complex datasets. Moreover, traditional approaches are ill-equipped to handle nonuniform data distributions. To address these limitations, topological data analysis (TDA) has gained popularity in recent years due to its ability to extract meaningful insights from complex data using properties from algebraic topology [32, 33, 34, 35]. At the heart of TDA lies persistent homology, a branch of algebraic topology that serves as a powerful tool for characterizing the intrinsic structure in datasets, regardless of the inherent noise present in the data [36, 37, 38]. Persistent homology differs from traditional dimensionality reduc- 5 tion techniques because they rely on the topology of data rather than its geometric properties and traditional metrics. In essence, persistent homology represents the dataset as a topological space and using tools from algebraic topology to extract meaningful features, namely the topological in- variants. Additionally, by introducing filtration, we can keep track of the changes in the topological invariants to understand the geometric and topological shape of the data, which is termed persis- tence. Through the lens of persistent homology, researchers can uncover hidden patterns, identify relevant structures, and gain a deeper understanding of the datasets’ intrinsic characteristics. Spectral graph theory has also been introduced to TDA in the recent years, with the introduction of topological Laplacian. Eckmann et al. [39] introduced simplicial complexes to the graph Laplacian defined on point cloud data, leading to the combinatorial Laplacian. This can be viewed as a discrete counterpart of the de Rham-Hodge Laplacian on manifolds. Both the Hodge Laplacian and the combinatorial Laplacian are topological Laplacians that give rise to topological invariants in their kernel space, specifically the harmonic spectra. However, the nonharmonic spectra contain algebraic connectivity that cannot be revealed by the topological invariants from the standard persistent homology[40]. A significant development in topological Laplacians occurred in 2019 with the introduction of persistent topological Laplacians. Specifically, evolutionary de Rham theory was introduced to obtain persistent Hodge Laplacians on manifolds [41]. Meanwhile, persistent combinatorial Laplacian [42], also known as the persistent spectral graph or persistent Laplacian (PL), was in- troduced for point cloud data. These methods have spurred numerous theoretical developments [43, 44, 45, 46, 47] and package [48], as well as remarkable applications in various fields, including protein engineering [49], forecasting emerging SARS-CoV-2 variants BA.4/BA.5 [50], and predict- ing protein-ligand binding affinity [51]. Recently, PL has been shown to improve PCA performance [52]. In this dissertation, we utilize TDA as a visualization tool for high dimensional data. Addition- ally, we utilize persistent homology to develop a novel algorithm called k-mers topology, which is utilized to analyze viral nucleotide sequence. Lastly, we develop topological nonnegative matrix 6 factorization (TNMF) utilizing PL to analyze single cell RNA sequencing data. 1.3 Phylogenetic Analysis Phylogenetic analysis of genetic sequences is crucial for elucidating the evolutionary relation- ships both between and within species [53]. This analysis entails the comparison of two or more DNA or RNA sequences to discern similarities, differences, and patterns within the genetic code. By identifying similarities and differences among sequences, researchers can gain insights into evolutionary relationships, mutations, genetic drifts, genome assembly, gene annotation, and more. Having a robust and scalable method that enables comparisons within species and across species is essential for laying the groundwork of genomic analysis. Traditional phylogenetic analysis utilizes sequence alignment, where pairs of sequences are aligned to obtain similarity scores. These methods include sequence similarity scores, like BLAST [54] and FASTA [55], sequence profile search, like PSI-BLAST [54] and HMMER [56], whole- genome comparison, like Mauve [57] and TBA [58], and multiple sequence alignment, like ClustalW [59], MAFFT [60], and Muscle [61]. In recent years, the sequence of SARS-CoV-2 has been heavily studied using alignment-based methods to analyze mutational trends and predict future mutations [62, 63, 50, 64, 65]. An advantage of sequence alignment methods is their ability to extract mutation sites, which can then be utilized for downstream analysis. These alignment-based methods assume that the segments of the sequences can be categorized into conserved and non- conserved segments. The effectiveness of the alignment can then be measured based on the amount of conserved region penalized by the amount of non-conserved segments, or gaps. However, such methods can fail if these conserved segments are not properly arranged in the sequence or if there is not a sufficient amount of conserved segments, which often occurs in real-world data [66, 67, 68]. Additionally, alignment-based methods are time-consuming and require a large amount of memory, which makes large-scale comparison difficult. Other cases where alignment-based methods can fail can be found in [69]. To combat these issues, alignment-free methods have been developed, which do not assume anything about the sequences. One of the most common alignment-free methods is the k-mers-based approach. In the original 7 k-mers methods developed by Blaisdell in 1986 [70], the frequency of words or motifs is counted, and the sequences are represented as a count-based vector. Then, a metric can be defined to compare sequences for phylogenetic analysis. Numerous extensions have been proposed to improve the k- mers method, such as those by Wu et al. [71, 72, 73], Korf et al. [74], and Jun et al. [75]. Because the counting is linear with respect to the sequence length, it is extremely computationally efficient, allowing for whole-genome analysis. However, these methods do not capture the relationship between the k-mers or the positional information about the sequence. Several alternative alignment-free methods have been proposed as well. The Natural Vector Method (NVM) computes the moments of the k-mers position [76, 77], incorporating the rela- tionship between the k-mers within the sequence, thus overcoming limitations in the k-mers based method. In information theory-based methods, the distance between sequences is obtained by computing the amount of information shared between them [78, 79, 80, 81, 82]. The Chaos Game Representation (CGR), originally proposed by Jeffory in 1990 [83], represents the sequence as an iterative function, allowing the DNA sequence to be visualized as an image [84]. The CGR representation was further extended in [85, 86, 87, 88] for sequence analysis. Other methods, such as the discrete Fourier power spectrum method [89, 90], fuzzy integral similarity method [91], and more [92], have been developed for phylogenetic analysis. In this dissertation, we employ alignment-based methods to study the mutational patterns in SARS-CoV-2. After aligning the sequences, we extract mutations and utilize UMAP to reduce the dimensionality for clustering. Additionally, we introduce a novel alignment-free method called k- mer topology, which utilizes tools from persistent homology to extract patterns within the sequence. We benchmark our method on virus classification tasks and apply it to standard phylogenetic analysis. 1.4 Single Cell RNA Sequencing (scRNA-seq) Single-cell RNA sequencing (scRNA-seq) reveals heterogeneity within cell types, leading to an understanding of cell-cell communication, cell differentiation, and differential gene expression. With current technology and protocols, more than 20,000 genes can be identified. Additionally, 8 numerous data analysis pipelines have been developed to assist in analyzing such complex data [93, 94, 95, 96, 97, 98].Despite improvements in technology that enable more accurate gene readings, analyzing these readings remains challenging. Causes of this challenge include dropout event-induced zero expression counts, low sequencing depth resulting in fewer readings, general noise, and the high dimensionality of the original data [28]. As a result, dimensionality reduction and feature selection become important for downstream analysis. Numerous dimensionality reduction and feature selection methods have been proposed for scRNA-seq data. One such method is scRNA by non-negative and low-rank representation (SinLRR), which assumes that scRNA-seq has an inherently low rank, and attempts to find the smallest rank matrix that captures the original data [99]. Additionally, various non-negative matrix factorization (NMF) methods with different constraints have been developed. In these methods, the low-dimensional representation of scRNA-seq data is achieved through a linear combination of the original genes, which are called meta-genes [100, 101, 102, 103, 104, 105]. Single-cell interpreta- tion via multikernel learning (SIMLR) utilizes multiple kernels to learn a cell-cell similarity metric that generalizes to different biological experiments and experimental procedures [106]. In addition, more traditional approaches, such as principal component analysis (PCA) [107] and its derivatives [108, 109], and visualization techniques, such as uniform manifold approximation and projection (UMAP) [23] and t-distributed stochastic neighbor embedding (t-SNE) [110], have been heavily utilized for scRNA-seq data. Furthermore, deep learning has also been used for dimensionality reduction [111, 112, 113, 114, 115, 116]. Deep learning and ensemble methods are another class of approaches that have become popular for single cell RNA-seq analysis. Single-cell variational inference (scVI) utilizes deep neural networks to obtain information from similar cells and genes to approximate the distribution of underlying gene expression values [111]. Single-cell cluster using marker genes (SCMcluster) utilizes known marker genes to guide feature selection and perform ensemble clustering [117]. AutoCell [118] utilizes variational autoencoding network that combines the Gaussian mixture model and graph embedding to model the high dimensional scRNA-seq data. Diffusion models 9 [119, 120, 121, 122], generative adversarial network (GAN) [121], language models [123, 124, 125], transformers [126, 127, 128], ensemble methods [106, 129, 130] and more [131, 132] have also been used for scRNA-seq analysis. Though these methods have great performance, they rely on careful curation of data and often require large amount of data for pretraining. Although numerous techniques have been developed, PCA is the most commonly used method for downstream analysis of scRNA-seq data [133]. PCA is a linear dimensionality reduction method, where its goal is to compute the principal components as new features that maximize the variance. The first principal component is a feature that maximizes the variance of the projected data, and each ith principal component is orthogonal to the i – 1 principal component that maximizes the variance of the projected data [6]. Single-cell consensus clustering (SC3) [129] utilizes PCA and the eigenvectors of the graph Laplacian induced by Euclidean, Pearson, and Spearman distances, and performs a consensus on k-means results obtained from different dimensions using the CSPA algorithm to obtain the final cell clustering result. CellChat [134] utilizes the low- dimensional representation of scRNA-seq alongside known interactions between ligands, receptors, and cofactors to predict cell-cell communication, and the user can perform dimensionality reduction prior to utilizing CellChat. DEEPsc [135] is a deep learning method that predicts the probability of a cell belonging to a reference atlas by projecting scRNA-seq to the PCA space of the reference atlas, which can then be used to predict cell types. A popular package Seurat [136] utilizes supervised PCA (sPCA) which finds the projection that captures the weighted nearest neighbor graph of the reference dataset for its downstream analysis. In addition to cell clustering, semi-supervised and supervised learning methods have been used to classify cell types according to their reference cells by projecting unknown cells to the PCA space of the reference cells [137, 138]. Utilizing PCA has many advantages, such as computational efficiency and ease in projecting new data into principal components. However, PCA lacks concrete interpretability and loses the non-negativity of read-count data. In contrast, the components of NMF are all positive and can be considered as metagenes, where metagenes are linear combinations of the original genes. Non-linear dimensionality reduction methods, such as UMAP, t-SNE, and Isomap, have great 10 performance for low dimensionality that can capture the local structure of the data, but they also lack interpretability due to matrix diagonalization. Moreover, both PCA and traditional nonlinear reduction methods are unstable when the data is reduced to higher dimensions, which is unfavorable for machine learning and deep learning tasks that typically require a large number of features. In this dissertation, we apply Correlated Clustering and Projection (CCP) and Topological Nonnegative Matrix Factorization (TNMF) to scRNA-seq data. We found that CCP outperforms PCA in both classification and clustering tasks when the number of components is higher than 50. Additionally, CCP improves both UMAP and t-SNE visualization of scRNA-seq data. Lastly, TNMF outperforms other NMF methods in clustering tasks of scRNA-seq data. 1.5 Outline The outline of the dissertation is as followed. In Chapter 2, we introduce background infor- mation that will be utilized throughout this dissertation, such as evaluation metrics, clustering algorithm, and dimensionality reduction algorithms. In Chapter 3, we introduce a large scale clustering algorithm, UMAP-assisted k-Means, and apply the technique to SARS-CoV-2 single nucleotide polymorphism (SNP) data. Furthermore, we introduce a novel alignment-free DNA sequence analysis method called k-mers topology. In Chapter 4, we introduce Residue-Similarity scores and indexes, which present a novel approach to evaluating classification and clustering problems. In Chapter 4, we introduce Correlated clustering and Projection (CCP), a nonlinear data-domain dimensionality reduction, and we compare its performance to standard benchmark data. In Chapter 5, we introduce topological nonnegative matrix factorization, a persistent Lapla- cian (PL) regularized nonnegative matrix factorization (NMF). Compared to traditional graph Laplacian, the PL utilizes filtration to capture multiscale interaction, which a traditional Laplacian cannot. Additionally, PL captures standard topological information and homotopic shape evolution, which gives a comprehensive view of the overall shape of the data. In Chapter 6, CCP and TNMF are applied to single-cell RNA sequencing (scRNA-seq) data. CCP improves other dimensionality reduction methods in not only clustering and classification tasks, but also improves visualization. Additionally, RS analysis is performed on scRNA-seq data, which has been shown to be both ef- 11 fective for visualizing and evaluating performance. TNMF was shown to improve in adjusted rand index (ARI), normalized mutual information (NMI), purity, and ACC against other NMF methods for clustering. 12 BIBLIOGRAPHY [1] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966. [2] [3] Gerard V Trunk. A problem of dimensionality: A simple example. IEEE Transactions on pattern analysis and machine intelligence, (3):306–307, 1979. B Chandrasekaran and Anil K Jain. Quantization complexity and independent measurements. IEEE Transactions on Computers, 100(1):102–106, 1974. [4] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. [5] George H Dunteman. Principal components analysis. Number 69. Sage, 1989. [6] [7] [8] [9] Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent devel- opments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016. Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936. Petros Xanthopoulos, Panos M Pardalos, and Theodore B Trafalis. Linear discriminant analysis. In Robust data mining, pages 27–33. Springer, 2013. Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012. [10] Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the national academy of sciences, 102(21):7426–7431, 2005. [11] Ketson R Dos Santos, Dimitrios G Giovanis, and Michael D Shields. Grassmannian diffusion maps–based dimension reduction and classification for high-dimensional data. SIAM Journal on Scientific Computing, 44(2):B250–B274, 2022. [12] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299–1319, 1998. [13] John W Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on computers, 100(5):401–409, 1969. [14] Paul Henderson. Sammon mapping. Pattern Recognit. Lett, 18(11-13):1307–1316, 1997. [15] Mateus Espadoto, Rafael M Martins, Andreas Kerren, Nina ST Hirata, and Alexandru C 13 Telea. Toward a quantitative survey of dimension reduction techniques. IEEE transactions on visualization and computer graphics, 27(3):2153–2173, 2019. [16] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000. [17] Al Mead. Review of the development of multidimensional scaling methods. Journal of the Royal Statistical Society: Series D (The Statistician), 41(1):27–39, 1992. [18] Farzana Anowar, Samira Sadaoui, and Bassant Selim. Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Computer Science Review, 40:100378, 2021. [19] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003. [20] Geoffrey Hinton and Sam T Roweis. Stochastic neighbor embedding. In NIPS, volume 15, pages 833–840. Citeseer, 2002. [21] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [22] Bo Li, Yan-Rui Li, and Xiao-Long Zhang. A survey on laplacian eigenmaps based manifold learning methods. Neurocomputing, 335:336–351, 2019. [23] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [24] Jiun-Hung Chen and Linda G Shapiro. Pca vs. tensor-based dimension reduction methods: An empirical comparison on active shape models of organs. In 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 5838–5841. IEEE, 2009. [25] Hongcheng Wang and Narendra Ahuja. A tensor approximation approach to dimensionality reduction. International Journal of Computer Vision, 76(3):217–229, 2008. [26] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [27] Xutao Li, Michael K Ng, Gao Cong, Yunming Ye, and Qingyao Wu. Mr-ntd: Manifold regularization nonnegative tucker decomposition for tensor data dimension reduction and representation. IEEE transactions on neural networks and learning systems, 28(8):1787– 1800, 2016. [28] David Lähnemann, Johannes Köster, Ewa Szczurek, Davis J McCarthy, Stephanie C Hicks, 14 Mark D Robinson, Catalina A Vallejos, Kieran R Campbell, Niko Beerenwinkel, Ahmed Mahfouz, et al. Eleven grand challenges in single-cell data science. Genome biology, 21(1):1–35, 2020. [29] Cameron G Williams, Hyun Jae Lee, Takahiro Asatsuma, Roser Vento-Tormo, and Ashra- ful Haque. An introduction to spatial transcriptomics for biomedical research. Genome Medicine, 14(1):68, 2022. [30] Christoph Bock and Thomas Lengauer. Computational epigenetics. Bioinformatics, 24(1):1– 10, 2008. [31] Susana Vinga and Jonas Almeida. Alignment-free sequence comparison—a review. Bioin- formatics, 19(4):513–523, 2003. [32] David Cohen-Steiner, Herbert Edelsbrunner, John Harer, and Yuriy Mileyko. Lipschitz functions have l p-stable persistence. Foundations of computational mathematics, 10(2):127– 139, 2010. [33] Paul Bendich, Herbert Edelsbrunner, and Michael Kerber. Computing robustness and per- sistence for images. IEEE transactions on visualization and computer graphics, 16(6):1251– 1260, 2010. [34] Robert Ghrist. Barcodes: the persistent topology of data. Bulletin of the American Mathe- matical Society, 45(1):61–75, 2008. [35] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46(2):255–308, 2009. [36] Afra Zomorodian and Gunnar Carlsson. Localized homology. Computational Geometry, 41(3):126–148, 2008. [37] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. In Proceedings of the twentieth annual symposium on Computational geometry, pages 347–356, 2004. [38] Edelsbrunner, Letscher, and Zomorodian. Topological persistence and simplification. Dis- crete & computational geometry, 28:511–533, 2002. [39] Beno Eckmann. Harmonische funktionen und randwertaufgaben in einem komplex. Com- mentarii Mathematici Helvetici, 17(1):240–255, 1944. [40] Danijela Horak and Jürgen Jost. Spectra of combinatorial laplace operators on simplicial complexes. Advances in Mathematics, 244:303–336, 2013. [41] Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021. 15 [42] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020. [43] Facundo Mémoli, Zhengchao Wan, and Yusu Wang. Persistent laplacians: Properties, algorithms and implications. SIAM Journal on Mathematics of Data Science, 4(2):858–884, 2022. [44] Jian Liu, Jingyan Li, and Jie Wu. The algebraic stability for persistent laplacians. arXiv preprint arXiv:2302.03902, 2023. [45] Xiaoqi Wei and Guo-Wei Wei. Persistent sheaf laplacians. arXiv preprint arXiv:2112.10906, 2021. [46] Rui Wang and Guo-Wei Wei. Persistent path laplacian. Foundations of Data Science, 5:26–55, 2023. [47] Dong Chen, Jian Liu, Jie Wu, and Guo-Wei Wei. Persistent hyperdigraph homology and per- sistent hyperdigraph laplacians. Foundations of Data Science, doi: 10.3934/fods.2023010, 2023. [48] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield, Mo.), 3(1):67, 2021. [49] Yuchi Qiu and Guo-Wei Wei. Persistent spectral theory-guided protein engineering. Nature Computational Science, 3(2):149–163, 2023. [50] Jiahui Chen, Yuchi Qiu, Rui Wang, and Guo-Wei Wei. Persistent laplacian projected omicron ba. 4 and ba. 5 to become new dominating variants. Computers in Biology and Medicine, 151:106262, 2022. [51] Zhenyu Meng and Kelin Xia. Persistent spectral–based machine learning (perspect ml) for protein-ligand binding affinity prediction. Science advances, 7(19):eabc5329, 2021. [52] Sean Cottrell, Rui Wang, and Guowei Wei. PLPCA: Persistent Laplacian enhanced- Journal of Chemical Information and Modeling, PCA for microarray data analysis. doi.org/10.1021/acs.jcim.3c01023, 2023. [53] Masatoshi Nei. Phylogenetic analysis in molecular evolutionary genetics. Annual review of genetics, 30(1):371–403, 1996. [54] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997. 16 [55] William R Pearson and David J Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988. [56] Robert D Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, et al. Pfam: the protein families database. Nucleic acids research, 42(D1):D222–D230, 2014. [57] Aaron E Darling, Bob Mau, and Nicole T Perna. progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PloS one, 5(6):e11147, 2010. [58] Mathieu Blanchette, W James Kent, Cathy Riemer, Laura Elnitski, Arian FA Smit, Kr- ishna M Roskin, Robert Baertsch, Kate Rosenbloom, Hiram Clawson, Eric D Green, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome research, 14(4):708–715, 2004. [59] Julie D Thompson, Desmond G Higgins, and Toby J Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucleic acids research, 22(22):4673–4680, 1994. [60] Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic acids research, 30(14):3059–3066, 2002. [61] Robert C Edgar. Muscle: multiple sequence alignment with high accuracy and high through- put. Nucleic acids research, 32(5):1792–1797, 2004. [62] Yuta Hozumi, Rui Wang, Changchuan Yin, and Guo-Wei Wei. Umap-assisted k-means clustering of large-scale sars-cov-2 mutation datasets. Computers in biology and medicine, 131:104264, 2021. [63] Jiahui Chen and Guo-Wei Wei. Omicron ba. 2 (b. 1.1. 529.2): high potential for becoming the next dominant variant. The journal of physical chemistry letters, 13(17):3840–3849, 2022. [64] Michael Bleher, Lukas Hahn, Maximilian Neumann, Juan Angel Patino-Galindo, Mathieu Carriere, Ulrich Bauer, Raul Rabadan, and Andreas Ott. Topological data analysis identifies emerging adaptive mutations in sars-cov-2. arXiv preprint arXiv:2106.07292, 2021. [65] Juan Ángel Patiño-Galindo, Ioan Filip, Ratul Chowdhury, Costas D Maranas, Peter K Sorger, Mohammed AlQuraishi, and Raul Rabadan. Recombination and lineage-specific mutations linked to the emergence of sars-cov-2. Genome Medicine, 13:1–14, 2021. [66] Sean R Eddy. Where did the blosum62 alignment score matrix come from? Nature biotechnology, 22(8):1035–1036, 2004. 17 [67] Paul P Gardner, Andreas Wilm, and Stefan Washietl. A benchmark of multiple sequence alignment programs upon structural rnas. Nucleic acids research, 33(8):2433–2439, 2005. [68] Emidio Capriotti and Marc A Marti-Renom. Quantifying the relationship between sequence and three-dimensional structure conservation in rna. BMC bioinformatics, 11:1–10, 2010. [69] Andrzej Zielezinski, Susana Vinga, Jonas Almeida, and Wojciech M Karlowski. Alignment- free sequence comparison: benefits, applications, and tools. Genome biology, 18:1–17, 2017. [70] B Edwin Blaisdell. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences, 83(14):5155–5159, 1986. [71] Tiee-Jian Wu, John P Burke, and Daniel B Davison. A measure of dna sequence dissimilarity based on mahalanobis distance between frequencies of words. Biometrics, pages 1431–1439, 1997. [72] Tiee-Jian Wu, Ya-Ching Hsieh, and Lung-An Li. Statistical measures of dna sequence dissimilarity under markov chain models of base composition. Biometrics, 57(2):441–448, 2001. [73] Tiee-Jian Wu, Ying-Hsueh Huang, and Lung-An Li. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between dna sequences. Bioinformat- ics, 21(22):4125–4132, 2005. [74] Ian F Korf and Alan B Rose. Applying word-based algorithms: the imeter. Plant Systems Biology, pages 287–301, 2009. [75] Se-Ran Jun, Gregory E Sims, Guohong A Wu, and Sung-Hou Kim. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences, 107(1):133– 138, 2010. [76] Chenglong Yu, Troy Hernandez, Hui Zheng, Shek-Chung Yau, Hsin-Hsiung Huang, Rong Lucy He, Jie Yang, and Stephen S-T Yau. Real time classification of viruses in 12 dimensions. PloS one, 8(5):e64328, 2013. [77] Mo Deng, Chenglong Yu, Qian Liang, Rong L He, and Stephen S-T Yau. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PloS one, 6(3):e17293, 2011. [78] Igor Ulitsky, David Burstein, Tamir Tuller, and Benny Chor. The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 13(2):336– 350, 2006. 18 [79] Chris-Andre Leimeister and Burkhard Morgenstern. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000–2008, 2014. [80] Lianping Yang, Xiangde Zhang, Haoyue Fu, and Chenhui Yang. An estimator for local analysis of genome based on the minimal absent word. Journal of Theoretical Biology, 395:23–30, 2016. [81] Lianping Yang, Xiangde Zhang, and Hegui Zhu. Alignment free comparison: similarity distribution between the dna primary sequences based on the shortest absent word. Journal of theoretical biology, 295:125–131, 2012. [82] Susana Vinga. Information theory applications for biological sequence analysis. Briefings in bioinformatics, 15(3):376–389, 2014. [83] H Joel Jeffrey. Chaos game representation of gene structure. Nucleic acids research, 18(8):2163–2170, 1990. [84] Milan Randić, Marjana Novič, and Dejan Plavšić. Milestones in graphical bioinformatics. International Journal of Quantum Chemistry, 113(22):2413–2446, 2013. [85] Pradeep Kumar Burma, Alok Raj, Jayant K Deb, and Samir K Brahmachari. Genome analysis: a new approach for visualization of sequence organization in genomes. Journal of biosciences, 17:395–411, 1992. [86] Jonas S Almeida, Joao A Carrico, Antonio Maretzek, Peter A Noble, and Madilyn Fletcher. Analysis of genomic sequences by chaos game representation. Bioinformatics, 17(5):429– 437, 2001. [87] Patrick J Deschavanne, Alain Giron, Joseph Vilain, Guillaume Fagot, and Bernard Fertil. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molecular biology and evolution, 16(10):1391–1399, 1999. [88] Bai-Lin Hao. Fractals from genomes–exact solutions of a biology-inspired problem. Physica A: Statistical Mechanics and its Applications, 282(1-2):225–246, 2000. [89] Tung Hoang, Changchuan Yin, Hui Zheng, Chenglong Yu, Rong Lucy He, and Stephen S-T Yau. A new method to cluster dna sequences using fourier power spectrum. Journal of theoretical biology, 372:135–145, 2015. [90] Changchuan Yin, Ying Chen, and Stephen S-T Yau. A measure of dna sequence similarity by fourier transform with applications on hierarchical clustering. Journal of theoretical biology, 359:18–28, 2014. [91] Ajay Kumar Saw, Garima Raj, Manashi Das, Narayan Chandra Talukdar, Binod Chandra 19 Tripathy, and Soumyadeep Nandi. Alignment-free method for dna sequence clustering using fuzzy integral similarity. Scientific reports, 9(1):3753, 2019. [92] Chenglong Yu, Qian Liang, Changchuan Yin, Rong L He, and Stephen S-T Yau. A novel construction of genome space with biological geometry. DNA research, 17(3):155–168, 2010. [93] Byungjin Hwang, Ji Hyun Lee, and Duhee Bang. Single-cell rna sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine, 50(8):1–14, 2018. [94] Tallulah S Andrews, Vladimir Yu Kiselev, Davis McCarthy, and Martin Hemberg. Tutorial: guidelines for the computational analysis of single-cell rna sequencing data. Nature protocols, 16(1):1–9, 2021. [95] Malte D Luecken and Fabian J Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):e8746, 2019. [96] Geng Chen, Baitang Ning, and Tieliu Shi. Single-cell rna-seq technologies and related computational data analysis. Frontiers in genetics, page 317, 2019. [97] Raphael Petegrosso, Zhuliu Li, and Rui Kuang. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020. [98] Wei Vivian Li and Jingyi Jessica Li. A statistical simulator scdesign for rational scrna-seq experimental design. Bioinformatics, 35(14):i41–i50, 2019. [99] Ruiqing Zheng, Min Li, Zhenlan Liang, Fang-Xiang Wu, Yi Pan, and Jianxin Wang. Sinnlrr: a robust subspace clustering method for cell type detection by non-negative and low-rank representation. Bioinformatics, 35(19):3642–3650, 2019. [100] Zhenqiu Shu, Qinghan Long, Luping Zhang, Zhengtao Yu, and Xiao-Jun Wu. Robust graph regularized nmf with dissimilarity and similarity constraints for scrna-seq data clustering. Journal of Chemical Information and Modeling, 62(23):6271–6286, 2022. [101] Peng Wu, Mo An, Hai-Ren Zou, Cai-Ying Zhong, Wei Wang, and Chang-Peng Wu. A robust semi-supervised nmf model for single cell rna-seq data. PeerJ, 8:e10091, 2020. [102] Jianwei Chen. Detecting cell type from single cell rna sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv, 2022. [103] Qiu Xiao, Jiawei Luo, Cheng Liang, Jie Cai, and Pingjian Ding. A graph regularized non- negative matrix factorization method for identifying microrna-disease associations. Bioin- formatics, 34(2):239–248, 2018. 20 [104] Na Yu, Ying-Lian Gao, Jin-Xing Liu, Juan Wang, and Junliang Shang. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Human genomics, 13(1):1–10, 2019. [105] Jin-Xing Liu, Dong Wang, Ying-Lian Gao, Chun-Hou Zheng, Jun-Liang Shang, Feng Liu, and Yong Xu. A joint-l2, 1-norm-constraint-based semi-supervised feature extraction for rna-seq data analysis. Neurocomputing, 228:263–269, 2017. [106] Bo Wang, Junjie Zhu, Emma Pierson, Daniele Ramazzotti, and Serafim Batzoglou. Visual- ization and analysis of single-cell rna-seq data by kernel-based similarity learning. Nature methods, 14(4):414–416, 2017. [107] Shuangge Ma and Ying Dai. Principal component analysis based methods in bioinformatics studies. Briefings in bioinformatics, 12(6):714–722, 2011. [108] Seyoung Park and Hongyu Zhao. Sparse principal component analysis with missing obser- vations. The Annals of Applied Statistics, 13(2):1016–1042, 2019. [109] F William Townes, Stephanie C Hicks, Martin J Aryee, and Rafael A Irizarry. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome biology, 20:1–16, 2019. [110] Dmitry Kobak and Philipp Berens. The art of using t-sne for single-cell transcriptomics. Nature communications, 10(1):1–14, 2019. [111] Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12):1053–1058, 2018. [112] Carlos Torroja and Fatima Sanchez-Cabo. Digitaldlsorter: deep-learning on scrna-seq to deconvolute gene expression data. Frontiers in Genetics, 10:978, 2019. [113] Musu Yuan, Liang Chen, and Minghua Deng. scmra: a robust deep learning method to annotate scrna-seq data with multiple reference datasets. Bioinformatics, 38(3):738–745, 2022. [114] Zixiang Luo, Chenyu Xu, Zhen Zhang, and Wenfei Jin. A topology-preserving dimensional- ity reduction method for single-cell rna-seq data using graph autoencoder. Scientific reports, 11(1):20028, 2021. [115] Dongfang Wang and Jin Gu. Vasc: dimension reduction and visualization of single-cell rna-seq data by deep variational autoencoder. Genomics, proteomics & bioinformatics, 16(5):320–331, 2018. [116] Eugene Lin, Sudipto Mukherjee, and Sreeram Kannan. A deep adversarial variational 21 autoencoder model for dimensionality reduction in single-cell rna sequencing analysis. BMC bioinformatics, 21(1):1–11, 2020. [117] Hao Wu, Haoru Zhou, Bing Zhou, and Meili Wang. Scmcluster: a high-precision cell clus- tering algorithm integrating marker gene set with single-cell rna sequencing data. Briefings in Functional Genomics, page elad004, 2023. [118] Junlin Xu, Jielin Xu, Yajie Meng, Changcheng Lu, Lijun Cai, Xiangxiang Zeng, Ruth Nussi- nov, and Feixiong Cheng. Graph embedding and gaussian mixture variational autoencoder network for end-to-end analysis of single-cell rna sequencing data. Cell Reports methods, 3(1), 2023. [119] Mehrshad Sadria and Anita Layton. The power of two: integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis. bioRxiv, pages 2023–04, 2023. [120] Alessandro Palma, Fabian J Theis, and Mohammad Lotfollahi. Predicting cell morphological responses to perturbations using generative modeling. bioRxiv, pages 2023–07, 2023. [121] Valentina Giansanti, Francesca Giannese, Oronza A Botrugno, Giorgia Gandolfi, Chiara Balestrieri, Marco Antoniotti, Giovanni Tonon, and Davide Cittaro. Scalable integration of multiomic single cell data using generative adversarial networks. bioRxiv, pages 2023–06, 2023. [122] Julius B Kirkegaard. Spontanously breaking of symmetry in overlapping cell instance segmentation using diffusion models. bioRxiv, pages 2023–07, 2023. [123] Hongru Shen, Jilei Liu, Jiani Hu, Xilin Shen, Chao Zhang, Dan Wu, Mengyao Feng, Meng Yang, Yang Li, Yichen Yang, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. Iscience, 26(5), 2023. [124] William Connell, Umair Khan, and Michael J Keiser. A single-cell gene expression language model. arXiv preprint arXiv:2210.14330, 2022. [125] Fan Yang, Wenchuan Wang, Fang Wang, Yuan Fang, Duyu Tang, Junzhou Huang, Hui Lu, and Jianhua Yao. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022. [126] Jiawei Chen, Hao Xu, Wanyu Tao, Zhaoxiong Chen, Yuxuan Zhao, and Jing-Dong J Han. Transformer for one stop interpretable cell type annotation. Nature Communications, 14(1):223, 2023. [127] Jing Xu, Aidi Zhang, Fang Liu, Liang Chen, and Xiujun Zhang. Ciform as a transformer- based model for cell-type annotation of large-scale single-cell rna-seq data. Briefings in Bioinformatics, page bbad195, 2023. 22 [128] Linfang Jiao, Gan Wang, Huanhuan Dai, Xue Li, Shuang Wang, and Tao Song. sctranssort: Transformers for intelligent annotation of cell types by gene embeddings. Biomolecules, 13(4):611, 2023. [129] Vladimir Yu Kiselev, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew Yiu, Tamir Chandra, Kedar N Natarajan, Wolf Reik, Mauricio Barahona, Anthony R Green, et al. Sc3: consensus clustering of single-cell rna-seq data. Nature methods, 14(5):483–486, 2017. [130] Pengyu Zhang, Hongming Zhang, and Hao Wu. ipro-wael: a comprehensive and ro- bust framework for identifying promoters in multiple species. Nucleic Acids Research, 50(18):10278–10289, 2022. [131] Pengyu Zhang, Yingfu Wu, Haoru Zhou, Bing Zhou, Hongming Zhang, and Hao Wu. Clnn- loop: a deep learning model to predict ctcf-mediated chromatin loops in the different cell lines and ctcf-binding sites (cbs) pair types. Bioinformatics, 38(19):4497–4504, 2022. [132] Pengyu Zhang and Hao Wu. Ichrom-deep: An attention-based deep learning model for identifying chromatin interactions. IEEE Journal of Biomedical and Health Informatics, 2023. [133] Heather J Zhou, Lei Li, Yumei Li, Wei Li, and Jingyi Jessica Li. Pca outperforms popular hidden variable inference methods for molecular qtl mapping. Genome Biology, 23(1):1–17, 2022. [134] Suoqin Jin, Christian F Guerrero-Juarez, Lihua Zhang, Ivan Chang, Raul Ramos, Chen- Hsiang Kuan, Peggy Myung, Maksim V Plikus, and Qing Nie. Inference and analysis of cell-cell communication using cellchat. Nature communications, 12(1):1088, 2021. [135] Floyd Maseda, Zixuan Cang, and Qing Nie. Deepsc: a deep learning-based map connecting single-cell transcriptomics and spatial imaging data. Frontiers in Genetics, 12:636743, 2021. [136] Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M Mauck, Shiwei Zheng, Andrew Butler, Maddie J Lee, Aaron J Wilk, Charlotte Darby, Michael Zager, et al. Integrated analysis of multimodal single-cell data. Cell, 184(13):3573–3587, 2021. [137] Hannah A Pliner, Jay Shendure, and Cole Trapnell. Supervised classification enables rapid annotation of cell atlases. Nature methods, 16(10):983–986, 2019. [138] Ze Zhang, Danni Luo, Xue Zhong, Jin Huk Choi, Yuanqing Ma, Stacy Wang, Elena Mahrt, Wei Guo, Eric W Stawiski, Zora Modrusan, et al. Scina: a semi-supervised subtyping algorithm of single cells and bulk samples. Genes, 10(7):531, 2019. 23 CHAPTER 2 BACKGROUND 2.1 Overview of Dimensionality Reduction 2.1.1 Principle Components Analysis Principal component analysis (PCA) is one of the most commonly used dimensional reduc- tion techniques for the exploratory analysis of high-dimensional data [1]. The first component, termed the principal component, is the direction that maximizes the variance, while the subsequent components are orthogonal to earlier ones. Let X = {xi}N i=1 be the input dataset, with N being the number of samples or data points. For each xi, let xi ∈ RM, where M is the number of features or data dimension. PCA seeks to find a linear combination of the columns of X with maximum variance. n ∑︁ j=1 ajxj = Xa, (2.1) where a1, a2, ..., an are constants, and a is the vectorized a1, a2, ..., an. The variance of this linear combination is defined as var(Xa) = aT Sa, (2.2) where S is the covariance matrix for the dataset. Note that we compute the eigenvalue of the covariance matrix. The maximum variance can be computed iteratively using Rayleigh’s quotient a(1) = arg max a aT XT Xa aT a . The subsequent components can be computed by maximizing the variance of ˆXk = X – k–1 ∑︁ j=1 XajaT j , (2.3) (2.4) where k represents the kth principal component. Here, k – 1 principal components are subtracted from the original matrix X. Therefore, the complexity of the method scales linearly with the number of components one seeks to find. In applications, we hope that the first few components give rise to a good PCA representation of the original data matrix X. 24 2.1.2 Nonnegative matrix factorization Nonnegative matrix factorization (NMF) decomposes the original data matrix into 2 submatrix that are both nonnegative. Unlike PCA, NMF requires that the original matrix is nonnegative rather than orthogonal [2]. More formerly, NMF solves the following optimization problem ∥X – WH∥2 F, min W,H s.t. W, H ≥ 0 (2.5) where ∥A∥2 F = (cid:205) i,j a2 ij is the Frobenius norm. Lee et al.[3] proposed a multiplicative updating scheme, which preserves the nonnegativity. For the t + 1th iteration, wt+1 = wt ij ht+1 = ht ij (XHT )ij (WHHT )ij (WT X)ij (WT WH)ij (2.6) (2.7) Note that NMF is convex in W or H, but is not convex in both W and H. 2.1.3 T-Distributed Stochastic Neighbor Embedding The t-distributed stochastic neighbor embedding (t-SNE) is a nonlinear dimensional reduction algorithm that is well suited for reducing high dimensional data into the two- or three-dimensional space. There are two main stages in t-SNE. First, it constructs a probability distribution over pairs of data such that closer data points are assigned a high probability, while farther points are given a low probability. Second, t-SNE defines a probability distribution in the embedded space that is similar to that in the original high-dimensional space, and aims to minimize the Kullback-Leibler (KL) divergence between them [4]. Let {x1, x2, ..., xN|xi ∈ RM} be a high dimensional input dataset. Our goal is to find an optimal low dimensional representation {y1, ..., yN|yi ∈ Rk}, such that k << M. The first step in t-SNE is to compute the pairwise distribution between xi and xj, denoted as pij. First, we find the conditional probability of xj, given xi: exp(–∥xi – xj ∥2/2σ2 i ) m≠i exp(–∥xi – xm∥2/2σ2 i ) setting pi|i = 0, and the denominator normalizes the probability. Here, σi is the predefined pj|i = i ≠ j, (2.8) (cid:205) , hyperparameter called perplexity. A smaller σi is used for a denser dataset. Notice that this 25 conditional probability is symmetric when the perplexity is fixed, i.e. pi|j = pj|i. Then, define the pairwise probability as pj|i + pi|j 2N In the second step, we learn a k-dimensional embedding {y1, ..., yN|yi ∈ Rk}. To this end, pij = (2.9) . t-SNE calculates a similar probability distribution qij defined as qij = (cid:205) m 1 1+∥yi–yj ∥2 1 1+∥ym–yl ∥2 l≠m (cid:205) , i ≠ j (2.10) and setting qii = 0. Finally, the low dimensional embedding {y1, ..., yN|yi ∈ Rk} is found by minimizing the KL-divergence via a standard gradient descent method KL(P|Q) = pij log pij qij , ∑︁ i,j (2.11) where P and Q are the distributions for pij and qij, respectively. Note that the probability distributions in Equation 2.8 and Equation 2.10 can be replaced by using many other delta sequence kernel of positive type [5]. 2.1.4 Uniform Manifold Approximation and Projection Uniform manifold approximation and projection (UMAP) is a nonlinear dimensional reduction method, utilizing three assumptions: the data is uniformly distributed on Riemannian manifold, Riemannian metric is locally constant, and the manifold if locally connected. Unlike t-SNE which utilizes probabilistic model, UMAP is a graph-based algorithm. The essential idea is to create a predefined k-dimensional weighted UMAP graph representation for each of the original high- dimensional data points, with the aim of minimizing the edge-wise cross-entropy between the weighted graph and the original data. Finally, the k-dimensional eigenvectors of the UMAP graph are used to represent each of the original data point. In this section, a computational view of UMAP is presented. For a more theoretical account, the reader is referred to Ref. [6]. Similar to t-SNE, UMAP considers the input data X = {x1, x2, ..., xN}, xi ∈ RM and look for an optimal low dimensional representation {y1, ..., yN|yi ∈ Rk}, such that k < M. The first stage is the 26 construction of weighted k-neighbor graphs. Define a metric d : X × X → R+. Let k << M be a hyperparemeter, and compute the k-nearest neighbors of each xi under a given metric d. For each xi, let where σi is defined via ρi = min{d(xi, xj)|1 ≤ j ≤ k, d(xi, xj) > 0} k ∑︁ j=1 exp (cid:18) – max(0, d(xi, xj) – ρi) σi (cid:19) = log2 k. (2.12) (2.13) Such choice of ρi ensure at least one data point is connected to xi and having edge weight of 1, and set σi as a scaling parameter. Then, define a weighted directed graph ¯G = (V, E, ω), where V is the set of vertices (in this case, the data X), E is the set of edges E = {(xi, xj)|1 ≤ j ≤ k, 1 ≤ i ≤ N}, and ω is the weight for edges ω(xi, xj) = exp (cid:18) – max(0, d(xi, xj) – ρi) σi (cid:19) . (2.14) UMAP tries to define an undirected weighted graph G from directed graph ¯G via symmetrization. Let A be the adjacency matrix of the graph ¯G. A symmetric matrix can be obtained B = A + AT – A ⊗ AT , (2.15) where T is the transpose and ⊗ denotes the Hadamard product. Then, the undirected weighted Laplacian G (the UMAP graph) is defined by its adjacency matrix B. In its realization, UMAP evolves an equivalent weighted graph H with a set of points {yi}N i=1, utilizing attractive and repulsive forces. The attractive and repulsive forces at coordinate yi and yj are given by ω(xi, xj)(yi – yj), and –2ab∥yi – yj ∥2(b–1) 2 1 + ∥yi – yj ∥2 2 2b 2)(1 + a∥yi – yj ∥2b 2 ) (ε + ∥yi – yj ∥2 (1 – ω(xi, xj))(yi – yj) (2.16) (2.17) where a, b are hyperparemeters, and ε is taken to be a small value such that the denominator does i=1, yi ∈ Rk, that not become 0. The goal is to find the optimal low-dimensional coordinates {yi}N 27 minimizes the edge-wise cross entropy with the original data at each point. The evolution of the UMAP graph Laplacian G can be regarded as a discrete approximation of the Laplace-Beltrami operator on a manifold defined by the data [7]. Implementation and further detail of UMAP can be found in Ref. [6]. UMAP may not work well if the data points are non-uniform. If part of the data points have k important neighbors while other part of the data points have k′ >> k important neighbors, the k-dimensional UMAP will not work efficiently. Currently, there is no algorithm to automatically determine the critic minimal kmin for a given dataset. Additionally, weights ω(xi, xj) and force terms can be replaced by other functions that are easier to evaluate [5]. The metric d can be selected as Euclidean distance, Manhattan distance, Minkowski distance, and Chebyshev distance, depending on applications. 2.2 Clustering Algorithm 2.2.1 K-means Clustering K-means is the most commonly used unsupervised clustering method, where the aim is to partition {xj|xj ∈ Rm, 1 ≤ j ≤ N} into K clusters {C1, ..., Ck}, where K << N [8]. K-means begins by randomly selecting K points to be the cluster centers, called centroids. Here, centroids are denoted as µ1, ..., µk, ..., µK, and the centroid µk is the center of cluster Ck. Then, each sample is assigned to the nearest centroid, forming the initial clusters. Centroids are then updated by minimizing the within-cluster sum of squares (WCSS), which is defined as: which gives the updating scheme K ∑︁ ∑︁ k=1 xj∈Ck ∥xj – µk ∥2, µk = 1 |Ck| ∑︁ xj∈Ck xj. (2.18) This process is repeated until convergence or until the maximal number of iterations is reached. 28 2.2.2 K-medoids Clustering K-medoids clustering is similar to k-means clustering. The main difference is that in k-means clustering, the centroid or the center of the cluster is not necessarily a sample in the data because the centroid is computed by taking the average of the samples in the cluster. K-medoids, on the other hand, will chose the medoids, or the ‘cluster center’ to be a sample of the data [9]. For a pre-selected K, we begin by randomly selecting K medoids {mk}K vector to its nearest medoid, which gives rise to the initial partition {Ck}K the closest vector to the center of the kth partition Ck as the new medoid {mk ∈ Ck}K reassign each vector into its nearest medoid, resulting in a new partition {Ck}K loss function or the accumulated distance. The process is repeated until {Ck}K k=1 and assign each k=1. Second, we denote k=1. We k=1 to minimize the k=1 is optimized with respect to a specific distance definition, arg min{C1,...,Ck} K ∑︁ ∑︁ k=1 xi∈Ck ∥xi – mk ∥, (2.19) where ∥ · ∥ is some distance. Any distance can be used in k-medoids, such as Euclidean, Manhattan, covariance distance, correlation distance, etc. 2.3 Evaluation metrics for clustering and classification 2.3.1 Adjusted Rand Index (ARI) The Adjusted Rand Index (ARI) measures the agreement between two clustering results by i=1 and {Sj}q considering all pairs of data points and comparing their assignments in the two clustering results [10]. Let {Ti}p j=1 be two partitioning of the data. Often times, Ti are the clustering result, and Sj are the true labels. Let nij = |Ti ∩ Sj| be the number of samples that belong to true label j and cluster label i, and define ai = (cid:205) i nij. Then, the ARI is defined as j nij and bj = (cid:205) ARI = (cid:205) ij nij 2 (cid:169) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) (cid:205) i 1 2        ai 2 (cid:169) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) + (cid:205) j –   (cid:205)      bj (cid:169) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) 2        (cid:205) j i ai 2 (cid:169) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) – (cid:205) i   (cid:169)  (cid:173)  (cid:173)   (cid:171)         (cid:170) (cid:174) (cid:174) (cid:172) (cid:205) 2 N / (cid:169) (cid:173) (cid:173) (cid:171) bj j (cid:169) (cid:173) (cid:173) (cid:171) 2 (cid:170) (cid:174) (cid:174) (cid:172)        (cid:170) (cid:174) (cid:174) (cid:172) (2.20) N / (cid:169) (cid:173) (cid:173) (cid:171) 2 (cid:170) (cid:174) (cid:174) (cid:172) bj 2 (cid:169) (cid:173) (cid:173) (cid:171) ai 2 (cid:170) (cid:174) (cid:174) (cid:172) 29 The ARI takes on a value between -0.5 and 1, where 1 is perfect match between two clustering methods, and 0 is completely random assignment of labels. 2.3.2 Normalized Mutual Information (NMI) The normalized mutual information (NMI) measures the mutual information between two clustering results and normalizes it according to cluster size [11]. We fix the true labels Y as one of the clustering result, and use the predicted labels as the other to calculate NMI. The NMI is calculated as the following NMI = 2I(Y; C) H(Y)H(C) (2.21) where H(·) is the entropy and I(Y; C) is the mutual information between true labels Y and predicted labels C. NMI has a range of 0 and 1, where 1 is a perfect mutual information between the 2 sets of labels. 2.3.3 Accuracy (ACC) Accuracy (ACC) calculates the percentage of correctly predicted class labels. The accuracy is given by ACC = 1 N N ∑︁ i=1 δ(yi, f (ci)), (2.22) where δ(a, b) is the indicator function. That is, if a = b, δ(a, b) = 1, and 0 otherwise. f : C → Y maps the cluster labels to the true labels, where the mapping is the optimal permutation of the cluster labels and true labels obtained from the Hungarian algorithm [12]. 2.3.4 Purity Let {Ci}p i=1 be a cluster label and {Yj}q j=1 be the true label. For purity calculation, each predicted label Ci is assigned to a true label Yj such that the |Ci ∩ Yj| is maximized [13]. Taking the average over all the predicted label, we obtain the following Purity = 1 N p ∑︁ i=1 max j |Ci ∩ Yj| (2.23) Note that unlike accuracy, purity does not map the predicted labels to the true labels. 30 2.3.5 Silhouette Score Silhouette score is a common metric used to determine the number of clusters for k-means clustering and assumes that the clusters are well-separated [14]. Let xi ∈ A ⊂ X be a sample in cluster A from the clustering algorithm. Then, we define a(xi) as the dissimilarity of sample xi to all the other samples of A. That is, a(xi) = 1 |A| ∑︁ xj inA ∥xi – xj ∥ (2.24) where |A| is the number of samples in cluster A. Now, let C ≠ A be another cluster from the clustering algorithm. Then, b(xi) is the minimum average distance from xi to the closest cluster that is not A. That is, b(xi) = min C≠A 1 |C| ∑︁ xj∈C ∥xi – xj ∥ Then, the score for sample xi, denoted as s(xi) is obtained as the following (2.25) (2.26) s(xi) = 1 – a(xi) b(xi) a(xi) < b(xi) 0 a(xi) = b(xi) b(xi) a(xi) – 1 a(xi) > b(xi).    One can then define the Silhouette score S as the average of s(xi) over all samples in the dataset. The Silhouette score is bounded by -1 and 1, where -1 indicates poor clustering, 0 indicates overlapping clusters, and 1 indicates well-separated clusters. 2.3.6 Balanced Accuracy (BA) Accuracy is utilized to measure the quality of a classification algorithm. For data with unbal- anced class size, standard accuracy does not give meaningful result; therefore, balanced accuracy (BA) is used[15], which computes the accuracy of each class separately, and the score is the average accuracy of each class. Let yi and ˆyi be the true and predicted labels of the i-th sample, respectively. Then, the balanced accuracy score is defined as: BA = 1 ωl ∑︁ i 1{yi=ˆyi}ωi 31 (2.27) where ωi is the sample weight, given as ωi = 1 j 1{yj=yi} (cid:205) , and 1{∗} is the indicator function. 2.4 Topological Data Analysis 2.4.1 Simplex Let {v0, v1, ..., vq} be a set of points in Rn. An affine combination is defined as v = q ∑︁ i=0 λivi, such that q ∑︁ i=0 λi = 1 (2.28) An affine hull is a set of affine combinations. The q + 1 points {vi}q i=0 are affinely independent if v1 – v0, ..., vq – v0 are linearly independent. A q-plane is well defined if q + 1 points are affinely independent. For example, in Rn, there are at most n linearly independent vectors. Therefore, there are at most n + 1 affinely independent points. Additionally, an affine combination, defined as Equation 2.28, is convex combination if we add an additional condition λi ≥ 0 for all i. Furthermore, the convex hull as the set of convex combination. A q-simplex denoted as σq is a convex hull of q+1 points in Rq, with the dimension dim(σq) = q. σ0 is a vertex, σ1 is an edge, σ2 is a triangle, σ3 is a tetrahedron and so on, and an example is shown in Figure 2.1. Figure 2.1 Illustration of (a) 0-simplex or vertex, (b) 1-simplex or edge, (c) 2-simplex or triangle, and (d) 3-simplex or tetrahedron. 2.4.2 Simplicial Complex Let σq = [v0, v1, ..., vq] be the q-simplex, where vi is a vertex. A simplicial complex K is a collection of simplicies in Rn satisfying the following conditions 1. If σq ∈ K and σp is a face of σq, then σp ∈ K 32 2. The nonempy intersection of any 2 simplicies σq, σp ∈ K is a face of both σq and σp In essence, one can think of K as gluing together lower order simplicies together. Each element σq ∈ K is a q-simplex of K, and the dimension of K is defined as dim(K) = max{dim(σq) : σq ∈ K}. Alternatively, for q dimensional K, one can define simplicial complex as K = {σq|σq = n ∑︁ i=0 λivi, λi ≥ 0, n ∑︁ 0 λi = 1} (2.29) 2.4.3 Chain complex A q-chain is a formal sum of q-simplicies in a simplicial complex K with coefficient Z2 ∈ {0, 1}. The set of all q-chains contains the basis for the set of q-simplicies in K. Such set forms a finitely generating free Abelian group Cq(K). We can relate the chain groups by a boundary operator, which is a group homomorphism ∂q : Cq(K) → Cq–1(K). This boundary operator is defined as q ∑︁ ∂qσq : (–1)iσi q–1 (2.30) i=0 where σi q = [v0, v1, ..., v∗ i , ..., vq] be a (q – 1)-simplex with the vertex vi removed. We can then define a chain complex as a sequence of chain groups connected by the boundary operator. ∂q+2 −−−−→ Cq+1(K) ∂q+1 −−−−→ Cq(K) ∂q −−→ ... ... (2.31) By utilizing the boundary operators, we can define the qth cycle group Zq and the qth boundary group Bq. Both Zq and Bq are subgroups of the qth chain group Cq, and are defined as Zq = Ker∂q = {c ∈ Cq|∂qc = ∅} Bq = Im∂k+1 = {c ∈ Cq|∃d ∈ Cq+1 : c = ∂q+1d} (2.32) (2.33) Additionally, ∂q–1∂q = implies that Bq ⊆ Zq ⊆ Cq. Moreover, the qth cycle is the q-dimensional hole. The, the q-homology group Hq is defined to be a quotient group of the q-cycle group modulo the q-boundary group, ie Hq = Zq/Bq 33 (2.34) The rank of the the Hq is the qth Betti number, denoted as βq. βq = rank(Hq) = rank(Zq) – rank(Bq) (2.35) The qth Betti number describes the qth dimensional hole. For example, β0 is the number of connected components, β1 is the number of loops, β2 is the number of cavity, and so on. Furthermore, the Betti numbers describe the topological property of the system. An example of Betti numbers of vertex, circle, sphere and torus can be found in Table 2.1 Table 2.1 Betti numbers of vertex, circle, sphere and torus. β0 β1 β2 1 0 0 1 1 0 1 0 1 1 2 1 2.4.4 Computational tools for constructing simplcial complex In this section, we describe Vietoris-Rips (VR) complex and alpha complex. 2.4.4.1 Vietoris-Rips complex VR complex is construncted from the 1-skeleton induced by pairwise distances of the point cloud. Let d(x, y) be dome distance, and ε be a threshold distance. VR complex of a finite point cloud X is given by VRε(X) = {σ ⊆ X|d(u, v) ≤ ε, ∀u ≠ v ∈ σ} (2.36) The Euclidean distance is often used, but other distance can also be used depending of the ap- plication. Additionally, because VR complex does not rely on the exact geometry, other abstract distance that do not satisfy the triangle ineqaulity can be used, such as cosine distance, correlation 34 distance and kernel induced distance. d(x, y) = 1 – x · y ∥x∥2∥y∥2 cov(x, y) σxσy d(x, y) = 1 – e–∥x–y∥2 2/τ d(x, y) = 1 – (2.37) Figure 2.2 shows a comparison of VR complex using the standard Eucliden distance and the kernel-induced distance. We can see that the kernel-induced distance has additional H1 barcode. Such kernel-induced distance has been utilized in molecular biology, where atom-specific proper- ties, such as Van der Waal radius, electrostatic potential, etc. can be used to refine the distance. Such application has been shown to be effective at predicting B-factor and protein-ligand binding affinity [16, 17, 18]. 35 (a) Euclidean distance (b) Kernel distance Figure 2.2 Comparison of VR complex using (a) standard Euclidean distance and (b) kernel-induced distance. 2.4.4.2 Alpha complex Unlike VR complex, alpha complex is related to geometric modeling and domain partitioning. Let X be a finite point cloud in a Euclidean space. We can build a Voroni diagram, and let V(x) be a Voronoi cell associated with x ∈ X. Then, for a given ε, the alpha complex is given by Aε(X) = {σ ∈ X| (cid:217) x∈X (V(x) ∩ Bε(x)) ≠ ∅} (2.38) where Bε(x) is the ball of radius ε around x ∈ X. Figure 2.3 show a comparison between the alpha complex and VR complex. VR complex show a longer persistence of H0, meaning that the points do not connect for longer period. Both complex show a loop, but the loop is located in a different interval. 36 (a) VR complex (b) Alpha complex Figure 2.3 Comparison of VR complex and alpha complex on 5 points. generated from VR complex, and (b) shows the barcode generated from alpha complex. (a) shows the barcode 2.4.5 Persistent homology One downside of utilizing the simplicial complex is that it does not provide sufficient information to understand the geometry of the data because we can only capture the data in a single scale. To this end, we utilize simplicial complex induced by filtration, ∅ = K0 ⊆ K1 ⊆ ... ⊆ Kp = K (2.39) where P is the number of filtration. Using such filtration is the foundation of persistent homology, where the persistence is observed through long-lasting topological features. For each filtration p, we construct the simplicial complex, the chain group, the subgroups and the homology group. In 37 particular, the t-persistent q-th homology group Ki is Hi,t q = Zi q/ (cid:16) Bi+t q (cid:217) (cid:17) Zi q (2.40) Computing the Betti numbers give the persistence of q-dimensional holes. Figure 2.4 show an example of persistent barcode using benzene (C6H6). We extracted the carbon atoms from benzene, and constructed the simplicial complex via vietoris-rips complex. Then, filtration was applied to obtain the Betti numbers. The top figures show the filtration process of benzene at r = 0.75, 1.50 and 3.00, and was visualized using ChimeraX [19]. Gudhi[20] was used to compute the simplicial complex and the barcodes. (a) r = 0.75 (b) r = 1.50 (c) r = 3.00 Figure 2.4 Illustration of vietoris-rips complex of benzene (C6H6) as the radius increases to r = 3.0. (a) r = 0.75, (b) r = 1.50 and (c) r = 3.00 shows the filtration process, and the bottom figure shows the persistent barcode corresponding to the filtration. 2.4.6 Combinatorial Laplacian Lapalcian Combinatorial Laplacian gives insight from both spectral analysis and persistent homology [21, 22]. Recall that the chain complex of a simplicial complex is defined through a sequence of boundary operators, and that looking at the kernel and the image of the operators defined the cycle 38 group Zq and the boundary Bq. Then, the q-th homology group Hq = Zq/Bq (or Hq = ker∂q/Im∂q+1). Moreover, the dimensional of Hq gives the q-betti numbers. defined as Cq(K) (cid:27) C∗ We can now define the dual chain complex through the adjoint operator of ∂q. The dual space is q : Cq–1(K) → Cq(K). q is defined as ∂∗ For ωq–1 ∈ Cq–1(K) and cq ∈ Cq(K), the coboundary operator is defined as q(K), and the coboundary operator ∂∗ ∂∗ωq–1(cq) ≡ ωq–1(∂cq). (2.41) Here ωq–1 is a (q – 1) cochain, or a homomorphic mapping from a chain to the coefficient group. The homology of the dual chain complex is called the cohomology. We then define the q-combinatorial Laplacian operator △q : Cq(K) → Cq(K) △q := ∂q+1∂∗ q+1 + ∂∗ q∂q. (2.42) Let Bq be the standard basis for the matrix representation of q-boundary operator from Cq(K) and Cq–1(K), and BT q be th q-coboundary operator. The matrix representation of the q-th order Laplacian operator Lq is defined as Lq = Bq+1BT q+1 + BT q Bq. (2.43) The multiplicity of zero eigenvalue of Lq is the q-th Betti number of the simplicial complex. The nonzero eigenvalues (non-harmonic spectrum) contains other topological and geometrical features. As stated before, simplicial complex does not provide sufficient information to understand the geometry of the data. To this end, we utilize simplicial complex induced by filtration {∅} = K0 ⊆ K1 ⊆ · · · ⊆ Kp = K, (2.44) where p is the number of filtration. For each Kt 0 ≤ t ≤ p, denote Cq(Kt) as chain group induced by Kt, and the corresponding boundary operator ∂t q : Cq(Kt) → Cq–1(Kt), resulting in ∂t qσq = q ∑︁ i=1 (–1)iσi q–1, 39 (2.45) for σq ∈ Kt. The adjoint operator of ∂t q is similarity defined as ∂t∗ q : Cq–1(Kt) → Cq(Kt), which we regard as the mapping Cq–1(Kt) → Cq(Kt) via the isomorphism between cochain and chain groups. Through these 2 operators, we can define the chain complexes induced by Kt. 2.4.7 Persistent Laplacian Utilizing filtration with simplicial complex, we can define persistence Laplacian spectra. Let Ct+p q–1 be Ct+p q whose boundary is in Ct this set, we can define the p-persistent q-boundary operator denoted ˆ∂t,p q corresponding adjoint operator (ˆ∂t,p)∗ : Ct q , assuming an inclusion mapping Ct : Ct,p q–1. On q–1 and the q . Then, the q-order p-persistent Laplacian q–1 q → Ct → Ct+p → Ct,p q–1 operator is computed as and its matrix representation as q = ˆ∂t,p △t,p q+1(ˆ∂t,p q+1)∗ + (ˆ∂t q)∗ ˆ∂t q, q = Bt,p Lt,p q+1(Bt,p q+1)T + (Bt q)T Bt q. (2.46) (2.47) Likewise as before, the multiplicity of the zero-eigenvalue is the q-th order p-persistent Betti number βt,p q , which is the q-dimensional hole in Kt that persists in Kt+p. Moreover, the q-th order Laplacian is just a particular case of Lt,p q , where p = 0, which is a snapshot of the topology at the filtration step t [23, 24]. Figure 2.5 show an example of persistent barcode feature, harmonic and nonharmonic spectra of persistent Laplacian of fullerene (C60) molecule. (a), (b) and (c) correspond to the radii r = 0.75, r = 1.5 and r = 2.5, respectively, and was visualized using ChimeraX[19]. (d) shows the persistent barcode of C60 molecule, which was generated using Gudhi. (e) shows the harmonic (βr,p the nonharmonic (λr,p k ) and k ) spectra of persistent Laplacian, which was obtained via HERMES[23]. Vietrois-rips complex was used for both cases, and for the persistent Laplacian, p = 0 was used. As indicated by the blue bars in (d), the connected components (H0) dies at about r = 1.5, which is consistent with βr,0 0 , which is to be expected. Persistent Laplacians captures the Betti information of persistent homology. However, persistent Laplacian also captures the non-harmonic spectra, which 40 is indicated by the red line. Such feature can capture the homtopic shape of evolution throughout the filtration process. 41 (a) r = 0.75 (b) r = 1.50 (c) r = 2.5 (d) Persistent Barcode (e) Persistent Laplacian Figure 2.5 Comparison of persistent barcode and persistent Laplacian utilizing fullerene C60 molecule. HERMES[23] package was used to generate the Betti numbers and to non-harmonic spectra using the vietoris-rips complex and p = 0. 42 2.5 Phylogenetic Analysis 2.5.1 k-mers method In DNA sequencing analysis, k-mers are words or motif of length k. For example, 1-mers consist of 4 types, adenine (A), cytosine (C), guanine (G) or thymine (T) (in the case of RNA sequencing, uracil (U)). 2-mers consists of 16 combinations, such as AA, GT, and GA. In general, for a given k, the number of subsequence of length k is 4k. In k-mers based methods, the frequency of each k-mers are computed. For example, the sequence GTAGAGCTGT contains 2A, 1C, 4G and 3T, which forms the frequency for 1-mers. Since the length of the sequence is 10, we can count up to k-mers of length 10. We will now formulate that case for 1-mers. Let S = s1s2...sN be a DNA sequence of length N, where si ∈ {A, C, G, T}. Define the indicator function for nucleotide l δl(si) = 1, si = l 0, otherwise    where 1 ≤ i ≤ N and l ∈ {A, C, G, T}. Then, the total number of nucleotide l is Nl = N ∑︁ i=1 δl(si). (2.48) (2.49) It is easy to see that the length of the sequence, assuming that all nucleotides are valid, satisfies N = NA + NC + NG. Now, for the general k-mers case, we have a total of N–k+1 k-mers given by (s1...sk)(s2...sk+1)...(sN–k+1...sN). Then, we label the 4k possible k-mers as l1, l2, ...lj, ..., l4k, and define the k-mers specific indicator function as δlj(si...si+k) = 1, (si...si+k) = lj 0, otherwise    Then, the total number of k-mers lj is Nlj = N–k+1 ∑︁ i=1 δlj(si...si+k) 43 (2.50) (2.51) Because k-mers has the computational efficiency of O(N), it can handle long sequences, includ- ing mammalian chromosomes. 2.5.2 Natural Vector Method The Natural Vector Method (NVM) was developed in 2011. It transforms the DNA sequence into a vector by capturing the moments of the position of the nucleotide [25]. This extends the traditional k-mers method by incorporating positional information of the k-mers. Let S = s1s2...sN be a DNA sequence of length N, where si ∈ {A, C, G, T}. Here, we consider uracil U as thymine T. Define a indicator function δl(si) as δl(si) = 1, si = l 0 otherwise    where l = A, C, G, T. Then, the natural vector for the sequence is defined as S = (NA, NC, NG, NT , µA, µC, µG, µT , DA 2 , DC 2 , DG 2 , DT 2 , ..., DA M, DC M, DG M, DT M) where N ∑︁ i=1 N ∑︁ Nl = µl = δl(si) δl(si) (2.52) i Nl i=1 N ∑︁ i=1 Dl m = (i – µl)m Nm–1 l Nm–1 δ(si) Here, m = 0, 1, 2, ..., M is the order of the moment, N = NA + NC + NG + NT . Nl and µl are the 0-order and 1-order, respectively. In essence, Nl is the total number of nucleotide l, and µl is the average position of nucleotide l. This results in a uniform feature vector of size 4M for each sequence. The natural vector was further improved via the implementation of k-mers specific natural vector [26]. Rather than looking at the nucleotide, the moments are calculated for each k-mers. For the sequence S = s1s2...sN, there are a total of N – k + 1 k-mers (s1...sk)(s2...sk+1)...(sN–k+1...sN). 44 Similarly, for a given k, the natural vector is defined as S = (Nl1, ..., Nl4k , µl1, ..., µl4k , Dl1 2 ..., D l4k M , ..., Dl1 M, ..., D l4k M ) where δlj(si...si+k) N–k+1 ∑︁ i=1 N–k+1 ∑︁ Nlj = µlj = lj m = D δlk(si...si+k) i Nlj (i – µlj)m Nm–1 Nm–1 lj i=1 N–k+1 ∑︁ i=1 δlj(si...si+k) (2.53) where δlj(si...si+j) is an indicator function for the k-mer lj. For a given k, there is a total of 4k combination of k-mers. Moreover, for a fixed k and M, the number of features will be 4kM. Then, combining all the k-mers together, we have a natural vector of length (cid:205)K k=1 4kM 2.5.3 Markov k-string model In this section, we discuss the Markov k-string model, which is the foundation for all alignment- free probabilistic models. Markov model aims to characterize evolution by evaluating the difference between observed probability of an appearance of a k-mer and the predicted appearance of a k-mer [27]. Let α1α2...αk be the nucleotide of a k-mer, and let f (α1α2...αk) denote its frequency. We can normalize the frequency by dividing it by the total number of k-mers, which will give the probability of a particular k-mers appearing. p(α1α2...αk) = f (α1α2...αk) N – k + 1 (2.54) Since the goal is to compare the observed and predicted probability, we compute the probability of (k – 1)-mers and (k – 2)-mer derived from the k-mer. The predicted probability of k-mers appearing can be computed using a Markov model: p0(α1α2...αk) = p(α1α2...αk–1)p(α2α3...αk) p(α2α3...αk–1) (2.55) 45 where p0 denote the predicted probability, and such prediction model has been utilized for both DNA and protein sequences [28, 29] Then, the evolution can be characterized by taking the normalized difference between the observe and predicted probabilities: a(α1α2...αk) = p(α1α2...αk)–p0(α1α2...αk) p0(α1α2...αk) 0 p0 ≠ 0 p0 = 0    (2.56) Likewise with the original k-mers and NVM methods, Markov models are also computationally efficient, and has been applied to both proteins and DNA sequences. 46 BIBLIOGRAPHY [1] Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent devel- opments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016. [2] Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012. [3] Daniel Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. Ad- vances in neural information processing systems, 13, 2000. [4] George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval Kluger. Fast interpolation-based t-sne for improved visualization of single-cell rna-seq data. Nature methods, 16(3):243–245, 2019. [5] GW Wei. Wavelets generated by using discrete singular convolution kernels. Journal of Physics A: Mathematical and General, 33(47):8577, 2000. [6] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [7] [8] Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge method. arXiv preprint arXiv:1912.12388, 2019. J MacQueen. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symposium on Math., Stat., and Prob, page 281, 1965. [9] Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341, 2009. [10] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193– 218, 1985. [11] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pages 1073–1080, 2009. [12] David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016. [13] KVSN Rama Rao and B Manjula Josephine. Exploring the impact of optimal clusters on In 2018 3rd International Conference on Communication and Electronics cluster purity. Systems (ICCES), pages 754–757. IEEE, 2018. 47 [14] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987. [15] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann. The balanced accuracy and its posterior distribution. In 2010 20th international conference on pattern recognition, pages 3121–3124. IEEE, 2010. [16] Zixuan Cang and Guo-Wei Wei. Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology. Bioinformatics, 33(22):3549–3557, 2017. [17] Zixuan Cang and Guo-Wei Wei. machine learning for protein-ligand binding affinity prediction. numerical methods in biomedical engineering, 34(2):e2914, 2018. Integration of element specific persistent homology and International journal for [18] David Bramer and Guo-Wei Wei. Atom-specific persistent homology and its application to protein flexibility analysis. Computational and mathematical biophysics, 8(1):1–35, 2020. [19] Thomas D Goddard, Conrad C Huang, Elaine C Meng, Eric F Pettersen, Gregory S Couch, John H Morris, and Thomas E Ferrin. Ucsf chimerax: Meeting modern challenges in visualization and analysis. Protein science, 27(1):14–25, 2018. [20] Clément Maria, Jean-Daniel Boissonnat, Marc Glisse, and Mariette Yvinec. The gudhi library: Simplicial complexes and persistent homology. In Mathematical Software–ICMS 2014: 4th International Congress, Seoul, South Korea, August 5-9, 2014. Proceedings 4, pages 167–174. Springer, 2014. [21] Beno Eckmann. Harmonische funktionen und randwertaufgaben in einem komplex. Com- mentarii Mathematici Helvetici, 17(1):240–255, 1944. [22] Daniel Hernández Serrano, Juan Hernández-Serrano, and Darío Sánchez Gómez. Simplicial degree in complex networks. applications of topological data analysis to network science. Chaos, Solitons and amp; Fractals, 137:109839, August 2020. [23] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield, Mo.), 3(1):67, 2021. [24] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020. [25] Mo Deng, Chenglong Yu, Qian Liang, Rong L He, and Stephen S-T Yau. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PloS one, 6(3):e17293, 2011. 48 [26] Nan Sun, Shaojun Pei, Lily He, Changchuan Yin, Rong Lucy He, and Stephen S-T Yau. Geometric construction of viral genome space and its applications. Computational and Structural Biotechnology Journal, 19:4226–4234, 2021. [27] Ji Qi, Bin Wang, and Bai-Iin Hao. Whole proteome prokaryote phylogeny without sequence alignment: Ak-string composition approach. Journal of molecular evolution, 58:1–11, 2004. [28] Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, James D Watson, et al. Molecular biology of the cell, volume 3. Garland New York, 1994. [29] Rui Hu and Bin Wang. Statistically significant strings are related to regulatory elements in the promoter regions of saccharomyces cerevisiae. Physica A: Statistical Mechanics and its Applications, 290(3-4):464–474, 2001. 49 CHAPTER 3 PHYLOGENETIC ANALYSIS 3.1 UMAP-assisted k-means clustering of large-scale SARS-CoV-2 mutation datasets 3.1.1 Introduction Beginning in December 2019, coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has become one of the most deadly global pandemics in history. The COVID-19 infections in the United States (US) and other nations are still spiking. As of January 20, 2021, the World Health Organization (WHO) has reported 93,217,287 confirmed cases of COVID-19 and 2,014,957 confirmed deaths. The virus has spread to Africa, Americas, Eastern Mediterranean, Europe, South-East Asia and Western Pacific [1]. To prevent further damage to our livelihood, we must control its spread through testing, social distancing, tracking the spread, and developing effective vaccines, drugs, diagnostics, and treatments. SARS-CoV-2 is a positive-sense single-strand RNA virus that belongs to the Nidovirales or- der, coronaviridae family and betacoronavirus genus [2]. To effectively track the virus, testing patients with suspected exposure to COVID-19 and sequencing the strand via PCR (polymerase chain reaction) are important. From sequencing, we can analyze patterns in mutation and predict transmission pathways. Without understanding such pathways, current efforts to find effective medicines and vaccines could become futile because mutations may change viral genome or lead to resistance. As of January 20, 2021, there are 203,344 available sequences with 26,844 unique sin- gle nucleotide polymorphisms (SNPs) with respect to the first SARS-CoV-2 sequence collected in December 2019 [3] according to our mutation tracker https://users.math.msu.edu/users/weig/SARS- CoV-2_Mutation_Tracker.html. A popular method for understanding mutational trends is to perform phylogenetic analysis, where one clusters mutations to find evolution patterns and transmission pathways. Phylogenetic analysis has been done on the Nidovirales family [4, 5, 6, 7, 4, 8] to understand genetic evolutionary pathways, protein level changes [9, 10, 11, 8], large scale variants [10, 9, 12, 13] and global trends 50 [14, 15, 16]. Commonly used techniques for phylogenetic analysis include tree based methods [17] and K-means clustering. Both methods belong to unsupervised machine learning techniques, where ground truth is unavailable. These approaches provide valuable information for exploratory research. A main issue with phylogenetic tree analysis is that as the number of samples increase, its computation becomes unpractical, making it unsuitable for large genome datasets. In contrast, K-means scales well with sample size increase, but does not perform well when the sample size is too small. Jaccard distance is commonly used to compare genome sequences [18] because it offers a phylogenetic or topological difference between samples. However, the tradeoff to the Jaccard distance is that its feature dimension is the same as its number of samples, suggesting that for a large sample size, the number of features is also large. Since K-means clustering relies on computing the distance between the center of the cluster and each sample, having a large feature space can result in expensive computation, large memory requirement, and poor clustering performance. This become a significant problem as the number of SARS-CoV-2 genome isolates from patients has reached 200,000 at this point. There is a pressing need for efficient clustering methods for SARS-CoV-2 genome sequences. One technique to address this challenge is to perform dimensional reduction on the K-means input dataset so that the task becomes manageable. Commonly used dimension reduction algorithms focus on two aspects: 1) the pairwise distance structure of all the data samples and 2) preservation of the local distances over the global distance. Techniques such as principal component analysis (PCA) [19], Sammon mapping [20], and multidimensional scaling (MDS) [21] aim to preserve the pairwise distance structure of the dataset. In contrast, the t-distributed stochastic neighbor embedding (t-SNE) [22, 23], uniform manifold approximation and projection (UMAP) [24, 25], Laplacian eigenmaps [26], and LargeVis [27] focus on the preservation of local distances. Among them, PCA, t-SNE, and UMAP are the most frequently used algorithms in the applications of cell biology, bioinformatics, and visualization [25]. PCA is a popular method used in exploratory studies, aiming to find the directions of the maximum variance in high-dimensional data and projecting them onto a new subspace to obtain 51 low-dimensional feature spaces while preserving most of the variance. The principal components of the new subspace can be interpreted as the directions of the maximum variance, which makes the new feature axes orthogonal to each other. Although PCA is able to cover the maximum variance among features, it may lose some information if one chooses an inappropriate number of principal components. As a linear algorithm, PCA performs poorly on the features with nonlinear relationship. Therefore, in order to present high-dimensional data on low dimensional and non- linear manifold, some nonlinear dimensional reduction algorithms such as t-SNE and UMAP are employed. T-SNE is a nonlinear method that can preserve the local and global structures of data. There are two main steps in t-SNE. First, it finds a probability distribution of the high dimensional dataset, where similar data points are given higher probability. Second, it finds a similar probabil- ity distribution in the lower dimension space, and the difference between the two distributions is minimized. However, t-SNE computes pairwise conditional probabilities for each pair of samples and involves hyperparameters that are not always easy to tune, which makes it computationally complex. UMAP is a novel manifold learning technique that also captures a nonlinear structure, which is competitive with t-SNE for visualization quality and maintains more of the global structure with superior run-time performance [24]. UMAP is built upon the mathematical work of Belkin and Niyogi on Laplacian eigenmaps, aiming to address the importance of uniform data distributions on manifolds via Riemannian geometry and the metric realization of fuzzy simplicial sets by David Spivak [28]. Similar to t-SNE, UMAP can optimize the embedded low-dimensional representation with respect to fuzzy set cross-entropy loss function by using stochastic gradient descent. The embedding is found by finding a low-dimensional projection of the data that closely matches the fuzzy topological structure of the original space. The error between two topological spaces will be minimized by optimizing the spectral layout of data in a low dimensional space. The objective of this work is to explore efficient computational methods for the SARS-CoV-2 phylogenetic analysis of large volume of SARS-CoV-2 genome sequences. Specifically, we are interested in developing a dimension-reduction assisted clustering method. With the increase in available sequencing data, the SNP dataset of SARS-CoV-2 has run into large-data problem. By 52 effectively analyzing clusters, we can find evolutionary trends, which will aid in finding effective medicines and vaccines. To this end, we compare the effectiveness and accuracy of PCA, t-SNE and UMAP for dimension reduction in association with the K-means clustering. To quantitatively evaluate the performance, we recast supervised classification problems with labels into a K-means clustering problems so that the accuracy of K-means clustering can be evaluated. As a result, the accuracy and performance of PCA, t-SNE and UMAP-assisted K-means clustering can be compared. By choosing the different dimensional reduction ratios, we examine the performance of these methods in K-means settings on four standard datasets. We found that UMAP is the most efficient, robust, reliable, and accurate algorithm. Based on this finding, we applied the UMAP- assisted K-means technique to large scale SARS-CoV-2 datasets generated from a Jaccard distance representation and a SNP position-based representation to further analyze its effectiveness, both in terms of speed and scalability. Our results are compared with those in the literature [9] to shed new light on SARS-CoV-2 phylogenetics. 3.1.2 Methods 3.1.2.1 Sequence and alignment The SARS-CoV-2 sequences were obtained from GISAID databank (www.gisaid.com). Only complete genome sequences with collection date, high coverage, and without ’NNNNNN’ in the sequences were considered. Each sequence was aligned to the reference sequence [3] using a multiple sequence alignment (MSA) package Clustal Omega [29]. A total of 203,344 complete SARS-CoV-2 sequences are analyzed in this work. 3.1.2.2 SNP position based features Let N be the number of SNP profiles with respect to the SARS-CoV-2 reference genome sequence, and let M be the number of unique mutation sites. Denote Vi as the position based feature of the ith SNP profile. Vi = [v1 i , v2 i , ..., vM i ], i = 1, 2, ..., N (3.1) 53 is a 1 × M vector. Here vj i = We compile this into an N × M position based feature, 1, mutation site 0, otherwise.    S(i, j) = vj i (3.2) (3.3) where each row represents a sample. Note that S(i, j) is a binary representation of the position and is sparse. 3.1.2.3 Jaccard based representation The Jaccard distance measures the dissimilarity between two sets. It is widely used in the phylogenetic studies of SNP profiles. In this work, we utilize Jaccard distance to compare SNP profiles of SARS-CoV-2 genome isolates. Let A and B be two sets. Consider the Jaccard index between A and B, denoted J(A, B), as the cardinality of the intersection divided by the cardinality of the union J(A, B) = (cid:12)A ∩ B(cid:12) (cid:12) (cid:12) (cid:12)A ∪ B(cid:12) (cid:12) (cid:12) (cid:12) (cid:12)A ∩ B(cid:12) (cid:12) (cid:12)A ∩ B(cid:12) (cid:12) – (cid:12) (cid:12)B(cid:12) (cid:12) + (cid:12) (cid:12) (cid:12)A(cid:12) (cid:12) . = The Jaccard distance between the two sets is defined by subtracting the Jaccard index from 1: dJ(A, B) = 1 – J(A, B) = (cid:12)A ∪ B(cid:12) (cid:12) (cid:12) – (cid:12) (cid:12)A ∩ B(cid:12) (cid:12) (cid:12)A ∪ B(cid:12) (cid:12) (cid:12) (3.4) (3.5) We assume there are N SNP profiles or genome isolates that have been aligned to the reference SARS-CoV-2 genome. Let Si, i = 1, ..., N, be the set with the position of the mutation of the ith sample. The Jaccard distance between two sets Si and Sj is given by dJ(Si, Sj). Taking the pairwise distance between all the samples, we can construct the Jaccard based representation, resulting in an N × N distance matrix D D(i, j) = dJ(Si, Sj) (3.6) This distance defines a metric over the collections of all finite sets [30]. 54 3.1.3 Validation K-means clustering is one of the unsupervised learning algorithms, suggesting that neither the accuracy nor the root-mean-square error can be calculated to evaluate the performance of the K-means clustering explicitly. Additionally, K-means clustering can be problematic for high- dimensional large datasets. Dimension-reduced K-means clustering is an efficient approach. To evaluate its accuracy and performance, we convert supervised classification problems with known solutions into dimension-reduced K-means clustering problems. In doing so, we apply the K-means clustering to the classification dataset by setting the number of clusters equals to the number of the real categories. Next, in each cluster, we will take the data with the dominant label as the test for all samples and then calculate the K-means clustering accuracy for the whole dataset. 3.1.3.1 Validation data In this work, we will consider the following classification datasets to test the performance of the clustering methods: Coil 20, Facebook large page-page network, MNIST, and Jaccard distanced- based MNIST. Previous work has been done on datasets using Euclidean and Minkowski distance for lower dimensions[24, 22, 23]. Here, we verify the result with higher reduction ratios, and tested the validity of using Jaccard distance as a metric. • Coil 20: Coil 20 [31] is a dataset with 1440 gray scale images, consisting of 20 different objects, each with 72 orientation. Each image is of size 128 × 128, which was treated as a 16384 dimensional vector for dimensional reduction • Facebook Network: Facebook large page-page network [32] is a page-page webgraph of verified Facebook sites. Each node represents a facebook page, and the links are the mutual links between sites. This is a binary dataset with 22,470 nodes; hence the sample size and feature size are both 22,470. Jaccard distance was computed between each nodes for the feature space. • MNIST: MNIST [33] is a hand written digit dataset. Each image is a grey scale of size 28 × 28, which was treated as a 784 dimensional vector for the feature space, each with a 55 integer value in [0, 255]. Standard normalization was used before performing dimensional reduction. There are 70,000 sample, with 10 different labels. • Jaccard distanced-based MNIST: The above dataset was converted to a Jaccard distance- based dataset. This is to simulate position based mutational dataset, where 1 indicates a mutation in a particular position. Jaccard distance was used to construct the feature space, hence for each sample, the feature size is 70,000. This dataset can be viewed as an additional validation on our Jaccard distance representation. 3.1.3.2 Validation results In the present work, we implement three popular dimensional reduction methods, PCA, UMAP, and t-SNE, for the dimension reduction and compare their performance in K-means clustering. For a uniform comparison, we reduce the dimensions of the samples by a set of ratios. The minimum between the number of features and the number of samples was taken as base of the reduction. For the Coil 20 dataset, since the numbers of samples and features were 1440 and 16384, respectively, dimension-reductions were based on 1440. For the Facebook Network, since the numbers of samples and features were both 22,470, dimension-reductions were based on 22,470. For the MNIST dataset, since the numbers of samples and features were respectively 70,000 and 784, dimension-reductions were based on 784. Finally, for the Jaccard distanced-based MNIST dataset, since the numbers of samples and features were both 70,000, dimension-reductions were based on 70,000. Note that for the Jaccard distanced-based MNIST data, more aggressive ratios were used because the original feature size is huge, i.e., 70,000. The standard ratios of 2, 4, and 8, etc do not sufficiently reduce the dimension for effective K-means computation. For the purpose of visualization, two-dimensional reduction algorithms are applied to each reduction scheme. In order to validate PCA, UMAP, and t-SNE assisted K-means clustering, we observed their performance using labeled datasets. K-nearest neighbors (K-NN) was used to find the baseline of the reduction, which reveals how much information can be preserved in the feature after applying a dimensional reduction algorithm. For k-NN, 10 fold cross-validation was performed. 56 Notably, K-means clustering is an unsupervised learning algorithm, which does not have labels to evaluate the clustering performance explicitly. However, we can assess the K-means clustering accuracy via labeled datasets that has ground truth. In doing so, we choose the number of K as the original number of classes. The detail of computing accuracy of k-means clustering can be found in section 2.3. 3.1.3.3 Coil 20 Figure 3.1 Comparison of different dimensional reduction algorithms on Coil 20 dataset. Total 20 different labels are in the Coil 20 dataset, and we use the ground truth label to color each data points. (a) Feature size is reduced to dimension 2 by PCA. (b) Feature size is reduced to dimension 2 by t-SNE. (c) Feature size is reduced to dimension 2 by UMAP. Figure 3.1 shows the performance of PCA-assisted, UMAP-assisted and t-SNE-assisted clus- tering of the Coil 20 dataset. For each case, the dataset were reduced to dimension 2 using default parameters, and the plots were colored with the ground truth of the Coil 20 dataset. It can be seen that PCA does not present good clustering, whereas UMAP and t-SNE show very good clusters. 57 Table 3.1 Accuracy of k-NN of the Coil 20 dataset without applying any reduction algorithms, as well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the Coil 20 dataset are 1440, 16384, and 20, respectively. Dataset k-NN accuracy w/o reduction Coil 20 (1440,16384,20) 0.956 Reduced dimension 720 (1/2) 360 (1/4) 180 (1/8) 90 (1/16) 45 (1/32) 22 (1/64) 14 (1/100) 7 (1/200) 3 2 PCA accuracy 0.955 0.957 0.973 0.977 0.980 0.985 0.730 0.985 0.850 0.730 UMAP accuracy 0.668 0.861 0.867 0.860 0.861 0.868 0.851 0.870 0.863 0.853 t-SNE accuracy 0.850 0.889 0.881 0.885 0.875 0.743 0.878 0.845 0.959 0.948 Table 3.1 shows the accuracy of k-NN clustering of the Coil 20 dataset assisted by PCA, t-SNE, and UMAP with different dimensional reduction radio. The Coil 20 dataset has 1,440 samples, 16,384 features, and 20 different labels. For PCA, the sklearn implementation on python was used with standard parameters. Note that for all methods, dimensions were reduced to 3 and 2 for a comparison. For t-SNE, Multicore-TSNE [34] was used because it offers up to 8 core processor, which is not available in the sklearn implementation, and it is the fastest performing t-SNE algorithm. For UMAP, we used standard parameters [24]. It can be seen that when we reduce the dimension to 3, t-SNE performs best. Moreover, when the dimensional reduction ratio is 1/100, PCA and UMAP also perform well. Notably, the k-NN accuracy for the data without applying any dimensional reduction algorithm is 0.956, indicating that UMAP does not provide the best clustering performance on the Coil 20 dataset. However, PCA and t-SNE will preserve the information of the original data with a dimensional reduction ratio larger than 1/100, and t-SNE even performs better for dimensional three on the Coil 20 dataset. 58 Table 3.2 Accuracy of K-means clustering of the Coil 20 dataset without applying any reduction algorithms, as well as the accuracy of K-means assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the Coil 20 dataset are 1440, 16384, and 20, respectively. Dataset k-NN accuracy w/o reduction Coil 20 (1440,16384,20) 0.626 Reduced dimension 720 (1/2) 360 (1/4) 180 (1/8) 90 (1/16) 45 (1/32) 22 (1/64) 14 (1/100) 7 (1/200) 3 2 PCA accuracy 0.64 0.678 0.633 0.642 0.666 0.673 0.631 0.591 0.561 0.537 UMAP accuracy 0.301 0.800 0.822 0.799 0.800 0.819 0.817 0.819 0.800 0.801 t-SNE accuracy 0.798 0.718 0.648 0.681 0.615 0.151 0.154 0.360 0.780 0.828 Table 3.2 describes the accuracy of K-means clustering of Coil 20 assisted by PCA, UMAP, and t-SNE with different dimensional reduction ratio. For consistency, we use the same set of standard parameters as k-NN. For the Coil 20 dataset, the accuracy of K-means clustering assisted by UMAP has the best performance. When the reduced dimension is 2048 (ratio 1/8), UMAP will result in a relatively high K-means accuracy (0.822). Moreover, although PCA performs best on k-NN accuracy, it performs poorly on the K-means accuracy, indicating that PCA is not a suitable dimensional reduction algorithm on the Coil 20 dataset. Furthermore, the highest accuracy of K-means clustering is 0.828, which is calculated from the t-SNE-assisted algorithm. However, the t-SNE-assisted accuracy under different reduction ratio changes dramatically. When the ratio is 1/64, the t-SNE-assisted accuracy is only 0.151, indicating that t-SNE is sensitive to the hyper-parameters settings. In contrast, the performance of UMAP is highly stable under all dimension-reduction ratios. Note that dimension-reduced k-means clustering methods outperform the original k-means clustering. Therefore, the proposed dimension-reduced k-means clustering methods not only improve the k-means clustering efficiency, but also achieve better accuracy. 59 3.1.3.4 Facebook Network Figure 3.2 Comparison of different dimensional reduction algorithms on the Facebook Network dataset. Total 4 different labels are in the Facebook Network dataset, and we use the ground truth label to color each data points. (a) Feature size is reduced to dimension 2 by PCA. (b) Feature size is reduced to dimension 2 by t-SNE. (c) Feature size is reduced to dimension 2 by UMAP. Figure 3.2 shows the visualization performance of PCA-assisted, UMAP-assisted, and t-SNE- assisted clustering of the Facebook Network. For each case, the dataset was reduced to dimension 2 using default parameters, and the plots were colored with the ground truth of the Facebook Network. Figure 3.2 shows that the PCA-based data is located distributively, while the t-SNE- and UMAP-based data show clusters. Table 3.3 shows the accuracy of k-NN clustering of the Facebook Network assisted by PCA, t-SNE, and UMAP with different dimensional reduction radio. The Facebook Network dataset has 22,470 samples with 4 different labels, and the feature size of the Facebook Network is also 22,470. For each algorithm, we use the same settings as the Coil 20 dataset. Without applying any dimensional reduction method, The Facebook Network has 0.755 k-NN accuracy. The reduced feature from PCA has the best k-NN performance when the reduction ratio is 1/2. UMAP has a better performance compared to PCA and t-SNE when the reduction ratio is smaller than 1/16. 60 Table 3.3 Accuracy of k-NN of the Facebook Network without applying any reduction algorithms, as well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the Facebook Network are 22470, 22470, and 4, respectively. Dataset k-NN accuracy w/o reduction Facebook Network (22470, 22470, 4) 0.755 Reduced dimension 11235 (1/2) 5617 (1/4) 2808 (1/8) 1404 (1/16) 702 (1/32) 351 (1/64) 224 (1/100) 112 (1/200) 44 (1/500) 22 (1/1000) 3 2 PCA accuracy 0.756 0.755 0.754 0.751 0.751 0.746 0.733 0.721 0.714 0.690 0.552 0.501 UMAP accuracy 0.360 0.669 0.754 0.816 0.814 0.815 0.814 0.819 0.816 0.815 0.801 0.786 t-SNE accuracy 0.307 0.316 0.355 0.707 0.669 0.690 0.676 0.633 0.709 0.643 0.741 0.732 Table 3.4 describes the accuracy of K-means clustering of the Facebook Network assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. PCA, UMAP, and t-SNE all have very poor performance, which may be caused by the smaller number of labels. The highest accuracy 0.427 is observed in the t-SNE-assistant algorithm with dimension 2. 61 Table 3.4 Accuracy of K-means clustering of the Facebook Network without applying any reduction algorithms, as well as the accuracy of K-means assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the Facebook Network are 22470, 22470, and 4, respectively. Dataset k-NN accuracy w/o reduction Facebook Network (22470, 22470, 4) 0.374 Reduced dimension 11235 (1/2) 5617 (1/4) 2808 (1/8) 1404 (1/16) 702 (1/32) 351 (1/64) 224 (1/100) 112 (1/200) 44 (1/500) 22 (1/1000) 3 2 PCA accuracy 0.331 0.331 0.331 0.331 0.331 0.331 0.331 0.331 0.331 0.331 0.332 0.358 UMAP accuracy 0.306 0.307 0.411 0.397 0.401 0.400 0.400 0.400 0.400 0.401 0.351 0.345 t-SNE accuracy 0.306 0.299 0.314 0.313 0.306 0.308 0.327 0.306 0.313 0.306 0.344 0.427 Similar to the last case, UMAP-based and t-SNE-based dimension-reduced k-means clustering methods outperform the original k-means clustering with the full feature dimension. Therefore, it is useful to carry out dimension reduction before k-means clustering for large datasets. 3.1.3.5 MNIST Figure 3.3 Comparison of different dimensional reduction algorithms on the MNIST dataset. Total 10 different labels are in the MNIST dataset, and we use the ground truth label to color each data points. (a) Feature size is reduced to dimension 2 by PCA. (b) Feature size is reduced to dimension 2 by t-SNE. (c) Feature size is reduced to dimension 2 by UMAP. 62 Table 3.5 Accuracy of k-NN of the MNIST dataset without applying any reduction algorithms, as well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the MNIST dataset are 70000, 784, and 10, respectively. Dataset k-NN accuracy w/o reduction MNIST (70000, 784, 10) 0.948 Reduced dimension 392 (1/2) 196 (1/4) 98 (1/8) 49 (1/16) 24 (1/32) 12 (1/64) 7 (1/100) 3 2 PCA accuracy 0.951 0.956 0.960 0.961 0.953 0.926 0.846 0.513 0.323 UMAP accuracy 0.937 0.938 0.937 0.937 0.937 0.937 0.936 0.929 0.919 t-SNE accuracy 0.696 0.846 0.893 0.886 0.842 0.676 0.940 0.938 0.928 Figure 3.3 shows the performance of PCA-assisted, UMAP-assisted and t-SNE-assisted cluster- ing of the MNIST dataset. The sample size of the MNIST dataset is 70000, which has 784 features with 10 different digit labels. For each case, the dataset was reduced to dimension 2 using default parameters, and the plots were colored with the ground truth of the MNIST dataset. In Figure 3.3, by applying the UMAP algorithm, the clear clusters can be detected for the MNIST dataset. The t-SNE offers a reasonable clustering at dimension 2 too. However, the PCA does not provide a good clustering. Table 3.6 Accuracy of K-means clustering of the MNIST dataset without applying any reduction algorithms, as well as the accuracy of K-means assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the MNIST dataset are 70000, 784, and 10, respectively. Dataset k-NN accuracy w/o reduction MNIST (70000, 784, 10) 0.494 Reduced dimension 392 (1/2) 196 (1/4) 98 (1/8) 49 (1/16) 24 (1/32) 12 (1/64) 7 (1/100) 3 2 PCA accuracy 0.487 0.492 0.498 0.496 0.501 0.489 0.464 0.365 0.300 UMAP accuracy 0.665 0.667 0.673 0.718 0.697 0.682 0.677 0.727 0.712 t-SNE accuracy 0.122 0.113 0.113 0.113 0.114 0.138 0.740 0.537 0.593 63 Table 3.5 shows the accuracy of k-NN clustering of the MNIST dataset assisted by PCA, t- SNE, and UMAP with different dimensional reduction radios. For each algorithm, we use the same settings as the Coil 20 dataset. Without applying any dimensional reduction algorithms, the accuracy of k-NN is 0.948. By applying PCA/UMAP with the reduction ratio greater than 1/64, the accuracy of PCA/UMAP-assisted k-NN is at the same level without using any dimensional reduction algorithm. However, in contract with UMAP and t-SNE, when the reduced dimension is 2 or 3, PCA performs poorly. This indicates that the PCA may not be suitable for dimension-reduction for datasets with a large sample size. Table 3.6 describes the accuracy of K-means clustering of the MNIST dataset assisted by PCA, UMAP, and t-SNE with different dimensional reduction ratios. By applying PCA, the accuracy of K-means is around 0.45. The t-SNE method performance is quite unstable, from very poor (0113) to the best (0.740), and to a relatively low value of 0.593. In contrast, we can see a stable and improved accuracy from using UMAP at various reduction ratios, indicating that the reduced feature generated by UMAP can better represent the clustering properties of the MNIST dataset compared to the PCA and t-SNE. As observed early, the present UMAP and t-SNE-assisted k-means clustering methods also significantly out-perform the original k-means clustering for this dataset. 3.1.3.6 Jaccard distanced-based MNIST Figure 3.4 Comparison of different dimensional reduction algorithms on the Jaccard distanced- based MNIST dataset. Total 10 different labels are in the Jaccard distanced-based MNIST dataset, and we use the ground truth label to color each data points. (a) Feature size is reduced to dimension 2 by PCA. (b) Feature size is reduced to dimension 2 by t-SNE. (c) Feature size is reduced to dimension 2 by UMAP. 64 Our last validation dataset is Jaccard distanced-based MNIST. This dataset can be treated as a test on the Jaccard distance-based data representation. Figure 3.4 shows the performance of PCA- assisted, UMAP-assisted, and t-SNE-assisted clustering of the Jaccard distanced-based MNIST dataset. The dataset was reduced to dimension 2 using default parameters for visualization, and the plots were colored with the ground truth of the Jaccard distanced-based MNIST dataset. From Figure 3.4, we can see that UMAP provides the clearest clusters compared to PCA and t-SNE when the dimension is reduced to 2. The performance of t-SNE is reasonable while PCA does not give a good clustering. Table 3.7 Accuracy of k-NN of the Jaccard distanced-based MNIST dataset without applying any reduction algorithms, as well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the Jaccard distanced-based MNIST dataset are 70000, 70000, and 10, respectively. Dataset k-NN accuracy w/o reduction Jaccard distanced based MNIST (70000, 70000, 10) 0.958 Reduced dimension 7000 (1/10) 3500 (1/20) 1750 (1/40) 875 (1/80) 437 (1/160) 218 (1/320) 109 (1/640) 70 (1/1000) 35 (1/2000) 17 (1/5000) 7 (1/10000) 3 2 PCA UMAP accuracy accuracy 0.958 0.958 0.958 0.958 0.958 0.958 0.958 0.958 0.956 0.938 0.867 0.487 0.313 0.958 0.966 0.967 0.967 0.968 0.968 0.968 0.968 0.968 0.968 0.967 0.965 0.960 t-SNE accuracy 0.588 0.601 0.725 0.613 0.718 0.701 0.873 0.915 0.872 0.916 0.942 0.939 0.924 65 Table 3.8 Accuracy of K-means clustering of the Jaccard distanced-based MNIST dataset without applying any reduction algorithms, as well as the accuracy of K-means assisted by PCA, UMAP and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the number of labels of the Jaccard distanced-based MNIST dataset are 70000, 70000, and 10, respectively. Dataset k-NN accuracy w/o reduction Jaccard distanced based MNIST (70000, 70000, 10) 0.555 Reduced dimension 7000 (1/10) 3500 (1/20) 1750 (1/40) 875 (1/80) 437 (1/160) 218 (1/320) 109 (1/640) 70 (1/1000) 35 (1/2000) 17 (1/5000) 7 (1/10000) 3 2 PCA UMAP accuracy accuracy 0.436 0.436 0.436 0.435 0.435 0.435 0.435 0.436 0.435 0.436 0.431 0.364 0.261 0.329 0.693 0.792 0.793 0.793 0.793 0.794 0.793 0.794 0.793 0.793 0.798 0.791 t-SNE accuracy 0.119 0.120 0.112 0.112 0.114 0.156 0.114 0.113 0.116 0.113 0.737 0.635 0.635 Table 3.7 shows the accuracy of k-NN clustering of Jaccard distanced-based MNIST assisted by PCA, t-SNE, and UMAP with different dimensional reduction radios. For each algorithm, we use the same settings as the Coil 20 dataset. Notably, the k-NN accuracy for the data without applying any dimensional reduction algorithm is 0.958, which is at the same level as the PCA algorithm with a reduction ratio greater than 1/5000. Moreover, we can find that UMAP performs well compared to PCA and t-SNE, indicating that after applying UMAP, the reduced feature still preserves most of the valued information of the Jaccard distanced-based MNIST dataset. The stability and persistence of UMAP at various reduction ratios are the most important features. Table 3.8 describes the accuracy of K-means clustering of the Jaccard distanced-based MNIST dataset assisted by PCA, UMAP, and t-SNE with different dimensional reduction ratio. For consistency, we will use the same standard parameters as k-NN. Similar to the MNIST dataset, the accuracy of K-means clustering assisted by UMAP still has the best performance. When the reduced dimension is 3, UMAP will result in the highest K-means accuracy 0.798. Noticeably, although PCA performs well on k-NN accuracy, it has the lowest K-mean accuracy, indicating that 66 PCA is not a suitable dimensional reduction algorithm, especially for those datasets with a large number of samples. To be noted, the t-SNE accuracy at four reduced dimensions are not available due to the extremely long running time. In a nutshell, PCA, UMAP, and t-SNE can all perform well for k-NN. However, for the Coil 20 dataset, UMAP performs slightly poorly, whereas the t-SNE performs well, which may be caused by a lack of data size. In order to train UMAP, it needs a suitable data size. The Coil 20 dataset has 20 labels, each with only 72 samples. This may not be enough to train UMAP properly. However, even in this case, UMAP performance is still very stable at various reduction ratios and is the best method in terms of reliability, which become the major advantages of UMAP. Another strength of UMAP comes from its dimension-reduction for K-means clustering. In most cases, UMAP can improve K-means clustering accuracy, especially for the Jaccard distanced- based MNIST dataset. Furthermore, UMAP can generate a very clear and elegant visualization of clusters with low dimensional reduction value such as 2. Additionally, UMAP performed better than PCA and t-SNE for a larger dataset (MNIST and Jaccard distanced-based MNIST). Especially for the Jaccard distanced-based MNIST data, where Jaccard distance was used as the metric, UMAP performed best, which indicates the merit of using UMAP for Jaccard distanced-based datasets, such as COVID-19 SNP datasets. Furthermore, the accuracies for k-NN classification and K-means clustering are both improved on the Jaccard distance-based MNIST dataset compared to the original MNIST dataset, which provides convincing evidence that the Jaccard distance representation will help improve the performance of the clustering on the SARS-CoV-2 mutation dataset in the following sections. 3.1.3.7 Efficiency comparison It is important to understand the computational time behaviors of various methods. To this end, we compare computational time for three dimension-reduction techniques. Figure 3.5 depicts the computational time of three methods for the four datasets under various reduction ratios. The green, orange, and blue lines represent the computational time of t-SNE, UMAP, and PCA, respectively. Some points in green line of Figure 3.5 (d) are not available, which due to the extremely long 67 running time. PCA performed best in most cases, except for the Coil 20 dataset, where UMAP had comparable computational time. This behavior is expected because PCA is a linear transformation, and its time should scale linearly with the number of components in the lower dimensional space. UMAP and t-SNE were slower than PCA, but it is evident from MNIST and Jaccard distanced-based MNIST datasets that UMAP scales better with the increase in the number of samples. Note that for Jaccard distanced-based MNIST, a higher dimension was not computed because the computational time was too long. For Facebook Network, UMAP is outperforming t-SNE; however, for higher dimensions, t-SNE computed faster. Nonetheless, from our baseline test Table 3.3, t-SNE does not perform well, indicating instability. Faster computation time may indicate too fast of a convergence, which leads to poor embedding. (a) Coil 20 time (b) Facebook Network time (c) MNIST time (d) Jaccard distanced-based MNIST time Figure 3.5 Computational time of each reduction ratio. The green, orange and blue lines represent the computational time of t-SNE, UMAP, and PCA, respectively. Not surprisingly, PCA performs the best in the majority of cases, except for the Coil 20 dataset. UMAP and t-SNE perform worse than PCA, but UMAP scales better when there are more samples, as evident from MNIST and Jaccard distanced-based MNIST datasets. Note that for Jaccard distanced-based MNIST, the higher dimension was not computed because the computational time was too long. 68 0.00.10.20.30.40.5Reduction ratio0100002000030000Time (s)PCA timeUMAP timet-SNE time0.00.10.20.30.40.5Reduction ratio05000100001500020000Time (s)PCA timeUMAP timet-SNE time0.00.10.20.30.40.5Reduction ratio02500500075001000012500Time (s)PCA timeUMAP timet-SNE time0.000.020.040.060.080.10Reduction ratio010000200003000040000Time (s)PCA timeUMAP timet-SNE time 3.1.4 SARS-CoV-2 mutation clustering 3.1.4.1 World SARS-CoV-2 mutation clustering We gather data submitted to GISAID up to January 20, 2021, and the total number of samples is 203,344. We first get the SNP information by applying the multiple sequence alignment, which leads to 26,844 unique SNPs. Next, we calculate the pairwise Jaccard distance of our dataset in order to generate the Jaccard distance-based features. Here, the number of rows is the number of samples (203,344), and the number of columns is the feature size (203,344). As we mentioned in Section 3.1.2.3, the Jaccard distance-based feature is a square matrix. However, due to the large size of samples and features, applying K-means clustering directly on the feature of the size of 203,344×203,344 is a very time-consuming process. Considering that UMAP outperforms the other two dimensional reduction algorithms (PCA and t-SNE) on the Jaccard distance-based MNIST dataset, we employ UMAP to reduce our original feature with the size of 203,344×203,344 to 203,344×203 . To be noted, UMAP is a reliable and stable algorithm, which performs consistently in clustering at various reduction ratios. Therefore, there is no need to use the same reduction dimension of 203 and one can also choose a different reduction dimension value to generate similar results. With the reduced dimension feature that has the size of 203,344×203 , we split our SARS-CoV-2 dataset into different clusters by applying the K-means clustering methods. After comparing the WCSS under a different number of clusters, we find that there are 6 clusters forming within the SARS-CoV-2 population based on the elbow method , which can be determined from Figure 3A.1 in the Supporting Information . Table 3A.1 in the Supporting Information shows the top 25 single mutations of each cluster. In order to understand the relationship, we also analyzed the co-mutation occurring in each cluster (Table 3.9). Here, we define a co-mutation as mutations that occur simultaneously in one SNP profile. For example, mutations occurring at position 241 and 3037 in a single SNP sample is a co-mutation [241, 3037]. From Table 3A.1 and Table 3.9 we see the following: 69 Table 3.9 The frequency and occurrence percentage of SARS-CoV-2 co-mutations from each clusters in the world. Cluster Co-mutations Cluster 1 [241, 3037, 14408, 23403, 28881, 28882, 28883] Cluster 2 [241, 1059, 3037, 14408, 23403, 25563] Cluster 3 [241, 1163, 3037, 7540, 14408, 16647, 18555, 22992, 23401, 23403, 28881, 28882, 28883] Cluster 4 [241, 3037, 14408, 23403] Cluster 5 [241, 3037, 14408, 23403] Cluster 6 [241, 3037, 4543, 5629, 9526, 11497, 13993, Frequency Percentage 21802 15008 2089 13387 124290 0.926 0.660 0.606 0.936 0.915 14408, 15766, 16889, 17019, 18877, 22992, 23403, 25563, 25710, 26735, 26876, 28975, 29399] 3279 0.940 • Though Clusters 1 and 6 seem similar from the top 25 single mutations, the co-mutations tells a different story. • Clusters 2 and 5 have high frequency of [241, 3037, 14408, 23403] mutations, but Cluster 5 has a clear co-mutation descendant with high frequency. • Cluster 3 has a unique combination of mutation that is only popular in Cluster 3. • Cluster 6 have high frequency of multiple co-mutations. Since it shares similarity with Clusters 4 and 5, it may be that Cluster 6 branched from Clusters 4 and 5. • Cluster 6 has many co-mutations when compared to other clusters. As seen in Table 3A.2, the majority of the cases is found in Europe, including the United Kingdom (UK), Denmark (DK), Netherlands (NL), Switzerland (CH) and Luxemberg (LU). Table 3A.2 in the Supporting Information shows the cluster distributions of samples from 25 countries. Here, we use the ISO 3166-1 alpha-2 codes as the country code. The listed countries are the United Kingdom (UK), the United States (US), Australia (AU), India (IN), Switzerland (CH), Netherlands (NL), Canada (CA), France (FR), Belgium (BE), Singapore (SG), Spain (ES), Russia (RU), Portugal (PT), Denmark (DK), Sweden (SE), Austria (AT), Japan (JP), South Africa (ZA), Iceland (IS), Brazil (BR), Saudi Arabia (SA), Norway (NO), China (CN), Italy (IT), and Korea (KR). We can visualize the clusters on the world map from Figure 3.7, which was visualized using 70 Highcharts. The underlying color indicates the dominant cluster for each country. Furthermore, from Table 3A.2, we can see the following: • SNP profiles from UK and DK are dominated in Clusters 5. • Clusters 3’s SNP profiles are predominantly found in AU. This may indicate that SARS-CoV-2 are mutating differently in AU. • SNP profiles from the US are found mostly in Clusters 2 and 5. • Most country’s SNP profiles are found in Clusters 1,2,4,5 and 6, with some having slightly higher numbers. Figure 3.6 Cluster distribution of the global SARS-CoV-2 mutation dataset. Using Highchart, the world map was colored, according to the dominant cluster. For example, United States have SNP profiles from all clusters, but Cluster 5 (purple) is the dominant type in the US. Only countries with more than 25 sequenced data available on GISAID were considered. Countries with fewer than 25 samples are labeled grayed. Notably, in Table 3.9, Cluster 4 and 5 have the same co-mutations with relatively high frequen- cies, which indicates the Clusters 4 and 5 share the same ”root”. Clusters 1, 2, 3, and 6 shares the co-mutation as Clusters 4 and 5, indicating that Clusters 1, 2,3, and 6 may have branched from 71 Cluster 4 and 5 in the 203-dimensional (203D) space. However, we cannot visualize the distri- bution of our reduced dataset in the 203D space. Therefore, benefit from the stable and reliable performance of UMAP at various reduction ratios, we reduce the dimension of our original dataset to 2, which enables us to observe the distribution of the dataset in the two-dimensional (2D) space. Figure 3.7 visualizes the distribution of our dataset with 6 distinct clusters with 2D UMAP. It can be seen that Clusters 2, 3 and 4 share a same “root” in the middle. Clusters 3 and 6 are farther away from the center, indicating that they are a descendants of the middle root. In addition, we looked specifically at the spike (S) protein because of its significance in viral infectivity. In all the clusters, 23403A>G (D614G) is present. Studies have shown that D614G increases the infectivity of SARS-CoV-2 [35], hence the high frequency in our data reflect such infectivity. In Clusters 1, 2 and 4, there are no significant co-mutations in the S protein. In Cluster 3, 100% of the variants contain the co-mutation [22992, 23401, 23403], which further supports its geographical isolation, where it is predominantly found in AU. Cluster 5 does not have a significant co-mutation, but the co-mutations [21614, 22227, 23403, 24334] occurred in 11290 SNP profiles (0.083). Cluster 6 has a pair of co-mutations [22992, 23403], which occurs in 99.7% of samples. Figure 3.7 2D UMAP visualization of the world SARS-CoV-2 mutation dataset with 6 distinct clusters. 72 Cluster10Cluster20Cluster30Cluster40Cluster50Cluster60 3.1.4.2 United States SARS-CoV-2 mutation clustering In addition to analyzing the clustering in the world, SNP profiles of SARS-CoV-2 from the US were considered. In this section, the US dataset has 17164 unique single mutations and 43395 samples. Therefore, the dimension of the Jaccard distance-based dataset is 43395×43395. After applying the UMAP, we reduce the dimension of the original dataset to be 43395×216. Following the similar K-means clustering processes as we did for the world dataset, we find that using the elbow method, we can see from Figure 3A.2 that there are 6 predominant clusters forming in the United States. Figure 3.8 show the US map with the cluster statistic. Here, Highchart was used to generate the plot with the pie chart. Each states were colored based on the dominant cluster. Table 3A.3 in the Supporting Information shows the top 25 mutations from each clusters in the United States. The cluster distribution of each states is listed in Table 3A.4. Table 3.10 shows the common occurring co-mutations, and we can observe the following: • Cluster A has a high frequency of co-mutations [241, 1059, 3037, 14408, 23403, 25563], which is a descendant of common co-mutations of Cluster 2 [241, 1059, 3037, 14408, 23403, 25563] from Table 3A.3. • Cluster B has a high frequency of co-mutations [241, 3037, 14408, 23403], which is a descendant of common co-mutations of Cluster 4 and 5 [241, 3037, 14408, 23403]. • Cluster C have high frequency of co-mutations [241, 3037, 14408, 23403, 28881, 28882, 28883], which is a descendant of common co-mutations of Cluster 1 [241, 3037, 14408, 23403, 28881, 28882, 28883] from Table 3.10. • Clusters D has high frequency of co-mutations [241, 3037, 14408, 20268, 23403, 28854], which is descendant of Clusters 4 and 5 [241, 3037, 14408, 23403]. US accounts for more than one third of mutations at site 23403 and half of mutations at site 28854 • Cluster E and F have a high frequency of co-mutations [8782, 17747, 17858, 18060, 28144] and [241, 1059, 3037, 11916, 14408, 18998, 23403, 25563, 29540], respectively, which are 73 descendants of Cluster 4 and 5 [241, 3037, 14408, 23403]. • Cluster F has a high frequency of co-mutations [241, 1059, 3037, 11916, 14408, 18998, 23403, 25563, 29540], which is a descendant of Cluster 2’s co-mutation [241, 1059, 3037, 14408, 23403, 25563] Table 3.10 The frequency and occurrence percentage of SARS-CoV-2 co-mutations from each clusters in US clusters. Co-mutations Cluster Cluster A [241, 1059, 3037, 14408, 23403, 25563] Cluster B [241, 3037, 14408, 23403] Cluster C [241, 3037, 14408, 23403, 28881, 28882, 28883] Cluster D [241, 3037, 14408, 20268, 23403, 28854] Cluster E [8782, 17747, 17858, 18060, 28144] Cluster F [241, 1059, 3037, 11916, 14408, 18998, 23403, 25563, 29540] Frequency Percentage 6646 20442 4429 3276 1183 501 0.702 0.932 0.945 0.643 0.744 0.789 74 Figure 3.8 Cluster distribution of United States SARS-CoV-2 mutation dataset. Using Highchart, the US map was colored, according to the dominant cluster. For example, United States have SNP profiles from all clusters, but Cluster E (purple) is the dominant type in the US. Only those countries that have more than 25 sequenced data available on GISAID were considered in the plot. Notably, in Table 3.10, Cluster B has a co-mutation that is present in Clusters A, C, D and F, indicating that Clusters A,C,D and E are descendants of Cluster B. Interestingly, Cluster E has a completely different set of co-mutations as the other clusters, indicating that they are a different strands of mutation. Considering the stability and reliability of UMAP at various reduction ratios, we employ UMAP to the original US dataset with reduced dimension 2, aiming to observe the distribution of the dataset in the 2D space. Figure 3.9 illustrates the 2D visualization of the US dataset with 6 distinct clusters. We can see that there are 3 clusters (Clusters A’, B’, and C’) share the same “root" located in the middle of the figure, while the other 3 clusters (Clusters D’, E’, and F’) are not. Cluster E’ is quite distinct from other clusters. This confirms our deduction about why Clusters E’ has a high frequency of different co-mutations in Table 3.10. In addition Cluster D’ is located close to Cluster A’, which may indicate that they have similar root that diverted. 75 In addition, we looked at co-mutations on the S protein. Every cluster, except for Cluster E, contains mutation 23403, which is expected due to its ability to increase the infectivity of SARS- CoV-2. Clusters A, C and F does not have any significant co-mutation occurring in the S protein, aside from 23403. Cluster E does not have a significant co-mutation nor a significant mutation in the S protein. Cluster B has co-mutations [22255, 23403], which occur in 780 samples. Cluster D has co-mutations [23403, 23604, 24076] that occur in 892 samples. Figure 3.9 The 2D UMAP visualization of the US SARS-CoV-2 mutation dataset with 6 distinct clusters. 3.1.5 Discussion In this section, we compared our past results [9] with our new method to gain a different perspective in clustering with the SNP profiles of COVID-19. In our previous work, a total of 8,309 unique single mutations are detected in 15,140 SARS-CoV-2 isolates. Here, we also calculate the pairwise distance among 15140 SNP profiles and set the number of clusters to be six. Table 3A.5 shows the cluster distribution of samples from the 15 countries [9]. The listed countries are the United States (US), Canada (CA), Australia (AU), United Kingdom (UK), Germany (DE), France (FR), Italy (IT), Russia (RU), China (CN), Japan (JP), Korean (KR), India (IN), Spain (ES), Saudi Arabia (SA), and Turkey (TR), and we use Cluster I, II, III, IV, V, and VI to represent six clusters without applying any dimensional reduction algorithm. Table 3A.6 lists the cluster distribution 76 ClusterA0ClusterB0ClusterC0ClusterD0ClusterE0ClusterF0 of samples from the same 15 countries, where we use Ip, IIp, IIIp, IVp, Vp, and VIp to represent six clusters performed by PCA with the reduction ratio to be 1/160. Table 3A.7 lists the cluster distribution of samples from the same 15 countries, where we use Iu, IIu, IIIu, IVu, Vu, and VIu to represent six clusters performed by UMAP with the reduction ratio setting to be 1/160. Noticeably, the SNP profile is focused in Cluster Iu, whereas in the non-reduced version, the samples are more spread out. This may be caused by the large number of features, making computed distance between the centroid and each data too similar, and leading to samples being placed in incorrect clusters. Not surprisingly, PCA and the original method for [9] has nearly identical result. It has been shown in [9] that PCA is the continuous solution of the cluster indicators in the K-means clustering method. On the other hand, UMAP shows a slightly different result. In the PCA method, the distribution is more spread out. In addition, the top occurrence for each country is higher for UMAP. On the other hand, we see that there are more samples in Cluster Iu for UMAP, which may indicate that mutations in Cluster Iu are the main strand. Moreover, Figure 3.10 illustrates the 2D visualizations of the US dataset up to June 01, 2020, with 6 distinct clusters by applying two different dimensional reduction algorithms. We can see that the data distribute disorderly under both PCA- and UMAP-assisted K-means clustering algorithms. Specifically, the PCA-assisted algorithm has a really poor clustering performance, while the UMAP- assisted algorithm forms more clear and better clusters than the PCA-assisted algorithm, which is consistent with our previous analysis in Section 3.1.3.1. Table 3.11 shows co-mutations occurred in each cluster from the UMAP-assisted K-means from data collected up to June 01, 2020. Cluster IIIu has 2 dominant co-mutations. Note that the dataset had 15140 SARS-CoV-2 isolates, whereas our current dataset has over 200,000 isolates. Nonetheless, we can compare the clusters to see which clusters persists. Cluster 1’s co-mutations are the same as those of Cluster Vu, indicating that Cluster 1 may have been derived from Cluster Vu. Cluster 2 shares the same co-mutations as those of Cluster IIu. Cluster 3’s co-mutations are the descendants of Cluster Vu. Clusters 4 and 5 have the same co-mutations as those of Clusters IIIu and VIu, indicating Clusters 4 and 5 are derived from Cluster IIIu and VIu. Cluster 6’s co- 77 mutations are descendants of Clusters IIIu and VIu. Note that co-mutations of Cluster Iu and the second set of co-mutations of Cluster IIu ([8782, 28144]) are not predominant co-mutations in our dataset, which may indicate a weaker infectivity. For example, every co-mutation in Table 3.9 has mutation 23403A>G (D614G) in the spike protein, which has been shown to increase infectivity of COVID-19 [35]. It is not surprising to see a co-mutation group not being dominant in our current dataset. By comparing these co-mutations, we can see that co-mutations that are dominant in both datasets (up to June 01 and January 20) will most likely persist in the future. Table 3.11 The frequency and occurrence percentage of SARS-CoV-2 co-mutations from each clusters collected from June 01, 2020. Cluster Cluster Iu Cluster IIu Cluster IIIu Cluster IVu Cluster Vu Cluster VIu Co-mutations [11083, 14805, 26144] [241, 3037, 14408, 23403, 25563] [241, 3037, 14408, 23403] [8782, 28144] [241, 1059, 3037, 14408, 23403, 25563] [241, 3037, 14408, 23403, 28881, 28882, 28883] [241, 3037, 14408, 23403] Frequency Percentage 948 2800 1468 1475 1318 1872 2222 0.730 0.893 0.412 0.414 0.621 0.817 0.969 Figure 3.10 2D visualizations of the US SARS-CoV-2 mutation dataset up to June 01, 2020 with 6 distinct clusters by applying two different dimensional reduction algorithms. (a) 2D PCA visualization. (b) 2D UMAP visualization. 3.1.6 Conclusion The rapid global spread of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to genetic mutation stimulated by genetic evolution and adaptation. Up to January 20, 2021, 203,344 complete SARS-CoV-2 78 sequences, and a total of 26,844 unique SNPs have been detected. Our previous work traced the COVID-19 transmission pathways and analyzed the distribution of the subtypes of SARS-CoV-2 across the world based on 15,140 complete SARS-CoV-2 sequences. The K-means clustering separated the sequences into six distinguished clusters. However, considering the tremendous increase in the number of available SARS-CoV-2 sequences, an efficient and reliable dimensional reduction method is urgently required. Therefore, the objective of the present work is to explore the best suited dimension reduction algorithm based on their performance and effectiveness. Here, a linear algorithm PCA and two non-linear algorithms, t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), have been discussed. To evaluate the performance of dimension reduction techniques in clustering, which is an unsupervised problem, we first cast classification problems into clustering problems with labels. Next, by setting different reduction ratios, we test the effectiveness and accuracy of PCA, t-SNE, and UMAP for k- NN and K-means using four benchmark datasets. The results show that overall, UMAP outperforms other two algorithms. The major strengths of UMAP is that UMAP-assisted k-NN classification and UMAP-assisted K-means clustering at various dimension reduction ratios have a consistent performance in terms of accuracy, which proves that UMAP is a stable and reliable dimension reduction algorithm. Moreover, compared to the K-means clustering accuracy that does not involve any dimensional reduction, UMAP-assisted K-means clustering can improve the accuracy for most cases. Furthermore, when the dimension is reduced to two, the UMAP clustering visualization is clear and elegant. Additionally, UMAP is a relatively efficient algorithm compared to t-SNE. Although PCA is a faster algorithm, its major limitation is its poor performance in accuracy. To be noted, UMAP performs better than PCA and t-SNE for the dataset with a large number of samples, indicating it is the best suited dimensional reduction algorithm for our SARS-CoV-2 mutation dataset. Moreover, we apply the UMAP-assisted K-means clustering to the world SARS-CoV-2 mutation dataset (up to January 20, 2021), which displays six distinct clusters. Correspondingly, the same approaches are also applied to the United States SARS-CoV-2 mutation dataset (up to January 20, 2021), resulting in six different clusters as well. Furthermore, we provide a new 79 perspective by utilizing UMAP-assisted K-means clustering to analyze our previous SARS-CoV-2 mutation datasets, and the 2D visualization of UMAP-assisted K-means clustering of our previous world SARS-CoV-2 mutation dataset (up to June 01, 2020) forms more clear clusters than the PCA-assisted K-means clustering. Finally, one of our four datasets was generated by the Jaccard distance representation, which improves both kNN classification and k-means clustering accuracies on the original dataset. 3.2 K-mer Topology for Whole Genome Analysis 3.2.1 Introduction Topological data analysis (TDA) is an emerging field in data science and has also been utilized in DNA sequence alignment. TDA, or more specifically, persistent homology, begins by first representing the point cloud data with vertices, edges, triangles, tetrahedra, etc., or more generally, a simplicial complex. Then, concepts from algebraic topology, such as connected components, holes, and voids, are utilized to extract topological invariants, and filtration is applied to capture the persistence of such topological invariants across different scales. Chan et al. [36] utilized persistent homology to model both vertical and horizontal evolution in viruses, including clonal evolution, reassortment, and recombination. They computed sequence dissimilarity via sequence alignment and used this information to calculate persistent homology based on genetic dissimilarity. The 0th order homology represents vertical evolution, while the 1st order homology corresponds to horizontal evolution. This method was applied to SARS-CoV-2 to analyze mutations and utilized a novel metric called the topological recurrence index (tRI), which calculated the number of cycles in a tree and was used to measure convergent evolution [37]. In Nguyen et al. [38], persistent homology was applied to the CGR representation of viral sequences, and the resulting persistent diagram was computed. Subsequently, the Wasserstein distance was calculated between these diagrams to construct the phylogenetic tree. These methods represent some of the first adopters of persistent homology for DNA sequencing analysis. However, due to the high computational cost associated with computing persistent homology, their approach may not be suitable for longer sequences, general whole viral sequences, and large-scale comparisons. 80 In this work, we introduce the k-mer topology, a novel persistent homology approach based on k-mer position. For each k-mer, we construct position-based distances. Using the standard persistent homology methodology, we obtain Betti curves for each k-mer. We benchmarked our method using the NCBI virus reference genome and conducted phylogenetic tree analysis on six datasets. Our method demonstrates the highest accuracy in NCBI reference virus classification and proves to be an effective tool for phylogenetic analysis. Furthermore, we demonstrate that our method can handle large whole bacterial genomes, a capability not achievable with other persistent homology tools. 3.2.2 Method In this section, we describe the k-mer specific topology. For an overview of persistent homology, refer to section 2.4 3.2.2.1 Persistent Homology on Nucleotide Sequence In this section, we describe the persistent homology for nucleotide sequence, and the construc- tion of the features. Then, we define a metric on the features, which will be used for the phylogenetic analysis. Position based distance Let S = s1s2...sN be a DNA sequence of length N, where si ∈ {A, C, G, T} is a nucleotide. Define the nucleotide-specific indicator function δl(si) = 1, si = l 0, otherwise    where l ∈ {A, C, G, T}. Then, we can define a nucleotide specific vector Sl as Sl = {i|δl(si) = 1, 1 ≤ i ≤ N} (3.7) (3.8) For example, for a sequence S = CGGATAACGTCCAGCAGTCAGTGATCGCATATCTTGAC, we 81 have SA = [4, 6, 7, 13, 16] SC = [1, 8, 11, 12, 15] SG = [2, 3, 9, 14, 17] ST = [5, 10] The distance based in the nucleotide position, denoted Dl, is computed as the following: Dl(i, j) = |Sl(i) – Sl(j)|. For example, the position based distance for SA is 0 2 3 9 2 0 1 7 DA = (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) 3 9 12 1 7 10 0 6 6 0 . (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) 9 3 0 12 10 9 3 (3.9) (3.10) (3.11) (3.12) (3.13) (3.14) Additionally, we can define a k-mer’s specific position, where instead of focusing solely on individual nucleotides, we examine strings of nucleotides of length k. Moreover, for a given k, there are 4k different combinations of k-mers, denoted as l1, l2, . . . , l4k. Similarly, we can define the k-mer specific indicator function as δlp(sisi+1...si+k) = 1, sisi+1...si+k = lp 0, otherwise    (3.15) Then, the k-mer specific position is given by Slp = {i|δlp(sisi+1...si+k) = 1, 1 ≤ i ≤ N – k + 1} (3.16) Then the k-mer specific position based distance can be defined as Dlp(i, j) = |Slp(i) – Slp(j)|. Using the same sequence as an example, AG specific position and distance can be defined as the 82 following SAG = [13, 16], DAG = (cid:169) (cid:173) (cid:173) (cid:171) 0 3 3 0 (cid:170) (cid:174) (cid:174) (cid:172) (3.17) Persistent homology features Using the position-based distance, Ripser [39] was used to construct the simplicial complex and the persistent barcodes. Then, Gudhi [40] was used to obtain example, βAT the betti-0 curves. Denote β lp 0,r as the Betti number at filtration r for a particular k-mer lp. For 0,3 would be the betti-0 for 2-mers AT at filtration 3. We can then collect the Betti values lp 0 , which can be defined as the following across increasing filtration values to get the Betti curve β vector lp lp 0,0, ..., β 0 = (β β lp 0,r, ..., β lp 0,R) (3.18) where R is the maximum number of filtrations. Figure 3.11 shows an example of a persistent barcode and the Betti curve. We generated a random sequence of length N = 100, and allowed the sequence to be circular. For such circular sequence, we can define the distance between Slp(i) and Slp(j) to be Dlp(i, j) = min(Slp(j) – Slp(i), N – Slp(j) + Slp(i)) (3.19) for i < j. Then, the persistent homology will be based on a loop. The left figure shows the persistent barcode of l = A, where the barcode shows the birth and death of a connected component (H0 in blue), loop (H1 in green) and cavity (H2 in red). The right figure shows βl 0. The 4 colors correspond to the nucleotides l = A, C, G, T. 83 (a) Persistence barcode (b) β0 curve Figure 3.11 Persistent barcode and Betti curve of nucleotide l = A of a randomly generated circular sequence of length 100 base pairs. (a) Blue, red and green correspond to the birth and death of H0, H1 and H2 features. (b) The Betti curves of A (red), C (green), G (blue), T (purple). The y axis corresponds to the β0 number at a particular filtration value. Additionally, we provide a real virus example using NC_001330, a reference sequence from the Microviridae family. The Microviridae family comprises bacteriophages with single-stranded DNA (ssDNA) genomes. NC_001330 consists of 6087 base pairs, including 1431 A, 1325 C, 1425 G, and 1906 T. Figure 3.12 shows the Betti curves for individual nucleotides, as well as for 2-mers, 3-mers, and 4-mers. 84 (a) 1-mer (b) 2-mer (c) 3-mer (d) 4-mer Figure 3.12 Betti curves of 1-mer, 2-mer, 3-mer and 4-mer of NC_001330 reference genome of the Microviridae family. For a given k, there are a total of 4k combinations of k-mers. Therefore, if the sequence is truly random, we would expect to see the same k-mer roughly every 4k position. Not surprisingly, as we increase the k, we see that the connected components persists for longer time, which indicate that the filtration number must increase. Moreover, in order to reduce the number of features we obtain from the Betti curves, we compute the Betti numbers at a step size of 4k–1 for a total of 50 steps. Distance on the k-mers persistent homology features We now define the metric used to perform phylogenetic analysis. For each k, we have 4k different k-mer specific Betti curves. We concatenate these Betti curves to obtain the vector βk = (βl1 0 , ..., β l4k 0 ). (3.20) Then, for viruses i and j, we can define the metric between the k-mer specific Betti curves as the 85 following distk(i, j) = ∥βk i – βk j ∥2 (3.21) where ∥ · ∥2 denote the euclidean distance. Then, we can take a weighted sum over different k to obtain the distance between viruses i and j Dist(i, j) = K ∑︁ k=1 akdistk(i, j). (3.22) 3.2.3 Results 3.2.3.1 Classification of the reference viral genome In order to verify the effectiveness of our method, we conducted classification on the viral reference genome from the National Center for Biotechnology Information (NCBI) virus database. NCBI adopts the virus taxonomy from the International Committee on Taxonomy of Viruses (ICTV), which updates the nomenclature of viruses periodically according to new findings. We considered NCBI dataset collected in 2020, 2022 and 2024, and the details of the preprocessing process can be found in Table 3.12. For NCBI 2020 and NCBI 2022, we have removed sequences that were no longer considered reference sequences as of January 20, 2024. For NCBI 2024, we removed families without a proper taxonomy rank, i.e., any sequences with a family name not ending in ’-viridae’. For example, the viral family Tolecusatellitidae was removed because its rank is undetermined. We included these undetermined families in NCBI 2024 full. Additionally, we included sequences with invalid nucleotides to simulate real-world scenarios. We benchmarked our method on these data because viral taxonomy is regularly updated. Therefore, our method not only needs to identify the correct family but also find the most similar sequence within that family. For example, viral families such as Herpesviridae and Reoviridae from the NCBI 2020 and NCBI 2022 datasets were abolished in late 2022, with some sequences being divided into smaller families or remaining unclassified. Therefore, it is crucial for the alignment method to be robust to these changes and accurately identify the correct viral family. 86 Table 3.12 Dataset, NCBI collection date, proprcessing procedure, number of families and number of sequences. Name NCBI date NCBI 2020 [41] 03/2020 NCBI 2022 03/2022 NCBI 2024 01/20/2024 NCBI 2024 full 01/20/2024 Removed sequence Unknown Baltimore class Unknown family families <3 sequence Partial sequence Unknown family families <2 sequence Invalid nucleotides Partial sequence Unknown family Only ’-viridae’ families <3 sequence Invalid nucleotide Partial sequence Unknown family families <3 sequence # families # sequence 83 6,993 123 11,428 199 12,154 209 13,645 For benchmarking, we adopt the procedure outlined in [41]. After constructing the k-mer spe- cific Betti curves and the distance matrix, we performed a 1-nearest neighbor (1-NN) classification . The identification is considered correct if the most similar sequence has the same family label. This simulates a real-world scenario, where identifying the most similar sequence is critical for further analysis. We compare our method with generalize natural vector (GNV) and Markov K-string (Markov) model [42], and their details can be found in section 2.5 Table 3.13 show the 1-NN classification accuracy of individual k-mer specific Betti curves as well as the weighted sum. 3-mers and 4-mers have the highest individual accuracy, and 1-mers have the lowest accuracy. The weighted sum has the higher accuracy on all the data except for NCBI 2022. Table 3.13 1-NN classification of the 4 dataset. The accuracy of individual k-mer specific Betti curves were obtained, as well as the weighted sums. The bolded number correspond to the highest accuracy. Data NCBI 2020 NCBI 2022 NCBI 2024 NCBI 2024 full k = 1 0.817 0.814 0.750 0.748 k = 2 0.902 0.894 0.857 0.857 87 k = 4 k = 5 k = 3 Sum 0.929 0.933 0.927 0.933 0.920 0.912 0.917 0.828 0.898 0.887 0.887 0.901 0.889 0.921 0.895 0.897 Figure 3.13 shows the comparison of our method to k-mer specific GNV and Markov model. For GNV, we utilized order 2 for each k and, used Euclidean distance to construct the distance matrix. Notice that our model’s accuracy increases as k increases, plateauing at k = 4. On the other hand, GNV and Markov model decreases in accuracy at k = 2 and k = 3, respectively. The decrease in accuracy for the Markov model is not surprising. This is because the Markov k-string method assesses sequence differences by comparing the observed appearance of k-strings with the predicted appearance of k-strings computed through the Markov model. The predicted model evaluates the left and right k – 1 substrings, along with the middle k – 2 substring. Therefore, 3-mers exhibit the most reliable computation of predicted appearance. The decrease in GNV performance is consistent with the report given by the original authors [41]; however, it is important to note the GNV’s strength comes from the metric defined on all the k-mer specific GNV. (a) NCBI 2020 (b) NCBI 2022 (c) NCBI 2024 (d) NCBI 2024 All Figure 3.13 Comparison of k-mer topology with GNV and Markov models. For GNV, order 2 for each k is presented. 88 Figure 3.14 shows the comparison of our method with GNV and the Markov k-string model. For GNV, we utilized order 2 with K = 9, and for the Markov model, we employed k = 3. In all models, there is a decrease in accuracy from 2020 to 2024, which is not surprising given the increase in the number of families and the division of large families into smaller ones. Notably, the Markov model proves to be less robust to these changes. Interestingly, our method demonstrates slightly better performance on NCBI 2024 All data compared to NCBI 2024 data. This is promising for real-world applications, as most sequences contain some invalid nucleotides due to experimental errors. Moreover, our method outperforms GNV by 6.13%, 5.12%, 8.33%, and 9.17% for NCBI 2020, NCBI 2022, NCBI 2024, and NCBI 2024 All data, respectively. Figure 3.14 Accuracy of k-mer topology, GNV and Markov models on the 4 dataset. For k-mer topology and GNV, the weighted sum over the different k was used. For GNV, k = 9 and order 2 was used, and for Markov model, k = 3 was used. 3.2.3.2 Phylogenetic analysis using k-mer topology We constructed a phylogenetic tree utilizing the k-mer topology method. After computing the distance, we utilized unweighted pair group method with arithmetic mean (UPGMA) method to construct the tree. Ebolavirus We obtained 59 complete genomes of ebolavirus, representing 5 species: Bundibu- gyo virus (BDBV), Reston virus (RESTV), Ebola virus (EBOV, formerly known as Zaire virus), Sudan virus (SUDV), and Tai Forest virus (TAFV). Details of the data can be found in Table 3B.1. 89 The EBOV species was further divided into 6 subtypes based on the location and year of the outbreak: Zaire (now known as DRC) pandemic of 1976-1977, DRC pandemic of 2007, Guinea outbreak of 2014, Gabon outbreak of 1994, and DRC outbreak of 1995. Figure 3.15 displays the phylogenetic tree generated using our method. We employed K = 5 to compute the k-mer topology features and utilized the UPGMA algorithm to generate the phylogenetic tree. The colors of the clades and labels correspond to the ebolavirus types and the EBOV subtypes. Notably, BDBV, RESTV, EBOV, SUDV, and TAFV are separated into distinct clades. Additionally, the EBOV subtypes are placed in separate clades, representing an improvement over [43], where the Fuzzy Integral Similarity method incorrectly placed one of the DRC outbreak of 2007 strains into the original Zaire pandemic clade. 90 Figure 3.15 Phylogenetic tree of ebolavirus complete genomes, including those from Bundibugyo virus (BDBV), Reston virus (RESTV), Ebola virus (EBOV, formally known as Zaire virus), Sudan virus (SUDBV) and Tai Forest virus (TAFV). EBOV was divided into 6 subtypes corresponding to the pandemic or outbreak. The labels and clades were colored according to their types and pandemic. Mammalian Mitochondria RNA We obtained 41 full mitochondria genome of mammals, con- sisting of 8 types Artiodactyla, Carnivore, Cetacea, Erinaceomorpha, Lagomorpha, Perissodactyla, Primates, Rodentia. The details of the data can be found in Table 3B.2. We utilized K = 4 to compute the k-mer topology features and the metric, and the UPGMA algorithm was employed 91 to generate the phylogenetic tree. Figure 3.16 illustrates the phylogenetic tree of the 8 classes of species. The labels and the clade are colored according to their group labels. Our method correctly clusters each type into individual clades, which is an improvement from other methods, which often placed Rabbit (Lagomorpha) into a clade with Carnivores [43]. Figure 3.16 Phylogenetic tree of mammalian mitochondrial genomes, including those from Artio- dactyla, Carnivora, Cetacea, Erinaceomorpha, Lagomorpha, Perissodactyla, Primates, and Roden- tia. Details of the data can be found in Table 3B.2. The species and clades are colored according to their groups. K = 4 was used to compute the k-mer topology features, and the UPGMA algorithm was used to construct the tree. Rhinovirus We obtained 113 rhinbovirus (HRV) and 3 non-rhinovirus outgroup. HRV is a positive-sense, single-stranded RNA virus (ssRNA(+)) belonging to the viral family Picornaviridae and the genus Enterovirus, and is often the predominent cause of common cold [44, 45]. The details 92 of the data can be found in Table 3B.3. There are 3 distinct groups within Picornaviridae, HRV- A, HRV-B and HRV-C. Figure 3.17 show the phylogenetic tree of 113 whole HRV genome and 3 outgroup sequence (HEV-C). We utilized K = 3 to compute the homology features and the distance. The labels and clades are colored according to their groups. Figure 3.17 Phylogenetic tree of 113 rhinovirus (HRV) and 3 non-rhinovirus outgroup. The colors correspond to different host animals. K = 5 was used to compute the k-mer topology features and distance, and the UPGMA algorithm was used to construct the tree. Our method correctly separates all the groups into individual clades. However, we noticed that HRV-C shares a clade with the HEV-C outgroup. This result is similar to the findings in [46], where HRV-A and HRV-B shared a clade, and HRV-C shared a clade with the HRV-A/HRV-B parent clade. 93 Coronavirus We obtained 30 coronavirus full genome sequences, which range from about 25,000 to 30,000 nucleotides. This dataset has been studied in various other studies [47, 48, 49, 46, 43]." Coronaviruses are enveloped, single-stranded, positive-sense RNA (ssRNA(+)) viruses belonging to the Coronaviridae family. They are known to infect many species, including bats, birds, humans, and other mammals, and can spread across different species [50, 51]. With the SARS-CoV-2 pandemic (COVID-19), there has been interest in uncovering the evolutionary relationship between other coronaviruses. The details of the data can be found in Table 3B.4. Figure 3.18 show the phylogenetic tree generated by our method. K = 7 was used to compute the k-mer topology features and the metric, in UPGMA algorithm was used to construct the tree. The samples were colored according to the 5 coronavirus groups. 94 Figure 3.18 Phylogenetic tree of 30 coronavirus and 4 non-coronavirus outgroup whole genome. The colors correspond to different host animals. K = 8 was used to compute the k-mer topology features and distance, and the UPGMA algorithm was used to construct the tree. Our method correctly clusters the non-coronavirus outgroups and groups 2, 3, 4, and 5. The NL63 strain from group 1 was not clustered with the other group 1 human coronaviruses, consistent with many alignment-free studies [43]. Additionally, our method separates group 2 coronaviruses into their hosts, namely the mouse hepatitis virus (MHV) and the bat coronavirus (BCoV). Influenze Type A We obtained 36 full genomes of influenza A viruses. Influenza A viruses are single-stranded, segmented RNA viruses classified based on the surface proteins hemagglutinin (H) and neuraminidase (N) [52]. Additionally, they pose a major health threat to humans and have resulted in numerous outbreaks [53]. We obtained 5 subtypes H7N9, H7N3, H2N2, H5N1, and 95 H1N1 for the analysis, and the details of the data can be found in Table 3B.5. Figure 3.19 show the phylogenetic tree generated using our methods. The left and right figures correspond to homology features and distance computed using K = 5 and K = 1. Notice that in K = 5, we have multiple clades of H1N1 that share the same clade as H5N1 strands; however the K = 1 was separated the H5N1 and H1N1 strands into different clades. This indicate the potential overfitting in K = 5. (a) K = 5 (b) K = 1 Figure 3.19 Phylogenetic tree of 6 influenza A virus. The colors correspond to different subtypes, including H7N9, H7N3, H2N2, H5N1, and H1N1. (a) K = 5 (b) K = 1. After computing the k-mer topology features and distances, UPGMA algorithm was used to construct the tree. The label and the clades are colored according to the subtypes. 16S rDNA from Bacteria Isolates We obtained 40 bacterial sequences that was cultured from endophytic bacteria isolated from the surface sterilized mature endosperms of 6 rices varieties. The 40 sequences have length <1500bp. The details of the data can be found in Table 3B.6. Figure 3.20 show the phylogenetic tree of the 16S rDNA from bacteria isolates using our method. We utilized K = 5 to compute the homology features and the distance. Then, UPGMA algorithm was used to obtain the tree. The clades and labels were colored according to the bacterial families. 96 Figure 3.20 Phylogenetic tree of 40 16S rDNA from bacteria isolates. The colors correspond to different host animals. K = 5 was used to compute the k-mer topology features and distance, and the UPGMA algorithm was used to construct the tree. Our method shows a main clades for Actinomycetales, Xanthomandales, Enterobacteriales, Pseudomandales, and Rhizobiales. However, the top clade shows a mix of misclustered sequences. This is consistent with [43], but our method show dominant clades for each bacteria family. This result sugest the limitation of our method for shorter sequence data, where more computationally expensive methods, such as multiple sequencing alignment can benefit greatly. Large bacteria genome We obtained 30 bacteria whole genomes from families Bacillaceae, Bor- reliaceae, Burkholderiaceae, Clostridiaceae, Desulfovibrionaceae, Enterobacteriaceae, Rhodobac- teraceae, Staphylococcaceae, and Yersiniaceae. The details of the data can be found in Table 3B.7. 97 All the sequences have over 1 million base pair, and we tested our method to see the computational limits of our method. Because the sequences are long, we computed the homology of 3-mers, 4-mers and 5-mers, and computed the distance using these 3 k-mers. Figure 3.21 show the phylo- genetic tree generated using our method. After the distance was computed, we utilized UPGMA algorithm to construct the tree. The labels and clades were colored according to the family label. Figure 3.21 Phylogenetic tree of 30 bacterial whole genomes. Betti curves for 3-mers, 4-mers and 5-mers were computed, and the distance was computed using these 3 k-mers. The labels and clades were colored according to their family labels. Our method correctly separates individual families into a clade. However, our method fails to place the samples into the appropriate phylum. Enterobacteriales, Yersiniaceae, Burkholderiaceae and Rhodobacteraceae belong to the phylum Pseudomonadota, and our method only clusters Enter- 98 obacteriales, Yersiniaceae into the same clade. Additionally, Staphylococcaceae, Clostridiaceae, Bacillaceae belongs to the phylum Bacillota, but only Staphylococcaceae and Clostridiaceae share a clade. This is most likely caused by our method not being able to compute the 1-mer and 2-mers topology because lower-order k-mers should provide insight to higher order taxonomy. 3.2.4 Discussion and conclusion K-mer topology is a novel method that computes the persistent homology based on the position of the k-mer. Our method outperformed GNV method in viral classification tasks on different versions of the data, indicating the robustness to changes in the taxonomy. We have also validated our method on standard phylogenetic analysis, including bacteria sequence, indicating the scalability of our method. However, our method does come with limitation. First, the accuracy for identifying the correct family in bacteria and virus is high; however, our method do not perform well when clustering high order taxonomy, most notably in the case of bacteria. This is due to the high computational cost of 1-mers and 2-mers topology for long sequences. Additionally, analyzing genomes longer than bacteria is not possible at this moment. In the future, we want to look at more algebraic topological tools, such as topological Laplacians, hypergraphs, and more. Moreover, we would like to evaluate the nonharmonic portion of our spectra to obtain more geometric information about the DNA sequence. Additionally, to combat the computational limitation for long sequence, we would like to explore potential of taking segments to increase the efficiency of our algorithm. Lastly, we would like to explore the application in protein sequences. 99 [1] Covid-19 weekly epidemiological update, 19 january 2021, 2021. BIBLIOGRAPHY [2] The species severe acute respiratory syndrome-related coronavirus: classifying 2019-ncov and naming it sars-cov-2. Nature microbiology, 5(4):536–544, 2020. [3] Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao, Jun-Hua Tian, Yuan-Yuan Pei, et al. A new coronavirus associated with human respiratory disease in china. Nature, 579(7798):265–269, 2020. [4] Intikhab Alam, Allan A Kamau, Maxat Kulmanov, Łukasz Jaremko, Stefan T Arold, Arnab Pain, Takashi Gojobori, and Carlos M Duarte. Functional pangenome analysis shows key features of e protein are preserved in sars and sars-cov-2. Frontiers in cellular and infection microbiology, 10:405, 2020. [5] Yu-Nong Gong, Kuo-Chien Tsao, Mei-Jen Hsiao, Chung-Guei Huang, Peng-Nien Huang, Po-Wei Huang, Kuo-Ming Lee, Yi-Chun Liu, Shu-Li Yang, Rei-Lin Kuo, et al. Sars-cov- 2 genomic surveillance in taiwan revealed novel orf8-deletion mutant and clade possibly associated with infections in middle east. Emerging microbes & infections, 9(1):1457–1466, 2020. [6] Peter Forster, Lucy Forster, Colin Renfrew, and Michael Forster. Phylogenetic network analysis of sars-cov-2 genomes. Proceedings of the National Academy of Sciences, 117(17):9241– 9243, 2020. [7] Xingguang Li, Junjie Zai, Qiang Zhao, Qing Nie, Yi Li, Brian T Foley, and Antoine Chaillon. Evolutionary history, potential intermediate animal host, and cross-species analyses of sars- cov-2. Journal of medical virology, 92(6):602–611, 2020. [8] Sunitha M Kasibhatla, Meenal Kinikar, Sanket Limaye, Mohan M Kale, and Urmila Kulkarni- Kale. Understanding evolution of sars-cov-2: a perspective from analysis of genetic diversity of rdrp gene. Journal of medical virology, 92(10):1932–1937, 2020. [9] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding sars-cov-2 transmis- sion and evolution and ramifications for covid-19 diagnosis, vaccine, and medicine. Journal of chemical information and modeling, 60(12):5853–5865, 2020. [10] Rui Wang, Jiahui Chen, Kaifu Gao, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Characterizing sars-cov-2 mutations in the united states. Research square, 2020. [11] Jiahui Chen, Rui Wang, Menglun Wang, and Guo-Wei Wei. Mutations strengthened sars-cov-2 infectivity. Journal of molecular biology, 432(19):5212–5226, 2020. [12] Rui Wang, Jiahui Chen, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding 100 asymptomatic covid-19 infection and transmission. The journal of physical chemistry letters, 11(23):10007–10015, 2020. [13] Michael Worobey, Jonathan Pekar, Brendan B Larsen, Martha I Nelson, Verity Hill, Jef- frey B Joy, Andrew Rambaut, Marc A Suchard, Joel O Wertheim, and Philippe Lemey. The emergence of sars-cov-2 in europe and north america. Science, 370(6516):564–570, 2020. [14] Yunmeng Bai, Dawei Jiang, Jerome R Lon, Xiaoshi Chen, Meiling Hu, Shudai Lin, Zixi Chen, Xiaoning Wang, Yuhuan Meng, and Hongli Du. Comprehensive evolution and molecular char- acteristics of a large number of sars-cov-2 genomes reveal its epidemic trends. International Journal of Infectious Diseases, 100:164–173, 2020. [15] Yujiro Toyoshima, Kensaku Nemoto, Saki Matsumoto, Yusuke Nakamura, and Kazuma Kiyotani. Sars-cov-2 genomic variations associated with mortality rate of covid-19. Journal of human genetics, 65(12):1075–1082, 2020. [16] Lucy Van Dorp, Mislav Acman, Damien Richard, Liam P Shaw, Charlotte E Ford, Louise Ormond, Christopher J Owen, Juanita Pang, Cedric CS Tan, Florencia AT Boshier, et al. Emergence of genomic diversity and recurrent mutations in sars-cov-2. Infection, Genetics and Evolution, 83:104351, 2020. [17] Roderic DM Page. Space, time, form: viewing the tree of life. Trends in ecology & evolution, 27(2):113–120, 2012. [18] Tingting Zhou, Keith CC Chan, Yi Pan, and Zhenghua Wang. An approach for determining In Bioinformatics Research evolutionary distance in network-based phylogenetic analysis. and Applications: Fourth International Symposium, ISBRA 2008, Atlanta, GA, USA, May 6-9, 2008. Proceedings 4, pages 38–49. Springer, 2008. [19] Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent de- velopments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016. [20] John W Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on computers, 100(5):401–409, 1969. [21] Chun-houh Chen, Wolfgang Härdle, Antony Unwin, Michael AA Cox, and Trevor F Cox. Multidimensional scaling. Springer, 2008. [22] George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval Kluger. Fast interpolation-based t-sne for improved visualization of single-cell rna-seq data. Nature methods, 16(3):243–245, 2019. [23] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. 101 [24] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [25] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38–44, 2019. [26] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embed- ding and clustering. Advances in neural information processing systems, 14, 2001. [27] Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and high- dimensional data. In Proceedings of the 25th international conference on world wide web, pages 287–297, 2016. [28] David I Spivak. Metric realization of fuzzy simplicial sets. Preprint, page 4, 2009. [29] Fabian Sievers, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, Hamish McWilliam, Michael Remmert, Johannes Söding, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Molecular systems biology, 7(1):539, 2011. [30] Michael Levandowsky and David Winter. Distance between sets. Nature, 234(5323):34–35, 1971. [31] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil-20). 1996. [32] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021. [33] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [34] D Ulyanov. Multicore-tsne https://github. com/dmitryulyanov. Multicore-TSNE2016, 2016. [35] Bette Korber, Will M Fischer, Sandrasegaram Gnanakaran, Hyejin Yoon, James Theiler, Werner Abfalterer, Nick Hengartner, Elena E Giorgi, Tanmoy Bhattacharya, Brian Foley, et al. Tracking changes in sars-cov-2 spike: evidence that d614g increases infectivity of the covid-19 virus. Cell, 182(4):812–827, 2020. [36] Joseph Minhow Chan, Gunnar Carlsson, and Raul Rabadan. Topology of viral evolution. Proceedings of the National Academy of Sciences, 110(46):18566–18571, 2013. [37] Michael Bleher, Lukas Hahn, Maximilian Neumann, Juan Angel Patino-Galindo, Mathieu Carriere, Ulrich Bauer, Raul Rabadan, and Andreas Ott. Topological data analysis identifies 102 emerging adaptive mutations in sars-cov-2. arXiv preprint arXiv:2106.07292, 2021. [38] Dong Quan Ngoc Nguyen, Phuong Dong Tan Le, Lin Xing, and Lizhen Lin. A topological characterization of dna sequences based on chaos geometry and persistent homology. In 2022 International Conference on Computational Science and Computational Intelligence (CSCI), pages 1591–1597. IEEE, 2022. [39] Christopher Tralie, Nathaniel Saul, and Rann Bar-On. Ripser.py: A lean persistent homology library for python. The Journal of Open Source Software, 3(29):925, Sep 2018. [40] Clément Maria, Jean-Daniel Boissonnat, Marc Glisse, and Mariette Yvinec. The gudhi library: Simplicial complexes and persistent homology. In Mathematical Software–ICMS 2014: 4th International Congress, Seoul, South Korea, August 5-9, 2014. Proceedings 4, pages 167–174. Springer, 2014. [41] Nan Sun, Shaojun Pei, Lily He, Changchuan Yin, Rong Lucy He, and Stephen S-T Yau. Geometric construction of viral genome space and its applications. Computational and Structural Biotechnology Journal, 19:4226–4234, 2021. [42] Ji Qi, Bin Wang, and Bai-Iin Hao. Whole proteome prokaryote phylogeny without sequence alignment: Ak-string composition approach. Journal of molecular evolution, 58:1–11, 2004. [43] Ajay Kumar Saw, Garima Raj, Manashi Das, Narayan Chandra Talukdar, Binod Chandra Tripathy, and Soumyadeep Nandi. Alignment-free method for dna sequence clustering using fuzzy integral similarity. Scientific reports, 9(1):3753, 2019. [44] Ann C Palmenberg, David Spiro, Ryan Kuzmickas, Shiliang Wang, Appolinaire Djikeng, Jennifer A Rathe, Claire M Fraser-Liggett, and Stephen B Liggett. Sequencing and analyses of all known human rhinovirus genomes reveal structure and evolution. Science, 324(5923):55– 59, 2009. [45] Mo Deng, Chenglong Yu, Qian Liang, Rong L He, and Stephen S-T Yau. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PloS one, 6(3):e17293, 2011. [46] Tung Hoang, Changchuan Yin, Hui Zheng, Chenglong Yu, Rong Lucy He, and Stephen S-T Yau. A new method to cluster dna sequences using fourier power spectrum. Journal of theoretical biology, 372:135–145, 2015. [47] Patrick CY Woo, Susanna KP Lau, Chung-ming Chu, Kwok-hung Chan, Hoi-wah Tsoi, Yi Huang, Beatrice HL Wong, Rosana WS Poon, James J Cai, Wei-kwang Luk, et al. Characterization and complete genome sequence of a novel coronavirus, coronavirus hku1, from patients with pneumonia. Journal of virology, 79(2):884–895, 2005. [48] Chenglong Yu, Qian Liang, Changchuan Yin, Rong L He, and Stephen S-T Yau. A novel 103 construction of genome space with biological geometry. DNA research, 17(3):155–168, 2010. [49] Lia Van Der Hoek, Krzysztof Pyrc, Maarten F Jebbink, Wilma Vermeulen-Oost, Ron JM Berkhout, Katja C Wolthers, Pauline ME Wertheim-van Dillen, Jos Kaandorp, Joke Spaar- Identification of a new human coronavirus. Nature medicine, garen, and Ben Berkhout. 10(4):368–373, 2004. [50] Paul S Masters. The molecular biology of coronaviruses. Advances in virus research, 66:193– 292, 2006. [51] Kristian G Andersen, Andrew Rambaut, W Ian Lipkin, Edward C Holmes, and Robert F Garry. The proximal origin of sars-cov-2. Nature medicine, 26(4):450–452, 2020. [52] Robert G Webster, William J Bean, Owen T Gorman, Thomas M Chambers, and Yoshihiro Kawaoka. Evolution and ecology of influenza a viruses. Microbiological reviews, 56(1):152– 179, 1992. [53] Dennis J Alexander. A review of avian influenza in different bird species. Veterinary micro- biology, 74(1-2):3–13, 2000. 104 APPENDIX 3A ADDITIONAL MATERIALS FOR UMAP-ASSISTED K-MEANS CLUSTERING OF LARGE-SCALE SARS-COV-2 MUTATION DATASETS 3A.1 Cluster Stability The K-means clustering is used to classify the SARS-CoV-2 the SNP variants. The Elbow method is used to determine the optimal number of clusters. Our results demonstrate four main clusters as shown in Figure S1, which plots the within-cluster sum of squares according to the number of clusters k for the SNP variants based on Jaccard distance metric. The optimal values of K-mean clusters is shown as the turning point in the in the elbow plots. Figure 3A.1 Within cluster sum of squares (WCSS) of the UMAP-assisted k-means result of SARS-CoV-2 collected up to January 20, 2021. Using elbow method, the number of cluster was determined to be 6. 105 Figure 3A.2 Within cluster sum of squares (WCSS) of the UMAP-assisted k-means result of US SARS-CoV-2 collected up to January 20, 2021. Using elbow method, the number of cluster was determined to be 6. 3A.2 Clusters distribution Table 3A.1 shows the top 25 mutations of each cluster from the dataset collected up to January 20, 2021. 106 Table 3A.1 Clusters distribution of the top 25 single mutations of SARS-CoV-2 in the world, collected up to January 20, 2021. Top Top 1 Top 2 Top 3 Top 4 Top 5 Top 6 Top 7 Top 8 Top 9 Top 10 Top 11 Top 12 Top 13 Top 14 Top 15 Top 16 Top 17 Top 18 Top 19 Top 20 Top 21 Top 22 Top 23 Top 24 Top 25 Position Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 3450 23403 3450 14408 3450 3037 3064 241 24 26801 0 22227 1 6286 1 21255 1 29645 1 28932 0 445 3450 28881 3450 28882 3450 28883 0 25563 0 27944 6 204 0 1059 0 21614 0 20268 3449 22992 0 18877 0 28854 40 11083 0 26735 125155 125032 125052 124492 57139 57097 56961 56844 56792 56731 56754 28396 28268 28274 22757 38443 32527 13045 25209 9379 7546 7393 9147 10985 7317 22649 22687 22631 22395 76 32 15 57 13 18 1 896 838 816 21663 18 27 17219 26 131 34 2499 386 738 977 23395 23381 23358 23132 62 33 17 82 26 7 9 23418 23413 23435 112 19 37 60 106 19 87 42 72 611 35 14230 14181 14174 13955 33 27 24 11 11 26 2 92 21 16 34 3 8 33 150 6487 37 11 3772 414 52 3488 3488 3481 3477 22 20 16 13 15 10 12 12 15 13 3475 3 3 2 20 1 3477 3480 2 103 3468 Table 3A.2 shows cluster distribution among the top 25 available sequence from each country. 107 Table 3A.2 Cluster distributions of SARS-CoV-2 sequences from top 25 countries with the highest number of sequences as of October 30, 2020. The top 25 countries are the United Kingdom (UK), the United States (US), Australia (AU), India (IN), Switzerland (CH), Netherlands (NL), Canada (CA), France (FR), Belgium (BE), Singapore (SG), Spain (ES), Russia (RU), Portugal (PT), Denmark (DK), Sweden (SE), Austria (AT), Japan (JP), South Africa (ZA), Iceland (IS), Brazil (BR), Saudi Arabia (SA), Norway (NO), China (CN), Italy (IT), and Korea (KR). Top Top 1 Top 2 Top 3 Top 4 Top 5 Top 6 Top 7 Top 8 Top 9 Top 10 Top 11 Top 12 Top 13 Top 14 Top 15 Top 16 Top 17 Top 18 Top 19 Top 20 Top 21 Top 22 Top 23 Top 24 Top 25 Country Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 0 0 0 3449 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1291 15042 1256 397 123 728 90 79 154 592 113 20 361 43 25 72 243 27 12 27 65 144 2 198 31 65107 20897 20177 5001 2973 1993 1669 2753 815 627 1099 1161 757 1130 889 972 394 240 508 21 683 461 96 414 225 8200 3691 954 466 368 258 357 112 1168 175 473 139 421 97 307 152 304 718 456 845 82 153 567 49 388 3385 3760 215 280 356 496 656 156 568 505 361 598 329 110 279 44 113 82 36 61 50 131 170 47 118 1030 5 317 9 238 3 425 19 0 439 164 15 40 512 27 1 18 0 0 0 62 0 0 61 1 UK US DK AU NL CA CH IS IN FR BE ES DE LU IT SG SE AE BR RU NO CL ZA FI PT Table 3A.3 shows the top 25 mutation of each cluster in the US dataset, collected up to January 20, 2021. 108 Table 3A.3 Clusters distribution of the top 25 single mutations of SARS-CoV-2 in the United States, collected up to January 20, 2021. Top Top 1 Top 2 Top 3 Top 4 Top 5 Top 6 Top 7 Top 8 Top 9 Top 10 Top 11 Top 12 Top 13 Top 14 Top 15 Top 16 Top 17 Top 18 Top 19 Top 20 Top 21 Top 22 Top 23 Top 24 Top 25 Position Cluster A Cluster B Cluster C Cluster D Cluster E Cluster F 4677 23403 4661 14408 4666 3037 4594 241 33 25563 21 1059 8 27964 13 10319 19 21304 5 18424 6 28869 3 20268 8 25907 10 28472 12 28854 4644 28881 4641 28882 4656 28883 32 14805 0 8083 12 8782 3 28144 2 24076 10 18060 13 17747 20621 20609 20577 20498 15710 15160 10205 7790 6333 6321 6251 2369 6133 6086 2585 1207 1207 1193 2857 2527 768 749 320 271 311 4890 4889 4888 4762 44 36 9 11 13 7 22 3785 7 16 3440 35 5 2 4 1 0 3 1759 40 8 3 3 2 3 0 1 1 0 0 0 0 0 0 0 0 1 0 0 16 1 1575 1572 0 1551 1498 9438 9445 9416 9345 9432 8030 3 7 13 3 7 4 10 8 62 20 10 2 4 1 2 5 0 7 12 635 634 630 630 635 634 0 0 0 0 0 0 0 1 6 2 1 1 0 3 0 0 0 0 0 Table 3A.3 shows the cluster statistics for each states with more than 50 SARS-CoV-2 genome isolates. 109 Table 3A.4 Cluster statistics for states with more than 50 SARS-CoV-2 genome samples. State Michigan Washington Virginia Maryland Alaska Idaho New Jersey Kentucky Wyoming Arizona Arkansas Tennessee Pennsylvania Missouri California Indiana Florida New Mexico Maine Nevada Iowa Minnesota Colorado Kansas Nebraska New York Oregon Vermont South Carolina Montana South Dakota Alabama Georgia District of Columbia Utah Massachusetts Oklahoma North Carolina Connecticut Cluster A Cluster B Cluster C Cluster D Cluster E Cluster F 709 1275 874 1790 477 389 45 783 546 231 343 120 339 181 36 232 80 105 131 53 63 25 36 82 110 22 72 13 67 17 32 5 19 9 22 33 31 8 10 5884 1961 1853 1525 2034 1053 1564 518 544 678 569 432 346 377 210 157 288 201 181 211 158 213 76 75 22 70 77 129 16 43 39 72 44 23 29 31 26 37 30 968 943 1084 275 293 50 45 66 146 50 168 82 54 63 18 35 26 44 26 8 24 11 48 18 11 47 13 0 5 7 0 0 9 13 15 0 3 2 1 527 1962 169 433 279 152 36 64 39 205 46 204 68 49 391 59 54 40 52 29 47 4 2 5 17 10 5 4 5 34 9 3 2 21 3 3 1 12 1 3 216 921 37 12 121 1 7 16 47 16 15 10 1 4 30 7 13 2 11 10 5 43 4 17 0 2 7 1 0 0 0 2 5 0 0 1 0 1 25 100 8 341 5 3 0 3 59 3 5 0 9 2 0 1 0 9 0 0 2 0 0 18 2 27 0 0 8 0 2 0 3 0 0 0 0 0 0 Table 3A.5 shows the world wide cluster statistics from SARS-CoV-2 genome collected up to June 01, 2020, which is our previous result from [9]. The listed countries are the United States 110 (US), Canada (CA), Australia (AU), United Kingdom (UK), Germany (DE), France (FR), Italy (IT), Russia (RU), China (CN), Japan (JP), Korean (KR), India (IN), Spain (ES), Saudi Arabia (SA), and Turkey (TR). Table 3A.5 The world wide clusters from SARS-CoV-2 genome data available up to June 01, 2020. The listed countries are the United States (US), Canada (CA), Australia (AU), United Kingdom (UK), Germany (DE), France (FR), Italy (IT), Russia (RU), China (CN), Japan (JP), Korean (KR), India (IN), Spain (ES), Saudi Arabia (SA), and Turkey (TR) [9]. Country Cluster I Cluster II Cluster III Cluster IV Cluster V Cluster VI US CA AU UK DE FR IT RU CN JP KR IN ES SA TR 844 12 163 539 10 41 26 10 8 0 0 93 27 14 25 311 29 149 875 20 85 24 27 3 3 0 69 100 31 3 488 17 410 908 21 14 9 1 215 68 28 0 141 74 9 24 156 16 135 1532 38 12 17 109 1 20 0 10 25 1 9 1813 19 146 119 42 82 0 3 1 3 0 3 3 2 0 975 41 77 3 0 0 0 0 25 0 0 0 2 0 0 PCA was used to reduce the same dataset as Table 3A.5 by reduction ratio of 1/160. Table 3A.6 shows the world-wide cluster statistics of the PCA-assisted K-means. K = 6 was used to compare with the original result from [9]. 111 Table 3A.6 The world wide clusters from SARS-CoV-2 genome data available up to June 01, 2020 using PCA embedding with reduction ratio of 1/160. Country Cluster Ip Cluster IIp Cluster IIIp Cluster IVp Cluster Vp Cluster VIp US CA AU UK DE FR IT RU CN JP KR IN ES SA TR 915 14 164 543 10 46 26 10 8 0 0 95 27 30 27 489 17 414 908 21 14 9 1 213 68 28 141 74 9 24 239 27 143 857 20 80 24 27 3 3 0 67 100 15 1 156 16 136 1546 38 12 17 109 1 20 0 10 25 1 9 1813 19 146 119 42 82 0 3 1 3 0 3 3 2 0 975 41 77 3 0 0 0 0 24 0 0 0 2 0 0 UMAP was used to reduce the same dataset as Table 3A.5 by reduction ratio of 1/160. Table 3A.7 shows the world-wide cluster statistics of the UMAP-assisted K-means. K = 6 was used to compare with the original result from [9]. Table 3A.7 The world wide clusters from SARS-CoV-2 genome data available up to June 01, 2020 using UMAP embedding with reduction ratio of 1/160. Country Cluster Iu Cluster IIu Cluster IIIu Cluster IVu Cluster Vu Cluster VIu US CA AU UK DE FR IT RU CN JP KR IN ES SA TR 2446 71 784 2171 57 163 13 92 178 36 18 232 205 56 56 1096 15 94 115 40 45 1 2 28 0 0 3 2 0 1 751 35 18 2 0 0 0 0 10 0 1 0 0 0 0 110 1 83 534 5 11 5 0 22 47 9 2 7 0 0 94 3 37 326 15 5 22 7 6 0 0 72 5 0 0 90 9 64 828 14 10 35 49 6 11 0 7 12 1 4 112 APPENDIX 3B ADDITIONAL MATERIALS FOR K-MERS TOPOLOGY FOR ALIGNEMNT-FREE SEQUENCE ANALYSIS The statistics of each dataset used for the phylogenetic is listed below. • Table 3B.1: Ebolavirus data • Table 3B.2: Mammalian mitochondria data • Table 3B.3: Rhinovirus data • Table 3B.4: Coronavirus data • Table 3B.5: Infleunza type A data • Table 3B.6: Bacterial 16S rDNA isolates data • Table 3B.7: Bacteria whole genome data. Table 3B.1 Accession, group, sequence length of ebolavirus data. Accession FJ217161.1 KC545395.1 KC545396.1 AF522874.1 JX477166.1 FJ621583.1 FJ968794.1 EU338380.1 JN638998.1 KC545390.1 KC545392.1 KC242801.1 KC242791.1 KC242793.1 AY354458.1 KC242799.1 KC242786.1 KC242789.1 KC242790.1 KC242800.1 KM034562.1 KM034557.1 KM233050.1 KM233057.1 KM233072.1 KM233070.1 KM233097.1 KM233096.1 KJ660346.2 KJ660348.2 Group BDBV BDBV BDBV RESTV RESTV RESTV SUDV SUDV SUDV SUDV SUDV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV NA 5964 5975 5975 5937 5935 5928 5905 5914 5931 5925 5923 6061 6061 6043 6054 6054 6063 6062 6060 6042 6051 6052 6053 6052 6049 6055 6051 6054 6053 6055 NC 4324 4290 4293 3929 3920 3929 4034 4032 4059 4080 4081 4037 4037 4052 4051 4050 4025 4023 4025 4052 4050 4052 4050 4052 4049 4052 4052 4049 4052 4051 NG 3632 3635 3635 3746 3755 3767 3756 3750 3729 3730 3732 3752 3752 3761 3747 3748 3749 3750 3751 3762 3756 3753 3753 3751 3752 3753 3752 3751 3755 3754 NT 5020 5039 5036 5279 5281 5263 5180 5179 5156 5139 5138 5109 5109 5102 5109 5107 5121 5123 5122 5102 5100 5099 5100 5099 5099 5099 5098 5099 5099 5099 Accession KC545393.1 KC545394.1 FJ217162.1 AB050936.1 FJ621585.1 JX477165.1 KC242783.2 AY729654.1 KC545389.1 KC545391.1 KC589025.1 NC_002549.1 KC242792.1 KC242794.1 KC242796.1 KC242784.1 KC242787.1 KC242785.1 KC242788.1 KM034555.1 KM233039.1 KM034560.1 KM233053.1 KM233063.1 KM233110.1 KM233099.1 KM233109.1 KM233103.1 KJ660347.2 Group BDBV BDBV TAFV RESTV RESTV RESTV SUDV SUDV SUDV SUDV SUDV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV EBOV NA 5974 5974 6020 5927 5900 5924 5911 5920 5924 5924 5921 6061 6047 6039 6055 6061 6062 6060 6063 6049 6052 6050 6053 6053 6053 6052 6053 6051 6054 NC 4293 4292 4371 3924 3898 3929 4028 4071 4080 4080 4047 4035 4052 4063 4049 4028 4025 4026 4032 4049 4051 4051 4051 4052 4053 4052 4055 4050 4052 NG 3636 3637 3630 3762 3747 3774 3750 3732 3731 3731 3734 3752 3756 3762 3748 3750 3750 3752 3749 3753 3751 3752 3753 3751 3752 3751 3753 3751 3754 NT 5036 5036 4914 5277 5251 5260 5186 5152 5139 5139 5173 5111 5104 5095 5107 5119 5121 5120 5114 5099 5099 5099 5100 5099 5098 5098 5097 5098 5099 113 Table 3B.2 Accession, group, sequence length of mammalian mitochondria data. Accession V00662.1 D38116.1 D38113.1 D38114.1 X99256.1 Y18001.1 AY863426.1 D38115.1 NC_002083.1 NC_002764.1 U20753.1 U96639.2 EU442884.2 EF551003.1 EF551002.1 DQ402478.1 AF303110.1 AF303111.1 EF212882.1 AJ002189.1 AF010406.1 AF533441.1 V00654.1 AY488491.1 NC_007441.1 NC_008830.1 NC_010640.1 X72204.1 NC_005268.1 NC_001321.1 NC_005270.1 NC_005275.1 NC_006931.1 NC_001788.1 X97336.1 Y07726.1 NC_001640.1 AJ238588.1 AJ001562.1 AJ001588.1 X88898.2 Groups Primates Primates Primates Primates Primates Primates Primates Primates Primates Primates Carnivore Carnivore Carnivore Carnivore Carnivore Carnivore Carnivore Carnivore Carnivore Artiodactyla Artiodactyla Artiodactyla Artiodactyla Artiodactyla Artiodactyla Artiodactyla Artiodactyla Cetacea Cetacea Cetacea Cetacea Cetacea Cetacea Perissodactyla Perissodactyla Perissodactyla Perissodactyla Rodentia Rodentia Lagomorpha Erinaceomorpha Length 16569 16563 16554 16364 16472 16521 16389 16389 16499 16586 17009 16727 16774 16990 16964 16868 17020 17017 16805 16680 16616 16640 16338 16355 16498 16719 16524 16402 16390 16398 16412 16324 16386 16670 16829 16832 16660 16507 16602 17245 17447 114 NG 2176 2104 2133 2160 2256 2169 2049 2168 2176 2116 NC 5176 5084 5099 5022 5231 5047 4953 5317 5403 5027 NT NA 4094 5123 4186 5189 4168 5154 4123 5059 3946 5039 4110 5195 4137 5243 3897 5007 3889 5031 5306 4137 5543 4454 2406 4606 5290 4267 2366 4804 5293 4265 2398 4812 5418 4513 2478 4581 5397 4508 2467 4592 5270 4285 2601 4712 5258 4355 2676 4731 5253 4346 2692 4726 5338 4000 2518 4949 4296 5790 4552 5594 4569 5569 4443 5460 4375 5421 4434 5542 4371 5786 5519 4396 5374 4527 2140 4361 5354 4609 2162 4265 5359 4474 2182 4383 5374 4626 2153 4259 5377 4525 2040 4382 5357 4573 2164 4292 4259 5394 4405 5663 4333 5623 4312 5358 5094 5301 5386 5207 5429 4584 2350 4882 5937 3503 2185 5822 4384 4289 4313 4237 4298 4358 4340 4404 2210 2181 2189 2198 2261 2164 2222 2205 2198 2131 2169 2236 2071 2096 4819 4630 4707 4754 4041 3913 Table 3B.3 Accession, group, sequence length of rhinovirus data. Accession AF499637.1 AY751783.1 DQ473486.1 DQ473489.1 DQ473491.1 DQ473493.1 DQ473496.1 DQ473499.1 DQ473504.1 DQ473506.1 DQ473508.1 DQ473511.1 EF077280.1 EF173415.1 EF173423.1 EF186077.2 EF582386.1 FJ445111.1 FJ445113.1 FJ445115.1 FJ445117.1 FJ445119.1 FJ445122.1 FJ445124.1 FJ445126.1 FJ445128.1 FJ445130.1 FJ445132.1 FJ445134.1 FJ445136.1 FJ445138.1 FJ445140.1 FJ445142.1 FJ445144.1 FJ445146.1 FJ445148.1 FJ445151.1 FJ445153.1 FJ445155.1 FJ445157.1 FJ445159.1 FJ445161.1 FJ445163.1 FJ445165.1 FJ445167.1 FJ445169.1 FJ445171.1 FJ445173.1 FJ445175.1 FJ445177.1 FJ445179.1 FJ445181.1 FJ445183.1 FJ445185.1 FJ445187.1 FJ445189.1 L05355.1 V01149.1 Group HEV A B B A A A A A A A A C A B C C A A A A A A B A A B A A A A A A A A A B B B A A B A A A B A A A A A A A A B A B HEV NA 2243 2348 2381 2365 2301 2370 2363 2333 2253 2415 2371 2367 2153 2299 2384 2296 2304 2388 2376 2377 2321 2354 2327 2403 2354 2336 2405 2352 2360 2349 2353 2342 2233 2340 2320 2390 2316 2372 2369 2357 2331 2382 2355 2235 2394 2338 2316 2387 2319 2347 2318 2333 2356 2344 2372 2370 2313 2206 NC 1677 1362 1428 1476 1383 1287 1342 1301 1380 1349 1389 1288 1550 1396 1456 1549 1492 1284 1387 1321 1352 1310 1369 1397 1292 1347 1423 1406 1364 1350 1326 1323 1363 1361 1352 1339 1499 1461 1421 1339 1334 1376 1334 1388 1295 1430 1344 1299 1317 1309 1336 1379 1379 1384 1411 1328 1460 1737 NG 1662 1428 1431 1463 1454 1442 1389 1427 1402 1412 1390 1386 1500 1416 1407 1520 1480 1388 1407 1426 1421 1421 1445 1413 1414 1429 1405 1424 1404 1425 1418 1381 1415 1416 1411 1393 1504 1452 1433 1406 1437 1460 1412 1416 1409 1463 1446 1381 1433 1449 1417 1441 1429 1444 1466 1408 1475 1711 NT 1876 1999 1976 1919 2007 2035 2012 2062 2108 1972 1998 1995 1812 2013 1969 1769 1838 2074 1938 2009 2049 2048 1971 1996 2069 2019 1989 1932 1981 2025 2036 2088 2128 2022 2058 2016 1890 1931 2001 2011 2014 2011 2039 2111 2025 2002 2026 2066 2071 2021 2022 1972 1981 1957 1971 2011 1964 1786 Accession AF546702.1 DQ473485.1 DQ473488.1 DQ473490.1 DQ473492.1 DQ473494.1 DQ473497.1 DQ473500.1 DQ473505.1 DQ473507.1 DQ473510.1 EF077279.1 EF173414.1 EF173420.1 EF173425.1 EF582385.1 EF582387.1 FJ445112.1 FJ445114.1 FJ445116.1 FJ445118.1 FJ445121.1 FJ445123.1 FJ445125.1 FJ445127.1 FJ445129.1 FJ445131.1 FJ445133.1 FJ445135.1 FJ445137.1 FJ445139.1 FJ445141.1 FJ445143.1 FJ445145.1 FJ445147.1 FJ445149.1 FJ445152.1 FJ445154.1 FJ445156.1 FJ445158.1 FJ445160.1 FJ445162.1 FJ445164.1 FJ445166.1 FJ445168.1 FJ445170.1 FJ445172.1 FJ445174.1 FJ445176.1 FJ445178.1 FJ445180.1 FJ445182.1 FJ445184.1 FJ445186.1 FJ445188.1 FJ445190.1 L24917.1 X02316.1 Group HEV B B B A A A A A A A C A B B C C B A A A A A A A A A A A B A A A A A A A A A A A B B A B A B B A A A A A B B A A A NA 2209 2338 2382 2362 2297 2389 2309 2336 2247 2407 2406 2176 2369 2356 2426 2195 2261 2402 2375 2298 2386 2376 2358 2332 2375 2380 2397 2342 2371 2334 2364 2414 2390 2371 2353 2377 2375 2368 2389 2336 2325 2392 2388 2227 2376 2371 2403 2404 2284 2331 2387 2334 2241 2377 2322 2392 2383 2324 NC 1649 1436 1456 1417 1371 1314 1322 1344 1404 1350 1313 1503 1310 1500 1412 1565 1533 1391 1329 1356 1340 1328 1277 1299 1287 1311 1275 1275 1328 1517 1327 1287 1315 1270 1383 1325 1362 1281 1341 1333 1325 1420 1374 1387 1515 1382 1410 1400 1339 1369 1314 1347 1381 1401 1497 1307 1331 1347 NG 1681 1443 1447 1420 1443 1437 1396 1409 1409 1426 1424 1509 1403 1477 1432 1473 1513 1417 1428 1437 1423 1403 1399 1416 1408 1391 1436 1439 1395 1473 1413 1401 1391 1391 1429 1435 1427 1406 1421 1434 1454 1410 1420 1425 1450 1413 1418 1384 1382 1420 1425 1420 1406 1447 1490 1389 1412 1418 NT 1867 1991 1929 2013 2029 1980 1998 2046 2081 1960 1994 1756 2043 1886 1944 1866 1779 2002 2002 2049 1970 2026 2091 2075 2062 2056 2019 2074 2022 1891 2029 2031 2042 2095 1997 1998 1997 2077 1984 2013 2019 1977 2028 2113 1878 1944 1975 2018 2140 2017 2010 2027 2121 1992 1907 2044 1998 2013 115 Table 3B.4 Accession, group, sequence length of coronavirus data. Group Group 1 Group 1 Group 1 Group 2 Group 2 Group 2 Group 2 Group 2 Group 2 Group 2 Group 2 Group 2 Group 3 Group 3 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 4 Group 5 Accession AF304460.1 AF353511.1 NC_005831.2 AY391777.1 U00735.2 AF391542.1 AF220295.1 NC_003045.1 AF208067.1 AF201929.1 AF208066.1 NC_001846.1 NC_001451.1 EU095850.1 AY278488.2 AY278741.1 AY278491.2 AY278554.2 AY282752.2 AY283794.1 AY283795.1 AY283796.1 AY283797.1 AY283798.2 AY291451.1 NC_004718.3 AY297028.1 AY572034.1 AY572035.1 NC_006577.2 NC_001564.2 Flaviviridae outgroup NC_004102.1 Flaviviridae outgroup NC_001512.1 Togaviridae outgroup NC_001544.1 Togaviridae outgroup Length 27317 28033 27553 30738 31032 31028 31100 31028 31233 31276 31112 31357 27608 27657 29725 29727 29742 29736 29736 29711 29705 29711 29706 29711 29729 29751 29715 29540 29518 29926 10682 9646 11835 11657 NA 7420 6937 7253 8485 8490 8486 8544 8487 8087 8117 8030 8138 7967 7969 8465 8455 8475 8476 8476 8453 8447 8453 8451 8453 8457 8481 8458 8402 8395 8331 2618 1889 3676 3220 NC 4549 5382 3979 4658 4713 4743 4711 4752 5591 5548 5534 5614 4479 4513 5941 5940 5942 5942 5939 5937 5936 5936 5935 5935 5940 5940 5934 5911 5907 3895 2531 2893 2860 2901 NG 5903 6397 5516 6655 6774 6772 6790 6767 7466 7422 7416 7487 5993 6066 6185 6188 6183 6185 6185 6184 6187 6185 6184 6185 6188 6187 6187 6154 6151 5699 2919 2724 2859 3065 NT 9445 9317 10805 10940 11055 11027 11055 11022 10089 10189 10132 10118 9169 9108 9134 9144 9142 9133 9136 9137 9135 9137 9135 9138 9144 9143 9135 9073 9065 12001 2614 2140 2440 2416 116 Table 3B.5 Accession, group, sequence length of influenza A virus data. Accession HM370969.1 H1N1 CY138562.1 H1N1 CY149630.1 H1N1 KC608160.1 H1N1 AM157358.1 H1N1 AB470663.1 H1N1 AB546159.1 H1N1 HQ897966.1 H1N1 EU026046.2 H1N1 FJ357114.1 H1N1 GQ411894.1 H1N1 CY140047.1 H1N1 KM244078.1 H1N1 HQ185381.1 H5N1 HQ185383.1 H5N1 EU635875.1 H5N1 FM177121.1 H5N1 AM914017.1 H5N1 KF572435.1 H5N1 AF509102.2 H5N1 AB684161.1 H5N1 EF541464.1 H5N1 JF699677.1 H5N1 GU186511.1 H5N1 EU500854.1 H7N3 CY129336.1 H7N3 CY076231.1 H7N3 CY039321.1 H7N3 AY646080.1 H7N3 KF259734.1 H7N9 KF938945.1 H7N9 KF259688.1 H7N9 KC609801.1 H7N9 CY014788.1 H7N9 CY186004.1 H7N9 DQ017487.1 H2N2 CY005540.1 H2N2 JX081142.1 H2N2 Group Length NA NC NG NT 373 453 383 437 385 441 381 409 381 418 393 418 378 421 389 422 384 439 392 438 376 430 385 440 367 447 365 406 366 408 358 397 368 407 365 398 355 403 364 401 363 404 359 396 362 404 374 407 355 475 348 470 340 467 343 470 355 485 309 478 312 483 312 490 314 488 317 500 308 494 386 445 384 455 397 446 1419 1422 1433 1398 1413 1422 1410 1410 1433 1433 1413 1433 1410 1350 1350 1350 1370 1350 1350 1366 1350 1350 1350 1370 1453 1428 1420 1434 1453 1398 1404 1413 1426 1460 1422 1467 1467 1457 330 343 346 357 355 359 351 353 347 350 346 347 335 339 336 347 350 344 345 344 348 349 348 345 339 332 327 333 329 321 322 320 332 337 317 355 344 349 263 259 261 251 259 252 260 246 263 253 260 261 261 240 240 248 245 243 247 257 235 246 236 244 284 278 286 288 284 290 287 291 292 306 303 281 284 265 117 Table 3B.6 Accession, group, sequence length of bacteria 16S rDNA data. 314 435 447 319 324 431 318 432 318 431 318 434 177 232 325 441 299 226 317 430 318 420 298 400 324 436 Length NA NC NG NT 345 1104 230 263 265 256 146 761 194 165 464 287 1452 361 340 262 295 301 1253 395 241 286 372 296 1195 250 339 282 1099 228 206 154 179 179 718 263 326 419 327 1335 260 314 430 335 1339 422 267 335 313 1337 268 317 1334 268 332 1366 262 338 1356 262 337 1350 262 335 1346 262 337 1351 149 184 742 262 337 1365 231 278 1035 262 334 1344 269 336 1343 253 318 1269 263 341 1364 442 278 345 322 1387 265 337 1356 262 335 1346 251 329 1294 261 338 1347 143 189 753 261 336 1345 161 193 785 274 352 1411 146 201 796 255 338 1319 263 335 1345 260 335 1341 260 335 1345 279 345 1346 255 325 1322 260 331 1337 325 429 317 431 312 401 316 432 169 252 317 431 188 243 330 455 183 266 309 417 317 430 315 431 318 432 306 416 321 419 315 428 Accession Family KY486204.1 Methylobacteriaceae KY486205.1 Xanthomonadaceae KY486206.1 Xanthomonadaceae KY486207.1 Intrasporangiaceae KY486218.1 Microbacteriaceae Pseudomonadaceae KY486219.1 Bacillaceae KY927407.1 KY486220.1 Paenibacillaceae Enterobacteriaceae KY486221.1 KY486222.1 Xanthomonadaceae KY486223.1 Microbacteriaceae KY486209.1 Rhodanobacteraceae Enterobacteriaceae KY486210.1 Enterobacteriaceae KY486232.1 KY019246.1 Enterobacteriaceae Enterobacteriaceae KY013009.1 KY927404.1 Microbacteriaceae Enterobacteriaceae KY486211.1 Staphylococcaceae KY013011.1 Enterobacteriaceae KY019245.1 Bacillaceae KY013010.1 KY486208.1 Enterobacteriaceae KY486212.1 Microbacteriaceae KY486213.1 Xanthomonadaceae Enterobacteriaceae KY486228.1 Enterobacteriaceae KY486224.1 Enterobacteriaceae KY486225.1 Enterobacteriaceae KY486226.1 KY927405.1 Enterobacteriaceae Enterobacteriaceae KY486227.1 KY927408.1 Microbacteriaceae KY486214.1 Enterobacteriaceae KY927406.1 Microbacteriaceae Enterobacteriaceae KY486215.1 Enterobacteriaceae KY486216.1 Enterobacteriaceae KY486217.1 Enterobacteriaceae KY486229.1 Pseudomonadaceae KY486230.1 Enterobacteriaceae KY486231.1 Enterobacteriaceae KY019244.1 118 Table 3B.7 Accession, group, sequence length of bacteria data. Family Bacillaceae Bacillaceae Bacillaceae Bacillaceae Borreliaceae Borreliaceae Borreliaceae Borreliaceae Clostridiaceae Clostridiaceae Clostridiaceae Accession CP001598.1 AE016879.1 CP001215.1 AE017225.1 CP000976.1 CP000048.1 CP000993.1 CP000049.1 CP000246.1 CP000312.1 BA000016.3 CP000527.1 Desulfovibrionaceae AE017285.1 Desulfovibrionaceae CP002297.1 Desulfovibrionaceae AM260480.1 CP000091.1 CP000578.1 CP001151.1 AM295250.1 AE015929.1 AP006716.1 CP001837.1 AL590842.1 CP001585.1 AE009952.1 CP001593.1 CP001671.1 CP000468.1 CP001383.1 AE005674.2 Burkholderiaceae Burkholderiaceae Rhodobacteraceae Rhodobacteraceae Staphylococcaceae Staphylococcaceae Staphylococcaceae Staphylococcaceae Yersiniaceae Yersiniaceae Yersiniaceae Yersiniaceae Enterobacteriaceae Enterobacteriaceae Enterobacteriaceae Enterobacteriaceae Length 5227419 5227293 5230115 5228663 931674 922307 930981 917330 3256683 2897393 3031430 3462887 3570858 3532052 2912490 2726152 1219053 1297647 2566424 2499279 2685015 2658366 4653728 4640720 4600755 4553586 5131397 5082025 4650856 4607202 NA 1685408 1685374 1667671 1685622 335148 321202 335785 322493 1148078 1017083 1060154 639427 659017 652227 486037 480326 192966 204455 833636 837991 907537 878689 1219520 1216182 1200303 1187961 1271011 1256126 1145625 1133784 NC 930043 930007 974191 930391 129225 137556 129350 133424 470943 423439 446732 1092219 1127624 1113805 972789 883953 416292 445520 449856 405441 437414 445972 1102670 1097565 1090469 1077463 1298314 1285309 1187110 1176618 NG 919269 919244 876267 919481 127787 137547 126839 133693 453276 395046 419228 1089813 1127109 1116142 972298 887595 420323 446045 438966 396707 443072 454367 1114185 1112625 1101384 1093635 1297349 1283517 1177854 1167963 NT 1692699 1692668 1711986 1693169 339514 326002 339007 327720 1184386 1061825 1105316 641428 657108 649878 481366 474278 189472 201627 843965 859140 896992 879338 1217353 1214348 1208599 1194527 1264723 1256945 1140266 1128831 119 CHAPTER 4 RESIDUE-SIMILARITY SCORES AND INDEXES Traditionally, data visualization involves reducing data into 2 or 3 feature components. However, aggressive reduction often leads to poor representations for data with high intrinsic dimensions, despite producing visually appealing results. For classification problems with 2 classes, the Receiver Operating Characteristic (ROC) curve and Area Under the ROC Curve (AUC) curve can effectively show performance. However, not all classification problems are binary. In this section, we introduce a new visualization tool called Residue-Similarity (R-S) scores or R-S plots. Unlike traditional methods, R-S plots can be applied to an arbitrary number of classes, providing a more comprehensive visualization solution. 4.1 Methods 4.1.1 Residue-Similarity score and indexes An R-S plot consists of two components, residue and similarity scores. Assume that the data is {(xm, ym)|xm ∈ RN, ym ∈ ZL}M m=1, where xm is the mth data. For classification problems, ym is the ground truth, and for clustering problems, ym is the cluster label. Here, N is the number of features and M is the number of samples. L is the number of classes, that is ym ∈ [0, 1, ..., L – 1]. We can partition X = {xm}M m=1 into L classes by taking Cl = {xm ∈ X|ym = l}. Note that ⊎L–1 l=0 The residue score is defined as the inter-class sum of distances. Suppose ym = l. Then, the Cl = X. residue score for xm is given by Rm := R(xm) = 1 Rmax ∑︁ xj∉Cl ∥xm – xj ∥, (4.1) where ∥ · ∥ is the distance between a pair of vectors and Rmax = max xm∈X R(xm) is the maximal residue score. The similarity score is given by taking the average intra-class score. That is, for ym = l, Sm := S(xm) = 1 |Cl| (cid:18) ∑︁ 1 – xj∈Cl ∥xm – xj ∥ dmax (cid:19) , (4.2) where dmax = max xi,xj∈X ∥xi – xj ∥ is the maximal pairwise distance of the dataset. Note that by scaling, 0 ≤ R(xm) ≤ 1 and 0 ≤ S(xm) ≤ 1 for all xm. In this work, we employ the Euclidean distance in 120 our R-S scores. However, other distance metrics can be similarly used as well. In general, a large R(xm) indicates that the data is far from other classes, and a large S(xm) indicates that the data is well clustered. Since Rmax and dmax are for the whole dataset, residue and similarity scores in different classes can be compared. The residue score and similarity score can be used to visualize each class separately, where R(x) is the x-axis, and S(x) is the y-axis. In the case of classification, define {(xm, ym, ˆym)|xm ∈ RN, ym ∈ ZL, ˆym ∈ ZL}M m=1, where ˆym is the predicted label for the mth sample. Then, we can repeat the above process by using the ground truth and visualize each class separately. By coloring the data point with the predicted label ˆym, we get the R-S score visualization of the classification. Class residue index (CRI) and class similarity index (CSI) can be easily defined for the lth class as CRIl = 1 |Cl| (cid:205) m Rm and CSIl = 1 |Cl| (cid:205) m Sm, respectively. Such indexes can be used to compare the distributions in different classes obtained by different methods. The above indices depend on clusters or classes. It is more useful to construct class-independent global indices. To this end, we first define residue index (RI) and similarity index (SI) as RI = 1 L (cid:205) l CRIl and SI = 1 L (cid:205) l CSIl, respectively. All of these indexes have the range of [0,1] and the larger the better for a given dataset. Additionally, we define R-S disparity (RSD) as RSD = RI – SI. RSD ranges [-1,1]. Finally, we define R-S index (RSI) as RSI = 1 – |RI – SI|. R-S index has the range of [0,1]. The Rand index is known to correlate with accuracy [1]. We speculate that the R-S disparity may correlate with the convergence of clustering and the R-S index may correlate with the accuracy of classification. R-S disparity and R-S index can be used to measure the performance of different methods. 4.1.2 Persistent Spectral Graph (PSG) Further analysis of point cloud data or the points in the R-S plot can be carried out with Topology Data Analysis (TDA). Persistent homology [2, 3, 4, 5, 6, 7, 8] is an algebraic topology technique and the main workhorse of TDA. It introduces a filtration process to generate a family of topological spaces so that the original data can be analyzed in multiscales. However, it cannot 121 detect the homotopic shape evolution of data during filtration. Topological Laplacians, such as persistent spectral graph (aka persistent Laplacian) [9, 10] and evolutionary Hodge Laplacian [11] are designed to preserve full topological persistence and capture homotopic shape evolution of data during a filtration. The persistent spectral graph returns the same multiscale topological invariants in its kernels of various dimensions and scales but offers additional homotopic shape information in its non-harmonic spectra. Considering two boundary operators ∂t q+1 : Cq+1(Kt+p) ↦→ Cq(Kt+p), where Cq(Kt+p) is a chain group and Kt ⊂ Kt+p are simplicial complexes generated by a filtration. Denote ∂t+p q+1 q : Cq(Kt) ↦→ Cq–1(Kt) and ∂t+p q+1 such as as ðt,p |Ct,p q+1 Ct,p q+1 = {α ∈ Ct+p q+1 | ∂t+p q+1(α) ∈ Ct q}. (4.3) Namely, Ct,p q+1 consists of elements whose images under ∂t+p q+1 are in Ct q. The p-persistent q- combinatorial Laplacian operator [12] is given by q = ðt,p ∆t,p q+1(ðt,p q+1)∗ + (∂t q)∗∂t q. (4.4) The topological invariants of the corresponding persistent homology defined by the same filtration are recovered from the kernel of the persistent Laplacian Eq. (4.4) [12], βt,p q = dim ker ∂t q – dim im ðt,p q+1 = dim ker ∆t,p q . (4.5) State differently, the zero eigenvalues of the persistent Laplacian operator Eq. (4.4) give rise to the entire topological variants of the persistent homology. Then, the non-harmonic part of the spectra (i.e., the non-zero eigenvalues of the persistent Laplacian) and associated eigenvectors offer additional shape information of the underlying data. Note that for small-sized high-dimensional datasets, PSG can be directly employed to reduce the dimensionality in terms of the statistical quantities of the data spectra. The resulting spectra or their statistics can be directly used to represent the original datasets. 122 4.1.3 The shape of data Continuous FRI was defined to offer the shape of M data entries in R3 [13]. A similar idea was used to define interactive differentiable Riemannian manifolds [14]. Here, we extend these ideas to construct Grassmann manifolds Gr(N – 1, I). Let X = {x1, ..., xm, ..., xM} be a finite set of M data entries. Denote xm ∈ RN be the feature vector for the mth sample, and |x – xm| be the Euclidean distance between a point x ∈ RN to the jth sample. Let η = 1 M M ∑︁ m=1 min xj ∥xm – xj ∥ be the average minimum pairwise distance of the input data. Then, the unnormalized rigidity density at point x ∈ RN is given by µ(x) = M ∑︁ m=1 ωmΦ(∥x – xm∥; τ, η, κ), (4.6) where ωm = 1, and τ and κ are the hyperparameters of the correlation kernel Φ. Notice that we can choose an isosurface µ(x) = cµmax, which defines an (N – 1)– dimensional Riemannian manifold by the collection of points {x|x ∈ RN, µ(x) = cµmax}, (4.7) µ(x). The shape of data can be directly visualized for 2 ≤ N ≤ 3 where c ∈ (0, 1) and µmax = max x as shown in Ref. [14]. One can restrict xm to a given subset in Eq. (4.6) to compare the shape of data in different classes when the class labels are known. For further analysis, one can obtain (N – 1) independent curvatures via fundamental forms [14]. Additionally, Hodge decomposition can be applied to analyze topological connectivity (i.e., Betti numbers associated with the harmonic spectra) and non-harmonic spectra of the Hodge Laplacians of the data [15]. For evolving manifolds, the evolutionary de Rham-Hodge theory can be used to analyze the geometry and topology of data [11]. 123 4.2 Results 4.2.1 Geometric shape, Residue-Similarity (R-S), and topological analysis In this section, we compare the 3D shape, R-S plot, and topological persistence of the TCGA- PANCAN data. For the comparison, CCP was used to reduce the data to N = 3 components. The data were divided according to their true labels into 5 classes. The 5-fold cross-validation was used to obtain the predicted labels for visualization (coloring). For the 3D shape visualization, after the dimensionality reduction, the Gaussian surface was used to generate the volumetric representation. The Chimera [16] was used to visualize the shape of data at the isovalue of 0.1. The surface was colored according to the predicted labels. For the persistence plot, after the dimensionality reduction, the data was divided according to their true labels. The HERMES package [9] with the α complex was used to generate topological dimensions 0 (Betti-0), 1 (Betti-1), and 2 (Betti-2) curves and the corresponding smallest non- zero eigenvalue curves. Note that persistent Laplacian itself offers low-dimensional geometric and topological representations of the original high-dimensional data [17]. Figure 4.1 shows the 3 different visualizations of class 1. Notice that the shape analysis shows predominately red regions or dots mixed with misclassified labels. We can see this mixing in the R-S plot as well. The yellow points have lower R scores, indicating that these samples are more likely to be mislabeled in machine learning. We can see that blue points with low S scores are isolated in the shape visualization. and βα,0 Note that βα,0 1 2 offer the same information as persistent homology does for the 0 , βα,0 shows there are about 290 samples in this class that become fully connected at shows there are many cycles in the sample. The βα,0 indicates there 0 = 1). The βα,0 1 1 data. The βα,0 0 radius 6 ( βα,0 are at most 7 cavities. There are no topological changes in the data after radius=11. However, the smallest non-zero eigenvalue (λα,0 0 ) keeps changing as the filtration radius increases, indicating that persistent Laplacian reveals more information about the data than persistent homology does. Finally, it is noticed that most misclassified samples have relatively low R-S scores. This observation indicates the effectiveness of our R-S scores and indexes. 124 (a) Gaussian surface of TCGA- PANCAN class 1. (b) R-S plot of TCGA-PANCAN class 1. (c) Persistence of TCGA-PANCAN class 1. Figure 4.1 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 1 data. CCP was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX [16]. Red color indicates the correctly classified data. (b) R-S plot of class 1. Red circle is the correct label. The x and y-axes correspond to the residue and similarity scores, respectively. (c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red color) λα,0 0 , βα,0 0 , λα,0 1 , and βα,0 for class 1. HERMES package [9] with the α complex was used to calculate the harmonic and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the βα,0 0 , βα,0 1 ,and βα,0 1 , and λα,0 from top to bottom. , and the harmonic spectral curves (indicated by blue color) βα,0 from top to bottom, and the right y-axis corresponds to λα,0 1 , and λα,0 0 , λα,0 2 2 2 2 The shape, R-S, and topological analysis of class 2 are given in Figure 4.2. The βα,0 class 2 has about 130 samples, which become fully connected near radius=10. The λα,0 shows a significant discontinuity at radius near radius=10. The βα,0 peak value. At most two cavities in data have been found in βα,0 0 0 1 at a given filtration. The shape shows about 28 cycles at its indicates curve 2 shows four major pieces and only a few samples were mislabeled by machine learning. R-S plots should have four different labels in this class. Most of the samples were correctly predicted, which 125 (a) Gaussian surface of TCGA- PANCAN class 2. (b) R-S plot of TCGA-PANCAN class 2. (c) Persistence of TCGA-PANCAN class 2. Figure 4.2 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 2 data. CCP was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX [16]. Green color indicates the correctly classified data. (b) R-S plot of class 2. Green circle is the correct label. The x and y-axes correspond to the residue and similarity scores, respectively. (c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red color) λα,0 0 , βα,0 0 , λα,0 1 , and βα,0 for class 2. HERMES package [9] with the α complex was used to calculate the harmonic and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the βα,0 0 , βα,0 1 ,and βα,0 1 , and λα,0 from top to bottom. , and the harmonic spectral curves (indicated by blue color) βα,0 from top to bottom, and the right y-axis corresponds to λα,0 1 , and λα,0 0 , λα,0 2 2 2 2 is consistent with the shape analysis. Most class 1 labels (red ones) have lower S scores, indicating that they do not belong. Most misclassified samples have low R-S scores and are disconnected from other samples in the class. Figure 4.3 gives three types of analyses for class 3. This class has only about 78 samples, as shown by the βα,0 0 curve. At filtration radius=2, there were 11 one-dimensional holes in the data. 126 (a) Gaussian surface of TCGA- PANCAN class 3. (b) R-S plot of TCGA-PANCAN class 3. (c) Persistence of TCGA-PANCAN class 3. Figure 4.3 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 3 data. CCP was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX [16]. Blue color indicates the correctly classified data. (b) R-S plot of class 3. Blue circle is the correct label. The x and y-axes correspond to the residue and similarity scores, respectively. (c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red color) λα,0 0 , λα,0 1 , and λα,0 1 , and βα,0 for class 3. HERMES package [9] with the α complex was used to calculate the harmonic 2 and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the βα,0 0 , βα,0 1 ,and βα,0 1 , and λα,0 from top to bottom. , and the harmonic spectral curves (indicated by blue color) βα,0 from top to bottom, and the right y-axis corresponds to λα,0 0 , λα,0 0 , βα,0 2 2 2 There were only two cavities found by βα,0 at isovalue 0.1 but merge at radius 7 as detected by βα,0 ones, as shown by the shape and R-S plots. The λα,0 topological persistence (βα,0 1 2 . The shape plot indicates most samples are disconnected 0 . The yellow labels are close to the red curve demonstrates a few discontinuities after 1 ) becomes flat, indicating important homotopic events in the data. As in other classes, most misclassified samples have relatively low R-S scores. In Figure 4.4, the βα,0 0 curve suggests 150 samples in class 4 (yellow). Some samples are 127 (a) Gaussian surface of TCGA- PANCAN class 4. (b) R-S plot of TCGA-PANCAN class 4. (c) Persistence of TCGA-PANCAN class 4. Figure 4.4 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 4 data. CCP was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX [16]. Yellow color indicates the correctly classified data. (b) R-S plot of class 4. Yellow circle is the correct label. The x and y-axes correspond to the residue and similarity scores, respectively. (c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red color) λα,0 0 , βα,0 0 , λα,0 1 , and βα,0 for class 4. HERMES package [9] with the α complex was used to calculate the harmonic and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the βα,0 0 , βα,0 1 ,and βα,0 1 , and λα,0 from top to bottom. , and the harmonic spectral curves (indicated by blue color) βα,0 from top to bottom, and the right y-axis corresponds to λα,0 1 , and λα,0 0 , λα,0 2 2 2 2 misclassified as class 1 (red), class 3 (blue), and class 5 (purple) as shown in shape and R-S plots. The topological persistence indicates many topological invariants along the filtration axis, which can be a faithful representation of the data [17]. Specifically, all data points overlap at radius 5, as shown by βα,0 0 radius 7 as revealed by βα,0 still indicates a discontinuity at radius 7. All cycles disappear after 1 . The last cycle persists from radius 6 to 7. The βα,0 0 . However, λα,0 curve becomes 2 flat at radius 4. The misclassified red samples show low S scores. 128 (a) Gaussian surface of TCGA- PANCAN class 5. (b) R-S plot of TCGA-PANCAN class 5. (c) Persistence of TCGA-PANCAN class 5. Figure 4.5 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 5 data. CCP was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX [16]. Purple color indicates the correctly classified data. (b) R-S plot of class 5. Purple circle is the correct label. The x and y-axes correspond to the residue and similarity scores, respectively. (c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red color) λα,0 0 , βα,0 0 , λα,0 1 , and βα,0 for class 5. HERMES package [9] with the α complex was used to calculate the harmonic and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the βα,0 0 , βα,0 1 , and βα,0 1 , and λα,0 from top to bottom. , and the harmonic spectral curves (indicated by blue color) βα,0 from top to bottom, and the right y-axis corresponds to λα,0 1 , and λα,0 0 , λα,0 2 2 2 2 Figure 4.5 illustrates our shape, R-S, and topological analyses of class 5. Although βα,0 0 indicates there are only about 140 samples, class 5 is very rich in its topological persistence. The βα,0 shows the data points did not connect before the filtration radius reached 12. The βα,0 1 indicates a large one-dimensional hole from radius 13.5 to 15. The βα,0 curve shows 9 short-living curve 1 2 cavities in the data. It is clear from the R-S plot that samples having low R-S scores are more likely to be mislabeled. 129 (a) Isovalue 0.235 (b) Isovalue 0.139 (c) Isovalue 0.0963 (d) Isovalue 0.0426 Figure 4.6 Shape of class 5 of TCGA-PANCAN dataset visualized in multiscale using ChimeraX [16] when isovalues were varied. The different colors indicated the predicted labels, and purple is the true label of class 5. Because Figure 4.5’s persistence plot indicates interesting topological features, and class 5 was further visualized in Figure 4.6 with varying isovalues or scales. We can see that from isovalues 0.235 to 0.139, 2 holes form in the bottom right corner. We can also see another hole beginning to form in the bottom center. From isovalues 0.139 to 0.0963, the two holes are no longer visible, but the hole that was forming is now completed. This corresponds to βα,0 shown by βα,0 1 . The voids are short-lived, as in Figure 4.5 and cannot be visible in the isosurface. Decreasing the isovalue further 2 to 0.0426 shows the combination of two main parts of the data, which would stabilize the structure. This corresponds to the decrease of βα,0 increase in λα,0 because the number of components is decreasing, but the indicates that the structure is more stable. 0 0 4.3 Discussion 4.3.1 R-S plot vs 2D plot R-S plot is an effective tool to visualize the performance of classification in general. Figure 4.7 shows the comparison of the R-S plot and the traditional 2D visualization of the Coil-20 dataset when the dimension is reduced to 2 by UMAP. For the traditional 2D plot, each data point was colored by the ground truth, and for the R-S plot, each section represents one of the 20 different classes, and data points were colored by the predicted labels from the k-NN classification. We can see that in the traditional 2D visualization, labels 3, 9, and 19 are located in the same region. It is interesting to see that this situation is reflected in our R-S plot as three labels mixed up. In the R-S plot, Labels 3, 9, and 19 have a high similarity score but a low residue score, meaning that the data points are not separated well among different classes and show the limitation of preserving the 130 local structure of a high dimensional data represented in the 2D space. Essentially, some data lay in an intrinsically high-dimensional space that cannot be well-described in the 2D representation. Note that 2D plots work best when the data dimension is reduced to 2, whereas the R-S plots can be applied to arbitrarily high dimensions. 131 (a) R-S plot of Coil-20 with 2 dimension. (b) Traditional 2d plot of Coil-20 using UMAP. Figure 4.7 (a) shows the R-S plot of the Coil-20 dataset when reduced by UMAP to N = 2. Each section represents a different class, where the data points were colored according to their predicted labels from k-NN classification via 10-fold cross-validation. (b) Coil-20 dataset reduced to N = 2 by UMAP. The data points were colored according to the ground truth. 132 BIBLIOGRAPHY [1] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971. [2] Patrizio Frosini. Measuring shapes by size functions. In Intelligent Robots and Computer Vision X: Algorithms and Techniques, volume 1607, pages 122–133. International Society for Optics and Photonics, 1992. [3] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simplification. In Proceedings 41st annual symposium on foundations of computer science, pages 454–463. IEEE, 2000. [4] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete & Com- putational Geometry, 33(2):249–274, 2005. [5] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46(2):255–308, 2009. [6] Konstantin Mischaikow and Vidit Nanda. Morse theory for filtrations and efficient computation of persistent homology. Discrete & Computational Geometry, 50(2):330–353, 2013. [7] K. L. Xia and G. W. Wei. Persistent homology analysis of protein structure, flexibility and folding. International Journal for Numerical Methods in Biomedical Engineering, 30:814– 844, 2014. [8] Jacob Townsend, Cassie Putman Micucci, John H Hymel, Vasileios Maroulas, and Konstanti- nos D Vogiatzis. Representation of molecular structures with persistent homology for machine learning applications in chemistry. Nature communications, 11(1):1–9, 2020. [9] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield, Mo.), 3(1):67, 2021. [10] Facundo Mémoli, Zhengchao Wan, and Yusu Wang. Persistent laplacians: Properties, algo- rithms and implications. SIAM Journal on Mathematics of Data Science, 2022. [11] Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021. [12] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020. [13] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multidomain models—flexibility and rigidity. The Journal of chemical physics, 139(19):11B614_1, 2013. 133 [14] Duc Duy Nguyen and Guo-Wei Wei. Dg-gl: Differential geometry-based geometric learning of molecular datasets. International journal for numerical methods in biomedical engineering, 35(3):e3179, 2019. [15] Rundong Zhao, Menglun Wang, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. The de rham– hodge analysis and modeling of biomolecules. Bulletin of mathematical biology, 82(8):1–38, 2020. [16] Eric F Pettersen, Thomas D Goddard, Conrad C Huang, Elaine C Meng, Gregory S Couch, Tristan I Croll, John H Morris, and Thomas E Ferrin. Ucsf chimerax: Structure visualization for researchers, educators, and developers. Protein Science, 30(1):70–82, 2021. [17] Jiahui Chen, Yuchi Qiu, Rui Wang, and Guo-Wei Wei. Persistent laplacian projected omicron ba. 4 and ba. 5 to become new dominating variants. Computers in biology and medicine, 151:106262, 2022. 134 CHAPTER 5 CORRELATED CLUSTERING AND PROJECTION 5.1 Introduction In this work, we propose a two-step data-domain method that seeks an optimal clustering in terms of a distance describing intrinsic feature correlations among samples to divide I feature vectors into N correlated clusters and then, non-linearly project correlated features in each cluster into a single descriptor by using Flexibility Rigidity Index (FRI) [1], which results in a low- dimensional representation of the original data. Additionally, the complex global correlations among samples are embedded into samples’ local representations during the FRI-based nonlinear projection RI → RN. To gain computational efficiency, one may further compute the pairwise correlation matrix of samples and impose a cutoff distance to avoid the global summation during the projection. The resulting method, called Correlated Clustering and Projection (CCP), precedes the other DR algorithms in the following aspects. (1) Instead of solving a matrix to reduce the dimensionality, CCP does not involve matrix diagonalization and thus can handle the dimensionality reduction of large sample sizes. (2) CCP exploits statistical measures, such as (distance) covariance, to quantify the high-level dependence between random feature vectors, rendering a stable algorithm when dealing with high dimensional data. (3) CCP is flexible with respect to targeted dimension N because the partition of features is based on N, whereas other methods may rely on min(M, N). The performance of CCP is stable with respect to the increase of N, which is important for datasets with high or moderately high intrinsic dimensions. In contrast, many existing methods stop working when the intrinsic data dimension is moderately high. (4) CCP is stable with respect to subsampling, which allows continuously adding new samples into a pre-existing dataset without the need to restart the calculation from the very beginning and thus, is advantageous for continuous data acquisition, collection, and analysis. This capability is valuable when the transient data are too expensive to be kept permanently, e.g., molecular dynamics simulations. Additionally, this subsampling property enables parameter optimization using a small amount of data in case of large data size. (5) As a data-domain method, CCP can be combined with a frequency-domain method, such as UMAP or 135 t-SNE, for a secondary dimensionality reduction to better preserve global structures of data and achieve higher accuracy. (6) Finally, the performance of CCP is validated on several benchmark classification datasets: Leukemia, Carcinoma, ALL-AML, TCGA-PANCAN, Coil-20, Coil-100, and Smallnorb based on various traditional algorithms such as k-NN, support vector machine, random forest and gradient boost decision tree. In all cases, CCP is very competitive with the state-of-art algorithms. Additionally, we have also proposed a new method, called Residue-Similarity (R-S) scores or R-S plot, for the performance visualization of unsupervised clustering and supervised classification algorithms. Although Receiving Operation Characteristic curve (ROC) and Area Under the ROC Curve (AUC) are typically used for the performance visualization of binary classes, they are not convenient for multiple classes. The proposed R-S scores can be used for visualizing the performance in an arbitrary number of classes. Finally, R index, S index, R-S disparity, and total R-S index are proposed to characterize clustering and classification results. Recent years have witnessed the growth of Topological Data Analysis (TDA) via persistent homology [2, 3, 4, 5, 6, 7, 8] in data sciences. It can be used to analyze the topological invariant of the R-S scores. However, persistent homology is insensitive to the homotopic shape evolution of data during filtration. We introduce a topological Laplacian, Persistent Spectral Graph (PSG) [9], to capture the homotopic shape of data, in addition to topological invariants. Note that TDA and PSG are dimensionality reduction algorithms that can generate low dimensional representations of the original high-dimensional data [10, 11]. To further analyze the shape of data, we transform point cloud data into a Grassmann manifold representation by using FRI [1]. When N = 3, the 2-manifold shape of data can be directly visualized. Such shape of data can be further analyzed by differential geometry apparatuses, including curvatures [12], Hodge decomposition [13] and evolutionary de Rham-Hodge theory [14]. 136 5.2 Methods and Algorithms Let Z := {zi m}M,I m=1,i=1 with M and I being the number of input data entries (i.e., samples) and the number of features for each data entry, respectively. Our goal is to find an N-dimensional representation of the original data, denoted as X := {xi m}M,N m=1,i=1, such that 1 ≤ N << I, by using a data-domain two-step clustering-projection strategy. 5.2.1 Feature clustering Let Z = {z1, ..., zi, ..., zI} be the set of data, where zi ∈ RM represents the ith feature vector for the data. The objective is to partition the feature vector into N parts, where 1 ≤ N << I is a preselected reduced feature dimension. To this end, we find an optimal disjoint partition of the data Z := ⊎N n=1 Zn, for a given N, where Zn is the nth partition (cluster) of the features. To seek the optimal partition, we first analyze the correlation among feature vectors zi. A variety of correlation measures can be used for this purpose. We discuss two standard approaches. Covariance distance First, we consider an I × I normalized covariance matrix with component ρ(zi, zj) = Cov(zi, zj) σ(zi)σ(zj) , 1 ≤ i, j ≤ I, (5.1) where Cov(zi, zj) is the covariance of zi and zj, and σ(zi) and σ(zi) are the variances of zi and zj, respectively. We set negative covariances to zero and subtract from 1 to obtain a covariance distance between feature vectors ∥zi – zj ∥dCov = 1 – ρ(zi, zj), ρ(zi, zj) > 0 0, otherwise.    (5.2) Note that covariance distances have the range of 0 < ∥zi – zj ∥dCov < 1, for all pairs of vectors, zi and zj. Highly correlated feature vectors will have their covariance distances close to 0, while the uncorrelated feature vectors will have their covariance distances close to 1. 137 Correlation distance Alternatively, one may also consider the correlation distance defined via the distance correlation [15]. First, one computes a distance matrix for each vector zi, i = 1, 2, ..., I jk = ∥zi ai m – zi k ∥, m, k = 1, 2, ..., M, where ∥ · ∥ denotes the Euclidean norm. Define doubly centered distance for vector zi, Ai jk := ajk – ¯aj· – ¯a·k + ¯a··, (5.3) (5.4) where ¯aj· is the jth row mean, ¯a·k is the kth column mean, and ¯a·· is the grand mean of the distance matrix for vector zi. For a pair of vectors (zi, zj), the squared distance covariance is given by dCov2(zi, zj) := 1 M2 ∑︁ ∑︁ j k jkAj Ai jk. The distance correlation between vectors (zi, zj) is given by dCor(zi, zj) := dCov2(zi, zj) dCov(zi, zi)dCov(zj, zj) . We define a correlation distance between vectors zi and zj as ∥zi – zj ∥dCor = 1 – dCor(zi, zj). (5.5) (5.6) (5.7) The correlation distance has values in range 0 ≤ ∥zi – zj ∥dCor ≤ 1. It gives ∥zi – zj ∥dCor = 1 if zi and zj are uncorrelated or independent. When zi and zj are linearly depends on each other, one has ∥zi – zj ∥dCor = 0. Correlated clustering Feature partition can be achieved with a variety of clustering methods. Here, as an example, we utilize a modified K-medoids method to perform the partition in a minimization process. Certainly, other K-means type of algorithms, including BFR algorithm, centroidal Voronoi tessellation, k q-flats, K-means++, etc., can be utilized for our feature partition as well. 138 For a pre-selected N, we begin by randomly selecting N medoids {mn}N vector to its nearest medoid, which gives rise to the initial partition {Zn}N the closest vector to the center of the nth partition Zn as the new medoid {mn ∈ Zn}N reassign each vector into its nearest medoid, resulting in a new partition {Zn}N loss function or the accumulated distance. The process is repeated until {Zn}N n=1 and assign each n=1. Second, we denote n=1. We n=1 to minimize the n=1 is optimized with respect to a specific distance definition, arg min{Z1,...,Zn,...,ZN} N ∑︁ ∑︁ n=1 zi∈Zn ∥zi – mn∥, (5.8) where ∥ · ∥ is either the covariance distance or the correlation distance. In comparison, covariance distance is easy to compute, while correlation distance can deal with complex nonlinear high-level correlations among feature vectors and samples. Note that many other metrics can be used too. For a given N, the minimization partitions similar feature vectors into N clusters, which provides the foundation for further projections. Our next goal is to project the original I-dimensional dataset Z into an N-dimensional repre- sentation X according to the partition result. 5.2.2 Feature projection Flexibility Rigidity Index (FRI) In this section, we review Flexibility Rigidity Index (FRI) [1]. Let {z1, ..., zm, ..., zM} be the input dataset, where zm ∈ RI. Denote ∥zi – zj ∥ some metric between ith and jth data entries, and the correlations between data entries are computed as Cij = Φ(∥zi – zj ∥; η, τ, κ), 1 ≤ i, j ≤ M (5.9) where Φ is the correlation kernel, and τ, η, κ > 0 are the parameters for the kernel. Commonly used metrics include the Euclidean distance, the Manhattan distance, the Wasserstein distance, etc. The correlation kernel is a real-valued smooth monotonically decreasing function, satisfying the two properties Φ(∥zi – zi∥; η, τ, κ) = 1, as |zi – zj| → 0 Φ(∥zi – zj ∥; η, τ, κ) = 0, as |zi – zj| → ∞ 139 A popular choice for such functions is a radial basis function. For example, one may use the generalized exponential function Φ(∥zi – zj ∥; η, τ, κ) = or the generalized Lorentz kernel Φ(∥zi – zj ∥; η, τ, κ) = e–(∥zi–zj ∥/τη)κ , ∥zj – zj ∥ < rc 0, otherwise    1 1+(cid:0)∥zi–zj ∥/τη(cid:1)κ ∥zj – zj ∥ < rc, 0 otherwise    (5.10) (5.11) where κ is the power, τ is the multiscale parameter, η is the scale to be computed from the given data, and rc is the cutoff distance, which is useful in a certain data structure to reduce the computational complexity [16]. In the context of t-SNE, η would be the perplexity, and in UMAP, η would define the geodesic of the Riemannian metric. We can construct a correlation matrix C = {Cij}, which reveals the topological connectivity between samples [1]. We can also view such correlation map as a weighted graph [9, 17], where rc is the cutoff function. In order to understand the connectivity, we choose η to be the average minimum distance between the data entries η = (cid:205)M m=1 minzj ∥zm – zj ∥ M . Using the correlation function, we can define the rigidity of xi as µi = M ∑︁ m=1 ωimΦ(∥zi – zm∥; η, τ, κ), where ωim are the weights. Here, we set ωim = 1 for all i and m. From the graph perspective, one can also view µi as the degree matrix of node xi. Correlated projection In this subsection, we employ FRI for the correlative dimensional reduc- tion of input dataset {z1, ..., zm, ..., zM}, where zm ∈ RI, leading to a low-dimensional representation {x1, ..., xm, ..., xM}, with xm ∈ RN. The FRI reduction captures the intrinsic correlation among samples. 140 Recall that {Zn}N n=1 are optimal partition of feature vectors from the K-medoids or another clustering method. Let S = {1, ..., I} be the whole set of indices of the feature vectors, and Sn, where Sn = {i|zi ∈ Zn}. We can define zSn m as mth input data with the nth collection n=1 S = ⊎N of feature indices Sn, i.e., zSn m := {zi We can now define the nth correlation matrix {Cn m|i ∈ Sn}. ij}i,j=1,...,M associated with subset Sn of features ij = Φn(∥zSn Cn i – zSn j ∥; ηn, τ, κ), 1 ≤ i, j ≤ M, 1 ≤ n ≤ N, (5.12) where Φn is the radial basis kernel for the kth grouping. For example, one may choose Φn(∥zSn i – zSn j ∥; ηn, τ, κ) = Φn(∥zSn i – zSn j ∥; ηn, τ, κ) = –(∥zSn e i –zSn j ∥/τηn)κ , ∥zSn i – zSn j ∥ < rn c or (5.13) 0, otherwise, (cid:16) 1 ∥zSn i –zSn j 1+ ∥/τηn(cid:17) κ , ∥zSn i – zSn j ∥ < rn c 0, otherwise, (5.14)       where truncation distance (rn c ) can be set to 2 or 3-standard deviations, and ηn is set to m – zSn j ∥zSn (cid:205)M ∥ ηn = m=1 minzSn M j . Then, we can project the data to an N-dimensional space representation by taking the rigidity function defined by correlation kernels, xn i = M ∑︁ m=1 imΦn(∥zSn ωn i – zSn m ∥; ηn, τ, κ), n = 1, 2, ..., N; i = 1, 2, ..., M (5.15) where ωn im are the weights associated with Φn for the nth cluster and can be set to 1. Moreover, the mth data in the reduced n-dimension representation is a vector xm = (x1 m, ..., xn m, ..., xN m)T . We also propose to improve the computational efficiency in Eq. Equation 5.15 by avoiding the global summation. This can be easily done as follows. First, we construct an M × M global distance matrix of the samples to obtain the nearest neighbors of each sample. Then, we use the cell lists 141 algorithm with the cutoff value to replace the global summations Eq. Equation 5.15 by considering only a few nearest neighbors [16]. This approach significantly reduces the memory requirement. Since the projections of xn i for different i and n are independent of each other, massively parallel computations can be used for large datasets. 5.3 Results In this section, we numerically explore CCP’s performance on a variety of high dimensional benchmark test datasets. For each dataset, we use 10 random seeds for 5-fold or 10-fold cross- validations, depending on the number of samples of the data. In order to validate the effectiveness of CCP, we compare it with Uniform Manifold Approxima- tion and Projection (UMAP), Principle Components Analysis (PCA), Locally Linear Embedding (LLE), and Isomap. For metric-based embedding, the Euclidean distance was used. All parameters were set to default, according to the packages outlined in Table 5.1. In order to test the effectiveness of the dimensionality reduction, a k-nearest neighbor classifer (kNN) was used. Table 5.1 shows the versions of the packages used in our comparison. In order to visualize the accuracy, R-S scores were utilized. In R-S plots, the residue and similarity scores were represented as the x and y axes, respectively, and the data points were colored according to the predicted labels from classification results. Table 5.1 Python packages used for the dimensionality reduction and benchmark tests. Package Python v3.8.5 Numpy v1.19.2 Scikit-learn v0.23.2 Scikit-learn-exta v0.2.0 Sklearn v0.0 umap-learn v0.5.1 5.3.1 Datasets We test CCP and several other commonly used algorithms on benchmark datasets, including Leukemia ALL-AML, Carcinoma, Arcene, TCGA-PANCAN, Coil-20 and Coil-100, and Small- 142 norb. Table 7.2 summarizes the datasets used in the present work. Table 5.2 Datasets used in the benchmark tests. Dataset [ref] Leukemia [18] (M, I, L) (72, 7070, 2) Carcinoma [19, 20] (174, 9182, 11) ALL-AML [21] (72, 7129, 2) TCGA-PANCAN [22] (801, 20531, 5) Coil-20 [23] (1440, 16384, 20) Coil-100 [24] (7200, 49152, 100) Smallnorb [25] (24300,18432,5) Description Microarray dataset of Leukemia. The data con- tains 72 samples, each with 7070 gene expres- sions. Microarray dataset of human carcionmas. Orig- inal data [19] contains 12533 genes, which were processed to 9182 dimensions in Ref. [20]. Cancer classification dataset based on gene ex- pressions by DNA microarrays of acute myeloid Leukemia (AML) and acute lymphoblastic Leukemia (ALL). Gene expression dataset. Part of the RNA-seq (HiSeq) PANCAN data, where expressions of 5 different types of tumors were extracted. Image classification dataset with 1440 images. Each image has size 128 × 128 = 16384, where 20 objects are captured at 72 angles. Each image was treated as a vector of length 16384. Image classification dataset of 7200 images. Each image has size 128 × 128 = 16384 with 3 channels, where 100 objects are captured at 72 angles. Each image was treated as a vector of length 49152. Image classification dataset with 5 generic cate- gories: four-legged animals, human figures, air- planes, trucks, and cars. Each object was taken from a variety of radial and azimuthal angles. Each sample consists of 2 images, the left and the right views, both of size 96×96. Both im- ages for flattened to create a vector of length 2×96×96. 5.3.2 Validation 5.3.2.1 Clustering analysis Since CCP uses clustering to partition features based on correlations, it is vital to analyze clustering effectiveness. One of the best methods to evaluate the effectiveness of the clustering from k-medoids and/or k- 143 means is to compute the loss function with respect to the varying number k of clusters. Then, using the elbow method, one finds the inflection point of the loss function and sets the corresponding k to be the optimal number of clusters. In this work, we present another method for visualizing the feature clusters using our R-S scores. Figure 5.1 shows the loss function of the k-medoids feature clustering given in Eq. (5.8) for Coil-20 dataset for N = 2 to 200. In addition, for N = 4, 16, 36, and 64, the clustering results were visualized using R-S scores. The red circles on the loss function correspond to N = 4, 16, 36, and 64, plotted in four charts. Each section in a given chart represents one cluster, and the x and the y-axes represent the residue and similarity scores, respectively for the cluster. For N = 4, the cluster colored in blue has a low similarity score, indicating that the data is spread out within the cluster. It indicates that a larger N is needed. From N = 36 to N = 64, it is seen that the loss function has an elbow, indicating that these numbers of clusters are appropriate. The R-S plots in these two charts are more compact, indicating that the clustering performs well. 144 Figure 5.1 Illustration of the partition and R-S visualization of 16384 features of the Coil-20 dataset into different numbers (N) of clusters. The line is the loss function defined in Eq. (5.8) with respect to different N values ranging from 2 to 200. Four red circles indicate N = 4, 16, 36, and 64, for which the corresponding R-S charts are given in charts from the left to the right. Each section of the charts represents an individual feature cluster with the x and y axes are the residue and similarity scores, respectively. The R-S indices for N = 4, 16, 36 and 64 are 0.842, 0.887, 0.952 and 0.992, respectively. As speculated earlier, the R-S disparity may correlate with the convergence of clustering. To verify this speculation, we have computed R-S disparity for the feature clustering results at N = 4, 16, 36, and 64. We found that RSDN=4 = 0.158, RSDN=16 = 0.113, RSDN=36 = 0.048, and RSDN=64 = –0.008. Therefore, R-S disparity correlates strongly with the loss function of the k-medoids clustering shown in Figure 5.1. One of the advantages of the N-medoids method is that the cluster centers will always be one of the feature vectors. In addition, since each medoid is always one of feature vectors, the method is much more efficient when dealing with high dimensional data. Other clustering methods, such as k- means, spectral clustering, DBSCAN, and hierarchical clustering, can be utilized for the clustering as well. 145 5.3.2.2 Partition scheme evaluation In order to explore the effectiveness of different partitions, we compare results obtained with three feature partitions: correlation partition, random equal partition, and equal variance partition. In the random equal partition, the features are randomly shuffled and split into N equal-sized clusters (i.e., N dimensions in the CCP). Therefore, each cluster has the same number of features, which will be projected into a one-dimensional representation in CCP. In equal variance partition, the features are normalized with respect to the largest one and ordered, and then, are split into N clusters such that all clusters have a similar amount of variance. In this partition, the first cluster contains the largest number of low-variance features, whereas the last cluster, cluster N, contains the least number of high-variance features. Notice that in the correlation partition, the numbers of features in clusters also vary and are determined by minimization according to Eq. (5.8). The Leukemia and Carcinoma datasets were used to compare the 3 feature partition schemes. For both tests, 5-fold cross-validations with 10 random seeds were used for the dimensionality reduction, and k-NN was used to obtain classification accuracy. For each fold of partition, all results attained from 10 seeds were included to evaluate partition schemes. 146 (a) Comparison of the correlation partition, random equal partition, and equal variance partition-based dimensionality reduction and classification for the Leukemia dataset of 7070 features. The accuracies are computed from kNN classifications using the reduced N features generated from CCP with three partitions. (b) R-S plots of clusters generated from CCP-based classification using correlation partition, equal variance partition, and random equal partition of the Leukemia dataset at N = 18. Figure 5.2 Comparing the CCP-based classification effectiveness using correlation partition, equal random partition, and equal variance partition of the features of the leukemia dataset. For FRI, exponential kernel with κ = 2 and τ = 1.0 was used. For each test, results from all 10 seeds were plotted. Form left to right: R-S plots of correlation partition, equal variance partition, and random equal partition. The x-axis is the residual score, and the y-axis is the similarity score. Each section corresponds to one cluster and the data was colored according to the predicted labels from the k-NN classification. Figure 5.2(a) shows the accuracy of CCP-based classification of the Leukemia dataset under various CCP reduced dimensions N equipped with 3 feature partition schemes. The correlation partition outperforms both equal random and variance partitions over all N values. In addition, as shown in Figure 5.2(b), for R-S plots, correlation partition outperforms other two partitions in each cluster at N = 18. In particular, equal random partition and equal variance partition do not work well in classifying label 2. Figure 5.3 shows the accuracies of CCP-based classifications of the Carcinoma dataset under 147 Figure 5.3 Comparison of the correlation partition, random equal partition, and equal variance partition-based dimensionality reduction and classification for the Carcinoma dataset of 7070 features. The accuracies are computed from kNN classifications using the reduced N features generated from CCP with three partitions. various CCP reduced dimensions N equipped with 3 feature partition schemes. The correlation partition outperforms both equal random and equal variance partitions over all N values. In addition, as shown in Figure 5.4, for the R-S plots, the correlation partition outperforms the other two partitions in each cluster at N = 46. 148 Figure 5.4 Comparing the CCP-based classification effectiveness using correlation partition, equal random partition, and equal variance partition of the features of the Carcinoma dataset. For FRI, exponential kernel with κ = 2 and τ = 2.0 was used. For each test, the results of all 10 seeds were plotted. From left to right: R-S plots of correlation partition, equal variance partition, and random equal partition. The x-axis is the residual score, and the y-axis is the similarity score. Each section corresponds to one cluster and the data were colored according to the predicted labels from k-NN. 5.3.3 Comparison with other dimensionality reduction methods In this section, we compare CCP’s performance with UMAP, PCA, LLE, and Isomap on ALL- AML, TCGA-PANCAN, Coil-20, and Coil-100 datasets. For each dataset, we performed 5-fold or 10-fold cross-validation depending on the size of the dataset to test the accuracy using k-nearest neighbors (k-NN). Results of all 10 random seeds were used in performance evaluation. 149 5.3.3.1 ALL-AML Figure 5.5 Accuracy of k-NN classification of the ALL-AML dataset when the dimension is reduced to N by using CCP, UMAP, PCA, LLE, and Isomap. Here, a 5-fold cross-validation with 10 random seedings was used. Test-train split was done prior to the dimensionality reduction. For CCP, exponential kernel with κ = 1 and τ = 2.0 was used. The sample size, feature size, and the number of classes of the ALLAML dataset are 72, 7129, and 2, respectively. Figure 5.6 Visualization of the ALL-AML dataset when dimension reduced by CCP with expo- nential kernel, κ = 1 and τ = 2.0 to N = 36. Each section represents a class and the samples were colored according to their predicted labels from the k-NN classification via 5-fold cross-validation. Results of all 10 seeds were used for the visualization. The x and y axes are the residue and similarity scores, respectively. The sample size, feature size, and the number of classes of the ALL-AML dataset are 72, 7129, and 2, respectively. The dimension of the ALL-AML dataset was reduced using an exponential kernel with κ = 1 and τ = 2.0. Figure 5.5 shows the performance of CCP, UMAP, PCA, Isomap, and LLE. Here, a 5-fold cross-validation with 10 random seeds was used. CCP performs better than the other algorithms do and is stable with a wide range of N values. All other methods show a drop in their accuracy beyond dimension N = 36. Since the ALL-AML dataset only has 72 samples, UMAP, PCA, LLE, and Isomap cannot reduce the ALL-AML dimension to N > 72 because their dimension is limited by the size of the matrix diagonalization. 150 Figure 5.7 Visualization of the ALL-AML dataset when the dimension was reduced to N = 36 by using CCP, UMAP, and PCA. Since there are only 72 samples in the ALL-AML dataset, results from all 10 seeds were plotted, leading to 720 sample points in the plot. For CCP, exponential kernel with κ = 1 and τ = 2.0 was used. The x and y axes are the residue and similarity scores, respectively. Each row is for one class, and the data points are colored based on the predicted labels from the k-NN classifier, using 5-fold cross-validation. The sample size, feature size, and the number of classes of the ALL-AML are 72, 7129, and 2, respectively. Figure 5.6 shows the R-S plot of the ALL-AML dataset when the dimension is reduced by CCP to N = 36. The left and right sections correspond to the 2 classes. Samples were plotted according to their 36 features and colored with the predicted labels from k-NN. Results of all 10 seeds were plotted into one chart (i.e., 720 samples), and the residue and the similarity scores were calculated separately for each random seed. The x and the y-axes are the residue and similarity scores, respectively. Class 2 has a better R-S distribution. Figure 5.7 shows the R-S plot of ALL-AML when the feature dimension is reduced to N = 36 by using CCP, UMAP, and PCA. Results of all 10 random seeds were used in the visualization to obtain a better understanding of the performance, and the residue and similarity scores were computed separately for each seed. In each class, the samples were colored according to their predicted labels obtained from k-NN. The x-axis and y-axis of each R-S plot are the residue and similarity scores, respectively. The top row is class 1 (ALL), and the bottom row is class 2 (AML). The numerical number inside the plot is the accuracy for the class. Notice that UMAP’s R-S plot indicates that UMAP’s reduction has a low similarity score, which explains its low accuracy. On the other hand, PCA has higher accuracy than that UMAP, but most AML samples are mislabeled. This indicates that PCA is unable to distinguish the difference between ALL and AML when N = 36. 151 5.3.3.2 TCGA-PANCAN For CCP, the dimension of the TCGA-PANCAN dataset was reduced using Lorentz kernel with κ = 1 and τ = 1.0. Figure 5.8 shows the performance comparison of CCP, UMAP, PCA, Isomap, and LLE. Here, a 5-fold cross-validation with 10 random seeds was used. Notice that CCP is comparable to Isomap and LLE in accuracy, whereas UMAP and PCA are unstable at higher dimensions. Figure 5.9 shows the R-S plot of the TCGA-PANCAN dataset when the dimension was reduced by CCP to N = 103. Each section corresponds to the 5 classes of TCGA-PANCAN. Samples were plotted according to 103 features and colored with the predicted labels from k-NN. The x and the y-axes are the residue and similarity scores, respectively. Figure 5.10 shows the R-S plot of TCGA-PANCAN when the dimension is reduced to N = 103 by using CCP, UMAP, and PCA, respectively. The samples were plotted based on 103 features and colored with their predicted labels from k-NN. The x-axis and y-axis of each plot are the residue and similarity scores, respectively. Each row corresponds to one of the 5 classes, and the number inside the plot is the accuracy for each class. Notice that UMAP has a cluster in each plot, but the cluster has a low similarity score. This means that in UMAP’s embedding, the sample within each class is not near each other, which results in low accuracy. PCA has comparable accuracy to CCP, but CCP has a notable improvement for class 1 and class 4. Figure 5.8 Accuracy of k-NN classification of the TCGA-PANCAN dataset when the dimension is reduced to N by using CCP, UMAP, PCA, LLE, and Isomap. Here, 5-fold cross-validation with 10 random seedings was used, and the test-train split was done prior to the reduction. For CCP, Lorentz kernel with κ = 1 and τ = 1.0 was used. The sample size, feature size, and the number of classes of the TCGA-PANCAN are 801, 20531, and 5, respectively. 152 Figure 5.9 Visualization of the TCGA-PANCAN dataset when the dimension is reduced to N = 103 by using CCP with Lorentz kernel, κ = 1 and τ = 1.0. Each section represents a different class. The samples were plotted based on 103 features and colored with their predicted labels from k-NN classification via 5-fold cross-validation. The x and y axes are the residue and similarity scores, respectively. The sample size, feature size, and the number of classes of the TCGA-PANCAN are 801, 20531, and 5, respectively. 153 Figure 5.10 Visualization of TCGA-PANCAN dataset when the dimension is reduced to N = 103 by using CCP, UMAP, and PCA. For CCP, Lorentz kernel with κ = 1 and τ = 1.0 was used. The x and y axes are the residue and similarity scores, respectively. Each row contains a class. The data is plotted based on 103 features and colored with the predicted labels of the k-NN classifier, using 5-fold cross-validation. The sample size, feature size, and the number of classes of the TCGA-PANCAN are 801, 20531, and 5, respectively. 154 5.3.3.3 Coil-20 The dimension of the Coil-20 dataset was reduced using Lorentz kernel with κ = 1 and τ = 6.0. Figure 5.11 shows the performance of CCP, UMAP, PCA, Isomap, and LLE. The 10-fold cross- validation with 10 random seeds was used. CCP has the best performance out of the 5 algorithms and maintains its accuracy in higher dimensions. PCA also has high accuracy but loses its accuracy in higher dimensions. Figure 5.12 shows the R-S plot of the Coil-20 dataset when the dimension is reduced by CCP to N = 82. Each section corresponds to the 20 classes of Coil-20. Samples were plotted based on 82 features and colored with the predicted labels from k-NN. The x and the y-axes are the residue and similarity scores, respectively. Figure 5.13 shows the R-S plot of Coil-20 when its dimension is reduced to N = 82 by using CCP, UMAP, and PCA. Samples in each class are plotted according to their 82 features and colored according to their predicted labels from k-NN. The x-axis and y-axes of each plot are the residue and similarity scores, respectively. Each row corresponds to one of the 20 classes, and the number inside the plot is the classification accuracy for each class. Notice that all of UMAP’s visualizations show a poor distribution in the bottom right, indicating that the residual score is high and the similarity score is low, which gives rise to poor performance in the classification. In order to further investigate the performance, labels 1, 2, and 3 were visualized in Figure 5.15. We can see that in the zoomed-in view, there are small subclusters within each plot, which come from different folds of the cross-validation. 155 Figure 5.11 Accuracy of k-NN classification of Coil-20 dataset when its dimension is reduced to different dimensions N by using CCP, UMAP, PCA, LLE, and Isomap. The 10-fold cross-validation with 10 random seedings was used, and the test-train split was done prior to the dimensionality reduction. For CCP, Lorentz kernel with κ = 1 and τ = 6.0 was used. The sample size, feature size, and the number of classes of the Coil-20 are 1440, 16384, and 20, respectively. Figure 5.12 Visualization of Coil-20 dataset when the dimension is reduced to N = 82 by using CCP with Lorentz kernel, κ = 1 and τ = 6.0. Each section represents a different class, and the samples were plotted based on 82 features and colored with their predicted labels from the k-NN classification via 10-fold cross-validation. The x and y axis are the residue and similarity scores, respectively. The sample size, feature size, and the number of classes of the Coil-20 are 1440, 16384, and 20, respectively. 156 Figure 5.13 Visualization of Coil-20 dataset for classes 1 to 10 when the dimension is reduced to N = 82 by using CCP, UMAP, and PCA. For CCP, Lorentz kernel with κ = 1 and τ = 6.0 was used. The x and y axis are the residue and similarity scores, respectively. 157 Figure 5.14 Visualization of Coil-20 dataset for classes 11 to 20 when the dimension is reduced to N = 82 by using CCP, UMAP, and PCA. For CCP, Lorentz kernel with κ = 1 and τ = 6.0 was used. The x and y axis are the residue and similarity scores, respectively. 158 Figure 5.15 Visualization of Coil-20 dataset class 1, 2, and 3, when the data dimension is reduced to N = 82 by UMAP. The figures are zoomed-in view. The data were plotted based on 82 features and colored according to their predicted labels from the k-NN classifier using 10-fold cross-validation. Label 1 has an accuracy of 0.125, whereas labels 2 and 3 have accuracy 0.000. 5.3.3.4 Coil-100 The dimension of the Coil-100 dataset was reduced using exponential kernel with κ = 1 and τ = 6.0. Figure 5.16 shows the performance of CCP, UMAP, PCA, Isomap, and LLE. Here, 10-fold cross-validation with 10 random seeds was used. CCP, PCA, LLE, and Isomap have comparable results, whereas UMAP is unstable at a higher dimension N. The best performance of UMAP was not as good as those of CCP and PCA. This indicates that Coil-100 has a high intrinsic dimension, for which UMAP has poor performance. Figure 5.16 Accuracy of k-NN classification of Coil-100 dataset when the dimension is reduced to N by using CCP, UMAP, PCA, LLE, and Isomap. The 10-fold cross-validation with 10 random seedings was used, and a test-train split was done prior to the reduction. For CCP,Lorentz kernel with κ = 1 and τ = 6.0 was used. The sample size, feature size, and the number of classes of the Coil-20 are 1440, 16384 and 20, respectively. 159 5.4 Discussion 5.4.1 Centrality based CCP CCP used FRI to project a group of correlated features into a 1D representation. If we observe the projection in a graph setting, the FRI projection can be viewed as computing the degree centrality of the graph. That is, let Z ∈ RM×I be the data, with M samples and I features. For each partition, we can define a graph Gn = (Vn, En, Wn), n = 1, 2, ..., N, where Vn, En and Wn are the vertex, edge and weight sets of the graph of the nth component, respectively. The weights are precisely the kernels defined in Eq. (5.12). Then, the FRI projection for xn i can be viewed as the degree centrality (Cd) of a weighted graph Cd(zSn i ) = ∑︁ zSn j Φn(|zSn i – zSn j |; ηn, τ, κ), (5.16) where Cd(zSn i ) is the degree centrality of vertex zSn i . In this case, we treat each data zSn i as a vertex. Instead of using the FRI projection, we can impose a traditional graph-based approach, setting the edge weight ωn ij = 1 for all 1 ≤ i, j ≤ M and 1 ≤ n ≤ N, when the node-node distance satisfies a cutoff. That is, instead of applying Eq. (5.12), we take An = {An ij}, An ij = 1, if ∥zSn i – zSn j ∥ < rn c 0, otherwise    , 1 ≤ i, j ≤ M. (5.17) Here, instead of writing Cn ij as in Eq. (5.12), we use An c is the cutoff distance. Then, the reduced new variables xn ij to denote the adjacency matrix of the i can be computed by ij. In such a manner, we can implement other centrality formulations, such as the degree centrality, closeness centrality [26], betweenness graph, and rn replacing Φn(∥zSn ∥; ηn, τ, κ) in Equation 5.15 with An i – zSn j centrality [27], and eigenvector centrality [28] in CCP. Figure 5.17 shows the accuracy of using different centrality formulations instead of the FRI pro- jection. Using the adjacency matrix, degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality were computed with rc = 0.7dmax, where dmax is the maximum pairwise distance between the input data. The performance of all methods is quite similar. However, the 160 stability of computing the centrality is heavily reliant on rc. Moreover, if the data is well clustered within each class, the graph may not be connected, which may affect the stability of centrality computations. Figure 5.17 The coil-20 dataset was reduced using centrality formulations, instead of the FRI projection. Degree, closeness, eigenvector centrality, and betweenness centrality were tested, with rc = 0.7dmax. The accuracy was calculated from 10-fold cross-validation with 10 random seeds. The sample size, feature size, and the number of classes of the Coil-20 are 1440, 16384, and 20, respectively. 5.4.2 Correlation distance based CCP CCP utilizes covariance distance in clustering to partition features. However, other distance metrics can be used in clustering as well, depending on the size of the data and the relationship between the features. In particular, correlation distance can be used instead of covariance distance, when the relationship between features is highly nonlinear. Figure 5.18 shows the effectiveness of correlation distance-based CCP when compared to covariance distance-based CCP and other DR algorithms. Notice that the correlation distance-based CCP significantly outperforms covariance- based CCP and other DR algorithms. Therefore, correlation distance-based CCP can be employed if high accuracy is desirable. However, it is noted that correlation distance-based CCP is very time-consuming and memory-demanding. This limitation may constrain the use of correlation distance-based CCP in high-dimensional data with large data sizes. 161 Figure 5.18 Comparison between correlation distance-based partitioning and covariance distance- based partitioning of ALL-AML dataset. k-NN with 10-fold cross-validation was used to compute the accuracy. The sample size, feature size, and the number of classes of ALL-AML are 72, 7129, and 2, respectively. 5.4.3 Parameter-free CCP The performance of the proposed two-step CCP depends on a few parameters, such as the dimension N, kernel type (i.e., generalized exponential and generalized Lorentz), power (κ), and scale (τ). Among them, the dimension may be chosen by the user. Although a set of default parameters is prescribed, it may not be optimal for different datasets. It will be a burden for users to select parameters. Fortunately, CCP is very stable under subsampling. Therefore, we can use subsampling to search the optimal parameter range for a given dataset automatically. In this subsection, we show that CCP is stable under subsampling. To verify this claim, we test CCP on the Smallnorb dataset, which has 46,800 samples and 5 classes. Each sample consists of a binocular picture of an object of size 96×96 pixels, taken from different radial and azimuthal angles. We flattened each image and combined the images to make an 18,432-dimensional feature vector. We subsample 1%, 5%, 10% and 20% samples to optimize CCP kernel parameters, respectively. Then, based on these CCP parameter sets, we carry out the CCP dimensionality reduction of the whole dataset for classification. The resulting 10-fold cross-validation accuracies of classification for the Smallnorb dataset are shown in Figure 5.19 for subsampled at 1%, 5%, 10% and 20%. It is clear that the accuracy increases as the subsampling size is increased from 1% to 20%. However, the accuracy difference between 1% subsampling and 20% subsampling is under 2% for all classes. It is seen that under different subsampling ratios, CCP can capture the structure of the data. Even at 1% subsampling, CCP is still very accurate. 162 Figure 5.19 R-S plot visualization of Smallnorb classification using CCP with the reduction ratio of 400 (47 dimensions) at 1%, 5%, 10% and 20% subsampling. Each row represents the data plotted based on 47 features and colored with the predicted labels from the k-NN classifier, using 10-fold cross-validation. The number in each plot shows the accuracy within each label obtained with subsampling-generated kernel parameters. The x and y axis are the residue score and the similarity scores, respectively. Since CCP is very stable under subsampling, one can make CCP a parameter-free method by using a relatively small amount of dataset to determine CCP parameters automatically. CCP’s stability under subsampling implies that CCP can be used in the dynamic data acquisition of excessively large datasets. Newly collected data can be added to the existing data without the need to restart the CCP calculation from the very beginning. 163 5.4.4 Accuracy comparison using four classifiers We have shown the effectiveness of CCP on various datasets. However, all the aforementioned analysis was based on the k-NN classifier. It is important to know whether the same pattern returns if other classification algorithms are employed. To this end, we compare CCP with other dimensionality reduction methods using k-NN, support vector machine (SVM), random forest (RF), and gradient boost decision tree (GBDT). Figure 5.20 Comparison of the accuracy of CCP on a variety of datasets and classification algo- rithms. The rows represent four datasets, from the top to the bottom: ALL-AML, TCGA-PANCAN, Coil-20, and Coil-100. The columns are for four classification algorithms, namely, k-NN, SVM, RF, and GBDT. The x-axes are the reduced dimension N and the y-axes are accuracy. 164 Figure 5.20 shows the comparison of CCP when utilizing k-NN, SVM, RF, and GBDT on ALL-AML, TCGA-PANCAN, Coil-20, and Coil-100 datasets. The rows are the 4 datasets, and the columns are 4 classification methods. For all the tests, sklearn’s classification package was utilized. For k-NN and SVC, default parameters were used. For RF and GBDT, {n_estimators=1000, max_depth = 7, min_samples_split = 3, max_features = ’sqrt’, n_jobs = -1 } were used. For all tests, standard scaling was used after the reduction. First, CCP remains very competitive against all other dimensionality reduction methods over all datasets when other classifiers are employed. The relative behaviors of all dimensionality reduction methods did not change much under different classifiers. Therefore, our earlier comparison is fair and our findings remain correct. Second, SVM appears to slightly improve the performance of CCP and PCA. However, LLE and Isomap do not work well with SVM. Third, UMAP did not perform well on ALL-AML, Coil-20, and Coil-100 when the k-NN method was used. However, its performance does not improve much with SVM, RF, and GBDT. Its instability with relatively large reduced dimension N persists over different classifiers. In fact, its best results have never reached those of other methods for these three datasets. A possible reason is that UMAP does not work well for data having moderately large intrinsic dimensions. Fourth, LLE had some instability in TCGA-PANCAN and Coil-100 datasets. Because the input data led the computed matrix to become singular, some of the tests from the cross-validation were not computed. For these cases, the average was taken over the working tests. Finally, we noticed that all dimensionality reduction methods underperformed with RF for the Coil-100 dataset and with GBDT for the ALL-AML dataset. This behavior might be due to the fact that for a given classifier, a uniform set of parameters was used for all datasets and RF does not work well for large datasets. 5.4.5 Efficiency comparison Although accuracy is very important, computational cost can be a crucial factor for huge datasets. In this section, we assess the computational times of various methods with elementary computer 165 resource allocations. Specifically, 4 central processing units (CPUs) with 64GB of memory from the High-Performance Computing Center (HPCC) of Michigan State University were used for all methods and all datasets. Figure 5.21 shows the computational time of the three-dimensionality reduction methods on ALL-AML, TCGA-PANCAN, Coil-20, and Coil-100. For ALL-AML and TCGA-PANCAN datasets, the average time from the 5-fold cross-validation over 10 random seeds was computed. For Coil-20 and Coil-100, the average time from the 10-fold cross-validation over 10 random seeds was recorded. Figure 5.21 CPU run time comparison among CCP, UMAP, and PCA on ALL-AML, TCGA- PANCAN, Coil-20, and Coil-100 datasets. For ALL-AML and TCGA-PANCAN, computational times for N = 10, 20, and 30 were calculated by taking the average of the 5-fold cross-validation over 10 random seeds. For Coil-20 and Coil-100 computational times for N = 10, 20, 30, and 40 were calculated by taking the average of the 10-fold cross-validation over 10 random seeds. In each chart, the x axis corresponds to the reduced dimension N, and the y axis is the average time (s). PCA shows essentially the fastest computation for all datasets. Isomap and LLE have very similar behaviors for all datasets. Their time efficiencies are quite similar to that of PCA. UMAP is faster than CCP for Coil-20 and Coil-100. For ALL-AML and TCGA-PANCAN, CCP is faster because of the small data size. Note that Equation 5.15 indicates the summation over all samples that satisfies cutoff of within 3 standard deviations of the average pairwise distance. This cutoff can be reduced for faster computation. However, it reduces the overall accuracy. For CCP, because clustered features are projected independently, each reduced dimension can be computed independently using the parallel architecture. Similar parallel computations can be 166 applied to different samples. Therefore, CCP can be further accelerated by using parallel and graphics processing unit (GPU) algorithms in practical applications. 5.5 Concluding remarks Like other dimensionality reduction algorithms, CCP has its advantage and disadvantages. First, CCP is a unique data-domain method and its features are highly interpretable. Because CCP partitions feature into clusters according to some metric, such as covariance distance or correlation distance, features with high correlation will perform better. One limitation for many methods relying on matrix diagonalization is that pairwise distance computation can encounter the “curse of dimensionality”, where distance computation of high dimensional data could become unreliable. By clustering features, CCP can more reliably compute distances because the dimension in each cluster will be much lower. Moreover, CCP performs better for data with a large number of features, such as TCGA-PANCAN, Coil-20, and Coil-100datasets. Therefore, CCP is suitable for the dimensionality reduction of data with relatively large intrinsic dimensions, for which many other popular methods may not work well. However, for datasets with a smaller number of features, CCP may not be as good as other methods. In this case, dimensionality reduction is unnecessary anyway. Also, we noticed that CCP might not be as good as UMAP and some other frequency-domain methods for extremely low final dimensions, say N = 2 or 3. In addition to doing well for data having moderately large intrinsic dimensions, CCP allows embedding for streamlined datasets, such as molecular dynamics generated transient data. We have shown that CCP is stable under subsampling, which enables users to optimize the CCP model with a small portion of initial data, and allows subsequent data to be embedded with the initial set. We noticed that dimensionality reduction algorithms that rely on matrix diagonalization have instability when dealing with streamlined data. Because CCP does not compute the nearest neighbors graph and does not diagonalize, a traditional 2D plot does not give a meaningful visualization. However, each dimension of CCP is computed by projecting the partitioned features. Hence we can easily interpret each dimension 167 of CCP. In tree-based classification algorithms, such as random forest and gradient boost decision trees, feature importance can be computed for each feature component, which gives a rank on how much impact each component has on the classification. For CCP, feature importance may be interpreted as how meaningful a set of highly correlated features is in the classification. CCP can be further optimized in various ways. It allows a wide variety of alternative data- domain embedding strategies in each of its two steps: clustering and projection. For example, in the clustering step, one might select alternative distance metrics, clustering algorithms, and loss functions to optimize feature vector partition for a given dataset. In the project step, one might choose alternative distance metrics based on Riemannian geometry or statistical theories and select alternative projections based on linear/nonlinear, orthogonal/non-orthogonal, and Grassmannian considerations. A wide variety of multistep dimensionality reduction methods can be developed. Unlike frequency-domain dimensionality reduction techniques, CCP renders a data-domain representation of the original high-dimensional data. Therefore, the resulting low-dimensional data can be reused as an input for a Secondary Dimensionality Reduction (SDR) with a frequency-domain technique to achieve specific goals. For example, one can use CCP as an initializer for local methods to capture global patterns [29]. The combination of CCP with UMAP and t-SNE, called CCP-UMAP and CCP-t-SNE, respectively, may generate better 2D visualizations for datasets with global structures. Additionally, for real-world problems, better accuracy is always desirable. New hybrid methods, such as three-step CCP-UMAP and CCP-Isomap, may achieve better dimensionality reduction performance for clustering, classification, and regression. Finally, the R-S scores, R index, S index, R-S disparity, and R-S index introduced in this work can be used for general-purpose data visualization and analysis. The shape of data and persistent Laplacian discussed in this work offer new geometric, topological, and spectral tools for data analysis and visualization. Server availability The CCP online server is available at https://weilab.math.msu.edu/CCP/. 168 BIBLIOGRAPHY [1] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multidomain models—flexibility and rigidity. The Journal of chemical physics, 139(19):11B614_1, 2013. [2] Patrizio Frosini. Measuring shapes by size functions. In Intelligent Robots and Computer Vision X: Algorithms and Techniques, volume 1607, pages 122–133. International Society for Optics and Photonics, 1992. [3] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simplification. In Proceedings 41st annual symposium on foundations of computer science, pages 454–463. IEEE, 2000. [4] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete & Com- putational Geometry, 33(2):249–274, 2005. [5] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46(2):255–308, 2009. [6] Konstantin Mischaikow and Vidit Nanda. Morse theory for filtrations and efficient computation of persistent homology. Discrete & Computational Geometry, 50(2):330–353, 2013. [7] K. L. Xia and G. W. Wei. Persistent homology analysis of protein structure, flexibility and folding. International Journal for Numerical Methods in Biomedical Engineering, 30:814– 844, 2014. [8] Jacob Townsend, Cassie Putman Micucci, John H Hymel, Vasileios Maroulas, and Konstanti- nos D Vogiatzis. Representation of molecular structures with persistent homology for machine learning applications in chemistry. Nature communications, 11(1):1–9, 2020. [9] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020. [10] Jiahui Chen, Yuchi Qiu, Rui Wang, and Guo-Wei Wei. Persistent laplacian projected omicron ba. 4 and ba. 5 to become new dominating variants. Computers in biology and medicine, 151:106262, 2022. [11] Duc Duy Nguyen, Zixuan Cang, and Guo-Wei Wei. A review of mathematical representations of biomolecular data. Physical Chemistry Chemical Physics, 22(8):4343–4367, 2020. [12] Duc Duy Nguyen and Guo-Wei Wei. Dg-gl: Differential geometry-based geometric learning of molecular datasets. International journal for numerical methods in biomedical engineering, 35(3):e3179, 2019. [13] Rundong Zhao, Menglun Wang, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. The de rham– 169 hodge analysis and modeling of biomolecules. Bulletin of mathematical biology, 82(8):1–38, 2020. [14] Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021. [15] Gábor J Székely, Maria L Rizzo, and Nail K Bakirov. Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6):2769–2794, 2007. [16] Kristopher Opron, Kelin Xia, and Guo-Wei Wei. Fast and anisotropic flexibility-rigidity index for protein flexibility and fluctuation analysis. The Journal of chemical physics, 140(23):06B617_1, 2014. [17] Duc Duy Nguyen and Guo-Wei Wei. Agl-score: algebraic graph learning score for protein– ligand binding scoring, ranking, docking, and screening. Journal of chemical information and modeling, 59(7):3291–3304, 2019. [18] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02):185–205, 2005. [19] Andrew I Su, John B Welsh, Lisa M Sapinoso, Suzanne G Kern, Petre Dimitrov, Hilmar Lapp, Peter G Schultz, Steven M Powell, Christopher A Moskaluk, Henry F Frierson, et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer research, 61(20):7388–7393, 2001. [20] Kun Yang, Zhipeng Cai, Jianzhong Li, and Guohui Lin. A stable gene selection in microarray data analysis. BMC bioinformatics, 7(1):1–16, 2006. [21] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439):531–537, 1999. [22] Kyle Chang, Chad J Creighton, Caleb Davis, Lawrence Donehower, Jennifer Drummond, David Wheeler, Adrian Ally, Miruna Balasundaram, Inanc Birol, Yaron SN Butterfield, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet, 45(10):1113–1120, 2013. [23] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil-20). 1996. [24] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil-100). 1996. [25] Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning methods for generic object recognition 170 with invariance to pose and lighting. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2:II–104 Vol.2, 2004. [26] Linton C Freeman. Centrality in social networks conceptual clarification. Social networks, 1(3):215–239, 1978. [27] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977. [28] Phillip Bonacich. Power and centrality: A family of measures. American journal of sociology, 92(5):1170–1182, 1987. [29] Dmitry Kobak and George C Linderman. Initialization is critical for preserving global data structure in both t-sne and umap. Nature biotechnology, 39(2):156–157, 2021. 171 CHAPTER 6 TOPOLOGICAL NONNEGATIVE MATRIX FACTORIZATION 6.1 Introduction Nonnegative matrix factorization (NMF) is a dimensionality reduction method in which the objective is to decompose the original count matrix into two nonnegative factor matrices [1, 2]. The resulting basis matrices are often referred to as meta-genes and represent nonnegative linear combinations of the original genes. Consequently, NMF results are highly interpretable. However, the original formulation employs a least-squares optimization scheme, making the method susceptible to outlier errors [3]. To address this issue, Kong et al. [4] introduced robust NMF (rNMF), or l2,1-NMF, which utilizes the l2,1-norm and can better handle outliers while maintaining comparable computational efficiency to standard NMF. Manifold regularization has also been employed to incorporate ge- ometric structures into dimensionality reduction, utilizing a graph Laplacian, leading to Graph Regularized NMF (GNMF) [5]. Semi-supervised methods, such as those incorporating marker genes [6], similarity and dissimilarity constraints [7], have been proposed to enhance NMF’s robustness. Additionally, various other NMF derivatives have been introduced [8, 9, 10]. Despite these advancements in NMF, manifold regularization remains an essential component to ensure that the lower-dimensional representation of the data can form meaningful clusters. However, using graph Laplacians can only capture a single scale of the data, specifically the scaling factor in the heat kernel. Therefore, single-scale graph Laplacians lack multiscale information. In this work, we introduce persistent Laplacian (PL)-regularized NMF, namely the topological NMF (TNMF) and robust topological NMF (rTNMF). Both TNMF and rTNMF can better capture multiscale geometric information than the standard GNMF and rGNMF. To achieve improved performance, PL is constructed by observing sample-sample interactions at multiple scales through filtration, creating a sequence of simplicial complexes. We can then view the spectra at each complex associated with a filtration to capture both topological and geometric information. Additionally, we introduce k-NN based PL to TNMF and rTNMF, referred to as k-TNMF and k-rTNMF, respectively. 172 The k-NN based PL reduces the number of hyperparameters compared to the standard PL algorithm. The outline of this work is as follows. First, we provide a brief overview of NMF, rNMF, GNMF, and rGNMF. Next, we present a concise theoretical formulation of PL and derive the multiplicative updating scheme for TNMF and rTNMF. Additionally, we introduce an alternative construction of PL, termed k-NN PL. 6.2 Prior Work In this section, we provide an overview of NMF methods, including the standard NMF, l21-NMF or rNMF, graph regularized NMF (GNMF), and the robust graph regularized NMF (rGNMF), and we utilize single cell RNA sequencing (scRNA-seq) data as an example for the interpretation of NMF. The notations and their descriptions are summarized in Table 6.1. Table 6.1 Abbreviations and notations used in the methods. Notation NMF rNMF GNMF rGNMF TNMF rTNMF k-TNMF k-rTNMF X ∈ RM×N W ∈ RM×p H ∈ Rp×N A ∈ RN×N D ∈ RN×N L ∈ RN×N PL ∈ RN×N PA ∈ RN×N PD ∈ RN×N ζt λ Description Nonnegative matrix factorization Robust Nonnegative Matrix Factorization Graph Regularized Nonnegative Matrix Factorization Robust Graph Regularized Nonnegative Matrix Factorization Topological Nonnegative Matrix Factorization Robust Topological Nonnegative Matrix Factorization k-NN induced Topological Nonnegative Matrix Factorization k-NN induced Robust Topological Nonnegative Matrix Factorization Nonnegative data matrix with M genes and N cells The basis or the meta-genes, and p is the rank Lower dimensional representation of the data, and p is the rank Adjacency matrix Degree matrix Graph Laplacian L = D – A Persistent Laplacian, PL = PD – PA Adjacency matrix associated with PL Degree matrix associated with PL The weight of the graph for the t-th filtration Hyperparameter for the regularized NMF 173 6.2.1 NMF The original formulation of NMF utilizes the Frobenius norm, which assumes that the noise of the data is sampled from a Gaussian distribution. Let X ∈ RM×N be a nonnegative data matrix. In scRNA-seq, this is the gene-count matrix. The goal of NMF is to find the decomposition X ≈ WH, where both W ∈ RM×p and H ∈ Hp×N are nonnegative. Here p is the rank of the decomposition. The minimization function is given as the following ∥X – WH∥2 F, min W,H s.t. W, H ≥ 0 (6.1) where ∥A∥2 F = (cid:205) i,j a2 ij. W is the basis, which are often times called the meta-genes in scRNA-seq. Lee et al. proposed a multiplicative updating scheme, which preserves the nonnegativity [1]. For the t + 1th iteration, ij = wt wt+1 ij ij = ht ht+1 ij (XHT )ij (WHHT )ij (WT X)ij (WT WH)ij (6.2) (6.3) Although the updating scheme is simple and effective in many biological data applications, scRNA-seq data is sparse and contains large amount of noise. Therefore, a model that is more robust to noise is necessary for feature selection and dimensionality reduction 6.2.2 rNMF The robust NMF (rNMF) utilizes the l2,1 norm, which assumes that the noise of the data is sampled from a Laplace distribution, which may be more suitable for a count-based data matrix, like scRNA-seq. The minimization function is given as the following min W,H ∥X – WH∥2,1, s.t. W, H ≥ 0, where ∥A∥2,1 = (cid:205) j ∥aj ∥2. Because l2,1-norm utilizes summation over the l2 distance between the original cell feature and the reduced feature, the effect of the outlier will not dominate the loss 174 function as much as the Frobenius norm formulation. RNMF has the following updating scheme ij = wt wt+1 ij (XQHT )ij (WHQHT )ij (WT XQ)ij (WT WHQ)ij , ij = ht ht+1 ij where Qjj = 1/∥X – Whj ∥2. 6.2.2.1 GNMF amd rGNMF (6.4) (6.5) Manifold regularization has been widely utilized in scRNA-seq. Let G(V, E, W) be a graph, where V = {xj}N j=1 is the set of vertices, E = {(xi, xj)|xi ∈ Nk(xj) ∪ xj ∈ Nk(xi)} is the set of edges, and W is the weight associated with the edges. Here, Nk(xj) denotes the k-th nearest neighbors of vertex j. For the weight between vertex i and j, denoted ωij, we chose a decaying function with the following properties ωij → 0 as ∥xi – xj ∥ → ∞ ωij → 1 as ∥xi – xj ∥ → 0 A common choice for such function is the radial basis function. For example, the heat kernel (cid:32) ωij = exp – ∥xi – xj ∥2 σ (cid:33) , (6.6) (6.7) where σ is the scale of the kernel. We can then represent the weights as an adjacency matrix A (cid:18) exp – (cid:19) ∥xi–xj ∥2 σ xj ∈ Nk(xi) 0, otherwise. Aij =    175 We can now construct the graph regularization term, RG, by looking at the distance ∥hi – hj ∥2 and the adjacency matrix. RG = 1 2 ∑︁ i,j Aij ∥hi – hj ∥2 ∑︁ = DiihT i hi – i AijhT i hj ∑︁ ij = Tr(HDHT ) – Tr(HAHT ) = Tr(HLHT ). Here, L and D are the Laplacian and the degree matrix, given by L = D – A and Dii = (cid:205) j Aij, respectively. Tr(·) denotes the trace of the matrix. Utilizing the regularization parameters, λ ≥ 0, we get the objective function of GNMF ∥X – WH∥2 F + λTr(HLHT ). min W,H and the objective function for rGNMF ∥X – WH∥2,1 + λTr(HLHT ). min W,H (6.8) (6.9) 6.3 Topological NMF While graph regularization improves the traditional NMF and rNMF, the choice of σ and vastly change the result. Furthermore, graph regularization only captures a single scale, and may not be able to capture the mutliscale geometric information in data. In this section, we only show the construction of persistent Laplacian. The theoretical formulation of persistent Laplacian can be found in section 2.4. 6.3.1 TNMF and rTNMF For scRNA-seq data, we calculate the 0-persistent Laplacian using the Vietoris-Rips (VR) complexes by increasing the filtration distance. We can then take a weighted sum over the 0- persistent Laplacian induced by the changes in the filtration distance. For persistent Laplacian 176 enhanced NMF, we will provide a computationally efficient algorithm to construct the persistent Laplacian matrix. Let L be a Laplacian matrix induced by some weighted graph, and note the following L =    lij, – (cid:205)N j=1 lij i ≠ j i = j. Then, let lmax = maxi≠j lij, lmin = mini≠j lij and d = lmax – lmin. The t-th Persistent Laplacian Lt, t = 1, ..., T is defined as Lt = {lt ij}, where lt ij =    lt ii = – lij ≤ (t/T)d + lmin 0 1 otherwise lt ij. ∑︁ i≠j Then, we can take the weighted sum over the all the persistent Laplacians PL := T ∑︁ t=1 ζtLt. (6.10) (6.11) (6.12) Unlike the standard Laplacian matrix L, PL captures the topological features that persists over different filtration, thus providing a multiscale view of the data that standard Laplacian lacks. Here, ζt is the hyper-parameter and must be chosen. Then, the PL regularized NMF, which we call topological nonnegative matrix factorization (TNMF) is defined as, ∥X – WH∥2 F + λTr(HT (PL)H) and the robust topological NMF (rTNMF) is defined as ∥X – WH∥2,1 + λTr(HT (PL)H). (6.13) (6.14) 6.3.2 Multiplicative Updating scheme The updating scheme follows the same principle as the standard GNMF and rGNMF. 177 TNMF For TNMF, the Lagrangian function is defined as L = ∥X – WH∥2 F + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH) (6.15) = Tr(XT X) – 2Tr(XHT WT ) + Tr(WHHT WT ) + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH). (6.16) Taking the partial with respect to W, we get ∂L ∂W = –2HT XH + 2WHHT + Φ. Using the KKT condition Φijwij = 0, we get the following (–2XHT )ijwij + (2WHHT )ijwij = 0. Therefore, the updating scheme is ij ← wt wt+1 ij (XHT )ij (WHHT )ij . For updating H, we take the derivative of the Lagrangian function with respect to H ∂L ∂H = –2WT X + 2WT WH + 2λH(PL) + Ψ. Using the Karush–Kuhn–Tucker (KKT) condition, we have Ψijhij = 0 and obtain (6.17) (6.18) (6.19) (6.20) –2(WT X + λH(PA))ijhij + 2(WT WH + λH(PD))ijhij = 0, (6.21) where PL = PD – PA and PDii = (cid:205) i≠j PAij. The updating scheme is then given by ij ← ht ht+1 ij (WT X + λH(PA))ij (WT WH + λH(PD))ij . (6.22) 6.3.2.1 rTNMF For the updating scheme for rTNMF, we utilize the fact that ∥A∥2,1 = Tr(AQAT ), where Qii = 1 2∥Ai ∥2 . The Lagrangian is given by L = ∥X – WH∥2,1 + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH) = Tr((X – WH)Q(X – WH)T ) + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH) = Tr(XQXT ) – 2Tr(WHQ) + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH), (6.23) (6.24) (6.25) 178 where Qii = 1 ∥xj–Whj ∥ . Taking the partial with respect to W, we get ∂L ∂W = –(XQHT ) + WHQHT – Φ. Using the KKT conditions Φijwij = 0, we get –(XQHT )ijwij + (WHQHT )ijwij = 0, which gives the updating scheme ij ← wt wt+1 ij (XQHT )ij (WHQHT )ij . For H, we take the partial with respect to H. ∂L ∂H = –WT XQ + WT WHQ + 2λH(PL) + Ψ. Then, using the KKT conditions Ψijhij = 0, we get (6.26) (6.27) (6.28) (6.29) (–WT XQ – 2λH(PA))ijhij + (WT WHQ + 2λH(PD))ijhij = 0, (6.30) where PL = PD – PA and gives the updating scheme ij ← ht ht+1 ij (WT XQ + 2λH(PA))ij (WT WHQ + 2λH(PD))ij . (6.31) 6.3.3 k-NN induced Persistent Laplacian One major issue with TNMF and rTNMF is that the parameters {ζt}T t=1 have to be chosen. For the parameters, we let ζt ∈ {0, 1, 1/2, · · · , 1/T} for a total of T + 1 parameters. Therefore, the number of parameters that needs to be chosen increases exponentially as the number of filtration T increases. Therefore, we propose an approximation to the original formulation usingk-NN induced PL. Let Nt(xj) be the t-nearest neighbors of sample xj. First, we define the t-persistent directed adjacency matrix ˜At as ˜At = {˜at ij}, ˜at ij = 1 xj ∈ Nt(xi) 0 otherwise.    179 (6.32) Then, the k-NN based directed adjacency Laplacian is the weighted sum of {At} ˜A := T ∑︁ t=1 ζt ˜At. (6.33) The undirected persistent adjacency matrix can be obtained via symmetrization PA = ˜A + ˜AT – ˜A ⊗ ˜AT , where ⊗ denote Hadamard product. Then, the PL can be constructed using the persistent degree and persistent adjacency matrices PL = PD – PA, PDii = ∑︁ j≠i PAij. (6.34) One advantage of utilizing the k-NN induced persistent Laplacian is that the parameter space is much smaller. We can set ζt ∈ {0, 1}, where ζt = 0 would ‘turn-off’ the particular neighbor’s connectivity. In essence, the number of parameters will be reduced to 2T , a significant decrease from (T + 1)T of the original formulation. We call the k-NN induced TNMF as k-TNMF and the k-NN induced rTNMF as k-rTNMF. 180 BIBLIOGRAPHY [1] Daniel Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. Ad- vances in neural information processing systems, 13, 2000. [2] Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012. [3] Weixiang Liu, Nanning Zheng, and Qubo You. Nonnegative matrix factorization and its applications in pattern recognition. Chinese Science Bulletin, 51:7–18, 2006. [4] Deguang Kong, Chris Ding, and Heng Huang. Robust nonnegative matrix factorization using In Proceedings of the 20th ACM international conference on Information and l21-norm. knowledge management, pages 673–682, 2011. [5] Qiu Xiao, Jiawei Luo, Cheng Liang, Jie Cai, and Pingjian Ding. A graph regularized non- negative matrix factorization method for identifying microrna-disease associations. Bioinfor- matics, 34(2):239–248, 2018. [6] Peng Wu, Mo An, Hai-Ren Zou, Cai-Ying Zhong, Wei Wang, and Chang-Peng Wu. A robust semi-supervised nmf model for single cell rna-seq data. PeerJ, 8:e10091, 2020. [7] Zhenqiu Shu, Qinghan Long, Luping Zhang, Zhengtao Yu, and Xiao-Jun Wu. Robust graph regularized nmf with dissimilarity and similarity constraints for scrna-seq data clustering. Journal of Chemical Information and Modeling, 62(23):6271–6286, 2022. [8] Wei Lan, Jianwei Chen, Qingfeng Chen, Jin Liu, Jianxin Wang, and Yi-Ping Phoebe Chen. De- tecting cell type from single cell rna sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv, pages 2022–05, 2022. [9] Jin-Xing Liu, Dong Wang, Ying-Lian Gao, Chun-Hou Zheng, Jun-Liang Shang, Feng Liu, and Yong Xu. A joint-l2, 1-norm-constraint-based semi-supervised feature extraction for rna-seq data analysis. Neurocomputing, 228:263–269, 2017. [10] Na Yu, Ying-Lian Gao, Jin-Xing Liu, Juan Wang, and Junliang Shang. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Human genomics, 13(1):1–10, 2019. 181 CHAPTER 7 APPLICATION IN SINGLE CELL RNA SEQUENCING 7.1 Preprocessing of Single Cell RNA Sequencing data using Correlated Clustering and Projection 7.1.1 Introduction In this section, we propose a computationally efficient and interpretable dimensionality reduc- tion algorithm for scRNA-seq data called correlated clustering and projection (CCP) [1]. CCP begins by clustering genes based on their similarity and then uses the flexibility rigidity index (FRI) [2] to nonlinearly project each gene cluster into a super-gene, which is a measure of accumulated gene-gene correlations among cells. Unlike traditional nonlinear reduction methods, CCP bypasses matrix diagonalization, allowing users to select the number of super-genes, which is beneficial for machine learning and deep learning tasks. Furthermore, similar to NMF’s meta-genes, super-genes are all nonnegative and highly interpretable. We validated CCP’s performance on 14 scRNA-seq datasets by varying the number of super-genes and conducting support vector machine classification and k-means clustering. Additionally, we have validated the performance of a novel evaluation metric for dimensionality reduction, called the Reside-Similarity-Index (RSI) [1]. RSI evaluates the intra-cluster similarity of cell types or clusters, and compares it to their inter-cluster residual score. As RSI only requires one set of labels, which can be computed from k-means, it can measure the performance of dimensionality reduction for both clustering and classification tasks, without requiring knowledge of the true labels. Furthermore, by analyzing the relationship between samples, RSI allows for a deeper understanding of the quality of the dimensionality reduction algorithm. We have verified the effectiveness of RSI alongside CCP on both clustering and classification tasks, and introduced the R-S plot as a novel visualization technique for data containing multiple cell types. 182 7.1.2 Results Table 7.1 Accession ID, source organism, and the counts for samples, genes, cell types and normalization for 14 datasets. Organism Samples Genes Cell types Normalization Accession ID GSE45719 GSE59114 GSE67835 GSE75748 cell GSE75748 time GSE82187 GSE84133 h1 GSE84133 h2 GSE84133 h3 GSE84133 h4 GSE84133 m1 GSE84133 m2 GSE89232 GSE94820 Reference Deng [3] Mouse Kowalczyk [4] Mouse Human Darmanis [5] Human Chu [6] Human Chu [6] Mouse Gokce [7] Human Baron [8] Human Baron[8] Human Baron[8] Human Baron[8] Mouse Baron[8] Mouse Baron[8] Human Breton [9] Human Villani [10] 300 1428 420 1018 758 705 1937 1724 3605 1308 822 1064 957 1140 22431 8422 22084 19097 19189 18840 20125 20125 20125 20125 14878 14878 20689 26593 8 6 8 7 6 10 14 14 14 14 13 13 4 5 RPKM TPM CPM TPM TPM TPM TPM TPM TPM TPM TPM TPM TPM TPM CCP was benchmarked against PCA on 14 datasets, and the dataset details can be found in Table 7.1. The data was normalized using either reads per kilobase of transcript per million (RPKM), transcript per million (TPM) or counts per million (CPM). For each dataset, CCP was used to obtain the number of super-genes as N =50, 100, 150, 200, 250, and 300. The parameters κ and τ of the exponential kernel were searched over κ = 1, 2 and τ = 1, 2, ..., 6 and were set to τ = 6 and κ = 2 for the exponential kernel. To test the reduction, 20 random seeds were used for CCP and PCA, and for each reduction, 30 random initializations of k-means were used to obtain cluster labels. After obtaining the cluster labels, ARI and NMI were computed by comparing the results to the labeled cell types, and the averages were visualized. For each figure, the red and blue lines represent CCP and PCA, respectively, and the star and dot markers indicate ARI and NMI, respectively. 7.1.2.1 CCP Benchmark Figure 7.1 shows the performance of CCP and PCA on three datasets, GSE67835, GSE75748 time, and GSE59114 data. For GSE67835, CCP outperforms PCA in all the dimensions we have 183 tested. For GSE75748 time, CCP outperforms PCA for 50 super-genes and above. GSE75748 time shows an increase in performance as the number of gene dimensions increased. PCA exhibits instability as N increases, which is noticeable from their decrease in performance from N = 50 to 150 for both datasets. CCP does not perform well on GSE59114 because both ARI and NMI are less then 0.3 for all the dimensions we have tested. CCP’s performance may be poor due to low intrinsic dimensionality of GSE59114. In other words, the number of gene clusters is inherently small, leading to redundant clusters. GSE59114, in particular, only has 8,422 genes, whereas other data have over 15,000 genes. Figure 7.1 ARI and NMI of the clustering results of CCP and PCA on GSE67835 , GSE75748 time and GSE59114 data. The red and blue lines correspond to CCP and PCA, respectively. A total of 20 random initializations were used to test the reduction, and for each reduction, a total of 30 random initializations were used to obtain the clustering results from k-means clustering. The averages of the ARI and NMI were obtained. For CCP, all the test utilize τ = 6 and κ = 2 for the exponential kernel. In order to verify CCP’s performance, the residue similarity index (RSI) was calculated for the k-means clustering result of the gene partitioning in CCP. Figure 7.2 shows the RSI of the k-means clustering on the genes at various numbers of cell clusters (k). The top row shows the clustering result for GSE59114, which had poor CCP performance, and the bottom row shows the clustering result for GSE67825, which had good CCP performance. For each number of clusters, 10 random initializations were used for the k-means clustering, and the averages of the RI, SI, and RSI were obtained. The red, blue, and green lines correspond to RI, SI, and RSI, respectively. RSI can be used to check the quality of the clustering, where the peak in RSI suggests the optimal number of clusters, in the case of CCP, the intrinsic dimensionality of the data. The right column shows 184 the 2D visualization of the gene using t-SNE. The samples were colored according to their cluster labels. The t-SNE visualization of GSE59114 shows the k-means clustering result when k = 8 was selected. The t-SNE visualization of GSE67835 shows the k-means clustering result when k = 64 was selected. Seven of the 64 clusters were colored, and the green samples are the rest of the genes. Notice that in GSE59114, there is a noticeable peak in the RSI score at k =8 clusters, whereas in GSE67835, the peak is flat and occurs at about k =32-64 clusters. This suggests that the intrinsic dimensionality is about 8 for GSE59114, which is unfavorable for CCP. On the other hand, the intrinsic dimension of GSE67835 is much higher, which is more suitable for CCP. Notice that the clusters have distinct boundaries, supporting the relatively low dimensionality of the data. On the other hand, the GSE67835 data is not well-clustered even at k = 64. Notice that the orange and blue genes have some outliers, and the purple genes are not well-clustered. This suggests that the number of optional gene clusters is larger, which suggests high gene dimensionality and favors CCP. 185 (a) GSE59114 (b) GSE67835 Figure 7.2 RI, SI, and RSI of the gene clustering of GSE59114 and GSE67835. k-means clustering was performed with k =2,4,8,16,32,64, and 128 cell clusters. For each number of clusters, 10 random initializations were utilized, and the averages of RI, SI and RSI were obtained. The red, blue, and green line corresponds to RI, SI, and RSI, respectively. We use t-SNE to visualize the genes in 2D. For GSE59114 k = 8 clusters were obtained, and the cells were colored according to their cluster assignment. For GSE67835, k = 64 cell types were obtained. Seven random cell types were colored, and the rest of cell types were colored in green. 7.1.2.2 Residue-Similarity Index comparison Residue-similarity index (RSI) has been shown to correlate with classification accuracy in [1]. In this section, we use RSI for classification and clustering on the 14 datasets from Table 7.1. We use CCP to process each dataset with the same parameters as the previous section with 20 random initializations. For classification, we use 5-fold cross-validation with 10 random seeds and the support vector machine to predict cell types. We use balanced accuracy (BA) to measure the performance of the classification. Then, using the same 5-fold cross-validation, we calculate RSI, where we obtain the RI, SI, and RSI from the test set, similar to [1]. For clustering, we compute RSI for PCA and CCP using the k-means clustering labels and the true labels. Additionally, using the k-means clustering labels, we compute the Silhouette score to compare the results with RSI. Full details of the benchmark procedure can be found in 7A.2.1 of the Supporting Materials. 186 In general, we have found no correlation between the Silhouette scores and RSI for clustering results. Additionally, we have found that BA and RSI correlate in classification results. Figure 7.3 Comparison of RSI in classification and clustering problems for GSE67835, GSE75748 time, and GSE82187 data at reduced dimensions N =50,100,150,200,250, and 300. CCP was used to reduce the original data dimension using τ = 6 and κ = 2 for the exponential kernel. The top and bottom rows correspond to the classification and clustering results, respectively. For classification, support vector machine was used. True labels were used to compute the RSI for the 5-fold cross validation. For clustering, RSI were computed using the cluster labels from k-means clustering and the true labels. We found that RSI correlates with classification accuracy in many of our tests. Figure 7.3 shows RSI for classification and clustering problems for GSE67835, GSE75748 time, and GSE82187 data. CCP was used to reduce the original data using τ = 6 and κ = 2 for the exponential kernel. The top row corresponds to classification results, and the bottom row corresponds to clustering results. Notice that for classification results, all three datasets show a correlation between BA and RSI. RSI on classification results for GSE67835 shows a plateau at about 150 super-genes, which corresponds to the plateau of the BA. This suggests that the optimal dimension is about 150. RSI on classification results for GSE75748 shows a plateau at about 200 gene clusters, even though BA plateaus at about 150 gene clusters. Even though the accuracy plateaued earlier, this suggests that the optimal dimension is at 200 gene clusters. In addition, since GSE75748 time observes cell differentiation at different times, it is possible that some cells are at different stages in their 187 cell cycles, as suggested in the literature [6]. This suggests that there are many intermediate stages in cell differentiation. RSI on classification results on GSE82187 shows a small decrease as the number of super-genes increases. This suggests that the optimal dimension is smaller than those of GSE67835 and GSE75748 time. Lastly, RSI decreases for all three datasets when PCA is utilized, which corresponds to the decrease in BA. For the clustering results, RSI using the k-means labels and the true cell types are similar. Even though ARI and NMI of PCA decrease as the number of gene clusters increases, RSI remains consistent. This suggests that PCA is not able to differentiate clusters at higher dimensions. CCP, on the other hand, shows a correlation with both RSI scores. Additional examples of utilizing RSI on classification and clustering problems can be found in Section 7A.2.2 of the Supporting materials. Figure 7.4 Comparison of CCP and PCA clustering on GSE67835, GSE75748 time, and GSE82187 data. CCP was used to reduce the original data dimension using τ = 6 and κ = 2 for the exponential kernel. The blue, orange, green and red bars correspond to mean CCP ARI, mean PCA ARI, mean CCP NMI and mean PCA NMI, respectively. Here, the average was taken over different dimensions. Figure 7.4 shows the overall clustering performance of CCP and PCA. The bars show the mean ARI and NMI values across the different number of components. Notice that for both ARI and NMI, CCP outperforms PCA. 188 Figure 7.5 Comparison of CCP and PCA classification on GSE67835, GSE75748 time, and GSE82187 data. CCP was used to reduce the original data dimension using τ = 6 and κ = 2 for the exponential kernel. The blue and orange bars correspond to mean BAs of CCP and PCA, respectively. Here, the average was taken over different dimensions. Figure 7.5 shows the overall classification performance of CCP and PCA. The bars show the mean BAs across the different number of dimensions. Notice that for the mean BA, CCP outperforms PCA. 7.1.3 Discussion 7.1.3.1 CCP Like other dimensionality reduction algorithms, CCP has its advantages and disadvantages. CCP nonlinearly projects each cluster of similar genes into a super-gene. Super-genes are highly interpretable: each super-gene represents a measure of a cluster of genes’ accumulated pairwise nonlinear correlations with the same cluster of genes in all other cells for a given cell. Similar to NMF, super-genes are non-negative, which is important for downstream analysis such as differential gene expression analysis. Since CCP is a data-domain method, it bypasses matrix diagonalization. One limitation of many dimensionality reduction algorithms is their dependence on matrix diagonalization. In scRNA- seq data, the number of genes is typically larger than 5,000, which gives rise to the “curse of dimensionality”. When the number of features are large, every sample may appear to be equidistant from one another, which makes many machine learning algorithms unable to find meaningful 189 clusters in the data. CCP, on the other hand, partitions the genes into clusters and computes the pairwise gene-gene correlations across all cells, which avoids the curse of dimensionality. Even though CCP has shown success in many scRNA-seq datasets, it does have limitations. CCP does not perform well for datasets with a low intrinsic dimension. As shown in Figure 7.2, GSE59114 and GSE94228 have a low intrinsic dimension, and as a result, their clustering results also suffered. In addition, many scRNA-seq datasets are sparse due to low signal-to-noise ratio and dropout events. Therefore, CCP will most likely benefit from data imputation. 7.1.3.2 RSI RSI is a useful tool for assessing the performance of dimensionality reduction for both clustering and classification problems. In the following section, we compare RSI to the traditional clustering metrics, ARI and NMI, and also with the Silhouette score. Then, we discuss RSI and its connection with classification accuracy. RSI for clustering Compared to ARI and NMI, which measure the similarity between two sets of labels, RSI evaluates the performance using only one set of labels. In this study, ARI and NMI were used to compare the true labels with the clustering labels. However, in practice, such true labels may not be available. RSI, on the other hand, can evaluate the effectiveness of clustering without the need for original labels. This is similar to the Silhouette score, which measures the separations between clusters. However, when there are multiple clusters, the Silhouette score becomes difficult to interpret because it measures whether a sample belongs to its current cluster assignment or to the nearest neighboring cluster. Therefore, it is often used to evaluate the optimal number of clusters, rather than evaluating different parameters while fixing the number of clusters. RSI can evaluate the effectiveness of different parameters while fixing the number of clusters. RSI for classification Using RSI for cell types, we have shown that RSI correlates with classifi- cation accuracy. Additionally, RI and SI indicate how well the clusters separate from each other. The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a metric commonly 190 used to evaluate classification effectiveness. However, AUC-ROC is a better metric for binary classification problems, and its interpretation is more challenging for multiclass problems. RSI, on the other hand, can handle problems with more than two cell types. Lastly, RSI uses the features and labels to compute the scores, so it can also demonstrate the effectiveness of dimensionality reduction algorithms in conjunction with classification problems. RSI can also be utilized for visualizing each class or cluster, which we have called Residue- Similarity (R-S) plot. In order to showcase the R-S plot, we compare it with traditional visualization techniques used in scRNA-seq data, namely t-SNE and UMAP. CCP was utilized to reduce the dimensionality. The 5-fold cross-validation was used to divide the data into 5 parts, where 4 parts were used to train the support vector machine classifier, and 1 part to test the classifier. Then, residue and similarity scores were computed for each sample and plotted according to its true number of cell types. Samples were then colored according to their predicted labels from the support vector machine classifier. The x-axis and y-axis correspond to residue and similarity scores, respectively. Both residue and similarity scores range from 0 to 1, where 1 is the most optimal, and the top-right corner indicates well-separated and clustered reduction. However, it is important to note that having a balance of both scores is important, as shown in Hozumi et al. (2022) [1]. For t-SNE and UMAP, the original data was log-transformed, and genes with variance less than 10–6 were removed prior to the reduction. Samples were then plotted and colored according to their cell types. 191 Figure 7.6 R-S plot, CCP assisted t-SNE plot and standard t-SNE plots of GSE75748 time data. CCP was used to reduce the scRNA-seq data to 200 super-genes using τ = 6 and κ = 2. The 5-fold cross-validation was used to split the data into 5 parts, where 4 were used for training, and 1 part was used for testing the support vector machine classifier. RS scores were computed for the testing set, and all 5 folds were visualized. Each section corresponds to one of the 7 true cell types, and the sample’s color and marker correspond to the predicted label from the support vector machine classifier. For t-SNE and UMAP, the data was log-transformed, and any genes with less than 10–6 variance were removed before applying the reduction. Samples were colored according to their cell types. Figure 7.6 shows a comparison between the R-S plot and 2D plots of UMAP and t-SNE for GSE75748 time data. CCP was used to generate 200 super-genes with τ = 6 and κ = 2. For UMAP and t-SNE plots, the reduction was directly applied to the log-transformed original data. In [6], Chu obtained snapshots at different times of ES cell differentiation from pluripotency to definitive endoderm (ED) over 4 days at 0hr, 12hrs, 24hrs, 36hrs, 72hrs, and 96hrs. Noticeably, cells recorded 192 at 72hrs and 96hrs are mixed in UMAP and t-SNE plots and are misclassified in the R-S plot. This finding is consistent with [6], where cells from 72hrs and 96hrs were relatively homogeneous. In a biological sense, this may indicate that cell differentiation had mostly completed by 72 hrs, such that not much of the further process of cell differentiation was observed at 96 hrs. In t-SNE and UMAP plots, we can see a similar pattern as the R-S plot. There are 2 subclusters of the 12hr samples. Additionally, the 72hr and 96hr samples form one large cluster, which is consistent with R-S plot’s findings. Most notably, there is a large difference between the ES cell at 0hr and ES cells at different times in all visualizations, and there is no misclassification of the 0hr state with cells from 72hr and 96hr states, indicating that the cells have indeed differentiated from the original pluripotent state. 193 Figure 7.7 RS plot, and CCP assisted UMAP and t-SNE plots of GSE75748 cell. CCP was used to reduce the scRNA-seq data to 100 components using τ = 6 and κ = 2. 5-fold cross validation was used to split the data into 5 parts, where 4 parts were used for training, and 1 part was used for testing the k-NN classifier. RS score was computed for the testing set, and all 5 folds were visualized. Each section corresponds to 1 of the 7 true cell types, and the sample’s color and marker correspond to the predicted label from the k-NN classifier. For t-SNE and UMAP, the data was log-transformed and any genes with less than 10–6 variance were removed before applying the reduction. Samples were colored according to their cell types. Figure 7.7 shows a comparison between the R-S plot and 2D plots of UMAP and t-SNE of GSE75748 cell data. CCP was used to reduce the dimension to 100 super-genes with τ = 6 and κ = 2. In [6], Chu obtained snapshots of lineage-specific progenitor cells that differentiated from H1 human embryonic stem (ES) cells. These differentiated cells include neuronal progenitor cells (NPC), endoderm derivatives cells (DEC), endothelial cells (EC), trophoblast-like cells (TB), human 194 foreskin fibrobalsts (HFF), and undifferentiated H1 and H9 human ES cells. Not surprisingly, all 3 visualizations show that undifferentiated ES cell H1 and H9 are clustered together, indicating that these two ES cells are relatively homogeneous, which agrees with Chu’s findings. In the R-S plot, we see that all but 1 DEC sample are classified incorrectly, whereas in UMAP and t-SNE plots, DEC samples do not form a distinct cluster, and have a super cluster forming with the H1, H9, and DEC cluster. In addition, all 3 visualizations show 2 clusters of NPC samples, but CCP is able to classify NPC samples correctly. Notice that in the R-S plot, there are a few misclassifications of EC and DEC cells, and in UMAP, these two clusters are adjacent to one another. This is consistent with a small number of misclassified EC and DEC shown in the RS plot. Since EC are derivatives of mesoderm, it has been suggested by [11, 12, 13] that mesoderm and DEC may have developed and differentiated from a common progenitor pool. 7.1.4 Conclusion CCP is a novel dimensionality reduction method that projects each cluster of similar genes into a super-gene defined as accumulated pairwise nonlinear gene-gene correlations among cells. We have shown that CCP is able to differentiate cell types and also preserve the similarity along trajectory of cellular differentiation. In addition, since CCP works exclusively in the data-domain, it does not rely on matrix diagonalization and its results are easily interpretable. It outperforms PCA for problems having an intrinsically high dimensionality. We have also shown that RSI is a novel metric for evaluating the effectiveness of dimensionality reduction algorithms. Since it correlates with accuracy but does not rely on knowing the true labels of the data, it can be applied to improve both clustering and classification. In addition, RSI can be used to vary the number of clusters and obtain insight into the optimal number of cell types. This information can be used to filter out data where CCP may not perform well because CCP works best when the intrinsic dimensionality of the data, i.e., the number of gene features, is relatively high. Lastly, the R-S plot is introduced as a new visualization tool that works well for problems with a large number of cell types. 195 7.1.5 Code and Data availability All data was processed and is available at https://github.com/hozumiyu/SingleCellDataProcess. The code needed to reproduce this paper’s result can be found at https://github.com/hozumiyu/CCP- for-Single-Cell-RNA-Sequencing. CCP is made available through our web-server at https://weilab.math.msu.edu/CCP/ or through the source code https://github.com/hozumiyu/CCP. Source code of RSI and R-S plot can be found at https://github.com/hozumiyu/RSI. 7.2 Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE 7.2.1 Introduction The objective of the present work is to explore the utility of CCP for initializing scRNA-seq data. We are particularly interested in its potential application for initializing UMAP and t-SNE, which are among the most successful visualization tools in scRNA-seq analysis. We tested CCP-assisted UMAP and CCP-assisted t-SNE on eight publicly available datasets. CCP performance in assisting UMAP and t-SNE is compared favorably with that of PCA and NMF. Additionally, we introduce a novel method for handling low-variance (LV) genes. Instead of discarding low-variance genes like many other methods, we group them together into a single category. This grouping is achieved by projecting them into one descriptor using FRI. One of the drawbacks of dropping low-variance genes is that scRNA-seq data often has an unequal number of cell types. Moreover, there are numerous genes with low expression, and removing too many genes may result in overlooking cell outliers. Therefore, LV-gene addresses this issue by consolidating low-variance genes into one descriptor, thereby increasing its predictive power. We found that CCP improves the accuracy of UMAP and t-SNE by over 11% in each case. 7.2.2 Methods and Algorithms In this section, we describe the construction of LV-gene, which will be on the components. For the rest of the components, refer to section 5.2 for the detail. 196 7.2.2.1 Low variance (LV) genes Let v = (v1, ..., vI) be the variance of the genes, where vi is the variance of gene zi, and assume that the variance are sorted in descending order. Then, define the low variance set P as P = {i|i > vcI} where 0 ≤ vc ≤ 1 is the cutoff ratio. Then, we can obtain the cell-cell correlation using these low variance genes CP ij , ij = Φ(∥zP CP i – zP j ∥; ηP, τ, κ) where Φ(∥zP i – zP j ∥; ηP, τ, κ) is the generalized exponential function Φ(∥zP   rP c is taken as the 3-standard deviation of the pairwise distances, and ηP is the average minimum j ∥; ηP, τ, κ) = otherwise. i – zP i – zP j ∥ < rP c – e ∥zP 0,  ∥ (cid:33) κ (cid:32) ∥zP –zP i j ηPτ distance ηP = (cid:205)M m=1 minzP j M ∥zP m – zP j ∥ . Using the correlation function, CCP projects |P| genes into a super-gene using FRI for ith sample, xP i = M ∑︁ m=1 where wim are the weights. wimΦ(∥zP i – zP m∥; ηP, τ, κ), For CCP, we compute the LV-gene first, and use the correlated partition algorithm on the remaining genes. 197 7.2.3 Results 7.2.3.1 Data preprocessing We have tested CCP-assisted UMAP and tSNE visualization on 20 publicly available data. Table 7.2 displays information including the Gene Expression Omnibus (GEO) accession ID [14, 15], the reference, data dimensions, and cell composition for each dataset. Additionally, data from the scziDesk paper [16] was utilized and can be accessed from their supporting materials. The Qx and Qs data correspond to Smart-seq2 and 10x genomic data from Quake et al. [17]. Notably, the GSE84133 human dataset encompasses all human patient data from Baron et al. [8]. Detailed statistics for each data can be found in Table 7B.1 in the supporting materials. Table 7.2 Dataset name, reference, dimensions and cell type composition. Size (cells x genes) Cell Composition Dataset [Ref] GSE75748cell [6] GSE75748time [6] GSE82187 [7] GSE67835 [5] GSE84133 H1 [8] 1018 x 19097 758 x 19189 705 x 18840 420 x 22084 1937 x 20125 GSE84133 H2 [8] 1724 x 20125 GSE84133 H3 [8] 3605 x 20125 GSE84133 H4 [8] 1303 x 20125 GSE84133 M1 [8] 822 x 14878 GSE84133 M2 [8] 1064 x 14878 GSE84133 human [8] 8569 x 20125 Muraro [18] Romanov [19] Qx Bladder [17] Qx Limb Muscle [17] Qx Spleen [17] Qs Diaphragm [17] Qs Limb Muscle [17] Qs Lung [17] 2122 x 19046 2881 x 21143 2500 x 23341 3909 x 23341 9552 x 23341 870 x 23341 1090 x 23341 1676 x 23341 Qs Trachea [17] 1350 x 23341 7 clusters: 138, 105, 212, 162, 159, 173, 69 6 clusters: 92, 102, 66, 172, 138, 188 10 clusters: 107, 18, 21, 71, 48, 7, 334, 13, 43, 43 8 clusters: 18, 62, 20, 110, 25, 16, 131, 38 14 clusters: 110, 51, 236, 872, 214, 120, 130, 13, 70, 14, 8, 92, 5, 2 14 clusters: 3, 81, 676, 371, 125, 301, 23, 2, 86, 17, 9, 22, 6, 2 14 clusters: 843, 100, 1130, 787, 161, 376, 92, 2, 36, 14, 7, 54, 1, 2 14 clusters: 2, 52, 284, 495, 101, 280, 7, 1, 63, 10, 1, 5, 1, 1 13 clusters: 2, 4, 4, 9, 343, 85, 236, 72, 14, 4, 17, 29, 3 13 clusters: 8, 3, 10, 182, 551, 133, 39, 67, 27, 4, 19, 18, 3 14 clusters: 958, 284, 2326, 2525, 601, 1077, 252, 18, 255, 55, 25, 173, 13, 7 9 clusters: 21, 812, 193, 101, 219, 245, 3, 80, 448 7 clusters: 267, 240, 356, 48, 898, 1001, 71 4 clusters: 1203, 1167, 57, 73 6 clusters: 461, 320, 1330, 308, 1136, 354 5 clusters: 6886, 1930, 42, 464, 230 5 clusters: 78, 81, 31, 241, 439 6 clusters: 71, 35, 141, 45, 258, 540 11 clusters: 57, 53, 25, 90, 113, 35, 693, 65, 85, 37, 423 4 clusters: 206, 113, 201, 830 198 To normalize the data, we began by normalized the counts by using the average median gene count of each cell. Let X ∈ RM×N be the data, with M cells and N genes. Each row (cell) was divided by its row sum, followed by multiplication by the median row sum to obtain a normalized count matrix. Finally, a log transformation using log1p was applied to achieve the final normalized count. In our benchmarking process, we employed CCP with parameters τ = 6 and κ = 2 to reduce the dimensions to 300 super-genes. Additionally, we utilized ν = 0.8 to generate the LV-gene. Clustering was performed using the Leiden algorithm, and we evaluated the quality of clustering using ARI, NMI and ECM by comparing the obtained clusters with the cell types provided by the original authors. Visualizations were generated using Scanpy’s implementation of UMAP and tSNE. In order to reduce the computation load for datasets exceeding 2,000 samples, we utilized subsampling. 7.2.3.2 Visualization Preprocessing of scRNA-seq data is a key step for visualization. Figure 7.8 shows an example of CCP-assisted tSNE visualization and the original tSNE visualization of the Baron dataset [8]. The original data has 20,125 genes, and aggressively reducing the original dimension to 2 dimensions by tSNE leads to poor visualization. In CCP-assisted tSNE, CCP was utilized to reduce the original genes into 300 super-genes, which were further reduced to 2 dimensions with tSNE for visualization. Obviously, CCP-assisted tSNE significantly improves the visualization quality in this case. We further showcase CCP-assisted visualization on the dataset described in Table Table 7.2. We provide additional comparison with PCA-assisted and NMF-assisted visualization in Section 7B.1.2 of the supporting materials. 199 Figure 7.8 TSNE visualization of GSE84133 mouse2 data. The left and right figures show the tSNE assisted and unassited tSNE visualization. Figure 7.9 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Quake dataset. Each row correspond to one of the 5 dataset, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type. 200 Figure 7.9 Comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visual- ization on Quake dataset. The rows correspond to Qx Bladder, Qx Limb Muscle, Qx Diaphragm, Qs Limb Muscle and Qs Trachea. Qx indicates scRNA-seq obtained used 10x genomic platform, and Qs indicate data obtained from SmartSeq2 platform. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors. 201 CCP improves the overall visualization of the Quake dataset. In Qx Bladder data, CCP-assisted UMAP and tSNE show three subclusters of bladder urothelial cells, whereas the standard UMAP and tSNE show only one cluster. However, the standard visualization show leukocytes and endothelial cells within the bladder urothelial cell cluster. In Qs Diaphragm dataa, CC:-assisted UMAP and tSNE show a 5 distinct clusters corresponding to each cell types. However the UMAP visualization do not differentiate the 5 cell types. The standard tSNE visualization show poor clustering, where satellite cell, mesenchymal cell and endothelial cell form a supercluster. In Qs Limb Muscle cell, all visualization show a supercluster of B cell and T cell. CCP-assisted visualization show a clear distinction between the B-T cell supercluster and macrophages, whereas the standard visualization show a supercluster of B cell, T cell, macrophages and endothelial cells. In the Qs Trachea data, the standard UMAP and tSNE visualization show a subpopulation of mesenchymal cell within the epithelial cell, where as CCP-assisted counterparts do not. Figure 7.10 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 dataset. The columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The samples were colored according to the true cell type. 202 Figure 7.10 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard UMAP visualization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors. In GSE75748 cell data, all visualization is similar. In [6], Chu obtained snapshots of lineage- specific progenitor cells that differentiated from H1 human embryonic stem (ES) cells and compared the gene profiles with undifferentiated H1 and H9 human ES cells as control. Most notably, H1 and H9 clustered together, which is consistent with our visualization. In GSE75748 time, all 203 visualization is comparable. Chu et al [6] obtained snapshot of ES cell differentiation from pluropotency to definitive endoderm over the time period 0hr, 12hr, 24hr, 36hr, 72hr and 96hr. Chu noted the cells sequenced at 72hr and 96hr show relatively similar expression profiles, suggesting that the differentiation has completed by 72hr. We see from our visualization that the 72hr and 96hr cells form a cluster. 12hr and 24hr cells form a cluster, and 0hr cells form its own cluster, indicating that there is a clear distinction between the undifferentiated and the cells undergoing differentiation. IN GSE67835, CCP-assisted visualization and its counter part have comparable result. Most notably, neurons cell from a distinct cluster in CCP-assisted visualization, whereas it does not in the standard visualization. In GSE82187 data, CCP-assisted UMAP and tSNE show a significant improvement over standard UMAP and tSNE visualization. Aside from astrocytes and OPC, all cell types form its own cluster. Standard UMAP and tSNE fail to show significant clustering of the different cell types. Figure 7.11 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Baron human dataset [8]. The rows correspond to the patients, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization. The The samples were colored according to the true cell type. 204 Figure 7.11 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard UMAP visualization on GSE84133 human dataset. Each row corresponds to 1 of the 4 patients. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors. Overall, CCP-assisted visualizations show stronger clustering. In standard UMAP and tSNE visualizations across all patients, we noticed superclusters with unclear boundaries. Conversely, CCP-assisted visualizations display well-defined boundaries between cell types. Most notably is the clear differentiation of quiescent stellate (Q-Stellate) cells, alpha cells, and ductal cells across all patients, which is a distinction that isn’t as evident in the standard visualizations. Figure 7.12 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization on Baron mouse dataset [8]. The rows correspond to the patients, and the columns correspond to 205 mouse 1 and 2, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visualization.. The The samples were colored according to the true cell type. Figure 7.12 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard UMAP visualization on GSE84133 human dataset. The rows correspond to mouse 1 and 2. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors. CCP-assisted visualizations demonstrate significantly stronger clustering for both mouse sam- ples. In the standard visualizations, beta cells are scattered among other cell types. Furthermore, in the data from mouse 2, alpha cells do not form a distinct cluster. Conversely, CCP-assisted visualizations distinctly cluster all cell types. Regarding mouse 1, the CCP-assisted visualization does not form a cluster for gamma cells, potentially due to the limited number of available gamma cells. . Figure 7.13 show the visualization of Murano, Romanov and Qs Lung data. The columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visu- alization. The samples were colored according to the true cell type. 206 Figure 7.13 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard UMAP visualization on Muraro, Romanov and Qs Lung dataset. CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell types provided by the original authors. In the Muraro dataset, CCP-assisted UMAP exhibits a clear separation of A cells, D cells, B cells, and ductal cells. In contrast, the standard UMAP visualization presents these cells as a supercluster. The standard tSNE visualization is dominated by the outlier B cells, rendering the visualization less interpretable. Regarding the Romanov dataset, all visualizations are relatively similar. CCP-assisted UMAP reveals a distinct cluster of astrocytes and ependymal cells, whereas both the standard UMAP and tSNE display a supercluster of these two cell types. Additionally, CCP-assisted UMAP and tSNE suggest two subclusters of VSM and endothelial cells, which are not discernible in the standard visualization. In the Qs Lung dataset, CCP-assisted and standard 207 visualizations yield comparable results. While the standard tSNE separates monocytes from classical monocytes, CCP-assisted UMAP and tSNE portray a homogeneous clustering of these two cell types. 7.2.3.3 Accuracy To assess CCP’s effectiveness as a primary dimensionality reduction tool for UMAP and tSNE, we conducted clustering using the Leiden algorithm within scanpy. We employed the adjusted Rand index (ARI) and normalized mutual information (NMI) to gauge accuracy by comparing the clustering results with the labels provided by the dataset’s authors. It’s important to note that these metrics do not measure absolute accuracy due to the absence of a gold standard dataset for scRNA-seq. Additionally, we used the element centric measure (ECM) [20] to evaluate cluster stability. (a) Average ARI over 18 dataset (b) Average NMI over 18 dataset (c) Average ECM over 18 dataset Figure 7.14 The average ARI, NMI, ECM of 18 datasets. 10 random initialization was used to compute CCP, CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE for each dataset. Leiden clustering was used to obtain the clustering results. Figure 7.14 show the average ARI, NMI and ECM of CCP-assisted UMAP, CCP-assisted tSNE, UMAP and tSNE across 18 dataset. For each dataset, we conducted 10 random seeds to perform dimensionality reduction, utilizing Leiden clustering to generate clustering labels. These labels were then compared to the annotated cell types provided by the original authors. CCP-assisted UMAP demonstrates a 24% improvement in ARI, 15$ in NMI, and 17% in ECM over standard UMAP. Similarly, CCP-assisted tSNE improves standard tSNE by 11% ARI, 10% NMI 208 and 8% in ECM. Additionally, CCP-assisted UMAP and tSNE have a higher ECM score, indicating that their clustering is more stable. Notably, both CCP-assisted UMAP and tSNE yield higher ECM scores, indicating more stable clustering. Interestingly, standard tSNE outperforms UMAP. However, UMAP’s performance heavily relies on accurately finding nearest neighbors, which can be challenging with noisy, sparse, and high-dimensional data. CCP effectively reduces dimensions, enabling UMAP to find neighbors more effectively and resulting in improved visualization. For a detailed comparison between CCP-assisted, PCA-assisted, and NMF-assisted visualiza- tions, please refer to Section 7B.1.3 in the supporting materials. 7.2.4 Discussion 7.2.4.1 Large Data While CCP proves to be an efficient dimensionality reduction technique for datasets with a large number of features, such as in the case of scRNA-seq data, it may encounter limitations due to the necessity of computing cell-cell correlations for each super-gene. To address this challenge when dealing with larger datasets, we proposed a subsampling approach. Let Z = {z1, ..., zM} be the training data used to develop a CCP model, and Y = {y1, ..., yT } be a new dataset or additional data. Using the training data, gene partitions Sn, cutoff distance rSn c and the connectivity ηSn are determined. Then, we embed new data to the trained model, utilizing the following modification to Equation 5.15 xn i = M ∑︁ m=1 to obtain appropriate super-genes. Φ(∥ySn i – zSn m ∥; ηn, τ, κ), (7.1) We verified the subsampling approach on GSE84133 human and Qx Spleen data. We combined all four patient’s sequencing data into one superset for this analysis. We randomly subsampled 500, 1000, 1500, 2000, 2500, 3000 samples as a training data, and performed the subsampling under 10 random seeds. We projected the testing data using Equation 7.1, followed by Leiden clustering. ARI and NMI were computed, and the average scores are reported in Figure 7.15. Notably, both the GSE84133 human and Qx Spleen datasets exhibited consistent and stable results under varying 209 number of subsampling. Additionally, we also show the CCP-assisted UMAP and tSNE for both data when subsampling was utilized. Notably, all visualizations were comparable, underscoring the stability of CCP-assisted visualizations even under subsampling. 210 (a) ARI and NMI under different subsampling (b) CCP-assisted UMAP and tSNE visualization of GSE84133 human (c) CCP-assisted UMAP and tSNE visualization of Qx Spleen Figure 7.15 UMAP and tSNE visualization of GSE84133 human and Qx Spleen data under different number of subsampling. 300 super-genes were generated from CCP, and Leiden clustering was used to obtain the clustering results. (a) ARI and NMI under different subsampling values. Left figure shows the ARI and NMI for GSE84133 Human, where the 4 patient data were combined. Right figure shows the ARI and NMI of Qx Speen data. (c) CCP-assisted UMAP and tSNE of GSE84133 Human data under different number of subsampling. (d) CCP-assisted UMAP and tSNE of Qx Spleen data under different number of subsampling. 211 7.2.4.2 Low Variance Genes We have utilized LV-genes to enhance the predictive power of super-genes with a LV gene cluster. By using a high cutoff ratio, we can reduce the number of genes used in the feature partition, potentially resulting in a lower number of super-genes. To assess the impact of the cutoff ratio on the number of super-genes used for UMAP and tSNE visualizations, we conducted tests using GSE82187 and GSE75748 cell data. The discussion for GSE75748 cell data can be found in Section 7B.2.1 of the supporting materials. Figure 7.16 show the effect of varying the number of super-genes and the cutoff ratio on the predictive power and visualization of GSE82187 data. We utilized 10 random seeds to generate CCP super-genes using different numbers of super-genes and cutoff ratios. Subsequently, Leiden clustering was applied to obtain cluster labels, and the ARI was computed utilizing the cell labels provided by the original authors. Notably, across all cutoff ratios, the ARI increases with an augmented number of super-genes, plateauing at a comparable level around 300 super-genes. This indicates the robustness of LV-gene. Figure 7.16(c) show the visualization of CCP-assisted UMAP and tSNE at various cutoff ratio. For the visualization, 300 super-genes were utilized, and UMAP and tSNE was applied to the super-genes to reduce the dimension to 2. Samples were then colored according to the cell types provided by the original authors. Note that all the visualization are comparable, indicating the robustness of LV-gene under different cutoff ratio. 212 (a) ARI of number of super-genes and νc (b) Number of genes in LV genes (c) CCP-assisted UMAP and tSNE visualization Figure 7.16 Analysis of varying the cutoff ratio νc on clustering and visualization of GSE82187 data. (a) ARI of leiden clustering when the number of super-genes and cutoff ratio is changed. (b) The number of genes in the LV-gene when νc is changed. (c) Top and bottom row shows the CCP- assisted UMAP and tSNE visualization, and the columns corresponds to νc = 0.6, 0.7, 0.8, 0.9. 300 super-genes were used to initialize UMAP and tSNE, and the samples were colored according to the true cell type. 7.2.5 Conclusion CCP is a nonlinear data-domain dimensionality reduction technique that leverages gene-gene correlations to partition genes, and utilizes cell-cell correlation to generate super-genes. Unlike methods that involve matrix diagonalization, CCP can be directly applied as a primary dimension- ality reduction tool to complement traditional visualization techniques like UMAP and tSNE. In our 213 experiments with 18 datasets, CCP-assisted UMAP and CCP-assisted tSNE visualizations consis- tently outperformed the original UMAP and tSNE. On average, CCP-assised UMAP improves the standard UMAP visualization by 24% in ARI and 15% in NMI, and CCP-assisted tSNE improves standard tSNE by 11% ARI and 10& NMI. Although the improvement for tSNE visualization is less than the improvement in UMAP, tSNE is sensitive to potential outliers and noise, where the visualization can become uninterpretable. CCP-assisted tSNE consistently show clear visualization in the 18 dataset we have tested. Additionally, CCP-assisted visualization improves PCA-assisted and NMF-assisted visualization in the 18 dataset we have tested. However, CCP comes with some disadvantageous. For data with no clear gene-gene correlation, CCP will most likely not perform well. Additionally, although utilizing gene clustering removes the complication with computing distance in high dimensions, when the number of samples becomes large, the cell-cell correlation computation becomes time consuming. We show that subsampling via a training set is an effective approach to enable CCP for dealing with large data. One possible extension for gene clustering is to incorporate prior information, such as using known genes or utilizing known gene regulatory pathways, to guide in the clustering. Additionally, CCP can also be employed in many other single cell contexts, such as spatial transcriptomics and cell-cell communication, and for initializing deep learning methods. 7.2.6 Code and Data availability All data can be downloaded from Gene Expression Omnibus [14, 15]. The processing files for these data can be found on https://github.com/hozumiyu/SingleCellDataProcess. CCP is made available through our web-server at https://weilab.math.msu.edu/CCP/ or through the source code https://github.com/hozumiyu/CCP . The code to reproduce this paper is found at https://github.com/hozumiyu/CCP-scRNAseq-UMAP-TSNE. 214 7.3 Topological Non-Negative Matrix Factorization for Single Cell RNA Sequencing Data 7.3.1 Introduction In this work, we introduce PL-regularized NMF, namely the topological NMF (TNMF) and the robust topological NMF (rTNMF). Both TNMF and rTNMF can better capture multiscale geometric information than the standard GNMF and rGNMF. To achieve improved performance, PL is constructed by observing cell-cell interactions at multiple scales through filtration, creating a sequence of simplicial complexes. We can then view the spectra at each complex associated with a filtration to capture both topological and geometric information. Additionally, we introduce k-NN based PL to TNMF and rTNMF, referred to as k-TNMF and k-rTNMF, respectively. The k-NN based PL reduces the number of hyperparameters compared to the standard PL algorithm. For the methodology and the algorithm for TNMF, refer to section 6.3. We will utilize the both k-NN and the cutoff based PL regularized NMF on scRNA-seq, and evaluate the effectiveness of clustering using adjusted rand index (ARI), normalized mutual information (NMI), purity and accuracy (ACC). Additionally, residue-similarity (RS) plot is utilized to visualize the effectiveness of the clustering. 7.3.2 Results 7.3.2.1 Benchmark Data We have performed benchmark on 12 publicly available datasets. The GEO accession number, reference, organism, number of cell types, and number of samples are recorded in Table 7.3. For each data, cell types with less than 15 cells were removed. The data was log-normalized, and then each cell was normalized to have unit length. For GNMF and rGNMF, k = 8 neighbors were used. For TNMF and rTNMF, 8 filtration values were used to construct PL, and for each scale, binary selection ζp = 0, 1 was used. For k-TNMF and k-rTNMF, k = 8 was used with ζp = 0, 1. For each test, double nonnegative singular value decomposition with zeros filled with the average of X (NNDSVDA) was used for the initialization. For the rank, we chose √ N, where N is the number of cells. Then k-means clustering was applied to obtain the clustering results. 215 Table 7.3 GEO accession code, reference, organism type, cell type, number of samples, and number of genes of each dataset. Geo Accession GSE67835 GSE75748 time GSE82187 GSE84133human1 GSE84133human2 GSE84133human3 GSE84133human4 GSE84133mouse1 GSE84133mouse2 GSE57249 GSE64016 GSE94820 Reference Dramanis [5] Chu [6] Gokce [7] Baron [8] Baron [8] Baron [8] Baron [8] Baron [8] Baron [8] Biase [21] Leng [22] Villani [10] Organism Cell type Human Human Mouse Human Human Human Human Mouse Mouse Human Human Human 8 6 8 9 9 9 6 6 8 3 4 5 # of Samples # of Genes 420 758 705 1895 1702 3579 1275 782 1036 49 460 1140 22084 19189 18840 20125 20125 20125 20125 14878 14878 25737 19084 26593 7.3.2.2 Benchmarking PL regularized NMF In order to benchmark persistent Laplacain regularized NMF, we compared our methods to other commonly used NMF methods, namely the GNMF, rGNMF, rNMF and NMF. For a fair comparison, We omitted supervised or semi-supervised methods. For k-rTNMF, rTNMF, k-TNMF, TNMF, GNMF and rGNMF, we set α = 1 for all tests. Table 7.4 shows the ARI values of the NMF methods for the 12 data we have tested. The bold number indicate the highest performance. Figure 7.17 depicts the average ARI value over the 12 datasets for each method. Table 7.4 ARI of NMF methods across 12 datasets. Data GSE67835 GSE64016 GSE75748time GSE82187 GSE84133human1 GSE84133human2 GSE84133human3 GSE84133human4 GSE84133mouse1 GSE84133mouse2 GSE57249 GSE94820 k-rTNMF rTNMF k-TNMF TNMF rGNMF GNMF 0.9109 0.1605 0.5790 0.7577 0.7907 0.9255 0.8361 0.8681 0.7918 0.6957 1.0000 0.5189 0.9391 0.1456 0.6104 0.7558 0.8220 0.9350 0.8447 0.8699 0.7945 0.6808 1.0000 0.5139 0.8533 0.1491 0.6099 0.9809 0.8855 0.9255 0.9181 0.9692 0.7913 0.9331 0.9483 0.5574 0.9306 0.2237 0.5963 0.9676 0.8301 0.9433 0.8625 0.8712 0.8003 0.7005 1.0000 0.4916 0.9236 0.1544 0.6581 0.9815 0.8969 0.9072 0.9179 0.9692 0.7894 0.8689 0.9638 0.5480 0.9454 0.2569 0.6421 0.9877 0.8310 0.9469 0.8504 0.8712 0.8003 0.6953 1.0000 0.6101 rNMF 0.7295 0.1455 0.5969 0.8221 0.7080 0.8930 0.7909 0.8311 0.6428 0.5436 0.9483 0.5440 NMF 0.7314 0.1466 0.5996 0.8208 0.6120 0.8929 0.8089 0.8311 0.6348 0.5470 0.9483 0.5556 216 Figure 7.17 Average ARI of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets. Overall, PL regularized rNMF and NMF have the highest ARI value across all the datasets. k-rTNMF outperforms other NMF methods by at least 0.09 for GSE64016. All PL regularized NMF methods outperform other NMF methods by at least 0.14 for GSE82187. For GSE84133 human 3, both rTNMF and TNMF outperform other methods by 0.07. TNMF improves other methods by more than 0.2 for GSE84133 mouse 2. Lastly, k-rTNMF has the highest ARI value for GSE94820. Moreover, rTNMF improves rGNMF by 0.05, and TNMF improves GNMF by about 0.06. k-TNMF and k-rTNMF also improve GNMF and rGNMF by about 0.03. Table 7.5 shows the NMI values of of the NMF methods for the 12 datasets we have tested. The bold number indicate the highest performance. Figure 7.18 shows the average NMI value over the 12 datasets. 217 Table 7.5 NMI of NMF methods across 12 datasets. data GSE67835 GSE64016 GSE75748time GSE82187 GSE84133human1 GSE84133human2 GSE84133human3 GSE84133human4 GSE84133mouse1 GSE84133mouse2 GSE57249 GSE94820 k-rTNMF rTNMF k-TNMF TNMF rGNMF GNMF 0.8858 0.2562 0.6971 0.8754 0.8310 0.9145 0.8357 0.8753 0.8565 0.8129 1.0000 0.6258 0.9235 0.3057 0.7522 0.9759 0.8802 0.9363 0.8500 0.8795 0.8664 0.8218 1.0000 0.7085 0.9104 0.2593 0.7235 0.8802 0.8713 0.9237 0.8439 0.8775 0.8596 0.8005 1.0000 0.6195 0.8607 0.1869 0.7343 0.9668 0.8780 0.9070 0.8677 0.9542 0.8495 0.8713 0.9293 0.6716 0.8999 0.2059 0.7750 0.9691 0.8716 0.8937 0.8718 0.9542 0.8498 0.8355 0.9505 0.6657 0.9107 0.3136 0.7159 0.9298 0.8785 0.9313 0.8577 0.8795 0.8664 0.8299 1.0000 0.6157 rNMF 0.7975 0.1896 0.7227 0.9124 0.8226 0.8835 0.8215 0.8694 0.7634 0.7258 0.9293 0.6624 NMF 0.8017 0.1849 0.7244 0.9117 0.7949 0.8829 0.8260 0.8694 0.7593 0.7272 0.9293 0.6693 Figure 7.18 Average NMI values of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets. Interestingly, k-rTNMF and k-TNMF on average have higher NMI values than rTNMF and TNMF, respectively. However, all PL regularized methods outperform rGNMF, GNMF, rNMF and NMF. Overall, PL regularized methods outperform other methods. Most noticeably, k-rTNMF, rTNMF and TNMF outperform standard NMF methods by 0.06 for GSE82187. Both rTNMF and TNMf outperform rGNMF and GNMF by 0.08 for GSE84133 human 4. Table 7.6 shows the purity values of the NMF methods for the 12 datasets we have tested. The bold number indicate the highest performance. Figure 7.19 shows the average purity over the 12 datasets. 218 Table 7.6 Purity of NMF methods across 12 datasets. data GSE67835 GSE64016 GSE75748time GSE82187 GSE84133human1 GSE84133human2 GSE84133human3 GSE84133human4 GSE84133mouse1 GSE84133mouse2 GSE57249 GSE94820 k-rTNMF rTNMF k-TNMF TNMF 0.9024 0.5013 0.7454 0.9888 0.9382 0.9661 0.9460 0.9882 0.9540 0.9373 0.9796 0.7550 0.9595 0.5846 0.7533 0.9620 0.9536 0.9806 0.9531 0.9427 0.9565 0.9604 1.0000 0.6658 0.9267 0.4913 0.7512 0.9895 0.9357 0.9614 0.9485 0.9882 0.9540 0.9410 0.9857 0.7462 0.9643 0.6048 0.7736 0.9927 0.9543 0.9818 0.9472 0.9427 0.9565 0.9585 1.0000 0.7893 rGNMF GNMF 0.9476 0.9595 0.5398 0.5339 0.7387 0.7553 0.9594 0.9620 0.9187 0.9490 0.9736 0.9777 0.9420 0.9452 0.9420 0.9427 0.9540 0.9552 0.9507 0.9466 1.0000 1.0000 0.6421 0.6421 rNMF 0.8726 0.5080 0.7467 0.9693 0.9189 0.9602 0.9464 0.9412 0.9309 0.9185 0.9796 0.7429 NMF 0.8719 0.5050 0.7455 0.9692 0.9099 0.9600 0.9466 0.9412 0.9299 0.9199 0.9796 0.7531 Figure 7.19 Average purity values of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets. In general, PL-regularized methods achieve higher purity values compared to other NMF methods. Purity measures the maximum intersection between true and predicted classes, which is why we do not observe a significant difference, as seen in ARI and NMI. Furthermore, since purity does not account for the size of a class, and given the imbalanced class sizes in scNRA-seq data, it is not surprising that the purity values are similar. Table 7.7 shows the ACC of the NMF methods for the 12 datasets we have tested. The bold number indicate the highest performance. Figure 7.20 shows the average ACC over the 12 datasets. 219 Table 7.7 ACC of NMF methods across 12 datasets. data GSE67835 GSE64016 GSE75748time GSE82187 GSE84133human1 GSE84133human2 GSE84133human3 GSE84133human4 GSE84133mouse1 GSE84133mouse2 GSE57249 GSE94820 k-rTNMF rTNMF k-TNMF TNMF rGNMF GNMF 0.9383 0.4537 0.7241 0.8514 0.8364 0.9177 0.8228 0.8816 0.8542 0.8155 1.0000 0.6107 0.9595 0.4891 0.7355 0.8512 0.8889 0.9224 0.8498 0.8824 0.8555 0.7903 1.0000 0.6088 0.9000 0.4746 0.6917 0.9888 0.9088 0.9447 0.9419 0.9882 0.8542 0.9305 0.9796 0.7201 0.9595 0.5502 0.7414 0.9599 0.8974 0.9242 0.8597 0.8831 0.8581 0.8263 1.0000 0.6482 0.9243 0.4870 0.7438 0.9895 0.9194 0.9069 0.9456 0.9882 0.8542 0.9101 0.9857 0.7119 0.9643 0.5700 0.7565 0.9927 0.8973 0.9260 0.8539 0.8831 0.8581 0.8232 1.0000 0.7533 rNMF 0.8357 0.4691 0.6873 0.8896 0.7988 0.8998 0.8032 0.8847 0.7361 0.7239 0.9796 0.7091 NMF 0.8364 0.4759 0.6875 0.8889 0.7370 0.8994 0.8178 0.8847 0.7311 0.7294 0.9796 0.7189 Figure 7.20 Average ACC of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF and NMF for the 12 datasets. Once again, we see that PL regularized methods have higher ACC than other NMF methods. RTNMF and TNMF improves rGNMF and GNMF by 0.05, and k-rTNMF and k-TNMF improves rGNMF and GNMF by 0.04. We see an improvement in ACC for both k-rTNMF and k-TNMF for GSE64016. All 4 PL regularized methods improve ACC of GSE82187 by 0.1. RTNMF and TNMF improve GSE84133 mouse 2 by at least 0.1 as well. 7.3.2.3 Overall performance Figure 7.21 shows the average ARI, NMI, purity and ACC of k-rTNMF, rTNMF, k-TNMF, TNMF, rGNMF, GNMF, rNMF, NMF across 10 datasets. All PL regularized NMF methods 220 Figure 7.21 Average ARI, NMI, purity and ACC of k-rTNMF, rTNMF, k-TNMF, TNMF, rGNMF, GNMF, rNMF, NMF across 12 datasets. outperform the traditional rGNMF, GNMF, rNMF and NMF. Both rTNMF and TNMF have higher average ARI and purity than the k-NN based PL counterparts. However, k-rTNMF and k-TNMF have higher average NMI than rTNMF and TNMF, respectively. k-rTNMF has a significantly higher purity than other methods. 7.3.3 Discussion 7.3.3.1 Visualization of meta-genes based UMAP and t-SNE Both UMAP and t-SNE are well-known for their effectiveness in visualization. However, these methods may not perform as competitively in clustering or classification tasks. Therefore, it is beneficial to employ NMF-based methods to enhance the visualization capabilities of UMAP and t-SNE. In this process, we generate meta-genes and subsequently utilize UMAP or t-SNE to further reduce the data to 2 dimensions for visualization. For a dataset with M cells, the number of meta-genes will be the integer value of √ M. To compare the standard UMAP and t-SNE plots with the top-NMF-assisted and top-rNMF-assisted UMAP and t-SNE visualizations, we used the default settings of the Python implementation of UMAP and the Scikit-learn implementation of t-SNE. For unassisted UMAP and t-SNE, we first removed low-abundance genes and performed log-transformation before applying UMAP and t-SNE. 221 Figure 7.22 shows the visualization of PL regularized NMF methods through UMAP. Each row corresponds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns from left to right are the k-rTNMF assisted UMAP, rTNMF assisted UMAP, k-TNMF assisted UMAP, TNMF assisted UMAP and UMAP visualization. Samples were colored according to their true cell types. Figure 7.22 Visualization of top-NMF and top-rNMF meta-genes through UMAP. Each row cor- responds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns from left to right are the k-rTNMF assisted UMAP, rTNMF assisted UMAP, k-TNMF assisted UMAP, TNMF assisted UMAP and UMAP visualization. Samples were colored according to their true cell types. Figure 7.23 shows the visualization of PL regularized NMF through t-SNE. Each row corre- sponds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns 222 from left to right are the k-rTNMF assisted t-SNE, rTNMF assisted t-SNE, k-TNMF assisted t-SNE, TNMF assisted t-SNE and t-SNE visualization. Samples were colored according to their true cell types. Figure 7.23 Visualization of top-NMF and top-rNMF meta-genes through t-SNE. Each row cor- responds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns from left to right are the k-rTNMF assisted t-SNE, rTNMF assisted t-SNE, k-TNMF asssited t-SNE, TNMF assisted t-SNE and t-SNE visualization. Samples were colored according to their true cell types. We see a considerable improvement in both top-NMF assisted and top-rNMF assisted UMAP and t-SNE visualization. 223 GSE67835 In the assisted UMAP and t-SNE visualizations of GSE67835, we observe a more distinct cluster, which includes a supercluster of fetal quiescent (Fetal-Q) and fetal replicating (Fetal-R) cells. Darmanis et al. [5] conducted a study that involved obtaining differential gene expression data for human adult brain cells and sequencing fetal brain cells for comparison. It is not surprising that the undeveloped Fetal-Q and Fetal-R cells do not exhibit significant differences and cluster together. GSE75748 time In GSE75748 time data, Chu et al. [6] sequenced human embryonic stem cells at times 0hr, 12hr, 24hr, 36hr, 72hr, and 96hr under hypoxic conditions to observe differentiation. In unassisted UMAP and t-SNE, although some clustering is visible, there is no clear separation between the clusters. Additionally, two subclusters of 12hr cells are observed. Notably, in the PL-regularized assisted UMAP and t-SNE visualizations, there is a distinct supercluster comprising the 72hr and 96hr cells, while cells from different time points form their own separate clusters. This finding aligns with Chu’s observation that there was no significant difference between the 72hr and 96hr cells, suggesting that differentiation may have already occurred by the 72hr mark. GSE94820 Notice that in both t-SNE and UMAP, although there is a boundary, the cells do not form distinct clusters. This lack of distinct clustering can pose challenges in many clustering and classification methods. On the other hand, all PL-regularized NMF methods result in distinct clusters. Among the PL-regularized NMF approaches, cutoff-based PL, rTNMF, and TNMF form a single CD1C+ (CD1C1) cluster, whereas the k-NN induced PL, k-rTNMF, and k-TNMF exhibit two subclusters. Villani et al. [10] previously noted the similarity in the expression profile of CD1C1–CD141– (DoubleNeg) cells and monocytes. PL-regularized NMF successfully differenti- ates between these two types. 224 GSE84133 mouse 2 PL-regularized NMF yields significantly more distinct clusters compared to unassisted UMAP and t-SNE. Notably, the beta and gamma cells form distinct clusters in PL- regularized NMF. Additionally, when PL-regularized NMF is applied to assist UMAP, potential outliers within the beta cell population become visible. Baron et al. [8] previously highlighted heterogeneity within the beta cell population, and we observe potential outliers in all visualizations. 7.3.3.2 RS analysis Although UMAP and t-SNE are excellent tools for visualizing clusters, they may struggle to capture heterogeneity within clusters. Moreover, these methods can be less effective when dealing with a large number of classes. Therefore, it is essential to explore alternative visualization techniques. 225 Figure 7.24 RS plots of GSE67835 data. The columns from left to right correspond to k-rTNMF, rTNMF, k-TNMF, and TNMF. Each row corresponds to a cell type. For each section, the x-axis and y-axis correspond to the S-score and R-score, respectively. K-means was used to obtain a cluster label, and the Hungarian algorithm was used to map the cluster labels to the true labels. Each sample was colored according to their true labels. In our approach, we visualize each cluster using RS plots as described in subsection 4.1.1. RS plots depict the relationship between the residue score (R score) and similarity score (S score) and have proven useful in various applications for visualizing data with multiple class types 226 [23, 24, 25, 26, 27]. Figure 7.24 shows the RS plots of PL-regularized NMF methods for GSE67835 data. The columns from left to right correspond to k-rTNMF, rTNMF, k-TNMF, and TNMF, while the rows correspond to the cell types. The x-axis and y-axis represent the S-score and R-score for each sample, respectively. The samples are colored according to their predicted cell types. Predictions were obtained using k-means clustering, and the Hungarian algorithm was employed to find the optimal mapping from the cluster labels to the true cell types. We can see that TNMF fails to identify OP cells, whereas k-rTNMF, rTNMF, and k-TNMF are able to identify OPC cells. Notably, the S-score is quite low, indicating that the OPC did not form a cluster for TNMF. For fetal quiescent and replicating cells, k-rTNMF correctly identifies these two types, and the few misclassified samples are located on the boundaries. RTNMF is able to correctly identify fetal replicating cells but could not distinguish fetal quiescent cells from fetal replicating cells. The S-score is low for neurons in both rTNMF and TNMF, which shows a direct correlation with the number of misclassified cells. 7.3.4 Conclusion Persistent Laplacian-regularized NMF is a dimensionality reduction technique that incorporates multiscale topological interactions between the cells. Traditional graph Laplacian-based regular- ization only represents a single scale and cannot capture the multiscale features of the data. We have also shown that the k-NN induced persistent Laplacian outperforms other NMF methods and is comparable to the cutoff-based persistent Laplacian-regularized NMF methods. However, PL methods do come with their downside. In particular, the weights for each filtration must be deter- mined prior to the reduction. If there are T filtrations, then the hyperparameter space is (T + 1)T . However, k-NN induced PL reduces the number of parameters to 2T . In addition, we have shown that we can achieve a significant improvement even if we limit the hyperparameter space to 2T . We would like to further explore possible parameter-free versions of topological NMF. Additionally, NMF methods are not globally convex, but we have shown that with NNDSVDA initialization, our methods perform the best. One possible extension to the proposed methods is to incorporate 227 higher-order persistent Laplacians in the regularization framework, which will reveal higher-order interactions. In addition, we would like to expand the ideas to tensor decomposition, such as Canonical Polyadic Decomposition (CPD) and Tucker decomposition, multimodal omics data, and spatial transcriptomics data. 7.3.5 Data availability and code The data and model used to produce these results can be obtained at https://github.com/hozumiyu/TopologicalNMF-scRNAseq. 228 BIBLIOGRAPHY [1] Yuta Hozumi, Rui Wang, and Guo-Wei Wei. Ccp: Correlated clustering and projection for dimensionality reduction. arXiv preprint arXiv:2206.04189, 2022. [2] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multidomain models—flexibility and rigidity. The Journal of chemical physics, 139(19):11B614_1, 2013. [3] Qiaolin Deng, Daniel Ramsköld, Björn Reinius, and Rickard Sandberg. Single-cell rna- seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science, 343(6167):193–196, 2014. [4] Monika S Kowalczyk, Itay Tirosh, Dirk Heckl, Tata Nageswara Rao, Atray Dixit, Brian J Haas, Rebekka K Schneider, Amy J Wagers, Benjamin L Ebert, and Aviv Regev. Single-cell rna-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome research, 25(12):1860–1872, 2015. [5] Spyros Darmanis, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. A survey of human brain transcriptome diversity at the single cell level. Proceedings of the National Academy of Sciences, 112(23):7285–7290, 2015. [6] Li-Fang Chu, Ning Leng, Jue Zhang, Zhonggang Hou, Daniel Mamott, David T Vereide, Jeea Choi, Christina Kendziorski, Ron Stewart, and James A Thomson. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome biology, 17:1–20, 2016. [7] Ozgun Gokce, Geoffrey M Stanley, Barbara Treutlein, Norma F Neff, J Gray Camp, Robert C Malenka, Patrick E Rothwell, Marc V Fuccillo, Thomas C Südhof, and Stephen R Quake. Cellular taxonomy of the mouse striatum as revealed by single-cell rna-seq. Cell reports, 16(4):1126–1137, 2016. [8] Maayan Baron, Adrian Veres, Samuel L Wolock, Aubrey L Faust, Renaud Gaujoux, Amedeo Vetere, Jennifer Hyoje Ryu, Bridget K Wagner, Shai S Shen-Orr, Allon M Klein, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems, 3(4):346–360, 2016. [9] Gaëlle Breton, Shiwei Zheng, Renan Valieris, Israel Tojal da Silva, Rahul Satija, and Michel C Nussenzweig. Human dendritic cells (dcs) are derived from distinct circulating precursors that are precommitted to become cd1c+ or cd141+ dcs. Journal of Experimental Medicine, 213(13):2861–2870, 2016. [10] Alexandra-Chloé Villani, Rahul Satija, Gary Reynolds, Siranush Sarkizova, Karthik Shekhar, James Fletcher, Morgane Griesbeck, Andrew Butler, Shiwei Zheng, Suzan Lazo, et al. Single- cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. 229 Science, 356(6335):eaah4573, 2017. [11] Pengzhi Yu, Guangjin Pan, Junying Yu, and James A Thomson. Fgf2 sustains nanog and switches the outcome of bmp4-induced human embryonic stem cell differentiation. Cell stem cell, 8(3):326–334, 2011. [12] Adam Rodaway, Hiroyuki Takeda, Sumito Koshida, Joanne Broadbent, Brenda Price, James C Induction of the mesendoderm in the zebrafish Smith, Roger Patient, and Nigel Holder. germ ring by yolk cell-derived tgf-beta family signals and discrimination of mesoderm and endoderm by fgf. Development, 126(14):3067–3078, 1999. [13] Shinsuke Tada, Takumi Era, Chikara Furusawa, Hidetoshi Sakurai, Satomi Nishikawa, Masaki Kinoshita, Kazuki Nakao, Tsutomu Chiba, and Shin-Ichi Nishikawa. Characterization of mesendoderm: a diverging point of the definitive endoderm and mesoderm in embryonic stem cell differentiation culture. Development, 2005. [14] Ron Edgar, Michael Domrachev, and Alex E Lash. Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic acids research, 30(1):207–210, 2002. [15] Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Michelle Holko, et al. Ncbi geo: archive for functional genomics data sets—update. Nucleic acids research, 41(D1):D991–D995, 2012. [16] Liang Chen, Weinan Wang, Yuyao Zhai, and Minghua Deng. Deep soft k-means cluster- ing with self-training for single-cell rna sequence data. NAR genomics and bioinformatics, 2(2):lqaa039, 2020. [17] Nicholas Schaum, Jim Karkanias, Norma F Neff, Andrew P May, Stephen R Quake, Tony Wyss-Coray, Spyros Darmanis, Joshua Batson, Olga Botvinnik, Michelle B Chen, et al. Single-cell transcriptomics of 20 mouse organs creates a tabula muris: The tabula muris consortium. Nature, 562(7727):367, 2018. [18] Mauro J Muraro, Gitanjali Dharmadhikari, Dominic Grün, Nathalie Groen, Tim Dielen, Erik Jansen, Leon Van Gurp, Marten A Engelse, Francoise Carlotti, Eelco Jp De Koning, et al. A single-cell transcriptome atlas of the human pancreas. Cell systems, 3(4):385–394, 2016. [19] Roman A Romanov, Amit Zeisel, Joanne Bakker, Fatima Girach, Arash Hellysaz, Raju Tomer, Alan Alpar, Jan Mulder, Frederic Clotman, Erik Keimpema, et al. Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes. Nature neuroscience, 20(2):176–188, 2017. [20] Alexander J Gates, Ian B Wood, William P Hetrick, and Yong-Yeol Ahn. Element-centric clustering comparison unifies overlaps and hierarchy. Scientific reports, 9(1):8574, 2019. 230 [21] Fernando H Biase, Xiaoyi Cao, and Sheng Zhong. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell rna sequencing. Genome research, 24(11):1787–1796, 2014. [22] Ning Leng, Li-Fang Chu, Chris Barry, Yuan Li, Jeea Choi, Xiaomao Li, Peng Jiang, Ron M Stewart, James A Thomson, and Christina Kendziorski. Oscope identifies oscillatory genes in unsynchronized single-cell rna-seq experiments. Nature methods, 12(10):947–950, 2015. [23] Yuta Hozumi, Kiyoto A Tanemura, and Guo Wei Wei. Preprocessing of single cell rna sequencing data using correlated clustering and projection. Journal of chemical Information and Modeling, accepted, 2023. [24] Hongsong Feng and Guo-Wei Wei. Virtual screening of drugbank database for herg block- ers using topological laplacian-assisted ai models. Computers in biology and medicine, 153:106491, 2023. [25] Zailiang Zhu, Bozheng Dou, Yukang Cao, Jian Jiang, Yueying Zhu, Dong Chen, Hongsong Feng, Jie Liu, Bengong Zhang, Tianshou Zhou, et al. Tidal: Topology-inferred drug addiction learning. Journal of Chemical Information and Modeling, 63(5):1472–1489, 2023. [26] Li Shen, Hongsong Feng, Yuchi Qiu, and Guo-Wei Wei. Svsbi: sequence-based virtual screening of biomolecular interactions. Communications Biology, 6(1):536, 2023. [27] Sean Cottrell, Rui Wang, and Guowei Wei. PCA for microarray data analysis. doi.org/10.1021/acs.jcim.3c01023, 2023. PLPCA: Persistent Laplacian enhanced- Journal of Chemical Information and Modeling, 231 APPENDIX 7A ADDITIONAL RESULTS FOR PREPROCESSING SINGLE CELL RNA SEQUENCING DATA USING CORRELATED CLUSTERING AND PROJECTION 7A.1 Methods 7A.1.1 Program and Packages Python v3.8.5 was utilized for the benchmark, and the following Python packages were uti- lized: NumPy v1.19.2, Scikit-learn v0.23.2. For visualization, Plotly v5.9.0 and Matplotlib v2.5.2 were used. For preprocessing of the scRNA-seq data, instructions can be found on GitHub at https://github.com/hozumiyu/SingleCellDataProcess. 7A.1.2 Benchmark Protocol Before applying dimensionality reduction, a log-transform was performed on the data. PCA and CCP were then applied to the data to generate 50, 100, 150, 200, 250 and 300 components. For CCP, τ = 6.0 and κ = 2.0 was used for the exponential kernel to generate super-genes. For each reduction, 20 random initializations were used to ensure robustness and stability. For clustering, k-means was used, where k was chosen as the number of cell or gene types, provided by the data, indicated in Table 7.1. A total of 30 random initializations was used to compute the clusters, and for each clustering result, ARI, NMI, Silhouette score, and RSI were computed. For classification, the support vector machine (SVM) with default parameters from Scikit- learn’s package was used. We performed 5-fold cross validation, where 4 parts were used for training the SVM and 1 part was used for testing. Ten random seeds were used for each reduction. Since the number of cells within each class can differ greatly in scRNA-seq data, we modified the training and testing sets. We removed any cell type with less than 5 cells and for the training set, we randomly sampled up to five times the average number of samples per cell type to ensure a balanced training set. We computed the balanced accuracy and RSI for each of the classification results. 7A.2 Results 7A.2.1 Additional benchmark In this section, we compare the performance of CCP and PCA on datasets that were not intro- duced in the main text. More details on the benchmark can be found in Section subsubsection 7.1.2.1 of the main text. Figure 7A.1 shows the performance of CCP and PCA on three datasets: GSE45719, GSE75748 cell, and GSE82187 data. The solid and dashed lines correspond to CCP and PCA, respectively, and the red and blue lines correspond to the ARI and NMI, respectively. For CCP, τ = 6.0 and κ = 2.0 were used for the exponential kernel. For all three datasets, as the number of components increases, ARI and NMI either increase or remain stable for CCP. On the other hand, PCA shows instability in terms of ARI and NMI as the number of components increases. PCA has higher ARI 232 and NMI values when the number of components is 50, but in higher numbers of dimensions, PCA performs worse. Figure 7A.1 ARI and NMI of the clustering results of CCP and PCA on GSE45719, GSE75748 cell. The red and blue lines correspond to CCP and PCA, respectively. For CCP, all the tests utilize τ = 6 and κ = 2 for the exponential kernels. The x-axis shows the number of components in the reduction. Figure 7A.2 shows the performance of CCP and PCA on the 6 datasets, GSE84133h1, GSE84133h2, GSE84133h3 GSE84133h4, GSE84133m1, and GSE84133m2. The solid and dashed lines correspond to CCP and PCA, respectively, and the red and blue lines correspond to the ARI and NMI, respectively. For GSE84133h1, GSE84133h2, and GSE84133h4 data, CCP outperforms PCA when the number of components is larger than 150 for both ARI and NMI. CCP is able to maintain its performance throughout higher dimensions, whereas PCA has a notable drop in its ARI and NMI values when the number of components is larger than 150. For GSE84133h3, PCA outperforms CCP until 300 components. We found that GSE84133h3 data is the only data, where PCA’s ARI increased at higher dimensions. For the mouse datasets, GSE84133m1 and GSE84133m2, CCP outperforms PCA when the number of components is higher than 50. There is a slight decrease in ARI and NMI beyond 100 components for CCP, but it is much more stable than PCA. 233 Figure 7A.2 ARI and NMI of the clustering results of CCP and PCA on GSE84133h1, GSE84133h2, GSE84133h3, GSE84133h4, GSE84133m1, and GSE84133m2. The red and blue lines correspond to CCP and PCA, respectively. For CCP, all the test utilize τ = 6 and κ = 2 for the exponential kernels. The x-axis shows the number of components in the reduction. Figure 7A.3 shows the performance of CCP and PCA on the 2 datasets, GSE89232 and GSE94820. The solid and dashed lines correspond to CCP and PCA, respectively, and the red and blue lines correspond to the ARI and NMI, respectively. For GSE94820, CCP outperforms PCA when the number of components is greater than 50. Both ARI and NMI improve until about 200 components. On the other hand, PCA has a notable drop in its performance from 100 components to 150 components, where the ARI and NMI drop to below 0.1, indicating very low clustering accuracy. CCP has poor performance for GSE89232 at all components tested. Both ARI and NMI are less than 0.2, indicating poor clustering accuracy. Similar to other data, PCA has stability concerns at higher dimensions, where both ARI and NMI become less than 0.1. Figure 7A.3 ARI and NMI of the clustering results of CCP and PCA on GSE89232 and GSE94820. The red and blue lines correspond to CCP and PCA, respectively. For CCP, all the test utilize τ = 6 and κ = 2 for the exponential kernels. The x-axis shows the number of components in the reduction. 234 In order to understand the poor performance of CCP on GSE89232, the RSI was computed on the gene clustering result. Figure 7A.4 shows the RSI plot and t-SNE visualization of GSE89232 and GSE94820 genes. RSI was computed by varying the number of clusters, where the x-axis shows the number of clusters, and the green, red, and blue lines correspond to the RSI, RI, and SI of the gene clustering results. The RSI stabilizes starting at 16 clusters and gradually decreases as the number of clusters increases. Notice that even though the RSI peaks at 16 clusters, the RI and SI continue to increase, indicating that the clustering may be improving. In addition, t-SNE was used to obtain a 2-dimensional (2D) visualization of the gene types. For GSE94820, k = 64 was used for the k-means clustering. Seven gene clusters with the largest number of genes were colored, and the rest were colored in green. Notice that purple genes are spread out, indicating that the purple genes do not cluster well. For GSE89232, the RSI is high at a low number of clusters. Compared to the RI and SI of GSE94820, the RI and SI are relatively high at a low number of gene clusters. This may indicate that the clustering is better at a lower number of clusters. In addition, t-SNE was used to visualize the genes with k = 8. The blue and orange genes are mixed, but they are located at the same general location. The other genes are clustered nicely. This indicates that the optimal number of clusters may be smaller than 8, which suggests that the intrinsic dimensionality of GSE89232 is low. Such a low intrinsic dimensionality is unfavorable for CCP, which may explain its poor performance for this data. 235 (a) GSE94820 (b) GSE89232 Figure 7A.4 RI, SI, and RSI of the gene clustering of GSE94820 and GSE89232. k-means clustering was performed with 2, 4, 8, 16, 32, 64, and 128 gene clusters. For each number of clusters, 10 random initializations were utilized, and the average of RI, SI, and RSI was obtained. The red, blue, and green lines correspond to RI, SI, and RSI, respectively. t-SNE was used to visualize the genes in 2D. For GSE94820, k = 64 was used for k-means clustering, and 7 gene clusters were colored according to their cluster label, while the rest of the genes were colored in green. For GSE89232, k = 8 was used for k-means clustering, and the genes were colored according to their gene cluster labels. 7A.2.2 Additional Residue-Similarity Index comparison Figure 7A.5 shows performance comparison between CCP and PCA in classification and cluster- ing problems using GSE45719, GSE59114, and GSE89232 data. Top row shows the classification results, where 5-fold cross validation and SVM were used. The bottom row shows the clustering results, where k-means was used to obtain the labels. For RSI, cluster labels obtained from k-means and from the true cell types were used. 236 Figure 7A.5 Comparison of RSI on classification and clustering results of GSE45719, GSE59114, and GSE89232 data. The top row shows the classification results, where the solid and dashed lines correspond to the CCP and PCA results, respectively. The x-axis represents the number of components in the dimensionality reduction. The 5-fold cross-validation was used to evaluate the classification performance using SVM. The bottom row shows the clustering results of the GSE45719, GSE59114, and GSE89232 data. For RSI, the cluster labels obtained from k-means and the true cell labels provided by the authors were utilized. The solid and dashed lines correspond to the CCP and PCA results, respectively. There is a noticeable correlation between RSI and BA in the classification results of GSE45719 and GSE89232. PCA’s BA decreases in all three datasets. For GSE59114, CCP’s result improves as the number of components increases, but is not as good as PCA’s result. We can see from the ARI and NMI of GSE59114 that these two values are also low. In addition, from Figure 7A.4, we can see that the gene clustering is poor for GSE59114, which may have caused the poor performance. As for the clustering results, it is evident that the Silhouette score does not correlate with the other metrics used. Interestingly, the RSI using the true cell types and the cluster labels are similar for all the results, which means RSI is a good index for clustering when there are no true labels. Figure 7A.6 shows performance comparison between CCP and PCA in classification and clustering problems using GSE84133h1, GSE84133h2, and GSE84133h3 data. Top row shows the classification results, where 5-fold cross validation and SVM were used. The bottom row shows the clustering results, where k-means was used to obtain the labels. For RSI, cluster labels obtained from k-means and from the true cell types were used. 237 Figure 7A.6 Comparison of RSI on classification and clustering results of GSE84133h1, GSE84133h2, and GSE84133h3 data. The top row shows the classification results, where the solid and dashed lines correspond to CCP and PCA results, respectively. The x-axis represents the number of components in the dimensionality reduction. The classification results were obtained using SVM and verified using 5-fold cross validation. The bottom row shows the clustering results of the GSE45719, GSE59114, and GSE89232 datasets. For RSI, cluster labels obtained from k-means and the true cell labels provided by the data were utilized. The solid and dashed lines correspond to CCP and PCA results, respectively. For classification result of GSE84133h1 and GSE84133h2, CCP outperforms PCA after 100 components, whereas in GSE84133h3, PCA outperforms PCA. However, for all 3 datasets, PCA’s BA decreases as the number of components increases. In addition, RSI correlates with BA in all 3 datasets for CCP. For clustering results, Silhouette score shows no relation with ARI, NMI, or RSI. RSI using the true cell types and cluster labels correlates for all 3 datasets. Similar to the classification results, CCP’s ARI and NMI of GSE84133h3 are lower than those of PCA’s for most of the dimensions. Figure 7A.7 shows performance comparison between CCP and PCA in classification and clustering problems using GSE84133h1, GSE84133h2, and GSE84133h3 data. Top row shows the classification results, where 5-fold cross validation and SVM were used. The bottom row shows the clustering results, where k-means was used to obtain the labels. For RSI, cluster labels obtained from k-means and from the true cell types were used. 238 Figure 7A.7 Comparison of RSI on classification and clustering results of GSE84133h4, GSE84133m1, and GSE84133m2 data. Top row shows the classification results, where the solid and dashed lines correspond to CCP and PCA results, respectively. The x-axis is the number of components in the dimensionality reduction. The 5-fold cross validation was used to verify the classification result obtained from using SVM. The bottom row shows the clustering results of the GSE45719, GSE59114, and GSE89232 data. For RSI, the cluster labels obtained from k-means and from the true cell labels provided by the data were utilized. The solid and dashed lines correspond to CCP and PCA results, respectively. For all three datasets, CCP outperforms PCA in classification accuracy when the number of components is larger than 150. CCP’s RSI of classification results correlates with BA, whereas PCA’s RSI shows less correlation with BA. In addition, for all three datasets, PCA’s accuracy decreases as the number of components increases, showing instability in the reduction algorithm. The Silhouette score does not show any correlation with the other metrics. RSI using the true cell types and cluster labels correlates for all three datasets. Lastly, ARI and NMI follow a similar trend as BA, where CCP outperforms PCA at higher numbers of components. Figure 7A.8 shows performance comparison between CCP and PCA in classification and clustering problems using GSE75748 cell and GSE94820 data. Top row shows classification results, where 5-fold cross validation and SVM were used. The bottom row shows the clustering results, where k-means was used to obtain the labels. For RSI, cluster labels obtained from k-means and from the true cell types were used. 239 Figure 7A.8 Comparison of RSI on classification and clustering results of GSE75748 cell and GSE94820 data. Top row shows the classification results, where the solid and dashed lines correspond to CCP and PCA results, respectively. The x-axis is the number of components in the dimensionality reduction. The 5-fold cross validation was used to verify the classification results obtained from using SVM. The bottom row shows the clustering results of the GSE45719, GSE59114, and GSE89232 data. For RSI, the cluster labels obtained from k-means and from the true cell labels provided by the data were utilized. The solid and dashed lines correspond to CCP and PCA results, respectively. For all three datasets, CCP outperforms PCA in classification accuracy when the number of components is larger than 150. RSI of CCP’s classification results correlates with BA, whereas PCA’s RSI does not show any correlation. In addition, for all three datasets, PCA’s accuracy decreases as the number of components increases, showing instability in the reduction algorithm. The Silhouette score does not show any correlation with the other metrics. RSI using the true cell types and cluster labels correlates for all three datasets. Lastly, ARI and NMI follow a similar trend as BA, where CCP outperforms PCA at higher numbers of components. 240 APPENDIX 7B ADDITIONAL MATERIALS FOR ANALYZING SCRNA-SEQ DATA BY CCP-ASSISTED UMAP AND T-SNE 7B.1 Results 7B.1.1 Data statistics Table 7B.1 show basic statistical analysis of the dataset, namely the sparsity (number of zero expression), max, mean and median expression, and the median sum of the cell expression. For the mean and the median expression, we considered all nonzero expression values. Table 7B.1 Accession ID, source organism, and the counts for samples, genes, and cell types for fourteen individual datasets. Dataset GSE75748cell [6] GSE75748time [6] GSE82187 [7] GSE67835 [5] GSE84133 H1 [8] GSE84133 H2 [8] GSE84133 H3 [8] GSE84133 H4 [8] GSE84133 M1 [8] GSE84133 M2 [8] GSE84133 human [8] Muraro [18] Romanov [19] Qx Bladder [17] Qx Limb Muscle [17] Qx Spleen [17] Qs Diaphragm [17] Qs Limb Muscle [17] Qs Lung [17] Qs Trachea [17] Size (cells x genes) 1018 x 19097 758 x 19189 705 x 18840 420 x 22084 1937 x 20125 1724 x 20125 3605 x 20125 1303 x 20125 822 x 14878 1064 x 14878 8569 x 20125 2122 x 19046 2881 x 21143 2500 x 23341 3909 x 23341 9552 x 23341 870 x 23341 1090 x 23341 1676 x 23341 1350 x 23341 Sparsity 49.64 54.69 77.72 81.40 90.44 90.59 91.30 89.05 90.48 87.81 90.62 73.02 85.92 86.94 93.57 94.34 91.35 89.47 89.08 85.48 Max 605598.59 165308.35 5.60 58272.00 4318.00 3476.00 3071.00 4234.00 3477.00 3656.00 4318.00 4501.60 2571.46 4640.55 1124.88 1665.24 481734.98 682914.40 300415.15 730410.52 Mean Median Median Row Sum 470.98 153.61 1.70 136.71 3.02 2.70 3.35 3.05 2.64 3.28 3.09 3.52 2.48 3.64 2.74 2.46 247.92 294.23 245.66 220.08 4404343.22 1306562.14 7086.69 505256.00 5346.00 4889.50 4742.00 6017.00 2936.50 5631.00 5075.00 18076.53 7373.00 11102.50 4110.00 3244.50 500405.50 723268.50 626066.50 745632.50 75.00 27.00 1.73 27.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.60 1.26 1.27 1.17 1.07 70.45 78.39 67.56 61.81 7B.1.2 Comparison of CCP, PCA and NMF assisted visualization In this section, we show the effectiveness of CCP as an initialization of UMAP and tSNE by comparing it to PCA and NMF, 2 of the most utilized algorithm for dimensionality reduction for scRNA-seq data. We perform the same preprocessing procedure as described in Section 7.2.3.1 of the main text. Figure 7B.1 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visualiza- tion on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 data. The columns from left to right correspond to CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. 241 NMF-assisted visualization do not provide a meaningful clustering result for all data. We can observe the following from the visualization: • In GSE75748 cell, CCP-assisted and PCA-assisted visualization show similar result. • IN GSE75748 time data, CCP-assisted and PCA-assisted visualization show similar result. Most notably, both technique show a supercluster of 72hr and 96hr, which is consistent with Chu’s findings [6]. There is a cleared distinction between the 72hr-96hr supercluster with 36hr cluster for PCA-assisted visualization. However, CCP-assisted visualization show a clear distinction of 00hr cells from other cells, indicating the clear distinction of undifferentiated cells from cells that have begun or completed differentiating. • In GSE67835 data, CCP-assisted UMAP and tSNE have the best visualization. PCA-assisted UMAP do no show clear clustering, and PCA-assisted tSNE visualization is dominated by an outlier astrocytes. • Both CCP-assisted UMAP and tSNE show the best visualization for GSE82187 data. All the cell types show a distinct cluster, whereas PCA-assisted visualization do not. 242 Figure 7B.1 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted visual- ization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 data. The columns from left to right show CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. The rows from top to bottom show GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 data. CCP, PCA and NMF were utilized to reduce the dimension to 300, and UMAP and tSNE were used to further reduce the dimension to 2. Sample were colored according to the cell types provided by the original authors. Figure 7B.2 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visualiza- tion on Quake data. The columns from left to right correspond to CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. Qx indicates scRNA-seq obtained used 10x genomic platform, and Qs indicate data obtained from SmartSeq2 platform. NMF-assisted visualization do not provide a meaningful clustering result for all data. We can observe the following from the visualization: • CCP-assisted and PCA-assisted show similar clustering result for both Qx Bladder and Qx Limb data. The most notable difference is that CCP-assisted UMAP show 3 subclusters of bladder urothelial cells. • CCP-assisted tSNE show a considerable improvment to PCA-assisted tSNE. Macrophages do not form a clustering PCA-assisted tSNE, whereas in CCP-assisted tSNE, it forms a cluster. 243 Additionally Satellite cells are more dispersed in PCA-assisted tSNE than in CCP-assisted tSNE • In Qs Limb Muscle, both CCP-assisted and PCA-assisted show a similar result. • In Qs Trachea data, CCP-assisted show better clustering result from PCA-assisted counter- parts. Noticeably, epithelial cells show a distinct cluster in CCP-assisted visualization, where as in PCA-assisted visualization, epithelial cells form 2 clusters, and have many cells mixed with mesenchymal cells. Figure 7B.2 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted visu- alization on Quake data. The columns from left to right show CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. The rows correspond to 1 of the 5 Quake data. Qx indicates scRNA-seq obtained used 10x genomic platform, and Qs indicate data obtained from SmartSeq2 platform. CCP, PCA and NMF were utilized to reduce the dimension to 300, and UMAP and tSNE were used to further reduce the dimension to 2. Sample were colored according to the cell types provided by the original authors. 244 Figure 7B.3 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visual- ization on GSE84133 human data. The columns from left to right correspond to CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. NMF-assisted visualization do not provide a meaningful clustering result for all data. In general, CCP-assisted visualization show the clearest distinction between the cell types. PCA-assisted UMAP and tSNE do not show the distinction between all the cell types. Figure 7B.3 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted vi- sualization onGSE84133 human data. The columns from left to right show CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF- assisted tSNE. The rows correspond to one of the four patients. CCP, PCA and NMF were utilized to reduce the dimension to 300, and UMAP and tSNE were used to further reduce the dimension to 2. Sample were colored according to the cell types provided by the original authors. Figure 7B.4 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visual- ization on GSE84133 mouse data. The columns from left to right correspond to CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. NMF-assisted visualization do not provide a meaningful clustering result for all data. In general, CCP-assisted visualization show the clearest distinction between the cell types. PCA- assisted UMAP and tSNE do not show the distinction between all the cell types. 245 Figure 7B.4 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted vi- sualization onGSE84133 human data. The columns from left to right show CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF- assisted tSNE. The rows correspond to mouse 1 and 2. CCP, PCA and NMF were utilized to reduce the dimension to 300, and UMAP and tSNE were used to further reduce the dimension to 2. Sample were colored according to the cell types provided by the original authors. Figure 7B.5 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visu- alization on Muraro, Romanov and Qs Lung data. The columns from left to right correspond to CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. NMF-assisted visualization do not provide a meaningful clustering result for all data. CCP- assisted and PCA-assisted visualization show comparable in all data. In PCA-assisted visualization of Qs Lung, it shows a clear distinction between monocytes and cillated columnar cells, whereas CCP-assisted visualization show a supercluster of these 2 cells. However, the stromal cells in CCP-assisted visualization show a distinct cluster, whereas PCA-assisted show a subcluster within the endothelial cell cluster. 246 Figure 7B.5 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted visu- alization on Muraro, Romanov and Qs Lung data. The columns from left to right show CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE. The rows from top to bottom show Muraro, Romanov and Qs Lung data. CCP, PCA and NMF were utilized to reduce the dimension to 300, and UMAP and tSNE were used to further reduce the dimension to 2. Sample were colored according to the cell types provided by the original authors. 7B.1.3 Accuracy In order to validate CCP’s performance, we computed the ARI, NMI and ECM for the 18 dataset. For each data, 10 random seed were utilized to generate CCP, PCA and NMF features, and the reduced features were further reduced to 2D using UMAP and tSNE. Leiden clustering was performed to obtain the clustering results, and ARI, NMI and ECM were computed by comparing the clustering results with the cell types provided by the original authors. Figure 7B.6 show the average ARI, NMI and ECM of CCP-assisted, PCA-assisted and NMF- assisted UMAP and tSNE over 18 dataset. Notice that CCP outperforms both PCA and NMF. CCP-assisted UMAP improves PCA-assisted UMAP on average by 43% in ARI 22.5$ in NMI and 19% in ECM. 247 (a) Average ARI over 18 dataset (b) Average NMI over 18 dataset (c) Average ECM over 18 dataset Figure 7B.6 The average ARI, NMI, ECM of 18 datasets. 10 random initialization was used to compute the reduction for each data. Leiden clustering was used to obtain the clustering results. Additionally, we validated the CCP super-genes with PCA-gene and NMF-gene. Figure 7B.7 show the average ARI, NMI and ECM of CCP, PCA and NMI over 18 dataset. Notice that CCP outperforms both PCA and NMF. Most notably, CCP improves PCA by 29% in ARI, 20% in NMI and 15% in ECM. (a) Average ARI over 18 dataset (b) Average NMI over 18 dataset (c) Average ECM over 18 dataset Figure 7B.7 The average ARI, NMI, ECM of 18 datasets. 10 random initialization was used to compute the reduction for each data. Leiden clustering was used to obtain the clustering results. Figure 7B.7 show the average ARI, NMI and ECM of CCP, PCA and NMI over 18 dataset. Notice that CCP outperforms both PCA and NMF. Most notably, CCP improves PCA by 29% in ARI, 20% in NMI and 15% in ECM. 7B.2 Discussion 7B.2.1 Low variance gene Figure 7B.8 show the effect of varying the number of super-genes and the cutoff ratio on the predictive power and visualization of GSE75748cell data. We utilized 10 random seeds to generate CCP super-genes using different number of super-genes and cutoff ratio. Then, Leiden clustering 248 was used to obtain the cluster labels, and the ARI was computed by using the cell types provided by the original authors. Notice that at all cutoff ratio, the ARI increases as the number of super-genes increases, and the ARI is comparable at 300 super-genes. This suggest the robustness of LV-gene. Additionally, we computed the number of genes in the LV-gene cluster at varying cutoff value, and plotted with the variance of the genes in descending order. Notice that at νc = 0.9 the variance of the genes are relatively small, indicating that the predictive power of these genes may not be high. Figure 7B.8(c) show the visualization of CCP-assisted UMAP and tSNE at various cutoff ratio. For the visualization, 300 super-genes were utilized, and UMAP and tSNE was applied to the super-genes to reduce the dimension to 2. Samples were then colored according to the cell types provided by the original authors. Note that all the visualization are comparable, indicating the robustness of LV-gene under different cutoff ratio. 249 (a) ARI of number of super-genes and νc (b) Number of genes in LV genes (c) CCP-assisted UMAP and tSNE visualization Figure 7B.8 Analysis of varying the cutoff ratio νc on clustering and visualization of GSE75748 cell data. (a) ARI of leiden clustering when the number of super-genes and cutoff ratio is changed. (b) The number of genes in the LV-gene when νc is changed. (c) Top and bottom row shows the CCP- assisted UMAP and t-SNE visualization, and the columns corresponds to νc = 0.6, 0.7, 0.8, 0.9. 300 super-genes were used to initialize UMAP and tSNE, and the samples were colored according to the true cell type. 250 APPENDIX 7C ALGEBRAIC CONNECTIVITY OF FRI IN CCP 7C.1 Motivation In the original formulation of CCP, for components n, we have Φ(|xSn i – xSn j |; rSn c , ηn, τ, κ) = (cid:35) k (cid:34) i –xSn |xSn ηnτ j | (cid:170) (cid:174) (cid:172)    – exp (cid:169) (cid:173) (cid:171) 0 if |xSn i – xSn j | < rSn c (7C.1) otherwise where ηSn is the algebraic connectivity term defined as ηSn = 1 M M ∑︁ m=1 |xSn m – xSn j |. min j≠m (7C.2) In other words, algebraic connectivity term is defined as the average minimal distannce. However, there are alternative values that ηSk can take, both supervised and unsupervised In this section, I will discuss alternative values, namely the average mean distance, approach. cluster-wide minimal distance and cluster-wide average distance. 7C.2 Formulation of algebraic connectivity 7C.2.1 Average mean distance The average mean distance can be defined as ηn = 1 M M ∑︁ m=1 (cid:205) m – xSn j≠m |xSn j M | (7C.3) 7C.2.2 cluster-wide minimal distance Let the data be represented as {(xm, ym)}M m=1, where ym ∈ ZL is the class or cluster label for sample m and L is the number of class or cluster. Let the data be partitioned in to L class or cluster Cl = {xm|ym = l}. For the cluster-wide minimal distance, samples in each Cl obtain its own algebraic connectivity term ηSk l . For class l, we have ηn l = 1 |Cl| ∑︁ xm∈Cl min xj∈Cl m – xSn |xSn j |. (7C.4) 251 7C.2.3 cluster-wide average distance Define Cl as in the cluster-wide minimal distance. For the cluster-wide average distance, samples in each Cl obtain its own algebraic connectivity term ηSk l m – xSn j (cid:205) ∑︁ | ηn l = 1 |Cl| xm∈Cl xj∈Cl |xSn |Cl| . For class l, we have . (7C.5) 7C.3 Results We utilized the same data as Table 7.1, and performed the same classification procedure as subsection 7A.1.2. Figure 7C.1 show the average balanced accuracy (14) of the 4 methods: the standard (average minimal distance), average mean distance, cluster-wide minimal distance and cluster-wide mean distance. τ = 1 and κ = 2 with exponential kernel was used for all the test. Figure 7C.1 Average balanced accuracy (BA) of different algebraic connectivity. From left to right: standard (average minimal distance), average mean distance, cluster-wide minimal distance and cluster-wide mean distance. The number above each bar indicate the average BA. τ = 1 and κ = 2 with exponential kernel was used for all the test. We can see that the average mean distance outperforms the other 3 method for scRNA-seq data. However, note that the other choice of algebraic connectivity also perform well. This indicates that CCP is quite stable for classification tasks, and the algebraic connectivity can be data-dependent. 252 CHAPTER 8 DISSERTATION CONTRIBUTIONS AND FUTURE DIRECTIONS The main contribution in this dissertation is listed as follows: • In Chapter 3, we introduce large scale clustering method called UMAP-assisted k-means clustering to clustering SARS-CoV-2 mutation, and analyze world-wide mutational trends. We also developed a novel alignment-free DNA sequence analysis method called k-mers topology. We show that k-mers topology outperforms other viral classification algorithm, and show that k-mers topology can also be used for phylogenetic analysis. • In chapter 4, We proposed residue-similarity analysis and plot for data with dimensions greater than 3. The definition and the formulation is presented in this chapter. We have compared RS plot with geometric shape of data and topological analysis visualization to validate and compare the results • In chapter 5, we proposed correlated clustering and projection (CCP) as a dimensionality reduction method for data with high intrinsic dimensions. We benchmark CCP against PCA on standard benchmark datasets • In Chapter 6, we introduce topological nonnegative matrix factorization (TNMF), which incorporate multiscale topological and geometrical information through persistent Laplacian. We show how PLs are constructed, and show an alternative method called k-NN induced PL. Then, we prove the updating scheme for TNMF. • In Chapter 7, we apply CCP and TNMF to scRNA-seq data. We show that both CCP and TNMF improves other dimensionality reduction methods for clustering, classification and visualization. In the future, I would like to extend my work and explore the following: • In the k-mers topology method, we utilized persistent homology to convert varying sequence length to a fixed feature. We would like to utilize tools from natural language processing 253 models to further enhance our method’s performance. Additionally, I am interested in exploring the biological implication of the distribution of k-mers. • In the k-mers topology method, we only utilized persistent homology in the k-mers method. We would like to extend the work to persistent Laplacians, and other topological methods. Additionally, k-mers can also be interpreted as nodes of hypergraphs, and I would like to explore the use of hypergraphs in DNA sequencing methods. • I would like to extend the k-mers topology to protein sequence classification. • One down side of topological NMF is that it requires choosing the weights for each filtration. Although k-NN induced persistent Laplacian can reduce the number of parameters, the number of parameters is still 2T . In the future I would like to consider a parameter free approach by using consensus methods. Additionally, I would like to extend the work to semi-supervised model. • I would like to explore the use of residue-similarity score as a minimization function for dimensionality reduction. The content of the dissertations are mostly adopted from the following publications and preprints. • Hozumi, Y., & Wei, G.-W (2024). K-mer Topology for Whole Genome Analysis. • Hozumi, Y., & Wei, G.-W. (2024). Analyzing single cell RNA sequencing with topological nonnegative matrix factorization. Journal of Computational and Applied Mathematics, 115842. • Hozumi, Y., Tanemura, K. A., & Wei, G.-W. (2023). Preprocessing of single cell RNA sequencing data using correlated clustering and projection. Journal of Chemical Information and Modeling. 254 • Hozumi, Y., Wang, R., & Wei, G.-W. (2022). Ccp: Correlated clustering and projection for dimensionality reduction. ArXiv Preprint ArXiv:2206.04189. • Hozumi, Y., Wang, R., Yin, C., & Wei, G.-W. (2021). UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets. Computers in Biology and Medicine, 131, 104264. • Hozumi, Y., & Wei, G.-W. (2023). Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE. ArXiv Preprint ArXiv:2306.13750. These work led to the following publication and preprints, but are not discussed in this disser- tation. • Chen, J., Wang, R., Hozumi, Y., Liu, G., Qiu, Y., Wei, X., & Wei, G.-W. (2022). Emerging dominant SARS-CoV-2 variants. Journal of Chemical Information and Modeling, 63(1), 335–342. • Cottrell, S., Hozumi, Y., & Wei, G.-W. (2023). K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis. ArXiv. • Feng, H., Cottrell, S., Hozumi, Y., & Wei, G.-W. (2024). Multiscale differential geometry learning of networks with applications to single-cell RNA sequencing data. Computers in Biology and Medicine, 171, 108211. • Wang, R., Chen, J., Gao, K., Hozumi, Y., Yin, C., & Wei, G.-W. (2020). Characterizing SARS-CoV-2 mutations in the United States. Research Square. • Wang, R., Chen, J., Gao, K., Hozumi, Y., Yin, C., & Wei, G.-W. (2021). Analysis of SARS- CoV-2 mutations in the United States suggests presence of four substrains and novel variants. Communications Biology, 4(1), 228. 255 • Wang, R., Chen, J., Hozumi, Y., Yin, C., & Wei, G.-W. (2020). Decoding asymptomatic COVID-19 infection and transmission. The Journal of Physical Chemistry Letters, 11(23), 10007–10015. • Wang, R., Chen, J., Hozumi, Y., Yin, C., & Wei, G.-W. (2022). Emerging vaccine- breakthrough SARS-CoV-2 variants. ACS Infectious Diseases, 8(3), 546–556. • Wang, R., Hozumi, Y., Yin, C., & Wei, G.-W. (2020a). Decoding SARS-CoV-2 transmission and evolution and ramifications for COVID-19 diagnosis, vaccine, and medicine. Journal of Chemical Information and Modeling, 60(12), 5853–5865. • Wang, R., Hozumi, Y., Yin, C., & Wei, G.-W. (2020b). Mutations on COVID-19 diagnostic targets. Genomics, 112(6), 5204–5213. • Wang, R., Hozumi, Y., Zheng, Y.-H., Yin, C., & Wei, G.-W. (2020). Host immune response driving SARS-CoV-2 evolution. Viruses, 12(10), 1095. 256