TOPOLOGICAL REPRESENTATION AND DIMENSIONALITY REDUCTION OF SINGLE
CELL RNA SEQUENCING AND VIRAL PHYLOGENETIC DATA

By

Yuta Hozumi

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Applied Mathematics—Doctor of Philosophy

2024

ABSTRACT

With the development and improvement in technology, we are now able to observe and collect large

data, especially in the field of biology. Sequencing technology has rapidly advanced, allowing

for large-scale sequencing with increased precision. This has resulted in large datasets with high

dimensionality. One trade-off with high dimensionality is the increase in sparsity. There are

multiple causes of sparsity, including sequencing errors, nonuniform noise, low read depth, etc.

It is paramount that we find an effective way to reduce the dimensionality of the data and select

important features for downstream analysis.

DNA sequencing is a vital aspect of computational biology. Analyzing the sequences gives

insight into relationships between species as well as within species, such as evolution, genetic drift,

mutation, common ancestry, and more. We gathered over 3.6 million SARS-CoV-2 sequences from

all over the world and performed multiple sequence alignments to extract mutations. Utilizing the

mutation data, we performed UMAP dimensionality reduction followed by clustering to analyze

mutational trends in the US and the world. However, such alignment methods do come with

limitations, namely in computational cost and assumptions. To this end, we developed k-mer

topology, an alignment-free DNA sequencing method that utilizes techniques from topological

data analysis. This results in a uniform set of features for all sequences, regardless of their sequence

length.

Another new technology in DNA sequencing technology is single-cell RNA-sequencing (scRNA-

seq), where gene expression profiles for each cell can be extracted. With current technology, roughly

10,000 cells and over 20,000 genes can be sequenced, which has caused problems due to the high

dimensionality of the data. Most dimensionality reduction methods employ frequency domain

representations obtained from matrix diagonalization and may not be efficient for large datasets

with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and

Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP

partitions high-dimensional features into correlated clusters and then projects correlated features

in each cluster into a one-dimensional representation based on sample correlations. Additionally,

we proposed topological nonnegative matrix factorization (TNMF), a matrix decomposition algo-

rithm which incorporates multiscale geometric and topological information in the form of persistent

Laplacian. Due to the nonnegativity constraint of the method, the reduced features are interpretable.

Both CCP and TNMF are verified on scRNA-seq data to show their superior performance in both

clustering, classification, and visualization.

Traditional visualization methods require a dimensionality reduction to 2 or 3 dimensions.

However, such aggressive reduction can lead to misleading conclusions. We, therefore, introduce

Residue-Similarity (RS) plot, which is a visualization tool for data with dimensions greater than 3.

The RS plot is constructed from two measures, residue (R) and similarity (S) scores. The R-score

measures the interclass difference, and the S-score measures the intra-class similarity score. The

Residue Similarity Index (RSI) was introduced to evaluate the R-score and S-score and is verified

to correlate with clustering and classification accuracies across a variety of benchmark datasets.

Copyright by
YUTA HOZUMI
2024

This dissertation is dedicated to my parents, Eiji and Momoko.
Thank you for always believing in me.

v

ACKNOWLEDGEMENTS

I would like to begin by thanking Prof. Guo-Wei Wei for being the most supportive and encouraging

advisor throughout my Ph.D. journey. During my first year, when I was struggling through the

exams, Prof. Wei was the first one to reach out to me to offer advice. After I passed the exams,

right when I was getting accustomed to research, we were hit by the COVID-19 pandemic. During

this time, Prof. Wei called me regularly, making sure I was doing well, and always made himself

available whenever I needed help. Even though I was still inexperienced, he invited me to join the

COVID research group, which kick-started my research career. I was able to observe the motivation

and brilliance of researchers at the forefront of research. Furthermore, Prof. Wei taught me the

importance of the ‘big picture,’ which many students fail to see, and it has opened my eyes to

mathematical biology. I genuinely appreciate him for providing me with numerous opportunities.

Without Prof. Wei, I do not think my current goal of pursuing research would exist.

I would also like to thank my dissertation committee, Prof. Mark Iwen, Prof. Ekaterina Rap-

inchuk (Merkurjev), and Prof. Longxiu Huang for their encouragement, feedback, and comments.

Every time I see them, they always offered encouragement and advices. Additionally, I would like

to thank Prof. Changchuan Yin for introducing me to genomic analysis during the COVID projects.

He introduced me to the field of bioinformatics, which became the basis of my dissertation. I would

also like to thank the senior members of our group, Dr. Rui Wang, Prof. Jiahui Chen, and Prof.

Duc Nguyen, for mentoring me throughout my Ph.D. I am especially thankful to Dr. Rui Wang for

her guidance throughout my second and third years at MSU. I learned all kinds of skills, including

coding, writing papers, topological data analysis, and more.

I am also thankful to my cohorts. Without their support, especially during my first year and

COVID lockdown, I may not have been able to finish my Ph.D. I am especially thankful for Cullen

Haselby, Luis Suarez, Quinn Minich and Chris Potvin for their encouragement. We are all in

different fields of mathematics, but I am thankful for all the advises they have given me.

Lastly, I give my biggest thanks to my father Eiji Hozumi and mother Momoko Hozumi. They

have been my biggest supporter from day 1 at MSU. I do sincerely thank them for their endless and

vi

unrequited support and love. They were the first ones that I would reach out if anything happened,

good or bad, and they would always listen till the very end.

Lastly, I would like to acknowledge that all of the work included in this thesis was supported

in part the following grants: NIH grants R01GM126189, R01AI164266, R35GM148196; NSF

grants DMS-1721024, DMS-1761320, IIS1900473, DMS-2052983, IIS-1900473; NASA grant

80NSSC21M0023; Michigan State University Research Foundation; Bristol-Myers Squibb 65109;

Pfizer. Additionally, I would like to thanl MSU’s higher performance cluster computer, The IBM TJ

Watson Research Center, The COVID-19 High Performance Computing Consortium, and NVIDIA

for computational assistance.

vii

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
1.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Topological Data Analysis
1.3 Phylogenetic Analysis .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Single Cell RNA Sequencing (scRNA-seq) . . . . . . . . . . . . . . . . . . . .
.
1.5 Outline
BIBLIOGRAPHY . .

1
1
5
7
8
. 11
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .

.
.

.

.

.

CHAPTER 2

BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 24
2.1 Overview of Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Evaluation metrics for clustering and classification . . . . . . . . . . . . . . . 29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Topological Data Analysis
2.5 Phylogenetic Analysis .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
BIBLIOGRAPHY . .

. .

.

CHAPTER 3

PHYLOGENETIC ANALYSIS . . . . . . . . . . . . . . . . . . . . . . 50
3.1 UMAP-assisted k-means clustering of large-scale SARS-CoV-2 mutation datasets 50
3.2 K-mer Topology for Whole Genome Analysis . . . . . . . . . . . . . . . . . . 80
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
BIBLIOGRAPHY .
APPENDIX 3A

. .
ADDITIONAL MATERIALS FOR UMAP-ASSISTED K-
MEANS CLUSTERING OF LARGE-SCALE SARS-COV-2
MUTATION DATASETS . . . . . . . . . . . . . . . . . . . . . 105
ADDITIONAL MATERIALS FOR K-MERS TOPOLOGY
FOR ALIGNEMNT-FREE SEQUENCE ANALYSIS . . . . . . 113

APPENDIX 3B

.

.

.

CHAPTER 4

.
.
.
4.1 Methods .
.
. .
4.2 Results . .
.
4.3 Discussion .
.
.
. .
BIBLIOGRAPHY . .

RESIDUE-SIMILARITY SCORES AND INDEXES . . . . . . . . . . . 120
. 120
.
. 124
.
.
. 130
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .
.

.
.
.
.

CHAPTER 5

.

. .

Introduction . .

CORRELATED CLUSTERING AND PROJECTION . . . . . . . . . . 135
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.1
5.2 Methods and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
. 142
.
5.3 Results . .
.
.
5.4 Discussion .
. 160
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.5 Concluding remarks .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
.
BIBLIOGRAPHY . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
.

. .
.
.

. .

.
.

CHAPTER 6

Introduction . .

. .
6.1
6.2 Prior Work .
.
6.3 Topological NMF .

TOPOLOGICAL NONNEGATIVE MATRIX FACTORIZATION . . . . 172
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. 173
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
.

. .

.
.
.

viii

BIBLIOGRAPHY . .

. .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

CHAPTER 7

APPLICATION IN SINGLE CELL RNA SEQUENCING . . . . . . . . 182

7.1 Preprocessing of Single Cell RNA Sequencing data using Correlated Clus-

tering and Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
. . . . . . . . 196

7.2 Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE
7.3 Topological Non-Negative Matrix Factorization for Single Cell RNA Se-

BIBLIOGRAPHY .
APPENDIX 7A

.
.

.
.

.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

quencing Data .
.
ADDITIONAL RESULTS FOR PREPROCESSING SINGLE
CELL RNA SEQUENCING DATA USING CORRELATED
CLUSTERING AND PROJECTION . . . . . . . . . . . . . . . 232
ADDITIONAL MATERIALS FOR ANALYZING SCRNA-
SEQ DATA BY CCP-ASSISTED UMAP AND T-SNE . . . . . 241
ALGEBRAIC CONNECTIVITY OF FRI IN CCP . . . . . . . . 251

APPENDIX 7B

APPENDIX 7C

CHAPTER 8

DISSERTATION CONTRIBUTIONS AND FUTURE DIRECTIONS . . 253

ix

CHAPTER 1

INTRODUCTION

1.1 Dimensionality Reduction

Technological advances have fueled exponential growth in high-dimensional data.

In bio-

logical science, high-dimensional data are ubiquitous in genomics, epigenomics, transcriptomics,

proteomics, metabolomics, and phenomics.

In image science, an image of moderate size, i.e.,

1024 × 1024, gives rise to a 1,048,576-dimensional vector. The rapid increase in the size and

complexity of scientific data has made the problem of the ‘curse of dimensionality’ more chal-

lenging than ever before in data sciences [1]. In machine learning, this problem is associated with

the phenomenon that the average predictive power of a well-trained model first increases as the

feature size increases but starts to deteriorate beyond a certain dimensionality [2, 3]. Moreover,

data with an enormous volume of the feature space will become sparse, posing challenges for

statistical analysis in determining statistical significance and principal variables. Furthermore, it

is challenging to visualize data in high dimensions unless one can reduce the dimension to two or

three. Therefore, it is desirable to reduce the dimensionality of high-dimensional data for the sake

of prediction, analysis, and visualization. These challenges have been driving the development of

many dimensionality reduction methods that can capture the intrinsic correlations in the original

data in a low-dimensional representation [4].

Dimensionality reduction can be achieved through various deep neural networks (DNN), such

as graph neural networks, autoencoders, transformers, etc. However, most DNN methods may

not work well with excessively high-dimensional data. Commonly used dimensionality reduction

algorithms fall into two categories: linear and nonlinear with respect to a certain distance metric.

Principle component analysis (PCA) [5] is a basic linear DR algorithm that focuses on finding the

principal components by creating new uncorrelated variables that successively maximize variances

[6]. Specifically, the first principal component is a vector that maximizes the variance of the

projected data, while the ith principal component is a vector that is orthogonal to the first (i – 1)

principal components, leading to the maximization of the variance of the projected data. Linear

1

discriminant analysis (LDA) is another linear dimensionality reduction method proposed by Sir

Ronald Fisher in 1936 [7]. As a generalization of Fisher’s linear discriminant, LDA aims to find

a linear combination of features that maximizes the separability of classes and minimizes the

inter-class variance for the multi-class classification problem [8]. Nonnegative matrix factorization

(NMF) is another linear dimensionality reduction method that is often employed for data with

nonnegative entries, such as count matrix in single cell RNA-sequencing (scRNA-seq) data. NMF

decomposes a nonnegative matrix into 2 nonnegative factor matrices, and the basis is a nonnegative

combination of the original features, leading to a more interpretable result [9]. Another category

of dimensionality reduction methods contains many nonlinear algorithms, which can be classified

into two groups: those that favor the preservation of the global pairwise distance and those that seek

to retain local distance instead of global distance. Algorithms such as kernel principal component

analysis (kernel PCA), Sammon mapping, and spectral embedding fall within the former category,

while Isomap, LargeVis, Laplacian eigenmaps, locally linear embedding (LLE), diffusion maps

[10, 11], t-SNE, and UMAP fall into the latter category. Kernel PCA [12] is an extension of PCA.

Standard PCA typically has poor performance if the data has complicated algebraic structures

that cannot be well-represented in a linear space. Therefore, kernel PCA is designed by applying

kernel functions in a reproducing kernel Hilbert space in 1998. In 1969, John W. Sammon first

proposed the Sammon mapping [13], which aims to conserve the structure of inter-point distances

by minimizing Sammon’s error and attempts to ensure the mapping does not affect the underlying

topology [14]. Spectral embedding computes the full Laplacian graph and uses graph eigenvectors,

which allows for the preservation of the original global graph structure in the lower-dimensional

space. Although kernel PCA, Sammon mapping, and spectral embedding preserve the pairwise

distance structure among all the data, they fail to capture the local relationship between data points.

Therefore, nonlinear algorithms are essential to incorporate the local structure in low-dimensional

space and better describe the local information of the original data.

A quantitative survey of dimensionality reduction techniques is provided in Ref. [15]. Several

widely used nonlinear DR algorithms are briefly discussed in the following.

Isomap [16] is

2

a nonlinear method aiming to preserve the geodesic distance between samples while reducing

dimensionality. It’s an extension of multidimensional scaling (MDS) [17], replacing the Euclidean

distance in MDS with geodesic distance (estimated by Dijkstra’s distance in graph theory). Isomap

is a local method, estimating the intrinsic geometry of a data manifold by roughly estimating each

sample’s neighbors, ensuring its efficiency [18]. Laplacian Eigenmap (LE), introduced in 2003

[19], is another unsupervised nonlinear algorithm preserving the local properties of a weighted

graph Laplacian. LE constructs a neighborhood graph where each data point is linked to its

nearest neighbors. Then, edge weights are estimated using the Gaussian kernel function. After

solving the eigenvectors of the generalized weighted neighborhood graph matrix, eigenvectors

associated with zero eigenvalues are discarded, and subsequent eigenvectors (smallest) are used

for embedding in a k-dimensional space. Moreover, t-Distributed Stochastic Neighbor Embedding

(t-SNE) [20, 21] is a nonlinear, manifold-based method well-suited for reducing high-dimensional

data into two- or three-dimensional space for visualization. T-SNE represents similarities for

every pair of data by constructing a conditional probability distribution over pairs of data.

It

then applies the ‘student t-distribution’ to obtain the probability distribution in the embedded

space. By minimizing the Kullback-Leibler (KL) divergence between these two probabilities in

the original and embedded space, t-SNE preserves the significant structure of the data for analysis

and visualization [22]. Furthermore, a state-of-art nonlinear dimensionality reduction algorithm

is uniform manifold approximation and projection (UMAP) [23], a graph-based algorithm that

builds on the Laplacian eigenmaps and performs great visualization and feature extraction. Three

assumptions make UMAP stand out among the other dimensionality reduction algorithms: (1) Data

is uniformly distributed on a Riemann manifold, (2) Riemannian metric is locally constant, and (3)

The manifold is locally connected. UMAP creates k-dimensional weighted graph representations

based on the k-nearest neighborhood searching and intents to minimize the edge-wise cross-entropy

between the embedded low-dimensional weighted graph representation in teams of a fuzzy set cross-

entropy loss function via the stochastic gradient descent. Specifically, UMAP constructs a weighted

directed adjacency matrix A, where A(i, j) represents the connection between the ith node and the

3

jth node when the jth node is one of the k nearest neighbors. Next, a normalized sparse Laplacian

matrix can be derived from A with the implementation of the cross-entropy loss involved, and

the k-dimensional eigenvectors of this normalized Laplacian will be used to represent each of the

original data points in a low-dimension space.

All of the dimensionality reduction algorithms mentioned above have broad applications in

science and technology. However, they often rely on frequency domain representations obtained

from matrix diagonalization. Typically, the computational complexity of eigenvalue decomposition

for a full matrix is O(M3), where M is the number of samples forming an M × M matrix. While

fast solvers exist, they may sacrifice accuracy, especially for datasets with relatively high intrinsic

dimensions [15]. Moreover, for datasets with a large number of features I, where M << I, the

reliance on matrix diagonalization limits the performance of these algorithms. Additionally, many

methods depend on computing distances between data entries (samples), which can be problematic

in high dimensions. Particularly, methods that utilize nearest neighbors, such as UMAP and t-SNE,

may suffer from instability in datasets with moderately high intrinsic dimensions, as outlined in the

‘curse of dimensionality’ [1].

A somewhat related but distinct problem is tensor-based dimensionality reduction [24, 25],

which addresses data with internal structures of geometric, topological, algebraic, and/or physical

origins. Methods dealing with tensorial structures, such as Tucker decomposition [26, 27], are often

employed in addition to the aforementioned dimensionality reduction approaches. These methods

find applications in various domains such as videos, X-ray Computed Tomography (X-ray CT), and

Magnetic Resonance Imaging (MRI) data.

Other related issues pertain to feature evaluation, ranking, clustering, extraction, and selection,

particularly for unlabeled data. Feature evaluation and ranking can be accomplished through

filtering or embedding techniques, while feature clustering and selection can be carried out using

methods such as k-means, k-means++, k-medoids, among others. These methods often serve as

preprocessing steps for dimensionality reduction. For labeled data, various supervised learning

methods can be employed for feature selection or extraction.

4

In this dissertation, we introduce two novel dimensionality reduction methods: Correlated

Clustering and Projection (CCP) and Topological Nonnegative Matrix Factorization (TNMF). CCP

is a data domain dimensionality reduction method consisting of two steps. First, features are

clustered based on their similarities. Then, the clustered features are nonlinearly projected into a

single descriptor. TNMF, on the other hand, is a persistent Laplacian regularized NMF method.

Unlike standard manifold regularization, TNMF can capture the geometrical and topological shape

of the data at multiscale. We evaluate CCP and TNMF on standard benchmark datasets and

apply these methods to single-cell RNA sequencing (scRNA-seq) data. Additionally, we propose

a novel visualization method called the Residue-Similarity (RS) plot. The RS plot comprises two

components. Given a set of labels for each sample, the Residue (R) score measures the intraclass

similarity score, while the Similarity (S) score measures the interclass similarity score. We find

that the RS plot can effectively visualize data with dimensions greater than 3, and the RS score

correlates with the accuracy of classification tasks.

1.2 Topological Data Analysis

In recent years, there has been an explosion of data across multiple disciplines, with biology

experiencing a significant influx of new data from technological advances such as genomics, single-

cell RNA sequencing (scRNA-seq) [28], spatial transcriptomics (ST) [29], epigenomics [30], and

DNA sequencing [31]. The data in many fields is often high-dimensional, noisy, and lacks

uniformity in size or features. However, traditional data analysis tools, including dimensionality

reduction techniques, often struggle to capture the intricate structures and relationships within such

complex datasets. Moreover, traditional approaches are ill-equipped to handle nonuniform data

distributions. To address these limitations, topological data analysis (TDA) has gained popularity

in recent years due to its ability to extract meaningful insights from complex data using properties

from algebraic topology [32, 33, 34, 35].

At the heart of TDA lies persistent homology, a branch of algebraic topology that serves as a

powerful tool for characterizing the intrinsic structure in datasets, regardless of the inherent noise

present in the data [36, 37, 38]. Persistent homology differs from traditional dimensionality reduc-

5

tion techniques because they rely on the topology of data rather than its geometric properties and

traditional metrics. In essence, persistent homology represents the dataset as a topological space

and using tools from algebraic topology to extract meaningful features, namely the topological in-

variants. Additionally, by introducing filtration, we can keep track of the changes in the topological

invariants to understand the geometric and topological shape of the data, which is termed persis-

tence. Through the lens of persistent homology, researchers can uncover hidden patterns, identify

relevant structures, and gain a deeper understanding of the datasets’ intrinsic characteristics.

Spectral graph theory has also been introduced to TDA in the recent years, with the introduction

of topological Laplacian. Eckmann et al.

[39] introduced simplicial complexes to the graph

Laplacian defined on point cloud data, leading to the combinatorial Laplacian. This can be viewed

as a discrete counterpart of the de Rham-Hodge Laplacian on manifolds. Both the Hodge Laplacian

and the combinatorial Laplacian are topological Laplacians that give rise to topological invariants

in their kernel space, specifically the harmonic spectra. However, the nonharmonic spectra contain

algebraic connectivity that cannot be revealed by the topological invariants from the standard

persistent homology[40].

A significant development in topological Laplacians occurred in 2019 with the introduction

of persistent topological Laplacians. Specifically, evolutionary de Rham theory was introduced

to obtain persistent Hodge Laplacians on manifolds [41]. Meanwhile, persistent combinatorial

Laplacian [42], also known as the persistent spectral graph or persistent Laplacian (PL), was in-

troduced for point cloud data. These methods have spurred numerous theoretical developments

[43, 44, 45, 46, 47] and package [48], as well as remarkable applications in various fields, including

protein engineering [49], forecasting emerging SARS-CoV-2 variants BA.4/BA.5 [50], and predict-

ing protein-ligand binding affinity [51]. Recently, PL has been shown to improve PCA performance

[52].

In this dissertation, we utilize TDA as a visualization tool for high dimensional data. Addition-

ally, we utilize persistent homology to develop a novel algorithm called k-mers topology, which is

utilized to analyze viral nucleotide sequence. Lastly, we develop topological nonnegative matrix

6

factorization (TNMF) utilizing PL to analyze single cell RNA sequencing data.

1.3 Phylogenetic Analysis

Phylogenetic analysis of genetic sequences is crucial for elucidating the evolutionary relation-

ships both between and within species [53]. This analysis entails the comparison of two or more

DNA or RNA sequences to discern similarities, differences, and patterns within the genetic code.

By identifying similarities and differences among sequences, researchers can gain insights into

evolutionary relationships, mutations, genetic drifts, genome assembly, gene annotation, and more.

Having a robust and scalable method that enables comparisons within species and across species

is essential for laying the groundwork of genomic analysis.

Traditional phylogenetic analysis utilizes sequence alignment, where pairs of sequences are

aligned to obtain similarity scores. These methods include sequence similarity scores, like BLAST

[54] and FASTA [55], sequence profile search, like PSI-BLAST [54] and HMMER [56], whole-

genome comparison, like Mauve [57] and TBA [58], and multiple sequence alignment, like

ClustalW [59], MAFFT [60], and Muscle [61].

In recent years, the sequence of SARS-CoV-2

has been heavily studied using alignment-based methods to analyze mutational trends and predict

future mutations [62, 63, 50, 64, 65]. An advantage of sequence alignment methods is their ability to

extract mutation sites, which can then be utilized for downstream analysis. These alignment-based

methods assume that the segments of the sequences can be categorized into conserved and non-

conserved segments. The effectiveness of the alignment can then be measured based on the amount

of conserved region penalized by the amount of non-conserved segments, or gaps. However, such

methods can fail if these conserved segments are not properly arranged in the sequence or if there is

not a sufficient amount of conserved segments, which often occurs in real-world data [66, 67, 68].

Additionally, alignment-based methods are time-consuming and require a large amount of memory,

which makes large-scale comparison difficult. Other cases where alignment-based methods can

fail can be found in [69]. To combat these issues, alignment-free methods have been developed,

which do not assume anything about the sequences.

One of the most common alignment-free methods is the k-mers-based approach. In the original

7

k-mers methods developed by Blaisdell in 1986 [70], the frequency of words or motifs is counted,

and the sequences are represented as a count-based vector. Then, a metric can be defined to compare

sequences for phylogenetic analysis. Numerous extensions have been proposed to improve the k-

mers method, such as those by Wu et al. [71, 72, 73], Korf et al. [74], and Jun et al. [75]. Because

the counting is linear with respect to the sequence length, it is extremely computationally efficient,

allowing for whole-genome analysis. However, these methods do not capture the relationship

between the k-mers or the positional information about the sequence.

Several alternative alignment-free methods have been proposed as well. The Natural Vector

Method (NVM) computes the moments of the k-mers position [76, 77], incorporating the rela-

tionship between the k-mers within the sequence, thus overcoming limitations in the k-mers based

method.

In information theory-based methods, the distance between sequences is obtained by

computing the amount of information shared between them [78, 79, 80, 81, 82]. The Chaos Game

Representation (CGR), originally proposed by Jeffory in 1990 [83], represents the sequence as

an iterative function, allowing the DNA sequence to be visualized as an image [84]. The CGR

representation was further extended in [85, 86, 87, 88] for sequence analysis. Other methods, such

as the discrete Fourier power spectrum method [89, 90], fuzzy integral similarity method [91], and

more [92], have been developed for phylogenetic analysis.

In this dissertation, we employ alignment-based methods to study the mutational patterns in

SARS-CoV-2. After aligning the sequences, we extract mutations and utilize UMAP to reduce the

dimensionality for clustering. Additionally, we introduce a novel alignment-free method called k-

mer topology, which utilizes tools from persistent homology to extract patterns within the sequence.

We benchmark our method on virus classification tasks and apply it to standard phylogenetic

analysis.

1.4 Single Cell RNA Sequencing (scRNA-seq)

Single-cell RNA sequencing (scRNA-seq) reveals heterogeneity within cell types, leading to

an understanding of cell-cell communication, cell differentiation, and differential gene expression.

With current technology and protocols, more than 20,000 genes can be identified. Additionally,

8

numerous data analysis pipelines have been developed to assist in analyzing such complex data

[93, 94, 95, 96, 97, 98].Despite improvements in technology that enable more accurate gene

readings, analyzing these readings remains challenging. Causes of this challenge include dropout

event-induced zero expression counts, low sequencing depth resulting in fewer readings, general

noise, and the high dimensionality of the original data [28]. As a result, dimensionality reduction

and feature selection become important for downstream analysis.

Numerous dimensionality reduction and feature selection methods have been proposed for

scRNA-seq data. One such method is scRNA by non-negative and low-rank representation

(SinLRR), which assumes that scRNA-seq has an inherently low rank, and attempts to find the

smallest rank matrix that captures the original data [99]. Additionally, various non-negative matrix

factorization (NMF) methods with different constraints have been developed. In these methods, the

low-dimensional representation of scRNA-seq data is achieved through a linear combination of the

original genes, which are called meta-genes [100, 101, 102, 103, 104, 105]. Single-cell interpreta-

tion via multikernel learning (SIMLR) utilizes multiple kernels to learn a cell-cell similarity metric

that generalizes to different biological experiments and experimental procedures [106]. In addition,

more traditional approaches, such as principal component analysis (PCA) [107] and its derivatives

[108, 109], and visualization techniques, such as uniform manifold approximation and projection

(UMAP) [23] and t-distributed stochastic neighbor embedding (t-SNE) [110], have been heavily

utilized for scRNA-seq data. Furthermore, deep learning has also been used for dimensionality

reduction [111, 112, 113, 114, 115, 116].

Deep learning and ensemble methods are another class of approaches that have become popular

for single cell RNA-seq analysis. Single-cell variational inference (scVI) utilizes deep neural

networks to obtain information from similar cells and genes to approximate the distribution of

underlying gene expression values [111]. Single-cell cluster using marker genes (SCMcluster)

utilizes known marker genes to guide feature selection and perform ensemble clustering [117].

AutoCell [118] utilizes variational autoencoding network that combines the Gaussian mixture

model and graph embedding to model the high dimensional scRNA-seq data. Diffusion models

9

[119, 120, 121, 122], generative adversarial network (GAN) [121], language models [123, 124, 125],

transformers [126, 127, 128], ensemble methods [106, 129, 130] and more [131, 132] have also

been used for scRNA-seq analysis. Though these methods have great performance, they rely on

careful curation of data and often require large amount of data for pretraining.

Although numerous techniques have been developed, PCA is the most commonly used method

for downstream analysis of scRNA-seq data [133]. PCA is a linear dimensionality reduction method,

where its goal is to compute the principal components as new features that maximize the variance.

The first principal component is a feature that maximizes the variance of the projected data,

and each ith principal component is orthogonal to the i – 1 principal component that maximizes

the variance of the projected data [6]. Single-cell consensus clustering (SC3) [129] utilizes

PCA and the eigenvectors of the graph Laplacian induced by Euclidean, Pearson, and Spearman

distances, and performs a consensus on k-means results obtained from different dimensions using

the CSPA algorithm to obtain the final cell clustering result. CellChat [134] utilizes the low-

dimensional representation of scRNA-seq alongside known interactions between ligands, receptors,

and cofactors to predict cell-cell communication, and the user can perform dimensionality reduction

prior to utilizing CellChat. DEEPsc [135] is a deep learning method that predicts the probability of

a cell belonging to a reference atlas by projecting scRNA-seq to the PCA space of the reference atlas,

which can then be used to predict cell types. A popular package Seurat [136] utilizes supervised

PCA (sPCA) which finds the projection that captures the weighted nearest neighbor graph of the

reference dataset for its downstream analysis. In addition to cell clustering, semi-supervised and

supervised learning methods have been used to classify cell types according to their reference cells

by projecting unknown cells to the PCA space of the reference cells [137, 138].

Utilizing PCA has many advantages, such as computational efficiency and ease in projecting

new data into principal components. However, PCA lacks concrete interpretability and loses

the non-negativity of read-count data. In contrast, the components of NMF are all positive and

can be considered as metagenes, where metagenes are linear combinations of the original genes.

Non-linear dimensionality reduction methods, such as UMAP, t-SNE, and Isomap, have great

10

performance for low dimensionality that can capture the local structure of the data, but they also

lack interpretability due to matrix diagonalization. Moreover, both PCA and traditional nonlinear

reduction methods are unstable when the data is reduced to higher dimensions, which is unfavorable

for machine learning and deep learning tasks that typically require a large number of features.

In this dissertation, we apply Correlated Clustering and Projection (CCP) and Topological

Nonnegative Matrix Factorization (TNMF) to scRNA-seq data. We found that CCP outperforms

PCA in both classification and clustering tasks when the number of components is higher than

50. Additionally, CCP improves both UMAP and t-SNE visualization of scRNA-seq data. Lastly,

TNMF outperforms other NMF methods in clustering tasks of scRNA-seq data.

1.5 Outline

The outline of the dissertation is as followed. In Chapter 2, we introduce background infor-

mation that will be utilized throughout this dissertation, such as evaluation metrics, clustering

algorithm, and dimensionality reduction algorithms.

In Chapter 3, we introduce a large scale

clustering algorithm, UMAP-assisted k-Means, and apply the technique to SARS-CoV-2 single

nucleotide polymorphism (SNP) data. Furthermore, we introduce a novel alignment-free DNA

sequence analysis method called k-mers topology. In Chapter 4, we introduce Residue-Similarity

scores and indexes, which present a novel approach to evaluating classification and clustering

problems.

In Chapter 4, we introduce Correlated clustering and Projection (CCP), a nonlinear

data-domain dimensionality reduction, and we compare its performance to standard benchmark

data. In Chapter 5, we introduce topological nonnegative matrix factorization, a persistent Lapla-

cian (PL) regularized nonnegative matrix factorization (NMF). Compared to traditional graph

Laplacian, the PL utilizes filtration to capture multiscale interaction, which a traditional Laplacian

cannot. Additionally, PL captures standard topological information and homotopic shape evolution,

which gives a comprehensive view of the overall shape of the data. In Chapter 6, CCP and TNMF

are applied to single-cell RNA sequencing (scRNA-seq) data. CCP improves other dimensionality

reduction methods in not only clustering and classification tasks, but also improves visualization.

Additionally, RS analysis is performed on scRNA-seq data, which has been shown to be both ef-

11

fective for visualizing and evaluating performance. TNMF was shown to improve in adjusted rand

index (ARI), normalized mutual information (NMI), purity, and ACC against other NMF methods

for clustering.

12

BIBLIOGRAPHY

[1]

Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.

[2]

[3]

Gerard V Trunk. A problem of dimensionality: A simple example. IEEE Transactions on
pattern analysis and machine intelligence, (3):306–307, 1979.

B Chandrasekaran and Anil K Jain. Quantization complexity and independent measurements.
IEEE Transactions on Computers, 100(1):102–106, 1974.

[4]

Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[5]

George H Dunteman. Principal components analysis. Number 69. Sage, 1989.

[6]

[7]

[8]

[9]

Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent devel-
opments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and
Engineering Sciences, 374(2065):20150202, 2016.

Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of
eugenics, 7(2):179–188, 1936.

Petros Xanthopoulos, Panos M Pardalos, and Theodore B Trafalis. Linear discriminant
analysis. In Robust data mining, pages 27–33. Springer, 2013.

Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive
review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012.

[10] Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick
Warner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and
structure definition of data: Diffusion maps. Proceedings of the national academy of
sciences, 102(21):7426–7431, 2005.

[11] Ketson R Dos Santos, Dimitrios G Giovanis, and Michael D Shields. Grassmannian diffusion
maps–based dimension reduction and classification for high-dimensional data. SIAM Journal
on Scientific Computing, 44(2):B250–B274, 2022.

[12] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component
analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299–1319, 1998.

[13]

John W Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on
computers, 100(5):401–409, 1969.

[14] Paul Henderson. Sammon mapping. Pattern Recognit. Lett, 18(11-13):1307–1316, 1997.

[15] Mateus Espadoto, Rafael M Martins, Andreas Kerren, Nina ST Hirata, and Alexandru C

13

Telea. Toward a quantitative survey of dimension reduction techniques. IEEE transactions
on visualization and computer graphics, 27(3):2153–2173, 2019.

[16]

Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework
for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.

[17] Al Mead. Review of the development of multidimensional scaling methods. Journal of the

Royal Statistical Society: Series D (The Statistician), 41(1):27–39, 1992.

[18] Farzana Anowar, Samira Sadaoui, and Bassant Selim. Conceptual and empirical comparison
of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne).
Computer Science Review, 40:100378, 2021.

[19] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and

data representation. Neural computation, 15(6):1373–1396, 2003.

[20] Geoffrey Hinton and Sam T Roweis. Stochastic neighbor embedding. In NIPS, volume 15,

pages 833–840. Citeseer, 2002.

[21] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of

machine learning research, 9(11), 2008.

[22] Bo Li, Yan-Rui Li, and Xiao-Long Zhang. A survey on laplacian eigenmaps based manifold

learning methods. Neurocomputing, 335:336–351, 2019.

[23] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation

and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.

[24]

Jiun-Hung Chen and Linda G Shapiro. Pca vs. tensor-based dimension reduction methods:
An empirical comparison on active shape models of organs. In 2009 Annual International
Conference of the IEEE Engineering in Medicine and Biology Society, pages 5838–5841.
IEEE, 2009.

[25] Hongcheng Wang and Narendra Ahuja. A tensor approximation approach to dimensionality

reduction. International Journal of Computer Vision, 76(3):217–229, 2008.

[26] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika,

31(3):279–311, 1966.

[27] Xutao Li, Michael K Ng, Gao Cong, Yunming Ye, and Qingyao Wu. Mr-ntd: Manifold
regularization nonnegative tucker decomposition for tensor data dimension reduction and
representation. IEEE transactions on neural networks and learning systems, 28(8):1787–
1800, 2016.

[28] David Lähnemann, Johannes Köster, Ewa Szczurek, Davis J McCarthy, Stephanie C Hicks,

14

Mark D Robinson, Catalina A Vallejos, Kieran R Campbell, Niko Beerenwinkel, Ahmed
Mahfouz, et al. Eleven grand challenges in single-cell data science. Genome biology,
21(1):1–35, 2020.

[29] Cameron G Williams, Hyun Jae Lee, Takahiro Asatsuma, Roser Vento-Tormo, and Ashra-
ful Haque. An introduction to spatial transcriptomics for biomedical research. Genome
Medicine, 14(1):68, 2022.

[30] Christoph Bock and Thomas Lengauer. Computational epigenetics. Bioinformatics, 24(1):1–

10, 2008.

[31] Susana Vinga and Jonas Almeida. Alignment-free sequence comparison—a review. Bioin-

formatics, 19(4):513–523, 2003.

[32] David Cohen-Steiner, Herbert Edelsbrunner, John Harer, and Yuriy Mileyko. Lipschitz
functions have l p-stable persistence. Foundations of computational mathematics, 10(2):127–
139, 2010.

[33] Paul Bendich, Herbert Edelsbrunner, and Michael Kerber. Computing robustness and per-
sistence for images. IEEE transactions on visualization and computer graphics, 16(6):1251–
1260, 2010.

[34] Robert Ghrist. Barcodes: the persistent topology of data. Bulletin of the American Mathe-

matical Society, 45(1):61–75, 2008.

[35] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,

46(2):255–308, 2009.

[36] Afra Zomorodian and Gunnar Carlsson. Localized homology. Computational Geometry,

41(3):126–148, 2008.

[37] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. In Proceedings of
the twentieth annual symposium on Computational geometry, pages 347–356, 2004.

[38] Edelsbrunner, Letscher, and Zomorodian. Topological persistence and simplification. Dis-

crete & computational geometry, 28:511–533, 2002.

[39] Beno Eckmann. Harmonische funktionen und randwertaufgaben in einem komplex. Com-

mentarii Mathematici Helvetici, 17(1):240–255, 1944.

[40] Danijela Horak and Jürgen Jost. Spectra of combinatorial laplace operators on simplicial

complexes. Advances in Mathematics, 244:303–336, 2013.

[41]

Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge
method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021.

15

[42] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International

journal for numerical methods in biomedical engineering, 36(9):e3376, 2020.

[43] Facundo Mémoli, Zhengchao Wan, and Yusu Wang. Persistent laplacians: Properties,
algorithms and implications. SIAM Journal on Mathematics of Data Science, 4(2):858–884,
2022.

[44]

Jian Liu, Jingyan Li, and Jie Wu. The algebraic stability for persistent laplacians. arXiv
preprint arXiv:2302.03902, 2023.

[45] Xiaoqi Wei and Guo-Wei Wei. Persistent sheaf laplacians. arXiv preprint arXiv:2112.10906,

2021.

[46] Rui Wang and Guo-Wei Wei. Persistent path laplacian. Foundations of Data Science,

5:26–55, 2023.

[47] Dong Chen, Jian Liu, Jie Wu, and Guo-Wei Wei. Persistent hyperdigraph homology and per-
sistent hyperdigraph laplacians. Foundations of Data Science, doi: 10.3934/fods.2023010,
2023.

[48] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, and Guo-Wei
Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield,
Mo.), 3(1):67, 2021.

[49] Yuchi Qiu and Guo-Wei Wei. Persistent spectral theory-guided protein engineering. Nature

Computational Science, 3(2):149–163, 2023.

[50]

Jiahui Chen, Yuchi Qiu, Rui Wang, and Guo-Wei Wei. Persistent laplacian projected omicron
ba. 4 and ba. 5 to become new dominating variants. Computers in Biology and Medicine,
151:106262, 2022.

[51] Zhenyu Meng and Kelin Xia. Persistent spectral–based machine learning (perspect ml) for
protein-ligand binding affinity prediction. Science advances, 7(19):eabc5329, 2021.

[52] Sean Cottrell, Rui Wang, and Guowei Wei. PLPCA: Persistent Laplacian enhanced-
Journal of Chemical Information and Modeling,

PCA for microarray data analysis.
doi.org/10.1021/acs.jcim.3c01023, 2023.

[53] Masatoshi Nei. Phylogenetic analysis in molecular evolutionary genetics. Annual review of

genetics, 30(1):371–403, 1996.

[54] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang,
Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation of protein
database search programs. Nucleic acids research, 25(17):3389–3402, 1997.

16

[55] William R Pearson and David J Lipman. Improved tools for biological sequence comparison.

Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988.

[56] Robert D Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eberhardt, Sean R
Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, et al. Pfam:
the
protein families database. Nucleic acids research, 42(D1):D222–D230, 2014.

[57] Aaron E Darling, Bob Mau, and Nicole T Perna. progressivemauve: multiple genome

alignment with gene gain, loss and rearrangement. PloS one, 5(6):e11147, 2010.

[58] Mathieu Blanchette, W James Kent, Cathy Riemer, Laura Elnitski, Arian FA Smit, Kr-
ishna M Roskin, Robert Baertsch, Kate Rosenbloom, Hiram Clawson, Eric D Green, et al.
Aligning multiple genomic sequences with the threaded blockset aligner. Genome research,
14(4):708–715, 2004.

[59]

Julie D Thompson, Desmond G Higgins, and Toby J Gibson. Clustal w:
improving the
sensitivity of progressive multiple sequence alignment through sequence weighting, position-
specific gap penalties and weight matrix choice. Nucleic acids research, 22(22):4673–4680,
1994.

[60] Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. Mafft: a novel
method for rapid multiple sequence alignment based on fast fourier transform. Nucleic acids
research, 30(14):3059–3066, 2002.

[61] Robert C Edgar. Muscle: multiple sequence alignment with high accuracy and high through-

put. Nucleic acids research, 32(5):1792–1797, 2004.

[62] Yuta Hozumi, Rui Wang, Changchuan Yin, and Guo-Wei Wei. Umap-assisted k-means
clustering of large-scale sars-cov-2 mutation datasets. Computers in biology and medicine,
131:104264, 2021.

[63]

Jiahui Chen and Guo-Wei Wei. Omicron ba. 2 (b. 1.1. 529.2): high potential for becoming
the next dominant variant. The journal of physical chemistry letters, 13(17):3840–3849,
2022.

[64] Michael Bleher, Lukas Hahn, Maximilian Neumann, Juan Angel Patino-Galindo, Mathieu
Carriere, Ulrich Bauer, Raul Rabadan, and Andreas Ott. Topological data analysis identifies
emerging adaptive mutations in sars-cov-2. arXiv preprint arXiv:2106.07292, 2021.

[65]

Juan Ángel Patiño-Galindo, Ioan Filip, Ratul Chowdhury, Costas D Maranas, Peter K Sorger,
Mohammed AlQuraishi, and Raul Rabadan. Recombination and lineage-specific mutations
linked to the emergence of sars-cov-2. Genome Medicine, 13:1–14, 2021.

[66] Sean R Eddy. Where did the blosum62 alignment score matrix come from? Nature

biotechnology, 22(8):1035–1036, 2004.

17

[67] Paul P Gardner, Andreas Wilm, and Stefan Washietl. A benchmark of multiple sequence
alignment programs upon structural rnas. Nucleic acids research, 33(8):2433–2439, 2005.

[68] Emidio Capriotti and Marc A Marti-Renom. Quantifying the relationship between sequence

and three-dimensional structure conservation in rna. BMC bioinformatics, 11:1–10, 2010.

[69] Andrzej Zielezinski, Susana Vinga, Jonas Almeida, and Wojciech M Karlowski. Alignment-
free sequence comparison: benefits, applications, and tools. Genome biology, 18:1–17,
2017.

[70] B Edwin Blaisdell. A measure of the similarity of sets of sequences not requiring sequence

alignment. Proceedings of the National Academy of Sciences, 83(14):5155–5159, 1986.

[71] Tiee-Jian Wu, John P Burke, and Daniel B Davison. A measure of dna sequence dissimilarity
based on mahalanobis distance between frequencies of words. Biometrics, pages 1431–1439,
1997.

[72] Tiee-Jian Wu, Ya-Ching Hsieh, and Lung-An Li. Statistical measures of dna sequence
dissimilarity under markov chain models of base composition. Biometrics, 57(2):441–448,
2001.

[73] Tiee-Jian Wu, Ying-Hsueh Huang, and Lung-An Li. Optimal word sizes for dissimilarity
measures and estimation of the degree of dissimilarity between dna sequences. Bioinformat-
ics, 21(22):4125–4132, 2005.

[74]

Ian F Korf and Alan B Rose. Applying word-based algorithms: the imeter. Plant Systems
Biology, pages 287–301, 2009.

[75] Se-Ran Jun, Gregory E Sims, Guohong A Wu, and Sung-Hou Kim. Whole-proteome
phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with
optimal feature resolution. Proceedings of the National Academy of Sciences, 107(1):133–
138, 2010.

[76] Chenglong Yu, Troy Hernandez, Hui Zheng, Shek-Chung Yau, Hsin-Hsiung Huang,
Rong Lucy He, Jie Yang, and Stephen S-T Yau. Real time classification of viruses in
12 dimensions. PloS one, 8(5):e64328, 2013.

[77] Mo Deng, Chenglong Yu, Qian Liang, Rong L He, and Stephen S-T Yau. A novel method of
characterizing genetic sequences: genome space with biological distance and applications.
PloS one, 6(3):e17293, 2011.

[78]

Igor Ulitsky, David Burstein, Tamir Tuller, and Benny Chor. The average common substring
approach to phylogenomic reconstruction. Journal of Computational Biology, 13(2):336–
350, 2006.

18

[79] Chris-Andre Leimeister and Burkhard Morgenstern. Kmacs:

the k-mismatch average
common substring approach to alignment-free sequence comparison. Bioinformatics,
30(14):2000–2008, 2014.

[80] Lianping Yang, Xiangde Zhang, Haoyue Fu, and Chenhui Yang. An estimator for local
analysis of genome based on the minimal absent word. Journal of Theoretical Biology,
395:23–30, 2016.

[81] Lianping Yang, Xiangde Zhang, and Hegui Zhu. Alignment free comparison: similarity
distribution between the dna primary sequences based on the shortest absent word. Journal
of theoretical biology, 295:125–131, 2012.

[82] Susana Vinga. Information theory applications for biological sequence analysis. Briefings

in bioinformatics, 15(3):376–389, 2014.

[83] H Joel Jeffrey. Chaos game representation of gene structure. Nucleic acids research,

18(8):2163–2170, 1990.

[84] Milan Randić, Marjana Novič, and Dejan Plavšić. Milestones in graphical bioinformatics.

International Journal of Quantum Chemistry, 113(22):2413–2446, 2013.

[85] Pradeep Kumar Burma, Alok Raj, Jayant K Deb, and Samir K Brahmachari. Genome
analysis: a new approach for visualization of sequence organization in genomes. Journal of
biosciences, 17:395–411, 1992.

[86]

Jonas S Almeida, Joao A Carrico, Antonio Maretzek, Peter A Noble, and Madilyn Fletcher.
Analysis of genomic sequences by chaos game representation. Bioinformatics, 17(5):429–
437, 2001.

[87] Patrick J Deschavanne, Alain Giron, Joseph Vilain, Guillaume Fagot, and Bernard Fertil.
Genomic signature: characterization and classification of species assessed by chaos game
representation of sequences. Molecular biology and evolution, 16(10):1391–1399, 1999.

[88] Bai-Lin Hao. Fractals from genomes–exact solutions of a biology-inspired problem. Physica

A: Statistical Mechanics and its Applications, 282(1-2):225–246, 2000.

[89] Tung Hoang, Changchuan Yin, Hui Zheng, Chenglong Yu, Rong Lucy He, and Stephen S-T
Yau. A new method to cluster dna sequences using fourier power spectrum. Journal of
theoretical biology, 372:135–145, 2015.

[90] Changchuan Yin, Ying Chen, and Stephen S-T Yau. A measure of dna sequence similarity by
fourier transform with applications on hierarchical clustering. Journal of theoretical biology,
359:18–28, 2014.

[91] Ajay Kumar Saw, Garima Raj, Manashi Das, Narayan Chandra Talukdar, Binod Chandra

19

Tripathy, and Soumyadeep Nandi. Alignment-free method for dna sequence clustering using
fuzzy integral similarity. Scientific reports, 9(1):3753, 2019.

[92] Chenglong Yu, Qian Liang, Changchuan Yin, Rong L He, and Stephen S-T Yau. A novel
construction of genome space with biological geometry. DNA research, 17(3):155–168,
2010.

[93] Byungjin Hwang, Ji Hyun Lee, and Duhee Bang. Single-cell rna sequencing technologies
and bioinformatics pipelines. Experimental & molecular medicine, 50(8):1–14, 2018.

[94] Tallulah S Andrews, Vladimir Yu Kiselev, Davis McCarthy, and Martin Hemberg. Tutorial:
guidelines for the computational analysis of single-cell rna sequencing data. Nature protocols,
16(1):1–9, 2021.

[95] Malte D Luecken and Fabian J Theis. Current best practices in single-cell rna-seq analysis:

a tutorial. Molecular systems biology, 15(6):e8746, 2019.

[96] Geng Chen, Baitang Ning, and Tieliu Shi. Single-cell rna-seq technologies and related

computational data analysis. Frontiers in genetics, page 317, 2019.

[97] Raphael Petegrosso, Zhuliu Li, and Rui Kuang. Machine learning and statistical methods for
clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223,
2020.

[98] Wei Vivian Li and Jingyi Jessica Li. A statistical simulator scdesign for rational scrna-seq

experimental design. Bioinformatics, 35(14):i41–i50, 2019.

[99] Ruiqing Zheng, Min Li, Zhenlan Liang, Fang-Xiang Wu, Yi Pan, and Jianxin Wang. Sinnlrr:
a robust subspace clustering method for cell type detection by non-negative and low-rank
representation. Bioinformatics, 35(19):3642–3650, 2019.

[100] Zhenqiu Shu, Qinghan Long, Luping Zhang, Zhengtao Yu, and Xiao-Jun Wu. Robust graph
regularized nmf with dissimilarity and similarity constraints for scrna-seq data clustering.
Journal of Chemical Information and Modeling, 62(23):6271–6286, 2022.

[101] Peng Wu, Mo An, Hai-Ren Zou, Cai-Ying Zhong, Wei Wang, and Chang-Peng Wu. A robust

semi-supervised nmf model for single cell rna-seq data. PeerJ, 8:e10091, 2020.

[102] Jianwei Chen. Detecting cell type from single cell rna sequencing based on deep bi-stochastic

graph regularized matrix factorization. bioRxiv, 2022.

[103] Qiu Xiao, Jiawei Luo, Cheng Liang, Jie Cai, and Pingjian Ding. A graph regularized non-
negative matrix factorization method for identifying microrna-disease associations. Bioin-
formatics, 34(2):239–248, 2018.

20

[104] Na Yu, Ying-Lian Gao, Jin-Xing Liu, Juan Wang, and Junliang Shang. Robust hypergraph
regularized non-negative matrix factorization for sample clustering and feature selection in
multi-view gene expression data. Human genomics, 13(1):1–10, 2019.

[105] Jin-Xing Liu, Dong Wang, Ying-Lian Gao, Chun-Hou Zheng, Jun-Liang Shang, Feng Liu,
and Yong Xu. A joint-l2, 1-norm-constraint-based semi-supervised feature extraction for
rna-seq data analysis. Neurocomputing, 228:263–269, 2017.

[106] Bo Wang, Junjie Zhu, Emma Pierson, Daniele Ramazzotti, and Serafim Batzoglou. Visual-
ization and analysis of single-cell rna-seq data by kernel-based similarity learning. Nature
methods, 14(4):414–416, 2017.

[107] Shuangge Ma and Ying Dai. Principal component analysis based methods in bioinformatics

studies. Briefings in bioinformatics, 12(6):714–722, 2011.

[108] Seyoung Park and Hongyu Zhao. Sparse principal component analysis with missing obser-

vations. The Annals of Applied Statistics, 13(2):1016–1042, 2019.

[109] F William Townes, Stephanie C Hicks, Martin J Aryee, and Rafael A Irizarry. Feature
selection and dimension reduction for single-cell rna-seq based on a multinomial model.
Genome biology, 20:1–16, 2019.

[110] Dmitry Kobak and Philipp Berens. The art of using t-sne for single-cell transcriptomics.

Nature communications, 10(1):1–14, 2019.

[111] Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep
generative modeling for single-cell transcriptomics. Nature methods, 15(12):1053–1058,
2018.

[112] Carlos Torroja and Fatima Sanchez-Cabo. Digitaldlsorter: deep-learning on scrna-seq to

deconvolute gene expression data. Frontiers in Genetics, 10:978, 2019.

[113] Musu Yuan, Liang Chen, and Minghua Deng. scmra: a robust deep learning method to
annotate scrna-seq data with multiple reference datasets. Bioinformatics, 38(3):738–745,
2022.

[114] Zixiang Luo, Chenyu Xu, Zhen Zhang, and Wenfei Jin. A topology-preserving dimensional-
ity reduction method for single-cell rna-seq data using graph autoencoder. Scientific reports,
11(1):20028, 2021.

[115] Dongfang Wang and Jin Gu. Vasc: dimension reduction and visualization of single-cell
rna-seq data by deep variational autoencoder. Genomics, proteomics & bioinformatics,
16(5):320–331, 2018.

[116] Eugene Lin, Sudipto Mukherjee, and Sreeram Kannan. A deep adversarial variational

21

autoencoder model for dimensionality reduction in single-cell rna sequencing analysis. BMC
bioinformatics, 21(1):1–11, 2020.

[117] Hao Wu, Haoru Zhou, Bing Zhou, and Meili Wang. Scmcluster: a high-precision cell clus-
tering algorithm integrating marker gene set with single-cell rna sequencing data. Briefings
in Functional Genomics, page elad004, 2023.

[118] Junlin Xu, Jielin Xu, Yajie Meng, Changcheng Lu, Lijun Cai, Xiangxiang Zeng, Ruth Nussi-
nov, and Feixiong Cheng. Graph embedding and gaussian mixture variational autoencoder
network for end-to-end analysis of single-cell rna sequencing data. Cell Reports methods,
3(1), 2023.

[119] Mehrshad Sadria and Anita Layton. The power of two: integrating deep diffusion models and
variational autoencoders for single-cell transcriptomics analysis. bioRxiv, pages 2023–04,
2023.

[120] Alessandro Palma, Fabian J Theis, and Mohammad Lotfollahi. Predicting cell morphological
responses to perturbations using generative modeling. bioRxiv, pages 2023–07, 2023.

[121] Valentina Giansanti, Francesca Giannese, Oronza A Botrugno, Giorgia Gandolfi, Chiara
Balestrieri, Marco Antoniotti, Giovanni Tonon, and Davide Cittaro. Scalable integration of
multiomic single cell data using generative adversarial networks. bioRxiv, pages 2023–06,
2023.

[122] Julius B Kirkegaard. Spontanously breaking of symmetry in overlapping cell instance

segmentation using diffusion models. bioRxiv, pages 2023–07, 2023.

[123] Hongru Shen, Jilei Liu, Jiani Hu, Xilin Shen, Chao Zhang, Dan Wu, Mengyao Feng, Meng
Yang, Yang Li, Yichen Yang, et al. Generative pretraining from large-scale transcriptomes
for single-cell deciphering. Iscience, 26(5), 2023.

[124] William Connell, Umair Khan, and Michael J Keiser. A single-cell gene expression language

model. arXiv preprint arXiv:2210.14330, 2022.

[125] Fan Yang, Wenchuan Wang, Fang Wang, Yuan Fang, Duyu Tang, Junzhou Huang, Hui
Lu, and Jianhua Yao. scbert as a large-scale pretrained deep language model for cell type
annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022.

[126] Jiawei Chen, Hao Xu, Wanyu Tao, Zhaoxiong Chen, Yuxuan Zhao, and Jing-Dong J
Han. Transformer for one stop interpretable cell type annotation. Nature Communications,
14(1):223, 2023.

[127] Jing Xu, Aidi Zhang, Fang Liu, Liang Chen, and Xiujun Zhang. Ciform as a transformer-
based model for cell-type annotation of large-scale single-cell rna-seq data. Briefings in
Bioinformatics, page bbad195, 2023.

22

[128] Linfang Jiao, Gan Wang, Huanhuan Dai, Xue Li, Shuang Wang, and Tao Song. sctranssort:
Transformers for intelligent annotation of cell types by gene embeddings. Biomolecules,
13(4):611, 2023.

[129] Vladimir Yu Kiselev, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew
Yiu, Tamir Chandra, Kedar N Natarajan, Wolf Reik, Mauricio Barahona, Anthony R Green,
et al. Sc3: consensus clustering of single-cell rna-seq data. Nature methods, 14(5):483–486,
2017.

[130] Pengyu Zhang, Hongming Zhang, and Hao Wu.

ipro-wael: a comprehensive and ro-
bust framework for identifying promoters in multiple species. Nucleic Acids Research,
50(18):10278–10289, 2022.

[131] Pengyu Zhang, Yingfu Wu, Haoru Zhou, Bing Zhou, Hongming Zhang, and Hao Wu. Clnn-
loop: a deep learning model to predict ctcf-mediated chromatin loops in the different cell
lines and ctcf-binding sites (cbs) pair types. Bioinformatics, 38(19):4497–4504, 2022.

[132] Pengyu Zhang and Hao Wu.

Ichrom-deep: An attention-based deep learning model for
identifying chromatin interactions. IEEE Journal of Biomedical and Health Informatics,
2023.

[133] Heather J Zhou, Lei Li, Yumei Li, Wei Li, and Jingyi Jessica Li. Pca outperforms popular
hidden variable inference methods for molecular qtl mapping. Genome Biology, 23(1):1–17,
2022.

[134] Suoqin Jin, Christian F Guerrero-Juarez, Lihua Zhang, Ivan Chang, Raul Ramos, Chen-
Hsiang Kuan, Peggy Myung, Maksim V Plikus, and Qing Nie. Inference and analysis of
cell-cell communication using cellchat. Nature communications, 12(1):1088, 2021.

[135] Floyd Maseda, Zixuan Cang, and Qing Nie. Deepsc: a deep learning-based map connecting
single-cell transcriptomics and spatial imaging data. Frontiers in Genetics, 12:636743, 2021.

[136] Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M Mauck, Shiwei Zheng,
Andrew Butler, Maddie J Lee, Aaron J Wilk, Charlotte Darby, Michael Zager, et al. Integrated
analysis of multimodal single-cell data. Cell, 184(13):3573–3587, 2021.

[137] Hannah A Pliner, Jay Shendure, and Cole Trapnell. Supervised classification enables rapid

annotation of cell atlases. Nature methods, 16(10):983–986, 2019.

[138] Ze Zhang, Danni Luo, Xue Zhong, Jin Huk Choi, Yuanqing Ma, Stacy Wang, Elena Mahrt,
Wei Guo, Eric W Stawiski, Zora Modrusan, et al. Scina: a semi-supervised subtyping
algorithm of single cells and bulk samples. Genes, 10(7):531, 2019.

23

CHAPTER 2

BACKGROUND

2.1 Overview of Dimensionality Reduction

2.1.1 Principle Components Analysis

Principal component analysis (PCA) is one of the most commonly used dimensional reduc-

tion techniques for the exploratory analysis of high-dimensional data [1]. The first component,

termed the principal component, is the direction that maximizes the variance, while the subsequent

components are orthogonal to earlier ones.

Let X = {xi}N

i=1 be the input dataset, with N being the number of samples or data points. For
each xi, let xi ∈ RM, where M is the number of features or data dimension. PCA seeks to find a

linear combination of the columns of X with maximum variance.

n
∑︁

j=1

ajxj = Xa,

(2.1)

where a1, a2, ..., an are constants, and a is the vectorized a1, a2, ..., an. The variance of this linear

combination is defined as

var(Xa) = aT Sa,

(2.2)

where S is the covariance matrix for the dataset. Note that we compute the eigenvalue of the

covariance matrix. The maximum variance can be computed iteratively using Rayleigh’s quotient

a(1) = arg max

a

aT XT Xa
aT a

.

The subsequent components can be computed by maximizing the variance of

ˆXk = X –

k–1
∑︁

j=1

XajaT
j ,

(2.3)

(2.4)

where k represents the kth principal component. Here, k – 1 principal components are subtracted

from the original matrix X. Therefore, the complexity of the method scales linearly with the number

of components one seeks to find. In applications, we hope that the first few components give rise

to a good PCA representation of the original data matrix X.

24

2.1.2 Nonnegative matrix factorization

Nonnegative matrix factorization (NMF) decomposes the original data matrix into 2 submatrix

that are both nonnegative. Unlike PCA, NMF requires that the original matrix is nonnegative rather

than orthogonal [2]. More formerly, NMF solves the following optimization problem

∥X – WH∥2
F,

min
W,H

s.t. W, H ≥ 0

(2.5)

where ∥A∥2

F = (cid:205)

i,j a2

ij is the Frobenius norm. Lee et al.[3] proposed a multiplicative updating

scheme, which preserves the nonnegativity. For the t + 1th iteration,

wt+1 = wt
ij

ht+1 = ht
ij

(XHT )ij
(WHHT )ij
(WT X)ij
(WT WH)ij

(2.6)

(2.7)

Note that NMF is convex in W or H, but is not convex in both W and H.

2.1.3 T-Distributed Stochastic Neighbor Embedding

The t-distributed stochastic neighbor embedding (t-SNE) is a nonlinear dimensional reduction

algorithm that is well suited for reducing high dimensional data into the two- or three-dimensional

space. There are two main stages in t-SNE. First, it constructs a probability distribution over pairs

of data such that closer data points are assigned a high probability, while farther points are given

a low probability. Second, t-SNE defines a probability distribution in the embedded space that is

similar to that in the original high-dimensional space, and aims to minimize the Kullback-Leibler

(KL) divergence between them [4].

Let {x1, x2, ..., xN|xi ∈ RM} be a high dimensional input dataset. Our goal is to find an optimal
low dimensional representation {y1, ..., yN|yi ∈ Rk}, such that k << M. The first step in t-SNE is to

compute the pairwise distribution between xi and xj, denoted as pij. First, we find the conditional

probability of xj, given xi:

exp(–∥xi – xj ∥2/2σ2
i )
m≠i exp(–∥xi – xm∥2/2σ2
i )
setting pi|i = 0, and the denominator normalizes the probability. Here, σi is the predefined

pj|i =

i ≠ j,

(2.8)

(cid:205)

,

hyperparameter called perplexity. A smaller σi is used for a denser dataset. Notice that this

25

conditional probability is symmetric when the perplexity is fixed, i.e. pi|j = pj|i. Then, define the

pairwise probability as

pj|i + pi|j
2N
In the second step, we learn a k-dimensional embedding {y1, ..., yN|yi ∈ Rk}. To this end,

pij =

(2.9)

.

t-SNE calculates a similar probability distribution qij defined as

qij =

(cid:205)

m

1
1+∥yi–yj ∥2
1
1+∥ym–yl ∥2

l≠m

(cid:205)

,

i ≠ j

(2.10)

and setting qii = 0. Finally, the low dimensional embedding {y1, ..., yN|yi ∈ Rk} is found by

minimizing the KL-divergence via a standard gradient descent method

KL(P|Q) =

pij log

pij
qij

,

∑︁

i,j

(2.11)

where P and Q are the distributions for pij and qij, respectively. Note that the probability distributions

in Equation 2.8 and Equation 2.10 can be replaced by using many other delta sequence kernel of

positive type [5].

2.1.4 Uniform Manifold Approximation and Projection

Uniform manifold approximation and projection (UMAP) is a nonlinear dimensional reduction

method, utilizing three assumptions: the data is uniformly distributed on Riemannian manifold,

Riemannian metric is locally constant, and the manifold if locally connected. Unlike t-SNE which

utilizes probabilistic model, UMAP is a graph-based algorithm. The essential idea is to create

a predefined k-dimensional weighted UMAP graph representation for each of the original high-

dimensional data points, with the aim of minimizing the edge-wise cross-entropy between the

weighted graph and the original data. Finally, the k-dimensional eigenvectors of the UMAP graph

are used to represent each of the original data point. In this section, a computational view of UMAP

is presented. For a more theoretical account, the reader is referred to Ref. [6].

Similar to t-SNE, UMAP considers the input data X = {x1, x2, ..., xN}, xi ∈ RM and look for an
optimal low dimensional representation {y1, ..., yN|yi ∈ Rk}, such that k < M. The first stage is the

26

construction of weighted k-neighbor graphs. Define a metric d : X × X → R+. Let k << M be a

hyperparemeter, and compute the k-nearest neighbors of each xi under a given metric d. For each

xi, let

where σi is defined via

ρi = min{d(xi, xj)|1 ≤ j ≤ k, d(xi, xj) > 0}

k
∑︁

j=1

exp

(cid:18) – max(0, d(xi, xj) – ρi)
σi

(cid:19)

= log2 k.

(2.12)

(2.13)

Such choice of ρi ensure at least one data point is connected to xi and having edge weight of 1, and
set σi as a scaling parameter. Then, define a weighted directed graph ¯G = (V, E, ω), where V is the

set of vertices (in this case, the data X), E is the set of edges E = {(xi, xj)|1 ≤ j ≤ k, 1 ≤ i ≤ N},

and ω is the weight for edges

ω(xi, xj) = exp

(cid:18) – max(0, d(xi, xj) – ρi)
σi

(cid:19)

.

(2.14)

UMAP tries to define an undirected weighted graph G from directed graph ¯G via symmetrization.

Let A be the adjacency matrix of the graph ¯G. A symmetric matrix can be obtained

B = A + AT – A ⊗ AT ,

(2.15)

where T is the transpose and ⊗ denotes the Hadamard product. Then, the undirected weighted

Laplacian G (the UMAP graph) is defined by its adjacency matrix B.

In its realization, UMAP evolves an equivalent weighted graph H with a set of points {yi}N

i=1,
utilizing attractive and repulsive forces. The attractive and repulsive forces at coordinate yi and yj

are given by

ω(xi, xj)(yi – yj), and

–2ab∥yi – yj ∥2(b–1)
2
1 + ∥yi – yj ∥2
2
2b
2)(1 + a∥yi – yj ∥2b
2 )

(ε + ∥yi – yj ∥2

(1 – ω(xi, xj))(yi – yj)

(2.16)

(2.17)

where a, b are hyperparemeters, and ε is taken to be a small value such that the denominator does
i=1, yi ∈ Rk, that

not become 0. The goal is to find the optimal low-dimensional coordinates {yi}N

27

minimizes the edge-wise cross entropy with the original data at each point. The evolution of the

UMAP graph Laplacian G can be regarded as a discrete approximation of the Laplace-Beltrami

operator on a manifold defined by the data [7]. Implementation and further detail of UMAP can be

found in Ref. [6].

UMAP may not work well if the data points are non-uniform. If part of the data points have

k important neighbors while other part of the data points have k′ >> k important neighbors, the

k-dimensional UMAP will not work efficiently. Currently, there is no algorithm to automatically

determine the critic minimal kmin for a given dataset. Additionally, weights ω(xi, xj) and force

terms can be replaced by other functions that are easier to evaluate [5]. The metric d can be

selected as Euclidean distance, Manhattan distance, Minkowski distance, and Chebyshev distance,

depending on applications.

2.2 Clustering Algorithm

2.2.1 K-means Clustering

K-means is the most commonly used unsupervised clustering method, where the aim is to

partition {xj|xj ∈ Rm, 1 ≤ j ≤ N} into K clusters {C1, ..., Ck}, where K << N [8].

K-means begins by randomly selecting K points to be the cluster centers, called centroids. Here,

centroids are denoted as µ1, ..., µk, ..., µK, and the centroid µk is the center of cluster Ck. Then, each

sample is assigned to the nearest centroid, forming the initial clusters. Centroids are then updated

by minimizing the within-cluster sum of squares (WCSS), which is defined as:

which gives the updating scheme

K
∑︁

∑︁

k=1

xj∈Ck

∥xj – µk ∥2,

µk =

1
|Ck|

∑︁

xj∈Ck

xj.

(2.18)

This process is repeated until convergence or until the maximal number of iterations is reached.

28

2.2.2 K-medoids Clustering

K-medoids clustering is similar to k-means clustering. The main difference is that in k-means

clustering, the centroid or the center of the cluster is not necessarily a sample in the data because

the centroid is computed by taking the average of the samples in the cluster. K-medoids, on the

other hand, will chose the medoids, or the ‘cluster center’ to be a sample of the data [9].

For a pre-selected K, we begin by randomly selecting K medoids {mk}K

vector to its nearest medoid, which gives rise to the initial partition {Ck}K

the closest vector to the center of the kth partition Ck as the new medoid {mk ∈ Ck}K

reassign each vector into its nearest medoid, resulting in a new partition {Ck}K

loss function or the accumulated distance. The process is repeated until {Ck}K

k=1 and assign each
k=1. Second, we denote
k=1. We
k=1 to minimize the
k=1 is optimized with

respect to a specific distance definition,

arg min{C1,...,Ck}

K
∑︁

∑︁

k=1

xi∈Ck

∥xi – mk ∥,

(2.19)

where ∥ · ∥ is some distance. Any distance can be used in k-medoids, such as Euclidean, Manhattan,

covariance distance, correlation distance, etc.

2.3 Evaluation metrics for clustering and classification

2.3.1 Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) measures the agreement between two clustering results by

i=1 and {Sj}q

considering all pairs of data points and comparing their assignments in the two clustering results
[10]. Let {Ti}p

j=1 be two partitioning of the data. Often times, Ti are the clustering
result, and Sj are the true labels. Let nij = |Ti ∩ Sj| be the number of samples that belong to true
label j and cluster label i, and define ai = (cid:205)

i nij. Then, the ARI is defined as

j nij and bj = (cid:205)

ARI =

(cid:205)
ij

nij

2

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:205)

i

1
2









ai

2

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

+ (cid:205)

j

–



(cid:205)





bj
(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

2









(cid:205)

j

i

ai

2

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

– (cid:205)
i



(cid:169)

(cid:173)

(cid:173)


(cid:171)










(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:205)

2

N
/ (cid:169)
(cid:173)
(cid:173)
(cid:171)
bj

j

(cid:169)
(cid:173)
(cid:173)
(cid:171)

2

(cid:170)
(cid:174)
(cid:174)
(cid:172)








(cid:170)
(cid:174)
(cid:174)
(cid:172)

(2.20)

N
/ (cid:169)
(cid:173)
(cid:173)
(cid:171)

2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

bj

2

(cid:169)
(cid:173)
(cid:173)
(cid:171)
ai

2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

29

The ARI takes on a value between -0.5 and 1, where 1 is perfect match between two clustering

methods, and 0 is completely random assignment of labels.

2.3.2 Normalized Mutual Information (NMI)

The normalized mutual information (NMI) measures the mutual information between two

clustering results and normalizes it according to cluster size [11]. We fix the true labels Y as one

of the clustering result, and use the predicted labels as the other to calculate NMI. The NMI is

calculated as the following

NMI =

2I(Y; C)
H(Y)H(C)

(2.21)

where H(·) is the entropy and I(Y; C) is the mutual information between true labels Y and predicted

labels C. NMI has a range of 0 and 1, where 1 is a perfect mutual information between the 2 sets

of labels.

2.3.3 Accuracy (ACC)

Accuracy (ACC) calculates the percentage of correctly predicted class labels. The accuracy is

given by

ACC =

1
N

N
∑︁

i=1

δ(yi, f (ci)),

(2.22)

where δ(a, b) is the indicator function. That is, if a = b, δ(a, b) = 1, and 0 otherwise. f : C → Y

maps the cluster labels to the true labels, where the mapping is the optimal permutation of the

cluster labels and true labels obtained from the Hungarian algorithm [12].

2.3.4 Purity
Let {Ci}p

i=1 be a cluster label and {Yj}q

j=1 be the true label. For purity calculation, each predicted
label Ci is assigned to a true label Yj such that the |Ci ∩ Yj| is maximized [13]. Taking the average

over all the predicted label, we obtain the following

Purity =

1
N

p
∑︁

i=1

max
j

|Ci ∩ Yj|

(2.23)

Note that unlike accuracy, purity does not map the predicted labels to the true labels.

30

2.3.5 Silhouette Score

Silhouette score is a common metric used to determine the number of clusters for k-means

clustering and assumes that the clusters are well-separated [14]. Let xi ∈ A ⊂ X be a sample in

cluster A from the clustering algorithm. Then, we define a(xi) as the dissimilarity of sample xi to

all the other samples of A. That is,

a(xi) =

1
|A|

∑︁

xj inA

∥xi – xj ∥

(2.24)

where |A| is the number of samples in cluster A. Now, let C ≠ A be another cluster from the

clustering algorithm. Then, b(xi) is the minimum average distance from xi to the closest cluster

that is not A. That is,

b(xi) = min
C≠A

1
|C|

∑︁

xj∈C

∥xi – xj ∥

Then, the score for sample xi, denoted as s(xi) is obtained as the following

(2.25)

(2.26)

s(xi) =

1 – a(xi)
b(xi)

a(xi) < b(xi)

0

a(xi) = b(xi)

b(xi)
a(xi) – 1

a(xi) > b(xi).






One can then define the Silhouette score S as the average of s(xi) over all samples in the dataset. The

Silhouette score is bounded by -1 and 1, where -1 indicates poor clustering, 0 indicates overlapping

clusters, and 1 indicates well-separated clusters.

2.3.6 Balanced Accuracy (BA)

Accuracy is utilized to measure the quality of a classification algorithm. For data with unbal-

anced class size, standard accuracy does not give meaningful result; therefore, balanced accuracy

(BA) is used[15], which computes the accuracy of each class separately, and the score is the average

accuracy of each class. Let yi and ˆyi be the true and predicted labels of the i-th sample, respectively.

Then, the balanced accuracy score is defined as:

BA =

1
ωl

∑︁

i

1{yi=ˆyi}ωi

31

(2.27)

where ωi is the sample weight, given as ωi =

1
j 1{yj=yi}

(cid:205)

, and 1{∗} is the indicator function.

2.4 Topological Data Analysis

2.4.1 Simplex

Let {v0, v1, ..., vq} be a set of points in Rn. An affine combination is defined as

v =

q
∑︁

i=0

λivi,

such that

q
∑︁

i=0

λi = 1

(2.28)

An affine hull is a set of affine combinations. The q + 1 points {vi}q

i=0 are affinely independent if
v1 – v0, ..., vq – v0 are linearly independent. A q-plane is well defined if q + 1 points are affinely
independent. For example, in Rn, there are at most n linearly independent vectors. Therefore,

there are at most n + 1 affinely independent points. Additionally, an affine combination, defined as

Equation 2.28, is convex combination if we add an additional condition λi ≥ 0 for all i. Furthermore,

the convex hull as the set of convex combination.

A q-simplex denoted as σq is a convex hull of q+1 points in Rq, with the dimension dim(σq) = q.

σ0 is a vertex, σ1 is an edge, σ2 is a triangle, σ3 is a tetrahedron and so on, and an example is

shown in Figure 2.1.

Figure 2.1 Illustration of (a) 0-simplex or vertex, (b) 1-simplex or edge, (c) 2-simplex or triangle,
and (d) 3-simplex or tetrahedron.

2.4.2 Simplicial Complex

Let σq = [v0, v1, ..., vq] be the q-simplex, where vi is a vertex. A simplicial complex K is a

collection of simplicies in Rn satisfying the following conditions

1. If σq ∈ K and σp is a face of σq, then σp ∈ K

32

2. The nonempy intersection of any 2 simplicies σq, σp ∈ K is a face of both σq and σp

In essence, one can think of K as gluing together lower order simplicies together. Each element

σq ∈ K is a q-simplex of K, and the dimension of K is defined as dim(K) = max{dim(σq) : σq ∈ K}.

Alternatively, for q dimensional K, one can define simplicial complex as

K = {σq|σq =

n
∑︁

i=0

λivi, λi ≥ 0,

n
∑︁

0

λi = 1}

(2.29)

2.4.3 Chain complex

A q-chain is a formal sum of q-simplicies in a simplicial complex K with coefficient Z2 ∈ {0, 1}.

The set of all q-chains contains the basis for the set of q-simplicies in K. Such set forms a finitely

generating free Abelian group Cq(K). We can relate the chain groups by a boundary operator,

which is a group homomorphism ∂q : Cq(K) → Cq–1(K). This boundary operator is defined as
q
∑︁

∂qσq :

(–1)iσi

q–1

(2.30)

i=0

where σi

q = [v0, v1, ..., v∗

i , ..., vq] be a (q – 1)-simplex with the vertex vi removed. We can then define

a chain complex as a sequence of chain groups connected by the boundary operator.

∂q+2
−−−−→ Cq+1(K)

∂q+1
−−−−→ Cq(K)

∂q
−−→ ...

...

(2.31)

By utilizing the boundary operators, we can define the qth cycle group Zq and the qth boundary

group Bq. Both Zq and Bq are subgroups of the qth chain group Cq, and are defined as

Zq = Ker∂q = {c ∈ Cq|∂qc = ∅}

Bq = Im∂k+1 = {c ∈ Cq|∃d ∈ Cq+1 : c = ∂q+1d}

(2.32)

(2.33)

Additionally, ∂q–1∂q = implies that Bq ⊆ Zq ⊆ Cq. Moreover, the qth cycle is the q-dimensional

hole.

The, the q-homology group Hq is defined to be a quotient group of the q-cycle group modulo

the q-boundary group, ie

Hq = Zq/Bq

33

(2.34)

The rank of the the Hq is the qth Betti number, denoted as βq.

βq = rank(Hq) = rank(Zq) – rank(Bq)

(2.35)

The qth Betti number describes the qth dimensional hole. For example, β0 is the number

of connected components, β1 is the number of loops, β2 is the number of cavity, and so on.

Furthermore, the Betti numbers describe the topological property of the system. An example of

Betti numbers of vertex, circle, sphere and torus can be found in Table 2.1

Table 2.1 Betti numbers of vertex, circle, sphere and torus.

β0
β1
β2

1
0
0

1
1
0

1
0
1

1
2
1

2.4.4 Computational tools for constructing simplcial complex

In this section, we describe Vietoris-Rips (VR) complex and alpha complex.

2.4.4.1 Vietoris-Rips complex

VR complex is construncted from the 1-skeleton induced by pairwise distances of the point

cloud. Let d(x, y) be dome distance, and ε be a threshold distance. VR complex of a finite point

cloud X is given by

VRε(X) = {σ ⊆ X|d(u, v) ≤ ε, ∀u ≠ v ∈ σ}

(2.36)

The Euclidean distance is often used, but other distance can also be used depending of the ap-

plication. Additionally, because VR complex does not rely on the exact geometry, other abstract

distance that do not satisfy the triangle ineqaulity can be used, such as cosine distance, correlation

34

distance and kernel induced distance.

d(x, y) = 1 –

x · y
∥x∥2∥y∥2
cov(x, y)
σxσy
d(x, y) = 1 – e–∥x–y∥2
2/τ

d(x, y) = 1 –

(2.37)

Figure 2.2 shows a comparison of VR complex using the standard Eucliden distance and the

kernel-induced distance. We can see that the kernel-induced distance has additional H1 barcode.

Such kernel-induced distance has been utilized in molecular biology, where atom-specific proper-

ties, such as Van der Waal radius, electrostatic potential, etc. can be used to refine the distance.

Such application has been shown to be effective at predicting B-factor and protein-ligand binding

affinity [16, 17, 18].

35

(a) Euclidean distance

(b) Kernel distance

Figure 2.2 Comparison of VR complex using (a) standard Euclidean distance and (b) kernel-induced
distance.

2.4.4.2 Alpha complex

Unlike VR complex, alpha complex is related to geometric modeling and domain partitioning.

Let X be a finite point cloud in a Euclidean space. We can build a Voroni diagram, and let V(x) be

a Voronoi cell associated with x ∈ X. Then, for a given ε, the alpha complex is given by

Aε(X) = {σ ∈ X|

(cid:217)

x∈X

(V(x) ∩ Bε(x)) ≠ ∅}

(2.38)

where Bε(x) is the ball of radius ε around x ∈ X.

Figure 2.3 show a comparison between the alpha complex and VR complex. VR complex show

a longer persistence of H0, meaning that the points do not connect for longer period. Both complex

show a loop, but the loop is located in a different interval.

36

(a) VR complex

(b) Alpha complex

Figure 2.3 Comparison of VR complex and alpha complex on 5 points.
generated from VR complex, and (b) shows the barcode generated from alpha complex.

(a) shows the barcode

2.4.5 Persistent homology

One downside of utilizing the simplicial complex is that it does not provide sufficient information

to understand the geometry of the data because we can only capture the data in a single scale. To

this end, we utilize simplicial complex induced by filtration,

∅ = K0 ⊆ K1 ⊆ ... ⊆ Kp = K

(2.39)

where P is the number of filtration. Using such filtration is the foundation of persistent homology,

where the persistence is observed through long-lasting topological features. For each filtration p,

we construct the simplicial complex, the chain group, the subgroups and the homology group. In

37

particular, the t-persistent q-th homology group Ki is

Hi,t
q = Zi
q/

(cid:16)

Bi+t
q

(cid:217)

(cid:17)

Zi
q

(2.40)

Computing the Betti numbers give the persistence of q-dimensional holes.

Figure 2.4 show an example of persistent barcode using benzene (C6H6). We extracted the

carbon atoms from benzene, and constructed the simplicial complex via vietoris-rips complex.

Then, filtration was applied to obtain the Betti numbers. The top figures show the filtration process

of benzene at r = 0.75, 1.50 and 3.00, and was visualized using ChimeraX [19]. Gudhi[20] was

used to compute the simplicial complex and the barcodes.

(a) r = 0.75

(b) r = 1.50

(c) r = 3.00

Figure 2.4 Illustration of vietoris-rips complex of benzene (C6H6) as the radius increases to r = 3.0.
(a) r = 0.75, (b) r = 1.50 and (c) r = 3.00 shows the filtration process, and the bottom figure shows
the persistent barcode corresponding to the filtration.

2.4.6 Combinatorial Laplacian Lapalcian

Combinatorial Laplacian gives insight from both spectral analysis and persistent homology

[21, 22]. Recall that the chain complex of a simplicial complex is defined through a sequence of

boundary operators, and that looking at the kernel and the image of the operators defined the cycle

38

group Zq and the boundary Bq. Then, the q-th homology group Hq = Zq/Bq (or Hq = ker∂q/Im∂q+1).

Moreover, the dimensional of Hq gives the q-betti numbers.

defined as Cq(K) (cid:27) C∗

We can now define the dual chain complex through the adjoint operator of ∂q. The dual space is
q : Cq–1(K) → Cq(K).

q is defined as ∂∗
For ωq–1 ∈ Cq–1(K) and cq ∈ Cq(K), the coboundary operator is defined as

q(K), and the coboundary operator ∂∗

∂∗ωq–1(cq) ≡ ωq–1(∂cq).

(2.41)

Here ωq–1 is a (q – 1) cochain, or a homomorphic mapping from a chain to the coefficient group.

The homology of the dual chain complex is called the cohomology.

We then define the q-combinatorial Laplacian operator △q : Cq(K) → Cq(K)

△q := ∂q+1∂∗

q+1 + ∂∗

q∂q.

(2.42)

Let Bq be the standard basis for the matrix representation of q-boundary operator from Cq(K)
and Cq–1(K), and BT

q be th q-coboundary operator. The matrix representation of the q-th order

Laplacian operator Lq is defined as

Lq = Bq+1BT

q+1 + BT

q Bq.

(2.43)

The multiplicity of zero eigenvalue of Lq is the q-th Betti number of the simplicial complex. The

nonzero eigenvalues (non-harmonic spectrum) contains other topological and geometrical features.

As stated before, simplicial complex does not provide sufficient information to understand the

geometry of the data. To this end, we utilize simplicial complex induced by filtration

{∅} = K0 ⊆ K1 ⊆ · · · ⊆ Kp = K,

(2.44)

where p is the number of filtration.

For each Kt 0 ≤ t ≤ p, denote Cq(Kt) as chain group induced by Kt, and the corresponding

boundary operator ∂t

q : Cq(Kt) → Cq–1(Kt), resulting in

∂t
qσq =

q
∑︁

i=1

(–1)iσi

q–1,

39

(2.45)

for σq ∈ Kt. The adjoint operator of ∂t

q is similarity defined as ∂t∗

q : Cq–1(Kt) → Cq(Kt), which we

regard as the mapping Cq–1(Kt) → Cq(Kt) via the isomorphism between cochain and chain groups.

Through these 2 operators, we can define the chain complexes induced by Kt.

2.4.7 Persistent Laplacian

Utilizing filtration with simplicial complex, we can define persistence Laplacian spectra. Let

Ct+p
q–1 be Ct+p
q whose boundary is in Ct
this set, we can define the p-persistent q-boundary operator denoted ˆ∂t,p
q
corresponding adjoint operator (ˆ∂t,p)∗ : Ct

q , assuming an inclusion mapping Ct
: Ct,p

q–1. On
q–1 and the
q . Then, the q-order p-persistent Laplacian

q–1
q → Ct

→ Ct+p

→ Ct,p

q–1

operator is computed as

and its matrix representation as

q = ˆ∂t,p
△t,p

q+1(ˆ∂t,p

q+1)∗ + (ˆ∂t

q)∗ ˆ∂t
q,

q = Bt,p
Lt,p

q+1(Bt,p

q+1)T + (Bt

q)T Bt
q.

(2.46)

(2.47)

Likewise as before, the multiplicity of the zero-eigenvalue is the q-th order p-persistent Betti number
βt,p
q , which is the q-dimensional hole in Kt that persists in Kt+p. Moreover, the q-th order Laplacian
is just a particular case of Lt,p

q , where p = 0, which is a snapshot of the topology at the filtration

step t [23, 24].

Figure 2.5 show an example of persistent barcode feature, harmonic and nonharmonic spectra

of persistent Laplacian of fullerene (C60) molecule. (a), (b) and (c) correspond to the radii r = 0.75,

r = 1.5 and r = 2.5, respectively, and was visualized using ChimeraX[19]. (d) shows the persistent
barcode of C60 molecule, which was generated using Gudhi. (e) shows the harmonic (βr,p
the nonharmonic (λr,p

k ) and
k ) spectra of persistent Laplacian, which was obtained via HERMES[23].
Vietrois-rips complex was used for both cases, and for the persistent Laplacian, p = 0 was used.

As indicated by the blue bars in (d), the connected components (H0) dies at about r = 1.5, which is
consistent with βr,0

0 , which is to be expected. Persistent Laplacians captures the Betti information of
persistent homology. However, persistent Laplacian also captures the non-harmonic spectra, which

40

is indicated by the red line. Such feature can capture the homtopic shape of evolution throughout

the filtration process.

41

(a) r = 0.75

(b) r = 1.50

(c) r = 2.5

(d) Persistent Barcode

(e) Persistent Laplacian

Figure 2.5 Comparison of persistent barcode and persistent Laplacian utilizing fullerene C60
molecule. HERMES[23] package was used to generate the Betti numbers and to non-harmonic
spectra using the vietoris-rips complex and p = 0.

42

2.5 Phylogenetic Analysis

2.5.1

k-mers method

In DNA sequencing analysis, k-mers are words or motif of length k. For example, 1-mers

consist of 4 types, adenine (A), cytosine (C), guanine (G) or thymine (T) (in the case of RNA

sequencing, uracil (U)). 2-mers consists of 16 combinations, such as AA, GT, and GA. In general,

for a given k, the number of subsequence of length k is 4k. In k-mers based methods, the frequency

of each k-mers are computed. For example, the sequence GTAGAGCTGT contains 2A, 1C, 4G

and 3T, which forms the frequency for 1-mers. Since the length of the sequence is 10, we can count

up to k-mers of length 10.

We will now formulate that case for 1-mers. Let S = s1s2...sN be a DNA sequence of length N,

where si ∈ {A, C, G, T}. Define the indicator function for nucleotide l

δl(si) =

1,

si = l

0,

otherwise





where 1 ≤ i ≤ N and l ∈ {A, C, G, T}. Then, the total number of nucleotide l is

Nl =

N
∑︁

i=1

δl(si).

(2.48)

(2.49)

It is easy to see that the length of the sequence, assuming that all nucleotides are valid, satisfies

N = NA + NC + NG.

Now, for the general k-mers case, we have a total of N–k+1 k-mers given by (s1...sk)(s2...sk+1)...(sN–k+1...sN).

Then, we label the 4k possible k-mers as l1, l2, ...lj, ..., l4k, and define the k-mers specific indicator
function as

δlj(si...si+k) =

1,

(si...si+k) = lj

0,

otherwise





Then, the total number of k-mers lj is

Nlj =

N–k+1
∑︁

i=1

δlj(si...si+k)

43

(2.50)

(2.51)

Because k-mers has the computational efficiency of O(N), it can handle long sequences, includ-

ing mammalian chromosomes.

2.5.2 Natural Vector Method

The Natural Vector Method (NVM) was developed in 2011. It transforms the DNA sequence

into a vector by capturing the moments of the position of the nucleotide [25]. This extends the

traditional k-mers method by incorporating positional information of the k-mers.

Let S = s1s2...sN be a DNA sequence of length N, where si ∈ {A, C, G, T}. Here, we consider

uracil U as thymine T. Define a indicator function δl(si) as

δl(si) =

1,

si = l

0

otherwise





where l = A, C, G, T. Then, the natural vector for the sequence is defined as

S = (NA, NC, NG, NT , µA, µC, µG, µT , DA

2 , DC

2 , DG

2 , DT

2 , ..., DA

M, DC

M, DG

M, DT
M)

where

N
∑︁

i=1
N
∑︁

Nl =

µl =

δl(si)

δl(si)

(2.52)

i
Nl

i=1
N
∑︁

i=1

Dl

m =

(i – µl)m
Nm–1
l Nm–1

δ(si)

Here, m = 0, 1, 2, ..., M is the order of the moment, N = NA + NC + NG + NT . Nl and µl are the

0-order and 1-order, respectively.

In essence, Nl is the total number of nucleotide l, and µl is

the average position of nucleotide l. This results in a uniform feature vector of size 4M for each

sequence.

The natural vector was further improved via the implementation of k-mers specific natural

vector [26]. Rather than looking at the nucleotide, the moments are calculated for each k-mers. For

the sequence S = s1s2...sN, there are a total of N – k + 1 k-mers (s1...sk)(s2...sk+1)...(sN–k+1...sN).

44

Similarly, for a given k, the natural vector is defined as

S = (Nl1, ..., Nl4k , µl1, ..., µl4k , Dl1

2 ..., D

l4k
M , ..., Dl1

M, ..., D

l4k
M )

where

δlj(si...si+k)

N–k+1
∑︁

i=1
N–k+1
∑︁

Nlj =

µlj =

lj
m =

D

δlk(si...si+k)

i
Nlj
(i – µlj)m
Nm–1
Nm–1
lj

i=1
N–k+1
∑︁

i=1

δlj(si...si+k)

(2.53)

where δlj(si...si+j) is an indicator function for the k-mer lj. For a given k, there is a total of 4k
combination of k-mers. Moreover, for a fixed k and M, the number of features will be 4kM. Then,

combining all the k-mers together, we have a natural vector of length (cid:205)K

k=1 4kM

2.5.3 Markov k-string model

In this section, we discuss the Markov k-string model, which is the foundation for all alignment-

free probabilistic models. Markov model aims to characterize evolution by evaluating the difference

between observed probability of an appearance of a k-mer and the predicted appearance of a k-mer

[27].

Let α1α2...αk be the nucleotide of a k-mer, and let f (α1α2...αk) denote its frequency. We can

normalize the frequency by dividing it by the total number of k-mers, which will give the probability

of a particular k-mers appearing.

p(α1α2...αk) =

f (α1α2...αk)
N – k + 1

(2.54)

Since the goal is to compare the observed and predicted probability, we compute the probability

of (k – 1)-mers and (k – 2)-mer derived from the k-mer. The predicted probability of k-mers

appearing can be computed using a Markov model:

p0(α1α2...αk) =

p(α1α2...αk–1)p(α2α3...αk)
p(α2α3...αk–1)

(2.55)

45

where p0 denote the predicted probability, and such prediction model has been utilized for both

DNA and protein sequences [28, 29]

Then, the evolution can be characterized by taking the normalized difference between the

observe and predicted probabilities:

a(α1α2...αk) =

p(α1α2...αk)–p0(α1α2...αk)
p0(α1α2...αk)

0

p0 ≠ 0

p0 = 0





(2.56)

Likewise with the original k-mers and NVM methods, Markov models are also computationally

efficient, and has been applied to both proteins and DNA sequences.

46

BIBLIOGRAPHY

[1]

Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent devel-
opments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and
Engineering Sciences, 374(2065):20150202, 2016.

[2] Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive
review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012.

[3] Daniel Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. Ad-

vances in neural information processing systems, 13, 2000.

[4] George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval
Kluger. Fast interpolation-based t-sne for improved visualization of single-cell rna-seq data.
Nature methods, 16(3):243–245, 2019.

[5] GW Wei. Wavelets generated by using discrete singular convolution kernels. Journal of

Physics A: Mathematical and General, 33(47):8577, 2000.

[6] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation

and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.

[7]

[8]

Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge
method. arXiv preprint arXiv:1912.12388, 2019.

J MacQueen. Some methods for classification and analysis of multivariate observations. Proc.
5th Berkeley Symposium on Math., Stat., and Prob, page 281, 1965.

[9] Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering.

Expert systems with applications, 36(2):3336–3341, 2009.

[10] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193–

218, 1985.

[11] Nguyen Xuan Vinh, Julien Epps, and James Bailey.

Information theoretic measures for
clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th
annual international conference on machine learning, pages 1073–1080, 2009.

[12] David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions

on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.

[13] KVSN Rama Rao and B Manjula Josephine. Exploring the impact of optimal clusters on
In 2018 3rd International Conference on Communication and Electronics

cluster purity.
Systems (ICCES), pages 754–757. IEEE, 2018.

47

[14] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster

analysis. Journal of computational and applied mathematics, 20:53–65, 1987.

[15] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann.
The balanced accuracy and its posterior distribution. In 2010 20th international conference
on pattern recognition, pages 3121–3124. IEEE, 2010.

[16] Zixuan Cang and Guo-Wei Wei. Analysis and prediction of protein folding energy changes
upon mutation by element specific persistent homology. Bioinformatics, 33(22):3549–3557,
2017.

[17] Zixuan Cang and Guo-Wei Wei.

machine learning for protein-ligand binding affinity prediction.
numerical methods in biomedical engineering, 34(2):e2914, 2018.

Integration of element specific persistent homology and
International journal for

[18] David Bramer and Guo-Wei Wei. Atom-specific persistent homology and its application to

protein flexibility analysis. Computational and mathematical biophysics, 8(1):1–35, 2020.

[19] Thomas D Goddard, Conrad C Huang, Elaine C Meng, Eric F Pettersen, Gregory S Couch,
John H Morris, and Thomas E Ferrin. Ucsf chimerax: Meeting modern challenges in
visualization and analysis. Protein science, 27(1):14–25, 2018.

[20] Clément Maria, Jean-Daniel Boissonnat, Marc Glisse, and Mariette Yvinec. The gudhi library:
Simplicial complexes and persistent homology. In Mathematical Software–ICMS 2014: 4th
International Congress, Seoul, South Korea, August 5-9, 2014. Proceedings 4, pages 167–174.
Springer, 2014.

[21] Beno Eckmann. Harmonische funktionen und randwertaufgaben in einem komplex. Com-

mentarii Mathematici Helvetici, 17(1):240–255, 1944.

[22] Daniel Hernández Serrano, Juan Hernández-Serrano, and Darío Sánchez Gómez. Simplicial
degree in complex networks. applications of topological data analysis to network science.
Chaos, Solitons and amp; Fractals, 137:109839, August 2020.

[23] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, and Guo-Wei
Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield,
Mo.), 3(1):67, 2021.

[24] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International

journal for numerical methods in biomedical engineering, 36(9):e3376, 2020.

[25] Mo Deng, Chenglong Yu, Qian Liang, Rong L He, and Stephen S-T Yau. A novel method
of characterizing genetic sequences: genome space with biological distance and applications.
PloS one, 6(3):e17293, 2011.

48

[26] Nan Sun, Shaojun Pei, Lily He, Changchuan Yin, Rong Lucy He, and Stephen S-T Yau.
Geometric construction of viral genome space and its applications. Computational and
Structural Biotechnology Journal, 19:4226–4234, 2021.

[27] Ji Qi, Bin Wang, and Bai-Iin Hao. Whole proteome prokaryote phylogeny without sequence
alignment: Ak-string composition approach. Journal of molecular evolution, 58:1–11, 2004.

[28] Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, James D Watson, et al.

Molecular biology of the cell, volume 3. Garland New York, 1994.

[29] Rui Hu and Bin Wang. Statistically significant strings are related to regulatory elements in
the promoter regions of saccharomyces cerevisiae. Physica A: Statistical Mechanics and its
Applications, 290(3-4):464–474, 2001.

49

CHAPTER 3

PHYLOGENETIC ANALYSIS

3.1 UMAP-assisted k-means clustering of large-scale SARS-CoV-2 mutation datasets

3.1.1

Introduction

Beginning in December 2019, coronavirus disease 2019 (COVID-19) caused by severe acute

respiratory syndrome coronavirus 2 (SARS-CoV-2) has become one of the most deadly global

pandemics in history. The COVID-19 infections in the United States (US) and other nations are still

spiking. As of January 20, 2021, the World Health Organization (WHO) has reported 93,217,287

confirmed cases of COVID-19 and 2,014,957 confirmed deaths. The virus has spread to Africa,

Americas, Eastern Mediterranean, Europe, South-East Asia and Western Pacific [1]. To prevent

further damage to our livelihood, we must control its spread through testing, social distancing,

tracking the spread, and developing effective vaccines, drugs, diagnostics, and treatments.

SARS-CoV-2 is a positive-sense single-strand RNA virus that belongs to the Nidovirales or-

der, coronaviridae family and betacoronavirus genus [2]. To effectively track the virus, testing

patients with suspected exposure to COVID-19 and sequencing the strand via PCR (polymerase

chain reaction) are important. From sequencing, we can analyze patterns in mutation and predict

transmission pathways. Without understanding such pathways, current efforts to find effective

medicines and vaccines could become futile because mutations may change viral genome or lead to

resistance. As of January 20, 2021, there are 203,344 available sequences with 26,844 unique sin-

gle nucleotide polymorphisms (SNPs) with respect to the first SARS-CoV-2 sequence collected in

December 2019 [3] according to our mutation tracker https://users.math.msu.edu/users/weig/SARS-

CoV-2_Mutation_Tracker.html.

A popular method for understanding mutational trends is to perform phylogenetic analysis,

where one clusters mutations to find evolution patterns and transmission pathways. Phylogenetic

analysis has been done on the Nidovirales family [4, 5, 6, 7, 4, 8] to understand genetic evolutionary

pathways, protein level changes [9, 10, 11, 8], large scale variants [10, 9, 12, 13] and global trends

50

[14, 15, 16]. Commonly used techniques for phylogenetic analysis include tree based methods

[17] and K-means clustering. Both methods belong to unsupervised machine learning techniques,

where ground truth is unavailable. These approaches provide valuable information for exploratory

research. A main issue with phylogenetic tree analysis is that as the number of samples increase,

its computation becomes unpractical, making it unsuitable for large genome datasets. In contrast,

K-means scales well with sample size increase, but does not perform well when the sample size is

too small. Jaccard distance is commonly used to compare genome sequences [18] because it offers

a phylogenetic or topological difference between samples. However, the tradeoff to the Jaccard

distance is that its feature dimension is the same as its number of samples, suggesting that for a large

sample size, the number of features is also large. Since K-means clustering relies on computing the

distance between the center of the cluster and each sample, having a large feature space can result in

expensive computation, large memory requirement, and poor clustering performance. This become

a significant problem as the number of SARS-CoV-2 genome isolates from patients has reached

200,000 at this point. There is a pressing need for efficient clustering methods for SARS-CoV-2

genome sequences.

One technique to address this challenge is to perform dimensional reduction on the K-means

input dataset so that the task becomes manageable. Commonly used dimension reduction algorithms

focus on two aspects: 1) the pairwise distance structure of all the data samples and 2) preservation

of the local distances over the global distance. Techniques such as principal component analysis

(PCA) [19], Sammon mapping [20], and multidimensional scaling (MDS) [21] aim to preserve

the pairwise distance structure of the dataset.

In contrast, the t-distributed stochastic neighbor

embedding (t-SNE) [22, 23], uniform manifold approximation and projection (UMAP) [24, 25],

Laplacian eigenmaps [26], and LargeVis [27] focus on the preservation of local distances. Among

them, PCA, t-SNE, and UMAP are the most frequently used algorithms in the applications of cell

biology, bioinformatics, and visualization [25].

PCA is a popular method used in exploratory studies, aiming to find the directions of the

maximum variance in high-dimensional data and projecting them onto a new subspace to obtain

51

low-dimensional feature spaces while preserving most of the variance. The principal components

of the new subspace can be interpreted as the directions of the maximum variance, which makes

the new feature axes orthogonal to each other. Although PCA is able to cover the maximum

variance among features, it may lose some information if one chooses an inappropriate number of

principal components. As a linear algorithm, PCA performs poorly on the features with nonlinear

relationship. Therefore, in order to present high-dimensional data on low dimensional and non-

linear manifold, some nonlinear dimensional reduction algorithms such as t-SNE and UMAP are

employed. T-SNE is a nonlinear method that can preserve the local and global structures of data.

There are two main steps in t-SNE. First, it finds a probability distribution of the high dimensional

dataset, where similar data points are given higher probability. Second, it finds a similar probabil-

ity distribution in the lower dimension space, and the difference between the two distributions is

minimized. However, t-SNE computes pairwise conditional probabilities for each pair of samples

and involves hyperparameters that are not always easy to tune, which makes it computationally

complex. UMAP is a novel manifold learning technique that also captures a nonlinear structure,

which is competitive with t-SNE for visualization quality and maintains more of the global structure

with superior run-time performance [24]. UMAP is built upon the mathematical work of Belkin

and Niyogi on Laplacian eigenmaps, aiming to address the importance of uniform data distributions

on manifolds via Riemannian geometry and the metric realization of fuzzy simplicial sets by David

Spivak [28]. Similar to t-SNE, UMAP can optimize the embedded low-dimensional representation

with respect to fuzzy set cross-entropy loss function by using stochastic gradient descent. The

embedding is found by finding a low-dimensional projection of the data that closely matches the

fuzzy topological structure of the original space. The error between two topological spaces will be

minimized by optimizing the spectral layout of data in a low dimensional space.

The objective of this work is to explore efficient computational methods for the SARS-CoV-2

phylogenetic analysis of large volume of SARS-CoV-2 genome sequences. Specifically, we are

interested in developing a dimension-reduction assisted clustering method. With the increase in

available sequencing data, the SNP dataset of SARS-CoV-2 has run into large-data problem. By

52

effectively analyzing clusters, we can find evolutionary trends, which will aid in finding effective

medicines and vaccines. To this end, we compare the effectiveness and accuracy of PCA, t-SNE

and UMAP for dimension reduction in association with the K-means clustering. To quantitatively

evaluate the performance, we recast supervised classification problems with labels into a K-means

clustering problems so that the accuracy of K-means clustering can be evaluated. As a result,

the accuracy and performance of PCA, t-SNE and UMAP-assisted K-means clustering can be

compared. By choosing the different dimensional reduction ratios, we examine the performance

of these methods in K-means settings on four standard datasets. We found that UMAP is the most

efficient, robust, reliable, and accurate algorithm. Based on this finding, we applied the UMAP-

assisted K-means technique to large scale SARS-CoV-2 datasets generated from a Jaccard distance

representation and a SNP position-based representation to further analyze its effectiveness, both in

terms of speed and scalability. Our results are compared with those in the literature [9] to shed new

light on SARS-CoV-2 phylogenetics.

3.1.2 Methods

3.1.2.1 Sequence and alignment

The SARS-CoV-2 sequences were obtained from GISAID databank (www.gisaid.com). Only

complete genome sequences with collection date, high coverage, and without ’NNNNNN’ in the

sequences were considered. Each sequence was aligned to the reference sequence [3] using a

multiple sequence alignment (MSA) package Clustal Omega [29]. A total of 203,344 complete

SARS-CoV-2 sequences are analyzed in this work.

3.1.2.2 SNP position based features

Let N be the number of SNP profiles with respect to the SARS-CoV-2 reference genome

sequence, and let M be the number of unique mutation sites. Denote Vi as the position based

feature of the ith SNP profile.

Vi = [v1

i , v2

i , ..., vM
i ],

i = 1, 2, ..., N

(3.1)

53

is a 1 × M vector. Here

vj
i =

We compile this into an N × M position based feature,

1, mutation site

0,

otherwise.





S(i, j) = vj
i

(3.2)

(3.3)

where each row represents a sample. Note that S(i, j) is a binary representation of the position and

is sparse.

3.1.2.3

Jaccard based representation

The Jaccard distance measures the dissimilarity between two sets.

It is widely used in the

phylogenetic studies of SNP profiles. In this work, we utilize Jaccard distance to compare SNP

profiles of SARS-CoV-2 genome isolates.

Let A and B be two sets. Consider the Jaccard index between A and B, denoted J(A, B), as the

cardinality of the intersection divided by the cardinality of the union

J(A, B) =

(cid:12)A ∩ B(cid:12)
(cid:12)
(cid:12)
(cid:12)A ∪ B(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)A ∩ B(cid:12)
(cid:12)
(cid:12)A ∩ B(cid:12)
(cid:12) – (cid:12)
(cid:12)B(cid:12)
(cid:12) + (cid:12)
(cid:12)

(cid:12)A(cid:12)
(cid:12)

.

=

The Jaccard distance between the two sets is defined by subtracting the Jaccard index from 1:

dJ(A, B) = 1 – J(A, B) =

(cid:12)A ∪ B(cid:12)
(cid:12)
(cid:12) – (cid:12)
(cid:12)A ∩ B(cid:12)
(cid:12)
(cid:12)A ∪ B(cid:12)
(cid:12)
(cid:12)

(3.4)

(3.5)

We assume there are N SNP profiles or genome isolates that have been aligned to the reference

SARS-CoV-2 genome. Let Si, i = 1, ..., N, be the set with the position of the mutation of the ith

sample. The Jaccard distance between two sets Si and Sj is given by dJ(Si, Sj). Taking the pairwise

distance between all the samples, we can construct the Jaccard based representation, resulting in

an N × N distance matrix D

D(i, j) = dJ(Si, Sj)

(3.6)

This distance defines a metric over the collections of all finite sets [30].

54

3.1.3 Validation

K-means clustering is one of the unsupervised learning algorithms, suggesting that neither

the accuracy nor the root-mean-square error can be calculated to evaluate the performance of

the K-means clustering explicitly. Additionally, K-means clustering can be problematic for high-

dimensional large datasets. Dimension-reduced K-means clustering is an efficient approach. To

evaluate its accuracy and performance, we convert supervised classification problems with known

solutions into dimension-reduced K-means clustering problems. In doing so, we apply the K-means

clustering to the classification dataset by setting the number of clusters equals to the number of the

real categories. Next, in each cluster, we will take the data with the dominant label as the test for

all samples and then calculate the K-means clustering accuracy for the whole dataset.

3.1.3.1 Validation data

In this work, we will consider the following classification datasets to test the performance of the

clustering methods: Coil 20, Facebook large page-page network, MNIST, and Jaccard distanced-

based MNIST. Previous work has been done on datasets using Euclidean and Minkowski distance

for lower dimensions[24, 22, 23]. Here, we verify the result with higher reduction ratios, and tested

the validity of using Jaccard distance as a metric.

• Coil 20: Coil 20 [31] is a dataset with 1440 gray scale images, consisting of 20 different

objects, each with 72 orientation. Each image is of size 128 × 128, which was treated as a

16384 dimensional vector for dimensional reduction

• Facebook Network: Facebook large page-page network [32] is a page-page webgraph of

verified Facebook sites. Each node represents a facebook page, and the links are the mutual

links between sites. This is a binary dataset with 22,470 nodes; hence the sample size and

feature size are both 22,470. Jaccard distance was computed between each nodes for the

feature space.

• MNIST: MNIST [33] is a hand written digit dataset. Each image is a grey scale of size

28 × 28, which was treated as a 784 dimensional vector for the feature space, each with a

55

integer value in [0, 255]. Standard normalization was used before performing dimensional

reduction. There are 70,000 sample, with 10 different labels.

• Jaccard distanced-based MNIST: The above dataset was converted to a Jaccard distance-

based dataset. This is to simulate position based mutational dataset, where 1 indicates a

mutation in a particular position. Jaccard distance was used to construct the feature space,

hence for each sample, the feature size is 70,000. This dataset can be viewed as an additional

validation on our Jaccard distance representation.

3.1.3.2 Validation results

In the present work, we implement three popular dimensional reduction methods, PCA, UMAP,

and t-SNE, for the dimension reduction and compare their performance in K-means clustering.

For a uniform comparison, we reduce the dimensions of the samples by a set of ratios. The

minimum between the number of features and the number of samples was taken as base of the

reduction. For the Coil 20 dataset, since the numbers of samples and features were 1440 and

16384, respectively, dimension-reductions were based on 1440. For the Facebook Network, since

the numbers of samples and features were both 22,470, dimension-reductions were based on 22,470.

For the MNIST dataset, since the numbers of samples and features were respectively 70,000 and

784, dimension-reductions were based on 784. Finally, for the Jaccard distanced-based MNIST

dataset, since the numbers of samples and features were both 70,000, dimension-reductions were

based on 70,000. Note that for the Jaccard distanced-based MNIST data, more aggressive ratios

were used because the original feature size is huge, i.e., 70,000. The standard ratios of 2, 4, and 8,

etc do not sufficiently reduce the dimension for effective K-means computation. For the purpose of

visualization, two-dimensional reduction algorithms are applied to each reduction scheme. In order

to validate PCA, UMAP, and t-SNE assisted K-means clustering, we observed their performance

using labeled datasets. K-nearest neighbors (K-NN) was used to find the baseline of the reduction,

which reveals how much information can be preserved in the feature after applying a dimensional

reduction algorithm. For k-NN, 10 fold cross-validation was performed.

56

Notably, K-means clustering is an unsupervised learning algorithm, which does not have labels

to evaluate the clustering performance explicitly. However, we can assess the K-means clustering

accuracy via labeled datasets that has ground truth. In doing so, we choose the number of K as the

original number of classes. The detail of computing accuracy of k-means clustering can be found

in section 2.3.

3.1.3.3 Coil 20

Figure 3.1 Comparison of different dimensional reduction algorithms on Coil 20 dataset. Total
20 different labels are in the Coil 20 dataset, and we use the ground truth label to color each data
points. (a) Feature size is reduced to dimension 2 by PCA. (b) Feature size is reduced to dimension
2 by t-SNE. (c) Feature size is reduced to dimension 2 by UMAP.

Figure 3.1 shows the performance of PCA-assisted, UMAP-assisted and t-SNE-assisted clus-

tering of the Coil 20 dataset. For each case, the dataset were reduced to dimension 2 using default

parameters, and the plots were colored with the ground truth of the Coil 20 dataset. It can be seen

that PCA does not present good clustering, whereas UMAP and t-SNE show very good clusters.

57

Table 3.1 Accuracy of k-NN of the Coil 20 dataset without applying any reduction algorithms, as
well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with different dimensional
reduction ratio. The sample size, feature size, and the number of labels of the Coil 20 dataset are
1440, 16384, and 20, respectively.

Dataset

k-NN accuracy
w/o reduction

Coil 20
(1440,16384,20)

0.956

Reduced
dimension
720 (1/2)
360 (1/4)
180 (1/8)
90 (1/16)
45 (1/32)
22 (1/64)
14 (1/100)
7 (1/200)
3
2

PCA
accuracy
0.955
0.957
0.973
0.977
0.980
0.985
0.730
0.985
0.850
0.730

UMAP
accuracy
0.668
0.861
0.867
0.860
0.861
0.868
0.851
0.870
0.863
0.853

t-SNE
accuracy
0.850
0.889
0.881
0.885
0.875
0.743
0.878
0.845
0.959
0.948

Table 3.1 shows the accuracy of k-NN clustering of the Coil 20 dataset assisted by PCA, t-SNE,

and UMAP with different dimensional reduction radio. The Coil 20 dataset has 1,440 samples,

16,384 features, and 20 different labels. For PCA, the sklearn implementation on python was

used with standard parameters. Note that for all methods, dimensions were reduced to 3 and

2 for a comparison. For t-SNE, Multicore-TSNE [34] was used because it offers up to 8 core

processor, which is not available in the sklearn implementation, and it is the fastest performing

t-SNE algorithm. For UMAP, we used standard parameters [24].

It can be seen that when we

reduce the dimension to 3, t-SNE performs best. Moreover, when the dimensional reduction ratio

is 1/100, PCA and UMAP also perform well. Notably, the k-NN accuracy for the data without

applying any dimensional reduction algorithm is 0.956, indicating that UMAP does not provide

the best clustering performance on the Coil 20 dataset. However, PCA and t-SNE will preserve the

information of the original data with a dimensional reduction ratio larger than 1/100, and t-SNE

even performs better for dimensional three on the Coil 20 dataset.

58

Table 3.2 Accuracy of K-means clustering of the Coil 20 dataset without applying any reduction
algorithms, as well as the accuracy of K-means assisted by PCA, UMAP and t-SNE with different
dimensional reduction ratio. The sample size, feature size, and the number of labels of the Coil 20
dataset are 1440, 16384, and 20, respectively.

Dataset

k-NN accuracy
w/o reduction

Coil 20
(1440,16384,20)

0.626

Reduced
dimension
720 (1/2)
360 (1/4)
180 (1/8)
90 (1/16)
45 (1/32)
22 (1/64)
14 (1/100)
7 (1/200)
3
2

PCA
accuracy
0.64
0.678
0.633
0.642
0.666
0.673
0.631
0.591
0.561
0.537

UMAP
accuracy
0.301
0.800
0.822
0.799
0.800
0.819
0.817
0.819
0.800
0.801

t-SNE
accuracy
0.798
0.718
0.648
0.681
0.615
0.151
0.154
0.360
0.780
0.828

Table 3.2 describes the accuracy of K-means clustering of Coil 20 assisted by PCA, UMAP,

and t-SNE with different dimensional reduction ratio. For consistency, we use the same set of

standard parameters as k-NN. For the Coil 20 dataset, the accuracy of K-means clustering assisted

by UMAP has the best performance. When the reduced dimension is 2048 (ratio 1/8), UMAP

will result in a relatively high K-means accuracy (0.822). Moreover, although PCA performs

best on k-NN accuracy, it performs poorly on the K-means accuracy, indicating that PCA is not

a suitable dimensional reduction algorithm on the Coil 20 dataset. Furthermore, the highest

accuracy of K-means clustering is 0.828, which is calculated from the t-SNE-assisted algorithm.

However, the t-SNE-assisted accuracy under different reduction ratio changes dramatically. When

the ratio is 1/64, the t-SNE-assisted accuracy is only 0.151, indicating that t-SNE is sensitive to

the hyper-parameters settings. In contrast, the performance of UMAP is highly stable under all

dimension-reduction ratios.

Note that dimension-reduced k-means clustering methods outperform the original k-means

clustering. Therefore, the proposed dimension-reduced k-means clustering methods not only

improve the k-means clustering efficiency, but also achieve better accuracy.

59

3.1.3.4 Facebook Network

Figure 3.2 Comparison of different dimensional reduction algorithms on the Facebook Network
dataset. Total 4 different labels are in the Facebook Network dataset, and we use the ground truth
label to color each data points. (a) Feature size is reduced to dimension 2 by PCA. (b) Feature size
is reduced to dimension 2 by t-SNE. (c) Feature size is reduced to dimension 2 by UMAP.

Figure 3.2 shows the visualization performance of PCA-assisted, UMAP-assisted, and t-SNE-

assisted clustering of the Facebook Network. For each case, the dataset was reduced to dimension

2 using default parameters, and the plots were colored with the ground truth of the Facebook

Network. Figure 3.2 shows that the PCA-based data is located distributively, while the t-SNE- and

UMAP-based data show clusters.

Table 3.3 shows the accuracy of k-NN clustering of the Facebook Network assisted by PCA,

t-SNE, and UMAP with different dimensional reduction radio. The Facebook Network dataset

has 22,470 samples with 4 different labels, and the feature size of the Facebook Network is also

22,470. For each algorithm, we use the same settings as the Coil 20 dataset. Without applying

any dimensional reduction method, The Facebook Network has 0.755 k-NN accuracy. The reduced

feature from PCA has the best k-NN performance when the reduction ratio is 1/2. UMAP has a

better performance compared to PCA and t-SNE when the reduction ratio is smaller than 1/16.

60

Table 3.3 Accuracy of k-NN of the Facebook Network without applying any reduction algorithms,
as well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with different dimensional
reduction ratio. The sample size, feature size, and the number of labels of the Facebook Network
are 22470, 22470, and 4, respectively.

Dataset

k-NN accuracy
w/o reduction

Facebook Network
(22470, 22470, 4)

0.755

Reduced
dimension
11235 (1/2)
5617 (1/4)
2808 (1/8)
1404 (1/16)
702 (1/32)
351 (1/64)
224 (1/100)
112 (1/200)
44 (1/500)
22 (1/1000)
3
2

PCA
accuracy
0.756
0.755
0.754
0.751
0.751
0.746
0.733
0.721
0.714
0.690
0.552
0.501

UMAP
accuracy
0.360
0.669
0.754
0.816
0.814
0.815
0.814
0.819
0.816
0.815
0.801
0.786

t-SNE
accuracy
0.307
0.316
0.355
0.707
0.669
0.690
0.676
0.633
0.709
0.643
0.741
0.732

Table 3.4 describes the accuracy of K-means clustering of the Facebook Network assisted by

PCA, UMAP and t-SNE with different dimensional reduction ratio. PCA, UMAP, and t-SNE all

have very poor performance, which may be caused by the smaller number of labels. The highest

accuracy 0.427 is observed in the t-SNE-assistant algorithm with dimension 2.

61

Table 3.4 Accuracy of K-means clustering of the Facebook Network without applying any
reduction algorithms, as well as the accuracy of K-means assisted by PCA, UMAP and t-SNE
with different dimensional reduction ratio. The sample size, feature size, and the number of labels
of the Facebook Network are 22470, 22470, and 4, respectively.

Dataset

k-NN accuracy
w/o reduction

Facebook Network
(22470, 22470, 4)

0.374

Reduced
dimension
11235 (1/2)
5617 (1/4)
2808 (1/8)
1404 (1/16)
702 (1/32)
351 (1/64)
224 (1/100)
112 (1/200)
44 (1/500)
22 (1/1000)
3
2

PCA
accuracy
0.331
0.331
0.331
0.331
0.331
0.331
0.331
0.331
0.331
0.331
0.332
0.358

UMAP
accuracy
0.306
0.307
0.411
0.397
0.401
0.400
0.400
0.400
0.400
0.401
0.351
0.345

t-SNE
accuracy
0.306
0.299
0.314
0.313
0.306
0.308
0.327
0.306
0.313
0.306
0.344
0.427

Similar to the last case, UMAP-based and t-SNE-based dimension-reduced k-means clustering

methods outperform the original k-means clustering with the full feature dimension. Therefore, it

is useful to carry out dimension reduction before k-means clustering for large datasets.

3.1.3.5 MNIST

Figure 3.3 Comparison of different dimensional reduction algorithms on the MNIST dataset. Total
10 different labels are in the MNIST dataset, and we use the ground truth label to color each data
points. (a) Feature size is reduced to dimension 2 by PCA. (b) Feature size is reduced to dimension
2 by t-SNE. (c) Feature size is reduced to dimension 2 by UMAP.

62

Table 3.5 Accuracy of k-NN of the MNIST dataset without applying any reduction algorithms, as
well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with different dimensional
reduction ratio. The sample size, feature size, and the number of labels of the MNIST dataset are
70000, 784, and 10, respectively.

Dataset

k-NN accuracy
w/o reduction

MNIST
(70000, 784, 10)

0.948

Reduced
dimension
392 (1/2)
196 (1/4)
98 (1/8)
49 (1/16)
24 (1/32)
12 (1/64)
7 (1/100)
3
2

PCA
accuracy
0.951
0.956
0.960
0.961
0.953
0.926
0.846
0.513
0.323

UMAP
accuracy
0.937
0.938
0.937
0.937
0.937
0.937
0.936
0.929
0.919

t-SNE
accuracy
0.696
0.846
0.893
0.886
0.842
0.676
0.940
0.938
0.928

Figure 3.3 shows the performance of PCA-assisted, UMAP-assisted and t-SNE-assisted cluster-

ing of the MNIST dataset. The sample size of the MNIST dataset is 70000, which has 784 features

with 10 different digit labels. For each case, the dataset was reduced to dimension 2 using default

parameters, and the plots were colored with the ground truth of the MNIST dataset. In Figure 3.3,

by applying the UMAP algorithm, the clear clusters can be detected for the MNIST dataset. The

t-SNE offers a reasonable clustering at dimension 2 too. However, the PCA does not provide a

good clustering.

Table 3.6 Accuracy of K-means clustering of the MNIST dataset without applying any reduction
algorithms, as well as the accuracy of K-means assisted by PCA, UMAP and t-SNE with different
dimensional reduction ratio. The sample size, feature size, and the number of labels of the MNIST
dataset are 70000, 784, and 10, respectively.

Dataset

k-NN accuracy
w/o reduction

MNIST
(70000, 784, 10)

0.494

Reduced
dimension
392 (1/2)
196 (1/4)
98 (1/8)
49 (1/16)
24 (1/32)
12 (1/64)
7 (1/100)
3
2

PCA
accuracy
0.487
0.492
0.498
0.496
0.501
0.489
0.464
0.365
0.300

UMAP
accuracy
0.665
0.667
0.673
0.718
0.697
0.682
0.677
0.727
0.712

t-SNE
accuracy
0.122
0.113
0.113
0.113
0.114
0.138
0.740
0.537
0.593

63

Table 3.5 shows the accuracy of k-NN clustering of the MNIST dataset assisted by PCA, t-

SNE, and UMAP with different dimensional reduction radios. For each algorithm, we use the

same settings as the Coil 20 dataset. Without applying any dimensional reduction algorithms, the

accuracy of k-NN is 0.948. By applying PCA/UMAP with the reduction ratio greater than 1/64, the

accuracy of PCA/UMAP-assisted k-NN is at the same level without using any dimensional reduction

algorithm. However, in contract with UMAP and t-SNE, when the reduced dimension is 2 or 3,

PCA performs poorly. This indicates that the PCA may not be suitable for dimension-reduction for

datasets with a large sample size.

Table 3.6 describes the accuracy of K-means clustering of the MNIST dataset assisted by PCA,

UMAP, and t-SNE with different dimensional reduction ratios. By applying PCA, the accuracy

of K-means is around 0.45. The t-SNE method performance is quite unstable, from very poor

(0113) to the best (0.740), and to a relatively low value of 0.593. In contrast, we can see a stable

and improved accuracy from using UMAP at various reduction ratios, indicating that the reduced

feature generated by UMAP can better represent the clustering properties of the MNIST dataset

compared to the PCA and t-SNE.

As observed early, the present UMAP and t-SNE-assisted k-means clustering methods also

significantly out-perform the original k-means clustering for this dataset.

3.1.3.6

Jaccard distanced-based MNIST

Figure 3.4 Comparison of different dimensional reduction algorithms on the Jaccard distanced-
based MNIST dataset. Total 10 different labels are in the Jaccard distanced-based MNIST dataset,
and we use the ground truth label to color each data points. (a) Feature size is reduced to dimension
2 by PCA. (b) Feature size is reduced to dimension 2 by t-SNE. (c) Feature size is reduced to
dimension 2 by UMAP.

64

Our last validation dataset is Jaccard distanced-based MNIST. This dataset can be treated as a

test on the Jaccard distance-based data representation. Figure 3.4 shows the performance of PCA-

assisted, UMAP-assisted, and t-SNE-assisted clustering of the Jaccard distanced-based MNIST

dataset. The dataset was reduced to dimension 2 using default parameters for visualization, and

the plots were colored with the ground truth of the Jaccard distanced-based MNIST dataset. From

Figure 3.4, we can see that UMAP provides the clearest clusters compared to PCA and t-SNE when

the dimension is reduced to 2. The performance of t-SNE is reasonable while PCA does not give a

good clustering.

Table 3.7 Accuracy of k-NN of the Jaccard distanced-based MNIST dataset without applying any
reduction algorithms, as well as the accuracy of k-NN assisted by PCA, UMAP and t-SNE with
different dimensional reduction ratio. The sample size, feature size, and the number of labels of
the Jaccard distanced-based MNIST dataset are 70000, 70000, and 10, respectively.

Dataset

k-NN accuracy
w/o reduction

Jaccard distanced
based MNIST
(70000, 70000, 10)

0.958

Reduced
dimension
7000 (1/10)
3500 (1/20)
1750 (1/40)
875 (1/80)
437 (1/160)
218 (1/320)
109 (1/640)
70 (1/1000)
35 (1/2000)
17 (1/5000)
7 (1/10000)
3
2

PCA

UMAP
accuracy accuracy

0.958
0.958
0.958
0.958
0.958
0.958
0.958
0.958
0.956
0.938
0.867
0.487
0.313

0.958
0.966
0.967
0.967
0.968
0.968
0.968
0.968
0.968
0.968
0.967
0.965
0.960

t-SNE
accuracy
0.588
0.601
0.725
0.613
0.718
0.701
0.873
0.915
0.872
0.916
0.942
0.939
0.924

65

Table 3.8 Accuracy of K-means clustering of the Jaccard distanced-based MNIST dataset without
applying any reduction algorithms, as well as the accuracy of K-means assisted by PCA, UMAP
and t-SNE with different dimensional reduction ratio. The sample size, feature size, and the
number of labels of the Jaccard distanced-based MNIST dataset are 70000, 70000, and 10,
respectively.

Dataset

k-NN accuracy
w/o reduction

Jaccard distanced
based MNIST
(70000, 70000, 10)

0.555

Reduced
dimension
7000 (1/10)
3500 (1/20)
1750 (1/40)
875 (1/80)
437 (1/160)
218 (1/320)
109 (1/640)
70 (1/1000)
35 (1/2000)
17 (1/5000)
7 (1/10000)
3
2

PCA

UMAP
accuracy accuracy

0.436
0.436
0.436
0.435
0.435
0.435
0.435
0.436
0.435
0.436
0.431
0.364
0.261

0.329
0.693
0.792
0.793
0.793
0.793
0.794
0.793
0.794
0.793
0.793
0.798
0.791

t-SNE
accuracy
0.119
0.120
0.112
0.112
0.114
0.156
0.114
0.113
0.116
0.113
0.737
0.635
0.635

Table 3.7 shows the accuracy of k-NN clustering of Jaccard distanced-based MNIST assisted by

PCA, t-SNE, and UMAP with different dimensional reduction radios. For each algorithm, we use

the same settings as the Coil 20 dataset. Notably, the k-NN accuracy for the data without applying

any dimensional reduction algorithm is 0.958, which is at the same level as the PCA algorithm with

a reduction ratio greater than 1/5000. Moreover, we can find that UMAP performs well compared

to PCA and t-SNE, indicating that after applying UMAP, the reduced feature still preserves most of

the valued information of the Jaccard distanced-based MNIST dataset. The stability and persistence

of UMAP at various reduction ratios are the most important features.

Table 3.8 describes the accuracy of K-means clustering of the Jaccard distanced-based MNIST

dataset assisted by PCA, UMAP, and t-SNE with different dimensional reduction ratio. For

consistency, we will use the same standard parameters as k-NN. Similar to the MNIST dataset,

the accuracy of K-means clustering assisted by UMAP still has the best performance. When the

reduced dimension is 3, UMAP will result in the highest K-means accuracy 0.798. Noticeably,

although PCA performs well on k-NN accuracy, it has the lowest K-mean accuracy, indicating that

66

PCA is not a suitable dimensional reduction algorithm, especially for those datasets with a large

number of samples. To be noted, the t-SNE accuracy at four reduced dimensions are not available

due to the extremely long running time.

In a nutshell, PCA, UMAP, and t-SNE can all perform well for k-NN. However, for the Coil

20 dataset, UMAP performs slightly poorly, whereas the t-SNE performs well, which may be

caused by a lack of data size.

In order to train UMAP, it needs a suitable data size. The Coil

20 dataset has 20 labels, each with only 72 samples. This may not be enough to train UMAP

properly. However, even in this case, UMAP performance is still very stable at various reduction

ratios and is the best method in terms of reliability, which become the major advantages of UMAP.

Another strength of UMAP comes from its dimension-reduction for K-means clustering. In most

cases, UMAP can improve K-means clustering accuracy, especially for the Jaccard distanced-

based MNIST dataset. Furthermore, UMAP can generate a very clear and elegant visualization of

clusters with low dimensional reduction value such as 2. Additionally, UMAP performed better

than PCA and t-SNE for a larger dataset (MNIST and Jaccard distanced-based MNIST). Especially

for the Jaccard distanced-based MNIST data, where Jaccard distance was used as the metric,

UMAP performed best, which indicates the merit of using UMAP for Jaccard distanced-based

datasets, such as COVID-19 SNP datasets. Furthermore, the accuracies for k-NN classification

and K-means clustering are both improved on the Jaccard distance-based MNIST dataset compared

to the original MNIST dataset, which provides convincing evidence that the Jaccard distance

representation will help improve the performance of the clustering on the SARS-CoV-2 mutation

dataset in the following sections.

3.1.3.7 Efficiency comparison

It is important to understand the computational time behaviors of various methods. To this end,

we compare computational time for three dimension-reduction techniques. Figure 3.5 depicts the

computational time of three methods for the four datasets under various reduction ratios. The green,

orange, and blue lines represent the computational time of t-SNE, UMAP, and PCA, respectively.

Some points in green line of Figure 3.5 (d) are not available, which due to the extremely long

67

running time. PCA performed best in most cases, except for the Coil 20 dataset, where UMAP had

comparable computational time. This behavior is expected because PCA is a linear transformation,

and its time should scale linearly with the number of components in the lower dimensional space.

UMAP and t-SNE were slower than PCA, but it is evident from MNIST and Jaccard distanced-based

MNIST datasets that UMAP scales better with the increase in the number of samples. Note that for

Jaccard distanced-based MNIST, a higher dimension was not computed because the computational

time was too long. For Facebook Network, UMAP is outperforming t-SNE; however, for higher

dimensions, t-SNE computed faster. Nonetheless, from our baseline test Table 3.3, t-SNE does not

perform well, indicating instability. Faster computation time may indicate too fast of a convergence,

which leads to poor embedding.

(a) Coil 20 time

(b) Facebook Network time

(c) MNIST time

(d) Jaccard distanced-based MNIST time

Figure 3.5 Computational time of each reduction ratio. The green, orange and blue lines represent
the computational time of t-SNE, UMAP, and PCA, respectively. Not surprisingly, PCA performs
the best in the majority of cases, except for the Coil 20 dataset. UMAP and t-SNE perform worse
than PCA, but UMAP scales better when there are more samples, as evident from MNIST and
Jaccard distanced-based MNIST datasets. Note that for Jaccard distanced-based MNIST, the higher
dimension was not computed because the computational time was too long.

68

0.00.10.20.30.40.5Reduction ratio0100002000030000Time (s)PCA timeUMAP timet-SNE time0.00.10.20.30.40.5Reduction ratio05000100001500020000Time (s)PCA timeUMAP timet-SNE time0.00.10.20.30.40.5Reduction ratio02500500075001000012500Time (s)PCA timeUMAP timet-SNE time0.000.020.040.060.080.10Reduction ratio010000200003000040000Time (s)PCA timeUMAP timet-SNE time3.1.4 SARS-CoV-2 mutation clustering

3.1.4.1 World SARS-CoV-2 mutation clustering

We gather data submitted to GISAID up to January 20, 2021, and the total number of samples

is 203,344. We first get the SNP information by applying the multiple sequence alignment, which

leads to 26,844 unique SNPs. Next, we calculate the pairwise Jaccard distance of our dataset in

order to generate the Jaccard distance-based features. Here, the number of rows is the number of

samples (203,344), and the number of columns is the feature size (203,344). As we mentioned

in Section 3.1.2.3, the Jaccard distance-based feature is a square matrix. However, due to the

large size of samples and features, applying K-means clustering directly on the feature of the size

of 203,344×203,344 is a very time-consuming process. Considering that UMAP outperforms the

other two dimensional reduction algorithms (PCA and t-SNE) on the Jaccard distance-based MNIST

dataset, we employ UMAP to reduce our original feature with the size of 203,344×203,344 to

203,344×203 . To be noted, UMAP is a reliable and stable algorithm, which performs consistently

in clustering at various reduction ratios. Therefore, there is no need to use the same reduction

dimension of 203 and one can also choose a different reduction dimension value to generate similar

results.

With the reduced dimension feature that has the size of 203,344×203 , we split our SARS-CoV-2

dataset into different clusters by applying the K-means clustering methods. After comparing the

WCSS under a different number of clusters, we find that there are 6 clusters forming within the

SARS-CoV-2 population based on the elbow method , which can be determined from Figure 3A.1

in the Supporting Information . Table 3A.1 in the Supporting Information shows the top 25 single

mutations of each cluster. In order to understand the relationship, we also analyzed the co-mutation

occurring in each cluster (Table 3.9). Here, we define a co-mutation as mutations that occur

simultaneously in one SNP profile. For example, mutations occurring at position 241 and 3037

in a single SNP sample is a co-mutation [241, 3037]. From Table 3A.1 and Table 3.9 we see the

following:

69

Table 3.9 The frequency and occurrence percentage of SARS-CoV-2 co-mutations from each
clusters in the world.

Cluster Co-mutations
Cluster 1 [241, 3037, 14408, 23403, 28881, 28882, 28883]
Cluster 2 [241, 1059, 3037, 14408, 23403, 25563]
Cluster 3 [241, 1163, 3037, 7540, 14408, 16647, 18555,
22992, 23401, 23403, 28881, 28882, 28883]

Cluster 4 [241, 3037, 14408, 23403]
Cluster 5 [241, 3037, 14408, 23403]
Cluster 6 [241, 3037, 4543, 5629, 9526, 11497, 13993,

Frequency Percentage

21802
15008

2089
13387
124290

0.926
0.660

0.606
0.936
0.915

14408, 15766, 16889, 17019, 18877, 22992, 23403,
25563, 25710, 26735, 26876, 28975, 29399]

3279

0.940

• Though Clusters 1 and 6 seem similar from the top 25 single mutations, the co-mutations

tells a different story.

• Clusters 2 and 5 have high frequency of [241, 3037, 14408, 23403] mutations, but Cluster 5

has a clear co-mutation descendant with high frequency.

• Cluster 3 has a unique combination of mutation that is only popular in Cluster 3.

• Cluster 6 have high frequency of multiple co-mutations. Since it shares similarity with

Clusters 4 and 5, it may be that Cluster 6 branched from Clusters 4 and 5.

• Cluster 6 has many co-mutations when compared to other clusters. As seen in Table 3A.2,

the majority of the cases is found in Europe, including the United Kingdom (UK), Denmark

(DK), Netherlands (NL), Switzerland (CH) and Luxemberg (LU).

Table 3A.2 in the Supporting Information shows the cluster distributions of samples from 25

countries. Here, we use the ISO 3166-1 alpha-2 codes as the country code. The listed countries are

the United Kingdom (UK), the United States (US), Australia (AU), India (IN), Switzerland (CH),

Netherlands (NL), Canada (CA), France (FR), Belgium (BE), Singapore (SG), Spain (ES), Russia

(RU), Portugal (PT), Denmark (DK), Sweden (SE), Austria (AT), Japan (JP), South Africa (ZA),

Iceland (IS), Brazil (BR), Saudi Arabia (SA), Norway (NO), China (CN), Italy (IT), and Korea

(KR). We can visualize the clusters on the world map from Figure 3.7, which was visualized using

70

Highcharts. The underlying color indicates the dominant cluster for each country. Furthermore,

from Table 3A.2, we can see the following:

• SNP profiles from UK and DK are dominated in Clusters 5.

• Clusters 3’s SNP profiles are predominantly found in AU. This may indicate that SARS-CoV-2

are mutating differently in AU.

• SNP profiles from the US are found mostly in Clusters 2 and 5.

• Most country’s SNP profiles are found in Clusters 1,2,4,5 and 6, with some having slightly

higher numbers.

Figure 3.6 Cluster distribution of the global SARS-CoV-2 mutation dataset. Using Highchart, the
world map was colored, according to the dominant cluster. For example, United States have SNP
profiles from all clusters, but Cluster 5 (purple) is the dominant type in the US. Only countries with
more than 25 sequenced data available on GISAID were considered. Countries with fewer than 25
samples are labeled grayed.

Notably, in Table 3.9, Cluster 4 and 5 have the same co-mutations with relatively high frequen-

cies, which indicates the Clusters 4 and 5 share the same ”root”. Clusters 1, 2, 3, and 6 shares

the co-mutation as Clusters 4 and 5, indicating that Clusters 1, 2,3, and 6 may have branched from

71

Cluster 4 and 5 in the 203-dimensional (203D) space. However, we cannot visualize the distri-

bution of our reduced dataset in the 203D space. Therefore, benefit from the stable and reliable

performance of UMAP at various reduction ratios, we reduce the dimension of our original dataset

to 2, which enables us to observe the distribution of the dataset in the two-dimensional (2D) space.

Figure 3.7 visualizes the distribution of our dataset with 6 distinct clusters with 2D UMAP. It can

be seen that Clusters 2, 3 and 4 share a same “root” in the middle. Clusters 3 and 6 are farther

away from the center, indicating that they are a descendants of the middle root. In addition, we

looked specifically at the spike (S) protein because of its significance in viral infectivity. In all the

clusters, 23403A>G (D614G) is present. Studies have shown that D614G increases the infectivity

of SARS-CoV-2 [35], hence the high frequency in our data reflect such infectivity. In Clusters 1,

2 and 4, there are no significant co-mutations in the S protein. In Cluster 3, 100% of the variants

contain the co-mutation [22992, 23401, 23403], which further supports its geographical isolation,

where it is predominantly found in AU. Cluster 5 does not have a significant co-mutation, but the

co-mutations [21614, 22227, 23403, 24334] occurred in 11290 SNP profiles (0.083). Cluster 6 has

a pair of co-mutations [22992, 23403], which occurs in 99.7% of samples.

Figure 3.7 2D UMAP visualization of the world SARS-CoV-2 mutation dataset with 6 distinct
clusters.

72

Cluster10Cluster20Cluster30Cluster40Cluster50Cluster603.1.4.2 United States SARS-CoV-2 mutation clustering

In addition to analyzing the clustering in the world, SNP profiles of SARS-CoV-2 from the US

were considered. In this section, the US dataset has 17164 unique single mutations and 43395

samples. Therefore, the dimension of the Jaccard distance-based dataset is 43395×43395. After

applying the UMAP, we reduce the dimension of the original dataset to be 43395×216. Following

the similar K-means clustering processes as we did for the world dataset, we find that using the

elbow method, we can see from Figure 3A.2 that there are 6 predominant clusters forming in the

United States. Figure 3.8 show the US map with the cluster statistic. Here, Highchart was used to

generate the plot with the pie chart. Each states were colored based on the dominant cluster.

Table 3A.3 in the Supporting Information shows the top 25 mutations from each clusters in the

United States. The cluster distribution of each states is listed in Table 3A.4. Table 3.10 shows the

common occurring co-mutations, and we can observe the following:

• Cluster A has a high frequency of co-mutations [241, 1059, 3037, 14408, 23403, 25563],

which is a descendant of common co-mutations of Cluster 2 [241, 1059, 3037, 14408, 23403,

25563] from Table 3A.3.

• Cluster B has a high frequency of co-mutations [241, 3037, 14408, 23403], which is a

descendant of common co-mutations of Cluster 4 and 5 [241, 3037, 14408, 23403].

• Cluster C have high frequency of co-mutations [241, 3037, 14408, 23403, 28881, 28882,

28883], which is a descendant of common co-mutations of Cluster 1 [241, 3037, 14408,

23403, 28881, 28882, 28883] from Table 3.10.

• Clusters D has high frequency of co-mutations [241, 3037, 14408, 20268, 23403, 28854],

which is descendant of Clusters 4 and 5 [241, 3037, 14408, 23403]. US accounts for more

than one third of mutations at site 23403 and half of mutations at site 28854

• Cluster E and F have a high frequency of co-mutations [8782, 17747, 17858, 18060, 28144]

and [241, 1059, 3037, 11916, 14408, 18998, 23403, 25563, 29540], respectively, which are

73

descendants of Cluster 4 and 5 [241, 3037, 14408, 23403].

• Cluster F has a high frequency of co-mutations [241, 1059, 3037, 11916, 14408, 18998,

23403, 25563, 29540], which is a descendant of Cluster 2’s co-mutation [241, 1059, 3037,

14408, 23403, 25563]

Table 3.10 The frequency and occurrence percentage of SARS-CoV-2 co-mutations from each
clusters in US clusters.

Co-mutations

Cluster
Cluster A [241, 1059, 3037, 14408, 23403, 25563]
Cluster B [241, 3037, 14408, 23403]
Cluster C [241, 3037, 14408, 23403, 28881, 28882, 28883]
Cluster D [241, 3037, 14408, 20268, 23403, 28854]
Cluster E [8782, 17747, 17858, 18060, 28144]
Cluster F [241, 1059, 3037, 11916, 14408, 18998, 23403,

25563, 29540]

Frequency Percentage

6646
20442
4429
3276
1183

501

0.702
0.932
0.945
0.643
0.744

0.789

74

Figure 3.8 Cluster distribution of United States SARS-CoV-2 mutation dataset. Using Highchart,
the US map was colored, according to the dominant cluster. For example, United States have SNP
profiles from all clusters, but Cluster E (purple) is the dominant type in the US. Only those countries
that have more than 25 sequenced data available on GISAID were considered in the plot.

Notably, in Table 3.10, Cluster B has a co-mutation that is present in Clusters A, C, D and F,

indicating that Clusters A,C,D and E are descendants of Cluster B. Interestingly, Cluster E has a

completely different set of co-mutations as the other clusters, indicating that they are a different

strands of mutation. Considering the stability and reliability of UMAP at various reduction ratios,

we employ UMAP to the original US dataset with reduced dimension 2, aiming to observe the

distribution of the dataset in the 2D space. Figure 3.9 illustrates the 2D visualization of the US

dataset with 6 distinct clusters. We can see that there are 3 clusters (Clusters A’, B’, and C’) share

the same “root" located in the middle of the figure, while the other 3 clusters (Clusters D’, E’, and

F’) are not. Cluster E’ is quite distinct from other clusters. This confirms our deduction about why

Clusters E’ has a high frequency of different co-mutations in Table 3.10. In addition Cluster D’ is

located close to Cluster A’, which may indicate that they have similar root that diverted.

75

In addition, we looked at co-mutations on the S protein. Every cluster, except for Cluster E,

contains mutation 23403, which is expected due to its ability to increase the infectivity of SARS-

CoV-2. Clusters A, C and F does not have any significant co-mutation occurring in the S protein,

aside from 23403. Cluster E does not have a significant co-mutation nor a significant mutation in

the S protein. Cluster B has co-mutations [22255, 23403], which occur in 780 samples. Cluster D

has co-mutations [23403, 23604, 24076] that occur in 892 samples.

Figure 3.9 The 2D UMAP visualization of the US SARS-CoV-2 mutation dataset with 6 distinct
clusters.

3.1.5 Discussion

In this section, we compared our past results [9] with our new method to gain a different

perspective in clustering with the SNP profiles of COVID-19. In our previous work, a total of 8,309

unique single mutations are detected in 15,140 SARS-CoV-2 isolates. Here, we also calculate the

pairwise distance among 15140 SNP profiles and set the number of clusters to be six. Table 3A.5

shows the cluster distribution of samples from the 15 countries [9]. The listed countries are the

United States (US), Canada (CA), Australia (AU), United Kingdom (UK), Germany (DE), France

(FR), Italy (IT), Russia (RU), China (CN), Japan (JP), Korean (KR), India (IN), Spain (ES), Saudi

Arabia (SA), and Turkey (TR), and we use Cluster I, II, III, IV, V, and VI to represent six clusters

without applying any dimensional reduction algorithm. Table 3A.6 lists the cluster distribution

76

ClusterA0ClusterB0ClusterC0ClusterD0ClusterE0ClusterF0of samples from the same 15 countries, where we use Ip, IIp, IIIp, IVp, Vp, and VIp to represent

six clusters performed by PCA with the reduction ratio to be 1/160. Table 3A.7 lists the cluster

distribution of samples from the same 15 countries, where we use Iu, IIu, IIIu, IVu, Vu, and VIu to

represent six clusters performed by UMAP with the reduction ratio setting to be 1/160. Noticeably,

the SNP profile is focused in Cluster Iu, whereas in the non-reduced version, the samples are more

spread out. This may be caused by the large number of features, making computed distance between

the centroid and each data too similar, and leading to samples being placed in incorrect clusters.

Not surprisingly, PCA and the original method for [9] has nearly identical result. It has been

shown in [9] that PCA is the continuous solution of the cluster indicators in the K-means clustering

method. On the other hand, UMAP shows a slightly different result.

In the PCA method, the

distribution is more spread out.

In addition, the top occurrence for each country is higher for

UMAP. On the other hand, we see that there are more samples in Cluster Iu for UMAP, which may

indicate that mutations in Cluster Iu are the main strand.

Moreover, Figure 3.10 illustrates the 2D visualizations of the US dataset up to June 01, 2020,

with 6 distinct clusters by applying two different dimensional reduction algorithms. We can see that

the data distribute disorderly under both PCA- and UMAP-assisted K-means clustering algorithms.

Specifically, the PCA-assisted algorithm has a really poor clustering performance, while the UMAP-

assisted algorithm forms more clear and better clusters than the PCA-assisted algorithm, which is

consistent with our previous analysis in Section 3.1.3.1.

Table 3.11 shows co-mutations occurred in each cluster from the UMAP-assisted K-means

from data collected up to June 01, 2020. Cluster IIIu has 2 dominant co-mutations. Note that the

dataset had 15140 SARS-CoV-2 isolates, whereas our current dataset has over 200,000 isolates.

Nonetheless, we can compare the clusters to see which clusters persists. Cluster 1’s co-mutations

are the same as those of Cluster Vu, indicating that Cluster 1 may have been derived from Cluster

Vu. Cluster 2 shares the same co-mutations as those of Cluster IIu. Cluster 3’s co-mutations are

the descendants of Cluster Vu. Clusters 4 and 5 have the same co-mutations as those of Clusters

IIIu and VIu, indicating Clusters 4 and 5 are derived from Cluster IIIu and VIu. Cluster 6’s co-

77

mutations are descendants of Clusters IIIu and VIu. Note that co-mutations of Cluster Iu and the

second set of co-mutations of Cluster IIu ([8782, 28144]) are not predominant co-mutations in our

dataset, which may indicate a weaker infectivity. For example, every co-mutation in Table 3.9 has

mutation 23403A>G (D614G) in the spike protein, which has been shown to increase infectivity of

COVID-19 [35]. It is not surprising to see a co-mutation group not being dominant in our current

dataset. By comparing these co-mutations, we can see that co-mutations that are dominant in both

datasets (up to June 01 and January 20) will most likely persist in the future.

Table 3.11 The frequency and occurrence percentage of SARS-CoV-2 co-mutations from each
clusters collected from June 01, 2020.

Cluster
Cluster Iu
Cluster IIu
Cluster IIIu

Cluster IVu
Cluster Vu
Cluster VIu

Co-mutations
[11083, 14805, 26144]
[241, 3037, 14408, 23403, 25563]
[241, 3037, 14408, 23403]
[8782, 28144]
[241, 1059, 3037, 14408, 23403, 25563]
[241, 3037, 14408, 23403, 28881, 28882, 28883]
[241, 3037, 14408, 23403]

Frequency Percentage

948
2800
1468
1475
1318
1872
2222

0.730
0.893
0.412
0.414
0.621
0.817
0.969

Figure 3.10 2D visualizations of the US SARS-CoV-2 mutation dataset up to June 01, 2020
with 6 distinct clusters by applying two different dimensional reduction algorithms. (a) 2D PCA
visualization. (b) 2D UMAP visualization.

3.1.6 Conclusion

The rapid global spread of coronavirus disease 2019 (COVID-19) caused by severe acute

respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to genetic mutation stimulated by

genetic evolution and adaptation. Up to January 20, 2021, 203,344 complete SARS-CoV-2

78

sequences, and a total of 26,844 unique SNPs have been detected. Our previous work traced the

COVID-19 transmission pathways and analyzed the distribution of the subtypes of SARS-CoV-2

across the world based on 15,140 complete SARS-CoV-2 sequences. The K-means clustering

separated the sequences into six distinguished clusters. However, considering the tremendous

increase in the number of available SARS-CoV-2 sequences, an efficient and reliable dimensional

reduction method is urgently required. Therefore, the objective of the present work is to explore the

best suited dimension reduction algorithm based on their performance and effectiveness. Here, a

linear algorithm PCA and two non-linear algorithms, t-distributed stochastic neighbor embedding

(t-SNE) and uniform manifold approximation and projection (UMAP), have been discussed. To

evaluate the performance of dimension reduction techniques in clustering, which is an unsupervised

problem, we first cast classification problems into clustering problems with labels. Next, by setting

different reduction ratios, we test the effectiveness and accuracy of PCA, t-SNE, and UMAP for k-

NN and K-means using four benchmark datasets. The results show that overall, UMAP outperforms

other two algorithms. The major strengths of UMAP is that UMAP-assisted k-NN classification

and UMAP-assisted K-means clustering at various dimension reduction ratios have a consistent

performance in terms of accuracy, which proves that UMAP is a stable and reliable dimension

reduction algorithm. Moreover, compared to the K-means clustering accuracy that does not involve

any dimensional reduction, UMAP-assisted K-means clustering can improve the accuracy for most

cases. Furthermore, when the dimension is reduced to two, the UMAP clustering visualization

is clear and elegant. Additionally, UMAP is a relatively efficient algorithm compared to t-SNE.

Although PCA is a faster algorithm, its major limitation is its poor performance in accuracy. To be

noted, UMAP performs better than PCA and t-SNE for the dataset with a large number of samples,

indicating it is the best suited dimensional reduction algorithm for our SARS-CoV-2 mutation

dataset. Moreover, we apply the UMAP-assisted K-means clustering to the world SARS-CoV-2

mutation dataset (up to January 20, 2021), which displays six distinct clusters. Correspondingly,

the same approaches are also applied to the United States SARS-CoV-2 mutation dataset (up to

January 20, 2021), resulting in six different clusters as well. Furthermore, we provide a new

79

perspective by utilizing UMAP-assisted K-means clustering to analyze our previous SARS-CoV-2

mutation datasets, and the 2D visualization of UMAP-assisted K-means clustering of our previous

world SARS-CoV-2 mutation dataset (up to June 01, 2020) forms more clear clusters than the

PCA-assisted K-means clustering. Finally, one of our four datasets was generated by the Jaccard

distance representation, which improves both kNN classification and k-means clustering accuracies

on the original dataset.

3.2 K-mer Topology for Whole Genome Analysis

3.2.1

Introduction

Topological data analysis (TDA) is an emerging field in data science and has also been utilized

in DNA sequence alignment. TDA, or more specifically, persistent homology, begins by first

representing the point cloud data with vertices, edges, triangles, tetrahedra, etc., or more generally,

a simplicial complex. Then, concepts from algebraic topology, such as connected components,

holes, and voids, are utilized to extract topological invariants, and filtration is applied to capture the

persistence of such topological invariants across different scales. Chan et al. [36] utilized persistent

homology to model both vertical and horizontal evolution in viruses, including clonal evolution,

reassortment, and recombination. They computed sequence dissimilarity via sequence alignment

and used this information to calculate persistent homology based on genetic dissimilarity. The

0th order homology represents vertical evolution, while the 1st order homology corresponds to

horizontal evolution. This method was applied to SARS-CoV-2 to analyze mutations and utilized

a novel metric called the topological recurrence index (tRI), which calculated the number of cycles

in a tree and was used to measure convergent evolution [37]. In Nguyen et al. [38], persistent

homology was applied to the CGR representation of viral sequences, and the resulting persistent

diagram was computed. Subsequently, the Wasserstein distance was calculated between these

diagrams to construct the phylogenetic tree. These methods represent some of the first adopters

of persistent homology for DNA sequencing analysis. However, due to the high computational

cost associated with computing persistent homology, their approach may not be suitable for longer

sequences, general whole viral sequences, and large-scale comparisons.

80

In this work, we introduce the k-mer topology, a novel persistent homology approach based

on k-mer position. For each k-mer, we construct position-based distances. Using the standard

persistent homology methodology, we obtain Betti curves for each k-mer. We benchmarked our

method using the NCBI virus reference genome and conducted phylogenetic tree analysis on six

datasets. Our method demonstrates the highest accuracy in NCBI reference virus classification

and proves to be an effective tool for phylogenetic analysis. Furthermore, we demonstrate that our

method can handle large whole bacterial genomes, a capability not achievable with other persistent

homology tools.

3.2.2 Method

In this section, we describe the k-mer specific topology. For an overview of persistent homology,

refer to section 2.4

3.2.2.1 Persistent Homology on Nucleotide Sequence

In this section, we describe the persistent homology for nucleotide sequence, and the construc-

tion of the features. Then, we define a metric on the features, which will be used for the phylogenetic

analysis.

Position based distance

Let S = s1s2...sN be a DNA sequence of length N, where si ∈

{A, C, G, T} is a nucleotide. Define the nucleotide-specific indicator function

δl(si) =

1,

si = l

0, otherwise





where l ∈ {A, C, G, T}. Then, we can define a nucleotide specific vector Sl as

Sl = {i|δl(si) = 1, 1 ≤ i ≤ N}

(3.7)

(3.8)

For example, for a sequence S = CGGATAACGTCCAGCAGTCAGTGATCGCATATCTTGAC, we

81

have

SA = [4, 6, 7, 13, 16]

SC = [1, 8, 11, 12, 15]

SG = [2, 3, 9, 14, 17]

ST = [5, 10]

The distance based in the nucleotide position, denoted Dl, is computed as the following:

Dl(i, j) = |Sl(i) – Sl(j)|.

For example, the position based distance for SA is

0

2

3

9

2

0

1

7

DA =

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

3 9 12

1 7 10

0 6

6 0

.

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

9

3

0

12 10 9 3

(3.9)

(3.10)

(3.11)

(3.12)

(3.13)

(3.14)

Additionally, we can define a k-mer’s specific position, where instead of focusing solely on

individual nucleotides, we examine strings of nucleotides of length k. Moreover, for a given k,
there are 4k different combinations of k-mers, denoted as l1, l2, . . . , l4k. Similarly, we can define
the k-mer specific indicator function as

δlp(sisi+1...si+k) =

1,

sisi+1...si+k = lp

0,

otherwise





(3.15)

Then, the k-mer specific position is given by

Slp = {i|δlp(sisi+1...si+k) = 1, 1 ≤ i ≤ N – k + 1}

(3.16)

Then the k-mer specific position based distance can be defined as Dlp(i, j) = |Slp(i) – Slp(j)|.

Using the same sequence as an example, AG specific position and distance can be defined as the

82

following

SAG = [13, 16], DAG = (cid:169)
(cid:173)
(cid:173)
(cid:171)

0 3

3 0

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(3.17)

Persistent homology features

Using the position-based distance, Ripser [39] was used to

construct the simplicial complex and the persistent barcodes. Then, Gudhi [40] was used to obtain

example, βAT

the betti-0 curves. Denote β

lp
0,r as the Betti number at filtration r for a particular k-mer lp. For
0,3 would be the betti-0 for 2-mers AT at filtration 3. We can then collect the Betti values
lp
0 , which can be defined as the following

across increasing filtration values to get the Betti curve β

vector

lp
lp
0,0, ..., β
0 = (β
β

lp
0,r, ..., β

lp
0,R)

(3.18)

where R is the maximum number of filtrations.

Figure 3.11 shows an example of a persistent barcode and the Betti curve. We generated a

random sequence of length N = 100, and allowed the sequence to be circular. For such circular

sequence, we can define the distance between Slp(i) and Slp(j) to be

Dlp(i, j) = min(Slp(j) – Slp(i), N – Slp(j) + Slp(i))

(3.19)

for i < j. Then, the persistent homology will be based on a loop. The left figure shows the persistent

barcode of l = A, where the barcode shows the birth and death of a connected component (H0 in
blue), loop (H1 in green) and cavity (H2 in red). The right figure shows βl

0. The 4 colors correspond

to the nucleotides l = A, C, G, T.

83

(a) Persistence barcode

(b) β0 curve

Figure 3.11 Persistent barcode and Betti curve of nucleotide l = A of a randomly generated circular
sequence of length 100 base pairs. (a) Blue, red and green correspond to the birth and death of H0,
H1 and H2 features. (b) The Betti curves of A (red), C (green), G (blue), T (purple). The y axis
corresponds to the β0 number at a particular filtration value.

Additionally, we provide a real virus example using NC_001330, a reference sequence from

the Microviridae family. The Microviridae family comprises bacteriophages with single-stranded

DNA (ssDNA) genomes. NC_001330 consists of 6087 base pairs, including 1431 A, 1325 C, 1425

G, and 1906 T. Figure 3.12 shows the Betti curves for individual nucleotides, as well as for 2-mers,

3-mers, and 4-mers.

84

(a) 1-mer

(b) 2-mer

(c) 3-mer

(d) 4-mer

Figure 3.12 Betti curves of 1-mer, 2-mer, 3-mer and 4-mer of NC_001330 reference genome of the
Microviridae family.

For a given k, there are a total of 4k combinations of k-mers. Therefore, if the sequence is truly

random, we would expect to see the same k-mer roughly every 4k position. Not surprisingly, as we

increase the k, we see that the connected components persists for longer time, which indicate that

the filtration number must increase. Moreover, in order to reduce the number of features we obtain

from the Betti curves, we compute the Betti numbers at a step size of 4k–1 for a total of 50 steps.

Distance on the k-mers persistent homology features We now define the metric used to perform

phylogenetic analysis. For each k, we have 4k different k-mer specific Betti curves. We concatenate

these Betti curves to obtain the vector

βk = (βl1

0 , ..., β

l4k
0 ).

(3.20)

Then, for viruses i and j, we can define the metric between the k-mer specific Betti curves as the

85

following

distk(i, j) = ∥βk

i – βk

j ∥2

(3.21)

where ∥ · ∥2 denote the euclidean distance.

Then, we can take a weighted sum over different k to obtain the distance between viruses i and j

Dist(i, j) =

K
∑︁

k=1

akdistk(i, j).

(3.22)

3.2.3 Results

3.2.3.1 Classification of the reference viral genome

In order to verify the effectiveness of our method, we conducted classification on the viral

reference genome from the National Center for Biotechnology Information (NCBI) virus database.

NCBI adopts the virus taxonomy from the International Committee on Taxonomy of Viruses

(ICTV), which updates the nomenclature of viruses periodically according to new findings. We

considered NCBI dataset collected in 2020, 2022 and 2024, and the details of the preprocessing

process can be found in Table 3.12. For NCBI 2020 and NCBI 2022, we have removed sequences

that were no longer considered reference sequences as of January 20, 2024. For NCBI 2024, we

removed families without a proper taxonomy rank, i.e., any sequences with a family name not

ending in ’-viridae’. For example, the viral family Tolecusatellitidae was removed because its rank

is undetermined. We included these undetermined families in NCBI 2024 full. Additionally, we

included sequences with invalid nucleotides to simulate real-world scenarios. We benchmarked

our method on these data because viral taxonomy is regularly updated. Therefore, our method not

only needs to identify the correct family but also find the most similar sequence within that family.

For example, viral families such as Herpesviridae and Reoviridae from the NCBI 2020 and NCBI

2022 datasets were abolished in late 2022, with some sequences being divided into smaller families

or remaining unclassified. Therefore, it is crucial for the alignment method to be robust to these

changes and accurately identify the correct viral family.

86

Table 3.12 Dataset, NCBI collection date, proprcessing procedure, number of families and
number of sequences.

Name

NCBI date

NCBI 2020 [41]

03/2020

NCBI 2022

03/2022

NCBI 2024

01/20/2024

NCBI 2024 full

01/20/2024

Removed sequence
Unknown Baltimore class
Unknown family
families <3 sequence
Partial sequence
Unknown family
families <2 sequence
Invalid nucleotides
Partial sequence
Unknown family
Only ’-viridae’
families <3 sequence
Invalid nucleotide
Partial sequence
Unknown family
families <3 sequence

# families # sequence

83

6,993

123

11,428

199

12,154

209

13,645

For benchmarking, we adopt the procedure outlined in [41]. After constructing the k-mer spe-

cific Betti curves and the distance matrix, we performed a 1-nearest neighbor (1-NN) classification .

The identification is considered correct if the most similar sequence has the same family label. This

simulates a real-world scenario, where identifying the most similar sequence is critical for further

analysis. We compare our method with generalize natural vector (GNV) and Markov K-string

(Markov) model [42], and their details can be found in section 2.5

Table 3.13 show the 1-NN classification accuracy of individual k-mer specific Betti curves as

well as the weighted sum. 3-mers and 4-mers have the highest individual accuracy, and 1-mers

have the lowest accuracy. The weighted sum has the higher accuracy on all the data except for

NCBI 2022.

Table 3.13 1-NN classification of the 4 dataset. The accuracy of individual k-mer specific Betti
curves were obtained, as well as the weighted sums. The bolded number correspond to the highest
accuracy.

Data
NCBI 2020
NCBI 2022
NCBI 2024
NCBI 2024 full

k = 1
0.817
0.814
0.750
0.748

k = 2
0.902
0.894
0.857
0.857

87

k = 4

k = 5

k = 3
Sum
0.929 0.933 0.927 0.933
0.920
0.912
0.917
0.828 0.898
0.887
0.887 0.901
0.889

0.921
0.895
0.897

Figure 3.13 shows the comparison of our method to k-mer specific GNV and Markov model.

For GNV, we utilized order 2 for each k and, used Euclidean distance to construct the distance

matrix. Notice that our model’s accuracy increases as k increases, plateauing at k = 4. On the

other hand, GNV and Markov model decreases in accuracy at k = 2 and k = 3, respectively. The

decrease in accuracy for the Markov model is not surprising. This is because the Markov k-string

method assesses sequence differences by comparing the observed appearance of k-strings with

the predicted appearance of k-strings computed through the Markov model. The predicted model

evaluates the left and right k – 1 substrings, along with the middle k – 2 substring. Therefore, 3-mers

exhibit the most reliable computation of predicted appearance. The decrease in GNV performance

is consistent with the report given by the original authors [41]; however, it is important to note the

GNV’s strength comes from the metric defined on all the k-mer specific GNV.

(a) NCBI 2020

(b) NCBI 2022

(c) NCBI 2024

(d) NCBI 2024 All

Figure 3.13 Comparison of k-mer topology with GNV and Markov models. For GNV, order 2 for
each k is presented.

88

Figure 3.14 shows the comparison of our method with GNV and the Markov k-string model. For

GNV, we utilized order 2 with K = 9, and for the Markov model, we employed k = 3. In all models,

there is a decrease in accuracy from 2020 to 2024, which is not surprising given the increase in

the number of families and the division of large families into smaller ones. Notably, the Markov

model proves to be less robust to these changes. Interestingly, our method demonstrates slightly

better performance on NCBI 2024 All data compared to NCBI 2024 data. This is promising for

real-world applications, as most sequences contain some invalid nucleotides due to experimental

errors. Moreover, our method outperforms GNV by 6.13%, 5.12%, 8.33%, and 9.17% for NCBI

2020, NCBI 2022, NCBI 2024, and NCBI 2024 All data, respectively.

Figure 3.14 Accuracy of k-mer topology, GNV and Markov models on the 4 dataset. For k-mer
topology and GNV, the weighted sum over the different k was used. For GNV, k = 9 and order 2
was used, and for Markov model, k = 3 was used.

3.2.3.2 Phylogenetic analysis using k-mer topology

We constructed a phylogenetic tree utilizing the k-mer topology method. After computing the

distance, we utilized unweighted pair group method with arithmetic mean (UPGMA) method to

construct the tree.

Ebolavirus We obtained 59 complete genomes of ebolavirus, representing 5 species: Bundibu-

gyo virus (BDBV), Reston virus (RESTV), Ebola virus (EBOV, formerly known as Zaire virus),

Sudan virus (SUDV), and Tai Forest virus (TAFV). Details of the data can be found in Table 3B.1.

89

The EBOV species was further divided into 6 subtypes based on the location and year of the

outbreak: Zaire (now known as DRC) pandemic of 1976-1977, DRC pandemic of 2007, Guinea

outbreak of 2014, Gabon outbreak of 1994, and DRC outbreak of 1995.

Figure 3.15 displays the phylogenetic tree generated using our method. We employed K =

5 to compute the k-mer topology features and utilized the UPGMA algorithm to generate the

phylogenetic tree. The colors of the clades and labels correspond to the ebolavirus types and the

EBOV subtypes. Notably, BDBV, RESTV, EBOV, SUDV, and TAFV are separated into distinct

clades. Additionally, the EBOV subtypes are placed in separate clades, representing an improvement

over [43], where the Fuzzy Integral Similarity method incorrectly placed one of the DRC outbreak

of 2007 strains into the original Zaire pandemic clade.

90

Figure 3.15 Phylogenetic tree of ebolavirus complete genomes, including those from Bundibugyo
virus (BDBV), Reston virus (RESTV), Ebola virus (EBOV, formally known as Zaire virus), Sudan
virus (SUDBV) and Tai Forest virus (TAFV). EBOV was divided into 6 subtypes corresponding
to the pandemic or outbreak. The labels and clades were colored according to their types and
pandemic.

Mammalian Mitochondria RNA We obtained 41 full mitochondria genome of mammals, con-

sisting of 8 types Artiodactyla, Carnivore, Cetacea, Erinaceomorpha, Lagomorpha, Perissodactyla,

Primates, Rodentia. The details of the data can be found in Table 3B.2. We utilized K = 4 to

compute the k-mer topology features and the metric, and the UPGMA algorithm was employed

91

to generate the phylogenetic tree. Figure 3.16 illustrates the phylogenetic tree of the 8 classes of

species. The labels and the clade are colored according to their group labels. Our method correctly

clusters each type into individual clades, which is an improvement from other methods, which often

placed Rabbit (Lagomorpha) into a clade with Carnivores [43].

Figure 3.16 Phylogenetic tree of mammalian mitochondrial genomes, including those from Artio-
dactyla, Carnivora, Cetacea, Erinaceomorpha, Lagomorpha, Perissodactyla, Primates, and Roden-
tia. Details of the data can be found in Table 3B.2. The species and clades are colored according to
their groups. K = 4 was used to compute the k-mer topology features, and the UPGMA algorithm
was used to construct the tree.

Rhinovirus We obtained 113 rhinbovirus (HRV) and 3 non-rhinovirus outgroup. HRV is a

positive-sense, single-stranded RNA virus (ssRNA(+)) belonging to the viral family Picornaviridae

and the genus Enterovirus, and is often the predominent cause of common cold [44, 45]. The details

92

of the data can be found in Table 3B.3. There are 3 distinct groups within Picornaviridae, HRV-

A, HRV-B and HRV-C. Figure 3.17 show the phylogenetic tree of 113 whole HRV genome and 3

outgroup sequence (HEV-C). We utilized K = 3 to compute the homology features and the distance.

The labels and clades are colored according to their groups.

Figure 3.17 Phylogenetic tree of 113 rhinovirus (HRV) and 3 non-rhinovirus outgroup. The colors
correspond to different host animals. K = 5 was used to compute the k-mer topology features and
distance, and the UPGMA algorithm was used to construct the tree.

Our method correctly separates all the groups into individual clades. However, we noticed that

HRV-C shares a clade with the HEV-C outgroup. This result is similar to the findings in [46],

where HRV-A and HRV-B shared a clade, and HRV-C shared a clade with the HRV-A/HRV-B

parent clade.

93

Coronavirus We obtained 30 coronavirus full genome sequences, which range from about 25,000

to 30,000 nucleotides. This dataset has been studied in various other studies [47, 48, 49, 46, 43]."

Coronaviruses are enveloped, single-stranded, positive-sense RNA (ssRNA(+)) viruses belonging

to the Coronaviridae family. They are known to infect many species, including bats, birds, humans,

and other mammals, and can spread across different species [50, 51]. With the SARS-CoV-2

pandemic (COVID-19), there has been interest in uncovering the evolutionary relationship between

other coronaviruses. The details of the data can be found in Table 3B.4. Figure 3.18 show the

phylogenetic tree generated by our method. K = 7 was used to compute the k-mer topology features

and the metric, in UPGMA algorithm was used to construct the tree. The samples were colored

according to the 5 coronavirus groups.

94

Figure 3.18 Phylogenetic tree of 30 coronavirus and 4 non-coronavirus outgroup whole genome.
The colors correspond to different host animals. K = 8 was used to compute the k-mer topology
features and distance, and the UPGMA algorithm was used to construct the tree.

Our method correctly clusters the non-coronavirus outgroups and groups 2, 3, 4, and 5. The

NL63 strain from group 1 was not clustered with the other group 1 human coronaviruses, consistent

with many alignment-free studies [43]. Additionally, our method separates group 2 coronaviruses

into their hosts, namely the mouse hepatitis virus (MHV) and the bat coronavirus (BCoV).

Influenze Type A We obtained 36 full genomes of influenza A viruses. Influenza A viruses are

single-stranded, segmented RNA viruses classified based on the surface proteins hemagglutinin

(H) and neuraminidase (N) [52]. Additionally, they pose a major health threat to humans and have

resulted in numerous outbreaks [53]. We obtained 5 subtypes H7N9, H7N3, H2N2, H5N1, and

95

H1N1 for the analysis, and the details of the data can be found in Table 3B.5. Figure 3.19 show the

phylogenetic tree generated using our methods. The left and right figures correspond to homology

features and distance computed using K = 5 and K = 1. Notice that in K = 5, we have multiple

clades of H1N1 that share the same clade as H5N1 strands; however the K = 1 was separated the

H5N1 and H1N1 strands into different clades. This indicate the potential overfitting in K = 5.

(a) K = 5

(b) K = 1

Figure 3.19 Phylogenetic tree of 6 influenza A virus. The colors correspond to different subtypes,
including H7N9, H7N3, H2N2, H5N1, and H1N1. (a) K = 5 (b) K = 1. After computing the k-mer
topology features and distances, UPGMA algorithm was used to construct the tree. The label and
the clades are colored according to the subtypes.

16S rDNA from Bacteria Isolates We obtained 40 bacterial sequences that was cultured from

endophytic bacteria isolated from the surface sterilized mature endosperms of 6 rices varieties. The

40 sequences have length <1500bp. The details of the data can be found in Table 3B.6. Figure 3.20

show the phylogenetic tree of the 16S rDNA from bacteria isolates using our method. We utilized

K = 5 to compute the homology features and the distance. Then, UPGMA algorithm was used to

obtain the tree. The clades and labels were colored according to the bacterial families.

96

Figure 3.20 Phylogenetic tree of 40 16S rDNA from bacteria isolates. The colors correspond to
different host animals. K = 5 was used to compute the k-mer topology features and distance, and
the UPGMA algorithm was used to construct the tree.

Our method shows a main clades for Actinomycetales, Xanthomandales, Enterobacteriales,

Pseudomandales, and Rhizobiales. However, the top clade shows a mix of misclustered sequences.

This is consistent with [43], but our method show dominant clades for each bacteria family. This

result sugest the limitation of our method for shorter sequence data, where more computationally

expensive methods, such as multiple sequencing alignment can benefit greatly.

Large bacteria genome We obtained 30 bacteria whole genomes from families Bacillaceae, Bor-

reliaceae, Burkholderiaceae, Clostridiaceae, Desulfovibrionaceae, Enterobacteriaceae, Rhodobac-

teraceae, Staphylococcaceae, and Yersiniaceae. The details of the data can be found in Table 3B.7.

97

All the sequences have over 1 million base pair, and we tested our method to see the computational

limits of our method. Because the sequences are long, we computed the homology of 3-mers,

4-mers and 5-mers, and computed the distance using these 3 k-mers. Figure 3.21 show the phylo-

genetic tree generated using our method. After the distance was computed, we utilized UPGMA

algorithm to construct the tree. The labels and clades were colored according to the family label.

Figure 3.21 Phylogenetic tree of 30 bacterial whole genomes. Betti curves for 3-mers, 4-mers and
5-mers were computed, and the distance was computed using these 3 k-mers. The labels and clades
were colored according to their family labels.

Our method correctly separates individual families into a clade. However, our method fails to

place the samples into the appropriate phylum. Enterobacteriales, Yersiniaceae, Burkholderiaceae

and Rhodobacteraceae belong to the phylum Pseudomonadota, and our method only clusters Enter-

98

obacteriales, Yersiniaceae into the same clade. Additionally, Staphylococcaceae, Clostridiaceae,

Bacillaceae belongs to the phylum Bacillota, but only Staphylococcaceae and Clostridiaceae

share a clade. This is most likely caused by our method not being able to compute the 1-mer and

2-mers topology because lower-order k-mers should provide insight to higher order taxonomy.

3.2.4 Discussion and conclusion

K-mer topology is a novel method that computes the persistent homology based on the position

of the k-mer. Our method outperformed GNV method in viral classification tasks on different

versions of the data, indicating the robustness to changes in the taxonomy. We have also validated

our method on standard phylogenetic analysis, including bacteria sequence, indicating the scalability

of our method. However, our method does come with limitation. First, the accuracy for identifying

the correct family in bacteria and virus is high; however, our method do not perform well when

clustering high order taxonomy, most notably in the case of bacteria. This is due to the high

computational cost of 1-mers and 2-mers topology for long sequences. Additionally, analyzing

genomes longer than bacteria is not possible at this moment. In the future, we want to look at more

algebraic topological tools, such as topological Laplacians, hypergraphs, and more. Moreover, we

would like to evaluate the nonharmonic portion of our spectra to obtain more geometric information

about the DNA sequence. Additionally, to combat the computational limitation for long sequence,

we would like to explore potential of taking segments to increase the efficiency of our algorithm.

Lastly, we would like to explore the application in protein sequences.

99

[1] Covid-19 weekly epidemiological update, 19 january 2021, 2021.

BIBLIOGRAPHY

[2] The species severe acute respiratory syndrome-related coronavirus: classifying 2019-ncov

and naming it sars-cov-2. Nature microbiology, 5(4):536–544, 2020.

[3] Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao,
Jun-Hua Tian, Yuan-Yuan Pei, et al. A new coronavirus associated with human respiratory
disease in china. Nature, 579(7798):265–269, 2020.

[4]

Intikhab Alam, Allan A Kamau, Maxat Kulmanov, Łukasz Jaremko, Stefan T Arold, Arnab
Pain, Takashi Gojobori, and Carlos M Duarte. Functional pangenome analysis shows key
features of e protein are preserved in sars and sars-cov-2. Frontiers in cellular and infection
microbiology, 10:405, 2020.

[5] Yu-Nong Gong, Kuo-Chien Tsao, Mei-Jen Hsiao, Chung-Guei Huang, Peng-Nien Huang,
Po-Wei Huang, Kuo-Ming Lee, Yi-Chun Liu, Shu-Li Yang, Rei-Lin Kuo, et al. Sars-cov-
2 genomic surveillance in taiwan revealed novel orf8-deletion mutant and clade possibly
associated with infections in middle east. Emerging microbes & infections, 9(1):1457–1466,
2020.

[6] Peter Forster, Lucy Forster, Colin Renfrew, and Michael Forster. Phylogenetic network analysis
of sars-cov-2 genomes. Proceedings of the National Academy of Sciences, 117(17):9241–
9243, 2020.

[7] Xingguang Li, Junjie Zai, Qiang Zhao, Qing Nie, Yi Li, Brian T Foley, and Antoine Chaillon.
Evolutionary history, potential intermediate animal host, and cross-species analyses of sars-
cov-2. Journal of medical virology, 92(6):602–611, 2020.

[8] Sunitha M Kasibhatla, Meenal Kinikar, Sanket Limaye, Mohan M Kale, and Urmila Kulkarni-
Kale. Understanding evolution of sars-cov-2: a perspective from analysis of genetic diversity
of rdrp gene. Journal of medical virology, 92(10):1932–1937, 2020.

[9] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding sars-cov-2 transmis-
sion and evolution and ramifications for covid-19 diagnosis, vaccine, and medicine. Journal
of chemical information and modeling, 60(12):5853–5865, 2020.

[10] Rui Wang, Jiahui Chen, Kaifu Gao, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei.

Characterizing sars-cov-2 mutations in the united states. Research square, 2020.

[11] Jiahui Chen, Rui Wang, Menglun Wang, and Guo-Wei Wei. Mutations strengthened sars-cov-2

infectivity. Journal of molecular biology, 432(19):5212–5226, 2020.

[12] Rui Wang, Jiahui Chen, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding

100

asymptomatic covid-19 infection and transmission. The journal of physical chemistry letters,
11(23):10007–10015, 2020.

[13] Michael Worobey, Jonathan Pekar, Brendan B Larsen, Martha I Nelson, Verity Hill, Jef-
frey B Joy, Andrew Rambaut, Marc A Suchard, Joel O Wertheim, and Philippe Lemey. The
emergence of sars-cov-2 in europe and north america. Science, 370(6516):564–570, 2020.

[14] Yunmeng Bai, Dawei Jiang, Jerome R Lon, Xiaoshi Chen, Meiling Hu, Shudai Lin, Zixi Chen,
Xiaoning Wang, Yuhuan Meng, and Hongli Du. Comprehensive evolution and molecular char-
acteristics of a large number of sars-cov-2 genomes reveal its epidemic trends. International
Journal of Infectious Diseases, 100:164–173, 2020.

[15] Yujiro Toyoshima, Kensaku Nemoto, Saki Matsumoto, Yusuke Nakamura, and Kazuma
Kiyotani. Sars-cov-2 genomic variations associated with mortality rate of covid-19. Journal
of human genetics, 65(12):1075–1082, 2020.

[16] Lucy Van Dorp, Mislav Acman, Damien Richard, Liam P Shaw, Charlotte E Ford, Louise
Ormond, Christopher J Owen, Juanita Pang, Cedric CS Tan, Florencia AT Boshier, et al.
Emergence of genomic diversity and recurrent mutations in sars-cov-2. Infection, Genetics
and Evolution, 83:104351, 2020.

[17] Roderic DM Page. Space, time, form: viewing the tree of life. Trends in ecology & evolution,

27(2):113–120, 2012.

[18] Tingting Zhou, Keith CC Chan, Yi Pan, and Zhenghua Wang. An approach for determining
In Bioinformatics Research
evolutionary distance in network-based phylogenetic analysis.
and Applications: Fourth International Symposium, ISBRA 2008, Atlanta, GA, USA, May
6-9, 2008. Proceedings 4, pages 38–49. Springer, 2008.

[19] Ian T Jolliffe and Jorge Cadima. Principal component analysis: a review and recent de-
velopments. Philosophical transactions of the royal society A: Mathematical, Physical and
Engineering Sciences, 374(2065):20150202, 2016.

[20] John W Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on

computers, 100(5):401–409, 1969.

[21] Chun-houh Chen, Wolfgang Härdle, Antony Unwin, Michael AA Cox, and Trevor F Cox.

Multidimensional scaling. Springer, 2008.

[22] George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval
Kluger. Fast interpolation-based t-sne for improved visualization of single-cell rna-seq data.
Nature methods, 16(3):243–245, 2019.

[23] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of

machine learning research, 9(11), 2008.

101

[24] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation

and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.

[25] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok,
Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing
single-cell data using umap. Nature biotechnology, 37(1):38–44, 2019.

[26] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embed-

ding and clustering. Advances in neural information processing systems, 14, 2001.

[27] Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and high-
dimensional data. In Proceedings of the 25th international conference on world wide web,
pages 287–297, 2016.

[28] David I Spivak. Metric realization of fuzzy simplicial sets. Preprint, page 4, 2009.

[29] Fabian Sievers, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong
Li, Rodrigo Lopez, Hamish McWilliam, Michael Remmert, Johannes Söding, et al. Fast,
scalable generation of high-quality protein multiple sequence alignments using clustal omega.
Molecular systems biology, 7(1):539, 2011.

[30] Michael Levandowsky and David Winter. Distance between sets. Nature, 234(5323):34–35,

1971.

[31] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library

(coil-20). 1996.

[32] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding.

Journal of Complex Networks, 9(2):cnab014, 2021.

[33] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[34] D Ulyanov. Multicore-tsne https://github. com/dmitryulyanov. Multicore-TSNE2016, 2016.

[35] Bette Korber, Will M Fischer, Sandrasegaram Gnanakaran, Hyejin Yoon, James Theiler,
Werner Abfalterer, Nick Hengartner, Elena E Giorgi, Tanmoy Bhattacharya, Brian Foley,
et al. Tracking changes in sars-cov-2 spike: evidence that d614g increases infectivity of the
covid-19 virus. Cell, 182(4):812–827, 2020.

[36] Joseph Minhow Chan, Gunnar Carlsson, and Raul Rabadan. Topology of viral evolution.

Proceedings of the National Academy of Sciences, 110(46):18566–18571, 2013.

[37] Michael Bleher, Lukas Hahn, Maximilian Neumann, Juan Angel Patino-Galindo, Mathieu
Carriere, Ulrich Bauer, Raul Rabadan, and Andreas Ott. Topological data analysis identifies

102

emerging adaptive mutations in sars-cov-2. arXiv preprint arXiv:2106.07292, 2021.

[38] Dong Quan Ngoc Nguyen, Phuong Dong Tan Le, Lin Xing, and Lizhen Lin. A topological
characterization of dna sequences based on chaos geometry and persistent homology. In 2022
International Conference on Computational Science and Computational Intelligence (CSCI),
pages 1591–1597. IEEE, 2022.

[39] Christopher Tralie, Nathaniel Saul, and Rann Bar-On. Ripser.py: A lean persistent homology

library for python. The Journal of Open Source Software, 3(29):925, Sep 2018.

[40] Clément Maria, Jean-Daniel Boissonnat, Marc Glisse, and Mariette Yvinec. The gudhi library:
Simplicial complexes and persistent homology. In Mathematical Software–ICMS 2014: 4th
International Congress, Seoul, South Korea, August 5-9, 2014. Proceedings 4, pages 167–174.
Springer, 2014.

[41] Nan Sun, Shaojun Pei, Lily He, Changchuan Yin, Rong Lucy He, and Stephen S-T Yau.
Geometric construction of viral genome space and its applications. Computational and
Structural Biotechnology Journal, 19:4226–4234, 2021.

[42] Ji Qi, Bin Wang, and Bai-Iin Hao. Whole proteome prokaryote phylogeny without sequence
alignment: Ak-string composition approach. Journal of molecular evolution, 58:1–11, 2004.

[43] Ajay Kumar Saw, Garima Raj, Manashi Das, Narayan Chandra Talukdar, Binod Chandra
Tripathy, and Soumyadeep Nandi. Alignment-free method for dna sequence clustering using
fuzzy integral similarity. Scientific reports, 9(1):3753, 2019.

[44] Ann C Palmenberg, David Spiro, Ryan Kuzmickas, Shiliang Wang, Appolinaire Djikeng,
Jennifer A Rathe, Claire M Fraser-Liggett, and Stephen B Liggett. Sequencing and analyses of
all known human rhinovirus genomes reveal structure and evolution. Science, 324(5923):55–
59, 2009.

[45] Mo Deng, Chenglong Yu, Qian Liang, Rong L He, and Stephen S-T Yau. A novel method
of characterizing genetic sequences: genome space with biological distance and applications.
PloS one, 6(3):e17293, 2011.

[46] Tung Hoang, Changchuan Yin, Hui Zheng, Chenglong Yu, Rong Lucy He, and Stephen S-T
Yau. A new method to cluster dna sequences using fourier power spectrum. Journal of
theoretical biology, 372:135–145, 2015.

[47] Patrick CY Woo, Susanna KP Lau, Chung-ming Chu, Kwok-hung Chan, Hoi-wah Tsoi,
Yi Huang, Beatrice HL Wong, Rosana WS Poon, James J Cai, Wei-kwang Luk, et al.
Characterization and complete genome sequence of a novel coronavirus, coronavirus hku1,
from patients with pneumonia. Journal of virology, 79(2):884–895, 2005.

[48] Chenglong Yu, Qian Liang, Changchuan Yin, Rong L He, and Stephen S-T Yau. A novel

103

construction of genome space with biological geometry. DNA research, 17(3):155–168, 2010.

[49] Lia Van Der Hoek, Krzysztof Pyrc, Maarten F Jebbink, Wilma Vermeulen-Oost, Ron JM
Berkhout, Katja C Wolthers, Pauline ME Wertheim-van Dillen, Jos Kaandorp, Joke Spaar-
Identification of a new human coronavirus. Nature medicine,
garen, and Ben Berkhout.
10(4):368–373, 2004.

[50] Paul S Masters. The molecular biology of coronaviruses. Advances in virus research, 66:193–

292, 2006.

[51] Kristian G Andersen, Andrew Rambaut, W Ian Lipkin, Edward C Holmes, and Robert F

Garry. The proximal origin of sars-cov-2. Nature medicine, 26(4):450–452, 2020.

[52] Robert G Webster, William J Bean, Owen T Gorman, Thomas M Chambers, and Yoshihiro
Kawaoka. Evolution and ecology of influenza a viruses. Microbiological reviews, 56(1):152–
179, 1992.

[53] Dennis J Alexander. A review of avian influenza in different bird species. Veterinary micro-

biology, 74(1-2):3–13, 2000.

104

APPENDIX 3A

ADDITIONAL MATERIALS FOR UMAP-ASSISTED K-MEANS CLUSTERING OF
LARGE-SCALE SARS-COV-2 MUTATION DATASETS

3A.1 Cluster Stability

The K-means clustering is used to classify the SARS-CoV-2 the SNP variants. The Elbow
method is used to determine the optimal number of clusters. Our results demonstrate four main
clusters as shown in Figure S1, which plots the within-cluster sum of squares according to the
number of clusters k for the SNP variants based on Jaccard distance metric. The optimal values of
K-mean clusters is shown as the turning point in the in the elbow plots.

Figure 3A.1 Within cluster sum of squares (WCSS) of the UMAP-assisted k-means result of
SARS-CoV-2 collected up to January 20, 2021. Using elbow method, the number of cluster was
determined to be 6.

105

Figure 3A.2 Within cluster sum of squares (WCSS) of the UMAP-assisted k-means result of US
SARS-CoV-2 collected up to January 20, 2021. Using elbow method, the number of cluster was
determined to be 6.

3A.2 Clusters distribution

Table 3A.1 shows the top 25 mutations of each cluster from the dataset collected up to January

20, 2021.

106

Table 3A.1 Clusters distribution of the top 25 single mutations of SARS-CoV-2 in the world,
collected up to January 20, 2021.

Top
Top 1
Top 2
Top 3
Top 4
Top 5
Top 6
Top 7
Top 8
Top 9
Top 10
Top 11
Top 12
Top 13
Top 14
Top 15
Top 16
Top 17
Top 18
Top 19
Top 20
Top 21
Top 22
Top 23
Top 24
Top 25

Position Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
3450
23403
3450
14408
3450
3037
3064
241
24
26801
0
22227
1
6286
1
21255
1
29645
1
28932
0
445
3450
28881
3450
28882
3450
28883
0
25563
0
27944
6
204
0
1059
0
21614
0
20268
3449
22992
0
18877
0
28854
40
11083
0
26735

125155
125032
125052
124492
57139
57097
56961
56844
56792
56731
56754
28396
28268
28274
22757
38443
32527
13045
25209
9379
7546
7393
9147
10985
7317

22649
22687
22631
22395
76
32
15
57
13
18
1
896
838
816
21663
18
27
17219
26
131
34
2499
386
738
977

23395
23381
23358
23132
62
33
17
82
26
7
9
23418
23413
23435
112
19
37
60
106
19
87
42
72
611
35

14230
14181
14174
13955
33
27
24
11
11
26
2
92
21
16
34
3
8
33
150
6487
37
11
3772
414
52

3488
3488
3481
3477
22
20
16
13
15
10
12
12
15
13
3475
3
3
2
20
1
3477
3480
2
103
3468

Table 3A.2 shows cluster distribution among the top 25 available sequence from each country.

107

Table 3A.2 Cluster distributions of SARS-CoV-2 sequences from top 25 countries with the highest
number of sequences as of October 30, 2020. The top 25 countries are the United Kingdom (UK),
the United States (US), Australia (AU), India (IN), Switzerland (CH), Netherlands (NL), Canada
(CA), France (FR), Belgium (BE), Singapore (SG), Spain (ES), Russia (RU), Portugal (PT),
Denmark (DK), Sweden (SE), Austria (AT), Japan (JP), South Africa (ZA), Iceland (IS), Brazil
(BR), Saudi Arabia (SA), Norway (NO), China (CN), Italy (IT), and Korea (KR).

Top
Top 1
Top 2
Top 3
Top 4
Top 5
Top 6
Top 7
Top 8
Top 9
Top 10
Top 11
Top 12
Top 13
Top 14
Top 15
Top 16
Top 17
Top 18
Top 19
Top 20
Top 21
Top 22
Top 23
Top 24
Top 25

Country Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
0
0
0
3449
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

1291
15042
1256
397
123
728
90
79
154
592
113
20
361
43
25
72
243
27
12
27
65
144
2
198
31

65107
20897
20177
5001
2973
1993
1669
2753
815
627
1099
1161
757
1130
889
972
394
240
508
21
683
461
96
414
225

8200
3691
954
466
368
258
357
112
1168
175
473
139
421
97
307
152
304
718
456
845
82
153
567
49
388

3385
3760
215
280
356
496
656
156
568
505
361
598
329
110
279
44
113
82
36
61
50
131
170
47
118

1030
5
317
9
238
3
425
19
0
439
164
15
40
512
27
1
18
0
0
0
62
0
0
61
1

UK
US
DK
AU
NL
CA
CH
IS
IN
FR
BE
ES
DE
LU
IT
SG
SE
AE
BR
RU
NO
CL
ZA
FI
PT

Table 3A.3 shows the top 25 mutation of each cluster in the US dataset, collected up to January

20, 2021.

108

Table 3A.3 Clusters distribution of the top 25 single mutations of SARS-CoV-2 in the United
States, collected up to January 20, 2021.

Top
Top 1
Top 2
Top 3
Top 4
Top 5
Top 6
Top 7
Top 8
Top 9
Top 10
Top 11
Top 12
Top 13
Top 14
Top 15
Top 16
Top 17
Top 18
Top 19
Top 20
Top 21
Top 22
Top 23
Top 24
Top 25

Position Cluster A Cluster B Cluster C Cluster D Cluster E Cluster F
4677
23403
4661
14408
4666
3037
4594
241
33
25563
21
1059
8
27964
13
10319
19
21304
5
18424
6
28869
3
20268
8
25907
10
28472
12
28854
4644
28881
4641
28882
4656
28883
32
14805
0
8083
12
8782
3
28144
2
24076
10
18060
13
17747

20621
20609
20577
20498
15710
15160
10205
7790
6333
6321
6251
2369
6133
6086
2585
1207
1207
1193
2857
2527
768
749
320
271
311

4890
4889
4888
4762
44
36
9
11
13
7
22
3785
7
16
3440
35
5
2
4
1
0
3
1759
40
8

3
3
2
3
0
1
1
0
0
0
0
0
0
0
0
1
0
0
16
1
1575
1572
0
1551
1498

9438
9445
9416
9345
9432
8030
3
7
13
3
7
4
10
8
62
20
10
2
4
1
2
5
0
7
12

635
634
630
630
635
634
0
0
0
0
0
0
0
1
6
2
1
1
0
3
0
0
0
0
0

Table 3A.3 shows the cluster statistics for each states with more than 50 SARS-CoV-2 genome

isolates.

109

Table 3A.4 Cluster statistics for states with more than 50 SARS-CoV-2 genome samples.

State
Michigan
Washington
Virginia
Maryland
Alaska
Idaho
New Jersey
Kentucky
Wyoming
Arizona
Arkansas
Tennessee
Pennsylvania
Missouri
California
Indiana
Florida
New Mexico
Maine
Nevada
Iowa
Minnesota
Colorado
Kansas
Nebraska
New York
Oregon
Vermont
South Carolina
Montana
South Dakota
Alabama
Georgia
District of Columbia
Utah
Massachusetts
Oklahoma
North Carolina
Connecticut

Cluster A Cluster B Cluster C Cluster D Cluster E Cluster F

709
1275
874
1790
477
389
45
783
546
231
343
120
339
181
36
232
80
105
131
53
63
25
36
82
110
22
72
13
67
17
32
5
19
9
22
33
31
8
10

5884
1961
1853
1525
2034
1053
1564
518
544
678
569
432
346
377
210
157
288
201
181
211
158
213
76
75
22
70
77
129
16
43
39
72
44
23
29
31
26
37
30

968
943
1084
275
293
50
45
66
146
50
168
82
54
63
18
35
26
44
26
8
24
11
48
18
11
47
13
0
5
7
0
0
9
13
15
0
3
2
1

527
1962
169
433
279
152
36
64
39
205
46
204
68
49
391
59
54
40
52
29
47
4
2
5
17
10
5
4
5
34
9
3
2
21
3
3
1
12
1

3
216
921
37
12
121
1
7
16
47
16
15
10
1
4
30
7
13
2
11
10
5
43
4
17
0
2
7
1
0
0
0
2
5
0
0
1
0
1

25
100
8
341
5
3
0
3
59
3
5
0
9
2
0
1
0
9
0
0
2
0
0
18
2
27
0
0
8
0
2
0
3
0
0
0
0
0
0

Table 3A.5 shows the world wide cluster statistics from SARS-CoV-2 genome collected up to
June 01, 2020, which is our previous result from [9]. The listed countries are the United States

110

(US), Canada (CA), Australia (AU), United Kingdom (UK), Germany (DE), France (FR), Italy
(IT), Russia (RU), China (CN), Japan (JP), Korean (KR), India (IN), Spain (ES), Saudi Arabia
(SA), and Turkey (TR).

Table 3A.5 The world wide clusters from SARS-CoV-2 genome data available up to June 01,
2020. The listed countries are the United States (US), Canada (CA), Australia (AU), United
Kingdom (UK), Germany (DE), France (FR), Italy (IT), Russia (RU), China (CN), Japan (JP),
Korean (KR), India (IN), Spain (ES), Saudi Arabia (SA), and Turkey (TR) [9].

Country Cluster I Cluster II Cluster III Cluster IV Cluster V Cluster VI

US
CA
AU
UK
DE
FR
IT
RU
CN
JP
KR
IN
ES
SA
TR

844
12
163
539
10
41
26
10
8
0
0
93
27
14
25

311
29
149
875
20
85
24
27
3
3
0
69
100
31
3

488
17
410
908
21
14
9
1
215
68
28 0
141
74
9
24

156
16
135
1532
38
12
17
109
1
20
0
10
25
1
9

1813
19
146
119
42
82
0
3
1
3
0
3
3
2
0

975
41
77
3
0
0
0
0
25
0
0
0
2
0
0

PCA was used to reduce the same dataset as Table 3A.5 by reduction ratio of 1/160. Table 3A.6
shows the world-wide cluster statistics of the PCA-assisted K-means. K = 6 was used to compare
with the original result from [9].

111

Table 3A.6 The world wide clusters from SARS-CoV-2 genome data available up to June 01, 2020
using PCA embedding with reduction ratio of 1/160.

Country Cluster Ip Cluster IIp Cluster IIIp Cluster IVp Cluster Vp Cluster VIp

US
CA
AU
UK
DE
FR
IT
RU
CN
JP
KR
IN
ES
SA
TR

915
14
164
543
10
46
26
10
8
0
0
95
27
30
27

489
17
414
908
21
14
9
1
213
68
28
141
74
9
24

239
27
143
857
20
80
24
27
3
3
0
67
100
15
1

156
16
136
1546
38
12
17
109
1
20
0
10
25
1
9

1813
19
146
119
42
82
0
3
1
3
0
3
3
2
0

975
41
77
3
0
0
0
0
24
0
0
0
2
0
0

UMAP was used to reduce the same dataset as Table 3A.5 by reduction ratio of 1/160. Table 3A.7
shows the world-wide cluster statistics of the UMAP-assisted K-means. K = 6 was used to compare
with the original result from [9].

Table 3A.7 The world wide clusters from SARS-CoV-2 genome data available up to June 01, 2020
using UMAP embedding with reduction ratio of 1/160.

Country Cluster Iu Cluster IIu Cluster IIIu Cluster IVu Cluster Vu Cluster VIu

US
CA
AU
UK
DE
FR
IT
RU
CN
JP
KR
IN
ES
SA
TR

2446
71
784
2171
57
163
13
92
178
36
18
232
205
56
56

1096
15
94
115
40
45
1
2
28
0
0
3
2
0
1

751
35
18
2
0
0
0
0
10
0
1
0
0
0
0

110
1
83
534
5
11
5
0
22
47
9
2
7
0
0

94
3
37
326
15
5
22
7
6
0
0
72
5
0
0

90
9
64
828
14
10
35
49
6
11
0
7
12
1
4

112

APPENDIX 3B

ADDITIONAL MATERIALS FOR K-MERS TOPOLOGY FOR ALIGNEMNT-FREE
SEQUENCE ANALYSIS

The statistics of each dataset used for the phylogenetic is listed below.

• Table 3B.1: Ebolavirus data

• Table 3B.2: Mammalian mitochondria data

• Table 3B.3: Rhinovirus data

• Table 3B.4: Coronavirus data

• Table 3B.5: Infleunza type A data

• Table 3B.6: Bacterial 16S rDNA isolates data

• Table 3B.7: Bacteria whole genome data.

Table 3B.1 Accession, group, sequence length of ebolavirus data.

Accession
FJ217161.1
KC545395.1
KC545396.1
AF522874.1
JX477166.1
FJ621583.1
FJ968794.1
EU338380.1
JN638998.1
KC545390.1
KC545392.1
KC242801.1
KC242791.1
KC242793.1
AY354458.1
KC242799.1
KC242786.1
KC242789.1
KC242790.1
KC242800.1
KM034562.1
KM034557.1
KM233050.1
KM233057.1
KM233072.1
KM233070.1
KM233097.1
KM233096.1
KJ660346.2
KJ660348.2

Group
BDBV
BDBV
BDBV
RESTV
RESTV
RESTV
SUDV
SUDV
SUDV
SUDV
SUDV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV

NA
5964
5975
5975
5937
5935
5928
5905
5914
5931
5925
5923
6061
6061
6043
6054
6054
6063
6062
6060
6042
6051
6052
6053
6052
6049
6055
6051
6054
6053
6055

NC
4324
4290
4293
3929
3920
3929
4034
4032
4059
4080
4081
4037
4037
4052
4051
4050
4025
4023
4025
4052
4050
4052
4050
4052
4049
4052
4052
4049
4052
4051

NG
3632
3635
3635
3746
3755
3767
3756
3750
3729
3730
3732
3752
3752
3761
3747
3748
3749
3750
3751
3762
3756
3753
3753
3751
3752
3753
3752
3751
3755
3754

NT
5020
5039
5036
5279
5281
5263
5180
5179
5156
5139
5138
5109
5109
5102
5109
5107
5121
5123
5122
5102
5100
5099
5100
5099
5099
5099
5098
5099
5099
5099

Accession
KC545393.1
KC545394.1
FJ217162.1
AB050936.1
FJ621585.1
JX477165.1
KC242783.2
AY729654.1
KC545389.1
KC545391.1
KC589025.1
NC_002549.1
KC242792.1
KC242794.1
KC242796.1
KC242784.1
KC242787.1
KC242785.1
KC242788.1
KM034555.1
KM233039.1
KM034560.1
KM233053.1
KM233063.1
KM233110.1
KM233099.1
KM233109.1
KM233103.1
KJ660347.2

Group
BDBV
BDBV
TAFV
RESTV
RESTV
RESTV
SUDV
SUDV
SUDV
SUDV
SUDV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV
EBOV

NA
5974
5974
6020
5927
5900
5924
5911
5920
5924
5924
5921
6061
6047
6039
6055
6061
6062
6060
6063
6049
6052
6050
6053
6053
6053
6052
6053
6051
6054

NC
4293
4292
4371
3924
3898
3929
4028
4071
4080
4080
4047
4035
4052
4063
4049
4028
4025
4026
4032
4049
4051
4051
4051
4052
4053
4052
4055
4050
4052

NG
3636
3637
3630
3762
3747
3774
3750
3732
3731
3731
3734
3752
3756
3762
3748
3750
3750
3752
3749
3753
3751
3752
3753
3751
3752
3751
3753
3751
3754

NT
5036
5036
4914
5277
5251
5260
5186
5152
5139
5139
5173
5111
5104
5095
5107
5119
5121
5120
5114
5099
5099
5099
5100
5099
5098
5098
5097
5098
5099

113

Table 3B.2 Accession, group, sequence length of mammalian mitochondria data.

Accession
V00662.1
D38116.1
D38113.1
D38114.1
X99256.1
Y18001.1
AY863426.1
D38115.1
NC_002083.1
NC_002764.1
U20753.1
U96639.2
EU442884.2
EF551003.1
EF551002.1
DQ402478.1
AF303110.1
AF303111.1
EF212882.1
AJ002189.1
AF010406.1
AF533441.1
V00654.1
AY488491.1
NC_007441.1
NC_008830.1
NC_010640.1
X72204.1
NC_005268.1
NC_001321.1
NC_005270.1
NC_005275.1
NC_006931.1
NC_001788.1
X97336.1
Y07726.1
NC_001640.1
AJ238588.1
AJ001562.1
AJ001588.1
X88898.2

Groups
Primates
Primates
Primates
Primates
Primates
Primates
Primates
Primates
Primates
Primates
Carnivore
Carnivore
Carnivore
Carnivore
Carnivore
Carnivore
Carnivore
Carnivore
Carnivore
Artiodactyla
Artiodactyla
Artiodactyla
Artiodactyla
Artiodactyla
Artiodactyla
Artiodactyla
Artiodactyla
Cetacea
Cetacea
Cetacea
Cetacea
Cetacea
Cetacea
Perissodactyla
Perissodactyla
Perissodactyla
Perissodactyla
Rodentia
Rodentia
Lagomorpha
Erinaceomorpha

Length
16569
16563
16554
16364
16472
16521
16389
16389
16499
16586
17009
16727
16774
16990
16964
16868
17020
17017
16805
16680
16616
16640
16338
16355
16498
16719
16524
16402
16390
16398
16412
16324
16386
16670
16829
16832
16660
16507
16602
17245
17447

114

NG
2176
2104
2133
2160
2256
2169
2049
2168
2176
2116

NC
5176
5084
5099
5022
5231
5047
4953
5317
5403
5027

NT
NA
4094
5123
4186
5189
4168
5154
4123
5059
3946
5039
4110
5195
4137
5243
3897
5007
3889
5031
5306
4137
5543 4454 2406 4606
5290 4267 2366 4804
5293 4265 2398 4812
5418 4513 2478 4581
5397 4508 2467 4592
5270 4285 2601 4712
5258 4355 2676 4731
5253 4346 2692 4726
5338 4000 2518 4949
4296
5790
4552
5594
4569
5569
4443
5460
4375
5421
4434
5542
4371
5786
5519
4396
5374 4527 2140 4361
5354 4609 2162 4265
5359 4474 2182 4383
5374 4626 2153 4259
5377 4525 2040 4382
5357 4573 2164 4292
4259
5394
4405
5663
4333
5623
4312
5358
5094
5301
5386
5207
5429 4584 2350 4882
5937 3503 2185 5822

4384
4289
4313
4237
4298
4358
4340
4404

2210
2181
2189
2198
2261
2164
2222
2205

2198
2131
2169
2236
2071
2096

4819
4630
4707
4754
4041
3913

Table 3B.3 Accession, group, sequence length of rhinovirus data.

Accession
AF499637.1
AY751783.1
DQ473486.1
DQ473489.1
DQ473491.1
DQ473493.1
DQ473496.1
DQ473499.1
DQ473504.1
DQ473506.1
DQ473508.1
DQ473511.1
EF077280.1
EF173415.1
EF173423.1
EF186077.2
EF582386.1
FJ445111.1
FJ445113.1
FJ445115.1
FJ445117.1
FJ445119.1
FJ445122.1
FJ445124.1
FJ445126.1
FJ445128.1
FJ445130.1
FJ445132.1
FJ445134.1
FJ445136.1
FJ445138.1
FJ445140.1
FJ445142.1
FJ445144.1
FJ445146.1
FJ445148.1
FJ445151.1
FJ445153.1
FJ445155.1
FJ445157.1
FJ445159.1
FJ445161.1
FJ445163.1
FJ445165.1
FJ445167.1
FJ445169.1
FJ445171.1
FJ445173.1
FJ445175.1
FJ445177.1
FJ445179.1
FJ445181.1
FJ445183.1
FJ445185.1
FJ445187.1
FJ445189.1
L05355.1
V01149.1

Group
HEV
A
B
B
A
A
A
A
A
A
A
A
C
A
B
C
C
A
A
A
A
A
A
B
A
A
B
A
A
A
A
A
A
A
A
A
B
B
B
A
A
B
A
A
A
B
A
A
A
A
A
A
A
A
B
A
B
HEV

NA
2243
2348
2381
2365
2301
2370
2363
2333
2253
2415
2371
2367
2153
2299
2384
2296
2304
2388
2376
2377
2321
2354
2327
2403
2354
2336
2405
2352
2360
2349
2353
2342
2233
2340
2320
2390
2316
2372
2369
2357
2331
2382
2355
2235
2394
2338
2316
2387
2319
2347
2318
2333
2356
2344
2372
2370
2313
2206

NC
1677
1362
1428
1476
1383
1287
1342
1301
1380
1349
1389
1288
1550
1396
1456
1549
1492
1284
1387
1321
1352
1310
1369
1397
1292
1347
1423
1406
1364
1350
1326
1323
1363
1361
1352
1339
1499
1461
1421
1339
1334
1376
1334
1388
1295
1430
1344
1299
1317
1309
1336
1379
1379
1384
1411
1328
1460
1737

NG
1662
1428
1431
1463
1454
1442
1389
1427
1402
1412
1390
1386
1500
1416
1407
1520
1480
1388
1407
1426
1421
1421
1445
1413
1414
1429
1405
1424
1404
1425
1418
1381
1415
1416
1411
1393
1504
1452
1433
1406
1437
1460
1412
1416
1409
1463
1446
1381
1433
1449
1417
1441
1429
1444
1466
1408
1475
1711

NT
1876
1999
1976
1919
2007
2035
2012
2062
2108
1972
1998
1995
1812
2013
1969
1769
1838
2074
1938
2009
2049
2048
1971
1996
2069
2019
1989
1932
1981
2025
2036
2088
2128
2022
2058
2016
1890
1931
2001
2011
2014
2011
2039
2111
2025
2002
2026
2066
2071
2021
2022
1972
1981
1957
1971
2011
1964
1786

Accession
AF546702.1
DQ473485.1
DQ473488.1
DQ473490.1
DQ473492.1
DQ473494.1
DQ473497.1
DQ473500.1
DQ473505.1
DQ473507.1
DQ473510.1
EF077279.1
EF173414.1
EF173420.1
EF173425.1
EF582385.1
EF582387.1
FJ445112.1
FJ445114.1
FJ445116.1
FJ445118.1
FJ445121.1
FJ445123.1
FJ445125.1
FJ445127.1
FJ445129.1
FJ445131.1
FJ445133.1
FJ445135.1
FJ445137.1
FJ445139.1
FJ445141.1
FJ445143.1
FJ445145.1
FJ445147.1
FJ445149.1
FJ445152.1
FJ445154.1
FJ445156.1
FJ445158.1
FJ445160.1
FJ445162.1
FJ445164.1
FJ445166.1
FJ445168.1
FJ445170.1
FJ445172.1
FJ445174.1
FJ445176.1
FJ445178.1
FJ445180.1
FJ445182.1
FJ445184.1
FJ445186.1
FJ445188.1
FJ445190.1
L24917.1
X02316.1

Group
HEV
B
B
B
A
A
A
A
A
A
A
C
A
B
B
C
C
B
A
A
A
A
A
A
A
A
A
A
A
B
A
A
A
A
A
A
A
A
A
A
A
B
B
A
B
A
B
B
A
A
A
A
A
B
B
A
A
A

NA
2209
2338
2382
2362
2297
2389
2309
2336
2247
2407
2406
2176
2369
2356
2426
2195
2261
2402
2375
2298
2386
2376
2358
2332
2375
2380
2397
2342
2371
2334
2364
2414
2390
2371
2353
2377
2375
2368
2389
2336
2325
2392
2388
2227
2376
2371
2403
2404
2284
2331
2387
2334
2241
2377
2322
2392
2383
2324

NC
1649
1436
1456
1417
1371
1314
1322
1344
1404
1350
1313
1503
1310
1500
1412
1565
1533
1391
1329
1356
1340
1328
1277
1299
1287
1311
1275
1275
1328
1517
1327
1287
1315
1270
1383
1325
1362
1281
1341
1333
1325
1420
1374
1387
1515
1382
1410
1400
1339
1369
1314
1347
1381
1401
1497
1307
1331
1347

NG
1681
1443
1447
1420
1443
1437
1396
1409
1409
1426
1424
1509
1403
1477
1432
1473
1513
1417
1428
1437
1423
1403
1399
1416
1408
1391
1436
1439
1395
1473
1413
1401
1391
1391
1429
1435
1427
1406
1421
1434
1454
1410
1420
1425
1450
1413
1418
1384
1382
1420
1425
1420
1406
1447
1490
1389
1412
1418

NT
1867
1991
1929
2013
2029
1980
1998
2046
2081
1960
1994
1756
2043
1886
1944
1866
1779
2002
2002
2049
1970
2026
2091
2075
2062
2056
2019
2074
2022
1891
2029
2031
2042
2095
1997
1998
1997
2077
1984
2013
2019
1977
2028
2113
1878
1944
1975
2018
2140
2017
2010
2027
2121
1992
1907
2044
1998
2013

115

Table 3B.4 Accession, group, sequence length of coronavirus data.

Group
Group 1
Group 1
Group 1
Group 2
Group 2
Group 2
Group 2
Group 2
Group 2
Group 2
Group 2
Group 2
Group 3
Group 3
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 4
Group 5

Accession
AF304460.1
AF353511.1
NC_005831.2
AY391777.1
U00735.2
AF391542.1
AF220295.1
NC_003045.1
AF208067.1
AF201929.1
AF208066.1
NC_001846.1
NC_001451.1
EU095850.1
AY278488.2
AY278741.1
AY278491.2
AY278554.2
AY282752.2
AY283794.1
AY283795.1
AY283796.1
AY283797.1
AY283798.2
AY291451.1
NC_004718.3
AY297028.1
AY572034.1
AY572035.1
NC_006577.2
NC_001564.2 Flaviviridae outgroup
NC_004102.1 Flaviviridae outgroup
NC_001512.1 Togaviridae outgroup
NC_001544.1 Togaviridae outgroup

Length
27317
28033
27553
30738
31032
31028
31100
31028
31233
31276
31112
31357
27608
27657
29725
29727
29742
29736
29736
29711
29705
29711
29706
29711
29729
29751
29715
29540
29518
29926
10682
9646
11835
11657

NA
7420
6937
7253
8485
8490
8486
8544
8487
8087
8117
8030
8138
7967
7969
8465
8455
8475
8476
8476
8453
8447
8453
8451
8453
8457
8481
8458
8402
8395
8331
2618
1889
3676
3220

NC
4549
5382
3979
4658
4713
4743
4711
4752
5591
5548
5534
5614
4479
4513
5941
5940
5942
5942
5939
5937
5936
5936
5935
5935
5940
5940
5934
5911
5907
3895
2531
2893
2860
2901

NG
5903
6397
5516
6655
6774
6772
6790
6767
7466
7422
7416
7487
5993
6066
6185
6188
6183
6185
6185
6184
6187
6185
6184
6185
6188
6187
6187
6154
6151
5699
2919
2724
2859
3065

NT
9445
9317
10805
10940
11055
11027
11055
11022
10089
10189
10132
10118
9169
9108
9134
9144
9142
9133
9136
9137
9135
9137
9135
9138
9144
9143
9135
9073
9065
12001
2614
2140
2440
2416

116

Table 3B.5 Accession, group, sequence length of influenza A virus data.

Accession

HM370969.1 H1N1
CY138562.1 H1N1
CY149630.1 H1N1
KC608160.1 H1N1
AM157358.1 H1N1
AB470663.1 H1N1
AB546159.1 H1N1
HQ897966.1 H1N1
EU026046.2 H1N1
FJ357114.1 H1N1
GQ411894.1 H1N1
CY140047.1 H1N1
KM244078.1 H1N1
HQ185381.1 H5N1
HQ185383.1 H5N1
EU635875.1 H5N1
FM177121.1 H5N1
AM914017.1 H5N1
KF572435.1 H5N1
AF509102.2 H5N1
AB684161.1 H5N1
EF541464.1 H5N1
JF699677.1 H5N1
GU186511.1 H5N1
EU500854.1 H7N3
CY129336.1 H7N3
CY076231.1 H7N3
CY039321.1 H7N3
AY646080.1 H7N3
KF259734.1 H7N9
KF938945.1 H7N9
KF259688.1 H7N9
KC609801.1 H7N9
CY014788.1 H7N9
CY186004.1 H7N9
DQ017487.1 H2N2
CY005540.1 H2N2
JX081142.1 H2N2

Group Length NA NC NG NT
373
453
383
437
385
441
381
409
381
418
393
418
378
421
389
422
384
439
392
438
376
430
385
440
367
447
365
406
366
408
358
397
368
407
365
398
355
403
364
401
363
404
359
396
362
404
374
407
355
475
348
470
340
467
343
470
355
485
309
478
312
483
312
490
314
488
317
500
308
494
386
445
384
455
397
446

1419
1422
1433
1398
1413
1422
1410
1410
1433
1433
1413
1433
1410
1350
1350
1350
1370
1350
1350
1366
1350
1350
1350
1370
1453
1428
1420
1434
1453
1398
1404
1413
1426
1460
1422
1467
1467
1457

330
343
346
357
355
359
351
353
347
350
346
347
335
339
336
347
350
344
345
344
348
349
348
345
339
332
327
333
329
321
322
320
332
337
317
355
344
349

263
259
261
251
259
252
260
246
263
253
260
261
261
240
240
248
245
243
247
257
235
246
236
244
284
278
286
288
284
290
287
291
292
306
303
281
284
265

117

Table 3B.6 Accession, group, sequence length of bacteria 16S rDNA data.

314 435
447
319
324 431
318 432
318 431
318 434
177 232
325 441
299
226
317 430
318
420
298 400
324 436

Length NA NC NG NT
345
1104
230
263
265
256 146
761
194 165
464 287
1452
361 340
262
295
301
1253
395
241
286 372
296
1195
250 339
282
1099
228
206 154
179
179
718
263
326 419
327
1335
260
314 430
335
1339
422 267
335 313
1337
268
317
1334
268
332
1366
262
338
1356
262
337
1350
262
335
1346
262
337
1351
149
184
742
262
337
1365
231
278
1035
262
334
1344
269
336
1343
253
318
1269
263
341
1364
442 278
345 322
1387
265
337
1356
262
335
1346
251
329
1294
261
338
1347
143
189
753
261
336
1345
161
193
785
274
352
1411
146
201
796
255
338
1319
263
335
1345
260
335
1341
260
335
1345
279
345
1346
255
325
1322
260
331
1337

325 429
317 431
312 401
316 432
169 252
317 431
188 243
330 455
183 266
309 417
317 430
315 431
318 432
306 416
321 419
315 428

Accession

Family

KY486204.1 Methylobacteriaceae
KY486205.1 Xanthomonadaceae
KY486206.1 Xanthomonadaceae
KY486207.1
Intrasporangiaceae
KY486218.1 Microbacteriaceae
Pseudomonadaceae
KY486219.1
Bacillaceae
KY927407.1
KY486220.1
Paenibacillaceae
Enterobacteriaceae
KY486221.1
KY486222.1 Xanthomonadaceae
KY486223.1 Microbacteriaceae
KY486209.1 Rhodanobacteraceae
Enterobacteriaceae
KY486210.1
Enterobacteriaceae
KY486232.1
KY019246.1
Enterobacteriaceae
Enterobacteriaceae
KY013009.1
KY927404.1 Microbacteriaceae
Enterobacteriaceae
KY486211.1
Staphylococcaceae
KY013011.1
Enterobacteriaceae
KY019245.1
Bacillaceae
KY013010.1
KY486208.1
Enterobacteriaceae
KY486212.1 Microbacteriaceae
KY486213.1 Xanthomonadaceae
Enterobacteriaceae
KY486228.1
Enterobacteriaceae
KY486224.1
Enterobacteriaceae
KY486225.1
Enterobacteriaceae
KY486226.1
KY927405.1
Enterobacteriaceae
Enterobacteriaceae
KY486227.1
KY927408.1 Microbacteriaceae
KY486214.1
Enterobacteriaceae
KY927406.1 Microbacteriaceae
Enterobacteriaceae
KY486215.1
Enterobacteriaceae
KY486216.1
Enterobacteriaceae
KY486217.1
Enterobacteriaceae
KY486229.1
Pseudomonadaceae
KY486230.1
Enterobacteriaceae
KY486231.1
Enterobacteriaceae
KY019244.1

118

Table 3B.7 Accession, group, sequence length of bacteria data.

Family
Bacillaceae
Bacillaceae
Bacillaceae
Bacillaceae
Borreliaceae
Borreliaceae
Borreliaceae
Borreliaceae
Clostridiaceae
Clostridiaceae
Clostridiaceae

Accession
CP001598.1
AE016879.1
CP001215.1
AE017225.1
CP000976.1
CP000048.1
CP000993.1
CP000049.1
CP000246.1
CP000312.1
BA000016.3
CP000527.1 Desulfovibrionaceae
AE017285.1 Desulfovibrionaceae
CP002297.1 Desulfovibrionaceae
AM260480.1
CP000091.1
CP000578.1
CP001151.1
AM295250.1
AE015929.1
AP006716.1
CP001837.1
AL590842.1
CP001585.1
AE009952.1
CP001593.1
CP001671.1
CP000468.1
CP001383.1
AE005674.2

Burkholderiaceae
Burkholderiaceae
Rhodobacteraceae
Rhodobacteraceae
Staphylococcaceae
Staphylococcaceae
Staphylococcaceae
Staphylococcaceae
Yersiniaceae
Yersiniaceae
Yersiniaceae
Yersiniaceae
Enterobacteriaceae
Enterobacteriaceae
Enterobacteriaceae
Enterobacteriaceae

Length
5227419
5227293
5230115
5228663
931674
922307
930981
917330
3256683
2897393
3031430
3462887
3570858
3532052
2912490
2726152
1219053
1297647
2566424
2499279
2685015
2658366
4653728
4640720
4600755
4553586
5131397
5082025
4650856
4607202

NA
1685408
1685374
1667671
1685622
335148
321202
335785
322493
1148078
1017083
1060154
639427
659017
652227
486037
480326
192966
204455
833636
837991
907537
878689
1219520
1216182
1200303
1187961
1271011
1256126
1145625
1133784

NC
930043
930007
974191
930391
129225
137556
129350
133424
470943
423439
446732
1092219
1127624
1113805
972789
883953
416292
445520
449856
405441
437414
445972
1102670
1097565
1090469
1077463
1298314
1285309
1187110
1176618

NG
919269
919244
876267
919481
127787
137547
126839
133693
453276
395046
419228
1089813
1127109
1116142
972298
887595
420323
446045
438966
396707
443072
454367
1114185
1112625
1101384
1093635
1297349
1283517
1177854
1167963

NT
1692699
1692668
1711986
1693169
339514
326002
339007
327720
1184386
1061825
1105316
641428
657108
649878
481366
474278
189472
201627
843965
859140
896992
879338
1217353
1214348
1208599
1194527
1264723
1256945
1140266
1128831

119

CHAPTER 4

RESIDUE-SIMILARITY SCORES AND INDEXES

Traditionally, data visualization involves reducing data into 2 or 3 feature components. However,

aggressive reduction often leads to poor representations for data with high intrinsic dimensions,

despite producing visually appealing results. For classification problems with 2 classes, the Receiver

Operating Characteristic (ROC) curve and Area Under the ROC Curve (AUC) curve can effectively

show performance. However, not all classification problems are binary.

In this section, we

introduce a new visualization tool called Residue-Similarity (R-S) scores or R-S plots. Unlike

traditional methods, R-S plots can be applied to an arbitrary number of classes, providing a more

comprehensive visualization solution.

4.1 Methods

4.1.1 Residue-Similarity score and indexes

An R-S plot consists of two components, residue and similarity scores. Assume that the data is

{(xm, ym)|xm ∈ RN, ym ∈ ZL}M

m=1, where xm is the mth data. For classification problems, ym is the
ground truth, and for clustering problems, ym is the cluster label. Here, N is the number of features

and M is the number of samples. L is the number of classes, that is ym ∈ [0, 1, ..., L – 1]. We can
partition X = {xm}M

m=1 into L classes by taking Cl = {xm ∈ X|ym = l}. Note that ⊎L–1
l=0
The residue score is defined as the inter-class sum of distances. Suppose ym = l. Then, the

Cl = X.

residue score for xm is given by

Rm := R(xm) =

1
Rmax

∑︁

xj∉Cl

∥xm – xj ∥,

(4.1)

where ∥ · ∥ is the distance between a pair of vectors and Rmax = max
xm∈X

R(xm) is the maximal residue

score. The similarity score is given by taking the average intra-class score. That is, for ym = l,

Sm := S(xm) =

1
|Cl|

(cid:18)

∑︁

1 –

xj∈Cl

∥xm – xj ∥
dmax

(cid:19)

,

(4.2)

where dmax = max
xi,xj∈X

∥xi – xj ∥ is the maximal pairwise distance of the dataset. Note that by scaling,

0 ≤ R(xm) ≤ 1 and 0 ≤ S(xm) ≤ 1 for all xm. In this work, we employ the Euclidean distance in

120

our R-S scores. However, other distance metrics can be similarly used as well. In general, a large

R(xm) indicates that the data is far from other classes, and a large S(xm) indicates that the data is

well clustered. Since Rmax and dmax are for the whole dataset, residue and similarity scores in

different classes can be compared.

The residue score and similarity score can be used to visualize each class separately, where

R(x) is the x-axis, and S(x) is the y-axis. In the case of classification, define {(xm, ym, ˆym)|xm ∈
RN, ym ∈ ZL, ˆym ∈ ZL}M

m=1, where ˆym is the predicted label for the mth sample. Then, we can
repeat the above process by using the ground truth and visualize each class separately. By coloring

the data point with the predicted label ˆym, we get the R-S score visualization of the classification.

Class residue index (CRI) and class similarity index (CSI) can be easily defined for the lth class

as CRIl = 1
|Cl|

(cid:205)

m Rm and CSIl = 1
|Cl|

(cid:205)

m Sm, respectively. Such indexes can be used to compare

the distributions in different classes obtained by different methods.

The above indices depend on clusters or classes. It is more useful to construct class-independent

global indices. To this end, we first define residue index (RI) and similarity index (SI) as RI =

1
L

(cid:205)

l CRIl and SI = 1
L

(cid:205)

l CSIl, respectively. All of these indexes have the range of [0,1] and the

larger the better for a given dataset. Additionally, we define R-S disparity (RSD) as RSD = RI – SI.

RSD ranges [-1,1]. Finally, we define R-S index (RSI) as RSI = 1 – |RI – SI|. R-S index has the

range of [0,1].

The Rand index is known to correlate with accuracy [1]. We speculate that the R-S disparity

may correlate with the convergence of clustering and the R-S index may correlate with the accuracy

of classification. R-S disparity and R-S index can be used to measure the performance of different

methods.

4.1.2 Persistent Spectral Graph (PSG)

Further analysis of point cloud data or the points in the R-S plot can be carried out with

Topology Data Analysis (TDA). Persistent homology [2, 3, 4, 5, 6, 7, 8] is an algebraic topology

technique and the main workhorse of TDA. It introduces a filtration process to generate a family

of topological spaces so that the original data can be analyzed in multiscales. However, it cannot

121

detect the homotopic shape evolution of data during filtration. Topological Laplacians, such as

persistent spectral graph (aka persistent Laplacian) [9, 10] and evolutionary Hodge Laplacian [11]

are designed to preserve full topological persistence and capture homotopic shape evolution of data

during a filtration. The persistent spectral graph returns the same multiscale topological invariants

in its kernels of various dimensions and scales but offers additional homotopic shape information

in its non-harmonic spectra.

Considering two boundary operators ∂t

q+1 : Cq+1(Kt+p) ↦→
Cq(Kt+p), where Cq(Kt+p) is a chain group and Kt ⊂ Kt+p are simplicial complexes generated by a
filtration. Denote ∂t+p
q+1

q : Cq(Kt) ↦→ Cq–1(Kt) and ∂t+p

q+1 such as

as ðt,p

|Ct,p
q+1

Ct,p
q+1 = {α ∈ Ct+p

q+1 | ∂t+p

q+1(α) ∈ Ct

q}.

(4.3)

Namely, Ct,p

q+1 consists of elements whose images under ∂t+p

q+1 are in Ct

q. The p-persistent q-

combinatorial Laplacian operator [12] is given by

q = ðt,p
∆t,p

q+1(ðt,p

q+1)∗ + (∂t

q)∗∂t
q.

(4.4)

The topological invariants of the corresponding persistent homology defined by the same

filtration are recovered from the kernel of the persistent Laplacian Eq. (4.4) [12],

βt,p
q = dim ker ∂t

q – dim im ðt,p

q+1 = dim ker ∆t,p
q .

(4.5)

State differently, the zero eigenvalues of the persistent Laplacian operator Eq. (4.4) give rise to

the entire topological variants of the persistent homology. Then, the non-harmonic part of the

spectra (i.e., the non-zero eigenvalues of the persistent Laplacian) and associated eigenvectors offer

additional shape information of the underlying data.

Note that for small-sized high-dimensional datasets, PSG can be directly employed to reduce

the dimensionality in terms of the statistical quantities of the data spectra. The resulting spectra or

their statistics can be directly used to represent the original datasets.

122

4.1.3 The shape of data

Continuous FRI was defined to offer the shape of M data entries in R3 [13]. A similar idea was

used to define interactive differentiable Riemannian manifolds [14]. Here, we extend these ideas

to construct Grassmann manifolds Gr(N – 1, I).

Let X = {x1, ..., xm, ..., xM} be a finite set of M data entries. Denote xm ∈ RN be the feature
vector for the mth sample, and |x – xm| be the Euclidean distance between a point x ∈ RN to the jth

sample. Let η =

1
M

M
∑︁

m=1

min
xj

∥xm – xj ∥ be the average minimum pairwise distance of the input data.

Then, the unnormalized rigidity density at point x ∈ RN is given by

µ(x) =

M
∑︁

m=1

ωmΦ(∥x – xm∥; τ, η, κ),

(4.6)

where ωm = 1, and τ and κ are the hyperparameters of the correlation kernel Φ. Notice that we can

choose an isosurface µ(x) = cµmax, which defines an (N – 1)– dimensional Riemannian manifold

by the collection of points

{x|x ∈ RN, µ(x) = cµmax},

(4.7)

µ(x). The shape of data can be directly visualized for 2 ≤ N ≤ 3

where c ∈ (0, 1) and µmax = max

x

as shown in Ref. [14].

One can restrict xm to a given subset in Eq. (4.6) to compare the shape of data in different

classes when the class labels are known.

For further analysis, one can obtain (N – 1) independent curvatures via fundamental forms [14].

Additionally, Hodge decomposition can be applied to analyze topological connectivity (i.e., Betti

numbers associated with the harmonic spectra) and non-harmonic spectra of the Hodge Laplacians

of the data [15]. For evolving manifolds, the evolutionary de Rham-Hodge theory can be used to

analyze the geometry and topology of data [11].

123

4.2 Results

4.2.1 Geometric shape, Residue-Similarity (R-S), and topological analysis

In this section, we compare the 3D shape, R-S plot, and topological persistence of the TCGA-

PANCAN data. For the comparison, CCP was used to reduce the data to N = 3 components. The

data were divided according to their true labels into 5 classes. The 5-fold cross-validation was used

to obtain the predicted labels for visualization (coloring).

For the 3D shape visualization, after the dimensionality reduction, the Gaussian surface was

used to generate the volumetric representation. The Chimera [16] was used to visualize the shape

of data at the isovalue of 0.1. The surface was colored according to the predicted labels.

For the persistence plot, after the dimensionality reduction, the data was divided according to

their true labels. The HERMES package [9] with the α complex was used to generate topological

dimensions 0 (Betti-0), 1 (Betti-1), and 2 (Betti-2) curves and the corresponding smallest non-

zero eigenvalue curves. Note that persistent Laplacian itself offers low-dimensional geometric and

topological representations of the original high-dimensional data [17].

Figure 4.1 shows the 3 different visualizations of class 1. Notice that the shape analysis shows

predominately red regions or dots mixed with misclassified labels. We can see this mixing in the

R-S plot as well. The yellow points have lower R scores, indicating that these samples are more

likely to be mislabeled in machine learning. We can see that blue points with low S scores are

isolated in the shape visualization.
and βα,0

Note that βα,0

1

2

offer the same information as persistent homology does for the

0 , βα,0
shows there are about 290 samples in this class that become fully connected at
shows there are many cycles in the sample. The βα,0

indicates there

0 = 1). The βα,0

1

1

data. The βα,0
0
radius 6 ( βα,0

are at most 7 cavities. There are no topological changes in the data after radius=11. However,
the smallest non-zero eigenvalue (λα,0

0 ) keeps changing as the filtration radius increases, indicating
that persistent Laplacian reveals more information about the data than persistent homology does.

Finally, it is noticed that most misclassified samples have relatively low R-S scores. This

observation indicates the effectiveness of our R-S scores and indexes.

124

(a) Gaussian surface of TCGA-
PANCAN class 1.

(b) R-S plot of TCGA-PANCAN class 1.

(c) Persistence of TCGA-PANCAN class 1.

Figure 4.1 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 1 data. CCP
was used to reduce the data to N = 3.
(a) Shape of data was visualized with isovalue 0.1 in
ChimeraX [16]. Red color indicates the correctly classified data. (b) R-S plot of class 1. Red circle
is the correct label. The x and y-axes correspond to the residue and similarity scores, respectively.
(c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red
color) λα,0
0 , βα,0
0 , λα,0
1 ,
and βα,0
for class 1. HERMES package [9] with the α complex was used to calculate the harmonic
and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the
βα,0
0 , βα,0
1 ,and βα,0
1 , and λα,0
from top to bottom.

, and the harmonic spectral curves (indicated by blue color) βα,0

from top to bottom, and the right y-axis corresponds to λα,0

1 , and λα,0

0 , λα,0

2

2

2

2

The shape, R-S, and topological analysis of class 2 are given in Figure 4.2. The βα,0
class 2 has about 130 samples, which become fully connected near radius=10. The λα,0
shows a significant discontinuity at radius near radius=10. The βα,0
peak value. At most two cavities in data have been found in βα,0

0

0

1

at a given filtration. The shape

shows about 28 cycles at its

indicates

curve

2

shows four major pieces and only a few samples were mislabeled by machine learning. R-S plots

should have four different labels in this class. Most of the samples were correctly predicted, which

125

(a) Gaussian surface of TCGA-
PANCAN class 2.

(b) R-S plot of TCGA-PANCAN class 2.

(c) Persistence of TCGA-PANCAN class 2.

Figure 4.2 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 2 data. CCP
was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX
[16]. Green color indicates the correctly classified data. (b) R-S plot of class 2. Green circle is
the correct label. The x and y-axes correspond to the residue and similarity scores, respectively.
(c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red
color) λα,0
0 , βα,0
0 , λα,0
1 ,
and βα,0
for class 2. HERMES package [9] with the α complex was used to calculate the harmonic
and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the
βα,0
0 , βα,0
1 ,and βα,0
1 , and λα,0
from top to bottom.

, and the harmonic spectral curves (indicated by blue color) βα,0

from top to bottom, and the right y-axis corresponds to λα,0

1 , and λα,0

0 , λα,0

2

2

2

2

is consistent with the shape analysis. Most class 1 labels (red ones) have lower S scores, indicating

that they do not belong.

Most misclassified samples have low R-S scores and are disconnected from other samples in

the class.

Figure 4.3 gives three types of analyses for class 3. This class has only about 78 samples, as

shown by the βα,0

0

curve. At filtration radius=2, there were 11 one-dimensional holes in the data.

126

(a) Gaussian surface of TCGA-
PANCAN class 3.

(b) R-S plot of TCGA-PANCAN class 3.

(c) Persistence of TCGA-PANCAN class 3.

Figure 4.3 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 3 data. CCP
was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX
[16]. Blue color indicates the correctly classified data. (b) R-S plot of class 3. Blue circle is the
correct label. The x and y-axes correspond to the residue and similarity scores, respectively. (c)
Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red color)
λα,0
0 , λα,0
1 , and λα,0
1 , and
βα,0
for class 3. HERMES package [9] with the α complex was used to calculate the harmonic
2
and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the
βα,0
0 , βα,0
1 ,and βα,0
1 , and λα,0
from top to bottom.

, and the harmonic spectral curves (indicated by blue color) βα,0

from top to bottom, and the right y-axis corresponds to λα,0

0 , λα,0

0 , βα,0

2

2

2

There were only two cavities found by βα,0
at isovalue 0.1 but merge at radius 7 as detected by βα,0
ones, as shown by the shape and R-S plots. The λα,0
topological persistence (βα,0

1

2 . The shape plot indicates most samples are disconnected
0 . The yellow labels are close to the red
curve demonstrates a few discontinuities after

1 ) becomes flat, indicating important homotopic events in the data. As

in other classes, most misclassified samples have relatively low R-S scores.

In Figure 4.4, the βα,0

0

curve suggests 150 samples in class 4 (yellow). Some samples are

127

(a) Gaussian surface of TCGA-
PANCAN class 4.

(b) R-S plot of TCGA-PANCAN class 4.

(c) Persistence of TCGA-PANCAN class 4.

Figure 4.4 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 4 data. CCP
was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX
[16]. Yellow color indicates the correctly classified data. (b) R-S plot of class 4. Yellow circle is
the correct label. The x and y-axes correspond to the residue and similarity scores, respectively.
(c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red
color) λα,0
0 , βα,0
0 , λα,0
1 ,
and βα,0
for class 4. HERMES package [9] with the α complex was used to calculate the harmonic
and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the
βα,0
0 , βα,0
1 ,and βα,0
1 , and λα,0
from top to bottom.

, and the harmonic spectral curves (indicated by blue color) βα,0

from top to bottom, and the right y-axis corresponds to λα,0

1 , and λα,0

0 , λα,0

2

2

2

2

misclassified as class 1 (red), class 3 (blue), and class 5 (purple) as shown in shape and R-S plots.

The topological persistence indicates many topological invariants along the filtration axis, which

can be a faithful representation of the data [17]. Specifically, all data points overlap at radius 5, as
shown by βα,0
0
radius 7 as revealed by βα,0

still indicates a discontinuity at radius 7. All cycles disappear after

1 . The last cycle persists from radius 6 to 7. The βα,0

0 . However, λα,0

curve becomes

2

flat at radius 4. The misclassified red samples show low S scores.

128

(a) Gaussian surface of TCGA-
PANCAN class 5.

(b) R-S plot of TCGA-PANCAN class 5.

(c) Persistence of TCGA-PANCAN class 5.

Figure 4.5 Shape of data, R-S and persistence visualization of TCGA-PANCAN class 5 data. CCP
was used to reduce the data to N = 3. (a) Shape of data was visualized with isovalue 0.1 in ChimeraX
[16]. Purple color indicates the correctly classified data. (b) R-S plot of class 5. Purple circle is
the correct label. The x and y-axes correspond to the residue and similarity scores, respectively.
(c) Visualization of the smallest non-zero eigenvalue curves along the filtration (indicated by red
color) λα,0
0 , βα,0
0 , λα,0
1 ,
and βα,0
for class 5. HERMES package [9] with the α complex was used to calculate the harmonic
and non-harmonic spectra. The x-axis is the filtration radius. The left y-axis corresponds to the
βα,0
0 , βα,0
1 , and βα,0
1 , and λα,0
from top to bottom.

, and the harmonic spectral curves (indicated by blue color) βα,0

from top to bottom, and the right y-axis corresponds to λα,0

1 , and λα,0

0 , λα,0

2

2

2

2

Figure 4.5 illustrates our shape, R-S, and topological analyses of class 5. Although βα,0

0

indicates there are only about 140 samples, class 5 is very rich in its topological persistence. The
βα,0
shows the data points did not connect before the filtration radius reached 12. The βα,0
1
indicates a large one-dimensional hole from radius 13.5 to 15. The βα,0

curve shows 9 short-living

curve

1

2

cavities in the data. It is clear from the R-S plot that samples having low R-S scores are more likely

to be mislabeled.

129

(a) Isovalue 0.235

(b) Isovalue 0.139

(c) Isovalue 0.0963

(d) Isovalue 0.0426

Figure 4.6 Shape of class 5 of TCGA-PANCAN dataset visualized in multiscale using ChimeraX
[16] when isovalues were varied. The different colors indicated the predicted labels, and purple is
the true label of class 5.

Because Figure 4.5’s persistence plot indicates interesting topological features, and class 5 was

further visualized in Figure 4.6 with varying isovalues or scales. We can see that from isovalues

0.235 to 0.139, 2 holes form in the bottom right corner. We can also see another hole beginning to

form in the bottom center. From isovalues 0.139 to 0.0963, the two holes are no longer visible, but
the hole that was forming is now completed. This corresponds to βα,0
shown by βα,0

1 . The voids are short-lived, as
in Figure 4.5 and cannot be visible in the isosurface. Decreasing the isovalue further

2

to 0.0426 shows the combination of two main parts of the data, which would stabilize the structure.
This corresponds to the decrease of βα,0
increase in λα,0

because the number of components is decreasing, but the

indicates that the structure is more stable.

0

0

4.3 Discussion

4.3.1 R-S plot vs 2D plot

R-S plot is an effective tool to visualize the performance of classification in general. Figure 4.7

shows the comparison of the R-S plot and the traditional 2D visualization of the Coil-20 dataset

when the dimension is reduced to 2 by UMAP. For the traditional 2D plot, each data point was

colored by the ground truth, and for the R-S plot, each section represents one of the 20 different

classes, and data points were colored by the predicted labels from the k-NN classification. We can

see that in the traditional 2D visualization, labels 3, 9, and 19 are located in the same region. It

is interesting to see that this situation is reflected in our R-S plot as three labels mixed up. In the

R-S plot, Labels 3, 9, and 19 have a high similarity score but a low residue score, meaning that the

data points are not separated well among different classes and show the limitation of preserving the

130

local structure of a high dimensional data represented in the 2D space. Essentially, some data lay

in an intrinsically high-dimensional space that cannot be well-described in the 2D representation.

Note that 2D plots work best when the data dimension is reduced to 2, whereas the R-S plots

can be applied to arbitrarily high dimensions.

131

(a) R-S plot of Coil-20 with 2 dimension.

(b) Traditional 2d plot of Coil-20 using UMAP.

Figure 4.7 (a) shows the R-S plot of the Coil-20 dataset when reduced by UMAP to N = 2. Each
section represents a different class, where the data points were colored according to their predicted
labels from k-NN classification via 10-fold cross-validation. (b) Coil-20 dataset reduced to N = 2
by UMAP. The data points were colored according to the ground truth.

132

BIBLIOGRAPHY

[1] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the

American Statistical association, 66(336):846–850, 1971.

[2] Patrizio Frosini. Measuring shapes by size functions. In Intelligent Robots and Computer
Vision X: Algorithms and Techniques, volume 1607, pages 122–133. International Society for
Optics and Photonics, 1992.

[3] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and
simplification. In Proceedings 41st annual symposium on foundations of computer science,
pages 454–463. IEEE, 2000.

[4] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete & Com-

putational Geometry, 33(2):249–274, 2005.

[5] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,

46(2):255–308, 2009.

[6] Konstantin Mischaikow and Vidit Nanda. Morse theory for filtrations and efficient computation

of persistent homology. Discrete & Computational Geometry, 50(2):330–353, 2013.

[7] K. L. Xia and G. W. Wei. Persistent homology analysis of protein structure, flexibility and
folding. International Journal for Numerical Methods in Biomedical Engineering, 30:814–
844, 2014.

[8]

Jacob Townsend, Cassie Putman Micucci, John H Hymel, Vasileios Maroulas, and Konstanti-
nos D Vogiatzis. Representation of molecular structures with persistent homology for machine
learning applications in chemistry. Nature communications, 11(1):1–9, 2020.

[9] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, and Guo-Wei
Wei. Hermes: Persistent spectral graph software. Foundations of data science (Springfield,
Mo.), 3(1):67, 2021.

[10] Facundo Mémoli, Zhengchao Wan, and Yusu Wang. Persistent laplacians: Properties, algo-

rithms and implications. SIAM Journal on Mathematics of Data Science, 2022.

[11] Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge

method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021.

[12] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International

journal for numerical methods in biomedical engineering, 36(9):e3376, 2020.

[13] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multidomain
models—flexibility and rigidity. The Journal of chemical physics, 139(19):11B614_1, 2013.

133

[14] Duc Duy Nguyen and Guo-Wei Wei. Dg-gl: Differential geometry-based geometric learning
of molecular datasets. International journal for numerical methods in biomedical engineering,
35(3):e3179, 2019.

[15] Rundong Zhao, Menglun Wang, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. The de rham–
hodge analysis and modeling of biomolecules. Bulletin of mathematical biology, 82(8):1–38,
2020.

[16] Eric F Pettersen, Thomas D Goddard, Conrad C Huang, Elaine C Meng, Gregory S Couch,
Tristan I Croll, John H Morris, and Thomas E Ferrin. Ucsf chimerax: Structure visualization
for researchers, educators, and developers. Protein Science, 30(1):70–82, 2021.

[17] Jiahui Chen, Yuchi Qiu, Rui Wang, and Guo-Wei Wei. Persistent laplacian projected omicron
ba. 4 and ba. 5 to become new dominating variants. Computers in biology and medicine,
151:106262, 2022.

134

CHAPTER 5

CORRELATED CLUSTERING AND PROJECTION

5.1

Introduction

In this work, we propose a two-step data-domain method that seeks an optimal clustering in

terms of a distance describing intrinsic feature correlations among samples to divide I feature

vectors into N correlated clusters and then, non-linearly project correlated features in each cluster

into a single descriptor by using Flexibility Rigidity Index (FRI) [1], which results in a low-

dimensional representation of the original data. Additionally, the complex global correlations

among samples are embedded into samples’ local representations during the FRI-based nonlinear

projection RI → RN. To gain computational efficiency, one may further compute the pairwise

correlation matrix of samples and impose a cutoff distance to avoid the global summation during

the projection. The resulting method, called Correlated Clustering and Projection (CCP), precedes

the other DR algorithms in the following aspects. (1) Instead of solving a matrix to reduce the

dimensionality, CCP does not involve matrix diagonalization and thus can handle the dimensionality

reduction of large sample sizes. (2) CCP exploits statistical measures, such as (distance) covariance,

to quantify the high-level dependence between random feature vectors, rendering a stable algorithm

when dealing with high dimensional data. (3) CCP is flexible with respect to targeted dimension N

because the partition of features is based on N, whereas other methods may rely on min(M, N). The

performance of CCP is stable with respect to the increase of N, which is important for datasets with

high or moderately high intrinsic dimensions. In contrast, many existing methods stop working

when the intrinsic data dimension is moderately high. (4) CCP is stable with respect to subsampling,

which allows continuously adding new samples into a pre-existing dataset without the need to restart

the calculation from the very beginning and thus, is advantageous for continuous data acquisition,

collection, and analysis. This capability is valuable when the transient data are too expensive to be

kept permanently, e.g., molecular dynamics simulations. Additionally, this subsampling property

enables parameter optimization using a small amount of data in case of large data size. (5) As a

data-domain method, CCP can be combined with a frequency-domain method, such as UMAP or

135

t-SNE, for a secondary dimensionality reduction to better preserve global structures of data and

achieve higher accuracy. (6) Finally, the performance of CCP is validated on several benchmark

classification datasets: Leukemia, Carcinoma, ALL-AML, TCGA-PANCAN, Coil-20, Coil-100,

and Smallnorb based on various traditional algorithms such as k-NN, support vector machine,

random forest and gradient boost decision tree.

In all cases, CCP is very competitive with the

state-of-art algorithms.

Additionally, we have also proposed a new method, called Residue-Similarity (R-S) scores or

R-S plot, for the performance visualization of unsupervised clustering and supervised classification

algorithms. Although Receiving Operation Characteristic curve (ROC) and Area Under the ROC

Curve (AUC) are typically used for the performance visualization of binary classes, they are

not convenient for multiple classes. The proposed R-S scores can be used for visualizing the

performance in an arbitrary number of classes. Finally, R index, S index, R-S disparity, and total

R-S index are proposed to characterize clustering and classification results.

Recent years have witnessed the growth of Topological Data Analysis (TDA) via persistent

homology [2, 3, 4, 5, 6, 7, 8] in data sciences. It can be used to analyze the topological invariant of

the R-S scores. However, persistent homology is insensitive to the homotopic shape evolution of

data during filtration. We introduce a topological Laplacian, Persistent Spectral Graph (PSG) [9],

to capture the homotopic shape of data, in addition to topological invariants. Note that TDA and

PSG are dimensionality reduction algorithms that can generate low dimensional representations of

the original high-dimensional data [10, 11].

To further analyze the shape of data, we transform point cloud data into a Grassmann manifold

representation by using FRI [1]. When N = 3, the 2-manifold shape of data can be directly

visualized. Such shape of data can be further analyzed by differential geometry apparatuses,

including curvatures [12], Hodge decomposition [13] and evolutionary de Rham-Hodge theory

[14].

136

5.2 Methods and Algorithms

Let Z := {zi

m}M,I

m=1,i=1 with M and I being the number of input data entries (i.e., samples) and
the number of features for each data entry, respectively. Our goal is to find an N-dimensional

representation of the original data, denoted as X := {xi

m}M,N

m=1,i=1, such that 1 ≤ N << I, by using a

data-domain two-step clustering-projection strategy.

5.2.1 Feature clustering

Let Z = {z1, ..., zi, ..., zI} be the set of data, where zi ∈ RM represents the ith feature vector

for the data. The objective is to partition the feature vector into N parts, where 1 ≤ N << I is a

preselected reduced feature dimension. To this end, we find an optimal disjoint partition of the data

Z := ⊎N

n=1

Zn, for a given N, where Zn is the nth partition (cluster) of the features.

To seek the optimal partition, we first analyze the correlation among feature vectors zi. A

variety of correlation measures can be used for this purpose. We discuss two standard approaches.

Covariance distance First, we consider an I × I normalized covariance matrix with component

ρ(zi, zj) =

Cov(zi, zj)
σ(zi)σ(zj)

,

1 ≤ i, j ≤ I,

(5.1)

where Cov(zi, zj) is the covariance of zi and zj, and σ(zi) and σ(zi) are the variances of zi and zj,

respectively.

We set negative covariances to zero and subtract from 1 to obtain a covariance distance between

feature vectors

∥zi – zj ∥dCov =

1 – ρ(zi, zj), ρ(zi, zj) > 0

0,

otherwise.





(5.2)

Note that covariance distances have the range of 0 < ∥zi – zj ∥dCov < 1, for all pairs of vectors, zi
and zj. Highly correlated feature vectors will have their covariance distances close to 0, while the

uncorrelated feature vectors will have their covariance distances close to 1.

137

Correlation distance Alternatively, one may also consider the correlation distance defined via

the distance correlation [15]. First, one computes a distance matrix for each vector zi, i = 1, 2, ..., I

jk = ∥zi
ai

m – zi

k ∥, m, k = 1, 2, ..., M,

where ∥ · ∥ denotes the Euclidean norm. Define doubly centered distance for vector zi,

Ai
jk := ajk – ¯aj· – ¯a·k + ¯a··,

(5.3)

(5.4)

where ¯aj· is the jth row mean, ¯a·k is the kth column mean, and ¯a·· is the grand mean of the distance
matrix for vector zi.

For a pair of vectors (zi, zj), the squared distance covariance is given by

dCov2(zi, zj) :=

1
M2

∑︁

∑︁

j

k

jkAj
Ai
jk.

The distance correlation between vectors (zi, zj) is given by

dCor(zi, zj) :=

dCov2(zi, zj)
dCov(zi, zi)dCov(zj, zj)

.

We define a correlation distance between vectors zi and zj as

∥zi – zj ∥dCor = 1 – dCor(zi, zj).

(5.5)

(5.6)

(5.7)

The correlation distance has values in range 0 ≤ ∥zi – zj ∥dCor ≤ 1. It gives ∥zi – zj ∥dCor = 1 if zi
and zj are uncorrelated or independent. When zi and zj are linearly depends on each other, one has

∥zi – zj ∥dCor = 0.

Correlated clustering Feature partition can be achieved with a variety of clustering methods.

Here, as an example, we utilize a modified K-medoids method to perform the partition in a

minimization process. Certainly, other K-means type of algorithms, including BFR algorithm,

centroidal Voronoi tessellation, k q-flats, K-means++, etc., can be utilized for our feature partition

as well.

138

For a pre-selected N, we begin by randomly selecting N medoids {mn}N

vector to its nearest medoid, which gives rise to the initial partition {Zn}N

the closest vector to the center of the nth partition Zn as the new medoid {mn ∈ Zn}N

reassign each vector into its nearest medoid, resulting in a new partition {Zn}N

loss function or the accumulated distance. The process is repeated until {Zn}N

n=1 and assign each
n=1. Second, we denote
n=1. We
n=1 to minimize the
n=1 is optimized with

respect to a specific distance definition,

arg min{Z1,...,Zn,...,ZN}

N
∑︁

∑︁

n=1

zi∈Zn

∥zi – mn∥,

(5.8)

where ∥ · ∥ is either the covariance distance or the correlation distance. In comparison, covariance

distance is easy to compute, while correlation distance can deal with complex nonlinear high-level

correlations among feature vectors and samples. Note that many other metrics can be used too. For

a given N, the minimization partitions similar feature vectors into N clusters, which provides the

foundation for further projections.

Our next goal is to project the original I-dimensional dataset Z into an N-dimensional repre-

sentation X according to the partition result.

5.2.2 Feature projection

Flexibility Rigidity Index (FRI)

In this section, we review Flexibility Rigidity Index (FRI) [1].

Let {z1, ..., zm, ..., zM} be the input dataset, where zm ∈ RI. Denote ∥zi – zj ∥ some metric

between ith and jth data entries, and the correlations between data entries are computed as

Cij = Φ(∥zi – zj ∥; η, τ, κ),

1 ≤ i, j ≤ M

(5.9)

where Φ is the correlation kernel, and τ, η, κ > 0 are the parameters for the kernel. Commonly

used metrics include the Euclidean distance, the Manhattan distance, the Wasserstein distance, etc.

The correlation kernel is a real-valued smooth monotonically decreasing function, satisfying

the two properties

Φ(∥zi – zi∥; η, τ, κ) = 1,

as |zi – zj| → 0

Φ(∥zi – zj ∥; η, τ, κ) = 0,

as |zi – zj| → ∞

139

A popular choice for such functions is a radial basis function. For example, one may use the

generalized exponential function

Φ(∥zi – zj ∥; η, τ, κ) =

or the generalized Lorentz kernel

Φ(∥zi – zj ∥; η, τ, κ) =

e–(∥zi–zj ∥/τη)κ

,

∥zj – zj ∥ < rc

0,

otherwise





1
1+(cid:0)∥zi–zj ∥/τη(cid:1)κ

∥zj – zj ∥ < rc,

0

otherwise





(5.10)

(5.11)

where κ is the power, τ is the multiscale parameter, η is the scale to be computed from the given data,

and rc is the cutoff distance, which is useful in a certain data structure to reduce the computational

complexity [16]. In the context of t-SNE, η would be the perplexity, and in UMAP, η would define

the geodesic of the Riemannian metric.

We can construct a correlation matrix C = {Cij}, which reveals the topological connectivity

between samples [1]. We can also view such correlation map as a weighted graph [9, 17], where

rc is the cutoff function. In order to understand the connectivity, we choose η to be the average

minimum distance between the data entries

η =

(cid:205)M

m=1 minzj ∥zm – zj ∥
M

.

Using the correlation function, we can define the rigidity of xi as

µi =

M
∑︁

m=1

ωimΦ(∥zi – zm∥; η, τ, κ),

where ωim are the weights. Here, we set ωim = 1 for all i and m. From the graph perspective, one

can also view µi as the degree matrix of node xi.

Correlated projection In this subsection, we employ FRI for the correlative dimensional reduc-

tion of input dataset {z1, ..., zm, ..., zM}, where zm ∈ RI, leading to a low-dimensional representation
{x1, ..., xm, ..., xM}, with xm ∈ RN. The FRI reduction captures the intrinsic correlation among

samples.

140

Recall that {Zn}N

n=1 are optimal partition of feature vectors from the K-medoids or another
clustering method. Let S = {1, ..., I} be the whole set of indices of the feature vectors, and

Sn, where Sn = {i|zi ∈ Zn}. We can define zSn

m as mth input data with the nth collection

n=1

S = ⊎N
of feature indices Sn, i.e., zSn

m := {zi
We can now define the nth correlation matrix {Cn

m|i ∈ Sn}.

ij}i,j=1,...,M associated with subset Sn of

features

ij = Φn(∥zSn
Cn

i

– zSn
j

∥; ηn, τ, κ),

1 ≤ i, j ≤ M,

1 ≤ n ≤ N,

(5.12)

where Φn is the radial basis kernel for the kth grouping. For example, one may choose

Φn(∥zSn

i – zSn

j

∥; ηn, τ, κ) =

Φn(∥zSn
i

– zSn
j

∥; ηn, τ, κ) =

–(∥zSn
e

i –zSn

j

∥/τηn)κ

,

∥zSn
i

– zSn
j

∥ < rn
c

or

(5.13)

0,

otherwise,

(cid:16)

1
∥zSn
i –zSn

j

1+

∥/τηn(cid:17) κ ,

∥zSn
i

– zSn
j

∥ < rn
c

0,

otherwise,

(5.14)









where truncation distance (rn

c ) can be set to 2 or 3-standard deviations, and ηn is set to
m – zSn
j

∥zSn

(cid:205)M

∥

ηn =

m=1 minzSn
M

j

.

Then, we can project the data to an N-dimensional space representation by taking the rigidity

function defined by correlation kernels,

xn
i =

M
∑︁

m=1

imΦn(∥zSn
ωn

i

– zSn

m ∥; ηn, τ, κ), n = 1, 2, ..., N; i = 1, 2, ..., M

(5.15)

where ωn

im are the weights associated with Φn for the nth cluster and can be set to 1. Moreover, the

mth data in the reduced n-dimension representation is a vector xm = (x1

m, ..., xn

m, ..., xN

m)T .

We also propose to improve the computational efficiency in Eq. Equation 5.15 by avoiding the

global summation. This can be easily done as follows. First, we construct an M × M global distance

matrix of the samples to obtain the nearest neighbors of each sample. Then, we use the cell lists

141

algorithm with the cutoff value to replace the global summations Eq. Equation 5.15 by considering

only a few nearest neighbors [16]. This approach significantly reduces the memory requirement.

Since the projections of xn

i for different i and n are independent of each other, massively parallel

computations can be used for large datasets.

5.3 Results

In this section, we numerically explore CCP’s performance on a variety of high dimensional

benchmark test datasets. For each dataset, we use 10 random seeds for 5-fold or 10-fold cross-

validations, depending on the number of samples of the data.

In order to validate the effectiveness of CCP, we compare it with Uniform Manifold Approxima-

tion and Projection (UMAP), Principle Components Analysis (PCA), Locally Linear Embedding

(LLE), and Isomap.

For metric-based embedding, the Euclidean distance was used. All parameters were set to

default, according to the packages outlined in Table 5.1. In order to test the effectiveness of the

dimensionality reduction, a k-nearest neighbor classifer (kNN) was used. Table 5.1 shows the

versions of the packages used in our comparison.

In order to visualize the accuracy, R-S scores were utilized.

In R-S plots, the residue and

similarity scores were represented as the x and y axes, respectively, and the data points were colored

according to the predicted labels from classification results.

Table 5.1 Python packages used for the dimensionality reduction and benchmark tests.

Package
Python v3.8.5
Numpy v1.19.2
Scikit-learn v0.23.2
Scikit-learn-exta v0.2.0
Sklearn v0.0
umap-learn v0.5.1

5.3.1 Datasets

We test CCP and several other commonly used algorithms on benchmark datasets, including

Leukemia ALL-AML, Carcinoma, Arcene, TCGA-PANCAN, Coil-20 and Coil-100, and Small-

142

norb. Table 7.2 summarizes the datasets used in the present work.

Table 5.2 Datasets used in the benchmark tests.

Dataset [ref]
Leukemia [18]

(M, I, L)
(72, 7070, 2)

Carcinoma [19, 20]

(174, 9182, 11)

ALL-AML [21]

(72, 7129, 2)

TCGA-PANCAN [22]

(801, 20531, 5)

Coil-20 [23]

(1440, 16384, 20)

Coil-100 [24]

(7200, 49152, 100)

Smallnorb [25]

(24300,18432,5)

Description
Microarray dataset of Leukemia. The data con-
tains 72 samples, each with 7070 gene expres-
sions.
Microarray dataset of human carcionmas. Orig-
inal data [19] contains 12533 genes, which were
processed to 9182 dimensions in Ref. [20].
Cancer classification dataset based on gene ex-
pressions by DNA microarrays of acute myeloid
Leukemia (AML) and acute lymphoblastic
Leukemia (ALL).
Gene expression dataset. Part of the RNA-seq
(HiSeq) PANCAN data, where expressions of 5
different types of tumors were extracted.
Image classification dataset with 1440 images.
Each image has size 128 × 128 = 16384, where
20 objects are captured at 72 angles. Each image
was treated as a vector of length 16384.
Image classification dataset of 7200 images.
Each image has size 128 × 128 = 16384 with
3 channels, where 100 objects are captured at
72 angles. Each image was treated as a vector
of length 49152.
Image classification dataset with 5 generic cate-
gories: four-legged animals, human figures, air-
planes, trucks, and cars. Each object was taken
from a variety of radial and azimuthal angles.
Each sample consists of 2 images, the left and
the right views, both of size 96×96. Both im-
ages for flattened to create a vector of length
2×96×96.

5.3.2 Validation

5.3.2.1 Clustering analysis

Since CCP uses clustering to partition features based on correlations, it is vital to analyze

clustering effectiveness.

One of the best methods to evaluate the effectiveness of the clustering from k-medoids and/or k-

143

means is to compute the loss function with respect to the varying number k of clusters. Then, using

the elbow method, one finds the inflection point of the loss function and sets the corresponding k

to be the optimal number of clusters.

In this work, we present another method for visualizing the feature clusters using our R-S

scores. Figure 5.1 shows the loss function of the k-medoids feature clustering given in Eq. (5.8)

for Coil-20 dataset for N = 2 to 200. In addition, for N = 4, 16, 36, and 64, the clustering results

were visualized using R-S scores. The red circles on the loss function correspond to N = 4, 16, 36,

and 64, plotted in four charts. Each section in a given chart represents one cluster, and the x and

the y-axes represent the residue and similarity scores, respectively for the cluster. For N = 4, the

cluster colored in blue has a low similarity score, indicating that the data is spread out within the

cluster. It indicates that a larger N is needed. From N = 36 to N = 64, it is seen that the loss

function has an elbow, indicating that these numbers of clusters are appropriate. The R-S plots in

these two charts are more compact, indicating that the clustering performs well.

144

Figure 5.1 Illustration of the partition and R-S visualization of 16384 features of the Coil-20 dataset
into different numbers (N) of clusters. The line is the loss function defined in Eq. (5.8) with respect
to different N values ranging from 2 to 200. Four red circles indicate N = 4, 16, 36, and 64, for
which the corresponding R-S charts are given in charts from the left to the right. Each section of the
charts represents an individual feature cluster with the x and y axes are the residue and similarity
scores, respectively. The R-S indices for N = 4, 16, 36 and 64 are 0.842, 0.887, 0.952 and 0.992,
respectively.

As speculated earlier, the R-S disparity may correlate with the convergence of clustering.

To verify this speculation, we have computed R-S disparity for the feature clustering results at

N = 4, 16, 36, and 64. We found that RSDN=4 = 0.158, RSDN=16 = 0.113, RSDN=36 = 0.048,

and RSDN=64 = –0.008. Therefore, R-S disparity correlates strongly with the loss function of the

k-medoids clustering shown in Figure 5.1.

One of the advantages of the N-medoids method is that the cluster centers will always be one of

the feature vectors. In addition, since each medoid is always one of feature vectors, the method is

much more efficient when dealing with high dimensional data. Other clustering methods, such as k-

means, spectral clustering, DBSCAN, and hierarchical clustering, can be utilized for the clustering

as well.

145

5.3.2.2 Partition scheme evaluation

In order to explore the effectiveness of different partitions, we compare results obtained with

three feature partitions: correlation partition, random equal partition, and equal variance partition.

In the random equal partition, the features are randomly shuffled and split into N equal-sized

clusters (i.e., N dimensions in the CCP). Therefore, each cluster has the same number of features,

which will be projected into a one-dimensional representation in CCP. In equal variance partition,

the features are normalized with respect to the largest one and ordered, and then, are split into N

clusters such that all clusters have a similar amount of variance. In this partition, the first cluster

contains the largest number of low-variance features, whereas the last cluster, cluster N, contains

the least number of high-variance features. Notice that in the correlation partition, the numbers of

features in clusters also vary and are determined by minimization according to Eq. (5.8).

The Leukemia and Carcinoma datasets were used to compare the 3 feature partition schemes.

For both tests, 5-fold cross-validations with 10 random seeds were used for the dimensionality

reduction, and k-NN was used to obtain classification accuracy. For each fold of partition, all

results attained from 10 seeds were included to evaluate partition schemes.

146

(a) Comparison of the correlation partition, random equal partition, and equal variance partition-based
dimensionality reduction and classification for the Leukemia dataset of 7070 features. The accuracies
are computed from kNN classifications using the reduced N features generated from CCP with three
partitions.

(b) R-S plots of clusters generated from CCP-based classification using correlation partition, equal
variance partition, and random equal partition of the Leukemia dataset at N = 18.

Figure 5.2 Comparing the CCP-based classification effectiveness using correlation partition, equal
random partition, and equal variance partition of the features of the leukemia dataset. For FRI,
exponential kernel with κ = 2 and τ = 1.0 was used. For each test, results from all 10 seeds were
plotted. Form left to right: R-S plots of correlation partition, equal variance partition, and random
equal partition. The x-axis is the residual score, and the y-axis is the similarity score. Each section
corresponds to one cluster and the data was colored according to the predicted labels from the k-NN
classification.

Figure 5.2(a) shows the accuracy of CCP-based classification of the Leukemia dataset under

various CCP reduced dimensions N equipped with 3 feature partition schemes. The correlation

partition outperforms both equal random and variance partitions over all N values. In addition, as

shown in Figure 5.2(b), for R-S plots, correlation partition outperforms other two partitions in each

cluster at N = 18. In particular, equal random partition and equal variance partition do not work

well in classifying label 2.

Figure 5.3 shows the accuracies of CCP-based classifications of the Carcinoma dataset under

147

Figure 5.3 Comparison of the correlation partition, random equal partition, and equal variance
partition-based dimensionality reduction and classification for the Carcinoma dataset of 7070
features. The accuracies are computed from kNN classifications using the reduced N features
generated from CCP with three partitions.

various CCP reduced dimensions N equipped with 3 feature partition schemes. The correlation

partition outperforms both equal random and equal variance partitions over all N values.

In

addition, as shown in Figure 5.4, for the R-S plots, the correlation partition outperforms the other

two partitions in each cluster at N = 46.

148

Figure 5.4 Comparing the CCP-based classification effectiveness using correlation partition, equal
random partition, and equal variance partition of the features of the Carcinoma dataset. For FRI,
exponential kernel with κ = 2 and τ = 2.0 was used. For each test, the results of all 10 seeds were
plotted. From left to right: R-S plots of correlation partition, equal variance partition, and random
equal partition. The x-axis is the residual score, and the y-axis is the similarity score. Each section
corresponds to one cluster and the data were colored according to the predicted labels from k-NN.

5.3.3 Comparison with other dimensionality reduction methods

In this section, we compare CCP’s performance with UMAP, PCA, LLE, and Isomap on ALL-

AML, TCGA-PANCAN, Coil-20, and Coil-100 datasets. For each dataset, we performed 5-fold or

10-fold cross-validation depending on the size of the dataset to test the accuracy using k-nearest

neighbors (k-NN). Results of all 10 random seeds were used in performance evaluation.

149

5.3.3.1 ALL-AML

Figure 5.5 Accuracy of k-NN classification of the ALL-AML dataset when the dimension is
reduced to N by using CCP, UMAP, PCA, LLE, and Isomap. Here, a 5-fold cross-validation with
10 random seedings was used. Test-train split was done prior to the dimensionality reduction. For
CCP, exponential kernel with κ = 1 and τ = 2.0 was used. The sample size, feature size, and the
number of classes of the ALLAML dataset are 72, 7129, and 2, respectively.

Figure 5.6 Visualization of the ALL-AML dataset when dimension reduced by CCP with expo-
nential kernel, κ = 1 and τ = 2.0 to N = 36. Each section represents a class and the samples were
colored according to their predicted labels from the k-NN classification via 5-fold cross-validation.
Results of all 10 seeds were used for the visualization. The x and y axes are the residue and
similarity scores, respectively. The sample size, feature size, and the number of classes of the
ALL-AML dataset are 72, 7129, and 2, respectively.

The dimension of the ALL-AML dataset was reduced using an exponential kernel with κ = 1

and τ = 2.0. Figure 5.5 shows the performance of CCP, UMAP, PCA, Isomap, and LLE. Here,

a 5-fold cross-validation with 10 random seeds was used. CCP performs better than the other

algorithms do and is stable with a wide range of N values. All other methods show a drop in their

accuracy beyond dimension N = 36. Since the ALL-AML dataset only has 72 samples, UMAP,

PCA, LLE, and Isomap cannot reduce the ALL-AML dimension to N > 72 because their dimension

is limited by the size of the matrix diagonalization.

150

Figure 5.7 Visualization of the ALL-AML dataset when the dimension was reduced to N = 36 by
using CCP, UMAP, and PCA. Since there are only 72 samples in the ALL-AML dataset, results
from all 10 seeds were plotted, leading to 720 sample points in the plot. For CCP, exponential
kernel with κ = 1 and τ = 2.0 was used. The x and y axes are the residue and similarity scores,
respectively. Each row is for one class, and the data points are colored based on the predicted
labels from the k-NN classifier, using 5-fold cross-validation. The sample size, feature size, and
the number of classes of the ALL-AML are 72, 7129, and 2, respectively.

Figure 5.6 shows the R-S plot of the ALL-AML dataset when the dimension is reduced by

CCP to N = 36. The left and right sections correspond to the 2 classes. Samples were plotted

according to their 36 features and colored with the predicted labels from k-NN. Results of all 10

seeds were plotted into one chart (i.e., 720 samples), and the residue and the similarity scores were

calculated separately for each random seed. The x and the y-axes are the residue and similarity

scores, respectively. Class 2 has a better R-S distribution.

Figure 5.7 shows the R-S plot of ALL-AML when the feature dimension is reduced to N = 36 by

using CCP, UMAP, and PCA. Results of all 10 random seeds were used in the visualization to obtain

a better understanding of the performance, and the residue and similarity scores were computed

separately for each seed. In each class, the samples were colored according to their predicted labels

obtained from k-NN. The x-axis and y-axis of each R-S plot are the residue and similarity scores,

respectively. The top row is class 1 (ALL), and the bottom row is class 2 (AML). The numerical

number inside the plot is the accuracy for the class. Notice that UMAP’s R-S plot indicates that

UMAP’s reduction has a low similarity score, which explains its low accuracy. On the other hand,

PCA has higher accuracy than that UMAP, but most AML samples are mislabeled. This indicates

that PCA is unable to distinguish the difference between ALL and AML when N = 36.

151

5.3.3.2 TCGA-PANCAN

For CCP, the dimension of the TCGA-PANCAN dataset was reduced using Lorentz kernel

with κ = 1 and τ = 1.0. Figure 5.8 shows the performance comparison of CCP, UMAP, PCA,

Isomap, and LLE. Here, a 5-fold cross-validation with 10 random seeds was used. Notice that CCP

is comparable to Isomap and LLE in accuracy, whereas UMAP and PCA are unstable at higher

dimensions.

Figure 5.9 shows the R-S plot of the TCGA-PANCAN dataset when the dimension was reduced

by CCP to N = 103. Each section corresponds to the 5 classes of TCGA-PANCAN. Samples were

plotted according to 103 features and colored with the predicted labels from k-NN. The x and the

y-axes are the residue and similarity scores, respectively.

Figure 5.10 shows the R-S plot of TCGA-PANCAN when the dimension is reduced to N = 103

by using CCP, UMAP, and PCA, respectively. The samples were plotted based on 103 features and

colored with their predicted labels from k-NN. The x-axis and y-axis of each plot are the residue

and similarity scores, respectively. Each row corresponds to one of the 5 classes, and the number

inside the plot is the accuracy for each class. Notice that UMAP has a cluster in each plot, but the

cluster has a low similarity score. This means that in UMAP’s embedding, the sample within each

class is not near each other, which results in low accuracy. PCA has comparable accuracy to CCP,

but CCP has a notable improvement for class 1 and class 4.

Figure 5.8 Accuracy of k-NN classification of the TCGA-PANCAN dataset when the dimension is
reduced to N by using CCP, UMAP, PCA, LLE, and Isomap. Here, 5-fold cross-validation with
10 random seedings was used, and the test-train split was done prior to the reduction. For CCP,
Lorentz kernel with κ = 1 and τ = 1.0 was used. The sample size, feature size, and the number of
classes of the TCGA-PANCAN are 801, 20531, and 5, respectively.

152

Figure 5.9 Visualization of the TCGA-PANCAN dataset when the dimension is reduced to N = 103
by using CCP with Lorentz kernel, κ = 1 and τ = 1.0. Each section represents a different class.
The samples were plotted based on 103 features and colored with their predicted labels from k-NN
classification via 5-fold cross-validation. The x and y axes are the residue and similarity scores,
respectively. The sample size, feature size, and the number of classes of the TCGA-PANCAN are
801, 20531, and 5, respectively.

153

Figure 5.10 Visualization of TCGA-PANCAN dataset when the dimension is reduced to N = 103
by using CCP, UMAP, and PCA. For CCP, Lorentz kernel with κ = 1 and τ = 1.0 was used. The
x and y axes are the residue and similarity scores, respectively. Each row contains a class. The
data is plotted based on 103 features and colored with the predicted labels of the k-NN classifier,
using 5-fold cross-validation. The sample size, feature size, and the number of classes of the
TCGA-PANCAN are 801, 20531, and 5, respectively.

154

5.3.3.3 Coil-20

The dimension of the Coil-20 dataset was reduced using Lorentz kernel with κ = 1 and τ = 6.0.

Figure 5.11 shows the performance of CCP, UMAP, PCA, Isomap, and LLE. The 10-fold cross-

validation with 10 random seeds was used. CCP has the best performance out of the 5 algorithms

and maintains its accuracy in higher dimensions. PCA also has high accuracy but loses its accuracy

in higher dimensions.

Figure 5.12 shows the R-S plot of the Coil-20 dataset when the dimension is reduced by CCP

to N = 82. Each section corresponds to the 20 classes of Coil-20. Samples were plotted based on

82 features and colored with the predicted labels from k-NN. The x and the y-axes are the residue

and similarity scores, respectively.

Figure 5.13 shows the R-S plot of Coil-20 when its dimension is reduced to N = 82 by using

CCP, UMAP, and PCA. Samples in each class are plotted according to their 82 features and

colored according to their predicted labels from k-NN. The x-axis and y-axes of each plot are the

residue and similarity scores, respectively. Each row corresponds to one of the 20 classes, and

the number inside the plot is the classification accuracy for each class. Notice that all of UMAP’s

visualizations show a poor distribution in the bottom right, indicating that the residual score is high

and the similarity score is low, which gives rise to poor performance in the classification. In order

to further investigate the performance, labels 1, 2, and 3 were visualized in Figure 5.15. We can see

that in the zoomed-in view, there are small subclusters within each plot, which come from different

folds of the cross-validation.

155

Figure 5.11 Accuracy of k-NN classification of Coil-20 dataset when its dimension is reduced to
different dimensions N by using CCP, UMAP, PCA, LLE, and Isomap. The 10-fold cross-validation
with 10 random seedings was used, and the test-train split was done prior to the dimensionality
reduction. For CCP, Lorentz kernel with κ = 1 and τ = 6.0 was used. The sample size, feature size,
and the number of classes of the Coil-20 are 1440, 16384, and 20, respectively.

Figure 5.12 Visualization of Coil-20 dataset when the dimension is reduced to N = 82 by using
CCP with Lorentz kernel, κ = 1 and τ = 6.0. Each section represents a different class, and the
samples were plotted based on 82 features and colored with their predicted labels from the k-NN
classification via 10-fold cross-validation. The x and y axis are the residue and similarity scores,
respectively. The sample size, feature size, and the number of classes of the Coil-20 are 1440,
16384, and 20, respectively.

156

Figure 5.13 Visualization of Coil-20 dataset for classes 1 to 10 when the dimension is reduced to
N = 82 by using CCP, UMAP, and PCA. For CCP, Lorentz kernel with κ = 1 and τ = 6.0 was used.
The x and y axis are the residue and similarity scores, respectively.

157

Figure 5.14 Visualization of Coil-20 dataset for classes 11 to 20 when the dimension is reduced to
N = 82 by using CCP, UMAP, and PCA. For CCP, Lorentz kernel with κ = 1 and τ = 6.0 was used.
The x and y axis are the residue and similarity scores, respectively.

158

Figure 5.15 Visualization of Coil-20 dataset class 1, 2, and 3, when the data dimension is reduced to
N = 82 by UMAP. The figures are zoomed-in view. The data were plotted based on 82 features and
colored according to their predicted labels from the k-NN classifier using 10-fold cross-validation.
Label 1 has an accuracy of 0.125, whereas labels 2 and 3 have accuracy 0.000.

5.3.3.4 Coil-100

The dimension of the Coil-100 dataset was reduced using exponential kernel with κ = 1 and

τ = 6.0. Figure 5.16 shows the performance of CCP, UMAP, PCA, Isomap, and LLE. Here, 10-fold

cross-validation with 10 random seeds was used. CCP, PCA, LLE, and Isomap have comparable

results, whereas UMAP is unstable at a higher dimension N. The best performance of UMAP was

not as good as those of CCP and PCA. This indicates that Coil-100 has a high intrinsic dimension,

for which UMAP has poor performance.

Figure 5.16 Accuracy of k-NN classification of Coil-100 dataset when the dimension is reduced to
N by using CCP, UMAP, PCA, LLE, and Isomap. The 10-fold cross-validation with 10 random
seedings was used, and a test-train split was done prior to the reduction. For CCP,Lorentz kernel
with κ = 1 and τ = 6.0 was used. The sample size, feature size, and the number of classes of the
Coil-20 are 1440, 16384 and 20, respectively.

159

5.4 Discussion

5.4.1 Centrality based CCP

CCP used FRI to project a group of correlated features into a 1D representation. If we observe

the projection in a graph setting, the FRI projection can be viewed as computing the degree centrality

of the graph. That is, let Z ∈ RM×I be the data, with M samples and I features. For each partition,

we can define a graph Gn = (Vn, En, Wn), n = 1, 2, ..., N, where Vn, En and Wn are the vertex,

edge and weight sets of the graph of the nth component, respectively. The weights are precisely

the kernels defined in Eq. (5.12). Then, the FRI projection for xn

i can be viewed as the degree

centrality (Cd) of a weighted graph

Cd(zSn
i

) =

∑︁

zSn
j

Φn(|zSn
i

– zSn
j

|; ηn, τ, κ),

(5.16)

where Cd(zSn

i ) is the degree centrality of vertex zSn

i

. In this case, we treat each data zSn
i

as a vertex.

Instead of using the FRI projection, we can impose a traditional graph-based approach, setting

the edge weight ωn

ij = 1 for all 1 ≤ i, j ≤ M and 1 ≤ n ≤ N, when the node-node distance satisfies

a cutoff. That is, instead of applying Eq. (5.12), we take

An = {An

ij}, An

ij =

1,

if ∥zSn
i

– zSn
j

∥ < rn
c

0,

otherwise





,

1 ≤ i, j ≤ M.

(5.17)

Here, instead of writing Cn

ij as in Eq. (5.12), we use An
c is the cutoff distance. Then, the reduced new variables xn

ij to denote the adjacency matrix of the
i can be computed by
ij. In such a manner, we can implement
other centrality formulations, such as the degree centrality, closeness centrality [26], betweenness

graph, and rn
replacing Φn(∥zSn

∥; ηn, τ, κ) in Equation 5.15 with An

i – zSn

j

centrality [27], and eigenvector centrality [28] in CCP.

Figure 5.17 shows the accuracy of using different centrality formulations instead of the FRI pro-

jection. Using the adjacency matrix, degree centrality, closeness centrality, betweenness centrality,

and eigenvector centrality were computed with rc = 0.7dmax, where dmax is the maximum pairwise

distance between the input data. The performance of all methods is quite similar. However, the

160

stability of computing the centrality is heavily reliant on rc. Moreover, if the data is well clustered

within each class, the graph may not be connected, which may affect the stability of centrality

computations.

Figure 5.17 The coil-20 dataset was reduced using centrality formulations, instead of the FRI
projection. Degree, closeness, eigenvector centrality, and betweenness centrality were tested, with
rc = 0.7dmax. The accuracy was calculated from 10-fold cross-validation with 10 random seeds.
The sample size, feature size, and the number of classes of the Coil-20 are 1440, 16384, and 20,
respectively.

5.4.2 Correlation distance based CCP

CCP utilizes covariance distance in clustering to partition features. However, other distance

metrics can be used in clustering as well, depending on the size of the data and the relationship

between the features. In particular, correlation distance can be used instead of covariance distance,

when the relationship between features is highly nonlinear. Figure 5.18 shows the effectiveness of

correlation distance-based CCP when compared to covariance distance-based CCP and other DR

algorithms. Notice that the correlation distance-based CCP significantly outperforms covariance-

based CCP and other DR algorithms. Therefore, correlation distance-based CCP can be employed

if high accuracy is desirable. However, it is noted that correlation distance-based CCP is very

time-consuming and memory-demanding. This limitation may constrain the use of correlation

distance-based CCP in high-dimensional data with large data sizes.

161

Figure 5.18 Comparison between correlation distance-based partitioning and covariance distance-
based partitioning of ALL-AML dataset. k-NN with 10-fold cross-validation was used to compute
the accuracy. The sample size, feature size, and the number of classes of ALL-AML are 72, 7129,
and 2, respectively.

5.4.3 Parameter-free CCP

The performance of the proposed two-step CCP depends on a few parameters, such as the

dimension N, kernel type (i.e., generalized exponential and generalized Lorentz), power (κ), and

scale (τ). Among them, the dimension may be chosen by the user. Although a set of default

parameters is prescribed, it may not be optimal for different datasets. It will be a burden for users

to select parameters. Fortunately, CCP is very stable under subsampling. Therefore, we can use

subsampling to search the optimal parameter range for a given dataset automatically.

In this subsection, we show that CCP is stable under subsampling. To verify this claim, we test

CCP on the Smallnorb dataset, which has 46,800 samples and 5 classes. Each sample consists of a

binocular picture of an object of size 96×96 pixels, taken from different radial and azimuthal angles.

We flattened each image and combined the images to make an 18,432-dimensional feature vector.

We subsample 1%, 5%, 10% and 20% samples to optimize CCP kernel parameters, respectively.

Then, based on these CCP parameter sets, we carry out the CCP dimensionality reduction of the

whole dataset for classification. The resulting 10-fold cross-validation accuracies of classification

for the Smallnorb dataset are shown in Figure 5.19 for subsampled at 1%, 5%, 10% and 20%. It is

clear that the accuracy increases as the subsampling size is increased from 1% to 20%. However,

the accuracy difference between 1% subsampling and 20% subsampling is under 2% for all classes.

It is seen that under different subsampling ratios, CCP can capture the structure of the data. Even

at 1% subsampling, CCP is still very accurate.

162

Figure 5.19 R-S plot visualization of Smallnorb classification using CCP with the reduction ratio of
400 (47 dimensions) at 1%, 5%, 10% and 20% subsampling. Each row represents the data plotted
based on 47 features and colored with the predicted labels from the k-NN classifier, using 10-fold
cross-validation. The number in each plot shows the accuracy within each label obtained with
subsampling-generated kernel parameters. The x and y axis are the residue score and the similarity
scores, respectively.

Since CCP is very stable under subsampling, one can make CCP a parameter-free method by

using a relatively small amount of dataset to determine CCP parameters automatically.

CCP’s stability under subsampling implies that CCP can be used in the dynamic data acquisition

of excessively large datasets. Newly collected data can be added to the existing data without the

need to restart the CCP calculation from the very beginning.

163

5.4.4 Accuracy comparison using four classifiers

We have shown the effectiveness of CCP on various datasets. However, all the aforementioned

analysis was based on the k-NN classifier.

It is important to know whether the same pattern

returns if other classification algorithms are employed. To this end, we compare CCP with other

dimensionality reduction methods using k-NN, support vector machine (SVM), random forest (RF),

and gradient boost decision tree (GBDT).

Figure 5.20 Comparison of the accuracy of CCP on a variety of datasets and classification algo-
rithms. The rows represent four datasets, from the top to the bottom: ALL-AML, TCGA-PANCAN,
Coil-20, and Coil-100. The columns are for four classification algorithms, namely, k-NN, SVM,
RF, and GBDT. The x-axes are the reduced dimension N and the y-axes are accuracy.

164

Figure 5.20 shows the comparison of CCP when utilizing k-NN, SVM, RF, and GBDT on

ALL-AML, TCGA-PANCAN, Coil-20, and Coil-100 datasets. The rows are the 4 datasets, and the

columns are 4 classification methods. For all the tests, sklearn’s classification package was utilized.

For k-NN and SVC, default parameters were used. For RF and GBDT, {n_estimators=1000,

max_depth = 7, min_samples_split = 3, max_features = ’sqrt’, n_jobs = -1 } were used. For all

tests, standard scaling was used after the reduction.

First, CCP remains very competitive against all other dimensionality reduction methods over all

datasets when other classifiers are employed. The relative behaviors of all dimensionality reduction

methods did not change much under different classifiers. Therefore, our earlier comparison is fair

and our findings remain correct.

Second, SVM appears to slightly improve the performance of CCP and PCA. However, LLE

and Isomap do not work well with SVM.

Third, UMAP did not perform well on ALL-AML, Coil-20, and Coil-100 when the k-NN

method was used. However, its performance does not improve much with SVM, RF, and GBDT. Its

instability with relatively large reduced dimension N persists over different classifiers. In fact, its

best results have never reached those of other methods for these three datasets. A possible reason

is that UMAP does not work well for data having moderately large intrinsic dimensions.

Fourth, LLE had some instability in TCGA-PANCAN and Coil-100 datasets. Because the input

data led the computed matrix to become singular, some of the tests from the cross-validation were

not computed. For these cases, the average was taken over the working tests.

Finally, we noticed that all dimensionality reduction methods underperformed with RF for the

Coil-100 dataset and with GBDT for the ALL-AML dataset. This behavior might be due to the

fact that for a given classifier, a uniform set of parameters was used for all datasets and RF does not

work well for large datasets.

5.4.5 Efficiency comparison

Although accuracy is very important, computational cost can be a crucial factor for huge datasets.

In this section, we assess the computational times of various methods with elementary computer

165

resource allocations. Specifically, 4 central processing units (CPUs) with 64GB of memory from

the High-Performance Computing Center (HPCC) of Michigan State University were used for all

methods and all datasets.

Figure 5.21 shows the computational time of the three-dimensionality reduction methods on

ALL-AML, TCGA-PANCAN, Coil-20, and Coil-100. For ALL-AML and TCGA-PANCAN

datasets, the average time from the 5-fold cross-validation over 10 random seeds was computed.

For Coil-20 and Coil-100, the average time from the 10-fold cross-validation over 10 random seeds

was recorded.

Figure 5.21 CPU run time comparison among CCP, UMAP, and PCA on ALL-AML, TCGA-
PANCAN, Coil-20, and Coil-100 datasets. For ALL-AML and TCGA-PANCAN, computational
times for N = 10, 20, and 30 were calculated by taking the average of the 5-fold cross-validation
over 10 random seeds. For Coil-20 and Coil-100 computational times for N = 10, 20, 30, and 40
were calculated by taking the average of the 10-fold cross-validation over 10 random seeds. In each
chart, the x axis corresponds to the reduced dimension N, and the y axis is the average time (s).

PCA shows essentially the fastest computation for all datasets.

Isomap and LLE have very

similar behaviors for all datasets. Their time efficiencies are quite similar to that of PCA.

UMAP is faster than CCP for Coil-20 and Coil-100. For ALL-AML and TCGA-PANCAN,

CCP is faster because of the small data size. Note that Equation 5.15 indicates the summation over

all samples that satisfies cutoff of within 3 standard deviations of the average pairwise distance.

This cutoff can be reduced for faster computation. However, it reduces the overall accuracy.

For CCP, because clustered features are projected independently, each reduced dimension can

be computed independently using the parallel architecture. Similar parallel computations can be

166

applied to different samples. Therefore, CCP can be further accelerated by using parallel and

graphics processing unit (GPU) algorithms in practical applications.

5.5 Concluding remarks

Like other dimensionality reduction algorithms, CCP has its advantage and disadvantages.

First, CCP is a unique data-domain method and its features are highly interpretable. Because CCP

partitions feature into clusters according to some metric, such as covariance distance or correlation

distance, features with high correlation will perform better. One limitation for many methods

relying on matrix diagonalization is that pairwise distance computation can encounter the “curse of

dimensionality”, where distance computation of high dimensional data could become unreliable.

By clustering features, CCP can more reliably compute distances because the dimension in each

cluster will be much lower. Moreover, CCP performs better for data with a large number of

features, such as TCGA-PANCAN, Coil-20, and Coil-100datasets. Therefore, CCP is suitable for

the dimensionality reduction of data with relatively large intrinsic dimensions, for which many

other popular methods may not work well.

However, for datasets with a smaller number of features, CCP may not be as good as other

methods. In this case, dimensionality reduction is unnecessary anyway. Also, we noticed that CCP

might not be as good as UMAP and some other frequency-domain methods for extremely low final

dimensions, say N = 2 or 3.

In addition to doing well for data having moderately large intrinsic dimensions, CCP allows

embedding for streamlined datasets, such as molecular dynamics generated transient data. We

have shown that CCP is stable under subsampling, which enables users to optimize the CCP model

with a small portion of initial data, and allows subsequent data to be embedded with the initial

set. We noticed that dimensionality reduction algorithms that rely on matrix diagonalization have

instability when dealing with streamlined data.

Because CCP does not compute the nearest neighbors graph and does not diagonalize, a

traditional 2D plot does not give a meaningful visualization. However, each dimension of CCP

is computed by projecting the partitioned features. Hence we can easily interpret each dimension

167

of CCP. In tree-based classification algorithms, such as random forest and gradient boost decision

trees, feature importance can be computed for each feature component, which gives a rank on

how much impact each component has on the classification. For CCP, feature importance may be

interpreted as how meaningful a set of highly correlated features is in the classification.

CCP can be further optimized in various ways. It allows a wide variety of alternative data-

domain embedding strategies in each of its two steps: clustering and projection. For example, in

the clustering step, one might select alternative distance metrics, clustering algorithms, and loss

functions to optimize feature vector partition for a given dataset. In the project step, one might

choose alternative distance metrics based on Riemannian geometry or statistical theories and select

alternative projections based on linear/nonlinear, orthogonal/non-orthogonal, and Grassmannian

considerations.

A wide variety of multistep dimensionality reduction methods can be developed. Unlike

frequency-domain dimensionality reduction techniques, CCP renders a data-domain representation

of the original high-dimensional data. Therefore, the resulting low-dimensional data can be reused

as an input for a Secondary Dimensionality Reduction (SDR) with a frequency-domain technique to

achieve specific goals. For example, one can use CCP as an initializer for local methods to capture

global patterns [29]. The combination of CCP with UMAP and t-SNE, called CCP-UMAP and

CCP-t-SNE, respectively, may generate better 2D visualizations for datasets with global structures.

Additionally, for real-world problems, better accuracy is always desirable. New hybrid methods,

such as three-step CCP-UMAP and CCP-Isomap, may achieve better dimensionality reduction

performance for clustering, classification, and regression.

Finally, the R-S scores, R index, S index, R-S disparity, and R-S index introduced in this work

can be used for general-purpose data visualization and analysis. The shape of data and persistent

Laplacian discussed in this work offer new geometric, topological, and spectral tools for data

analysis and visualization.

Server availability

The CCP online server is available at https://weilab.math.msu.edu/CCP/.

168

BIBLIOGRAPHY

[1] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multidomain
models—flexibility and rigidity. The Journal of chemical physics, 139(19):11B614_1, 2013.

[2] Patrizio Frosini. Measuring shapes by size functions. In Intelligent Robots and Computer
Vision X: Algorithms and Techniques, volume 1607, pages 122–133. International Society for
Optics and Photonics, 1992.

[3] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and
simplification. In Proceedings 41st annual symposium on foundations of computer science,
pages 454–463. IEEE, 2000.

[4] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete & Com-

putational Geometry, 33(2):249–274, 2005.

[5] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,

46(2):255–308, 2009.

[6] Konstantin Mischaikow and Vidit Nanda. Morse theory for filtrations and efficient computation

of persistent homology. Discrete & Computational Geometry, 50(2):330–353, 2013.

[7] K. L. Xia and G. W. Wei. Persistent homology analysis of protein structure, flexibility and
folding. International Journal for Numerical Methods in Biomedical Engineering, 30:814–
844, 2014.

[8]

Jacob Townsend, Cassie Putman Micucci, John H Hymel, Vasileios Maroulas, and Konstanti-
nos D Vogiatzis. Representation of molecular structures with persistent homology for machine
learning applications in chemistry. Nature communications, 11(1):1–9, 2020.

[9] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. International

journal for numerical methods in biomedical engineering, 36(9):e3376, 2020.

[10] Jiahui Chen, Yuchi Qiu, Rui Wang, and Guo-Wei Wei. Persistent laplacian projected omicron
ba. 4 and ba. 5 to become new dominating variants. Computers in biology and medicine,
151:106262, 2022.

[11] Duc Duy Nguyen, Zixuan Cang, and Guo-Wei Wei. A review of mathematical representations
of biomolecular data. Physical Chemistry Chemical Physics, 22(8):4343–4367, 2020.

[12] Duc Duy Nguyen and Guo-Wei Wei. Dg-gl: Differential geometry-based geometric learning
of molecular datasets. International journal for numerical methods in biomedical engineering,
35(3):e3179, 2019.

[13] Rundong Zhao, Menglun Wang, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. The de rham–

169

hodge analysis and modeling of biomolecules. Bulletin of mathematical biology, 82(8):1–38,
2020.

[14] Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de rham-hodge

method. Discrete and continuous dynamical systems. Series B, 26(7):3785, 2021.

[15] Gábor J Székely, Maria L Rizzo, and Nail K Bakirov. Measuring and testing dependence by

correlation of distances. The annals of statistics, 35(6):2769–2794, 2007.

[16] Kristopher Opron, Kelin Xia, and Guo-Wei Wei. Fast and anisotropic flexibility-rigidity
index for protein flexibility and fluctuation analysis. The Journal of chemical physics,
140(23):06B617_1, 2014.

[17] Duc Duy Nguyen and Guo-Wei Wei. Agl-score: algebraic graph learning score for protein–
ligand binding scoring, ranking, docking, and screening. Journal of chemical information
and modeling, 59(7):3291–3304, 2019.

[18] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray
gene expression data. Journal of bioinformatics and computational biology, 3(02):185–205,
2005.

[19] Andrew I Su, John B Welsh, Lisa M Sapinoso, Suzanne G Kern, Petre Dimitrov, Hilmar
Lapp, Peter G Schultz, Steven M Powell, Christopher A Moskaluk, Henry F Frierson, et al.
Molecular classification of human carcinomas by use of gene expression signatures. Cancer
research, 61(20):7388–7393, 2001.

[20] Kun Yang, Zhipeng Cai, Jianzhong Li, and Guohui Lin. A stable gene selection in microarray

data analysis. BMC bioinformatics, 7(1):1–16, 2006.

[21] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P
Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular
classification of cancer: class discovery and class prediction by gene expression monitoring.
science, 286(5439):531–537, 1999.

[22] Kyle Chang, Chad J Creighton, Caleb Davis, Lawrence Donehower, Jennifer Drummond,
David Wheeler, Adrian Ally, Miruna Balasundaram, Inanc Birol, Yaron SN Butterfield, et al.
The cancer genome atlas pan-cancer analysis project. Nat Genet, 45(10):1113–1120, 2013.

[23] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library

(coil-20). 1996.

[24] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library

(coil-100). 1996.

[25] Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning methods for generic object recognition

170

with invariance to pose and lighting. Proceedings of the 2004 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2:II–104 Vol.2, 2004.

[26] Linton C Freeman. Centrality in social networks conceptual clarification. Social networks,

1(3):215–239, 1978.

[27] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages

35–41, 1977.

[28] Phillip Bonacich. Power and centrality: A family of measures. American journal of sociology,

92(5):1170–1182, 1987.

[29] Dmitry Kobak and George C Linderman. Initialization is critical for preserving global data

structure in both t-sne and umap. Nature biotechnology, 39(2):156–157, 2021.

171

CHAPTER 6

TOPOLOGICAL NONNEGATIVE MATRIX FACTORIZATION

6.1

Introduction

Nonnegative matrix factorization (NMF) is a dimensionality reduction method in which the

objective is to decompose the original count matrix into two nonnegative factor matrices [1,

2]. The resulting basis matrices are often referred to as meta-genes and represent nonnegative

linear combinations of the original genes. Consequently, NMF results are highly interpretable.

However, the original formulation employs a least-squares optimization scheme, making the method

susceptible to outlier errors [3].

To address this issue, Kong et al. [4] introduced robust NMF (rNMF), or l2,1-NMF, which

utilizes the l2,1-norm and can better handle outliers while maintaining comparable computational

efficiency to standard NMF. Manifold regularization has also been employed to incorporate ge-

ometric structures into dimensionality reduction, utilizing a graph Laplacian, leading to Graph

Regularized NMF (GNMF) [5]. Semi-supervised methods, such as those incorporating marker

genes [6], similarity and dissimilarity constraints [7], have been proposed to enhance NMF’s

robustness. Additionally, various other NMF derivatives have been introduced [8, 9, 10].

Despite these advancements in NMF, manifold regularization remains an essential component

to ensure that the lower-dimensional representation of the data can form meaningful clusters.

However, using graph Laplacians can only capture a single scale of the data, specifically the scaling

factor in the heat kernel. Therefore, single-scale graph Laplacians lack multiscale information.

In this work, we introduce persistent Laplacian (PL)-regularized NMF, namely the topological

NMF (TNMF) and robust topological NMF (rTNMF). Both TNMF and rTNMF can better capture

multiscale geometric information than the standard GNMF and rGNMF. To achieve improved

performance, PL is constructed by observing sample-sample interactions at multiple scales through

filtration, creating a sequence of simplicial complexes. We can then view the spectra at each complex

associated with a filtration to capture both topological and geometric information. Additionally, we

introduce k-NN based PL to TNMF and rTNMF, referred to as k-TNMF and k-rTNMF, respectively.

172

The k-NN based PL reduces the number of hyperparameters compared to the standard PL algorithm.

The outline of this work is as follows. First, we provide a brief overview of NMF, rNMF, GNMF,

and rGNMF. Next, we present a concise theoretical formulation of PL and derive the multiplicative

updating scheme for TNMF and rTNMF. Additionally, we introduce an alternative construction of

PL, termed k-NN PL.

6.2 Prior Work

In this section, we provide an overview of NMF methods, including the standard NMF, l21-NMF

or rNMF, graph regularized NMF (GNMF), and the robust graph regularized NMF (rGNMF), and

we utilize single cell RNA sequencing (scRNA-seq) data as an example for the interpretation of

NMF. The notations and their descriptions are summarized in Table 6.1.

Table 6.1 Abbreviations and notations used in the methods.

Notation
NMF
rNMF
GNMF
rGNMF
TNMF
rTNMF
k-TNMF
k-rTNMF
X ∈ RM×N
W ∈ RM×p
H ∈ Rp×N
A ∈ RN×N
D ∈ RN×N
L ∈ RN×N
PL ∈ RN×N
PA ∈ RN×N
PD ∈ RN×N
ζt
λ

Description
Nonnegative matrix factorization
Robust Nonnegative Matrix Factorization
Graph Regularized Nonnegative Matrix Factorization
Robust Graph Regularized Nonnegative Matrix Factorization
Topological Nonnegative Matrix Factorization
Robust Topological Nonnegative Matrix Factorization
k-NN induced Topological Nonnegative Matrix Factorization
k-NN induced Robust Topological Nonnegative Matrix Factorization
Nonnegative data matrix with M genes and N cells
The basis or the meta-genes, and p is the rank
Lower dimensional representation of the data, and p is the rank
Adjacency matrix
Degree matrix
Graph Laplacian L = D – A
Persistent Laplacian, PL = PD – PA
Adjacency matrix associated with PL
Degree matrix associated with PL
The weight of the graph for the t-th filtration
Hyperparameter for the regularized NMF

173

6.2.1 NMF

The original formulation of NMF utilizes the Frobenius norm, which assumes that the noise of

the data is sampled from a Gaussian distribution. Let X ∈ RM×N be a nonnegative data matrix. In

scRNA-seq, this is the gene-count matrix. The goal of NMF is to find the decomposition X ≈ WH,

where both W ∈ RM×p and H ∈ Hp×N are nonnegative. Here p is the rank of the decomposition.

The minimization function is given as the following

∥X – WH∥2
F,

min
W,H

s.t. W, H ≥ 0

(6.1)

where ∥A∥2

F = (cid:205)

i,j a2

ij. W is the basis, which are often times called the meta-genes in scRNA-seq.
Lee et al. proposed a multiplicative updating scheme, which preserves the nonnegativity [1]. For

the t + 1th iteration,

ij = wt
wt+1
ij

ij = ht
ht+1
ij

(XHT )ij
(WHHT )ij
(WT X)ij
(WT WH)ij

(6.2)

(6.3)

Although the updating scheme is simple and effective in many biological data applications,

scRNA-seq data is sparse and contains large amount of noise. Therefore, a model that is more

robust to noise is necessary for feature selection and dimensionality reduction

6.2.2

rNMF

The robust NMF (rNMF) utilizes the l2,1 norm, which assumes that the noise of the data is

sampled from a Laplace distribution, which may be more suitable for a count-based data matrix,

like scRNA-seq. The minimization function is given as the following

min
W,H

∥X – WH∥2,1,

s.t. W, H ≥ 0,

where ∥A∥2,1 = (cid:205)

j ∥aj ∥2. Because l2,1-norm utilizes summation over the l2 distance between the

original cell feature and the reduced feature, the effect of the outlier will not dominate the loss

174

function as much as the Frobenius norm formulation. RNMF has the following updating scheme

ij = wt
wt+1
ij

(XQHT )ij
(WHQHT )ij
(WT XQ)ij
(WT WHQ)ij

,

ij = ht
ht+1
ij

where Qjj = 1/∥X – Whj ∥2.

6.2.2.1 GNMF amd rGNMF

(6.4)

(6.5)

Manifold regularization has been widely utilized in scRNA-seq. Let G(V, E, W) be a graph,

where V = {xj}N

j=1 is the set of vertices, E = {(xi, xj)|xi ∈ Nk(xj) ∪ xj ∈ Nk(xi)} is the set of edges,
and W is the weight associated with the edges. Here, Nk(xj) denotes the k-th nearest neighbors of

vertex j. For the weight between vertex i and j, denoted ωij, we chose a decaying function with the

following properties

ωij → 0 as ∥xi – xj ∥ → ∞

ωij → 1 as ∥xi – xj ∥ → 0

A common choice for such function is the radial basis function. For example, the heat kernel

(cid:32)

ωij = exp

–

∥xi – xj ∥2
σ

(cid:33)

,

(6.6)

(6.7)

where σ is the scale of the kernel. We can then represent the weights as an adjacency matrix A

(cid:18)

exp

–

(cid:19)

∥xi–xj ∥2
σ

xj ∈ Nk(xi)

0,

otherwise.

Aij =





175

We can now construct the graph regularization term, RG, by looking at the distance ∥hi – hj ∥2 and

the adjacency matrix.

RG =

1
2

∑︁

i,j

Aij ∥hi – hj ∥2

∑︁

=

DiihT

i hi –

i

AijhT

i hj

∑︁

ij

= Tr(HDHT ) – Tr(HAHT )

= Tr(HLHT ).

Here, L and D are the Laplacian and the degree matrix, given by L = D – A and Dii = (cid:205)

j Aij,

respectively. Tr(·) denotes the trace of the matrix. Utilizing the regularization parameters, λ ≥ 0,

we get the objective function of GNMF

∥X – WH∥2

F + λTr(HLHT ).

min
W,H

and the objective function for rGNMF

∥X – WH∥2,1 + λTr(HLHT ).

min
W,H

(6.8)

(6.9)

6.3 Topological NMF

While graph regularization improves the traditional NMF and rNMF, the choice of σ and vastly

change the result. Furthermore, graph regularization only captures a single scale, and may not be

able to capture the mutliscale geometric information in data. In this section, we only show the

construction of persistent Laplacian. The theoretical formulation of persistent Laplacian can be

found in section 2.4.

6.3.1 TNMF and rTNMF

For scRNA-seq data, we calculate the 0-persistent Laplacian using the Vietoris-Rips (VR)

complexes by increasing the filtration distance. We can then take a weighted sum over the 0-

persistent Laplacian induced by the changes in the filtration distance. For persistent Laplacian

176

enhanced NMF, we will provide a computationally efficient algorithm to construct the persistent

Laplacian matrix.

Let L be a Laplacian matrix induced by some weighted graph, and note the following

L =





lij,

– (cid:205)N

j=1 lij

i ≠ j

i = j.

Then, let lmax = maxi≠j lij, lmin = mini≠j lij and d = lmax – lmin. The t-th Persistent Laplacian Lt,
t = 1, ..., T is defined as Lt = {lt

ij}, where

lt
ij =




lt
ii = –

lij ≤ (t/T)d + lmin

0

1

otherwise

lt
ij.

∑︁

i≠j

Then, we can take the weighted sum over the all the persistent Laplacians

PL :=

T
∑︁

t=1

ζtLt.

(6.10)

(6.11)

(6.12)

Unlike the standard Laplacian matrix L, PL captures the topological features that persists over

different filtration, thus providing a multiscale view of the data that standard Laplacian lacks. Here,

ζt is the hyper-parameter and must be chosen. Then, the PL regularized NMF, which we call

topological nonnegative matrix factorization (TNMF) is defined as,

∥X – WH∥2

F + λTr(HT (PL)H)

and the robust topological NMF (rTNMF) is defined as

∥X – WH∥2,1 + λTr(HT (PL)H).

(6.13)

(6.14)

6.3.2 Multiplicative Updating scheme

The updating scheme follows the same principle as the standard GNMF and rGNMF.

177

TNMF For TNMF, the Lagrangian function is defined as

L = ∥X – WH∥2

F + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH)

(6.15)

= Tr(XT X) – 2Tr(XHT WT ) + Tr(WHHT WT ) + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH).

(6.16)

Taking the partial with respect to W, we get

∂L
∂W

= –2HT XH + 2WHHT + Φ.

Using the KKT condition Φijwij = 0, we get the following

(–2XHT )ijwij + (2WHHT )ijwij = 0.

Therefore, the updating scheme is

ij ← wt
wt+1
ij

(XHT )ij
(WHHT )ij

.

For updating H, we take the derivative of the Lagrangian function with respect to H

∂L
∂H

= –2WT X + 2WT WH + 2λH(PL) + Ψ.

Using the Karush–Kuhn–Tucker (KKT) condition, we have Ψijhij = 0 and obtain

(6.17)

(6.18)

(6.19)

(6.20)

–2(WT X + λH(PA))ijhij + 2(WT WH + λH(PD))ijhij = 0,

(6.21)

where PL = PD – PA and PDii = (cid:205)

i≠j PAij. The updating scheme is then given by

ij ← ht
ht+1
ij

(WT X + λH(PA))ij
(WT WH + λH(PD))ij

.

(6.22)

6.3.2.1

rTNMF

For the updating scheme for rTNMF, we utilize the fact that ∥A∥2,1 = Tr(AQAT ), where

Qii =

1
2∥Ai ∥2

. The Lagrangian is given by

L = ∥X – WH∥2,1 + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH)

= Tr((X – WH)Q(X – WH)T ) + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH)

= Tr(XQXT ) – 2Tr(WHQ) + λTr(HT (PL)H) + Tr(ΦW) + Tr(ΨH),

(6.23)

(6.24)

(6.25)

178

where Qii =

1

∥xj–Whj ∥ . Taking the partial with respect to W, we get

∂L
∂W

= –(XQHT ) + WHQHT – Φ.

Using the KKT conditions Φijwij = 0, we get

–(XQHT )ijwij + (WHQHT )ijwij = 0,

which gives the updating scheme

ij ← wt
wt+1
ij

(XQHT )ij
(WHQHT )ij

.

For H, we take the partial with respect to H.

∂L
∂H

= –WT XQ + WT WHQ + 2λH(PL) + Ψ.

Then, using the KKT conditions Ψijhij = 0, we get

(6.26)

(6.27)

(6.28)

(6.29)

(–WT XQ – 2λH(PA))ijhij + (WT WHQ + 2λH(PD))ijhij = 0,

(6.30)

where PL = PD – PA and gives the updating scheme

ij ← ht
ht+1
ij

(WT XQ + 2λH(PA))ij
(WT WHQ + 2λH(PD))ij

.

(6.31)

6.3.3 k-NN induced Persistent Laplacian

One major issue with TNMF and rTNMF is that the parameters {ζt}T

t=1 have to be chosen.
For the parameters, we let ζt ∈ {0, 1, 1/2, · · · , 1/T} for a total of T + 1 parameters. Therefore, the

number of parameters that needs to be chosen increases exponentially as the number of filtration T

increases. Therefore, we propose an approximation to the original formulation usingk-NN induced

PL.

Let Nt(xj) be the t-nearest neighbors of sample xj. First, we define the t-persistent directed

adjacency matrix ˜At as

˜At = {˜at

ij},

˜at
ij =

1 xj ∈ Nt(xi)

0

otherwise.





179

(6.32)

Then, the k-NN based directed adjacency Laplacian is the weighted sum of {At}

˜A :=

T
∑︁

t=1

ζt ˜At.

(6.33)

The undirected persistent adjacency matrix can be obtained via symmetrization

PA = ˜A + ˜AT – ˜A ⊗ ˜AT ,

where ⊗ denote Hadamard product. Then, the PL can be constructed using the persistent degree

and persistent adjacency matrices

PL = PD – PA, PDii =

∑︁

j≠i

PAij.

(6.34)

One advantage of utilizing the k-NN induced persistent Laplacian is that the parameter space

is much smaller. We can set ζt ∈ {0, 1}, where ζt = 0 would ‘turn-off’ the particular neighbor’s

connectivity. In essence, the number of parameters will be reduced to 2T , a significant decrease

from (T + 1)T of the original formulation. We call the k-NN induced TNMF as k-TNMF and the

k-NN induced rTNMF as k-rTNMF.

180

BIBLIOGRAPHY

[1] Daniel Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. Ad-

vances in neural information processing systems, 13, 2000.

[2] Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive
review. IEEE Transactions on knowledge and data engineering, 25(6):1336–1353, 2012.

[3] Weixiang Liu, Nanning Zheng, and Qubo You. Nonnegative matrix factorization and its

applications in pattern recognition. Chinese Science Bulletin, 51:7–18, 2006.

[4] Deguang Kong, Chris Ding, and Heng Huang. Robust nonnegative matrix factorization using
In Proceedings of the 20th ACM international conference on Information and

l21-norm.
knowledge management, pages 673–682, 2011.

[5] Qiu Xiao, Jiawei Luo, Cheng Liang, Jie Cai, and Pingjian Ding. A graph regularized non-
negative matrix factorization method for identifying microrna-disease associations. Bioinfor-
matics, 34(2):239–248, 2018.

[6] Peng Wu, Mo An, Hai-Ren Zou, Cai-Ying Zhong, Wei Wang, and Chang-Peng Wu. A robust

semi-supervised nmf model for single cell rna-seq data. PeerJ, 8:e10091, 2020.

[7] Zhenqiu Shu, Qinghan Long, Luping Zhang, Zhengtao Yu, and Xiao-Jun Wu. Robust graph
regularized nmf with dissimilarity and similarity constraints for scrna-seq data clustering.
Journal of Chemical Information and Modeling, 62(23):6271–6286, 2022.

[8] Wei Lan, Jianwei Chen, Qingfeng Chen, Jin Liu, Jianxin Wang, and Yi-Ping Phoebe Chen. De-
tecting cell type from single cell rna sequencing based on deep bi-stochastic graph regularized
matrix factorization. bioRxiv, pages 2022–05, 2022.

[9]

Jin-Xing Liu, Dong Wang, Ying-Lian Gao, Chun-Hou Zheng, Jun-Liang Shang, Feng Liu,
and Yong Xu. A joint-l2, 1-norm-constraint-based semi-supervised feature extraction for
rna-seq data analysis. Neurocomputing, 228:263–269, 2017.

[10] Na Yu, Ying-Lian Gao, Jin-Xing Liu, Juan Wang, and Junliang Shang. Robust hypergraph
regularized non-negative matrix factorization for sample clustering and feature selection in
multi-view gene expression data. Human genomics, 13(1):1–10, 2019.

181

CHAPTER 7

APPLICATION IN SINGLE CELL RNA SEQUENCING

7.1 Preprocessing of Single Cell RNA Sequencing data using Correlated Clustering and

Projection

7.1.1

Introduction

In this section, we propose a computationally efficient and interpretable dimensionality reduc-

tion algorithm for scRNA-seq data called correlated clustering and projection (CCP) [1]. CCP

begins by clustering genes based on their similarity and then uses the flexibility rigidity index (FRI)

[2] to nonlinearly project each gene cluster into a super-gene, which is a measure of accumulated

gene-gene correlations among cells. Unlike traditional nonlinear reduction methods, CCP bypasses

matrix diagonalization, allowing users to select the number of super-genes, which is beneficial for

machine learning and deep learning tasks. Furthermore, similar to NMF’s meta-genes, super-genes

are all nonnegative and highly interpretable. We validated CCP’s performance on 14 scRNA-seq

datasets by varying the number of super-genes and conducting support vector machine classification

and k-means clustering.

Additionally, we have validated the performance of a novel evaluation metric for dimensionality

reduction, called the Reside-Similarity-Index (RSI) [1]. RSI evaluates the intra-cluster similarity

of cell types or clusters, and compares it to their inter-cluster residual score. As RSI only requires

one set of labels, which can be computed from k-means, it can measure the performance of

dimensionality reduction for both clustering and classification tasks, without requiring knowledge

of the true labels. Furthermore, by analyzing the relationship between samples, RSI allows for a

deeper understanding of the quality of the dimensionality reduction algorithm. We have verified

the effectiveness of RSI alongside CCP on both clustering and classification tasks, and introduced

the R-S plot as a novel visualization technique for data containing multiple cell types.

182

7.1.2 Results

Table 7.1 Accession ID, source organism, and the counts for samples, genes, cell types and
normalization for 14 datasets.

Organism Samples Genes Cell types Normalization

Accession ID
GSE45719
GSE59114
GSE67835
GSE75748 cell
GSE75748 time
GSE82187
GSE84133 h1
GSE84133 h2
GSE84133 h3
GSE84133 h4
GSE84133 m1
GSE84133 m2
GSE89232
GSE94820

Reference
Deng [3]

Mouse
Kowalczyk [4] Mouse
Human
Darmanis [5]
Human
Chu [6]
Human
Chu [6]
Mouse
Gokce [7]
Human
Baron [8]
Human
Baron[8]
Human
Baron[8]
Human
Baron[8]
Mouse
Baron[8]
Mouse
Baron[8]
Human
Breton [9]
Human
Villani [10]

300
1428
420
1018
758
705
1937
1724
3605
1308
822
1064
957
1140

22431
8422
22084
19097
19189
18840
20125
20125
20125
20125
14878
14878
20689
26593

8
6
8
7
6
10
14
14
14
14
13
13
4
5

RPKM
TPM
CPM
TPM
TPM
TPM
TPM
TPM
TPM
TPM
TPM
TPM
TPM
TPM

CCP was benchmarked against PCA on 14 datasets, and the dataset details can be found in

Table 7.1. The data was normalized using either reads per kilobase of transcript per million

(RPKM), transcript per million (TPM) or counts per million (CPM). For each dataset, CCP was

used to obtain the number of super-genes as N =50, 100, 150, 200, 250, and 300. The parameters

κ and τ of the exponential kernel were searched over κ = 1, 2 and τ = 1, 2, ..., 6 and were set to

τ = 6 and κ = 2 for the exponential kernel. To test the reduction, 20 random seeds were used for

CCP and PCA, and for each reduction, 30 random initializations of k-means were used to obtain

cluster labels. After obtaining the cluster labels, ARI and NMI were computed by comparing the

results to the labeled cell types, and the averages were visualized. For each figure, the red and blue

lines represent CCP and PCA, respectively, and the star and dot markers indicate ARI and NMI,

respectively.

7.1.2.1 CCP Benchmark

Figure 7.1 shows the performance of CCP and PCA on three datasets, GSE67835, GSE75748

time, and GSE59114 data. For GSE67835, CCP outperforms PCA in all the dimensions we have

183

tested. For GSE75748 time, CCP outperforms PCA for 50 super-genes and above. GSE75748

time shows an increase in performance as the number of gene dimensions increased. PCA exhibits

instability as N increases, which is noticeable from their decrease in performance from N = 50 to

150 for both datasets. CCP does not perform well on GSE59114 because both ARI and NMI are

less then 0.3 for all the dimensions we have tested. CCP’s performance may be poor due to low

intrinsic dimensionality of GSE59114. In other words, the number of gene clusters is inherently

small, leading to redundant clusters. GSE59114, in particular, only has 8,422 genes, whereas other

data have over 15,000 genes.

Figure 7.1 ARI and NMI of the clustering results of CCP and PCA on GSE67835 , GSE75748
time and GSE59114 data. The red and blue lines correspond to CCP and PCA, respectively. A
total of 20 random initializations were used to test the reduction, and for each reduction, a total of
30 random initializations were used to obtain the clustering results from k-means clustering. The
averages of the ARI and NMI were obtained. For CCP, all the test utilize τ = 6 and κ = 2 for the
exponential kernel.

In order to verify CCP’s performance, the residue similarity index (RSI) was calculated for the

k-means clustering result of the gene partitioning in CCP. Figure 7.2 shows the RSI of the k-means

clustering on the genes at various numbers of cell clusters (k). The top row shows the clustering

result for GSE59114, which had poor CCP performance, and the bottom row shows the clustering

result for GSE67825, which had good CCP performance. For each number of clusters, 10 random

initializations were used for the k-means clustering, and the averages of the RI, SI, and RSI were

obtained. The red, blue, and green lines correspond to RI, SI, and RSI, respectively. RSI can be

used to check the quality of the clustering, where the peak in RSI suggests the optimal number

of clusters, in the case of CCP, the intrinsic dimensionality of the data. The right column shows

184

the 2D visualization of the gene using t-SNE. The samples were colored according to their cluster

labels. The t-SNE visualization of GSE59114 shows the k-means clustering result when k = 8 was

selected. The t-SNE visualization of GSE67835 shows the k-means clustering result when k = 64

was selected. Seven of the 64 clusters were colored, and the green samples are the rest of the genes.

Notice that in GSE59114, there is a noticeable peak in the RSI score at k =8 clusters, whereas

in GSE67835, the peak is flat and occurs at about k =32-64 clusters. This suggests that the intrinsic

dimensionality is about 8 for GSE59114, which is unfavorable for CCP. On the other hand, the

intrinsic dimension of GSE67835 is much higher, which is more suitable for CCP. Notice that the

clusters have distinct boundaries, supporting the relatively low dimensionality of the data. On the

other hand, the GSE67835 data is not well-clustered even at k = 64. Notice that the orange and

blue genes have some outliers, and the purple genes are not well-clustered. This suggests that the

number of optional gene clusters is larger, which suggests high gene dimensionality and favors

CCP.

185

(a) GSE59114

(b) GSE67835

Figure 7.2 RI, SI, and RSI of the gene clustering of GSE59114 and GSE67835. k-means clustering
was performed with k =2,4,8,16,32,64, and 128 cell clusters. For each number of clusters, 10
random initializations were utilized, and the averages of RI, SI and RSI were obtained. The red,
blue, and green line corresponds to RI, SI, and RSI, respectively. We use t-SNE to visualize the
genes in 2D. For GSE59114 k = 8 clusters were obtained, and the cells were colored according to
their cluster assignment. For GSE67835, k = 64 cell types were obtained. Seven random cell types
were colored, and the rest of cell types were colored in green.

7.1.2.2 Residue-Similarity Index comparison

Residue-similarity index (RSI) has been shown to correlate with classification accuracy in [1].

In this section, we use RSI for classification and clustering on the 14 datasets from Table 7.1.

We use CCP to process each dataset with the same parameters as the previous section with 20

random initializations. For classification, we use 5-fold cross-validation with 10 random seeds and

the support vector machine to predict cell types. We use balanced accuracy (BA) to measure the

performance of the classification. Then, using the same 5-fold cross-validation, we calculate RSI,

where we obtain the RI, SI, and RSI from the test set, similar to [1]. For clustering, we compute

RSI for PCA and CCP using the k-means clustering labels and the true labels. Additionally, using

the k-means clustering labels, we compute the Silhouette score to compare the results with RSI.

Full details of the benchmark procedure can be found in 7A.2.1 of the Supporting Materials.

186

In general, we have found no correlation between the Silhouette scores and RSI for clustering

results. Additionally, we have found that BA and RSI correlate in classification results.

Figure 7.3 Comparison of RSI in classification and clustering problems for GSE67835, GSE75748
time, and GSE82187 data at reduced dimensions N =50,100,150,200,250, and 300. CCP was used
to reduce the original data dimension using τ = 6 and κ = 2 for the exponential kernel. The top and
bottom rows correspond to the classification and clustering results, respectively. For classification,
support vector machine was used. True labels were used to compute the RSI for the 5-fold cross
validation. For clustering, RSI were computed using the cluster labels from k-means clustering and
the true labels.

We found that RSI correlates with classification accuracy in many of our tests. Figure 7.3 shows

RSI for classification and clustering problems for GSE67835, GSE75748 time, and GSE82187 data.

CCP was used to reduce the original data using τ = 6 and κ = 2 for the exponential kernel. The

top row corresponds to classification results, and the bottom row corresponds to clustering results.

Notice that for classification results, all three datasets show a correlation between BA and RSI.

RSI on classification results for GSE67835 shows a plateau at about 150 super-genes, which

corresponds to the plateau of the BA. This suggests that the optimal dimension is about 150. RSI

on classification results for GSE75748 shows a plateau at about 200 gene clusters, even though

BA plateaus at about 150 gene clusters. Even though the accuracy plateaued earlier, this suggests

that the optimal dimension is at 200 gene clusters. In addition, since GSE75748 time observes

cell differentiation at different times, it is possible that some cells are at different stages in their

187

cell cycles, as suggested in the literature [6]. This suggests that there are many intermediate stages

in cell differentiation. RSI on classification results on GSE82187 shows a small decrease as the

number of super-genes increases. This suggests that the optimal dimension is smaller than those of

GSE67835 and GSE75748 time. Lastly, RSI decreases for all three datasets when PCA is utilized,

which corresponds to the decrease in BA.

For the clustering results, RSI using the k-means labels and the true cell types are similar. Even

though ARI and NMI of PCA decrease as the number of gene clusters increases, RSI remains

consistent. This suggests that PCA is not able to differentiate clusters at higher dimensions. CCP,

on the other hand, shows a correlation with both RSI scores.

Additional examples of utilizing RSI on classification and clustering problems can be found in

Section 7A.2.2 of the Supporting materials.

Figure 7.4 Comparison of CCP and PCA clustering on GSE67835, GSE75748 time, and GSE82187
data. CCP was used to reduce the original data dimension using τ = 6 and κ = 2 for the exponential
kernel. The blue, orange, green and red bars correspond to mean CCP ARI, mean PCA ARI,
mean CCP NMI and mean PCA NMI, respectively. Here, the average was taken over different
dimensions.

Figure 7.4 shows the overall clustering performance of CCP and PCA. The bars show the mean

ARI and NMI values across the different number of components. Notice that for both ARI and

NMI, CCP outperforms PCA.

188

Figure 7.5 Comparison of CCP and PCA classification on GSE67835, GSE75748 time, and
GSE82187 data. CCP was used to reduce the original data dimension using τ = 6 and κ = 2
for the exponential kernel. The blue and orange bars correspond to mean BAs of CCP and PCA,
respectively. Here, the average was taken over different dimensions.

Figure 7.5 shows the overall classification performance of CCP and PCA. The bars show

the mean BAs across the different number of dimensions. Notice that for the mean BA, CCP

outperforms PCA.

7.1.3 Discussion

7.1.3.1 CCP

Like other dimensionality reduction algorithms, CCP has its advantages and disadvantages.

CCP nonlinearly projects each cluster of similar genes into a super-gene. Super-genes are highly

interpretable: each super-gene represents a measure of a cluster of genes’ accumulated pairwise

nonlinear correlations with the same cluster of genes in all other cells for a given cell. Similar to

NMF, super-genes are non-negative, which is important for downstream analysis such as differential

gene expression analysis.

Since CCP is a data-domain method, it bypasses matrix diagonalization. One limitation of many

dimensionality reduction algorithms is their dependence on matrix diagonalization. In scRNA-

seq data, the number of genes is typically larger than 5,000, which gives rise to the “curse of

dimensionality”. When the number of features are large, every sample may appear to be equidistant

from one another, which makes many machine learning algorithms unable to find meaningful

189

clusters in the data. CCP, on the other hand, partitions the genes into clusters and computes the

pairwise gene-gene correlations across all cells, which avoids the curse of dimensionality.

Even though CCP has shown success in many scRNA-seq datasets, it does have limitations.

CCP does not perform well for datasets with a low intrinsic dimension. As shown in Figure 7.2,

GSE59114 and GSE94228 have a low intrinsic dimension, and as a result, their clustering results

also suffered.

In addition, many scRNA-seq datasets are sparse due to low signal-to-noise ratio and dropout

events. Therefore, CCP will most likely benefit from data imputation.

7.1.3.2 RSI

RSI is a useful tool for assessing the performance of dimensionality reduction for both clustering

and classification problems. In the following section, we compare RSI to the traditional clustering

metrics, ARI and NMI, and also with the Silhouette score. Then, we discuss RSI and its connection

with classification accuracy.

RSI for clustering Compared to ARI and NMI, which measure the similarity between two sets of

labels, RSI evaluates the performance using only one set of labels. In this study, ARI and NMI were

used to compare the true labels with the clustering labels. However, in practice, such true labels

may not be available. RSI, on the other hand, can evaluate the effectiveness of clustering without

the need for original labels. This is similar to the Silhouette score, which measures the separations

between clusters. However, when there are multiple clusters, the Silhouette score becomes difficult

to interpret because it measures whether a sample belongs to its current cluster assignment or to the

nearest neighboring cluster. Therefore, it is often used to evaluate the optimal number of clusters,

rather than evaluating different parameters while fixing the number of clusters. RSI can evaluate

the effectiveness of different parameters while fixing the number of clusters.

RSI for classification Using RSI for cell types, we have shown that RSI correlates with classifi-

cation accuracy. Additionally, RI and SI indicate how well the clusters separate from each other.

The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a metric commonly

190

used to evaluate classification effectiveness. However, AUC-ROC is a better metric for binary

classification problems, and its interpretation is more challenging for multiclass problems. RSI, on

the other hand, can handle problems with more than two cell types. Lastly, RSI uses the features

and labels to compute the scores, so it can also demonstrate the effectiveness of dimensionality

reduction algorithms in conjunction with classification problems.

RSI can also be utilized for visualizing each class or cluster, which we have called Residue-

Similarity (R-S) plot. In order to showcase the R-S plot, we compare it with traditional visualization

techniques used in scRNA-seq data, namely t-SNE and UMAP. CCP was utilized to reduce the

dimensionality. The 5-fold cross-validation was used to divide the data into 5 parts, where 4 parts

were used to train the support vector machine classifier, and 1 part to test the classifier. Then, residue

and similarity scores were computed for each sample and plotted according to its true number of

cell types. Samples were then colored according to their predicted labels from the support vector

machine classifier. The x-axis and y-axis correspond to residue and similarity scores, respectively.

Both residue and similarity scores range from 0 to 1, where 1 is the most optimal, and the top-right

corner indicates well-separated and clustered reduction. However, it is important to note that having

a balance of both scores is important, as shown in Hozumi et al. (2022) [1]. For t-SNE and UMAP,

the original data was log-transformed, and genes with variance less than 10–6 were removed prior

to the reduction. Samples were then plotted and colored according to their cell types.

191

Figure 7.6 R-S plot, CCP assisted t-SNE plot and standard t-SNE plots of GSE75748 time data.
CCP was used to reduce the scRNA-seq data to 200 super-genes using τ = 6 and κ = 2. The 5-fold
cross-validation was used to split the data into 5 parts, where 4 were used for training, and 1 part
was used for testing the support vector machine classifier. RS scores were computed for the testing
set, and all 5 folds were visualized. Each section corresponds to one of the 7 true cell types, and
the sample’s color and marker correspond to the predicted label from the support vector machine
classifier. For t-SNE and UMAP, the data was log-transformed, and any genes with less than 10–6
variance were removed before applying the reduction. Samples were colored according to their cell
types.

Figure 7.6 shows a comparison between the R-S plot and 2D plots of UMAP and t-SNE for

GSE75748 time data. CCP was used to generate 200 super-genes with τ = 6 and κ = 2. For UMAP

and t-SNE plots, the reduction was directly applied to the log-transformed original data. In [6],

Chu obtained snapshots at different times of ES cell differentiation from pluripotency to definitive

endoderm (ED) over 4 days at 0hr, 12hrs, 24hrs, 36hrs, 72hrs, and 96hrs. Noticeably, cells recorded

192

at 72hrs and 96hrs are mixed in UMAP and t-SNE plots and are misclassified in the R-S plot. This

finding is consistent with [6], where cells from 72hrs and 96hrs were relatively homogeneous. In

a biological sense, this may indicate that cell differentiation had mostly completed by 72 hrs, such

that not much of the further process of cell differentiation was observed at 96 hrs. In t-SNE and

UMAP plots, we can see a similar pattern as the R-S plot. There are 2 subclusters of the 12hr

samples. Additionally, the 72hr and 96hr samples form one large cluster, which is consistent with

R-S plot’s findings. Most notably, there is a large difference between the ES cell at 0hr and ES

cells at different times in all visualizations, and there is no misclassification of the 0hr state with

cells from 72hr and 96hr states, indicating that the cells have indeed differentiated from the original

pluripotent state.

193

Figure 7.7 RS plot, and CCP assisted UMAP and t-SNE plots of GSE75748 cell. CCP was used
to reduce the scRNA-seq data to 100 components using τ = 6 and κ = 2. 5-fold cross validation
was used to split the data into 5 parts, where 4 parts were used for training, and 1 part was used
for testing the k-NN classifier. RS score was computed for the testing set, and all 5 folds were
visualized. Each section corresponds to 1 of the 7 true cell types, and the sample’s color and
marker correspond to the predicted label from the k-NN classifier. For t-SNE and UMAP, the data
was log-transformed and any genes with less than 10–6 variance were removed before applying the
reduction. Samples were colored according to their cell types.

Figure 7.7 shows a comparison between the R-S plot and 2D plots of UMAP and t-SNE of

GSE75748 cell data. CCP was used to reduce the dimension to 100 super-genes with τ = 6 and

κ = 2. In [6], Chu obtained snapshots of lineage-specific progenitor cells that differentiated from

H1 human embryonic stem (ES) cells. These differentiated cells include neuronal progenitor cells

(NPC), endoderm derivatives cells (DEC), endothelial cells (EC), trophoblast-like cells (TB), human

194

foreskin fibrobalsts (HFF), and undifferentiated H1 and H9 human ES cells. Not surprisingly, all 3

visualizations show that undifferentiated ES cell H1 and H9 are clustered together, indicating that

these two ES cells are relatively homogeneous, which agrees with Chu’s findings. In the R-S plot,

we see that all but 1 DEC sample are classified incorrectly, whereas in UMAP and t-SNE plots,

DEC samples do not form a distinct cluster, and have a super cluster forming with the H1, H9, and

DEC cluster. In addition, all 3 visualizations show 2 clusters of NPC samples, but CCP is able to

classify NPC samples correctly. Notice that in the R-S plot, there are a few misclassifications of

EC and DEC cells, and in UMAP, these two clusters are adjacent to one another. This is consistent

with a small number of misclassified EC and DEC shown in the RS plot. Since EC are derivatives

of mesoderm, it has been suggested by [11, 12, 13] that mesoderm and DEC may have developed

and differentiated from a common progenitor pool.

7.1.4 Conclusion

CCP is a novel dimensionality reduction method that projects each cluster of similar genes

into a super-gene defined as accumulated pairwise nonlinear gene-gene correlations among cells.

We have shown that CCP is able to differentiate cell types and also preserve the similarity along

trajectory of cellular differentiation. In addition, since CCP works exclusively in the data-domain,

it does not rely on matrix diagonalization and its results are easily interpretable. It outperforms

PCA for problems having an intrinsically high dimensionality.

We have also shown that RSI is a novel metric for evaluating the effectiveness of dimensionality

reduction algorithms. Since it correlates with accuracy but does not rely on knowing the true labels

of the data, it can be applied to improve both clustering and classification. In addition, RSI can be

used to vary the number of clusters and obtain insight into the optimal number of cell types. This

information can be used to filter out data where CCP may not perform well because CCP works

best when the intrinsic dimensionality of the data, i.e., the number of gene features, is relatively

high. Lastly, the R-S plot is introduced as a new visualization tool that works well for problems

with a large number of cell types.

195

7.1.5 Code and Data availability

All data was processed and is available at https://github.com/hozumiyu/SingleCellDataProcess.

The code needed to reproduce this paper’s result can be found at https://github.com/hozumiyu/CCP-

for-Single-Cell-RNA-Sequencing.

CCP is made available through our web-server at

https://weilab.math.msu.edu/CCP/ or through the source code https://github.com/hozumiyu/CCP.

Source code of RSI and R-S plot can be found at https://github.com/hozumiyu/RSI.

7.2 Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE

7.2.1

Introduction

The objective of the present work is to explore the utility of CCP for initializing scRNA-seq data.

We are particularly interested in its potential application for initializing UMAP and t-SNE, which

are among the most successful visualization tools in scRNA-seq analysis. We tested CCP-assisted

UMAP and CCP-assisted t-SNE on eight publicly available datasets. CCP performance in assisting

UMAP and t-SNE is compared favorably with that of PCA and NMF.

Additionally, we introduce a novel method for handling low-variance (LV) genes. Instead of

discarding low-variance genes like many other methods, we group them together into a single

category. This grouping is achieved by projecting them into one descriptor using FRI. One of the

drawbacks of dropping low-variance genes is that scRNA-seq data often has an unequal number of

cell types. Moreover, there are numerous genes with low expression, and removing too many genes

may result in overlooking cell outliers. Therefore, LV-gene addresses this issue by consolidating

low-variance genes into one descriptor, thereby increasing its predictive power. We found that CCP

improves the accuracy of UMAP and t-SNE by over 11% in each case.

7.2.2 Methods and Algorithms

In this section, we describe the construction of LV-gene, which will be on the components. For

the rest of the components, refer to section 5.2 for the detail.

196

7.2.2.1 Low variance (LV) genes

Let v = (v1, ..., vI) be the variance of the genes, where vi is the variance of gene zi, and assume

that the variance are sorted in descending order. Then, define the low variance set P as

P = {i|i > vcI}

where 0 ≤ vc ≤ 1 is the cutoff ratio. Then, we can obtain the cell-cell correlation using these low

variance genes CP
ij ,

ij = Φ(∥zP
CP

i – zP

j ∥; ηP, τ, κ)

where Φ(∥zP

i – zP
j

∥; ηP, τ, κ) is the generalized exponential function

Φ(∥zP



rP
c is taken as the 3-standard deviation of the pairwise distances, and ηP is the average minimum

j ∥; ηP, τ, κ) =

otherwise.

i – zP

i – zP
j

∥ < rP
c

–
e

∥zP

0,



∥

(cid:33) κ

(cid:32) ∥zP
–zP
i
j
ηPτ

distance

ηP =

(cid:205)M

m=1 minzP
j
M

∥zP

m – zP
j

∥

.

Using the correlation function, CCP projects |P| genes into a super-gene using FRI for ith

sample,

xP
i =

M
∑︁

m=1

where wim are the weights.

wimΦ(∥zP

i – zP

m∥; ηP, τ, κ),

For CCP, we compute the LV-gene first, and use the correlated partition algorithm on the

remaining genes.

197

7.2.3 Results

7.2.3.1 Data preprocessing

We have tested CCP-assisted UMAP and tSNE visualization on 20 publicly available data.

Table 7.2 displays information including the Gene Expression Omnibus (GEO) accession ID

[14, 15], the reference, data dimensions, and cell composition for each dataset. Additionally, data

from the scziDesk paper [16] was utilized and can be accessed from their supporting materials. The

Qx and Qs data correspond to Smart-seq2 and 10x genomic data from Quake et al. [17]. Notably,

the GSE84133 human dataset encompasses all human patient data from Baron et al. [8]. Detailed

statistics for each data can be found in Table 7B.1 in the supporting materials.

Table 7.2 Dataset name, reference, dimensions and cell type composition.

Size (cells x genes) Cell Composition

Dataset [Ref]
GSE75748cell [6]
GSE75748time [6]
GSE82187 [7]
GSE67835 [5]
GSE84133 H1 [8]

1018 x 19097
758 x 19189
705 x 18840
420 x 22084
1937 x 20125

GSE84133 H2 [8]

1724 x 20125

GSE84133 H3 [8]

3605 x 20125

GSE84133 H4 [8]

1303 x 20125

GSE84133 M1 [8]

822 x 14878

GSE84133 M2 [8]

1064 x 14878

GSE84133 human [8]

8569 x 20125

Muraro [18]
Romanov [19]
Qx Bladder [17]
Qx Limb Muscle [17]
Qx Spleen [17]
Qs Diaphragm [17]
Qs Limb Muscle [17]
Qs Lung [17]

2122 x 19046
2881 x 21143
2500 x 23341
3909 x 23341
9552 x 23341
870 x 23341
1090 x 23341
1676 x 23341

Qs Trachea [17]

1350 x 23341

7 clusters: 138, 105, 212, 162, 159, 173, 69
6 clusters: 92, 102, 66, 172, 138, 188
10 clusters: 107, 18, 21, 71, 48, 7, 334, 13, 43, 43
8 clusters: 18, 62, 20, 110, 25, 16, 131, 38
14 clusters: 110, 51, 236, 872, 214, 120, 130, 13,
70, 14, 8, 92, 5, 2
14 clusters: 3, 81, 676, 371, 125, 301, 23, 2, 86, 17,
9, 22, 6, 2
14 clusters: 843, 100, 1130, 787, 161, 376, 92, 2,
36, 14, 7, 54, 1, 2
14 clusters: 2, 52, 284, 495, 101, 280, 7, 1, 63, 10,
1, 5, 1, 1
13 clusters: 2, 4, 4, 9, 343, 85, 236, 72, 14, 4, 17, 29,
3
13 clusters: 8, 3, 10, 182, 551, 133, 39, 67, 27, 4,
19, 18, 3
14 clusters: 958, 284, 2326, 2525, 601, 1077, 252,
18, 255, 55, 25, 173, 13, 7
9 clusters: 21, 812, 193, 101, 219, 245, 3, 80, 448
7 clusters: 267, 240, 356, 48, 898, 1001, 71
4 clusters: 1203, 1167, 57, 73
6 clusters: 461, 320, 1330, 308, 1136, 354
5 clusters: 6886, 1930, 42, 464, 230
5 clusters: 78, 81, 31, 241, 439
6 clusters: 71, 35, 141, 45, 258, 540
11 clusters: 57, 53, 25, 90, 113, 35, 693, 65, 85, 37,
423
4 clusters: 206, 113, 201, 830

198

To normalize the data, we began by normalized the counts by using the average median gene

count of each cell. Let X ∈ RM×N be the data, with M cells and N genes. Each row (cell) was

divided by its row sum, followed by multiplication by the median row sum to obtain a normalized

count matrix. Finally, a log transformation using log1p was applied to achieve the final normalized

count.

In our benchmarking process, we employed CCP with parameters τ = 6 and κ = 2 to reduce

the dimensions to 300 super-genes. Additionally, we utilized ν = 0.8 to generate the LV-gene.

Clustering was performed using the Leiden algorithm, and we evaluated the quality of clustering

using ARI, NMI and ECM by comparing the obtained clusters with the cell types provided by

the original authors. Visualizations were generated using Scanpy’s implementation of UMAP and

tSNE. In order to reduce the computation load for datasets exceeding 2,000 samples, we utilized

subsampling.

7.2.3.2 Visualization

Preprocessing of scRNA-seq data is a key step for visualization. Figure 7.8 shows an example of

CCP-assisted tSNE visualization and the original tSNE visualization of the Baron dataset [8]. The

original data has 20,125 genes, and aggressively reducing the original dimension to 2 dimensions

by tSNE leads to poor visualization.

In CCP-assisted tSNE, CCP was utilized to reduce the

original genes into 300 super-genes, which were further reduced to 2 dimensions with tSNE for

visualization. Obviously, CCP-assisted tSNE significantly improves the visualization quality in this

case. We further showcase CCP-assisted visualization on the dataset described in Table Table 7.2.

We provide additional comparison with PCA-assisted and NMF-assisted visualization in Section

7B.1.2 of the supporting materials.

199

Figure 7.8 TSNE visualization of GSE84133 mouse2 data. The left and right figures show the
tSNE assisted and unassited tSNE visualization.

Figure 7.9 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and

tSNE visualization on Quake dataset. Each row correspond to one of the 5 dataset, and the

columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard

tSNE visualization. The samples were colored according to the true cell type.

200

Figure 7.9 Comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visual-
ization on Quake dataset. The rows correspond to Qx Bladder, Qx Limb Muscle, Qx Diaphragm,
Qs Limb Muscle and Qs Trachea. Qx indicates scRNA-seq obtained used 10x genomic platform,
and Qs indicate data obtained from SmartSeq2 platform. CCP was used to reduced the dimension to
300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to 2 to obtain the
visualization. Samples were colored according to the cell types provided by the original authors.
201

CCP improves the overall visualization of the Quake dataset. In Qx Bladder data, CCP-assisted

UMAP and tSNE show three subclusters of bladder urothelial cells, whereas the standard UMAP and

tSNE show only one cluster. However, the standard visualization show leukocytes and endothelial

cells within the bladder urothelial cell cluster. In Qs Diaphragm dataa, CC:-assisted UMAP and

tSNE show a 5 distinct clusters corresponding to each cell types. However the UMAP visualization

do not differentiate the 5 cell types. The standard tSNE visualization show poor clustering, where

satellite cell, mesenchymal cell and endothelial cell form a supercluster. In Qs Limb Muscle cell,

all visualization show a supercluster of B cell and T cell. CCP-assisted visualization show a clear

distinction between the B-T cell supercluster and macrophages, whereas the standard visualization

show a supercluster of B cell, T cell, macrophages and endothelial cells. In the Qs Trachea data,

the standard UMAP and tSNE visualization show a subpopulation of mesenchymal cell within the

epithelial cell, where as CCP-assisted counterparts do not.

Figure 7.10 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and

tSNE visualization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 dataset. The

columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard

tSNE visualization. The samples were colored according to the true cell type.

202

Figure 7.10 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard
UMAP visualization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187. CCP was
used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce
the dimension to 2 to obtain the visualization. Samples were colored according to the cell types
provided by the original authors.

In GSE75748 cell data, all visualization is similar. In [6], Chu obtained snapshots of lineage-

specific progenitor cells that differentiated from H1 human embryonic stem (ES) cells and compared

the gene profiles with undifferentiated H1 and H9 human ES cells as control. Most notably, H1

and H9 clustered together, which is consistent with our visualization.

In GSE75748 time, all

203

visualization is comparable. Chu et al [6] obtained snapshot of ES cell differentiation from

pluropotency to definitive endoderm over the time period 0hr, 12hr, 24hr, 36hr, 72hr and 96hr. Chu

noted the cells sequenced at 72hr and 96hr show relatively similar expression profiles, suggesting

that the differentiation has completed by 72hr. We see from our visualization that the 72hr and

96hr cells form a cluster. 12hr and 24hr cells form a cluster, and 0hr cells form its own cluster,

indicating that there is a clear distinction between the undifferentiated and the cells undergoing

differentiation. IN GSE67835, CCP-assisted visualization and its counter part have comparable

result. Most notably, neurons cell from a distinct cluster in CCP-assisted visualization, whereas it

does not in the standard visualization. In GSE82187 data, CCP-assisted UMAP and tSNE show

a significant improvement over standard UMAP and tSNE visualization. Aside from astrocytes

and OPC, all cell types form its own cluster. Standard UMAP and tSNE fail to show significant

clustering of the different cell types.

Figure 7.11 show the comparison of CCP-assisted UMAP and tSNE with standard UMAP and

tSNE visualization on Baron human dataset [8]. The rows correspond to the patients, and the

columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard

tSNE visualization. The The samples were colored according to the true cell type.

204

Figure 7.11 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard
UMAP visualization on GSE84133 human dataset. Each row corresponds to 1 of the 4 patients.
CCP was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to
further reduce the dimension to 2 to obtain the visualization. Samples were colored according to
the cell types provided by the original authors.

Overall, CCP-assisted visualizations show stronger clustering. In standard UMAP and tSNE

visualizations across all patients, we noticed superclusters with unclear boundaries. Conversely,

CCP-assisted visualizations display well-defined boundaries between cell types. Most notably is

the clear differentiation of quiescent stellate (Q-Stellate) cells, alpha cells, and ductal cells across all

patients, which is a distinction that isn’t as evident in the standard visualizations. Figure 7.12 show

the comparison of CCP-assisted UMAP and tSNE with standard UMAP and tSNE visualization

on Baron mouse dataset [8]. The rows correspond to the patients, and the columns correspond to

205

mouse 1 and 2, and the columns correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard

UMAP and standard tSNE visualization.. The The samples were colored according to the true cell

type.

Figure 7.12 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard
UMAP visualization on GSE84133 human dataset. The rows correspond to mouse 1 and 2. CCP
was used to reduced the dimension to 300 super-genes. UMAP and tSNE were utilized to further
reduce the dimension to 2 to obtain the visualization. Samples were colored according to the cell
types provided by the original authors.

CCP-assisted visualizations demonstrate significantly stronger clustering for both mouse sam-

ples. In the standard visualizations, beta cells are scattered among other cell types. Furthermore,

in the data from mouse 2, alpha cells do not form a distinct cluster. Conversely, CCP-assisted

visualizations distinctly cluster all cell types. Regarding mouse 1, the CCP-assisted visualization

does not form a cluster for gamma cells, potentially due to the limited number of available gamma

cells.

.

Figure 7.13 show the visualization of Murano, Romanov and Qs Lung data. The columns

correspond to CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE visu-

alization. The samples were colored according to the true cell type.

206

Figure 7.13 Comparison of CCP-assisted UMAP with PCA-assisted, NMF-assisted and standard
UMAP visualization on Muraro, Romanov and Qs Lung dataset. CCP was used to reduced the
dimension to 300 super-genes. UMAP and tSNE were utilized to further reduce the dimension to
2 to obtain the visualization. Samples were colored according to the cell types provided by the
original authors.

In the Muraro dataset, CCP-assisted UMAP exhibits a clear separation of A cells, D cells,

B cells, and ductal cells. In contrast, the standard UMAP visualization presents these cells as a

supercluster. The standard tSNE visualization is dominated by the outlier B cells, rendering the

visualization less interpretable. Regarding the Romanov dataset, all visualizations are relatively

similar. CCP-assisted UMAP reveals a distinct cluster of astrocytes and ependymal cells, whereas

both the standard UMAP and tSNE display a supercluster of these two cell types. Additionally,

CCP-assisted UMAP and tSNE suggest two subclusters of VSM and endothelial cells, which are

not discernible in the standard visualization. In the Qs Lung dataset, CCP-assisted and standard

207

visualizations yield comparable results. While the standard tSNE separates monocytes from

classical monocytes, CCP-assisted UMAP and tSNE portray a homogeneous clustering of these

two cell types.

7.2.3.3 Accuracy

To assess CCP’s effectiveness as a primary dimensionality reduction tool for UMAP and tSNE,

we conducted clustering using the Leiden algorithm within scanpy. We employed the adjusted

Rand index (ARI) and normalized mutual information (NMI) to gauge accuracy by comparing

the clustering results with the labels provided by the dataset’s authors. It’s important to note that

these metrics do not measure absolute accuracy due to the absence of a gold standard dataset for

scRNA-seq. Additionally, we used the element centric measure (ECM) [20] to evaluate cluster

stability.

(a) Average ARI over 18 dataset

(b) Average NMI over 18 dataset

(c) Average ECM over 18 dataset

Figure 7.14 The average ARI, NMI, ECM of 18 datasets. 10 random initialization was used to
compute CCP, CCP-assisted UMAP, CCP-assisted tSNE, standard UMAP and standard tSNE for
each dataset. Leiden clustering was used to obtain the clustering results.

Figure 7.14 show the average ARI, NMI and ECM of CCP-assisted UMAP, CCP-assisted tSNE,

UMAP and tSNE across 18 dataset. For each dataset, we conducted 10 random seeds to perform

dimensionality reduction, utilizing Leiden clustering to generate clustering labels. These labels

were then compared to the annotated cell types provided by the original authors.

CCP-assisted UMAP demonstrates a 24% improvement in ARI, 15$ in NMI, and 17% in ECM

over standard UMAP. Similarly, CCP-assisted tSNE improves standard tSNE by 11% ARI, 10% NMI

208

and 8% in ECM. Additionally, CCP-assisted UMAP and tSNE have a higher ECM score, indicating

that their clustering is more stable. Notably, both CCP-assisted UMAP and tSNE yield higher

ECM scores, indicating more stable clustering. Interestingly, standard tSNE outperforms UMAP.

However, UMAP’s performance heavily relies on accurately finding nearest neighbors, which can

be challenging with noisy, sparse, and high-dimensional data. CCP effectively reduces dimensions,

enabling UMAP to find neighbors more effectively and resulting in improved visualization.

For a detailed comparison between CCP-assisted, PCA-assisted, and NMF-assisted visualiza-

tions, please refer to Section 7B.1.3 in the supporting materials.

7.2.4 Discussion

7.2.4.1 Large Data

While CCP proves to be an efficient dimensionality reduction technique for datasets with a large

number of features, such as in the case of scRNA-seq data, it may encounter limitations due to the

necessity of computing cell-cell correlations for each super-gene. To address this challenge when

dealing with larger datasets, we proposed a subsampling approach.

Let Z = {z1, ..., zM} be the training data used to develop a CCP model, and Y = {y1, ..., yT }
be a new dataset or additional data. Using the training data, gene partitions Sn, cutoff distance rSn
c
and the connectivity ηSn

are determined. Then, we embed new data to the trained model, utilizing

the following modification to Equation 5.15

xn
i =

M
∑︁

m=1

to obtain appropriate super-genes.

Φ(∥ySn

i – zSn

m ∥; ηn, τ, κ),

(7.1)

We verified the subsampling approach on GSE84133 human and Qx Spleen data. We combined

all four patient’s sequencing data into one superset for this analysis. We randomly subsampled 500,

1000, 1500, 2000, 2500, 3000 samples as a training data, and performed the subsampling under

10 random seeds. We projected the testing data using Equation 7.1, followed by Leiden clustering.

ARI and NMI were computed, and the average scores are reported in Figure 7.15. Notably, both

the GSE84133 human and Qx Spleen datasets exhibited consistent and stable results under varying

209

number of subsampling.

Additionally, we also show the CCP-assisted UMAP and tSNE for both data when subsampling

was utilized. Notably, all visualizations were comparable, underscoring the stability of CCP-assisted

visualizations even under subsampling.

210

(a) ARI and NMI under different subsampling

(b) CCP-assisted UMAP and tSNE visualization of GSE84133 human

(c) CCP-assisted UMAP and tSNE visualization of Qx Spleen

Figure 7.15 UMAP and tSNE visualization of GSE84133 human and Qx Spleen data under different
number of subsampling. 300 super-genes were generated from CCP, and Leiden clustering was
used to obtain the clustering results. (a) ARI and NMI under different subsampling values. Left
figure shows the ARI and NMI for GSE84133 Human, where the 4 patient data were combined.
Right figure shows the ARI and NMI of Qx Speen data. (c) CCP-assisted UMAP and tSNE of
GSE84133 Human data under different number of subsampling.
(d) CCP-assisted UMAP and
tSNE of Qx Spleen data under different number of subsampling.

211

7.2.4.2 Low Variance Genes

We have utilized LV-genes to enhance the predictive power of super-genes with a LV gene

cluster. By using a high cutoff ratio, we can reduce the number of genes used in the feature

partition, potentially resulting in a lower number of super-genes. To assess the impact of the cutoff

ratio on the number of super-genes used for UMAP and tSNE visualizations, we conducted tests

using GSE82187 and GSE75748 cell data. The discussion for GSE75748 cell data can be found in

Section 7B.2.1 of the supporting materials.

Figure 7.16 show the effect of varying the number of super-genes and the cutoff ratio on the

predictive power and visualization of GSE82187 data. We utilized 10 random seeds to generate

CCP super-genes using different numbers of super-genes and cutoff ratios. Subsequently, Leiden

clustering was applied to obtain cluster labels, and the ARI was computed utilizing the cell labels

provided by the original authors. Notably, across all cutoff ratios, the ARI increases with an

augmented number of super-genes, plateauing at a comparable level around 300 super-genes. This

indicates the robustness of LV-gene.

Figure 7.16(c) show the visualization of CCP-assisted UMAP and tSNE at various cutoff ratio.

For the visualization, 300 super-genes were utilized, and UMAP and tSNE was applied to the

super-genes to reduce the dimension to 2. Samples were then colored according to the cell types

provided by the original authors. Note that all the visualization are comparable, indicating the

robustness of LV-gene under different cutoff ratio.

212

(a) ARI of number of super-genes and νc

(b) Number of genes in LV genes

(c) CCP-assisted UMAP and tSNE visualization

Figure 7.16 Analysis of varying the cutoff ratio νc on clustering and visualization of GSE82187
data. (a) ARI of leiden clustering when the number of super-genes and cutoff ratio is changed. (b)
The number of genes in the LV-gene when νc is changed. (c) Top and bottom row shows the CCP-
assisted UMAP and tSNE visualization, and the columns corresponds to νc = 0.6, 0.7, 0.8, 0.9. 300
super-genes were used to initialize UMAP and tSNE, and the samples were colored according to
the true cell type.

7.2.5 Conclusion

CCP is a nonlinear data-domain dimensionality reduction technique that leverages gene-gene

correlations to partition genes, and utilizes cell-cell correlation to generate super-genes. Unlike

methods that involve matrix diagonalization, CCP can be directly applied as a primary dimension-

ality reduction tool to complement traditional visualization techniques like UMAP and tSNE. In our

213

experiments with 18 datasets, CCP-assisted UMAP and CCP-assisted tSNE visualizations consis-

tently outperformed the original UMAP and tSNE. On average, CCP-assised UMAP improves the

standard UMAP visualization by 24% in ARI and 15% in NMI, and CCP-assisted tSNE improves

standard tSNE by 11% ARI and 10& NMI. Although the improvement for tSNE visualization is

less than the improvement in UMAP, tSNE is sensitive to potential outliers and noise, where the

visualization can become uninterpretable. CCP-assisted tSNE consistently show clear visualization

in the 18 dataset we have tested. Additionally, CCP-assisted visualization improves PCA-assisted

and NMF-assisted visualization in the 18 dataset we have tested. However, CCP comes with some

disadvantageous. For data with no clear gene-gene correlation, CCP will most likely not perform

well. Additionally, although utilizing gene clustering removes the complication with computing

distance in high dimensions, when the number of samples becomes large, the cell-cell correlation

computation becomes time consuming. We show that subsampling via a training set is an effective

approach to enable CCP for dealing with large data. One possible extension for gene clustering

is to incorporate prior information, such as using known genes or utilizing known gene regulatory

pathways, to guide in the clustering. Additionally, CCP can also be employed in many other single

cell contexts, such as spatial transcriptomics and cell-cell communication, and for initializing deep

learning methods.

7.2.6 Code and Data availability

All data can be downloaded from Gene Expression Omnibus [14, 15]. The processing

files for these data can be found on https://github.com/hozumiyu/SingleCellDataProcess. CCP

is made available through our web-server at https://weilab.math.msu.edu/CCP/ or through the

source code https://github.com/hozumiyu/CCP . The code to reproduce this paper is found at

https://github.com/hozumiyu/CCP-scRNAseq-UMAP-TSNE.

214

7.3 Topological Non-Negative Matrix Factorization for Single Cell RNA Sequencing Data

7.3.1

Introduction

In this work, we introduce PL-regularized NMF, namely the topological NMF (TNMF) and

the robust topological NMF (rTNMF). Both TNMF and rTNMF can better capture multiscale

geometric information than the standard GNMF and rGNMF. To achieve improved performance,

PL is constructed by observing cell-cell interactions at multiple scales through filtration, creating a

sequence of simplicial complexes. We can then view the spectra at each complex associated with a

filtration to capture both topological and geometric information. Additionally, we introduce k-NN

based PL to TNMF and rTNMF, referred to as k-TNMF and k-rTNMF, respectively. The k-NN

based PL reduces the number of hyperparameters compared to the standard PL algorithm.

For the methodology and the algorithm for TNMF, refer to section 6.3. We will utilize the

both k-NN and the cutoff based PL regularized NMF on scRNA-seq, and evaluate the effectiveness

of clustering using adjusted rand index (ARI), normalized mutual information (NMI), purity and

accuracy (ACC). Additionally, residue-similarity (RS) plot is utilized to visualize the effectiveness

of the clustering.

7.3.2 Results

7.3.2.1 Benchmark Data

We have performed benchmark on 12 publicly available datasets. The GEO accession number,

reference, organism, number of cell types, and number of samples are recorded in Table 7.3. For

each data, cell types with less than 15 cells were removed. The data was log-normalized, and then

each cell was normalized to have unit length. For GNMF and rGNMF, k = 8 neighbors were used.

For TNMF and rTNMF, 8 filtration values were used to construct PL, and for each scale, binary

selection ζp = 0, 1 was used. For k-TNMF and k-rTNMF, k = 8 was used with ζp = 0, 1. For

each test, double nonnegative singular value decomposition with zeros filled with the average of X

(NNDSVDA) was used for the initialization. For the rank, we chose

√

N, where N is the number of

cells. Then k-means clustering was applied to obtain the clustering results.

215

Table 7.3 GEO accession code, reference, organism type, cell type, number of samples, and
number of genes of each dataset.

Geo Accession
GSE67835
GSE75748 time
GSE82187
GSE84133human1
GSE84133human2
GSE84133human3
GSE84133human4
GSE84133mouse1
GSE84133mouse2
GSE57249
GSE64016
GSE94820

Reference
Dramanis [5]
Chu [6]
Gokce [7]
Baron [8]
Baron [8]
Baron [8]
Baron [8]
Baron [8]
Baron [8]
Biase [21]
Leng [22]
Villani [10]

Organism Cell type
Human
Human
Mouse
Human
Human
Human
Human
Mouse
Mouse
Human
Human
Human

8
6
8
9
9
9
6
6
8
3
4
5

# of Samples # of Genes

420
758
705
1895
1702
3579
1275
782
1036
49
460
1140

22084
19189
18840
20125
20125
20125
20125
14878
14878
25737
19084
26593

7.3.2.2 Benchmarking PL regularized NMF

In order to benchmark persistent Laplacain regularized NMF, we compared our methods to

other commonly used NMF methods, namely the GNMF, rGNMF, rNMF and NMF. For a fair

comparison, We omitted supervised or semi-supervised methods. For k-rTNMF, rTNMF, k-TNMF,

TNMF, GNMF and rGNMF, we set α = 1 for all tests.

Table 7.4 shows the ARI values of the NMF methods for the 12 data we have tested. The bold

number indicate the highest performance. Figure 7.17 depicts the average ARI value over the 12

datasets for each method.

Table 7.4 ARI of NMF methods across 12 datasets.

Data
GSE67835
GSE64016
GSE75748time
GSE82187
GSE84133human1
GSE84133human2
GSE84133human3
GSE84133human4
GSE84133mouse1
GSE84133mouse2
GSE57249
GSE94820

k-rTNMF rTNMF k-TNMF TNMF rGNMF GNMF
0.9109
0.1605
0.5790
0.7577
0.7907
0.9255
0.8361
0.8681
0.7918
0.6957
1.0000
0.5189

0.9391
0.1456
0.6104
0.7558
0.8220
0.9350
0.8447
0.8699
0.7945
0.6808
1.0000
0.5139

0.8533
0.1491
0.6099
0.9809
0.8855
0.9255
0.9181
0.9692
0.7913
0.9331
0.9483
0.5574

0.9306
0.2237
0.5963
0.9676
0.8301
0.9433
0.8625
0.8712
0.8003
0.7005
1.0000
0.4916

0.9236
0.1544
0.6581
0.9815
0.8969
0.9072
0.9179
0.9692
0.7894
0.8689
0.9638
0.5480

0.9454
0.2569
0.6421
0.9877
0.8310
0.9469
0.8504
0.8712
0.8003
0.6953
1.0000
0.6101

rNMF
0.7295
0.1455
0.5969
0.8221
0.7080
0.8930
0.7909
0.8311
0.6428
0.5436
0.9483
0.5440

NMF
0.7314
0.1466
0.5996
0.8208
0.6120
0.8929
0.8089
0.8311
0.6348
0.5470
0.9483
0.5556

216

Figure 7.17 Average ARI of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF and NMF
for the 12 datasets.

Overall, PL regularized rNMF and NMF have the highest ARI value across all the datasets.

k-rTNMF outperforms other NMF methods by at least 0.09 for GSE64016. All PL regularized

NMF methods outperform other NMF methods by at least 0.14 for GSE82187. For GSE84133

human 3, both rTNMF and TNMF outperform other methods by 0.07. TNMF improves other

methods by more than 0.2 for GSE84133 mouse 2. Lastly, k-rTNMF has the highest ARI value for

GSE94820. Moreover, rTNMF improves rGNMF by 0.05, and TNMF improves GNMF by about

0.06. k-TNMF and k-rTNMF also improve GNMF and rGNMF by about 0.03.

Table 7.5 shows the NMI values of of the NMF methods for the 12 datasets we have tested. The

bold number indicate the highest performance. Figure 7.18 shows the average NMI value over the

12 datasets.

217

Table 7.5 NMI of NMF methods across 12 datasets.

data
GSE67835
GSE64016
GSE75748time
GSE82187
GSE84133human1
GSE84133human2
GSE84133human3
GSE84133human4
GSE84133mouse1
GSE84133mouse2
GSE57249
GSE94820

k-rTNMF rTNMF k-TNMF TNMF rGNMF GNMF
0.8858
0.2562
0.6971
0.8754
0.8310
0.9145
0.8357
0.8753
0.8565
0.8129
1.0000
0.6258

0.9235
0.3057
0.7522
0.9759
0.8802
0.9363
0.8500
0.8795
0.8664
0.8218
1.0000
0.7085

0.9104
0.2593
0.7235
0.8802
0.8713
0.9237
0.8439
0.8775
0.8596
0.8005
1.0000
0.6195

0.8607
0.1869
0.7343
0.9668
0.8780
0.9070
0.8677
0.9542
0.8495
0.8713
0.9293
0.6716

0.8999
0.2059
0.7750
0.9691
0.8716
0.8937
0.8718
0.9542
0.8498
0.8355
0.9505
0.6657

0.9107
0.3136
0.7159
0.9298
0.8785
0.9313
0.8577
0.8795
0.8664
0.8299
1.0000
0.6157

rNMF
0.7975
0.1896
0.7227
0.9124
0.8226
0.8835
0.8215
0.8694
0.7634
0.7258
0.9293
0.6624

NMF
0.8017
0.1849
0.7244
0.9117
0.7949
0.8829
0.8260
0.8694
0.7593
0.7272
0.9293
0.6693

Figure 7.18 Average NMI values of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF
and NMF for the 12 datasets.

Interestingly, k-rTNMF and k-TNMF on average have higher NMI values than rTNMF and

TNMF, respectively. However, all PL regularized methods outperform rGNMF, GNMF, rNMF and

NMF. Overall, PL regularized methods outperform other methods. Most noticeably, k-rTNMF,

rTNMF and TNMF outperform standard NMF methods by 0.06 for GSE82187. Both rTNMF and

TNMf outperform rGNMF and GNMF by 0.08 for GSE84133 human 4.

Table 7.6 shows the purity values of the NMF methods for the 12 datasets we have tested. The

bold number indicate the highest performance. Figure 7.19 shows the average purity over the 12

datasets.

218

Table 7.6 Purity of NMF methods across 12 datasets.

data
GSE67835
GSE64016
GSE75748time
GSE82187
GSE84133human1
GSE84133human2
GSE84133human3
GSE84133human4
GSE84133mouse1
GSE84133mouse2
GSE57249
GSE94820

k-rTNMF rTNMF k-TNMF TNMF
0.9024
0.5013
0.7454
0.9888
0.9382
0.9661
0.9460
0.9882
0.9540
0.9373
0.9796
0.7550

0.9595
0.5846
0.7533
0.9620
0.9536
0.9806
0.9531
0.9427
0.9565
0.9604
1.0000
0.6658

0.9267
0.4913
0.7512
0.9895
0.9357
0.9614
0.9485
0.9882
0.9540
0.9410
0.9857
0.7462

0.9643
0.6048
0.7736
0.9927
0.9543
0.9818
0.9472
0.9427
0.9565
0.9585
1.0000
0.7893

rGNMF GNMF
0.9476
0.9595
0.5398
0.5339
0.7387
0.7553
0.9594
0.9620
0.9187
0.9490
0.9736
0.9777
0.9420
0.9452
0.9420
0.9427
0.9540
0.9552
0.9507
0.9466
1.0000
1.0000
0.6421
0.6421

rNMF
0.8726
0.5080
0.7467
0.9693
0.9189
0.9602
0.9464
0.9412
0.9309
0.9185
0.9796
0.7429

NMF
0.8719
0.5050
0.7455
0.9692
0.9099
0.9600
0.9466
0.9412
0.9299
0.9199
0.9796
0.7531

Figure 7.19 Average purity values of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF
and NMF for the 12 datasets.

In general, PL-regularized methods achieve higher purity values compared to other NMF

methods. Purity measures the maximum intersection between true and predicted classes, which is

why we do not observe a significant difference, as seen in ARI and NMI. Furthermore, since purity

does not account for the size of a class, and given the imbalanced class sizes in scNRA-seq data, it

is not surprising that the purity values are similar.

Table 7.7 shows the ACC of the NMF methods for the 12 datasets we have tested. The bold

number indicate the highest performance. Figure 7.20 shows the average ACC over the 12 datasets.

219

Table 7.7 ACC of NMF methods across 12 datasets.

data
GSE67835
GSE64016
GSE75748time
GSE82187
GSE84133human1
GSE84133human2
GSE84133human3
GSE84133human4
GSE84133mouse1
GSE84133mouse2
GSE57249
GSE94820

k-rTNMF rTNMF k-TNMF TNMF rGNMF GNMF
0.9383
0.4537
0.7241
0.8514
0.8364
0.9177
0.8228
0.8816
0.8542
0.8155
1.0000
0.6107

0.9595
0.4891
0.7355
0.8512
0.8889
0.9224
0.8498
0.8824
0.8555
0.7903
1.0000
0.6088

0.9000
0.4746
0.6917
0.9888
0.9088
0.9447
0.9419
0.9882
0.8542
0.9305
0.9796
0.7201

0.9595
0.5502
0.7414
0.9599
0.8974
0.9242
0.8597
0.8831
0.8581
0.8263
1.0000
0.6482

0.9243
0.4870
0.7438
0.9895
0.9194
0.9069
0.9456
0.9882
0.8542
0.9101
0.9857
0.7119

0.9643
0.5700
0.7565
0.9927
0.8973
0.9260
0.8539
0.8831
0.8581
0.8232
1.0000
0.7533

rNMF
0.8357
0.4691
0.6873
0.8896
0.7988
0.8998
0.8032
0.8847
0.7361
0.7239
0.9796
0.7091

NMF
0.8364
0.4759
0.6875
0.8889
0.7370
0.8994
0.8178
0.8847
0.7311
0.7294
0.9796
0.7189

Figure 7.20 Average ACC of k-rTNMF, rTNMF, k-TNMF,TNMF, rGNMF, GNMF, rNMF and
NMF for the 12 datasets.

Once again, we see that PL regularized methods have higher ACC than other NMF methods.

RTNMF and TNMF improves rGNMF and GNMF by 0.05, and k-rTNMF and k-TNMF improves

rGNMF and GNMF by 0.04. We see an improvement in ACC for both k-rTNMF and k-TNMF

for GSE64016. All 4 PL regularized methods improve ACC of GSE82187 by 0.1. RTNMF and

TNMF improve GSE84133 mouse 2 by at least 0.1 as well.

7.3.2.3 Overall performance

Figure 7.21 shows the average ARI, NMI, purity and ACC of k-rTNMF, rTNMF, k-TNMF,

TNMF, rGNMF, GNMF, rNMF, NMF across 10 datasets. All PL regularized NMF methods

220

Figure 7.21 Average ARI, NMI, purity and ACC of k-rTNMF, rTNMF, k-TNMF, TNMF, rGNMF,
GNMF, rNMF, NMF across 12 datasets.

outperform the traditional rGNMF, GNMF, rNMF and NMF. Both rTNMF and TNMF have higher

average ARI and purity than the k-NN based PL counterparts. However, k-rTNMF and k-TNMF

have higher average NMI than rTNMF and TNMF, respectively. k-rTNMF has a significantly

higher purity than other methods.

7.3.3 Discussion

7.3.3.1 Visualization of meta-genes based UMAP and t-SNE

Both UMAP and t-SNE are well-known for their effectiveness in visualization. However, these

methods may not perform as competitively in clustering or classification tasks. Therefore, it is

beneficial to employ NMF-based methods to enhance the visualization capabilities of UMAP and

t-SNE.

In this process, we generate meta-genes and subsequently utilize UMAP or t-SNE to further

reduce the data to 2 dimensions for visualization. For a dataset with M cells, the number of

meta-genes will be the integer value of

√

M. To compare the standard UMAP and t-SNE plots

with the top-NMF-assisted and top-rNMF-assisted UMAP and t-SNE visualizations, we used the

default settings of the Python implementation of UMAP and the Scikit-learn implementation of

t-SNE. For unassisted UMAP and t-SNE, we first removed low-abundance genes and performed

log-transformation before applying UMAP and t-SNE.

221

Figure 7.22 shows the visualization of PL regularized NMF methods through UMAP. Each

row corresponds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The

columns from left to right are the k-rTNMF assisted UMAP, rTNMF assisted UMAP, k-TNMF

assisted UMAP, TNMF assisted UMAP and UMAP visualization. Samples were colored according

to their true cell types.

Figure 7.22 Visualization of top-NMF and top-rNMF meta-genes through UMAP. Each row cor-
responds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns
from left to right are the k-rTNMF assisted UMAP, rTNMF assisted UMAP, k-TNMF assisted
UMAP, TNMF assisted UMAP and UMAP visualization. Samples were colored according to their
true cell types.

Figure 7.23 shows the visualization of PL regularized NMF through t-SNE. Each row corre-

sponds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns

222

from left to right are the k-rTNMF assisted t-SNE, rTNMF assisted t-SNE, k-TNMF assisted t-SNE,

TNMF assisted t-SNE and t-SNE visualization. Samples were colored according to their true cell

types.

Figure 7.23 Visualization of top-NMF and top-rNMF meta-genes through t-SNE. Each row cor-
responds to GSE67835, GSE75748 time, GSE94820 and GSE84133 mouse 2 data. The columns
from left to right are the k-rTNMF assisted t-SNE, rTNMF assisted t-SNE, k-TNMF asssited t-SNE,
TNMF assisted t-SNE and t-SNE visualization. Samples were colored according to their true cell
types.

We see a considerable improvement in both top-NMF assisted and top-rNMF assisted UMAP

and t-SNE visualization.

223

GSE67835 In the assisted UMAP and t-SNE visualizations of GSE67835, we observe a more

distinct cluster, which includes a supercluster of fetal quiescent (Fetal-Q) and fetal replicating

(Fetal-R) cells. Darmanis et al. [5] conducted a study that involved obtaining differential gene

expression data for human adult brain cells and sequencing fetal brain cells for comparison. It is

not surprising that the undeveloped Fetal-Q and Fetal-R cells do not exhibit significant differences

and cluster together.

GSE75748 time

In GSE75748 time data, Chu et al. [6] sequenced human embryonic stem cells

at times 0hr, 12hr, 24hr, 36hr, 72hr, and 96hr under hypoxic conditions to observe differentiation.

In unassisted UMAP and t-SNE, although some clustering is visible, there is no clear separation

between the clusters. Additionally, two subclusters of 12hr cells are observed.

Notably, in the PL-regularized assisted UMAP and t-SNE visualizations, there is a distinct

supercluster comprising the 72hr and 96hr cells, while cells from different time points form their

own separate clusters. This finding aligns with Chu’s observation that there was no significant

difference between the 72hr and 96hr cells, suggesting that differentiation may have already occurred

by the 72hr mark.

GSE94820 Notice that in both t-SNE and UMAP, although there is a boundary, the cells do

not form distinct clusters. This lack of distinct clustering can pose challenges in many clustering

and classification methods. On the other hand, all PL-regularized NMF methods result in distinct

clusters.

Among the PL-regularized NMF approaches, cutoff-based PL, rTNMF, and TNMF form a

single CD1C+ (CD1C1) cluster, whereas the k-NN induced PL, k-rTNMF, and k-TNMF exhibit

two subclusters. Villani et al.

[10] previously noted the similarity in the expression profile of

CD1C1–CD141– (DoubleNeg) cells and monocytes. PL-regularized NMF successfully differenti-

ates between these two types.

224

GSE84133 mouse 2 PL-regularized NMF yields significantly more distinct clusters compared

to unassisted UMAP and t-SNE. Notably, the beta and gamma cells form distinct clusters in PL-

regularized NMF. Additionally, when PL-regularized NMF is applied to assist UMAP, potential

outliers within the beta cell population become visible. Baron et al. [8] previously highlighted

heterogeneity within the beta cell population, and we observe potential outliers in all visualizations.

7.3.3.2 RS analysis

Although UMAP and t-SNE are excellent tools for visualizing clusters, they may struggle

to capture heterogeneity within clusters. Moreover, these methods can be less effective when

dealing with a large number of classes. Therefore, it is essential to explore alternative visualization

techniques.

225

Figure 7.24 RS plots of GSE67835 data. The columns from left to right correspond to k-rTNMF,
rTNMF, k-TNMF, and TNMF. Each row corresponds to a cell type. For each section, the x-axis and
y-axis correspond to the S-score and R-score, respectively. K-means was used to obtain a cluster
label, and the Hungarian algorithm was used to map the cluster labels to the true labels. Each
sample was colored according to their true labels.

In our approach, we visualize each cluster using RS plots as described in subsection 4.1.1. RS

plots depict the relationship between the residue score (R score) and similarity score (S score)

and have proven useful in various applications for visualizing data with multiple class types

226

[23, 24, 25, 26, 27].

Figure 7.24 shows the RS plots of PL-regularized NMF methods for GSE67835 data. The

columns from left to right correspond to k-rTNMF, rTNMF, k-TNMF, and TNMF, while the rows

correspond to the cell types. The x-axis and y-axis represent the S-score and R-score for each

sample, respectively. The samples are colored according to their predicted cell types. Predictions

were obtained using k-means clustering, and the Hungarian algorithm was employed to find the

optimal mapping from the cluster labels to the true cell types.

We can see that TNMF fails to identify OP cells, whereas k-rTNMF, rTNMF, and k-TNMF are

able to identify OPC cells. Notably, the S-score is quite low, indicating that the OPC did not form a

cluster for TNMF. For fetal quiescent and replicating cells, k-rTNMF correctly identifies these two

types, and the few misclassified samples are located on the boundaries. RTNMF is able to correctly

identify fetal replicating cells but could not distinguish fetal quiescent cells from fetal replicating

cells. The S-score is low for neurons in both rTNMF and TNMF, which shows a direct correlation

with the number of misclassified cells.

7.3.4 Conclusion

Persistent Laplacian-regularized NMF is a dimensionality reduction technique that incorporates

multiscale topological interactions between the cells. Traditional graph Laplacian-based regular-

ization only represents a single scale and cannot capture the multiscale features of the data. We

have also shown that the k-NN induced persistent Laplacian outperforms other NMF methods and

is comparable to the cutoff-based persistent Laplacian-regularized NMF methods. However, PL

methods do come with their downside. In particular, the weights for each filtration must be deter-

mined prior to the reduction. If there are T filtrations, then the hyperparameter space is (T + 1)T .

However, k-NN induced PL reduces the number of parameters to 2T . In addition, we have shown

that we can achieve a significant improvement even if we limit the hyperparameter space to 2T . We

would like to further explore possible parameter-free versions of topological NMF. Additionally,

NMF methods are not globally convex, but we have shown that with NNDSVDA initialization,

our methods perform the best. One possible extension to the proposed methods is to incorporate

227

higher-order persistent Laplacians in the regularization framework, which will reveal higher-order

interactions.

In addition, we would like to expand the ideas to tensor decomposition, such as

Canonical Polyadic Decomposition (CPD) and Tucker decomposition, multimodal omics data, and

spatial transcriptomics data.

7.3.5 Data availability and code

The data and model used to produce these results can be obtained at

https://github.com/hozumiyu/TopologicalNMF-scRNAseq.

228

BIBLIOGRAPHY

[1] Yuta Hozumi, Rui Wang, and Guo-Wei Wei. Ccp: Correlated clustering and projection for

dimensionality reduction. arXiv preprint arXiv:2206.04189, 2022.

[2] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multidomain
models—flexibility and rigidity. The Journal of chemical physics, 139(19):11B614_1, 2013.

[3] Qiaolin Deng, Daniel Ramsköld, Björn Reinius, and Rickard Sandberg. Single-cell rna-
seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science,
343(6167):193–196, 2014.

[4] Monika S Kowalczyk, Itay Tirosh, Dirk Heckl, Tata Nageswara Rao, Atray Dixit, Brian J
Haas, Rebekka K Schneider, Amy J Wagers, Benjamin L Ebert, and Aviv Regev. Single-cell
rna-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic
stem cells. Genome research, 25(12):1860–1872, 2015.

[5] Spyros Darmanis, Steven A Sloan, Ye Zhang, Martin Enge, Christine Caneda, Lawrence M
Shuer, Melanie G Hayden Gephart, Ben A Barres, and Stephen R Quake. A survey of human
brain transcriptome diversity at the single cell level. Proceedings of the National Academy of
Sciences, 112(23):7285–7290, 2015.

[6] Li-Fang Chu, Ning Leng, Jue Zhang, Zhonggang Hou, Daniel Mamott, David T Vereide,
Jeea Choi, Christina Kendziorski, Ron Stewart, and James A Thomson. Single-cell rna-seq
reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm.
Genome biology, 17:1–20, 2016.

[7] Ozgun Gokce, Geoffrey M Stanley, Barbara Treutlein, Norma F Neff, J Gray Camp, Robert C
Malenka, Patrick E Rothwell, Marc V Fuccillo, Thomas C Südhof, and Stephen R Quake.
Cellular taxonomy of the mouse striatum as revealed by single-cell rna-seq. Cell reports,
16(4):1126–1137, 2016.

[8] Maayan Baron, Adrian Veres, Samuel L Wolock, Aubrey L Faust, Renaud Gaujoux, Amedeo
Vetere, Jennifer Hyoje Ryu, Bridget K Wagner, Shai S Shen-Orr, Allon M Klein, et al. A
single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell
population structure. Cell systems, 3(4):346–360, 2016.

[9] Gaëlle Breton, Shiwei Zheng, Renan Valieris, Israel Tojal da Silva, Rahul Satija, and Michel C
Nussenzweig. Human dendritic cells (dcs) are derived from distinct circulating precursors
that are precommitted to become cd1c+ or cd141+ dcs. Journal of Experimental Medicine,
213(13):2861–2870, 2016.

[10] Alexandra-Chloé Villani, Rahul Satija, Gary Reynolds, Siranush Sarkizova, Karthik Shekhar,
James Fletcher, Morgane Griesbeck, Andrew Butler, Shiwei Zheng, Suzan Lazo, et al. Single-
cell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors.

229

Science, 356(6335):eaah4573, 2017.

[11] Pengzhi Yu, Guangjin Pan, Junying Yu, and James A Thomson. Fgf2 sustains nanog and
switches the outcome of bmp4-induced human embryonic stem cell differentiation. Cell stem
cell, 8(3):326–334, 2011.

[12] Adam Rodaway, Hiroyuki Takeda, Sumito Koshida, Joanne Broadbent, Brenda Price, James C
Induction of the mesendoderm in the zebrafish
Smith, Roger Patient, and Nigel Holder.
germ ring by yolk cell-derived tgf-beta family signals and discrimination of mesoderm and
endoderm by fgf. Development, 126(14):3067–3078, 1999.

[13] Shinsuke Tada, Takumi Era, Chikara Furusawa, Hidetoshi Sakurai, Satomi Nishikawa, Masaki
Kinoshita, Kazuki Nakao, Tsutomu Chiba, and Shin-Ichi Nishikawa. Characterization of
mesendoderm: a diverging point of the definitive endoderm and mesoderm in embryonic
stem cell differentiation culture. Development, 2005.

[14] Ron Edgar, Michael Domrachev, and Alex E Lash. Gene expression omnibus: Ncbi gene
expression and hybridization array data repository. Nucleic acids research, 30(1):207–210,
2002.

[15] Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim
Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Michelle
Holko, et al. Ncbi geo: archive for functional genomics data sets—update. Nucleic acids
research, 41(D1):D991–D995, 2012.

[16] Liang Chen, Weinan Wang, Yuyao Zhai, and Minghua Deng. Deep soft k-means cluster-
ing with self-training for single-cell rna sequence data. NAR genomics and bioinformatics,
2(2):lqaa039, 2020.

[17] Nicholas Schaum, Jim Karkanias, Norma F Neff, Andrew P May, Stephen R Quake, Tony
Wyss-Coray, Spyros Darmanis, Joshua Batson, Olga Botvinnik, Michelle B Chen, et al.
Single-cell transcriptomics of 20 mouse organs creates a tabula muris: The tabula muris
consortium. Nature, 562(7727):367, 2018.

[18] Mauro J Muraro, Gitanjali Dharmadhikari, Dominic Grün, Nathalie Groen, Tim Dielen, Erik
Jansen, Leon Van Gurp, Marten A Engelse, Francoise Carlotti, Eelco Jp De Koning, et al. A
single-cell transcriptome atlas of the human pancreas. Cell systems, 3(4):385–394, 2016.

[19] Roman A Romanov, Amit Zeisel, Joanne Bakker, Fatima Girach, Arash Hellysaz, Raju Tomer,
Alan Alpar, Jan Mulder, Frederic Clotman, Erik Keimpema, et al. Molecular interrogation of
hypothalamic organization reveals distinct dopamine neuronal subtypes. Nature neuroscience,
20(2):176–188, 2017.

[20] Alexander J Gates, Ian B Wood, William P Hetrick, and Yong-Yeol Ahn. Element-centric
clustering comparison unifies overlaps and hierarchy. Scientific reports, 9(1):8574, 2019.

230

[21] Fernando H Biase, Xiaoyi Cao, and Sheng Zhong. Cell fate inclination within 2-cell and 4-cell
mouse embryos revealed by single-cell rna sequencing. Genome research, 24(11):1787–1796,
2014.

[22] Ning Leng, Li-Fang Chu, Chris Barry, Yuan Li, Jeea Choi, Xiaomao Li, Peng Jiang, Ron M
Stewart, James A Thomson, and Christina Kendziorski. Oscope identifies oscillatory genes
in unsynchronized single-cell rna-seq experiments. Nature methods, 12(10):947–950, 2015.

[23] Yuta Hozumi, Kiyoto A Tanemura, and Guo Wei Wei. Preprocessing of single cell rna
sequencing data using correlated clustering and projection. Journal of chemical Information
and Modeling, accepted, 2023.

[24] Hongsong Feng and Guo-Wei Wei. Virtual screening of drugbank database for herg block-
ers using topological laplacian-assisted ai models. Computers in biology and medicine,
153:106491, 2023.

[25] Zailiang Zhu, Bozheng Dou, Yukang Cao, Jian Jiang, Yueying Zhu, Dong Chen, Hongsong
Feng, Jie Liu, Bengong Zhang, Tianshou Zhou, et al. Tidal: Topology-inferred drug addiction
learning. Journal of Chemical Information and Modeling, 63(5):1472–1489, 2023.

[26] Li Shen, Hongsong Feng, Yuchi Qiu, and Guo-Wei Wei. Svsbi: sequence-based virtual

screening of biomolecular interactions. Communications Biology, 6(1):536, 2023.

[27] Sean Cottrell, Rui Wang, and Guowei Wei.

PCA for microarray data analysis.
doi.org/10.1021/acs.jcim.3c01023, 2023.

PLPCA: Persistent Laplacian enhanced-
Journal of Chemical Information and Modeling,

231

APPENDIX 7A

ADDITIONAL RESULTS FOR PREPROCESSING SINGLE CELL RNA SEQUENCING
DATA USING CORRELATED CLUSTERING AND PROJECTION

7A.1 Methods

7A.1.1 Program and Packages

Python v3.8.5 was utilized for the benchmark, and the following Python packages were uti-
lized: NumPy v1.19.2, Scikit-learn v0.23.2. For visualization, Plotly v5.9.0 and Matplotlib v2.5.2
were used. For preprocessing of the scRNA-seq data, instructions can be found on GitHub at
https://github.com/hozumiyu/SingleCellDataProcess.

7A.1.2 Benchmark Protocol

Before applying dimensionality reduction, a log-transform was performed on the data. PCA
and CCP were then applied to the data to generate 50, 100, 150, 200, 250 and 300 components.
For CCP, τ = 6.0 and κ = 2.0 was used for the exponential kernel to generate super-genes. For each
reduction, 20 random initializations were used to ensure robustness and stability.

For clustering, k-means was used, where k was chosen as the number of cell or gene types,
provided by the data, indicated in Table 7.1. A total of 30 random initializations was used to
compute the clusters, and for each clustering result, ARI, NMI, Silhouette score, and RSI were
computed.

For classification, the support vector machine (SVM) with default parameters from Scikit-
learn’s package was used. We performed 5-fold cross validation, where 4 parts were used for
training the SVM and 1 part was used for testing. Ten random seeds were used for each reduction.
Since the number of cells within each class can differ greatly in scRNA-seq data, we modified the
training and testing sets. We removed any cell type with less than 5 cells and for the training set,
we randomly sampled up to five times the average number of samples per cell type to ensure a
balanced training set. We computed the balanced accuracy and RSI for each of the classification
results.

7A.2 Results

7A.2.1 Additional benchmark

In this section, we compare the performance of CCP and PCA on datasets that were not intro-
duced in the main text. More details on the benchmark can be found in Section subsubsection 7.1.2.1
of the main text.

Figure 7A.1 shows the performance of CCP and PCA on three datasets: GSE45719, GSE75748
cell, and GSE82187 data. The solid and dashed lines correspond to CCP and PCA, respectively,
and the red and blue lines correspond to the ARI and NMI, respectively. For CCP, τ = 6.0 and
κ = 2.0 were used for the exponential kernel. For all three datasets, as the number of components
increases, ARI and NMI either increase or remain stable for CCP. On the other hand, PCA shows
instability in terms of ARI and NMI as the number of components increases. PCA has higher ARI

232

and NMI values when the number of components is 50, but in higher numbers of dimensions, PCA
performs worse.

Figure 7A.1 ARI and NMI of the clustering results of CCP and PCA on GSE45719, GSE75748
cell. The red and blue lines correspond to CCP and PCA, respectively. For CCP, all the tests utilize
τ = 6 and κ = 2 for the exponential kernels. The x-axis shows the number of components in the
reduction.

Figure 7A.2 shows the performance of CCP and PCA on the 6 datasets, GSE84133h1,
GSE84133h2, GSE84133h3 GSE84133h4, GSE84133m1, and GSE84133m2. The solid and
dashed lines correspond to CCP and PCA, respectively, and the red and blue lines correspond to
the ARI and NMI, respectively. For GSE84133h1, GSE84133h2, and GSE84133h4 data, CCP
outperforms PCA when the number of components is larger than 150 for both ARI and NMI. CCP
is able to maintain its performance throughout higher dimensions, whereas PCA has a notable drop
in its ARI and NMI values when the number of components is larger than 150. For GSE84133h3,
PCA outperforms CCP until 300 components. We found that GSE84133h3 data is the only data,
where PCA’s ARI increased at higher dimensions. For the mouse datasets, GSE84133m1 and
GSE84133m2, CCP outperforms PCA when the number of components is higher than 50. There
is a slight decrease in ARI and NMI beyond 100 components for CCP, but it is much more stable
than PCA.

233

Figure 7A.2 ARI and NMI of the clustering results of CCP and PCA on GSE84133h1, GSE84133h2,
GSE84133h3, GSE84133h4, GSE84133m1, and GSE84133m2. The red and blue lines correspond
to CCP and PCA, respectively. For CCP, all the test utilize τ = 6 and κ = 2 for the exponential
kernels. The x-axis shows the number of components in the reduction.

Figure 7A.3 shows the performance of CCP and PCA on the 2 datasets, GSE89232 and
GSE94820. The solid and dashed lines correspond to CCP and PCA, respectively, and the red and
blue lines correspond to the ARI and NMI, respectively. For GSE94820, CCP outperforms PCA
when the number of components is greater than 50. Both ARI and NMI improve until about 200
components. On the other hand, PCA has a notable drop in its performance from 100 components
to 150 components, where the ARI and NMI drop to below 0.1, indicating very low clustering
accuracy. CCP has poor performance for GSE89232 at all components tested. Both ARI and NMI
are less than 0.2, indicating poor clustering accuracy. Similar to other data, PCA has stability
concerns at higher dimensions, where both ARI and NMI become less than 0.1.

Figure 7A.3 ARI and NMI of the clustering results of CCP and PCA on GSE89232 and GSE94820.
The red and blue lines correspond to CCP and PCA, respectively. For CCP, all the test utilize τ = 6
and κ = 2 for the exponential kernels. The x-axis shows the number of components in the reduction.

234

In order to understand the poor performance of CCP on GSE89232, the RSI was computed on
the gene clustering result. Figure 7A.4 shows the RSI plot and t-SNE visualization of GSE89232
and GSE94820 genes. RSI was computed by varying the number of clusters, where the x-axis
shows the number of clusters, and the green, red, and blue lines correspond to the RSI, RI, and SI of
the gene clustering results. The RSI stabilizes starting at 16 clusters and gradually decreases as the
number of clusters increases. Notice that even though the RSI peaks at 16 clusters, the RI and SI
continue to increase, indicating that the clustering may be improving. In addition, t-SNE was used
to obtain a 2-dimensional (2D) visualization of the gene types. For GSE94820, k = 64 was used for
the k-means clustering. Seven gene clusters with the largest number of genes were colored, and the
rest were colored in green. Notice that purple genes are spread out, indicating that the purple genes
do not cluster well. For GSE89232, the RSI is high at a low number of clusters. Compared to the
RI and SI of GSE94820, the RI and SI are relatively high at a low number of gene clusters. This
may indicate that the clustering is better at a lower number of clusters. In addition, t-SNE was used
to visualize the genes with k = 8. The blue and orange genes are mixed, but they are located at the
same general location. The other genes are clustered nicely. This indicates that the optimal number
of clusters may be smaller than 8, which suggests that the intrinsic dimensionality of GSE89232
is low. Such a low intrinsic dimensionality is unfavorable for CCP, which may explain its poor
performance for this data.

235

(a) GSE94820

(b) GSE89232

Figure 7A.4 RI, SI, and RSI of the gene clustering of GSE94820 and GSE89232. k-means clustering
was performed with 2, 4, 8, 16, 32, 64, and 128 gene clusters. For each number of clusters, 10
random initializations were utilized, and the average of RI, SI, and RSI was obtained. The red, blue,
and green lines correspond to RI, SI, and RSI, respectively. t-SNE was used to visualize the genes
in 2D. For GSE94820, k = 64 was used for k-means clustering, and 7 gene clusters were colored
according to their cluster label, while the rest of the genes were colored in green. For GSE89232,
k = 8 was used for k-means clustering, and the genes were colored according to their gene cluster
labels.

7A.2.2 Additional Residue-Similarity Index comparison

Figure 7A.5 shows performance comparison between CCP and PCA in classification and cluster-
ing problems using GSE45719, GSE59114, and GSE89232 data. Top row shows the classification
results, where 5-fold cross validation and SVM were used. The bottom row shows the clustering
results, where k-means was used to obtain the labels. For RSI, cluster labels obtained from k-means
and from the true cell types were used.

236

Figure 7A.5 Comparison of RSI on classification and clustering results of GSE45719, GSE59114,
and GSE89232 data. The top row shows the classification results, where the solid and dashed
lines correspond to the CCP and PCA results, respectively. The x-axis represents the number of
components in the dimensionality reduction. The 5-fold cross-validation was used to evaluate
the classification performance using SVM. The bottom row shows the clustering results of the
GSE45719, GSE59114, and GSE89232 data. For RSI, the cluster labels obtained from k-means
and the true cell labels provided by the authors were utilized. The solid and dashed lines correspond
to the CCP and PCA results, respectively.

There is a noticeable correlation between RSI and BA in the classification results of GSE45719
and GSE89232. PCA’s BA decreases in all three datasets. For GSE59114, CCP’s result improves
as the number of components increases, but is not as good as PCA’s result. We can see from the ARI
and NMI of GSE59114 that these two values are also low. In addition, from Figure 7A.4, we can
see that the gene clustering is poor for GSE59114, which may have caused the poor performance.
As for the clustering results, it is evident that the Silhouette score does not correlate with the other
metrics used. Interestingly, the RSI using the true cell types and the cluster labels are similar for
all the results, which means RSI is a good index for clustering when there are no true labels.

Figure 7A.6 shows performance comparison between CCP and PCA in classification and
clustering problems using GSE84133h1, GSE84133h2, and GSE84133h3 data. Top row shows the
classification results, where 5-fold cross validation and SVM were used. The bottom row shows
the clustering results, where k-means was used to obtain the labels. For RSI, cluster labels obtained
from k-means and from the true cell types were used.

237

Figure 7A.6 Comparison of RSI on classification and clustering results of GSE84133h1,
GSE84133h2, and GSE84133h3 data. The top row shows the classification results, where the
solid and dashed lines correspond to CCP and PCA results, respectively. The x-axis represents the
number of components in the dimensionality reduction. The classification results were obtained
using SVM and verified using 5-fold cross validation. The bottom row shows the clustering results
of the GSE45719, GSE59114, and GSE89232 datasets. For RSI, cluster labels obtained from
k-means and the true cell labels provided by the data were utilized. The solid and dashed lines
correspond to CCP and PCA results, respectively.

For classification result of GSE84133h1 and GSE84133h2, CCP outperforms PCA after 100
components, whereas in GSE84133h3, PCA outperforms PCA. However, for all 3 datasets, PCA’s
BA decreases as the number of components increases. In addition, RSI correlates with BA in all
3 datasets for CCP. For clustering results, Silhouette score shows no relation with ARI, NMI, or
RSI. RSI using the true cell types and cluster labels correlates for all 3 datasets. Similar to the
classification results, CCP’s ARI and NMI of GSE84133h3 are lower than those of PCA’s for most
of the dimensions.

Figure 7A.7 shows performance comparison between CCP and PCA in classification and
clustering problems using GSE84133h1, GSE84133h2, and GSE84133h3 data. Top row shows the
classification results, where 5-fold cross validation and SVM were used. The bottom row shows
the clustering results, where k-means was used to obtain the labels. For RSI, cluster labels obtained
from k-means and from the true cell types were used.

238

Figure 7A.7 Comparison of RSI on classification and clustering results of GSE84133h4,
GSE84133m1, and GSE84133m2 data. Top row shows the classification results, where the solid
and dashed lines correspond to CCP and PCA results, respectively. The x-axis is the number of
components in the dimensionality reduction. The 5-fold cross validation was used to verify the
classification result obtained from using SVM. The bottom row shows the clustering results of the
GSE45719, GSE59114, and GSE89232 data. For RSI, the cluster labels obtained from k-means and
from the true cell labels provided by the data were utilized. The solid and dashed lines correspond
to CCP and PCA results, respectively.

For all three datasets, CCP outperforms PCA in classification accuracy when the number of
components is larger than 150. CCP’s RSI of classification results correlates with BA, whereas
PCA’s RSI shows less correlation with BA. In addition, for all three datasets, PCA’s accuracy
decreases as the number of components increases, showing instability in the reduction algorithm.
The Silhouette score does not show any correlation with the other metrics. RSI using the true cell
types and cluster labels correlates for all three datasets. Lastly, ARI and NMI follow a similar trend
as BA, where CCP outperforms PCA at higher numbers of components.

Figure 7A.8 shows performance comparison between CCP and PCA in classification and
clustering problems using GSE75748 cell and GSE94820 data. Top row shows classification
results, where 5-fold cross validation and SVM were used. The bottom row shows the clustering
results, where k-means was used to obtain the labels. For RSI, cluster labels obtained from k-means
and from the true cell types were used.

239

Figure 7A.8 Comparison of RSI on classification and clustering results of GSE75748 cell and
GSE94820 data. Top row shows the classification results, where the solid and dashed lines
correspond to CCP and PCA results, respectively. The x-axis is the number of components in
the dimensionality reduction. The 5-fold cross validation was used to verify the classification
results obtained from using SVM. The bottom row shows the clustering results of the GSE45719,
GSE59114, and GSE89232 data. For RSI, the cluster labels obtained from k-means and from the
true cell labels provided by the data were utilized. The solid and dashed lines correspond to CCP
and PCA results, respectively.

For all three datasets, CCP outperforms PCA in classification accuracy when the number of
components is larger than 150. RSI of CCP’s classification results correlates with BA, whereas
PCA’s RSI does not show any correlation.
In addition, for all three datasets, PCA’s accuracy
decreases as the number of components increases, showing instability in the reduction algorithm.
The Silhouette score does not show any correlation with the other metrics. RSI using the true cell
types and cluster labels correlates for all three datasets. Lastly, ARI and NMI follow a similar trend
as BA, where CCP outperforms PCA at higher numbers of components.

240

APPENDIX 7B

ADDITIONAL MATERIALS FOR ANALYZING SCRNA-SEQ DATA BY CCP-ASSISTED
UMAP AND T-SNE

7B.1 Results

7B.1.1 Data statistics

Table 7B.1 show basic statistical analysis of the dataset, namely the sparsity (number of zero
expression), max, mean and median expression, and the median sum of the cell expression. For the
mean and the median expression, we considered all nonzero expression values.

Table 7B.1 Accession ID, source organism, and the counts for samples, genes, and cell types for
fourteen individual datasets.

Dataset
GSE75748cell [6]
GSE75748time [6]
GSE82187 [7]
GSE67835 [5]
GSE84133 H1 [8]
GSE84133 H2 [8]
GSE84133 H3 [8]
GSE84133 H4 [8]
GSE84133 M1 [8]
GSE84133 M2 [8]
GSE84133 human [8]
Muraro [18]
Romanov [19]
Qx Bladder [17]
Qx Limb Muscle [17]
Qx Spleen [17]
Qs Diaphragm [17]
Qs Limb Muscle [17]
Qs Lung [17]
Qs Trachea [17]

Size (cells x genes)
1018 x 19097
758 x 19189
705 x 18840
420 x 22084
1937 x 20125
1724 x 20125
3605 x 20125
1303 x 20125
822 x 14878
1064 x 14878
8569 x 20125
2122 x 19046
2881 x 21143
2500 x 23341
3909 x 23341
9552 x 23341
870 x 23341
1090 x 23341
1676 x 23341
1350 x 23341

Sparsity
49.64
54.69
77.72
81.40
90.44
90.59
91.30
89.05
90.48
87.81
90.62
73.02
85.92
86.94
93.57
94.34
91.35
89.47
89.08
85.48

Max
605598.59
165308.35
5.60
58272.00
4318.00
3476.00
3071.00
4234.00
3477.00
3656.00
4318.00
4501.60
2571.46
4640.55
1124.88
1665.24
481734.98
682914.40
300415.15
730410.52

Mean Median Median Row Sum
470.98
153.61
1.70
136.71
3.02
2.70
3.35
3.05
2.64
3.28
3.09
3.52
2.48
3.64
2.74
2.46
247.92
294.23
245.66
220.08

4404343.22
1306562.14
7086.69
505256.00
5346.00
4889.50
4742.00
6017.00
2936.50
5631.00
5075.00
18076.53
7373.00
11102.50
4110.00
3244.50
500405.50
723268.50
626066.50
745632.50

75.00
27.00
1.73
27.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.60
1.26
1.27
1.17
1.07
70.45
78.39
67.56
61.81

7B.1.2 Comparison of CCP, PCA and NMF assisted visualization

In this section, we show the effectiveness of CCP as an initialization of UMAP and tSNE by
comparing it to PCA and NMF, 2 of the most utilized algorithm for dimensionality reduction for
scRNA-seq data. We perform the same preprocessing procedure as described in Section 7.2.3.1 of
the main text.

Figure 7B.1 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visualiza-
tion on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 data. The columns from left to
right correspond to CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted
tSNE, NMF-assisted UMAP and NMF-assisted tSNE.

241

NMF-assisted visualization do not provide a meaningful clustering result for all data. We can

observe the following from the visualization:

• In GSE75748 cell, CCP-assisted and PCA-assisted visualization show similar result.

• IN GSE75748 time data, CCP-assisted and PCA-assisted visualization show similar result.
Most notably, both technique show a supercluster of 72hr and 96hr, which is consistent with
Chu’s findings [6]. There is a cleared distinction between the 72hr-96hr supercluster with
36hr cluster for PCA-assisted visualization. However, CCP-assisted visualization show a clear
distinction of 00hr cells from other cells, indicating the clear distinction of undifferentiated
cells from cells that have begun or completed differentiating.

• In GSE67835 data, CCP-assisted UMAP and tSNE have the best visualization. PCA-assisted
UMAP do no show clear clustering, and PCA-assisted tSNE visualization is dominated by
an outlier astrocytes.

• Both CCP-assisted UMAP and tSNE show the best visualization for GSE82187 data. All the

cell types show a distinct cluster, whereas PCA-assisted visualization do not.

242

Figure 7B.1 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted visual-
ization on GSE75748 cell, GSE75748 time, GSE67835 and GSE82187 data. The columns from left
to right show CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE,
NMF-assisted UMAP and NMF-assisted tSNE. The rows from top to bottom show GSE75748 cell,
GSE75748 time, GSE67835 and GSE82187 data. CCP, PCA and NMF were utilized to reduce the
dimension to 300, and UMAP and tSNE were used to further reduce the dimension to 2. Sample
were colored according to the cell types provided by the original authors.

Figure 7B.2 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visualiza-
tion on Quake data. The columns from left to right correspond to CCP-assisted UMAP, CCP-assisted
tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE.
Qx indicates scRNA-seq obtained used 10x genomic platform, and Qs indicate data obtained from
SmartSeq2 platform.

NMF-assisted visualization do not provide a meaningful clustering result for all data. We can

observe the following from the visualization:

• CCP-assisted and PCA-assisted show similar clustering result for both Qx Bladder and Qx
Limb data. The most notable difference is that CCP-assisted UMAP show 3 subclusters of
bladder urothelial cells.

• CCP-assisted tSNE show a considerable improvment to PCA-assisted tSNE. Macrophages do
not form a clustering PCA-assisted tSNE, whereas in CCP-assisted tSNE, it forms a cluster.

243

Additionally Satellite cells are more dispersed in PCA-assisted tSNE than in CCP-assisted
tSNE

• In Qs Limb Muscle, both CCP-assisted and PCA-assisted show a similar result.

• In Qs Trachea data, CCP-assisted show better clustering result from PCA-assisted counter-
parts. Noticeably, epithelial cells show a distinct cluster in CCP-assisted visualization, where
as in PCA-assisted visualization, epithelial cells form 2 clusters, and have many cells mixed
with mesenchymal cells.

Figure 7B.2 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted visu-
alization on Quake data. The columns from left to right show CCP-assisted UMAP, CCP-assisted
tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-assisted tSNE.
The rows correspond to 1 of the 5 Quake data. Qx indicates scRNA-seq obtained used 10x genomic
platform, and Qs indicate data obtained from SmartSeq2 platform. CCP, PCA and NMF were
utilized to reduce the dimension to 300, and UMAP and tSNE were used to further reduce the
dimension to 2. Sample were colored according to the cell types provided by the original authors.

244

Figure 7B.3 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visual-
ization on GSE84133 human data. The columns from left to right correspond to CCP-assisted
UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and
NMF-assisted tSNE. NMF-assisted visualization do not provide a meaningful clustering result for
all data. In general, CCP-assisted visualization show the clearest distinction between the cell types.
PCA-assisted UMAP and tSNE do not show the distinction between all the cell types.

Figure 7B.3 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted vi-
sualization onGSE84133 human data. The columns from left to right show CCP-assisted UMAP,
CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-
assisted tSNE. The rows correspond to one of the four patients. CCP, PCA and NMF were utilized
to reduce the dimension to 300, and UMAP and tSNE were used to further reduce the dimension
to 2. Sample were colored according to the cell types provided by the original authors.

Figure 7B.4 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visual-
ization on GSE84133 mouse data. The columns from left to right correspond to CCP-assisted
UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and
NMF-assisted tSNE.

NMF-assisted visualization do not provide a meaningful clustering result for all data.
In
general, CCP-assisted visualization show the clearest distinction between the cell types. PCA-
assisted UMAP and tSNE do not show the distinction between all the cell types.

245

Figure 7B.4 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted vi-
sualization onGSE84133 human data. The columns from left to right show CCP-assisted UMAP,
CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and NMF-
assisted tSNE. The rows correspond to mouse 1 and 2. CCP, PCA and NMF were utilized to reduce
the dimension to 300, and UMAP and tSNE were used to further reduce the dimension to 2. Sample
were colored according to the cell types provided by the original authors.

Figure 7B.5 show the comparison of CCP-assisted, PCA-assisted and NMF-assisted visu-
alization on Muraro, Romanov and Qs Lung data. The columns from left to right correspond to
CCP-assisted UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted
UMAP and NMF-assisted tSNE.

NMF-assisted visualization do not provide a meaningful clustering result for all data. CCP-
assisted and PCA-assisted visualization show comparable in all data. In PCA-assisted visualization
of Qs Lung, it shows a clear distinction between monocytes and cillated columnar cells, whereas
CCP-assisted visualization show a supercluster of these 2 cells. However, the stromal cells in
CCP-assisted visualization show a distinct cluster, whereas PCA-assisted show a subcluster within
the endothelial cell cluster.

246

Figure 7B.5 Comparison of CCP-assisted visualization with PCA-assisted and NMF-assisted visu-
alization on Muraro, Romanov and Qs Lung data. The columns from left to right show CCP-assisted
UMAP, CCP-assisted tSNE, PCA-assisted UMAP, PCA-assisted tSNE, NMF-assisted UMAP and
NMF-assisted tSNE. The rows from top to bottom show Muraro, Romanov and Qs Lung data. CCP,
PCA and NMF were utilized to reduce the dimension to 300, and UMAP and tSNE were used to
further reduce the dimension to 2. Sample were colored according to the cell types provided by the
original authors.

7B.1.3 Accuracy

In order to validate CCP’s performance, we computed the ARI, NMI and ECM for the 18
dataset. For each data, 10 random seed were utilized to generate CCP, PCA and NMF features, and
the reduced features were further reduced to 2D using UMAP and tSNE. Leiden clustering was
performed to obtain the clustering results, and ARI, NMI and ECM were computed by comparing
the clustering results with the cell types provided by the original authors.

Figure 7B.6 show the average ARI, NMI and ECM of CCP-assisted, PCA-assisted and NMF-
assisted UMAP and tSNE over 18 dataset. Notice that CCP outperforms both PCA and NMF.
CCP-assisted UMAP improves PCA-assisted UMAP on average by 43% in ARI 22.5$ in NMI and
19% in ECM.

247

(a) Average ARI over 18 dataset

(b) Average NMI over 18 dataset

(c) Average ECM over 18 dataset

Figure 7B.6 The average ARI, NMI, ECM of 18 datasets. 10 random initialization was used to
compute the reduction for each data. Leiden clustering was used to obtain the clustering results.

Additionally, we validated the CCP super-genes with PCA-gene and NMF-gene. Figure 7B.7
show the average ARI, NMI and ECM of CCP, PCA and NMI over 18 dataset. Notice that CCP
outperforms both PCA and NMF. Most notably, CCP improves PCA by 29% in ARI, 20% in NMI
and 15% in ECM.

(a) Average ARI over 18 dataset

(b) Average NMI over 18 dataset

(c) Average ECM over 18 dataset

Figure 7B.7 The average ARI, NMI, ECM of 18 datasets. 10 random initialization was used to
compute the reduction for each data. Leiden clustering was used to obtain the clustering results.

Figure 7B.7 show the average ARI, NMI and ECM of CCP, PCA and NMI over 18 dataset.
Notice that CCP outperforms both PCA and NMF. Most notably, CCP improves PCA by 29% in
ARI, 20% in NMI and 15% in ECM.

7B.2 Discussion

7B.2.1 Low variance gene

Figure 7B.8 show the effect of varying the number of super-genes and the cutoff ratio on the
predictive power and visualization of GSE75748cell data. We utilized 10 random seeds to generate
CCP super-genes using different number of super-genes and cutoff ratio. Then, Leiden clustering

248

was used to obtain the cluster labels, and the ARI was computed by using the cell types provided by
the original authors. Notice that at all cutoff ratio, the ARI increases as the number of super-genes
increases, and the ARI is comparable at 300 super-genes. This suggest the robustness of LV-gene.
Additionally, we computed the number of genes in the LV-gene cluster at varying cutoff value, and
plotted with the variance of the genes in descending order. Notice that at νc = 0.9 the variance of
the genes are relatively small, indicating that the predictive power of these genes may not be high.
Figure 7B.8(c) show the visualization of CCP-assisted UMAP and tSNE at various cutoff ratio.
For the visualization, 300 super-genes were utilized, and UMAP and tSNE was applied to the
super-genes to reduce the dimension to 2. Samples were then colored according to the cell types
provided by the original authors. Note that all the visualization are comparable, indicating the
robustness of LV-gene under different cutoff ratio.

249

(a) ARI of number of super-genes and νc

(b) Number of genes in LV genes

(c) CCP-assisted UMAP and tSNE visualization

Figure 7B.8 Analysis of varying the cutoff ratio νc on clustering and visualization of GSE75748 cell
data. (a) ARI of leiden clustering when the number of super-genes and cutoff ratio is changed. (b)
The number of genes in the LV-gene when νc is changed. (c) Top and bottom row shows the CCP-
assisted UMAP and t-SNE visualization, and the columns corresponds to νc = 0.6, 0.7, 0.8, 0.9.
300 super-genes were used to initialize UMAP and tSNE, and the samples were colored according
to the true cell type.

250

APPENDIX 7C

ALGEBRAIC CONNECTIVITY OF FRI IN CCP

7C.1 Motivation

In the original formulation of CCP, for components n, we have

Φ(|xSn

i – xSn

j

|; rSn

c , ηn, τ, κ) =

(cid:35) k

(cid:34)

i –xSn
|xSn
ηnτ

j

|

(cid:170)
(cid:174)
(cid:172)





–

exp (cid:169)
(cid:173)
(cid:171)

0

if |xSn

i – xSn

j

| < rSn
c

(7C.1)

otherwise

where ηSn

is the algebraic connectivity term defined as

ηSn

=

1
M

M
∑︁

m=1

|xSn
m – xSn
j

|.

min
j≠m

(7C.2)

In other words, algebraic connectivity term is defined as the average minimal distannce.

However, there are alternative values that ηSk can take, both supervised and unsupervised
In this section, I will discuss alternative values, namely the average mean distance,

approach.
cluster-wide minimal distance and cluster-wide average distance.

7C.2 Formulation of algebraic connectivity

7C.2.1 Average mean distance

The average mean distance can be defined as

ηn =

1
M

M
∑︁

m=1

(cid:205)

m – xSn
j≠m |xSn
j
M

|

(7C.3)

7C.2.2

cluster-wide minimal distance
Let the data be represented as {(xm, ym)}M

m=1, where ym ∈ ZL is the class or cluster label for
sample m and L is the number of class or cluster. Let the data be partitioned in to L class or
cluster Cl = {xm|ym = l}. For the cluster-wide minimal distance, samples in each Cl obtain its own
algebraic connectivity term ηSk
l

. For class l, we have

ηn
l =

1
|Cl|

∑︁

xm∈Cl

min
xj∈Cl

m – xSn
|xSn
j

|.

(7C.4)

251

7C.2.3

cluster-wide average distance

Define Cl as in the cluster-wide minimal distance. For the cluster-wide average distance,

samples in each Cl obtain its own algebraic connectivity term ηSk
l
m – xSn
j

(cid:205)

∑︁

|

ηn
l =

1
|Cl|

xm∈Cl

xj∈Cl |xSn
|Cl|

. For class l, we have

.

(7C.5)

7C.3 Results

We utilized the same data as Table 7.1, and performed the same classification procedure as
subsection 7A.1.2. Figure 7C.1 show the average balanced accuracy (14) of the 4 methods: the
standard (average minimal distance), average mean distance, cluster-wide minimal distance and
cluster-wide mean distance. τ = 1 and κ = 2 with exponential kernel was used for all the test.

Figure 7C.1 Average balanced accuracy (BA) of different algebraic connectivity. From left to right:
standard (average minimal distance), average mean distance, cluster-wide minimal distance and
cluster-wide mean distance. The number above each bar indicate the average BA. τ = 1 and κ = 2
with exponential kernel was used for all the test.

We can see that the average mean distance outperforms the other 3 method for scRNA-seq data.
However, note that the other choice of algebraic connectivity also perform well. This indicates that
CCP is quite stable for classification tasks, and the algebraic connectivity can be data-dependent.

252

CHAPTER 8

DISSERTATION CONTRIBUTIONS AND FUTURE DIRECTIONS

The main contribution in this dissertation is listed as follows:

• In Chapter 3, we introduce large scale clustering method called UMAP-assisted k-means

clustering to clustering SARS-CoV-2 mutation, and analyze world-wide mutational trends.

We also developed a novel alignment-free DNA sequence analysis method called k-mers

topology. We show that k-mers topology outperforms other viral classification algorithm,

and show that k-mers topology can also be used for phylogenetic analysis.

• In chapter 4, We proposed residue-similarity analysis and plot for data with dimensions

greater than 3. The definition and the formulation is presented in this chapter. We have

compared RS plot with geometric shape of data and topological analysis visualization to

validate and compare the results

• In chapter 5, we proposed correlated clustering and projection (CCP) as a dimensionality

reduction method for data with high intrinsic dimensions. We benchmark CCP against PCA

on standard benchmark datasets

• In Chapter 6, we introduce topological nonnegative matrix factorization (TNMF), which

incorporate multiscale topological and geometrical information through persistent Laplacian.

We show how PLs are constructed, and show an alternative method called k-NN induced PL.

Then, we prove the updating scheme for TNMF.

• In Chapter 7, we apply CCP and TNMF to scRNA-seq data. We show that both CCP and

TNMF improves other dimensionality reduction methods for clustering, classification and

visualization.

In the future, I would like to extend my work and explore the following:

• In the k-mers topology method, we utilized persistent homology to convert varying sequence

length to a fixed feature. We would like to utilize tools from natural language processing

253

models to further enhance our method’s performance. Additionally, I am interested in

exploring the biological implication of the distribution of k-mers.

• In the k-mers topology method, we only utilized persistent homology in the k-mers method.

We would like to extend the work to persistent Laplacians, and other topological methods.

Additionally, k-mers can also be interpreted as nodes of hypergraphs, and I would like to

explore the use of hypergraphs in DNA sequencing methods.

• I would like to extend the k-mers topology to protein sequence classification.

• One down side of topological NMF is that it requires choosing the weights for each filtration.

Although k-NN induced persistent Laplacian can reduce the number of parameters, the

number of parameters is still 2T .

In the future I would like to consider a parameter free

approach by using consensus methods. Additionally, I would like to extend the work to

semi-supervised model.

• I would like to explore the use of residue-similarity score as a minimization function for

dimensionality reduction.

The content of the dissertations are mostly adopted from the following publications and

preprints.

• Hozumi, Y., & Wei, G.-W (2024). K-mer Topology for Whole Genome Analysis.

• Hozumi, Y., & Wei, G.-W. (2024). Analyzing single cell RNA sequencing with topological

nonnegative matrix factorization.

Journal of Computational and Applied Mathematics,

115842.

• Hozumi, Y., Tanemura, K. A., & Wei, G.-W. (2023). Preprocessing of single cell RNA

sequencing data using correlated clustering and projection. Journal of Chemical Information

and Modeling.

254

• Hozumi, Y., Wang, R., & Wei, G.-W. (2022). Ccp: Correlated clustering and projection for

dimensionality reduction. ArXiv Preprint ArXiv:2206.04189.

• Hozumi, Y., Wang, R., Yin, C., & Wei, G.-W. (2021). UMAP-assisted K-means clustering

of large-scale SARS-CoV-2 mutation datasets. Computers in Biology and Medicine, 131,

104264.

• Hozumi, Y., & Wei, G.-W. (2023). Analyzing scRNA-seq data by CCP-assisted UMAP and

t-SNE. ArXiv Preprint ArXiv:2306.13750.

These work led to the following publication and preprints, but are not discussed in this disser-

tation.

• Chen, J., Wang, R., Hozumi, Y., Liu, G., Qiu, Y., Wei, X., & Wei, G.-W. (2022). Emerging

dominant SARS-CoV-2 variants. Journal of Chemical Information and Modeling, 63(1),

335–342.

• Cottrell, S., Hozumi, Y., & Wei, G.-W. (2023). K-Nearest-Neighbors Induced Topological

PCA for Single Cell RNA-Sequence Data Analysis. ArXiv.

• Feng, H., Cottrell, S., Hozumi, Y., & Wei, G.-W. (2024). Multiscale differential geometry

learning of networks with applications to single-cell RNA sequencing data. Computers in

Biology and Medicine, 171, 108211.

• Wang, R., Chen, J., Gao, K., Hozumi, Y., Yin, C., & Wei, G.-W. (2020). Characterizing

SARS-CoV-2 mutations in the United States. Research Square.

• Wang, R., Chen, J., Gao, K., Hozumi, Y., Yin, C., & Wei, G.-W. (2021). Analysis of SARS-

CoV-2 mutations in the United States suggests presence of four substrains and novel variants.

Communications Biology, 4(1), 228.

255

• Wang, R., Chen, J., Hozumi, Y., Yin, C., & Wei, G.-W. (2020). Decoding asymptomatic

COVID-19 infection and transmission. The Journal of Physical Chemistry Letters, 11(23),

10007–10015.

• Wang, R., Chen, J., Hozumi, Y., Yin, C., & Wei, G.-W. (2022). Emerging vaccine-

breakthrough SARS-CoV-2 variants. ACS Infectious Diseases, 8(3), 546–556.

• Wang, R., Hozumi, Y., Yin, C., & Wei, G.-W. (2020a). Decoding SARS-CoV-2 transmission

and evolution and ramifications for COVID-19 diagnosis, vaccine, and medicine. Journal of

Chemical Information and Modeling, 60(12), 5853–5865.

• Wang, R., Hozumi, Y., Yin, C., & Wei, G.-W. (2020b). Mutations on COVID-19 diagnostic

targets. Genomics, 112(6), 5204–5213.

• Wang, R., Hozumi, Y., Zheng, Y.-H., Yin, C., & Wei, G.-W. (2020). Host immune response

driving SARS-CoV-2 evolution. Viruses, 12(10), 1095.

256