GRAPH-BASED CLUSTERING ALGORITHMS FOR SINGLE-CELL RNA SEQUENCING
                    DATA: METHODS AND THEORY
                                      By
                           Andriana Manousidaki
                              A DISSERTATION
                                  Submitted to
                          Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
                      Statistics—Doctor of Philosophy
                                     2023


                                            ABSTRACT
The innovative technology of single-cell RNA sequencing (scRNAseq) allows us to extract gene
expression information from each cell of a tissue, resulting in data sets of tens of thousands to
millions of points (cells). Clustering of cells based on the similarity of their gene expression
enables the understanding of their functions and hence the characterization of cell types in a tissue.
    This dissertation focuses on the most widely used clustering methodology for scRNAseq data
– clustering based on the graph representation of data points (cells as vertices on a graph). Firstly,
we showcase how existing methods can effectively identify an important group of tumor growth
related cells in the analysis of head and neck cancer scRNAseq data. The newly discovered marker
genes can potentially facilitate new therapy approaches. Secondly, we introduce a novel clustering
method that preserves both the global data geometry and cluster structure, via multidimensional
scaling based on power-weighted path metrics. The new method outperforms prevailing scRNAseq
clustering algorithms on a wide range of benchmarking data sets. Thirdly, we study spectral
clustering on shared nearest neighbors (SNN) graphs. In contrast to current ad-hoc methods for
number of neighbors selection, we develop a general cross-validation tuning algorithm to achieve
effective clustering. Moreover, we provide a comprehensive theoretical analysis of SNN based
spectral clustering in the nonparametric setting. Our theoretical results reveal an optimal range of
the number of neighbors for cluster identification and characterize the impact of data density on
spectral clustering.


This dissertation is dedicated to my family.
                      iii


                                    ACKNOWLEDGEMENTS
First and foremost, I would like to extend my gratitude to my advisors Dr. Haolei Weng and Dr.
Yuying Xie, for their valuable mentoring and the precious research opportunities that offered me
during my Ph.D. journey. Additionally, I would like to thank Dr. Anna Little and Dr. Yuehua Cui
for their advice as members of my guidance committee. Furthermore, I would like to highlight my
gratitude to Professor Camille Fairbourn for her continuous support and mentoring throughout my
professional endeavours.
     Moreover, I feel deeply thankful to Dr. Dimitra Papadovasilaki for advising me during my Ph.D
life.
     My life in East Lansing and my Ph.D studies wouldn’t be so pleasant if I had not met my
roommate and best friend Dr. Ana-Maria Raicu and the rest of my "Spartan" friends Dr. Ilias
Magoulas, Eleni Lygda, Dr. Christos Gregoriadis, Dr. Ioannis Zaxos, Dr. George Psaromiligos,
Dr. Michalis Paparizos and Dr. Dimitris Vardakis who have helped since before I arrived in United
States and we are now family. Also I would like to thank my most recent friends Estefania Blancas
Garcias, Manos Kokarakis and Mylena Ortiz.
     I would like to highlight my gratitude to my husband, Dr. Marios Velivasakis, the most
enthusiastic Mathematician I know for his constant support and mathematical discussion which
played an important role for the completion of this thesis. I also want to thank my mother-in-law
and father-in-law, Sofia Zervou and George Velivasakis for their love. I am proud to call them my
parents.
     I also want to thank my sisters Catherine Manousidaki and Dr. Maria Manousidaki, because
they have shown to me what means to be a strong independent modern woman. With their studies,
career and life experiences paved an easier way for me toward my Ph.D.
     Finally, I would like to thank my father Ioannis Manousidakis and my mother Evaggelia
Maslimopoulou for believing in me, encouraging me to pursue my dreams and teaching me how to
continue fighting when things are hard.
                                                 iv


                           TABLE OF CONTENTS
CHAPTER 1    INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        1
     BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3
CHAPTER 2    SINGLE-CELL ANALYSIS OF CANCER STEM CELLS IN HEAD
             AND NECK CANCER . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
     BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
CHAPTER 3    CLUSTERING AND VISUALIZATION OF SINGLE-CELL RNA-SEQ
             DATA USING PATH METRICS . . . . . . . . . . . . . . . . . . . . .          . 18
     BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
     APPENDIX A    DATA PREPROCESSING . . . . . . . . . . . . . . . . . . .             . 41
     APPENDIX B    ADDITIONAL CLUSTERING RESULTS . . . . . . . . . . .                  . 44
CHAPTER 4    SHARED NEAREST NEIGHBORS GRAPH BASED SPECTRAL
             CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     47
     BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
     APPENDIX A     PERFORMANCE ON GAUSSIAN DATA WITH DIAGONAL
                   COVARIANCE MATRIX . . . . . . . . . . . . . . . . . . . . .            77
     APPENDIX B     PERFORMANCE ON GAUSSIAN DATA WITH TRIDIAGONAL
                   PRECISION MATRIX . . . . . . . . . . . . . . . . . . . . . .           82
     APPENDIX C    PERFORMANCE ON GAUSSIAN DATA WITH NETWORK
                   OF FEATURES . . . . . . . . . . . . . . . . . . . . . . . . . .        87
                                         v


                                            CHAPTER 1
                                         INTRODUCTION
The pioneering technology of single-cell RNA sequencing (scRNAseq) enables the extraction of
gene expression information from individual cells within a tissue, yielding datasets comprising
tens of thousands to millions of cellular data points. Clustering cells based on the congruity of
their gene expression profiles facilitates the comprehension of their functional attributes, thereby
enabling the characterization of distinct cell types within a given tissue. Prevalent clustering
methodologies developed for scRNAseq data rely on the representation of data points (cells) as
vertices in a graph (Stuart et al., 2019; Wolf et al., 2018). The present dissertation primarily
focuses on graph-based clustering methods tailored for scRNAseq data analysis. Firstly, we present
the contribution of established approaches to the identification of Cancer Stem Cells (CSCs), a
cellular cohort characterized by their resistance to therapeutic interventions and their pivotal role
in tumor initiation and progression (Chen et al., 2021; Mroz et al., 2015). Secondly, we introduce
a novel clustering methodology denoted as Single-Cell Path Metrics Profiling (scPMP), which
concurrently upholds both local cluster structure and global data geometry. Thirdly, we undertake
an exploration of the performance of Spectral Clustering on Shared Nearest Neighbors (SNN)
graphs in relationship with the parameter of nearest neighbors used in the construction of the SNN
grpah. We finally suggest a general cross-validation method for the tuning of this parameter.
    In Chapter 2, an in-depth analysis of scRNAseq data originating from cell cultures of head
and neck cancer lines, as well as 10 primary tumors, is conducted. The primary objective of this
analysis revolves around the identification of the most homogeneous cluster of CSCs within each
dataset, while simultaneously elucidating their dynamic states and plasticity via an extension of the
repertoire of CSC marker genes.
    Chapter 3 presents the introduction of the scPMP algorithm, a novel clustering methodology
predicated upon path-metric distances among cells. Unlike conventional distance metrics, such as
the Euclidean distance, path metrics possess the capacity to discern density variations and faithfully
uphold the underlying data geometry. By integrating path metrics with multidimensional scaling
                                                  1


techniques, we obtain a low-dimensional representation of the data that faithfully encapsulates both
the global data geometry and cluster structure. The efficacy of the scPMP algorithm is evaluated
comprehensively in terms of clustering quality and geometric fidelity, ultimately establishing its
superiority over current scRNAseq clustering algorithms across a diverse spectrum of benchmark
datasets.
    Chapter 4 delves into Spectral Clustering on SNN graphs. SNN graphs are constructed based
on a 𝑘 Nearest Neighbors (𝑘NN) graph, thus rendering their properties contingent upon the choice
of the parameter 𝑘. Our findings indicate that, in both the absence of noise and the presence of
noise, it is imperative to select 𝑘 of the magnitude 𝑐𝑛 in order to maximize the likelihood of cluster
identification. This contrasts with the literature on random geometric graphs, which suggests an
order of log 𝑛 for 𝑘 (Brito et al., 1997). Additionally, we propose a comprehensive cross-validation
tuning approach for fine-tuning the parameters of clustering algorithms. We employ this approach
to determine the optimal number of nearest neighbors, denoted as 𝑘, for the SNN spectral clustering
algorithm using various types of simulated data.
                                                   2


                                         BIBLIOGRAPHY
Brito, M. R., Chávez, E. L., Quiroz, A. J., and Yukich, J. E. (1997). Connectivity of the mutual
  k-nearest-neighbor graph in clustering and outlier detection. Statistics & Probability Letters,
  35(1):33–42.
Chen, C.-Y., Ueha, S., Ishiwata, Y., Shichino, S., Yokochi, S., Yang, D., Oppenheim, J. J., Ogiwara,
  H., Deshimaru, S., Kanno, Y., et al. (2021). Combining an alarmin hmgn1 peptide with pd-l1
  blockade results in robust antitumor effects with a concomitant increase of stem-like/progenitor
  exhausted cd8+ t cells. Cancer immunology research, 9(10):1214–1228.
Mroz, E. A., Tward, A. M., Hammon, R. J., Ren, Y., and Rocco, J. W. (2015). Intra-tumor genetic
  heterogeneity and mortality in head and neck cancer: analysis of data from the cancer genome
  atlas. PLoS medicine, 12(2):e1001786.
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius,
  M., Smibert, P., and Satĳa, R. (2019). Comprehensive integration of single-cell data. Cell,
  177:1888–1902.
Wolf, F. A., Angerer, P., and Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression
  data analysis. Genome Biology, 19.
                                                   3


                                             CHAPTER 2
SINGLE-CELL ANALYSIS OF CANCER STEM CELLS IN HEAD AND NECK CANCER
2.1   Introduction
    Head and neck cancer (HNC) is a major global health problem, with an estimated 880,000
new cases and 445,000 deaths annually worldwide (Sung et al., 2021). While human papilloma
(HPV)-associated HNC have improved outcomes, despite advances in comprehensive cancer care,
HPV-negative HNC remains a highly morbid disease with stagnant survival rates hovering at 50%.
This poor prognosis is due in part to the complex heterogeneity of HNC, which involves multiple
cell types, genetic alterations and transitional states that lead to treatment resistance and poor
outcomes (Chen et al., 2021; Puram et al., 2018; Mroz et al., 2015).
    Tumoral heterogeneity is a well-established biomarker of poor prognosis, being associated
with aggressive cancer behavior and treatment resistance in various cancer types, including HNC.
Tumoral heterogeneity has been associated with worse outcomes, mediated by intrinsic and extrinsic
factors related to subpopulations with distinct molecular profiles. Tumoral plasticity has been
identified as a critical driver of tumoral heterogeneity, where clonal expansion and subclonal
selection are based on evolutionary progression with each clone arising from cells with high
propagation potential, plasticity, and self-renewal. Within the tumor microenvironment (TME),
there is a subpopulation of tumor initiating cells, or cancer stem cells (CSC), that have the capacity
to drive clonal and subclonal selection (O’Brien et al., 2007). Traditionally CSC were deemed
fixed cells with limited to no plasticity based on their original definitions. However, as the field has
advanced, the role of plasticity in CSC has expanded and the traditional view of CSCs has evolved.
While the term CSC has persisted, despite much controversy on their existence, a more nuanced
understanding of CSC is that their stem-like activity (CSC-state: self-renewal, tumorigenicity and
asymmetric division) is not fixed but a transient state dictated by tumoral and environmental cues
(Chaffer and Weinberg, 2011). When cells are in this CSC-state, they are associated with treatment
resistance, metastasis, and tumor recurrence. However, CSCs in HNC remains controversial and
the CSC-like state has been difficult to study as the mechanisms of CSC plasticity are poorly
                                                    4


understood.
    Plasticity and heterogeneity are also critical components of epithelial to mesenchymal transition
(EMT) programs. Weinberg and others have shown EMT represent transient cancer cell states with
varying degrees of activities, strongly suggesting the EMT process enables cancer cells to acquire
CSC-like properties and enhance their ability to initiate and sustain tumors (Mani et al., 2008; Tam
and Weinberg, 2013). However, understanding the link between EMT and CSC remains elusive due
to their rarity and potentially transiet states. Analyzing this interaction is critical to understanding
the CSC-like state and defining potential mechanisms for plasticity and identify novel targets for
therapy.
    Recent advances in single-cell RNA sequencing (scRNAseq) technology have enabled the
identification of distinct subpopulations of cells within tumors based on their gene expression
profiles, providing a powerful tool to study the heterogeneity and plasticity of CSCs in HNC (Wang
et al., 2019). Moreover, in vitro lineage tracing can be used to assess CSC’s capacity for plasticity
and evaluate their various states. In this study, we integrated scRNAseq and in vitro and in silico
lineage tracing to analyze these rare CSC subpopulations in cell culture and primary HNC tumors
to characterize their dynamic states and plasticity. Our study sheds new light on the dynamic nature
and plasticity of CSCs in HNC, and their potential involvement in EMT programs. Our findings
have important implications for the development of novel therapeutic strategies for HNC, as well
as other cancers, and for the broader understanding of CSC-states and plasticity.
2.2    Methods
2.2.1    Cell lines analysis
    To better evaluate transcriptional differences and controlling for tumoral heterogeneity, we
subsequently performed single cell sequencing of two patient derived HNC cell lines (UMSCC-
122 and UMSCC-103).Both cell lines were sorted to select for CSC (CD44high ALDHhigh) and
non-CSC (CD44low/ ALDHlow) cells. After standard quality control filtering and integration of
the two cell line expression data sets, we found 26 clusters using Seurat (Stuart et al., 2019). We
observed that the clusters were not separated on the UMAP plot and suggesting that cells lie on a
                                                    5


                 A
                B
                C
                 D
                                  Figure 2.1 Cell line data results.
continuum but the distinction between non-CSC to CSC was weak since cell cycle phase affected
the clustering (Figure 2.1A). As a next step, we eliminated the cell-cycle effect and we performed
a trajectory analysis to capture both local and global nonlinear structure using an information-
geometric distance between cells (Moon et al., 2019). Given recent evidence suggesting an inherent
plasticity in cancer stem cells, we were interested in evaluating if there is a continuity between
                                                 6


                        Figure 2.2 CSC signature genes matrix - Cell line data.
non-stem and stem cells in the tumor. Figure 2.1B demonstrates a spectrum of non-CSC to CSC,
with an overlapping region in the middle, suggesting cells can progress from a non-stem-like to
state to a stem-like state, supporting the hypothesis that CSC possesses plasticity-capacity and exist
in a transitional CSC-state (Intermediate cells).
    Finally, we used cell line data to generate a model to predict the probability that a cancer cell is
a stem cell (or in a stem-like state) and develop a CSC-state gene expression signature. Using the
trajectory coordinates of the cell line cells, we constructed a logistic regression model that provides
the probability a cell is a cancer stem cell (Figure 2.1 right). We calculated the correlation of each
gene’s expression to the cells’ predictive probabilities and used these values to rank each gene and
to perform gene set enrichment analysis.
                                                    7


    We found 29 enriched pathways in the C2 and Hallmark databases. The positively enriched
pathways and contributing genes are demonstrated in the leading-edge analysis (Figure 2.1D).
We tested which of the contributing genes are significantly expressed more in each predicted
group of cells (CSC, Intermediate, Non-CSC) to construct a signature genes matrix (Figure 2.2).
Together these data demonstrate a conserved cancer stem cell signature identified with single cell
sequencing and novel bioinformatic techniques. These data nominate a subset of genes (ACTB,
ANXA2, TPM1, MYL12A, CD63, CCDN1, CD59, ATOX1, LAMA3, LAMC2, L1CAM, KRT8,
ANXA3,CLTB and IL32) as drivers of the cancer stem cell phenotype. Despite these cells being
exclusively derived from epithelial cells, several of the CSC differentially expressed genes are
associated predominantly with CAFs (TPM1, MYL12A, KRT8, CD63 and IL32), suggesting a
mesenchymal state of CSC. These clusters were selected to further define a pure epithelial CSC
signature in the primary tumor data of 10 patients.
2.2.2    Primary tumor data analysis
    While patient-derived cell line data provides critical informatics and biologic data, it fails
to capture the complexity and heterogeneity of HNC. We leveraged our access to fresh tumor
specimens to perform scRNASeq techniques. Given the evidence of the tumor microenvironment
playing a large role in maintenance of the cancer stem cell niche, we hypothesized that the cancer
stem cell signature may differ between cell line and primary tumors, however cells in the CSC-state
will have conserved signatures. To evaluate CSC signatures in primary tumors we analyzed 10 HNC
harvested directly from the operative theatre. Tumors were then digested into single cell suspension
and sorted by FACS for standard CSC markers (ALDH and CD44). scRNASeq was then performed
on the enriched groups. Seurat clusters are shown in Figure 2.3A. Of the CD44/ALDH enriched
cells, the deconvoluted epithelial tumor population was found to make up only a small proportion
of the tumor bulk (5%) with the remaining cells representing the immune and stromal elements of
the TME. To confirm the identity of the epithelial cell cluster, the cell line CSC expression data was
normalized and mapped onto the primary tumor expression data. As seen in Figure 2.3A, the cell
line data, in black, overlap with the epithelial cluster (cluster 9) confirming an epithelial expression
                                                    8


pattern. We then used RNAscope to show co-localization of top expressed epithelial genes within
the tumor cell population to further confirm expression of the DEG genes in the primary tumor
(Figure 2.3C). We considered isolating not only the epithelial but also the fibroblast cluster since
CAF genes were found in the signature genes of CSC suggested by the cell line analysis. We also
observe that PTPRC is low in fibroblasts of the sample and that the epithelial annotated cells are
mapped on each fibroblast cluster (figure 2.3B). Hence, we proceed to investigate the expression
profile of CSCs in both epithelial and fibroblast cells. Following the steps suggested by the analysis
of the cell line data, we explore the trajectory of the epithelial and fibroblast cells.
  A
  B
  C
                               Figure 2.3 Primary tumor data analysis.
                                                    9


          A     Epithelial and fibroblasts by state        Epithelial and fibroblasts by pseudotime
          B      Epithelial and fibroblasts by probability Epithelial and fibroblasts by predicted
                 of CSC status                             status
                                                                                           PTGS2
               200 GSEA ranked genes heatmap in epithelial and fibroblast cells            AQP1
          C
                                                                                           SLPI
                                                                                           CCL19
                                                                                           CCL11
                                                                                           TIMP1
                 Figure 2.4 Epithelial and Fibroblast cells of primary tumor data.
   Figure 2.4A demonstrates that there is a spectrum of CSCs, Intermediate and Non-CSCs on
with two branches. To understand the ordering of cells on the trajectory we used the trajectory
coordinates of each cell and their grouping based on ALDH and CD44 levels to extract their
pseudotime (Street et al., 2018). We observe that cells with the highest pseudotime are located
                                                        10


                  A
                  B
                              Figure 2.5 GSEA of primary tumor data.
at the leftmost part of the trajectory and have high ALDH and CD44 (sorted as CSC). For this
reason, we assume that the purest of CSCs are located in the leftmost corner. To validate this
assumption a logistic regression model was generated to predict the probability that a cell in the
epithelial and fibroblast cluster is a CSC using the pseudotime information of the cell. Cells with
the highest probability are indeed located in the leftmost corner (figure 2.4B). Additionally, genes
of the fibroblast and epithelial cluster were ranked based on the correlation of their expression
                                                 11


      Core enrichment genes in epithelial and fibroblast cells
                                                                               TIMP1
                                                                               SFRP4
                                                                               IGFBP3
                                                                               ADH1B
                                                                               TNC
                  Figure 2.6 Heatmap of core enrichment genes of 10 patient data.
level with the pseudotime assigned to each cell. This ranking was used to perform gene set en-
richment analysis. Positively enriched pathways are associated with the genes at the top of the list
namely those that are positively correlated to pseudotime and hence the CSC cells. A heatmap of
the expression of genes at the top and bottom of the ranked list shows a clustering of the tumor
and epithelia cells based on pseudotime and the gradual transition from Non-CSC to CSC cells
(figure 2.4C). The biologic behavior of the CSC subpopulation can be further described through
the results of the GSEA of cell line and primary data. Permutation-based analysis was performed
to calculate p-values and false discovery rates (FDR). Here we find that CSCs were enriched for
the HALLMARK EPITHELIAL MESENCHYMAL TRANSITION, HALLMARK ANGIOGEN-
ESIS, C2 ANASTASSIO MULTICANCER INVASIVENESS SIGNATURE, and C2 BOQUEST
STEM CELL UP compared to non-CSC. Interestingly, non-CSC were enriched for radiation re-
sponse pathways (SMIRNOV RESPONSE TO IR 2HR UP) as well as the well-established HNC
pathways, HALLMARK P53 PATHWAY, HALLMARK MYC TARGETS V1 and HALLMARK
                                                 12


          Figure 2.7 Survival rate of TCGA patients based on estimated CSC proportion.
PI3K AKT MTOR SIGNALING pathways (figure 2.5). Taken together, these findings further
support that CSCs are critical components of EMT and play a crucial role in promoting tumor
invasion, metastasis, and treatment resistance.
    To further define the genes of interest, common genes across the significantly enriched gene sets
were explored. Genes common to at least 2 significantly enriched pathways are SFRP4, ALDH1B,
WNT5A, TIMP1, COL1A1, COL3A1, MFAP5, COL1A2, LUM, COL5A1, THBS2, COL5A2,
COL6A3, VCAN, LOX, MXRA5, COL6A2, FAP, CDH11, DCN, SPOCK1. Given many of these
genes are established mesenchymal genes (COL6A3, FAP, VCAN), this suggests that CSC may
represent a critical subset of cells in a mesenchymal-state as part of the EMT.
2.2.3   Survival analysis of TCGA data
    To test the significance of the CSC signature genes (figure 2.2) we utilized expression informa-
tion of those genes in HPV patients of the TCGA database. Specifically, the CSC signature genes
matrix was used to estimate the pure CSC, intermediate and pure Non-CSC proportion of cells in
each patient via least trimmed square gene-expression deconvolution technique (Hao et al., 2018).
Next, patients were grouped into two clusters via kmeans based on their estimated proportions. The
two clusters separated patients with low pure CSC proportion and high pure CSC proportion. The
                                                  13


survival rate of patients in the cluster with low pure CSC proportion have a significanlty higher
survival rate (p-value = 0.0028, figure 2.7)
2.3    Discussion
    Our study provides new insights into the dynamic nature and plasticity of CSCs in head and neck
cancer, and their potential involvement in the epithelial-to-mesenchymal transition (EMT) process.
We identified multiple dynamic states of CSCs within our primary cell cultures of UMSCC HNC
cell lines and 10 primary tumors, suggesting that CSCs exist in a state of dynamic equilibrium
with their non-CSC counterparts. Our in vitro lineage tracing experiments further confirmed the
plasticity of CSCs, and their ability to differentiate into non-CSCs and vice versa. Recent insight
into CSC biology has moved away from them representing a distinct entity and more of a dynamic
state. Our findings are consistent with previous studies that have shown the plasticity of CSCs
in various cancer types, including breast cancer, colorectal cancer, and glioblastoma (Chaffer and
Weinberg, 2011; Vermeulen et al., 2010; Wang et al., 2018). In addition, our study supports the
hypothesis that the EMT process may be involved in the acquisition of stem cell-like properties by
cancer cells and may enhance their ability to initiate and sustain tumors (Mani et al., 2008; Tam and
Weinberg, 2013). This hypothesis is supported by our gene set enrichment analysis (GSEA) results,
which identified enrichment of EMT-related gene sets in our CSC populations. Interestingly, as part
of this mesenchymal transition, we found cells in the CSC-state had similar canonical expression
patterns as CAFs. This was seen both in pure epithelial cell line data as well as in the primary
data. Furthermore, we found that the TIMP1/CD63 pathway was differentially expressed in our CSC
populations. TIMP1 and CD63 have both been characterized in CAFs, more so than epithelial cells.
TIMP1 is a member of the tissue inhibitor of metalloproteinase (TIMP) family, regulating matrix
metalloproteinases (MMPs), and have been critical mediators in cancer invasion and metastasis.
TIMP1 has been shown to be overexpressed in several solid organ cancers, including breast, lung,
prostate and ovarian. In addition to invasive characteristics, TIMP1 has been shown promote cancer
cell survival, thus critical for cancer maintenance. CD63 is a member of the tetraspanin family
of membrane proteins, thus associated with cell adhesion, migration, and signaling through the
                                                  14


regulation of cell surface receptor trafficking and the regulation of extracellular vesicles, which
are important mediators of intercellular communication within the TME. Like TIMP1, CD63 has
been found to be overexpressed in various types of cancer, including breast, lung, and melanoma.
Together, both TIMP1 and CD63 are involved in several overlapping pathways that regulate EMT,
specifically AKT/mTOR, WNT/b-catenin, integrins and CD44. TIMP1/CD63 has been studied in
other cancers as part of the EMT as well as CSCs, suggesting that this pathway may play a role in
the maintenance and plasticity of CSCs in HNC. Our study provides new evidence for the potential
involvement of this pathway in HN CSC biology, and may open up new avenues for the development
of targeted therapies for HNC and other cancers. Taken together, our findings highlight the dynamic
and plastic nature of CSCs in HNC, and their potential involvement in the EMT process and the
TIMP1/CD63 pathway. Our study may have implications for the development of personalized
therapeutic strategies for this deadly disease. Further studies are needed to validate our findings in
larger patient cohorts and to explore the clinical relevance of our results.
                                                  15


                                         BIBLIOGRAPHY
Chaffer, C. L. and Weinberg, R. A. (2011). A perspective on cancer cell metastasis. science,
  331(6024):1559–1564.
Chen, C.-Y., Ueha, S., Ishiwata, Y., Shichino, S., Yokochi, S., Yang, D., Oppenheim, J. J., Ogiwara,
  H., Deshimaru, S., Kanno, Y., et al. (2021). Combining an alarmin hmgn1 peptide with pd-l1
  blockade results in robust antitumor effects with a concomitant increase of stem-like/progenitor
  exhausted cd8+ t cells. Cancer immunology research, 9(10):1214–1228.
Hao, Y., Yan, M., Lei, Y. L., and Xie, Y. (2018). Fast and robust deconvolution of tumor infiltrating
  lymphocyte from expression profiles using least trimmed squares. bioRxiv.
Mani, S. A., Guo, W., Liao, M.-J., Eaton, E. N., Ayyanan, A., Zhou, A. Y., Brooks, M., Reinhard,
  F., Zhang, C. C., Shipitsin, M., et al. (2008). The epithelial-mesenchymal transition generates
  cells with properties of stem cells. Cell, 133(4):704–715.
Moon, K. R., van Dĳk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., van den
  Elzen, A., Hirn, M. J., Coifman, R. R., et al. (2019). Visualizing structure and transitions in
  high-dimensional biological data. Nature Biotechnology, 37(12):1482–1492.
Mroz, E. A., Tward, A. M., Hammon, R. J., Ren, Y., and Rocco, J. W. (2015). Intra-tumor genetic
  heterogeneity and mortality in head and neck cancer: analysis of data from the cancer genome
  atlas. PLoS medicine, 12(2):e1001786.
O’Brien, C. A., Pollett, A., Gallinger, S., and Dick, J. E. (2007). A human colon cancer cell capable
  of initiating tumour growth in immunodeficient mice. Nature, 445(7123):106–110.
Puram, S. V., Parikh, A. S., and Tirosh, I. (2018). Single cell rna-seq highlights a role for a partial
  emt in head and neck cancer. Molecular & cellular oncology, 5(3):e1448244.
Street, K., Risso, D., Fletcher, R. B., Das, D., Ngai, J., Yosef, N., Purdom, E., and Dudoit, S.
  (2018). Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC
  genomics, 19:1–16.
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius,
  M., Smibert, P., and Satĳa, R. (2019). Comprehensive integration of single-cell data. Cell,
  177:1888–1902.
Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., and Bray, F.
  (2021). Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide
  for 36 cancers in 185 countries. CA: a cancer journal for clinicians, 71(3):209–249.
Tam, W. L. and Weinberg, R. A. (2013). The epigenetics of epithelial-mesenchymal plasticity in
  cancer. Nature medicine, 19(11):1438–1449.
                                                  16


Vermeulen, L., De Sousa E Melo, F., Van Der Heĳden, M., Cameron, K., De Jong, J. H., Borovski,
  T., Tuynman, J. B., Todaro, M., Merz, C., Rodermond, H., et al. (2010). Wnt activity defines colon
  cancer stem cells and is regulated by the microenvironment. Nature cell biology, 12(5):468–476.
Wang, Q., He, Z., Huang, M., Liu, T., Wang, Y., Xu, H., Duan, H., Ma, P., Zhang, L., Zamvil,
  S. S., et al. (2018). Vascular niche il-6 induces alternative macrophage activation in glioblastoma
  through hif-2𝛼. Nature communications, 9(1):559.
                                                   17


                                               CHAPTER 3
   CLUSTERING AND VISUALIZATION OF SINGLE-CELL RNA-SEQ DATA USING
                                            PATH METRICS
3.1    Introduction
    The advance in single-cell RNA-seq (scRNA-seq) technologies in recent years has enabled the
simultaneous measurement of gene expression at the single-cell level Saliba et al. (2014); Eberwine
et al. (1992); Tang et al. (2009). This opens up new possibilities to detect previously unknown cell
populations, study cellular development and dynamics, and characterize cell composition within
bulk tissues. Despite its similarity with bulk RNAseq data, scRNAseq data tends to have larger
variation and larger amounts of missing values due to the low abundance of initial mRNA per cell.
To address these challenges, numerous computational algorithms have been proposed focusing on
different aspects. Given a collection of single cell transcriptomes from scRNAseq, one of the most
common applications is to identify and characterize subpopulations, e.g., cell types or cell states.
Numerous clustering approaches have been developed such as 𝑘-means based methods SC3 Kiselev
et al. (2017), SIMLR Wang et al. (2017), and RaceID Herman et al. (2018); hierarchical clustering
based methods CIDR Lin et al. (2017), BackSPIN A et al. (2015), and pcaReduce žurauskienė and
Yau (2016); graph based methods Rphenograph CLevine et al. (2015), SNN-Cliq Xu and Su (2015),
SSNN-Louvain Zhu et al. (2020), Seurat Stuart et al. (2019), and scanpy Wolf et al. (2018); and
deep-learning based methods scGNN Wang et al. (2021), scVI Lopez et al. (2018), ScDeepCluster
Tian et al. (2019b), DANCE Ding et al. (2022). To visualize and characterize relationships between
cell types, it is important to represent them in a low-dimensional space. Many low-dimensional
embedding methods have been proposed including UMAP McInnes et al. (2018), 𝑡-SNE Van der
Maaten and Hinton (2008), PHATE Moon et al. (2019), and LargeVis Tang et al. (2016). However, a
key challenge for embedding methods is to simultaneously reduce cluster variance and preserve the
global geometry, including the distances between clusters and cluster shapes. For example, Figure
3.4 illustrates the typical situation on a cell mixture dataset Tian et al. (2019a): the PCA embedding
preserves the global geometry but clusters have high variance; clusters are better separated in the
                                                    18


UMAP and 𝑡-SNE embeddings, but the global geometric structure of the clusters is lost.
    When choosing a clustering algorithm, there is always an underlying tension between respecting
data density and data geometry. Density-based methods such as DBSCAN cluster data by connect-
ing together high-density regions, regardless of cluster shape. More traditional approaches such
as 𝑘-means require that clusters are convex and geometrically well separated. However, in many
real data, clusters tend to have both nonconvex/elongated geometry and a lack of robust density
separation as shown in Figure 3.2b which consists of three elongated Gaussian distributions and a
bridge connecting two of the distributions. The data set is challenging because it exhibits elongated
geometry, but methods relying only on density will fail due to the bridge. Such characteristics are
commonly observed in scRNA-seq data, especially for cells sampled from a developmental process,
as cell types often trace out elongated structures and frequently lack robust density separation. This
elongated geometry phenomena is due to the fact that all the cell types originate from stem cells
through a trajectory-like differentiation process, and the bridge structures are created by the cells
in the transition states. For example, circulating monocytes in the Tabula Muris (TM) lung data
set Tabula Muris Consortium (2018) have an elongated cluster structure as illustrated by the PCA
plot in Figure 3.1a, as do the ductal cells in the TM pancreatic data set (see Figure 3.1c). The
UMAP plots of these same data sets illustrate the lack of robust density separation: for TM lung,
there is a bridge connecting the alveolar and lung cell types, and also an overlap/bridge between
the circulating and invading monocytes (see Figure 3.1b); for TM pancreatic, the pancreatic A
and pancreatic PP cells are not well separated. The combination of elongation and poor density
separation make clustering scRNA-seq data sets a challenging task.
    We propose an embedding method based on the power weighted path metric which is well
suited to this difficult regime. These metrics balance density and geometry considerations in the
data learning tasks such as clustering and semi-supervised learning Vincent and Bengio (2003);
Bousquet et al. (2004); Sajama and Orlitsky (2005); Chang and Yeung (2008); Bĳral et al. (2011);
Moscovich et al. (2017); Mckenzie and Damelin (2019); Little et al. (2020a); Borghini et al. (2020).
They have performed well in applications such as imaging Fischer et al. (2001); Zhang and Murphy
                                                   19


                      (a) PCA TM Lung                                                 (b) UMAP TM Lung
                 (c) PCA TM Pancreatic                                           (d) UMAP TM Pancreatic
Figure 3.1 Tabula Muris data sets have elongated clusters in the PCA embedding and clusters
connected with a bridge of points in the UMAP embedding.
                                                                                               0.4
                                    1.2                      10
 1.0
                                                              5
                                                                                               0.2
                                    0.8
                                                              0
 0.5
                                                                                               0.0
                                    0.4
                                                             −5
 0.0
                                                                                              −0.2
                                    0.0
                                                            −10
−0.5
     −0.5 0.0     0.5    1.0 1.5        −2 −1 0   1  2          −10   −5  0   5    10   15         −0.4     −0.2 0.0 0.2
             (a) Balls              (b) Elongated   with            (c) Swiss Roll                      (d) GL Manifold
                                    Bridge
Figure 3.2 Toy Data Sets. 3.2a and 3.2b show the 2-dimensional data sets. 3.2c plots the first two
coordinates of the Swiss roll. 3.2d shows the 2-dimensional PCA plot of the SO(3) manifolds.
(2021); Little et al. (2020a); Mckenzie and Damelin (2019), but their usefulness for the analysis
of scRNAseq data remains unexplored. Because these metrics are density-sensitive, they reduce
cluster variance; in addition, these metrics also capture global distance information, and thus
preserve global geometry; see Figure 3.4b. Using the path metric embedding to cluster the data
thus yields a clustering method which balances density-based and geometric information.
                                                         20


3.2    Materials and Methods
    We first introduce our theoretical framework in Section 3.2.1; Section 3.2.2 then describes the
details of the proposed scPMP algorithm, and Section 3.2.3 describes metrics for assessment.
3.2.1    Path Metrics
    We first define a family of power-weighted path metrics parametrized by 1 ≤ 𝑝 < ∞.
Definition 1
Given a discrete data set 𝑋, the discrete 𝑝-power weighted path metric between 𝑎, 𝑏 ∈ 𝑋 is defined
                                   Í                   1
                                       𝑠−1            𝑝 𝑝
as ℓ 𝑝 (𝑎, 𝑏) := inf (𝑥0 ,...,𝑥 𝑠 ) 𝑖=0     𝑥𝑖+1 − 𝑥𝑖 2 , where the infimum is taken over all sequences of
points 𝑥 0 , . . . , 𝑥 𝑠 in 𝑋 with 𝑥 0 = 𝑎 and 𝑥 𝑠 = 𝑏.
Note as 𝑝 → ∞, ℓ 𝑝 converges to the “bottleneck edge" distance
                                         ℓ∞ (𝑎, 𝑏) :=     inf         max ∥𝑥𝑖+1 − 𝑥𝑖 ∥ 2 ,
                                                      (𝑥 0 ,...,𝑥 𝑠 )  𝑖
which is well studied in the computer science literature Pollack (1960); Hu (1961); Camerini (1978);
Gabow and Tarjan (1988). Two points are close in ℓ∞ if they are connected by a high-density path
through the data, regardless of how far apart the points are. On the other hand, when 𝑝 = 1, ℓ1
reduces to Euclidean distance. If path edges are furthermore restricted to lie in a nearest neighbor
graph, ℓ1 approximates the geodesic distance between the points, i.e. the length of the shortest
path lying on the underlying data structure, which is a highly useful metric for manifold learning
Tenenbaum et al. (2000). The parameter 𝑝 governs a trade-off between these two extremes, i.e.
it determines how to balance density and geometry considerations when determining which data
points should be considered close. The relationship between ℓ 𝑝 and density can be made precise.
Assume 𝑛 independent samples from a continuous, nonzero density function 𝑓 supported on a
𝑑-dimensional, compact Riemannian manifold M (a manifold is a smooth, locally linear surface;
see Lee (2018)). Then for 𝑝 > 1, ℓ 𝑝 (𝑎, 𝑏) converges (after appropriate normalization) to
                                                     ∫                                       𝑝1
                                                                         − ( 𝑝−1)   ′
                                   L 𝑝 (𝑎, 𝑏) := inf         𝑓 (𝛾(𝑡))         𝑑   |𝛾 (𝑡)| 𝑑𝑡      ,  (3.1)
                                                  𝛾
                                                                21


as 𝑛 → ∞, where the infimum is taken over all smooth curves 𝛾 : [0, 1] → M connecting 𝑎, 𝑏
Hwang et al. (2016); Groisman et al. (2022); Fernández et al. (2023). Note |𝛾 ′ (𝑡)| is simply the
arclength element on M, so L1 reduces to the standard geodesic distance. When 𝑝 ≠ 1, one
obtains a density-weighted geodesic distance. The optimal L 𝑝 path is not necessarily the most
direct: a detour may be worth it if it allows the path to stay in a high-density region; see Figure
3.3. Thus the metric is density-sensitive, in that distances across high-density regions are smaller
than distances across low-density regions; this is a desirable property for many machine learning
tasks Chu et al. (2017), including trajectory estimation for developmental cells and cancer cells.
However the metric is also geometry preserving, since it is computed by path integrals on M. The
parameter 𝑝 controls the balance of these two properties: when 𝑝 is small, L 𝑝 depends mainly on
the geometry of the data, while for large 𝑝, L 𝑝 is primarily determined by data density.
    Although path metrics are defined in a complete graph, i.e. Definition 1 considers every path
in the data connecting 𝑎, 𝑏, recent work Little et al. (2020b); Groisman et al. (2018); Mckenzie and
Damelin (2019); Chu et al. (2020) has established that it is sufficient to only consider paths in a
𝐾-nearest neighbors (𝐾NN) graph, as long as 𝐾 ≥ 𝐶 log 𝑛 for a constant 𝐶 depending on 𝑝, 𝑑, 𝑓 ,
and the geometry of the data. By restricting to a 𝐾NN graph, all pairwise path distances can be
computed in 𝑂 (𝐾𝑛2 ) with Dĳkstra’s algorithm.
               (a) 𝑝 = 1.1                                                    (b) 𝑝 = 2
                Figure 3.3 Optimal ℓ 𝑝 path between two points in a moon data set.
                                                  22


Algorithm 3.1 scPMP.
 1: Input: noisy data 𝑋         e ∈ R𝑛×𝑑 , parameter 𝑝, number of clusters 𝑘
 2: Optional input: 𝐾1 , 𝐾2 , 𝑟 min , 𝑟 max , 𝜏
 3:           (Defaults: 12, 𝑛 ∧ 500, 3, 39, 0.01)
 4: Output: scPMP embedding 𝑌 ∈ R𝑛×𝑟 , label vector ℓˆ ∈ [𝑘] 𝑛
 5:
 6: % Denoise data:
    𝑥𝑖 ← 𝐾11 𝑗 ∈N𝑖,𝐾 e
               Í
 7:                           𝑥𝑖
                            1
 8:
 9: % Compute path metrics:
       𝑝
10: G𝐾2 ← 𝐾2 NN graph on 𝑋 with edge weights ∥𝑥𝑖 − 𝑥 𝑗 ∥ 𝑝
       𝑝                                                                   𝑝
11: 𝐷 𝑖 𝑗 ← length of shortest path connecting 𝑥𝑖 , 𝑥 𝑗 in G𝐾
                                                                             2
                          𝑝    1
12: (𝐷 PM )𝑖 𝑗 ← (𝐷 𝑖 𝑗 ) 𝑝
13:
14: % Compute MDS embedding of path metrics:
                     (2)
15: 𝐵 ← − 12 𝐽 𝐷 PM      𝐽
16: Λ = diag(𝜆 1 , . . . , 𝜆 𝑛 ) ← eigenvalues of 𝐵 in descending order
17: 𝑉 = (𝑣 1 , . . . , 𝑣 𝑛 ) ← corresponding eigenvectors of 𝐵
18: 𝑟 ← index
            √ maximizing       √ 𝜆𝑖 /𝜆𝑖+1𝑛×𝑟    for 𝑖 satisfying 𝑟 min ≤ 𝑖 ≤ 𝑟 max , 𝜆𝑖 /𝜆 1 ≥ 𝜏
19: 𝑌 ← ( 𝜆 1 𝑣 1 , . . . , 𝜆𝑟 𝑣 𝑟 ) ∈ R
20:
21: % Cluster the data:
22: ℓˆ ← constrained 𝑘-means(𝑌 , 𝑘)
3.2.2    Algorithm
    We consider a noisy data set of 𝑛 data points e                           𝑥 𝑛 ∈ R𝑑 , which form the rows of noisy
                                                                 𝑥1 , . . . , e
data matrix 𝑋 e ∈ R𝑛×𝑑 . We first denoise the data with a local averaging procedure, which has been
shown to be advantageous for manifold plus noise data models García Trillos et al. (2019). More
specifically, we replace e       𝑥𝑖 with its local average:
                                    1 ∑︁
                           𝑥𝑖 :=                𝑥𝑗
                                                e     ,   N𝑖,𝐾1 = { 𝑗 : e   𝑥 𝑗 is a 𝐾1 NN of e𝑥𝑖 } ,
                                    𝐾1 𝑗 ∈N
                                           𝑖,𝐾1
and let 𝑋 ∈ R𝑛×𝑑 denote the denoised data matrix.
    We then fix 𝑝 and compute the 𝑝-power weighted path distance between all points in 𝑋 to
                                                                                             𝑝
obtain pairwise distance matrix 𝐷 PM ∈ R𝑛×𝑛 . More precisely, we let G𝐾2 = (𝑋, 𝐸) be the graph
on 𝑋 where 𝑥𝑖 , 𝑥 𝑗 are connected with edge weight 𝐸𝑖 𝑗 = ∥𝑥𝑖 − 𝑥 𝑗 ∥ 𝑝 if 𝑥𝑖 is a 𝐾2 NN of 𝑥 𝑗 or 𝑥 𝑗 is a
                                            𝑝                                                                      𝑝
𝐾2 NN of 𝑥𝑖 . We then compute 𝐷 𝑖 𝑗 as the total length of the shortest path connecting 𝑥𝑖 , 𝑥 𝑗 in G𝐾2 ,
                                                              23


                                          𝑝  1
and define 𝐷 PM by (𝐷 PM )𝑖 𝑗 = (𝐷 𝑖 𝑗 ) 𝑝 .
     We next apply classical multidimensional scaling Borg and Groenen (2005) to obtain a low-
dimensional embedding which preserves the path metrics. Specifically, we define the path metric
                                (2)
MDS matrix 𝐵 = − 12 𝐽 𝐷 PM          𝐽 where 𝐽 = 𝐼𝑛 − 𝑛1 11𝑇 is the centering matrix, 1 ∈ R𝑛 is a vector of
                (2)
all 1’s, and 𝐷 PM   is obtained from 𝐷 PM by squaring all entries. We let the spectral decomposition
of 𝐵 be denoted by 𝐵 = 𝑉Λ𝑉 𝑇 , where Λ = diag(𝜆 1 , . . . , 𝜆 𝑛 ), 𝑉 = (𝑣 1 , . . . , 𝑣 𝑛 ) ∈ R𝑛×𝑛 contain the
eigenvalues and eigenvectors of 𝐵 in descending order. The embedding dimension 𝑟 is then chosen
as the index 𝑖 which maximizes the eigenratio 𝜆𝑖 /𝜆𝑖+1 Lam and Yao (2012), with the following
restrictions: we constrain 3 ≤ 𝑖 ≤ 39 and only consider ratios 𝜆𝑖+1 /𝜆𝑖 between “large" eigenvalues,
                                                                                        √            √
i.e. we require 𝜆𝑖 /𝜆 1 ≥ 0.01. The scPMP embedding is then defined by 𝑌 = ( 𝜆 1 𝑣 1 , . . . , 𝜆𝑟 𝑣 𝑟 ) ∈
R𝑛×𝑟 .
     Finally, we apply 𝑘-means to the scPMP embedding to obtain cluster labels. Specifically, we
let ℓˆ𝑖 ∈ [𝑘] = {1, . . . , 𝑘 } be the cluster label of 𝑥𝑖 returned by running 𝑘-means on 𝑌 with 𝑘 clusters
and 20 replicates. Since 𝑘-means may return highly imbalanced clusters, cluster sample sizes were
                                 √
constrained to be at least 𝑛/2. Specifically, if 𝑘-means returned a tiny cluster, 𝑘 was increased
to 𝑘 + 1, and the tiny cluster merged with the closest non-trivial cluster. This entire procedure is
summarized in the pseudocode in Algorithm 3.1.
     We note that the computational bottleneck for Algorithm 3.1 is the computation and storage
of all pairwise path distances, which has complexity 𝑂 (𝑛2 log 𝑛) when 𝐾2 = 𝑂 (log 𝑛). However
this quadratic cost can be avoided by utilizing a low rank approximation of the squared distance
matrix via the Nystrom method Williams and Seeger (2001); Ghojogh et al. (2020); Platt (2005);
Yu et al. (2012); Civril et al. (2006). For example, Shamai et al. (2020) propose a fast, quasi-linear
implementation of MDS which only requires the computation of path distances from a set of 𝑞
landmarks, so that the complexity of computing path distances is reduced to 𝑂 (𝑞𝑛 log 𝑛). Our
implementation of scPMP includes the option to use this landmark-based approximation and is thus
highly scalable.
     We also note that an important consideration in the fully unsupervised setting is how to select
                                                         24


the number of clusters 𝑘. This is a rather ill-posed question with multiple reasonable answers due
to hierarchical cluster structure. We do not focus on this in the current article, and Algorithm 3.1
assumes the number of clusters is given. However we emphasize that when 𝑘 is unknown, the
scPMP embedding offers a useful tool for selecting a reasonable number of clusters. For example,
Line 21 of Algorithm 3.1 can be repeated for a range of candidate 𝑘 values to obtain candidate
clusterings ℓb𝑘 ; b
                  𝑘 can then be chosen so that ℓb𝑘 optimizes a cluster validity criterion such as the
silhouette criterion Kaufman and Rousseeuw (2009); Maechler et al. (2021). Alternatively, one
could build a graph with distances computed in the scPMP embedding, and estimate 𝑘 as the
number of small eigenvalues of a corresponding graph Laplacian Von Luxburg (2007); Little et al.
(2020a).
3.2.3    Assessment
    We evaluate the performance of Algorithm 3.1 with respect to (1) cluster quality and (2)
geometric fidelity on a collection of labeled benchmarking data sets with ground truth labels ℓ.
There are many helpful metrics for the quality of the estimated cluster labels ℓ,  ˆ and we compute the
adjusted rand index (ARI), entropy of cluster accuracy (ECA), and entropy of cluster purity (ECP).
Definitions of ECA and ECP can be found in Appendix B. We compare our clustering results
with the output of 𝑘-means, DBSCAN Ester et al. (1996); Xu et al. (1998), 𝑘-means on 𝑡-SNE
embedding Van der Maaten and Hinton (2008), DBSCAN on UMAP embedding McInnes et al.
(2018) and for scRNAseq data sets additionally with the following scRNAseq clustering methods:
SC3 Kiselev et al. (2017), scanpy Wolf et al. (2018), RaceID3 Grün et al. (2018), SIMRL Wang
et al. (2017) and Seurat Stuart et al. (2019).
    Assessing the geometric fidelity of the low-dimensional embedding 𝑌 is more delicate; we
want to assess whether the embedding procedure preserves the global relative distances between
clusters. We first compute the mean of each cluster as in Van der Maaten and Hinton (2008)
using the ground truth labels, i.e. 𝜇 𝑗 (𝑋) = |I1𝑗 | 𝑖∈I𝑗 𝑥𝑖 where I𝑗 = {𝑖 : ℓ𝑖 = 𝑗 }; we then define
                                                       Í
𝐷 𝜇,𝑋 (𝑖, 𝑗) = ∥𝜇ℓ𝑖 (𝑋)−𝜇ℓ 𝑗 (𝑋)∥ 2 . Similarly, we compute the means 𝜇 𝑗 (𝑌 ) in the scPMP embedding,
and define 𝐷 𝜇,𝑌 (𝑖, 𝑗) = ∥𝜇ℓ𝑖 (𝑌 ) − 𝜇ℓ 𝑗 (𝑌 )∥ 2 ; we then compare 𝐷 𝜇,𝑋 and 𝐷 𝜇,𝑌 . Specifically, we
                                                      25


define the geometric perturbation 𝜋 by:
                                                                      2
                                                      𝐷 𝜇,𝑋 − 𝑐𝐷 𝜇,𝑌  𝐹
                                  𝜋(𝑋, 𝑌 , ℓ) = min              2
                                                                        ,
                                                 𝑐
                                                           𝐷 𝜇,𝑋 𝐹
where ∥ · ∥ 𝐹 is the Frobenius norm. The 𝑐 achieving the minimum is easy to compute, and one
obtains
                                                      2
                                     𝐷 𝜇,𝑋 − 𝑐∗ 𝐷 𝜇,𝑌 𝐹              ⟨𝐷 𝜇,𝑋 , 𝐷 𝜇,𝑌 ⟩
                      𝜋(𝑋, 𝑌 , ℓ) =              2
                                                           ,   𝑐∗ =             2
                                                                                      .
                                           𝐷 𝜇,𝑋 𝐹
                                                                         𝐷 𝜇,𝑌  𝐹
We compare 𝜋(𝑋, 𝑌 , ℓ) with the geometric perturbation of other embedding schemes for 𝑋, i.e.
with 𝜋(𝑋, 𝑈, ℓ) for 𝑈 equal to the UMAP McInnes et al. (2018) and 𝑡-SNE Van der Maaten
and Hinton (2008) embeddings. Note that 𝜋 is not always a useful measure: for example if 𝑋
consisted of concentric spheres sharing the same center, the metric would be meaningless, as the
distance between cluster means would be zero. Nevertheless, in most cases 𝜋 is a helpful metric
for quantifying the preservation of global cluster geometry.
3.3   Results
    We apply Algorithm 3.1 to both a collection of toy manifold data sets and a collection of
scRNAseq data sets. Results are reported in Sections 3.3.1 and 3.3.2 respectively. The default
parameter values reported in Algorithm 3.1 were used on all data sets.
3.3.1   Manifold Data
    We apply Algorithm 1 for 𝑝 = 1.5, 2, 4 to the following four manifold data sets:
    Balls (𝑛 = 1200, 𝑑 = 2, 𝑘 = 3): Clusters were created by uniform sampling of 3 overlapping
balls in R2 ; see Figure 3.2a.
    Elongated with bridge (denoted EWB, 𝑛 = 620, 𝑑 = 2, 𝑘 = 3): Clusters were created by
sampling from 3 elongated Gaussian distributions. A bridge was added connecting two of the
Gaussians; see Figure 3.2b.
    Swiss roll (𝑛 = 1275, 𝑑 = 3, 𝑘 = 3): Clusters were created by uniform sampling from three
distinct regions of a Swiss roll; 3-dimensional isotropic Gaussian noise (𝜎 = 0.75) was then added
to the data. Figure 3.2c shows the first two data coordinates.
                                                   26


                                Method            Balls EWB Swiss          SO(3)
                                𝑘-means           0.955 -0.001 0.373       0.010
                               DBSCAN             0.055 0.550       1         1
                          UMAP+DBSCAN             0.600 0.645       1         1
                            𝑡-SNE+𝑘-means         0.895 0.359       1      0.532
                                 Seurat           0.777 0.837       1         1
                                 PM1.5            0.921 0.489       1      0.501
                                  PM2             0.907 0.990       1         1
                                  PM4             0.781 0.584       1         1
                                     Table 3.1 ARI for manifold data.
     SO(3) manifolds (𝑛 = 3000, 𝑑 = 1000, 𝑘 = 3): For 1 ≤ 𝑖 ≤ 3, the 3-dimensional manifold
M𝑖 ⊆ R9 is defined by fixing three eigenvalues 𝐷 𝑖 = diag(𝜆 1 , 𝜆2 , 𝜆3 ) and then defining M𝑖 =
∪𝑉 ∈𝑆𝑂 (3) 𝑉 𝐷 𝑖𝑉 𝑇 , where SO(3) is the special orthogonal group. After fixing 𝐷 𝑖 , we randomly sample
from M𝑖 by taking random orthonormal bases 𝑉 of R3 . A noisy, high-dimensional embedding
was then obtained by adding uniform random noise with standard deviation 𝜎 = 0.0075 in 1000
dimensions. Figure 3.2d shows the first two principal components of the data, which exhibits no
cluster separation.
     The data sets were chosen to illustrate various cluster separability characteristics. For the balls,
the clusters have good geometric separation but are not separable by density. For the Swiss roll
and SO(3), the clusters have a complex and inter-twined geometry but are well separated in terms
of density. For EWB, clusters are both elongated and lack robust density separability due to the
bridge, and one expects that methods which rely too heavily on either geometry or density will
fail. The ARIs achieved by Algorithm 3.1, 𝑘-means based methods, DBSCAN based methods,
and Seurat are reported in Table 3.1. See Tables B.1 and B.2 in Appendix B for ECP and ECA.
As expected, 𝑘-means out performs all methods on the balls but performs very poorly on all other
data sets. DBSCAN and Seurat achieve perfect accuracy on the Swiss roll and SO(3) but perform
rather poorly on the balls and EWB, although Seurat does noticeably better than DBSCAN. PM2 is
the only method which achieves a high ARI (> 90%) and a low ECP and ECA (< 0.15) on all data
sets.
     Table 3.2 reports the geometric perturbation of the embedding produced by Algorithm 3.1 and
                                                      27


                             Method      Balls   EWB Swiss       SO(3)
                            2d UMAP     0.001    0.006 0.305     0.071
                            𝑟d UMAP        0     0.033 0.339     0.054
                            2d 𝑡-SNE       0     0.004 0.187     0.171
                            𝑟d 𝑡-SNE       0     0.042 0.074     0.157
                            2d PM1.5       0     0.033 0.002     0.103
                            𝑟d PM1.5       0     0.023 0.011     0.154
                             2d PM2        0     0.146 0.025     0.156
                             𝑟d PM2        0     0.068 0.025     0.179
                             2d PM4     0.003    0.191 0.056     0.194
                             𝑟d PM4     0.004    0.157 0.056     0.194
Table 3.2 Geometric perturbation for manifold data. The 𝑟d UMAP embeddings were computed
with an embedding dimension of 𝑟 = 5 for the balls, EWB, Swiss roll and 𝑟 = 7 for SO(3), which
corresponded to the estimated dimension for both PM1.5 and PM2 . For 𝑡-SNE, 𝑟 = 3 for all data
sets.
compares with UMAP and 𝑡-SNE. Since Algorithm 3.1 generally selects an embedding dimension
𝑟 > 2, to ensure a fair comparison the geometric perturbation was computed in both the 2d and
𝑟-dimensional (𝑟d) embeddings for all methods, where for UMAP 𝑟 is the dimension selected by
Algorithm 3.1 and for 𝑡-SNE 𝑟 = 3 (note 𝑟 ≤ 3 was required in Rtsne implementation). Overall
PM1.5 achieved the lowest geometric perturbation, although all methods had small perturbation on
the Balls data set and 𝑡-SNE had the lowest perturbation on EWB. We point out however that for
both the Swiss roll and SO(3), the metric may not be meaningful due to the complicated cluster
geometry.
3.3.2   scRNAseq Data
    We apply Algorithm 1 for 𝑝 = 1.5, 2, 4 to the following synthetic scRNAseq data sets:
    RNA mixture: Benchmarking scRNAseq data set from Tian et al. (2019a). RNAmix1 was
processed with CEL-seq2 and has 𝑛 = 296 cells and 𝑑 = 14687 genes. RNAmix2 was processed
with Sort-seq and has 𝑛 = 340 cells and 𝑑 = 14224 genes. For the creation of the two data sets,
RNA was extracted in bulk for each of the following cell lines: H2228, H1975, HCC827. Then the
RNA was mixed in 𝑘 = 7 different proportions (each defining a ground truth cluster label), diluted
to single cell equivalent amounts ranging from 3.75pg to 30pg, and processed using CEL-seq2 and
SORT-seq. See here for Supplemental info including ground truth geometric structure.
                                                28


    Simulated beta: Simulated data set of 𝑛 = 473 beta cells and 𝑑 = 2279 genes, created based on
SAVER Huang et al. (2018) and scImpute Li and Li (2018). First, we subset the Baron’s Pancreatic
data set Baron et al. (2016) to include only Beta cells. As in Li and Li (2018), we randomly
choose 10% of the genes to operate as marker genes. Then, we split the cells to 𝑘 = 3 clusters and
each cluster is assigned a different group of marker genes. For each cluster we scale up the mean
expression of its marker genes. Lastly, to simulate the drop out effect, as in Huang et al. (2018),
we multiply each cell by an efficiency loss constant drawn by Gamma(10, 100). Using 𝑆 to refer to
the data matrix resulting from the above steps, the final simulated data 𝑋 is obtained by letting 𝑋𝑖 𝑗
be drawn from Poisson(𝑆𝑖 𝑗 ).
    In addition to the synthetic data, we evaluate the performance of Algorithm 3.1 on the following
real scRNAseq data sets:
    Cell mixture data set: Another benchmarking data set from Tian et al. (2019a) consisting of
a mixture of 𝑘 = 5 cell lines created with 10x sequencing platform. The cell line identity of a cell
is also its true cluster label. The data set consists of 𝑛 = 3822 cells and 𝑑 = 11786 genes; we
removed multiplets, based on the provided metadata file and kept 3000 most variable genes after
SCT tranformation Hafemeister and Satĳa (2019); Choudhary and Satĳa (2022).
    Baron’s pancreatic: Human pancreatic data set generated by Baron et al. (2016). After quality
control and SAVER imputation, there are 𝑑 = 14738 genes and 𝑛 = 1844 cells. For analysis
purposes cells that belong in a group with less than 70 members were filtered out to reduce to 𝑘 = 8
cell types. Also, we kept only the 3000 most variable genes after SCT tranformation Hafemeister
and Satĳa (2019); Choudhary and Satĳa (2022). The cell types associated with each cell were
obtained by an iterative hierarchical clustering method that restricts genes enriched in one cell type
from being used to separate other cell types. The enriched markers in every cluster defined the cell
type of the cells that belong in that cluster.
    Tabula Muris data sets: Mouse scRNAseq data for different tissues and organs Tabula
Muris Consortium (2018). We select the pancreatic data (TM Panc) with 𝑛 = 1444 cells and
𝑑 = 23433 genes and the lung data (TM Lung) with 𝑛 = 453 cells and 𝑑 = 23433 genes. Both
                                                  29


       Method         RNA1    RNA2     TMLung     Beta    TMPanc   BaronPanc     PBMC4K    CellMix
         SC3           0.637   0.827     0.798   0.969     0.894      0.767       0.889       1
        scanpy         0.620   0.825     0.796   0.898     0.615      0.966       0.977       1
       RaceID3         0.730   0.520     0.900   0.714     0.751      0.651       0.763       1
       SIMLR           0.878   0.792     0.727   0.969     0.599      0.698       0.705       1
        Seurat         0.792   0.667     0.843   0.901     0.547      0.941       0.889     0.993
      Seurat_def       0.714   0.785     0.764   0.907     0.798      0.971       0.975       1
       𝑘-means         0.921   0.786     0.848   0.957     0.840      0.662       0.747       1
      DBSCAN           0.952   0.826     0.587   0.541     0.734      0.724       0.889       1
  UMAP+DBSCAN          0.926   0.892     0.619   0.946     0.893      0.848       0.974       1
   𝑡-SNE+𝑘-means       0.943   0.915     0.753   0.928     0.620      0.641       0.596     0.878
        PM1.5          0.939   0.924     0.888   0.969     0.626      0.804       0.754       1
         PM2           0.939   0.973     0.808   0.969     0.921      0.969       0.757       1
         PM4           0.939   0.939     0.731   0.975     0.775      0.853       0.978       1
                                    Table 3.3 ARI for RNA data.
data sets have 𝑘 = 7 different cell types which were characterized by an FACS-based full length
transcript analysis.
    PBMC4k data set: This data set includes the gene expression of Peripheral Blood Mononuclear
Cells. The raw data are available from 10X Genomics. After quality control, saver imputation,
and removing the two smallest cell types, there are 𝑑 = 16655 genes and 𝑛 = 4316 cells in the
dataset. Also, we merge CD8+ T-cells and CD4+ T-cells in one type named T-cells resulting in
𝑘 = 4 cell types. The ground truth cell types are provided by SingleR annotation after marker gene
verification in github.com/SingleR.
    Details about the pre-processing of data sets can be found in Appendix A. For the following
UMAP and 𝑡-SNE results, Linnorm normalization was applied without denoising, as this normal-
ization gave the best results. Note Seurat_def refers to the results of the entire Seurat pipeline,
whereas Seurat refers to the result of using Seurat clustering on data with the same processing and
normalization as for PM. The embedding dimension 𝑟 selected by Algorithm 3.1 ranged from 3 to
7 for PM1.5 and PM2 , and from 3 to 11 for PM4 .
    Table 3.3 reports the ARI achieved by Algorithm 3.1 and other methods; see Tables B.6 and
B.5 in Appendix B for ECP and ECA. The path metric methods perform equally well or better
than the rest of the methods. Once again PM2 exhibits the best overall performance, with a high
ARI (≥ 90%) on all data sets except TM lung and PBMC4K; the next best method is PM4 , which
                                                 30


    Method      RNA1      RNA2    TMLung       Beta   TMPanc    BaronPanc    PBMC4k      CellMix
  2d UMAP       0.122     0.142     0.057     0.036     0.064      0.115       0.015      0.090
  𝑟d UMAP       0.160     0.131     0.092     0.023     0.036      0.129       0.027      0.050
   2d 𝑡-SNE     0.059     0.054     0.042     0.025     0.048      0.206       0.038      0.061
   𝑟d 𝑡-SNE     0.035     0.054     0.027     0.010     0.040      0.229       0.050      0.033
   2d PM1.5     0.010     0.013     0.046     0.003     0.076      0.067       0.028      0.098
   𝑟d PM1.5     0.017     0.009     0.006       0       0.019      0.006       0.007      0.007
    2d PM2      0.040     0.040     0.085     0.002     0.150      0.103       0.050      0.101
    𝑟d PM2      0.048     0.036     0.029     0.002     0.051      0.010       0.013      0.008
    2d PM4      0.108     0.135     0.246     0.007     0.265      0.193       0.069      0.107
    𝑟d PM4      0.100     0.082     0.083     0.007     0.099      0.027       0.029      0.008
Table 3.4 Geometric perturbation for RNA data. For 𝑟d UMAP 𝑟 = 7, 6, 5, 3, 5, 9, 3, 4 for the
various data sets, which was the maximum of the PM1.5 dimension and the PM2 dimension. For
𝑟d 𝑡-SNE 𝑟 = 3.
achieves a high ARI on all but 3 data sets. SC3, RaceId3, and SIMLR had a low ARI (< 90%)
on 6 of the 8 data sets; scanpy, Seurat, 𝑘-means, and 𝑡-SNE+𝑘-means had a low ARI on 5 of the
8 data sets; Seurat_def, UMAP+DBSCAN, and PM1.5 had a low ARI for 4 of 8 data sets. These
results indicate that incorporating both density-based and geometric information when determining
similarity generally leads to more robust results for scRNA-seq data. Moreover, PM2 achieves the
best median ECP and median ECA values across all RNA data sets. Although the optimal balance
depends on the data set (for example PBMC4K does best with 𝑝 = 4, while TMLung does best
with 𝑝 = 1.5), path metrics with a moderate 𝑝 exhibit the best performance across a wide range of
data sets.
     For BaronPanc we observe that Seurat_def achieves a slightly higher ARI than all the reported
path metric methods (𝑝 = 1.5, 2, 4). However, a significant advantage of Algorithm 3.1 over Seurat
is the high clustering performance on a wide range of sample sizes. To demonstrate our claim we
compare the ARI results in different down-sampled versions of BaronsPanc. We selected a stratified
sample of 50%, 25% and 10% of the cells of the BaronPanc data set. The results can be found in
Table B.4 of Appendix B. We observed no ARI deterioration for Algorithm 3.1 for the 50% and
25% down-sampled data set and only a moderate decrease for the 10% down-sampled dataset (ARI
of 0.67 at 10% downsampling for 𝑝 = 1.5). On the contrary, there is significant ARI deterioration
both for Seurat and Seurat_def; in particular, at 10% downsampling the ARI deteriorates to 0.405
                                                  31


         (a) PCA                   (b) PM2               (c) UMAP                 (d) 𝑡-SNE
         (e) PCA                   (f) PM2               (g) UMAP                 (h) 𝑡-SNE
Figure 3.4 Top: embeddings colored by true cell type. Bottom: average linkage dendrograms of
cluster means.
for Seurat and to 0.185 for Seurat_def. Notice that in the 10% down-sampled data set, we use
regular 𝑘-means for PM2 to allow for the prediction of smaller sized clusters.
    We also investigated whether we could learn the ground truth number of clusters by optimizing
the silhouette criterion in the scPMP embedding, and compared this with the number of clusters
obtained from Seurat using the default resolution; see Table B.3 in Appendix B. For 4 out of the
8 RNA data sets evaluated in this article (RNAMix1, RNAMix2, BaronPanc, and CellMix), this
procedure on PM2 yielded an estimate for 𝑘 which matched the number of distinct annotated labels.
On the other hand, Seurat correctly estimates the number of clusters for only 2 out of the 8 RNA
data sets (RNAMix1 and TMLung).
    Table 3.4 reports the geometric perturbation. We see that increasing 𝑝 increases the geometric
perturbation, with PM1.5 yielding the smallest geometric perturbation on all data sets. Although
PM1.5 is the clear winner in terms of this metric, PM2 still performed favorably with respect to
UMAP and 𝑡-SNE. Indeed, 𝑟d PM2 had lower geometric perturbation than UMAP on all but one
data set (TMPanc), and lower geometric perturbation than 𝑡-SNE on the majority of data sets.
Figure 3.4 shows the PCA, PM2 , UMAP, and 𝑡-SNE embeddings of the Cell Mix data set, as well
                                                32


                                       Complete Runtime
                                       dataset   PBMC4k   BaronPanc
                                  95
                                  90
                                  85
                                  80
                                  75
                                  70
                                  65
                                  60
               complete_rt(min)
                                  55
                                  50
                                  45
                                  40
                                  35
                                  30
                                  25
                                  20
                                  15
                                  10
                                   5
                                   0
                                       ca                 PM sea n
                                          n                 PM_1  .5
                                                                 _2
                                  db      m             R _4PM
                                     s   k−
                                                          ac  eI
                                                              SC D 3
                                                          Sc 3an py
                                                     Se    Se
                                                        ur rat
                                                          at   u
                                              t− SN E      S (d
                                                             IM ef .)
                                                      + k− LR
                                                           m  ea
                                                       U M ns
                                                           AP  +d  b
                                                                      method
    Figure 3.5 Processing and clustering time for PBMC4K and Baron’s Pancreatic data sets.
as a tree structure on the clusters. The tree structure was obtained by first computing the cluster
means in the embedding and then applying hierarchical clustering with average linkage to the
means. The PCA tree (Figure 3.4(e)) was computed using 40 PCs so that it accurately reflects the
global geometry of the clusters. Interestingly path metrics recover the same hierarchical structure
on the clusters as PCA: the cell types HCC827 and H1975 are the most similar, and H838 is the
most distinct. This is what one would expect given more extensive biological information about
the cell types, since H838 is the only cell line here derived from metastatic site Lymph node on
a male patient, while both HCC827 and H1975 originated from the primary site of female lung
cancer patients. However, neither UMAP or 𝑡-SNE give the correct hierarchical representation of
the clusters, because both methods struggle to preserve global geometric structure as observed in
numerous studies Kobak and Berens (2019); Cooley et al. (2020).
   Furthermore, Figure 3.5 records the runtime for processing and clustering (in minutes) of the
                                                                      33


Baron’s Pancreatic (𝑛 = 1844) and PBMC4K (𝑛 = 4316) data sets. For PBMC4k (our largest
data set), we use the landmark-based approximation of path distances for scalability. All the PM
methods run in less than a minute on BaronPanc and less than 6 minutes on PBMC4k; RaceID3,
scanpy, and Seurat were also fast. SC3 and SIMLR had long runtimes, requiring 37.9 and 91.1
minutes respectively for PBMC4k.
3.4   Discussion
    This article applies a new theoretical framework to the analysis of single cell RNA-seq data
which is based on the computation of optimal paths. Path metrics encode both geometric and
density-based information, and the resulting low-dimensional embeddings simultaneously preserve
density-based cluster structure as well as global cluster orientation. The method exhibits competitive
performance when applied to numerous benchmarks, and the implementation is scalable to large
data sets. Although we investigated other choices of 𝑝, we found that 𝑝 = 2 performed well on a
wide range of RNA data sets, indicating that 𝑝 = 2 is an appropriate balance between density and
geometry for this application. Future research will explore ways to make the method more robust
to noise and adapting the method to the semi-supervised context.
                                                   34


                                          BIBLIOGRAPHY
A, Z., Muñoz-Manchado, A. B., Codeluppi, S., P, L., G, L. M., A, J., S, M., H, M., L, H., C, B., C,
  R., G, C.-B., J, H.-L., and and, L. S. (2015). Brain structure. Cell types in the mouse cortex and
  hippocampus revealed by single-cell RNA-seq. Science (New York, N.Y.), 347:1138–1142.
Baron, M. et al. (2016). A single-cell transcriptomic map of the human and mouse pancreas reveals
  inter- and intra-cell population structure. Cell Systems, 3(4):346–360.
Bĳral, A., Ratliff, N., and Srebro, N. (2011). Semi-supervised learning with density based distances.
  In UAI, pages 43–50.
Borg, I. and Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications.
  Springer Science & Business Media.
Borghini, E., Fernández, X., Groisman, P., and Mindlin, G. (2020). Intrinsic persistent homology
  via density-based metric learning. arXiv preprint arXiv:2012.07621.
Bousquet, O., Chapelle, O., and Hein, M. (2004). Measure based regularization. In NIPS, pages
  1221–1228.
Camerini, P. (1978). The min-max spanning tree problem and some extensions. Information
  Processing Letters, 1(10-14).
Chang, H. and Yeung, D.-Y. (2008). Robust path-based spectral clustering. Pattern Recognition,
  41(1):191–203.
Choudhary, S. and Satĳa, R. (2022). Comparison and evaluation of statistical error models for
  scrna-seq. Genome Biology, 23.
Chu, T., Miller, G., and Sheehy, D. (2017). Exploration of a graph-based density sensitive metric.
  arXiv preprint arXiv:1709.07797.
Chu, T., Miller, G., and Sheehy, D. (2020). Exact computation of a manifold metric, via Lipschitz
  embeddings and shortest paths on a graph. In SODA, pages 411–425.
Civril, A., Magdon-Ismail, M., and Bocek-Rivele, E. (2006). Ssde: Fast graph drawing using
  sampled spectral distance embedding. In International Symposium on Graph Drawing, pages
  30–41. Springer.
CLevine, J., Simonds, E., Bendall, S., Davis, K., Amir, E.-a., Tadmor, M., Litvin, O., Fienberg,
  H., Jager, A., Zunder, E., Finck, R., Gedman, A., Radtke, I., Downing, J., Pe’er, D., and Nolan,
  G. (2015). Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate
  with prognosis. Cell.
                                                  35


Cooley, S. M., Hamilton, T., Ray, J. C. J., and Deeds, E. J. (2020). A novel metric reveals previously
  unrecognized distortion in dimensionality reduction of scrna-seq data. bioRxiv.
Ding, J., Wen, H., Tang, W., Liu, R., Li, Z., Venegas, J., Su, R., Molho, D., Jin, W., Zuo, W., et al.
  (2022). Dance: A deep learning library and benchmark for single-cell analysis. bioRxiv, pages
  2022–10.
Eberwine, J., Yeh, H., Miyashiro, K., Cao, Y., Nair, S., Finnell, R., Zettel, M., and Coleman, P.
  (1992). Analysis of gene expression in single live neurons. Proceedings of the National Academy
  of Sciences, 89(7):3010–3014.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering
  clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231.
Fernández, X., Borghini, E., Mindlin, G., and Groisman, P. (2023). Intrinsic persistent homology
  via density-based metric learning. Journal of Machine Learning Research, 24(75):1–42.
Fischer, B., Zöller, T., and Buhmann, J. M. (2001). Path based pairwise data clustering with
  application to texture segmentation. In International Workshop on Energy Minimization Methods
  in Computer Vision and Pattern Recognition, pages 235–250. Springer.
Gabow, H. and Tarjan, R. (1988). Algorithms for two bottleneck optimization problems. Journal
  of Algorithms, 9:411–417.
García Trillos, N., Sanz-Alonso, D., and Yang, R. (2019). Local regularization of noisy point
  clouds: Improved global geometric estimates and data analysis. Journal of Machine Learning
  Research, 20(136):1–37.
Ghojogh, B., Ghodsi, A., Karray, F., and Crowley, M. (2020). Multidimensional scaling, sammon
  mapping, and isomap: Tutorial and survey. arXiv preprint arXiv:2009.08136.
Groisman, P., Jonckheere, M., and Sapienza, F. (2018). Nonhomogeneous Euclidean first-passage
  percolation and distance learning. arXiv preprint arXiv:1810.09398.
Groisman, P., Jonckheere, M., and Sapienza, F. (2022). Nonhomogeneous euclidean first-passage
  percolation and distance learning. Bernoulli, 28(1):255–276.
Grün, D. et al. (2018). Revealing dynamics of gene expression variability in cell state space. Nature
  methods, 17:45–49.
Hafemeister, C. and Satĳa, R. (2019). Normalization and variance stabilization of single-cell
  rna-seq data using regularized negative binomial regression. Genome Biology, 20(1).
Herman, J. S., Grün, D., et al. (2018). Fateid infers cell fate bias in multipotent progenitors from
  single-cell rna-seq data. Nature methods, 15(5):379.
                                                    36


Hu, T. (1961). Letter to the editor: The maximum capacity route problem. Operations Research,
  9(6):898–900.
Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bonasio, R., Murray, J. I., Raj, A., Li, M.,
  and Zhang, N. R. (2018). Saver: gene expression recovery for single-cell rna sequencing. Nature
  methods, 15(7):539–542.
Hwang, S., Damelin, S., and Hero, A. (2016). Shortest path through random points. The Annals of
  Applied Probability, 26(5):2791–2823.
Kaufman, L. and Rousseeuw, P. (2009). Finding Groups in Data: An Introduction to Cluster
  Analysis.
Kiselev, V. Y., Kirschner, K., Schaub, M. T., Andrews, T., Yiu, A., Chandra, T., Natarajan, K. N.,
  Reik, W., Barahona, M., Green, A. R., and Hemberg, M. (2017). SC3: consensus clustering of
  single-cell RNA-seq data. Nature Methods, 14:483–486.
Kobak, D. and Berens, P. (2019). The art of using t-sne for single-cell transcriptomics. Nature
  Communications, 10:2041–1723.
Lam, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inference for the
  number of factors. The Annals of Statistics, pages 694–726.
Lee, J. M. (2018). Introduction to Riemannian manifolds. Springer.
Li, W. V. and Li, J. J. (2018). An accurate and robust imputation method scimpute for single-cell
  rna-seq data. Nature communications, 9(1):1–9.
Lin, P., Troup, M., and Ho, J. W. K. (2017). CIDR: Ultrafast and accurate clustering through
  imputation for single-cell RNA-seq data. Genome Biology, 18.
Little, A., Maggioni, M., and Murphy, J. (2020a). Path-based spectral clustering: Guarantees,
  robustness to outliers, and fast algorithms. Journal of Machine Learning Research, 21(6):1–66.
Little, A., McKenzie, D., and Murphy, J. (2020b). Balancing geometry and density: Path distances
  on high-dimensional data. arXiv preprint arXiv:2012.09385.
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., and Yosef, N. (2018). Deep generative modeling
  for single-cell transcriptomics. Nature methods, 15(12):1053–1058.
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2021). cluster: Cluster
  Analysis Basics and Extensions.
McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approximation and
  projection for dimension reduction. arXiv preprint arXiv:1802.03426.
                                                 37


Mckenzie, D. and Damelin, S. (2019). Power weighted shortest paths for clustering Euclidean data.
  Foundations of Data Science, 1(3):307.
Moon, K. R., van Dĳk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., Elzen,
  A. v. d., Hirn, M. J., Coifman, R. R., Ivanova, N. B., Wolf, G., and Krishnaswamy, S. (2019).
  Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology,
  37(12):1482–1492.
Moscovich, A., Jaffe, A., and B.Nadler (2017). Minimax-optimal semi-supervised regression on
  unknown manifolds. In AISTATS, pages 933–942.
Platt, J. (2005). Fastmap, metricmap, and landmark mds are all nyström algorithms. In International
  Workshop on Artificial Intelligence and Statistics, pages 261–268. PMLR.
Pollack, M. (1960). Letter to the editor: The maximum capacity through a network. Operations
  Research, 8(5):733–736.
Sajama and Orlitsky, A. (2005). Estimating and computing density based distance metrics. In
  ICML, pages 760–767.
Saliba, A.-E., Westermann, A. J., Gorski, S. A., and Vogel, J. (2014). Single-cell RNA-seq:
  advances and future challenges. Nucleic Acids Research, 42(14):8845–8860.
Shamai, G., Zibulevsky, M., and Kimmel, R. (2020). Efficient inter-geodesic distance computation
  and fast classical scaling. IEEE Transactions on Pattern Analysis and Machine Intelligence,
  42(1):74–85.
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius,
  M., Smibert, P., and Satĳa, R. (2019). Comprehensive integration of single-cell data. Cell,
  177:1888–1902.
Tabula Muris Consortium, Overall coordination, L. c. e. a. (2018). Single-cell transcriptomics of
  20 mouse organs creates a tabula muris. Nature, 562:367–372.
Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch,
  B. B., Siddiqui, A., et al. (2009). mrna-seq whole-transcriptome analysis of a single cell. Nature
  methods, 6(5):377–382.
Tang, J., Liu, J., Zhang, M., and Mei, Q. (2016). Visualizing large-scale and high-dimensional
  data. In Proceedings of the 25th international conference on world wide web, pages 287–297.
Tenenbaum, J., Silva, V. D., and Langford, J. (2000). A global geometric framework for nonlinear
  dimensionality reduction. Science, 290(5500):2319–2323.
Tian, L., Dong, X., Freytag, S., Lê Cao, K.-A., Su, S., JalalAbadi, A., Amann-Zalcenstein, D.,
                                                  38


  Weber, T. S., Seidi, A., Jabbari, J. S., et al. (2019a). Benchmarking single cell rna-sequencing
  analysis pipelines using mixture control experiments. Nature methods, 16(6):479–487.
Tian, T., Wan, J., Song, Q., and Wei, Z. (2019b). Clustering single-cell rna-seq data with a
  model-based deep learning approach. Nature Machine Intelligence, 1(4):191–198.
Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine
  learning research, 9(11).
Vincent, P. and Bengio, Y. (2003). Density-sensitive metrics and kernels. In Snowbird Learning
  Workshop.
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4):395–416.
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D., and Batzoglou, S. (2017). Visualization and analysis
  of single-cell RNA-seq data by kernel-based similarity learning. Nature Methods, 14:414–416.
Wang, J., Ma, A., Chang, Y., Gong, J., Jiang, Y., Qi, R., Wang, C., Fu, H., Ma, Q., and Xu, D.
  (2021). scgnn is a novel graph neural network framework for single-cell rna-seq analyses. Nature
  communications, 12(1):1882.
Williams, C. and Seeger, M. (2001). Using the nyström method to speed up kernel machines. In
  Proceedings of the 14th annual conference on neural information processing systems, number
  CONF, pages 682–688.
Wolf, F. A., Angerer, P., and Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression
  data analysis. Genome Biology, 19.
Xu, C. and Su, Z. (2015). Identification of cell types from single-cell transcriptomes using a novel
  clustering method. Bioinformatics, 31(12):1974–1980.
Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. (1998). A distribution-based clustering algorithm
  for mining in large spatial databases. In Proceedings 14th International Conference on Data
  Engineering, pages 324–331. IEEE.
Yip, S. H., Wang, P., Kocher, J.-P. A., Sham, P. C., and Wang, J. (2017). Linnorm: improved
  statistical analysis for single cell RNA-seq expression data. Nucleic Acids Research, 45(22):e179–
  e179.
Yu, H., Zhao, X., Zhang, X., and Yang, Y. (2012). Isomap using nyström method with incremental
  sampling. Advances in Information Sciences & Service Sciences, 4(12).
Zhang, S. and Murphy, J. (2021). Hyperspectral image clustering with spatially-regularized ultra-
  metrics. Remote Sensing, 13(5):955.
                                                   39


Zhu, X., Zhang, J., Xu, Y., Wang, J., Peng, X., and Li, H.-D. (2020). Single-cell clustering based
  on shared nearest neighbor and graph partitioning. Interdisciplinary Sciences: Computational
  Life Sciences, 12:117–130.
žurauskienė, J. and Yau, C. (2016). pcaReduce: hierarchical clustering of single cell transcriptional
  profiles. BMC Bioinformatics, 17.
                                                40


                                           APPENDIX A
                                    DATA PREPROCESSING
In this section the pre-processing of all RNA data sets is described. The main preprocessing steps
are quality control, imputation with SAVER Huang et al. (2018), and normalization. Below we
provide information about quality control and imputation and then we describe how we used those
steps according to the guidelines of each method.
A.1    Data availability
    The raw data of Cellmix and RNAmix are downloaded from GEO under accession code
GSE118767, and the preprocessed data are available at their github repository. The PBMC4K data
is available at 10x Genomics’s website. The Baron’s pancreatic data is available in GEO with the
access code GSM2230757. The mouse tissue scRNAseq data sets are accessible on Figshare.
A.2    Main steps
    Quality Control: Quality control is applied on RNAmix1, RNAmix2, Cellmix, BaronPanc,
PMC4K, Beta. Specifically, cells where at most 200 genes are expressed are filtered out. Also,
only genes that are expressed in more than 3 cells are included in the data set. In addition, cells
with percentage of expressed mitochondrial genes greater than 20% are excluded. The data sets
TMpanc and TMLung as found in Figshare have passed a quality control check with cutoffs of at
least 500 genes and 50,000 reads, so no additional filtering was applied.
    Imputation: Imputation with SAVER Huang et al. (2018) was applied to all RNA seq data
sets apart from Cellmix. After removing multiplets the Cellmix data set included high quality data
and every clustering method achieved high ARI, suggesting no need for further processing and
imputation.
A.3    Preprocessing per method
    Path metrics (PM), 𝑘-means, DBSCAN: After quality control and imputation, we normalize
the data. RNAmix1, RNAmix2, TMLung, Beta, TMPanc, PBMC4K were row normalized and
log transformed (data matrix had cells in rows and genes in columns). We then restrict to the top
                                                41


2000 high variance genes. For the BaronPanc and CellMix, which have large sample size, SCT
transformation was applied instead and the top 3000 variable genes were kept Hafemeister and
Satĳa (2019); Choudhary and Satĳa (2022). When needed, we rescale genes where variances were
extremely high. As a next step we apply PCA for dimension reduction, keeping the top 40 PC’s.
Finally, denoising is applied by replacing each point with the mean of its local neighborhood, using
a neighborhood size of 𝐾 = 12 points. For very large data sets, one may want to use a larger 𝐾.
    UMAP+DBSCAN, 𝑡-SNE+𝑘-means: After quality control and imputation, we apply Linnnorm
Yip et al. (2017) to all data sets. Then, we restrict to the top 2000 high variance genes. When
needed, we rescale genes with extremely high variance. Finally, we apply PCA for dimension
reduction, keeping the top 40 PC’s.
    Seurat: For this method, we process the data as for PM and then use Seurat’sStuart et al. (2019)
functions to find neighboring points and cluster them. Notice that here we adjust the parameter
‘res’, to retrieve the correct number of clusters.
    Seurat_def: We follow the suggested processing and clustering workflow of Seurat Stuart et al.
(2019) for all data sets. Notice that we normalize BaronPanc and CellMix with the SCT method
Hafemeister and Satĳa (2019); Choudhary and Satĳa (2022). Then data sets are clustered with
adjusted resolution parameter, to retrieve the correct number of clusters.
    SC3: After quality control and imputation we normalize the information of every cell and
multiply by 10000. Then we use the log of the data for clustering with SC3 Kiselev et al. (2017).
Exception to this are the BaronPanc and CellMix data set, for which we use SCT normalization.
    scanpy: After quality control and imputation we use the lognormalization of scanpy Wolf
et al. (2018). Exception to this are the BaronPanc and CellMix data set, for which we use SCT
normalization.
    RaceID3: We apply quality control on the cells of the counts of the data set. RaceID3 Herman
et al. (2018); Grün et al. (2018) applies filtering and normalization in one step, which we adjust to
have about the same amount of cells and genes as with other methods. Notice that we do not apply
imputation because imputed data would not be counts, which are the required input of RaceID3.
                                                   42


    SIMLR: For SIMLR Wang et al. (2017) After quality control and imputation we normalize the
information of every cell and multiply by 10000 and use the log of those data. Exception to this are
the BaronPanc and CellMix data set, for which we use SCT normalization.
                                               43


                                             APPENDIX B
                               ADDITIONAL CLUSTERING RESULTS
Here we present more clustering evaluation results based on Entropy of Cluster Accuracy (ECA)
and Entropy of cluster Purity (ECP). The ECA can quantify the variety of true labels within a
predicted cluster and ECP can quantify the variety of predicted cluster labels within a true group.
Definition 2
Let 𝑁 represent the number of true groups and 𝑀 the number of predicted clusters. Let 𝑁 𝑗 be the
number of true groups with data points within the 𝑗 th predicted cluster and similarly let 𝑀 𝑗 be the
number of predicted clusters with data points within the 𝑗 th true group. Finally let 𝑝(𝑥 𝑗 ) denote
the proportion of data points belonging to the 𝑗 th true group that are within a given 𝑗 th predicted
cluster and let 𝑝𝑖 (𝑦 𝑗 ) denote the proportion of data points of 𝑗 th predicted cluster that are within a
given 𝑖 th true group. Then:
                                                𝑀 𝑁𝑖
                                            1 ∑︁ ∑︁
                                  𝐸𝐶 𝐴 = −             𝑝𝑖 (𝑥 𝑗 )𝑙𝑜𝑔( 𝑝(𝑥 𝑗 ))
                                            𝑀 𝑖=1 𝑗=1
                                               𝑁 𝑀
                                            1 ∑︁ ∑︁𝑖
                                  𝐸𝐶𝑃 = −              𝑝𝑖 (𝑦 𝑗 )𝑙𝑜𝑔( 𝑝(𝑦 𝑗 ))
                                            𝑁 𝑖=1 𝑗=1
    For a given clustering, low ECA means that data points in a predicted cluster originate from
the same true group. On the other hand, low ECP indicates that almost all the data points in a true
group were assigned the same clustering label. Use of ECP and ECA in clustering of scRNAseq
data was also found in Tian et al. (2019a).
                                Method          Balls     EWB Swiss           SO(3)
                                𝑘-means         0.082     1.050 0.588         1.084
                               DBSCAN           0.385     0.114       0         0
                           UMAP+DBSCAN          0.941     0.695       0         0
                            𝑡-SNE+ 𝑘-means      0.153     0.630       0       0.440
                                 Seurat         0.255     0.193       0         0
                                 PM1.5          0.123     0.447       0       0.460
                                  PM2           0.142     0.020       0         0
                                  PM4           0.253     0.268       0         0
                                    Table B.1 ECP for manifold data.
                                                    44


                             Method           Balls    EWB Swiss      SO(3)
                            𝑘-means           0.082    1.096 0.633    1.089
                            DBSCAN            0.362    0.231    0       0
                        UMAP+DBSCAN           0.200    0.014    0       0
                        𝑡-SNE+ 𝑘-means        0.147    0.582    0     0.440
                              Seurat          0.250    0.183    0       0
                              PM1.5           0.120    0.461    0     0.462
                               PM2            0.138    0.020    0       0
                               PM4            0.248    0.291    0       0
                                 Table B.2 ECA for manifold data.
     Method       RNA1 RNA2 TMLung              Beta    TMPanc   BaronPanc    PBMC4k     CellMix
 Seurat_res=0.8       7       8         7         7        11        12          13        14
      PM1.5          12       11        8         4        15         9           4         5
       PM2            7        7        9         4         5         8           5         5
       PM4            8        8        16        4         5         7           4         5
     True 𝑘           7        7         7        3         7         8           4         5
Table B.3 Predicted number of clusters for Seurat and Path metrics for RNA data (𝑘 is the true
number of clusters).
                      Dataset              Seurat   Seurat_def   PM1.5    PM2       PM4
          100% of Baron’s Pancreatic       0.941       0.971     0.804    0.969    0.853
           50% of Baron’s Pancreatic       0.880       0.844     0.969    0.969    0.969
           25% of Baron’s Pancreatic       0.973       0.705     0.973    0.973    0.973
           10% of Baron’s Pancreatic       0.410       0.185     0.674   0.939*    0.804
                                 Table B.4 Downsampling results.
      Method        RNA1 RNA2        TMLung       Beta   TMPanc BaronPanc     PBMC4k     CellMix
        SC3          0.289 0.114       0.228     0.058    0.132     0.368       0.328       0
      Scanpy         0.481 0.242       0.314     0.147    0.309     0.093       0.054       0
     RaceID3         0.336 0.621       0.168     0.342    0.122     0.181       0.207     0.000
      SIMLR          0.163 0.294       0.263     0.057    0.407     0.104       0.190       0
      Seurat         0.319 0.230       0.193     0.146    0.290     0.097       0.265       0
    Seurat_def       0.256 0.270       0.423     0.128    0.289     0.112       0.106       0
     𝑘-means         0.147 0.268       0.221     0.080    0.194     0.164       0.193       0
     DBSCAN          0.090 0.188       0.368     0.440    0.202     0.146       0.262       0
    UMAP+db          0.078 0.151       0.449     0.019    0.124     0.076       0.163       0
 𝑡-SNE+ 𝑘-means      0.104 0.137       0.426     0.085    0.259     0.171       0.187     0.126
       PM1.5         0.110 0.146       0.180     0.061    0.305     0.147       0.196       0
        PM2          0.110 0.071       0.362     0.061    0.279     0.077       0.195       0
        PM4          0.110 0.123       0.230     0.048    0.156     0.159       0.096       0
                                  Table B.5 ECA for RNA data.
                                                 45


    Method     RNA1 RNA2     TMLung    Beta TMPanc BaronPanc PBMC4k CellMix
      SC3      0.328 0.114    0.294    0.058   0.070   0.301  0.062    0
    Scanpy     0.517 0.183    0.322    0.146   0.516   0.088  0.057    0
    RaceID3    0.381 0.665    0.182    0.351   0.268   0.413  0.310    0
    SIMLR      0.151 0.267    0.275    0.057   0.543   0.380  0.360    0
     Seurat    0.292 0.282    0.230    0.150   0.540   0.122  0.053  0.027
   Seurat_def  0.320 0.258    0.436    0.133   0.284   0.089  0.062    0
    𝑘-means    0.131 0.255    0.244    0.079   0.215   0.395  0.316    0
   DBSCAN      0.075 0.141    0.404    0.135   0.138   0.109  0.051    0
   UMAP+db     0.151 0.226    0.413    0.126   0.061   0.248  0.087    0
𝑡-SNE+ 𝑘-means 0.102 0.133    0.437    0.089   0.494   0.402  0.451  0.147
     PM1.5     0.096 0.136    0.197    0.061   0.482   0.273  0.312    0
      PM2      0.096 0.062    0.323    0.061   0.158   0.081  0.308    0
      PM4      0.096 0.114    0.184    0.048   0.260   0.226  0.055    0
                           Table B.6 ECP for RNA data.
                                       46


                                                CHAPTER 4
      SHARED NEAREST NEIGHBORS GRAPH BASED SPECTRAL CLUSTERING
4.1    Introduction
    The exploration of the theoretical properties of spectral clustering on finite sample data started
more than twenty years ago (Spielman and Teng, 1996; Guattery and Miller, 1998; Ng et al., 2001;
Meilă and Shi, 2001) along with theoretical properties for increasing sample size (Luxburg et al.,
2004). One of the advantages of spectral clustering is the interpretability of its performance on
data points represented as vertices on graphs (𝑘NN, 𝜖-neighbor graph) that are connected with
edges based on their similarity to other data points (von Luxburg, 2007). Although theoretical
results of 𝑘NN graph-based clustering methods have been explored by Maier et al. (2009), the
theoretical properties of SNN graph-based clustering combined with spectral clustering haven’t
been investigated yet. In the following sections, we use a similar framework as of Maier et al.
(2009) to provide a range for the number of neighbors used for the construction of the SNN graph,
such that exact cluster identification is achieved.
4.2    Framework
    Our aim is to cluster a set of 𝑛 points, 𝑋1 , . . . , 𝑋𝑛 , which have been drawn from some underlying
density, 𝑝, of R𝑚 . For this task, we build the SNN graph of those points and use its Laplacian for
spectral clustering. The number of true clusters is known and denoted as 𝐾.
4.2.1    The SNN graph
    For the construction of the SNN graph, we first find the 𝑘 nearest neighbors of each point 𝑋𝑖 . Let
𝑘NN(𝑋𝑖 ) be the set of the first 𝑘 nearest neighbors of 𝑋𝑖 . Then, we connect two vertices 𝑋𝑖 and 𝑋 𝑗 ,
if 𝑋𝑖 ∈ 𝑘 𝑁 𝑁 (𝑋 𝑗 ) or if 𝑋 𝑗 ∈ 𝑘 𝑁 𝑁 (𝑋𝑖 ). The weight, 𝑊𝑖, 𝑗 , of their edge is the Jaccard similarity of
𝑘NN(𝑋𝑖 ), 𝑘NN(𝑋 𝑗 ), i.e.
                                              |𝑘 𝑁 𝑁 (𝑋𝑖 ) ∩ 𝑘 𝑁 𝑁 (𝑋 𝑗 )|
                                     𝑊𝑖, 𝑗 =                               .                           (4.1)
                                              |𝑘 𝑁 𝑁 (𝑋𝑖 ) ∪ 𝑘 𝑁 𝑁 (𝑋 𝑗 )|
The SNN graph is a symmetric graph denote as 𝐺 𝑆𝑁 𝑁 (𝑘).
                                                        47


4.2.2   The SNN Spectral Clustering Algorithm
Algorithm 4.1 SNN Spectral Clustering Algorithm.
 1: Input: 𝑋 ∈ R𝑛×𝑚 , number of clusters 𝐾, nearest neighbors 𝑘, bandwidth ℎ, density bound 𝑡,
                                                              1     1
    Denoise = (TRUE, FALSE), Laplacian = (𝐷 − 𝑊, 𝐷 − 2 𝑊 𝐷 − 2 , 𝐼 − 𝐷 −1𝑊), 𝑡, 𝛿, 𝜖 𝑛
 2:
 3: Output: Predicted clustering labels ℓ ∈ R𝑛
 4:
 5: % 𝑘-nearest neighbors:
 6:
 7: for i=1 to n do
 8:      𝑘 𝑁 𝑁𝑖 ← {𝑋 𝑗 : 𝑋 𝑗 is one of the 𝑘 𝑁 𝑁 of 𝑋𝑖 , based on Euclidean distance}
 9:
10: % Shared Nearest Neighbors graph:
11:
               |𝑘 𝑁 𝑁 ∩𝑘 𝑁 𝑁 |
12: 𝑊𝑖, 𝑗 ← |𝑘 𝑁 𝑁𝑖𝑖 ∪𝑘 𝑁 𝑁 𝑗𝑗 |
13: 𝐺 𝑆𝑁 𝑁 (𝑘) ← graph with adjacency matrix 𝑊
14:
15: % Kernel estimation of density 𝑝:
16:
17: for i=1 to n do
                             𝑛     𝑋 −𝑋
                         1 Í
18:      𝑝ˆ𝑛 (𝑋𝑖 ) ← 𝑛ℎ          𝐾( 𝑖 ℎ 𝑗 )
                            𝑗=1
19:
20: % Denoising:
21:
22: if Denoising = TRUE then
23:       Remove vertices and edges of X s.t. 𝑝ˆ𝑛 (𝑋) < 𝑡 − 2𝜖 𝑛
24:      Remove components with size less than 𝛿𝑛
25:      𝐺 ′𝑆𝑁 𝑁 (𝑘) ← denoised 𝐺 𝑆𝑁 𝑁 (𝑘)
26:      𝑊, 𝑛 ← adjacency of 𝐺 ′𝑆𝑁 𝑁 (𝑘), number of vertices in 𝐺 ′𝑆𝑁 𝑁 (𝑘)
27:
28: % Eigendecomposition of SNN graph Laplacian:
29:
              Í
30: 𝐷 𝑖,𝑖 ← 𝑛𝑗=1 𝑊𝑖, 𝑗
31: 𝐷 ← diagonal(𝐷 𝑖,𝑖 )
32: 𝐿 ← Laplacian
33: 𝑉 ← 𝐾 eigenvector matrix of𝐿
34:
35: % Clustering of 𝑋1 , . . . , 𝑋𝑛 :
36:
37: ℓ ← 𝑘-means_plusplus(𝑉, 𝐾)
38:
                                                   48


Algorithm 4.2 Eigenvector matrix of 𝐿.
                                                    1           1
  1:  Input: 𝐿, 𝐾, Laplacian = (𝐷 − 𝑊, 𝐷 − 2 𝑊 𝐷 − 2 , 𝐼 − 𝐷 −1𝑊)
  2:
  3:  Output: 𝐾 eigenvector matrix of 𝐿
  4:
                   1         1
  5:  if 𝐿 = 𝐷 − 2 𝑊 𝐷 − 2 then
  6:      {𝜆 1 , 𝜆2 , ..., 𝜆 𝐾 , ...} ← eigenvalues of 𝐿 such that 𝜆 1 > 𝜆2 > 𝜆3 ...
  7:      𝑉 = [𝑉1 , 𝑉2 , ..., 𝑉𝐾 ] ← 𝑉𝑖 eigenvector of 𝜆𝑖
  8:      % Row normalization of 𝑉
  9:      𝑣 𝑖 ← 𝑖-th row of 𝑉
          𝑞 ← (𝑞 1 , . . . , 𝑞 𝑛 ) where 𝑞𝑖 = ( 𝑗 𝑣 𝑖,2 𝑗 ) 1/2
                                                Í
10:
11:       𝑉˜ ← 𝐷 −1  𝑞 𝑉, where 𝐷 𝑞 = 𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙 (𝑞)
12:       ℓ ← 𝑘-means_plusplus(𝑉,         ˜ 𝐾)
13:   else
14:       {𝜆 1 , 𝜆2 , ..., 𝜆 𝐾 , ...} ← eigenvalues of 𝐿 such that 𝜆 1 < 𝜆2 < 𝜆3 ...
15:       𝑉 = [𝑉1 , 𝑉2 , ..., 𝑉𝐾 ] ← 𝑉𝑖 eigenvector of 𝜆𝑖
4.3     Theoretical results
     The theoretical results presented in section 4.3 are divided into two cases; the noise-free case
and the noisy case, based on Maier et al. (2009).
Noise-free case. In this case, we consider a probability distribution 𝑝, whose support consists
of several high-density regions separated by a positive distance from each other. We consider
that successful cluster identification means that each high-density region corresponds to a unique
predicted cluster. Since there is no overlap between high-density regions, every point will belong
to one cluster only. Hence, the denoising step of 4.1 will not remove any points from the SNN
graph.
Noisy case. In this case, the high-density regions of 𝑝 are connected by low-density regions.
For a 𝑡 > 0 we define the 𝑡-level set, 𝐿 (𝑡), as the closure of all points 𝑥 ∈ R𝑚 with 𝑝(𝑥) ≥ 𝑡,
i.e. {𝑥 : 𝑝(𝑥) ≥ 𝑡}. We denote those components as 𝐶 (1) , . . . , 𝐶 (𝐾) . In the following results, we
explore two approaches to the noisy case.
     In the first approach, the true clusters are the sets 𝐶 (1) , . . . , 𝐶 (𝐾) . Points in low-density regions
do not belong to any cluster and are removed. We consider that clusters are identified exactly by
our algorithm when each connected component of 𝐿 (𝑡) is included in a unique predicted cluster
                                                             49


                       (𝑖)
                     𝑝 max          supremum of density attained by points of 𝐶 (𝑖)
                        (𝑖)
                      𝑝 min         infimum of density attained by points of 𝐶 (𝑖)
                      𝑢 (𝑖)         lower bound on distance of cluster 𝐶 (𝑖) to other clusters
                      𝑢𝑖 𝑗          distance between cluster 𝐶 (𝑖) and cluster 𝐶 ( 𝑗)
                    𝜌(𝑢 (𝑖) )       probability of balls of radius 𝑢 (𝑖) in 𝐶 (𝑖)
                      𝛽 (𝑖)         probability mass of cluster 𝐶 (𝑖)
                    ′
                  𝐺 𝑆𝑁 𝑁 (𝑘)        the SNN graph after denoising
                                                Table 4.1 Notations.
and the ratio of noisy points to cluster points goes to zero. We also consider rough identification
of clusters, when the clustering algorithm predicts components that contain points of a unique 𝐶 (𝑖)
plus some noisy points that do not belong to any cluster. The following table includes notation
used in sections . In the second approach, noisy points are not removed. Instead, they are defined
as cluster points of the 𝐿(𝑡) component closest to them. For this approach, there are connections
between subgraphs that correspond to true clusters and spectral clustering might mis-cluster points
that are equidistant from two clusters. We provide results regarding the mis-clustering error.
     Let the sets 𝐶 (1) , . . . , 𝐶 (𝐾) be 𝐾 disjoint, compact and connected subsets of R𝑚 . The boundary
𝜕𝐶 (𝑖) of every 𝐶 (𝑖) is a smooth (𝑚 −1)-dimensional submanifold in R𝑚 . We will denote with 𝜅 (𝑖) the
minimal curvature radius of 𝜕𝐶 (𝑖) , which is equal to the inverse of the largest principal curvature
                                                               ∫
of 𝜕𝐶 (𝑖) . Also we will denote with 𝛽 (𝑖) = 𝜇(𝐶 (𝑖) ) = 𝐶 (𝑖) 𝑝𝑑𝜆, the probability mass of 𝐶 (𝑖) , where
𝜆 is the Lebesgue measure in R𝑚 .
     The following results about within cluster connectivity of a predicted cluster and isolation
(disconnectivity) of each cluster will be proven using the collar set of each cluster. Specifically, the
collar of 𝐶 (𝑖) is defined as 𝐶𝑜𝑙 (𝑖) (𝜈) = {𝑥 ∈ 𝐶 (𝑖) | 𝑑𝑖𝑠𝑡 (𝑥, 𝜕𝐶 (𝑖) ) ≤ 𝜈}, with 𝜈 < 𝜅 (𝑖) . Furthermore,
                                                        (𝑖)
we define the maximal covering radius to be 𝜈𝑚𝑎𝑥             = max𝜈≤𝜅 (𝑖) {𝜈 | 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝜈) is connected }
and we denote with 𝑢 (𝑖) a lower bound of the distance of 𝐶 (𝑖) from 𝐶 (𝑖) with 𝑗 ≠ 𝑖. In the noise-free
case, 𝑢 (𝑖) will be considered strictly greater than 0. Finally, the 𝑘NN radius of a point 𝑋𝑖 is the
                                                                                                              (𝑖)
maximum distance of 𝑋𝑖 to a point in 𝑘NN(𝑋𝑖 ). The minimal 𝑘NN radius of a cluster 𝐶 (𝑖) , 𝑅𝑚𝑖𝑛                   ,
is the minimal 𝑘NN radius of the points in 𝐶 (𝑖) .
                                                          50


4.3.1      Noise-free case
Lemma 1 (Within cluster connectedness in 𝐺 𝑆𝑁 𝑁 (𝑘))
Let A𝑛(𝑖) denote the event that the points of cluster 𝐶 𝑖 are connected in 𝐺 𝑆𝑁 𝑁 (𝑘). For 𝑧 ∈
                            
                       (𝑖)
  0, 2𝑚𝑖𝑛{𝑢 (𝑖) , 𝜈𝑚𝑎𝑥      } ,
                                                                                                                      !
                                                                      (𝑛−1)
  𝑃   (A𝑛(𝑖) ) 𝑐   ≤ 𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 + 𝑁 1 −                (𝑖)
                                                         𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚                            (𝑖)
                                                                                   1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min         (𝑖) 
                                                                                                           − 𝑛𝑝 max     , (4.2)
                               (𝑖)
for 𝑀 ∼ 𝐵𝑖𝑛(𝑛 − 1, 𝑝 max           𝜂𝑚 𝑧 𝑚 ).
                                   (𝑖)
Proof. Observe that if 𝑅𝑚𝑖𝑛             > 𝑧, for some 𝑧 > 0 and if for two points of 𝑋𝑖 , 𝑋 𝑗 ∈ 𝐶 (𝑖) holds that
𝑑 (𝑋𝑖 , 𝑋 𝑗 ) ≤ 𝑧, then we have that 𝑋𝑖 ∈ 𝑘NN(𝑋 𝑗 ) and 𝑋 𝑗 ∈ 𝑘NN(𝑋𝑖 ) . Furthermore, if we can find a
covering of 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4) of a finite number of balls of radius 𝑧/4, where every ball contains at
least two points of 𝐶 (𝑖) , then points in neighboring balls have distance less than 𝑧. Hence they are
in the list of 𝑘 nearest neighbors of each other and every pair of points will have a shared neighbor.
This implies that every point in 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4) will be connected in 𝐺 𝑆𝑁 𝑁 (𝑘). Notice that
points in 𝐶𝑜𝑙 (𝑖) (𝑧/4) will have at most distance 3𝑧/4 from the balls of the covering and since every
ball includes at least two points of the cluster, we conclude that the points of 𝐶𝑜𝑙 (𝑖) (𝑧/4) will be
connected to 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4). As a result, the points of 𝐶 (𝑖) will be connected in 𝐺 𝑆𝑁 𝑁 (𝑘). Let
F𝑧(𝑖) be the event that, given a covering of 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4), there exists a ball that doesn’t contain
                                                                                   (𝑖)
at least two points of 𝐶 (𝑖) . Based on the above observation, {𝑅𝑚𝑖𝑛                    > 𝑧} ∩ (F𝑧(𝑖) ) 𝑐 implies the points
of 𝐶 (𝑖) will be connected on 𝐺 𝑆𝑁 𝑁 (𝑘). Therefore,
                                                                                        
                                        𝑃   (A𝑛(𝑖) ) 𝑐   ≤𝑃        (𝑖)
                                                                 {𝑅𝑚𝑖𝑛  ≤ 𝑧} + 𝑃       F𝑧(𝑖)                              (4.3)
                               
                   (i)
     For P     {Rmin       ≤ z} : We define 𝑁 𝑠 = |{ 𝑗 ≠ 𝑠|𝑋 𝑗 ∈ 𝐵(𝑋𝑠 , 𝑧)}|, for 1 ≤ 𝑠 ≤ 𝑛. Then,
                                                                             (𝑖)
                                                           if the event {𝑅𝑚𝑖𝑛     ≤ 𝑧} is true =⇒
                               ∃𝑋𝑠 ∈ 𝐶 (𝑖) s.t. 𝑚𝑎𝑥{𝑑 (𝑦, 𝑋𝑠 ) | 𝑦 ∈ 𝑘 𝑁 𝑁 (𝑋𝑠 )} ≤ 𝑧 =⇒
                                                                   ∃𝑋𝑠 ∈ 𝐶 (𝑖) s.t. 𝑁 𝑠 ≥ 𝑘.
                                                                  51


                                         Ø 𝑛
                      (𝑖)
                                                 {𝑁 𝑠 ≥ 𝑘 } ∩ {𝑋𝑠 ∈ 𝐶 (𝑖) } and then we have:
                                              
Therefore, {𝑅𝑚𝑖𝑛             ≤ 𝑧} ⊆
                                          𝑠=1
                                           ∑︁ 𝑛                                                                             
                          (𝑖)
               𝑃    {𝑅𝑚𝑖𝑛      ≤ 𝑧} ≤              𝑃 𝑁 𝑠 ≥ 𝑘 | 𝑋𝑠 ∈ 𝐶 (𝑖) 𝑃 𝑋𝑠 ∈ 𝐶 (𝑖) ≤ 𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 .
                                              𝑠=1
Here 𝑁 𝑠 | {𝑋𝑠 ∈ 𝐶 (𝑖) } ∼ 𝐵𝑖𝑛(𝑛 − 1, 𝜇(𝐵(𝑋𝑠 , 𝑧)) and 𝜇(𝐵(𝑋𝑠 , 𝑧)) ≤ sup𝑥∈𝐶 (𝑖) 𝜇(𝐵(𝑥, 𝑧)) ≤
                                                                   
  (𝑖)                                                                                                    (𝑖)
𝑝 max 𝜂𝑚 𝑧 . Hence, 𝑃 𝑁 𝑠 ≥ 𝑘 < 𝑃 𝑀 ≥ 𝑘 , for 𝑀 ∼ 𝐵𝑖𝑛(𝑛 − 1, 𝑝 max
            𝑚                                                                                                 𝜂𝑚 𝑧 𝑚 ) and 𝜂𝑚 the volume
of the 𝑚-dimensional unit ball.
                       
                    (i)
      For P Fz : Since 𝐶 (𝑖) is compact and connected, we can find a covering of 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4)
with 𝑁 balls, 𝐵1 (𝑧/4), . . . , 𝐵 𝑁 (𝑧/4) of radius 𝑧/4. Let us denote, 𝑃 𝑋 𝑗 ,𝐵𝑠 = 𝑃(𝑋 𝑗 ∈ 𝐵 𝑠 (𝑧/4) | 𝑋 𝑗 ∈
𝐶 (𝑖) )𝑃(𝑋 𝑗 ∈ 𝐶 (𝑖) )) the probability that the point 𝑋 𝑗 is in 𝐶 (𝑖) and in the ball 𝐵 𝑠 (𝑧/4). Then,
                                                                                                                        
           (𝑖)                                                                                                       (𝑖)
    𝑃 F𝑧         = 𝑃 {∃𝑠, 1 ≤ 𝑠 ≤ 𝑁, s.t. 𝐵 𝑠 (𝑧/4) has less than two points of 𝐶 }
          𝑁
         ∑︁                                                                          
      ≤        𝑃 {𝐵 𝑠 (𝑧/4) has less than two points of 𝐶 (𝑖) }
         𝑠=1
          𝑁
         ∑︁                                                                𝑁
                                                                            ∑︁                                                          
                                                                (𝑖)
      =        𝑃 {𝐵 𝑠 (𝑧/4) has no points of 𝐶 } +                               𝑃 {𝐵 𝑠 (𝑧/4) has exactly one point of 𝐶 (𝑖) }
         𝑠=1                                                                𝑠=1
          𝑁 Ö
         ∑︁      𝑛                          ∑︁𝑁 ∑︁  𝑛             Ö                    
      =                 1 − 𝑃 𝑋 𝑗 ,𝐵𝑠 +                  𝑃 𝑋 𝑗 ,𝐵𝑠          1 − 𝑃 𝑋𝑞 ,𝐵𝑠
         𝑠=1 𝑗=1                               𝑠=1 𝑗=1               𝑞≠ 𝑗
                                                                                                                                        (4.4)
      Now we notice that,
                                                                                                (𝑖)
                          𝑃 𝑋 𝑗 ,𝐵𝑠 = 𝜇(𝐵 𝑠 (𝑧/4)) ≤ sup 𝜇(𝐵𝑞 (𝑧/4)) ≤ 𝑝 max                         𝜂𝑚 𝑧 𝑚 /4𝑚 and,                    (4.5)
                                                             𝐵𝑞 ⊂𝐶 (𝑖)
                                                                                                     (𝑖)
                               𝑃 𝑋 𝑗 ,𝐵𝑠 = 𝜇(𝐵 𝑠 (𝑧/4)) ≥              inf     𝜇(𝐵𝑞 (𝑧/4)) ≥ 𝑝 min        𝜂𝑚 𝑧 𝑚 /4𝑚                    (4.6)
                                                                    𝐵𝑞  ⊂𝐶 (𝑖)
            (𝑖)                                                                                                     (𝑖)
where 𝑝 max        is the supremum of density attained by points of 𝐶 (𝑖) and 𝑝 min                                        the infimum of the
density attained by points of 𝐶 (𝑖) . From inequalities 4.5, 4.6, we can write 4.4 as,
                                                              𝑛                                                            (𝑛−1)
              𝑃 F𝑧(𝑖) ≤ 𝑁 1 − 𝑝 min            (𝑖)
                                                   𝜂𝑚 𝑧 𝑚 /4𝑚 + 𝑛𝑁 𝑝 max        (𝑖)
                                                                                     𝜂𝑚 𝑧 𝑚 /4𝑚 1 − 𝑝 min   (𝑖)
                                                                                                                 𝜂𝑚 𝑧 𝑚 /4𝑚
                                                                                                                                 !
                                                                 (𝑛−1)
                                               (𝑖)                                  (𝑖)                       (𝑖)
                              = 𝑁 1 − 𝑝 min        𝜂𝑚 𝑧 𝑚 /4𝑚               1 − 𝑝 min   𝜂𝑚 𝑧 𝑚 /4𝑚 + 𝑛𝑝 max         𝜂𝑚 𝑧 𝑚 /4𝑚
                                                                                                                        !
                                                                 (𝑛−1)
                                               (𝑖)                                               (𝑖)            (𝑖)   
                              = 𝑁 1 − 𝑝 min        𝜂𝑚 𝑧 𝑚 /4𝑚               1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min       − 𝑛𝑝 max
                                                                           52


                                                                                                       (𝑖)
                                                                                                     𝜈 𝑚𝑎𝑥
     For the covering we will use a standard 4𝑧 −packing. Also, since                          𝑧
                                                                                               4 ≤      2 ,  balls of radius 𝑧/8
around the packing centers are disjoint subsets of 𝐶 (𝑖) . Consequently, the total volume of the 𝑁
                                                                                             𝑚
balls will be bounded by the volume of cluster 𝐶 (𝑖) and hence 𝑁𝜂𝑚 8𝑧 𝑚 ≤ 𝑉 𝑜𝑙 (𝐶 (𝑖) ). Finally we get
the following bound for the number of covering balls, 𝑁,
                                                           𝑉 𝑜𝑙 (𝐶 (𝑖) )
                                                    𝑁≤               𝑚                                                     (4.7)
                                                               𝜂𝑚 8𝑧 𝑚
                                                                                                                              □
     Now we will explore the connectivity of points from different clusters in 𝐺 𝑆𝑁 𝑁 (𝑘). We say that
𝐶 (𝑖) is isolated on 𝐺 𝑆𝑁 𝑁 (𝑘), if there is no edge between sample points of 𝐶 (𝑖) and any other cluster.
Lemma 2 (Between clusters connectivity in 𝐺 𝑆𝑁 𝑁 (𝑘))
Let I𝑛(𝑖) be the event that 𝐶 (𝑖) is isolated from all other clusters in 𝐺 𝑆𝑁 𝑁 (𝑘), then for 𝑘 ≤
𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛)                                                           
                                                           𝐾              𝜌(𝑢 (𝑖) ) 𝑘−1
                                       
                                           (𝑖)  𝑐
                                                        ∑︁       − 𝑛−1
                                                                     2       2     − 𝑛−1
                                     𝑃 I𝑛             ≤         𝑒                          .                               (4.8)
                                                          𝑖=1
Proof. For the construction of the 𝐺 𝑆𝑁 𝑁 (𝑘), we practically start with constructing a symmetric
𝑘NN graph, 𝐺 𝑘 𝑁 𝑁 , and then we remove the edges of points that do not have common neighbors in
their 𝑘-nearest neighbors lists. This implies that points that aren’t connected on the 𝐺 𝑘 𝑁 𝑁 will not
be connected on the 𝐺 𝑆𝑁 𝑁 (𝑘). Hence,
                               (I𝑛(𝑖) =⇒ 𝐶 (𝑖) is not isolated in 𝐺 𝑘 𝑁 𝑁
                                      𝑐           
                               𝑐                                                    
                       𝑃 I𝑛(𝑖)        ≤ 𝑃 𝐶 (𝑖) not isolated in 𝐺 𝑘 𝑁 𝑁
                                                                  ∑︁                                  
                                                 (𝑖)         (𝑖)                        ( 𝑗)      (𝑖 𝑗)
                                      ≤ 𝑃 𝑅max ≥ 𝑢                   +         𝑃 𝑅max ≥ 𝑢
                                                                          𝑗≠𝑖
                                                                      ∑︁                            
                                                 (𝑖)                                    ( 𝑗)
                                      ≤𝑃      𝑅max     ≥ 𝑢 (𝑖)       +         𝑃 𝑅max ≥ 𝑢 ( 𝑗)
                                                                          𝑗≠𝑖
                                                                       
                                          𝐾              𝜌(𝑢 (𝑖) ) 𝑘−1
                                         ∑︁      − 𝑛−1
                                                     2      2     − 𝑛−1
                                      ≤       𝑒
                                         𝑖=1
where we get the final bound for 𝑘 < 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛) and using Preposition 6 and Lemma
7 from Maier et al. (2009). Here, 𝑢 (𝑖) ≤ 𝑢𝑖 𝑗 and 𝜌(𝑢 (𝑖) ) is the probability of balls of radius 𝑢 (𝑖) in
𝐶 (𝑖) .                                                                                                                       □
                                                             53


Lemma 3 (Range of 𝑘 for within-cluster connectedness)
                       (𝑖)                                                  (𝑖)                                  (𝑖) 𝑚
                           𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and 𝑘 ≤ (𝑛 − 1) 𝑝 𝑚𝑎𝑥                   𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥
                                           
If 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥                                                                                               ) } then,
                                                                                                (𝑖)
                                                                                       (𝑘−1) 𝑝
                                                               1 𝑛2  − 4𝑚+1 𝑝 (𝑖)           𝑚𝑖𝑛
                                  𝑃    (A𝑛(𝑖) ) 𝑐    ≤ 2+ 𝑚                       𝑒            𝑚𝑎𝑥                                (4.9)
                                                                 4 𝑛−1
                                                                                                               
Proof. Our overall goal is to find appropriate values for 𝑘 such that 𝑃 (A𝑛(𝑖) ) 𝑐 has an upper bound
that goes to zero as n goes to infinity. We will find upper bounds for the two terms of the inequality
4.2.
     We will use the following inequalities in the proof:
                                                                                        𝑘
(Hoeffding’s inequality). Let 𝑀 ∼ 𝐵𝑖𝑛(𝑛, 𝑝) and define 𝛼 =                           𝑛−1 .    Then,
                                                                   
                                         𝛼 ≥ 𝑝, 𝑃 𝑀 ≥ 𝑘 ≤ 𝑒 −𝑛𝐾 (𝛼|| 𝑝) ,                                                        (4.10)
                                                     1−𝑎
                            𝑎
                                                           
where 𝐾 (𝛼|| 𝑝) = 𝛼 log     𝑝   + (1 − 𝛼) log        1−𝑝      is the Kullback-Leibler divergence of (𝛼, 1 − 𝛼) and
( 𝑝, 1 − 𝑝).
(1st logarithmic inequality.)
                                                              𝑥−1
                                           log(𝑥) ≥                   , for 𝑥 > 0                                                (4.11)
                                                                𝑥
(2nd logarithmic inequality.)
                                           log(1 − 𝑥) ≤ 𝑥, for 𝑥 ≤ 1.                                                            (4.12)
                                                                                    (𝑖)                             𝑘
     For the first term of 4.2, we use inequality 4.10 for 𝑝 = 𝑝 𝑚𝑎𝑥                      𝜂𝑚 𝑧 𝑚 and 𝛼 =          𝑛−1 . Now, assuming
that 𝑝 < 𝛼 and that 𝑘 ≤ 𝑛 − 1 we get,
                                                                                                             
                                                𝑎                   1−𝑎                                  𝑎
                              −(𝑛−1) 𝛼 log     𝑝   +(1−𝛼) log     1− 𝑝            −(𝑛−1) 𝛼 log         𝑝   +𝑝−𝛼
        𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 ≤ 𝑒                                                   ≤𝑒                                       , by 4.11.
Let 𝜃 = 𝜂𝑚 𝑧 𝑚 /𝛼 then we have,
                                                                                                    
                                                                                1            (𝑖)
                                                               −𝑘 log         (𝑖)    +𝜃 𝑝 𝑚𝑎𝑥 −1
                           𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 ≤ 𝑛𝛽 (𝑖) 𝑒                        𝜃 𝑝𝑚𝑎𝑥
                                                                                                        
                                                                                    1             (𝑖)
                                                         log(𝑛𝛽 (𝑖) )−𝑘 log                 +𝜃 𝑝 𝑚𝑎𝑥 −1
                                                   =𝑒
                                                                                    (𝑖)
                                                                                 𝜃 𝑝𝑚𝑎𝑥                                          (4.13)
                                                              
                                                                        1
                                                                                   (𝑖)
                                                                                              
                                                          − 𝑘2 log      (𝑖)    +𝜃 𝑝 𝑚𝑎𝑥 −1
                                                   ≤𝑒                𝜃 𝑝𝑚𝑎𝑥
                                                                                                  ,
                                                                54


                                         𝑘                1
                                                                           (𝑖)             
for 𝑘 such that log(𝑛𝛽 (𝑖) ) ≤           2    log         (𝑖)      + 𝜃 𝑝 𝑚𝑎𝑥         − 1 . That way we attain a lower bound for 𝑘:
                                                      𝜃 𝑝 𝑚𝑎𝑥
                                                                         2 log(𝑛𝛽 (𝑖) )
                                                     𝑘≥                                        (𝑖)
                                                                                                                                                             (4.14)
                                                                           1
                                                                                  
                                                              log          (𝑖)       + 𝜃 𝑝 𝑚𝑎𝑥        −1
                                                                      𝜃 𝑝 𝑚𝑎𝑥
    For the second term of 4.2 we have,
                                                (𝑛−1)                                                              
                              (𝑖)                                                             (𝑖)              (𝑖) 
                 𝑁 1 − 𝑝 min      𝜂𝑚 𝑧 𝑚 /4𝑚                  1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min                   − 𝑛𝑝 max             =
                                                         (𝑖)  (𝑛−1) log(1−𝑝 (𝑖) 𝜂 𝑚 𝑧 𝑚 /4𝑚 )+log(𝑁)
                                                                
                                          (𝑖)
                   1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min         − 𝑛𝑝 max           𝑒                         min                             ≤
                                                                                                                         𝑉𝑜𝑙 (𝐶 (𝑖) )
                                                                                             (𝑖)
                                                                                                                                      
                                                                    (𝑛−1) log(1−𝑝 min 𝜂 𝑚 𝑧 𝑚 /4𝑚 )+log                       𝑚
                               𝑚    𝑚     (𝑖)            (𝑖)                                                              𝜂𝑚 𝑧 𝑚
                   1 − 𝜂𝑚 𝑧 /4         𝑝 min    −   𝑛𝑝 max         𝑒                                                          8         , by 4.7.
                                                                      𝜃𝑘
Now we use the substitution 𝜂𝑚 𝑧 𝑚 = 𝜃𝛼 = 𝑛−1
                                                              (𝑛−1)                                                            
                                       (𝑖)                                                                  (𝑖)            (𝑖) 
                           𝑁 1 − 𝑝 min       𝜂𝑚 𝑧 𝑚 /4𝑚                     1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min                  − 𝑛𝑝 max            ≤
                                                                                   𝜃 𝑘 𝑝 (𝑖)                       (𝑖) 𝑚 (𝑛−1) 
                                                                          (𝑖)  − 4𝑚min +log 𝑉𝑜𝑙 (𝐶 𝜃)8
                                       𝜃𝑘                 (𝑖)
                             1+ 𝑚                     𝑛𝑝 max − 𝑝 min 𝑒                                                  𝑘               ≤
                                   4 (𝑛 − 1)
                                                                                   𝜃 𝑘 𝑝 (𝑖)                       (𝑖) 𝑚 
                                                                                                                                                             (4.15)
                                                                          (𝑖)  − 4𝑚min +log 𝑉𝑜𝑙 (𝐶 𝜃 )8 𝑛
                                       𝜃𝑛                 (𝑖)
                             1+ 𝑚                     𝑛𝑝 max − 𝑝 min 𝑒                                                           ≤
                                   4 (𝑛 − 1)
                                                                                                 (𝑖)
                                                                                           𝜃𝑝
                                        𝜃𝑛                 (𝑖)            (𝑖)  − 𝑘2 4𝑚          min
                             1+ 𝑚                     𝑛𝑝 max − 𝑝 min 𝑒                               ,
                                   4 (𝑛 − 1)
                                                                                             (𝑖)                                                 (𝑖)
                                                                                      𝜃𝑘 𝑝 min                 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 𝑛            𝜃 𝑝 min
where the last step of the inequality 4.15 holds if −                                    4𝑚        + log                𝜃             ≤ − 2𝑘   4𝑚     or equiva-
lently if,
                                                              4𝑚 2               𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 𝑛 
                                                  𝑘≥             (𝑖)
                                                                        log                                                                                  (4.16)
                                                            𝜃 𝑝 𝑚𝑖𝑛                            𝜃
    To bound inequality 4.13 with the upper bound of 4.15 we need,
                                                                                                                            (𝑖)  
                                                                                        (𝑖)              𝑘             1
                                 𝜃𝑛               (𝑖)            (𝑖)
                                                                                 𝑘  𝜃𝑝
                                                                                        min           −  2    log      (𝑖)    +𝜃 𝑝 𝑚𝑎𝑥 −1
                                                                            𝑒 − 2 4𝑚 ≥ 𝑒
                                                                       
                        1+ 𝑚                   𝑛𝑝 max      − 𝑝 min                                                 𝜃 𝑝𝑚𝑎𝑥
                             4 (𝑛 − 1)
                                                    (𝑖)                                                             
                                              𝑘 𝜃 𝑝 min         𝑘                  1
                                                                                                      (𝑖)
for which it suffices to have, 2 4𝑚 ≤ 2 log (𝑖) + 𝜃 𝑝 𝑚𝑎𝑥 − 1 or equivalently − log(𝛾) ≥
                                                                               𝜃 𝑝 𝑚𝑎𝑥
                  (𝑖)
            𝛾 𝑝 𝑚𝑖𝑛                    (𝑖)                                                                                                                       3𝛾
1+𝛾−      4𝑚 𝑝 (𝑖)      for 𝛾 = 𝜃 𝑝 𝑚𝑎𝑥      . The above is satisfied for values of 𝛾 that − log(𝛾) ≥ 1 + 𝛾 −                                                     4 ,
                 𝑚𝑎𝑥
           (𝑖)
         𝑝 𝑚𝑖𝑛
since        (𝑖)   ≤ 14 . Such values of 𝛾 could be                       1 1
                                                                         10 , 2     and others. We will use 𝛾 = 12 , which will result
       4𝑚 𝑝 𝑚𝑎𝑥
to 𝜃 = 1(𝑖)        and the probability inequality in Lemma 1 can be rewritten as:
        2𝑝 𝑚𝑎𝑥
                                                                                                           (𝑖)                                               (𝑖)
                                                                                                       𝑘𝑝                                            (𝑘−1) 𝑝
                                   𝑛            1                                 (𝑖)  − 4𝑚+1 𝑝 (𝑖)
                                                                                                          𝑚𝑖𝑛                   1 𝑛2  − 4𝑚+1 𝑝 (𝑖)        𝑚𝑖𝑛
    𝑃 (A𝑛(𝑖) ) 𝑐       ≤ 2 1+                          (𝑖)
                                                               𝑛𝑝    (𝑖)
                                                                     max     −    𝑝 min
                                                                                               𝑒             𝑚𝑎𝑥 ≤ 2 +                          𝑒           𝑚𝑎𝑥
                                   𝑛 − 1 4𝑚 2𝑝 𝑚𝑎𝑥                                                                               4𝑚 𝑛 − 1
                                                                              55


     Now we return to the range of 𝑘. We substitute the chosen value for 𝜃 in 4.14 and 4.16
and we observe that the largest of the two lower bounds is the one in 4.16. This is because
                        (𝑖)
𝛽 (𝑖) = 𝜇(𝐶 (𝑖) ) ≤ 𝑝 𝑚𝑎𝑥    𝑉 𝑜𝑙 (𝐶 (𝑖) ).
                                                                                                        (𝑖)
                                                                                                            𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 .
                                                                                                                           
     Hence, we conclude that an appropriate lower bound for 𝑘 is 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥
                                                                                (𝑖)
From the assumption 𝜂𝑚 𝑧 𝑚 = 𝜃𝛼 and that 𝑧 ≤ 2 min{𝑢 (𝑖) , 𝜈𝑚𝑎𝑥                      } we get an upper bound for 𝑘.
                 (𝑖)                              (𝑖)                          (𝑖) 𝑚
𝑘 < (𝑛 − 1) 𝑝 𝑚𝑎𝑥     𝜂𝑚 𝑧 𝑚 ⇔ 𝑘 < (𝑛 − 1) 𝑝 𝑚𝑎𝑥      𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥     ) }.                                   □
     The theoretical results we derived so far provide us with probabilities of having connected
components in the graph 𝐺 𝑆𝑁 𝑁 (𝑘), isolated from other components that each correspond to a
specific cluster. Now, we want to explore the probability that the step of spectral clustering on
𝐺 𝑆𝑁 𝑁 (𝑘) of algorithm 4.1 will yield exact identification of clusters.
Theorem 1 (Optimal k for exact identification of clusters)
                  4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛                                                   𝑚+1 log 2𝑛𝑝 (𝑖) 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚
                                                                                                                            
For 𝑘 = 𝑐 1 +               𝑝𝑚𝑖𝑛
                       2+ 4𝑚 𝑝𝑚𝑎𝑥
                                         , where 𝑐 such that   𝑘  will   satisfy 𝑘  ≥   4                𝑚𝑎𝑥
                                                           (𝑖)                             (𝑖) 𝑚
and 𝑘 ≤ min 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛), (𝑛 − 1) 𝑝 𝑚𝑎𝑥         𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥
                
                                                                                               ) } , for 𝑖 ∈ {1, . . . , 𝐾 },
the algorithm 4.1 achieves exact cluster identification with probability
                                                                                                     
                                               1   𝑛2        (𝑘−1) 𝑝
                                                              − 𝑚+1 𝑚𝑖𝑛              − 𝑛−1
                                                                                             𝜌𝑚𝑖𝑛 𝑘−1
                                                                                              2 − 𝑛−1
                                                                             − 𝐾 2𝑒
                                                                                        2
                       𝑃 Q𝑛 ≥ 1 − 𝐾 2 +                     𝑒    4    𝑝𝑚𝑎𝑥
                                                 4𝑚 𝑛 − 1
                                                                               𝐾
                                                                                  A𝑛(𝑖) , the event that for each cluster
                                                                              Ñ
Proof. We will start our proof with some notation. Let A𝑛 =
                                                                              𝑖=1
                                                                                        𝐾
                                                                                           I𝑛(𝑖) be the event that every
                                                                                       Ñ
its points are connected on the graph. Correspondingly, let I𝑛 =
                                                                                       𝑖=1
cluster is isolated from the other clusters on the graph 𝐺 𝑆𝑁 𝑁 (𝑘). Then, we denote with Q𝑛 the
intersection of A𝑛 and I𝑛 .
     If Q𝑛 is true, then with an appropriate permutation of rows, the adjacency matrix 𝑊 of 𝐺 𝑆𝑁 𝑁 (𝑘)
will be block-diagonal and each block will correspond to a cluster. The first step of our proof will
be, to explain how each of the Laplacian option will utilize the block structure of 𝑊 to achieve exact
clustering. The second step of our proof is to find an optimal choice for 𝑘 and an upper bound for
the probability of event (Q𝑛 ) 𝑐 .
     For L = D − W and L = I − D−1 W : According to Propositions 2 and 4 of von Luxburg (2007),
the unnormalized Laplacian and the random walk Laplacian have eigenvalue zero with multiplicity
                                                          56


equal to the number of connected components on the graph. Furthermore, the eigenspace of
eigenvalue zero is spanned by indicator vectors 1𝐶 (𝑖) , for 𝑖 ∈ {1, . . . , 𝐾 }.
    If 𝐺 𝑆𝑁 𝑁 (𝑘) has a connected component for all the points that belong to a unique cluster, and
the components are isolated from components that correspond to different clusters, the matrix 𝑊
will have a block structure and every block will include all the points of one cluster.
    The matrix 𝑉 with columns the eigenvectors of L will hence be of the form 𝑉 = 𝐵𝑀, where 𝑀
is an orthogonal matrix and B is a block diagonal matrix with columns the indicator eigenvectors
1𝐶 (𝑖) , for 𝑖 ∈ {1, . . . , 𝐾 }. Now, notice that 𝑉 will also have a block diagonal structure and the rows
of every block will be equal. On the other hand, due to the block structure of 𝑉, columns of 𝑉 of
different blocks will be perpendicular to each other. Hence, kmeans on V will choose one row from
each block as the centroid of a cluster, and as a result points of the same block will be clustered
together.
                    1          1                             1      1                      1     1
    For L = D− 2 WD− 2 : We observe that 𝐿 = 𝐷 − 2 𝑊 𝐷 − 2 and 𝐿 𝑠𝑦𝑚 = 𝐼 − 𝐷 − 2 𝑊 𝐷 − 2 have the
same eigenvectors but different eigenvalues. Specifically, if 𝑣 is an eigenvector for the eigenvalue
𝜆 of 𝐿 𝑠𝑦𝑚 then 𝑣 is an eigenvector for the eigenvalue −𝜆 of 𝐿.
    According to Proposition 4 of von Luxburg (2007), 𝐿 𝑠𝑦𝑚 has eigenvalue zero with multiplicity
equal to the number of connected components in the graph and the eigenspace of zero is spanned
by the vectors {𝐷 − 2 1𝐶 (1) , . . . , 𝐷 − 2 1𝐶 (𝐾 ) }.
                          1                1
                                                                                         1
    In this case, the matrix 𝑉 of eigenvectors of 𝐿 will be of the form 𝑉 = 𝐷 − 2 𝐵𝑀, where 𝑀 is
an orthogonal matrix, and B is a block diagonal matrix with columns the indicator eigenvectors
1𝐶 (𝑖) , for 𝑖 ∈ {1, . . . , 𝐾 }. This time the rows of V that correspond to points of the same component
can be seen as vectors that aren’t identical, but, interestingly, have the same direction and different
lengths. For this reason in algorithm 4.1 we do not apply kmeans on 𝑉, but instead on 𝑉˜ which is
equal to V after row-normalization. We observe that the rows of 𝑉˜ that correspond to points of the
same component will be equal. The columns of 𝑉˜ of different blocks will be perpendicular to each
other. Hence, kmeans on 𝑉˜ will choose one row from each block as the centroid of a cluster, and
as a result points of the same block will be clustered together.
                                                        57


                                                                                                                                        
    Moving on to the second step of the proof, we will calculate an upper bound for 𝑃 (Q𝑛 ) 𝑐 . Let
                                                  (𝑖)                                  (𝑖)
𝜌𝑚𝑖𝑛 = min 𝜌(𝑢 (𝑖) ), 𝑝 𝑚𝑖𝑛 = min 𝑝 𝑚𝑖𝑛                  and 𝑝 𝑚𝑎𝑥 = max 𝑝 𝑚𝑎𝑥               . We notice that by Lemma 3, for
        𝑖=1,...,𝐾                  𝑖=1,...,𝐾                              𝑖=1,...,𝐾
                    (𝑖)                                                 (𝑖)                                   (𝑖) 𝑚
                         𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and 𝑘 ≤ (𝑛−1) 𝑝 𝑚𝑎𝑥                 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥
                                           
𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥                                                                                                ) } for 𝑖 = 1, . . . , 𝐾,
we have that,
                                                                                                                  (𝑖)
                                         𝐾                            𝐾                                  (𝑘−1) 𝑝
                                     ∑︁        
                                                         (𝑖) 𝑐
                                                                ∑︁                1 𝑛2  − 4𝑚+1 𝑝 (𝑖)            𝑚𝑖𝑛
                     𝑃 (A𝑛 ) 𝑐      ≤         𝑃 (A𝑛 ) ≤                     2+ 𝑚                    𝑒            𝑚𝑎𝑥
                                        𝑖=1                          𝑖=1
                                                                                  4 𝑛−1
                                                                      𝐾 
                                                                    ∑︁             1 𝑛2  − 4(𝑘−1)              𝑝𝑚𝑖𝑛
                                                                                                                                        (4.17)
                                                                 ≤          2+                      𝑒      𝑚+1 𝑝𝑚𝑎𝑥
                                                                     𝑖=1
                                                                                  4𝑚 𝑛 − 1
                                                                                1 𝑛2  − 4(𝑘−1)             𝑝𝑚𝑖𝑛
                                                                 = 𝐾 2+ 𝑚                       𝑒 𝑚+1 𝑝𝑚𝑎𝑥
                                                                               4 𝑛−1
If we additionally have that 𝑘 ≤ 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛) for every 𝑖 in {1, . . . , 𝐾 }, then by Lemma
2
                                                                                                             
                                              𝐾                              𝐾               𝜌(𝑢 (𝑖) ) 𝑘−1
                                          ∑︁                𝑐          ∑︁     − 𝑛−1               − 𝑛−1
                                                  𝑃 I𝑛(𝑖)
                                                                                       2        2
                          𝑃 (I𝑛 ) 𝑐     ≤                           ≤𝐾           𝑒
                                             𝑖=1                            𝑖=1
                                                                                                           
                                                                             𝐾               𝜌𝑚𝑖𝑛 𝑘−1
                                                                            ∑︁     − 𝑛−1
                                                                                       2       2 − 𝑛−1                                  (4.18)
                                                                    ≤𝐾           𝑒
                                                                            𝑖=1
                                                                                                     
                                                                                𝑛−1    𝜌𝑚𝑖𝑛 𝑘−1
                                                                           2 − 2         2 − 𝑛−1
                                                                    =𝐾 𝑒                                .
Hence,
                                                                                                                                      
                                                               1      𝑛2          (𝑘−1) 𝑝
                                                                                       − 𝑚+1 𝑚𝑖𝑛                     − 𝑛−1
                                                                                                                             𝜌𝑚𝑖𝑛 𝑘−1
                                                                                                                               2 − 𝑛−1
                                                                                                          + 𝐾 2𝑒
                                                                                                                        2
      𝑃 (Q𝑛 ) 𝑐 ≤ 𝑃 (A𝑛 ) 𝑐 + 𝑃 (I𝑛 ) 𝑐 ≤ 𝐾 2 +                                      𝑒   4      𝑝𝑚𝑎𝑥
                                                                     4𝑚 𝑛 − 1
    Now we want to choose k such that the bounds of 4.17, 4.18 will be of the same order.
Equivalently we want,                                                                  
                                                (𝑘−1) 𝑝                     𝜌𝑚𝑖𝑛 𝑘−1
                                              − 𝑚+1 𝑚𝑖𝑛            − 𝑛−1
                                                                      2      2 − 𝑛−1
                                          𝑛𝑒    4       𝑝𝑚𝑎𝑥  =𝑒                          ,
                             4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛                                                                   4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛 
which holds for 𝑘 = 1 +                𝑝
                                  2+ 4𝑚 𝑝𝑚𝑖𝑛
                                                      .  In conclusion, choosing 𝑘 = 𝑐 1 +                                     𝑝
                                                                                                                          2+ 4𝑚 𝑝𝑚𝑖𝑛
                                                                                                                                           , for
                                          𝑚𝑎𝑥                                                                                     𝑚𝑎𝑥
                                                                                                                                          
a constant c such that 𝑘 will satisfy the conditions for inequalities 4.17, 4.18, will let 𝑃 (Q𝑛 ) 𝑐 go
to zero exponentially with n.                                                                                                                 □
                                                                 58


4.3.2    Noisy case
4.3.2.1     First approach to noisy case: Remove low-density points
     We now explore the connectedness and isolation properties of clusters predicted by 4.1 when
there are noise points in our sample - points that do not belong to any cluster and have low density.
In this case, we apply spectral clustering on 𝐺 ′𝑆𝑁 𝑁 (𝑘), the graph that doesn’t include any points with
 ˆ
𝑝(𝑥)   < 𝑡 − 2𝜖 𝑛 and their edges. The clusters of the 𝐿 (𝑡 − 2𝜖 𝑛 ) are denoted as 𝐶 (𝑖) (2𝜖 𝑛 ). The value 𝜖 𝑛
is the error in density estimation. We choose to work with clusters of the 𝐿 (𝑡 − 2𝜖 𝑛 ) to ensure that
the points of the 𝐿 (𝑡) set will not be removed from the 𝐺 ′𝑆𝑁 𝑁 (𝑘). Additionally, 𝜖 𝑛 has the property
                                                                                                      (𝑖)      (𝑖)
that 𝑑𝑖𝑠𝑡 (𝐶 (𝑖) (2𝜖 𝑛 ), 𝐶 ( 𝑗) (2𝜖 𝑛 )) ≥ 𝑢 (𝑖) for every 𝑖, 𝑗 ∈ {1, . . . , 𝐾 }. We denote with 𝑅˜ 𝑚𝑖𝑛 , 𝑅˜ 𝑚𝑎𝑥 the
minimal and maximal 𝑘NN radius of 𝐶 (𝑖) (2𝜖 𝑛 ) and with 𝛽˜ (𝑖) the mass of 𝐶 (𝑖) (2𝜖 𝑛 ), 𝜇(𝐶 (𝑖) (2𝜖 𝑛 )).
          ˜ (𝑖) ) is the probability of balls of radius 𝑢 (𝑖) in 𝐶 (𝑖) (2𝜖 𝑛 ). The aim of this section is to
Finally, 𝜌(𝑢
illustrate how to extend the connectedness and isolation results for the clusters 𝐶 (𝑖) 𝑖 = 1, . . . , 𝐾 in
the noisy case. In more detail, we prove that 𝐺 ′𝑆𝑁 𝑁 (𝑘) will have so many connected components
as the number of clusters 𝐾 and each component on the graph 𝐺 ′𝑆𝑁 𝑁 (𝑘) will correspond to a
unique topological component of the 𝐿 (𝑡 − 2𝜖 𝑛 ) set. Furthermore, the component of 𝐺 ′𝑆𝑁 𝑁 (𝑘)
corresponding to 𝐶 (𝑖) (2𝜖 𝑛 ) will include all the points of 𝐶 (𝑖) and some additional points, but it
will be isolated from all other components. Hence the adjacency matrix of 𝐺 ′𝑆𝑁 𝑁 (𝑘) will again be
block-diagonal and spectral clustering will predict clusters that include the points of 𝐶 (𝑖) plus some
boundary points. We further prove that as the sample size n increases, the ratio of boundary to
cluster points will go to zero and allow spectral clustering to yield a grouping only of true cluster
points achieving that way exact clustering. Let D𝑛(𝑖) be the event that | 𝑝ˆ𝑛 (𝑋𝑖 ) − 𝑝 𝑛 (𝑋𝑖 )| ≤ 𝜖 𝑛 for
every 𝑋𝑖 , 𝑖 = 1, . . . , 𝑛.
Lemma 4 (Range of 𝑘 for within-cluster connectedness in noisy case 1)
Let A𝑛(𝑖) denote the event that the points of cluster 𝐶 (𝑖) are connected in 𝐺 ′𝑆𝑁 𝑁 (𝑘). If 𝑘 ≥
                (𝑖)                                          (𝑖)                       (𝑖) 𝑚
                    𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and 𝑘 ≤ (𝑛 − 1) 𝑝 𝑚𝑎𝑥        𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥
                                        
4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥                                                                           ) } then,
                                                                        (𝑘−1)𝑡
                                                      1 𝑛2  − 4𝑚+1 𝑝 (𝑖)          
                                𝑃   (A𝑛(𝑖) ) 𝑐   ≤ 2+ 𝑚            𝑒         𝑚𝑎𝑥  + 𝑃 D𝑛𝑐
                                                        4 𝑛−1
                                                            59


Proof. We assume that D𝑛(𝑖) holds, and we use the steps of the proof of Lemma 1. There are no
                                                           
difference in the bounds for 𝑃 {𝑅𝑚𝑖𝑛 ≤ 𝑧} ≤ 𝑛𝛽 (𝑖) 𝑃(𝑀 ≥ 𝑘). We recall that F𝑧(𝑖) is the event
                                              (𝑖)
that, given a covering of 𝐶 (𝑖) (2𝜖 𝑛 ) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4), there exists a ball that doesn’t contain at least
two points of 𝐶 (𝑖) (2𝜖 𝑛 ). In the noisy case, this event can happen either if some ball in the covering
contains less than two points or if some points of 𝐶 (𝑖) were discarded. If the event D𝑛(𝑖) holds for a
point x then 𝑝(𝑥) ˆ      > 𝑡 − 𝜖 𝑛 , then x will not be removed at the denoising step. Additionally, 𝑝 𝑚𝑖𝑛 = 𝑡
in 𝐶 (𝑖) . Consequently,
                                                                                                 !
                                                       (𝑛−1)                                        
                                                                                           (𝑖)
                  P Fz(i) ≤ 𝑁 1 − 𝑡𝜂𝑚 𝑧 𝑚 /4𝑚
                                                                                               
                                                                  1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑡 − 𝑛𝑝 max        + 𝑃 D𝑛𝑐       (4.19)
                                                 
We find the final bound of 𝑃 (A𝑛(𝑖) ) 𝑐 by following the proof of Lemma 3 and using the inequality
4.19.                                                                                                                □
Lemma 5 (Cluster size probability)
Let B𝑛(𝑖) denote the event that there are more than 𝛿𝑛 sample points from cluster 𝐶 (𝑖) . If 𝛽 (𝑖) > 𝛿
then,
                                                                               𝛽 (𝑖) − 𝛿 2
                                                                                        
                                                                   − 21 𝑛𝛽 (𝑖)
                                                    B𝑛(𝑖)
                                                          𝑐
                                              𝑃               ≤𝑒                 𝛽 (𝑖)
Proof Same as in Maier et al. (2009) Lemma 4.                                                                        □
Lemma 6 (Density estimation error)
Let E𝑛(𝑖) denote the event that there are less than 𝛿𝑛 points in all the boundary points sets
                                        𝐾
𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) together. If        𝜇(𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) ) < 𝛿/2, we have 𝑃 E𝑛(𝑖)         ≤ 𝑒 −𝛿𝑛/8
                                        Í                                                          𝑐
                                       𝑗=1
Proof Same as in Maier et al. (2009) Lemma 5.                                                                        □
Proposition 1 (Cluster connectedness in 𝐺 ′𝑆𝑁 𝑁 (𝑘))
Let C𝑛(𝑖) be the event that in the denoised graph 𝐺 ′𝑆𝑁 𝑁 (𝑘) it holds that:
     - all sample points of 𝐶𝑛(𝑖) are contained in the graph
     -the sample points of 𝐶𝑛(𝑖) are connected in the graph
     -there is no component of the graph that consists only of points outside the 𝐿(𝑡) set.
                                                                 60


    Then under the conditions that
    1. 𝛽 (𝑖) > 2𝛿
                                                      𝐾
                                                         𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) ) ≤ 𝛿/2
                                                     Ð
    2. 𝜖 𝑛 sufficiently small such that 𝜇(
                                                     𝑖=1
                               (𝑖)
                                    𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and
                                                       
    3. 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥
                         (𝑖)                              (𝑖) 𝑚
    4. 𝑘 ≤ (𝑛 − 1) 𝑝 𝑚𝑎𝑥      𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥        ) }
    and for sufficiently large n we obtain
                                                                           (𝑘−1)𝑡
                                   𝑐              1 𝑛2  − 4𝑚+1 𝑝 (𝑖)                  − 𝛿𝑛
                                                                                                        
                        𝑃 C𝑛(𝑖)          ≤ 2+ 𝑚                      𝑒           𝑚𝑎𝑥 + 2𝑒 8 + 2𝑃 D
                                                                                                          𝑐
                                                                                                          𝑛       (4.20)
                                                     4 𝑛−1
Proof. We observe that:
         𝑐               𝑐                𝑐                𝑐                 𝑐
𝑃 C𝑛(𝑖)       ≤ 𝑃 A𝑛(𝑖)           + 𝑃 B𝑛(𝑖)           + 𝑃 E𝑛(𝑖)            + 𝑃 D𝑛(𝑖)           and use 4,5 and 6.     □
Lemma 7 (Between clusters connectivity in 𝐺 ′𝑆𝑁 𝑁 (𝑘))
Let I𝑛(𝑖) be the event that 𝐶 (𝑖) (2𝜖 𝑛 ) is isolated from all other clusters in 𝐺 ′𝑆𝑁 𝑁 (𝑘), then for 𝑘 ≤
𝜌(𝑢 (𝑖) )𝑛/2 − 2 log( 𝛽˜ (𝑖) 𝑛)
                                                                                      
                                                         𝐾              𝜌(𝑢 (𝑖) ) 𝑘−1
                                     
                                         (𝑖)  𝑐
                                                      ∑︁       − 𝑛−1
                                                                   2       2     − 𝑛−1            
                                   𝑃 I𝑛            ≤          𝑒                          +𝑃    D𝑛𝑐   .            (4.21)
                                                        𝑖=1
Proof. We follow the proofs of Proposition 6 and Lemma 7 from Maier et al. (2009).                                    □
    Proposition 1 and Lemma 7 will now be used to find a range for k for the rough identification
of clusters 𝐶 (𝑖) (2𝜖 𝑛 ) with spectral clustering on 𝐺 ′𝑆𝑁 𝑁 (𝑘). With the term rough identification, we
mean that all points of each true cluster 𝐶 (𝑖) belong to the same predicted cluster, which may also
have some additional points that do not belong to any cluster. Two important conditions for the
results of exact cluster identification are the following:
Condition 1:
                               (𝑖)
                                    𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and
                                                       
    a) 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥
                                                                        (𝑖)                            (𝑖) 𝑚
    b) 𝑘 ≤ min 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛), (𝑛 − 1) 𝑝 𝑚𝑎𝑥                   𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥
                  
                                                                                                           ) } ,
Condition 2:
    a) 𝛽 (𝑖) > 2𝛿
    b) 𝑝 is three times continuously differentiable with uniformly bounded derivatives
                                                                   61


                                                          𝐾
                                                              𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) ) ≤ 𝛿/2.
                                                         Ð
     c) 𝜖 𝑛 sufficiently small such that 𝜇(
                                                         𝑖=1
Theorem 2 (Rough cluster identification in noisy case 1)
If condition 2 holds, an optimal choice of 𝑘 for the identification of clusters 𝐶 (𝑖) is 𝑘 = 𝑐 1 +
4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛 
     2+     𝑡          , for a constant c such that 𝑘 will satisfy condition 1 for every 𝑖 ∈ {1, . . . , 𝐾 }.
        4𝑚 𝑝𝑚𝑎𝑥
Also for a kernel density estimator 𝑝ˆ𝑛 with bandwidth ℎ there are constants 𝐶1 , 𝐶2 such that if
ℎ2 ≤ 𝐶1 𝜖 𝑛 :
                                                                                                   
                                      1    𝑛2       − 𝑚+1(𝑘−1)𝑡
                                                                              2 − 2
                                                                                    𝑛−1   𝜌𝑚𝑖𝑛 𝑘−1
                                                                                            2 − 𝑛−1             𝛿            𝑚𝜖 2
         𝑃 Q𝑛 ≥ 1 − 𝐾 2 +                             𝑒    4    𝑝𝑚𝑎𝑥   +𝐾 𝑒                           + 2𝑒 −𝑛 8 + 3𝑒 −𝐶2 𝑛ℎ      𝑛 ,
                                       4𝑚 𝑛 − 1
where Q𝑛 is the event that the algorithm 4.1 roughly identifies all clusters 𝐶 (𝑖) .
Proof. If I𝑛(𝑖) is true for all clusters 𝐶 (𝑖) (2𝜖 𝑛 ) then for every 𝑖 such that 1 ≤ 𝑖 ≤ 𝐾, there will be
no connections between the subgraph of 𝐺 ′𝑆𝑁 𝑁 (𝑘) containing points of cluster 𝐶 (𝑖) and any other
cluster. Furthermore, if C𝑛(𝑖) holds, then all points that belong to 𝐶 (𝑖) are connected on 𝐺 ′𝑆𝑁 𝑁 (𝑘)
and points outside 𝐶 (𝑖) are discarded or connected to points of 𝐶 (𝑖) . If I𝑛(𝑖) and C𝑛(𝑖) hold for
every cluster, then the adjacency matrix 𝑊 of this graph will be block diagonal. Each block will
correspond to a cluster 𝐶 (𝑖) (2𝜖 𝑛 ) and spectral clustering will roughly identify all 𝐶 (𝑖) . To prove
that, we follow the same argument regarding the different types of Laplacians as in the proof of
Theorem 1.
                                                                                                                             𝐾
                                                                                                                                 C𝑛(𝑖) and
                                                                                                                            Ñ
     Let Q𝑛 be the event that that clusters are roughly identified by algorithm 4.1, C𝑛 =
                                                                                                                            𝑖=1
        𝐾
           I𝑛(𝑖) . Then using Proposition 1 and Lemma 7 we obtain,
       Ñ
I𝑛 =
       𝑖=1
                                                                                                                     
   
           𝑐
                   
                             𝑐
                                    
                                            𝑐
                                                             1      𝑛2       − 𝑚+1(𝑘−1)𝑡
                                                                                                 2 − 2
                                                                                                      𝑛−1   𝜌𝑚𝑖𝑛 𝑘−1
                                                                                                              2 − 𝑛−1       −𝑛 𝛿
                                                                                                                                       
                                                                                                                                           𝑐
                                                                                                                                             
𝑃 (Q𝑛 )        ≤ 𝑃 (C𝑛 ) +𝑃 (I𝑛 )                 ≤ 𝐾 2+ 𝑚                    𝑒   4     𝑝𝑚𝑎𝑥 +𝐾 𝑒                       +2𝑒  8   +3𝑃 (D𝑛 ) ,
                                                              4 𝑛−1
                                                                        (𝑖)
where 𝜌𝑚𝑖𝑛 = min 𝜌(𝑢 (𝑖) ) and 𝑝 𝑚𝑎𝑥 = max 𝑝 𝑚𝑎𝑥                              .
                   𝑖=1,...,𝐾                                𝑖=1,...,𝐾
     According to Lemma 9 of Maier et al. (2009) if 𝑝 ∈ 𝐶 2 (R𝑚 ) with || 𝑝|| ∞ = 𝑝 𝑚𝑎𝑥 and 𝑝′ (𝑥) ≠ 0
for 𝑥 in the neighborhood of {𝑝 = 𝑡} then for sufficiently small 𝜖 𝑛
                                      Ø𝐾                                      ∑︁𝐾
                                             ( 𝑗)               ( 𝑗)
                                   𝜇(      𝐶      (2𝜖 𝑛 ) \ 𝐶        )≤𝐶            𝑣𝑜𝑙 (𝜕𝐶 (𝑖) ) 𝑝 𝑚𝑎𝑥 𝜖 𝑛 ,
                                      𝑖=1                                     𝑖=1
                                                                       62


for some constant 𝐶. Under those conditions for 𝑝 and Theorem 3.1.7 of Prakasa Rao (1983),
there exist constants 𝐶1 , 𝐶2 such that when we choose bandwidth ℎ for the estimation of density 𝑝
                                                                   𝑚 2
that satisfies ℎ2 ≤ 𝐶1 𝜖 𝑛 we get that 𝑃 (D𝑛 ) 𝑐 ≤ 𝑒𝐶2 𝑛ℎ 𝜖 𝑛 . Hence, under the conditions for 𝑘 of
Proposition 1 and Lemma 7,
                                                                                  
                           1  𝑛2          (𝑘−1)𝑡
                                          − 𝑚+1               − 𝑛−1
                                                                       𝜌𝑚𝑖𝑛 𝑘−1
                                                                        2 − 𝑛−1                 𝛿            𝑚𝜖 2
                                                       + 𝐾 2𝑒                        + 2𝑒 −𝑛 8 + 3𝑒 −𝐶2 𝑛ℎ
                                                                 2
      𝑃 (Q𝑛 ) 𝑐 ≤ 𝐾 2 +                 𝑒   4    𝑝𝑚𝑎𝑥                                                          𝑛  . (4.22)
                             4𝑚 𝑛 − 1
     To find an appropriate value of 𝑘 so that 4.22 holds, we follow the same argument as in the
proof of Theorem 1 and noticing that last term of the bound of 𝑃((Q𝑛 ) 𝑐 ) is independent of 𝑘. We
                                                                    4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛 
find that an appropriate value for 𝑘 will be 𝑘 = 𝑐 1 +                    2+ 4𝑚 𝑝𝑡𝑚𝑎𝑥
                                                                                             , for a constant c such that
                                               
condition 1 is satisfied. Again 𝑃 (Q𝑛 ) 𝑐 goes to zero exponentially with n.                                            □
     It is important to notice that as 𝑛 increases the boundary points found in 𝐶 (𝑖) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) , 𝑖 =
1, . . . , 𝐾 will decrease and will be significantly less than the points of the L(t) set, leading to
cleaner predicted clusters. Actually 𝐶 (𝑖) (2𝜖 𝑛 ) will collapse to 𝐶 (𝑖) . We will refer to the term exact
identification when rough identification of clusters is achieved and the ratio of number of points
that do not belong to any cluster to number of cluster points goes to zero.
Theorem 3 (Exact cluster identification in noisy case 1)
Let 𝑝 be three times continuously differentiable with uniformly bounded derivatives and let 𝑝ˆ𝑛 be
                                                                                1
a kernel density estimator with bandwidth ℎ𝑛 = ℎ0 (𝑙𝑜𝑔𝑛/𝑛) 𝑚+4 for some ℎ0 > 0. For a suitable
                                    2
𝜖0 > 0 set 𝜖 𝑛 = 𝜖0 (𝑙𝑜𝑔𝑛/𝑛) 𝑚+4 . Then there exist constants 𝑐 1 , 𝑐 2 such that for 𝑛 → ∞ and
𝑐 1 𝑙𝑜𝑔𝑛 ≤ 𝑘 ≤ 𝑐 2 𝑛 we obtain cluster 𝐶 (𝑖) is exactly identified by algorithm 4.1 almost surely.
Proof. According to Proposition 8 of von Luxburg (2007) if 𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 is the number of cluster points
and 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 is the number of points that do not belong to any cluster, then for 𝜖 𝑛 that goes to
                                         𝐾
                                            𝛽 (𝑖) , there exist a constant 𝐷¯ such that for large 𝑛,
                                         Í
zero as n goes to infinity and 𝛽 =
                                        𝑖=1
                                                𝐷¯                  1 ¯                 𝛽              
                  𝑃 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 /𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 > 4 𝜖 𝑛 | C𝑛 ≤ 𝑒 − 4 𝐷𝜖 𝑛 𝑛 + 𝑒 − 𝑛 + 𝑃 (D𝑛 ) 𝑐 .                         (4.23)
                                                  𝛽                                       8
We can choose 𝜖0 such that ℎ2𝑛 ≤ 𝐶𝜖 𝑛 for a suitable constant 𝐶. Then there exist 𝐶2 > 0 with
                                                                                               4
                       𝑚 2                                                     𝑚
𝑃 (D𝑛 ) 𝑐 ≤ 𝑒 −𝐶2 𝑛ℎ𝑛 𝜖 𝑛 . Notice that 𝑛ℎ𝑛𝑚 𝜖 𝑛2 = ℎ0𝑚 𝜖 0 𝑛 𝑙𝑜𝑔𝑛
                                                                               𝑚+4   𝑙𝑜𝑔𝑛  𝑚+4
                                                                          𝑛             𝑛         = ℎ0𝑚 𝜖 0 𝑙𝑜𝑔𝑛. Hence,
                                                           63


 ∞
 Í                                                                              ∞
                                                                                  Í        
                                                                                                                     ¯
     𝑃 (D𝑛 ) 𝑐 < ∞. Furthermore by inequality 4.23 we have                              𝑃 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 /𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 > 4 𝐷𝛽 𝜖 𝑛 |
𝑖=1                                                                             𝑖=1
C𝑛 < ∞. Following similar proof as of Theorem 2 we can find constants 𝑐 1 , 𝑐 2 such that for
𝑐 1 𝑙𝑜𝑔𝑛 ≤ 𝑘 ≤ 𝑐 2 𝑛 cluster 𝐶 (𝑖) will be roughly identified almost surely, and as a result the event C𝑛
will also occur almost surely. Consequently, 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 /𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 → 0 alomost surely.
4.3.2.2     Second approach to noisy case - No removal of points
      As before we denote the connected components of the 𝑡-level of the density 𝑝 by 𝐶 (1) , ..., 𝐶 (𝐾) .
For the rest in the support of 𝑝 (i.e. 𝐵 = supp{𝑝} ∖ ∪𝑖=1             𝐾 𝐶 (𝑖) ), we denote:
                       𝐶˜ (𝑖) = {𝑥 ∈ 𝐵 : 𝑖 = argmin1≤ 𝑗 ≤𝐾 𝑑 (𝑥, 𝐶 ( 𝑗) )}, for 1 ≤ 𝑖 ≤ 𝐾.
The ith cluster consists of 𝐶 (𝑖) and 𝐶˜ (𝑖) and no point is removed. Let cluster 𝑖 be denoted as the
set 𝐶¯ (𝑖) = 𝐶 (𝑖) ∪ 𝐶˜ (𝑖) . We describe the noisy case as the event that the minimal distance of points
between 𝐶 (𝑖) and 𝐶˜ (𝑖) is zero. For this reason, for the minimum density of points in 𝐶¯ (𝑖) , 𝑝 𝑚𝑖𝑛 , will
hold that 𝑝 𝑚𝑖𝑛 > 0. Consequently, 𝐶¯ (𝑖) is connected topologically. The following results correspond
to clustering with algorithm 4.1 and the choice of graph Laplacian to be the unnormalized, i.e.
𝐿 = 𝐷 − 𝑊. Let us denote the “ideal" version by 𝐿˜ = 𝐷˜ − 𝑊,                    ˜ namely; 𝑊˜ is the revised version
of 𝑊 by removing all connections between different clusters. So 𝑊˜ is a block diagonal matrix (up
to permutation of the nodes), with each block corresponding to a true cluster. Table 4.2 provides
notations for this case.
Lemma 8 (Connectedness of 𝐶¯ (𝑖) )
Let C𝑛(𝑖) be the event that the sample points in 𝐶¯ (𝑖) are connected. Then under the conditions:
                                 (𝑖)
      1. 𝑘 ≥ 4𝑚+1 𝑙𝑜𝑔(2𝑛𝑝 𝑚𝑎𝑥        𝑉 𝑜𝑙 (𝐶¯ (𝑖) )8𝑚 ) and
                            (𝑖)
      2. 𝑘 < 2(𝑛 − 1) 𝑝 𝑚𝑎𝑥      𝜂𝑚 4𝑚 min{(𝑑𝑛(𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥
                                                            (𝑖) 𝑚
                                                                  ) }, we obtain
                                                                                       (𝑖)
                                                                              (𝑘−1) 𝑝
                                                                 1 𝑛2      −           𝑚𝑖𝑛
                                      𝑃 (C𝑛(𝑖) )   𝑐                                 (𝑖)
                                                       ≤ (2 + 𝑚         )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 .                             (4.24)
                                                                4 𝑛−1
                                                     (𝑖)                   𝑉 𝑜𝑙 (𝐶¯ (𝑖) )
Proof. We find a covering of 𝐶¯ (𝑖) ( 𝑝 𝑚𝑖𝑛              + 𝜖 𝑛 ) of 𝑁 ≤              𝑚      balls with radius 𝑧/4, where
                                                                              𝜂 𝑚 8𝑧 𝑚
𝑧 ∈ 4(0, min{𝑑𝑛(𝑖) , 𝜈𝑚𝑎𝑥 (𝑖)
                                }). If each of the covering balls contains at least two points of 𝐶¯ (𝑖) and
                                                               64


                𝐶¯ (𝑖)           𝐶 (𝑖) ∪ 𝐶˜ (𝑖)
                   (𝑖)
                𝑝 max            supremum of density attained by points of 𝐶¯ (𝑖)
                   (𝑖)
                𝑝 min            infimum of density attained by points of 𝐶¯ (𝑖)
                 𝑢 (𝑖)           lower bound on distance of 𝐶 (𝑖) to all 𝐶 ( 𝑗) with 𝑗 ≠ 𝑖
                  𝑢𝑖 𝑗           distance between r 𝐶 (𝑖) and 𝐶 ( 𝑗)
              𝜌(𝑢 (𝑖) )          probability mass of balls of radius 𝑢 (𝑖) in 𝐶 (𝑖)
                𝜌𝑚𝑖𝑛               min 𝜌(𝑢 (𝑖) )
                                 𝑖=1,...,𝐾
              ˜ (𝑖) )
              𝜌(𝑢                probability mass of balls of radius 𝑢 (𝑖) in 𝐶˜ (𝑖)
                𝜌˜ 𝑚𝑖𝑛             min 𝜌(𝑢 ˜ (𝑖)/2 )
                                 𝑖=1,...,𝐾
                 𝛽 (𝑖)           probability mass of 𝐶 (𝑖)
               𝛽𝑚𝑎𝑥                max 𝛽 (𝑖)
                                 1=1,...,𝐾
                 𝜇˜ (𝑖)          probability mass of 𝐶˜ (𝑖)
               𝜇˜ 𝑚𝑎𝑥              max 𝜇˜ (𝑖)
                                 1=1,...,𝐾
                   (𝑖)
                𝑅ˆ 𝑚𝑖𝑛           minimal 𝑘NN radius of 𝐶¯ (𝑖)
             (𝑖)         (𝑖)
          𝑅˜ 𝑚𝑎𝑥   , 𝑅𝑚𝑎𝑥        maximal 𝑘NN radius of 𝐶˜ (𝑖) , 𝐶 (𝑖)
                 𝑑𝑛(𝑖)           minimum distance of 𝐶¯ (𝑖) ( 𝑝 𝑚𝑖𝑛         (𝑖)
                                                                                 + 𝜖 𝑛 ) from 𝜕 𝐶¯ (𝑖)
                 𝜅 (𝑖)           the minimal curvature radius of 𝜕 𝐶¯ (𝑖)
                  (𝑖)
                𝜈𝑚𝑎𝑥              max {𝜈 | 𝐶¯ (𝑖) \ 𝐶𝑜𝑙 (𝑖) (𝜈)is connected} and 𝐶𝑜𝑙 (𝑖) is the collar of 𝐶¯ (𝑖)
                                 𝜈<𝜅 (𝑖)
                                                          Table 4.2 Notations.
     (𝑖)
{ 𝑅ˆ 𝑚𝑖𝑛 > 𝑧} then neighboring covering balls will contain points that share common neighbors and
hence they will be connected. We follow the arguments of the proofs of Lemma 1 and Lemma 3 to
obtain                                                                                              (𝑖)
                                                                                            (𝑘−1) 𝑝
                                                                           1 𝑛2           −         𝑚𝑖𝑛
                                                (C𝑛(𝑖) ) 𝑐
                                                                                                  (𝑖)
                                            𝑃                ≤ (2 + 𝑚                  )𝑒   4𝑚+1 𝑝𝑚𝑎𝑥
                                                                          4 𝑛−1
                                                           (𝑖)       (𝑖)
under the stated conditions and with 𝑝 max                       , 𝑝 min   to be the supremum and the infimum of density
attained by points of 𝐶¯ (𝑖) .                                                                                        □
Lemma 9 (Isolation of every 𝐶˜ (𝑖) from all 𝐶 ( 𝑗) )
Let Ĩ𝑛 denote the event that every point in any 𝐶˜ (𝑖) is not connected to points in any 𝐶 ( 𝑗) for 𝑗 ≠ 𝑖.
                      𝜌˜ 𝑚𝑖𝑛 𝑛
Then for 𝑘 <              2    − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)} we obtain
                                                                               𝑛−1 𝜌˜ 𝑚𝑖𝑛 𝑘−1
                                                                                               
                                                 𝑃 ( Ĩ𝑛 ) 𝑐 ≤ 𝐾 2 𝑒 − 2              2 − 𝑛−1
                                                               
                                                                                                                  (4.25)
                                                                         65


Proof. Let Ĩ𝑛(𝑖) denote the event that every point of 𝐶˜ (𝑖) is not connected to points in any 𝐶 ( 𝑗) for
                                                                                                                            (𝑖)
𝑗 ≠ 𝑖. Then 𝐶˜ (𝑖) is connected to some 𝐶 ( 𝑗) for 𝑗 ≠ 𝑖, if either the event { 𝑅˜ 𝑚𝑎𝑥                                             > 𝑢 (𝑖) } occurs
    Ð ( 𝑗)
or {𝑅𝑚𝑎𝑥 > 𝑢 (𝑖 𝑗) /2} occurs. Following the steps of the proof of Lemma 2 we obtain under the
   𝑗≠𝑖
                               ˜ (𝑖) )
                               𝜌(𝑢                                                                            𝜌(𝑢 ( 𝑗 )/2 )
conditions that 𝑘 <               2      − 2𝑙𝑜𝑔( 𝜇˜ (𝑖) 𝑛) and for every 𝑗 ≠ 𝑖 that 𝑘 <                           2         − 2𝑙𝑜𝑔(𝛽 ( 𝑗) 𝑛). We
reach the stated results observing that 𝜌(𝑢               ˜ (𝑖) ) > 𝜌(𝑢  ˜ (𝑖) /2) ≥ 𝜌˜ 𝑚𝑖𝑛 , 𝜌(𝑢 ( 𝑗) ) > 𝜌(𝑢           ˜ ( 𝑗) /2) ≥ 𝜌˜ 𝑚𝑖𝑛 and
                         𝐾
                             𝑃 ( Ĩ𝑛(𝑖) ) 𝑐 .
                   Í                      
that 𝑃 ( Ĩ𝑛 ) 𝑐 ≤                                                                                                                               □
                        𝑖=1
Proposition 2 (Distance between the eigenspaces of 𝐿 and 𝐿)                               ˜
Under the conditions,
                                     (𝑖)
     1. 𝑘 ≥ 4𝑚+1 𝑙𝑜𝑔(2𝑛𝑝 𝑚𝑎𝑥              𝑉 𝑜𝑙 (𝐶¯ (𝑖) )8𝑚 ) for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and
                               (𝑖)
     2. 𝑘 < 2(𝑛 − 1) 𝑝 𝑚𝑎𝑥          𝜂𝑚 4𝑚 min{(𝑑𝑛(𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥     (𝑖) 𝑚
                                                                          ) } for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and
               𝜌˜ 𝑚𝑖𝑛 𝑛
     3. 𝑘 <        2    − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)},
     there exists an orthogonal matrix 𝑂 ∈ R𝐾×𝐾 such that
                                                                             5√      Í𝐾
                                                                          2 2 𝐾 𝑖=1         𝑛˜ 𝑖
                                                 ||𝑈𝑂 − 𝑈||     ˜ 𝐹≤                             ,
                                                                                       ˜
                                                                                 𝜆+ ( 𝐿)
                                                                                                                            
                                                                    (𝑘−1) 𝑝                          𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1)
                                                       1 𝑛2        − 𝑚+1 𝑚𝑖𝑛              − 𝑛−1                  − 𝑛−1
                                                                                                                                , where 𝑈, 𝑈˜ are
                                                                                               2          2
with probability at least 1 − 𝐾 (2 +                 4𝑚 𝑛−1 )𝑒
                                                                    4     𝑝𝑚𝑎𝑥    + 𝐾 2𝑒
eigenvector matrices of the 𝐾 smallest eigenvalues of 𝐿, 𝐿,                            ˜ respectively, 𝑛˜ 𝑖 is the size of 𝐶˜ (𝑖) and
     ˜ is the 𝐾 + 1 smallest eigenvalue of 𝐿.
𝜆+ ( 𝐿)                                                         ˜
Proof. Let C𝑛 be the event that every cluster 𝐶¯ (𝑖) is connected on 𝐺 𝑆𝑁 𝑁 (𝑘). Then if 𝑘 satisfies the
the conditions of Lemma 8 for every cluster we obtain
                                                                         1 𝑛2           −
                                                                                           (𝑘−1) 𝑝𝑚𝑖𝑛
                                           𝑃 (C𝑛 ) 𝑐 ≤ 𝐾 (2 + 𝑚                       )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 ,                                         (4.26)
                                                                         4 𝑛−1
                                 (𝑖)                                  (𝑖)
where 𝑝 𝑚𝑖𝑛 = min 𝑝 𝑚𝑖𝑛                 and 𝑝 𝑚𝑎𝑥 = min 𝑝 𝑚𝑎𝑥              . Removing any edges that connect subgraphs of
                     𝑖=1,...,𝐾                            𝑖=1,...,𝐾
different clusters will result in a block diagonal adjacency matrix 𝑊.                                 ˜ As we explored in proof of
Theorem 1 this is equivalent with exact cluster identification by spectral clustering.
     Furthermore, let I𝑛 be the event that every set 𝐶 (𝑖) is isolated from other sets 𝐶 ( 𝑗) for 𝑗 ≠ 𝑖 we
                                                                             𝜌 𝑚𝑖𝑛 𝑛
obtain following the proof of Lemma 2 that if 𝑘 <                               2    − 2𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛) then
                                                                                              
                                                                                    𝜌𝑚𝑖𝑛 𝑘−1
                                                                          − 𝑛−1      2 − 𝑛−1
                                                 𝑃 (I𝑛 ) 𝑐 ≤ 𝐾 2 𝑒
                                                                             2
                                                                                                  .                                          (4.27)
                                                                        66


          𝜌 𝑚𝑖𝑛 𝑛                       𝜌˜ 𝑚𝑖𝑛 𝑛
Since        2    − 2𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛) ≥          2    − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)} we observe that the if
      𝜌˜ 𝑚𝑖𝑛 𝑛
𝑘 <       2     − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)} the upper bounds of inequality 4.27 and inequality
4.25 will hold. Now let us denote with 𝑆𝑛 the event that the only connections between any pair of
clusters 𝐶 (𝑖) , 𝐶 ( 𝑗) are from the samples in 𝐶˜ (𝑖) and 𝐶˜ ( 𝑗) . Then,
                                                                                                  
                                𝑃 (S𝑛 ) 𝑐 ≤ 𝑃 (C𝑛 ) 𝑐 + 𝑃 (I𝑛 ) 𝑐 + 𝑃 ( Ĩ𝑛 ) 𝑐                                             (4.28)
and if 𝑘 satisfies the condition for the upper bounds of those probabilities we obtain that,
                                                                                                                  
                                                    2        (𝑘−1) 𝑝𝑚𝑖𝑛                 𝑛−1 𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1)
                                              1 𝑛          −                        −                     −
                                                        )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 + 𝐾 2 𝑒
                                                                                        2            2       𝑛−1
                     𝑃 (S𝑛 ) 𝑐 ≤ 𝐾 (2 + 𝑚                                                                                   (4.29)
                                             4 𝑛−1
    Additionally, Theorem 2 in YU et al. (2015) we have that there exists an orthogonal matrix
𝑂 ∈ R𝐾×𝐾 such that
                                                                 3√
                                                               2 2   𝐾 ||𝐿 − 𝐿||  ˜ 2
                                        ||𝑈𝑂 − 𝑈||  ˜ 𝐹≤                                  ,
                                                                           ˜
                                                                     𝜆+ ( 𝐿)
where 𝑈, 𝑈˜ are the eigenvector matrices of the 𝐾 smallest eigenvalues of 𝐿, 𝐿,                          ˜ respectively, ||𝐿− 𝐿||
                                                                                                                               ˜ 2
is the spectral norm, and 𝜆+ ( 𝐿)   ˜ is the smallest non-zero eigenvalue of 𝐿˜ (i.e. the 𝐾 + 1 smallest
eigenvalues of 𝐿).  ˜ Now we would like to bound ||𝐿 − 𝐿||             ˜ 2 . First of all,
                                    ||𝐿 − 𝐿||  ˜ 2 ≤ ||𝐷 − 𝐷||   ˜ 2 + ||𝑊 − 𝑊˜ || 2 ,
and ||𝑊 − 𝑊˜ || 2 is determined by the between-cluster connections. So if the event S𝑛 occurs and
because Jaccard similarity ≤ 1 we notice that
                                                                     ∑︁𝐾
                                                  ||𝑊 − 𝑊˜ || 2 ≤         𝑛˜ 𝑖
                                                                      𝑖=1
and
                                                                     ∑︁𝐾
                                                 ||𝐿 − 𝐿||˜ 2≤2           𝑛˜ 𝑖 ,
                                                                     𝑖=1
where 𝑛˜ 𝑖 is the sample size of 𝐶˜ (𝑖) . Hence, under the condition for 𝑘 we conclude that
                                                                   5√     Í𝐾
                                                                 2 2 𝐾 𝑖=1         𝑛˜ 𝑖
                                                      ˜
                                           ||𝑈𝑂 − 𝑈|| 𝐹 ≤                               ,
                                                                        +
                                                                     𝜆 ( 𝐿)˜
                                                                                                              
                                                          (𝑘−1) 𝑝𝑚𝑖𝑛                       𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1)
                                               1 𝑛2     −                       − 𝑛−1
                                                                                   2            2       − 𝑛−1
with probability at least 1 − 𝐾 (2 +         4𝐾 𝑛−1
                                                    )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥     + 𝐾 2𝑒                                      .              □
                                                             67


Definition 3
The mis-clustering error is defined as
                                                                      𝑛
                                                                 1 ∑︁
                                                 𝑀𝑛 = min                  1(𝜎( 𝑞˜𝑖 ) ≠ 𝑞𝑖 ),
                                                          𝜎∈𝑆 𝐾  𝑛 𝑖=1
where 𝑞𝑖 is the true cluster label of ith data, 𝑞˜𝑖 is the estimated one and 𝑆 𝐾 is the set of all possible
permutations of {1, . . . K}.
Theorem 4 (Mis-clustering error bound)
Under the conditions,
                                   (𝑖)
     1. 𝑘 ≥ 4𝑚+1 𝑙𝑜𝑔(2𝑛𝑝 𝑚𝑎𝑥             𝑉 𝑜𝑙 (𝐶¯ (𝑖) )8𝑚 ) for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and
                             (𝑖)
     2. 𝑘 < 2(𝑛 − 1) 𝑝 𝑚𝑎𝑥         𝜂𝑚 4𝑚 min{(𝑑𝑛(𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥    (𝑖) 𝑚
                                                                         ) } for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and
               𝜌˜ 𝑚𝑖𝑛 𝑛
     3. 𝑘 <        2    − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)},
     we obtain
                                                              256𝑛∗ 𝐾 ( 𝑖=1 𝑛˜ 𝑖 ) 2
                                                                              Í𝐾
                                                      𝑀𝑛 ≤
                                                                    𝑛              ˜ 2
                                                                              𝜆+ ( 𝐿)
                                                                                                                 
                                                                  (𝑘−1) 𝑝                     𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1)
                                                      1 𝑛2      − 𝑚+1 𝑚𝑖𝑛             − 𝑛−1
                                                                                         2         2      − 𝑛−1
with probability at least 1 − 𝐾 (2 +                 4𝑚 𝑛−1 )𝑒
                                                                 4      𝑝𝑚𝑎𝑥   + 𝐾 2𝑒                               .
     Here, 𝑛∗ = max 𝑛𝑖 for the size of clusters 𝐶¯ (𝑖) , 𝑛𝑖 , 1 ≤ 𝑖 ≤ 𝐾.
                      1≤𝑖≤𝐾
Proof. We notice that { 𝑞˜𝑖 }𝑖=1        𝑛 are obtained by running 𝑘-means on 𝑈 ∈ R𝑛×𝐾 . Let us define their
associated centroids by { ℎ˜ 𝑖 }𝑖=1      𝑛 . Note also that { ℎ        ˜ 𝑖 }𝑛 have 𝐾 unique vectors. Further define
                                                                            𝑖=1
                                       𝐻˜ = argmin𝐻∈R𝑛×𝐾 :          has 𝐾 unique rows ||𝑈   − 𝐻|| 2𝐹 .
                            © ℎ˜ 1 ª
                                 𝑇
                            ­ ®
                            ­ ˜𝑇 ®
                            ­ ℎ2 ®                                                    n                                               o
It is clear that 𝐻˜ = ­­ . ®® ∈ R𝑛×𝐾 . We define the set 𝐴 = 1 ≤ 𝑖 ≤ 𝑛 : || ℎ˜ 𝑖 − 𝑒𝑇𝑖 𝑈𝑂                             ˜ 𝑇 || 2 ≥ √ 1 ,
                                                                                                                                  2𝑛∗
                            ­ .. ®
                            ­ ®
                            ­ ®
                               ˜𝑇
                            « ℎ𝑛 ¬
where {𝑒𝑖 }𝑖=1𝑛 is the standard basis of R𝑛 and 𝑛∗ = max 𝑛 where 𝑛 is the sample size from the ith
                                                                                    𝑖           𝑖
                                                                           1≤𝑖≤𝐾
cluster  𝐶 (𝑖)  ∪ 𝐶˜ (𝑖) . Then,
                                                           ˜ 𝑇 || 2 < √ 1 , for 𝑖 ∉ 𝐴.
                                             || ℎ˜ 𝑖 − 𝑒𝑇𝑖 𝑈𝑂                                                                    (4.30)
                                                                              2𝑛∗
                                                                      68


Also,
                                             1
                                          © √𝑛 1 1𝑛 1        0        ···         0        ª
                                          ­                                                ®
                                          ­ 0              √1 1𝑛2     ···         0 ®
                                          ­                                                ®
                                     ˜
                                    𝑈 = ­­ .                𝑛2
                                                                                                                                (4.31)
                                                                                   .. ®® 𝑃,
                                                                                           ®
                                          ­ ..                ..
                                          ­                    .      ···           . ®
                                          ­                                                ®
                                          « 0                0        · · · √𝑛1𝐾 1𝑛𝐾 ¬
with 𝑃 being orthogonal. By 4.31
                                                                                √︂
                                                 ˜ 𝑇 − 𝑒𝑇𝑗 𝑈𝑂      ˜ 𝑇 || 2 ≥        2
                                          ||𝑒𝑇𝑖 𝑈𝑂                                                                              (4.32)
                                                                                     𝑛∗
Combining 4.30 and 4.32
|| ℎ˜ 𝑖 − 𝑒𝑇𝑗 𝑈𝑂
              ˜ 𝑇 || 2 ≥ ||𝑒𝑇𝑖 𝑈𝑂        ˜ 𝑇 || 2 − || ℎ˜𝑖 − 𝑒𝑇𝑖 𝑈𝑂
                               ˜ 𝑇 − 𝑒𝑇𝑗 𝑈𝑂                         ˜ 𝑇 || 2 > ||𝑒𝑇𝑖 𝑈𝑂   ˜ 𝑇 − 𝑒𝑇𝑗 𝑈𝑂 ˜ 𝑇 || 2 − √ 1 > √ 1
                                                                                                                      2𝑛∗          2𝑛∗
Consequently,
                          1 ∑︁       1                                              ∗
                1                                                 ˜ 𝑇 || 2 ≤ 2𝑛 || 𝐻˜ − 𝑈𝑂
                                            ∑︁
         𝑀𝑛 ≤     | 𝐴| ≤         1 ≤ 2𝑛∗          || ℎ˜ 𝑖 − 𝑒𝑇𝑖 𝑈𝑂       2
                                                                                                ˜ 𝑇 || 2𝐹
                𝑛         𝑛 𝑖∈𝐴      𝑛      𝑖∈𝐴
                                                                                 𝑛
                                                                                2𝑛∗  ˜                             ˜     𝑇
                                                                                                                               2
                                                                           ≤             || 𝐻 − 𝑈|| 𝐹 + ||𝑈 − 𝑈𝑂 || 𝐹
                                                                                 𝑛
                                                                                8𝑛∗             ˜ 𝑇 || 2𝐹
                                                                           ≤          ||𝑈 − 𝑈𝑂
                                                                                 𝑛
                                                                                8𝑛∗               ˜ 2𝐹
                                                                           =          ||𝑈𝑂 − 𝑈||
                                                                                 𝑛
                                                                                            5√
                                                                                                             !2
                                                                                8𝑛∗ 2 2 𝐾 𝑖=1 𝑛˜ 𝑖
                                                                                                  Í𝐾
                                                                           ≤
                                                                                 𝑛                  ˜
                                                                                               𝜆+ ( 𝐿)
                                                                                256𝑛∗ 𝐾 ( 𝑖=1 𝑛˜ 𝑖 ) 2
                                                                                               Í𝐾
                                                                           ≤                               ,
                                                                                     𝑛               ˜ 2
                                                                                               𝜆+ ( 𝐿)
                                                                                                                                     
                                                                                       (𝑘−1) 𝑝                    𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1)
                                                                              2      − 𝑚+1 𝑚𝑖𝑛            − 𝑛−1               − 𝑛−1
using Proposition 2 with probability at least 1 − 𝐾 (2 + 41𝑚 𝑛−1
                                                                                                              2        2
                                                                            𝑛
                                                                                 )𝑒    4     𝑝𝑚𝑎𝑥 + 𝐾 2𝑒                                .
                                                                                                                                      □
4.4       A general algorithm for tuning of clustering method parameters
        To construct an SNN graph as we described in section 4.2.1 one must first construct a 𝑘NN
graph and then remove edges between 𝑘NN neighbors that do not share any of their neighbors. The
parameter 𝑘 affects the structure of the SNN graph and hence any other method that is based on
                                                                 69


it, as for example the algorithm 4.1. A lot of SNN graph-based clustering methods (Stuart et al.,
2019; Xu and Su, 2015) do not use data-driven ways to decide on the value of 𝑘.
     Bellow we introduce a general cross-validation tuning algorithm called 𝑘cv tuning that provides
an optimal choice of a clustering parameter based on the data information. We apply this algorithm
to find an optimal choice for the parameter 𝑘 of the number of nearest neighbors used in the SNN
spectral clustering algorithm 4.1 in simulated data of different signal-to-noise levels and data feature
structure.
     The simulated data are described in section 4.4.2 and the tools used to assess the performance of
algorithm 4.3 are introduced in section 4.4.3. The performance results are summarized in section
4.4.4.
4.4.1    𝑘cv tuning algorithm
     The introduced cross-validation method suggests a tuning of a parameter 𝑘 of a clustering
method, based on the idea that when the clustering is optimal, points in the same cluster can predict
with high accuracy features of points in the same cluster. A similar methodology has been used for
the tuning of model parameters in Li et al. (2020).
     The 𝑘cv tuning algorithm works in 𝑁 folds.
     - Given a dataset 𝑋, in every fold a version of 𝑋 is created by randomly removing 10% of the
entries of X.
     - Then, Singular Value Thresholding is applied to each version to extract its low rank approxi-
mation and hence a completed matrix, 𝐴.    ˆ
     - The chosen clustering algorithm is applied on 𝐴ˆ for multiple values of k.
     - Next, the missing entries of 𝑋 are predicted. Specifically, a missing feature of a data point
in the predicted cluster 𝑐, associated with a specific value 𝑘, is predicted by data points of 𝑋 that
belong also to cluster 𝑐.
     - The optimal k is chosen to be the one that attained the lowest average prediction error across
the 𝑁 folds.
     Notice that for the completion of 𝐴𝑞 , we first find its SVD, i.e 𝐴𝑞 = 𝑈𝐷𝑉 𝑇 and then we use 𝑆𝑉 𝐷
                                                     70


thresholding to obtain the final result 𝐴ˆ 𝑞 =             1        𝑇
                                                           𝑚 𝑈𝐷 𝑘ˆ 𝑉 , 𝐷 𝑘ˆ = 𝑑𝑖𝑎𝑔(𝐷 1,1 , 𝐷 2,2 , ..., 𝐷 𝑘,ˆ 𝑘ˆ , 0, .., 0).
Algorithm 4.3 kcv_tuning.
  1: Input: 𝑋 ∈ R𝑛×𝑚 , number of clusters 𝐾, clustering function 𝐺, nearest neighbors 𝑘, number
     of folds N, prediction method (mean, ols, lasso), low-rank approximation threshold 𝑘,                           ˆ training
     percentage 𝑝
  2: Output: Optimal number of nearest neighbors, 𝑘 optimal
  3:
  4: for 𝑞 = 1 to 𝑁 do
  5:       𝐴𝑞 ← 𝑋 with (1-p)% of entries randomly replaced by 0
  6:       𝐼𝑞 ← {(𝑖, 𝑗) : 𝑥𝑖, 𝑗 replaced with 0 in 𝐴𝑞 }
  7:
  8:      % 𝐴𝑞 completion via Singular Value Thresholding
  9:       𝐴ˆ 𝑞 ← SVD on 𝑋 with threshold 𝑘ˆ
10:
11:       % Evaluation of the performance of 𝑘
12:       for 𝑘 = 10 to ⌈ 𝑛3 ⌉, by 20 do
13:             ℓ𝑘 = 𝐺 ( 𝐴ˆ𝑞 , 𝐾, 𝑘) ← predicted membership
14:             for {𝑖, 𝑗 } ∈ 𝐼𝑞 do
15:                  𝐶𝑖 ← points in the same cluster as point i
16:                 𝑌 = [𝑋𝑟 𝑗 ] for r ∈ 𝐶𝑖
17:                  𝑍 = [𝑋𝑟𝑡 ] for r ∈ 𝐶𝑖 and 𝑡 ≠ 𝑗
18:
19:                  % Prediction of missing values
20:                  if prediction_method = mean then
21:                      𝑥ˆ𝑖, 𝑗 ← 𝑌¯
22:                  else if prediction_method = ols then
23:                      𝛽ˆ ← 𝑎𝑟𝑔𝑚𝑖𝑛 𝛽 {||𝑌 − 𝛽𝑍 || 2 }
24:                      𝑥ˆ𝑖 𝑗 ← 𝛽𝑍ˆ
25:                  else if prediction_method = lasso then
26:                      𝜆 0 , 𝛽ˆ ← 𝑎𝑟𝑔𝑚𝑖𝑛𝜆,𝛽 {||𝑌 − 𝛽𝑍 || 2 + 𝜆||𝛽|| 1 }
27:                      𝑥ˆ𝑖 𝑗 ← 𝛽𝑍ˆ
28:
29:             % Mean prediction error of k in fold q
                𝐿 𝑘,𝑞 = |𝐼1𝑞 | (𝑖, 𝑗)∈𝐼𝑄 (𝑥𝑖 𝑗 − 𝑥ˆ𝑖 𝑗 ) 2
                               Í
30:
31:
32:  % Mean prediction error of k
     𝐿 𝑘 = 𝑁1 𝑞=1
                 Í𝑁
33:                     𝐿 𝑘,𝑞
34:
35:  % Optimal k
36:  𝑘 𝑜 𝑝𝑡𝑖𝑚𝑎𝑙 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑘 (𝐿 𝑘 )
                                                              71


4.4.2    Simulations
     We test the performance of the algorithm 4.3 when used for the tuning of parameter 𝑘 of
algorithm 4.1, on various Multivariate Gaussian (MG) data. We consider a simulated data set of 𝑛
data points 𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ∈ R𝑚 , that are grouped in 𝐾 clusters. The data are independently sampled
from a Multivariate Gaussian mixture model:
                                                   ∑︁𝐾
                                                        𝜋𝑖 𝑁 (𝜇𝑖 , Σ𝑖 ).
                                                    𝑖=1
The coordinates of the centers of the Gaussian mixture follow the standard normal distribution,
                                      𝑖.𝑖.𝑑.
                                  𝜇𝑖 𝑗 ∼ 𝑁 (0, 1) for 𝑖 = 1, . . . , 𝐾, 𝑗 = 1, . . . , 𝑚,
and the clusters have equal sizes, i.e 𝜋1 = · · · = 𝜋 𝐾 = 1/𝐾. We consider three types of covariance
matrices Σ 𝑘 :
     1. the simple case where Σ1 = · · · = Σ𝐾 = 𝑚𝑠 · 𝐼
     2.the case where the Σ1 = · · · = Σ𝐾 = 𝑚𝑠 · Σ and the corresponding precision matrix Ω = Σ−1
is tridiagonal. This case simulates a chain dependency between features of the data. Specifically,
Σ = {𝜎𝑖, 𝑗 }, with 𝜎𝑖, 𝑗 = 0.5|𝑖− 𝑗 | .
     3. the case where the Σ1 = · · · = Σ𝐾 = 𝑚𝑠 · Σ, the corresponding precision matrix Ω = Σ−1 is
sparse and simulates a network dependency between features of the data. For the construction of
Ω we follow the simulation procedure of Li and Gui (2005).
     For the assessment of the 𝑘cv tuning algorithm, we simulated Multivariate Gaussian data of
n=1000 points and 𝑚 = 10 or 50 features. The tables below represent the data setting considered
based on the type of covariance matrix used.
4.4.3    Assessment methods
     We introduce the Normalized Prediction Accuracy function that measures the performance of a
value 𝑘 for the clustering of a data set, utilizing the average prediction loss. Additionally, we define
the ARI Relative Ratio that measures how close the choice of the value 𝑘 suggested by 4.3 is from
the value that achieves maximum ARI.
                                                           72


          m      s=0.1        s=0.3        s=0.5           m       s=0.1            s=0.2       s=0.3
          10    setting 1   setting 2    setting 3         10     setting 7     setting 8     setting 9
          50    setting 4   setting 5    setting 6         50    setting 10     setting11    setting 12
                     (a) Simple MG data                    (b) MG data with tridiagonal precision matrix
                                   m     s = 0.06      s = 0.16       s = 0.23
                                  10    setting 13    setting14      setting 15
                                  50    setting 16    setting17      setting 18
                                  (c) MG data with network feature dependency
                                   Table 4.3 Settings of simulated data.
Definition 4 (Normalized Prediction Accuracy - NPA)
Let K to be the set of values for the parameter k of the number of nearest neighbors. Let 𝐿 𝑖𝑘 be
the the mean prediction error of 𝑘 associated with the prediction of the held out entries of the
data matrix 𝑋, for the simulation iteration 𝑖. The Normalized Prediction Accuracy, F, of 𝑘 for the
iteration 𝑖 is :
                                                      𝐿 𝑖𝑘 − 𝑚𝑖𝑛 𝑗 ∈K (𝐿 𝑖𝑗 )
                                𝐹 (𝑘, 𝑖) = 1 −
                                                𝑚𝑎𝑥 𝑗 ∈K (𝐿 𝑖𝑗 ) − 𝑚𝑖𝑛 𝑗 ∈K (𝐿 𝑖𝑗 )
Definition 5 (ARI Relative Ratio)
Let 𝑘˜ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑘∈K (ARI), when running the Snn_Spectral_Clustering algorithm. Let 𝑘ˆ be the
optimal k based on the kcv_tuning algorithm. The ARI Relative Ratio for iteration i, 𝑅(𝑖), is:
                                                𝐴𝑅𝐼 ( 𝑘)˜ 𝑖 − 𝐴𝑅𝐼 ( 𝑘)ˆ 𝑖
                                        𝑅(𝑖) =                            ,
                                                       𝐴𝑅𝐼 ( 𝑘)˜ 𝑖
where 𝐴𝑅𝐼 (𝑘)𝑖 is the Adjusted Rand Index for the clustering produced using k nearest neighbors
in iteration i.
4.4.4    Simulation Results
     The simulated data were used to tune the parameter 𝑘 of the SNN spectral clustering algorithm
introduced in section 4.2.2. Below we describe the performance results of the 𝑘cv tuning algorithm.
     To summarize the NPA results of a particular value of 𝑘, we use the mean NPA of this value
over a round of 1000 simulations. The maximum mean NPA is achieved for the value of 𝑘 that
the 𝑘cv tuning algorithm suggests as optimal. It is interesting to observe whether the mean NPA
                                                      73


is maximized for the same value of 𝑘 that would achieve the maximum mean ARI over a round of
simulations. We notice that this depends on the prediction method used along with the structure of
feature dependency.
     For data settings 1-6 (figures A.2, A.4) we observe that the ARI of SNN spectral algorithm
is increasing for 10 ≤ 𝑘 ≤ 50 and for larger values of 𝑘, the ARI remains about the same. The
maximum ARI is achieved for 𝑘 = 330, i.e. the largest 𝑘 we consider during tuning. The NPA
curve is also maximized at 𝑘 = 330 for mean, ols, and lasso prediction. Mean and lasso perform
better than ols and achieve lower values of ARI ratios. This is because ols uses information from
all features introducing more variance in the prediction, since those settings simulate data sets with
independent features. Mean uses information only from one feature and lasso selects a few of the
features and hence performs better than ols.
     For data settings 7-9 and 13-15 the maximum ARI achieved by the SNN spectral clustering
algorithm is within the range of [10,90] (figures B.2, B.4, C.2). Applying prediction with ols or
mean, fails to tune 𝑘 within this range, in contrast to lasso. This is because simulated data of
type 7-15 do not have independent features, every feature depends on 2-5 other features. The ols
prediction will utilize information of every feature and will introduce more error in the prediction
and the mean prediction takes into consideration only the information of one feature, whereas lasso
will use information only of the features correlated to the one of interest. For this reason, lasso
performs better for types 7-9 and 13-15.
     In settings 10-12 and 16-18, although feature selection methods like lasso are expected to
perform better than ols and mean, it is observed that they have similar performance. Here the
simulated data have low signal-to-noise ratio and hence a ridge regression prediction or elastic net
might perform better.
     The mean and median ARI Relative Ratios provide an estimate of the proportion of the difference
between the maximum ARI and the ARI achieved by clustering using the tuned 𝑘. The settings
with larger variance factor 𝑠 as shown in table 4.3 have higher mean and median ARI Relative
Ratios than settings with lower 𝑠. In more detail, we observe that for settings 1-6, the ARI of
                                                 74


SNN spectral clustering after tuning is about 1% to 10% less than the optimum ARI when using
mean and lasso prediction for tuning (tables A.1, A.2). For settings 7-9, lasso achieves the best
tuning results with ARI that is 3% to 14% smaller than the optimum ARI (table B.1). The chain
dependency of a higher number of features makes settings 10-12 harder to tune for 𝑘. The tuned
𝑘 obtains an ARI difference from the optimum between 3% to 20% (tables B.2). However, for the
final set of simulations (settings 13-18) the features of each dataset can be represented as a network
and every feature will depend on 5 (setting 13-15) or 12 other features (settings 16-18). In this
case, the ARI of SNN spectral clustering after tuning is only 0.6% to 10% lower than the optimum
(tables C.1, C.2).
4.5    Conclusions
     In this chapter, we conducted an investigation into the clustering performance of an SNN graph-
based method. For this method, we build the SNN graph as a subgraph of a 𝑘NN graph based on
the Jaccard similarity of 𝑘nn neighbors of vertices. The parameter 𝑘 affects the structure of the
SNN graph and hence the clustering performance. Our goal was to determine for which values of 𝑘,
the SNN spectral clustering algorithm can achieve true cluster identification with high probability.
Our results suggest that in both the noise-free and the noisy case, one needs to select 𝑘 of the order
𝑐𝑛 to maximize the probability of cluster identification, in contrast to random geometric graph
literature that suggests 𝑘 of order log 𝑛 (Brito et al., 1997). Furthermore, we introduce a general
cross-validation tuning method for parameters of clustering algorithms. We use this method to
tune the number of nearest neighbors 𝑘 of the SNN spectral clustering algorithm for a variety of
simulated data types and find that the accuracy of clustering results after using the tuned value is 1%
to 20% lower than the accuracy achieved by the optimum 𝑘 and depends on the feature dependency
of the data.
                                                   75


                                         BIBLIOGRAPHY
Brito, M. R., Chávez, E. L., Quiroz, A. J., and Yukich, J. E. (1997). Connectivity of the mutual
  k-nearest-neighbor graph in clustering and outlier detection. Statistics & Probability Letters,
  35(1):33–42.
Guattery, S. and Miller, G. L. (1998). On the quality of spectral separators. SIAM Journal on
  Matrix Analysis and Applications, 19(3):701–719.
Li, H. and Gui, J. (2005). Gradient directed regularization for sparse Gaussian concentration
  graphs, with applications to inference of genetic networks. Biostatistics, 7(2):302–317.
Li, T., Levina, E., and Zhu, J. (2020). Network cross-validation by edge sampling. Biometrika,
  107(2):257–276.
Luxburg, U., Bousquet, O., and Belkin, M. (2004). Limits of spectral clustering. Advances in
  neural information processing systems, 17.
Maier, M., Hein, M., and von Luxburg, U. (2009). Optimal construction of 𝑘-nearest-neighbor
  graphs for identifying noisy clusters. Theoretical Computer Science, 410(19):1749–1764.
Meilă, M. and Shi, J. (2001). A random walks view of spectral segmentation. In International
  Workshop on Artificial Intelligence and Statistics, pages 203–208. PMLR.
Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm.
  Advances in neural information processing systems, 14.
Prakasa Rao, B. (1983). Nonparametric functional estimation. Probability and Mathematical
  Statistics: A Series of Monographs and Textbooks. Academic Press.
Spielman, D. A. and Teng, S.-H. (1996). Spectral partitioning works: Planar graphs and finite
  element meshes. In Proceedings of 37th conference on foundations of computer science, pages
  96–105. IEEE.
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius,
  M., Smibert, P., and Satĳa, R. (2019). Comprehensive integration of single-cell data. Cell,
  177:1888–1902.
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17.
Xu, C. and Su, Z. (2015). Identification of cell types from single-cell transcriptomes using a novel
  clustering method. Bioinformatics, 31(12):1974–1980.
YU, Y., WANG, T., and SAMWORTH, R. J. (2015). A useful variant of the davis—kahan theorem
  for statisticians. Biometrika, 102(2):315–323.
                                                  76


                               APPENDIX A
PERFORMANCE ON GAUSSIAN DATA WITH DIAGONAL COVARIANCE MATRIX
              prediction  setting  Median   Mean     Sd.Error
                             1      0.007   0.010      0.001
                mean         2      0.021   0.042      0.004
                             3      0.053   0.099      0.010
                             1      0.006   0.011      0.001
                  ols        2      0.024   0.053      0.005
                             3      0.055   0.109      0.010
                             1      0.007   0.012      0.001
                 lasso       2      0.023   0.044      0.004
                             3      0.051   0.102      0.010
            Table A.1 ARI ratios summary for settings 1, 2 and 3.
              prediction  setting  Median   Mean     Sd.Error
                             4      0.009   0.012      0.001
                mean         5      0.030   0.074      0.008
                             6      0.114   0.280      0.023
                             4      0.009   0.019      0.002
                  ols        5      0.049   0.151      0.016
                             6      0.132   0.309      0.024
                             4      0.009   0.011      0.001
                 lasso       5      0.030   0.078      0.010
                             6      0.100   0.249      0.021
            Table A.2 ARI ratios summary for settings 4, 5 and 6.
                                    77


                                  setting 1                                                              setting 2                                                    setting 3
         0.100
                                                                                 0.4
                                                                                                                                                    0.8
         0.075                                                                   0.3
                                                                                                                                                    0.6
 ols     0.050
                                                                         ols     0.2
                                                                                                                                            ols     0.4
         0.025                                                                   0.1
                                                                                                                                                    0.2
         0.000                                                                   0.0                                                                0.0
                  0.000   0.025       0.050      0.075           0.100                 0.0   0.1              0.2         0.3         0.4                 0.0   0.2      0.4      0.6   0.8
                                    mean                                                                   mean                                                        mean
                                  setting 1                                                              setting 2                                                    setting 3
                                                                                 0.4
                                                                                                                                                    0.8
        0.075
                                                                                 0.3
                                                                                                                                                    0.6
        0.050
lasso                                                                    lasso                                                              lasso
                                                                                 0.2
                                                                                                                                                    0.4
        0.025
                                                                                 0.1
                                                                                                                                                    0.2
        0.000                                                                    0.0                                                                0.0
                 0.000    0.025          0.050           0.075                         0.0   0.1              0.2         0.3         0.4                 0.0   0.2      0.4      0.6   0.8
                                   mean                                                                   mean                                                         mean
                                  setting 1                                                              setting 2                                                    setting 3
         0.100
                                                                                                                                                    0.8
                                                                                 0.3
         0.075
                                                                                                                                                    0.6
                                                                                 0.2
 lasso                                                                   lasso                                                              lasso
         0.050
                                                                                                                                                    0.4
                                                                                 0.1
         0.025
                                                                                                                                                    0.2
         0.000                                                                   0.0                                                                0.0
                  0.000   0.025       0.050      0.075           0.100                 0.0         0.1              0.2         0.3                       0.0   0.2      0.4      0.6   0.8
                                     ols                                                                    ols                                                          ols
                                                     Figure A.1 ARI ratios comparisons for 1, 2, 3.
                                                                                                         78


          Setting 1 − mean                       Setting 2 − mean                       Setting 3 − mean
                                  0.8
0.89
                                                                              0.34                              0.55
                                       0.52                              0.6
0.88                              0.7
                                       0.50                                   0.32
0.87                              0.6
                                                                         0.5                                    0.50
0.86                                                                          0.30
                                  0.5  0.48
0.85                                                                     0.4  0.28
                                  0.4  0.46                                                                     0.45
0.84
                                                                              0.26
                                  0.3                                    0.3
                                       0.44
0.83
                             ARI                                    ARI       0.24                         ARI  0.40
                             NPA  0.2                               NPA                                    NPA
0.82                                   0.42
     10 50 90   150  210 270  330           10 50 90   150  210 270  330           10 50 90   150  210 270  330
                   k                                      k                                      k
           Setting 1 − ols                        Setting 2 − ols                        Setting 3 − ols
0.89                              0.7                                    0.60
                                       0.52                                   0.34                              0.54
0.88
                                  0.6                                    0.55
                                       0.50                                   0.32
0.87                                                                                                            0.52
0.86                              0.5                                    0.50 0.30
                                       0.48
                                                                                                                0.50
0.85                                                                          0.28
                                  0.4  0.46                              0.45
0.84                                                                                                            0.48
                                                                              0.26
                                       0.44                              0.40
0.83                              0.3
                             ARI                                    ARI       0.24                         ARI  0.46
                             NPA                                    NPA                                    NPA
0.82                                   0.42
     10 50 90   150  210 270  330           10 50 90   150  210 270  330           10 50 90   150  210 270  330
                   k                                      k                                      k
          Setting 1 − lasso                      Setting 2 − lasso                      Setting 3 − lasso
0.89                              0.7
                                       0.52                              0.6  0.34
0.88                                                                                                            0.55
                                  0.6
                                       0.50                                   0.32
0.87
                                                                         0.5
0.86                              0.5                                         0.30                              0.50
                                       0.48
0.85                                                                          0.28
                                  0.4  0.46                              0.4
0.84
                                                                              0.26                              0.45
                                       0.44
0.83                              0.3
                             ARI                                    ARI  0.3  0.24                         ARI
                             NPA                                    NPA                                    NPA
0.82                                   0.42
     10 50 90   150  210 270  330           10 50 90   150  210 270  330           10 50 90   150  210 270  330
                   k                                      k                                      k
                           Figure A.2 ARI and NPA vs k for settings 1, 2 and 3.
                                                        79


                                  setting 4                                                      setting 5                                                            setting 6
                                                                           1.00                                                                  1.00
         0.15
                                                                           0.75                                                                  0.75
         0.10
 ols                                                               ols     0.50
                                                                                                                                         ols     0.50
         0.05
                                                                           0.25                                                                  0.25
         0.00                                                              0.00                                                                  0.00
                 0.00      0.05               0.10          0.15                  0.00   0.25         0.50          0.75          1.00                  0.00   0.25      0.50     0.75   1.00
                                    mean                                                           mean                                                                mean
                                  setting 4                                                      setting 5                                                            setting 6
                                                                                                                                                 1.00
                                                                           0.75
        0.06
                                                                                                                                                 0.75
lasso                                                              lasso                                                                 lasso
                                                                           0.50
        0.04
                                                                                                                                                 0.50
        0.02                                                               0.25
                                                                                                                                                 0.25
        0.00                                                               0.00                                                                  0.00
                0.00    0.02           0.04          0.06                         0.00    0.25               0.50          0.75                         0.00   0.25      0.50     0.75   1.00
                                   mean                                                           mean                                                                 mean
                                  setting 4                                                      setting 5                                                            setting 6
                                                                           1.00                                                                  1.00
         0.15
                                                                           0.75                                                                  0.75
         0.10
 lasso                                                             lasso                                                                 lasso
                                                                           0.50                                                                  0.50
         0.05
                                                                           0.25                                                                  0.25
         0.00                                                              0.00                                                                  0.00
                 0.00      0.05               0.10          0.15                  0.00   0.25         0.50          0.75          1.00                  0.00   0.25      0.50     0.75   1.00
                                     ols                                                            ols                                                                  ols
                                                      Figure A.3 ARI ratios comparisons for 4, 5, 6.
                                                                                                 80


          Setting 4 − mean                       Setting 5 − mean                       Setting 6 − mean
0.92                              0.8                                    0.7
                                                                              0.30
                                       0.50
0.90                                                                     0.6                                    0.55
                                       0.45                                   0.25
                                  0.6
0.88
                                       0.40                              0.5  0.20
                                                                                                                0.50
0.86                              0.4  0.35                                   0.15
                                                                         0.4
0.84                                   0.30                                                                     0.45
                                                                              0.10
                                                                         0.3
                                  0.2  0.25
0.82
                             ARI                                    ARI       0.05                         ARI
                             NPA       0.20                         NPA  0.2                               NPA  0.40
0.80
     10 50 90   150  210 270  330           10 50 90   150  210 270  330           10 50 90   150  210 270  330
                   k                                      k                                      k
           Setting 4 − ols                        Setting 5 − ols                        Setting 6 − ols
                                                                         0.60
                                                                              0.30
0.90                                   0.50
                                  0.6
                                                                                                                0.55
                                       0.45                                   0.25
0.88                                                                     0.55
                                  0.5
                                       0.40                                   0.20
0.86                                                                                                            0.50
                                  0.4  0.35                              0.50 0.15
0.84
                                       0.30
                                  0.3                                         0.10                              0.45
0.82                                   0.25                              0.45
                             ARI                                    ARI                                    ARI
                                                                              0.05
                             NPA  0.2  0.20                         NPA                                    NPA
0.80
     10 50 90   150  210 270  330           10 50 90   150  210 270  330           10 50 90   150  210 270  330
                   k                                      k                                      k
          Setting 4 − lasso                      Setting 5 − lasso                      Setting 6 − lasso
0.92
                                                                              0.30
                                  0.7  0.50
                                                                         0.6                                    0.55
0.90
                                  0.6  0.45                                   0.25
0.88
                                  0.5  0.40                                   0.20
                                                                         0.5
                                                                                                                0.50
0.86                              0.4  0.35
                                                                              0.15
0.84                              0.3  0.30                              0.4
                                                                              0.10
                                  0.2  0.25                                                                     0.45
0.82
                             ARI                                    ARI  0.3                               ARI
                                  0.1                                         0.05
                             NPA       0.20                         NPA                                    NPA
0.80
     10 50 90   150  210 270  330           10 50 90   150  210 270  330           10 50 90   150  210 270  330
                   k                                      k                                      k
                           Figure A.4 ARI and NPA vs k for settings 4, 5 and 6.
                                                        81


                                APPENDIX B
PERFORMANCE ON GAUSSIAN DATA WITH TRIDIAGONAL PRECISION MATRIX
               prediction  setting  Median   Mean     Sd.Error
                              7      0.029   0.046      0.004
                 mean         8      0.094   0.116      0.008
                              9      0.118   0.158      0.011
                              7      0.023   0.036      0.003
                   ols        8      0.067   0.101      0.008
                              9      0.102   0.143      0.011
                              7      0.015   0.030      0.003
                 lasso        8      0.060   0.094      0.007
                              9      0.099   0.141      0.010
             Table B.1 ARI ratios summary for settings 7, 8 and 9.
               prediction  setting  Median   Mean     Sd.Error
                             10      0.038   0.048      0.003
                 mean        11      0.171   0.201      0.012
                             12      0.243   0.281      0.015
                             10      0.025   0.040      0.003
                   ols       11      0.163   0.198      0.012
                             12      0.102   0.143      0.011
                             10      0.027   0.038      0.002
                 lasso       11      0.147   0.202      0.013
                             12      0.219   0.271      0.016
           Table B.2 ARI ratios summary for settings 10, 11 and 12.
                                     82


                                  setting 7                                                     setting 8                                                        setting 9
                                                                                                                                         0.6
                                                                        0.6
         0.4
                                                                                                                                         0.4
                                                                        0.4
 ols                                                            ols                                                              ols
         0.2
                                                                                                                                         0.2
                                                                        0.2
         0.0                                                            0.0                                                              0.0
                0.0          0.2                  0.4                         0.0   0.2                  0.4         0.6                       0.0         0.2             0.4         0.6
                                   mean                                                           mean                                                            mean
                              setting 7                                                         setting 8                                                        setting 9
                                                                                                                                         0.8
                                                                        0.6
                                                                                                                                         0.6
        0.4
lasso                                                           lasso                                                            lasso
                                                                        0.4
                                                                                                                                         0.4
        0.2
                                                                        0.2
                                                                                                                                         0.2
        0.0                                                             0.0                                                              0.0
               0.0          0.2                  0.4                          0.0   0.2              0.4             0.6                       0.0   0.2             0.4         0.6         0.8
                                   mean                                                          mean                                                             mean
                                  setting 7                                                     setting 8                                                        setting 9
                                                                                                                                         0.8
                                                                        0.6
         0.3
                                                                                                                                         0.6
                                                                        0.4
         0.2
 lasso                                                          lasso                                                            lasso
                                                                                                                                         0.4
                                                                        0.2
         0.1
                                                                                                                                         0.2
         0.0                                                            0.0                                                              0.0
                0.0   0.1                  0.2            0.3                 0.0         0.2                  0.4         0.6                 0.0   0.2             0.4         0.6         0.8
                                     ols                                                           ols                                                              ols
                                                        Figure B.1 ARI ratios comparisons for 7, 8, 9.
                                                                                                83


             Setting 7 − mean                            Setting 8 − mean                         Setting 9 − mean
                                                                                   0.7
 0.88                                     0.7 0.60                                      0.36
                                                                                                                          0.6
                                              0.58                                 0.6  0.34
 0.86                                     0.6
                                              0.56
                                                                                   0.5  0.32                              0.5
                                          0.5
 0.84                                         0.54
                                                                                        0.30
                                          0.4 0.52                                 0.4
                                                                                                                          0.4
 0.82                                                                                   0.28
                                              0.50
                                          0.3                                      0.3
                                              0.48                                      0.26                              0.3
 0.80                              ARI                                       ARI                                     ARI
                                          0.2                                      0.2
                                   NPA                                       NPA                                     NPA
                                              0.46                                      0.24
       10  50  90   150   210  270   330           10 50  90   150   210 270   330           10 50 90   150  210 270  330
                       k                                           k                                       k
              Setting 7 − ols                             Setting 8 − ols                          Setting 9 − ols
0.88                                     0.60                                           0.36
                                              0.60                                                                        0.56
                                         0.55                                           0.34
0.86                                                                               0.55                                   0.54
                                         0.50                                           0.32                              0.52
                                              0.55
0.84
                                         0.45                                      0.50 0.30                              0.50
0.82                                     0.40                                           0.28                              0.48
                                              0.50
                                                                                   0.45                                   0.46
                                         0.35                                           0.26
0.80                               ARI                                       ARI                                     ARI
                                                                                                                          0.44
                                   NPA                                       NPA                                     NPA
                                         0.30                                           0.24
      10  50  90   150   210  270   330            10 50  90   150   210 270  330            10 50 90   150  210 270  330
                      k                                           k                                        k
            Setting 7 − lasso                           Setting 8 − lasso                         Setting 9 − lasso
0.88                                                                                    0.36                              0.56
                                         0.60 0.60                                 0.56
                                                                                        0.34                              0.54
                                                                                   0.54
0.86                                     0.55
                                                                                        0.32                              0.52
                                                                                   0.52
                                              0.55
                                         0.50                                                                             0.50
0.84                                                                               0.50 0.30
                                         0.45                                      0.48                                   0.48
                                                                                        0.28
0.82                                          0.50
                                                                                   0.46                                   0.46
                                         0.40                                           0.26
                                   ARI                                       ARI                                     ARI  0.44
                                                                                   0.44
                                   NPA                                       NPA                                     NPA
0.80                                     0.35                                           0.24
      10  50  90   150   210  270   330            10 50  90   150   210 270  330            10 50 90   150  210 270  330
                      k                                           k                                        k
                                Figure B.2 ARI and NPA vs k for settings 7, 8 and 9.
                                                                84


                                setting 10                                                           setting 11                                                          setting 12
                                                                                                                                                    1.00
                                                                               0.8
         0.20
                                                                                                                                                    0.75
         0.15                                                                  0.6
 ols     0.10
                                                                       ols     0.4                                                          ols     0.50
         0.05                                                                  0.2                                                                  0.25
         0.00                                                                  0.0                                                                  0.00
                 0.00    0.05       0.10        0.15      0.20                        0.0    0.2        0.4       0.6          0.8                         0.00   0.25       0.50     0.75   1.00
                                  mean                                                                 mean                                                                mean
                                setting 10                                                           setting 11                                                          setting 12
                                                                               1.00
                                                                                                                                                    1.00
        0.20
                                                                               0.75
                                                                                                                                                    0.75
        0.15
lasso                                                                  lasso                                                                lasso
                                                                               0.50
                                                                                                                                                    0.50
        0.10
                                                                               0.25                                                                 0.25
        0.05
        0.00                                                                   0.00                                                                 0.00
                0.00    0.05       0.10        0.15      0.20                         0.00    0.25         0.50         0.75         1.00                  0.00   0.25       0.50     0.75   1.00
                                  mean                                                                 mean                                                                mean
                                setting 10                                                           setting 11                                                          setting 12
                                                                               1.00
        0.20                                                                                                                                        1.00
                                                                               0.75
        0.15                                                                                                                                        0.75
lasso                                                                  lasso                                                                lasso
                                                                               0.50
        0.10                                                                                                                                        0.50
        0.05                                                                   0.25                                                                 0.25
        0.00                                                                   0.00                                                                 0.00
                0.00     0.05        0.10         0.15          0.20                  0.00    0.25         0.50         0.75         1.00                  0.00   0.25       0.50     0.75   1.00
                                    ols                                                                 ols                                                                 ols
                                            Figure B.3 ARI ratio comparisons for settings 10, 11 and 12.
                                                                                                      85


            Setting 10 − mean                         Setting 11 − mean                        Setting 12 − mean
                                             0.34
                                         0.7                                                                             0.60
 0.88
                                             0.32                                0.6  0.10
                                         0.6                                                                             0.55
 0.86
                                             0.30
                                         0.5                                          0.08
                                                                                 0.5                                     0.50
 0.84                                        0.28
                                         0.4
                                             0.26                                                                        0.45
 0.82                                    0.3                                     0.4  0.06
                                         0.2 0.24                                                                        0.40
 0.80
                                  ARI                                      ARI   0.3  0.04                          ARI
                                         0.1 0.22
                                  NPA                                      NPA                                      NPA  0.35
 0.78
       10  50  90  150   210  270   330           10 50  90   150  210 270   330           10 50  90  150  210  270  330
                      k                                          k                                       k
             Setting 10 − ols                           Setting 11 − ols                         Setting 12 − ols
                                             0.34
                                        0.60
0.88
                                             0.32                                     0.10
                                        0.55                                     0.55
0.86                                                                                                                     0.55
                                             0.30
                                        0.50                                          0.08
0.84                                         0.28
                                                                                 0.50
                                        0.45                                                                             0.50
0.82                                         0.26                                     0.06
                                        0.40 0.24                                0.45
0.80
                                  ARI                                      ARI        0.04                          ARI  0.45
                                        0.35 0.22
                                  NPA                                      NPA                                      NPA
0.78
      10  50  90  150   210  270   330            10 50  90  150   210 270  330            10 50  90  150  210  270  330
                     k                                           k                                       k
           Setting 10 − lasso                         Setting 11 − lasso                       Setting 12 − lasso
                                             0.34                                0.54
0.88
                                             0.32                                     0.10                               0.55
                                        0.6                                      0.52
0.86
                                             0.30
                                        0.5                                      0.50 0.08
0.84                                         0.28                                                                        0.50
                                                                                 0.48
0.82                                    0.4  0.26                                     0.06
                                                                                 0.46
                                             0.24
0.80                                                                                                                     0.45
                                        0.3
                                  ARI                                      ARI        0.04                          ARI
                                             0.22                                0.44
                                  NPA                                      NPA                                      NPA
0.78
      10  50  90  150   210  270   330            10 50  90  150   210 270  330            10 50  90  150  210  270  330
                     k                                           k                                       k
                             Figure B.4 ARI and NPA vs k for settings 10, 11 and 12.
                                                               86


                            APPENDIX C
PERFORMANCE ON GAUSSIAN DATA WITH NETWORK OF FEATURES
           prediction  setting Median    Mean     Sd.Error
                         13     0.015    0.026      0.003
             mean        14     0.058    0.078      0.006
                         15     0.088    0.121      0.009
                         13     0.012    0.020      0.002
              ols        14     0.048    0.071      0.005
                         15     0.077    0.105      0.008
                         13     0.009    0.017      0.001
             lasso       14     0.047    0.072      0.006
                         15     0.072    0.104      0.008
       Table C.1 ARI ratios summary for settings 13, 14 and 15.
           prediction  setting Median    Mean     Sd.Error
                         16     0.006    0.006      0.000
             mean        17     0.027    0.036      0.003
                         18     0.050    0.107      0.012
                         16     0.006    0.008      0.001
              ols        17     0.029    0.056      0.006
                         18     0.057    0.147      0.014
                         16     0.006    0.007      0.000
             lasso       17     0.027    0.053      0.006
                         18     0.061    0.164      0.015
       Table C.2 ARI ratios summary for settings 16, 17 and 18.
                                87


                             setting 13                                             setting 14                                                        setting 15
         0.6
                                                                                                                                    0.6
                                                                0.4
         0.4
                                                                0.3
                                                                                                                                    0.4
 ols                                                    ols                                                                 ols
                                                                0.2
         0.2
                                                                                                                                    0.2
                                                                0.1
         0.0                                                    0.0                                                                 0.0
                0.0    0.2               0.4     0.6                  0.0   0.1         0.2         0.3         0.4                       0.0   0.2            0.4   0.6
                               mean                                                      mean                                                           mean
                            setting 13                                             setting 14                                                         setting 15
                                                                0.6
        0.6
                                                                                                                                    0.6
                                                                0.4
        0.4
                                                                                                                                    0.4
lasso                                                   lasso                                                               lasso
                                                                0.2
        0.2
                                                                                                                                    0.2
        0.0                                                     0.0                                                                 0.0
               0.0    0.2             0.4       0.6                   0.0         0.2                     0.4         0.6                 0.0   0.2            0.4   0.6
                              mean                                                      mean                                                            mean
                            setting 13                                             setting 14                                                         setting 15
                                                                0.6
        0.15
                                                                                                                                    0.6
                                                                0.4
        0.10
                                                                                                                                    0.4
lasso                                                   lasso                                                               lasso
        0.05                                                    0.2
                                                                                                                                    0.2
        0.00                                                    0.0                                                                 0.0
               0.00   0.05               0.10    0.15                 0.0         0.2                     0.4         0.6                 0.0   0.2            0.4   0.6
                                ols                                                           ols                                                        ols
                                  Figure C.1 ARI ratio comparisons for settings 13, 14 and 15.
                                                                                        88


            Setting 13 − mean                         Setting 14 − mean                        Setting 15 − mean
 0.93                                    0.7
                                             0.56                                     0.36                               0.6
 0.92
                                         0.6                                     0.6
 0.91                                        0.54
                                                                                      0.34                               0.5
 0.90                                    0.5                                     0.5
                                             0.52
 0.89                                                                                 0.32
                                         0.4                                     0.4                                     0.4
 0.88                                        0.50
                                                                                      0.30
                                         0.3
 0.87                                                                            0.3                                     0.3
                                  ARI        0.48                          ARI                                      ARI
                                  NPA                                      NPA                                      NPA
 0.86                                    0.2                                          0.28
       10  50  90  150   210  270   330           10 50  90   150  210 270   330           10 50  90  150  210  270  330
                      k                                          k                                       k
             Setting 13 − ols                           Setting 14 − ols                         Setting 15 − ols
0.93                                                                                                                     0.54
                                        0.60
                                             0.56                                0.55 0.36
0.92
                                        0.55
                                                                                                                         0.52
0.91                                         0.54
                                                                                      0.34
                                        0.50
0.90                                                                             0.50
                                             0.52                                                                        0.50
                                        0.45                                          0.32
0.89
0.88                                    0.40 0.50
                                                                                 0.45 0.30                               0.48
0.87                                    0.35
                                  ARI        0.48                          ARI                                      ARI
                                  NPA                                      NPA                                      NPA
0.86                                    0.30                                          0.28                               0.46
      10  50  90  150   210  270   330            10 50  90  150   210 270  330            10 50  90  150  210  270  330
                     k                                           k                                       k
           Setting 13 − lasso                         Setting 14 − lasso                       Setting 15 − lasso
                                        0.60
0.93                                                                                                                     0.56
                                             0.56                                0.56 0.36
0.92                                    0.55                                                                             0.54
                                                                                 0.54
0.91                                         0.54                                                                        0.52
                                        0.50                                          0.34
0.90                                                                             0.52                                    0.50
                                             0.52
0.89                                    0.45                                     0.50 0.32                               0.48
0.88                                         0.50                                                                        0.46
                                        0.40                                     0.48 0.30
0.87                                                                                                                     0.44
                                  ARI        0.48                          ARI   0.46                               ARI
                                  NPA   0.35                               NPA                                      NPA
0.86                                                                                  0.28                               0.42
      10  50  90  150   210  270   330            10 50  90  150   210 270  330            10 50  90  150  210  270  330
                     k                                           k                                       k
                             Figure C.2 ARI and NPA vs k for settings 13, 14 and 15.
                                                               89


                                   setting 16                                                              setting 17                                                                  setting 18
                                                                                       0.6
                                                                                                                                                                  1.00
         0.100
                                                                                                                                                                  0.75
         0.075
                                                                                       0.4
 ols     0.050                                                                 ols                                                                        ols     0.50
                                                                                       0.2
         0.025                                                                                                                                                    0.25
         0.000                                                                         0.0                                                                        0.00
                  0.000    0.025        0.050          0.075           0.100                 0.0         0.2                      0.4               0.6                  0.00   0.25       0.50     0.75   1.00
                                      mean                                                                           mean                                                                mean
                                   setting 16                                                             setting 17                                                                   setting 18
                                                                                       0.5                                                                        1.00
        0.03
                                                                                       0.4
                                                                                                                                                                  0.75
                                                                                       0.3
lasso                                                                          lasso                                                                      lasso
        0.02
                                                                                                                                                                  0.50
                                                                                       0.2
        0.01
                                                                                                                                                                  0.25
                                                                                       0.1
        0.00                                                                           0.0                                                                        0.00
                 0.00       0.01               0.02            0.03                          0.0   0.1         0.2          0.3         0.4   0.5                        0.00   0.25       0.50     0.75   1.00
                                     mean                                                                            mean                                                                mean
                                   setting 16                                                             setting 17                                                                   setting 18
                                                                                       0.6
                                                                                                                                                                  1.00
        0.100
                                                                                                                                                                  0.75
        0.075
                                                                                       0.4
lasso                                                                          lasso                                                                      lasso
        0.050                                                                                                                                                     0.50
                                                                                       0.2
        0.025                                                                                                                                                     0.25
        0.000                                                                          0.0                                                                        0.00
                 0.000    0.025        0.050          0.075           0.100                  0.0         0.2                      0.4               0.6                  0.00   0.25       0.50     0.75   1.00
                                      ols                                                                            ols                                                                  ols
                                               Figure C.3 ARI ratio comparisons for settings 16, 17 and 18.
                                                                                                                90


            Setting 16 − mean                         Setting 17 − mean                        Setting 18 − mean
 0.98                                                                            0.7
                                         0.7 0.65
                                                                                      0.40                               0.6
 0.97                                                                            0.6
                                         0.6
 0.96                                        0.60                                0.5  0.35                               0.5
                                         0.5
 0.95
                                         0.4                                     0.4
                                             0.55                                     0.30
                                                                                                                         0.4
 0.94                                    0.3                                     0.3
                                             0.50                                     0.25
 0.93                                    0.2                                                                             0.3
                                  ARI                                      ARI   0.2                                ARI
                                  NPA                                      NPA                                      NPA
 0.92                                    0.1                                          0.20
       10  50  90  150   210  270   330           10 50  90   150  210 270   330           10 50  90  150  210  270  330
                      k                                          k                                       k
             Setting 16 − ols                           Setting 17 − ols                         Setting 18 − ols
0.98
                                                                                 0.60
                                             0.65                                                                        0.60
                                        0.6                                           0.40
0.97
                                                                                 0.55
0.96                                    0.5  0.60                                                                        0.55
                                                                                 0.50 0.35
0.95                                                                                                                     0.50
                                        0.4                                      0.45
                                             0.55                                     0.30
0.94
                                                                                 0.40
                                        0.3                                                                              0.45
                                             0.50                                     0.25
0.93
                                                                                 0.35
                                  ARI                                      ARI                                      ARI
                                  NPA   0.2                                NPA                                      NPA  0.40
0.92                                                                                  0.20
                                                                                 0.30
      10  50  90  150   210  270   330            10 50  90  150   210 270  330            10 50  90  150  210  270  330
                     k                                           k                                       k
           Setting 16 − lasso                         Setting 17 − lasso                       Setting 18 − lasso
                                                                                 0.60
0.98                                                                                                                     0.54
                                             0.65
                                        0.6                                      0.55 0.40
0.97
                                                                                                                         0.52
0.96                                         0.60                                0.50
                                        0.5                                           0.35
                                                                                                                         0.50
0.95                                                                             0.45
                                        0.4                                           0.30
                                             0.55                                                                        0.48
0.94                                                                             0.40
                                        0.3
                                             0.50                                     0.25                               0.46
0.93                                                                             0.35
                                  ARI                                      ARI                                      ARI
                                        0.2
                                  NPA                                      NPA                                      NPA
0.92                                                                             0.30 0.20                               0.44
      10  50  90  150   210  270   330            10 50  90  150   210 270  330            10 50  90  150  210  270  330
                     k                                           k                                       k
                             Figure C.4 ARI and NPA vs k for settings 16, 17 and 18.
                                                               91