GRAPH-BASED CLUSTERING ALGORITHMS FOR SINGLE-CELL RNA SEQUENCING DATA: METHODS AND THEORY By Andriana Manousidaki A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics—Doctor of Philosophy 2023 ABSTRACT The innovative technology of single-cell RNA sequencing (scRNAseq) allows us to extract gene expression information from each cell of a tissue, resulting in data sets of tens of thousands to millions of points (cells). Clustering of cells based on the similarity of their gene expression enables the understanding of their functions and hence the characterization of cell types in a tissue. This dissertation focuses on the most widely used clustering methodology for scRNAseq data – clustering based on the graph representation of data points (cells as vertices on a graph). Firstly, we showcase how existing methods can effectively identify an important group of tumor growth related cells in the analysis of head and neck cancer scRNAseq data. The newly discovered marker genes can potentially facilitate new therapy approaches. Secondly, we introduce a novel clustering method that preserves both the global data geometry and cluster structure, via multidimensional scaling based on power-weighted path metrics. The new method outperforms prevailing scRNAseq clustering algorithms on a wide range of benchmarking data sets. Thirdly, we study spectral clustering on shared nearest neighbors (SNN) graphs. In contrast to current ad-hoc methods for number of neighbors selection, we develop a general cross-validation tuning algorithm to achieve effective clustering. Moreover, we provide a comprehensive theoretical analysis of SNN based spectral clustering in the nonparametric setting. Our theoretical results reveal an optimal range of the number of neighbors for cluster identification and characterize the impact of data density on spectral clustering. This dissertation is dedicated to my family. iii ACKNOWLEDGEMENTS First and foremost, I would like to extend my gratitude to my advisors Dr. Haolei Weng and Dr. Yuying Xie, for their valuable mentoring and the precious research opportunities that offered me during my Ph.D. journey. Additionally, I would like to thank Dr. Anna Little and Dr. Yuehua Cui for their advice as members of my guidance committee. Furthermore, I would like to highlight my gratitude to Professor Camille Fairbourn for her continuous support and mentoring throughout my professional endeavours. Moreover, I feel deeply thankful to Dr. Dimitra Papadovasilaki for advising me during my Ph.D life. My life in East Lansing and my Ph.D studies wouldn’t be so pleasant if I had not met my roommate and best friend Dr. Ana-Maria Raicu and the rest of my "Spartan" friends Dr. Ilias Magoulas, Eleni Lygda, Dr. Christos Gregoriadis, Dr. Ioannis Zaxos, Dr. George Psaromiligos, Dr. Michalis Paparizos and Dr. Dimitris Vardakis who have helped since before I arrived in United States and we are now family. Also I would like to thank my most recent friends Estefania Blancas Garcias, Manos Kokarakis and Mylena Ortiz. I would like to highlight my gratitude to my husband, Dr. Marios Velivasakis, the most enthusiastic Mathematician I know for his constant support and mathematical discussion which played an important role for the completion of this thesis. I also want to thank my mother-in-law and father-in-law, Sofia Zervou and George Velivasakis for their love. I am proud to call them my parents. I also want to thank my sisters Catherine Manousidaki and Dr. Maria Manousidaki, because they have shown to me what means to be a strong independent modern woman. With their studies, career and life experiences paved an easier way for me toward my Ph.D. Finally, I would like to thank my father Ioannis Manousidakis and my mother Evaggelia Maslimopoulou for believing in me, encouraging me to pursue my dreams and teaching me how to continue fighting when things are hard. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 CHAPTER 2 SINGLE-CELL ANALYSIS OF CANCER STEM CELLS IN HEAD AND NECK CANCER . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 CHAPTER 3 CLUSTERING AND VISUALIZATION OF SINGLE-CELL RNA-SEQ DATA USING PATH METRICS . . . . . . . . . . . . . . . . . . . . . . 18 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 APPENDIX A DATA PREPROCESSING . . . . . . . . . . . . . . . . . . . . 41 APPENDIX B ADDITIONAL CLUSTERING RESULTS . . . . . . . . . . . . 44 CHAPTER 4 SHARED NEAREST NEIGHBORS GRAPH BASED SPECTRAL CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 APPENDIX A PERFORMANCE ON GAUSSIAN DATA WITH DIAGONAL COVARIANCE MATRIX . . . . . . . . . . . . . . . . . . . . . 77 APPENDIX B PERFORMANCE ON GAUSSIAN DATA WITH TRIDIAGONAL PRECISION MATRIX . . . . . . . . . . . . . . . . . . . . . . 82 APPENDIX C PERFORMANCE ON GAUSSIAN DATA WITH NETWORK OF FEATURES . . . . . . . . . . . . . . . . . . . . . . . . . . 87 v CHAPTER 1 INTRODUCTION The pioneering technology of single-cell RNA sequencing (scRNAseq) enables the extraction of gene expression information from individual cells within a tissue, yielding datasets comprising tens of thousands to millions of cellular data points. Clustering cells based on the congruity of their gene expression profiles facilitates the comprehension of their functional attributes, thereby enabling the characterization of distinct cell types within a given tissue. Prevalent clustering methodologies developed for scRNAseq data rely on the representation of data points (cells) as vertices in a graph (Stuart et al., 2019; Wolf et al., 2018). The present dissertation primarily focuses on graph-based clustering methods tailored for scRNAseq data analysis. Firstly, we present the contribution of established approaches to the identification of Cancer Stem Cells (CSCs), a cellular cohort characterized by their resistance to therapeutic interventions and their pivotal role in tumor initiation and progression (Chen et al., 2021; Mroz et al., 2015). Secondly, we introduce a novel clustering methodology denoted as Single-Cell Path Metrics Profiling (scPMP), which concurrently upholds both local cluster structure and global data geometry. Thirdly, we undertake an exploration of the performance of Spectral Clustering on Shared Nearest Neighbors (SNN) graphs in relationship with the parameter of nearest neighbors used in the construction of the SNN grpah. We finally suggest a general cross-validation method for the tuning of this parameter. In Chapter 2, an in-depth analysis of scRNAseq data originating from cell cultures of head and neck cancer lines, as well as 10 primary tumors, is conducted. The primary objective of this analysis revolves around the identification of the most homogeneous cluster of CSCs within each dataset, while simultaneously elucidating their dynamic states and plasticity via an extension of the repertoire of CSC marker genes. Chapter 3 presents the introduction of the scPMP algorithm, a novel clustering methodology predicated upon path-metric distances among cells. Unlike conventional distance metrics, such as the Euclidean distance, path metrics possess the capacity to discern density variations and faithfully uphold the underlying data geometry. By integrating path metrics with multidimensional scaling 1 techniques, we obtain a low-dimensional representation of the data that faithfully encapsulates both the global data geometry and cluster structure. The efficacy of the scPMP algorithm is evaluated comprehensively in terms of clustering quality and geometric fidelity, ultimately establishing its superiority over current scRNAseq clustering algorithms across a diverse spectrum of benchmark datasets. Chapter 4 delves into Spectral Clustering on SNN graphs. SNN graphs are constructed based on a 𝑘 Nearest Neighbors (𝑘NN) graph, thus rendering their properties contingent upon the choice of the parameter 𝑘. Our findings indicate that, in both the absence of noise and the presence of noise, it is imperative to select 𝑘 of the magnitude 𝑐𝑛 in order to maximize the likelihood of cluster identification. This contrasts with the literature on random geometric graphs, which suggests an order of log 𝑛 for 𝑘 (Brito et al., 1997). Additionally, we propose a comprehensive cross-validation tuning approach for fine-tuning the parameters of clustering algorithms. We employ this approach to determine the optimal number of nearest neighbors, denoted as 𝑘, for the SNN spectral clustering algorithm using various types of simulated data. 2 BIBLIOGRAPHY Brito, M. R., Chávez, E. L., Quiroz, A. J., and Yukich, J. E. (1997). Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection. Statistics & Probability Letters, 35(1):33–42. Chen, C.-Y., Ueha, S., Ishiwata, Y., Shichino, S., Yokochi, S., Yang, D., Oppenheim, J. J., Ogiwara, H., Deshimaru, S., Kanno, Y., et al. (2021). Combining an alarmin hmgn1 peptide with pd-l1 blockade results in robust antitumor effects with a concomitant increase of stem-like/progenitor exhausted cd8+ t cells. Cancer immunology research, 9(10):1214–1228. Mroz, E. A., Tward, A. M., Hammon, R. J., Ren, Y., and Rocco, J. W. (2015). Intra-tumor genetic heterogeneity and mortality in head and neck cancer: analysis of data from the cancer genome atlas. PLoS medicine, 12(2):e1001786. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius, M., Smibert, P., and Satija, R. (2019). Comprehensive integration of single-cell data. Cell, 177:1888–1902. Wolf, F. A., Angerer, P., and Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology, 19. 3 CHAPTER 2 SINGLE-CELL ANALYSIS OF CANCER STEM CELLS IN HEAD AND NECK CANCER 2.1 Introduction Head and neck cancer (HNC) is a major global health problem, with an estimated 880,000 new cases and 445,000 deaths annually worldwide (Sung et al., 2021). While human papilloma (HPV)-associated HNC have improved outcomes, despite advances in comprehensive cancer care, HPV-negative HNC remains a highly morbid disease with stagnant survival rates hovering at 50%. This poor prognosis is due in part to the complex heterogeneity of HNC, which involves multiple cell types, genetic alterations and transitional states that lead to treatment resistance and poor outcomes (Chen et al., 2021; Puram et al., 2018; Mroz et al., 2015). Tumoral heterogeneity is a well-established biomarker of poor prognosis, being associated with aggressive cancer behavior and treatment resistance in various cancer types, including HNC. Tumoral heterogeneity has been associated with worse outcomes, mediated by intrinsic and extrinsic factors related to subpopulations with distinct molecular profiles. Tumoral plasticity has been identified as a critical driver of tumoral heterogeneity, where clonal expansion and subclonal selection are based on evolutionary progression with each clone arising from cells with high propagation potential, plasticity, and self-renewal. Within the tumor microenvironment (TME), there is a subpopulation of tumor initiating cells, or cancer stem cells (CSC), that have the capacity to drive clonal and subclonal selection (O’Brien et al., 2007). Traditionally CSC were deemed fixed cells with limited to no plasticity based on their original definitions. However, as the field has advanced, the role of plasticity in CSC has expanded and the traditional view of CSCs has evolved. While the term CSC has persisted, despite much controversy on their existence, a more nuanced understanding of CSC is that their stem-like activity (CSC-state: self-renewal, tumorigenicity and asymmetric division) is not fixed but a transient state dictated by tumoral and environmental cues (Chaffer and Weinberg, 2011). When cells are in this CSC-state, they are associated with treatment resistance, metastasis, and tumor recurrence. However, CSCs in HNC remains controversial and the CSC-like state has been difficult to study as the mechanisms of CSC plasticity are poorly 4 understood. Plasticity and heterogeneity are also critical components of epithelial to mesenchymal transition (EMT) programs. Weinberg and others have shown EMT represent transient cancer cell states with varying degrees of activities, strongly suggesting the EMT process enables cancer cells to acquire CSC-like properties and enhance their ability to initiate and sustain tumors (Mani et al., 2008; Tam and Weinberg, 2013). However, understanding the link between EMT and CSC remains elusive due to their rarity and potentially transiet states. Analyzing this interaction is critical to understanding the CSC-like state and defining potential mechanisms for plasticity and identify novel targets for therapy. Recent advances in single-cell RNA sequencing (scRNAseq) technology have enabled the identification of distinct subpopulations of cells within tumors based on their gene expression profiles, providing a powerful tool to study the heterogeneity and plasticity of CSCs in HNC (Wang et al., 2019). Moreover, in vitro lineage tracing can be used to assess CSC’s capacity for plasticity and evaluate their various states. In this study, we integrated scRNAseq and in vitro and in silico lineage tracing to analyze these rare CSC subpopulations in cell culture and primary HNC tumors to characterize their dynamic states and plasticity. Our study sheds new light on the dynamic nature and plasticity of CSCs in HNC, and their potential involvement in EMT programs. Our findings have important implications for the development of novel therapeutic strategies for HNC, as well as other cancers, and for the broader understanding of CSC-states and plasticity. 2.2 Methods 2.2.1 Cell lines analysis To better evaluate transcriptional differences and controlling for tumoral heterogeneity, we subsequently performed single cell sequencing of two patient derived HNC cell lines (UMSCC- 122 and UMSCC-103).Both cell lines were sorted to select for CSC (CD44high ALDHhigh) and non-CSC (CD44low/ ALDHlow) cells. After standard quality control filtering and integration of the two cell line expression data sets, we found 26 clusters using Seurat (Stuart et al., 2019). We observed that the clusters were not separated on the UMAP plot and suggesting that cells lie on a 5 A B C D Figure 2.1 Cell line data results. continuum but the distinction between non-CSC to CSC was weak since cell cycle phase affected the clustering (Figure 2.1A). As a next step, we eliminated the cell-cycle effect and we performed a trajectory analysis to capture both local and global nonlinear structure using an information- geometric distance between cells (Moon et al., 2019). Given recent evidence suggesting an inherent plasticity in cancer stem cells, we were interested in evaluating if there is a continuity between 6 Figure 2.2 CSC signature genes matrix - Cell line data. non-stem and stem cells in the tumor. Figure 2.1B demonstrates a spectrum of non-CSC to CSC, with an overlapping region in the middle, suggesting cells can progress from a non-stem-like to state to a stem-like state, supporting the hypothesis that CSC possesses plasticity-capacity and exist in a transitional CSC-state (Intermediate cells). Finally, we used cell line data to generate a model to predict the probability that a cancer cell is a stem cell (or in a stem-like state) and develop a CSC-state gene expression signature. Using the trajectory coordinates of the cell line cells, we constructed a logistic regression model that provides the probability a cell is a cancer stem cell (Figure 2.1 right). We calculated the correlation of each gene’s expression to the cells’ predictive probabilities and used these values to rank each gene and to perform gene set enrichment analysis. 7 We found 29 enriched pathways in the C2 and Hallmark databases. The positively enriched pathways and contributing genes are demonstrated in the leading-edge analysis (Figure 2.1D). We tested which of the contributing genes are significantly expressed more in each predicted group of cells (CSC, Intermediate, Non-CSC) to construct a signature genes matrix (Figure 2.2). Together these data demonstrate a conserved cancer stem cell signature identified with single cell sequencing and novel bioinformatic techniques. These data nominate a subset of genes (ACTB, ANXA2, TPM1, MYL12A, CD63, CCDN1, CD59, ATOX1, LAMA3, LAMC2, L1CAM, KRT8, ANXA3,CLTB and IL32) as drivers of the cancer stem cell phenotype. Despite these cells being exclusively derived from epithelial cells, several of the CSC differentially expressed genes are associated predominantly with CAFs (TPM1, MYL12A, KRT8, CD63 and IL32), suggesting a mesenchymal state of CSC. These clusters were selected to further define a pure epithelial CSC signature in the primary tumor data of 10 patients. 2.2.2 Primary tumor data analysis While patient-derived cell line data provides critical informatics and biologic data, it fails to capture the complexity and heterogeneity of HNC. We leveraged our access to fresh tumor specimens to perform scRNASeq techniques. Given the evidence of the tumor microenvironment playing a large role in maintenance of the cancer stem cell niche, we hypothesized that the cancer stem cell signature may differ between cell line and primary tumors, however cells in the CSC-state will have conserved signatures. To evaluate CSC signatures in primary tumors we analyzed 10 HNC harvested directly from the operative theatre. Tumors were then digested into single cell suspension and sorted by FACS for standard CSC markers (ALDH and CD44). scRNASeq was then performed on the enriched groups. Seurat clusters are shown in Figure 2.3A. Of the CD44/ALDH enriched cells, the deconvoluted epithelial tumor population was found to make up only a small proportion of the tumor bulk (5%) with the remaining cells representing the immune and stromal elements of the TME. To confirm the identity of the epithelial cell cluster, the cell line CSC expression data was normalized and mapped onto the primary tumor expression data. As seen in Figure 2.3A, the cell line data, in black, overlap with the epithelial cluster (cluster 9) confirming an epithelial expression 8 pattern. We then used RNAscope to show co-localization of top expressed epithelial genes within the tumor cell population to further confirm expression of the DEG genes in the primary tumor (Figure 2.3C). We considered isolating not only the epithelial but also the fibroblast cluster since CAF genes were found in the signature genes of CSC suggested by the cell line analysis. We also observe that PTPRC is low in fibroblasts of the sample and that the epithelial annotated cells are mapped on each fibroblast cluster (figure 2.3B). Hence, we proceed to investigate the expression profile of CSCs in both epithelial and fibroblast cells. Following the steps suggested by the analysis of the cell line data, we explore the trajectory of the epithelial and fibroblast cells. A B C Figure 2.3 Primary tumor data analysis. 9 A Epithelial and fibroblasts by state Epithelial and fibroblasts by pseudotime B Epithelial and fibroblasts by probability Epithelial and fibroblasts by predicted of CSC status status PTGS2 200 GSEA ranked genes heatmap in epithelial and fibroblast cells AQP1 C SLPI CCL19 CCL11 TIMP1 Figure 2.4 Epithelial and Fibroblast cells of primary tumor data. Figure 2.4A demonstrates that there is a spectrum of CSCs, Intermediate and Non-CSCs on with two branches. To understand the ordering of cells on the trajectory we used the trajectory coordinates of each cell and their grouping based on ALDH and CD44 levels to extract their pseudotime (Street et al., 2018). We observe that cells with the highest pseudotime are located 10 A B Figure 2.5 GSEA of primary tumor data. at the leftmost part of the trajectory and have high ALDH and CD44 (sorted as CSC). For this reason, we assume that the purest of CSCs are located in the leftmost corner. To validate this assumption a logistic regression model was generated to predict the probability that a cell in the epithelial and fibroblast cluster is a CSC using the pseudotime information of the cell. Cells with the highest probability are indeed located in the leftmost corner (figure 2.4B). Additionally, genes of the fibroblast and epithelial cluster were ranked based on the correlation of their expression 11 Core enrichment genes in epithelial and fibroblast cells TIMP1 SFRP4 IGFBP3 ADH1B TNC Figure 2.6 Heatmap of core enrichment genes of 10 patient data. level with the pseudotime assigned to each cell. This ranking was used to perform gene set en- richment analysis. Positively enriched pathways are associated with the genes at the top of the list namely those that are positively correlated to pseudotime and hence the CSC cells. A heatmap of the expression of genes at the top and bottom of the ranked list shows a clustering of the tumor and epithelia cells based on pseudotime and the gradual transition from Non-CSC to CSC cells (figure 2.4C). The biologic behavior of the CSC subpopulation can be further described through the results of the GSEA of cell line and primary data. Permutation-based analysis was performed to calculate p-values and false discovery rates (FDR). Here we find that CSCs were enriched for the HALLMARK EPITHELIAL MESENCHYMAL TRANSITION, HALLMARK ANGIOGEN- ESIS, C2 ANASTASSIO MULTICANCER INVASIVENESS SIGNATURE, and C2 BOQUEST STEM CELL UP compared to non-CSC. Interestingly, non-CSC were enriched for radiation re- sponse pathways (SMIRNOV RESPONSE TO IR 2HR UP) as well as the well-established HNC pathways, HALLMARK P53 PATHWAY, HALLMARK MYC TARGETS V1 and HALLMARK 12 Figure 2.7 Survival rate of TCGA patients based on estimated CSC proportion. PI3K AKT MTOR SIGNALING pathways (figure 2.5). Taken together, these findings further support that CSCs are critical components of EMT and play a crucial role in promoting tumor invasion, metastasis, and treatment resistance. To further define the genes of interest, common genes across the significantly enriched gene sets were explored. Genes common to at least 2 significantly enriched pathways are SFRP4, ALDH1B, WNT5A, TIMP1, COL1A1, COL3A1, MFAP5, COL1A2, LUM, COL5A1, THBS2, COL5A2, COL6A3, VCAN, LOX, MXRA5, COL6A2, FAP, CDH11, DCN, SPOCK1. Given many of these genes are established mesenchymal genes (COL6A3, FAP, VCAN), this suggests that CSC may represent a critical subset of cells in a mesenchymal-state as part of the EMT. 2.2.3 Survival analysis of TCGA data To test the significance of the CSC signature genes (figure 2.2) we utilized expression informa- tion of those genes in HPV patients of the TCGA database. Specifically, the CSC signature genes matrix was used to estimate the pure CSC, intermediate and pure Non-CSC proportion of cells in each patient via least trimmed square gene-expression deconvolution technique (Hao et al., 2018). Next, patients were grouped into two clusters via kmeans based on their estimated proportions. The two clusters separated patients with low pure CSC proportion and high pure CSC proportion. The 13 survival rate of patients in the cluster with low pure CSC proportion have a significanlty higher survival rate (p-value = 0.0028, figure 2.7) 2.3 Discussion Our study provides new insights into the dynamic nature and plasticity of CSCs in head and neck cancer, and their potential involvement in the epithelial-to-mesenchymal transition (EMT) process. We identified multiple dynamic states of CSCs within our primary cell cultures of UMSCC HNC cell lines and 10 primary tumors, suggesting that CSCs exist in a state of dynamic equilibrium with their non-CSC counterparts. Our in vitro lineage tracing experiments further confirmed the plasticity of CSCs, and their ability to differentiate into non-CSCs and vice versa. Recent insight into CSC biology has moved away from them representing a distinct entity and more of a dynamic state. Our findings are consistent with previous studies that have shown the plasticity of CSCs in various cancer types, including breast cancer, colorectal cancer, and glioblastoma (Chaffer and Weinberg, 2011; Vermeulen et al., 2010; Wang et al., 2018). In addition, our study supports the hypothesis that the EMT process may be involved in the acquisition of stem cell-like properties by cancer cells and may enhance their ability to initiate and sustain tumors (Mani et al., 2008; Tam and Weinberg, 2013). This hypothesis is supported by our gene set enrichment analysis (GSEA) results, which identified enrichment of EMT-related gene sets in our CSC populations. Interestingly, as part of this mesenchymal transition, we found cells in the CSC-state had similar canonical expression patterns as CAFs. This was seen both in pure epithelial cell line data as well as in the primary data. Furthermore, we found that the TIMP1/CD63 pathway was differentially expressed in our CSC populations. TIMP1 and CD63 have both been characterized in CAFs, more so than epithelial cells. TIMP1 is a member of the tissue inhibitor of metalloproteinase (TIMP) family, regulating matrix metalloproteinases (MMPs), and have been critical mediators in cancer invasion and metastasis. TIMP1 has been shown to be overexpressed in several solid organ cancers, including breast, lung, prostate and ovarian. In addition to invasive characteristics, TIMP1 has been shown promote cancer cell survival, thus critical for cancer maintenance. CD63 is a member of the tetraspanin family of membrane proteins, thus associated with cell adhesion, migration, and signaling through the 14 regulation of cell surface receptor trafficking and the regulation of extracellular vesicles, which are important mediators of intercellular communication within the TME. Like TIMP1, CD63 has been found to be overexpressed in various types of cancer, including breast, lung, and melanoma. Together, both TIMP1 and CD63 are involved in several overlapping pathways that regulate EMT, specifically AKT/mTOR, WNT/b-catenin, integrins and CD44. TIMP1/CD63 has been studied in other cancers as part of the EMT as well as CSCs, suggesting that this pathway may play a role in the maintenance and plasticity of CSCs in HNC. Our study provides new evidence for the potential involvement of this pathway in HN CSC biology, and may open up new avenues for the development of targeted therapies for HNC and other cancers. Taken together, our findings highlight the dynamic and plastic nature of CSCs in HNC, and their potential involvement in the EMT process and the TIMP1/CD63 pathway. Our study may have implications for the development of personalized therapeutic strategies for this deadly disease. Further studies are needed to validate our findings in larger patient cohorts and to explore the clinical relevance of our results. 15 BIBLIOGRAPHY Chaffer, C. L. and Weinberg, R. A. (2011). A perspective on cancer cell metastasis. science, 331(6024):1559–1564. Chen, C.-Y., Ueha, S., Ishiwata, Y., Shichino, S., Yokochi, S., Yang, D., Oppenheim, J. J., Ogiwara, H., Deshimaru, S., Kanno, Y., et al. (2021). Combining an alarmin hmgn1 peptide with pd-l1 blockade results in robust antitumor effects with a concomitant increase of stem-like/progenitor exhausted cd8+ t cells. Cancer immunology research, 9(10):1214–1228. Hao, Y., Yan, M., Lei, Y. L., and Xie, Y. (2018). Fast and robust deconvolution of tumor infiltrating lymphocyte from expression profiles using least trimmed squares. bioRxiv. Mani, S. A., Guo, W., Liao, M.-J., Eaton, E. N., Ayyanan, A., Zhou, A. Y., Brooks, M., Reinhard, F., Zhang, C. C., Shipitsin, M., et al. (2008). The epithelial-mesenchymal transition generates cells with properties of stem cells. Cell, 133(4):704–715. Moon, K. R., van Dijk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., van den Elzen, A., Hirn, M. J., Coifman, R. R., et al. (2019). Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology, 37(12):1482–1492. Mroz, E. A., Tward, A. M., Hammon, R. J., Ren, Y., and Rocco, J. W. (2015). Intra-tumor genetic heterogeneity and mortality in head and neck cancer: analysis of data from the cancer genome atlas. PLoS medicine, 12(2):e1001786. O’Brien, C. A., Pollett, A., Gallinger, S., and Dick, J. E. (2007). A human colon cancer cell capable of initiating tumour growth in immunodeficient mice. Nature, 445(7123):106–110. Puram, S. V., Parikh, A. S., and Tirosh, I. (2018). Single cell rna-seq highlights a role for a partial emt in head and neck cancer. Molecular & cellular oncology, 5(3):e1448244. Street, K., Risso, D., Fletcher, R. B., Das, D., Ngai, J., Yosef, N., Purdom, E., and Dudoit, S. (2018). Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC genomics, 19:1–16. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius, M., Smibert, P., and Satija, R. (2019). Comprehensive integration of single-cell data. Cell, 177:1888–1902. Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., and Bray, F. (2021). Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians, 71(3):209–249. Tam, W. L. and Weinberg, R. A. (2013). The epigenetics of epithelial-mesenchymal plasticity in cancer. Nature medicine, 19(11):1438–1449. 16 Vermeulen, L., De Sousa E Melo, F., Van Der Heijden, M., Cameron, K., De Jong, J. H., Borovski, T., Tuynman, J. B., Todaro, M., Merz, C., Rodermond, H., et al. (2010). Wnt activity defines colon cancer stem cells and is regulated by the microenvironment. Nature cell biology, 12(5):468–476. Wang, Q., He, Z., Huang, M., Liu, T., Wang, Y., Xu, H., Duan, H., Ma, P., Zhang, L., Zamvil, S. S., et al. (2018). Vascular niche il-6 induces alternative macrophage activation in glioblastoma through hif-2𝛼. Nature communications, 9(1):559. 17 CHAPTER 3 CLUSTERING AND VISUALIZATION OF SINGLE-CELL RNA-SEQ DATA USING PATH METRICS 3.1 Introduction The advance in single-cell RNA-seq (scRNA-seq) technologies in recent years has enabled the simultaneous measurement of gene expression at the single-cell level Saliba et al. (2014); Eberwine et al. (1992); Tang et al. (2009). This opens up new possibilities to detect previously unknown cell populations, study cellular development and dynamics, and characterize cell composition within bulk tissues. Despite its similarity with bulk RNAseq data, scRNAseq data tends to have larger variation and larger amounts of missing values due to the low abundance of initial mRNA per cell. To address these challenges, numerous computational algorithms have been proposed focusing on different aspects. Given a collection of single cell transcriptomes from scRNAseq, one of the most common applications is to identify and characterize subpopulations, e.g., cell types or cell states. Numerous clustering approaches have been developed such as 𝑘-means based methods SC3 Kiselev et al. (2017), SIMLR Wang et al. (2017), and RaceID Herman et al. (2018); hierarchical clustering based methods CIDR Lin et al. (2017), BackSPIN A et al. (2015), and pcaReduce žurauskienė and Yau (2016); graph based methods Rphenograph CLevine et al. (2015), SNN-Cliq Xu and Su (2015), SSNN-Louvain Zhu et al. (2020), Seurat Stuart et al. (2019), and scanpy Wolf et al. (2018); and deep-learning based methods scGNN Wang et al. (2021), scVI Lopez et al. (2018), ScDeepCluster Tian et al. (2019b), DANCE Ding et al. (2022). To visualize and characterize relationships between cell types, it is important to represent them in a low-dimensional space. Many low-dimensional embedding methods have been proposed including UMAP McInnes et al. (2018), 𝑡-SNE Van der Maaten and Hinton (2008), PHATE Moon et al. (2019), and LargeVis Tang et al. (2016). However, a key challenge for embedding methods is to simultaneously reduce cluster variance and preserve the global geometry, including the distances between clusters and cluster shapes. For example, Figure 3.4 illustrates the typical situation on a cell mixture dataset Tian et al. (2019a): the PCA embedding preserves the global geometry but clusters have high variance; clusters are better separated in the 18 UMAP and 𝑡-SNE embeddings, but the global geometric structure of the clusters is lost. When choosing a clustering algorithm, there is always an underlying tension between respecting data density and data geometry. Density-based methods such as DBSCAN cluster data by connect- ing together high-density regions, regardless of cluster shape. More traditional approaches such as 𝑘-means require that clusters are convex and geometrically well separated. However, in many real data, clusters tend to have both nonconvex/elongated geometry and a lack of robust density separation as shown in Figure 3.2b which consists of three elongated Gaussian distributions and a bridge connecting two of the distributions. The data set is challenging because it exhibits elongated geometry, but methods relying only on density will fail due to the bridge. Such characteristics are commonly observed in scRNA-seq data, especially for cells sampled from a developmental process, as cell types often trace out elongated structures and frequently lack robust density separation. This elongated geometry phenomena is due to the fact that all the cell types originate from stem cells through a trajectory-like differentiation process, and the bridge structures are created by the cells in the transition states. For example, circulating monocytes in the Tabula Muris (TM) lung data set Tabula Muris Consortium (2018) have an elongated cluster structure as illustrated by the PCA plot in Figure 3.1a, as do the ductal cells in the TM pancreatic data set (see Figure 3.1c). The UMAP plots of these same data sets illustrate the lack of robust density separation: for TM lung, there is a bridge connecting the alveolar and lung cell types, and also an overlap/bridge between the circulating and invading monocytes (see Figure 3.1b); for TM pancreatic, the pancreatic A and pancreatic PP cells are not well separated. The combination of elongation and poor density separation make clustering scRNA-seq data sets a challenging task. We propose an embedding method based on the power weighted path metric which is well suited to this difficult regime. These metrics balance density and geometry considerations in the data learning tasks such as clustering and semi-supervised learning Vincent and Bengio (2003); Bousquet et al. (2004); Sajama and Orlitsky (2005); Chang and Yeung (2008); Bijral et al. (2011); Moscovich et al. (2017); Mckenzie and Damelin (2019); Little et al. (2020a); Borghini et al. (2020). They have performed well in applications such as imaging Fischer et al. (2001); Zhang and Murphy 19 (a) PCA TM Lung (b) UMAP TM Lung (c) PCA TM Pancreatic (d) UMAP TM Pancreatic Figure 3.1 Tabula Muris data sets have elongated clusters in the PCA embedding and clusters connected with a bridge of points in the UMAP embedding. 0.4 1.2 10 1.0 5 0.2 0.8 0 0.5 0.0 0.4 −5 0.0 −0.2 0.0 −10 −0.5 −0.5 0.0 0.5 1.0 1.5 −2 −1 0 1 2 −10 −5 0 5 10 15 −0.4 −0.2 0.0 0.2 (a) Balls (b) Elongated with (c) Swiss Roll (d) GL Manifold Bridge Figure 3.2 Toy Data Sets. 3.2a and 3.2b show the 2-dimensional data sets. 3.2c plots the first two coordinates of the Swiss roll. 3.2d shows the 2-dimensional PCA plot of the SO(3) manifolds. (2021); Little et al. (2020a); Mckenzie and Damelin (2019), but their usefulness for the analysis of scRNAseq data remains unexplored. Because these metrics are density-sensitive, they reduce cluster variance; in addition, these metrics also capture global distance information, and thus preserve global geometry; see Figure 3.4b. Using the path metric embedding to cluster the data thus yields a clustering method which balances density-based and geometric information. 20 3.2 Materials and Methods We first introduce our theoretical framework in Section 3.2.1; Section 3.2.2 then describes the details of the proposed scPMP algorithm, and Section 3.2.3 describes metrics for assessment. 3.2.1 Path Metrics We first define a family of power-weighted path metrics parametrized by 1 ≤ 𝑝 < ∞. Definition 1 Given a discrete data set 𝑋, the discrete 𝑝-power weighted path metric between 𝑎, 𝑏 ∈ 𝑋 is defined Í 1 𝑠−1 𝑝 𝑝 as ℓ 𝑝 (𝑎, 𝑏) := inf (𝑥0 ,...,𝑥 𝑠 ) 𝑖=0 𝑥𝑖+1 − 𝑥𝑖 2 , where the infimum is taken over all sequences of points 𝑥 0 , . . . , 𝑥 𝑠 in 𝑋 with 𝑥 0 = 𝑎 and 𝑥 𝑠 = 𝑏. Note as 𝑝 → ∞, ℓ 𝑝 converges to the “bottleneck edge" distance ℓ∞ (𝑎, 𝑏) := inf max ∥𝑥𝑖+1 − 𝑥𝑖 ∥ 2 , (𝑥 0 ,...,𝑥 𝑠 ) 𝑖 which is well studied in the computer science literature Pollack (1960); Hu (1961); Camerini (1978); Gabow and Tarjan (1988). Two points are close in ℓ∞ if they are connected by a high-density path through the data, regardless of how far apart the points are. On the other hand, when 𝑝 = 1, ℓ1 reduces to Euclidean distance. If path edges are furthermore restricted to lie in a nearest neighbor graph, ℓ1 approximates the geodesic distance between the points, i.e. the length of the shortest path lying on the underlying data structure, which is a highly useful metric for manifold learning Tenenbaum et al. (2000). The parameter 𝑝 governs a trade-off between these two extremes, i.e. it determines how to balance density and geometry considerations when determining which data points should be considered close. The relationship between ℓ 𝑝 and density can be made precise. Assume 𝑛 independent samples from a continuous, nonzero density function 𝑓 supported on a 𝑑-dimensional, compact Riemannian manifold M (a manifold is a smooth, locally linear surface; see Lee (2018)). Then for 𝑝 > 1, ℓ 𝑝 (𝑎, 𝑏) converges (after appropriate normalization) to ∫  𝑝1 − ( 𝑝−1) ′ L 𝑝 (𝑎, 𝑏) := inf 𝑓 (𝛾(𝑡)) 𝑑 |𝛾 (𝑡)| 𝑑𝑡 , (3.1) 𝛾 21 as 𝑛 → ∞, where the infimum is taken over all smooth curves 𝛾 : [0, 1] → M connecting 𝑎, 𝑏 Hwang et al. (2016); Groisman et al. (2022); Fernández et al. (2023). Note |𝛾 ′ (𝑡)| is simply the arclength element on M, so L1 reduces to the standard geodesic distance. When 𝑝 ≠ 1, one obtains a density-weighted geodesic distance. The optimal L 𝑝 path is not necessarily the most direct: a detour may be worth it if it allows the path to stay in a high-density region; see Figure 3.3. Thus the metric is density-sensitive, in that distances across high-density regions are smaller than distances across low-density regions; this is a desirable property for many machine learning tasks Chu et al. (2017), including trajectory estimation for developmental cells and cancer cells. However the metric is also geometry preserving, since it is computed by path integrals on M. The parameter 𝑝 controls the balance of these two properties: when 𝑝 is small, L 𝑝 depends mainly on the geometry of the data, while for large 𝑝, L 𝑝 is primarily determined by data density. Although path metrics are defined in a complete graph, i.e. Definition 1 considers every path in the data connecting 𝑎, 𝑏, recent work Little et al. (2020b); Groisman et al. (2018); Mckenzie and Damelin (2019); Chu et al. (2020) has established that it is sufficient to only consider paths in a 𝐾-nearest neighbors (𝐾NN) graph, as long as 𝐾 ≥ 𝐶 log 𝑛 for a constant 𝐶 depending on 𝑝, 𝑑, 𝑓 , and the geometry of the data. By restricting to a 𝐾NN graph, all pairwise path distances can be computed in 𝑂 (𝐾𝑛2 ) with Dijkstra’s algorithm. (a) 𝑝 = 1.1 (b) 𝑝 = 2 Figure 3.3 Optimal ℓ 𝑝 path between two points in a moon data set. 22 Algorithm 3.1 scPMP. 1: Input: noisy data 𝑋 e ∈ R𝑛×𝑑 , parameter 𝑝, number of clusters 𝑘 2: Optional input: 𝐾1 , 𝐾2 , 𝑟 min , 𝑟 max , 𝜏 3: (Defaults: 12, 𝑛 ∧ 500, 3, 39, 0.01) 4: Output: scPMP embedding 𝑌 ∈ R𝑛×𝑟 , label vector ℓˆ ∈ [𝑘] 𝑛 5: 6: % Denoise data: 𝑥𝑖 ← 𝐾11 𝑗 ∈N𝑖,𝐾 e Í 7: 𝑥𝑖 1 8: 9: % Compute path metrics: 𝑝 10: G𝐾2 ← 𝐾2 NN graph on 𝑋 with edge weights ∥𝑥𝑖 − 𝑥 𝑗 ∥ 𝑝 𝑝 𝑝 11: 𝐷 𝑖 𝑗 ← length of shortest path connecting 𝑥𝑖 , 𝑥 𝑗 in G𝐾 2 𝑝 1 12: (𝐷 PM )𝑖 𝑗 ← (𝐷 𝑖 𝑗 ) 𝑝 13: 14: % Compute MDS embedding of path metrics: (2) 15: 𝐵 ← − 12 𝐽 𝐷 PM 𝐽 16: Λ = diag(𝜆 1 , . . . , 𝜆 𝑛 ) ← eigenvalues of 𝐵 in descending order 17: 𝑉 = (𝑣 1 , . . . , 𝑣 𝑛 ) ← corresponding eigenvectors of 𝐵 18: 𝑟 ← index √ maximizing √ 𝜆𝑖 /𝜆𝑖+1𝑛×𝑟 for 𝑖 satisfying 𝑟 min ≤ 𝑖 ≤ 𝑟 max , 𝜆𝑖 /𝜆 1 ≥ 𝜏 19: 𝑌 ← ( 𝜆 1 𝑣 1 , . . . , 𝜆𝑟 𝑣 𝑟 ) ∈ R 20: 21: % Cluster the data: 22: ℓˆ ← constrained 𝑘-means(𝑌 , 𝑘) 3.2.2 Algorithm We consider a noisy data set of 𝑛 data points e 𝑥 𝑛 ∈ R𝑑 , which form the rows of noisy 𝑥1 , . . . , e data matrix 𝑋 e ∈ R𝑛×𝑑 . We first denoise the data with a local averaging procedure, which has been shown to be advantageous for manifold plus noise data models García Trillos et al. (2019). More specifically, we replace e 𝑥𝑖 with its local average: 1 ∑︁ 𝑥𝑖 := 𝑥𝑗 e , N𝑖,𝐾1 = { 𝑗 : e 𝑥 𝑗 is a 𝐾1 NN of e𝑥𝑖 } , 𝐾1 𝑗 ∈N 𝑖,𝐾1 and let 𝑋 ∈ R𝑛×𝑑 denote the denoised data matrix. We then fix 𝑝 and compute the 𝑝-power weighted path distance between all points in 𝑋 to 𝑝 obtain pairwise distance matrix 𝐷 PM ∈ R𝑛×𝑛 . More precisely, we let G𝐾2 = (𝑋, 𝐸) be the graph on 𝑋 where 𝑥𝑖 , 𝑥 𝑗 are connected with edge weight 𝐸𝑖 𝑗 = ∥𝑥𝑖 − 𝑥 𝑗 ∥ 𝑝 if 𝑥𝑖 is a 𝐾2 NN of 𝑥 𝑗 or 𝑥 𝑗 is a 𝑝 𝑝 𝐾2 NN of 𝑥𝑖 . We then compute 𝐷 𝑖 𝑗 as the total length of the shortest path connecting 𝑥𝑖 , 𝑥 𝑗 in G𝐾2 , 23 𝑝 1 and define 𝐷 PM by (𝐷 PM )𝑖 𝑗 = (𝐷 𝑖 𝑗 ) 𝑝 . We next apply classical multidimensional scaling Borg and Groenen (2005) to obtain a low- dimensional embedding which preserves the path metrics. Specifically, we define the path metric (2) MDS matrix 𝐵 = − 12 𝐽 𝐷 PM 𝐽 where 𝐽 = 𝐼𝑛 − 𝑛1 11𝑇 is the centering matrix, 1 ∈ R𝑛 is a vector of (2) all 1’s, and 𝐷 PM is obtained from 𝐷 PM by squaring all entries. We let the spectral decomposition of 𝐵 be denoted by 𝐵 = 𝑉Λ𝑉 𝑇 , where Λ = diag(𝜆 1 , . . . , 𝜆 𝑛 ), 𝑉 = (𝑣 1 , . . . , 𝑣 𝑛 ) ∈ R𝑛×𝑛 contain the eigenvalues and eigenvectors of 𝐵 in descending order. The embedding dimension 𝑟 is then chosen as the index 𝑖 which maximizes the eigenratio 𝜆𝑖 /𝜆𝑖+1 Lam and Yao (2012), with the following restrictions: we constrain 3 ≤ 𝑖 ≤ 39 and only consider ratios 𝜆𝑖+1 /𝜆𝑖 between “large" eigenvalues, √ √ i.e. we require 𝜆𝑖 /𝜆 1 ≥ 0.01. The scPMP embedding is then defined by 𝑌 = ( 𝜆 1 𝑣 1 , . . . , 𝜆𝑟 𝑣 𝑟 ) ∈ R𝑛×𝑟 . Finally, we apply 𝑘-means to the scPMP embedding to obtain cluster labels. Specifically, we let ℓˆ𝑖 ∈ [𝑘] = {1, . . . , 𝑘 } be the cluster label of 𝑥𝑖 returned by running 𝑘-means on 𝑌 with 𝑘 clusters and 20 replicates. Since 𝑘-means may return highly imbalanced clusters, cluster sample sizes were √ constrained to be at least 𝑛/2. Specifically, if 𝑘-means returned a tiny cluster, 𝑘 was increased to 𝑘 + 1, and the tiny cluster merged with the closest non-trivial cluster. This entire procedure is summarized in the pseudocode in Algorithm 3.1. We note that the computational bottleneck for Algorithm 3.1 is the computation and storage of all pairwise path distances, which has complexity 𝑂 (𝑛2 log 𝑛) when 𝐾2 = 𝑂 (log 𝑛). However this quadratic cost can be avoided by utilizing a low rank approximation of the squared distance matrix via the Nystrom method Williams and Seeger (2001); Ghojogh et al. (2020); Platt (2005); Yu et al. (2012); Civril et al. (2006). For example, Shamai et al. (2020) propose a fast, quasi-linear implementation of MDS which only requires the computation of path distances from a set of 𝑞 landmarks, so that the complexity of computing path distances is reduced to 𝑂 (𝑞𝑛 log 𝑛). Our implementation of scPMP includes the option to use this landmark-based approximation and is thus highly scalable. We also note that an important consideration in the fully unsupervised setting is how to select 24 the number of clusters 𝑘. This is a rather ill-posed question with multiple reasonable answers due to hierarchical cluster structure. We do not focus on this in the current article, and Algorithm 3.1 assumes the number of clusters is given. However we emphasize that when 𝑘 is unknown, the scPMP embedding offers a useful tool for selecting a reasonable number of clusters. For example, Line 21 of Algorithm 3.1 can be repeated for a range of candidate 𝑘 values to obtain candidate clusterings ℓb𝑘 ; b 𝑘 can then be chosen so that ℓb𝑘 optimizes a cluster validity criterion such as the silhouette criterion Kaufman and Rousseeuw (2009); Maechler et al. (2021). Alternatively, one could build a graph with distances computed in the scPMP embedding, and estimate 𝑘 as the number of small eigenvalues of a corresponding graph Laplacian Von Luxburg (2007); Little et al. (2020a). 3.2.3 Assessment We evaluate the performance of Algorithm 3.1 with respect to (1) cluster quality and (2) geometric fidelity on a collection of labeled benchmarking data sets with ground truth labels ℓ. There are many helpful metrics for the quality of the estimated cluster labels ℓ, ˆ and we compute the adjusted rand index (ARI), entropy of cluster accuracy (ECA), and entropy of cluster purity (ECP). Definitions of ECA and ECP can be found in Appendix B. We compare our clustering results with the output of 𝑘-means, DBSCAN Ester et al. (1996); Xu et al. (1998), 𝑘-means on 𝑡-SNE embedding Van der Maaten and Hinton (2008), DBSCAN on UMAP embedding McInnes et al. (2018) and for scRNAseq data sets additionally with the following scRNAseq clustering methods: SC3 Kiselev et al. (2017), scanpy Wolf et al. (2018), RaceID3 Grün et al. (2018), SIMRL Wang et al. (2017) and Seurat Stuart et al. (2019). Assessing the geometric fidelity of the low-dimensional embedding 𝑌 is more delicate; we want to assess whether the embedding procedure preserves the global relative distances between clusters. We first compute the mean of each cluster as in Van der Maaten and Hinton (2008) using the ground truth labels, i.e. 𝜇 𝑗 (𝑋) = |I1𝑗 | 𝑖∈I𝑗 𝑥𝑖 where I𝑗 = {𝑖 : ℓ𝑖 = 𝑗 }; we then define Í 𝐷 𝜇,𝑋 (𝑖, 𝑗) = ∥𝜇ℓ𝑖 (𝑋)−𝜇ℓ 𝑗 (𝑋)∥ 2 . Similarly, we compute the means 𝜇 𝑗 (𝑌 ) in the scPMP embedding, and define 𝐷 𝜇,𝑌 (𝑖, 𝑗) = ∥𝜇ℓ𝑖 (𝑌 ) − 𝜇ℓ 𝑗 (𝑌 )∥ 2 ; we then compare 𝐷 𝜇,𝑋 and 𝐷 𝜇,𝑌 . Specifically, we 25 define the geometric perturbation 𝜋 by: 2 𝐷 𝜇,𝑋 − 𝑐𝐷 𝜇,𝑌 𝐹 𝜋(𝑋, 𝑌 , ℓ) = min 2 , 𝑐 𝐷 𝜇,𝑋 𝐹 where ∥ · ∥ 𝐹 is the Frobenius norm. The 𝑐 achieving the minimum is easy to compute, and one obtains 2 𝐷 𝜇,𝑋 − 𝑐∗ 𝐷 𝜇,𝑌 𝐹 ⟨𝐷 𝜇,𝑋 , 𝐷 𝜇,𝑌 ⟩ 𝜋(𝑋, 𝑌 , ℓ) = 2 , 𝑐∗ = 2 . 𝐷 𝜇,𝑋 𝐹 𝐷 𝜇,𝑌 𝐹 We compare 𝜋(𝑋, 𝑌 , ℓ) with the geometric perturbation of other embedding schemes for 𝑋, i.e. with 𝜋(𝑋, 𝑈, ℓ) for 𝑈 equal to the UMAP McInnes et al. (2018) and 𝑡-SNE Van der Maaten and Hinton (2008) embeddings. Note that 𝜋 is not always a useful measure: for example if 𝑋 consisted of concentric spheres sharing the same center, the metric would be meaningless, as the distance between cluster means would be zero. Nevertheless, in most cases 𝜋 is a helpful metric for quantifying the preservation of global cluster geometry. 3.3 Results We apply Algorithm 3.1 to both a collection of toy manifold data sets and a collection of scRNAseq data sets. Results are reported in Sections 3.3.1 and 3.3.2 respectively. The default parameter values reported in Algorithm 3.1 were used on all data sets. 3.3.1 Manifold Data We apply Algorithm 1 for 𝑝 = 1.5, 2, 4 to the following four manifold data sets: Balls (𝑛 = 1200, 𝑑 = 2, 𝑘 = 3): Clusters were created by uniform sampling of 3 overlapping balls in R2 ; see Figure 3.2a. Elongated with bridge (denoted EWB, 𝑛 = 620, 𝑑 = 2, 𝑘 = 3): Clusters were created by sampling from 3 elongated Gaussian distributions. A bridge was added connecting two of the Gaussians; see Figure 3.2b. Swiss roll (𝑛 = 1275, 𝑑 = 3, 𝑘 = 3): Clusters were created by uniform sampling from three distinct regions of a Swiss roll; 3-dimensional isotropic Gaussian noise (𝜎 = 0.75) was then added to the data. Figure 3.2c shows the first two data coordinates. 26 Method Balls EWB Swiss SO(3) 𝑘-means 0.955 -0.001 0.373 0.010 DBSCAN 0.055 0.550 1 1 UMAP+DBSCAN 0.600 0.645 1 1 𝑡-SNE+𝑘-means 0.895 0.359 1 0.532 Seurat 0.777 0.837 1 1 PM1.5 0.921 0.489 1 0.501 PM2 0.907 0.990 1 1 PM4 0.781 0.584 1 1 Table 3.1 ARI for manifold data. SO(3) manifolds (𝑛 = 3000, 𝑑 = 1000, 𝑘 = 3): For 1 ≤ 𝑖 ≤ 3, the 3-dimensional manifold M𝑖 ⊆ R9 is defined by fixing three eigenvalues 𝐷 𝑖 = diag(𝜆 1 , 𝜆2 , 𝜆3 ) and then defining M𝑖 = ∪𝑉 ∈𝑆𝑂 (3) 𝑉 𝐷 𝑖𝑉 𝑇 , where SO(3) is the special orthogonal group. After fixing 𝐷 𝑖 , we randomly sample from M𝑖 by taking random orthonormal bases 𝑉 of R3 . A noisy, high-dimensional embedding was then obtained by adding uniform random noise with standard deviation 𝜎 = 0.0075 in 1000 dimensions. Figure 3.2d shows the first two principal components of the data, which exhibits no cluster separation. The data sets were chosen to illustrate various cluster separability characteristics. For the balls, the clusters have good geometric separation but are not separable by density. For the Swiss roll and SO(3), the clusters have a complex and inter-twined geometry but are well separated in terms of density. For EWB, clusters are both elongated and lack robust density separability due to the bridge, and one expects that methods which rely too heavily on either geometry or density will fail. The ARIs achieved by Algorithm 3.1, 𝑘-means based methods, DBSCAN based methods, and Seurat are reported in Table 3.1. See Tables B.1 and B.2 in Appendix B for ECP and ECA. As expected, 𝑘-means out performs all methods on the balls but performs very poorly on all other data sets. DBSCAN and Seurat achieve perfect accuracy on the Swiss roll and SO(3) but perform rather poorly on the balls and EWB, although Seurat does noticeably better than DBSCAN. PM2 is the only method which achieves a high ARI (> 90%) and a low ECP and ECA (< 0.15) on all data sets. Table 3.2 reports the geometric perturbation of the embedding produced by Algorithm 3.1 and 27 Method Balls EWB Swiss SO(3) 2d UMAP 0.001 0.006 0.305 0.071 𝑟d UMAP 0 0.033 0.339 0.054 2d 𝑡-SNE 0 0.004 0.187 0.171 𝑟d 𝑡-SNE 0 0.042 0.074 0.157 2d PM1.5 0 0.033 0.002 0.103 𝑟d PM1.5 0 0.023 0.011 0.154 2d PM2 0 0.146 0.025 0.156 𝑟d PM2 0 0.068 0.025 0.179 2d PM4 0.003 0.191 0.056 0.194 𝑟d PM4 0.004 0.157 0.056 0.194 Table 3.2 Geometric perturbation for manifold data. The 𝑟d UMAP embeddings were computed with an embedding dimension of 𝑟 = 5 for the balls, EWB, Swiss roll and 𝑟 = 7 for SO(3), which corresponded to the estimated dimension for both PM1.5 and PM2 . For 𝑡-SNE, 𝑟 = 3 for all data sets. compares with UMAP and 𝑡-SNE. Since Algorithm 3.1 generally selects an embedding dimension 𝑟 > 2, to ensure a fair comparison the geometric perturbation was computed in both the 2d and 𝑟-dimensional (𝑟d) embeddings for all methods, where for UMAP 𝑟 is the dimension selected by Algorithm 3.1 and for 𝑡-SNE 𝑟 = 3 (note 𝑟 ≤ 3 was required in Rtsne implementation). Overall PM1.5 achieved the lowest geometric perturbation, although all methods had small perturbation on the Balls data set and 𝑡-SNE had the lowest perturbation on EWB. We point out however that for both the Swiss roll and SO(3), the metric may not be meaningful due to the complicated cluster geometry. 3.3.2 scRNAseq Data We apply Algorithm 1 for 𝑝 = 1.5, 2, 4 to the following synthetic scRNAseq data sets: RNA mixture: Benchmarking scRNAseq data set from Tian et al. (2019a). RNAmix1 was processed with CEL-seq2 and has 𝑛 = 296 cells and 𝑑 = 14687 genes. RNAmix2 was processed with Sort-seq and has 𝑛 = 340 cells and 𝑑 = 14224 genes. For the creation of the two data sets, RNA was extracted in bulk for each of the following cell lines: H2228, H1975, HCC827. Then the RNA was mixed in 𝑘 = 7 different proportions (each defining a ground truth cluster label), diluted to single cell equivalent amounts ranging from 3.75pg to 30pg, and processed using CEL-seq2 and SORT-seq. See here for Supplemental info including ground truth geometric structure. 28 Simulated beta: Simulated data set of 𝑛 = 473 beta cells and 𝑑 = 2279 genes, created based on SAVER Huang et al. (2018) and scImpute Li and Li (2018). First, we subset the Baron’s Pancreatic data set Baron et al. (2016) to include only Beta cells. As in Li and Li (2018), we randomly choose 10% of the genes to operate as marker genes. Then, we split the cells to 𝑘 = 3 clusters and each cluster is assigned a different group of marker genes. For each cluster we scale up the mean expression of its marker genes. Lastly, to simulate the drop out effect, as in Huang et al. (2018), we multiply each cell by an efficiency loss constant drawn by Gamma(10, 100). Using 𝑆 to refer to the data matrix resulting from the above steps, the final simulated data 𝑋 is obtained by letting 𝑋𝑖 𝑗 be drawn from Poisson(𝑆𝑖 𝑗 ). In addition to the synthetic data, we evaluate the performance of Algorithm 3.1 on the following real scRNAseq data sets: Cell mixture data set: Another benchmarking data set from Tian et al. (2019a) consisting of a mixture of 𝑘 = 5 cell lines created with 10x sequencing platform. The cell line identity of a cell is also its true cluster label. The data set consists of 𝑛 = 3822 cells and 𝑑 = 11786 genes; we removed multiplets, based on the provided metadata file and kept 3000 most variable genes after SCT tranformation Hafemeister and Satija (2019); Choudhary and Satija (2022). Baron’s pancreatic: Human pancreatic data set generated by Baron et al. (2016). After quality control and SAVER imputation, there are 𝑑 = 14738 genes and 𝑛 = 1844 cells. For analysis purposes cells that belong in a group with less than 70 members were filtered out to reduce to 𝑘 = 8 cell types. Also, we kept only the 3000 most variable genes after SCT tranformation Hafemeister and Satija (2019); Choudhary and Satija (2022). The cell types associated with each cell were obtained by an iterative hierarchical clustering method that restricts genes enriched in one cell type from being used to separate other cell types. The enriched markers in every cluster defined the cell type of the cells that belong in that cluster. Tabula Muris data sets: Mouse scRNAseq data for different tissues and organs Tabula Muris Consortium (2018). We select the pancreatic data (TM Panc) with 𝑛 = 1444 cells and 𝑑 = 23433 genes and the lung data (TM Lung) with 𝑛 = 453 cells and 𝑑 = 23433 genes. Both 29 Method RNA1 RNA2 TMLung Beta TMPanc BaronPanc PBMC4K CellMix SC3 0.637 0.827 0.798 0.969 0.894 0.767 0.889 1 scanpy 0.620 0.825 0.796 0.898 0.615 0.966 0.977 1 RaceID3 0.730 0.520 0.900 0.714 0.751 0.651 0.763 1 SIMLR 0.878 0.792 0.727 0.969 0.599 0.698 0.705 1 Seurat 0.792 0.667 0.843 0.901 0.547 0.941 0.889 0.993 Seurat_def 0.714 0.785 0.764 0.907 0.798 0.971 0.975 1 𝑘-means 0.921 0.786 0.848 0.957 0.840 0.662 0.747 1 DBSCAN 0.952 0.826 0.587 0.541 0.734 0.724 0.889 1 UMAP+DBSCAN 0.926 0.892 0.619 0.946 0.893 0.848 0.974 1 𝑡-SNE+𝑘-means 0.943 0.915 0.753 0.928 0.620 0.641 0.596 0.878 PM1.5 0.939 0.924 0.888 0.969 0.626 0.804 0.754 1 PM2 0.939 0.973 0.808 0.969 0.921 0.969 0.757 1 PM4 0.939 0.939 0.731 0.975 0.775 0.853 0.978 1 Table 3.3 ARI for RNA data. data sets have 𝑘 = 7 different cell types which were characterized by an FACS-based full length transcript analysis. PBMC4k data set: This data set includes the gene expression of Peripheral Blood Mononuclear Cells. The raw data are available from 10X Genomics. After quality control, saver imputation, and removing the two smallest cell types, there are 𝑑 = 16655 genes and 𝑛 = 4316 cells in the dataset. Also, we merge CD8+ T-cells and CD4+ T-cells in one type named T-cells resulting in 𝑘 = 4 cell types. The ground truth cell types are provided by SingleR annotation after marker gene verification in github.com/SingleR. Details about the pre-processing of data sets can be found in Appendix A. For the following UMAP and 𝑡-SNE results, Linnorm normalization was applied without denoising, as this normal- ization gave the best results. Note Seurat_def refers to the results of the entire Seurat pipeline, whereas Seurat refers to the result of using Seurat clustering on data with the same processing and normalization as for PM. The embedding dimension 𝑟 selected by Algorithm 3.1 ranged from 3 to 7 for PM1.5 and PM2 , and from 3 to 11 for PM4 . Table 3.3 reports the ARI achieved by Algorithm 3.1 and other methods; see Tables B.6 and B.5 in Appendix B for ECP and ECA. The path metric methods perform equally well or better than the rest of the methods. Once again PM2 exhibits the best overall performance, with a high ARI (≥ 90%) on all data sets except TM lung and PBMC4K; the next best method is PM4 , which 30 Method RNA1 RNA2 TMLung Beta TMPanc BaronPanc PBMC4k CellMix 2d UMAP 0.122 0.142 0.057 0.036 0.064 0.115 0.015 0.090 𝑟d UMAP 0.160 0.131 0.092 0.023 0.036 0.129 0.027 0.050 2d 𝑡-SNE 0.059 0.054 0.042 0.025 0.048 0.206 0.038 0.061 𝑟d 𝑡-SNE 0.035 0.054 0.027 0.010 0.040 0.229 0.050 0.033 2d PM1.5 0.010 0.013 0.046 0.003 0.076 0.067 0.028 0.098 𝑟d PM1.5 0.017 0.009 0.006 0 0.019 0.006 0.007 0.007 2d PM2 0.040 0.040 0.085 0.002 0.150 0.103 0.050 0.101 𝑟d PM2 0.048 0.036 0.029 0.002 0.051 0.010 0.013 0.008 2d PM4 0.108 0.135 0.246 0.007 0.265 0.193 0.069 0.107 𝑟d PM4 0.100 0.082 0.083 0.007 0.099 0.027 0.029 0.008 Table 3.4 Geometric perturbation for RNA data. For 𝑟d UMAP 𝑟 = 7, 6, 5, 3, 5, 9, 3, 4 for the various data sets, which was the maximum of the PM1.5 dimension and the PM2 dimension. For 𝑟d 𝑡-SNE 𝑟 = 3. achieves a high ARI on all but 3 data sets. SC3, RaceId3, and SIMLR had a low ARI (< 90%) on 6 of the 8 data sets; scanpy, Seurat, 𝑘-means, and 𝑡-SNE+𝑘-means had a low ARI on 5 of the 8 data sets; Seurat_def, UMAP+DBSCAN, and PM1.5 had a low ARI for 4 of 8 data sets. These results indicate that incorporating both density-based and geometric information when determining similarity generally leads to more robust results for scRNA-seq data. Moreover, PM2 achieves the best median ECP and median ECA values across all RNA data sets. Although the optimal balance depends on the data set (for example PBMC4K does best with 𝑝 = 4, while TMLung does best with 𝑝 = 1.5), path metrics with a moderate 𝑝 exhibit the best performance across a wide range of data sets. For BaronPanc we observe that Seurat_def achieves a slightly higher ARI than all the reported path metric methods (𝑝 = 1.5, 2, 4). However, a significant advantage of Algorithm 3.1 over Seurat is the high clustering performance on a wide range of sample sizes. To demonstrate our claim we compare the ARI results in different down-sampled versions of BaronsPanc. We selected a stratified sample of 50%, 25% and 10% of the cells of the BaronPanc data set. The results can be found in Table B.4 of Appendix B. We observed no ARI deterioration for Algorithm 3.1 for the 50% and 25% down-sampled data set and only a moderate decrease for the 10% down-sampled dataset (ARI of 0.67 at 10% downsampling for 𝑝 = 1.5). On the contrary, there is significant ARI deterioration both for Seurat and Seurat_def; in particular, at 10% downsampling the ARI deteriorates to 0.405 31 (a) PCA (b) PM2 (c) UMAP (d) 𝑡-SNE (e) PCA (f) PM2 (g) UMAP (h) 𝑡-SNE Figure 3.4 Top: embeddings colored by true cell type. Bottom: average linkage dendrograms of cluster means. for Seurat and to 0.185 for Seurat_def. Notice that in the 10% down-sampled data set, we use regular 𝑘-means for PM2 to allow for the prediction of smaller sized clusters. We also investigated whether we could learn the ground truth number of clusters by optimizing the silhouette criterion in the scPMP embedding, and compared this with the number of clusters obtained from Seurat using the default resolution; see Table B.3 in Appendix B. For 4 out of the 8 RNA data sets evaluated in this article (RNAMix1, RNAMix2, BaronPanc, and CellMix), this procedure on PM2 yielded an estimate for 𝑘 which matched the number of distinct annotated labels. On the other hand, Seurat correctly estimates the number of clusters for only 2 out of the 8 RNA data sets (RNAMix1 and TMLung). Table 3.4 reports the geometric perturbation. We see that increasing 𝑝 increases the geometric perturbation, with PM1.5 yielding the smallest geometric perturbation on all data sets. Although PM1.5 is the clear winner in terms of this metric, PM2 still performed favorably with respect to UMAP and 𝑡-SNE. Indeed, 𝑟d PM2 had lower geometric perturbation than UMAP on all but one data set (TMPanc), and lower geometric perturbation than 𝑡-SNE on the majority of data sets. Figure 3.4 shows the PCA, PM2 , UMAP, and 𝑡-SNE embeddings of the Cell Mix data set, as well 32 Complete Runtime dataset PBMC4k BaronPanc 95 90 85 80 75 70 65 60 complete_rt(min) 55 50 45 40 35 30 25 20 15 10 5 0 ca PM sea n n PM_1 .5 _2 db m R _4PM s k− ac eI SC D 3 Sc 3an py Se Se ur rat at u t− SN E S (d IM ef .) + k− LR m ea U M ns AP +d b method Figure 3.5 Processing and clustering time for PBMC4K and Baron’s Pancreatic data sets. as a tree structure on the clusters. The tree structure was obtained by first computing the cluster means in the embedding and then applying hierarchical clustering with average linkage to the means. The PCA tree (Figure 3.4(e)) was computed using 40 PCs so that it accurately reflects the global geometry of the clusters. Interestingly path metrics recover the same hierarchical structure on the clusters as PCA: the cell types HCC827 and H1975 are the most similar, and H838 is the most distinct. This is what one would expect given more extensive biological information about the cell types, since H838 is the only cell line here derived from metastatic site Lymph node on a male patient, while both HCC827 and H1975 originated from the primary site of female lung cancer patients. However, neither UMAP or 𝑡-SNE give the correct hierarchical representation of the clusters, because both methods struggle to preserve global geometric structure as observed in numerous studies Kobak and Berens (2019); Cooley et al. (2020). Furthermore, Figure 3.5 records the runtime for processing and clustering (in minutes) of the 33 Baron’s Pancreatic (𝑛 = 1844) and PBMC4K (𝑛 = 4316) data sets. For PBMC4k (our largest data set), we use the landmark-based approximation of path distances for scalability. All the PM methods run in less than a minute on BaronPanc and less than 6 minutes on PBMC4k; RaceID3, scanpy, and Seurat were also fast. SC3 and SIMLR had long runtimes, requiring 37.9 and 91.1 minutes respectively for PBMC4k. 3.4 Discussion This article applies a new theoretical framework to the analysis of single cell RNA-seq data which is based on the computation of optimal paths. Path metrics encode both geometric and density-based information, and the resulting low-dimensional embeddings simultaneously preserve density-based cluster structure as well as global cluster orientation. The method exhibits competitive performance when applied to numerous benchmarks, and the implementation is scalable to large data sets. Although we investigated other choices of 𝑝, we found that 𝑝 = 2 performed well on a wide range of RNA data sets, indicating that 𝑝 = 2 is an appropriate balance between density and geometry for this application. Future research will explore ways to make the method more robust to noise and adapting the method to the semi-supervised context. 34 BIBLIOGRAPHY A, Z., Muñoz-Manchado, A. B., Codeluppi, S., P, L., G, L. M., A, J., S, M., H, M., L, H., C, B., C, R., G, C.-B., J, H.-L., and and, L. S. (2015). Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science (New York, N.Y.), 347:1138–1142. Baron, M. et al. (2016). A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Systems, 3(4):346–360. Bijral, A., Ratliff, N., and Srebro, N. (2011). Semi-supervised learning with density based distances. In UAI, pages 43–50. Borg, I. and Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Springer Science & Business Media. Borghini, E., Fernández, X., Groisman, P., and Mindlin, G. (2020). Intrinsic persistent homology via density-based metric learning. arXiv preprint arXiv:2012.07621. Bousquet, O., Chapelle, O., and Hein, M. (2004). Measure based regularization. In NIPS, pages 1221–1228. Camerini, P. (1978). The min-max spanning tree problem and some extensions. Information Processing Letters, 1(10-14). Chang, H. and Yeung, D.-Y. (2008). Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203. Choudhary, S. and Satija, R. (2022). Comparison and evaluation of statistical error models for scrna-seq. Genome Biology, 23. Chu, T., Miller, G., and Sheehy, D. (2017). Exploration of a graph-based density sensitive metric. arXiv preprint arXiv:1709.07797. Chu, T., Miller, G., and Sheehy, D. (2020). Exact computation of a manifold metric, via Lipschitz embeddings and shortest paths on a graph. In SODA, pages 411–425. Civril, A., Magdon-Ismail, M., and Bocek-Rivele, E. (2006). Ssde: Fast graph drawing using sampled spectral distance embedding. In International Symposium on Graph Drawing, pages 30–41. Springer. CLevine, J., Simonds, E., Bendall, S., Davis, K., Amir, E.-a., Tadmor, M., Litvin, O., Fienberg, H., Jager, A., Zunder, E., Finck, R., Gedman, A., Radtke, I., Downing, J., Pe’er, D., and Nolan, G. (2015). Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell. 35 Cooley, S. M., Hamilton, T., Ray, J. C. J., and Deeds, E. J. (2020). A novel metric reveals previously unrecognized distortion in dimensionality reduction of scrna-seq data. bioRxiv. Ding, J., Wen, H., Tang, W., Liu, R., Li, Z., Venegas, J., Su, R., Molho, D., Jin, W., Zuo, W., et al. (2022). Dance: A deep learning library and benchmark for single-cell analysis. bioRxiv, pages 2022–10. Eberwine, J., Yeh, H., Miyashiro, K., Cao, Y., Nair, S., Finnell, R., Zettel, M., and Coleman, P. (1992). Analysis of gene expression in single live neurons. Proceedings of the National Academy of Sciences, 89(7):3010–3014. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231. Fernández, X., Borghini, E., Mindlin, G., and Groisman, P. (2023). Intrinsic persistent homology via density-based metric learning. Journal of Machine Learning Research, 24(75):1–42. Fischer, B., Zöller, T., and Buhmann, J. M. (2001). Path based pairwise data clustering with application to texture segmentation. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 235–250. Springer. Gabow, H. and Tarjan, R. (1988). Algorithms for two bottleneck optimization problems. Journal of Algorithms, 9:411–417. García Trillos, N., Sanz-Alonso, D., and Yang, R. (2019). Local regularization of noisy point clouds: Improved global geometric estimates and data analysis. Journal of Machine Learning Research, 20(136):1–37. Ghojogh, B., Ghodsi, A., Karray, F., and Crowley, M. (2020). Multidimensional scaling, sammon mapping, and isomap: Tutorial and survey. arXiv preprint arXiv:2009.08136. Groisman, P., Jonckheere, M., and Sapienza, F. (2018). Nonhomogeneous Euclidean first-passage percolation and distance learning. arXiv preprint arXiv:1810.09398. Groisman, P., Jonckheere, M., and Sapienza, F. (2022). Nonhomogeneous euclidean first-passage percolation and distance learning. Bernoulli, 28(1):255–276. Grün, D. et al. (2018). Revealing dynamics of gene expression variability in cell state space. Nature methods, 17:45–49. Hafemeister, C. and Satija, R. (2019). Normalization and variance stabilization of single-cell rna-seq data using regularized negative binomial regression. Genome Biology, 20(1). Herman, J. S., Grün, D., et al. (2018). Fateid infers cell fate bias in multipotent progenitors from single-cell rna-seq data. Nature methods, 15(5):379. 36 Hu, T. (1961). Letter to the editor: The maximum capacity route problem. Operations Research, 9(6):898–900. Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bonasio, R., Murray, J. I., Raj, A., Li, M., and Zhang, N. R. (2018). Saver: gene expression recovery for single-cell rna sequencing. Nature methods, 15(7):539–542. Hwang, S., Damelin, S., and Hero, A. (2016). Shortest path through random points. The Annals of Applied Probability, 26(5):2791–2823. Kaufman, L. and Rousseeuw, P. (2009). Finding Groups in Data: An Introduction to Cluster Analysis. Kiselev, V. Y., Kirschner, K., Schaub, M. T., Andrews, T., Yiu, A., Chandra, T., Natarajan, K. N., Reik, W., Barahona, M., Green, A. R., and Hemberg, M. (2017). SC3: consensus clustering of single-cell RNA-seq data. Nature Methods, 14:483–486. Kobak, D. and Berens, P. (2019). The art of using t-sne for single-cell transcriptomics. Nature Communications, 10:2041–1723. Lam, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics, pages 694–726. Lee, J. M. (2018). Introduction to Riemannian manifolds. Springer. Li, W. V. and Li, J. J. (2018). An accurate and robust imputation method scimpute for single-cell rna-seq data. Nature communications, 9(1):1–9. Lin, P., Troup, M., and Ho, J. W. K. (2017). CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biology, 18. Little, A., Maggioni, M., and Murphy, J. (2020a). Path-based spectral clustering: Guarantees, robustness to outliers, and fast algorithms. Journal of Machine Learning Research, 21(6):1–66. Little, A., McKenzie, D., and Murphy, J. (2020b). Balancing geometry and density: Path distances on high-dimensional data. arXiv preprint arXiv:2012.09385. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., and Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12):1053–1058. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2021). cluster: Cluster Analysis Basics and Extensions. McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. 37 Mckenzie, D. and Damelin, S. (2019). Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 1(3):307. Moon, K. R., van Dijk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., Elzen, A. v. d., Hirn, M. J., Coifman, R. R., Ivanova, N. B., Wolf, G., and Krishnaswamy, S. (2019). Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology, 37(12):1482–1492. Moscovich, A., Jaffe, A., and B.Nadler (2017). Minimax-optimal semi-supervised regression on unknown manifolds. In AISTATS, pages 933–942. Platt, J. (2005). Fastmap, metricmap, and landmark mds are all nyström algorithms. In International Workshop on Artificial Intelligence and Statistics, pages 261–268. PMLR. Pollack, M. (1960). Letter to the editor: The maximum capacity through a network. Operations Research, 8(5):733–736. Sajama and Orlitsky, A. (2005). Estimating and computing density based distance metrics. In ICML, pages 760–767. Saliba, A.-E., Westermann, A. J., Gorski, S. A., and Vogel, J. (2014). Single-cell RNA-seq: advances and future challenges. Nucleic Acids Research, 42(14):8845–8860. Shamai, G., Zibulevsky, M., and Kimmel, R. (2020). Efficient inter-geodesic distance computation and fast classical scaling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):74–85. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius, M., Smibert, P., and Satija, R. (2019). Comprehensive integration of single-cell data. Cell, 177:1888–1902. Tabula Muris Consortium, Overall coordination, L. c. e. a. (2018). Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature, 562:367–372. Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B. B., Siddiqui, A., et al. (2009). mrna-seq whole-transcriptome analysis of a single cell. Nature methods, 6(5):377–382. Tang, J., Liu, J., Zhang, M., and Mei, Q. (2016). Visualizing large-scale and high-dimensional data. In Proceedings of the 25th international conference on world wide web, pages 287–297. Tenenbaum, J., Silva, V. D., and Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323. Tian, L., Dong, X., Freytag, S., Lê Cao, K.-A., Su, S., JalalAbadi, A., Amann-Zalcenstein, D., 38 Weber, T. S., Seidi, A., Jabbari, J. S., et al. (2019a). Benchmarking single cell rna-sequencing analysis pipelines using mixture control experiments. Nature methods, 16(6):479–487. Tian, T., Wan, J., Song, Q., and Wei, Z. (2019b). Clustering single-cell rna-seq data with a model-based deep learning approach. Nature Machine Intelligence, 1(4):191–198. Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11). Vincent, P. and Bengio, Y. (2003). Density-sensitive metrics and kernels. In Snowbird Learning Workshop. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4):395–416. Wang, B., Zhu, J., Pierson, E., Ramazzotti, D., and Batzoglou, S. (2017). Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature Methods, 14:414–416. Wang, J., Ma, A., Chang, Y., Gong, J., Jiang, Y., Qi, R., Wang, C., Fu, H., Ma, Q., and Xu, D. (2021). scgnn is a novel graph neural network framework for single-cell rna-seq analyses. Nature communications, 12(1):1882. Williams, C. and Seeger, M. (2001). Using the nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems, number CONF, pages 682–688. Wolf, F. A., Angerer, P., and Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biology, 19. Xu, C. and Su, Z. (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 31(12):1974–1980. Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. (1998). A distribution-based clustering algorithm for mining in large spatial databases. In Proceedings 14th International Conference on Data Engineering, pages 324–331. IEEE. Yip, S. H., Wang, P., Kocher, J.-P. A., Sham, P. C., and Wang, J. (2017). Linnorm: improved statistical analysis for single cell RNA-seq expression data. Nucleic Acids Research, 45(22):e179– e179. Yu, H., Zhao, X., Zhang, X., and Yang, Y. (2012). Isomap using nyström method with incremental sampling. Advances in Information Sciences & Service Sciences, 4(12). Zhang, S. and Murphy, J. (2021). Hyperspectral image clustering with spatially-regularized ultra- metrics. Remote Sensing, 13(5):955. 39 Zhu, X., Zhang, J., Xu, Y., Wang, J., Peng, X., and Li, H.-D. (2020). Single-cell clustering based on shared nearest neighbor and graph partitioning. Interdisciplinary Sciences: Computational Life Sciences, 12:117–130. žurauskienė, J. and Yau, C. (2016). pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics, 17. 40 APPENDIX A DATA PREPROCESSING In this section the pre-processing of all RNA data sets is described. The main preprocessing steps are quality control, imputation with SAVER Huang et al. (2018), and normalization. Below we provide information about quality control and imputation and then we describe how we used those steps according to the guidelines of each method. A.1 Data availability The raw data of Cellmix and RNAmix are downloaded from GEO under accession code GSE118767, and the preprocessed data are available at their github repository. The PBMC4K data is available at 10x Genomics’s website. The Baron’s pancreatic data is available in GEO with the access code GSM2230757. The mouse tissue scRNAseq data sets are accessible on Figshare. A.2 Main steps Quality Control: Quality control is applied on RNAmix1, RNAmix2, Cellmix, BaronPanc, PMC4K, Beta. Specifically, cells where at most 200 genes are expressed are filtered out. Also, only genes that are expressed in more than 3 cells are included in the data set. In addition, cells with percentage of expressed mitochondrial genes greater than 20% are excluded. The data sets TMpanc and TMLung as found in Figshare have passed a quality control check with cutoffs of at least 500 genes and 50,000 reads, so no additional filtering was applied. Imputation: Imputation with SAVER Huang et al. (2018) was applied to all RNA seq data sets apart from Cellmix. After removing multiplets the Cellmix data set included high quality data and every clustering method achieved high ARI, suggesting no need for further processing and imputation. A.3 Preprocessing per method Path metrics (PM), 𝑘-means, DBSCAN: After quality control and imputation, we normalize the data. RNAmix1, RNAmix2, TMLung, Beta, TMPanc, PBMC4K were row normalized and log transformed (data matrix had cells in rows and genes in columns). We then restrict to the top 41 2000 high variance genes. For the BaronPanc and CellMix, which have large sample size, SCT transformation was applied instead and the top 3000 variable genes were kept Hafemeister and Satija (2019); Choudhary and Satija (2022). When needed, we rescale genes where variances were extremely high. As a next step we apply PCA for dimension reduction, keeping the top 40 PC’s. Finally, denoising is applied by replacing each point with the mean of its local neighborhood, using a neighborhood size of 𝐾 = 12 points. For very large data sets, one may want to use a larger 𝐾. UMAP+DBSCAN, 𝑡-SNE+𝑘-means: After quality control and imputation, we apply Linnnorm Yip et al. (2017) to all data sets. Then, we restrict to the top 2000 high variance genes. When needed, we rescale genes with extremely high variance. Finally, we apply PCA for dimension reduction, keeping the top 40 PC’s. Seurat: For this method, we process the data as for PM and then use Seurat’sStuart et al. (2019) functions to find neighboring points and cluster them. Notice that here we adjust the parameter ‘res’, to retrieve the correct number of clusters. Seurat_def: We follow the suggested processing and clustering workflow of Seurat Stuart et al. (2019) for all data sets. Notice that we normalize BaronPanc and CellMix with the SCT method Hafemeister and Satija (2019); Choudhary and Satija (2022). Then data sets are clustered with adjusted resolution parameter, to retrieve the correct number of clusters. SC3: After quality control and imputation we normalize the information of every cell and multiply by 10000. Then we use the log of the data for clustering with SC3 Kiselev et al. (2017). Exception to this are the BaronPanc and CellMix data set, for which we use SCT normalization. scanpy: After quality control and imputation we use the lognormalization of scanpy Wolf et al. (2018). Exception to this are the BaronPanc and CellMix data set, for which we use SCT normalization. RaceID3: We apply quality control on the cells of the counts of the data set. RaceID3 Herman et al. (2018); Grün et al. (2018) applies filtering and normalization in one step, which we adjust to have about the same amount of cells and genes as with other methods. Notice that we do not apply imputation because imputed data would not be counts, which are the required input of RaceID3. 42 SIMLR: For SIMLR Wang et al. (2017) After quality control and imputation we normalize the information of every cell and multiply by 10000 and use the log of those data. Exception to this are the BaronPanc and CellMix data set, for which we use SCT normalization. 43 APPENDIX B ADDITIONAL CLUSTERING RESULTS Here we present more clustering evaluation results based on Entropy of Cluster Accuracy (ECA) and Entropy of cluster Purity (ECP). The ECA can quantify the variety of true labels within a predicted cluster and ECP can quantify the variety of predicted cluster labels within a true group. Definition 2 Let 𝑁 represent the number of true groups and 𝑀 the number of predicted clusters. Let 𝑁 𝑗 be the number of true groups with data points within the 𝑗 th predicted cluster and similarly let 𝑀 𝑗 be the number of predicted clusters with data points within the 𝑗 th true group. Finally let 𝑝(𝑥 𝑗 ) denote the proportion of data points belonging to the 𝑗 th true group that are within a given 𝑗 th predicted cluster and let 𝑝𝑖 (𝑦 𝑗 ) denote the proportion of data points of 𝑗 th predicted cluster that are within a given 𝑖 th true group. Then: 𝑀 𝑁𝑖 1 ∑︁ ∑︁ 𝐸𝐶 𝐴 = − 𝑝𝑖 (𝑥 𝑗 )𝑙𝑜𝑔( 𝑝(𝑥 𝑗 )) 𝑀 𝑖=1 𝑗=1 𝑁 𝑀 1 ∑︁ ∑︁𝑖 𝐸𝐶𝑃 = − 𝑝𝑖 (𝑦 𝑗 )𝑙𝑜𝑔( 𝑝(𝑦 𝑗 )) 𝑁 𝑖=1 𝑗=1 For a given clustering, low ECA means that data points in a predicted cluster originate from the same true group. On the other hand, low ECP indicates that almost all the data points in a true group were assigned the same clustering label. Use of ECP and ECA in clustering of scRNAseq data was also found in Tian et al. (2019a). Method Balls EWB Swiss SO(3) 𝑘-means 0.082 1.050 0.588 1.084 DBSCAN 0.385 0.114 0 0 UMAP+DBSCAN 0.941 0.695 0 0 𝑡-SNE+ 𝑘-means 0.153 0.630 0 0.440 Seurat 0.255 0.193 0 0 PM1.5 0.123 0.447 0 0.460 PM2 0.142 0.020 0 0 PM4 0.253 0.268 0 0 Table B.1 ECP for manifold data. 44 Method Balls EWB Swiss SO(3) 𝑘-means 0.082 1.096 0.633 1.089 DBSCAN 0.362 0.231 0 0 UMAP+DBSCAN 0.200 0.014 0 0 𝑡-SNE+ 𝑘-means 0.147 0.582 0 0.440 Seurat 0.250 0.183 0 0 PM1.5 0.120 0.461 0 0.462 PM2 0.138 0.020 0 0 PM4 0.248 0.291 0 0 Table B.2 ECA for manifold data. Method RNA1 RNA2 TMLung Beta TMPanc BaronPanc PBMC4k CellMix Seurat_res=0.8 7 8 7 7 11 12 13 14 PM1.5 12 11 8 4 15 9 4 5 PM2 7 7 9 4 5 8 5 5 PM4 8 8 16 4 5 7 4 5 True 𝑘 7 7 7 3 7 8 4 5 Table B.3 Predicted number of clusters for Seurat and Path metrics for RNA data (𝑘 is the true number of clusters). Dataset Seurat Seurat_def PM1.5 PM2 PM4 100% of Baron’s Pancreatic 0.941 0.971 0.804 0.969 0.853 50% of Baron’s Pancreatic 0.880 0.844 0.969 0.969 0.969 25% of Baron’s Pancreatic 0.973 0.705 0.973 0.973 0.973 10% of Baron’s Pancreatic 0.410 0.185 0.674 0.939* 0.804 Table B.4 Downsampling results. Method RNA1 RNA2 TMLung Beta TMPanc BaronPanc PBMC4k CellMix SC3 0.289 0.114 0.228 0.058 0.132 0.368 0.328 0 Scanpy 0.481 0.242 0.314 0.147 0.309 0.093 0.054 0 RaceID3 0.336 0.621 0.168 0.342 0.122 0.181 0.207 0.000 SIMLR 0.163 0.294 0.263 0.057 0.407 0.104 0.190 0 Seurat 0.319 0.230 0.193 0.146 0.290 0.097 0.265 0 Seurat_def 0.256 0.270 0.423 0.128 0.289 0.112 0.106 0 𝑘-means 0.147 0.268 0.221 0.080 0.194 0.164 0.193 0 DBSCAN 0.090 0.188 0.368 0.440 0.202 0.146 0.262 0 UMAP+db 0.078 0.151 0.449 0.019 0.124 0.076 0.163 0 𝑡-SNE+ 𝑘-means 0.104 0.137 0.426 0.085 0.259 0.171 0.187 0.126 PM1.5 0.110 0.146 0.180 0.061 0.305 0.147 0.196 0 PM2 0.110 0.071 0.362 0.061 0.279 0.077 0.195 0 PM4 0.110 0.123 0.230 0.048 0.156 0.159 0.096 0 Table B.5 ECA for RNA data. 45 Method RNA1 RNA2 TMLung Beta TMPanc BaronPanc PBMC4k CellMix SC3 0.328 0.114 0.294 0.058 0.070 0.301 0.062 0 Scanpy 0.517 0.183 0.322 0.146 0.516 0.088 0.057 0 RaceID3 0.381 0.665 0.182 0.351 0.268 0.413 0.310 0 SIMLR 0.151 0.267 0.275 0.057 0.543 0.380 0.360 0 Seurat 0.292 0.282 0.230 0.150 0.540 0.122 0.053 0.027 Seurat_def 0.320 0.258 0.436 0.133 0.284 0.089 0.062 0 𝑘-means 0.131 0.255 0.244 0.079 0.215 0.395 0.316 0 DBSCAN 0.075 0.141 0.404 0.135 0.138 0.109 0.051 0 UMAP+db 0.151 0.226 0.413 0.126 0.061 0.248 0.087 0 𝑡-SNE+ 𝑘-means 0.102 0.133 0.437 0.089 0.494 0.402 0.451 0.147 PM1.5 0.096 0.136 0.197 0.061 0.482 0.273 0.312 0 PM2 0.096 0.062 0.323 0.061 0.158 0.081 0.308 0 PM4 0.096 0.114 0.184 0.048 0.260 0.226 0.055 0 Table B.6 ECP for RNA data. 46 CHAPTER 4 SHARED NEAREST NEIGHBORS GRAPH BASED SPECTRAL CLUSTERING 4.1 Introduction The exploration of the theoretical properties of spectral clustering on finite sample data started more than twenty years ago (Spielman and Teng, 1996; Guattery and Miller, 1998; Ng et al., 2001; Meilă and Shi, 2001) along with theoretical properties for increasing sample size (Luxburg et al., 2004). One of the advantages of spectral clustering is the interpretability of its performance on data points represented as vertices on graphs (𝑘NN, 𝜖-neighbor graph) that are connected with edges based on their similarity to other data points (von Luxburg, 2007). Although theoretical results of 𝑘NN graph-based clustering methods have been explored by Maier et al. (2009), the theoretical properties of SNN graph-based clustering combined with spectral clustering haven’t been investigated yet. In the following sections, we use a similar framework as of Maier et al. (2009) to provide a range for the number of neighbors used for the construction of the SNN graph, such that exact cluster identification is achieved. 4.2 Framework Our aim is to cluster a set of 𝑛 points, 𝑋1 , . . . , 𝑋𝑛 , which have been drawn from some underlying density, 𝑝, of R𝑚 . For this task, we build the SNN graph of those points and use its Laplacian for spectral clustering. The number of true clusters is known and denoted as 𝐾. 4.2.1 The SNN graph For the construction of the SNN graph, we first find the 𝑘 nearest neighbors of each point 𝑋𝑖 . Let 𝑘NN(𝑋𝑖 ) be the set of the first 𝑘 nearest neighbors of 𝑋𝑖 . Then, we connect two vertices 𝑋𝑖 and 𝑋 𝑗 , if 𝑋𝑖 ∈ 𝑘 𝑁 𝑁 (𝑋 𝑗 ) or if 𝑋 𝑗 ∈ 𝑘 𝑁 𝑁 (𝑋𝑖 ). The weight, 𝑊𝑖, 𝑗 , of their edge is the Jaccard similarity of 𝑘NN(𝑋𝑖 ), 𝑘NN(𝑋 𝑗 ), i.e. |𝑘 𝑁 𝑁 (𝑋𝑖 ) ∩ 𝑘 𝑁 𝑁 (𝑋 𝑗 )| 𝑊𝑖, 𝑗 = . (4.1) |𝑘 𝑁 𝑁 (𝑋𝑖 ) ∪ 𝑘 𝑁 𝑁 (𝑋 𝑗 )| The SNN graph is a symmetric graph denote as 𝐺 𝑆𝑁 𝑁 (𝑘). 47 4.2.2 The SNN Spectral Clustering Algorithm Algorithm 4.1 SNN Spectral Clustering Algorithm. 1: Input: 𝑋 ∈ R𝑛×𝑚 , number of clusters 𝐾, nearest neighbors 𝑘, bandwidth ℎ, density bound 𝑡, 1 1 Denoise = (TRUE, FALSE), Laplacian = (𝐷 − 𝑊, 𝐷 − 2 𝑊 𝐷 − 2 , 𝐼 − 𝐷 −1𝑊), 𝑡, 𝛿, 𝜖 𝑛 2: 3: Output: Predicted clustering labels ℓ ∈ R𝑛 4: 5: % 𝑘-nearest neighbors: 6: 7: for i=1 to n do 8: 𝑘 𝑁 𝑁𝑖 ← {𝑋 𝑗 : 𝑋 𝑗 is one of the 𝑘 𝑁 𝑁 of 𝑋𝑖 , based on Euclidean distance} 9: 10: % Shared Nearest Neighbors graph: 11: |𝑘 𝑁 𝑁 ∩𝑘 𝑁 𝑁 | 12: 𝑊𝑖, 𝑗 ← |𝑘 𝑁 𝑁𝑖𝑖 ∪𝑘 𝑁 𝑁 𝑗𝑗 | 13: 𝐺 𝑆𝑁 𝑁 (𝑘) ← graph with adjacency matrix 𝑊 14: 15: % Kernel estimation of density 𝑝: 16: 17: for i=1 to n do 𝑛 𝑋 −𝑋 1 Í 18: 𝑝ˆ𝑛 (𝑋𝑖 ) ← 𝑛ℎ 𝐾( 𝑖 ℎ 𝑗 ) 𝑗=1 19: 20: % Denoising: 21: 22: if Denoising = TRUE then 23: Remove vertices and edges of X s.t. 𝑝ˆ𝑛 (𝑋) < 𝑡 − 2𝜖 𝑛 24: Remove components with size less than 𝛿𝑛 25: 𝐺 ′𝑆𝑁 𝑁 (𝑘) ← denoised 𝐺 𝑆𝑁 𝑁 (𝑘) 26: 𝑊, 𝑛 ← adjacency of 𝐺 ′𝑆𝑁 𝑁 (𝑘), number of vertices in 𝐺 ′𝑆𝑁 𝑁 (𝑘) 27: 28: % Eigendecomposition of SNN graph Laplacian: 29: Í 30: 𝐷 𝑖,𝑖 ← 𝑛𝑗=1 𝑊𝑖, 𝑗 31: 𝐷 ← diagonal(𝐷 𝑖,𝑖 ) 32: 𝐿 ← Laplacian 33: 𝑉 ← 𝐾 eigenvector matrix of𝐿 34: 35: % Clustering of 𝑋1 , . . . , 𝑋𝑛 : 36: 37: ℓ ← 𝑘-means_plusplus(𝑉, 𝐾) 38: 48 Algorithm 4.2 Eigenvector matrix of 𝐿. 1 1 1: Input: 𝐿, 𝐾, Laplacian = (𝐷 − 𝑊, 𝐷 − 2 𝑊 𝐷 − 2 , 𝐼 − 𝐷 −1𝑊) 2: 3: Output: 𝐾 eigenvector matrix of 𝐿 4: 1 1 5: if 𝐿 = 𝐷 − 2 𝑊 𝐷 − 2 then 6: {𝜆 1 , 𝜆2 , ..., 𝜆 𝐾 , ...} ← eigenvalues of 𝐿 such that 𝜆 1 > 𝜆2 > 𝜆3 ... 7: 𝑉 = [𝑉1 , 𝑉2 , ..., 𝑉𝐾 ] ← 𝑉𝑖 eigenvector of 𝜆𝑖 8: % Row normalization of 𝑉 9: 𝑣 𝑖 ← 𝑖-th row of 𝑉 𝑞 ← (𝑞 1 , . . . , 𝑞 𝑛 ) where 𝑞𝑖 = ( 𝑗 𝑣 𝑖,2 𝑗 ) 1/2 Í 10: 11: 𝑉˜ ← 𝐷 −1 𝑞 𝑉, where 𝐷 𝑞 = 𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙 (𝑞) 12: ℓ ← 𝑘-means_plusplus(𝑉, ˜ 𝐾) 13: else 14: {𝜆 1 , 𝜆2 , ..., 𝜆 𝐾 , ...} ← eigenvalues of 𝐿 such that 𝜆 1 < 𝜆2 < 𝜆3 ... 15: 𝑉 = [𝑉1 , 𝑉2 , ..., 𝑉𝐾 ] ← 𝑉𝑖 eigenvector of 𝜆𝑖 4.3 Theoretical results The theoretical results presented in section 4.3 are divided into two cases; the noise-free case and the noisy case, based on Maier et al. (2009). Noise-free case. In this case, we consider a probability distribution 𝑝, whose support consists of several high-density regions separated by a positive distance from each other. We consider that successful cluster identification means that each high-density region corresponds to a unique predicted cluster. Since there is no overlap between high-density regions, every point will belong to one cluster only. Hence, the denoising step of 4.1 will not remove any points from the SNN graph. Noisy case. In this case, the high-density regions of 𝑝 are connected by low-density regions. For a 𝑡 > 0 we define the 𝑡-level set, 𝐿 (𝑡), as the closure of all points 𝑥 ∈ R𝑚 with 𝑝(𝑥) ≥ 𝑡, i.e. {𝑥 : 𝑝(𝑥) ≥ 𝑡}. We denote those components as 𝐶 (1) , . . . , 𝐶 (𝐾) . In the following results, we explore two approaches to the noisy case. In the first approach, the true clusters are the sets 𝐶 (1) , . . . , 𝐶 (𝐾) . Points in low-density regions do not belong to any cluster and are removed. We consider that clusters are identified exactly by our algorithm when each connected component of 𝐿 (𝑡) is included in a unique predicted cluster 49 (𝑖) 𝑝 max supremum of density attained by points of 𝐶 (𝑖) (𝑖) 𝑝 min infimum of density attained by points of 𝐶 (𝑖) 𝑢 (𝑖) lower bound on distance of cluster 𝐶 (𝑖) to other clusters 𝑢𝑖 𝑗 distance between cluster 𝐶 (𝑖) and cluster 𝐶 ( 𝑗) 𝜌(𝑢 (𝑖) ) probability of balls of radius 𝑢 (𝑖) in 𝐶 (𝑖) 𝛽 (𝑖) probability mass of cluster 𝐶 (𝑖) ′ 𝐺 𝑆𝑁 𝑁 (𝑘) the SNN graph after denoising Table 4.1 Notations. and the ratio of noisy points to cluster points goes to zero. We also consider rough identification of clusters, when the clustering algorithm predicts components that contain points of a unique 𝐶 (𝑖) plus some noisy points that do not belong to any cluster. The following table includes notation used in sections . In the second approach, noisy points are not removed. Instead, they are defined as cluster points of the 𝐿(𝑡) component closest to them. For this approach, there are connections between subgraphs that correspond to true clusters and spectral clustering might mis-cluster points that are equidistant from two clusters. We provide results regarding the mis-clustering error. Let the sets 𝐶 (1) , . . . , 𝐶 (𝐾) be 𝐾 disjoint, compact and connected subsets of R𝑚 . The boundary 𝜕𝐶 (𝑖) of every 𝐶 (𝑖) is a smooth (𝑚 −1)-dimensional submanifold in R𝑚 . We will denote with 𝜅 (𝑖) the minimal curvature radius of 𝜕𝐶 (𝑖) , which is equal to the inverse of the largest principal curvature ∫ of 𝜕𝐶 (𝑖) . Also we will denote with 𝛽 (𝑖) = 𝜇(𝐶 (𝑖) ) = 𝐶 (𝑖) 𝑝𝑑𝜆, the probability mass of 𝐶 (𝑖) , where 𝜆 is the Lebesgue measure in R𝑚 . The following results about within cluster connectivity of a predicted cluster and isolation (disconnectivity) of each cluster will be proven using the collar set of each cluster. Specifically, the collar of 𝐶 (𝑖) is defined as 𝐶𝑜𝑙 (𝑖) (𝜈) = {𝑥 ∈ 𝐶 (𝑖) | 𝑑𝑖𝑠𝑡 (𝑥, 𝜕𝐶 (𝑖) ) ≤ 𝜈}, with 𝜈 < 𝜅 (𝑖) . Furthermore, (𝑖) we define the maximal covering radius to be 𝜈𝑚𝑎𝑥 = max𝜈≤𝜅 (𝑖) {𝜈 | 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝜈) is connected } and we denote with 𝑢 (𝑖) a lower bound of the distance of 𝐶 (𝑖) from 𝐶 (𝑖) with 𝑗 ≠ 𝑖. In the noise-free case, 𝑢 (𝑖) will be considered strictly greater than 0. Finally, the 𝑘NN radius of a point 𝑋𝑖 is the (𝑖) maximum distance of 𝑋𝑖 to a point in 𝑘NN(𝑋𝑖 ). The minimal 𝑘NN radius of a cluster 𝐶 (𝑖) , 𝑅𝑚𝑖𝑛 , is the minimal 𝑘NN radius of the points in 𝐶 (𝑖) . 50 4.3.1 Noise-free case Lemma 1 (Within cluster connectedness in 𝐺 𝑆𝑁 𝑁 (𝑘)) Let A𝑛(𝑖) denote the event that the points of cluster 𝐶 𝑖 are connected in 𝐺 𝑆𝑁 𝑁 (𝑘). For 𝑧 ∈   (𝑖) 0, 2𝑚𝑖𝑛{𝑢 (𝑖) , 𝜈𝑚𝑎𝑥 } , !       (𝑛−1) 𝑃 (A𝑛(𝑖) ) 𝑐 ≤ 𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 + 𝑁 1 − (𝑖) 𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚 (𝑖) 1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min (𝑖)  − 𝑛𝑝 max , (4.2) (𝑖) for 𝑀 ∼ 𝐵𝑖𝑛(𝑛 − 1, 𝑝 max 𝜂𝑚 𝑧 𝑚 ). (𝑖) Proof. Observe that if 𝑅𝑚𝑖𝑛 > 𝑧, for some 𝑧 > 0 and if for two points of 𝑋𝑖 , 𝑋 𝑗 ∈ 𝐶 (𝑖) holds that 𝑑 (𝑋𝑖 , 𝑋 𝑗 ) ≤ 𝑧, then we have that 𝑋𝑖 ∈ 𝑘NN(𝑋 𝑗 ) and 𝑋 𝑗 ∈ 𝑘NN(𝑋𝑖 ) . Furthermore, if we can find a covering of 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4) of a finite number of balls of radius 𝑧/4, where every ball contains at least two points of 𝐶 (𝑖) , then points in neighboring balls have distance less than 𝑧. Hence they are in the list of 𝑘 nearest neighbors of each other and every pair of points will have a shared neighbor. This implies that every point in 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4) will be connected in 𝐺 𝑆𝑁 𝑁 (𝑘). Notice that points in 𝐶𝑜𝑙 (𝑖) (𝑧/4) will have at most distance 3𝑧/4 from the balls of the covering and since every ball includes at least two points of the cluster, we conclude that the points of 𝐶𝑜𝑙 (𝑖) (𝑧/4) will be connected to 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4). As a result, the points of 𝐶 (𝑖) will be connected in 𝐺 𝑆𝑁 𝑁 (𝑘). Let F𝑧(𝑖) be the event that, given a covering of 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4), there exists a ball that doesn’t contain (𝑖) at least two points of 𝐶 (𝑖) . Based on the above observation, {𝑅𝑚𝑖𝑛 > 𝑧} ∩ (F𝑧(𝑖) ) 𝑐 implies the points of 𝐶 (𝑖) will be connected on 𝐺 𝑆𝑁 𝑁 (𝑘). Therefore,       𝑃 (A𝑛(𝑖) ) 𝑐 ≤𝑃 (𝑖) {𝑅𝑚𝑖𝑛 ≤ 𝑧} + 𝑃 F𝑧(𝑖) (4.3)   (i) For P {Rmin ≤ z} : We define 𝑁 𝑠 = |{ 𝑗 ≠ 𝑠|𝑋 𝑗 ∈ 𝐵(𝑋𝑠 , 𝑧)}|, for 1 ≤ 𝑠 ≤ 𝑛. Then, (𝑖) if the event {𝑅𝑚𝑖𝑛 ≤ 𝑧} is true =⇒ ∃𝑋𝑠 ∈ 𝐶 (𝑖) s.t. 𝑚𝑎𝑥{𝑑 (𝑦, 𝑋𝑠 ) | 𝑦 ∈ 𝑘 𝑁 𝑁 (𝑋𝑠 )} ≤ 𝑧 =⇒ ∃𝑋𝑠 ∈ 𝐶 (𝑖) s.t. 𝑁 𝑠 ≥ 𝑘. 51 Ø 𝑛 (𝑖) {𝑁 𝑠 ≥ 𝑘 } ∩ {𝑋𝑠 ∈ 𝐶 (𝑖) } and then we have:  Therefore, {𝑅𝑚𝑖𝑛 ≤ 𝑧} ⊆ 𝑠=1   ∑︁ 𝑛       (𝑖) 𝑃 {𝑅𝑚𝑖𝑛 ≤ 𝑧} ≤ 𝑃 𝑁 𝑠 ≥ 𝑘 | 𝑋𝑠 ∈ 𝐶 (𝑖) 𝑃 𝑋𝑠 ∈ 𝐶 (𝑖) ≤ 𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 . 𝑠=1 Here 𝑁 𝑠 | {𝑋𝑠 ∈ 𝐶 (𝑖) } ∼ 𝐵𝑖𝑛(𝑛 − 1, 𝜇(𝐵(𝑋𝑠 , 𝑧)) and 𝜇(𝐵(𝑋𝑠 , 𝑧)) ≤ sup𝑥∈𝐶 (𝑖) 𝜇(𝐵(𝑥, 𝑧)) ≤     (𝑖) (𝑖) 𝑝 max 𝜂𝑚 𝑧 . Hence, 𝑃 𝑁 𝑠 ≥ 𝑘 < 𝑃 𝑀 ≥ 𝑘 , for 𝑀 ∼ 𝐵𝑖𝑛(𝑛 − 1, 𝑝 max 𝑚 𝜂𝑚 𝑧 𝑚 ) and 𝜂𝑚 the volume of the 𝑚-dimensional unit ball.   (i) For P Fz : Since 𝐶 (𝑖) is compact and connected, we can find a covering of 𝐶 (𝑖) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4) with 𝑁 balls, 𝐵1 (𝑧/4), . . . , 𝐵 𝑁 (𝑧/4) of radius 𝑧/4. Let us denote, 𝑃 𝑋 𝑗 ,𝐵𝑠 = 𝑃(𝑋 𝑗 ∈ 𝐵 𝑠 (𝑧/4) | 𝑋 𝑗 ∈ 𝐶 (𝑖) )𝑃(𝑋 𝑗 ∈ 𝐶 (𝑖) )) the probability that the point 𝑋 𝑗 is in 𝐶 (𝑖) and in the ball 𝐵 𝑠 (𝑧/4). Then,     (𝑖) (𝑖) 𝑃 F𝑧 = 𝑃 {∃𝑠, 1 ≤ 𝑠 ≤ 𝑁, s.t. 𝐵 𝑠 (𝑧/4) has less than two points of 𝐶 } 𝑁 ∑︁   ≤ 𝑃 {𝐵 𝑠 (𝑧/4) has less than two points of 𝐶 (𝑖) } 𝑠=1 𝑁 ∑︁   𝑁 ∑︁   (𝑖) = 𝑃 {𝐵 𝑠 (𝑧/4) has no points of 𝐶 } + 𝑃 {𝐵 𝑠 (𝑧/4) has exactly one point of 𝐶 (𝑖) } 𝑠=1 𝑠=1 𝑁 Ö ∑︁ 𝑛   ∑︁𝑁 ∑︁ 𝑛 Ö  = 1 − 𝑃 𝑋 𝑗 ,𝐵𝑠 + 𝑃 𝑋 𝑗 ,𝐵𝑠 1 − 𝑃 𝑋𝑞 ,𝐵𝑠 𝑠=1 𝑗=1 𝑠=1 𝑗=1 𝑞≠ 𝑗 (4.4) Now we notice that, (𝑖) 𝑃 𝑋 𝑗 ,𝐵𝑠 = 𝜇(𝐵 𝑠 (𝑧/4)) ≤ sup 𝜇(𝐵𝑞 (𝑧/4)) ≤ 𝑝 max 𝜂𝑚 𝑧 𝑚 /4𝑚 and, (4.5) 𝐵𝑞 ⊂𝐶 (𝑖) (𝑖) 𝑃 𝑋 𝑗 ,𝐵𝑠 = 𝜇(𝐵 𝑠 (𝑧/4)) ≥ inf 𝜇(𝐵𝑞 (𝑧/4)) ≥ 𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚 (4.6) 𝐵𝑞 ⊂𝐶 (𝑖) (𝑖) (𝑖) where 𝑝 max is the supremum of density attained by points of 𝐶 (𝑖) and 𝑝 min the infimum of the density attained by points of 𝐶 (𝑖) . From inequalities 4.5, 4.6, we can write 4.4 as,    𝑛   (𝑛−1) 𝑃 F𝑧(𝑖) ≤ 𝑁 1 − 𝑝 min (𝑖) 𝜂𝑚 𝑧 𝑚 /4𝑚 + 𝑛𝑁 𝑝 max (𝑖) 𝜂𝑚 𝑧 𝑚 /4𝑚 1 − 𝑝 min (𝑖) 𝜂𝑚 𝑧 𝑚 /4𝑚 !   (𝑛−1) (𝑖) (𝑖) (𝑖) = 𝑁 1 − 𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚 1 − 𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚 + 𝑛𝑝 max 𝜂𝑚 𝑧 𝑚 /4𝑚 !   (𝑛−1) (𝑖) (𝑖) (𝑖)  = 𝑁 1 − 𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚 1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min − 𝑛𝑝 max 52 (𝑖) 𝜈 𝑚𝑎𝑥 For the covering we will use a standard 4𝑧 −packing. Also, since 𝑧 4 ≤ 2 , balls of radius 𝑧/8 around the packing centers are disjoint subsets of 𝐶 (𝑖) . Consequently, the total volume of the 𝑁 𝑚 balls will be bounded by the volume of cluster 𝐶 (𝑖) and hence 𝑁𝜂𝑚 8𝑧 𝑚 ≤ 𝑉 𝑜𝑙 (𝐶 (𝑖) ). Finally we get the following bound for the number of covering balls, 𝑁, 𝑉 𝑜𝑙 (𝐶 (𝑖) ) 𝑁≤ 𝑚 (4.7) 𝜂𝑚 8𝑧 𝑚 □ Now we will explore the connectivity of points from different clusters in 𝐺 𝑆𝑁 𝑁 (𝑘). We say that 𝐶 (𝑖) is isolated on 𝐺 𝑆𝑁 𝑁 (𝑘), if there is no edge between sample points of 𝐶 (𝑖) and any other cluster. Lemma 2 (Between clusters connectivity in 𝐺 𝑆𝑁 𝑁 (𝑘)) Let I𝑛(𝑖) be the event that 𝐶 (𝑖) is isolated from all other clusters in 𝐺 𝑆𝑁 𝑁 (𝑘), then for 𝑘 ≤ 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛)   𝐾 𝜌(𝑢 (𝑖) ) 𝑘−1  (𝑖)  𝑐  ∑︁ − 𝑛−1 2 2 − 𝑛−1 𝑃 I𝑛 ≤ 𝑒 . (4.8) 𝑖=1 Proof. For the construction of the 𝐺 𝑆𝑁 𝑁 (𝑘), we practically start with constructing a symmetric 𝑘NN graph, 𝐺 𝑘 𝑁 𝑁 , and then we remove the edges of points that do not have common neighbors in their 𝑘-nearest neighbors lists. This implies that points that aren’t connected on the 𝐺 𝑘 𝑁 𝑁 will not be connected on the 𝐺 𝑆𝑁 𝑁 (𝑘). Hence, (I𝑛(𝑖) =⇒ 𝐶 (𝑖) is not isolated in 𝐺 𝑘 𝑁 𝑁 𝑐   𝑐   𝑃 I𝑛(𝑖) ≤ 𝑃 𝐶 (𝑖) not isolated in 𝐺 𝑘 𝑁 𝑁   ∑︁    (𝑖) (𝑖) ( 𝑗) (𝑖 𝑗) ≤ 𝑃 𝑅max ≥ 𝑢 + 𝑃 𝑅max ≥ 𝑢 𝑗≠𝑖   ∑︁    (𝑖) ( 𝑗) ≤𝑃 𝑅max ≥ 𝑢 (𝑖) + 𝑃 𝑅max ≥ 𝑢 ( 𝑗) 𝑗≠𝑖   𝐾 𝜌(𝑢 (𝑖) ) 𝑘−1 ∑︁ − 𝑛−1 2 2 − 𝑛−1 ≤ 𝑒 𝑖=1 where we get the final bound for 𝑘 < 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛) and using Preposition 6 and Lemma 7 from Maier et al. (2009). Here, 𝑢 (𝑖) ≤ 𝑢𝑖 𝑗 and 𝜌(𝑢 (𝑖) ) is the probability of balls of radius 𝑢 (𝑖) in 𝐶 (𝑖) . □ 53 Lemma 3 (Range of 𝑘 for within-cluster connectedness) (𝑖) (𝑖) (𝑖) 𝑚 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and 𝑘 ≤ (𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥  If 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥 ) } then, (𝑖) (𝑘−1) 𝑝    1 𝑛2  − 4𝑚+1 𝑝 (𝑖) 𝑚𝑖𝑛 𝑃 (A𝑛(𝑖) ) 𝑐 ≤ 2+ 𝑚 𝑒 𝑚𝑎𝑥 (4.9) 4 𝑛−1   Proof. Our overall goal is to find appropriate values for 𝑘 such that 𝑃 (A𝑛(𝑖) ) 𝑐 has an upper bound that goes to zero as n goes to infinity. We will find upper bounds for the two terms of the inequality 4.2. We will use the following inequalities in the proof: 𝑘 (Hoeffding’s inequality). Let 𝑀 ∼ 𝐵𝑖𝑛(𝑛, 𝑝) and define 𝛼 = 𝑛−1 . Then,   𝛼 ≥ 𝑝, 𝑃 𝑀 ≥ 𝑘 ≤ 𝑒 −𝑛𝐾 (𝛼|| 𝑝) , (4.10) 1−𝑎 𝑎   where 𝐾 (𝛼|| 𝑝) = 𝛼 log 𝑝 + (1 − 𝛼) log 1−𝑝 is the Kullback-Leibler divergence of (𝛼, 1 − 𝛼) and ( 𝑝, 1 − 𝑝). (1st logarithmic inequality.) 𝑥−1 log(𝑥) ≥ , for 𝑥 > 0 (4.11) 𝑥 (2nd logarithmic inequality.) log(1 − 𝑥) ≤ 𝑥, for 𝑥 ≤ 1. (4.12) (𝑖) 𝑘 For the first term of 4.2, we use inequality 4.10 for 𝑝 = 𝑝 𝑚𝑎𝑥 𝜂𝑚 𝑧 𝑚 and 𝛼 = 𝑛−1 . Now, assuming that 𝑝 < 𝛼 and that 𝑘 ≤ 𝑛 − 1 we get,       𝑎 1−𝑎 𝑎   −(𝑛−1) 𝛼 log 𝑝 +(1−𝛼) log 1− 𝑝 −(𝑛−1) 𝛼 log 𝑝 +𝑝−𝛼 𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 ≤ 𝑒 ≤𝑒 , by 4.11. Let 𝜃 = 𝜂𝑚 𝑧 𝑚 /𝛼 then we have,    1 (𝑖)   −𝑘 log (𝑖) +𝜃 𝑝 𝑚𝑎𝑥 −1 𝑛𝛽 (𝑖) 𝑃 𝑀 ≥ 𝑘 ≤ 𝑛𝛽 (𝑖) 𝑒 𝜃 𝑝𝑚𝑎𝑥    1 (𝑖) log(𝑛𝛽 (𝑖) )−𝑘 log +𝜃 𝑝 𝑚𝑎𝑥 −1 =𝑒 (𝑖) 𝜃 𝑝𝑚𝑎𝑥 (4.13)  1  (𝑖)  − 𝑘2 log (𝑖) +𝜃 𝑝 𝑚𝑎𝑥 −1 ≤𝑒 𝜃 𝑝𝑚𝑎𝑥 , 54 𝑘 1  (𝑖)  for 𝑘 such that log(𝑛𝛽 (𝑖) ) ≤ 2 log (𝑖) + 𝜃 𝑝 𝑚𝑎𝑥 − 1 . That way we attain a lower bound for 𝑘: 𝜃 𝑝 𝑚𝑎𝑥 2 log(𝑛𝛽 (𝑖) ) 𝑘≥ (𝑖) (4.14) 1  log (𝑖) + 𝜃 𝑝 𝑚𝑎𝑥 −1 𝜃 𝑝 𝑚𝑎𝑥 For the second term of 4.2 we have,   (𝑛−1)   (𝑖) (𝑖) (𝑖)  𝑁 1 − 𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚 1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min − 𝑛𝑝 max = (𝑖)  (𝑛−1) log(1−𝑝 (𝑖) 𝜂 𝑚 𝑧 𝑚 /4𝑚 )+log(𝑁)   (𝑖) 1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min − 𝑛𝑝 max 𝑒 min ≤ 𝑉𝑜𝑙 (𝐶 (𝑖) ) (𝑖)    (𝑛−1) log(1−𝑝 min 𝜂 𝑚 𝑧 𝑚 /4𝑚 )+log 𝑚 𝑚 𝑚 (𝑖) (𝑖)  𝜂𝑚 𝑧 𝑚 1 − 𝜂𝑚 𝑧 /4 𝑝 min − 𝑛𝑝 max 𝑒 8 , by 4.7. 𝜃𝑘 Now we use the substitution 𝜂𝑚 𝑧 𝑚 = 𝜃𝛼 = 𝑛−1   (𝑛−1)   (𝑖) (𝑖) (𝑖)  𝑁 1 − 𝑝 min 𝜂𝑚 𝑧 𝑚 /4𝑚 1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑝 min − 𝑛𝑝 max ≤  𝜃 𝑘 𝑝 (𝑖) (𝑖) 𝑚 (𝑛−1)  (𝑖)  − 4𝑚min +log 𝑉𝑜𝑙 (𝐶 𝜃)8  𝜃𝑘 (𝑖) 1+ 𝑚 𝑛𝑝 max − 𝑝 min 𝑒 𝑘 ≤ 4 (𝑛 − 1)  𝜃 𝑘 𝑝 (𝑖) (𝑖) 𝑚  (4.15) (𝑖)  − 4𝑚min +log 𝑉𝑜𝑙 (𝐶 𝜃 )8 𝑛  𝜃𝑛 (𝑖) 1+ 𝑚 𝑛𝑝 max − 𝑝 min 𝑒 ≤ 4 (𝑛 − 1) (𝑖)   𝜃𝑝 𝜃𝑛 (𝑖) (𝑖)  − 𝑘2 4𝑚 min 1+ 𝑚 𝑛𝑝 max − 𝑝 min 𝑒 , 4 (𝑛 − 1) (𝑖) (𝑖) 𝜃𝑘 𝑝 min 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 𝑛  𝜃 𝑝 min where the last step of the inequality 4.15 holds if − 4𝑚 + log 𝜃 ≤ − 2𝑘 4𝑚 or equiva- lently if, 4𝑚 2 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 𝑛  𝑘≥ (𝑖) log (4.16) 𝜃 𝑝 𝑚𝑖𝑛 𝜃 To bound inequality 4.13 with the upper bound of 4.15 we need,   (𝑖)   (𝑖) 𝑘 1  𝜃𝑛 (𝑖) (𝑖)  𝑘 𝜃𝑝 min − 2 log (𝑖) +𝜃 𝑝 𝑚𝑎𝑥 −1 𝑒 − 2 4𝑚 ≥ 𝑒  1+ 𝑚 𝑛𝑝 max − 𝑝 min 𝜃 𝑝𝑚𝑎𝑥 4 (𝑛 − 1) (𝑖)   𝑘 𝜃 𝑝 min 𝑘 1  (𝑖) for which it suffices to have, 2 4𝑚 ≤ 2 log (𝑖) + 𝜃 𝑝 𝑚𝑎𝑥 − 1 or equivalently − log(𝛾) ≥ 𝜃 𝑝 𝑚𝑎𝑥 (𝑖) 𝛾 𝑝 𝑚𝑖𝑛 (𝑖) 3𝛾 1+𝛾− 4𝑚 𝑝 (𝑖) for 𝛾 = 𝜃 𝑝 𝑚𝑎𝑥 . The above is satisfied for values of 𝛾 that − log(𝛾) ≥ 1 + 𝛾 − 4 , 𝑚𝑎𝑥 (𝑖) 𝑝 𝑚𝑖𝑛 since (𝑖) ≤ 14 . Such values of 𝛾 could be 1 1 10 , 2 and others. We will use 𝛾 = 12 , which will result 4𝑚 𝑝 𝑚𝑎𝑥 to 𝜃 = 1(𝑖) and the probability inequality in Lemma 1 can be rewritten as: 2𝑝 𝑚𝑎𝑥 (𝑖) (𝑖) 𝑘𝑝 (𝑘−1) 𝑝    𝑛 1 (𝑖)  − 4𝑚+1 𝑝 (𝑖)  𝑚𝑖𝑛  1 𝑛2  − 4𝑚+1 𝑝 (𝑖) 𝑚𝑖𝑛 𝑃 (A𝑛(𝑖) ) 𝑐 ≤ 2 1+ (𝑖) 𝑛𝑝 (𝑖) max − 𝑝 min 𝑒 𝑚𝑎𝑥 ≤ 2 + 𝑒 𝑚𝑎𝑥 𝑛 − 1 4𝑚 2𝑝 𝑚𝑎𝑥 4𝑚 𝑛 − 1 55 Now we return to the range of 𝑘. We substitute the chosen value for 𝜃 in 4.14 and 4.16 and we observe that the largest of the two lower bounds is the one in 4.16. This is because (𝑖) 𝛽 (𝑖) = 𝜇(𝐶 (𝑖) ) ≤ 𝑝 𝑚𝑎𝑥 𝑉 𝑜𝑙 (𝐶 (𝑖) ). (𝑖) 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 .  Hence, we conclude that an appropriate lower bound for 𝑘 is 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥 (𝑖) From the assumption 𝜂𝑚 𝑧 𝑚 = 𝜃𝛼 and that 𝑧 ≤ 2 min{𝑢 (𝑖) , 𝜈𝑚𝑎𝑥 } we get an upper bound for 𝑘. (𝑖) (𝑖) (𝑖) 𝑚 𝑘 < (𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 𝑧 𝑚 ⇔ 𝑘 < (𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥 ) }. □ The theoretical results we derived so far provide us with probabilities of having connected components in the graph 𝐺 𝑆𝑁 𝑁 (𝑘), isolated from other components that each correspond to a specific cluster. Now, we want to explore the probability that the step of spectral clustering on 𝐺 𝑆𝑁 𝑁 (𝑘) of algorithm 4.1 will yield exact identification of clusters. Theorem 1 (Optimal k for exact identification of clusters) 4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛  𝑚+1 log 2𝑛𝑝 (𝑖) 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚  For 𝑘 = 𝑐 1 + 𝑝𝑚𝑖𝑛 2+ 4𝑚 𝑝𝑚𝑎𝑥 , where 𝑐 such that 𝑘 will satisfy 𝑘 ≥ 4 𝑚𝑎𝑥 (𝑖) (𝑖) 𝑚 and 𝑘 ≤ min 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛), (𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥  ) } , for 𝑖 ∈ {1, . . . , 𝐾 }, the algorithm 4.1 achieves exact cluster identification with probability      1 𝑛2  (𝑘−1) 𝑝 − 𝑚+1 𝑚𝑖𝑛 − 𝑛−1 𝜌𝑚𝑖𝑛 𝑘−1 2 − 𝑛−1 − 𝐾 2𝑒 2 𝑃 Q𝑛 ≥ 1 − 𝐾 2 + 𝑒 4 𝑝𝑚𝑎𝑥 4𝑚 𝑛 − 1 𝐾 A𝑛(𝑖) , the event that for each cluster Ñ Proof. We will start our proof with some notation. Let A𝑛 = 𝑖=1 𝐾 I𝑛(𝑖) be the event that every Ñ its points are connected on the graph. Correspondingly, let I𝑛 = 𝑖=1 cluster is isolated from the other clusters on the graph 𝐺 𝑆𝑁 𝑁 (𝑘). Then, we denote with Q𝑛 the intersection of A𝑛 and I𝑛 . If Q𝑛 is true, then with an appropriate permutation of rows, the adjacency matrix 𝑊 of 𝐺 𝑆𝑁 𝑁 (𝑘) will be block-diagonal and each block will correspond to a cluster. The first step of our proof will be, to explain how each of the Laplacian option will utilize the block structure of 𝑊 to achieve exact clustering. The second step of our proof is to find an optimal choice for 𝑘 and an upper bound for the probability of event (Q𝑛 ) 𝑐 . For L = D − W and L = I − D−1 W : According to Propositions 2 and 4 of von Luxburg (2007), the unnormalized Laplacian and the random walk Laplacian have eigenvalue zero with multiplicity 56 equal to the number of connected components on the graph. Furthermore, the eigenspace of eigenvalue zero is spanned by indicator vectors 1𝐶 (𝑖) , for 𝑖 ∈ {1, . . . , 𝐾 }. If 𝐺 𝑆𝑁 𝑁 (𝑘) has a connected component for all the points that belong to a unique cluster, and the components are isolated from components that correspond to different clusters, the matrix 𝑊 will have a block structure and every block will include all the points of one cluster. The matrix 𝑉 with columns the eigenvectors of L will hence be of the form 𝑉 = 𝐵𝑀, where 𝑀 is an orthogonal matrix and B is a block diagonal matrix with columns the indicator eigenvectors 1𝐶 (𝑖) , for 𝑖 ∈ {1, . . . , 𝐾 }. Now, notice that 𝑉 will also have a block diagonal structure and the rows of every block will be equal. On the other hand, due to the block structure of 𝑉, columns of 𝑉 of different blocks will be perpendicular to each other. Hence, kmeans on V will choose one row from each block as the centroid of a cluster, and as a result points of the same block will be clustered together. 1 1 1 1 1 1 For L = D− 2 WD− 2 : We observe that 𝐿 = 𝐷 − 2 𝑊 𝐷 − 2 and 𝐿 𝑠𝑦𝑚 = 𝐼 − 𝐷 − 2 𝑊 𝐷 − 2 have the same eigenvectors but different eigenvalues. Specifically, if 𝑣 is an eigenvector for the eigenvalue 𝜆 of 𝐿 𝑠𝑦𝑚 then 𝑣 is an eigenvector for the eigenvalue −𝜆 of 𝐿. According to Proposition 4 of von Luxburg (2007), 𝐿 𝑠𝑦𝑚 has eigenvalue zero with multiplicity equal to the number of connected components in the graph and the eigenspace of zero is spanned by the vectors {𝐷 − 2 1𝐶 (1) , . . . , 𝐷 − 2 1𝐶 (𝐾 ) }. 1 1 1 In this case, the matrix 𝑉 of eigenvectors of 𝐿 will be of the form 𝑉 = 𝐷 − 2 𝐵𝑀, where 𝑀 is an orthogonal matrix, and B is a block diagonal matrix with columns the indicator eigenvectors 1𝐶 (𝑖) , for 𝑖 ∈ {1, . . . , 𝐾 }. This time the rows of V that correspond to points of the same component can be seen as vectors that aren’t identical, but, interestingly, have the same direction and different lengths. For this reason in algorithm 4.1 we do not apply kmeans on 𝑉, but instead on 𝑉˜ which is equal to V after row-normalization. We observe that the rows of 𝑉˜ that correspond to points of the same component will be equal. The columns of 𝑉˜ of different blocks will be perpendicular to each other. Hence, kmeans on 𝑉˜ will choose one row from each block as the centroid of a cluster, and as a result points of the same block will be clustered together. 57   Moving on to the second step of the proof, we will calculate an upper bound for 𝑃 (Q𝑛 ) 𝑐 . Let (𝑖) (𝑖) 𝜌𝑚𝑖𝑛 = min 𝜌(𝑢 (𝑖) ), 𝑝 𝑚𝑖𝑛 = min 𝑝 𝑚𝑖𝑛 and 𝑝 𝑚𝑎𝑥 = max 𝑝 𝑚𝑎𝑥 . We notice that by Lemma 3, for 𝑖=1,...,𝐾 𝑖=1,...,𝐾 𝑖=1,...,𝐾 (𝑖) (𝑖) (𝑖) 𝑚 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and 𝑘 ≤ (𝑛−1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥  𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥 ) } for 𝑖 = 1, . . . , 𝐾, we have that, (𝑖) 𝐾 𝐾  (𝑘−1) 𝑝   ∑︁  (𝑖) 𝑐  ∑︁ 1 𝑛2  − 4𝑚+1 𝑝 (𝑖) 𝑚𝑖𝑛 𝑃 (A𝑛 ) 𝑐 ≤ 𝑃 (A𝑛 ) ≤ 2+ 𝑚 𝑒 𝑚𝑎𝑥 𝑖=1 𝑖=1 4 𝑛−1 𝐾  ∑︁ 1 𝑛2  − 4(𝑘−1) 𝑝𝑚𝑖𝑛 (4.17) ≤ 2+ 𝑒 𝑚+1 𝑝𝑚𝑎𝑥 𝑖=1 4𝑚 𝑛 − 1  1 𝑛2  − 4(𝑘−1) 𝑝𝑚𝑖𝑛 = 𝐾 2+ 𝑚 𝑒 𝑚+1 𝑝𝑚𝑎𝑥 4 𝑛−1 If we additionally have that 𝑘 ≤ 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛) for every 𝑖 in {1, . . . , 𝐾 }, then by Lemma 2   𝐾 𝐾 𝜌(𝑢 (𝑖) ) 𝑘−1   ∑︁  𝑐 ∑︁ − 𝑛−1 − 𝑛−1 𝑃 I𝑛(𝑖) 2 2 𝑃 (I𝑛 ) 𝑐 ≤ ≤𝐾 𝑒 𝑖=1 𝑖=1   𝐾 𝜌𝑚𝑖𝑛 𝑘−1 ∑︁ − 𝑛−1 2 2 − 𝑛−1 (4.18) ≤𝐾 𝑒 𝑖=1   𝑛−1 𝜌𝑚𝑖𝑛 𝑘−1 2 − 2 2 − 𝑛−1 =𝐾 𝑒 . Hence,          1 𝑛2  (𝑘−1) 𝑝 − 𝑚+1 𝑚𝑖𝑛 − 𝑛−1 𝜌𝑚𝑖𝑛 𝑘−1 2 − 𝑛−1 + 𝐾 2𝑒 2 𝑃 (Q𝑛 ) 𝑐 ≤ 𝑃 (A𝑛 ) 𝑐 + 𝑃 (I𝑛 ) 𝑐 ≤ 𝐾 2 + 𝑒 4 𝑝𝑚𝑎𝑥 4𝑚 𝑛 − 1 Now we want to choose k such that the bounds of 4.17, 4.18 will be of the same order. Equivalently we want,   (𝑘−1) 𝑝 𝜌𝑚𝑖𝑛 𝑘−1 − 𝑚+1 𝑚𝑖𝑛 − 𝑛−1 2 2 − 𝑛−1 𝑛𝑒 4 𝑝𝑚𝑎𝑥 =𝑒 , 4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛 4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛  which holds for 𝑘 = 1 + 𝑝 2+ 4𝑚 𝑝𝑚𝑖𝑛 . In conclusion, choosing 𝑘 = 𝑐 1 + 𝑝 2+ 4𝑚 𝑝𝑚𝑖𝑛 , for 𝑚𝑎𝑥 𝑚𝑎𝑥   a constant c such that 𝑘 will satisfy the conditions for inequalities 4.17, 4.18, will let 𝑃 (Q𝑛 ) 𝑐 go to zero exponentially with n. □ 58 4.3.2 Noisy case 4.3.2.1 First approach to noisy case: Remove low-density points We now explore the connectedness and isolation properties of clusters predicted by 4.1 when there are noise points in our sample - points that do not belong to any cluster and have low density. In this case, we apply spectral clustering on 𝐺 ′𝑆𝑁 𝑁 (𝑘), the graph that doesn’t include any points with ˆ 𝑝(𝑥) < 𝑡 − 2𝜖 𝑛 and their edges. The clusters of the 𝐿 (𝑡 − 2𝜖 𝑛 ) are denoted as 𝐶 (𝑖) (2𝜖 𝑛 ). The value 𝜖 𝑛 is the error in density estimation. We choose to work with clusters of the 𝐿 (𝑡 − 2𝜖 𝑛 ) to ensure that the points of the 𝐿 (𝑡) set will not be removed from the 𝐺 ′𝑆𝑁 𝑁 (𝑘). Additionally, 𝜖 𝑛 has the property (𝑖) (𝑖) that 𝑑𝑖𝑠𝑡 (𝐶 (𝑖) (2𝜖 𝑛 ), 𝐶 ( 𝑗) (2𝜖 𝑛 )) ≥ 𝑢 (𝑖) for every 𝑖, 𝑗 ∈ {1, . . . , 𝐾 }. We denote with 𝑅˜ 𝑚𝑖𝑛 , 𝑅˜ 𝑚𝑎𝑥 the minimal and maximal 𝑘NN radius of 𝐶 (𝑖) (2𝜖 𝑛 ) and with 𝛽˜ (𝑖) the mass of 𝐶 (𝑖) (2𝜖 𝑛 ), 𝜇(𝐶 (𝑖) (2𝜖 𝑛 )). ˜ (𝑖) ) is the probability of balls of radius 𝑢 (𝑖) in 𝐶 (𝑖) (2𝜖 𝑛 ). The aim of this section is to Finally, 𝜌(𝑢 illustrate how to extend the connectedness and isolation results for the clusters 𝐶 (𝑖) 𝑖 = 1, . . . , 𝐾 in the noisy case. In more detail, we prove that 𝐺 ′𝑆𝑁 𝑁 (𝑘) will have so many connected components as the number of clusters 𝐾 and each component on the graph 𝐺 ′𝑆𝑁 𝑁 (𝑘) will correspond to a unique topological component of the 𝐿 (𝑡 − 2𝜖 𝑛 ) set. Furthermore, the component of 𝐺 ′𝑆𝑁 𝑁 (𝑘) corresponding to 𝐶 (𝑖) (2𝜖 𝑛 ) will include all the points of 𝐶 (𝑖) and some additional points, but it will be isolated from all other components. Hence the adjacency matrix of 𝐺 ′𝑆𝑁 𝑁 (𝑘) will again be block-diagonal and spectral clustering will predict clusters that include the points of 𝐶 (𝑖) plus some boundary points. We further prove that as the sample size n increases, the ratio of boundary to cluster points will go to zero and allow spectral clustering to yield a grouping only of true cluster points achieving that way exact clustering. Let D𝑛(𝑖) be the event that | 𝑝ˆ𝑛 (𝑋𝑖 ) − 𝑝 𝑛 (𝑋𝑖 )| ≤ 𝜖 𝑛 for every 𝑋𝑖 , 𝑖 = 1, . . . , 𝑛. Lemma 4 (Range of 𝑘 for within-cluster connectedness in noisy case 1) Let A𝑛(𝑖) denote the event that the points of cluster 𝐶 (𝑖) are connected in 𝐺 ′𝑆𝑁 𝑁 (𝑘). If 𝑘 ≥ (𝑖) (𝑖) (𝑖) 𝑚 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and 𝑘 ≤ (𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥  4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥 ) } then, (𝑘−1)𝑡    1 𝑛2  − 4𝑚+1 𝑝 (𝑖)   𝑃 (A𝑛(𝑖) ) 𝑐 ≤ 2+ 𝑚 𝑒 𝑚𝑎𝑥 + 𝑃 D𝑛𝑐 4 𝑛−1 59 Proof. We assume that D𝑛(𝑖) holds, and we use the steps of the proof of Lemma 1. There are no   difference in the bounds for 𝑃 {𝑅𝑚𝑖𝑛 ≤ 𝑧} ≤ 𝑛𝛽 (𝑖) 𝑃(𝑀 ≥ 𝑘). We recall that F𝑧(𝑖) is the event (𝑖) that, given a covering of 𝐶 (𝑖) (2𝜖 𝑛 ) ∖ 𝐶𝑜𝑙 (𝑖) (𝑧/4), there exists a ball that doesn’t contain at least two points of 𝐶 (𝑖) (2𝜖 𝑛 ). In the noisy case, this event can happen either if some ball in the covering contains less than two points or if some points of 𝐶 (𝑖) were discarded. If the event D𝑛(𝑖) holds for a point x then 𝑝(𝑥) ˆ > 𝑡 − 𝜖 𝑛 , then x will not be removed at the denoising step. Additionally, 𝑝 𝑚𝑖𝑛 = 𝑡 in 𝐶 (𝑖) . Consequently, !     (𝑛−1)   (𝑖) P Fz(i) ≤ 𝑁 1 − 𝑡𝜂𝑚 𝑧 𝑚 /4𝑚  1 − 𝜂𝑚 𝑧 𝑚 /4𝑚 𝑡 − 𝑛𝑝 max + 𝑃 D𝑛𝑐 (4.19)   We find the final bound of 𝑃 (A𝑛(𝑖) ) 𝑐 by following the proof of Lemma 3 and using the inequality 4.19. □ Lemma 5 (Cluster size probability) Let B𝑛(𝑖) denote the event that there are more than 𝛿𝑛 sample points from cluster 𝐶 (𝑖) . If 𝛽 (𝑖) > 𝛿 then, 𝛽 (𝑖) − 𝛿 2  − 21 𝑛𝛽 (𝑖) B𝑛(𝑖) 𝑐 𝑃 ≤𝑒 𝛽 (𝑖) Proof Same as in Maier et al. (2009) Lemma 4. □ Lemma 6 (Density estimation error) Let E𝑛(𝑖) denote the event that there are less than 𝛿𝑛 points in all the boundary points sets 𝐾 𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) together. If 𝜇(𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) ) < 𝛿/2, we have 𝑃 E𝑛(𝑖) ≤ 𝑒 −𝛿𝑛/8 Í 𝑐 𝑗=1 Proof Same as in Maier et al. (2009) Lemma 5. □ Proposition 1 (Cluster connectedness in 𝐺 ′𝑆𝑁 𝑁 (𝑘)) Let C𝑛(𝑖) be the event that in the denoised graph 𝐺 ′𝑆𝑁 𝑁 (𝑘) it holds that: - all sample points of 𝐶𝑛(𝑖) are contained in the graph -the sample points of 𝐶𝑛(𝑖) are connected in the graph -there is no component of the graph that consists only of points outside the 𝐿(𝑡) set. 60 Then under the conditions that 1. 𝛽 (𝑖) > 2𝛿 𝐾 𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) ) ≤ 𝛿/2 Ð 2. 𝜖 𝑛 sufficiently small such that 𝜇( 𝑖=1 (𝑖) 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and  3. 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥 (𝑖) (𝑖) 𝑚 4. 𝑘 ≤ (𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥 ) } and for sufficiently large n we obtain (𝑘−1)𝑡  𝑐  1 𝑛2  − 4𝑚+1 𝑝 (𝑖) − 𝛿𝑛   𝑃 C𝑛(𝑖) ≤ 2+ 𝑚 𝑒 𝑚𝑎𝑥 + 2𝑒 8 + 2𝑃 D 𝑐 𝑛 (4.20) 4 𝑛−1 Proof. We observe that:  𝑐  𝑐  𝑐  𝑐  𝑐 𝑃 C𝑛(𝑖) ≤ 𝑃 A𝑛(𝑖) + 𝑃 B𝑛(𝑖) + 𝑃 E𝑛(𝑖) + 𝑃 D𝑛(𝑖) and use 4,5 and 6. □ Lemma 7 (Between clusters connectivity in 𝐺 ′𝑆𝑁 𝑁 (𝑘)) Let I𝑛(𝑖) be the event that 𝐶 (𝑖) (2𝜖 𝑛 ) is isolated from all other clusters in 𝐺 ′𝑆𝑁 𝑁 (𝑘), then for 𝑘 ≤ 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log( 𝛽˜ (𝑖) 𝑛)   𝐾 𝜌(𝑢 (𝑖) ) 𝑘−1  (𝑖)  𝑐  ∑︁ − 𝑛−1 2 2 − 𝑛−1   𝑃 I𝑛 ≤ 𝑒 +𝑃 D𝑛𝑐 . (4.21) 𝑖=1 Proof. We follow the proofs of Proposition 6 and Lemma 7 from Maier et al. (2009). □ Proposition 1 and Lemma 7 will now be used to find a range for k for the rough identification of clusters 𝐶 (𝑖) (2𝜖 𝑛 ) with spectral clustering on 𝐺 ′𝑆𝑁 𝑁 (𝑘). With the term rough identification, we mean that all points of each true cluster 𝐶 (𝑖) belong to the same predicted cluster, which may also have some additional points that do not belong to any cluster. Two important conditions for the results of exact cluster identification are the following: Condition 1: (𝑖) 𝑉 𝑜𝑙 (𝐶 (𝑖) )8𝑚 and  a) 𝑘 ≥ 4𝑚+1 log 2𝑛𝑝 𝑚𝑎𝑥 (𝑖) (𝑖) 𝑚 b) 𝑘 ≤ min 𝜌(𝑢 (𝑖) )𝑛/2 − 2 log(𝛽 (𝑖) 𝑛), (𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 min{(𝑢 (𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥  ) } , Condition 2: a) 𝛽 (𝑖) > 2𝛿 b) 𝑝 is three times continuously differentiable with uniformly bounded derivatives 61 𝐾 𝐶 ( 𝑗) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) ) ≤ 𝛿/2. Ð c) 𝜖 𝑛 sufficiently small such that 𝜇( 𝑖=1 Theorem 2 (Rough cluster identification in noisy case 1) If condition 2 holds, an optimal choice of 𝑘 for the identification of clusters 𝐶 (𝑖) is 𝑘 = 𝑐 1 + 4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛  2+ 𝑡 , for a constant c such that 𝑘 will satisfy condition 1 for every 𝑖 ∈ {1, . . . , 𝐾 }. 4𝑚 𝑝𝑚𝑎𝑥 Also for a kernel density estimator 𝑝ˆ𝑛 with bandwidth ℎ there are constants 𝐶1 , 𝐶2 such that if ℎ2 ≤ 𝐶1 𝜖 𝑛 :      1 𝑛2  − 𝑚+1(𝑘−1)𝑡 2 − 2 𝑛−1 𝜌𝑚𝑖𝑛 𝑘−1 2 − 𝑛−1 𝛿 𝑚𝜖 2 𝑃 Q𝑛 ≥ 1 − 𝐾 2 + 𝑒 4 𝑝𝑚𝑎𝑥 +𝐾 𝑒 + 2𝑒 −𝑛 8 + 3𝑒 −𝐶2 𝑛ℎ 𝑛 , 4𝑚 𝑛 − 1 where Q𝑛 is the event that the algorithm 4.1 roughly identifies all clusters 𝐶 (𝑖) . Proof. If I𝑛(𝑖) is true for all clusters 𝐶 (𝑖) (2𝜖 𝑛 ) then for every 𝑖 such that 1 ≤ 𝑖 ≤ 𝐾, there will be no connections between the subgraph of 𝐺 ′𝑆𝑁 𝑁 (𝑘) containing points of cluster 𝐶 (𝑖) and any other cluster. Furthermore, if C𝑛(𝑖) holds, then all points that belong to 𝐶 (𝑖) are connected on 𝐺 ′𝑆𝑁 𝑁 (𝑘) and points outside 𝐶 (𝑖) are discarded or connected to points of 𝐶 (𝑖) . If I𝑛(𝑖) and C𝑛(𝑖) hold for every cluster, then the adjacency matrix 𝑊 of this graph will be block diagonal. Each block will correspond to a cluster 𝐶 (𝑖) (2𝜖 𝑛 ) and spectral clustering will roughly identify all 𝐶 (𝑖) . To prove that, we follow the same argument regarding the different types of Laplacians as in the proof of Theorem 1. 𝐾 C𝑛(𝑖) and Ñ Let Q𝑛 be the event that that clusters are roughly identified by algorithm 4.1, C𝑛 = 𝑖=1 𝐾 I𝑛(𝑖) . Then using Proposition 1 and Lemma 7 we obtain, Ñ I𝑛 = 𝑖=1    𝑐   𝑐   𝑐   1 𝑛2  − 𝑚+1(𝑘−1)𝑡 2 − 2 𝑛−1 𝜌𝑚𝑖𝑛 𝑘−1 2 − 𝑛−1 −𝑛 𝛿  𝑐  𝑃 (Q𝑛 ) ≤ 𝑃 (C𝑛 ) +𝑃 (I𝑛 ) ≤ 𝐾 2+ 𝑚 𝑒 4 𝑝𝑚𝑎𝑥 +𝐾 𝑒 +2𝑒 8 +3𝑃 (D𝑛 ) , 4 𝑛−1 (𝑖) where 𝜌𝑚𝑖𝑛 = min 𝜌(𝑢 (𝑖) ) and 𝑝 𝑚𝑎𝑥 = max 𝑝 𝑚𝑎𝑥 . 𝑖=1,...,𝐾 𝑖=1,...,𝐾 According to Lemma 9 of Maier et al. (2009) if 𝑝 ∈ 𝐶 2 (R𝑚 ) with || 𝑝|| ∞ = 𝑝 𝑚𝑎𝑥 and 𝑝′ (𝑥) ≠ 0 for 𝑥 in the neighborhood of {𝑝 = 𝑡} then for sufficiently small 𝜖 𝑛 Ø𝐾 ∑︁𝐾 ( 𝑗) ( 𝑗) 𝜇( 𝐶 (2𝜖 𝑛 ) \ 𝐶 )≤𝐶 𝑣𝑜𝑙 (𝜕𝐶 (𝑖) ) 𝑝 𝑚𝑎𝑥 𝜖 𝑛 , 𝑖=1 𝑖=1 62 for some constant 𝐶. Under those conditions for 𝑝 and Theorem 3.1.7 of Prakasa Rao (1983), there exist constants 𝐶1 , 𝐶2 such that when we choose bandwidth ℎ for the estimation of density 𝑝   𝑚 2 that satisfies ℎ2 ≤ 𝐶1 𝜖 𝑛 we get that 𝑃 (D𝑛 ) 𝑐 ≤ 𝑒𝐶2 𝑛ℎ 𝜖 𝑛 . Hence, under the conditions for 𝑘 of Proposition 1 and Lemma 7,      1 𝑛2  (𝑘−1)𝑡 − 𝑚+1 − 𝑛−1 𝜌𝑚𝑖𝑛 𝑘−1 2 − 𝑛−1 𝛿 𝑚𝜖 2 + 𝐾 2𝑒 + 2𝑒 −𝑛 8 + 3𝑒 −𝐶2 𝑛ℎ 2 𝑃 (Q𝑛 ) 𝑐 ≤ 𝐾 2 + 𝑒 4 𝑝𝑚𝑎𝑥 𝑛 . (4.22) 4𝑚 𝑛 − 1 To find an appropriate value of 𝑘 so that 4.22 holds, we follow the same argument as in the proof of Theorem 1 and noticing that last term of the bound of 𝑃((Q𝑛 ) 𝑐 ) is independent of 𝑘. We 4 log(𝑛)+(𝑛−1) 𝜌 𝑚𝑖𝑛  find that an appropriate value for 𝑘 will be 𝑘 = 𝑐 1 + 2+ 4𝑚 𝑝𝑡𝑚𝑎𝑥 , for a constant c such that   condition 1 is satisfied. Again 𝑃 (Q𝑛 ) 𝑐 goes to zero exponentially with n. □ It is important to notice that as 𝑛 increases the boundary points found in 𝐶 (𝑖) (2𝜖 𝑛 ) \ 𝐶 ( 𝑗) , 𝑖 = 1, . . . , 𝐾 will decrease and will be significantly less than the points of the L(t) set, leading to cleaner predicted clusters. Actually 𝐶 (𝑖) (2𝜖 𝑛 ) will collapse to 𝐶 (𝑖) . We will refer to the term exact identification when rough identification of clusters is achieved and the ratio of number of points that do not belong to any cluster to number of cluster points goes to zero. Theorem 3 (Exact cluster identification in noisy case 1) Let 𝑝 be three times continuously differentiable with uniformly bounded derivatives and let 𝑝ˆ𝑛 be 1 a kernel density estimator with bandwidth ℎ𝑛 = ℎ0 (𝑙𝑜𝑔𝑛/𝑛) 𝑚+4 for some ℎ0 > 0. For a suitable 2 𝜖0 > 0 set 𝜖 𝑛 = 𝜖0 (𝑙𝑜𝑔𝑛/𝑛) 𝑚+4 . Then there exist constants 𝑐 1 , 𝑐 2 such that for 𝑛 → ∞ and 𝑐 1 𝑙𝑜𝑔𝑛 ≤ 𝑘 ≤ 𝑐 2 𝑛 we obtain cluster 𝐶 (𝑖) is exactly identified by algorithm 4.1 almost surely. Proof. According to Proposition 8 of von Luxburg (2007) if 𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 is the number of cluster points and 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 is the number of points that do not belong to any cluster, then for 𝜖 𝑛 that goes to 𝐾 𝛽 (𝑖) , there exist a constant 𝐷¯ such that for large 𝑛, Í zero as n goes to infinity and 𝛽 = 𝑖=1  𝐷¯  1 ¯ 𝛽   𝑃 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 /𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 > 4 𝜖 𝑛 | C𝑛 ≤ 𝑒 − 4 𝐷𝜖 𝑛 𝑛 + 𝑒 − 𝑛 + 𝑃 (D𝑛 ) 𝑐 . (4.23) 𝛽 8 We can choose 𝜖0 such that ℎ2𝑛 ≤ 𝐶𝜖 𝑛 for a suitable constant 𝐶. Then there exist 𝐶2 > 0 with 4   𝑚 2 𝑚 𝑃 (D𝑛 ) 𝑐 ≤ 𝑒 −𝐶2 𝑛ℎ𝑛 𝜖 𝑛 . Notice that 𝑛ℎ𝑛𝑚 𝜖 𝑛2 = ℎ0𝑚 𝜖 0 𝑛 𝑙𝑜𝑔𝑛  𝑚+4 𝑙𝑜𝑔𝑛  𝑚+4 𝑛 𝑛 = ℎ0𝑚 𝜖 0 𝑙𝑜𝑔𝑛. Hence, 63 ∞ Í   ∞ Í  ¯ 𝑃 (D𝑛 ) 𝑐 < ∞. Furthermore by inequality 4.23 we have 𝑃 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 /𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 > 4 𝐷𝛽 𝜖 𝑛 | 𝑖=1  𝑖=1 C𝑛 < ∞. Following similar proof as of Theorem 2 we can find constants 𝑐 1 , 𝑐 2 such that for 𝑐 1 𝑙𝑜𝑔𝑛 ≤ 𝑘 ≤ 𝑐 2 𝑛 cluster 𝐶 (𝑖) will be roughly identified almost surely, and as a result the event C𝑛 will also occur almost surely. Consequently, 𝑁 𝑁𝑜𝐶𝑙𝑢𝑠𝑡𝑒𝑟 /𝑁𝑐𝑙𝑢𝑠𝑡𝑒𝑟 → 0 alomost surely. 4.3.2.2 Second approach to noisy case - No removal of points As before we denote the connected components of the 𝑡-level of the density 𝑝 by 𝐶 (1) , ..., 𝐶 (𝐾) . For the rest in the support of 𝑝 (i.e. 𝐵 = supp{𝑝} ∖ ∪𝑖=1 𝐾 𝐶 (𝑖) ), we denote: 𝐶˜ (𝑖) = {𝑥 ∈ 𝐵 : 𝑖 = argmin1≤ 𝑗 ≤𝐾 𝑑 (𝑥, 𝐶 ( 𝑗) )}, for 1 ≤ 𝑖 ≤ 𝐾. The ith cluster consists of 𝐶 (𝑖) and 𝐶˜ (𝑖) and no point is removed. Let cluster 𝑖 be denoted as the set 𝐶¯ (𝑖) = 𝐶 (𝑖) ∪ 𝐶˜ (𝑖) . We describe the noisy case as the event that the minimal distance of points between 𝐶 (𝑖) and 𝐶˜ (𝑖) is zero. For this reason, for the minimum density of points in 𝐶¯ (𝑖) , 𝑝 𝑚𝑖𝑛 , will hold that 𝑝 𝑚𝑖𝑛 > 0. Consequently, 𝐶¯ (𝑖) is connected topologically. The following results correspond to clustering with algorithm 4.1 and the choice of graph Laplacian to be the unnormalized, i.e. 𝐿 = 𝐷 − 𝑊. Let us denote the “ideal" version by 𝐿˜ = 𝐷˜ − 𝑊, ˜ namely; 𝑊˜ is the revised version of 𝑊 by removing all connections between different clusters. So 𝑊˜ is a block diagonal matrix (up to permutation of the nodes), with each block corresponding to a true cluster. Table 4.2 provides notations for this case. Lemma 8 (Connectedness of 𝐶¯ (𝑖) ) Let C𝑛(𝑖) be the event that the sample points in 𝐶¯ (𝑖) are connected. Then under the conditions: (𝑖) 1. 𝑘 ≥ 4𝑚+1 𝑙𝑜𝑔(2𝑛𝑝 𝑚𝑎𝑥 𝑉 𝑜𝑙 (𝐶¯ (𝑖) )8𝑚 ) and (𝑖) 2. 𝑘 < 2(𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 4𝑚 min{(𝑑𝑛(𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥 (𝑖) 𝑚 ) }, we obtain (𝑖) (𝑘−1) 𝑝 1 𝑛2 − 𝑚𝑖𝑛 𝑃 (C𝑛(𝑖) ) 𝑐 (𝑖) ≤ (2 + 𝑚 )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 . (4.24) 4 𝑛−1 (𝑖) 𝑉 𝑜𝑙 (𝐶¯ (𝑖) ) Proof. We find a covering of 𝐶¯ (𝑖) ( 𝑝 𝑚𝑖𝑛 + 𝜖 𝑛 ) of 𝑁 ≤ 𝑚 balls with radius 𝑧/4, where 𝜂 𝑚 8𝑧 𝑚 𝑧 ∈ 4(0, min{𝑑𝑛(𝑖) , 𝜈𝑚𝑎𝑥 (𝑖) }). If each of the covering balls contains at least two points of 𝐶¯ (𝑖) and 64 𝐶¯ (𝑖) 𝐶 (𝑖) ∪ 𝐶˜ (𝑖) (𝑖) 𝑝 max supremum of density attained by points of 𝐶¯ (𝑖) (𝑖) 𝑝 min infimum of density attained by points of 𝐶¯ (𝑖) 𝑢 (𝑖) lower bound on distance of 𝐶 (𝑖) to all 𝐶 ( 𝑗) with 𝑗 ≠ 𝑖 𝑢𝑖 𝑗 distance between r 𝐶 (𝑖) and 𝐶 ( 𝑗) 𝜌(𝑢 (𝑖) ) probability mass of balls of radius 𝑢 (𝑖) in 𝐶 (𝑖) 𝜌𝑚𝑖𝑛 min 𝜌(𝑢 (𝑖) ) 𝑖=1,...,𝐾 ˜ (𝑖) ) 𝜌(𝑢 probability mass of balls of radius 𝑢 (𝑖) in 𝐶˜ (𝑖) 𝜌˜ 𝑚𝑖𝑛 min 𝜌(𝑢 ˜ (𝑖)/2 ) 𝑖=1,...,𝐾 𝛽 (𝑖) probability mass of 𝐶 (𝑖) 𝛽𝑚𝑎𝑥 max 𝛽 (𝑖) 1=1,...,𝐾 𝜇˜ (𝑖) probability mass of 𝐶˜ (𝑖) 𝜇˜ 𝑚𝑎𝑥 max 𝜇˜ (𝑖) 1=1,...,𝐾 (𝑖) 𝑅ˆ 𝑚𝑖𝑛 minimal 𝑘NN radius of 𝐶¯ (𝑖) (𝑖) (𝑖) 𝑅˜ 𝑚𝑎𝑥 , 𝑅𝑚𝑎𝑥 maximal 𝑘NN radius of 𝐶˜ (𝑖) , 𝐶 (𝑖) 𝑑𝑛(𝑖) minimum distance of 𝐶¯ (𝑖) ( 𝑝 𝑚𝑖𝑛 (𝑖) + 𝜖 𝑛 ) from 𝜕 𝐶¯ (𝑖) 𝜅 (𝑖) the minimal curvature radius of 𝜕 𝐶¯ (𝑖) (𝑖) 𝜈𝑚𝑎𝑥 max {𝜈 | 𝐶¯ (𝑖) \ 𝐶𝑜𝑙 (𝑖) (𝜈)is connected} and 𝐶𝑜𝑙 (𝑖) is the collar of 𝐶¯ (𝑖) 𝜈<𝜅 (𝑖) Table 4.2 Notations. (𝑖) { 𝑅ˆ 𝑚𝑖𝑛 > 𝑧} then neighboring covering balls will contain points that share common neighbors and hence they will be connected. We follow the arguments of the proofs of Lemma 1 and Lemma 3 to obtain (𝑖) (𝑘−1) 𝑝 1 𝑛2 − 𝑚𝑖𝑛 (C𝑛(𝑖) ) 𝑐  (𝑖) 𝑃 ≤ (2 + 𝑚 )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 4 𝑛−1 (𝑖) (𝑖) under the stated conditions and with 𝑝 max , 𝑝 min to be the supremum and the infimum of density attained by points of 𝐶¯ (𝑖) . □ Lemma 9 (Isolation of every 𝐶˜ (𝑖) from all 𝐶 ( 𝑗) ) Let Ĩ𝑛 denote the event that every point in any 𝐶˜ (𝑖) is not connected to points in any 𝐶 ( 𝑗) for 𝑗 ≠ 𝑖. 𝜌˜ 𝑚𝑖𝑛 𝑛 Then for 𝑘 < 2 − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)} we obtain 𝑛−1 𝜌˜ 𝑚𝑖𝑛 𝑘−1  𝑃 ( Ĩ𝑛 ) 𝑐 ≤ 𝐾 2 𝑒 − 2 2 − 𝑛−1  (4.25) 65 Proof. Let Ĩ𝑛(𝑖) denote the event that every point of 𝐶˜ (𝑖) is not connected to points in any 𝐶 ( 𝑗) for (𝑖) 𝑗 ≠ 𝑖. Then 𝐶˜ (𝑖) is connected to some 𝐶 ( 𝑗) for 𝑗 ≠ 𝑖, if either the event { 𝑅˜ 𝑚𝑎𝑥 > 𝑢 (𝑖) } occurs Ð ( 𝑗) or {𝑅𝑚𝑎𝑥 > 𝑢 (𝑖 𝑗) /2} occurs. Following the steps of the proof of Lemma 2 we obtain under the 𝑗≠𝑖 ˜ (𝑖) ) 𝜌(𝑢 𝜌(𝑢 ( 𝑗 )/2 ) conditions that 𝑘 < 2 − 2𝑙𝑜𝑔( 𝜇˜ (𝑖) 𝑛) and for every 𝑗 ≠ 𝑖 that 𝑘 < 2 − 2𝑙𝑜𝑔(𝛽 ( 𝑗) 𝑛). We reach the stated results observing that 𝜌(𝑢 ˜ (𝑖) ) > 𝜌(𝑢 ˜ (𝑖) /2) ≥ 𝜌˜ 𝑚𝑖𝑛 , 𝜌(𝑢 ( 𝑗) ) > 𝜌(𝑢 ˜ ( 𝑗) /2) ≥ 𝜌˜ 𝑚𝑖𝑛 and 𝐾 𝑃 ( Ĩ𝑛(𝑖) ) 𝑐 .  Í  that 𝑃 ( Ĩ𝑛 ) 𝑐 ≤ □ 𝑖=1 Proposition 2 (Distance between the eigenspaces of 𝐿 and 𝐿) ˜ Under the conditions, (𝑖) 1. 𝑘 ≥ 4𝑚+1 𝑙𝑜𝑔(2𝑛𝑝 𝑚𝑎𝑥 𝑉 𝑜𝑙 (𝐶¯ (𝑖) )8𝑚 ) for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and (𝑖) 2. 𝑘 < 2(𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 4𝑚 min{(𝑑𝑛(𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥 (𝑖) 𝑚 ) } for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and 𝜌˜ 𝑚𝑖𝑛 𝑛 3. 𝑘 < 2 − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)}, there exists an orthogonal matrix 𝑂 ∈ R𝐾×𝐾 such that 5√ Í𝐾 2 2 𝐾 𝑖=1 𝑛˜ 𝑖 ||𝑈𝑂 − 𝑈|| ˜ 𝐹≤ , ˜ 𝜆+ ( 𝐿)   (𝑘−1) 𝑝 𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1) 1 𝑛2 − 𝑚+1 𝑚𝑖𝑛 − 𝑛−1 − 𝑛−1 , where 𝑈, 𝑈˜ are 2 2 with probability at least 1 − 𝐾 (2 + 4𝑚 𝑛−1 )𝑒 4 𝑝𝑚𝑎𝑥 + 𝐾 2𝑒 eigenvector matrices of the 𝐾 smallest eigenvalues of 𝐿, 𝐿, ˜ respectively, 𝑛˜ 𝑖 is the size of 𝐶˜ (𝑖) and ˜ is the 𝐾 + 1 smallest eigenvalue of 𝐿. 𝜆+ ( 𝐿) ˜ Proof. Let C𝑛 be the event that every cluster 𝐶¯ (𝑖) is connected on 𝐺 𝑆𝑁 𝑁 (𝑘). Then if 𝑘 satisfies the the conditions of Lemma 8 for every cluster we obtain  1 𝑛2 − (𝑘−1) 𝑝𝑚𝑖𝑛 𝑃 (C𝑛 ) 𝑐 ≤ 𝐾 (2 + 𝑚 )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 , (4.26) 4 𝑛−1 (𝑖) (𝑖) where 𝑝 𝑚𝑖𝑛 = min 𝑝 𝑚𝑖𝑛 and 𝑝 𝑚𝑎𝑥 = min 𝑝 𝑚𝑎𝑥 . Removing any edges that connect subgraphs of 𝑖=1,...,𝐾 𝑖=1,...,𝐾 different clusters will result in a block diagonal adjacency matrix 𝑊. ˜ As we explored in proof of Theorem 1 this is equivalent with exact cluster identification by spectral clustering. Furthermore, let I𝑛 be the event that every set 𝐶 (𝑖) is isolated from other sets 𝐶 ( 𝑗) for 𝑗 ≠ 𝑖 we 𝜌 𝑚𝑖𝑛 𝑛 obtain following the proof of Lemma 2 that if 𝑘 < 2 − 2𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛) then   𝜌𝑚𝑖𝑛 𝑘−1 − 𝑛−1 2 − 𝑛−1 𝑃 (I𝑛 ) 𝑐 ≤ 𝐾 2 𝑒  2 . (4.27) 66 𝜌 𝑚𝑖𝑛 𝑛 𝜌˜ 𝑚𝑖𝑛 𝑛 Since 2 − 2𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛) ≥ 2 − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)} we observe that the if 𝜌˜ 𝑚𝑖𝑛 𝑛 𝑘 < 2 − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)} the upper bounds of inequality 4.27 and inequality 4.25 will hold. Now let us denote with 𝑆𝑛 the event that the only connections between any pair of clusters 𝐶 (𝑖) , 𝐶 ( 𝑗) are from the samples in 𝐶˜ (𝑖) and 𝐶˜ ( 𝑗) . Then,     𝑃 (S𝑛 ) 𝑐 ≤ 𝑃 (C𝑛 ) 𝑐 + 𝑃 (I𝑛 ) 𝑐 + 𝑃 ( Ĩ𝑛 ) 𝑐 (4.28) and if 𝑘 satisfies the condition for the upper bounds of those probabilities we obtain that,   2 (𝑘−1) 𝑝𝑚𝑖𝑛 𝑛−1 𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1) 1 𝑛 − − − )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 + 𝐾 2 𝑒  2 2 𝑛−1 𝑃 (S𝑛 ) 𝑐 ≤ 𝐾 (2 + 𝑚 (4.29) 4 𝑛−1 Additionally, Theorem 2 in YU et al. (2015) we have that there exists an orthogonal matrix 𝑂 ∈ R𝐾×𝐾 such that 3√ 2 2 𝐾 ||𝐿 − 𝐿|| ˜ 2 ||𝑈𝑂 − 𝑈|| ˜ 𝐹≤ , ˜ 𝜆+ ( 𝐿) where 𝑈, 𝑈˜ are the eigenvector matrices of the 𝐾 smallest eigenvalues of 𝐿, 𝐿, ˜ respectively, ||𝐿− 𝐿|| ˜ 2 is the spectral norm, and 𝜆+ ( 𝐿) ˜ is the smallest non-zero eigenvalue of 𝐿˜ (i.e. the 𝐾 + 1 smallest eigenvalues of 𝐿). ˜ Now we would like to bound ||𝐿 − 𝐿|| ˜ 2 . First of all, ||𝐿 − 𝐿|| ˜ 2 ≤ ||𝐷 − 𝐷|| ˜ 2 + ||𝑊 − 𝑊˜ || 2 , and ||𝑊 − 𝑊˜ || 2 is determined by the between-cluster connections. So if the event S𝑛 occurs and because Jaccard similarity ≤ 1 we notice that ∑︁𝐾 ||𝑊 − 𝑊˜ || 2 ≤ 𝑛˜ 𝑖 𝑖=1 and ∑︁𝐾 ||𝐿 − 𝐿||˜ 2≤2 𝑛˜ 𝑖 , 𝑖=1 where 𝑛˜ 𝑖 is the sample size of 𝐶˜ (𝑖) . Hence, under the condition for 𝑘 we conclude that 5√ Í𝐾 2 2 𝐾 𝑖=1 𝑛˜ 𝑖 ˜ ||𝑈𝑂 − 𝑈|| 𝐹 ≤ , + 𝜆 ( 𝐿)˜   (𝑘−1) 𝑝𝑚𝑖𝑛 𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1) 1 𝑛2 − − 𝑛−1 2 2 − 𝑛−1 with probability at least 1 − 𝐾 (2 + 4𝐾 𝑛−1 )𝑒 4𝑚+1 𝑝𝑚𝑎𝑥 + 𝐾 2𝑒 . □ 67 Definition 3 The mis-clustering error is defined as 𝑛 1 ∑︁ 𝑀𝑛 = min 1(𝜎( 𝑞˜𝑖 ) ≠ 𝑞𝑖 ), 𝜎∈𝑆 𝐾 𝑛 𝑖=1 where 𝑞𝑖 is the true cluster label of ith data, 𝑞˜𝑖 is the estimated one and 𝑆 𝐾 is the set of all possible permutations of {1, . . . K}. Theorem 4 (Mis-clustering error bound) Under the conditions, (𝑖) 1. 𝑘 ≥ 4𝑚+1 𝑙𝑜𝑔(2𝑛𝑝 𝑚𝑎𝑥 𝑉 𝑜𝑙 (𝐶¯ (𝑖) )8𝑚 ) for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and (𝑖) 2. 𝑘 < 2(𝑛 − 1) 𝑝 𝑚𝑎𝑥 𝜂𝑚 4𝑚 min{(𝑑𝑛(𝑖) ) 𝑚 , (𝜈𝑚𝑎𝑥 (𝑖) 𝑚 ) } for every 𝑖, 1 ≤ 𝑖 ≤ 𝐾 and 𝜌˜ 𝑚𝑖𝑛 𝑛 3. 𝑘 < 2 − 2 max{𝑙𝑜𝑔( 𝜇˜ 𝑚𝑎𝑥 𝑛), 𝑙𝑜𝑔(𝛽𝑚𝑎𝑥 𝑛)}, we obtain 256𝑛∗ 𝐾 ( 𝑖=1 𝑛˜ 𝑖 ) 2 Í𝐾 𝑀𝑛 ≤ 𝑛 ˜ 2 𝜆+ ( 𝐿)   (𝑘−1) 𝑝 𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1) 1 𝑛2 − 𝑚+1 𝑚𝑖𝑛 − 𝑛−1 2 2 − 𝑛−1 with probability at least 1 − 𝐾 (2 + 4𝑚 𝑛−1 )𝑒 4 𝑝𝑚𝑎𝑥 + 𝐾 2𝑒 . Here, 𝑛∗ = max 𝑛𝑖 for the size of clusters 𝐶¯ (𝑖) , 𝑛𝑖 , 1 ≤ 𝑖 ≤ 𝐾. 1≤𝑖≤𝐾 Proof. We notice that { 𝑞˜𝑖 }𝑖=1 𝑛 are obtained by running 𝑘-means on 𝑈 ∈ R𝑛×𝐾 . Let us define their associated centroids by { ℎ˜ 𝑖 }𝑖=1 𝑛 . Note also that { ℎ ˜ 𝑖 }𝑛 have 𝐾 unique vectors. Further define 𝑖=1 𝐻˜ = argmin𝐻∈R𝑛×𝐾 : has 𝐾 unique rows ||𝑈 − 𝐻|| 2𝐹 . © ℎ˜ 1 ª 𝑇 ­ ® ­ ˜𝑇 ® ­ ℎ2 ® n o It is clear that 𝐻˜ = ­­ . ®® ∈ R𝑛×𝐾 . We define the set 𝐴 = 1 ≤ 𝑖 ≤ 𝑛 : || ℎ˜ 𝑖 − 𝑒𝑇𝑖 𝑈𝑂 ˜ 𝑇 || 2 ≥ √ 1 , 2𝑛∗ ­ .. ® ­ ® ­ ® ˜𝑇 « ℎ𝑛 ¬ where {𝑒𝑖 }𝑖=1𝑛 is the standard basis of R𝑛 and 𝑛∗ = max 𝑛 where 𝑛 is the sample size from the ith 𝑖 𝑖 1≤𝑖≤𝐾 cluster 𝐶 (𝑖) ∪ 𝐶˜ (𝑖) . Then, ˜ 𝑇 || 2 < √ 1 , for 𝑖 ∉ 𝐴. || ℎ˜ 𝑖 − 𝑒𝑇𝑖 𝑈𝑂 (4.30) 2𝑛∗ 68 Also, 1 © √𝑛 1 1𝑛 1 0 ··· 0 ª ­ ® ­ 0 √1 1𝑛2 ··· 0 ® ­ ® ˜ 𝑈 = ­­ . 𝑛2 (4.31) .. ®® 𝑃, ® ­ .. .. ­ . ··· . ® ­ ® « 0 0 · · · √𝑛1𝐾 1𝑛𝐾 ¬ with 𝑃 being orthogonal. By 4.31 √︂ ˜ 𝑇 − 𝑒𝑇𝑗 𝑈𝑂 ˜ 𝑇 || 2 ≥ 2 ||𝑒𝑇𝑖 𝑈𝑂 (4.32) 𝑛∗ Combining 4.30 and 4.32 || ℎ˜ 𝑖 − 𝑒𝑇𝑗 𝑈𝑂 ˜ 𝑇 || 2 ≥ ||𝑒𝑇𝑖 𝑈𝑂 ˜ 𝑇 || 2 − || ℎ˜𝑖 − 𝑒𝑇𝑖 𝑈𝑂 ˜ 𝑇 − 𝑒𝑇𝑗 𝑈𝑂 ˜ 𝑇 || 2 > ||𝑒𝑇𝑖 𝑈𝑂 ˜ 𝑇 − 𝑒𝑇𝑗 𝑈𝑂 ˜ 𝑇 || 2 − √ 1 > √ 1 2𝑛∗ 2𝑛∗ Consequently, 1 ∑︁ 1 ∗ 1 ˜ 𝑇 || 2 ≤ 2𝑛 || 𝐻˜ − 𝑈𝑂 ∑︁ 𝑀𝑛 ≤ | 𝐴| ≤ 1 ≤ 2𝑛∗ || ℎ˜ 𝑖 − 𝑒𝑇𝑖 𝑈𝑂 2 ˜ 𝑇 || 2𝐹 𝑛 𝑛 𝑖∈𝐴 𝑛 𝑖∈𝐴 𝑛 2𝑛∗  ˜ ˜ 𝑇 2 ≤ || 𝐻 − 𝑈|| 𝐹 + ||𝑈 − 𝑈𝑂 || 𝐹 𝑛 8𝑛∗ ˜ 𝑇 || 2𝐹 ≤ ||𝑈 − 𝑈𝑂 𝑛 8𝑛∗ ˜ 2𝐹 = ||𝑈𝑂 − 𝑈|| 𝑛 5√ !2 8𝑛∗ 2 2 𝐾 𝑖=1 𝑛˜ 𝑖 Í𝐾 ≤ 𝑛 ˜ 𝜆+ ( 𝐿) 256𝑛∗ 𝐾 ( 𝑖=1 𝑛˜ 𝑖 ) 2 Í𝐾 ≤ , 𝑛 ˜ 2 𝜆+ ( 𝐿)   (𝑘−1) 𝑝 𝜌𝑚𝑖𝑛 +𝜌˜ 𝑚𝑖𝑛 2(𝑘−1) 2 − 𝑚+1 𝑚𝑖𝑛 − 𝑛−1 − 𝑛−1 using Proposition 2 with probability at least 1 − 𝐾 (2 + 41𝑚 𝑛−1 2 2 𝑛 )𝑒 4 𝑝𝑚𝑎𝑥 + 𝐾 2𝑒 . □ 4.4 A general algorithm for tuning of clustering method parameters To construct an SNN graph as we described in section 4.2.1 one must first construct a 𝑘NN graph and then remove edges between 𝑘NN neighbors that do not share any of their neighbors. The parameter 𝑘 affects the structure of the SNN graph and hence any other method that is based on 69 it, as for example the algorithm 4.1. A lot of SNN graph-based clustering methods (Stuart et al., 2019; Xu and Su, 2015) do not use data-driven ways to decide on the value of 𝑘. Bellow we introduce a general cross-validation tuning algorithm called 𝑘cv tuning that provides an optimal choice of a clustering parameter based on the data information. We apply this algorithm to find an optimal choice for the parameter 𝑘 of the number of nearest neighbors used in the SNN spectral clustering algorithm 4.1 in simulated data of different signal-to-noise levels and data feature structure. The simulated data are described in section 4.4.2 and the tools used to assess the performance of algorithm 4.3 are introduced in section 4.4.3. The performance results are summarized in section 4.4.4. 4.4.1 𝑘cv tuning algorithm The introduced cross-validation method suggests a tuning of a parameter 𝑘 of a clustering method, based on the idea that when the clustering is optimal, points in the same cluster can predict with high accuracy features of points in the same cluster. A similar methodology has been used for the tuning of model parameters in Li et al. (2020). The 𝑘cv tuning algorithm works in 𝑁 folds. - Given a dataset 𝑋, in every fold a version of 𝑋 is created by randomly removing 10% of the entries of X. - Then, Singular Value Thresholding is applied to each version to extract its low rank approxi- mation and hence a completed matrix, 𝐴. ˆ - The chosen clustering algorithm is applied on 𝐴ˆ for multiple values of k. - Next, the missing entries of 𝑋 are predicted. Specifically, a missing feature of a data point in the predicted cluster 𝑐, associated with a specific value 𝑘, is predicted by data points of 𝑋 that belong also to cluster 𝑐. - The optimal k is chosen to be the one that attained the lowest average prediction error across the 𝑁 folds. Notice that for the completion of 𝐴𝑞 , we first find its SVD, i.e 𝐴𝑞 = 𝑈𝐷𝑉 𝑇 and then we use 𝑆𝑉 𝐷 70 thresholding to obtain the final result 𝐴ˆ 𝑞 = 1 𝑇 𝑚 𝑈𝐷 𝑘ˆ 𝑉 , 𝐷 𝑘ˆ = 𝑑𝑖𝑎𝑔(𝐷 1,1 , 𝐷 2,2 , ..., 𝐷 𝑘,ˆ 𝑘ˆ , 0, .., 0). Algorithm 4.3 kcv_tuning. 1: Input: 𝑋 ∈ R𝑛×𝑚 , number of clusters 𝐾, clustering function 𝐺, nearest neighbors 𝑘, number of folds N, prediction method (mean, ols, lasso), low-rank approximation threshold 𝑘, ˆ training percentage 𝑝 2: Output: Optimal number of nearest neighbors, 𝑘 optimal 3: 4: for 𝑞 = 1 to 𝑁 do 5: 𝐴𝑞 ← 𝑋 with (1-p)% of entries randomly replaced by 0 6: 𝐼𝑞 ← {(𝑖, 𝑗) : 𝑥𝑖, 𝑗 replaced with 0 in 𝐴𝑞 } 7: 8: % 𝐴𝑞 completion via Singular Value Thresholding 9: 𝐴ˆ 𝑞 ← SVD on 𝑋 with threshold 𝑘ˆ 10: 11: % Evaluation of the performance of 𝑘 12: for 𝑘 = 10 to ⌈ 𝑛3 ⌉, by 20 do 13: ℓ𝑘 = 𝐺 ( 𝐴ˆ𝑞 , 𝐾, 𝑘) ← predicted membership 14: for {𝑖, 𝑗 } ∈ 𝐼𝑞 do 15: 𝐶𝑖 ← points in the same cluster as point i 16: 𝑌 = [𝑋𝑟 𝑗 ] for r ∈ 𝐶𝑖 17: 𝑍 = [𝑋𝑟𝑡 ] for r ∈ 𝐶𝑖 and 𝑡 ≠ 𝑗 18: 19: % Prediction of missing values 20: if prediction_method = mean then 21: 𝑥ˆ𝑖, 𝑗 ← 𝑌¯ 22: else if prediction_method = ols then 23: 𝛽ˆ ← 𝑎𝑟𝑔𝑚𝑖𝑛 𝛽 {||𝑌 − 𝛽𝑍 || 2 } 24: 𝑥ˆ𝑖 𝑗 ← 𝛽𝑍ˆ 25: else if prediction_method = lasso then 26: 𝜆 0 , 𝛽ˆ ← 𝑎𝑟𝑔𝑚𝑖𝑛𝜆,𝛽 {||𝑌 − 𝛽𝑍 || 2 + 𝜆||𝛽|| 1 } 27: 𝑥ˆ𝑖 𝑗 ← 𝛽𝑍ˆ 28: 29: % Mean prediction error of k in fold q 𝐿 𝑘,𝑞 = |𝐼1𝑞 | (𝑖, 𝑗)∈𝐼𝑄 (𝑥𝑖 𝑗 − 𝑥ˆ𝑖 𝑗 ) 2 Í 30: 31: 32: % Mean prediction error of k 𝐿 𝑘 = 𝑁1 𝑞=1 Í𝑁 33: 𝐿 𝑘,𝑞 34: 35: % Optimal k 36: 𝑘 𝑜 𝑝𝑡𝑖𝑚𝑎𝑙 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑘 (𝐿 𝑘 ) 71 4.4.2 Simulations We test the performance of the algorithm 4.3 when used for the tuning of parameter 𝑘 of algorithm 4.1, on various Multivariate Gaussian (MG) data. We consider a simulated data set of 𝑛 data points 𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ∈ R𝑚 , that are grouped in 𝐾 clusters. The data are independently sampled from a Multivariate Gaussian mixture model: ∑︁𝐾 𝜋𝑖 𝑁 (𝜇𝑖 , Σ𝑖 ). 𝑖=1 The coordinates of the centers of the Gaussian mixture follow the standard normal distribution, 𝑖.𝑖.𝑑. 𝜇𝑖 𝑗 ∼ 𝑁 (0, 1) for 𝑖 = 1, . . . , 𝐾, 𝑗 = 1, . . . , 𝑚, and the clusters have equal sizes, i.e 𝜋1 = · · · = 𝜋 𝐾 = 1/𝐾. We consider three types of covariance matrices Σ 𝑘 : 1. the simple case where Σ1 = · · · = Σ𝐾 = 𝑚𝑠 · 𝐼 2.the case where the Σ1 = · · · = Σ𝐾 = 𝑚𝑠 · Σ and the corresponding precision matrix Ω = Σ−1 is tridiagonal. This case simulates a chain dependency between features of the data. Specifically, Σ = {𝜎𝑖, 𝑗 }, with 𝜎𝑖, 𝑗 = 0.5|𝑖− 𝑗 | . 3. the case where the Σ1 = · · · = Σ𝐾 = 𝑚𝑠 · Σ, the corresponding precision matrix Ω = Σ−1 is sparse and simulates a network dependency between features of the data. For the construction of Ω we follow the simulation procedure of Li and Gui (2005). For the assessment of the 𝑘cv tuning algorithm, we simulated Multivariate Gaussian data of n=1000 points and 𝑚 = 10 or 50 features. The tables below represent the data setting considered based on the type of covariance matrix used. 4.4.3 Assessment methods We introduce the Normalized Prediction Accuracy function that measures the performance of a value 𝑘 for the clustering of a data set, utilizing the average prediction loss. Additionally, we define the ARI Relative Ratio that measures how close the choice of the value 𝑘 suggested by 4.3 is from the value that achieves maximum ARI. 72 m s=0.1 s=0.3 s=0.5 m s=0.1 s=0.2 s=0.3 10 setting 1 setting 2 setting 3 10 setting 7 setting 8 setting 9 50 setting 4 setting 5 setting 6 50 setting 10 setting11 setting 12 (a) Simple MG data (b) MG data with tridiagonal precision matrix m s = 0.06 s = 0.16 s = 0.23 10 setting 13 setting14 setting 15 50 setting 16 setting17 setting 18 (c) MG data with network feature dependency Table 4.3 Settings of simulated data. Definition 4 (Normalized Prediction Accuracy - NPA) Let K to be the set of values for the parameter k of the number of nearest neighbors. Let 𝐿 𝑖𝑘 be the the mean prediction error of 𝑘 associated with the prediction of the held out entries of the data matrix 𝑋, for the simulation iteration 𝑖. The Normalized Prediction Accuracy, F, of 𝑘 for the iteration 𝑖 is : 𝐿 𝑖𝑘 − 𝑚𝑖𝑛 𝑗 ∈K (𝐿 𝑖𝑗 ) 𝐹 (𝑘, 𝑖) = 1 − 𝑚𝑎𝑥 𝑗 ∈K (𝐿 𝑖𝑗 ) − 𝑚𝑖𝑛 𝑗 ∈K (𝐿 𝑖𝑗 ) Definition 5 (ARI Relative Ratio) Let 𝑘˜ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑘∈K (ARI), when running the Snn_Spectral_Clustering algorithm. Let 𝑘ˆ be the optimal k based on the kcv_tuning algorithm. The ARI Relative Ratio for iteration i, 𝑅(𝑖), is: 𝐴𝑅𝐼 ( 𝑘)˜ 𝑖 − 𝐴𝑅𝐼 ( 𝑘)ˆ 𝑖 𝑅(𝑖) = , 𝐴𝑅𝐼 ( 𝑘)˜ 𝑖 where 𝐴𝑅𝐼 (𝑘)𝑖 is the Adjusted Rand Index for the clustering produced using k nearest neighbors in iteration i. 4.4.4 Simulation Results The simulated data were used to tune the parameter 𝑘 of the SNN spectral clustering algorithm introduced in section 4.2.2. Below we describe the performance results of the 𝑘cv tuning algorithm. To summarize the NPA results of a particular value of 𝑘, we use the mean NPA of this value over a round of 1000 simulations. The maximum mean NPA is achieved for the value of 𝑘 that the 𝑘cv tuning algorithm suggests as optimal. It is interesting to observe whether the mean NPA 73 is maximized for the same value of 𝑘 that would achieve the maximum mean ARI over a round of simulations. We notice that this depends on the prediction method used along with the structure of feature dependency. For data settings 1-6 (figures A.2, A.4) we observe that the ARI of SNN spectral algorithm is increasing for 10 ≤ 𝑘 ≤ 50 and for larger values of 𝑘, the ARI remains about the same. The maximum ARI is achieved for 𝑘 = 330, i.e. the largest 𝑘 we consider during tuning. The NPA curve is also maximized at 𝑘 = 330 for mean, ols, and lasso prediction. Mean and lasso perform better than ols and achieve lower values of ARI ratios. This is because ols uses information from all features introducing more variance in the prediction, since those settings simulate data sets with independent features. Mean uses information only from one feature and lasso selects a few of the features and hence performs better than ols. For data settings 7-9 and 13-15 the maximum ARI achieved by the SNN spectral clustering algorithm is within the range of [10,90] (figures B.2, B.4, C.2). Applying prediction with ols or mean, fails to tune 𝑘 within this range, in contrast to lasso. This is because simulated data of type 7-15 do not have independent features, every feature depends on 2-5 other features. The ols prediction will utilize information of every feature and will introduce more error in the prediction and the mean prediction takes into consideration only the information of one feature, whereas lasso will use information only of the features correlated to the one of interest. For this reason, lasso performs better for types 7-9 and 13-15. In settings 10-12 and 16-18, although feature selection methods like lasso are expected to perform better than ols and mean, it is observed that they have similar performance. Here the simulated data have low signal-to-noise ratio and hence a ridge regression prediction or elastic net might perform better. The mean and median ARI Relative Ratios provide an estimate of the proportion of the difference between the maximum ARI and the ARI achieved by clustering using the tuned 𝑘. The settings with larger variance factor 𝑠 as shown in table 4.3 have higher mean and median ARI Relative Ratios than settings with lower 𝑠. In more detail, we observe that for settings 1-6, the ARI of 74 SNN spectral clustering after tuning is about 1% to 10% less than the optimum ARI when using mean and lasso prediction for tuning (tables A.1, A.2). For settings 7-9, lasso achieves the best tuning results with ARI that is 3% to 14% smaller than the optimum ARI (table B.1). The chain dependency of a higher number of features makes settings 10-12 harder to tune for 𝑘. The tuned 𝑘 obtains an ARI difference from the optimum between 3% to 20% (tables B.2). However, for the final set of simulations (settings 13-18) the features of each dataset can be represented as a network and every feature will depend on 5 (setting 13-15) or 12 other features (settings 16-18). In this case, the ARI of SNN spectral clustering after tuning is only 0.6% to 10% lower than the optimum (tables C.1, C.2). 4.5 Conclusions In this chapter, we conducted an investigation into the clustering performance of an SNN graph- based method. For this method, we build the SNN graph as a subgraph of a 𝑘NN graph based on the Jaccard similarity of 𝑘nn neighbors of vertices. The parameter 𝑘 affects the structure of the SNN graph and hence the clustering performance. Our goal was to determine for which values of 𝑘, the SNN spectral clustering algorithm can achieve true cluster identification with high probability. Our results suggest that in both the noise-free and the noisy case, one needs to select 𝑘 of the order 𝑐𝑛 to maximize the probability of cluster identification, in contrast to random geometric graph literature that suggests 𝑘 of order log 𝑛 (Brito et al., 1997). Furthermore, we introduce a general cross-validation tuning method for parameters of clustering algorithms. We use this method to tune the number of nearest neighbors 𝑘 of the SNN spectral clustering algorithm for a variety of simulated data types and find that the accuracy of clustering results after using the tuned value is 1% to 20% lower than the accuracy achieved by the optimum 𝑘 and depends on the feature dependency of the data. 75 BIBLIOGRAPHY Brito, M. R., Chávez, E. L., Quiroz, A. J., and Yukich, J. E. (1997). Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection. Statistics & Probability Letters, 35(1):33–42. Guattery, S. and Miller, G. L. (1998). On the quality of spectral separators. SIAM Journal on Matrix Analysis and Applications, 19(3):701–719. Li, H. and Gui, J. (2005). Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics, 7(2):302–317. Li, T., Levina, E., and Zhu, J. (2020). Network cross-validation by edge sampling. Biometrika, 107(2):257–276. Luxburg, U., Bousquet, O., and Belkin, M. (2004). Limits of spectral clustering. Advances in neural information processing systems, 17. Maier, M., Hein, M., and von Luxburg, U. (2009). Optimal construction of 𝑘-nearest-neighbor graphs for identifying noisy clusters. Theoretical Computer Science, 410(19):1749–1764. Meilă, M. and Shi, J. (2001). A random walks view of spectral segmentation. In International Workshop on Artificial Intelligence and Statistics, pages 203–208. PMLR. Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14. Prakasa Rao, B. (1983). Nonparametric functional estimation. Probability and Mathematical Statistics: A Series of Monographs and Textbooks. Academic Press. Spielman, D. A. and Teng, S.-H. (1996). Spectral partitioning works: Planar graphs and finite element meshes. In Proceedings of 37th conference on foundations of computer science, pages 96–105. IEEE. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., III, W. M. M., Hao, Y., Stoeckius, M., Smibert, P., and Satija, R. (2019). Comprehensive integration of single-cell data. Cell, 177:1888–1902. von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17. Xu, C. and Su, Z. (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 31(12):1974–1980. YU, Y., WANG, T., and SAMWORTH, R. J. (2015). A useful variant of the davis—kahan theorem for statisticians. Biometrika, 102(2):315–323. 76 APPENDIX A PERFORMANCE ON GAUSSIAN DATA WITH DIAGONAL COVARIANCE MATRIX prediction setting Median Mean Sd.Error 1 0.007 0.010 0.001 mean 2 0.021 0.042 0.004 3 0.053 0.099 0.010 1 0.006 0.011 0.001 ols 2 0.024 0.053 0.005 3 0.055 0.109 0.010 1 0.007 0.012 0.001 lasso 2 0.023 0.044 0.004 3 0.051 0.102 0.010 Table A.1 ARI ratios summary for settings 1, 2 and 3. prediction setting Median Mean Sd.Error 4 0.009 0.012 0.001 mean 5 0.030 0.074 0.008 6 0.114 0.280 0.023 4 0.009 0.019 0.002 ols 5 0.049 0.151 0.016 6 0.132 0.309 0.024 4 0.009 0.011 0.001 lasso 5 0.030 0.078 0.010 6 0.100 0.249 0.021 Table A.2 ARI ratios summary for settings 4, 5 and 6. 77 setting 1 setting 2 setting 3 0.100 0.4 0.8 0.075 0.3 0.6 ols 0.050 ols 0.2 ols 0.4 0.025 0.1 0.2 0.000 0.0 0.0 0.000 0.025 0.050 0.075 0.100 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.8 mean mean mean setting 1 setting 2 setting 3 0.4 0.8 0.075 0.3 0.6 0.050 lasso lasso lasso 0.2 0.4 0.025 0.1 0.2 0.000 0.0 0.0 0.000 0.025 0.050 0.075 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.8 mean mean mean setting 1 setting 2 setting 3 0.100 0.8 0.3 0.075 0.6 0.2 lasso lasso lasso 0.050 0.4 0.1 0.025 0.2 0.000 0.0 0.0 0.000 0.025 0.050 0.075 0.100 0.0 0.1 0.2 0.3 0.0 0.2 0.4 0.6 0.8 ols ols ols Figure A.1 ARI ratios comparisons for 1, 2, 3. 78 Setting 1 − mean Setting 2 − mean Setting 3 − mean 0.8 0.89 0.34 0.55 0.52 0.6 0.88 0.7 0.50 0.32 0.87 0.6 0.5 0.50 0.86 0.30 0.5 0.48 0.85 0.4 0.28 0.4 0.46 0.45 0.84 0.26 0.3 0.3 0.44 0.83 ARI ARI 0.24 ARI 0.40 NPA 0.2 NPA NPA 0.82 0.42 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 1 − ols Setting 2 − ols Setting 3 − ols 0.89 0.7 0.60 0.52 0.34 0.54 0.88 0.6 0.55 0.50 0.32 0.87 0.52 0.86 0.5 0.50 0.30 0.48 0.50 0.85 0.28 0.4 0.46 0.45 0.84 0.48 0.26 0.44 0.40 0.83 0.3 ARI ARI 0.24 ARI 0.46 NPA NPA NPA 0.82 0.42 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 1 − lasso Setting 2 − lasso Setting 3 − lasso 0.89 0.7 0.52 0.6 0.34 0.88 0.55 0.6 0.50 0.32 0.87 0.5 0.86 0.5 0.30 0.50 0.48 0.85 0.28 0.4 0.46 0.4 0.84 0.26 0.45 0.44 0.83 0.3 ARI ARI 0.3 0.24 ARI NPA NPA NPA 0.82 0.42 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Figure A.2 ARI and NPA vs k for settings 1, 2 and 3. 79 setting 4 setting 5 setting 6 1.00 1.00 0.15 0.75 0.75 0.10 ols ols 0.50 ols 0.50 0.05 0.25 0.25 0.00 0.00 0.00 0.00 0.05 0.10 0.15 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 mean mean mean setting 4 setting 5 setting 6 1.00 0.75 0.06 0.75 lasso lasso lasso 0.50 0.04 0.50 0.02 0.25 0.25 0.00 0.00 0.00 0.00 0.02 0.04 0.06 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 1.00 mean mean mean setting 4 setting 5 setting 6 1.00 1.00 0.15 0.75 0.75 0.10 lasso lasso lasso 0.50 0.50 0.05 0.25 0.25 0.00 0.00 0.00 0.00 0.05 0.10 0.15 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 ols ols ols Figure A.3 ARI ratios comparisons for 4, 5, 6. 80 Setting 4 − mean Setting 5 − mean Setting 6 − mean 0.92 0.8 0.7 0.30 0.50 0.90 0.6 0.55 0.45 0.25 0.6 0.88 0.40 0.5 0.20 0.50 0.86 0.4 0.35 0.15 0.4 0.84 0.30 0.45 0.10 0.3 0.2 0.25 0.82 ARI ARI 0.05 ARI NPA 0.20 NPA 0.2 NPA 0.40 0.80 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 4 − ols Setting 5 − ols Setting 6 − ols 0.60 0.30 0.90 0.50 0.6 0.55 0.45 0.25 0.88 0.55 0.5 0.40 0.20 0.86 0.50 0.4 0.35 0.50 0.15 0.84 0.30 0.3 0.10 0.45 0.82 0.25 0.45 ARI ARI ARI 0.05 NPA 0.2 0.20 NPA NPA 0.80 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 4 − lasso Setting 5 − lasso Setting 6 − lasso 0.92 0.30 0.7 0.50 0.6 0.55 0.90 0.6 0.45 0.25 0.88 0.5 0.40 0.20 0.5 0.50 0.86 0.4 0.35 0.15 0.84 0.3 0.30 0.4 0.10 0.2 0.25 0.45 0.82 ARI ARI 0.3 ARI 0.1 0.05 NPA 0.20 NPA NPA 0.80 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Figure A.4 ARI and NPA vs k for settings 4, 5 and 6. 81 APPENDIX B PERFORMANCE ON GAUSSIAN DATA WITH TRIDIAGONAL PRECISION MATRIX prediction setting Median Mean Sd.Error 7 0.029 0.046 0.004 mean 8 0.094 0.116 0.008 9 0.118 0.158 0.011 7 0.023 0.036 0.003 ols 8 0.067 0.101 0.008 9 0.102 0.143 0.011 7 0.015 0.030 0.003 lasso 8 0.060 0.094 0.007 9 0.099 0.141 0.010 Table B.1 ARI ratios summary for settings 7, 8 and 9. prediction setting Median Mean Sd.Error 10 0.038 0.048 0.003 mean 11 0.171 0.201 0.012 12 0.243 0.281 0.015 10 0.025 0.040 0.003 ols 11 0.163 0.198 0.012 12 0.102 0.143 0.011 10 0.027 0.038 0.002 lasso 11 0.147 0.202 0.013 12 0.219 0.271 0.016 Table B.2 ARI ratios summary for settings 10, 11 and 12. 82 setting 7 setting 8 setting 9 0.6 0.6 0.4 0.4 0.4 ols ols ols 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 mean mean mean setting 7 setting 8 setting 9 0.8 0.6 0.6 0.4 lasso lasso lasso 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 mean mean mean setting 7 setting 8 setting 9 0.8 0.6 0.3 0.6 0.4 0.2 lasso lasso lasso 0.4 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 ols ols ols Figure B.1 ARI ratios comparisons for 7, 8, 9. 83 Setting 7 − mean Setting 8 − mean Setting 9 − mean 0.7 0.88 0.7 0.60 0.36 0.6 0.58 0.6 0.34 0.86 0.6 0.56 0.5 0.32 0.5 0.5 0.84 0.54 0.30 0.4 0.52 0.4 0.4 0.82 0.28 0.50 0.3 0.3 0.48 0.26 0.3 0.80 ARI ARI ARI 0.2 0.2 NPA NPA NPA 0.46 0.24 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 7 − ols Setting 8 − ols Setting 9 − ols 0.88 0.60 0.36 0.60 0.56 0.55 0.34 0.86 0.55 0.54 0.50 0.32 0.52 0.55 0.84 0.45 0.50 0.30 0.50 0.82 0.40 0.28 0.48 0.50 0.45 0.46 0.35 0.26 0.80 ARI ARI ARI 0.44 NPA NPA NPA 0.30 0.24 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 7 − lasso Setting 8 − lasso Setting 9 − lasso 0.88 0.36 0.56 0.60 0.60 0.56 0.34 0.54 0.54 0.86 0.55 0.32 0.52 0.52 0.55 0.50 0.50 0.84 0.50 0.30 0.45 0.48 0.48 0.28 0.82 0.50 0.46 0.46 0.40 0.26 ARI ARI ARI 0.44 0.44 NPA NPA NPA 0.80 0.35 0.24 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Figure B.2 ARI and NPA vs k for settings 7, 8 and 9. 84 setting 10 setting 11 setting 12 1.00 0.8 0.20 0.75 0.15 0.6 ols 0.10 ols 0.4 ols 0.50 0.05 0.2 0.25 0.00 0.0 0.00 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 0.00 0.25 0.50 0.75 1.00 mean mean mean setting 10 setting 11 setting 12 1.00 1.00 0.20 0.75 0.75 0.15 lasso lasso lasso 0.50 0.50 0.10 0.25 0.25 0.05 0.00 0.00 0.00 0.00 0.05 0.10 0.15 0.20 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 mean mean mean setting 10 setting 11 setting 12 1.00 0.20 1.00 0.75 0.15 0.75 lasso lasso lasso 0.50 0.10 0.50 0.05 0.25 0.25 0.00 0.00 0.00 0.00 0.05 0.10 0.15 0.20 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 ols ols ols Figure B.3 ARI ratio comparisons for settings 10, 11 and 12. 85 Setting 10 − mean Setting 11 − mean Setting 12 − mean 0.34 0.7 0.60 0.88 0.32 0.6 0.10 0.6 0.55 0.86 0.30 0.5 0.08 0.5 0.50 0.84 0.28 0.4 0.26 0.45 0.82 0.3 0.4 0.06 0.2 0.24 0.40 0.80 ARI ARI 0.3 0.04 ARI 0.1 0.22 NPA NPA NPA 0.35 0.78 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 10 − ols Setting 11 − ols Setting 12 − ols 0.34 0.60 0.88 0.32 0.10 0.55 0.55 0.86 0.55 0.30 0.50 0.08 0.84 0.28 0.50 0.45 0.50 0.82 0.26 0.06 0.40 0.24 0.45 0.80 ARI ARI 0.04 ARI 0.45 0.35 0.22 NPA NPA NPA 0.78 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 10 − lasso Setting 11 − lasso Setting 12 − lasso 0.34 0.54 0.88 0.32 0.10 0.55 0.6 0.52 0.86 0.30 0.5 0.50 0.08 0.84 0.28 0.50 0.48 0.82 0.4 0.26 0.06 0.46 0.24 0.80 0.45 0.3 ARI ARI 0.04 ARI 0.22 0.44 NPA NPA NPA 0.78 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Figure B.4 ARI and NPA vs k for settings 10, 11 and 12. 86 APPENDIX C PERFORMANCE ON GAUSSIAN DATA WITH NETWORK OF FEATURES prediction setting Median Mean Sd.Error 13 0.015 0.026 0.003 mean 14 0.058 0.078 0.006 15 0.088 0.121 0.009 13 0.012 0.020 0.002 ols 14 0.048 0.071 0.005 15 0.077 0.105 0.008 13 0.009 0.017 0.001 lasso 14 0.047 0.072 0.006 15 0.072 0.104 0.008 Table C.1 ARI ratios summary for settings 13, 14 and 15. prediction setting Median Mean Sd.Error 16 0.006 0.006 0.000 mean 17 0.027 0.036 0.003 18 0.050 0.107 0.012 16 0.006 0.008 0.001 ols 17 0.029 0.056 0.006 18 0.057 0.147 0.014 16 0.006 0.007 0.000 lasso 17 0.027 0.053 0.006 18 0.061 0.164 0.015 Table C.2 ARI ratios summary for settings 16, 17 and 18. 87 setting 13 setting 14 setting 15 0.6 0.6 0.4 0.4 0.3 0.4 ols ols ols 0.2 0.2 0.2 0.1 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 mean mean mean setting 13 setting 14 setting 15 0.6 0.6 0.6 0.4 0.4 0.4 lasso lasso lasso 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 mean mean mean setting 13 setting 14 setting 15 0.6 0.15 0.6 0.4 0.10 0.4 lasso lasso lasso 0.05 0.2 0.2 0.00 0.0 0.0 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 ols ols ols Figure C.1 ARI ratio comparisons for settings 13, 14 and 15. 88 Setting 13 − mean Setting 14 − mean Setting 15 − mean 0.93 0.7 0.56 0.36 0.6 0.92 0.6 0.6 0.91 0.54 0.34 0.5 0.90 0.5 0.5 0.52 0.89 0.32 0.4 0.4 0.4 0.88 0.50 0.30 0.3 0.87 0.3 0.3 ARI 0.48 ARI ARI NPA NPA NPA 0.86 0.2 0.28 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 13 − ols Setting 14 − ols Setting 15 − ols 0.93 0.54 0.60 0.56 0.55 0.36 0.92 0.55 0.52 0.91 0.54 0.34 0.50 0.90 0.50 0.52 0.50 0.45 0.32 0.89 0.88 0.40 0.50 0.45 0.30 0.48 0.87 0.35 ARI 0.48 ARI ARI NPA NPA NPA 0.86 0.30 0.28 0.46 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 13 − lasso Setting 14 − lasso Setting 15 − lasso 0.60 0.93 0.56 0.56 0.56 0.36 0.92 0.55 0.54 0.54 0.91 0.54 0.52 0.50 0.34 0.90 0.52 0.50 0.52 0.89 0.45 0.50 0.32 0.48 0.88 0.50 0.46 0.40 0.48 0.30 0.87 0.44 ARI 0.48 ARI 0.46 ARI NPA 0.35 NPA NPA 0.86 0.28 0.42 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Figure C.2 ARI and NPA vs k for settings 13, 14 and 15. 89 setting 16 setting 17 setting 18 0.6 1.00 0.100 0.75 0.075 0.4 ols 0.050 ols ols 0.50 0.2 0.025 0.25 0.000 0.0 0.00 0.000 0.025 0.050 0.075 0.100 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 mean mean mean setting 16 setting 17 setting 18 0.5 1.00 0.03 0.4 0.75 0.3 lasso lasso lasso 0.02 0.50 0.2 0.01 0.25 0.1 0.00 0.0 0.00 0.00 0.01 0.02 0.03 0.0 0.1 0.2 0.3 0.4 0.5 0.00 0.25 0.50 0.75 1.00 mean mean mean setting 16 setting 17 setting 18 0.6 1.00 0.100 0.75 0.075 0.4 lasso lasso lasso 0.050 0.50 0.2 0.025 0.25 0.000 0.0 0.00 0.000 0.025 0.050 0.075 0.100 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 ols ols ols Figure C.3 ARI ratio comparisons for settings 16, 17 and 18. 90 Setting 16 − mean Setting 17 − mean Setting 18 − mean 0.98 0.7 0.7 0.65 0.40 0.6 0.97 0.6 0.6 0.96 0.60 0.5 0.35 0.5 0.5 0.95 0.4 0.4 0.55 0.30 0.4 0.94 0.3 0.3 0.50 0.25 0.93 0.2 0.3 ARI ARI 0.2 ARI NPA NPA NPA 0.92 0.1 0.20 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 16 − ols Setting 17 − ols Setting 18 − ols 0.98 0.60 0.65 0.60 0.6 0.40 0.97 0.55 0.96 0.5 0.60 0.55 0.50 0.35 0.95 0.50 0.4 0.45 0.55 0.30 0.94 0.40 0.3 0.45 0.50 0.25 0.93 0.35 ARI ARI ARI NPA 0.2 NPA NPA 0.40 0.92 0.20 0.30 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Setting 16 − lasso Setting 17 − lasso Setting 18 − lasso 0.60 0.98 0.54 0.65 0.6 0.55 0.40 0.97 0.52 0.96 0.60 0.50 0.5 0.35 0.50 0.95 0.45 0.4 0.30 0.55 0.48 0.94 0.40 0.3 0.50 0.25 0.46 0.93 0.35 ARI ARI ARI 0.2 NPA NPA NPA 0.92 0.30 0.20 0.44 10 50 90 150 210 270 330 10 50 90 150 210 270 330 10 50 90 150 210 270 330 k k k Figure C.4 ARI and NPA vs k for settings 16, 17 and 18. 91