ADVANCED STATISTICAL AND COMPUTATIONAL TECHNIQUES FOR GENOMIC DATA ANALYSIS By Sikta Das Adhikari A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics—Doctor of Philosophy Computational Mathematics, Science and Engineering—Dual Major 2024 ABSTRACT The human body is an incredibly complex system that researchers have studied for decades to uncover its secrets. Genetic, transcriptomic, and epigenetic data each offer unique insights into its functioning. Recent technological advancements have enabled the generation of vast amounts of high-quality biological data, creating unprecedented opportunities to explore molecular mech- anisms underlying health and disease. Analyzing these diverse datasets is crucial for developing targeted therapies, personalized medicine, and advancing our understanding of biology. Build- ing advanced statistical and computational models to handle these complex datasets is now more important than ever for translating biological information into actionable insights and driving breakthroughs in medical research and treatment strategies. In this dissertation, I first developed BayesKAT, a kernel-based testing methodology for as- sessing the association between user-defined groups of SNPs or genes and a phenotype of interest (Chapter 2). Unlike existing kernel-based tests that use predefined single or average kernels and often yield ambiguous results, this algorithm adaptively selects the optimal composite kernel using a Gaussian process model within a Bayesian framework, providing more interpretable outcomes. Next, I explored the emerging field of spatial transcriptomics, where similar Gaussian process models and kernel-based testing have significant potential. To complement the recent surge in spa- tial transcriptomics research, Chapter 3 presents a comprehensive literature review of significant methodologies, particularly for spatial gene detection, which is a crucial step in spatial transcrip- tomics data analysis. This review provides an overview of the current state of research in the field. In Chapter 4, I extended the kernel-based testing procedure to address challenges in spatial transcriptomic data. The newly developed algorithm, cSVG, not only detects spatially variable genes, but also improves spatial domain detection accuracy and addresses additional problems in this field. Finally, to tackle the scarcity of crucial TF binding information for many transcription factors (TFs) across various cell types, I developed a computational model, 3D-TF-IMPUTE (Chapter 5). This model predicts TF binding sites by utilizing readily available epigenetic datasets and leveraging the three-dimensional structure of the genome in an unsupervised manner, efficiently predicting TF binding sites essential for understanding the functional genome. By tackling key challenges in the analysis of genetic, transcriptomic, and epigenetic data, this dissertation makes significant contributions to the field. It provides powerful tools for researchers to better understand the molecular underpinnings of health and disease, paving the way for future breakthroughs in biomedical research. Copyright by SIKTA DAS ADHIKARI 2024 ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my advisors, Dr. Jianrong Wang and Dr. Yuehua Cui, for their invaluable guidance and support throughout this journey. I am incredibly fortunate to have had such wonderful mentors during my PhD studies. Their doors were always open whenever I needed help, and their positive attitudes always helped me recover from setbacks and push through obstacles. I am also profoundly grateful to my committee members Professor Lyudmila Sakhanenko, Professor Haolei Weng and Professor Rongrong Wang, for their insightful feedback on my research and professional guidance on my thesis. Their expertise and advice have been instrumental in shaping my work. I would like to express my gratitude to the Department of Statistics and Probability, my primary department for the past five years. I am grateful to have been part of this department and to have had the opportunity to learn from exceptional professors. Special thanks to Dr. Haolei Weng and Dr. Yimin Xiao for teaching the core courses so beautifully. I feel fortunate to have had such outstanding teachers. I also want to thank the amazing STT staff, especially Tami and Andy, for solving all my problems so quickly. My time in graduate school has been an incredible five-year journey, with the last 3.5 years spent as part of Dr. Jianrong Wang’s lab. I extend my heartfelt thanks to my lab members: Dr. Jiaxin Yang, Dr. Hao Wang, Dr. Binbin Huang, Dr. Wenjie Qi, Dr. Jiwoong Kim, Dr. Pronoy Kanti Mondol, Tairan Song, Shuyang Zhang, and Xinrui Yu, for fostering a positive environment, always being helpful, and generously sharing their knowledge. Dr. Wang’s passion for research and his positive mindset have been truly inspiring, often boosting my self-confidence and productivity. I have inherited a wealth of research techniques and biological concepts from him. A special thank you goes to my husband, Arka, for his unwavering support and for being with me through all the highs and lows. This journey has been so fulfilling because we experienced it together. I am immensely grateful for the privilege of discussing new research papers, admiring each other’s work, providing honest critiques, and cheering each other on after long, overwhelming v days. The past five years have been the most enriching period of my life, and I have made incredible new friends along the way. Thanks to Sumegha, Hema, Tathagata, and Nisarg for introducing board games and many more wonderful things into my life. Anirban da and Sampriti, thank you for always being there, for our joint adventures, endless debates, and spontaneous plans. Your friendships have made this journey truly memorable, and I have found friends for life. Thanks to Nian, Haoxiang, and Sang Kyu for making our coursework time so enjoyable. I would like to acknowledge my internship supervisor, Dr. Joyce Hsiao, and my mentor, Dr. Michael kleyman, who significantly enriched my learning journey and broadened my horizons. This internship provided me with the confidence to step out of my comfort zone and taught me how to plan and execute projects within limited time frames. My heartfelt thanks go to my professors at the University of Calcutta—Prof. Manisha Pal, Prof. Asis Kumar Chattopadhyay, Prof. Uttam Bandyopadhyay, Prof. Sugata Sen Roy, Prof. Gauran- gadeb Chattopadhyay, Prof. Nripes Kumar Mandal, Prof. Rahul Bhattacharya, and especially Prof. Bhaswati Ganguli—whose help, guidance and motivation inspired me to pursue a PhD. Finally, I thank my parents, Alpana Das Adhikari and Samir Das Adhikari, for their emotional support and love, which have been the foundation of my achievements. I couldn’t have come this far without their boundless generosity and wonderful parenting. I also thank my elder brother Saikat Das Adhikari and sister-in-law Arpita Dutta for their constant support. I thank them for always believing in me. I am grateful to my in-laws for always cheering me on and supporting me. This journey has allowed me to gain expertise in my field of interest and acquire skills that have prepared me for my next career path. The personal growth I have experienced has been remarkable, and I am forever thankful to everyone who has helped me along the way. vi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 TABLE OF CONTENTS CHAPTER 2 BAYESKAT: BAYESIAN OPTIMAL KERNEL-BASED TEST FOR GENETIC ASSOCIATION STUDIES REVEAL JOINT GENETIC EFFECTS IN COMPLEX DISEASES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 2.1 2.2 Material and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Enhanced efficacy of BayesKAT benchmarked on simulation studies . . . . . . 20 2.4 Application to ADNI datsets: BayesKAT reveals novel associated genetic Introduction . . . . . basis of complex traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . 32 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Data and code avilability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.7 Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 SPATIAL TRANSCRIPTOMICS - SVG DETECTION TECHNIQUES AND SCOPES FOR IMPROVEMENT . . . . . . . . . . . . . . . . . . 39 . 39 Introduction: Analysis of spatial transcriptomic data . . . . . . . . . . . . . . 3.1 3.2 An overview of SVG detection techniques . . . . . . . . . . . . . . . . . . . . 40 3.3 Statistical Inference with Multiple Testing Control . . . . . . . . . . . . . . . . 50 . 51 3.4 Exploring Performance, Advantages, and Limitations . . . . . . . . . . . . . . 53 3.5 Assessing Input Data and Model Outputs . . . . . . . . . . . . . . . . . . . . . 54 3.6 Publicly Accessible Code for Major SVG Detection Methods . . . . . . . . . . 54 3.7 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.8 Downstream analysis: Spatial domain detection . . . . . . . . . . . . . . . . CHAPTER 4 Introduction . . CSVG: IMPROVED SPATIAL DOMAIN DETECTION BY SPATIALLY VARIABLE GENE CLUSTERING ADJUSTING FOR CELL TYPE EFFECT . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 . 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 . 4.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Real data analysis . . 79 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Supplementary materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 . . . . 3D-TF-IMPUTE: A COMPUTATIONAL MODEL TO IMPUTE TF BINDING SITES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . Input data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Results . 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Data and Code availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 . 87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 . 91 . 92 . 94 . . . . . . . . . vii CHAPTER 6 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 95 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 APPENDIX A KERNEL BASED MODELS: FROM SEMIPARAMETRIC MODEL TO LMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 . APPENDIX B SCORE STATISTIC AND THE NULL DISTRIBUTION . . . . . . . . 113 APPENDIX C T-SNE PLOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 APPENDIX D COMMUNITY DETECTION ALGORITHMS . . . . . . . . . . . . . 117 APPENDIX E CHAPTER 2 SUPPLEMENTARY TABLES AND FIGURES . . . . . . 118 APPENDIX F CHAPTER 4 SUPPLEMENTARY FIGURES . . . . . . . . . . . . . . 125 viii CHAPTER 1 INTRODUCTION In contemporary biomedical research, the analysis of diverse biological datasets has emerged as a cornerstone in deciphering the intricate mechanisms of human diseases. Within the human body, comprised of trillions of cells, each cell type is specialized to execute distinct functions. At the core of cellular function lies the flow of genetic information, wherein DNA within the nucleus undergoes transcription into RNA, subsequently translated into proteins pivotal for various biological processes. Throughout this transcription and translation cascade, distinct types of data are collected at each stage. Beginning with genome sequencing data extracted from DNA, researchers scrutinize variations such as single nucleotide polymorphisms (SNPs) to discern associations with diseases or phenotypic traits. Transitioning to transcriptomic data, derived from the second stage, unveils the remarkable diversity in gene expression among cells, despite harboring identical DNA sequences. Gene expression data analysis plays a crucial role in biomedical research by providing insights into the activity levels of genes within cells or tissues. Gene expression profiling can identify genes whose expression levels correlate with specific disease states or physiological conditions. These genes can serve as biomarkers for disease diagnosis, prognosis, and treatment response prediction. In addition to genetic and transcriptomic datasets, epigenetic data play a pivotal role in un- derstanding functional regulation. Epigenetic modifications, including DNA methylation, histone modifications, and transcription factor (TF) binding, orchestrate changes in gene expression crucial for the specialized functions of diverse cell types within various organs. These chemical modi- fications to DNA and associated proteins exert profound effects on gene expression, influencing cellular phenotypes and contributing to the complexity of human diseases. Understanding the molecular underpinnings of human diseases requires comprehensive anal- ysis of genetic, transcriptomic, and epigenetic datasets. This thesis addresses key challenges associated with these three types of datasets by developing advanced statistical and computational methodologies. 1 Genome-wide association studies (GWAS) have been instrumental over the past two decades in identifying millions of disease-associated single nucleotide polymorphisms (SNPs). However, many complex diseases and phenotypes are influenced by multiple genetic variants, where individ- ual SNPs may only have a weak linear association with the phenotype. These SNPs can collectively contribute to the phenotype through their involvement in crucial biological processes or pathways. The problem of evaluating the strength of association between user-defined SNP groups and spe- cific phenotypes or diseases is often addressed by semiparametric models and kernel-based tests. However, existing kernel-based testing methods rely on pre-specified single or average kernels, which can sometimes lead to inconsistent or ambiguous results. Chapter 2 introduces BayesKAT, a Bayesian optimal kernel-based test that adaptively selects the optimal composite kernel using a Gaussian process model within a Bayesian framework, providing more interpretable results. This algorithm is applicable not only to SNP-groups but also to sets of genes, whether user-specified or from genome-wide biologically important pathways. Based on a series of performance compar- isons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations. Recent advancements in spatial transcriptomic technology have transformed the field of gene expression analysis, sparking a surge in research over the past few years. Given the extensive research in this field, Chapter 3 offers a detailed literature review, particularly on techniques for detecting spatially variable genes (SVGs), which is a crucial initial step in spatial transcriptomic data analysis. While various methods have been proposed for SVG detection, the significance of these genes lies in their contribution to downstream analyses, particularly in tasks like spatial domain detection. Conventional approaches typically rely on using all or a predetermined number of top- ranked SVGs for spatial domain detection. However, in datasets characterized by high diversity and a large number of SVGs, this strategy may not ensure accurate spatial domain detection or subsequent downstream analyses. Alternatively, grouping SVGs based on their expression patterns and leveraging all SVG groups in the downstream analysis can enhance accuracy, as seen in many examples. Furthermore, classifying SVGs in this manner is akin to identifying cell type marker 2 genes, offering valuable biological insights. The challenge lies in accurately categorizing SVGs into relevant clusters, aggravated by the absence of ground truth and prior knowledge regarding the number and spectrum of spatial expression patterns exhibited by genes. Addressing this challenge, Chapter 4 introduces cSVG, a framework that begins with SVG detection and proceeds to precisely classify SVGs based on their spatial patterns by adjusting for confounding effects caused by shared cell types. Notably, this method eliminates the need for prior knowledge of gene cluster numbers, distinct spatial patterns, or cell type information. Through comprehensive simulation studies and real data analyses, this approach demonstrates considerable efficiency and holds promise as a potent tool in spatial transcriptomics analysis. Transcription factors (TFs) are essential in the biological system as they are key regulators of gene expression. By binding to specific parts of DNA, TFs can activate or repress the transcription of target genes, thereby controlling the flow of genetic information from DNA to mRNA. TFs orchestrate complex gene networks, ensuring that genes are expressed at the right time, place, and levels, which is vital for maintaining cellular function. Understanding TF binding sites and their regulatory mechanisms is fundamental for insights into health, disease, and potential therapeutic interventions. Epigenetic data analysis often faces the challenge of unavailable or poor-quality TF binding data across various cell types and species. Accurately predicting TF binding sites remains a significant hurdle. Existing methods for predicting TF binding sites are typically based on supervised models, which involve computationally intensive preprocessing and training steps, and require various types of input data that may not be available for many cell types and species. Moreover, these methods overlook the three-dimensional structure of the genome, which holds valuable insights into underlying processes. To tackle these limitations, Chapter 5 presents a pioneering computational model, 3D-TF-IMPUTE. This method predicts TF binding sites for various TFs by leveraging the three-dimensional DNA structure, requiring only motif information and chromatin accessibility data for unsupervised prediction of TF binding sites. This thesis addresses critical challenges in the analysis of genetic, transcriptomic, and epigenetic data, paving the way for a deeper understanding of disease biology and potential therapeutic 3 interventions. 4 CHAPTER 2 BAYESKAT: BAYESIAN OPTIMAL KERNEL-BASED TEST FOR GENETIC ASSOCIATION STUDIES REVEAL JOINT GENETIC EFFECTS IN COMPLEX DISEASES A modified version of part of this chapter was previously published in Briefings in Bioinformatics journal by Oxford University Press: Sikta Das Adhikari, Yuehua Cui, Jianrong Wang, BayesKAT: bayesian optimal kernel-based test for genetic association studies reveals joint genetic effects in complex diseases, Briefings in Bioin- formatics, Volume 25, Issue 3, May 2024, bbae182, https://doi.org/10.1093/bib/bbae182 2.1 Introduction Deciphering the genetic basis of complex traits, such as the Alzheimer’s disease, plays pivotal roles in functional genomics and precision medicine [1],[2]. Based on the advancement in high- throughput sequencing techniques, specific associated genetic variants, e.g., single-nucleotide polymorphisms (SNPs), have been identified for a large panel of phenotypes using Genome-wide Association Studies (GWAS) [3]. However, traditional GWAS approaches treat SNPs independently and can only discover individual SNPs that have strong marginal statistical associations with the phenotype of interest. It is well documented that many complex diseases and phenotypes are often associated with multiple genetic variants [4], [5], [6], [7] where an individual variant itself might be weakly associated with the phenotype. In contrast, groups of such SNPs may jointly contribute to the phenotype, potentially mediated via their cooperative participation in important biological processes or pathways [8], [9]. Therefore, the traditional GWAS framework of testing individual SNPs separately without considering the correlation structures and the potential interactions among SNPs may not capture the group-wise joint SNP effects. Separate testings of SNP associations by traditional GWAS approaches are also limited to reveal the underlying biological mechanisms of complex phenotypes. Alternative approaches based on multivariate regression significantly suffer from the large degrees of freedom in genome-wide association tests and can substantially lose the statistical power [10]. 5 To overcome this critical challenge, kernel-based testing (KBT) framework has been introduced to test group-wise joint SNP effects [10], [11], [12], [13], [14], [15], [16]. By incorporating a kernel function to measure the similarity among genetic variants and compare with the phenotype similarities, the KBT framework simultaneously models the joint effects of multiple genetic variants. Wu et al. 2011 [12] first proposed the widely-used sequence kernel association test (SKAT) model to test rare-variant associations. As a supervised, versatile and computationally streamlined regression approach, SKAT accesses the associations between genetic variants within a specific region and the trait. As the outputs from the SKAT model, p-values of the statistical associations are generated, facilitating straightforward interpretations of the findings. An R package [17] has been developed for implementing different kinds of kernel-based testing models, including SKAT. To enable novel discoveries of the genetic basis underlying complex diseases, maximizing statistical power in genome-wide association tests while effectively controlling type 1 errors is strongly desired. Under the KBT framework, statistical power heavily depends on the specific choice of kernel functions [16], [18], [19], [20]. However, the existing KBT models, including SKAT, require the kernel function to be specified a priori. Because the true functional relationship between the genetic variants and phenotypes is usually unknown in practice, selecting the ideal kernel function in advance for the KBT model, one that maximizes statistical power without increasing the type 1 error rate, poses statistical and computational challenges. One common approach that has been used is to repeat the KBT procedures based on different choices of kernels and then select the one resulting in the minimum p-value, which has been discussed by multiple studies [18], [21]. The major problem of this straightforward approach is the inflated type 1 error. Although data-dependent permutation or perturbation methods [18] can help ease the problem, they are not computationally scalable, especially when applied to high-dimensional datasets in large-scale genomic studies. An alternative approach is to use an equal-weighted average of multiple candidate kernels to form an averaged composite kernel [18], which performs better than the worst performing candidate kernel but does not usually achieve the performance of the optimal kernel function. Tests based on the average kernel approach may lead to inconsistent or 6 incorrect conclusions in applications, as we will demonstrate below. He et al. 2018 [21] proposed a maximum kernel test model based on the U statistic, i.e. the mKU model, which claims to achieve the statistical power as close as to the best candidate kernel in high-dimensional settings under certain distributional assumptions. However, the specific distributional assumptions may not hold in practice and hence lead to inflated p-values, which will also be discussed in this study. To further illustrate the significance and difficulty of choosing appropriate kernels in genetic association testings, Figure 2.1A demonstrates an example based on the genotype data for the trait of whole brain volume collected from the ADNI project for the Alzheimer’s Disease Neuroimaging Initiative (https://adni.loni.usc.edu/)[22]. The group of genetic variants located in genes belonging to the caffeine metabolism pathway [23], [24], [25] are included into the kernel-based testing model to test the hypothesis that whether the caffeine metabolism pathway is associated with the whole brain volume phenotype. Using different kernel functions, the SKAT model leads to inconsistent conclusions. For instance, based on Quadratic kernel, the SKAT model rejects the null hypothesis (p-value<0.05), while the use of Gaussian kernel or IBS kernel does not lead to any rejection of the null hypothesis Figure 2.1A. On the other hand, using the equal-weighted average composite kernel, the SKAT model tends to reject the null hypothesis (p-value= 0.047). Since there is no clear mechanistic link between the caffeine metabolism pathway and whole brain volume, the rejection of the null hypothesis based on the Quadratic and the average composite kernels is likely a false discovery. To further quantify this issue, in Figure 2.1B, a cohort of total 500 replicate synthetic datasets based on the same covariates and genotype data is created, where the phenotype variables are generated by a Quadratic function ℎ(·). Applying the SKAT test based on different kernel functions on the synthetic datasets, inconsistent testing results appear to be a persistent problem. Although the Quadratic kernel function leads to the correct hypothesis testing result as expected, the average composite kernel usually leads to incorrect conclusions (Figure 2.1B). As shown in the barplot of Figure 2.1B, using different kernels (Linear, Quadratic, Gaussian and average composite kernels) across the 500 synthetic datasets, inconsistent results are observed and the overall fractions of correct testing results are very low. Hence, the inconsistent conclusions 7 based on different kernels suggest the fundamental need of developing a systematic data-adaptive approach of selecting appropriate kernel functions for KBT models in genome-wide association tests. Figure 2.1 Overview of the significance and model design for BayesKAT. (A) Real-world example of association tests with inconsistent results depending on specific kernels. where BayesKAT offers a more interpretable metric. (B) Based on a synthetic data cohort simulated assuming a true quadratic function, different kernel functions lead to inconsistent results. Across 500 simulation replicates, each individual kernel yields inconsistent and ambiguous conclusions (barplot). In comparison, BayesKAT generates both interpretable and highly consistent results, with substantially boosted power. (C) Workflow of BayesKAT implementation for diverse types of genetic association tests to derive biological meaningful interpretations. (D) Model structures and the inference algorithms for the two BayesKAT strategies: BayesKAT-MCMC (left) and BayesKAT-MAP (right). BayesKAT-MCMC samples from posterior parameter distributions, providing a comprehensive view of the posterior parameter distributions. On the other hand, BayesKAT-MAP provides a more scalable solution, particularly well-suited for high-dimensional data. 8 Figure 2.1 (cont’d) In this study, we developed a novel Bayesian Kernel-Based Association Testing algorithm, BayesKAT (https://github.com/wangjr03/BayesKAT). This algorithm effectively tackles the kernel selection challenge by choosing the optimal kernel in a data-adaptive way and calculating the poste- rior probability of association by evaluating the joint statistical associations of specific SNP groups with a complex phenotype. Moreover, compared to existing KBT-based methods, BayesKAT si- multaneously achieves four goals in genome-wide association tests: (i) superior statistical power, by selecting the optimal kernel function based on the dataset under study; (ii) consistent results, by avoiding repeated tests based on a variety of different kernels; (iii) controlled type-1 error, without relying on unverified distributional assumptions or minimum p-value kernels; and (iv) strong computational scalability for high-dimensional and large-sample genome-wide data. Two alternative computational strategies, i.e., MCMC and MAP, are incorporated in BayesKAT, leading to additional implementation flexibilities for users. Extensively tested on a series of simulated datasets under different parameter settings, BayesKAT consistently demonstrates superior perfor- mance against existing methods. Furthermore, applied on the ADNI genotype datasets of the complex trait of whole brain volume (https://adni.loni.usc.edu), BayesKAT successfully discovered mechanistically related genes and biological pathways with higher accuracy. Specific genes and 9 pathways related with neurodegenerative diseases, as reported by previous studies, are consistently prioritized by BayesKAT while not prioritized by other methods. Strikingly, BayesKAT is able to identify group-level SNP effects of novel co-expressed gene modules and protein complexes that potentially participate in the molecular processes modulating the whole brain volume phenotype. These algorithmic advantages and new biological discoveries robustly support the statistical in- novation of BayesKAT and strongly highlight its effectiveness in decoding the genetic basis and associated molecular mechanisms underlying complex diseases. 2.2 Material and methods 2.2.1 Overview of kernel based testing models for genetic data Under the kernel machine regression framework, continuous quantitative traits can be associated to genetic variants or molecular features, along with additional covariates, through a semiparametric model: 𝑌𝑖 = 𝑋𝑖 𝛽 + ℎ(𝑍𝑖) + 𝜖𝑖, 𝑖 = 1, 2, · · · , 𝑛 (2.1) where 𝑌𝑖 denotes the continuous value of the trait for the 𝑖th person in a sample of size 𝑛; 𝑋𝑖 = [𝑋𝑖1, 𝑋𝑖2, · · · , 𝑋𝑖𝑘 ] is a set of 𝑘 covariates for the 𝑖th individual that need to be controlled; and 𝛽 = [𝛽1, 𝛽2, · · · , 𝛽𝑘 ] are the corresponding effects of covariates. 𝑍𝑖 = [𝑍𝑖1, 𝑍𝑖2, · · · , 𝑍𝑖 𝑝] is the vector for the 𝑝 genetic variants or molecular features, where 𝑍𝑖 𝑗 denotes the 𝑗th genetic variant or molecular level feature for the 𝑖th individual . The unknown errors 𝜖𝑖 are assumed to be independent and follow 𝑁 (0, 𝜎2), where the value 𝜎2 is also unknown. The most common genetic features are SNPs and the widely used molecular-level features include gene expressions. The features, i.e. 𝑍. 𝑗 , 𝑗 = 1, 2, · · · , 𝑝 are associated with the trait, i.e. 𝑦, through an arbitrary function ℎ(·) which is assumed to lie in a function space 𝐻𝐾 generated by a kernel function 𝐾 (·, ·). The supplementary files provide More discussion about different kernel function types. It has been shown (see Appendix A) [11] that the kernel machine regression model in equation (2.1) is equivalent to the following linear mixed model: 𝑌 = 𝑋 𝛽 + ℎ + 𝜖 (2.2) 10 where 𝛽 ∈ R𝑘 is a vector of effect sizes for covariates 𝑋 ∈ R𝑛×𝑘 , ℎ is an 𝑛 × 1 vector of random effects which is distributed as ℎ ∼ 𝑁 (0, 𝜏𝐾), where 𝐾 is the 𝑛 × 𝑛 kernel matrix and the error is distributed as 𝜖 ∼ 𝑁 (0, 𝜎2𝐼), where 𝜎2 is the error variance and 𝜏 is a variance component for the genetic effect. The main goal is to test if the genetic variants have any combined effect on the outcome variable 𝑌 . Testing for the presence of group effect of 𝑍 is equivalent to testing the hypothesis 𝐻0 : 𝜏 = 0 vs. 𝐻1 : 𝜏 > 0. Choosing an appropriate kernel is crucial and that topic has been discussed further in the supplementary files. Most KBT methods, including SKAT, choose the kernel function first and then perform the testing based on the specified kernel function. 2.2.2 Importance of choosing appropriate Kernel Although a variety of different kernels are available, for a given dataset, it is practically impossible to know a priori which kernel will fit the dataset best and maximize the testing power. Genetic data related to complex phenotypes pose particular challenges, primarily stemming from our limited understanding of how the interplay among genetic or molecular features influences their collective association with a phenotype. Therefore, choosing a kernel randomly can lead to a less powerful testing procedure for genome-wide applications. For example, if the outcome variable 𝑌 is related to the features through a Quadratic function, using a Linear kernel in the model will lead to weak tests that are not able to reject the null hypothesis even when the association is strong. On the other hand, by repeatedly applying KBT models based on different candidate kernels and choosing the one resulting in the minimum p-value, there is a high chance of making a false discovery, i.e., rejecting the null hypothesis when there is no association. Combining a panel of candidate kernels together to create a composite kernel is thus a natural and effective strategy to overcome this issue. While a straightforward strategy of averaging kernels to form a composite kernel, i.e., a linear combination of candidate kernels with equal weights, can perform better than the worst-performing kernel function, it usually cannot perform as efficiently as the best kernel to accurately represent the association between the trait and features for a given dataset, thus, is not guaranteed to increase the statistical power. As shown in 2.1A and 2.1B, 11 evaluated on both real and synthetic datasets, the average kernel strategy can often lead to incorrect and inconsistent results in practice. Thus, a systematic data-adaptive approach of optimal kernel selection is highly desirable for high-dimensional genome-wide association tests, especially for complex human disease phenotypes that are genetically modulated by multiple inter-dependent genetic variants. Although a variety of different kernels are available, for a given dataset, it is practically impossible to know a priori which kernel will fit the data best and maximize the testing power. Genetic datasets of complex phenotypes are particularly challenging due to the limited prior knowledge about the structure of interdependence among genetic or molecular features and how the features are quantitatively associated with the phenotypes. Therefore, choosing a kernel randomly can lead to a less powerful testing procedure for genome-wide applications. For example, if the outcome variable Y is related to the features through a Quadratic function, using a Linear kernel in the model will lead to weak tests that are not able to reject the null hypothesis even when the association is strong. On the other hand, by repeatedly applying KBT models based on different candidate kernels and choosing the one resulting in the minimum p-value, there is a high chance of making a false discovery, i.e., rejecting the null hypothesis when there is no association. 2.2.3 BayesKAT Our new algorithm, BayesKAT (https://github.com/wangjr03/BayesKAT) employs a novel Bayesian modeling strategy to automatically select the optimal composite kernel based on the data and does not require the composite kernel function to be set a priori by the user. Based on the inferred optimal composite kernel function, BayesKAT can efficiently test the joint effects induced by a group of genetic or molecular features associated with a phenotype. The optimal composite kernel is a linear combination of candidate kernels where the weight of each candidate kernel re- flects the degree of usefulness of the kernel explaining the complex relationship between a group of features and the phenotype of interest. As an illustration, suppose there are three potential kernels: Quadratic, Gaussian, and IBS. If the IBS kernel effectively captures the underlying relationship, it will carry greater significance within the composite kernel, hence have a larger weight, while 12 the impact of other kernels may be relatively weak, as indicated by their lower weights. Figure 2.1C provides an overview of the workflow of BayesKAT and its two computational strategies: 1) the Markov Chain Monte Carlo (MCMC) strategy; and 2) the Maximum a Posteriori (MAP) strategy. Additionally, it is noteworthy that while BayesKAT is primarily developed to test genetic associations, it can also be employed for a wide range applications, including testing the association between continuous gene expression features and complex traits. Consider a set of 𝑚 candidate kernels 𝐾1, 𝐾2, · · · , 𝐾𝑚, the composite kernel is in the form of 𝜌𝑖𝐾𝑖, where 0 ≤ 𝜌𝑖 ≤ 1 and (cid:205)𝑚 𝑖=1 𝜌𝑖=1 and (cid:205)𝑚 𝑖=1 ℎ ∼ 𝑁 (0, 𝜏 (cid:205)𝑚 𝑖=1 𝜌𝑖𝐾𝑖), Therefore, selecting the optimal composite kernel is equivalent to selecting the optimal value for the weight 𝜌𝑖 (𝑖 = 1, · · · , 𝑚) so that it can capture the underlying relationship between the genetic or molecular features and the trait, when testing the group-level effect of a set of multiple features. As a kernel-based testing model, BayesKAT relies on a set of candidate kernel functions, which are incorporated to infer the optimal composite kernel for the association tests. For the convenience of practical implementations, BayesKAT infers the optimal composite kernel consisting of three candidate kernels as the default setting. And the default candidate kernels include Quadratic, Gaussian and IBS kernel. To construct a composite kernel, the candidate kernels are normalized in BayesKAT based on the previously proposed technique [21] so that they are in the same scale and comparable. To gain robust performance, weakly informative prior distributions are used for model pa- rameters by default, although the users can incorporate more informative priors based on spe- cific knowledge about the data. The important model parameters are 𝜃 = [ ˜𝜌, 𝜏1, 𝜎2, 𝛽], where ˜𝜌 = [ ˜𝜌1, ˜𝜌2, ..., ˜𝜌𝑚] are the unscaled weights of the candidate kernels ((cid:205)𝑚 𝑖=1 ˜𝜌𝑖 ≠ 1), 𝛽 = (𝛽1, 𝛽2, · · · , 𝛽𝑘 )𝑇 and 𝜏1 = 𝜏 𝜎2 after reparameterization. And the weakly informative prior dis- tributions are: 𝜎2 ∼ 𝐼𝑛𝑣𝑒𝑟 𝑠𝑒𝐺𝑎𝑚𝑚𝑎(2, 2), 13 𝜏1 ∼ 𝑈𝑛𝑖 𝑓 𝑜𝑟𝑚(0, 2), 𝛽 ∼ 𝑀𝑢𝑙𝑡𝑖𝑣𝑎𝑟𝑖𝑎𝑡𝑒𝑁𝑜𝑟𝑚𝑎𝑙 (0, 10𝐼), ˜𝜌1, ˜𝜌2, ˜𝜌3 ∼ 𝐺𝑎𝑚𝑚𝑎(1, 1). The actual weights for candidate kernels 𝜌 = (𝜌1, 𝜌2, 𝜌3) = ( ˜𝜌1, ˜𝜌2, ˜𝜌3)/(cid:205)3 1 and 𝜌 ∼ 𝑑𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡 (1, 1, 1). The data distribution is defined as: 𝑖=1 ˜𝜌𝑖. Clearly (cid:205)3 𝑖=1 𝑦|𝜃 ∼ 𝑁 (𝑋 𝛽, 𝜎2(𝜏1𝐾𝑐 + 𝐼)), where the composite kernel 𝐾𝑐 = (cid:205)3 𝑖=1 𝜌𝑖𝐾𝑖. 2.2.3.1 BayesKAT strategy: 𝜌𝑖 = (2.3) Here the main hypothesis to test is: 𝐻0 : 𝜏 = 0 vs. 𝐻1 : 𝜏 > 0. It is equivalent to test 𝐻0 : 𝜏1 = 0 vs. 𝐻1 : 𝜏1 > 0. Bayes factor(𝐵𝐹10) is calculated to test the hypothesis, which evaluates the evidence in favor of the alternative hypothesis. Bayes factor is defined as the ratio of marginal likelihoods under two hypotheses [26]: 𝐵𝐹10 = 𝑃(𝐷𝑎𝑡𝑎|𝐻1) 𝑃(𝐷𝑎𝑡𝑎|𝐻0) (2.4) where 𝑃(𝐷𝑎𝑡𝑎|𝐻0) and 𝑃(𝐷𝑎𝑡𝑎|𝐻1) are the marginal likelihoods under 𝐻0 and 𝐻1, respectively. Given the input data, BayesKAT mainly uses two efficient and easy-to-implement strategies to select the composite kernel and calculate the Bayes Factor 𝐵𝐹10 by estimating 𝑃(𝐷𝑎𝑡𝑎|𝐻0) and 𝑃(𝐷𝑎𝑡𝑎|𝐻1). The two computational strategies are explained in subsequent sections. 2.2.3.2 Interpreting BayesKAT output: The preliminary output from BayesKAT, 𝐵𝐹10, is a summary of evidence provided by the data in favor of 𝐻1 as opposed to 𝐻0. In addition, the posterior probability of the association (𝑃(𝐻1|𝐷𝑎𝑡𝑎)) is calculated as the final output, which has a one-one relation with 𝐵𝐹10, i.e., 𝑃(𝐻1|𝐷𝑎𝑡𝑎) 𝑃(𝐻0|𝐷𝑎𝑡𝑎) = ⇒ 𝑃(𝐻1|𝐷𝑎𝑡𝑎) = 𝑃(𝐻0|𝐷𝑎𝑡𝑎)𝑃(𝐻1) × 𝑃(𝐻1) 𝑃(𝐻1|𝐷𝑎𝑡𝑎)𝑃(𝐻0) 𝑃(𝐻0) 1 𝑃 (𝐻 0 ) 𝑃 (𝐻 1 ) 1+ 1 𝐵𝐹 10 Here 𝑃(𝐻1) and 𝑃(𝐻0) are the prior probabilities under 𝐻1 and 𝐻0, respectively, i.e. the probabilities of existence of association and no association, respectively. Depending on the specific 14 biological problem and dataset, the values of 𝑃(𝐻0) and 𝑃(𝐻1) can be set given biological evidence or prior knowledge, and 𝑃(𝐻1|𝐷𝑎𝑡𝑎) can be calculated. 𝑃(𝐻1|𝐷𝑎𝑡𝑎) is the posterior probability of the model under 𝐻1 given the data. 𝑃(𝐻1|𝐷𝑎𝑡𝑎) gives a quantitative evaluation of how probable there exists an association between the genetic features and the phenotype of interest or how strong the evidence is against the null hypothesis. As already demonstrated by the example in Figure 2.1A and 2.1B in the introduction section, existing methods based on pre-selected kernels or average composite kernels suffer from inconsistent hypothesis testing results and may lead to false discoveries. In contrast, for the Figure 2.1A scenario, BayesKAT calculated the posterior probability of a true association given the data P(𝐻1|𝑑𝑎𝑡𝑎)=0.19, i.e. the evidence of association is very low and the existence of true association is not very likely. This is consistent with the fact that there is a lack of documented evidence of the association between the caffeine metabolism pathway and the whole brain volume phenotype. Strikingly, in the scenario of the synthetic data cohort simulated with known association based on the Quadratic kernel (see Figure 2.1B), BayesKAT successfully calculated the posterior probability 𝑃(𝐻1|𝐷𝑎𝑡𝑎) = 0.99 to suggest the existence of association, without incorporating any prior information. Moreover, tested on 500 repeated simulation cohorts, BayesKAT achieves much higher statistical power than other methods and also demonstrates more consistent testing results (see Figure 2.1B). Investigating the inferred weights for each candidate kernel functions in the final composite kernel selected by BayesKAT further shows that the Quadratic kernel is correctly assigned with the largest weights (Supplementary Figure E.1), suggesting that BayesKAT can efficiently capture the functional form of the underlying statistical associations in a data adaptive way. 2.2.4 BayesKAT-MCMC As a Bayesian model, BayesKAT-MCMC employs the Markov Chain Monte Carlo (MCMC) sampling-based strategy to infer the optimal composite kernel function. Leveraging MCMC for efficient and traceable samplings from complex target distributions, BayesKAT-MCMC avoids direct sampling from the posterior distribution 𝑃(𝜃|𝐷𝑎𝑡𝑎) (see section 2.2.6), which does not have a closed mathematical form and is computationally intractable. Instead, Metropolis-Hastings 15 method [27, 28] is used to iteratively draw samples based on the generated Markov chain, which are able to approximate the target probability distribution 𝑃(𝜃|𝐷𝑎𝑡𝑎). Let 𝜃0 denote the initial value for 𝜃. The 𝑡th iteration of the Metropolis-Hastings algorithm consists of the following steps [29], [30]: 1. Sample a candidate point 𝜃𝑡 from a proposal distribution 𝐽𝑡 (𝜃∗|𝜃𝑡−1). 2. Calculate the acceptance ratio for jumping to the new point 𝑟1 = 𝑝(𝜃∗|𝐷𝑎𝑡𝑎)/𝐽𝑡 (𝜃∗|𝜃𝑡 −1) 𝑝(𝜃𝑡 −1)/𝐽𝑡 (𝜃𝑡 −1|𝜃∗) 3. Set 𝜃𝑡 = 𝜃∗ with probability 𝑟1 and 𝜃𝑡 = 𝜃𝑡−1 with probability 1 − 𝑟1. That is, it jumps to the new proposed value with probability 𝑟1 and stays at the same value with probability 1 − 𝑟1. Here is the main workflow for BayesKAT-MCMC: Input: Genotype matrix 𝑍, covariate matrix 𝑋, response 𝑦. Parameters under 𝐻1 : 𝜏1 > 0: 𝜃𝐻1 = [ ˜𝜌, 𝜏1, 𝜎2, 𝛽] Parameters under 𝐻0 : 𝜏1 = 0: 𝜃𝐻0 = [𝜎2, 𝛽] , 𝜃𝐻0 as mentioned in "BayesKAT" section. Prior distribution of 𝜃𝐻1 Define the data distribution based on 𝜃𝐻𝑖 , 𝑋,𝑍 and 𝑦, as defined in (2.3) where 𝑖 =0 or 1. Step 1: Using the Metropolis-Hastings MCMC method mentioned above, draw samples from the posterior distribution of 𝜃𝐻1 using three separate MCMC chains, each of which ran 50,000 iterations. Step 2: check if the algorithm has converged, otherwise run more iterations. Step 3: Draw samples from the posterior distribution of 𝜃𝐻1 by similarly Repeating step 1 and 2. Step 4: Using the posterior samples of 𝜃𝐻1, the unknown parameters are estimated: [ ˆ˜𝜌, ˆ𝜏1, ˆ𝜎2, ˆ𝛽]. From ˆ˜𝜌, the optimal kernel weights 𝜌 can be estimated. ˆ𝜃𝐻1 = Step 5: the posterior samples of 𝜃𝐻1 , 𝜃𝐻0 are used to calculate 𝑃(𝑑𝑎𝑡𝑎|𝐻1), 𝑃(𝑑𝑎𝑡𝑎|𝐻0) respectively using Chib’s method[31]. Step 6: Bayes Factor calculated from 𝑃(𝑑𝑎𝑡𝑎|𝐻1), 𝑃(𝑑𝑎𝑡𝑎|𝐻0) using (2.4). 16 BayesKAT-MCMC uses the R package BayesianTools [32] to generate two sets of samples from the posterior distribution of 𝜃 under the hypotheses 𝐻1 and 𝐻0, using the Metropolis-Hastings algorithm in an adaptive way [33] to leverage the history of the stochastic process and appropriately fine-tune the proposal distributions. Three separate MCMC chains initiated from different random start points are generated for 50,000 iterations. To ensure that the MCMC chains are converged, trace plot is used to visualize the moves of the Markov chains in the state space [34] . In addition, based on the Gelman-Rubin diagnostic method [35], the potential scale reduction factor, i.e. PSRF score, is also calculated and presented at the end of MCMC sampling to inspect whether the chain is converged in which the PSRF score is close to one. Based on the generated posterior samples, the marginal distributions of the parameters are further visualized as shown in Figure 2.1D. The posterior samples are used to estimate the composite kernel weights and also the marginal likelihoods 𝑃(𝐷𝑎𝑡𝑎|𝐻𝑖), 𝑖 = 1, 0 using the Chib’s method [31]. Bayes Factor is subsequently calculated using the formula presented in equation (2.4). 2.2.5 BayesKAT-MAP Although BayesKAT-MCMC yields comprehensive information of the marginal distributions of model parameters, drawing large sets of samples from the posterior distributions in the MCMC strategy is computationally expensive. Here, we provide an alternative strategy, termed BayesKAT- MAP, which is easy to implement, allows parallel calculations, and has higher computational scalability. Instead of drawing numerous MCMC samples from the posterior distributions and then estimating the parameter values, BayesKAT-MAP employs the quick optimization technique to estimate the parameters of interest directly, based on the Maximum A Posteriori (MAP) strategy such that, ˆ𝜃 𝑀 𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 [log 𝑝(𝜃|𝑦)] = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 [log 𝑝(𝑦|𝜃) + log 𝜋(𝜃)] (2.5) where 𝜋(𝜃) denotes the prior distribution 𝜃. Because the objective function is nondifferentiable at some points, a derivative-free optimization algorithm by Quadratic approximation using the 17 R package Minqa [36] is implemented. The most important model parameter 𝜏1 indicates the existence of association between the feature set and the trait variable, with 𝜏1 = 0 suggesting that there is no association. In BayesKAT-MAP, if the calculated MAP estimator of 𝜏1, i.e., ˆ𝜏1 𝑀 𝐴𝑃 = 0, it follows that the Bayes Factor =0 (i.e., no evidence of association is found), and the computational process terminates. On the other hand, if ˆ𝜏1 𝑀 𝐴𝑃 > 0, it implies that there might be some evidence of association and BayesKAT-MAP proceeds to calculate the MAP estimator again under 𝐻0 and then computes the marginal likelihoods 𝑃(𝐷𝑎𝑡𝑎|𝐻𝑖), (𝑖 = 1, 0), along with the Bayes Factor. Due to the practical limitations of exact analytical methods, such as relying on specific dis- tributional assumptions, efficient numerical integration approaches [37] are needed to calculate the marginal likelihoods under hypotheses 𝐻𝑖, 𝑃(𝐷𝑎𝑡𝑎|𝐻𝑖) = ∫ 𝑃𝑟 (𝐷𝑎𝑡𝑎|𝜃, 𝐻𝑖) 𝜋(𝜃|𝐻𝑖)𝑑𝜃, so that the model can be applied on diverse panels of data. BayesKAT-MAP employs the Laplace’s method [38], [39], [26] for approximating the integral 𝑇 = ∫ 𝑃𝑟 (𝐷𝑎𝑡𝑎|𝜃, 𝐻𝑖)𝜋(𝜃|𝐻𝑖)𝑑𝜃 by ˆ𝑇, where ˆ𝑇 = (2𝜋)𝑑/2| ˜Σ|1/2𝑃𝑟 (𝐷𝑎𝑡𝑎| ˜𝜃, 𝐻𝑖)𝜋( ˜𝜃|𝐻𝑖) and 𝑑 is the dimension of 𝜃, ˜𝜃 is the mode of the log-likelihood function 𝑙 (𝜃|𝐷𝑎𝑡𝑎), ˜Σ is the inverse of the negative Hessian matrix of the second derivative of 𝑙 (𝜃|𝐷𝑎𝑡𝑎) computed at ˜𝜃. For boundary regions of the parameter space, the Laplace approximation is modified according to the previously developed protocol [40] to accommodate the boundary cases. Based on the estimated marginal likelihood densities, the Bayes Factor is then computed and the posterior probabilities are finally inferred, given user-defined priors 𝑝(𝐻0) and 𝑝(𝐻1) for which equal values are used as the default setting in BayesKAT. Here is the summary of the workflow for BayesKAT-MAP: Input: Genotype matrix 𝑍, covariate matrix 𝑋, response 𝑦. Parameters under 𝐻1 : 𝜏1 > 0: 𝜃𝐻1 = [ ˜𝜌, 𝜏1, 𝜎2, 𝛽] Parameters under 𝐻0 : 𝜏1 = 0: 𝜃𝐻0 = [𝜎2, 𝛽] Prior distribution of 𝜃𝐻1 , 𝜃𝐻0 as mentioned in "BayesKAT" section. Define the data distribution based on 𝜃𝐻𝑖 , 𝑋,𝑍 and 𝑦 as mentioned in (2.3), where 𝑖 =0 or 1 Step 1: ˆ𝜃𝐻1 = ˆ𝜃 𝑀 𝐴𝑃 is calculated using this formulation 2.5 using optimization technique. 18 Step 2: check if ˆ𝜏1 = 0. If 0, stop. conclusion: no association! If >0, go to next step. Step 3: Calculate ˆ𝜃𝐻0 using same technique in step 1 under 𝐻0. Step 4: Calculate 𝑃(𝑑𝑎𝑡𝑎|𝐻1), 𝑃(𝑑𝑎𝑡𝑎|𝐻0) using Laplace approximation(look at previous paragraph) based on ˆ𝜃𝐻1 and ˆ𝜃𝐻0. Step 5: Bayes Factor calculated from 𝑃(𝑑𝑎𝑡𝑎|𝐻1), 𝑃(𝑑𝑎𝑡𝑎|𝐻0) using (2.4). The performance and runtime of BayesKAT-MCMC and BayesKAT-MAP under different settings are systematically compared and further discussed in supplementary files. Because of the superior computational scalability of BayesKAT-MAP, as shown in 2.1D, the results in the paper are generated using BayesKAT-MAP. 2.2.6 Input data organization and model set up for BayesKAT Data containing the information of genotypes or molecular features for complex phenotypes can be collected from large public-accessible or user-generated cohorts (e.g., ADNI, UK biobank, All of Us (https://allofus.nih.gov/), GTEx, PsychENCODE (https:// psychencode.synapse.org/), etc.). Individual-level data can be pre-processed and efficiently undergo steps of quality controls using software such as Plink [41]. Additionally, biological meaningful feature groups need to be defined and created depending on the goals of genome-wide association tests. In this study, we have explored four different biology inspired ways of grouping functionally related genetic variants, including 1) gene-wise groups: aggregating SNPs located within genomic regions of genes; 2) pathway-level groups: aggregating SNPs situated within genes that belong to a specific molecular pathway; 3) co-expression gene modules: aggregating SNPs located within genes that belong to a specific co-expression module; and 4) Protein-protein interaction(PPI) modules: aggregating SNPs located within genes that belong to a specific PPI module, which may represent a protein complex. 19 2.2.7 Comparison with other methods and performance evaluation The performance of BayesKAT is compared to two state-of-the-art algorithms: 1) SKAT using the average composite kernel, denoted as SKAT(Avg) [18]; and 2) the U statistic-based method, denoted as mKU [21]. Both of these two methods are frequentist approaches that maximize the power after restricting the type 1 error to a fixed level of 𝛼, such as setting 𝛼 to 0.05, and have been shown to outperform other existing methods. In contrast, as the first Bayesian model for this problem, BayesKAT uses a fixed threshold on the posterior probability or the Bayes factor, based on previously suggested guidelines [26], to reject the null hypothesis. To make fair comparisons, the performance of each method (i.e., the empirical statistical power), is evaluated at a fixed and equal empirical type 1 error across all three algorithms. A systematic comparison based on rigorous simulations are presented in the Results section. As multiple groups are simultaneously tested, a multiplicity correction technique is implemented, which is discussed in detail in the supplementary files. 2.3 Enhanced efficacy of BayesKAT benchmarked on simulation studies As defined in the section of Materials and Methods, the rows of the feature matrix 𝑍 ∈ R𝑛×𝑝 correspond to 𝑛 individuals and the columns correspond to the 𝑝 features. Depending on the particular genetic association studies, the features may encompass discrete genetic characteristics, like alleles with values of 0, 1, or 2, or continuous molecular features, such as gene expressions. We have conducted simulations for both cases, with 𝑍 corresponding to discrete or continuous features, under both low and high dimensional settings. 2.3.1 Simulation with continuous features: As shown in Figure 2.2A, the simulations based on continuous features are first conducted to evaluate the performance of BayesKAT using similar scenarios presented in [21]. With the specified parameters (𝑛 = 500, 𝑝 = 100, 𝑘 = 2, 𝑚 = 3), the feature matrix 𝑍 is simulated from a multivariate normal distribution with mean 0𝑝 and an AR(1) correlation matrix 𝑅 where (𝑅( 𝑗, 𝑗 ′) = 𝑟 | 𝑗− 𝑗 ′ |). In the simulation, the correlation 𝑟 is set to be 0.6. The covariate matrix 𝑋 ∈ R𝑛×2 has one binary covariate generated from a Bernoulli (0.6) distribution and one continuous covariate generated 20 Figure 2.2 Performance comparison based on simulations using continuous features. (A) Schematic summary of the data generation process for continuous molecular features (e.g. gene expression features) and the demonstration of the implementation under various scenarios. (B) Performance comparison across different simulation settings with systematic performance evaluations, i.e. the empirical power versus empirical type 1 error, for SKAT(Avg), mKU, and BayesKAT across different scenarios and parameter settings. When increasing 𝑝 while keeping other factors constant, all methods exhibit a slight decline in power, but BayesKAT consistently outperforms SKAT(Avg) and mKU. Additionally, as 𝑟 increases under fixed parameters, BayesKAT also consistently surpasses SKAT(Avg) and mKU. 21 from N(2,1). The covariate coefficients are set as 𝛽 = (0.03, 0.5) and the outcomes 𝑌 are simulated based on model (2.1). The error term with variance 𝜎2 = 1 is added as the random noise. Three commonly used candidate kernels are incorporated in BayesKAT: Linear, Quadratic, and Gaussian. Different functional forms of ℎ(·) are used to create different scenarios to test the performance. The empirical type 1 errors of each method are calculated based on the specific scenario where ℎ(𝑍) = 0, i.e. there is no association between 𝑍 and 𝑌 in the simulated data. The different scenarios are given below: • Scenario A: ℎ(𝑍) = 0.6 × 𝑍1𝑍3 • Scenario B: ℎ(𝑍) = 0.55 × 𝑍1𝑍3 + 0.1 × 𝑍1 + 0.1 × 𝑍3 • Scenario C: ℎ(𝑍) = 0.3 × (𝑍1 − 𝑍3) + 1.5 × 𝑐𝑜𝑠(𝑍3)𝑒𝑥 𝑝(−𝑍 2 3/5)) In all the simulation scenarios, as shown in Figure 2.2B, the empirical power versus empirical type 1 error for BayesKAT is consistently better than that of SKAT(Avg) and mkU. Moreover, sensitivity analyses are conducted to evaluate the effects with different simulation parameters on the final performance of different algorithms. Notably, when the number of features 𝑝 increases from 100 to 150, while keeping all other parameters fixed, BayesKAT consistently achieves superior empirical power compared to other methods (see Figure 2.2B). Similarly, when the correlation 𝑟 increases from 0.6 to 0.8 with 𝑝 fixed as 150, BayesKAT consistently demonstrates higher empirical power and outperforms other methods. Taken together, these simulation results demonstrate the robust superior performance of BayesKAT compared to SKAT(Avg) and mKU in various settings with continuous features. 2.3.2 Simulation with discrete features: The performance of different models on discrete SNP features is first evaluated based on simulations where randomly selected SNP groups are used as features. Because randomly selected SNPs are generally not functionally related, the overall effectiveness of all KBT models decreases as expected, with BayesKAT still showing improved empirical power compared to other methods (Supplementary Figure E.2). Because the real-world implementations of KBT models for genetics 22 Figure 2.3 Performance comparison based on simulations with discrete features. (A) Schematic summary of the data generation process for discrete genetic features (e.g., SNP variants) and the demonstration of the implementation under various scenarios. Starting with the original 𝑍 matrix containing groups of SNPs from each pathway, a simulated SNP matrix 𝑍′ is generated, preserving the underlying interrelationships among the SNPs. Covariate variables (i.e., age, gender and occupation) are incorporated based on the data from ADNI. (B) Performance comparison across different simulation settings. Systematic performance evaluations, i.e. the curves of empirical power versus empirical type 1 error, for SKAT(Avg), mKU, and BayesKAT across different scenarios are plotted. The performance across three sets of simulated datasets is averaged. The performance curves based on each individual simulated dataset can be found in Supplementary Figure E.3. studies usually focus on functionally related groups of SNPs, a more realistic strategy of simulating groups of discrete SNP features, instead of randomly selected unrelated SNPs, is employed to further benchmark the performance of BayesKAT (Figure 2.3A). To rigorously capture the underlying linkage disequilibrium structures of discrete features in real-world SNP data, the ADNI dataset is used as the basis for the simulations, where the covariate matrix 𝑋 is created from the real covariates (e.g. age, gender and education) of the corresponding 23 individuals in the dataset, and SNPs located in specific genes belonging to the selected KEGG pathways are included into the testing (see Materials and Methods). Three different KEGG pathways are randomly selected for performance evaluations. For each pathway, the groups of SNPs located in the corresponding gene members are identified. The "SNPknock" package [42] is then used to simulate the knockoff SNP data, which maintains the structural dependency among SNPs in each group intact in the simulated knockoff SNPs (Figure 2.3A). The feature matrix 𝑍 is constructed based on the simulated SNP data, with 𝑛 = 755 and 𝑝 ranging between 4000 and 5000. The outcome variable 𝑌 is simulated based on three different scenarios. Each scenario corresponds to a different functional form ℎ(𝑍), that is: • Scenario D: ℎ(𝑍) = 2 × 𝑍1𝑍3 • Scenario E: ℎ(𝑍) = 2 × 𝑍1𝑍3 + 0.04 × 𝑍𝑖 + 0.04 × 𝑍3 • Scenario F: ℎ(𝑍) = 0.4 × (𝑍1 − 𝑍3) + 0.4 × 𝑐𝑜𝑠(𝑍3)𝑒𝑥 𝑝(−𝑍 2 3/5)) Where 𝑍𝑖 is the 𝑖th column of 𝑍 corresponding to the 𝑖th SNP. For each group of pathway-level knockoff SNPs, 500 simulations are generated. By applying BayesKAT and the other methods on the set of simulations, the corresponding empirical type 1 error and empirical power are calculated accordingly. The summary of performance comparisons based on this extensive set of simulations is shown in Supplementary Figure E.3. The empirical power and empirical type 1 error for each SNP group clearly demonstrates that BayesKAT robustly outperforms SKAT(Avg) and mKU, across different simulation scenarios and settings. The averaged performance over 3 pathway- level SNP groups can be found in Figure 2.3B. To demonstrate the robust superior performance of BayesKAT, the simulation study is repeated using different sample sizes, n=1000,1500. The comparison outcomes are illustrated in supplementary Figures E.4 and E.5. Remarkably, in addition to these consistent advantages, BayesKAT also achieves much lower empirical type 1 error across all simulation settings, when the suggested Bayes Factor threshold [26] is employed. It suggests that the associations between the SNP groups and the phenotype detected by BayesKAT exhibit a 24 significantly higher level of reliability compared to other methods, a crucial attribute for genetic applications in complex diseases. 2.4 Application to ADNI datsets: BayesKAT reveals novel associated genetic basis of complex traits To illustrate the novel biological insights generated by BayesKAT, the individual-level data, in- cluding the genotype, phenotype and demographic covariates, from the ADNI project (https://adni. loni.usc.edu) [43], [22] are used to conduct a series of group-level genetic association testings. Specifically, BayesKAT is used to test the group-wise associations between SNP sets and the com- plex phenotype of whole brain volume, based on the available information across 755 individuals in the ADNI cohort. 2.4.1 Real data preprocessing In addition to a series of simulated datasets, real genetic datasets are used to evaluate the performance of BayesKAT and its derived biological discoveries. The data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). Plink software [41] (http://pngu.mgh.harvard.edu/purcell/plink/) is used to pre-process the individual-level genotype data. Detailed information on data access, download and pre-processing steps can be found in the supplementary material. Four biology-based strategies are used to create the feature groups of functionally related SNPs, leading to complementary biological insights into the genetic basis underlying the specific phenotype. The four SNP feature grouping strategies are: (1) Gene-wise SNP groups for 18,999 protein-coding genes in the human genome[44]. For each protein-coding gene, all SNPs located within +/-5KB of the gene body are collected as the gene-wise SNP feature set; (2) Pathway-wise SNP groups for 352 KEGG pathways [23], [24], [25]. For each pathway, 25 gene-wise SNP groups for all genes belonging to the specific pathway are collected as the pathway- wise SNP feature set. The number of SNPs per pathway varies from 35 to 22,555, with a mean of 1,721 SNPs. Figure 2.5A provides a schematic figure demonstrating the pathway-wise joint SNP testing procedure; (3) Co-expression gene module based SNP groups. Forty-one co-expression gene modules are identified using the R package "WGCNA" [45], [46] to find correlated gene co-expression clusters from expression data (available on the ADNI website). The co-expression gene modules have a different number of genes in them, which varies from 6 to 3361. SNPs within each module are extracted for further group-wise testing; and (4) 401 protein-protein interaction (PPI) gene module based SNP groups. The PPI gene modules are previously created in [47] based on the topology of the PPI network. The number of genes in PPI modules ranges from 2 to 497. Figure 2.4 Functional validation of BayesKAT’s prioritized genes using orthogonal information. (A) Top-ranking genes prioritized by BayesKAT are strongly supported by previous literature of functional studies of brain-related diseases. (B) The selected genes by BayesKAT demonstrate higher fractions of overlapping meQTL’s CpG sites than the genes selected by SKAT(Avg) and mKU. The CpG sites of significant meQTLs from the brain tissues represent orthogonal molecular-level evidence in support of the gene’s functional involvement with whole brain volume. 2.4.2 BayesKAT prioritizes functionally related genes: To identify genes associated with the trait of whole brain volume, we conduct a gene-wise association test with gene-level SNPs (see Materials and Methods). BayesKAT prioritized 17 genes, whose posterior probability of association (𝑃(𝐻1|𝐷𝑎𝑡𝑎)) is greater than 0.7. Figure 2.4A shows some examples of the prioritized genes, which have been suggested to be associated with 26 brain or neurodegenerative disorders by previous seminal studies [48], [49], [50]. The whole list of the 17 prioritized genes and their corresponding posterior probability of associations are provided in Supplementary Table E.1. Strikingly, as external evidence in support of BayesKAT’s prioritized genes, 12 out of the 17 genes (71%) contain CpGs that have been found to be involved with significant meQTLs [51] in the human brain cortex. In contrast, the genes prioritized by SKAT(Avg) and mKU demonstrate much lower fractions of overlapping with CpGs linked to meQTLs (66% and 61% respectively, Figure 2.4B). The higher proportion of prioritized genes containing CpGs offers molecular-level support for BayesKAT’s ability to uncover genes mechanistically linked to complex traits. Figure 2.5 Pathway-level association tests by BayesKAT prioritizes neurodegenerative disease related pathways. (A) Schematic representation illustrating the steps of pathway-level association tests. Sets of SNPs located within genes belonging to each of the 352 pathways are tested simultaneously for pathway-level associations with the phenotype of interest (e.g., the whole brain volume ). Multiplicity control is implemented to identify the specific list of significant pathways linked to the phenotype.(B) The top-ranking pathways prioritized by (i) BayesKAT, (ii) mKU, and (iii) SKAT(Avg) demonstrate distinct enrichment with neurodegenerative disease associated pathways. Top 50 pathways are shown for fair comparison. The top-ranking pathways by BayesKAT are ranked by the estimated posterior probabilities of the pathway-level associations. The top-ranking pathways by the frequentist methods, mKU and SKAT(Avg) are ranked by the -log10(p-values). The pathways highlighted in red are neurodegenerative disease related pathways. The red horizontal dashed line in each bar plot indicates the threshold used by each model for fair comparison (see Materials and Methods). The pie charts illustrate the proportion of the selected pathways (above model’s thresholds) that belongs to the neurodegenerative disease pathways. BayesKAT notably exhibits enhanced prioritization of neurodegenerative disease pathways. Due to the issue of inflated p-values in mKU, pathways with p-values of 0 are assigned -log10(p-values)=30 for visualizations. 27 Figure 2.5 (cont’d) 2.4.3 Biological pathways linked to neurodegenerative diseases are top-ranked by BayesKAT To identify biological pathways that potentially modulate the whole brain volume trait, BayesKAT is used to analyze pathway-level SNP groups (see Materials and Methods, Figure 2.5A). The top 50 ranked KEGG pathways by each model are summarized in Figure 2.5B, where BayesKAT ranks the pathways based on decreasing posterior probability of association (𝑃(𝐻1|𝐷𝑎𝑡𝑎)) while the mKU and SKAT(Avg) methods rank the pathways based on decreasing -log10(p-value). Interestingly, BayesKAT successfully prioritized most of the neurodegenerative disease related pathways with top ranks (Figure 2.5B). This is a strong mechanistic support to BayesKAT’s results, because the neu- rodegenerative diseases, including Alzheimer’s disease, Huntington’s disease, Amyotrophic lateral sclerosis and Parkinson’s disease, have been found to be strongly related to brain volume loss [52], [53], [54], [55]. Based on a reasonable posterior probability threshold 0.7, there are 21 pathways identified by BayesKAT (Figure 2.5B), Supplementary Table E.2), among which five pathways 28 are associated with neurodegenerative diseases. In comparison, the mKU and SKAT(Avg) models prioritized large numbers of pathways (242 and 72 respectively), while only a small fraction of them are neurodegenerative disease related pathways (6 and 5, respectively). For SKAT(Avg), these functionally related pathways are not even the top-ranked ones. As summarized in the cor- responding pie charts in Figure 2.5B, BayesKAT achieves the highest efficiency in prioritizing important pathways and resulting in fewer false discoveries than the other methods. These results are also consistent with the larger type 1 errors of mKU and SKAT(Avg) observed from simulation analyses described above. Note that, setting the posterior probability threshold is subjective, same as selecting type 1 error threshold or FDR threshold (0.05 or 0.01 or 0.1). Opting for a threshold greater than 0.7 (such as 0.9, 0.95, or 0.99) is also effective, as BayesKAT efficiently prioritizes the top pathways. Overall, the highly prioritized functional relevant pathways imply the novel biological insights that can be generated by using BayesKAT. To further evaluate the performance of BayesKAT in determining the optimal kernel weights, 10 randomly chosen pathways are used to test the pathway-level associations. The resulting composite kernels are compared to the results of using SKAT based on individual kernels separately. The corresponding -log10(p-values) metrics from SKAT using individual kernels are compared to the inferred kernel weights in the composite kernels from BayesKAT. As shown in Figure 2.6A, the high similarity between the two heatmaps indicates that BayesKAT can efficiently select the optimal composite kernel automatically from the data, without relying on prior knowledge or repetitively trying different individual kernels. 2.4.4 BayesKAT identifies trait-associated gene modules and protein complexes To further demonstrate BayesKAT’s capability of revealing novel group-level associations to traits from specific sets of cooperative SNPs, two additional SNP grouping strategies are applied: 1) SNPs from co-expression gene modules; and 2) SNPs from protein-protein interaction (PPI) modules (see Materials and Methods). Applied on the SNP groups aggregated from co-expression gene modules, BayesKAT is able to pinpoint specific modules as significantly associated with the whole brain volume trait (Figure 2.6B Left). On the other hand, SKAT(Avg) and mKU identify 29 a large number of modules (Figure 2.6B Right), which are consistent to the inflated type 1 errors of these two methods as observed previously, and failed to provide specific prioritizations of the gene modules. Remarkably, by comparing to the significant GWAS SNPs identified from another genome-wide meta-analysis of brain volume study [56], two out of the four selected modules by BayesKAT (50%) contain significant GWAS SNPs (Figure 2.6C). In contrast, only 26% (7/27) and 43% (6/14) modules selected by mKU and SKAT(Avg) contain significant GWAS SNPs. These results further provide orthogonal support for the superior performance of BayesKAT in identifying new collective associations to the complex traits for SNP groups of functionally related genes. 30 Figure 2.6 Boosted association tests based on BayesKAT’s composite kernels reveal novel modules of genes and proteins linked to brain volume. (A) The inferred weights of individual kernels in BayesKAT’s composite kernels (right) recapitulate the strength of each kernel (-log10(p-values)) when each kernel is incorporated separately (left). Without relying on prior knowledge or repetitively testing different kernels separately, BayesKAT automatically infers the appropriate composite kernels to boost the group-level tests for different pathways. (B) Prioritized co-expression gene modules by BayesKAT (left) vs. SKAT(Avg) and mKU (right). The significance threshold of selection for each model is represented by the horizontal red dashed lines (see Materials and Methods). (C) The selected significant co-expression gene modules by BayesKAT demonstrate higher fractions of overlapping with significant SNPs from orthogonal GWAS studies, compared to the results from SKAT(Avg) and mKU. (D) The selected significant PPI modules by BayesKAT demonstrate higher fractions of overlapping with significant SNPs from orthogonal GWAS studies, compared to the results from SKAT(Avg) and mKU. (E) Prioritized PPI modules by BayesKAT (left) vs. SKAT(Avg) and mKU (right). The significance threshold of selection for each model is represented by the horizontal red dashed lines (see Materials and Methods) 31 Figure 2.6 (cont’d) When applied to SNP groups organized according to PPI modules that largely represent protein complexes, BayesKAT distinctly prioritizes eight PPI modules as significant, six of which contain significant GWAS SNPs (75%, Figure 2.6D, Figure 2.6E Left). On the contrary, only 30%(17/56) and 65%(13/20) of the modules selected by mKU and SKAT(Avg) respectively contain significant GWAS SNPs (Figure 2.6D, Figure 2.6E Right). Taken together, the highly specific prioritizations of potential gene modules and protein complexes, along with the substantially improved justification from other GWAS SNPs, suggest that BayesKAT can facilitate novel discoveries of molecular components involved in complex traits and may pave the way for innovative approaches to disease treatments. 2.5 Discussion BayesKAT (https://github.com/wangjr03/BayesKAT) is a data adaptive methodology that auto- matically selects the appropriate composite kernel using the MCMC algorithm (BayesKAT-MCMC) or the optimization technique (BayesKAT-MAP) and conducts hypothesis testing on the presence of group-level genetic associations for complex traits. The Bayesian framework and the inferred posterior probabilities are more interpretable and informative, compared to p-values from fre- 32 quentist methods. Based on extensive benchmark analyses, BayesKAT demonstrates consistent superior performance than other methods across different settings. Moreover, evaluated on a series of biologically inspired SNP groups based on a real genetic dataset, BayesKAT not only achieves improved prioritization of functionally relevant and justified group-level SNP associations, but also enables novel discoveries with respect to the underlying molecular mechanisms of complex traits. By revealing the collective effects of functionally cooperative SNPs without relying on the prior knowledge of specific kernels, BayesKAT represents one important step forward towards the goal of deciphering the intricate genetic basis of human diseases. Although some methods based on the Gaussian process [57] or supervised learning technique [58] attempt to select the best kernel using training data for prediction purposes, BayesKAT is the first Bayesian KBT methodology that simultaneously selects the optimal composite kernel while testing for the associations, without requiring the training data. In addition, the data- adaptive strategy of composite kernel selections also facilitates the description of more complicated interdependence structures that can not be fully captured by individual kernels. Furthermore, BayesKAT provides the flexibility of incorporating multiple testing corrections, integrating prior biological knowledge, and modeling various data types. To complement the MCMC strategy, BayesKAT-MAP is highly scalable and can be efficiently implemented for large-scale genome- wide studies. BayesKAT utilizes the Metropolis-Hastings MCMC algorithm in combination with a derivative- free grid-search-based optimization approach to choose the composite kernel for specific datasets. Nevertheless, the BayesKAT framework is not restricted to these techniques. Other efficient MCMC sampling algorithms or reliable optimization techniques can be incorporated. A variety of sampling techniques have been reviewed and compared for different purposes of modeling [59], [60], [61], [62]. Integrating these techniques, especially the variational Bayes techniques, into the BayesKAT framework and systematically evaluating their performance for different types of applications will be an important step for future developments (More discussion on this topic is included in the supplementary files). 33 2.6 Data and code avilability BayesKAT is an open source infrastructure available in the GitHub repository https:// github.com/ wangjr03/BayesKAT. The repository includes R codes for BayesKAT, along with comprehensive instructions, sample testing data, and code for pre-processing real data. Please note that the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data used for this manuscript is not publicly available, and interested users can request for access through the official portal: https://adni.loni.usc. edu/data-samples/access-data/. Detailed information on data download and processing procedures can be found in the supplementary material. 2.7 Supplementary materials 2.7.1 Kernel functions A kernel function is defined as a function 𝐾 : X×X → R, where the kernel matrix 𝐾 = (𝑘𝑖,𝑖′)𝑛 𝑖,𝑖′=1 is symmetric and positive semidefinite with 𝑘𝑖,𝑖′ = 𝑘 (𝑍𝑖, 𝑍𝑖′). In this setting, 𝑘 (𝑍𝑖, 𝑍𝑖′) is a measure of similarity between the 𝑖th and the 𝑖′th subject. There are a variety of kernel functions to choose from, and the most widely used ones include the Linear kernel, the Quadratic kernel and the Gaussian kernel. For genetic SNP data, identity by state (IBS) kernel is a popular candidate kernel function suggested by various studies [19], [10], [12]. The functional forms of these kernels are summarized below: • Linear kernel: 𝐾 (𝑍𝑖, 𝑍𝑖′) = 𝑍𝑇 𝑖 𝑍𝑖′ • Quadratic kernel: 𝐾 (𝑍𝑖, 𝑍𝑖′) = (𝑍𝑇 𝑖 𝑍𝑖′ + 1)2 • Gaussian kernel: 𝐾 (𝑍𝑖, 𝑍𝑖′) = 𝑒𝑥 𝑝{−∥𝑍𝑖 − 𝑍𝑖′ ∥2/𝑙}, where ∥𝑍𝑖 − 𝑍𝑖′ ∥2 = (cid:205)𝑝 𝑗=1 (𝑍𝑖 𝑗 − 𝑍𝑖′ 𝑗 )2, 𝑙 is a tuning parameter. • IBS kernel: 𝐾 (𝑍𝑖, 𝑍𝑖′) = (2𝑝)−1 (cid:205)𝑝 𝑗=1 𝐼 𝐵𝑆(𝑍𝑖 𝑗 , 𝑍𝑖′ 𝑗 ) = (2𝑝)−1 (cid:205)𝑝 𝑗=1 (2 − |𝑍𝑖 𝑗 − 𝑍𝑖′ 𝑗 |) 2.7.2 Importance of choosing appropriate Kernel Although a variety of different kernels are available, for a given dataset, it is practically impossible to know a priori which kernel will fit the dataset best and maximize the testing power. 34 Genetic data related to complex phenotypes pose particular challenges, primarily stemming from our limited understanding of how the interplay among genetic or molecular features influences their collective association with a phenotype. Therefore, choosing a kernel randomly can lead to a less powerful testing procedure for genome-wide applications. For example, if the outcome variable 𝑌 is related to the features through a Quadratic function, using a Linear kernel in the model will lead to weak tests that are not able to reject the null hypothesis even when the association is strong. On the other hand, by repeatedly applying KBT models based on different candidate kernels and choosing the one resulting in the minimum p-value, there is a high chance of making a false discovery, i.e., rejecting the null hypothesis when there is no association. Combining a panel of candidate kernels together to create a composite kernel is thus a natural and effective strategy to overcome this issue. While a straightforward strategy of averaging kernels to form a composite kernel, i.e., a linear combination of candidate kernels with equal weights, can perform better than the worst-performing kernel function, it usually cannot perform as efficiently as the best kernel to accurately represent the association between the trait and features for a given dataset, thus, is not guaranteed to increase the statistical power. As shown in main Figure 1 (A)(B), evaluated on both real and synthetic datasets, the average kernel strategy can often lead to incorrect and inconsistent results in practice. Thus, a systematic data-adaptive approach of optimal kernel selection is highly desirable for high-dimensional genome-wide association tests, especially for complex human disease phenotypes that are genetically modulated by multiple inter-dependent genetic variants. 2.7.3 Comaparison between BayesKAT-MCMC and BayesKAT-MAP The performance and runtime of BayesKAT-MCMC and BayesKAT-MAP under different settings are systematically compared. Supplementary Table 2.1 summarizes the empirical type- 1 error and empirical power for different simulated functional dependencies. The simulations were based on the proposed settings used by previous studies [21]. As shown in Table 1, both MCMC and MAP strategies achieve nearly equal statistical power in detecting associations. While BayesKAT-MCMC provides more information on the posterior distributions of parameters, it is 35 more computationally expensive compared to BayesKAT-MAP. When the number of samples or features, i.e. 𝑛 or 𝑝, increases beyond 500, BayesKAT-MCMC is not sufficiently scalable without requesting more high-performance computing resources. On the other hand, Main Figure 1(D) shows the superior computational scalability of BayesKAT-MAP based on the same level of computational resource support. Therefore, these two alternative strategies provide similar accuracy and complementary signatures for genetic association tests of complex traits, with BayesKAT-MAP being more flexible and conservative. For the rest of the paper, results of BayesKAT-MAP are presented due to its scalability. The code for implementing BayesKAT-MCMC and BayesKAT- MAP are both made publicly available via GitHub: https://github.com/wangjr03/BayesKAT. Table 2.1 Empirical power of BayesKAT-MCMC and BayesKAT-MAP. BayesKAT-MCMC BayesKAT-MAP h(Z) ℎ(𝑍) = 0 ℎ(𝑍) = 2 × 𝑍1𝑍3 ℎ(𝑍) = 2 × 𝑍1𝑍3+ 0.04 × 𝑍𝑖 + 0.04 × 𝑍3 ℎ(𝑍) = 0.4 × (𝑍1 − 𝑍3)+ 0.4 × 𝑐𝑜𝑠(𝑍3)𝑒𝑥 𝑝(−𝑍 2 3/5) n 500 500 p 500 500 0.018 0.946 500 500 0.948 500 500 0.976 0 0.946 0.958 0.998 The functional forms in simulation scenarios are taken from [21] and coefficients are adjusted based on 𝑛 (sample size) and 𝑝 (no. of features) 2.7.4 Multiple testing correction When 𝑚1 multiple groups are simultaneously tested, multiplicity corrections are needed. Mul- tiple testing corrections on p-values from frequentist models are carried out using methods such as the Bonferroni correction [63], which controls the family-wise type 1 error to 𝛼 by setting the individual test’s type 1 error at 𝛼/𝑚1. Other methods like FDR control have also been popular ones. For multiple testing corrections on Bayesian models, the multiplicity control is achieved by setting a high prior probability of the individual null hypothesis, as suggested by previous studies [64] [65]. For each test, the prior probability is set as 𝑃(𝐻0) = 0.99, which is equivalent to assuming that, on average, one in 100 tests is believed to have a true association. Considering the number 36 of SNP groups in genetic applications, such as the number of biological pathways or co-expression modules, this choice of prior probability setting is regarded as rather conservative and ensures fair performance comparisons. 2.7.5 ADNI Data Handling and Pre-processing steps The Alzheimer’s Disease Neuroimaging Initiative(ADNI) data referenced in this manuscript is not publicly accessible. Interested users can request access through the official portal at https://adni .loni.usc.edu/data-samples/access-data/. Once granted access, users can navigate to the "down- loads" section and download the genotype data. For this study, "ADNI 1 SNP genotype data - PLINK" file containing the .bed, .fam and .bim files was downloaded. Initially, these files contain information on 757 individuals and 620901 SNPs. The standard quality control steps are per- formed using Plink[41] software, which resulted in the removal of SNPs and individuals failing standard missingness thresholds, Minor Allele Frequency (MAF), and Hardy-Weinberg Equilib- rium (HWE) criteria. Subsequent to quality control, a simple imputation technique addresses the remaining missing values, resulting in a final genotype matrix featuring 531086 SNPs and 756 individuals meeting the established quality criteria. Demographic and phenotypic information for these individuals is retrieved using the publicly available ADNIMERGE package (detailed descrip- tion: https://adni.bitbucket.io/). Covariates, specifically Age, gender, and education levels, which exhibit significant linear relationships with the phenotype of interest (whole brain volume), are integrated into the model. By aligning individual IDs across response, genotype, and covariate data, a final dataset comprising 755 individual-level data points is obtained. The GitHub repos- itory https://github.com/wangjr03/BayesKAT contains the specific codes for preprocessing real individual-level genotype datasets. 2.7.6 Further Discussion In recent years, different deep learning models have been developed and applied in genomics studies to predict molecular features, such as gene expression, histone marks, chromatin accessibil- ity and transcription factor binding, using DNA sequences as features [66], [67], [68], [69]. These models allow for in silico mutations of DNA sequences and the prediction of perturbed molecular 37 features for each mutation individually. Compared to these models, the power of BayesKAT in delineating the collective group-level genetic associations based on automatic composite kernel se- lection opens up a new level of analytical capability of dissecting the genetic complexity. Combined together, the complementary advantages of BayesKAT and deep learning models are expected to facilitate novel mechanistic insights into human diseases. The implementation of BayesKAT is not limited to genetic studies based on features of SNPs or gene expressions. Any group of continuous or discrete features that are functionally related can be tested for associations with an outcome using the BayesKAT methodology. For instance, BayeskAT can be used to test if a group of images is associated with a particular disease trait by adopting properly defined kernels. As another direction for future developments, non-linear functions of candidate kernels, instead of the linear combinations, will be explored as the composite kernel, which may lead to improved power for kernel-based testing. 38 CHAPTER 3 SPATIAL TRANSCRIPTOMICS - SVG DETECTION TECHNIQUES AND SCOPES FOR IMPROVEMENT A modified version of part of this chapter was previously published (Das Adhikari,Sikta et al, 2024): Das Adhikari S., Yang J., Wang J. and Cui, Y. (2024) Recent advances in spatially variable gene detection in spatial transcriptomic. https://doi.org/10.1016/j.csbj.2024.01.016 3.1 Introduction: Analysis of spatial transcriptomic data Recent advancements in Spatially-resolved transcriptomics (SRT) technology have provided comprehensive gene expression data for thousands of genes across multiple samples or spatial spots, accompanied by their respective spatial coordinates across a tissue which refers to a collection of cells that are organized in a specific manner and perform a particular function or set of functions within an organism. It is a complex and dynamic landscape where the spatial arrangement of cells is integral to understanding gene expression patterns and their implications for health and disease. Depending on the specific technology utilized, a sample could represent a single cell (as in the case of STARmap technology), a cell-sized local region (as with HDST technology[70]), or a localized region comprising dozens of cells (as seen in Slide-seq[71, 72] and Visium technologies). The latest SRT platforms, such as 10x Genomics Visium and Slide-seqV2, encompass thousands of spatial locations within each tissue sample, with future developments poised to achieve even higher resolutions. As technology progresses, the demand for more robust statistical frameworks to effectively analyze spatial data intensifies. Although spatial transcriptomic (ST) data permit addressing a range of distinct questions, a fundamental initial step in the downstream analysis of spatial data is the identification of spatially variable genes (SVGs). These are genes that exhibit variations in expression levels either across the entire tissue or within predefined spatial domains. These genes can potentially unveil tissue heterogeneity and the underlying structural factors that drive distinct expression patterns across spatial locations, thus offering valuable insights into biology. Numerous methods have been developed for the identification of SVGs. These methods en- 39 compass a spectrum of approaches, including the utilization of standard spatial statistics measures like Moran’s I statistic[73] and Geary’s C statistic[74] to rank genes based on their spatial au- tocorrelation. More advanced methods employ model-based approaches such as SpatialDE[75], SpatialDE2[76], SPARK and its extensions[77], nnSVG[78], BOOST-GP[79], marked point pro- cess frameworks like Trendsceek[80] and scGCO[81], or model-free frameworks like sepal[82] and GLISS[83]. Additionally, there are toolboxes, such as MERINGUE[84], Giotto[85], Seurat[86], Squidpy[87] that integrate some of these methods into comprehensive end-to-end analysis frame- works. Downstream analysis involving SVGs encompasses various tasks, such as spatial clustering, deciphering spatial domains, and identifying spatial domain-specific SVGs. Additionally, there are numerous other downstream analyses that leverage additional information like scRNASeq data, histological images, and more, for tasks such as spatial decomposition of spots, gene imputation, the inference of cell-cell and gene-gene interactions and spatial location reconstruction for scRNA-seq data. However, this review till section 3.7 primarily concentrates on SVG detection frameworks and does not delve into the details of other downstream analyses in this section. Therefore, the primary focus of this chapter is to discuss selected frameworks for SVG identifi- cation. This serves as a literature review aimed at providing a comprehensive overview of the field of spatial transcriptomic studies. The goal is to become acquainted with existing SVG identification frameworks, including their unique characteristics, novelty, as well as their pros and cons. 3.2 An overview of SVG detection techniques Generally, in a spatial transcriptomics setup, the available spatial dataset contains gene expres- sion measures/counts for 𝑚 genes distributed across 𝑁 known spatial coordinates or spots. This section establishes the key symbols that will be frequently utilized. Specifically, 𝑦 = (𝑦1, 𝑦2, ..., 𝑦𝑁 ) is defined as the gene expression profiles/counts for a given gene across spatial coordinates (referred to as samples or spots), denoted by 𝑠 = (𝑠1, ..., 𝑠𝑁 ). The coordinates of the spatial locations are typ- ically two-dimensional, i.e., 𝑠𝑖 = (𝑠𝑖1, 𝑠𝑖2), but any dimensional coordinates can be employed. The primary objective of these SVG detection models is to ascertain which genes, out of the 𝑚 genes, 40 are spatially variable across the tissue. In other words, the main goal is to determine whether the gene expression measure 𝑦 depends on or relates to the spatial locations where the gene expression measures are collected. Here, we classify SVG detection methods based on two primary categories: (1) based on input data type and (2) based on the computational framework. The initial categorization focuses on input data type, representing the foundational step in SVG detection. Therefore, we first discuss the input data pre-processing step in Section 3.2.1. Subsequently, Sections 3.2.2 and 3.2.3 delve into the detailed exploration of model-based and model-free approaches, respectively, aligning with the later categorization based on the computational framework. Table 1 is then presented in this sequential order to reflect the dual categorization process. 3.2.1 Gene expression data and pre-processing step The gene expression measure 𝑦 are generally of count data type (originated from sequence based or image based technology). Various SVG detection models have been developed to specifically use count data as input following some mandatory filtering and quality control steps. Some examples of these models include SPARK-X[88] ,BOOST-GP[79], SINFONIA[89], and GPcounts[90]. The gene expression count data often exhibit over-dispersion and contain numerous zero values, mainly due to the technology employed for data generation or simply because many genes are poorly expressed for biological reasons. These particular issues in count data are generally taken care of by using negative binomial models which handle over-dispersion well. For the issue of zero-inflation, Zhao et al, 2022 [91] showed that modeling zero inflation is not necessary in spatial transcriptomics, thus is not a concern in many method development. On the other hand, some methods, for example SpatialDE[75], nnSVG[78], and BOOST-MI[92], use normalized gene expression data as input in the framework for easy implementation, where in most of cases, the data is modeled using multivariate normal distribution after transformation. Authors in SPARK[77] proposed two different data models, SPARK and SPARK-G which uses count data and normalized data, respectively. The data normalization method is not unique for these methods. The normalization step generally removes the bias due to differences in sequencing depth using size factors and 41 normalizes the data using log transformation(log10 or log2 transformations after adding a pseudo- count value 𝑐, preferably 1). The method sepal[82] uses a slightly different normalization procedure which involves mapping the log-transformed values to the interval [0,1] and using a pseudocount 2. Other normalization methods, such as scran, scuttle, and scater R/Bioconductor packages[93, 94], can also be applied. Table 3.1 provides information on some selective methods together with their required input data type and the implemented model: Table 3.1 A selective list of methods for SVG detection in ST data analysis categorized based on required input data type and the implemented computational framework. Input data type Computational framework Method SpatialDE2[76] SPARK[77] BOOST-GP[79] CTSV[95] GPcounts[90] SPARK-X[88] SINFONIA[89] HEARTSVG[96] Count Count Count Count Count Count Count Count SpatialDE[75] Normalized SPARK-G[77] Normalized nnSVG[78] Normalized SOMDE[97] Normalized BOOST-MI[92] Normalized Trendsceek[80] Normalized scGCO[81] sepal[82] GLISS[83] Normalized Normalized Normalized MULTILAYER[98] Normalized BSP[99] Normalized model-based model-based model-based model-based Data model Poisson Overdispersed poisson Zero-inflated negative binomial Zero-inflated negative binomial model-based Negative binomial - - - Multivariate Normal Multivariate Normal Multivariate Normal Multivariate Normal Modified Ising model Marked point process Marked point process - - - - model-free Model-free Model-free model-based model-based model-based model-based model-based model-based model-based Model-free Model-free Model-free Model-free 42 3.2.2 Overview of model-based frameworks 3.2.2.1 Gaussian process(GP) regression based and similar models The majority of the methods, including some of the state-of-the-art algorithms to detect SVG, are based on Gaussian process (GP) regression models. For example, one of the first published SVG detection methods, SpatialDE[75], models the normalized gene expression 𝑦 for a given gene using the following multivariate normal model: 𝑝(𝑦|𝜇, 𝜎2 𝑠 , 𝛿, 𝐾) ∼ 𝑁 (𝑦|𝜇1, 𝜎2 𝑠 𝐾 + 𝛿𝐼), (3.1) where the covariance term is decomposed into a spatial and a non-spatial part, where 𝛿𝐼 represents the non-spatial part and 𝜎2 𝑠 𝐾 is the spatial covariance matrix, whose (𝑖, 𝑗)𝑡ℎ element in the kernel matrix 𝐾 denotes the spatial similarity between the 𝑖𝑡ℎ and 𝑗 𝑡ℎ spot calculated based on the corresponding coordinates 𝑠𝑖 and 𝑠 𝑗 . The choice of the kernel function plays a very important role in detecting the spatial correlation present in the gene expressions. More discussion about kernel function is provided in the next subsection. Other methods like SPARK-G [77] (the Gaussian version of SPARK), nnSVG [78], and SOMDE [97] implement similar GP models for modeling normalized gene expression data with some extra features or added level of complexity. SPARK-G and nnSVG provide the option to include extra covariate terms in the model. The covariates or the explanatory variables could contain batch information, cell-cycle information, or other information that is important to adjust for during the analysis. SOMDE is a two-step procedure. This approach involves first utilizing a self-organizing map to cluster neighboring cells into nodes. Subsequently, it employs a Gaussian process to model and analyze the spatial gene expression patterns at the node level. Table 3.1 shows that methods like SPARK[77], SpatialDE2[76], BOOST-GP[79], CTSV[95], and GPcounts[90] model count data directly. SPARK models the count data using an overdispersed poisson model where the logarithm of the unknown Poisson rate parameter is assumed to follow a stationary Gaussian process with similar spatial and non-spatial covariance components. BOOST- GP presents a novel Bayesian hierarchical model to analyze spatial transcriptomic data, which 43 models the count data using a zero-inflated negative binomial(ZINB) model. The logarithm of the normalized expression level, which is included in the expectation term in NB, can be seen as a GP with a spatial covariance term representing the spatial variability in case there is a spatial pattern. GPcounts also uses negative binomial distribution to model the UMI(Unique Molecular Identifier) data. SpatialDE2 employs a Generalized Linear Mixed Model (GLMM) for count data modeling. In contrast to GP-based techniques that typically separate covariance into a spatial and a non-spatial component, SpatialDE2 dissects the covariance into several spatial components along with a non-spatial random component. CTSV implements a slightly different technique and does not use the GP model. In CTSV, the gene specific, spot specific and cell-type specific relative mean expression level in the ZINB model is a linear combination of ℎ1(𝑠.1) and ℎ2(𝑠.2) where the functions ℎ1(·) and ℎ2(·) represents the underlying true spatial effect modeled with the kernel function in GP model. 3.2.2.2 Statistical inference and selecting kernel function in GP-based frameworks Typically, when evaluating the existence of spatial patterns within the data, an assessment is made by testing the alternative hypothesis, which suggests the presence of a spatial covariance term in the model, against the null hypothesis, where the spatial covariance term is set to zero, indicating the absence of spatial variability. This comparison between the model fitted under the alternative hypothesis and the null model forms the basis of a significance testing procedure. This often involves conducting significance tests and drawing conclusions based on p-values in frequentist approaches. For example, in model (3.1), testing SVG is equivalent to testing 𝐻0 : 𝜎2 𝑠 = 0. As previously mentioned, selecting the appropriate kernel function for computing the spatial covariance matrix is a critically important step in identifying spatial patterns within the data. Ideally, the kernel function should accurately capture the true underlying relationship between the 𝑦 values and the spatial coordinates 𝑠. In practice, the actual underlying function remains unknown, and the closer the chosen kernel function approximates the true functional form, the more precise the model specification becomes, rendering the test more robust and powerful. SpatialDE employs a squared exponential covariance function(a.k.a Gaussian kernel function 44 or radial basis kernel function) to compute the spatial covariance matrix: 𝐾𝑖, 𝑗 = 𝑘 (𝑠𝑖, 𝑠 𝑗 ) = 𝑒𝑥 𝑝(− |𝑠𝑖−𝑠 𝑗 |2 2𝑙2 ) The hyperparameter 𝑙, recognized as the characteristic length scale or bandwidth, determines how rapidly the covariance decays as a function of distance and is typically chosen by grid search. SOMDE also uses the squared exponential (Gaussian) kernel in their model with ten different length scales and chooses the one that achieves the highest log-likelihood ratio value. GPcounts uses linear or periodic kernel based on BIC values. SPARK asserts that relying on a single kernel restricts the ability to robustly identify spatially variable genes across diverse spatial patterns. Therefore, SPARK (and SPARK-G) adopts an approach involving a total of ten distinct spatial kernels. These comprise five periodic kernels (e.g., Cosine kernels) with varying periodicity parameters and five Gaussian kernels with different smoothness parameters. SPARK proceeds to compute ten p- values, each derived from a different test employing these various kernel functions. These p-values are subsequently combined using the Cauchy combination rule [100, 101]. Similar to SPARK, SpatialDE2 incorporates a variety of pre-defined kernels with varying structures and length scales. It also offers the flexibility to conduct an omnibus test as an alternative to independently testing each kernel and subsequently merging the p-values. nnSVG posits that genes can potentially display a vast spectrum of spatial patterns, and using the same set of kernel functions for all of the genes might lead to less powerful tests. They consider the use of an exponential covariance function as a kernel function where the length scale parameter of the kernel function is fitted for each gene, which allows capturing the unique spatial variability pattern of the gene. CTSV uses five different sets of functional forms for ℎ1(𝑠.1) and ℎ2(𝑠.2), which includes linear functions, squared exponential functions, and periodic functions with different sets of scaling parameters and the five p-values calculated from five different forms are combined using the Cauchy combination method. Although different models discussed here have some similarities in testing procedures, the model fitting techniques implemented and the testing procedures utilized are different and are summarized in Table 3.2. 45 Table 3.2 List of some popular SVG detection methods with model-fitting and testing information. Method SpatialDE Bayesian/ Frequentist Frequentist Model fitting and parameter estimation Maximizing marginal log likelihood SpatialDE2 Frequentist Only null model parameters SPARK Frequentist needs to be estimated by BLUP Approximate-inference algorithm based on the PQL approach Hypothesis testing method Likelihood ratio test Score test based on Zhang and Lin[102] Satterthwaite method on the basis of score statistics SPARK-G Frequentist Maximum likelihood Score test nnSVG Frequentist Fast optimization algorithms for NNGP models (BRISC R package) Likelihood ratio test SOMDE Frequentist Gradient optimization Likelihood ratio test CTSV Frequentist Approx. maximum likelihood using conjugate gradient(CG) algorithm Wald tests (R package pscl) GPcounts Frequentist Optimization of log marginal Likelihood ratio test BOOST-GP Bayesian likelihood by variational approximation Sampling from posterior using MCMC Bayes Factor or posterior probabilities of inclusion (PPI) The statistical power of GP-based methods hinges on the selection of kernel functions[77], which can complicate the model selection and limit SVG detection power. To address this challenge, the authors in [92] introduced BOOST-MI. This novel approach employs Bayesian modeling of spatial transcriptomics data via a modified Ising model to identify SV genes. As an initial step, BOOST-MI takes normalized gene expression data as input and dichotomizes the normalized expression levels into a binary spatial pattern. Subsequently, BOOST-MI proceeds to identify a wide spectrum of spatial patterns displayed by the genes by inferring the Ising model interaction parameter within a Bayesian framework. It achieves this by generating samples from the posterior distribution of the parameters through a double Metropolis-Hastings (DMH) algorithm[103]. Subsequently, it 46 computes the Bayes factor based on these posterior samples, which are then used for selecting SV genes. Trendsceek[80], one of the earliest published SVG detection methods, models data as marked point processes, where they assign points to represent the spatial locations of spots and marks on each point to represent expression levels. The pivotal objective of Trendsceek revolves around evaluating the dependency between the spatial distribution of points and their respective marks through pairwise analyses as a function of the inter-point distances. The underlying premise is that if there exists no dependency between marks and point locations, the resulting scores should remain constant across various distances. A resampling procedure is executed to gauge the significance of a gene’s spatial expression pattern, involving permutations of expression values that create a null model with no spatial expression dependency. Similar to Trendsceek, ScGCO[81](single-cell graph cuts optimization) method also models gene expression data as a marked point process where points represent the spatial locations of measured cells or spots, and marks are discrete gene expression states (such as, down-regulated or up-regulated) associated with points. It analyzes the dependency of points with a specific mark on spatial locations using a hypothesis test. Under the null hypothesis (i.e., no spatial dependency), it assumes that points with a specific mark in a 2D space are distributed in a completely random fashion and can be described by a homogeneous spatial Poisson process. Genes with spatial regions whose number of cells/spots of specific marks are associated with statistically significant low probabilities under the null model are selected as SVG. 3.2.3 Overview of model-free frameworks There are other SVG detection methods such as SPARK-X[88], sepal[82], GLISS[83], and SINFONIA[89] which do not attempt to model the data generation process or rely on distributional assumptions. Instead, they use model-free techniques to detect SVGs. The authors introduced sepal[82] (Spatial Expression Pattern Locator), an innovative method that leverages transcript diffusion simulations to identify genes exhibiting spatial patterns. It simulates transcript diffusion within the spatial domain and measures the time required for convergence. The core idea is that 47 transcripts with random spatial distributions will converge more quickly or reach a homogeneous state faster compared to those with distinct spatial patterns. Consequently, the diffusion time serves as an indicator of a gene’s degree of spatial variability. Genes with longer diffusion times exhibit less spatial randomness. Therefore, ranking genes based on this indicator and selecting the top-ranked genes as SVGs is a logical approach. SINFONIA[89] offers a scalable approach to initially identify spatially variable genes through ensemble strategies as part of its spatial transcriptomic data analysis, with the ultimate goal of deciphering spatial domains. SINFONIA initially identifies the 𝑘 nearest neighbors in Euclidean space for each spot and builds a Spatial Neighbor Graph (SNG) using the weight matrix where the (𝑖, 𝑗)th element is determined by a function of the distance between the 𝑖th and 𝑗th spot. Next, SINFONIA calculates Moran’s I and Geary’s C statistics for each gene based on the weight matrix 𝑊 to assess spatial autocorrelation. The underlying concept is that genes with more pronounced spatial autocorrelation exhibit more organized spatial expression patterns. HEARTSVG[96] utilizes a unique, distribution-free, test-based approach that focuses on iden- tifying non-SVGs first and then infers the presence of SVGs using this information. The process involves assessing serial autocorrelations within the marginal expressions across the global spatial context to pinpoint non-SVGs. This, in turn, enables the automatic recognition of all other genes as SVGs, regardless of their spatial patterns. HEARTSVG asserts its superiority in terms of ro- bustness and computational efficiency by abstaining from assumptions about specific underlying spatial patterns for these variable genes. SPARK-X[88] is a nonparametric method grounded in the following insight: if 𝑦 is independent of 𝑠, then the spatial distance between two locations 𝑖 and 𝑗 would also be unrelated to the gene- expression difference between those two locations. SPARK-X constructs two 𝑁 × 𝑁 projection covariance matrices: (1) The expression covariance matrix based on gene expression levels; and (2) the distance covariance matrix based on all spatial locations. It employs a test statistic derived from the product of these two covariance matrices to evaluate the independence between the gene expression (𝑦) and the spatial coordinates (𝑠). In simpler terms, if gene expressions are indeed 48 independent of spatial coordinates, the product of these covariance matrices will yield a small value. Conversely, if gene expressions are dependent on the spatial coordinates, the product of the matrices will yield a large value. Similar to the kernel matrix used in methods like SpatialDE or SPARK, the statistical power of the SPARK-X test inevitably hinges on how the distance covariance matrix is constructed and how well it aligns with the true underlying spatial patterns exhibited by the gene of interest. To ensure robust identification of spatially varying genes across diverse spatial expression patterns, SPARK-X explores various transformations of the spatial coordinates (𝑠) and subsequently generates distinct distance covariance matrices. Specifically, the algorithm applies five Gaussian transformations with varying smoothness parameters and five cosine transformations to the spatial coordinates (𝑠). This process results in the creation of eleven distinct p-values, corresponding to the ten transformed distance covariance matrices and the original one constructed using the original coordinates. These individual p-values are then combined using the Cauchy combination method. MULTILAYER[98] treats spatially transcriptomics data as a raster image and uses digital image strategies to resolve tissue substructures. The basic unit in MULTILAYER is the "gexel", gene expression element analogous to a pixel in a digital image. The gene expression levels per gexel relative to the average gene expression are computed within the tissue. Genes are considered upregulated or downregulated when their normalized read counts per gene are above or below the average behavior, respectively. Differentially expressed genes are ranked based on the number of related gexels, providing a rapid view of genes that are overrepresented on the digital map based on their relative expression. GLISS[83] (Graph Laplacian-based Integrative Single-cell Spatial Analysis) utilizes a graph- based feature learning framework to detect and discover SVGs and recover cell locations in scRNA- seq data by leveraging spatial transcriptomic and scRNA-seq data. The workflow involves multiple steps. First, SV genes are identified from ST data using graph-based feature selection. Next, it determines the cells of interest in the scRNA-seq data based on unsupervised learning methods and leverage these selected SVGs to discover new SVGs in scRNA-seq data. The final goal of this workflow is to cluster genes based on their spatial patterns. 49 The BSP (Big-Small Patch)[99] method, introduced in a recent publication, utilizes a non- parametric model for the identification of spatially variable genes in 2D or 3D spatial transcriptomics data. The approach involves taking normalized spatial transcriptomics data as input. It defines big and small patches for each spatial spot based on neighboring spots with larger or smaller radii, respectively. The method then calculates local means of gene expression for both big and small patches. Following this, it calculates the ratio between the variances of local means for each gene, approximating a log-normal distribution for the distribution of these ratios. Subsequently, a p-value is determined for each gene based on this approximated distribution. 3.3 Statistical Inference with Multiple Testing Control We have previously discussed both model-based and model-free methods for detecting SVGs. The mathematical models employed for capturing the data generation process and the innovative model-free SVG detection technique have proven valuable for uncovering significant SVGs that offer critical biological insights. However, from a statistical perspective, concerns arise regarding the potential for false discoveries of genes that lack genuine spatial variability. This concern becomes more pronounced when a large number of genes are simultaneously tested across most frameworks. If the false discovery rate or type 1 error is not adequately controlled, it may lead to incorrect conclusions and the selection of numerous genes that exhibit false spatial variability. Various methods have been developed for multiplicity correction (MC) to address this concern. Some methods analytically constrain the false discovery rate (FDR) to remain below a predetermined threshold, while others do not analytically control the FDR and simply select a user-specified number of top genes as SVGs. Researchers may choose a method that aligns better with their research goals and the type of downstream analysis they intend to perform. In Table 3.3, we present an overview of these methods, organized around these critical questions. The permutation-based method is usually considered as the golden standard method as it is purely data-driven and distribution free. However, it is the least scalable one since it is computationally more demanding. The FDR-based methods have been the commonly applied ones since they offer type I error control while maintaining high power compared to the Bonferroni method. Nevertheless, depending on the downstream analysis 50 goal, it is not necessary to strictly enforce the MC rule. For example, when the goal is to find the low dimensional embedding of genes, such as in spatial PCA analysis [104], people usually choose top ranked genes for further analysis. In such cases, strictly enforcing MC is not needed. 3.4 Exploring Performance, Advantages, and Limitations In the preceding sections, we have explored the complexities associated with spatial count data. In many instances, these count data are characterized by sparsity and overdispersion. Section 2 of this review classifies modeling frameworks based on whether they directly model the count data or opt for modeling the normalized data. Some literatures [88, 79] argue against modeling normalized data with a Gaussian distribution due to concerns that such a parametric approximation may result in reduced statistical power and difficulties in controlling type 1 errors, especially when dealing with small p-values. On the other hand, methods that employ normalized count data, such as SpatialDE, SPARK- G, and nnSVG, offer advantages, including simpler model structures and reduced computational challenges. Notably, SPARK employs a dual modeling approach, encompassing both an overdis- persed Poisson model (SPARK) and a Gaussian model (SPARK-G) for count data analysis. They declare that SPARK-G exhibits significantly improved computational efficiency compared to the Poisson version of SPARK. Moreover, SPARK-G may demonstrate greater resilience to model misspecification, potentially enhancing its effectiveness in specific data applications. Although many researchers prefer to model count data directly, there is no consensus on the preferred approach for directly modeling count data either. While some opt for Poisson distribution models, others argue that it may be insufficient to address issues of overdispersion, suggesting that a negative binomial distribution is more suitable in such cases. Furthermore, when data exhibit extreme sparsity, the utilization of a zero-inflated Poisson or negative binomial model may be more logical, although it tends to introduce greater complexity into the model. But we need to note that direct modeling of sparse count data with a negative binomial distribution or other over-dispersed Poisson distributions incurs algorithm stability issues [88, 108, 90]. With the continuous evolution of spatial transcriptomic technologies, researchers now have 51 access to increasingly vast and high-resolution spatial datasets. Analyzing these extensive datasets demands the use of efficient and scalable methods for downstream analysis. Notably, approaches like Trendsceek and BOOST-GP impose substantial computational demands. In a study referenced from SRTsim[109], it was observed that when applying these methods to synthetic data, Trendsceek (v.1.0.0) required approximately 10 hours, while BOOST-GP needed about 8 hours to analyze a single synthetic dataset containing 1000 genes and 673 locations. In the same research context, SOMDE (v.0.1.8) struggled, failing to process nearly 90 percent of the genes and yielding NA values. Another comprehensive comparison, outlined in a review paper[110], assessed the performance of various SVG detection methods. The evaluation considered computational time and memory usage across 20 diverse spatial datasets, each varying in the number of spots or samples. Among the methods examined, including SpatialDE, SPARK-X, nnSVG, SOMDE, Giotto KM, and Giotto Rank (both are implemented in the Giotto package), SPARK-X emerged as the swiftest, with SOMDE following as the second-best option, albeit notably slower than SPARK-X. SpatialDE exhibited poorer performance in larger datasets, while nnSVG proved faster than SpatialDE for larger datasets but relatively slower for datasets with fewer spatial locations. In particular, SPARK- X [88] scales linearly with the number of spatial locations, while other methods scale cubically (e.g., SpatialDE) or quadratically (SpatialDE2, SPARK). In terms of peak memory usage, study [110] revealed that SOMDE consumed the least memory, with SPARK-X ranking second. Conversely, SpatialDE demonstrated high peak memory consump- tion. Considering the trade-off between speed and memory usage, SPARK-X and SOMDE emerged as the two most efficient methods, as determined by the experiment. Furthermore, the evaluation included other methods such as Giotto KM, Giotto Rank, and Moran’s I , but none of these alternatives matched the efficiency of SPARK-X or SOMDE based on the experimental findings. In summary, each modeling framework comes with its own set of pros and cons, necessitat- ing careful consideration of the trade-off between computational efficiency/cost and performance when selecting the most suitable approach for analyzing spatial count data. The model-free or 52 nonparametric approaches do not try to capture the data generation process and offer alternative frameworks to detect SVG. Most of the method frameworks are very intuitive but each comes with its own sets of restrictions or assumptions. For example, Trendsceek is a resampling-based method, which incurs a substantial computational load, rendering its application impractical for extensive ST datasets. SPARK-X exhibits impressive performance for high dimensional data, but the authors recommend using it with large sample (e.g., spot) size, say 3,000 or more. 3.5 Assessing Input Data and Model Outputs For the various methodologies we reivewed so far, some of these approaches primarily focus on identifying genes that exhibit spatial variability across the entire tissue, exemplified by methods like SpatialDE and SPARK. In contrast, others are additionally equipped to detect genes with spatial variability within predefined spatial domains, as seen in nnSVG. Other methods like SpaGCN[111] and STAMarker[112] are designed to identify Spatial domains and detect SVGs within spatial domains. Additionally, certain methods aim to identify SVGs to facilitate downstream analysis. For instance, SINFONIA, as cited in this work, provides a scalable approach for the initial identification of spatially variable genes using ensemble strategies within the context of spatial transcriptomic data analysis. The ultimate objective of this method is to decipher distinct spatial domains within the tissue. Furthermore, some of these methods leverage additional information as input, such as single- cell RNA sequencing (scRNA-seq) data, spatial domain information, or tissue-specific markers, in conjunction with spatial transcriptomic data. For instance, CTSV requires scRNA-seq data and a set of marker genes as input alongside the spatial transcriptomic data. It employs deconvolution techniques like SPOTlight[113], RCTD[114], or SpatialDWLS[115] to estimate cell-type propor- tions for each spatial spot. Ultimately, this approach identifies spatially variable genes specific to different cell types. Similarly, Trendsceek identifies genes with significant spatial trends and subsequently determines the subset of cells occupying spatial regions of interest. Given the distinct ultimate objectives and input criteria for each method, it would be unfair to 53 evaluate their performance solely based on a single parameter. Rather, the utility or superiority of these frameworks depends on the researcher’s specific goals and the nature of their research inquiries. In this context, we present a table that combines these selective frameworks, including details about their typical inputs and primary research objectives (see Table 3.4). 3.6 Publicly Accessible Code for Major SVG Detection Methods Every prominent SVG detection method featured in this paper has made its code publicly available. Certain frameworks have packages published in CRAN or available as Python modules, while others have shared their code on Github, and the package can be installed directly from Github. Here, we have compiled a list of the packages and repositories associated with these techniques, along with the coding language they have used (see Table 3.5). This compilation aims to facilitate convenient access to their respective code bases, making it easier for researchers to choose a method based on their preferred programming language. 3.7 Summary and Outlook We systematically reviewed recently developed frameworks for identifying spatially variable genes and grouped them into different categories and delved into the unique aspects of their models and underlying principles. Here, we provide a brief discussion encompassing various facets, including pre-processing steps, modeling frameworks, inference techniques, scalability, and practical applicability of these frameworks. We explored the performance of select methods as reported in previously published papers. Nevertheless, it is essential to note that we refrained from conducting evaluations based solely on the number of SVGs detected or the trade-off between statistical power and FDR. This decision arises from the fact that the methods discussed in this paper often serve different research objectives, each tailored to specific research questions. For example, a method primarily focused on spatial clustering may yield similar outcomes when considering the top 100 genes versus the top 110 genes. In contrast, a method geared toward accurately identifying genuine SVGs and scrutinizing individual SVGs to glean deeper insights into biological mechanisms may prioritize stringent control of false discovery rates, making it a pivotal concern in their evaluation. The evaluation criteria must align with the unique goals and nuances of each 54 method, akin to comparing apples to oranges when attempting to gauge their performance solely based on the number of SVGs selected. This paper[110] has previously investigated several methods for detecting spatial gene expres- sion variations and benchmarked their performance based on different measures. It reported that, although each SVG detection method successfully identifies a significant number of SVGs, there is limited overlap in the SVGs detected when a significance cutoff is applied to filter the SVGs. The study’s simulation analysis revealed that, in most cases, the estimated FDRs do not accurately reflect the true FDRs. These findings indicate that there is room for improvement in the commonly used methods for SVG detection and their associated FDR control approaches. In the context of Gaussian process based methods, one potential issue could be related to the selection of kernel function. For instance, as an improvement to spatialDE, approaches like SPARK and SPARK-X employ a variety of different kernels to robustly identify various traits. However, they apply the same set of parameter values to all genes, even when these genes may exhibit vastly different spatial patterns. While nnSVG offers improvements by allowing gene-specific kernel function parameter selection, it relies on a single type of kernel function. This opens room for further methodological development for optimal kernel selection when kernel-based methods are applied for SVG detection. Furthermore, model-free techniques, in many cases, do not analytically control FDR, making it challenging to establish a specific cutoff for selecting SVGs. Many methods claim to detect more SVGs than others, often undetected by alternative methods. However, the mere detection of more SVGs does not necessarily indicate the superiority of a framework if it does not effectively control the FDR. If the goal is to pinpoint the top 𝑘 (say 1000) SVGs for subsequent analysis without the necessity of precisely quantifying detection uncertainty, these methods can be employed. However, for a more rigorous approach, it is crucial to implement stringent FDR control measures to prevent false discoveries. In our empirical analysis, we observed that numerous methods exhibit elevated false positive rates with inflated p-values (data not shown). There is an urgent demand for the development of more rigorous statistical approaches to enhance false positive control. 55 Model-based SVG detection techniques frequently incorporate covariate variables, such as cell type information or domain structure information, into the model. However, the unavailability of this information alongside spatial transcriptomics data poses a challenge. It remains unclear how to obtain covariate information without utilizing the same data twice—once for identifying covariates and again for detecting SVG. Addressing this issue represents an ongoing challenge within the framework of SVG detection techniques. Hopefully, future methods will be developed to effectively bridge this gap. Finally, we acknowledge that benchmarking existing methods is essential to determine their efficiency in terms of scalability and accurate selection of SVGs. This is crucial for ensuring proper downstream analysis. To address this need, we present a foundational outline of a benchmarking design. For data generation, they can be simulated through methods such as SRTsim [109] and scDesign3 [116]. Both methods are capable of simulating datasets that emulate the structure of a real spatial dataset by learning their parameters. SRTsim offers a ShinyR platform where spatial patterns can be visualized, and parameters can be configured to generate count data for spatially variable gene expression. Diverse datasets containing both spatial and non-spatial genes can be simulated, with various spatial effect strengths, sparsity levels, distinct spatial patterns. scDesign3 is another model-based simulation machinery where users can use real data to estimate parameters which allow for a wide range of simulation scenarios, from homogenous cell populations to complex tissues with diverse cell types. Benchmarking involves assessing the performance of implemented methods by checking their power and false discovery rate (FDR) in detecting SVGs, their scalability, as well as the impact on specific downstream analysis such as spatial domain detection. In summary, this chapter provides a selective survey of recently published and archived literature on SVG detection, offering an analysis of their practical utility, adaptability, innovation, and constraints from various practical perspectives. This effort aims to facilitate new researchers in gaining a holistic understanding of the available methods and assist them in selecting a framework aligned with their specific research needs and questions. 56 Table 3.3 Compilation of SVG detection techniques Grouped by the method’s control of False Discovery Rate (FDR). Method Trendsceek SpatialDE SpatialDE2 SPARK SPARK-G SPARK-X nnSVG BOOST-GP GLISS scGCO CTSV HEARTSVG GPcounts BSP SOMDE sepal SINFONIA BOOST-MI MULTILAYER If framework analytically controls FDR Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes yes No No No No No How SVGs are selected Permutation based p-values, Benjamini–Hochberg procedure[105] for MC Analytically estimated p-values, q-value method[106] for MC Analytically estimated p-values, Benjamini–Yekutieli procedure[107] for MC Analytically estimated p-values, Benjamini–Yekutieli procedure for MC same as SPARK same as SPARK Analytical approximate p-values, Benjamini–Hochberg method for MC Based on Bayesian FDR controlled PPI threshold Permutation based p-values, Benjamini–Hochberg procedure for MC Analytically estimated p-values, Benjamini–Hochberg procedure for MC Analytically estimated p-values, q-value method for MC Analytically estimated p-values, MC by Bonferroni/Holm/Hochberg Analytical or permuted p-values, q-value method for MC Analytically estimated p-values, q-value method[106] for MC Top ranked genes based on spatial variability score Top 𝑘 genes with highest ranks Top 𝑘 genes with highest score and an ensemble technique Based on specific Bayes Factor threshold Based on the two-fold threshold of a test statistic 57 Table 3.4 List of selective methods with input data type and main goal. Method (Publication) SpatialDE(2018) Input data ST data SpatialDE2(Archived,2021) ST data SPARK(2020) SPARK-X(2021) nnSVG(2023) BOOST-GP(2021) SOMDE(2021) sepal(2021) SINFONIA(2023) BOOST-MI(2022) scGCO(2022) BSP HEARTSVG (Archived,2023) MULTILAYER(2021) ST data ST data ST data ST data ST data ST data ST data ST data ST data ST data ST data ST data Main goal Finding SVG Spatial gene-clustering Tissue region segmentation Finding SVG Spatial gene-clustering Finding SVG Finding SVG Finding SVGs across tissue or within spatial domains Finding SVG Finding SVG Finding SVG Spatial gene-clustering Finding SVG for deciphering spatial domains Finding SVG Finding SVG Finding SVG Detecting SVG and spatial domain Detecting SVG, dimensionality reduction, spatial clustering and more STAMarker(Archived,2022) ST data Spatial domain-specific variable genes GPcounts(2021) SpaGCN(2021) Trendsceek(2018) GLISS(Archived,2020) CTSV(2022) ST data scRNA-seq data Finding SVG, identifying gene-specific branching locations and more ST data histology image data Identifying spatial domains and SVG in domain Finding SVG Identifying cells in spatially significant gene expression regions Finding SVG, recovering cell locations in scRNA-seq data and gene-clustering Detecting cell-type-specific SVG ST data scRNA-Seq data ST data scRNA-seq data ST data scRNA-seq set of marker genes 58 Table 3.5 List of methods with implementing code language and package site. Method SpatialDE SpatialDE2 SOMDE Code language Python Python Python Package or GitHub or vignette https://github.com/Teichlab/SpatialDE https://github.com/PMBio/SpatialDE https://pypi.org/project/somde https://github.com/XuegongLab/somde sepal Python https://github.com/almaan/sepal GLISS SINFONIA ScGCO Python Python Python 10.5281/zenodo.4573237 https://github.com/junjiezhujason/gliss https://github.com/BioX-NKU/SINFONIA https://github.com/WangPeng-Lab/scGCO MULTILAYER Python https://github.com/SysFate/MULTILAYER GPcounts BSP Trendsceek SPARK SPARK-G SPARK-X nnSVG BOOST-MI CTSV HEARTSVG Python Python R R R R R R https://github.com/ManchesterBioinference/GPcounts https://github.com/juexinwang/BSP/ https://github.com/edsgard/trendsceek https://github.com/xzhoulab/SPARK https://xzhoulab.github.io/SPARK/01_about/ https://github.com/lmweber/nnSVG https://github.com/Xijiang1997/BOOST-MI https://bioconductor.org/packages/devel/bioc/html/CTSV.html https://github.com/cz0316/HEARTSVG.git BOOST-GP R/C++ https://github.com/Minzhe/BOOST-GP 59 3.8 Downstream analysis: Spatial domain detection In spatially resolved RNA-seq data, researchers aim to identify spatially variable genes (SVGs) as an initial step in the analysis, although detecting SVGs is not the primary objective. The main goal lies in the downstream analysis, which relies on the SVGs identified in the initial step. Key downstream analyses include detecting spatial domains, spatial trajectory inference on the tissue, and high-resolution spatial map reconstruction. etc. Here, our primary focus is on spatial domain detection. Various algorithms have been proposed by authors that are equipped to perform this task, such as SpaGCN[111], SpatialPCA[104], BayesSpace[117] etc. Among these, we opt for SpatialPCA due to its superior performance compared to other existing algorithms. In this section, we elaborate on the method and domain detection steps conducted by SpatialPCA in greater detail, as we have employed this efficient algorithm in our framework, which will be described in the next chapter. 3.8.1 SpatialPCA Spatial transcriptomic datasets are typically vast and high dimensional, with a significantly higher number of genes (m) than spots (N). Handling the entire dataset directly poses challenges. Moreover, disregarding spatial information while seeking low-dimensional embeddings of the data from gene expression values results in information loss. To address this issue, the Authors propose SpatialPCA, a method for spatially-aware dimension reduction. SpatialPCA effectively extracts a low-dimensional representation of spatial transcriptomic data while preserving both biological signals and spatial correlation structures. This condensed representation of spatial transcriptomic data can be further utilized for many downstream analysis, for example spatial domain detection, which we are interested in. 3.8.1.1 Method to obtain Spatial PCs 𝑌 is denoted as the available 𝑚 × 𝑁 normalized gene expression matrix. The 𝑗𝑖𝑡ℎ element of 𝑌 , 𝑦 𝑗𝑖 (𝑠𝑖) represents the gene expression measure for 𝑗 𝑡ℎ gene on 𝑖𝑡ℎ location. SpatialPCA aims to perform dimension reduction on the gene expression matrix and infer a 𝑑 × 𝑁 factor matrix 𝑍 that represents a low-dimensional embedding of 𝑌 .For dimension reduction, consider the following 60 latent factor model 𝑌 = (𝑋 𝐵)𝑇 + 𝑊 𝑍 + 𝐸 where 𝑋 is the covariate matrix, 𝐵 be the corresponding coefficients. 𝑊 is a 𝑚 × 𝑑 factor loading matrix and 𝐸 is the 𝑚 × 𝑁 residual matrix, where 𝐸 𝑗𝑖 ∼ 𝑁 (0, 𝜎2 0 ). The model is currently unidentifiable, so following the probabilistic principal component analysis model (PPCA)[118] an orthonormality constraint is imposed on 𝑊 : 𝑊𝑇𝑊 = 𝐼𝑑. Unlike the independent assumption on 𝑍 (𝑍 ∼ 𝑁 (0, 𝐼)) in PPCA and similar models, here the elements of 𝑍 are not independent because in spatial transcriptomic data, the spots spatially close to each other are likely to be more similar than the ones far from each other. This is because the spots closer to each other might share similar cell type composition and display similar gene expression measures. In order to promote consistency among neighborhood factor values and enhance the exchange of information among adjacent areas for factor estimation, it is assumed that the 𝑙𝑡ℎ factor values 𝑍 1×𝑁 𝑙. across N locations follows a multivariate normal distribution: 𝑍𝑙. ∼ 𝑀𝑉 𝑁 (0, Σ𝑙) where the covariance matrix Σ𝑙 is constructed using the Gaussian kernel Σ𝑙 = 𝜎2 0 where 𝐾 (𝑠𝑖, 𝑠 𝑗 ) = 𝑒𝑥 𝑝(−|𝑠𝑖 − 𝑠 𝑗 |2/𝛾) 𝜏𝑙 𝐾 where 𝛾 being the bandwidth parameter and 𝜎2 0 𝜏𝑙 is the variance component that is scaled with respect to the residual error variance 𝜎2 0 . With the model specifications provided above, the factor loading matrix 𝑊 and the factor matrix 𝑍, along with the hyperparameters (𝜏, 𝜎2 0 ), are inferred through maximum likelihood-based optimization. Specifically, both 𝐵 and 𝑍 are integrated out initially to obtain a marginal likelihood, based on which 𝜏, 𝜎2 posterior mean conditional on the estimated 𝜏, 𝜎2 0 , and 𝑊 are inferred. Subsequently, 𝑍 is estimated by computing their 0 , and 𝑊. The rows of the final matrix 𝑍 are called Spatial PCs. 61 3.8.1.2 Spatial domain detection with spatial PCs The spatial PCs 𝑍 inferred from SpatialPCA can be combined with various methods already developed in the scRNA-seq literature to enable a range of downstream applications in spatial transcriptomics. Spatial domain detection, aimed at segmenting the tissue into multiple structures or microenvironments, each characterized by a distinct transcriptomic profile, is facilitated. For spatial domain detection, it is formulated as a clustering problem on the inferred spatial PCs 𝑍. Specifically, standard clustering algorithms like walktrap or Louvain are applied on 𝑍 to categorize spatial locations into different spatial domains. Due to the critical spatial correlation information present across locations in 𝑍, clustering based on the spatial PCs would result in similar cluster assignments in neighboring locations, leading to smooth boundaries in the detected tissue structures. This advantageous property of SpatialPCA is harnessed by our framework, to be described in the next chapter, to enhance the accuracy of spatial domain detection. 62 CHAPTER 4 CSVG: IMPROVED SPATIAL DOMAIN DETECTION BY SPATIALLY VARIABLE GENE CLUSTERING ADJUSTING FOR CELL TYPE EFFECT 4.1 Introduction Recent advancements in spatially-resolved transcriptomics (SRT) technology have revolution- ized our ability to acquire comprehensive gene expression data for thousands of genes across tissue locations in multiple samples. The number of genes and spatial resolution vary depending on the specific technology employed. However, regardless of the technology and resolution, spatial transcriptomic data facilitate the exploration of various biological questions. Often a fundamental initial step in the analysis of SRT data involves identifying spatially vari- able genes (SVGs). These genes exhibit expression level variations either across the entire tissue or within predefined spatial domains. In recent years, there has been an abundance of research and the development of new methods to address the challenge of detecting SVGs[119][120]. Although the detection of SVGs lets us visualize the spatial patterns in the tissue which might offer some level of biological insights about the tissue of interest, the main use of SVGs lies in downstream analysis, specifically for spatial domain detection. Spatial domains are distinct and functionally specialized anatomical structures within tissue, each distinguished by unique local characteristics including cell-type composition, transcriptome heterogeneity, and cell-cell interactions[121][122][123]. De- tecting these domains is crucial for understanding their collaborative role in tissue functions and development stages. To achieve this, a set number of top SVGs is typically selected, and spatial domains are identified using these top SVGs. However, using an arbitrary number of top SVGs might not represent all the spatial patterns exhibited by the SVGs. As previously argued [82], Some dominant patterns may overshadow less pronounced yet relevant patterns. Previous methods, like SpatialDE[75], SPARK [77] and Sepal[82], attempted to classify SVGs into groups with similar spatial patterns, aiding in a more holistic representation of results. It’s important to highlight that classifying spatially variable genes (SVGs) is a challenging task requiring specialized methods. Simple clustering approaches are 63 inadequate as they overlook spatial information [75]. However, in existing methods challenges arise regarding the selection of the unknown number of spatial pattern groups and selection of other parameters, as well as the unclear impact of classification on downstream analysis. Here, we propose an efficient method cSVG to classify SVGs into clusters and explore the benefits of this clustering step in the final goal of spatial domain detection. The concept of clustering SVGs carries a significant biological rationale. Researchers have already identified many SVGs as the markers for specific cell types [124],[125],[126],[127], yet detecting these cell type-specific SVGs presents a formidable challenge. As distinct spatial patterns are associated with distinct cell type compositions, it follows that different cell type-specific SVGs would manifest distinct spatial patterns. Consequently, clustering SVGs with similar patterns can be viewed as a means of segregating distinct cell type-specific SVGs. cSVG is a Gaussian process-based method that initially identifies SVGs and then establishes a dependency map among these SVGs using an intuitive approach (see methods 4.2). This map links each SVG with other SVGs exhibiting similar spatial patterns, ultimately clustering similarly expressed SVGs together (see Figure 4.1A). The resulting SVG-clusters from the cSVG algorithm can serve as inputs for further downstream analysis. For performing downstream analysis, we employ a well-known dimension reduction technique tailored for spatial data, SpatialPCA[104], to derive low-dimensional embeddings specific to each SVG-cluster. These embeddings are subsequently utilized for spatial domain detection. In each example considered in this study, whether through simulation setups or real data analysis, we compare our findings with those obtained from the SpatialPCA framework, which generates low- dimensional embeddings for all top SVGs collectively and subsequently performs the same spatial domain detection step. This comparison aims to elucidate the advantages of the gene clustering step facilitated by cSVG. To evaluate the performance of cSVG, we conducted a comparison using a synthetic dataset derived from real human DLPFC cortex data, with known annotations of its 5 layers (4 prefrontal cortex layers and the white matter). See Supplementary Figure F.1 for more details on the synthetic 64 data generation. Figure 4.1B illustrates the domain-based SVG clusters within the synthetic dataset: cluster 1 represents genes predominantly expressed in layer 1 cluster 2 includes genes from cortex layers 2, 3, or 4, while cluster 3 comprises genes overexpressed in the white matter region. The distribution of the number of genes across clusters is uneven in this scenario as shown in figure 4.1C, as frequently evident in real datasets. Upon repeating the analysis on 10 simulated synthetic datasets, and considering ARI scores (Adjusted Rand Index, higher the better) and PAS scores (Percentage of abnormal spots, lower the better, see method section 4.2) (see Figure 4.1D), we observe that our framework provides better domain detection results compared to the SpatialPCA framework. Additionally, Figure 4.1E displays a t-SNE plot [128] representing genes from a randomly chosen simulation outcome. The genes are color-coded according to the gene cluster label identified by cSVG. This visualization demonstrates that genes within each cluster are packed together, indicating the accuracy of gene classification by cSVG. In this paper, we conducted two types of simulation studies: firstly, to assess the accuracy of the cSVG framework for SVG classification, and secondly, to evaluate the accuracy of domain detection based on the detected SVG-clusters by cSVG using synthetic datasets mimicking real- world scenarios. In addition, we analyzed three publicly available datasets and one newly acquired pancreatic cancer SRT dataset. The publicly available datasets include: 1). The DLPFC human cortex annotated dataset[129], comprising 12 samples; 2). The HER2 human breast tumor annotated dataset[130]; and 3). The dataset from study of human breast cancer biopsies[131]. The newly acquired dataset is the pancreatic cancer dataset, which comes with rudimentary annotation. The application of cSVG and the detection of domains aligns well with the rough annotation. Overall, findings from simulation studies and real data analyses across multiple datasets affirm that utilizing SVG clusters and, consequently, the cSVG framework can markedly enhance the performance of spatial domain detection. This approach holds significant promise to unveil hidden spatial patterns which provide novel insights into spatial heterogeneity of tissue samples. 65 Figure 4.1 (A) Schematic overview of the cSVG framework. (B-E) An example simulation based on a synthetic dataset: (B) The synthetic dataset is generated based on a real dataset with annotated spatial domains. Three types of domain-specific SVGs and noise genes are created. (C) The distribution of the number of each type of gene is presented. (D) Evaluation based on ARI (higher is better) and PAS score (lower is better). (E) Visualization by the t-SNE plot: shown is for a randomly selected simulation result. The gene clusters identified by cSVG are highlighted with distinct colors. The plot illustrates that genes within the clusters are densely grouped together and separated from genes belonging to other clusters. 66 4.2 Method In a typical spatial transcriptomics setup, the dataset comprises gene expression measures or counts for 𝑚 genes distributed across 𝑁 known spatial coordinates or spots. Suppose 𝑦 = (𝑦1, 𝑦2, ..., 𝑦𝑁 ) represents the gene expression profiles or counts for a particular gene across spatial coordinates (referred to as samples or spots), denoted by 𝑠 = (𝑠1, ..., 𝑠𝑁 ). The spatial locations are typically represented as two-dimensional coordinates, i.e., 𝑠𝑖 = (𝑠𝑖1, 𝑠𝑖2), although coordinates of any dimensionality can be employed. The primary objective of spatially variable gene (SVG) detection models is to identify which genes, among the 𝑚 genes, exhibit spatial variability across the tissue. In essence, the key goal is to determine whether the gene expression measure 𝑦 is dependent on or related to the spatial locations where the gene expression measures are sampled. Consider a scenario where there are 𝑘 𝑑 spatial domains within the tissue of interest. A spatial domain represents a distinct region or area within the tissue characterized by unique molecular signatures or gene expression patterns. These patterns may arise from specific factors such as cellular composition, anatomical organization, spatial arrangement, or functional attributes. In reality, the specific spatial domains are typically unknown. However, each domain may be defined with a set of spatially variable genes (SVGs) exhibiting characteristic gene expression patterns in proximity to these domains. To accurately reconstruct the underlying domain structure, it is beneficial to group SVGs based on their spatial patterns and utilize all SVG groups in downstream analysis, rather than relying solely on an arbitrary subset of all SVGs. With this objective in mind, we present the cSVG framework, organized into two primary steps. The first step involves the selection of SVGs, while the second step utilizes the SVGs selected in the first step to generate an SVG dependency list and subsequently create clusters of SVGs based on this information (see 4.1A). It is worth noting that there exists a plethora of SVG detection techniques in the literature [119] [120], and any of these methods can substitute for the first step in the framework. However, methods that rigorously control the False Discovery Rate (FDR) are preferable, as they enhance accuracy in the subsequent stages of analysis. 67 4.2.1 Step 1: Selecting SVGs Like the majority of the SVG detection methods this step is based on the Gaussian process (GP) regression model which models the normalized gene expression 𝑦 for a given gene assuming the following multivariate normal model: 𝑝(𝑦|𝜇, 𝜎2 𝑠 , 𝛿, 𝐾) ∼ 𝑀𝑉 𝑁 (𝑋 𝛽, 𝜎2 𝑠 𝐾 + 𝛿𝐼), (4.1) where the covariance term is decomposed into a spatial and a non-spatial part, with 𝛿𝐼 and 𝜎2 𝑠 𝐾 representing the non-spatial and spatial covariance matrix, respectively. The (𝑖, 𝑗)𝑡ℎ element in the kernel matrix 𝐾 denotes the spatial similarity between the 𝑖𝑡ℎ and 𝑗 𝑡ℎ spot calculated based on the corresponding coordinates 𝑠𝑖 and 𝑠 𝑗 . The choice of the kernel function plays a very important role in detecting the spatial correlation presented in the gene expressions. 𝑋 𝑁×𝑘 represents the covariate matrix, while 𝛽𝑘×1 denotes the array of corresponding coefficients. This model can incorporate up to 𝑘 − 1 covariates, such as cell type information or domain structure information. However, often such information is either unavailable or deemed untrustworthy. Hence, in practice, we typically employ 𝑋 solely as the intercept. When evaluating the existence of spatial patterns within the data, an assessment is made by testing the alternative hypothesis, which suggests the presence of spatial variance in the model, against the null hypothesis, where the spatial variance component is zero, indicating the absence of spatial variability. This comparison between the model fitted under the null and alternative hypotheses forms the basis of a significance testing procedure. This often involves conducting significance tests and drawing conclusions based on p-values in frequentist approaches. In model (4.1), testing if a gene is spatially variable is equivalent to testing 𝐻0 : 𝜎2 𝑠 = 0. Within this framework, we use a straightforward score test to test the underlying hypothesis and a p-value is calculated. More information regarding the test is provided in the supplementary material. Prior studies [77, 132] indicate that Gaussian and Cosine kernels are adept at capturing a wide spectrum of distinct spatial gene expression patterns. Hence, we utilize 10 different kernels (5 Cosine and 5 Gaussian kernels with varying parameter values) following the approach established by SPARK [77]. Denote the number of detected SVGs by 𝑚1 in step 1. 68 4.2.2 Step 2: Classifying SVGs This stage categorizes the SVGs identifed in step 1 into cohesive groups. While conventional clustering algorithms can segregate genes into distinct groups based on gene expression, they invariably disregard spatial information. Hence, we require more sophisticated algorithms to classify spatially variable genes more precisely [75]. Intuitively, if two spatially variable genes 𝑦1 and 𝑦2 exhibit similar spatial patterns, they should be correlated which is equivalent to assume 𝑦1 = 𝑦2 + 𝜖, where 𝜖 ∼ (0, 𝜎2 𝜖 𝐼) is a random noise term with arbitrary variance 𝜎2 𝜖 . Hence, while testing gene 𝑦1, if we use gene 𝑦2 as a covariate in model (4.1) and test the same alternative hypothesis described in step 1, we would expect gene 𝑦1 to be less or even not significant depending on how strong the correlation between 𝑦1 and 𝑦2 is. On the other hand, if after using another gene 𝑦3 as a covariate in the model we still find gene 𝑦1 to be significant, that would imply that gene 𝑦1 and gene 𝑦3 have different spatial patterns. Given that many SVGs are cell-type marker genes [124][125][126][127], we would expect gene 𝑦1 and 𝑦2 belong to the same cell type while gene 𝑦3 belongs to a different cell type. However, our algorithm does not require such knowledge (often not available in spot-level SRT data) as detailed below. With this intuition, for each SVG 𝑗, 𝑗 = 1, · · · , 𝑚1, we can select a list of genes 𝑆 𝑗 which are correlated to gene 𝑗 and use them as covariates in the model 4.1. If the no. of genes in 𝑆 𝑗 exceeds 3, we choose the top 𝑘 𝑗 principal components and include them as covariates in the model. Typically, 𝑘 𝑗 is chosen such that at least 80% of the total variance is explained by the top 𝑘 𝑗 principal components. With this, model (4.1) becomes: 𝑦(𝑠) = 𝑘 𝑗 ∑︁ 𝑙=1 𝛽𝑙 𝑃𝐶𝑙 + 𝛼(𝑠) + 𝜖 (𝑠) (4.2) where 𝑃𝐶𝑙 represents the 𝑙𝑡ℎ principal component, 𝛼(𝑠) ∼ 𝑁 (0, 𝜎2 𝑠 𝐾) and 𝜖 (𝑠) ∼ 𝑁 (0, 𝛿𝐼). There might be different ways of selecting related genes. It could be done by choosing a threshold based on a correlation measure, either linear or nonlinear, such as Pearson correlation, Spearman correlation, distance correlation or kernel correlation. Alternatively, one can run a penalized regression with LASSO[133], Elastic net [134][135] or MCP penalty[136] or perform 69 sure independence screening[137] to select genes correlated with gene 𝑗. In real applications, many SVGs represent marker genes for cell types, exhibiting spatial expression patterns that reflect the distribution of various cell types. Given that marker genes for a given cell type often exhibit similar or strongly correlated gene expression distributions, employing this model adeptly controls for the cell-type specific effect, a factor typically challenging to ascertain or quantify in real-world settings as the cell type information is typically unknown in spot-level SRT data. Employing this model on all the genes, we can get a set of genes with unique patterns (showing significance under model (4.2)) and a gene dependency list. From the list, we come up with a weighted graph structure, where nodes are SVGs and two nodes are connected if they are correlated. By applying a clustering algorithm such as leiden [138], groups of genes are determined and the unique genes (which are not part of any gene group) create singleton sets. The full algorithm steps are provided in the supplementary file. 4.2.3 Downstream analysis: spatial domain detection Spatially resolved transcriptomics serve a crucial role in identifying tissue or region substruc- tures through domain detection analysis. Numerous frameworks have been developed for this purpose, to name a few, SpaGCN[111], SpatialPCA [104] and BayesSpace[117]. In our analysis, we opted for SpatialPCA[104] due to its proven superiority in performance over other available algorithms. Furthermore, to conduct domain detection post identification of SVG clusters using our framework, we require an effective low-dimensional representation of the dataset, a task efficiently facilitated by SpatialPCA. SpatialPCA effectively extracts a low-dimensional representation of spa- tial transcriptomic data while preserving both biological signals and spatial correlation structures. This condensed representation can serve as an input for efficient clustering algorithms, such as the Louvain [139] or Walktrap algorithm [140], facilitating the clustering of spots and thereby identifying spatial domains. The steps for obtaining spatial domains in the SpatialPCA workflow include: 1) Select the top 3000 SVGs and calculate spatial PCs based on these genes, and 2) Use the top 20-30 spatial PCs for spatial clustering using algorithms like Louvein or walktrap. 70 In our approach, we utilize our framework to identify SVG groups with similar spatial patterns. Within each SVG group, we compute the spatial PCs and aggregate the top spatial PCs from each group to form the final embedding. These aggregated spatial PCs are then used as input in the clustering algorithm to detect spatial domains. The differences between our method and the existing ones are that existing methods use low-dimensional embeddings based on all top SVGs, while our method gets low-dimensional embeddings within each cluster. The within-cluster embeddings can capture unique spatial structures and hence lead to improved spatial domain detection. 4.2.4 Measuring accuracy of domain detection Detecting domains essentially involves assigning a cluster label to each of the 𝑁 spots in the tissue sample. Once a framework is implemented for detecting spatial domains, it becomes crucial to measure its accuracy against the ground truth domain labels. We primarily employ a standard clustering evaluation metric, the Adjusted Rand Index (ARI), to assess the similarity between the predicted domain labels and the true labels. Additionally, we utilize the PAS (Percentage of Abnormal Spots) score to quantify the clustering performance of spatial domain detection, following the approach outlined in [104]. This score gauges the randomness of spots located outside their clustered spatial domain computed as the proportion of spots with a cluster label differing from at least six of their ten neighboring spots. A lower PAS score reflects greater homogeneity within spatial clusters. 4.3 Simulation Study We conducted two simulation studies: one to demonstrate the effectiveness of the SVG clustering performance and another to assess whether the SVG clusters identified by cSVG are beneficial for enhancing domain detection accuracy. 4.3.1 Evaluation of the clustering performance To illustrate cSVG’s capability to accurately classify SVGs, we devise the simulation scenario I in which 100 normalized gene expression datasets were simulated, each consisting of 53 genes (labeled g1-g53) and 2000 spots following model 4.1 with no covariates. The first 10 genes (g1-g10) 71 Figure 4.2 Simulation setting for cSVG. A) Six representative genes each with a distinct spatial pattern in the simulated dataset. Green and Red color represents low and high gene expression. B) Correlation heatmap of the simulated dataset with compound symmetry correlation structure within gene groups. Independent: uncorrelated gene group for genes without any spatial pattern; Correlated: correlated gene group for genes without any spatial pattern. Pattern 1-3: correlated gene group for genes with spatial pattern 1-3. Pattern 4-6: single gene with spatial pattern 4-6. C) Empirical power of the SVG detection step at detecting SVGs for the simulated datasets with compound symmetry (Left) and AR(1) (Right) correlation structure within each gene group. D) The distribution of ARI values based on predicted gene clusters for simulation under each correlation structure. E) Empirical FDR distribution for simulation under each correlation structure. Here, false discovery occurs when Correlated or Independent genes show up in any of the final gene clusters. 72 represent independent noise genes, not correlated with any other gene in the dataset. The next set of 10 genes (g11-g20) exhibit strong correlations among themselves but do not show any spatial pattern. The third (g21-g30), fourth (g31-g40), and fifth (g41-g50) sets of genes represent spatially variable genes with three distinct spatial patterns (Spatial Pattern 1, Spatial Pattern 2, and Spatial Pattern 3, respectively). Figure 4.2A exhibits the six sets of representative genes with distinct spatial patterns. Genes are correlated within the spatial pattern 1-3 as exhibited in Figure 4.2B, and the spatial effect strength increases with the gene index. For example, g21 and g30 are correlated and share the same spatial pattern, but the spatial effect is stronger in g30 compared to g21. The correlation between the genes within a gene group could have different structures. They could display Compound Symmetry (CS) if any pair of genes within a group have the same correlation, i.e., 𝜌𝑖 𝑗 = 𝜌. Alternatively, they could demonstrate a first order Autoregressive (AR(1)) pattern if the correlation between two genes decays as their distance increases, i.e., 𝜌𝑖 𝑗 = 𝜌|𝑖− 𝑗 |. Figure 4.2B exhibits CS correlation structure. The last three genes, g51, g52, and g53, each exhibit a unique spatial pattern (Spatial Pattern 4, Spatial Pattern 5, and Spatial Pattern 6, respectively). As the spatial pattern strength increases within each spatial gene group (pattern1-pattern3), the SVG detection power converges towards 1, as expected (see Figure 4.2C). Furthermore, upon comparing this outcome with the performance of the most efficient comparable SVG detection method, nnSVG[78] (where both nnSVG and cSVG use normalized gene expressions), we observed that cSVG detects spatial genes slightly more effectively (see Supplementary Figure F.3). The cSVG framework not only demonstrates efficacy in identifying the true SVGs (g21-g53) but also adeptly categorizes them. The Adjusted Rand Index (ARI) scores, calculated for each simulated dataset, which are computed based on the cluster labels of detected SVGs and their true cluster labels - cluster near 1 in the violin plot in Figure 4.2D, indicating high accuracy of gene clustering. The false discovery rates were also monitored in this study. Here false discovery happens when any of the non-spatial genes (g1-g20) appears in any of the final gene clusters. FDR is calculated as the number of false discoveries divided by the total number of SVG discoveries. As indicated in Figure 4.2E, the FDR values are predominantly distributed near 0, indicating high accuracy of the results. 73 4.3.2 Evaluation of the spatial domain detection performance To showcase how the outputs of the cSVG algorithm aid in the downstream analysis of spatial domain detection and enhance its accuracy, we generated synthetic datasets based on the annotated human DLPFC data with sample ID 151670. Supplementary Figure F.1 details the steps involved in generating the synthetic data, ensuring that key features of the original dataset are preserved, such as the distribution of means and variances of all genes. The dataset is annotated with 5 layers (4 prefrontal cortex layers and the white matter layer). After filtering out sparse genes, we ended up with 4,865 genes whose expression were measured in 3484 spots. We first randomly selected 2,000 genes which were converted to SVGs in the generated dataset (See supplementary Figure F.1). The rest of the genes were converted to random noise genes with no specific pattern. Among the 2,000 SVGs, three distinct spatial domain structures were represented (see 4.1B and 4.1C): 800 SVGs correspond to the first cluster, wherein genes are predominantly expressed in the cortex layer 1; 150 SVGs exhibit overexpression in cortex layers 2, 3, or 4; and the remaining 1,050 SVGs from cluster 3 predominantly display expression in the white matter domain region. In such 10 simulated synthetic datasets, we applied our framework, cSVG, to identify gene clusters which were further leveraged for domain detection. We compared the domain detection results of cSVG with the default one by SpatialPCA without an additional clustering step. As we mentioned previously in the introduction 4.1, our framework significantly improves the spatial domain detection accuracy, as evidenced by ARI scores and PAS scores (see 4.1D. Additionally, t-SNE plot[128] from a randomly chosen simulation result for the genes displays the accuracy of gene clustering performance of cSVG. The t-SNE plots for all 10 simulation results were given in Supplementary Figure F.2. 4.4 Real data analysis 4.4.1 Human DLPFC 10x Genomics Visium dataset We applied cSVG algorithm to the human dorsolateral prefrontal cortex (DLPFC) data[129] generated by Visium from 10x Genomics. Publicly available datasets from 12 human DLPFC tissue samples, obtained from three individuals, can be accessed and downloaded from the link: http:// spatial.libd.org/spatialLIBD/. We directly downloaded the processed datasets from the SpatialPCA 74 Github repository, available at: https://github.com/shangll123/SpatialPCA_analysis_codes. These samples, on average, encompassed 3,973 spots, each manually annotated to one of the six prefrontal cortex layers or white matter. Our primary analysis is focused on two samples with ID 151670 and 151673, which contain expression measurements of 33,538 genes across 3,498 spots and 33,538 genes across 36,39 spots, respectively. We also analysed the other 10 samples and the results are available in the supplementary files. We applied our framework to detect the spatial domains (see methods 4.2) for each of the samples. We also followed the SpatialPCA framework provided in https://github.com/shangll123/ SpatialPCA_analysis_codes without separating genes into clusters to detect spatial domains to com- pare with our results. The comparison is based on the ARI scores, utilizing the provided annotations for each sample. We also compared the PAS scores for all the samples. In Figure 4.3A, sample 151670 is presented with annotated spatial domains as the ground truth (left), spatial domains detected by SpatialPCA (middle), and spatial domains detected by our framework (right). The ARI values for the detected domains by SpatialPCA and our framework (0.34 and 0.69, respectively, as shown in Figure 4.3E), along with the PAS scores (0.013722 and 0.006289, respectively, depicted in Figure 4.3F), highlight the superior performance of our framework over SpatialPCA. Similarly, in Figure 4.3C, the spatial domains for sample 151673 are displayed in the same order: annotated domains, spatial domains by SpatialPCA, and spatial domains by our framework. The ARIs for SpatialPCA and our framework are 0.58 and 0.65, respectively and the PAS scores are 0.028579 and 0.023083, respectively. Combining the results from all the samples, our framework significantly enhances spatial domain detection compared to SpatialPCA (see Figure 4.3G, 4.3H) as confirmed by the increase in the median ARI score. We visualized the SVG clusters identified by cSVG through the t-SNE plots in Figure 3B and 3D for the two samples. These visualizations highlight the resemblance among genes within the same cluster and the disparity between genes from separate clusters, emphasizing the efficacy of cSVG in delineating SVG clusters. 75 Figure 4.3 Spatial domain detection analysis results of the Human cortex data from DLPFC: (A) The detected spatial domains for Sample 151670: Annotated domain used as the ground truth (left), by SpatialPCA (middle), and by our framework (Right). (B) The t-SNE plot for all the SVGs detected by cSVG for Sample 151670. SVGs are colored based on cluster label calculated by cSVG. (C) The spatial domains detected for Sample 151673: Annotated domain used as the ground truth (left), by SpatialPCA (middle), and by our framework (Right). (D) The t-SNE plot for all the SVGs detected by cSVG for Sample 151673. The fact that SVGs with similar color are close to each other but are away from those with different colors indicates accurate gene clustering. (E) Comparison of ARI score (higher the better) between SpatialPCA and our framework based on these two samples. (F) Comparison of PAS score (lower the better) between SpatialPCA and our framework based on these two samples. (G) Comparison of ARI score between SpatialPCA and our framework based on all 12 samples. (H) Comparison of PAS score between SpatialPCA and our framework based on all 12 samples. 76 4.4.2 Analysis of HER2 breast tumor data We applied cSVG to another dataset, the HER2-positive breast tumor data[130], initially com- prising 36 tumor datasets from eight individuals (patients A-H), each consisting of 3 or 6 sections. Following SpatialPCA analysis, we selected the H1 dataset, encompassing 15,030 genes measured across 613 spatial locations. We utilized the pre-processed dataset available in the SpatialPCA repository, containing 10,053 genes across 607 spots, and omitted datasets from other samples due to sparse gene expressions or minimal spot coverage. cSVG identified 268 SVGs grouped into three main clusters, along with a few unique pattern genes. The three main gene clusters consist of 84, 48, and 117 genes respectively. The t-SNE plot of the SVGs (see Figure 4.4B) exhibits a similar pattern for genes within the same cluster in close proximity to each other. Comparing spatial domains detected by SpatialPCA and cSVG based on the ARI value, our framework shows better performance (ARI=0.48) than SpatialPCA (ARI=0.44), thereby improving domain detection accuracy (see 4.4C). 4.4.3 Analysis of the pancreatic cancer data Our final analysis was conducted on a new dataset concerning Pancreatic cancer, obtained from the Henry Ford Health System. This dataset comprises gene expression measurements for 17,943 genes across 3,142 spots, collected from a pancreatic tumor-infested tissue. Following standard filtering and normalization procedures, we applied our cSVG framework to identify SVG clusters and detect spatial domains. Three primary SVG clusters were identified, each showcasing differential expressions in distinct tissue regions. Representative genes from these clusters are depicted in supplementary Figure F.8. For each SVG cluster, we extracted low-dimensional embeddings (top SpatialPCs from Spa- tialPCA), combined them, and applied the Leiden algorithm to identify spatial domains. We repeated this process for the SpatialPCA framework, utilizing the top 20 Spatial PCs from the top 3,000 SVGs. Due to the absence of spot annotations, the exact number of spatial domains is unknown. Therefore, we employed the same Leiden algorithm for the SpatialPCA framework, rather than using algorithms typically utilized by SpatialPCA that require a predetermined number 77 Figure 4.4 Analysis of the HER2 data. (A) Annotated spatial layers considered as ground truth, showcasing 6 known tissue components including cancer-related spots in blue shades. (B) The t-SNE plot illustrating SVGs detected by cSVG, with distinct colors representing distinct SVG-clusters. (C) Spatial domains detected by cSVG and SpatialPCA, with respective ARI scores of 0.48 and 0.44. of domains. The dataset includes a rough annotation (see 4.5A) highlighting the tumor (marked in red) and non-tumor (marked in yellow) regions of interest. Figures 4.5B and 4.5C display the predicted domains by SpatialPCA and cSVG frameworks, respectively. As the precise annotation labels for each spot are unavailable, we cannot compute scores like ARI to compare spatial domain detection accuracy between the two methods. However, through visual inspection, the domains detected by cSVG appear more accurate, effectively capturing the most important regions. Tumor-containing regions identified by cSVG are notably smaller and more accurate compared to those identified by SpatialPCA. The cSVG algorithm identified three primary clusters of spatially variable genes (comprising 3931, 2751, and 1781 genes, respectively) in the pancreatic cancer dataset, each providing sig- 78 nificant biological insights. Supplementary Figure F.8 demonstrates that genes in Cluster 1 are predominantly overexpressed in non-tumor regions, while genes in Clusters 2 and 3 are overex- pressed around tumor regions. Consequently, pathway enrichment analysis reveals that genes in Clusters 2 and 3 are associated with cancer-related pathways, whereas genes in Cluster 1 are not. Notably, Cluster 3 genes are enriched with several T cell-related pathways, a feature not observed in Cluster 2. This suggests that Cluster 3 gene expressions exhibit immune cells spatially located around cancer regions, offering intriguing insights into the biology of cancer. Figure 4.5 Analysis of the Pancreatic cancer data: (A) Rough annotation of the tissue depicting different important regions. Tumor regions are highlighted in red, while non-tumor yet significant regions are marked in yellow. (B) Spatial domains detected by SpatialPCA, utilizing the calculation of spatial PCs and employing the Leiden algorithm for clustering without presupposing the number of clusters. (C) Spatial domain detection by cSVG, involving the aggregation of SVG-cluster specific spatial PCs and utilizing the Leiden algorithm for clustering without presupposing the number of clusters. 4.5 Discussion In recent years, the exploration and analysis of spatial data have reached unprecedented heights, offering diverse insights into biological systems. Central to this endeavor is the identification of SVGs, which serve as pivotal components in understanding tissue organization and function. However, merely detecting SVGs does not inherently yield substantial biological insights. Rather, their significance lies in their dual role: 1) SVGs are used for spatial domain detection and 2) Some SVGs serve as markers for specific cell types. Traditional approaches often struggle to achieve precise spatial domain detection, and discerning cell type-specific SVGs amidst the data noise poses 79 a formidable task. Our proposed framework, cSVG, addresses these challenges to improve spatial domain detection without requiring cell type information. By implementing cSVG, we effectively detect SVG clusters that can be interpreted as clusters of cell type SVGs. Remarkably, this identification is accomplished without the need for complex cell type deconvolution techniques, streamlining the analysis process while providing biologically meaningful insights. Through extensive real data analysis and simulation studies, we have demon- strated the efficacy of cSVG in enhancing spatial domain detection accuracy, validating its utility in spatial transcriptomic analysis. The implementation of cSVG involves two sequential steps, with the initial phase focusing on SVG detection. This step can be performed using any existing SVG detection technique and the alteration can be done without changing any code for cSVG. The detailed step-by-step code and results are provided on our GitHub repository https://github.com/wangjr03/cSVG. Looking ahead, there are ample opportunities to refine and expand upon the cSVG framework. Future enhancements may involve simplifying and scaling up the SVG detection and cluster detection methods to accommodate larger and more complex datasets. Additionally, continued refinement of the framework will enable researchers to extract deeper insights from spatial transcriptomic data, advancing our understanding of tissue biology and disease mechanisms. In conclusion, the use of SVG clusters generated by cSVG represents a crucial advancement in spatial transcriptomic analysis. As we continue to refine and evolve this methodology, it is poised to become an indispensable tool for dissecting the spatial complexity of biological systems and unraveling the intricate interplay between genes, cells, and tissues. Data and code availability All relevant codes for reproducing each step of the real data analysis and simulation study results are available on our GitHub repository: https://github.com/wangjr03/cSVG. The publicly accessible datasets and their sources are provided in the data folder. Please note that the pancreatic cancer data used in this study was received from our collaborators Dr. Nina Steele and Dr. Brian Theisen from the Henry Ford Health System. The Institutional and Review Board approval is 80 maintained for 16150 at Henry Ford Hospital for The Translational and Clinical Research Center Biorepository. The data has not been published and is not publicly available at this time. We received a de-identified dataset for the analysis described in this chapter. 4.6 Supplementary materials 4.6.1 Score test for SVG detection The Gaussian process (GP) regression model which models the normalized gene expression 𝑦 for a given gene using the following multivariate normal model: 𝑝(𝑦|𝜇, 𝜎2 𝑠 , 𝛿, 𝐾) ∼ 𝑁 (𝑦|𝑋 𝛽, 𝜎2 𝑠 𝐾 + 𝛿𝐼), where the covariance term is decomposed into a spatial and a non-spatial part, where 𝛿𝐼 represents the non-spatial part and 𝜎2 𝑠 𝐾 is the spatial covariance matrix, whose (𝑖, 𝑗)𝑡ℎ element in the kernel matrix 𝐾 denotes the spatial similarity between the 𝑖th and 𝑗th spot calculated based on the corresponding coordinates 𝑠𝑖 and 𝑠 𝑗 . The choice of the kernel function plays a very important role in detecting the spatial correlation present in the gene expressions. 𝑋 𝑁×𝑘 represents the covariate matrix, while 𝛽𝑘×1 denotes the array of corresponding coefficients. As we mentioned in the main text, testing if a gene is a SVG is equivalent to testing 𝐻0 : 𝜎2 𝑠 = 0. The null hypothesis 𝐻0 : 𝜎2 𝑠 =0 can be tested using the variance-component score test which is the locally most powerful test[141]. The variance-covariance score statistic is: 𝑄 = (𝑦 − 𝑋 ˆ𝛽)𝑇 𝐾 (𝑦 − 𝑋 ˆ𝛽) where ˆ𝛽 is the MLEs under the null model. Under the null hypothesis, the score statistic 𝑄 follows a mixture of chi-square distributions [142], which can be closely approximated with the computationally efficient Davies’ method[143]. More details about the test is provided in the appendix B. In this chapter, we utilize part of the code provided along with the SKAT paper [142] which uses the same score test for the purpose of rare-variant association testing in genetic data. Kernel functions: A kernel function is defined as a function 𝐾 : X × X → R, where the kernel 81 matrix 𝐾 = (𝑘𝑖,𝑖′)𝑛 𝑖,𝑖′=1 is symmetric and positive semidefinite with 𝑘𝑖,𝑖′ = 𝑘 (𝑠𝑖, 𝑠𝑖′). In this setting, 𝑘 (𝑠𝑖, 𝑠𝑖′) is a measure of similarity between the 𝑖th and the 𝑖′th subject. There are a variety of kernel functions to choose from, and the most simple one is the Linear kernel. The other useful kernels are Polynomial kernel, the Gaussian kernel and the cosine kernel. The functional forms of these kernels are summarized below: • Linear kernel: 𝐾 (𝑠𝑖, 𝑠𝑖′) = 𝑠𝑇 𝑖 𝑠𝑖′ • Polynomial kernel: 𝐾 (𝑠𝑖, 𝑠𝑖′) = (𝑠𝑇 𝑖 𝑠𝑖′ + 𝑐)𝑑, where 𝑐,𝑑 are the free parameters. • Gaussian kernel: 𝐾 (𝑠𝑖, 𝑠𝑖′) = 𝑒𝑥 𝑝{−∥𝑠𝑖 − 𝑠𝑖′ ∥2/𝑙}, where ∥𝑠𝑖 − 𝑠𝑖′ ∥2 = (cid:205)𝑝 𝑗=1 (𝑠𝑖 𝑗 − 𝑠𝑖′ 𝑗 )2 is the Euclidean distance, 𝑙 is a length scale parameter. • Cosine kernel: 𝐾 (𝑠𝑖, 𝑠𝑖′) = 𝑐𝑜𝑠(2𝜋 ∥𝑠𝑖−𝑠𝑖′ ∥2 𝜙 ), where 𝜙 is the periodicity parameter. Choices of kernel functions: We must define the kernel function in order to proceed with the hypothesis testing. As it is unknown which kernel will be best for the test, we employ the score test to evaluate the null hypothesis across various kernel functions with distinct kernel parameters. Gaussian and cosine kernels are typically effective in capturing spatial patterns. Following the method outlined in the SPARK paper [77], we compute five different length scale parameter values for the Gaussian Kernel and five different periodic parameter values for the cosine kernel. We conduct the test across ten different kernels and aggregate the resulting p-values using the Cauchy combination rule[144]. 4.6.2 cSVG algorithm Our model is built upon the Gaussian process model. Thus, we require normalized count matrix data with 𝑚 rows (genes) and 𝑁 columns (spots). We also need the spatial location matrix 𝐿 with 𝑁 rows, 2 columns (𝑋 and 𝑌 coordinates of spots). Step 1: Detect SVGs based on model 4.1 defined in the main text. Suppose there are 𝑚1 SVGs and denote the SVG list as 𝑆𝑦 with |𝑆𝑦 | = 𝑚1 where | · | denotes the cardinality of a set. 82 Step 2: Start with the subset matrix denoted by 𝑀 𝑚1×𝑁 𝑆𝑉𝐺 . Repeat for 𝑗 = 1, · · · , 𝑚1: 1. For the 𝑗th gene in 𝑆𝑦, find genes correlated with it using methods such as (1) SIS[137], (2) marginal correlation test (+ve correlation), (3) Elastic net[134][135], (4) SIS+Enet, or other methods. Denote the correlated gene list as 𝑆 𝑗 . 2. If |𝑆 𝑗 | = 0, then the 𝑗th gene has an unique spatial pattern. If |𝑆 𝑗 | <= 3, then fit all genes in 𝑆 𝑗 as the covariates in model (2) in the main text. If |𝑆 𝑗 | > 3, then get the 𝑘 𝑗 PCs of genes in 𝑆 𝑗 and fit them as covariates in model (2) in the main text. 3.. Using the model 4.2 in the main text, conduct a score test to compute the p-value under 10 different kernels following the SPARK idea, then integrate these 10 p-values using the Cauchy combination rule[144],[145] to get the final p-value. The output includes: a) For each SVG 𝑗, a list of correlated genes in 𝑆 𝑗 ; and b) A list of unique SVGs as defined in 2. Step 3: 1). Based on the output in step 2, a weighted graph structure is created where each SVG is a node. Each SVG 𝑗 in the output list in 3.a) has a common edge with all its dependent genes specified in list 𝑆 𝑗 in the graph. 2). Clusters are determined from the weighted graph structure in 1) using the Leiden community detection algorithm[138] (see Appendix D). 3). The unique genes in output list 3.b) not connected with other nodes in the graph structure are allocated to the singleton set. 4.6.3 Multiplicity correction At each stage of cSVG, our model is applied to each of the 𝑚 genes. To control the false discovery rate (FDR), multiplicity correction is required. We employ the Benjamini–Yekutieli (BY) procedure, known for its effectiveness under arbitrary correlation conditions, to obtain adjusted p-values to claim significant genes [146]. 83 4.6.4 Analysis of human breast cancer dataset We analyzed another spatial transcriptomics dataset of human breast cancer. We obtained the data from the SPARK[77] GitHub repository at https://github.com/xzhoulab/SPARK-Analysis (specifically the ‘Layer2_BC_count_matrix-1.tsv’ file). The dataset can also be downloaded from Spatial Transcriptomics Research (http://www.spatialtranscriptomicsresearch.org). This dataset contains 14,789 genes measured across 251 spots. Our preprocessing involved filtering out genes with expression levels below 10% across the array spots and retaining spots with a total read count > 10. Following these criteria, our analysis focused on a final set of 5,262 genes observed across 250 spots within the breast cancer dataset. Using our framework cSVG, we identified 724 spatially relevant genes in the initial step. In comparison, SPARK[77] detected 290 genes, SPARK-G detected 244 genes, nnSVG[78] detected 592 genes, and SPARK-X[88] detected 901 genes. The 724 genes were subsequently classified into three main clusters(containing 369, 320, and 14 genes respectively) in the second step of cSVG framework, along with a few unique singleton genes. The representative genes from each cluster are visualized in Figure 4.6. Genes in cluster 1 exhibited higher expression levels in the lower tissue region, while cluster 2 genes were overexpressed in the middle tissue region, and cluster 3 genes highlighted the upper tissue region. Additionally, the unique genes showcased distinct expression patterns in the third row. This figure confirms the superior performance of our cSVG method. 84 Figure 4.6 The representative spatial genes in the human breast cancer dataset from SVG clusters detected by cSVG. 85 CHAPTER 5 3D-TF-IMPUTE: A COMPUTATIONAL MODEL TO IMPUTE TF BINDING SITES 5.1 Introduction Typically, the initial phase in understanding gene regulation involves pinpointing the in vivo binding sites across the genome for various transcription factors (TFs). Unfortunately, ChIP-seq data of TF binding are only available for a small subset of TFs in limited cellular contexts. For instance, within the human genome, despite hundreds of recognized TFs, ChIP-seq data is available for fewer than 10 TFs in most cell types cataloged in ENCODE [147]. This limitation is exacerbated in other organisms, such as plants [148]. Hence, there arises an urgent demand for innovative algorithms to predict TF binding sites on a genome-wide scale across diverse cell types. While several methods have been proposed [149],[150],[151] to address this TF binding imputation challenge, The most recent algorithm, Avocado [152], proposes to leverage the information from 6,870 epigenomics and TF binding data together and do imputation for the human genome. Although it can generate a rich model, it is limited for biological applications since it is usually impossible for biologists to gather the huge amounts of data first, especially for studies in other species. As chromatin accessibility data is becoming widely available [153] , other algorithms, with ‘J-Team’ and ‘Yuanfang Guan’ as the winning methods [149],[150] , all share similar ideas: i.e. combine TF binding motif information with in vivo chromatin accessibility to impute TF binding for specific genomic locations, based on models learned from another cell type. To improve the performance, co-occurring motifs in local flanking sequences, which biologically represent the potential co-factors, are integrated into the models as important features. The fundamental restriction of these methods (e.g. ‘J-Team’) is that only local co-occurring motif information along the 1D genome (< 1KB) is used. Since TF bindings happen in 3D chromatin, we need to explore the co-occurring motif patterns of potential co-factors and the TF itself in 3D neighborhoods. This idea has big potential to improve cell type specific TF imputation because: 1) It align with the real biology better. TFs and their co-factors usually form complexes 86 and co-regulate genes in 3D chromatin space, i.e. the well-known transcription factories [154], [155]. 2) Preliminary observation: We grouped 3D interactions (in GM12878 cells) by genes and looked at TF binding frequency in 3D neighborhood for two TFs: CTCF and it’s known co-factor YY1. A substantial number of 3D neighborhoods show binding for both the TFs (See 5.1A), suggesting the TF co-factors patterns in 3D space are critical. 3)Leveraging 3D chromatin information can significantly expand useful features from local flanking sequences to long-range interacting sequences. The percentages of binding sites for TF cofactors CTCF and YY1 are much higher in 3D neighborhood compared to in 1D flanking region (See 5.1B). 4) Based on 3D interactions to genes, we are enabled to explore specific TF co-occurring information in different gene groups. There are many challenges in this work. To efficiently leverage the 3D chromatin information, we need to overcome a series of computational hurdles: 1) Co-factors of TFs in 3D interacting neighborhood are largely unknown ; 2) Combinations of TFs and corresponding co-factors may vary for different gene groups; (See 5.1C) 3) Distributions of chromatin accessibility (e.g. DNase-seq) are unknown in 3D interacting regions vs. non-interacting regions. We propose 3D-TF-IMPUTE, a novel probabilistic model designed to concurrently infer cell type-specific TF binding and uncover all pertinent unknown parameters, as depicted in Figure 5.1C,D, employing an unsupervised approach. The algorithm takes as input: 1)Chromatin acces- sibility data, 2)Motif hit information for the entire genome across various TFs, and 3)Chromatin interaction data. The first key feature of the model is to explore 3D chromatin neighborhoods, instead of 1D flanking regions. The second key feature of the model is to explore gene group specific co-factor association patterns, so that we can borrow information across different genes. Notably, TF-ChIP-Seq data is omitted during model computation and only utilized post hoc to validate the model’s outcomes. 5.2 Input data pre-processing 3D chromatin interaction data is now widely available for different cell types and species [153],[156],[157]. Both Capture Hi-C data and ChIA-PET data based on general factors (e.g. 87 Figure 5.1 Motivation and Overview of 3D-TF-IMPUTE. A) Often TF pairs exhibit tendency to bind simultaneously in 3D neighborhood. Barplot shows frequency of CTCF and YY1 binding sites in chromosome 1 within 3D neighborhood regions, comparing individual binding occurrences versus simultaneous binding. B) Pair-wise binding tendency of CTCF, YY1 in 3D neighborhoods vs. in 1D neighborhood regions. C) Schematic of TF binding imputation. 3D neighborhoods are defined for each genes (doted circle). Genes are grouped based on similar 3D co-factor profile (rectangular line). D) Modeling and inference of TF binding by borrowing information from 3D neighbors and gene groups. Subscript nbr denotes features of 3D neighborhoods. 𝐵: binary label represents TF binding. 𝐷: continuous ATAC-Seq signal at the location. Seq: binary sequence featured by matching with known TF motifs. 𝑆: 3D chromatin structure. 𝑀𝐺: gene group-specific 3D co-factor profile. Posterior probability of 𝐵 is calculated based on observed ATAC-Seq signal and sequence feature of local and 3D neighborhoods given 3D chromatin structure and group-wise co-factor profile. After updating 𝐵, each probability component is recalculated based on updated 𝐵 until convergence. p300, Pol II or H3K4me2) can be used as inputs to the model since they have better resolution. For Hi-C data, we use ATAC-Seq peaks to significantly increase the resolution and cell-type specificity of chromatin contacts, as suggested by previous studies [158]. 88 To enhance scalability, we partitioned the entire genome into consecutive 100-base pair (BP) bins and assessed the likelihood of a binding site occurrence within each bin for every TF. If a bin is predicted to contain a binding site, the specific binding site location can be identified by the presence of an 8mer MOTIF within that bin. 5.3 Method Overview Here we describe our advanced novel sampling framework 3D-TF-IMPUTE, which uses ATAC- Seq, Hi-C, and TF binding motif information datasets as input and predicts the TF binding sites for our TF of interest. First, the whole chromosome is divided into consecutive 100 Base Pairs length segments or genomic bins and 3D-TF-IMPUTE predicts which of these bins contains the potential binding sites for the TF of interest. It narrows down the search to active bins, which serve as the potential region for TF binding sites. Each of these active bins create a 3D neighborhood, which is defined as genomic regions in chromatin interacting hubs(indicated by 3D chromatin interaction data) where at least one gene promoter region is involved. We refine the selection of active bins further by identifying those that overlap with any TF MOTIF. These bins, characterized by such overlap, emerge as highly active and promising candidates for TF binding sites. 3D-TF-IMPUTE starts with this subset of active bins and calculates the probability 𝑃(𝐵 = 1 | 𝐷, 𝐷𝑛𝑏𝑟, 𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟, 𝑀𝐺, 𝑆) for all the bins in the active set, where B is the latent variable indicating whether the genomic bin location belongs to a TF binding site or not (B=1 or 0), D represents the ATAC-seq signal at the location, 𝐷𝑛𝑏𝑟 represents the ATAC-seq signals in the 3D neighborhood corresponding to the location, Seq represents the occurrences of motif hits for all available TFs at the location, 𝑆𝑒𝑞𝑛𝑏𝑟 represents the same in the 3D neighborhood, 𝑀𝐺 represents the TF co-occurrence in 3D neighborhoods for a specific gene-group 𝐺, and S represents the 3D chromatin structure. A higher value of 𝑃(𝐵 = 1 | 𝐷, 𝐷𝑛𝑏𝑟, 𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟, 𝑀𝐺, 𝑆) indicates higher chances of the TF binding at the active bin of interest. Directly this posterior probability is complex, but it can be decomposed into several parts: 𝑃(𝐵 | 𝐷, 𝐷𝑛𝑏𝑟, 𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟, 𝑀𝐺, 𝑆) ∝ 𝑃(𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟 | 𝐵, 𝑀𝐺, 𝑆)𝑃(𝐷, 𝐷𝑛𝑏𝑟 | 𝐵, 𝑆)𝑃(𝐵 | 𝑆). The first probability term infers the likelihood of the event based on 3D neighborhood TFs interdependence, the second probability term incorporates the 89 information based on chromatin accecibility and the third probability term simply includes the odds of having a binding site. The first probability term can be further decomposed into 𝑃(𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟 | 𝐵, 𝑀𝐺, 𝑆)= 𝑃(𝑆𝑒𝑞𝑛𝑏𝑟 | 𝑆𝑒𝑞, 𝐵 = 1, 𝑀𝐺, 𝑆) 𝑃(𝑆𝑒𝑞 | 𝐵 = 1, 𝑀𝐺, 𝑆) = 𝑃(𝑆𝑒𝑞𝑛𝑏𝑟 | 𝑆𝑒𝑞, 𝐵 = 1, 𝑀𝐺, 𝑆) 𝑃(𝑆𝑒𝑞𝑇 𝐹1 | 𝐵 = | 𝐵 = 1, 𝑀𝐺, 𝑆), where 𝑆𝑒𝑞𝑡×1 array contains in- 1, 𝑀𝐺, 𝑆) 𝑃(𝑆𝑒𝑞𝑇 𝐹2 |𝐵 = 1, 𝑀𝐺, 𝑆) · · · 𝑃(𝑆𝑒𝑞𝑇 𝐹𝑡 formation about 𝑡 known TFs MOTIF binding at the location, 𝑆𝑒𝑞𝑡×1 = [𝑆𝑒𝑞𝑇 𝐹1 , 𝑆𝑒𝑞𝑇 𝐹2 , · · · , 𝑆𝑒𝑞𝑇 𝐹𝑡 ]. The second "=" holds with the assumption that given the binding status at the location and the TF-TF co-occurrence information provided by 𝑀𝐺, 𝑆𝑒𝑞𝑇 𝐹𝑖 , 𝑖 = 1, 2, ..𝑡 are independently distributed. While predicting TF binding sites for 𝑇 𝐹𝑗 , 3D-TF-IMPUTE algorithm estimates (cid:206)𝑡 𝑖=1𝑖≠ 𝑗 𝑃(𝑆𝑒𝑞𝑇 𝐹𝑖 |𝐵 = 1, 𝑀𝐺, 𝑆) by (cid:205)𝑡 𝑖=1 𝑆𝑒𝑞𝑇 𝐹𝑖 ×𝑀𝐺 𝑗𝑖 (cid:205)𝑖=1𝑡 𝑀𝐺 𝑗𝑖 , where 𝑀𝐺 𝑗𝑖 is the 𝑖𝑡ℎ component of the row in 𝑀𝐺 corresponding to the TF of interest 𝑇 𝐹𝑗 . The second probability term is calculated based on a kernel density estimation(KDE) of the joint distribution of 𝐷, 𝐷𝑛𝑏𝑟. The third probability term is calculated based on the proportion of active bins with predicted binding sites. In the iterative process of 3D-TF-IMPUTE, the 𝐵 values are initialized for all active bins using TF MOTIF data. Subsequently, utilizing these initialized 𝐵 values, the probability terms 1, 2, and 3 are computed. Based on the resulting final probability 𝑃(𝐵 | 𝐷, 𝐷𝑛𝑏𝑟, 𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟, 𝑀𝐺, 𝑆) values, the 𝐵 values for the active bins are sampled, initiating the next iteration. This iterative cycle continues until the probability values converge across all active bins. The schematic overview of the sampling framework is provided in figure 5.1D and the detailed algorithm is provided here: Set ch=chromosome1, TF=TF1,num𝑖𝑡𝑒𝑟=1000 Prepare ch specific ATAC-seq data Prepare ch specific MOTIF data. Prepare ch specific 3D neighbourhood data Prepare TF specific MOTIF data iter 1:num𝑖𝑡𝑒𝑟: 90 Initialize latent variable values b1 specific to the TF for all active bins. Prepare P2 function input based on ATAC and initialized latent variable values. b1 in ActiveBinList: calculate 𝑃1(1), 𝑃1(0) calculate 𝑃2(1), 𝑃2(0) calculate 𝑃3(1), 𝑃3(0) calculate 𝑃(𝐵 = 1|𝐷, 𝐷𝑛𝑏𝑟, 𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟, 𝑠) = 𝑃1 (1)∗𝑃2 (1)∗𝑃3 (1) 𝑃1 (1)∗𝑃2 (1)∗𝑃3 (1)+𝑃1 (0)∗𝑃2 (0)∗𝑃3 (0) Sample B based on 𝑃(𝐵 = 1|𝐷, 𝐷𝑛𝑏𝑟, 𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟, 𝑠), update b1 where, 𝑃1(1) = 𝑃(𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟 |𝐵 = 1, 𝑀𝐺, 𝑆), 𝑃1(0) = 𝑃(𝑆𝑒𝑞, 𝑆𝑒𝑞𝑛𝑏𝑟 |𝐵 = 0, 𝑀𝐺, 𝑆) 𝑃2(1) = 𝑃(𝐷, 𝐷𝑛𝑏𝑟 |𝐵 = 1, 𝑆), 𝑃2(0) = 𝑃(𝐷, 𝐷𝑛𝑏𝑟 |𝐵 = 0, 𝑆) 𝑃3(1) = 𝑃(𝐵 = 1|𝑆), 𝑃3(0) = 𝑃(𝐵 = 0|𝑆) 5.4 Results In cell types where TF-ChIP-seq data is available, the data is often noisy or incomplete. Since there is no ground truth available, we evaluate the performance of our predictions by examining the number of overlaps with TF ChIP-seq binding sites. Specifically, we compare the predicted binding sites of CTCF, YY1, and RUNX3 in GM12878 cell type with the ChIP-Seq data. Figure 5.2 showcases real examples on the UCSC genome browser, where black tracks represent TF ChIP-Seq peaks and predicted binding sites are marked in red. These examples illustrate the degree of overlap between predicted binding sites and TF ChIP-Seq peaks. Figure 5.3 presents the ROC and Precision-Recall curves for these three key TFs: CTCF, YY1, and RUNX3. The high quality of TF MOTIF data for CTCF leads to highly accurate final predictions as evident by the high area under the ROC and PR curve. On the other hand, the MOTIF data quality for YY1 and RUNX3 is extremely low. Nonetheless, their predictions remain reasonable. Overall, the results demonstrate the effectiveness of the 3D-TF-IMPUTE algorithm in accurately predicting TF binding sites. 91 Figure 5.2 Examples of predicted TF binding sites by 3D-TF-IMPUTE overlapped with significant TF peaks shown in the genome browser. The tracks in black correspond to the transcription factors A) CTCF, B) RUNX3, C) YY1. The predictions by 3D-TF-IMPUTE are highlighted in red. 5.5 Discussion In conclusion, while existing computational methods for TF binding have predominantly relied on supervised setups, our proposed approach marks a significant departure by introducing an unsupervised probabilistic model. Unlike traditional methods, which often encounter challenges when training data from one cell type fails to accurately represent TF binding in others or data quality is low, our method circumvents such issues. Moreover, in contrast to existing methods that suffer from lengthy execution times, our approach 92 Figure 5.3 Evaluation of 3D-TF-IMPUTE performance. A) ROC curve demonstrating the prediction accuracy for significant TFs. B) Precision-Recall curve demonstrating the prediction accuracy of the same TFs. showcases remarkable efficiency by running in parallel, thus substantially decreasing preprocessing and computation time. Remarkably, 3D-TF-IMPUTE attains a notably low false discovery rate, particularly in cases where input data quality is high. Even in instances of subpar motif data quality, 3D-TF-IMPUTE adeptly utilizes information from co-factors to predict high-quality TF binding sites, thereby aug- menting prediction reliability, as illustrated by the example of YY1( see figure 5.3). While our approach represents a pioneering advancement, there remains room for enhancement. Addressing the impact of low-quality motif data on prediction accuracy could be a promising avenue for future refinement. Incorporating mechanisms to update motif information based on co-factor 93 interactions may prove instrumental in bolstering prediction power. Importantly, our work underscores the pivotal role of three-dimensional chromatin neighbor- hoods in enhancing TF binding prediction accuracy. Looking ahead, this concept holds significant promise for inspiring further methodological developments, with potential applications across diverse research endeavors focused on predicting TF binding sites. 5.6 Data and Code availability The code to run this model is available at GitHub: https://github.com/wangjr03/3DTFImpute. The input dataset ATAC-Seq and ChIP-Seq for GM12878 cell type is downloaded from ENCODE [147]. The Motif data downloaded from [159] https://www.internationalgenome.org and the Hi-C data from [160] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525. 94 CHAPTER 6 CONCLUSION AND FUTURE WORK The cSVG model introduced in Chapter 4 represents a significant advancement in spatial domain detection, substantially improving accuracy compared to existing methods. However, there are opportunities to further enhance its performance. One area for improvement is addressing the challenge of gene separability observed in t-SNE plots of LIBD data samples 5, 6, 7, and 8 (see F.6). In these instances, genes may form large clusters with spatial patterns overlapping multiple distinct clusters. This suggests the presence of genes with mixed spatial features, requiring a more nuanced approach to modeling spatial variability. The integration of composite kernel functions, as facilitated by the BayesKAT model, could offer a more efficient means of capturing the spatial complexity of these genes. Furthermore, enhancing the scalability of cSVG cluster identification represents a significant area for improvement in future research. While we have demonstrated the utility of SVG clusters in improving downstream analysis, the current approach may be resource-intensive for larger datasets. Leveraging the BayesKAT model to identify gene clusters based on selected composite kernel weights offers a promising solution to this challenge, streamlining the process of SVG cluster identification and enabling more scalable analysis. These future directions hold the potential to address key questions in the field of spatial transcrip- tomics and further enhance our understanding of spatial gene expression patterns. By refining our methodologies and leveraging advanced computational techniques, we aim to unlock new insights into the spatial organization of gene expression and its implications for biological processes. The primary objective behind developing the BayesKAT algorithm was to assess the joint association of a single group of SNPs or gene expressions with the phenotype of interest. While the effectiveness and efficiency of BayesKAT in prioritizing and ranking significant SNP groups were demonstrated through pathway-wise or gene-wise analyses in the chapter, from a statistical perspective, determining a threshold that effectively controls false discovery rates poses challenges. Additionally, the current version of BayesKAT relies on approximations, which may not perform 95 optimally with small sample sizes. As a response, I propose an updated version of the BayesKAT model which improves the model structure. Also, I intend to enhance its speed and performance by implementing variational inference techniques[161]. As defined in 2.3 in chapter 2, a slight reparametrization of the BayesKAT model has the data distribution: 𝑦|𝜃 ∼ 𝑁 (𝑋 𝛽, 𝜏𝐾𝑐 + 𝜎2𝐼)), (6.1) where the composite kernel 𝐾𝑐 = (cid:205)3 𝑖=1 𝜌𝑖𝐾𝑖. Now we use one indicator variable 𝛿 to indicate whether or not 𝜏 is 0. That means, if 𝛿 = 0, it indicates that 𝜏 = 0, i.e there is no association between the group of SNPs and the phenotype. We treat the indicator variable 𝛿 as Bernoulli random trials with success rate 𝑝(𝛿 = 1) = 𝜋. 𝜏 follows an exponential(𝜆𝑠) prior distribution when 𝛿 = 1 and equals to 0 when 𝛿 = 0. This is similar to the spike and slab prior [162], where the slab distribution is generally normal as it is normally used for variable selection. We might set specific values for the hyperparameters of the model 𝜋, 𝜆𝑠 or different prior distributions can be assumed on them. The posterior distribution for all the parameters can be estimated using the variational inference technique[161] assuming a mean-field variational family and finally, the posterior inclusion probability (PIP) can be defined as: 𝑃𝐼 𝑃 = 𝑝(𝛿 = 1 | 𝑋, 𝑦). Based on the PIP value it would be possible to infer the strength of the association. Furthermore, in the case of multiple testing(same phenotype, different sets of SNPs), the problem can be written as a variable selection problem and solved following the sparse Bayesian variable selection technique outlined in literature [163], [164]. 96 [1] [2] [3] [4] [5] [6] [7] [8] [9] BIBLIOGRAPHY Lars Bertram and Rudolph E. Tanzi. Thirty years of alzheimer’s disease genetics: the implications of systematic meta-analyses. Nature Reviews Neuroscience, 9(10):768–778, Oct 2008. Timothy J. Vyse and John A. Todd. Genetic analysis of autoimmune disease. Cell, 85(3):311– 318, May 1996. Emil Uffelmann, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen, and Danielle Posthuma. Genome-wide association studies. Nature Reviews Methods Primers, 1(1):59, Aug 2021. Evan A Boyle, Yang I Li, and Jonathan K Pritchard. An expanded view of complex traits: from polygenic to omnigenic. Cell, 169(7):1177–1186, 2017. Eric S Lander. 470(7333):187–197, 2011. Initial impact of the sequencing of the human genome. Nature, Andre Franke, Dermot PB McGovern, Jeffrey C Barrett, Kai Wang, Graham L Radford- Smith, Tariq Ahmad, Charlie W Lees, Tobias Balschun, James Lee, Rebecca Roberts, et al. Genome-wide meta-analysis increases to 71 the number of confirmed crohn’s disease susceptibility loci. Nature genetics, 42(12):1118–1125, 2010. Benjamin F Voight, Laura J Scott, Valgerdur Steinthorsdottir, Andrew P Morris, Chris- tian Dina, Ryan P Welch, Eleftheria Zeggini, Cornelia Huth, Yurii S Aulchenko, Gudmar Thorleifsson, et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nature genetics, 42(7):579–589, 2010. Laura I Furlong. Human diseases through the lens of network biology. Trends Genet, 29(3):150–159, December 2012. Aravinda Chakravarti and Tychele N Turner. Revealing rate-limiting steps in complex disease biology: The crucial importance of studying rare, extreme-phenotype families. Bioessays, 38(6):578–586, April 2016. [10] Lydia Coulter Kwee, Dawei Liu, Xihong Lin, Debashis Ghosh, and Michael P Epstein. A powerful and flexible multilocus association test for quantitative traits. The American Journal of Human Genetics, 82(2):386–397, 2008. [11] Dawei Liu, Xihong Lin, and Debashis Ghosh. Semiparametric regression of multidimen- sional genetic pathway data: Least-squares kernel machines and linear mixed models. Bio- metrics, 63(4):1079–1088, 2007. [12] Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82–93, 2011. [13] Shaoyu Li and Yuehua Cui. Gene-centric gene–gene interaction: A model-based kernel machine method. The Annals of Applied Statistics, 6(3):1134 – 1161, 2012. 97 [14] Dawei Liu, Debashis Ghosh, and Xihong Lin. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics, 9:1–11, 2008. [15] Rachel Marceau, Wenbin Lu, Shannon Holloway, Michèle M Sale, Bradford B Worrall, Stephen R Williams, Fang-Chi Hsu, and Jung-Ying Tzeng. A fast multiple-kernel method with applications to detect gene-environment interaction. Genetic epidemiology, 39(6):456– 468, 2015. [16] Michael C Wu, Peter Kraft, Michael P Epstein, Deanne M Taylor, Stephen J Chanock, David J Hunter, and Xihong Lin. Powerful snp-set analysis for case-control genome-wide association studies. The American Journal of Human Genetics, 86(6):929–942, 2010. [17] Seunggeun Lee, Zhangchen Zhao, with contributions from Larisa Miropolsky, and Michael Wu. SKAT: SNP-Set (Sequence) Kernel Association Test, 2023. R package version 2.2.5. [18] Michael C Wu, Arnab Maity, Seunggeun Lee, Elizabeth M Simmons, Quaker E Harmon, Xinyi Lin, Stephanie M Engel, Jeffrey J Molldrem, and Paul M Armistead. Kernel machine snp-set testing under multiple candidate kernels. Genetic epidemiology, 37(3):267–275, 2013. [19] Jennifer Wessel and Nicholas J Schork. Generalized genomic distance–based regression methodology for multilocus association analysis. The American Journal of Human Genetics, 79(5):792–806, 2006. [20] Xinyi Lin, Tianxi Cai, Michael C Wu, Qian Zhou, Geoffrey Liu, David C Christiani, and Xihong Lin. Kernel machine snp-set analysis for censored survival outcomes in genome-wide association studies. Genetic epidemiology, 35(7):620–631, 2011. [21] Tao He, Shaoyu Li, Ping-Shou Zhong, and Yuehua Cui. An optimal kernel-based u-statistic method for quantitative gene-set association analysis. Genetic epidemiology, 43(2):137–149, 2019. [22] the ADNI team. ADNIMERGE: Alzheimer’s Disease Neuroimaging Initiative, 2023. R package version 0.0.1. [23] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1):27–30, 2000. [24] Minoru Kanehisa. Toward understanding the origin and evolution of cellular organisms. Protein Science, 28(11):1947–1951, 2019. [25] Minoru Kanehisa, Miho Furumichi, Yoko Sato, Masayuki Kawashima, and Mari Ishiguro- Watanabe. Kegg for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research, 51(D1):D587–D592, 2023. [26] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995. 98 [27] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6):1087–1092, 12 2004. [28] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 04 1970. [29] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 1995. [30] Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999. [31] Siddhartha Chib and Ivan Jeliazkov. Marginal likelihood from the metropolis–hastings output. Journal of the American statistical association, 96(453):270–281, 2001. [32] Florian Hartig, Francesco Minunno, and Stefan Paul. BayesianTools: General-Purpose MCMC and SMC Samplers and Tools for Bayesian Statistics, 2023. R package version 0.1.8. [33] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm. Bernoulli, pages 223–242, 2001. [34] Vivekananda Roy. Convergence diagnostics for markov chain monte carlo. Annual Review of Statistics and Its Application, 7:387–412, 2020. [35] Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple sequences. Statistical science, pages 457–472, 1992. [36] Douglas Bates, Katharine M. Mullen, John C. Nash, and Ravi Varadhan. minqa: Derivative- Free Optimization Algorithms by Quadratic Approximation, 2022. R package version 1.2.5. [37] Michael Evans and Tim Swartz. Methods for approximating integrals in statistics with special emphasis on bayesian integration problems. Statistical science, pages 254–272, 1995. [38] Luke Tierney and Joseph B Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the american statistical association, 81(393):82–86, 1986. [39] Robert E Kass. The validity of posterior expansions based on laplace’s method. Bayesian and likelihood methods in statistics and econometrics, pages 473–487, 1990. [40] Donna K Pauler, Jonathan C Wakefield, and Robert E Kass. Bayes factors and approxi- mations for variance component models. Journal of the American Statistical Association, 94(448):1242–1253, 1999. [41] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A R Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul I W de Bakker, Mark J Daly, and Pak C Sham. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81(3):559–575, July 2007. 99 [42] Matteo Sesia, Chiara Sabatti, and Emmanuel J Candès. Gene hunting with hidden markov model knockoffs. Biometrika, 106(1):1–18, 2019. [43] Andrew J Saykin, Li Shen, Tatiana M Foroud, Steven G Potkin, Shanker Swaminathan, Sungeun Kim, Shannon L Risacher, Kwangsik Nho, Matthew J Huentelman, David W Craig, et al. Alzheimer’s disease neuroimaging initiative biomarkers as quantitative phenotypes: Genetics core aims, progress, and plans. Alzheimer’s & Dementia, 6(3):265–273, 2010. [44] Jennifer Harrow, Adam Frankish, Jose M Gonzalez, Electra Tapanari, Mark Diekhans, Felix Kokocinski, Bronwen L Aken, Daniel Barrell, Amonida Zadissa, Stephen Searle, If Barnes, Alexandra Bignell, Veronika Boychenko, Toby Hunt, Mike Kay, Gaurab Mukherjee, Jeena Rajan, Gloria Despacio-Reyes, Gary Saunders, Charles Steward, Rachel Harte, Michael Lin, Cédric Howald, Andrea Tanzer, Thomas Derrien, Jacqueline Chrast, Nathalie Walters, Suganthi Balasubramanian, Baikang Pei, Michael Tress, Jose Manuel Rodriguez, Iakes Ezkurdia, Jeltje van Baren, Michael Brent, David Haussler, Manolis Kellis, Alfonso Valencia, Alexandre Reymond, Mark Gerstein, Roderic Guigó, and Tim J Hubbard. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res, 22(9):1760– 1774, September 2012. [45] Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology, 4(1), 2005. [46] Peter Langfelder and Steve Horvath. Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics, 9(1):1–13, 2008. [47] Hao Wang, Binbin Huang, and Jianrong Wang. Predict long-range enhancer regulation based on protein–protein interactions between transcription factors. Nucleic Acids Research, 49(18):10347–10368, 2021. [48] Pinki Munot, Dawn E Saunders, Dianna M Milewicz, Ellen S Regalado, John R Ostergaard, Kees P Braun, Timothy Kerr, Klaske D Lichtenbelt, Sunny Philip, Christopher Rittey, et al. A novel distinctive cerebrovascular phenotype is associated with heterozygous arg179 acta2 mutations. Brain, 135(8):2506–2514, 2012. [49] Kwangsik Nho, Jason Corneveaux, Sungeun Kim, Hai Lin, Shannon Risacher, Li Shen, S Swaminathan, V Ramanan, Y Liu, T Foroud, Mark Inlow, Ashley Siniard, Rebecca Reiman, P Aisen, Ronald Petersen, R Green, Clifford Jack, Michael Weiner, C Baldwin, and Andrew Saykin. Whole-exome sequencing and imaging genetics identify functional variants for rate of change in hippocampal volume in mild cognitive impairment. Molecular Psychiatry, 18:781–787, 04 2013. [50] Michelle C do Rosario, Guillermo Rodriguez Bey, Bruce Nmezi, Fang Liu, Talia Oranburg, Ana SA Cohen, Keith A Coffman, Maya R Brown, Kirill Kiselyov, Quinten Waisfisz, et al. Variants in the zinc transporter tmem163 cause a hypomyelinating leukodystrophy. Brain, 145(12):4202–4209, 2022. [51] Bernard Ng, Charles C White, Hans-Ulrich Klein, Solveig K Sieberts, Cristin McCabe, Ellis Patrick, Jishu Xu, Lei Yu, Chris Gaiteri, David A Bennett, et al. An xqtl map integrates the 100 genetic architecture of the human brain’s transcriptome and epigenome. Nature neuroscience, 20(10):1418–1426, 2017. [52] Jasper D Sluimer, Wiesje M van der Flier, Giorgos B Karas, Nick C Fox, Philip Scheltens, Frederik Barkhof, and Hugo Vrenken. Whole-brain atrophy rate and cognitive decline: longitudinal mr study of memory clinic patients. Radiology, 248(2):590–598, 2008. [53] Ferdinando Squitieri, Milena Cannella, Maria Simonelli, Jenny Sassone, Tiziana Martino, Eugenio Venditti, Andrea Ciammola, Claudio Colonnese, Luigi Frati, and Andrea Cia- rmiello. Distinct brain volume changes correlating with clinical stage, disease progression rate, mutation size, and age at onset prediction as early biomarkers of brain atrophy in huntington’s disease. CNS neuroscience & therapeutics, 15(1):1–11, 2009. [54] DM Mezzapesa, A Ceccarelli, F Dicuonzo, A Carella, MF De Caro, M Lopez, V Samarelli, P Livrea, and IL Simone. Whole-brain and regional brain atrophy in amyotrophic lateral sclerosis. American Journal of Neuroradiology, 28(2):255–259, 2007. [55] Emma J Burton, Ian G McKeith, David J Burn, E David Williams, and John T O’Brien. Cerebral atrophy in parkinson’s disease with and without dementia: a comparison with alzheimer’s disease, dementia with lewy bodies and controls. Brain, 127(4):791–800, 2004. [56] Philip R Jansen, Mats Nagel, Kyoko Watanabe, Yongbin Wei, Jeanne E Savage, Christi- aan A de Leeuw, Martijn P van den Heuvel, Sophie van der Sluis, and Danielle Posthuma. Genome-wide meta-analysis of brain volume identifies genomic loci and genes shared with intelligence. Nature communications, 11(1):5606, 2020. [57] Tong Teng, Jie Chen, Yehong Zhang, and Bryan Kian Hsiang Low. Scalable variational bayesian kernel selection for sparse gaussian process regression. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):5997–6004, Apr. 2020. [58] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12:2211–2268, 2011. [59] Cong Han and Bradley P Carlin. Markov chain monte carlo methods for computing bayes fac- tors: A comparative review. Journal of the American Statistical Association, 96(455):1122– 1132, 2001. [60] Thomas J DiCiccio, Robert E Kass, Adrian Raftery, and Larry Wasserman. Computing bayes factors by combining simulation and asymptotic approximations. Journal of the American Statistical Association, 92(439):903–915, 1997. [61] Sandip Sinharay and Hal S Stern. An empirical comparison of methods for computing bayes factors in generalized linear mixed models. Journal of Computational and Graphical Statistics, 14(2):415–435, 2005. [62] Nial Friel and Anthony Pettitt. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society Series B, 70:589–607, 07 2008. 101 [63] Bradley Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press, 2012. [64] Jon Wakefield. Bayes factors for genome-wide association studies: comparison with p- values. Genetic Epidemiology, 33(1):79–86, 2009. [65] Matthew Stephens and David J Balding. Bayesian statistical methods for genetic association studies. Nature Reviews Genetics, 10(10):681–690, 2009. [66] [67] Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10):931–934, 2015. Jian Zhou, Chandra L Theesfeld, Kevin Yao, Kathleen M Chen, Aaron K Wong, and Olga G Troyanskaya. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature genetics, 50(8):1171–1179, 2018. [68] David R Kelley, Jasper Snoek, and John L Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research, 26(7):990– 999, 2016. [69] Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska- Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10):1196–1203, 2021. [70] Sanja Vickovic, Gökcen Eraslan, Fredrik Salmén, Johanna Klughammer, Linnea Stenbeck, Denis Schapiro, Tarmo Äijö, Richard Bonneau, Ludvig Bergenstråhle, José Fernandéz Navarro, et al. High-definition spatial transcriptomics for in situ tissue profiling. Nature methods, 16(10):987–990, 2019. [71] Samuel G Rodriques, Robert R Stickels, Aleksandrina Goeva, Carly A Martin, Evan Murray, Charles R Vanderburg, Joshua Welch, Linlin M Chen, Fei Chen, and Evan Z Macosko. Slide- seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science, 363(6434):1463–1467, 2019. [72] Robert R Stickels, Evan Murray, Pawan Kumar, Jilong Li, Jamie L Marshall, Daniela J Di Bella, Paola Arlotta, Evan Z Macosko, and Fei Chen. Highly sensitive spatial transcrip- tomics at near-cellular resolution with slide-seqv2. Nature biotechnology, 39(3):313–319, 2021. [73] Patrick AP Moran. Notes on continuous stochastic phenomena. Biometrika, 37(1/2):17–23, 1950. [74] Robert C Geary. The contiguity ratio and statistical mapping. The incorporated statistician, 5(3):115–146, 1954. [75] Valentine Svensson, Sarah A. Teichmann, and Oliver Stegle. Spatialde: identification of spatially variable genes. Nature Methods, 15(5):343–346, May 2018. 102 [76] Ilia Kats, Roser Vento-Tormo, and Oliver Stegle. Spatialde2: Fast and localized variance component analysis of spatial transcriptomics. bioRxiv, 2021. [77] Shiquan Sun, Jiaqiang Zhu, and Xiang Zhou. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature Methods, 17(2):193–200, Feb 2020. [78] Lukas M. Weber, Arkajyoti Saha, Abhirup Datta, Kasper D. Hansen, and Stephanie C. Hicks. nnsvg for the scalable identification of spatially variable genes using nearest-neighbor gaussian processes. Nature Communications, 14(1):4059, Jul 2023. [79] Qiwei Li, Minzhe Zhang, Yang Xie, and Guanghua Xiao. Bayesian modeling of spatial molecular profiling data via Gaussian process. Bioinformatics, 37(22):4129–4136, 06 2021. [80] Daniel Edsgärd, Per Johnsson, and Rickard Sandberg. Identification of spatial expression trends in single-cell gene expression data. Nature Methods, 15(5):339–342, May 2018. [81] Ke Zhang, Wanwan Feng, and Peng Wang. Identification of spatially variable genes with graph cuts. Nature Communications, 13(1):5488, Sep 2022. [82] Alma Andersson and Joakim Lundeberg. sepal: identifying transcript profiles with spatial patterns by diffusion-based modeling. Bioinformatics, 37(17):2644–2650, 03 2021. [83] Junjie Zhu and Chiara Sabatti. feature learning. bioRxiv, 2020. Integrative spatial single-cell analysis with graph-based [84] Brendan F Miller, Dhananjay Bambah-Mukku, Catherine Dulac, Xiaowei Zhuang, and Jean Fan. Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities. Genome Res, 31(10):1843–1855, May 2021. [85] Ruben Dries, Qian Zhu, Rui Dong, Chee-Huat Linus Eng, Huipeng Li, Kan Liu, Yuntian Fu, Tianxiao Zhao, Arpan Sarkar, Feng Bao, et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome biology, 22:1–31, 2021. [86] Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M Mauck, Shiwei Zheng, Andrew Butler, Maddie J Lee, Aaron J Wilk, Charlotte Darby, Michael Zager, et al. Integrated analysis of multimodal single-cell data. Cell, 184(13):3573–3587, 2021. [87] Giovanni Palla, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, Ignacio L. Ibarra, Olle Holmberg, Isaac Vir- shup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J. Theis. Squidpy: a scalable framework for spatial omics analysis. Nature Methods, 19(2):171–178, February 2022. [88] Jiaqiang Zhu, Shiquan Sun, and Xiang Zhou. Spark-x: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biology, 22(1):184, Jun 2021. [89] Rui Jiang, Zhen Li, Yuhang Jia, Siyu Li, and Shengquan Chen. SINFONIA: Scalable identification of spatially variable genes for deciphering spatial domains. Cells, 12(4), February 2023. 103 [90] Nuha BinTayyash, Sokratia Georgaka, S T John, Sumon Ahmed, Alexis Boukouvalas, James Hensman, and Magnus Rattray. Non-parametric modelling of temporal and spatial counts data from RNA-seq experiments. Bioinformatics, 37(21):3788–3795, November 2021. [91] Peiyao Zhao, Jiaqiang Zhu, Ying Ma, and Xiang Zhou. Modeling zero inflation is not necessary for spatial transcriptomics. Genome Biology, 23(1):118, May 2022. [92] Xi Jiang, Guanghua Xiao, and Qiwei Li. A bayesian modified ising model for identi- fying spatially variable genes from spatial transcriptomics data. Statistics in Medicine, 41(23):4647–4665, 2022. [93] Aaron TL Lun, Davis J McCarthy, and John C Marioni. A step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor. F1000Research, 5, 2016. [94] Davis J McCarthy, Kieran R Campbell, Aaron TL Lun, and Quin F Wills. Scater: pre- processing, quality control, normalization and visualization of single-cell rna-seq data in r. Bioinformatics, 33(8):1179–1186, 2017. [95] Jinge Yu and Xiangyu Luo. accounting for excess zeros. Bioinformatics, 38(17):4135–4144, 07 2022. Identification of cell-type-specific spatially variable genes [96] Xin Yuan, Yanran Ma, Ruitian Gao, Shuya Cui, Yifan Wang, Botao Fa, Shiyang Ma, Ting Wei, Shuangge Ma, and Zhangsheng Yu. Heartsvg: a fast and accurate method for spatially variable gene identification in large-scale spatial transcriptomic data. bioRxiv, 2023. [97] Minsheng Hao, Kui Hua, and Xuegong Zhang. SOMDE: a scalable method for identifying spatially variable genes with self-organizing map. Bioinformatics, 37(23):4392–4398, 06 2021. [98] [99] Julien Moehlin, Bastien Mollet, Bruno Maria Colombo, and Marco Antonio Mendoza-Parra. Inferring biologically relevant molecular tissue substructures by agglomerative clustering of digitized spatial transcriptomes with multilayer. Cell Systems, 12(7):694–705.e3, 2021. Juexin Wang, Jinpu Li, Skyler T Kramer, Li Su, Yuzhou Chang, Chunhui Xu, Michael T Eadon, Krzysztof Kiryluk, Qin Ma, and Dong Xu. Dimension-agnostic and granularity- based spatially variable gene identification using bsp. Nature Communications, 14(1):7367, 2023. [100] Yaowu Liu, Sixing Chen, Zilin Li, Alanna C Morrison, Eric Boerwinkle, and Xihong Lin. Acat: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics, 104(3):410–421, 2019. [101] Natesh S. Pillai and Xiao-Li Meng. An unexpected encounter with cauchy and lévy, 2015. [102] Daowen Zhang and Xihong Lin. Hypothesis testing in semiparametric additive mixed models. Biostatistics, 4(1):57–74, January 2003. [103] Faming Liang. A double metropolis–hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computation and Simulation, 80:1007–1022, 09 2010. 104 [104] Lulu Shang and Xiang Zhou. Spatially aware dimension reduction for spatial transcriptomics. Nature Communications, 13(1):7203, Nov 2022. [105] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995. [106] John D Storey and Robert Tibshirani. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445, 2003. [107] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165 – 1188, 2001. [108] William N Venables and Brian D Ripley. Modern applied statistics with S-PLUS. Springer Science & Business Media, 2013. [109] Jiaqiang Zhu, Lulu Shang, and Xiang Zhou. Srtsim: spatial pattern preserving simulations for spatially resolved transcriptomics. Genome Biology, 24(1):39, Mar 2023. [110] Carissa Chen, Hani Jieun Kim, and Pengyi Yang. Evaluating spatially variable gene detection methods for spatial transcriptomics data. bioRxiv, 2022. [111] Jian Hu, Xiangjie Li, Kyle Coleman, Amelia Schroeder, Nan Ma, David J. Irwin, Edward B. Lee, Russell T. Shinohara, and Mingyao Li. Spagcn: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature Methods, 18(11):1342–1351, Nov 2021. [112] Chihao Zhang, Kangning Dong, Kazuyuki Aihara, Luonan Chen, and Shihua Zhang. Sta- marker: Determining spatial domain-specific variable genes with saliency maps in deep learning. bioRxiv, 2022. [113] Marc Elosua-Bayes, Paula Nieto, Elisabetta Mereu, Ivo Gut, and Holger Heyn. SPOT- light: seeded NMF regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic Acids Research, 49(9):e50–e50, 02 2021. [114] Dylan M Cable, Evan Murray, Luli S Zou, Aleksandrina Goeva, Evan Z Macosko, Fei Chen, and Rafael A Irizarry. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat Biotechnol, 40(4):517–526, February 2021. [115] Rui Dong and Guo-Cheng Yuan. Spatialdwls: accurate deconvolution of spatial transcrip- tomic data. Genome biology, 22(1):145, 2021. [116] Dongyuan Song, Qingyang Wang, Guanao Yan, Tianyang Liu, Tianyi Sun, and Jingyi Jessica Li. scdesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nature Biotechnology, 42(2):247–252, 2024. [117] Edward Zhao, Matthew R Stone, Xing Ren, Jamie Guenthoer, Kimberly S Smythe, Thomas Pulliam, Stephen R Williams, Cedric R Uytingco, Sarah EB Taylor, Paul Nghiem, et al. Spatial transcriptomics at subspot resolution with bayesspace. Nature biotechnology, 39(11):1375–1384, 2021. 105 [118] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611–622, 1999. [119] Sikta Das Adhikari, Jiaxin Yang, Jianrong Wang, and Yuehua Cui. Recent advances in spatially variable gene detection in spatial transcriptomics. Comput. Struct. Biotechnol. J., February 2024. [120] Zhijian Li, Zain M Patel, Dongyuan Song, Guanao Yan, Jingyi Jessica Li, and Luca Pinello. Benchmarking computational methods to identify spatially variable genes and peaks. bioRx- ivorg, December 2023. [121] Charles Swanton. Intratumor heterogeneity: evolution through space and time. Cancer research, 72(19):4875–4882, 2012. [122] Michalina Janiszewska. The microcosmos of intratumor heterogeneity: the space-time of cancer evolution. Oncogene, 39(10):2031–2039, 2020. [123] David T Scadden. Nice neighborhood: emerging concepts of the stem cell niche. Cell, 157(1):41–50, 2014. [124] Miranda V Hunter, Reuben Moncada, Joshua M Weiss, Itai Yanai, and Richard M White. Spatially resolved transcriptomics reveals the architecture of the tumor-microenvironment interface. Nature communications, 12(1):6278, 2021. [125] Qianghu Wang, Baoli Hu, Xin Hu, Hoon Kim, Massimo Squatrito, Lisa Scarpace, Ana C DeCarvalho, Sali Lyu, Pengping Li, Yan Li, et al. Tumor evolution of glioma-intrinsic gene expression subtypes associates with immunological changes in the microenvironment. Cancer cell, 32(1):42–56, 2017. [126] Yi Zhang, Guanjue Xiang, Alva Yijia Jiang, Allen Lynch, Zexian Zeng, Chenfei Wang, Wubing Zhang, Jingyu Fan, Jiajinlong Kang, Shengqing Stan Gu, et al. Metatime integrates single-cell gene expression to characterize the meta-components of the tumor immune mi- croenvironment. Nature communications, 14(1):2634, 2023. [127] Srivatsan Raghavan, Peter S Winter, Andrew W Navia, Hannah L Williams, Alan DenAdel, Kristen E Lowder, Jennyfer Galvez-Reyes, Radha L Kalekar, Nolawit Mulugeta, Kevin S Kapner, et al. Microenvironment drives cell state, plasticity, and drug response in pancreatic cancer. Cell, 184(25):6119–6137, 2021. [128] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [129] Kristen R Maynard, Leonardo Collado-Torres, Lukas M Weber, Cedric Uytingco, Brianna K Barry, Stephen R Williams, Joseph L Catallini, Matthew N Tran, Zachary Besich, Mad- havi Tippani, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature neuroscience, 24(3):425–436, 2021. 106 [130] Alma Andersson, Ludvig Larsson, Linnea Stenbeck, Fredrik Salmén, Anna Ehinger, Sunny Z Wu, Ghamdan Al-Eryani, Daniel Roden, Alex Swarbrick, Åke Borg, et al. Spatial decon- volution of her2-positive breast cancer delineates tumor-associated cell type interactions. Nature communications, 12(1):6012, 2021. [131] Patrik L Ståhl, Fredrik Salmén, Sanja Vickovic, Anna Lundmark, José Fernández Navarro, Jens Magnusson, Stefania Giacomello, Michaela Asp, Jakub O Westholm, Mikael Huss, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science, 353(6294):78–82, 2016. [132] Jiaqiang Zhu, Shiquan Sun, and Xiang Zhou. Spark-x: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome biology, 22(1):184, 2021. [133] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996. [134] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010. [135] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for cox’s proportional hazards model via coordinate descent. Journal of statistical software, 39(5):1, 2011. [136] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. 2010. [137] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5):849– 911, 2008. [138] Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From louvain to leiden: guaran- teeing well-connected communities. Scientific reports, 9(1):5233, 2019. [139] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008. [140] Pascal Pons and Matthieu Latapy. Computing communities in large networks using random walks. In Computer and Information Sciences-ISCIS 2005: 20th International Symposium, Istanbul, Turkey, October 26-28, 2005. Proceedings 20, pages 284–293. Springer, 2005. [141] Xihong Lin. Variance component testing in generalised linear models with random effects. Biometrika, 84(2):309–326, 1997. [142] Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82–93, 2011. [143] Robert B Davies. The distribution of a linear combination of 𝜒2 random variables. Journal of the Royal Statistical Society Series C: Applied Statistics, 29(3):323–333, 1980. 107 [144] Yaowu Liu, Sixing Chen, Zilin Li, Alanna C Morrison, Eric Boerwinkle, and Xihong Lin. Acat: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics, 104(3):410–421, 2019. [145] Yaowu Liu and Jun Xie. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc, 115(529):393–402, April 2019. [146] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001. [147] Carrie A Davis, Benjamin C Hitz, Cricket A Sloan, Esther T Chan, Jean M Davidson, Idan Gabdank, Jason A Hilton, Kriti Jain, Ulugbek K Baymuradov, Aditi K Narayanan, et al. The encyclopedia of dna elements (encode): data portal update. Nucleic acids research, 46(D1):D794–D801, 2018. [148] Alper Yilmaz, Milton Y Nishiyama Jr, Bernardo Garcia Fuentes, Glaucia Mendes Souza, Daniel Janies, John Gray, and Erich Grotewold. Grassius: a platform for comparative regulatory genomics across the grasses. Plant physiology, 149(1):171–180, 2009. [149] Jens Keilwagen, Stefan Posch, and Jan Grau. Accurate prediction of cell type-specific transcription factor binding. Genome biology, 20:1–17, 2019. [150] Hongyang Li, Daniel Quang, and Yuanfang Guan. Anchor: trans-cell type prediction of transcription factor binding sites. Genome research, 29(2):281–292, 2019. [151] Richard I Sherwood, Tatsunori Hashimoto, Charles W O’donnell, Sophia Lewis, Amira A Barkal, John Peter Van Hoff, Vivek Karun, Tommi Jaakkola, and David K Gifford. Discovery of directional and nondirectional pioneer transcription factors by modeling dnase profile magnitude and shape. Nature biotechnology, 32(2):171–178, 2014. [152] Jacob Schreiber, Jeffrey Bilmes, and William Stafford Noble. Completing the encode3 compendium yields accurate imputations across a variety of assays and human biosamples. Genome biology, 21:1–13, 2020. [153] Yanli Wang, Fan Song, Bo Zhang, Lijun Zhang, Jie Xu, Da Kuang, Daofeng Li, Mayank NK Choudhary, Yun Li, Ming Hu, et al. The 3d genome browser: a web-based browser for visualizing 3d genome organization and long-range chromatin interactions. Genome biology, 19:1–12, 2018. [154] Heidi Sutherland and Wendy A Bickmore. Transcription factories: gene expression in unions? Nature Reviews Genetics, 10(7):457–466, 2009. [155] Cameron S Osborne. Molecular pathways: transcription factories and chromosomal translo- cations. Clinical Cancer Research, 20(2):296–300, 2014. [156] Qianli Dong, Ning Li, Xiaochong Li, Zan Yuan, Dejian Xie, Xiaofei Wang, Jianing Li, Yanan Yu, Jinbin Wang, Baoxu Ding, et al. Genome-wide hi-c analysis reveals extensive hierarchical chromatin interactions in rice. The Plant Journal, 94(6):1141–1156, 2018. 108 [157] Yuri B Schwartz and Giacomo Cavalli. Three-dimensional genome organization and function in drosophila. Genetics, 205(1):5–24, 2017. [158] Wenxiu Ma, Ferhat Ay, Choli Lee, Gunhan Gulsoy, Xinxian Deng, Savannah Cook, Jennifer Hesson, Christopher Cavanaugh, Carol B Ware, Anton Krumm, et al. Fine-scale chromatin interaction maps reveal the cis-regulatory landscape of human lincrna genes. Nature methods, 12(1):71–78, 2015. [159] Peter H Sudmant, Tobias Rausch, Eugene J Gardner, Robert E Handsaker, Alexej Abyzov, John Huddleston, Yan Zhang, Kai Ye, Goo Jun, Markus Hsi-Yang Fritz, et al. An integrated map of structural variation in 2,504 human genomes. Nature, 526(7571):75–81, 2015. [160] Suhas SP Rao, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov, James T Robinson, Adrian L Sanborn, Ido Machol, Arina D Omer, Eric S Lander, et al. A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7):1665–1680, 2014. [161] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017. [162] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the american statistical association, 83(404):1023–1032, 1988. [163] Peter Carbonetto and Matthew Stephens. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. 2012. [164] Benjamin A Logsdon, Gabriel E Hoffman, and Jason G Mezey. A variational bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC bioinformatics, 11:1–13, 2010. [165] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000. [166] Dawei Liu, Xihong Lin, and Debashis Ghosh. Semiparametric regression of multidimen- sional genetic pathway data: least-squares kernel machines and linear mixed models. Bio- metrics, 63(4):1079–1088, 2007. [167] Eugene Demidenko. Mixed models: theory and applications with R. John Wiley & Sons, 2013. [168] David A Harville. Bayesian inference for variance components using only error contrasts. Biometrika, 61(2):383–385, 1974. [169] H Desmond Patterson and Robin Thompson. Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3):545–554, 1971. [170] Nan M Laird and James H Ware. Random-effects models for longitudinal data. Biometrics, pages 963–974, 1982. 109 [171] Regev Schweiger, Omer Weissbrod, Elior Rahmani, Martina Müller-Nurasyid, Sonja Kunze, Christian Gieger, Melanie Waldenberger, Saharon Rosset, and Eran Halperin. Rl-skat: an exact and efficient score test for heritability and set tests. Genetics, 207(4):1275–1283, 2017. [172] Geoffrey E Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in neural information processing systems, 15, 2002. [173] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [174] Jesse H. Krijthe. Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation, 2015. R package version 0.17. [175] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008. [176] L.J.P. van der Maaten. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15:3221–3245, 2014. [177] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004. [178] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008. [179] Gabor Csardi and Tamas Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006. [180] Gábor Csárdi, Tamás Nepusz, Vincent Traag, Szabolcs Horvát, Fabio Zanini, Daniel Noom, and Kirill Müller. igraph: Network Analysis and Visualization in R, 2024. R package version 2.0.3. 110 APPENDIX A KERNEL BASED MODELS: FROM SEMIPARAMETRIC MODEL TO LMM As mentioned in chapter 2, under the kernel machine regression framework, continuous quantitative traits can be associated to genetic variants or molecular features, along with additional covariates, through a semiparametric model: 𝑦𝑖 = 𝑋𝑖 𝛽 + ℎ(𝑍𝑖) + 𝜖𝑖, 𝑖 = 1, 2, · · · , 𝑛 (A.1) Here, ℎ(𝑍𝑖) is an unknown centered smooth function. Model A.1 models covariate effects para- metrically and the spatial effect parametrically or nonparametrically. When h(·) = 0, A.1 reduces to the standard linear regression model. We assume that the nonparametric function ℎ(𝑍) resides in a function space H𝑘 defined by a positive definite kernel function 𝐾 (., .). According to Mercer’s theorem [165], assuming certain regularity conditions hold, a kernel function 𝐾 (., .) implicitly defines a unique function space spanned by a specific set of orthogonal basis functions (features) 𝜙 𝑗 (𝑍)𝐽 ℎ(𝑍) can be expressed as a linear combination of these bases: ℎ(𝑍) = (cid:205)𝐽 𝑗=1. This implies that any 𝜔 𝑗 𝜙 𝑗 (𝑍) = 𝜙 𝑗 (𝑍)𝑇 𝜔 𝑗=1 (referred to as the primal representation), where 𝜔 is a coefficient vector. Alternatively, ℎ(𝑍) can be represented using the kernel function 𝐾 (., .) as ℎ(𝑍) = (cid:205)𝐿 𝑙=1 𝛼𝑙 𝐾 (𝑍 ∗ representation), where 𝐿 is an integer, 𝛼𝑙 are constants, and {𝑍 ∗ 1 , 𝑍 ∗ 2 , .., 𝑍 ∗ Now, assuming ℎ(𝑍) belongs to H𝑘 , the function space generated by a kernel function 𝐾 (., .). Estimation of 𝛽 and ℎ(.) in A.1 proceeds by maximizing the scaled penalized likelihood function: 𝐽 (ℎ, 𝛽) = − 1 2 𝑛 ∑︁ 𝑖=1 [𝑦𝑖 − 𝑥𝑖 𝛽 − ℎ(𝑍𝑖)]2 − 1 2 𝜆∥ℎ∥2 H𝑘 = − 1 2 𝑛 ∑︁ 𝑖=1 [𝑦𝑖 − 𝑥𝑖 𝛽 − ℎ(𝑍𝑖)]2 − 𝜆𝛼𝑇 𝐾𝛼 1 2 (A.2) where 𝜆 is a tuning parameter which controls the tradeoff between goodness of fit and complexity of the model. When 𝜆 = 0, the model interpolates the data, whereas when 𝜆 = ∞, the model reduces to a simple linear model without ℎ(.). This is exactly similar to the log-likelihood of the 111 𝑙 , 𝑍) (known as the dual 𝐿 } ∈ R𝑚. linear mixed model[166] up to some constant: 𝑦 = 𝑋 𝛽 + ℎ + 𝜖 (A.3) where the random effect ℎ ∼ 𝑁 (0, 𝜏𝐾), random noise 𝜖 ∼ 𝑁 (0, 𝜎2). Here, 𝑦 ∼ 𝑁 (𝑋 𝛽, Σ), where Σ = 𝜏𝐾 + 𝜎2𝐼 and 𝑦|ℎ ∼ (𝑋 𝛽 + ℎ, 𝜎2𝐼). The log-likelihood: 𝑙𝑜𝑔(𝑙 (𝑦)) = 𝑙𝑜𝑔(𝑙 (𝑦|ℎ)𝑙 (ℎ)) = 𝑙𝑜𝑔(𝑙 (𝑦|ℎ)) + 𝑙𝑜𝑔(𝑙 (ℎ)) = 𝑙𝑜𝑔(𝑒𝑥 𝑝− 1 2𝜎2 =− 1 2𝜎2 (cid:205)𝑛 𝑖=1 (𝑦𝑖−𝑋𝑖 𝛽−ℎ𝑖)𝑇 (𝑦𝑖−𝑋𝑖 𝛽−ℎ𝑖)) + 𝑙𝑜𝑔(𝑒𝑥 𝑝− 1 (cid:205)𝑛 𝑖=1(𝑦𝑖 − 𝑋𝑖 𝛽 − 𝑍𝑖)𝑇 (𝑦𝑖 − 𝑋𝑖 𝛽 − ℎ𝑖) − 1 =− (cid:205)𝑛 2𝜏 ℎ𝑇 𝐾 −1ℎ 𝑖=1 [𝑦𝑖 − 𝑥𝑖 𝛽 − ℎ𝑖]2 − 1 2𝜏 ℎ𝑇 𝐾 −1ℎ 2𝜏 ℎ𝑇 𝐾 −1ℎ)+ constant terms Setting 𝜏 = 1/𝜆 and ℎ = 𝐾𝛼, one can easily see that equations A.2 and A.3 are identical. 112 APPENDIX B SCORE STATISTIC AND THE NULL DISTRIBUTION B.0.1 Score test for SVG detection The Gaussian process (GP) regression model which models the normalized gene expression 𝑦 for a given gene using the following multivariate normal model: 𝑝(𝑦|𝜇, 𝜎2 𝑠 , 𝛿, 𝐾) ∼ 𝑁 (𝑦|𝑋 𝛽, 𝜎2 𝑠 𝐾 + 𝛿𝐼), (B.1) where the covariance term is decomposed into a spatial and a non-spatial part, where 𝛿𝐼 represents the non-spatial part and 𝜎2 𝑠 𝐾 is the spatial covariance matrix, whose (𝑖, 𝑗)𝑡ℎ element in the kernel matrix 𝐾 denotes the spatial similarity between the 𝑖𝑡ℎ and 𝑗 𝑡ℎ spot calculated based on the corresponding coordinates 𝑠𝑖 and 𝑠 𝑗 . The choice of the kernel function plays a very important role in detecting the spatial correlation present in the gene expressions. 𝑋 𝑁×𝑘 represents the covariate matrix, while 𝛽𝑁×1 denotes the array of corresponding coefficients. This model can incorporate up to 𝑘 − 1 covariates, such as cell type information or domain structure information. For the 𝑖𝑡ℎ location, the model can be written like this: 𝑦𝑖 (𝑠𝑖) = 𝑋𝑖 (𝑠𝑖) 𝛽 + ℎ𝑖 (𝑠𝑖) + 𝜖𝑖 (B.2) where 𝜖𝑖 is the iid residual error which follows 𝑁 (0, 𝛿), 𝛼𝑖 (𝑠𝑖) is such that: ℎ(𝑠) = (ℎ1(𝑠1), ℎ2(𝑠2), · · · , ℎ𝑛 (𝑠𝑖)) ∼ 𝑀𝑉 𝑁 (0, 𝜎2 𝑠 𝐾 (𝑠)) is a spatial random effect which captures the spatial variation using the spatial covariate matrix 𝜎2 𝑠 𝐾; Overall the covariance for the normalized gene expression 𝑦(𝑠) is Σ = 𝜎2 𝑠 𝐾 + 𝛿𝐼. As we mentioned in the main text, testing if a gene is a SVG is equivalent to testing 𝐻0 : 𝜎2 𝑠 = 0. The null hypothesis 𝐻0 : 𝜎2 𝑠 =0 can be tested using the variance-component score test which is the locally most powerful test[141]. The variance-component score statistic is: 𝑄 = (𝑦 − 𝑋 ˆ𝛽)𝑇 𝐾 (𝑦 − 𝑋 ˆ𝛽) where ˆ𝛽 is the MLEs under the null model. Under the null hypothesis, the score statistic 𝑄 follows a mixture of chi-square distributions [142], which can be closely approximated with the 113 computationally efficient Davies’ method[143]. More details about the test is provided in the next subsection. B.1 More details on its null distribution The Gaussian process model presented in B.1 has many unknown parameters, for example 𝛿, the kernel parameter 𝜌, 𝜎2 𝑠 . The unknown parameters are estimated simultaneously by treating them as variance components in the linear mixed model and estimating them using the Restricted Maximum Likelihood(RML) model. The RML model was used to reduce bias in the variance components model [167][168][169]. The restricted log-likelihood[170][166] for the lmm in 4.1 is as follows: 𝐿 𝑅 (𝜎2 𝑠 , 𝛿, 𝜌) = − 𝑙𝑜𝑔|Σ| − 1 2 1 2 𝑙𝑜𝑔|𝑋𝑇 Σ−1𝑋 | − 1 2 (𝑦 − 𝑋 𝛽)𝑇 Σ−1(𝑦 − 𝑋 𝛽) (B.3) where Σ = 𝜎2 𝑠 𝐾 + 𝛿𝐼. The score statistic can be written in the form of 𝑋 (𝑋𝑇 𝑋)−1𝑋𝑇 , ˆ𝛽 and ˆ𝛿 are the MLEs under the null model and ˜𝑄(𝛽, 𝛿) = 1 Under 𝐻0, the final score statistic for the test Q reduces to 𝑦𝑇 𝐻𝑇 𝐾 𝐻𝑦 and ˜𝑄( ˆ𝛽, ˆ𝛿) − 𝑡𝑟 (𝐻𝐾)[166], where 𝐻 = 𝐼 − 2𝛿 (𝑦 − 𝑋 𝛽)𝑇 𝐾 (𝑦 − 𝑋 𝛽) 𝑄 ∼ 𝑁 ∑︁ 𝑖=1 𝜆𝑖 𝜒2 𝑖,1 (B.4) where 𝜒2 𝑖,1 are independent 𝜒2 1 random variables and 𝜆1, 𝜆2, · · · , 𝜆𝑁 are the eigenvalues of ˆ𝛿𝐻1/2𝐾 𝐻1/2. The form of the null distribution of the score statistic distribution stated follows from this argument [171]. Under 𝐻0, 𝑦 ∼ 𝑁 (𝑋 𝛽, 𝛿𝐼), the covariates can be removed using the transformation ˜𝑦 = 𝐻𝑦. Now ˜𝑦 ∼ 𝑁 (0, 𝛿𝐻𝐻𝑇 ) and 𝐾 1/2 ˜𝑦 ∼ 𝑁 (0, 𝛿𝐾 1/2𝐻𝐻𝑇 𝐾 1/2). Let 𝑈 be the matrix whose columns are the eigenvectors of ˆ𝛿𝐾 1/2𝐻𝐻𝑇 𝐾 1/2 and the corresponding eigenvalues are ˜𝜆1, · · · , ˜𝜆𝑛. Therefore, 𝑈𝑇 𝐾 1/2𝐻𝑦 ∼ 𝑁 (0, 𝑑𝑖𝑎𝑔( ˜𝜆𝑖)) and 𝑄 = (𝑈𝑇 𝐾 1/2𝐻𝑦)𝑇 (𝑈𝑇 𝐾 1/2𝐻𝑦) ∼ (cid:205)𝑛 𝑖=1 𝑖,1. Using the fact that the eigenvalues of any matrix 𝐴, 𝐴𝐴𝑇 , 𝐴𝑇 𝐴 are the same and 𝐻2 = 𝐻, it can be shown ˜𝜆𝑖 𝜒2 that ˜𝜆𝑖 = 𝜆𝑖, the eigen values of ˆ𝛿𝐻1/2𝐾 𝐻1/2. 114 APPENDIX C T-SNE PLOT t-Distributed Stochastic Neighbor Embedding (t-SNE) [172][173] is a powerful nonlinear dimen- sionality reduction technique designed to facilitate the visualization of high-dimensional data by mapping it into two or three dimensions. This reduction process enables the identification of inherent patterns, clusters, and relationships that are otherwise difficult to discern. t-SNE begins by calculating pairwise similarities between data points 𝑥𝑖 and 𝑥 𝑗 in the high- dimensional space. The similarity of datapoint 𝑥 𝑗 to datapoint 𝑥𝑖 is the conditional probability, 𝑝 𝑗 |𝑖, that 𝑥𝑖 would pick 𝑥 𝑗 as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at 𝑥𝑖 . For nearby data points, 𝑝 𝑗 |𝑖 is relatively high, while for widely separated data points, 𝑝 𝑗 |𝑖 becomes almost infinitesimal.𝑝 𝑗 |𝑖 is given by: 𝑝 𝑗 |𝑖 = exp (cid:0)−∥𝑥𝑖 − 𝑥 𝑗 ∥2/2𝜎2 (cid:1) 𝑖 (cid:205)𝑘≠𝑖 exp (cid:0)−∥𝑥𝑖 − 𝑥𝑘 ∥2/2𝜎2 𝑖 (cid:1) For the low-dimensional counterparts 𝑦𝑖 and 𝑦 𝑗 of the high-dimensional data points 𝑥𝑖 and 𝑥 𝑗 , we model the similarity of map point 𝑦 𝑗 to map point 𝑦𝑖 by: 𝑞 𝑗 |𝑖 = exp (cid:0)−∥𝑦𝑖 − 𝑦 𝑗 ∥2(cid:1) (cid:205)𝑘≠𝑖 exp (cid:0)−∥𝑦𝑖 − 𝑦𝑘 ∥2(cid:1) If the map points 𝑦𝑖 and 𝑦 𝑗 correctly model the similarity between the high-dimensional data points 𝑥𝑖 and 𝑥 𝑗 , the conditional probabilities 𝑝 𝑗 |𝑖 and 𝑞 𝑗 |𝑖 will be equal. To achieve this, t-SNE aims to minimize the mismatch between 𝑝𝑖 𝑗 and 𝑞𝑖 𝑗 . This is done by minimizing the sum of Kullback-Leibler divergences over all data points using gradient descent. The cost function 𝐶 is given by: ∑︁ 𝐶 = 𝐾 𝐿 (𝑃𝑖 ∥𝑄𝑖) = ∑︁ ∑︁ 𝑝 𝑗 |𝑖 log 𝑖 𝑖 𝑗 𝑝 𝑗 |𝑖 𝑞 𝑗 |𝑖 where 𝑃𝑖 represents the conditional probability distribution over all other datapoints given datapoint 𝑥𝑖 , and 𝑄𝑖 represents the conditional probability distribution over all other map points given map point 𝑦𝑖. 115 In summary, t-SNE reduces high-dimensional data to two or three dimensions by preserving the pairwise similarities between points as much as possible, thereby facilitating the visualization of complex data structures. Because the Kullback-Leibler divergence is not symmetric, different types of errors in the pairwise distances in the low-dimensional map are weighted unequally. Specifically, there is a high cost for using widely separated map points to represent nearby data points (i.e., using a small 𝑞 𝑗 |𝑖 to model a large 𝑝 𝑗 |𝑖). Conversely, there is a relatively small cost for using nearby map points to represent widely separated data points. This means that the SNE cost function prioritizes preserving the local structure of the data in the map. In Chapter 4, t-SNE plots are extensively used to illustrate the similarity between the gene clusters identified by cSVG. The R package "Rtsne" [174][175][176] has been used for that purpose. In complex biological data, traditional methods such as PCA often fall short, necessitating the use of advanced techniques like t-SNE to achieve better performance and more meaningful visualizations. 116 APPENDIX D COMMUNITY DETECTION ALGORITHMS In many complex networks, nodes cluster and form relatively dense groups—often called commu- nities, where the nodes within each community are more densely connected to each other than to the rest of the network. Such a modular structure is usually not known beforehand, making the detection of communities an important problem. One of the best-known methods for community detection is called modularity[177]. This method tries to maximize the difference between the actual number of edges in a community and the expected number of such edges. Community detection algorithms use various methodologies to partition networks into meaningful clusters, revealing insights into the relationships and interactions within the network. Different community detection algorithms like Louvain and Leiden are widely used in single-cell RNA sequencing (scRNA-seq) analysis for clustering single cells, aiding in the identification of distinct cell types and states. The modularity is defined by: H = 1 2𝑚 ∑︁ 𝑐 (𝑒𝑐 − 𝛾 𝐾 2 𝑐 2𝑚 ) where 𝑚 is the total number of edges in the network, 𝑒𝑐 is the actual number of edges in community 𝑐, the expected number of edges can be expressed as 2𝑚 , where 𝐾𝑐 is the sum of the degrees of the nodes in the community 𝑐. 𝛾 > 0 is a resolution parameter. Higher resolutions lead to more 𝐾 2 𝑐 communities, while lower resolutions lead to fewer communities. The Louvain method[178] is one of the most popular community detection algorithms due to its simplicity and effectiveness, typically operating based on modularity optimization. However, in recent years, some drawbacks of the Louvain algorithm have been identified[138], leading to the increased popularity of the Leiden algorithm. The Leiden method is more robust and accurate, making it better suited for analyzing complex networks where high-quality community detection is crucial. In step 3 of the cSVG algorithm, we use the Leiden algorithm, implemented through the "igraph" R package[179],[180] with the default resolution parameter value 𝛾 = 1. 117 APPENDIX E CHAPTER 2 SUPPLEMENTARY TABLES AND FIGURES Table E.1 List of significant genes selected by BayesKAT and their posterior probability of association. Gene Name p(H1|Data) CARD10 0.9835514 MGAT5 0.9474112 TMEM71 0.9237384 FAM174B 0.9042780 ACTA2 0.8677653 C18orf45 0.8569289 TMEM163 0.8357041 LRP1B 0.8257515 SLC35B4 0.8248786 EDNRA 0.7878737 ABCB7 0.7699132 C9orf135 0.7660853 SLC25A18 0.7583290 DKK2 PCID2 0.7501011 0.7387756 TMEM38A 0.7042073 NXNL2 0.7037605 118 Table E.2 List of the significant KEGG pathways selected by BayesKAT and their corresponding posterior probabilities of association. Pathway Name p(H1|Data) Pathways of neurodegeneration - multiple diseases 0.9998840 Alzheimer disease Salmonella infection Antifolate resistance Huntington disease Bile secretion Alcoholic liver disease Dilated cardiomyopathy Metabolic pathways Amyotrophic lateral sclerosis Hypertrophic cardiomyopathy Cardiac muscle contraction mTOR signaling pathway cAMP signaling pathway Pathogenic Escherichia coli infection Calcium signaling pathway ABC transporters Citrate cycle (TCA cycle) Non-alcoholic fatty liver disease AMPK signaling pathway Parkinson disease 0.9982576 0.9897521 0.9666899 0.9630876 0.9530779 0.9285561 0.9240475 0.9211165 0.9070785 0.8975824 0.8966726 0.8869947 0.8688107 0.8645842 0.7962372 0.7851350 0.7766141 0.7676582 0.7640913 0.7242965 119 Figure E.1 Inferred weights of each candidate kernel within the composite kernels based on the simulation tests as presented in Figure 1(B). The inferred composite kernel demonstrates strong agreements with the underlying true kernel function (the Quadratic kernel) used for data generation. High kernel weights are consistently inferred for the Quadratic kernel. 120 Figure E.2 Performance comparison based on simulations involving unrelated discrete genetic features. With the same fixed level of empirical type 1 error, the empirical power of all three methods are overall low, because this basic simulation only groups unrelated genetic features together, which is not recommended for real-world genetic association studies. Even through, BayesKAT still achieves better empirical power. 121 Figure E.3 The empirical power vs empirical type 1 error plot from the pathway-based simulations with sample size n=755. For each scenario, BayesKAT consistently achieves the best performance. The aggregated result presented in main figure 3(B). 122 Figure E.4 The empirical power vs empirical type 1 error plot from the pathway-based simulations with sample size n=1000. The strength of the relationship is intentionally weakened by adjusting effect sizes for performance comparison purposes. Note that for larger sample sizes under the scenarios in S3, all methods exhibit a power of 1, making them incomparable. For each scenario, BayesKAT consistently achieves the best performance. 123 Figure E.5 The empirical power vs empirical type 1 error plot from the pathway-based simulations with sample size n=1500. The strength of the relationship is intentionally weakened by adjusting effect sizes for performance comparison purposes. Note that for larger sample sizes under the scenarios in S3, all methods exhibit a power of 1, making them incomparable. BayesKAT consistently performs reasonably well for each scenario. 124 APPENDIX F CHAPTER 4 SUPPLEMENTARY FIGURES Figure F.1 Steps to create a synthetic dataset from an annotated original Spatial dataset. We begin with a filtered gene expression count matrix and a location matrix where spots are annotated. The tissue region contains 𝑁 spots where gene expression measurements are measured for each of the 𝑚 genes. Step 1: Randomly select 𝑚1 genes from the original dataset to be converted into new SVGs in the synthetic dataset. The remaining 𝑚 − 𝑚1 genes will serve as noise genes with no spatial pattern. Step 2: For each of the 𝑚1 genes in the original dataset, sort and arrange the gene expression count values according to the reference domain structure. For the remaining genes, sort and arrange the count values randomly on the tissue region. 125 Figure F.2 The t-SNE plots[128] illustrate the separation of domain-specific SVGs by cSVG across 10 synthetic datasets. 126 Figure F.3 The performance comparison between A)nnSVG[78] and B)cSVG is conducted for the simulation setup outlined in section 3.1 of the main manuscript. The gene groups are: Independent: Uncorrelated Gene group for genes without any spatial pattern(g1-g10), Correlated: Correlated Gene group for genes without any spatial pattern(g11-g20). Pattern 1-3: Correlated Gene group for genes with spatial pattern 1-3 (g21-g30,g31-g40,g41-g50). Pattern 4-6: Single gene with spatial pattern 4-6(g51-g53) The spatial pattern strength intensifies within each spatial gene group(pattern1-pattern3) as indicated by the triangles between the plots. The empirical power of the SVG detection step is evaluated for simulated datasets with AR(1) (Left) and compound symmetry (CS) (Right) correlation structures within the gene groups. 127 Figure F.4 The SVG-clusters detected from the DLPFC data using cSVG distinctly highlight the disparity between genes in the two main clusters. The representative genes in the second cluster demonstrate overexpression in the white matter region, while those in the first cluster exhibit overexpression in the other six cortex layers. Additionally, the three unique genes in the final row display a slightly different pattern compared to the genes in the main clusters. 128 Figure F.5 Annotated and predicted spatial domains for all 12 DLPFC samples using cSVG, alongside ARI scores for each sample indicating prediction accuracy. 129 Figure F.6 t-SNE plots illustrating genes from all 12 DLPFC samples, with colors representing SVG cluster labels detected by cSVG. Across the majority of samples, distinct clustering is observed, indicating accurate separation of genes within different clusters. 130 Figure F.6 (cont’d) 131 Figure F.7 Table displays gene-cluster sizes by cSVG for all 12 DLPFC samples. 132 Figure F.8 Analysis of Pancreatic Cancer Data using cSVG unveils three primary SVG clusters. Representative genes from each cluster (Cluster 1, Cluster 2, and Cluster 3) are showcased. The genes from these distinct clusters exhibit overexpression in three different regions. 133 Figure F.9 Analysis of Pancreatic cancer data: Pathway enrichment analysis of genes from A) Cluster 1, B) Cluster 2, and C)Cluster are showcased. As expected, cluster 2 and cluster 3 genes show enrichment in cancer-related pathways. 134