SINGLE CELLS ARE BIOLOGICAL TOKENS: TOWARDS CELL LANGUAGE MODELS By Hongzhi Wen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2024 ABSTRACT The rapid advancement of single-cell technologies allows for simultaneous measurement of multiple molecular features within individual cells, providing unprecedented multimodal data through single- cell multi-omics and spatial omics technologies. This dissertation addresses the complex challenges of modeling these multimodal interactions using deep learning techniques. We propose two series of studies: the first, is the application of graph neural networks and graph transformers to model relations between multimodal features, incorporating external domain knowledge. We propose Single-cell Multi-Omics GNN (scMoGNN) and Single-cell Multi-Omics Transformer (scMoFormer), the latter extends the former one and demonstrates the prospect of transformers in single-cell multi-omics representation learning. The second is the application of transformers in spatial omics representation learning. We propose Spatial Transformer (SpaFormer), a transformer-based masked autoencoder learning framework for extracting cell context information and imputing spatial transcriptomics data. Despite the effectiveness of these models, their knowledge transferability across tasks and datasets remains limited. To overcome this, we introduce a new transformer-based foundation model, Cell Pre-trained Language Model (CellPLM), that encodes inter-cellular relations and multimodal features, demonstrating the significant potential of foundation models for future research in single-cell biology. ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Jiliang Tang, for his invaluable guidance, inspiration, and unwavering support throughout my Ph.D. journey. Dr. Tang has been more than just an academic mentor; he has also been a true friend, offering wisdom and advice in many aspects of life. Joining Dr. Tang’s research lab was one of the best decisions I made four years ago, as it has provided me with an enriching experience. Under his mentorship, I have acquired extensive knowledge, from effective research methodologies to leadership skills. Dr. Tang is among the finest advisors one could hope for, and I am profoundly grateful for his consistent encouragement and responsiveness in nurturing the growth of his students. I owe much of my success to his steadfast support over the years. I would like to extend my deepest appreciation to my dissertation committee members, Dr. Guan-Hua Tu, Dr. Hui Liu, and Dr. Yuying Xie, for their insightful comments and invaluable suggestions. Their expertise and feedback have been instrumental in shaping both this dissertation and my future career path. I am grateful to have had the pleasure and fortune of having supportive and encouraging colleagues during my Ph.D. I am thankful to all my colleagues from the Data Science and Engineering Lab: Dr. Jamell Dacon, Dr. Wei Jin, Jiayuan Ding, Wenzhuo Tang, Xinnan Dai, Dr. Remy Liu, Kaiqi Yang, Hang Li, Zhikai Chen, Haitao Mao, Yingqian Cui, Jie Ren, Dr. Wenqi Fan, Dr. Hamid Karimi, Dr. Tyler Derr, Dr. Yao Ma, Dr. Haochen Liu, Hua Liu, Dr. Xiaorui Liu, Dr. Wentao Wang, Dr. Yiqi Wang, Dr. Zhiwei Wang, Dr. Han Xu, Dr. Xiangyu Zhao, Haoyu Han, Harry Shomer, Kai Guo, Juanhui Li, Yaxin Li and Yuxuan Wan. I would also like to extend my heartfelt appreciation to the DANCE team, my dear friends, and collaborators outside of MSU for their unwavering support and encouragement during the challenging moments of my Ph.D. journey. Their kindness and steadfast support helped me persevere through times of frustration. Finally, I would like to thank my family, for their unconditional love and support. iii TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 GRAPH NEURAL NETWORKS FOR SINGLE-CELL MULTI-OMICS REPRESENTATION LEARNING . . . . . . . . . . . . . . . . . . . . . 1 4 CHAPTER 3 TRANSFORMERS FOR SINGLE-CELL MULTI-OMICS REPRESENTATION LEARNING . . . . . . . . . . . . . . . . . . . . . 26 CHAPTER 4 TRANSFORMERS FOR SINGLE-CELL SPATIAL OMICS REPRESENTATION LEARNING . . . . . . . . . . . . . . . . . . . . . 48 CHAPTER 5 BUILDING SINGLE-CELL FOUNDATION MODEL BEYOND SINGLE CELLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 APPENDIX A GRAPH NEURAL NETWORKS FOR MULTI-OMICS REPRESENTATION LEARNING . . . . . . . . . . . . . . . . . . . . 102 APPENDIX B TRANSFORMERS FOR SINGLE-CELL SPATIAL OMICS REPRESENTATION LEARNING . . . . . . . . . . . . . . . . . . . . 106 APPENDIX C BUILDING SINGLE-CELL FOUNDATION MODEL BEYOND SINGLE CELLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 iv CHAPTER 1 INTRODUCTION 1.1 Motivation The rapid advance of single-cell technologies makes it possible to simultaneously measure multiple molecular features at multiple modalities in a cell, such as gene expressions, protein abundance, and chromatin accessibility. This is known as single-cell multi-omics technologies. For instance, CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) (Stoeckius et al., 2017) enables simultaneous quantification of mRNA expression and surface proteins abundance; methods like sci-CAR (Cao et al., 2018), Paired-seq (Zhu et al., 2019), and SNARE-seq (Chen et al., 2019) enable joint profiling of mRNA expression and chromatin accessibility ( i.e. genome-wide DNA accessibility). The joint measurements from these methods provide unprecedented multimodal data for single cells, which has given rise to valuable insights into not only the relationship between different modalities but, more importantly, a holistic understanding of the cellular system. In addition to the aforementioned multi-omic technologies, recently, single-cell spatial omics technologies have emerged as another next-generation tool for biomedical research. For instance, in-situ hybridization (ISH) based technology (Lubeck et al., 2014) produces detailed single-cell transcriptomic profiles along with the location of cells within a tissue, yielding deeper insights into cell identity and functionality than ever. The proliferation of these advanced single-cell data also introduces tremendous challenges in modeling the complex interactions among different modalities. As a result, there is an emerging trend to leverage deep learning techniques to handle multimodal single-cell data (Molho et al., 2024). For example, BABEL (Wu et al., 2021b) translated between the transcriptome (mRNA) and chromatin (DNA) profiles of a single cell based on an encoder-decoder architecture; scMM (Minoura et al., 2021) implemented a mixture-of-experts deep generative model for joint embedding learning and modality prediction. Cobolt (Gong et al., 2021) acquired joint embedding via a variant of Multimodal Variational Autoencoder (MVAE) (Yao et al., 2021). While there are a growing number of deep learning frameworks for modeling multimodal single-cell data, most of them treat each cell 1 as a separate input without considering possible high-order interactions among cells or different modalities. This makes them not optimal for multi-omics technology and spatial omics technology. In light of this, we propose two series of studies to fill up the gaps in terms of two perspectives. First, we study how to model the complex relations between multi-modal features. Specifically, we explore the application of graph neural networks and graph transformers on cell-feature heterogeneous graphs, which can readily incorporate external domain knowledge and model the interactions within each modality and cross-modalities. Second, we study how to model the complex interactions between cells. Specifically, we propose a transformer-based framework for spatial transcriptomic data and leverage positional encodings to help extract cell context information. Through these two studies, we demonstrate the effectiveness and advantages of deep learning models, particularly, transformers, in modeling single-cell multimodal data. Despite that the proposed models are effective in handling multimodal data, the knowledge learned by models is not transferable across tasks and datasets. To address this issue, there is an emerging effort (Yang et al., 2022; Gong et al., 2023; Shen et al., 2023; Cui et al., 2023; Theodoris et al., 2023) from the research community to explore the potential of a foundation model that first extracts latent knowledge from unlabeled data and subsequently generalizes this knowledge to a variety of tasks. Existing foundation models all regard genes as tokens and focus solely on modeling gene relationships within individual cells, neglecting the inter-cellular information in an organism, as well as the abundant multi-modal resources. Therefore, a need has arisen to extend the foundation model to multimodal applications beyond single cells. Based on our previous studies, we propose a new transformer-based foundation model to encode inter-cellular relations and multimodal features. Our research on various downstream tasks demonstrates the power of this foundation model, which has great potential to facilitate future research in single-cell biology. 1.2 Contributions This dissertation addresses the complex challenges of modeling single-cell multi-omics and spatial omics data using deep learning techniques. First, we study the representation learning for multi-omics data, explore the application of graph neural networks and transformers, and 2 highlight the prospect of transformers in single-cell analysis. Second, we study the representation learning of spatial omics with transformer architecture and demonstrate the effectiveness of masked autoencoders in spatial omics learning. Based on these findings, we further develop a foundation model for single-cell data analysis. The major contributions of this dissertation can be summarized as follows: • We conduct research on designing transformer-based frameworks for single-cell spatial and multimodal data. Based on the studies, we further propose a multimodal foundation model for single-cell analysis. • In Chapter 2, we explore the application of graph neural networks on cell-feature heterogeneous graphs for multi-modal single-cell data, named scMoGNN (Wen et al., 2022a). scMoGNN can readily incorporate external domain knowledge and model the interactions within each modality and cross-modalities. • In Chapter 3, we extend our studies from graph neural networks (i.e., scMoGNN) to graph transformers (i.e., scMoFormer (Tang et al., 2023)) for multi-omics data analysis, which not only enhances performance but also further highlights the potential of transformers in multi-omics representation learning. • In Chapter 4, we propose a transformer-based model for spatial transcriptomic data, SpaFormer (Wen et al., 2023), and leverage positional encodings to facilitate cell-cell relation identification and data imputation. Our results demonstrate the high efficiency and performance of a masked autoencoder backbone for spatial omics representation learning. • In Chapter 5, by unifying the explored architectures, we propose a new transformer-based foundation model for multimodal single-cell data, CellPLM (Wen et al., 2024), which can encode spatial and transcriptomics features as well as cell-cell relations. 3 CHAPTER 2 GRAPH NEURAL NETWORKS FOR SINGLE-CELL MULTI-OMICS REPRESENTATION LEARNING Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single- modality datasets into the downstream analyses. To address these challenges and correspondingly facilitate multimodal single-cell data analyses, three key tasks have been introduced: Modality prediction, Modality matching, and Joint embedding. In this work, we present a general Graph Neural Network framework scMoGNN to tackle these three tasks and show that scMoGNN demonstrates superior results in all three tasks compared with the state-of-the-art and conventional approaches. Our method is an official winner in the overall ranking of Modality prediction from NeurIPS 2021 Competition1, and all implementations of our methods have been integrated into DANCE package2. 2.1 Chapter Introduction The rapid advance of single-cell technologies makes it possible to simultaneously measure multiple molecular features at multiple modalities in a cell, such as gene expressions, protein abundance and chromatin accessibility. For instance, CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) (Stoeckius et al., 2017) enables simultaneous quantification of mRNA expression and surface proteins abundance; methods like sci-CAR (Cao et al., 2018), Paired-seq (Zhu et al., 2019), and SNARE-seq (Chen et al., 2019) enable joint profiling of mRNA expression and chromatin accessibility (i.e. genome-wide DNA accessibility). The joint measurements from these methods provide unprecedented multimodal data for single cells, which has given rise to valuable insights for not only the relationship between different modalities but, more importantly, a holistic understanding of the cellular system. Despite the emergence of joint platforms, single-modality datasets are still far more prevalent. 1The competition official website is https://openproblems.bio/neurips_2021. 2Our DANCE package is released at https://github.com/OmicsML/dance. 4 How to effectively utilize complementary information from multimodal data to investigate cellular states and dynamics and to incorporate the vast amount of single-modality data while leveraging the multimodal data pose great challenges in single-cell genomics. To address these challenges, Luecken (Luecken et al., 2021) summarized three major tasks: (1) Modality prediction aims at predicting the features of one modality from the features of another modality (Wu et al., 2021b); (2) Modality matching focuses on identifying the correspondence of cells between different modalities (Welch et al., 2017); and (3) Joint embedding requires embedding the features of two modalities into the same low-dimensional space (Stoeckius et al., 2017). The motivation of modality prediction and modality matching is to better integrate existing single-modality datasets, while joint embedding can provide more meaningful representations of cellular states from different types of measurements. In light of these benefits, computational biologists recently organized a competition for multimodal single-cell data integration at NeurIPS 2021 (Luecken et al., 2021) to benchmark these three tasks and facilitate the computational biology communities. There is an emerging trend to leverage deep learning techniques to tackle the tasks mentioned above for multimodal single-cell data (Molho et al., 2024). BABEL (Wu et al., 2021b) translated between the transcriptome (mRNA) and chromatin (DNA) profiles of a single cell based on an encoder-decoder architecture; scMM (Minoura et al., 2021) implemented a mixture-of-experts deep generative model for joint embedding learning and modality prediction. Cobolt (Gong et al., 2021) acquired joint embedding via a variant of Multimodal Variational Autoencoder (MVAE) (Yao et al., 2021). MOFA2 (Argelaguet et al., 2020) used Bayesian group factor analysis to reduce dimensions of multi-modality data and generate a low-dimensional joint representation. However, most of these approaches treat each cell as a separate input without considering possible high-order interactions among cells or different modalities. Such higher-order information can be essential for learning with high-dimensional and sparse cell features, which are common in single-cell data. Take the joint embedding task for example, the feature dimensions for GEX (mRNA) and ATAC (DNA) data are as high as 13,431 and 116,490, respectively; however, only 9.75% of GEX and 2.9% of ATAC features are nonzero on average over 42, 492 training samples (cells). Furthermore, integrated 5 measuring often requires additional processing to cells, which can lead to extra noise and drop-out in the resulting data (Lee et al., 2020; Mimitou et al., 2021). Therefore, it is a desired technique that can mitigate the negative impact of such noise. Recently, the advances in graph neural networks (GNNs) (Kipf and Welling, 2017; Wu et al., 2020; Battaglia et al., 2018; Gilmer et al., 2017; Liu et al., 2021b) pave the way for addressing the aforementioned issues in single-cell data integration. Specifically, GNNs aggregate information from neighborhoods to update node embeddings iteratively (Gilmer et al., 2017). Thus, the node embedding can eventually encode high-order structural information through multiple aggregation layers. In addition, GNNs smooth the features by aggregating neighbors’ embedding and also filter the eigen-values of graph Laplacian, which provides an extra denoising mechanism (Ma et al., 2021b). Hence, by modeling the interactions between cells and their features as a graph, we can adopt GNNs to exploit the structural information and tackle the limitations of previous techniques for single-cell data integration. With the constructed graph, we can readily incorporate external knowledge (e.g., interactions between genes) into the graph to serve as additional structural information. Moreover, it enables a transductive learning paradigm with GNNs to gain additional semi-supervised signals to enhance representation learning. Given those advantages, we aim to design a GNN framework for multimodal data integration. While several existing works attempted to introduce graph neural networks to single cell analysis (Song et al., 2021; Wang et al., 2021; Ciortan and Defrance, 2022; Shao et al., 2021), none of them tackle the challenging problem of multimodal data integration which requires handling different modalities simultaneously. Therefore, we aim to develop GNN methods for the tasks in multimodal data integration, especially for modality prediction, modality matching and joint embedding. Specifically, we propose a general framework scMoGNN for modeling interactions of modalities and leveraging GNNs in single-cell analysis3. Our framework is highly versatile: we demonstrate its use cases in the three different multimodal tasks. To the best of our knowledge, we are the first to develop a GNN framework in this emerging research topic, i.e., multimodal single-cell data integration. Our 3Our solution won the first place of the modality prediction task in the Multimodal Single-Cell Data Integration competition at NeurIPS 2021. 6 proposed framework achieves the best results in all of these three tasks on the benchmark datasets, providing a very strong baseline for follow-up research. Our contributions can be summarized as follows: 1. We study the problem of multimodal single-cell data integration and propose a general GNN-based framework scMoGNN to capture the high-order structural information between cells and modalities. 2. The proposed general framework is highly flexible as it can be adopted in different multimodal single-cell tasks. 3. Our framework achieves remarkable performance across tasks. It has won the first place of the modality prediction task in the Multimodal Single-Cell Data Integration competition, and currently outperforms all models for all three tasks on the leaderboard4. All of our results are based on publicly available data and are reproducible. 2.2 Problem Statement Before we present the problem statement, we first introduce the notations used in this paper. There are three modalities spanning through each task. They are GEX as mRNA data, ATAC as DNA data and ADT as protein data. Each modality is initially represented by a matrix M ∈ R𝑁×𝑘 where 𝑁 indicates the number of cells, and 𝑘 denotes the feature dimension for each cell. In our work, we later construct a bipartite graph G = (U, V, E) based on each modality M, where U is the set of 𝑁 cell nodes {𝑢1, 𝑢2, ..., 𝑢𝑁 } and V is the set of 𝑘 feature nodes {𝑣1, 𝑣2, ..., 𝑣 𝑘 }. With the aforementioned notations, the problem of learning GNNs for single-cell data integration is formally defined as, Given a modality M ∈ R𝑁×𝑘 , we aim at learning a mapping function 𝑓𝜃 which maps M to the space of downstream tasks. In the following, we formally define these three key tasks of single-cell data integration: modality prediction, modality matching and joint embedding. We will also define the corresponding evaluation metrics for each task. Note that these metrics are also adopted by the competition to decide the top winners. 4The leaderboard can be accessed at https://eval.ai/web/challenges/challenge-page/1111/leaderboard/2860. 7 2.2.1 Task 1: Modality Prediction In this task, given one modality (like GEX), the goal is to predict the other (like ATAC) for all feature values in each cell. It can be formally defined as, Given a source modality M1 ∈ R𝑁×𝑘1, the goal is to predict a target modality M2 ∈ R𝑁×𝑘2 via learning a mapping function 𝑓𝜃 parameterized by 𝜃 such that M2 = 𝑓𝜃 (M1). Possible modality pairs of (M1, M2) are (GEX, ATAC), (ATAC, GEX), (GEX, ADT) and (ADT, GEX), which correspond to four sub-tasks in Task 1. Root Mean Square Error (RMSE) is used to quantify performance between observed and predicted feature values. 2.2.2 Task 2: Modality Matching The goal of this task is to identify the correspondence between two single-cell profiles and provide the probability distribution of these predictions. It can be formally defined as, Given modality M1 ∈ R𝑁×𝑘1 and modality M2 ∈ R𝑁×𝑘2, we aim to learn two mapping functions 𝑓𝜃1 parameterized by 𝜃1 and 𝑓𝜃2 parameterized by 𝜃2 to map them into the same space such that S = 𝑔( 𝑓𝜃1 (M1), 𝑓𝜃2 (M2)) (2.1) where 𝑔 is a score function to calculate probability distribution of correspondence predictions. S ∈ R𝑁×𝑁 is an output score matrix with each row summing to 1. Sij is the correspondence probability between 𝑖-th cell from modality M1 and 𝑗-th cell from modality M2. Possible modality pairs of (M1, M2) are (GEX, ATAC), (ATAC, GEX), (GEX, ADT) and (ADT, GEX), which correspond to four sub-tasks in Task 2. The sum of weights in the correct correspondences of S is used as final score to quantify prediction performance using 𝑠𝑐𝑜𝑟𝑒 = (cid:205)𝑁 𝑖=1 (cid:205)𝑁 𝑗=1 S𝑖, 𝑗 if 𝑖 = 𝑗. 2.2.3 Task 3: Joint Embedding In this task, the goal is to learn an embedded representation that leverages the information of two modalities. The quality of the embedding will be evaluated using a variety of criteria generated from expert annotation. It can be formally defined as, Given modality M1 ∈ R𝑁×𝑘1 and modality M2 ∈ R𝑁×𝑘2, we aim to learn three mapping functions 8 𝑓𝜃1, 𝑓𝜃2 and 𝑓𝜃3 parameterized by 𝜃1, 𝜃2 and 𝜃3 accordingly to project them into downstream tasks, H = 𝑓𝜃3 (cid:0)𝐶𝑂𝑁𝐶 𝐴𝑇 ( 𝑓𝜃1 (M1), 𝑓𝜃2 (M2))(cid:1) (2.2) and 𝑓𝜃2 (M2) ∈ R𝑁×𝑘2′ where 𝑓𝜃1 (M1) ∈ R𝑁×𝑘1′ correspond to new representations learned from modality M1 and M2 separately, H ∈ R𝑁×𝑘3 is a final output embedding learned through 𝑓𝜃3 on concatenation of 𝑓𝜃1 (M1) and 𝑓𝜃2 (M2). H will be measured using six different metrics broken into two classes: biology conservation and batch removal. Biology conservation metrics include “NMI cluster/label”, “Cell type ASW”, “Cell cycle conservation” and “Trajectory conservation” which aim to measure how well an embedding conserves expert-annotated biology. Batch removal metrics include “Batch ASW” and “Graph connectivity” which are to evaluate how well an embedding removes batch variation. Please refer to the Appendix A.1 for more details about these metrics description. Possible modality pairs of (M1, M2) are (GEX, ATAC) and (ADT, GEX), which correspond to two sub-tasks in Task 3. In this work, we instantiate 𝑓𝜃 as a graph neural network model by first constructing a bipartite graph G = (U, V, E) based on modality M, and then learning better cell node representation through message passing on graphs. 2.3 Method In this section, we first introduce the proposed general framework scMoGNN for multimodal data integration. Then we introduce how to adapt scMoGNN to advance three tasks: modality prediction, modality matching and joint embedding. An illustration of our framework is shown in Figure 2.1. Specifically, our framework can be divided into three stages: graph construction, cell-feature graph convolution, and task-specific head. 2.3.1 A General GNN Framework To develop a GNN-based framework for single-cell data integration, we are essentially faced with the following challenges: (1) how to construct the graph for cells and its features (or modalities); (2) how to effectively extract meaningful patterns from the graph; and (3) how to adapt the framework to different multimodal tasks. 9 (a) Graph Construction (b) Cell-feature Graph Convolution with Task- Specific Head Figure 2.1 An overview of scMoGNN. We first construct the cell-feature graph from a given modality and then perform cell-feature graph convolution to obtain latent embeddings of cells, which are sent to a task-specific head to perform the downstream task. 2.3.1.1 Graph Construction With the single-cell data, our first step is to construct a cell-feature graph that GNNs can be applied to. We construct a cell-feature bipartite graph where the cells and their biological features (e.g. GEX, ADT or ATAC features) are treated as different nodes, which we term as cell nodes and feature nodes, respectively. A cell node is connected with the feature nodes that represent its features. With such graph structure, the advantage is that cell nodes can propagate features to their neighboring feature nodes, and vice versa. Formally, we denote the bipartite graph as G = (U, V, E). In this graph, U is the set of 𝑁 cell nodes {𝑢1, 𝑢2, ..., 𝑢𝑁 } and V is the set of 𝑘 feature nodes {𝑣1, 𝑣2, ..., 𝑣 𝑘 }, where each feature node refers to one feature dimension of the input data. E ⊆ U × V represents the set of edges between U and V, which describe the relations between cells and features. The graph can be denoted as a weighted adjacency matrix (cid:170) (cid:174) (cid:174) (cid:172) where O is a matrix filled with constant 0 and M ∈ R𝑁×𝑘 is the input feature matrix of cells. M can A = (cid:169) (cid:173) (cid:173) (cid:171) ∈ R(𝑁+𝑘)×(𝑁+𝑘), M𝑇 O (2.3) O M be also viewed as the features of one modality such as GEX. Note that A is a bipartite graph where two nodes within the same set (either feature nodes or cell nodes) are not adjacent. Based on the aforementioned process of graph construction, we can adjust it for specific tasks, e.g., incorporating 10 0Features Nodes0…000…0…………00…0……Cell NodesCell Nodes0…0………………01FeaturesCells0…000…2…………20…3……0Features0.1…0.30.10…0.2…………0.30.2…0……FeaturesModalityInputEnhanced Components(optional)0.10.30.10.20.30.2Adjacency Matrix10…000…2…………20…3…1Feature Nodes0…200…0…………02…3…GraphConstructionCell NodesFeature NodesCell-Feature Graph ConvolutionCell NodesFeature NodesCell NodesFeature NodesFeature NodesHeterogeneousEdgesFeature NodesCell NodesConvolutional Layers!𝑯Modality1EmbeddingModality2Embedding!"!"#$!"!"#%&!=ScoreMatrixCell1Cell1Cell2Cell3Cell2Cell30.60.90.30.30.20.60.80.90.5(L2 normalized)(L2 normalized)(Cosine Similarity)cell1cell2cell3Task-specific Head…MLP LayersOutput Matrix(optional) a prior knowledge graph of genes, which can change the bipartite characteristic of A. Furthermore, since GNNs are mostly dealing with attributed graphs, we need to assign initial embeddings for feature and cell nodes. Specifically, we use Xcell and Xfeat to denote the initial and Xfeat ∈ R𝑘×𝑑′′ embeddings for cell and feature nodes, respectively. We have Xcell ∈ R𝑁×𝑑′ where 𝑑′ and 𝑑′′ are determined by the task-specific settings. As an illustrative example, the initial embeddings of feature nodes {𝑣1, ..., 𝑣 𝑘 } could be the one-hot index of each feature dimension; thus, Xfeat ∈ R𝑘×𝑘 is an identity matrix, i.e., Xfeat = I𝑘 . Meanwhile, we do not have any prior knowledge for each cell, thus, Xcell = O𝑁×1. Together with the node features, the constructed cell-feature graph can be denoted as G = (A, Xfeat, Xcell). 2.3.1.2 Cell-Feature Graph Convolution After we obtain the constructed cell-feature graph, the next challenge is how to extract high-order structural information from the graph. In this work, we utilize GNNs to effectively learn cell/feature representations over the constructed graph. Note that there are two types of nodes in the graph and we need to deal with them differently. For the ease of illustration, we first only consider one type of nodes (e.g., feature nodes V), and later we extend it to the whole graph. Typically, GNNs follow the message-passing paradigm (Gilmer et al., 2017) and in each layer of GNNs, the embedding of each node is updated according to the messages passed from its neighbors. Let H𝑙 = {h𝑙 1 𝑙 be the input node embeddings in the 𝑙-th layer, where h𝑙 𝑖 ∈ R𝑑 h𝑙 output embeddings of the 𝑙-th layer can be expressed as follows: 𝑁 }, 𝑖 corresponds to node 𝑣𝑖. Hence, the , ..., h𝑙 𝑖 = Update(h𝑙 h𝑙+1 𝑖 , Agg(h𝑙 𝑗 | 𝑗 ∈ N𝑖)), (2.4) where N𝑖 is the set of first-order neighbors of node 𝑣𝑖, Agg(·) indicates an aggregation function on neighbor nodes’ embedding, and Update(·) is an update function that generates a new node embedding vector from the previous one and aggregation results from neighbors. Notably, for the input node embedding in the first layer, we have H1 = 𝜎(XfeatWfeat + bfeat), Wfeat ∈ R𝑑′×𝑑 (2.5) 11 where Wfeat is a transformation matrix, 𝑑 is the dimension of the hidden space and 𝜎 is an activation function. Though there exist a number of different GNN models, in this work, we focus on the most representative one, Graph Convolution Network (GCN) (Kipf and Welling, 2017). Note that it is straightforward to extend the proposed framework to other GNN models. Considering that we have two different types of nodes in the graph (i.e., cells and features nodes), we make some modifications on the vanilla GCN to deal with different types of nodes/edges. We name it as cell-feature graph convolution, where we separately perform the aggregation function on different types of edges to capture interactions between cell and feature nodes. Moreover, we use different parameters for aggregation on different edges, thus allowing the embedding of each node type to have very different distributions. Specifically, from a message passing perspective, the operation in a cell-feature graph convolution layer can be expressed as two steps, i.e., aggregation and updating. In order to generate a message m for different types of nodes, there are at least two basic aggregation functions, one is: m 𝑖,𝑙 U→V = 𝜎(b𝑙 U→V + ∑︁ 𝑗 ∈N𝑖,𝑣𝑖 ∈V 𝑒 𝑗𝑖 𝑐 𝑗𝑖 𝑗 W𝑙 h𝑙 U→V) (2.6) where 𝑖 corresponds to node 𝑣𝑖 ∈ V, 𝑗 corresponds to node 𝑢 𝑗 ∈ U; 𝑒 𝑗𝑖 denotes the edge weight between 𝑣 𝑗 and 𝑢𝑖, W𝑙 U→V and b𝑙 U→V are trainable parameters, 𝜎(·) is an activation function such as ReLU, and 𝑐 𝑗𝑖 is a normalization term defined as follows: √︄ ∑︁ √︄ ∑︁ 𝑒 𝑗 𝑘 𝑒𝑘𝑖 𝑐 𝑗𝑖 = 𝑘 ∈N𝑗 𝑘∈N𝑖 (2.7) Obviously, Eq. (2.6) is the aggregation function from cell nodes U to feature nodes V. Thus the other aggregation function from V to U can be written as: m 𝑖,𝑙 V→U = 𝜎(b𝑙 V→U + ∑︁ 𝑗 ∈N𝑖,𝑢𝑖 ∈U 𝑒 𝑗𝑖 𝑐 𝑗𝑖 𝑗 W𝑙 h𝑙 V→U) (2.8) The transformation matrices W𝑙 U→V and W𝑙 V→U project the node embeddings from one hidden space to another vice versa. After generating the messages from neighborhoods, we then update the embedding for nodes in V and U accordingly: 𝑖 = h𝑙 h𝑙+1 𝑖 + m 𝑖,𝑙 U→V , 𝑗 = h𝑙 h𝑙+1 𝑗 + m 𝑗,𝑙 V→U , (2.9) 12 where 𝑣𝑖 ∈ V and 𝑢 𝑗 ∈ U. In Eq. (2.9), we adopt a simple residual mechanism in order to enhance self information. As mentioned earlier, we can have more than two types of edges depending on the downstream task and the graph structure can be much more complex than a bipartite graph. Despite such complexity, our proposed framework and cell-feature graph convolution have the capacity to handle more types of edges/nodes. We will introduce these details in Section 2.3.2. 2.3.1.3 Task-specific Head After we learn node embeddings for feature and cell nodes, we need to project the embedding to the space of the specific downstream task. Hence, we design a task-specific head, which depends on the downstream task. Specifically, we first take the node embeddings of cell nodes from each convolution layer, aggregate them and project them into the space of downstream task ˆY: ˆY = Head (cid:16) Readout𝜃 (H1 U , ..., H𝐿 U) (cid:17) (2.10) where H𝑖 U refers to embeddings of all cell nodes in 𝑖-th layer, Readout(·)𝜃 is a trainable aggregation function, Head(·) is a linear transformation that projects the latent embedding to the downstream task space. With the obtained output, we can then optimize the framework through minimizing the task-specific loss functions. In the following subsection, we give the details of training the general framework for different tasks. 2.3.2 Model Specifications With the proposed general framework scMoGNN, we are now able to perform different tasks by adjusting some of the components. In this subsection, we show how scMoGNN is applied in the three important tasks in single-cell data integration: modality prediction, modality matching and joint embedding. 2.3.2.1 Modality Prediction In the modality prediction task, our objective is to translate the data from one modality to another. Given the flexibility of graph construction and GNNs in our framework, we can readily incorporate external knowledge into our method. In this task, we adjust the graph construction to include such domain knowledge to enhance the feature information. Specifically, in GEX-to-ADT and 13 GEX-to-ATAC subtasks, we introduce hallmark gene sets(Liberzon et al., 2015) (i.e., pathway data) from the Molecular Signatures Database (MSigDB)(Subramanian et al., 2005). The pathway data describe the correlations between gene features (i.e., features of GEX data) and the dataset consists of 50 so-called gene sets. In each gene set, a group of correlated genes are collected. However, there is no numerical information in these gene sets to help us quantify the relations between those genes, resulting in homogeneous relations. Intuitively, we can construct a fully-connected inter-gene graph for each gene set, and incorporate those edges into our original graph G. Hence, the new adjacency matrix for G is: A = (cid:169) (cid:173) (cid:173) (cid:171) O M M𝑇 P (cid:170) (cid:174) (cid:174) (cid:172) (2.11) where P ∈ R𝑘×𝑘 is a symmetric matrix, which refers to the links between gene features, generated from gene sets data. Furthermore, we manually add some quantitative information, i.e., cosine similarity between gene features based on their distributions in GEX input data. Due to the existence of extra type of edges in the graph (i.e., edges among feature nodes), we need to make corresponding adjustment on our cell-feature graph convolution. Taking an arbitrary feature node 𝑣𝑖 as example, we have N𝑖 = N u 𝑖 ∪ N v 𝑖 , where N𝑖 denotes the set of neighbors of node 𝑣𝑖, N u 𝑖 ⊆ U is the set of cell node neighbors of 𝑣𝑖, N v 𝑖 ⊆ V is the set of feature node neighbors of 𝑣𝑖. Since cell and feature nodes have very different characteristics, when we aggregate the embedding from N u 𝑖 and N v 𝑖 respectively, we expect to get very different results. Thus, when we update the embedding of center node, we have different strategies to combine messages from different channels. As a starting point, we decide to enable a scalar weight. Formally speaking, similar to Eq. (2.6) and Eq. (2.8), we first generate two messages m 𝑖,𝑙 𝑖,𝑙 U→V and m V→V for each node 𝑣𝑖 ∈ V, then we update the node embedding of 𝑣𝑖 following the formulation below: 𝑖 = h𝑙 h𝑙+1 𝑖,𝑙 𝑖 + 𝛼 · m V→V + (1 − 𝛼) · m 𝑖,𝑙 U→V (2.12) where 𝛼 is either a hyper-parameter or a learnable scaler to control the ratio between inter-feature aggregation and cell-feature aggregation. 14 Next we elaborate on the modality prediction head and loss function for the task. The head structure for modality prediction is relatively simple. Note that we deliberately keep the same hidden dimension throughout the cell-feature graph convolution; thus we can simply use a trainable weight vector w to sum up cell node embeddings from different layers, as follows: ˆH = 𝐿 ∑︁ 𝑖=1 w𝑖 · H𝑖 U (2.13) where ˆH, H𝑖 U ∈ R𝑁×𝑑, and 𝑑 is the dimension of hidden layer. After that, a simple fully connected layer is performed to transform it to the target space: ˆY = ˆHW + b. A rooted mean squared error (RMSE) loss is then calculated as: L = (cid:118)(cid:117)(cid:116) 1 𝑁 𝑁 ∑︁ 𝑖=1 (Y𝑖 − ˆYi)2, which is a typical loss function for regression tasks. 2.3.2.2 Modality Matching (2.14) (2.15) Our goal in the modality matching task is to predict the probability that a pair of different modality data is actually from the same cell. Modality matching is very different from the modality prediction task in two regards: (1) it requires interactions between two modalities; and (2) it does not demand the model to give detailed predictions but it emphasizes pairings between two modalities. Therefore, we need to adjust the framework in graph construction and task-specific head correspondingly. Since two different modalities are presented as input in this task, we construct two cell-feature graphs. Cell-feature graph convolutions are then performed on these two graphs separately to obtain two embedding matrices ˆHm1 ∈ R𝑁×𝑑1 and ˆHm2 ∈ R𝑁×𝑑2. We set 𝑑1 = 𝑑2 so they can be directly multiplied together. Thus, we calculate the cosine similarity between each cell pair as follows: S = ˆH′ 𝑇 m1 · ˆH′ m2 15 (2.16) where S ∈ R𝑁×𝑁 denotes the symmetric score matrix; ˆH′ m2 indicate that we perform L2 normalization for each row (i.e., each cell) in ˆHm1 and ˆHm2 before we perform matrix multiplication. m1 and ˆH′ We further calculate the softmax function for each row and each column of S to convert scores to probabilities. Then we can express the loss function as follows: 𝑁 ∑︁ 𝑁 ∑︁ Lmatch = − Y𝑐1,𝑐2 log(P𝑟 𝑐1,𝑐2 ) + Y𝑐1,𝑐2 log(P𝑐 𝑐1,𝑐2 ) (2.17) with P𝑟 𝑖, 𝑗 = 𝑐1=1 𝑐2=1 𝑒S𝑖, 𝑗 (cid:205)𝑁 𝑘=1 𝑒S𝑖,𝑘 , P𝑐 𝑖, 𝑗 = 𝑒S𝑖, 𝑗 (cid:205)𝑁 𝑘=1 𝑒S𝑘, 𝑗 , where Y ∈ R𝑁×𝑁 denotes a binarized matching matrix that indicates the perfect matching (i.e., the ground truth label), and P𝑖, 𝑗 ∈ R𝑁×𝑁 represents the probability that 𝑖-th data in modality 1 and 𝑗-th data in modality 2 are actually referring to the same cell. In addition to the matching loss Lmatch, we include a set of auxiliary losses to boost the performance, i.e., prediction losses and reconstruction losses: Laux = Lpred12 + Lpred21 + Lrecon11 + Lrecon22 = 1 𝑁 + 1 𝑁 𝑁 ∑︁ (Xm2 − 𝑓𝜃2 ( ˆHm1))2 + 𝑖=1 𝑁 ∑︁ 𝑖=1 (Xm1 − 𝑓𝜃1 ( ˆHm1))2 + 1 𝑁 1 𝑁 𝑁 ∑︁ (Xm1 − 𝑓𝜃1 ( ˆHm2))2 𝑖=1 𝑁 ∑︁ 𝑖=1 (Xm2 − 𝑓𝜃2 ( ˆHm2))2 (2.18) where Xm1 and Xm2 refer to the preprocessing results of two modalities respectively, 𝑓𝜃1 and 𝑓𝜃2 each refers to a fully connected network with one hidden layer, which project the node embeddings ˆH to the particular target modality space. These auxiliary losses provide extra supervision for our model to encode the necessary information within the hidden space ˆH. Hence they have the potential to improve the robustness of the model and reduce the risk of overfitting. In the training phase, we jointly optimize Lmatch and Laux. Lastly, in the inference phase, we introduce bipartite matching as an extra post-processing method to further augment our matching result. Specifically, we first use percentile threshold to filter the score matrix S in order to reduce the subsequent calculations, resulting in a symmetric sparse matrix S′. We consider S′ as an adjacency matrix of a bipartite graph, which helps us model 16 the two different modality data and their inter-class relations. Thus, our goal (i.e., matching between two modalities) becomes a rectangular linear assignment problem, where we try to find a perfect matching that maximizes the sum of the weights of the edges included in the matching. We can effectively solve this problem through the Hungarian algorithm, also known as the Munkres or Kuhn-Munkres algorithm. 2.3.2.3 Joint Embedding The target of the joint embedding task is to learn cell embeddings from multiple modalities and thus better describe cellular heterogeneity. Several complex metrics are enabled in this task, there often exist trade-offs between metrics and metrics (e.g. to remove the batch effect while retaining the batch information). To provide more training signals, we utilize both supervised and self-supervised losses to train our graph neural networks. Specifically, we first use the LSI for preprocessing to generate the input node features for two modalities and concatenate them as one joint modality, which allows us to construct a cell-feature graph. Based on the graph, we perform the proposed cell-feature graph convolution and generate the output cell node embedding in the same way as in Eq. (2.13). As suggested by the work (Liu et al., 2021a), cell type information plays a key role in the metrics of joint embedding evaluation and it is beneficial to extract 𝑇 dimensions from the hidden space to serve as supervision signals. Following this idea, we calculate the softmax function for 𝑡 ∈ {1, ..., 𝑇 }, 𝑖 ∈ {1, ..., 𝑁 } with 𝑁 being the number of cells: ˆY𝑖,𝑡 = 𝑒 ˆH𝑖,𝑡 (cid:205)𝑇 𝑘=1 𝑒 ˆH𝑖,𝑘 (2.19) where ˆY is the probability that cell 𝑖 belongs to cell type 𝑡 and 𝑇 is set to be exactly equal to the total number of cell types. Then we introduce the loss functions: L = Lrecon + Lcell type + Lregular 𝑇 ∑︁ 𝑁 ∑︁ (XLSI − 𝑓𝜃 ( ˆH))2 + = 1 𝑁 𝑖=1 𝑡=1 Y𝑡 log( ˆY𝑡) + 𝛽 ∗ ∥ ˆH ˜J ∥2 (2.20) where 𝑓𝜃 is a two-layer perceptron, Y ∈ R𝑁×𝑇 is the sparse labels of cell types, and ˆH ˜J refers to the other hidden dimensions aside from the 𝑇 dimensions that have been exclusive to cell type 17 information. Eventually, the output hidden space ˆH would contain cellular information required by the task, although the loss functions do not directly optimize the final metric. 2.4 Experiment In this section, we evaluate the effectiveness of our framework scMoGNN against three tasks and show how scMoGNN outperforms representative baselines over all three tasks by combining our general framework with task-specific design. Note that in this experiment, we follow the official settings and datasets in the multimodal single-cell data integration competition at NeurIPS 2021 (Luecken et al., 2021) and we compare the performance of the proposed framework with that from the top winners in the competition. All source codes of our model have been integrated into the DANCE package (Ding et al., 2024). 2.4.1 Modality Prediction Datasets. In the modality prediction dataset, each modality is provided with source data, preprocessed data (with default methods), and batch labels. Data statistics are shown in Table A.1 in Appendix A.2. Note that the GEX-ATAC and ATAC-GEX subtasks are not entirely symmetric, because in the GEX-ATAC task, the output dimension is deliberately reduced, where 10,000 ATAC features are randomly selected out of the original 116,490 dimensions. Settings and Parameters. For our own model, we report the average performance of 10 runs. In each run, we reserved 15% of training cells for validation and early stopping. In practice, several tricks help boost the performance: (1) utilizing initial residual instead of the skip connections in Eq.2.9, (2) using group normalization for aggregation results and dismissing the edge weight normalization stated in Eq. 2.7. Besides, we empirically set our parameter 𝛼 in Eq. 2.12 to 0.5. Additionally, in order to fit the high-dimensional ATAC features, we enabled node sampling. Baselines. In Table 2.1, we only show the teams that acquired top results in one or more subtasks in the competition, because they are officially required to declare their model details. For further comparison, we briefly detail each method: (1) Baseline, a truncated-SVD dimensionality reduction followed by linear regression. (2) Dengkw, a well-designed model based on kernel ridge regression. (3) Novel, an encoder-decoder structure with LSI preprocessing. (4) Living System 18 Lab, an ensemble model composed of random forest models, catboost(Prokhorenkova et al., 2018) models, and k-nearest neighbors regression models. (5) Cajal, a feed forward neural network with heavy feature selection guided by prior knowledge. (6) ScJoint, an ensemble neural network model incorporated various strategies of preprocessing such as extracting TF-IDF features and filtering highly variable genes/features. The source codes for all the methods above can be found in the official github of the competition5. It can be seen that the existing models are relatively simple, mainly based on traditional machine learning algorithms and autoencoders. In contrast, our framework has a more advanced architecture, which provides the flexibility to different structural data (e.g. graph data) and different tasks. This makes our framework a very suitable backbone in the field of single-cell multimodal data integration. Results. As shown in Table 2.1, our method achieved the lowest overall loss in the competition (the lower, the better). An interesting observation is that there is no team lead consistently across all subtasks, resulting in individual s for each category, which is very different from the other two tasks in the competition. However, as far as we know, there is no team that worked only on one subtask. Such a phenomenon may be caused by three reasons: (1) the modality prediction task is the most competitive task in the competition, and many participating teams participated only in this task (including our own team). As a result, over 40 individual teams appeared on the final leaderboard. (2) the modality prediction task has only 1,000 cells in the private test dataset, therefore, certain variance exists in the evaluation results. (3) the diverse feature dimensionality and sparsity in different modalities raised additional challenges to the model’s generalization ability. Compared to the other models, our GNN model presented consistently better performance across these four subtasks and became the overall winner in the competition. Furthermore, we even improved our models after the competition, with modifications including: a learning-rate decay training strategy, more hidden units along with stronger regularization of dropout and weight decay. Eventually, we’ve effectively strengthened our graph neural network model hence significantly improved our results, especially in the toughest subtask GEX-to-ADT, 5The official github link is https://github.com/openproblems-bio/neurips2021_multimodal_topmethods. 19 Table 2.1 RMSE for Modality Prediction (Task 1)↓ . ‘*’ indicates ensemble models. GEXADT ADT2GEX GEX2ATAC ATAC2GEX Overall 0.4395 0.3854 0.4153 0.4065 0.4393 0.3954 0.3898 0.3344 0.3242 0.3177 0.3228 0.3311 0.3247 0.3221 0.1785 0.1833 0.1781 0.1774 0.1777 0.1785 0.1776 0.2524 0.2449 0.2531 0.2393 0.2169 0.2377 0.2403 0.3012 0.2836 0.2911 0.2865 0.2891 0.2840 0.2824 0.3885 0.3242 0.1778 0.2315 0.2805 0.3809 0.3223 0.1777 0.2310 0.2780 Baseline Dengkw* Novel LS. Lab* Cajal ScJoint* scMoGNN* scMoGNN (Single) scMoGNN* (Ensemble) where the output for each cell is a 134-dimensional dense vector. We now achieved an RMSE loss of 0.3809 which is lower than the previous best score of 0.3854 in the competition. Overall, the results prove the effectiveness of our GNN framework, and in some specific cases, scMoGNN has tremendous advantage in view of performance. 2.4.2 Modality Matching Datasets. The majority of the modality matching dataset is the same as the modality prediction dataset (as shown in Table A.2 in Appendix A.2, while several differences exist, including: (1) the number of testing cells; (2) the dimensionality of ATAC features; and (3) the inconsistent cell order among modalities. In the training data, samples’ correspondence between different modalities is given, while for the test data, our goal is to find the correspondence. Settings and Parameters. The experimental settings are similar to the previous task, while some adjustments were made. For calculation convenience, we decoupled the propagation and transformation. Besides, batch labels are given in company with the test data, which provides a very strong prior knowledge for matching. We thus divided the test data into different batches, then matched samples that belong to the same batch. This trick dramatically reduced the search space, resulting in a significant performance boost. To be fair, we have confirmed that the winning solution also used the same strategy. Baselines. In Table 2.2, we only compare our models with the winning team and runner-up 20 Table 2.2 Performances for Modality Matching (Task 2)↑. GEX2ADT ADT2GEX GEX2ATAC ATAC2GEX Overall Baseline GLUE (CLUE) Novel scMoGNN 0.0000 0.0495 0.0373 0.0810 0.0000 0.0516 0.0373 0.0810 0.0001 0.0560 0.0412 0.0630 0.0001 0.0583 0.0412 0.0630 0.0001 0.0539 0.0392 0.0720 team in the competition. Next we briefly introduce those models: (1) Baseline, first projecting one modality to the other with linear regression, and then searching for the nearest neighbors in the same modality. (2) GLUE (Cao and Gao, 2022) (CLUE), the official winner, a variational auto-encoder (VAE) model supervised by three auxiliary losses. (3) Novel, a feed-forward neural network directly supervised by matching loss. Results. As shown in Table 2.2, scMoGNN outperforms the winning team and the runner-up team with a very large margin. Note that we didn’t create any models for this task during the competition since we focused on the modality prediction task. This is the reason why we don’t have any results on the official leaderboard. The score of the metric can be roughly seen as the accuracy of predicting a right correspondence for each piece of data. Meanwhile the search space grows with the total number of cells in the test data. For example, in the test phase of the ADT-to-GEX subtask, we have 15,066 cells to match, thus for each piece of ADT data, we have 15,066 candidates in GEX data. Although we separated those cells into three batches, the rough expectation of the accuracy of randomly guessing is still as low as 1/5000, which can indicate the difficulty of this task. Thus, scMoGNN has already achieved very high performance (e.g. 0.08 in ADT-GEX subtask). Note that both team Novel’s model and scMoGNN utilizes a symmetric matching algorithm, thus we have exactly the same performance for dual subtasks (e.g. GEX2ADT and ADT2GEX). Another interesting observation is that our proposed graph neural network model is especially good at GEX-ADT dual subtasks, where we improved the previous winning performance from 0.05 to 0.08. 21 Table 2.3 Performances for GEX-ADT Joint Embedding (Task 3)↑. NMI Cell type ASW Cc_con Traj_con Batch ASW Graph Conn Average Baseline Amateur (JAE) GLUE scMoGNN 0.6408 0.7608 0.8022 0.8499 0.5266 0.6043 0.5759 0.6496 0.9270 0.7817 0.6058 0.8325 0.8631 0.8591 0.7084 0.8532 0.7982 0.8432 0.8800 0.8691 0.8945 0.9700 0.9506 0.9708 0.7699 0.8039 0.7789 0.8168 2.4.3 Joint Embedding Datasets. The training data of this task are basically the same as the modality prediction task. Moreover, data from different modalities are already aligned. Regarding the complementary data, there are two settings for this task. In the ’online’ setting, the only data is features of two modalities. Meanwhile, in the ’pre-trained’ setting, any external knowledge is acceptable. In this paper, we follow the second setting (i.e. pre-trained setting) and we only compare our results with other pre-trained models. Generally speaking, pre-trained models obtain better performance than online models. In our experiments, cell type labels are provided in the training phase, while test data consist of all the train cells and unseen test cells but are not along with cell type labels. Settings and Parameters. Considering that the input in this task is similar to modality matching, we followed settings in Section 2.4.2. Baselines. We briefly describe the other three models in Table 2.3: (1) Baseline, a concatenation of PCA results of two modalities. (2) Amateur (JAE), the official winner, an auto-encoder model that incorporates extra supervision (cell annotations). The model is adapted from scDEC (Liu et al., 2021a). (3) GLUE (Cao and Gao, 2022), an auto-encoder model guided by an external knowledge graph. Results. As shown in Table 2.3, our scMoGNN significantly outperforms the other two models in GEX-ADT joint embedding task, with an improvement over 0.1 according to the average metric. 2.4.4 Ablation Study Throughout the previous sections, we have examined that our graph neural network general framework is suitable for all these tasks in single-cell multimodal data integration. In this subsection, we investigate if we really benefit from graph structural information. We take the modality matching 22 Figure 2.2 Ablation study for the modality matching task. Figure 2.3 Parameter analysis of layer weight w. task as an example. In the modality matching task we use decoupled GNNs, thus, we can easily remove the graph structural information by eliminating the propagation layers. The result is referred to as “w/o propagation” in Figure 2.2. The performance significantly drops from 0.0810 to 0.0745 in the GEX-ADT subtask and from 0.0630 to 0.0562 in the GEX-ATAC subtask, respectively. These observations indicate that the graph structural information extracted by the propagation layers indeed helped the performance of our method significantly. We also examined the importance of our auxiliary loss, shown in Figure 2.2. Without the supervision of auxiliary losses, scMoGNN lost a lot of generalization ability, behaving as poorly as without graph structural information. 23 0.08100.06300.07570.06150.07450.06190.07870.05620.05000.05500.06000.06500.07000.07500.08000.0850GEX-ADTGEX-ATACscMoGNNw/o auxiliary lossw/o propagationw/o negative edge weight0.1500.1850.2870.3410.3200.5650.1830.1830.3030.3210.3210.4350.3200.6320.4100.3380.3600.3470.0000.2000.4000.6000.8001.000GEX-to-ADTGEXADTGEXATACGEX-ADTTask1Task2Task3Layer1Layer2Layer3Layer4 2.4.5 Parameter Analysis We analyzed an important parameter in our framework, i.e. w in Eq. 2.13, in order to gain a deeper insight on scMoGNN. Specifically, w is a learnable parameter that controls the weight between each propagation layer. Intuitively, the value of w can prove to us the effectiveness of graph structural information and help us understand how much higher-order structural information is valued by models in different tasks. Therefore we show values of w learned by scMoGNN in different tasks in Figure 2.3. Note that in different tasks we have different numbers of layers, in modality prediction we have 4 layers and in modality matching and joint embedding we have 3 layers and 2 layers, respectively. The results consistently show that scMoGNN tends to synthesize the information in each layer, not just limited to the shallow layer, which suggests that the information of the higher-order graph structure is indeed effective. As for more details, joint embedding depends more on local information, which exists in source input data. While in modality prediction, more higher-order information is referenced, indicating that the model needs to enrich more information from similar cells or similar features. This can be explained from the need for more detailed information in modality prediction. 2.5 Related Work In this section, we briefly introduce works related to our work including GNNs on single-modality data and multimodal data integration. GNNs on Single-Modality Data. Graphs occur as a natural representation of single-cell data both as feature-centric (RNAs, DNAs, or proteins) and cell-centric. Thus, a few recent works have applied GNNs to the single-cell data. scGCN (Song et al., 2021) proposes a GNN model for knowledge transfer in single-cell omics (mRNA or DNA) based on Graph Convolutional Networks (Kipf and Welling, 2017). scGNN (Wang et al., 2021) formulates and aggregates cell-cell relationships with Graph Neural Networks for missing-data imputation and cell clustering using single-cell RNA sequencing (scRNA-seq) data. scDeepSort (Shao et al., 2021) is a pre-trained cell-type annotation tool for scRNA-seq data that utilizes a deep learning model with a weighted GNN. Similar to our proposed model, scDeepSort also relies on feature-cell graphs. However, it 24 does not incorporate any prior knowledge into GNNs. Using spatial transcriptomics (mRNA) data, DSTG (Song and Su, 2021) utilizes semi-supervised GCN to deconvolute the relative abundance of different cell types at each spatial spot. Despite its success on single-modality data, there are few efforts on applying GNNs to multimodal single-cell data. Multimodal Data Integration. Most of the prior works in multimodal data integration can be divided into 1) matrix factorization (Duren et al., 2018; Stein-O’Brien et al., 2018; Jin et al., 2020) or statistical based methods (Stuart et al., 2019; Shen et al., 2009; Welch et al., 2017) and 2) autoencoder based methods (Wu et al., 2021b; Gong et al., 2021). Specifically, BABEl (Wu et al., 2021b) leverages autoencoder frameworks with two encoders and two decoders to take only one of these modalities and infer the other by constructing reconstruction loss and cross-modality loss. Cobolt(Gong et al., 2021) acquires joint embedding via a variant of Multimodal Variational Autoencoder (MVAE(Yao et al., 2021)). Unlike our proposed models, these aforementioned methods are unable to incorporate high-order interactions among cells or different modalities. To the best of our knowledge, we are the first to apply GNNs in the field of multimodal single-cell data integration and build a GNNs-based general framework to broadly work on these three key tasks from NeurIPS 2021 Competition. Our framework officially won first place in the overall ranking of the modality prediction task. After the competition, we extended our framework to the other two tasks and achieved superior performance compared with the top winning methods. 2.6 Chapter Conclusion In this chapter, we proposed a general framework scMoGNN based on GNNs for multimodal single-cell data integration and multi-omics representation learning. It can be broadly applied to all three key tasks, modality prediction, modality matching and joint embedding, from the NeurIPS 2021 Competition. Our framework scMoGNN is able to capture high-order structural information between cells and features. To the best of our knowledge, we are the first to apply GNNs in this field. Our method officially won first place in the overall ranking of the modality prediction task and now outperforms all models from three tasks on the leaderboard with remarkable advantage. 25 CHAPTER 3 TRANSFORMERS FOR SINGLE-CELL MULTI-OMICS REPRESENTATION LEARNING The recent development of multimodal single-cell technology has made the possibility of acquiring multiple omics data from individual cells, thereby enabling a deeper understanding of cellular states and dynamics. Nevertheless, the proliferation of multimodal single-cell data also introduces tremendous challenges in modeling the complex interactions among different modalities. The recently advanced methods focus on constructing static interaction graphs and applying graph neural networks (GNNs) to learn from multimodal data, such as scMoGNN in Chapter 2. However, such static graphs can be suboptimal as they do not take advantage of the downstream task information; meanwhile, GNNs also have some inherent limitations when deeply stacking GNN layers. To tackle these issues, in this work, we investigate how to leverage transformers for multimodal single-cell data in an end-to-end manner while exploiting downstream task information. In particular, we propose a scMoFormer framework that can readily incorporate external domain knowledge and model the interactions within each modality and cross-modalities. 3.1 Chapter Introduction Advancements in multimodal single-cell technologies provide the capability to simultaneously profile multiple data types in the same cell, including chromatin accessibility (Cao et al., 2018; Chen et al., 2019), DNA methylation (Gaiti et al., 2019), nucleosome occupancy (Pott, 2017). For example, CITE-seq (Stoeckius et al., 2017) utilizing oligonucleotide-conjugated antibodies can quantify RNA and surface protein abundance in the same cells. Here, protein abundance is measured via the antibody-derived tags (ADTs) read counts. The Single Cell Multiome ATAC + Gene Expression technology (Belhocine et al., 2021) concurrently profiles assay of transposase-accessible chromatin (ATAC-seq) (Buenrostro et al., 2013) and RNA expression from the same cell. These technologies offer an exciting opportunity to characterize cell identity and state at an unprecedented resolution, enabling a better understanding of gene regulatory networks in multicellular organisms and tissues (Zhu et al., 2020). 26 Despite the rapid accumulation of multimodal single-cell data, the analyses of such data are still faced with numerous challenges. First, single-cell measurements frequently exhibit high sparsity levels, making it difficult to draw meaningful insights from the data. The measurement process is impacted by various environmental factors, such as amplification bias, cell cycle effects, variations in library size, and RNA capture rate, all of which contribute to substantial noise in single-cell data (Eraslan et al., 2019). Furthermore, samples are often measured under different conditions, including batches, times, locations, or using different instruments, which leads to systematic variations in the measured values, which further complicates the interpretation of single-cell data. These imperfections can result in biased estimates of cell-cell interactions and pose tremendous challenges for computational models to exploit such interactions. To effectively capture the intricate interactions within cells and genes, current research focuses on constructing static graphs based on heuristic criteria and then employing graph neural networks (GNNs) (Kipf and Welling, 2017; Velickovic et al., 2018; Ma and Tang, 2021) to extract information from the built graph. For example, Hao et al. (Hao et al., 2021) and Van et al. (Van Dijk et al., 2018) have built the k-nearest neighbor (k-NN) graph for cells, which assesses the connections between cells by measuring the similarity in gene expression of cells. However, the quality of k-NN graphs is contingent upon the selected heuristic similarity measure and it does not incorporate downstream task information, which can introduce noise for GNNs to perform well on the downstream task. An alternative way for building the interaction graph is to construct the graph based on prior domain knowledge, e.g., utilizing publicly accessible databases (Wen et al., 2022a). For instance, Wen et al. (Wen et al., 2022a) construct the graph based on the pathway data (Liberzon et al., 2015) extracted from the Molecular Signatures Database (Subramanian et al., 2005), which describes the correlations between gene features. However, similar to k-NN graphs, such graph construction process does not leverage downstream task information and the knowledge base may not include all relevant genes/proteins. To make things worse, GNNs have some inherent limitations that hinder their success in applications: the over-smoothing (Kreuzer et al., 2021) and over-squashing (Alon and Yahav, 2021) issues where GNNs produce poor results when we deeply stack GNN layers. In 27 view of these, one question naturally arises: can we have a better approach to construct interaction graphs (among cells, genes, and proteins) that utilize downstream information while avoiding the aforementioned issues? In light of the recent advances of transformers (Devlin et al., 2019; Kitaev et al., 2020; Liu et al., 2021c, 2022) in capturing pairwise relations among objects, we seek to utilize transformers for learning the interaction graph for cells, genes, and proteins in an end-to-end manner. Transformers are well-suited to address the limitations of static graphs: they learn the interaction between objects through the self-attention mechanism, where all objects are attended to each other with learnable attention scores indicating their interaction strength. Thus, the attention matrix provides an advanced approach to characterize the interaction between objects in a data-driven way and has demonstrated success in reducing unwanted variance and noise across batches (Yang et al., 2022). Moreover, multi-head and multi-layer transformers have the capability of capturing more complex and nuanced relationships during the training process (Huang et al., 2021; Ieremie et al., 2022). However, these traditional transformers do not account for the available graph structure and are therefore unable to leverage prior information present in graph data, such as biological knowledge graphs. In this context, graph transformers (Rampasek et al., 2022; Ying et al., 2021; Dwivedi and Bresson, 2020) offer a solution by combining the strengths of GNNs and transformers to make use of graph data. These approaches allow for the incorporation of prior insights from structural information learned from GNNs, while still allowing for data-specific interactions to be learned through the attention mechanism. On top of that, graph transformers also alleviate the over-smoothing and over-squashing problems in GNNs by enabling individual objects to attend to unconnected objects (Chen et al., 2022; Dwivedi et al., 2021; Kreuzer et al., 2021; Rampášek et al., 2022). Therefore, it is of great importance to investigate the potential of (graph) transformers in single-cell analysis. In this work, we aim to design a transformer framework for multimodal single-cell data. In essence, to utilize the strengths of (graph) transformers, we are faced with two non-trivial challenges. First, since multimodal data contains diverse information from various sources, e.g., genes, proteins, and cells, it can be difficult for a single transformer to capture all aspects. Second, traditional 28 transformers suffer a quadratic computation complexity w.r.t. the number of objects, which poses a challenge for single-cell analysis where the number of cells can be large. To address the first challenge, we introduce the Single-Cell Multimodal Transformer scMoFormer, which employs multiple transformers to model the multimodal data, allowing each transformer to deal with a specific data modality. The core of scMoFormer is the cross-modality aggregation component which builds a bridge between these transformers and aggregates the necessary information from individual ones. For the second challenge, scMoFormer employs linearized transformers (Choromanski et al., 2021) to the cells which greatly reduces the computational complexity. To the best of our knowledge, we are the first to employ transformers to advance the analysis of multimodal single-cell data. Our proposed framework achieves promising results on the benchmark datasets, providing a very strong baseline for follow-up research. Our contributions can be summarized as follows: • We study the problem of multimodal single-cell data analysis and propose a transformer framework scMoFormer to capture the intricate relations within modalities and between modalities. • The proposed scMoFormer is versatile and can flexibly incorporate domain knowledge of biological databases regarding genes and proteins. • The proposed scMoFormer achieves superior performance on various benchmark datasets. Remarkably, we won a Kaggle silver medal with the rank of 24/1221 (Top 2%) without multi-model ensembling in a NeurIPS 2022 competition1. 3.2 Problem Statement Before we state the problem, we first introduce the notations used in the following sections. For clarity and simplicity, we use the subscripts “g”, “p” and “c” for gene, protein, and cell, respectively. For instance, we use h𝑔, h𝑝, h𝑐 to denote the embeddings of genes, proteins, and cells, respectively. Although we aim to develop a general framework for multi-omics, a great entry point is to focus on a specific standardized problem that is easy to evaluate. In this study, we follow the benchmark 1The competition official website is https://nips.cc/virtual/2022/competition/50092. 29 setting in the NeurIPS 2022 competition, i.e., Multimodal Single-Cell Integration Across Time, Individuals, and Batches2. The specific problem in this competition is to predict a paired modality with a given modality and to infer how DNA, RNA, and protein measurements co-vary in single cells. For simplicity, in this work, we focus on using gene expression (RNA) to predict surface protein levels. We denote Xg and Xp as the measurement counts of gene and protein, respectively. With Xg, we try to learn a mapping function that can best describe the relationship between two modalities. We denote L (·, ·) as the objective function that measures the dissimilarity between the predicted and the true protein level. Formally, we describe our target as an optimization problem: Given Xg and the objective function L (·, ·), we aim to find a mapping function 𝑓 ∗ 𝜃 (parameterized by 𝜃) that minimize the objective loss: 𝑓 ∗ 𝜃 = arg min 𝑓 𝜃 L (cid:0) 𝑓𝜃 (cid:0)Xg (cid:1) , Xp (cid:1) . (3.1) In the subsequent sections, we formulate the mapping 𝑓 ∗ 𝜃 using transformers and GNNs and employ Root Mean Square Error, Mean Absolute Error, and Pearson correlation coefficient as evaluation metrics for our predictions. 3.3 Proposed Multimodal Transformer Framework In this section, we introduce the proposed multimodal framework scMoFormer. An overview of scMoFormer is shown in Figure 3.1. It consists of multimodal graph construction, a multimodal transformer, and a prediction layer. In brief, we first construct a heterogeneous graph that contains cell, gene, and protein nodes together with their interactions. We then utilize multiple (graph) transformers on top of this heterogeneous graph to extract rich cell representations and predict each cell’s surface protein abundance levels. In the following subsections, we will detail these key components. 3.3.1 Multimodal Graph Construction In this subsection, we introduce how we construct the graph for multimodal single-cell data. Specifically, we construct a heterogeneous graph containing three different types of nodes to denote 2The competition information can be found at https://www.kaggle.com/competitions/open-problems-multimodal. 30 Figure 3.1 An illustration of scMoFormer. In this framework, three important components are included: graph construction, multimodal transformer, and prediction layer. the entire data. It has four subgraphs as shown in Figure 3.1, i.e., a protein-protein graph, a gene-gene graph, a gene-protein graph, and a cell-gene graph. Next, we detail how to conduct these subgraphs. 3.3.1.1 Subgraph Construction. Before we present the construction of the multimodal heterogeneous graph, we first describe how we build the subgraphs within modalities and between modalities. Protein-protein graph. To integrate prior biological information into protein-protein graph, we refer to the STRING (Szklarczyk et al., 2023) database. STRING provides a comprehensive resource of protein-protein interactions and the functional relationships between different proteins. It contains seven different channels covering varied aspects of sources, including genomic context, experiment results, and text mining efforts. We use the combined confidence score of all channels as edge weights to comprehensively enhance the graph. The proteins in STRING are labeled in Ensemble (Cunningham et al., 2022) protein (ENSP), and the mapping from ENSP to the protein preferred name is provided in additional information resource. Notice that regularly one protein may have multiple aliases. To match the protein display names with ENSP, we utilize GeneCards (Stelzer et al., 2016) to find all possible aliases of our target proteins. In a nutshell, the mapping from STRING nodes to our target proteins is given by ENSP possible aliases of proteins the display (cid:1) (cid:1) names of proteins within the dataset. Gene-gene graph. In order to maintain consistency and reduce the impact of potential noise 31 from multiple sources of prior information, we also utilized the STRING database to construct the gene-gene graph and combined all seven channels of the STRING database to enhance the graph as much as possible. It is worth noting that although the gene and protein nodes are biologically equivalent in the sense of prior information, we process them separately to more specifically handle data-specific information related to the target labels. The genes are labeled in Ensemble gene (ENSG) and additional matching efforts are needed to form the gene-gene graph. We utilize the MyGeneInfo (Wu et al., 2014) gene query service to align the STRING protein nodes, encoded by ENSP, with the input gene nodes, encoded by ENSG. Gene-protein graph. Now we have two separate graphs among proteins and genes respectively, and we would like to form a general frame by adding the connections between genes and proteins. Following the central dogma of molecular biology, information flows from RNA to proteins via translation. This encoding relationship is recorded within the gene and protein nomenclature. Specifically, the gene names from existing single-cell multimodal datasets contain two parts: the ENSG and the symbol or name of corresponding proteins. That is, if a gene and a protein share the same symbol, it means the target protein is encoded by the gene. Utilizing biological information, we link the proteins to genes by matching their symbols. Cell-gene graph. The constructed gene-protein graph is fully based on prior knowledge, and we integrate data-based information into the multimodal heterogeneous graph by involving cell nodes. The gene expression counts of multimodality data imply the data-specific relationships between genes and cells. Naturally, a cell and a gene are connected if the gene expresses within that cell. Note that the number of genes detected within each cell varies significantly, which depicts that the raw counts of RNA abundance show substantial heterogeneity among cells. In addition, the raw data includes extremely large counts, making it impractical to straightforwardly apply counts data as the edge weights between cells and genes. Therefore, we normalize each cell by total counts over all genes and taking the logarithm of the resulting data matrix, i.e., library size normalization and centered log-transformation. The processed data is then fed into the multimodal graph to describe the cell-gene links. 32 Remark. In the above discussion, we have introduced four types of subgraphs, i.e., a protein- protein graph, a gene-gene graph, a gene-protein graph, and a cell-gene graph. It is worth noting that we do not construct the cell-cell graph. Instead, we use the transformer to learn the cell-cell relationships via the attention mechanism. This is in contrast with some previous studies that utilize a static cell-cell similarity graph generated from the input features, which might be prone to inaccurate cell relationships due to the noisy nature of single-cell data. On the other hand, we do not explicitly establish connections between the cells and the target protein nodes, as it might cause the model to easily overfit. 3.3.1.2 Heterogeneous Graph Construction. With the aforementioned subgraphs, we are now ready to explain how we construct the multimodal heterogeneous graph. Formal Graph Definition. We now formally define the multimodal heterogeneous graph. Let (cid:1) as the node A be the adjacency matrix of the multimodal heterogeneous graph with V = (cid:0)vp, vg, vc set, where we have (cid:16) (cid:16) vp = vg = 𝑣1 p , 𝑣2 p , . . . , 𝑣𝑁p p 𝑣1 g , 𝑣2 g , . . . , 𝑣𝑁g g vc = (cid:16) 𝑣1 c , 𝑣2 c , . . . , 𝑣𝑁c c (cid:17) (cid:17) , , (cid:17) , (3.2) and 𝑁p, 𝑁g, 𝑁c are the number of proteins, genes, cells, respectively. Denote E ∈ V × V as the edge set, we write the multimodal heterogeneous graph as G = (V, E). Let 𝑁 = (𝑁p + 𝑁g + 𝑁c) be the total number of nodes, the adjacency matrix A can be written as follows: A = (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) Ap S S𝑇 Ag 0 A𝑇 RNA 0 ARNA 0 (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) ∈ R𝑁×𝑁 , (3.3) where Ap ∈ R𝑁p×𝑁p is the protein-protein interaction graph; Ag ∈ R𝑁g×𝑁g denotes the gene-gene graphs mapped from STRING via MyGene.info (Lelong et al., 2022); S ∈ R𝑁g×𝑁p is the encoding relationship between genes and proteins; and ARNA ∈ R𝑁c×𝑁g represents the gene expression. 33 Node Feature initialization. With the graph structure, next we discuss how to initilize the node features. Since the RNA counts Xg show drastic sparsity along with high dimension, it is not practical to be directly used as cell features. Therefore, we first denoise the data and reduce the dimension by Singular Value Decomposition (SVD). To alleviate the effects of cell-to-cell heterogeneity and extreme large counts, we conduct library size normalization and centered log-transformation. The preprocessed data ¯X is then passed through the SVD algorithm. Cell features are initialized with c ∈ R𝑁c×𝑑0. We then initialize gene features by the weighted sum of the c ∈ R𝑁g×𝑑0 with the normalized counts ¯Xg as the weights. In the studied g · h0 the reduced features h0 g = ¯X𝑇 reduced features h0 problem, proteins are the target modality for prediction; thus they are initialized randomly based on their indices. 3.3.2 Multimodal Transformers In the proposed scMoFormer, we address the challenge of heterogeneity in the multimodal heterogeneous graph consisting of three distinct modalities: cells, genes, and proteins. To effectively handle this heterogeneity, we employ multiple transformers, each designed to process a specific data modality. The information obtained from these individual transformers is then aggregated through a cross-modality aggregation mechanism. Subsequently, we will present the transformers assigned to each modality and elaborate on how to coherently integrate them. 3.3.2.1 Cell Transformer Transformer (Vaswani et al., 2017) has made significant achievements in the field of Natural Language Processing (NLP) in recent years. The attention mechanism can capture high-order and non-Euclidean connections between nodes, which is desired to explore the cell-cell relationships within single-cell multimodal data. Following the notations of the original transformer, we denote the queries, keys and input cell embeddings as Q, K, hc ∈ R𝑁c×𝑑 with input dimension 𝑑, the scaled dot-product attention can be formulated as Attn(hc) = softmax (cid:19) (cid:18) QK𝑇 √ 𝑑 hc, (3.4) 34 where Aattn = softmax (cid:17) (cid:16) QK𝑇 √ 𝑑 is the attention matrix of cell-cell interactions. Despite being effective in various NLP tasks, the original transformer has limitations in scalability due to its relatively high space complexity of 𝑂 (𝑁 2 c + 𝑁c𝑑) and time complexity of 𝑂 (𝑁 2 c 𝑑). This issue considerably limits the application of transformers in single-cell multimodal analysis. Since typically multimodal data includes tens of thousands of cells, it is not applicable to implement the original transformer directly on cells. To address the scalability issue, we employ generalized kernelizable attention (Choromanski et al., 2021) as a computationally efficient approximation of traditional attention. The attention blocks is kernelized in the form A(𝑖, 𝑗) = K (cid:16) 𝑖 , k⊤ q⊤ 𝑗 (cid:17) , where q𝑖 stands for the 𝑖𝑡ℎ row of query Q and k 𝑗 denotes the 𝑗 𝑡ℎ row of key K. The kernel K : R𝑑 × R𝑑 → R+ is specified as: K(x, y) = E (cid:2)𝜙(x)⊤𝜙(y)(cid:3) , (3.5) where 𝜙 : R𝑑 → R𝑟 key with rows given as 𝜙 (cid:0)q⊤ 𝑖 + denotes the feature mapping. Let Q′, K′ ∈ R𝑁c×𝑟 be the approximate query and (cid:1) ⊤ and 𝜙 (cid:0)k⊤ 𝑖 (cid:1) ⊤ respectively, the kernel approximation of cell attention is formulated as Attn(hc) = (cid:98)D−1 (cid:0)Q′ (cid:0)(K′)⊤ hc where (cid:98)D = diag (cid:0)Q′ (cid:0)(K′)⊤ 1𝑁c (cid:1)(cid:1) , (cid:1)(cid:1) . (3.6) With kernel of dimension 𝑟, the space complexity and time complexity is reduced to 𝑂 (𝑁c𝑟 +𝑁c𝑑+𝑟𝑑) and 𝑂 (𝑁c𝑟𝑑), respectively. The generalized kernelizable attention boosts our cell transformer to linear space and time complexity, while still delivering results comparable to regular transformers (Choromanski et al., 2021). The attention mechanism in the model captures the intricate relationships between cells, resulting in an improved representation of the individual cells. 3.3.2.2 Gene Transformer and Protein Transformer To leverage biological insights, we include STRING (Szklarczyk et al., 2023) as an addition to provide local information of genes and proteins. Although STRING provides solid prior networks, there may still exist data-related concerns when applied to sequencing data. For instance, 35 the NeurIPS 2022 competition dataset3 contains 22, 050 genes, but only 13, 101 of these genes have interactions recorded in the STRING database. This issue also occurs in proteins, as the NeurIPS 2022 competition dataset collects 140 proteins and only 120 of them are found within the STRING (Szklarczyk et al., 2023) networks. This means that information about interactions within the remaining molecules is not available unless additional global information is included. To address the concerns, we utilized graph transformers to encode both the local and global information about genes and proteins. Specifically, following GraphGPS (Rampasek et al., 2022), we adapt the graph transformers for gene-gene and protein-protein graphs separately, i.e., gene transformer and protein transformer. Since the gene transformer and protein transformer have the same architecture, we will use the gene transformer to illustrate the details. A gene transformer layer consists of two parallel components: a message-passing GNN block and a global attention block. The GNN block subtracts the gene interaction information from the prior local network, while the attention block learns global data-specific relationships by allowing each gene to attend to all other genes. The results from two blocks are summed together and then processed by fully connected layers to update the gene embeddings. Recall that the adjacency matrix of the gene-gene graph is denoted as Ag ∈ R𝑁g×𝑁g, let hℓ embedding of ℓ-th layer, the update functions are as follows: 𝑔 ∈ R𝑁g×𝑑ℓ with dimension 𝑑ℓ be the gene (cid:17) , , Ag , hℓ g (cid:17) g,𝑀 = GNNℓ (cid:16) hℓ+1 g,𝑇 = Attnℓ (cid:16) hℓ hℓ+1 g g = MLPℓ (cid:16) hℓ+1 hℓ+1 g,𝑀 + hℓ+1 g,𝑇 (cid:17) , (3.7) where GNNℓ and Attnℓ represent the message passing GNN block and the global attention mechanism and MLPℓ is a 2-layer MLP block. The GNN block brings in prior biological insights, while the attention block allows resolving the expressivity bottlenecks caused by over-smoothing (Kreuzer et al., 2021) and over-squashing (Alon and Yahav, 2021). Positional encoding is applied to the gene transformer layer to provide another solution to the expressivity bottlenecks among gene-gene graphs. Message-passing GNNs update gene node 3The dataset can be accessed at https://www.kaggle.com/competitions/open-problems-multimodal. 36 embeddings by aggregating local neighborhood representation given by gene knowledge graphs. By incorporating additional positional information, positional encoding helps to differentiate nodes that have the same local surroundings but distinct positions. Accompanied by prior gene-gene interaction networks, positional encoding distinguishes genes by the absolute position of each gene within the STRING network. This is important from a biological perspective as each gene functions differently. Here, we implement Laplacian positional encoding (Dwivedi and Bresson, 2020) and random walk positional encoding (Dwivedi et al., 2022). The Laplacian positional encoding captures the spectral information of graph Laplacian by its eigenvectors. Denote the graph Laplacian of input gene graph as Lg, the matrix factorization of Lg is formulated as Lg = I − D−1/2 g AgD−1/2 g = U𝑇 g 𝚲gUg, (3.8) where Dg is the degree matrix of gene graph, 𝚲g and Ug correspond to the eigenvalues and eigenvectors respectively. The gene node Laplacian positional encoding of dimension 𝑘 is defined as the 𝑘 smallest non-trivial eigenvectors. While Laplacian positional encoding embeds the positional information from the graph Laplacian, the random walk positional encoding tends to grasp the positional information given by the graph clusters. The random walk positional encoding of dimension 𝑘 is defined with 𝑘-steps of the random walk as: p𝑖 = (cid:2)RW𝑖𝑖, RW2 𝑖𝑖, · · · , RW𝑘 𝑖𝑖 (cid:3) ∈ R𝑘 , (3.9) where RW = AgD−1 g is the random walk operator. The term RW𝑘 𝑖𝑖 represents the landing probability of a gene node 𝑖 to itself after 𝑘 steps. The processed positional encoding is then combined to gene features through a fully connected layer. 3.3.2.3 Cross-Modality Aggregation Transformers are constructed separately for each modality. To build a bridge among those transformers, we implement message passing GNNs among the links which connect nodes of distinct types. The information from cell transformer will denoise the prior knowledge by adding data-specific details into gene embeddings and protein embeddings. Meanwhile, information from gene transformer and protein transformer will bring in biological insights to cell transformer to 37 predict the target proteins. Particularly, we take the advantage of GraphSAGE (Hamilton et al., 2017) to transfer the information. Denote the 𝑖-th destination node as 𝑣𝑖 and 𝑗-th source node as 𝑢 𝑗 . The information from the source nodes to the destination nodes is updated by: h(ℓ+1) N (𝑣𝑖) = Aggregate h(ℓ+1) 𝑣𝑖 = Update (cid:0)h𝑣𝑖 , hN (𝑣𝑖) (cid:1) (cid:16)(cid:110) hℓ 𝑢 𝑗 , ∀ 𝑢 𝑗 ∈ N (𝑣𝑖) (cid:111)(cid:17) (3.10) For node 𝑣𝑖, the message passing GNN aggregates information from its neighbors through aggregator function (Aggregate). The neighborhood information h(ℓ+1) hℓ 𝑣𝑖 and processed by an updating procedure. The newly generated embeddings are normalized N (𝑣𝑖) is then combined to the embeddings before next iteration. The message passing GNN modules facilitate communication between the transformers, enabling the transformers to leverage various forms of information during the training process. Now we summarize the workflow of scMoFormer. As shown in Figure3.1, we apply transformers within each type of node and utilize message passing GNNs to form the bridges between transformers. In a formal way, for ℓ-th layer, denote the cell transformer as Transℓ p respectively. Let MPGℓ protein graph transformer as GTℓ passing GNN modules between genes and proteins, and MPGℓ g g and GTℓ g (cid:1) c, gene graph transformer and p and MPGℓ c and MPGℓ p (cid:1) g are for the links g be the message (cid:1) c (cid:1) between genes and cells. For the purpose of prediction, we process the cell embeddings with MLP as cell readouts, where FCℓ represents the ℓ-th fully connected layer. The updates of node embeddings are achieved by g = GTℓ hℓ+1 g (cid:16) hℓ g (cid:16) , Ag (cid:17) hℓ c hℓ p , Ap (cid:17) + MPGℓ p (cid:16) hℓ p , S𝑇 (cid:17) + MPGℓ g (cid:17) c (cid:1) + MPGℓ g hℓ g (cid:16) , A𝑇 RNA (cid:17) , hℓ g , S g (cid:1) (cid:16) p (cid:1) c = Transℓ hℓ+1 c (cid:16) p = GTℓ hℓ+1 p , ARNA (cid:17) , g + MPGℓ c (cid:1) + FCℓ (cid:16) (cid:17) hℓ c (cid:16) hℓ c (cid:17) , (3.11) where S, Ap, Ag, ARNA are adjacency blocks defined in Section 3.3.1. 38 3.3.3 Prediction Layer With the number of layers be 𝐿, the predictions are given by one extra full connected layer FC𝐿+1 as (cid:16) hℓ c , ℓ in {1, 2, . . . , 𝐿} (cid:17) , h𝐿+1 = Concat c (cid:98)Xp = FC𝐿+1 (cid:16) (cid:17) , h𝐿+1 c (3.12) where (cid:98)Xp ∈ R𝑁c×𝑁p is the final prediction. To optimize the whole framework, we adapt a Mean Square Error (MSE) loss to measure the difference between the predictions and the ground-truth values: 3.4 Experiment (cid:16) L (cid:98)Xp, Xp (cid:17) = 1 𝑁 𝑝 𝑁𝑐 𝑁 𝑝 ∑︁ 𝑁𝑐∑︁ (cid:16) 𝑖=1 𝑖=1 𝑖 𝑗 𝑖 𝑗 p − (cid:98)X p X (cid:17) 2 . (3.13) In this section, we present the experimental results of scMoFormer against baselines on benchmark datasets. In particular, we aim to answer the following questions: • RQ1: How does scMoFormer perform compare against baselines based on various evaluation metrics? • RQ2: Given various choices of the positional encodings, how do they affect the performance of scMoFormer? • RQ3: How does each of the model component impact the performance of scMoFormer? Before presenting our experimental results and observations, we first introduce the experimental settings. 3.4.1 Experimental Settings 3.4.1.1 Datasets We follow the setting of the NeurIPS multimodal single-cell integration competition of the year 2021 (Luecken et al., 2021) and 2022 and collect the joint measurements of gene expression and surface protein levels datasets from the competitions. Both datasets contain the raw counts, which represent the number of reads per gene per cell, as well as the normalized counts. For the NeurIPS 39 Table 3.1 Dataset Statistics. CITE GEX2ADT Number of RNA Number of Proteins Train Cells Test Cells RNA Zero Rate 22,050 140 42,843 28,145 0.780 13,953 134 66,175 1,000 0.904 2021 competition, we pick the data corresponding to the task of protein abundance prediction via gene expression and refer to it as “GEX2ADT”. The processed data is centered and log-transformed for denoising purposes. For the competition in 2022, which we refer to as “CITE”, the objective is to utilize CITE-seq (Stoeckius et al., 2017) data measured from days 2, 3, and 4 to predict the protein level on day 7 from different individuals. It is worth mentioning that the protein level testing data is not available during the completion of this work. Therefore, we simulate the competition scenario by treating the training data from day 4 as our testing set. The processed RNA data is centered and log-transformed, while the normalized protein levels are denoised and scaled by background (Kotliarov et al., 2020). We summarize the statistics of both datasets in Table 3.1. 3.4.1.2 Baselines We evaluate the performance of scMoFormer against state-of-art multimodal prediction models among the task of using gene expression to predict surface protein levels. The selected baselines are as follows: • Cross-modal Autoencoders (Yang et al., 2021), short for CMAE, incorporated multi- ple autoencoders to integrate multimodal data and utilized domain knowledge by adding discriminative loss to the training process to align shared markers or clusters among datasets. • BABEL (Wu et al., 2021b) proposed a general framework for multimodal translation with modality-specific encoders and decoders. Note that initially, BABEL focused on RNA and ATAC-seq (Buenrostro et al., 2013) data. In this evaluation, we repurpose BABEL to the RNA to protein setting. • scMM (Minoura et al., 2021) modeled the multimodal data with generative setting. We note 40 Table 3.2 Prediction evaluations based on different metrics (score ± std). Dataset Metric BABEL CMAE scMM∗ RMSE ↓ 1.67388 ± 0.00765 2.00874 ± 0.02088 - ScMoGNN 1.66634 ± 0.00741 1.62720 ± 0.00731 scMoFormer CITE MAE ↓ Corr ↑ RMSE ↓ GEX2ADT MAE ↓ Corr ↑ 1.07777 ± 0.00602 1.21897 ± 0.01042 - 1.07577 ± 0.00372 1.05639 ± 0.00221 0.87475 ± 0.00119 0.82502 ± 0.00843 - 0.87788 ± 0.00113 0.88552 ± 0.00080 0.45387 ± 0.00738 0.51549 ± 0.00857 0.64067 ± 0.00722 0.42576 ± 0.01180 0.41987 ± 0.00234 0.30720 ± 0.00618 0.34855 ± 0.00488 0.43407 ± 0.00307 0.28819 ± 0.00976 0.28289 ± 0.00223 0.86144 ± 0.00274 0.81565 ± 0.00463 0.68287 ± 0.00981 0.87051 ± 0.00524 0.87698 ± 0.00121 *The scale of scMM predictions is not compatible with that of normalized protein levels. that the input of scMM is restricted to raw counts by design, and the output predictions are scaled as centered log-transformed data. • ScMoGNN (Wen et al., 2022a) involved domain knowledge like biological pathways to enhance the GNNs. The original ScMoGNN followed a transductive setting. In this work, we implement an inductive setting of ScMoGNN for a fair comparison with the baselines. 3.4.1.3 Parameter Setting To benchmark the performance of baselines and scMoFormer, we uniformly employ inductive settings among both datasets. On the CITE dataset, we use the data measured on day 4 for testing and randomly split 80/20% of the data prior to day 4 for training and validation. On the GEX2ADT dataset, we randomly pick 15% of the training data for validation and evaluate the predictions on the testing set. For BABEL, the hidden dimension is tuned from {16, 32, 64, 128}. For CMAE, the weights of adversarial loss and reconstruction loss are chosen from {0.1, 1, 2.5, 5, 10}. For scMM, the hidden dimensions are tuned from {16, 32, 64, 128}. For ScMoGNN, the weight decay parameter of the optimizer is tuned from {5 × 10−6, 1 × 10−5, 5 × 10−5, 1 × 10−4}. 3.4.2 Evaluation of Predictions We evaluate the final protein-level prediction performance using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). Meanwhile, because multimodal data usually suffers from the influence of batch effects and unbalanced measuring depth, the count’s scale of each cell may vary significantly, which will substantially affect the RMSE and MAE metrics. Therefore, we also include the Pearson correlation coefficient (Corr), which is normalized by the mean and variance of the input on a per-cell basis, as a robust and scale-free metric to evaluate the predictions. 41 A lower RMSE or MAE score indicates a geometrically closer estimation of the protein levels, while a higher Corr score suggests a statistically more similar match to the actual value. We report the mean and the standard deviation of each metric across five different runs, and the results are illustrated in Table 3.2. The best performance is highlighted in bold. To answer the first question, we note that our scMoFormer consistently outperforms all other baselines according to all three metrics on both datasets, indicating that scMoFormer successfully captures the quantitative characteristics of target protein levels given the input gene expression measurements. Particularly, for the CITE dataset, scMoFormer achieved significantly lower RMSE compared to the second-best model ScMoGNN, by 0.04. More importantly, scMoFormer achieved a significant improvement in terms of the Pearson correlation metrics over all other baselines, with a noticeably lower performance variation across runs, indicating the stability of our model. We further analyze the performance of different models on proteins that are least well captured by any models. Specifically, for each model, we compute the RMSE for each protein separately and identify ten proteins that resulted in the highest average RMSE across all models. As shown in Figure 3.2, scMoFormer and ScMoGNN achieved relatively stable results and are consistently better compared to BABEl and CMAE. Figure 3.2 Least well-predicted protein performance comparison across models. 3.4.3 Positional Encoding As mentioned in Section 3.3.2.2, we implement Laplacian positional encoding (Dwivedi and Bresson, 2020) and random walk positional encoding (Dwivedi et al., 2022) to capture positional information of prior knowledge graph. For ease of notation, we use the abbreviation PE to refer to 42 Table 3.3 Prediction RMSE results of different positional encoding (score ± std). CITE GEX2ADT Laplacian PE 1.63161 ± 0.01082 Random Walk PE 1.63014 ± 0.01129 1.62720 ± 0.00731 w/o PE 0.42025 ± 0.00243 0.41987 ± 0.00234 0.42202 ± 0.00399 the positional encoding. To benchmark the impact of the two types of PE among two datasets and answer the second question, we show the performance of scMoFormer with each PE and compare them with the scenario without any PE. The mean and standard deviation of RMSE scores of five runs are shown in Table 3.3. According to the results, the influence of PE varies among datasets. The setting without PE reaches the best RMSE score on CITE dataset, while two types of PE both improve the performance on GEX2ADT dataset. Notice that in Table 3.1, the RNA zero rate of CITE dataset is significantly lower compared to the GEX2ADT dataset, providing the model with greater access to data-specific information. If the data contains sufficient information, then the neighborhood information from the GNNs alone is adequate and there is no need for the extra prior knowledge from the PEs. This is further supported by the observation that random walk PE performs better than Laplacian PE in both datasets. The Laplacian PE models global information by using the spectral information of the graph Laplacian, while the random walk PE encodes local information by accessing the landing probability of a 𝑘-step random walk. In cases where the prior knowledge may be noisy for downstream tasks, the local information alone is enough for predictions and the global structure becomes redundant. 3.4.4 Ablation Study Table 3.2 demonstrates that models that incorporate domain knowledge perform better in modality prediction compared to those that do not. BABEL, ScMoGNN, and scMoFormer are the three models that make use of domain knowledge, and they show improved performance compared to the other two models. Among these three models, ScMoGNN, which is based on GNNs, performs better than BABEL, while scMoFormer outperforms all other models with its combination of transformers and GNNs framework. Given that scMoFormer includes three transformers, this raises the questions: Which transformer has the biggest impact on performance? How much do the 43 transformers contribute to the improvement in performance? 3.4.4.1 Influence of Every Transformer The propose multimodal transformers consist of three different transformers, namely the cell transformer, gene transformer and protein transformers. As our predictions are based on the cell readout, it is expected that each of the three transformers will have different levels of impact on the performance. To quantify the specific impact of a single transformer, we conduct an experiment by removing the other two transformers and measuring the prediction RMSE scores. The results of the evaluation, including the scores of three partial models and the model with no transformers, are summarized in Figure 3.3. (a) CITE (b) GEX2ADT Figure 3.3 Prediction RMSE↓ results of keeping only one Transformer. The performance of the gene transformer and protein transformer is better than that of the cell transformer in the CITE dataset, while it is the opposite in the GEX2ADT dataset. This can be explained by the difference in RNA zero rate between the two datasets, as shown in Table 3.1. For the GEX2ADT dataset, the high RNA zero rate means less information, making the cell transformer crucial in increasing performance by drawing more information from the data. On the other hand, 44 Table 3.4 Prediction RMSE results of scMoFormer over GNNs (score ± std). CITE GEX2ADT GNN GNN-prior scMoFormer 1.63071 ± 0.01081 1.63020 ± 0.00672 1.62720 ± 0.00731 0.43070 ± 0.00237 0.42731 ± 0.00138 0.41987 ± 0.00234 the CITE dataset has a lower zero rate, meaning it provides more information, allowing the gene transformer and protein transformer to enhance the model by adding external biological knowledge. 3.4.4.2 How to Utilize Prior Knowledge To answer the second question, we compare scMoFormer with two GNN-based models in Table 3.4. The model “GNN-prior” refers to the GNNs that built on the same graph in Section 3.3.1, while the “GNN” model is constructed using only the cell-gene graph without incorporating any prior information. The results show that the incorporation of prior knowledge into the graph results in a slight performance boost in both datasets. However, when multimodal transformers are included, the performance improvement is much more pronounced. This highlights the usefulness of prior knowledge and the importance of using transformers to effectively incorporate this information into the model. 3.5 Related Work In this section, we go through some related works of our proposed framework, including GNNs, transformers, and other deep learning methods in single-cell analysis. There is a growing number of deep learning-based methods for multimodal single-cell analysis in the community. For instance, scMDC (Lin et al., 2022) is an end-to-end autoencoder-based model with one encoder and two decoders. The encoder takes the concatenation of two modalities as an input and then reconstructs two modalities separately via two individual decoders. After training, the learned latent embedding would be used for clustering analysis. DCCA (Zuo et al., 2021) learns a coordinated but distinct representation for each omics data by mutually supervising each other on the basis of semantic similarity across embeddings, and then reconstructs back to the original dimension as output via a decoder for each omics data. Cross-modal Autoencoders (Yang 45 et al., 2021) utilize multiple autoencoders to map different modalities onto the same latent space, and incorporate prior knowledge through the use of adversarial loss and paired anchor loss in the training process. BABEL (Wu et al., 2021b) consists of two neural-network-based encoders and two decoders for translation and reconstruction. Both Cross-modal Autoencoders and BABEL focus on multimodal translation by adding interoperability constraints to train multiple encoders and decoders. Another approach, scMM (Minoura et al., 2021), captures nonlinear latent structures with variational autoencoders. It exploits a mixture-of-expert framework with a deep generative model and attains end-to-end learning by modeling raw counts of each modality. While these models have made significant advancements in multimodal integration, most of them are based on autoencoders and tend to overlook the underlying biological interactions of molecules and cells. To capture the biological interactions of molecules and cells, there has been an increasing number of GNNs and transformer frameworks published in the field of single-cell analysis. One benefit of transformers applied in single-cell data is to capture long-range dependency in a global view. Another benefit is to interpret biological phenomena via the attention mechanism in transformers. From GNNs’ perspective, graphs are natural to represent all kinds of data in single-cell data, like gene-to-gene graphs, cell-two-cell graphs, and cell-to-gene graphs. Another benefit of GNNs is to easily add domain knowledge or prior knowledge into graphs, like pathways between genes or overlaps between genes and peaks. For example, scGNN (Wang et al., 2021) models cell-cell interaction by incorporating GNN with multi-modal autoencoders. Specifically, scGNN builds a cell graph by capturing cell-type-specific regulatory signals and utilizes a Left-Truncated Mixture Gaussian model for scRNA-Seq data analysis. GLUE (Cao and Gao, 2022) pre-trains modality- specific variational autoencoders to get cell embeddings and then encodes a knowledge-based graph with GNNs. The next step involves performing an adversarial multimodal alignment of the cells through an iterative optimization process. In addition, ScMoGNN (Wen et al., 2022a) models the cell similarity and feature similarity by building a cell-feature graph and extracts information from data with a graph encoder. ScMoGNN takes advantage of gene pathway data as prior knowledge to enhance the graph and denoise the data. Moreover, scBERT (Yang et al., 2022) follows the 46 pre-training and fine-tuning paradigm of bidirectional encoder representations from transformers (BERT) for cell annotation of scRNA-seq data. The process of annotation involves extracting high-level patterns of cell types from the reference dataset. Different from these approaches which focus on single-modality data, we are the first to introduce transformers and GNNs to single-cell multimodal prediction. 3.6 Chapter Conclusion In this chapter, we enhance our scMoGNN with a transformer model architecture and present scMoFormer, a multimodal transformer model for single-cell surface protein abundance from gene expression measurements. We combined the data with prior biological interaction knowledge from the STRING database into a richly connected heterogeneous graph and leveraged the transformer architectures to learn an accurate mapping between gene expression and surface protein abundance. Remarkably, scMoFormer achieves superior and more stable performance than other baselines on both 2021 and 2022 NeurIPS single-cell competition datasets. 47 CHAPTER 4 TRANSFORMERS FOR SINGLE-CELL SPATIAL OMICS REPRESENTATION LEARNING Spatially resolved transcriptomics brings exciting breakthroughs to single-cell analysis by providing physical locations along with gene expression. However, as a cost of the extremely high spatial resolution, the cellular level spatial transcriptomic data suffer significantly from missing values. While a standard solution is to perform imputation on the missing values, most existing methods either overlook spatial information or only incorporate localized spatial context without the ability to capture long-range spatial information. Using multi-head self-attention mechanisms and positional encoding, transformer models can readily grasp the relationship between tokens and encode location information. In this paper, by treating single cells as spatial tokens, we study how to leverage transformers to impute spatial transcriptomic data. In particular, investigate the following two key questions: (1) how to encode spatial information of cells in transformers, and (2) how to train a transformer for spatial transcriptomic imputation. By answering these two questions, we present a transformer-based imputation framework, SpaFormer, for cellular-level spatial transcriptomic data. Extensive experiments demonstrate that SpaFormer outperforms existing state-of-the-art imputation algorithms on three large-scale datasets while maintaining superior computational efficiency. 4.1 Chapter Introduction Spatial transcriptomic technologies have rapidly developed in recent years and emerged as next-generation tools for biomedical research. For instance, in-situ hybridization (ISH) based technology (Lubeck et al., 2014) produces detailed single-cell transcriptomic profiles along with the location of cells within a tissue, yielding deeper insights into cell identity and functionality than ever. However, as the number of profiled genes increases, the requirement for additional rounds of hybridization also increases, which elevates the potential for cumulative errors. As a result, ISH-based spatial transcriptomic data generally suffer from low mRNA capture efficiency and may miss a significant number of expressed genes, resulting in many false zero counts in observed gene expression data. 48 An effective approach to mitigate this problem is applying imputation methods to rectify the false zeros. There are numerous types of imputation methods for conventional transcriptomic data (i.e., scRNA-seq data). Nevertheless, those methods tend to yield suboptimal performance when imputing spatial transcriptomic data, as they do not leverage the presented spatial information. Notably, the spatial locations of cells provide important information about cell-cell interactions as well as cell similarities (Wen et al., 2022b), which have the great potential to advance the imputation process. For example, tumor cells usually show strong aggregation, and neighboring tumor cells with similar micro-environment in terms of ligands tend to have higher gene expression similarity than distant cells (Browaeys et al., 2020). To effectively utilize spatial information, graph neural networks have been recently utilized for spatial transcriptomic analysis (Li et al., 2022; Wang et al., 2022). Concretely, graph neural networks (GNNs) are applied to the cell-cell neighboring graphs built on the spatial positions. However, these methods are limited to modeling a localized spatial context, which can be unfavorable for identifying long-range correlated cells. For example, Treg cells are scarce spatially in many tissues but still tend to be homologous and share similar gene expression profiles (Rudensky, 2011). Hence, it is desirable to capture the cell interactions from broader contexts. To achieve this goal, we employ the transformer model (Vaswani et al., 2017) for the studied problem. The transformer model was originally designed for textual data. It uses multi-head self-attention mechanisms to model relationships between input tokens and utilizes positional encoding to model the locations of tokens. Transformers are able to weigh the importance of each input token relative to all other tokens, rather than adjacent ones as in GNNs. In our studied problem, by treating cells as tokens, we can readily apply the transformer model to capture long-range correlations between cells. In this chapter, we investigate two key questions when applying transformers to spatial transcrip- tomic imputation: (1) how to encode spatial information of cells in transformers, and (2) how to train a transformer for transcriptomic imputation. To answer the first question, one natural idea is to adopt the positional encodings from common transformers to encode spatial cellular coordinates. However, the spatial coordinates of cells are continuous and irregular, which are essentially different from the 49 discrete coordinates in the common practice of transformers. Therefore, efforts are still desired to design positional encodings for spatial transcriptomics. To address this issue, we investigate existing positional encodings, compare their advantages and disadvantages, and conduct comprehensive experiments to obtain a best practice for spatial transcriptomics. To answer the second question, we generalize the well-studied imputation models for conventional transcriptomic data into a flexible autoencoder framework, where we adopt a transformer as the encoder. Furthermore, we propose a new bi-level masking technique that can be plugged into the general autoencoder framework. With the solutions to two questions, our proposed framework, SpaFormer, consistently achieves outstanding imputation performance on three large-scale cellular-level spatial transcriptomic datasets. 4.2 Problem Statement Before we introduce the notations and basic concepts, we first introduce the data we use. It is important to note that the focus of this chapter is on high-resolution cellular-level datasets (typically generated by ISH-based techniques), as opposed to the commonly studied spot-level datasets (e.g., 10X Visium, Seq-based data). In this work, we use two published and one unpublished dataset produced by the Nanostring CosMx (He et al., 2022b) platform. In order to obtain the cellular level gene expression, CellPose (Stringer et al., 2021) software is applied to conduct cell segmentation. Figure 4.1 gives an example about the raw image data and how cells are segmented. After pre-processing, a typical single-cell spatial transcriptomic dataset is comprised of two essential components, i.e., the gene expression of cells and the corresponding spatial locations. We denote the gene expression data as a matrix X ∈ R 𝑁×𝑘 , where 𝑁 is the number of cells, and 𝑘 is the number of genes. Hereby, X𝑖, 𝑗 denotes the count of the 𝑗-th gene captured in the 𝑖-th cell. We use C ∈ R 𝑁×2 to denote the two-dimensional coordinates of each cell, where those coordinates are based on the center position of each cell. Note that each dataset is composed of multiple field-of-views (FOVs). Each FOV contains thousands of cells and might not be adjacent to each other. Thus, we focus on units of FOVs by default. In the spatial transcriptomic imputation problem, we suppose that a part of the input values in X are missing, denoted as a mask matrix M ∈ {0, 1}𝑁×𝑘 , where the value of X𝑖, 𝑗 can only be observed 50 when M𝑖, 𝑗 = 1. A partially observed data matrix (cid:101)X is defined as: (cid:101)X𝑖, 𝑗 =    0 M𝑖, 𝑗 = 0 X𝑖, 𝑗 M𝑖, 𝑗 = 1 (4.1) Our objective is to predict the missing values X𝑖, 𝑗 at M𝑖, 𝑗 = 0, given the partially observed data (cid:101)X𝑖, 𝑗 and the spatial positions C. Note that in this work, we treat cells as tokens thus we use these two terms exchangeably in the remaining of the chapter. (a) Molecular image. (b) Cell segmentation. Figure 4.1 A sample image of protein, RNA molecules, and segmented cells. Colors in sub-figure (a) indicate the protein molecules that are stained. These proteins contribute to the cell segmentation process, which results in the sub-figure (b). The final output from the pipeline consists of the position of each cell and a cell-by-gene count matrix. 4.3 Encoding Spatial Information in Transformers An essential component of our SpaFormer framework is transformer (Vaswani et al., 2017) encoders, which were introduced in Section 3.3.2.1. When employing transformer models to spatial transcriptomic data, a critical challenge is how to encode spatial information in transformers. In a standard transformer, positional encodings (PEs) are added to the token embeddings to make use of the positional information of tokens. Different from chapter 3, in this chapter, positional encodings in the transformer mainly encode the spatial positions for every single cell. The coordinates of spatial transcriptomics are continuous and irregular, which are essentially different from the sequential data, image data, or graph data (as in Chapter 3). To tackle this issue, we investigate three general groups of positional encodings and demonstrate how we apply them to spatial transcriptomics. On top of 51 that, we introduce two advanced model-based positional encodings that address the limitations of previous positional encodings. 4.3.1 Patch-based Positional Encodings We define patch-based positional encodings as encodings derived from discrete and regular patches. For example, ViT (Dosovitskiy et al., 2020) separates an image into 16 × 16 patches, so that patches lie in a regular grid. Positional encodings generated from these patch coordinates are considered patch-based positional encodings. To apply this approach to spatial transcriptomic data, we segment a whole input region, a.k.a, a field-of-view (FOV), into regular patches. In this chapter, we segment each FOV into 100 × 100 patches, where each patch contains 0.3-0.55 cell on average due to the distinct cell size of different tissues. Based on the patch coordinates, there are generally two ways to produce positional encodings (PE) for each patch, i.e., learnable PE and sinusoid PE. Learnable PE is introduced in ViT (Dosovitskiy et al., 2020), where it learns a positional encoding for each individual position. Sinusoid PE is proposed in vanilla transformer (Vaswani et al., 2017), originally designed for sequential data, while it can be extended to 2-dimensional space (Carion et al., 2020). 4.3.2 Coordinate-based Positional Encodings Coordinate-based positional encodings aim to generate positional encodings directly from the continuous spatial coordinates. In order to improve the generalizability, we normalize the spatial coordinates C ∈ R𝑛×2 to [0, 1] by applying min-max normalization in each FOV. In this subsection, we discuss two specific types of positional encodings: naive PE and sinusoid PE. naive PE projects normalized spatial coordinates to desired dimension 𝑑 via an MLP or other transformation. Compared to patch-based learnable PE, naive PE imposes the ordinal relation between coordinates. On the other hand, coordinate-based sinusoid PE replaces discrete patch coordinates in the patch-based version with continuous spatial coordinates. 4.3.3 Graph-based Positional Encodings Graph-based positional encodings are derived from the spatial adjacency graph that connects adjacent tokens. The motivation of graph-based positional encodings is that the spatial adjacency 52 graph conveys the relative position relations between tokens so that we can capture the positional information by encoding the spatial adjacency graph. To this end, we construct a spatial graph in which cell pairs are connected when the euclidean distance is smaller than 15-25𝜇𝑚. As a result, each cell is connected to 4-6 cells on average, depending on the specific tissue type. The resulting adjacency matrix is denoted as A. Next, we present two ways to encode positional information from spatial adjacency graphs: random walk PE (RWPE) , which derives from landing probabilities of random walks, is proposed in LSPE (Dwivedi et al., 2021). This positional encoding is effective in encoding graph structures, however, it hardly encodes distance information. Furthermore, it is limited to structural information within the 𝑘-th order neighborhood and thus overlooks the global context. Laplacian PE (LapPE), which uses Laplacian eigenvectors as positional encodings (Kreuzer et al., 2021). Specifically, graph Laplacian L is defined as L = I𝑛 − D−1/2AD−1/2 = U𝚲U𝑇 , where I𝑛 is an identity matrix, and 𝚲 ∈ R 𝑘×𝑘 and U ∈ R𝑛×𝑘 correspond to the eigenvalues and eigenvectors respectively, 𝑛 is the number of nodes, and 𝑘 is the number of top eigenvalues we select. The Laplacian eigenvectors constitute a local coordinate system that retains the overall structure of the graph. Thus, we can use the 𝑖-th row of eigenvector matrix U as the positional encoding for node 𝑖 (a.k.a, cell 𝑖) in the graph. LapPE is generally a good choice despite being limited by sign ambiguity (Kreuzer et al., 2021). A straightforward approach to address this problem is to randomly flip the sign of eigenvectors during training, to force the model to be sign-invariant. Figure 4.2 An illustration of our transformer-based autoencoder framework for spatial transcriptomics data imputation. 53 Cells in SpaceInput Tokens+ Positional EncodingsTransformerEncoderDecoder(MLP or ZINB)Latent Space(Optional)Reconstruction𝜇𝑥𝜎𝑥𝑧Masked(Optional)GeneExpressionPositional Encodings 4.3.4 Model-based Positional Encodings Despite that the aforementioned positional encodings can encode spatial information, each of them has certain limitations. Specifically, patch-based and coordinate-based positional encodings are dependent on the absolute location. However, studies have shown that relative positional encodings (Wu et al., 2021a; Luo et al., 2021), which consider the pair-wise relationships, generally perform better in various domains. One key advantage of relative positional encodings is their translation-invariant property. Translation-invariance refers to the property that positional encodings remain unchanged upon global translation of the coordinates, thereby enhancing the generalizability of transformers. Fortunately, graph-based positional encodings also achieve translation invariance since they are derived from the spatial adjacency graph, which is naturally based on invariant pair-wise distances. Nevertheless, RWPE cannot accurately encode pair-wise distance, while LapPE suffers from sign ambiguity. To overcome the aforementioned limitations, we introduce two advanced models to generate positional encodings based on spatial adjacency graphs: SignNet (Lim et al., 2022) proposes a sign invariant and permutation invariant network to learn positional encodings from the Laplacian eigenvectors of a target graph. The resulting positional encodings inherit the advantages of LapPE while not suffering from sign ambiguity. Cond PE, which stands for conditional positional encodings, originally proposed in CPVT (Chu et al., 2021). In the original version, CPVT added a simple convolution layer before vision transformers, to provide positions conditioned on the local neighborhood of each input token. We adopt this idea while substituting the visual convolution with a graph convolution, as it is more feasible to implement the graph convolution on a spatial graph compared to directly convolute the sparsely positioned cell tokens in 2-dimensional space. Consequently, we apply a graph attention network (Velickovic et al., 2018) to the spatial adjacency graph to generate initial embedding for each cell token. This embedding is considered a conditional positional encoding (Cond PE) that encodes spatial neighborhoods while achieving translation invariance. 54 Table 4.1 Comparison between positional encodings regarding the desired properties for spatial transcriptomics. Sources Distance Aware. Global Effec. Translation Invari. Structure Aware. Other Issues Sinusoid PE Learnable PE Naive PE RWPE LapPE Relative PE Patch / Coordinate Patch Coordinate Graph Graph Distance SignNet (Lim et al., 2022) Cond PE Model Model ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✓ ✓ Sign Ambiguity Scalability 4.3.5 Summary We summarize all aforementioned positional encodings in Table 4.1. We consider four main properties when we compare these methods, namely distance awareness, global effectiveness, translation invariance, and spatial structure awareness. These properties have already been mentioned in the previous introduction. In conclusion, we investigate four types of position encoding to capture spatial information in transformers, which possess different properties and mostly demonstrate good potential as position encodings for spatial transcriptomics. We will apply distinct positional encodings in our SpaFormer framework in order to gain the best practices for position encoding in spatial transcriptomes. 4.4 Proposed Transformer Framework In this section, we introduce our SpaFormer framework. An overview of SpaFormer is illustrated in Figure 4.2. In SpaFormer, we first extract the spatial positional encoding for each cell, using different methods as discussed in Section 4.3. Then cell embeddings are initialized with gene expressions and positional encodings, while some cells are selectively masked depending on the specific setting. Next, a transformer encoder is applied to encode both cellular profiles and intercellular contexts into the latent space. Finally, a decoder reconstructs the input (or masked) information based on the latent variables. In the following, we first introduce our general framework that generalizes popular transcriptomic imputation models in Section 4.4.1. Then based on the general framework, we propose a new bi-level masking strategy in Section 4.4.2, which is particularly suitable for spatial transcriptomic data imputation. 55 4.4.1 Generalized Autoencoder Framework The most popular architecture for deep-learning-based transcriptomic data imputation meth- ods (Molho et al., 2024) is autoencoder, due to its prevalence in data denoising and missing data applications. Existing methods introduced a few variants of autoencoders, including variational autoencoders (VAE) (Kingma and Welling, 2013; Lopez et al., 2018) and ZINB-based (zero-inflated negative binomial) autoencoders (Eraslan et al., 2019), while they often lack a systematic comparison for these autoencoder variants. In order to compare the performances of different autoencoders on the spatial transcriptomic imputation task, our SpaFormer framework generalizes all these variants of autoencoders. Our general framework takes a batch of cells as input. To be consistent with Section 4.2, we denote the input as (cid:101)X ∈ R𝑛×𝑘 , where 𝑛 is the number of cells in the input field-of-view (FOV). Note that here we omit the positional encodings, which should also be included in the input matrix. An encoder 𝑞𝜃 projects the input data to latent space, resulting in Z ∈ R𝑛×𝑑, where 𝑑 is the latent dimension. A decoder 𝑝𝜙 then generates an output (cid:98)X ∈ R𝑛×𝑘 from the latent space, where we expect (cid:98)X to be identical with the input matrix (cid:101)X. The overall framework can be simply written as: Z = 𝑞𝜃 (cid:17) (cid:16) (cid:101)X (cid:98)X = 𝑝𝜙 (Z) (4.2) (4.3) Next, we proceed to introduce how we implement various autoencoder variants under the general framework. Vanilla autoencoders. In the case of vanilla autoencoder, we implement 𝑞𝜃 with a transformer and 𝑝𝜙 with a multi-layer perception (MLP). Note that we have selected an asymmetrical encoder- decoder architecture, as we believe that the model requires a greater ability to utilize spatial information for denoising during the encoding process. While during decoding, the contextual information of each cell has already been included in the latent space, making decoding easier. The 56 loss function for vanilla autoencoders can be written as: LMSE = (cid:13) (cid:13) (cid:13)(cid:98)X − (cid:101)X (cid:13) 2 (cid:13) (cid:13) 𝐹 (4.4) ZINB-based autoencoders. Previous studies (Eraslan et al., 2019; Lopez et al., 2018) pointed out that the data distribution of the transcriptomic data can be approximated by zero-inflated negative binomial (ZINB) distribution because the data are discrete, overdispersed and contain many zero values. Therefore, we can adopt a ZINB-based autoencoder to leverage this prior information. A ZINB distribution is defined as: NB(𝑥 | 𝜇, 𝜃) = Γ(𝑥 + 𝜃) 𝑥!Γ(𝜃) (cid:18) 𝜃 𝜃 + 𝜇 (cid:19) 𝜃 (cid:18) 𝜇 (cid:19) 𝑥 𝜃 + 𝜇 ZINB (𝑥 | 𝜋, 𝜇, 𝜃) = 𝜋𝛿0(𝑥) + (1 − 𝜋) NB(𝑥) (4.5) (4.6) where 𝜇 and 𝜋 denote mean and dispersion, 𝜋 is the weight of the point mass at zero, and 𝛿0 generates constant 0. In our SpaFormer framework, to adapt a vanilla autoencoder to a ZINB-based autoencoder, the encoder 𝑝𝜙 remains unchanged, while a ZINB decoder implements the decoder 𝑞𝜃 instead of an MLP decoder. A ZINB decoder takes the latent code Z as input and generates an intermediate result (cid:98)H ∈ R𝑛×𝑑′ estimate the parameters of ZINB distribution. Let 𝚷, M and 𝚯 be the three parameters 𝜋, 𝜇, 𝜃 of via MLP. On top of that, ZINB decoders have three fully-connected output layers to ZINB distribution estimated by the three output layers respectively, then the overall loss function of ZINB-based autoencoder changes to the negative log-likelihood of the ZINB distribution, formulated as: LZINB = − log (cid:16) ZINB((cid:101)X | 𝚷, M, 𝚯) (cid:17) (4.7) In the inference stage, the estimated mean matrix M is selected as the imputation output. Variational autoencoders. VAEs have been wildly applied in single-cell transcriptomic analysis (Lopez et al., 2018, 2019; Molho et al., 2024) since they are robust to technical noise and bias. In our SpaFormer framework, we optionally transform an autoencoder framework to a VAE by making modifications to the latent space and loss function. Specifically, two additional 57 fully-connected layers are appended to the encoder to estimate the mean and variance of latent variables: 𝝁𝑧 = 𝑞𝜃 (cid:17) (cid:16) (cid:101)X · W𝜇𝑧 , 𝝈𝑧 = exp (cid:16) (cid:16) 𝑞𝜃 (cid:17) (cid:101)X (cid:17) · W𝜎𝑧 (4.8) where 𝜇z and 𝜎z are mean and variance respectively. The latent variables Z are then sampled from the estimated Gaussian distribution N (cid:0)𝝁𝑧, 𝝈𝑧(cid:1), where we apply reparametrization trick (Kingma and Welling, 2013) to keep the gradient descend possible. Furthermore, an additional Kullback-Leibler (KL) divergence loss term is added to regularize the distribution of latent variables, which is: LKL = 𝐷KL (cid:0)N (cid:0)𝝁𝑧, 𝝈𝑧(cid:1) ∥N (0, 1)(cid:1) (4.9) The loss functions of vanilla autoencoder and ZINB-based autoencoder then turn to: L = LMSE + 𝛽LKL, or L = LZINB + 𝛽LKL, respectively, where 𝛽 is the KL loss weight. In the inference stage, we directly employ the estimated mean 𝜇z as latent variables, instead of sampling. 4.4.2 Bi-level Masked Autoencoders Inspired by the tremendous success of the masked autoencoding paradigm in NLP (Devlin et al., 2019) and Computer Vision (He et al., 2022a), we propose to incorporate a new masked autoencoder model into our generalized framework. Specifically, masked autoencoders are a form of more general denoising autoencoders (Vincent et al., 2008), which adopt a simple concept to remove a proportion of the data and then learn to recover the removed parts. The idea of masked autoencoders is also natural and applicable in spatial transcriptomics since we expect a powerful model to be able to recover missing cellular profiles from neighboring cells sharing the same microenvironment and homologous cells with the same cell state. Note that the objective of the masked autoencoder is highly consistent with the process of data imputation. Hence, it is very promising that the model will be trained with the specific capability for imputing missing values. Despite that masked autoencoders are highly suitable for training an imputation model, the characteristic of spatial transcriptomics adds difficulties to the utilization of masked autoencoders. Compared to NLP and vision, it is much more difficult to recover cell profiles solely based on contexts. One reason is that the external signals a cell receives depend not only on the concentration 58 of ligands in the microenvironment but also on the amount of receptors on its membrane. Therefore, it’s hardly possible to identify intercellular correlations when the cell profile is completely masked out. To address this issue, we made a modification to the standard masked autoencoder, which we call it a bi-level masking strategy. In a bi-level masking strategy, for each input batch of tokens, we first determine which cells are to be masked. Then, instead of thoroughly masking the tokens, we selectively mask out a certain proportion of features in those tokens, leaving the potential for the model to recover underlying intercellular correlations. Specifically, we first sample a mask vector (cid:101)mtoken ∈ {0, 1}𝑛 from a Bernoulli distribution with 𝑝 = 𝜃, where 𝑛 is the number of input tokens, 𝜃 is the probability that a token is selected to be masked. Next, (cid:101)Mfeat ∈ {0, 1}𝑛×𝑘 is sampled from another Bernoulli distribution with 𝑝 = 𝛾, where 𝛾 is the probability that a feature is masked in a selected token. Lastly, we combined (cid:101)mtoken and (cid:101)Mfeat as: (cid:101)M𝑖, 𝑗 = 1 − (cid:101)mtoken 𝑖 · (cid:101)Mfeat 𝑖, 𝑗 (4.10) where (cid:101)M is the finalized mask matrix. The input features (cid:101)X𝑖, 𝑗 are then masked when (cid:101)M𝑖, 𝑗 = 0, resulting in a new feature matrix (cid:101)X′ ∈ R𝑛×𝑘 written as: (cid:16) (cid:101)X′ = 𝜙 (cid:101)M ⊙ (cid:101)X (cid:17) (4.11) where ⊙ indicates element-wise multiplication, and 𝜙 refers to a rescaling technique that maintains the mean of the input unchanged for each cell. Notably, when 𝛾 = 1, our masked autoencoder is equivalent to MAE for vision (He et al., 2022a). When 𝜃 = 1, our masked autoencoder is equivalent to a denoising autoencoder (Vincent et al., 2008). Such a bi-level masking strategy provides our framework with greater flexibility. During training, masked features (cid:101)X′ are used as initial token embeddings. Therefore, it should be plugged into Eq. 4.2 to replace (cid:101)X. Note that our bi-level masking strategy perfectly fits the overall autoencoder framework, which means we can combine the masked autoencoder with other components, i.e., VAE and ZINB decoders. When enabling the masked autoencoder, we need to 59 change the reconstruction loss in Eq. 4.4 to: LMSE = (cid:13) (cid:13) (cid:13) (1 − M′) ⊙ (cid:16) (cid:98)X − (cid:101)X (cid:17)(cid:13) 2 (cid:13) (cid:13) 𝐹 (4.12) where we only calculate mean squared error (MSE) for the prediction of masked values. For inference, we simply dismiss the bi-level masking strategy. Instead, the unmasked input (cid:101)X is enabled, which provides extra information for imputation. Table 4.2 Imputation performance on three CosMx datasets. Methods RMSE↓ Lung 5 Pearson↑ Cosine↑ RMSE↓ Kidney 1139 Liver Normal Pearson↑ Cosine↑ RMSE↓ Pearson↑ Cosine↑ - - Baseline Raw scRNA- seq Methods scImpute SAVER scVI DCA GraphSCI scGNN Reference- based Methods gimVI Tangram SpaGE Spatial Methods Sprod SEDR Ours SpaGAT SpaFormer 0.3758 0.3245 0.3213 0.2861 0.2858 0.3957 0.3170 0.3905 0.4420 0.3245 0.2865 0.2785 0.444 0.4564 0.6231 0.6223 0.1334 OOM * 0.5320 0.3161 0.3942 OOT ** 0.4949 0.6047 0.6363 0.3747 0.311 0.3106 0.2901 0.2852 0.3624 0.4387 0.3639 0.5096 0.5214 0.5269 0.6661 0.6648 0.3081 0.5917 0.3883 0.4572 0.5499 0.3116 0.6518 0.6786 0.2859 0.2794 - - 0.4824 0.4887 0.5834 0.5985 0.2403 OOM * -0.0104 0.4037 0.4311 OOT** 0.5101 0.5852 0.6108 0.5714 0.5689 0.648 0.6597 0.4128 0.1967 0.4798 0.5065 0.5826 0.6506 0.671 0.3507 0.3048 0.2909 0.2797 0.2542 0.3347 0.4542 0.3373 0.6433 0.2241 0.2117 - - 0.4437 0.5462 0.5749 0.657 0.3707 OOM * -0.0015 0.4830 0.5483 OOT** OOM* 0.7624 0.7793 0.5074 0.5864 0.6224 0.688 0.443 0.1170 0.5236 0.5899 0.7829 0.7973 *Run out of 300G CPU Memory. **Run more than 48 hours on 128 CPUs. 4.5 Experiment 4.5.1 Experimental settings Datasets. We validate the proposed approach on three spatial transcriptomic datasets generated by the CosMX platform (He et al., 2022b) from Nanostring, and can be accessed on their official website1. These datasets differ in their scales and sources of tissues, which highlights the comprehensiveness of our experiments. The data statistics are presented in Table 4.3. Baselines. To evaluate the effectiveness of SpaFormer, we compare it with the state-of- the-art spatial and non-spatial transcriptomic imputation models: scImpute (Li and Li, 2018), SAVER (Huang et al., 2018), scVI (Lopez et al., 2018), DCA (Eraslan et al., 2019), GraphSCI (Rao 1Dataset can be downloaded from https://nanostring.com/products/cosmx-spatial-molecular-imager/nsclc-ffpe- dataset. 60 Table 4.3 Dataset statistics. Dataset Cell Num. FOV Num. Gene Num. Zero Ratio Lung 5 Kidney 1139 Liver Normal 99,656 61,283 305,730 30 18 244 960 960 1000 86.74% 83.49% 86.39% et al., 2021), scGNN (Wang et al., 2021), gimVI (Lopez et al., 2019), Tangram (Biancalani et al., 2021), SpaGE (Abdelaal et al., 2020), Sprod (Li et al., 2022), and SEDR (Xu et al., 2024). The first six methods are designed for scRNA-seq data imputation. gimVI, Tangram, and SpaGE aim to impute spatial transcriptomic data via external reference scRNA-seq data. Sprod and SEDR build cell-cell graphs among spatial neighbors to help imputation. In addition to these published baselines, we create a new baseline model SpaGAT which uses the same bi-level masking autoencoder framework as SpaFormer, based on a graph neural network encoder with spatial graphs. Specifically, we implement a graph attention network (Velickovic et al., 2018) as the encoder. Since the graph attention network is a localized version of transformers, SpaGAT can be considered an ablation study for our SpaFormer model. Implementation Settings. Before we conduct the experiment, we first randomly mask 30% of the data to create the partially observed data, while the original masked data are considered as ground truth. All training and inference processes are then conducted on the partially observed data, and those dropped values are kept for evaluation. Based on the partially observed data, we further conduct preprocessing methods, according to the recommended settings of each specific model. For our own SpaGAT and SpaFormer, we first normalize the total RNA counts of each cell, and then apply log1p transform. In addition, considering that the output format of the baseline models varies between raw counts and log-transformed values, we uniformly conduct postprocessing to make sure all imputed data and ground-truth data are normalized and log-transformed. By default, SpaGAT and SpaFormer adopt a bi-level masked autoencoder, while SpaFormer chooses a Cond PE as positional encoding. Reproducibility details and codes for all methods can be found on our GitHub repository2 as well as Appendix B.1. 2Codes and hyperparameter settings are available at https://github.com/wehos/CellT. 61 Evalutaion. For imputation, the rooted mean squared error (RMSE), Pearson correlation coefficients (Pearson), and cosine similarity (Cosine) metrics are calculated based on the predictions of the masked values. For clustering, we first reduce the dimension of imputed data with PCA and then construct kNN graph based on the first 10 PCs. The clustering result is obtained from Leiden algorithm on the kNN graph, where we conduct a grid search to find the optimal resolution for Leiden. Clustering metrics normalized mutual information (NMI) and adjusted Rand index (ARI) are then calculated based on the clustering results and predefined cell type labels. Lastly, all deep learning models are evaluated with 5 random seeds, and the average performance is reported, while statistical models are evaluated only once. The standard deviation is presented in Appendix B.2. 4.5.2 Imputation Performance In Table 4.2, we present the experimental results. It is shown that our SpaFormer consistently outperforms other baselines by showing a better imputation performance. Aside from that, there are several interesting observations. (1) scVI and DCA consistently present suboptimal performance, outperforming some advanced methods, e.g., GraphSCI. Since both DCA and scVI adopt ZINB-based autoencoders, this demonstrates the effectiveness of ZINB-based autoencoders. (2) Reference-based methods, i.e., gimVI, Tangram, SpaGE, generally do not work well on our highly noisy spatial transcriptomic dataset. These methods map spatial transcriptomic data to scRNA-seq reference datasets and impute the missing values based on reference data. However, the single-cell spatial transcriptomics data has a substantially low cover rate of RNA molecules, which causes distinct distributions between spatial transcriptomic data and reference scRNA-seq data. These results indicate the limitation of these reference-based imputation approaches. (3) SpaGAT, the non-transformer version of SpaFormer, presents fairly good performance. This demonstrates the advantage of our bi-level mask-autoencoder framework. (4) Recent methods specifically designed for spatial transcriptomics either suffer from scalability issues or demonstrate suboptimal performance. Therefore, there remains significant space for exploration in the field of single-cell spatial transcriptomic imputation. 62 4.5.3 Scalability Analysis Scalability is critical when deploying methods on single-cell spatial transcriptomic data since the datasets often contain tens of thousands of cells. We considered scalability as an important factor when conducting the evaluation. Specifically, we set limits on the runtime and memory consumption of the model, i.e., 48 hours of running time, 300G CPU memory, and 45G GPU memory. Notably, we encountered scalability issues in scGNN and Sprod methods even with the smallest Kidney dataset. scGNN and SEDR train neural networks on GPUs, however, both of them may run out of 300G CPU memory when building the cell-cell graph. Sprod separates data into numerous batches and purely runs on CPUs. We allocate 128 CPU processors for it, yet it fails to finish running in 48 hours. In contrast, our proposed SpaFormer exhibits linear complexity w.r.t the number of cells, thanks to our efficient autoencoder architecture and the Performer (Choromanski et al., 2021) encoder. In addition, we present the empirical computational overhead of all methods on the largest Liver dataset in Table 4.4,. Specifically, we record the running time, peak CPU memory, and peak GPU memory. Peak CPU memory refers to the physical CPU memory consumption. Peak GPU Memory refers to the peak consumption of GPU memory resources. Running time refers to the training and inference for the imputation, excluding preprocessing time. The experiments are conducted on an RTX 8000 GPU card and 128 AMD EPYC CPU cores with 300G CPU memory. Each setting is repeated 4 times and the average running time is reported. Our SpaFormer method demonstrates superior computational efficiency, especially running time. 4.5.4 Clustering Performance In addition to imputation performance, we conduct unsupervised clustering on the imputed data to validate whether the imputation can help recover cell type information. For evaluation, cell type labels accompanied by the dataset are considered as ground truth. It is expected that well-imputed data can better recover the cell clustering structures. As shown in Figure 4.3, although all models enhance the clustering performance as compared to the unimputed raw data, clustering based on SpaFormer achieves the best performance. 63 Table 4.4 Computational overhead on Liver dataset. Methods are ranked by running time. Methods Running Time (hour) SpaFormer scVI Spage DCA Tangram gimVI SAVER scImpute GraphSCI scGNN SPROD SEDR 0.068 0.119 0.179 0.236 0.551 0.971 1.975 2.917 9.592 N/A >48 N/A Peak CPU Mem- ory (Mb) 6,612 3,511 12,493 7,508 4,507 6,600 245,760 269,490 18,329 > 307,200 N/A >307,200 Peak GPU Mem- ory (Mb) 1,678 254 N/A 41,850 17012 250 N/A N/A N/A N/A N/A N/A In addition, an interesting observation is that despite reference-based methods (i.e., SpaGE and gimVI) do not achieve impressive performance on imputation metrics, their clustering performance is relatively better. This indicates that they may have succeeded in discovering reference cells from scRNA-seq data, however, they still suffer from the distribution gap between the reference and query data. 4.5.5 Ablation Study 4.5.5.1 Positional Encodings As we mentioned in Section 3.4.3, we conduct experiments on three datasets to compare the performance of different positional encodings. According to our experiments in FIgure 4.4, Cond PE and LapPE outperform other positional encodings w.r.t overall performance on three datasets, indicating the effectiveness of graph-based PEs. Meanwhile, Cond PE outperforms SignNet, suggesting that global effectiveness might not be a desired property for spatial transcriptomics, as we discussed in Section 3.4.3. In conclusion, we select Cond PE as a default setting in SpaFormer. Besides, a 2-dimensional Sinusoid PE can also be a good choice for spatial transcriptomics since 64 Figure 4.3 Clustering performance on imputed data of Lung dataset. it has fewer parameters than others. Meanwhile, the gap between transformers with and without positional encodings is less significant than we expected. Therefore, there is still large room for exploration in positional encodings for spatial transcriptomic data. 4.5.5.2 Autoencoder Frameworks Another highlight of our SpaFormer framework is that we generalize different popular autoencoder-based models. Based on our general framework, we conduct a thorough ablation study on the Liver dataset. The experimental results in Figure 4.5 indicate that various variations of autoencoders achieved optimal performance when combined with masking. Additionally, the simple bi-level mask autoencoder obtained the best results. Therefore, we recommend the masked autoencoder as the default setting for our SpaFormer framework. 65 0.641 0.659 0.648 0.628 0.540 0.622 0.402 0.402 0.460 0.366 0.335 0.228 0.599 0.626 0.585 0.604 0.483 0.575 0.235 0.236 0.405 0.252 0.236 0.096 00.10.20.30.40.50.60.7CleanSpaFormerDCAgimVISpaGEscVISaverGraphSCIscImputeSEDRTangramRawARINMI Figure 4.4 Comparison between different positional encodings on three datasets. Values indicate Pearson correlation coefficient. Figure 4.5 Ablation study on different autoencoder variants, i.e., vanilla autoencoders (AE), variational autoencoders (VAE), bi-level masked autoencoder (MVAE) and masked autoencoder (MAE). For each variant, we implement both MLP and ZINB decoders. Values indicate the Pearson correlation coefficient on the Liver dataset. 4.5.5.3 Mask Ratio Lastly, we conduct a parameter analysis on token masked ratio 𝜃 and feature masked ratio 𝛾 in Section 4.4.2 to demonstrate the importance of our bi-level masking strategy. As shown in Figure 4.6, joint tuning 𝜃 and 𝛾 results in an optimal Pearson correlation score. In our main experiment, we select 𝜃 = 0.5, 𝛾 = 0.5 as default parameters, since this setting consistently achieves optimal performance on three datasets. 66 0.6070.6070.6070.6110.6110.6110.6110.6080.6240.6280.6230.6340.6350.6360.7720.7740.7710.7770.7760.7760.779NoneSignNetLearnable PENaïve PESinusoidLapPECond PEKidneyLungLiver0.2880.3910.5030.7710.7680.7600.7800.7750.00.10.20.30.40.50.60.70.80.9Pearson CorrelationAE - MLPAE - ZINBVAE - MLPVAE - ZINBMVAE - MLPMVAE - ZINBMAE - MLPMAE - ZINB Figure 4.6 Parameter analysis on 𝜃 and 𝛾. Values indicate Pearson correlation coefficient on Lung dataset. 4.6 Related Work The increased resolution of transcriptomics profiling methods comes at a cost of increasing data sparsity. The profiling technology may fail to capture a number of the expressed genes of an individual cell due to low amounts of mRNA in individual cells and a low capture rate. A popular way to address this issue is to perform imputation, which aims to correct false zeros by estimating realistic values for those gene-cell pairs. A large number of methods have been developed for the task of scRNA-seq data imputation, mainly focusing on generative probability models or matrix factorization (Gong et al., 2018; Huang et al., 2018; Ronen and Akalin, 2018; van Dijk et al., 2017). Aside from these methods, deep learning models have gained immense popularity over recent years. A natural deep learning architecture for the imputation task is the autoencoder, due to its prevalence in data denoising and missing data applications (Beaulieu-Jones et al., 2017; Boquet et al., 2019; Gondara and Wang, 2017, 2018; Pereira et al., 2020). To leverage the spatial information of spatial transcriptomic data, the latest method, Sprod (Wang et al., 2022) constructs a graph by connecting cells with similar transcriptomic profiles and prunes it with physical distance. Then a denoised matrix is learned by minimizing the reconstruction error 67 and a graph Laplacian smoothing term. However, this graph only connects cells that are spatially close to each other. As a result, it is limited in a localized spatial context and difficult to identify cells with long-range correlations. Different from the aforementioned methods, we propose to utilize transformers for spatial transcriptomic imputation to globally model the cell-cell interactions while exploiting the spatial information. 4.7 Chapter Conclusion In this chapter, we comprehensively investigate two key questions related to implementing transformer models for spatial transcriptomic imputation at the cellular level. By answering these two questions, we proposed a transformer-based general imputation framework SpaFormer for spatial transcriptomics data. In addition, we propose a new bi-level masking technique, which can be incorporated into general autoencoder frameworks. Our method outperforms all existing methods for recovering missing values and clustering structures. This result strongly suggests the advantage of transformers in spatial omics data analysis. 68 CHAPTER 5 BUILDING SINGLE-CELL FOUNDATION MODEL BEYOND SINGLE CELLS Despite that the proposed methods in chapter 3 and 4 are effective in handling multimodal data, the knowledge learned by models is not transferable across tasks and datasets. To address this kind of issue, there is an emerging effort (Yang et al., 2022; Gong et al., 2023; Shen et al., 2023; Cui et al., 2023; Theodoris et al., 2023) from the research community to explore the potential of a foundation model that first extracts latent knowledge from unlabeled data and subsequently generalizes this knowledge to a variety of tasks. The current state-of-the-art single-cell pre-trained models are greatly inspired by the success of large language models. They trained transformers by treating genes as tokens and cells as sentences. However, three fundamental differences between single-cell data and natural language data are overlooked: (1) scRNA-seq data are presented as bag-of-genes instead of sequences of RNAs; (2) Cell-cell relations are more intricate and important than inter-sentence relations; and (3) The quantity of single-cell data is considerably inferior to text data, and they are very noisy. In light of these characteristics, we propose a new pre-trained model CellPLM, which takes cells as tokens and tissues as sentences. In addition, we leverage spatially-resolved transcriptomic data in pre-training to facilitate learning cell-cell relationships and introduce a Gaussian mixture prior distribution as an additional inductive bias to overcome data limitation. CellPLM is the first single-cell foundation model that encodes cell-cell relations and it consistently outperforms existing foundation models in diverse downstream tasks, with 500x times higher inference speed on generating cell embeddings than existing pre-trained models. 5.1 Chapter Introduction Next-generation sequencing technologies such as single-cell RNA sequencing (scRNA-seq) have produced vast amounts of data, sparking a surge of interest in developing large-scale pre-trained models for single-cell analysis (Yang et al., 2022; Gong et al., 2023; Shen et al., 2023; Cui et al., 2023; Theodoris et al., 2023). These models seek to capture underlying structures and patterns from unlabeled scRNA-seq data, and can be fine-tuned on specific downstream datasets to deliver accurate predictions and nuanced insights into cellular mechanisms. Particularly, these pre-trained 69 models have been inspired by the success of large language models, such as BERT and GPT (Devlin et al., 2019; Bubeck et al., 2023), and treat genes as words (tokens) and cells as sentences to train transformers (Vaswani et al., 2017). However, we argue that these approaches may have limitations due to the fundamental differences between single-cell data and natural language data, which have been largely overlooked in existing literature: First, unlike sentences, the scRNA-seq data utilized by existing pre-trained models are not sequential. Before the training stage, RNA sequences have been identified as functional units, i.e., genes. Instead of original sequences, data is denoted as a cell-by-gene count matrix that measures the abundance of individual genes within each cell. This is analogous to the bag-of-words model in natural languages, where the set of genes is fixed, and there is no sequential relationship among them. Second, the relationship between cells is remarkably more intricate and important than that of sentences, since cell-cell communications play an essential role in determining cell states and cell development (Armingol et al., 2021). Additionally, within tissues, there are numerous cells from the same or similar cell lineage, which grants them similar gene expression profiles and hence provides valuable supplementary information for denoising and identifying cell states (Cannoodt et al., 2016; Molho et al., 2024; Street et al., 2018). As a result, many recent methods (Wang et al., 2021; Shao et al., 2022; Xu et al., 2023; Wen et al., 2023) have constructed cell-cell graphs to advance representation learning for single-cell data. Such evidence demonstrates the importance of the cell-cell relationship, which is neglected by existing pre-trained models. Third, the quantity and quality of single-cell datasets are significantly lower than those of natural language data. For comparison, the high-quality filtered English dataset extracted from Common Crawl corpora (Wenzek et al., 2020) consists of 32 billion sentences, whereas the largest collection of single-cell datasets, namely the Human Cell Atlas (Regev et al., 2017), includes less than 50 million cells. To make things worse, single-cell data often suffer from technical artifacts and dropout events (Svensson et al., 2017; Qiu, 2020), as well as significant batch effects between sequencing platforms and experiments (Tran et al., 2020; Argelaguet et al., 2021). 70 The aforementioned differences introduce distinct challenges that call for new pre-training strategies tailored for single-cell data. To bridge this gap, we propose a novel single-Cell Pre-trained Language Model (CellPLM), which addresses these challenges from the following perspective: First, As shown in Figure 5.1, CellPLM proposes a cell language model to account for cell-cell relations. The cell embeddings are initialized by aggregating gene embeddings since gene expressions are bag- of-word features. Second, CellPLM leverages a new type of data, spatially-resolved transcriptomic (SRT) data, to gain an additional reference for uncovering cell-cell interactions. Compared to scRNA-seq data, SRT data provide additional positional information for cells. Both types of data are jointly modeled by transformers. Third, CellPLM introduces inductive bias to overcome the limitation of data quantity and quality by utilizing a Gaussian mixture model as the prior distribution in the latent space. This design can lead to smoother and better cell latent representations (Grønbech et al., 2020; Xu et al., 2023; Jiang et al., 2023). To the best of our knowledge, the proposed CellPLM is the first pre-trained transformer framework that encodes inter-cell relations, leverages spatially- resolved transcriptomic data, and adopts a reasonable prior distribution. It is evident from our experiments that CellPLM consistently outperforms both pre-trained and non-pre-trained methods across five distinct downstream tasks, with 500x times higher inference speed on generating cell embeddings compared to existing pre-trained models. 5.2 Cell Language Model Beyond Single Cells In this section, we introduce the concept of the cell language models and detailed implementation of the proposed CellPLM. As illustrated in Figure 5.2, CellPLM consists of four modules: a gene expression embedder, an encoder, latent space, and a decoder, which we will demonstrate in Section 5.2.2. At a higher level, there are two stages in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data with a masked language modeling objective. For fine-tuning, the model is first initialized with the pre-trained parameters, and then all of the parameters are fine-tuned using data and labels (if available) from the downstream datasets. We demonstrate the pre-training and fine-tuning framework in Section 5.2.3 and 5.2.3.2, respectively. 71 Figure 5.1 An illustration of the difference in the language models between existing single-cell pre-trained models and CellPLM. Existing pre-trained models only consider conditional probability between gene expressions within the same cell, while in CellPLM, gene expression distribution is also conditioned on other cells. See details in Section 5.2. 5.2.1 Cell Language Model Due to the recent achievements of large language models (Bubeck et al., 2023), several studies have drawn inspiration from natural language processing in an attempt to establish a foundation model for single-cell analysis. These studies consider genes as tokens and train transformers on them, aiming to model the conditional probability between gene expressions. Concretely, previous pre-trained models are trained on scRNA-seq data, which are stored in the format of a cell-by-gene matrix X ∈ R 𝑁×𝑘 , where 𝑁 is the number of cells, and 𝑘 is the number of distinct gene types. The value of X𝑖, 𝑗 denotes the count of gene 𝑗 observed in cell 𝑖, also known as gene expression. The pre-training goal of these models is to estimate a conditional probability distribution, which can be formulated as: 𝑝 (cid:0)X𝑖, 𝑗 |{X𝑖,𝑜}𝑜∈O (𝑖) (cid:1) , 𝑗 ∈ U (𝑖), (5.1) where 𝑖 refers to the 𝑖-th cell and O (𝑖) is the set of observed genes in cell 𝑖 whose expressions are known; U (𝑖) denotes the set of unobserved genes in cell 𝑖 whose expression will be predicted by the model, typically referring as masked genes. If we consider genes as words, this objective is analogous to the language model in computational linguistics (Bengio et al., 2000), and thus can be named a “gene language model”. In this way, the model is trained to capture the intrinsic relations between genes, which can provide prior knowledge for downstream analysis. However, in Eq. (5.1), the distribution of unobserved gene expressions only depends on genes within the same cell, while disregarding the information of other cells within the same tissue, which 72 Cellgene1gene2gene3gene4gene5Existing Single-cell Pretrained Models“Gene Language Model”Cell 1Cell 2Cell 3Cell 4gene1gene2gene3gene1gene2gene3gene1gene2gene3gene1gene2gene3CellPLM“Cell Language Model” does not align with the inherent nature of biology. Therefore, in CellPLM, we provide a different perspective to model scRNA-seq data by treating cells as tokens: (cid:16) 𝑝 X𝑖, 𝑗 |{X𝑢,𝑣}(𝑢,𝑣)∈MC (cid:17) , (𝑖, 𝑗) ∈ M, (5.2) where we denote M as the set of masked gene expressions in X, and M𝐶 is the complement, i.e., the set of unmasked expressions. The distribution of a masked entry X𝑖, 𝑗 depends on both the observed genes in cell 𝑖 and genes from other cells that are not masked. We hereby name it as “cell language model”, which models the distribution of cellular features beyond single cells. By estimating the conditional probability distribution in Eq. (5.2), CellPLM is trained to capture the intricate relationships that exist between not only genes but also cells. From a biology perspective, there are particularly two types of inter-cell relations that can be beneficial to CellPLM. First, within tissues, there are numerous cells from the same or similar cell lineage, which mutually provide valuable supplementary information for denoising and identifying cell states (Cannoodt et al., 2016; Molho et al., 2024; Street et al., 2018). The other type of relations, cell-cell communications, plays an essential role in determining cell development and cell states (Armingol et al., 2021). Existing analysis methods (Hou et al., 2020; Jin et al., 2021; Raredon et al., 2019) have already explored the cell-cell communications on the cell type or cluster levels, while CellPLM aims to capture the intricate “language” of cell-cell communications between single cells. Overall, CellPLM presents a novel cell language model that aligns well with biological principles and holds great potential to enhance downstream tasks by extracting valuable cellular knowledge from unlabeled single-cell data. 5.2.2 Model Architecture 5.2.2.1 Gene Expression Embedder The first module in CellPLM model is a gene expression embedder, which projects input gene expressions into a low-dimensional cellular feature space. In light of the nature that scRNA-seq is profiled as bag-of-genes features, CellPLM learns an embedding vector for each type of gene and then aggregates these gene embeddings according to their expression levels in each cell. Formally 73 Figure 5.2 An illustration of the pre-training framework of CellPLM. CellPLM is pre-trained with cell-level masked language modeling task. The model consists of four modules: a gene expression embedder, a transformer encoder, a Gaussian mixture latent space, and a batch-aware decoder. speaking, for gene 𝑗 ∈ {1, ..., 𝑘 }, a randomly-initialized learnable embedding vector h 𝑗 ∈ R 𝑑 is assigned, where 𝑑 is the hidden dimension of the encoder layers. The gene expression embedding matrix E ∈ R 𝑁×𝑑 is then generated by aggregating gene embeddings according to their expressions: E𝑖 = 𝑘 ∑︁ 𝑗=1 X𝑖, 𝑗 h 𝑗 , (5.3) where E𝑖 is the 𝑖-th row vector of E, corresponding to the gene expression embedding for cell 𝑖. Note that the gene expression matrix X is a sparse matrix since the zero-rate of scRNA-seq can be up to 90% (Jiang et al., 2022). In addition, unmeasured genes (per sequencing platforms) also lead to zero entries in X. Therefore, when implementing Eq. (5.3), CellPLM leverages sparse operations, which significantly improves memory and computational efficiency. In addition, following the convention (Stuart et al., 2019), we preprocessed X with library size normalization and log1p transformation before inputting the model. 74 Self-Supervised PretrainingGene Expression EmbedderCellPLMModelMaskedPretrainingCellPLM ModelEncoderLatent SpaceGaussian Mixture as PriorDecoderLatent VariableFully Connected Layerx NFeature ReconstructionBatch Labelzbdraw!"=$!"#$%!&("|)!,+!)MaskedMaskedPositional EmbeddingExpression EmbeddingCell 1Cell 2Cell 3Cell 4+Transformer BlockN xEstimate!,#,$scRNA-seq AtlasSpatial Transcriptomic AtlasGene Expression EmbedderGene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7UnmeasuredUnexpressedMaskedExpressionEmbeddingAggregateInput 5.2.2.2 Transformer Encoder The proposed CellPLM follows an encoder-decoder structure, where the encoder is based on transformers (Vaswani et al., 2017). The transformer model was originally developed for processing textual data. It leverages multi-head self-attention mechanisms to capture relationships between input tokens and incorporates positional encoding to represent the token positions. In CellPLM, by considering cells as tokens, we can readily apply the transformer model to capture intercellular relationships. When applying the transformer, we consider the embedding at 𝑙-th layer H(𝑙) ∈ R 𝑁×𝑑 as a set of 𝑁 tokens, where 𝑁 is the total number of cells in a tissue sample, and 𝑑 is the hidden dimension. By stacking 𝐿 transformer layers, CellPLM gradually encodes cellular and inter-cellular information into cell embeddings, formulated as: H(𝑙) = TransformerLayer(𝑙) (H(𝑙−1)). (5.4) In practice, 𝑁 can scale up to ten thousand, which is out of the capacity of an ordinary transformer. Therefore, we adopt an efficient variant of transformers with linear complexity (i.e., Flowformer (Wu et al., 2022)) for the implementation of transformer layers. To further inform inter-cellular relations, we incorporate spatial positional information of individual cells from a novel type of data, spatially-resolved transcriptomic (SRT) data. Specifically, SRT data consists of two parts. One is a gene expression matrix X ∈ R 𝑁×𝑘 same as scRNA-seq data, and the other part is a 2D coordinate matrix C ∈ R 𝑁×2. The coordinates denote the center position of each cell within a field-of-view (FOV) where the cells are located (an illustration can be found in Appendix C.1). This feature helps locate the microenvironment surrounding each cell, providing an additional reference for identifying cell lineage and cell communications, which were introduced in Section 5.2.1. To encode this extra positional information, we leverage the idea of positional encodings (PE) in transformers. 5.2.2.3 Gaussian Mixture Latent Space One of the highlights of CellPLM is the design of probabilistic latent space. Prior studies have employed variational autoencoders for single-cell analysis, which typically assumes an isotropic Gaussian distribution as the prior distribution of the latent space (Lopez et al., 2018; Xu et al., 2021). 75 While this approach can effectively remove batch effects, it may also result in a loss of information regarding the underlying biological structure of cell groups. To address this limitation, CellPLM incorporates the concept of Gaussian mixture variational encoder (Dilokthanakul et al., 2016; Yang et al., 2019; Xu et al., 2023), which utilizes a mixture of Gaussians to capture the information of distinct functional groups of cells. Formally, for 𝑖 ∈ {1, . . . , 𝑁 }, the generative model of cell 𝑖 can be formulated as: 𝑝(y𝑖; 𝝅) = Multinomial(𝝅), 𝐿 (cid:214) (cid:16) N 𝝁𝑦𝑖,𝑙 , diag 𝑝 (z𝑖 | y𝑖) = (cid:16) 𝝈2 𝑦𝑖,𝑙 (cid:17)(cid:17) , (5.5) 𝑖=1 (cid:16) 𝑝𝜃𝑑𝑒𝑐 (x𝑖 | z𝑖) = N 𝝁z𝑖 , 𝜎2I (cid:17) , where y𝑖 ∈ R 𝐿 represents the one-hot latent cluster variable and 𝝅 is its prior; 𝑦𝑖,𝑙 denotes the 𝑙-th entry of y𝑖; 𝝁𝑦𝑙 ∈ R 𝑑𝑧 and 𝝈2 𝑦𝑙 ∈ R 𝑑𝑧×𝑑𝑧 denote the mean and variance of the 𝑙-th Gaussian component, respectively; and 𝝁𝑧𝑖 ∈ R 𝑘 and 𝜎2I ∈ R 𝑘×𝑘 denote the posterior mean and variance of expression x𝑖, respectively. In this work, we assume that 𝜎2 is a constant and the posterior mean is parameterized by 𝝁𝑧𝑖 = 𝑓𝑑𝑒𝑐 (z𝑖; 𝜃𝑑𝑒𝑐). To estimate the posterior of z𝑖 and y𝑖, we parameterize the inference process with neural networks, which is detailed in Appendix C.3. On top of that, a log-evidence lower bound (ELBO) can be derived from this generative model for the optimization purpose (Dilokthanakul et al., 2016). However, as mentioned in Section 5.2.1, our pre-training framework incorporates a cell language model, where parts of the input gene expression matrix X are masked. This will result in a modified objective. To formalize the problem, recall that previously we defined the masked set as M. On top of that, we denote M ∈ R 𝑁×𝑘 as a mask indicator matrix such that M𝑖, 𝑗 = 1 0    if (𝑖, 𝑗) ∉ M, if (𝑖, 𝑗) ∈ M. Let ˜X ∈ R 𝑁×𝑘 be the masked gene expression matrix given by the element-wise multiplication ˜X = M ⊙ X. The objective of cell language model with Gaussian mixture prior, i.e., a denoising 76 variational lower bound (Im Im et al., 2017), can be formulated as: LCellLM =E𝑞(Z,Y| ˜X) E𝑝( ˜X|X) ln (cid:34) (cid:35) 𝑝𝜃 (X, Z, Y) 𝑞𝜂 (Z, Y | ˜X) (5.6) = E𝑞 𝜂𝑒𝑛𝑐 (Z| ˜X) E𝑝( ˜X|X) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:2)log 𝑝𝜃𝑑𝑒𝑐 (X | Z)(cid:3) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) Lrecon − E𝑞 𝜂 𝜋 (Y|Z) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:2)KL (cid:0)𝑞𝜂𝑒𝑛𝑐 (Z | ˜X) ∥ 𝑝(Z | Y)(cid:1) (cid:3) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:124) (cid:123)(cid:122) Lcond − E𝑞 𝜂𝑒𝑛𝑐 (Z| ˜X) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:2)KL (cid:0)𝑞𝜂 𝜋 (Y | Z) ∥ 𝑝(Y)(cid:1)(cid:3) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) . (cid:124) (cid:123)(cid:122) LY Similar to previous works (Dilokthanakul et al., 2016), we refer to the three terms in Eq. (5.6) as reconstruction term Lrecon, conditional prior term Lcond and Y prior term LY. The approximation and estimation of the denoising variational lower bound are specified in Section 5.2.3. 5.2.2.4 Batch-aware Decoder The decoder in CellPLM operates by decoding each cell individually, given that the tissue context has already been encoded into the latent space by the encoder. The decoder’s purpose is twofold: to reconstruct masked features and to help remove batch effects from the latent space. In order to accomplish this goal, the decoder stacks several feed-forward layers (FFLayers) atop the input of latent variables z, and a batch embedding, denoted as b ∈ R 𝑑𝑧 . Specifically, for each cell, the batch embedding is loaded from a learnable lookup table as b = LookUp(𝑏), where 𝑏 is the label indicating the specific tissue sample (or FOV for SRT data) from which the cell has been drawn. By feeding the batch label to the decoder, a batch-effect-free latent space can be achieved, as empirically evidenced in scVI (Lopez et al., 2018). The decoder can thus be formulated as: h(0) = z + b, h(𝑙) = FFLayer(𝑙) (h(𝑙−1)), where 𝑙 indicates the number of the layer, h(𝑙) is the hidden vector of layer 𝑙 ∈ (1..𝐿 − 1), and 𝐿 is the total number of fully connected layers. The dimension of the last layer is different from the previous layers because the last layer is considered as an output layer, with h𝐿 ∈ R 𝑘 , where 𝑘 is the size of gene sets in the gene expression matrix X ∈ R 𝑁×𝑘 . 77 5.2.3 Model Pre-training and Fine-tuning 5.2.3.1 Pre-training The pre-training of CellPLM follows a cell language modeling objective, as demonstrated in Eq. (5.6). Specifically, given a batch of cell tokens as input, we first decide which cells should be masked. Instead of completely masking these cell tokens, we selectively mask a certain percentage of the gene expressions within them. This allows the model to recover underlying correlations between cells, as proposed in a recent preprint, SpaFormer (Wen et al., 2023). A significant concern in CellPLM is the disparity in the number of genes measured by different sequencing platforms. Notably, the gap between scRNA-seq and SRT can be substantial, ranging from 1,000 to 30,000. Taking this into consideration, CellPLM only masks the expression of genes that are measured in each dataset, implying that the reconstruction loss is calculated exclusively on these measured genes. When optimizing the denoising variational lower bound in Eq. (5.6), we apply reparameterization trick and Monte Calo sampling, as proposed in VAE (Kingma and Welling, 2014). Furthermore, under the independent Gaussian assumption, we reformulate and estimate the reconstruction term Lrecon in Eq. (5.6) with a mean squared error (MSE). Therefore, the pre-training loss function of CellPLM can be formulated as: LMSE = (cid:16) (cid:13) (cid:13) (cid:13)M ⊙ H(𝐿) − (1 − M) ⊙ X (cid:17)(cid:13) 2 (cid:13) (cid:13) 𝐹 , Lpretrain = LMSE + Lcond + LY, (5.7) where ⊙ signifies element-wise multiplication, H(𝐿) ∈ R 𝑁×𝑘 is the output from the decoder, X and M are the ground-truth gene expression matrix and the mask indicator matrix respectively, as defined above. Lcond and LY are derived from Eq. (5.6). 5.2.3.2 Task-specific Fine-tuning When fine-tuning CellPLM, the model is first initialized with the pre-trained parameters. In downstream tasks that require gene expressions as output, the pre-trained decoder can be directly leveraged, and the batch embedding is set to the mixture of all pre-training batches. Otherwise, the decoder will be replaced with a task-specific head. The entire model is then fine-tuned with task-specific loss functions, which helps align the general knowledge of the model to the specific 78 downstream task. For example, in the spatial transcriptomic imputation task, the entire pre-trained model can do zero-shot inference. It can also be fine-tuned on a query SRT dataset and a reference scRNA-seq dataset, where two datasets are sampled from the same type of tissue. In this case, the loss function remains the same as Eq.(5.7). After fine-tuning on these datasets, CellPLM fits the data distribution of the target tissue and can readily perform imputation. The design and implementation of heads and loss functions for other downstream tasks are elucidated in Appendix C.5. 5.3 Experiment CellPLM is first pre-trained on more than 9 Million scRNA-seq cells and 2 Million SRT cells, with the masked language modeling objective, demonstrated in Section 5.2.3. The model consists of over 80 million parameters and the pre-training was finished in less than 24 hours on a GPU server with 8 Nvidia Tesla v100 16GB cards. The hyperparameters, datasets, and reproducibility information for pre-trained models are detailed in Appendix C.4. In the following sections, we evaluate the performance of CellPLM on various downstream tasks, including zero-shot clustering, scRNA-seq denoising, spatial transcriptomic imputation, cell type annotation, and perturbation prediction. With the selected tasks, we aim to answer the following research questions: RQ1: Is CellPLM capable of transferring pre-train knowledge to a brand new dataset? RQ2: Does CellPLM provide better cell representations than other pre-trained and non-pre- trained models? RQ3: Does CellPLM succeed in jointly modeling scRNA-seq and SRT data? 5.3.1 Preliminary study: zero-shot clustering Table 5.1 Inference time(s) for querying 48, 082 cells on a Tesla V100 GPU. Due to GPU memory capacity, the batch size of Geneformer and scGPT is set to 8, while the batch size of CellPLM is 48, 082. Geneformer scGPT CellPLM 1183.20 1540.65 2.54 To evaluate the transferability of the pre-trained model, we extract cell embeddings from the 79 (a) Clustering Performance (b) Batch Correction Figure 5.3 CellPLM readily removes patient batch effect and provides accurate cell clustering results without fine-tuning. pre-trained encoder on a public dataset from Li, etc.(Li et al., 2020), which is not included in pre-train data. In addition to CellPLM, we include three baselines, i.e., PCA, Geneformer (Theodoris et al., 2023) and scGPT(Cui et al., 2023). PCA refers to the first 512 PCs of log-normalized expressions from 4500 highly variable genes (the number of PCs equals the embedding size of scGPT and CellPLM). This is a common embedding method on scRNA-seq data. Geneformer and scGPT are two recently published pre-trained models that are capable of generating cell embeddings. Figure 5.3a illustrates how well the embeddings are aligned with curated cell type labels, and Figure 5.3b demonstrates models’ ability to remove technical artifacts and mix biological signals from different experiments. Notably, the clustering result of CellPLM’s embedding achieves the highest ARI and NMI with respect to the ground-truth cell type labels. From the visualization, it is also clear that CellPLM possesses smoother latent space than others, which is attributed to our Gaussian mixture prior distribution for pre-training. We also notice that CellPLM is over 500 times faster than other pre-trained models that conduct self-attention among gene tokens, as shown in Table 5.1. Overall, this preliminary study addresses RQ1 and RQ2, and indicates that CellPLM can readily transfer pre-trained knowledge to new datasets in removing batch effect and generating high-quality cell embeddings, at extraordinarily high inference speed. 80 CellPLM(ARI = 0.867, NMI = 0.823)PCA(ARI = 0.843, NMI = 0.812)scGPT(ARI = 0.836, NMI = 0.818)Geneformer(ARI = 0.461, NMI = 0.586)Major cell typeCellPLMPCAscGPTGeneformerPatient Table 5.2 (Task 1) The scRNA-seq denoising performance on the PBMC 5K and Jurkat datasets. Model RMSE (↓) MAE (↓) RMSE (↓) MAE (↓) PBMC 5K Jurkat DeepImpute scGNN 2.0 GraphSCI SAVER DCA scVI MAGIC scImpute scGPT (fine-tuned) 1.168 ± 0.018 1.376 ± 0.015 1.068 ± 0.007 0.884 ± 0.001 0.775 ± 0.002 0.777 ± 0.005 0.793 ± 0.001 1.170 ± 0.003 0.901 ± 0.001 1.051 ± 0.025 1.237 ± 0.019 0.924 ± 0.009 0.748 ± 0.001 0.621 ± 0.002 0.623 ± 0.004 0.639 ± 0.001 1.002 ± 0.001 0.565 ± 0.001 0.786 ± 0.006 1.001 ± 0.016 0.659 ± 0.030 0.569 ± 0.001 0.423 ± 0.001 0.416 ± 0.001 0.424 ± 0.001 0.624 ± 0.002 0.711 ± 0.001 0.557 ± 0.003 0.917 ± 0.021 0.481 ± 0.024 0.472 ± 0.001 0.351 ± 0.001 0.344 ± 0.002 0.351 ± 0.002 0.529 ± 0.001 0.498 ± 0.001 CellPLM (zero-shot) CellPLM (from scratch) CellPLM (fine-tuned) 0.854 ± 0.001 0.761 ± 0.009 0.725 ± 0.001 0.692 ± 0.000 0.571 ± 0.011 0.551 ± 0.001 0.517 ± 0.001 0.395 ± 0.003 0.391 ± 0.001 0.426 ± 0.000 0.320 ± 0.003 0.320 ± 0.001 5.3.2 Task 1: scRNA-seq Denoising Given that single-cell RNA-Seq protocols capture only a subset of the mRNA molecules within individual cells, the resulting measurements exhibit substantial technical noise (Grün et al., 2014). Therefore, we consider denoising power as the most desired and essential property for a single-cell foundation model. The goal of the denoising task is to estimate the true expression level of each gene in each cell from a noisy observation. To assess the denoising efficacy of CellPLM, we conduct an evaluation on two single-cell RNA-Seq datasets, i.e., PBMC 5K and Jurkat from 10x Genomics1. These two datasets were excluded from pre-training. Following the setting of scGNN (Wang et al., 2021) and scGNN2.0 (Gu et al., 2022), we apply a random flipping process to a subset of non-zero entries, transforming them into zeros in order to simulate the effects of dropout. We compare CellPLM against a broad range of contemporary approaches, including DeepImpute (Arisdakessian et al., 2019), scGNN2.0 (Gu et al., 2022), SAVER (Huang et al., 2018), DCA (Eraslan et al., 2019), scVI (Lopez et al., 2018), MAGIC (Van Dijk et al., 2018), scImpute (Li and Li, 2018) and scGPT (Cui et al., 2023). We evaluate scRNA-seq denoising performance based on two popular regression metrics, i.e., Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), to measure the degree of similarity between predicted gene expression and the actual ones. More details pertaining to these methods, the fine-tuning of CellPLM, and the evaluation metrics under 110x Genomics datasets are available at https://support.10xgenomics.com/single-cellgene-expression/datasets. 81 the task of scRNA-seq denoising can be found in Appendix C.5.1. It is evident that the fine-tuned CellPLM consistently exhibits superior performance compared to all baseline models on both datasets. Note that even under the zero-shot setting, CellPLM shows satisfactory results that surpass the majority of baselines (6 out of 8) on each dataset. These observations answer the question of RQ1 and RQ2. As a powerful denoising model, CellPLM can serve as a foundation for other downstream analyses. 5.3.3 Task 2: Spatial Transcriptomic Imputation Spatially resolved transcriptomics has revolutionized single-cell analysis by incorporating physical locations along with gene expression, leading to exciting breakthroughs. However, as a tradeoff for the highly detailed spatial resolution, spatial transcriptomic data at the cellular level typically cover less than 1, 000 genes, which poses challenges in data analysis. To assess the potential benefits of the pre-trained model in the given task, we evaluate CellPLM on two spatial transcriptomic datasets at single-cell resolution, i.e., Lung2 and Liver2 from Merscope datasets dataset2 (the whole study is not included in our pre-train data). Following the setting of baselines including SpaGE (Abdelaal et al., 2020), stPlus (Shengquan et al., 2021), gimVI (Lopez et al., 2019) and Tangram (Biancalani et al., 2021), we impute the unmeasured genes of the SRT dataset utilizing a scRNA-seq dataset as reference. We identify the testing gene set in SRT data by stratified sampling according to gene sparsity (Avşar and Pir, 2023) and hold out those genes in the fine-tuning stage. To evaluate the accuracy of spatial transcriptomic imputation, we employ the Pearson correlation coefficient (Corr) and cosine similarity (Cosine) to measure the degree of similarity between the predicted spatial gene expressions and the corresponding ground-truth expression values. Remarkably, the fine-tuned CellPLM takes the lead on both datasets, effectively addressing the research question RQ1 and RQ3. However, when training from scratch on these datasets, CellPLM hardly converges. This indicates the pre-training information is necessary for CellPLM to impute the SRT data. In addition, we conduct a preliminary study by visualizing the attention between cells in space, which might hint at cell-cell communication. For the visualization and additional 2Merscope ffpe human immuno-oncology datasets are available at https://info.vizgen.com/ffpe-showcase. 82 Table 5.3 (Task 2) The results of spatial tanscriptomic imputation on the Lung2 and Liver2 datasets. Model SpaGE stPlus gimVI Tangram Lung2 Liver2 Corr (↑) Cosine (↑) Corr (↑) Cosine (↑) 0.227 ± 0.011 0.177 ± 0.021 0.130 ± 0.010 0.123 ± 0.005 0.352 ± 0.015 0.360 ± 0.014 0.325 ± 0.010 0.285 ± 0.008 0.253 ± 0.014 0.224 ± 0.010 0.163 ± 0.019 0.168 ± 0.024 0.376 ± 0.005 0.399 ± 0.012 0.338 ± 0.010 0.309 ± 0.008 CellPLM (zero-shot) CellPLM (from scratch) CellPLM (fine-tuned) 0.119 ± 0.024 0.058 ± 0.020 0.318 ± 0.015 0.327 ± 0.011 0.370 ± 0.013 0.481 ± 0.011 0.141 ± 0.013 0.024 ± 0.039 0.328 ± 0.011 0.322 ± 0.145 0.352 ± 0.011 0.481 ± 0.010 Table 5.4 (Task 3) The results of cell type annotation on MS and hPancreas dataset. “*" indicates results directly taken from scGPT (Cui et al., 2023). MS hPancreas CellTypist ACTINN SingleCellNet TOSICA* F1 (↑) Precision (↑) Precision (↑) 0.667 ± 0.002 0.693 ± 0.001 0.708 ± 0.023 0.736 ± 0.025, 0.709 ± 0.006 0.628 ± 0.012 0.634 ± 0.009 0.705 ± 0.005 0.761 ± 0.004 0.637 ± 0.001 0.700 ± 0.001 0.739 ± 0.006 0.661 0.664 F1 (↑) 0.656 0.578 scBERT* (fine-tuned) scGPT* (fine-tuned) 0.599 0.703 0.604 0.729 0.685 0.718 0.699 0.735 CellPLM (from scratch) CellPLM (fine-tuned) 0.709 ± 0.007 0.732 ± 0.015 0.689 ± 0.034 0.766 ± 0.007 0.803 ± 0.008 0.749 ± 0.010 0.682 ± 0.037 0.753 ± 0.010 information regarding baselines, the fine-tuning of the CellPLM, and the evaluation metrics under this task, please refer to Appendix C.5.2. 5.3.4 Task 3: Cell Type Annotation Cell type annotation is another important task in single-cell analysis as it enables the identification and characterization of distinct cell populations within a tissue or organism. The objective of this task is to classify the type of cells from query datasets according to the annotations in reference datasets. Here we follow the suggestion of scGPT (Cui et al., 2023) to include hPancreas (Chen et al., 2023) and Multiple Sclerosis (MS) (Schirmer et al., 2019) datasets. More details about the datasets, baseline methods, the fine-tuning of CellPLM can be found in Appendix C.5.4. The empirical results presented in Table 5.4 indicate that CellPLM learns a well-represented and generalizable cellular embedding, achieving considerably large improvement on the cell type annotation task. This again confirms our positive answer to the research questions RQ1 and RQ2. 83 In addition to the aforementioned cell-level tasks, we also explore the potential of CellPLM in gene-level tasks. We conduct an experiment on perturbation prediction following the setting of GEARS (Roohani et al., 2022). The results show that CellPLM can also benefit gene perturbation prediction, and successfully outperform previous SOTA. Due to space limitations, we leave the detailed results in Appendix C.5.3. 5.4 Related Work Deep learning methods for single-cell data have garnered significant research interest in recent years (Molho et al., 2024). However, due to the distinct model architectures, the knowledge learned by models is not transferable across tasks. To address this issue, there is an emerging effort (Yang et al., 2022; Gong et al., 2023; Shen et al., 2023; Cui et al., 2023; Theodoris et al., 2023) from the research community to explore the potential of a foundation model that first extracts latent knowledge from unlabeled scRNA-seq data and subsequently generalizes this knowledge to a variety of tasks. The first such pre-trained model for single-cell data, scBERT (Yang et al., 2022), takes genes as tokens and leverages an efficient transformer (Choromanski et al., 2020) to encode over 16,000 gene tokens for each cell. By randomly masking a fraction of non-zero gene expression values and predicting them based on the remaining data, scBERT effectively learns intricate relationships between genes, leading to improved cellular representation. Later, xTrimoGene (Gong et al., 2023) made two key enhancements to scBERT: pruning zero-expressed genes and improving expression binning strategies by an auto-discretization strategy. These modifications notably enhance scalability and feature resolutions. Another latest preprint, scGPT (Cui et al., 2023), introduces a variant of masked language modeling that mimics the auto-regressive generation in natural language processing, where the masked genes are iteratively predicted according to model’s confidence. Unlike the aforementioned models, Geneformer (Theodoris et al., 2023) and tGPT (Shen et al., 2023) completely abandon precise expression levels of genes. Instead, they model the rank of gene expressions and construct sequences of genes according to their relative expression levels within each cell. 84 The aforementioned models all regard genes as tokens and focus solely on modeling gene relationships within individual cells, neglecting the intercellular information in an organism. In contrast, CellPLM overcomes this limitation by introducing a cell language model that extends beyond single cells. Furthermore, by leveraging the spatial information of cells acquired from SRT data, along with a prior Gaussian mixture distribution, the model achieves unparalleled performance on a range of downstream tasks. 5.5 Chapter Conclusion In this chapter, we propose Cell Language Model, a novel paradigm of single-cell pre-trained model, which aligns well with the fundamental characteristics of single-cell data. This has led to CellPLM, the first pre-trained transformer framework that encodes inter-cell relations, leverages spatially-resolved transcriptomic data and adopts a reasonable prior distribution. Our experiments on various downstream tasks demonstrate the power of CellPLM, which has great potential to facilitate future research in single-cell biology. 85 CHAPTER 6 CONCLUSION In this chapter, we summarize the research results of this dissertation, discuss their broader impact and highlight promising research direction. 6.1 Summary In this dissertation, we have explored various deep learning approaches for advancing single- cell multi-omics and spatial omics analysis. By leveraging graph neural networks (GNNs) and transformers, we introduced several innovative models tailored for the unique challenges presented by single-cell data. These models not only demonstrate significant improvements over existing methodologies but also lay the groundwork for future developments in the field. One of the central contributions is the development of CellPLM, a novel pre-trained model that effectively captures cell-cell relationships and integrates spatial transcriptomic data, outperforming existing models in various tasks. CellPLM is built upon our findings from the first chapters, and effectively transfers knowledge from cross-domain unlabeled data, to various downstream tasks and datasets of interest. It can serve as a universal backbone for single-cell research, which exactly fits the concept of a foundation model. Therefore, the implications of this dissertation extend beyond the specific models and tasks discussed. The methodologies proposed in this work highlight the importance of considering complex inter-cellular interactions and multi-modal data integration in single-cell foundation model development. As single-cell technologies continue to advance, the need for more sophisticated foundation models that can handle the intricacies of these data types will only grow. 6.2 Future Work In the future, we plan to further unleash the power of foundation models, i.e., CellPLM, in single-cell analysis. Currently, CellPLM only considers two modalities: scRNA-seq and spatial transcriptomics. While these modalities have already demonstrated significant improvements in various tasks, there is substantial room for growth and refinement. One potential direction for future work is the extension of the CellPLM framework to incorporate 86 multi-omics data, just like what we have explored in scMoGNN and scMoFormer. The integration of additional omics layers, such as epigenomics, proteomics, and metabolomics, would enable a more comprehensive understanding of cellular processes. By incorporating these diverse data types, CellPLM could capture a more holistic view of the molecular landscape, leading to improved accuracy in cell classification, state prediction, and trajectory inference. Another promising area for future exploration is the enhancement of the embedder module within CellPLM to better handle cross-modality integration. The introduction of advanced techniques, such as multi-view learning or contrastive learning, could improve the model’s ability to learn meaningful representations across different omics modalities. Furthermore, the development of novel attention mechanisms tailored specifically for multi-modal data could allow the model to more effectively prioritize and fuse information from each modality, thus enhancing its overall performance. Additionally, there is a growing interest in the application of foundation models like CellPLM to the study of rare and heterogeneous cell populations. As single-cell technologies continue to advance, the ability to analyze these rare cell types with high precision becomes increasingly important. Future work could involve the adaptation of CellPLM to focus on these challenging cell populations, possibly through the incorporation of few-shot learning techniques or the development of specialized pre-training strategies designed to handle imbalanced datasets. In conclusion, the potential future directions for CellPLM are vast and varied. By extending the framework to incorporate more diverse data types, refining the model’s architecture, and exploring new application domains, we can continue to push the boundaries of single-cell analysis and unlock new biological insights. As single-cell technologies continue to evolve, the development of sophisticated foundation models like CellPLM will be crucial in addressing the increasing complexity and diversity of the data being generated. 87 BIBLIOGRAPHY Abdelaal, T., Mourragui, S., Mahfouz, A., and Reinders, M. J. (2020). Spage: spatial gene enhancement using scrna-seq. Nucleic Acids Research, 48(18):e107–e107. Adamson, B., Norman, T. M., Jost, M., Cho, M. Y., Nuñez, J. K., Chen, Y., Villalta, J. E., Gilbert, L. A., Horlbeck, M. A., Hein, M. Y., et al. (2016). A multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. Cell, 167(7):1867–1882. Alon, U. and Yahav, E. (2021). On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations. Argelaguet, R., Arnol, D., Bredikhin, D., Deloro, Y., Velten, B., Marioni, J. C., and Stegle, O. (2020). Mofa+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biology, 21(1):1–17. Argelaguet, R., Cuomo, A. S., Stegle, O., and Marioni, J. C. (2021). Computational principles and challenges in single-cell data integration. Nature biotechnology, 39(10):1202–1215. Arisdakessian, C., Poirion, O., Yunits, B., Zhu, X., and Garmire, L. X. (2019). Deepimpute: an accurate, fast, and scalable deep neural network method to impute single-cell rna-seq data. Genome Biology, 20(1):1–14. Armingol, E., Officer, A., Harismendy, O., and Lewis, N. E. (2021). Deciphering cell–cell interactions and communication from gene expression. Nature Reviews Genetics, 22(2):71–88. Avşar, G. and Pir, P. (2023). A comparative performancencodere evaluation of imputation methods in spatially resolved transcriptomics data. Molecular Omics. Batool, F. and Hennig, C. (2021). Clustering with the average silhouette width. Computational Statistics & Data Analysis, 158:107190. Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. (2018). Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Beaulieu-Jones, B. K., Moore, J. H., and CONSORTIUM, P. R. O.-A. A. C. T. (2017). Missing data imputation in the electronic health record using deeply learned autoencoders. In Pacific symposium on biocomputing 2017, pages 207–218. World Scientific. Belhocine, K., DeMare, L., and Habern, O. (2021). Single-cell multiomics: Simultaneous epigenetic and transcriptional profiling: 10x genomics shares experimental planning and sample preparation tips for the chromium single cell multiome atac+ gene expression system. Genetic Engineering & Biotechnology News, 41(1):66–68. 88 Bengio, Y., Ducharme, R., and Vincent, P. (2000). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13. Biancalani, T., Scalia, G., Buffoni, L., Avasthi, R., Lu, Z., Sanger, A., Tokcan, N., Vanderburg, C. R., Segerstolpe, Å., Zhang, M., et al. (2021). Deep learning and alignment of spatially resolved single-cell transcriptomes with tangram. Nature Methods, 18(11):1352–1362. Boquet, G., Vicario, J. L., Morell, A., and Serrano, J. (2019). Missing data in traffic estimation: A variational autoencoder imputation method. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2882–2886. IEEE. Browaeys, R., Saelens, W., and Saeys, Y. (2020). Nichenet: modeling intercellular communication by linking ligands to target genes. Nature Methods, 17(2):159–162. Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., and Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna-binding proteins and nucleosome position. Nature Methods, 10(12):1213–1218. Burgess, D. J. (2019). Spatial transcriptomics coming of age. Nature Reviews Genetics, 20(6):317– 317. Cang, Z., Zhao, Y., Almet, A. A., Stabell, A., Ramos, R., Plikus, M. V., Atwood, S. X., and Nie, Q. (2023). Screening cell–cell communication in spatial transcriptomics via collective optimal transport. Nature Methods, 20(2):218–228. Cannoodt, R., Saelens, W., and Saeys, Y. (2016). Computational methods for trajectory inference from single-cell transcriptomics. European journal of immunology, 46(11):2496–2506. Cao, J., Cusanovich, D. A., Ramani, V., Aghamirzaie, D., Pliner, H. A., Hill, A. J., Daza, R. M., McFaline-Figueroa, J. L., Packer, J. S., Christiansen, L., et al. (2018). Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science, 361(6409):1380–1385. Cao, Z.-J. and Gao, G. (2022). Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nature Biotechnology, 40(10):1–9. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer. Chen, D., O’Bray, L., and Borgwardt, K. (2022). Structure-aware transformer for graph representation learning. In International Conference on Machine Learning, pages 3469–3489. PMLR. 89 Chen, J., Xu, H., Tao, W., Chen, Z., Zhao, Y., and Han, J.-D. J. (2023). Transformer for one stop interpretable cell type annotation. Nature Communications, 14(1):223. Chen, S., Lake, B. B., and Zhang, K. (2019). High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nature Biotechnology, 37(12):1452–1457. Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. (2020). Rethinking attention with performers. arXiv preprint arXiv:2009.14794. Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. (2021). Rethinking attention with performers. In International Conference on Learning Representations. Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., and Shen, C. (2021). Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882. Ciortan, M. and Defrance, M. (2022). Gnn-based embedding for clustering scrna-seq data. Bioinformatics, 38(4):1037–1044. Cui, H., Wang, C., Maan, H., and Wang, B. (2023). scgpt: Towards building a foundation model for single-cell multi-omics using generative ai. bioRxiv, pages 2023–04. Cunningham, F., Allen, J. E., Allen, J., Alvarez-Jarreta, J., Amode, M. R., Armean, I. M., Austine- Orimoloye, O., Azov, A. G., Barnes, I., Bennett, R., et al. (2022). Ensembl 2022. Nucleic Acids Research, 50(D1):D988–D995. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. (2016). Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Ding, J., Liu, R., Wen, H., Tang, W., Li, Z., Venegas, J., Su, R., Molho, D., Jin, W., Wang, Y., et al. (2024). Dance: A deep learning library and benchmark platform for single-cell analysis. Genome Biology, 25(1):72. Dixit, A., Parnas, O., Li, B., Chen, J., Fulco, C. P., Jerby-Arnon, L., Marjanovic, N. D., Dionne, D., Burks, T., Raychowdhury, R., et al. (2016). Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. Cell, 167(7):1853–1866. Domínguez Conde, C., Xu, C., Jarvis, L., Rainbow, D., Wells, S., Gomes, T., Howlett, S., Suchanek, 90 O., Polanski, K., King, H., et al. (2022). Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594):eabl5197. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Duren, Z., Chen, X., Zamanighomi, M., Zeng, W., Satpathy, A. T., Chang, H. Y., Wang, Y., and Wong, W. H. (2018). Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proceedings of the National Academy of Sciences, 115(30):7723–7728. Dwivedi, V. P. and Bresson, X. (2020). A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699. Dwivedi, V. P., Luu, A. T., Laurent, T., Bengio, Y., and Bresson, X. (2021). Graph neural networks with learnable structural and positional representations. arXiv preprint arXiv:2110.07875. Dwivedi, V. P., Luu, A. T., Laurent, T., Bengio, Y., and Bresson, X. (2022). Graph neural networks with learnable structural and positional representations. In International Conference on Learning Representations. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S., and Theis, F. J. (2019). Single-cell rna-seq denoising using a deep count autoencoder. Nature Communications, 10(1):1–14. Gaiti, F., Chaligne, R., Gu, H., Brand, R. M., Kothen-Hill, S., Schulman, R. C., Grigorev, K., Risso, D., Kim, K.-T., Pastore, A., et al. (2019). Epigenetic evolution and lineage histories of chronic lymphocytic leukaemia. Nature, 569(7757):576–580. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research. Gondara, L. and Wang, K. (2017). Multiple imputation using deep denoising autoencoders. arXiv preprint arXiv:1705.02737, 280. Gondara, L. and Wang, K. (2018). Mida: Multiple imputation using denoising autoencoders. In Pacific-Asia conference on knowledge discovery and data mining, pages 260–272. Springer. Gong, B., Zhou, Y., and Purdom, E. (2021). Cobolt: integrative analysis of multimodal single-cell sequencing data. Genome Biology, 22(1):1–21. Gong, J., Hao, M., Zeng, X., Liu, C., Ma, J., Cheng, X., Wang, T., Zhang, X., and Song, L. (2023). xtrimogene: An efficient and scalable representation learner for single-cell rna-seq data. bioRxiv, pages 2023–03. 91 Gong, W., Kwak, I.-Y., Pota, P., Koyano-Nakagawa, N., and Garry, D. J. (2018). Drimpute: imputing dropout events in single cell rna sequencing data. BMC bioinformatics, 19(1):1–10. Grønbech, C. H., Vording, M. F., Timshel, P. N., Sønderby, C. K., Pers, T. H., and Winther, O. (2020). scvae: variational auto-encoders for single-cell gene expression data. Bioinformatics, 36(16):4415–4422. Grün, D., Kester, L., and Van Oudenaarden, A. (2014). Validation of noise models for single-cell transcriptomics. Nature Methods, 11(6):637–640. Gu, H., Cheng, H., Ma, A., Li, Y., Wang, J., Xu, D., and Ma, Q. (2022). scgnn 2.0: a graph neural network tool for imputation and clustering of single-cell rna-seq data. Bioinformatics, 38(23):5322–5325. Hamilton, W. L., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, volume 30. Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., Lee, M. J., Wilk, A. J., Darby, C., Zager, M., et al. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13):3573–3587. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. (2022a). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009. He, S., Bhatt, R., Brown, C., Brown, E. A., Buhr, D. L., Chantranuvatana, K., Danaher, P., Dunaway, D., Garrison, R. G., Geiss, G., Gregory, M. T., Hoang, M. L., Khafizov, R., Killingbeck, E. E., Kim, D., Kim, T. K., Kim, Y., Klock, A., Korukonda, M., Kutchma, A., Lewis, Z. R., Liang, Y., Nelson, J. S., Ong, G. T., Perillo, E. P., Phan, J. C., Phan-Everson, T., Piazza, E., Rane, T., Reitz, Z., Rhodes, M., Rosenbloom, A., Ross, D., Sato, H., Wardhani, A. W., Williams-Wietzikoski, C. A., Wu, L., and Beechem, J. M. (2022b). High-plex multiomic analysis in ffpe at subcellular level by spatial molecular imaging. bioRxiv. Hou, R., Denisenko, E., Ong, H. T., Ramilowski, J. A., and Forrest, A. R. (2020). Predicting cell-to-cell communication networks using natmi. Nature Communications, 11(1):5011. Huang, K., Xiao, C., Glass, L. M., and Sun, J. (2021). Moltrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics, 37(6):830–836. Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bonasio, R., Murray, J. I., Raj, A., Li, M., and Zhang, N. R. (2018). Saver: gene expression recovery for single-cell rna sequencing. Nature Methods, 15(7):539–542. 92 Ieremie, I., Ewing, R. M., and Niranjan, M. (2022). Transformergo: predicting protein–protein interactions by modelling the attention between sets of gene ontology terms. Bioinformatics, 38(8):2269–2277. Im Im, D., Ahn, S., Memisevic, R., and Bengio, Y. (2017). Denoising criterion for variational auto-encoding framework. In Proceedings of the AAAI conference on artificial intelligence, volume 31. Jiang, J., Xu, J., Liu, Y., Song, B., Guo, X., Zeng, X., and Zou, Q. (2023). Dimensionality reduction and visualization of single-cell rna-seq data with an improved deep variational autoencoder. Briefings in Bioinformatics, page bbad152. Jiang, R., Sun, T., Song, D., and Li, J. J. (2022). Statistics or biology: the zero-inflation controversy about scrna-seq data. Genome Biology, 23(1):1–24. Jin, S., Guerrero-Juarez, C., Zhang, L., Chang, I., Ramos, R., Kuan, C., Myung, P., Plikus, M., and Nie, Q. (2021). Inference and analysis of cell-cell communication using cellchat. nat. commun. 12, 1088. Jin, S., Zhang, L., and Nie, Q. (2020). scai: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles. Genome biology, 21:1–19. Kim, N., Kim, H. K., Lee, K., Hong, Y., Cho, J. H., Choi, J. W., Lee, J.-I., Suh, Y.-L., Ku, B. M., Eum, H. H., et al. (2020). Single-cell rna sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nature Communications, 11(1):2285. Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Bengio, Y. and LeCun, Y., editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations. Kitaev, N., Kaiser, Ł., and Levskaya, A. (2020). Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451. Kotliarov, Y., Sparks, R., Martins, A. J., Mulè, M. P., Lu, Y., Goswami, M., Kardava, L., Banchereau, R., Pascual, V., Biancotto, A., et al. (2020). Broad immune activation underlies shared set point signatures for vaccine responsiveness in healthy individuals and disease activity in patients with lupus. Nature Medicine, 26(4):618–629. Kreuzer, D., Beaini, D., Hamilton, W. L., Létourneau, V., and Tossou, P. (2021). Rethinking graph 93 transformers with spectral attention. In Advances in Neural Information Processing Systems, volume 34, pages 21618–21629. Lee, J., Hyeon, D. Y., and Hwang, D. (2020). Single-cell multiomics: technologies and data analysis methods. Experimental & Molecular Medicine, 52(9):1428–1442. Lelong, S., Zhou, X., Afrasiabi, C., Qian, Z., Cano, M. A., Tsueng, G., Xin, J., Mullen, J., Yao, Y., Avila, R., et al. (2022). Biothings sdk: a toolkit for building high-performance data apis in biomedical research. Bioinformatics, 38(7):2077–2079. Li, J., Chen, S., Pan, X., Yuan, Y., and Shen, H.-B. (2022). Cell clustering for spatial transcriptomics data with graph neural networks. Nature Computational Science, 2(6):399–408. Li, W. V. and Li, J. J. (2018). An accurate and robust imputation method scimpute for single-cell rna-seq data. Nature Communications, 9(1):1–9. Li, Y., Ren, P., Dawson, A., Vasquez, H. G., Ageedi, W., Zhang, C., Luo, W., Chen, R., Li, Y., Kim, S., et al. (2020). Single-cell transcriptome analysis reveals dynamic cell populations and differential gene expression patterns in control and aneurysmal human aortic tissue. Circulation, 142(14):1374–1388. Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., and Tamayo, P. (2015). The molecular signatures database hallmark gene set collection. Cell systems, 1(6):417–425. Lim, D., Robinson, J., Zhao, L., Smidt, T., Sra, S., Maron, H., and Jegelka, S. (2022). Sign and basis invariant networks for spectral graph representation learning. arXiv preprint arXiv:2202.13013. Lin, X., Tian, T., Wei, Z., and Hakonarson, H. (2022). Clustering of single-cell multi-omics data with a multimodal deep learning method. Nature Communications, 13(1):7705. Liu, Q., Chen, S., Jiang, R., and Wong, W. H. (2021a). Simultaneous deep generative modelling and clustering of single-cell genomic data. Nature machine intelligence, 3(6):536–544. Liu, X., Ding, J., Jin, W., Xu, H., Ma, Y., Liu, Z., and Tang, J. (2021b). Graph neural networks with adaptive residual. Advances in Neural Information Processing Systems, 34. Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021c). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022. Lopez, R., Nazaret, A., Langevin, M., Samaran, J., Regier, J., Jordan, M. I., and Yosef, N. (2019). A 94 joint model of unpaired data from scrna-seq and spatial transcriptomics for imputing missing gene expression measurements. arXiv preprint arXiv:1905.02269. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., and Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12):1053–1058. Lotfollahi, M., Wolf, F. A., and Theis, F. J. (2019). scgen predicts single-cell perturbation responses. Nature Methods, 16(8):715–721. Lubeck, E., Coskun, A. F., Zhiyentayev, T., Ahmad, M., and Cai, L. (2014). Single-cell in situ rna profiling by sequential hybridization. Nature Methods, 11(4):360–361. Luecken, M. D., Burkhardt, D. B., Cannoodt, R., Lance, C., Agrawal, A., Aliee, H., Chen, A. T., Deconinck, L., Detweiler, A. M., Granados, A. A., et al. (2021). A sandbox for prediction and integration of dna, rna, and proteins in single cells. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2). Luo, S., Li, S., Cai, T., He, D., Peng, D., Zheng, S., Ke, G., Wang, L., and Liu, T.-Y. (2021). Stable, fast and accurate: Kernelized attention with relative positional encoding. Advances in Neural Information Processing Systems, 34:22795–22807. Ma, F. and Pellegrini, M. (2020). Actinn: automated identification of cell types in single cell rna sequencing. Bioinformatics, 36(2):533–538. Ma, L., Wang, L., Khatib, S. A., Chang, C.-W., Heinrich, S., Dominguez, D. A., Forgues, M., Candia, J., Hernandez, M. O., Kelly, M., et al. (2021a). Single-cell atlas of tumor cell evolution in response to therapy in hepatocellular carcinoma and intrahepatic cholangiocarcinoma. Journal of Hepatology, 75(6):1397–1408. Ma, Y., Liu, X., Zhao, T., Liu, Y., Tang, J., and Shah, N. (2021b). A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211. Ma, Y. and Tang, J. (2021). Deep Learning on Graphs. Cambridge University Press. Merritt, C. R., Ong, G. T., Church, S. E., Barker, K., Danaher, P., Geiss, G., Hoang, M., Jung, J., Liang, Y., McKay-Fleisch, J., et al. (2020). Multiplex digital spatial profiling of proteins and rna in fixed tissue. Nature Biotechnology. Mimitou, E. P., Lareau, C. A., Chen, K. Y., Zorzetto-Fernandes, A. L., Hao, Y., Takeshima, Y., Luo, W., Huang, T.-S., Yeung, B. Z., Papalexi, E., et al. (2021). Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nature Biotechnology, 39(10):1246–1258. Minoura, K., Abe, K., Nam, H., Nishikawa, H., and Shimamura, T. (2021). A mixture-of-experts 95 deep generative model for integrated analysis of single-cell multiomics data. Cell reports methods, 1(5):100071. Molho, D., Ding, J., Tang, W., Li, Z., Wen, H., Wang, Y., Venegas, J., Jin, W., Liu, R., Su, R., et al. (2024). Deep learning in single-cell analysis. ACM Transactions on Intelligent Systems and Technology, 15(3):1–62. Norman, T. M., Horlbeck, M. A., Replogle, J. M., Ge, A. Y., Xu, A., Jost, M., Gilbert, L. A., and Weissman, J. S. (2019). Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365(6455):786–793. Pereira, R. C., Santos, M. S., Rodrigues, P. P., and Abreu, P. H. (2020). Reviewing autoencoders for missing data imputation: Technical trends, applications and outcomes. Journal of Artificial Intelligence Research, 69:1255–1285. Pott, S. (2017). Simultaneous measurement of chromatin accessibility, dna methylation, and nucleosome phasing in single cells. eLife, 6:e23203. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). Catboost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31. Qiu, P. (2020). Embracing the dropouts in single-cell rna-seq analysis. Nature Communications, 11(1):1169. Rampasek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. (2022). Recipe for a general, powerful, scalable graph transformer. In Advances in Neural Information Processing Systems. Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. (2022). Recipe for a general, powerful, scalable graph transformer. arXiv preprint arXiv:2205.12454. Rao, J., Zhou, X., Lu, Y., Zhao, H., and Yang, Y. (2021). Imputing single-cell rna-seq data by combining graph convolution and autoencoder neural networks. iScience, 24(5):102393. Raredon, M. S. B., Adams, T. S., Suhail, Y., Schupp, J. C., Poli, S., Neumark, N., Leiby, K. L., Greaney, A. M., Yuan, Y., Horien, C., et al. (2019). Single-cell connectomic analysis of adult mammalian lungs. Science advances, 5(12):eaaw3851. Regev, A., Teichmann, S. A., Lander, E. S., Amit, I., Benoist, C., Birney, E., Bodenmiller, B., Campbell, P., Carninci, P., Clatworthy, M., et al. (2017). The human cell atlas. eLife, 6:e27041. Ronen, J. and Akalin, A. (2018). netsmooth: Network-smoothing based imputation for single cell rna-seq. F1000Research, 7. 96 Roohani, Y., Huang, K., and Leskovec, J. (2022). Gears: Predicting transcriptional outcomes of novel multi-gene perturbations. bioRxiv, pages 2022–07. Rudensky, A. Y. (2011). Regulatory t cells and foxp3. Immunological reviews, 241(1):260–268. Schirmer, L., Velmeshev, D., Holmqvist, S., Kaufmann, M., Werneburg, S., Jung, D., Vistnes, S., Stockley, J. H., Young, A., Steindel, M., et al. (2019). Neuronal vulnerability and multilineage diversity in multiple sclerosis. Nature, 573(7772):75–82. Shao, X., Li, C., Yang, H., Lu, X., Liao, J., Qian, J., Wang, K., Cheng, J., Yang, P., Chen, H., et al. (2022). Knowledge-graph-based cell-cell communication inference for spatially resolved transcriptomic data with spatalk. Nature Communications, 13(1):4429. Shao, X., Yang, H., Zhuang, X., Liao, J., Yang, P., Cheng, J., Lu, X., Chen, H., and Fan, X. (2021). scdeepsort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network. Nucleic Acids Research, 49(21). Shen, H., Liu, J., Hu, J., Shen, X., Zhang, C., Wu, D., Feng, M., Yang, M., Li, Y., Yang, Y., et al. (2023). Generative pretraining from large-scale transcriptomes for single-cell deciphering. iScience. Shen, R., Olshen, A. B., and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. Shengquan, C., Boheng, Z., Xiaoyang, C., Xuegong, Z., and Rui, J. (2021). stplus: a reference-based method for the accurate enhancement of spatial transcriptomics. Bioinformatics, 37(Supplement_1):i299–i307. Song, Q. and Su, J. (2021). Dstg: deconvoluting spatial transcriptomics data through graph-based artificial intelligence. Brief Bioinform, 22(5):1–13. Song, Q., Su, J., and Zhang, W. (2021). scgcn is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nature Communications, 12(1):1–11. Stein-O’Brien, G. L., Arora, R., Culhane, A. C., Favorov, A. V., Garmire, L. X., Greene, C. S., Goff, L. A., Li, Y., Ngom, A., Ochs, M. F., et al. (2018). Enter the matrix: factorization uncovers knowledge from omics. Trends in Genetics, 34(10):790–805. Stelzer, G., Rosen, N., Plaschkes, I., Zimmerman, S., Twik, M., Fishilevich, S., Stein, T. I., Nudel, R., Lieder, I., Mazor, Y., et al. (2016). The genecards suite: from gene data mining to disease genome sequence analyses. Current protocols in bioinformatics, 54(1):1–30. Stoeckius, M., Hafemeister, C., Stephenson, W., Houck-Loomis, B., Chattopadhyay, P. K., Swerdlow, H., Satija, R., and Smibert, P. (2017). Simultaneous epitope and transcriptome measurement in 97 single cells. Nature Methods, 14(9):865–868. Street, K., Risso, D., Fletcher, R. B., Das, D., Ngai, J., Yosef, N., Purdom, E., and Dudoit, S. (2018). Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC genomics, 19:1–16. Stringer, C., Wang, T., Michaelos, M., and Pachitariu, M. (2021). Cellpose: a generalist algorithm for cellular segmentation. Nature Methods, 18(1):100–106. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck III, W. M., Hao, Y., Stoeckius, M., Smibert, P., and Satija, R. (2019). Comprehensive integration of single-cell data. Cell, 177(7):1888–1902. Ståhl, P. L., Salmén, F., Vickovic, S., Lundmark, A., Navarro, J. F., Magnusson, J., Giacomello, S., Asp, M., Westholm, J. O., Huss, M., Mollbrink, A., Linnarsson, S., Codeluppi, S., Åke Borg, Pontén, F., Costea, P. I., Sahlén, P., Mulder, J., Bergmann, O., Lundeberg, J., and Frisén, J. (2016). Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science, 353(6294):78–82. Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43):15545–15550. Svensson, V., Natarajan, K. N., Ly, L.-H., Miragaia, R. J., Labalette, C., Macaulay, I. C., Cvejic, A., and Teichmann, S. A. (2017). Power analysis of single-cell rna-sequencing experiments. Nature Methods, 14(4):381–387. Szklarczyk, D., Kirsch, R., Koutrouli, M., Nastou, K., Mehryary, F., Hachilif, R., Gable, A. L., Fang, T., Doncheva, N. T., Pyysalo, S., et al. (2023). The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Research, 51(D1):D638–D646. Tan, Y. and Cahan, P. (2019). Singlecellnet: a computational tool to classify single cell rna-seq data across platforms and across species. Cell systems, 9(2):207–213. Tang, W., Wen, H., Liu, R., Ding, J., Jin, W., Xie, Y., Liu, H., and Tang, J. (2023). Single- cell multimodal prediction via transformers. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 2422–2431. Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., et al. (2023). Transfer learning enables predictions in network biology. Nature, pages 1–9. Tran, H. T. N., Ang, K. S., Chevrier, M., Zhang, X., Lee, N. Y. S., Goh, M., and Chen, J. (2020). 98 A benchmark of batch-effect correction methods for single-cell rna sequencing data. Genome Biology, 21:1–32. van Dijk, D., Nainys, J., Sharma, R., Kaithail, P., Carr, A. J., Moon, K. R., Mazutis, L., Wolf, G., Krishnaswamy, S., and Pe’er, D. (2017). Magic: A diffusion-based imputation method reveals gene-gene interactions in single-cell rna-sequencing data. bioRxiv, page 111591. Van Dijk, D., Sharma, R., Nainys, J., Yim, K., Kathail, P., Carr, A. J., Burdziak, C., Moon, K. R., Chaffer, C. L., Pattabiraman, D., et al. (2018). Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. Wang, J., Ma, A., Chang, Y., Gong, J., Jiang, Y., Qi, R., Wang, C., Fu, H., Ma, Q., and Xu, D. (2021). scgnn is a novel graph neural network framework for single-cell rna-seq analyses. Nature Communications, 12(1):1–11. Wang, Y., Song, B., Wang, S., Chen, M., Xie, Y., Xiao, G., Wang, L., and Wang, T. (2022). Sprod for de-noising spatially resolved transcriptomics data based on position and image information. Nature Methods, 19(8):950–958. Welch, J. D., Hartemink, A. J., and Prins, J. F. (2017). Matcher: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biology, 18(1):1–19. Wen, H., Ding, J., Jin, W., Wang, Y., Xie, Y., and Tang, J. (2022a). Graph neural networks for multimodal single-cell data integration. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4153–4163. Wen, H., Jin, W., Ding, J., Xu, C., Xie, Y., and Tang, J. (2022b). Bi-channel masked graph autoencoders for spatially resolved single-cell transcriptomics data imputation. In NeurIPS 2022 AI for Science: Progress and Promises. Wen, H., Tang, W., Dai, X., Ding, J., Jin, W., Xie, Y., and Tang, J. (2024). CellPLM: Pre-training of cell language model beyond single cells. In The Twelfth International Conference on Learning 99 Representations. Wen, H., Tang, W., Jin, W., Ding, J., Liu, R., Shi, F., Xie, Y., and Tang, J. (2023). Single cells are spatial tokens: Transformers for spatial transcriptomic data imputation. arXiv preprint arXiv:2302.03038. Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., and Grave, E. (2020). CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association. Wu, C., Mark, A., and Su, A. I. (2014). Mygene. info: gene annotation query as a service. bioRxiv, page 009332. Wu, H., Wu, J., Xu, J., Wang, J., and Long, M. (2022). Flowformer: Linearizing transformers with conservation flows. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 24226–24242. PMLR. Wu, K., Peng, H., Chen, M., Fu, J., and Chao, H. (2021a). Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10033–10041. Wu, K. E., Yost, K. E., Chang, H. Y., and Zou, J. (2021b). Babel enables cross-modality translation between multiomic profiles at single-cell resolution. Proceedings of the National Academy of Sciences, 118(15):e2023070118. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S. Y. (2020). A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24. Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M. I., and Yosef, N. (2021). Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Molecular systems biology, 17(1):e9620. Xu, H., Fu, H., Long, Y., Ang, K. S., Sethi, R., Chong, K., Li, M., Uddamvathanak, R., Lee, H. K., Ling, J., et al. (2024). Unsupervised spatially embedded deep representation of spatial transcriptomics. Genome Medicine, 16(1):12. Xu, J., Xu, J., Meng, Y., Lu, C., Cai, L., Zeng, X., Nussinov, R., and Cheng, F. (2023). Graph embedding and gaussian mixture variational autoencoder network for end-to-end analysis of single-cell rna sequencing data. Cell Reports Methods, page 100382. Yang, F., Wang, W., Wang, F., Fang, Y., Tang, D., Huang, J., Lu, H., and Yao, J. (2022). scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866. 100 Yang, K. D., Belyaeva, A., Venkatachalapathy, S., Damodaran, K., Katcoff, A., Radhakrishnan, A., Shivashankar, G., and Uhler, C. (2021). Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nature Communications, 12(1):1–10. Yang, L., Cheung, N.-M., Li, J., and Fang, J. (2019). Deep clustering by gaussian mixture variational autoencoders with graph embedding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6440–6449. Yao, Z., Liu, H., Xie, F., Fischer, S., Adkins, R. S., Aldridge, A. I., Ament, S. A., Bartlett, A., Behrens, M. M., Van den Berge, K., et al. (2021). A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature, 598(7879):103–110. Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. (2021). Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888. Zhu, C., Preissl, S., and Ren, B. (2020). Single-cell multimodal omics: the power of many. Nature Methods, 17(1):11–14. Zhu, C., Yu, M., Huang, H., Juric, I., Abnousi, A., Hu, R., Lucero, J., Behrens, M. M., Hu, M., and Ren, B. (2019). An ultra high-throughput method for single-cell joint analysis of open chromatin and transcriptome. Nature Structural & Molecular Biology, 26(11):1063–1070. Zuo, C., Dai, H., and Chen, L. (2021). Deep cross-omics cycle attention model for joint analysis of single-cell multi-omics data. Bioinformatics, 37(22):4091–4099. 101 GRAPH NEURAL NETWORKS FOR MULTI-OMICS REPRESENTATION LEARNING APPENDIX A A.1 Details of Metrics in Task 3 A.1.1 Biology Conservation Metrics NMI cluster/label. Normalized mutual information (NMI) compares the overlap of two clusterings. We use NMI to compare the cell type labels with an automated clustering computed on the integrated dataset (based on Louvain clustering. NMI scores of 0 or 1 correspond to uncorrelated clustering or a perfect match, respectively. Automated Louvain clustering is performed at resolution ranges from 0.1 to 2 in steps of 0.1, and the clustering output with the highest NMI with the label set is used. Cell type ASW. The silhouette width metric indicates the degree to which observations with identical labels are compact. The average silhouette width (ASW) (Batool and Hennig, 2021), which ranges between -1 and 1, is calculated by averaging the silhouette widths of all cells in a set. We employ ASW to determine the compactness of the resulting embedding’s cell types. The ASW of the cluster is calculated using the cell identity labels and scaled to a value between 0 and 1 using the equation: 𝐴𝑆𝑊 = ( 𝐴𝑆𝑊𝐶 + 1)/2 (A.1) where C denotes the set of all cell identity labels. Cell cycle conservation. The cell cycle conservation score serves as a proxy for the preservation of the signal associated with gene programs during data integration. It determines the amount of variance explained by cell cycle per batch prior to and following integration. The differences in variance before 𝑉 𝑎𝑟𝑏𝑒 𝑓 𝑜𝑟𝑒 and variance after 𝑉 𝑎𝑟𝑎 𝑓 𝑡𝑒𝑟 are aggregated into a final score between 0 and 1, using the equation: 𝐶𝐶𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 1 − |𝑉 𝑎𝑟𝑎 𝑓 𝑡𝑒𝑟 − 𝑉 𝑎𝑟𝑏𝑒 𝑓 𝑜𝑟𝑒 | 𝑉 𝑎𝑟𝑏𝑒 𝑓 𝑜𝑟𝑒 (A.2) where values near to 0 suggest less conservation of variance explained by the cell cycle, while 1 represents complete conservation. 102 Trajectory conservation. The conservation score of a trajectory is a proxy for the conservation of a continuous biological signal within a joint embedding. We compare trajectories computed after integration for relevant cell types that depict a continuous cellular differentiation process to trajectories computed per batch and modality using this metric. The conservation of the trajectory is quantified via Spearman’s rank correlation coefficient, 𝑠, between the pseudotime values before and after integration. The final score is scaled to a value between 0 and 1 using the equation: 𝑡𝑟𝑎 𝑗 𝑒𝑐𝑡𝑜𝑟 𝑦𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = (𝑠 + 1)/2 (A.3) where a value of 1 or 0 indicates that the cells on the trajectory are in the same order before and after integration, or in the reverse order. A.1.2 Batch Removal Metrics Batch ASW. The ASW is used to quantify batch mixing by taking into account the incompatibility of batch labels per cell type cluster. We consider the absolute silhouette width, on batch labels per cell, in particular. Here, zero shows that batches are thoroughly mixed, but any variation from zero indicates the presence of a batch effect. We rescale this score so that higher values imply better batch mixing and use the equation below to determine the per-cell type label, j: 1 (cid:12) (cid:12)𝐶 𝑗 where 𝐶 𝑗 is the set of cells with the cell label j and |𝐶 𝑗 | denotes the number of cells in that set. To batch 𝐴𝑆𝑊 𝑗 = 1 − |𝑠(𝑖)| (A.4) ∑︁ 𝑖∈𝐶 𝑗 (cid:12) (cid:12) obtain the final 𝑏𝑎𝑡𝑐ℎ𝐴𝑆𝑊 score, the label-specific 𝑏𝑎𝑡𝑐ℎ𝐴𝑆𝑊 𝑗 scores are averaged: batch 𝐴𝑆𝑊 = 1 |𝑀 | ∑︁ 𝑗 ∈𝑀 batch 𝐴𝑆𝑊 𝑗 (A.5) where M is the set of unique cell labels. A 𝑏𝑎𝑡𝑐ℎ𝐴𝑆𝑊 value of 1 suggests optimal batch mixing, whereas a value of 0 indicates severely separated batches. Graph connectivity. The graph connectivity metric determines whether cells of the same kind from various batches are embedded close to one another. This is determined by computing a k-nearest neighbor (kNN) graph using Euclidean distances on the embedding. Then, we determine 103 Table A.1 Dataset Statistics of modality prediction task. The number of feature dimensions, train/test samples, and batches. GEX-ADT ADT-GEX GEX-ATAC ATAC-GEX Source Dim Target Dim Train Cells Test Cells Train Batches Test Batches 13,953 134 66,175 1,000 9 3 134 13,953 66,175 1,000 9 3 13,431 10,000 42,492 1,000 10 3 116,490 13,431 42,492 1,000 10 3 if all cells with the same cell identity label are connected in this kNN graph. For each cell identity label 𝑐, we generate the subset kNN graph 𝐺 = (𝑁𝑐; 𝐸𝑐), which contains only cells from a given label. Using these subset kNN graphs, we compute the graph connectivity score: 𝑔𝑐 = 1 |𝐶 | ∑︁ 𝑐∈𝐶 |𝐿𝐶𝐶 (𝐺 (𝑁𝑐; 𝐸𝑐))| |𝑁𝑐 | (A.6) where 𝐶 represents the set of cell identity labels, 𝐿𝐶𝐶 () denotes the number of nodes in the largest connected component of the graph, and 𝑁𝑐 is the number of nodes with cell identity 𝑐. The resulting score ranges from 0 to 1, where 1 means that all cells with the same cell identity are connected in the integrated kNN graph, while 0 indicates that no cell is connected in the network. A.1.3 Metric Aggregation Due to the differing nature of each metric, each metric would be assigned a weight. An overall weighted average of batch correction and bio-conservation scores will be computed via the equation: 𝑆overall ,𝑖 = 0.6 · 𝑆𝑏𝑖𝑜,𝑖 + 0.4 · 𝑆𝑏𝑎𝑡𝑐ℎ,𝑖 (A.7) A.2 Data Statistics The Table A.1 and Table A.2 provide statistics about dataset used in modality prediction task and modality matching task respectively. A.3 Reproducibility A.3.1 Source Codes All the source code of winning solutions can be found at OpenProblems official github. These codes have been officially verified thus reproducibility is ensured. 104 Table A.2 Dataset Statistics of modality matching task. The number of feature dimensions, train/test samples, and batches. GEX-ADT ADT-GEX GEX-ATAC ATAC-GEX Source Dim Target Dim Train Cells Test Cells Train Batches Test Batches 13,953 134 66,175 15,066 10 3 134 13,953 66,175 15,066 10 3 13,431 116,490 42,492 20,009 10 3 116,490 13,431 42,492 20,009 10 3 For the new models we developed after the competitions, all source codes have been integrated into the DANCE package (Ding et al., 2024). 105 APPENDIX B TRANSFORMERS FOR SINGLE-CELL SPATIAL OMICS REPRESENTATION LEARNING B.1 Reproduciblity Details B.1.1 Data Availability Two datasets (Lung and Liver) we used are public available on Nanostring official website: https://nanostring.com/products/cosmx-spatial-molecular-imager/nsclc-ffpe-dataset. B.1.2 Implementation Settings B.1.2.1 Baselines To evaluate the effectiveness of SpaFormer, we compare it with the state-of-the-art spatial and non-spatial transcriptomic imputation models: (1) scImpute (Li and Li, 2018) employs a probabilistic model to detect dropouts, and implements imputation through non-negative least squares regression. (2) SAVER (Huang et al., 2018) uses negative binomial distribution to model the data and estimates a Gamma prior through Poisson Lasso regression. The posterior mean is used to output expression with uncertainty quantification from the posterior distribution. (3) scVI (Lopez et al., 2018) models dropouts with a ZINB distribution, and estimates the distributional parameters of each gene in each cell with a VAE model. (4) DCA (Eraslan et al., 2019) is an autoencoder that predicts parameters of chosen distributions like ZINB to generate the imputed data. (5) GraphSCI (Rao et al., 2021) employs a graph autoencoder on a gene correlation graph. Meanwhile, it uses another autoencoder to reconstruct the gene expressions, taking graph autoencoder embeddings as additional input. (6) scGNN (Wang et al., 2021) first builds a cell-cell graph based on gene expression similarity and then utilizes a graph autoencoder together with a standard autoencoder to refine graph structures and cell representation. Lastly, an imputation autoencoder is trained with a graph Laplacian smoothing term added to the reconstruction loss. (7) gimVI (Lopez et al., 2019) is a deep generative model for integrating spatial transcriptomics data and scRNA-seq data which can be used to impute spatial transcriptomic data. gimVI is based on scVI (Lopez et al., 2018) and employs alternative conditional distributions to address technology-specific covariate shift more effectively. (8) Sprod (Li et al., 106 2022) is the latest published imputation method for spatial transcriptomics data. It first projects gene expressions to a latent space, then connects neighboring cells in the latent space, and prunes it with physical distance. Then a denoised matrix is learned by jointly minimizing the reconstruction error and a graph Laplacian smoothing term. (9) SpaGAT is a baseline model created by ourselves. It is the same bi-level masking autoencoder framework as SpaFormer, based on a graph neural network encoder with spatial graphs. Specifically, we implement a graph attention network (Velickovic et al., 2018) as the encoder. Since the graph attention network is a localized version of transformers, SpaGAT can be considered an ablation study for our SpaFormer model. B.1.2.2 Hyperparameter Settings For our own2 SpaGAT and SpaFormer, we first normalize the total RNA counts of each cell, and then apply log1p transform. We heavily conducted hyperparameters searching on the Lung dataset. However, we noticed that the performance is not sensitive to most hyperparameters, except for masking rate, autoencoder type, and positional encodings as we presented in ablation studies. To reproduce our results, the recommended hyperparameters are n_layer=2, num_heads=8, num_hidden=128, latent_dim=20, learning rate=1e-3, weight_decay=5e-4. We used these default hyperparameters in the other two datasets. The source codes will been released on our github. For our own created baseline model SpaGAT, we used the same set of hyperparameters while replacing the transformer encoder with a graph attention network. For baseline models, all the implementations are from the authors’ repo/software. Optimiz- ers/trainers are provided by original implementations. Preprocessings are also consistent with the original methods. The detailed settings are as below: ScImpute only involves two parameters. Parameter K denotes the potential number of cell populations, threshold t on dropout probabilities. We set t=0.5 and K=15. This is per the author’s instructions in their paper, i.e., a default threshold value of 0.5 is sufficient for most scRNA-seq data, and K should be chosen based on the clustering result of the raw data and the resolution level desired by the users, where K=15 is close to the ground-truth cell-type number. SAVER is a statistical model and does not expose any hyperparameters in their implementation. 107 Therefore, we run their default setting. For scVI, we searched for combinations of n_hidden=[128, 256], n_layer=[1, 2], gene_likelihood=[nb, zinb] on the Lung5 dataset. All settings are repeated five times and the best mean performance is achieved by n_hidden=128, n_layer=1, and gene_likelihood=’nb’. Other parameters are per default. We then applied this set of hyperparameters to all three datasets and reported it in the main results. For DCA, the original hyperparameter optimization mode is broken (there is a relevant unresolved issue on GitHub), and there are no other parameter instructions in the tutorial, therefore we used default parameters. For GraphSCI, we tried hidden1=[16, 32, 64], hidden2=[32, 64]. Note that GraphSCI does not have other hyperparameters and the default number of hidden sizes is quite small. The performance reported in our paper was obtained from hidden1=32, and hidden2=64. gimVI software does not expose hyperparameters such as n_hidden and n_layer, so we follow the default settings. Additionally, gimVI requires external scRNA-seq reference data. For the Lung5 dataset, we used data from GSE131907 as a reference. For Kidney, we used data from GSE159115, and for Liver we used data from GSE115469. scGNN provides two sets of preprocessing in their GitHub repo and we adopt the first one "Cell/Gene filtering without inferring LTMG" for CSV format. Then we follow the corresponding instructions to run their imputation method but get an out-of-memory error during EM algorithm. Sprod provides batch mode for handling very big spatial datasets. We follow their instructions for dataset without a matching image and ran the batch mode with num_of_batches=30, however, it can not finish running within 48 hours even on the smallest Kidney dataset. For Spage we follow the official tutorial. For SeDR, we follow their official tutorial for imputation and batch integration, where we consider each FOV as a batch. SeDR does not provide any information on hyperparameters in their tutorial, therefore, we run their default method with 5 random seeds. To adapt SeDR’s output to our evaluation protocol, we remove the inherent standardization function from SeDR’s ‘.recon‘ function. For Tangram, we found that directly inputting all the data would lead to OOM issue. Therefore, 108 we follow the official instruction to separate the data into 5 splits and impute them split by split. B.2 Standard Deviation For simplicity, we omit the standard deviation in Table 4.2. The detailed experiment statistics are shown below in Table B.1. Table B.1 The standard deviation of the imputation experiments. Lung 5 RMSE 0.2861± 0.0037 0.2858± 0.0002 0.3957± 0.0006 0.3170± 0.0018 0.2785± 0.0005 Lung 5 Pearson 0.6231± 0.0136 0.6223± 0.0008 0.1334± 0.0016 0.5320± 0.0063 0.6363± 0.0019 Lung 5 Cosine 0.6661± 0.0112 0.6648± 0.0007 0.3081± 0.0007 0.5917± 0.0052 0.6786± 0.0016 Kidney 1139 Kidney 1139 Kidney 1139 Liver Norma Liver Normal Liver Norma RMSE 0.2901± 0.0040 0.2852± 0.0005 0.3624± 0.0009 0.4387± 0.0015 0.2794± 0.0042 Pearson 0.5834± 0.0152 0.5985± 0.0022 0.2403± 0.0048 −0.0104± 0.0004 0.6108± 0.0152 Cosine 0.6480± 0.0125 0.6597± 0.0017 0.4128± 0.0029 0.1967± 0.0027 0.671± 0.0121 RMSE 0.2797± 0.0159 0.2542± 0.0129 0.3347± 0.0009 0.4542± 0.0024 0.2117± 0.0014 Pearson 0.5749± 0.0641 0.657± 0.0461 0.3707± 0.0065 −0.0015± 0.0010 0.7976± 0.0034 Cosine 0.6224± 0.0582 0.688± 0.0401 0.4430± 0.0065 0.1167± 0.0050 0.7797± 0.0032 scVI DCA GraphSCI gimVI SpaFormer 109 APPENDIX C BUILDING SINGLE-CELL FOUNDATION MODEL BEYOND SINGLE CELLS C.1 Spatially-Resolved Transcriptomic Data Recently, spatial transcriptomic technologies are developed to spatially resolve transcriptomics profiles (Ståhl et al., 2016; Merritt et al., 2020). With spatial transcriptomics data, researchers can learn the spatial context of cells and cell clusters within a tissue (Burgess, 2019). The major technologies/platforms for spatial transcriptomics are Visium by 10x (Ståhl et al., 2016), GeoMx Digital Spatial Profiler (DSP) (Merritt et al., 2020) by NanoString and CosMx Spatial Molecular Imager (SMI) by NanoString, MERFISH, Vizgen, Resolve, Rebus, and molecular cartography. 10x Visium does not profile at single-cell resolution, and while GeoMx DSP is capable of single-cell resolution through user-drawn profiling regions, the scalability is limited. The most recent platform, CosMx Spatial Molecular Imager (SMI) (He et al., 2022b), can profile consistently at single-cell and even sub-cellular resolution. CosMx SMI follows much of the initial protocol as GeoMx DSP, with barcoding and ISH hybridization. However, the SMI instrument performs 16 cycles of automated cyclic readout, and in each cycle, the set of barcodes (readouts) are UV-cleaved and removed. These cycles of hybridization and imaging yield spatially resolved profiling of RNA and protein at single-cell (∼ 10𝜇𝑚) and subcellular (∼ 1𝜇𝑚) resolution. In this work, we use two published and one unpublished dataset produced by the CosMx platform. In order to obtain the cellular level gene expression, CellPose (Stringer et al., 2021) software is applied to conduct cell segmentation. To give a concrete example, we provide a sample field-of-view (FOV) in Fig. 4.1. Pre-selected types of RNA molecules are captured by the molecular imager, which are denoted as white dots in the figures. Colors in the first sub-figure indicate the protein molecules that are stained. These proteins contribute to the cell segmentation process, which results in the second sub-figure. The final output from the pipeline consists of the position of each cell and a cell-by-gene count matrix, which is produced by counting the number of RNA molecules within each cell. The difference between scRNA-seq and SRT data is further demonstrated in Fig. C.1. 110 Figure C.1 An illustration of the difference between scRNA-seq and SRT data. C.2 2D Sinusoid Positional Encodings Since 2D sinusoidal PE achieves a competitive performance and has a lower complexity on SRT data (Wen et al., 2023), in our transformer encoer, we generate a sinusoidal PE for cells in SRT data, formulated as: PE(𝑥,𝑦,2𝑖) = sin (cid:16) 𝑥/100004𝑖/𝑑(cid:17) , PE(𝑥,𝑦,2𝑖+1) = cos (cid:16) PE(𝑥,𝑦,2 𝑗+𝑑/2) = sin (cid:16) 𝑦/100004 𝑗/𝑑(cid:17) , PE(𝑥,𝑦,2 𝑗+1+𝑑/2) = cos 𝑥/100004𝑖/𝑑(cid:17) (cid:16) , 𝑦/100004 𝑗/𝑑(cid:17) (C.1) , where 𝑑 is the total dimension of positional encoding, 𝑖, 𝑗 ∈ [0, 𝑑/4) specify a specific feature dimension. Let ˜C ∈ R 𝑁×2 be a normalized coordinate matrix, where we normalize and truncate coordinates in C to integers ranging in [0, 100). 𝑥, 𝑦 then refer to the spatial coordinates from ˜C, e.g., 𝑥 = ˜C𝑡,0 and 𝑦 = ˜C𝑡,1 for cell 𝑡. In this way, we generate a PE matrix P ∈ R 𝑁×𝑑 for every cell in SRT data, where P𝑖 is the PE vector for cell 𝑖. Meanwhile, for scRNA-seq data, a randomly initialized 𝑑-dimensional vector 𝑝′ is shared among all cells, which also results in a placeholder PE matrix P. C.3 Denoising Variational Lower Bound for Masked Language Modeling One of the highlights of CellPLM is the design of probabilistic latent space. Prior studies have employed variational autoencoders for single-cell analysis, which typically assumes an isotropic Gaussian distribution as the prior distribution of the latent space (Lopez et al., 2018; Xu et al., 2021). While this approach can effectively remove batch effects, it may also result in a loss of information regarding the underlying biological structure of cell groups. To address this limitation, CellPLM incorporates the concept of Gaussian mixture variational encoder (Dilokthanakul et al., 2016; Yang et al., 2019; Xu et al., 2023), which utilizes a mixture of Gaussians to capture the information of 111 distinct functional groups of cells. Formally, for 𝑖 ∈ {1, . . . , 𝑁 }, the generative model of cell 𝑖 can be formulated as: 𝑝(y𝑖; 𝝅) = Multinomial(𝝅), 𝐿 (cid:214) (cid:16) N 𝝁𝑦𝑖,𝑙 , diag 𝑝 (z𝑖 | y𝑖) = (cid:16) 𝝈2 𝑦𝑖,𝑙 (cid:17)(cid:17) , (C.2) 𝑖=1 (cid:16) 𝑝𝜃𝑑𝑒𝑐 (x𝑖 | z𝑖) = N 𝝁z𝑖 , 𝜎2I (cid:17) , where y𝑖 ∈ R 𝐿 represents the one-hot latent cluster variable and 𝝅 is its prior; 𝑦𝑖,𝑙 denotes the 𝑙-th entry of y𝑖; 𝝁𝑦𝑙 ∈ R 𝑑𝑧 and 𝝈2 𝑦𝑙 ∈ R 𝑑𝑧×𝑑𝑧 denote the mean and variance of the 𝑙-th Gaussian component, respectively; and 𝝁𝑧𝑖 ∈ R 𝑘 and 𝜎2I ∈ R 𝑘×𝑘 denote the posterior mean and variance of expression x𝑖, respectively. In this work, we assume that 𝜎2 is a constant and the posterior mean is parameterized by 𝝁𝑧𝑖 = 𝑓𝑑𝑒𝑐 (z𝑖; 𝜃𝑑𝑒𝑐). To estimate the posterior of z𝑖 and y𝑖, we parameterize the inference process with neural networks. Specifically, we assume that the cluster variables y are independent of the expression x𝑖 condition on latent variables z𝑖. The inference model can be formulated as: 𝑞𝜂𝜇,𝜂 𝜎 (z𝑖 | x𝑖) = N (cid:16) ˆ𝝁𝑖, diag (cid:17)(cid:17) (cid:16) ˆ𝝈2 𝑖 , 𝑞𝜂 𝜋 (y𝑖 | z𝑖) = Multinomial( ˆ𝝅𝑖), where the estimations are given by h𝑖 = 𝑓𝑒𝑛𝑐 (x𝑖; 𝜂𝑒𝑛𝑐), ˆ𝝁𝑖 = 𝑓𝜇 (cid:0)h𝑖; 𝜂𝜇(cid:1) , (cid:17) = 𝑓𝜎 (h𝑖; 𝜂𝜎) , ˆ𝝈2 𝑖 (cid:16) log (C.3) (C.4) ˆ𝝅𝑖 = 𝑓𝜋 (z𝑖; 𝜂𝜋) . Here 𝑓𝑒𝑛𝑐 (·; 𝜂𝑒𝑛𝑐) represents the transformer encoder, 𝑓𝜇 (·; 𝜂𝜇), 𝑓𝜎 (·; 𝜂𝜎) and 𝑓𝜋 (·; 𝜂𝜋) are neural networks. A log-evidence lower bound (ELBO) can be derived from this generative model for the optimization purpose (Dilokthanakul et al., 2016). However, as mentioned in Section 5.2.1, our pre-training framework incorporates a cell language model, where parts of the input gene expression matrix X are masked. This will result in a modified objective. To formalize the problem, recall 112 Table C.1 Hyperparameters for pretraining CellPLM model. encoder hidden dim encoder layers latent dimension decoder hidden dim decoder layers model dropout cell mask rate gene mask rate learning rate weight decay num of cluster (for GMM) total parameter CellPLM 1024 4 512 1024 2 0.2 0.75 0.25 2e-4 1e-8 16 82,402,543 that previously we defined the masked set as M. On top of that, we denote M ∈ R 𝑁×𝑘 as a mask indicator matrix such that M𝑖, 𝑗 = 1 0 if (𝑖, 𝑗) ∉ M, if (𝑖, 𝑗) ∈ M.    Let ˜X ∈ R 𝑁×𝑘 be the masked gene expression matrix given by the element-wise multiplication ˜X = M ⊙ X. The objective of cell language model with Gaussian mixture prior, i.e., a denoising variational lower bound (Im Im et al., 2017), can be formulated as: LCellLM =E𝑞(Z,Y| ˜X) E𝑝( ˜X|X) ln (cid:34) (cid:35) 𝑝𝜃 (X, Z, Y) 𝑞𝜂 (Z, Y | ˜X) (C.5) = E𝑞 𝜂𝑒𝑛𝑐 (Z| ˜X) E𝑝( ˜X|X) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:2)log 𝑝𝜃𝑑𝑒𝑐 (X | Z)(cid:3) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) Lrecon − E𝑞 𝜂 𝜋 (Y|Z) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:2)KL (cid:0)𝑞𝜂𝑒𝑛𝑐 (Z | ˜X) ∥ 𝑝(Z | Y)(cid:1) (cid:3) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:124) (cid:123)(cid:122) Lcond − E𝑞 𝜂𝑒𝑛𝑐 (Z| ˜X) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:2)KL (cid:0)𝑞𝜂 𝜋 (Y | Z) ∥ 𝑝(Y)(cid:1)(cid:3) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) . (cid:124) (cid:123)(cid:122) LY C.4 Pre-training Settings C.4.1 Hyperparameter Settings We pre-trained CellPLM model with the hyperparameters specified in Table C.1. 113 Table C.2 List of dataset and data sources. External links will be included in our github repo. Source Datasets HCA HTCA HTAN_HTAPP, HTAN_Stanford, HTAN_Vanderbilt, HTAN_BU cxg_PBMCs, EGAS00001004571_PBMCs, eQTLAutoimmune, covid19autoimmunityPBMCs, VanDerWijst-Human-10x5pv1, cxg_Airways, COMBAT2022, TabulaSapiens, PAN.A01.v01.raw_count.20210429.PFI.embedding, GTEx_8_tissues_snRNAseq_atlas_071421.public_obs GSE139324, GSE136246, GSE179994,GSE131907, GSE171145, GSE139555, GSE156728_CD4, GSE148071, PMID_34663877, Qian_et_al_2020_LC, GSE176021, GSE156728_CD8 Other Atlas (deduplicated) MalteEtAl_LungAtlas, TICAtlas GEO C.4.2 Datasets for Pre-training The dataset for pre-training contains 11.4 million cells from scRNA-seq and SRT data. scRNA-seq data consist of 4.7 million cells from human tumor cell atlas (HTCA, https://humantumoratlas.org), 1.4 million cells from human cell atlas (HCA, https://www.humancellatlas.org), and 2.6 million cells from Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo). All of them are public available data, elucidated in table C.2. A more detailed list and external links will be disclosed in our GitHub repository. Note that although our CellPLM is capable to handle various input feature sets, when we concatenated these scRNA-seq datasets, we used inner join by default of Anndata package. As a result, all scRNA-seq datasets only contain a 13, 500 common gene set. We will address this issue and increase the size of the gene set in future versions of CellPLM. The SRT datasets we used are publicly available on Nanostring official website (nanostring.com), where 2.7 million cells and 1, 000 genes are measured. Both scRNA-seq and SRT data are preprocessed with library size normalization and log1p transformation, following the convention in Stuart et al. (2019). C.5 Additional Experimental Details In this section, we provide more experimental details about fine-tuning, baselines, and evaluation metrics under each downstream task. 114 C.5.1 scRNA-seq Denoising Downstream Task Datasets. In scRNA-seq denoising task, we evaluate CellPLM on two datasets, i.e., PBMC 5K and Jurkat from 10x Genomics1. It is worth noting that during the prepossessing stage, we performed sub-setting on both datasets to ensure that all the genes were included in the gene set of pre-training data. Additionally, any genes with zero counts were removed from the analysis. We list the statistics of them in Table C.3. Table C.3 scRNA-seq denoising datasets. 5K PBMC Jurkat Number of genes Number of cells Num genes picked 33,538 5,247 7,197 32,738 3,258 7,618 Evaluation Metrics. Following the setting of scGNN (Wang et al., 2021), scGNN2.0 (Gu et al., 2022) and DeepImpute (Arisdakessian et al., 2019), we performed synthetic dropout simulation with missing at random (MAR) setting. While scGNN only considered a simple scenario, i.e., randomly flipped 10% of the non-zero entries to zeros, DeepImpute applied cell-wise mask with masking probability given by a multinomial distribution. Specifically, we adapted the setting from DeepImpute with exponential kernel. For cell 𝑖 that contains at least 5 expressed genes, the probability that one non-zero count 𝑥𝑖, 𝑗 is masked during the training process is given by Exp(0, 20): 𝑝𝑖, 𝑗 = 𝑞𝑖, 𝑗 = 1 20 𝑒− 𝑥 20 , 𝑝𝑖, 𝑗 (cid:205)𝐽𝑖 𝑗=0 𝑝𝑖, 𝑗 , where 𝐽𝑖 is the number of non-zero counts within cell 𝑖. We masked 10% of the non-zero counts according to {𝑞𝑖, 𝑗 }𝐽𝑖 𝑗=0 and evaluate model performance on the masked entries. We calculate the root mean squared error (RMSE) and mean absolute error (MAE) between the predicted values and ground truth. Baselines. (1) DeepImpute (Arisdakessian et al., 2019) employed a strategy of dividing genes into subsets and constructing deep neural networks to impute scRNA-seq data. We implemented 110x Genomics datasets are available at https://support.10xgenomics.com/single-cellgene-expression/datasets. 115 DeepImpute with default settings in DANCE (Ding et al., 2024) package. (2) scGNN2.0 (Gu et al., 2022) incorporated a feature autoencoder, a cluster autoencoder and a graph attention autoencoder for simultaneous imputation and clustering. scGNN2.0 is implemented by DANCE package with default settings. (3) GraphSCI (Rao et al., 2021) combined autoencoders with graph convolution networks among a gene-gene similarity graph. We accommodated the implementation of GraphSCI in DANCE package. (4) SAVER (Huang et al., 2018) leveraged Poisson LASSO regression to model the scRNA-seq counts with Poisson–gamma mixture. We utilized R package SAVER to illustrate the performance of it. (5) DCA (Eraslan et al., 2019) introduced an autoencoder framework based on zero inflated negative binomial (ZINB) distribution. We applied DCA to aforementioned datasets with its Python pacakge. (6) MAGIC (Van Dijk et al., 2018) utilized Markov affinity to capture gene-gene relationship and impute missing gene expression. We adapted its Python package to access the performance of it. (7) scImpute (Li and Li, 2018) developed a Gamma and Gaussian mixture model to identify dropout values. We revealed the performance of scImpute with its R pacakge. Fine-tuning. Since denoising task requires model to recover the gene expression matrix, we can directly get the zero shot performance of CellPLM by specifying the gene set of target dataset. Additionally, we fine-tuned CellPLM by replacing the pre-trained decoder with a MLP head and initializing encoder with pre-trained weights. Additionally, for methods require model selection on validation set, we performed another 10% simulation dropout and treat masked entries as validation set. The fine-tuned CellPLM was trained on MSE reconstruction loss, while the best model was selected by evaluating MSE on validation set. C.5.2 Spatial Tanscriptomic Imputation Downstream Task Datasets. To evaluate spatial tanscriptomic imputation models at single-cell resolution, we collected two samples from MERSCOPE FFPE Human Immuno-oncology Data2. Specifically, we chose "Lung cancer 2" and "Liver cancer 2" as our samples, and subsequently referred to them as "Lung2" and "Liver2" respectively. The Lung2 and Liver2 datasets were 2Merscope ffpe human immuno-oncology datasets are available at https://info.vizgen.com/ffpe-showcase. 116 subsetted to align with the gene set of the pre-training data. Additionally, we removed the fields of view (FOVs) that contained fewer than 100 cells and retained only the first 100 FOVs from both datasets. Note that all baselines require reference scRNA-seq datasets to impute the unseen genes of SRT data, we collected GSE131907 (Kim et al., 2020) and GSE151530 (Ma et al., 2021a) for lung cancer and liver cancer, respectively. The statistics of all datasets are illustrated in Table C.4. Table C.4 Spatial tanscriptomic imputation datasets. Lung2 Liver2 GSE131907 GSE151530 Number of genes Number of cells Num genes picked Num cells picked 500 836,739 462 40,114 500 598,141 446 20,629 29,634 208,506 All All 18,667 56,721 ALL All Evaluation Metrics. Following the evaluation pipeline proposed by Avşar et al. (Avşar and Pir, 2023), we selected target genes of SRT data with stratified sampling according to gene sparsity. Specifically, we grouped genes into four categories: low sparse, moderate sparse, high sparse, and very-high sparse. Empirically, the boundaries were defined as [𝑥 < 75, 75 ≤ 𝑥 < 90, 90 ≤ 𝑥 < 95, 95 ≤ 𝑥] to approximate the Gaussian mean and standard deviation slices. Subsequently, we randomly selected 25 genes from each sparsity group and remove them from training data. After training the models, we calculate the evaluation metrics on the target genes. Namely, we compute the root mean squared error (RMSE), Pearson’s correlation coefficient (PCC) and cosine similarity (Cosine) between the ground truth values and the corresponding imputed values in a gene-wise approach. Baselines. (1) SpaGE (Abdelaal et al., 2020) relied on domain adaptation to map scRNA-seq data onto SRT data and utilized a 𝑘-nearest-neighbor (k-NN) graph to predict unseen genes. We implemented SpaGE with default settings on both datasets. (2) stPlus (Shengquan et al., 2021) developed an autoencoder framework for learning cell embeddings and imputing SRT genes using a weighted k-NN approach. The performance of stPlus is accessed by its Python package. (3) gimVI (Lopez et al., 2019) introduced a variational autoencoder based model with protocol-specific treatments on scRNA-seq data and SRT data. We applied the algorithm with default settings to 117 evaluate the performance of gimVI. (4) Tangram (Biancalani et al., 2021) utilized a deep learning approach to learn the spatial alignment of scRNA-seq data based on a reference SRT dataset with consistent spatial maps. We evaluated Tangram with its Python package. Fine-tuning. Similar to scRNA-seq denoising, the spatial tanscriptomic imputation task requires the ouput of the model to be the gene expression. Thus, we directly fine-tune CellPLM on the pre-trained weights while specifying the input genes and target genes. The last two batches were hold out for validation. Figure C.2 Visualization of attention matrix demonstrate cell-cell communication. Visualization of attention. One essential multi-cell task is cell-cell communication (CCC) inference, where CCC mainly represents biochemical signaling through ligand-receptor binding across cells (Cang et al., 2023). Our CellPLM applies self-attention mechanism on cell level, from which we can study the interaction strength given by cell attention matrix. As a preliminary study, we extract the attention matrix between cells from a random chosen field of view (FOV) in Cosmx Liver dataset. The attention matrix is treated as CCC scores, and we visualize the results following the stream plot setting in Cang et al. (2023). As shown in the Figure C.2 in our supplementary PDF, there are some strong trends on the left side and right side of the FOV, suggesting further exploration 118 of specific signaling pathways for the included cells. This case study showcase the potential of our CellPLM model in cell-cell communication research. We hope our model can facilitate more insightful biological research in the future. C.5.3 Perturbation Prediction The perturb-seq technology has been established to examine the gene expression response at the single-cell level when subjected to pooled perturbations (Dixit et al., 2016). By comparing the gene expression before and after perturbation, downstream analysis of differential expression (DE) enables the identification of genes that play a crucial role in disease progression. To assess the potential benefits of CellPLM in gene-level tasks, we conduct experiments to predict the expression value of genes after perturbation. Following the setting of GEARS (Roohani et al., 2022), we partition the perturbations into training, validation, and test sets, ensuring that none of the test perturbations are encountered during the optimization process. Two perturbation datasets are employed for evaluation: (1) the Adamson Perturb-Seq dataset (Adam- son et al., 2016), consisting of 87 one-gene perturbations; and (2) the Norman Perturb-Seq dataset (Norman et al., 2019), containing 131 two-gene perturbations and 105 one-gene pertur- bations. To evaluate the performance of perturbation prediction, we employ Root Mean Square Error (RMSE) to measure the degree of similarity between the non-zero ground-truth expression values and corresponding predicted gene expressions. In addition, following previous settings in GEARS (Roohani et al., 2022), we also present the RMSE calculated on the top 20 deferentially- expressed genes. We compare the performance between CellPLM and two baselines, i.e., a recent preprint GEARS method (Roohani et al., 2022), and scGen (Lotfollahi et al., 2019). The results in Figure C.3 imply that CellPLM achieves the lowest RMSE values across all settings. Downstream Task Datasets. We included the Adamson Perturb-Seq dataset (Adamson et al., 2016) for one-gene perturbations and the Norman Perturb-Seq dataset (Norman et al., 2019) for two-gene perturbations. We followed the preprocess pipeline of GEARS (Roohani et al., 2022) and both datasets were then gene-wise subsetted to fit in the gene set of pre-training data. The statistics 119 Figure C.3 (Task 3) The RMSE performance (↓) on Adamson Perturb-Seq and the Norman Perturb- Seq datasets. The Norman Perturb-seq dataset consists of two settings: one-gene perturbations and two-gene perturbations, denoted as Norm.0 and Norm.1, respectively. Table C.5 Perturbation prediction datasets. Adamson Norman Number of genes Number of cells Num genes picked Num one-gene pert. Num two-gene pert. 5,060 68,603 3,246 87 – 5,045 91,205 2,353 105 131 are summaried in Table C.5. Evaluation Metrics. Following the setting of GEARS (Roohani et al., 2022), we applied data split such that the testing perturbation are unseen during the training process. Specifically, For Adamson dataset, we randomly hold out 25% of the perturbations for testing and 10% of the perturbations within the training set for validation. For Norman dataset, two settings for two-gene perturbations are implemented for evalutation purpose: 1/2 unseen and 2/2 unseen. We excluded all two-gene combinations in which at least one of the individual genes involved in the combination belonged to the unseen set. Finally, we evaluate the performance by calculating the root mean squared error (RMSE) between the predictions and the true values within the testing set. Baselines. (1) GEARS (Roohani et al., 2022) utilized gene co-expression knowledge graph and Gene Ontology-derived knowledge graph to model the influence of perturbations. We followed the recommended parameter settings within its Python package to access the performance. (2) scGen (Lotfollahi et al., 2019) built a conditional variational autoencoders and incoporated vector arithmetics to model phenomena response. We implemented scGen with its Python package on both 120 Adam. AllAdam. DENorman.0 AllNorman.0 DENorman.1 AllNorman.1 DE0.000.050.100.150.200.250.30RMSECellPLMGEARSscGEN datasets. Fine-tuning. For one perturbation, we set the input of perturbed genes to be −100 to mimic the gene perturbation action. During the fine-tuning process, we substituted the original batch-aware decoder with a simplified MLP decoder. Additionally, we initialized the remaining components of CellPLM with pre-trained weights. The final model was chosen to be the best-performed model on the validation set. C.5.4 Cell Type Annotation Cell type annotation is a crucial step in single-cell analysis as it enables the identification and characterization of distinct cell populations within a tissue or organism. This information is crucial for understanding the functional diversity, developmental trajectories, and disease relevance of different cell types, providing insights into biological processes and facilitating targeted therapeutic approaches. Downstream Task Datasets. We assess the performance of CellPLM on the task of cell type annotation on hPancreas (Chen et al., 2023) and Multiple Sclerosis (MS) (Schirmer et al., 2019), which are suggested by (Cui et al., 2023). The hPancreas dataset contains five scRNA-seq datasets of human pancreas cells, divided into reference and query sets with annotations, including 13 cell types and 11 cell types, respectively. The Multiple Sclerosis dataset (M.S.), sourced from EMBL-EBI, includes 9 healthy control and 12 M.S. samples. 3,000 highly variable genes were retained. Evaluation Metrics. We evaluate cell type annotation performance based on two standard classification metrics, Macro Precision and Macro F1 score. Baselines. To benchmark the performance of CellPLM, we compare it with both pre-trained models including scGPT (Cui et al., 2023), scBERT (Yang et al., 2022), as well as non-pre-trained SOTA models including ACTINN (Ma and Pellegrini, 2020), CellTypist (Domínguez Conde et al., 2022), SingleCellNet (Tan and Cahan, 2019), and TOSICA (Chen et al., 2023). For baseline methods, we adhere to their provided guidelines and utilize the default parameter setting. The performance metrics reported for scBERT, TOSICA and scGPT in this task are directly obtained from scGPT papers. 121 Fine-tuning. For CellPLM model, we attach a feed forward layer to the pre-trained encoder and latent space and tune the downstream model on the downstream dataset with a standard cross entropy loss. C.6 Comparison between CellPLM and scVI As a supplement to the zero-shot clustering experiments in Section 5.3.1, we add an additional comparison with scVI (Lopez et al., 2018) on the same dataset. As shown in Figure C.4, CellPLM successfully outperforms scVI without any training or fine-tuning, while the latter was trained on this specific dataset. Figure C.4 Visualization and comparison between CellPLM (zero-shot) and scVI on the clustering task. C.7 Visualization of Gene Embeddings In order to examine whether gene interactions can be encoded in CellPLM, we present a visualization of pre-trained gene embeddings from the gene expression embedder in Figure C.5. From the visualization, the gene embeddings maintain some latent structures. To further verify the effectiveness of the latent structure, we highlight a specific family of genes, HLA genes. There are multiple classes of genes in HLA gene family. For example, HLA class I genes (e.g., HLA-A, -B, and 122 CellPLM(ARI = 0.867, NMI = 0.823)scVI(ARI =0.843, NMI = 0.823)PatientMajor cell type Figure C.5 Visualization of gene embeddings in the pre-trained CellPLM demonstrate that CellPLM successfully captures gene interactions in the initial gene embeddings. For example, HLA Class I genes and HLA Class II perfectly form two clusters in the gene embedding space. -C) present endogenous peptides to responding CD8+ T Cells while the class II (e.g., HLA-DR, -DP, and –DQ) process exogenous peptides for presentation to CD4+ helper T Cells. From the UMAP visualization, HLA gene embedding clusters perfectly match the functionality and characteristics of those genes. 123