INTEGRATIVE LEARNING OF CELLULAR SYSTEMS AND NETWORKS By Julian D. Venegas A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science, and Engineering—Doctor of Philosophy 2024 ABSTRACT Advances in omics technologies have led to an abundance of comprehensive biomolecular information of biological systems, down to single-cell resolution. With omics data, biologists can gain a deeper understanding of the complex-hierarchical networks that constitute an organism. To this end, deep learning methods are often applied to assist in discovering meaningful patterns and relationships from omics data. Though deep learning methods can offer high performance on many complex tasks, some challenges arise with omics-based tasks: (1) Omics data are often high-dimensional with low-sample size and/or high levels of sparsity, with complex dependency structures between and within omics data types. (2) There is an imbalance of annotated data across different species and environments. These difficulties make desirable the integration of omics data across different modalities, group samples, and platforms, as well as environments and species. This thesis examines, builds and implements approaches to address these challenges through Integrative Learning techniques, which I use as a general term to encompass tech- niques that incorporate multiple sources of related data for improved learning (e.g., transfer learning, multi-task learning, and multi-modal data integration). In this work I highlight and address these challenges in different omics-based tasks. In addition to applied methods development, I provide probabilistic and mathematical frameworks that underpin many of these applied problems in omics analysis. Lastly, I showcase some of my more current experiments in deconvolution. Though these experiments are prelimary, often through toy examples, they demonstrate some possible future directions I want to consider. I dedicate this thesis to my father, mother, wife, and three sons. iii ACKNOWLEDGEMENTS First, I must thank my advisor and committee chair, Dr. Yuying Xie. He has been a consistent source of support since he welcomed me to his lab. I most appreciate him offering me strong guidance in research, balanced with freedom to pursue my own interests; it has been a pleasant experience. I would also like to thank my co-advisor, Dr. Shin-Han Shiu. He served a key role in applied projects, offering me insights and direction through the lens of a domain expert in biology. I would also like to thank my other committee members, Dr. Frederi Viens and Dr. Longxiu Huang for their feedback, advice and support. All of my committee members have provided me with a great example of scholarship and professionalism, balanced with humanity. I also want to thank the CMSE department and the College of Engineering more broadly. I was always made to feel welcome and supported by all. In particular, I’d like to thank Heather Williams for her administrative support, and Dr. Katy Colbry for guiding me towards being awarded the NSF Graduate Research Fellowship, among other funds. Lastly, thank you to my family—my father, mother, wife, and three sons. You’ve sup- ported, encouraged, and inspired me throughout my studies; I am forever grateful. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Probabalistic Framework for scRNA-seq Data . . . . . . . . . . . . . . . CHAPTER 3 PLANT STRESS RESPONSE . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 Data and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Model Architecture and Training . . . . . . . . . . . . . . . . . . . . . . 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Motif Learning and Location Dependencies . . . . . . . . . . . . . . . . . 3.6 Additional Experiments and Future directions . . . . . . . . . . . . . . . CHAPTER 4 CELL TYPE DECONVOLUTION . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Survey of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Cell Type Deconvolution Benchmark Dataset and Model Development . . 4.5 Spot Data and Deconvolution Methodology . . . . . . . . . . . . . . . . . CHAPTER 5 FURTHER EXPLORATIONS IN DECONVOLUTION . . . . . . 5.1 Towards a Probabilistic Framework for Deconvolution . . . . . . . . . . . 5.2 Learning Cell Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Bivariate normal genes for 2 cell-types CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 5 15 19 19 21 22 24 26 29 30 30 31 33 37 42 51 51 52 55 58 60 70 v CHAPTER 1 INTRODUCTION Complex cellular networks with dynamic functional states have a pivotal role in the hierarchy of networks and structures that make up an organism McManus et al. (2015). This makes the study of cells and cellular networks integral to understanding all of biology. This has motivated advances in high-throughput single-cell sequencing and imaging technologies have enabled the collection of massive biomolecular data at single-cell resolution. Indeed, single- cell sequencing technologies often generate tens of thousands to millions of samples/cells per study Svensson et al. (2018). This massive and complex data make deep learning methods an attractive approach to analyze cell behavior and cellular networks. Deep learning methods, which have consistently shown cutting-edge performance in vari- ous big data applications Pouyanfar et al. (2018); Dong et al. (2021), have fertile new ground for research that pushes the frontiers of biological science. The first developments in single-cell sequencing technology began with complementary DNA (cDNA) Eberwine et al. (1992); Brady et al. (1990). A major breakthrough would come decades later, through the development of single-cell RNA sequencing (scRNA-seq) methods Tang et al. (2009). scRNA-seq methods have given scientists the ability to anal- ysis biology with single cell resolution, i.e. at the building blocks of life. This has served a pivotal role in changing how we study biology, and has lead to great advancements across many fields of biology and science more broadly. Since then, there has been further devel- opments in next-generation sequencing platforms, with over one hundred currently existing single-cell sequencing technologies Wang and Navin (2015); Wen and Tang (2022). We now have technologies that measure a wide-array of cell features, e.g. DNA sequences and epigenetic features, methylation and chromatin accessibility, RNA expression, and profiles of surface proteins. Building on these developments, recent omics technology advances offer multi-feature capability, and additional ancillary features such as spatial location via spatial transcriptomics. 1 These advances in sequencing technologies have facilitated great advances on the bioin- formatics front, through high throughput methods Svensson et al. (2018); Kolodziejczyk et al. (2015) which provide (1) higher resolution from bulk tissues to individual cells (and sub-cellular levels) and (2) a massive volume of open data. To illustrate this, consider bulk tissue RNAseq (bulk RNA-seq) Li and Wang (2021) vs scRNA-seq data. With bulk RNA- seq we quantify expression as an aggregate from a group of cells (cell pool). Here, cellular heterogeneity is lost, and we have only a single expression sample comprised of multiple cells. With single-cell sequencing technologies we quantify expression for each individual cell. So if we consider a cell pool consisting of 20 cells, we get a single sample from bulk RNA-seq and 20 samples from scRNA-seq. In real sequencing experiments, this small delta becomes a great delta, where a single experiment with scRNA-seq technologies can generate tens of thousands up to millions of samples. These high-resolution and high-throughput omics technologies have provided a massive amount of rich biological data to analyse and explore. Naturally, as with data explosions across other scientific domains, this has fostered ad- vancements in data analysis and machine learning. In particular, deep learning methods have come to the forefront in many big data applications Pouyanfar et al. (2018); Dong et al. (2021), and omics data analysis is no exception. Deep learning applications in omics have provided biologists with powerful in silico tools to supplement and inform their in vitro and in vivo experiments. Conversely, domain knowledge gained from in vitro and in vivo experiments can inform deep learning models through integrative learning methods. This feedback loop between domain experts and advanced in silico learning methods will serve as a vehicle we drive to new frontiers in our understanding of biology, and thereby, physical life. Despite the success of single-cell data in numerous applications, difficulties arise due to the complexity of the data which requires advanced analysis pipelines with a number of steps. Single-cell data preprocessing includes many stages of data pruning, normalization, and often challenging machine learning tasks like batch effect correction, data imputation, 2 or dimensionality reduction. Moreover, specialized types of single-cell data require further processing such as multimodal data integration and cell type deconvolution for spatial tran- scriptomics. These steps are crucial to facilitate downstream tasks ranging from clustering and cell annotation, disease prediction, identifying gene coexpression networks, to the iden- tification of developmental trajectories of cells transitioning between states Lähnemann and et al. (2020). For tasks with clear evaluation metrics, deep learning often achieves top performance against other classical machine learning techniques Muzio et al. (2021). Deep learning can uniquely leverage its diverse architectures to capture networks of interdependencies between genes that alter other genes’ expression levels Bansal et al. (2007), and cells that communicate with other cells through mechanisms like ligand-receptor pairs Li et al. (2022b). Due to the richness of deep learning architectures and the customization of hyper-parameters and loss functions, deep learning models can be more readily tailored to particular tasks in single- cell analysis compared to shallow-classical machine learning methods. Deep learning has already become a staple in omics analysis, but there is still fertile ground in deep learning applications to problems in omics data analysis. In this work, I review and showcase some major problems in omics data analysis, with applications of deep learning on omics data. In the chapter 1, I give background of omics data and technologies, focusing on single-cell, bulk-seq and spatial transcriptomics data. Further, I put scRNA-seq data into a probabilistic framework that underpins many methods in scRNA- seq analysis. In Chapter 3 I apply deep learning frameworks and methods to predicting plant stress response from DNA sequences. There, I emphasize the utility of integrative learning through transfer learning and multi-task learning. In Chapter 4 I develop and apply deep graph learning methods for cell type deconvolution, provide benchmarking experiments, and a method for generating large-scale cell type deconvolution benchmarking datasets, with ground truth labels. In chapter 5 I showcase a few more (early) works in deconvolution, emphasizing probabilistic frameworks for deconvolution. Lastly, in chapter 6 I give some 3 concluding remarks about experimental results, developments, and future directions. 4 CHAPTER 2 BACKGROUND We begin this chapter with an overview of omics data, with an emphasis on single-cell data and spatial transcriptomics data. Here, we will discuss developments in omics technologies, as well as structure and select problems in omics data analysis. We then develop a probabilistic framework for scRNA-seq count data, which serves as an underpinning assumption in many analytical methods for tasks involving scRNA-seq data. The first section - Omics Data - is in part derived from my contributions to the survey paper Molho et al. (2024), and new developments paper Ding et al. (2024b). 2.1 Omics Data At a high-level, we study biology to understand life through internal (unseen or micro- scale) dynamics, Further, to understand the levers that control those dynamics. To under- stand the dynamics and controls of any system, it is most helpful to create causal maps, i.e., mapping cause and effect. To do this, we need output or outcomes that we can observe through our senses. Of course, dynamic systems are complex, with controls defined by non- linear interactions of components across both horizontal and vertical scales. To understand the dynamics of life, then, it helps to understand biological systems across varying levels. This allows us to better understand causal maps controlling biological systems. In biological systems, one of the most tangible levels we can observe outcomes is with phenotypes. On the other hand, genotypes are the microscopic basic building blocks of organisms. This makes the task of understanding causal maps from genotypes to phenotypes essential to understanding biological systems. Further, any map from genome to phenome begins with the transcriptome, making trancriptome analysis an essential goal for biologists Houle et al. (2010). While the genotypes of cells within an organism are nearly identical, only a small subset of the total gene pool is expressed at any given moment in time. That is, cellular networks have dynamic functional states, which lead to variations in RNA transcribed from DNA 5 across cells, i.e., transcriptomic variations. A major control of these dynamics are gene regulatory networks, as they regulate gene activity. In short, the transcriptome is largely defined by gene regulatory networks. Moreover, variations in the transcriptome highlight and reinforce the notion of cellular heterogeneity. Thus, being able to capture and analyze omics data of individual cells can help us account for cellular heterogeneity, which in turn help us understand gene regulatory networks and other major drivers of biological systems. Most gladly, there has been significant developments in single-cell technologies that provide researchers with ever increasing information at the cellular level, including transcriptomics, genomics and epigenomics data. While bulk sampling methods can access and take transcriptomics measurements of cell pools within a tissue, they lack the capability to capture the heterogeneity and stochasticity of the cells that make up the bulk sample. Further, even with a homogeneous bulk mixture, we lose granularity in the aggregate signal measured from the mixture. On the other hand, single-cell technologies measure signals from individual cells, thus giving more granularity to study cell heterogeneity. This provides a way to isolate individual cells and their influence on upstream biological functions Goldman et al. (2019); Kulkarni et al. (2019); Stegle et al. (2015); Nguyen et al. (2018). In this section we review developments of single-cell and other omics technologies over time, with a focus on transcriptomics technologies. We summarize these technological advances chronologically in Figure 2.1. Figure 2.1 Timeline of major developments in single-cell technologies. 6 2.1.1 Single-cell Technologies The first developments in sequencing technology can be traced back to James Eberwine et al. Eberwine et al. (1992) and Iscove et al.Brady et al. (1990), who first expanded complementary DNAs (cDNAs). Since then, modest yet consistent advancements culminated in a significant breakthrough in 2009, when single-cell RNA sequencing (scRNA-seq) was created Tang et al. (2009). The ideas underpinning scRNA-seq have since provided the way for various single-cell technologies for a broader range of target measures within a cell. Some examples include technologies that target DNA methylation Guo et al. (2013) (2013), protein and DNA accessibility (2015), and histone modifications Bartosovic et al. (2021) (2021). Beyond sequencing technologies, scRNA-seq has facilitated advancements in quantitative methods and understanding. With scRNA-seq data, researchers have made strides in essential tasks for deeper biological understanding, such as cell type segmentation, classification, and cell type expression profiling, to name a few. Structurally, single-cell omics data is put into matrix form, with measured signals (e.g., RNA transcripts) as columns and cells as rows. Some example features could be some measures from accessible DNA regions in ATAC-seq data, genes in scRNA-seq data, and proteins in CITE-seq data. Here in Figure 2.2, we show a simple example of a single-cell omics data matrix. For scRNA-seq, isolation of the cell is the first step for obtaining transcriptome infor- mation. Naturally, scRNA-seq and other single-cell technologies depend first on isolating individual cells. This process often distinguishes the different single-cell technologies, i.e., how they perform cell isolation. The earliest technologies tended to be low throughput and achieved cell isolation through serial dilution or robotic micromanipulation Brehm-Stecher and Johnson (1990). More recently, technologies that use microfluidic methods to isolate the cell have provided higher throughput capabilities Whitesides (2006). A key microfluidic based cell isolation method uses microdroplets Thorsen et al. (2001). Here, water droplets are uniformely disperesed in a medium of oil, which allows cells to isolate into the droplets. 7 While commercial microfluidic platforms like Fluidigm C1, ICELL8, and Chromium can benefit from high throughput, they face the challenge of high cost and often the requirement of uniform cell size in the sample. Once a cell is separated and lysed, messenger RNAs in this cell are reverse transcripted into more stable cDNAs with a unique cell barcode. The cDNAs are then amplified via Polymerase Chain Reaction (PCR) for better data capture before sequencing, which tends to introduce bias due to the uneven amplification efficiency. Therefore, besides the unique barcodes, the cDNA molecules in a cell are also given a Unique Molecular Identifier (UMI) to correct the amplification bias by collapsing the reads with the same UMI into one read. After debiasing, sequence reads are mapped to the genome and are grouped into genes for the creation of a count matrix Wang and Navin (2015). In addition to measuring RNA transcripts, some single-cell technologies can also mea- sure chromatin accessibility of a cell’s chromosome. Eukaryotic genomes are hierarchically packaged into chromatin Kornberg (1974), and this packaging plays a central role in gene regulation Kornberg and Lorch (1974). Buenrostro et al. created a means for sampling the epigenome at the single-cell level through the Assay for Transposase Accessible Chromatin using sequencing (ATAC-seq) Buenrostro et al. (2013) in 2013 . ATAC-seq allows the identi- fication of accessible DNA, i.e. the nucleosome-free regions of the genome Hendrickson et al. (2018). DNA accessibility within the genome can be used to identify regulatory elements in different cell types which cause the activation or repression of gene expression Thurman et al. (2012). scATAC-seq produces a count matrix with a number of reads per open chromatin regions, which lead to very large matrices with hundreds of thousands of regions. Further- more the data is known to be very sparse, where it is common to have the non-zero entries make up less than 3% of the data Li et al. (2021). While single-cell sequencing techniques for transcriptome measurements have seen great growth, single-cell proteomics methods have been developing at a slower pace. This makes a gap in omics data analysis, since proteomic data are essential to understand how genes respond to environmental changes. Moreover, proteins are basic functional units for many 8 Figure 2.2 An illustration of data matrices produced by single-cell technologies. cellular processes, which makes proteomics data essential to analyzing and understanding cellular behavior. Unlike most sequencing technologies, which have a standard process, pro- teomic measurements are often bespoke and designed for specific applications Vistain and Tay (2021). However, some technologies have made significant strides in capturing protein information of cells and combining this with mRNA measurements. Specifically, Cellular In- dexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), simultaneously sequences mRNA and measures the surface proteins on a cell Stoeckius et al. (2017). The method can sample over 1,000 genes and 80 proteins per cell, but like many other sequencing techniques, it suffers from high noise. In addition, CITE-seq is incapable of detecting intracellular pro- teins Baron and Yanai (2017). 2.1.2 Single-Cell Spatial Transcriptomics Single-cell technologies that capture transcriptomic, proteomic, or epigenetic information do so with great precision but with the loss of spatial information of the cells within the tissues. However, the cells’ relative locations within tissue is critical to understanding normal development and disease pathology. With spatial transcriptomic technologies, researchers are able to measure transcriptomics and leverage the spatial information or relative locations of cells in a tissue for better performing downstream tasks Crosetto et al. (2015); Moor and Itzkovitz (2017); Wang et al. (2018); Marx (2021); Asp et al. (2020); Waylen et al. (2020); Teves and Won (2020). For example, motivated by the fact that a pair of ligand and receptor with closer distance are easier to bind, HoloNet Li et al. (2022b) builds up a directed graph 9 based on the expression of ligand–receptor gene lists and the physical distance between the sender cell and receiver cell to represent cell–cell communication events. However, the early generations of spatially resolved profiling technologies are not at the single-cell resolution but instead sampled in groups called ‘spots’, which capture several cells. It requires additional work to determine the cell type proportion in spots, a process called cell type deconvolution. Alternatively, many cell imaging platforms provide RNA spatial information at the cellular and subcellular level, but the individual cells must be identified through cell segmentation methods. Major technologies or platforms for spatial transcriptomics include multiplexed error- robust fluorescence in-situ hybridization (MERFISH), sequential fluorescence ISH (seqFISH+), Slide-Seq, Visium by 10x Ståhl et al. (2016), GeoMx Digital Spatial Profiler (DSP) Mer- ritt et al. (2020) by NanoString,and CosMx Spatial Molecular Imager (SMI) by NanoS- tring. MERFISH Moffitt and Zhuang (2016), first introduced in 2015, is a single-molecule- fluorescence-in-situ-hybridization (smFISH)-based technology that can be applied to fresh- frozen samples to provide subcellular resolution. While traditionally the procedure of these smFISH-based technologies is complex, a number of commercialized platforms have emerged recently, such as Vizgen, Rebus Esper, Molecular Cartography, and Resolve Biosciences Moses and Pachter (2022), which allow more convenient sequencing of spatial transcrip- tomic at a lower cost. As an alternative to MERFISH, seqFish+ Lubeck et al. (2014); Shah et al. (2016); Eng et al. (2019) employs ’pseudocolor’ as a combination of colors to increase the amount of detectable transcripts Rao et al. (2021a). Beyond early in-situ hybridization methods, a number of sequence-based technologies have emerged. Closely related to scRNA-seq technologies, these sequencing-based methods barcode RNAs such that each read can be mapped to its corresponding spatial location through the associated barcodes. The rest of the sequencing read is mapped to the genome to identify the transcript of origin, collectively generating a gene-expression matrix. Stahl et al. Ståhl et al. (2016) first proposed this method, which has been adapted by commercial 10 platforms such as 10X Visium. 10x Visium fixes spatially barcoded oligos to each spot in a capture slide (area 6.5mm2), with the barcoding done through DNA extension and reverse transcription for formalin fixed paraffin embedded tissues (FFPE) and fresh frozen tissues respectively. In particular, the 10x Visium expression slide contains 4 capture slides, each with area 6.5 mm2 where fresh frozen or FFPE tissues are placed. Each of the capture slides contain a grid of approximately 5000 barcoded spots that are 55µm in diameter with a center-to-center distance of 100µm between any two adjacent spots. On average, there are 1 − 10 cells in each of these spots, and ∼ 18, 000 unique genes in human (∼ 20, 000 in mouse) can be quantified. Another major sequencing-based technology is Slide-Seq, which captures mRNA by placing barcoded beads on slides, which achieves a high resolution of 10 micron. Technological innovations further improved sequencing resolution in recent years. For instance, high-definition spatial transcriptomics (HDST) Rodriques et al. (2019a) uses wells rather than slides, whereas built upon Slide-Seq, Slide-seqV2 Stickels et al. (2020) raised the resolution to near-cellular level while reaching RNA capture efficiency of roughly 50% of scRNA-sequencing. Finally, spatio-temporal enhanced resolution omics sequencing (Stereo-seq)Chen et al. (2022) deposits barcoded DNA nanoballs in patterned arrays to achieve single-cell resolution while maintaining high sensitivity. While 10x Visium and Slide-Seq do not profile at cellular resolution, Nanostring’s GeoMx DSP is capable of cellular resolution through user-drawn profiling regions. Geomx DSP uses PC-linker to link barcodes via antibodies to proteins and RNA for identification. The spatial regions of interest (ROI) on the tissue are flexible and can be user-defined, or with pre-defined layouts (such as a square grid). During imaging, the DSP barcodes from each ROI are UV- cleaved and collected for sequencing, and the spatial information is recorded. Due to the flexibility of the ROI definitions, the ROIs can be a range of sizes, from a single-cell or hundreds of cells. The RNA assay can quantify > 18, 000 target genes, and the protein assay can quantify > 96 proteins. Though GeoMX can produce cellular-resolution sequencing, its scalability is limited. The 11 most recent platform, CosMx Spatial Molecular Imager (SMI) Lewis et al. (2022), is able to profile consistently at single cell, and even subcellular resolution. CosMx SMI follows much of the initial protocol as GeoMx DSP, with barcoding and ISH hybridization. However, the SMI instrument performs 16 cycles of automated cyclic readout, and in each cycle the set of barcodes (readouts) are UV-cleaved and removed. These cycles of hybridization and imaging yield spatial resolved profiling of RNA (> 980 target genes) and protein (> 80 validated proteins) at single-cell (∼ 10µm) and subcellular (∼ 1µm) resolution. Multiplex imaging technologies have significantly advanced higher spatial resolution for single-cell profiling. Spatially resolved transcriptomic data, along with corresponding imag- ing data, enables single-cell or even subcellular analysis on both spatially morphological and pixel resolution information. Recently, antibody-based multiplexed imaging methods have dominated the multiplexing approaches, as they can capture cellular organization and tissue phenotypic heterogeneity at the protein level. They utilize various protein markers for cellu- lar identification. Immunohistochemistry (IHC)Coons et al. (1942), first reported in 1942, is one of the most commonly used multiplexed imaging methods. It uses appropriately-labeled antibodies to bind specifically to their target antigens in situ (in the original site), which can be better captured by current light or fluorescence microscopy. Due to the limited protein readouts, methods including multiplexed immunofluorescence (MxIF) Gerdes et al. (2013) and cyclic immunofluorescence (CyCIF) Lin et al. (2015, 2018) were proposed to add more new antibodies in multiple rounds of staining. Another imaging platform, Co-Detection by IndeXing (CODEX)Goltsev et al. (2018), is designed for up to 40 proteins using cyclic detection of DNA-indexed antibody panels. Imaging mass cytometry (IMC)Giesen et al. (2014) is an evolutionary technology that leverages mass spectroscopy to obtain images from tissues with 40+ labels simultaneously. This vastly reduces data noise and enhances the multiplex capability. Multiplexed ion beam imaging (MIBI)Keren et al. (2019) is also per- formed by imaging tissues with secondary ion mass spectrometry based on metal-labeled antibodies. These multiplexed imaging tools provide high-dimensionality imaging assays at 12 the single-cell level and enable analyzing and understanding of the single-cell function and tissue structure. 2.1.3 Spatial & Bulk Deconvolution In addition to omics data, spatial transcriptomics technologies provide spatial location data of samples. The spatial resolution capabilities now range from multi-cell pools or bulk samples, to single-cell and even sub-cellular levels. This information gives us yet another aspect to study the functional dynamics in biological systems. Along with cellular het- erogeneity, we can now study spatial heterogeneity and the cellular composition of tissues (Molho et al., 2024; Rao et al., 2021b; Fan et al., 2023). With spatial transcriptomics, we can analyze omics data within a spatial context, which can offer deeper insight into cellular interactions (Tian et al., 2023; Raredon et al., 2023), and cell type localization under varying conditions. For example, we can study how cell types are organized in cancerous or diseased tissue (Williams et al., 2022; Rao et al., 2021b). Of course, to analyze this most readily and objectively requires single-cell resolution. With lower resolution (multi-cell pools or spots), each omics data point is an aggregate of the multiple cells within the captured region. For example, with RNA we would have an aggregate expression of the cells in the cell-pool, which heterogeneous in many cases. While this does provide us with some spatial context, it does not readily allow for studying the spatial distribution of cells and thus any downstream analysis that is dependent on cell type composition. While there are options with single-cell resolution, the more affordable options in spatial transcriptomics are not single-cell resolu- tion, naturally. Some popular lower resolution options include 10X Visium (Maynard et al., 2021), and Slide-seq (Stickels et al., 2021; Rodriques et al., 2019b). With this tradeoff in mind, it is desirable to make best use of the non single-cell resolu- tion spatial transcriptomics data. Though spatial information is not captured at single-cell resolution, we can use the spatial omics data together with robust reference single-cell data to deconvolute the aggregated omics data in terms cell type composition. Indeed, this task is called cell type deconvolution (Ding et al., 2024a; Molho et al., 2024), which is the task of 13 quantifying the cell type composition of bulk mixtures (in this context, spatial mixtures) by decomposing bulk mixture omics measurements by cell type proportion. The simplest idea to accomplish this is through non-negative matrix factorization (NMF). Here, we take a cell type expression profile derived from some set of robust single-cell reference data, and regress onto the bulk mixture expression data, with a non-negative constraint on the coefficients. These non-negative coefficients (after normalization) are taken as the cell type composition estimates. This is fairly intuitive, as the bulk mixture expression is an aggregate of the multiple cells in the mixture and so finding the cell type composition (coefficients) that best match a standard cell type expression profile should give a decent estimate of the true cell type compositions. In recent years there has been further development in cell type deconvo- lution, with many methods building on this basic idea of NMF. Here are some methods and brief descriptions (further description found in the Deconvolution chapter): SPOTlight (Elosua-Bayes et al., 2021a) essentially applies a seeding to NMF regression to de- convolve bulk mixtures with reference scRNA-seq data. Stereoscope (Andersson et al., 2020) is a Bayesian model that integrates information from both single-cell and spatial transcrip- tomics data to estimate the probability of each cell type at each location within the tissue sample. Cell2location (Kleshchevnikov et al., 2022a) puts bulk expression into a Bayesian hierarchical framework with a spatially determined prior on the cell-type compositions. Out- side of these shallow-classical methods that build on the basic idea from NMF, there has been developments in deep learning-based methods (Molho et al., 2024) fro cell type decon- volution. Tangram (Biancalani et al., 2021) spatially aligns reference scRNA-seq data and identifies spatially co-expressed gene modules, from which it infers the presence of different cell types in a tissue sample. DSTG (Song and Su, 2021a) creates synthetic mixtures by sampling reference scRNA-seq data and map the synthetic and real bulk mixtures to a com- mon domain. A graph is then constructed by mutual-nearest neighbors in this domain and taken as input through graph-based convolutional networks, which directly estimate the cell type proportion. 14 Currently, a road block in the way of making significant advancements in cell type de- convolution methods is the relatively limited amount of data with ground truth cell type compositions. At its core, cell type deconvolution reduces to an inverse problem of sorts, as we are trying to recover cell type composition from an aggregate signal over all cells in a mixture. The only way to get cell type labeled data is with single-cell resolution, so syn- thetic experiments must be done to create cell type labeled mixture data. In most cases, such labeled data is small and taken from non-human tissue. Moreover, creating synthetic mixtures does not always reflect the heterogeneity and organization of cell types within real tissues, and hence real biological systems. For example, Lulu Yan et al. (Yan and Sun, 2023) provide three deconvolution benchmark datasets from mouse tissues, which contains at most 80,000 cells. Bin Li et al. (Li et al., 2022a) provide 32 synthetic deconvolution datasets, taken from scRNA-seq reference data. Further developments in methods and benchmark datasets in cell type deconvolution can provide a way to better leverage high-throughput bulk omics data for downstream analyses that depend on cell type composition. Of course, this is strengthened further with spatial omics data, as we can then use spatial information together with cell type composition. An important area that could greatly benefit from these developments is in immuno-oncology, to better understand tumor cell organization, and their interactions with the immune system. Indeed, tumors grow from cell proliferation, making the study of immune cell composition and organization an important tool for understanding cancer and developing better therapies (Sturm et al., 2019). Towards this end, in the Cell Type Deconvolution section we develop a spatial transcriptomic benchmark dataset from samples that include TME (human), a deconvolution method, and set of benchmarking exercises. 2.2 Probabalistic Framework for scRNA-seq Data 2.2.1 scRNA-seq Count Models Cell type composition relies on cell type annotation of cells from scRNA-seq data, either directly or as a reference for deconvolution of bulk data. Many approaches to cell type anno- 15 tation rely on scRNA count models to estimate the annotations. Below is small progression of scRNA count models that add parameters to incorporate technical effects, which is often needed to deal with technical effects across studies (batches) and platforms. We are given d genes, N single-cells, K cell-types Observed X ∈ RN ×d = (xi,j) - observed single-cell expression (cid:101)X = (˜xi,j) ∈ RK×d - estimate of mean cell-type profiles µ Unobserved µ = (µi,j) ∈ RK×d - True mean cell-type profiles c(i) - cell-type of cell i ∈ [1, N ] Goal: Predict true cell-type c(i) of each cell i ∈ [1, N ] from {X, (cid:101)X} Xi,j ∼ N B(µ = µc(i),j, size = θ) ≡ P oisson(Γ) (2.1) Γ = Gamma(θ, θ/µc(i),j) Technical effects model Xi,j ∼ P oisson(bi) + N B(µ = siµc(i),j, size = θ) (2.2) Technical effects of observation (i) Not all mRNA transcripts get detected → Detection efficiency scale factor si (ii) Off-target binding → Background counts additive factor bi (iii) Gene-specific detection efficiencies Simplified technical effects model Xi,j ∼ N B(µ = bi + si (cid:101)Xc(i),j, size = θ) (2.3) Simplifying assumption to avoid discrete distribution convolution. Same mean model as (2.2), but higher variance. 16 2.2.2 scRNA-seq Count Pre-processing Forgiving the minor deviation in symbol definition from the previous section, we define X = [Xgm] ∈ RD×N raw expression : D genes × N samples X = 1 N X1 = [xg] ∈ RD×1 mean expression : D genes over N samples and we outline some standard pre-processing steps applied to scRNA-seq expression data. 1. Select d ≤ D candidate genes G with highest mean expression (cid:26) G = arg max G′⊂[1,D],|G′|=d (cid:27) (cid:80) G′ xg 2. For all d candidate genes g, g′ ∈ G, where g ̸= g′: (i) Compute log-transformed expression ratios L = [Lgg′] ∈ RN ×d(d−1) Lgg′ = (cid:20) log2 (cid:1) (cid:0) xgm xg′m (cid:21)N m=1 ∈ RN ×1 (ii) Compute pair-wise variations V = [Vgg′] ∈ Rd×(d−1) Vgg′ = SE(Lgg′), where SE2(Lgg′) = 1 N −1 (cid:16) N (cid:80) m=1 log2 (cid:0) xgm xg′m (cid:1) − Lgg′ (cid:17)2 , where Lgg′ = 1 N 1T Lgg′ or, in matrix form SE2(Lgg′) = 1 N −1 (cid:16) Lgg′ − Lgg′1 (cid:17)T (cid:16) Lgg′ − Lgg′1 (cid:17) = 1 N −1LT gg′ (cid:16) IN − 1 N 11T (cid:17) Lgg′ (iii) Compute gene-stability measures M = [Mg] ∈ Rd×1 - arithmetic mean of pairwise variations Mg = 1 d−1 (cid:80) g′̸=g Vgg′ 3. Determine d∗ ≤ d housekeeper/reference genes G∗: genes with the lowest gene-stability measure (low =⇒ more stable) G∗ = arg min G′⊂G,|G′|=d∗ (cid:26) (cid:27) (cid:80) G′ Mg 17 Normalizing factors: (cid:101)F = F ∗ F ∈ RN ×1 where F = (cid:2)Fm (cid:3)N m=1, Fm = (cid:19)1/d∗ (cid:18) (cid:89) G∗ xgm F = 1 N 1T F Normalized expression matrix: (cid:101)x = Xdiag( (cid:101)F ) ∈ RD×N median(X) = [median(Xc)]K c=1 ∈ RD×K: median normalized expression Then re-scale for equal median expression across cells (cid:101)X = median(X)diag(cid:0)M (X)(cid:1), where M (X) = (cid:0) median(x1) median(xc) (cid:1)K c=1 The background and normalized read counts can then be computed from negative control probes. {Pl}L l=1 : probe pools {Rl}L l=1, Rl ⊂ Pl : negative control (nc) probes in pool Pl (cid:101)XRl ∈ R|Rl|×N : normalized read counts of nc probes (from (cid:101)X) 18 CHAPTER 3 PLANT STRESS RESPONSE The chapter is comprised of work on a collaborative project between Dr. Xie’s and Dr. Shiu’s labs. Here, I have reproduced original experiments I worked on with Dr. Yuning Hao and Dr. Runze Su. I give them both credit for the original experiments, and figures 3.4-3.7, 3.9-3.10. In part, the reproductions were done to make further inquiries and developments. Particularly, I contributed inquiries and experiments in stress type grouping and alternative model architectures. 3.1 Introduction A central problem in molecular plant biology is to understand how plants respond to various abiotic and biotic stressors (e.g. heat waves, drought, and pest infestations) at the molecular level. As stresses trigger changes in gene expression levels, differential expression analysis is a key tool to understand stress response in plants. The goal of such analyses is to determine motifs that are A main component of gene expression regulation is through the binding of transcription factors to specific sequences of DNA called regulatory elements (motifs). For this reason, an avenue of research has been to identify these transcription factors and the respective regula- tory motifs, in order to predict gene expression responses Uygun et al. (2017); Wilkins et al. (2016). However, identifying individual regulatory motifs, such as transcription factor bind- ing sites (TFBS), is only a small part of the complex process of gene regulation. Indeed, gene regulation processes also depend on the location, orientation, quantity and co-localization of regulatory motifs. These dependencies form the structures that modulate gene regulation, and these structures form what is called regulatory grammar Weingarten-Gabbay and Segal (2014). Understanding of regulatory grammar by computational modeling of these complex de- pendencies has thus become a hot area of bioinformatics research. Many advancements to- wards modeling complex regulatory grammar have come from deep sequence learning models, 19 traditionally used in natural language processing. One of the early deep learning models developed to account for the sequential depen- dencies was DeepSea Zhou and Troyanskaya (2015). This was done by using convolutional neural networks (CNN), from which motifs and local dependencies were learned, ultimately used for functional-variant prediction. Building on the DeepSea model, Quang and Xie developed DanQ Quang and Xie (2016), which couples the CNN with a recurrent neural network (RNN), namely a bi-directional long short-term memory network (LSTM) Hochreiter and Schmidhuber (1997). The LSTM component helps identify long-range dependencies [9], and hence co-localization dependen- cies. As the LSTM is bi-directional, it learns these features on both the forward and reverse ordering of sequences (hence orientation). These developments of deep sequence learning models are easily tailored and applied to our problem of interest: predicting plant stress response from DNA sequences. Building on these ideas, we propose DeepCAT, a convolutional self-attention architecture to predict plant stress response from DNA sequences 1. DeepCAT consists of 3 layers. The first is a convolutional layer which converts DNA base-pairs to a numerical sequence, identifying key predictive motifs and local dependencies. The second layer is self-attention, which cap- tures key predictive co-localization dependencies. Lastly, a fully-connected (FC) layer to output prediction scores of gene up-regulation under different abiotic and biotic stresses. A few properties of DeepCAT yield advantages over popular learning models. First, the self-attention Vaswani et al. (2017) method has been shown to capture long-range sequence dependencies beyond the capabilities of RNNs Vaswani et al. (2017). Another property of self-attention is that it does not impose a strict order on how a sequence is processed Vaswani et al. (2017), which is advantageous since the ordering of base-pairs may not always be a factor in gene expression. With these advantages we hypothesize that DeepCAT outperforms many other popular learning models in predicting plant stress response. Also, that Deep- 1DeepCAT is an adaptation of another model our lab has proposed: CANEE, a convolutional self- attention architecture to analyze the function of non-coding DNA sequences 20 Figure 3.1 Basic DNA up-regulation prediction model. Figure 3.2 High level pipeline of DeepCat. CAT extracts sequential features that can identify known and potentially novel transcription factor binding motifs (TFBMs) and their interactions. We demonstrate this on the problem of predicting gene expression of Arabidopsis thaliana (A.thaliana) in response to 57 abiotic and biotic stress conditions. 3.2 Data and Problem Statement Given raw arabidopsis thaliana DNA sequence data, the objective is to predict gene up- regulation under 57 environmental stress conditions. Specifically, to predict if an arabidopsis gene was up-regulated or not in shoot tissue under each of 36 abiotic (e.g. cold, heat, osmotic) and 21 biotic (e.g. 71 Pseudomonas syringae, bacterial flagellin) stress conditions. Gene expression and sequence data of 20,799 A. thaliana genes each consisting of 3,200- bp (covering promoter and 5’ UTR) were downloaded from the AtGenExpress database and processed as in Uygun et al. (2017). Genes with a log2 fold change ≥ 1 were con- sidered up-regulated. In particular, the preprocessed and normalized expression data from 21 AtGenExpress was used to calculate log2 fold change between stress and control conditions using Limma in the R environment. Genes with a log2 fold change ≥ 1 were considered up-regulated. DNA sequences were pulled for each gene from TAIR10. Particularly, the sequences were taken from 1-kilobase (kb) upstream and 500-base pairs (bp) downstream the transcription start site and 500-bp upstream and 1-kb downstream the transcription stop site. These sequences were then one-hot encoded, with each sequence converted into a 3200x4 binary matrix. The columns correspond to A,C,G,T, and rows correspond to the position in the DNA sequence, with each row containing a single 1 in one column and zeros in the remaining columns. Genes were randomly assigned according to a training-validation-test split of 70- 10-20. 3.3 Model Architecture and Training DeepCAT consists of 3 main modules: (1) CNN, (2) self-Attention and (3) FC/output. CNN Module A 1D convolutional layer and a max-pooling layer makes up the CNN module. Suppose the input of the convolutional layer has shape (N, I, L) and the output is (N, O), then the 1D convolution is given by: Conv1D(XNm,Oj ) = ReLU (Bias(Oj) + I−1 (cid:88) k=0 WOj ,k ⋆ XNm,k), where for batch size N, input sequence dimension I, output element dimension O, and input sequence length L. The subsequent max-pooling is given by: Output(Ni, Cj, k) = max m=0,1,...,kernel size−1 input(Ni, Cj, k + m), where input value is of size (N, C, L). Self Attention Module First, positional encoding is applied to enable the model to capture relative positional infor- mation, and thus potential order structure. This consists of applying a positional embedding 22 Figure 3.3 DeepCAT architecture. of the sequences that are output from the CNN module. We apply the sinusoidal embedding as in Vaswani et al. (2017). The self-attention mechanism, as in Vaswani et al. (2017), has three factors: query Q, key K and value V . Assuming the input is X, the formulation can be expressed as: Qi = WQXi Ki = WKXi Si,j = Vi = WKXi Qi · Kj√ d exp(Si,j) k exp(Si,k) (cid:88) (cid:80) outputi = Si,jVj Scorei,j = j Here, W(·) represents a weight matrix. The weights corresponding to each of these factors, and hence the factors themselves, are learned during the training process. The output from the self-attention network is then passed to the FC module. Fully-Connected Output Module We apply a single FC layer, giving weighted scores for each of the 57 stress types. We then apply a sigmoid output layer to obtain probability scores for gene up-regulation. We trained DeepCAT by minimizing the average multi-task binary cross-entropy loss in mini-batches of size 50 using the Adam optimizer Kingma and Ba (2014). All the weights and biases were initialized with Xavier (uniform) Glorot and Bengio (2010) and zero values respectively. For 23 model regularization purposes, we applied dropout with rate of 0.1 in attention layers. Validation data was used to determine an optimal number of training iterations. Namely, we use an early-stopper to stop the training process if the validation loss does not decrease for a set number of epochs (default 5), thus keeping the model that performs best on the validation set. In all of our experiments we trained DeepCAT with settings: 320 convolutional kernel- s/filters, kernel dimension 26, pooling dimension 13, and used 4 attention heads. Our implementation was with PyTorch, and our experiments (training and testing) were ran on NVIDIA K80 GPU. 3.4 Experiments Using the fully trained models, performance was measured on the testing data (70-10-20 train-valid-test split). We used two metrics: the Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and the Precision Recall-Area Under the Curve (PR-AUC). For overall comparison purposes we averaged the PR-AUC and ROC-AUC across the 57 stress types. In each of our experiments we compare DeepCAT against a few well-known shallow and deep learning models. The shallow models consisted of Support Vector Machines (SVM) and Random Forest. The deep learning model we compared against was essentially the DanQ model Quang and Xie (2016), with the modification of the output layer to give plant response probability scores for the 57 different stress types. We chose this deep learning model for comparison, as it has a similar structure as DeepCAT, and has performed well on a different but similar problem with human DNA data. We first evaluated baseline performances of our standard DeepCAT model and the other learning models. To improve on these baseline results, we experimented with transfer learning strategies. 3.4.1 Transfer Learning - experimentally verified and pre-learned sources The main idea of transfer learning Tan et al. (2018) is to leverage existing knowledge in one setting to learn a task in a different but related setting. Here we injected existing 24 Figure 3.4 PFM construction process. knowledge in two ways. One was through experimentally verified information. The other was information learned from a model with a much larger data set. As the kernels in the CNN layer of DeepCAT act as DNA motif finders, we experimented initializing the kernels with known A.thaliana TFBMs. That is, we initialize with the resulting position weight matrices (PWM). To do this, consider a set of n aligned sequences of length m. Then we construct the position frequency matrix (PFM) of size 4 × m with the counts of each base over the sequences. We then apply row-wise normalization to get the 4 × m position probability matrix, as each column-position sums to 1. Finally, we compute the PWM with log odds from the PFM against a background model. In the simplest case, we use a uniform background model, i.e. each base is equally likely to occur in a sequence. Moreover, the DanQ model used a massive amount of human gene data (> 4 million), so we also experimented initializing the kernels with the kernel weights learned in the DanQ model. We found that implementing these transfer learning methods in DeepCAT lead to better performances across nearly all 57 stresses (Figures 3.5 and 3.6). 3.4.2 Multi-task learning - Stress Grouping Our previous results are based on learning all 57 tasks (57 stress responses) simultane- ously. Hence we are in a multi-task learning (MTL) setting. In this setting, the learning process seeks the shared representation (feature mapping of the sequences) that best pre- dicts response to each of the 57 stress types. This can be problematic, however, because different stress types may induce very different underlying regulatory mechanisms, and find- ing a good shared representation may not be possible. The expectation is that in an MTL setting, learning stresses with similar underlying regulatory mechanisms will benefit from 25 Figure 3.5 Performance of DeepCAT with kernels initialized from weights learned from the DanQ human model - updated original figure from Dr. Yuning Hao. each other, while stresses with very different underlying regulatory mechanisms may hinder performance. We tested this hypothesis by finding related stresses through k-means (k=3) clustering of the stress responses, and trained the three models separately. We also paired this a transfer learning scheme by initializing the convolutional kernels with known A.thaliana TFBMs. Overall we found that both of these experiments lead to better performances, with the latter yielding the best performance (Figure 3.8). In figure 3.7 we show the clustering hierarchy of the stress types. 3.5 Motif Learning and Location Dependencies An interesting result is that the DeepCAT model can be interpreted as a motif learner, by a translation of the kernels in the convolution layer to positional weight matrices B. Alipanahi 26 Figure 3.6 Performance of DeepCAT with kernels initialized from experimentally verified TFBMs - updated original figure from Dr. Yuning Hao. and Frey (2015). We aligned these to known motifs from the DAP-seq and CIS-BP databases using TOMTOM software (see meme-suite.org/meme/tools/tomtom). Of the 319 motifs learned by our model, 114 significantly match known motifs (E < 0.1); a threshold of 0.05 was used for p-value to measure the similarities. Figure A.7 shows this process, Figures A.8- A.11 show resulting learned TFBMs, which are all found in the appendix. Analyzing the scores in the attention module, we found interactions of motifs at different positions. From Figure A.11 we can see that the attention model identifies interactions be- tween base-pairs at long ranges, and thus identifies long-range co-localization dependencies. 27 Figure 3.7 Clustering hierarchy of the stress types from k-means clustering. Three large clusters are identifiable, with red highlighted stresses being heat related - original figure from Dr. Yuning Hao. Figure 3.8 Performance of DeepCAT with known TFBM initialized kernels, and the clustered response multi-task model (also with TFBM initialized kernels) - original figure from Dr. Yuning Hao. 28 Figure 3.9 Task-specific architectural variations. 3.6 Additional Experiments and Future directions A direction I explored, and continue to explore, for mitigating negative transfer between stress types is through the addition of task-specific layers. In its most basic form, the idea is to compose the model in two modules. The first module is a shared module, where the parameters are shared between all the tasks. This is the set-up of the basic multi-task architecture. The second module consists of task-specific silos or towers. These serve as the output layers where each task has a dedicated silo, and parameters are not shared between the tasks. These architectures can be improved by incorporating gating mechanisms that limit the amount of shared information each silo receives as input, as well as incorporating a mixture of experts. At this point in time I have only done preliminary bench-marking exercises with these architectures and have not yet seen significant improvement above our standard DeepCAT architecture. Nonetheless, with DeepCAT we have shown how deep sequence learning and other learn- ing mechanisms, such as grouped learning and transfer learning, can move us towards solving the problem of plant stress response prediction. Moreover, these methods are able to learn and extract key motifs and long-range motif interactions, which are important components of understanding regulatory grammar, and hence gene regulation. 29 CHAPTER 4 CELL TYPE DECONVOLUTION 4.1 Introduction Quantifying cell type composition (proportion) is an important tool to better understand and characterize diseases and other abnormalities by differences in cell type composition between experimental groups Karagiannis et al. (2022). With single-cell transcriptomics data, this can be done by annotating the cell type of each cell. Cell type annotation is often done computationally by comparing expression patterns in a cell with reference cell type expression profiles with maximum likelihood approaches. Segmentation based methods are also used, as well as expert validation (when the sample size is small). While advances in high-throughput transcriptomics technologies have facilitated this task by providing single-cell resolution, bulk sequencing is significantly less labor intensive and costly Jin and Liu (2021). Estimating cell type composition in bulk samples offers a more economical approach, and provides a way to make use of a wealth of public bulk data. A more recent motivating factor for estimating cell type composition in bulk data is due to developments in spatial transcriptomics, as doing so provides a way understand cell type composition in a spatial context within a tissue. Moreover, spatial transcriptomics technolo- gies measure gene expression of small spatially tagged regions, but most platforms do not have single-cell resolution. An issue with bulk sampling of biomolecular information, such as bulk RNA-seq, is that information is averaged across a cell pool that is often heterogeneous. This makes it difficult to deconfound cell type composition from differences in the molecular profile between experimental groups Repsilber et al. (2010). Cell type deconvolution is the task of estimating and deconfounding the composition of cell types in bulk samples of biomolecular information. This is a type of inverse problem, as we are trying to determine the signal of individual cell types from aggregated readings across multiple cell types. Solving this task then requires some transfer learning approaches, using single-cell data as a reference (transfer source) and 30 bulk data as a query (transfer target). Typically, gene expression data are used, though other data such as protein expression Okendo et al. (2022) or DNA methylation have also been used Singh et al. (2021). 4.2 Problem Formulation The input of the task of cell type deconvolution consists of three components: 1) bulk gene expression (to be deconvoluted), 2) reference scRNA-seq, 3) (optional) spot coordinates to indicate the location within the tissue of each spot. Note additional data, such as histology images can be leveraged as well. The problem is formulated as follows. We’re given bulk expression data Y ∈ Rd×n where each cell-pool i ∈ [1, n] is composed of a mixture of cell types [1, K]. Then for each cell-pool i ∈ [1, n], we wish to construct an estimator ˆai ∈ ∆K−1 of the true cell type composition ai ∈ ∆K−1, where ∆K−1 is the regular K-simplex ∆K−1 = {x ∈ RK : K (cid:88) k=1 xk = 1, xk ≥ 0 for k = 1, 2, ..., K} (4.1) We are also given some reference scRNA-seq expression data X ∈ Rd×N with one-hot labeled cell types C ∈ RN ×K. Then construct the estimator of cell type compositions for the n cell- pools by some function (cid:98)B = F (Y, X, C) ∈ RK×n (4.2) If the spatial information (2D or 3D coordinates in the given tissue) S =       ∈ Rn×m, m ∈ {2, 3}       s1 ... sn of the cell-pools are available, then we may incorporate this into the estimator of cell type compositions for the n cell-pools by some function (cid:98)B = F (Y, S, X, C) ∈ RK×n (4.3) We can further generalize this setup to account for multi-batch data. 31 Reference: N reference single-cells, K cell-types Target: T experiments/batches of nt cell-pools Observed Y (t) ∈ Rd×nt = (y(t)i,j) - bulk mRNA counts X ∈ Rd×N = (xi,j) - reference single-cell mRNA counts c(i) - cell type of reference cell i ∈ [N ] (cid:101)X = (˜xi,j) ∈ Rd×K - estimate of mean cell-type profiles µ Unobserved µ = (µi,j) ∈ Rd×K - True mean cell type profiles B(t) = (β(t)i,j) ∈ RK×nt - cell type abundances of cell-mixtures β(t)·,j ∈ ∆K−1 = {x ∈ RK : K (cid:88) k=1 xk = 1, xk ≥ 0 for k = 1, 2, ..., K} For simplicity, we return to the single batch setup to further define the deconvolution problem under two different scenarios - with and without ground truth cell type compositions. Unsupervised solution: B∗(Y ; X) = argmin B L(cid:0)Y, (cid:98)Y (B, θ; X)(cid:1), where min θ Supervised solution: (cid:98)Y (B, θ; X) = h(X; θ)B B∗(Y ; X) = (cid:98)B(θ∗; X, Y ), where (cid:98)B(θ; X, Y ) = f (X, Y ; θ) and θ∗ = argmin θ L(B, (cid:98)B(θ; X, Y )) Baseline unsupervised solutions: (1) multivariate linear regression - independently for each sample mixture 32 β∗ LS = argmin β ||y − Xβ||2 (2) non-negative constraint of (1) β∗ N N LS = argmin β⪰0 ||y − Xβ||2 (3) multivariate regression with multiplicative log-normal errors - independently for each sample mixture β∗ lnm = argmin β⪰0 ||log(y) − log(Xβ)||2 Typically, cell type deconvolution models are evaluated on datasets with ground truth cell type proportions using mean squared error (MSE), mean absolute error (MAE), correlation, cross-entropy and Jensen-Shannon divergence (JSD). In most cases, however, non-simulated datasets do not have ground truth cell type proportions. In this unsupervised setting, if profiled marker proteins are also provided with the dataset, one evaluation metric Danaher et al. (2022) is the correlation between predicted cell type proportions and the respective marker proteins. 4.3 Survey of Methods Most classical methods for cell type deconvolution are based on non-negative matrix fac- torization (NNMF). The most basic method is non-negative least squares (NNLS), where some reference scRNA-seq gene expression is used to create a cell-profile matrix (cid:101)X ∈ Rd×K, which is then regressed onto the bulk gene expression. The resulting (non-negative) coeffi- cients are then used as the cell type composition estimates. (cid:98)B = argmin ||Y − (cid:101)XB||F B≥0 (4.4) Here, the idea is that the single-cells’ expression will aggregate linearly, respective to their proportion in the bulk sample. The cell profile or signature matrix is typically constructed through the median or mean across cells within each cell type of interest. A penalized NNLS approach is taken with DWLS Tsoucas et al. (2019), which applies a dampened weighting 33 scheme to the standard NNLS framework. Here, each gene’s error term is weighted by the squared inverse of the predicted bulk expression level. This is done to reduce bias towards highly expressed genes, or genes that are highly represented across cell types. Most other traditional methods build on these ideas. NMFreg Rodriques et al. (2019a) applies non-negative matrix factorization (NNMF) on the reference X to construct a basis in a lower dimensional gene space, (cid:12)|X − W ′H ′(cid:12) (cid:12) (cid:12)|F W, H = argmin W ′,H ′≥0 (4.5) where the rows of H ∈ RK×N are the cell-topic embeddings, and the columns of W ∈ Rd×K the corresponding weightings. The cell-topic profiles are then used for the deconvolution of the bulk data via NNLS (cid:98)B = argmin ||Y − HB||F B≥0 (4.6) Building on NMFReg, SPOTlight Elosua-Bayes et al. (2021b) uses non-negative matrix factorization to produce the cell-topic profile matrix. Taking W, H from the first step of NMFReg, SPOTligt then constructs spot-topic profiles P ∈ RK×n through NNLS of X onto W P = argmin ||X − W P ′||F P ′≥0 (4.7) Cell-topic profiles (cid:101)H ∈ RK×K are then constructed from H by taking the median over each cell type. Finally, the estimator of cell type compositions for the bulk data is then given by (cid:98)B = argmin ||P − (cid:101)HB||F B≥0 (4.8) Altering the classical assumption of an additive error linear model, SpatialDecon Danaher et al. (2022) implements a non-negative linear regression-based method that assumes a log- normal multiplicative error model. The log-normal error model is given by log(i·) = log( (cid:101)X T i· B) + ϵi, where ϵi ∼ N (0, σ2In) and B ∈ RK×n (4.9) The estimator of cell type compositions for the n cell-pools is then given by (cid:12) (cid:12)|log(Y ) − log( (cid:101)X T B)(cid:12) (cid:98)B = argmin (cid:12)|2 B≥0 (4.10) 34 One of the first methods to incorporate spatial information in the deconvolution spatial transcriptomics data is Conditional AutoRegressive Model-based Deconvolution (CARD) Ma and Zhou (2022). CARD applies a conditional autoregressive (CAR) assumption on the coefficients of the classical non-negative linear model between the bulk expression Y and a cell-profile matrix (cid:101)X. The linear model is given by Y = (cid:101)XB + ϵ, ϵ ∼ N (0, σ2 e In) (4.11) The CAR assumption then incorporates 2D spatial information S ∈ Rn×2 through an intrinsic prior on the cell type compositions (the model coefficients) by modeling compositions in each location as a weighted combination of compositions in all other locations. This modeling assumption is given by Bki = bk + ϕ n (cid:88) j=1,j̸=i Wij(Bkj − bk) + ϵki, ϵki ∼ N (0, σ2 ki) (4.12) where the weights Wij are given by the Gaussian kernel Wij = KG(si, sj; σ2) = exp(− ||si − sj||2 2 2σ2 ) (4.13) with default scaling parameter σ2 = 0.1. CARD then estimates the cell type composition of the spatial transcriptomic data through constrained maximum likelihood estimation. Some recent developments in cell type deconvolution have applied deep learning-based methods. These approaches typically apply a transfer learning scheme wherein they first simulate bulk data from scRNA reference data, and use a common network to predict the cell type composition of both the simulated and real bulk data. A notable feature of the deep learning-based methods is that they model the cell type compositions directly, i.e. the model objective is on the predicted cell type compositions. This contrasts with most classical methods, where the predicted cell type proportions are the optimal parameters/coefficients from some regression model. One of the early deep learning approaches to the cell type deconvolution problem is Scaden Menden et al. (2020). First, scRNA reference data is randomly sampled from scRNA 35 reference data to generate simulated mixed-cell samples. A fully-connected network is then trained to predict the true cell type compositions of the simulated bulk data, with cross- entropy loss function. This trained model is then applied to the real bulk data to get cell type compositions. Building on this approach, DSTG Song and Su (2021b) is a Graphical Neural Network (GNN) based method, modeling similarities in expression between different bulk samples. First, the pseudo bulk expression data is generated taking np random samples (with replacement) of 2 to 8 cells from the scRNA-seq reference, and aggregating their UMI counts, downsampling to adjust for realistic bulk UMI counts. The pseudo and real bulk data are then aligned in a lower dimensional (S < d) gene-space using Canonical Correlation Analysis (CCA). The projections to the s = 1, 2, ..., S dimensions are given by the canonical variables where Us = (cid:101)Xµ∗ s Vs = Xν∗ s µ∗ s, ν∗ s = argmax µs,νs∈Rd {νT s (cid:101)X T Xνs} s.t. U T s Us′ = V T s Vs′ = δss′ (4.14) (4.15) are the canonical correlation vector pairs. These embeddings are then used to construct a graph by considering Mutual Nearest Neighbors (MNN) as adjacent in the graph. That is, given a pair of sample cell-pools i, j, we let Ai,j =   1 if i and j are mutual nearest neighbors  0 otherwise (4.16) Here, adjacencies can be between simulated-to-real and real-to-real samples. With Xin = [ (cid:101)XX] ∈ Rd×N (N = np + n) and the normalized adjacency matrix (cid:101)A as input, the L ≥ 1 (default 1) graph convolution (GCN) layers of the DSTG model are given by H (0) = Xin H (l) = ReLU( (cid:101)AH (l−1)W (l)) for l ∈ [1, L] (4.17) 36 where W (l) is the weight matrix for the lth layer. The output of the DSTG model is the predicted composition of K cell-types, given by   = softmax( (cid:101)AH (L)W ) ∈ Rn×K  (4.18)    (cid:98)Bp (cid:98)B where (cid:98)Bp and (cid:98)B are the predictions for the pseudo and real cell-pools, respectively. The loss function is then defined as the cross-entropy between the predicted and true cell-type compositions of the pseudo cell-pools L = − np (cid:88) K (cid:88) i=1 k=1 i,k ln(ˆy(p) y(p) i,k ) (4.19) where ˆy(p) i,k and y(p) i,k are the predicted and true composition of cell-type k in the ith pseudo cell-pool. I have summarized a list of existing cell type deconvolution methods in Table 4.1, and in Table 4.2 I highlight some select benchmark sets in table. Note that this benchmark collection was made prior to our development of large-scale benchmark data. Table 4.1 A summary of tools for cell type deconvolution. Tool Algorithm NMFReg Classical SPOTlight Classical DWLS Classical SpatialDWLS Classical Description A non-negative matrix factorization of an annotated scRNA reference matrix Extension of NMFReg, with non-negative matrix factorization applied to both the scRNA reference matrix, and the bulk expression matrix Weighted NNLS; dampened weighting is applied to genes A subset of cell types chosen via PAGE enrichment analysis R R SpatialDecon Classical A multiplicative log-normal error model R, Python Language Matlab, Python Availability NMFReg Rodriques et al. (2019a) NMFReg-Python R, Python SPOTlight Elosua-Bayes et al. (2021b) Dance Ding et al. (2022) DWLS Tsoucas et al. (2019) SpatialDWLS Dong and Yuan (2021) SpatialDecon Danaher et al. (2022) Dance Ding et al. (2022) cell2location Variational Inference CARD Classical RNA-Sieve Classical Scaden GNN DSTG GNN Bayesian hierarchical model of spatial expression counts with a spatially informed prior on cell type compositions Conditional autoregressive based model that incorporates spatial correlation of cell type compostion A likelihood based inference model that estimates cell type proportion through a maximum-likelihood method A fully-connected network that is trained on simulated bulk data, and used to predict cell type compositions of real bulk data A graph convolutional network whose graph is constructed on Mutual Nearest Neighbors of low-dimensional embeddings of simulated and real bulk data Python cell2location Kleshchevnikov et al. (2022b) R, Python CARD Ma and Zhou (2022) Dance Ding et al. (2022) Python RNA-Sieve Dan D. Erdmann-Pham and Song (2021) Python Scaden Menden et al. (2020) R, Python DSTG Song and Su (2021b) Dance Ding et al. (2022) 4.4 Cell Type Deconvolution Benchmark Dataset and Model Development As mentioned in the background chapter, section Spatial & Bulk Deconvolution, there is a lack for quality benchmark data for cell type deconvolution. Again, such datasets must be 37 Table 4.2 A summary of datasets for cell-type deconvolution. Dataset Mouse Posterior Brain 10x Visium Data Mouse Olfactory Bulb Species Tissue Mouse Posterior brain Mouse Olfactory bulb HEK293T and CCRF-CEM cell line mixture Human Human PDAC Human Pancreas Dataset Dimensions 3,353 spots 31,053 genes 1,185 spots 11,176 genes 56 mixtures 1,414 genes 1,819 spots 19,738 genes Protocol Availability 10X Visium MPB10xV lin (d) 10X Visium MOB10xV lin (c) NanoString GeoMx CelllineGeoMx lin (a) Spatial Transcriptomics HPdacST lin (b) generated through either synthetic mixture experiments or some form of random sampling from a reference scRNA-seq dataset. In either case, most datasets are limited in size, and/or do not reflect real conditions. Towards this end, we develop a method to generate large yet realistic cell type deconvolution benchmark datasets, from which we have generated a human tumor microenvironment dataset consisting of 1.8 million cells. Additionally, we build on ideas from DSTG and develop a spatially informed Graph Neural Network based method, GNNDECONVOLVER. Here, we build the model framework to incorporate spatial information, if it is available. Prior to this, only a small set of classical- shallow methods have incorporated spatial information, such as CARD. With this method, we can leverage reference scRNA-seq data with and without spatial information, to infer cell type compositions of bulk mixtures with and without spatial information. To validate GNNDECONVOLVER, we carry out a compilation of experiments on the large-scale benchmark dataset we’ve generated. In this benchmarking, we will see that GNNDECONVOLVER performs strongly against a set of existing state-of-the-art methods. For fairness, we have included methods that incorporate spatial information. An outcome of this method to generate cell type deconvolution benchmark datasets is an open tool that takes single-cell resolution spatial trancriptomics data and generates synthetic mixtures of varying size. This tool accepts data from many popular spatial transcriptomic platforms, such as 10x Visium, MERFISH, and sci-Space. Here, we developed cell type deconvolution benchmarking datasets that are larger in scope and quality than current datasets. In terms of quality, most benchmark datasets for 38 Table 4.3 A summary of datasets for cell-type deconvolution. Tissue sections Cells Genes Fields of view (FOV) FOV size 8 771,236 980 239 ∼ 984.96µm × 656.64µm cell type deconvolution are created through random sampling of scRNA-seq data, wherein the number of sampled cells is randomly chosen within some range that matches typical ranges found in a given spatial transcriptomics platform. This sampling process lacks spatial context, as spatial context is lost with scRNA-seq methods. We used single-cell resolution spatial transcriptomics datasets generated by the CosMx Spatial Molecular Imager (SMI) to create benchmark datasets with preserved spatial context and large sample size. However, while SMI is high-throughput (up to nearly 1 million cells), it has relatively low multi-plex capability (can target around 1,000 genes and 100 proteins per panel) He et al. (2022). Data from the CosMx Spatial Molecular Imager (SMI) consists of transcriptomic, cell type annotations, spatial, histology images, and some protein data. Cell type annotations are made from a negative binomial likelihood model with the mean given by cell type reference profiles, bias added from expected background, and a large size parameter (default 10) to account for overdispersion due to technical sources of variance. This is the model given in 2.2, with detection scale factor set to 1. 4.4.1 Non-small cell lung cancer tumors For this experiment, we used the NanoString CosMx open-source non-small-cell lung cancer (NSCLC)dataset. This dataset consists of transcriptomic, spatial, and histology image data for 8 samples of 5 Non-small cell lung cancer (NSCLC) tumors. A data summary is given in Table 4.3. The cell type compositions of each sample can be seen in physical space, and in gene space through gene expression UMAP projections, which can be seen in Figure 4.1. To generate pseudo-spot data, we divided each FOV spatially into a uniform grid (see 39 Figure 4.1 Composition of CosMx Lung Samples in A. physical space, and B. gene space through gene expression UMAP projections. Figures from He et al. (2022). 4.3, ). The uniform spot (grid rectangle) sizes were chosen to cover an area of 37456.28µm2, which is the mean area of spots in another NSCLC spatial transcriptomics dataset from Nanostring’s GeoMx platform (not single-cell resolution). The ultimate purpose of this is to test these pseudo-spot data as a reference for deconvoltion of the GeoMx generated data. To allow for spatial context in the pseudo spots, we simply use spot centers as the coordinates of each spot. Figures 4.2- 4.3 illustrate this two step process of applying an FOV grid on the tissue, and a second layer grid on each FOV. In addition to this benchmark dataset, we developed a GNN-based model by modifying and building upon ideas from the GNN-based model DSTG Song and Su (2021b) that is detailed in the Survey of Methods subsection. An important change we made was in the graph 40 Figure 4.2 FOV grid overlaying tissue sample. Figure 4.3 Layout of pseudo-spots as a grids over each FOV. 41 Figure 4.4 Cell type deconvolution benchmark data results - NSCLC tumors. construction. DSTG only uses CCA embedded expression data to define adjacencies through Mutual-Nearest-Neighbors, and does not allow for the integration of spatial information. Our model incorporates spatial information by defining graph adjacencies from expression data and spatial location. We used these benchmark pseudo-spot data to validate our model. The common exper- imental design we use is to one or more samples as references (training and/or validation sets), and one sample as the query (test set). We used both mean squared error (MSE) and mean absolute error as evaluation metrics. Our GNN performed best in the majority of ex- periments, in both the 18 cell type and 17 cell type settings. See Figure 4.4 for a breakdown of the experimental settings, and the performance results. 4.5 Spot Data and Deconvolution Methodology Here, we build on the preliminary developments of spot data generation and the graph- based deconvolution scheme proposed. 42 4.5.1 Dataset The single-cell resolution spatial transcriptomic data we used is from the CosMx plat- form (He et al., 2021) by Nanostring, which uses a spatial molecular imaging technique. Through our collaboration with Nanostring, we collected 20 samples from human tissue in lung, kidney, and liver. All the samples contain tumor micro environments. Each dataset was generated from 960-to-1000-plex CosMx RNA panel run on CosMx SMI. Here we de- scribe the data from each tissue in detail. Human Lung. This dataset consists of 8 samples over 5 NSCLC (non-small cell lung cancer) tissues. The resulting dataset contains measurements from 960 targets over 800,327 cells, of which 766,313 cells are analyzed. In more detail, 259,604,214 transcripts are detected. In these samples, the cells were experimentally labeled (by CosMx) from 18 detected cell types. Human Kidney. This dataset consists of 10 samples of tissue taken from lupus nephritis patients, via kidney core biopsy. The resulting dataset contains nearly 300,000 cells. Human Liver. This dataset had subcellular resolution, and consists of 1,000 genes over 800,000 cells. These samples cover a 180 mm2 area of liver tissue, from 1 normal liver and 1 hepatocellular carcinoma tissues. 4.5.2 Pseudo spot generation As in the pseudo spot generation process described in the previous section, we impose a grid on the single-cell resolution spatial transcriptomics data. We choose a spot size to yield multiple cells per spot, with an average size similar to lower resolution spatial transcriptomics methods. We then have the cell type compositions of these pseudo spots, but in a realistic setting since we are generating them within their spatial context of the tissue samples. We assign the cells to pseudo spots based on the centroid coordinates of each cell. That is, the cell gets assigned to the pseudo spot that contains its centroid within its defined boundary. Interestingly, we did not have any cells whose centroid sat on a boundar line, so we did not deal with that assignment problem. Nonetheless, we could randomly assign a cell living on a boundary line to a single pseudo spot forming that boundary. We take the expression 43 Figure 4.5 An Overview of SPATIALCTD. (a). The method for generating the SPATIALCTD dataset. SPATIALCTD comprises three distinct human tissues, namely, the lung, kidney, and liver. For each sample in tissues, SPATIALCTD consists of a spot gene expression file, a spot location file, and a ground truth file. (b). A summary of SPATIALCTD. 44 (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:6)(cid:8)(cid:9)(cid:10)(cid:9)(cid:11)(cid:4)(cid:10)(cid:11)(cid:12)(cid:6)(cid:13)(cid:14)(cid:15)(cid:16)(cid:15)(cid:15)(cid:15)(cid:6)(cid:17)(cid:4)(cid:18)(cid:18)(cid:11)(cid:6)(cid:19)(cid:2)(cid:8)(cid:20)(cid:4)(cid:21)(cid:6)(cid:22)(cid:15)(cid:6)(cid:8)(cid:9)(cid:10)(cid:9)(cid:11)(cid:4)(cid:10)(cid:11)(cid:12)(cid:6)(cid:7)(cid:23)(cid:15)(cid:16)(cid:15)(cid:15)(cid:15)(cid:6)(cid:17)(cid:4)(cid:18)(cid:18)(cid:11)(cid:1)(cid:24)(cid:20)(cid:25)(cid:6)(cid:6)(cid:26)(cid:6)(cid:8)(cid:9)(cid:10)(cid:9)(cid:11)(cid:4)(cid:10)(cid:11)(cid:12)(cid:6)(cid:13)(cid:13)(cid:15)(cid:16)(cid:15)(cid:15)(cid:15)(cid:6)(cid:17)(cid:4)(cid:18)(cid:18)(cid:11)(cid:6)(cid:27)(cid:5)(cid:28)(cid:24)(cid:20)(cid:8)(cid:6)(cid:10)(cid:5)(cid:24)(cid:10)(cid:29)(cid:30)(cid:31)(cid:28)(cid:10)(cid:6)(cid:18)(cid:28)(cid:17)(cid:9)(cid:10)(cid:2)(cid:28)(cid:20)(cid:6)(cid:30)(cid:31)(cid:28)(cid:10)(cid:6)(cid:25)(cid:4)(cid:20)(cid:4)(cid:6)(cid:4)(cid:32)(cid:31)(cid:5)(cid:4)(cid:11)(cid:11)(cid:2)(cid:28)(cid:20)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:1)(cid:6)(cid:4)(cid:1)(cid:2)(cid:3)(cid:4)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:3)(cid:4)(cid:7)(cid:10)(cid:12)(cid:13)(cid:7)(cid:14)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:17)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:15)(cid:16)(cid:15)(cid:4)(cid:4)(cid:4)(cid:7)(cid:10)(cid:12)(cid:13)(cid:4)(cid:4)(cid:4)(cid:4)(cid:18)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:19)(cid:4)(cid:7)(cid:14)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:18)(cid:14)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:19)(cid:14)(cid:4)(cid:7)(cid:20)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:18)(cid:20)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:19)(cid:20)(cid:4)(cid:7)(cid:2)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:18)(cid:2)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:19)(cid:2)(cid:4)(cid:16)(cid:16)(cid:16)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:16)(cid:16)(cid:16)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:16)(cid:16)(cid:16)(cid:7)(cid:10)(cid:12)(cid:13)(cid:4)(cid:4)(cid:4)(cid:4)(cid:21)(cid:14)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:21)(cid:20)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:21)(cid:22)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:16)(cid:16)(cid:16)(cid:4)(cid:4)(cid:4)(cid:4)(cid:21)(cid:9)(cid:7)(cid:14)(cid:4)(cid:4)(cid:4)(cid:7)(cid:20)(cid:4)(cid:4)(cid:4)(cid:7)(cid:2)(cid:4)(cid:4)(cid:4)(cid:16)(cid:16)(cid:16)(cid:7)(cid:20)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:20)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:20)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:15)(cid:16)(cid:23)(cid:4)(cid:4)(cid:4)(cid:7)(cid:2)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:15)(cid:16)(cid:20)(cid:4)(cid:4)(cid:4)(cid:4)(cid:15)(cid:16)(cid:14)(cid:4)(cid:4)(cid:15)(cid:16)(cid:24)(cid:4)(cid:4)(cid:4)(cid:16)(cid:16)(cid:16)(cid:4)(cid:4)(cid:4)(cid:30)(cid:31)(cid:28)(cid:10)(cid:1) measure of the pseudo spots by aggregating the expression over the cells within the pseudo spot. One thing to note is we are operating under the simple assumption that expression from bulk mixtures is the sum of the expression of the cells within the mixture. This may not be the case, indeed some studies have shown a log-normal aggregation, but it is a widely used assumption (hence NMF based methods) and a simple starting point. We then compute the cell type composition of the pseudo spots from the cell types of the cells within the pseudo spots, which serves as the ground truth labels of the pseudo spots. We define the spatial location of the spots from their centroid coordinates. This process generates the following data: (1) number of cells in each spot, (2) cell ID, spot ID, and their mapping that defines the cell-to-spot assignment, (3) spatial location of each spot, (4) spot level gene expression, and (5) the ground truth cell type compositions. This procedure has spot size as a parameter, which we set to a realistic size seen in Nanostring’s GeoMx platform, which is lower resolution (bulk mixtures). Here, we aimed for spot sizes near the mean and/or median spot size taken by GeoMx, 37456.28 µm2, and 24168.74 µm2 respectively. Human Lung. We set the FOV accordingly: 5,472 pixels * 3,648 pixels, 0.18 µm per pixel. Dividing each FOV into 20 pseudo spots, we get a spot area 32338.2067 µm2, which is within the mean and median GeoMx spot size. Lastly, we filtered out low quality (spots without cells), resulting in a total of 4,660 spots over the 8 samples, and 771,236 cells. In 4.5 we outline the generation procedure and give a summary of the generated datasets. Human Kidney. We set the FOV accordingly: 5,472 pixels * 3,648 pixels, 0.18 µm per pixel. Dividing each FOV into 20 pseudo spots, we get a spot area 32338.2067 µm2. After filtering, we have 2,460 spots over 10 samples, consisting of 296,838 cells. Human Liver. We set the FOV accordingly: 4,236 pixels * 4,236 pixels, 0.12 µm per pixel. Dividing each FOV into 9 pseudo spots, we get a spot area 28709.9136 µm2. After filtering, we have 5,796 spots over 2 samples, consisting of 760,506 cells. 45 Figure 4.6 An overview of GNNDECONVOLVER and experimental settings. a. An overview of GNNDECONVOLVER. Note that reference refers to the training data with known cell type proportion labels, and query indicates test data the model will predict. b. Four types of experimental settings. 46 (cid:1)(cid:1)(cid:2)(cid:3)(cid:2)(cid:4)(cid:2)(cid:5)(cid:6)(cid:2)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:2)(cid:8)(cid:14)(cid:8)(cid:1)(cid:2)(cid:3)(cid:2)(cid:4)(cid:2)(cid:5)(cid:6)(cid:2)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:2)(cid:8)(cid:15)(cid:8)(cid:16)(cid:17)(cid:2)(cid:4)(cid:18)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:2)(cid:8)(cid:19)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:2)(cid:8)(cid:14)(cid:20)(cid:8)(cid:15)(cid:20)(cid:8)(cid:19)(cid:8)(cid:13)(cid:10)(cid:21)(cid:2)(cid:13)(cid:2)(cid:22)(cid:8)(cid:23)(cid:12)(cid:24)(cid:25)(cid:23)(cid:17)(cid:5)(cid:13)(cid:10)(cid:21)(cid:2)(cid:13)(cid:2)(cid:22)(cid:8)(cid:23)(cid:12)(cid:24)(cid:25)(cid:23)(cid:26)(cid:27)(cid:5)(cid:28)(cid:2)(cid:22)(cid:8)(cid:29)(cid:4)(cid:10)(cid:12)(cid:30)(cid:8)(cid:29)(cid:31)(cid:32)(cid:33)(cid:20)(cid:8)(cid:34)(cid:8)(cid:35)(cid:36)(cid:36)(cid:36)(cid:29)(cid:37)(cid:38)(cid:8)(cid:26)(cid:10)(cid:18)(cid:2)(cid:4)(cid:1)(cid:2)(cid:26)(cid:17)(cid:36)(cid:36)(cid:36)(cid:29)(cid:37)(cid:38)(cid:8)(cid:26)(cid:10)(cid:18)(cid:2)(cid:4)(cid:26)(cid:10)(cid:25)(cid:2)(cid:5)(cid:25)(cid:8)(cid:1)(cid:2)(cid:12)(cid:4)(cid:2)(cid:23)(cid:2)(cid:5)(cid:25)(cid:10)(cid:25)(cid:27)(cid:24)(cid:5)(cid:36)(cid:36)(cid:36)(cid:39)(cid:14)(cid:39)(cid:15)(cid:39)(cid:5)(cid:9)(cid:12)(cid:24)(cid:25)(cid:8)(cid:8)(cid:9)(cid:14)(cid:8)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:41)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:15)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:14)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:15)(cid:8)(cid:8)(cid:40)(cid:36)(cid:40)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:9)(cid:15)(cid:8)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:15)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:14)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:15)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:14)(cid:8)(cid:8)(cid:40)(cid:36)(cid:42)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:9)(cid:5)(cid:8)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:15)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:14)(cid:8)(cid:8)(cid:40)(cid:36)(cid:19)(cid:8)(cid:8)(cid:8)(cid:8)(cid:40)(cid:36)(cid:14)(cid:8)(cid:8)(cid:40)(cid:36)(cid:19)(cid:8)(cid:8)(cid:8)(cid:36)(cid:36)(cid:36)(cid:8)(cid:8)(cid:8)(cid:26)(cid:2)(cid:10)(cid:4)(cid:5)(cid:2)(cid:22)(cid:8)(cid:37)(cid:24)(cid:11)(cid:12)(cid:24)(cid:23)(cid:27)(cid:25)(cid:27)(cid:24)(cid:5)(cid:9)(cid:24)(cid:3)(cid:25)(cid:11)(cid:10)(cid:43) 4.5.3 GNNDECONVOLVER To begin, let’s assume we have t samples which measure expression of d genes over n1, n2,..., nt spots. We treat the spots as graph nodes, constructing the graph from their log-normalized expression values (Lytal et al., 2020). The nodes in each sample are then connected according to both expression level and spatial distance. To do this, let Aspatial and Agene be the adjacency matrices of the distances and expression respectively. We define Aspatial by nearest-neighbors, with K = 5. Meaning, each spot is connected to its 5 nearest nodes (spatially). We also considered defining this by specifying a distance threshold. We apply the same construction setting for Agene, except we define distance here with cosine similarity between expression levels. We then define the final adjacency matrix Asample as a weighted sum of these adjaceny matrices Asample = αAspatial + βAgene. We experimentally tested and set where α and β to 0.3 and 0.7 respectively. Then, for each sample t the graph is defined by Asamplet ∈ Rnt×nt and Xsamplet ∈ Rnt×d where Asamplet is the final adjacency matrix, and Xsamplet is the node representation matrix. We also connect nodes between different samples, but the spatial context is only within samples so we contruct the between sample graphs via gene expression only. First, we compute expression similarity of nodes between each sample. We define the adjacency matrix using a nearest neighbors scheme, again with K = 5. This yields a graph that connects nodes across all t samples, both labeled and unlabeled (cell type compositions). The between sample graph is then given by Aall ∈ Rn×n and Xall ∈ Rn×d where n = n1 + n2 + · · · + nt. This graph construction yields a linked graph G = (V, E). The task of GNNDECON- VOLVER is to predict cell type compositions of unlabeled spots, with both spot features and the graph features defined by the graph of node connections between labeled and unlabeled spots. Namely, we have the input as [AX], where A ∈ Rn×n is the adjacency matrix, and X ∈ Rn×d is the node representation matrix. Again, we are have n spots with expression measurements over d genes. GNNDECONVOLVER consists of two graph convolutional layers, where the second layer 47 is treated as the output layer, i.e. no activation function is applied. These layers are define accordingly: H (l+1) = σ (cid:16) ˜AH (l)W (l)(cid:17) = ReLU (cid:16) ˜AH (l)W (l)(cid:17) (4.20) where ˜A = ˜D−1/2(A + I) ˜D−1/2 with ˜D the diagonal matrix of A + I and I the identity matrix. H (l) is the input from the previous layer. W (l) is the weight matrix of the l-th layer. ReLU(·) is the nonlinear activation function. Here, the input for the first layer would be the original node representation H(0) = X. We can define GNNDECONVOLVER as the following composition: ˆY = ˜A ReLU (cid:16) ˜AX T W (0)(cid:17) W (1) (4.21) where W (0) and W (1) are learned weight matrices, and ˆY is the predicted cell type compo- sitions with F unique cell types. The loss function is defined as the cross-entropy between ground truth and predicted cell type composition: L = − nq (cid:88) F (cid:88) i=1 f =1 yi,f ln (ˆyi,f ) (4.22) Here, we have nq labeled nodes, ˆyi,f and yi,f represent the predicted and ground truth cell type proportion of cell type f in spot i, respectively. We train the model by minimizing the cross-entropy L on training sets via stochastic gradient descent using backpropogation. 4.5.4 Results Here we test GNNDECONVOLVER against 8 other deconvolution methods, and compare our generated dataset with other synthetic bulk mixture data. We see that GNNDECON- VOLVER outperforms all other methods in each evaluation metric. We continue to see this trend as we vary the spot size. Refer back to Figure 4.6 for an overview of the model and experimental setup. These results show GNNDECONVOLVER to be a useful deconvolution method, and the formative ideas may help guide future method developments. Particulary, the to inte- grate reference scRNA-seq data with spatial transcriptomics data. We also see that this 48 Figure 4.7 Performance of 9 methods in cell type deconvolution. a. A summary of results for 9 cell type deconvolution methods. b-e. Comparison of the models under different settings on SPATIALCTD kidney tissue in terms of MSE, MAE and PCC. 49 b Kidney: Setting 1c Kidney: Setting 2de Kidney: Setting 3 Kidney: Setting 4aMethodOverallMetricDatasetSettingMSEMAEPCCKidneyLiverLungS1S2S3S41 GNNDeconvolver2 RCTD3 CARD8 DestVI4 SpatialDecon6 Cell2location9 Stereoscope5 NNLS7 Tangram123111232323123123123111123232323Ranking Figure 4.8 Statistical comparison between SPATIALCTD and existing cell type deconvolution benchmark datasets. pseudo spot generation procedure can provide us with more realistic cell type deconvolution benchmark data, at a large scale. 50 a# Cells# Cell TypesHumanTMESpot ImageSubcellular LocationCell CompositionMPOAMouse BrainMouse CortexMouse Visual CortexSimulated ST for Human LungMouse EmbryoCellular LocationSimulated ST for Human LiverMouse Posterior BrainMouse Olfactory BulbHEK293T & CCRF-CEMHuman PDACSPOTlight SyntheticOur Human LungOur Human KidneyOur Human LiverDataset17k59k4.7k14k14k60k29kNANANANA2k-8k771k296k760k201615231569-1923NA22081835-3817-19 CHAPTER 5 FURTHER EXPLORATIONS IN DECONVOLUTION 5.1 Towards a Probabilistic Framework for Deconvolution Let C be the random variable of observing a single cell of type [1, K] = {1, 2, ..., K}, with distribution C ∼ CategoricalK(w), p(C = k; w) = wk, w ∼ πw (5.1) Let X be the random variable of a single-cell’s gene expression for D genes, and suppose pk(x; θk) = p(X = x|C = k), θ = (θ1, ..., θK) ∼ πθ (5.2) Then X ∼ f where f is the mixture density f (x; θ, w) = K (cid:88) k=1 wkpk(x; θk) To simplify notation, for k ∈ [1, K] we let Xk = X|C = k µk = E(Xk), Σk = cov(Xk) (5.3) (5.4) Suppose now that we randomly sample n cells C (1), ..., C (n) iid∼ CategoricalK(w), then the total number for each of the K cell-types are given by the random variable N = (N1, ..., Nk) ∼ M ultinomialK(n, w) p(n1, ..., nK; n, w) = n! n1! · · · nK! K (cid:89) k=1 wnk k , for (n1, ..., nK)/n ∈ ∆K−1 (5.5) Suppose we measure the expression X(1) k , ..., X(Nk) k iid∼ pk, k ∈ [1, K] and aggregate over each cell-type Y = K (cid:88) Nk(cid:88) k=1 i=1 X(i) k = K (cid:88) k=1 NkHk, where Hk = Xk (5.6) Letting Bk = Nk n and B = (B1, ..., BK) ∈ ∆K−1, taking the sample mean we get Y n = K (cid:88) k=1 Nk n Hk = K (cid:88) k=1 BkHk 51 (5.7) Note that Epk(Hk) = µk and covpk(Hk) = 1 Nk Σk, so and E(Y) = K (cid:88) k=1 Nkµk, cov(Y) = K (cid:88) k=1 NkΣk (cid:19) E (cid:18) Y n = K (cid:88) k=1 Bkµk, cov (cid:19) (cid:18) Y n = K (cid:88) k=1 Bk n Σk (5.8) (5.9) If we let Y = (cid:80) i = 1nX(i), where X(1), ..., X(n) iid∼ f , then the distribution of Y is the n−fold convolution of f: fY (y) = (f ∗ f ∗ · · · ∗ f ∗ f )(y) = f ∗n(y) (5.10) Also, nkHk ∼ p∗nk k , and Y = K (cid:80) k=1 nkHk ∼ (p∗n1 1 ∗ p∗n2 2 5.2 Learning Cell Profiles ∗ · · · ∗ p∗nK−1 K−1 ∗ p∗nK K )(y) Cell type expression profiles have played a key role in cell type deconvolution, as they are what the most basic methods are built on. In chapter 2, section 2, we have shown a standard way of constructing cell type expression profiles from reference scRNA-seq data. These methods are mostly rule based, where we take some normalized reference scRNA-seq data, account for background, and take the median or mean. A direction I thought would be interesting is to develop a deep learning method that learns the cell type expression profile. As I thought about this task, it made sense to try this through the task of deconvolution, which often relies on the cell type profile. 5.2.1 Architecture The most basic form of this learning method is to take the full reference scRNA-seq data set and pass it through a locally connected neural network (LocNet), specifically organized to only share parameters within each cell type. Setting the dimension of the penultimate layer to the number of unique cell types and applying a non-negative activation would then yield at least the form of a cell type profile matrix. This is then used in the final output layer as a regression task, where the coefficients are considered the cell type compositions. So, in this setup, we not only learn a cell type profile from reference scRNA-seq data, but 52 Figure 5.1 Locally-connected Neural Network. we also perform deconvolution and get an estimate of cell type compositions. Here is a brief setup of the method. Note that we are using the log error model framework. Here we have D genes, N cells from reference scRNA-seq composed of K cell types, and M bulk mixtures. Reference scRNA expression (raw): X = [X1 X2 · · · XK] ∈ RD×N , Xc ∈ RD×Nc, K (cid:88) c=1 Nc = N Mixture expression (raw): Y ∈ RD×M We then define LocNet and its locally connected layer as follows. Locally connected layer: H = σ(cid:0)Xdiag(W1, W2, ..., WK)(cid:1) ∈ RD×P K, where Wc ∈ RD×P for c=1,...,K (cid:20) σ(cid:0)X1W1 = (cid:1) σ(cid:0)X2W2 (cid:1) · · · σ(cid:0)XKWK (cid:21) (cid:1) We can extend LocNet to have multiple locally connected layers, with H(0) = X, H (i) = σ(cid:0)H(i−1)diag(W(i) 1 , W(i) 2 , ..., W(i) K )(cid:1) and we define the cell type expression profile by F (X) = H(L) ∈ RD×K. The objective is then to minimize the MSLE: 1 D ||log(Y) − log(F (X)B)||2 2 53 Figure 5.2 True vs estimated cell type proportions. Figure 5.3 LocNet preliminary validation results table. Figure 5.4 LocNet learned profile vs median profile. Minimizing this objective we learn a cell type expression profile F (X) = H(L) and an estimate for the cell type compositions B. A small example given in Figure 5.1 will help illustrate the local nature of this network. 54 5.2.2 Preliminary Experimental Results To validate this method I chose a small cell line mixture dataset from Nanostring GeoMx, which came from a cell pellet array study done by Nanostring Danaher et al. (2022). It is a small dataset, but this is just an early validation to test the utility of the model. It consists of expression data from two cell lines (HEK293T, CCRF-CEM) mixed in cell-pellet array at varying proportions (40 mixtures, 16 pure cell-lines). 700 um regions were profiled for 1414 genes with GeoMx platform. Further, I normalized the expression data with 27 housekeeping genes that were selected using geNorm on the 50 highest mean expression genes. I tested this method against the basic Non-negative Least Squares (NNLS) method, and the Log-Normal Regression (LogNormReg) method. I chose these two methods because they both use cell type profiles directly. For these two methods, I used the standard cell type profile construction, with the median expression within each cell type. The results show LocNet performs strongly across various metrics in estimating true cell type composition. Since the loss function of LocNet is just the Log-Normal Regression model objective, this may suggest that LocNet is learning a better cell type profile than the standard rules based profile used in NNLS and LogNormReg. Further, looking at the results we may suggest the median is underestimating the true cell type profile. 5.3 Bivariate normal genes for 2 cell-types Here I wanted to consider deconvolution in a probabilistic framework, which I began with a small digestible example consisting of only two cell types. Here, I consider 2 cell types by taking samples from bivariate normal distributions. The setup is as follows. (cid:18) νc ∼ N µc =      µ1,c µ2,c   , Σc =   σ1,c ρcσ1,cσ2,c ρcσ1,cσ2,c σ2,c    (cid:19) , for cell-types c = 1, 2, with ν1 ⊥⊥ ν2 Reference scRNA: x(i) c ∼ νc (i.i.d.), for i = 1, ..., N . We put this in matrix form as: X = [X1 X2] ∈ R2×2N , where Xc = [x(1) c x(2) c · · · x(N ) c ] 55 Figure 5.5 toy example - bivariate normal samples. Figure 5.6 Decision boundary and classifications of MAP, LocNet and NNLS on the toy bivariate normal gene expression samples. Mixed-cell RNA: y(i) ∼ β(i) 1 ˜x(i) 1 + β(i) 2 ˜x(i) 2 , for i = 1, ..., M , where ˜x(i) c ∼ νc (i.i.d.), and 1 − β(i) 1 = β(i) 2 ∈ {0, 1}. We put this in matrix form as: Y = [y(1) y(2) · · · y(M )] ∈ R2×M In this toy example we set the mean and variance parameters as           µ1 = 45 30   , Σ1 =   60 −30 −30 20  , µ2 =     , Σ2 =   30 45 20 −30 −30 60   We take 1000 samples from this distribution, and plot their 2D coordinates in Figure 5.5. 56 With this toy example, I am able to compute the Maximum a posteriori (MAP) estimate for the cell type compositions, against which I compare estimates from LocNet and NNLS for preliminary tests. Interestingly, I found LocNet was able to perform nearly as well as the MAP, and defines very similar decision boundaries. 57 CHAPTER 6 CONCLUSION Going back to the Cell Type Deconvolution chapter, recall that scRNA-seq data are used as references for cell type composition estimation of spatial transcriptomic data. There are many variables at play between any pair of these datasets. One major variable is the differences in library preparation of the two sequencing technologies, which can lead to systematic bias that confounds cell type deconvolution results, termed as platform effects. RCTD Cable et al. (2022) and cell2location Kleshchevnikov et al. (2022b) are two methods that statistically account for platform effects and other sources of gene expression variations to model cell type compositions in spatial transcriptomic data. A direction I’d like to take my research is how to synthesize these ideas with deep learning methods, especially a GNN- based method that allows for easier multimodal data integration. Another aim of mine is to continue developing the cell type deconvolution benchmark datasets, and use them to validate my model developments. A problem highlighted in the Plant Stress Response chapter is that of negative transfer. This is a problem the occurs across many domains, and interests me greatly. Going back to the experimental results in the Plant Stress Response chapter, DeepCAT is shows decent performance relative to well-established shallow and deep learning methods, but accuracy is still low in absolute terms. Thus there are still some challenges to overcome. From our results, we see that the stress grouping is a significant matter, and more sophisticated methods to learn the best groupings (i.e. the most related stresses) could help increase testing accuracy. Additionally, we saw leveraging transfer learning from both the big data human model, and the experimentally verified TFBMs helped increase the predictive accuracy. This is another lever for increased accuracy, and hence is a direction of great interest. However, One issue I found with the transfer learning schemes is there was no control mechanisms on the transferred information. We simply took the source data as parameter initializations and trained all of the target data (Arabadopsis) from there. We didn’t account for similarities 58 or differences, for example, in the DNA sequences between humans and arabadopsis. The negative transfer is also found in the multi-task learning scheme. Grouping the heat and non-heat related stresses separately did improve from sharing across all stresses. However, even within those groups I found that the model performed particularly bad for a hand full of certain stresses. Again, control mechanisms would be helpful to limit the sharing of information between stresses when it leads to negative transfer. 59 BIBLIOGRAPHY Hek293t and ccrf-cem cell line mixture data. https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE174746. Human pdac data. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111672. Mouse olfactory bulb adult-mouse-olfactory-bulb-1-standard-1. data. https://www.10xgenomics.com/resources/datasets/ Mouse posterior visium data. spatial-gene-expression/datasets/1.0.0/V1_Mouse_Brain_Sagittal_Posterior. https://support.10xgenomics.com/ brain 10x Andersson, A., Bergenstråhle, J., Asp, M., Bergenstråhle, L., Jurek, A., Fernández Navarro, J., and Lundeberg, J. (2020). Single-cell and spatial transcriptomics enables probabilistic inference of cell type topography. Communications biology, 3(1):565. Asp, M., Bergenstråhle, J., and Lundeberg, J. (2020). Spatially resolved transcrip- tomes—next generation tools for tissue exploration. BioEssays, 42(10):1900221. B. Alipanahi, A. Delong, M. T. W. and Frey, B. J. (2015). Predicting the sequence specifici- ties of dna- and rna-binding proteins by deep learning. Nature Biotechnology, 33(8):831– 838. Bansal, M., Belcastro, V., Ambesi-Impiombato, A., and Di Bernardo, D. (2007). How to infer gene networks from expression profiles. Molecular systems biology, 3(1):78. Baron, M. and Yanai, I. (2017). New skin for the old rna-seq ceremony: the age of single-cell multi-omics. Genome Biology, 18(1):1–3. Bartosovic, M., Kabbe, M., and Castelo-Branco, G. (2021). Single-cell cut&tag profiles histone modifications and transcription factors in complex tissues. Nature biotechnology, 39(7):825–835. Biancalani, T., Scalia, G., Buffoni, L., Avasthi, R., Lu, Z., Sanger, A., Tokcan, N., Vander- burg, C. R., Segerstolpe, Å., Zhang, M., et al. (2021). Deep learning and alignment of spatially resolved single-cell transcriptomes with tangram. Nature methods, 18(11):1352– 1362. Brady, G., Barbara, M., and Iscove, N. (1990). Representative in vitro cdna amplification from individual hemopoietic cells and colonies. Methods in Molecular and Cellular Biology, 2:17––25. Brehm-Stecher, B. F. and Johnson, E. A. (1990). Single-cell microbiology: Tools, technolo- gies, and applications. Microbiology and Molecular Biology Reviews, 68(2):538—-559. 60 Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., and Greenleaf, W. J. (2013). Transposition of native chromatin for multimodal regulatory analysis and personal epige- nomics. Nature methods, 10(12):1213. Cable, D. M., Murray, E., Zou, L. S., Goeva, A., Macosko, E. Z., Chen, F., and Irizarry, R. A. (2022). Robust decomposition of cell type mixtures in spatial transcriptomics. Nature Biotechnology, 40(4):517–526. Chen, A., Liao, S., Cheng, M., Ma, K., Wu, L., Lai, Y., Qiu, X., Yang, J., Xu, J., Hao, S., Wang, X., Lu, H., Chen, X., Liu, X., Huang, X., Li, Z., Hong, Y., Jiang, Y., Peng, J., Liu, S., Shen, M., Liu, C., Li, Q., Yuan, Y., Wei, X., Zheng, H., Feng, W., Wang, Z., Liu, Y., Wang, Z., Yang, Y., Xiang, H., Han, L., Qin, B., Guo, P., Lai, G., Muñoz-Cánoves, P., Maxwell, P. H., Thiery, J. P., Wu, Q.-F., Zhao, F., Chen, B., Li, M., Dai, X., Wang, S., Kuang, H., Hui, J., Wang, L., Fei, J.-F., Wang, O., Wei, X., Lu, H., Wang, B., Liu, S., Gu, Y., Ni, M., Zhang, W., Mu, F., Yin, Y., Yang, H., Lisby, M., Cornall, R. J., Mulder, J., Uhlén, M., Esteban, M. A., Li, Y., Liu, L., Xu, X., and Wang, J. (2022). Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell, 185(10):1777–1792.e21. Coons, A. H., Creech, H. J., Jones, R. N., and Berliner, E. (1942). The demonstration of pneumococcal antigen in tissues by the use of fluorescent antibody. The Journal of Immunology, 45(3):159–170. Crosetto, N., Bienko, M., and Van Oudenaarden, A. (2015). Spatially resolved transcrip- tomics and beyond. Nature Reviews Genetics, 16(1):57–66. Dan D. Erdmann-Pham, Jonathan Fischer, J. H. and Song, Y. S. (2021). A likelihood-based deconvolution of bulk gene expression data using single-cell references. Genome Research. Code Link: https://github.com/songlab-cal/rna-sieve. Danaher, P., Kim, Y., Nelson, B., Griswold, M., Yang, Z., Piazza, E., and Beechem, J. M. (2022). Advances in mixed cell deconvolution enable quantification of cell types in spatial transcriptomic data. Nature communications, 13(1):1–13. Ding, J., Liu, R., Wen, H., Tang, W., Li, Z., Venegas, J., Su, R., Molho, D., Jin, W., Wang, Y., et al. (2024a). Dance: A deep learning library and benchmark platform for single-cell analysis. Genome Biology, 25(1):72. Ding, J., Venegas, J., Li, L., Lu, Q., Wang, Y., Wu, L., Jin, W., Wen, H., Liu, R., Tang, W., Dai, X., Li, Z., Zuo, W., Chang, Y., Leo, Y., Lulu-Shang, L., Danaher, P., Xie, Y., and Tang, J. (2024b). Spatialctd: A large-scale tumor microenvironment spatial transcriptomic dataset to evaluate cell type deconvolution for immuno-oncology. Journal of Computational Biology. Ding, J., Wen, H., Tang, W., Liu, R., Li, Z., Venegas, J., Su, R., Molho, D., Jin, W., Zuo, 61 W., et al. (2022). Dance: A deep learning library and benchmark for single-cell analysis. bioRxiv. Code Link: https://github.com/OmicsML/dance. Dong, R. and Yuan, G. transcriptomic spatial of https://github.com/rdong08/spatialDWLS_dataset. data. Genome Biology, (2021). Spatialdwls: accurate 22(1). deconvolution Code Link: Dong, S., Wang, P., and Abbas, K. (2021). A survey on deep learning and its applications. Computer Science Review, 40:100379. Eberwine, J., Yeh, H., Miyashiro, K., Cao, Y., Nair, S., Finnell, R., Zettel, M., and Coleman, P. (1992). Analysis of gene expression in single live neurons. Proceedings of the National Academy of Sciences of the United States of America, 89:3010–3014. Elosua-Bayes, M., Nieto, P., Mereu, E., Gut, I., and Heyn, H. (2021a). Spotlight: seeded nmf regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic acids research, 49(9):e50–e50. Elosua-Bayes, M., Nieto, P., Mereu, E., Gut, I., and Heyn, H. (2021b). Spotlight: seeded nmf regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic Acids Research, 49(9):e50–e50. Eng, C.-H. L., Lawson, M., Zhu, Q., Dries, R., Koulena, N., Takei, Y., Yun, J., Cronin, C., Karp, C., Yuan, G.-C., et al. (2019). Transcriptome-scale super-resolved imaging in tissues by rna seqfish+. Nature, 568(7751):235–239. Fan, Z., Luo, Y., Lu, H., Wang, T., Feng, Y., Zhao, W., Kim, P., and Zhou, X. (2023). Spascer: spatial transcriptomics annotation at single-cell resolution. Nucleic Acids Re- search, 51(D1):D1138–D1149. Gerdes, M. J., Sevinsky, C. J., Sood, A., Adak, S., Bello, M. O., Bordwell, A., Can, A., Corwin, A., Dinn, S., Filkins, R. J., Hollman, D., Kamath, V., Kaanumalle, S., Kenny, K., Larsen, M., Lazare, M., Li, Q., Lowes, C., McCulloch, C. C., McDonough, E., Montalto, M. C., Pang, Z., Rittscher, J., Santamaria-Pang, A., Sarachan, B. D., Seel, M. L., Seppo, A., Shaikh, K., Sui, Y., Zhang, J., and Ginty, F. (2013). Highly multiplexed single-cell analysis of formalin-fixed, paraffin-embedded cancer tissue. Proceedings of the National Academy of Sciences, 110(29):11982–11987. Giesen, C., Wang, H. A., Schapiro, D., Zivanovic, N., Jacobs, A., Hattendorf, B., Schüffler, P. J., Grolimund, D., Buhmann, J. M., Brandt, S., et al. (2014). Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. Nature methods, 11(4):417–422. Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial 62 Intelligence and Statistics, volume 9, pages 249–256. PMLR. Goldman, S. L., MacKay, M., Afshinnekoo, E., Melnick, A. M., Wu, S., and Mason, C. E. (2019). The impact of heterogeneity on single-cell sequencing. Front. Genet., 10:8. Goltsev, Y., Samusik, N., Kennedy-Darling, J., Bhate, S., Hale, M., Vazquez, G., Black, S., and Nolan, G. P. (2018). Deep profiling of mouse splenic architecture with codex multiplexed imaging. Cell, 174(4):968–981. Guo, H., Zhu, P., Wu, X., Li, X., Wen, L., and Tang, F. (2013). Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced rep- resentation bisulfite sequencing. Genome Research, 23(12):2126–2135. He, S., Bhatt, R., Birditt, B., Brown, C., Brown, E., Chantranuvatana, K., Danaher, P., Dunaway, D., Filanoski, B., Garrison, R. G., et al. (2021). High-plex multiomic analysis in ffpe tissue at single-cellular and subcellular resolution by spatial molecular imaging. bioRxiv, pages 2021–11. He, S., Bhatt, R., and Brown, C. e. a. (2022). High-plex imaging of rna and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nature Biotechnology. Hendrickson, D., Soifer, I., Wranik, B., Botstein, D., and McIsaac, S. (2018). Simultaneous profiling of dna accessibility and gene expression dynamics with atac-seq and rna-seq. Methods in Molecular Biology, 1819:317–333. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780. Houle, D., Govindaraju, D. R., and Omholt, S. (2010). Phenomics: the next challenge. Nature Reviews Genetics, 11(12):855–866. Jin, H. and Liu, Z. (2021). A benchmark for rna-seq deconvolution analysis under dynamic testing environments. Genome Biology. Karagiannis, T., Monti, S., and Sebastiani, P. (2022). Cell type diversity statistic: An entropy-based metric to compare overall cell type composition across samples. Frontiers in genetics. Keren, L., Bosse, M., Thompson, S., Risom, T., Vijayaragavan, K., McCaffrey, E., Mar- quez, D., Angoshtari, R., Greenwald, N. F., Fienberg, H., et al. (2019). Mibi-tof: A multiplexed imaging platform relates cellular phenotypes and tissue structure. Science advances, 5(10):eaax5851. Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. International Conference on Learning Representations. 63 Kleshchevnikov, V., Shmatko, A., Dann, E., Aivazidis, A., King, H. W., Li, T., Elmentaite, R., Lomakin, A., Kedlian, V., Gayoso, A., et al. (2022a). Cell2location maps fine-grained cell types in spatial transcriptomics. Nature biotechnology, 40(5):661–671. Kleshchevnikov, V., Shmatko, A., Dann, E., Aivazidis, A., King, H. W., Li, T., El- mentaite, R., Lomakin, A., Kedlian, V., Gayoso, A., Jain, M. S., Park, J. S., Ramona, L., Tuck, E., Arutyunyan, A., Vento-Tormo, R., Gerstung, M., James, L., Stegle, O., and Bayraktar, O. A. (2022b). Cell2location maps fine-grained cell types in spatial transcriptomics. Nature Biotechnology, 40(5):661–671. Code Link: https://github.com/BayraktarLab/cell2location. Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C., and Teichmann, S. A. (2015). The technology and biology of single-cell rna sequencing. Molecular cell, 58(4):610–620. Kornberg, R. D. (1974). Chromatin structure: a repeating unit of histones and dna. Science, 184(4139):868–871. Kornberg, R. D. and Lorch, Y. (1974). Chromatin structure and transcription. Annual Review of Cell Biology, 8:563–587. Kulkarni, A., Anderson, A. G., Merullo, D. P., and Konopka, G. (2019). Beyond bulk: a review of single cell transcriptomics methodologies and applications. Curr. Opin. Biotech- nol., 58:129–136. Lähnemann, D. and et al. (2020). Eleven grand challenges in single-cell data science. Genome Biology, 21(1). Lewis, Z. R., Phan-Everson, T., Geiss, G., Korukonda, M., Bhatt, R., Brown, C., Dunaway, D., Phan, J., Rosenbloom, A., Filanoski, B., et al. (2022). Subcellular characterization of over 100 proteins in ffpe tumor biopsies with cosmx spatial molecular imager. Cancer Research, 82(12_Supplement):3878–3878. Li, B., Zhang, W., Guo, C., Xu, H., Li, L., Fang, M., Hu, Y., Zhang, X., Yao, X., Tang, M., et al. (2022a). Benchmarking spatial and single-cell transcriptomics integration meth- ods for transcript distribution prediction and cell type deconvolution. Nature methods, 19(6):662–670. Li, H., Ma, T., Hao, M., Wei, L., and Zhang, X. (2022b). Decoding functional cell-cell communication events by multi-view graph learning on spatial transcriptomics. bioRxiv. Li, X. and Wang, C.-Y. (2021). From bulk, single-cell to spatial rna sequencing. International Journal of Oral Science, 13(1):1–6. Li, Z., Kuppe, C., Ziegler, S., Cheng, M., Kabgani, N., Menzel, S., Zenke, M., Kramann, R., and Costa, I. G. (2021). Chromatin-accessibility estimation from single-cell atac-seq data 64 with scopen. Nature communications, 12(1):1–14. Lin, J.-R., Fallahi-Sichani, M., and Sorger, P. K. (2015). Highly multiplexed imaging of single cells using a high-throughput cyclic immunofluorescence method. Nature communications, 6(1):1–7. Lin, J.-R., Izar, B., Wang, S., Yapp, C., Mei, S., Shah, P. M., Santagata, S., and Sorger, P. K. (2018). Highly multiplexed immunofluorescence imaging of human tissues and tumors using t-cycif and conventional optical microscopes. Elife, 7. Lubeck, E., Coskun, A. F., Zhiyentayev, T., Ahmad, M., and Cai, L. (2014). Single-cell in situ RNA profiling by sequential hybridization. Nature Methods, 11(4):360–361. Lytal, N., Ran, D., and An, L. (2020). Normalization methods on single-cell rna-seq data: an empirical survey. Frontiers in genetics, 11:501166. Ma, Y. and Zhou, X. (2022). Spatially informed cell-type deconvolution for spatial tran- scriptomics. Nature Biotechnology. Code Link: https://github.com/YingMa0107/CARD. Marx, V. (2021). Method of the year: spatially resolved transcriptomics. Nature methods, 18(1):9–14. Maynard, K. R., Collado-Torres, L., Weber, L. M., Uytingco, C., Barry, B. K., Williams, S. R., Catallini, J. L., Tran, M. N., Besich, Z., Tippani, M., et al. (2021). Transcriptome- scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature neuro- science, 24(3):425–436. McManus, J., Cheng, Z., and Vogel, C. (2015). Next-generation analysis of gene expres- sion regulation–comparing the roles of synthesis and degradation. Molecular bioSystems, 11(10):2680–2689. Menden, K., Marouf, M., Oller, S., Dalmia, A., Magruder, D. S., Kloiber, K., Heutink, P., and Bonn, S. (2020). Deep learning-based cell composition analy- sis from tissue expression profiles. Science Advances, 6(30):eaba2619. Code Link: https://github.com/KevinMenden/scaden. Merritt, C. R., Ong, G. T., Church, S. E., Barker, K., Danaher, P., Geiss, G., Hoang, M., Jung, J., Liang, Y., McKay-Fleisch, J., et al. (2020). Multiplex digital spatial profiling of proteins and rna in fixed tissue. Nature biotechnology, 38(5):586–599. Moffitt, J. and Zhuang, X. (2016). Chapter one - RNA imaging with multiplexed error-robust fluorescence in situ hybridization (MERFISH). In Filonov, G. S. and Jaffrey, S. R., editors, Visualizing RNA dynamics in the cell, volume 572 of Methods in enzymology, pages 1–49. Academic Press. ISSN: 0076-6879. 65 Molho, D., Ding, J., Tang, W., Li, Z., Wen, H., Wang, Y., Venegas, J., Jin, W., Liu, R., Su, R., et al. (2024). Deep learning in single-cell analysis. ACM Transactions on Intelligent Systems and Technology, 15(3):1–62. Moor, A. E. and Itzkovitz, S. (2017). Spatial transcriptomics: paving the way for tissue-level systems biology. Current opinion in biotechnology, 46:126–133. Moses, L. and Pachter, L. (2022). Museum of spatial transcriptomics. Nature Methods, 19(5):534–546. Muzio, G., O’Bray, L., and Borgwardt, K. (2021). Biological network analysis with deep learning. Briefings in Bioinformatics, 22(2):1515–1530. Nguyen, Q. H., Pervolarakis, N., Nee, K., and Kessenbrock, K. (2018). Experimental Consid- erations for Single-Cell RNA Sequencing Approaches. Frontiers in Cell and Developmental Biology, 6:108. Okendo, J., Okanda, D., Mwangi, P., and Nyaga, M. (2022). Proteomic deconvolution reveals distinct immune cell fractions in different body sites in sars-cov-2 positive individuals. medRxiv. Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M. P., Shyu, M.-L., Chen, S.-C., and Iyengar, S. S. (2018). A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys (CSUR), 51(5):1–36. Quang, D. and Xie, X. (2016). Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Nucleic acids research, 44(11):e107– e107. Rao, A., Barkley, D., França, G. S., and Yanai, I. (2021a). Exploring tissue architecture using spatial transcriptomics. Nature, 596(7871):211–220. Rao, A., Barkley, D., França, G. S., and Yanai, I. (2021b). Exploring tissue architecture using spatial transcriptomics. Nature, 596(7871):211–220. Raredon, M. S. B., Yang, J., Kothapalli, N., Lewis, W., Kaminski, N., Niklason, L. E., and Kluger, Y. (2023). Comprehensive visualization of cell–cell interactions in single-cell and spatial transcriptomics with niches. Bioinformatics, 39(1):btac775. Repsilber, D., Kern, S., and Telaar, Anna, e. a. (2010). Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach. BMC Bioinformatics. Rodriques, S. G., Stickels, R. R., Goeva, A., Martin, C. A., Murray, E., Vanderburg, C. R., Welch, J., Chen, L. M., Chen, F., and Macosko, E. Z. (2019a). Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science, 66 363(6434):1463–1467. Code Link: https://github.com/broadchenf/Slideseq. Rodriques, S. G., Stickels, R. R., Goeva, A., Martin, C. A., Murray, E., Vanderburg, C. R., Welch, J., Chen, L. M., Chen, F., and Macosko, E. Z. (2019b). Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science, 363(6434):1463–1467. Shah, S., Lubeck, E., Zhou, W., and Cai, L. (2016). In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron, 92(2):342– 357. Singh, O., Pratt, D., and Aldape, K. (2021). Immune cell deconvolution of bulk dna methy- lation data reveals an association with methylation class, key somatic alterations, and cell state in glial/glioneuronal tumors. Acta Neuropathologica Communications. Song, Q. and Su, J. (2021a). Dstg: deconvoluting spatial transcriptomics data through graph-based artificial intelligence. Briefings in bioinformatics, 22(5):bbaa414. Song, Q. and Su, J. (2021b). Dstg: deconvoluting spatial transcriptomics data through graph-based artificial intelligence. Briefings in Bioinformatics, 22(3):1–13. Code Link: https://github.com/Su-informatics-lab/DSTG. Ståhl, P. L., Salmén, F., Vickovic, S., Lundmark, A., Navarro, J. F., Magnusson, J., Gia- comello, S., Asp, M., Westholm, J. O., Huss, M., Mollbrink, A., Linnarsson, S., Codeluppi, S., Borg, Å., Pontén, F., Costea, P. I., Sahlén, P., Mulder, J., Bergmann, O., Lundeberg, J., and Frisén, J. (2016). Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science, 353(6294):78–82. Stegle, O., Teichmann, S. A., and Marioni, J. C. (2015). Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet., 16(3):133–145. Stickels, R. R., Murray, E., Kumar, P., Li, J., Marshall, J. L., Bella, D. J. D., Arlotta, P., Macosko, E. Z., and Chen, F. (2020). Highly sensitive spatial transcriptomics at near- cellular resolution with slide-seqV2. Nature Biotechnology, 39(3):313–319. Stickels, R. R., Murray, E., Kumar, P., Li, J., Marshall, J. L., Di Bella, D. J., Arlotta, P., Macosko, E. Z., and Chen, F. (2021). Highly sensitive spatial transcriptomics at near- cellular resolution with slide-seqv2. Nature biotechnology, 39(3):313–319. Stoeckius, M., Hafemeister, C., Stephenson, W., Houck-Loomis, B., Chattopadhyay, P. K., Swerdlow, H., Satija, R., and Smibert, P. (2017). Large-scale simultaneous measurement of epitopes and transcriptomes in single cells. Nature methods, 14(9):865. Sturm, G., Finotello, F., Petitprez, F., Zhang, J. D., Baumbach, J., Fridman, W. H., List, M., and Aneichyk, T. (2019). Comprehensive evaluation of transcriptome-based cell-type 67 quantification methods for immuno-oncology. Bioinformatics, 35(14):i436–i445. Ståhl, P. L., Salmén, F., Vickovic, S., Lundmark, A., Navarro, J. F., Magnusson, J., Gia- comello, S., Asp, M., Westholm, J. O., Huss, M., Mollbrink, A., Linnarsson, S., Codeluppi, S., Åke Borg, Pontén, F., Costea, P. I., Sahlén, P., Mulder, J., Bergmann, O., Lundeberg, J., and Frisén, J. (2016). Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science, 353(6294):78–82. Svensson, V., Vento-Tormo, R., and Teichmann, S. A. (2018). Exponential scaling of single- cell rna-seq in the past decade. Nature protocols, 13(4):599–604. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. (2018). A survey on deep transfer learning. CoRR, abs/1808.01974. Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B. B., Siddiqui, A., Lao, K., and Surani, M. A. (2009). mrna-seq whole- transcriptome analysis of a single cell. Nature Methods, 6(5):377–382. Teves, J. M. and Won, K. J. (2020). Mapping cellular coordinates through advances in spatial transcriptomics technology. Molecules and Cells, 43(7):591. Thorsen, T., Roberts, R. W., Arnold, F. H., and Quake, S. R. (2001). Dynamic pattern formation in a vesicle-generating microfluidic device. Physical Review Letters, 86(18):4163– 4166. Thurman, R. et al. (2012). The accessible chromatin landscape of the human genome. Nature, 489(7414):75–82. Tian, L., Chen, F., and Macosko, E. Z. (2023). The expanding vistas of spatial transcrip- tomics. Nature Biotechnology, 41(6):773–782. Tsoucas, D., Dong, R., Chen, H., Zhu, Q., Guo, G., and Yuan, G.-C. (2019). Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1). Code Link: https://github.com/dtsoucas/DWLS. Uygun, S., Seddon, A. E., Azodi, C. B., and Shiu, S.-H. (2017). Predictive models of spatial transcriptional response to high salinity. Plant Physiology, 174(1):450–464. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Vistain, L. F. and Tay, S. (2021). Single-cell proteomics. Trends in biochemical sciences, 46(8):661–672. 68 Wang, G., Moffitt, J. R., and Zhuang, X. (2018). Multiplexed imaging of high-density libraries of rnas with merfish and expansion microscopy. Scientific reports, 8(1):1–13. Wang, Y. and Navin, N. E. (2015). Advances and applications of single-cell sequencing technologies. Molecular cell, 58(4):598–609. Waylen, L. N., Nim, H. T., Martelotto, L. G., and Ramialison, M. (2020). From whole- mount to single-cell spatial assessment of gene expression in 3d. Communications biology, 3(1):1–11. Weingarten-Gabbay, S. and Segal, E. (2014). The grammar of transcriptional regulation. Hum Genet, 133(6):701–711. Wen, L. and Tang, F. (2022). Recent advances in single-cell sequencing technologies. Preci- sion Clinical Medicine, 5(1):pbac002. Whitesides, G. M. (2006). The origins and the future of microfluidics. Nature, 442(7101):368– 373. Wilkins, O., Hafemeister, C., Plessis, A., Holloway-Phillips, M.-M., Pham, G. M., Nicotra, A. B., Gregorio, G. B., Jagadish, S. K., Septiningsih, E. M., Bonneau, R., and Purugganan, M. (2016). Egrins (environmental gene regulatory influence networks) in rice that function in the response to water deficit, high temperature, and agricultural environments. The Plant Cell, 28(10):2365–2384. Williams, C. G., Lee, H. J., Asatsuma, T., Vento-Tormo, R., and Haque, A. (2022). An in- troduction to spatial transcriptomics for biomedical research. Genome Medicine, 14(1):68. Yan, L. and Sun, X. (2023). Benchmarking and integration of methods for deconvoluting spatial transcriptomic data. Bioinformatics, 39(1):btac805. Zhou, J. and Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10):931–934. 69 APPENDIX Figure A.1 Median of rank-ordered Bray-curtis dissimilarity taken over all spots. Figure A.2 Median of rank-ordered Bray-curtis dissimilarity taken over all spots. 70 Figure A.3 GNNDECONVOLVER deconvolution on SPATIALCTD lung 5-2 sample. a. Ground truth single-cell resolution on SPATIALCTD Lung 5-2 sample. Each dot is a single cell colored by its ground truth cell type label. Proportions of deconvolved cell types from ground truth and GNNDECONVOLVER represented as pie charts for each spot. b. Spatial autocorrelation of the cell type proportions computed using Hotspot. Spatial distribution of cell type proportion for T CD4 memory cells, T CD8 memory cells, tumor, macrophage and neutrophil cells, as inferred by GNNDECONVOLVER. Each dot represents a spot. The depth of the point indicates the proportions of the cell type in the spot. 71 (cid:1)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:2)(cid:4)(cid:8)(cid:9)(cid:10)(cid:7)(cid:11)(cid:12)(cid:13)(cid:13)(cid:7)(cid:8)(cid:14)(cid:15)(cid:12)(cid:7)(cid:11)(cid:3)(cid:16)(cid:15)(cid:3)(cid:17)(cid:18)(cid:8)(cid:18)(cid:3)(cid:5)(cid:7)(cid:19)(cid:14)(cid:7)(cid:17)(cid:15)(cid:3)(cid:8)(cid:20)(cid:2)(cid:12)(cid:6)(cid:18)(cid:11)(cid:8)(cid:18)(cid:3)(cid:5)(cid:7)(cid:21)(cid:2)(cid:3)(cid:16)(cid:7)(cid:1)(cid:22)(cid:22)(cid:23)(cid:12)(cid:11)(cid:3)(cid:5)(cid:24)(cid:3)(cid:13)(cid:24)(cid:12)(cid:2)(cid:10)(cid:7)(cid:11)(cid:12)(cid:13)(cid:13)(cid:7)(cid:8)(cid:14)(cid:15)(cid:12)(cid:7)(cid:11)(cid:3)(cid:16)(cid:15)(cid:3)(cid:17)(cid:18)(cid:8)(cid:18)(cid:3)(cid:5)(cid:7)(cid:19)(cid:14)(cid:7)(cid:17)(cid:15)(cid:3)(cid:8)(cid:7)(cid:7)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:2)(cid:4)(cid:8)(cid:9)(cid:10)(cid:7)(cid:11)(cid:12)(cid:13)(cid:13)(cid:7)(cid:8)(cid:14)(cid:15)(cid:12)(cid:7)(cid:25)(cid:8)(cid:7)(cid:17)(cid:18)(cid:5)(cid:26)(cid:13)(cid:12)(cid:27)(cid:11)(cid:12)(cid:13)(cid:13)(cid:7)(cid:2)(cid:12)(cid:17)(cid:3)(cid:13)(cid:4)(cid:8)(cid:18)(cid:3)(cid:5) Figure A.4 HEK293T and CCRF-CEM cell line mixture observed expression vs estimated expression from various models. Figure A.5 PC Region Reconstruction of multivariate normal distributions - applied towards deconvolution methods. 72 Figure A.6 The essence of the work for which I am most proud, and excited about. Figure A.7 Pipeline to translate kernel weights to position weight matrices, which can be compared experimentally verified motifs. 73 Figure A.8 Motifs learned from DeepCAT aligned with a known TFBM. Figure A.9 Motifs learned from DeepCAT aligned with a known TFBM. Figure A.10 Motifs learned from DeepCAT aligned with a known TFBM. 74 Figure A.11 Potential novel kernel PWM. Figure A.12 This table examines scaling effects when only one cell line’s expression is scaled (by 1000 here), a common problem in cell type expression profiling. 75