APPLICATION OF DEEP GENERATIVE MODELING TO SINGLE CELL RNA SEQUENCING DATA IN TOXICOLOGY By Omar Kana A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Pharmacology and Toxicology - Environmental Toxicology – Doctor of Philosophy 2024 ABSTRACT Single cell RNA-sequencing provides an opportunity for pharmacologists and toxicologists study the heterogeneity of cellular responses to chemical perturbations. However, analyzing these datasets remains incredibly challenging as they are high volume and high dimensional. Efforts to mitigate these challenges have centered on dimensionality reduction. Recently, it has been shown that variational autoencoders (VAEs), deep generative models that can perform dimensionality reduction, can be successfully deployed on single cell RNA-seq data to perform normalization and prediction. In this thesis I will explore both use cases. Even with the unprecedented detail single cell RNA-seq can provide for describing cell type specific chemical responses, exploring all relevant combinations of cell type-chemical perturbations remains difficult. Additionally, the dose of a chemical changes the overall character of its response. Variational autoencoders (VAEs) have been shown to predict chemical perturbations for single doses. However, VAEs have yet to be used to predict the entire dose- response. Here I introduce single cell Variational Inference of Dose-Response (scVIDR) which not only predicts the trajectory of the dose-response, but also achieves better large dose predictions than previous VAE algorithms. First, scVIDR is shown to predict dose-dependent gene expression across cell types in mouse liver, human blood cells, and cancer cell lines. Next, regression on scVIDR’s latent space is used to biologically interpret model predictions. Finally, scVIDR is used to order individual cells based on their chemical sensitivity by assigning a pseudo-dose value. I conclude scVIDR can effectively be used to predict chemical perturbations in a wide range of administration scenarios. Analysis of the compartmentalization of liver metabolism can be described with two axes: spatial and temporal. The spatial axis is conferred by the hepatic lobule, which is made up of concentric layers or zones of cells that have distinct metabolic programs. The master regulatory pathway that sustains this metabolic program is the Wnt/𝛽-catenin pathway, which when activated by Wnt ligands, induces the transcription of potentiators of metabolic and nutritional gradients. The temporal axis of liver is conferred by the circadian rhythm which originates from super chiasmatic nucleus. Within each cell in the liver, this circadian rhythm is maintained by a core set of genes which confer feedback loops which sustain the necessary oscillations for metabolic efficiency. Previous studies have shown how the axes of the liver lobule can interact using single cell RNA-seq data. However, how these axes interact with toxicological perturbation is still unknown. In this thesis, a variational autoencoder is used to batch correct single cell RNA-seq data for zonation inference. Existing models originally used to analyze the zonal-rhythmic axes of the liver lobule are extended to account for the effects of chemical perturbation. This methodology is applied to mouse liver snRNA-seq data from mice subjected to acute treatment of 2,3,7,8 tetrachlorodibenzo-p-dioxin. Copyright by OMAR KANA 2024 This thesis is dedicated to the family who mean the world to me. To my mother, Rana, my father, Munib, my brother Yusef, and my wonderful fiancΓ©, Lene. Thank you for loving and supporting me. v I cannot possibly acknowledge all who deserve it. I will try anyway. ACKNOWLEDGEMENTS I would first like to acknowledge my advisor, Dr. Sudin Bhattacharya, without whom my PhD would not have been possible. I have learned so much and gained such valuable experiences during my time in his lab. I thank him for creating the space I needed to develop as a scientist. I would like to acknowledge my committee members: Dr. Cheryl Rockwell, Dr. Norbert Kaminski, and Dr. Rance Nault. All of whom have been generous with their knowledge and data. Dr. Nault specifically has given me such wonderful opportunities and insight that I believe without his being there, my PhD would have been incredibly difficult. I would like to acknowledge the members of my lab: Dr. David Filipovic, Mr. Daniel Marri and Ms. Leah Terrian. All of whom are like friends to me. Their support and companionship over the last five years has eased many a stressful situation. To David, I would like to extend a personal thanks, as your insight and counsel has had a direct impact not only on my scientific career, but also on my outlook on life. I would like to acknowledge my best friends, Jason, and Tofiq, who have stuck with me since high school. Returning home to Louisiana during COVID-19 epidemic, and those brief visits during holidays were always richer when you were there to listen to my ramblings about graduate school. I would like to acknowledge my family. My mother, father, and brother have only ever been supportive since I started my PhD. Your love is like the solid ground beneath my feet, supporting me wherever I go. You have always been the driving force pushing me forward to be my best. Finally, I would like to acknowledge my fiancΓ©, Lene, whose presence in my life over the past two years has been wonderful. You have made every challenge easier, every stressful situation comfortable, and every cloudy Michigan day brighter. I am so thankful for your patience and support during my PhD. vi TABLE OF CONTENTS LIST OF ABBREVIATIONS ...................................................................................................... viii CHAPTER 1 INTRODUCTION .................................................................................................... 1 CHAPTER 2 PERTURBATION OF LIVER RHYTHMICITY AND ZONATION BY 2,3,7,8 TETRACHLORODIBENZO-P-DIOXIN..................................................................................... 13 CHAPTER 3 PREDICTION OF SINGLE DOSE CHEMICAL PERTURBATIONS ACROSS CELL STATES USING VARIATIONAL AUTOENCODERS .................................................... 53 CHAPTER 4 PREDICTION OF MULTIPLE DOSE CHEMICAL PERTURBATIONS ACROSS CELL STATES USING VARIATIONAL AUTO-ENCODERS ................................................... 80 CHAPTER 5 CONCLUSIONS AND FUTURE DIRECTIONS................................................ 104 BIBLIOGRAPHY ........................................................................................................................ 113 vii LIST OF ABBREVIATIONS AhR - Aryl Hydrocarbon Receptor ALT – Alanine Transferase AOP – Adverse Outcome Pathway BIC – Bayesian Information Criterion ChIP – Chromatin Immunoprecipitation CITE-seq – Cellular Indexing of Transcriptomes and Epitopes sequencing DEG – Differentially Expressed Genes 𝛿 – The mean difference between the treated and control populations GM-CSF – Granulocyte-Monocyte Colony Stimulating Factor GSEA – Gene Set Enrichment Analysis HVG – Highly Variable Gene INF-𝛽 – Interferon Beta LPS – Lipopolysaccharide MMD – Maximum Mean Discrepancy MNLEM – Mixed Non-Linear Effects Model NAFLD – Non-alcoholic Fatty Liver Disease DILI – Drug induced liver injury. PBMC – Peripheral Blood Mononuclear Cell PC – Principal Component PCA – Principal Component Analysis scVIDR – single cell Variational Inference of the Dose Response scATAC-seq – Single Cell assay for transposase-accessible chromatin scRNA-seq – Single Cell RNA sequencing snRNA-seq – Single Nuclei RNA sequencing TCDD – 2,3,7,8 Tetrachlorodibenzo-p-dioxin UMAP – Uniform Manifold Approximation and Projection VAE – Variational Auto-encoder ZT – Zeitgeber Time viii CHAPTER 1 INTRODUCTION 1.1 The combinatorial problem in pharmacology and toxicology Drug discovery presents a difficult problem to scientists. The space of possible biochemical therapeutics is extremely large1. Just for small organic chemicals, there exist 1060 possible structures, and of those potentially 1023 structures could be pharmacologically relevant2. This space of chemicals does not even include the biologics and inorganic compounds that exhibit therapeutic properties. Despite this large chemical space for drug discovery, many drugs will not make it to market. This is due to a myriad of reasons, including, prominently, potential toxicological effects on any different number of physiological systems3. Physiological responses to the same chemical can vary from cell type to cell type4, and according to Tabula Sapiens5, a molecular reference atlas of human cells, there are at least 400 distinct human cell types. Within those cell types, there exist potential physiological gradients and differences in function based on the context. An example explored in this thesis is the metabolic gradient in the liver lobule, where hepatocytes nearer to the central vein and those closer to the portal triad perform different and complementary sets of metabolic functions (e.g., Ahr expression; Figure 1.1)6. In addition to these metabolic gradients, stochasticity resulting from the biochemical process of transcription and translation, global differences in cellular parameters (e.g., gene copy number), and the chemical environment of the cell result in variation in the responses of cells of the same type7–9. As a result, the spectrum of cellular response to chemical perturbation gives us an incomprehensibly large combinatorial space that must be explored by toxicologists to determine chemical safety. 1 Figure 1.1 Expression of AhR across hepatic lobule. A) A zonal expression profile of normalized expression as described by Yang et al10 and Halpern et al6. Zone 0 represents the level of AhR expression in hepatocytes closest to the central vein. Zone 9 represents the level of AhR expression closest to the portal vein. B) A single-cell resolution image of the liver lobule generated by Halpern et al.6 with AhR expression levels represented by color from Yang et al.10 The central vein is denoted by β€œCV” (label in black) with the portal triad denoted by β€œPN” (label in white). In the past, this problem was not addressed due to the experimental limitations in measuring the state of a single cell. However, recent decades have created an omics revolution, in which 2 scientists can profile thousands of different endpoints from protein11 and RNA expression12 to the presence and concentration of particular metabolites in a cell13. This thesis will mainly focus on data from single cell RNA sequencing (scRNA-seq)14–16, which can be used to profile the transcriptomes of tens of thousands of individual cells at once. This presents an entirely new level of resolution for toxicologists, as for the first time ever, the true heterogeneity of cellular transcriptomic response can be profiled in an efficient manner17. My thesis presents a method to analyze and predict high dimensional gene expression data and heterogeneity in cellular dose- response. 1.2 The unreasonable effectiveness of dimensionality reduction In the statistical field of machine learning, there is a class of algorithms that perform a task called dimensionality reduction. The analysis and visualization of high dimensional data like single-cell gene expression can be facilitated by a broad class of dimensionality reduction algorithms. Examples of such algorithms include principal component analysis (PCA)18, which aims to preserve the global variances of the data, and uniform manifold approximation and projection (UMAP)19 which aims to preserve the local distances in low dimensions (Figure 1.2 B and C). This class of algorithms is routinely used in the study of scRNA-seq data and is useful for the visualization of high-dimensional datasets20, as well as in downstream tasks such as clustering and trajectory analysis21. 3 Figure 1.2 Dimensionality reduction of 3D mammoth. A) Point representation of wooly mammoth (Smithsonian Institution Archives USNM: V23792) at three different perspectives in 3D space. Colored using K-means clustering. B) 2D representation of wooly mammoth produced by principal component analysis (PCA). PCA preserves inter-cluster distances better than intra- cluster distances. C) 2D representation of wooly mammoth produced by Uniform manifold approximation and projection (UMAP). UMAP preserves intra-cluster distances better than inter- cluster distances. D) 2D representation of wooly mammoth produced by Variational autoencoder (VAE). VAE’s utilize neural networks to do dimensionality reduction and optimize both inter- and intra-cluster distances. Dimensionality reduction is remarkably efficient, often needing only orders of magnitude fewer features to properly represent a space. This can be seen even in large atlas sized scRNA-seq datasets made up of hundreds of thousands of cells. An example of such a dataset is the Tabula Sapiens5 which contains ~500,000 cells across 24 tissues from 15 human donors. Despite having measurements for ~30,000 genes, 100 principal components explain ~40% of the variance (Figure 1.3). This efficiency is reflected in the ubiquity of PCA’s use as it is a common preprocessing step in scRNA-seq pipelines to use the first 50-100 principal components22. With thousands of possible measurements in hundreds of thousands of cells, it is striking that with 4 only 50-100 components, one can capture significant variation in the dataset. A possible explanation lies in the manifold hypothesis23. Essentially, while real-world datasets can reside in high-dimensional space, they can often be described with a local-coordinate system of fewer dimensions. These dimensions represent a more concise summary of the high-dimensional space and thus make it more useful for experimentalists to describe a system. This is likely to be true in the case of cell biology, where gene-gene regulatory interactions constrain the number of possible attainable cell states24,25. This is particularly helpful in the domains of pharmacology and toxicology. While there are potentially 1023 pharmacological drugs2, chemicals with similar structures will tend to induce similar effects in the same biological system26. Due to the large number of genes in the genome, there is an infinite set of configurations of cellular state. However, the interactions between genes and the process of evolution limit the possible range of phenotypes to a much more tractable amount27. Which is why while omics can provide measurements of 1000s of dimensions, they are amenable to dimensionality reduction. 5 Figure 1.3 Cumulative explained variance in PCA of Tabula Sapiens. PCA done with differing number of components was performed on the Tabula Sapiens. The x-axis is the number of principal components used in the dimensionality reduction. The y-axis is the resulting variance ratio explained by the PCA. For the purposes of this thesis, I choose to focus on dimensionality reduction using a class of deep neural networks called variational auto encoders (VAEs)28 (Figure 1.2 D) which have been used extensively to model and make predictions of changes in single cell gene expression29–34. 1.3 A Brief Introduction to Variational Autoencoders VAEs are a class of deep generative models which rely on Bayesian priors to encode single cell data into a latent distribution28. VAE’s represent a subclass of autoencoders, which themselves are a generalization of PCA35. VAEs can be seen as a non-linear β€œcousin” to PCA, with certain advantages and disadvantages when comparing the two techniques. To introduce the reader to VAEs and how they work, I will first describe the relationship between PCA and autoencoders to 6 frame the discussion. I will then describe the basic structure of an autoencoder. Finally, I will describe how the autoencoder framework is extended to variational autoencoders. A full mathematical description of variational autoencoders can be found in section 3.4.1. As described in section 1.2, PCA is a form of dimensionality reduction that is used extensively in scRNA-seq data. PCA is a linear transformation of the data which aims to preserve as much variance as possible for each dimension18. A forward function brings the high dimensional scRNA-seq measurements to a lower dimensional representation. This lower dimensional representation, also known as principal component (PC) space, can be used for downstream analysis such as visualization 36, clustering of the data for cell type identification 37, and input into trajectory inference algorithms (i.e., pseudo-time)21. An advantage of PCA is that since it is linear the user can directly interpret the dimensions of the lower dimensional space using loadings38,39. These loadings can be used to describe gene expression programs in each principal component of the lower dimensional space. An additional useful feature of PCA is that the forward function is invertible, and the input data can be approximately reconstructed (Figure 1.4 A). However, trying to generate novel snRNA-seq measurements not included in training will often result in inaccurate reconstructions30. To address this problem, non-linear functions can replace the forward and inverse functions. 7 Figure 1.4 Diagrams of PCA, an Autoencoder, and a VAE. Input scRNA-seq measurement data is represented by the green box labeled 𝑋 and the reconstructed scRNA-seq data from the lower dimensional representation is represented by the orange box labeled 𝑋′. Lower dimensional representations are shown as blue boxes. A) Schematic of PCA. scRNA-seq measure data is orthogonally transformed using a linear 𝑓(𝑋) into a lower dimensional principal 8 Figure 1.4 (cont’d) component (PC) space. The inverse function, π‘“βˆ’1(𝑋), is used to take points from PC space back into scRNA-seq measurement space. B) Schematic of an autoencoder. Autoencoders are like PCA, but instead of linear transformations 𝑓 and π‘“βˆ’1, there are non-linear neural networks that replace them called an encoder and decoder respectively. The encoder takes the scRNA-seq measurement, and compresses it into the latent space, 𝑍. The decoder takes points from 𝑍 and approximately reconstructs them back into scRNA-seq measurement space. C) Schematic of a VAE. Unlike a normal autoencoder, a variational autoencoder has a probabilistic encoder and decoder. A normal VAE encoder, π‘žπœ™(𝑍|𝑋), outputs a mean and variance which is used to sample the features of the latent space, 𝑝(𝑍). The decoder then approximately maps the probability distribution of the latent space back into scRNA-seq measurement space. Autoencoders aim to perform the dimensionality reduction using deep learning. Autoencoders replace linear forward function and inverted function in PCA, with neural networks35. The neural network replacing the forward function, which compresses the high dimensional scRNA-seq data to a lower dimensional representation, is called the encoder. The lower dimensional representation of the scRNA-seq is referred to as the latent space. The neural network replacing the inverted function, taking latent space measurements back into scRNA-seq measurement space, is called the decoder (Figure 1.4 B). The use of neural networks instead of linear functions is useful, since now the user can calculate more accurate lower dimensional representations of the scRNA-seq measurements40,41. Accurate representations are critical during scRNA-seq processing pipeline, as the cell type identification step requires a lower dimensional representation of the data as input for clustering42. This comes at a cost, however, as the non- linearity of the neural networks makes interpretation of the latent dimensions more difficult as there are no longer any linear loadings. In section 4.2.3 and section 4.4.4 of this dissertation, I describe a method to overcome this limitation in VAEs and interpret them using a loadings like model. Additionally, the autoencoder’s latent space dimensions are not forced to be structured43, meaning that dimensions of the latent space are entangled with one another and are not guaranteed to be independent of one another like in PCA. As a result, vanilla autoencoders have latent spaces that are much less generalizable to other datasets35. 9 The variational autoencoder represents an extension of autoencoders that impose more structure on the latent space using Bayesian priors. This imposed structure forces the latent dimensions to be as disentangled as possible. To do this, instead of the dimensionality reduction being a deterministic function like in vanilla autoencoders, variational autoencoders describe the data generation process probabilistically35. This means that instead of encoding a lower representation directly, the encoder calculates a mean and variance of a gaussian distribution, from which the user can sample features for the latent space (Figure 1.4 C). This gaussian distribution is called the prior distribution and is what imposes structure one the latent space. To ensure that the encoder’s output is as close to prior (i.e., structured) as possible, the neural network minimizes the difference between the encoder output and the prior distribution. This imposed structure in the latent space is what allows the VAE to generate non-random samples from the latent space. In practice this means that the decoder can generate meaningful scRNA-seq measurements from unobserved points in the latent space rather than just the points in the training data. Thus, VAEs have the non-linear power of autoencoders with added generalizability to other datasets35. Due to these advantages, VAE’s have proven to be an incredibly flexible and effective tool in the analysis of scRNA-seq30,31,33,34,44–47. 1.4 Variational autoencoders and single cell RNA sequencing VAEs have been used extensively across a spectrum computational single cell task. At its most basic, VAEs have been used to visualize scRNA-seq data44,45, and at its most ambitious, has been used to integrate scRNA-seq data with other single cell data modalities34. Most single cell research on VAE’s has been in correcting scRNA-seq data for batch effects29. However, VAE’s have extended use in other problem domains such as cell type identification46, integration of datasets from multiple labs48, and inference of cell-cell interactions49. The following section is a brief survey of the current field, with a focus on batch correction, cell type identification, and prediction of chemical perturbations. VAEs have been used to integrate single cell RNA-seq across experimental groups29, labs48, and data modalities34. Integration, also called batch correction, is a process by which the variation of the data based on some batch covariate (e.g., sex, age, lab of origin, etc.) is removed from the data. This is often done in order to properly address whether the changes in the data originate from the variable of interest (i.e., TCDD treatment) independent of these batch effects. VAE models such as scVI have been used to model such covariates including library size and single 10 cell chemistry29. To do this, models will condition the encoder and decoder on the batch labels in order to remove their variance from the latent space. For integrating across different labs of origin, models such as scArches33,48 further force integration by adding a regularization term to the loss function. This term causes the model to bring the batch clusters as close together as possible on the latent space. VAE models such as MultiVI have been used to integrate multiple single cell data modalities (i.e., snRNA-seq, CITE-seq, and single cell ATAC-seq)34. To do this, MultiVI first encodes each modality into its own latent space. Then MultiVI averages the latent spaces together to make the final single latent representation of all data modalities. To make sure that the different latent spaces can be averaged together, an additional term is added to the loss function. This term forces the different latent spaces to be as similar to one another as possible. As with the normal integration, the aim is to make an integrated representation across all data modalities such that one can compare impacts across chromatin accessibility, protein expression, and transcription. In chapter 2 of this thesis, I describe the use of variational autoencoders to model and remove effects of TCDD on hepatocytes to infer liver zonation more accurately from scRNA-seq data. Conditional latent representations have been used in cell type identification. To do this, models will utilize existing large datasets of tissues such as the Allen Mouse Brain Atlas50, or the Tabula Sapiens5. These models use these large tissue datasets as reference to train the VAE’s, after which the VAE will then classify cells in new datasets based on their position in the latent space. Additionally, since these models can integrate across labs of origin, the models can utilize multiple large cell atlases to classify new cells. One approach, pioneered by scANVI46, is to utilize existing architecture created by scVI29 and extend it by incorporating a neural network classifier. The classifier and the VAE are trained in parallel so that the latent space is optimized to make it as simple as possible for the neural network to distinguish between different classes of cells. Alternative models for cell type classification include MoE-Sim-VAE which utilizes a mixture of experts architecture to cluster and identify different cell populations51. One pharmacologically focused application of variational autoencoders for the prediction of chemical perturbations30,33. In section 1.3, one of the mentioned advantages of VAE’s were that they were generative models, and as a result could generate meaningful scRNA-seq measurements from unobserved points in the latent space. Recently, VAEs have been demonstrated to predict the expression of a chemical perturbation on an unobserved cell type 11 based on the perturbations of other cell types30. To understand how this might work, I take an example of the model in another discipline, computer vision. Much like scRNA-seq, images are high dimensional data (each pixel representing a dimension) which can be compressed into a lower dimensional space using variational autoencoders. Let us consider a picture of a woman’s face, to which I want to virtually add glasses. This can be accomplished using VAEs by subtracting the latent representation of pictures of men’s faces from those of men’s faces with glasses, and subsequently adding the resulting β€œglasses” vector to pictures of women’s faces without glasses. I can imagine an analogue to this model, but instead of gender I have a cell type and instead of glasses I have a chemical perturbation. These models are examples of using vector arithmetic on the latent space of a VAE. One such model, scGen30, has been shown to outperform other generative models such as generative adversarial networks and other dimensionality reduction algorithms such as PCA on prediction or chemical perturbations for unseen cell types30. However, these models have difficulty accounting for certain complexities in biological data. For one, the response of a particular drug is highly dependent on cell type8,9. Thus, a simple addition of a perturbation must be weighted on the cell’s transcriptomic profile. Additionally, the magnitude of the chemical perturbation must also be considered. At lower concentrations, there may be little to no effect of a drug on a cell’s transcriptome. However, at higher concentrations, these effects can be more pronounced. Furthermore, even if I can make these predictions, interpretation of how the model makes predictions on a gene-by-gene basis must also be considered to evaluate whether the model is describing the chemical perturbation appropriately. In chapter 3 of this thesis, I improve on the original vector arithmetic method using regression on the latent space. In chapter 4, I further extend the model to account for multiple doses, interpret prediction on a gene-by-gene basis, and order cells based on how perturbed their transcriptomes are. 12 CHAPTER 2 PERTURBATION OF LIVER RHYTHMICITY AND ZONATION BY 2,3,7,8 TETRACHLORODIBENZO-P-DIOXIN 2.1 Introduction To deal with the multitude of chemicals that an organism will need to interact with on a regular basis and to sustain healthy metabolic homeostasis, the liver has evolved to compartmentalize metabolic programs52. This compartmentalization is both spatially and temporally organized. Spatially organized via the porto-central axis of the hepatic lobule, repeating hexagonal sub-units of the liver53. Temporally organized via the circadian rhythm based on feeding/fasting cycles54,55. In previous studies, the spatial and temporal metabolic organization of hepatocytes was described for over five thousand genes in hepatocytes52. However, it is unknown how resilient hepatic compartmentalization is to acute toxicological perturbation. In this chapter I investigate changes in the spatial and temporal axes of the liver with respect to acute treatment of 2,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD). To do this I first infer the zonation of the liver lobule using a Variational autoencoder28,29 and diffusion maps56,57. Then I extend a method introduced by Droin and Kholtei et al52 to classify genes based on their rhythmicity and zonation to include classifications of TCDD influence. The histological unit of the liver is the hepatic lobule (Figure 2.1). The hepatic lobule is a hexagonally shaped structure with a central vein at the center and a portal triad at each vertex of the hexagon6,53. Each portal triad is made up of a portal artery, portal vein, and a bile duct58. Due to the positioning of vasculature and the direction of blood flow (from portal to central), nutrient and metabolic gradients are established within each lobule that confer different metabolic functions at different radii extending from the central vein53. For example, hepatocytes nearer to the central vein will tend to have higher expression of pathways involved in metabolism of xenobiotics by CYP450 genes59. Hepatocytes nearer to the portal triad will tend to specialize in 𝛽-oxidation and gluconeogenesis52. This metabolic gradation over the axis from the portal triads to the central vein (porto-central) is called liver zonation6,52,53,60. Liver zonation is maintained by a complex network of chemical cues and cell-cell interactions60. The canonical master regulatory pathway is the Wnt/𝛽-catenin pathway61. In brief, Wnt proteins bind to Frizzled causing lipoprotein receptor-related proteins to phosphorylate the 𝛽-catenin degradation complex. This causes 𝛽-catenin to dissociate from the degradation complex, translocate to the nucleus, and activate potentiators of zonation62,63. Traditionally, liver zonation refers to the categorization of 13 hepatocytes either residing the pericentral (referred to as central; zone 3), mid-lobular (zone 2), and periportal (referred to as portal; zone 1)53. However, in more recent studies zonation suggested a finer grained, or continuous pattern of concentric layers of hepatocytes6,52. . Figure 2.1 Cartoon of hepatic zonation within the liver lobule. Within the liver there exist hexagonal hepatic subunits known as liver lobules. Within these lobules exist a gradient of nutrient, oxygen, and metabolic functions along each liver sinusoid from the central vein to the portal triad (i.e., bile duct, portal vein, and portal artery). Cells along the sinusoid are separated into three zones (periportal zone 1, mid-lobular zone 2, and pericentral zone 3). The metabolic gradient of periportal functions and pathologies are represented by a green triangular gradient. Metabolic gradients of pericentral functions and pathologies are represented by an orange triangular gradient. Temporal regulation of the liver is described using the circadian rhythm. Temporal compartmentalization of metabolic functions makes sure that liver function (e.g., 14 glycolysis/gluconeogenesis) is coordinated with feeding/fasting cycles of the organism52,54,64–66. Disruptions of the circadian rhythm are associated with diseases such as non-alcoholic fatty liver disease (NAFLD)66–68. Thus, sustaining a consistent rhythm within the liver is important to the overall health of the organism. Within each cell is a molecular oscillator that is made up of a network of feedback loops. In hepatocytes, this oscillator is made up of a core set of circadian genes (Figure 2.2)69. CLOCK and ARNTL bind to one another forming the CLOCK-ARNTL complex. The CLOCK-ARNTL complex then binds to E-box motifs which activate the transcription of downstream core circadian genes. These genes include genes that translate to the PER-CRY complex which inhibits the binding of CLOCK-ARNTL. CLOCK-ARNTL also activates genes that translate to REV-ERB and DBP. DBP upregulates Nr1d1/2 (genes that encode REV-ERB) which translates to more REV-ERB. REV-ERB competes with ROR transcription factors and inhibits the translation of further Nr1d1/2 and Arntl. These feedback loops make the basis for the rhythmicity seen in hepatocytes69. 15 Figure 2.2 Molecular pathway of circadian rhythm in liver. Day-night cycle represented with box containing blue gradient. Proteins like CLOCK and DBP are indicated by colored boxes within the day-night diagram. Transcription of genes represented by curved arrows. Activation is represented by arrows. Inhibition is represented by flat-head arrows. A recent study by Droin and Kholtei et al demonstrated that the spatial and temporal axes overlap with one another52. This is mostly due to the inherently rhythmic behavior of Wnt ligand expression. This leads to targets of the Wnt signaling pathway such as Axin2 to exhibit rhythmic patterns of expression. Thus, many established zonated pathways have rhythmic patterns of expression. Among those pathways are many gene sets involved in drug metabolism. Interestingly, while rhythmicity impacted the core pathways that determine zonation, the converse was not true, as the core circadian clock (except for Cry1) exhibited no zonation52. The overlap with the temporal and spatial axes with the drug metabolism pathways suggests a third axis to consider when describing liver function, chemical perturbation. 2,3,7,8 Tetrachlorodibenzo-p-dioxin (TCDD) is a particularly interesting candidate due to its known impact to rhythmicity and zonation pathways62,66. It has been established that TCDD in sub- chronic exposures ablates or greatly dampens the oscillations of the majority of core circadian clock genes66. This is hypothesized to act through the cis-regulatory action of AhR, TCDD’s canonical receptor. AhR is observed to bind at the gene-body of many of the core circadian clock genes two hours post TCDD treatment66. Sub-chronic exposure to TCDD also impacts zonation 16 in a dose-dependent manner62. At lower doses, a clear pericentral bias appears across all hepatocytes which reflects a higher activation of pericentral drug metabolism pathways. At higher doses, dysregulation of zonation leads to complete disorganization of most zonation biomarkers62. Like in the effects on circadian rhythm, effects on zonation likely stem from the potential cross talk between AhR and Wnt signaling pathways70,71. While it is established how TCDD may impact both the spatial and temporal organization of the hepatic lobule, it is still less understood how each of these pathways interact with one another. Furthermore, most results discussed above deal with month long exposures to TCDD, and do not describe the initial acute effects that lead to the differential zonation and rhythmicity. Utilizing the model I developed, I show that TCDD, even at acute exposures, has a significant impact on both the zonation and rhythmicity of the liver. 2.2 Results 2.2.1 Visualization of acute hepatocyte response to 2,3,7,8 Tetrachlorodibenzo-p-dioxin To elucidate the transcriptional impact TCDD treatment plays in disrupting hepatocyte rhythmicity and zonation, I used hepatic single-nuclei RNA-seq data from male C57BL/6 mice generated Cholico et al72. Mice were housed in a room with a 12:12 light:dark cycle, gavaged with a single dose of 30 ΞΌg/kg TCDD (or sesame oil vehicle) at time-point 0 (6:00 AM), after which the livers were collected and snap frozen at timepoints 2, 4, 8, 12, 18, or 24 hours post treatment. The data was then clustered and hepatocytes were identified using established hepatic biomarkers from previous studies62. All other cell types were removed from the dataset. After filtering the hepatocytes for low quality and low read count, I was left with 129,373 hepatocytes. The top 15,000 highly variable genes (HVGs) were kept for future analysis. I subsequently performed dimensionality reduction for data visualization and analysis on important TCDD response genes (see section 2.4.1 for more preprocessing details). I performed a UMAP analysis on the hepatocytes to observe cell clustering patterns (Figure 2.3). I observe that the data naturally clusters between time points (with the exception of 18 and 24 hours with TCDD treatment). Additionally, I find that treatment groups separate sufficiently, indicating significant changes in the gene expression profile with respect to TCDD treatment. Interestingly, the UMAP hepatocyte clusters indicated a circular pattern in expression for both treatments. While UMAP’s are known to destroy global distance information40, I believe this reflects the internal rhythmicity of hepatocytes as this circular UMAP pattern was not found for 17 other cell types (data not shown). I can also observe while acute TCDD treatment impacts hepatocyte rhythmicity, it does not completely abolish it like in previous studies with sub-chronic TCDD treatments66. Figure 2.3 Visualization of snRNA-seq for acute TCDD toxicity in mouse hepatocytes. Each dot represents a cell. UMAP of hepatocytes colored by time in hours after treatment on the left, and dose of treatment in πœ‡g/kg of TCDD on the right. To analyze the direct response of TCDD over the time course, I analyze the expression of known TCDD response genes: Cyp1a1, Cyp1a2, Ahrr, and Tiparp (Figure 2.4)73,74. I find that Cyp1a1 and Ahrr achieve saturation in expression at 12 hours post treatment, while Cyp1a2 and Tiparp achieve saturation in expression sometime at or before two hours post treatment. All response genes exhibited significant increases in expression when treated with TCDD. 18 Figure 2.4 Time Series Expression of TCDD Response Genes. Normalized counts (see section 2.4.3) of known TCDD activated genes: Cyp1a1, Cyp1a2, Ahrr, and Tiparp. Expression of genes are plotted as function of the hours after treatment by TCDD (30 πœ‡π‘” π‘˜π‘” ) or sesame oil vehicle (0 πœ‡π‘” π‘˜π‘” ). Shaded regions represent 95% confidence interval. To analyze the potential acute effects TCDD has on the spatial and temporal metabolism of the liver lobule, I first inferred the porto-central axis for the liver hepatocytes, and then used a mixed non-linear effects model75 (MNLEM) to classify genes based on what factor (Rhythmicity, Zonation, TCDD treatment) or combination of factors controlled gene expression. 19 2.2.2 Inferring zonation of hepatocytes from single cell gene expression profiles Zonation of the liver lobule is not directly measured by snRNA-seq. Thus, zonation needs to be inferred from hepatocyte gene expression. To infer the zonation profile, expression must first be corrected for other confounding factors in the experimental design. For example, known centrally zonated gene Cyp1a210 is activated by TCDD. Another centrally zonated example, Slc1a2, is known to have rhythmic patterns in expression52. To correct these confounders and calculate the zonation trajectory I elected to utilize the approach taken by Aizarani et al76 and Nault et al62. Here I first perform batch correction on the single cell data to remove variance stemming from TCDD treatment and time-dependent effects. Then I perform trajectory inference to calculate the latent zonation value of each hepatocyte. I utilized single cell Variational Inference (scVI)29, a variational autoencoder (VAE), to perform batch correction on the snRNA-seq data (Figure 2.5). The major reason I utilize a variational autoencoder is due to the size of the dataset. After preprocessing, there are ~130,000 cells in the dataset and most single cell integration tools do not scale to datasets of those size easily77. scVI can integrate large gene expression atlases and batch correct for variation across many labs29,77. Thus, it is expressly tested to scale to millions of cells and can integrate them in a relatively short period of time29. 20 Figure 2.5 UMAP of Batch correction by scVI of snRNA-seq data. UMAP of cells not batch corrected on the left. In the middle is a cartoon of the encoder region of scVI which performs dimensionality reduction. On the right is a UMAP of the latent space of scVI. Cells in both UMAPs are colored by the combined label of hours after treatment and dose of TCDD treatment in πœ‡g/kg (Hour_πœ‡g/kg). VAE’s like scVI can batch correct for variables using their latent space. When encoding the latent distribution during training, the user can condition the model on some factor in the experimental design (e.g., dose and time). In this way, the model encodes a latent representation that doesn’t contain the variance of the factors being conditioned for. I can β€œdenoise” the data by decoding from the latent distribution back into gene expression distribution29. Utilizing scVI, I batch correct for TCDD treatment and time of harvest. I then use a trajectory inference algorithm, diffusion pseudo-time56, on the latent space scVI to infer the trajectory of expression on the porto-central axis. I utilize the second component of the diffusion pseudo-time plot (analogous to components in PCA) as our trajectory (see section 2.4.3) as I observed most zonal genes follow a gradient along this component (Figure 2.6). I will refer to trajectory inference values as pseudo-space, which I define as an ordering of cells based on how closely the cells approximate the expression of central hepatocytes (0 for most portal, and 1 for most central). 21 Figure 2.6 Diffusion map visualization of scVI latent space reveals zonation trajectory. Diffusion maps representation of batch corrected latent space of scVI. Each point represents a single hepatocyte. Hepatocytes on the right are colored by pseudo-space metric. Hepatocytes with a pseudo-space value closer to 0 have a more portal hepatocyte-like expression and closer to 1 have a more central hepatocyte-like expression. Hepatocytes in the middle panel are colored by Cyp2f2 (portal marker) expression. Hepatocytes in right panel are colored by Slc1a2 (central marker) expression. Expression of each gene is measured in cell normalized counts (see section 2.4.1). To confirm whether pseudo-space accurately represents the zonation of liver lobules, I plot expression profiles of known zonated genes Cyp2f2 and Slc1a2 along the pseudo-space axis. I observe significant correlations between the expression of these genes and the pseudo-space values of the cells (Figure 2.7). As a negative control, I analyze whether I see a similar correlation in known non-zonated genes, such as Arntl and Clock (Figure 2.7). I show that no significant correlations exist between the pseudo-space metric and these genes. 22 Figure 2.7 Regression plots of cell normalized expression versus pseudo-space. Regression plots of hepatocyte expression Cyp2f2, Slc1a2, Arntl, and Clock versus the associated pseudo- space value. Each point is a hepatocyte. Expression of each gene is measured in cell normalized counts. A pseudo-space value of 0 refers to a hepatocyte with more periportal expression. A pseudo-space value of 1 refers to hepatocytes with more pericentral expression. Pearson correlations and their associated p-values were calculated between the gene expression and pseudo-space metric. To simplify downstream analysis, I bin the pseudo-space values into layers (see section 2.4.3; Figure 2.8). Typically, hepatocytes are separated into three zones in the liver lobule: Zone 1 or 23 periportal, Zone 2 or mid-lobular, and Zone 3 pericentral cells53. However, in more recent studies involving single cell RNA-seq and smFISH show zonation to be finer grained6. Selection of the number of layers to bin the cells into is down to the resolution of the data. For example, in Halpern et al6 there were fifteen zones, while in Droin and Kholtei et al52, there are only eight zones. I elect to bin into five layers given to make sure there are at least two-thousand cells in each layer. Each bin has an equal length along the pseudo-space axis (e.g., all cells in layer one have a pseudo-space value between zero and one-fifth). Unsurprisingly, I find that there are fewer cells binned into the last (more central) layer than compared to the first (more portal) layer (Figure 2.9). This makes sense since by having smaller radius in the liver lobule, there are fewer cells nearer to the single central vein than nearer to the many portal triads. Figure 2.8 UMAP visualization of pseudo-space binning and zonation biomarkers. All UMAPs are of scVI’s latent space. A) Top panels represent schematic of binning process. On the top left cells are colored by the pseudo-space metric. A pseudo-space value of 1 refers to hepatocytes with more pericentral expression and a pseudo-space value of 0 refers to hepatocytes 24 Figure 2.8 (cont’d) with more periportal expression. On the top right, the same UMAP colored by the binned layers. Each layer refers to hepatocytes with pseudo-space values between (layer number - 1)/5 and layer number/5. B) On the bottom panels are UMAP plots with cells colored by Cyp2f2, Slc1a2, and Arntl gene expression in cell normalized counts. Figure 2.9 Bar plot of the number of hepatocytes in each inferred layer. Layers are described in figure 2.8. Layers with smaller values represent hepatocytes that have expression more similar to pericentral hepatocytes and layers with larger values represent hepatocytes that have expression more similar to periportal hepatocytes. Using the binned pseudo-space values, I create pseudobulk profiles (sum of counts across cells) for each treatment x time x layer combination (see section 2.4.3). These count sums are then normalized using size-factor estimation to normalize for the number of cells used to create each pseudobulk profile78. I use the normalized expression for each treatment x time x layer combination for down-stream analysis and classification of genes. 2.2.3 Classifying hepatocyte genes using a mixed non-linear effects model To investigate the spatial and zonated expression profiles of genes, and how those profiles are changed by TCDD treatment, I utilize an approach described by Droin and Kholtei et al52. In their approach they model rhythmicity (R) using the sum of sine and cosine functions, and zonation (Z) using the first and second order Legendre polynomials. I extend their model to include the influence of TCDD (Dioxin; D) on liver expression. To model the effect of TCDD, I 25 opted for a simple indicator function in which the hepatocytes were treated with TCDD (D = 1) or were not treated (D = 0). Using these functions, a series of MNLEMs were constructed that represented different combinations of factors (D, Z, and R) that could influence gene expression based on the original experimental design and the inference of zonation (Figures 2.10 and 2.11). Effects from different factors can either be independent of one another (e.g., Z + R) or dependent on one another (e.g., D x Z). How factors can influence one another to change expression is illustrated in figure 2.11. Using this approach, I classified the top fifteen thousand highly variable genes based on the MNLEM that best described said gene’s expression (see section 2.4.4 for more details on implementation). 26 Figure 2.10 Illustrations of each single or no effect models used for classification of hepatocytes genes. Ideal simulations of each model plotted in terms of time (in hours after treatment) versus expression, or inferred layers of the liver lobule. Each line plot is colored either by time, layer of dose of TCDD of treatment. TCDD (Dioxin) influence is delineated by D, influence of liver rhythmicity is delineated by R, and influence of liver zonation (layers) are delineated Z. Models with no influence from either D, R, or Z are delineated as flat, F. 27 Figure 2.11 Illustrations of multiple effect models used for classification of hepatocyte genes. Ideal simulations of each model as described before in Figure 2.9. Effects that are 28 Figure 2.11 (cont’d) separated with β€œ+” indicate models in which factors are independent of one another. Effects that are separated with a β€œx” indicate models in which factors are dependent on one another. I first confirm that my classification works by analyzing genes known to have rhythmic, zonal, or TCDD influenced expression. For this I classified Arntl (Rhythmic Gene), and Slc1a2 (Zonated and Rhythmic Gene), and Ahrr (a TCDD response gene that saturates at 12 hours). In figure 2.12 I show that my model can detect the rhythmicity of Arntl and Slc1a2. Additionally, the model can distinguish between TCDD saturation and true rhythmicity. The model is also able to find that Slc1a2 is zonated. Finally, Ahrr is classified as having TCDD influence. I find in this small survey that gene expression profiles reflect the classification of the gene one is looking at (Figure 2.12). 29 Figure 2.12 Classification of canonical rhythmic, zonal, and TCDD responsive genes. Analysis of canonical rhythmic gene, Arntl, canonical rhythmic and zonal gene, Slc1a2, canonical TCDD responsive gene Ahrr. A) Bar plots of Shwarz weights for each gene classification. The higher the Shwarz weight, the higher the probability the gene belongs in a particular class. Effects that are separated with β€œ+” indicate classes in which factors are independent of one another. Effects that are separated with a β€œx” indicate classes in which factors 30 Figure 2.12 (cont’d) are dependent on one another. TCDD (Dioxin) influence is delineated by D, influence of liver rhythmicity is delineated by R, and influence of liver zonation (layers) are delineated Z. Models with no influence from either D, R, or Z are delineated as flat, F.B) Line plots of gene expression measured in normalized counts (see section 2.4.3). Each line is colored by layer and dose of TCDD (πœ‡g/kg), or C) colored by hour after treatment and dose of TCDD. Looking at the distribution, I find that most genes are not influenced by any factors (Figure 2.13 A). Class sizes for classes that don’t include the effect of TCDD agree with previous studies with more genes being zonated than rhythmic52. The largest multi-effect class with TCDD influence is Dx(Z+R) (Figure 2.11 A). When I look at the classification of TCDD canonical receptor, AhR, I find that it has a classification of Z+R (Figure 2.13 B). Furthermore, when I perform gene set enrichment analysis (GSEA) on Dx(Z+R) class, I find that it is enriched for many of the canonical pathways involved in TCDD liver response (Figure 2.13 C). I found that no genes were categorized to have dependence between all three factors (DxZxR). 31 Figure 2.13 Classification distributions and enrichment of TCDD toxicity pathways. A) Bar plot of the number of genes in each class for the top 15,000 HVGs. Heights represent the number of genes in each class. B) Bar plot of Shwarz weights for Ahr classification. The higher the Shwarz weight, the higher the likelihood Ahr belongs to a particular class. C) GSEA analysis of the Dx(Z+R) class of genes. Significance of a particular pathway on the y-axis is represented by its negative log adjusted p-value. The higher the value the more significant the pathway. The top 10 most significant pathways are graphed. I use the above classification scheme to then investigate acute toxicity to rhythmic genes and zonal genes in the liver lobule. 32 2.2.4 Acute 2,3,7,8 Tetrachlorodibenzo-p-dioxin perturbation of rhythmic genes I first analyzed the impact of TCDD on a core set of circadian rhythm genes (Clock, Arntl, Per2, Cry1, Nr1d1, Npas2, Rorc). To do this I looked at gene classifications for each of the core circadian clock gene. All core circadian genes analyzed contained rhythmicity in their classification. The only genes classified as purely rhythmic were Arntl, Clock and Rorc (Figure 2.14 A). When I looked closely at their expression, I found that Arntl and Clock exhibit significantly increased expression at 2 and 12 hours (p-value < 0.01 Mann-Whitney U-test) with TCDD treatment. This correlates with the saturation of TCDD response genes analyzed in section 2.2.1 and Figure 2.2 (Figure 2.14 B, C). Npas2, Per2, and Nr1d2 all were classified to have TCDD influence (Figure 2.15 A). I find that Per2 has significantly higher expression for all time points except 18 hours post treatment. Npas2 has significantly lower expression at timepoint 2, 4 and 24. Finally, Nr1d2 exhibits significantly higher expression at time point 2 and 4, but significantly lower at time point 12. Cry1 was classified as zonal and rhythmic, which is corroborated by previous studies52. However, I note that the second highest classification of Cry1 implies TCDD influence which can be seen at time points 18 and 24 (Figure 2.15 A, B). 33 Figure 2.14 Core circadian clock genes classified as only rhythmic. A) Bar plots of Shwarz weights for gene classification of Arntl, Clock, and Rorc. B) Line plots of gene expression for Arntl, Clock, and Rorc graphed with respect to hours after treatment and colored by dose of treatment of TCDD. C) Violin plots plotting the distribution of expression for each treatment condition at each timepoint, with p-value differences described with *’s. ** is 0.001 < p-value < 0.01 and * 0.01 < p-value < 0.05. 34 Figure 2.15 Core circadian clock genes classified as having multiple effects. A) Bar plots of Shwarz weights for gene classification for Npas2, Nr1d2, Per2, and Cry1. B) Line plots of gene 35 Figure 2.15 (cont’d) expression for Npas2, Nr1d1, Per2, and Cry1 graphed with respect to hours after treatment and colored by dose of treatment of TCDD. C) Violin plots of Npas2, Nr1d1, Per2, and Cry1 plotting the distribution of expression for each treatment condition at each timepoint, with p-value differences described with *’s. ** is 0.001 < p-value < 0.01 and * 0.01 < p-value < 0.05. I next looked at the impact of TCDD on rhythmic genes in DxR and Dx(R+Z) classes by first trying to see if TCDD has induced or removed rhythmicity from a gene (see section 2.4.5). To investigate whether a gene has gained or lost rhythmicity with TCDD treatment, I use the likelihood ratio test (see section 2.4.6). Here I compare whether the expression of a gene better fits a rhythmic or non-rhythmic (flat) model. Genes that fit rhythmic model in control but fit flat model in treatment are classified to have lost rhythmicity. Genes that fit a flat expression model in control but fit a rhythmic expression model in treatment are classified to have gained rhythmicity. If the gene fits a rhythmic model in both treated and control conditions, they are classified to have kept rhythmicity. Looking at the distribution of genes in the TCDD influenced rhythmic classes, I see that 13% of genes have gained rhythmicity, while 21% of genes have lost rhythmicity (Figure 2.16 A). When I perform GSEA on genes that lost rhythmicity, I find hallmarks of TCDD gene expression response such as β€œMetabolism by CYP450” and β€œChemical Carcinogenesis” (Figure 2.16 B)79. When I perform GSEA on genes that have gained rhythmicity, I find another hallmark of TCDD expression response, retinol metabolism (Figure 2.16 B)80. 36 Figure 2.16 Analysis of gain/loss of rhythmicity for TCDD influenced rhythmic genes. A) A total of 1072 genes that were classified to be influenced by rhythm and TCDD were analyzed as to whether they had gained, or lost rhythmicity based on the likelihood ratio test. A bar plot of the proportion of genes analyzed colored by whether they lost (in orange) or gained rhythmicity (in blue). B) GSEA analysis of gene sets for genes that gained and lost rhythmicity. Significance of a particular pathway on the y-axis is represented by its negative log adjusted p-value. The higher the value the more significant the pathway. The top 10 most significant pathways are graphed unless fewer than 10 pathways were significant. Looking at the remaining genes that kept rhythmicity, I did analysis on how TCDD impacted the properties of their rhythm. In this case I analyzed the amplitudes and phases of the genes and how they changed with TCDD treatment. When analyzing the core circadian clock genes, I only see small changes in the phase and amplitude of genes (Figure 2.17 B). When looking across all 37 genes that kept rhythmicity, I similarly see that there are no major trends reflecting a delay in phase or reduction of amplitude (Figure 2.17 C). When looking at whether the magnitude of the genetic change with respect to TCDD treatment had a correlation with magnitude of the change of phase or amplitude to TCDD treatment, I found a weak correlation (Figure 2.15 D). 38 Figure 2.17 Analysis of rhythmic parameters for TCDD influenced rhythmic genes. A) Schematic and equations describing phase and amplitude of gene’s expression. B) Bar plot of core circadian clock genes (Arntl, Clock, Per2, Cry1, Npas2, Nr1d2, and Rorc) and their respective phase and amplitude in each treatment group. Phase is measured in radians, and 39 Figure 2.17 (cont’d) amplitude is measured in normalized counts. C) Violin plot of all genes that kept their rhythmicity’s rhythmic parameters. D) Regression plot of the magnitude of change of phase or amplitude vs. the magnitude of the mean log fold change in expression with respect to treatment. Each point represents a single gene. 2.2.5 Acute 2,3,7,8 Tetrachlorodibenzo-p-dioxin perturbation of zonal genes Zonation is primarily determined by the Wnt/𝛽-catenin pathway81 and influenced by the Ras pathway and hypoxia pathways6. To investigate if TCDD influences these pathways in a meaningful way, I have taken a list of known targets of each pathway6,52,82 to see if they are enriched in TCDD influenced gene classes. I find that all zonation pathway targets are enriched in the Z+R class of genes. For Z+R’s TCDD influenced counterpart, Dx(Z+R), and the dual effect DxZ class, I find that all pathway targets except targets for down regulated by the Ras pathway were enriched (Figure 2.18). From this I conclude that the core pathways of zonation are impacted by acute TCDD treatment. 40 Figure 2.18 Enrichment analysis of zonation regulation pathways on all zonation gene classes. GSEA of pathways that regulate zonation (Wnt, Ras, and hypoxia pathways)6,67. Both Z (Zonation) and Z + R (Zonation + Rhythmicity) represent zonated genes not affected by D (Dioxin). DxZ and Dx(Z+R) represent zonated genes affected by D. Significance of a particular pathway on the y-axis is represented by its negative log adjusted p-value. The higher the value the more significant the pathway. Similar to their rhythmic counterparts, I analyzed whether genes in the DxZ and Dx(Z+R) have either gained or lost zonation. Like with the rhythmic likelihood ratio test, I performed a zonated analogue (see section 2.4.5). I find that 18% of genes analyzed had gained zonation when treated with TCDD, and 13% of genes analyzed had lost their zonation with TCDD treatment (Figure 2.19 A). When I performed GSEA on each of the groups, I found that UDP- glucuronosyltransferase related pathways such as β€œPentose and Glucoronate Interconversions” 41 were enriched in genes that lost zonation (Figure 2.19 B). I found no enriched pathways for the genes that gained zonation. Finally, when analyzing whether the genes that lost or gained zonation were more portally enriched or centrally enriched, I found that genes that lost zonation were more centrally enriched even when compared to background (p-value < 0.001 Kolmogorov-Smirnov Test) (Figure 2.19 C). 42 Figure 2.19 Analysis of loss or gain of zonation based on TCDD treatment. A) Bar plot of proportion of 1059 TCDD influenced zonated genes. B) Enrichr83 analysis of genes that lost zonation with TCDD treatment. C) Stacked bar chart describing the distribution of zonated genes and in which zone of the liver lobule are those genes most highly expressed. Zonal location 43 Figure 2.19 (cont’d) based on the maximum expression of gene. Differences in distribution calculated using the two sample Kolmogorov-Smirnov test. Finally, looking at the genes that kept zonation, I looked at how TCDD impacted zonation parameters. For this I calculated the line of best fit for layer dependent expression of each gene. I call the slope of the line of best fit the gene’s β€œzonation slope”. I filtered out genes with centers of expression in the mid-lobular region, as they have a non-monotonic zonation pattern and thus are not linearly zonated. I found that linearly zonated genes that had larger magnitude log fold changes with respect to TCDD treatment also had larger magnitude changes in their zonation slope (Figure 2.20). Figure 2.20 Regression plot of TCDD induced log fold change vs. TCDD induced change in zonation. Each dot represents a zonated gene as determined by classification. Magnitude of change of the slope of the line best fit across liver layers is plotted versus the magnitude of the mean log fold change in expression of the gene. 2.3 Discussion Analysis of drugs in the liver have often been limited by having to study the chronobiology and spatial zonation separately. In this chapter, I have taken previous results integrating chronobiology and spatial zonation in the liver and have extended them to account for chemical dependent influence. This remains important as how the liver reorganizes its metabolic functions 44 in response to toxic stimuli remains understudied. I show that even in acute situations, TCDD has a significant impact on both the rhythm and zonation of the liver. I show that AhR, TCDD’s canonical receptor, has zonal and rhythmic components to its expression, and that many effects of TCDD also have zonal and rhythmic components. Previous studies showed that sub-chronic treatment by TCDD at the same dose caused a large dampening effect on the amplitude, and a large shift in the core circadian clock66. While this isn’t true for acute perturbation, I still observe significant TCDD influence in nearly half of the core circadian clock genes I analyzed. This corroborates previous results that AhR directly impacts the core-circadian feedback loops66. Most genes in the core-circadian clock analyzed that were classified as impacted by TCDD were downstream CLOCK-ARNTL (e.g., Per2, and Nr1d2 binding (with the single exception of Npas2). It has been shown in previous studies that AHR, ARNTL, and CLOCK colocalize on Per2 and Nr1d1, suggesting that AhR may interrupt normal CLOCK-ARNTL binding. Additionally, Ξ²NF-activated AhR has been shown to interact with ARNTL in Hepa1c1c7 cells impairing CLOCK-ARNTL heterodimer formation at E-boxes within the Per1 promoter84. Per1 was filtered out of the dataset due to low variance, however I see signs of Per1 inhibition in Per1’s cell normalized expression from time points 4-12 hours as well as 24 hours post TCDD treatment (Figure 2.21). 45 Figure 2.21 Time series expression of Per1. Cell normalized counts of Per1 plotted with respect to hours post oral gavage at ZT0 (6:00 AM). In blue (0 πœ‡π‘” π‘˜π‘” TCDD) are measurements from hepatocytes treated with just sesame oil vehicle. In orange are measurements from mice treated with 30 πœ‡π‘” π‘˜π‘” TCDD. Areas in shaded region represent a 95% confidence interval. TCDD’s overall effect on all rhythmic genes in the hepatocyte genome remains mixed. Most genes had modest effects on their expression at specific time-points. Additionally, the overall shape of the oscillations for genes that kept rhythmicity with treatment remained relatively constant. However, genes that gain or lose rhythmicity make up around 15% of all rhythmic genes. Additionally, GSEA results of the genes that lost rhythmicity had many of the hallmarks of TCDD gene expression response79,85. This suggests that TCDD ablates rhythmicity in many of the major pathways it targets. 46 TCDD also significantly impacts zonation pathways in the liver. TCDD alters the zonation of nearly a quarter of all zonated genes. Of those genes, around thirty percent of them have either gained or lost zonation. Genes that have lost zonation are enriched in phase II UDP- glucuronosyltransferase metabolism. Additionally, genes that have completely lost their zonation are enriched in the central liver of the liver lobule. This correlates with the central enrichment of AhR in the liver lobule59. Interestingly, I show that genes significantly affected by TCDD are enriched for targets of regulators of zonation pathways. This is potentially due to the cross talk between the Wnt/𝛽- catenin signaling pathway70. Interestingly, much of this enrichment is in Dx(Z+R) class. The rhythmic component of this class has been hypothesized to be linked to Wnt signaling by non- parenchymal cells in the pericentral zone52. Potentially, TCDD activation of AhR enriched in the central layer59 could cause perturbation to this signaling, and thus explain the observed perturbation to rhythmicity in these genes. Further analysis of the spatial and rhythmic effects of TCDD on non-parenchymal cells would be needed to confirm this hypothesis. Unanswered by my analysis is the exact role AhR plays in the perturbation of the spatial and rhythmic axes of the liver lobule. Are the effects I see on rhythmicity, for example, a result of TCDD activating AhR and AhR binding directly to cis-regulatory elements, or is this a more hierarchical effect? Is TCDD inducing changes to the core circadian clock, which in turn is inducing changes in the rhythmicity of other genes? Is a downstream effect of TCDD’s perturbation of rhythmicity the perturbation of zonation as many zonation regulators are also rhythmic? Or are the effects completely independent of one another? Is it different for different gene pathways? One way to probe these questions in a high throughput manner is through cis- regulatory binding analysis of AhR and its potential binding partners (ARNTL and ARNT) through technologies like ChIP-seq. The presence or absence of binding for each of these pathways could help delineate which pathways are directly perturbed by canonical AhR and which pathways are being perturbed by non-canonical TCDD toxicity pathways. The liver is a fantastically complex organ with a distinct space-time logic to its metabolic organization. Since the liver is central to the metabolism of many drugs, understanding how drugs impact this space-time logic may help to 1) describe how hepatocytes reorganize in response to chemical perturbation, and 2) how break down of this logic is related to liver injury. Here I present a methodology for analyzing the effects of chemical perturbations on zonation and 47 rhythmicity. The methodology can be extended to dose-response experiments via a change in the function describing the effect of chemical treatment. Additionally, the methodology can easily be applied to any snRNA-seq data set involving chemical perturbation of the liver. I envision pharmacologists and toxicologists utilizing the methodology described here to describe not only the impact that chemical have on particular pathways, but also how those impacted pathways figure into the overall metabolic compartmentalization of the liver. 2.4 Methods 2.4.1 Preprocessing of single-nuclei RNA-seq preprocessing Cholico et al72 performed clustering and cell type annotation of the dataset as described in previous lab studies79,86. All preprocessing was performed using the scanpy.pp pacakage87. Raw counts were normalized to the median total cell count using the normalize_total. They were then log transformed with a pseudocount using the log1p function. I refer to counts normalized in this way as β€œnormalized cell counts”. Highly variable genes were selected using the highly_variable_genes function. Filtering of cells first started with the removal of all non-parenchymal cell types so that I was only left with hepatocytes. Hepatocytes with fewer than 1,500 counts and 3 genes were removed from the dataset. Cells with fewer than 200 genes being expressed were also removed. Highly variable genes were selected using the highly_variable_genes function. 2.4.2 Batch correction using scVI scVI29 was trained on the snRNA-seq data to perform batch correction on the data. To remove sample wise batch variance, I labeled all the cells in each biological sample and then used that label as input for the batch labels in scVI. I used default parameters were used for the structure and training the model which are shown in table 2.1 and 2.2 respectively. 48 Hyperparameter Value Latent dimension Number of layers Layer width Dropout rate 30 1 128 0.1 Kullback-Leibler weight 5 Gene expression distribution βˆ— 10βˆ’5 NB Table 2.1 Hyperparameters for scVI’s variational autoencoder model. Latent distribution Normal Hyperparameter Value Training epochs Learning rate Learning rate decay Optimizer Optimizer epsilon 46 0.001 10βˆ’6 Adam 0.01 Table 2.2 Hyperparameters for scVI’s variational autoencoder training. 2.4.3 Layer calculations The latent space representation of the cell normalized counts was used as input into the Diffusion maps algorithm. Diffusion maps was calculated using the function diffmap from the scanpy.tl python package56. The second component of the diffusion maps representation was taken and then min-max scaled in order to generate the pseudo-space metric. The metric was oriented so that centrally enriched expression had the highest values and portally enriched expression had the lowest values. Cells were binned into five layers using the pseudo-space metric. Each bin represents a fifth of the pseudo-space trajectory (i.e., cells in bin 𝑖 have pseudo-dose values between π‘–βˆ’1 5 and 𝑖 ). 5 Counts from each treatment x time x layer combination were summed across all cells in then 49 normalized using the computeSumFactors function from the scran R package78. I call these pseudo-bulk normalized counts β€œNormalized Counts” in the manuscript. Genes were classified based on which layer they had maximum β€œNormalized Counts”. Genes that had maximum expression in layers 1 and 2 were considered portal. Genes that had maximum expression in layer 3 were considered mid-lobular. Genes that had maximum expression in layers 4 and 5 were considered central. 2.4.4 Design of Mixed Non-Linear Effects Models Mixed non-linear effects models (MNLEM) were deployed using the MixedLM class in statmodels python package88. In Table 2.3 I describe the individual terms for each factor: TCDD influence (D), rhythmicity (R), and zonation (Z). Equations for each class described in terms in Table 2.3 are described in Table 2.4. Term Effect Equation 𝐷 𝑅𝑆𝑖𝑛 π‘…πΆπ‘œπ‘  𝑍𝑃1 𝑍𝑃2 D R R Z Z 0 𝑖𝑓 π‘†π‘’π‘ π‘Žπ‘šπ‘’ 𝑂𝑖𝑙 πΆπ‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™ { 1 𝑖𝑓 𝑇𝐢𝐷𝐷 π‘‘π‘Ÿπ‘’π‘Žπ‘‘π‘šπ‘’π‘›π‘‘ sin (πœ”π‘‘) cos (πœ”π‘‘) 𝑙 3𝑙2 2 Table 2.3 Terms for mixed linear effects models. Each term is denoted by its name and its effect. D is TCDD (Dioxin) Influence, R is rhythmicity, and Z is zonation. 𝑑 is the time in hours after treatment. 𝑙 is the layer the of the liver lobule. πœ” is the conversion factor between 𝑑 and radians which is equal to πœ‹ 12 . 50 Class F D R Z Z+R RxZ DxR DxZ Equation for Model 𝑦 = 𝛽0 𝑦 = 𝛽0 + 𝛽1𝐷 𝑦 = 𝛽0 + 𝛽1𝑅𝑆𝑖𝑛 + 𝛽2π‘…πΆπ‘œπ‘  𝑦 = 𝛽0 + 𝛽1𝑍𝑃1 + 𝛽2𝑍𝑃2 𝑦 = 𝛽0 + 𝛽1𝑅𝑆𝑖𝑛 + 𝛽2π‘…πΆπ‘œπ‘  + 𝛽3𝑍𝑃1 + 𝛽4𝑍𝑃2 𝑦 = 𝛽0 + (𝛽1𝑅𝑆𝑖𝑛 + 𝛽2π‘…πΆπ‘œπ‘  ) βˆ— (𝛽3𝑍𝑃1 + 𝛽4𝑍𝑃2) 𝑦 = 𝛽0 + 𝛽1𝐷 βˆ— (𝛽2𝑅𝑆𝑖𝑛 + 𝛽3π‘…πΆπ‘œπ‘ ) 𝑦 = 𝛽0 + 𝛽1𝐷 βˆ— (𝛽2𝑍𝑃1 + 𝛽3𝑍𝑃2) Dx(Z+R) DxZxR 𝑦 = 𝛽0 + 𝛽1𝐷 βˆ— (𝛽2𝑅𝑆𝑖𝑛 + 𝛽3π‘…πΆπ‘œπ‘  + 𝛽4𝑍𝑃1 + 𝛽4𝑍𝑃2) 𝑦 = 𝛽0 + 𝛽1𝐷 βˆ— (𝛽2𝑅𝑆𝑖𝑛 + 𝛽3π‘…πΆπ‘œπ‘  ) βˆ— (𝛽4𝑍𝑃1 + 𝛽5𝑍𝑃2) Table 2.4 Equations for each mixed non-linear effects model used for classification. Each term is denoted by its name and its effect. D is TCDD (Dioxin) Influence, R is rhythmicity, and Z is zonation. 𝑦 represents the predicted gene expression. 𝛽 represents the associated gene weight. All other terms such as 𝑅𝑆𝑖𝑛 and 𝑍𝑃1 are defined in Table 2.3. Implementation of the MNLEM was almost identical to Droin and Kholtei et al52. These equations were fit to normalized count of each individual gene using the Nelder-Mead optimization algorithm89. A noise offset (Οƒ0 = 0.15) was added to the data to make sure that overfitting was avoided. Equations with the smallest overall Bayesian information criterion90 (BIC) were classified with their corresponding class. BIC acts as a general multi-comparison analogue to the likelihood ratio (πœ’2) test as it penalizes more complex models. The exception to classifying with models that have the smallest BIC were when models tied with one another. I defined a tie as having a relative difference of 1%. In the case of ties, models with fewer parameters were selected. Bar graphs of classification were calculated using Shwarz weights. Shwarz weights are calculated using differences between the BIC values and the minimum BIC value in across all models (𝐡𝐼𝐢𝑀𝑖𝑛) using the following equation: π‘ β„Žπ‘€π‘Žπ‘Ÿπ‘§ π‘€π‘’π‘–π‘”β„Žπ‘‘ = 𝐡𝐼𝐢𝑖 βˆ’ 𝐡𝐼𝐢𝑀𝑖𝑛 exp 2 𝑛 (𝐡𝐼𝐢𝑖 βˆ’ 𝐡𝐼𝐢𝑀𝑖𝑛) Σ𝑖=0 51 2.4.5 Differential Rhythmicity and Zonation To perform differential rhythmicity and zonation analysis, I fit gene expression to models much like what I did in section 2.4.5. However instead of using Nelder-Mead, I used ordinary least squares with the ols function from the statsmodels.api python package. For each gene, I separated expression values between treated and control. These groups of expression values were then fitted either to the R class model or F class model (see table 2.4 for description of functions) in the case of differential rhythmicity or the Z class model or F class model in differential zonation. To see if the R or Z class models were more descriptive of gene expression, I used the likelihood ratio test (also called the πœ’2 test). Models that were gene expression that was significantly better explained by the R model was said to have rhythmicity (the same goes for zonation and the Z model). The likelihood ratio test was implemented using the chi2 function from the scipy.stats python package. I estimated rhythmicity parameters from the fitted parameters in R models. If I let π‘Ž = 𝛽1 and 𝑏 = Ξ²2 then I can calculate the amplitude of the expression oscillation as: And I can define the phase as: π΄π‘šπ‘π‘™π‘–π‘‘π‘’π‘‘π‘’ = βˆšπ‘Ž2 + 𝑏2 π‘β„Žπ‘Žπ‘ π‘’ = π‘Žπ‘‘π‘Žπ‘›2(𝑏, π‘Ž) Linear zonation slope was calculated for all zonal genes that kept zonation post TCDD treatment and were not mid-lobular. To calculate zonation slope I fit a simpler zonation model than the one in table 2.4: 𝑦 = 𝛽0 + 𝛽1𝑙 I fit this model much in the same way I did above during differential zonation. The zonation slope was equal to 𝛽1. The procedure was similar to how one would find the line of best fit. 2.4.6 Statistical Tests I determined differential expression between treatments and parameter values using the Mann- Whitney U-test using the function mannwhitneyu. To determine if there were differences between the distribution of gene zones between gene lists, I used the Kolmogorov-Smirnov test implemented in the ks_2samp function. All functions used for evaluating statistical significance were in the scipy. stats python package91. Gene set enrichment analysis was performed using Enrichr83 pathway analysis with in the gseapy package92. All significant pathways had an FDR < 0.2. 52 CHAPTER 3 PREDICTION OF SINGLE DOSE CHEMICAL PERTURBATIONS ACROSS CELL STATES USING VARIATIONAL AUTOENCODERS 3.1 Introduction The problem I aim to tackle in this chapter of the thesis is generalizing chemical responses across cell types. More formally, given that I know the response for some chemical 𝑋 in cell types 𝐴 and 𝐡, can I then use that information to accurately predict the response for the same chemical in a third cell type 𝐢? This is an important but challenging task in pharmacology and toxicology. Biological context makes the physiological response to chemical perturbation unique for different cell types. Thus, the information about the responses in cell types 𝐴 and 𝐡 may not be directly applicable to the response in cell type 𝐢. As a result, toxicologists require a great deal of information across biological systems to make assessments about the safety of a particular drug or chemical. A technology that could help address this problem is single-cell RNA-Seq (scRNA- seq). Perturbational scRNA-seq can measure the chemical responses of tens of thousands of individual cells across thousands of genes93. As a result, I can start to map the transcriptomic space of chemical perturbations at the single cell level. However, even with advances like the Mix-Seq protocol94 or comprehensive datasets like Sciplex95, scientists have only explored a small portion of the cell type x chemical perturbation space. Given this challenge, computational biologists have attempted to leverage large datasets to predict chemical perturbations in a variety of biological contexts96. Specifically, a number of deep generative modeling tools have been developed to predict chemical perturbations across cell types30,33,41,97,98. Such models have the potential to quantitatively map relevant portions of the cell type x chemical perturbation space using a relatively small amount of data. The first use of deep generative modeling in the prediction of chemical perturbations was scGen30, a variational autoencoder that performed dimensionality reduction by encoding scRNA- seq data into a latent space. It then used linear vector arithmetic to predict the state of virtual cells in the latent space, and finally decoded the virtual cells back to the full gene expression space. Alternative approaches have also been described in the literature. For example, scPreGan98 is a generative adversarial network which utilizes a generator-discriminator framework to predict the distributions of the unknown perturbed cell type. Another alternative is CellOT97, which is an autoencoder framework that utilizes optimal transport to predict cell type- specific perturbation. However, I find existing approaches perform poorly when approximating 53 perturbations in in vivo experiments. More specifically, when used to predict the in vivo expression of cell-type dependent differentially expressed genes (DEGs), they are unable to approximate the expression means accurately. As such, there is a need for models that can predicts the chemical perturbation of cell type specific responder genes appropriately. Here I propose single cell Variational Inference of Dose-Response (scVIDR), a regression- based improvement to the scGen model (Figure 3.1). scVIDR boosts prediction accuracy for cell type specific differentially expressed genes (DEGs). Additionally, scVIDR approximates high- dose experiments better than other state of the art algorithms. The model accomplishes this across several datasets including mouse liver cells treated with 30 πœ‡g/kg TCDD sub-chronically (Nault et al)79,86, PBMCs treated with IFN-𝛽 (Kang et al) 99, and different mammalian species treated with LPS6 (Hagai et al)100. Figure 3.1 Schematic of scVIDR for prediction of a single dose for some unknown cell type. Outline of the scVIDR model for expression prediction for unknown single dose-response in cell type 3. Training is done using cell types 1 and 2 as input to a variational autoencoder model. The difference between the means of latent representations of the control and treated groups, 𝛿1 and 𝛿2, are used as input into a linear regression model, which then predicts the unknown 𝛿3 of cell 54 Figure 3.1 (cont’d) type 3. I then use the decoder portion of the model to output the latent space predictions back into gene expression space. 3.2 Results 3.2.1 Description of scGen training and prediction I begin by considering a scRNA-seq dataset 𝑋 = {π‘₯𝑖}𝑖=1 𝑁 consisting of 𝑁 cells, where π‘₯𝑖 represents the expression profile of cell 𝑖. I assume that gene expression is generated by some continuous random process involving a lower dimensional random variable 𝑧. The generative process that describes the mapping from 𝑧 to 𝑋 is given by the probability distribution, π‘πœƒ(𝑋|𝑧) . Thus, given that I know 𝑋 and not 𝑧, I would like to approximate the probability distribution that maps 𝑋 to 𝑧, π‘πœƒ(𝑧|𝑋). Since calculating π‘πœƒ(𝑧|𝑋) is usually intractable, I use a neural network, the encoder, to approximate it using a different Gaussian distribution, π‘žπœ™(𝑧|𝑋). To map values back from 𝑧 to 𝑋, I use a second neural network, the decoder, to approximate π‘πœƒ(𝑋|𝑧). In practice, both the encoder and decoder are trained together to minimize the reconstruction error of the decoder and the difference between the prior distribution and the encoder distribution (see section 3.4.1 for full mathematical description). I characterize whether a cell has been treated with a set concentration of the chemical of interest with the indicator variable 𝑑 (Figure 3.1 A). I set 𝑑 = 1 for cells that have been treated with the chemical (treatment) and 𝑑 = 0 for cells that have not been treated (control). A dataset contains 𝑐 cell types within both the 𝑑 = 0 and 𝑑 = 1 groups. Each time a model is evaluated, one treated cell type is withheld from training and used in evaluation. In standard VAE vector arithmetic (scGen) the latent space representation of the perturbation of some cell type 𝐴 is approximated by 𝑧̂𝑖,𝐴,𝑑=1 = 𝑧𝑖,𝐴,𝑑=0 + 𝛿 where 𝑧𝑖,𝐴,𝑑=0 is the latent gene expression representations of cell type 𝐴,30 and 𝛿 is the difference between the centroids of the treated and control training groups in the latent space. scGen is described as a fixed model98, meaning that 𝛿 is not conditional on the cell type, i.e. it assumes that perturbations in the latent space are consistent across all cell types. While this may hold for datasets with conserved responses30, in datasets with more heterogenous tissues I find that perturbations in the latent space are not consistent across cell types. Thus, scGen is not 55 effective for chemical perturbation prediction when evaluating responses across highly distinct cell types. 3.2.2 scGen’s 𝜹 deviates greatly from cell specific differences when outlier cell types are present. The scGen model assumes that perturbations on the latent space are similar in magnitude and direction. To define what I mean by this, I first need to define a cell specific perturbation vector 𝛿, 𝛿𝑐. I define the difference between the latent centroids of the treated (t = 1) and control (t = 0) groups for a particular cell type A as 𝛿𝑐=𝐴 or as: 𝛿𝑐 = 𝑧̅𝑐=𝐴,𝑑=1 βˆ’ 𝑧̅𝑐=𝐴,𝑑=0 where 𝑧̅ is the average on the latent space. I calculated 𝛿𝑐 using a model trained on a snRNA-seq dataset of livers from mice gavaged with 30 πœ‡g/kg TCDD sub-chronically (Nault et al)79. A UMAP projection of the data can be seen in Figure 3.2, where I can observe by eye that perturbations on individual cell types differ significantly. However, since UMAP only preserves local rather than global distances, I need a more formal way of examining the data. Figure 3.2 UMAP projection of snRNA-seq data from TCDD-perturbed mouse livers. UMAP is performed on normalized gene expression of each cell. Individual points represent 56 Figure 3.2 (cont’d) UMAP of gene expression for individual liver cells. Cells are colored by cell type and dose (πœ‡g/kg) of TCDD. I evaluate the assumptions of scGen more formally using a combination of dimensionality reduction that preserves global distances, and high dimensional metrics. In Figure 3.3A, I use PCA to show that 𝛿𝑐 for many of the cell types in the dataset deviate significantly from 𝛿𝑠𝑐𝐺𝑒𝑛. 𝛿𝑠𝑐𝐺𝑒𝑛 is defined as the difference between the average of treated cell subtracted by the average of the control cells on the latent space of the scGen model. Specifically, 𝛿𝑠𝑐𝐺𝑒𝑛 has a higher overall magnitude than most other cell types except endothelial cells (Figure 3.3 B). To evaluate whether the directions for 𝛿𝑐 are consistent with 𝛿𝑠𝑐𝐺𝑒𝑛 I calculated the cosine distance between 𝛿𝑠𝑐𝐺𝑒𝑛 and 𝛿𝑐, which showed that 𝛿𝑠𝑐𝐺𝑒𝑛 deviates significantly from all 𝛿𝑐 in terms of direction 𝛿𝑠𝑐𝐺𝑒𝑛. 57 Figure 3.3 All Ξ΄c s deviate from Ξ΄scgen with respect to their direction across most cell types in the TCDD mouse liver snRNA-seq dataset. A) A PCA visualization of the calculated Ξ΄c’s for a VAE trained on all cell types. Each arrow represents the calculated 𝛿 for a particular cell type B) Bar plots of the magnitude of the Ξ΄ and other individual Ξ΄c’s, and Ξ΄c cosine distance from the Ξ΄. 58 Figure 3.3 (cont’d) A cosine distance of 0 represents a Ξ΄c in the same direction as Ξ΄, of 1 represents a Ξ΄c orthogonal to Ξ΄, and of 2 represent a Ξ΄c in the opposite direction as Ξ΄. It is worth noting that outlier cell types could influence the calculation of 𝛿𝑠𝑐𝐺𝑒𝑛 since the latent space is highly sensitive to changes in the composition of the training set. Thus, if I were to take out endothelial cells and stellate cells from the analysis and retrain the model, it does not necessarily fix the issue of outliers influencing the relative magnitude and direction of 𝛿𝑠𝑐𝐺𝑒𝑛. Instead, their removal exaggerates other differences in the latent space during the process of dimensionality reduction whereby cholangiocytes become a major outlier (Figure 3.4). Figure 3.4 Ξ΄c s deviation from Ξ΄ with respect to their direction across most cell types is sensitive to training dataset. A PCA visualization of the calculated Ξ΄c’s for a VAE trained on all cell types excluding endothelial cells and stellate cells. Each arrow represents the 𝛿𝑐 associated with a specific cell type. I conclude that a fixed model of 𝛿 cannot account for the full scope of heterogeneity that exists within many relevant toxicological datasets, and that the averaged 𝛿𝑠𝑐𝐺𝑒𝑛 is an inadequate way to 59 predict perturbations in the varied mouse liver cells. Instead of using a fixed 𝛿𝑠𝑐𝐺𝑒𝑛 I need a new 𝛿 prediction method that conditions on the transcriptomic state of the control cells. To do this, I introduce a regression-based correction to the data in order to more accurately capture 𝛿𝑐. 3.2.3 scVIDR better predicts cell specific differences on the latent space of mouse liver TCDD perturbation than scGen. When analyzing the latent space, I observed that the centroid of the control population for a particular cell type was predictive of 𝛿𝑐 (𝑅2 β‰ˆ 0.9), which led me to hypothesize that information about the perturbations of each cell type on the latent space might be encoded in the location of the corresponding centroid. I took a simple approach of using 𝛿𝑐 using the centroids of each control population as input data. From this point forward I will refer to this method as scVIDR. scVIDR improved on scGen in terms of predicting cell type specific changes in gene expression (Figure 3.5 A). scVIDR’s prediction, 𝛿𝑠𝑐𝑉𝐼𝐷𝑅, of π›Ώβ„Žπ‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘ βˆ’π‘π‘œπ‘Ÿπ‘‘π‘Žπ‘™ is closer to the PCA projection than scGen’s prediction, 𝛿𝑠𝑐𝐺𝑒𝑛. Additionally, scVIDR better predicts the magnitude and direction of π›Ώβ„Žπ‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘ βˆ’π‘π‘œπ‘Ÿπ‘‘π‘Žπ‘™ than scGen. I conclude from these observations on the latent space that scVIDR is a more viable way to infer 𝛿𝑐. 60 Figure 3.5 π›…π¬πœπ†πžπ§ deviates more from π›…π‡πžπ©πšπ­π¨πœπ²π­πžπ¬βˆ’π©π¨π«π­πšπ₯ than π›…π¬πœπ•πˆπƒπ‘ . A) A PCA visualizationof the calculated 𝛿𝑐’s for a VAE trained without portal hepatocytes. β€œscGen Hepatocytes – portal” refers to the prediction by scGen (𝛿𝑠𝑐𝐺𝑒𝑛), and β€œscVIDR Hepatocytes – 61 Figure 3.5 (cont’d) portal” refers to the prediction by scVIDR (𝛿𝑠𝑐𝑉𝐼𝐷𝑅). B) Bar plots of the magnitude of the 𝛿𝑐′𝑠, and the cosine distance from the π›Ώπ»π‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘ βˆ’π‘π‘œπ‘Ÿπ‘‘π‘Žπ‘™ for each 𝛿𝑐. A cosine distance of 0 represents a 𝛿𝑐 in the same direction as π›Ώπ»π‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘ βˆ’π‘π‘œπ‘Ÿπ‘‘π‘Žπ‘™, of 1 represents a 𝛿𝑐 orthogonal to π›Ώπ»π‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘ βˆ’π‘π‘œπ‘Ÿπ‘‘π‘Žπ‘™ and of 2 represent a 𝛿𝑐 in the opposite direction as π›Ώπ»π‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘ βˆ’π‘π‘œπ‘Ÿπ‘‘π‘Žπ‘™. scVIDR is equivalent to scGen when there is only one cell type in the training dataset (Figure 3.6). This is because regression on the latent space of one cell type returns the 𝛿 of the entire dataset. Figure 3.6 scVIDR is equivalent to scGen when training on a single cell type. A) A UMAP projection of latent space of single-cell expression of two cell types from Kang et al99: CD4T and B cells. They are colored by cell type, condition and train test split. B) Validation of prediction of B-cell perturbation when the VAE is trained solely on CD4-T cells. A regression plot is shown for both scVIDR and scGen performance. Each point represents the mean expression of a particular gene. Red points represent the top ten differentially expressed genes. Shaded region around regression line represents the 95% confidence interval. A PCA plot of the predictions is displayed where each point represents an individual cell. Ground truth is in orange and the 62 Figure 3.6 (cont’d) control data is colored blue. Predictions by scGen and scVIDR are colored green and red respectively. 3.2.4 scVIDR predicts single-dose, single-cell perturbation expression better than other state-of-the-art algorithms. I next evaluated scVIDR by comparing its ability to predict changes in gene expression in the mouse liver dataset with other state-of-the-art algorithms. Our training set (Figure 3.7A) consisted of all control and TCDD-treated cell types except for TCDD-treated portal hepatocytes which were used for model evaluation. I compared the performance of scGen30, scPreGAN98, CellOT97, and scVIDR on the top 5000 highly variable genes (HVGs), and the top 100 differentially expressed genes (DEGs). When predicting the gene expression of portal hepatocytes, each method generated a set of virtual portal hepatocytes (Figure 3.7B). I then computed the average expression of each gene across all cells and compared the average gene expression in predicted cells versus cells derived from snRNA-seq experiments. Across HVGs, the scVIDR model yielded an average 𝑅2 of 0.92 (Figure 3.7C). Across DEGs, scVIDR produced an average 𝑅2 of 0.81 (Figure 3.7C). Continuing the evaluation across all cell types (Figure 3.7D), leaving out one cell type perturbation at a time as described above for portal hepatocytes, our model outperformed all other models (with p-value < 0.001, one sided Mann- Whitney U Test) both when evaluated on HVGs and DEGs. 63 Figure 3.7 Prediction of in vivo single cell gene expression of portal hepatocytes from mice treated with 30 𝛍g/kg of TCDD. A) UMAP projection of the latent space representation of scVIDR for control and treated single-cell gene expression. In A) each cell type and dose in 64 Figure 3.7 (cont’d) πœ‡g/kg combination, and by the train-test split for model training is represented by different colors. In the example in the figure, TCDD-treated portal hepatocytes were used as a test set. B) PCA plots of predicted portal hepatocytes responses following treatment with 30 ΞΌg/kg of TCDD using scGen30, scVIDR, scPreGAN98, and CellOT97. C) Regression plots of each model. Each point represents the mean expression of a particular gene. Red points represent the top ten differentially expressed genes. Shaded region around regression line represents the 95% confidence interval. D) Boxplot of 𝑅2 for prediction across all liver cell types treated with 30 ΞΌg/kg of TCDD. Calculation of the mean 𝑅2 across all highly variable genes (blue). Calculation of the mean 𝑅2 across the top 100 differentially expressed highly variable genes (orange). Prediction performance distributions were compared using one sided Mann-Whitney U test. **** indicates p-values < 0.001. I further evaluated scVIDR on an additional dataset of IFN𝛽-treated PBMC 87. A similar benchmark was performed as before (Figure 3.8), however instead of using the top 5000 highly variable genes, I used the top 6998 genes to be consistent across models compared. For this dataset, π‘‘β€ˆ = β€ˆ1 labels PBMCs treated with 100 U/ml IFN-𝛽 , π‘‘β€ˆ = β€ˆ0 for untreated PBMCs. The left-out cell type being predicted is B cells (Figure 3.7A). Across HVGs, the models yielded 𝑅2 values of 0.97, 0.92, 0.77 and 0.66 and across DEGs, and 𝑅2 values of 0.96, 0.86, 0.80, and 0.84 for scVIDR, scGen, scPreGAN, and CellOT respectively (Figure 3.7C). When accuracy was assessed for all cell types, scVIDR significantly outperformed all other models (Figure 3.8D). 65 Figure 3.8 Prediction of in vitro response of B cells to IFN𝛃. A) UMAP projection of latent space of scVIDR for treated and untreated single-cell expression. UMAP plots are colored by 66 Figure 3.8 (cont’d) cell type, training split, and condition, respectively. B) PCA plot of scGen, scVIDR, scPreGAN, and CellOT predictions of B-cell expression after IFN𝛽 treatment. C) scGen, scVIDR, scPreGAN, and CellOT prediction versus experimental expression data regression plot. Each point represents the mean expression for a particular gene. Red points represent the top ten differentially expressed genes. Shaded region around regression line represents the 95% confidence interval. D) Boxplot of 𝑅2 scores across all tissues in the PBMC treated dataset. Prediction of all highly variable genes (blue), and top 100 differentially expressed genes (orange). Prediction performance distributions were compared using one sided Mann-Whitney U test. **** indicates p-values < 0.001. To make sure that scVIDR is predicting unknown physiologies in chemical perturbations, I performed an additional experiment on IFN𝛽 treated B-cells by selecting a subset of genes unique to IFN𝛽 response of B-cells. To do this I take the set of DEGs unique to B-cells (Figure 3.9A). I show in Figure 3.9B that scVIDR consistently outperforms other models at predicting the gene expression means across all unique DEGs. 67 Figure 3.9 Evaluating scVIDR on differentially expressed genes unique to B-cells. A) A Venn diagram of the set of all top 100 differentially expressed genes (DEGs) for each cell type in the IFN𝛽 treated PBMC dataset. Overlap between the top 100 differentially expressed gene in B- 68 Figure 3.9 (cont’d) cells and the set of all top 100 differentially expressed genes in all other cell types is shown. B) Regression plots of predictions of the differentially expressed genes unique to B-cells (red portion of panel A). Each point represents the mean expression for a particular gene. Shaded region represents the 95% confidence interval. 3.2.5 scVIDR predicts rat phagocyte perturbation by LPS6 better than all other state of the art algorithms. Next, I extend the idea of predicting chemical perturbations across cell types to cross-species predictions. The index 𝑐 in 𝛿𝑐 now refers to the species that the cells came from rather than individual cell type. I also attempt to account for species differences in the same way that I try to correct for cell type differences using scVIDR. I benchmark this potential for scVIDR on a dataset from Hagai et al100, where mononuclear phagocytes were harvested from four different mammal species (rat, pig, mouse, and rabbit). π‘‘β€ˆ = β€ˆ1 refers to phagocytes treated with 100 n𝑔/ml IFN-𝛽 , and π‘‘β€ˆ = β€ˆ0 for phagocytes treated with control, and the left-out species I want to predict are rats (Figure 3.10A). I see that scVIDR still outperforms all other algorithms at predicting perturbation of rat phagocytes (Figure 3.10B, C), with a correlation of 𝑅2 = 0.92 for HVGs, and an 𝑅2 = 0.76 for highly variable genes. I observe that while the predictions are more accurate for scVIDR, it still has difficulty predicting the 69 DEGs for rats. Figure 3.10 scVIDR predicts the effects of LPS6 on rat cells from mouse, rabbit, and pig cells better than other state-of-the-art algorithms. A) UMAP projection of latent space of scVIDR for treated and untreated single-cell gene expression. UMAP plots are colored by species, training split, and treatment condition, respectively. B) PCA plot of scGen, scVIDR, scPreGAN, and CellOT predictions of rat phagocytes after LPS6 treatment. C) scGen, scVIDR, scPreGAN, and CellOT prediction versus experimental expression data regression plot. Each point represents the mean expression for a particular gene. Red points represent the top ten differentially expressed genes. Shaded region around regression line represents the 95% confidence interval. 3.3 Discussion I have demonstrated several cases where scVIDR can predict single-dose chemical perturbations in scRNA-seq data. First, I establish that scGen is an inadequate algorithm for prediction of highly heterogenous tissues. Then I show how to improve scGen using regression to predict the 70 effect in a more cell-type specific way using an algorithm I developed, scVIDR. I then utilized scVIDR to predict chemical perturbations in three different scenarios. In the first scenario, I predict TCDD-induced perturbation in mouse liver cells. In the second where I predict IFN𝛽 stimulation of PBMCs. In the third scenario, I predict LPS6-induced chemical perturbation of phagocytes derived from rats. In every case, scVIDR outperforms all other algorithms. A question that arises is why scGen and scVIDR both perform better on the PBMC dataset (Kang et al)99 than the rat phagocyte (Hagai et al)100 and mouse liver datasets (Nault et al)79. I hypothesize that this is in part due to inter-cell-type overlap in the responses. I quantify this by looking at the overlap in the top 100 DEGs across all cells or species in the dataset. I observe there is much higher average overlap DEGs between the in the Kang et al dataset than in all other datasets (Figure 3.11). This makes intuitive sense given that datasets with similar cell-type responses will have more overlap and thus will be easier to predict for the VAE. I show that while this is an overall limitation in the VAE paradigm, scVIDR is more robust this limitation than scGen (see Sections 3.2.4 and 3.2.5). 71 Figure 3.11 Overlap between differentially expressed genes reflects scVIDR prediction performance across datasets. A-C) Heatmaps of Jaccard similarity between top 100 DEGs for each pair of cells in each dataset. A) is the IFN-𝛽 treated PBMCs by Kang et al.99 , B) TCDD Mouse liver from Nault et al.79 C) is the LPS6 treated phagocytes by Hagai et al.100 D) Boxplot of 𝑅2 regression values for top 100 DEGs by scVIDR for each study in chapter 3. 72 However, DEG similarities do not completely explain why within datasets, there are differences in predictions between certain cell types. For example, the endothelial cell group, while having the smallest average overlap, has the second highest prediction scores in the Nault et al dataset (Figure 3.12A, D). Likewise, in the TCDD dataset, scVIDR performed better on the cell types most sensitive to TCDD, e.g., hepatocytes and endothelial cells (Figure 3.12B). When looking at cell types less sensitive to TCDD (e.g., cholangiocytes and stellate cells), the model often underestimated the expression of differentially expressed genes (Figure 3.12E). This is likely a result of a combination of factors including the similarity of the treatment to the control data (Figure 3.12B), smaller control cell populations (Figure 3.12C), and the overall low expression of highly variable genes (Figure 3.12E). Thus, I believe the VAE has less information to predict differential gene expression for these cell types. Additional control data for more rare subpopulations could help alleviate the prediction inaccuracy. 73 Figure 3.12 Impact of differential expression overlap, latent perturbation magnitude, control population size on overall model performance. A) Average Jaccard similarity for each 74 Figure 3.12 (cont’d) cell type in the Nault et al study79 B) Sinkhorn distance between the latent distributions of the control and 30 πœ‡g/kg doses of TCDD of each cell type on the latent space. C) Bar plot of the control group cell population size for each cell type. D) Bar plot of mean gene 𝑅2 for each individual cell type when predicting only the 30 πœ‡g/kg dose of TCDD. E) scVIDR prediction versus real expression regression plot of cholangiocytes and stellate cell from mice administered with a 30 πœ‡g/kg dose of TCDD. Each point represents the mean expression of a gene. The top 10 differentially expressed genes are represented with red points. Shaded region around regression line represents the 95% confidence interval. Mapping the combinatorial space of chemical perturbation is becoming more and more crucial as industrial practices and drug development accelerate the number of chemicals toxicologists need to evaluate for safety. Doing so experimentally is prohibitively expensive due to the sheer size of the chemical x dose perturbation space. Computational algorithms that can support current endeavors to do high throughput testing of chemical perturbations are needed now more than ever. I envision a future where technologies such as scVIDR can be applied to large single-cell chemical perturbation atlases to reduce the number of experiments needed to evaluate a drug or chemical. Not only does this represent a practical endpoint to technologies like scVIDR, but also an ethical one, as reduction of animal testing has become a high priority in the field of toxicology. Thus, I feel that this work is of immense significance to the fields of pharmacology and toxicology. 3.4 Methods 3.4.1 A mathematical description of variational autoencoders Variational Autoencoders28 aim to estimate the posterior probability function, π‘πœƒ(𝑧|𝑋) of a latent process, 𝑧, given a set of observations, 𝑋. By Bayes theorem I can calculate the posterior probability: π‘πœƒ(𝑧|𝑋) = 𝑝(𝑋, 𝑧) 𝑝(𝑋) However, calculating the posterior in this way is intractable due to the difficulty in computing the marginal distribution: 𝑝(𝑋) = βˆ«π‘(𝑋|𝑧)𝑝(𝑧)𝑑𝑧 𝑧 75 Thus, I instead aim to approximate π‘πœƒ(𝑧|𝑋) by minimizing the Kullback-Leibler (KL) divergence between π‘πœƒ(𝑧|𝑋) and some gaussian distribution, π‘žπœ™(𝑧|𝑋), whose parameters, πœ™, are calculated by a neural network. This neural network is termed the encoder. I can calculate the KL divergence as: 𝐷𝐾𝐿(π‘žπœ™(𝑧|𝑋)||π‘πœƒ(𝑧|𝑋)) = 𝐸𝑧~π‘ž (π‘™π‘œπ‘” π‘žπœ™(𝑧|𝑋) π‘πœƒ(𝑧|𝑋) ) Substituting π‘πœƒ(𝑧|𝑋) with Bayes theorem and rearranging the terms: 𝐷𝐾𝐿(π‘žπœ™(𝑧|𝑋)||π‘πœƒ(𝑧|𝑋)) = log π‘πœƒ(𝑋) βˆ’ 𝐸𝑧~π‘ž (log π‘πœƒ(𝑋,𝑧) π‘žπœ™(𝑧|𝑋) ) The second term in the equation above is also known as the evidence lower bound, or ELBO. Since the KL divergence must be a positive value, I can minimize it by maximizing the ELBO. I can rewrite the ELBO as: 𝐸𝑧~π‘ž (log 𝑝(𝑋, 𝑧) π‘žπœ™(𝑧|𝑋) ) = 𝐸𝑧~π‘ž (log π‘πœƒ(𝑋|𝑧)π‘πœƒ(π‘₯) π‘žπœ™(𝑧|𝑋) ) = 𝐸𝑧~π‘ž(log π‘πœƒ(𝑋|𝑧)) βˆ’ 𝐸𝑧~π‘ž (log π‘žπœ™(𝑧|𝑋) π‘πœƒ(𝑧) ) Since the second term in the equation above is equivalent to the definition of the KL Divergence I can rewrite the ELBO as: 𝐸𝑧~π‘ž(log π‘πœƒ(𝑋|𝑧)) βˆ’ 𝐷𝐾𝐿(π‘žπœ™(𝑧|𝑋)||π‘πœƒ(𝑧)) The first term is known as the reconstruction error. This maximizes the likelihood that I will generate values from the latent space that match our observations. The second term is the KL divergence term, which is predicated by the KL divergence between the distribution estimated by the encoder and the prior distribution which is a standard normal multivariate distribution. This second term encourages structure in 𝑝(𝑧) as minimizing the difference between π‘žπœ™(𝑧|𝑋) and, as a result, maximizing disentanglement in 𝑝(𝑧). I can now construct our objective function, 𝐿(πœƒ, πœ™), as the following: 𝐿(πœƒ, πœ™) = βˆ’πΈπ‘§~π‘ž(log π‘πœƒ(𝑋|𝑧)) + 𝐷𝐾𝐿(π‘žπœ™(𝑧|𝑋)||π‘πœƒ(𝑧)) Unfortunately, naively trying to take the gradient with respect to πœ™ to minimize this function will result in highly variable gradients and thus variable training results. To fix this I instead utilize the reparameterization trick, where I sample from the latent space such that 𝑧 is deterministic with respect to some noise variable, πœ– ~ 𝑁(0, 𝐼), or: 1 2 β‹… πœ– 𝑧𝑖 = πœ‡πœ™ + Ξ£πœ™ Where, πœ‡ is the mean vector and Ξ£ is the covariance matrix of the inferred distribution. 76 3.4.2 Single cell expression datasets and preprocessing for scVIDR. Data from Nault et al79 was collected and processed from raw count expression matrices. The cell expression vectors are normalized to the median total expression counts for each cell. The cell counts are then log transformed with a pseudo-count of 1. Finally, I select the top 5000 most highly variable genes to do our analysis on. The preprocessing was carried out using the scanpy.pp package using the normalize_total, log1p, and highly_variable functions87. The Nault of et al dataset79 comprised of single nuclei RNA-seq of C57BL6 of flash frozen mouse livers. Mice in this dataset were administered, sub-chronically, a specified dose of TCDD via oral gavage every 4 days for 28 days. In our analysis, all immune cell types were left out, as immune cells are known to migrate from the lymph to the liver during TCDD administration79. Thus, there is a small size for the immune cell populations in the low-dose datasets versus the high-dose. PBMC data from Kang et al99, Study B data from Zheng et al16, and species data from Hagai et al100,was accessed as a processed dataset from Lotfollahi et al30. When training scGen and scVIDR, batch effects are accounted for with the scvi.data package using the setup_anndata function29. Differential abundances of cells in different groups are accounted for by random sampling with replacement the same number of cells for each dose and random sampling without replacement the same number of cells for each cell type. 3.4.3 Implementation and Training of Models All code in this manuscript is implemented in the Python programming language. The scVIDR model is built on the python package, scGen v. 2.0.030 which in turn is buil on the python pachakge scVI v. 0.13.029. I extend existing code bases to include linear regression on the latent space. Hyperparameters for the model and training are the default values selected by scGen v. 2.0.0. Table 3.1 outlines the model hyperparameters used in deploying scVIDR and scGen: 77 Hyperparameter Value Latent dimension Number of layers Layer width Dropout rate 100 2 800 0.2 Kullback-Leibler weight 5 βˆ— 10βˆ’5 Table 3.1 Hyperparameters for scVIDR’s and scGen’s variational autoencoder model. Table 3.2 outlines the training hyperparameters when deploying scVIDR and scGen: Hyperparameter Value Training epochs Learning rate Learning rate decay Optimizer Optimizer epsilon Early stopping 100 0.001 10βˆ’6 Adam 0.01 True Early stopping patience 25 Table 3.2 Hyperparameters for scVIDR’s and scGen’s training scheme. 3.4.4 scVIDR calculation of πœΉΜ‚ 𝒄 If I want to estimate a 𝛿𝑐 for some type of cell type 𝐡 based on 𝑧̅𝑐=𝐡,𝑝=0 and where 𝑧̅𝑐=𝐡,𝑝=1 is unknown, I can approximate a function based on 𝑧̅𝑐=𝐡,𝑝=0, or: 𝛿̂ 𝑐= 𝐡 = 𝑓(𝑧̅𝑐=𝐡,𝑝=0) 78 Where I approximate the above function using all other existing cell types in the dataset as input to ordinary least squares regression as implemented by the LinearRegression function in the sklearn.linear_model package101. 79 CHAPTER 4 PREDICTION OF MULTIPLE DOSE CHEMICAL PERTURBATIONS ACROSS CELL STATES USING VARIATIONAL AUTO-ENCODERS 4.1 Introduction In this chapter, I will examine the ability of models such as scVIDR to not only predict single dose chemical perturbations, but to interpolate response across unknown doses in the latent space. Here, as in the previous chapter, I have some unknown cell type 𝐢 whose response to a chemical is unknown. However, unlike the previous chapter, I am not only looking at a fixed dose of a particular chemical, but rather the spectrum of response that results from treatments at different levels of concentration. I refer to the cellular response to a specific concentration of a given chemical as the dose-response curve for that particular cell type. I will refer to the prediction of the response to multiple doses of a chemical as dose-response prediction. More formally, given that I know the dose-response for some cell types 𝐴 and 𝐡, can I predict the dose-response of 𝐢? I perform this task by extending the scVIDR model from chapter 3 to account for multiple doses. To do this I take the latent 𝛿𝑐 and interpolate on it log-linearly to predict doses between the lowest and highest measured in the dataset (Figure 4.1). Figure 4.1 Schematic of perturbation prediction across multiple doses in scVIDR. Outline of scVIDR for prediction of the unknown response to multiple doses for some unknown cell type. Log-linear interpolation on 𝛿 is used to predict dose dependent changes in gene expression in the latent space. The latent space representations are then projected back into gene expression space using the decoder. 80 To my knowledge there is only one other model in the literature that performs this particular task, the compositional perturbation autoencoder (CPA)41. CPA creates multiple autoencoders that the user can add together in a modular way to remove variation for multiple covariates (e.g., dose, time-point, cell type, etc.). As a result, the user can train a model on multiple covariates at the same time to create latent representations of each one. However, CPA requires a large amount of training data in order to create accurate latent spaces for each covariate. Our model, by comparison, requires much less data to make accurate predictions. In addition, CPA is based on a regular autoencoder framework, which unlike scVIDR’s VAE, encodes sparse representations of the latent space (i.e., the space between data points does not encode much information)102. This makes them less amenable to interpolation. VAEs by comparison learn using probability distributions, and better describe continuous trajectories in the data28. I evaluate the abilities of this model on two datasets. The first dataset is an extension of the TCDD mouse liver dataset described before, where instead of a single fixed dose of 30 πœ‡g/kg TCDD, there are seven additional doses ranging from 10 ng/kg to 10 πœ‡g/kg TCDD (Nault et al) 86. I show that simply by interpolating on 𝛿𝑐 I can predict the unknown dose-responses for a wide range of chemicals. The second dataset is the Sciplex dataset (Srivatsan et al)95, which is comprised of three cancer cell lines (A549, MCF7, and K562) treated with four doses of 188 drugs. I show that I can use scVIDR to predict the dose-response of thirty-seven different chemicals. In addition to the prediction of the response to multiple doses, I would also like to understand what genes are important in predicting the response for a particular drug. This would help validate the model when performing predictions. To do this I take advantage of the structure of the VAE to identify genes with the highest contribution to the dose-response according to scVIDR. Finally, when calculating the dose-response in single cell data, I encounter heterogeneity in individual cell responses to the same dose of a particular chemical, either due to the inherent stochasticity in transcription103, the chemical environment of the cell, or the internal state of the cell8,9. As a result, I would like a measure of how much a particular cell has responded transcriptionally to a chemical. I call the metric I have developed β€œpseudo-dose”. I show that pseudo-dose accurately describes the variation in the response to TCDD exhibited by hepatocytes across the liver lobule. 81 4.2 Results 4.2.1 scVIDR accurately predicts the transcriptomic response for multiple doses across cell types. In this section I predict the response of liver cells to multiple doses of TCDD. Here, 𝑑 is equal to the magnitude of the perturbation, which in my case is equivalent to the dose. Thus, 𝑑 = 0 represents expression at dose 0 and 𝑑 = 30 represents expression at dose 30, where the dose is in units of πœ‡g/kg for the Nault et al dataset86. As with the single-dose case, I train the model on the dose-response data for all cell types except one, for which only the 𝑑 = 0 condition is kept. I calculate the 𝛿̂ 𝑐 (equivalent to 𝛿𝑠𝑐𝑉𝐼𝐷𝑅 in section 3.2.2; see section 3.4.4) which is the estimated difference of means between the highest dose and the untreated groups. For scVIDR, intermediate doses are then calculated on the latent space by interpolating log-linearly on the 𝛿̂ For scGen30, I log-linearly interpolate on 𝛿𝑠𝑐𝐺𝑒𝑛. Finally, those latent space representations are decoded back into gene expression space using the decoder portion of each of the models. 𝑐. I analyzed a mouse liver snRNA-seq from the Nault et al dataset86 that included 8 doses (p = [0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10, 30]) of TCDD and a control (p = 0) in πœ‡g/kg (Figure 4.2 A). scVIDR outperforms scGen in approximating expression across the dose-response of TCDD in mouse liver. I used the mean 𝑅2 score across all evaluated genes as my performance metric (Figure 4.2B). scVIDR significantly out-performed scGen at predicting HVGs and DEGs for doses > 0.3 πœ‡g/kg (Mann-Whitney One-Sided U test p < 0.001). scVIDR predicts the important TCDD receptor repressor gene, Ahrr, at doses 1, 3, and 10 πœ‡g/kg in portal hepatocytes better than scGen (Figure 4.2C). When predicting all other cell types (cholangiocytes, endothelial cells, stellate cells, central hepatocytes, portal hepatocytes, and portal fibroblasts) scVIDR significantly outperformed scGen only at the highest doses of 10 and 30 πœ‡g/kg on prediction of all HVGs (Figure 4.2D). When predicting on just the DEGs, scVIDR significantly outperformed scGen for doses > 0.3 πœ‡g/kg (Figure 4.2E). 82 Figure 4.2 Prediction of in vivo singe cell expression of the TCDD dose-response in portal hepatocytes from mouse liver tissue. A) UMAP projection for the latent space representation of single cell expression across TCDD dose-response. Cells are colored by dose (πœ‡g/kg), cell type, and training split. Arrows on UMAP represent a 𝛿 calculated on UMAP space, with each 83 Figure 4.2 (cont’d) arrowhead representing a specific dose denoted by its color. B) Dose-response prediction for Ahrr using scVIDR, and scGen. The differences between the predicted and true distribution of Ahrr at each dose are measured via the Sinkhorn distance. C) Bar plots of the 𝑅2 of the gene expression means in portal hepatocytes for all highly variable genes and the top 100 differentially expressed genes. D) Box plot of the distribution of 𝑅2 scores across all cell types in liver tissue. I used scVIDR to predict the effects of a test set of 37 drugs out of 188 treatments in the sci-Plex dose-response data95 at 24 hours for A549 cells (Figure 4.3). scVIDR was trained on all data (all drugs and doses) in K562 and MCF7 cells. The model was also trained on the remaining 151 drugs in A549 cells not used in validation, as well as the vehicle data for the 37 drugs in the test set (Figure 4.3A). The dose-response for the 37 drugs was predicted as above by first calculating the 𝛿̂ 𝐴549 between the control and highest dose for a particular drug and log linearly interpolating along the 𝛿̂ 𝐴549 in order to predict the intermediate doses. I evaluated predictions made by scVIDR at the gene, drug, and drug pathway level. For the drug Belinostat, a histone deacetylase inhibitor, scVIDR improves on predictions of differentially expressed genes such as MALAT1 relative to scGen (Figure 4.3B). When predicting gene expression of the DEGs in Belinostat treated A549 cells, scVIDR also significantly outperformed scGen on all doses (Figure 4.3C). On predicting the DEGs of all drugs with the same mode of action as Belinostat (Epigenetics), scVIDR similarly outperformed scGen on all doses (Figure 4.3D). Finally, when looking across all 37 drugs in the test dataset, I was able to predict the expression of DEGs significantly better than scGen on average for the 3 highest doses of 100, 1,000, 10,000 nM (Figure 4.3E). 84 Figure 4.3 Prediction of in vitro dose-response of A549 cells to different drug treatments. A) UMAP of the latent space of single-cell expression colored by cell type and dose (nM) 85 Figure 4.3 (cont’d) respectively. B) Prediction of the dose-response of MALAT1 in response to Belinostat treatment of A549 cells. The differences between the predicted and true distribution and of MALAT1 at each dose are measured via the Sinkhorn distance. C) Bar plot of prediction performance of the dose-response of Belinostat administered to A549 cells on the top 100 differentially expressed genes D) Boxplot of prediction performance of the top 100 differentially expressed genes for the A549 dose-response in all test dataset epigenetic pathway drugs. E) Boxplot of prediction performance of the top 100 differentially expressed for the A549 dose-response in all 37 test dataset drugs. 4.2.2 More sophisticated models of dose-response interpolation do not improve cross cell type prediction when compared to log-linear interpolation. The dose-response is described by a sigmoid relationship between the concentration of the chemical of interest and the measured physiological response. Often when describing this relationship, pharmacologists and toxicologists use the Hill equation to model this relationship. However, as the Hill model has parameters that are usually unknown to us, I must fit the model to a dataset. As a result of the lack of information on parameters (e.g., EC50) for specific drugs in specific cell types, how well the model matches the true biology is difficult to validate. The same though process can be extended to a threshold model of response, where the sigmoid relationship is described by a log-linear response that starts at a specified dose rather than 0. More concretely, I can evaluate each of the interpolation functions based on whether they can predict intermediate doses more or less accurately across several drugs. I performed the following experiment, where I took five random drugs from Srivatsan et al dataset (TGX-221, Crizotinib (PF-02341066), Tranylcypromine (2-PCPA) HCl, XAV-939, and Decitabine) and predicted their dose-response in A549 cells using three different interpolation methods: Hill, threshold, and log-linear (explanation of models in section 4.4.3). I show in figure 4.4 that there is no statistical difference between the log-linear and either the threshold or the Hill model of interpolation. This suggests that calculation of 𝛿 has a much higher impact on the prediction accuracy than how I interpolate on said 𝛿. 86 Figure 4.4 Comparison of dose-response predictions for five drugs in A549 cells. Boxplot of the distribution of predictions made for five drugs (TGX-221, Crizotinib (PF-02341066), Tranylcypromine (2-PCPA) HCl, XAV-939, and Decitabine) in A549 cells at four different doses (nM). Log-linear model predictions of the dose-response are in blue, Hill models of the dose- response are in orange, and threshold models of the dose-response are in green. 4.2.3 Regression on the latent space infers the relationship between predicted gene expression and πœΉΜ‚ 𝒄 Insight into model decisions can provide information regarding proper model usage and pitfalls. It would be useful to identify which genes and pathways are associated with scVIDR’s prediction; however, standard VAEs do not have a linear map from the latent space to the gene expression and thus are hard to interpret. To interpret scVIDR’s predictions, I approximate the function of the decoder with linear regression (see section 4.4.4 for more extensive explanation). I take inspiration from the use of principal component analysis (PCA) in scSeq104 and the development of the linear decoded variational autoencoders (LDVAE)38. PCA is a linear transformation that projects the data onto a lower dimensional (latent) space while retaining as 87 much variance as possible. This transformation is represented by a linear weight matrix, π‘Šπ‘π‘π‘Ž, with dimensions π‘š π‘₯ 𝑔 where π‘š is the number of latent variables, and 𝑔 is the number of genes. I can understand each principal component as a linear combination of genes. This allows us to assess the relationship between genes and a direction in latent space. In a VAE, the mapping from the latent space to the gene space is done by the decoder which, unlike the inverse of PCA, is non-linear. In LDVAEs, however, the decoder portion of the VAE is a linear regression layer and thus the weight matrix of this layer, π‘Šπ‘™π‘‘π‘£π‘Žπ‘’, describes a linear relationship between direction in the latent space and gene prediction38. However, interpretability comes at the expense of model accuracy. LDVAEs have higher reconstruction error than standard VAEs on single cell data38. Similarly, using PCA and vector arithmetic to predict scSeq perturbations performed poorly compared to scGen30. As a result, one would like to try and interpret the latent space of a standard VAE. I present an approach to interpret the VAE’s latent space using sparse regression. I take an alternative approach to LDVAEs in which I instead approximate the non-linear function of the decoder in a standard VAE using sparse linear regression (Figure 4.5A). Sparse regression methods like LIME have been used to interpret complex models105. I specifically use sparse linear ridge regression, given that each gene has a non-zero contribution to each latent variable and gene weights are distributed parsimoniously. This gives us a linear transformation matrix, π‘ŠΜ‚π‘£π‘Žπ‘’, that approximates the function of the decoder. 88 Figure 4.5 Interrogation of VAE using ridge regression in portal hepatocyte prediction. A) Schematic of calculating latent dimension weights using ridge regression. B) Bar plot of top 20 genes with the highest scVIDR genes scores. C) Enrichr analysis of the top 100 genes with respect to the scVIDR gene scores. Bar plot of adjusted p-values from statistically significant (adjusted p-value < 0.05) enriched pathways from the WikiPathways 2019 Mouse Database. D) PCA projection of single cell expression data colored by log dose and fatty acid oxidation 89 Figure 4.5 (cont’d) pathway score. E) Logistic fit of median pathway score for each dose value. MAE - mean absolute error. The span of 𝛿̂ I use this weight matrix to interrogate the relationship between predicted gene expression and 𝛿̂ 𝑐 is simply a direction in scVIDR’s latent space. The importance of 𝛿̂ gene’s predicted expression is the sum of the latent dimensional components of 𝛿̂ the gene’s corresponding latent dimensional weight from π‘ŠΜ‚π‘£π‘Žπ‘’. In matrix form: 𝑐 to each 𝑐 multiplied by 𝑐. 𝐺𝑒𝑛𝑒 π‘†π‘π‘œπ‘Ÿπ‘’π‘  = 𝛿̂ π‘‡π‘ŠΜ‚π‘£π‘Žπ‘’ 𝑐 In practice, I found that normalizing the weight matrix by its L2 norm gives better insights when interpreting the model (see section 4.4.4). Gene scores represent how significant changes in latent space dimensions will impact the decoded transcriptomic response when I interpolate on the span of 𝛿̂ 𝑐 on the latent space. Thus, genes with higher scores will be predicted to have bigger changes when I increase the dose of my prediction by scVIDR. I utilize a trained scVIDR model where portal hepatocytes were left out of training and the 𝑐=π‘ƒπ‘œπ‘Ÿπ‘‘π‘Žπ‘™ π»π‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘  was approximated (Figure 4.5 B-D). Gene scores for 𝛿̂ 𝛿̂ 𝑐=π‘ƒπ‘œπ‘Ÿπ‘‘π‘Žπ‘™ π»π‘’π‘π‘Žπ‘‘π‘œπ‘π‘¦π‘‘π‘’π‘  were calculated as described above. The genes with the top 20 highest magnitude genes scores included well established markers of TCDD-induced hepatotoxicity such as genes from the cytochrome P450 family (Figure 4.5B)106. To see whether this relationship extended to pathways involved in TCDD-induced genetic response, I performed Enrichr analysis83,92 using the 2019 WikiPathways database107 on genes with the top 100 gene scores (Figure 4.5C). Among the top enriched terms, I found the hallmarks of hepatic response to TCDD in mice, such as oxidation by cytochrome P450108, fatty acid omega oxidation109, and tryptophan metabolism110. To derive the relationship between the actual doses and the gene pathways, the genes with the top 100 gene scores that were in β€œFatty Acid Oxidation” from WikiPathways was used in calculating enrichment scores for each cell using Scanpy87. A sigmoid function was fit to the median enrichment score in each dose (section 4.4.6). I observed a small mean absolute error in my model and thus concluded that there was a sigmoidal dose-response relationship for the gene set generated by Enrichr (Figure 4.5 D, E). 90 4.2.4 Pseudo-dose captures zonation in TCDD hepatocyte response. In single cell analysis of developmental trajectories, it is useful to order cells with respect to a latent time course, termed β€œpseudo-time”. This is because cells develop at different rates due to natural variations among themselves and their environment. This ordering is usually done using algorithms such as Slingshot111 and Monocle112. In pharmacology and toxicology, I experience a similar problem as cells of the same type have variable sensitivities to the same toxicant. Hence, I propose to order cells in terms of a latent dose. I call this ordering of cells a β€œpseudo-dose”. 91 Figure 4.6 Interrogation of VAE using ridge regression in portal hepatocyte prediction. A) Schematic of calculating latent dimension weights using ridge regression. B) Bar plot of top 20 genes with the highest scVIDR genes scores. C) Enrichr analysis of the top 100 genes with 92 Figure 4.6 (cont’d) respect to the scVIDR gene scores. Bar plot of adjusted p-values from statistically significant (adjusted p-value < 0.05) enriched pathways from the WikiPathways 2019 Mouse Database. D) PCA projection of single cell expression data colored by log dose and fatty acid oxidation pathway score. E) Logistic fit of median pathway score for each dose value. MAE - mean absolute error. Working off the assumption that 𝛿𝑐 is the axis of perturbation in latent space, I orthogonally project the latent representation of each cell to the π‘ π‘π‘Žπ‘›(𝛿𝑐) to obtain a scalar coefficient for each cell along 𝛿𝑐 (Figure 4.6 A). I use this scalar coefficient as the pseudo-dose value for each cell. To test whether these pseudo-dose values capture the latent response across cell types, I distinguished between the portal and central regions of the liver lobule. Zonation of the lobule not only defines differences in hepatocyte gene expression along the portal to central axis, but also their metabolic characteristics53. Thus, I expect that the two zones will exhibit different sensitivities to TCDD. The pseudo-dose correlated well with the actual dose administered to the hepatocytes with an 𝑅2 = 0.76 (Figure 4.6C). I also found that pseudo-dose displayed a sigmoidal relationship (Section 4.4.6) between the expression of differentially expressed genes such as Fmo3 (Figure 4.6D). Finally, I found the pseudo-dose to be statistically higher on average in the central hepatocytes versus portal hepatocytes (Figure 4.6E). This is consistent with liver biology as central hepatocytes respond more strongly to treatment due to TCDD sequestration59, and higher AhR expression levels in the centrilobular zone113. 4.3 Discussion When profiling any chemical, the dose often makes the poison. Development of dose-response prediction models is non-trivial and significant as low-dose changes to expression do not necessarily predict high dose changes. This is important as the effects seen at therapeutic doses for drugs do not necessarily extrapolate to overdose situations. To illustrate this, in figure 4.7 I show that changes in the first half of the TCDD dose-response of mouse liver portal-hepatocytes (between 0 and 0.3 πœ‡g/kg) doesn’t correlate to changes in the second half of the dose-response (between 0.3 and 30 πœ‡g/kg). This is reflective of the non-linear nature of the chemical responses and belies the needs for methods that extrapolate dose-response utilizing other sources of data. I 93 show that scVIDR can handle predicting multiple doses from the chemical dose-responses of other cell types. I first modeled the dose response of thirty-eight chemicals across two datasets. I then show how I can interpret the dose-response predictions in scVIDR. Finally, I utilize the dimensionality reduction capabilities of scVIDR to order cells based on their individual cellular response using scVIDR. Figure 4.7 Regression plot of the magnitude of TCDD induced log fold-changes (LFC) of genes at low and high doses in mouse portal hepatocytes. All LFCs represent absolute values. Each point represents a gene. Low Mean LFC is the difference between the log(x+1) expression at dose 0 and 0.3 πœ‡g/kg. High Mean LFC is the difference between the log(x+1) expression at dose 0.3 and 30 πœ‡g/kg. The shaded region around the regression line represents the 95% confidence interval. 94 Examining the limitations of the model with respect to the Nault et al dataset86, I can see that many of the same principles discussed in section 3.3 are relevant. Datasets with fewer control cells, and the less sensitive the cells, the worse the overall performance at higher doses (Figure 4.8). The notable exception is with stellate cells, where the highest dose is predicted worse in scVIDR than in scGen. The reason why is difficult to pin down completely. I hypothesize that this may be a result of the stellate cells starting to undergo an epithelial to mesenchymal transition to myofibroblasts85. This may be a result of the current formulation of the model not accounting for discrete transitions. In the future, it may be useful to use more sophisticated models to predict physiological transitions in cellular state like the epithelial to mesenchymal transition. 95 Figure 4.8 Impact of latent perturbation magnitude, and control population size on dose- response prediction performance. A) Sinkhorn distance between the latent distributions of the 96 Figure 4.8 (cont’d) control and 30 πœ‡g/kg doses of TCDD of each cell type on the latent space in the Nault et al study86. C) Bar plot of the control group cell population size for each cell type. D) Bar plot of mean gene 𝑅2 for each individual cell type when predicting all doses of TCDD in πœ‡g/kg. In the sci-Plex dataset, prediction of certain drugs with epigenetic mode of actions produced the poorest prediction scores (Figure 4.9). This is because scSeq data provides no information regarding epigenetic modifications (e.g., chromatin accessibility, histone marks, and DNA binding proteins). Integration with epigenetic scSeq data such as single cell ATAC-seq could help to predict such responses with higher accuracy. 97 Figure 4.9 Overall drug pathway performances at the highest administered dose in Srivatsan et al. A boxplot of the mean gene 𝑅2 Performance by scVIDR across all drug pathways in the test dataset at a dose of 10 πœ‡M. While scVIDR and its pseudo-dose metric work on standard dose-response scenarios, it is likely inappropriate for use with more complex trajectories such as those found in cellular development and circadian rhythms114. Such trajectories include branching and cycling which involve 98 complex non-linear dynamics and require more sophisticated models to properly capture their topology. One could combine using scVIDR for dimensionality reduction and performing trajectory analysis using popular algorithms such as Monocle112, slingshot111, and diffusion pseudotime56 to account for such trajectories. Taken together, my tool facilitates dose-response predictions for a particular drug in a specific cell type using the response of other cell types. Dose-response modeling is important in the realm of drug development and toxicity testing as the physiological response of chemical perturbation is dose-dependent. I envision the use of scVIDR in optimizing dose-response studies during drug discovery and development. scVIDR enables prediction of chemical response in a wide array of cell types and doses using only the control and highest doses of previous experiments. As more data becomes available on single cell chemical perturbations, generative modeling can yield insights into the underlying manifold of gene expression and how different classes of chemicals act on that manifold. Discovery of the properties of the manifold will allow for generalizations to be made about the physiology of tissues and understudied chemical perturbations. 4.4 Methods 4.4.1 Implementation of scVIDR Implementation of scVIDR is identical to the implementation describe in sections 3.4.3 and 3.4.4. 4.4.2 Single cell expression datasets and preprocessing The sci-Plex dataset95 and TCDD dose-response dataset86 were collected and processed uniformly from raw count expression matrices. The cell expression vectors are normalized to the median total expression counts for each cell. The cell counts are then log transformed with a pseudo-count of 1. Finally, I select the top 5000 most highly variable genes to do my analysis on. The preprocessing was carried out using the scanpy. pp package using the normalize_total, log1p, and highly_variable functions87. The TCDD dose-response dataset comprised of single nuclei RNA-seq of C57BL6 of flash frozen mouse livers. Mice in this dataset were administered, sub-chronically, at a specified dose of TCDD via oral gavage every 4 days for 28 days. In my analysis, all immune cell types were left out, as immune cells are known to migrate from the lymph to the liver during TCDD administration79. Thus, there is a small size for the immune cell populations in the low-dose datasets versus the higher doses. The sci-Plex dataset is obtained from Srivatsan et al95. 99 4.4.3 Interpolating on 𝜹 for dose-response prediction. To predict the latent representation for a response at some dose, 𝑑, I interpolate log-linearly on 𝛿̂ 𝑐= 𝐡 such that for each latent cell in my prediction, 𝑧𝑖,𝑐,𝑝=𝑑: 𝑧̂𝑖,𝑐,𝑝=𝑑 = 𝑧𝑖,𝑐,𝑝=0 + 𝑝𝛿 Where 𝑝 is the coefficient, I want to calculate in order to calculate 𝑧̂𝑖,𝑐,𝑝=𝑑. In log linear interpolation: 𝑝 = π‘™π‘œπ‘”(𝑑 + 1) log (mπ‘Žπ‘₯(𝑑) + 1)) Where max(d) is the highest measured dose in the dataset. In the threshold model: And in the Hill Model: 𝑝 = { log(𝑑) log(π‘šπ‘Žπ‘₯(𝑑)) 0 𝑑 > π‘Ž 𝑑 ≀ π‘Ž 𝑝 = 𝐿 1 + π‘’βˆ’π‘˜(π‘‘βˆ’π‘‘0) where π‘Ž, 𝐿, π‘˜, 𝑑0 are free parameters that I fit to the data. These free parameters were fit using Levenberg-Marquardt algorithm implementation in the curve_fit function in the scipy.optimize package. 4.4.4 Inferring feature level contributions to perturbation prediction In PCA, I perform an orthogonal linear transformation on the data, such that my projected data preserves as much variance as possible. It is known that the solution to this maximization problem is to project the data onto the eigenvectors of the covariance matrix or: π‘π‘š = π‘‹π‘Šπ‘š Where 𝑋 is the mean-centered scRNA-seq expression matrix, π‘Šπ‘š is the eigenvectors corresponding to the π‘š highest eigenvalues of the covariance matrix of 𝑋, and π‘π‘š represents the π‘š-dimensional projection of the data onto its principal components. I can see from this formula that π‘π‘š is calculated as a linear combination of weights and gene expression, and thus there is a linear relationship between genes and the principal components. I can exploit this fact and calculate a loading for each gene with each corresponding eigenvector by taking the product of the eigenvector and the square root of the corresponding eigenvalue or: π‘™π‘œπ‘Žπ‘‘π‘–π‘›π‘”π‘–π‘— = π‘Šπ‘–π‘— βˆ— βˆšπœ†π‘– 100 Where π‘Šπ‘–π‘— is the π‘—π‘‘β„Ž value (corresponding to gene j) of the π‘–π‘‘β„Ž eigenvector and πœ†π‘– is the eigenvalue for the π‘–π‘‘β„Ž eigenvector. These loadings represent a normalized score of the relationship between a gene’s expression and a particular principal component. These loadings are also directly proportional to the actual correlation between the gene’s expression and the principal component of interest. It can be shown that PCA and autoencoders with a single hidden layer (with size less than the observations) and a strictly linear map are nearly equivalent115. I can project principal components back into expression space using the following function: Additionally, I note that PCA is a solution to the minimization of the reconstruction error: 𝑋̂ = π‘π‘šπ‘Šπ‘š 𝑇 𝑇 = π‘‹π‘Šπ‘šπ‘Šπ‘š ‖𝑋 βˆ’ 𝑋̂‖ 2 2 I find similarly that the loss function that I try to optimize in the autoencoder I described above is: ‖𝑋 βˆ’ π‘‹π‘Š1π‘Š2 2 𝑇‖2 Where π‘Š1 is the weights of the hidden layer, and π‘Š2 is the weights of the final layer of the autoencoder. In effect, I can see that the autoencoder described above can approximate the loadings of a principal component analysis using π‘Š2. The reconstruction error for a standard VAE with the assumption that the observations are a multivariate Gaussian is: 1 𝑁 ‖𝑋 βˆ’ 𝐷𝑒𝑐(𝑍)β€–2 Where 𝑁 is the number of samples, 𝐷𝑒𝑐(𝑍) is the function of the decoder neural network, and 𝑍 is the transformation by the encoder of the observations onto the latent space. In an LDVAE, the 𝐷𝑒𝑐(𝑍) is replaced with single layer with linear transfer operators such that the reconstruction error is the following: 1 𝑁 ‖𝑋 βˆ’ π‘π‘Šπ·π‘’π‘ 𝑇 β€–2 In which π‘Šπ·π‘’π‘ is the linear weights of the decoder. These weights give us an approximation of the contributions of individual genes to the dimensions of the latent space. I can interpret π‘Šπ·π‘’π‘ as a loadings matrix by which I can interpret the latent dimensions of the LDVAE. 101 To approximate feature contributions to predicting the perturbation in scVIDR, I train a ridge regression model. I then take the decoder portion of my model and sample 100,000 points from the latent space and generate their corresponding expression vectors. This will be my training dataset for a ridge regression. I then train the ridge regression using the Ridge class from the sklearn.linear_model package. I can describe the loss of my ridge regression as: ‖𝐷𝑒𝑐(𝑍) βˆ’ π‘π‘Šπ‘‡β€–2 + πœ†β€–π‘Šβ€–2 Where 𝑍 are the sampled points from the latent space, π‘π‘Šπ‘‡ are the approximation of the predicted gene expression vectors, and π‘Š, is a m x n matrix where m Is the number of genes and n is the number of latent dimensions. I divide π‘Š using the ||π‘Š|| to normalize for the effect of 2 over expressed genes. I then calculate the gene scores by taking the dot product of normalized π‘Š and 𝛿𝑐, or: 𝑔𝑒𝑛𝑒 π‘ π‘π‘œπ‘Ÿπ‘’π‘  = π‘Š ||π‘Š|| 2 β‹… 𝛿𝑐 I use these gene scores to order genes for Enrichr83 pathway analysis with the gseapy package92. Scores for each pathway were calculated using score_genes function from scanpy.tl package with the genes sets derived from the Enrichr results. 4.4.5 Calculating the pseudo-dose values. I can order each cell, π‘₯𝑖, with respect to the variable response of π‘₯𝑖 to the chemical by taking the latent representation, 𝑧𝑖, and orthogonally projecting it onto 𝐿 = π‘ π‘π‘Žπ‘›(𝛿𝑐): π‘π‘Ÿπ‘œπ‘—πΏ = 𝛿𝑐 β‹… 𝑧𝑖 𝛿𝑐 β‹… 𝛿𝑐 𝛿𝑐 = 𝑝𝛿𝑐 The scalar multiple of 𝛿, 𝑝, is the pseudo-dose value for π‘₯𝑖. 4.4.6 Regression of sigmoid function for evaluating dose-response relationships To establish whether a standard dose-response relationship existed between the top pathways inferred by Enrichr and between the pseudo-dose and gene expression, a logistic function of the form: 𝑓(𝑑) = 𝐿 1 + π‘’βˆ’π‘˜(π‘‘βˆ’π‘‘0) + 𝑏 Where d is the dose or pseudo-dose. The parameters of the function above were fit to the output variables (median enrichment score and Fmo3 normalized expression) using Levenberg- Marquardt algorithm implementation in the curve_fit function in the scipy.optimize package. The 102 regression was evaluated using the mean absolute error metric implementation in the mean_absolute_error function in the sklearn.metrics package. 103 CHAPTER 5 CONCLUSIONS AND FUTURE DIRECTIONS 5.1 Generative modeling for toxicological single cell RNA-seq scRNA-seq has proliferated across many different disciplines of biology from development116–118 to pharmacology93,95. In toxicology, single cell technologies have the potential to reveal the heterogeneous response previously masked by bulk tissue techniques like RNA-seq17. Even within cells of the same type there are exhibited differences in response depending on the transcriptional state of the cell8,9. While technologies like scRNA-seq have the potential to describe the entire spectrum of chemical response, due to the sheer volume of information and high-dimensionality, computational approaches must be deployed to uncover the biological complexity of tissue response. Broadly, to deal with the high dimensionality of single cell data, bioinformaticists will utilize dimensionality reduction algorithms. Dimensionality reduction algorithms like PCA are applied routinely to do visualization and preprocessing for downstream analysis36,37,42. Over the last five years, VAEs (a deep generative model) have gained popularity as a dimensionality reduction algorithm in scRNA-seq analysis. This is due to the model’s remarkable ability to perform data integration29,34,48, and its ability to generate unseen scRNA-seq from its latent space30,33. In this dissertation I utilize variational autoencoders to perform batch correction for administration of TCDD to mice. First, I inferred the zonation of individual hepatocytes based on their gene expression. Then classified genes based on whether they are responsive to TCDD, exhibit a rhythmic expression profile, and/or exhibit a zonated expression profile. Due to the highly ordered nature of gene expression in the liver lobule, in previous studies6,52,76 hepatocyte zonation is inferred from gene expression in control conditions. Unfortunately, TCDD significantly changes the expression of key zonation genes62. To accurately infer zonation from the liver, I performed batch correction in order to remove variance in expression caused by TCDD treatment. Using this new batch corrected dataset, I inferred zonation reducing the overall impact TCDD has on the inference. With the zonation properly inferred, I categorized hepatocyte genes based on their inferred zonation, time-series expression over twenty-four hours, and their response to TCDD treatment. With this gene classification, I revealed that most of the canonical TCDD response pathways exhibit both rhythmic and zonal properties and that these properties are independent of one another. This includes established spatial-temporal pathways in the liver 104 lobule including lipid and bile acid metabolism, and CYP450 metabolism pathways52. Additionally, I find that TCDD induced rhythmic changes are more associated with changes to the amplitude of the oscillations rather than changes in the phase. This follows with more than a third of TCDD affected genes completely losing rhythmicity. Mechanistically, I found that TCDD impacted the of the core circadian clock in the alternative Arntl binding partner to Clock, Npas2, and genes downstream of CLOCK-ARNTL binding such as Per1/2 and Nr1d2. Finally, I found that genes that had lost zonation were mostly associated with the central region of the hepatic lobule. The generative properties of the VAE were utilized to predict chemical responses on unseen cell types in a model I developed called scVIDR. To do this, I built on the existing scGen framework and expanded it to predict across several doses. The main advantage of scVIDR over scGen is post-training regression on the latent space, which weights predictions of the response based on the cell’s position on the latent space. This regression-based correction significantly improves prediction of chemical perturbations across several data sets such TCDD liver response, in-vitro cancer-responses to many different drug candidates, and PBMC response to IFN-𝛽. This not only worked on unseen cell types, but I also utilized scVIDR to predict phagocyte response to LPS6 on unseen species. To predict multiple doses, I log-linearly interpolate on the latent space and predict the dose-response of TCDD in the liver and thirty-six cancer therapy drugs in A549 cells. The main advantage of such modeling is its ability to take existing single-cell chemical response datasets and utilize them to predict unseen chemical responses in novel cell types of interest. Interpretation of the VAE latent space is difficult. In PCA, the lower dimensional principal component space can be interpreted using the weights of the linear functions compressing the data. These linear weights are like correlations between gene expression and the principal components. VAE’s use non-linear neural networks to do their compression and thus the latent space of VAE’s can’t be easily interpreted. To interpret the latent space of the VAE, scVIDR approximates the decoder function using linear regression. The weights of the linear regression are used like the weights in PCA, which acts as an approximation of the relationship between the latent space and the dimensions of the of scRNA-seq measurement space. This allows the users to infer how changes in the position of latent space β€œcorrelate” to changes in gene expression measurements in scRNA-seq. Using this method, scVIDR predicts transcriptomic response of mouse hepatocytes to TCDD is most pronounced in canonical response genes like Cyp1a1. 105 Additionally, when performing gene pathway analysis on the most β€œcorrelated” genes, I identify canonical TCDD response pathways including CYP450 metabolism and fatty acid oxidation. Finally, responses to TCDD varies across individual cells, even those within the same cell type. I utilize scVIDR to measure such variation in the magnitude of response. To do this, I orthogonally project cells onto the 𝛿𝑐, which is the mean difference between the control and treated cells on the latent space. The coefficient of the projection is the measure of the magnitude of response or the β€œpseudo-dose”. With this pseudo-dose metric, I am able to demonstrate the difference in sensitivity between portal and central hepatocytes to TCDD. 5.2 Future directions in research the spatial-temporal organization of the liver lobule and associated pathologies The liver is responsible for metabolism of nutrients and toxicants, production and recycling of proteins, and glucose and lipid homeostasis. Many of these tasks are zonated and rhythmic in the liver. Changes in the rhythmicity and zonation have been associated with NAFLD and drug induced liver injuries (DILI) such as those caused by acetaminophen. In this dissertation, I show that scRNA-seq can be used to comprehensively analyze responses of zonal and rhythmic processes to chemical perturbation of TCDD. I will first review future directions for studying TCDD’s impact on zonation and rhythmicity. Then I will discuss alternative liver pathologies for which the methods of chapter 2 can be applied. 5.2.1 Future directions in studying TCDD’s impact on rhythmicity and zonation and applications in other pathologies. As described before, sub-chronic administration of TCDD leads to the ablation of rhythmicity in the mouse liver66. In acute administration, I see less severe effects on rhythmicity. Most effects on rhythmicity are seen in standard TCDD response pathways such as CYP450 metabolism. I see changes in the normal expression of the core circadian clock genes, agreeing with previous studies66. The most pronounced effect was increased expression of Per2 across most of the acute response. Fader et al66 has shown that CLOCK, ARNTL, and AhR colocalize on the Per2 promoter, suggesting that TCDD induced AhR disruption of normal CLOCK- ARNTL binding may be the cause of this change. I also see higher expression Per1 two hours post treatment followed by a decrease in expression for most time points after (with the exception of time point 18). Previous studies into the regulation of Per1 have implicated 𝛽- naphthoflavone activated AhR in interrupting the normal binding of CLOCK-ARNTL in Hepa1c1c7 cells84. However, parsing how these acute changes in circadian rhythm leads to the 106 sub-chronic abolishment of rhythmicity in most core members of the circadian rhythm will require further studies on how different numbers of TCDD exposures change overall circadian clock expression and regulation. TCDD exposure has also been shown to greatly alter zonation in sub-chronic exposures62. While impacts on zonation are predictably subtler within acute exposures, I confirm a bias towards pericentral expression seen in those same sub-chronic exposure studies62. Additionally, core regulatory pathways of zonation such as Wnt/𝛽-catenin signaling were enriched in gene classes impacted by TCDD. How TCDD interrupts expression of these pathways is still unclear. TCDD induced changes to hepatocyte expression of Rspo3, which is a rheostat of zonation, may be the primary cause62. Additionally, I show that most changes to zonation are localized to the pericentral zone corroborating changes seen in pericentral Apc62. This pericentral localization of TCDD’s effect on zonation is likely due to the high expression of AhR in the pericentral region52 as well as activated Cyp1a2 sequestration of TCDD in the pericentral zone of the liver119. What is still unknown is how non-parenchymal response to TCDD is implicated in changes in zonation. Previous studies suggest that the rhythmic Wnt ligand expression of pericentral non- parenchymal cells is significant in zonation homeostasis52. Analysis on the potential impact TCDD has on this non-parenchymal signaling would potentially elucidate TCDD’s impact on pericentral expression of zonation regulators. One of the main limitations of the study regards inferring zonation from gene expression. Due to the nature of CYP450 expression being centrally enriched in liver lobules, TCDD biases gene expression towards more pericentral expression overall62. To get absolute localization of expression, spatial transcriptomic methodologies such as Xenium120 or MERFISH121 will be needed. Additionally, studies like those in chapter 2 are limited to cell types in the liver for which the zonation of expression is well profiled (i.e., hepatocytes6,52 and endothelial cells76,122). Spatial experiments could better pinpoint the zonation patterns of genes across many cells in the non- parenchymal compartment of the liver such as cholangiocytes and immune cells. For example, resident macrophages, Kupffer cells, have been found to have an asymmetric distribution across the liver lobule, with more localizing in the periportal region123. However, their zonal expression profile has not been fully studied. Recent research in Kupffer cells have shown their transcription factor activity significantly altered with respect to TCDD exposure124. Additionally, in partial hepatectomy rodent models, Kupffer cells were important secretors of Wnt ligands in the 107 midlobular and periportal zones post loss of tissue79,125. This Wnt signaling goes on to propagate a loss of zonation very similar to the pericentral bias found in Nault et al62 and in chapter 2. TCDD potentially could alter the expression of key Wnt ligand gene in Kupffer cells, much in the same way they do in partial hepatectomies125. The role of these and other non-parenchymal cells in TCDD liver zonation response, what affected hepatocytes they localize around, and what genes they differentially express proximal to liver injury can be better elucidated by spatial transcriptomic technologies. Additionally, insight generated by spatial experiments would have broader impacts in other zonated liver pathologies such as NAFLD and DILI. 5.2.2 Future Analysis of Rhythmicity and Zonation in Alternative Pathologies Like TCDD administration in mice, many liver pathologies are characterized by perturbations to the spatial-temporal organization of the liver. More specifically, NAFLD has been connected to both zonal and temporal perturbations in liver metabolism68. Zonally, NAFLD is characterized by the accumulation of triglycerides due to increased influx of fatty acids in periportal hepatocytes and decreased efflux fatty acids in pericentral hepatocytes68. This results in steatosis localizing in pericentral hepatocytes126. Temporally, NAFLD has been observed in jet-lagged models of mice127. Additionally, molecular knock-out models of core-circadian genes show upticks in NAFLD and mortality127. Dissecting the entire etiology of NAFLD, will need more comprehensive descriptions of the perturbations to both zonation and temporal genes. Studies similar to those described in chapter 2 of this thesis could be utilized on NAFLD mouse models to identify these perturbations and better describe the progression of NAFLD pathologies. Drug induced liver injuries (DILI) exhibit distinct zonal properties. In carbon tetrachloride (CCL4) models of DILI (which cause peri-central injury), injury leads to regenerative responses from adjacent layers of hepatocytes such as those found in the periportal region128. These regenerative responses were accompanied by a temporary loss of zonation which was recovered six days post DILI129. It is likely this loss of zonation is connected to the activation of proliferative phenotypes post DILI128. Identifying which genes lose their zonation in response to DILI, and where proliferative genes are being turned on along the zonation axis of the liver lobule will lead to stronger understandings of liver regeneration and DILI prevention. DILI is also dependent on circadian rhythm. For example, alanine transferase (ALT) levels were higher in mice administered with acetaminophen at ZT14 vs. ZT2130. This is correlated with peaks in acetaminophen enzyme expression early in the night. However, these effects are 108 dependent on feeding patterns as fasting induced higher ALT levels when compared to control. Despite this, CLOCK-deficient mice had resilient ALT levels regardless of acetaminophen administration time. Thus, DILI can be dependent on the circadian clock of the animal130. Identifying genes in a drug’s mode of action that are rhythmic would be helpful in determining more optimal times in which to administer said drug as to prevent DILIs. 5.3 Future Directions in Variational Autoencoders and the Prediction of Chemical Perturbations Illustrated already in this dissertation is scVIDR’s ability to improve on perturbation prediction and predict additional doses along the dose-response. Next steps include using these models to infer important dose-response metrics such as potency and maximal efficacy. Additionally, relating the trajectory of the pseudo-dose inferred by scVIDR to potential adverse outcome pathways or using it to infer new ones would be incredibly useful for toxicologists. Finally, I will discuss potential improvements to scVIDR with these end goals in mind. 5.3.1 Broader Toxicological Impacts of scVIDR One of the most prominent features a toxicologist wants to extract is the drug’s potency. Several dose-response metrics measure potency. For example, EC50 is the concentration at which the drug elicits fifty percent of its maximal response (assuming a standard dose-response)131. In section 4.2.2 a hill-model of interpolation is discussed for prediction but is ultimately discarded due to a lack of significant improvement over the more parsimonious log-linear model. However, models like this would, in theory, predict the EC50 (𝑑0 parameter from section 4.4.3). Validation of these predictions would require an external dataset of EC50s specific to cell types and thus was not included in the main text. Ensuring accurate predictions for chemical potency with models like scVIDR will rely on improving datasets as well as improving on the current model formulation. Current proposals for single cell perturbational atlases132 as well as the continuation of initiatives like TOXCAST21133 stand to improve on this issue. Ultimately, in addition to parameter discovery, toxicologists would like to use models like scVIDR to predict adverse outcome pathways. Adverse outcome pathways (AOPs) are the sequence of cellular and molecular events that lead a chemical to elicit a toxic effect134. I can relate 𝛿 or the β€œpseudo-dose” to gene pathways using existing methods such as gene set enrichment analysis (GSEA) as described in section 4.2.4. Expanding on the pseudo-dose algorithm to account for branching trajectories will better account for distinguishing between different sources of toxicity. One major limitation of scVIDR currently is that it can only predict 109 transcriptomic changes associated with a chemical. How well transcriptional changes reflect changes in protein levels and protein activity is a matter of intense debate135,136. While within sample correlations are rather robust, across samples (i.e., across tissues or conditions) correlations are much more modest due to variances in post-transcriptional and translational kinetics136. So, while increases or decrease in RNA expression likely imply increases or decreases in protein expression, the magnitude of these changes is much less correlated due to post-transcriptional, translational, and post-translational mechanisms of protein regulation. This protein regulation is also a frequent feature of AOPs137. Thus, understanding the toxic response from only the transcriptomic perspective would immediately limit the scope of discovery. Thus, integration of other data modalities such as CITE-seq and scATAC-seq is required address this limitation. 5.3.2 Improving scVIDR. The main determinant of scVIDR’s prediction accuracy is in calculation of the 𝛿. This is especially important when wanting to improve predictions of cell type specific EC50s and AOPs. While I show marked improvements in this prediction, potentially the model can be further improved with a change to the resolution of the prediction. Here, instead of cell-type specific 𝛿𝑐, I calculate a 𝛿𝑖, where 𝑖 is the index of a specific cell. This would allow for a more fine-grained prediction of cellular response. The main limitation in this approach would be how to calculate individual 𝛿𝑖 as individual cells cannot be followed across the dose-response using scRNA-seq. This is because scRNA-seq destroys the cells during sequencing. Alternative technologies such as Live-seq138 would be a way to overcome this limitation. However, no commercial platform yet exists with this type of technology for wide-spread adoption. Alternatively, a computational solution would be to map the control cells to the treatment cells on the latent space. The mapping would be used to infer where on the latent treatment distribution a control cell would lie after perturbation, thus giving us point pairs to calculate 𝛿𝑖. Having an individualized 𝛿𝑖 would allow for more complicated models, such as neural networks, to predict 𝛿𝑖 for unseen cells. For example, scANVI46 utilizes a neural network, π‘ž(𝑐𝑖|𝑧𝑖), on the latent space to classify cells into their respective cell types. This framework could be adapted to predict 𝛿𝑖 for unseen cells, by taking the existing 𝛿𝑖 on the latent space and using them as training for the neural network, π‘ž(𝛿𝑖|𝑧𝑖). However, this assumes that the shortest path on the latent space is the best representation of the underlying biological manifold. Optimal transport-based methods such as 110 TrajectoryNet139 utilize continuous normalizing flows140 to infer the underlying trajectory of cells along some time-series experiment. In theory, one could adapt this to the trajectory of the dose- response to get more accurate uncertainties in predictions. Future work will be needed to study how to improve prediction accuracy for concrete toxicological endpoints and drug safety studies. The pseudo-dose metric works for standard dose-response scenarios described in this dissertation. However, it struggles when faced with more complex trajectories. These more complex trajectories could be observed in relation to chrono-pharmacology66 or chemical disruptions to cellular development141. Additionally, toxicological responses can lead to more than one type of cell injury. To account for this, one could extend pseudo-dose using methods such as TSCAN142 or diffusion pseudo-time56. The only change the user would be to preprocess using scVIDR instead of PCA or UMAP. Additionally, with more complex trajectories, toxicologists would like to be able to distinguish between different sources of response (i.e., cell death vs. cellular necrosis). Differential trajectory analysis methods like condiments143 or Trade- seq21 could be used to identify different genes or different sets of genes associated with different branches or points in the dose-response. These genes could be used to identify potential mechanisms by which the drug will cause response. Currently, scVIDR doesn’t integrate other data modalities important to toxic response such as chromatin accessibility144 and protein expression145. Already, scVIDR has shown that it sometimes performs poorly on drugs with epigenetic modes of action (section 4.3; Figure 4.8). Incorporation of other data modalities will be critical not only for future models of chemical response prediction, but also in describing AOPs associated with different chemicals. VAE models that aim to integrate different types of single cell technologies such as MultiVI present a potential path forward. For example, let us assume that I have scRNA-seq, scATAC-seq, and CITE-seq datasets with some overlapping cell-types. Using an integrative version of scVIDR, multiple single cell technologies would be integrated to a single latent space. Thus, similar predictions could be extended to changes in chromatin accessibility with the scATAC-seq, and changes in expression of cell surface proteins with CITE-seq. With this integration, scVIDR will not only be able to predict transcriptomic events in an adverse outcome pathway, but also epigenetic and cell surface protein events. 111 5.4 Closing remarks In this dissertation I utilize current and develop new frameworks in generative modeling for toxicology. I show VAEs utility to describe responses to chemicals in different scRNA-seq datasets. My discussions in rhythmicity and zonation in TCDD response have implications in other liver pathologies such as NAFLD and DILI. Time-series modeling with VAEs could be incorporated similarly to help with these types of experiments. While VAEs are incredibly useful for scRNA-seq, high data requirements for training (i.e., thousands of cells) mean they require thousands of bulk RNA-seq expression datasets be properly trained on existing resources such as TOXCAST21133. Despite this, VAEs show great promise in aiding toxicologists in studying transcriptomic response. With larger single cell initiatives, I envision models like scVIDR can be extended to predict EC50s and adverse outcome pathways. Additionally, VAEs ability to integrate different types of single cell data, such as scATAC-seq and CITE-seq, will allow for better prediction of chemical response. With the proliferation of single cell technologies, an opportunity to understand toxicology at cellular resolution has presented itself. However, high dimensionality makes analysis challenging. I have shown that variational autoencoders can be used to process such data to make insights, as well as generate novel predictions using existing toxicological data. 112 BIBLIOGRAPHY 1. 2. 3. 4. 5. 6. 7. 8. 9. Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure- based drug design: a molecular modeling perspective. Med Res Rev 16, 3–50 (1996). Shuker, S. B., Hajduk, P. J., Meadows, R. P. & Fesik, S. W. Discovering High- Affinity Ligands for Proteins: SAR by NMR. Science (1979) 274, 1531–1534 (1996). Sun, D., Gao, W., Hu, H. & Zhou, S. Why 90% of clinical drug development fails and how to improve it? Acta Pharm Sin B 12, 3049–3062 (2022). Aissa, A. F. et al. Single-cell transcriptional changes associated with drug tolerance and response to combination therapies in cancer. Nat Commun 12, 1628 (2021). null, null et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science (1979) 376, eabl4896 (2023). Halpern, K. B. et al. Single-cell spatial reconstruction reveals global division of labour in the mammalian liver. Nature 2017 542:7641 542, 352–356 (2017). Raj, A. & van Oudenaarden, A. Nature, Nurture, or Chance: Stochastic Gene Expression and Its Consequences. Cell vol. 135 216–226 Preprint at https://doi.org/10.1016/j.cell.2008.09.050 (2008). Yao, J., Pilko, A. & Wollman, R. Distinct cellular states determine calcium signaling response. Mol Syst Biol 12, 894 (2016). Kramer, B. A. & Pelkmans, L. Cellular state determines the multimodal signaling response of single cells. bioRxiv 2019.12.18.880930 (2019). 10. Yang, Y., Filipovic, D. & Bhattacharya, S. A Negative Feedback Loop and Transcription Factor Cooperation Regulate Zonal Gene Induction by 2, 3, 7, 8‐Tetrachlorodibenzo‐p‐Dioxin in the Mouse Liver. Hepatol Commun (2021) doi:10.1002/hep4.1848. 11. Madeira, C. & Costa, P. M. Chapter Two - Proteomics in systems toxicology. in Proteomics and Systems Biology (eds. Donev, R. & Karabencheva-Christova, T. B. T.-A. in P. C. and S. B.) vol. 127 55–91 (Academic Press, 2021). 12. Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–35 (2006). 13. Ma, R., MartΓ­nez-RamΓ­rez, A. S., Borders, T. L., Gao, F. & Sosa-Pineda, B. Metabolic and non-metabolic liver zonation is established non-synchronously and requires sinusoidal Wnts. Elife 9, (2020). 14. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015). 15. Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015). 113 16. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 1–12 (2017). 17. Zhang, Q. et al. Embracing systems toxicology at single-cell resolution. Curr Opin Toxicol 16, 49–57 (2019). 18. Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572 (1901). 19. McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). 20. Ouyang, J. F., Kamaraj, U. S., Cao, E. Y. & Rackham, O. J. L. ShinyCell: simple and sharable visualization of single-cell gene expression data. Bioinformatics 37, 3374–3376 (2021). 21. Van den Berge, K. et al. Trajectory-based differential expression analysis for single- cell sequencing data. Nat Commun 11, 1201 (2020). 22. Linderman, G. C. Dimensionality Reduction of Single-Cell RNA-Seq Data. in RNA Bioinformatics (ed. Picardi, E.) 331–342 (Springer US, New York, NY, 2021). doi:10.1007/978-1-0716-1307-8_18. 23. Fefferman, C., Mitter, S. & Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society 29, 983–1049 (2016). 24. Whitacre, J. M. Degeneracy: a link between evolvability, robustness and complexity in biological systems. Theor Biol Med Model 7, 6 (2010). 25. Nowak, M. A., Boerlijst, M. C., Cooke, J. & Smith, J. M. Evolution of genetic redundancy. Nature 388, 167–171 (1997). 26. Chen, B. et al. Relating Chemical Structure to Cellular Response: An Integrative Analysis of Gene Expression, Bioactivity, and Structural Data Across 11,000 Compounds. CPT Pharmacometrics Syst Pharmacol 4, 576–584 (2015). 27. Davidson, E. H. The β€œRegulatory Genome” for Animal Development. in The Regulatory Genome 1–29 (Elsevier, 2006). doi:10.1016/b978-012088563-3.50019- 5. 28. Kingma, D. P. & Welling, M. Auto-encoding variational bayes. in 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings (International Conference on Learning Representations, ICLR, 2014). 29. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat Methods 15, 1053–1058 (2018). 30. Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat Methods 16, 715–721 (2019). 31. GrΓΈnbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 36, 4415–4422 (2020). 32. Qiu, Y. L., Zheng, H. & Gevaert, O. Genomic data imputation with variational auto- encoders. Gigascience 9, 1–12 (2020). 114 33. Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of- distribution generation for unpaired data using transfer VAE. Bioinformatics 36, i610–i617 (2020). 34. Ashuach, T. et al. MultiVI: deep generative model for the integration of multimodal data. Nat Methods 20, 1222–1231 (2023). 35. Bank, D., Koenigstein, N. & Giryes, R. Autoencoders. in Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook (eds. Rokach, L., Maimon, O. & Shmueli, E.) 353–374 (Springer International Publishing, Cham, 2023). doi:10.1007/978-3-031-24628-9_16. 36. Liu, Z. Visualizing single-cell RNA-seq data with Semisupervised principal component analysis. Int J Mol Sci 21, 5797 (2020). 37. Grabski, I. N., Street, K. & Irizarry, R. A. Significance analysis for clustering with single-cell RNA-sequencing data. Nat Methods 20, 1196–1202 (2023). 38. Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020). 39. Marini, F. & Binder, H. pcaExplorer: an R/Bioconductor package for interacting with RNA-seq principal components. BMC Bioinformatics 20, 331 (2019). 40. Chari, T. & Pachter, L. The specious art of single-cell genomics. PLoS Comput Biol 19, e1011288- (2023). 41. Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high- throughput screens. Mol Syst Biol 19, e11517 (2023). 42. Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet 24, 550–572 (2023). 43. Higgins, I. et al. Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579 (2016). 44. Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun 9, (2018). 45. Wang, D. & Gu, J. VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder. Genomics Proteomics Bioinformatics 16, 320–331 (2018). 46. Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol 17, e9620 (2021). 47. Jiang, J. et al. Dimensionality reduction and visualization of single-cell RNA-seq data with an improved deep variational autoencoder. Brief Bioinform 24, bbad152 (2023). 48. Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 40, 121–130 (2022). 115 49. Liu, S., Zhang, Y., Peng, J. & Shang, X. An improved hierarchical variational autoencoder for cell–cell communication estimation using single-cell RNA-seq data. Brief Funct Genomics elac056 (2023) doi:10.1093/bfgp/elac056. 50. Yao, Z. et al. A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain. Nature 624, 317–332 (2023). 51. Kopf, A., Fortuin, V., Somnath, V. R. & Claassen, M. Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations on single cell data. PLoS Comput Biol 17, e1009086 (2021). 52. Droin, C. et al. Space-time logic of liver gene expression at sub-lobular scale. Nat Metab 3, 43–58 (2021). 53. Cunningham, R. P. & Porat-Shliom, N. Liver Zonation – Revisiting Old Questions With New Technologies. Front Physiol 12, 732929 (2021). 54. Dibner, C., Schibler, U. & Albrecht, U. The Mammalian Circadian Timing System: Organization and Coordination of Central and Peripheral Clocks. Annu Rev Physiol 72, 517–549 (2010). 55. Buijs, R. M. et al. Organization of circadian functions: interaction with the body. in Progress in Brain Research (eds. Kalsbeek, A. et al.) vol. 153 341–360 (Elsevier, 2006). 56. Haghverdi, L., BΓΌttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods 13, 845–848 (2016). 57. Coifman, R. R. & Lafon, S. Diffusion maps. Appl Comput Harmon Anal 21, 5–30 (2006). 58. 59. 60. Plumlee, K. Chapter 11. Hepatobiliary System. in 61–68 (2004). doi:10.1016/B0- 32-301125-X/50014-5. Santostefano, M. J. et al. Dose-dependent localization of TCDD in isolated centrilobular and periportal hepatocytes. Toxicological Sciences 52, 9–19 (1999). Jungermann, K. Zonation of metabolism and gene expression in liver. Histochem Cell Biol 103, 81–91 (1995). 61. Wild, S. L. et al. The canonical Wnt pathway as a key regulator in liver development, differentiation and homeostatic renewal. Genes (Basel) 11, 1163 (2020). 62. Nault, R. et al. Single-cell transcriptomics shows dose-dependent disruption of hepatic zonation by TCDD in mice. Toxicological Sciences 191, 135–148 (2023). 63. Pai, S. G. et al. Wnt/beta-catenin pathway: modulating anticancer immune response. J Hematol Oncol 10, 101 (2017). 64. Green, C. B., Takahashi, J. S. & Bass, J. The meter of metabolism. Cell 134, 728– 742 (2008). 116 65. Yeung, J. & Naef, F. Rhythms of the Genome: Circadian Dynamics from Chromatin Topology, Tissue-Specific Gene Expression, to Behavior. Trends in Genetics 34, 915–926 (2018). 66. Fader, K. A., Nault, R., Doskey, C. M., Fling, R. R. & Zacharewski, T. R. 2,3,7,8- Tetrachlorodibenzo-p-dioxin abolishes circadian regulation of hepatic metabolic activity in mice. Sci Rep 9, 6514 (2019). 67. Shi, D. et al. Circadian clock genes in the metabolism of non-alcoholic fatty liver disease. Front Physiol 10, 423 (2019). 68. Martini, T., Naef, F. & Tchorz, J. S. Spatiotemporal Metabolic Liver Zonation and Consequences on Pathophysiology. Annual Review of Pathology: Mechanisms of Disease 18, 439–466 (2023). 69. Tahara, Y. & Shibata, S. Circadian rhythms of liver physiology and disease: experimental and clinical evidence. Nat Rev Gastroenterol Hepatol 13, 217–226 (2016). 70. 71. ProchΓ‘zkovΓ‘, J. et al. The interplay of the aryl hydrocarbon receptor and Ξ²-catenin alters both AhR-dependent transcription and wnt/Ξ²-catenin signaling in liver progenitors. Toxicological Sciences 122, 349–360 (2011). Schneider, A. J., Branam, A. M. & Peterson, R. E. Intersection of AHR and Wnt signaling in development, health, and disease. International Journal of Molecular Sciences vol. 15 17852–17885 Preprint at https://doi.org/10.3390/ijms151017852 (2014). 72. Time-Course Single-Nuclei RNA-Seq Analysis Identifies NRG/ERBB Signaling Dysregulation in TCDD-Induced Hepatotoxicity. Society of Toxicology Annual Meeting Preprint at (2023). 73. Larigot, L., Juricek, L., Dairou, J. & Coumoul, X. AhR signaling pathways and regulatory functions. Biochim Open 7, 1–9 (2018). 74. Hutin, D. et al. 2,3,7,8-Tetrachlorodibenzo-p-Dioxin (TCDD)-Inducible Poly-ADP- Ribose Polymerase (TIPARP/PARP7) Catalytic Mutant Mice (TiparpH532A) Exhibit Increased Sensitivity to TCDD-Induced Hepatotoxicity and Lethality. Toxicological Sciences 183, 154–169 (2021). 75. Davidian, M. & Giltinan, D. M. Nonlinear models for repeated measurement data: An overview and update. J Agric Biol Environ Stat 8, 387 (2003). 76. Aizarani, N. et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature 572, 199–204 (2019). 77. Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods 19, 41–50 (2022). 78. L. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single- cell RNA sequencing data with many zero counts. Genome Biol 17, 75 (2016). 117 79. Nault, R., Fader, K. A., Bhattacharya, S. & Zacharewski, T. R. Single-Nuclei RNA Sequencing Assessment of the Hepatic Effects of 2,3,7,8-Tetrachlorodibenzo-p- dioxin. CMGH 11, 147–159 (2021). 80. Schmidt, C. K. et al. 2,3,7,8-Tetrachlorodibenzo-p-dioxin (TCDD) alters the endogenous metabolism of all-trans-retinoic acid in the rat. Arch Toxicol 77, 371– 383 (2003). 81. Burke, Z. D. & Tosh, D. The Wnt/Ξ²-catenin pathway: master regulator of liver zonation? BioEssays 28, 1072–1077 (2006). 82. Gougelet, A. et al. T-cell factor 4 and Ξ²-catenin chromatin occupancies pattern zonal liver metabolism in mice. Hepatology 59, 2344–2357 (2014). 83. Chen, E. Y. et al. Enrichr: Interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013). 84. Xu, C.-X., Krager, S. L., Liao, D.-F. & Tischkau, S. A. Disruption of CLOCK- BMAL1 Transcriptional Activity Is Responsible for Aryl Hydrocarbon Receptor– Mediated Regulation of Period1 Gene. Toxicological Sciences 115, 98–108 (2010). 85. Nault, R. et al. Dose-Dependent Metabolic Reprogramming and Differential Gene Expression in TCDD-Elicited Hepatic Fibrosis. Toxicological Sciences 154, 253– 266 (2016). 86. Nault, R. et al. Benchmarking of a Bayesian single cell RNAseq differential gene expression test for dose–response study designs. Nucleic Acids Res (2022) doi:10.1093/nar/gkac019. 87. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018). 88. Seabold, S. & Perktold, J. Statsmodels: Econometric and statistical modeling with python. in Proceedings of the 9th Python in Science Conference vol. 57 10–25080 (Austin, TX, 2010). 89. Nelder, J. A. & Mead, R. A Simplex Method for Function Minimization. Comput J 7, 308–313 (1965). 90. Schwarz, G. Estimating the dimension of a model. The annals of statistics 461–464 (1978). 91. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 2020 17:3 17, 261–272 (2020). 92. 93. Fang, Z., Liu, X. & Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 39, btac757 (2023). Peidli, S. et al. scPerturb: Harmonized Single-Cell Perturbation Data. bioRxiv 2022.08.20.504663 (2023) doi:10.1101/2022.08.20.504663. 94. McFarland, J. M. et al. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action. Nat Commun 11, (2020). 118 95. 96. Srivatsan, S. R. et al. Massively multiplex chemical transcriptomics at single-cell resolution. Science (1979) 367, 45–51 (2020). Ji, Y., Lotfollahi, M., Wolf, F. A. & Theis, F. J. Machine learning for perturbational single-cell omics. Cell Syst 12, 522–537 (2021). 97. Bunne, C. et al. Learning Single-Cell Perturbation Responses using Neural Optimal Transport. bioRxiv 2021.12.15.472775 (2021) doi:10.1101/2021.12.15.472775. 98. Wei, X., Dong, J. & Wang, F. scPreGAN, a deep generative model for predicting the response of single-cell expression to perturbation. Bioinformatics 38, 3377– 3384 (2022). 99. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat Biotechnol 36, 89–94 (2018). 100. Hagai, T. et al. Gene expression variability across cells and species shapes innate immunity. Nature 563, 197–202 (2018). 101. Pedregosa, F. et al. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, 2825–2830 (2011). 102. Arpit, D., Zhou, Y., Ngo, H. & Govindaraju, V. Why regularized auto-encoders learn sparse representation? in International Conference on Machine Learning 136– 144 (PMLR, 2016). 103. Elowitz, M. B., Levine, A. J., Siggia, E. D. & Swain, P. S. Stochastic gene expression in a single cell. Science (1979) 297, 1183–1186 (2002). 104. Rostom, R., Svensson, V., Teichmann, S. A. & Kar, G. Computational approaches for interpreting scRNA-seq data. FEBS Lett 591, 2213–2225 (2017). 105. Ribeiro, M. T., Singh, S. & Guestrin, C. β€˜ Why should i trust you?’ Explaining the predictions of any classifier. in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 1135–1144 (2016). 106. Uno, S. et al. Cyp1a1(βˆ’/βˆ’) male mice: protection against high-dose TCDD-induced lethality and wasting syndrome, and resistance to intrahepatocyte lipid accumulation and uroporphyria. Toxicol Appl Pharmacol 196, 410–421 (2004). 107. Martens, M. et al. WikiPathways: Connecting communities. Nucleic Acids Res 49, D613–D621 (2021). 108. Henry, E. C., Welle, S. L. & Gasiewicz, T. A. TCDD and a Putative Endogenous AhR Ligand, ITE, Elicit the Same Immediate Changes in Gene Expression in Mouse Lung Fibroblasts. Toxicological Sciences 114, 90–100 (2010). 109. Cholico, G. N. et al. Thioesterase induction by 2,3,7,8-tetrachlorodibenzo-p-dioxin results in a futile cycle that inhibits hepatic Ξ²-oxidation. Sci Rep 11, 15689 (2021). 110. Friedrich, M. et al. Tryptophan metabolism drives dynamic immunosuppressive myeloid states in IDH-mutant gliomas. Nat Cancer 2, 723–740 (2021). 111. Street, K. et al. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018). 119 112. Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat Methods 14, 979–982 (2017). 113. Lindros, K. O., Oinonen, T., Johansson, I. & Ingelman-Sundberg, M. Selective Centrilobular Expression of the Aryl Hydrocarbon Receptor in Rat Liver. Journal of Pharmacology and Experimental Therapeutics 280, 506 LP – 511 (1997). 114. Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat Biotechnol 37, 547–554 (2019). 115. Plaut, E. From principal subspaces to principal components with linear autoencoders. arXiv preprint arXiv:1804.10253 (2018). 116. Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019). 117. Byrnes, L. E. et al. Lineage dynamics of murine pancreatic development at single- cell resolution. Nat Commun 9, 3922 (2018). 118. Pellin, D. et al. A comprehensive single cell transcriptional landscape of human hematopoietic progenitors. Nat Commun 10, (2019). 119. Diliberto, J. J., Burgin, D. & Birnbaum, L. S. Role of CYP1A2 in Hepatic Sequestration of Dioxin: Studies Using CYP1A2 Knock-Out Mice. Biochem Biophys Res Commun 236, 431–433 (1997). 120. Janesick, A. et al. High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis. Nat Commun 14, 8353 (2023). 121. Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science (1979) 348, aaa6090 (2015). 122. Halpern, K. B. et al. Paired-cell sequencing enables spatial gene expression mapping of liver endothelial cells. Nat Biotechnol 36, 962–970 (2018). 123. Gola, A. et al. Commensal-driven immune zonation of the liver promotes host defence. Nature 589, 131–136 (2021). 124. Karri, K. & Waxman, D. J. TCDD dysregulation of lncRNA expression, liver zonation and intercellular communication across the liver lobule. Toxicol Appl Pharmacol 471, 116550 (2023). 125. Walesky, C. M. et al. Functional compensation precedes recovery of tissue mass following acute liver injury. Nat Commun 11, 5785 (2020). 126. Brunt, E. M. Pathology of nonalcoholic fatty liver disease. Nat Rev Gastroenterol Hepatol 7, 195–203 (2010). 127. Kettner, N. M. et al. Circadian homeostasis of liver metabolism suppresses hepatocarcinogenesis. Cancer Cell 30, 909–924 (2016). 128. Panday Chase P.; Khetani Salman R., R. M. The Role of Liver Zonation in Physiology, Regeneration, and Disease. Semin Liver Dis 42, 1–16 (2022). 120 129. Wei, Y. et al. Liver homeostasis is maintained by midlobular zone 2 hepatocytes. Science (1979) 371, eabb1625 (2021). 130. DeBruyne, J. P., Weaver, D. R. & Dallmann, R. The Hepatic Circadian Clock Modulates Xenobiotic Metabolism in Mice. J Biol Rhythms 29, 277–287 (2014). 131. Jiang, X. & Kopp-Schneider, A. Summarizing EC50 estimates from multiple dose- response experiments: A comparison of a meta-analysis strategy to a mixed-effects model approach. Biometrical Journal 56, 493–512 (2014). 132. Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high‐throughput screens. Mol Syst Biol 19, e11517 (2023). 133. R, T. R., P, A. C., J, K. R. & R, B. J. Improving the Human Hazard Characterization of Chemicals: A Tox21 Update. Environ Health Perspect 121, 756–765 (2013). 134. Sakuratani, Y., Horie, M. & Leinala, E. Integrated Approaches to Testing and Assessment: OECD Activities on the Development and Use of Adverse Outcome Pathways and Case Studies. Basic Clin Pharmacol Toxicol 123, 20–28 (2018). 135. Koussounadis, A., Langdon, S. P., Um, I. H., Harrison, D. J. & Smith, V. A. Relationship between differentially expressed mRNA and mRNA-protein correlations in a xenograft model system. Sci Rep 5, 10775 (2015). 136. Buccitelli, C. & Selbach, M. mRNAs, proteins and the emerging principles of gene expression control. Nat Rev Genet 21, 630–644 (2020). 137. Liu, A., Han, N., Munoz-Muriedas, J. & Bender, A. Deriving time-concordant event cascades from gene expression data: A case study for Drug-Induced Liver Injury (DILI). PLoS Comput Biol 18, e1010148- (2022). 138. Chen, W. et al. Live-seq enables temporal transcriptomic recording of single cells. Nature 608, 733–740 (2022). 139. Tong, A., Huang, J., Wolf, G., Van Dijk, D. & Krishnaswamy, S. Trajectorynet: A dynamic optimal transport network for modeling cellular dynamics. in International conference on machine learning 9526–9536 (PMLR, 2020). 140. Mathieu, E. & Nickel, M. Riemannian continuous normalizing flows. Adv Neural Inf Process Syst 33, 2503–2515 (2020). 141. Khan, D. M. I. O., Karmaus, P. W. F., Bach, A., Crawford, R. B. & Kaminski, N. E. An in vitro model of human hematopoiesis identifies a regulatory role for the aryl hydrocarbon receptor. Blood Adv 7, 6253–6265 (2023). 142. Ji, Z. & Ji, H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res 44, e117–e117 (2016). 143. BΓ©zieux, H. R. de, Berge, K. Van den, Street, K. & Dudoit, S. Trajectory inference across multiple conditions with condiments: differential topology, progression, differentiation, and expression. bioRxiv 2021.03.09.433671 (2021) doi:10.1101/2021.03.09.433671. 144. Davis, I. J. & Pattenden, S. G. Chapter 1-3 - Chromatin Accessibility as a Strategy to Detect Changes Associated With Development, Disease, and Exposure and 121 Susceptibility to Chemical Toxins. in Toxicoepigenetics (eds. McCullough, S. D. & Dolinoy, D. C.) 85–103 (Academic Press, 2019). doi:https://doi.org/10.1016/B978- 0-12-812433-8.00003-4. 145. Kennedy, S. The role of proteomics in toxicology: identification of biomarkers of toxicity by protein expression analysis. Biomarkers 7, 269–290 (2002). 122