STATISTICAL METHODS FOR SINGLE CELL GENE EXPRESSION: DIFFERENTIAL EXPRESSION, CURVE ESTIMATION AND GRAPHICAL MODELLING By Satabdi Saha A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2022 ABSTRACT STATISTICAL METHODS FOR SINGLE CELL GENE EXPRESSION: DIFFERENTIAL EXPRESSION, CURVE ESTIMATION AND GRAPHICAL MODELLING By Satabdi Saha This dissertation elucidates a set of statistical methods developed for analysis of single cell gene expression datasets. Gene expression profiling of single cells has led to unprecedented progress in understanding normal physiology, disease progression and developmental pro- cesses. However, despite many improvements in high throughput sequencing, various tech- nical factors including cell-cycle heterogeneity, library size differences, amplification bias, and low RNA capture per cell lead to high noise in scRNA-seq experiments. A primary characteristic of these datasets is the presence of high number of zeroes which represents the undetectable level of expression for a transcript. Statistical methods capable of modelling novel single cell experiments are developed and new estimation strategies are proposed and validated using simulated and real data experiments. • In Chapter 1, the motivation and underlying philosophies of single cell gene expression is reviewed. Methods for analysis of dose response experiments and gene co-expression networks are reviewed and novel statistical hypothesis to be investigated using single cell experiments are discussed. • In Chapter 2, I analyze a unique in vivo dose response hepatic scRNAseq dataset con- sisting of 9 dose groups with 3 biological replicates for 11 distinct liver cell types for greater than 100K cells. A Hurdle model for multiple group data is proposed, which models the bimodality of single cell gene expression within multiple groups. Based on the model assumptions, I derive a fit for purpose Bayesian test for simultaneously testing the differences in mean gene expression and zero proportions for multiple dose groups. For comparison the counterpart likelihood-ratio test for differential expres- sion that incorporates testing for both components is also derived. This chapter was originally published in [1]. • In Chapter 3, dose response curve estimation for single cell experiments is studied. Cur- rent protocols for genomic dose response modelling are only capable of modelling bulk and microarray datasets. A semiparametric regression model for joint dose response curve estimation for multiple cell-types while accounting for confounding covariates is proposed. A novel, scalable and efficient optimization algorithm using the MM phi- losophy is proposed for the estimation of both monotone and non-monotone curves. Two relevant tests of hypothesis are discussed and the proposed methods are validated using several simulated datasets. • In Chapter 4, co-expression network estimation is studied using graph signal processing. A kernelized signed graph learning approach is developed for learning single cell gene co-expression networks, based on the assumption of smoothness of gene expressions over activating edges. Performance is assessed using real human and mouse embryonic datasets. This chapter was originally published in [2]. This work is dedicated to my husband and son. You have made me better, stronger, and more fulfilled than I could have ever imagined. iv ACKNOWLEDGEMENTS I would like to express my deepest gratitude to my advisors, Dr Tapabrata Maiti and Dr Samiran Sinha, both of whom have been extremely supportive of my study, research, and professional development. It has been inspiring to observe their scientific rigor and enthusi- asm for interdisciplinary research. I am grateful for their constant encouragement, patience, guidance, and the tremendous support throughout my doctoral studies at MSU. I would like to thank Dr Sudin Bhattacharya, Dr Rance Nault and Dr Tim Zacharewski for introducing me to the field of toxicogenomics; it has always been a wonderful experience collaborating with them on highly interdisciplinary problems. I would like to thank my committee members, Dr Shlomo Levental and Dr Lyudamila Sakhanenko for serving in my committee and being patient readers of my work. I am highly indebted to Dr Sakhanenko and Dr Levental for being wonderful teachers; their contribution towards my PhD training is immense. In addition, I am very grateful to Dr Selin Aviyente and her doctoral advisee Abdullah Karaslaanli for collaborating with me on developing signal processing based graph learning ideas, which contribute to the fourth chapter of this dissertation. I thank Prof. Elijah Dikong and Prof. Camille Fairbourne for their advice on my teaching assistantships and for being wonderful mentors. Special thanks to the wonderful staff of our department, Sue, Teresa, Megan, Andy and Tami for their extensive support throughout the entire duration of my doctoral studies. I sincerely thank my friends at MSU, Nilanjan, Sanket, Alex, Abhijnan, Sumegha and Phuong for their generous help and friendship. I also thank my brother and friend Subhrajit for being a constant source of positivity and encouragement. Finally and most importantly I wish to thank my mom and my husband. This dissertation would not have been possible without their endless support and encouragement. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation for exploring single cell gene expression . . . . . . . . . . . . . . 1 1.2 Structure of single cell gene expression datasets . . . . . . . . . . . . . . . . 2 1.3 Dose Response Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 National Toxicology Program’s approach to genomic dose response modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1.1 Determining Adequate Signal in the Data . . . . . . . . . . 5 1.3.1.2 Filtering of Measured Features . . . . . . . . . . . . . . . . 6 1.3.1.3 Dose Response Curve Estimation . . . . . . . . . . . . . . . 7 1.4 Covariation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 CHAPTER 2 BAYESIAN SINGLE CELL RNASEQ DIFFERENTIAL GENE EXPRESSION TEST FOR DOSE RESPONSE STUDY DESIGNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Single Cell Dose Response Experiments . . . . . . . . . . . . . . . . . . . . . 12 2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Animal handling and treatment . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Real scRNAseq and snRNAseq datasets . . . . . . . . . . . . . . . . 14 2.2.3 Dose-response data simulation . . . . . . . . . . . . . . . . . . . . . . 15 2.2.4 Single cell RNA-seq Hurdle Model . . . . . . . . . . . . . . . . . . . . 16 2.2.4.1 Hypothesis Formulation . . . . . . . . . . . . . . . . . . . . 17 2.2.4.2 Single cell Bayesian Hurdle model Analysis (scBT) . . . . . 17 2.2.4.3 Multiple group Likelihood Ratio Test (LRT) . . . . . . . . . 21 2.2.4.4 Linear model-based Likelihood Ratio Test (LRT linear) . . . 22 2.2.5 Benchmarking method selection . . . . . . . . . . . . . . . . . . . . . 22 2.2.5.1 Seurat Bimod . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.5.2 MAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.5.3 Limma-trend . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.5.4 Wilcoxon Rank Sum (WRS) Test . . . . . . . . . . . . . . . 25 2.2.5.5 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.5.6 Kruskal-Wallis (KW) Test . . . . . . . . . . . . . . . . . . . 26 2.2.6 Benchmarking and sensitivity analyses . . . . . . . . . . . . . . . . . 26 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Performance accuracy of DE test methods . . . . . . . . . . . . . . . 29 2.3.2 Type I error control and power . . . . . . . . . . . . . . . . . . . . . 32 2.3.3 Parameter Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . 33 2.3.4 Test method agreement . . . . . . . . . . . . . . . . . . . . . . . . . . 35 vi 2.4 Real dose–response dataset DE analysis . . . . . . . . . . . . . . . . . . . . . 36 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 CHAPTER 3 SEMIPARAMETRIC DOSE RESPONSE CURVE ESTIMATION FOR SINGLE CELL DOSE RESPONSE EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1 Single Cell Dose Response Experiments . . . . . . . . . . . . . . . . . . . . . 42 3.1.1 Motivating experimental study and hypothesis of interest . . . . . . 44 3.1.2 Literature on dose response curve estimation . . . . . . . . . . . . . . 45 3.1.3 Literature on MM algorithms . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Model, notations and assumptions . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Penalized Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.3 Monotonicity constraints . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.4 Confidence interval estimation . . . . . . . . . . . . . . . . . . . . . . 57 3.2.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 CHAPTER 4 KERNELIZED SIGNED GRAPH LEARNING FOR SINGLE CELL GENE REGULATORY NETWORK INFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1 Single Cell Gene Regulatory Networks . . . . . . . . . . . . . . . . . . . . . 74 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.2 Low- and High-frequency Signals on Unsigned Graphs . . . . . . . . . 79 4.2.3 Unsigned Graph Learning . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.4 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.1 Signed Graph Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.2 Kernelized Signed Graph Learning . . . . . . . . . . . . . . . . . . . 82 4.3.3 Hyperparameter Selection . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.4 Generation of simulated datasets from zero-inflated negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.5.1 AUPRC and AUROC: . . . . . . . . . . . . . . . . . . . . . 86 4.3.5.2 AUPRC Activating/Inhibitory: . . . . . . . . . . . . . . . . 86 4.3.5.3 EPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 vii 4.4.2 Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 APPENDIX A BAYESIAN SINGLE CELL RNASEQ DIFFERENTIAL GENE EXPRESSION TEST FOR DOSE RESPONSE STUDY DESIGNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 APPENDIX B SEMIPARAMETRIC DOSE RESPONSE CURVE ESTIMATION FOR SINGLE CELL DOSE RESPONSE EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . 104 APPENDIX C KERNELIZED SIGNED GRAPH LEARNING FOR SINGLE CELL GENE REGULATORY NETWORK INFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 viii LIST OF TABLES Table 1.1 Parametric models for genomic dose response analysis . . . . . . . . . . . 8 Table 2.1 Dose-response models for simulation of scRNAseq data . . . . . . . . . . . 16 Table 3.1 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter β under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 63 Table 3.2 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter ϕ under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 63 Table 3.3 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter ψ0 under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 63 Table 3.4 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter ψ1 under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 72 ix LIST OF FIGURES Figure 2.1 Flow diagram of the simulation, benchmarking, and experimental data evaluation strategy presented in the manuscript. Briefly, SplattDR was developed to simulate dose-response scRNAseq data and validated based on experimental dose-response data. Simulated datasets were generated varying diverse parameters 10 times and then used to assess the perfor- mance of each test method. Each test method was also assessed using experimental data from the hepatic snRNAseq dose response dataset obtained from male mice gavage every 4 days for 28 days with 0.01, 0.03, 0.1, 0.3, 1, 3, 10, or 30 µg/kg TCDD. Related figures for each analysis from the main body are noted. . . . . . . . . . . . . . . . . . . 24 Figure 2.2 Comparison of simulated and real dose-response data. (A) Relationship between gene-wise mean expression and percent zeroes for simulated and real dose-response data. Simulation data consisted of 10,000 genes and 9 dose groups based on parameters derived from experimental dose- response snRNAseq data. Black line represents a fitted model to the ex- perimental data from which the normalized root mean square deviation (NRMSD) of simulated data was determined. (B) Relationship between gene-wise mean expression and variance for simulated and experimen- tal data. NMRSD was calculated for simulated data from the fitted model represented as a black line. (C) Distribution of log(fold-changes) in experimental and simulated data showing the median and minimum and maximum values. (D) Principal components analysis of simulated data colored according to simulated dose groups. (E) NMRSD esti- mated relative to fitted model in A,B for simulated data generated from initial parameters derived from published hepatic scRNAseq (two dose; GSE148339), hepatic whole cell (whole cell; GSE129516), and peripheral blood mononuclear cell (PBMC; GSE108313) datasets. (F) NMRSD es- timated relative to model fitted to cell-type specific experimental dose- response data when simulated from initial parameters estimated from that same cell type. Box and whisker plots show median NMRSD, 25th and 75th percentiles, and minimum and maximum values. . . . . . . . . 29 x Figure 2.3 Classification performed of DE analysis tests. (A) ROCs estimated from simulated dose-response scRNAseq data for 9 DE test methods including all genes expressed in at least 1 cell (unfiltered). (B) ROCs for 9 DE test methods after filtering simulated dose-response scRNAseq data for genes expressed in only ≥ 5% of cells (low levels) in at least one dose group. (C) Precision-recall curves (PRCs) for 9 DE test methods on unfiltered simulated dose-response scRNAseq data. (D) PRCs for 9 DE test methods on filtered simulated dose-response scRNAseq data. Lines represent the mean values and shaded region reflects the standard deviation for 10 independent simulations. (E) Precision of DE test methods. (F) FPR of DE test methods. (G) MCC for test methods. E,F,G Box and whisker plots median values, 25th and 75th percentiles, and minimum and maximum values for 10 independent simulations. Points reflects values for each independent simulation. Panels display comparisons of unfiltered and filtered datasets. . . . . . . . . . . . . . . . 30 Figure 2.4 Evaluation of Type I and II error control. (A) False positive rate (FPR) of 9 differential expression test methods estimated from negative con- trol (0% DE genes) simulated dose-response scRNAseq data including all genes expressed in at least 1 cell (unfiltered) and genes expressed in only ≥ 5% of cells in at least one dose group (filtered). (B,C) Lo- gistic regression models were fitted to negative control data to predict the probability of false positive identification using percent zeroes and mean expression as covariates. Lines represent the predicted probabil- ity of false positive classification with the shaded region representing the 95% confidence interval. (D) False negative rate (FNR) of 9 dif- ferential expression test methods estimated from positive control (100% DE genes) simulated dose-response scRNAseq data including unfiltered and filtered datasets. (E,F) Logistic regression models were fit to posi- tive control data. Lines represent predicted probability of false negative classification with shaded region representing the 95% confidence interval. 32 xi Figure 2.5 Matthews correlation coefficient (MCC) from sensitivity analyses of dif- ferential expression test methods. (A) MCC for 9 DGEA test methods determined from simulated dose response data with varying number of cells per dose group. Simulations consisted of 5,000 genes with a proba- bility of differential expression of 10% and 9 dose groups. (B) MCC for simulated data varying the cells numbers by dose group. The number of cells in each of the 9 doses groups is shown on the right. (C) MCC for varying proportion of differentially expressed genes. (D) MCC when varying the mean fold-change (location) of repressed differentially ex- pressed genes. (E) MCC for varying distribution of fold-change (scale) of differentially expressed genes. (F) MCC for varying dropout rates calculated as in Table S3. Points represent median and error bars repre- sent minimum to maximum values. Boxplots represent median, 25th to 75th percentile, and minimum to maximum values. Each analysis con- sisted of 10 replicate datasets including all genes expressed in at least 1 cell (unfiltered) and genes expressed in ≥ 5% of cells in at least one dose group (filtered). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 2.6 Agreement of differential expression test methods on experimental dose- response data. (A) Upset plot showing the intersection size of genes identified as differentially expressed by 9 different test methods in hep- atocytes from the portal region of the liver lobule. (B) Intersect of differentially expressed genes in portal fibroblasts. (C) Intersect size in hepatic stellate cells. Vertical bars represent the intersect size for test methods denoted by a black dot. Horizontal bars show the total number of differentially expressed genes identified within each test (set sizes). Only intersects for which genes were identified are shown. Genes were considered differentially expressed when (i) expressed in > 5% of cells within any given dose group and (ii) exhibit a |fold-change| ≥ 1.5. A heatmap in the upper left corner of each panel shows the pairwise AUCC comparisons for the 500 lowest p-values. (D) Relative proportion of cell types identified in each dose group of the real dataset for the cell types in A,B,C. Experimental snRNAseq data was obtained from male mice gav- aged with sesame oil vehicle (vehicle control) or 0.01 – 30 µg/kg TCDD every 4 days for 28 days. (E) Graph metrics for gene set enrichment analysis of portal fibroblasts grouped by similarity in gene membership. Violin plots show distribution of node-wise values for each test method. (F) Network visualization of significantly enriched (adjusted p-value ≤ 0.05) gene sets using the Bayes factor ranked genes of portal fibroblasts. Groups of ≥ 2 nodes were manually annotated following commonality in the gene set names. Each node represents a gene set with the size of the node representing the number of genes in a gene set, and edges connect nodes with ≥ 50% overlap. . . . . . . . . . . . . . . . . . . . . 37 xii Figure 3.1 Results of the simulation study, illustrating the performance of Model 1 (see Simulation Design) in 500 replicates with 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 64 Figure 3.2 Results of the simulation study, illustrating the performance of Model 1 (see Simulation Design) in 500 replicates with 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 64 Figure 3.3 Results of the simulation study, illustrating the performance of Model 1 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true curve and the estimated log-mean curves averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM-Int and NB- GAM models respectively . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 3.4 Results of the simulation study, illustrating the estimated RMSE of Model 1 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB- GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. . . . . . . . . . . . . 65 Figure 3.5 Results of the simulation study, illustrating the performance of Model 2 (see Simulation Design) in 500 replicates for 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 66 Figure 3.6 Results of the simulation study, illustrating the performance of Model 2 (see Simulation Design) in 500 replicates for 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 66 xiii Figure 3.7 Results of the simulation study, illustrating the performance of Model 2 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true simulated curves and the estimated log-mean curve averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM- Int and NB-GAM models respectively . . . . . . . . . . . . . . . . . . . . 67 Figure 3.8 Results of the simulation study, illustrating the estimated RMSE of Model 2 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB- GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. . . . . . . . . . . . . 67 Figure 3.9 Results of the simulation study, illustrating the performance of Model 3 (see Simulation Design) in 500 replicates for 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 68 Figure 3.10 Results of the simulation study, illustrating the performance of Model 3 (see Simulation Design) in 500 replicates for 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 68 Figure 3.11 Results of the simulation study, illustrating the performance of Model 3(see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true simulated curves and the estimated log-mean curve averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM- Int and NB-GAM models respectively . . . . . . . . . . . . . . . . . . . . 69 Figure 3.12 Results of the simulation study, illustrating the estimated RMSE of Model 3 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB- GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. . . . . . . . . . . . . 69 xiv Figure 3.13 Results of the simulation study, illustrating the performance of Model 4 (see Simulation Design) in 500 replicates for 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 70 Figure 3.14 Results of the simulation study, illustrating the performance of Model 4 (see Simulation Design) in 500 replicates for 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. . . . . . . . . . . . . 70 Figure 3.15 Results of the simulation study, illustrating the performance of Model 4 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true simulated curves and the estimated log-mean curve averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM- Int and NB-GAM models respectively . . . . . . . . . . . . . . . . . . . . 71 Figure 3.16 Results of the simulation study, illustrating the estimated RMSE of Model 4 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB- GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. . . . . . . . . . . . . 71 Figure 4.1 Euclidean distances (left, normalized to [0, 1]) and correlations (right) between expressions of gene pairs in curated datasets studied in Section 4.4. Values are calculated only for gene pairs that are connected in the ground truth GRNs and they are reported separately for activating and inhibitory edges. Only inhibitory edges are reported for VSC, since its GRN includes only inhibitory edges. . . . . . . . . . . . . . . . . . . . . 77 Figure 4.2 Performance of scSGL and state-of-the-art methods on curated datasets as measured by AUPRC for activating and inhibitory edges. x-axis indicates dropout ratio in the dataset. . . . . . . . . . . . . . . . . . . . 88 Figure 4.3 Performance of various methods for synthetic datasets with varying number of genes (top row), dropout ratio (middle row) and number of cells (bottom row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 xv Figure 4.4 Scalability analysis of different methods. Run time of benchmarking methods are calculated using BEELINE pipeline [3]. Run time of scSGL includes kernel construction and optimization procedure. All methods are run on the same computer. Results of GENIE3 for 2000 genes are not reported due to its high run time. . . . . . . . . . . . . . . . . . . . . 92 Figure 4.5 Performance of methods for two real-world scRNAseq datasets. Inferred graphs are compared to three different gene regulatory databases. . . . . 94 Figure 4.6 The subnetworks of 24 lineage specific genes in hESC (A) and 19 well known marker genes in mESC (B). We report results of scSGL-r as it has the highest AUPRC ratio in Figure 4.5. For clarity, only those edges whose absolute edge weight fall into the top 1 percentile are shown. Node sizes are proportional to their degrees. . . . . . . . . . . . . . . . . 95 Figure 4.7 UpSet plot that shows intersection between the top 1000 edges by scSGL with 3 kernels and benchmarking methods in hESC and mESC datasets. 97 Figure A.1 Principal components analysis of cell types identified in a real hepatic dose-response snRNAseq dataset. Points represent a distinct cell and colors reflect the dose group. . . . . . . . . . . . . . . . . . . . . . . . . . 101 Figure A.2 Comparison of fold-change distribution in simulated and real dose-response snRNAseq data where the log-normal mean (facLoc) and standard de- viation (facScale) were varied as well as the percentage of differentially expressed genes and proportion of downregulated DE genes. A total of 5000 genes were simulated or sampled from real data and the fold- change for the highest dose group was calculated. The Kullback-Leibler Divergence (KLD) intrinsic discrepancy (ID) was used to evaluate the similarity in distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure A.3 Benchmarking scores for data simulated using the Splatter [4] wrapper Splattdr with default initial parameters. A total of 4500 cells and 5000 genes were simulated across 9 dose groups with a probability of being differentially expressed of 10%, 50% of which were downregulated. (A) Ground truth was used to estimated the False Positive Rate (FPR), True Positive Rate (TPR), False Negative Rate (FNR), True Negative Rate (TNR), precision, balanced accuracy, and F1 score. Boxplots and whiskers represent values for 10 replicate simulations. (B) The area- under the concordance curve (AUCC) was calculated as previously de- scribed (ref) for the 100 most significant genes (K = 100). Heatmap represents the pairwise AUCC for each DE analysis grouped by similarity. 103 xvi Figure C.1 AUROC values of various methods for synthetic datasets with three different topologies (random, modular and hub) and varying number of genes (top row), dropout ratio (middle row) and number of cells (bottom row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Figure C.2 EPR values of various methods for synthetic datasets with three differ- ent topologies (random, modular and hub) and varying number of genes (top row), dropout ratio (middle row) and number of cells (bottom row). 142 Figure C.3 AUROC and EPR ratios of methods for two real-world scRNAseq datasets. Inferred graphs are compared to three different gene regulatory databases. 143 Figure C.4 Performance of scSGL and state-of-the-art methods on curated datasets as measured by AUPRC ratios for activating and inhibitory edges. Each column corresponds to a synthetic network. Abbreviations: LI, linear; CY, cycle; LL, linear long; BF, bifurcating; BFC, bifurcating converging and TF, trifurcating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Figure C.5 Edges detected using scSGL-r between 24 Lineage marker genes of hESC at different time points of the differentiation process. Only edges whose absolute edge weights fall into top 10 percent are shown. Edge thick- nesses are proportional to their weights, and node sizes are proportional to their degrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 xvii CHAPTER 1 INTRODUCTION 1.1 Motivation for exploring single cell gene expression Cells are the fundamental unit of life: Millions of cells in our body coordinate to perform basic physiological functions essential for maintaining life. Yet, several critical biological processes such as the genesis, development and fate commitment of early-stage embryos, are determined by the biology of a single cell. The ability to investigate single-cell states have allowed exploration of critical biological systems with unprecedented resolution. It is now possible to easily profile individual cells instead of cell populations, which advances our fundamental understanding of the intrinsic cellular heterogeneity and dynamics [5, 6, 7]. However, despite many improvements in high throughput sequencing, various technical factors including cell-cycle heterogenity, library size differences, amplification bias, and low RNA capture per cell lead to high noise in single-cell RNAseq (scRNAseq) experiments. Recent technologies are capable of sequencing millions of cells, but often generate highly sparse expression datasets due to shallow sequencing. The presence of these issues result in substantial noise that often obscures the true biological signal and renders the application of traditional statistical models for analysis of single cell datasets unsuitable [8, 9]. The availability of gene expression data on a single cell level has led to a more complete picture of the average state of an organism across time and space. Denoting Y as the vector of gene expression measurement and Z as the set of covariates, significant efforts have been made in modelling E(Y |Z) to study differential expression of genes, ie. to determine if the observed difference or change in expression levels of genes between experimental conditions is statistically significant. Compared to the bulk genomic protocols, single cell gene expression allows inference on cellular heterogeneity. Bulk experiments dealt with data averaged over cells, hence measuring 1 cell-cell variation is unfamiliar ground in gene expression methods. Grouping cells into predefined cell-types (assume that cells within a cell-type group are independent replicates), estimating cell-type specific models and quantifying the statistical differences therewith can lead to to important insights into cell-type specific behaviour. One might be interested in modelling the functional form of E(Y |Z) for different cell-types and quantify the variability between the estimated functions. Finally one might desire inference on the co-expression of Y , which explains how the expression of one gene covaries with the others, an example studied in coexpression network estimation. This might help us understand how genes can regulate each other, and render deeper biological insights. 1.2 Structure of single cell gene expression datasets Unlike microarray and bulk RNAseq datasets, scRNAseq data exhibits excess zero values due to the low per cell RNA input, biases in capture and amplification, transcriptional bursts, and other technical factors [10, 11, 12]. This no expression (zero values), due to a conflation of both biological and technical factors, results in an excessive number of zeroes leading to bimodal expression. Consequently, single cell models consider the gene expression distribution as a mixture of an unexpressed (zero) and a positively (non-zero) expressed population [13, 11, 14, 1]. The statistical model for Y is formulated as Yi,j |Ri,j ∼ f (θ) (1.1) Yi,j = 0 with probability 1 when Ri,j = 0 where f (θ) denotes the distribution of gene j in cell i, given that it is detectable. The indicator variable Ri,j indicates the presence or absence of expression for gene j in cell i. Most existing methods use the Poisson or the negative binomial distribution to model the read counts of individual genes directly [10, 13, 11, 15]. Another approach that is frequently used is to assume that log(Yi,j ) follows a normal distribution, which leads to a more tractable mathematical theory than count distributions have [14, 16, 17, 1]. Additionally bulk RNA-seq 2 data have been historically modelled using both count and continuous distributions and the logarithmic transformation used to make the data continuous is well accepted in the field of bulk genomics [18, 19, 20]. There is also an ongoing debate in the field of single cell genomics as to whether the excess zeros present in the data need to modelled differently assuming a sampling process different from the process used to model the true biological expression. Recent research has argued that count distributions like the negative binomial or a Poisson distribution with a Gamma prior on the mean, that eventually leads to a negative binomial posterior is enough for modelling the excess zeros in the expression data. What we observed in our experimental datasets is a very high percentage of zeroes, often close to 60 − 70% and the experimental scientists were more convinced that such severe percentage of zeroes were more a result of the technical errors than the actual biological process. We conducted several experiments to test this hypothesis [1] and concluded that a zero induced distribution would best model our experimental data [21, 22]. Therefore throughout the entire dissertation we stick to zero-induced distributions. However all our proposed methods could be well adapted to work without the zero-inflation part. 1.3 Dose Response Experiments The central idea of toxicology lies in understanding the molecular mechanisms of action that makes a treatment toxic. An important tool for studying the mechanisms of action is to understand the response of a living organism to a range of doses of the toxic substance. The dose–response relationship, or the dose response curve, describes the magnitude of the response of an organism, as a function of exposure to a stressor (usually a toxin) after a certain exposure time [23]. Developing dose response models is crucial for determining safe, hazardous and beneficial levels of pollutants, drugs, foods and several other substances to which humans or other organisms are exposed [24]. These conclusions often form the basis for public policy. The U.S. Environmental Protection Agency (EPA) has developed extensive guidance and reports on dose–response modeling and assessment, as well as software [25]. 3 The potency value derived from the dose response model is the benchmark dose (BMD). BMD is the dose or concentration that produces a predetermined change in the response rate of an adverse effect. This predetermined change in response is called the benchmark response (BMR), and is generally assumed to be the standard deviation of the response at zero dose [26]. Dose-dependent gene expression profiling (aka genomic dose-response studies (GDRS)) of drugs and chemicals has been proposed as an alternative test to rodent bioassays to assess human health risks. Application of single-cell RNA sequencing (scRNAseq) for the evaluation of chemicals, drugs, and food contaminants presents the rare opportunity of accounting for cellular heterogeneity in pharmacological and toxicological responses[27]. 1.3.1 National Toxicology Program’s approach to genomic dose response modeling The National Toxicology Program (NTP) proposes the use of the approach outlined below for both in vivo and in vitro genomic dose-response studies (GDRS) [26]. The steps in the process include (1) designing a dose response experiment with a sufficient number of doses that can effectively capture the shape of a dose response curve and accurately determine the benchmark dose (BMD) for all dose responsive genes, (2) designing a statistical test of hypothesis to determine whether the modelled data shows any effect to the treatment. This test acts a preliminary step for filtering out genes that aren’t dose responsive. (3) Conducting a second trend test for identifying genes that exhibit a biologically plausible response to the treatment, (4) fitting parametric dose response models derived from the Environmental Protection Agency (EPA) BMD software to identify a biological potency estimate, ie. benchmark dose (BMD) for each gene exhibiting a dose-related response to treatment, (5) grouping genes into predifined sets defined by gene ontologies and calculating the composite BMD of the gene set and (6) finally providing a biological explanations for the selected set of genes and BMD estimates. As pointed out by NTP, following all the steps for GDRS modelling leads to consistent modelling and facilitates the use of use of genomic 4 dose-response data in risk assessment. Assuming that the experimental process is adequate, I mainly focus on the statistical considerations of items 2-4. 1.3.1.1 Determining Adequate Signal in the Data The first step of the GDRS protocol seeks to determine whether the signal in the data is adequate to model and likely to yield minimally reproducible findings [28]. Simply put, we are interested in finding whether a gene is dose responsive. Statistically, this can be formulated as an ANOVA problem. There is a population of interest for which there is a true quantitative outcome Y for each of the k levels of dose D. The population outcomes for each group have mean parameters µ1,j , .., µK,j with no restrictions on the pattern of means. The population variances for the outcome for each of the k groups defined by the levels of the explanatory variable all have the same value, σj2 . We are interested in testing the null hypothesis that “nothing interesting is happening.”. For one-way ANOVA, we use H0 : µ1 = µ2 = . . . µK , which states that all of the population means are equal, without restricting what the common value is. The alternative must include everything else, which can be expressed as “at least one of the k population means differs from all of the other". Use of a pre-fitering test to determine dose responsiveness helps to avoid modeling of data with no statistically plausible signal. Modeling data with no statistically significant effect is likely to yield unreproducible results with highly inaccurate estimates of BMD values. Further, given the huge number of genes (usually in the order of millions) it is an unnecessary computational burden to model genes with no statistically plausible signal. An associated problem that is studied in gene expression studies is to compare two biological conditions in order to find differentially expressed (DE) genes. A gene is defined as DE if it is transcribed into different amounts of mRNA molecules per cell under the two conditions. However, since we do not observe the true levels of expression, carefully designed statistical tests help biologists understand to what extent a gene is DE. Currently, there are dozens of differential gene expression analysis (DGEA) approaches for scRNAseq data; 5 developed based on differences in assumptions, statistical methodologies, and study designs [29, 30, 31, 32, 33, 34]. A recent comparison of 36 approaches demonstrated acceptable performance for common bulk RNAseq tools such as edgeR and limma-trend, and MAST for single cell experiments, as well as common statistical tests such as the Wilcoxon rank sum (WRS) and the t-test [32]. However, most methods have been developed primarily for two group comparisons whereas our study designs consists of multiple groups. The use of two sample tests for multiple group study designs elevate the type I error rate warranting further investigation of these methods for multiple group dose–response study designs [35]. Chapter 2 proposes the first multiple group statistical testing framework designed exclusively for single cell dose response designs. The test statistic is derived for both the frequentist and Bayesian set-up and benchmarked against selected bulk and single cell DE analysis methods capable of accomodating multiple group designs. A further detailed background and analysis is delayed until Chapter 2. 1.3.1.2 Filtering of Measured Features The second step of GDRS protocol applies a statistical trend test and effect size to filter genes. Statistically, it assumes that the experimenter has apriori knowledge that the response to the treatment has a known direction, and in most cases, the response increases in magnitude as the dose is increased. Therefore the statistical analysis is based on the assumption that the responses are monotonically ordered. Thus the test of hypothesis translates to H0 : µ1 = µ2 = . . . µK , versus µ1 ≤ µ2 ≤ . . . µK . Several statistical tests have been proposed for testing this monotonically ordered alternative hypothesis. The test recommended by the US EPA and NTP is the William’s Trend test [36] to identify monotonic trends. Williams’ test is a step-down trend test for testing several treatment levels with a zero control in a one -way ANOVA design with normally distributed errors of homogeneous vari- ance. Let there be K groups including the control and let the zero dose level be indicated with k = 1 and the treatment levels as 2 ≤, · · · ≤ K, then the following K − 1 hypotheses 6 are tested: H0,K−1 : x̄1 = µ2 = · · · = µK , Ha,K−1 : x̄1 ≤ µ2 ≤ . . . µK , x̄1 < µK . H0,K−2 : x̄1 = µ2 = .. = µK−1 , Ha,K−2 : x̄1 ≤ µ2 ≤ . . . µK−1 , x̄1 < µK−1 .. .. . . H0,1 : x̄1 = µ2 , Ha,1 : x̄1 < µ2 , The first step is to determine the maximum likelihood (ML) estimates µ ck under the alterna- tive hypothesis that there is a response. This is estimated using the pool adjacent violators algorithm (PAVA) [37]. If the response is a decrease in mean, then the above formulation applies after a change of the inequality sign. The k th test statistic t̄i is calculated as follows: ck − x̄1 µ t̄i = p s2 1/nk − 1/n1 where s2 is an unbiased estimator of error variance of Y with ν degrees of freedom. The critical t-values as given in the tables of [36] for α = 0.05 (one-sided) and are looked up according to the degree of freedoms (ν)) and the order number of the dose level. Regarding genomic dose-response modeling, NTP acknowledges limitations to using a more traditional trend test such as the Williams’ Test that identifies only monotonic trends. 1.3.1.3 Dose Response Curve Estimation The third step of the genomic dose-response analysis focuses on fitting dose-response curves to each gene that exhibits a response to treatment as determined by the filtering approach de- scribed above. To model the data, parametric models such as polynomial, linear, power, Hill and exponential 2, 3,4 dose-response models are fitted to the measured features. Equations describing the models are given in the table below. A detailed description of all the models can be found in the EPA’s Benchmark Dose Software (BMDS)Version 3.2 user guide. The current GDRS modelling softwares [28] are only capable of modelling bulk RNA-sequencing datasets, with no statistical framework or recommendation existing for dose response mod- elling of single cell datasets. The bulk gene expression data is usually log transformed and presumed to have a normal distribution and each model is run assuming constant variance. 7 Model Formula Linear µ(dose) = γ + βdose Polynomial (order-q) µ(dose) = γ + β1 dose + β2 dose2 + · · · + βq doseq Hill µ(dose) = γ + (v ∗ dosen )/(k n + dosen ) Exponential 2 or 3 µ(dose) = a ∗ exp(sign(b ∗ dosed )) Exponential 3 or 4 µ(dose) = a ∗ (c − (c − 1) ∗ exp(dosed )) Power µ(dose) = γ + βdoseδ Table 1.1 Parametric models for genomic dose response analysis Let us denote a sample of n independent and identically distributed (IID) pairs of vari- ables, B = {(Zi , Yi )}, i = 1, . . . n, as the data. Zi can be further segmented as Zi = {(Di , Xi )} where Di represents the dose vector and Xi represents the vector of other confounding co- variates such as age, gender etc. The main interest of dose response studies lies in modelling E(Yi ) as a function of the dose vector; ie. E(Yi |Di ) = ζ(Di , θ), where θ is the vector of un- known parameters. In general, the methods for estimating dose response can be classified as parametric or nonparametric. Parametric methods assume a model ζ(Di , θ) for the DR curve where θ is the vector of unknown parameters. Considerable amount of statistical methodol- ogy exists in parametric modelling of dose-response studies, employing parametric models, such as Logistic, Exponential, and Gompertz as well as others [38] and the US EPA [28] gives guidelines on how to employ these methods. These models are generally fit using standard techniques such as maximum likelihood (ML) or restricted maximum likelihood (RML) and their functional forms are monotonic allowing for tractable estimation of relevant BMD’s. It is well known that if a parametric model is correctly specified then the ML/RML estimators are efficient. However, in many cases, it is difficult to correctly specify the parametric form of the dose–response curve because the biological mechanism of drug action or toxicity may be complex and the form of the dose–response curve is unknown apriori [39]. When the parametric model is misspecified, the corresponding curve estimate may be severely biased. In addition, fitting of extremely flexible polynomial models with high orders to match diffi- cult non-monotonic curve shapes, can lead to severe overfitting. The standard workaround is to fit multiple parametric models, each of which may fit the data reasonably, but produce 8 a range of BMD estimates that accounts for model uncertainty in the estimation process. Accounting for this uncertainty using model averaging (MA) has received recent attention in the literature [40]. However, despite simulation studies suggesting that model-averaged estimates provide BMD estimates that have low bias and nominal coverage properties, these estimates may fail to adequately describe the true uncertainty for DR curves on the edge of the MA model space, hence resulting in inaccurate inference. Further, as the MA results are available on limited number of parametric forms, models that make fewer assumptions on parametric forms may allow for better estimation of the dose response curves [39]. To enhance the robustness of the estimation of the dose–response curve, many nonpara- metric methods have been proposed. [41] proposed MLE estimation procedures under the assumption that the dose response curve is sigmoid and non-decreasing. [42] proposed a method for non-parametric estimation of the dose response curve under monotonicity con- straints using B-splines regression. [43] used kernel estimators to obtain estimates for the dose–response curve under general shape restrictions. In the Bayesian setup, several ap- proaches have been proposed for non-parametric estimation of dose response curves under monotonicity constraints. From a Bayesian perspective, one specifies a suitable prior on the regression function that induces monotonicity and then inference is based on the posterior distribution. [44] employed an additive model with a prior imposed on the slope of the piecewise-linear functions. [45] adopted mixture modelling of shifted and scaled probability distribution functions. [46] propose a Gaussian process model with a posterior projection approach for shape-constrained curves. In contrast to the parametric methods, all the non- parametric approaches are flexible and learn the shape of the dose response curves based on the data. To the best of our our knowledge, none of the existing literature has consid- ered building a non-parametric regression model for both unconstrained and monotonicity constrained estimation of dose response curves for single cell experiments while accounting for technical confounders. Incorporation of such a flexible modelling structure will allow for more accurate modeling of a variety of non-monotonic responses. Further, the incorporation 9 of a flexible nonparametric modeling approach to describe the shape of the diversity of dose- response behaviors observed in genomic dose-response will lead to the estimation of more accurate BMD’s. 1.4 Covariation Analysis A gene co-expression network (GCN) is an undirected graph, where nodes correspond to different genes, and edges connecting the nodes denote the co-expression relationships be- tween genes. GCNs can help people learn the functional relationships between genes and infer and annotate the functions of unknown genes. GCN reconstruction attempts to infer this co-expression network from high-throughput data using statistical and computational approaches. Multiple methods encompassing varying mathematical concepts have been pro- posed during the last decade to infer GCNs using gene expression data from bulk population sequencing technologies, which accumulate expression profile from all cells in a tissue. These methods can be broadly classified into two groups: the first group infers a static GCN, con- sidering steady state of gene expression, while the second group uses temporal measurements to capture the expression profile of the genes in a dynamic process. A thorough evaluation of the static and dynamic models used in bulk GCN reconstruction can be found in [47, 48]. In the static group, several computational approaches ranging from correlation networks [49], Gaussian graphical models [50], Bayesian networks [51], regression analyses [52, 53], and information theoretical approaches [54, 55] have been used for network inference from population-level data. Similarly, several models have been proposed for the analysis of pop- ulation level dynamic gene expression data [56, 57, 58]. In this dissertation we will mainly focus on estimating a static co-expression network. Mathematically we are interested in derterming the relationship between Yi and Yj . The first step is to construct a symmetric adjacency matrix A, where Ai,j is a weighted adjacency score in the range from 0 to 1 be- tween genes i and j. Ai,j measures the level of association between gene expression vectors Yi and Yj . Clustering methods can then be applied to search for gene modules based on the 10 resulting distance matrix [49]. The modules can serve to understand functional relationships between known disease genes and candidate genes. Gene modules can also be used to detect regulatory genes and study the regulatory mechanisms in various organisms [59]. Further background will be delayed until Chapter 4. 11 CHAPTER 2 BAYESIAN SINGLE CELL RNASEQ DIFFERENTIAL GENE EXPRESSION TEST FOR DOSE RESPONSE STUDY DESIGNS 2.1 Single Cell Dose Response Experiments Single-cell transcriptomics enables researchers to investigate homeostasis, development and disease at unprecedented cellular resolution [60, 6, 61, 62, 63]. As with any new innovative technology, diverse tools soon follow to address specific applications and unique challenges. Currently, there are dozens of differential gene expression analysis (DGEA) approaches for single-cell RNAseq (scRNAseq) data; developed based on differences in assumptions, statis- tical methodologies, and study designs [29, 30, 31, 32, 33, 34]. A recent comparison of 36 approaches demonstrated acceptable performance for common bulk RNAseq tools such as edgeR and limma-trend, and MAST for snRNAseq, as well as common statistical tests such as the Wilcoxon rank sum (WRS) and the t-test [32]. However, most methods have been developed primarily for two group comparisons whereas study designs typical of pharmacol- ogy and toxicology experiments such as dose–responses consist of multiple groups. The use of two sample tests for multiple group study designs elevate the type I error rate warranting further investigation of these methods for multiple group dose–response study designs [35]. Dose–response studies are used to derive the efficacy and/or safety margins such as ef- fective dose and the point of departure (POD). Significant efforts by the toxicology and regulatory communities have suggested that acute (< 14 days) and sub-acute (14–28days) transcriptomic studies as viable alternative to the current standard 2-year rodent bioassay that significantly reduces the time and resources needed to assess risk [64, 65, 26]. Gene expression profiling at single-cell resolution could further support such evaluations by identi- fying cell-specific dose-dependent responses indicative of an adverse event. The U.S. National Toxicology Program (NTP) recently reported a robust DGEA approach is essential to de- 12 riving biologically relevant PODs [26]. However, concerns regarding the inclusion of false positives that produce less conservative POD estimates potentially leads to incorrect clas- sification of mode-of-action( MoA), thus highlighting the importance of controlling type I error rates [66, 67]. Unlike microarray and bulk RNAseq datasets, single-cell RNAseq (scRNAseq) data ex- hibits excess zero values due to the low per cell RNA input, biases in capture and amplifica- tion, transcriptional bursts, and other technical factors[12]. This no expression (zero values), due to a conflation of both biological and technical factors, results in an excessive number of zeroes in an otherwise continuous measure [14]. Therefore, traditional tests of differential gene expression, based on the assumption of a normal distribution, fail to correctly model the bimodality of single cell gene expression[16]. Consequently, scRNAseq test methods usually consider the gene expression distribution as a mixture of a unexpressed (zero) and a posi- tively (non-zero) expressed population [14, 16, 68]. For example, the Seurat Bimod approach tests for differential gene expression using a likelihood ratio test designed for the said mixture population. MAST extends the Seurat Bimod test to a two-part generalized linear model structure capable of incorporating covariates [14, 16]. Given the improved performance of MAST [32, 14, 16], we hypothesized that multiple group tests developed assuming the same distributional framework would be most favorable for dose–response study designs. Further- more, a Bayesian approach which considers prior knowledge is anticipated to minimize type I error rates [69, 70]. The aim of the presented study is to evaluate the performance of existing and novel DGEA test methods on dose–response scRNAseq datasets. To reduce the rate of false positives we propose a novel, multiplicity corrected, Bayesian multiple group test (scBT) designed exclusively for DGEA of dose–response scRNAseq data. Two other fit-for-purpose frequentist multiple group tests are also examined: (i) a multiple group extension of the Seurat Bimod test and (ii) a simple extension of test (i) to a generalized linear model framework. Existing and proposed methods are benchmarked on simulated and real experimental dose–response 13 datasets. Using simulated datasets we were able to investigate the influence of various parameters such as number of cells, and illustrate how using different test methods can aid in gaining biological insight on the role of individual cell types on the pathophysiological consequences of exposure. 2.2 Materials and Methods 2.2.1 Animal handling and treatment Male C57BL/6 mice aged postnatal day (PND) 25 were obtained from Charles Rivers Lab- oratories (Kingston, NY) were housed and treated as previously described [71]. Mice were housed in Innovive cages (San Diego, CA) with ALPHA-dri bedding (Shepherd Specialty Papers, IL) at 23◦ C, 30-40% relative humidity, and a 12:12 h light:dark cycle. Aquavive water (Innovive) and Harlan Teklad 22/5 Rodent Diet 8940 (Madison, WI) was provided ad libitum. On PND 29, randomly assigned mice were gavaged at Zeitgeber time (ZT) 0 with 0.1 mL sesame oil vehicle (Sigma-Aldrich,St. Louis, MO), 0.01, 0.03, 0.1, 0.3, 1, 3, 10 or 30 µ g/kg TCDD every 4 days for 28 days (7 total administered doses). At day 28 mice were euthanized by CO2 asphyxiation and livers were immediately flash frozen in liquid nitrogen and stored at −80◦ C. All animal procedures were approved by the Michigan State Uni- versity (MSU) Institutional Animal Care and Use Committee (IACUC) and reporting of in vivo experiments follow the Animal Research: Reporting of In Vivo Experiments (ARRIVE) [72] and Minimum Information about Animal Toxicology Experiments (MIATE) guidelines (https://fairsharing.org/FAIRsharing.wYScsE). 2.2.2 Real scRNAseq and snRNAseq datasets Hepatic single-nuclei RNA-sequencing (snRNAseq) was performed as previously described ′ using the 10× Genomics Chromium Single Cell 3 v3.1 kit (10X Genomics, Pleasanton, CA) [73]. Briefly, nuclei were isolated using EZ Lysis Buffer (Sigma-Aldrich), homogenized by disposable Dounce homogenizer, washed, filtered using a 70-µ m cell strainer. The nuclei 14 pellet was resuspended in buffer containing DAPI (10 µg/ml), filtered using a 40-µm strainer, and immediately sorted using a BD FACSAria IIu (BD Biosciences, San Jose, CA) at the MSU Pharmacology and Toxicology Flow Cytometry Core (facs.iq.msu.edu/). Sequencing (150-bp paired end) was performed at a depth of 50,000 reads/cell using a NovaSeq6000 at Novogene (Beijing, China). CellRanger v3.0.2 (10x Genomics) was used to align reads to mouse gene models (mm10, release 93) including introns and exons to consider both pre- mRNA and mature mRNA gene models. Seurat was used to integrate and log-normalize expression data [74]. The data is available on the Gene Expression Omnibus (GEO) at accession ID GSE184506 and R package versions are listed in Supplementary Table S1. Additional real datasets were publicly available. Hepatic whole-cell generated using the 10X Genomics platform was obtained from GEO (GSE129516)[63]. Hepatic single-nuclei processed as the dose–response data for control and high dose TCDD treatment (0 and 30 µg/kg) was obtained from GEO (GSE148339). Peripheral blood mononuclear cell (PBMC) data also generated using the 10x Genomics platform and Seurat was obtained from the SeuratData R package [74]. Gene set enrichment analysis of experimental data was performed using the fgsea v1.14 R package on gene lists sorted by significance values (e.g. P-value). Gene sets from BIO- CARTA, KEGG, PANTHER and WIKIPATHWAYS were obtained from the Gene Set Knowledgebase (GSKB; http://ge-lab.org/gskb/) and filtered for gene sets containing 15–250 genes. Gene sets were agglomerated based on overlap of gene membership and only those showing ≥ 50% overlap were considered similar for subsequent network analyses. Visualiza- tion and calculation of measures of centrality were determined using igraph v1.2.7. Gene sets were considered enriched when adjusted P-value < 0.05 2.2.3 Dose-response data simulation To simulate dose-response scRNAseq datasets we developed a wrapper for the Splatter R package[75]. Splatter simulates counts using parameters estimated from real data to set the 15 mean expressions, variance, and outlier probability. Other parameters such as the number of cells, genes, probability of being differentially expressed, mean fold-change of DE genes (location) and standard deviation of fold-change of DE genes (scale) were manually assigned to best reflect real data. The wrapper (SplattDR) leverages the group simulation feature of Splatter by applying a multiplicative factor estimated using dose-response models in 2.1 based on the US EPA Benchmark Dose Software[28]. SplattDR R package is available at (github.com/zacharewskilab/splattdr). Model Formula Hill µ(dose) = γ + (v ∗ dosen )/(k n + dosen ) Exponential 2 or 3 µ(dose) = a ∗ exp(sign(b ∗ dosed )) Exponential 3 or 4 µ(dose) = a ∗ (c − (c − 1) ∗ exp(dosed )) Power µ(dose) = γ + βdoseδ Table 2.1 Dose-response models for simulation of scRNAseq data 2.2.4 Single cell RNA-seq Hurdle Model We model the log-normalized gene expression matrix using a hurdle distribution wherein the rate of gene expression is assumed to follow a Bernoulli distribution and conditional on a cell expressing the gene, the log-normalized expression level is assumed to follow a Gaussian distribution[14]. We denote Yi,j to be the log-normalized expression value of gene j in cell i, for i = 1, . . . n and j = 1, . . . p. To characterize the bimodal properties of single cell data, for a given cell, a gene is defined to be either positively expressed or undetected. Define Rij = I[Yij > 0] to be the indicator variable denoting the presence or absence of expression for gene j in cell i. Following [14], the log-normalized gene expression values are modeled as follows: Yi,j |Ri,j ∼ N ormal(µj , σj2 ), Yi,j = 0 with probability 1 when Ri,j = 0, (2.1) Ri,j ∼ Bernoulli(ωj ), where µj and σj2 denote the mean and variance of the gene expression level, conditional on the gene being expressed and ωj denotes the rate of gene expression of gene j across all cells. 16 2.2.4.1 Hypothesis Formulation We now assume that our data has been collected under K conditions (doses), and denote the data by Dk,o ≡ {(Yk,i,j , Rk,i,j ), i = 1, . . . , nk } . The underlying populations for the sample data Dk,o for k = 1, 2, . . . , K, dose groups are assumed to be identified by the parameters (µk,j , σj2 , ωk,j ). The aim of this study is to test for difference in gene expression patterns between the different dose groups. Traditionally one would perform an ANOVA test to detect changes in mean across groups for samples with continuous measurements. However, to account for the bimodality in single cell gene expression distribution, the test should detect for changes in µj and ωj simultaneously, as both could drive differential gene expression. Therefore we define, H0 : µ1,j = µ2,j = . . . µK,j = µj and ω1,j = ω2,j = . . . ωK,j = ωj . (2.2) versus the alternative Ha : µk,j is different for at least one k and ωk,j is different for at least one k, k = 1, . . . K 2.2.4.2 Single cell Bayesian Hurdle model Analysis (scBT) Given the single cell RNA-seq hurdle model structure, we assume that a priori, given σj2 , µk,j ∼ N ormal(mk,0 , τk,µ σj2 ), σj2 ∼ IG(aσ , bσ ), ωk,j ∼ Beta(ak,ω , bk,ω ), where IG is the inverse gamma distribution with shape aσ and scale bσ and mk,0 , τk,µ , aσ , bσ , ak,ω , bk,ω are the hyperparameters. Given the large number of gene-wise model fits arising from a single cell expreriment, there is a pressing need to allow for a parallel structure whereby the same model is fitted to each gene. The prior distributions on the parameters describe how the unknown coefficients µk,j ωk,j and σj2 vary across the genes and the dose groups while allowing for information borrowing between the genes. Now, based on the model assumptions, we propose a Bayesian test for simultaneously testing the differences in mean gene expression 17 and dropout proportions as formulated in 2.2.4.1. Under the null hypothesis the marginal likelihood is written as K Y nk   Rk,i,j (Yk,i,j − µj )2 Z Z Z Y   1 1−Rk,i,j LH0 ,j = √ exp − ωj (1 − ωj ) k=1 i=1 2πσj 2σj2 × π(µj |σj2 )π(σj2 )π(ωj )dµj dσj2 dωj 1 1 = PK Pnk ×q (2π)( k=1 i=1 Rk,i,j )/2 PK Pnk 1 + τµ k=1 i=1 Rk,i,j Γ(aσ + ( K P Pnk 1 k=1 i=1 Rk,i,j )/2) × × PK Pnk Γ(aσ )baσσ (1/bσ + Atot /2)aσ +( k=1 i=1 Rk,i,j )/2 PK PK Pnk Beta(aω + ( K P Pnk i=1 Rk,i,j ), bω + k=1 nk − ( i=1 Rk,i,j )) × k=1 k=1 , (2.3) Beta(aω , bω ) where (K n ) (K n )−1 k k XX m2 XX 1 Atot = 2 Rk,i,j Yk,i,j + 0 − Rk,i,j + k=1 i=1 τµ k=1 i=1 τµ (K n )2 k XX m0 × Rk,i,j Yk,i,j + . k=1 i=1 τµ Under the alternative hypothesis we compute the marginal likelihood without any restriction on the K means µk,j and the dropout parameter ωk,j ; k = 1, 2, . . . K. Particularly, we assume that µk,j ∼ N ormal(mk,0 , τk,µ σj2 ), and σj2 ∼ IG(aσ , bσ ), ωk,j ∼ Beta(ak,ω , bk,ω ); k = 1, 2, . . . K. Now, the marginal likelihood under the alternative hypothesis is given by K Y nk  Rk,i,j (Yk,i,j − µk,j )2 Z Z Y   1 LHa ,j = ··· √ exp − ωk,j (2.4) k=1 i=1 2πσj 2σj2  1−Rk,i,j × (1 − ωk,j ) YK   YK   2 × π(µk,j )π(ωk,j ) π(σj ) dµk,j dωk,j dσj2 k=1 k=1 1 1 = P K P nk × QK p Pnk (2π)( k=1 i=1 Rk,i,j )/2 k=1 1 + τ k,µ i=1 Rk,i,j PK Pnk 1 Γ(aσ + k=1 i=1 Rk,i,j /2) × × P nk Γ(aσ )baσσ (1/bσ + K aσ + K P i=1 Rk,i,j /2 P k=1 Ak /2) k=1 18 K Beta(ak,ω + ni=1 Rk,i,j , bk,ω + nk − ni=1 P k P k Y Rk,i,j ) × (2.5) k=1 Beta(ak,ω , bk,ω ) Now, under the assumption that aω = ak,ω , bω = bk,ω , τµ = τk,µ , for k =1,2, . . . K, and √ r! ∼ 2πr(r/e)r we have, 19 LH0 ,j √ 1−K QK p1 + τ Pnk R k,µ i=1 k,i,j = 2πe × qk=1 LHa ,j 1 + τµ K nk P P k=1 i=1 Rk,i,j aσ + 12 ( K P P nk i=1 Rk,i,j ) 1/bσ + K  P k=1 A k /2 k=1 × 1/bσ + Atot /2 s (aω + K P Pnk PK PK Pnk k=1 i=1 Rk,i,j − 1) × (bω + k=1 nk − k=1 i=1 Rk,i,j − 1) × PK (aw + bw + k=1 nk − 1) PK Pnk P K P nk (aω + k=1 i=1 Rk,i,j − 1)(aω + k=1 i=1 Rk,i,j −1) × (aw + bw + K (aw +bw + K P k=1 nk −1) P k=1 nk − 1) K K X nk X X PK PK Pnk × (bω + nk − Rk,i,j − 1)(bω + k=1 nk − k=1 i=1 Rk,i,j −1) k=1 k=1 i=1 s K Y (aw + bw + nk − 1) × Pnk Pnk k=1 (aw + i=1 Rk,i,j − 1) × (bw + nk − i=1 Rk,i,j − 1) K  Y (aw + bw + nk − 1)(aw +bw +nk −1) × P nk (aw + ni=1 Rk,i,j − 1)(aw + i=1 Rk,i,j −1) P k k=1  1 × P nk (bw + nk − ni=1 Rk,i,j − 1)(bw +nk − i=1 Rk,i,j −1) P k  (K−1) Γ(aω )Γ(bω ) × Γ(aω + bω ) The Bayes factor is then defined as LH0 ,j π(Ha ) BF01,j = × (2.6) LHa ,j π(H0 ) where π(Ha ) and π(H0 ) are the prior probabilities for the alternative and null model, respectively. The hyperparameters are obtained by maximising the marginal likelihood under the null and the alternative hypothesis. Detailed derivations of the likelihood function and the Bayes Factor are provided in Supplementary Material. Using the test of hypothesis described in Equation (2.2) scBHM conducts a test of DE for each gene independently. To control for multiplicity we adopt the FDR correction approach discussed in [76]. The rejection threshold is estimated in terms of the posterior probabilities of the null hypothesis, p(H0,j |Dj ). For a target FDR α, the procedure rejects all hypotheses with p(H0,j |Dj ) < ζ 20 , where p(H0,j |Dj ) = [1 + 1 BF01,j ]−1 and ζ is the largest value such that C(ζ) J(ζ) ≤ α where, J(ζ) = {j : p(H0,j |Dj ) ≤ ζ} and C(ζ) = j∈J(ζ) p(Ho,j |Dj ). P 2.2.4.3 Multiple group Likelihood Ratio Test (LRT) To carry out a direct performance comparison with scBT, we extend the Seurat Bimod [14] for multiple groups. Assuming that all the K groups have the same variance σj2 and omitting the index j for clarity, the likelihood ratio test can be defined as; supθ∈H0 L(θ|Y, R) Λ(Y, R) = supθ∈Ha L(θ|Y, R) where the likelihood can be written as; Y Y L(θ|Y, R) = ωkek (1 − ωk )nk −ek f (Yi,k |µk , σ 2 ) k i∈Ck where Y and R represent the gene observation vector and the gene indicator vector across K dose groups, θ = {µk , σ 2 , πk , k = 1, . . . , K} is the vector of unknown parameters, Ck is the set of cells expressing the gene in group k (i.e.Ck = {i : Ri,k = 1}), ek = i Ri,k is the P number of cells expressing the gene in group k and f is the density function of the normal distribution with parameters µk and σ 2 . Therefore we can write, P P P sup{ω,µ,σ2 } ω ( k ek ) (1 − ω)( k nk − k ek ) k i∈Ck N (Yik |µ, σ 2 ) Q Q Λ(Y, R) = sup{ωk ,µk ,σ2 ;k=1,...,K} k ωkek (1 − ωk )(nk −ek ) k i∈Ck N (Yik |µk , σ 2 ) Q Q Q P P P sup{ω} ω ( k ek ) (1 − ω)( k nk − k ek ) = sup{ωk ,k=1,..,K} k ωkek (1 − ωk )(nk −ek ) Q sup{µ,σ2 } k i∈Ck N (Yik |µ, σ 2 ) Q Q × Q Q sup{µk ,σ2 ;k=1,..,K} k i∈Ck N (Yik |µk , σ 2 ) P P P e e ¯ − k2 ek Y P k nkk ek  1 − P k nkk nk −ek  P ¯ + + 2 k ek (Yk − Y ) = k ek · k ek · 1+ P P ek ¯+ 2 k nk 1 − nk k + i=1 (Yik − Yk ) = Λb (R) · Λn (Y + ) where Λb is a binomial LRT, Λn is a normal LRT, Y + is the set of positive Y values, Y¯k+ is the mean of the positive Y values in the k th group and Ȳ¯ = K1 K k=1 Yk . Thus, it can be shown ¯ P 21 that the combined LRT is the product of a binomial and a normal LRT statistic Λb and Λn , both of which are derived using classical statistical theory. Applying classical asymptotic results about LRTs , −2logΛ(Y, R) converges to a χ2 distribution with (2K − 2) degrees of freedom under H0 . We note here that the sample size for the χ2 statistic is not n, but n+ = K k=1 ωk nk and n+ is sufficiently large for our simulation and real data experiments. P 2.2.4.4 Linear model-based Likelihood Ratio Test (LRT linear) The generalized linear model approach MAST was identified as one of the top performing tests for pairwise differential expression testing [32, 16]. Deriving from their approach, the LRT multiple test is extended to a generalized linear model framework, where the mean and the dropout proportions are modelled as a linear function of the dose groups (assumed to be a continuous covariate). Using the same distributional assumptions defined in 2.1 we fit a logistic regression model for the discrete variable R and a Gaussian linear model for the continuous variable Y conditional on (R = 1) independently, as follows: E(Yij |Rij = 1) = m0,j +m1,j ∗di and logit{P (Rij = 1)} = ψ0,j +ψ1,j ∗di , where d represents the continuous dose groups. Under this modelling approach, the null hypothesis described in Equation 2.2 can be rewritten as H0 : E(Yij |Rij = 1) = m0j and logit{P (Ri,j = 1)} = ψ0j . The regression models are fit using the lm and brglm functions in the stats and brglm R packages. The likelihood ratio test statistic is computed using the same statistical theory discussed discussed for the LRT multiple test and it asymptotically follows a χ2 distribution under H0 . 2.2.5 Benchmarking method selection Our fit-for-purpose tests were benchmarked to existing differential expression testing meth- ods or their multiple group equivalent based on previously reported performance, ability to consider multiple groups, or whether they served as foundation for the scBT and multiple group LRT (LRT multiple) tests developed here. Seurat Bimod served as foundation for the 22 scBT and LRT multiple tests as previously outlined, and MAST was identified as one of the top performing test for two group comparisons [32]. Similarly, limma-trend performed well for two sample comparisons and can consider multiple groups. The Wilcoxon Rank Sum test was identified as providing excellent balance between its ability to identify DE genes and speed, and is the default test for the Seurat R package for scRNAseq analysis. It was also reported that the t test performed well and therefore we included the ANOVA and Kruskal-Wallis (KW) tests, a parametric and non-parametric alternative of the t test for multiple group comparisons. All tests were run without correction for batch effects or other nuisance covariates. Multiplicity for each test was controlled using FDR correction 31. All tests, including scBT, LRT multiple and LRT Linear are available in our scBT R package (github.com/satabdisaha1288/scBT). R session information is listed Supplementary Table S1. A flow diagram outlines our benchmarking approach (Figure 2.1). 2.2.5.1 Seurat Bimod Seurat Bimod test [14] is a pairwise differential gene expression testing approach developed assuming the single cell RNA-seq hurdle model framework. The test is formulated as H0 : the mean and the dropout parameters of the gene vector under two dose groups are equal versus Ha : the mean and the dropout parameters differ over the two groups. The LRT based test statistic −2logΛ(y, r) converges to a χ2 distribution with two degrees of freedom under H0 . The computations are carried out using the R Package Seurat . 2.2.5.2 MAST MAST [16] proposes a two-part generalized linear model for differential expression analysis of scRNAseq data. The first part models the rate of gene expression using logistic regression logit(ωij ) = Xi βjω and the second part uses a linear model to express the positive gene- expression Yij , conditional on Rij as µij = Xi βjµ ; where βjω andβjµ are the coefficients of the covariates used in the logistic and linear regression model respectively. A test with an 23 Figure 2.1 Flow diagram of the simulation, benchmarking, and experimental data evaluation strategy presented in the manuscript. Briefly, SplattDR was developed to simulate dose-response scRNAseq data and validated based on experimental dose-response data. Simulated datasets were generated varying diverse parameters 10 times and then used to assess the performance of each test method. Each test method was also assessed using experimental data from the hepatic snRNAseq dose response dataset obtained from male mice gavage every 4 days for 28 days with 0.01, 0.03, 0.1, 0.3, 1, 3, 10, or 30 µg/kg TCDD. Related figures for each analysis from the main body are noted. asympotic χ2 null distribution is employed for identifying DEGs and multiplicity is controlled using FDR correction[77]. Despite the fact that LRT-linear and MAST have the same hurdle regression framework, the estimation process for the two methods has some significant differ- ences. First, to achieve shrinkage of the continuous variance, MAST assumes a gamma prior distribution on the precision (inverse of variance) parameter and estimates its posterior max- imum likelihood estimator (MLE) and uses that in place of the regular MLE of the precision parameter. Second, it fits a Bayesian logistic regression model for the discrete component by assuming Cauchy distribution priors centered at zero for the regression parameters. This is 24 done to deal with cases of “linear separation” where the parameter estimates diverge to ±∞ and the Fisher information matrix becomes singular. And finally, it considers the cellular detection rate defined as CDRi = p1 Pj=1 Rij to be a covariate in both the logistic and P linear regression models. LRT linear on the other hand simply fits the non-Bayesian linear and the logistic regression models without considering variance shrinkage or adjustment for additional covariates. 2.2.5.3 Limma-trend Limma-trend [20] proposes a linear model based differential expression approach for mod- elling RNA-seq experiments of arbitrary complexity. Their framework models the mean gene expression as a function of several continuous and categorical covariates. A separate linear model is fitted for each gene, but the gene-wise models are linked by global parameters using the parametric empirical Bayes approaches [78]. The global variance estimated by the em- pirical Bayes procedure also incorporates a mean variance trend, allowing better modelling of low abundance genes. Finally, test of differential gene expression is carried out by testing the significance of one or more coefficients of the fitted linear model. 2.2.5.4 Wilcoxon Rank Sum (WRS) Test WRS [79] test is a non-parametric test commonly used for pairwise DGE testing. The test is formulated as H0 , the distributions of the gene vector under two dose groups are equal versus Ha : the distributions are not equal.The test involves the calculation of the U statistic, which for large samples is approximately normally distributed. Since this is a pairwise test, a union is taken over all the genes found to be DE in each of the pairwise tests. The computations are carried out using the wilcox.test function in R package stats and multiplicity is controlled using FDR correction. 25 2.2.5.5 ANOVA Analysis of variance (ANOVA) [80] is very commonly used for testing the differences among means in multiple groups. For a fixed gene j, it is assumed that the observed gene vector yk,i,j for cell i is grouped by dose. Assuming that Yk,i,j ∼ N ormal(µk,j , σj2 ), ANOVA aims to test the null hypothesis H0 : µ1,j = µ2,j = . . . µK,j = µj versus Ha : µk,j , i = 1, . . . , n; j = 1, . . . , p; k = 1, . . . , K is different for at least one k. The test statistic is computed using the aov function in R package stats and it follows a F-distribution with (K − 1) and (n − K) degrees of freedom. Multiplicity is controlled by applying FDR correction on the obtained p-values. 2.2.5.6 Kruskal-Wallis (KW) Test KW [81] test extends the WRS test for multiple groups. It is also a non-parametric extension of the ANOVA test.The test is formulated as; H0 , the distributions of the gene vector under K dose groups are equal versus Ha : the distributions are not equal. The computation of the KW test statistic is carried out using the kruskal.test function in R- package stats and it asymptotically follows a χ2 distribution with K − 1 degrees of freedom. Multiplicity is controlled by applying FDR correction on the obtained p-values. 2.2.6 Benchmarking and sensitivity analyses Benchmarking of DE test methods was performed on simulated datasets based on initial parameters derived from real dose-response snRNAseq data. The probability of differential expression was set to 10% with a 50% probability of being down-regulated, equally dis- tributed among the dose-response models in Table 1. Batch parameters were used to include sample variation associated with data obtained from 3 individuals in each dose group. A total of 5,000 genes were simulated for 4,500 cells (500 per dose group) using the same doses as the real dataset. Sensitivity analyses varied each of the following parameters according 26 to values is supplementary Table 1: cell abundance equally distributed among dose groups, varying cell numbers in each dose group, percent DE genes, proportion of downregulated DE genes, fold-change location or scale, and dropout rate. Each simulation was replicated 10 times using a different initial seed. Method concordance was determined as area under the concordance curve (AUCC) for the top 100- or 500-ranked genes in simulated and real datasets, respectively, as previously described [82]. 2.3 Results For benchmarking of DGEA methods, a ground truthis required. Existing simulation tools such as PowSimR, SymSim, SPsimSeq and Splatter are commonly used for power analy- ses, evaluating DE analysis methods, and testing cell clustering strategies [75, 83, 84, 85]. Tools such as SymSim and Splatter are also capable of simulating cell trajectories and model differentiation processes. Trajectories which exhibit non-linear changes over time or across different developmental stages are not unlike dose–response effects which change over a con- tinuum of doses. However, dose-responsive changes commonly follow defined trajectories such as Hill, exponential, power, and linear models [28]. To simulate dose–response scR- NAseq data we developed a wrapper for the Splatter scRNAseq data simulation tool named SplattDR. SplattDR modified the Splatter grouped data simulation strategy by adjusting counts from means defined by one of the dose–response functions outlined in the Materials and Methods. To demonstrate the modeling capability of SplattDR, 10000 gene expression responses were simulated with a 10% probability of being differentially expressed, equally distributed across the dose–response models. Parameters used in Splatter were initially estimated from our experimental single nuclei RNAseq (snRNAseq) dose–response dataset. The simulated data compared to the experimental data showed the relationship between the mean expres- sion, percentage of zeroes, and mean variance were consistent (Figures 2.2A, B). Estimation of the normalized root mean square deviation (NRMSD) from a curve fit to the experimental 27 data indicated excellent concordance. The distribution of log(fold-changes) between vehicle (dose 0) and the highest simulated dose (dose 9; 30 µg/kg) showed a more even distribution within a similar range compared to experimental data which was skewed towards induction (Figure 2.2C). However, the gene induction skew was captured by modulating the parameters affecting the probability of differential expression and the proportion of differentially repressed genes (Supplementary Figure A.1). Principal components analysis (PCA) of the simulated data clearly showed the dose-dependent characteristics of scRNAseq data with distinct clusters increasing in separation with increasing dose (Figure 2.2D) which was also resolved by PCA within the experimental data (Supplementary Figure A.2). To our knowledge, no other published in-vivo dose-response scRNAseq datasets are avail- able limiting the number of datasets to estimate initial parameters for simulation to date. To investigate whether existing datasets generated using a different study design (e.g. whole cells or different tissue source) could be used to derive initial parameters, we also simulated 10 000 genes starting with parameters estimated from (i) a two-dose liver snRNAseq (GSE148339), (ii) whole cell liver scRNAseq (GSE129516) and (iii) peripheral blood mononuclear cells (PBMC; GSE108313) datasets. When compared to a model fit for experimental data to determine the relation between mean expression and percent zeroes or mean variance, the NRMSD for data simulated from these datasets were between 1 and 10% with data simulated from whole cell data differing the most from the model fit (Figure 2.2E). We then explored whether parameters estimated from distinct cell types could replicate the characteristics of that same cell type (Figure 2.2F). Not surprisingly, using initial parameters derived from individual cell types in the experimental dose–response data had lower NRMSD than those derived from the whole cell dataset. Notably, when data derived from a lower abundant cell sub-type was used to estimate starting parameters, the dose–response characteristics for that cell subtype was also poorly modeled (Figures 2.2E, 2.2F) 28 Figure 2.2 Comparison of simulated and real dose-response data. (A) Relationship between gene-wise mean expression and percent zeroes for simulated and real dose-response data. Simulation data consisted of 10,000 genes and 9 dose groups based on parameters derived from experimental dose-response snRNAseq data. Black line represents a fitted model to the experimental data from which the normalized root mean square deviation (NRMSD) of simulated data was determined. (B) Relationship between gene-wise mean expression and variance for simulated and experimental data. NMRSD was calculated for simulated data from the fitted model represented as a black line. (C) Distribution of log(fold-changes) in experimental and simulated data showing the median and minimum and maximum values. (D) Principal components analysis of simulated data colored according to simulated dose groups. (E) NMRSD estimated relative to fitted model in A,B for simulated data generated from initial parameters derived from published hepatic scRNAseq (two dose; GSE148339), hepatic whole cell (whole cell; GSE129516), and peripheral blood mononuclear cell (PBMC; GSE108313) datasets. (F) NMRSD estimated relative to model fitted to cell-type specific experimental dose-response data when simulated from initial parameters estimated from that same cell type. Box and whisker plots show median NMRSD, 25th and 75th percentiles, and minimum and maximum values. 2.3.1 Performance accuracy of DE test methods We evaluated the performance of several differential gene expression analysis methods on simulated datasets consisting of nine dose groups of 500 cells each (4500 total) and 5000 29 Figure 2.3 Classification performed of DE analysis tests. (A) ROCs estimated from simulated dose-response scRNAseq data for 9 DE test methods including all genes expressed in at least 1 cell (unfiltered). (B) ROCs for 9 DE test methods after filtering simulated dose-response scRNAseq data for genes expressed in only ≥ 5% of cells (low levels) in at least one dose group. (C) Precision-recall curves (PRCs) for 9 DE test methods on unfiltered simulated dose-response scRNAseq data. (D) PRCs for 9 DE test methods on filtered simulated dose-response scRNAseq data. Lines represent the mean values and shaded region reflects the standard deviation for 10 independent simulations. (E) Precision of DE test methods. (F) FPR of DE test methods. (G) MCC for test methods. E,F,G Box and whisker plots median values, 25th and 75th percentiles, and minimum and maximum values for 10 independent simulations. Points reflects values for each independent simulation. Panels display comparisons of unfiltered and filtered datasets. genes with a 10% probability of being differentially expressed (500 differentially expressed genes). Selection criteria for test inclusion are outlined in the Materials and Methods sec- tion and included 9 test methods; ANOVA [80], single-cell Bayes hurdle model test (scBT), Kruskall–Wallis (KW) [81], limma-trend [20, 78], likelihood-ratio test (LRT) linear and mul- tiple, MAST [16], Seurat bimod [14] and WRS [79]. With ground truth from simulated data, the sensitivity, specificity, and precision for each test method was computed. Area under the receiver-operating characteristic curve (AUROC) was used to measure test performance for correctly classified differentially expressed genes. 30 In unfiltered data, AUROC scores showed similar performance for most tests except scBT which had the largest AUROC among all test methods (Figure 2.3A). To account for the inherent class imbalance between differentially expressed and non-differentially expressed classes the area under the precision-recall curves (AUPRC) was also calculated. Similar to AUROCs, AUPRCs identified scBT as the best performing test (Figure 2.3C). In most standard differential expression testing pipelines genes expressed at low levels are removed to minimize false detection rates. Following filtering of genes expressed in ≥ 5% of cells in any dose group, scBT was consistently ranked as the best test based on AUROC and AUPRC scores. The performance of LRT linear test also improved, with comparable AUROC and AUPRC scores relative to scBT, suggesting LRT linear is poorly suited for genes expressed at low levels (Figures 2.3B–D). AUROC and AUPRC reflect the performance of each test method with varying signifi- cance (i.e. P-value) thresholds. In the standard pipeline a fixed threshold is used, typically a P-value ≥ 0.05 after adjustment for multiple hypothesis testing (i.e. Bonferroni correction). For each method except scBT, the performance at an adjusted P-value ≥ 0.05 significance criteria was evaluated. In scBT analysis, a gene was considered differentially expressed when the estimated posterior probabilities of the null hypothesis, p(H0 , j|Dj ), was less than , where the value was chosen to achieve a target FDR of 0.05. scBT significantly outperformed all other tests in precision rates irrespective of low expression filtering (Figure 2.3E). However, scBT was less effective in identifying true positives (Figures 3F). Applying the filtering cri- teria improved the recall rates, but the precision rates remain largely unchanged (Figure 2.3E, 2.3F). Test method classification performance scores were estimated as the Matthews correlation coefficient (MCC) which is well suited for unbalanced data [86]. We see that the scBT and LRT linear tests performed best for this metric on both unfiltered and filtered data (Figure 2.3G). 31 Figure 2.4 Evaluation of Type I and II error control. (A) False positive rate (FPR) of 9 differential expression test methods estimated from negative control (0% DE genes) simulated dose-response scRNAseq data including all genes expressed in at least 1 cell (unfiltered) and genes expressed in only ≥ 5% of cells in at least one dose group (filtered). (B,C) Logistic regression models were fitted to negative control data to predict the probability of false positive identification using percent zeroes and mean expression as covariates. Lines represent the predicted probability of false positive classification with the shaded region representing the 95% confidence interval. (D) False negative rate (FNR) of 9 differential expression test methods estimated from positive control (100% DE genes) simulated dose-response scRNAseq data including unfiltered and filtered datasets. (E,F) Logistic regression models were fit to positive control data. Lines represent predicted probability of false negative classification with shaded region representing the 95% confidence interval. 2.3.2 Type I error control and power To investigate test performance in controlling type I errors (false positives), DGEA methods on simulated datasets were examined with 0% DE genes (i.e. negative control). Using the threshold for the computed posterior null probabilities, scBT identified only one false positive gene in 2 of 10 simulations (Figure 2.4 A). ANOVA, scBT, KW, limma-trend and LRT linear 32 had false positive rates (FPRs) below 3% indicating better performance compared to two group tests. After filtering for genes with low expression levels, scBT still correctly identified all the non-differentially expressed genes and was the best performing test. These are the same tests that had a better FPR control in initial simulations (Figure 2.4). To explore whether mean expression or percentage of zeroes influenced type I error rates, a logistic regression model was fit to negative control data. We predicted the probability for each gene to be identified as differentially expressed in the negative control data. While the curve for scBT is missing since few false positives were identified, the predicted FPR for all the other tests except LRT linear were also high for highly expressed genes with few zeroes (Figures 2.4B, 2.4C). Next, a positive control dataset with 100% differentially expressed genes was simulated to evaluate test performance for detecting true positives. All tests except scBT exhibited a false negative rate (FNR) ≥ 40% (Figure 2.4D). The best performing tests for FNR also had high FPR. Logistic model regression fitting for false negative classification of genes shows that the false negative rates were highest when the mean expression was either too high or too low for all tests (Figures 2.4E, 2.4F). 2.3.3 Parameter Sensitivity Analysis Experimental scRNAseq datasets will vary between cell types, cell composition, and re- sponses depending on the target tissue, treatment, number of cells sequenced, and more. For example, some distinct cell types are very abundant (e.g. hepatocytes), with others present at lower levels (e.g. portal fibroblasts) in hepatic scRNAseq datasets. Moreover, treatments such as exposure to a xenobiotic, can elicit dose-dependent changes in relative proportions of cell types such as the infiltration of immune cells (26). We investigated the impact by changing cell abundance from 25 to 2000 cells per dose group and observed an increase in the false positive rate (FPR) when increasing the number of cells. The scBT and LRT linear tests were less sensitive to an increase in the FPR as cell abundance increased while the total positive rates (TPR + FPR) increased with cell abundance for all methods. 33 Figure 2.5 Matthews correlation coefficient (MCC) from sensitivity analyses of differential expression test methods. (A) MCC for 9 DGEA test methods determined from simulated dose response data with varying number of cells per dose group. Simulations consisted of 5,000 genes with a probability of differential expression of 10% and 9 dose groups. (B) MCC for simulated data varying the cells numbers by dose group. The number of cells in each of the 9 doses groups is shown on the right. (C) MCC for varying proportion of differentially expressed genes. (D) MCC when varying the mean fold-change (location) of repressed differentially expressed genes. (E) MCC for varying distribution of fold-change (scale) of differentially expressed genes. (F) MCC for varying dropout rates calculated as in Table S3. Points represent median and error bars represent minimum to maximum values. Boxplots represent median, 25th to 75th percentile, and minimum to maximum values. Each analysis consisted of 10 replicate datasets including all genes expressed in at least 1 cell (unfiltered) and genes expressed in ≥ 5% of cells in at least one dose group (filtered). 34 Although all tests exhibited comparable performance at low cell numbers (≥500), as cell numbers increased scBT outperformed all other tests in both precision and MCC score (Fig- ures 2.5 A) Comparison of AUROCs and AUPRCs across cell numbers showed that ANOVA, KW, limma-trend, and LRT linear tests performed best for a small number of cells, but the increase in AUROC was steeper for scBT. It was also evident from the experimental snRNAseq dataset that the number of cells per dose group was not fixed. We evaluated the performance of the test methods when the number of cells dose-dependently increased or decreased, and when the number of cells per dose group were taken from experimental data. Notably, while scBT had the best MCC for increasing number of cells per dose, LRT linear performed better than scBT when the number of cells decreased before and after filtering for genes expressed at low levels (Figure 2.5B). The shift in MCC between increasing and decreasing cell numbers for scBT appears to be driven by a concomitant decrease in TPR and increase in FNR. Increasing the proportion of differentially expressed genes led to an improvement in MCC except for scBT and LRT linear, though these tests maintained the top MCC scores as well as AUROC and AUPRC (Figure 2.5C). As the magnitude of the effect increased, LRT linear performed best at the low end while scBT exhibited the greatest improvement in MCC (Figure 2.5D). Conversely, while the MCC decreased for most tests when modulating the fold-change scale of differentially expressed genes, scBT improved and was more stable (Figure 2.5 E). As the proportion of unexpressed genes increased, the FPR increased with precision decreasing for all tests. However, scBT was least affected, and maintained the highest MCC among all tests (Figure 2.5F). 2.3.4 Test method agreement To assess agreement between tests, the area under the concordance curve (AUCC) for each pair of tests for the top 100 genes ranked by adjusted P-value was calculated as previously described [32, 82]. All methods showed excellent concordance (AU CC ≥ 0.77) with LRT 35 linear showing the poorest consistency compared to all other tests while the limma-trend and ANOVA tests showed perfect agreement with an AUCC of 1 (Supplementary Figure A.3). Pairwise differential gene expression comparisons between Seurat Bimod, MAST and WRS had AU CC > 0.95AU CCs while the multiple group tests ANOVA, LRT multiple, KW, and scBT clustered together with AUCC ranging between 0.9 and 1. In the absence of nuisance covariates, MAST and Seurat Bimod provided similar results, as expected given their similar mixture normal model structure. Likewise for ANOVA and limma-trend, both of which rely on normality assumptions for testing differential gene expression. 2.4 Real dose–response dataset DE analysis Without ground truth for experimental data, the performance of the differential expression test methods was examined by first evaluating the agreement for each identified cell type (Figures 2.6). Genes in the experimental dataset were considered differentially expressed when expressed in ≥5% of cells in at least one dose group and had a |fold-change| ≥ 1.5. In hepatocytes, the most abundant cell type, fewer than 5 genes were not detected in all test methods, with the majority missed by the WRS test (Figure 2.6A). Upon closer exam- ination, those genes were not expressed in control hepatocytes. Not surprisingly, for all cell types, the largest intersection was between all tests indicating strong agreement within all test methods. Only a few tests identified a subset of unique genes as differentially expressed, which accounted for a very small fraction. For example, LRT linear identified 12 unique dif- ferentially expressed genes in portal fibroblasts, one of the least abundant cell types (Figure 2.6B). LRT linear was the best performing test for low cell numbers indicating that the 12 unique differentially expressed genes may in fact be true positives. Consistent with simu- lations of varying cell numbers (Figure 2.6A), 24 genes were not identified as differentially expressed by the scBT method for stellate cells which exhibit a dose-dependent decrease in numbers (Figures 2.6C, D). Although scBT outperformed other tests in most scenarios, it under performed in this scenario. Nevertheless, when ranking genes by significance level (i.e. 36 Figure 2.6 Agreement of differential expression test methods on experimental dose-response data. (A) Upset plot showing the intersection size of genes identified as differentially expressed by 9 different test methods in hepatocytes from the portal region of the liver lobule. (B) Intersect of differentially expressed genes in portal fibroblasts. (C) Intersect size in hepatic stellate cells. Vertical bars represent the intersect size for test methods denoted by a black dot. Horizontal bars show the total number of differentially expressed genes identified within each test (set sizes). Only intersects for which genes were identified are shown. Genes were considered differentially expressed when (i) expressed in > 5% of cells within any given dose group and (ii) exhibit a |fold-change| ≥ 1.5. A heatmap in the upper left corner of each panel shows the pairwise AUCC comparisons for the 500 lowest p-values. (D) Relative proportion of cell types identified in each dose group of the real dataset for the cell types in A,B,C. Experimental snRNAseq data was obtained from male mice gavaged with sesame oil vehicle (vehicle control) or 0.01 – 30 µg/kg TCDD every 4 days for 28 days. (E) Graph metrics for gene set enrichment analysis of portal fibroblasts grouped by similarity in gene membership. Violin plots show distribution of node-wise values for each test method. (F) Network visualization of significantly enriched (adjusted p-value ≤ 0.05) gene sets using the Bayes factor ranked genes of portal fibroblasts. Groups of ≥ 2 nodes were manually annotated following commonality in the gene set names. Each node represents a gene set with the size of the node representing the number of genes in a gene set, and edges connect nodes with ≥ 50% overlap. 37 P-values), AUCC were high for all pairwise comparisons. To explore the biological insight gained by using the test methods, gene set enrichment analysis was performed by ranking genes following significance values (adjusted P-value or Bayes factor) on gene sets from BIOCARTA, KEGG, PANTHER and WIKIPATHWAYS. Gene sets were grouped based on their similarity in gene membership into a network for which centrality measures can be estimated. An examination of portal fibroblasts, which exhibited the most disagreement among test methods (Figure 2.6B), showed that multiple group test methods, particularly scBT had improved centrality metrics (centrality – number of edges; closeness – steps required to access other nodes; and betweenness – number of paths that go through a node) (Figure 2.6E). Visualization of significantly enriched terms identified enriched functions associated with growth factor and immune cell signaling in addition to expected terms such as xenobiotic metabolism and nuclear receptors involved in lipid metabolism (Figure 2.6F). Alternatively, WRS which did not find as many connected groups of functions, was largely limited to those identified by scBT except for the hormone signaling and tryptophan metabolism clusters. While there is no ground truth from real data, greater agreement between similar gene sets from disparate sources suggests that multiple group tests such as scBT provide more reliable findings [1]. However, all the test methods produce comparable gene set enrichment results as expected since the most robust changes were identified by all the test methods. 2.5 Discussion The goal of this study was to compare the performance of newly developed DGEA test methods for dose–response experiments to existing analysis methods. Using simulated data to generate ground truth, we evaluated the performance of nine differential expression testing methods which were broadly classified as either fit-for-purpose, multiple group, or two group tests. Criteria for test method selection was based on previous benchmarking efforts for two group study designs identifying MAST, limma-trend, WRS, and t-test as the best performers 38 [32, 87]. ANOVA and KW tests were also included for evaluating multiple group comparisons, and Seurat Bimod, for having the same modelling framework as scBT, LRT multiple and LRT linear tests. The test methods were ranked from best to worse (1-9) based on type I error rate, type II error rate, MCC, AUROC and AUPRC (Figure 7, Supplementary Table S4). While several scRNAseq tools have been developed [75, 83, 84, 85], none are developed to simulate dose–response models commonly identified in toxicological and pharmacological datasets [28, 88]. Our SplattDR wrapper for the Splatter package (28) was able to show that simulated data can effectively emulate key experimental scRNAseq data characteristics when simulation parameters were estimated from various Unique Molecular Identifier (UMI)- based datasets. In agreement with a previous report, technical and biological factors, such as cell type, does appear to influence gene dropout rates (18). We primarily focused on 10× Genomics UMI data given the unavailability of real experimental dose–response data generated using other platforms. Overall, test method performance was consistent with their intended application. For example, fit-for-purpose tests scBT and LRT linear consistently ranked higher followed by multiple groups tests such as KW and LRT multiple. scBT exhibited the best overall perfor- mance with excellent FPR control and top ranked MCC while LRT linear struck a balance between type I and type II error rates. The scBT results are not surprising as Bayes factor- based tests have proven to be conservative and consequently more appropriate when false positives are of concern [69, 70]. In the context of investigating chemical or drug MoAs, false positives have the potential to lead to wasted effort and resources in attempts to validation and support findings [89]. Moreover, when assessing a large number of genes, a 5% FP rate (P − value ≥ 0.05) can result in hundreds of FPs that skew MoA classifications [67]. A single test method was not expected to outperform all other tests under all conditions as previously demonstrated when comparing pairwise testing [29, 32, 87]. Therefore, we assessed the strengths and limitations of each test method by varying parameters likely to change 39 within and across various experimental datasets. The number and relative abundance of cell types is known to be affected by disease or treatment, and the distribution of differential expression influenced by the chemical, drug, or food contaminant being evaluated [63, 73]. scBT consistently ranked at the top under most scenarios, particularly when the mean and standard deviation of the fold-change for differentially expressed genes varied. However, scBT under performed in MCC when the number of cells decrease in a dose-dependent manner which would be expected in treatments which alter cell population sizes (e.g. inflammation). Under these circumstances LRT linear outperformed all other tests with scBT performing similar to the other test methods as evident when 24 differentially expressed genes were not identified by scBT within experimental data for stellate cells which experienced a dose- dependent decrease in relative abundance following TCDD treatment. Although excluding genes expressed at low levels generally improved the performance of all test methods, the comparative performance of test methods did not significantly change in most cases. We did not have access to experimental scRNAseq dose–response data, however, we expect that the scBT would perform equally well as with experimental snRNAseq data as the elevated number of zeroes are common to both types of data. Major differences between these types of data are (i) biases in gene detection and (ii) overall counts [73]. Given the higher overall counts in scRNAseq data, test method such as scBT may even perform better. DGEA provides biological information regarding the effects of exposure to chemicals, drugs, and food contaminants. As expected, gene set enrichment analyses did not dramat- ically differ in the enriched pathways which are driven by the most robust responses such as xenobiotic metabolism. However, when integrating gene sets from disparate sources we found gene sets that partially overlap in gene membership were consistently identified by multiple group test methods. For example, several gene sets related to growth factors and cell proliferation were identified by scBT but not WRS. Portal fibroblasts are implicated in proliferation of cholangiocytes and the secretion of growth factors during development. Enrichment of these terms suggests a functional role consistent with the induction of bile 40 duct proliferation by TCDD (45,46). In contrast, WRS identified enrichment associated with tryptophan as well as oxytocin/thyrotropin-releasing-hormone pathways which has not been linked to the effects of TCDD on portal fibroblasts. Although ground truth for the complete experimental dataset is not available, the use of test methods such as scBT reduce experimental noise to identify leads warranting further analysis. 2.6 Conclusion Collectively, our findings suggest that scBT and LRT linear fit-for-purpose tests are bet- ter suited for the differential expression analysis of dose–response studies and when false positives are of greater concern than false negatives. Moreover, consistent with previous benchmarking efforts, we show that common non-parametric tests such as KW out-perform test methods developed for scRNAseq data when the study involves comparisons between multiple groups. Ultimately, each test method performs optimally under diverse scenarios. While the importance of controlling type I error rates is acknowledged, a balance must be struck with type II error rates. The tradeoff should be determined based on the individual research question being investigated. It may even be reasonable to apply different test meth- ods to distinct cell types based on dropout rates, cell abundance, and changes in relative cell proportions given the strengths and weaknesses of each test method. 2.7 Acknowledgements This chapter is based on my published paper [1]. I would like to thank the joint first author Dr Rance Nault, and co-authors Dr Samiran Sinha, Dr Tapabrata Maiti, Dr Sudin Bhattacharya, Jack Dodson and Dr Tim Zacharewski for their support and advice. This work was funded by National Human Genome Research Institute [R21 HG010789]; National Institutes of Environmental Health Sciences Superfund Research Program [P42 ES004911] and NSF [DMS 1945824]. 41 CHAPTER 3 SEMIPARAMETRIC DOSE RESPONSE CURVE ESTIMATION FOR SINGLE CELL DOSE RESPONSE EXPERIMENTS 3.1 Single Cell Dose Response Experiments Gene expression profiling of single cells has led to unprecedented progress in understanding normal physiology, disease progression and developmental processes. In contrast to bulk RNA-sequencing (RNAseq), gene expression profiling of single-cell allows investigation of cell-specific heterogenity and can be used to assess changes in response to drugs and chemicals for individual cell populations. Dose-dependent gene expression profiling (aka genomic dose- response studies (GDRS)) of drugs and chemicals has been proposed as an alternative test to rodent bioassays to assess human health risks and application of single-cell RNA sequencing (scRNAseq) for the evaluation of chemicals, drugs, and food contaminants presents the opportunity of accounting for cellular heterogeneity in pharmacological and toxicological responses. Although dose response modelling of bulk RNA-seq datasets have been long-used for generating quantitative estimates of risks associated with such responses, no statistical study yet exists for the estimation of dose response curves for scRNA-seq datasets. Unlike microarray and bulk RNA-seq datasets that record gene expression measurements averaged over many cells, scRNAseq allows gene expression mesurement on a cellular level hence enabling biological investigation at the cellular level. However, despite many improve- ments in high throughput sequencing, various technical factors including cell-cycle hetero- genity, library size differences, amplification bias, and low RNA capture per cell lead to high noise in scRNA-seq experiments. Recent technologies are capable of sequencing millions of cells, but often generate highly sparse expression datasets due to shallow sequencing. The presence of these issues result in substantial noise that often obscures the true biological signal and renders unsuitable, the application of traditional statistical models for analysis of 42 single cell datasets. Several statistical models have been proposed for the analysis of single cell datasets, with most focusing on zero-inflated count data distributions like Poisson or negative binomial, to account for the overdispersion. Continuous hurdle models like MAST have also been proposed that treat the zero sampling process to be completely different from the sampling process of true biological expression, which is generally positive. Several statistical studies have been designed to investigate problems related to differential gene ex- pression, cell-type clustering, denoising or imputation, gene regulatory network construction, trajectory inference and other statistical problems studied earlier in the context of bulk and microarray datasets. However, no rigorous statistical framework has yet been designed for dose response modelling for scRNA-seq datasets that are invaluable for deriving cell-type specific efficacy and/or safety margins such as effective dose and the point of departure (POD) of several toxicological/ pharmacological responses. Regulatory communities have suggested that acute ( < 14 days) and sub-acute (14–28 days) transcriptomic studies as viable alternative to the current standard 2-year rodent bioassay that significantly reduces the time and resources needed to assess risk. Further- more, single cell transcriptomic datasets could bolster such investigations by identifying cell-specific dose-dependent responses indicative of an adverse event. The U.S. National Toxicology Program (NTP) recently reported a robust DGEA approach is essential to de- riving biologically relevant PODs. In characterizing biologic and public health significance, and the need for possible regulatory interventions, it is important to efficiently estimate dose response functions while accounting for cell-specific heterogenity and known experimental confounders. Motivated by the lack of a single framework for single cell dose–response esti- mation and trend testing, while accounting for covariates this chapter proposes a complete statistical framework for the same. 43 3.1.1 Motivating experimental study and hypothesis of interest In this work we analyze a unique in vivo dose response hepatic scRNAseq dataset consisting of 9 dose groups with 3 biological replicates for 11 distinct liver cell types for greater than 100K cells. The 9 dose groups represent 9 doses of varying levels of 2,3,7,8-tetrachlorodibenzo-p- dioxin (TCDD). The persistent organic pollutant and potent agonist of the aryl hydrocarbon receptor (AHR), TCDD induces hepatic lipid accumulation (steatosis) that progresses to steatohepatitis with fibrosis. In humans, TCDD and related compounds are associated with dyslipidemia and inflammation and in mice, AHR activation by TCDD elicits cell- specific and spatially resolved histological and gene expression responses. Our interest lies in estimation of cell-type specific dose response curves for genes of interest in order to derive finer point of departures (PODs) for use towards informed decision-making on the safety and toxicity of drugs and chemicals. We hypothesize that some genes of interest would exhibit different POD’s for different cell-types and our interest not only lies in estimating cell-specific DR curves but also in testing whether there exists a difference in cell-type specific POD’s. This would be particularly hard if the genes exhibit slight changes in POD’s over a handful of cell-populations while exhibiting the same POD over others. In those cases we would like to investigate how the POD estimates change in comparison to existing ones established via bulk and microarray based DR studies. Futher, it could be possible that celltypes exhibit completely different DR curve shapes for some genes whereas there are slight shifts and slope changes for cell-type specific DR curves for other genes. It is essential to allow the DR curve to change according to cell-types for a particular gene in order to capture these varying curve shapes. It is also essential to test whether these estimated changes between celltypes are statistically significant, in order to ensure accurate POD estimation. Therefore in this work we propose 1) a semi-parametric regression model for flexible modelling of dose response curves, while accounting for high percentange of zeroes and effects of additional covariates 2) a scalable and computationally efficient MM algorithm for the estimation of regression parameters, 3) an approach to incorporate monotonicity assumptions, if supported by prior 44 beliefs, within the estimation paradigm, 4) tests of hypothesis for testing the presence of a dose response trend and whether the trend varies for different cell-types and 5) estimation of cell-type aware POD’s. 3.1.2 Literature on dose response curve estimation The primary interest in dose response studies lies in assessing the association between a continuous predictor, D, and a response variable, Y , while adjusting for covariates, X = (X1 , . . . Xq )⊤ . There are a number of papers in the existing literature on dose response curve estimation, but none that solves the problem underlying our motivating examples. Most of the existing literature focusses on analyzing data from bulk RNA-seq datasets, microarrays or traditional environmental toxicity studies that focuses on continuous or di- chotomous endpoints. Let us denote a sample of n independent and identically distributed (IID) pairs of variables, B = {(Zi , Yi )}, i = 1, . . . n as the data. Zi can be further segmented as Zi = {(Di , Xi )} where Di represents the dose vector and Xi represents the vector of other confounding covariates such as age, gender etc. The main interest of dose response studies lies in modelling E(Yi ) as a function of the dose vector; ie. E(Yi |Di ) = ζ(Di , θ), where θ is the vector of unknown parameters. In general, the methods for estimating dose response can be classified as parametric or nonparametric. Parametric methods assume a model ζi (D, θ) for the DR curve where θ is the vector of unknown parameters. Considerable amount of statistical methodology exists in parametric modelling of dose-response studies, employing parametric models, such as Logistic, Exponential, and Gompertz as well as others (Holland-Letz and Kopp-Schneider 2015) and the US EPA (US EPA 2012) gives guidelines on how to employ these methods. These models are generally fit using standard techniques such as maximum likelihood (ML) or restricted maximum likelihood(RML) and their functional forms are monotonic allowing for tractable estimation of relevant POD’s. It is well known that if a parametric model is correctly specified then the ML/RML estimators are efficient. However, in many cases, 45 it is difficult to correctly specify the parametric form of the dose–response curve because the biological mechanism of drug action or toxicity may be complex and the form of the dose–response curve is unknown apriori. When the parametric model is misspecified, the corresponding curve estimate may be severely biased. In addition fitting of extremely flexible polynomial models with high orders, to match difficult non-monotonic curve shapes can lead to severe overfitting. The standard workaround is to fit multiple parametric models, each of which may fit the data reasonably, but produce a range of POD estimates that accounts for model uncertainty in the estimation process. Accounting for this uncertainty using model averaging (MA) has received recent attention in the literature. However, despite simulation studies suggesting that model-averaged estimates provide POD estimates that have low bias and nominal coverage properties, these estimates may fail to adequately describe the true uncertainty for DR curves on the edge of the MA model space, hence resulting in inaccurate inference. Further, as the MA results are available on limited number of parametric forms, models that make fewer assumptions on parametric forms may allow for better estimation of the dose response curves. To enhance the robustness of the estimation of the dose–response curve, many nonpara- metric methods have been proposed. [41] proposed MLE estimation procedures under the assumption that the dose response curve is sigmoid and non-decreasing. [42] proposed a method for non-parametric estimation of the dose response curve under monotonicity con- straints using B-splines regression. [43] used kernel estimators to obtain estimates for the dose–response curve under general shape restrictions. In the Bayesian setup, several ap- proaches have been proposed for non-parametric estimation of dose response curves under monotonicity constraints. From a Bayesian perspective, one specifies a suitable prior on the regression function that induces monotonicity and then inference is based on the posterior distribution. [44] employed an additive model with a prior imposed on the slope of the piecewise-linear functions. [45] adopted mixture modelling of shifted and scaled probability distribution functions. [46] propose a Gaussian process model with a posterior projection 46 approach for shape-constrained curves. In contrast to the parametric methods, all the non- parametric approaches are flexible and learn the shape of the dose response curves based on the data. To our knowledge none of the existing literature has considered building a semi-parametric regression model for both unconstrained and monotonicity constrained es- timation of dose response curves for single cell experiments while accounting for technical confounders and whose development is the primary goal of this chapter. 3.1.3 Literature on MM algorithms A large number of statistical and machine learning problems require the computation of min(θ, B) or θ̂ = argmin R(θ, B) (3.1) θ∈Θ θ∈Θ where R(θ, B) is a risk function defined over the observed data B and is dependent on some parameter θ ∈ ×. Common risk functions that are used in practice are the negative log-likelihood functions, which can be expressed as n 1X R(θ, B) = − f (zi ; yi ; θ) n i=1 where f (zi , yi , θ) is a density function over the support of Z and Y . Simple maximum likeli- hood estimation problems (or alternatively minimization of the negative log-likelihood) can be solved analytically, but most practical maximum likelihood and least squares estima- tion problems must be solved numerically. The task of computing 3.1 may be complicated by various factors which include the lack of differentiability of R or difficulty in obtaining closed-form solutions to the first-order condition equation ∇θ R = 0, where ∇θ is the gra- dient operator with respect to θ , and 0 is a zero vector. The double duty MM acronym that stands for majorize in case of minimization problems and minorize for maximization problems provides an unifying algorithm for simplifying the computation of a difficult form of 3.6 via iterative minimization of surrogate functions [90]. Simplification is attained by (a) avoiding computationally expensive inversion of large matrices, (b) linearizing the optimiza- 47 tion problem, (c) parameter uncoupling, (d) cleverly dealing with equality and inequality constraints or (e) turning a non-differentiable surface into a smooth one. Let Θ(r) represent a fixed value of the parameter Θ, and let h(Θ|Θ(r) ) denote a real-valued function of Θ whose form depends on Θ(r) . The function h(Θ|Θ(r) ) is said to majorize a real- valued function f (Θ) at the point Θ(r) provided h(Θ|Θ(r) ) ≥ f (Θ) for all Θ h(Θ(r) |Θ(r) ) = f (Θ(r) ) (3.2) Therefore the surface h(Θ|Θ(r) ) lies above the surface f (Θ) and is tangent to it at the point at Θ = Θ(r) . The function h(Θ(r) |Θ(r) ) is said to minorize f (Θ) at Θ(r) if −h(Θ(r) |Θ(r) ) majorizes −f (Θ) at Θ(r) . Θ(r) represents the current iterate in optimizing the surface f (Θ). In a majorize-minimize MM algorithm, the algorithm minimizes the majorizing func- tion h(Θ|Θ(r) ) rather than the actual function f (Θ) . If θ(r+1) denotes the minimizer of h(Θ(r) |Θ(r) ), then the MM procedure forces f (Θ) towards a minimum. The inequality f (Θ(r+1) ) = h(Θ(r+1) |Θ(r) ) + f (Θ(r+1) ) − h(Θ(r+1) |Θ(r) ) ≤ h(Θ(r) |Θ(r) ) + f (Θ( r)) − h(Θ(r) |Θ(r) ) = f (Θ( r)) (3.3) follows directly from the fact h(Θ(r+1) |Θ(r) ) ≤ h(Θ(r) |Θ(r) )) and definition 3.2. The descent property 3.3 allows for remarkable numerical stability and the MM algorithm also applies to maximization rather than minimization with straightforward changes. MM algorithms, which present a generalization of the EM (expectation– maximization) algorithms [91] have been shown to effectively solve a variety of optimization problems in machine learning, statistical estimation and signal processing. A comprehensive treatment on the theory and implementation of MM algorithms can be found in [92]. Summaries and tutorials on MM algorithms for various problems can be found in [93, 90, 94, 95, 92] Some theoretical analyses of MM algorithms can be found in [96, 97, 98]. 48 3.2 Methods 3.2.1 Model, notations and assumptions It is understood that scRNA-seq data are collected from many cells of different types when cells are exposed to different dose levels. Since the data for every gene will be analyzed separately, let us denote the data from a given gene by {Yi,j , Di,j , Xi,j | i = 1, . . . , I, j = 1, . . . , ni }. Here i and j are used to denote the suffix for the cell-type and a cell within a specific type of cell. Here Yi,j denotes the response (scRNA-seq expression) from the jth cell of ith cell type, Di,j and Xi,j denote the dose-level and a set of covariates for the corresponding cell. Since the scRNA-seq expression contain excessive amount of zeros, we adopt a zero-inflated negative binomial distribution: Yi,j ∼ ωi,j N B(mi,j = si µi,j , ϕ) + (1 − ωi,j )I(Yi,j = 0). (3.4) Model (3.4) implies that the response is assumed to come from a two-component mixture of two distributions. One component has a mass fully concentrated on zero, and the other component is the Negative Binomial distribution with the mean of mi,j and the scale of exp(ϕ), where ϕ is on the real line. We define the mean mij = si,j µi,j where si,j represents the cell-specific biases and µi,j , expected transcript count. The scale factors si,j are computed using scran [99], an approach that uses pooling across single cells to normalize scRNA-seq sequencing data with high number of zeroes. Further µi,j is assumed to be function that depends on the dose Dj and covariates Xi,j . Specifically, we assume that ⊤ µi,j = exp{ζi (Di,j ) + Xi,j β}, (3.5) where ζj is nonparametric function of the dose specific to ith type of cells. Our goal is to estimate this function ψi . The covariate Xi,j is assumed to exert a linear effect through the regression parameter β. Our method implicitly assumes that ζi can be well approximated between boundary points a and b by a cubic B spline basis with some number of knots. We will denote knot configurations by pairs (K, τ ), where the number of knots K is a 49 non-negative integer and the knot locations are given by the K dimensional vector k = (k1 , ...., kK ), for a < d(1) < τ1 ≤ · · · ≤ τK < d(n) < b for k = 1, ..., K denote the k th function in a cubic B-spline basis with natural boundary constraints, i.e. linear outside [a, b]. Therefore the dose response curve is approximated as: XM ζi (D) = γi,m Bm (D), m=1 with M = d + K + 1. With the above model for the dose-response curve, we can write XM µi,j = exp{ γi,m Bm (Di,j ) + Xi,j⊤ β}. (3.6) m=1 Furthermore, lab experiments indicate that the chance of zero inflation tends to decrease with the increasing dose level. This is because a lot of genes are induced, thereby increasing the mean while simultaneously reducing the dropouts. Therefore we propose to model the inflation parameter as a function of the mean of the negative binomial component (3.4), which means we model ωi,j in terms of si µi,j logit(ωi,j ) = ψ0 + ψ1 log(si µi,j ), (3.7) where ψ0 and ψ1 are two unknown parameters to be estimated. Define Θ = (ϕ, β, γ1,1 , . . . , γ1,M , . . . , γJ,1 , . . . , γJ,M , ψ0 , ψ1 )⊤ . Then the log-likelihood function is X I X ni   ℓ(Θ) = I(Yi,j > 0) Yi,j log(si µi,j ) − {Yi,j + exp(ϕ)}log{si µi,j + exp(ϕ)} i=1 j=1  + ϕ exp(ϕ) + log Yi,j + exp(ϕ) − log(Yi,j !) − log exp(ϕ) + log(ωi,j )   eϕ  exp(ϕ) + I(Yi,j = 0)log (1 − ωi,j ) + ωi,j . si µi,j + exp(ϕ) The goal is estimating the parameter by maximizing ℓ(Θ). The maximization of ℓ(Θ) is difficult as the dimension of Θ could be large. Therefore, we develop an MM algorithm to ease the complexity of this optimization. The first step of the MM algorithm is developing a minorization function that is easy to optimize compared to ℓ(Θ) and it is presented in the (0) (0) (0) (0) (0) (0) following theorem. Define Θ(0) = (ϕ(0) , β (0) , γ1,1 , . . . , γ1,M , · · · , γJ,1 , . . . , γJ,M , ψ0 , ψ1 )⊤ . 50 Theorem 1. The minorizing function of ℓ(Θ) is ℓ† (Θ|Θ(0) ), such that ℓ† (Θ|Θ(0) ) = ℓ†1 (ϕ|Θ(0) ) + g1 (ψ0 |Θ(0) ) + g2 (ψ1 |Θ(0) ) + † PI (0) ι=1 ℓ3,ι (γι |Θ ) +ℓ†4 (β|Θ(0) ) + ℓ†5 (Θ(0) ), (3.8) ℓ(Θ) ≥ ℓ† (Θ|Θ(0) ) with equality holding when Θ = Θ(0) , and the terms of (3.8) are XI X ni   † (0) (0) ℓ1 (ϕ|Θ ) = I(Yi,j > 0){Yi,j + exp(ϕ)} + exp(ϕ)Γi,j i=1 j=1 ( ) exp(ϕ(0) ) − exp(ϕ) n (0) o (0) + I(Yi,j > 0) + Γi,j si µi,j + exp(ϕ(0) )  0.5sj × ϕ exp(ϕ) − (0) {exp(ϕ) − exp(ϕ(0 )}2 si µi,j + exp(ϕ ) (0)  (0) (0) − exp(ϕ)log{si µi,j + exp(ϕ )}   + I(Yi,j > 0) log Yi,j + exp(ϕ) − log exp(ϕ) , I X ni n (0) ωi,j X o  (0) (0) (0) g1 (ψ0 |Θ ) = I(Yi,j > 0) + Γi,j ψ0 − exp{(ψ0 − ψ0 )(2M + 5)} i=1 j=1 2M + 5 XI X ni n o (0) g2 (ψ1 |Θ ) =(0) I(Yi,j > 0) + Γi,j ψ1 log(si ) − 0.5(M + 1)(ψ1 − ψ (0) )2 i=1 j=1 d+K+1 X  (0) ⊤ (0) + ψ1 { γi,m Bm (Di,j ) + Xi,j β } m=1 (0)  ωi,j (0) − exp{log(si )(ψ1 − ψ1 )(2M + 5)} 2M + 5 (0) + 0.5(M + 1) exp{(ψ1 − ψ1 )2 }  XM  (0) (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β }(2M + 5) , m=1 Xni n o ℓ†3,i (γi |Θ0 ) (0) = I(Yi,j > 0) + Γi,j j=1  XM XM  (0) (0) (0) × −0.5 (γi,m − γi,m )2 Bm 2 (Di,j ) + ψ1 (γi,m − γi,m )Bm (Di,j ) m=1 m=1 (0)  M ωi,j X (0) − 0.5 exp[{(γi,m − γi,m )Bm (Di,j )(2M + 5)}2 ] 2M + 5 m=1 51 XM  (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2M + 5)} m=1 X ni  X M + I(Yi,j > 0)Yi,j γi,m Bm (Di,j ) j=1 m=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) )  (0) (0) + Γi,j exp(ϕ ) (0) M µi,j X (0) × exp{(γi,m − γi,m )Bm (Di,j )(M + 1)} M + 1 m=1 0.5sj n (0) o − (0) I(Y i,j > 0) + Γ i,j si µi,j + exp(ϕ(0) )  XM (0) 2 (0) × −2(µi,j ) (γi,m − γi,m )Bm (Di,j ) m=1 (0) M (µi,j )2 X  (0) + exp{2(M + 1)(γi,m − γi,m )Bm (Di,j )} , M + 1 m=1 X I X ni n o ℓ†4 (β|Θ0 ) (0) ⊤ = I(Yi,j > 0) + Γi,j −0.5{Xi,j (β − β (0) )}2 i=1 j=1  (0) ⊤ (0) + ψ1 Xi,j (β −β ) (0)  ωi,j ⊤ − 0.5 exp[{Xi,j (β − β (0) )(2M + 5)}2 ] 2M + 5  ⊤ (0) (0) + exp{Xi,j ψ1 (β − β )(2M + 5)} X I X ni  ⊤ + I(Yi,j > 0)Yi,j Xi,j β i=1 j=1   sj (0 (0) (0) − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ ) + Γi,j exp(ϕ ) si µi,j + exp(ϕ(0) ) (0) µi,j ⊤ × exp{Xi,j (β − β (0) )(M + 1)} M +1 0.5sj n (0) o − (0) I(Y i,j > 0) + Γ i,j si µi,j + exp(ϕ(0) ) (0) (µi,j )2   (0) 2 ⊤ (0) ⊤ (0) × −2(µi,j ) Xi,j (β − β ) + exp{2(M + 1)Xi,j (β − β )} , M +1 52 (0) (0) (0) (0) (0) (0) where, Γi,j = I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )/{(1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )}, (0) (0) (0) (0) (0) (0) (0) ωi,j = exp{ψ0 + ψ1 log(si µi,j )}/[1 + exp{ψ0 + ψ1 log(si µi,j )}], and ϕ G(ϕ, µi,j ) = [exp(ϕ)/{sj µi,j + exp(ϕ)}]e . The noticeable fact is that the minorizing function separates all the parameters. Conse- quently, the follow-up calculation becomes extremely simple compared to the case of non- separated parameters in the joint update of all the parameters in any iterative procedure. Next, parameters are estimated by the Newton-Raphson method. This technique of updat- ing the parameters is known as gradient MM algorithm [93]. Let Θ(t) and Θ(t+1) be the parameter value at the tth and (t + 1)th iterations, respectively, then in the gradient MM n 2† o−1 † (Θ|Θ(t) ) ∂ℓ (Θ|Θ(t) ) approach, Θ(t+1) = Θ(t) − ∂ ℓ∂Θ∂Θ ⊤ |Θ=Θ (t) ∂Θ |Θ=Θ(t) . Starting with an initial value of Θ, this update is repeated until the parameters converge with a specified tolerance. Note that in the above update, instead of the log-likelihood, we are using the minorizing function. Because of the separation of the parameters, {∂ 2 ℓ† (Θ|Θ(t) )/∂Θ∂Θ⊤ } turns out to be diagonal matrix. Specifically, I X ni  ∂ℓ† (Θ|Θ(0) ) exp(ϕ(0) ) X  |Θ=Θ(0) = − (0) Yi,j I(Yi,j > 0) + exp(ϕ(0) ) ∂ϕ i=1 j=1 si µi,j + exp(ϕ(0) ) n o (0) × I(Yi,j > 0) + Γi,j  ∂ + I(Yi,j > 0) log Yi,j + exp(ϕ) |ϕ=ϕ(0) ∂ϕ  ∂ − log exp(ϕ) |ϕ=ϕ(0) ∂ϕ n o (0) + I(Yi, > 0) + Γi,j exp(ϕ(0) ) + exp(ϕ(0) )ϕ(0) n o  (0) (0) (0) − exp(ϕ )log sj µi,j + exp(ϕ ) , I X ni n ∂ℓ† (Θ|Θ(0) ) X (0) (0) o |Θ=Θ(0) = I(Yi,j > 0) + Γi,j − ωi,j , ∂ψ0 i=1 j=1 I X ni  d+K+1 ∂ℓ† (Θ|Θ(0) ) X X (0)  |Θ=Θ(0) = log(si ) + γi,m Bm (Di,j ) + Xi,j β⊤ (0) ∂ψ1 i=1 j=1 m=1 n o (0) (0) I(Yi,j > 0) + Γi,j − ωi,j , 53 I X ni ∂ℓ† (Θ|Θ(0) ) X  n o (0) (0) (0) |Θ=Θ(0) = Xi,j ψ1 I(Yi,j > 0) + Γi,j − ωi,j + I(Yi,j > 0)Yi,j ∂β i=1 j=1 (0) sj µi,j  − (0) I(Yi,j > 0)Yi,j si µi,j + exp(ϕ(0) )  (0 (0) (0) + I(Yi,j > 0) exp(ϕ ) + Γi,j exp(ϕ ) , ni ∂ℓ† (Θ|Θ(0) ) X  n o (0) (0) (0) |Θ=Θ(0) = Bm (Di,j ) ψ1 I(Yi,j > 0) + Γi,j − ωi,j + I(Yi,j > 0)Yi,j ∂γi,m j=1 (0) sj µi,j  − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) )  (0) (0) + Γi,j exp(ϕ ) . Also, I X ni  ∂2 † Yi,j exp(ϕ(0) ) X  (0) ℓ (Θ|Θ ) |Θ=Θ (0) = I(Y i,j > 0) − (0) ∂ϕ2 i=1 j=1 si µi,j + exp(ϕ(0) ) ∂2 ∂2  + 2 log Yi,j + exp(ϕ) |Θ=Θ(0) − 2 log exp(ϕ) |Θ=Θ(0) ∂ϕ ∂ϕ n o (0) + I(Yi, > 0) + Γi,j 2 exp(ϕ(0) ) + exp(ϕ(0) )ϕ(0) n o (0) − exp(ϕ(0) )log sj µi,j + exp(ϕ(0) ) 3 exp(2ϕ(0) ) + sj exp(2ϕ(0) )  − (0) , sj µi,j + exp(ϕ(0) ) I n i ∂2 † XX (0) 2 ℓ (Θ|Θ0 ) |Θ=Θ(0) = (2M + 5) ωi,j , ∂ψ0 i=1 j=1 I ni ∂2 † XX    (0) (0) ℓ (Θ|Θ ) |Θ=Θ(0) = −(M + 1) I(Yi,j > 0) + Γi,j ∂ψ12 i=1 j=1 (0)  ωi,j − (2M + 5)2 × {log(si )}2 + M + 1 2M + 5 X M  (0) ⊤ (0) 2 2 +{ γi,m Bm (Di,j ) + Xi,j β } (2M + 5) , m=1 2 ni  ∂ † (0) X 2 (0) 2 ℓ (Θ|Θ ) |Θ=Θ(0) = − Br (Di,j ) I(Yi,j > 0) + Γi,j + (2M + 5) ∂γi,r j=1 54  (0),2 (0) × (1 + ψ1 )ωi,j (0) sj µi,j (M + 1)  + (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) )  (0) (0) + Γi,j exp(ϕ ) n o  (0) (0) + 2µi,j I(Yi,j > 0) + Γi,j , A detailed proof of the theorem is provided in the Appendix for Chapter 3. 3.2.2 Penalized Estimation In our proposed spline model, regression is performed by choosing a set of knots and by finding the spline defined over these knots. In this case, the number of knots has an important influence on the resulting fit; not having enough knots leads to underfit- ted regression and too many knots lead to overfitted model. Choosing the position of knots is also an issue since uniformly distributed knots can lead to overfitting in an area where there are few points and underfitting in an area where there are many points. The most widely used spline regression methods overcome this difficulties by using a penal- ization approach. In smoothing splines, knots are set at each data point and the wig- gliness of the spline is controlled by penalizing over its integrated squared second order R ′′ 2 derivative ψ (x) dx. Substantial statistical methodology exists in the literature for find- ing the best number and location of knots; we refer to [100] for a review. To this effect, one can choose a large number of equally spaced initial knots and estimate the model by maximizing a penalized likelihood function. We follow [101] Chapter 5.5.3, and set for k = 1, . . . , K, τk = k+1 K+2 th sample quantile of the unique dose levels. The default choice for K = min(0.25 × the number of unique dose levels, 35). This knot placement procedure has been shown to peform well in a lot of scenarios. Alternatively the statistician may choose K based on visual inspection of the data scatterplot to better depict the complexity of the regression function relative to the noise in the data. The penalized log-likelihood function 55 is ℓp (Θ) = ℓ(Θ) − λi P(γi ), where λi > 0 is the smoothing parameter and P(γi ) denotes P i the penalty function for the ith dose-response curve. The minorizing function for ℓp (Θ) is simply ℓ† (Θ|Θ(0) ) − i λi P(γi ). This will lead to a minor adjustment to the expressions P when taking derivatives of the minorizing function with respect to γi . Procedures such as cross validation and generalized cross validation are used for the selection of the smoothness parameter λ ([101] Chapter 5) 3.2.3 Monotonicity constraints Our primary focus is inference on the unknown regression function log(µi ). Equation (1) works fine when we have no prior information on the shape of the regression function. How- ever, a high percentage of genes are expected to have a higher chance of an adverse response at increasing dose levels, after adjusting for important confounding factors, such as age and sex. Additionally, incorporating monotonicity constraints while assessing the dose response function has been shown to improve estimation efficiency and power to detect trends [102]. To impose monotonicity constraints on Equation (1), we need to assume that ψi ∈ Θ+ is an isotonic regression function with Θ+ = {ζi (d1 ) ≤ ζi (d2 )∀(d1 , d2 ) ∈ R2 , d1 < d2 }. Currently we express the log-mean function as log(µi,j ) = M m=1 γi,m Bm (Di,j ) + Xi,j β. To make this ⊤ P function monotone increasing with respect to dose d, we need γi,m > 0, for m = 1, . . . , M [103]. Therefore we write γi,m = exp(ηi,m ) X M log(µi,j ) = exp(ηi,m )Bm (Di,j ) + Xi,j ⊤ β m=1 According to theorem 5.9 of [104], ζi (D) = γi,m Bm (D), is a class of monotone nonde- PM m=1 creasing splines, since the monotonicity of the B-splines is guaranteed by the nondecreasing order of co- efficients. Thus our goal now is estimation of parameters (0) (0) (0) (0) (0) (0) Θ = (ϕ(0) , β (0) , η1,1 , . . . , η1,M , . . . , ηJ,1 , . . . , ηJ,M , ψ0 , ψ1 )⊤ 56 Accounting for this parameter transformation in the minorization function, only changes the first and second derivative calculation with respect to γi,m for m = 2, . . . , M . Now, the first derivative of the minorization function with respect to ηi,m will be   ∂ ∂ ∂ ‡ g5,i (ηi |Θ(0) ) = g3,i (exp(ηi )|Θ ) + (0) (0) ℓ (exp(ηi )|Θ ) × exp(ηi,m ) ∂ηi,m ∂γi,m ∂γi,m 3,i Define ∂2 (0) ∂2 ‡ Hm,m′ (γi,m ) = 2 g 3,i (γ i |Θ ) + ℓ (γ |Θ(0) ) 2 3,i i ∂γi,m ∂γi,r Then the second derivates with respect to ηi,m can be written as Hm,m′ (ηi,m ) = Hm,m′ (exp(ηi,m )) × exp(ηm ) × exp(ηm′ ) ∂ ′ + g5,i (ηi |Θ(0) ) × I(m = m ) ∂ηi,m The rest of the computations remain exactly the same as before. 3.2.4 Confidence interval estimation The gradient of the penalized log-likelihood function is I ni ∂ℓp (Θ) ∂ℓ(Θ) X ∂ XX = − λi P(γi ) = Ui,j , ∂Θ ∂Θ i ∂Θ i=1 j=1 where Ui,j = {∂ℓi,j (Θ)/∂Θ − (λi /ni ) ∂P(γi )/∂Θ} |Θ=Θb . Let H = ∂ 2 ℓp (Θ)/∂Θ∂Θ⊤ |Θ=Θb , and define A = −(H/ ni )−1 . Then the variance- PI i=1 qP covariance of i=1 ni Θ can be estimated by I b I X ni ! 1 X ⊤ V0 = A PI Ui,j Ui,j A⊤ i=1 ni i=1 j=1 . The square root of the diagonal elements of V0 divided by the square root of the total sample size, i ni , is the standard error of the estimator of each component of the Θ vector. P An important component of the model is estimation of dose-response relation for a gene in a specific cell type. Suppose the interest is in the ith cell type, so the dose-response relation 57 estimator for the ith cell type is G ci (D∗ ) = PM Bm (D∗ )b m=1 γi,m at the dose level D∗ . We can re-write G b where ci (D∗ ) = L⊤ Θ, L⊤ = (0, 0, 0, 0, B1 (D∗ ), . . . , BM (D∗ ), 0, . . . , 0) Therefore,   V0 var(G ci (D)) = var(L Θ) = L ⊤b ⊤ PI L. i=1 ni q Then the 95% pointwise CI for Gi (D ) is G ∗ bi (D∗ ) ± 1.96 × var(G bi (D∗ )). 3.2.5 Model Selection In case, the user is not sure whether placing monotonicity constraints betters the model estimation, we suggest fitting both the unconstrained and constrained versions of the models and selecting the model with the lowest AIC where AIC = 2θ − 2log(l(Θ)), b where θ is the number of estimated parameters and log(l(Θ)) b is maximum value of the likelihood function for the model . In case of penalized estimation, the value of the optimal λ can be chosen using cross-validation and then the value of the log-likelihood can be calculated using the MLE estimates obtained via MM algorithm for penalized estimation. 3.3 Hypothesis Testing There are two tests of hypothesis that we wish to conduct for the model described above: 1. Test if there is any effect of dose on the response for a given cell-type. Suppose that we are interested in the ith cell type, then set H0,1 : γi,2 = γi,3 = . . . γi,M versus the alternative Ha,1 : at least one of the components of γi is different from the rest of the components. 2. To test if the dose-response curves differ by the cell-types we set H0,2 : γ1 = · · · = γI versus the alternative Ha,2 : at least one of γi is different from the rest. 58 For both tests, we use likelihood ratio test. The likelihood ratio test statistic is given by n o Λψ = 2 ℓ(Θ) b − ℓ(Θ b H0 ) where Θ̂, calculated via the MM algorithm, denotes the unconstrained MLEs of Θ while Θ̂H0 is the constrained MLE of Θ under H0 , which can be obtained by ( (t) )−1 (t) 2 † (t+1) (t) ∂ ℓ (ΘH 0 |Θ H0 ) ∂ℓ† (ΘH0 |ΘH0 ) ΘH0 = ΘH0 − |ΘH =Θ(t) |ΘH =Θ(t) ∂ΘH0 ∂Θ⊤ H0 0 H0 ∂ΘH0 0 H0 where X I ℓ† (ΘH0 |ΘH0 ) = ℓ†1 (ϕH0 |ΘH0 ) + g1 (ψ0 H0 |ΘH0 ) + g2 (ψ1 H0 |ΘH0 ) + ℓ†3,i (γi H0 |ΘH0 ) (0) (0) (0) (0) (0) i=1 + ℓ†4 (βH0 |ΘH0 ) + ℓ†5 (ΘH0 ) (0) (0) For problem 1), the null model, ψi (D) does not vary with D, so is a constant. Therefore, µi,j = exp(γ ∗ + Xi,j T β). Under the null hypothesis H0,1 , Λψ is asymptotically follows a chi-squared distribution with degrees of freedom (M − 1). Under the null hypothesis H0,2 , γ1 = · · · = γI = γ (say), so Λψ is asymptotically follows a chi-squared distribution with degrees of freedom (I − 1)M . 3.4 Results 3.4.1 Simulation Design To study the behavior of the procedure, we conducted a simulation study. We generated the (s) log-mean dose response curve as; log(µij ) = ζi (Dij ) + βXij ,where Di,j = {0, 0.01, 0.03, 0.1, (s) log(Di,j +1) (s) 0.3, 1, 3, 10, 30} and Dij = log(30) . The dose response function ζi (Dij ) is then assumed to follow four different patterns • Case 1: (s) (s) ( 2 (s) ζ1,1 (D1,j ) = 1 + 0.1D1,j + 0.1D1,j s) + 3(D1,j − 0.5)2+ (s) (s) ( 2 (s) ζ1,2 (D2,j ) = 1 + 0.1D1,j + 0.8D1,j s) + 0.8(D1,j − 0.5)2+ (s) (s) ζ1,3 (D3,j ) = 0.5 + 0.2D1,j 59 • Case 2: (s) (s) ζ2,1 (D1,j ) = 0.15 exp{2 × D1,j } (s) (s) ζ2,2 (D2,j ) = 0.15 exp[{2 + (0.5/300)} × D2,j ] (s) (s) ζ2,3 (D3,j ) = 0.15 exp[{2 + (0.6/300)} × D3,j ] • Case 3: (s) (s) 1.5sin(Di,j ) ζi (Dij ) = (s) + 2, i = 1, 2, 3 1 + 1.5sin(Di,j ) • Case 4: (s) ζi (Dij ) = 3, i = 1, 2, 3 We generated a single covariate Xi,j ∼ N (0, 1) and set β = 0.5. The B-spline basis matrix Bm (Di,j ) is then generated using three knots at 0.008, 0.075 and 0.4. We also fixed the degree of the B-spline basis as d = 3. We chose ψ0 = 0.5, ψ1 = 0.1 and computed ωi,j using Equation 3.7. The binary random variable Ri,j is then generated from a Bernoulli ′ distribution with mean ωi,j . Finally expression data Yij is generated as Yi,j = Yi,j Ri,j where, ′ Yij is generated from a negative binomial distribution with mean si,j µi,j , and dispersion exp(ϕ) where ϕ ∼ N (0.5, 0.1). Datasets with three cell types are generated. The analysis is repeated with cell-type specific sample sizes ni = 100, 300. 3.4.2 Analysis For each dose response pattern and for ni = (100, 300) we simulated 500 data sets. Tables 3.1, 3.2, 3.3 and 3.4 summarizes the estimated mean, SE, bias, Mean square error and coverage probabilities of parameters β, ϕ, ψ0 and ψ1 for cases 1-4. The results are compared with an intercept model having ψ1 = 0. It can be seen in general that the performance of the full model is better than the intercept model in all aspects, thus indicating a dependence of the zero-inflation parameter on the log-mean. Figures 3.1,3.3,3.5,3.6,3.9,4.0,4.3 and 4.4 60 demonstrate the fits of the estimated curve along the 95% confidence interval. From the figures, it can be seen that the fitted values for both sample sizes are close to the true values, and the 95% confidence bands in general covers the true curves. As expected, the confidence intervals are wider when sample size decreases. To further demonstrate the performance of our proposed ZINB spline model (ZINB-SPL) we benchmarked it against three competing methods; (a) Zero inflated negative binomial gen- eralized additive model (ZINB-GAM-Dose) with the zero inflation parameter dependent on the dose as logit(ωij ) = γ0 + γ1 ∗ dose, (b) Zero inflated negative binomial generalized ad- ditive model (ZINB-GAM-Int) with the zero inflation parameter having an intercept model; logit(ωij ) = γ0 , and (c) Negative Binomial GAM. We use R packages mgcv [105] for fitting NB GAM and zigam [106] for fitting models (a) and (b). The benchmarking curves for the simulation scenarios are reported in Figures 3.3, 3.7,3.11 and 3.15. The performance of the four models are very close in case of the linear and constant models demonstrated in simulation scenarios 3 and 4. However ZINB-SPL leads to almost perfect curve estimates in non-linear curves shapes illustrated in simulation scenarios 1 and 2. This is further demon- strated by the RMSE boxplots in Figures 3.4, 3.8,3.12 and 3.16. The RMSE boxplots in Figure 3.4 and 3.8 for the ZINB-SPL lies significantly lower than the other 3 models. Also note that the NB-GAM model has the worst performance in scenarios 1 and 2, hence justi- fying the need of a zero-inflation structure to model the excess zeroes in the data generating process. Under Case 1 and Case 2, we had 500 positive control datasets where true mean curve varies for the three cell-types. Therefore Case 1 and Case 2 were used to to study the power of Test 2 in detecting true positive instances, (Section 3.3) . In this scenario, it is expected that the the LRT test statistic will have large values, with the expected p-value close to 0. Therefore the expected true positive rate is computed as TPR = 500 1 P500 l=1 I[P (Λζ,l > χ2M (I−1) |Ha ) < α]. Case 2 embodies a local alternative testing scenario where the slope parameter of the DR functions for cell type 2 and 3 are written as α2 = α1 + g1 /nk and 61 α3 = α1 + g2 /nk , where α1 is the slope parameter of the DR function for cell-type 1 and g1 = 0.5 and g2 = 0.6 are sufficiently close. Therefore it is expected that the power of test 2 in detecting true positives for Case 1 will be much smaller than the expected TPR in case 1, where there is larger discrepancy in the shape of the true mean functions. In Case1, out of 500 simulations, in 490 instances the null hypothesis was rejected with an expected TPR of 0.98 for ni = 300. For ni = 300, 476 instances were rejected the null hypothesis with an expected TPR of 0.952. For Case 2, ni = 300, 298 out of 500 instances yielded a rejection of the null hypothesis with an expected TPR of 0.578 and for ni = 100, 196 out of 500 simulations yielded a rejection of the null hypothesis with an expected TPR of 0.392. The decrease in TPR for smaller sample sizes and local alternatives is an expected result. (s) Next, under Case 3 and 4, the true log-mean response ηi (Dij ) followed the same DR model for all the cell types. The only difference is that in Case 3, all celltypes has a mono- tonically increasing DR function, whereas in Case 4, the DR function is a constant. There- fore the datasets generated from these cases serve as negative control datasets, where the true dose response curves does not change with the cell-types. We conducted Test 2 on these datasets to study the ability of the test in controlling for false positives. Since all the datasets in this scenario are generated from a null model, the average p-value should be close to 1 and we determine false positive rate as FPR = 500 1 P500 2 l=1 I[P (Λη,l > χ(M )(I−1) |H0 ) < α], where α = 0.05. For Case 3, ni = 300, 39 out of 500 simulations yielded a rejection of the null hypothesis with expected FPR = 0.076 . For ni = 100, 54 instances rejected the null hypothesis with an expected FPR of 0.108. In Case 4, ni = 300, 22 out of 500 instances are rejected, thus having an expected FPR of 0.044. For ni = 100, 76 out of 500 instances are rejected at a level of 0.05, with an expected FPR of 0.152. This clearly indicates that a higher sample size is preferable for controlling the false positive rates. 62 ni = 300 ni = 100 Scenario Model β̂ SE Bias MSE Cov β̂ SE Bias MSE Cov Case 1 Full 0.507 0.049 0.007 0.003 0.95 0.491 0.083 -0.008 0.008 0.92 Intercept 0.515 0.049 0.015 0.002 0.94 0.500 0.0824 0.007 0.007 0.93 Case 2 Full 0.498 0.112 -0.002 0.004 0.96 0.502 0.123 0.02 0.015 0.97 Intercept 0.512 0.045 0.012 0.003 0.998 0.5079 0.0795 -0.007 0.011 0.992 Case 3 Full 0.500 0.037 -0.0015 0.0014 0.942 0.499 0.066 -0.001 0.004 0.938 Intercept 0.500 0.037 0.001 0.001 0.938 0.498 0.067 0.002 0.004 0.965 Case 4 Full 0.495 0.031 -0.004 0.001 0.952 0.504 0.060 0.004 0.003 0.964 Intercept 0.501 0.033 0.001 0.001 0.931 0.503 0.060 0.003 0.003 0.964 Table 3.1 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter β under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 ni = 300 ni = 100 Scenario Model ϕ̂ SE Bias MSE Cov ϕ̂ SE Bias MSE Cov Case 1 Full 0.618 0.168 0.118 0.043 0.902 0.891 0.303 -0.008 0.008 0.74 Intercept 0.601 0.171 0.101 0.041 0.904 0.888 0.308 0.389 0.263 0.75 Case 2 Full 0.621 0.243 0.221 0.127 0.856 1.283 0.325 0.7836 1.136 0.532 Intercept 0.687 0.264 0.187 0.119 0.776 1.317 0.314 0.817 1.158 0.525 Case 3 Full 0.554 0.089 0.054 0.011 0.922 0.684 0.155 0.185 0.080 0.764 Intercept 0.555 0.0907 0.055 0.011 0.901 0.682 0.153 0.182 0.159 0.752 Case 4 Full 0.529 0.070 0.029 0.006 0.851 0.625 0.120 0.126 0.032 0.796 Intercept 0.540 0.069 0.040 0.007 0.892 0.635 0.120 0.135 0.034 0.762 Table 3.2 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter ϕ under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 ni = 300 ni = 100 Scenario Model ˆ ψ0 SE Bias MSE Cov ˆ ψ0 SE Bias MSE Cov Case 1 Full 0.512 0.166 0.012 0.050 0.97 0.488 0.427 -0.012 0.173 0.978 Intercept 0.619 0.268 0.119 0.029 1 0.609 0.480 0.109 0.064 1 Case 2 Full 0.455 0.223 -0.045 0.069 0.96 0.044 0.592 0.101 0.229 0.91 Intercept 0.552 0.206 0.052 0.048 0.988 0.5218 0.383 0.021 0.146 0.976 Case 3 Full 0.4744 0.353 -0.025 0.0267 1 0.431 0.598 -0.068 0.190 0.972 Intercept 0.725 0.352 0.0554 0.011 1 0.728 0.602 0.227 0.072 1 Case 4 Full 0.482 0.462 -0.018 0.016 0.961 0.468 0.795 -0.031 0.120 1 Intercept 0.796 0.502 0.296 0.093 1 0.797 0.788 0.136 0.034 1 Table 3.3 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter ψ0 under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 63 Figure 3.1 Results of the simulation study, illustrating the performance of Model 1 (see Simulation Design) in 500 replicates with 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. Figure 3.2 Results of the simulation study, illustrating the performance of Model 1 (see Simulation Design) in 500 replicates with 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. 64 Figure 3.3 Results of the simulation study, illustrating the performance of Model 1 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true curve and the estimated log-mean curves averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM-Int and NB-GAM models respectively Figure 3.4 Results of the simulation study, illustrating the estimated RMSE of Model 1 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB-GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. 65 Figure 3.5 Results of the simulation study, illustrating the performance of Model 2 (see Simulation Design) in 500 replicates for 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. Figure 3.6 Results of the simulation study, illustrating the performance of Model 2 (see Simulation Design) in 500 replicates for 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. 66 Figure 3.7 Results of the simulation study, illustrating the performance of Model 2 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true simulated curves and the estimated log-mean curve averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM-Int and NB-GAM models respectively Figure 3.8 Results of the simulation study, illustrating the estimated RMSE of Model 2 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB-GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. 67 Figure 3.9 Results of the simulation study, illustrating the performance of Model 3 (see Simulation Design) in 500 replicates for 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. Figure 3.10 Results of the simulation study, illustrating the performance of Model 3 (see Simulation Design) in 500 replicates for 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. 68 Figure 3.11 Results of the simulation study, illustrating the performance of Model 3(see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true simulated curves and the estimated log-mean curve averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM-Int and NB-GAM models respectively Figure 3.12 Results of the simulation study, illustrating the estimated RMSE of Model 3 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB-GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. 69 Figure 3.13 Results of the simulation study, illustrating the performance of Model 4 (see Simulation Design) in 500 replicates for 100 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. Figure 3.14 Results of the simulation study, illustrating the performance of Model 4 (see Simulation Design) in 500 replicates for 300 sample size. The columns correspond to the three different cell-types. Continuous red and blue lines and the shaded grey region represent the log-mean curve averaged across 500 random replicates, the true simulated curves, and the 95% pointwise confidence intervals. respectively. 70 Figure 3.15 Results of the simulation study, illustrating the performance of Model 4 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to the three different cell-types. Continuous cyan, peach, darkblue, darkgreen and red lines represent the represent the true simulated curves and the estimated log-mean curve averaged across 500 random replicates for ZINB-SPL, ZINB-GAM-Dose, ZINB-GAM-Int and NB-GAM models respectively Figure 3.16 Results of the simulation study, illustrating the estimated RMSE of Model 4 (see Simulation Design) in 500 replicates for sample size 300. The columns correspond to RMSE boxplots of the NB-GAM, ZINB-GAM-Dose, ZINB-GAM-Int and ZINB-SPL models respectively for the three different cell-types plotted over 500 replicates. 71 ni = 300 ni = 100 Scenario Model ψˆ1 SE Bias MSE Cov ψˆ1 SE Bias MSE Cov Case 1 Full 0.081 0.166 -0.019 0.025 0.956 0.101 0.294 0.099 0.009 0.964 Intercept 0.000 0.1658 -0.099 0.094 1 0.000 0.291 -0.09 0.009 1 Case 2 Full 0.109 0.2607 0.0089 0.010 0.96 0.0438 0.572 -0.056 0.324 0.892 Intercept 0.000 0.148 -0.100 0.057 0.998 0.000 0.556 -0.100 0.0100 0.992 Case 3 Full 0.113 0.152 0.013 0.006 0.956 0.135 0.262 0.034 0.038 0.988 Intercept 0.000 0.150 -0.100 0.010 1 0.000 0.318 -0.100 0.010 1 Case 4 Full 0.107 0.154 0.007 0.002 0.952 0.111 0.264 0.012 0.014 0.964 Intercept 0.000 0.164 -0.100 0.010 1 0.000 0.288 -0.100 0.0100 1 Table 3.4 Parameter estimate, asymptotic standard error, bias and mean squared error(MSE)of parameter ψ1 under full model and the intercept model. The results are averaged over 500 replicates and reported for ni = 100, 300 3.5 Discussion In this chapter we present the first statistically rigorous approach for modelling zero-inflated single cell dose response data. Our method proposes a semi-parametric regression frame- work for the estimation of several non-linear cell-type specific curves arising naturally in single cell dose response experiments. We discuss an extremely flexible penalized regression approach along with further extensions to estimation under monotonicity constraints. Our novel technique flexibly models cell-type speficific heterogenity and its use of a ZINB-Bspline model allows it to effectively capture diverse DR functions and accommodate undesirable zero inflation in data. Further, to better deal with a large number of parameters, arising due to the joint modelling of multiple cell-type specific DR curves, we propose a highly scalable and computationally efficient MM algorithm. Through our MM estimator we are able to significantly reduce the computational complexity of our proposed optimization problem. Finally we propose two highly relevant hypothesis testing approaches that test of presence of DR signal along with differences in the shape of the functional form for different cell-types. Comprehensive studies on simulated datasets confirm that our proposed model ZINB- SPL yields better RMSE than the competing methods (ZINB-GAM-Dose, ZINB-GAM-Int and NB-GAM). Further, ZINB-SPL displayed greater estimation accuracy for constant, lin- ear and non-linear curve shapes. In case of the highly non-linear shapes, the performance of 72 ZINB-SPL showed a clear improvement over the competing methods. The consistent per- formnce of ZINB-SPL in four realistic simulation scenarios highlighted the robustness of our proposed approach in modeling dose response relationships under varying characteristics of scRNA-seq datasets. Additionally, the hypothesis testing results for the simulation scenarios demonstrated effective false positive rate control while maintaining considerable power for testing against local alternatives. An advantage of our proposed MM framework is that it is highly scalable and can provide computationally efficient estimation for a large number of parameters in the model, which could be the case if many cell-types are modelled jointly. In addition, it is straightforward to generalize the proposed framework to monotonicity constrained estimation which may be of interest in several biomedical problems. Thus we provide the first framework for statistically modelling single cell dose response datasets along with a appropriate machinery for testing differences between cell-type specific toxicological responses. 73 CHAPTER 4 KERNELIZED SIGNED GRAPH LEARNING FOR SINGLE CELL GENE REGULATORY NETWORK INFERENCE 4.1 Single Cell Gene Regulatory Networks Gene regulatory networks (GRNs) represent fundamental molecular regulatory interactions among genes that establish and maintain all required biological functions characterizing a certain physiological state of a cell in an organism [107]. Cell type identity in an organism is determined by how active transcription factors interact with a set of cis-regulatory regions in the genome and controls the activity of genes by either activation or repression of tran- scription [108]. Usually, the relationship between these active transcription factors and their target genes characterize GRNs. Due to the inherent causality captured by these mean- ingful biological interactions in GRNs, genome-wide inference of these networks holds great promise in enhancing the understanding of normal cell physiology, and also in characterizing the molecular compositions of complex diseases [109, 110]. GRNs can be mathematically characterized as graphs where nodes represent genes and the edges quantify the regulatory relations. GRN reconstruction attempts to infer this reg- ulatory network from high-throughput data using statistical and computational approaches. Multiple methods encompassing varying mathematical concepts have been proposed during the last decade to infer GRNs using gene expression data from bulk population sequencing technologies, which accumulate expression profile from all cells in a tissue. These methods can be broadly classified into two groups: the first group infers a static GRN, considering steady state of gene expression, while the second group uses temporal measurements to cap- ture the expression profile of the genes in a dynamic process. A thorough evaluation of the static and dynamic models used in bulk GRN reconstruction can be found in [47, 48]. Recent advances in RNA-sequencing technologies have enabled the measurement of gene 74 expression in single cells. This has led to the development of several computational ap- proaches aimed at quantifying the expression of individual cells for cell-type labelling and estimation of cellular lineages. Several algorithms have been developed to arrange cells in a projected temporal order (pseudotime trajectory) based on similarities in their transcrip- tional states. In parallel, several dynamic models for single cell GRN reconstruction have also been developed taking into account the estimated pseudotimes. Since single cell network reconstruction algorithms try to establish functional relationships between genes taking into account the entire population of cells, it is debatable as to whether additional knowledge regarding cell state transitions may provide any added benefits [111, 3]. In summary, direct application of bulk GRN reconstruction methods may not be adequate for single cell network inference. The complex nature of single-cell transcriptomics data pose unique challenges in GRN inference. Changes in gene expression due to cell-cell stochastic variation, cell-cycle hetero- geneity and high sparsity due to insufficient sequencing depths and capture inefficiency for genes with low expression form some of the unique characteristics of these datasets [112, 113]. Most importantly the high sparsity/high zero values feature in single cell datasets has gar- nered a lot of attention and several statistical methods have been designed to particularly model this phenomenon [10, 16, 114]. Recent research has indicated that these zero values referred to as "dropouts" most likely result from biological variation and may be indicative of heterogeneity in gene expression for varying cell types [21, 22]. To account for these unique challenges a variety of algorithms for network reconstruction in scRNAseq data have been recently proposed , but most of these methods fail to outperform network estimation methods developed for bulk data or microarrays. [3, 111]. To that end, we propose a network reconstruction algorithm that learns the co-expression between genes by borrowing ideas from graph signal processing (GSP) literature. GSP provides a framework for analyzing signals defined on graphs by extending classical signal processing tools and concepts [115]. In many applications of GSP, the graph topology is not always available, 75 thus it must be inferred or learned from the observed data. The major approaches to graph learning (GL) include smoothness based methods [116], where the graph is learned with the assumption that graph signals vary smoothly with respect to graph structure; and diffusion process based models, where the graph is learned from signals that are assumed to be graph filtered versions of random processes [117]. In this work, we focus on learning graphs with the smoothness assumption for the following reasons. First, smooth signals admit low-pass and sparse representations in the graph Fourier domain. Thus, the GL problem is equivalent to finding efficient information processing transforms for graph signals. Second, many graph- based machine learning tasks, such as spectral clustering, graph regularized learning etc., are developed based on the smoothness of the graph signals. Finally, smooth graph signals are observed ubiquitously in real-world applications [117]. Smoothness based GL is first considered in [118] by modeling graph signals using fac- tor analysis, where the transformation from factors to observed signals exploits the graph topology. By imposing a suitable prior on factors, the graph signals are modelled to have low-frequency representation in the graph Fourier domain. This analysis results in an op- timization problem where a graph is learned such that variation of signals over the learned graph is minimized. Different variations of this framework with constraints on the learned topology and for handling noisy graph signals were considered in [119, 120, 121, 122, 123]. All of the previous works learn unsigned graphs with the exception of [124], where a signed graph is learned by employing signed graph Laplacian defined by [125]. By using signed Laplacian, [124] aim to learn positive edges between nodes whose signal values are similar and negative edges between nodes whose signal values have opposite signs with similar abso- lute values. However, this approach is not suitable when graph signals are either all positive- or negative-valued, as in the case of gene expression data. Considering the advantages of GL approaches in learning graph topologies that are con- sistent with the observed signals, in this paper, we propose a novel GL algorithm for the re- construction of GRNs. In particular, we assume gene expression data obtained from cells are 76 Activating Inhibitory Euclidean Distances Correlations 1.00 1.0 0.75 0.5 Distance Correlation 0.50 0.0 0.25 0.5 0.00 1.0 GSD HSC mCAD VSC GSD HSC mCAD VSC Dataset Dataset Figure 4.1 Euclidean distances (left, normalized to [0, 1]) and correlations (right) between expressions of gene pairs in curated datasets studied in Section 4.4. Values are calculated only for gene pairs that are connected in the ground truth GRNs and they are reported separately for activating and inhibitory edges. Only inhibitory edges are reported for VSC, since its GRN includes only inhibitory edges. graph signals residing on an unknown graph structure, which corresponds to the GRN. One important characteristic of GRNs is that they are signed graphs, where positive and negative edges correspond to activating and inhibitory regulations between genes. To this end, we propose a novel and computationally efficient signed GL approach, scSGL, that reconstructs the GRN under the assumption that graph signals admit low-frequency representation over activating edges, while admitting high-frequency representation over inhibitory edges. Bio- logically, this modelling implies that two genes that are connected with an activating edge have similar expressions, while two genes connected with an inhibitory edge have dissimilar expressions. In Figure 4.1, we show how these assumptions hold for curated datasets studied in Section 4.4. The figure shows that Euclidean distances between expressions are smaller for gene pairs connected by activating edges than for those connected by inhibitory edges. The figure also reports correlations between expressions, which indicates that expressions of gene pairs connected with activating and inhibitory edges are positively correlated, i.e., similar, 77 and negatively correlated, i.e., dissimilar, respectively. We also performed a Wilcoxon Rank Sum test to determine whether the calculated associations for the positive ground truth connections were significantly lower than the associations for the negative ground truth con- nections for Euclidean distances. We test the null hypothesis, H0 : the distributions of both populations are equal versus the alternative hypothesis Ha :, the distribution of the negative associations are stochastically greater than the distribution of positive associations. In case of the correlation distances we want to test Ha :the distribution of the positive associations are stochastically greater than the distribution of negative associations. The calculated p- values were all less than 0.01, hence justifying our assumptions for all curated datasets except VSC, which only has negative associations. Another important characteristic of scRNAseq is high proportion of dropouts. We address this issue by employing kernel functions to map graph signals to a higher dimensional space and assuming low- and high-frequency represen- tation for these high dimensional graph signals. This mapping allows us to use kernels that are appropriate for modelling single cell data structures. 4.2 Background 4.2.1 Notations Define an undirected graph as G = (V, E) where V is the set of nodes with a cardinality of |V | = n and E ⊆ V × V denotes the edge set with cardinality |E| = m. Edges between nodes i and j are denoted by ei,j having a weight wij . If the weights wi,j are strictly positive then the graph is unsigned, whereas if wi,j allows both positive and negative values then the graph is referred to as signed. Define W to be the n × n symmetric adjacency matrix of graph G, where Wij = Wji = wij if eij ∈ E and 0, otherwise. In the case of signed graphs, W can be decomposed as W + − W − where Wij+ = wij for wij > 0 and Wij− = |wij | for wij < 0. Further the Laplacian matrix L is defined as L = D − W where D is the degree matrix of G. Similar to the decomposition of the adjacency matrix, the Laplacian can be decomposed for estimation of signed Laplacians. All one and all zero vectors and matrices are represented 78 by 1 and 0, respectively. Finally, ith row and column of a matrix X are represented by Xi· and X·i , respectively. 4.2.2 Low- and High-frequency Signals on Unsigned Graphs A graph signal x ∈ Rn is a vector whose entries reside on the nodes of an unsigned graph G. Graph Fourier transform (GFT) of x is defined as the expansion of x in terms of the eigenbasis of the graph Laplacian [126]. This representation allows us to characterize x in terms of its graph spectral content as either low- or high-frequency, where low(high)-frequency graph signals have small (large) variation with respect to the graph [127]. Let L = VΛV⊤ be the eigendecomposition of L where Λ is the diagonal matrix of eigenvalues with Λii = λi and V·i is the eigenvector corresponding to λi . GFT of x is then b = V⊤ x and inverse GFT is [126]: x Xn x = Vbx= x bi V·i . (4.1) i=1 Thus, x is the linear combination of eigenvectors of L with the coefficients equal to the entries of x b. Eigenvectors of L corresponding to small eigenvalues have small variation over the graph. Thus, if most of the energy of x b lies in x bi s corresponding to the small eigenvalues, then x varies little over G. On the other hand, if most of the energy of xb lies in bi s corresponding to the large eigenvalues, x has high variation over G. The total variation x of x over G is then quantified as [126]: trace(b x⊤ Λbx) = trace(x⊤ VLV⊤ x) = trace(x⊤ Lx), (4.2) which is small for low-frequency graph signals and large for high-frequency ones. 4.2.3 Unsigned Graph Learning An unknown unsigned graph G can be learned from a set of observed graph signals defined over it with the assumption that graph signals have low-frequency representation in graph 79 spectral domain, i.e., total variation is small. Using this assumption, [118] proposes to learn G by minimizing (4.2) with respect to L given a set of graph signals {xi }pi=1 as follows: min. trace(X⊤ LX) + α∥L∥2F s.t. trace(L) = 2n, (4.3) L∈L where L = {L : Lij = Lji ≤ 0 ∀i ̸= j, L1 = 0} is the set of Laplacian matrices and X ∈ Rn×p is the data matrix whose columns are graph signals. The constraint trace(L) = 2n ensures that trivial solutions are avoided. ∥L∥F is employed to control the sparsity of the learned graph as discussed in [119]. In particular, the Frobenius norm penalizes L from having entries with large absolute values, and the non-positivity constraint (Lij = Lji ≤ 0) and degree constraint (trace(L) = 2n) threshold edges with small weights. Thus, as α → ∞ edges of the learned graph start having very similar weights and no thresholding can be performed to sparsify the learned graph. On the other hand, small values of α leads L to have entries with varying values and the imposed constraints threshold the small values yielding a sparse graph. 4.2.4 Kernels Traditional machine learning and signal processing applications are mostly developed based on linear modelling due to their simplicity. However, real world problems require nonlinear estimation that can detect more complex patterns in the data. For this purpose, kernels are introduced to capture the nonlinearity by mapping signals to a high-dimensional space [128]. Kernels correspond to dot products in a higher dimensional feature space and overcome explicit construction of the feature space; thus providing simplicity of linear methods in nonlinear estimation. Given data from input space X , and a mapping function ϕ : X → H where H is an Hilbert space, a kernel function can be expressed as an inner product in the corresponding feature space, i.e., κ(xi , x′i ) = ⟨ϕ(xi ), ϕ(x′i )⟩, where κ : X ×X → R is a finitely positive semi-definite kernel function [129]. An explicit representation of the feature map ϕ is not necessary and the dimension of mapped feature vectors could be high and even 80 infinite. By using different kernels, learning algorithm can be augmented to exploit various (nonlinear) associations between input data. For example, the first term in (4.3) can be rewritten as trace(XX⊤ L) = i,j ⟨Xi· , Xj· ⟩Lij , where the dot product between the rows of P X can be replaced by κ(Xi· , Xj· ). In the next section, this observation is used to develop a graph learning framework that is able to capture nonlinear relations between graph signals. 4.3 Methods 4.3.1 Signed Graph Learning In (4.3), an unsigned graph is learned with the assumption that the observed graph signals have low-frequency representation in graph spectral domain. In order to learn a signed graph G, one needs to make some additional assumptions about the graph signals X. In this work, we make the following assumptions: 1. Signal values on nodes connected by positive edge values are similar to each other, i.e., variation over positive edges is small. 2. Signal values on nodes connected by negative edge values are dissimilar to each other, i.e., variation over negative edges is large. From GSP perspective, these assumptions correspond to graph signals being low- and high- frequency over positive and negative edges, respectively. Let G+ be the graph corresponding to the positive edges of G and let G− be the graph corresponding to the negative edges of G with the edge weights equal to the absolute value of the original edge values. Assumption 1 implies that the graph signals have low-frequency representation in the graph Fourier domain of G+ . On the other hand, assumption 2 implies that the graph signals have high- frequency representation in graph Fourier domain of G− . We use (4.2) to quantify how well the graph signals fit these assumptions. Thus, to learn an unknown signed graph, we minimize trace(X⊤ L+ X) with respect to L+ while maximizing trace(X⊤ L− X) with respect 81 to L− : min. trace(X⊤ L+ X) − trace(X⊤ L− X) + α1 ∥L+ ∥2F + α2 ∥L− ∥2F L+ ,L− ∈L s.t. trace(L+ ) = 2n, trace(L− ) = 2n (4.4) ij = 0 if Lij ̸= 0 and Lij = 0 if Lij ̸= 0 ∀i ̸= j, − − L+ + where Frobenius norms and the first two constraints are similar to (4.3) and the last con- straint ensures that L+ and L− are not non-zero for the same indices. 4.3.2 Kernelized Signed Graph Learning As mentioned in Section 4.2.4, kernels are used in learning algorithms to exploit various (nonlinear) relations between input data. This is especially crucial in GRN inference as shown in [130], where 17 different association measures between gene expressions are com- pared in terms of their performance in GRN inference and various other tasks on single-cell transcriptomic datasets. In this paper, we consider three kernels: correlation coefficient, r, measure of proportionality, ρ [131] and a modification of Kendall’s tau (τzi ) for zero inflated non-negative continuous data [132]. These kernels are selected because r is a commonly used measure for network inference, ρ performs the best in [130] and τzi can handle high ratio of dropouts in scRNAseq. As mentioned in Section 4.2.4, kernels provide an efficient way of capturing the non- linear associations between input data samples. This is especially crucial in building single cell GRN learning algorithms due to the unique statistical properties exhibited by single cell data. To determine optimal measures of association in single cell network learning, [130] evaluated 17 association measures in terms of their ability to reconstruct gene and cellular networks from single cell transcriptomic datasets. Two measures, ρ [131], a measure of association for compositional data and τzi , a measure of association for zero inflated non-negative continuous data [132] are shown to perform consistently better in all learning scenarios investigated in [130]. The strong performance of ρ can be explained on the basis 82 that scRNA-seq captures only a small proportion of messenger RNA in each cell and therefore gene expression measurements can be viewed as relative measures of abundance (as seen in compositional data). On the other hand, τzi , a modification of Kendall’s rank correlation coefficient, is expected to provide less biased estimates of association in the setting of zero- inflated continuous data, a characteristic of single cell transcriptomic datasets [132]. To compare and contrast these two measures, the correlation kernel r is additionally investigated since it’s widely used in GRN reconstruction algorithms. In its current form, (4.3) cannot be used directly for different associations. Thus, the optimization problem in (4.4) is extended using kernels. The first term in (4.4) can be written as trace(XX⊤ L+ ) = i,j ⟨Xi· , Xj· ⟩L+ ij and the second term can be written similarly. P By replacing dot products with a given kernel function, i.e., κ(Xi· , Xj· ), the problem in (4.4) can be extended to incorporate the different associations as: min. trace(KL+ ) − trace(KL− ) + α1 ∥L+ ∥2F + α2 ∥L− ∥2F L+ ,L− ∈L s.t. trace(L+ ) = 2n, trace(L− ) = 2n (4.5) ij = 0 if Lij ̸= 0 and Lij = 0 if Lij ̸= 0 ∀i ̸= j, − − L+ + where K ∈ Rn×n is the kernel matrix with Kij = κ(Xi· , Xj· ). From GSP perspective, this modification implies that graph signals on each node, i.e., Xi· , are first mapped to a (higher dimensional) Hilbert space and the signed graph is learned in this new space. Namely, let Φ ∈ Rn×bp be the matrix constructed from mapping Xi· ’s to the Hilbert space H with dimension pb where rows of Φ are ϕ(Xi· ). When learning unknown signed graph G with a kernel, each column of Φ is a graph signal over G and they are assumed to have low- and high-frequency representation with respect to G+ and G− , respectively. Extending signed graph learning problem in (4.4) using kernels brings flexibility and any association metric in [130] can be implemented in this framework if it is a positive semi-definite kernel. The optimization procedure for 4.5 is given in the Supplementary Material. 83 4.3.3 Hyperparameter Selection The optimization problem in 4.5 requires the selection of two regularization parameters α1 and α2 , which determine the density of the learnt graph, i.e., large values of α1 (α2 ) result in denser L+ (L− ). Their values can be set to obtain a graph with desired positive and negative edge densities. Next we illustrate the algorithm used for generating realistic single cell simulations. 1. Given a matrix X ∈ Rn×p whose columns are graph signals, we randomly shuffle each column of the matrix k times creating k surrogate data matrices. 2. Association between rows of the surrogate data matrices are calculated by the kernel employed in (4.5). 3. Thresholds λ1 and λ2 are selected as the pth and (100 − p)th percentiles of the values in the association matrix calculated in Step 2. 4. Steps (1-3) are repeated k times to construct the empirical distribution of the thresholds λ1 and λ2 . 5. Finally, λb1 and λb2 are selected to be the medians of the empirical distributions con- structed in Step 4. 6. The association matrix for the original data X is constructed. 7. The number of entries in the association matrix that are smaller than λb1 are determined and normalized by the total number of entries in the association matrix to obtain the density of L− . Similarly, number of entries in the association matrix greater than λ b2 is used to determine the density of L+ . 8. Values of α1 and α2 are then selected to learn graphs with the estimated graph densities found in Step 7. Since the density of the positive (negative) graph increases monoton- 84 ically with the value of α1 (α2 ), bisection search is used to determine the values of α1 and α2 that give the desired densities. 4.3.4 Generation of simulated datasets from zero-inflated negative binomial distribution In this section, we outline the algorithm that was used to generate the simulated datasets for parameter sensitivity analysis. 1. For each simulation setting, we first generated a binary association graph Gj1 j2 , ∀j1 , j2 ∈ {1, . . . , p} using either of the three graph topologies random, hub and cluster. 2. A binary indicator Ij1 j2 was next sampled for each entry of the association graph with Ij1 j2 ∼ Bernoulli(0.5). 3. Given the binary association graph a weight matrix was generated as:  if Ij1 j2 = 0,  Gj1 j2 U nif (0.3, 0.7)  Wj1 j2 = Gj1 j2 U nif (−0.7, −0.3) if Ij1 j2 = 1.   4. Random samples of n multivariate Gaussian random variables were then generated, with known weight matrix Wj1 j2 . The random sample was denoted as (X1 , . . . , Xp ), where each variable (gene vector) Xj = (Xj1 , . . . , Xjn )T consisted of n realizations. 5. To mimic the dropout phenomenon present in real scRNAseq datasets, we next in- troduced additional zeros to the gene expression matrix. Following [13], the dropout probability for each row (gene vector) in the gene expression matrix X was calcu- lated as: pji = exp(−ρXji2 ), where ρ represents the exponential decay parameter that controls the dependence between the dropout probability and gene expression. 85 6. A binary indicator was next sampled for each entry: ηji ∼ Bernoulli(pji ), with ηji = 1 indicating that the corresponding entry of Xji would be replaced by 0. The dropout probability for each gene vector was calculated as ωj = ni=1 ηji . P 7. Using a modification of the NORTA (Normal to Anything) method [133] we generated samples from a multivariate zero inflated negative binomial distribution based on the multivariate normal samples generated in Step 4 using mean, dispersion and zero- inflation parameters λ, k and ωj s. 8. To mirror real scRNA-seq gene expression data behaviour, the gene expression mean λ and standard deviation k were estimated from a real scRNA-seq dataset, Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. 4.3.5 Performance Metrics 4.3.5.1 AUPRC and AUROC: Area under precision and recall curve and area under the receiver operating characteristic are calculated by comparing inferred graphs to ground truth gene regulations. During this calculation, signs of the learned edges are ignored as the AUPRC and AUROC are perfor- mance metrics restricted to binary classification. In particular, we first take the absolute value of edge weights and then compare them to ground truth edges. Thus, these metrics indicate how well methods detect edges without considering the signs of the inferred edges. Ground truth networks are considered as undirected and self-loops are ignored. Following [3], we also defined AUPRC ratio and AUROC ratio as the ratio of AUPRC (AUROC) value of the methods to AUPRC (AUROC) of the random estimator. 4.3.5.2 AUPRC Activating/Inhibitory: One of our goals is to learn whether the edges are activating or inhibitory. AUPRC as defined above cannot evaluate the sign information. Thus, for curated datasets, whose ground truth 86 gene regulations include signed edge information, we calculate AUPRC for activating and inhibitory edges seperately. In particular, for methods that learn signed graphs we compare the learned positive edges to activating edges in ground truth and learned negative edges to inhibitory edges in the ground truth. For methods that do not learn signed edges, we evaluate the inferred edges with respect to the ground truth activating and inhibitory edges separately to calculate two AUPRC values. 4.3.5.3 EPR Early precision ratio is the fraction of true positives in the top-k edges in the inferred graphs where k is the number of edges in the ground truth network [3]. For methods that return signed edges, we found top-k edges after taking absolute value of the edge weights, thus this metric is used for edge detection performance rather than the detection of edge signs. Ground truth networks are considered to be directed graphs and self-loops are ignored. Finally, EPR ratio is defined as the ratio of the EPR values of the methods to the EPR value of the random estimator. 4.4 Results In this section, performance of scSGL is evaluated and compared to state-of-the-art GRN inference methods on various simulated and experimental scRNAseq datasets. We selected GENIE3 [52], GRNBOOST2 [53], PIDC [134] and PPCOR [50] for comparison as they are the top performing methods in [3]. GENIE3, GRNBOOST2 and PPCOR were originally developed for bulk analysis, while PIDC is developed for single cell gene expression data. Among these methods, GENIE3 and GRNBOOST2 return fully connected directed networks, while the remaining two infer undirected networks. Finally, only PPCOR algorithm returns signed graphs. Given the inherent sparsity of gene networks, we used the area under the precision-recall curves (AUPRC) ratio as the primary evaluation metric. Supplementary Material also includes results using area under the receiver operating characteristic curves 87 GSD Activating GSD Inhibitory HSC Activating HSC Inhibitory mCAD Activating mCAD Inhibitory VSC Inhibitory High GENIE3 1.71 1.78 1.92 1.18 1.16 0.98 3.19 3.24 3.18 2.71 2.23 1.90 1.69 1.88 2.33 1.05 1.03 0.98 2.78 2.14 1.46 GRNBOOST2 1.60 1.65 1.50 1.43 1.42 1.29 2.93 3.10 3.02 2.93 2.54 2.24 1.72 1.64 2.00 1.00 0.99 0.99 2.73 2.24 1.77 PIDC 1.85 1.83 1.73 1.27 1.25 1.22 3.12 3.12 3.18 2.84 2.53 2.11 1.72 1.70 1.90 1.09 1.05 1.06 2.69 2.81 2.59 PPCOR 2.49 2.22 1.76 1.45 1.33 1.50 3.66 3.61 3.37 3.23 2.93 2.43 2.20 2.17 2.24 1.34 1.37 1.56 2.62 2.49 2.52 scSGL-r 2.64 2.43 2.26 2.08 2.06 2.02 3.57 3.53 3.51 3.12 3.28 3.06 2.50 2.50 2.48 1.53 1.50 1.50 2.72 2.84 2.87 scSGL- 2.42 2.61 2.36 1.81 2.03 2.16 3.77 3.76 3.75 2.54 2.87 2.23 3.06 2.88 2.48 1.67 1.63 1.72 2.70 2.57 2.61 scSGL- zi 2.74 2.57 2.27 2.23 2.30 2.32 3.83 3.76 3.80 3.02 3.31 2.75 2.75 2.69 2.48 1.46 1.48 1.55 2.69 2.67 2.78 Low %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 %0 %50 %70 Figure 4.2 Performance of scSGL and state-of-the-art methods on curated datasets as measured by AUPRC for activating and inhibitory edges. x-axis indicates dropout ratio in the dataset. (AUROC) ratio and early precision ratio (EPR) ratio, whose overall results are consistent with the following observations made with AUPRC ratio. 4.4.1 Synthetic Datasets Curated Datasets From BEELINE: The first simulation datasets we consider are cu- rated from "published Boolean models of GRNs" [3]. These datasets were generated using the recently proposed single cell GRN simulator BoolODE [3]. BoolODE converts boolean functions specifying a GRN directly to ODE equations using GeneNetWeaver [135, 136], a widely used method to simulate bulk transcriptomic data from GRNs. These datasets are generated from four literature-curated Boolean models: mammalian cortical area development (mCAD), ventral spinal cord (VSC) development, hematopoietic stem cell (HSC) differentiation and gonadal sex determination (GSD). These models represent different types of graph structures, with varying numbers of positive and negative edges; thus serving as good examples for illustrating the robustness of the proposed method in modelling signed graph topologies. BoolODE is used to create ten random sim- ulations of the synthetic gene expression datasets with 2,000 cells for each model. For each dataset, one version with a dropout rate of 50% and another with a rate of 70% are also considered to evaluate the performance of the methods under missing values. AUPRC ratios are calculated separately for activating and inhibitory edges and their 88 Parameter Sensitivity Analysis: To mimic the zero inflated and overly dispersed nature of most scRNAseq datasets, we simulated gene expression data from a multivariate zero- inflated negative binomial (ZINB) distribution for our second simulation. These datasets were then used to conduct parameter sensitivity analysis for the proposed methods. Given a known graph structure, synthetic datasets are generated from a ZINB distribution by adapting an algorithm developed by [133]. The three parameters of the ZINB distribution; λ, k and ω, which control its mean, dispersion and degree of zero-inflation, respectively were determined from a real scRNAseq dataset to make the simulations mirror the properties of real datasets. The ZINB simulator is then used to generate expression data from three different network structures: random networks, networks with a given community structure and networks with hubs. Random networks are generated using Erdős–Rényi model with desired edge density. Since Erdős–Rényi model is not realistic due to its binomial degree distribution, we also consider networks with hubs. These networks are generated using a Barabási–Albert model whose degree distribution follows a power-law function. Finally, networks with community structure, also known as modular networks, are generated using a disjoint union of random graphs. The accuracy of the scSGL inferred graphs were then evaluated for all three graph structures. To investigate the robustness of scSGL, we simulated datasets from the afore- mentioned network topologies by varying the following parameters: (i) number of genes (10, 50, 100 and 250), (ii) number of cells (100, 300, 500 and 1000) and (iii) dropout probabilities (0.26-0.36). To account for the inherent randomness of the simulations, 10 independent data replicates were generated for every parameter combination and the mean AUPRC ratios obtained by averaging over the replicates are reported in Figure 4.3. Recent investigations of scRNAseq datasets have revealed that dropout rates are primarily driven by a combination of technical and biological factors [108]. Consequently, while mean gene expression and proportion of zeros are linked, this may vary based on cell type, sex, and other biological and technical factors. While investigating the impact of dropout rates 90 on network estimation accuracy, we found a steady decline in AUPRC ratios for all methods with an increase in the number of zeroes. scSGL irrespective of the kernel choice maintained the highest AUPRC ratios across all network topologies. Gene expression in scRNAseq datasets can be intepreted as relative measures of abundance owing to the datasets being a combination of gene expression derived from several cell-types. This could be the reason why proportionality measures perform well [130]. The strong performance of τzi can be explained on the basis that it explicitly models the dropouts present in scRNAseq datasets. Despite the poor performance of regularized correlation networks (PPCOR), we see a strong performance of scSGL when using the correlation kernel. This proves that gene-gene relationships are in fact non-linear in nature. This belief is also strengthened by the above average performance of tree-based machine learning algorithms like GENIE3 and GRNBOOST2. It is to be noted that PIDC, the only other method capable of modelling excess zeroes, while accounting for non-linear relationships fails to achieve a top-ranking AUPRC ratios. Next, we evaluated the impact of cell sizes on network reconstruction. Figure 2 demon- strates a clear rise in AUPRC ratios when the number of cells are increased. PIDC, the only other single cell network estimation technique, achieves a below average performance at the lowest sample size of 100. This could be due to the fact that PIDC requires large sample sizes for accurate estimation of pairwise joint probability distributions for calculating mutual information. In general, PPCOR has the worst performance among all methods. It should also be noted that the performance of GRNBOOST2 was equivalent to scSGL for all the network topologies when the sample size was 10 times the number of genes. These results indicate the importance of sample size in accurate network estimation for all of the methods and network topologies is considered. Finally, the performance of each of the methods was evalulated by varying the number of genes. All methods had high AUPRC ratios across network topologies when the number of genes was small. While the AUPRC ratios of all the methods declined with an increase in the number of genes, scSGL performed significantly better than most of the benchmarking 91 methods. This dip in performance could be attributed to the fact that all methods learn very dense networks. With an increase in the number of nodes, there is an increasing number of false edges detected by every algorithm. The performance of scSGL could further be improved with a more biologically informed framework for hyperparameter selection. Computational Complexity: Methods are compared in terms of their scalability to datasets with large number of genes. For this purpose, synthetic data generation process used in parameter sensitivity analysis is employed to create three datasets with 500, 1000 and 2000 genes. Each dataset is generated from Barabási–Albert model, includes 1000 cells, and has a dropout ratio of 0.26. Average run time and AUPRC ratios over 10 replicates are reported in Figure 4.4. We reported results only for the correlation kernel, as other ker- nels have similar performances and run times. It is observed that scSGL runs significantly faster than GENIE3, GRNBOOST2 and PIDC while having superior performance in terms of AUPRC ratio. Although PPCOR runs faster than scSGL, it shows poor performance. Further discussion on the computational and storage complexity of scSGL is provided in [2] 4.4.2 Real Datasets For real datasets, we consider scRNAseq expressions of human embryonic stem cells (hESC) and mouse embryonic stem cells (hESC) which include 758 and 451 cells, respectively. We inferred GRNs between 500 highly varying genes along with highly varying TFs [3]. Inferred GRNs are compared to three different databases of gene regulations: STRING [137], cell- type specific [138] and nonspecific [139, 140, 141]. AUPRC ratios are reported in Figure 4.5. All methods have performance values close to random estimator. Except PPCOR, which has random performance in both datasets and for all databases, methods have com- parable performances, with scSGL showing slightly better performance in hESC and while benchmarking methods working slightly better in mESC. To add biological meaning to the estimated networks we compared them to the reference networks in the STRING database. The STRING database is a compendium of protein- protein interactions created by gathering information from varying sources like experimental studies, text mining etc. The edges in the STRING network are classified as high confi- dence (minimum score of 0.700), medium confidence (minimum score of 0.400) and and low confidence (minimum score of 0.150). In hESC dataset, scSGL-ρ identified the maximum number of high confidence associations present in the STRING reference network. scSGL-ρ, scSGL-r and scSGL-τzi each identified 60, 56 and 24 high confidence STRING interactions, respectively, with an edge confidence greater than 0.5. The interactions identified by scSGL- r form a network of 56 unique genes including genes Nanog, Sox2, Sox4, Pou5f1, Ctnnb1, Gata2, Gata3 and many others. Lineage-specific marker genes, Cdk6, Col5a1, Vim, and Itg5, which are known to have regulatory roles in cell differentiation were also detected by scSGL-r but with edge confidence less than 0.5 (0.1-0.3) [142, 143]. scSGL-ρ and scSGL- r identified 20 common genes including Sox2, Sox4, Gata6, Ctnnb1 and Bmp4. scSGL- τzi identified the least number of genes but successfully retrieved lineage markers Nanog, Sox2, Sox4, Pou5f1,Ctnnb1, Gata2, Gata3. All three kernel methods identified genes Sox4, Ctnnb1, Bmp4 and Gata6. According to the STRING database, the 56 genes identified by 93 hESC AUPRC Ratio mESC AUPRC Ratio GENIE3 0.976 0.959 1.752 0.999 1.164 1.590 High GRNBOOST2 1.017 0.992 1.573 1.017 1.161 1.653 PIDC 1.029 1.056 1.987 1.029 1.205 1.800 PPCOR 1.000 1.000 1.000 1.000 1.000 1.000 scSGL-r 1.117 1.164 1.723 1.036 1.133 1.463 scSGL- 1.030 1.128 1.894 1.006 1.090 1.285 scSGL- zi 1.061 1.150 1.721 Sp np RIN G 1.018 1.123 1.423 Low ecific ecific ST No ific ific RIN ec ec ST G Sp Nonp Figure 4.5 Performance of methods for two real-world scRNAseq datasets. Inferred graphs are compared to three different gene regulatory databases. scSGL-r are associated with 839 significantly enriched biological process gene ontology (GO) terms that include cell differentiation, chromosome separation, specification of animal organ position, mitotic nuclear division and organ formation. Genes identified by scSGL-ρ and scSGL-τzi had similar functional enrichments for biological processes. To demonstrate some of the learned associations in hESC, we plotted the subnetwork of 24 lineage specific marker genes using scSGL [143]. Figure 6 shows the presence of activating relationships between key definitive endoderm (DE) markers like Gata6, Gata4, and Eomes and joint inhibition of pluripotency markers Pou5f1, Nanog, and Sox2. Gata4 and Gata6 have been reported as necessary for the development and function of a number of endoderm-derived tissues and cells [144, 145] and onset of Gata4 and Gata6 expression has been reported to be coincident with the beginning of endoderm gene expression [146]. In addition, inhibition of pluripo- tency markers by the key DE markers indicates progression of the cells towards a DE state. We also learned day specific scSGL graphs for the 24 marker genes by first clustering the dataset over days (0,12,24,36,72 and 96 hrs). Day specific graphs showed that scSGL can effectively recover gene network changes from data clustered over time points (Section 7; Supplementary Data) . In mESC dataset, scSGL-ρ, scSGL-r and scSGL-τzi each identified 67, 103 and 55 high confidence STRING interactions, respectively, with an edge confidence greater than 0.5. The 94 three estimated networks capture interactions regulated by known transcription factors Sox2, Nanog, Klf4, Myc and Sall4 [147]. scSGL-r identified known relationships between Sox2 and Nanog; Esrrb with Sox2 and Rybp among many others. scSGL-ρ identified known relation- ships between Esrrb and Etv5 and indirect interactions between Sall4 and Rybp regulated by TF Oct4. scSGL-τzi identified most of the important relationships identified by scSGL- r along with additional relationships between Sox2, Nanog, and Rif1. According to the STRING database, the 103 genes identified by scSGL-r are associated with 908 significantly enriched biological process GO terms that include cell fate determination, specification and commitment, mitotic DNA replication and regulation of nodal signalling pathway. Similar to hESC analysis, scSGL, irrespective of the chosen kernel, identified genes with similar functional enrichments for biological processes. To demonstrate some of the learned associ- ations in mESC, we plotted the subnetwork of 19 well known marker genes+TF in mESC differentiation, estimated using scSGL. As can be seen in Figure 6B, Nanog, Gata4, Sox2, Sox17, Zfp42 and Lefty1 emerge as some of the hub nodes with high degrees of associations. The learned network also captures vital signed associations between Sox2, Nanog, Sox17, hESC Lineage Marker Genes mESC Genes HAPLN1 LEFTY1 F1 ZFP42 SOX17 POU5 A4 LHX ESR SA GAT 1 1 RB LL4 R 16 ET SO CE IFI 3B V5 1 X2 DN MT MY CN RIF 2 EOME S PMAIP 1 CDC5L ZFP4 ERBB4 NANOG SOX2 GATA2 COL4A2 LECT1 SFPQ GS 2 CT1 C DAB GL MY A6 3 G UL T TA NO BP GA ND GA K10 GAT RY POU5F1 1 P NA HA GNG1 MA A4 KLF4 1 SOX17 PRDM14 A B Figure 4.6 The subnetworks of 24 lineage specific genes in hESC (A) and 19 well known marker genes in mESC (B). We report results of scSGL-r as it has the highest AUPRC ratio in Figure 4.5. For clarity, only those edges whose absolute edge weight fall into the top 1 percentile are shown. Node sizes are proportional to their degrees. 95 Zfp42 and Gata4. It is well known that Sox2 and Nanog form the core of a transcription factor network that promotes embryonic stem cell pluripotency and self- renewal. Zfp42 is also known to be a direct target of Nanog, which is augmented by Sox2 [148]. In addition, Sox17 together with Gata4 expression reinforce a transcriptional network that antagonizes Nanog expression to initiate differentiation [149]. Finally, to analyze the relation between edges identified by scSGL and benchmarking methods, the intersection between the top 1000 edges is reported as an UpSet plot [150] in Figure 4.7. In both datasets, PPCOR does not have any intersection with other methods probably because of its poor performance reported in Figure 4.5. The remaining 6 methods have an intersection set with cardinality around 40 edges. The same number of common edges is found in the intersection of PIDC, GENIE3, GRNBOOST2, scSGL-τzi , scSGL-r and in the intersection of PIDC, GENIE3, scSGL-τzi , scSGL-r, scSGL-ρ. These observa- tions hold for both datasets, indicating the reproducibility of the proposed approach across different datasets. Edges identified by τzi and r have more intersecting edges with bench- marking methods and with each other than those identified by ρ, which indicates that the benchmarking methods have more common edges with correlation based association metrics than with proportionality measures. scSGL methods have more common edges with PIDC than with GENIE3 and GRNBOOST2, which may be due to the fact that PIDC learns co-expression GRN similar to scSGL, while GENIE3 and GRNBOOST2 learn directed in- teractions between genes. 4.5 Discussion In this paper, we have introduced a novel network inference algorithm based on GSP. Our proposed algorithm scSGL identifies functional relationships between genes by learning the signed adjacency matrix from the gene expression data under the assumption that graph signals are similar over positive edges and dissimilar over negative edges. This novel tech- nique also takes into account the nonlinearity of the gene interactions by employing kernel 96 Intersection between top 1000 edges of methods for hESC dataset 800 Intersection size 600 400 200 0 PPCOR PIDC GRNBOOST2 GENIE3 scSGL- zi scSGL- scSGL-r 1000 0 Intersection between top 1000 edges of methods for mESC dataset 1000 800 Intersection size 600 400 200 0 PPCOR PIDC GRNBOOST2 GENIE3 scSGL- zi scSGL- scSGL-r 1000 0 Figure 4.7 UpSet plot that shows intersection between the top 1000 edges by scSGL with 3 kernels and benchmarking methods in hESC and mESC datasets. mappings. We applied scSGL to four curated datasets derived from "published Boolean models of GRN" and two real experimental scRNAseq datasets during differentiation. To conduct an in-depth analysis of gene co-expression network reconstruction from scRNAseq datasets, we generated simulations from zero inflated negative binomial distributions. These simulations, generated using different parameter combinations, were used to investigate the robustness of our proposed methods to changing cell sizes, gene numbers and dropout rates. For the curated datasets, scSGL consistently obtained higher AUPRC ratios in compari- son to the benchmarking methods, despite each dataset having a different number of stable cell states. Parameter sensitivity analysis reflected the superior performance of scSGL in estimating networks under varying network topologies. The performance remained consis- tent even when the gene numbers increased, the dropout rates were high and the sample sizes were low. This indicated the robustness of scSGL in modelling networks under varying characteristics of scRNA-seq datasets. The networks estimated from real data using scSGL identified important functional re- lationships between target genes and transcription factors and exhibited enrichment for ap- 97 propriate functional processes. In addition, day specific analysis of the hESC dataset showed how scSGL can be used to infer gene networks for clustered datasets (Supplementary Mate- rial). This may be particularly important in single cell network inference, if cell-type specific networks are of interest. We also demonstrated that scSGL attained performance compa- rable to state-of-the-art-methods in real data experiments, with the performance of all the GRN reconstruction methods methods being close to random. Accuracy evaluation of the predicted networks for the real datasets were done using cell-type specific, non-specific and functional networks described in [3]. However, most of the information in these ground truth datasets have been accumulated based on tissue level data and hence it’s not completely ap- propriate to calculate precision and recall rates from these databases. Although scRNAseq techniques provide significant advantages over bulk data such as increased sample size with higher depth coverage and and presence of highly distinct cell clusters, it also comes laced with multiple sources of technical and biological noise. Moreover, the inability to differentiate between technical and biological noise, and the absence of adequate noise modelling techniques further exacerbate the problem [151, 8]. scSGL aims to capture the node similarities and dissimilarities based on distances between graph signals. These graph signals exhibit smoothness, which implies that within a given node cluster, genes tend to be homogeneous, while varying across clusters. This leads to densely connected graphs where the heterogeneity induced by distinct cell sub-populations can be simultaneously curbed. Using single cell data with cell cluster labels, easily obtained from single cell clustering algorithms [152], in conjunction with scSGL can aid in identifying functional modules that are associated with a cell type [153]. Integrating pseudotemporal ordering with scSGL can further help in identifying the functional modules associated with differential pathways [154]. Despite the availability of a large number of computational methods, accurate GRN reconstruction still remains an open problem. Most reconstruction methods are based on the assumption that presence of an edge implies regulatory relationships. They also have 98 the tendency to establish links between genes regulated by the same regulator. These issues can generate a lot of false positives and therefore additional sources of data such as ChIP- seq measurements that help in identifying direct interactions between TFs and target genes, can provide a way to filter out the spurious interactions [155]. Finally, gene regulation has multiple layers beyond direct TF-target interaction, but functional relationships can only be established if these relationships induce persistent changes in transcriptional state. As single cell data sources over multiple modalities continue to become available, it will be interesting to see how integration of these data types aids GRN reconstruction using scSGL. [156]. 4.6 Acknowledgements This chapter is based on my published paper [2]. I would like to thank the joint first author Abdullah Karaslaanli, and co-authors Dr Selin Aviyente and Dr Tapabrata Maiti for their support and advice. This work was supported by the National Science Foundation grant: CCF 2006800. 99 APPENDICES 100 APPENDIX A BAYESIAN SINGLE CELL RNASEQ DIFFERENTIAL GENE EXPRESSION TEST FOR DOSE RESPONSE STUDY DESIGNS Figure A.1 Principal components analysis of cell types identified in a real hepatic dose-response snRNAseq dataset. Points represent a distinct cell and colors reflect the dose group. 101 Figure A.2 Comparison of fold-change distribution in simulated and real dose-response snRNAseq data where the log-normal mean (facLoc) and standard deviation (facScale) were varied as well as the percentage of differentially expressed genes and proportion of downregulated DE genes. A total of 5000 genes were simulated or sampled from real data and the fold-change for the highest dose group was calculated. The Kullback-Leibler Divergence (KLD) intrinsic discrepancy (ID) was used to evaluate the similarity in distributions. 102 Figure A.3 Benchmarking scores for data simulated using the Splatter [4] wrapper Splattdr with default initial parameters. A total of 4500 cells and 5000 genes were simulated across 9 dose groups with a probability of being differentially expressed of 10%, 50% of which were downregulated. (A) Ground truth was used to estimated the False Positive Rate (FPR), True Positive Rate (TPR), False Negative Rate (FNR), True Negative Rate (TNR), precision, balanced accuracy, and F1 score. Boxplots and whiskers represent values for 10 replicate simulations. (B) The area-under the concordance curve (AUCC) was calculated as previously described (ref) for the 100 most significant genes (K = 100). Heatmap represents the pairwise AUCC for each DE analysis grouped by similarity. 103 APPENDIX B SEMIPARAMETRIC DOSE RESPONSE CURVE ESTIMATION FOR SINGLE CELL DOSE RESPONSE EXPERIMENTS B.1 Proof of Theorem 1 (0) (0) (0) (0) (0) (0) Define Θ(0) = (ϕ(0) , β (0) , γ1,1 , . . . , γ1,δ+K+1 , . . . , γJ,1 , . . . , γJ,δ+K+1 , ψ0 , ψ1 )⊤ . Then, ℓ(Θ) = ℓ(Θ(0) ) + ℓ(Θ) − ℓ(Θ(0) ) = ℓ(Θ(0) ) I X ni   ! ( ) X µi,j si µi,j + exp(ϕ) + I(Yi,j > 0) Yi,j log (0) − Yi,j log (0) i=1 j=1 µi,j si µi,j + exp(ϕ(0) ) (0) − exp(ϕ)log{si µi,j + exp(ϕ)} + exp(ϕ(0) )log{si µi,j + exp(ϕ(0) )} +ϕ exp(ϕ) − ϕ(0) exp(ϕ(0) ) + log[Γ{Yi,j + exp(ϕ)}] −log[Γ{Yi,j + exp(ϕ(0) )}] −log[Γ{exp(ϕ)}] + log[Γ{exp(ϕ(0) )}]  (0) +log(ωi,j ) − log(ωi,j )   eϕ  exp(ϕ) +I(Yi,j = 0)log (1 − ωi,j ) + ωi,j si µi,j + exp(ϕ) ( )eϕ(0)  exp(ϕ(0) )  (0) (0) −I(Yi,j = 0)log (1 − ωi,j ) + ωi,j (0) sj µi,j + exp(ϕ(0) ) For the sake of brevity, define G(ϕ, µi,j ) = [exp(ϕ)/{sj µi,j + exp(ϕ)}]e . Then, ϕ ℓ(Θ) = ℓ(Θ(0) ) I X ni   ! ( ) X µi,j si µi,j + exp(ϕ) + I(Yi,j > 0) Yi,j log (0) − Yi,j log (0) i=1 j=1 µi,j si µi,j + exp(ϕ(0) ) (0) − exp(ϕ)log{si µi,j + exp(ϕ)} + exp(ϕ(0) )log{si µi,j + exp(ϕ(0) )} +ϕ exp(ϕ) − ϕ(0) exp(ϕ(0) ) + log[Γ{Yi,j + exp(ϕ)}] 104 −log[Γ{Yi,j + exp(ϕ(0) )}] −log[Γ{exp(ϕ)}] + log[Γ{exp(ϕ(0) )}]  (0) +log(ωi,j ) − log(ωi,j ) ( ) (1 − ωi,j ) + ωi,j G(ϕ, µi,j ) +I(Yi,j = 0)log (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) = ℓ(Θ(0) ) I X ni   ! ( ) X µi,j si µi,j + exp(ϕ) + I(Yi,j > 0) Yi,j log (0) − Yi,j log (0) i=1 j=1 µi,j si µi,j + exp(ϕ(0) ) ( ) si µi,j + exp(ϕ) − exp(ϕ)log (0) si µi,j + exp(ϕ(0) ) (0) (0) − exp(ϕ)log{si µi,j + exp(ϕ(0) )} + exp(ϕ(0) )log{si µi,j + exp(ϕ(0) )} +ϕ exp(ϕ) − ϕ(0) exp(ϕ(0) ) + log[Γ{Yi,j + exp(ϕ)}] −log[Γ{Yi,j + exp(ϕ(0) )}] − log[Γ{exp(ϕ)}] + log[Γ{exp(ϕ(0) )}]  (0) +log(ωi,j ) − log(ωi,j ) +I(Yi,j = 0)    (0) (0) × log (1 − ωi,j ){(1 − ωi,j )/(1 − ωi,j )} (0) (0) (0) (0) ωi,j G(ϕ(0) , µi,j ){ωi,j G(ϕ, µi,j )/ωi,j G(ϕ(0) , µi,j )}   +log 1 + (0) (0) (1 − ωi,j ){(1 − ωi,j )/(1 − ωi,j )}  (0) (0) (0) (0) −log[(1 − ωi,j ) + ωi,j G(ϕ , µi,j )] ≥ ℓ(Θ(0) ) I X ni   ! ( ) X µi,j si µi,j + exp(ϕ) + I(Yi,j > 0) Yi,j log (0) − Yi,j log (0) i=1 j=1 µi,j si µi,j + exp(ϕ(0) ) ( ) si µi,j + exp(ϕ) − exp(ϕ)log (0) si µi,j + exp(ϕ(0) ) (0) (0) − exp(ϕ)log{si µi,j + exp(ϕ(0) )} + exp(ϕ(0) )log{si µi,j + exp(ϕ(0) )} +ϕ exp(ϕ) − ϕ(0) exp(ϕ(0) ) + log[Γ{Yi,j + exp(ϕ)}] −log[Γ{Yi,j + exp(ϕ(0) )}] − log[Γ{exp(ϕ)}] + log[Γ{exp(ϕ(0) )}] 105  (0) +log(ωi,j ) − log(ωi,j ) (0) ( ) (1 − ωi,j )  (1 − ωi,j ) +I(Yi,j = 0) × (0) (0) (0) log (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (1 − ωi,j ) (0) (0) ( )  ωi,j G(ϕ(0) , µi,j ) ωi,j G(ϕ, µi,j ) + (0) (0) (0) log (0) (0) (B.1) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) ωi,j G(ϕ(0) , µi,j ) The last inequality follows by applying the Jensen’s inequality to the term involved with I(Yi,j = 0). Now, applying the inequality −log(x) ≥ 1 − x for any generic x > 0 we obtain ( ) si µi,j + exp(ϕ) si µi,j + exp(ϕ) −log (0) ≥1 − (0) sj µi,j + exp(ϕ(0) ) sj µi,j + exp(ϕ(0) ) (0) sj (µi,j − µi,j ) + exp(ϕ(0) ) − exp(ϕ) = (0) . (B.2) sj µi,j + exp(ϕ(0) ) Let us now consider the very last term of (B.1). ( ) ( ) ( ) ωi,j G(ϕ, µi,j ) ωi,j G(ϕ, µi,j ) log (0) (0) = log (0) + log (0) ωi,j G(ϕ(0) , µi,j ) ωi,j G(ϕ(0) , µi,j ) ( ) ωi,j (0) = log (0) − log{G(ϕ(0) , µi,j )} + log{G(ϕ, µi,j )} ωi,j ( ) ωi,j (0) = log (0) − log{G(ϕ(0) , µi,j )} + exp(ϕ)ϕ ωi,j  ( ) si µi,j + exp(ϕ) − exp(ϕ) log (0) sj µi,j + exp(ϕ(0) ) n o (0) (0) +log sj µi,j + exp(ϕ ) ( ) ωi,j (0) = log (0) − log{G(ϕ(0) , µi,j )} + exp(ϕ)ϕ ωi,j ( ) si µi,j + exp(ϕ) − exp(ϕ)log (0) sj µi,j + exp(ϕ(0) ) n o (0) (0) − exp(ϕ)log sj µi,j + exp(ϕ ) ( ) ωi,j (0) ≥ log (0) − log{G(ϕ(0) , µi,j )} + exp(ϕ)ϕ ωi,j ( (0) ) sj (µi,j − µi,j ) + exp(ϕ(0) ) − exp(ϕ) + exp(ϕ) (0) sj µi,j + exp(ϕ(0) ) 106 n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) ( ) ωi,j (0) = log (0) − log{G(ϕ(0) , µi,j )} + exp(ϕ)ϕ ωi,j n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) ( ) (0) exp(ϕ ) − exp(ϕ) + exp(ϕ) (0) sj µi,j + exp(ϕ(0) ) (0) sj exp(ϕ)(µi,j − µi,j ) + (0) sj µi,j + exp(ϕ(0) ) ( ) ωi,j (0) = log (0) − log{G(ϕ(0) , µi,j )} + exp(ϕ)ϕ ωi,j n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) ( ) exp(ϕ(0) ) − exp(ϕ) + exp(ϕ) (0) sj µi,j + exp(ϕ(0) ) (0) sj {exp(ϕ) − exp(ϕ(0) )}(µi,j − µi,j ) + (0) sj µi,j + exp(ϕ(0) ) (0) sj exp(ϕ(0) )(µi,j − µi,j ) + (0) sj µi,j + exp(ϕ(0) ) ( ) ωi,j (0) ≥ log (0) − log{G(ϕ(0) , µi,j )} + exp(ϕ)ϕ ωi,j n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) ( ) exp(ϕ(0) ) − exp(ϕ) + exp(ϕ) (0) sj µi,j + exp(ϕ(0) ) 0.5sj − (0) sj µi,j + exp(ϕ(0) )   (0) 2 (0) 2 × {exp(ϕ) − exp(ϕ )} + (µi,j − µi,j ) (0) sj exp(ϕ(0) )(µi,j − µi,j ) + (0) (B.3) sj µi,j + exp(ϕ(0) ) 107 The first inequality in the above derivation is obtained by using (B.2). The second inequality is obtained by applying the inequality ab ≥ −0.5(a2 + b2 ). Once again using (B.2) we obatin ( ) (0) si µi,j + exp(ϕ) sj (µi,j − µi,j ) + exp(ϕ(0) ) − exp(ϕ) − exp(ϕ)log (0) ≥ exp(ϕ) (0) si µi,j + exp(ϕ(0) ) si µi,j + exp(ϕ(0) ) (0) sj exp(ϕ)(µi,j − µi,j ) = (0) si µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − exp(ϕ)} + (0) si µi,j + exp(ϕ(0) ) (0) sj {exp(ϕ) − exp(ϕ(0 )}(µi,j − µi,j ) = (0) si µi,j + exp(ϕ(0) ) (0) sj exp(ϕ(0 )(µi,j − µi,j ) + (0) si µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − exp(ϕ)} + (0) si µi,j + exp(ϕ(0) ) 0.5sj ≥− (0) si µi,j + exp(ϕ(0) )   (0 2 (0) 2 × {exp(ϕ) − exp(ϕ )} + (µi,j − µi,j ) (0) sj exp(ϕ(0 )(µi,j − µi,j ) + (0) si µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − exp(ϕ)} + (0) . (B.4) si µi,j + exp(ϕ(0) ) Now, we apply inequalities (B.2), (B.3), and (B.4) to (B.1) and obtain, ℓ(Θ) ≥ ℓ(Θ(0) ) I X ni   ! X µi,j + I(Yi,j > 0) Yi,j log (0) i=1 j=1 µi,j ( (0) ) sj (µi,j − µi,j ) + exp(ϕ(0) ) − exp(ϕ) +Yi,j (0) si µi,j + exp(ϕ(0) )   0.5sj (0 2 (0) 2 − (0) {exp(ϕ) − exp(ϕ )} + (µi,j − µi,j ) si µi,j + exp(ϕ(0) ) (0) sj exp(ϕ(0 )(µi,j − µi,j ) + (0) si µi,j + exp(ϕ(0) ) 108 exp(ϕ){exp(ϕ(0) ) − exp(ϕ)} + (0) si µi,j + exp(ϕ(0) ) (0) (0) − exp(ϕ)log{si µi,j + exp(ϕ(0) )} + exp(ϕ(0) )log{si µi,j + exp(ϕ(0) )} +ϕ exp(ϕ) − ϕ(0) exp(ϕ(0) ) + log[Γ{Yi,j + exp(ϕ)}] −log[Γ{Yi,j + exp(ϕ(0) )}] − log[Γ{exp(ϕ)}]  (0) +log[Γ{exp(ϕ )}] + log(ωi,j ) − log(ωi,j ) (0) (0) ( ) (1 − ωi,j )  (1 − ωi,j ) +I(Yi,j = 0) × (0) (0) (0) log (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (1 − ωi,j ) (0) (0)  ( ) ωi,j G(ϕ(0) , µi,j ) ωi,j (0) + (0) (0) (0) log (0) − log{G(ϕ(0) , µi,j )} (1 − ωi,j ) + ωi,j G(ϕ , µi,j ) (0) ωi,j n o (0) (0) + exp(ϕ)ϕ − exp(ϕ)log sj µi,j + exp(ϕ ) ( ) exp(ϕ(0) ) − exp(ϕ) + exp(ϕ) (0) sj µi,j + exp(ϕ(0) )   0.5sj (0) 2 (0) 2 − (0) {exp(ϕ) − exp(ϕ )} + (µi,j − µi,j ) sj µi,j + exp(ϕ(0) ) (0) sj exp(ϕ(0) )(µi,j − µi,j )  + (0) sj µi,j + exp(ϕ(0) ) = ℓ†1 (ϕ|Θ(0) ) + ℓ∗2 (ψ0 , ψ1 , γ1 , . . . , γI , β|Θ(0) ) +ℓ∗3 (γ1 , . . . , γI , β|Θ(0) ) + ℓ∗4 (Θ(0) ) (B.5) where I X ni  ( ) (0)  X exp(ϕ ) − exp(ϕ) ℓ†1 (ϕ|Θ(0) ) = I(Yi,j > 0) Yi,j (0) i=1 j=1 si µi,j + exp(ϕ(0) ) 0.5sj − (0) {exp(ϕ) − exp(ϕ(0 )}2 si µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − exp(ϕ)} + (0) si µi,j + exp(ϕ(0) ) (0) − exp(ϕ)log{si µi,j + exp(ϕ(0) )} +ϕ exp(ϕ) + log[Γ{Yi,j + exp(ϕ)}]  −log[Γ{exp(ϕ)}] 109 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) exp(ϕ)ϕ (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) ( ) exp(ϕ(0) ) − exp(ϕ) + exp(ϕ) (0) sj µi,j + exp(ϕ(0) )  0.5sj (0) 2 − (0) {exp(ϕ) − exp(ϕ )} , sj µi,j + exp(ϕ(0) ) X I X ni  ℓ∗2 (ψ0 , ψ1 , γ1 , . . . , γI , β|Θ(0) ) = I(Yi,j > 0)log(ωi,j ) i=1 j=1 +I(Yi,j = 0) (0) (1 − ωi,j )  × (0) (0) (0) log(1 − ωi,j ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) log (ωi,j ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) X I X ni   = I(Yi,j > 0) ψ0 + ψ1 log(si µi,j ) i=1 j=1  −log [1 + exp{ψ0 + ψ1 log(si µi,j )}] (0) (1 − ωi,j )  +I(Yi,j = 0) × − (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) ×log [1 + exp{ψ0 + ψ1 log(si µi,j )}] (0) (0) ωi,j G(ϕ(0) , µi,j ) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  × ψ0 + ψ1 log(si µi,j )  −log [1 + exp{ψ0 + ψ1 log(si µi,j )}] X I X ni  = I(Yi,j > 0) i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) × {ψ0 + ψ1 log(si µi,j )} 110  −log [1 + exp{ψ0 + ψ1 log(si µi,j )}] XI X ni  = I(Yi,j > 0) i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) × {ψ0 + ψ1 log(si µi,j )} " # 1 + exp{ψ0 + ψ1 log(si µi,j )} −log (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) (0) (0) −log[1 + exp{ψ0 + ψ1 log(si µi,j )}] XI X ni  ≥ I(Yi,j > 0) i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) × {ψ0 + ψ1 log(si µi,j )} 1 + exp{ψ0 + ψ1 log(si µi,j )} +1 − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) (0) (0) −log[1 + exp{ψ0 + ψ1 log(si µi,j )}] , where the last inequality is obtained using the result log(x) ≤ x − 1, so −log(x) ≥ 1 − x. Now, X I X ni  ℓ∗2 (ψ0 , ψ1 , γ1 , . . . , γI , β|Θ(0) ) ≥ I(Yi,j > 0) i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) × {ψ0 + ψ1 log(si µi,j )} (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} + (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )} exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) (0) (0) −log[1 + exp{ψ0 + ψ1 log(si µi,j )}] 111 XI X ni  = I(Yi,j > 0) i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  × ψ0 + ψ1 log(si ) d+K+1 X  ⊤ +ψ1 { γi,m Bm (Di,j ) + Xi,j β} m=1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} + (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) × 1 − exp{ψ0 − ψ0 + ψ1 log(si µi,j )  (0) (0) −ψ1 log(si µi,j )}  (0) (0) (0) −log[1 + exp{ψ0 + ψ1 log(si µi,j )}] . Let us work on two specific terms of the above expression. First, d+K+1 X d+K+1 X ⊤ (0) ψ1 { γi,m Bm (Di,j ) + Xi,j β} = (ψ1 − ψ 1 (0) ){ (γi,m − γi,m )Bm (Di,j ) m=1 m=1 ⊤ (0) +Xi,j (β −β )} d+K+1 X (0) +ψ 1 (0) { (γi,m − γi,m )Bm (Di,j ) m=1 ⊤ (0) +Xi,j (β −β )} d+K+1 X (0) ⊤ (0) +ψ1 { γi,m Bm (Di,j ) + Xi,j β }  m=1 ≥ −0.5 (d + K + 2)(ψ1 − ψ (0) )2 d+K+1 X (0) + (γi,m − γi,m )2 Bm 2 (Di,j ) m=1  ⊤ (0) 2 +{Xi,j (β −β )} d+K+1 X (0) (0) +ψ 1 { (γi,m − γi,m )Bm (Di,j ) m=1 ⊤ (0) +Xi,j (β −β )} 112 d+K+1 X (0) ⊤ (0) +ψ1 { γi,m Bm (Di,j ) + Xi,j β }. m=1 The above inequality is obtained by applying the result, ab ≥ −0.5(a2 + b2 ) repeatedly on the product terms. The second term, (0) (0) (0) − exp{ψ0 − ψ0 + ψ1 log(si µi,j ) − ψ1 log(si µi,j )}  d+K+1 X (0) = − exp ψ0 − ψ0 + ψ1 log(si ) + ψ1 { ⊤ γi,m Bm (Di,j ) + Xi,j β} m=1 d+K+1 X  (0) (0) (0) −ψ1 log(si ) − ψ1 { γi,m Bm (Di,j ) + ⊤ (0) Xi,j β } m=1  (0) (0) (0) = − exp ψ0 − ψ0 + log(si )(ψ1 − ψ1 ) + (ψ1 − ψ1 ) d+K+1 X (0) ⊤ ×{ (γi,m − γi,m )Bm (Di,j ) + Xi,j (β − β (0) )} m=1 d+K+1 X (0) (0) ⊤ +ψ1 { (γi,m − γi,m )Bm (Di,j ) + Xi,j (β − β (0) )} m=1 d+K+1 X (0)  (0) ⊤ (0) +(ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β } m=1  (0) (0) (0) = − exp ψ0 − ψ0 + log(si )(ψ1 − ψ1 ) + (ψ1 − ψ1 ) d+K+1 X (0) ⊤ ×{ (γi,m − γi,m )Bm (Di,j ) + Xi,j (β − β (0) )} m=1 d+K+1 X (0) (0) ⊤ (0) + (γi,m − γi,m )ψ1 Bm (Di,j ) + Xi,j ψ1 (β − β (0) ) m=1 d+K+1 X (0)  (0) ⊤ (0) +(ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β } m=1  1 (0) ≥ − exp{(ψ0 − ψ0 )(2d + 2K + 7)} 2d + 2K + 7 (0) + exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} d+K+1 X (0) (0) + exp{(ψ1 − ψ1 )(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)} m=1 (0) ⊤ + exp{(ψ1 − ψ1 )Xi,j (β − β (0) )}(2d + 2K + 7)} d+K+1 X (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)} m=1 113 ⊤ (0) + exp{Xi,j ψ1 (β − β (0) )(2d + 2K + 7)}  d+K+1 X (0)  (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) m=1  1 (0) ≥ − exp{(ψ0 − ψ0 )(2d + 2K + 7)} 2d + 2K + 7 (0) + exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} d+K+1 X   (0) 2 (0) 2 + exp 0.5(ψ1 − ψ1 ) + 0.5{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)} m=1   (0) 2 ⊤ (0) 2 + exp 0.5(ψ1 − ψ1 ) + 0.5{Xi,j (β − β )(2d + 2K + 7)} d+K+1 X (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)} m=1 ⊤ (0) + exp{Xi,j ψ1 (β − β (0) )(2d + 2K + 7)}  d+K+1 X (0)  (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) m=1  1 (0) ≥ − exp{(ψ0 − ψ0 )(2d + 2K + 7)} 2d + 2K + 7 (0) + exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} (0) +0.5(d + K + 1) exp{(ψ1 − ψ1 )2 } d+K+1 X (0) +0.5 exp[{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)}2 ] m=1 (0) ⊤ +0.5 exp{(ψ1 − ψ1 )2 } + 0.5 exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ] d+K+1 X (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)} m=1 ⊤ (0) + exp{Xi,j ψ1 (β − β (0) )(2d + 2K + 7)}  d+K+1 X (0)  (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) . m=1 The first and third inequalities follow from the AM-GM inequality and the second inequality follows from the result that ab ≤ (a2 + b2 )/2. Now, XI X ni  ∗ (0) ℓ2 (ψ0 , ψ1 , γ1 , . . . , γI , β|Θ ) ≥ I(Yi,j > 0) i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) 114  × ψ0 + ψ1 log(si ) −0.5(d + K + 2)(ψ1 − ψ (0) )2 d+K+1 X (0) −0.5 (γi,m − γi,m )2 Bm 2 (Di,j ) m=1 ⊤ −0.5{Xi,j (β − β (0) )}2 d+K+1 X (0) (0) +ψ 1 { (γi,m − γi,m )Bm (Di,j ) m=1 ⊤ (0) +Xi,j (β −β )} d+K+1 X  (0) ⊤ (0) +ψ1 { γi,m Bm (Di,j ) + Xi,j β } m=1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} + (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  1 × 1− 2d + 2K + 7  (0) × exp{(ψ0 − ψ0 )(2d + 2K + 7)} (0) + exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} (0) +0.5(d + K + 1) exp{(ψ1 − ψ1 )2 } d+K+1 X (0) +0.5 exp[{(γi,m − γi,m )Bm (Di,j ) m=1 (0) ×(2d + 2K + 7)}2 ] + 0.5 exp{(ψ1 − ψ1 )2 } ⊤ +0.5 exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ] d+K+1 X (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j ) m=1 ×(2d + 2K + 7)} ⊤ (0) + exp{Xi,j ψ1 (β − β (0) )(2d + 2K + 7)}  d+K+1 X (0) (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β } m=1 ×(2d + 2K + 7)  (0) (0) (0) −log[1 + exp{ψ0 + ψ1 log(si µi,j )}] 115 X I = g1 (ψ0 |Θ0 ) + g2 (ψ1 |Θ0 ) + g3,i (γi |Θ0 ) i=1 +g4 (β|Θ0 ) + g5 (Θ0 ), I X ni  (0) (0)  X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g1 (ψ0 |Θ0 ) = I(Yi,j > 0) + (0) (0) ψ0 (1 − ω ) + ω G(ϕ (0) , µ(0) ) i=1 j=1 i,j i,j i,j (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  1 (0) × exp{(ψ0 − ψ0 )(2d + 2K + 7)} , 2d + 2K + 7 I X ni  (0) (0)  X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g2 (ψ1 |Θ0 ) = I(Yi,j > 0) + (0) (0) (0) i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  × ψ1 log(si ) − 0.5(d + K + 2)(ψ1 − ψ (0) )2 d+K+1 X  (0) ⊤ (0) +ψ1 { γi,m Bm (Di,j ) + Xi,j β } m=1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  1 − (0) (0) (0) × 1 + exp{ψ0 + ψ1 log(si µi,j )} 2d + 2K + 7 (0) exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} (0) +0.5(d + K + 1) exp{(ψ1 − ψ1 )2 } (0) +0.5 exp{(ψ1 − ψ1 )2 }  d+K+1 X (0)  (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) , m=1 ni  (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  (0) g3,i (γi |Θ ) = I(Yi,j > 0) + (0) (0) (0) j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  d+K+1 X (0) × −0.5 (γi,m − γi,m )2 Bm 2 (Di,j ) m=1 d+K+1 X  (0) (0) +ψ 1 (γi,m − γi,m )Bm (Di,j ) m=1 116 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  d+K+1 1 X (0) × 0.5 exp[{(γi,m − γi,m )Bm (Di,j ) 2d + 2K + 7 m=1 ×(2d + 2K + 7)}2 ] d+K+1 X  (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)} , m=1 and I X ni  (0) (0)  (0) X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g4 (β|Θ ) = I(Yi,j > 0) + (0) (0) (0) i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   ⊤ (0) 2 (0) ⊤ (0) × −0.5{Xi,j (β − β )} + ψ1 Xi,j (β − β ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  1 ⊤ × 0.5 exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ] 2d + 2K + 7  ⊤ (0) (0) + exp{Xi,j ψ1 (β − β )(2d + 2K + 7)} . I X ni   X Yi,j sj µi,j ℓ∗3 (γ1 , . . . , γI , β|Θ(0) ) = I(Yi,j > 0) Yi,j log (µi,j ) − (0) i=1 j=1 si µi,j + exp(ϕ(0) ) 0.5sj (0) − (0) (µi,j − µi,j )2 si µi,j + exp(ϕ(0) ) sj exp(ϕ(0 )µi,j  − (0) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) − (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) sj exp(ϕ(0) )µi,j   0.5sj (0) 2 × (0) (µi,j − µi,j ) + (0) sj µi,j + exp(ϕ(0) ) sj µi,j + exp(ϕ(0) ) X I X ni  = I(Yi,j > 0)Yi,j log (µi,j ) i=1 j=1 I(Yi,j > 0)sj exp(ϕ(0 )  I(Yi,j > 0)Yi,j sj − (0) + (0) si µi,j + exp(ϕ(0) ) si µi,j + exp(ϕ(0) ) 117 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) sj exp(ϕ(0) )  + (0) (0) (0) × (0) µi,j (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) sj µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  0.5sj I(Yi,j > 0) − (0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   0.5sj (0) 2 × (0) ×(µi,j − µi,j ) sj µi,j + exp(ϕ(0) ) X I X ni  = I(Yi,j > 0)Yi,j log (µi,j ) i=1 j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) µi,j (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  0.5sj − (0) I(Yi,j > 0) si µi,j + exp(ϕ(0) ) (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) 2 + (0) (0) (0) (µi,j − µi,j ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) Now, we obtain two inequalities for −µi,j and −(µi,j − µi,j )2 and apply them in the above expression. First, d+K+1 X ⊤ −µi,j = − exp{ γi,m Bm (Di,j ) + Xi,j β} m=1 d+K+1 X (0) (0) ⊤ = − µi,j exp{ (γi,m − γi,m )Bm (Di,j ) + Xi,j (β − β (0) )} m=1 (0) d+K+1 µi,j X (0) ≥− exp{(γi,m − γi,m )Bm (Di,j )(d + K + 2)} d+K +2 m=1  ⊤ (0) + exp{Xi,j (β −β )(d + K + 2)} , where the inequality is obtained by using the AM-GM inequality. Second, (0) (0) (0) −(µi,j − µi,j )2 = − (µi,j )2 + 2µi,j µi,j − (µi,j )2 ! !2 (0) (0) µ i,j (0) µi,j = − (µi,j )2 + 2(µi,j )2 (0) − (µi,j )2 (0) µi,j µi,j (0) = − (µi,j )2 118 d+K+1 X (0) (0) ⊤ + 2(µi,j )2 exp{ (γi,m − γi,m )Bm (Di,j ) + Xi,j (β − β (0) )} m=1 d+K+1 X (0) (0) ⊤ − (µi,j )2 exp{2 (γi,m − γi,m )Bm (Di,j ) + 2Xi,j (β − β (0) )} m=1 (0) ≥− (µi,j )2  d+K+1 X  (0) (0) ⊤ (0) + 2(µi,j )2 1+ (γi,m − γi,m )Bm (Di,j ) + Xi,j (β −β ) m=1 (0) d+K+1 (µi,j )2 X (0) − exp{2(d + K + 2)(γi,m − γi,m )Bm (Di,j )} d+K +2 m=1  ⊤ (0) + exp{2(d + K + 2)Xi,j (β −β )} , where the inequality is obtained by applying exp(x) ≥ 1 + x for any generic x to the middle term of the above expression and using the AM-GM inequality to the third term. Now, applying these two inequalities we obtain, X I X ni  d+K+1 X ⊤ ℓ∗3 (γ1 , . . . , γI , β|Θ(0) ) ≥ I(Yi,j > 0)Yi,j { γi,m Bm (Di,j ) + Xi,j β} i=1 j=1 m=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) d+K+1 µi,j X (0) × exp{(γi,m − γi,m )Bm (Di,j ) d + K + 2 m=1  ⊤ (0) ×(d + K + 2)} + exp{Xi,j (β −β )(d + K + 2)}  0.5sj − (0) I(Yi,j > 0) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  (0) × (µi,j )2 119  d+K+1 X (0) (0) −2(µi,j )2 1+ (γi,m − γi,m )Bm (Di,j ) m=1 ⊤ +Xi,j (β − β (0) ) (0) d+K+1 (µi,j )2 X + exp{2(d + K + 2) d+K +2 m=1 (0) ×(γi,m − γi,m )Bm (Di,j )}  ⊤ (0) + exp{2(d + K + 2)Xi,j (β −β )} X K = ℓ‡3,i (γi |Θ0 ) + ℓ‡4 (β|Θ0 ), i=1 where Xni  d+K+1 X ℓ‡3,i (γi |Θ0 ) = I(Yi,j > 0)Yi,j γi,m Bm (Di,j ) j=1 m=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) d+K+1 µi,j X (0) × exp{(γi,m − γi,m )Bm (Di,j )(d + K + 2)} d + K + 2 m=1 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  0.5sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  d+K+1 X (0) 2 (0) × −2(µi,j ) (γi,m − γi,m )Bm (Di,j ) m=1 (0) d+K+1 (µi,j )2 X  (0) + exp{2(d + K + 2)(γi,m − γi,m )Bm (Di,j )} , d+K +2 m=1 XI X ni  ℓ‡4 (β|Θ0 ) = I(Yi,j > 0)Yi,j Xi,j ⊤ β i=1 j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) 120 (0) µi,j ⊤ × exp{Xi,j (β − β (0) )(d + K + 2)} d+K +2 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  0.5sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (µi,j )2  (0) 2 ⊤ (0) × −2(µi,j ) Xi,j (β − β ) + exp{2(d + K + 2) d+K +2  ⊤ (0) ×Xi,j (β − β )} . Combining all results we can write X I ℓ(Θ) ≥ ℓ†1 (ϕ|Θ(0) ) (0) + g1 (ψ0 |Θ ) + g2 (ψ1 |Θ ) + (0) ℓ†3,i (γi |Θ(0) ) i=1 +ℓ†4 (β|Θ(0) ) + ℓ†5 (Θ(0) ), where ℓ†3,i (γi |Θ(0) ) = g3,i (γi |Θ(0) ) + ℓ‡3,i (γi |Θ(0) ), ℓ†4 (β|Θ(0) ) = g4 (β|Θ(0) ) + ℓ‡4 (β|Θ(0) ). Let us now consider the derivatives. I X ni  ( ) (0)  ∂ † ∂ X exp(ϕ ) − exp(ϕ) ℓ1 (ϕ|Θ(0) ) = I(Yi,j > 0) Yi,j (0) ∂ϕ ∂ϕ i=1 j=1 si µi,j + exp(ϕ(0) ) 0.5sj − (0) {exp(ϕ) − exp(ϕ(0 )}2 si µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − exp(ϕ)} + (0) si µi,j + exp(ϕ(0) ) (0) − exp(ϕ)log{si µi,j + exp(ϕ(0) )} +ϕ exp(ϕ) + log[Γ{Yi,j + exp(ϕ)}]  −log[Γ{exp(ϕ)}] (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) exp(ϕ)ϕ (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) ( ) exp(ϕ(0) ) − exp(ϕ) + exp(ϕ) (0) sj µi,j + exp(ϕ(0) ) 121  0.5sj (0) 2 − (0) {exp(ϕ) − exp(ϕ )} , sj µi,j + exp(ϕ(0) ) I X ni   ( ) X exp(ϕ) = I(Yi,j > 0) −Yi,j (0) i=1 j=1 si µi,j + exp(ϕ(0) ) sj − (0) {exp(ϕ) − exp(ϕ(0 )} exp(ϕ) si µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − 2 exp(ϕ)} + (0) si µi,j + exp(ϕ(0) ) (0) − exp(ϕ)log{si µi,j + exp(ϕ(0) )} ∂ + exp(ϕ) + ϕ exp(ϕ) + log[Γ{Yi,j + exp(ϕ)}] ∂ϕ  ∂ − log[Γ{exp(ϕ)}] ∂ϕ (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) + (0) (0) (0) exp(ϕ) + exp(ϕ)ϕ (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − 2 exp(ϕ)} + (0) sj µi,j + exp(ϕ(0) )  sj (0) − (0) {exp(ϕ) − exp(ϕ )} exp(ϕ) , sj µi,j + exp(ϕ(0) ) I X ni   X Yi,j exp(ϕ) = I(Yi,j > 0) − (0) i=1 j=1 si µi,j + exp(ϕ(0) )  ∂ ∂ + log[Γ{Yi,j + exp(ϕ)}] − log[Γ{exp(ϕ)}] ∂ϕ ∂ϕ (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + I(Yi, > 0) + (0) (0) (0) exp(ϕ) + exp(ϕ)ϕ (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) n o (0) − exp(ϕ)log sj µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − 2 exp(ϕ)} + (0) sj µi,j + exp(ϕ(0) )  sj (0) − (0) {exp(ϕ) − exp(ϕ )} exp(ϕ) , sj µi,j + exp(ϕ(0) ) Next, I X ni  ∂2 †  (0) X Yi,j exp(ϕ) 2 ℓ1 (ϕ|Θ ) = I(Yi,j > 0) − (0) ∂ϕ i=1 j=1 si µi,j + exp(ϕ(0) ) 122 ∂2 ∂2  + 2 log[Γ{Yi,j + exp(ϕ)}] − 2 log[Γ{exp(ϕ)}] ∂ϕ ∂ϕ (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + I(Yi, > 0) + (0) (0) (0) 2 exp(ϕ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) n o (0) + exp(ϕ)ϕ − exp(ϕ)log sj µi,j + exp(ϕ(0) ) exp(ϕ){exp(ϕ(0) ) − 4 exp(ϕ)} + (0) sj µi,j + exp(ϕ(0) )  sj (0) − (0) {2 exp(ϕ) − exp(ϕ )} exp(ϕ) . sj µi,j + exp(ϕ(0) ) Now, I X ni  Yi,j exp(ϕ(0) )  ∂ † (0) X ℓ1 (ϕ|Θ ) |Θ=Θ(0) = I(Yi,j > 0) − (0) ∂ϕ i=1 j=1 si µi,j + exp(ϕ(0) ) ∂ + log[Γ{Yi,j + exp(ϕ)}] |ϕ=ϕ(0) ∂ϕ  ∂ − log[Γ{exp(ϕ)}] |ϕ=ϕ(0) ∂ϕ (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + I(Yi, > 0) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  exp(ϕ(0) ) + exp(ϕ(0) )ϕ(0) n o (0) (0) (0) − exp(ϕ )log sj µi,j + exp(ϕ ) exp(ϕ(0) ){exp(ϕ(0) ) − 2 exp(ϕ(0) )} + (0) sj µi,j + exp(ϕ(0) ) sj − (0) {exp(ϕ(0) ) − exp(ϕ(0) )} sj µi,j + exp(ϕ ) (0)  (0) exp(ϕ ) , I X ni  Yi,j exp(ϕ(0) ) X  = I(Yi,j > 0) − (0) i=1 j=1 si µi,j + exp(ϕ(0) ) ∂ + log[Γ{Yi,j + exp(ϕ)}] |ϕ=ϕ(0) ∂ϕ  ∂ − log[Γ{exp(ϕ)}] |ϕ=ϕ(0) ∂ϕ 123 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + I(Yi, > 0) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  × exp(ϕ(0) ) + exp(ϕ(0) )ϕ(0) n o (0) − exp(ϕ(0) )log sj µi,j + exp(ϕ(0) ) exp(2ϕ(0) )  − (0) sj µi,j + exp(ϕ(0) ) I X ni  ∂2 † Yi,j exp(ϕ(0) ) X  (0) ℓ (ϕ|Θ ) |Θ=Θ(0) = I(Y i,j > 0) − ∂ϕ2 1 i=1 j=1 (0) si µi,j + exp(ϕ(0) ) ∂2 + log[Γ{Yi,j + exp(ϕ)}] |Θ=Θ(0) ∂ϕ2 ∂2  − 2 log[Γ{exp(ϕ)}] |Θ=Θ(0) ∂ϕ (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + I(Yi, > 0) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  × 2 exp(ϕ(0) ) + exp(ϕ(0) )ϕ(0) n o (0) (0) (0) − exp(ϕ )log sj µi,j + exp(ϕ ) 3 exp(2ϕ(0) ) − (0) sj µi,j + exp(ϕ(0) ) sj exp(2ϕ(0) )  − (0) . sj µi,j + exp(ϕ(0) ) I X ni  (0) (0)  ∂ X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g1 (ψ0 |Θ0 ) = I(Yi,j > 0) + (0) (0) (0) ∂ψ0 i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) × exp{(ψ0 − ψ0 )(2d + 2K + 7)} , I X ni  (0) (0) ∂ X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g1 (ψ0 |Θ0 ) |Θ=Θ(0) = I(Yi,j > 0) + (0) (0) (0) ∂ψ0 i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  − (0) (0) (0) , 1 + exp{ψ0 + ψ1 log(si µi,j )} 124 (0) (0) (0) ∂2 XI X ni  exp{ψ0 + ψ1 log(si µi,j )} g1 (ψ0 |Θ0 ) = − ∂ψ02 i=1 j=1 (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )} (0)  (0) ×(2d + 2K + 7) exp{(ψ0 − ψ0 )(2d + 2K + 7)} , (0) (0) (0) ∂2 I X ni  (2d + 2K + 7) exp{ψ0 + ψ1 log(si µi,j )} X  g1 (ψ0 |Θ0 ) |Θ=Θ(0) = − , ∂ψ02 i=1 j=1 (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )} (0) I ni  (0) (0)  ∂ ∂ XX I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g2 (ψ1 |Θ0 ) = I(Yi,j > 0) + (0) (0) (0) ∂ψ1 ∂ψ1 i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  × ψ1 log(si ) − 0.5(d + K + 2)(ψ1 − ψ (0) )2 d+K+1 X  (0) ⊤ (0) +ψ1 { γi,m Bm (Di,j ) + Xi,j β } m=1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  1 − (0) (0) (0) × 1 + exp{ψ0 + ψ1 log(si µi,j )} 2d + 2K + 7 (0) × exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} (0) +0.5(d + K + 1) exp{(ψ1 − ψ1 )2 } (0) +0.5 exp{(ψ1 − ψ1 )2 }  d+K+1 X (0) (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) m=1  ⊤ (0) +Xi,j β }(2d + 2K + 7) I X ni  (0) (0)  X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) = I(Yi,j > 0) + (0) (0) (0) i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  (0) × log(si ) − (d + K + 2)(ψ1 − ψ1 ) d+K+1 X  (0) ⊤ (0) +{ γi,m Bm (Di,j ) + Xi,j β } m=1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  1 − (0) (0) (0) × 1 + exp{ψ0 + ψ1 log(si µi,j )} 2d + 2K + 7 (0) (2d + 2K + 7) × log(si ) × exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} (0) (0) +(d + K + 1) exp{(ψ1 − ψ1 )2 } × (ψ1 − ψ1 ) 125 (0) (0) + exp{(ψ1 − ψ1 )2 }(ψ1 − ψ1 )  d+K+1 X (0)  (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) m=1 d+K+1 X  (0) ⊤ (0) ×{ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) , m=1 I X ni  (0) (0)  ∂ X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g2 (ψ1 |Θ0 ) |Θ=Θ(0) = I(Yi,j > 0) + (0) (0) (0) ∂ψ1 i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  d+K+1 X (0)  × log(si ) + γi,m Bm (Di,j ) + Xi,j β⊤ (0) m=1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  1 − (0) (0) (0) × 1 + exp{ψ0 + ψ1 log(si µi,j )} 2d + 2K + 7 (2d + 2K + 7) × log(si ) d+K+1 X (0)  ⊤ (0) +{ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) m=1 I X ni  (0) (0)  X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) = I(Yi,j > 0) + (0) (0) (0) i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  d+K+1 X (0)  × log(si ) + γi,m Bm (Di,j ) + Xi,j β⊤ (0) m=1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  d+K+1 X (0)  × log(si ) + γi,m Bm (Di,j ) + Xi,j β⊤ (0) m=1 X I X ni  d+K+1 X (0)  = log(si ) + γi,m Bm (Di,j ) + Xi,j β ⊤ (0) i=1 j=1 m=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  × I(Yi,j > 0) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )} I X ni  ∂2 X  g2 (ψ1 |Θ0 ) = −(d + K + 2) I(Yi,j > 0) ∂ψ12 i=1 j=1 126 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  1 − (0) (0) (0) × 1 + exp{ψ0 + ψ1 log(si µi,j )} 2d + 2K + 7 (2d + 2K + 7)2 × {log(si )}2 (0) × exp{log(si )(ψ1 − ψ1 )(2d + 2K + 7)} (0) (0) +2(d + K + 1) exp{(ψ1 − ψ1 )2 } × (ψ1 − ψ1 )2 (0) +(d + K + 1) exp{(ψ1 − ψ1 )2 } (0) (0) +2 exp{(ψ1 − ψ1 )2 }(ψ1 − ψ1 )2 (0) + exp{(ψ1 − ψ1 )2 }  d+K+1 X (0)  (0) ⊤ (0) + exp (ψ1 − ψ1 ){ γi,m Bm (Di,j ) + Xi,j β }(2d + 2K + 7) m=1 d+K+1 X  (0) ⊤ (0) 2 2 ×{ γi,m Bm (Di,j ) + Xi,j β } (2d + 2K + 7) , m=1 I X ni  ∂2 X  g2 (ψ1 |Θ0 ) |Θ=Θ(0) = −(d + K + 2) I(Yi,j > 0) ∂ψ12 i=1 j=1 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  1 − (0) (0) (0) × 1 + exp{ψ0 + ψ1 log(si µi,j )} 2d + 2K + 7 (2d + 2K + 7)2 × {log(si )}2 + d + K + 2 d+K+1 X (0)  ⊤ (0) 2 2 +{ γi,m Bm (Di,j ) + Xi,j β } (2d + 2K + 7) m=1 ∂ g3,i (γi |Θ(0) ) ∂γi,m ni  (0) (0)  ∂ X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) = I(Yi,j > 0) + (0) (0) (0) ∂γi,m j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  d+K+1 X d+K+1 X  (0) 2 2 (0) (0) × −0.5 (γi,m − γi,m ) Bm (Di,j ) + ψ 1 (γi,m − γi,m )Bm (Di,j ) m=1 m=1 127 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  d+K+1 1 X (0) × 0.5 exp[{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)}2 ] 2d + 2K + 7 m=1 d+K+1 X  (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)} m=1 ni  (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  = I(Yi,j > 0) + (0) (0) (0) j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   (0) 2 (0) × −(γi,m − γi,m )Bm (Di,j ) + ψ1 Bm (Di,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  1 (0) × exp[{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)}2 ] 2d + 2K + 7 (0) ×(γi,m − γi,m )Bm (Di,j ) × Bm (Di,j ) × (2d + 2K + 7)2 (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)}  (0) ×ψ1 Bm (Di,j )(2d + 2K + 7) , ni (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  = Bm (Di,j ) I(Yi,j > 0) + (0) (0) (0) j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   (0) (0) × −(γi,m − γi,m )Bm (Di,j ) + ψ1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) × exp[{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)}2 ] (0) ×(γi,m − γi,m )Bm (Di,j ) × (2d + 2K + 7) (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)}  (0) ×ψ1 , ∂ g3,i (γi |Θ(0) ) |Θ=Θ(0) ∂γi,m 128 ni (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  (0) = ψ1 Bm (Di,j ) I(Yi,j > 0) + (0) (0) (0) j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )} ∂2 2 g3,i (γi |Θ(0) ) ∂γi,m ni (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  ∂ X = Bm (Di,j ) I(Yi,j > 0) + (0) (0) (0) ∂γi,m j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   (0) (0) × −(γi,m − γi,m )Bm (Di,j ) + ψ1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) × exp[{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)}2 ] (0) ×(γi,m − γi,m )Bm (Di,j ) × (2d + 2K + 7) (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)}  (0) ×ψ1 ni (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  = Bm (Di,j ) I(Yi,j > 0) + (0) (0) (0) j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   × −Bm (Di,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  (0) × exp[{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)}2 ] (0) ×2(γi,m − γi,m )2 Bm 3 (Di,j )(2d + 2K + 7)3 (0) + exp[{(γi,m − γi,m )Bm (Di,j )(2d + 2K + 7)}2 ] ×Bm (Di,j ) × (2d + 2K + 7) (0) (0) + exp{(γi,m − γi,m )ψ1 Bm (Di,j )(2d + 2K + 7)}  ( 2 ×ψ 0)1 Bm (Di,j )(2d + 2K + 7) . 129 ∂2 2 g3,i (γi |Θ(0) ) |Θ=Θ(0) ∂γi,m ni (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  = Bm (Di,j ) I(Yi,j > 0) + (0) (0) (0) j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   × −Bm (Di,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}   ( 2 × Bm (Di,j ) × (2d + 2K + 7) + ψ 0)1 Bm (Di,j )(2d + 2K + 7) ni (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  2 = − Bm (Di,j ) I(Yi,j > 0) + (0) (0) (0) j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  ( 2 +(2d + 2K + 7){1 + ψ 0)1 } (0) (0) (0) . 1 + exp{ψ0 + ψ1 log(si µi,j )} Also, ∂2 g3,i (γi |Θ(0) ) |Θ=Θ(0) = 0. ∂γi,m γi,r ∂ ‡ ℓ (γi |Θ(0) ) ∂γi,r 3,i ni  d+K+1 ∂ X X = I(Yi,j > 0)Yi,j γi,m Bm (Di,j ) ∂γi,r j=1 m=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ ) (0) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) d+K+1 µi,j X (0) × exp{(γi,m − γi,m )Bm (Di,j )(d + K + 2)} d + K + 2 m=1 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  0.5sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  d+K+1 X (0) 2 (0) × −2(µi,j ) (γi,m − γi,m )Bm (Di,j ) m=1 (0) d+K+1 (µi,j )2 X  (0) + exp{2(d + K + 2)(γi,m − γi,m )Bm (Di,j )} , d+K +2 m=1 130 Xni  = I(Yi,j > 0)Yi,j Br (Di,j ) j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) µi,j (0) × exp{(γi,r − γi,r )Br (Di,j )(d + K + 2)}Br (Di,j )(d + K + 2) d+K +2 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  0.5sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  (0) × −2(µi,j )2 Br (Di,j ) (0) (µi,j )2  (0) + exp{2(d + K + 2)(γi,r − γi,r )Br (Di,j )}2(d + K + 2)Br (Di,j ) , d+K +2 Xni  = Br (Di,j ) I(Yi,j > 0)Yi,j j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) ×µi,j exp{(γi,r − γi,r )Br (Di,j )(d + K + 2)} (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   (0) 2 (0) 2 (0) × −(µi,j ) + (µi,j ) exp{2(d + K + 2)(γi,r − γi,r )Br (Di,j )} . So, ∂ ‡ ℓ3,i (γi |Θ(0) ) |Θ=Θ(0) ∂γi,r X ni  = Br (Di,j ) I(Yi,j > 0)Yi,j j=1 (0) sj µi,j  − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) . (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) 131 ∂2 ‡ ℓ (γ |Θ(0) ) 2 3,i i ∂γi,r ni  ∂ X = Br (Di,j ) I(Yi,j > 0)Yi,j ∂γi,r j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ ) (0) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) ×µi,j exp{(γi,r − γi,r )Br (Di,j )(d + K + 2)} (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   (0) 2 (0) 2 (0) × −(µi,j ) + (µi,j ) exp{2(d + K + 2)(γi,r − γi,r )Br (Di,j )} ni   X sj = Br (Di,j ) − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) j=1 si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) ×µi,j Br (Di,j )(d + K + 2) exp{(γi,r − γi,r )Br (Di,j )(d + K + 2)} (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  (0) (0) ×2(d + K + 2)Br (Di,j )(µi,j )2 exp{2(d + K + 2)(γi,r − γi,r )Br (Di,j )} . Now, ∂2 ‡ ℓ (γ |Θ(0) ) |Θ=Θ(0) 2 3,i i ∂γi,r ni   X 2 sj = −(d + K + 2) Br (Di,j ) (0) I(Yi,j > 0)Yi,j j=1 si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0 (0) (0) +I(Yi,j > 0) exp(ϕ ) + (0) (0) (0) × exp(ϕ ) µi,j (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0)  2sj (µi,j )2 I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) ∂2 ℓ‡ (γi |Θ(0) ) |Θ=Θ(0) = 0. ∂γi,r γi,m 3,i 132 I ni  (0) (0)  ∂ (0) ∂ XX I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) g4 (β|Θ ) = I(Yi,j > 0) + (0) (0) (0) ∂β ∂β i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   ⊤ (0) 2 (0) ⊤ (0) × −0.5{Xi,j (β − β )} + ψ1 Xi,j (β − β ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  1 ⊤ × 0.5 exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ] 2d + 2K + 7  ⊤ (0) (0) + exp{Xi,j ψ1 (β − β )(2d + 2K + 7)} . I X ni  (0) (0)  X I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) = I(Yi,j > 0) + (0) (0) (0) i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   ⊤ (0) (0) × −{Xi,j (β − β )}Xi,j + ψ1 Xi,j (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  1 ⊤ × exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ] 2d + 2K + 7 ⊤ ×Xi,j (β − β (0) )Xi,j (2d + 2K + 7)2 ⊤ (0) + exp{Xi,j ψ1 (β − β (0) )(2d + 2K + 7)}  (0) ×ψ1 Xi,j (2d + 2K + 7) . I X ni (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X   = Xi,j I(Yi,j > 0) + (0) (0) (0) i=1 j=1 (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   ⊤ (0) (0) × −Xi,j (β − β ) + ψ1 (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  ⊤ ⊤ × exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ]Xi,j (β − β (0) )(2d + 2K + 7)  ⊤ (0) (0) (0) + exp{Xi,j ψ1 (β − β )(2d + 2K + 7)} × ψ1 . 133 I X ni  ∂ (0) (0) X g4 (β|Θ ) |Θ=Θ(0) = ψ1 Xi,j I(Yi,j > 0) ∂β i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  − (0) (0) (0) . 1 + exp{ψ0 + ψ1 log(si µi,j )} I X ni (0) (0)  ∂2 I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) X  (0) g4 (β|Θ ) = Xi,j I(Yi,j > 0) + ∂β∂β ⊤ i=1 j=1 (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) ⊤  × −Xi,j (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )} − (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  ⊤ × exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ] ⊤ ⊤ ×2{Xi,j (β − β (0) )}2 (2d + 2K + 7)3 Xi,j ⊤ ⊤ + exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ]Xi,j (2d + 2K + 7) ⊤ (0) (0) + exp{Xi,j ψ1 (β − β (0) )(2d + 2K + 7)}(ψ1 )2  ⊤ ×Xi,j (2d + 2K + 7) X I X ni  ⊤ = − Xi,j Xi,j I(Yi,j > 0) i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0) (2d + 2K + 7) exp{ψ0 + ψ1 log(si µi,j )} + (0) (0) (0) 1 + exp{ψ0 + ψ1 log(si µi,j )}  ⊤ × exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ] ⊤ ×2{Xi,j (β − β (0) )}2 (2d + 2K + 7)2 ⊤ + exp[{Xi,j (β − β (0) )(2d + 2K + 7)}2 ]  ⊤ (0) (0) (0) + exp{Xi,j ψ1 (β −β )(2d + 2K + 7)} × (ψ1 )2 . 134 I X ni ∂2 X  (0) ⊤ g4 (β|Θ ) |Θ=Θ(0) = − Xi,j Xi,j I(Yi,j > 0) ∂β 2 i=1 j=1 (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) +(2d + 2K + 7){1 + (ψ1 )2 } (0) (0) (0) exp{ψ0 + ψ1 log(si µi,j )}  × (0) (0) (0) . 1 + exp{ψ0 + ψ1 log(si µi,j )} I ni  ∂ ‡ ∂ XX ⊤ ℓ (β|Θ0 ) = I(Yi,j > 0)Yi,j Xi,j β ∂β 4 ∂β i=1 j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ ) (0) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) µi,j ⊤ × exp{Xi,j (β − β (0) )(d + K + 2)} d+K +2 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  0.5sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )  (0) ⊤ × −2(µi,j )2 Xi,j (β − β (0) ) (0) (µi,j )2  ⊤ (0) + exp{2(d + K + 2)Xi,j (β − β )} d+K +2 XI X ni  . = I(Yi,j > 0)Yi,j Xi,j i=1 j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) µi,j ⊤ × exp{Xi,j (β − β (0) )(d + K + 2)} × Xi,j (d + K + 2) d+K +2 (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  0.5sj − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (µi,j )2  (0) 2 ⊤ × −2(µi,j ) Xi,j + exp{2(d + K + 2)Xi,j (β − β (0) )} d+K +2 135  ×2(d + K + 2)Xi,j XI X ni  . = Xij I(Yi,j > 0)Yi,j i=1 j=1  sj − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ(0) ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) ⊤ ×µi,j exp{Xi,j (β − β (0) )(d + K + 2)} (0) (0) (0)  sj (µi,j )2 I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  − (0) I(Yi,j > 0) + (0) (0) (0) si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   ⊤ (0) × −1 + exp{2(d + K + 2)Xi,j (β − β )} . I X ni  ∂ ‡ X ℓ (β|Θ0 ) |Θ=Θ(0) = Xij I(Yi,j > 0)Yi,j ∂β 4 i=1 j=1 (0) sj µi,j  − (0) I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) si µi,j + exp(ϕ ) (0) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) . (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) I X ni ∂2  ‡ X sj ⊤ ℓ4 (β|Θ0 ) = Xi,j − (0) ∂β∂β i=1 j=1 si µi,j + exp(ϕ(0) )  I(Yi,j > 0)Yi,j +I(Yi,j > 0) exp(ϕ(0 ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) ⊤ ⊤ ×µi,j exp{Xi,j (β − β (0) )(d + K + 2)}Xi,j (d + K + 2) (0) sj (µi,j )2  − (0) I(Yi,j > 0) si µi,j + exp(ϕ(0) ) (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) 136   ⊤ (0) ⊤ × exp{2(d + K + 2)Xi,j (β − β )} × 2(d + K + 2) × Xi,j I X ni  X ⊤ sj = −(d + K + 2) Xi,j Xi,j (0) i=1 j=1 si µi,j + exp(ϕ(0) )  I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) ⊤ ×µi,j exp{Xi,j (β − β (0) )(d + K + 2)} (0) sj (µi,j )2  + (0) I(Yi,j > 0) si µi,j + exp(ϕ(0) ) (0) (0)  I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j ) + (0) (0) (0) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j )   ⊤ (0) × 2 exp{2(d + K + 2)Xi,j (β − β )} . ∂2 ⊤ ℓ‡4 (β|Θ0 ) |Θ=Θ(0) ∂β∂β I X ni (0) sj µi,j X  ⊤ = −(d + K + 2) Xi,j Xi,j (0) i=1 j=1 si µi,j + exp(ϕ(0) )  I(Yi,j > 0)Yi,j + I(Yi,j > 0) exp(ϕ(0 ) (0) (0) I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  (0) + (0) (0) (0) × exp(ϕ ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) (0) (0) (0)  2sj (µi,j )2 I(Yi,j = 0)ωi,j G(ϕ(0) , µi,j )  + (0) I(Yi,j > 0) + (0) (0) (0) . si µi,j + exp(ϕ(0) ) (1 − ωi,j ) + ωi,j G(ϕ(0) , µi,j ) 137 APPENDIX C KERNELIZED SIGNED GRAPH LEARNING FOR SINGLE CELL GENE REGULATORY NETWORK INFERENCE C.1 Optimization Algorithm for Signed Graph Learning In this section, we present an ADMM based algorithm to solve the optimization problem for signed graph learning. For convenience, we include the optimization problem below: min. trace(KL+ ) − trace(KL− ) + α1 ∥L+ ∥2F + α2 ∥L− ∥2F L+ ,L− ∈L s.t. trace(L+ ) = 2n, trace(L− ) = 2n (C.1) ij = 0 if Lij ̸= 0 and Lij = 0 if Lij ̸= 0 ∀i ̸= j. − − L+ + This problem is non-convex due to the last two constraints, which are called complementarity constraints [157]. In [158], it is shown that alternating direction method of multipliers (ADMM) converges for problems with complementarity constraints under some assumptions. First, we rewrite the problem in vector form. Let upper(·) be an operator that takes an n × n matrix and returns a n(n − 1)/2-dimensional vector that corresponds to the upper triangular part of the input matrix. Define diag(x) as an operator which returns a diagonal matrix with the diagonal elements equal to the input vector x. Similarly, diag(X) returns the diagonal of the input matrix X as a vector. The matrix P ∈ Rn×n(n−1)/2 is defined such that Pupper(A) = A1 where A is a symmetric matrix whose diagonal entries are equal to zero. Let k = upper(K), d = diag(K), ℓ+ = upper(L+ ), ℓ− = upper(L− ). Thus, (C.1) can be rewritten as: min. ⟨2k − P⊤ d, ℓ+ ⟩ − ⟨2k − P⊤ d, ℓ− ⟩ + α1 ⟨(2I + P⊤ P)ℓ+ , ℓ+ ⟩ ℓ+ ≤0, ℓ− ≤0 + α2 ⟨(2I + P⊤ P)ℓ− , ℓ− ⟩ (C.2) s.t. 1⊤ ℓ+ = −n, 1⊤ ℓ− = −n, and ℓ+ ⊥ℓ− , 138 where the first two terms correspond to trace terms in (C.1), the last two terms correspond to Frobenius terms of (C.1) and first two constraints are the same as the first two constraints of (C.1). The last constraint with ℓ+ ≤ 0 and ℓ− ≤ 0 correspond to the complementarity constraints. By introducing two slack variables v = ℓ+ and w = ℓ− , the problem is written in standard ADMM form: min. ıS (v, w) + h(ℓ+ , ℓ− ) + ıH (ℓ+ ) + ıH (ℓ− ) v,w,ℓ+ ,ℓ− (C.3) − s.t. + v − ℓ = 0, w − ℓ = 0, where ıS (·) is the indicator function for the complementarity set S = {(v, w) : v ≤ 0, w ≤ 0, v⊥w}, h(ℓ+ , ℓ− ) is the objective function in (C.2), and ıH () is the indicator function for the hyperplane H = {ℓ : 1⊤ ℓ = −n}. The augmented Lagrangian of (C.3) is: Lρ (v, w, ℓ+ , ℓ− , λ1 , λ2 ) =ıS (v, w) + h(ℓ+ , ℓ− ) + ıH (ℓ+ ) + ıH (ℓ− ) ρ − ρ − 2 1 (v − ℓ ) + ∥v − ℓ ∥2 + λ2 (w − ℓ )+ ∥w − ℓ ∥2 , (C.4) + + 2 + λ⊤ ⊤ 2 2 where λ1 and λ2 are Lagrange multipliers and ρ > 0 is the Augmented Lagrangian parameter. (v, w)-step: The (v, w)-step of ADMM can be found as the projection onto the comple- mentarity set S: ρ k λk ρ k λk (vk+1 , wk+1 ) = argmin ıS (v, w) + ∥v − ℓ+ + 1 ∥22 + ∥w − ℓ− + 2 ∥22 = ΠS (y), (C.5) v,w 2 ρ 2 ρ k k where y = [(ℓ+ − λk1 /ρ)⊤ , (ℓ− − λk2 /ρ)⊤ ]⊤ and ΠS (·) is the projection operator on the set S. (ℓ+ , ℓ− )-step: Using the fact that optimization can be performed separately for ℓ+ and ℓ− , ℓ+ -step can be written as: k+1 ρ λk ℓ+ = argmin z⊤ ℓ+ + α1 ⟨(2I + P⊤ P)ℓ+ , ℓ+ ⟩ + ıH (ℓ+ ) + ∥vk+1 − ℓ+ + 1 ∥22 2 ρ ℓ+ (C.6) ⊤ −1 k+1 = ΠH [((4α1 + ρ)I + 2α1 P P) (ρv + λk1 − z)], where z = 2k − P⊤ d and ΠH (·) is the projection operator on the hyperplane H. Similarly, ℓ− -step can be written as: k+1 ℓ− = ΠH [((4α2 + ρ)I + 2α2 P⊤ P)−1 (ρwk+1+ λk2 + z)]. (C.7) 139 Lagrange multipliers udpate: The updates of Lagrange multipliers are: k+1 λk+1 1 = λk1 + ρ(vk+1 − ℓ+ ), (C.8) k+1 λk+1 2 = λk2 + ρ(wk+1 − ℓ− ). (C.9) C.1.1 Computational and Storage Complexity Computational complexity of the optimization procedure described above can be found by determining how many computations are required for each ADMM step. Let M = n(n−1)/2 where n is the number of genes. (v, w)-step can be performed in O(M ) time, or O(n2 ) time. (ℓ+ , ℓ− )-step requires the inversion of the matrix (4α1 + ρ)I + 2α1 P⊤ P, which needs to be calculated only once before the optimization iterations. The inverse matrix has a closed form solution which can be found using Woodbury matrix identity. It has a decomposition of the form A⊤ A where A is a sparse matrix with O(n2 ) non-zero entries. Thus, matrix-vector multiplication of (ℓ+ , ℓ− )-step can be done in O(n2 ) time. Updates of Lagrangian multipliers can also be performed O(M ) time, or O(n2 ). Let I be the number of iterations required for the convergence of ADMM. Thus, overall time complexity of scSGL is O(In2 ). The storage complexity of scSGL is determined by the size of the inverse matrix required in (ℓ+ , ℓ− )-step. Since this matrix has a decomposition of the form A⊤ A, we only need to store A. Thus, the storage complexity of scSGL is O(n2 ). The computational and storage complexity of ADMM is quadratic in the number of genes and is not affected by the number of cells. Note that, scSGL also requires the construction of the kernel matrix before running the optimization. Since there are already very efficient tools to construct kernel matrices [130], we did not include their complexity in the analysis above. Finally, there are recent works in GSP literature for scaling GL methods to learning graphs with millions of nodes [159]. These approaches can be employed to scale scSGL, which we left as a future pursuit. 140 C.2 AUROC and EPR Results In the main text, we consider AUPRC based metrics defined above as the main performance metrics for comparison due to inherent sparsity of GRNs. In this section, AUROC and EPR values for synthetic data used in parameter sensitivity analysis are reported in Figures C.1 and C.2, respectively. AUROC and EPR ratios for real datasets are also reported in Figure C.3. hESC AUROC Ratio mESC AUROC Ratio hESC EPR Ratio mESC EPR Ratio GENIE3 0.921 0.946 1.236 0.985 1.066 1.156 High GENIE3 0.959 1.778 3.457 1.054 3.237 3.268 High GRNBOOST2 1.015 0.989 1.160 1.016 1.043 1.125 GRNBOOST2 0.941 1.560 3.194 1.049 3.062 3.290 PIDC 1.012 0.992 1.241 1.011 1.040 1.114 PIDC 0.914 1.904 3.750 1.001 2.527 3.179 PPCOR 1.000 1.000 1.000 1.000 1.000 1.000 PPCOR 1.000 1.000 1.000 1.000 1.000 1.000 scSGL-r 0.942 1.047 1.208 1.011 1.056 1.121 scSGL-r 1.006 1.905 3.096 1.005 2.011 2.673 scSGL- 0.947 1.018 1.151 1.020 1.066 1.155 scSGL- 0.910 1.868 4.414 1.050 1.253 1.366 scSGL- zi 0.958 1.049 1.173 1.009 1.055 1.121 Low scSGL- zi 0.991 1.850 3.398 0.999 1.944 2.364 Low ic ific G ic ific G ic ific G ic ific G ec if ec RIN ec if ec RIN ec if ec RIN ec if ec RIN Sp np ST Sp np ST Sp np ST Sp np ST No No No No Figure C.3 AUROC and EPR ratios of methods for two real-world scRNAseq datasets. Inferred graphs are compared to three different gene regulatory databases. realizations of an expression data with 500 cells. Compared to curated datasets analyzed in the main text, these datasets do not include any dropouts. We calculated AUPRC ratios for activating and inhibitory edges separately and average of 10 realizations are reported in Figure C.4. The figure indicates that scSGL along with PPCOR are the best performing methods for inference of activating edges. For BF, BFC and CY; PPCOR followed by scSGL-r have the highest AUPRC ratios. On the other hand, scSGL-r followed by other kernels and PPCOR have the highest performances for LI, LL and TF. For the inference of inhibitory edges, the best performing method varies across networks. In BF, GRNBOOST2 shows the best performance; in BFC and TF, scSGL-τ and GRNBOOST2 have the highest AUPRC ratios; and for the remaining datasets scSGL-r followed by PPCOR perform better than others. In [3], it is observed that methods perform well on linear networks (LI and LL); while the inference in the remaining networks is harder. AUPRC ratios reported in Figure C.4 are inline with this observation, where AUPRC ratios for linear networks are generally higher than those for BF, BFC, CY and TF. Overall, scSGL-r along with PPCOR are the best performing methods, if the results on activating and inhibitory edges are evaluated together. Finally, when performances of kernels are compared, it can be seen that correlation kernel 143 shows the highest performance, followed by zero-inflated Kendall. Synthetic Activating Synthetic Inhibitory High GENIE3 1.87 2.84 2.76 2.90 6.40 3.88 1.76 2.00 1.70 6.74 78.70 1.45 GRNBOOST2 1.37 3.23 2.93 2.74 7.05 2.97 2.54 2.18 1.90 4.54 51.36 1.59 PIDC 1.92 2.22 3.22 2.29 7.00 3.16 1.81 1.62 1.93 8.42 31.29 1.48 PPCOR 3.58 4.03 5.21 3.25 7.50 5.08 1.86 1.68 3.19 13.43 79.07 1.36 scSGL-r 3.25 3.48 4.92 3.36 8.65 5.08 1.73 2.05 3.17 15.73 93.28 1.47 scSGL- 2.75 3.29 4.28 3.30 8.48 4.12 2.10 1.93 2.89 4.40 1.00 1.41 scSGL- zi 3.39 3.39 4.83 3.02 8.10 5.04 1.91 2.22 3.14 12.45 29.62 1.56 Low BF BFC CY LI LL TF BF BFC CY LI LL TF Figure C.4 Performance of scSGL and state-of-the-art methods on curated datasets as measured by AUPRC ratios for activating and inhibitory edges. Each column corresponds to a synthetic network. Abbreviations: LI, linear; CY, cycle; LL, linear long; BF, bifurcating; BFC, bifurcating converging and TF, trifurcating. C.4 Cell-Type Specific GRN Inference scSGL is developed based on the assumption that all cells are related to a single GRN. However, single cell datasets are generally a combination of cells arising from varying cell- types, and therefore may necessitate the inference of cell type specific GRNs. Cell-type specific GRN’s can be learned in our framework by adding a cell-type clustering step before applying scSGL. One could either group the datasets by using cluster labels provided by the original authors of the experimental study or by clustering the dataset using one of the many clustering algorithms proposed for single cell data [160] (in case pre-defined cell-labels are absent). Assuming independence within cell-groups, we could estimate cell-type specific networks using scSGL for each cluster separately. In this section, we demonstrate the process for using scSGL to learn cell type specific GRNs and apply this process to the differentiation dataset hESC. We apply scSGL seper- 144 T=0 T = 12 T = 24 GATA4 GATA4 GATA4 GATA6 GATA3 GATA6 GATA3 GATA6 GATA3 G11 GAT 4 G11 GAT 4 G11 GAT 4 GN A2 GN A2 GN A2 C BB C BB C BB HA GS ER S HA GS ER S HA GS ER S ND 1 O ME ND 1 O ME ND 1 O ME E E E HAPLN 3B HAPLN 3B HAPLN 3B 1 DN TM 1 DN TM 1 DN TM IFI16 CER1 IFI16 CER1 IFI16 CER1 ZFP42 ZFP42 ZFP42 LECT1 LECT1 LECT1 1 SO 1 SO 1 SO LHX 0 X2 LHX 0 X2 LHX 0 X2 P K1 X1 7 SO M14 PK1 X1 7 SO M14 PK1 X1 7 SO M14 MY MY MY MA CT1 PR D MA CT1 PR D MA CT1 PR D NANO POU5 F1 NANO POU5 F1 NANO POU5 F1 G PMAIP1 G PMAIP1 G PMAIP1 T = 36 T = 72 T = 96 GATA4 GATA4 GATA3 GATA3 GATA4 GATA3 GN GATA6 GAT 4 GN GATA6 GAT 4 GN GATA6 GAT 4 G11 A2 BB G11 A2 BB G11 A2 BB C C C HA GS ER S HA GS ER S HA GS ER S ND ME ND OM E ND ME 1 EO 1 E 1 EO HAPLN 3B HAPLN 3B HAPLN 3B 1 DNMT 1 DNMT 1 DNMT IFI16 CER1 IFI16 CER1 IFI16 CER1 ZFP42 ZFP42 ZFP42 LECT1 LECT1 LECT1 1 SO 1 SO 1 SO LHX 0 X2 LHX 0 X2 LHX 0 X2 1 7 1 7 1 7 PK MY X1 SO M14 PK MY X1 SO M14 PK MY X1 SO M14 MA CT1 PR D MA CT1 PR D MA CT1 PR D NANO POU5 F1 NANO POU5 F1 NANO POU5 F1 G PMAIP1 G PMAIP1 G PMAIP1 Figure C.5 Edges detected using scSGL-r between 24 Lineage marker genes of hESC at different time points of the differentiation process. Only edges whose absolute edge weights fall into top 10 percent are shown. Edge thicknesses are proportional to their weights, and node sizes are proportional to their degrees. ately to the hESC dataset clustered by days (0,12,24,36,72 and 96 hrs) and learn scSGL graphs between 24 lineage-specific marker genes [143] at these different time points. Figure C.5 demonstrates the absence of edges from the Gata-family binding proteins at 0h. Gata family binding proteins have been reported as necessary for the development and function of a number of endoderm-derived tissues and cells [146, 161]. Onset of Gata4 and Gata6 ex- pression has been reported to be coincident with the beginning of endoderm gene expression hence the absence of edges from Gata 4 and Gata6 at 0h are indicative of the undifferentiated nature of the single cells [146]. Weak interactions start to emerge at 12 hours of differentia- tion with inhibition of pluripotency markers Nanog and Sox2. At 24 h of differentiation, we notice a stronger inhibition of Nanog and pluripotency marker Pmaip1 by Gata6, indicating a transition of the cells towards a primitive streak state. Hand1 has been reported to play an essential role in both trophoblast-giant cells differentiation and in cardiac morphogenesis [162]. Inhibition of known DE marker Cer1 by Hand1 and Gata family TF’s at 36 and 72 145 h of differentiation indicates an advanced state of differentiation. The appearance of key DE markers Gata2, Gata4, Gata6, Cer1 and Eomes as hub-nodes in 96-h time point net- work indicates that the cells have progressed toward the definitive endoderm (DE) state. This analysis clearly demonstrates that scSGL identifies gene network changes from data clustered over time points. We acknowledge that analyzing the dataset in this manner does not exploit the similarity between the true cell-type specific networks and estimating a single network for the different cell-types ignores the fact that we do not expect the cell-type specific graphs to be identical. Our optimization framework can be extended to jointly learn Laplacians estimated from multiple cell groups but that is out of scope for the current paper and will be considered in future research. 146 BIBLIOGRAPHY 147 BIBLIOGRAPHY [1] Rance Nault, Satabdi Saha, Sudin Bhattacharya, Jack Dodson, Samiran Sinha, Tapabrata Maiti, and Tim Zacharewski. Benchmarking of a bayesian single cell rnaseq differential gene expression test for dose–response study designs. Nucleic acids research, 50(8):e48–e48, 2022. [2] Abdullah Karaaslanli, Satabdi Saha, Selin Aviyente, and Tapabrata Maiti. scsgl: ker- nelized signed graph learning for single-cell gene regulatory network inference. Bioin- formatics, 38(11):3011–3019, 2022. [3] Aditya Pratapa, Amogh P Jalihal, Jeffrey N Law, Aditya Bharadwaj, and TM Mu- rali. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nature methods, 17(2):147–154, 2020. [4] L. Zappia, B. Phipson, and A. Oshlack. Splatter: simulation of single-cell rna sequenc- ing data. Genome Biol, 18(1):174, 2017. [5] Ehud Shapiro, Tamir Biezuner, and Sten Linnarsson. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Reviews Genetics, 14(9):618–630, 2013. [6] Cole Trapnell. Defining cell types and states with single-cell genomics. Genome re- search, 25(10):1491–1498, 2015. [7] Aleksandra A Kolodziejczyk, Jong Kyoung Kim, Valentine Svensson, John C Marioni, and Sarah A Teichmann. The technology and biology of single-cell rna sequencing. Molecular cell, 58(4):610–620, 2015. [8] Oliver Stegle, Sarah A Teichmann, and John C Marioni. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16(3):133–145, 2015. [9] Charles Gawad, Winston Koh, and Stephen R Quake. Single-cell genome sequencing: current state of the science. Nature Reviews Genetics, 17(3):175–188, 2016. [10] Peter V Kharchenko, Lev Silberstein, and David T Scadden. Bayesian approach to single-cell differential expression analysis. Nature methods, 11(7):740–742, 2014. [11] Davide Risso, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean- Philippe Vert. Zinb-wave: A general and flexible method for signal extraction from single-cell rna-seq data. BioRxiv, page 125112, 2017. [12] Kwangbom Choi, Yang Chen, Daniel A Skelly, and Gary A Churchill. Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics. Genome biology, 21(1):1–16, 2020. 148 [13] Emma Pierson and Christopher Yau. Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome biology, 16(1):1–10, 2015. [14] Andrew McDavid, Greg Finak, Pratip K Chattopadyay, Maria Dominguez, Laurie Lamoreaux, Steven S Ma, Mario Roederer, and Raphael Gottardo. Data exploration, quality control and testing in single-cell qpcr-based gene expression experiments. Bioin- formatics, 29(4):461–467, 2013. [15] Abhishek Sarkar and Matthew Stephens. Separating measurement and expression mod- els clarifies confusion in single-cell rna sequencing analysis. Nature genetics, 53(6):770– 777, 2021. [16] Greg Finak, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K Shalek, Chloe K Slichter, Hannah W Miller, M Juliana McElrath, Martin Prlic, et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data. Genome biology, 16(1):1–13, 2015. [17] Andrew McDavid, Raphael Gottardo, Noah Simon, and Mathias Drton. Graphical models for zero-inflated single cell gene expression. The annals of applied statistics, 13(2):848, 2019. [18] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data. bioinfor- matics, 26(1):139–140, 2010. [19] Simon Anders and Wolfgang Huber. Differential expression analysis for sequence count data. Nature Precedings, pages 1–1, 2010. [20] Charity W Law, Yunshun Chen, Wei Shi, and Gordon K Smyth. voom: Precision weights unlock linear model analysis tools for rna-seq read counts. Genome biology, 15(2):1–17, 2014. [21] Valentine Svensson. Droplet scrna-seq is not zero-inflated. Nature Biotechnology, 38(2):147–150, 2020. [22] Justin D Silverman, Kimberly Roche, Sayan Mukherjee, and Lawrence A David. Naught all zeros in sequence count data are the same. Computational and structural biotechnology journal, 18:2789, 2020. [23] KS Crump, DG Hoel, CH Langley, and R Peto. Fundamental carcinogenic processes and their implications for low dose risk assessment. Cancer research, 36(9_Part_1):2973–2979, 1976. [24] Food, Drug Administration, et al. Guidance for industry: exposure-response relationships-study design, data analysis, and regulatory applications. http://www. fda. gov/cber/gdlns/exposure. pdf, 2003. 149 [25] L Martin. Benchmark dose software (bmds) version 2.1 user’s manual version 2.0. Washington, DC: United States Environmental Protection Agency, Office of Environ- mental Information, 2009. [26] National Toxicology Program et al. Ntp research report on national toxicology pro- gram approach to genomic dose-response modeling: Research report 5 [internet]. NTP Research Report on National Toxicology Program, 2018. [27] Qiang Zhang, W Michael Caudle, Jingbo Pi, Sudin Bhattacharya, Melvin E Andersen, Norbert E Kaminski, and Rory B Conolly. Embracing systems toxicology at single-cell resolution. Current opinion in toxicology, 16:49–57, 2019. [28] J Allen Davis, Jeffrey S Gift, and Q Jay Zhao. Introduction to benchmark dose methods and us epa’s benchmark dose software (bmds) version 2.1. 1. Toxicology and applied pharmacology, 254(2):181–191, 2011. [29] Beate Vieth, Swati Parekh, Christoph Ziegenhain, Wolfgang Enard, and Ines Hell- mann. A systematic evaluation of single cell rna-seq analysis pipelines. Nature com- munications, 10(1):1–11, 2019. [30] Zhun Miao, Ke Deng, Xiaowo Wang, and Xuegong Zhang. Desingle for detecting three types of differential expression in single-cell rna-seq data. Bioinformatics, 34(18):3223– 3224, 2018. [31] Keegan D Korthauer, Li-Fang Chu, Michael A Newton, Yuan Li, James Thomson, Ron Stewart, and Christina Kendziorski. A statistical approach for identifying differential distributions in single-cell rna-seq experiments. Genome biology, 17(1):1–15, 2016. [32] Charlotte Soneson and Mark D Robinson. Bias, robustness and scalability in single-cell differential expression analysis. Nature methods, 15(4):255–261, 2018. [33] Tian Mou, Wenjiang Deng, Fengyun Gu, Yudi Pawitan, and Trung Nghia Vu. Re- producibility of methods to detect differentially expressed genes from single-cell rna sequencing. Frontiers in genetics, 10:1331, 2020. [34] Maria K Jaakkola, Fatemeh Seyednasrollah, Arfa Mehmood, and Laura L Elo. Compar- ison of methods to detect differentially expressed genes between single-cell populations. Briefings in bioinformatics, 18(5):735–743, 2017. [35] Tae Kyun Kim. Understanding one-way anova using conceptual figures. Korean journal of anesthesiology, 70(1):22–26, 2017. [36] DA Williams. A test for differences between treatment means when several dose levels are compared with a zero dose control. Biometrics, pages 103–117, 1971. [37] Jan De Leeuw, Kurt Hornik, and Patrick Mair. Isotone optimization in r: pool- adjacent-violators algorithm (pava) and active set methods. Journal of statistical soft- ware, 32:1–24, 2010. 150 [38] Tim Holland-Letz and Annette Kopp-Schneider. Optimal experimental designs for dose–response studies with continuous endpoints. Archives of toxicology, 89(11):2059– 2068, 2015. [39] Marc Aerts, Matthew W Wheeler, and José Cortiñas Abrahantes. An extended and unified modeling framework for benchmark dose estimation for both continuous and binary data. Environmetrics, 31(7):e2630, 2020. [40] Matthew W Wheeler, Jose Cortiñas Abrahantes, Marc Aerts, Jeffery S Gift, and Jerry Allen Davis. Continuous model averaging for benchmark dose analysis: Averaging over distributional forms. Environmetrics, page e2728, 2022. [41] Richard L Schmoyer. Sigmoidally constrained maximum likelihood estimation in quan- tal bioassay. Journal of the American Statistical Association, 79(386):448–453, 1984. [42] Colleen Kelly and John Rice. Monotone smoothing with application to dose-response curves and the assessment of synergism. Biometrics, pages 1071–1085, 1990. [43] Michel Delecroix, Michel Simioni, and Christine Thomas-Agnan. Functional estimation under shape constraints. Journaltitle of Nonparametric Statistics, 6(1):69–89, 1996. [44] Brian Neelon and David B Dunson. Bayesian isotonic regression and trend analysis. Biometrics, 60(2):398–406, 2004. [45] Björn Bornkamp and Katja Ickstadt. Bayesian nonparametric estimation of contin- uous monotone functions with applications to dose–response analysis. Biometrics, 65(1):198–205, 2009. [46] Lizhen Lin and David B Dunson. Bayesian monotone regression using gaussian process projection. Biometrika, 101(2):303–317, 2014. [47] Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Manolis Kellis, James J Collins, and Gustavo Stolovitzky. Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796–804, 2012. [48] Lian En Chai, Swee Kuan Loh, Swee Thing Low, Mohd Saberi Mohamad, Safaai Deris, and Zalmiyah Zakaria. A review on the computational approaches for gene regulatory network construction. Computers in biology and medicine, 48:55–65, 2014. [49] Peter Langfelder and Steve Horvath. Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics, 9(1):1–13, 2008. [50] Seongho Kim. ppcor: an r package for a fast calculation to semi-partial correlation coefficients. Communications for statistical applications and methods, 22(6):665, 2015. [51] Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er. Using bayesian net- works to analyze expression data. Journal of computational biology, 7(3-4):601–620, 2000. 151 [52] Vân Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. Infer- ring regulatory networks from expression data using tree-based methods. PloS one, 5(9):1–10, 2010. [53] Thomas Moerman, Sara Aibar Santos, Carmen Bravo González-Blas, Jaak Simm, Yves Moreau, Jan Aerts, and Stein Aerts. Grnboost2 and arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics, 35(12):2159–2161, 2019. [54] Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. Aracne: an algorithm for the reconstruc- tion of gene regulatory networks in a mammalian cellular context. In BMC bioinfor- matics, volume 7, pages 1–15. Springer, 2006. [55] Jeremiah J Faith, Boris Hayete, Joshua T Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume Cottarel, Simon Kasif, James J Collins, and Timothy S Gardner. Large- scale mapping and validation of escherichia coli transcriptional regulation from a com- pendium of expression profiles. PLoS biol, 5(1):e8, 2007. [56] Kevin Murphy, Saira Mian, et al. Modelling gene expression data using dynamic bayesian networks. Technical report, Citeseer, 1999. [57] Jiguo Cao, Xin Qi, and Hongyu Zhao. Modeling gene regulation networks using or- dinary differential equations. In Next generation microarray bioinformatics, pages 185–197. Springer, 2012. [58] Pierre Geurts et al. dyngenie3: dynamical genie3 for the inference of gene networks from time series expression data. Scientific reports, 8(1):1–12, 2018. [59] Ziv Bar-Joseph, Georg K Gerber, Tong Ihn Lee, Nicola J Rinaldi, Jane Y Yoo, François Robert, D Benjamin Gordon, Ernest Fraenkel, Tommi S Jaakkola, Richard A Young, et al. Computational discovery of gene modules and regulatory networks. Nature biotechnology, 21(11):1337–1342, 2003. [60] Keren Bahar Halpern, Rom Shenhav, Orit Matcovitch-Natan, Beata Toth, Doron Lemze, Matan Golan, Efi E Massasa, Shaked Baydatch, Shanie Landen, Andreas E Moor, et al. Single-cell spatial reconstruction reveals global division of labour in the mammalian liver. Nature, 542(7641):352–356, 2017. [61] Tianhao Mu, Liqin Xu, Yu Zhong, Xinyu Liu, Zhikun Zhao, Chaoben Huang, Xiaofeng Lan, Chengchen Lufei, Yi Zhou, Yixun Su, et al. Embryonic liver developmental tra- jectory revealed by single-cell rna sequencing in the foxa2egfp mouse. Communications biology, 3(1):1–12, 2020. [62] Dongyin Guan, Ying Xiong, Trang Minh Trinh, Yang Xiao, Wenxiang Hu, Chunjie Jiang, Pieterjan Dierickx, Cholsoon Jang, Joshua D Rabinowitz, and Mitchell A Lazar. The hepatocyte clock and feeding control chronophysiology of multiple liver cell types. Science, 369(6509):1388–1394, 2020. 152 [63] Xuelian Xiong, Henry Kuang, Sahar Ansari, Tongyu Liu, Jianke Gong, Shuai Wang, Xu-Yun Zhao, Yewei Ji, Chuan Li, Liang Guo, et al. Landscape of intercellular crosstalk in healthy and nash liver revealed by single-cell secretome gene analysis. Molecular cell, 75(3):644–660, 2019. [64] Reza Farmahin, Anne Marie Gannon, Rémi Gagné, Andrea Rowan-Carroll, Byron Kuo, Andrew Williams, Ivan Curran, and Carole L Yauk. Hepatic transcriptional dose- response analysis of male and female fischer rats exposed to hexabromocyclododecane. Food and Chemical Toxicology, 133:110262, 2019. [65] Ivy Moffat, Nikolai L Chepelev, Sarah Labib, Julie Bourdon-Lacombe, Byron Kuo, Julie K Buick, France Lemieux, Andrew Williams, Sabina Halappanavar, Amal I Malik, et al. Comparison of toxicogenomics and traditional approaches to inform mode of action and points of departure in human health risk assessment of benzo [a] pyrene in drinking water. Critical reviews in toxicology, 45(1):1–43, 2015. [66] A Francina Webster, Nikolai Chepelev, Rémi Gagné, Byron Kuo, Leslie Recio, Andrew Williams, and Carole L Yauk. Impact of genomics platform and statistical filtering on transcriptional benchmark doses (bmd) and multiple approaches for selection of chemical point of departure (pod). PLoS One, 10(8):e0136764, 2015. [67] Timothy W Gant and Shu-Dong Zhang. In pursuit of effective toxicogenomics. Muta- tion Research/Fundamental and Molecular Mechanisms of Mutagenesis, 575(1-2):4–16, 2005. [68] Samarendra Das and Shesh N Rai. Swarnseq: An improved statistical approach for differential expression analysis of single-cell rna-seq data. Genomics, 113(3):1308–1324, 2021. [69] Minjeong Jeon and Paul De Boeck. Decision qualities of bayes factor and p value-based hypothesis testing. Psychological Methods, 22(2):340, 2017. [70] Yong Li, Xiao-Bin Liu, and Jun Yu. A bayesian chi-squared test for hypothesis testing. Journal of Econometrics, 189(1):54–69, 2015. [71] Kelly A Fader, Rance Nault, Mathew P Kirby, Gena Markous, Jason Matthews, and Timothy R Zacharewski. Convergence of hepcidin deficiency, systemic iron overloading, heme accumulation, and rev-erbα/β activation in aryl hydrocarbon receptor-elicited hepatotoxicity. Toxicology and applied pharmacology, 321:1–17, 2017. [72] Nathalie Percie du Sert, Viki Hurst, Amrita Ahluwalia, Sabina Alam, Marc T Avey, Monya Baker, William J Browne, Alejandra Clark, Innes C Cuthill, Ulrich Dirnagl, et al. The arrive guidelines 2.0: Updated guidelines for reporting animal research. Journal of Cerebral Blood Flow & Metabolism, 40(9):1769–1777, 2020. [73] Rance Nault, Kelly A Fader, Sudin Bhattacharya, and Tim R Zacharewski. Single- nuclei rna sequencing assessment of the hepatic effects of 2, 3, 7, 8-tetrachlorodibenzo- p-dioxin. Cellular and Molecular Gastroenterology and Hepatology, 11(1):147–159, 2021. 153 [74] Andrew Butler, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology, 36(5):411–420, 2018. [75] Luke Zappia, Belinda Phipson, and Alicia Oshlack. Splatter: simulation of single-cell rna sequencing data. Genome biology, 18(1):1–15, 2017. [76] Michael A Newton, Amine Noueiry, Deepayan Sarkar, and Paul Ahlquist. Detect- ing differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5(2):155–176, 2004. [77] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995. [78] Gordon K Smyth. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology, 3(1), 2004. [79] Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics, pages 196–202. Springer, 1992. [80] Ronald Aylmer Fisher et al. On the" probable error" of a coefficient of correlation deduced from a small sample.(1921). Contributions to Mathematical Statistics. 3–32, 1950. [81] William H Kruskal and W Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583–621, 1952. [82] Rafael A Irizarry, Daniel Warren, Forrest Spencer, Irene F Kim, Shyam Biswal, Bryan C Frank, Edward Gabrielson, Joe GN Garcia, Joel Geoghegan, Gregory Ger- mino, et al. Multiple-laboratory comparison of microarray platforms. Nature methods, 2(5):345–350, 2005. [83] Beate Vieth, Christoph Ziegenhain, Swati Parekh, Wolfgang Enard, and Ines Hell- mann. powsimr: power analysis for bulk and single cell rna-seq experiments. Bioin- formatics, 33(21):3486–3488, 2017. [84] Xiuwei Zhang, Chenling Xu, and Nir Yosef. Simulating multiple faceted variability in single cell rna sequencing. Nature communications, 10(1):1–16, 2019. [85] Alemu Takele Assefa, Jo Vandesompele, and Olivier Thas. Spsimseq: semi-parametric simulation of bulk and single-cell rna-sequencing data. Bioinformatics, 36(10):3276– 3278, 2020. [86] Davide Chicco and Giuseppe Jurman. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13, 2020. 154 [87] Alessandra Dal Molin, Giacomo Baruzzo, and Barbara Di Camillo. Single-cell rna- sequencing: assessment of differential expression analysis methods. Frontiers in genet- ics, 8:62, 2017. [88] Jason R Phillips, Daniel L Svoboda, Arpit Tandon, Shyam Patel, Alex Sedykh, Deepak Mav, Byron Kuo, Carole L Yauk, Longlong Yang, Russell S Thomas, et al. Bmdex- press 2: enhanced transcriptomic dose-response analysis workflow. Bioinformatics, 35(10):1780–1782, 2019. [89] Othman Soufan, Jessica Ewald, Charles Viau, Doug Crump, Markus Hecker, Niladri Basu, and Jianguo Xia. T1000: a reduced gene set prioritized for toxicogenomic studies. PeerJ, 7:e7975, 2019. [90] David R Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician, 58(1):30–37, 2004. [91] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977. [92] Kenneth Lange. MM optimization algorithms. SIAM, 2016. [93] D. R. Hunter and K. Lange. A tutorial on mm algorithms. The American Statistician, 58:30––37, 2004. [94] S Lang and A Brezger. Bayesian P-splines. Journal of Computational and Graphical Statistics, 13:183–212, 2004. [95] Kenneth Lange. The mm algorithm. In Optimization, pages 185–219. Springer, 2013. [96] Florin Vaida. Parameter convergence for em and mm algorithms. Statistica Sinica, pages 831–840, 2005. [97] Jan de Leeuw and Kenneth Lange. Sharp quadratic majorization in one dimension. Computational statistics & data analysis, 53(7):2471–2484, 2009. [98] Kenneth Lange, Joong-Ho Won, Alfonso Landeros, and Hua Zhou. Nonconvex opti- mization via mm algorithms: Convergence theory. arXiv preprint arXiv:2106.02805, 2021. [99] Aaron. T. L. Lun, Karsten Bach, and John C Marioni. Pooling across cells to normalize single-cell rna sequencing data with many zero counts. Genome biology, 17(1):1–14, 2016. [100] Matt P Wand. A comparison of regression spline smoothing procedures. Computational Statistics, 15(4):443–462, 2000. [101] David Ruppert, Matt P Wand, and Raymond J Carroll. Semiparametric regression. Cambridge university press, 2003. 155 [102] T Robertson, FT Wright, and R Dykstra. Order Restricted Statistical Inference. John Wiley&Sons, 1988. [103] Suzanne Winsberg and James O Ramsay. Monotonic transformations to additivity using splines. Biometrika, 67(3):669–674, 1980. [104] Larry Schumaker. Spline functions: basic theory. Cambridge University Press, 2007. [105] Simon Wood and Maintainer Simon Wood. Package ‘mgcv’. R package version, 1(29):729, 2015. [106] S Wotherspoon and Burch P. Package ‘zigam’. R package Github version, https://github.com/AustralianAntarcticDataCentre/zigam, 2016. [107] Naomi Moris, Cristina Pina, and Alfonso Martinez Arias. Transition states and cell fate decisions in epigenetic landscapes. Nature Reviews Genetics, 17(11):693–703, 2016. [108] Mark WEJ Fiers, Liesbeth Minnoye, Sara Aibar, Carmen Bravo González-Blas, Zeynep Kalender Atak, and Stein Aerts. Mapping gene regulatory networks from single-cell omics data. Briefings in functional genomics, 17(4):246–254, 2018. [109] Assieh Saadatpour, Guoji Guo, Stuart H Orkin, and Guo-Cheng Yuan. Characterizing heterogeneity in leukemic cells using single-cell gene expression analysis. Genome biology, 15(12):1–13, 2014. [110] Victoria Moignard, Steven Woodhouse, Laleh Haghverdi, Andrew J Lilly, Yosuke Tanaka, Adam C Wilkinson, Florian Buettner, Iain C Macaulay, Wajid Jawaid, Evan- gelia Diamanti, et al. Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nature biotechnology, 33(3):269–276, 2015. [111] Shuonan Chen and Jessica C Mar. Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data. BMC bioinformatics, 19(1):1–21, 2018. [112] Lucrezia Patruno, Davide Maspero, Francesco Craighero, Fabrizio Angaroni, Marco Antoniotti, and Alex Graudenzi. A review of computational strategies for denois- ing and imputation of single-cell transcriptomic data. Briefings in Bioinformatics, 22(4):bbaa222, 2021. [113] Kyle Akers and TM Murali. Gene regulatory network inference in single-cell biology. Current Opinion in Systems Biology, 26:87–97, 2021. [114] Davide Risso, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean- Philippe Vert. A general and flexible method for signal extraction from single-cell rna-seq data. Nature communications, 9(1):1–17, 2018. [115] Antonio Ortega, Pascal Frossard, Jelena Kovačević, José MF Moura, and Pierre Van- dergheynst. Graph signal processing: Overview, challenges, and applications. Proceed- ings of the IEEE, 106(5):808–828, 2018. 156 [116] Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard. Learning graphs from data: A signal representation perspective. IEEE Signal Processing Magazine, 36(3):44–63, 2019. [117] Gonzalo Mateos, Santiago Segarra, Antonio G Marques, and Alejandro Ribeiro. Con- necting the dots: Identifying network structure via graph signal processing. IEEE Signal Processing Magazine, 36(3):16–43, 2019. [118] Xiaowen Dong, Dorina Thanou, Pascal Frossard, and Pierre Vandergheynst. Learning laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23):6160–6173, 2016. [119] Vassilis Kalofolias. How to learn a graph from smooth signals. In Artificial Intelligence and Statistics, pages 920–929, 2016. [120] Junhui Hou, Lap-Pui Chau, Ying He, and Huanqiang Zeng. Robust laplacian matrix learning for smooth graph signals. In 2016 IEEE International Conference on Image Processing (ICIP), pages 1878–1882. IEEE, 2016. [121] Peter Berger, Gabor Hannak, and Gerald Matz. Efficient graph learning from noisy and incomplete data. IEEE Transactions on Signal and Information Processing over Networks, 6:105–119, 2020. [122] Sai Kiran Kadambari and Sundeep Prabhakar Chepuri. Learning product graphs from multidomain signals. In ICASSP 2020-2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 5665–5669. IEEE, 2020. [123] Liu Rui, Hossein Nejati, Seyed Hamid Safavi, and Ngai-Man Cheung. Simultaneous low-rank component and graph estimation for high-dimensional graph signals: Appli- cation to brain imaging. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4134–4138. IEEE, 2017. [124] Gerald Matz and Thomas Dittrich. Learning signed graphs from data. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5570–5574. IEEE, 2020. [125] Jérôme Kunegis, Stephan Schmidt, Andreas Lommatzsch, Jürgen Lerner, Ernesto W De Luca, and Sahin Albayrak. Spectral analysis of signed graphs for clustering, pre- diction and visualization. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 559–570. SIAM, 2010. [126] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Van- dergheynst. The emerging field of signal processing on graphs: Extending high- dimensional data analysis to networks and other irregular domains. IEEE signal pro- cessing magazine, 30(3):83–98, 2013. [127] Aliaksei Sandryhaila and Jose MF Moura. Discrete signal processing on graphs: Fre- quency analysis. IEEE Transactions on Signal Processing, 62(12):3042–3054, 2014. 157 [128] Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 1171–1220, 2008. [129] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cam- bridge university press, 2004. [130] Michael A Skinnider, Jordan W Squair, and Leonard J Foster. Evaluating measures of association for single-cell transcriptomics. Nature methods, 16(5):381–386, 2019. [131] Thomas P Quinn, Mark F Richardson, David Lovell, and Tamsyn M Crowley. propr: an r-package for identifying proportionally abundant features using compositional data analysis. Scientific reports, 7(1):1–9, 2017. [132] Ronald S Pimentel, Magdalena Niewiadomska-Bugaj, and Jung-Chao Wang. Associ- ation of zero-inflated continuous variables. Statistics & Probability Letters, 96:61–67, 2015. [133] Inbal Yahav and Galit Shmueli. On generating multivariate poisson data in man- agement science applications. Applied Stochastic Models in Business and Industry, 28(1):91–102, 2012. [134] Thalia E Chan, Michael PH Stumpf, and Ann C Babtie. Gene regulatory network inference from single-cell data using multivariate information measures. Cell systems, 5(3):251–267, 2017. [135] Thomas Schaffter, Daniel Marbach, and Dario Floreano. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioin- formatics, 27(16):2263–2270, 2011. [136] Daniel Marbach, Thomas Schaffter, Claudio Mattiussi, and Dario Floreano. Generat- ing realistic in silico gene networks for performance assessment of reverse engineering methods. Journal of computational biology, 16(2):229–239, 2009. [137] Damian Szklarczyk, Annika L Gable, Katerina C Nastou, David Lyon, Rebecca Kirsch, Sampo Pyysalo, Nadezhda T Doncheva, Marc Legeay, Tao Fang, Peer Bork, et al. The string database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic acids research, 49(D1):D605–D612, 2021. [138] ENCODE Project Consortium et al. An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57, 2012. [139] Zhi-Ping Liu, Canglin Wu, Hongyu Miao, and Hulin Wu. Regnetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database, 2015, 2015. [140] Luz Garcia-Alonso, Christian H Holland, Mahmoud M Ibrahim, Denes Turei, and Julio Saez-Rodriguez. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome research, 29(8):1363–1375, 2019. 158 [141] Heonjong Han, Jae-Won Cho, Sangyoung Lee, Ayoung Yun, Hyojin Kim, Dasom Bae, Sunmo Yang, Chan Yeong Kim, Muyoung Lee, Eunbeen Kim, et al. Trrust v2: an ex- panded reference database of human and mouse transcriptional regulatory interactions. Nucleic acids research, 46(D1):D380–D386, 2018. [142] DA Brafman, C Phung, N Kumar, and K Willert. Regulation of endodermal differen- tiation of human embryonic stem cells through integrin-ecm interactions. Cell Death & Differentiation, 20(3):369–381, 2013. [143] Li-Fang Chu, Ning Leng, Jue Zhang, Zhonggang Hou, Daniel Mamott, David T Vereide, Jeea Choi, Christina Kendziorski, Ron Stewart, and James A Thomson. Single-cell rna-seq reveals novel regulators of human embryonic stem cell differenti- ation to definitive endoderm. Genome biology, 17(1):1–20, 2016. [144] Alistair J Watt, Roong Zhao, Jixuan Li, and Stephen A Duncan. Development of the mammalian liver and ventral pancreas is dependent on gata4. BMC developmental biology, 7(1):1–11, 2007. [145] Emily M Walker, Cayla A Thompson, and Michele A Battle. Gata4 and gata6 regulate intestinal epithelial cytodifferentiation during development. Developmental biology, 392(2):283–294, 2014. [146] JB Fisher, K Pulakanti, S Rao, and SA Duncan. Gata6 is essential for endoderm formation from human pluripotent stem cells. Biology open, 6(7):1084–1095, 2017. [147] Qing Zhou, Hiram Chipperfield, Douglas A Melton, and Wing Hung Wong. A gene reg- ulatory network in mouse embryonic stem cells. Proceedings of the National Academy of Sciences, 104(42):16438–16443, 2007. [148] Wenjing Shi, Hui Wang, Guangjin Pan, Yijie Geng, Yunqian Guo, and Duanqing Pei. Regulation of the pluripotency marker rex-1 by nanog and sox2. Journal of biological chemistry, 281(33):23319–23325, 2006. [149] Kathy K Niakan, Hongkai Ji, René Maehr, Steven A Vokes, Kit T Rodolfa, Richard I Sherwood, Mariko Yamaki, John T Dimos, Alice E Chen, Douglas A Melton, et al. Sox17 promotes differentiation in mouse embryonic stem cells by directly regulating extraembryonic gene expression and indirectly antagonizing self-renewal. Genes & development, 24(3):312–326, 2010. [150] Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, and Hanspeter Pfister. Upset: visualization of intersecting sets. IEEE transactions on visualization and computer graphics, 20(12):1983–1992, 2014. [151] Dominic Grün, Lennart Kester, and Alexander Van Oudenaarden. Validation of noise models for single-cell transcriptomics. Nature methods, 11(6):637–640, 2014. [152] Raphael Petegrosso, Zhuliu Li, and Rui Kuang. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020. 159 [153] Zhaoning Wang, Miao Cui, Akansha M Shah, Wei Tan, Ning Liu, Rhonda Bassel- Duby, and Eric N Olson. Cell-type-specific gene regulatory networks underlying murine neonatal heart regeneration at single-cell resolution. Cell reports, 33(10):108472, 2020. [154] Zhigang Xue, Kevin Huang, Chaochao Cai, Lingbo Cai, Chun-yan Jiang, Yun Feng, Zhenshan Liu, Qiao Zeng, Liming Cheng, Yi E Sun, et al. Genetic programs in human and mouse early embryos revealed by single-cell rna sequencing. Nature, 500(7464):593– 597, 2013. [155] Sara Aibar, Carmen Bravo González-Blas, Thomas Moerman, Hana Imrichova, Gert Hulselmans, Florian Rambow, Jean-Christophe Marine, Pierre Geurts, Jan Aerts, Joost van den Oord, et al. Scenic: single-cell regulatory network inference and clus- tering. Nature methods, 14(11):1083–1086, 2017. [156] Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. Cell, 177(7):1888–1902, 2019. [157] Holger Scheel and Stefan Scholtes. Mathematical programs with complementarity con- straints: Stationarity, optimality, and sensitivity. Mathematics of Operations Research, 25(1):1–22, 2000. [158] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, 78(1):29–63, 2019. [159] Vassilis Kalofolias and Nathanaël Perraudin. Large scale graph learning from smooth signals. arXiv preprint arXiv:1710.05654, 2017. [160] Vladimir Yu Kiselev, Tallulah S Andrews, and Martin Hemberg. Challenges in unsu- pervised clustering of single-cell rna-seq data. Nature Reviews Genetics, 20(5):273–282, 2019. [161] I-Cheng Ho, Tzong-Shyuan Tai, Sung-Yun Pai, et al. Gata3 and the t-cell lineage: essential functions before and after t-helper-2-cell differentiation. Nature reviews im- munology, 9(2):125–135, 2009. [162] Pual Riley, Lynn Anaon-Cartwight, and James C Cross. The hand1 bhlh transcrip- tion factor is essential for placentation and cardiac morphogenesis. Nature genetics, 18(3):271–275, 1998. 160