INTERPRETABLE MACHINE LEARNING IN PLANT GENOMES: STUDIES IN MODELING AND UNDERSTANDING COMPLEX BIOLOGICAL SYSTEMS By Christina Brady Azodi A DISSERTATION Michigan State University in partial fulfillment of the requirements for the degree of Plant Biology––Doctor of Philosophy Submitted to 2019 ABSTRACT By Christina Brady Azodi INTERPRETABLE MACHINE LEARNING IN PLANT GENOMES: STUDIES IN MODELING AND UNDERSTANDING COMPLEX BIOLOGICAL SYSTEMS Complex systems are ubiquitous in genetics and genomics. From the regulation of gene expression to the genetic basis of complex traits, we see that complex networks of diverse cellular molecules underpin the natural world. Driven by technological advances, today’s researchers have access to large amounts of omics data from diverse species. At the same time, improvements in computer processing and algorithms have produced more powerful computational tools. Taken together, these advances mean that those working at the interface of data science and biology are poised to better model and understand complex biological systems. The research in this dissertation demonstrates how a data-driven approach can be used to better understand three complex systems: (1) transcriptional response to single and combined heat and drought stress in Arabidopsis thaliana, (2) the genetic basis of flowering time, a complex trait, in Zea mays, and (3) the social basis for opinions and beliefs about biotechnology products. To study the first system, we generated models of the cis-regulatory code from information about DNA sequence and additional omics levels using both classic machine learning and deep learning algorithms. We identified 1,061 putative cis-regulatory elements associated with different patterns of response to single and combined heat and drought stress and found that information about additional levels of regulation, especially chromatin accessibility and known transcription factor binding, improved our models of the cis-regulatory code. To study the second system, we generated phenotype prediction models for flowering time, height, and yield based on either genetic markers or transcript levels at the seedling stage. We found that, while genetic marker-based models performed better than transcript level-based models, models that integrated both types of data performed best. Furthermore, transcript-based models were more useful for finding genes known to be associated with flowering time, highlighting how using additional levels of omics data can improve our ability to understand the genetic basis of complex traits. Finally, to study the third system, we integrated 29 characteristics about a person (e.g. age, political ideology, education, values, environmental beliefs) into a machine learning model that would predict an individual’s beliefs and opinions about five different types of biotechnology products (e.g. biofortification, biopharmaceuticals). While this approach was particularly usefully for identifying individuals that were broadly supportive of biotechnology, finding characteristics of individuals with negative or conditional (i.e. support product A, but not B) opinions was more challenging, highlighting the complexity of public opinions about biotechnology. ACKNOWLEDGEMENTS I would like to express my deepest appreciation to all those who have supported, guided, and challenged me throughout my graduate training. My advisor and mentor, Dr. Shin-Han Shiu, has graciously provided me with support when I needed it and space when I wanted it—allowing me to grow into an independent scientists with confidence and (usually) poise. His intentionality in cultivating a diverse and supportive lab has also benefited me and everyone lucky enough to call themselves a Shiu Lab member or alumni. I have had the opportunity to work with many excellent collaborators during my training. Dr. Gustavo de los Campos was a wealth of knowledge about all things statistics and has always encouraged me to have bold aspirations. Dr. Andrew McCarren and Dr. Mark Roantree generously hosted me for five months in their lab at Dublin City University and helped me become a stronger data scientist and interdisciplinary researcher. Dr. Thomas Dietz supported my efforts to turn my Environmental Science Policy Program capstone project into a peer reviewed publication and shared priceless nuggets of academic wisdom every step of the way. Dr. Yuying Xie, Dr. Jiliang Tang, and Dr. Yuning Hao and introduced me to state-of-the-art modeling approaches that have the potential to help answer fundamental questions in genetics. My interactions with them has reminded me to always continue learning and innovating. Finally, my committee members, Dr. Yuying Xie, Dr. David Lowry, and Dr. David Kramer, provided me with excellent guidance on my dissertation. Many of my most joyful moments in graduate school were celebrating victories (both small and large) with my peers and collaborators—research is a team sport. I would like to thank all of my teammates who made every day more enjoyable: Bethany Moore, Siobhan Cusack, Dr. iv Peipei Wang, Dr. Sahra Uygun, Dr. Nick Panchy, Dr. Johnny Lloyd, Dr. Ming-Jung Liu, Jeremy Pardo, Birte Schwarz, Dr. Ben Lucker, and Dr. Peter Neofotis. This dissertation would not have been possible without support from Michigan State University’s Institute for Cyber-Enabled Research. Having access to a free, well-supported, high- performance computer cluster has been absolutely critical for my dissertation. I would also like to thank the Department of Plant Biology and the PLB Graduate Student Organization for providing me with an academic and social home. I would like to thank my parents for their everlasting support and for always encouraging me to do what makes me happy. I would like to thank my sister and my friends for consoling me when science is hard and celebrating with me when science is going well. And finally, I would like to thank my partner Dom Del Ponte, who’s levelheadedness and unwavering support have helped me grow from failures and aim even higher. I am unbelievably grateful that you have been with me for every step of my PhD and I cannot wait to begin our next adventure together! See you down under! v TABLE OF CONTENTS LIST OF FIGURES ....................................................................................................................... ix KEY TO ABBREVIATIONS ....................................................................................................... xii CHAPTER ONE: INTRODUCTION ............................................................................................. 1 1.1 Introduction ............................................................................................................................... 2 REFERENCES ............................................................................................................................... 6 CHAPTER TWO: A LIGHT IN THE BLACK BOX: INTERPRETABLE MACHINE LEARNING FOR GENETICISTS ................................................................................................. 9 2.1 Abstract ............................................................................................................................... 10 2.2 Why is interpretable machine learning needed? ................................................................. 10 2.3 Overview of strategies for interpretable machine learning ................................................. 13 2.4 Probing strategies for interpreting machine learning models ............................................. 15 2.4.1 Probing Support Vector Machine models .................................................................... 15 2.4.2 Probing decision tree-based models ............................................................................. 17 2.4.3 Probing deep learning networks ................................................................................... 19 2.5 Perturbing strategies for interpreting machine learning models ......................................... 23 2.5.1 Sensitivity Analysis ..................................................................................................... 23 2.5.2 What-if Analysis .......................................................................................................... 25 2.6 Surrogate strategies for interpreting machine learning models .......................................... 28 2.7 Challenges and Opportunities ............................................................................................. 29 2.8 Concluding Remarks and Future Perspective ..................................................................... 31 2.9 Outstanding Questions ........................................................................................................ 31 2.10 Glossary ............................................................................................................................ 32 REFERENCES ............................................................................................................................. 34 CHAPTER THREE: MODELING THE CIS-REGULATORY CODE OF PLANT SINGLE AND COMBINED STRESS TRANSCRIPTIONAL RESPONSE WITH MACHINE LEARNING .................................................................................................................................. 39 3.1 Abstract ............................................................................................................................... 40 3.2 Introduction ......................................................................................................................... 41 3.3 Results and Discussion ....................................................................................................... 43 3.3.1 More than 50% of genes have synergistic or antagonistic responses to combined heat and drought stress ................................................................................................................. 43 3.3.2 Combinatorial stress response patterns can be predicted using known and putative regulatory elements ............................................................................................................... 48 3.3.3 Additional multi-omics regulatory information can improve cis-regulatory code models ................................................................................................................................... 53 3.3.4 Interpreting deep learning models provides insight into the cis-regulatory code ........ 54 3.3.5 pCREs identified outside the promoter region are predictive of response patterns ..... 58 3.3.6 The cis-regulatory code of response to single and combined heat and drought stress 62 vi 3.4 Conclusions ......................................................................................................................... 66 3.5 Methods ............................................................................................................................... 68 3.5.1 Expression data processing, response group classification, and functional category enrichment analysis ............................................................................................................... 68 3.5.2 Identification of known binding sites from in vitro TF binding data .......................... 70 3.5.3 Computational identification of novel pCREs and comparison with known TFBMs . 71 3.5.4 Sequence conservation, chromatin accessibility, and histone mark data processing and analysis .................................................................................................................................. 72 3.5.5 Classic machine learning-based models of the cis-regulatory code ............................ 73 3.5.6 Convolutional neural network-based models of the cis-regulatory code ..................... 75 3.5.7 Data Availability .......................................................................................................... 78 APPENDIX ................................................................................................................................... 79 REFERENCES ............................................................................................................................. 90 CHAPTER FOUR: BENCHMARKING PARAMETRIC AND MACHINE LEARNING MODELS FOR GENOMIC PREDICTION OF COMPLEX TRAITS1 ...................................... 97 4.1 Abstract ............................................................................................................................... 98 4.2 Introduction ......................................................................................................................... 98 4.3 Materials and Methods ...................................................................................................... 103 4.3.1 Genotype and phenotype data .................................................................................... 103 4.3.2 Genomic selection algorithms .................................................................................... 104 4.3.3 Hyperparameter grid search using cross-validation ................................................... 108 4.3.4 Assessing Predictive Performance ............................................................................. 109 4.3.5 Feature Selection ........................................................................................................ 109 4.3.6 Initializing ANN starting weights seeded from other GP algorithms ........................ 110 4.3.7 Data and Code Availability ........................................................................................ 111 4.4 Results ............................................................................................................................... 112 4.4.1. Hyperparameter grid search is critical, particularly among non-linear algorithms .. 112 4.4.2 ANN is the most significantly impacted by hyperparameter choice ......................... 113 4.4.3 Feature selection improves performance of ANN models ......................................... 116 4.4.4 Non-random initialization of ANN starting weights and convolutional layers improve ANN performance for some species ................................................................................... 120 4.4.5 No one GP algorithm performs best for all species and traits ................................... 122 4.5 Discussion ......................................................................................................................... 127 4.6 Acknowledgements ........................................................................................................... 131 APPENDIX ................................................................................................................................. 132 REFERENCES ........................................................................................................................... 141 CHAPTER FIVE: TRANSCRIPTOME-BASED PREDICTION OF COMPLEX TRAITS IN MAIZE 1 ...................................................................................................................................... 148 5.1 Abstract ............................................................................................................................. 149 5.2 Introduction ....................................................................................................................... 149 5.3 Results and Discussion ..................................................................................................... 152 5.3.1 Relationships between transcript levels, kinship, and phenotypes among maize lines ............................................................................................................................................. 152 5.3.2 Predicting complex traits from transcript or genetic marker data .............................. 153 vii 5.3.3 Predicting complex traits using both transcript and genetic marker data .................. 157 5.3.4 Comparison of the importance of transcripts versus genetic markers for model predictions ........................................................................................................................... 159 5.3.5 Assessment of benchmark flowering time genes ....................................................... 164 5.3.6 Improving our understanding of the genetic basis of flowering time using transcriptome data ............................................................................................................... 166 5.4 Conclusions ....................................................................................................................... 170 5.5 Methods ............................................................................................................................. 173 5.5.1 Genotypic, transcriptomic, and phenotypic data processing ..................................... 173 5.5.2 Comparison of transcript and genetic marker data .................................................... 174 5.5.3 Genomic prediction models and model performance ................................................ 175 5.5.4 Selecting subsets of T or G for input to genomic prediction models ......................... 176 5.5.5 Genetic marker/transcript importance analysis .......................................................... 177 5.5.6 Benchmark flowering time genes .............................................................................. 179 5.5.7 Data Availability ........................................................................................................ 179 5.6 Acknowledgements ........................................................................................................... 180 APPENDIX ................................................................................................................................. 181 REFERENCES ........................................................................................................................... 195 CHAPTER SIX: PERCEPTIONS OF EMERGING BIOTECHNOLOGIES 1 ......................... 201 6.1 Abstract ............................................................................................................................. 202 viii LIST OF FIGURES Figure 1.1. Overview of the content of this dissertation. ................................................................ 4 Figure 2.1. Machine Learning Crash Course. ............................................................................... 11 Figure 2.2. Why interpretable machine learning? ......................................................................... 12 Figure 2.3. Overview of strategies in interpretable machine learning. ......................................... 14 Figure 2.4. Detailed overview of the probing strategies. .............................................................. 16 Figure 2.5. Deep Learning Crash Course. ..................................................................................... 20 Figure 2.6. Detailed overview of the perturbing strategies. .......................................................... 24 Figure 3.1. A framework for generating cis-regulatory code models. .......................................... 44 Figure 3.2. Gene expression response groups for single and combined heat and drought stress. 46 Figure 3.3. Cis-regulatory code models based on known TFBMs and pCREs. ........................... 50 Figure 3.4. Cis-regulatory code models based on pCREs and additional multi-omics regulatory information. ................................................................................................................................... 56 Figure 3.5. Cis-regulatory code models based on pCREs identified in putative promoter and non- promoter regions. .......................................................................................................................... 60 Figure 3.6. Overview of the most important pCREs for our cis-regulatory code models. ........... 63 Supplemental Figure 3.1. Overlap in true positive gene predictions from models using known TFBMs, pCREs, or both features as input. ................................................................................... 83 Supplemental Figure 3.2. Impact of including association rules between pCREs as model features. ......................................................................................................................................... 84 Supplemental Figure 3.3. The association of additional regulatory information with pCREs compared to random 6-mers and known TFBMs. ........................................................................ 86 Supplemental Figure 3.4. Probing the trained kernels to understand the important patterns of additional regulatory information identified by CNN models. ..................................................... 88 Supplemental Figure 3.5. Overlap in true positive gene predictions from models using pCREs from different genetic regions. ...................................................................................................... 89 Figure 4.1. Algorithms used and compared in past GP studies and algorithms and data included in the GP benchmark. .................................................................................................................. 102 ix Figure 4.2. Grid search results for height in maize and overall GP algorithm performance for predicting height across species. ................................................................................................. 114 Figure 4.3. Impact of feature selection on GP algorithm performance. ..................................... 117 Figure 4.4. Description and performance results of the seeded ANN approach. ........................ 121 Figure 4.5. Comparison of algorithms for predicting additional traits. ...................................... 124 Supplemental Figure 4.1. Height prediction performance for non-linear GP algorithms during hyperparameter grid search. ........................................................................................................ 133 Supplemental Figure 4.2. Comparison of feature selection algorithms and change in performance variation after feature selection. ............................................................................ 135 Supplemental Figure 4.3. Hyperparameter random search results from predicting height in spruce. ......................................................................................................................................... 137 Supplemental Figure 4.4. Number of wins between each pair of GP algorithm. ...................... 139 Supplemental Figure 4.5. Similarity between traits and datasets in model performance. .......... 140 Figure 5.1. Relationship between lines from transcript and genetic marker data. ...................... 154 Figure 5.2. Genomic prediction model performance. ................................................................. 156 Figure 5.3. Correlation between genetic marker and transcript importance for flowering time. 161 Figure 5.4. Comparison of transcript and genetic marker importance scores for benchmark flowering time genes. .................................................................................................................. 165 Figure 5.5. Relationship between transcript level/allele type and flowering time for benchmark genes. .......................................................................................................................................... 167 Supplemental Figure 5.1. Distribution of genetic marker and transcript data across maize chromosomes. ............................................................................................................................. 182 Supplemental Figure 5.2. Feature importance analysis for G+T models. .................................. 183 Supplemental Figure 5.3. Impact of transcript region sizes on importance correlation between transcript:genetic marker pairs. ................................................................................................... 185 Supplemental Figure 5.4. Manhattan plot of importance scores from Genomic Prediction models. ..................................................................................................................................................... 186 Supplemental Figure 5.5. Correlation between genetic marker/eQTL and transcript importance. ..................................................................................................................................................... 188 x Supplemental Figure 5.6. Genomic prediction and genetic marker:transcript pairs using genome- wide genetic markers (GGW). ...................................................................................................... 190 Supplemental Figure 5.7. Correlation between feature importance between algorithms. .......... 191 Supplemental Figure 5.8. Relationship between transcript levels and alleles and flowering time for benchmark genes. .................................................................................................................. 192 xi ANN ANOVA BA BB BL BP BRR BT CNN CNS DAP DBH DE DHS DM DNA DT ELU EN FC FDR KEY TO ABBREVIATIONS Artificial neural network Analysis of variance Bayes-A Bayes-B Bayesian LASSO Base pairs Bayesian ridge regression Biotechnology Convolutional neural network Conserved noncoding sequence DNA affinity purification Diameter at breast height Wood density DNaseI hypersensitive site Diallel mating Deoxyribonucleic acid Developmental timing Exponential linear unit Ensemble Fold change False discovery rate xii FNR FPR FT GBS GDD GEO GM GMO GO GP GTB HSD HT LASSO LIM LOFO MAS ML MSE NAM NEP PCC pCRE False negative rate False positive rate Flowering time Genotype by sequencing Growing degree days Gene expression omnibus Grain moisture Genetically modified organism Gene ontology Genomic prediction Gradient tree boosting Honestly significant difference Height Least absolute shrinkage and selection operator Local interpretable model-agnostic explanations Leave-one-feature-out Marker assisted selection Machine learning Mean squared error Nested association mapping New ecological paradigm Pearson's correlation coefficient putative cis-regulatory element xiii PWM QTL RBF RF RKHS RNA rrBLUP SELU SNP ST STEM SVM SVR TAIR TAMO TF TFBM TPR TSS UTR WTS YLD Position weight matrix Quantitative trail loci Radial basis function Random Forest Reproducing kernel Hilbert space Ribonucleic acid Ridge regression best linear unbiased predictor Scaled exponential linear unit Single nucleotide polymorphism Standability Science, technology, engineering, math Support vector machine Support vector regression The Arabidopsis Information Resource Tools for Analysis of Motifs Transcription factor Transcription factor binding motif True positive rate Transcription start site Untranslated region Willingness to sacrifice Yield xiv CHAPTER ONE: INTRODUCTION 1 1.1 Introduction In 1990, James D. Watson wrote, “when finally interpreted, the genetic massages encoded within our DNA molecules will provide the ultimate answers to the chemical underpinnings of human existence” (Watson 1990). And in fact, sequencing and assembling the first human genome spurred many novel discoveries in genetics, genomics, and human disease (Collins et al. 2003; Hood and Rowen 2013). However, genome sequences feel short of providing what Watson referred to as the “ultimate answers”, as scientists began to appreciate the extent to which complicating factors, such as gene-by-environment interactions (i.e. GxE) and gene-by-gene interaction (i.e. epistasis), impact trait variation (Ku et al. 2010). Today, we have access to tens of thousands of genome sequences from species ranging from humans to fungi. With advances in transcriptomics, epigenomics, proteomics, and metabolomics (i.e. omics), we also have access to increasing amount of information about cellular molecules besides DNA. As scientists work to decode all of this information, our picture of “the chemical underpinnings of human existence” continues to become more complex (Huang et al. 2017; Pinu et al. 2019). Fortunately, complexity is not an adverse quality in biological systems. Rather, complexity is tightly associated with robustness (i.e. the ability of a biological system to persist under perturbations) and evolvability (Carlson and Doyle 2002; Kitano 2004; Whitacre 2010). These qualities are especially vital in plant biological systems because, as sessile organisms, plants have to constantly adjust and evolved to their changing environment (Anderson et al. 2011). While beneficial for biological systems, greater complexity does mean that modeling and decoding important biological systems is a challenging task. 2 To meet the challenges of modeling complex biological systems with large amounts of heterogeneous (e.g. multi-omics) data, biologists are turning more to machine learning. Machine learning has been described as a tool that allows computers to learn patterns from data without being explicitly programmed (Samuel 1959). From this description we can define a few terms and concepts important for machine learning. First, the collection of all of the learned patterns from a particular dataset is called a machine learning model. Next, the input data from which the model learns is made up of a number of examples (i.e. instances) for which we have information about different characteristics (i.e. feature) and, in the case of supervised machine learning, the values that we want to predict (i.e. the label). Finally, instead of having to define the nature of the relationship between a set of features and the label (e.g. linear, exponential, A interacts with B), the ability to learn without being explicitly programmed means that machine learning models learn these relationships from the examples provided. In addition to reducing the influence of human bias on a model, this also means that machine learning models are able to represent more complex systems. While the capacity of machine learning models to outperform classical statistical models has led to their increased use for modeling complex biological systems (Tarca et al. 2007; Ma et al. 2014; Libbrecht and Noble 2015; Angermueller et al. 2016; Chicco 2017; Cuperlovic-Culf 2018), decoding these models can be more difficult (Lipton 2018; Guidotti et al. 2018). To improve our ability to interpret machine learning models and thus gain novel biological insights into complex systems that are modeled well by these algorithms, Chapter 2 is a review of interpretable machine learning. The review highlights different types of strategies that can be used to better understand the patterns that a machine learning model has learned. While the strategies reviewed can be used to interpret machine learning models built to learn any system, 3 the review focuses on how these strategies can be used to better understand genetics and genomics. The chapters following this review, present my work on using machine learning to predict and understand complex systems (Figure 1.1). Chapter 3 focuses on elucidating how model plant, Arabidopsis thaliana, regulates its response to single and combined heat and drought stress at the genetic level. Here, I use machine learning to integrate DNA sequence information (i.e. known and putative regulatory elements) and additional regulatory information (e.g. chromatin accessibility and histone marks) into models that are predictive of different patterns of response to single and combined heat and drought stress. These models were then interpreted in order to identify the regulatory elements and additional regulatory information driving these predictions. Figure 1.1. Overview of the content of this dissertation. Chapters 4 and 5 focus on using genetic information to predict mature plant phenotypes (e.g. yield and flowering time), an approach known as genomic prediction. Chapter 4 provides a thorough comparison of the performance of different algorithms for the genomic prediction of 18 4 different traits across six diverse plant species. This study focuses on how non-linear algorithms, including classic machine learning and deep learning based algorithms, compare to the linear- regression based algorithms that were first used for genomic prediction (Meuwissen et al. 2001; de los Campos et al. 2013). Then, Chapter 5 describes my work to use both genetic markers and gene expression levels from seedlings to predict mature phenotypes in Zea mays. These genomic prediction models were compared to determine which type of data was the most useful for predicting traits. In addition, because a great deal is already known about the genetic basis of flowering time in Z. mays, we used a set of known flowering time genes as a benchmark to determine if genetic marker-based or gene expression-based genomic prediction models were better for helping us understand the genetic basis of flowering time. While up to this point, the focus has been on modeling and improving our understanding of complex systems in plant genetics and genomics, complex systems are ubiquitous in other fields (De Laurentiis et al. 2016; Kapsar et al. 2019). Toward demonstrating this, Chapter 6 describes my work using interpretable machine learning to better understand the social basis for public opinion about biotechnology. Here I use factors including age, gender, religion, politics, personal values, and environmental beliefs, to predict an individual’s beliefs and options about five different types of biotechnology products. This chapter is the product of my capstone research for the Environmental Science Policy specialization. 5 REFERENCES 6 REFERENCES Anderson, J. T., J. H. Willis, and T. Mitchell-Olds, 2011 Evolutionary genetics of plant adaptation. Trends in Genetics 27: 258–266. Angermueller, C., T. Pärnamaa, L. Parts, and O. Stegle, 2016 Deep learning for computational biology. Molecular Systems Biology 12: 878–16. de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler, and M. P. L. Calus, 2013 Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics 193: 327–345. Carlson, J. M., and J. Doyle, 2002 Complexity and robustness. PNAS 99: 2538–2545. Chicco, D., 2017 Ten quick tips for machine learning in computational biology. BioData Mining 10: 35. Collins, F. S., M. Morgan, and A. Patrinos, 2003 The Human Genome Project: Lessons from Large-Scale Biology. Science 300: 286–290. Cuperlovic-Culf, M., 2018 Machine Learning Methods for Analysis of Metabolic Data and Metabolic Pathway Modeling. Metabolites 8: 4. De Laurentiis, V., D. V. L. Hunt, and C. D. F. Rogers, 2016 Overcoming Food Security Challenges within an Energy/Water/Food Nexus (EWFN) Approach. Sustainability 8: 95. Guidotti, R., A. Monreale, S. Ruggieri, F. Turini, F. Giannotti et al., 2018 A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys 51: 1–42. Hood, L., and L. Rowen, 2013 The Human Genome Project: big science transforms biology and medicine. Genome Med 5: 79. Huang, S., K. Chaudhary, and L. X. Garmire, 2017 More Is Better: Recent Progress in Multi- Omics Data Integration Methods. Front. Genet. 8:. Kapsar, K. E., C. L. Hovis, R. F. Bicudo da Silva, E. K. Buchholtz, A. K. Carlson et al., 2019 Telecoupling Research: The First Five Years. Sustainability 11: 1033. Kitano, H., 2004 Biological robustness. Nat Rev Genet 5: 826–837. Ku, C. S., E. Y. Loy, A. Salim, Y. Pawitan, and K. S. Chia, 2010 The discovery of human genetic variations and their use as disease markers: past, present and future. J Hum Genet 55: 403–415. Libbrecht, M. W., and W. S. Noble, 2015 Machine learning applications in genetics and genomics. Nature Publishing Group 16: 321–332. 7 Lipton, Z. C., 2018 The Mythos of Model Interpretability. ACM Queue 16:. Ma, C., H. H. Zhang, and X. Wang, 2014 Machine learning for Big Data analytics in plants. Trends in Plant Science 1–11. Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics 1–11. Pinu, F. R., D. J. Beale, A. M. Paten, K. Kouremenos, S. Swarup et al., 2019 Systems Biology and Multi-Omics Integration: Viewpoints from the Metabolomics Research Community. Metabolites 9:. Samuel, A. L., 1959 Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development 3: 210–229. Tarca, A. L., V. J. Carey, X. Chen, R. Romero, and S. Drăghici, 2007 Machine Learning and Its Applications to Biology. PLOS Computational Biology 3: e116. Watson, J. D., 1990 The human genome project: past, present, and future. Science 248: 44–49. Whitacre, J. M., 2010 Degeneracy: a link between evolvability, robustness and complexity in biological systems. Theoretical Biology and Medical Modelling 7: 6. 8 CHAPTER TWO: A LIGHT IN THE BLACK BOX: INTERPRETABLE MACHINE LEARNING FOR GENETICISTS 9 2.1 Abstract As we move further into the Era of Big Data, geneticists are turning more and more to machine learning (ML) to make sense of the deluge of omics data now available. Machine Learning is a subfield of artificial intelligence that focuses on generating models that learn from data without being explicitly programmed. ML models are well suited to address challenges unique to genetic and genomic data including high dimensionality (e.g. predicting trait values from millions of genetic markers), complex systems (e.g. mapping gene regulatory networks), and high order interactions (e.g. identifying epistatic interactions). While the complexity of ML models is what makes them so powerful, it also makes them difficult to interpret. Fortunately, researchers have developed strategies to make the inner workings of machine learning models understandable to humans, and in doing so have made it possible to derive novel biological insights from ML models. In this review, we discuss what types of strategies for interpreting ML models are available, how they work, and how they can be used in a biological context. Finally, we describe challenges and promising future directions in interpretable ML for biology. 2.2 Why is interpretable machine learning needed? Thanks to the advances in technology and reduced cost to generate data, biologists are now living in the Era of Big Data (Marx 2013; Stephens et al. 2015). In this Era, big data will drive progress in biological fields ranging from population genetics (Schrider and Kern 2018) to precision medicine (Alyass et al. 2015). But big data also presents researchers with new challenges, such as how to derive biological understanding from large amounts of heterogeneous data (e.g. multi-omics data) and how to model highly complex systems (e.g. gene regulation and protein folding). In order to address the challenge of harnessing big data to answer biological questions, bioinformaticians and computational biologists are now turning to machine learning 10 (ML; Figure 2.1) (Tarca et al. 2007; Ma et al. 2014; Libbrecht and Noble 2015; Angermueller et al. 2016; Chicco 2017; Cuperlovic-Culf 2018). Arthur Samuel, a pioneer of machine learning, described it as a “field of study that gives computers the ability to learn without being explicitly programmed” (Samuel 1959). One common criticism faced by those employing ML in biology and elsewhere is that the ML models are “Black Boxes”. While this term lacks a precise definition (Lipton 2018), broadly it means that only model inputs and outputs, but not the internal logic, can be understood by a human. However, not all ML models are equally “Black Boxes”. For example, the internal logic of ML models based off decision trees is inherently interpretable. On the other hand, the internal logic of a deep learning model, which could be made up of hundreds or hundreds of thousands of connections and hidden variables, is much more of a black box. Figure 2.1. Machine Learning Crash Course. 11 There are three major reasons why ML model interpretation is important: troubleshooting, novel insights, and trust (Figure 2.2). First, models rarely perform best without tweaking or troubleshooting and understanding why mispredictions are made is essential for determining if there were mistakes or biases in the input data or if there were issues with how the model trained. Second, an ML model with impressive performance may have identified biologically interesting patterns in the data. However, scientists could not learn such biological insights without interpreting the model. Finally, we are not keen to trust things if we do not understand how and why they work and without trust ML models will not be used to their full Figure 2.2. Why interpretable machine learning? Interpretability of machine learning models is important for facilitating troubleshooting, making biological insights, and trust. 12 potential. For example, ML models have the potential to improve our ability to diagnose and treat diseases. However, doctors will be reluctant to trust models that have not been or cannot be readily interpreted because they won’t understand their medical basis and may worry the models are capturing artifacts (Miller 2017). Furthermore, many patients would be reluctant to undergo treatment without knowing why that treatment was selected for them. Therefore, model interpretability, or the ability to understand what logic is driving a model’s performance, is critical for the those using ML for biology. 2.3 Overview of strategies for interpretable machine learning Interpretable ML is an emerging focus among data scientists. Here we will review three strategies for interpretability—probing, surrogate, and perturbing (Figure 2.3)—that have been used to interpret ML models in biology. Probing strategies involve dissecting the inner structure of a trained model. Surrogate strategies involve training classically interpretable models that estimate an ML model’s predictions. Perturbing strategies involve measuring the change in model performance before and after disturbing features or instances (Guidotti et al. 2018; Molnar 2019). Interpretation strategies can also be defined based on if they are applicable to all algorithms (i.e. model-agnostic) or only to one or a subset of algorithms (i.e. model-specific) (see Figure 2.1 for example algorithms). Finally, interpretation strategies can be either global or local. Global interpretation involves explaining the overall relationship between features and labels. While local interpretation strategies focus on explaining the prediction of an individual instance. For example, imagine you train an ML model to predict if a gene (i.e. the instance) is up- regulated when the organism is exposed to an environmental toxin (i.e. the label) based on the presence or absence of a set of known regulatory sequences (i.e. the features). A global 13 Figure 2.3. Overview of strategies in interpretable machine learning. ML models can be interpreted by probing trained models (gray box), training interpretable surrogate models (white box), or perturbing the input data (purple data stack) and measuring the change in performance. Probing strategies are characterized by which algorithms (shown here: deep learning model) they are used for. Surrogate models can be trained to represent global or local predictions. Perturbing strategies, which are algorithm agnostic, are characterized based on if a feature (i.e. Sensitivity analysis) or an instance (i.e. What-If analysis) is perturbed. interpretation strategy would tell you how important regulatory sequence X was for predicting up-regulation across all of the genes in your dataset. In contrast, a local interpretation strategy would tell you how important regulatory sequence X was for predicting gene Y as up-regulated. This example will be used repeatedly throughout the review to explain various concepts in ML and interpretable ML. 14 We should emphasize that, because ML models identify association through correlation, ML interpretation strategies do not identify causal relationships between input features and labels. For example, if putative regulatory sequence X is present in the promoter region of all up- regulated but no non-up-regulated genes, X is considered highly correlated with up-regulation, but we cannot claim X is responsible for that response. 2.4 Probing strategies for interpreting machine learning models Training a machine learning model involves identifying the best set of parameters to predict the label of interest (e.g. is the gene up-regulated). After training is complete, those parameters can be probed to better understand what the model learned. Probing strategies tend to provide global interpretations of trained models. However, some strategies for probing deep learning models provide local interpretations (e.g. DeepLIFT, see description below). Because the structure and type of parameters learned by ML models vary by algorithm, probing strategies are algorithm-specific. In the following sections, we will discuss how different types of ML algorithms can be probed and provide examples of how these strategies have been used to answer fundamental questions in biology. 2.4.1 Probing Support Vector Machine models Support Vector Machine (SVM) is an algorithm that finds the hyperplane that best separates instances by their label (for classification tasks) or best approximates the label values (for predicting continuous labels). Using predicting gene up-regulation as an example, the hyperplane would lie in a multi-dimensional space defined by the presence or absence of the regulatory sequences (i.e. features) and would separate genes (i.e. instances) that are up- regulated from those that were not up-regulated (i.e. the label). (Figure 2.4A). These models identify linear relationships between features and labels, but they can be modified to identify 15 non-linear relationships by using a non-linear function (e.g. polynomial) to map the data into a non-linear feature space. While there are advanced methods for probing non-linear SVM models (Barakat and Bradley 2010; Rasmussen et al. 2011), in biological applications of SVM, only linear SVM models are typically interpreted with probing strategies. Figure 2.4. Detailed overview of the probing strategies. Probing strategies used to understand the predictions made by (A) support vector machine, (B) decision tree based, and (C) deep learning models. Input features (e.g. genetic markers, environmental variables, image pixels) are shown in purple, the labels (e.g. state of differential expression, diagnosis) are shown in green/yellow, and the interpretation strategies (e.g. coefficient weight, mean decrease impurity, gradient) are shown in orange. A linear SVM model is probed by extracting the trained weights (i.e. coefficients) that define the hyperplane (Figure 2.4A). These weights directly represent the relationship between the feature and the label, making their interpretation relatively straight forward. For example, Ronen et al. trained a linear SVM model to classify simulated populations as being under positive or negative selection, using genetic markers as features (Ronen et al. 2013). They found genetic markers with large, positive weights in their SVM model (indicating strong positive selection) were also found to be associated with positive selection by traditional population genetics statistical tests (e.g. Tajima’s D and Fay and Wu’s H) with positive selection. However, 16 SVM probing strategies, like many other interpretation strategies covered in this review, need to be interpreted with caution because they can provide an incomplete picture of what features are important for the model. For example, two highly correlated features will split the coefficient weight between them, effectively reducing the importance of each feature by half. Or a feature with a highly non-linear effect may not be assigned a high coefficient weight by a linear SVM model and will therefore be missed in the interpretation (see Challenges and Opportunities). 2.4.2 Probing decision tree-based models A decision tree is a set of true/false questions nested in a hierarchical structure. They are inherently interpretable because the content and order of each true/false question can be directly observed from the tree and the path for each instance through a decision tree can be traced. For example, for predicting gene up-regulation using regulatory sequences as features, the first question in a trained decision tree can be, “is regulatory sequence X present?” and if the answer is yes then the second question is, “is regulatory sequence Y present?” From this we can infer that, based on the trained model, regulatory sequence X best separates up-regulated from non-up- regulated genes, and that if regulatory sequence X is present, the presence or absence of regulatory sequence Z is the next most useful information. How well a given true/false question separates up-regulated from non-up-regulated genes can be quantified, for example by measuring the change in node impurity, or how many instances of a different class are present in a node, before and after a particular feature is used to separate the genes (Figure 2.4B). Where a regulatory sequence that is present in up-regulated but not in non-up-regulated gene promoters would be able to split the genes perfectly, resulting in a large decrease in node impurity, and therefore interpreted as being very important. 17 Single decision trees often do not perform well at predicting complex biological patterns. Instead, tree-based algorithms that take advantage of ensemble methods tend to perform better in biological applications (Rokach 2016). In ensemble methods, many decision trees are trained on small subsets of the data to produce many "weak" models that when combined provide one strong model (e.g. Random Forest, Gradient Tree Boosting, Extra-Tree, etc. (Breiman 2001)). However, the "forest" of decision trees is more complicated to interpret. One way to probe ensemble decision-tree based models is average the decrease in node impurity every time a feature is uses in a tree in the ensemble, providing a mean decrease in impurity (a.k.a. Gini Importance) score for each feature. For example, Uygun et al. used the mean decrease impurity to determine which DNA motifs (i.e. features) best classified genes as differentially expressed or not differentially expressed after a model plant, Arabidopsis thaliana, was exposed to high salinity stress (Uygun et al. 2017). In addition to facilitating intuitive model interpretation, the hierarchical structure of decision tree-based models allows them to inherently model interactions between features—and those interactions can be probed. Given the complexity of many biological systems (e.g. neuronal networks, gene regulatory networks, protein-protein interactions), interpretation strategies that can pinpoint important interactions are useful. Using iterative Random Forest, a tool for finding stable feature interactions in RF models (Basu et al. 2018), Vervier and Michaelson identified interactions between genomic, transcriptomic, and epigenomic features that were predictive of deleterious genetic variants (Vervier and Michaelson 2018). They found that the local GC content and the distance to the nearest expression Quantitative Trait Loci for a genetic variant were consistently found to interact that was important for driving the model performance. 18 As with SVM based importance scores, feature importance scores and interactions from decision-tree based models should be interpreted with caution. For example, mean decrease impurity tends to be inflated for continuous over categorical features, categorical features with a large number of categories over those with few categories, and continuous features on a large scale (e.g. range = 0-100) over those on a small scale (e.g. range = 0-1) (Strobl et al. 2007). Therefore, this strategy should only be used when the input feature types are relatively uniform. In addition, while tree-based models are readily interpretable, deep learning algorithms outperform even ensemble decision tree-based methods at predicting complex patterns, in biology and in other fields. Deep learning algorithms also benefit from the ability to learn from raw data (e.g. whole DNA sequence), rather than user defined features (e.g. enriched sequence motifs) and are therefore being turned to more and more by the ML community. 2.4.3 Probing deep learning networks In statistics, there is often a tradeoff between predictability and interpretability, and this is certainly the case for deep learning (Figure 2.5). Given that deep learning models have been shown to outperform classic ML models (e.g. regression, decision trees) in applications ranging from machine translations to computer vision (LeCun et al. 2015; Guo et al. 2016; Banerjee et al. 2017), there has been a substantial effort to develop new methods to interpret these complex models. Because interpreting neural networks is an active area of research, there is yet a consensus on the best interpretation methods. Here we describe three general strategies for probing deep learning models (Supplemental Table 2.1) and provide specific examples of how some of these strategies have been used in biology. Like the coefficient weights and the Gini Importance scores for SVM and RF models, respectively, similar metrics indicating feature importance for neural networks can be derived by 19 Figure 2.5. Deep Learning Crash Course. probing three different components of a trained network: (1) the weights of the connections between nodes in different layers of the network, (2) the gradient of the output with respect to the input, and (3) the activation level at every connection for a given instance (Figure C). In the first category, connection weight-based methods quantify the overall relationship between each feature and the output by summing the connection weights from input-to-hidden, hidden-to- hidden, and hidden-to-output nodes for each input feature (Garson 1991; Olden and Jackson 2002). For example, following the path through the neural network in Figure 2.5 between an 20 input feature (e.g. i1) and the output layer, you notice the paths for different input features have different sized connection weights (represented by the widths of the gray lines). If we were to use some handy linear algebra to sum the connection weights along the path through the network for each feature, we could quantify the extent to which some features (e.g. i1) are more important for predicting the value of the label in the output layer than others (e.g. i3). This approach was used to determine which of 179 microRNAs (i.e. input features) had the highest connection weights to the expression level of Smad7 (i.e. the label), a gene that is involved in disrupting a signaling process that gets up-regulated in patients with breast cancer (Manzanarez-Ozuna et al. 2018). The second category of neural network importance scores are gradient-based scores (sometimes referred to as Saliency). In the case of interpretable deep learning, the gradient refers to the change in the predicted label value as the values of an input feature is changed. This is not to be confused with the use of the word gradient in describing how neural networks are trained (i.e. gradient descent). These scores are calculated using a handy trick from calculus, the partial derivative, which measures the change in the output (i.e. the predicted label value) due to making tiny changes in the input (i.e. the feature value). A feature with a large gradient (e.g. i1) is one where a small change in the input feature value would result in a big change in the prediction (Simonyan et al. 2013). Kelley et al. used this approach to identify putative distal regulatory sequences in genomic regions where positive and negative gradient-based importance score peaks represented enhancer and silencer regions, respectively (Kelley et al. 2018). Although weigh-based and gradient-based approaches can provide interpretations of deep learning models, there are situations where their applications can be misleading. For example, connection weight- based importance scores are not directly comparable when features are on different scales, they 21 underestimate importance when positive and negative weights cancel each other out, and they can overestimate importance when connections with large weights are rarely activated (i.e. rarely turned on) (REF). Similarly, gradient-based importance scores are not useful when input features are categorical (e.g. true/false, present/absent) or when the signal from an input feature is saturated (i.e. small changes in the feature value will not change the prediction) (Shrikumar et al. 2017). The third category, activation-based interpretation strategies, such as DeepLIFT (Deep Learning Important FeaTures), avoid these limitations and have therefore become popular for interpreting deep learning models in recent biological applications (Zuallaert et al. 2017; Shrikumar et al. 2017; Washburn et al. 2019). DeepLIFT works by comparing the activation level of nodes in the network (i.e. the output value of a node after it has passed through the activation function) when a reference instance is input into the trained model compared to when an instance of interest is used. In models where the input is a DNA sequence, the reference instance could be a randomly shuffled DNA sequence or the background nucleotide frequency. Then, for example, given that a DeepLIFT interpretation is unique to the null instance and the instance of interest that are selected, it provides a local interpretation. For example, using DeepLIFT. This approach is robust when features are on different scales or when connection weights are large but rarely activated because it relays on activation rather than connection weights. Furthermore, it is robust against categorial features and saturated nodes because it looks for important differences between the null and the instance of interest. In addition to the three strategies described above for probing the importance of each feature to the prediction overall, another promising strategy is to probe each hidden node to see what pattern it has learned to identify. One approach involving this type of strategy is to feed the 22 trained model either real or fake instances to identify which ones maximally activate a node, referred to as activation maximization, and associate the properties of those real or fake instance to that node. For example, if the 10 DNA sequence that maximally activate node X (i.e. cause node X to have the maximum possible output value) all contain the motif ACGGTC, one could associate that motif to node X. Other strategies to probe hidden nodes are unique to the type of deep learning algorithm. For example, Esteva et al. used a dimensionality reduction technique (t- Distributed Stochastic Neighbor Embedding: tSNE) to visualize the nodes in the last hidden layer of a convolutional new network trained to diagnose different types of skin cancer from photos (Esteva et al. 2017). 2.5 Perturbing strategies for interpreting machine learning models Perturbing strategies involve modifying the input data and observing changes in the model output. Because modifications to the input data can be made regardless of the type of ML algorithm applied, perturbing strategies are generally model-agnostic (although there are some perturbing strategies particular to deep learning models that will not be discussed here). Next we discuss two general perturbation based strategies: sensitivity analysis and What-if methods. 2.5.1 Sensitivity Analysis Sensitivity analysis involves modifying the input features and measuring the extent to which these modifications impact overall model performance, which provides a global measure of importance (Figure 2.6A). Feature modification is typically done by either removing (i.e. leave-one-feature-out) or permuting (e.g. set all values to the mean, or randomly shuffle) features one at a time during training and observing the change in predictive performance of the trained model. The result from a sensitivity analysis is a highly intuitive score for each feature that indicates its overall contribution as the decrease in model performance when the feature in 23 question is removed or permuted (Figure 2.6A). Because perturbing a feature not only impacts that feature but also other features that interacted with it, this type of analysis captures both the main and interaction effects for each feature. Figure 2.6. Detailed overview of the perturbing strategies. Colors represent the same data as in Figure 2.4. One approach for assessing feature importance using a sensitivity analysis is to compare the performance of a model trained on the original dataset compared to a model trained on the perturbed dataset (i.e. leave-one-feature-out). This approach was used to determine what sequence features were most important for identifying genomic islands containing clusters of genes acquired by horizontal gene transfer (Che et al. 2010). They found models trained without information about size (i.e. length of the genomic region in base pairs) had a 13% lower performance as measured by area under the Receiver Operator Characteristic curve (i.e. a plot of the true positive rate against the false positive rate at various thresholds). However, this type of sensitivity analysis can be computationally expensive as it requires training new models for every perturbed dataset. Because of the greater computational cost for training deep learning models, this type of sensitivity analysis is typically only used to interpret deep learning model when there are few input features. For example, by building convolutional neural networks 24 (CNN, see Figure 2.5, (Schmidhuber 2015), leave-one-feature-out was used to determine which of five histone marks were most important for predicting transcription factor binding sites (Jing et al. 2019). They found the H3K4me3 mark was the most important for their model predictions, which is consistent with what is known about H2K4me3 being associated with active transcription of nearby genes (Roadmap Epigenomics Consortium et al. 2015). Another way to reduce the amount of computational power needed to perform a sensitivity analysis is to train one model on the full dataset, then measure how the model performance on a specific instance or a held out set of data changes when different features are perturbed. This type of approach has also been demonstrated to be well suited for interpretable ML in genetic studies because they mirror mutagenesis experiments. For example, in silico mutagenesis, i.e., permuting nucleotides, was used to identify which changes in the DNA sequence most impacted tissue specific gene expression (Zhou et al. 2018). First, they trained a series of models that first predicted an epigenomic profile (e.g. histone marks, chromatin accessibility) directly from DNA sequence and then used that epigenomic profile to predict tissue specific gene expression levels. After training their models they were able to mutate (i.e. perturb) over 140 million base pairs and measure the impact those mutations had on expression. 2.5.2 What-if Analysis What-if methods involve modifying one or more feature values for a single instance and observing to what degree the modification impacts the prediction for that instance. Because the focus is on a specific instance, this provides a local measure of importance, as opposed to sensitivity analysis which looks at the global impact of feature space modifications on model performance. What-if interpretation strategies are also sometimes called counterfactual strategies, where counterfactuals are the name given to instances that have been modified and fed 25 back to the model (Wachter et al. 2018). Here we will discuss three What-if methods: partial dependency plots, individual conditional expectation plots, and occlusion sensitivity. Partial dependency plots show the effect of changing the value for a specific feature on the prediction of an instance when all other feature values are averaged (Friedman 2001). For example, imagine we trained a machine learning model that predicts the likelihood that a sequence will be bound by a certain transcription factor (TF). To determine how important position #3 is for TF binding, we could generate an instance with the background nucleotide frequency at the other positions, then make a partial dependency plot showing the change in the TF binding likelihood when each nucleotide is placed at position #3 (left side; Figure 2.6B). These plots are useful because they show the magnitude, direction, and non-linearities in the relationship between a feature and the label. However, one limitation is that, dependencies can only be visualized for one or two features at a time, so these plots are typically only generated for models with few features, which are uncommon in most application of machine learning in genetics and genomics, or when a subset of features deemed important based on another interpretation strategy have already been identified (Liu and Yang 2014). A second limitation of partial dependency plots is that they can obscure patterns when there are interactions between features (e.g. the influence of regulatory sequence X on expression depends on the presence or absence of regulatory sequence Y) or when the effect of a feature is heterogeneous across the instances (e.g. regulatory sequence X is associated with up-regulation in some genes, but not for others where regulatory sequence X is highly methylated). To alleviate the impact of feature interactions, Goldstein et al. suggested that dependency plots could be generated for every instance in the datasets, instead of on one averaged instance, an approach they termed individual conditional expectation (Goldstein et al. 2015). For example, 26 instead of looking at the impact of changes in the nucleotide at position #3 when other positions are set to the background nucleotide frequency, an individual conditional expectation plot could be generated for a specific sequence (right side; Figure 2.6B). Individual conditional expectation plots have been used to interpret deep learning models that utilize adversarial learning. Adversarial learning is when two deep learning models are trained by competing with one another, in a sort of machine learning arms-race (Dalvi et al. 2004; Biggio and Roli 2018). For example, Generative Adversarial Networks (GANs, see Figure 2.5) are composed of a generator, which generates simulated data, and a discriminator, which has access to the real data and tries to determine if the simulated data is real or not (Goodfellow et al. 2014). The generator and discriminator compete with one another, with the generator getting better at simulating real data and the discriminator getting better at spotting simulated data. This type of approach was used to better understand the diversity of gene expression patterns at the single cell level, where a generator was trained to simulate realistic single cell gene expression levels and a discriminator was trained to classify these simulations as real or not given a set of diverse, real single cell expression data (Ghahramani et al. 2018). To determine what patterns of gene expression were characteristic of real single cell expression, they generated individual conditional expectation plots by varying the expression level of known epidermal cell marker genes and observing the change in the prediction from the discriminator. Using this approach, Ghahramani et al. were able to train a discriminator that was sensitive to changes in the expression levels for genes that were known markers for particular cell-type states (e.g. IvI, Krt10, and Krt14 for epidermal cell state). Therefore, such a discriminator can potentially be used to identify novel markers. 27 Finally, What-if approaches are useful for interpreting deep learning models used to classify images and have therefore been used to better understand models generated to classify medical images. For example, convolutional neural networks were used to classify blood samples as infected or non-infected with malaria based on blood smear images (Rajaraman et al. 2018). To make sure these models were basing their classification off differences in parasitized regions as opposed to unrelated background signals, they “grayed out” different regions of an image and found that graying our regions with visible parasites decreased model performance more than graying out un-parasitized regions. This approach is often referred to as occlusion sensitivity because it involves excluding pixels from the image. 2.6 Surrogate strategies for interpreting machine learning models Image you have an ML model that is truly a Black Box—meaning it cannot be probed or perturbed, or that these strategies do not provide useful information. In such a case, one interpretation strategy is to use an inherently interpretable model as a surrogate for the Black Box model. The surrogate model could be a well-established statistical model (e.g. linear regression) or it could be an ML model that is easier to interpret (e.g. decision tree). For example, to generate a surrogate for a Black Box model that can predict gene up-regulation under toxin stress with great accuracy, we would first apply the Black Box model to a set of genes to get predictions. Then we would train our selected interpretable surrogate model (e.g. logistic regression) on the same set of genes to learn the prediction from the Black Box model (i.e. the surrogate label) and interpret the logistic regression model by observing the p-values and/or effect size. One major limitation of such an approach is that Black Box models are often highly complex (i.e. a highly non-linear decision boundary) and, thus, cannot be fully and accurately 28 learned by an interpretable surrogate. In such cases, one approach is to generate a surrogate that can more accurately learn just a portion of the Black Box model. This concept is called LIME (Local Interpretable Model-agnostic Explanations) (Ribeiro et al. 2016). The theory behind this is that while the complex, non-linear decision boundary for the full model may be too complex for a surrogate model to learn, the decision boundary for one instance or a group of similar instances (e.g. genes) will likely be simple enough. To generate a LIME model, for example, an instance of interest is selected, then that instance is perturbed many times like that in sensitivity analysis, then the Black Box model is applied to those perturbed instances to generate predictions. After you have Black-Box 2.model predictions for each perturbed instance, an interpretable model is trained to learn those predictions and the interpretable model is then inspected. Local surrogate models have been used by those interested in understanding predictions in medical settings (Nanayakkara et al. 2018; Wang et al. 2019). For example, LIME was used to learn explanations for predictions of individual mortality following cardiac arrest (Nanayakkara et al. 2018). They focused their interpretation on patients that were misclassified by the Black Box model (i.e. predicted to survive but did not). For example, they found for one patient that was given a 78% probability of survival due to favorable features (e.g. healthy neurologic status, lack of chronic respiratory illness), other negative features were indicators of mortality (e.g. elevated creatinine, advanced age). 2.7 Challenges and Opportunities While a great deal of effort has gone into developing the interpretation strategies discussed above, there are still several technical and practical challenges to interpreting machine learning models in genetics. 29 • Correlation: when the input data contains features that are highly correlated, both probing and perturbing interpretation strategies can underestimate the importance of a feature by splitting the importance between the correlated features (i.e. during probing) or by compensating for one feature when it is left out or permuted (i.e. during perturbing) (Altmann et al. 2016). • Heterogeneous input space: Multi-omics data integration is an area of computational biology receiving much attention (Pinu et al. 2019). However, some interpretation strategies (e.g. SVM coefficient weights, ANN weights/gradient-based probing) provide misleading results if the scale of the input features differs. This can sometimes be addressed with normalization, but when both continuous and categorial features are used together, normalization may not be an option. Therefore, interpretation strategies that can handle diverse multi-omics data are needed. • Heterogeneous effects: Given the importance of non-linear effects in biology (i.e. epistasis, feedback loops, synergistic/antagonistic effects), interpretation strategies that can identify features that have important but heterogeneous effects are critical. However, some interpretation strategies (e.g. partial dependency plots, LIME) aren’t able to identify such features because positive and negative signals will average out. Therefore, interpretation strategies that identify features with heterogeneous effects are needed. • Multiple interpretations: For some interpretation strategies (e.g. What-if analysis, DeepLIFT), there may be more than one explanation for why an instance was predicted a certain way, which one is best? While these challenges can at times undermine efforts to understand the biology driving a model’s predictions, many of them are shared with traditional statistical methods. In addition, 30 these challenges are also associated with data in other fields and beyond. These challenges also represent opportunities for computational biologists to develop novel solutions. There are also many tools that have been developed to facilitate interpreting ML models using many of the strategies described in this review (Supplemental Table 2.2). 2.8 Concluding Remarks and Future Perspective Interpretability is critical for applications of ML both within and outside of biology and will therefore likely see substantial advances in the coming years. Given its broad use, future innovations in ML interpretability will likely to come out of fields working on diverse applications ranging from self-driving cars and smart city development to targeted advertising and natural language processing. Training the next generation of biologists to be able to harness these innovations to improve our ability to derive biological insights from what have been considered “Black Box” models represents a major research and training priority in the coming decade (see Outstanding Questions). 2.9 Outstanding Questions • How will advances in deep learning (e.g. transfer learning and multi-label learning) impact the interpretability of these models? • Interaction effects are a common phenomenon in biological systems whether we are thinking about community dynamics, epistasis, or environmental effects. What interpretation strategies are most appropriate for finding these, often complex, interactions? • Deep learning algorithms often outperform less complex ML algorithms (i.e. RF and SVM) when enough training data is available, however can we learn as much from interpreting deep learning models as we can from these simpler models? 31 • There have been few efforts to compare the robustness of interpretation strategies using biological datasets. How should biologists benchmark interpretation strategies in order to find the methods most useful for addressing biological questions? • Interpreting ML models can be relatively involved and require extensive computational skills. To what extent can existing and new tools for interpreting ML be made accessible to biologists without extensive programming skills? 2.10 Glossary Algorithm: The procedure taken to solve a problem/to build a model Decision tree: A model made up of a series of branching true/false questions. Deep Learning: A subset of ML algorithms roughly inspired by the structure of the brain that can find complex, nonlinear patterns in data. Ensemble: A combination of multiple models that makes one prediction for each unknown sample instead of multiple predictions. Feature: An explanatory (i.e. independent) variable during modeling. Global interpretation: A ML interpretation that gives an explanation of the overall relationship between features and the label. Instance: A single example or object (n) from which the model will learn or be applied to. Interpretable: Capable of being understood by a human. Label: The dependent variable to be predicted. Either a continuous variable for regression models or a categorical variable (i.e. class) for classification models. Local interpretation: A ML interpretation that gives an explanation of the relationship between features and the label for one or a subset of instances. 32 Machine learning: Computational models that are able to learn from data without being explicitly programmed. Model: The set of patterns learned for a specific problem, where given an input (i.e. instances with features) the model will generate an output (i.e. prediction). Parameters: Variables in an ML model whose values are estimated/optimized during training (e.g. connection weights, tree depth, coefficient weights). Perturbing: A family of interpretable ML strategies that measure how changes in the input data impact model predictions or performance. Probing: A family of interpretable ML strategies that involve inspecting the structure and parameters in a trained model. Surrogate: A family of interpretable ML strategies that involve training an inherently interpretable surrogate model to represent a black-box model. Test set: Subset of the data used to test the performance of a trained model Training: The process of identifying the best parameters to make up a model – the learning part in ML. 33 REFERENCES 34 REFERENCES Altmann, A., M. S. Schröter, V. I. Spoormaker, S. A. Kiem, D. Jordan et al., 2016 Validation of non-REM sleep stage decoding from resting state fMRI using linear support vector machines. NeuroImage 125: 544–555. Alyass, A., M. Turcotte, and D. Meyre, 2015 From big data analysis to personalized medicine for all: challenges and opportunities. BMC Medical Genomics 8: 33. Angermueller, C., T. Pärnamaa, L. Parts, and O. Stegle, 2016 Deep learning for computational biology. Molecular Systems Biology 12: 878–16. Banerjee, S., P. Bhattacharjee, and S. Das, 2017 Performance of Deep Learning Algorithms vs. Shallow Models, in Extreme Conditions - Some Empirical Studies, pp. 565–574 in Pattern Recognition and Machine Intelligence, edited by B. U. Shankar, K. Ghosh, D. P. Mandal, S. S. Ray, D. Zhang, et al. Lecture Notes in Computer Science, Springer International Publishing. Barakat, N., and A. P. Bradley, 2010 Rule extraction from support vector machines: A review. Neurocomputing 74: 178–190. Basu, S., K. Kumbier, J. B. Brown, and B. Yu, 2018 Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences 115: 1943–1948. Biggio, B., and F. Roli, 2018 Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84: 317–331. Breiman, L., 2001 Random Forests. Machine Learning 45: 5–32. Che, D., C. Hockenbury, R. Marmelstein, and K. Rasheed, 2010 Classification of genomic islands using decision trees and their ensemble algorithms. BMC Genomics 11: S1. Chicco, D., 2017 Ten quick tips for machine learning in computational biology. BioData Mining 10: 35. Cuperlovic-Culf, M., 2018 Machine Learning Methods for Analysis of Metabolic Data and Metabolic Pathway Modeling. Metabolites 8: 4. Dalvi, N., P. Domingos, Mausam, S. Sanghai, and D. Verma, 2004 Adversarial classification, pp. 99 in Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’04, ACM Press, Seattle, WA, USA. Esteva, A., B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter et al., 2017 Dermatologist-level classification of skin cancer with deep neural networks. Nature 542: 115–118. 35 Friedman, J. H., 2001 Greedy function approximation: A gradient boosting machine. Ann. Statist. 29: 1189–1232. Garson, D. G., 1991 Interpreting neural network connection weights. AI Expert 6: 46–51. Ghahramani, A., F. M. Watt, and N. M. Luscombe, 2018 Generative adversarial networks simulate gene expression and predict perturbations in single cells. bioRxiv 262501. Goldstein, A., A. Kapelner, J. Bleich, and E. Pitkin, 2015 Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation. Journal of Computational and Graphical Statistics 24: 44–65. Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley et al., 2014 Generative Adversarial Nets, pp. 2672–2680 in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, MIT Press, Cambridge, MA, USA. Guidotti, R., A. Monreale, S. Ruggieri, F. Turini, F. Giannotti et al., 2018 A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys 51: 1–42. Guo, Y., Y. Liu, A. Oerlemans, S. Lao, S. Wu et al., 2016 Deep learning for visual understanding: A review. Neurocomputing 187: 27–48. Jing, F., S. Zhang, Z. Cao, and S. Zhang, 2019 An integrative framework for combining sequence and epigenomic data to predict transcription factor binding sites using deep learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1–1. Kelley, D. R., Y. A. Reshef, M. Bileschi, D. Belanger, C. Y. McLean et al., 2018 Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28: 739–750. LeCun, Y., Y. Bengio, and G. Hinton, 2015 Deep learning. Nature 521: 436–444. Libbrecht, M. W., and W. S. Noble, 2015 Machine learning applications in genetics and genomics. Nature Publishing Group 16: 321–332. Lipton, Z. C., 2018 The Mythos of Model Interpretability. ACM Queue 16:. Liu, Z., and J. Yang, 2014 Quantifying ecological drivers of ecosystem productivity of the early- successional boreal Larix gmelinii forest. Ecosphere 5: art84. Ma, C., H. H. Zhang, and X. Wang, 2014 Machine learning for Big Data analytics in plants. Trends in Plant Science 1–11. Manzanarez-Ozuna, E., D.-L. Flores, E. Gutiérrez-López, D. Cervantes, and P. Juárez, 2018 Model based on GA and DNN for prediction of mRNA-Smad7 expression regulated by miRNAs in breast cancer. Theoretical Biology and Medical Modelling 15:. 36 Marx, V., 2013 Biology: The big challenges of big data. Nature 498: 255–260. Molnar, C., 2019 Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Christoph Molnar. Nanayakkara, S., S. Fogarty, M. Tremeer, K. Ross, B. Richards et al., 2018 Characterising risk of in-hospital mortality following cardiac arrest using machine learning: A retrospective international registry study. PLOS Medicine 15: e1002709. Olden, J. D., and D. A. Jackson, 2002 Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecological Modelling 154: 135–150. Pinu, F. R., D. J. Beale, A. M. Paten, K. Kouremenos, S. Swarup et al., 2019 Systems Biology and Multi-Omics Integration: Viewpoints from the Metabolomics Research Community. Metabolites 9:. Rajaraman, S., K. Silamut, Md. A. Hossain, I. Ersoy, R. J. Maude et al., 2018 Understanding the learned behavior of customized convolutional neural networks toward malaria parasite detection in thin blood smear images. Journal of Medical Imaging 5: 1. Rasmussen, P. M., K. H. Madsen, T. E. Lund, and L. K. Hansen, 2011 Visualization of nonlinear kernel models in neuroimaging by sensitivity maps. NeuroImage 55: 1120–1131. Ribeiro, M. T., S. Singh, and C. Guestrin, 2016 “Why Should I Trust You?”: Explaining the Predictions of Any Classifier, pp. 1135–1144 in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, ACM Press, San Francisco, California, USA. Roadmap Epigenomics Consortium, A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky et al., 2015 Integrative analysis of 111 reference human epigenomes. Nature 518: 317–330. Rokach, L., 2016 Decision forest: Twenty years of research. Information Fusion 27: 111–125. Ronen, R., N. Udpa, E. Halperin, and V. Bafna, 2013 Learning Natural Selection from the Site Frequency Spectrum. Genetics 195: 181–193. Samuel, A. L., 1959 Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development 3: 210–229. Schmidhuber, J., 2015 Deep learning in neural networks: An overview. Neural Networks 61: 85– 117. Schrider, D. R., and A. D. Kern, 2018 Supervised Machine Learning for Population Genetics: A New Paradigm. Trends in Genetics 34: 301–312. 37 Shrikumar, A., P. Greenside, and A. Kundaje, 2017 Learning Important Features Through Propagating Activation Differences. Proceedings of the 34 th International Conference on Machine Learning. Simonyan, K., A. Vedaldi, and A. Zisserman, 2013 Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. International Conference on Learning Representations. Stephens, Z. D., S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai et al., 2015 Big Data: Astronomical or Genomical? PLOS Biology 13: e1002195. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn, 2007 Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:. Tarca, A. L., V. J. Carey, X. Chen, R. Romero, and S. Drăghici, 2007 Machine Learning and Its Applications to Biology. PLOS Computational Biology 3: e116. Uygun, S., A. E. Seddon, C. B. Azodi, and S.-H. Shiu, 2017 Predictive Models of Spatial Transcriptional Response to High Salinity. Plant physiology 174: 450–464. Vervier, K., and J. J. Michaelson, 2018 TiSAn: estimating tissue-specific effects of coding and non-coding variants. Bioinformatics 34: 3061–3068. Wachter, S., B. D. M. Mittelstadt, and C. Russell, 2018 Counterfactual explanations without opening the black box: automated decisions and the GDPR. Harvard Journal of Law and Technology 31:. Wang, X., D. Wang, Z. Yao, B. Xin, B. Wang et al., 2019 Machine Learning Models for Multiparametric Glioma Grading With Quantitative Result Interpretations. Front Neurosci 12:. Washburn, J. D., M. K. Mejia-Guerra, G. Ramstein, K. A. Kremling, R. Valluru et al., 2019 Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. PNAS 116: 5542–5549. Zhou, J., C. L. Theesfeld, K. Yao, K. M. Chen, A. K. Wong et al., 2018 Deep learning sequence- based ab initio prediction of variant effects on expression and disease risk. Nat Genet 50: 1171–1179. Zuallaert, J., M. J. Kim, Y. Saeys, and W. De Neve, 2017 Interpretable convolutional neural networks for effective translation initiation site prediction, pp. 1233–1237 in IEEE International Conference on Bioinformatics and Biomedicine-BIBM, IEEE. 38 CHAPTER THREE: MODELING THE CIS-REGULATORY CODE OF PLANT SINGLE AND COMBINED STRESS TRANSCRIPTIONAL RESPONSE WITH MACHINE LEARNING 39 3.1 Abstract When faced with adverse environmental conditions, plants mount dynamic and diverse responses at the transcriptional level. The set of regulatory components that control these responses is called the cis-regulatory code. Previous studies have characterized the cis-regulatory code regulating response to individual stress conditions (e.g. salinity), however, the way plants regulate their response to multiple simultaneous stresses is poorly understood. Here, we use classic machine learning and deep learning approaches to model the cis-regulatory code of response to single and combined heat and drought stress in Arabidopsis thaliana. First, we trained cis-regulatory code models using DNA sequence-based features from gene promoters (e.g. the presence/absence of a putative cis-regulatory elements), that were able to predict a gene’s pattern of response better than random guessing. Then, we demonstrated how integrating additional levels of regulatory information (e.g. chromatin accessibility, histone modifications) and sequence-based features from outside the promoter region (e.g. downstream of the transcriptional stop site) improved the accuracy of our cis-regulatory codes. We found that features based on known transcription factor binding, histone 3 lysine 9 acetylation, chromatin accessibility, and downstream DNA sequence were the most useful additions to our models. We also found that while some of the most important putative cis-regulatory elements for our models resembled transcription factor (TFs) binding sites (TFBMs) associated with TFs known to be involved in heat and/or drought stress, others resembled TFBMs for TFs involved in developmental or other stress pathways. This study demonstrates how an in silico/data driven approach can be used to generate biological insights into important complex biological systems. 40 3.2 Introduction In order to survive and thrive, plants dynamically coordinate their physiology and development with their environment. Given projected increases in global temperatures (Stocker et al. 2013) and the frequency and severity of droughts, heat waves, and flooding (Reynolds and Ortiz 2010; Sillmann et al. 2013), improving our understanding of how plants regulate these dynamic changes will be useful for future efforts to breed or engineer for more resilient crops (Rabara et al. 2014) and for our ability to understand how a changing climate will impact diverse plant species (Nicotra et al. 2010). Efforts to study how plants regulate their response to a single stress improve our understanding of how plants will regulate their response. However, multiple stressors are typically present, and the response to combined stress may be different than the response to either of the stresses individually. This was demonstrated at the transcriptional level, where ~60% of Arabidopsis thaliana genes were found to respond to combined stress conditions in ways that are not predictable based on their responses to the stresses individually (Rasmussen et al. 2013). While recent efforts have been made to identify transcriptomic (Atkinson et al. 2013; Sewelam et al. 2014; Bonnet et al. 2017), metabolomic (Prasch and Sonnewald 2013; Georgii et al. 2017), or physiological (Shaar-Moshe et al. 2017) changes in response to combined stress, how these changes are regulated remains unclear. There are a number of important components involved in regulating a gene’s response to an environmental stress. One major component is the binding of one or more transcription factors (TFs) nearby that gene. TFs are proteins that bind to DNA and activate/repress transcription of nearby genes. Their importance for regulating transcriptional response to stress has made them targets for breeding and engineering plants for improved response to stresses, including salt (Hu et al. 2008), drought (Choi et al. 2013; Lee et al. 2017), drought and heat (Wu 41 et al. 2009; Chang et al. 2017) stress. In fact, some of the genes involved in the domestication of crop species were TFs (Konishi et al. 2006; Doebley et al. 2006). One approach to better understand and find the TFs driving stress induced changes in gene expression is to identify the non-coding regions of DNA, or cis-regulatory elements (CREs), near the transcriptional start site of a gene where TFs bind. For some TFs in model species like A. thaliana, the DNA sequences that a TF can bind to (TF binding motif; TFBM) have been established in vitro (Weirauch et al. 2014; O’Malley et al. 2016). In addition, putative CREs (pCREs) can be found computationally using enrichment-based methods based on co-expression (Zou et al. 2011; Ghandi et al. 2014). Previous studies have demonstrated that both known TFBMs and pCREs can be used to generate models that are predictive of a gene’s response to different environmental conditions (Zou et al. 2011; Uygun et al. 2017; Liu et al. 2018). These predictive models are referred to as the cis- regulatory code. While such studies highlight the importance of the presence of TFBMs and pCREs for understanding transcriptional regulation, factors besides the presence or absence of a CRE can influence TF binding and therefore transcriptional response to stress. For example, TF binding can also be affected by chromatin accessibility (Huebert et al. 2012; He et al. 2012; Arvey et al. 2012; Wang et al. 2012) and histone modifications (Steinfeld et al. 2007; Zhu et al. 2012). Therefore, methods to integrate these additional layers of omics information into the cis- regulatory code are needed. Here we explore the cis-regulatory code of transcriptional response to single and combined heat and drought stress in A. thaliana. Heat and drought stress were selected because they often co-occur in nature, they elicit some similar and some conflicting physiological responses in plants (Rizhsky et al. 2004), and because many important TFs and TF binding motifs have already been identified for these stresses individually. At the physiological level, the 42 effects of combined drought and heat are generally additive (Vile et al. 2012). However, it is unclear to what degree these responses are additive, synergistic, or antagonistic at the level of transcriptional regulation. To better understand the regulatory logic underlying single and combined stress, first, we grouped genes likely to be co-regulated based on their shared pattern of transcriptional response under single and combined heat and drought stress (Prasch and Sonnewald 2013) (Step 1; Figure 3.1). Then, we used known TFBMs and enrichment based pCREs (Step 2; Figure 3.1) to generate models of the cis-regulatory code controlling these different patterns of responses to single and combined heat and drought stress using machine learning. To improve our cis-regulatory code models and therefore our understanding of how response to single and combined stress is regulated in A. thaliana, we modeled complex regulatory interactions (Step 3A; Figure 3.1), used a deep learning approach to integrate additional layers regulatory information (i.e. chromatin accessibility, sequence conservation, and histone marks) into our models (Step 3B; Figure 3.1), and expanded the scope of our models by including pCREs identified outside of the promoter region (Step 3C; Figure 3.1). In addition to providing a comprehensive overview of the cis-regulatory code of response to single and combined heat and drought stress in A. thaliana, this study also exemplifies how a data-driven approach can be used to make novel discoveries in a complex system like gene regulation (Step 4; Figure1). 3.3 Results and Discussion 3.3.1 More than 50% of genes have synergistic or antagonistic responses to combined heat and drought stress In order to study the regulation of transcriptional response to single and combined stress, we first identified groups of genes that were likely to be co-regulated based on their shared 43 Figure 3.1. A framework for generating cis-regulatory code models. Step 1: Genes were grouped based on their pattern of differential expression under heat (H), drought (D), and H+D stress compared to control conditions. Step 2: For each response group, known TFBMs and putative cis-regulatory elements (pCREs) were identified based on site enrichment among response group genes (Fisher’s Exact Test; p-value < 0.01). Step 3: Information was gathered about associations between pCREs, their overlap with additional regulatory information, and pCREs located outside of the promoter regions. Step 4: All of this information was combined into machine learning models of the cis-regulatory code and the models were interpreted to identify the most important components driving the predictions. pattern of transcriptional response (U: Up-regulated, N: Non-responsive, D: Down-regulated) to three stress conditions: heat, drought, and combined heat and drought stress using transcriptome data from an earlier study (Prasch and Sonnewald 2013). For example, genes that were up- regulated under heat and combined stress, but not under drought alone were placed in the UNU 44 response group. These response groups were further categorized based on if the response to the combined stress was similar to (“independent”: UNU, NUU, DND, or NDD), less than (“antagonistic”: UNN, NUN, DNN, or NDN), or greater than (“synergistic”: NNU or NND) the sum of the responses to the single stress conditions (Figure 3.2A). Among genes that were responsive to at least one stress, 43%, 29%, and 24% genes were in the independent, antagonistic, and synergistic response groups, respectively (Figure 3.1B; Supplemental Table 3.1). The remaining 4% of genes belonged to rare response groups (e.g. DUN, UUD) and were not considered in our analysis. Most of the genes in the independent and antagonistic response categories were responsive (up or down-regulated) to heat, rather than drought stress. The dominance of the heat response could be due to: (1) the mild nature of the drought stress (Prasch and Sonnewald 2013), (2) an overriding influence of heat stress, as heat response also dominates over salt stress (Rasmussen et al. 2013), or (3) the fact that the expression data is derived from leaf where osmotic stress has a lesser effect compared to root (Shen et al. 2017). To determine if genes in a response group shared distinct biological functions, we tested for the enrichment of genes with different Gene Ontology (GO) terms in each response group compared to non-stress responsive genes and genes in other response groups. Overall, we found that functional overlap existed between independent and synergistic, but not antagonistic, response group genes (Figure 3.2C, Supplemental Table 3.2). For example, both independent (UNU) and synergistic (NNU), but not antagonistic (UNN and NUN) response groups were enriched for heat and reactive oxygen species response. Similarly, for the down-regulation response groups, the independent (DND) and synergistic (NND) response groups were enriched for primarily photosynthesis related GO terms (Mathur et al. 2014), while the antagonist response group genes (DNN) were enriched for pollen development. This highlighted that genes 45 Figure 3.2. Gene expression response groups for single and combined heat and drought stress. (A) Gene expression response groups included in the study where the three-letter response codes signify up-regulation (U), down-regulation (D), and no significant change in expression (N) ordered based on response to heat, drought, and both stresses. The number below the response group name is the number of genes in that response group that have non-overlapping promoters (1kb upstream of TSS) with neighboring genes. Colored bars designate if genes in the response 46 Figure 3.2 (cont’d) group are considered to have antagonistic (yellow), independent (orange), or synergistic (purple) responses to combined stress. (B) The log2 Fold Change in expression under heat (H), drought (D), and H+D compared to control for each gene (X-axis), sorted by response group. If the absolute value of the Log2(FC) ≤ 1, colored white (N). (C) Select Gene Ontology (GO) categories that were enriched for genes belonging to the different response group compared to all other genes. GO categories with a large positive log2(q-value) (red) are over-represented, while those with large negative log2(q-value) (blue) are under-represented in that response group. in the same functional category are not necessarily co-regulated and that antagonistic genes are not only differently regulated but also perform different biological functions. In addition to having functional overlap, genes in the up-regulation independent and synergistic response groups were enriched for heat or water response functions (UNU & NNU, Figure 3.2C), while the antagonist response group genes were enriched for non-canonical abiotic stress response categories including brassinosteroid biosynthesis processes and innate immune response (UNN & NUN, Figure 3.2C). Brassinosteroids, for example, are most well known as cell-division and developmental regulators, but have also been implicated in heat tolerance in Brassica juncea (Kumar et al. 2010). Because an antagonistic response to combined stress means the response to the single stress was somehow counteracted, this suggests that the non-canonical functions are more tightly regulated than functions enriched in independent response group genes. This could be because the responses are detrimental in the presence of drought stress. For example, up-regulation of innate immune response (Huot et al. 2014) in NUN genes and down- 47 regulation of pollen development genes (De Storme and Geelen 2014) in DNN genes could be tightly regulated because aberrant responses could negatively impact fitness unnecessarily. In summary, we found that ~55% of genes responsive to at least one stress showed either antagonistic or synergistic responses to combined heat and drought stress. Because these non- independent responses to combined stress were so prevalent, we hypothesized that a unique regulatory code must exist that is able to fine tune transcriptional response under combined heat and drought stress. We also found that genes in synergistic response groups overlapped functionally with genes in independent response groups, highlighting that genes with similar biological functions are not necessarily co-regulated. 3.3.2 Combinatorial stress response patterns can be predicted using known and putative regulatory elements Because TFs and their binding sites regulating combinatorial stress response are yet to be identified, we set out to identify responsible TFs by taking advantage of available in vitro TF binding region and motif (known TFBMs) data from the DAP-seq (O’Malley et al. 2016) and CIS-BP (Weirauch et al. 2014) databases for 344 TFs. First, 197 of the 344 known TFBMs were identified as enriched in the promoter region of at least one set of response group genes (referred to as enriched TFBMs, see Methods). On average, response groups were enriched for 35 known TFBMs (range: 0-87) from 27 TF families (referred to as enriched families, Supplemental Table 3.1). In parallel, to identify regulatory sequences not covered by known TFBMs, we searched for putative cis-regulatory elements (pCREs) by identifying k-mers enriched in the promoter regions of genes in each response group compared to genes not responsive to stress (see Methods). Response groups were enriched for 68 pCREs on average (range: 7-158). These pCREs were similar to TFBMs from 22 of the 27 enriched families. This similarity was defined at two 48 different ways: (1) across all response groups, 13% of pCREs were significantly more similar to 36 of the 197 enriched known TFBMs compared to TFBMs from the same TF family (i.e. similar to TFBM) (Supplemental Table 3.3, see Methods), and (2) an additional 66% of pCREs were significantly more similar to an enriched TFBM compared to TFBMs from other families (i.e. similar to a TF family). The remaining 21% of pCREs were either most similar to TFBMs from TF families not significantly enriched in response group genes. Thus, the iterative k-mer finding approach based on co-expression recovered additional regulatory information not captured by the in vitro TFBM data. To determine the extent to which known TFMBs and co-expression-based pCREs can explain combined stress response patterns (i.e. how much of the cis-regulatory code have we captured), we used the presence or absence of these TFBM and pCRE sites as features (i.e. independent variables) in machine learning models to classify genes as belonging to a response group or as non-responsive under any stress condition (i.e. the dependent variable). Because machine learning models need to learn from sufficient training data, we only used response groups with >20 genes. Model performance was measured by calculating the F-measure (F1) on a set of data held out from model training, where an F1=1 would be a perfect classification and an F1=0.5 would be no better than random guessing based on our approach (see Methods). Although all models performed better than random guessing (Figure 3.3A), models built using pCREs (median F1=0.64) significantly outperformed those built using known, enriched TFBMs (median F1=0.58) (paired t-test, p=3.7x10-4). If we used all known TFBMs (i.e. both response group enriched and non-enriched), the model performance decreased further (median F1=0.54). These findings support the notion that pCREs contain additional regulatory information not captured by the TFBM data. This is not to say that pCREs can completely replace TFBM data 49 Figure 3.3. Cis-regulatory code models based on known TFBMs and pCREs. (A) Predictive performance (F1) of Random Forest machine learning models using known TFBMs (yellow), pCREs (teal), or both (rose) as input features as input for predicting response 50 Figure 3.3 (cont’d) group vs. non-responsive genes. (B) Percent of pCREs identified for each response group that are significantly similar to pCREs in other response groups based on a threshold that is the 95th percentile of the PCC distribution between 100 sets of 25 randomly generated 6-mers (numbers in cells). Darker red indicates that the pCREs identified independently for each of the two response groups shared higher sequence similarity. (C) Select example pCREs (first column) and a motif logo of the known TFBM the pCRE is most similar to. The similarity scores (Pearson’s Correlation Coefficient; PCC) between each pCRE and its best match are shown. The pCREs are sorted from most to least commonly enriched across response groups, where the boxes indicate what response groups the pCRE was enriched (gray) or not enriched (white) in. because models built using the enriched TFBMs and pCREs were able to correctly classify different subsets of genes (Supplemental Figure 3.1). However, including both types of elements as features did not improve model performance compared to only using pCREs (median F1=0.64; paired t-test, p=0.51). Across the response groups, the combined models also classified genes more similarly to the pCRE-based than the enriched TFBM-based models (Supplemental Figure 3.1). Next, we quantified the degree of overlap between pCREs identified for different response groups to assess how the cis-regulatory programs differ between different response patterns to single and combined stress. Two pCREs were considered overlapping if they shared a greater sequence similarity with each other than with 95% of random 6-mers (Figure 3.3B). Using this approach, the pCRE overlap ranged from 0 to 35% between response groups, with response groups that share the same direction of response (i.e. NNU and UNU) tend to have 51 higher degree of overlap (R2=0.41, p<1x10-4). Interestingly, of the pCREs overlapping among the most response groups, the top three, GCCACGT, ACGTGG, and AAAATAT (stars, Figure 3.3C) were significantly similar to TFBMs associated with circadian clock TFs bZIP16, PIF7, and RVE8, respectively (Hsieh et al. 2012; James et al. 2012). PIF7 has been shown to negatively regulate DREB1 as a means to avoid hindering plant growth by the accumulation of DREB1 when the plant is not under stress (Kidokoro et al. 2009). Our findings further confirm earlier studies that stress response regulation has a significant circadian clock component (Liu et al. 2013). Nonetheless, 65 to 100% of pCREs differed between any two response groups (Figure 3.3B). This supported the notion that there are substantially distinct regulatory mechanisms involved in different patterns of response to combined stress. In summary, the iterative k-mer finding approach identified pCREs that, when used as predictive features, were better able to classify genes by their response groups than known enriched TFBMs. Over 20% of pCREs were not similar to known enriched TFBMs, indicating pCREs contain novel regulatory information. In addition, the majority of pCREs did not show significant sequence similarity with pCREs from other response groups, suggesting substantial regulatory differences. Finally, while we were able to classify genes by their response group well above random expectation, with a median F1=0.64 there was still ample room for model improvement. Thus, we next explored three strategies to improve predictions of response to single and combined stress by: (1) considering interactions between pCREs, (2) integrating dynamic multi-omics data, and (3) including pCREs located outside the proximal promoter. Because TFs frequently work in concert to regulate gene expression (Harbison et al. 2004; Farnham 2009), we first incorporated interactions between TFs into our models by identifying interactions between pCREs. We identified interactions between pCREs for each response group 52 using two statistical approaches: association Rule (aRules) and iterative Random Forest (iRF). However, pCRE pairs identified did not improve model performance when used as features alone or with pCREs (Supplemental Figure 3.2 and Supplemental Info), unlike in high salinity stress (Uygun et al. 2017). 3.3.3 Additional multi-omics regulatory information can improve cis-regulatory code models To account for additional levels of regulation involved in response to single and combined heat and drought stress, we next explored adding additional information to our cis- regulatory code models. We included information about chromatin accessibility (DNase I Hypersensitive Sites: DHS) (Sullivan et al. 2015; Liu et al. 2018) and eight histone marks (Pfluger and Wagner 2007; Dong and Weng 2013; Stroud et al. 2014) because both can impact the ability of a TF to bind. In addition, because regulatory elements can experience selective pressure, information about sequence conservation across the Brassicaceae family (Conserved Noncoding-Sequences: CNS) (Haudry et al. 2013) was included because it could be informative for identifying pCREs (Guo et al.2003; Haberer et al 2006). Finally, as described above, in vitro TF binding regions have been identified in A. thaliana (O’Malley et al. 2016), we included these data as they may also improve our ability to identify pCREs. These data are collectively referred to as “additional regulatory information”. To determine if this additional regulatory information would improve our understanding of the cis-regulatory code of combined stress response patterns, we next tested if the addition of these data into our machine learning models would improve their performance. While models utilizing this additional regulatory information improved the average performance for a few response groups (i.e. NNU, DNN), overall, they did not perform significantly better than pCRE- 53 only models (median F1=0.66; paired t-test, p=0.062) (olive; Figure 3.4A). One possible reason for this lack of improvement could be that our machine learning algorithm (i.e. Random Forest) was not adequately integrating the additional regulatory information into the models. For example, Random Forest treats all input features, such as pCREs and histone marks as independent when they may not. To address this limitation, we applied a deep learning approach, convolutional neural network (CNN). CNNs are frequently used in image classification because when given training data (e.g. many photographs of cats) they are able to learn local patterns (e.g. triangles that resemble cat ears) and associate those patterns with what is being predicted (e.g. is there a cat in the photograph). We hypothesized we could train CNN models to look for patterns in the additional regulatory information available for each pCRE and to then associate those patterns with a response group (Figure 3.4B; see Methods). Using this approach, our ability to predict response groups increased (median F1=0.68) compared to the pCRE only models (paired t-test, p=0.002), with the largest improvements for the UNU, DNN, DND, and NNU response groups (where F1 increased by 0.069, 0.055, 0.050, and 0.046 respectively) (rose; Figure 3.4A). 3.3.4 Interpreting deep learning models provides insight into the cis-regulatory code To understand what combinations of additional regulatory information were important for the ability of our CNN models to classify genes by their response group, we visualized and measured the importance of the trained kernels. During the process of model training, each kernel learns a particular "pattern", i.e., how much value, or weight, should be given to each type of feature (i.e. presence/absence and additional regulatory information) to best predict if a gene belongs to a response group. For example, in Figure 3.4B, kernel #1 (k1) learned to look for pCREs that were present and that overlapped with a DAP site and with histone marks for H1 and 54 H7 (positive kernel weights), but not H4 or H6 (negative kernel weights) (named for illustrative purposes only). Then, each trained kernel scans across the input data and generates an output value for each pCRE based on how well it matches the pattern. For example, when k1 was used to scan from pCRE-A down to pCRE-X, it output a large (i.e. dark) value for pCREs that match its pattern (e.g. pCRE-A) and a small value for pCREs that do not match its pattern (e.g. pCRE- D). To assess which types of features were most important (i.e. highest weighted) among kernels from CNN models for each response group, we extracted the trained kernels (i.e. a list of 12 weights) for each kernel in each replicate, clustered them into groups with similar patterns of weights, and calculated the median weight assigned to pCRE presence/absence and each additional regulatory information for each cluster (Figure 3.4C, S4; see Methods). To measure the overall importance of each kernel, we calculated the change in model performance on the test data (i.e. data not used for training) when each kernel was zeroed out (i.e. all weights set to zero; see Methods). We then reported the median kernel importance for each kernel cluster (Figure 3.4C, S4). For example, when a kernel in the first kernel cluster for DNN was set to zero, model performance (measured using the area under the receiver operator characteristic; see Methods) dropped by > 0 .005. Note that the performance decreases are all very small, indicating the models were robust to perturbation likely because more than one kernel trained to learn important patterns. Overall, the presence or absence of the pCREs (P/A) had the highest median weights (leftmost column; Figure 3.4C). Of the additional regulatory information, DAP, H3K9ac, and DHS had the next highest kernel weights, suggesting known TF binding, the acetylation of lysine 9 on histone H3 (a hallmark of active promoters (Karmodiya et al. 2012)), and chromatin accessibility were consistently useful feature for predicting response to single and combined stress. Additional regulatory information were weighted differently in 55 Figure 3.4. Cis-regulatory code models based on pCREs and additional multi-omics regulatory information. (A) Predictive performance (F-measures (F1)) of Random Forest models using pCREs (teal, same as in Figure 3.3A) and pCREs + additional regulatory information (olive) and of 56 Figure 3.4 (cont’d) Convolutional Neural Network (CNN) models using both pCREs + additional regulatory information (rose). The larger error around NUN models is due to the small number of NUN genes available for model training. (B) An illustration of the internal workings of the CNN models and how the trained kernels (i.e. pattern identifiers) in those models were used to understand the patterns of additional regulatory information the models were trained to identify. (C) Summary of results from interpreting the trained CNN models. The feature types (i.e. presence/absence (P/A) and additional regulatory information) were sorted based on the average kernel weights across all kernels trained for all response groups and replicates (first column). The remaining columns represent kernel clusters for specific response groups. For each response group, all trained kernels from all CNN replicates were clustered using hierarchical clustering with dynamic cutting (min cluster size=250 kernels). The median kernel weights and kernel importance scores are shown here for the two clusters with the highest median kernel importance for each response group. Large kernel weights (dark purple) indicate the presence of that pCRE or its overlap with the additional regulatory information was an indicator of belonging to the response group rather than NNN. important kernel clusters for different response groups (second column and on; Figure 3.4C). This was especially true of histone mark features. For example, H3K27me3 tended to be negatively weighted in important kernel clusters for up-regulation response groups (e.g. UNN, NUN, NNU) but neutral or positively weighted in important kernel clusters for down-regulated response groups (e.g. DNN, DND). Together with the fact that H3K27me3 is known to be associated with gene silencing (Luo and Lam 2010), this finding supports the idea that lysine 27 57 trimethylation is involved with regulating response to single and combined heat and drought stress. However, we also found that H3K4me3 had a large positive weight for the most important DNN kernel cluster. This was unexpected given that H3K4me3 is associated with active promoters (Luo and Lam 2010) and suggests that the role of lysine 4 trimethylation in regulating single and combined heat and drought stress response may be complicated. In summary, we found that the integration of additional multi-omics regulatory information into our models of the cis-regulatory code using CNNs improved our ability to classify genes by their pattern of response to single and combined stress. While some information (e.g. TF binding, H3K9ac) was important for all response groups, other information (e.g. H3K4me3, H3K27me3) was differentially important across the response groups. The usefulness of these data was especially surprising given some of the limitations of the data. For example, most of the data were generated either in vitro (e.g. DAP) or under growth conditions that do not match the transcriptome data used for this study (e.g. DHS). 3.3.5 pCREs identified outside the promoter region are predictive of response patterns The models discussed thus far were based on features located in the proximal promoter regions typically housing regulatory sequences in plants (Yu et al. 2016). However, plant regulatory sequences can also be located in the 5’ untranslated region (5’ UTR) (Tompa 2001), first intron (Int1) (Zhang and Duff 1994), 3’ UTR (Wasserman et al. 2000), and downstream of the transcriptional stop site (DS500). To assess the extent to which pCREs outside of the promoter regions were predictive of combined stress response patterns, the iterative k-mer finding approach was repeated in the 5’ UTR, Int1, 3’ UTR, and DS500. Then, predictive models were built using either pCREs from each region individually or in combination as features. Because sequence information was not available for all five regions for all genes (particularly 5' 58 and 3' UTRs), we had to remove between 47 and 587 genes from each response group to make our models comparable. Importantly, this means that the performance results from our earlier machine learning models would not be directly comparable. In order to establish a direct comparison, we also reran the iterative k-mer finding and modeling on the promoter region using the smaller subsets of genes. Models built using pCREs located in promoter or, surprisingly, DS500 regions outperformed models built with pCREs from other regions (Tukey test; Figure 3.5A). DS500 pCREs substantially outperformed promoter pCREs for the NUN response group in terms of F1 (+0.06, Figure 3.5A), as it correctly classified 2 more genes and reduced the false positives by 14 (Supplemental Figure 3.5). Interestingly, the most predictive DS500 pCRE, ACTTTG, shares significant sequence similarity (PCC=0.92) with the known TFBM for WRKY46, which has known roles in drought response. This pCRE was not enriched in the promoter region, emphasizing the potential importance of the DS500 region for cis-regulation. Although the 5’UTR and 3’UTR pCREs did not perform as well as those in promoters and DS500s, they were significantly better than random expectation (t-test: p=0.02, 0.006, respectively), however Int1 pCREs were not significantly different than random (p=0.75). Because models built using pCREs from different regions were able to correctly classify different subsets of genes (Supplemental Figure 3.5), we used pCREs from all regions as features and the resulting models (the ALL column, Figure 3.5A) outperformed all single region-based models, suggesting that pCREs located beyond the promoter region are important for regulating combined stress response. To determine if the pCREs identified from different genetic regions were unique to that region or found across regions, we identified the best matches between pCREs within and 59 Figure 3.5. Cis-regulatory code models based on pCREs identified in putative promoter and non-promoter regions. (A) Predictive performance (F1) from Random Forest models using pCREs found in the promoter, 5’ UTR, first intron (1st Int), 3’ UTR, downstream region (DS500), or all regions (All) as input features. The box color represents the F1 scores normalized by the F1s of each response group (the darkest blue represents the best set of input feature for each response group) with the actual F1 provided in each box. The boxplot shows the distribution of F1 scores for each region (below) and for each response group (right). Letters on the top of boxplots signify significant differences by region based on the Tukey test (p<0.05). (B) Average sequence similarity (PCC) between pCREs identified from different regions (axes) for each response group (colors). The dashed line indicates the random average similarity obtained by calculating PCC for every possible pairing of 100 sets of 25 randomly selected 6-mers. 60 between different regions based on sequence similarity (Pearson’s correlation coefficient, PCC, see Methods). The pCREs from most regions were more similar to each other (average PCC=0.73) than would be expected by random chance (dotted line, PCC95th=0.57; Figure 3.5B). The DS500 pCREs tended to be the most similar to pCREs from other regions, especially those from the 5’UTR, 1st intron, and 3’ UTR. Interestingly, the promoter pCREs tended to be the least similar to pCREs from other regions. That said, the pCREs most similar to the promoter pCREs were enriched in the DS500 (average PCC=0.73), indicating similar cis-regulatory mechanisms governing transcriptional regulation up and downstream of genes. In addition, pCREs from down-regulation response groups (blue color series, Figure 3.5B) tended to be more similar between regions than up-regulation response group pCREs (red color series, Figure 3.5B). This suggests that regulatory elements involved in down regulating genes are either less region specific or are more likely to be located in multiple regions around the gene. This is in sharp contrast to NNU pCREs, which were the only pCREs that were less likely to be similar (average PCC = 0.45) than random chance, suggesting the regulatory circuitry for synergistic up- regulation is specific to the promoter region. In summary, incorporating pCREs identified outside of the proximal promoter region improved our ability to predict response to single and combined heat and drought stress. Of the five regions assessed, the DS500 pCREs performed marginally better than promoter pCREs for two of the seven response groups. Taken together, this suggests that while most of the pertinent regulatory information is in the promoter regions, additional regulatory information important for response to single and combined heat and drought stress may be located outside the promoter region. 61 3.3.6 The cis-regulatory code of response to single and combined heat and drought stress We have demonstrated that adding multi-omics data and expanding our search for putative regulatory elements beyond the promoter region has improved our cis-regulatory code models. While these models are still not perfect, they perform well above random expectation and therefore can be used to illuminate the cis-regulatory code of response to single and combined heat and drought stress in A. thaliana. To this end, here we further characterize a subset of the most important promoter (from CNN models) and non-promoter (from Random Forest models) located pCREs identified for each of the seven response groups. The most important promoter located pCREs from the CNN models were those that, when set to zero, caused the largest decrease in model performance (see Methods). The most important pCREs from the Random Forest models are those that, when used at a node in a decision tree, were able to best separate genes by their response group (see Methods). The importance scores of pCREs based on these two approaches are in Supplemental Table 3.5, S6. To characterize the most important promoter pCREs using the additional levels of regulatory information included in the study, we determined how much more frequently the sites of each promoter pCREs overlapped with each of the additional regulatory information in response group genes than random expected using a set of 1,000 random 6-mers (Supplemental Table 3.5). Focusing on the top five most important pCREs from each response group, we found that these pCREs could be classified into three groups based on their degrees of overlap between their sites and the additional regulatory information (Figure 3.6A). Group 1 pCREs were unique in that, in addition to DAP and DHS, they were also much more likely to overlap with CNS than 62 Figure 3.6. Overview of the most important pCREs for our cis-regulatory code models. (A) The top five most important promoter pCREs from CNN models clustered using k-means clustering (k=3) into three groups based on the pattern of overlap between their sites with additional regulatory information and sorted using hierarchical clustering. The overlap percentile refers to how frequently a pCRE overlaps with each additional regulatory information in the promoter of response group genes compared to 1000 random 6-mers, with values in darker red signifying higher degrees of overlap compared to the random background. The importance score is the median decrease in model performance on the test set when a pCRE and its associated additional regulatory information is removed from the CNN model (i.e. larger decrease in performance means a larger importance). (B) The TF name, motif logo, and sequence similarity score (Pearson’s Correlation Coefficient; PCC) for the known TFBMs that best match the top two promoter pCREs from the CNN model (left two columns) and the top two non-promoter 63 Figure 3.6 (cont’d) pCREs from the RF model using pCREs from all five gene regions (see purple in Figure 3.5A) (right two columns) for each response group. random 6-mers (dashed boxes; Figure 3.6A), suggesting these pCREs are more highly conserved across the Brassicaceae. Group 2 pCREs also frequently overlapped with DAP and DHS regions, although to a lesser extent. Group 2 pCREs were also less likely to overlap histone marks associated with active transcription (e.g. H3K23ac, H3K4me1), which was interesting given that the important pCREs identified for the down-regulation response groups (i.e. DNN, DND, NND) tended to be in Group 2 (8 vs. 4 and 3 in other groups 1 and 2). Finally, Group 3 pCREs were less likely to overlap with DAP regions than random 6-mers, suggesting these pCREs may be bound by TFs not yet included in in vitro binding databases. We next characterized promoter and non-promoter pCREs by determining which were similar to known CREs and which represented putative novel CREs by (see Methods). Overall, 40.5% of promoter pCREs and 37.6% of pCREs from other regions were significantly similar to a specific known TFBM (i.e. sequence similarity (PCC) was > 95th percentile of PCCs between TFs in the same family) (Supplemental Table 3.5, S6). Focusing on the two most important promoter and non-promoter pCREs for each response group (Figure 3.6B) we found many different TFs and TF families represented. The promoter and non-promoter located pCRE for the DND models, AAATAT, is identical to the TFBM of a MYB related TF, REVEILLE8 (RVE8) (Figure 3.6B), which has been proposed to be involved in a negative feedback loop regulating the circadian clock’s response to temperature (James et al. 2012). The most important non- promoter pCRE for the NUN model, ACTTTG, is similar to TFBMs in the WRKY TF family 64 (PCC to WRKY46 = 0.92), which are known to be involved in osmotic and salt stress response (Ding et al. 2015). The most important promoter pCRE for the NND models, TGTCGA, is similar to TFBMs in the AP2 TF family (PCC to DDF2 = 0.88), which are known to be involved in heat, cold, and drought tolerance in A. thaliana (Kang et al. 2011). Taken together, these three examples highlight that by modeling and interpreting the cis-regulatory code we able to find pCREs similar to known TFBMs for TFs known to be involved in heat, drought, and combined heat and drought stress, respectively. Interestingly, the most important pCREs for the NNU response group are not similar to TFBMs for TFs known to be involved in either heat or drought stress. For example, the most important promoter pCRE, GAAAAC is identical to the TFBM for the G2-like γMYB2 TF, which has no know association with stress response. The second most important promoter pCRE, CACGTG is identical to the TFBM for bHLH104, which while known to be involved in regulating iron homeostasis in A. thaliana (Li et al. 2016), is not associated with other stresses. Similarly, the most important non-promoter pCRE for NNU, AGATTC, is identical to the TFBM AT1G49560), a G2-like family TF possibly involved in regulating flowering time. This highlights how much more work needs to be done to understand the regulation of combined heat and drought stress, with these pCREs and their associated TFs representing prime candidates for further characterization. In summary, we found that important promoter pCREs belongs to three groups that differed in how frequently the pCREs were associated with additional regulatory information. We also found that while some of the most important pCREs found by our cis-regulatory code models were similar to known TFBMs bound by TFs involved in heat and/or drought stress response, others (i.e. those enriched in NNU genes) were similar to TFs with no established 65 association to either stress condition. Taken together, these findings highlight the complexity of the cis-regulatory code of response to single and combined heat and drought stress in A. thaliana and the need for further study. 3.4 Conclusions Understanding how plants regulate their response to combined heat and drought stress is of great importance because of the frequency with which these stresses co-occur and the severity of their impact on our agricultural sector when they do (Rizhsky et al. 2004). Here we develop models of the cis-regulatory code regulating response to single and combined heat and drought stress in A. thaliana. We assessed the strength of our models by determining how well they classify genes not used for training as belonging to a response group (e.g. NNU, NND) or to the non-responsive group (i.e. NNN). We found that incorporating pCREs identified outside of the proximal promoter region and additional multi-omics regulatory information (i.e. chromatin accessibility, sequence conservation, known TF binding, and histone markers) into our models improved their performance. We also explored the use of a deep learning approach, CNN, and demonstrated that it performed better than the classical machine learning algorithm used in this study, Random Forest. Furthermore, by interpreting our cis-regulatory code models, we were able to provide novel biological insights, including identifying which pCREs and additional regulatory information were most important for being able to predict response to single and combined heat and drought stress. Because our cis-regulatory code models are not able to perfectly predict a gene’s response group, there is still more to learn about the complexities of the regulation of response to single and combined stress. One factor that is limiting our ability to model the cis-regulatory code is that genes in a response group are not all regulated by the same mechanisms. This issue 66 is compounded by the fact that samples were gathered only at a single time point a few days after the stress conditions were applied. Thus, we have only a snapshot and do not have information on whether the stress responsive genes began to respond immediately after stress initiation or later after the plants began to acclimate. A second limiting factor is that we are missing critical information about the rate of mRNA degradation. Because our picture of differential gene expression comes from measuring and comparing the steady state mRNA levels, we cannot determine if the change in gene expression is due to, for example, increase in production or a decrease in degradation. Finally, while incorporating chromatin accessibility and epigenetic mark data into our models of the cis-regulatory code improved their performance, these data were not ideally suited for this study because they were generated from plant at different developmental stages under different conditions than the plants used to generate the transcriptomic data used in this study (Prasch and Sonnewald 2013). We consider this a third limitation because both chromatin accessibility and epigenetic marks are dynamic, meaning they change over the course of development and in response to environmental conditions (Sullivan et al. 2014; King 2015). There are numerous mechanistic possibilities for how a gene can regulate its response to combined stress. For example, a gene with a synergistic response could require the binding of two TFs in order to be expressed, each of which is only up-regulated or activated by one of the individual stresses. Alternatively, synergistic responses could be regulated by novel TFs that only bind or are only activated under the combined stress scenario. In our study we found that some pCREs were unique to individual response groups. For example, AGCTTC, which perfectly matches the TFBM identified for HSFB2H, was only enriched in genes with a synergistic up-regulation response to combined stress (NNU). However, other pCREs, especially those containing G-Box motifs, were found for multiple response groups, with pCREs from 67 independent and synergistic response groups having the most overlap. By studying the pCREs identified for different response patterns we can begin to understand at a global level how response to combined stress is regulated. As high-throughput omics technologies continue to become more affordable and widely used, techniques to integrate across multiple types of omics data will become increasingly necessary. It is also critical that these techniques are interpretable so that we are able to derive from them insights into complex biological systems such as the regulation of gene expression. Here we trained classic and deep machine learning models of the cis-regulatory code regulating response to single and combined heat and drought stress. We then used in silico model interpretation strategies and were able to identify known actors in response to heat and/or drought stress in addition to putative novel actors that are prime targets for further characterization. In the future, this approach could be used to study the regulation of other developmental and stress induced responses in plants and other organisms. 3.5 Methods 3.5.1 Expression data processing, response group classification, and functional category enrichment analysis Expression data for response to mild heat (32ºC day/28ºC night for 3 days), mild drought (30% field capacity), and combined heat and drought stress in A. thaliana were downloaded from NCBI Gene Expression Omnibus (GEO) (GSE46760) as normalized signal intensity values (Prasch and Sonnewald 2013). The expression data was generated using the Agilent platform and probe data was converted into TAIR10 gene identifiers using IDswop from the “agilp” package in the R environment (Chain 2012). If multiple probes were present for the same gene the mean of the probe intensities was used, unless the intensities were >20% different, in which case the 68 gene was excluded. Differential expression folds and associated false discovery rate (FDR) adjusted p-values (i.e. q-values) (Benjamini and Hochberg 1995) between each stress conditions and the control condition were calculated using limma (Ritchie et al. 2015) in the R environment. Genes were classified as significantly up-regulated (U) if their log2 fold-change ≥ 1.0 with q ≤ 0.05, down-regulated (D) if their log2 fold-change ≤ -1.0 with q ≤ 0.05, or non- responsive (N) otherwise. Genes were clustered into “response groups” using a convention established by Rasmussen et al. (2013). Briefly, each gene was defined by its pattern of U, D, or N under heat, drought, and combined stress conditions. For example, a gene that is U under heat, D under drought, and N under combined stress was classified as in the UDN response group. To more clearly distinguish between genes belonging to a response group from genes that were considered non-stress responsive (NNN), genes were only considered NNN if they were not significantly differentially expressed (up- or down-regulated) with a log2 fold change cutoff of 0.8 under any of the three stress conditions or under any stress condition at any time point in the AtGenExpress database (http://www.weigelworld.org/resources/microarray/AtGenExpress/). P Sequence data for the promoter, 5’ UTR, 3’ UTR, first intron, and downstream region for A. thaliana genes were downloaded from TAIR10. Genes whose promoter regions (1-kb upstream the transcriptional start site) overlapped with neighboring genes were excluded from the analysis. We tested if genes oriented in the same direction as their upstream neighboring gene were more likely to be correctly predicted than genes with partially overlapping promoter regions, but the results were not significant for most response groups (Supplemental Table 3.1), so genes oriented in any direction were kept. For the analysis of the regulatory information in 69 regions outside the proximal promoter, only genes that had sequence data available for all regions were included (Supplemental Table 3.4) The enrichment of GO terms (http://www.geneontology.org/ontology/subsets/goslim_plant.obo) and metabolic pathways (http://www.plantcyc.org) in the response group genes compared to NNN genes, were determined using the Fisher’s Exact test with p-values adjusted for multiple testing (Storey 2003). As no AraCyc terms were enriched, only GO terms were discussed. 3.5.2 Identification of known binding sites from in vitro TF binding data Two sets of in vitro TF binding motif (TFBM) data were used to identify known binding sites. First, in vitro 200 bp binding regions for 344 TFs were collected from the DAP-Seq database (O’Malley et al. 2016). These 200 bp regions were derived from mapped sequencing peaks, and only peaks with a fraction of reads in peaks (FRiP) ≥ 5% were included. Second, position frequency matrices (PFMs) were obtained from the CIS-BP database for an additional 190 TFs without DAP-Seq data (Weirauch et al. 2014). CIS-BP PFMs were covered to Position Weight Matrices (PWM) adjusted for A. thaliana’s AT (0.33) and GC (0.17) background using the TAMO package (Gordon et al 2005). These 190 PWMs were then mapped to the putative promoter region (within 1kb upstream of the transcription start site) of A. thaliana genes using Motility with a threshold of p<1e-06 (http://cartwheel.caltech.edu/motility/). A gene was considered to be regulated by a TF if its putative promoter region overlapped with one or more known TFBM sites. We also identified a subsets of known TFBMs that were enriched in the promoter regions of genes in a response group compared to non-responsive (NNN) genes using the Fisher’s Exact test (p<0.05), these TFBMs are referred to as the known enriched TFBMs. 70 3.5.3 Computational identification of novel pCREs and comparison with known TFBMs To identify pCREs that were not covered by the available in vitro TF binding data, an enrichment based computational approach was taken (referred to as the iterative k-mer finding approach). With this approach, modified from (Liu et al. 2018), all possible 6-mers tested for enrichment in the response group gene promoters compared to NNN gene promoters using the Fisher's Exact test (p< 0.01). For 6-mers that were enriched, their sequence was lengthened to all eight possible 7-mers (e.g. ATATCG à AATATCG, TATATCG, GATATCG, CATATCG, ATATCGA, ATATCGT, ATATCGG, ATATCGC), which were then each tested for enrichment. The k-mer lengthening process continued until the longer k-mers were no longer significantly enriched. The above was repeated to find enriched pCREs in the 5’ UTR, 1st intron, 3’ UTR, and 500 bp downstream region. To assess the sequence similarity between (A) the pCREs identified for different response groups, (B) between the pCREs identified in different regions, and (C) between the pCREs and all known in vitro TFBMs, the Pearson’s Correlation Coefficients (PCC) between pCREs/TFBMs were calculated as in (Uygun et al. 2017). For two pCREs from (A) or from (B) to be considered similar, their PCC had to be the ≥ 95th percentile value of PCCs (i.e. > 0.78) between best matching pairs of pCREs from 100 sets of 25 random 6-mers, where each pCRE in each set was paired with the pCRE from each of the other 99 sets with the highest PCC value. We used 25 because it was the average number of pCREs enriched in each genetic region across all response groups. To determine the degree of sequence similarity in (C), three PCC thresholds for each TFBM were calculated that range from least to most stringent. The lowest level of stringency is “better than random”, where the pCRE-TFBM PCC is ≥ 95th percentile of PCCs between the TFBM and 1,000 random k-mers. The next level of stringency is “between family”, 71 where the pCRE-TFBM PCC is ≥ 95th percentile of PCCs between the TFBM and TFBMs from other TF families. Finally, the highest level of stringency is “within family”, where the pCRE- TFBM PCC is ≥ 95th percentile of PCCs between TFBMs from within the same family. 3.5.4 Sequence conservation, chromatin accessibility, and histone mark data processing and analysis Sequence conservation the between species conservation criteria, A. thaliana genomic regions that overlapped with ~90,000 Conserved Non-coding Sequences (CNS) among 9 Brassicaceae species were used (Haudry et al. 2013). DNase I Hyper-Sensitivity (DHS) regions were downloaded from GEO (GSE53322 and GSE53324) as peaks in bed format. These regions were identified from multiple tissues and developmental stages, including roots, root hair cells, leaf, seed coat, and dark grown A. thaliana Col-0 seedlings at 7-days old (Sullivan et al. 2014). Regions associated with activation-associated histone marks (H3K4me1: SRR2001269, H3K4me3: SRR1964977, H3K9ac: SRR1964985, and H3K23ac: SRR1005405) and with repression-associated histone marks (H3K9me1: SRR1005422, H3K9me2: SRR493052, H3K27me3: SRR3087685, and H3T3p: SRR2001289) were as compiled previously (Lloyd et al. 2018) using data from (Stroud et al. 2014). The percentage of times the sites of a pCRE overlapped with the 11 additional regulatory information (DAP-Seq, CNS, DHS, and eight histone marks) was calculated for each combination of pCRE and additional regulatory information for each response group. To determine how these overlaps were significant or not, 1,000 random, unique 6-mers were generated and mapped to the promoter regions of response group genes, then the percentage of overlap with each combination of random 6-mer and additional regulatory information was calculated for each response group. These overlap percentages were used to generate a 72 background distributions for overlap with each additional regulatory region, allowing us to convert the percent overlap scores for pCREs into percentiles along this background distribution. The percentage overlap with each additional regulatory information was also calculated for all CIS-BP motifs. Analysis of Variance (ANOVA), implemented in R v3.5.3, was used to determine if there were difference in the overlap percentage for each of the 11 additional regulatory information for each set of response group genes all pCRE, the top 10 most important pCREs (details below), the CIS-BP motifs, and the 1,000 random 6-mers. The ANOVA p-values were adjusted for multiple testing (Storey 2003). Finally, post-hoc Tukey tests, implemented using the HSD.test function from the agricolae package in R, were performed on comparisons with a significant ANOVA (q-value < 0.05) to identify which groups (i.e. pCREs, top 10 pCREs, CIS-BP, or random 6-mers) had significantly different distributions in their percent of overlap with the additional regulatory information (p < 0.05). To convert the additional regulatory information into features that could be used as input to our machine learning models, a new feature was generated for each pCRE – additional regulatory information pair (e.g. pCRE-DHS), where the value of the feature was set to 1 if the pCRE was both present in the promoter region of the gene and overlapped with the additional regulatory information and set to 0 if either or both of those criteria were not met. This resulted in a total of 12 features associated with each pCRE (i.e. the original presence/absence feature + the 11 additional features). 3.5.5 Classic machine learning-based models of the cis-regulatory code A classic machine learning algorithm called Random Forest (RF) (Breiman 2001) was used to generate models of the cis-regulatory code for each response group. These models were trained using a supervised learning approach, meaning they learned to predict the desired output 73 (e.g. does the gene belong to response group NNU or NNN?) using example instances (i.e. genes) for which they have both the input features (e.g. presence of absence of pCRE-X) and the true classification (e.g. NNU or NNN). Different sets of input features were used throughout the study, including known TFBMs, promoter pCREs, combinatorial pCRE rules (see Supplemental Methods), overlap with additional regulatory information, and non-promoter pCREs. RF was implemented using Scikit-Learn in Python 3 (Pedregosa et al. 2011). To avoid training models that classify all genes as belonging to the more common response group, we balanced our input data by randomly down-sampling genes from the larger response group to match the number of genes in the smaller response group. Because the genes included in the input data can impact model training and performance, this process was replicated 100 times. To measure the performance of our models on a set of genes not seen by the model during training we used a 10-fold cross-validation scheme, where the input data was randomly divided into 10 bins, then a model was trained on bins 1-9 (i.e. the training set) and that model’s performance was measured based on how will it performed on the instances in the 10th bin (i.e. the validation set). This was repeated, until each bin was used as the validation set one time. To select what values to use for two important RF parameters—maximum depth [3, 5, 10, 50] and maximum features [10%, 25%, 50%, 75%, 100%, square root(100%), and log2(100%)]—a cross-validated grid search implemented using GridSearchCV from Scikit-Learn was performed on the first 10 of the 100 balanced datasets (Supplemental Table 3.7). The maximum depth parameter controls how deep each decision tree can be trained, where trees that are too shallow may not be able to capture complex patterns and trees that are too deep may overfit, meaning they would predict the training genes well, but would not generalize to predict genes not included in training well (e.g. the validation set or new genes). The maximum features parameter controls how many of the 74 input features each decision tree in the forest will be allowed to use, where too few will result in poor performance from individual decision trees and too many will result in most decision trees in the forest identifying the same pattern. Model performance was evaluated using the F-measure (F1) (Bishop and Others 2006), or the harmonic mean of precision (True Positive / True Positives + False Positives) and recall (True Positives / True Positives + False Negatives), where an F1=1 would indicate all gene were perfectly classified, and an F1=0.5 would indicate the model did no better than random guessing. For each model we also determined which genes were correctly classified as belonging to a response group, R. Every balanced run of the model could have predicted a different subset of genes as belonging to R. Thus, a final classification call that a gene, G, belongs to group R was determined if the mean predicted probability of 100 balanced runs ≥ the predicted score threshold (i.e. the threshold between 0 and 1 that maximized model performance averaged over replicates). For each balanced run, we identified the predicted score that maximized the F- measure. We took the average of the predicted score maximizing F-measures for all 100 runs as the predicted score threshold. Then, models with similar F1 scores could be compared to see if they predicted a different subset of genes. Finally, the relative importance of each feature in a RF model was determined using the _importance_score function built into the Scikit-Learn implementation of RF. This function calculates feature importance as the normalized decrease in node impurity across the decision trees when that feature is used to divide a node, known as the Gini Importance (Breiman 2001). 3.5.6 Convolutional neural network-based models of the cis-regulatory code Convolutional neural networks (CNNs), a deep learning algorithm (Breiman 2001), were tested to see if it could better integrate additional regulatory information into our models of the 75 cis-regulatory code. CNNs were implemented in Python 3.6 using Tensorflow 2.0 (Girija 2016). CNN models were made up of four layers: input, convolutional, dense (i.e. fully connected), and the output (i.e. the prediction) (see Figure 3.4B). The input is a 3-dimensional array [rows x columns x layers] where each layer contains data from a different gene, each column (size= # of pCREs for that response group) contains different pCREs, and each row (size=12) contains either pCRE presence/absence or overlap with additional regulatory information. The convolutional layer is composed of kernels (i.e. pattern finders) with the dimensions [12 x 1], using a stride length =1, this resulted in each kernel passing over each pCRE one time and resulting in an output with dimensions [# kernels x # pCREs]. The starting kernel weights were initialized randomly and were scaled relative to the size of the input data using Xavier Initialization (Glorot and Bengio). The output from the convolutional layer was flattened (i.e. changed the output from a 2D array to a 1D array with shape [1 x (# kernels x # pCREs)]) and then passed to the dense layer. A non-linear activation function (rectified linear units; ReLU) was applied to both the convolutional and dense layers, and a sigmoid activation function was applied to the final output layer to facilitate making a binary decision (e.g. NNU vs. NNN). Weights were optimized using the Stochastic Gradient Descent with momentum (SGDm) (momentum=0.9) as implemented in Tensorflow. Three strategies were used to reduce the likelihood of the CNN models overfitting, where models train so specifically to the training data that they do not generalize well to new data. First, L2 regularization was applied to the kernel weights in our convolutional layer, forcing the weights to shrink toward zero. Second, dropout regularization was applied to the dense layer, meaning during each iteration of training a random subset of the dense nodes were removed. This essentially adds randomness to the model and encourages the network to learn more general 76 patterns in the data, rather than specific ones that may be overfit. Finally, CNNs can overfit to the training data if they are allowed to train for too many iterations. However, training for too few iterations will result in a model that has not yet converged (i.e. underfitting). To determine when to best stop training, we used an early stopping approach implemented in Keras (https://keras.io/callbacks/#earlystopping), where the training data was further split into training (90%) and validation (10%) and training stopped when model performance had not increased (min_delta = 0) for 10 iterations (patience = 10) on the validation data, with the maximum number of training iterations limited to 1,000. As with the RF models described above, CNN models were trained on balanced datasets. Because of the greater computational power needed by CNNs, instead of the cross-validation approach used for RF, the balanced data was divided into a training set (90%) and testing set (10%) and performance was measured on the testing set. Model parameters were selected using a random search across the parameter space with five-fold cross validation with ~4,800 iterations (implemented using RandomizedSearchCV in Scikit- Learn). Parameters in the search included the learning rate, the number of kernels in the convolutional layer, the number of nodes in the dense layer, the dropout rate, and the L2 regularization rate (see Table 7). The importance of each pCRE and its associated additional regulatory information was determined by measuring the difference in model performance between the original model and a new model when the values in all rows for a pCRE column were set to zero (i.e. not present and not overlapping with the additional regulatory information) for all genes. Thus, larger positive differences indicate pCREs were important. Negative scores indicate zeroing out the pCREs in question actually improved model performance. The change in performance measured using the area under the receiver operator characteristic, rather than the F1 because it does not require the 77 selection of a classification threshold. The median importance scores across replicates were used to summarize the importance of each pCRE and its associated additional regulatory information. To determine what patterns the CNNs learned to identify, we extracted the weights from each kernel in the convolutional layer of our trained CNN models. Because we trained 100 CNN models, each with either 8 or 16 kernels (see Table 7), we used hierarchical clustering with dynamic branch cutting (minimum cluster size = 250) to group kernels based on the similarity of their weights and found the median weight at each position for each cluster. Kernel importance was measured as described above, where the change in model performance after a kernel’s weights were set to zero (i.e. identifying no pattern) was calculated for each kernel. The median kernel importance scores across all kernels in a cluster are show. 3.5.7 Data Availability All data used in this study are publicly available (Haudry et al. 2013; Prasch and Sonnewald 2013; Stroud et al. 2014; Weirauch et al. 2014; Sullivan et al. 2014; O’Malley et al. 2016). All code needed to reproduce the results from this study are available on GitHub (https://github.com/ShiuLab/Manuscript_Code/2019_CRC_HeatDrought). This repository also contains a detailed README.md file which describes our analyses in more detail, provides the commands used to generate the results in this study, and includes links to the most recent versions of the scripts used. 78 APPENDIX 79 Supplemental Information Methods: Combinatorial pCRE rule discovery To determine if pairs of pCREs were significantly more likely to occur in combination with each other in response group genes, we mined for association rules between our computationally identified pCREs using two statistical tests. First, the ‘aRules’ package (Hahsler et al. 2011) implemented in R, was used to identify pairs of pCREs that were found together in at least 20% of the genes in a response group (support ≥0.2), had a confidence score ≥0.5, where confidence (X⇒Y) was defined as support(X∪Y)/support(X), and had a lift score significantly > (X⇒Y) is defined as support(X∪Y)/(support(X)*support(Y)) and can be interpreted as the 1 (q < 0.05) using Fisher’s Exact Test and multiple testing correction using Bonferroni. Lift support for the rule given the prior probability of obtaining that rule by chance. To be regarded as combinatorial pCREs, pairs also had to be on average ≥2 base pairs apart. Second, the package iterative Random Forest (‘iRF’: (Basu et al. 2018)) also implemented in R, was used to identify pCREs that formed stable interactions in the large leaf nodes of ensembles of decision trees. For each response group, iRFs were identified and compared to iRFs identified after permuting the presence/absence data, iRFs not identified after imputation and those with stability scores significantly greater than the imputed iRF stability scores were regarded as combinatorial pCREs. Results: Significant interactions between pCREs exist but do not improve predictive models Given that TFs frequently work in concert (Farnham 2009), we next incorporated interactions between TFs into our cis-regulatory code models by generating input features to represent significant interactions between pCREs. We identified interactions between pCREs for 80 each response group using three approaches. The first two approaches, association Rule (aRules) and iterative Random Forest (iRF), looked for significant association between the pCREs we had already identified for each response group. The third approach was to use a combinatorial rule aware approach to identify new pairs of pCREs that were not identified by the iterative k-mer finding method because they were not individually enriched in the response group genes. Finally, we assessed if the inclusion of these pCRE interactions as features improved our models of the cis-regulatory code for each response group. The first statistical approach, association Rules (aRules), identified pairs of pCREs that were significantly more likely to be found in the same promoter of response group genes compared to NNN genes and were found together in at least 20% of the response group genes (i.e. 20% support) (Hahsler et al. 2011). Because our iterative k-mer finding approach identified some pCREs with high sequence similarity that likely represent different regions of the same CRE (e.g. AATACT, GAATAC), we also stipulated that the average distance between pCREs in an aRule must be >2 bp. The second approach, iterative Random Forest (iRF) (Basu et al. 2018), identified pCREs that formed significantly more stable interactions in decision trees built to classify response group from NNN genes than in decision trees built using permuted data. While the average number of association rules across response groups was similar for both methods (27 aRules and 21 iRF rules), the number of association rules identified for each response group by each method was not correlated (Spearman’s rank correlation (rho)=0.04) (Supplemental Figure 3.2A), suggesting these approaches identified different types of associations. Finally, we used iRF to identify interactions between 6-mers from all possible 6-mers for NUN and NNU and found 15 and 14 6-mer pairs, respectively. 81 To determine if the pCRE interactions improved our predictive models, we converted pCRE interactions into features, where the feature was given a value = 1 if both pCREs were present in the gene promoter and = 0 otherwise. When RF models were built using the interaction feature alone or in combination with the single pCRE features, model performance either did not improve or improved moderately, with aRules improving performance for UNN and NUN and iRF rules improving performance for NND (Supplemental Figure 3.2B). We hypothesized the lack of improvement was due to the fact that combinatorial rule features held less information (i.e. are sparser) than single pCRE features because two pCREs had to be present in a gene for the interaction to be considered present. Additionally, because of the hierarchical structure of the decision trees that make up the RF models, RF is able to model interaction effects without the need for explicitly coded interaction features. Finally, novel 6-mer pairs identified using iRFs did not improve model performance when used as features alone or with pCREs (Supplemental Figure 3.3B). 82 Supplemental Figure 3.1. Overlap in true positive gene predictions from models using known TFBMs, pCREs, or both features as input. The number of genes that were similarly correctly predicted by known TFBM, pCRE, and known TFBM + pCRE cis-regulatory code models for each response group. The number of genes in a response group that were predicted as belonging to that response group by any of the models (i.e. False Negatives) is shown in red. 83 Supplemental Figure 3.2. Impact of including association rules between pCREs as model features. (A) The number of pCREs (green), and pCRE association rules identified using aRules (pink), and iRF rules (light blue) for each response group (Y-axis). (B) Performance of RF cis- 84 Supplemental Figure 3.2 (cont’d) regulatory code models using all single pCREs (green; as in Figure 3.3A), only aRules (pink), only iRF rules (light/), single pCREs + aRules (red), and single pCREs + iRF rules (dark blue) as input features. (C) Performance of RF cis-regulatory code models using 6-mer pairs identified by iRF from a set of all possible 6-mers as features. 85 Supplemental Figure 3.3. The association of additional regulatory information with pCREs compared to random 6-mers and known TFBMs. (A) Results of the Analysis of Variance (ANOVA) tests use to determine if there were difference in the percent of times a sequence overlapped with each of the 11 additional regulatory information when it was present in a response group gene for all pCRE (light green), the top 10 86 Supplemental Figure 3.3 (cont’d) most important pCREs (based on the Gini Index from RF models; dark green), known TFBM (i.e. CIS-BP motifs; blue), and 1,000 random 6-mers (coral), for each response groups. The results from the 77 ANOVA were corrected for multiple testing (q-values) and shown here as the negative log10(q-value). (B) The distribution of the percent overlap data used in (A). For the 59 response group - additional regulatory information pairs with significant differences in overlap (ANOVA q-value < 0.05), sequence groups are labeled (i.e. a, b, c) based on which groups are significantly different from each other using a post-hoc Tukey test (p-value < 0.05). 87 Supplemental Figure 3.4. Probing the trained kernels to understand the important patterns of additional regulatory information identified by CNN models. The full results from interpreting the trained CNN models (see Figure 3.4C). The feature types (i.e. presence/absence (P/A) and additional regulatory information) were sorted based on the average kernel weights across all kernels trained for all response groups and replicates (first column). The remaining columns represent kernel clusters for specific response groups. For each response group, all trained kernels from all CNN replicates were clustered using hierarchical clustering with dynamic cutting (min cluster size=250 kernels). The median kernel weights and kernel importance scores are shown here for the resulting clusters. 88 Supplemental Figure 3.5. Overlap in true positive gene predictions from models using pCREs from different genetic regions. The number of genes that were similarly correctly predicted by pCREs identified in the promoter (green), 3’ untranslated region (UTR) (orange), first intron (blue), 5’ UTR (purple), and downstream (500 bp; DS500; red) regions for each response group. 89 REFERENCES 90 REFERENCES Arvey, A., P. Agius, W. S. Noble, and C. Leslie, 2012 Sequence and chromatin determinants of cell-type–specific transcription factor binding. Genome Res. 22: 1723–1734. Atkinson, N. J., C. J. Lilley, and P. E. Urwin, 2013 Identification of genes involved in the response of Arabidopsis to simultaneous biotic and abiotic stresses. Plant physiology 162: 2028–2041. Benjamini, Y., and Y. Hochberg, 1995 Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57: 289–300. Bishop, C. M., and Others, 2006 Pattern recognition and machine learning (information science and statistics). Bonnet, C., S. Lassueur, C. Ponzio, R. Gols, M. Dicke et al., 2017 Combined biotic stresses trigger similar transcriptomic responses but contrasting resistance against a chewing herbivore in Brassica nigra. BMC Plant Biol. 17: 127. Breiman, L., 2001 Random Forests. Machine Learning 45: 5–32. Chain, B., 2012 agilp: Agilent expression array processing package. Internet] URL http://www. bioconductor. org/packages/release/bioc/html/agilp. html [accessed on May 2013]. Chang, Y., B. H. Nguyen, Y. Xie, B. Xiao, N. Tang et al., 2017 Co-overexpression of the Constitutively Active Form of OsbZIP46 and ABA-Activated Protein Kinase SAPK6 Improves Drought and Temperature Stress Resistance in Rice. Front. Plant Sci. 8: 1102. Choi, Y.-S., Y.-M. Kim, O.-J. Hwang, Y.-J. Han, S. Y. Kim et al., 2013 Overexpression of ArabidopsisABF3 gene confers enhanced tolerance to drought and heat stress in creeping bentgrass. Plant Biotechnol. Rep. 7: 165–173. De Storme, N., and D. Geelen, 2014 The impact of environmental stress on male reproductive development in plants: biological processes and molecular mechanisms. Plant Cell Environ. 37: 1–18. Ding, Z. J., J. Y. Yan, C. X. Li, G. X. Li, Y. R. Wu et al., 2015 Transcription factor WRKY46 modulates the development of Arabidopsis lateral roots in osmotic/salt stress conditions via regulation of ABA signaling and auxin homeostasis. Plant J. 84: 56–69. Doebley, J. F., B. S. Gaut, and B. D. Smith, 2006 The Molecular Genetics of Crop Domestication. Cell 127: 1309–1321. Dong, X., and Z. Weng, 2013 The correlation between histone modifications and gene expression. Epigenomics 5: 113–116. 91 Farnham, P. J., 2009 Insights from genomic profiling of transcription factors. Nature Reviews Genetics 10: 605–616. Georgii, E., M. Jin, J. Zhao, B. Kanawati, P. Schmitt-Kopplin et al., 2017 Relationships between drought, heat and air humidity responses revealed by transcriptome-metabolome co- analysis. BMC Plant Biol. 17: 120. Ghandi, M., D. Lee, M. Mohammad-Noori, and M. A. Beer, 2014 Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Computational Biology 10: e1003711–15. Girija, S. S., 2016 Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Glorot, X., and Y. Bengio Understanding the difficulty of training deep feedforward neural networks. 2010. Harbison, C. T., D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac et al., 2004 Transcriptional regulatory code of a eukaryotic genome. Nature 431: 99–104. Haudry, A., A. E. Platts, E. Vello, D. R. Hoen, M. Leclercq et al., 2013 An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nature Publishing Group 45: 891–898. He, H. H., C. A. Meyer, M. W. Chen, V. C. Jordan, M. Brown et al., 2012 Differential DNase I hypersensitivity reveals factor-dependent chromatin dynamics. Genome Res. 22: 1015– 1025. Hsieh, W.-P., H.-L. Hsieh, and S.-H. Wu, 2012 Arabidopsis bZIP16 Transcription Factor Integrates Light and Hormone Signaling Pathways to Regulate Early Seedling Development. The Plant Cell 24: 3997–4011. Hu, H., J. You, Y. Fang, X. Zhu, Z. Qi et al., 2008 Characterization of transcription factor gene SNAC2 conferring cold and salt tolerance in rice. Plant Mol. Biol. 67: 169–181. Huebert, D. J., P. F. Kuan, S. Keles, and A. P. Gasch, 2012 Dynamic Changes in Nucleosome Occupancy Are Not Predictive of Gene Expression Dynamics but Are Linked to Transcription and Chromatin Regulators. Molecular and Cellular Biology 32: 1645–1653. James, A. B., N. H. Syed, J. W. S. Brown, and H. G. Nimmo, 2012 Thermoplasticity in the plant circadian clock. Plant Signal Behav 7: 1219–1223. Kang, H.-G., J. Kim, B. Kim, H. Jeong, S. H. Choi et al., 2011 Overexpression of FTL1/DDF1, an AP2 transcription factor, enhances tolerance to cold, drought, and heat stresses in Arabidopsis thaliana. Plant Science 180: 634–641. Karmodiya, K., A. R. Krebs, M. Oulad-Abdelghani, H. Kimura, and L. Tora, 2012 H3K9 and H3K14 acetylation co-occur at many gene regulatory elements, while H3K14ac marks a 92 subset of inactive inducible promoters in mouse embryonic stem cells. BMC Genomics 13: 424. Kidokoro, S., K. Maruyama, K. Nakashima, Y. Imura, Y. Narusaka et al., 2009 The phytochrome-interacting factor PIF7 negatively regulates DREB1 expression under circadian control in Arabidopsis. Plant Physiol. 151: 2046–2057. King, G. J., 2015 Crop epigenetics and the molecular hardware of genotype × environment interactions. Frontiers in Plant Science 6: 10217–19. Konishi, S., T. Izawa, S. Y. Lin, K. Ebana, Y. Fukuta et al., 2006 An SNP caused loss of seed shattering during rice domestication. Science 312: 1392–1396. Kumar, M., G. Sirhindi, R. Bhardwaj, S. Kumar, and G. Jain, 2010 Effect of exogenous H2O2 on antioxidant enzymes of Brassica juncea L. seedlings in relation to 24-epibrassinolide under chilling stress. Indian J. Biochem. Biophys. 47: 378–382. Lee, D.-K., P. J. Chung, J. S. Jeong, G. Jang, S. W. Bang et al., 2017 The rice OsNAC6 transcription factor orchestrates multiple molecular mechanisms involving root structural adaptions and nicotianamine biosynthesis for drought tolerance. Plant Biotechnol. J. 15: 754–764. Li, X., H. Zhang, Q. Ai, G. Liang, and D. Yu, 2016 Two bHLH Transcription Factors, bHLH34 and bHLH104, Regulate Iron Homeostasis in Arabidopsis thaliana. Plant Physiology 170: 2478–2493. Liu, T., J. Carlsson, T. Takeuchi, L. Newton, and E. M. Farré, 2013 Direct regulation of abiotic responses by the Arabidopsis circadian clock component PRR7. The Plant Journal 76: 101–114. Liu, M.-J., K. Sugimoto, S. Uygun, N. Panchy, M. S. Campbell et al., 2018 Regulatory Divergence in Wound-Responsive Gene Expression between Domesticated and Wild Tomato. The Plant Cell 30: 1445–1460. Lloyd, J. P., Z. T.-Y. Tsai, R. P. Sowers, N. L. Panchy, and S.-H. Shiu, 2018 A Model-Based Approach for Identifying Functional Intergenic Transcribed Regions and Noncoding RNAs (J. Gojobori, Ed.). Molecular Biology and Evolution 35: 1422–1436. Luo, C., and E. Lam, 2010 ANCORP: a high-resolution approach that generates distinct chromatin state models from multiple genome-wide datasets. The Plant Journal 63: 339– 351. Mathur, S., D. Agrawal, and A. Jajoo, 2014 Photosynthesis: Response to high temperature stress. Journal of Photochemistry & Photobiology, B: Biology 137: 116–126. Nicotra, A. B., O. K. Atkin, S. P. Bonser, A. M. Davidson, E. J. Finnegan et al., 2010 Plant phenotypic plasticity in a changing climate. Trends in Plant Science 15: 684–692. 93 O’Malley, R. C., S. C. Huang, L. Song, M. G. Lewsey, A. Bartlett et al., 2016 Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. Cell 1–21. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion et al., 2011 Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12: 2825–2830. Pfluger, J., and D. Wagner, 2007 Histone modifications and dynamic regulation of genome accessibility in plants. Curr. Opin. Plant Biol. 10: 645–652. Prasch, C. M., and U. Sonnewald, 2013 Simultaneous application of heat, drought, and virus to Arabidopsis plants reveals significant shifts in signaling networks. Plant physiology 162: 1849–1866. Rabara, R. C., P. Tripathi, and P. J. Rushton, 2014 The Potential of Transcription Factor-Based Genetic Engineering in Improving Crop Tolerance to Drought. OMICS: A Journal of Integrative Biology 18: 601–614. Rasmussen, S., P. Barah, M. C. Suarez-Rodriguez, S. Bressendorff, P. Friis et al., 2013 Transcriptome responses to combinations of stresses in Arabidopsis. Plant physiology 161: 1783–1794. Reynolds, M. P., and R. Ortiz, 2010 Adapting crops to climate change: a summary, pp. 1–8 in Climate change and crop production, edited by M. P. Reynolds. CABI, Wallingford. Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43: e47. Rizhsky, L., H. Liang, J. Shuman, V. Shulaev, S. Davletova et al., 2004 When defense pathways collide. The response of Arabidopsis to a combination of drought and heat stress. Plant Physiol. 134: 1683–1696. Sewelam, N., Y. Oshima, N. Mitsuda, and M. Ohme-Takagi, 2014 A step towards understanding plant responses to multiple environmental stresses: a genome-wide study. Plant Cell Environ. 37: 2024–2035. Shaar-Moshe, L., E. Blumwald, and Z. Peleg, 2017 Unique Physiological and Transcriptional Shifts under Combinations of Salinity, Drought, and Heat. Plant physiology 174: 421– 434. Shen, P.-C., A.-L. Hour, and L.-Y. D. Liu, 2017 Microarray meta-analysis to explore abiotic stress-specific gene expression patterns in Arabidopsis. Bot. Stud. 58: 22. Sillmann, J., V. V. Kharin, X. Zhang - Journal of …, and 2013, 2013 Climate extremes indices in the CMIP5 multimodel ensemble: Part 1. Model evaluation in the present climate. Wiley Online Library. 94 Steinfeld, I., R. Shamir, and M. Kupiec, 2007 A genome-wide analysis in Saccharomyces cerevisiae demonstrates the influence of chromatin modifiers on transcription. Nature Genetics 39: 303–309. Stocker, T. F., D. Qin, G.-K. Plattner, M. Tignor, S. K. Allen et al., 2013 IPCC, 2013: climate change 2013: the physical science basis. Contribution of working group I to the fifth assessment report of the intergovernmental panel on climate change. Storey, J. D., 2003 The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat. 31: 2013–2035. Stroud, H., T. Do, J. Du, X. Zhong, S. Feng et al., 2014 Non-CG methylation patterns shape the epigenetic landscape in Arabidopsis. Nat. Struct. Mol. Biol. 21: 64–72. Sullivan, A. M., A. A. Arsovski, J. Lempe, K. L. Bubb, M. T. Weirauch et al., 2014 Mapping and Dynamics of Regulatory DNA and Transcription Factor Networks in A. thaliana. CellReports 8: 2015–2030. Sullivan, A. M., K. L. Bubb, R. Sandstrom, J. A. Stamatoyannopoulos, and C. Queitsch, 2015 DNase I hypersensitivity mapping, genomic footprinting, and transcription factor networks in plants. Biochemical Pharmacology 3–4: 40–47. Tompa, M., 2001 Identifying Functional Elements by Comparative DNA Sequence Analysis. Genome Res. 11: 1143–1144. Uygun, S., A. E. Seddon, C. B. Azodi, and S.-H. Shiu, 2017 Predictive Models of Spatial Transcriptional Response to High Salinity. Plant physiology 174: 450–464. Vile, D., M. Pervent, M. Belluau, F. Vasseur, J. Bresson et al., 2012 Arabidopsis growth under prolonged high temperature and water deficit: independent or interactive effects? Plant Cell Environ. 35: 702–718. Wang, J., J. Zhuang, S. Iyer, X. Lin, T. W. Whitfield et al., 2012 Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22: 1798–1812. Wasserman, W. W., M. Palumbo, W. Thompson, J. W. Fickett, and C. E. Lawrence, 2000 Human-mouse genome comparisons to locate regulatory sites. Nat. Genet. 26: 225–228. Weirauch, M. T., A. Yang, M. Albu, A. G. Cote, A. Montenegro-Montero et al., 2014 Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity. Cell 158: 1431–1443. Wu, X., Y. Shiroto, S. Kishitani, Y. Ito, and K. Toriyama, 2009 Enhanced heat and drought tolerance in transgenic rice seedlings overexpressing OsWRKY11 under the control of HSP101 promoter. Plant Cell Rep. 28: 21–30. 95 Yu, C.-P., J.-J. Lin, and W.-H. Li, 2016 Positional distribution of transcription factor binding sites in Arabidopsis thaliana. Sci. Rep. 6: 25164. Zhang, G., and G. W. Duff, 1994 Intron 1 regulation of interleukin 1 beta (IL-1β) gene transcription: an alternative promoter? Cytokine 6: 564. Zhu, Y., A. Dong, and W.-H. Shen, 2012 Histone variants and chromatin assembly in plant abiotic stress responses. BBA - Gene Regulatory Mechanisms 1819: 343–348. Zou, C., K. Sun, J. D. Mackaluso, A. E. Seddon, R. Jin et al., 2011 Cis-regulatory code of stress- responsive transcription in Arabidopsis thaliana. PNAS 1–6. 96 CHAPTER FOUR: BENCHMARKING PARAMETRIC AND MACHINE LEARNING MODELS FOR GENOMIC PREDICTION OF COMPLEX TRAITS1 1 The work described in this chapter has been published in the following manuscript Christina B. Azodi, Emily G. Bolger, Andrew McCarren, Mark Roantree, Gustavo de los Campos, Shin-Han Shiu (2019) Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3. DOI: 10.1534/g3.119.400498 97 4.1 Abstract The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e. ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e. feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values. 4.2 Introduction The ability to predict complex traits from genotypes is a grand challenge in biology and is accelerating the speed of crop and livestock breeding (Heffner et al. 2009; Lorenz et al. 2011; Jonas and de Koning 2013; Desta and Ortiz 2014). Genomic Prediction (GP, aka Genomic Selection), the use of genome-wide genetic markers to predict complex traits, was originally proposed by Meuwissen et al. (Meuwissen et al. 2001) as a solution to the limitations of Marker- 98 Assisted Selection (MAS) where only a limited number of previously identified markers with the strongest associations are used to select the best lines. GP is particularly well-suited for the prediction of quantitative traits controlled by many small-effect alleles (Ribaut and Ragot 2007). A major challenge in using GP is estimating the effects of a large number of makers (p) using phenotype information of a comparatively limited number of individuals (n) (i.e. p >> n) (Meuwissen et al. 2001). To address this challenge, Meuwissen et al. first presented three statistical methods for GP (Meuwissen et al. 2001). The first was a linear mixed model called ridge regression Best Linear Unbiased Prediction (rrBLUP), which uniformly shrinks the marker effects. The other two were Bayesian approaches, BayesA (BA) and BayesB (BB), which both differentially shrink the marker effects and with BB also performing variable selection. Since then, additional approaches have been shown to be useful for GP, including Least Absolute Angle and Selection Operator (LASSO) (Usai et al. 2009), Elastic Net (Zou and Hastie 2005), Support Vector Regression with a linear kernel (SVRlin) (Moser et al. 2009; Xu et al. 2018), and additional Bayesian methods including Bayesian LASSO (BL), BayesCπ, and BayesDπ (de los Campos et al. 2009; Habier et al. 2011). While these approaches perform well when dealing with high dimensional data (i.e. p>>n), they are all based on a linear mapping from genotype to phenotypes, and therefore may not fully capture non-linear effects (e.g. epistasis, dominance), which are likely to be important for complex traits (Holland 2007; Monir and Zhu 2018). To overcome this limitation, non-linear approaches, including reproducing kernel Hilbert spaces (RKHS) regression (Gianola et al. 2006; de los Campos et al. 2010), Support Vector Regression with non-linear kernels (i.e. polynomial SVRpoly and radial basis function SVRrbf (Long et al. 2011; Kasnavi et al. 2017)), and decision tree based algorithms such as Random Forest (RF) (González-Recio and Forni 2011; 99 Spindel et al. 2015) and Gradient Tree Boosting (GTB) (González-Recio et al. 2013) have been applied to GP problems. In previous efforts to compare the performance of multiple linear and non-linear approaches (Heslot et al. 2012; Neves et al. 2012; Blondel et al. 2015; Ramstein et al. 2016; Roorkiwal et al. 2016), no single method performs best in all cases. Rather, factors such as the size of the training data set, marker type and number, trait heritability, effective population size, the number of causal loci, as well as genetic architecture (the locus effect size distribution) can all affect algorithm performance (Meuwissen 2009; Riedelsheimer et al. 2013; Spindel et al. 2015; Norman et al. 2018). This highlights the importance of comparing new algorithms across a diverse range of datasets. With improvements in computing speeds, the development of graphics processing units (GPUs), and breakthroughs in algorithms for backpropagation learning (Rumelhart et al. 1986; Parker 1987), there has been a resurgence of research using deep learning (i.e. artificial neural networks (ANNs)) to model complex biological processes (Angermueller et al. 2016; Webb 2018). ANNs are a class of machine learning methods that perform layers of transformations on features to create abstraction features, known as hidden layers, which are used for predictions. The first application of ANNs for GP was presented in 2011, when Okut et al. trained fully connected ANNs (i.e. each node in a layer is connected to all nodes in surrounding layers) containing one hidden layer to predict body mass index in mice (Okut et al. 2011). Since 2011, more complex ANN architectures have been used for GP including radial basis function neural networks (González-Camacho et al. 2012) deep neural networks (Ehret et al. 2015; Bellot et al. 2018), deep recurrent neural networks (Pouladi et al. 2015), probabilistic neural network classifiers (González-Camacho et al. 2016, 2018), and convolutional neural networks (CNNs) (Ma et al. 2018). With only one exception (Bellot et al. 2018), these ANNs have been applied to 100 datasets with relatively few genetic markers (<60k), however, as sequencing continues to become less expensive, whole-genome marker datasets are becoming larger with some breeding programs generating data for hundreds of thousands of markers. Because of the internal complexity of ANN models, training an ANN with so many markers can result in sub-optimal solutions (i.e. underfitting). Therefore, it is especially important to benchmark ANNs against other GP statistical approaches on datasets with high dimensionality where underfitting may occur. GP has yielded promising results for breeders. However, a comprehensive comparison of GP algorithms, particularly ANNs, on a wide range of GP problems is missing (Figure 4.1A). Here we compared the ability of 12 GP algorithms (see Methods, Figure 4.1B) to predict a diverse range of physiological traits in six plant species (maize, rice, sorghum, soy, spruce, and switchgrass; Figure 4.1C). These six data sets (referred to as the benchmark data sets) represent a wide range of GP data types, with the size of the training data set ranging from 327 to 5,014 individuals, and 4,000 to 332,000 markers derived from array-based approaches or sequencing. Compared to the linear algorithms included in the study, the non-linear algorithms, especially ANNs, require more pre-modeling tuning (e.g. hyperparameter selection, feature selection). Therefore, before comparing algorithm performance across all 18 combinations of species and traits, we first focused on predicting plant height in each species in order to establish best practices for model building. Because ANNs are underrepresented in GP comparison studies and our first attempts to use ANNs for GP performed relatively poorly, we focus on methods to improve ANN performance, including reducing model complexity using feature selection and combining relationships learned from linear algorithms into the more complex ANN architectures (i.e. a seeded ANN approach and convolutional layers (i.e. CNNs)). Then, using 101 Figure 4.1. Algorithms used and compared in past GP studies and algorithms and data included in the GP benchmark. (A) Number of times a GP algorithm was utilized (diagonal) or directly compared to other GP algorithms (lower triangle) out of 91 publications published between 2012-2018 (Supplemental Table 4.1). GP algorithms were included if they were utilized in >1 study. (B) A graphical representation of the GP algorithms included in the study and their relationship to each other. 102 Figure 4.1 (cont’d) Colors designate if the algorithm identifies only linear (orange) or linear and non-linear (green) relationships. The placement of each algorithm on the tree designates (qualitatively) the relationship between different algorithms. The labels at each branch provide more information about how algorithms in that branch differ from others. rrBLUP, ridge regression Best Linear Unbiased Predictor; BRR, Bayesian Ridge Regression; BA, BayesA; BB, BayesB; BL, Bayesian LASSO; SVR, Support Vector Regression (kernel type: lin, linear; poly, polynomial; rbf, radial basis function); RF, Random Forest; GTB, Gradient Tree Boosting; ANN, Artificial Neural Network. (C) Species and traits included in the benchmark with training population types and sizes and marker types and numbers for each dataset. NAM: Nested Association Mapping. DM: partial diallel mating. GBS: genotyping by sequencing. SNP: single nucleotide polymorphism. HT: height. FT: flowering time. YLD: yield. GM: grain moisture. R8: time to R8 developmental stage. DBH: diameter at breast height. DE: wood density. ST: standability. lessons learned from predicting height, we compared the performance of all GP algorithms across all species and traits. 4.3 Materials and Methods 4.3.1 Genotype and phenotype data Genotypic data from six plant species were used to predict 3 traits from each species (Fig 1C). The maize phenotypic (Hansey et al. 2011) and genotypic (Hirsch et al. 2014) data were from the pan-genome population, maize trait values were averaged over replicate plots. The rice data were from elite breeding lines from the International Rice Research Institute irrigated rice breeding program (Spindel et al. 2015), and dry season trait data averaged over four years were 103 used. The sorghum data were generated from sorghum lines from the US National Plant Germplasm System grown in Urbana, IL (Fernandes et al. 2017) and trait values were averaged over two blocks for this study. The soybean data were generated from the SoyNAM population containing recombinant inbred lines (RILs) derived from 40 biparental populations (Xavier et al. 2016). The white spruce data were obtained from the SmartForests project team, using a SNP- chip developed by Quebec Ministry of Forest Wildlife and Parks (Beaulieu et al. 2014). Switchgrass phenotypic (Lipka et al. 2014) and genotypic (Evans et al. 2017) data were generated from the Northern Switchgrass Association Panel (Evans et al. 2015) which contains clones or genotypes from 66 diverse upland switchgrass populations. The genotype data was obtained in the form of biallelic SNPs with missing marker data already dropped or imputed by the original authors. Marker calls were converted when necessary to [-1,0,1] corresponding to [aa, Aa, AA] where A was either the reference or the most common allele. Genome locations of maize SNPs were converted from assembly AGPv2 to AGPv4, with AGPv2 SNPs that did not map to AGPv4 being removed, leaving 332,178 markers for the maize analysis. Phenotype values were normalized between 0 and 1. Lines with missing phenotypic value for any of the three traits were removed. 4.3.2 Genomic selection algorithms To assess what statistical approaches are most frequently used for genomic selection, we conducted a literature search of papers applying genomic selection methods to crop or simulated data from January 2012-February 2018. We recorded what statistical approach(es) was(were) applied in each study (Supplemental Table 4.1), allowing us to calculate both the total number of times an approach had been applied and how many times any two approaches were directly compared (Fig 1A). Based on the results from this literature search, nine commonly used 104 statistical approaches were included in this study: rrBLUP, Bayes A (BA), Bayes B (BB), Bayesian LASSO, Bayesian-RR, RF, SVR with a linear kernel (SVRlin), SVR with polynomial kernel (SVRpoly), SVR with radial basis function kernel (SVRrbf). Three additional machine learning approaches, gradient tree boosting (GTB), artificial neural networks (ANN), and convolutional neural networks (CNN), were also included because of their ability to model non- linear relationships. Most linear algorithms were implemented in R packages rrBLUP (Endelman 2011) and BGLR (for Bayesian methods including BRR: Bayesian RR, BA: Bayes A, BB: Bayes B, and BL: Bayesian LASSO) (Pérez and de los Campos 2014). These algorithms vary in what approach they use to address the p >> n problem (Figure 4.1B), for example rrBLUP performs uniform shrinkage on all marker coefficients to reduce variance of the estimator, while BB performs differential shrinkage of the marker coefficients and variable selection. The differences between these algorithms have been thoroughly reviewed previously (de los Campos et al. 2013). Models for Bayesian methods were trained for 12,000 iterations using a burn-in of 2,000. Non-linear algorithms (SVRpoly, SVRrbf, RF, and GTB) and SVRlin were implemented in python using the Scikit-Learn library (Pedregosa et al. 2011). For SVR algorithms, the marker data is mapped into a new feature space using linear or non-linear kernels (i.e. poly, rbf) and then linear regression within that feature space is performed with the goal of minimizing error outside of a margin of tolerated error. The RF algorithm works by averaging the predictions from a “forest” of bootstrapped regression trees, where each tree contains a random subset of the lines and of the markers (Breiman 2001). Related to RF, GTB algorithm uses the principle of boosting (Friedman 2001) to improve predictions from weak learners (i.e. regression trees) by iteratively updating 105 the learners to minimize a loss function, therefore generating better weak learners as training progresses. Artificial Neural Networks (ANNs) were implemented in python using TensorFlow (Girija 2016). The input layer for the ANNs contained the genetic markers for an individual (x; Figure 4.1B), the nodes in the hidden layers were all fully connected to all nodes in the previous and following layers (i.e. Multilayer Perceptron). A non-linear activation function (selected during the grid search, see below) was applied to each node in the input and hidden layers, except the last hidden layer, which was connected with a linear function to the output layer, the predicted trait value (y). To reduce the likelihood of vanishing gradients, when the error gradient, which controls the degree to which the weights are updated during each iteration of training, becomes so small the weights stop updating thus halting model training, in the ANN, the starting weights (w) were scaled relative to the number of input markers using the Xavier Initializer (Glorot and Bengio). Weights were then optimized using the Adam Optimizer (Kingma and Ba 2014) with a learning rate selected by the grid search (described below). To determine the optimal stopping time for training (i.e. number of epochs), an early stopping approach was used (Prechelt 1998), where the training set was further divided into training and validation, and early stopping occurred when the change in mean squared error (MSE) for the validation set was < 0.1% for 10 epochs using a 10 epoch burn-in. Occasionally, due to poor random initialization of weights, the early stopping criteria would be reached before the network started to converge and the resulting network would predict the same trait value for every line. When this was observed in the validation set the training process was repeated starting with new initialized weights. Convolutional Neural Networks (CNNs) were implemented in Python 3.6 using Tensorflow 2.0. The input layer for the CNNs consisted of the genetic markers for an individual 106 one-hot-encoded so that each possible allele at each locus was represented as present or absent. Because of the large size of the possible hyperparameter space (Supplemental Table 4.2), a randomized search (using RandomizedSearchCV from Scikit-Learn with 5 folds) was performed on rice for predicting height on one replicate, and the best combination of hyperparameters (lowest average mean squared error) from this one search was used for all other species, traits, and replicates. The input data first passed through a convolutional layer, followed by a maximum pooling layer, a dropout layer, a dense (i.e. fully connected) layer, a batch normalization layer, and finally to the output layer containing one node with the predicted trait value. The EarlyStopping function in Keras (https://keras.io/callbacks/#earlystopping) was used to avoid overfitting (min_delta = 0, patience = 10). To reduce the time and memory requirements, CNN models were trained using a batch size = 100 and run for a maximum of 1,000 epochs. As with ANN models, if the early stopping criteria was reached before the network started to converge, the model would be re-run starting with new initialized weights. To incorporate predictions from multiple algorithms into one summary prediction, an ensemble approach was used where the ensemble predicted trait value was the mean predicted trait value from 11 algorithms (EN11: rrBLUP, BRR, BA, BB, BL, SVR, SVRpoly, SVRrbf, RF, GTB, ANN) or five algorithms (EN5: rrBLUP, BL, SVRpoly, RF, ANN). The subset of five consisted of algorithms with differing statistical bases, where rrBLUP represented penalized methods, BL represented the Bayesian approaches, SVRpoly represented non-linear regularized functions, RF represented decision tree based methods, and ANN represented the deep learning approach. This ensemble predicted trait value was then compared to the true trait values to generate performance metrics. A Repeated Measures Analysis of variance (ANOVA) 107 implemented in R was used to compare model performance, where performance of each model on each replicate test set were considered related. 4.3.3 Hyperparameter grid search using cross-validation To obtain the best possible results from each algorithm, a grid search approach was used to determine the combination of hyperparameters that maximized performance for each trait/species combination. No hyperparameter needed to be defined for rrBLUP, BL, or BRR. For rrBLUP, the R package estimates the regularization and kernel parameters from the data. For BL or BRR, parameters for these Bayesian regression methods were also estimated from the data. Between one and five hyperparameters were tested for the remaining algorithms (Supplemental Table 4.2). To avoid biasing our hyperparameter selection, an 80/20 training/testing approach was used, where 20% of the lines were held out from each model as a testing set and the grid search was performed on the remaining 80% of training lines. For RF, SVRlin, SVRpoly, SVRrbf, and GTB algorithms, 10 replicates of the grid search were run using the GridSearchCV function from Scikit-Learn with 5-fold cross validation. Ten replicates of the grid search were also run for ANN models, where for each replicate 80% of the training data was randomly selected for training the network with each combination of hyperparameters and the remaining 20% used to select the best combination. This whole process (train/test split, grid search) was replicated 10 times, with a different 20% of lines selected as the test set for each replicate. ANOVA implemented in R was used to determine which hyperparameters significantly impacted model performance for each species. 108 4.3.4 Assessing Predictive Performance The predictive performance of the models was compared using two metrics. For the grid search analysis, the mean squared error (MSE) between the predicted (Ŷ) and the true (Y) trait value was used. For the model comparisons, Pearson correlation coefficient (r) between the predicted (Ŷ) and the true trait value (Y) was used as it is the standard metric for GP performance (Heffner et al. 2009; Heslot et al. 2012; Riedelsheimer et al. 2013). It was computed using the cor() function in R for rrBLUP and the Bayesian approaches or the numpy corrcoef() function in Python for the ML and ANN approaches. Only predicted trait values for lines from the test set were considered when calculating r. Summary performance metrics (% of best r, rank, variance) were calculated using the mean predictive performance (r) across all replicates for each GP algorithm for each species/trait combination. 4.3.5 Feature Selection The top 10, 50, 100, 250, 500, 1000, 2000, 4000, and 8000 markers were selected using three different feature selection algorithms: Random Forest (RF), Elastic Net (EN), and BayesA (BA). RF and EN feature selection were implemented in Scikit-Learn and BA was implemented in the BGLR package in R. The EN feature selection algorithm requires tuning of the hyperparameter that controls the ratio of the L1- and L2- penalties (e.g. L1:L2 = 1:10 = 0.1). Because the L1 penalty function performs variable selection by shrinking some coefficients to zero, we started with an initial weight on the L1 penalty of 0.1 and then, if fewer than 8,000 markers remained after variable selection, we reduced it in steps of 0.02 until that criteria was met (a 4,000 marker threshold was used for spruce and soy, which only had 6,932 and 4,240 markers available, respectively). 109 To avoid bias during feature selection, the 80:20 training/testing approach described above was used, where feature selection was performed on the training data and the ultimate performance of models built using the selected markers was scored on the testing set. This was repeated for all 10 testing sets. A repeat measures ANOVA was conducted to compare feature selection algorithms, the number of features selected, and GP algorithms (i.e. independent variables) on model performance (i.e. dependent variable) where replicates were considered repeat measures as they used the same testing set. One-sided, paired Wilcoxon Signed-Rank tests were conducted to determine if model performance (i.e. dependent variable) increased after feature selection (all vs. top 4,000 for soy and spruce, all vs. top 8,000 for other species) (i.e. independent variable). Resulting p-values were corrected for multiple testing (q-value) (Benjamini and Hochberg 1995). 4.3.6 Initializing ANN starting weights seeded from other GP algorithms In addition to building ANNs with randomly initialized starting weights, we tested the usefulness of seeding the starting weights with information from other GP algorithms (i.e. rrBLUP, BB, BL, or RF). This is an ensemble-like approach in that it utilizes multiple algorithms to make a final prediction. Ensemble approaches often perform better than single algorithm approaches (Dietterich 2000). First, after the data was divided into training, validation, and testing sets and, for species with large p:n ratios (i.e. maize, rice, sorghum, switchgrass) the top 8,000 markers were selected, we applied a GP algorithm (rrBLUP, BB, BL, or RF) to the training data. From that model we extracted the coefficients/importance scores assigned to each marker and used those as the starting weights for 25% of the nodes in the first hidden layer. We also tested seeding starting weights for 50% of the nodes to predict height in all 6 species but found this significantly increased the model error (MSE) on the validation set (ANOVA; p- 110 value= 0.04), so only results from seeding 25% were included. Because we still needed to reduce the likelihood of vanishing gradients, described above, we manually adjusted the scale of the coefficients/importance scores to match the distribution of the starting weights assigned the remaining 75% of the nodes in the first hidden layer by Xavier Initialization. Finally, to reduce bias in the ANN, random noise was introduced to the seeded nodes by multiplying each starting weight with a random number from a normal distribution with a mean =0 and the standard deviation equal to the standard deviation of weights from Xavier Initialization. After the training data was used to determine these seeded starting weights, it was used to train the ANN model, the validation set was used to select the best set of hyperparameters and the early stopping point. Then the final trained model was applied to the testing set and performance metrics were calculated. A repeat measures ANOVA was conducted to test if the seeded or the unseeded ANN models (i.e. independent variable) differed in the amount of variation (standard deviation) in model performance across replicates (i.e. dependent variable), with each species acting as a repeat measurement. 4.3.7 Data and Code Availability For reproducibility, all six datasets along with training/testing designations are available on Dryad (https://doi.org/10.5061/dryad.xksn02vb9) and scripts to run all of the algorithms included in this study on GitHub for future benchmarking. All code used in this study is available on GitHub (https://github.com/ShiuLab/Manuscript_Code/tree/master/2019_GP_Comparison). A README file is included, which provides detailed instructions on how to use the code to generate GP models. Supplemental material available at FigShare (https://doi.org/10.25387/g3.9855590). 111 4.4 Results 4.4.1. Hyperparameter grid search is critical, particularly among non-linear algorithms We selected six linear and five non-linear algorithms (note, CNNs are discussed separately) to compare their performance in GP problems (see Methods). While some model parameters can be estimated from the data (de los Campos et al. 2013), other parameters, referred to as hyperparameters, have to be user-defined (Chapelle et al. 2002; Kuhn and Johnson 2013). This was the case for eight of the algorithms in our study: BA, BB, SVRlin, SVRpoly, SVRrbf, RF, GTB, and ANN. For these algorithms we conducted a grid search to evaluate the prediction accuracy of models using every possible combination of hyperparameter values (for lists of hyperparameters, see Supplemental Table 4.2). To produce unbiased estimates of prediction accuracy the grid search was performed within the training set so that no data from the testing set was used to select hyperparameter values. Then we used the best set of hyperparameters from the grid search to build models using genotype and phenotype data from six plant species. This allowed us to compare the predictive performance of all algorithms included in the benchmark datasets. To determine which hyperparameters significantly impacted model performance, we tested for changes in model performance (mean squared error; MSE) across the hyperparameter space for each algorithm/species/trait combination using Analysis of Variance (ANOVA). The degrees of freedom hyperparameter for BA and BB, both linear algorithms, that influences the shape of the prior density of marker effects (de los Campos et al. 2013) had no significant impact on model performance (ANOVA: p-value= 0.41~1.0; Supplemental Table 4.3). Other parameters for the Bayesian algorithms were determined using rules built into the BGLR package that account for factors such as phenotypic variance and the number of markers (p) 112 (Pérez and de los Campos 2014) and were therefore not considered in our grid search. However, 15 of 16 of the hyperparameters tested for the non-linear algorithms significantly impacted performance in at least one species (Supplemental Table 4.3, Supplemental Figure 4.1A-C). Using height in maize as an example, we found that SVRpoly algorithm performed better (i.e. lower MSE) using 2nd degree polynomials compared to using up to 3rd degree polynomials (p- value = 1*10-21, Figure 4.2A). For RF-based models, the maximum depth (max depth) of decision trees allowed significantly impacted performance (p-value = 1*10-3, Supplemental Table 4.3), with shallower trees typically performing better (Figure 4.2B). This pattern was also observed in RF models predicting height for rice, spruce, and soy (p-value= 1*10-66~5*10-4, Supplemental Table 4.3, S1B Figure). Because shallower decision trees are less complex, they tend not to overfit, suggesting the best hyperparameters for RF are those that reduce overfitting. The only hyperparameter from the non-linear algorithms that did not impact performance was the rate of dropout (a useful regularization technique to avoid overfitting) for ANN models, where there was no significant change in model performance when two different rates (10% and 50%) were used (p-value= 0.24 ~ 0.97, Supplemental Table 4.3). 4.4.2 ANN is the most significantly impacted by hyperparameter choice Hyperparameters for SVRlin, SVRpoly, SVRrbf, RF, and GTB tended to have moderate effects on MSE, while ANN hyperparameters often caused substantial changes in MSE (Figure 4.2A-C; S1A-C Figure). Across the six species, the median variance in MSE across the hyperparameter space for ANN was 6*106, but ranged from 3*10-3- 0.1 for the other GP algorithms (S1D Figure) For example, for predicting height in maize, SVRpoly models built using the 2nd degree polynomial outperformed those built using the 3rd degree polynomial with a decrease in MSE ~ 0.05 (Figure 4.2A), while for ANN models, hyperparameter combinations 113 Figure 4.2. Grid search results for height in maize and overall GP algorithm performance for predicting height across species. (A) Average of mean squared error (MSE) over hyperparameter space (penalty, C) for Support Vector Regression (SVR) based models predicting height in maize. SVRrbf and SVRpoly results are shown using gamma=1x10-5 and 1x10-4, respectively. Poly: polynomial. RBF: Radial Basis Function. (B) Distribution of MSEs across hyperparameter space for Random Forest (RF; left) and Gradient Tree Boosting (GTB; right) as the maximum features available to each tree (Max Features) and maximum tree depth (color) change. GTB results are shown using a learning rate = 0.01. (C) Average MSE across hyperparameter space for ANN models with different network architectures, degrees of regularization (dropout or L2), using either the Rectified Linear Unit (ReLU; left) or Sigmoid (right) activation function. (D) Mean performance (Pearson’s Correlation Coefficient: r, text) for predicting height and percent best r (colored box, top algorithm for each species = 100% (red)). White text: the best r values. Violin-plots show the median and distribution of r values for each trait (right) and algorithm (bottom). 114 that performed the best (i.e. Sigmoid activation function and no L2 regularization) resulted in models with MSEs that were >500 lower than the worst performing model (Rectified Linear Unit (ReLU) activation function, no L2 regularization, and large numbers of hidden nodes; Figure 4.2C). This highlighted that, while hyperparameter selection is necessary for all non-linear algorithms, it is especially critical for building ANNs for GP problems. Using the best set of hyperparameters for each model, we next compared the predictive performance (Pearson’s correlation coefficient, r, between predicted and true trait values) of each algorithm on plant height. As with past efforts to benchmark GP algorithms (Heslot et al. 2012; Neves et al. 2012), no one algorithm always performed the best (white bolded; Figure 4.2D). For example, while rrBLUP performed best for maize, sorghum, and switchgrass, BA performed best for soy, and RF performed best for rice and spruce. Notably, ANNs substantially underperformed compared to other non-linear algorithms, with a median performance at 84% of the best r for each of the six species (i.e. 16% below the best performing algorithm for that trait/species). Notably, among the six species, ANN performed the best in soy (r = 0.44) relative to the species best algorithm BA (r = 0.47, Figure 4.2D). Soy has the largest number of training lines among the six species (5,014) and has a marker to training line ratio close to one (Figure 4.1C). Thus, we hypothesized the poor performance of the ANN models was in part due to our inability to train a network with so many features (markers) and so little training data (lines). During ANN model training, the weights assigned to each connection between nodes in neighboring layers of the network have to be estimated. Because every input marker is connected to every node in the first hidden layer, including more markers in the model will require more weights to be estimated, resulting in a more complex network that is more likely to underfit. In an ideal situation, to account for the complexity in these large networks, five to ten times more instances 115 (lines) than features (markers) would need to be available for training (Klimasauskas 1993). Alternatively, one can reduce model complexity by only including markers that are most likely to be associated with the trait using feature selection methods. 4.4.3 Feature selection improves performance of ANN models ANNs and sometimes other non-linear algorithms performed poorly compared to linear methods, which could be due to an insufficient number of training lines relative to the number of markers. To address this, we used feature selection to identify and select the markers most associated with trait variation. Because the number of markers associated with a trait is dependent on the genetic architecture of the trait and is not typically known, models were built using a range of numbers of markers (p = 10~8,000) and were compared to models built using all available markers from each species. Because performing feature selection on the training and testing data can artificially inflate prediction accuracies (Bermingham et al. 2015), feature selection was conducted on the training set only. This was repeated 10 times, using a different subset of lines for testing for each replicate (see Methods). Three feature selection algorithms (RF, BayesA, and Elastic Net (EN)) were compared to predict height in maize, the species with the largest number of markers (p) relative to training lines (n) (p:n = 850, Figure 4.1C). While each algorithm selected a largely different subset of markers (Figure 4.3A, Supplemental Figure 4.2A), the degree of overlap was significantly greater than random expectation. To demonstrate this, we randomly selected three sets of 8,000 maize markers and counted how many markers were present in all three sets 10,000 times and found that the 99th percentile of overlap was equal to 10, however we observed an average of 220 overlapping markers across replicates using these three feature selection approaches. When the different feature selection subsets were used to predict height in maize, there was a significant 116 Figure 4.3. Impact of feature selection on GP algorithm performance. (A) Average number of overlapping markers in the top 8,000 markers selected by three feature selection algorithms for predicting height in maize across ten replicates. EN: Elastic Net. (B) Change in ANN predictive performance (r) at predicting height in maize as the number of input markers (p) selected by three feature selection algorithms (BayesA: BA, EN, and Random 117 Figure 4.3 (cont’d) Forest: RF) increases. Dashed line: mean r when all 332,178 maize markers were used. (C) Mean r of rrBLUP, SVRlin, RF, GTB, and ANN models for predicting height using subsets or all (X- axis) markers as features across 10 replicate feature selection and ML runs for each of six species with their ratios of numbers of markers (p) to numbers of lines (n) shown. Data points were jittered horizontally for ease of visualization. (D) The significance (-log10(q-value), paired Wilcoxon Signed-Rank test test) of the difference in r between models from different GP algorithms (colored as in Figure 4.3C) generated using a subset of 4,000 or 8,000 and all markers as input. Dotted line designates significant differences (p-value < 0.05). interaction between the number of available markers (p) and the feature selection method (repeat measures ANOVA: p-value = 1.7*10-12). Exploring this interaction further, we found that, while feature selection algorithms performed similarly with large n, RF tended to perform the best when fewer markers were selected for GP (Figure 4.3B; Supplemental Figure 4.2B) and was therefore used to test the impact of feature selection on predicting height in the other five species. For species with a low p:n ratio (i.e. soy and spruce), for all GP algorithms tested, as p increased the model performance tended to increase continuously (e.g. all GP algorithms in sorghum) or, in some cases, the model performance reached a maximum (or a plateau) quickly (e.g. in soy after 2,500 markers were used) (Figure 4.3C). For these species, there was no significant improvement in performance after feature selection (all vs. top 4,000) using any GP algorithm (one-sided, paired Wilcoxon Signed-Rank test: q-value = 0.98 ~ 0.99; Figure 4.3D). 118 For example, ANNs built using all 6,932 spruce markers performed no better than those built using the top 4,000 markers (p-value= 0.98). For species with a large p:n ratio (i.e. maize, rice, sorghum, and switchgrass), a similar pattern was observed for rrBLUP, SVRlin, and GTB, where performance increased or reached a plateau as p increased and no significant improvement in performance was found after feature selection (p=8,000) (q-value = 0.28 ~ 0.99; Figure 4.3D). However, for these four species, feature selection improved the performance of ANN models (q-value= 0.019 ~ 0.047; Figure 4.3D). For example, after feature selection prediction of height in maize using ANNs improved from r=0.17 to 0.41, a 141% increase. Ultimately, performing feature selection prior to ANN training for these four datasets with large p:n ratios, improved ANN performance (median r at 89% of the best r for each of the six species) compared to ANNs without feature selection (84% of the best r). Therefore, for the GP benchmark analysis, feature selection was performed prior to model building for additional traits for maize, rice, sorghum, and switchgrass and the top 8,000 markers were used. Because feature selection only improved the performance of RF models in sorghum and switchgrass, we did not perform feature selection before training RF models in the full benchmark study. While feature selection notably improved ANN performance, ANNs still often underperformed compared to other GP algorithms (Figure 4.3C), meaning the they were unable to learn even the linear relationships between markers and traits that were found using the linear- based algorithms. Because ANNs should theoretically at least match the performance of linear algorithms, this suggests that the ANN hyperparameters are not optimal. Furthermore, we found that, even after feature selection, there was greater variation in performance across replicates for ANN models compared to rrBLUP, SVRlin, RF, and GTB (S2C-D Figure), indicating the ANN 119 models did not always converge on the best solution. One potential reason for the is that the final trained network can be heavily influenced by the initial weights used in ANN, which are selected randomly. In addition, while random weight initialization, a procedure we have used thus far, reduces bias in the network, it can also result in some networks converging on a local, rather than global, optimal solution. 4.4.4 Non-random initialization of ANN starting weights and convolutional layers improve ANN performance for some species To reduce the likelihood of ANNs converging to locally optimal solutions, we developed an approach that allowed the ANNs to utilize the relationship between markers and traits determined by another GP algorithm. In this approach, a GP algorithm was applied to the training lines, and the coefficient or importance score assigned to each marker from this algorithm was used to seed the starting weights (Figure 4.4A). Four GP algorithms were tested to seed the weights: rrBLUP, BB, BL, and RF (referred to as ANNrrBLUP, ANNBB, ANNBL, and ANNRF, respectively). Because this approach could predispose the networks to only learn the relationship already identified by the seed algorithm, two steps were taken to re-introduce randomness into the network (see Methods). First, the seeded approach was only used to initialize starting weights for 25% of the nodes in the first hidden layer, while connection weights to the remaining 75% of nodes were initialized randomly as before. Second, noise was infused into the starting weights for the 25% of nodes that were seeded. Applying this approach to predict plant height we found that ANN performance improved for three of six species (Figure 4.4B). For example, the average performance for rice without seeding (ANN) was r = 0.25 and with seeding from BL (ANNBL) was r = 0.32, a 28% improvement, while for sorghum, ANNBL had <0.1% improvement over the original ANN 120 Figure 4.4. Description and performance results of the seeded ANN approach. (A) An overview of the seeded ANN approach. The network in the top left is an example of a fully connected ANN with 6 input nodes (i.e. 6 markers), two hidden layers, and one output layer (i.e. predicted trait value). The blue node in the first hidden layer represents an example node that will have seeded weights. For this node, the weights (w) connecting each input node to the hidden node will be seeded from the coefficient/importance for each marker as determined by another GP algorithm using the training data. b: bias, which helps control the value at which the activation function will trigger. (B) The distribution of model performance (r) using only all random (None), 25% seeded (rrBLUP, BayesB, BL, RF) weight initialization, and convolutional neural networks (CNN). The mean performance of the overall top performing algorithm (i.e. not necessary ANN) shown as dotted red line. 121 methods. Seeding ANN models did not significantly reduce the amount of variation in model performance across replicates (repeated measure ANOVA: p-value= 0.39, Supplemental Table 4.4). Ultimately, seeded ANN models had a median performance between 89% - 90% of the best r for each species (compared to 89% with random initialization, Figure 4.4B). While this represented only a moderate improvement, we included the seeded ANN approach in the benchmark analysis because of how substantial the improvement was for some species (i.e. rice). Another deep learning strategy for reducing the complexity of GP problems and consequently decreasing the likelihood of converging on local optimum is to use convolutional and pooling layers to summarize local patterns of genetic markers and learn from these summaries (Ma et al. 2018). We tested this approach by training Convolutional Neural Networks (CNNs) to predict plant height (S3A Figure). Notably, feature selection (n= 8,000) had either no or a negative impact on CNN performance. For example, the average performance of CNNs at predicting height in maize, the species with the most genetic markers, was r = 0.39, but dropped to r = 0.37 after feature selection. CNNs performed better than ANNs at predicting height in two of six species (yellow; Figure 4.4B), with the biggest improvement in rice where the average performance increased from r = 0.25 using ANNs to r = 0.32 using CNNs, a 32% improvement. While CNN models did not reduce the amount of variation in model performance across replicates (repeated measure ANOVA: p-value = 0.08, Supplemental Table 4.4), we included CNNs in the final benchmark analysis because of the promising results in rice and switchgrass. 4.4.5 No one GP algorithm performs best for all species and traits Having established best practices for hyperparameter and feature selection for our datasets, we next compared the performance of all GP algorithms for predicting three traits in each of the six species. For maize, rice, and soy, these traits included height, flowering time, and yield (Figure 122 4.1C). For species where data was not available for one or more of these traits, other traits were used (see the panel labeled "Others", Figure 4.5A). As with past efforts to benchmark GP algorithms (Heslot et al. 2012; Neves et al. 2012), different algorithms performed best for different species/trait combinations (Figure 4.5A; Supplemental Table 4.5). Thus, we utilized the predictive power of multiple algorithms to establish an ensemble prediction using all (except CNN: EN11) or a subset of five (EN5) algorithms (see Methods). The ensemble models consistently performed well, with EN5 or EN11 being the best (three) or tied for the best (nine) algorithm for 12 of the 18 species/trait combinations included in the benchmark and had a median performance rank of 3 (Figure 4.5B; Supplemental Table 4.6). For the remaining 6 species/trait combinations where EN5 or EN11 weren’t among the best performers, they tended to perform only slightly worse (median % of best r = 99.2%, Figure 4.5A). This suggests that ensemble-based predictions are more stable and more likely to result in better trait predictions than a single algorithm. Focusing on the species/trait combinations where one of the non-ensemble algorithms was or tied for best, we found that a linear algorithm performed best for five of the species/trait combinations, a non-linear algorithm performed best for four species/trait combinations, and both a linear and a non-linear algorithm performed equally well for the remaining six species/trait combinations (Figure 4.5B). This finding suggests that linear and non-linear algorithms are equally well suited for GP. The linear algorithms BRR and BA performed best overall, being among the top performers for 9 and 8 traits, respectively, and with the top two median ranks of five and 4.5, respectively (Supplemental Table 4.6). The top performing non- linear algorithm was SVRpoly, which was among the top performers for 8 traits and had a median rank of 6. There was notably greater performance variation across species/traits for non-linear 123 Figure 4.5. Comparison of algorithms for predicting additional traits. (A) Mean model performance (r; text) for each species/trait combination (y-axis) for each GP algorithm (x-axis). White text: r of the best performing algorithm(s) for a species. Colored boxes: percent of best performance (r) for a species, with the top algorithm for each species = 100% (red). The median % of best performance for each GP algorithm for each type of trait (i.e. height, developmental timing, yield, other) is shown below each heatmap. GM: sorghum grain moisture. DBH and DE: diameter at breast height and wood density, respectively, for spruce. ST: standability for switchgrass. (B) Top left: summary of the number of species/trait combinations that were predicted best by an ensemble (gray) or a non-ensemble model (yellow), or predicted 124 Figure 4.5 (cont’d) equally well by both (purple). Bottom right: among non-ensemble models that performed or tied for the best, the number of species/trait combinations that were predicted best by a linear (blue) or a non-linear model (green) or predicted equally well by both (orange). (C) Percent of replicates where one GP algorithm (y-axis, winner) outperformed another GP algorithm (x-axis, loser) for predicting height in switchgrass. Orange and cyan texts: linear and non-linear algorithms, respectively. (D) Hierarchical clustering of GP algorithms based on mean predictive performance across all species/trait combinations. Algorithm colored as in (C). algorithms (mean variance = 1.03%) compared linear algorithms (mean variance = 0.65%) (Supplemental Table 4.6). For example, SVRrbf performed poorly at predicting developmental timing traits (median 83% of the best r), however it had or was tied for the best prediction for three of the four “other” traits (median 100% of the best r) (Figure 4.5A). Results from ANN models using randomly initialized (ANN) and BB seeded (ANNBB) weights are shown because ANNBB had the best performance of the seeded ANN models (see S5, Supplemental Table 4.6 for results from other seeded ANNs). Notably, none of the randomly initialized ANN (median rank = 13.5), the ANNBB (median rank = 13), or the CNN (median rank = 15.5) models performed best for any trait (Supplemental Table 4.6). One limitation of comparing the mean score or performance rank is that small but consistent differences in model performance could be missed. To account for this, we also calculated the number of times an algorithm outperformed another algorithm for each trait across the replicates. Using this metric, we were able to identify algorithms that consistently outperformed others for a given trait/species combination (Figure 4.5C, Supplemental Figure 125 4.4). We frequently observed that linear algorithms had higher win percentages than nonlinear algorithms, this was the case for all three traits in maize and soybean for example (S4 Figure). However, there were plenty of exceptions. RF and SVRrbf had higher win percentages than linear algorithms for predicting height and diameter at breast height (DBH) in spruce and ANNBB had a higher win percentage than all algorithms except BA and BB for predicting flowering time in rice (S3 Figure). In a few cases, assessing win percentages allowed us to identify winners when mean predictive performance (r) was tied. For example, for predicting height in switchgrass. SVRpoly had the same average performance (r = 0.61) as multiple of the linear algorithms (i.e. rrBLUP, BA, etc.), however, it outperformed those algorithms in 70-80% of replicates (Figure 4.5C). In order to determine which algorithms perform similarly, we performed hierarchical clustering of the algorithms based on their performance across the 18 species/trait combinations (from Figure 4.5A). Interestingly, linear and non-linear algorithms did not clearly separate from each other (Figure 4.5D). For example, rrBLUP and SVRlin were more similar to the neural network based models (i.e. CNN and ANNBB), than they were to the linear Bayesian algorithms (i.e. BA, BB, BL, and BRR). Notably, while the Bayesian algorithms tended to cluster together closely performance-wise, the non-linear algorithms tended to have a greater distance between them. Finally, in order to identify if algorithm performance was similar for specific types of traits (e.g. whether similar algorithms perform well at predicting traits related to developmental timing) or across species/population composition (e.g. whether similar algorithms perform well on diversity panels), we performed hierarchical clustering of each species/trait based on performance of all 14 algorithms (from Figure 4.5A). Surprisingly, species/trait combinations with similar patterns of algorithm performance were often not the same species, trait, or 126 population type (Supplemental Figure 4.5), suggesting that we cannot generalize easily the differences in performance based on species, trait, or population type. 4.5 Discussion We conducted a benchmarking comparison of GP algorithms on 18 species/trait combinations that differ in the type and size of the training data set and of the marker data available. Similar to previous GP algorithm benchmark studies conducted on smaller datasets (Heslot et al. 2012; Blondel et al. 2015), a key result from this analysis is that no one model performs best for all species and all traits. We further demonstrate that, while similar algorithms perform similarly across the 18 species/trait combinations, algorithm performance was not clearly related to the trait type or population composition. With that said, linear algorithms tend to perform consistently well, while the performance of non-linear algorithms varied widely by trait. Studies of gene networks have shown that non-additive interactions (e.g. epistasis, dominance) are important for development and regulation of complex traits (Holland 2007; Monir and Zhu 2018). One may expect approaches that can consider non-linear combinations would therefore be better suited for modeling complex trait. This was not the case and we found the inconsistency of non-linear algorithms surprising. We have three, non-mutually exclusive, explanations for why linear algorithms often outperform non-linear algorithms. First, the traits included in this study vary in their genetic architecture (i.e. the number and distribution of allele effects), therefore we may be observing that linear algorithms outperform non-linear algorithms when the trait has a predominantly additive genetic basis. Second, there is evidence that even highly complex biological systems generate allelic patterns that are consistent with a linear, additive genetic model because of the discrete nature of DNA variation and the fact that many markers have extreme allele frequencies 127 (Hill et al. 2008). The proportion of dominance and epistatic variance that can be captured by an additive (i.e. linear) model increases when allele frequencies are extreme (Hill et al. 2008). This phenomenon is even more important with inbred lines (e.g. soy and rice); where, at each locus there are only 2 possible variants (e.g. AA and TT); thus, the additive model fully captures the single-locus genetic variance. However, the fraction of epistatic variance that can be captured by an additive model depends on how many multi-locus genotypes are present in the data and this depends on allele frequencies. Thus, the distribution of allele frequency (which due to mutation, selection, and drift is often enriched at extreme values) is one of the reasons why additive models often capture and perform very well at predicting traits that at the biological level are affected by complex epistatic networks. Finally, a third explanation is that the amount of training data available for most GP problems was insufficient for learning non-linear interactions between large numbers of markers, therefore the linear models, which focus on modeling linear relationships, outperform the non-linear models. Three findings from our study suggest that limited training data plays a role. First, we found that non-linear algorithms performed better at predicting traits in species with a small marker number to population size (p:n) ratio. For example, RF, SVRpoly, and SVRrbf performed best at predicting traits in spruce and ANN models tended to perform better at predicting traits in soy, the species with the second smallest and smallest p:n, respectively. Second, the ANN models significantly improved after feature selection. This was not the case for other algorithms in our study or with previous efforts to use feature selection for GP (Vazquez et al. 2010; Bermingham et al. 2015). For example, for predicting traits in Holstein cattle, the top 2,000 markers had only 95% of the predictive ability of all the markers using BL (Vazquez et al. 2010). With a fixed training data size, prediction accuracy is a function of how much genetic 128 variation is captured by markers in linkage disequilibrium with quantitative trait loci and the accuracy of the estimated effects (Goddard 2009). Because feature selection removes markers from the model, such decreases in performance after feature selection for non-ANN models are likely due to the reduction in the amount of genetic variation captured without a subsequent increase in the accuracy of the estimated effects. However, we hypothesize that feature selection significantly improved performance for ANNs because it improved the accuracy of the estimated effects (i.e. the connection weights) more than it reduced the amount of genetic variation captured. Third, ANNs that have been trained on small datasets often have unstable performance likely because ANNs are sensitive to the initialized weight values when they do not have enough training data to learn from (LeBaron and Weigend 1998; Shaikhina and Khovanova 2017). We observed greater instability in performance across replicates for ANNs compared to other algorithms (S2C-D Figure), suggesting that our ANN models may have benefitted from additional training data. However, a recent study involving large sample size (n~80,000) in humans compared linear models with two types of ANN algorithms, multilayer perceptron and convolutional neural networks, and did not find any clear superiority of the ANN methods relative to linear models, if anything the linear model offered higher predictive power than the ANNs (Bellot et al. 2018). While they also found that feature selection improved the performance of their ANN models, using the top 10k of the 50k markers, these models still did not outperform the linear models (Bellot et al. 2018). Given that these results are from a single study in humans, we believe it will be informative to benchmark ANNs on a larger crop dataset in the future. While there is a great deal of excitement about the uses of deep learning in the field of genetics, there is still much work to be done to improve performance of deep learning-based models. In 129 this study we identified dimensionality as a major limitation to training ANNs for GP. Additional areas of deep learning research also need to be further explored. For example, in this study we limited the ANN hyperparameter space searched because the grid search method was too computationally intensive to be more thorough. Because changes in hyperparameters had a large impact on model performance, further hyperparameter tuning could lead to better performing models. For example, we limited our search to include nine possible network architectures with between one and three hidden layers each containing between 5-100 nodes (Supplemental Table 4.1), but it is possible that ANNs with different network architectures, such as more hidden layers, or different combinations of layer sizes, could have performed better. Similarly, given that the hyperparameter space for CNN models was only tested for one species and trait (height in rice), it is likely that model-specific hyperparameter selection could improve the performance of CNN models beyond what we were able to achieve here. In summary, we provided a thorough comparison of 12 GP algorithms and two ensembles for predicting diverse traits in six plant species with a range of marker types and numbers and population types and sizes. We found that no GP algorithm was best for all species/trait combinations and that trait type or population type were not closely associated with which algorithms worked best. While neural network approaches did not tend to outperform linear or other non-linear models, strategies to tailor neural networks for GP problems (e.g. non-random initialization of stating weights, convolutional and pooling layers) show promise. Unlike previous GP algorithm benchmark studies (Heslot et al. 2012), we found that the performance of ensemble models, generated by combining predictions from multiple individual GP algorithms, consistently tied with or exceeded the performance of the best individual algorithm. Taken together, these finds lead us to recommend that breeders test the performance of multiple 130 algorithms on their training population to identify which algorithm or combination of algorithms performs best for traits important to their breeding program. 4.6 Acknowledgements We thank Peipei Wang and John Lloyd from the Shiu lab, Gabriel Rovere from the MSU QuantGen group, and Fouad Bahrpeyma from the Insight Center for their valuable suggestions to our project. This work was supported by the National Science Foundation (NSF) Graduate Research Fellowship [Fellow ID: 2015196719], Graduate Research Opportunities Abroad (GROW) Fellowship to C.B.A.; NSF PlantGenomics Research Experiences for Undergraduate to E.B.; the U.S. Department of Energy Great Lakes Bioenergy Research Center [BER DE- SC0018409] and National Science Foundation [IOS-1546617, DEB-1655386] to S.-H.S.; and the National Institute of Health [R01GM099992, R01FM101219] to G.D.L.C.. 131 APPENDIX 132 Supplemental Figure 4.1. Height prediction performance for non-linear GP algorithms during hyperparameter grid search. (A) Average (line) and standard deviation (shadow) of mean squared error (MSE) over hyperparameter space for SVR based models predicting height as the penalty (C) (X-axis) change. SVRrbf and SVRpoly results are shown using gamma=1x10-5 and 1x10-4, respectively. (B) 133 Supplemental Figure 4.1 (cont’d) Distribution of the MSE across hyperparameter space for RF (left) and GTB (right) as the maximum features available to each tree (Max Features; X-axis) and maximum tree depth (color) change. GTB results are shown using a learning rate = 0.01. (C) Average MSE across hyperparameter space for ANN models with different network architectures (X-axis), degrees of regularization using dropout (D.o.) or L2 regularization (L2), using either the Rectified Linear Unit (ReLU; left) or Sigmoid (right) activation function. (D) Distribution of the variance in MSE across the hyperparameter space for predicting height in each species using each GP algorithm. Black bar represents the median variance across the species for each GP algorithm. 134 Supplemental Figure 4.2. Comparison of feature selection algorithms and change in performance variation after feature selection. (A) Average number of overlapping markers in the top markers (p) selected by three different feature selection algorithms for predicting height in maize across ten replicates for p=10 ~ 8,000. (B) Change in model performance (r) using five GP algorithms at predicting height in maize as the number of input markers (p) selected by three different feature selection algorithms increases. 135 Supplemental Figure 4.2 (cont’d) Dashed line: the mean r for each GP algorithm when all maize markers were used. Colored lines: mean r of models using features selection subsets using algorithms colored as in (A). Colored areas: standard deviation around the mean. (C) Distribution and median of the standard deviation of model performance (r) across replicates for all feature selection subsets (p=10 ~ 8,000) combined across all species for each GP algorithm (D) Distribution and median of the standard deviation of model performance across replicates for all feature selection subsets (p=10 ~ 8,000) by species for each GP algorithm. 136 Supplemental Figure 4.3. Hyperparameter random search results from predicting height in spruce. (A) Overview of the architecture and parameters used to train the CNN models. The parameters listed below for each layer (black) were either pre-set (value shown in purple) or the value for 137 Supplemental Figure 4.3 (cont’d) that parameter was selected using RandomSearchCV (values tested shown in red). (B) Average mean squared error (MSE) across the hyperparameter space for predicting height in rice (replicate #1). RMSprop: Root Mean Square propagation. 138 Supplemental Figure 4.4. Number of wins between each pair of GP algorithm. Percent of replicates where one GP algorithm (y-axis) outperformed another GP algorithm (x- axis) for predicting each species/trait combination. 139 Supplemental Figure 4.5. Similarity between traits and datasets in model performance. Hierarchical clustering of trait:species combinations based on mean predictive performance across all algorithms included in the benchmark. HT: height. DT: developmental timing. YLD: yield, GM: grain moisture. DBH: diameter at breast height. DE: wood density. ST: standability. 140 REFERENCES 141 REFERENCES Angermueller, C., T. Pärnamaa, L. Parts, and O. Stegle, 2016 Deep learning for computational biology. Molecular Systems Biology 12: 878–16. Beaulieu, J., T. K. Doerksen, J. MacKay, A. Rainville, and J. Bousquet, 2014 Genomic selection accuracies within and between environments and small breeding groups in white spruce. BMC Genomics 15: 1048. Bellot, P., G. de los Campos, and M. Pérez-Enciso, 2018 Can Deep Learning Improve Genomic Prediction of Complex Human Traits? Genetics genetics.301298.2018. Benjamini, Y., and Y. Hochberg, 1995 Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57: 289–300. Bermingham, M. L., R. Pong-Wong, A. Spiliopoulou, C. Hayward, I. Rudan et al., 2015 Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports 1–12. Blondel, M., A. Onogi, H. Iwata, and N. Ueda, 2015 A Ranking Approach to Genomic Selection. PLoS ONE 10: e0128570–23. Breiman, L., 2001 Random Forests. Machine Learning 45: 5–32. de los Campos, G., D. Gianola, G. J. M. Rosa, K. A. Weigel, and J. Crossa, 2010 Semi- parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet. Res. 92: 295–308. de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler, and M. P. L. Calus, 2013 Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics 193: 327–345. de los Campos, G., H. Naya, D. Gianola, J. Crossa, A. Legarra et al., 2009 Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182: 375–385. Chapelle, O., V. Vapnik, O. Bousquet, and S. Mukherjee, 2002 Choosing Multiple Parameters for Support Vector Machines. Mach. Learn. 46: 131–159. Desta, Z. A., and R. Ortiz, 2014 Genomic selection: genome-wide prediction in plant improvement. Trends in Plant Science 19: 592–601. Dietterich, T. G., 2000 Ensemble methods in machine learning. International workshop on multiple classifier systems. 142 Ehret, A., D. Hochstuhl, D. Gianola, and G. Thaller, 2015 Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle. Genet. Sel. Evol. 47: 22. Endelman, J. B., 2011 Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. Evans, J., E. Crisovan, K. Barry, C. Daum, J. Jenkins et al., 2015 Diversity and population structure of northern switchgrass as revealed through exome capture sequencing. Plant J. 84: 800–815. Evans, J., M. D. Sanciangco, K. H. Lau, E. Crisovan, K. Barry et al., 2017 Extensive Genetic Diversity is Present within North American Switchgrass Germplasm. Plant Genome. Fernandes, S. B., K. O. G. Dias, D. F. Ferreira, and P. J. Brown, 2017 Efficiency of multi-trait, indirect, and trait-assisted genomic selection for improvement of biomass sorghum. Theor. Appl. Genet. Friedman, J. H., 2001 Greedy function approximation: A gradient boosting machine. Ann. Statist. 29: 1189–1232. Gianola, D., R. L. Fernando, and A. Stella, 2006 Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173: 1761–1776. Girija, S. S., 2016 Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Glorot, X., and Y. Bengio Understanding the difficulty of training deep feedforward neural networks. 2010. Goddard, M., 2009 Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136: 245–257. González-Camacho, J. M., G. de los Campos, P. Pérez, D. Gianola, J. E. Cairns et al., 2012 Genome-enabled prediction of genetic values using radial basis function neural networks. Theoretical and Applied Genetics 125: 759–771. González-Camacho, J. M., J. Crossa, P. Pérez-Rodríguez, L. Ornella, and D. Gianola, 2016 Genome-enabled prediction using probabilistic neural network classifiers. BMC Genomics 1–16. González-Camacho, J. M., L. Ornella, P. Pérez-Rodríguez, D. Gianola, S. Dreisigacker et al., 2018 Applications of Machine Learning Methods to Genomic Selection in Breeding Wheat for Rust Resistance. Plant Genome 11:. González-Recio, O., and S. Forni, 2011 Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genet. Sel. Evol. 43: 7. 143 González-Recio, O., J. A. Jiménez-Montero, and R. Alenda, 2013 The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets. J. Dairy Sci. 96: 614–624. Habier, D., R. L. Fernando, K. Kizilkaya, and D. J. Garrick, 2011 Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics 12: 186. Hansey, C. N., J. M. Johnson, R. S. Sekhon, S. M. Kaeppler, and N. de Leon, 2011 Genetic diversity of a maize association population with restricted phenology. Crop Sci. 51: 704– 715. Heffner, E. L., M. E. Sorrells, and J.-L. Jannink, 2009 Genomic Selection for Crop Improvement. Crop Science 49: 1–12. Heslot, N., H.-P. Yang, M. E. Sorrells, and J.-L. Jannink, 2012 Genomic Selection in Plant Breeding: A Comparison of Models. Crop Science 52: 146–15. Hill, W. G., M. E. Goddard, and P. M. Visscher, 2008 Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4: e1000008. Hirsch, C. N., J. M. Foerster, J. M. Johnson, R. S. Sekhon, G. Muttoni et al., 2014 Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26: 121–135. Holland, J. B., 2007 Genetic architecture of complex traits in plants. Curr. Opin. Plant Biol. 10: 156–161. Jonas, E., and D.-J. de Koning, 2013 Does genomic selection have a future in plant breeding? Trends Biotechnol. 31: 497–504. Kasnavi, S. A., M. A. Afshar, M. M. Shariati, N. E. J. Kashan, and M. Honarvar, 2017 Performance evaluation of support vector machine (SVM)-based predictors in genomic selection. Indian J. Anim. Sci. 87: 1226–1231. Kingma, D. P., and J. Ba, 2014 Adam: A Method for Stochastic Optimization. Klimasauskas, C. C., 1993 Applying neural networks. Neural networks in finance and investing 47–72. Kuhn, M., and K. Johnson, 2013 Applied Predictive Modeling. LeBaron, B., and A. S. Weigend, 1998 A bootstrap evaluation of the effect of data splitting on financial time series. IEEE Transactions on Neural Networks 9: 213–220. Lipka, A. E., F. Lu, J. H. Cherney, E. S. Buckler, M. D. Casler et al., 2014 Accelerating the Switchgrass (Panicum virgatum L.) Breeding Cycle Using Genomic Selection Approaches. PLoS ONE 9: e112227–7. 144 Long, N., D. Gianola, G. J. M. Rosa, and K. A. Weigel, 2011 Application of support vector regression to genome-assisted prediction of quantitative traits. Theor. Appl. Genet. 123: 1065–1074. Lorenz, A. J., S. Chao, F. G. Asoro, E. L. Heffner, T. Hayashi et al., 2011 Chapter 2: Genomic Selection in Plant Breeding: Knowledge and Prospects. Elsevier Inc. Ma, W., Z. Qiu, J. Song, J. Li, Q. Cheng et al., 2018 A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248: 1307–1318. Meuwissen, T. H. E., 2009 Accuracy of breeding values of’unrelated’individuals predicted by dense SNP genotyping. Genet. Sel. Evol. 41: 35. Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics 1–11. Monir, M. M., and J. Zhu, 2018 Dominance and Epistasis Interactions Revealed as Important Variants for Leaf Traits of Maize NAM Population. Front. Plant Sci. 9: 627. Moser, G., B. Tier, R. E. Crump, M. S. Khatkar, and H. W. Raadsma, 2009 A comparison of five methods to predict genomic breeding values of dairy bulls from genome-wide SNP markers. Genet. Sel. Evol. 41: 56. Neves, H. H., R. Carvalheiro, and S. A. Queiroz, 2012 A comparison of statistical methods for genomic selection in a mice population. 1–17. Norman, A., J. Taylor, J. Edwards, and H. Kuchel, 2018 Optimising Genomic Selection in Wheat: Effect of Marker Density, Population Size and Population Structure on Prediction Accuracy. G3 8: 2889–2899. Okut, H., D. Gianola, G. J. M. Rosa, and K. A. Weigel, 2011 Prediction of body mass index in mice using dense molecular markers and a regularized neural network. Genet. Res. 93: 189–201. Parker, D. B., 1987 Optimal algorithms for adaptive networks: Second order backpropagation, second order direct backpropagation, and second order hebbing learning. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion et al., 2011 Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12: 2825–2830. Pérez, P., and G. de los Campos, 2014 Genome-wide regression and prediction with the BGLR statistical package. Genetics 198: 483–495. Pouladi, F., H. Salehinejad, and A. M. Gilani, 2015 Deep Recurrent Neural Networks for Sequential Phenotype Prediction in Genomics. 145 Prechelt, L., 1998 Early Stopping - But When?, pp. 55–69 in Neural Networks: Tricks of the Trade, edited by G. B. Orr and K.-R. Müller. Springer Berlin Heidelberg, Berlin, Heidelberg. Ramstein, G. P., J. Evans, S. M. Kaeppler, R. B. Mitchell, K. P. Vogel et al., 2016 Accuracy of Genomic Prediction in Switchgrass (Panicum virgatum L.) Improved by Accounting for Linkage Disequilibrium. G3 6: 1049–1062. Ribaut, J.-M., and M. Ragot, 2007 Marker-assisted selection to improve drought adaptation in maize: the backcross approach, perspectives, limitations, and alternatives. Journal of Experimental Botany 58: 351–360. Riedelsheimer, C., F. Technow, and A. E. Melchinger, 2013 Comparison of whole-genome prediction models for traits with contrasting genetic architecture in a diversity panel of maize inbred lines. 1–9. Roorkiwal, M., A. Rathore, R. R. Das, M. K. Singh, A. Jain et al., 2016 Genome-Enabled Prediction Models for Yield Related Traits in Chickpea. Front. Plant Sci. 7: 1666. Rumelhart, D. E., G. E. Hinton, and R. J. Williams, 1986 Learning internal representation by error propagation, Parallel Distributed Processing, DE Rumelhart and JL McClelland, eds. Shaikhina, T., and N. A. Khovanova, 2017 Handling limited datasets with neural networks in medical applications: A small-data approach. Artificial Intelligence in Medicine 75: 51– 63. Spindel, J., H. Begum, D. Akdemir, P. Virk, B. Collard et al., 2015 Genomic Selection and Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic Architecture, Training Population Composition, Marker Number and Statistical Model on Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines. PLoS Genetics 11: e1004982–25. Usai, M. G., M. E. Goddard, and B. J. Hayes, 2009 LASSO with cross-validation for genomic selection. Genet. Res. 91: 427–436. Vazquez, A. I., G. J. M. Rosa, K. A. Weigel, G. de los Campos, D. Gianola et al., 2010 Predictive ability of subsets of single nucleotide polymorphisms with and without parent average in US Holsteins. J. Dairy Sci. 93: 5942–5949. Webb, S., 2018 Deep learning for biology. Nature 554: 555–557. Xavier, A., W. M. Muir, and K. M. Rainey, 2016 Assessing Predictive Properties of Genome- Wide Selection in Soybeans. G3 6: 2611–2616. Xu, Y., X. Wang, X. Ding, X. Zheng, Z. Yang et al., 2018 Genomic selection of agronomic traits in hybrid rice using an NCII population. Rice 11: 32. 146 Zou, H., and T. Hastie, 2005 Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series. 147 CHAPTER FIVE: TRANSCRIPTOME-BASED PREDICTION OF COMPLEX TRAITS IN MAIZE 1 1 The work described in this chapter has been published in the following manuscript Christina B. Azodi, Jeremy Pardo, Robert VanBuren, Gustavo de los Campos, and Shin-Han Shiu (2019) Transcriptome-based prediction of complex traits in maize. The Plant Cell. DOI: 10.1105/tpc.19.00332 148 5.1 Abstract The ability to predict traits from genome-wide sequence information (i.e. genomic prediction), has improved our understanding of the genetic basis of complex traits and transformed breeding practices. Transcriptome data may also be useful for genomic prediction. However, it remains unclear how well transcript levels can predict traits, particularly when traits are scored at different development stages. Using maize genetic markers and transcript levels from seedlings to predict mature plant traits, we found transcript and genetic marker models have similar performance. When the transcripts and genetic markers with the greatest weights (i.e. the most important) in those models were used in one joint model, performance increased. Furthermore, genetic markers important for predictions were not close to or identified as regulatory variants for important transcripts. These findings demonstrate that transcript levels are useful for predicting traits and that their predictive power is not simply due to genetic variation in the transcribed genomic regions. Finally, genetic marker models identified only one of 14 benchmark flowering time genes, while transcript models identified five. Highlighting that, in addition to being useful for genomic prediction, transcriptome data can provide a link between traits and variation that cannot be readily captured at the sequence level. 5.2 Introduction The prediction of complex traits from genetic data is a grand challenge in biology and the outcome of such prediction has become increasingly useful for plant and animal breeding (Heffner et al. 2009; Jonas and de Koning 2013). Among the different approaches for connecting genotypes to phenotypes, genomic prediction (or genomic selection) using all available markers was developed to overcome the limitations of Marker-Assisted Selection, which uses only significant quantitative trait loci (QTLs), for breeding traits that are controlled by many small 149 effect alleles (Meuwissen et al. 2001; Ribaut and Ragot 2007). Using genomic prediction, breeders are able to make data driven decisions about what lines to include in their programs, speeding up and reducing the cost of developing the next generation of crops (Endelman et al. 2014; Spindel et al. 2015). Furthermore, because genomic prediction models are associating genetic signatures with phenotypes, untangling genomic prediction models has the potential to improve our understanding of the genetic basis of complex traits. However, as with related approaches such as genome wide association studies and QTL mapping, it remains difficult to go from associated genetic markers to the molecular basis for a trait (Drinkwater and Gould 2012; Solberg Woods 2014). There are a number of factors contributing to this difficulty. The variation in markers associated with phenotypes may not be the causal variants but are linked to the genes that control the trait in question. Considering that linkage disequilibrium distance can range from 1 kilobase (kb) in diverse maize populations (Tenaillon et al. 2001) to ~250 kb in Arabidopsis thaliana (Nordborg et al. 2002), the linked candidate genes can range from a few to a few hundreds. Even if the associated genetic variant is controlling the underlying phenotype, most variants associated with complex traits have small effect sizes and can be regulatory (Albert and Kruglyak 2015), which may not be linked to the genes they regulate. Furthermore, multiple regulatory variants that have indiscernible effects on their own, could interact epistatically to influence gene and ultimately trait expression. However, even with sufficient statistical power to detect genetic variants with small effect sizes and interactions between them, genetic information is connected to traits through multiple intermediate processes, including, for example, transcription, translation, epigenetic modification, and metabolism. Each of these intermediate processes 150 represent an additional level of complexity that obscures the association between genetic information and a trait. One solution is to account for these intermediate processes by integrating relevant omics data in addition to genetic variation. This approach has led to promising, but often mixed, results in plants. Current efforts have focused primarily on predicting hybrid performance using transcriptional information from the parental lines. For example, transcript level-based distance measures generated from transcripts associated with the trait were better than genetic markers in predicting hybrid performance in maize (Frisch et al. 2010; Fu et al. 2012). However, when all transcripts were used (instead of a subset of pre-selected transcripts), model performance decreased (Zenke-Philippi et al. 2016). The performance of models based on transcript levels can be better or worse compared to those based on genetic markers depending on the trait. For example, transcriptome data performed better for predicting grain yield in hybrid maize populations, but genetic marker data performed better for predicting grain dry matter content in the same population (Schrag et al. 2018). Similarly, in a maize diversity panel, genomic prediction models that combined transcript and marker data only outperformed models using markers alone for certain traits (Guo et al. 2016). Finally, efforts to integrate additional omic information to predict various traits in Drosophila melanogaster (Li et al. 2019), and human diseases, such as breast cancer (González-Reymúndez et al. 2017), and responses to treatment interventions, including acute kidney rejection and response to infliximab in ulcerative colitis (Kang et al. 2017; Zarringhalam et al. 2018), have demonstrated the potential usefulness of transcriptome data in the field of precision medicine. Overall, these efforts provide reasonable evidence that transcriptome data could be useful for trait prediction. However, genomic prediction-based approaches that trained on the entire 151 transcriptome data have not been used to better understand the genetic mechanisms for a trait. In addition, it is not known the degree to which transcriptomes obtained at a particular developmental stage can be informative for predicting phenotypes scored at a different stage. To address these questions, we used transcriptome data derived from maize whole seedling (Hirsch et al. 2014) to predict phenotypes (flowering time, height, and grain yield) at much later developmental stages. In addition to comparing prediction performance between genetic marker and transcriptome-based models, we also looked at whether transcripts and genetic markers that were important for the prediction models were located in the same or adjacent regions. Finally, we determined how well our models were able to identify a benchmark set of flowering time genes to explore the potential of using genomic prediction to better understand the mechanistic basis of complex traits. 5.3 Results and Discussion 5.3.1 Relationships between transcript levels, kinship, and phenotypes among maize lines Before using the transcriptome data for genomic prediction, we first assessed properties of the transcriptome data in three areas: (1) the quantity and distribution of transcript information across the genome, (2) the amount of variation in transcript levels, and (3) the similarity in the transcriptome profile between maize lines, with an emphasis on how these properties compared to those based on the genotype data. After filtering out 16,898 transcripts that did not map to the B73 reference genome or had zero or near zero variance across lines (see Methods), we had 31,238 transcripts. While the number of transcripts was <10% of the number of genetic markers used in this study (332,178), the distribution of transcripts along the genome was similar to the genetic marker distribution (Supplemental Figure 5.1). The log2-transformed median transcript level across lines ranged from 0 to 12.4 (median=2.2) and the variance ranged from 3x10-30 to 152 14.5 (median= 0.13), highlighting that a subset of transcripts had relatively high variation in transcript levels across maize lines at the seedling stage. To determine how similar transcript levels were between lines, we calculated the expression Correlation (eCor) between all pairs of lines using Pearson’s Correlation Coefficient (PCC). The eCor values ranged from 0.84 to 0.99 (mean=0.93). As expected, lines with similar transcriptome profiles were also genetically similar as there was a significant correlation between eCor values with values in the kinship matrix generated from the genetic marker data (Spearman’s Rank ρ = 0.27, p < 2.2x10-16; Figure 5.1A). As a result, we were able to find clusters of lines that had both high transcript and genetic similarities (e.g. cluster a, b; Figure 5.1B, C). However, most of the variation in eCor was not explained by kinship, which explained why we identified other clusters that had similar transcriptome profiles, but were not genetically similar (e.g. cluster c, Figure 5.1B, C). Because the basis of genomic prediction is to predict a phenotype from genetic data, we next asked if kinship or eCor were anti-correlated with the phenotypic distances between lines (see Methods). While both kinship (ρ = -0.03, p < 2.2x10-16; Figure 5.1D) and eCor (ρ = -0.08, p < 2.2x10-16; Figure 5.1E) were significantly, negatively correlated with the phenotype distance, the degree of correlation was minor. Furthermore, the groups of lines that clustered together based on their eCor (e.g. clusters a, b; Figure 5.1B, 1C) did not have lower phenotypic distance (Figure 5.1F). Taken together, these findings suggest that transcriptome data may be similarly informative as genotype data but capture difference aspect of phenotypic variation. We tested both of these interpretations further in subsequent sections. 5.3.2 Predicting complex traits from transcript or genetic marker data To test how useful transcriptome data was for genomic prediction compared to genetic marker data, we applied four approaches to predict three agronomically important traits in maize: 153 Figure 5.1. Relationship between lines from transcript and genetic marker data. (A) Relationship between kinship based on genetic marker data (X-axis) and expression correlation (eCor, in Pearson's Correlation Coefficient (PCC)) based on transcript data (Y-axis). Boxplots show the median Y-axis value for each X-axis bin (bin size=0.15) with the 5th (blue) and 95th (red) percentile range shown. The correlation between kinship and eCor was calculated using Spear- man’s Rank Coefficient (ρ). (B, C) The relationships between lines based on eCor (B) or kinship (C) for all pairs of maize lines. Lines are sorted based on hierarchical clustering results using the eCor values. The blue, white, and red color scales indicate negative, no, or positive correlations, respectively. Dotted rectangles: indicating cluster of lines discussed in the main text. (D, E) The relationships between the Euclidean distance calculated with phenotype 154 Figure 5.1 (cont’d) values (Phenotype Distance: Y-axis) and kinship (D), and eCor (E). Colored line: follow those in (A). (F) The relationships between lines based on Phenotype Distance, where the lines were sorted as in (B). Red: smaller distance (more similar). Blue: greater distances (less similar). flowering time, height, and grain yield. Because no one genomic prediction algorithm always performs best (Heslot et al. 2012; Spindel et al. 2015), we tested two linear algorithms (ridge regression Best Linear Unbiased Predictor (rrBLUP) and Bayesian-Least Absolute Shrinkage and Selection Operator (BL)), one nonlinear algorithm (random forest: RF), and one ensemble approach (En; see Methods). To establish a baseline for our genomic prediction models, we determined the amount of the phenotypic signal that could be predicted using population structure alone, defined as the first five Principal Components from the genetic marker data. Then we built models for each trait using genetic marker data (G), kinship (K) derived from G, transcript levels (T), or expression correlation (eCor) derived from T (Figure 5.2). Model performance was measured using PCC between the actual and the predicted phenotypic values. Across algorithms and traits, the K data resulted in models with the best predictive performance, while models built using the eCor data performed the worst (Figure 5.2, Supplemental Table 5.1). Furthermore, models built using G always outperformed models using T. Regardless, eCor and T-based models were significantly better than the baseline predictions (dotted blue line, Figure 5.2), indicating transcriptome data can be informative in genomic prediction. Considering the transcriptome data is from seedling; it is particularly surprising that mature plant phenotypes can be predicted. Next, we asked if using only the most informative (i.e. the largest absolute coefficients) transcripts or genetic markers as input into our models would 155 Figure 5.2. Genomic prediction model performance. PCCs between predicted and true values for three traits and four algorithms using six different input features. The text in each box represents the absolute PCC with the best performing model for each trait in white. The box color represents the PCC normalized by trait, where the brightest red (1) corresponds to the algorithm/input feature combination that performed the best for the trait and the brightest blue (0) corresponds to the combination that performed the worst. Right violin-plots show the PCC distributions among different input features for each algorithm (right). 156 Figure 5.2 (cont’d) The median PCCs are indicated with black bars. The model performance PCCs based on only population structure (first 75 principal components) are indicated with a blue dashed line. Bottom violin-plots show the PCC distributions among different algorithms for each input feature. rrB: ridge regression Best Linear Unbiased Predictor. BL: Bayesian Least Absolute Shrinkage and Selection Operator. RF: Random Forest. En: Ensemble. improve trait predictions (see Methods). We also tested different sized subsets of transcripts with the greatest degrees of line specific expression to test if they could better predict traits. However, using rrBLUP to predict flowering time as an example, none of these subsets performed better than the full T data (Supplemental Table 5.3). We also tested setting the most variable transcripts as fixed effects in our rrBLUP models, but this also did not improve performance (Supplemental Table 5.3). Finally, consistent with earlier findings (Shen and Chou 2006; Jia et al. 2015), combining the predictions from multiple algorithms, known as an ensemble approach, resulted in the best predictive models (Figure 5.2), and is therefore used to illustrate most of our findings in the following sections. 5.3.3 Predicting complex traits using both transcript and genetic marker data Because the genetic marker and transcriptome data represented different types of molecular information that could be associated with the traits of interest, we hypothesized that their combination would be more informative and next built models that used combined data, either K+T or G+T. However, adding the transcript data did not substantially improve performance over K or G alone (Figure 5.2). One possible reason for this lack of improvement could be overfitting. This is most common when there is only a small amount of training data 157 (i.e. few maize lines) but a very large number of predictor variables (i.e. many genetic markers/transcripts). To test this hypothesis, we trained rrBLUP models (referred to as G200+T200) to predict flowering time using only the 200 genetic markers and the 200 transcripts with the largest absolute coefficients from the G and T rrBLUP models, respectively (see Methods). These genetic markers and transcripts are referred to as "features". To avoid overfitting during feature selection (Bermingham et al. 2015), we first separated the dataset into training and testing sets. The top features were selected using the training data only. The testing data were never used to select the top features. Using the independent testing data to evaluate performance, our ability to predict flowering time improved using G200+T200 (PCC = 0.68 +/- 0.06) compared to the full G+T model (PCC = 0.64 +/- 0.01) and to the individual G and T models (PCC = 0.64 +/- 0.01, 0.61 +/- 0.01, respectively). One explanation for this improvement could be that using only the top features of each data type reduced noise from the model. If this is the case, the G200 and the T200 models would be expected to outperform the G and the T models, respectively. But we see the opposite results (see previous section; Supplemental Table 5.3), suggesting this improvement was due to a reduction in overfitting. To assess if G or T data features tend to be more informative in predicting traits, we further quantified the importance score of each genetic marker and transcript feature for models using G+T data. The importance score represents the impact that each feature had on model performance defined according the algorithm used (see Methods). Because the G and T data features may contain overlapping information and, thus, are not independent, the importance scores from the G+T model may be affects by issues caused by collinearity. However, given that the importance scores assigned to transcripts in the G+T models were correlated with the scores from the T-only models (Supplemental Figure 5.2A), the addition of the genetic marker features 158 into the model did not impact the relative importance of transcript features. The only exception was a subset of Ts that were important for the G+T, but not the T-only Bayesian LASSO (BL) models. Because RF importance measures tend to be biased toward continuous features (Strobl et al. 2007), we focused on rrBLUP and BL importance scores. For all three traits, the top 1,000 most important features were enriched for genetic markers relative to transcript features (Odds Ratio = 0.17 ~ 0.44; all p < 1x10-16; Supplemental Figure 5.2B; Supplemental Table 5.2). However, the top 20 most important features tended to be enriched for transcript relative to genetic marker features (Odds Ratio = 2.66 ~ 13.0, p = 0.087 ~ <1x10-16, Supplemental Table 5.2), with transcript features making up the top two most important feature in all cases (Supplemental Figure 5.2B). The consistency with which transcript features were the most important for the models suggests that transcript information is useful for genomic prediction. 5.3.4 Comparison of the importance of transcripts versus genetic markers for model predictions Because models built using transcript features outperformed baseline models based solely on population structure, we know transcriptome data contained information useful for explaining phenotypic variation. Furthermore, using feature selection to combine both datasets into one predictive model (G200+T200) improved our ability to predict flowering time (Supplemental Table 5.3). Therefore, we hypothesized that these two data types capture different aspects of phenotypic variation. To address this, we assessed the extent to which the important genetic markers (from G-based models) overlapped with or neighbored the genes where the important transcripts (from T-based models) originated from (top; Figure 5.3A). We did not use the importance values from the G+T model due to concern of feature dependence. The genic region and flanking sequences within a defined window of an important transcript is referred to as the 159 transcript regions (see Methods). For each trait and algorithm, we compared the importance assigned to the transcript with that of the genetic marker with the highest average importance in the transcript region (T:G pair). Multiple window sizes were explored (see Methods), and we used 2 kb (+/- 1kb from the center of a gene) where the feature importance correlation between transcripts and genetic markers was maximized (Supplemental Figure 5.3A). Using this window size, 15,049 T:G pairs were identified. At the whole genome level there appeared to be regions where both genetic markers and transcripts were identified as important (Supplemental Figure 5.4). However, when we look closer, those regions mostly do not overlap. In some cases, the important genetic markers and transcripts were in linkage disequilibrium. Using the flowering time model as an example, we found the most important genetic marker was located within a gene upstream the most important transcript (GRMZM2G171650: MADS69; arrow a, Figure 5.3B), but the two are in linkage disequilibrium (Hirsch et al. 2014). In most cases, there were no important genetic markers that were located nearby to important transcripts and if we extend the window size to 80 kb, we see MADS69 is the exception rather than the rule (Supplemental Figure 5.3B). For example, the second most important flowering time genetic marker was not located near important transcript regions (arrow b, Figure 5.3B). Similarly, the second most important flowering time transcript (GRMZM5G865543) was over 0.6 Mb from an important genetic marker (arrow c, Figure 5.3B). Across all traits and algorithms, T:G pairs were only moderately correlated (ρ = 0.09-0.13; Figure 5.3C, Supplemental Figure 5.5A). This lack of correlation is notable for the most important genetic markers and transcripts. For example, across the three traits, only 4-7 T:G pairs were both in the top 1% most important features from the ensemble models, and those pairs were never the top ranked genetic markers or 160 Figure 5.3. Correlation between genetic marker and transcript importance for flowering time. (A) Illustration of how transcript (T):genetic marker (G) (top graph) and T:expression Quantitative Trait Locus (eQTL) (bottom graph) pairs were determined. Genetic marker importance percentiles are shown above the genetic markers (red triangle) and eQTL (yellow 161 Figure 5.3 (cont’d) triangle). A T:G pair was defined as the transcript and the most important genetic marker within the transcript region (top graph). A T:eQTL pair was defined as the transcript and the most important genetic marker within the eQTL region (bottom graph). (B) Manhattan plots of the transcript (blue bar) and genetic marker (red dot) importance scores (-loge(1-importance percentile)) in a 2Mb window surrounding top two genetic markers (top and middle plots) and transcripts (top and bottom plots) based on the T-based and G-based Ensemble models for predicting flowering time, respectively. All genetic markers (i.e. not just the T:G pair) are shown. The threshold (gray dotted line) is set at the 99th percentile importance. (C) Density scatter plot of the importance scores (see Methods) of the genetic marker (Y-axis) and transcript (X-axis) for T:G pairs (top graphs) and of the eQTL genetic marker (Y-axis) and transcript (X- axis) for the T:eQTL pairs (bottom graphs) for three traits derived from the G-based and T-based Ensemble models, respectively. The threshold (black dotted line) was set at the 99th percentile importance score for each trait and input feature type. The correlation between importance scores between transcript and genetic marker/eQTL pairs was calculated using Spearman’s rank (ρ). transcripts from the model (Figure 5.3B). These findings argue against the notion that these two data types capture similar aspects of phenotypic variation as we hypothesized earlier. One concern was that the lack of correlation was due to the genetic marker data being derived from RNA-Seq experiments, and thus limited to the transcribed regions. However, when the experiment was repeated using ~1 million genome-wide genetic markers (GGW) derived from whole-genome sequencing (Bukowski et al. 2018) as input features (Supplemental Figure 5.6A), the correlation between T:GGW pairs did not increase (Supplemental Figure 5.6B and S6C). 162 In light of this, we hypothesized that the lack of correlation was because important transcripts tend to be regulated by important trans factors located far beyond the transcript region. To test this, we assessed the degree to which important genetic markers identified as expression QTL (eQTLs) were associated with important transcripts. We identified 58,361 cis (62) and trans (58,299) eQTL associated with 7,052 transcripts and defined T:eQTL pairs for each of these transcripts by selecting the genetic marker within +/- 1kb of an eQTL for that transcript (i.e. eQTL region) with the highest average importance. Across all traits and algorithms, the importance of transcripts and eQTL in T:eQTL pairs was actually negatively correlated (ρ = -0.15 ~ -0.06; Figure 5.3C, Supplemental Figure 5.5B). The lack of correlation between importance scores for T:G and T:eQTL pairs was in contrast to the relatively high correlation observed in feature importance between algorithms (ρ= 0.31-0.98), with rrBLUP and BL importance scores being the most correlated (ρ= 0.87-0.98) and the average correlation between genetic markers (ρ = 0.75) being higher than for transcripts (ρ = 0.55) (correlation between algorithms; Supplemental Figure 5.7). Together with the findings that important genetic markers were not co-located and eQTL were not associated with genes that gave rise to the important transcripts for any of the three traits, these findings may suggest that transcriptome data is capturing layers of information, such as epigenetic signals, that are not captured by genome sequences. However, we cannot rule of the possibility that the eQTL approach using RNA-Seq based genetic markers is not sufficiently sensitive in identifying important trans-factors. Further study with more trait and high quality genome-wide genetic marker data is needed to resolve these possibilities. 163 5.3.5 Assessment of benchmark flowering time genes Because the genetic basis for flowering time is well studied (Muszynski et al. 2006; Danilevskaya et al. 2010; Meng et al. 2011; Lazakis et al. 2011), we identified a set of 14 known flowering time genes (Supplemental Table 5.4). To assess the extent to which these benchmark genes can predict flowering time, we trained an rrBLUP model where we set these 14 genes as fixed, rather than random, effects and our model performance increased (PCC = 0.64 +/- 0.01; Supplemental Table 5.3) compared to when they were not fixed (PCC = 0.61). Then we compared the ability of genetic marker and transcript-based models to identify these benchmark genes as important using the T:G and T:eQTL pairs described earlier. Of the 14 benchmark genes, four had corresponding genetic markers in our T:G pair data. When we increased the flanking regions threshold to 20kb from the center of the transcript for defining T:G pairs, corresponding genetic markers were found for five additional benchmark genes. Two benchmark genes, CCT1 and PEBP4, neither of which were members of a T:G pair, were associated with eQTLs. To account for differences in distribution and range of importance scores generated by different algorithms and numbers of features, the importance scores were converted to percentiles for comparison purposes. Different benchmark genes were important (>95th percentile) for models using the two different data types, with one and five benchmark gene considered important by the genetic marker-based and the transcript-based models, respectively (Figure 4A; Supplemental Table 5.5). For example, the genetic marker located within the RAP2 gene, which has been shown to be associated with flowering time in multiple studies (Buckler et al. 2009; Hirsch et al. 2014), was identified as important based on genetic marker (99.7th-99.9th percentile), but not transcript (59th- 79th percentile) data. In contrast, MADS69, MADS1, PEBP24, and PEBP8 were identified as 164 important using transcript data (95th-100th percentile), but not using genetic marker data (16th- 93th percentile). Furthermore, with transcript data we were able to assess the importance of three genes (ZAG6, PEPB5, and PEBP2) that were not located near genetic markers or associated with eQTL. For example, there were no eQTL associated with or genetic markers within the 40bp window of ZAG6, but ZAG6 was identified as important (98th-99.9th percentile) in the transcript- Figure 5.4. Comparison of transcript and genetic marker importance scores for benchmark flowering time genes. Importance percentile of each transcript (left) and genetic marker (right) pair as determined by each of the 4 algorithms (X-axis). Genes are sorted based on hierarchical clustering of their importance percentiles. Gray boxes designate benchmark genes that did not have genetic markers within a 40kb window. Confidence levels (high or medium) were assigned based on the type of evidence available for the benchmark gene (see Methods). Algorithms were abbreviated as in Figure 5.2. 165 based models (Figure 5.4A). For some of these benchmark genes, the region most closely linked to trait variation could be outside the +/- 20kb window. For example, as described above, the important genetic marker for MADS69 (Chr3_160559109) is ~32 kb upstream (see a arrows; Figure 5.3B). However, when we plotted the correlation between importance scores between T:G pairs using the largest window size (80 kb), we found that MAD69 was the only gene for which this was the case (Supplemental Figure 5.3B). Taken together, these finding further highlight the usefulness of transcript data for identifying the genetic basis for variation in a trait. 5.3.6 Improving our understanding of the genetic basis of flowering time using transcriptome data An open question was why transcript-based models were able to identify five benchmark flowering time genes as important that were not identified by genetic marker-based models and if transcriptome data could be used to better understand the genetic basis of flowering time. To understand why benchmark genes were not uniformly identified as important for flowering time when using both genetic marker and transcript data, we determined the extent to which transcript levels and the genetic marker allele (i.e. major or minor) were related to flowering time. As expected, we observed the most significant differences in flowering time for the transcripts (Figure 5.5A, Supplemental Figure 5.8A) and genetic markers (Figure 5.5B, Supplemental Figure 5.8B) that were identified as important by our models. For example, MADS1 was important only in the transcript-based models and transcript level was significantly correlated with flowering time (p = 0.0001; Figure 5.5A). In contrast, lines with the major allele for the genetic marker that paired with the MADS1 transcript (Chr9: 156980141) did not flower at a significantly different time than lines with the minor allele (p = 0.062; Figure 5.5B). Another 166 Figure 5.5. Relationship between transcript level/allele type and flowering time for benchmark genes. (A) Boxplots show the transcript levels (loge(Fold-Change)) over flowering time bin with the 5th (blue) and 95th (red) percentile range shown. Flowering time was defined as the growing degree days/100. Linear models were fit and adjusted r2 and p-values are shown. Confidence levels of benchmark genes were designated as in Figure 5.4. (B) Distributions of flowering time for lines with the major (red) or minor (gray) alleles for the genetic marker paired with each benchmark gene as indicated in (A). Differences in flowering time by allele were tested using t-tests. (C) Number of transcripts (Y-axis) for which transcript levels were associated with flowering time in linear models within p-value bins (-log10(p-value); X-axis). Benchmark genes are labeled as in (A). (D) Number of genetic markers (Y-axis) for which differences in flowering time by allele from t-tests were within p-value bins (-log10(p-value); X-axis). Benchmark genes are labeled as in (A). example was RAP2, which was important only in the genetic marker-based models. Lines with the major allele in RAP2 were more likely to flower late (p < 1x10-4), but RAP2 transcript levels 167 did not significantly correlate with changes in flowering time (p = 0.33). Overall, benchmark genes were more likely to have transcript levels associated with flowering time (Figure 5.5C) than genetic marker alleles associated with flowering time (Figure 5.5D). Importantly, using the transcriptome data we were also able to understand in more detail the impact of the benchmark genes on flowering time. For example, variation in transcript levels of MADS69 accounted for 16.7% of the variation in flowering time, more than any other transcript, where lines with lower levels of transcription flowered later. Modulation of MADS69 expression levels has recently been patented as an approach to controlling flowering time (“US Patent Application for MODIFYING FLOWERING TIME IN MAIZE Patent Application (Application #20140366213 issued December 11, 2014) - Justia Patents Search”). Similarly, MADS1 transcript levels explained 3.7% of the variation in flowering time, with lines with lower levels of transcription flowering later. This is consistent with what has been observed experimentally, where down-regulation of MADS1 results in delayed flowering time (Alter et al. 2016). For medium confidence benchmark genes (i.e. identified through association studies), the specific roles of the genes on flowering time are not well understood, but by finding positive or negative correlations between transcript levels and the underlying phenotypes, more mechanistic details can be interred. For example, transcript levels of ZAG6 had the second largest impact on flowering time, accounting for 6% of variation, with increased transcript levels associated with earlier flowering. Another example is PEBP24, with transcript levels of PEBP24 accounting for 2.7% of the variation in flowering time. Unlike many of the other benchmark genes, increased PEBP24 transcript levels were associated with later flowering time. Overall, the identification of these medium confidence benchmark genes as important transcript indicates the relevance of transcriptional regulation in their flowering time functions. 168 While using the benchmark genes allowed us to assess the usefulness of transcript levels compared to genetic marker information for identifying genes involved in flowering time, we should note that many non-benchmark genes were also identified by our models as important. For example, from the Ensemble model, there were 154 important, non-benchmark transcripts with importance scores falling between the two most important benchmark genes (MADS69, 100th percentile; ZAG6, 99.5th percentile; yellow, Dataset S1). While seven of those in between transcripts were annotated with the Gene Ontology (GO) term “flower development” (GO:0009908, green, Dataset S1), these 154 non-benchmark transcripts were not enriched for this GO term (q = 1.0). In fact, neither these transcripts nor any other set of important transcripts from models based on other algorithms (see Methods) were enriched for any GO terms. Therefore, from our transcript-based genomic prediction models we have identified 147 high ranking transcripts, many of which have unknown functions, that are among the most important in predicting flowering time in maize but do not play known roles in this process. For example, GRMZM5G865543 and GRMZM2G023520 (the second and third most important transcripts respectively from the Ensemble model) do not have annotated function in maize. And while they do have homologs in Arabidopsis thaliana and Oryza sativa, those homologs do not have known function in flowering time (see Supplemental Table 5.6 for similar information about the top 10 transcripts). Note that the transcriptome data is from the seedling stage. It is possible that genes of these important transcripts influence biological processes in earlier stage of development that influence flowering time later. To further our understanding of the genetic basis of flowering time control and the connections between juvenile and adult phenotypes, these important transcripts are prime candidates for future genetic studies. 169 5.4 Conclusions We have generated predictive models that use genetic markers, transcripts, and their combination to predict flowering time, height, and yield in a diverse maize population. While models built using transcriptome data did not outperform models that used genotype data, transcript-based models performed well above random expectation, and in many cases, performance was similar to that of genotype-based models. We found that transcripts and genetic markers from different genomic regions were identified as important for model predictions. Furthermore, by assessing the relative importance of the features used to build the models, we found that transcript-based models identified more known flowering time associated genes than genetic marker-based models. These findings underscore the usefulness of transcript data for improving our understanding of the genetic mechanisms responsible for complex traits. There are four possible mechanistic explanations of why transcript levels could have a similar predictive power as genetic markers. First, cis-regulatory variants that impact transcript levels, are all more likely to be similar between closely related individuals. Therefore, the ability of transcript data to predict phenotypes is simply a reflection of that dependency. However, we demonstrated that the most informative transcript features for predicting maize phenotypes are distinct from the most informative genetic marker features found in the transcript regions. While for some important transcripts, the associated important genetic marker could be in linkage disequilibrium but outside of the 2kb window used in our study (e.g. ~32 kb away in the case of MADS69), overall as we increased the transcript region window size, the correlation between the importance scores assigned to T:G pairs decreased, suggesting this is not generally the case. Thus, the second explanation is that there are trans-regulatory variants, e.g. due to transposon polymorphisms or transcriptional regulators, that play a major role. However, we found that the 170 importance of eQTLs (99.9% trans) and their associated transcripts were not positively correlated, suggesting that the trans-regulatory variation we identified cannot explain why transcript variation is predictive of phenotypic variation either. However, considering the challenges in identifying eQTLs due to mixed tissues used (Wills et al. 2013), in modeling epistatic interactions (Becker et al. 2012), and in our limited ability to find cis-eQTL, we cannot conclusively rule of this possibility. The third explanation is that transcription is a molecular phenotype caused by the integration of multiple genetic marker signals, both cis and trans, that may not have had strong signals individually. The fourth explanation is that there are epigenetic variants contributing to expression variation. It remains to be determined what the contribution of epigenetic variation is on our ability to use transcript data to predict phenotypes. One surprise is that the transcript data generated using V1 seedling tissues can predict adult plant phenotypes. We reason that complex traits, such as flowering time, are influenced by more than just canonical genes that act immediately prior to the growth and developmental sequences leading to flowering. For example, early developmental events such as cotyledon damage (Hanley and May 2006), root restriction (Keever et al. 2015), and photoperiod and temperature changes (Song et al. 2013) can impact flowering time in mature plants. Therefore, early development transcript differences could eventually result in different flowering time. There were three limitations of this study that made our ability to predict adult plant phenotypes and identify known important transcripts even more surprising. First, transcript level data was derived from whole V1 seedling tissue, which should limit the predictive power of our genomic prediction models for mature plant traits. We expect that transcript information taken from tissues and timepoints more relevant to the phenotype of interest are more likely to be predictive. For example, co-expression networks derived from maize root tissues are more predictive of 171 accumulation of 17 different elements (e.g. Al, Fe, K, Zn) in maize seeds than co-expression networks derived from tissues not involved in element uptake and transport (Schaefer et al. 2018). Second, transcript levels were calculated by mapping reads to the B73 reference genome without considering structural and fragmental variations exist between diverse maize lines. Having only a B73 reference genome to map to likely results in bias or noise in our transcriptome dataset. In future studies, it will be informative to determine if correcting for such structural and fragmental variation would improve genomic prediction. Finally, a third limitation of our study is that no environmental component is considered. An area of active research in genomic prediction is the incorporation of Genotype by Environment (GxE) interactions into predictive models (Burgueño et al. 2012; Cuevas et al. 2017; Granato et al. 2018). Thus, a potential benefit of using transcript information for genomic prediction could be that GxE interactions would be picked up by transcript level signals. Because transcriptome data used in our study was from whole seedlings (i.e. not the same individuals that were phenotype), this could not be tested. Our findings highlight an important benefit of using transcript data to better understand the genetic basis of a trait. While it can be difficult to associate signals from a number of small effect genetic markers or even a single large effect genetic marker back to a specific gene, transcript level information is inherently associated with genes. Because of the importance of regulatory variation on complex traits (Albert and Kruglyak 2015), the use of transcript information in genomic prediction could be crucial for deciphering the contribution of regulatory variation to the genetic basis of traits. Therefore, while we observed that in terms of predictive ability, genetic marker data outperformed transcript data, expression differences are more straightforward to interpret than sequence polymorphisms. In practice, this meant that transcript- 172 based models identified five benchmark flowering time genes, while genetic marker-based models only identified one and it highlighted our finding that more insight into the genetic basis of complex traits can be gained when transcriptome data are considered. 5.5 Methods 5.5.1 Genotypic, transcriptomic, and phenotypic data processing The phenotypic (Hansey et al. 2011), and genotypic and transcriptomic (Hirsch et al. 2014) data used in this study were generated from the pan-genome population consisting of diverse inbred maize lines. Genotype, transcriptome, flowering time, height, and yield data was all available for 388 lines out of the 503 maize pan-genome panel and were used for the study (Dataset S2). Genetic marker scores derived from RNA-seq reads were converted to a [-1,0,1] format corresponding to [aa, Aa, AA] with the more common allele (AA) designated as 1. The genetic marker positions were converted from maize B73 reference genome A Golden Path v2 (AGPv2) to AGPv4.37. The AGPv2 genetic markers that did not map to AGPv4.37 and genetic markers with a minor allele frequency less than 5% were removed, resulting in 332,178 genetic markers. To determine if the use of RNA-Seq derived genetic markers biased our results, we also tested a set of genome-wide markers (GGW). These markers were downloaded already processed and uplifted to AGPv4 from (Bukowski et al. 2018). Data was available for 149 maize lines included in the study. After removing GGW with minor allele frequency less than 5% and duplicate patterns of allele calls across the 149 lines (i.e. the same criteria used for the G dataset), ~ 1.08 million markers were available for this analysis. RNA-Seq derived transcriptomic data from whole-seedling tissue (i.e. root and shoot) at the V1 stage from (Hirsch et al. 2014) was processed to remove loci that did not map to AGPv4.37. The remaining maize B73 genes were filtered with default settings of the 173 nearZeroVar function from the R caret package to remove genes with zero or near zero variance (> 95% of the lines sharing the same transcript level) across lines. After the filtering steps, transcript counts for 31,238 genes were retained in the final dataset. The raw transcripts per million count data were transformed with a loge + 1 transformation before the data were used in subsequent analyses. Mapping rates to the B73 genome assembly were also downloaded from (Hirsch et al. 2014). To assess if transcriptome data had predictive power beyond random expectation, transcriptome data were permuted by gene, so each gene had the same distribution of transcript values, but the values were randomly assigned to different maize lines for building the transcriptome shuffled models. To compared important transcripts and genetic markers from genomic prediction models, transcripts were converted from AGPv2 to v4, only genes with one to one correspondence between AGPv3 and v4 were included in this analysis. To assess the impact uplifting had on expression levels we re-mapped transcript data from B73 to AGPv4 using Bowtie2 (version 2.3.2) and performed read counting using Cufflinks (version 2.2.1). The correlation between uplifted and re-mapped gene expression levels for B73 was 0.94 (PCC, p.value < 2x10-16). 5.5.2 Comparison of transcript and genetic marker data Three different approaches were used to determine the similarity between lines based on the three different data types. For the genotype data, a kinship matrix was generated using the centered Identity By State method (Endelman and Jannink 2012) implemented in TASSEL v5.20180517 (Bradbury et al. 2007). The Pearson Correlation Coefficient (PCC) between RNA- Seq mapping rates and kinship with B73 was calculated using the cor.test function in R. For the transcript data, we generated an expression Correlation (eCor) matrix by calculating the PCCs of transcript values between lines using the cor.test function in R. The eCor matrix was normalized 174 between 0 and 1 and the diagonal was set as 1. Finally, for phenotype data, we calculated the Euclidean distance between lines using the distances package in the R environment. The correlation between kinship, eCor, and Phenotype Distance between pairs of lines was calculated using PCC. 5.5.3 Genomic prediction models and model performance Because part of the phenotypic signal observed in genomic prediction models may be due to population structure/family relatedness within the breeding population, we established a baseline for our genomic prediction models by using the principal components (PCs) generated using the marker data alone, to predict phenotypic values for each trait. Because the relationship between the population structure and traits can vary by trait and by population, we tested the top 5, 10, 15, 20, 50, 75, and 100 PCs and selected the top 75 PCs to use as our baseline because accuracy plateaued after this point. Four methods were used for each trait, two linear-parametric methods: ridge regression-Best Linear Unbiased Predictor (rrBLUP) (Endelman 2011) and Bayesian Least absolute shrinkage and selection operator (BL) (Pérez and de los Campos 2014), and one non-linear and non-parametric method: Random Forest (RF) (Leo Breiman Statistics 2001), and one ensemble based approach (En) (Dietterich 2000). The rrBLUP models used the mixed.solve function in the “rrBLUP” package implemented in R. The BL models were also implemented in R using the “BGLR” package. RF was implemented in python using Scikit- Learn (Pedregosa et al. 2011). Ensemble predictions were generated by taking the mean of the predicted trait values from rrBLUP, BL, and RF. A grid-search was performed on the first 10 of the 100 cross-validation replicates to find the best combination of parameters for the RF model. Parameters tested included max tree depth (3, 5, 10, and 50) and the max number of features included in each tree (10%, 50%, 100%, square root, and log2). 175 The predictive performance of the models was compared using the PCC. The PCC between the predicted (Ŷ) and the true trait value (Y) and was computed using the cor() function in R for rrBLUP and BL or the NumPy corrcoef function in Python for RF. One hundred replicates of a five-fold cross validation approach were applied to maximize the data available for model training without resulting in overfitting. For each replicate, the lines were randomly divided into 5 subsets, where each subset is used as the testing set once and the rest 4 subsets combined to train the model, resulting in a total of 500 cross-validated runs. PCC was calculated using only the predicted values from the testing set for each run. For the top 10 most important transcripts from the ensemble model, leave-one-feature-out analysis was performed using rrBLUP with 100 replicates to get a score for how much the model performance (PCC) changes when that one transcript is removed from training (Supplemental Table 5.6). Information about top BLAST matches was collected from maizeGDB (https://www.maizegdb.org/). 5.5.4 Selecting subsets of T or G for input to genomic prediction models To determine if using smaller subsets of T or G as input to the genomic prediction models would improve our ability to predict traits, we used rrBLUP and flowering time as an example to select features. For transcript data, features were selected in three ways. First, 10, 20, 100, and 1000 transcripts with the greatest variance across the maize lines were selected and used as input to the rrBLUP models. Second, the 14 benchmark flowering time genes (see Methods: Benchmark flowering time genes) were used. Finally, 14 and 200 transcripts with the greatest absolute coefficient (i.e. weight) assigned by rrBLUP during training were selected. For this analysis, the models were re-run without cross-validation so that feature selection and model 176 training were performed on the training data and the testing data was only used to measure model performance, thus ensuring against overfitting. This was done for each of the 100 replicates. 5.5.5 Genetic marker/transcript importance analysis In order to identify features important for building the genomic prediction models, feature importance information was extracted from each model established with one of four methods: rrBLUP, BL, RF, and Ensemble. For rrBLUP, the importance metric was the marker effect ($u) calculated by mixed.solve in the R rrBLUP package. For BL, the importance metric was the estimated posterior mean ($ETA) calculated using the R BGLR package. The absolute value of marker effect and estimated posterior mean were used since the features are categorical with no particular meaning for the sign of importance metrics. For RF, the importance metric was the Gini importance, collected using the _importance_score function built into the Scikit- Learn implementation of RF. The Gini importance is the total decrease in node impurity (i.e. the homogeneity of classes in a node) after a particular feature is used to split a node. Node impurity decreases as instances from one of the classes are removed from the node, leaving a greater proportion of instances from the other class. Importance metrics from rrBLUP, BL, and RF were averaged over the 100 cross-validation replicates. Ensemble importance scores were calculated by normalizing the average importance scores from each model and each method between 0 and 1, then taking the mean of normalized importance scores across the three algorithms. Enrichment for transcript compared to genetic marker features within the top 1000 or top 20 features was done using Fisher’s Exact Test, where the number of transcript features in and not in the top X features was compared to the number of genetic marker features in and not in the top X features. To determine the degree to which the importance of a transcript correlates with the importance of nearby genetic markers, the genetic marker G with the greatest mean importance 177 score within a fixed window from the center of a genomic region R where a transcript T mapped to was selected among genetic markers in region R, referred to as a T:G pair (Figure 5.3A). To identify the effect of window size, a series of window sizes ranging from 1-80kb were tested. For each window size, the Spearman’s Correlation (ρ) was calculated between the importance scores of T:G pairs. The window size with the highest correlation (2kb) was chosen (Supplemental Figure 5.3A). For this analysis, transcripts without location information or without one-to-one mapping between AGP V3 to V4 were removed, leaving 24,412 transcripts. With a window size of 2kb, additional transcripts were dropped because there was not a genetic marker within that window, resulting in 15,049 transcripts to be included in the downstream analysis. This analysis was repeated for the genome-wide genetic markers (GGW) from (Bukowski et al. 2018). To determine the degree to which the importance of a transcript correlated with the importance of trans-regulatory variants, significant eQTLs (multiple testing corrected p<0.05) were identified for each transcript using the linear regression (modelLINEAR) approach from MatrixeQTL implemented in R. Benjamini-Hochberg false discovery rate correction was used to adjust p for multiple testing and eQTLs were considered significant if adjusted p < 0.05 (Benjamini and Hochberg 1995). The distance for considering eQTL as cis was 1 mega base (Zan et al. 2016), however, because <0.1% of eQTL identified were cis, all eQTL were analyzed together. The importance of an eQTL or the neighboring genetic marker located within a 2kb window of the eQTL with the greatest average importance score was compared to the importance of the transcript with the eQTL in question (T:eQTL pair). Enrichment of Gene Ontology (GO) terms associated with important transcripts compared to the reference genome was tested using agriGO v2 (Tian et al. 2017). The enrichment p-values are corrected for multiple testing by agriGOv2 using FDR. The top 10, 25, 178 and 100 transcripts from each algorithm, excluding the benchmark flowering time genes, were tested against the reference genome. The top 153 transcripts excluding benchmark genes (i.e. the top transcripts between the best two benchmark genes), from the ensemble algorithm and the union of the top 10, 25, and 100 transcripts from all four algorithms were tested. 5.5.6 Benchmark flowering time genes We compiled a list of genes known to be involved in flowering time based on evidence from knockdown experiments (Muszynski et al. 2006; Danilevskaya et al. 2010; Meng et al. 2011; Lazakis et al. 2011; Alter et al. 2016) and/or association studies (Salvi et al. 2007; Hirsch et al. 2014). Genes were assigned confidence levels based on the type of evidence available, with experimental evidence considered high confidence, association study evidence and significant similarity with known flowering time genes from other species considered medium confidence (Supplemental Table 5.4). Because some of these genes did not have genetic markers located within the 2kb window of the center of the transcript, progressively larger windows were used to identify the most important nearby genetic marker up to 40kb. To compared importance scores across algorithms and between models using G or T data as input, percentiles were used. To determine if transcripts or genetic markers assigned to flowering time benchmark genes were associated with flowering time in this study, linear models and t-tests, respectively, implemented in R were used. 5.5.7 Data Availability All data and code needed to reproduce the results from this study is available on GitHub (https://github.com/ShiuLab/Manuscript_Code/tree/master/2019_expression_GP/data). 179 5.6 Acknowledgements We thank Richard Amasino, Wolfgang Busch, and David Lowry for their help in interpreting our findings. This work was partly supported by NSF Graduate Research Fellowship (Fellow ID: 2015196719), Graduate Research Opportunities Abroad (GROW) Fellowship to C.B.A.; the U.S. Department of Energy Great Lakes Bioenergy Research Center (BER DE- SC0018409) and the National Science Foundation (IOS-1546617, DEB-1655386) to S.-H.S. 180 APPENDIX 181 Supplemental Figure 5.1. Distribution of genetic marker and transcript data across maize chromosomes. Number of genetic markers (top) and transcripts (bottom) included in this study in 1 Mb bins across the maize chromosomes. Supports Figure 5.1. 182 Supplemental Figure 5.2. Feature importance analysis for G+T models. (A) Relationships between importance scores for transcripts from the T (X-axis) and G+T (Y- axis) flowering time prediction models established with rrBLUP (left column), BL (middle column), and RF (right column). The Pearson’s Correlation Coefficient (r) is shown in the top 183 Supplemental Figure 5.2 (cont’d) left corner. (B) Distribution of importance scores for the top 1,000 (inset = top 20) features from the G+T models for three traits using rrBLUP (top row) and BL (bottom row). Transcripts are in purple and genetic markers are in yellow. Supports Figure 5.2. 184 Supplemental Figure 5.3. Impact of transcript region sizes on importance correlation between transcript:genetic marker pairs. (A) The correlation (green) between importance scores for transcript:genetic marker pairs and the number of pairs found (blue) as the transcript region size increases from 1-80 kb. (B) Density plot of the importance scores of genetic markers (Y-axis) and transcripts (X-axis) from T:G pairs using an 80 kb window size. The threshold was set (red dotted line) as the 99th percentile of the normalized importance score for each trait, algorithm, and input feature type. The correlation between transcript and genetic marker importance was calculated using Spearman’s Rank (ρ). Supports Figure 5.3. 185 Supplemental Figure 5.4. Manhattan plot of importance scores from Genomic Prediction models. Manhattan plots of genetic marker (top) and transcript (bottom) importance scores for predicting (A) flowering time, (B) height, and (C) yield. Threshold importance scores (dotted blue) were set at the 99th percentile importance score for each trait, algorithm, and 186 Supplemental Figure 5.4 (cont’d) input feature type (i.e. genetic markers or transcripts). Genetic markers and transcripts falling above that threshold colored in blue. Supports Figure 5.3. 187 Supplemental Figure 5.5. Correlation between genetic marker/eQTL and transcript importance. Density plot of the importance scores of (A) genetic markers (G, Y-axis) and transcripts (T, X- axis) from T:G pairs and (B) eQTL (eQTL, Y-axis) and transcripts (T, X-axis) from T:eQTL 188 Supplemental Figure 5.5 (cont’d) pairs. The threshold was set (red dotted line) as the 99th percentile of the normalized importance score for each trait, algorithm, and input feature type. The correlation between transcript and genetic marker importance was calculated using Spearman’s Rank (ρ). Supports Figure 5.3. 189 Supplemental Figure 5.6. Genomic prediction and genetic marker:transcript pairs using genome-wide genetic markers (GGW). (A) Summary of performance of rrBLUP models using the genome-wide genetic markers (GGW) from Bukowski et al. compared to those using the RNA-Seq derived genetic markers (G) from Hirsch et al. (B) The correlation (green) between importance scores for Transcript (T)/GGW pairs and the number of pairs found (blue) as the transcript region size increases from 1-80 kb (compare to analysis using Hirsch et al. genetic markers (G) in Supplemental Figure 5.3). (C) Density plot of the importance scores of GGW (Y-axis) and T (X-axis) using a 2-kb (left) or a 50- kb (right) window. The threshold was set (red dotted line) as the 99th percentile of the normalized importance score for each trait, algorithm, and input feature type. The correlation between transcript and genetic marker importance was calculated using Spearman’s Rank (ρ). Supports Figure 5.3. 190 Supplemental Figure 5.7. Correlation between feature importance between algorithms. Density scatter plot of the importance scores of genetic markers (top) and transcripts (bottom) generated with rrBLUP and BL (left), rrBLUP and RF (middle), as well as BL and RF (right). The correlation between importance scores between algorithms was calculated using Spearman’s Rank (ρ). Supports Figure 5.3. 191 Supplemental Figure 5.8. Relationship between transcript levels and alleles and flowering time for benchmark genes. (A) Boxplots show the median transcript level (log(Fold-Change)) for each flowering time (Growing Degree Days (GDD)/100) bin with the 95th (red) and 5th (blue) percentiles shown. Linear models were fit and adjusted r2 and p-values are shown. (B) Violin-plots of the distribution of flowering time (GDD/100) for lines with the major (blue) or minor (gray) allele for the genetic marker paired with each benchmark gene. Significant differences in the GDD by allele were tested for using t-tests. Supports Figure 5.5. 192 Supplemental File 5.1. Top 1000 most important transcripts for each trait from the transcript-based Ensemble models. Supplemental File 5.1 can be found at the following link: http://www.plantcell.org/content/early/2019/10/22/tpc.19.00332/tab-figures-data 193 Supplemental File 5.2. Account of data (Genetic Marker, Transcript, Phenotype) availability for maize lines and decision to include line in the study Supplemental File 5.2 can be found at the following link: http://www.plantcell.org/content/early/2019/10/22/tpc.19.00332/tab-figures-data 194 REFERENCES 195 REFERENCES Albert, F. W., and L. Kruglyak, 2015 The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 16: 197–212. Alter, P., S. Bircheneder, L.-Z. Zhou, U. Schlüter, M. Gahrtz et al., 2016 Flowering Time- Regulated Genes in Maize Include the Transcription Factor ZmMADS1. Plant Physiol. 172: 389–404. Becker, J., J. R. Wendland, B. Haenisch, M. M. Nöthen, and J. Schumacher, 2012 A systematic eQTL study of cis-trans epistasis in 210 HapMap individuals. Eur. J. Hum. Genet. 20: 97–101. Benjamini, Y., and Y. Hochberg, 1995 Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57: 289–300. Bermingham, M. L., R. Pong-Wong, A. Spiliopoulou, C. Hayward, I. Rudan et al., 2015 Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports 1–12. Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss et al., 2007 TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23: 2633–2635. Buckler, E. S., J. B. Holland, P. J. Bradbury, C. B. Acharya, P. J. Brown et al., 2009 The genetic architecture of maize flowering time. Science 325: 714–718. Bukowski, R., X. Guo, Y. Lu, C. Zou, B. He et al., 2018 Construction of the third-generation Zea mays haplotype map. Gigascience 7: 1–12. Burgueño, J., G. de los Campos, K. Weigel, and J. Crossa, 2012 Genomic prediction of breeding values when modeling genotype$\times$ environment interaction using pedigree and dense molecular markers. Crop Sci. 52: 707–719. Cuevas, J., J. Crossa, O. A. Montesinos-López, J. Burgueño, P. Pérez-Rodríguez et al., 2017 Bayesian Genomic Prediction with Genotype × Environment Interaction Kernel Models. G3 7: 41–53. Danilevskaya, O. N., X. Meng, and E. V. Ananiev, 2010 Concerted modification of flowering time and inflorescence architecture by ectopic expression of TFL1-like genes in maize. Plant Physiol. 153: 238–251. Dietterich, T. G., 2000 Ensemble methods in machine learning. International workshop on multiple classifier systems. 1 196 Drinkwater, N. R., and M. N. Gould, 2012 The long path from QTL to gene. PLoS Genet. 8: e1002975. Endelman, J. B., 2011 Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. Endelman, J. B., G. N. Atlin, Y. Beyene, K. Semagn, X. Zhang et al., 2014 Optimal Design of Preliminary Yield Trials with Genome-Wide Markers. Crop Sci. 54: 48–59. Endelman, J. B., and J.-L. Jannink, 2012 Shrinkage estimation of the realized relationship matrix. G3 2: 1405–1413. Frisch, M., A. Thiemann, J. Fu, T. A. Schrag, S. Scholten et al., 2010 Transcriptome-based distance measures for grouping of germplasm and prediction of hybrid performance in maize. Theor. Appl. Genet. 120: 441–450. Fu, J., K. C. Falke, A. Thiemann, T. A. Schrag, A. E. Melchinger et al., 2012 Partial least squares regression, support vector machine regression, and transcriptome-based distances for prediction of maize hybrid performance with gene expression data. Theor. Appl. Genet. 124: 825–833. González-Reymúndez, A., G. de los Campos, L. Gutiérrez, S. Y. Lunt, and A. I. Vazquez, 2017 Prediction of years of life after diagnosis of breast cancer using omics and omic-by- treatment interactions. Eur. J. Hum. Genet. 25: 538–544. Granato, I., J. Cuevas, F. Luna-Vázquez, J. Crossa, O. Montesinos-López et al., 2018 BGGE: A New Package for Genomic-Enabled Prediction Incorporating Genotype × Environment Interaction Models. G3 8: 3039–3047. Guo, Z., M. M. Magwire, C. J. Basten, Z. Xu, and D. Wang, 2016 Evaluation of the utility of gene expression and metabolic information for genomic prediction in maize. Theoretical and Applied Genetics 129: 2413–2427. Hanley, M. E., and O. C. May, 2006 Cotyledon damage at the seedling stage affects growth and flowering potential in mature plants. New Phytol. 169: 243–250. Hansey, C. N., J. M. Johnson, R. S. Sekhon, S. M. Kaeppler, and N. de Leon, 2011 Genetic diversity of a maize association population with restricted phenology. Crop Sci. 51: 704– 715. Heffner, E. L., M. E. Sorrells, and J.-L. Jannink, 2009 Genomic Selection for Crop Improvement. Crop Science 49: 1–12. Heslot, N., H.-P. Yang, M. E. Sorrells, and J.-L. Jannink, 2012 Genomic Selection in Plant Breeding: A Comparison of Models. Crop Science 52: 146–15. 197 Hirsch, C. N., J. M. Foerster, J. M. Johnson, R. S. Sekhon, G. Muttoni et al., 2014 Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26: 121–135. Jia, J., Z. Liu, X. Xiao, B. Liu, and K.-C. Chou, 2015 iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J. Theor. Biol. 377: 47–56. Jonas, E., and D.-J. de Koning, 2013 Does genomic selection have a future in plant breeding? Trends Biotechnol. 31: 497–504. Kang, T., W. Ding, L. Zhang, D. Ziemek, and K. Zarringhalam, 2017 A biological network- based regularized artificial neural network model for robust phenotype prediction from gene expression data. BMC Bioinformatics 18: 565. Keever, G. J., J. R. Kessler Jr, G. B. Fain, and D. C. Mitchell, 2015 Seedling Developmental Stage at Transplanting Affects Growth and Flowering of Medallion Flower and Globe Amaranth. J. Environ. Hortic. 33: 53–57. Lazakis, C. M., V. Coneva, and J. Colasanti, 2011 ZCN8 encodes a potential orthologue of Arabidopsis FT florigen that integrates both endogenous and photoperiod flowering signals in maize. J. Exp. Bot. 62: 4833–4842. Leo Breiman Statistics, L. B., 2001 Random Forests. Li, Z., N. Gao, J. W. R. Martini, and H. Simianer, 2019 Integrating Gene Expression Data Into Genomic Prediction. Front. Genet. 10: 126. Meng, X., M. G. Muszynski, and O. N. Danilevskaya, 2011 The FT-like ZCN8 Gene Functions as a Floral Activator and Is Involved in Photoperiod Sensitivity in Maize. Plant Cell 23: 942–960. Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics 1–11. Muszynski, M. G., T. Dam, B. Li, D. M. Shirbroun, Z. Hou et al., 2006 delayed flowering1 Encodes a basic leucine zipper protein that mediates floral inductive signals at the shoot apex in maize. Plant Physiol. 142: 1523–1536. Nordborg, M., J. O. Borevitz, J. Bergelson, C. C. Berry, J. Chory et al., 2002 The extent of linkage disequilibrium in Arabidopsis thaliana. Nature Genetics 30: 190–193. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion et al., 2011 Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12: 2825–2830. Pérez, P., and G. de los Campos, 2014 Genome-wide regression and prediction with the BGLR statistical package. Genetics 198: 483–495. 198 Ribaut, J.-M., and M. Ragot, 2007 Marker-assisted selection to improve drought adaptation in maize: the backcross approach, perspectives, limitations, and alternatives. Journal of Experimental Botany 58: 351–360. Salvi, S., G. Sponza, M. Morgante, D. Tomes, X. Niu et al., 2007 Conserved noncoding genomic sequences associated with a flowering-time quantitative trait locus in maize. Proc. Natl. Acad. Sci. U. S. A. 104: 11376–11381. Schaefer, R. J., J.-M. Michno, J. Jeffers, O. Hoekenga, B. Dilkes et al., 2018 Integrating Coexpression Networks with GWAS to Prioritize Causal Genes in Maize. Plant Cell 30: 2922–2942. Schrag, T. A., M. Westhues, W. Schipprack, F. Seifert, A. Thiemann et al., 2018 Beyond Genomic Prediction: Combining Different Types of omics Data Can Improve Prediction of Hybrid Performance in Maize. Genetics 208: 1373–1385. Shen, H.-B., and K.-C. Chou, 2006 Ensemble classifier for protein fold pattern recognition. Bioinformatics 22: 1717–1722. Solberg Woods, L. C., 2014 QTL mapping in outbred populations: successes and challenges. Physiol. Genomics 46: 81–90. Song, Y. H., S. Ito, and T. Imaizumi, 2013 Flowering time regulation: photoperiod- and temperature-sensing in leaves. Trends in Plant Science 18: 575–583. Spindel, J., H. Begum, D. Akdemir, P. Virk, B. Collard et al., 2015 Genomic Selection and Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic Architecture, Training Population Composition, Marker Number and Statistical Model on Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines. PLoS Genetics 11: e1004982–25. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn, 2007 Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:. Tenaillon, M. I., M. C. Sawkins, A. D. Long, R. L. Gaut, J. F. Doebley et al., 2001 Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proc. Natl. Acad. Sci. U. S. A. 98: 9161–9166. Tian, T., Y. Liu, H. Yan, Q. You, X. Yi et al., 2017 agriGO v2.0: a GO analysis toolkit for the agricultural community, 2017 update. Nucleic Acids Res. 45: W122–W129. US Patent Application for MODIFYING FLOWERING TIME IN MAIZE Patent Application (Application #20140366213 issued December 11, 2014) - Justia Patents Search. Wills, Q. F., K. J. Livak, A. J. Tipping, T. Enver, A. J. Goldson et al., 2013 Single-cell gene expression analysis reveals genetic associations masked in whole-tissue experiments. Nat. Biotechnol. 31: 748–752. 199 Zan, Y., X. Shen, S. K. G. Forsberg, and Ö. Carlborg, 2016 Genetic Regulation of Transcriptional Variation in Natural Arabidopsis thaliana Accessions. G3 6: 2319–2328. Zarringhalam, K., D. Degras, C. Brockel, and D. Ziemek, 2018 Robust phenotype prediction from gene expression data using differential shrinkage of co-regulated genes. Sci. Rep. 8: 1237. Zenke-Philippi, C., A. Thiemann, F. Seifert, T. Schrag, A. E. Melchinger et al., 2016 Prediction of hybrid performance in maize with a ridge regression model employed to DNA markers and mRNA transcription profiles. BMC Genomics 17: 262. 200 CHAPTER SIX: PERCEPTIONS OF EMERGING BIOTECHNOLOGIES 1 1 This chapter has been published in the following manuscript Christina B. Azodi and Thomas Dietz (2019) Perceptions of Emerging Biotechnologies. Environmental Research Letters. DOI: 10.1088/1748-9326/ab44332 © IOP Publishing. Reproduced with permission. All rights reserved. 201 6.1 Abstract Research on public views of biotechnology has centered on genetically modified (GM) foods. However, as the breadth of biotechnology applications grows, a better understanding of public concerns about non-agricultural biotechnology products is needed in order to develop proactive strategies to address these concerns. Here, we explore the perceived benefits and risks associated with five biotechnology products and how those perceptions translate into public opinion about the use and regulation of biotechnology in the United States. While we found greater support for non-agricultural biotechnology product, 70% of individuals surveyed showed no or little variation in their support across the products, indicating opinions about early GM products may be influencing the acceptance of emerging biotechnologies. We identified five common patterns of opinions about biotechnology and used machine learning models to integrate a wide range of factors and predict a respondent’s opinion group. While the model was particularly good at identifying individuals supportive of biotechnology, differentiating between individuals from the non- and conditionally-supportive opinion groups was more challenging, emphasizing the complexity of public opinions of emerging biotechnology products. The full article is available at the following link: https://iopscience.iop.org/article/10.1088/1748-9326/ab4433 202