USING TRANSCRIPTOME AND DATA SCIENCE METHODS TO UNCOVER GENE REGULATORY AND FUNCTIONAL INFORMATION By Sahra Uygun A DISSERTATION Submitted to Michigan State University in partial fulfilment of the requirements for the degree of Genetics – Doctor of Philosophy 2017 ABSTRACT USING TRANSCRIPTOME AND DATA SCIENCE METHODS TO UNCOVER GENE REGULATORY AND FUNCTIONAL INFORMATION By Sahra Uygun Even in the well-studied model organisms, there are still genomic regions with unknown function. These genomic regions include protein-coding genes and regulatory elements that are key components of transcriptional regulation. With technological advances, more biological data are being generated including spatial, temporal, developmental, and conditional gene expression data. Gene expression data, and specifically co-expression analyses have been widely used to predict gene function through guilt by association. However, it remains to be seen to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the choice of gene expression dataset and clustering algorithms impact inferences about gene functions. To answer these questions, I used co-expression to identify novel genes that function in a biological process, and the impact of different clustering algorithms on the ability to identify genes that function in the same pathway. Apart from the functional associations, gene co-expression analyses can also be used to identify the putative cis-regulatory elements that are over-represented in co-expressed gene promoters. These elements can be used to build models of gene regulation under changing environments and genome-wide models of how different organ and cell type gene expression are regulated under changing environments have not yet been built in plants. I used Arabidopsis thaliana organ and cell type stress responsive gene expression data and co-expression clusters to identify putative cis-regulatory elements. Using these elements and machine learning models, I predicted high salinity responsive gene expression in shoots, roots and six root cell types. I found that plant organ and cell type transcriptional response to high salinity are likely regulated by a core set of elements that we identified and built predictive models of plant spatial transcriptional responses to environmental stress. Overall, this research contributes to understanding the role of “big data” in biology, provides guidelines for effectively using gene coexpression in functional associations and shows how computational approaches help in identifying gene regulatory information. ACKNOWLEDGEMENTS I would like to thank my advisor Shin-Han Shiu for his endless support and patience throughout my PhD. With his mentorship, I learned how to ask constructive scientific questions and approach them. He always created a fun and active environment in his laboratory. I would also like to thank my committee members, who all helped me in my PhD with their suggestions and feedback. I was fortunate to rotate in Robert L. Last’s laboratory during my first year of PhD. Later, he became one of my committee members and collaborator mentor in various projects and helped me with his scientific expertise and critical thinking. I was also so lucky to rotate in Jin Chen’s laboratory during the first year. With his computational biology expertise, I was inspired to use computational approaches to answer biological questions and he was always ready to help. In 2011, I took Christina Chan’s Systems Biology course, where I was introduced to various topics in functional genomics. I was so happy when she agreed to be on my committee. I am so grateful for her suggestions and support. Apart from my committee members, other professors also played a huge role in my PhD. I cannot end my acknowledgements without mentioning Cathy Ernst, Genetics Graduate Program director. She was always there for the Genetics students and was so helpful. It was so good to know that we had her support in all aspects of PhD. I would also like to thank the members of the Shiu laboratory over the years. The laboratory grew in numbers and members changed while I was working there and each member was valuable for both scientific discussions and friendship. I would like to specifically thank Melissa Lehti-Shiu for her endless support. Finally, without the continuous encouragement and love of my family, I would not be to complete my PhD. I would like to thank Ayşegül and Öner Uygun for always being there for me and I dedicate this thesis to them. iv TABLE OF CONTENTS LIST OF FIGURES .....................................................................................................................viii KEY TO ABBREVIATIONS .........................................................................................................x CHAPTER 1 ................................................................................................................................... 1 INTRODUCTION .......................................................................................................................... 1 1.1 Omics data and its use in functional genomics ......................................................................... 1 1.2 Abundance and utility of gene expression profiling data ......................................................... 2 1.3 Assigning functions to genes via computational approaches using gene expression data ....... 5 1.4 Deciphering key players in gene expression regulation ........................................................... 7 1.5 Thesis chapters .......................................................................................................................... 8 REFERENCES ............................................................................................................................. 10 CHAPTER 2 ................................................................................................................................. 17 UTILITY AND LIMITATIONS OF USING GENE EXPRESSION DATA TO IDENTIFY FUNCTIONAL ASSOCIATIONS ............................................................................................... 17 2.1 Abstract ................................................................................................................................... 18 2.2 Introduction ............................................................................................................................. 19 2.3 Results and discussion ............................................................................................................ 22 2.3.1 The extent to which genes in pathways have correlated transcript profiles .................... 22 2.3.2 Influence of annotation on pathway ECs ......................................................................... 26 2.3.3 Influence of the similarity measure used to assess EC .................................................... 28 2.3.4 Influence of biological factors on pathway EC................................................................ 29 2.3.5 Impact of datasets used to evaluate pathway EC ............................................................. 33 2.3.6 Influence of individual vs. combined stress datasets on pathway EC ............................. 34 2.3.7 Robustness in recovering pathway genes when using different datasets ......................... 38 2.3.8 Clusters as predictive units of pathways .......................................................................... 41 2.3.9 Impact of clustering algorithms and parameters on the identification of pathway genes 44 2.3.10 Using leucine degradation gene phenomics data to validate co-expression associations ................................................................................................................................................... 46 2.4 Conclusion .............................................................................................................................. 50 2.5 Methods................................................................................................................................... 52 2.5.1 A. thaliana metabolic pathways and pathway features analyzed ..................................... 52 2.5.2 Expression dataset and its processing .............................................................................. 53 2.5.3 Calculation of expression correlation and expression coherence .................................... 53 2.5.4 Co-expression clustering .................................................................................................. 55 2.5.5 Assessing the overlap in pathway and cluster memberships ........................................... 56 2.5.6 Using phenomics data to evaluate co-expression associations ........................................ 57 2.6 Acknowledgements ................................................................................................................. 57 APPENDIX ................................................................................................................................... 58 REFERENCES ............................................................................................................................. 66 v CHAPTER 3 ................................................................................................................................. 74 PREDICTIVE MODELS OF SPATIAL TRANSCRIPTIONAL RESPONSE TO HIGH SALINITY .................................................................................................................................... 74 3.1 Abstract ................................................................................................................................... 75 3.2 Introduction ............................................................................................................................. 75 3.3 Results and Discussion ........................................................................................................... 78 3.3.1 Transcriptional responses to stress have a strong spatial component .............................. 78 3.3.2 Known TF binding motifs contribute to a better than random performing model for salt up-regulation prediction ............................................................................................................ 81 3.3.3 pCREs derived from co-expression clusters are similar, but not identical, to the known binding motifs of TFs................................................................................................................ 85 3.3.4 pCRE set further improves salt up-regulation prediction in a spatially specific manner 88 3.3.5 Filtering pCREs based on TF binding, DNase I hypersensitivity, and conservation ...... 91 3.3.6 pCREs work best in combinations ................................................................................... 93 3.4 Conclusions ............................................................................................................................. 99 3.5 Methods................................................................................................................................. 101 3.5.1 Expression data processing and expression data analysis.............................................. 101 3.5.2 In vitro TF binding, DNase I hypersensitivity, and conserved non-coding datasets ..... 102 3.5.3 Salt up-regulation pCRE identification .......................................................................... 103 3.5.4 Comparison of pCREs and TFBMs ............................................................................... 104 3.5.5 Prediction of salt up-regulation using machine learning and feature selection ............. 106 3.5.6 Binary prediction of root and shoot up-regulated genes ................................................ 107 3.5.7 Combinatorial motif rule discovery ............................................................................... 108 3.6 Acknowledgements ............................................................................................................... 109 APPENDIX ................................................................................................................................. 110 REFERENCES ........................................................................................................................... 115 CHAPTER 4 ............................................................................................................................... 122 PREDICTIVE MODELS OF CELL TYPE HIGH SALINITY RESPONSIVE GENE EXPRESSION ............................................................................................................................ 122 4.1 Abstract ................................................................................................................................. 122 4.2 Introduction ........................................................................................................................... 123 4.3 Results and discussion .......................................................................................................... 125 4.3.1 Known TF binding data in predicting cell type salt responsive gene expression .......... 125 4.3.2 Organ pCREs in predicting cell type salt responsive gene expression .......................... 128 4.3.3 Identifying root cell type pCREs associated with salt stress ......................................... 131 4.3.4 Cell type pCREs in predicting salt up-regulation .......................................................... 133 4.4 Conclusions ........................................................................................................................... 136 4.5 Methods................................................................................................................................. 138 4.5.1 Gene expression dataset ................................................................................................. 138 4.5.2 Co-expression analyses .................................................................................................. 138 4.5.3 Cell type pCRE identification ........................................................................................ 139 4.5.4 Predictive models ........................................................................................................... 139 APPENDIX ................................................................................................................................. 141 REFERENCES ........................................................................................................................... 145 vi CHAPTER 5 ............................................................................................................................... 149 CONCLUDING REMARKS ...................................................................................................... 149 REFERENCES ........................................................................................................................... 154 vii LIST OF FIGURES Figure 2.1 Co-expression of A. thaliana pathway genes under stress .......................................... 25 Figure 2.2 Relationship between pathway ECs, annotation quality and similarity measures ...... 27 Figure 2.3 Impact of pathway size and other factors on EC ......................................................... 32 Figure 2.4 Impact of datasets on pathway EC Percentile ............................................................. 37 Figure 2.5 Performance of clusters in predicting pathways .......................................................... 42 Figure 2.6 Assessing the validity of co-expression associations with leucine measurement data 49 Figure S.2.1 Boxplots of EC values distributions of overall and selected GO-BPs and AraCyc pathways ................................................................................................................................... 59 Figure S.2.2 Factors that potentially influence pathway EC. ....................................................... 60 Figure S.2.3 Randomized pathway EC distributions and EC percentiles from multiple datasets 61 Figure S.2.4 Distinguishing gene pairs within and between pathways using a conditiondependent dataset and a condition-independent dataset ........................................................... 63 Figure S.2.5 The extent to which metabolic pathways are over-represented in co-expression clusters ...................................................................................................................................... 64 Figure S.2.6 The extent to which high EC pathways are over-represented in clusters................. 65 Figure 3.1 A. thaliana gene expression correlation across stress datasets and Gene Ontology (GO) terms enriched in salt responsive genes........................................................................... 79 Figure 3.2 Over-representation of known TF binding sites in organ salt up-regulated genes and performance of CRCs predicting salt up-regulation ................................................................. 84 Figure 3.3 Similarity of the pCREs to CIS-BP TFBMs ............................................................... 87 Figure 3.4 Performance of salt up-regulation prediction models using pCREs identified from coexpression clusters .................................................................................................................... 90 Figure 3.5 Performance of salt up-regulation prediction models using filtered pCRE sets ......... 94 Figure 3.6 Summary of root and shoot combinatorial pCRE rules and model performance ........ 98 Figure S.3.1 Distributions of PCCs between CIS-BP TFBMs with and between example TF families .................................................................................................................................... 111 viii Figure S.3.2 Contribution of general, root, and shoot pCREs to the predictions of true positive genes that were globally, root-specifically, and shoot-specifically up-regulated ................... 112 Figure S.3.3 Feature selection on pCREs and performance of RF models using selected pCREs ................................................................................................................................................. 113 Figure S.3.4 Summary of the distance between pairs of motifs in combinatorial rules ............. 114 Figure 4.1 Performance of cell type salt up-regulation prediction models using TF binding data and organ pCREs .................................................................................................................... 127 Figure 4.2 Gene expression correlation across stress datasets of root, shoot and root cell types 130 Figure 4.3 Cell type pCREs: Classification and similarity among pCRE sets ........................... 132 Figure 4.4 Performance of cell type salt up-regulation prediction models using cell type pCREs ................................................................................................................................................. 135 Figure S.4.1 Precision/recall of END and STE salt up-regulation prediction models using TF binding data and organ pCREs ............................................................................................... 142 Figure S.4.2 Precision/recall of END and STE salt up-regulation prediction models using cell type pCREs and union of cell type+organ pCREs .................................................................. 143 Figure S.4.3 True positive predictions from models using organ and cell type pCREs ............. 144 ix KEY TO ABBREVIATIONS ABA Abscisic Acid AlignACE Aligns Nucleic Acid Conserved Elements AUC-ROC Area Under Curve - Receiver Operating Characteristic BIC Bayesian Information Criterion BN Bayesian Network cDNA Complementary DNA CFS Correlation-based Feature Selection ChIP Chromatin Immunoprecipitation CNS Conserved Non-coding Sequence COL Columella COR Cortex CRC cis-regulatory Code CRE cis-regulatory Element DAP DNA Affinity Purification DHS DNase I Hypersensitive Sites DNA Deoxyribonucleic Acid EBI European Bioinformatics Institute EC Expression Coherence END Endodermis EPI Epidermis eQTL Expression Quantitative Trait Loci x FET Fishers Exact Test FRiP Fraction of Reads in Peaks GEO Gene Expression Omnibus GFP Green Fluorescent Protein GO Gene Ontology GO-BP Gene Ontology Biological Process IUPAC The International Union of Pure and Applied Chemistry KL Kullback-Leibler LeuDeg Leucine Degradation Pathway LOR Log-odds Ratio MDScan Motif Discovery Scan MEME Multiple Expectation Maximization for Motif Elicitation MI Mutual Information NASCArray Nottingham Arabidopsis Stock Centre Array NCBI The National Center for Biotechnology Information PBM Protein Binding Array PCC Pearson Correlation Coefficient pCRE Putative CRE PHL Proto-phloem PWM Position Weight Matrix RBF Radial Basis Function RF Random Forest RMA Robust Multi-Array Average xi RNA Ribonucleic Acid SAGE Serial Analysis of Gene Expression STE Stele SVM Support Vector Machine TAIR The Arabidopsis Information Resource TAMO Tools for Analysis of Motifs T-DNA Transfer DNA TF Transcription Factor TFBM Transcription Factor Binding Motif WEKA Waikato Environment for Knowledge Analysis WGCNA Weighted Gene Co-expression Network Analyses WT Wild-type YMF Yeast Motif Finder xii CHAPTER 1 INTRODUCTION 1.1 Omics data and its use in functional genomics The term “omics” is commonly used in biological sciences to refer to genome-scale data [1]. For example, genomics, first used by Thomas Roderick in 1986, refers to sequencing and analyzing the genomes of organisms [2]. Currently 3,808 eukaryotic genomes, including more than 100 plant species [3], are available in The National Center for Biotechnology Information (NCBI) database [4]. Even though genome sequences are useful, they only constitute a “natural coordinate system” [5], where the individual components involved in how organisms develop, function and respond to the environment are left to be identified. Once a genome has been sequenced, the next step is to identify and annotate the functional regions in the genome. This has been the major focus of functional genomics approaches, which have the goal of understanding functions of unknown genes and other regions in genome to understand how organisms function [5,6]. Annotations are descriptions of genomic features, and these descriptions can be structural or functional [7]. Structural annotations include specifying the coordinates, splice forms, intron/exon junctions of genes and could also include regulatory site information. Structural annotation of genes can be considered as the starting point for characterization of gene functions [3]. After structural annotation of genes is carried out, cellular function and location of gene products as well as the biological processes that the genes are involved in can be characterized. All these biological roles can be considered as functional annotations [8]. One of the common approaches towards annotating biological roles of genes is using comparative sequence analyses, comparing sequence 1 of unknown genes to known ones [9], particularly through interspecies comparisons. Sequencebased functional inference relies on high sequence similarity, which might reflect functional similarity but not all functionally related genes share sequence similarity [10]. Genomic sequence alone is not sufficient to pinpoint biologically meaningful regions. To uncover gene function, high-throughput data have been generated covering multiple layers of biological information involving chromatin, DNA, RNA, proteins and metabolites [11]. For example, chromatin level information includes nucleosome occupancy and histone modifications, while genome level information includes genomic sequences and genome binding by regulatory components like transcription factors (TFs). Gene product information such as quantification of transcripts, proteins, and metabolites have been generated at a genome-scale. Different levels of biological information contribute to understanding functions of unknown genes that are the goals of functional genomics approaches. At the levels of gene products, transcript level information i.e. steady-state mRNA levels is the most abundant data type [12] and available from publicly data repositories such as NCBI. This facilitates the computational studies to utilize the vast number of gene expression data to extract biological information. 1.2 Abundance and utility of gene expression profiling data Gene expression data are useful either alone or in combination with other data for disease classification [13–16], marker gene discovery [16], expression quantitative trait loci (eQTLs) discovery [17–19], learning about gene evolution [20], and identifying gene co-expression networks that can lead to gene regulatory and functional information [9,21,22]. In achieving these, publicly available gene expression datasets are useful resources. Gene Expression Omnibus (GEO, [23]) is a public functional genomics data repository [24]. Originally, this repository is started for 2 storing microarray and sequencing based high-throughput gene expression data similar to the ArrayExpress repository of European Bioinformatics Institute (EBI, [25]). However with technological advances, other biological data, such as nucleosome occupancy, are also being stored there. Nonetheless, gene expression profiling is the most abundant data type in this repository. Gene expression profiles could be obtained by microarray [26] and RNA-seq [27] experiments. These methods are the most widely used gene expression measurement approaches. For example, 48,501 gene expression datasets available in GEO are based on microarray technology. Microarrays for detecting gene expression was developed in 1995 [28] and the aim was to monitor the expression of many genes quantitatively in parallel [26]. This technique involves fixed DNA probes that the cDNA from biological sample can bind to and hybridize [26]. As hybridization is involved, using microarrays require prior knowledge of genome sequence, which is a limitation if an organism of unknown genome or transcriptome is studied. Also, cross-hybridization and saturation of hybridization are additional limitations of array-based approaches in monitoring gene expression [29]. To overcome some of these limitations, sequencing-based techniques have been developed, where prior sequence information is not required [27]. Currently, 11,338 gene expression datasets available in GEO are based on high-throughput sequencing. Different from array-based approaches, cDNA sequence can be directly determined using sequencers. Sequencing has also been improved from traditional Sanger sequencing and tag-based sequencing (such as Serial Analysis of Gene Expression (SAGE) [30]) to next-generation sequencing [29]. Through the genome-scale expression profiling techniques, it is possible to understand which genes are induced/repressed and highly/lowly expressed during a developmental stage, in a particular tissue, and under a particular condition. However, the number of genes altered in gene expression is not the only information that could be obtained from systems-wide gene expression 3 data, as it is possible to monitor thousands of genes and compare multiple gene expression profiles. This has led to substantial efforts in the field of bioinformatics to advance and develop data mining strategies to use the accumulating gene expression data to identify functions of genomic regions. For example, co-expression analysis relies on the observation that functionally related genes often have similar expression patterns and can be useful in reverse genetics approaches for narrowing down candidate genes to test for a particular function [10]. In addition, co-expressed genes tend to share common regulatory signatures such as similar TF binding sites [31]. This also makes gene expression data useful in finding over-represented motifs (potential TF binding sites) from coexpressed genes. Databases of co-expression gene networks, where expression similarity between genes (nodes) are captured in connections (edges), are available for multiple plant species (e.g. Arabidopsis thaliana, rice, barley and others [32–35]) as well as other organisms like fruit fly, mouse, and humans [36]. In the following sections, I discuss the utility of using gene expression data (temporal, spatial, and conditional) and co-expression analyses in hypothesizing gene function as well as identifying potential TF binding sites. It should be noted that the gene expression data only reflects one level of gene regulatory information among multiple levels of biological information mentioned earlier. Through integration of different levels of information, it is possible to get a more complete picture of inferred “function” compared to using one data type. This type of data integration efforts have been carried out in multiple organisms including Arabidopsis thaliana, fruit fly, mouse, and humans [37–39] by combining transcriptome with, for example, protein-DNA interactions, protein-protein interactions, and literature co-occurrences [40]. However, in data integration approaches, transcriptome data remain the most abundant and the most influential in capturing gene functional relationships [10,41]. 4 1.3 Assigning functions to genes via computational approaches using gene expression data Gene function can be characterized in multiple ways. Function of a protein-coding gene can be described by the biological processes and pathways its products are involved in, the molecular functions its gene product have, and/or the cellular component the gene product is located in. These properties of genes are summarized in controlled vocabulary in Gene Ontology (GO), where ~20 organisms have gene annotations to one of the following hierarchical ontologies: biological processes, molecular function and cellular component [42]. In the model plant, Arabidopsis thaliana, 40% of protein-coding genes have annotations with experimental evidences in at least one of the ontologies [10]. However, only ~5% of A. thaliana genes have annotations to all ontologies based on experimental evidence [10]. In addition, it is remarkable that only 1% of rice protein-coding genes have annotation in at least one of the ontologies based on experimental evidence [10]. This leaves over 90% of genes to characterize experimentally. Thus, it is necessary to computationally predict gene function and narrow down the candidate genes that would be further experimentally tested. Gene expression data are useful in hypothesizing and predicting gene function. If a gene X with unknown function coordinately expresses over a variety of experimental conditions (coexpresses) with a gene Y of known function, then it is possible that the gene X has a similar function as Y. This is known as guilt by association [43]. The relationship between co-expression and function was first demonstrated in Saccharomyces cerevisiae and human transcriptome studies [45–48]. This approach has been used in plants as well to identify genes that are involved in multiple pathways including fatty acid biosynthesis, specialized metabolism, and secondary cell wall biosynthesis [10,49–51]. For example, 71 co-expression modules that were associated with 5 genes that are expressed in specific tissues or in response to pathogen infection, abiotic stress, hormone treatments or environmental conditions were identified in Oryza sativa (rice) [52]. Importantly, 17% of the 17,298 rice genes that lacked a functional description (GO annotation) were found in at least one of the co-expression modules that could be functionally associated to other genes in the module [52]. In another recent study, researchers used co-expression to hypothesize gene function using guilt by association in Bos taurus (cattle) and they further supported their predictions with protein interaction data. Overall, 132 genes with previously unknown function were assigned biological roles [53]. Although co-expression is useful in predicting gene function, there are limitations to consider. First, assessing co-expression is not trivial. How similarity of expression is calculated and how high similarity is defined are expected to impact which genes are considered to be coexpressed. Thus, their definition can change conclusions regarding gene functional associations. The second limitation is that, depending on the number and type of gene expression samples used in co-expression analysis, the identity of genes defined as co-expressed might change. Current studies typically combine multiple datasets for gene function inference [9,25]. Including large numbers of samples (e.g. over ~100s) increases the statistical power of calculating expression similarity. However, combining gene expression samples from different experiments may result in the loss of context-specific relationships [54]. For example, similarities in gene expression profiles may depend on the cellular context [55], such as the differences in co-expression groups in cancer vs. non-cancer cells [56]. The inclusion of additional data may eliminate the coexpression signal. The third limitation of using co-expression in functional inference is that the genes of interest might be regulated at a level other than transcription, such as post-transcriptional regulation. In this case, the gene co-expression relationship might not be informative to pinpoint 6 functionally related genes and data integration methods that incorporate epigenetic and posttranscriptional information are expected to be more informative compared to using co-expression alone. Apart from gene functions, cis-regulatory code can be inferred from analyzing co-expressed genes. The knowledge of components of gene regulatory machinery is important for understanding how genes are regulated at the transcriptional level. Among the regulatory components, the cisregulatory elements (CREs) that the TFs bind to drive gene expression are important for regulating temporal, spatial and conditional gene expression. Gene expression data and co-expression analyses are useful in identifying CREs. CREs can then be used to form predictive models of gene expression. 1.4 Deciphering key players in gene expression regulation Gene regulation ensures that the gene products are made correctly in a temporal, spatial and conditional manner [57,58] and involve multiple levels. The multiple levels of regulation leading to the final protein products include epigenetic, transcriptional, post-transcriptional, and post-translational level. Steady-state mRNA levels are particularly determined by transcriptional regulation [59]. At the transcriptional level, CREs and the DNA-binding TFs are important components that recruit the basal transcriptional machinery including RNA polymerase. One way to identify CREs is to use chromatin immunoprecipitation (ChIP) with a given TF followed by array hybridization or sequencing, which yields binding site sequences [60]. In addition to identifying TF binding sites with ChIP studies, chromatin features such as histone modifications can also yield gene regulatory information. For example, ChIP-seq focusing on specific histone modifications (e.g. H3K27 acetylation and H3K4 methylation [61]) and open chromatin regions identified from DNAse I hypersensitivity [62] can provide information on 7 potential CREs. Another way to identify potential CREs is using computational methods to identify over-represented sequences among a given set of sequences. CREs are often found close to transcription start site of the genes [63,64] and using genomic sequences, particularly promoter regions of the genes that co-express over a variety of conditions [65,66] it is possible to identify potential TF binding elements. Other genomic regions, such as the first intron of the gene itself, are also found to contain CREs involved in gene regulation [67]. It should be noted that the above approaches offer genome-wide information on TF binding and/or CREs. However, they do not provide mechanistic details of transcriptional regulation of genes. To further complement the genome-wide approaches mentioned above, machine learning techniques can be used to predict gene expression based on rules of CREs. The rules can be based on presence/absence, copy number, location, combinations of CREs. Machine learning includes statistical modelling [68] and has three stages: algorithm design, learning and testing. Potential CREs identified from genome-wide approaches can be used to build models for predicting gene expression and, based on the performance of prediction models, a set of CREs might be determined as drivers of gene expression at the context studied. Thus, these data-driven methods are key tools for the identifying CREs by forming predictive models [70]. 1.5 Thesis chapters In this research, I used publicly available high-throughput gene expression data to identify functional associations and putative CREs involved in organ and cell type stress responsive gene expression. In research described in Chapter 2, I investigated the utilities and limitations of using gene co-expression in hypothesizing functional association. Specifically, I assessed to what extent Arabidopsis thaliana pathway genes co-express. I also determined what type of pathways tend to 8 form co-expression modules and evaluated the influence of dataset on the co-expressed genes. I also evaluated the impact of commonly-used clustering algorithms and their parameters on the ability to identify genes that function in the same pathways. In validating co-expression cluster memberships, I used an independent phenomics dataset to confirm the potential functional associations obtained from clusters. In Chapter 3, I explored what the cis-regulatory code is for the organ specific high-salinity responsive gene expression. I asked to what extent the stress gene expression is similar among different organs and whether the current knowledge of TF binding sites could explain the organ-specific stress gene expression. Through collaboration with Alexander E. Seddon, who completed his Master’s study in the Shiu laboratory, we identified putative CREs that might be involved in organ high-salinity responsive gene expression and formed predictive models of gene up-regulation. I incorporated known TF binding sites, chromatin accessibility and evolutionary conservation; Alexander E. Seddon incorporated CRE combinatorial relations in machine learning models. Overall, we present a genome-wide view of cis-regulatory logic of organ gene expression in response to high-salinity. In research described in Chapter 4, I asked whether whole organ associated CREs could explain gene expression at a finer resolution to cell type level. I identified root cell type putative CREs and formed predicted models for each cell type. 9 REFERENCES 10 REFERENCES 1. Joyce AR, Palsson BØ. The model organism as a system: integrating “omics” data sets. Nat Rev Mol Cell Biol. Nature Publishing Group; 2006;7: 198–210. doi:10.1038/nrm1857 2. Hieter P, Boguski M. Functional Genomics: It’s All How You Read It. Science (80- ). 1997;278. 3. Veeckman E, Ruttink T, Vandepoele K. Are We There Yet? Reliably Estimating the Completeness of Plant Genome Sequences. Plant Cell. 2016;28: 1759–68. doi:10.1105/tpc.16.00349 4. National Center for Biotechnology Information. [cited 14 Dec 2016]. Available: https://www.ncbi.nlm.nih.gov/ 5. Werner T. Next generation sequencing in functional genomics. Brief Bioinform. Oxford University Press; 2010;11: 499–511. doi:10.1093/bib/bbq018 6. Holtorf H, Guitton M-C, Reski R. Plant functional genomics. Naturwissenschaften. 2002;89: 235–49. Available: http://www.ncbi.nlm.nih.gov/pubmed/12146788 7. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, et al. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18: 188–196. doi:10.1101/gr.6743907 8. Berardini TZ, Mundodi S, Reiser L, Huala E, Garcia-Hernandez M, Zhang P, et al. Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. American Society of Plant Biologists; 2004;135: 745–55. doi:10.1104/pp.104.040071 9. Movahedi S, Van bel M, Heyndrickx K s., Vandepoele K. Comparative co-expression analysis in plant biology. Plant Cell Environ. Blackwell Publishing Ltd; 2012;35: 1787– 1798. doi:10.1111/j.1365-3040.2012.02517.x 10. Rhee SY, Mutwil M. Towards revealing the functions of all genes in plants. Trends Plant Sci. Elsevier Ltd; 2014;19: 212–221. doi:10.1016/j.tplants.2013.10.006 11. Berger B, Peng J, Singh M. Computational solutions for omics data. Nat Rev Genet. NIH Public Access; 2013;14: 333–46. doi:10.1038/nrg3433 12. Barrett T. Gene Expression Omnibus (GEO). The NCBI Handbook. National Center for Biotechnology Information (US); 2013. 13. van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415: 530–536. doi:10.1038/415530a 11 14. Weigelt B, Baehner FL, Reis-Filho JS. The contribution of gene expression profiling to breast cancer classification, prognostication and prediction: a retrospective of the last decade. J Pathol. 2010;220: 263–80. doi:10.1002/path.2648 15. Yeoh E, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1: 133–43. doi:10.1016/S15356108(02)00032-6 16. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science (80- ). 1999;286. 17. Albert FW, Kruglyak L. The role of regulatory variation in complex traits and disease. Nat Rev Genet. 2015;16: 197–212. doi:10.1038/nrg3891 18. Ranjan A, Budke JM, Rowland SD, Chitwood DH, Kumar R, Carriedo L, et al. eQTL Regulating Transcript Levels Associated with Diverse Biological Processes in Tomato. Plant Physiol. 2016;172: 328–40. doi:10.1104/pp.16.00289 19. West MAL, Kim K, Kliebenstein DJ, Van Leeuwen H, Michelmore RW, Doerge RW, et al. Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis. Genetics. 2007;175: 1441–1450. doi:10.1534/genetics.106.064972 20. Donoghue MT, Keshavaiah C, Swamidatta SH, Spillane C. Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana. BMC Evol Biol. 2011;11: 47. doi:10.1186/1471-2148-11-47 21. Horan K, Jang C, Bailey-Serres J, Mittler R, Shelton C, Harper JF, et al. Annotating genes of known and unknown function by large-scale coexpression analysis. Plant Physiol. 2008;147: 41–57. doi:10.1104/pp.108.117366 22. Vandepoele K, Quimbaya M, Casneuf T, De Veylder L, Van de Peer Y. Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. Plant Physiol. 2009;150: 535–46. doi:10.1104/pp.109.136028 23. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. Oxford University Press; 2002;30: 207–10. doi:10.1093/NAR/30.1.207 24. Rung J, Brazma A. Reuse of public genome-wide gene expression data. Nat Rev Genet. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2013;14: 89–99. doi:10.1038/nrg3394 25. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, et al. ArrayExpress - A public repository for microarray gene expression data at the EBI. Nucleic Acids Research. 2003. pp. 68–71. doi:10.1093/nar/gkg091 12 26. Brown PO, Botstein D. Exploring the new world of the genome with DNA microarrays. Nat Genet. 1999;21: 33–37. doi:10.1038/4462 27. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18: 1509–1517. doi:10.1101/gr.079558.108 28. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science (80- ). 1995;270: 467–470. doi:10.1126/science.270.5235.467 29. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. NIH Public Access; 2009;10: 57–63. doi:10.1038/nrg2484 30. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial Analysis of Gene Expression. Science (80- ). 1995;270: 484–487. doi:10.1126/science.270.5235.484 31. Allocco DJ, Kohane IS, Butte AJ. Quantifying the relationship between co-expression, coregulation and gene function. BMC Bioinformatics. BioMed Central; 2004;5: 18. doi:10.1186/1471-2105-5-18 32. Obayashi T, Nishida K, Kasahara K, Kinoshita K. ATTED-II updates: Condition-specific gene coexpression to extend coexpression analyses and applications to a broad range of flowering plants. Plant Cell Physiol. 2011;52: 213–219. doi:10.1093/pcp/pcq203 33. Hamada K, Hongo K, Suwabe K, Shimizu A, Nagayama T, Abe R, et al. OryzaExpress: An integrated database of gene expression networks and omics annotations in rice. Plant Cell Physiol. 2011;52: 220–229. doi:10.1093/pcp/pcq195 34. Mochida K, Uehara-Yamaguchi Y, Yoshida T, Sakurai T, Shinozaki K. Global landscape of a Co-expressed gene network in barley and its application to gene discovery in triticeae crops. Plant Cell Physiol. 2011;52: 785–803. doi:10.1093/pcp/pcr035 35. Fukushima A, Nishizawa T, Hayakumo M, Hikosaka S, Saito K, Goto E, et al. Exploring Tomato Gene Functions Based on Coexpression Modules Using Graph Clustering and Differential Coexpression Approaches. Plant Physiol. 2012;158: 1487–1502. doi:10.1104/pp.111.188367 36. Obayashi T, Kinoshita K. COXPRESdb: A database to compare gene coexpression in seven model animals. Nucleic Acids Res. 2011;39. doi:10.1093/nar/gkq1147 37. Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, et al. The GeneMANIA prediction server: Biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010;38. doi:10.1093/nar/gkq537 38. Lee T, Yang S, Kim E, Ko Y, Hwang S, Shin J, et al. AraNet v2: an improved database of co-functional gene networks for the study of Arabidopsis thaliana and 27 other nonmodel plant species. Nucleic Acids Res. 2015;43: D996-1002. doi:10.1093/nar/gku1053 13 39. Kotera M, Yamanishi Y, Moriya Y, Kanehisa M, Goto S. GENIES: gene network inference engine based on supervised analysis. Nucleic Acids Res. 2012;40: W162-7. doi:10.1093/nar/gks459 40. Greenbaum D, Luscombe NM, Jansen R, Qian J, Gerstein M. Interrelating different types of genomic data, from proteome to secretome: ’Oming in on function. Genome Res. 2001;11: 1463–1468. doi:10.1101/gr.207401 41. Alexeyenko A, Sonnhammer ELL. Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res. 2009;19: 1107–16. doi:10.1101/gr.087528.108 42. Gene Ontology Consortium TGO. Gene Ontology Consortium: going forward. Nucleic Acids Res. Oxford University Press; 2015;43: D1049-56. doi:10.1093/nar/gku1179 43. Bhat P, Yang H, Bögre L, Devoto A, Paccanaro A. Computational selection of transcriptomics experiments improves Guilt-by-Association analyses. Provart NJ, editor. PLoS One. Public Library of Science; 2012;7: e39681. doi:10.1371/journal.pone.0039681 44. Stuart JM, Segal E, Koller D, Kim SK. A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science (80- ). 2003;302: 249–255. doi:10.1126/science.1087447 45. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci. 1998;95: 14863–14868. doi:10.1073/pnas.95.25.14863 46. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9: 3273–97. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=25624&tool=pmcentrez&rend ertype=abstract 47. Jansen R, Greenbaum D, Gerstein M. Relating whole-genome expression data with proteinprotein interactions. Genome Res. 2002;12: 37–46. doi:10.1101/gr.205602 48. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004;14: 1085–94. doi:10.1101/gr.1910904 49. Han X, Yin L, Xue H. Co-expression analysis identifies CRC and AP1 the regulator of Arabidopsis fatty acid biosynthesis. J Integr Plant Biol. 2012;54: 486–99. doi:10.1111/j.1744-7909.2012.01132.x 50. Maeda H, Yoo H, Dudareva N. Prephenate aminotransferase directs plant phenylalanine biosynthesis via arogenate. Nat Chem Biol. 2011;7: 19–21. doi:10.1038/nchembio.485 51. Persson S, Wei H, Milne J, Page GP, Somerville CR. Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Proc Natl Acad Sci 14 U S A. 2005;102: 8633–8. doi:10.1073/pnas.0503392102 52. Childs KL, Davidson RM, Buell CR. Gene coexpression network analysis as a source of functional annotation for rice genes. PLoS One. 2011;6: e22196. doi:10.1371/journal.pone.0022196 53. Beiki H, Nejati-Javaremi A, Pakdel A, Masoudi-Nejad A, Hu Z-L, Reecy JM. Large-scale gene co-expression network as a source of functional annotation for cattle genes. BMC Genomics. London: BioMed Central; 2016;17: 846. doi:10.1186/s12864-016-3176-2 54. He F, Karve AA, Maslov S, Babst BA. Large-Scale Public Transcriptomic Data Mining Reveals a Tight Connection between the Transport of Nitrogen and Other Transport Processes in Arabidopsis. Front Plant Sci. Frontiers; 2016;7: 1207. doi:10.3389/fpls.2016.01207 55. de la Fuente A. From “differential expression” to “differential networking” - identification of dysfunctional regulatory networks in diseases. Trends in Genetics. 2010. pp. 326–333. doi:10.1016/j.tig.2010.05.001 56. Anglani R, Creanza TM, Liuzzi VC, Piepoli A, Panza A, Andriulli A, et al. Loss of connectivity in cancer co-expression networks. PLoS One. 2014;9. doi:10.1371/journal.pone.0087075 57. Lackner DH, Beilharz TH, Marguerat S, Mata J, Watt S, Schubert F, et al. A Network of Multiple Regulatory Layers Shapes Gene Expression in Fission Yeast. Mol Cell. 2007;26: 145–155. doi:10.1016/j.molcel.2007.03.002 58. Carlberg C, Molnar F. Mechanisms of gene regulation. Mechanisms of Gene Regulation. 2014. doi:10.1007/978-94-007-7905-1 59. Spangler JB, Ficklin SP, Luo F, Freeling M, Feltus FA. Conserved Non-Coding Regulatory Signatures in Arabidopsis Co-Expressed Gene Modules. PLoS One. 2012;7. doi:10.1371/journal.pone.0045041 60. Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012;13: 840–52. doi:10.1038/nrg3306 61. Zhou VW, Goren A, Bernstein BE. Charting histone modifications and the functional organization of mammalian genomes. Nat Rev Genet. 2011;12: 7–18. doi:10.1038/nrg2905 62. Sullivan AM, Arsovski AA, Lempe J, Bubb KL, Weirauch MT, Sabo PJ, et al. Mapping and Dynamics of Regulatory DNA and Transcription Factor Networks in A. thaliana. Cell Rep. 2014;8: 2015–2030. doi:10.1016/j.celrep.2014.08.019 63. Farnham PJ. Insights from genomic profiling of transcription factors. Nat Rev Genet. 2009;10: 605–616. doi:10.1038/nrg2636 15 64. Yu C-P, Lin J-J, Li W-H. Positional distribution of transcription factor binding sites in Arabidopsis thaliana. Sci Rep. 2016;6: 25164. doi:10.1038/srep25164 65. Zou C, Sun K, Mackaluso JD, Seddon AE, Jin R, Thomashow MF, et al. Cis-regulatory code of stress-responsive transcription in Arabidopsis thaliana. Proc Natl Acad Sci U S A. 2011;108: 14992–7. doi:10.1073/pnas.1103202108 66. Koryachko A, Matthiadis A, Ducoste JJ, Tuck J, Long TA, Williams C. Computational approaches to identify regulators of plant stress response using high-throughput gene expression data. Curr Plant Biol. 2015;3: 20–29. doi:10.1016/j.cpb.2015.04.001 67. Rombauts S, Florquin K, Lescot M, Marchal K, Rouzé P, van de Peer Y. Computational approaches to identify promoters and cis-regulatory elements in plant genomes. Plant Physiol. American Society of Plant Biologists; 2003;132: 1162–76. doi:10.1104/pp.102.017715 68. Tarca AL, Carey VJ, Chen X, Romero R, Drăghici S, Rosenblatt F, et al. Machine Learning and Its Applications to Biology. PLoS Comput Biol. Public Library of Science; 2007;3: e116. doi:10.1371/journal.pcbi.0030116 69. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. Nature Research; 2015;16: 321–332. doi:10.1038/nrg3920 70. Li Y, Chen C yu, Kaye AM, Wasserman WW. The identification of cis-regulatory elements: A review from a machine learning perspective. BioSystems. 2015. pp. 6–17. doi:10.1016/j.biosystems.2015.10.002 16 CHAPTER 2 UTILITY AND LIMITATIONS OF USING GENE EXPRESSION DATA TO IDENTIFY FUNCTIONAL ASSOCIATIONS1 1 The work described in this chapter was published in the following manuscript: Sahra Uygun, Cheng Peng, Melissa D. Lehti-Shiu, Robert L. Last, Shin-Han Shiu (2016) Utility and limitations of using gene expression data to identify functional associations. PLOS Computational Biology. http://dx.doi.org/10.1371/journal.pcbi.1005244 Contribution: I was involved in conceptualization, data collection and analyses of this project. I wrote the original manuscript and all authors were involved in reviewing and editing of the manuscript. 17 2.1 Abstract Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using coexpression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets. 18 2.2 Introduction With the ease of sequencing, an ever increasing number of genomes from a wide range of species are available. One major challenge is to ascribe functions to genomic features. For example, while ~70% of Arabidopsis thaliana genes have annotated functions [1], only ~40% of these annotations are supported by experimental evidence such as mutant phenotype or biochemical assays [2]. To increase functional information, transcriptome data have been used to develop hypotheses of gene function based on similarity of expression patterns (co-expression) with genes that have known functions [2–4]. The relationship between co-expression and functional correlation was first shown with Saccharomyces cerevisiae and human transcriptome data [5–8]. Subsequently, a large number of plant studies used co-expression analysis to infer gene functions [9–17]. For example, the MYB28 and MYB29 transcription factors are co-expressed with the glucosinolate pathway genes that they regulate [9]. Similarly, the transcription factors CRC and AP1 co-express with 58 fatty acid biosynthesis genes, and crc and ap1 mutants have altered fatty acid compositions [15]. More broadly, methods based on integration of multiple types of omics datasets were developed to account for different levels of regulation and to improve gene functional inferences [18–21]. In these data integration exercises, transcriptome data remain the most abundant and the most effective in capturing gene functional relationships [2,18]. Thus, analysis of gene expression results can inform hypotheses of plant gene functions. Despite its utility, there are known computational and biological limitations in using coexpression for gene functional inference, and these usually are not evaluated in co-expression based studies [2]. First, genes with similar expression profiles may not necessarily have related functions [22]. Second, for those genes that do have related functions, transcription patterns may not be coordinated due to post-transcriptional and other levels of regulation [23]. Third, it is also 19 possible that they do in fact co-express, but that the co-expression criteria need to be optimized. For example, using an expression coherence (EC) measure, which is the ratio of the number of coexpressed gene pairs to the total number of gene pairs [24], only 41% of the Gene Ontology Biological Process (GO-BP) terms have higher ECs than expected by chance [25]. The 59% of pathways with low ECs may contain genes that are regulated beyond transcription. Alternatively, a more detailed exploration is required to determine how co-expression should be defined. Consistent with this, in most studies, a fixed threshold of expression similarity is used to identify pairs of co-expressed genes. Depending on the value of this threshold, the degree of co-expression might be over- or underestimated and lead to false positive or negative associations. Therefore, it is necessary to optimize the criteria used to define co-expression to increase the utility of expression data in guilt-by-association studies. One major parameter that impacts co-expression studies is the type of dataset; it is expected that not all expression profiling experiments will be informative for revealing functional relationships between any given gene pair [26]. Most studies combine multiple datasets for gene function inference [9,25]. One advantage of this approach is the increased statistical power for establishing correlations. Small number of samples might lead to statistically unreliable connections [27]. However, the inclusion of too many samples can result in the loss of information [28], and expression datasets that are directly relevant to the underlying biological processes might be more useful in functional inference. For example, to uncover drought response pathway genes, it would be better to use a more specific, drought stress dataset instead of a collection that includes potentially uninformative experiments [13]. Other factors that impact the effectiveness of coexpression studies include the specific samples used (e.g. stress vs. developmental series), method of data transformation (e.g. fold change vs. absolute expression values), and the procedures and 20 parameters used to define co-expression. A comprehensive study evaluating the above is needed and would be highly informative for future studies that use co-expression as a means for functional inference. In addition to inferring functional relationships between two genes, co-expression is useful for uncovering groups of genes with related functions (referred to as clusters). Unsupervised learning methods, particularly various clustering algorithms, are among the most common approaches used to identify co-expression clusters [29]. Once the clusters are identified, functional categories such as GO can be used to evaluate what types of genes are over-represented in each cluster, and gene functions can be hypothesized based on cluster membership [30]. Although clustering and enrichment analyses are straightforward, there is no single best method [31] as there are a large numbers of clustering algorithms and the cluster memberships (which genes are in the same cluster) depend on many clustering variables (e.g. algorithm, distance measure and number of clusters). Because differences in parameter choice strongly influence the types of co-expression clusters obtained, it is important to perform clustering with multiple parameters rather than relying on a single method. In this study, our goal was to maximize the information from co-expression data to improve predictions of functional associations between genes. Specifically, we asked to what extent A. thaliana genes are co-expressed in each metabolic pathway. We also explored the features of high EC pathways. Next, we evaluated the influence of dataset on EC for each metabolic pathway, the best practices in using co-expression to identify novel genes that function in a biological process, and the impact of different commonly-used clustering algorithms and parameters on the ability to identify genes that function in the same pathways. Finally, the biological relevance of cluster membership was validated using an independent phenomics dataset. Overall, we demonstrated that 21 optimizing the use of co-expression based approaches requires considerations of the pathway of interest, expression dataset and clustering algorithm. 2.3 Results and discussion 2.3.1 The extent to which genes in pathways have correlated transcript profiles To evaluate the extent to which genes with similar expression patterns have similar functions, we asked whether genes in the same A. thaliana metabolic pathway were co-expressed (see Methods). To address this question, Pearson Correlation Coefficients (PCCs) between genes in each of the 382 A. thaliana metabolic pathways in AraCyc were calculated using an expression dataset consisting of 16 different environmental conditions (referred to as the stress dataset [32]) (Figure 2.1A). To broadly examine groups of functionally related genes in addition to metabolic pathways, we also calculated PCCs between genes in each of the 1,710 A. thaliana Gene Ontology Biological Process (GO-BP). A group of genes in an AraCyc pathway or a GO-BP is referred to as a “functional category”. The median PCC values were <0.1 for ~60% of functional categories, suggesting that many genes in the same pathway have dissimilar transcript profiles under stress conditions. To assess statistical significance and control for false positive expression correlation, the PCC values of pairs of genes in the same functional category were compared to PCC values of random gene pairs (Figure 2.1A). The 95th percentile PCC value of random gene pairs (referred to as PCC95) was 0.41 for the stress dataset. In other words, only 5% of random gene pairs have PCC values >0.41. We used PCC95 as the threshold for calling the expression profiles of a gene pair as significantly positively correlated with a 5% false positive rate. Based on this threshold, only 19% of gene pairs within a functional category have significantly correlated expression patterns. 22 To determine whether some functional categories contain more members with highly correlated expression than others, we adopted the expression coherence (EC) measure, which ranges from 0 to 1 [25]. Here the "pathway EC" is defined as the proportion of pairs of genes in a pathway or GO category that have significantly correlated transcript profiles. Note that the median ECs in A. thaliana are only 0.11 for GO-BPs and 0.14 for AraCyc pathways, indicating that 50% of the functional categories have <11-14% gene pairs with significant expression correlation. Consistent with an earlier study [25], we found that genes in functional categories generally have higher ECs than groups consisted of randomly selected genes (Mann-Whitney test, p <2.2e-16; Figure 2.1B). In particular, 36% of the AraCyc pathways have higher EC values than the 95th percentile of the random EC distribution (Figure 2.1B); these are defined as “high EC pathways”. Similarly, 32% of the GO-BPs have higher EC values than the 95th percentile of the random EC distribution (referred to as “high EC GO-BPs”). One explanation for the slightly higher number of high EC pathways than that of high EC GO-BPs may be because metabolism related pathways tend to have a more highly coordinated transcriptional regulation compared to other types of functional categories. Consistent with this notion, GO-BP categories related to metabolism, including metabolic pathways (GO:0008152) and its child terms, have higher median ECs (0.14 and 0.13) compared to signal transduction (GO:0007165, EC=0.11), cell-cycle (GO:0007049, EC=0.10) and response to stress (GO:0006950, EC=0.08) categories. Among metabolic GO-BPs, amino acid metabolism pathways (GO:0006520) have the highest median EC (0.21) among the categories we compared (Figure S.2.1). Overall, GO-BPs have lower ECs than AraCyc pathways (Mann-Whitney Test, p=2.41e-03; Figure S.2.1). The ECs for functional categories have a very wide range (Figure 1B, Figure S.2.1). The differences in ECs may be due to technical issues such as functional annotation quality or 23 methodological issues such as the similarity measure used to assess co-expression. The EC differences can also be due to differences in the biological characteristics of pathways, for example, the role of the pathway, presence of common transcriptional regulatory mechanisms, and regulation at levels beyond transcription. Finally, the dataset used to calculate EC could also be a major factor. In the following sections, we assess the factors influencing ECs and identify ways to maximize ECs for functional categories. Considering false positive annotation can have a significant, negative impact in further analyses, we examined features of high EC categories and the impact of multiple factors on ECs by focusing on AraCyc metabolic pathways in the following sections. 24 Figure 2.1 Co-expression of A. thaliana pathway genes under stress. (A) Boxplots of expression correlations (Pearson’s Correlation Coefficient, PCC) between pairs of genes in each A. thaliana metabolic pathway (left sub-figure) and random gene pairs (right sub-figure). The pathways are sorted based on median PCC. Light blue boxes: Interquartile range. Blue line: median PCCs. Red dashed line: the 95th percentile PCC value (PCC95=0.41) of the random gene pair PCC distribution. Black dashed line: the median PCC of the random gene pair PCC distribution. (B) Bar plot indicating ECs for A. thaliana pathways (left sub-figure) and random gene pairs (right sub-figure). The pathways are in the same order as in (A). The insert graph shows the number of pathways that have significantly higher ECs than randomly expected (black) and those that are not significant (white). Different percentile thresholds on the x-axis are based on the random EC distribution (right sub-figure). The red dashed line designates the 95th percentile of the random EC distribution. 25 2.3.2 Influence of annotation on pathway ECs Computational predictions of gene function without experimental evidence can lead to false assignments to pathways, resulting in lower pathway EC values. This is particularly important because computational annotations in the Plant Metabolic Network are based on sequence similarity only [33,34]. Functional annotations made using sequence similarity based methods are estimated to have an error rate of 49% [35] and high sequence similarity does not necessarily lead to co-expression [36] . To determine whether annotation quality is a major factor influencing pathway EC, we separated pathway genes into those with and without experimental evidence. Consistent with the hypothesis that annotation quality can significantly impact pathway EC, pathways with lower ECs tended to have proportionally fewer genes with experimental evidence (PCC=0.20, p=1.53e03; Figure S.2.2A). Pathway ECs calculated using genes with experimental evidence were substantially higher (Mann-Whitney test, p=5.44e-12, median EC=0.26) than those calculated using genes assigned to pathways solely based on computational predictions (median EC=0.10; Figure 2.2A). This is consistent with the hypothesis that some annotations based solely on computational evidence are incorrect. Although annotation quality influences pathway EC, it explains only ~4% of the variance in the median EC of pathways that include genes assigned based on all evidence (computational and experimental, EC=0.14) and pathways that include genes assigned based on computational evidence (EC=0.10). The small increase in co-expressed genes pairs when including experimental evidence is potentially due to the small fraction of genes that have experimental evidence (5,991 genes considering all evidence, 934 genes considering experimental evidence). Nonetheless, because annotation quality did have a measurable impact, only genes with experimental evidence were included in further analyses. 26 Figure 2.2 Relationship between pathway ECs, annotation quality and similarity measures. (A) Relationship between the EC calculated for pathway genes that are annotated based on experimental evidence (ECexp) and EC calculated for pathway genes that are annotated only computational evidence annotations (ECcomp). The genes used to calculate ECexp and ECcomp do not overlap. Each dot represents one pathway. Dashed line: y=x line. (B) Heatmap of correlations between pathway EC percentiles calculated with: partial correlations estimated with the corpcor method, Spearman’s rank correlation coefficient (Spearman), Pearson Correlation Coefficient (PCC), adjusted and normalized Mutual Information (MI), partial correlation calculated with the partialcorr method, and transformed p-values of Bayesian Network (BN) (C) Percent pathways that have high EC using different similarity measures. (D) Heatmap of pathway EC percentiles calculated using different similarity measures. Color represents EC percentiles. White dotted rectangles: high EC pathways that are specific to one measure. 27 2.3.3 Influence of the similarity measure used to assess EC In addition to gene annotation quality, similarity measure used to assess gene co-expression could impact pathway EC. Although PCC is among the most widely used similarity measures in co-expression studies, it does not deal with non-linear relationship well as other similarity measures including Spearman’s rank correlation coefficient and mutual information (MI). Another consideration is that, all three similarity measures above consider only pairwise correlations, thus higher order correlations due to the influence of the other genes in the network are not considered. To assess the influence of higher order correlation, we also evaluated two approaches: (1) partial correlation, where the correlation between genes was calculated after controlling the effects of other genes and (2) a graph model-based approach such as Bayesian Network (BN) where the strength of connection of a gene pair is determined by considering all genes in a network. To assess the impact of potential non-collinearity and higher order correlations, we first calculated pathway ECs with seven different similarity measures including PCC, Spearman’s rank, two partial correlation methods (corpcor and partialcorr), adjusted and normalized MI, and transformed pvalue of arc strength in a pathway BN (see 2.5 Methods). To assess the statistical significance of EC values and control for false positive ECs, EC values were calculated with randomly chosen gene pairs for each pathway size and for each similarity measure. Thus, for each pathway size and measure, a random EC distribution is available and used to determine the percentile value of a pathway EC (referred to as "EC percentile”). Thus, a high EC percentile indicates reduced probability that the observed pathway EC is spurious. First, we asked if the pathway EC percentiles are correlated among different measures (Figure 2.2B). For example, PCC were significantly positively correlated with, in order of diminishing degrees of correlations, corpcor (PCC=0.80, p=2.35e-50), Spearman’s rank 28 coefficient (PCC=0.78, p=1.67e-46), adjusted MI (PCC=0.53, p=5.23e-18), partialcorr (PCC=0.39, p=1.97e-09), BN (PCC=0.30, p=5.35e-06), and normalized MI (PCC=0.17, p=1.23e02). Given the degrees of correlations in EC percentiles differ widely between measures, the similarity measures have significant impact on pathway ECs. Consistent with this notion, the number of pathways with ECs that are significantly higher than randomly expected (high EC pathways, >95th percentile of the random EC distribution) vary widely depending on the similarity measure (Figure 2.2C). Among the measures, corpcor, PCC and Spearman’s rank allowed the highest numbers of high EC pathways to be identified. This finding is consistent with the finding of a recent study examining PCC, Spearman’s rank coefficient, MI, and other similarity measures [27]. Only five of the pathways have high ECs consistently regardless of similarity measures (Figure 2.2D). Importantly, consistent the idea that non-linearity and higher order correlations can be important, the ECs of some pathways are only significant if a particular similarity measure is used (white box, Figure 2.2D). Notably, 17 and 10 pathways have high ECs only when the corpcor method and the BN-based measure were used, respectively (Figure 2.2D), illustrating the importance of higher order correlations. In addition, different methods of calculating partial correlations led to significant differences in high EC pathway recovery. As the corpcor method was optimized for genomic data analysis [37], it is not surprising that the results from corpcor is more informative. For further analyses, we continue with PCC as the measure the calculate gene co-expression as it is one of the widely-used similarity measure and, along with Spearman’s rank and corpcor, uncover the highest numbers of high EC pathways. 2.3.4 Influence of biological factors on pathway EC Next, we explored biological factors that may influence pathway EC, including pathway size (the number of genes assigned to the pathway), subcellular location, pathway gene function, 29 and evidence of co-regulation. We hypothesized that a pathway with a larger number of genes might have relatively more complicated modes of regulation beyond transcription, leading to low pathway ECs. In addition, gene products with similar functions tend to be co-localized and may be coordinately regulated [38], as in the case for photosynthesis and other chloroplast-related pathways [39]. However, pathway gene number was not significantly correlated with pathway EC (PCC=-0.03, p=0.67; Figure 2.3A), and pathway gene product subcellular location was not associated with pathway EC (Figure 2.3B). To assess whether the general biological functions of a pathway contribute to differences in EC between pathways, we compared EC between five general pathway categories including generation of precursor metabolites and energy, biosynthesis, degradation. However, the significance of enrichment of these general categories was only marginal (Mann-Whitney Test, p=0.05; Figure 2.3C). Interestingly, although the expression of gene pairs in the general category of generation of precursor metabolites and energy is not always significantly coherent, the specific pathways within—photosynthesis light reactions, chlorophyllide a biosynthesis I and aerobic respiration—had significantly higher ECs compared to random pathways (99th percentile of pathway EC distribution). This finding suggests that EC, and more generally co-expression, is more relevant to more detailed levels of the functional classification hierarchy. Transcriptional regulation is another major factor that could influence pathway EC. Genes that are co-regulated could have similar transcript profiles, and the differences in the degree of coregulation may explain differences in pathway EC. To determine the extent of co-regulation, we asked how the presence of cis-regulatory elements differs among pathways. It is expected that pathway genes with similar sets of cis-elements in their promoters would have similar expression patterns and thus contribute to high pathway EC. We mapped 349 transcription factor binding 30 motifs [40] to the promoters of all A. thaliana genes, and identified motifs that were overrepresented in the promoters of pathway genes taking each pathway separately and comparing to all other genes. A total of 40 overrepresented motifs were found for 17 pathways. However, there was no significant difference in EC between pathways with and without overrepresented motif sites (Mann-Whitney Test, p=0.66; Figure S.2.2B). This was surprising given that the 349 motif dataset spans essentially all known A. thaliana transcription factor families, and transcription factors from the same family tend to have similar binding motifs [40]. Thus, the reason why high EC pathway genes do not necessarily have more shared motifs (Figure S.2.2B) is not simply due to unknown transcription factor binding sites. This finding can also be due to complex interactions between binding sites, nucleosome positioning and other DNA properties [41]. We also evaluated post-transcriptional regulation by miRNA, but did not find a significant difference in EC between pathways with miRNA target genes and those that did not (Mann-Whitney Test, p=0.31; Figure S.2.2C). Given the dearth of genome-wide post-transcriptional and other levels of regulatory data in plants, it remains to be resolved if post-transcriptional regulation contributes to a lower pathway EC. 31 Figure 2.3 Impact of pathway size and other factors on EC. (A) Relationship between ECexp of a pathway and pathway size (the number of genes assigned to a pathway). (B) ECexp value distribution for pathway genes with products that have subcellular location annotations. PM: Plasma membrane (C) ECexp value distribution for different pathway classes (general pathway categories). (D) Datasets used to determine pathway ECs. A “+” indicates that the dataset in question was used (either individually or in combination) for the analyses depicted by bar graphs in (E) and (F). The columns in (D) correspond to those in (E) and (F). (E) The 95th percentile PCC values (PCC95) in the null distributions for each dataset or combination of datasets. PCC95 of combined datasets (stress fold change and light (L)+stress (S)+development (D) absolute intensity) are labeled in the bar plot (F) Number of pathways with high EC for each dataset and or combination of datasets. Green: fold change values were used to calculate ECs. Orange: absolute intensity values were used for calculating ECs. 32 2.3.5 Impact of datasets used to evaluate pathway EC Among the factors studied ─ the size of the pathway, subcellular location, functions of pathway genes, and evidence of shared transcription factor binding sites ─ none significantly impact pathway EC. We next asked whether the expression dataset has a major impact on whether the EC for a pathway is high or low. The analyses described so far were performed using an environmental stress dataset consisting of 112 experiments including biotic and abiotic stress treatments in shoot and root [32]. Low pathway EC values could reflect the fact that pathways are only relevant to one type of stress (biotic or abiotic) and a large compiled dataset fails to capture the underlying patterns of co-expression. To address this possibility, we first calculated the random gene pair correlations for three subsets of the environmental stress gene expression dataset: shoot abiotic, shoot biotic, and root abiotic. PCC95 values were higher for subsets (PCC95=0.51 - 0.60) of the stress dataset than for the entire dataset (PCC95=0.41; Figure 2.3D and E), indicating that the difference in gene expression between experiments within a dataset, i.e. data heterogeneity, was lower when the samples were divided into biologically relevant subsets. Consistent with this, the average sample correlation within each of the shoot biotic, shoot abiotic, and root abiotic subsets is higher (0.15, 0.19, and 0.46, respectively) than the entire environmental stress dataset (0.13, Mann-Whitney Test using all pairwise sample PCCs, p=8.73e-26, 2.10e-05, 1.93e-144). Due to the impact of data heterogeneity, fewer high EC pathways tend to be recovered from individual stress datasets compared to combined datasets (Figure 2.3F). To test whether these findings are specific to the environmental stress data, an additional four expression datasets were analyzed (development, light, hormone, and diurnal; Figure 2.3D). We found that the threshold PCC95 values of these datasets were significantly negatively correlated with the number of high EC pathways (PCC=0.97, p=1.10e-08). Thus, because PCC95 33 is negatively correlated with data heterogeneity (as discussed in the previous section), higher data heterogeneity likely allows more co-expressed pathway genes to be recovered. Data heterogeneity can be influenced by which datasets are combined and how the expression data are processed and transformed. Combining datasets tends to increase data heterogeneity and thus leads to a better recovery of pathway genes based on co-expression (Figure 2.3F). Dataset processing also has an effect on data heterogeneity. For example, datasets that were processed to obtain fold change values had a substantially lower PCC95 (median PCC95 of fold change datasets=0.41; Figure 2.3E) than that of the absolute intensity dataset (median PCC95 of intensity datasets=0.76; Figure 2.3E), although this was not true for the hormone dataset (Figure 2.3E). Taken together, these results reveal that dataset transformation approaches and nature of the expression dataset impact the threshold for defining significant co-expression and thus significantly shapes pathway EC. 2.3.6 Influence of individual vs. combined stress datasets on pathway EC A wide range (5%-53%) of pathways have significantly high ECs depending on the dataset used (Figure 2.3F). This pattern led us to question whether some datasets were more informative than others in recovering specific pathways. To assess this, pathway EC percentiles were calculated for each dataset separately (Figure S.2.3A). Note that for each expression dataset analyzed, we picked half a million pairs of randomly chosen genes from a total of ~22,000 to establish background correlations and selecting the correlation threshold at the 95th percentile of the random correlation distribution. Because dataset heterogeneity influenced the threshold values used to determine gene co-expression (Figure 2.3E), we first asked whether larger, combined stress datasets were more informative (i.e. had higher pathway EC percentiles) compared to smaller, individual datasets (Figure 2.4A; Figure S.2.3B and C). The combined stress dataset had a higher median EC percentile (95.6) compared to the individual datasets (89.5-89.9). For example, 34 the monoterpene biosynthetic pathway had an EC percentile of 99.6 based on the combined stress dataset, but the values ranged from 26.3 to 89.9 for individual datasets. By contrast, in >14% of the pathways, the EC percentiles determined with the individual datasets were higher than those based on the combined dataset (Figure 2.4A; Figure S.2.3B and C). For example, the lipid dependent phytate biosynthesis I pathway had an EC percentile of 99.5 when the root abiotic stress dataset was used compared with EC percentiles<27 for all other individual and combined datasets. Another example is the cuticular wax biosynthetic pathway, which had an EC percentile of 99.7 calculated from the shoot abiotic stress data, but had EC percentiles of 26.4 and 26.6 when root abiotic and shoot biotic stress datasets were used, respectively. This is consistent with the role of cuticular wax in protecting the shoot from drought and other stresses [42,43] and the co-regulation of its biosynthetic genes [44]. Similarly, indole-3acetic acid (IAA) degradation genes have EC percentiles of 99.9 and 26.3 using root and shoot abiotic stress datasets, respectively, consistent with the finding that IAA degradation products have been mainly detected in roots [45,46]. These findings lead to the conclusion that EC among genes in the same pathway is strongly influenced by whether individual or combined stress datasets are used, particularly if the pathway in question is biologically relevant to the experimental conditions of the dataset. Thus, it is important to test multiple individual and combined datasets for finding the optimal EC for a pathway. It should be noted that, while high EC pathways can be recovered with individual datasets, the smaller numbers of samples in individual dataset have less power in detecting coexpression. This is because we have included randomized background information for different sized datasets in calculating threshold pairwise similarities for determining EC and in calculating threshold EC values for identifying pathways with significantly high ECs. A smaller dataset where 35 spurious correlations are expected will have a correspondingly higher threshold because the correlations between randomized gene pairs will be higher. 36 Figure 2.4 Impact of datasets on pathway EC Percentile. (A) Relationship between pathway EC percentiles calculated using the combined stress gene expression dataset and those calculated based on individual stress dataset, abiotic/shoot. (B) Relationship between pathway EC percentiles calculated using the light, development, and stress combined dataset and those calculated based on individual dataset, stress. In (A) and (B) the dashed line represents y =x, and each dot represents a pathway. (C) Individual and combinations of datasets used to determine pathway EC Percentiles. 37 Figure 2.4 (cont’d) *: NASCArray consisting of all the datasets listed here as well as additional datasets (~700 samples). The columns in (C) correspond to those in (D) and (E). (D) Bar plot of percent high EC pathways using different expression datasets (E) Heat map of pathway EC percentiles from 13 gene expression datasets. Dark red: EC percentiles≥ 95. Orange: 95 > EC percentiles < 75. Yellow: 75 > EC percentiles <50, Blue: 50 > EC percentiles < 0 (F) Histogram of the numbers of datasets leading high EC values for each pathway. Example pathways are labeled with an arrow. 2.3.7 Robustness in recovering pathway genes when using different datasets To determine whether the conclusion that EC is strongly influenced by stress (S) datasets is generalizable to non-stress ones, we further increased the dataset size by including light (L) and developmental series (D). We found that when using dataset L, S, D, and combined (L+S+D) datasets, 12, 46, 81, and 96 pathways had significantly higher than expected EC, respectively. Although the combined dataset was the best for uncovering more pathways, the EC percentiles were higher for some pathways when individual datasets were used (Figure 2.4B; Figure S.2.3D and E). Two interesting examples are the trans-zeatin biosynthesis and the iron reduction/absorption pathways. These pathways only had significantly high EC when using the light dataset. Fluctuations in light conditions can alter the expression of trans-zeatin biosynthesis genes [47]. In addition, iron is a central component of chlorophyll. One iron reduction gene, FRO6, contains multiple light-responsive elements, and another, iron reduction gene FRO7, has an expression pattern similar to FRO6 [48,49]. Consistent with our discussion on the impact of individual and combined datasets in the previous section, these findings indicate that dataset choice 38 impacts the optimal recovery of pathway genes. Next, we asked how data transformation impacts pathway EC percentile. The EC percentiles determined from fold change and absolute intensity were significantly positively correlated for the stress (PCC=0.38, p=4.90e-9) and hormone (PCC=0.57, p=1.17e-20; Figure S.2.3F and G) datasets. Despite these significant correlations, data transformation still resulted in a >50 percentile difference in EC for 27% and 12% of pathways using stress and hormone datasets respectively. Based on our results, it is important to test datasets according to the pathway of interest, but do more expression data samples necessarily lead to better pathway recovery? To answer this question, we compared pathway EC percentiles across 12 individual and combined datasets (Figure 2.4C). We found that stress dataset had the highest percentage of high EC pathways (53%) recovery rate among larger, combined datasets analyzed (Figure 2.4C). To further assess whether using a much more inclusive, more conditionally independent dataset compared to the 12 datasets we used, would increase the recovery rate of high EC pathways, we analyzed NASCArrays dataset with >700 samples [50]. We found that 24% of the pathways had high EC with the NASCArray dataset. This recovery rate was lower compared to a much smaller dataset such as the stress set, where 53% of pathways had high ECs (Figure 2.4D). Thus more is not necessarily better. This is because the overlap in within and between pathway expression correlations was larger when NASCArray dataset was used compared to stress dataset (Figure S.2.4A and B), indicating that it was harder to distinguish within and between pathway gene pairs using the NASCArray data. Next, we asked if some pathways have significantly high EC regardless of the dataset used (i.e. are robust). Among pathways, 180 had significantly high EC in >1 datasets (Figure 2.4E), but photosynthesis light reactions was the only pathway that had significantly high EC in all datasets. This is consistent with earlier findings that light reaction genes are tightly co-regulated 39 [51]. In addition to photosynthesis light reactions, jasmonic acid biosynthesis, aliphatic glucosinolate biosynthesis side chain elongation cycle, fatty acid elongation, palmitate biosynthesis II and chlorophyll a degradation II were also among the most robust pathways in terms of EC. On the other end of the spectrum, 15% of the 179 pathways with significant EC had significantly high EC in only one dataset (e.g. phenylalanine degradation; Figure 2.4F), further indicating the importance of dataset selection for co-expression associations with unknown genes. In addition, 21% of the pathways (e.g. ammonia assimilation cycle; Figure 2.4F) did not have significant EC regardless of the dataset used; indicating that additional datasets may be required and/or these pathways are mainly regulated at levels beyond transcription. Given that many pathways had significant EC when a particular dataset was used, we asked how many individual datasets are required to recover the 180 pathways with significant ECs. Interestingly, when datasets are included one at a time, the number of pathways with significantly high EC initially increased but appeared to be saturated after the addition of 11 datasets (Figure S.2.3H). Taken together, although genes within pathways can have similar expression patterns, this similarity is best recovered after experimenting with a number of different individual and combinations of datasets as well as with data transformations. In addition, although data heterogeneity increases the number of pathway genes that can be recovered, combining datasets is not necessarily the best approach for all pathways. Comparing to 5-53% high EC pathways that can be discovered when datasets are used individually, combining the analysis results of the individual datasets led to the finding that 80% pathways have high ECs. 40 2.3.8 Clusters as predictive units of pathways Clustering genes based on similar expression profiles is commonly performed to find genes that are functionally related [52]. In the best-case scenario, most of the genes in a pathway would be in the same cluster, and the remaining genes in the cluster could be tested to see if they have functions similar to the pathway genes. To evaluate the extent clustering would give us this best case scenario, we first employed one of the most widely used clustering algorithms, k-means, to group ~22,000 genes in the stress gene expression dataset. To determine the optimal k, there are multiple proposed statistical methods including Bayesian Information Criterion (BIC) [53], gap statistic based on the elbow plot [54], and silhouette score [55]. Although these measures have been successfully implemented in simulated datasets where the grouping is apparent [56], there is no best method in determining the number of natural groups of the high-throughput genomics data and often researchers have to try multiple number of clustering results [57,58]. In our initial analysis, we used elbow plot to define k. We computed within cluster sum-of squares for a range of k values starting from 5 clusters and going up to 2000 (Figure S.2.5A). Even though there is no clear elbow point, the decrease in the within sum of squares was apparent when k=100 which was used for k-means clustering. Once the 100 clusters were obtained, over-representation analysis was used to assess how well pathway and cluster membership coincide and an over-representation score was defined (see 2.5 Methods). Clusters with significant over-representation scores (q <0.05) were analyzed further (Figure 2.5). Our expectation for an ideal clustering result was a low q-value (~0). Only 30% of the pathways were found to be over-epresented in >1 cluster, and 38% pathways had an over-representation score < 2 (0.01 < q < 0.05). 41 Figure 2.5 Performance of clusters in predicting pathways. (A) Histogram of the maximum scores (-log(q)) for over-representation of pathways within clusters. (B) Histogram of the maximum F measures for prediction of pathway membership based on cluster membership. (C) Relationship between precision and recall for clusters. In (A-C), clusters were generated using kmeans with k=100. (D) Heat map of over-representation scores obtained from different individual and combined clustering algorithms (top) and cluster numbers (bottom) Color represents overrepresentation scores (-log(q)) from 0 to 12. 42 Figure 2.5 (cont’d) Scores less than 1.3 are indicated by dark blue. Scores more than 1.3 are represented by a spectrum of light blue to red. Pathways in the heat map are sorted based on the number of times that they are over-represented in the clusters, high to low. (E) Bar plot showing the difference between overall maximum over-representation score ─ the highest score from any single cluster ─ and the over-representation score from clusters generated using k-means, k=100 for each pathway. (F) Bar plot showing the difference between the overall maximum F measure ─ the highest score from any single cluster ─ and the F measure from clusters generated using k-means, k=100 for each pathway. (G) Bar plot showing the difference between maximum Precision ─ the highest score from any single cluster ─ and the Precision from clusters generated using k-means, k=100 from each pathway. Arrow: performance values for the leucine degradation pathway As significance alone does not tell us to what extent each cluster is informative in finding additional genes associated with the pathway of interest, we evaluated each clustering result as a prediction problem, where a gene’s membership in a cluster is used to predict its membership in a particular pathway. The performance of the clustering results was evaluated using the F measure, which is the harmonic mean of Precision and Recall. Here precision is the proportion of the number of genes that overlap between a cluster and a pathway to the number of genes in the cluster. Recall is the proportion of the number of genes that overlap between a pathway and a cluster to the number of genes in the pathway. F measures can range from 0 and 1 and higher F measures suggest that both Precision and Recall are high. Precision, Recall and F measures were calculated for every pathway-cluster combination when there was a significant enrichment (q <0.05; Figure 2.5B and 43 C). We expected high Precision (~1) for the most informative clusters, but the highest precision among cluster-pathway combinations was 0.11. In one cluster, 11% of the genes belong to the “glucosinolate biosynthesis from the tryptophan pathway”. The same cluster also yielded the highest F measure (0.18). This result suggests that there is a need to improve this clustering result, potentially by using different clustering algorithms and parameters that is explored further in the next section. 2.3.9 Impact of clustering algorithms and parameters on the identification of pathway genes In the analyses described so far, we used only one clustering algorithm (k-means) and fixed parameters (Euclidean distance, k=100). Next, we assessed how additional clustering algorithms and clustering parameters (number of clusters defined, distance measure, and number of runs) impact the identification of co-expressed gene clusters and how this in turn impacts the identification of genes with similar functions. Five algorithms were applied to the stress expression dataset using different parameters including number of clusters (k), consistency among runs, and other algorithm-specific parameters, to obtain 366 different clustering results. Although some of the algorithms (k-means, approximate kernel k-means, c-means) often yield local optima instead of an overall best result, clustering runs with the same algorithm and parameters gave very similar results (average PCC among 10 runs=0.8 - 1.0). Therefore, only the maximum over-representation score from 10 runs is shown (Figure 2.5D). We found that the choice of k is important; regardless of the algorithm, smaller k values resulted in low over-representation scores (Figure 2.5D) and a smaller number of pathways overrepresented among clusters (Figure S.2.5B). This is likely due to the fact that smaller k values lead to larger sized clusters that contain genes from multiple pathways. We also found that the number 44 of members in a cluster that overlap with members of a pathway differs depending on the algorithm used; k-means was the best performing algorithm, followed by approximate kernel k-means and hierarchical clustering with the Ward algorithm (Figure S.2.5B). Overall, with all clustering methods combined, we were able to recover 131 pathways out of 225 (64 more pathways than when only k-means, k=100 was used). In contrast, 95 out of 225 pathways were not over represented in any of the clusters, and 22 pathways were only over-represented in one algorithmparameter combination (Figure S.2.5C). Taken together, the clustering approach is not deterministic; the parameters used influence co-expression associations. Therefore, it is important to evaluate multiple algorithms and parameters to recover pathways of interest. Multiple algorithm-parameter combinations were examined (e.g. an example combination: k-means, k=100), to quantitatively assess the degree of improvement in performance measures. First, clusters from 69 algorithm-parameter combinations were generated (Figure 2.5D). For each pathway, we asked what the maximum over-representation score was among the clusters from all combinations. This maximum score was then compared to the over-representation score of clustering results from our standard method discussed above (k-means, k=100; Figure 2.5E). We found that the over-representation scores of the best clusters were increased by an average of 1.40 (25-fold better q-value) compared to the score when only one algorithm/parameter was used. We also evaluated clustering performance using F measure (improved by an average of 0.15; Figure 2.5F) and Precision (improved by an average of 0.20; Figure 2.5G). These results reinforce the importance of considering multiple algorithms and parameters to maximize pathway-cluster overlap. Furthermore, for algorithms requiring a predefined k, the k value may be different depending on the pathway one would like to recover and it is necessary to try out multiple values 45 for the best results. Thus, selecting a presumably optimal k may yield a more natural grouping of the entire dataset but at the expense of uncovering clusters representing individual pathways. We should emphasize that, although considering multiple clustering parameters allow recovery of 93 pathways, there are still 96 pathways that were not recovered by the five algorithms used in this study (Figure S.2.6). This may be because genes in these pathways do not have highly coordinated expression patterns and have low pathway ECs. Consistent with this interpretation, high EC pathways tend to be recovered by clustering compared to low EC ones (Fishers exact test, p=4.56E-12; Figure S.2.6). We should also emphasize that the scores used to assess the clustering performance ignore the possibility that some genes in the clusters will be novel pathway components. The presence of these genes reduces the over-representation score, precision, and Fmeasure. These novel pathway component genes are prime candidates for further functional characterization using genetic or biochemical analysis. 2.3.10 Using leucine degradation gene phenomics data to validate co-expression associations We established that the degree of gene co-expression in some pathways is influenced by dataset and data transformation and that it is important to use multiple algorithms and parameters when identifying clusters based on co-expression. To demonstrate that novel pathway components can in fact be recovered as a result, we used phenomics data to validate novel gene components of the leucine degradation pathway [59,60]. We chose to focus on leucine degradation because it is among the most over-represented pathways in co-expression clusters (Figure 2.5), and many components of the leucine degradation network remain to be discovered in plants [61,62]. Eighteen novel genes that were not annotated to leucine degradation in the AraCyc database are consistently found in clusters (≥10 clustering results) that are over-represented with 12 annotated leucine 46 degradation genes. Among these genes, AT1G55510, a branched-chain alpha-keto acid decarboxylase E1 beta subunit, was recently shown to be involved in leucine degradation [61] but has not yet been annotated as such. The fact that AT1G55510 is consistently found in the same clusters as leucine pathway genes prompted us to examine the rest of the genes that cluster with leucine degradation genes for involvement in leucine degradation. We hypothesized that previously unknown associations deduced from co-expression clusters could be verified based on their mutant phenotype data. To test this hypothesis, we used a published phenomics dataset that includes free seed leucine levels for mutants in more than 5,000 genes [60] (Figure 2.6A). The free leucine levels (nmol/g fresh weight) of leucine degradation gene mutants are expected to be more similar to genes within the same cluster than to wild type plants or randomly chosen mutants. As expected, the leucine degradation enzyme genes had higher leucine levels than mutants in random genes and wild-type plants (p=0.05 and 0.04 respectively; Figure 2.6B). Next we evaluated the clusters that were over-represented with leucine degradation genes by calculating the log ratio between the proportion of leucine degradation genes in a cluster to the proportion of non-leucine degradation genes in the same cluster. Note that, as k increases, the log ratio tends to increase (Figure 2.6C). This trend is potentially due to increased statistical power to identify over-representation in smaller sized clusters. Among these clusters, hierarchical clustering with the Ward algorithm (k=100 and k=200) and approximate kernel k-means (k=50, k=400 and k=500) yielded clusters that had genes (Figure 2.6D) whose leucine levels were significantly higher than the wild type measurements (p=0.01-0.05; Figure 6E). Thus, some genes in those co-expressed clusters are likely involved in leucine degradation. Nonetheless, the differences in leucine levels between mutants of genes in the cluster and wild type plants were small (Figure 2.6E). This may be due to the fact that some co-expressed genes are false positives. 47 However, some known leucine degradation pathway gene mutants also do not have dramatic differences in leucine level compared to wild type (Figure 2.6B) and this may also explain the small effect size. We next asked whether a gene that consistently clusters with leucine degradation genes ─ regardless of the algorithm and parameters used ─ tends to be a better pathway gene candidate than one that does not. Mutants in genes that were retrieved from three separate clustering results had significantly higher leucine levels than mutants in random genes and wildtype plants (p=0.03 and 2.60e-3 respectively; Figure 2.6F), indicating that consistency may serve as a criterion to increase confidence in candidate genes. 48 Figure 2.6 Assessing the validity of co-expression associations with leucine measurement data. (A) Log2 of the number leucine degradation (LeuDeg) gene mutants, random gene mutants, and WT control plants that were included in the analysis of leucine levels. (B) The absolute leucine levels (nM/gFW) in the same three types of genetic background as in (A). (C) Log-odds values (log ratio between the proportion of leucine degradation genes in a cluster to the proportion of nonleucine degradation genes in the same cluster) of clusters that are enriched in leucine degradation genes identified using different algorithm-size parameter combinations. 49 Figure 2.6 (cont’d) (D) Log2 of the number of genes that cluster with leucine degradation pathway genes (overrepresentation score >1.3) for each algorithm-size parameter combination. (E) Box plot showing the absolute seed leucine levels (nM/gFW) in plants with T-DNA insertions in genes clustered with leucine degradation pathway genes (enrichment score >1.3) for each algorithm-cluster size parameter combination. *: the groups of genes where mutant leucine levels are significantly greater than in wild type. (F) The absolute seed leucine levels (nM/gFW) of T-DNA insertion mutants of genes that cluster with leucine degradation genes. Binning in x-axis depends on the number of times that each gene clusters with leucine degradation genes considering all clustering results. 2.4 Conclusion A large number of high-throughput omics data are accumulating. Of these, transcriptome data are the most abundant, covering multiple tissues and conditions, and have been widely used to generate hypotheses about gene functions. Since almost the first microarray studies, researchers have used the guilt-by-association approach to make useful predictions about gene functions. This approach is based on the hypothesis that genes encoding proteins of shared function are more likely to have common features such as gene expression patterns. Here we show that even though this approach is useful, there are many limitations to co-expression-based functional inferences and that these limitations can be potentially overcome through methodological considerations that include pathway gene annotation quality, expression dataset used, clustering algorithms, and the use of an independent dataset such as the mutant phenotype data used here to maximize the utility of co-expression relationships in hypothesizing gene functional relations. 50 By evaluating within-pathway gene expression correlation based on the EC measure, we show that genes encoding proteins involved in the same pathway do not necessarily co-express. For example, only 5% of pathways have significantly high EC, using a light treatment dataset. For the remaining 95% of pathways, pathway genes may not be coordinately expressed and/or the light dataset is not informative. For some pathways, co-expression will be ill-suited due to gene sharing among pathways (thus multiple mode of regulation), requirement for condition-specific expression data that is not available, and/or that coordinated regulation of the pathway is at a level beyond transcription. In other situations, several approaches could be taken to improve the recovery of pathways with high EC. By filtering genes based on annotation, it might be possible to obtain a core set of genes that are co-expressed. In addition, using expression datasets of different type (e.g. treatment and/or tissue types), complexity (e.g. individual or combined), and transformation method (e.g. fold change or absolute intensity value) could be effective. In this study, we have demonstrated that clustering algorithms and parameters impact the ability to find novel pathway genes. Thus, by relying on a single algorithm and a single parameter ─ as is most commonly done in published studies ─ co-expression associations with functional implications might be missed. For any pathway being analyzed it is necessary to find the optimal algorithm and parameters to identify clusters that contain the majority of the known pathway genes. We also demonstrated that using one particular clustering algorithm-parameter combination in most cases does not lead to clusters that have optimal overlaps in gene memberships with pathways. Instead, for the best result, we need to consider multiple algorithms and parameters. The methodological considerations we had in this study reflect the multi-parameter nature of coexpression based analyses. The studies that include co-expression based approaches should 51 involve rigorous testing of multiple variables ranging from the pathway of interest to expression dataset and clustering algorithm. 2.5 Methods 2.5.1 A. thaliana metabolic pathways and pathway features analyzed A. thaliana metabolic pathways (AraCyc pathways), the genes belonging to these pathways and supporting evidence were obtained from the Plant Metabolic Network (version 8, [63]). To examine a broader set of gene function in addition to metabolism, A. thaliana Gene Ontology biological processes (GO-BPs) annotations were obtained from geneontology.org [64]. Only nuclear genes and pathways/processes with >2 genes were included in further analyses. The metabolic pathway genes were divided into two sets based on supporting evidence. The first set contained all pathway genes regardless of the types of evidence supporting the annotations (382 pathways, 5,991 genes). The second set only contained genes with experimental evidence (225 pathways, 934 genes). For the GO data, we examined 1,710 GO-BP terms covering 23,157 genes. To determine if genes of a pathway tend to have a particular subcellular location, subcellular location information was obtained from the SUBcellular Arabidopsis consensus database (SUBAcon [65]), and a contingency table for each pathway and subcellular location was established to calculate the enrichment p-value (Fisher’s Exact Test). The resulting p-values were corrected for multiple testing [66]. To determine whether similar sets of cis-regulatory elements are present among genes in the same pathway, 349 position frequency matrices taken from the CisBP database [40] were converted to position weight matrices (PWMs) based on the A. thaliana background AT and CG frequencies (0.33 and 0.17, respectively) using the Tools for Analysis of MOtifs (TAMO) package MotifTools [67]. The PWMs were used to determine the location of 52 motif sites in the 1kb region upstream of the transcriptional start sites of A. thaliana genes with Motility [68]. To assess the impact of post-transcriptional regulation, we used a dataset with associations between miRNAs and their target genes, downloaded from The Arabidopsis Information Resource-TAIR [69]. 2.5.2 Expression dataset and its processing Six publicly available Affymetrix ATH1 microarray gene expression datasets used in this study include: Biotic stress: GSE5615-5616, Light: GSE5617, Abiotic stress: GSE5620-5628, Development: GSE5629-5634, Hormone: GSE39384, and Diurnal [32,70–72]. In addition to these, ~700 A. thaliana microarray datasets were downloaded from NASCArrays database [50]. The datasets were downloaded in either normalized form [50,73] or as unprocessed data from Gene Expression Omnibus (GEO) [74]. For the unprocessed datasets, the CEL files for the AtGenExpress data [32,70,71] were downloaded from TAIR [69] and quantile normalized using the Bioconductor affy package in R [75]. The Bioconductor LIMMA package [76] was used to calculate fold changes by contrasting treatment and control experiments, and the p–values of significant fold changes were corrected for multiple testing [66]. 2.5.3 Calculation of expression correlation and expression coherence To generate the null expression correlation distribution, 500,000 gene pairs were randomly selected and their Pearson Correlation Coefficients (PCCs) were calculated using the SciPy library [77]. The 95th percentile PCC values (PCC95) in the null distributions were used as thresholds for calling the expression patterns of two genes as significantly correlated with a 5% false positive rate. Using the PCC95 values, the expression coherence (EC) score was calculated to determine the extent of co-expression among genes in a given pathway [24,25]. The EC score of a pathway 53 is the ratio of the number of gene pairs with PCC values higher than PCC95 and the total number of gene pairs in a pathway. Thus EC values range from 0 (no gene pair with significant expression correlation) to 1 (all gene pairs significantly co-expressed). To identify pathways with significantly higher than randomly expected ECs (high EC pathways), pathway-gene associations were randomized 100 times with the sizes of the pathways kept the same. For each dataset, a distribution of randomly expected EC values was established. For a given dataset, a pathway was defined as a high EC pathway if it had an EC score larger than the 95 percentile value of the null EC distribution. The percentile of the pathway ECs in the null EC distribution was referred to as EC percentile. To assess how similar the gene expression profiles among array experiments in a dataset, the PCC values between the experiments in a dataset were calculated and the median PCC value was used as a measure of homogeneity among the experiments within a dataset. To evaluate the impact of similarity measures, Spearman’s rank coefficient [77], partial correlation [37] and Mutual Information (MI) [78] were used as additional similarity measures to determine pathway EC in the same way as PCC was used. Partial correlations of pathway genes were calculated with two methods: (1) a Python implementation of partialcorr function in MATLAB, which determined the correlations between residuals of linear regression, and (2) the R package corpcor that was optimized for genomic datasets [37]. MI was calculated both as normalized and adjusted with the Python scikit-learn package [78]. The adjusted MI measure accounts for impacts of sample sizes (larger samples might lead to higher MI) and the normalized MI values was calculated by scaling MI values to between 0 and 1. To explore higher order correlations in addition to pairwise ones, Bayesian Networks (BNs) were constructed for each pathway using the bnlearn package in R [79]. Hill-climbing algorithm was used to construct BNs with options for continuous data. The transformed p-values (-log(p)) of arc strengths between 54 nodes (genes) in BNs were used as measures of gene association strengths that are used similarly as pairwise similarity. Only the transformed p-values were used because they were nearly perfectly correlated with arc strengths (r2=0.9998). BNs were also constructed for randomized pathways to determine threshold p-values for each gene association and the thresholds were then applied to determine how many gene pairs in each pathway have above threshold arc strength p-values to determine pathway EC. 2.5.4 Co-expression clustering To determine the impact of the clustering algorithm on the resulting co-expressed gene clusters, we tested k-means [80], hierarchical clustering (hclust), c-means [81] and Weighted Gene Coexpression Network Analysis (WGCNA) [82] in the R environment and approximate kernel kmeans [83] in MATLAB. Clustering parameters tested included the numbers of clusters (k), distance measures, and hierarchical clustering algorithms for relevant methods. Initially, we attempted to obtain the optimal k for clustering the stress expression dataset by obtaining the “elbow plot”. After testing 11 k values ranging from 5 to 2000 to assess, we realized that the selected k was not necessarily the best and the choice of k impacts clustering memberships of genes. For distance-based algorithms, three distance measures (Euclidean, radial basis function kernel and 1-PCC) were tested. For hierarchical clustering, we also explored the impact of average, complete and Ward linkage algorithms. For WGCNA, the pickSoftThreshold function was used to determine the ß values based on the scale-free topology model [82]. Commonly used clustering algorithms ─ such as k-means ─ are not deterministic, i.e. they may result in a local optimum solution. To evaluate whether multiple runs could result in significantly different results, we ran k-means, approximate kernel k-means and c-means 10 times. We refer to the similarity among 10 55 runs as consistency between the runs. In contrast, for hierarchical clustering and WGCNA, the coexpression cluster membership was always the same for every run. 2.5.5 Assessing the overlap in pathway and cluster memberships Fisher’s exact test was used to assess how well memberships within a cluster overlap with those in a pathway. The resulting p-values were corrected for multiple testing [66]. For each clustering algorithm-parameter combination, an “over-representation score” between a cluster and a pathway was defined as the -log(q) value where a higher score indicates a more significant degree of overlap between cluster and pathway memberships. An over-representation score ≥1.3 (q <0.05) was considered to be statistically significant. To account for the possibility that over-representation of some pathways is spurious we asked how often significant over-representation scores arise from randomized expression data. Specifically, the stress expression dataset was permuted to generate 15 random datasets that were used in k-means clustering (k=5 to 2000, 10 independent runs for each k and each random dataset). The same approach outlined above was also used to assess how well memberships in a pathway overlap with those in a random cluster. Among 1,650 random clusters, none had a significant over-representation score with A. thaliana pathways. To further assess if cluster membership can serve to predict pathway membership, we calculated the F measure (the harmonic mean of precision and recall) for each cluster-pathway combination. Precision is the proportion of correct predictions over total predictions; in our case it was the ratio between the number of genes in a cluster that were also found in a pathway and the total number of genes in that cluster. Recall is the proportion of correct predictions over total true positives; in our case it was the ratio between the number of genes in a cluster that were also found in a pathway and the total number of genes in that pathway. F measure was calculated for each pathway-cluster combination with an over-representation score ≥1.3. 56 2.5.6 Using phenomics data to evaluate co-expression associations Here we used the mutant profile data from Chloroplast 2010, a database consisting of phenotypic screening results for mutants of more than 5,000 genes [59,60] to confirm the potential functional links between genes found in the same co-expression cluster. This database includes measurements of amino acids and fatty acids as well as the chloroplast morphology and photosynthetic parameters. Taking leucine degradation as an example, we expected the leucine content to be more similar between mutants of leucine degradation genes and mutants of genes found in the same co-expression cluster than to wild-type and mutants of random genes. To determine whether this was the case, we retrieved the leucine measurements (in nmol/g fresh weight) of 12 leucine annotated degradation genes, genes that were clustered with pathway genes with an over-representation score >1.3 (q <0.05), 1000 random genes, where homozygous T-DNA insertions were available, and 184 wild type control plants included in the Chloroplast 2010 database. Significant differences between the leucine levels of mutants and controls, included randomly selected mutants of genes that are not in the leucine degradation pathway and wild-type plants, were identified with Mann-Whitney tests. 2.6 Acknowledgements We thank Kevin Childs for his help with WGCNA and advice on clustering, Xi Yin for assistance with approximate kernel k-means in MATLAB, Nicholas Panchy and Zing Tsung-Yeh Tsai for assistance with Bayesian Network, and members of the Last and Shiu lab for helpful discussions. 57 APPENDIX 58 Figure S.2.1 Boxplots of EC values distributions of overall and selected GO-BPs and AraCyc pathways. The order of x-axis is based on the median EC. 59 Figure S.2.2 Factors that potentially influence pathway EC. (A) Relationship between the proportion of genes with experimental evidence and EC. EC is shown on the x-axis. (B) EC calculated for pathway genes that are annotated based on experimental evidence (ECexp) distribution of pathways with and without enriched motifs. (C) ECexp distribution of pathways that have miRNA target genes and of pathways those do not. 60 Figure S.2.3 Randomized pathway EC distributions and EC percentiles from multiple datasets. (A) Left panel: Individual and combinations of datasets used to determine pathway EC percentiles. A “+” indicates that the dataset in question was used (either individually or in combination) for the distributions depicted in Right panel. 61 Figure S.2.3 (cont’d) Right panel: Distribution of randomized pathway EC. Pathway gene membership was randomized 100 times. Median EC per pathway size is shown in this distribution. (B) Relationship between pathway EC percentiles calculated using the combined stress gene expression dataset and those calculated based on individual stress dataset, biotic/shoot. (C) abiotic/shoot. (D) Relationship between pathway EC percentiles calculated using the light, development, and stress combined dataset and those calculated based on individual dataset, development. (E) light. (F) Relationship between pathway EC percentiles calculated using the fold change and absolute intensities for the stress gene expression dataset. (G) Relationship between pathway EC percentiles calculated using fold change and absolute intensities for the hormone gene expression dataset. Dashed line: y=x. Each dot represents a pathway. (H) The change in the number of pathways with high EC (y-axis) with the addition of more expression datasets (x-axis). 62 Figure S.2.4 Distinguishing gene pairs within and between pathways using a conditiondependent dataset and a condition-independent dataset. (A) Distributions of PCC values of within pathway gene pairs (light red) and between pathway gene pairs (light blue) using the condition-dependent, stress dataset. Red line: PCC at the 95th percentile of between pathway gene pair PCC distribution. (B) Same as (A) using the condition-independent, NASCArray dataset. 63 Figure S.2.5 The extent to which metabolic pathways are over-represented in co-expression clusters. (A) Elbow plot showing within cluster sum of squares for k = 5-2000. (B) The number of pathways over-represented (y-axis) in clusters obtained using different algorithms and cluster numbers (x-axis). (C) Distribution of the number of times that a pathway is over-represented in a cluster (sorted high to low). 64 Figure S.2.6 The extent to which high EC pathways are over-represented in clusters. The relationship between the pathway-cluster over-representation score (y-axis) and pathway EC percentile (x-axis). Horizontal red line: over-representation score=1.3, corresponding to adjusted p-value of 0.05. A pathway is considered over-represented if the over-representation score is >1.3. Vertical red line: EC percentile=95. High EC pathway has an EC percentile>95. The insert is the contingency table for testing (Fisher’s exact test) whether the number of high EC pathways is higher among over-represented clusters than randomly expected. 65 REFERENCES 66 REFERENCES 1. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40: D1202-10. doi:10.1093/nar/gkr1090 2. Rhee SY, Mutwil M. Towards revealing the functions of all genes in plants. Trends Plant Sci. Elsevier Ltd; 2014;19: 212–221. doi:10.1016/j.tplants.2013.10.006 3. Rosa BA, Jasmer DP, Mitreva M. Genome-wide tissue-specific gene expression, coexpression and regulation of co-expressed genes in adult nematode Ascaris suum. Jex AR, editor. PLoS Negl Trop Dis. Public Library of Science; 2014;8: e2678. doi:10.1371/journal.pntd.0002678 4. Provart NJ, Alonso J, Assmann SM, Bergmann D, Brady SM, Brkljacic J, et al. 50 years of Arabidopsis research: highlights and future directions. New Phytol. 2016;209: 921–44. doi:10.1111/nph.13687 5. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci. 1998;95: 14863–14868. doi:10.1073/pnas.95.25.14863 6. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9: 3273–97. 7. Jansen R, Greenbaum D, Gerstein M. Relating whole-genome expression data with proteinprotein interactions. Genome Res. 2002;12: 37–46. doi:10.1101/gr.205602 8. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004;14: 1085–94. doi:10.1101/gr.1910904 9. Hirai MY, Sugiyama K, Sawada Y, Tohge T, Obayashi T, Suzuki A, et al. Omics-based identification of Arabidopsis Myb transcription factors regulating aliphatic glucosinolate biosynthesis. Proc Natl Acad Sci U S A. 2007;104: 6478–83. doi:10.1073/pnas.0611629104 10. Wei H, Persson S, Mehta T, Srinivasasainagendra V, Chen L, Page GP, et al. Transcriptional coordination of the metabolic network in Arabidopsis. Plant Physiol. 2006;142: 762–74. doi:10.1104/pp.106.080358 11. Mentzen WI, Peng J, Ransom N, Nikolau BJ, Wurtele ES. Articulation of three core metabolic processes in Arabidopsis: fatty acid biosynthesis, leucine catabolism and starch metabolism. BMC Plant Biol. 2008;8: 76. doi:10.1186/1471-2229-8-76 67 12. Guttikonda SK, Trupti J, Bisht NC, Chen H, An Y-QC, Pandey S, et al. Whole genome coexpression analysis of soybean cytochrome P450 genes identifies nodulation-specific P450 monooxygenases. BMC Plant Biol. 2010;10: 243. doi:10.1186/1471-2229-10-243 13. Childs KL, Davidson RM, Buell CR. Gene coexpression network analysis as a source of functional annotation for rice genes. PLoS One. 2011;6: e22196. doi:10.1371/journal.pone.0022196 14. Mutwil M, Klie S, Tohge T, Giorgi FM, Wilkins O, Campbell MM, et al. PlaNet: combined sequence and expression comparisons across plant networks derived from seven species. Plant Cell. 2011;23: 895–910. doi:10.1105/tpc.111.083667 15. Han X, Yin L, Xue H. Co-expression analysis identifies CRC and AP1 the regulator of Arabidopsis fatty acid biosynthesis. J Integr Plant Biol. 2012;54: 486–99. doi:10.1111/j.1744-7909.2012.01132.x 16. Wong DCJ, Sweetman C, Ford CM. Annotation of gene function in citrus using gene expression information and co-expression networks. BMC Plant Biol. 2014;14: 186. doi:10.1186/1471-2229-14-186 17. Righetti K, Vu JL, Pelletier S, Vu BL, Glaab E, Lalanne D, et al. Inference of LongevityRelated Genes from a Robust Coexpression Network of Seed Maturation Identifies Regulators Linking Seed Storability to Biotic Defense-Related Pathways. Plant Cell. 2015;27: 2692–708. doi:10.1105/tpc.15.00632 18. Alexeyenko A, Sonnhammer ELL. Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res. 2009;19: 1107–16. doi:10.1101/gr.087528.108 19. Lee I, Ambaru B, Thakkar P, Marcotte EM, Rhee SY. Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana. Nat Biotechnol. Nature Publishing Group; 2010;28: 149–56. doi:10.1038/nbt.1603 20. Kotera M, Yamanishi Y, Moriya Y, Kanehisa M, Goto S. GENIES: gene network inference engine based on supervised analysis. Nucleic Acids Res. 2012;40: W162-7. doi:10.1093/nar/gks459 21. Lee T, Yang S, Kim E, Ko Y, Hwang S, Shin J, et al. AraNet v2: an improved database of co-functional gene networks for the study of Arabidopsis thaliana and 27 other nonmodel plant species. Nucleic Acids Res. 2015;43: D996-1002. doi:10.1093/nar/gku1053 22. Bergmann S, Ihmels J, Barkai N. Similarities and differences in genome-wide expression data of six organisms. PLoS Biol. 2004;2: E9. doi:10.1371/journal.pbio.0020009 23. Lelli KM, Slattery M, Mann RS. Disentangling the many layers of eukaryotic transcriptional regulation. Annu Rev Genet. Annual Reviews; 2012;46: 43–68. doi:10.1146/annurev- 68 genet-110711-155437 24. Pilpel Y, Sudarsanam P, Church GM. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet. 2001;29: 153–9. doi:10.1038/ng724 25. Vandepoele K, Quimbaya M, Casneuf T, De Veylder L, Van de Peer Y. Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. Plant Physiol. 2009;150: 535–46. doi:10.1104/pp.109.136028 26. Rest JS, Wilkins O, Yuan W, Purugganan MD, Gurevitch J. Meta-analysis and metaregression of transcriptomic responses to water stress in Arabidopsis. Plant J. 2016;85: 548– 560. doi:10.1111/tpj.13124 27. Ballouz S, Verleyen W, Gillis J. Guidance for RNA-seq co-expression network construction and analysis: safety in numbers. Bioinformatics. 2015;31: 2123–30. doi:10.1093/bioinformatics/btv118 28. Cosgrove EJ, Gardner TS, Kolaczyk ED. On the choice and number of microarrays for transcriptional regulatory network inference. BMC Bioinformatics. 2010;11: 454. 29. Pirooznia M, Yang JY, Yang MQ, Deng Y, Guyon I, Weston J, et al. A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. BioMed Central; 2008;9 Suppl 1: S13. doi:10.1186/1471-2164-9-S1-S13 30. Gerstein M, Jansen R. The current excitement in bioinformatics—analysis of wholegenome expression data: how does it relate to protein structure and function? Curr Opin Struct Biol. 2000;10: 574–584. doi:10.1016/S0959-440X(00)00134-2 31. Pirim H, Seker S. Ensemble Clustering for Biological Datasets. 2012; Available: http://www.researchgate.net/publication/236622906_Ensemble_Clustering_for_Biological _Datasets/file/50463518772638378a.pdf 32. Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, et al. The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007;50: 347–63. doi:10.1111/j.1365313X.2007.03052.x 33. Mueller LA, Zhang P, Rhee SY. AraCyc: a biochemical pathway database for Arabidopsis. Plant Physiol. 2003;132: 453–60. doi:10.1104/pp.102.017236 34. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4: Article17. doi:10.2202/1544-6115.1128 35. Jones CE, Brown AL, Baumann U. Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics. 2007;8: 170. doi:10.1186/1471-21058-170 69 36. Patel R V, Nahal HK, Breit R, Provart NJ. BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species. Plant J. 2012;71: 1038–50. doi:10.1111/j.1365-313X.2012.05055.x 37. Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol. 2005;4: Article32. doi:10.2202/1544-6115.1175 38. Li S, Ehrhardt DW, Rhee SY. Systematic analysis of Arabidopsis organelles and a protein localization database for facilitating fluorescent tagging of full-length Arabidopsis proteins. Plant Physiol. 2006;141: 527–39. doi:10.1104/pp.106.078881 39. Mao L, Van Hemert JL, Dash S, Dickerson JA. Arabidopsis gene co-expression network and its functional modules. BMC Bioinformatics. BioMed Central; 2009;10: 346. doi:10.1186/1471-2105-10-346 40. Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, et al. Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity. Cell. 2014;158: 1431–1443. doi:10.1016/j.cell.2014.08.009 41. Tsai ZT-Y, Shiu S-H, Tsai H-K. Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol. 2015;11: e1004418. doi:10.1371/journal.pcbi.1004418 42. Kunst L, Samuels AL. Biosynthesis and secretion of plant cuticular wax. Prog Lipid Res. 2003;42: 51–80. doi:10.1016/S0163-7827(02)00045-0 43. Go YS, Kim H, Kim HJ, Suh MC. Arabidopsis Cuticular Wax Biosynthesis Is Negatively Regulated by the DEWAX Gene Encoding an AP2/ERF-Type Transcription Factor. Plant Cell. 2014;26: 1666–1680. doi:10.1105/tpc.114.123307 44. Seo PJ, Lee SB, Suh MC, Park M-J, Go YS, Park C-M. The MYB96 transcription factor regulates cuticular wax biosynthesis under drought conditions in Arabidopsis. Plant Cell. 2011;23: 1138–52. doi:10.1105/tpc.111.083485 45. Ljung K. Auxin metabolism and homeostasis during plant development. Development. 2013;140: 943–50. doi:10.1242/dev.086363 46. Pencík A, Simonovik B, Petersson S V, Henyková E, Simon S, Greenham K, et al. Regulation of auxin homeostasis and gradients in Arabidopsis roots through the formation of the indole-3-acetic acid catabolite 2-oxindole-3-acetic acid. Plant Cell. 2013;25: 3858– 70. doi:10.1105/tpc.113.114421 47. Kasahara H. Distinct Isoprenoid Origins of cis- and trans-Zeatin Biosyntheses in Arabidopsis. J Biol Chem. 2004;279: 14049–14054. doi:10.1074/jbc.M314195200 70 48. Feng H, An F, Zhang S, Ji Z, Ling H-Q, Zuo J. Light-regulated, tissue-specific, and cell differentiation-specific expression of the Arabidopsis Fe(III)-chelate reductase gene AtFRO6. Plant Physiol. 2006;140: 1345–54. doi:10.1104/pp.105.074138 49. Kim SA, Guerinot M Lou. Mining iron: iron uptake and transport in plants. FEBS Lett. 2007;581: 2273–80. doi:10.1016/j.febslet.2007.04.043 50. Craigon DJ, James N, Okyere J, Higgins J, Jotham J, May S. NASCArrays: a repository for microarray data generated by NASC’s transcriptomics service. Nucleic Acids Res. Oxford University Press; 2004;32: D575-7. doi:10.1093/nar/gkh133 51. Bilgin DD, Zavala JA, Zhu J, Clough SJ, Ort DR, DeLucia EH. Biotic stress globally downregulates photosynthesis genes. Plant Cell Environ. 2010;33: 1597–613. doi:10.1111/j.1365-3040.2010.02167.x 52. D’haeseleer P. How does gene expression clustering work? Nat Biotechnol. Nature Publishing Group; 2005;23: 1499–501. doi:10.1038/nbt1205-1499 53. Pelleg DD, Pelleg DD, Moore AW, others. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. ICML. 2000. pp. 727--734. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.3377 54. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B (Statistical Methodol. Blackwell Publishers Ltd.; 2001;63: 411–423. doi:10.1111/1467-9868.00293 55. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. North-Holland; 1987;20: 53–65. doi:10.1016/03770427(87)90125-7 56. Ben-Hur A, Elisseeff A, Guyon I. A stability based method for discovering structure in clustered data. Pacific Symp Biocomput. 2002;7: 6–17. 57. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognit Lett. 2010;31: 651– 666. doi:10.1016/j.patrec.2009.09.011 58. Pollard KS, van der Laan MJ. Cluster Analysis of Genomic Data. Springer New York; 2005. pp. 209–228. doi:10.1007/0-387-29362-0_13 59. Lu Y, Savage LJ, Larson MD, Wilkerson CG, Last RL. Chloroplast 2010: a database for large-scale phenotypic screening of Arabidopsis mutants. Plant Physiol. 2011;155: 1589– 600. doi:10.1104/pp.110.170118 60. Bell SM, Burgoon LD, Last RL. MIPHENO: data normalization for high throughput metabolite analysis. BMC Bioinformatics. 2012;13: 10. doi:10.1186/1471-2105-13-10 71 61. Peng C, Uygun S, Shiu S-H, Last RL. The Impact of the Branched-Chain Ketoacid Dehydrogenase Complex on Amino Acid Homeostasis in Arabidopsis. Plant Physiol. 2015; pp.15.00461-. doi:10.1104/pp.15.00461 62. Gu L, Jones AD, Last RL. Broad connections in the Arabidopsis seed metabolic network revealed by metabolite profiling of an amino acid catabolism mutant. Plant J. 2010;61: 579– 90. doi:10.1111/j.1365-313X.2009.04083.x 63. Plant metabolic pathway database (PMN / PlantCyc) home page [Internet]. [cited 12 Apr 2016]. Available: http://www.plantcyc.org/ 64. Gene Ontology Consortium TGO. Gene Ontology Consortium: going forward. Nucleic Acids Res. Oxford University Press; 2015;43: D1049-56. doi:10.1093/nar/gku1179 65. Hooper CM, Tanz SK, Castleden IR, Vacher MA, Small ID, Millar AH. SUBAcon: a consensus algorithm for unifying the subcellular localization data of the Arabidopsis proteome. Bioinformatics. 2014; btu550-. doi:10.1093/bioinformatics/btu550 66. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100: 9440–5. doi:10.1073/pnas.1530509100 67. Gordon DB, Nekludova L, McCallum S, Fraenkel E. TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. Bioinformatics. 2005;21: 3164–5. doi:10.1093/bioinformatics/bti481 68. Cartwheel: a framework for genomic sequence analysis [Internet]. [cited 12 Apr 2016]. Available: http://cartwheel.caltech.edu/ 69. TAIR. [cited 12 Apr 2016]. Available: http://www.arabidopsis.org/ 70. Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, et al. A gene expression map of Arabidopsis thaliana development. Nat Genet. 2005;37: 501–6. doi:10.1038/ng1543 71. Goda H, Sasaki E, Akiyama K, Maruyama-Nakashita A, Nakabayashi K, Li W, et al. The AtGenExpress hormone and chemical treatment data set: experimental design, data evaluation, model data analysis and data access. Plant J. 2008;55: 526–42. doi:10.1111/j.0960-7412.2008.03510.x 72. Mockler TC, Michael TP, Priest HD, Shen R, Sullivan CM, Givan SA, et al. The DIURNAL project: DIURNAL and circadian expression profiling, model-based pattern matching, and promoter analysis. Cold Spring Harb Symp Quant Biol. 2007;72: 353–63. doi:10.1101/sqb.2007.72.006 73. AtGenExpress Resources - Weigel World [Internet]. [cited 12 Apr 2016]. Available: http://jsp.weigelworld.org/AtGenExpress/resources/ 72 74. GEO - NCBI. [cited 12 Apr 2016]. Available: http://www.ncbi.nlm.nih.gov/geo/ 75. Bioconductor. [cited 12 Apr 2016]. Available: http://www.bioconductor.org/ 76. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3: Article3. doi:10.2202/1544-6115.1027 77. SciPy.org. [cited 12 Apr 2016]. Available: http://www.scipy.org/ 78. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830. 79. bnlearn-R package. Available: http://www.bnlearn.com/ 80. Hartigan J, Wong M. Algorithm AS 136: A K-Means Clustering Algorithm. Appl Stat. 1979;28: 100. doi:10.2307/2346830 81. Pal NR, Bezdek JC, Hathaway RJ. Sequential Competitive Learning and the Fuzzy c-Means Clustering Algorithms. Neural Networks. Elsevier Science Ltd.; 1996;9: 787–796. doi:10.1016/0893-6080(95)00094-1 82. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9: 559. doi:10.1186/1471-2105-9-559 83. Chitta R, Jin R, Havens TC, Jain AK. Approximate Kernel k-means : Solution to Large Scale Kernel Clustering. 2011; 895–903. 73 CHAPTER 3 PREDICTIVE MODELS OF SPATIAL TRANSCRIPTIONAL RESPONSE TO HIGH SALINITY1 1 The work described in this chapter was submitted to the journal Plant Physiology and currently under review for publication: Sahra Uygun¶, Alexander E. Seddon¶, Christina B. Azodi, Shin-Han Shiu. Predictive models of spatial transcriptional response to high salinity ¶ These authors contributed equally to this work. Contribution: Alexander E. Seddon and I were involved in conceptualization, data curation and formal analyses of this project. We generated all figures of this manuscript. Alexander E. Seddon and I wrote the original manuscript, I updated and included sections, and all authors were involved in reviewing and editing of the manuscript. 74 3.1 Abstract Plants are exposed to a variety of environmental conditions, and their ability to respond to environment variation depends on the proper regulation of gene expression in an organ, tissue, and cell type specific manner. Although knowledge is accumulating on how stress responses are regulated, a genome-wide model of how plant transcription factors (TFs) and cis-regulatory elements (CREs) control spatially specific stress response has yet to emerge. Using Arabidopsis thaliana as a model, we identified a set of 1,894 putative CREs (pCREs) that are associated with high salinity (salt) up-regulated genes in the root or the shoot. These pCREs led to computational models that can better predict salt up-regulated genes in root and shoot compared to models based on known TF binding motifs. In addition, we incorporated TF binding sites identified via largescale in vitro assays, chromatin accessibility, evolutionary conservation and pCRE combinatorial relations in machine learning models, and found that only consideration of pCRE combinations led to better performance in salt up-regulation predictions in root and shoot. Our results suggest that the plant organ transcriptional response to high salinity is regulated by a core set of pCREs and provide a genome-wide view on the cis-regulatory code of plant spatial transcriptional responses to environmental stress. 3.2 Introduction Plants are equipped with a wide range of mechanisms to respond to environmental stresses such as excess heat, salinity, drought, and pathogen attack [1,2]. These stress response mechanisms are indispensable for plant survival and have a significant spatial component where organs and tissues respond to the environmental changes differently [3–5]. In the case of high salinity stress (referred to as salt stress), after perceiving an increase in soil salt concentration, the primary 75 physiological response of the root is to exclude sodium from the xylem and to send hormonal signals of stress to the shoot, while the shoot must respond to the effects of ion toxicity and water limitation [6,7]. In addition to physiological changes that are spatially specific, it is well documented that differential gene expression under stress conditions can be regulated in highly organ and tissue specific manner [8–11], which ultimately impact plant development and physiology. Spatially- and conditionally- specific gene expression is expected to be subjected to the control of transcriptional regulatory machineries, including transcription factors (TFs) and their associated cis-regulatory elements (CREs). Currently, the TFs and their corresponding CREs regulating stress response have received considerable attention [12–14], but our knowledge on spatial regulation of stress responses are limited. CREs can be identified based on co-expression [15–19] and/or through in vitro and in vivo TF binding experiments [20–23]. The co-expression approach has been successfully used to identify putative CREs (pCREs) in regulating stress responsive gene expression in yeast (Saccharomyces cerevisiae) [15] and in A. thaliana [17]. In addition, pCREs are over-represented in the 1kb regions upstream of tissue and cell type specifically expressed genes [24]. Although some of these pCREs are similar to binding sequences of TFs known to regulate stress responsive genes [24], it remains unclear how they may be relevant to spatial stress response regulation. One fruitful computational approach for assessing the relevance of pCREs is to ask how well they can be used to predict spatial stress response, i.e. how well they can be used to identify the “Cis-regulatory Code” (CRC [17,25]) CRC is defined as the sets of CREs involved in gene regulation in a particular context (e.g. environment, location, timing) [17,25]. One major conclusion from CRC studies is that TFs frequently regulate genes expression pattern in combinations. For example, in yeast, CREs 76 identified through TF binding data uncovered a complex regulatory code involving combinations of multiple CREs [20]. In humans, genes expressed in specific tissues are regulated by particular combinations of TFs and CREs [26]. In A. thaliana, CRCs consisting of binary combinations of pCREs resulted in more precise predictions of salt stress up-regulated genes [17] then using individual CREs. CRCs can potentially be further improved by knowledge of TF binding. For example, computational model considering in vitro TF binding site information, sequence conservation, DNA structure, and/or chromatin accessibility were shown to predictive of in vivo TF binding in mouse [27] and in yeast [28]. Tissue specific TF binding was also predicted using information on binding motifs and histone modifications [29]. These examples highlight the relevance and utility of CRCs and integration of multiple relevant datasets for understanding the mechanisms underlying genome-wide spatial transcriptional response to stress. However, such spatial response CRC is not available. The goal of this study was to uncover the CRCs underlying spatially specific transcriptional response to stress using plant as a model. Specifically, we focused on the CREs relevant to salt stress response in the above ground (shoot) and the below ground (root) parts of A. thaliana. Salt stress was chosen because it is well studied both physiologically [6,7] and molecularly [30,31], and there are documented differences in the transcriptional response to salt in the root and shoot [8,11]. Additionally, there are known TFs and CREs for salt stress [31–34] for verifying our results. To assess transcriptional changes to salt stress across different organs in A. thaliana, we first asked how salt up-regulated genes in roots and shoots differed in their functional annotations. Next, to determine how well current knowledge of TF binding sites in A. thaliana can explain spatial salt up-regulation, we used motifs and binding sites identified through two large-scale in vitro studies [22,23] to generate models of root and shoot salt up-regulation. We then identified 77 additional putative CREs (pCREs) with a co-expression approach to assess if these newly identified pCREs allowed better predictions of spatial response to salt stress. We tested whether pCRE individually could be used to establish a cis-regulatory model explaining spatial patterns of up-regulation during salt stress. To evaluate whether we could further improve spatial salt stress response prediction, we filtered pCRE sites according to information on in vitro TF binding [22,23], chromatin accessibility, and conserved non-coding regions [35]. Lastly, we built prediction models using combinations of CREs. 3.3 Results and Discussion 3.3.1 Transcriptional responses to stress have a strong spatial component Earlier global gene expression studies demonstrated that different plant organs have distinct transcriptional responses to stress [8–11]. To assess the extent to which organs have unique expression patterns under different stress conditions and to determine the similarities between organ (root vs. shoot) stress responses, we determined the correlations between the levels of differential expression across multiple conditions and time points using two types of existing datasets: (1) root and shoot samples under abiotic stress [11] and (2) shoot samples under biotic stress (see 3.5 Methods). There were several patterns worth noting. First, samples for related stress conditions tended to cluster together, and these "stress condition clusters" tended to have root and shoot sub-clusters (Figure 3.1A). For example, osmotic and salt stress samples formed a cluster, with sub-clusters composed of shoot and root samples (dotted rectangles I and II respectively; Figure 3.1A). 78 Figure 3.1 A. thaliana gene expression correlation across stress datasets and Gene Ontology (GO) terms enriched in salt responsive genes. (A) Between-sample Pearson's Correlation Coefficients (PCC) calculated based on log2 fold changes (log2(stress treatment/control)) of genes in shoot and root samples under each stress condition/treatment duration combination. The orders of rows and columns are the same, and they are sorted based on hierarchical clustering of the pairwise PCC values. Boxes on the left represent the key for organ and stress condition. Dotted rectangles I and II highlight osmotic and salt stress clusters, respectively. (B) Heatmap (left) indicating Gene Ontology (GO) Slim terms significantly over- (blue) or under- (red) represented in genes that are differentially up-regulated during salt stress after 3 hours in (R)oot only, (S)hoot only, and/or (G)lobally in both organs (log2 fold change>1, p ≤ 0.05). Right heatmap summarizes the log2 odds ratio (LOR) from the enrichment test (grey: LOR could not be calculated due to 0 in ratio). 79 The median Pearson’s Correlation Coefficient (PCC) of the log2 fold-change values between samples from the same organ but different stress conditions (median PCC=0.17) were significantly lower than those between samples from the same stress condition but different organs (median PCC=0.31, Mann-Whitney, p<2.2e-16). Thus, the stress condition has more impact on overall expression pattern than organ identity. Nonetheless, under some stress conditions there were stronger organ specific effects. For example, the salt stress response correlations between organs (median PCC=0.24) were significantly lower those between samples from the same organ (median PCC=0.69, Mann-Whitney, p<2.2e-16). Taken together, our findings are consistent with earlier studies [8–10] that, while there is a specific transcriptional response to each stress, this response is further influenced by the organ where genes are expressed. Given the stress response is influenced by organ identity, we next assessed what types of genes tend to be differentially up-regulated in root and shoot during salt stress using Gene Ontology (GO) term enrichment analysis (see 3.5 Methods). Three sets of significantly salt upregulated genes were defined: (1) global - 246 genes both in the root and the shoot; (2) root-specific - 1,854 genes only in the root, and (3) shoot-specific - 276 genes only in the shoot. There were 48 GO terms significantly over/under-represented in ≥1 gene sets defined above (Figure 3.1B). For example, thylakoid and plastid terms were over-represented among shoot specifically up-regulated genes, consistent with an earlier finding that photosynthesis is significantly impacted by salt stress [36]. Among the terms, signal transduction and response to stress were over-represented in all three gene sets (Figure 3.1B). As these three gene sets are mutually exclusive, this result suggests that the root and the shoot have unique signaling pathway genes up-regulated, as well as pathways that are globally necessary for stress response. This result is supported by work on the The Salt Overly Sensitive (SOS) pathway that involves components that are common to both organs as well 80 as those specific to the root and shoot [30,37]. Interestingly, “DNA binding transcription factor activity” and "DNA binding", were only enriched amongst the root specifically and globally upregulated genes. This suggests that there is a set of global TFs and another set specific to the root. In addition, genes up-regulated in the root may be regulated by both a global and a root specific set of TFs, while genes up-regulated in the shoot may be regulated primarily by a global TF set. To summarize, a variety of functional categories were found to be enriched in the genes up-regulated by salt stress. In some instances, root specifically, shoot specifically, and globally up-regulated genes had the same enriched functional categories. These common enriched terms suggest that roots and shoots up-regulate similar types of genes. However, there are also genes upregulated in an organ specific manner that may be regulated by distinct sets of up-regulated TFs. The TFs that are specifically up-regulated in roots may help to explain the differences in expression pattern that we observe between the roots and shoots under salt stress. Thus, the root specifically up-regulated genes may be controlled by the root specific TFs. Because TFs may differ in the CREs they bind and there are substantial amounts of in vitro TF-DNA interaction data in A. thaliana [22,23], we next examined whether known TF binding data may be associated with organ specific, salt induced gene expression. 3.3.2 Known TF binding motifs contribute to a better than random performing model for salt up-regulation prediction Because TFs exert their regulatory roles by binding to CREs, we expected that the global and organ specific activities of TFs will be reflected in which CREs are in the regulatory regions of global, root-specific, and shoot-specific salt up-regulated genes. We hypothesized that each organ had a different set of CREs regulating its salt stress up-regulated genes and these CREs could be used to construct CRCs that are models for predicting stress responsive gene expression 81 [17]. To test these hypotheses, we collected A. thaliana TF binding data from two large-scale in vitro studies, CIS-BP [22] and DAP-seq [23] that cover binding sites of 758 TFs. Given the extensive coverage of TFs, we expected that these datasets would cover a significant number of cis-regulatory sequences relevant for controlling root and/or shoot up-regulated genes. Here the root up-regulated genes were defined as the union of the root specifically and globally (in both root and shoot) up-regulated genes under high salinity treatment. Similarly, the shoot up-regulated genes were the union of the shoot specifically and globally up-regulated genes. We first tested if the TF binding sites predicted based on the CIS-BP data and the binding sites inferred from DAP-seq peaks were significantly over-represented in the putative promoter regions (within 1kb upstream of transcriptional start sites) of root and shoot up-regulated genes. Among binding site information for 758 TFs, we found that 262 and 397 were significantly overrepresented in the putative promoters of root and shoot up-regulated genes, respectively compared to non-responsive genes. Overall, we found that, if the TF binding sites based on the CIS-BP data are enriched in the promoter regions of root up-regulated genes, the same sites also tend to be enriched among shoot up-regulated genes (enrichment score PCC=0.88, p=9.20e-42; Figure 3.2A). This was also the case when DAP-seq data was used, but to a much lesser degree (PCC=0.25, p=2.27e-04; Figure 3.2B). This finding suggests that some cis-regulatory sites are common between root and shoot up-regulated genes. Nonetheless, the correlations were not perfect, suggesting that some CIS-BP TF and DAP-seq binding sites were differentially enriched between up-regulated genes in root and shoot. Consistent with this notion, the binding sites of some TF families were enriched in an organ specific manner. For example, WRKY binding sites were over-represented only in root and AP2 sites were over-represented mostly in shoot upregulated genes (Figure 3.2A and B). Next, to assess the extent to which known TF binding data 82 could explain organ-specific responses, we established CRCs with machine learning methods for predicting whether a gene is up-regulated or non-responsive to salt stress in root or shoot based on the presence and absence of CIS-BP TFBM or DAP-seq sites in the putative promoter regions (see 3.5 Methods). We used two approaches to evaluate CRC model performance. The first is Area Under Curve - Receiver Operating Characteristic (AUC-ROC), where a perfect model would have AUCROC=1 and random predictions would lead to AUC-ROC=0.5. The second way is the precisionrecall curve, where precision is the proportion of correctly predicted genes that are up-regulated in an organ and recall is the proportion of truly up-regulated genes in an organ that are correctly predicted. Better models would have precision-recall curves tending more towards the upper-right corner of the graph and random predictions would be no better than the background (dotted lines, Figure 3.2C and D). The model built with all binding site data according to CIS-BP or DAP-seq led to better predictions than randomly expected in root and in shoot (Figure 3.2C and D), indicating that these TF binding data contain relevant regulatory information for root and shoot salt up-regulation. Consistent with the expectation that only a subset of TFs would be involved in the organ-specific up-regulation, models using binding data of TFs with over-represented numbers of binding sites in salt up-regulated genes resulted in similar performance as the ones using all TF data in either root or shoot (Figure 3.2C and D). In addition, the models based on binding sites of TFs that were not over-represented performed poorly (AUC-ROC=0.54-0.56). 83 Figure 3.2 Over-representation of known TF binding sites in organ salt up-regulated genes and performance of CRCs predicting salt up-regulation. (A) Scatterplot of enrichment score (–log (q-value)) of CIS-BP TFBM sites in the promoters of root (y-axis) and shoot (x-axis) upregulated genes compared to non-responsive genes. Each point is for one TFBM. Blue: WRKY TF data. Red: AP2 TF data. Dotted lines: q-value threshold at 0.05. (B) As in (A) but using DAP-seq data. Each point is for one TF. (C) Precision-recall curves and AUC-ROCs (insert) of CRCs predicting root up-regulated genes using CIS-BP TFs (orange) or DAP-seq TFs (blue). O: overrepresented among root up-regulated genes (red and black). The colors of the precision-recall curves correspond to the colors for different subsets in the AUC-ROC bar chart. The error bar corresponds to the standard error from 10-fold cross validation for each model. (D) As in (C) but for shoot up-regulated genes. 84 As the TF binding information was available for ~38% of the known A. thaliana TFs tested under in vitro conditions [22,23,38], there could be some relevant CREs not be included in the models. In addition, it is worth noting that performance of modeling root up-regulated genes is not as good as modeling salt up-regulated genes in shoots (Figure 3.2C and D). Thus, to improve upon our understanding of what CREs are associated with and how these CREs may influence salt up-regulation in the root and shoot, we next identified putative CREs based on co-expression to assess how the regulatory logic differs between the root and shoot salt up-regulation. 3.3.3 pCREs derived from co-expression clusters are similar, but not identical, to the known binding motifs of TFs We hypothesized that motifs identified through co-expression clustering would provide additional regulatory information compared to large-scale TF in vitro binding data [22,23] in modeling salt up-regulation. To test this, we identified 1,894 putative CREs (pCREs) overrepresented in putative promoters of root and/or shoot salt up-regulated genes in co-expression clusters defined based on the stress fold-change data (see 3.5 Methods). Next, we asked if the pCREs identified based on co-expression were similar to 355 CIS-BP TF binding motifs (TFBMs) [22]. We calculated the PCC values of the position weight matrices (PWMs) of all motif pairs between pCRE and CIS-BP TFBMs to find the best matching pCRE-TFBM pairs where lower PCC values indicating diminishing similarity (Figure 3.3A). Three criteria were used to define whether a pCRE-TFBM pair had significant similarity. First, we identified pCREs that are identical to TFBMs. Only two pCREs were identical (PCC=1) to experimentally determined binding motifs for ATBZIP63 involved in abscisic acid (ABA) biosynthesis [34] and ABF3 involved in abscisic acid (ABA) signaling [39], consistent with their roles in salt stress response. Second, given PCC=1 is highly stringent, we defined that a pCRE-TFBM pair has significant similarity if their PCC is 85 significantly higher than (at the 5% level) PCCs of TF pairs from the same family (red; Figure 3.3A). Based on this criterion, 4% of the pCREs were significantly similar to TFBMs. Third, we defined a pCRE-TFBM pair has significant similarity if their PCC is significantly higher than (at the 5% level) PCCs of TF pairs from different families (blue; Figure 3.3A). This is reasonable because the within family TFBM PCC values tend to be higher than between families (Figure S.3.1). This criterion allowed us to assess TFs from which families may bind the pCREs. Based on this criterion, 25% pCREs can be assigned to 24 of the 27 TF families and example TFBMs and their best matching pCREs are shown in Figure 3.3B. While 25% of the organ pCREs enriched among salt up-regulated genes are significantly similar to ≥1 TFBMs, what should be made of the remaining 75% of pCREs? One possibility is that these pCREs are TFBMs likely bound to one of the families, just that in the existing TF binding data a close representative is not available. To test this, we asked if the pCREs are more similar to a known TFBM than to sequences randomly drawn from the genome (black dots; Figure 3.3A) and found that PCC values between pCREs and their best matching TFBMs are all higher than 95th percentile value in the pCRE-random sequence PCC distribution (Figure 3.3A). Thus, all pCREs are more significantly similar to known TFBMs than random sequences. These findings suggest that the pCREs are not simply random sequences pulled from the genome and that the coexpression-based analysis contributed to an expanded set of CREs that are relevant for organ salt up-regulation. 86 Figure 3.3 Similarity of the pCREs to CIS-BP TFBMs. (A) Distributions of PCC values between TFBMs. The y-axis indicates different TF families and the x-axis indicates the PCC values. Orange: the 95th percentile value of the distributions of PCCs between TFBMs from a particular family and pCREs with their best matches in the same family (TFBM vs. pCRE) TFBMs. Red: the 95th percentile value in the distribution of PCCs between pairs of TFBMs from each TF family (TFBM within). Blue: the 95th percentile value of the distribution of PCCs between TFBMs in one family and their best matching TFBMs in other families (TFBM between). Black: the 95th percentile value of the distribution of PCCs between TFBM from a particular family and random motifs (TFBM vs. random). (B) The sequence logos of bZIP, AP2, NAC-NAM and TCP TFBMs from CIS-BP (left) and their best matching pCREs and PCC values (right). 87 3.3.4 pCRE set further improves salt up-regulation prediction in a spatially specific manner To assess if the pCRE set predicts salt up-regulation better than known in vitro TF binding sites [22,23], we modeled salt up-regulated expression using the pCRE set (see 3.5 Methods; Figure 3.4A and B). Salt up-regulation prediction models based on pCREs had better prediction performance for both root up-regulated genes (red, AUC-ROC=0.71; Figure 3.4A) and shoot upregulated genes (red, AUC-ROC=0.79; Figure 3.4B) than the models based on CIS-BP and DAPseq data (root AUC-ROC=0.64, shoot AUC-ROC=0.74; Figure 3.2C and D). This improvement indicates that using motifs discovered from co-expression clusters containing root and/or shoot upregulated genes led to better prediction models of organ salt up-regulation. Next, we classified 1,894 organ pCREs into three subsets that were over-represented in the promoters of genes up-regulated by salt in the root (759 root pCREs), in the shoot (237 shoot pCREs), and in both root and shoot (898 general pCREs). The rationale for defining these pCRE subsets was that the root and shoot subsets might be more critical to controlling expression for the root specifically and shoot specifically up-regulated genes, respectively, while the general pCREs might be critical for globally up-regulated genes. To test this hypothesis, salt up-regulation prediction models were established using root, shoot, and general pCREs where each pCRE was treated as an independent predictor. For predicting root up-regulated genes (including genes upregulated globally and in root specifically), we found that a model based on root pCREs (AUCROC=0.70) was much better than a model based on shoot pCREs (AUC-ROC=0.61; Figure 3.4A). Similarly, a model based on shoot pCREs better predicted shoot salt up-regulated genes (AUCROC=0.73) than a model based on root pCREs (AUC-ROC=0.66; Figure 3.4B). Thus, the root and shoot pCRE sets are better at predicting up-regulated genes in the organs for which they are 88 associated, demonstrating they are relevant to spatially specific up-regulated genes. In addition, root pCREs alone or the combination of the general and the root pCRE set resulted in models that performed as well as the model using the all pCRE set (AUC-ROC=0.71; Figure 3.4A). This suggests that shoot pCREs provide no additional information to predict root up-regulated genes. In contrast, for shoot up-regulated genes, although the model based only on shoot pCRE performed reasonably well (AUC-ROC=0.73), it was not as good as the model based only on the general pCREs (AUC-ROC=0.80; Figure 3.4B). Surprisingly, adding the shoot pCRE set did not provide additional regulatory information for salt up-regulation in the shoots that was not already provided by the general pCREs. This conclusion is based on the finding that the general pCRE-based model performed as well as a model using both the general pCREs and the shoot pCREs (AUC-ROC=0.80; Figure 3.4B). This further supports the notion that shoot up-regulated genes may be regulated by a global set of TFs (Figure 3.1B) that bind to set of general pCREs. Another surprise was that, for root up-regulated gene prediction, the models based on the root pCREs, the general pCREs, and the union of the general and the root pCREs performed similarly (Figure 3.4A). One potential explanation is that each model captured a distinct subset of the organ up-regulated genes. To assess the extent to which the models were predicting similar sets of genes, we examined how genes were classified when different pCRE subsets were used (see 3.5 Methods). We found that more root specifically up-regulated genes were predicted with the root pCRE-based models (24%) compared to models using general pCREs (9%; Figure S.3.2). 89 Figure 3.4 Performance of salt up-regulation prediction models using pCREs identified from co-expression clusters. (A) Precision-recall curves for models predicting root salt up-regulated genes using all pCREs (black), root pCREs (blue), shoot pCREs (red), general pCREs (orange) and, root+general pCREs (purple). The bar plot on top right indicates the corresponding AUCROC values of the models. Error bar is the standard error of precision values or AUC-ROCs from 10-fold cross validation. (B) Precision-recall curves and AUC-ROC values for models predicting shoot up-regulated genes using all pCREs (black), root pCREs (blue), shoot pCREs (red), general pCREs (orange) and, shoot+general pCREs (purple). 90 Taken together, these results demonstrate that the identification of the pCRE set using stress expression data can lead to improvements in modeling gene expression over known in vitro TF binding sites. This supports our hypothesis that co-expression based approaches would improve CRE discovery. We also found that salt stress up-regulated genes in the root and the shoot may be regulated by different subsets of motifs in the pCRE set. Genes up-regulated by salt stress in the root can be best predicted with a model considering both root and general pCRE sets without considering shoot pCREs. However, the shoot up-regulated genes likely are regulated primarily by general pCREs, as seen in the equivalent performance of the general pCRE model and the full pCRE model of shoot up-regulated genes. 3.3.5 Filtering pCREs based on TF binding, DNase I hypersensitivity, and conservation We demonstrated that pCREs identified in this study can predict organ salt up-regulation. However, the large number of pCREs identified (1,894) raised the question of whether there might be motifs that were redundant or not particularly informative in the predictive model and could be filtered out. To reduce redundancy, we first removed highly similar pCRE pairs (see 3.5 Methods). Next, we used feature selection algorithms to identify the best performing pCREs in predicting root up-regulated genes (Figure 3.4A) and shoot up-regulated genes (Figure 3.4B). Among the feature selection algorithms used, the Chi-square statistic-based approach performed the best (see 3.5 Methods; Supplemental Figure 3). With a threshold chi-square statistic ≥ 10, 678 (41%) and 397 (35%) of pCREs (referred to as chi10 selected pCREs) were regarded as informative and could better predict root (AUC-ROC=0.73; Figure 3.5A) and shoot (AUC-ROC=0.81; Figure 3.5B) salt up-regulation, respectively, compared to when the full set of pCREs were used (Figure 3.4A and B). 91 To improve prediction of organ salt gene up-regulation further, we took advantage of additional regulatory information including the in vitro TF binding data (CIS-BP and DAP-seq), chromatin accessibility measured according to DNase I Hyper-Sensitivity (DHS) experiments [40], and the Conserved Non-coding Sequences (CNS) among Brassicaceae species [35]. Although root and shoot up-regulated gene promoters were over-represented with DHS regions (FETs, all p < 5e-13) and with CNSs (FETs, all p<1e-12) compared to non-responsive genes, the performance of models based on only DHS or CNS were the same as random guess (AUC-ROC ~ 0.5), suggesting additional regulatory sequence information was needed. Thus, we hypothesized that a pCRE site would be more informative in predicting gene expression if it overlapped with a potential TF binding site, a chromatin accessible region, and/or CNS. Models based on DAP-seq filtered pCREs had similar performance to the models using original unfiltered pCREs for both root and shoot salt up-regulation (AUC-ROC=0.73-74 and 0.80-81; Figure 3.5A and B). Because the model performance remained the same and 9-14% of the pCRE sites were removed, it is likely that filtering based on DAP-seq data eliminated some false positive pCRE sites but also true positive sites. This is also true for filtering pCRE sites based on CIS-BP data (Figure 3.5A). On the other hand, filtering pCRE sites based on DHS information further decreased the performance for shoot up-regulation prediction but did not impact prediction in root (Figure 3.5B). Thus, pCRE sites informative for predicting shoot salt up-regulation were likely removed, potentially because chromatin accessibility can only partially predict gene expression [41]. One surprising finding was that models based on pCRE sites overlapped with CNS had the worst performance in predicting both root and shoot up-regulated genes. This is likely because the CNS were identified with stringent criteria and filtering reduced true pCRE sites. In addition, this finding suggests that there are pCRE sites that are involved in organ salt up- 92 regulation but are not highly conserved. Taken together, the pCRE information alone already led to models with the best performance in predicting organ salt up-regulation. Additional TF binding information, DHS, and CNS either did not improve or worsen the model performance. 3.3.6 pCREs work best in combinations To evaluate what the minimal set of pCREs was for salt up-regulation predictions, we ranked all chi10 pCREs as well as those with DAP, CIS-BP, DHS, and/or CNS evidence according to importance scores generated during machine learning runs (see 3.5 Methods). For each chi10 pCRE, it was examined five times by either applying no filter to the sites the pCRE mapped to or by filtering based on four types of evidence (DAP, CIS-BP, DHS, or CNS), this analysis was referred to as a combined approach (Figure 3.5). Consistent with models with CNS filtering having the lowest performance in predicting organ salt up-regulation, we found that CNS features were the least important in predictions (Figure 3.5C and D). Given there were 678 root and 397 shoot chi10 pCREs, we ranked 2,712 root and 1,588 shoot pCRE-evidence pairs and identified the top 100 and 10 pCREs for root or shoot up-regulation. 93 Figure 3.5 Performance of salt up-regulation prediction models using filtered pCRE sets. (A) Precision-recall curves for models predicting root salt up-regulated genes using the following pCRE sets: (1) chi10 - pCREs selected with the chi-square test feature selection method with a threshold of chi-square statistic ≥ 10 (red) (Supplementary Figure 3E and F), (2) DAP - chi10 selected pCREs including only pCRE sites overlapped DAP-seq peaks (blue), (3) CIS-BP - chi10 selected pCREs including only pCRE sites overlapped with CIS-BP TFBM sites (orange), (4) DHS - chi10 selected pCREs including only pCRE sites overlapped with a DNase-I Hyper-Sensitivity (DHS) peaks (green), (5) CNS - chi10 selected pCREs including only pCRE sites overlapped with a Conserved Non-coding Sequence (CNS, purple), and (6) Combined - using all information from (1)-(5) (black). 94 Figure 3.5 (cont’d) (B) Precision-recall curves for models predicting shoot salt up-regulated genes using the six types of pCRE sets as in (A). (C) Distributions of importance ranks of all chi10 selected pCREs (chi10) or chi10 pCREs filtered based on DAP-seq, CIS-BP, DHS, or CNS data. The ranks were obtained from the model built with the Combined dataset in (A) for root. (D) As in (C) but based on the model built with the Combined dataset in (B) for shoot. The models based on the top 100 pCRE-evidence pair yielded AUC-ROC=0.72 and 0.80 for predicting root and shoot up-regulation, respectively, which were comparable to the models based on all chi10 selected pCREs (red; Figure 3.5A and B). However, using only the top 10 pCRE-evidence pairs, the prediction performance was significantly worse (AUC-ROC=0.66 and 0.72 for root and shoot prediction, respectively). This result suggests that most important 100 pCRE-evidence pairs that included 39 and 40 pCREs for root and shoot, respectively, were sufficient for predicting organ salt up-regulated genes. This would imply that the rest of the 1,854 pCREs we identified are not informative. Alternatively, it is possible that some of these seemingly uninformative pCREs may reveal their importance only in combinations as demonstrated in studies on regulation of gene expression under stress conditions [17,20], as well as tissue specific expression [26,42]. So far, we considered many pCREs collectively in the salt up-regulation predictions but treated each pCRE individually as an independent predictor. Therefore, we asked: (1) whether pCRE combinations were important for salt up-regulated genes in the root and/or the shoots, (2) what the important pCRE combinations were, (3) what types of pCREs were involved with the combinations, and (4) if combinatorial rules important in root expression were also important for shoot expression, or vice versa. 95 To identify pCRE combinations relevant to the up-regulation of genes under salt stress, we used the Classification by Association method (CBA; see 3.5 Methods). Due to consideration of computational complexity, we restricted our analysis to binary combinatorial rules where the presence of two pCREs predicted up-regulation (pCRE A + pCRE B  up-regulation in organ of interest). Rule sets were generated for both the root and the shoot salt up-regulated genes. As some pCREs may only be informative in combinations, we included all 1,894 pCREs without any filtering to identify combinatorial rules. This would also enable us to compare the pCREs involved in rules to the individual pCREs found to be most informative. We identified 2,826 and 351 combinatorial rules for root and shoot up-regulation. 1,086 pCREs were present in the combinatorial rules that were predictive of root up-regulation (root rules), but only 389 of them were also chi10 selected pCREs that were informative for predicting up-regulation when considered individually. Similarly, only 136 out of 427 pCREs in the shoot rules were chi10 selected. Thus, a substantial number of pCREs were informative for predicting root and shoot salt up-regulation only when considered in combination. We also found only 12 root rules (among 2,826) had the same pCRE combinations as the shoot rules, suggesting that the great majority of the rules for one organ were specific to that organ. Most importantly, models based on only the combinatorial rule sets improved predictions for both root (AUC-ROC=0.81; Figure 3.6A) and shoot (AUC-ROC=0.87; Figure 3.6B) up-regulated genes compared to the models based on presence/absence of single pCREs (AUC-ROC=0.71 and 0.79 for root and shoot, respectively, Figure 3.4A and B). These results indicate the involvement of pCRE combinations to the salt up-regulation. In addition, they demonstrate that the rules capture the physical interaction between two presumed TFs binding to a pair of pCREs. Nonetheless, we found that the sites of a pair of pCREs in a rule are not significantly closer together in salt up- 96 regulated genes compared with non-responsive genes (Figure S.3.4A and B). This is consistent with the finding that the distance distribution of the binding sites of interacting human TFs were not significantly different from random expectation [42]. Thus, sites of pCRE important for combinatorial regulation may not be constrained by their distances. With the combinatorial rules, we next examined if the rules tended to be composed of a general pCRE and an organ (root or shoot) pCRE, two general pCREs, or two organ pCREs (Figure 3.6C). We found that there was a significant difference in the distribution of these three categories of combinatorial rules for the shoot rules (Chi-square test, p=6.0e-06), particularly there were more general-general pCRE combinations than expected (odds-ratio=1.5), and fewer organorgan pCRE rules than expected (odds-ratio=0.52). This aligns with the notion that the general pCREs are more important for the regulation of shoot up-regulated genes. The root rules also had a significantly different distribution of rule types (Chi-square test, p=0.01), but the effect sizes were generally low (odds-ratios range 0.89-1.1). Thus, it does not appear that rules for root upregulated genes are composed of general pCRE with a pCRE from one of the organ sets. Taken together, our findings suggest that the organ pCREs work best in combinations and example combinatorial rules are shown in Figure 3.6D. The greater importance of combinatorial rules aligns well with what is already known in mammals, where individual CREs are important for expression in multiple tissues, but CRE combinations are more relevant in controlling tissuespecific expression [18,22]. Both root rules and shoot rules incorporate pCREs from the full set of organ pCREs, but there is only little overlap (0.4-3%) in the two sets of rules. This suggests that the pCREs need to be considered in combinations for better predictions of salt up-regulation. 97 Figure 3.6 Summary of root and shoot combinatorial pCRE rules and model performance. (A) Precision-recall curves comparing models based on combinatorial rules (green) and the full pCRE set (black) for predicting root salt up-regulated genes. (B) Precision-recall curves comparing model based on combinatorial rules (green) and the full pCRE set (black) for shoot salt upregulated genes. (C) Heatmap summarizing the rules identified for salt stress up-regulated genes in the root (pCREs: blue=root, red=shoot, orange=general). 98 3.4 Conclusions In this study, we identified a set of 1,894 pCREs from co-expression clusters that were relevant to the up-regulation of transcript abundance under salt stress in the root and shoot of A. thaliana. Among them, 25% pCREs were similar to the known binding motifs of TFs from multiple families. Machine learning models for predicting salt up-regulation based on the pCRE set had significantly better performance than those based on in vitro binding data from two largescale studies [22,23]. Thus, the pCREs identified likely contained cis-regulatory information of spatial response to salt. We also found that the salt up-regulation in the root required both a general pCRE set that was relevant to up-regulation in both root and shoot as well as a root pCRE set primarily associated with root specifically salt up-regulated genes. In contrast, the shoot salt upregulated genes relied primarily on a general pCRE set. Considering that substantially more genes were up-regulated in the root (2,100) compared to that in the shoot (524), this difference in the composition of relevant pCREs may reflect the differences in regulatory complexity and root as the primary organ exposed to high salinity treatment. Filtering pCREs using in vitro TF binding data, chromatin accessibility and conservation, we found that ~40 pCREs could predict organ salt up-regulation with the same performance as the model using all pCREs. Nonetheless, the organ salt up-regulation models considering combinations of pCREs had significantly improved performance over the models considering pCREs collectively but treated each pCRE as independent predictor. Most importantly, the majority of the pCREs in the combinatorial rules were not considered important when they were treated as independent predictors and would have been false negatives in common motif finding practices. One limitation of our study was that the pCREs were identified based on the expression data alone without knowledge of whether the sites mapped by these pCREs were actually bound 99 by TFs. To overcome this limitation, we incorporated in vitro binding, chromatin accessibility, and conservation data into the model. We found that the inclusion of in vitro binding data led to models with the same performance as those considered only pCREs. Nonetheless, we found that the pCREs identified are complementary to in vitro derived TF binding information. Because the in vitro TF binding was an assessment of what kinds of sequences could be bound and not where the in vivo binding sites were in the genome, the binding data by itself was not expected to predict condition-specific expression well. Combining the pCREs identified using condition-specific data and the in vitro binding data, the condition-specific regulators and regulatory sequences could be pinpointed. In addition to in vitro TF binding data, chromatin accessibility data (DHS) was incorporated but led to either a reduced model performance or did not improve the model performance. One potential reason was because the DHS data we used were generated under conditions not related to salt stress in different developmental stages of A. thaliana. Finally, CNS was incorporated to filter pCRE sites but yielded models with the lowest prediction performance. One reason could be stress pCREs might have higher evolutionary rates and not well conserved. Another possible reason is that CNSs are defined in a stringent fashion or not sensitive enough in obtaining sites that are under selection but beyond the limit of detection. Another limitation was that the spatial stress response was predicted at the organ level with limited resolution. The next logical step is to identify pCREs that can be used to predict the differential expression of genes in a cell type specific manner. In any case, our results show that co-expression based CRE identification in conjunction with machine learning-based modeling are a promising method for globally assessing spatial gene regulation in the context of stress. In addition to providing a genome-wide view of the potential cis-regulatory mechanisms, this approach may have possible applications in engineering plants that can respond to stresses. Use of 100 native, tissue-specific inducible promoters to engineer plants is promising, but it is limited by the promoters that are already available in nature [43]. The methods we used here may help to identify individual and/or combinations of cis-regulatory sequences that can be used in synthetic promoters to drive tissue specific expression in the context of stress. 3.5 Methods 3.5.1 Expression data processing and expression data analysis A. thaliana abiotic stress expression data for the root and shoot [11] and biotic stress data for the shoot were downloaded from the AtGenExpress website (http://www.weigelworld.org/resources/microarray/AtGenExpress/). The data came preprocessed and normalized. We calculated log2 fold changes and associated p-values for each stress condition and its corresponding control at each time point and each organ using limma [44] in the R environment [45]. The p-values were adjusted [46] to control for the False Discovery Rate. Genes were considered up-regulated if their log2 fold-change values≥1 and their adjusted p-values≤0.05. Genes up-regulated after salt treatment for three hours in the root and in the shoot were referred as the root and shoot up-regulated genes respectively. Genes were considered non-responsive if they were not significantly differentially expressed (up or down-regulated) under any stress condition at any time point in the root or the shoot. Each organ had its own set of non-responsive genes (“root non-responsive” and “shoot non-responsive”). This stringent definition of non-responsive genes was chosen to because cis-regulatory sequence may be relevant to regulating responses not only to salt but also to other stress conditions. To assess the relationship of the degrees of differential expression in the root and shoot under different stress conditions, Pearson’s Correlation Coefficients (PCCs) of log2 fold change 101 values were calculated for all pairs of conditions/organ. A heatmap of the PCC values (Figure 3.1) was generated using the gplots package in R [47]. To identify the functional categories enriched in salt up-regulated genes (3hr) in the root, in the shoot, or in both root and shoot, each plant GO slim category (http://www.geneontology.org/ontology/subsets/goslim_plant.obo) was used to determine if it contained over/under-represented numbers of up-regulated genes in root, shoot, or both organs with Fisher’s Exact Test (FET) implemented in SciPy (http://www.scipy.org/). The p-values were adjusted for multiple testing using the Benjamini-Hochberg method [46]. 3.5.2 In vitro TF binding, DNase I hypersensitivity, and conserved non-coding datasets Two sets of in vitro binding data were used. The first set included Position Frequency Matrices (PFMs) obtained from the CIS-BP database website [22]. These PFMs are based on either protein binding microarray data or TRANSFAC motifs [22]. The PFMs were converted to Position Weight Matrices (PWMs) adjusted for the background AT (0.33) and CG (0.17) content of A. thaliana genome using the TAMO package [48]. This resulted in a final set of 355 PWMs (referred to as TFBMs). To map the TFBMs, first the 1kb sequences upstream of transcriptional start sites (putative promoters) of all genes in A. thaliana were downloaded from The Arabidopsis Information Resource (ftp://ftp.arabidopsis.org/). The TFBMs from CIS-BP were mapped to the putative promoter sequences using Motility (http://cartwheel.caltech.edu) with a threshold p<1e06. The second set included 344 DAP-Seq experiments testing in vitro bindings to naked genomic DNA in 598 TFs from the Plant Cistrome Database [25]. A DAP-seq peak (~200bp long) contained TF binding site and only peak with fraction of reads in peaks (FRiP) ≥ 5% was considered further. We identified TFBM sites and DAP-seq peaks that were over-represented in the promoters of the 102 root up-regulated and shoot up-regulated genes by performing FET against the root non-responsive and shoot non-responsive genes, respectively. DNase I hyper-Sensitivity (DHS) data [40] were obtained from GEO (GSE53322 and GSE53324) in form of peaks in bed format. The DHS datasets were derived from multiple developmental stages and tissues including 7-day-old dark-grown A. thaliana Col-0 seedlings, root, root hair cells, root non-hair cells, and seed coat. Each DHS dataset was treated as distinct features in this study for predicting salt up-regulation. A. thaliana-based coordinates of ~90,000 Conserved non-coding sequences (CNS) among Brassicaceae species were obtained (http://mustang.biol.mcgill.ca:8885, [35]) to assess whether CNS may be informative for assessing salt up-regulation. In addition, both DHS and CNS regions were used to filter pCRE sites to see if sites with different degrees of chromatin accessibility and conservation may contribute to salt upregulation prediction differently. 3.5.3 Salt up-regulation pCRE identification To identify pCREs associated with salt up-regulated genes in the root and shoot, we used a published pipeline with modifications [17]. The stress expression dataset in the form of a log2 fold-change expression matrix was used to identify co-expression clusters using iterated rounds of k-means clustering such that all clusters contained 60 genes or less, while clusters smaller than 10 genes were excluded. Clusters enriched in salt up-regulated genes in any time point in either roots or shoots were analyzed further for identifying 6-18bp motifs in the putative promoter regions of genes in each cluster. Six motif finding programs were used: AlignACE [49], MDScan [50], MEME [51], Motif Sampler [52], Weeder [53], and YMF [54]. In the initial motif finding step, ~300,000 motifs were identified, many of which were redundant. Two rounds of pCRE merging/enrichment testing were performed. In the first round, the ~300,000 motifs were merged 103 if their consensus sequences shared the same IUPAC codes and/or if they were highly similar to each other (in the same cluster) based on clusters defined using Kullback-Leibler (KL) distance [17]. In the enrichment step, these merged pCREs were mapped to the 1kb promoter regions of genes in A. thaliana using Motility (http://cartwheel.caltech.edu), and we kept mappings with a p < 1e-06. The pCREs were further analyzed if their mapped sites were significantly overrepresented (FET, adjusted p ≤0.05) in promoters of salt up-regulated genes. In the second round, we further merged enriched motifs based on PCC distance (1-PCC) of the motif PWMs. Using the PCC distance matrix, motifs were clustered hierarchically and distinct clusters were demarcated with a PCC distance threshold of 0.10, which was previously found to be the first percentile of PCC distances for non-redundant motifs in the JASPER CORE dataset [17]. Within each cluster, a single motif was chosen based on having the most significant degree of enrichment for genes up-regulated under salt stress in roots and/or shoots. The motifs identified from all clusters were collectively referred to as pCREs. To identify pCREs particularly relevant to root, shoot, or general salt up-regulation, a final round of FET was done to identify motifs were significantly over-represented (p<0.05) only in the root salt up-regulated genes (“root pCREs”), only in shoot salt up-regulated genes (“shoot pCREs”), and among genes up-regulated in both organs (“general pCREs”). In the end, 1,894 shoot, root, and general pCREs were identified. 3.5.4 Comparison of pCREs and TFBMs To assess the similarity between the pCREs identified here and the known TFBMs from CIS-BP [22], the PCCs between PWMs of all pCRE-TFBM, all pCRE, and all TFBMs combinations were calculated. For each pCRE, the pCRE-TFBM combination with highest PCC was analyzed further. To assess the statistical significance of the correlation between a pCRE- 104 TFBM pair, a within TF family PCC distribution was established using TFBMs from each TF family. This allowed us to test whether a pCRE was more similar to a TFBM of a particular family than those between TFBMs within that same family. The PCC distribution of each TFBM family was fitted with normal or beta distribution functions based on maximum likelihood using the MASS package [55] in R. Every PCC between a pCRE and a TFBM from a particular family was compared to the cumulative density function of the fitted within family distribution to get a pvalue. All p-values from the pairwise comparisons were adjusted for multiple testing within the same family [56]. To further assess which families of TFs pCREs might bind to, between family TFBM PCC distributions were generated and fitted as described above. We compared the PCC for each pCRETFBM pair to the between family distributions to generate a p-value, which were adjusted for multiple testing [56]. We set a q-value of 0.05 as the threshold to say that the pCRE may be bound by the same family as the TFBM. Because the TFBMs for some families were more divergent than the other (a wide range median PCCs for within family distribution; Supplemental Figure 1), the false negative rates (fail to assign a pCRE to certain families) varied. To assess if the pCREs were more similar to TFBMs than to random genomic sequences, 1,894 random PWMs with the same length distribution as pCREs were generated. For a random PWM of length k, 15 k-mers were randomly generated using the background distribution of AT-GC in A. thaliana, and consolidated into a PWM using the MotifTools.Motif_from_counts function in TAMO [48]. The random PWMs were then compared to TFBMs to establish the distribution of PCC to randomized PWMs. 105 3.5.5 Prediction of salt up-regulation using machine learning and feature selection Our goal was to model salt up-regulation of genes in each organ as a classification problem involving two classes: salt up-regulated genes in an organ and genes that are not responsive under any stress condition. The Support Vector Machine (SVM, [57]) and Random Forest (RF, [58]) algorithms were used for classification implemented in the Waikato Environment for Knowledge Analysis (Weka [59]). To get the importance scores of each feature, RF was also used from Scikitlearn package in Python [60]. Every model in this paper had two components: 1) a set of genes, each of which is classified as up-regulated or non-responsive (“expression class”) and 2) a set of cis-regulatory sites (CIS-BP TFBMs, DAP-seq peaks, or pCREs) and their presence/absence on the putative promoter of each gene (“promoter features”). We established machine learning models using five sets of pCRE sites including all mapped pCRE sites, as well as pCRE sites overlapped with CIS-BP TFBM sites, DAP-seq peaks, DHS regions, and CNS. In this setup, the models predict the genes from the two expression classes using the presence or absence of the promoter features. Grid-searches were used to find the best combination of the following three parameters in SVM: (1) the ratio of non-responsive to up-regulated genes, (2) the parameter of the soft margin, and (3) the gamma parameter of the Radial Basis Function (RBF) kernel. The latter two parameters are part of the SVM method itself. Similarly, grid-searches were used for RF predictions including (1) the ratio of non-responsive to up-regulated genes and (2) number of attributes. The ratio of negative to positive examples was achieved using the Weka class “weka.filters.supervised.instance.SpreadSubsample”, which subsamples the non-responsive genes to achieve the desired ratio of up-regulated to non-responsive genes. We used 10-fold cross validation as implemented in Weka, and the average AUC-ROC from all 10 cross validation runs 106 was calculated using the ROCR package [61]. RF model results were reported in this study as the performances of SVM and RF models were correlated and RF was easier to scale-up to largedatasets. The parameter combination with the maximum average AUC-ROC were taken as the best parameters for each model, and this maximum AUC-ROC is what we report for each model. Precision-recall curves were plotted using the output from the model with the maximum AUCROCs. To eliminate redundant motifs, we used three univariate feature selection methods: 1) PCCbased using Caret R Package 2) Correlation Feature Selection-CFS in Weka (correlation is based on minimum description length (MDL), symmetrical uncertainty, and relief (Hall,1999)) and 3) Chi-squared test in Weka on the pCRE sets. For the PCC-based method, we calculated the PCC between each pair-wise feature (pCRE sites). For the pairs of features that have greater than a PCC of 0.5, kept only one of them. This is an arbitrary threshold; however, removing 15 -20% pCREs, did not change the AUC-ROC values of prediction models. With CFS method, we kept the default settings in Weka. Chi-squared test in Weka yields ranks for each pCRE based on the chi-square statistic. We used the chi-square statistic of 10 and 20 as thresholds to keep higher ranked pCREs. 3.5.6 Binary prediction of root and shoot up-regulated genes While the AUC-ROC is a good measure of the overall performance of machine learning models, it does not indicate how well individual genes are predicted. Thus, it is possible that two models have similar levels of performance as measured by AUC-ROC resulting from the correct predictions of different sets of genes. To assess which genes were predicted by models based on different pCRE sets, and to see if different models correctly predict different sets of genes, the Weka program CrossValidationAddPredictions was used to identify whether a gene was correctly predicted as up-regulated or non-responsive during salt stress. This program makes a model as 107 described above, but it keeps track of the prediction of each gene. We used the best parameter combination identified from the original grid-search as the basis for the binary prediction run. We chose the parameter combinations with the maximum average AUC-ROC. For that given run, maximum F-measure (harmonic mean of precision and recall, calculated using ROCR) was used as the threshold to create binary predictions for each gene. We also assessed the overlap of correctly predicted up-regulated genes (True Positives, “TP”) based on models using different pCRE sets by looking at the percentage of the up-regulated genes correctly predicted by two different models. 3.5.7 Combinatorial motif rule discovery To test if the combinations of specific pCREs were predictive of salt up-regulation in the root or shoot, the Classification by Association (CBA) [63] method was used to identify combinatorial rules of the form “pCRE A + pCRE B → up-regulation” were selected from the CBA output. This method is useful for identifying rules where some combinations of features are associated with a class. The features in our case were the presence or absence of pCRE pairs on a gene promoter and the class was root or shoot up-regulation. The root or shoot up-regulated and non-responsive genes were broken up into different subsamples. Each of these subsamples was run through CBA using multiple values for minimum confidence (percentage of genes where “pCRE A + pCRE B → up-regulation” out of all the instances of “pCRE A + pCRE B”) and support (percentage of genes that with the rule “pCRE A + pCRE B → up-regulation”). Rules for shoot up-regulated genes were discovered using a minimum support 0.5% and a minimum confidence of 60.0%, with a non-responsive to up-regulated ratio of 2:1. We went through several rounds of CBA to discover root rules using different values of support, confidence and nonresponsive to up-regulated ratios. We ended up using a minimum support of 0.1%, a minimum 108 confidence of 60%, and subsamples with 976 non-responsive genes to 488 responsive genes, which were the same numbers of genes used to generate the shoot rules. These parameters were chosen because the rules generated gave an appreciable gain in the AUC-ROC when performing predictions. Due to the limitation of using a GUI version of CBA, we were not able to do an extensive exploration of the best CBA parameter values. Thus, it is possible that there is a more optimal parameter set that will yield a greater performance gain. The distance between pairs of pCREs in a rule was calculated for all instances of the rules in the putative promoters. The minimal distance between the closest ends of two pCREs were determined. To determine if the minimal distances were significantly different than randomly expected, background distributions of pCREs was generated by modeling the frequency of distances between two random pCREs of the same lengths as the pCREs in the rule pair based on an earlier approach [42]. The only difference in our method was that we compared our observed distance distributions to the background distribution using a Mann-Whitney test instead of a Kolmogorov-Smirnov test, as the Mann-Whitney test can more directly test whether one distribution has higher or lower distances than the other distribution. 3.6 Acknowledgements We thank members of the Shiu laboratory, especially Bethany Moore, Ming-Jung Liu and Melissa Lehti-Shiu, for valuable discussions and John Lloyd for help with Random Forest. 109 APPENDIX 110 Figure S.3.1 Distributions of PCCs between CIS-BP TFBMs with and between example TF families. Distributions of PCCs between TFBMs within a TF family (red) and between TFBMs of a particular family to all other TFBMs across families (blue). 111 Figure S.3.2 Contribution of general, root, and shoot pCREs to the predictions of true positive genes that were globally, root-specifically, and shoot-specifically up-regulated. (A) Bar plot of % true positive root-specific or global salt up-regulated genes predicted by the root pCRE only model (blue), the general pCRE only model (orange), and both models (black). Grey: false negatives. (B) Bar plot of % true positive shoot-specific or global salt up-regulated genes predicted by the shoot pCRE only model (red), the general pCRE only model (orange), and both models (black). Grey: false negatives. 112 Figure S.3.3 Feature selection on pCREs and performance of RF models using selected pCREs. (A) Precision-recall curves of models predicting root up-regulated genes (B) Precisionrecall curves of models predicting shoot up-regulated genes. Bar plots on top right of precisionrecall curves indicate the AUC-ROC values for predicting root (A) and shoot (B) salt up-regulated genes. The features used are: “root+general” or “shoot+general” (red), “PCC<0.5”: pCREs that have PCCs higher than 0.5 (blue), “chi10”: pCREs that have chi-squared statistic higher than 10 (green). “chi20”: pCREs that have chi-squared statistic higher than 20 (orange). “CFS”: pCREs selected from Correlation Feature Selection (purple). (C) Bar plot of number of pCREs that each set of feature selection method yields for root+general pCRE set (D) Bar plot of number of pCREs that each set of feature selection method yields for shoot+general pCRE set. (E) Dot plot of chisquare statistic for root+general pCRE set. Red dashed lines: Thresholds used to select pCREs. (F) Dotplot of chi-square statistic for shoot+general pCRE set. 113 Figure S.3.4 Summary of the distance between pairs of motifs in combinatorial rules. (A) The distance between pairs of motifs in the same rule for all instances of the rule on the promoters of all genes in the A. thaliana genome for the root rules. Squares represent the median distance for a rule, and the edges of the ribbon represent the 25th and 75th percentile of distances. The color of each square represents the significance (Mann-Whitney, adjusted p < 0.05) of the distance distribution compared the random background distribution. White: Not significant, red: significantly closer than random, blue: significantly further than random. Rows are sorted from lowest median distance at the bottom to highest median distance at the top. (B) Same as (A) but for shoot rules. 114 REFERENCES 115 REFERENCES 1. Bostock RM, Pye MF, Roubtsova T V. Predisposition in Plant Disease: Exploiting the Nexus in Abiotic and Biotic Stress Perception and Response. Annu Rev Phytopathol. Annual Reviews ; 2014;52: 517–549. doi:10.1146/annurev-phyto-081211-172902 2. Rasheed S, Bashir K, Matsui A, Tanaka M, Seki M. Transcriptomic Analysis of Soil-Grown Arabidopsis thaliana Roots and Shoots in Response to a Drought Stress. Front Plant Sci. 2016;7. doi:10.3389/fpls.2016.00180 3. Cramer GR, Urano K, Delrot S, Pezzotti M, Shinozaki K. Effects of abiotic stress on plants: a systems biology perspective. BMC Plant Biol. BioMed Central; 2011;11: 163. doi:10.1186/1471-2229-11-163 4. Gargallo-Garriga A, Sardans J, Pérez-Trujillo M, Rivas-Ubach A, Oravec M, Vecerova K, et al. Opposite metabolic responses of shoots and roots to drought. Sci Rep. Nature Publishing Group; 2014;4: 6829. doi:10.1038/srep06829 5. Pierik R, Testerink C. The art of being flexible: how to escape from shade, salt, and drought. Plant Physiol. American Society of Plant Biologists; 2014;166: 5–22. doi:10.1104/pp.114.239160 6. Munns R, Tester M. Mechanisms of Salinity Tolerance. Annu Rev Plant Biol. 2008;59: 651–681. doi:10.1146/annurev.arplant.59.032607.092911 7. Munns R. Comparative physiology of salt and water stress. Plant Cell Environ. 2002;25: 239–250. doi:10.1046/j.0016-8025.2001.00808.x 8. Kreps JA, Wu Y, Chang H-S, Zhu T, Wang X, Harper JF. Transcriptome Changes for Arabidopsis in Response to Salt, Osmotic, and Cold Stress. Plant Physiol. 2002;130: 2129– 2141. doi:10.1104/pp.008532 9. Dinneny JR, Long TA, Wang JY, Jung JW, Mace D, Pointer S, et al. Cell Identity Mediates the Response of Arabidopsis Roots to Abiotic Stress. Science (80- ). 2008;320: 942–945. doi:10.1126/science.1153795 10. Geng Y, Wu R, Wee CW, Xie F, Wei X, Chan PMY, et al. A Spatio-Temporal Understanding of Growth Regulation during the Salt Stress Response in Arabidopsis. Plant Cell. 2013;25: 2132–2154. doi:10.1105/tpc.113.112896 11. Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, et al. The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007;50: 347–63. doi:10.1111/j.1365313X.2007.03052.x 116 12. Qin F, Shinozaki K, Yamaguchi-Shinozaki K. Achievements and Challenges in Understanding Plant Abiotic Stress Responses and Tolerance. Plant Cell Physiol. 2011;52: 1569–1582. doi:10.1093/pcp/pcr106 13. Seki M, Narusaka M, Ishida J, Nanjo T, Fujita M, Oono Y, et al. Monitoring the expression profiles of 7000 Arabidopsis genes under drought, cold and high-salinity stresses using a full-length cDNA microarray. Plant J. Blackwell Science Ltd; 2002;31: 279–292. doi:10.1046/j.1365-313X.2002.01359.x 14. Haberer G, Wang Y, Mayer KFX. The Non-coding Landscape of the Genome of Arabidopsis thaliana. Genetics and Genomics of the Brassicaceae. New York, NY: Springer New York; 2011. pp. 67–121. doi:10.1007/978-1-4419-7118-0_3 15. Beer MA, Tavazoie S. Predicting Gene Expression from Sequence. Cell. 2004;117: 185– 198. doi:10.1016/S0092-8674(04)00304-6 16. Wang X, Haberer G, Mayer KF. Discovery of cis-elements between sorghum and rice using co-expression and evolutionary conservation. BMC Genomics. 2009;10: 284. doi:10.1186/1471-2164-10-284 17. Zou C, Sun K, Mackaluso JD, Seddon AE, Jin R, Thomashow MF, et al. Cis-regulatory code of stress-responsive transcription in Arabidopsis thaliana. Proc Natl Acad Sci U S A. 2011;108: 14992–7. doi:10.1073/pnas.1103202108 18. Priest HD, Filichkin SA, Mockler TC. Cis-regulatory elements in plant cell signaling. Curr Opin Plant Biol. 2009;12: 643–649. doi:10.1016/j.pbi.2009.07.016 19. Austin RS, Hiu S, Waese J, Ierullo M, Pasha A, Wang TT, et al. New BAR tools for mining expression data and exploring Cis -elements in Arabidopsis thaliana. Plant J. 2016; doi:10.1111/tpj.13261 20. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431: 99–104. doi:10.1038/nature02800 21. Franco-Zorrilla JM, López-Vidriero I, Carrasco JL, Godoy M, Vera P, Solano R. DNAbinding specificities of plant transcription factors and their potential to define target genes. Proc Natl Acad Sci U S A. 2014;111: 2367–72. doi:10.1073/pnas.1316278111 22. Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, et al. Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity. Cell. 2014;158: 1431–1443. doi:10.1016/j.cell.2014.08.009 23. O’Malley RC, Huang SC, Song L, Lewsey MG, Bartlett A, Nery JR, et al. Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. Cell. 2016;165: 1280–1292. doi:10.1016/j.cell.2016.04.038 117 24. Jiao Y, Lori Tausta S, Gandotra N, Sun N, Liu T, Clay NK, et al. A transcriptome atlas of rice cell types uncovers cellular, functional and developmental hierarchies. Nat Genet. 2009;41: 258–263. doi:10.1038/ng.282 25. Yáñez-Cuna JO, Kvon EZ, Stark A. Deciphering the transcriptional cis-regulatory code. Trends Genet. 2013;29: 11–22. doi:10.1016/j.tig.2012.09.007 26. Hu Z, Gallo SM. Identification of interacting transcription factors regulating tissue gene expression in human. BMC Genomics. 2010;11: 49. doi:10.1186/1471-2164-11-49 27. Zhong S, He X, Bar-Joseph Z, Harbison C, Gordon D, Lee T, et al. Predicting tissue specific transcription factor binding sites. BMC Genomics. BioMed Central; 2013;14: 796. doi:10.1186/1471-2164-14-796 28. Tsai ZT-Y, Shiu S-H, Tsai H-K. Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol. 2015;11: e1004418. doi:10.1371/journal.pcbi.1004418 29. McLeay RC, Leat CJ, Bailey TL. Tissue-specific prediction of directly regulated genes. Bioinformatics. 2011;27: 2354–60. doi:10.1093/bioinformatics/btr399 30. Zhu J-K. Salt and Drought Stress Signal Transduction in Plants. Annu Rev Plant Biol. 2002;53: 247–273. doi:10.1146/annurev.arplant.53.091401.143329 31. Golldack D, Lüking I, Yang O. Plant tolerance to drought and salinity: stress regulating transcription factors and their functional significance in the cellular transcriptional network. Plant Cell Rep. 2011;30: 1383–1391. doi:10.1007/s00299-011-1068-0 32. Mizoi J, Shinozaki K, Yamaguchi-Shinozaki K. AP2/ERF family transcription factors in plant abiotic stress responses. Biochim Biophys Acta - Gene Regul Mech. 2012;1819: 86– 96. doi:10.1016/j.bbagrm.2011.08.004 33. Nakashima K, Takasaki H, Mizoi J, Shinozaki K, Yamaguchi-Shinozaki K. NAC transcription factors in plant abiotic stress responses. Biochim Biophys Acta - Gene Regul Mech. 2012;1819: 97–103. doi:10.1016/j.bbagrm.2011.10.005 34. Matiolli CC, Tomaz JP, Duarte GT, Prado FM, Bem LEV Del, Silveira AB, et al. The Arabidopsis bZIP Gene AtbZIP63 Is a Sensitive Integrator of Transient Abscisic Acid and Glucose Signals. Plant Physiol. 2011;157: 692–705. doi:10.1104/pp.111.181743 35. Haudry A, Platts AE, Vello E, Hoen DR, Leclercq M, Williamson RJ, et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat Genet. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2013;45: 891–8. doi:10.1038/ng.2684 36. Chaves MM, Flexas J, Pinheiro C. Photosynthesis under drought and salt stress: regulation 118 mechanisms from whole doi:10.1093/aob/mcn125 plant to cell. Ann Bot. 2009;103: 551–560. 37. Ji H, Pardo JM, Batelli G, Van Oosten MJ, Bressan RA, Li X. The Salt Overly Sensitive (SOS) Pathway: Established and Emerging Roles. Mol Plant. 2013;6: 275–286. doi:10.1093/mp/sst017 38. Barah P, B N MN, Jayavelu ND, Sowdhamini R, Shameer K, Bones AM. Transcriptional regulatory networks in Arabidopsis thaliana during single and combined stresses. Nucleic Acids Res. Oxford University Press; 2016;44: 3147–64. doi:10.1093/nar/gkv1463 39. Kang J, Choi H, Im M, Kim SY. Arabidopsis Basic Leucine Zipper Proteins That Mediate Stress-Responsive Abscisic Acid Signaling. Plant Cell Online. 2002;14: 343–357. doi:10.1105/tpc.010362 40. Sullivan AM, Arsovski AA, Lempe J, Bubb KL, Weirauch MT, Sabo PJ, et al. Mapping and Dynamics of Regulatory DNA and Transcription Factor Networks in A. thaliana. Cell Rep. 2014;8: 2015–2030. doi:10.1016/j.celrep.2014.08.019 41. Liu M-J, Seddon AE, Tsai ZT-Y, Major IT, Floer M, Howe GA, et al. Determinants of nucleosome positioning and their influence on plant gene expression. Genome Res. 2015; gr.188680.114-. doi:10.1101/gr.188680.114 42. Yu X, Lin J, Zack DJ, Qian J. Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Res. 2006;34: 4925–4936. doi:10.1093/nar/gkl595 43. Potenza C, Aleman L, Sengupta-Gopalan C. Targeting transgene expression in research, agricultural, and environmental applications: Promoters used in plant transformation. Vitr Cell Dev Biol - Plant. 2004;40: 1–22. doi:10.1079/IVP2003477 44. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; doi:10.1093/nar/gkv007 45. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2012. 46. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Ser B. 1995;57: 289–300. doi:10.2307/2346101 47. Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, et al. gplots: Various R Programming Tools for Plotting Data. 2015. 48. Gordon DB, Nekludova L, McCallum S, Fraenkel E. TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. 119 Bioinformatics. 2005;21: 3164–5. doi:10.1093/bioinformatics/bti481 49. Roth FP, Hughes JD, Estep PW, Church GM. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol. 1998;16: 939–945. doi:10.1038/nbt1098-939 50. Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol. 2002;20: 835–839. doi:10.1038/nbt717 51. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2: 28–36. 52. Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouzé P, et al. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics. 2001;17: 1113–1122. 53. Pavesi G, Mereghetti P, Zambelli F, Stefani M, Mauri G, Pesole G. MoD Tools: regulatory motif discovery in nucleotide sequences from co-regulated or homologous genes. Nucleic Acids Res. 2006;34: W566–W570. doi:10.1093/nar/gkl285 54. Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. Proc Int Conf Intell Syst Mol Biol. 2000;8: 344–354. 55. Venables WN & Ripley BD. Modern Applied Statistics with S. Fourth Edition. Springer, New York.2002. ISBN 0-387-95457-0 56. Storey JD. The positive false discovery rate: a bayesian interpretation and the q-value 1. Ann Stat. 2003;31: 2013–2035. 57. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20: 273–297. doi:10.1007/BF00994018 58. Breiman L. Random Forests. Mach Learn. Kluwer Academic Publishers; 2001;45: 5–32. doi:10.1023/A:1010933404324 59. Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka. Bioinformatics. 2004;20: 2479–81. doi:10.1093/bioinformatics/bth261 60. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830. 61. Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21: 3940–3941. doi:10.1093/bioinformatics/bti623 62. Hall MA. Correlation-based feature selection for machine learning. PhD Thesis. University 120 of Waikato.1999. 63. Ma BLWHY. Integrating classification and association rule mining. Proceedings of the 4th. 1998. 121 CHAPTER 4 PREDICTIVE MODELS OF CELL TYPE HIGH SALINITY RESPONSIVE GENE EXPRESSION 4.1 Abstract How multicellular organisms respond to their environment depends on the responses of individual cell types to the surrounding fluctuations. Transcriptional reprogramming based on the environmental changes is the key to these responses. Therefore, how transcriptional reprogramming is controlled in each distinct cell type is important to decipher. However, the mechanism of cell type specific gene expression regulation, particularly towards environmental changes, is mostly unknown in plants. Here, we use Arabidopsis thaliana root cell type data as examples to understand the mechanism of stress responsive gene expression regulation via cisregulatory elements (CREs). We used a computational approach to identify 3,095 putative CREs (pCREs) and built predictive models of high salinity responsive gene expression in six root cell types (columella, cortex, endodermis-quiescent center, epidermis, proto-phloem and stele). We found that pCREs we identified can be used to predict high salinity responsive gene expression better than random predictions. Overall, our predictive models are better at identifying nonresponsive genes rather than identifying high-salinity responsive genes. Also, whole organ associated CREs yield similar prediction performance to the cell type pCREs. Therefore, the cell type pCREs identified need to be further explored to better understand the mechanism of cell type environmental responsive gene expression. 122 4.2 Introduction Since the invention of the microscope, scientists have aimed to study different types of cells and the characteristics that make them different [1]. Distinguishing cell types requires studying unique functions that separate them from others. One of the crucial components of having various cell types with different functions is a result of how genes are regulated differentially and precisely controlled in response to changing environmental conditions. Based on this differential regulation, each cell type can be identified by its specific gene expression profile. However, the gene expressions, that is steady-state mRNA levels, might not reflect individual changes per cell type if the whole organ is analyzed [2]. Thus, isolating cell types is required to study cell type specific gene expression profiles. In order to isolate distinct cell types, techniques such as fluorescent activated cell sorting (FACS) and laser capture microdissection (LCM) are being used [3–5]. These techniques have been used to study Arabidopsis thaliana root cell types. A.thaliana root has been an ideal system to study cell types as the roots have a radial organization with layers of cell types (epidermis, cortex, endodermis and stele [6]) and undergo continuous development from the stem cells, meaning cells divide, expand and specialize [6]. To separate the root cell types, green fluorescent protein (GFP) lines that are specific to cell types have been developed and extensively used [7]. The ability to isolate the root cell types led the studies to investigate how individual cell types vary in gene expression and respond to environmental fluctuations. Among the environmental fluctuations, high salt concentrations in soil impact plants adversely and result in reduction of the yield in crops [8]. How each cell type is contributing to the overall root and whole plant response to salt stress can be learned by understanding how cell types regulate salt stress responsive gene expression. Gene regulation involves multiple players, including transcription factors (TFs), cofactors, and chromatin remodeling complexes [9]. Among 123 these players, sets of regulatory sequences that are accessible to TF binding could be used to study the differential gene expression across cell types [10]. TF binding sites can be determined in various ways such as using chromatin immunoprecipitation (ChIP) methods, array or sequencing based [11]. However, ChIP methods only cover a single TF binding and it is not feasible to cover all TFs in an organism in a single experiment. To determine TF binding sites for as many TFs as possible, in vitro methods such as Protein Binding Arrays (PBMs) and DNA Affinity Purification (DAP) methods have been developed [12,13]. Apart from in vivo and in vitro methods in identifying TF binding sites, computational approaches have also been successfully used [14] in hypothesizing putative cis-regulatory elements. These computational approaches include coexpression, phylogenetics and combination of these two [15]. For example, Haberer et al used PhyloCon (Phylogenetic Consensus, [16]) in combination with co-expression across 81 microarray studies in identifying candidate TF binding sites [15]. Also, Zou et al identified stress related putative cis-regulatory elements (pCREs) using co-expression across 16 stress conditions and time points, and six motif finders [17]. Given the large number of potential TF binding sites and/or pCREs that could be identified by the computational approaches (over 60,000 sites in Haberer et al, 1,215 pCREs in Zou et al), the false positive rate is likely high and additional tests are required to be confident of a handful of pCREs through experimental validation. In this regard, evaluation of pCREs with machine learning approaches is useful [17]. The machine learning approaches can take individual (and/or combinations of) pCREs as the predictors and build models for the outcome, such as cell type salt responsive gene expression. With these prediction models, it is possible to identify important pCREs that could be involved in TF binding and drive gene expression [17,18]. 124 Regulatory mechanisms that are responsible for cell type specific responses to external factors still remain to be deciphered [19]. In this study, we aimed to investigate the cis-regulatory code (CRC) of salt responsive gene expression in columella (COL), cortex (COR), stele (STE), proto-phloem (PHL), epidermis (EPI) and endodermis (END) in the roots and expand the CRC of the organ salt up-regulation. Firstly, we asked whether previously identified TF binding information could predict root cell type salt up-regulation. Next, we explored to what extent the salt responsive gene expression is similar between whole root and the individual root cell types and used previously identified organ pCREs to predict root cell type salt up-regulation. We also identified pCREs that might be involved in salt up-regulation in each cell type and found there are common pCREs among cell types as well as cell type specific ones based on over-representation in the promoters of salt up-regulation genes. We built prediction models utilizing these pCREs and found that depending on the cell type, different sets of pCREs are needed to be considered for the best performing prediction models. 4.3 Results and discussion 4.3.1 Known TF binding data in predicting cell type salt responsive gene expression To investigate to what the extent the current knowledge of TF binding data can explain salt up-regulation in the root cell types COL, COR, END, EPI, PHL, and STE [20], we built machine learning models. We used A. thaliana TF binding data from two large-scale in vitro studies, CISBP [12] and DAP-seq [13] that cover binding data of 758 TFs (~38% of the known A. thaliana TFs). Given the extensive coverage of the in-vitro TF binding data, we expected root cell type salt up-regulation might be explained with the known TF binding data and formed prediction models 125 using machine learning. Using known TF binding data as predictors in Random Forest (RF), root cell type salt up-regulation predictions were better than random, and CIS-BP data performed slightly better than DAP-seq data in all six cell type predictions (AUC-ROC=0.63-0.71 and AUCROC=0.58-0.68 for CIS-BP and DAP-seq respectively; Figure 4.1A). Among these predictions, END and STE had the highest performances and they have the least and the highest number of salt up-regulated genes respectively (precision-recall curves for END and STE predictions are given in Figure S.4.1A and B). These results suggest that the in-vitro TF binding data are useful in understanding root cell type up-regulation, however there is still room for improvement since the prediction performances were still lower than perfect classification (AUC-ROC ~1). Given that the CIS-BP and DAP-seq studies were not conducted on the condition that we study (high salinity), it is expected that using TF binding information that is obtained/predicted under the relevant conditions might be more informative in cell type gene expression predictive models. One such data could be the pCREs that we identified from root stress responsive gene expression. 126 Figure 4.1 Performance of root cell type salt up-regulation prediction models using TF binding data and organ pCREs. (A) Bar plot of AUC-ROC values of prediction models using CIS-BP (yellow) and DAP-seq (orange) data. (B) Bar plot of AUC-ROC values of prediction models using organ pCREs: root (pink), general (green), union of root and general (blue), and all organ (root+general+shoot; purple) pCREs. 127 4.3.2 Organ pCREs in predicting cell type salt responsive gene expression Next, to understand whether the pCREs identified from the whole organ (root and shoot) for salt stress may provide more information than in-vitro TF binding data in cell type salt upregulation predictions, we used organ pCREs as the predictors in the machine learning models (Figure 4.1B). Our hypothesis was that the organ pCREs, particularly root pCREs, might explain cell type specific salt up-regulation because the genome-wide expression patterns of the salttreated root cell types as well as the salt and osmotic stress-treated whole root are positively correlated (PCCs ≥ 95th percentile PCC from all pair-wise sample comparisons; Figure 4.2). However, we found that the root pCRE-based models did not outperform known TF binding data in predicting salt up-regulated genes in each cell type (AUC-ROC=0.64-0.67; Figure 4.1B). This suggests that the root pCREs are not predicting salt up-regulation to a finer resolution. Possibly with whole root expression dataset, signals from individual cell types are lost, therefore the organ pCREs do not represent the regulatory information required for the cell type salt stress responsive gene expression. Next, we used pCREs that were over-represented in both root and shoot salt upregulated genes (referred to as general pCREs), the union of root and general pCREs, and all organ pCREs in predictions. We found that general pCREs were the best performing ones among other pCRE sets in predicting root cell type salt up-regulation (AUC-ROC=0.68-0.78; Figure 4.1B, Examples of the precision-recall curves are given in Figure S.4.1C and D for END and STE predictions). This suggests that the pCREs that are responsible for a common salt response in the roots and shoots are also able to predict cell type salt up-regulation well. It is possible that the general pCREs are representatives of common stress responsive elements regardless of the tissue. Since general pCREs likely predict common cell type salt response, the question remains what the cell type specific components are in salt response and whether the cell type specific pCREs would 128 further improve salt up-regulation models providing a finer resolution for mechanism of salt stress responsive gene expression. A possible way to improve salt up-regulation prediction models is to focus on the gene expression at a fine spatial resolution, namely, the gene expression of genes in individual cell types to identify cell type specific pCREs. In human studies, cell type specific CREs were identified using the gene expression data across cell types and these CREs were used to predict cell type specific gene expression [21,22]. We hypothesized that using A.thaliana root cell type stress expression data and identifying co-expressed gene clusters, we could find cell type pCREs that can explain cell type salt up-regulation better than the known TF binding data and whole organ pCREs. 129 Figure 4.2 Gene expression correlation across stress datasets of root, shoot and root cell types. Heatmap of gene expression correlations using pair-wise sample comparisons. Colors represent PCC values that fall to ≥ 95th percentile of all pair-wise sample PCCs (0.42). Boxes outlined with black are the clusters of biologically relevant associations (e.g. root salt treatment samples clustering with root osmotic treatment samples). Boxes outlined with dashed black line are the correlations of 5 root cell type salt treatment samples (COL, END, EPI, PHL, STE) with root salt treatment samples, mixture of root abiotic stress samples and COR salt treatment sample respectively from top to bottom. 130 4.3.3 Identifying root cell type pCREs associated with salt stress To identify root cell type pCREs that might be involved in salt stress, we used a motif finding pipeline that was developed in a previous study [17]. We incorporated root cell type differential expression in COL, COR, END, EPI, PHL, and STE [20], with the root abiotic stress data [23] to identify co-expression clusters over-represented with salt stress up-regulated genes in different root cell types. After selecting the these clusters, we identified pCREs from the promoters of the co-expressed genes. According to the enrichment of pCRE sites in the promoters of the genes in these clusters, we classified pCREs into subsets. 583 pCREs were classified as root cell type general pCREs, 734-2828 pCREs were considered specific to a particular cell type or found to be over-represented in multiple cell type salt up-regulated genes, and 6-360 pCREs were considered specific to only one cell type (Figure 4.3A). Because we hypothesized that we would find new cell type pCREs, we expected to have motifs that were distinct from organ pCREs. Additionally, we asked to what extent each pCRE subset has similar pCREs and expected that within the same set, pCREs might be more similar than those across different sets. To address these questions, we calculated the average PCC among the pCREs within each pCRE subset. We found that within a pCRE set, we did not have the most similar pCREs. For example, cell type general pCREs had the highest average correlation (PCC=0.78), but EPI and STE specific pCREs had average PCCs ≤ 0.4. We also found that the organ pCRE set and cell type pCRE set did not have high similarity across subsets (PCC=0.37-0.47) supporting the notion that the pCREs we identified from genes up-regulated in different cell type were distinct from the set of pCREs identified using data from whole root and shoot (Figure 4.3B). 131 Figure 4.3 Cell type pCREs: Classification and similarity among pCRE sets. (A) Top panel: Bar plot of number of cell type specific pCREs. Bottom panel: Heatmap of over-represented pCREs. Each row is a pCRE and red color is for over-representation of that pCRE in the cell type salt up-regulated genes. (B) Heatmap of similarity among pCRE sets. Similarity is calculated as PCC between Position Weight Matrices (PWMs). COL, COR, END, EPI, PHL and STE pCREs are cell type specific pCREs. C.U. is cell type union, referring to pCREs that belong to more than one cell type pCRE set. C. GEN is the cell type general pCREs that are over-represented in all cell type salt up-regulated genes. 132 4.3.4 Cell type pCREs in predicting salt up-regulation We have shown that cell type pCREs were distinct from the organ pCRE set, suggesting that the novel motifs may be important in driving salt up-regulation among root cell types in addition to the general organ pCREs. We used the cell type pCREs in predicting salt up-regulation of each cell type. We found that the general cell type pCRE were better at predicting END salt upregulation than END pCREs (AUC-ROC=0.75 vs 0.55; Figure 4.4A; Figure S.4.2A). This trend was also observed with COL, COR and EPI salt up-regulation predictions; the general cell type pCREs performed better than the respective cell type pCREs (Figure 4.4A). On the other hand, STE salt up-regulation seemed to be driven by STE pCREs as the prediction performance was better when STE pCREs were use compared to using general cell type pCREs (AUCROC=0.69 vs 0.63 respectively; Figure 4.4A; Figure S.4.2B). This was also the case for PHL salt up-regulation predictions. For all the predictions, the union of cell type and general pCREs gave the best performances, reflecting that both general stress response pCREs and cell type specific pCREs might be responsible for the cell type responses. Also, depending on the cell type the major driving force in salt up-regulation could be either general or cell type pCREs. Overall, the performance of cell type pCREs was similar to the organ pCREs in predicting up-regulation in various cell types (Figure 4.1B; Figure 4.4A) and combining these pCRE sets (organ pCREs+cell type pCREs) did not further improve prediction model performances (Figure 4.4B). Even though we have seen similar performances of the salt up-regulation prediction using the organ pCRE set and cell type pCRE set, potentially these sets could predict different sets of genes to be salt up-regulated. To test this, we compared the sets of genes predicted by models based on the organ pCRE set and cell type pCREs, focusing on the STE up-regulated genes as an example. We found that ~50% of the true positives predictions were the same from the two models (Figure S.4.3A). Furthermore, 10% 133 of the salt responsive genes were correctly predicted by the organ pCREs only and 14% were predicted correctly by the cell type pCREs only. This result implies that we could predict an additional set of the salt responsive genes using cell type pCREs. However, it is interesting to note that organ pCREs also predict a subset of STE up-regulated genes that the cell type pCREs could not classify correctly. For these sets of genes that were predicted by only organ pCREs, by only cell type pCREs and by both, we asked whether the levels of salt up-regulation were different (e.g. cell type pCREs predicting genes that have a higher salt responsive gene expression-higher fold change- in STE). Our expectation was that there might be differences in the salt up-regulation levels between the genes predicted only by certain pCRE sets. However, the gene sets predicted (10%, 14%, 50% of STE) did not differ in fold change in salt responsive gene expression (Figure S.4.3B). This suggests that there may be another layer in which these genes differ in so that different sets of pCREs are informative in predicting them. Overall, we could predict salt upregulation to a finer resolution in A. thaliana, yet it remains to be investigated whether the cell type pCRE set could be improved in explaining the specific cell type response. 134 Figure 4.4 Performance of cell type salt up-regulation prediction models using cell type pCREs. (A) Bar plot of AUC-ROC values of prediction models using cell type (pink), general (green), union of cell type and general (blue), and all cell type (purple) pCREs. *: Each cell type has a different set of pCREs over-represented in the salt up-regulated gene sets (B) Bar plot of AUC-ROC values of prediction models using the union of organ and cell type pCREs. 135 4.4 Conclusions With the abundance of transcriptomics studies, it is possible to study the gene regulatory networks of different tissues and cell types under changing environmental conditions. In this study, we identified pCREs that might be involved in the cell type specific salt stress responsive gene expression, particularly salt up-regulation. We found evidence that CRC regulating genes across the whole organs might be partially responsible for regulating the root cell type expression based on the prediction performance of machine learning models with organ general pCREs as predictors (Figure 4.1B). However, we could not rule out the possibility that cell type specific pCREs were important in predicting cell type salt up-regulation. Particularly, for STE, we identified pCREs that were able to predict STE up-regulated genes well and 14% of the STE up-regulated genes were only predicted with cell type pCREs (Figure S.4.3A). This shows that even though organ and cell type pCREs lead to similar performances in predicting salt up-regulation across cell types, each pCRE set can explain a different portion of genes that are up-regulated. Overall, we identified pCREs and built computational models that can explain cell type salt up-regulation. To further improve the results of this study and overcome the limitations, the following efforts can be made. Firstly, the cell type gene expression data used here only consist of one time point (salt treatment for 1h). It was the only dataset available at the time of data processing stage of this project. Since then, more studies were conducted related to how root cell types respond to environmental fluctuations and salt stress [24]. There is a dataset consisting of 6 time points (1h, 3h, 8h, 20h, 32h, 48h) of salt treatment across 4 root cell types (COL, COR, EPI and STE) [24]. We analyzed this dataset with the expectation that including more time points would further improve the co-expression clustering and subsequent pCRE identification. However, we found that 1h salt treatments of the same cell types from two different datasets [20,24] did not yield 136 similar expression (based on expression sample clustering similar to Figure 4.2). This is a common issue known as the batch effect [25] in gene expression studies and needs further analyses. For future work, it would be necessary to evaluate the time scale data more extensively to remove the batch effects and incorporate this dataset into the pCRE identification. Apart from the expression data used, the approach in analyzing the potential regulatory motifs could also be improved. We can further expand on root cell type salt up-regulation by identifying k-mers. K-mers are consensus sequences that are of length k and over-represented in a given set of sequences (e.g promoters of co-expressed genes). Studies [9,26] found that k-mers are informative in predicting gene expression. In this study, we used six motif finders in finding PWMs, however using k-mers could be informative in predicting root cell type gene expression and in our initial analyses we found promising results. Combinatorial rules are another aspect that could be constructed for improving cell type salt up-regulation predictions. As organization of CREs is the key in the transcription factor complexes to form and drive gene expression [27], considering pCRE pairs might be more important in regulating cell type salt up-regulation compared to individual pCREs, as we have seen in organ salt responsive predictive models (Chapter 3). Apart from the points discussed in improving cell type salt up-regulation predictions, other questions to expand on the CRC involved in regulating cell type processes are: (1) What are the pCREs involved in the root cell identity and how different are pCREs involved in cell identity vs. cell response? (2) What are the pCREs involved in spatial salt down-regulation and what are the differences between CRC of up-regulation and. down-regulation? Through answering these questions, it might be possible to get a more detailed genome-wide view of the spatial and conditional CRC in plants. 137 4.5 Methods 4.5.1 Gene expression dataset Root cell type expression data was downloaded as CEL files from GEO (GSE7641, [20]). This expression dataset consists of control and salt stress conditions (150mM NaCl treatment for 1h) applied to columella (COL), cortex (COR), endodermis+quiescent center (END), epidermis (EPI), proto-phloem (PHL) and stele (STE). The CEL files were pre-processed and quantile normalized using the Bioconductor affy package in R environment [28]. Log2 fold changes and the p-values were calculated for salt stress and the corresponding control for each cell type using the limma package [29]. p-values were corrected for multiple testing using false discovery rate [30]. In addition to root cell type expression data, root abiotic stress data from AtGenExpress (http://www.weigelworld.org/resources/microarray/AtGenExpress/) were used and the data were processed in a previous study [17]. Up-regulation in each cell type was defined for the genes with log2 fold-change values≥1 and their adjusted p-values≤0.05. Non-responsive genes were defined as neither up or down-regulated under any stress at any time point in the root or in any cell type sample. 4.5.2 Co-expression analyses To find co-expressed gene clusters, root stress expression dataset from AtGenExpress was combined with root cell type salt stress expression dataset and 20,060 genes in the expression dataset were clustered into co-expression clusters using c-means [31] in the R environment. Among the resulting clusters, clusters that had greater than 10 genes and less than 60 genes were selected for further analyses. Clusters that were larger than 60 genes were clustered further and clusters containing less than 10 genes were excluded from the analyses. This range of number of 138 genes in a cluster was required for efficiently running the motif finders [17]. Overall 538 clusters were obtained. Fishers exact test was used to select the clusters among the 538 that were overrepresented with cell type salt up-regulated genes (q-value ≤0.05) [32]. This analysis was repeated for each cell type separately. 4.5.3 Cell type pCRE identification To identify cell type pCREs, a previously established pipeline using six motif finders was used [17]. Motifs were found in the putative promoters (-1kb) of the genes that were in the coexpression clusters over-represented with each cell type salt up-regulated genes. Overall, 7,417 and 3,095 pCREs were identified for cell type salt up-regulation and these pCREs were overrepresented in the cell type salt up-regulated genes (for at least one cell type) with an overrepresentation q-value of 0.05 and 10-6 respectively. Among the 3,095 pCREs, 978 of them were over-represented in the promoters of COL salt up-regulated genes compared to the rest of the genome; 1728 in COR, 734 in END, 2,340 in EPI and 2,828 in STE. 583 pCREs were commonly over-represented in all root cell type salt up-regulated genes and were referred as “general” pCREs. Note that cell type up-regulated genes include cell type specific up-regulated genes as well as the genes that were up-regulated in more than one cell type. To assess similarity between pairs of pCREs, PCCs were calculated using the Position Weight Matrices (PWMs) of each pCRE. 4.5.4 Predictive models In obtaining predictive models, Support Vector Machine (SVM, [33]) and Random Forest (RF, [34]) were used in classifying root cell type up-regulated genes and non-responsive genes in the Waikato Environment for Knowledge Analysis (Weka, [35]). To find optimal parameters for each classification, grid-searches were done with the following in SVM: (1) the ratio of non- 139 responsive to up-regulated genes, (2) the parameter of the soft margin, and (3) the gamma parameter of the Radial Basis Function (RBF) kernel; in RF: (1) the ratio of non-responsive to upregulated genes and (2) number of attributes to use in trees. 10-fold cross validation was performed in prediction models. Two approaches were used to evaluate the prediction performance. (1) Area Under Curve-Receiver Operating Characteristic (AUC-ROC) measure, where a perfect model would have AUC-ROC=1 and random predictions would lead to AUC-ROC=0.5. (2) Precisionrecall curve, where precision is the ratio of true positive predictions to overall predicted as positive and recall is the ratio of true positive predictions to total number of positive class (salt upregulated genes). Better models would have precision-recall curves tending more towards the upper-right corner of the graph and random predictions would be no better than the background. 140 APPENDIX 141 Figure S.4.1 Precision/recall of END and STE salt up-regulation prediction models using TF binding data and organ pCREs. (A) Precision/recall curves of END salt up-regulation models using CIS-BP (yellow) and DAP-seq data (orange). (B) Same as (A) but for STE salt up-regulation. (C) Precision/recall curves of END salt up-regulation models using root (pink), general (green), union of root and general (blue), and all organ (root+general+shoot; purple) pCREs. (D) Same as (C) but for STE salt up-regulation. 142 Figure S.4.2 Precision/recall of END and STE salt up-regulation prediction models using cell type pCREs and union of cell type+organ pCREs. (A) Precision/recall curves of END salt upregulation models using cell type (pink), general (green), union of cell type and general (blue), and all cell type (purple) pCREs. (B) Same as (A) but for STE salt up-regulation. *: depends on the predicted cell type, * in (A) refers to END pCREs, in (B) refers to STE pCREs. (C) Precision/recall curves of END salt up-regulation models using (D) Same as (A) but for STE salt up-regulation. 143 Figure S.4.3 True positive predictions from models using organ and cell type pCREs. (A) Venn-diagram of percentage true positive predictions using organ pCREs and cell type pCREs. STE up-regulated genes were divided into 4 classes: 12% that were correctly predicted by only organ pCREs, 49% that were correctly predicted by both organ and pCREs, 14% that were correctly predicted by only cell type pCREs and 25% that could not be predicted correctly by either pCRE sets (B) Boxplot of log2 fold change expression values of each class of STE up-regulated genes described in (A). 144 REFERENCES 145 REFERENCES 1. Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. Cold Spring Harbor Laboratory Press; 2015;25: 1491–1498. doi:10.1101/gr.190595.115 2. Benfey PN, Bennett M, Schiefelbein J. Getting to the root of plant biology: impact of the Arabidopsis genome sequence on root research. Plant J. Blackwell Publishing Ltd; 2010;61: 992–1000. doi:10.1111/j.1365-313X.2010.04129.x 3. Neira M, Azen E. Gene discovery with laser capture microscopy. Methods Enzymol. 2002;356: 282–9. Available: http://www.ncbi.nlm.nih.gov/pubmed/12418206 4. Bryant Z, Subrahmanyan L, Tworoger M, LaTray L, Liu CR, Li MJ, et al. Characterization of differentially expressed genes in purified Drosophila follicle cells: toward a general strategy for cell type-specific developmental analysis. Proc Natl Acad Sci U S A. 1999;96: 5559–64. 5. Southall TD, Gold KS, Egger B, Davidson CM, Caygill EE, Marshall OJ, et al. Cell-TypeSpecific Profiling of Gene Expression and Chromatin Binding without Cell Isolation: Assaying RNA Pol II Occupancy in Neural Stem Cells. Dev Cell. Elsevier; 2013;26: 101– 112. doi:10.1016/j.devcel.2013.05.020 6. Benfey PN, Schiefelbein JW. Getting to the root of plant development: the genetics of Arabidopsis root formation. Trends Genet. 1994;10: 84–8. Available: http://www.ncbi.nlm.nih.gov/pubmed/8178369 7. Birnbaum K, Shasha DE, Wang JY, Jung JW, Lambert GM, Galbraith DW, et al. A gene expression map of the Arabidopsis root. Science. American Association for the Advancement of Science; 2003;302: 1956–60. doi:10.1126/science.1090022 8. Hirt H, Shinozaki K. Plant responses to abiotic stress. Springer; 2004. 9. Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes. Brief Funct Genomic Proteomic. Oxford University Press; 2009;8: 215–30. doi:10.1093/bfgp/elp014 10. Stergachis AB, Neph S, Reynolds A, Humbert R, Miller B, Paige SL, et al. Developmental Fate and Cellular Maturity Encoded in Human Regulatory DNA Landscapes. Cell. Elsevier; 2013;154: 888–903. doi:10.1016/j.cell.2013.07.020 11. Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012;13: 840–52. doi:10.1038/nrg3306 12. Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, et al. Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity. Cell. 146 2014;158: 1431–1443. doi:10.1016/j.cell.2014.08.009 13. O’Malley RC, Huang SC, Song L, Lewsey MG, Bartlett A, Nery JR, et al. Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. Cell. 2016;165: 1280–1292. doi:10.1016/j.cell.2016.04.038 14. Koryachko A, Matthiadis A, Ducoste JJ, Tuck J, Long TA, Williams C. Computational approaches to identify regulators of plant stress response using high-throughput gene expression data. Curr Plant Biol. 2015;3: 20–29. doi:10.1016/j.cpb.2015.04.001 15. Haberer G, Mader MT, Kosarev P, Spannagl M, Yang L, Mayer KFX. Large-scale ciselement detection by analysis of correlated expression and sequence conservation between Arabidopsis and Brassica oleracea. Plant Physiol. American Society of Plant Biologists; 2006;142: 1589–602. doi:10.1104/pp.106.085639 16. Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. Oxford University Press; 2003;19: 2369–80. doi:10.1093/BIOINFORMATICS/BTG329 17. Zou C, Sun K, Mackaluso JD, Seddon AE, Jin R, Thomashow MF, et al. Cis-regulatory code of stress-responsive transcription in Arabidopsis thaliana. Proc Natl Acad Sci U S A. 2011;108: 14992–7. doi:10.1073/pnas.1103202108 18. Li Y, Chen C yu, Kaye AM, Wasserman WW. The identification of cis-regulatory elements: A review from a machine learning perspective. BioSystems. 2015. pp. 6–17. doi:10.1016/j.biosystems.2015.10.002 19. Heinz S, Romanoski CE, Benner C, Glass CK. The selection and function of cell typespecific enhancers. Nat Rev Mol Cell Biol. Nature Research; 2015;16: 144–154. doi:10.1038/nrm3949 20. Dinneny JR, Long TA, Wang JY, Jung JW, Mace D, Pointer S, et al. Cell Identity Mediates the Response of Arabidopsis Roots to Abiotic Stress. Science (80- ). 2008;320: 942–945. doi:10.1126/science.1153795 21. Natarajan A, Yardımcı GG, Sheffield NC, Crawford GE, Ohler U. Predicting cell-type– specific gene expression from regions of open chromatin. Genome Res. 2012;22: 1711– 1722. doi:10.1101/gr.135129.111 22. Chen C, Zhang S, Zhang X-S. Discovery of cell-type specific regulatory elements in the human genome using differential chromatin modification analysis. Nucleic Acids Res. 2013;41: 9230–9242. doi:10.1093/nar/gkt712 23. Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, et al. The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007;50: 347–63. doi:10.1111/j.1365313X.2007.03052.x 147 24. Geng Y, Wu R, Wee CW, Xie F, Wei X, Chan PMY, et al. A Spatio-Temporal Understanding of Growth Regulation during the Salt Stress Response in Arabidopsis. Plant Cell. 2013;25: 2132–2154. doi:10.1105/tpc.113.112896 25. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. Oxford University Press; 2007;8: 118–27. doi:10.1093/biostatistics/kxj037 26. Lapidot M, Michal L, Mizrahi-Man O, Pilpel Y. Functional characterization of variations on regulatory motifs. PLoS Genet. 2008;4: e1000018. doi:10.1371/journal.pgen.1000018 27. Ezer D, Zabet NR, Adryan B. Homotypic clusters of transcription factor binding sites: A model system for understanding the physical mechanics of gene expression. Comput Struct Biotechnol J. Research Network of Computational and Structural Biotechnology; 2014;10: 63–9. doi:10.1016/j.csbj.2014.07.005 28. Bioconductor. [cited 12 Apr 2016]. Available: http://www.bioconductor.org/ 29. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; doi:10.1093/nar/gkv007 30. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Ser B. 1995;57: 289–300. doi:10.2307/2346101 31. Pal NR, Bezdek JC, Hathaway RJ. Sequential Competitive Learning and the Fuzzy c-Means Clustering Algorithms. Neural Networks. Elsevier Science Ltd.; 1996;9: 787–796. doi:10.1016/0893-6080(95)00094-1 32. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100: 9440–5. doi:10.1073/pnas.1530509100 33. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20: 273–297. doi:10.1007/BF00994018 34. Breiman L. Random Forests. Mach Learn. Kluwer Academic Publishers; 2001;45: 5–32. doi:10.1023/A:1010933404324 35. Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka. Bioinformatics. 2004;20: 2479–81. doi:10.1093/bioinformatics/bth261 148 CHAPTER 5 CONCLUDING REMARKS To decipher functional associations among Arabidopsis thaliana genes, and regulatory information of stress responsive gene expression, I used computational approaches to analyze publicly available global gene expression data. In Chapter 2, I evaluated the utilities and limitations of using gene expression data in hypothesizing gene function. I found that based on the individual metabolic pathways, the extent of gene co-expression differs and 5%-53% of pathways form significant gene co-expression groups with similarity measure and expression dataset used impacting this percentage. I also evaluated the commonly-used clustering algorithms with different parameters and showed that focusing on only one algorithm led to information loss in relating previously unknown genes to known pathways. In validating the co-expression clusters obtained, I used an independent phenomics dataset to confirm functional associations to leucine degradation pathway. These analyses serve as an outline of the best practices of using gene co-expression in functional inference and will be an important resource for studies that aim to utilize gene coexpression analyses. One potential improvement to the research described in Chapter 2 can be including gene expression datasets generated by different technologies. I used microarray data available for A. thaliana to ask to what extent pathway genes co-express and what the impact of dataset on the degree of pathway gene co-expression is. Currently, 48,501 gene expression datasets in the public data repository Gene Expression Omnibus (GEO) are microarray-based and 11,338 datasets are based on RNA-sequencing (RNA-seq) experiments (as of January 19th, 2017, [1]). Even though 149 there are more microarray datasets compared to RNA-seq ones, the advantage of RNA-seq is that genes currently not present as probes on microarrays can be studied. It is shown that 6,953 A. thaliana genes are structurally annotated but are not represented on probes of Affymetrix ATH1 microarray platform, and 52% of these genes not on the array do not have functional annotation information [2]. In this case, RNA-seq data can be helpful to uncover co-expression relationships of genes absent on microarrays. In addition, RNA-seq allows more information to be extracted, such as transcript isoforms or expressed intergenic regions, that can be included in co-expression studies. For example, co-splicing networks have been constructed using RNA-seq data and correlations in the isoform ratios across different genes have been calculated for functional associations [3]. These additional associations using RNA-seq data can complement the associations obtained from using microarray data alone. Another future direction would be to use machine learning models to predict pathway genes using gene co-expression features (e.g. expression coherence, cluster membership). In my study, I used unsupervised, clustering methods to associate unknown genes to known pathways. This goal can also be achieved in a supervised fashion by classification of pathway genes against genes that do not belong to any pathways. This classification can yield models that later be used on the rest of the genome to identify genes/genomic regions that show resemblance to genes from specific pathways. These models could also show what the most important co-expression features (i.e. EC and cluster membership) are in identifying genes that belong to specific pathways. In addition to gene co-expression, other levels of genome-wide information can be used in making functional associations. For example, protein-protein interactions can be another level of information to along with gene co-expression. Although there are data integration approaches using multiple biological data types to build gene functional networks [4,5], datasets apart from gene expression still do not 150 have extensive temporal, spatial and conditional information. With accumulation of such data, it will be possible to investigate the utilities and limitations of using different data types for gene functional inference. Another future direction in utilizing gene co-expression is new pathway discovery. In my study, I focused on existing literature of metabolic pathways and other biological processes. However, co-expression clusters might only contain unknown genes and these co-expressed genes could belong to a process that previously have not been studied. Using biotic stress treatment gene expression datasets, new signaling pathways have been identified from co-expression modules [6]. In the same study, known TF binding motifs such as G-box, W-box, and MYB motifs were also mapped to genes in the same co-expression clusters. Through combination of motif and pathway over-representation analyses on co-expression clusters, novel pathways as well as the potential regulatory signatures responsible for pathway gene co-expression can be identified. Similar analyses can be applied to the remaining co-expression clusters (366) obtained in my research. The co-expression clusters can also be useful in computational prediction of cis-regulatory elements (CREs) driving pathway gene co-expression. There are still many CREs to identify that are responsible for pathway gene expression patterns, considering I could not find over-represented known TF binding motifs on the promoters of high EC pathway genes. Therefore, additional CREs can to be identified using co-expression clusters over-represented with pathway genes to complement the TF binding motif information obtained from large scale in-vitro studies [7,8]. In research described in Chapters 3 and 4, I identified putative CREs (pCREs) that are likely involved in spatial salt responsive gene expression. These pCREs were further used in predictive models of organ and cell type salt up-regulation and contribute towards knowledge of salt stress gene regulation. In addition to pCREs, I also incorporated known TF binding sites, 151 chromatin accessibility information, and conservation across species into machine learning models to predict organ salt up-regulation and to identify the most important pCREs responsible for salt up-regulation in root and shoots. In Chapter 4, the predictive models were expanded to the cell type level and I identified pCREs that are likely involved in root cell type specific salt upregulation. Chapters 3 and 4 revealed pCREs that could be involved in spatial high salinity response and the predictive models generated For future research, additional genomic regions (in addition to promoters including introns, genes) could be used to identify pCREs and information on protein-protein interactions between TFs can be considered. Although most (86% [9]) TF binding sites are found near transcription start sites, there are also regulatory elements that are known to be present in the introns as well as the genic regions [10]. Therefore, to identify pCREs, other genomic regions should also be considered. Apart from the CREs, the characteristics of the TFs that bind to these CREs also needs to be considered. For example, TFs may not function by themselves and require additional activators to initiate binding to CREs. TFs bind to cofactors or other TFs to regulate gene expression. In this respect, one limitation of our study was that our predictive models did not take TF-cofactor and TF-TF interactions into account. It is possible that some TFs only drive gene expression after these interactions are present and we would miss the information from these interactions by only focusing on the co-expressed genes and the DNA sequences. Coupling information from gene coexpression with TF-protein interactions can improve the models of organ salt up-regulation. However, conditional protein-protein interaction data, particularly under salt stress, need to be generated. Prediction of cell type specific salt up-regulation was a challenge in this study. Cell-type data provides a higher resolution information compared to that from heterogeneous cell types 152 within an organ. However, the current cell-type study still focus a group of cells that may be heterogenous. Recently, studies focus on subpopulations of cells sharing a common geneexpression profile with the help of single-cell sequencing studies [11]. As more data are available from single cell sequencing studies, it will be possible to utilize the single cell gene expression data to identify CREs driving individual cell gene expression. Apart from cell type predictions, prediction of down-regulations (both organ and cell type) remains a challenge. Even though predictions of organ salt up-regulation were satisfactory using computationally identified pCREs as predictors, performance of down-regulation predictions was similar to random guessing. This suggests that down-regulation is more complex than up-regulation and considering pCREs is not enough. Potentially, post-transcriptional mechanism like RNA turnover can be taken into account in modeling down-regulations. Hence, improving predictions of down-regulations will require additional data types such as RNA approach to equilibrium sequencing (RATE-seq, [12]) to be generated across multiple experimental conditions. Overall, in my dissertation I used data-driven approaches towards learning about gene regulation and function. Even in the genomes that are well-studied, there are regions that need to be functionally annotated. The methodology and results of this thesis are applicable to annotating functional regions including genes and regulatory elements. 153 REFERENCES 154 REFERENCES 1. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. Oxford University Press; 2002;30: 207–10. doi:10.1093/NAR/30.1.207 2. Giorgi FM, Del Fabbro C, Licausi F. Comparative study of RNA-seq- and Microarrayderived coexpression networks in Arabidopsis thaliana. Bioinformatics. 2013;29: 717–724. doi:10.1093/bioinformatics/btt053 3. Aghamirzaie D, Collakova E, Li S, Grene R, Carvalho R, Feijão C, et al. CoSpliceNet: a framework for co-splicing network inference from transcriptomics data. BMC Genomics. 2016;17: 845. doi:10.1186/s12864-016-3172-6 4. Lee T, Yang S, Kim E, Ko Y, Hwang S, Shin J, et al. AraNet v2: an improved database of co-functional gene networks for the study of Arabidopsis thaliana and 27 other nonmodel plant species. Nucleic Acids Res. 2015;43: D996-1002. doi:10.1093/nar/gku1053 5. Mutwil M, Klie S, Tohge T, Giorgi FM, Wilkins O, Campbell MM, et al. PlaNet: combined sequence and expression comparisons across plant networks derived from seven species. Plant Cell. 2011;23: 895–910. doi:10.1105/tpc.111.083667 6. Ma S, Shah S, Bohnert HJ, Snyder M, Dinesh-Kumar SP. Incorporating Motif Analysis into Gene Co-expression Networks Reveals Novel Modular Expression Pattern and New Signaling Pathways. Copenhaver GP, editor. PLoS Genet. Public Library of Science; 2013;9: e1003840. doi:10.1371/journal.pgen.1003840 7. Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, et al. Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity. Cell. 2014;158: 1431–1443. doi:10.1016/j.cell.2014.08.009 8. O’Malley RC, Huang SC, Song L, Lewsey MG, Bartlett A, Nery JR, et al. Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. Cell. 2016;165: 1280–1292. doi:10.1016/j.cell.2016.04.038 9. Yu C-P, Lin J-J, Li W-H, Koudritsky M, Domany E, Lin Z, et al. Positional distribution of transcription factor binding sites in Arabidopsis thaliana. Sci Rep. Nature Publishing Group; 2016;6: 25164. doi:10.1038/srep25164 10. Riechmann JL. Transcriptional Regulation : a Genomic Overview. Arabidopsis Book. 2002;1: e0085. doi:10.1199/tab.0085 11. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. Nature Research; 2015;33: 155– 155 160. doi:10.1038/nbt.3102 12. Neymotin B, Athanasiadou R, Gresham D. Determination of in vivo RNA kinetics using RATE-seq. RNA. Cold Spring Harbor Laboratory Press; 2014;20: 1645–52. doi:10.1261/rna.045104.114 13. Calviello L, Mukherjee N, Wyler E, Zauber H, Hirsekorn A, Selbach M, et al. Detecting actively translated open reading frames in ribosome profiling data. Nat Methods. Nature Research; 2015;13: 165–170. doi:10.1038/nmeth.3688 156