DEFINING THE CHARACTERISTICS AND ROLES OF FUNCTIONAL GENOMIC SEQUENCES USING COMPUTATIONAL APPROACHES By John P. Lloyd A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Plant Biology – Doctor of Philosophy 2017 ABSTRACT DEFINING THE CHARACTERISTICS AND ROLES OF FUNCTIONAL GENOMIC SEQUENCES USING COMPUTATIONAL APPROACHES By John P. Lloyd Advances in biotechnology have provided a wealth of sequencing data that is transforming our view of a genome. Eukaryotic genomes, initially thought to contain discrete genes in a sea of non-functional DNA, have been found to exhibit pervasive biochemical activity, particularly transcription. However, whether this biochemical activity is functional (i.e. under evolutionary selection) or the result of noisy activity of cellular machinery represents a fundamental debate of the post-genome era. The research described in this dissertation focuses on two open questions confronting genome biology: 1) Where are the functional elements within a genome? 2) What roles are functional elements performing? For the first question, I focused on transcribed regions in unannotated, intergenic regions of genomes, which represent functionally ambiguous sequences. To determine which and how many intergenic transcribed regions (ITRs) represent functional sequences, machine learningbased function prediction models were established using Arabidopsis thaliana as a model. The prediction models were able to successfully distinguish between benchmark functional (phenotype genes) and non-functional sequences (pseudogenes) using evolutionary, biochemical, and sequence-based structural features. When applied to ITRs, ~40% of ITRs were predicted as functional, suggesting ITRs primarily represent transcriptional noise. I further investigated the evolutionary histories of ITRs in four grass (Poaceae) species. ITRs were found to be primarily species-specific and exhibit recent duplicates, with rare examples of ancient duplicate retention. In addition, ITR duplicates and orthologs were usually not expressed. Function prediction models were also generated in Oryza sativa (rice) that predicted ~60% of rice ITRs as nonfunctional. The results of function prediction models and evaluating evolutionary histories both suggest ITRs are primarily non-functional sequences. However, I also provide a list of potentially-functional ITRs that should be considered high priority targets for future experimental studies. For the second question, I established a machine learning framework to predict mutant phenotypes, which provide potent evidence for the role of a gene. Phenotype predictions were focused on essential genes (those with lethal mutant phenotypes) in A. thaliana, as these genes represent a historically well-studied group. Combining 57 expression, duplication, evolutionary, and gene network characteristics through machine learning methods accurately distinguished between genes with lethal and non-lethal mutant phenotypes. Additionally, essential gene prediction models could be applied across species; essential gene prediction models generated in A. thaliana could identify essential genes in rice and Saccharomyces cerevisiae. Thus, machinelearning represents a promising avenue of prioritization of candidate genes for large-scale phenotyping efforts. Overall, the research described in this dissertation highlight computational approaches as highly effective in defining functional sequences and classifying the likely roles of genes. To Ani, for her unwavering love and support. iv ACKNOWLEDGEMENTS As I look back upon my five years in East Lansing, I reflect upon the many people I have to thank during this wonderful journey. First and foremost, I must thank my advisor, Shinhan, who convinced me to join his lab after just a short interview and has since proven time and again to be a thoughtful scientist and excellent mentor. Shinhan has always pushed us to generate topnotch science and not settle to churn out low-impact studies. I believe that I and all members of his lab are stronger scientists for it. I also wanted to mention my Master’s advisor, David Meinke, for putting me on the path that I am now. My life would have ended up very different if he had not brought me into his lab as an undergraduate and Master’s student. I also want to thank my committee members, Robin Buell, Robert Last, Doug Schemske, and George Mias, for donating their scarce time and helpful advice over the years. In addition to being a great mentor, Shinhan always managed to bring together a great group of people with which to work and relax. To the many friends and colleagues from the Shiu lab over the years, it has been a pleasure and I will always fondly recall our time together. In particular, I would like to mention Gaurav Moghe, Sahra Uygun, Nicholas Panchy, Beth and Josh Moore, and Christina Azodi, who were fixtures in the lab during my stay there and come to mind first when I think of my Ph.D. studies. Last and certainly not least, I must thank my lovely wife, Anita. She was always there with support, advice, and help in times of need. Thank you, Ani, you have helped me grow and become a better person. v TABLE OF CONTENTS LIST OF TABLES ......................................................................................................................... ix LIST OF FIGURES ........................................................................................................................ x KEY TO ABBREVIATIONS ....................................................................................................... xii CHAPTER 1: INTRODUCTION ................................................................................................... 1 Prevalence of intergenic transcription ...................................................................................... 2 The nature of genes ................................................................................................................... 3 Definition of function ............................................................................................................... 5 Gain and loss of functional sequences ...................................................................................... 7 Identifying the role of a gene .................................................................................................... 9 Machine learning as a tool in biology ..................................................................................... 10 Dissertation outline and significance ...................................................................................... 11 CHAPTER 2: DEFINING THE FUNCTIONAL SIGNIFICANCE OF INTERGENIC TRANSCRIBED REGIONS ........................................................................................................ 13 ABSTRACT ............................................................................................................................... 14 INTRODUCTION ..................................................................................................................... 15 RESULTS AND DISCUSSION ................................................................................................ 19 Relationship between genome size and intergenic expression indicates that intergenic transcripts may generally be non-functional ........................................................................... 19 Expression, conservation, and epigenetic features are significantly distinct between benchmark functional and non-functional genomic sequences .............................................. 20 Consideration of multiple features in combination produces accurate predictions of functional genomic regions ...................................................................................................................... 24 Exclusion of features from multiple tissues increases prediction performance for narrowlyexpressed sequences................................................................................................................ 29 Intergenic transcribed regions and annotated ncRNAs are mostly predicted as non-functional ................................................................................................................................................. 32 Intergenic transcribed regions and annotated ncRNAs do not resemble benchmark RNA genes ....................................................................................................................................... 34 CONCLUSION .......................................................................................................................... 40 METHODS ................................................................................................................................ 42 Identification of transcribed regions in leaf tissue of 15 flowering plants ............................. 42 Phenotype data sources ........................................................................................................... 43 Arabidopsis thaliana genome annotation ............................................................................... 44 Sequence conservation and structure features ........................................................................ 45 Transcription activity features ................................................................................................ 46 Histone 3 mark features .......................................................................................................... 48 DNA methylation features ...................................................................................................... 49 Chromatin accessibility and transcription factor binding features ......................................... 50 vi Single-feature prediction performance ................................................................................... 50 Binary classification with machine learning ........................................................................... 51 Multi-class machine learning model ....................................................................................... 52 ACKNOWLEDGEMENTS ....................................................................................................... 54 APPENDIX ................................................................................................................................ 55 CHAPTER 3: CROSS-SPECIES AND POST-DUPLICATION EVOLUTIONARY DYNAMICS OF INTERGENIC TRANSCRIBED REGIONS INDICATE WIDESPREAD NOISY TRANSCRIPTION .......................................................................................................... 67 ABSTRACT ............................................................................................................................... 68 INTRODUCTION ..................................................................................................................... 69 RESULTS AND DISCUSSION ................................................................................................ 72 Identification and classification of Poaceae transcribed regions ............................................ 72 Cross-species sequence conservation of ITRs ........................................................................ 75 Cross-species expression conservation of ITRs ...................................................................... 78 Post-duplication sequence and expression conservation of ITRs ........................................... 80 Predicting the functionality of rice transcribed regions .......................................................... 84 Comparison of function predictions and evolutionary histories ............................................. 89 CONCLUSION .......................................................................................................................... 91 METHODS ................................................................................................................................ 93 Gene, pseudogene, and random intergenic annotation ........................................................... 93 Identification and classification of transcribed regions .......................................................... 94 Sequence and expression conservation ................................................................................... 96 Identification of syntenic gene blocks .................................................................................... 97 Rice function prediction features ............................................................................................ 99 Transcription activity features ..............................................................................................99 Sequence conservation features ............................................................................................99 Histone mark and nucleosome occupancy features ............................................................100 DNA methylation features ...................................................................................................100 Machine learning approach ................................................................................................... 101 ACKNOWLEDGEMENTS ..................................................................................................... 104 APPENDIX .............................................................................................................................. 105 CHAPTER 4: CHARACTERISTICS OF PLANT ESSENTIAL GENES ALLOW FOR WITHIN- AND BETWEEN-SPECIES PREDICTION OF LETHAL MUTANT PHENOTYPES ..................................................................................................................................................... 114 ABSTRACT ............................................................................................................................. 115 INTRODUCTION ................................................................................................................... 116 RESULTS AND DISCUSSION .............................................................................................. 120 Phenotype classification and functions of genes with lethal phenotypes ............................. 120 Copy number of lethal genes ................................................................................................ 121 Duplication timing of lethal-phenotype genes ...................................................................... 123 Relationship between phenotype lethality and gene expression ........................................... 128 Conservation of lethal genes ................................................................................................. 133 Network connectivity of lethal-phenotype genes ................................................................. 136 Prediction of lethal genes using a machine learning framework .......................................... 138 vii Cross-species predictions of lethal-phenotype genes ........................................................... 144 CONCLUSION ........................................................................................................................ 149 METHODS .............................................................................................................................. 152 Phenotype Data Sources ....................................................................................................... 152 Gene Ontology Functional Annotation ................................................................................. 153 Evolutionary Rate Calculations and Analysis of Duplicates and Pseudogenes .................... 154 Expression Data Sources and Processing ............................................................................. 156 Network Analysis.................................................................................................................. 157 Machine learning predictions ................................................................................................ 157 ACKNOWLEDGEMENTS ..................................................................................................... 160 APPENDIX .............................................................................................................................. 161 CHAPTER 5: CONCLUSIONS ................................................................................................. 167 Assessing the functionality of genomic sequences ............................................................... 168 Predictions of mutant phenotypes ......................................................................................... 172 Concluding remarks .............................................................................................................. 174 REFERENCES ........................................................................................................................... 176 viii LIST OF TABLES Table 4.1 Features of essential genes in A. thaliana ....................................................................129 ix LIST OF FIGURES Figure 2.1 Relationship between genome size and transcription coverage ...................................21 Figure 2.2 Predictions of functional and non-functional sequences based on single features .......23 Figure 2.3 Predictions of functional and non-functional sequences based on multiple features ...26 Figure 2.4 Functional likelihood distributions based on the full model. .......................................28 Figure 2.5 Proportion of sequence classes predicted as functional in the full and single-category models ............................................................................................................................................31 Figure 2.6 Function predictions based on a four-class prediction model ......................................36 Supplemental Figure 2.1 Expression breadth distributions of sequence classes ...........................59 Supplemental Figure 2.2 Impacts of conditional phenotypes and expression breadth on the function prediction model ..............................................................................................................60 Supplemental Figure 2.3 Distributions of functional likelihood scores based on the 500 bp tissueagnostic model ...............................................................................................................................62 Supplemental Figure 2.4 Distributions of 12 example features from 7 feature categories............63 Supplemental Figure 2.5 Distance of ITRs and annotated ncRNA regions to and feature similarity with neighboring genes ..................................................................................................64 Supplemental Figure 2.6 Distributions of functional likelihood scores based on the 100 bp tissueagnostic model ...............................................................................................................................65 Supplemental Figure 2.7 Correlation between features used in functional predictions ................66 Figure 3.1 Transcriptome content in four Poaceae species............................................................73 Figure 3.2 Sequence conservation of transcribed regions .............................................................76 Figure 3.3 Expression conservation of transcribed regions ...........................................................79 Figure 3.4 Duplication characteristics of transcribed regions .......................................................82 Figure 3.5 Function predictions of transcribed regions .................................................................86 Supplemental Figure 3.1 Expression characteristics of transcribed regions in four Poaceae species ..........................................................................................................................................106 x Supplemental Figure 3.2 Distance and expression correlation between intergenic transcribed regions and neighboring genes.....................................................................................................107 Supplemental Figure 3.3 Distributions of duplication types .......................................................109 Supplemental Figure 3.4 Distributions of synonymous substitution rates between anchor genes of within-species collinear gene blocks ...........................................................................................110 Supplemental Figure 3.5 Prediction score distributions based on binary and three-class function prediction models .........................................................................................................................111 Supplemental Figure 3.6 Relationship between sequence length, % exon overlap, and function predictions ....................................................................................................................................113 Figure 4.1 Copy number of phenotype genes in A. thaliana and O. sativa .................................122 Figure 4.2 Duplication timing and type of A. thaliana phenotype genes ....................................125 Figure 4.3 Expression characteristics of A. thaliana phenotype genes........................................131 Figure 4.4 Evolutionary rate and cross-species protein conservation of A. thaliana phenotype genes ............................................................................................................................................135 Figure 4.5 Network connectivity of A. thaliana phenotype genes ..............................................137 Figure 4.6 Machine learning performance of essential gene predictions ....................................140 Supplemental Figure 4.1 Over- and under-representation of phenotype genes in Gene Ontology categories .....................................................................................................................................162 Supplemental Figure 4.2 Proportions of most similar paralogs produced in whole-genome duplication events ........................................................................................................................164 Supplemental Figure 4.3 Evolutionary rates between paralogs ...................................................165 Supplemental Figure 4.4 Performance of essential gene predictions with increasing numbers of features .........................................................................................................................................166 xi KEY TO ABBREVIATIONS AraNet Probabilistic functional gene network of Arabidopsis thaliana Araport Arabidopsis Information Portal AUC-ROC Area under the curve - receiver operating characteristic BLAST Basic local alignment search tool bp Base pairs BS-seq Bisulfite sequencing CAGE Cap analysis gene expression ChIP-seq Chromatin immunoprecipitation sequencing CNB Conserved nucleotide block DAP Days after pollination DHS Deoxyribonuclease I hypersensitive site DiProDB Dinucleotide Property Database DNA Deoxyribonucleic acid DNase Deoxyribonuclease I EMBOSS European Molecular Biology Open Software Suite ENCODE Encyclopedia of DNA Elements eRNA Enhancer RNA EST Expressed sequence tag ETR Exon transcribed region FDR False discovery rate FET Fisher's Exact Test xii FL Functional likelihood FNR False negative rate FPKM Fragments per kilobase of transcript per million mapped reads FPR False positive rate FungiDB Fungal Database GBM Gene body methylation GO Gene Ontology ITR Intergenic transcribed region K Rate of substitutions Ka Rate of non-synonymous substitutions Ks Rate of synonymous substitutions KST Kolmogorov–Smirnov test lncRNA long non-coding RNA MCScanX Multiple collinearity scan software, version X miRBase microRNA Database miRNA microRNA miRNA-TR microRNA transcribed region MNase-seq Micrococcal nuclease sequencing MSU Michigan State University MUSCLE Multiple Sequence Comparison by Log-Expectation MYA Million years ago MYO Million years old NCBI National Center for Biotechnology Information xiii ncRNA non-coding ribonucleic acid OrthoMCL Ortholog Markov clustering software PAML Phylogenetic Analysis by Maximum Likelihood PC Principal component PCC Pearson's correlation coefficient PHAST Phylogenetic Analysis with Space/Time models Phen-TR Phenotype exon transcribed region PHYLIP Phylogeny Inference Package Pseu-TR Pseudogene transcribed region r2 Square of Pearson's correlation coefficient RAxML Randomized Axelerated Maximum Likelihood RF Random forest RNA Ribonucleic acid RNA Pol II RNA Polymerase II RNA-seq RNA sequencing RPKM Reads per kilobase of transcript per million mapped reads SMO Sequential minimal optimization snoRNA Small nucleolar RNA SNP Single nucleotide polymorphism snRNA Small nuclear RNA SRA Sequence read archive SVM Support vector machines TAIR The Arabidopsis Information Resource xiv TE Transposable element TF Transcription factor tRNA Transfer RNA tRNAdb tRNA Database U test Mann Whitney U test WEKA Waikato Environment for Knowledge Analysis WGD Whole genome duplication xv CHAPTER 1: INTRODUCTION 1 Prevalence of intergenic transcription Advances in sequencing technology have led to the discovery of extensive transcriptional activities occurring in unannotated, intergenic regions of genomes. Intergenic transcription is prevalent in a variety of model systems, including Homo sapiens (human) (ENCODE Project Consortium, 2012), Drosophila melanogaster (fruit fly) (Brown et al., 2014), Caenorhabditis elegans (nematode) (Boeck et al., 2016), and Saccharomyces cerevisiae (yeast) (Nagalakshmi et al., 2008). In plant systems, intergenic transcripts have been identified in Arabidopsis thaliana (Yamada et al., 2003; Stolc et al., 2005; Moghe et al., 2013; Krishnakumar et al., 2015) and Oryza sativa (rice) (Nobuta et al., 2007). Initially, intergenic transcripts were suggested to be primarily related to nearby genes as unannotated exon extensions or run-on transcription (van Bakel et al., 2010). However, a variety of novel functions have been suggested for these sequences (Guil and Esteller, 2012; Hanada et al., 2013; Tan et al., 2015), suggesting they may also represent the activity of undetected genes. The notion that ITRs represent independent functional elements with novel functions awaiting discovery should be tempered by the possibility that they may also represent the product of noisy activity of the cellular machinery controlling gene expression. Transcriptional noise can be produced by random landing of RNA Polymerase II or presence of spurious regulatory signals driving expression of non-functional transcripts (Struhl, 2007). As much as 90% of the transcriptional activity in the yeast genome has been estimated to be the product of noisy transcription, including many transcripts originating from intergenic regions (Struhl, 2007). In addition to transcriptional noise, ITRs may also represent contamination from genomic DNA generated while preparing RNA samples for sequencing (Moghe et al., 2013). Few ITRs have been experimentally characterized (Ivanova et al., 2006; Guttman et al., 2009; Heinen et al., 2 2009; Ulitsky et al., 2011; Sauvageau et al., 2013; Lai et al., 2015) or exhibit sequence conservation within or between species (Moghe et al., 2013). Thus, the possibility that ITRs primarily represent non-functional sequences cannot be ruled out. To explore which ITRs likely result from the activities of novel genes (i.e. functional sequences), a clear conception of what defines a gene is required. The nature of genes Genes have been described historically as the unit of heredity (Gerstein et al., 2007) and underlie the biological diversity and evolution of life on earth (Raff, 1996). The conception and definition of a gene have evolved over the past century. The term “gene” was coined in 1909 and initially utilized as an abstract concept to explain the heredity of phenotypes between parents and offspring (Gerstein et al., 2007; Portin, 2015; Portin and Wilkins, 2017). Research over the following 50 years identified the physical characteristics of genes: they are regions of chromosomes, composed of DNA, and typically represent blueprints for proteins (Gerstein et al., 2007; Portin, 2015; Portin and Wilkins, 2017). The concept of a gene as a distinct chromosomal sequence that encodes a single functional protein product emerged (Beadle and Tatum, 1941; Crick, 1963; Gaertner and Cole, 1977; Portin and Wilkins, 2017). The post-genome era has complicated the straightforward view of a gene (Pesole, 2008; ENCODE Project Consortium, 2012). First, genes are not necessarily protein-coding, as extensive roles for non-coding RNAs have been uncovered (Fire et al., 1998; Guil and Esteller, 2012; Tan et al., 2015). Next, alternative splice forms (Kelemen et al., 2013) and translational start sites (Kochetov, 2008; Bazykin and Kochetov, 2011) are common, indicating that a single gene may encode multiple protein isoforms. A related issue is that the presence of alternative transcriptional start and stop sites suggest boundaries of a gene may not be well defined (Pesole, 3 2008). Genes may also overlap one another, either on opposing DNA strands or in distinct coding frames (Veeramachaneni et al., 2004; Makalowska et al., 2005), and therefore genes do not necessarily represent unique regions on a chromosome. Last, genes have previously been defined while including their upstream regulatory promoters (Pearson, 2006). However, transacting gene regulatory elements, such as enhancers (Bulger and Groudine, 2010), exist remotely from gene bodies. Together, these issues highlight genes as DNA sequences that frequently encode multiple protein isoforms, may not have clearly-definable start and stop points, and can be associated with an extensive array of both cis- and trans-acting of regulatory elements. The complicated picture of gene structure and function represents a paradigm shift in which the gene as a distinct protein-coding element now represents only a subset of all possible cases (Portin, 2015), and as a result, there have been repeated calls to reconsider the definition of a gene (Pearson, 2006; Gerstein et al., 2007; Pesole, 2008; Portin and Wilkins, 2017). However, it is not clear to what extent the potentially complex structures of genes undermine the traditional concept of a gene as the unit of heredity that underlies phenotype. Instead, ever-advancing understanding of the physical structure of genes has perhaps overemphasized the use of the “nominal” gene definition (Griffiths and Stotz, 2006), which focuses on predicted gene sequences without regard to potential phenotype association. As evidence for this, there have been multiple tentative updates to the definition of a gene outlined in the past decade (Gerstein et al., 2007; Pesole, 2008; Portin and Wilkins, 2017) and each includes a requirement that genes produce a “functional” product or associate with a phenotype (Gerstein et al., 2007; Pesole, 2008; Portin and Wilkins, 2017). Thus, there appears to be two questions related to gene definitions that should be considered separately: How do we define the coordinates of gene models? How do we identify functional genome sequences? Furthermore, biological noise may 4 play a critical role, as alternative splice variants or transcriptional start and end sites may result from non-specific activity of the cellular mechanisms regulating these processes. Thus, it is crucial to not only catalog all observed transcript isoforms, but to also direct attention to determining whether an isoform is functional. Definition of function Despite increasing complexity with which we view genes, these sequences are agreed to produce functional products. Thus, a clear definition of what constitutes function within a biological context is needed. Two contrasting definitions of function are frequently debated, the “causal role” and “selected effect” definitions (Cummins, 1975; Amundson and Lauder, 1994; ENCODE Project Consortium, 2012; Graur et al., 2013; Doolittle et al., 2014). Under the causal role definition, a functional genome region is one that is exhibits reproducible biochemical activity, such as transcription, protein-binding, or the presence of particular chromatin states. This definition was invoked by the ENCODE Consortium (2012) to conclude that 80% of the human genome was biochemically functional, which was in turn cited as evidence to disprove the existence of “junk” DNA (Eddy 2013). However, this estimate of the functional proportion of the human genome far exceeds those based on evolutionary conservation (Rands et al., 2014) or mutational load (Graur, 2017). Because of this, the suitability of the causal role definition has come under extensive critique (Doolittle, 2013; Graur et al., 2013; Niu and Jiang, 2013). An alternative definition for function is the selected effect, which requires that functional genome sequences contribute to the survival and reproduction of an organism and thus be under evolutionary selection. As biochemical activity may be the result of noise, the selected effect definition has been suggested to be more suitable for defining biological function (Amundson 5 and Lauder, 1994; Graur et al., 2013; Doolittle et al., 2014). Given these considerations, I will utilize the selected effect definition of function throughout this dissertation. Based on the selected effect definition, what data should be considered as evidence that a sequence is functional? Observing a phenotype resulting from mutation of a sequence is considered the gold standard for identifying functional genome regions (Ponting and Belgard, 2010; Niu and Jiang, 2013). In such cases, the presence of a mutant phenotype indicates that a sequence had a role in producing the wild-type phenotype and is likely under selection. However, phenotype data for each region in a genome is not available and has been used to validate the functionality of only a handful of ITRs. Sequence conservation, particularly over long time periods, also represents strong functional evidence and can be assessed genome-wide. However, selective pressure may be weak or positive, resulting in a lack of detectable conservation, particularly among RNA genes (Pang et al., 2006; Ponting, 2017), or a sequence may perform a species-specific function. By contrast, sequence conservation over short periods may result from a lack of time for sequences to significantly diverge. Thus, the presence or absence of conservation cannot be used by itself to classify functional sequences. Last, biochemical activity also represents evidence of functionality, including biochemical signatures such as transcription, transcription factor binding, and the presence of epigenetic marks (e.g. specific DNA methylation and histone mark patterns). However, biochemical activities are subject to noise and, similar to conservation evidence, any individual activity should not be used to define functionality alone. It has been suggested that an integrative approach that considers phenotype, evolutionary, and biochemical evidence in combination could be used to define functional sequences (Kellis et al., 2014). Such a framework has been criticized for ignoring the evolutionary origin of a 6 biochemical activity (i.e. the selected effect definition). Nevertheless, recent studies have shown that considering genetic, evolutionary, and biochemical properties in combination effectively distinguishes between sequences that are under selection and those that are not (Tsai et al., 2017). This approach is explored in more detail in this dissertation. Gain and loss of functional sequences Functional sequences in genomes are not static and each species harbors a distinct set of genes and regulatory sequences. What mechanisms underlie the changes in functional sequences between genomes? Gene duplication plays a critical role in the evolution of new functional sequences (Ohno et al., 1968; Zhang, 2003; Panchy et al., 2016). Following gene duplication, one gene copy may retain the ancestral function while the additional copy evolves a new function (Hughes, 1994; Zhang et al., 1998). Alternatively, two gene copies may lose complementary subsets of the ancestral gene to become more specialized, a process known as subfunctionalization (Force et al., 1999; Lynch and Force, 2000). Although subfunctionalization can result from a stochastic loss of gene functions, it may also play an important role in optimization of genes by separating conflicting functions into separate gene copies that can evolve independently (referred to as escape from adaptive conflict) (Hittinger and Carroll, 2007; Marais and Rausher, 2008). While duplication events represent a source of functional diversity, the most common outcome of gene duplication is the pseudogenization of one copy (Li et al., 1981; Maere et al., 2005; Hanada et al., 2008; Moghe et al., 2014). Thus, while duplication can facilitate the evolution of new functions, it more frequently results in the production of non-functional sequences. The process of pseudogenization of duplicate gene copies can take place over millions of years (Lynch and Conery, 2000; Lynch and Conery, 2003). As a result, a subset of 7 annotated genes may represent sequences undergoing functional decay and en route to pseudogene status. For example, a pair of duplicate transcription factors in A. thaliana, DDF1 and DDF2, exhibit evidence of highly asymmetric divergence with DDF1 retaining ancestral functions and DDF2 losing binding site affinity (Lehti-Shiu et al., 2015). This suggests that DDF2 could be undergoing functional decay. In a genome-wide survey, 1,939 A. thaliana protein-coding genes lack transcriptional evidence and exhibit characteristics that are more consistent with pseudogenes than protein-coding genes (Yang et al., 2011). This represents 7% of the annotated gene space in A. thaliana, indicating that functional decay among putative gene annotations may not be uncommon. Further complicating the relationship between pseudogenes and functional decay is the fact that pseudogenes may remain functional as truncated proteins or evolve novel functions at the RNA level post-pseudogenization (Poliseno et al., 2010; Karreth et al., 2015). In addition to gene duplication, novel genes can evolve de novo from genome regions that did not previously contain a functional element (Kaessmann, 2010; Tautz and DomazetLošo, 2011). While the probability of a protein-coding gene evolving from non-coding sequence was described as “practically zero” (Jacob, 1977), multiple studies have identified de novo gene birth in primates (Johnson et al., 2001; Knowles et al., 2009; Toll-Riera et al., 2009), fruit fly (Levine et al., 2006; Zhou et al., 2008) and yeast (Cai et al., 2008). Intergenic transcription may play a critical role in de novo gene evolution, with transient “proto”-genes evolving from intergenic transcripts that may come under positive selection and evolve into protein-coding genes (Carvunis et al., 2012). This could suggest that tolerating the transcription of nonfunctional intergenic sequences plays a role in the evolution novel functional sequences. 8 Identifying the role of a gene What role is a gene performing in the cell or for the organism? Evidence can be provided by expression patterns (Eisen et al., 1998; Spellman et al., 1998; Lee et al., 2004a; Uygun et al., 2016), metabolic profiling (Raamsdonk et al., 2001; Sumner et al., 2003; Hirai et al., 2007), biochemical analyses (Martzen et al., 1999; Chen et al., 2003), and gene networks and interactions (Bork et al., 2004; Lee et al., 2004b; Lee et al., 2010; Arabidopsis Interactome Mapping Consortium, 2011). Beyond these tools, assessing the biological consequences resulting from disruption of a gene (i.e. mutant phenotypes) provides potent clues to the function of a gene. Owing to this, comprehensive datasets of mutant phenotype data for each annotated gene in a genome are available for multiple model systems (Winzeler et al., 1999; Kamath et al., 2003; Boutros et al., 2004; Kim et al., 2010). Plant systems, however, lag behind in the availability of well-curated phenotype data. Only 13% of annotated genes in the model plant A. thaliana are associated with a mutant phenotype (Kuromori et al., 2006; Lloyd and Meinke, 2012; Savage et al., 2013), despite the presence of near-saturation mutagenesis resources (Kuromori et al., 2009). This is due in part to long generation times among plant species. For example, A. thaliana, a relatively fast-growing plant, has a generation time of 6 weeks (Meyerowitz, 1989), longer than those of other eukaryotic models such as yeast, fruit fly, or nematode. In addition, pervasive duplication of plant genes has resulted in extensive functional overlap between paralogs that masks the consequences of gene disruption, making the identification of phenotypes-of-interest difficult. Computational approaches may play a critical role in streamlining costly and time-consuming experimental analysis by providing candidate predictions of gene-gene interactions (You et al., 9 2010; Upstill-Goddard et al., 2013), genetic redundancy (Chen et al., 2010), and phenotype genes (Seringhaus et al., 2006; Yuan et al., 2012; Musso et al., 2014). Machine learning as a tool in biology Advancements in biotechnology have brought with them massive quantities of biological sequencing data. In response, biologists have begun to adopt computer science techniques to manipulate and mine these data for valuable biological insights. One notable data mining technique is machine learning (Tarca et al., 2007): computational algorithms designed to recognize patterns in data and make predictions. Briefly, two examples of machine learning algorithms are decision trees (e.g. Random Forests), which are constructed as flow chart-like structures based on a set of training data, and support vector machines (SVM), which use multidimensional planes to separate sets of labeled training data. Studies based on machine learning have made major impacts in a wide array of fields in biology. Early machine learning analyses were successful in identifying translation initiation sites in Escherichia coli (Stormo and Schneider, 1982) and predicting secondary structure of proteins (King and Sternberg, 1990). Machine learning methods have been widely applied in biomedical cancer research, where they effectively discriminate between healthy and diseased tissue (Furey et al., 2000; Guyon et al., 2002; Hilario et al., 2003; Wang et al., 2005) and classify tumor severity (Shipp et al., 2002; Ye et al., 2003) on the basis of expression and proteomic data. In more recent basic research, machine learning methods have been used to predict mutant phenotypes (Yuan et al., 2012; Zhang et al., 2013), generate gene functional networks (Bassel et al., 2011), classify stressresponsive gene expression (Zou et al., 2011), assess functional overlap between gene pairs (Chen et al., 2010), and characterize the features of gene duplicates (Moghe et al., 2014). Throughout this dissertation I make extensive use of machine learning techniques, specifically to 10 generate prediction models capable of classifying the functionality of a sequence and the essentiality of genes. Dissertation outline and significance Genes produce the functional protein and RNA products required by a cell, and thus represent a primary target for basic research and biological engineering. Given their fundamental importance, the identification of the number, location, and role of all genes within a genome represent critical and fundamental tasks in genome research. Toward the identification of genic (i.e. functional) sequences, Chapters 2 and 3 outline three key advancements: 1) systematic identification of the biochemical, evolutionary, and structural characteristics shared among functional genome regions, 2) establishment of a machine learning framework that accurately distinguishes functional and non-functional sequences, and 3) detailed evaluation of the evolutionary dynamics of ITRs, which represent functionally ambiguous genome regions. Overall, these chapters describe how functional genome sequences are distinct from nonfunctional ones and outline how these differences can be used to distinguish functional and noisy biochemical activity. Once functional sequences have been identified, the next question is what role they are performing in a cell. Toward this objective, Chapter 4 describes a machine learning framework that effectively distinguishes between genes with lethal and non-lethal mutant phenotypes. This resulted in a catalog of characteristics shared among essential genes, a gene set has been the target of research as they highlight the minimal gene set and biological processes that are required for life. Moreover, successful prediction of genes with lethal phenotypes provides a proof-of-concept to show that machine learning approaches represent a promising system for prioritizing genes for large-scale phenotyping analysis in plants. Importantly, predictions of both 11 functional genome regions and essential genes show translational potential, indicating that the ‘omics data available for model systems can be successfully leveraged to provide insight into potential gene function in non-model systems. 12 CHAPTER 2: DEFINING THE FUNCTIONAL SIGNIFICANCE OF INTERGENIC TRANSCRIBED REGIONS 1 1 The work described in this chapter has been submitted for publication: John P. Lloyd, Zing Tsung-Yeh Tsai, Rosalie P. Sowers, Nicholas L. Panchy, Shin-Han Shiu (2017) Defining the functional significance of intergenic transcribed regions. Submitted. 13 ABSTRACT With advances in transcript profiling, the presence of transcriptional activities in intergenic regions has been well established. However, whether intergenic expression reflects transcriptional noise or activity of novel genes remains unclear. We identified intergenic transcribed regions (ITRs) in 15 diverse flowering plant species and found that the amount of intergenic expression correlates with genome size, a pattern that could be expected if intergenic expression is largely non-functional. To further assess the functionality of ITRs, we first built machine learning classifiers using Arabidopsis thaliana as a model that accurately distinguish functional sequences (phenotype genes) and non-functional ones (pseudogenes and unexpressed intergenic regions) by integrating 93 biochemical, evolutionary, and sequence-structure features. Next, by applying the models genome-wide, we found that 4,427 ITRs (38%) and 796 annotated ncRNAs (44%) had features significantly similar to benchmark protein-coding or RNA genes and thus were likely parts of functional genes. Approximately 60% of ITRs and ncRNAs were more similar to non-functional sequences and should be considered transcriptional noise. The predictive framework established here provides not only a comprehensive look at how functional, genic sequences are distinct from likely non-functional ones, but also a new way to differentiate novel genes from genomic regions with noisy transcriptional activities. 14 INTRODUCTION Advances in sequencing technology have helped to identify pervasive transcription in intergenic regions with no annotated genes. These intergenic transcripts have been found in metazoa and fungi, including Homo sapiens (human) (ENCODE Project Consortium, 2012), Drosophila melanogaster (Brown et al., 2014), Caenorhabditis elegans (Boeck et al., 2016), and Saccharomyces cerevisiae (Nagalakshmi et al., 2008). In plants, ~7,000 and ~15,000 intergenic transcripts have also been reported in Arabidopsis thaliana (Yamada et al., 2003; Stolc et al., 2005; Moghe et al., 2013; Krishnakumar et al., 2015) and Oryza sativa (Nobuta et al., 2007), respectively. The presence of intergenic transcripts indicates that there may be additional genes in genomes that have escaped gene finding efforts thus far. Knowledge of the complete suite of functional elements present in a genome is an important goal for large-scale functional genomics efforts and the quest to connect genotype to phenotype. Thus the identification of functional intergenic transcribed regions (ITRs) represents a fundamental task that is critical to our understanding of the gene space in a genome. Loss-of-function study represents the gold standard by which the functional significance of genomic regions, including ITRs, can be confirmed (Ponting and Belgard, 2010; Niu and Jiang, 2013). In Mus musculus (mouse), at least 25 ITRs with loss-of-function mutant phenotypes have been identified (Sauvageau et al., 2013; Lai et al., 2015), indicating that they are bona fide genes. In addition, loss-of-function mutants have been used to confirm ITR functionality in mouse embryonic stem cell proliferation (Ivanova et al., 2006; Guttman et al., 2009) and male reproductive development (Heinen et al., 2009), as well as brain and eye development in Danio rerio (Ulitsky et al., 2011). In human, 162 long intergenic non-coding 15 RNAs harbor phenotype-associated SNPs, suggesting that these expressed intergenic regions may be functional (Ning et al., 2013). In addition to intergenic expression, most model organisms feature an abundance of annotated non-coding RNA (ncRNA) sequences (Zhao et al., 2016), which are mostly identified through the presence of expression occurring outside of annotated genes. Thus, the only difference between ITRs and most ncRNA sequences is whether or not they have been annotated. Similar to the ITR examples above, a small number of ncRNAs have been confirmed as functional through loss-of-function experimental characterization, including but not limited to Xist in mouse (Penny et al., 1996; Marahrens et al., 1997), Malat1 in human (Bernard et al., 2010), bereft in D. melanogaster (Hardiman et al., 2002), and At4 in A. thaliana (Shin et al., 2006). However, despite the presence of a few notable examples, the number of ITRs and ncRNAs with well-established functions is dwarfed by those with no known function. While some ITRs and ncRNAs are likely novel genes, intergenic transcription may also be the byproduct of noisy expression that can occur due to nonspecific landing of RNA Polymerase II (RNA Pol II) or spurious regulatory signals that drive expression in random genomic regions (Struhl, 2007). Thus, whether an intergenic transcript is considered functional cannot depend solely on the fact that it is expressed. In addition to being biochemically active, the genomic region must be under selection. This line of logic has revived the classical ideas on differentiating “causal role” and “selected effect” functionality (Doolittle et al., 2014). A “causal role” definition requires a definable activity to consider a genomic region as functional (Cummins, 1975; Amundson and Lauder, 1994), which is adopted by the ENCODE Consortium (ENCODE Project Consortium, 2012) to classify ~80% of the human genome as having biochemical functions. This finding has been used as evidence to disprove the presence of junk 16 DNA that is not under natural selection (Eddy, 2013). This has drawn considerable critique because biochemical activity itself is not an indication of selection (Graur et al., 2013; Niu and Jiang, 2013). Instead, selected effect functionality is advocated to be a more suitable definition for a genomic region with discernible activity (Amundson and Lauder, 1994; Graur et al., 2013; Doolittle et al., 2014). Under the selected effect functionality definition, ITRs and most annotated ncRNA genes remain functionally ambiguous. Functional ITRs represent genic sequences that have not been identified with conventional gene finding programs. Such programs incorporate sequence characteristics, transcriptional evidence, and conservation information to define genic regions that are expected to be functional. Thus, genes that lack the features typically associated with genic regions remain unidentified. Due to the debate on the definitions of function post-ENCODE, Kellis et al. (2014) suggested that evolutionary, biochemical, and genetic evidences provide complementary information to define functional genomic regions. Integration of chromatin accessibility, transcriptome, and conservation evidence was successful in identifying regions in the human genome that are under selection (Gulko et al., 2014). Moreover, a comprehensive integration of biochemical, evolutionary, and genetic evidence resulted in highly accurate identification of human disease genes and pseudogenes (Tsai et al., 2017). However, it is not known if such predictions are possible or if the features that define functional genomic regions in human are applicable in other species. In plants, even though many biochemical signatures are known to be associated with genic regions, these signatures have not been incorporated to assist in identifying the functional genomic regions. To investigate the functionality of intergenic transcription, we first identified ITRs in 15 flowering plant species with 17-fold genome size differences and evaluated the relationship 17 between the prevalence of intergenic expression and genome size. Next, we determined whether 93 evolutionary, biochemical, and sequence-structure features could distinguish functional sequences (phenotype genes) and non-functional ones (pseudogenes) using A. thaliana as a model. We then jointly considered all 93 features to establish functional gene prediction models using machine learning methods. Given that phenotype genes were composed of protein-coding sequences, prediction models were also generated by considering benchmark RNA genes to ensure that functional predictions were not exclusive to protein-coding genome regions. Finally, we applied the models to ITRs and annotated ncRNAs to determine whether these functionally ambiguous sequences are more similar to known functional or likely non-functional sequences. 18 RESULTS AND DISCUSSION Relationship between genome size and intergenic expression indicates that intergenic transcripts may generally be non-functional Transcription of an unannotated, intergenic region could be due to non-functional transcriptional noise or the activity of a novel gene. If noisy transcription occurs due to random landing of RNA Pol II or spurious regulatory signals, a naïve expectation is that, as genome size increases, the coverage of intergenic expression would increase accordingly. By contrast, we expect that the extent of expression for genic sequences will not be significantly correlated with genome size because larger plant genomes do not necessarily have more genes (r2=0.01; p=0.56; see Methods). Thus, to gauge if intergenic transcribed regions (ITRs) generally behave more like what we expect of noisy or genic transcription, we assessed the correlation between genome size and the coverage of intergenic expression occurring within 15 flowering plant species. We first identified genic and intergenic transcribed regions using leaf transcriptome data from 15 flowering plants with 17-fold differences in genome size (Supplemental Table 2.1). Identical numbers of RNA-sequencing (RNA-seq) reads (30 million) and the same mapping procedures were used in all species to facilitate cross-species comparisons (see Methods). Transcribed regions were considered as ITRs if they did not overlap with any gene annotation and had no significant translated sequence similarity to plant protein sequences. As expected, the coverage of expression originating from annotated genic regions had no significant correlation with genomes size (r2=0.03; p=0.53; Fig. 2.1A). In contrast, the coverage of intergenic expression occurring was significantly and positively correlated (r2=0.30; p=0.04; Fig. 2.1B). Because more intergenic expression is occurring in species with more genome space, this is 19 consistent with the interpretation that a significant proportion of intergenic expression represents transcriptional noise. However, the correlation between genome size and intergenic expression explained ~30% of the variation (Fig. 2.1B), suggesting that other factors also affect ITR content, including the possibility that some ITRs are truly functional, novel genes. To further evaluate the functionality of intergenic transcripts, we next identified the biochemical and evolutionary features of functional genic regions and tested whether intergenic transcripts in A. thaliana were more similar to functional or non-functional sequences. Expression, conservation, and epigenetic features are significantly distinct between benchmark functional and non-functional genomic sequences To determine whether intergenic transcripts resemble functional sequences, we first asked what features allow benchmark functional and non-functional genomic regions to be distinguished in the model plant, Arabidopsis thaliana. For benchmark functional sequences, we used genes with visible loss-of-function phenotypes when mutated (referred to as phenotype genes, n=1,876; see Methods). Because their mutations have significant growth and/or developmental impact and likely contribute to reduced fitness, these phenotype genes can be considered functional under the selected effect definition (Neander, 1991). For benchmark nonfunctional genomic regions, we utilized pseudogene sequences (n=761; see Methods). These pseudogenes exhibit sequence similarity to known genes, but harbor disabling mutations, including frame shifts and/or in-frame stop codons, that result in the production of presumably non-functional protein products. Considering that only 2% of pseudogenes are maintained over 90 million years of divergence between human and mouse (Svensson et al., 2006), it is expected that the majority of pseudogenes are no longer under selection (Li et al., 1981). 20 Figure 2.1 Relationship between genome size and transcription coverage. Transcription coverage is shown for (A) annotated genic regions and (B) intergenic regions excluding any annotated features. Each dot represents one of 15 flowering plant species. Mb: megabase. Gb: gigabase. Dotted lines: linear model fits. r2: square of Pearson’s correlation coefficient. 21 We evaluated 93 gene or gene product features for their ability to distinguish between phenotype genes and pseudogenes. These features were grouped into seven categories, including chromatin accessibility, DNA methylation, histone 3 (H3) marks, sequence conservation, sequence-structure, transcription factor (TF) binding, and transcription activity. Feature values (Supplemental Table 2.2) were calculated for a randomly-selected 500 base pair (bp) window inside a phenotype gene or pseudogene. We used Area Under the Curve - Receiver Operating Characteristic (AUC-ROC) as a metric to measure how well a feature distinguishes between phenotype genes and pseudogenes. AUC-ROC values range between 0.5 (random guessing) and 1 (perfect separation of functional and non-functional sequences), with AUC-ROC values of 0.7, 0.8, and 0.9 considered fair, good, and excellent performance, respectively. Among the seven feature categories, transcription activity features were highly informative (median AUCROC=0.88; Fig. 2.2A). Despite the strong performance of transcription activity-related features, the presence of expression (i.e. presence of transcript) evidence was a poor predictor of functionality (AUC-ROC=0.58; Fig. 2.2A). This is because 80% of pseudogenes were considered expressed in ≥1 of 51 RNA-seq datasets, demonstrating that presence of transcripts should not be used by itself as evidence of functionality. Sequence conservation, DNA methylation, TF binding, and H3 mark features were also fairly distinct between phenotype genes and pseudogenes (median AUC-ROC ~0.7 for each category; Fig. 2.2B-E). We also observed high performance variability within feature categories (see Supplemental Information). By contrast, chromatin accessibility and sequence-structure features were largely uninformative (median AUC-ROC=0.51 and 0.55, respectively; Fig. 2.2F,G). The poor performance of chromatin accessibility features is likely because the DNase I hypersensitive site (DHS) datasets were sparse, as only 2-6% of phenotype gene and pseudogene sequences overlapped a DHS 22 Figure 2.2 Predictions of functional and non-functional sequences based on single features. Prediction performance is measured using Area Under the Curve - Receiver Operating Characteristic (AUC-ROC). Features include those in the categories of (A) transcription activity, (B) sequence conservation, (C) DNA methylation, (D) transcription factor (TF) binding, (E) histone 3 (H3) marks, (F) sequence structure, and (G) chromatin accessibility. AUC-ROC ranges in value from 0.5 (equivalent to random guessing) to 1 (perfect predictions). Dotted lines: median AUC-ROC of features in a category. 23 peak. Further, median nucleosome occupancy of phenotype genes (median normalized nucleosome occupancy = 1.22) is only slightly lower than that of pseudogenes (median = 1.31; Mann Whitney U test, p < 2e-4). For sequence-structure features based on dinucleotide structures (see Methods), we found that poor performance was likely due to phenotype genes and pseudogenes sharing similar dinucleotide sequence compositions (r2=0.99, p<3e-16). The differences between genes and pseudogenes in transcription, conservation, and epigenetic features and functional genomic regions suggested that these features may individually provide sufficient information for distinguishing between functional and nonfunctional genomic regions. To assess this possibility, we next evaluated the error rates of function predictions based on single features. We first considered expression breadth of a sequence, the best predicting single feature of functionality. Despite high AUC-ROC (0.95; Fig. 2.2A), the false positive rate (FPR; % of pseudogenes predicted as phenotype genes) was 21% when only expression breadth was used, while the false negative rate (FNR; % of phenotype genes predicted as pseudogenes) was 4%. Similarly, the best-performing H3 mark- and sequence conservation-related features (Fig. 2.2B,E) had FPRs of 26% and 32%, respectively, and also incorrectly classified at least 10% of phenotype genes as pseudogenes. Thus, error rates are high even when considering well-performing single features, indicating the need to jointly consider multiple features for distinguishing phenotype genes and pseudogenes. Consideration of multiple features in combination produces accurate predictions of functional genomic regions To consider multiple features in combination, we first conducted principle component (PC) analysis to investigate how well phenotype genes and pseudogenes could be separated. Between the first two PCs, which jointly explain 40% of the variance in the feature dataset, 24 phenotype genes (Fig. 2.3A) and pseudogenes (Fig. 2.3B) were distributed in largely distinct space. However, there remains substantial overlap, indicating that standard parametric approaches are not well suited to distinguishing between benchmark functional and nonfunctional sequences. Thus, we instead considered all 93 features for phenotype gene and pseudogenes in combination using random forest (referred to as the full model; see Methods). The phenotype gene and pseudogene sequences and associated conservation, biochemical, and sequence-structure features were separated into distinct training and testing sets and the full model was generated and validated using independent data subsets (cross-validation). The full model provided more accurate predictions (AUC-ROC=0.98; FNR=4%; FPR=10%; Fig. 2.3C) than any individual feature (Fig. 2.2; Supplemental Table 2.3). An alternative measure of performance based on the precision (proportion of predicted functional sequences that are truly functional) and recall (proportion of truly functional sequences predicted correctly) values among predictions generated by the full model also indicated that the model was performing well (Fig. 2.3D). When compared to the best-performing single feature (expression breadth), the full model had a similar FNR but half the FPR (10% compared to 21%). Thus, the full model is highly capable of distinguishing between phenotype genes and pseudogenes. We next determined the relative contributions of different feature categories in predicting phenotype genes and pseudogenes and whether models based on a subset of features would perform similarly as the full model. Seven prediction models were established, each using only the subset of features from a single category (Fig. 2.2). Although none of these category-specific models had performance as high as the full model, the models based on transcription activity, sequence conservation, and H3 mark features scored highly (AUC-ROC=0.97, 0.92, and 0.91, respectively; Fig. 2.3C). Particularly, the transcription activity feature category model performed 25 Figure 2.3 Predictions of functional and non-functional sequences based on multiple features. Smoothed scatter plots of the first two principle components (PCs) of (A) phenotype gene and (B) pseudogene features. The percentages on the axes in (A) indicate the feature value variation explained by the associated PC. (C) AUC-ROC values of function prediction models built when considering all features (Full), all except transcription activity (TX)-related features (Full (-TX)), and all features from each category. The category abbreviations follow those in Fig. 2.2. (D) Precision-recall curves of the models with matching colors from (C). The models were built using feature values calculated from 500 bp sequence windows. 26 almost as well as the full model (FNR=6%, FPR=12%). We emphasize that the breadth and level of transcription are the causes of the strong performance of the transcription activity-only model, not the presence of expression evidence. To evaluate whether the strong performance of the full model is being driven solely by transcription activity-related features, we also built a function prediction model did not consider these features (full (-TX), Fig. 2.3 C,D). We found that the model excluding transcription activity features performed almost as well as the full model and similarly to the transcription activity-feature-only model, but with an increased FPR (AUCROC=0.96; FNR=3%; FPR=20%). This indicates that predictions of functional regions are not reliant solely on transcription data. Instead, a diverse array of features can be considered to make highly accurate predictions of the functionality of a genomic sequence. Meanwhile, our finding of the high performance of the transcription activity-only model highlights the possibility of establishing an accurate model for distinguishing functional genic and non-functional genomic sequences in plant species with only a modest amount of transcriptome data. To provide a measure of the potential functionality of any sequence in the A. thaliana genome, including ITRs and ncRNAs, we utilized the confidence score from the full model as a “functional likelihood” value (see Methods) (Tsai et al., 2017). The functional likelihood (FL) score ranges between 0 and 1, with high values indicating that a sequence is more similar to phenotype genes (functional) and low values indicating a sequence more closely resembles pseudogenes (non-functional). FL values for all genomic regions examined in this study are available in Supplemental Table 2.4. As expected, phenotype genes had high FL values (median=0.97; Fig. 2.4A) and pseudogenes had low values (median=0.01; Fig. 2.4B). To call sequences as functional or not, we defined a threshold FL value (0.35) by maximizing the Fmeasure (see Methods). Using this threshold, 96% of phenotype genes (Fig. 2.4A) and 90% of 27 Figure 2.4 Functional likelihood distributions based on the full model. (A) Phenotype genes. (B) Pseudogenes. (C) Annotated protein-coding genes. (D) Transposable elements. (E) Random unexpressed intergenic sequences. (F) Intergenic transcribed regions (ITR). (G) Araport11 ncRNAs. (H) TAIR10 ncRNAs. The full model was established using 500 bp sequence windows. Higher and lower functional likelihood values indicate greater similarity to phenotype genes and pseudogenes, respectively. Vertical dashed lines indicate the threshold for calling a sequence as functional or non-functional. The percentages to the left and right of the dashed line indicate the percent of sequences predicted as functional or non-functional, respectively. 28 pseudogenes (Fig. 2.4B) are correctly classified as functional and non-functional, respectively, demonstrating that the full model is highly capable of distinguishing functional and nonfunctional sequences. We next applied our model to predict the functionality of annotated protein-coding genes, transposable elements (TEs), and unexpressed intergenic regions. Most annotated protein-coding genes not included in the phenotype gene dataset had high FL scores (median=0.86; Fig. 2.4C) and 80% were predicted as functional. The features exhibited by lowscoring protein-coding genes and high-scoring pseudogenes are discussed in the Supplemental Information. By contrast, the FLs were low for both TEs (median=0.03, Fig. 2.4D) and unexpressed intergenic regions (median=0.07; Fig. 2.4E), and 99% of TEs and all unexpressed intergenic sequences were predicted as non-functional, further demonstrating the utility of the function prediction model. Overall, the FL measure provides a useful metric to distinguish between phenotype genes and pseudogenes. In addition, the FLs of annotated protein-coding genes, TEs, and unexpressed intergenic sequences agree with a priori expectations regarding the functionality of these sequences. Exclusion of features from multiple tissues increases prediction performance for narrowlyexpressed sequences We sought to apply functional prediction models to ITRs, which often exhibit narrow expression patterns (Supplemental Fig. 2.1A). However, given the association between transcription activity features and functional predictions (Fig. 2.2A; Fig. 2.3C), we first investigated how functional predictions performed for conditionally-functional and narrowlyexpressed sequences. We found that genes with conditional phenotypes (see Methods) had no significant differences in FLs (median=0.96) as those with phenotypes under standard growth conditions (median=0.97; U test, p=0.38, Supplemental Fig. 2.2A). Thus, our model can capture 29 conditionally functional sequences. Next, we evaluated FL distributions among sequences with different breadths of expression. For this comparison, we focused on non-stress, single-tissue expression datasets (Supplemental Table 2.5), which was distinct from the expression breadth feature in the prediction model that considered all datasets. While phenotype genes were better predicted than pseudogenes among sequences with the same number of tissues with expression evidence (U tests, all p < 1.7E-06; Supplemental Fig. 2.2B), 65% of the 62 phenotype genes expressed in ≤3 tissues were predicted as non-functional. Further, there was a significant correlation between the number of tissues with expression evidence and FL values of all sequences in our analysis (r2=0.77; p < 2E-16). Consistent with misclassifications among narrowly-expressed phenotype genes, a key difference between 80 pseudogenes predicted as functional (high-FL) and 683 pseudogenes predicted as non-functional was that high-FL pseudogenes were more highly and broadly expressed (Fig. 2.5). Thus, the function prediction model is biased against narrowly-expressed sequences, regardless of whether they are functional or not. To tailor functional predictions to narrowly-expressed sequences, we generated a “tissueagnostic” model that attempts to minimize the contribution of biochemical activities occurring in many tissues by excluding expression breadth and features that were available across multiple tissues (see Methods). The tissue-agnostic model performed similarly to the full model (AUCROC=0.97; FNR=4%; FPR=15%; Supplemental Fig. 2.3; Supplemental Table 2.4). Importantly, the proportion of phenotype genes expressed in ≤3 tissues predicted as functional increased by 23% (35% in the full model to 58% in the tissue-agnostic model, Supplemental Fig. 2.1C), indicating that the tissue-agnostic model is more suitable for predicting the functionality of narrowly-expressed sequences than the full model, although there was an increase in FPR (from 30 Figure 2.5 Proportion of sequence classes predicted as functional in the full and singlecategory models. Percentages of sequence classes that are predicted as functional in models based on all features and the single category models, each using all features from a category (abbreviated according to Fig. 2.2). The models are sorted from left to right based on performance (AUC-ROC). The colors of and numbers within the blocks indicate the proportion sequences predicted as functional by a given model. Phenotype gene and pseudogene sequences are shown in three sub-groups: all sequences (All), and those predicted as functional (high functional likelihood (FL)) and non-functional (low FL) in the full model. ITR: intergenic transcribed regions. 31 10% to 15%). We next sought to evaluate the FL of ITR and annotated ncRNA sequences utilizing both the full model and the tissue-agnostic model, as these sequences were often narrowly-expressed (Supplemental Fig. 2.5A). Intergenic transcribed regions and annotated ncRNAs are mostly predicted as nonfunctional A subset of ITRs and ncRNAs likely represent novel genes or unannotated exon extensions of known genes (Johnson et al., 2005; Moghe et al., 2013). Nevertheless, most ITRs and ncRNAs are functionally ambiguous, as they are predominantly identified by the presence of expression evidence and few have been characterized genetically. To evaluate the functionality of ITRs and ncRNAs, we applied both the full and tissue-agnostic models to 895 ITRs, 136 TAIR ncRNAs, and 252 Araport long ncRNAs (referred to as Araport ncRNAs; see Methods) that do not overlap with any annotated genome features. Consistent with previous studies (Moghe et al., 2013), ITRs and ncRNAs in our dataset were narrowly and weakly expressed and poorly conserved compared to phenotype genes, and ITRs in particular had biochemical characteristics that were generally more consistent with pseudogenes (Supplemental Fig. 2.4). The median FLs based on the full model were low (0.09) for both ITRs (Fig. 2.4F) and Araport ncRNAs (Fig. 2.4G), and only 15% and 9% of these sequences were predicted as functional, respectively. By contrast, TAIR ncRNAs had a significantly higher median FL value (0.53; U tests, both p<5e-31; Fig. 2.4H) and 68% were predicted as functional. The higher proportion of functional TAIR ncRNA predictions compared to ITRs and Araport ncRNAs could be best explained by differences in features from the transcription activity category (Fig. 2.5; Supplemental Fig. 2.4). We also note that a greater proportion of ITRs and Araport ncRNAs are predicted as functional when considering only DNA methylation or H3 mark features (Fig. 2.5). 32 However, these two category-specific models also had higher false positive rates (unexpressed intergenic sequences and pseudogenes, Fig. 2.5). Thus, single feature-category models do not provide additional support for the functionality of most Araport ncRNAs and ITRs. We next applied the tissue-agnostic model that is less biased against narrowly-expressed sequences (Supplemental Fig. 2.2C) to ITRs and TAIR/Araport ncRNAs that were generally narrowly-expressed (Supplemental Fig. 2.1A). Compared to the full model, around twice as many ITRs (30%) and Araport ncRNAs (19%) but a similar number of TAIR ncRNA (67%) were predicted as functional. Considering the union of the full and tissue-agnostic model predictions, 268 ITRs (32%), 57 Araport ncRNAs (23%), and 105 TAIR ncRNAs (77%) were likely functional. ITRs and annotated ncRNAs closer to annotated genes tended to be predicted as functional (Supplemental Fig. 2.5A). Using the 95th percentile of intron lengths for all genes as a threshold to call ITRs and annotated ncRNAs as proximal or distal to neighboring genes, 57% of likely functional and 35% of likely non-functional ITRs and ncRNAs were proximal to neighboring genes, respectively (FET, p < 2E-09). To assess if a subset these likely functional, proximal ITRs/ncRNAs may be unannotated exons of known genes, we assessed whether they tended to have similar features with their neighbors. Compared to feature similarities between neighboring and random gene pairs (Supplemental Fig. 2.5B-D), likely functional ITRs/ncRNAs were less similar to their neighbors, regardless of proximity (Supplemental Fig. 2.7C,D). Thus, despite their proximity to annotated genes, it remains unclear if some ITRs or annotated ncRNAs represent unannotated exon extensions of known genes or not. In addition, for proximal functional ITRs/annotated ncRNAs, we cannot rule out the possibility that they represent falsepositive functional predictions due to the accessible and active chromatin states of nearby genes. Given the challenge in ascertaining the origin of likely functional, proximal ITRs/ncRNAs, we 33 instead conservatively estimate that 116 distal, functional ITRs and annotated ncRNAs may represent fragments of novel genes. Intergenic transcribed regions and annotated ncRNAs do not resemble benchmark RNA genes Thus far, we predicted the majority of ITR and annotated ncRNA sequences as nonfunctional. We demonstrated that the full model was able to predict conditional phenotype genes (Supplemental Fig. 2.2A) and the tissue-agnostic model was more effective than the full model in predicting narrowly expressed phenotype genes (Supplemental Fig. 2.2B,C). Thus, conditional or tissue-specific functionality do not fully explain why the majority of ITRs and ncRNAs are predicted as non-functional. However, the function prediction models so far were built by contrasting protein-coding genes with pseudogenes and it remains possible that these proteincoding gene-based models can not accurately predict RNA genes. To evaluate this possibility, we generated a tissue-agnostic model using features calculated from a randomly-selected 100 bp sequence within a phenotype protein-coding gene or pseudogene body (for features, see Supplemental Table 2.6). The reason for using 100 bp sequences is that most RNA genes are too short to be considered by earlier models, which were based on 500 bp sequences. In addition, features from the tissue agnostic model are more suitable for RNA gene prediction as annotated RNA genes tend to be more narrowly expressed than phenotype genes (U tests, all p < 2e-05; Supplemental Fig. 2.1B). The 100 bp tissue-agnostic model performed similarly to the full 500 bp model in distinguishing between phenotype protein-coding genes and pseudogenes, except with higher FNR (AUC-ROC=0.97; FNR=13%; FPR=5%; Supplemental Fig. 2.6), but only predicted three out of six RNA genes with documented mutant phenotypes (phenotype RNA genes) as functional (Supplemental Fig. 2.6I). Further, other RNA Pol II-transcribed RNA genes 34 exhibited mixed predictions from the 100 bp tissue-agnostic model, as 15% of microRNA (miRNA) primary transcripts (Supplemental Fig. 2.6J), 73% of small nucleolar RNAs (snoRNAs; Supplemental Fig. 2.6K), and 50% of small nuclear RNAs (snRNAs; Supplemental Fig. 2.6L) were predicted as functional. Although the proportion of phenotype RNA genes predicted as functional (50%) is significantly higher than the proportion of pseudogenes predicted as functional (5%, FET, p < 0.004), this finding suggests that a model built with protein-coding genes has a substantial FNR for detecting RNA genes. To determine whether the suboptimal predictions by the phenotype protein-coding genebased models are because RNA genes belong to a class of their own, we next built a multi-class function prediction model aimed at distinguishing four classes of sequences: benchmark RNA genes (n=46), phenotype protein-coding genes (1,882), pseudogenes (3,916), and randomlyselected, unexpressed intergenic regions (4,000). Benchmark RNA genes include six phenotype RNA genes and 40 high-confidence miRNA primary transcript sequences (see Methods). RNA phenotype genes exhibit a phenotype when mutated and are likely under selection, fulfilling the selected effect definition of functionality. However, the lack of a sizeable sample of RNA genes with documented phenotypes required that we also include annotated RNA genes with no phenotype information. In addition, unexpressed intergenic sequences were included to provide another set of likely non-functional sequences distinct from pseudogenes. Expression breadth and tissue-specific features were excluded from the four-class model and 100 bp sequences were used. In the four-class model, 87% of benchmark RNA genes, including all six phenotype RNA genes, were predicted as functional sequences (65% RNA gene-like and 22% phenotype proteincoding gene-like; Fig. 2.6A). In addition, 95% of phenotype genes were predicted as functional (Fig. 2.6B), including 80% of narrowly expressed genes, an increase of 22% over the 500 bp 35 Figure 2.6 Function predictions based on a four-class prediction model. (A) Stacked bar plots indicate the prediction scores of benchmark RNA genes for each of the four classes: dark blue - phenotype protein-coding gene (Ph), cyan - RNA gene (RNA), red - pseudogene (Ps), yellow – random intergenic sequence (Ig). A benchmark RNA gene is classified as one of the four classes according to the highest prediction score. The color bars below the chart indicate the predicted class, with the same color scheme as the prediction score. Sequences classified as Ph or RNA were considered functional, while those classified as Ps or Ig were considered nonfunctional. Percentages below a classification region indicate the proportion of sequences classified as that class. (B) Phenotype protein-coding gene prediction scores. (C) Pseudogene 36 Figure 2.6 (cont’d) prediction scores. (D) Random unexpressed intergenic region prediction scores. Note that no sequence was predicted as functional. (E) Intergenic transcribed region (ITR), (F) Araport11 ncRNA regions. (G) TAIR10 ncRNA regions. 37 tissue-agnostic model (Supplemental Fig. 2.2C). For the remaining two sequence classes, 70% of pseudogenes (Fig. 2.6C) and 100% of unexpressed intergenic regions (Fig. 2.6D) were predicted as non-functional (either pseudogenes or unexpressed intergenic sequences). Thus, the four-class model improves prediction accuracy of RNA genes and narrowly expressed genes. However, the inclusion of RNA genes in the model has significantly increased the ambiguity in pseudogene classification. Since the four-class model was able to distinguish benchmark RNA genes from nonfunctional sequence classes, we next evaluated whether ITRs and annotated ncRNAs resemble functional sequences with the four-class model. Note that the 100 bp model used here allowed us to evaluate an additional 10,938 ITRs and 1,406 annotated ncRNAs. We found that 34% of ITR, 38% of Araport ncRNA, and of 65% TAIR ncRNAs were predicted as functional sequences (Fig. 2.6E-G). More specifically, 20% or fewer ITR and annotated ncRNA sequences were most similar to RNA genes (Fig. 2.6E-G), suggesting that most are not functioning as miRNA, which represented the majority of benchmark RNA sequences. However, other potential roles as RNA regulators, including cis-acting (Guil and Esteller, 2012) and competitive endogenous (Tan et al., 2015) regulatory functions, should be further studied. To provide an overall estimate of the proportion of likely-functional ITRs and annotated ncRNAs, we considered the predictions from the four-class model (Fig. 2.6), the full model (Fig. 2.3,2.4), and the tissue-agnostic models (Supplemental Fig. 2.3,2.6). Based on support from at least one of the four models, we classified 4,437 ITRs (38%) and 796 annotated ncRNAs (44%) as functional, as they resembled either phenotype protein-coding or RNA genes. Our findings lend support that they are likely parts of novel or annotated genes. Meanwhile, we find that a substantial number of ITRs (62%) and annotated ncRNAs (56%) are predicted as non-functional. Moreover, at least a third of ITRs 38 (Fig. 2.6E) and Araport ncRNAs (Fig. 2.6F) most closely resemble unexpressed intergenic regions. Thus, we show that the majority of ITRs and annotated ncRNA regions resemble nonfunctional genomic regions, and therefore could represent regions of noisy transcription. 39 CONCLUSION Discerning the location of functional regions within a genome represents a key goal in genomic biology. Despite advances in computational gene finding, it remains challenging to determine whether intergenic transcribed regions (ITRs) represent functional or noisy biochemical activity. We established robust function prediction models based on the evolutionary, biochemical, and structural characteristics of phenotype genes and pseudogenes. The prediction models accurately define functional and non-functional regions and are applicable genome-wide. These results echo recent findings that human phenotype genes could be distinguished from pseudogenes (Tsai et al., 2017). Given that function predictions were successful in both plant and metazoan model systems, integrating the evolutionary and biochemical features of known genes will likely be applicable to any species. The next step will be to test whether function prediction models can be applied across species, which could ultimately allow the phenotype data and ‘omics resources available in model systems to effectively guide the identification of functional regions in non-models. Expression data was highly informative to functional predictions. We found that the prediction model based on only 24 transcription activity-related features performs nearly as well as the full model that integrates additional information including conservation, H3 mark, methylation, and TF binding data. In human, use of transcription data from cell lines also produced highly accurate predictions of functional genomic regions (AUC-ROC=0.96) (Tsai et al., 2017). Despite the importance of transcription data, we emphasize that the presence of expression evidence is an extremely poor predictor. Taken together, these results indicate that function prediction models can be established in any species, model or not, with a modest 40 number of transcriptome datasets (e.g. 51 in this study and 19 in human). One caveat of the current model is that narrowly-expressed phenotype genes are frequently predicted as pseudogene and broadly-expressed pseudogenes tend to be called functional. To improve the function prediction model, it will be important to explore additional features unrelated to transcription. Because few phenotype genes are narrowly-expressed (5%) in the A. thaliana training data, more phenotyping data for narrowly expressed genes will be crucial as well. Upon application of the function prediction models genome-wide, we found that 4,427 ITRs and 796 annotated ncRNAs in A. thaliana are likely functional. Assuming each entry equals a novel gene, this estimate represents a 19% increase in annotated gene space (excluding annotated ncRNAs) for the model plant. However, considering the high false positive rates (e.g. 10% for the full and 31% for the four-class model), this is most likely an overestimate of the number of novel genes contributed by functional ITRs and annotated ncRNAs. In addition, we emphasize that the majority of ITRs and ncRNAs resemble pseudogenes and random unexpressed intergenic regions. Similarly, most human ncRNAs are more similar to nonfunctional sequences than they are to protein coding and RNA genes (Tsai et al., 2017). Furthermore, the significant relationship between the amount of intergenic expression occurring in a species and the size of a genome is consistent with the interpretation that intergenic transcripts are generally non-functional. Thus, instead of assuming any expressed sequence must be functionally significant, we advocate that the null hypothesis should be that it is not, particularly considering that most ITRs and annotated ncRNAs have not been experimentally characterized. The machine learning framework we have described provides an approach for distinguishing between functional and noisy biochemical activity, and will help defining the gene space in a genome. 41 METHODS Identification of transcribed regions in leaf tissue of 15 flowering plants RNA-sequencing (RNA-seq) datasets were retrieved from the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI; www.ncbi.nlm.nih.gov/sra/) for 15 flowering plant species (Supplemental Table 2.1). All datasets were generated from leaf tissue and sequenced on Illumina HiSeq 2000 or 2500 platforms. Genome sequences and gene annotation files were downloaded from Phytozome v.11 (www.phytozome.net) (Goodstein et al., 2012) or Oropetium Base v.01 (www.sviridis.org) (VanBuren et al., 2015). Genome sequences were repeat masked using RepeatMasker v4.0.5 (www.repeatmasker.org) if a repeat-masked version was not available. Only one end from paired-end read datasets were utilized in downstream processing. Reads were trimmed to be rid of low scoring ends and residual adaptor sequences using Trimmomatic v0.33 (LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:20) (Bolger et al., 2014) and mapped to genome sequences using TopHat v2.0.13 (default parameters except as noted below) (Kim et al., 2013). Reads ≥20 nucleotides in length that mapped uniquely within a genome were used in further analysis. For each species, thirty million mapped reads were randomly selected from among all datasets and assembled into transcript fragments using Cufflinks v2.2.1 (default parameters except as noted below) (Trapnell et al., 2010), while correcting for sequence-specific biases during the sequencing process by providing an associated genome sequence with the -b flag. The expected mean fragment length for assembled transcript fragments in Cufflinks was set to 150 from the default of 200 so that expression levels in short fragments would not be overestimated. 42 The 1st and 99th percentile of intron lengths for each species were used as the minimum and maximum intron lengths, respectively, for both the TopHat2 and Cufflinks steps. Intergenic transcribed regions (ITRs) were defined by transcript fragments that did not overlap with gene annotation and did not have significant six-frame translated similarity to plant protein sequences in Phytozome v.10 (BLASTX E-value < 1E-05). The correlation between assembled genome size and gene counts was determined with data from the first 50 published plant genomes (Michael and Jackson, 2013). Phenotype data sources Mutant phenotype data for A. thaliana protein-coding genes was collected from a published dataset (Lloyd and Meinke, 2012), the Chloroplast 2010 database (Ajjawi et al., 2010; Savage et al., 2013), and the RIKEN phenome database (Kuromori et al., 2006) as described by Lloyd et al. (Lloyd et al., 2015). Phenotype genes used in our analyses were those whose disruption resulted in lethal or visible defects under standard laboratory growth conditions. Genes with documented mutant phenotypes under standard conditions were considered as a distinct and non-overlapping category from other annotated protein-coding genes. We identified six RNA genes with documented loss-of-function phenotypes through literature searches (Supplemental Table 2.7): At4 (AT5G03545) (Shin et al., 2006), MIR164A and MIR164D (AT2G47585 and AT5G01747, respectively) (Guo et al., 2005), MIR168A (AT4G19395) (Li et al., 2012b), and MIR828A and TAS4 (AT4G27765 and AT3G25795, respectively) (Hsieh et al., 2009). Conditional phenotype genes were those belonging to the Conditional phenotype group as described by Lloyd and Meinke (Lloyd and Meinke, 2012). Loss-of-function mutants of these genes exhibited phenotype only under stress conditions. 43 Arabidopsis thaliana genome annotation A. thaliana protein-coding gene, miRNA gene, snoRNA gene, snRNA gene, ncRNA region, pseudogene, and transposable element annotations were retrieved from The Arabidopsis Information Resource v.10 (TAIR10; www.arabidopsis.org) (Berardini et al., 2015). Additional miRNA gene and lncRNA region annotations were retrieved from Araport v.11 (www.araport.org). A primary difference between the TAIR ncRNAs and Araport lncRNAs (referred to as Araport ncRNAs in the Results & Discussion section) is the date in which they were annotated. For example, 221 ncRNAs were present in the v.7 release of TAIR, which dates back to 2007 (TAIR10 contains 394 ncRNA annotations) (Swarbreck et al., 2008; Lamesch et al., 2012; Berardini et al., 2015). However, Araport lncRNAs were annotated in the past five years (Krishnakumar et al., 2015). Thus, that TAIR ncRNAs are generally more highly and broadly expressed is likely a result of the less sensitive transcript identification methods available for early TAIR releases. A pseudogene-finding pipeline (Zou et al., 2009) was used to identify additional pseudogene fragments and count the number of disabling mutations (premature stop or frameshift mutations). Genes, pseudogenes, and transposons with overlapping annotation were excluded from further analysis. Overlapping lncRNA annotations were merged for further analysis. When pseudogenes from TAIR10 and the pseudogene-finding pipeline overlapped, the longer pseudogene annotation was used. A. thaliana ITRs analyzed include: (1) the Set 2 ITRs in Moghe et al. (Moghe et al., 2013), (2) the novel transcribed regions from Araport v.11, and (3) additional ITRs from 206 RNA-seq datasets (Supplemental Table 2.5). Reads were trimmed, mapped, and assembled into transcript fragments as described above, except that overlapping transcript fragments from across datasets were merged. ITRs analyzed did not overlap with any TAIR10, Araport11, or 44 pseudogene annotation. Overlapping ITRs from different annotated subsets were kept based on a priority system: Araport11 > Set 2 ITRs from Moghe et al. (Moghe et al., 2013) > ITRs identified in this study. For each sequence entry (gene, ncRNA, pseudogene, transposable element, or ITR), a 100 and 500 base pair (bp) window was randomly chosen for calculating feature values and subsequent model building steps. Feature descriptions are provided in the following sections. The feature values for randomly selected 500 and 100 bp windows are provided in Supplemental Tables 2.2 and 2.6, respectively. Additionally, non-expressed intergenic sequences were randomly-sampled from genome regions that did not overlap with annotated genes, pseudogenes, transposable elements, or regions with genic or intergenic transcript fragments (100 bp, n=4,000; 500 bp, n=3,716). All 100 and 500 bp windows described above are referred to as sequence windows throughout the Methods section. Sequence conservation and structure features There were 10 sequence conservation features examined. The first two were derived from comparisons between A. thaliana accessions including nucleotide diversity and Tajima’s D among 81 accessions (Cao et al., 2011) using a genome matrix file from the 1,001 genomes database (www.1001genomes.org). The python scripts are available through GitHub (https://github.com/ShiuLab/GenomeMatrixProcessing). The remaining eight features were derived from cross-species comparisons, three based on multiple sequence alignments and five based on pairwise alignments. Three multiple sequence alignment-based features were established using aligned genomic regions between A. thaliana and six other plant species (Glycine max, Medicago truncatula, Populus trichocarpa, Vitis vinifera, Sorghum bicolor, and Oryza sativa) (Li et al., 2012a), which are referred to as conserved blocks. For each conserved block, the first feature was the proportion of a sequence window that overlapped a conserved 45 block (referred to as coverage), and the two other features were the maximum and average phastCons scores within each sequence window. The phastCons score was determined for each nucleotide within conserved blocks (Li et al., 2012a). Nucleotides in a sequence window that did not overlap with a conserved block were assigned a phastCons score of 0. For each sequence window, five pairwise alignment-based cross-species conservation features were the percent identities to the most significant BLASTN match (if E-value<1E-05) in each of five taxonomic groups. The five taxonomic groups included the Brassicaceae family (nspecies=7), other dicotyledonous plants (22), monocotyledonous plants (7), other embryophytes (3), and green algae (5). If no sequence with significant similarity was present, percent identity was scored as zero. For sequence-structure features, we used 125 conformational and thermodynamic dinucleotide properties collected from DiProDB database (Friedel et al., 2009). Because the number of dinucleotide properties was high and dependent, we reduced the dimensionality by utilizing principal component (PC) analysis as described previously (Tsai et al., 2015). Sequence-structure values corresponding to the first five PCs were calculated for all dinucleotides in and averaged across the length of a sequence window and used as features when building function prediction models. Transcription activity features We generated four multi-dataset and 20 individual dataset transcription activity features. To identify a set of RNA-seq datasets to calculate multi-dataset features, we focused on the 72 of 206 RNA-seq datasets each with ≥20 million reads (see above; Supplemental Table 2.5). Transcribed regions were identified with TopHat2 and Cufflinks as described in the RNA-seq analysis section except that the 72 A. thaliana RNA-seq datasets were used. Following transcript 46 assembly, we excluded 21 RNA-seq datasets because they had unusually high RPKM (Reads Per Kilobase of transcript per Million mapped reads) values (median RPKM value range=272~2,504,294) compared to the rest (2~252). The remaining 51 RNA-seq datasets were used to generate four multi-dataset transcription activity features including: expression breadth, 95th percentile expression level, maximum transcript coverage, and presence of expression evidence (for values see Supplemental Table 2.2, 2.6). Expression breadth was the number of RNA-seq datasets that have ≥1 transcribed region that overlapped with a sequence window. The 95th percentile expression level was the 95th percentile of RPKM values across 51 RNA-seq datasets where RPKM values were set to 0 if there was no transcribed region for a sequence window. Maximum transcript coverage was the maximum proportion of a sequence window that overlapped with a transcribed region across 51 RNA-seq datasets. Presence of expression evidence was determined by overlap between a sequence window and any transcribed region in the 51 RNA-seq datasets. In addition to features based on multiple datasets, 20 individual dataset features were derived from 10 datasets: seven tissue/organ-specific RNA-seq datasets including pollen (SRR847501), seedling (SRR1020621), leaf (SRR953400), root (SRR578947), inflorescence (SRR953399), flower, (SRR505745) and silique (SRR953401), and three datasets from nonstandard growth conditions, including dark-grown seedlings (SRR974751) and leaf tissue under drought (SRR921316) and fungal infection (SRR391052). For each of these 10 RNA-seq datasets, we defined two features for each sequence window: the maximum transcript coverage (as described above) and the maximum RPKM value of overlapping transcribed regions (referred to as Level in Fig. 2.2). If no transcribed regions overlapped a sequence window, the maximum RPKM value was set as 0. For the analysis of narrowly- and broadly-expressed phenotype genes 47 and pseudogenes (Supplemental Fig. 2.2B,C), we used 28 out of 51 RNA-seq datasets generated from a single tissue and in standard growth conditions to calculate the number of tissues with evidence of expression (tissue expression breadth). In total, seven tissues were represented among the 28 selected RNA-seq datasets (see above; Supplemental Table 2.5), and thus tissue expression breadth ranges from 0 to 7 (note that only 1 through 7 are shown in Supplemental Fig. 2.2B,C due to low sample size of phenotype genes in the 0 bin). The tissue breadth value is distinct from the expression breadth feature used in model building that was generated using all 51 datasets and considered multiple RNA-seq datasets from the same tissue separately (range: 051). Histone 3 mark features Twenty histone 3 (H3) mark features were calculated based on eight H3 chromatin immunoprecipitation sequencing (ChIP-seq) datasets from SRA. The H3 marks examined include four associated with activation (H3K4me1: SRR2001269, H3K4me3: SRR1964977, H3K9ac: SRR1964985, and H3K23ac: SRR1005405) and four associated with repression (H3K9me1: SRR1005422, H3K9me2: SRR493052, H3K27me3: SRR3087685, and H3T3ph: SRR2001289). Reads were trimmed as described in the RNA-seq section and mapped to the TAIR10 genome with Bowtie v2.2.5 (default parameters) (Langmead et al., 2009). Spatial Clustering for Identification of ChIP-Enriched Regions v.1.1 (Xu et al., 2014) was used to identify ChIP-seq peaks with a false discover rate ≤ 0.05 with a non-overlapping window size of 200, a gap parameter of 600, and an effective genome size of 0.92 (Koehler et al., 2011). For each H3 mark, two features were calculated for each sequence window: the maximum intensity among overlapping peaks and peak coverage (proportion of overlap with the peak that overlaps maximally with the sequence window). In addition, four multi-mark features were generated. 48 Two of the multi-mark features were the number of activating marks (0-4) overlapping a sequence window and the proportion of a sequence window overlapping any peak from any of the four activating marks (activating mark peak coverage). The remaining two multi-mark features were the same as the two activating multi-mark features except focused on the four repressive marks. DNA methylation features Twenty-one DNA methylation features were calculated from bisulfite-sequencing (BSseq) datasets from seven tissues (pollen: SRR516176, embryo: SRR1039895, endosperm: SRR1039896, seedling: SRR520367, leaf: SRR1264996, root: SRR1188584, and inflorescence: SRR2155684). BS-seq reads were trimmed as described above and processed with Bismark v.3 (default parameters) (Krueger and Andrews, 2011) to identify methylated and unmethylated cytosines in CG, CHH, and CHG (H = A, C, or T) contexts. Methylated cytosines were defined as those with ≥5 mapped reads and with >50% of mapped reads indicating that the position was methylated. For each BS-seq dataset, the percentage of methylated cytosines in each sequence window for CG, CHG, and CHH contexts were calculated if the sequence window had ≥5 cytosines with ≥5 reads mapping to the position. To determine whether the above parameters where reasonable, we assessed the false positive rate of DNA methylation calls by evaluating the proportion of cytosines in the chloroplast genome that are called as methylated, as the chloroplast genome has few DNA methylation events (Ngernprasirtsiri et al., 1988; Zhang et al., 2006). Based on the above parameters, 0-1.5% of cytosines in CG, CHG, or CHH contexts in the chloroplast genome were considered methylated in any of the seven BS-seq datasets. This indicated that the false positive rates for DNA methylation calls were low and the parameters were reasonable. 49 Chromatin accessibility and transcription factor binding features Chromatin accessibility features consisted of ten DHS-related features and one micrococcal nuclease sequencing (MNase-seq)-derived feature. DHS peaks from five tissues (seed coat, seedling, root, unopened flowers, and opened flowers) were retrieved from the Gene Expression Omnibus (GSE53322 and GSE53324) (Sullivan et al., 2014). For each of the five tissues, the maximum DHS peak intensity and DHS peak coverage were calculated for each sequence window. Normalized nucleosome occupancy per bp based on MNase-seq was obtained from Liu et al. (Liu et al., 2015). The average nucleosome occupancy value was calculated across each sequence window. Transcription factor (TF) binding site features were based on in vitro DNA affinity purification sequencing data of 529 TFs (O’Malley et al., 2016). Two features were generated for each sequence window: the total number of TF binding sites and the number of distinct TFs bound. Single-feature prediction performance The ability for each single feature to distinguish between functional and non-functional regions was evaluated by calculating AUC-ROC value with the Python scikit-learn package (Pedregosa et al., 2011). AUC-ROC values range between 0.5 (equivalent to random guessing) and 1 (perfect predictions) and values above 0.7, 0.8, and 0.9 are considered to be fair, good, and excellent, respectively. Thresholds to predict sequences as functional or non-functional using a single feature were defined by the feature value that produced the highest F-measure, the harmonic mean of precision (proportion of sequences predicted as functional that are truly functional) and recall (proportion of truly functional sequences predicted as functional). The Fmeasure allows consideration of both false positives and false negatives at a given threshold. FPR were calculated as the percentage of negative (non-functional) cases with values above or 50 equal to the threshold and thus falsely predicted as functional. FNR were calculated as the percentage of positive (functional) cases with values below the threshold and thus falsely predicted as non-functional. Binary classification with machine learning For binary classification (two-class) models that contrasted phenotype genes and pseudogenes, the random forest (RF) implementation in the Waikato Environment for Knowledge Analysis software (WEKA) (Hall et al., 2009) was utilized. Three types of two-class models were established, including the full model (500 bp sequence window, Fig. 2.3C,D and Fig. 2.4), tissue-agnostic models (500 bp, Supplemental Fig. 2.3; 100 bp, Supplemental Fig. 2.6), and single feature category models (Fig. 2.3C,D). For each model type, we first generated 100 balanced datasets by randomly selecting equal numbers of phenotype genes (positive examples) and pseudogenes (negative examples). For each of these 100 datasets, 10-fold stratified crossvalidation was utilized, where the model was trained using 90% of sequences and tested on the remaining 10%. Thus, for each model type, a sequence window had 100 prediction scores, where each score was the proportion of 500 random forest trees that predicted a sequence as a phenotype gene in a balanced dataset. The median of 100 prediction scores was used as the functional likelihood (FL) value (Supplemental Table 2.4). The FL threshold to predict a sequence as functional or non-functional was defined based on maximum F-measure as described in the previous section. We tested multiple -K parameters (2 to 25) in the WEKA-RF implementation, which alters the number of randomly-selected features included in each RF tree (Supplemental Table 2.8), and found that 15 randomly-selected features provided the highest performance based on AUC-ROC (calculated and visualized using the ROCR package) (Sing et al., 2005). Feature importance was assessed by excluding one feature at a time to determine the 51 associated reduction in prediction performance (Supplemental Table 2.9). All leave-one-out models performed well (AUC-ROC >0.97), indicating that no single feature was dominating the function predictions and/or many features are correlated (Supplemental Fig. 2.7). Binary classification models were also built using all features from 500 bp sequences (equivalent to the full model) with the Sequential Minimal Optimization - Support Vector Machine (SMO-SVM) implementation in WEKA (Hall et al., 2009). The results of SMO-SVM models were highly similar to the full RF results: PCC between the FL values generated by RF and SMOSVM=0.97; AUC-ROC of SMO-SVM=0.97; FPR=12%; FNR=3%. By comparison, the full RF model had AUC-ROC=0.98, FPR=10%, FNR=4%. Tissue-agnostic models were generated by excluding the expression breadth feature and 95th percentile expression level and replacing all features from RNA-seq, BS-seq, and DHS datasets that were available in multiple tissues. For multiple-tissue RNA-seq data, the maximum expression level across 51 RNA-seq datasets (in RPKM) and maximum coverage (as described in the transcription activity section) of a sequence window in any of 51 RNA-seq datasets were used. For multi-tissue DNA methylation features, minimum proportions of methylated cytosines in any tissue in CG, CHG, and CHH contexts were used. For DHS data, the maximum peak intensity and peak coverage was used instead. In single feature category predictions, fewer total features were used and therefore lower –K values (i.e. the number of random features selected when building random forests) were considered in parameter searches (Supplemental Table 2.8). Multi-class machine learning model For the four-class model, benchmark RNA gene, phenotype protein-coding gene, pseudogene, and random unexpressed intergenic sequences were used as the four training classes. Benchmark RNA genes consisted of six RNA genes with documented loss-of-function 52 phenotypes and 40 high-confidence miRNA genes from miRBase (www.mirbase.org) (Kozomara and Griffiths-Jones, 2014). We generated 250 datasets with equal proportions (larger classes randomly sampled) of training sequences. Two-fold stratified cross-validation was utilized due to the low number of benchmark RNA genes. The features included those described for the tissue-agnostic model and focused on 100 bp sequence windows. The RF implementation, cforest, in the party package of R (Strobl et al., 2008) was used to build the classifiers. The fourclass predictions provide prediction scores for each sequence type: an RNA gene, phenotype protein-coding gene, pseudogene, and unexpressed intergenic score (Supplemental Table 2.4). The prediction scores indicate the proportion of random forest trees that classify a sequence as a particular class. Median prediction scores from across 100 balanced runs were used as final prediction scores. Scores from a single balanced dataset models sum to 1, but not the median from 100 balanced runs. Thus, the median scores were scaled to sum to 1. For each sequence window, the maximum prediction score among the four classes was used to classify a sequence as phenotype gene, pseudogene, unexpressed intergenic, or RNA gene. 53 ACKNOWLEDGEMENTS I wish to thank Christina Azodi, Ming-Jung Liu, Gaurav Moghe, Bethany Moore, and Sahra Uygun and for providing processed data and discussion. 54 APPENDIX 55 Supplemental information Variability in single feature performance within feature categories Within each feature category, there was a wide range of performance between features (Fig. 2.2, Supplemental Table 2.3) and there were clear biological or technical explanations for features that perform poorly. For the transcription activity category, 17 out of 24 features had an AUCROC performance >0.8, including the best-performing feature, expression breadth (AUCROC=0.95; Fig. 2.2A). However, five transcription activity-related features performed poorly (AUC-ROC<0.65), including the presence of expression (transcript) evidence (AUC-ROC=0.58; Fig. 2.2A). For the sequence conservation category, maximum and average phastCons conservation scores were highly distinct between phenotype genes and pseudogenes (AUCROC=0.83 and 0.82, respectively; Fig. 2.2B). On the other hand, identity to best matching nucleotide sequences found in Brassicaceae and algal species were not informative (AUCROC=0.55 and 0.51, respectively; Fig. 2.2B). This was because 99.8% and 95% of phenotype genes and pseudogenes, respectively, had a potentially homologous sequence within the Brassicaceae family and only 3% and 1%, respectively, in algal species. Thus, Brassicaceae genomes were too similar and algal genomes too dissimilar to A. thaliana to provide meaningful information. H3 mark features also displayed high variability. The most informative H3 mark features were based on the number and coverage of activation-related marks (AUC-ROC=0.87 and 0.85, respectively; Fig. 2.2E), consistent with the notion that histone marks are often jointly associated with active genomic sequences to provide a robust regulatory signal (Schreiber and Bernstein, 2002; Wang et al., 2008). By comparison, the coverage and intensity of H3 lysine 27 trimethylation (H3K27me3) and H3 threonine 3 phosphorylation (H3T3ph) were largely indistinct between phenotype genes and pseudogenes (AUC-ROC range: 0.55-0.59; Fig. 2.2E). 56 Features of misclassified sequences Although the full model performs exceedingly well, there remain false predictions. There are 76 phenotype genes (4%) predicted as non-functional (referred to as low-functional likelihood (FL) phenotype genes). We assessed why these phenotype genes were not correctly identified by first asking what category of features were particularly distinct between low-FL and the remaining phenotype genes. We found that the major category that led to the misclassification of phenotype genes was transcription activity, as only 7% of low-scoring phenotype genes were predicted as functional in the transcription activity-only model, compared to 98% of high FL phenotype genes (Fig. 2.5). By contrast, >65% of low-FL phenotype genes were predicted as functional when sequence conservation, H3 mark, or DNA methylation features were used. This could suggest that the full model is less effective in predicting functional sequences that are weakly or narrowly expressed. While sequence conservation features are distinct between functional and non-functional sequences when considered in combination, a significantly higher proportion of low-FL phenotype genes were specific to the Brassicaceae family, with only 33% present in dicotyledonous species outside of the Brassicaceae, compared to 78% of high-scoring phenotype genes (FET, p < 4e-12), thus our model likely has reduced power in detecting lineage-specific functional sequences. We also predict 80 pseudogenes (10%) to be functional (high-FL pseudogenes). A significantly higher proportion of high-FL pseudogenes came from existing genome annotation as 19% of annotated pseudogenes were classified as functional, compared to 4% of pseudogenes identified through a computational pipeline (FET, p < 1.5E-10) (Zou et al., 2009). We found that high-FL pseudogenes might be more recently pseudogenized and thus have not yet lost many genic signatures, as the mean number of disabling mutations (premature stop or frameshift) per 57 kb in high-scoring pseudogenes (1.9) were significantly lower than that of low-scoring pseudogenes (4.0; U test, p < 0.02). Lastly, we cannot rule out the possibility that a small subset of high-scoring pseudogenes represent truly functional sequences, rather than false positives (Poliseno et al., 2010; Karreth et al., 2015). Overall, the misclassification of both narrowlyexpressed phenotype genes and broadly-expressed pseudogenes highlights the need for an updated prediction model that is less influenced by expression breadth. Among protein-coding genes without phenotype information, we predict 20% as nonfunctional. We expect that at least 4% represent false negatives based on the FNR of the full model. The actual FNR among protein-coding genes may be higher, however, as phenotype genes represent a highly active and well conserved subset of all genes. However, a subset of the low-scoring protein-coding genes may also represent gene sequences undergoing functional decay and en route to pseudogene status. To assess this possibility, we examined 1,940 A. thaliana "decaying” genes that may be experiencing pseudogenization due to promoter disablement (Yang et al., 2011) and found that, while these decaying genes represented only 7% of all A. thaliana annotated protein-coding genes, they made up 45% of protein-coding genes predicted as non-functional (Fisher’s Exact Test (FET), p < 1E-11). 58 Supplemental Figure 2.1 Expression breadth distributions of sequence classes. (A) Based on 500 bp feature regions. (B) Based on 100 bp feature regions. 59 Supplemental Figure 2.2 Impacts of conditional phenotypes and expression breadth on the function prediction model. (A) Functional likelihood distributions of phenotype genes with mutant phenotypes under standard growth conditions (non-conditional) and non-standard growth conditions such as stressful environments (conditional) based on the 500 bp full model. Feature values were calculated from a random 500 bp region from within the sequence body. Higher and lower functional likelihood values indicate a greater similarity to phenotype genes and pseudogenes, respectively. (B,C) Distributions of functional likelihood scores for phenotype genes (blue) and pseudogenes (red) for sequences with various breadths of expression for (B) the 500 bp full model and (C) the 500 bp tissue-agnostic model generated by excluding the 60 Supplemental Figure 2.2 (cont’d) expression breadth and features available from multiple tissues. The tissue-agnostic model is aimed toward minimizing the effects of biochemical activity occurring across multiple tissues and predicts a greater proportion of narrowly-expressed phenotype genes as functional compared to the full model. 61 Supplemental Figure 2.3 Distributions of functional likelihood scores based on the 500 bp tissue-agnostic model. (A) Phenotype genes. (B) Pseudogenes. (C) Annotated protein-coding genes. (D) Transposable elements. (E) Random unexpressed intergenic sequences. (F) Intergenic transcribed regions (ITR). (G) Araport11 ncRNAs. (H) TAIR10 ncRNAs. Vertical dashed lines display the threshold to define a sequence as functional or non-functional. The numbers to the left and right of the dashed line show the percentage of sequences predicted as functional or nonfunctional, respectively. 62 Supplemental Figure 2.4 Distributions of 12 example features from 7 feature categories. Sequence classes include phenotype genes, pseudogenes, TAIR- and Araport-annotated ncRNAs and intergenic transcribed regions (ITR). 63 Supplemental Figure 2.5 Distance of ITRs and annotated ncRNA regions to and feature similarity with neighboring genes. (A) Distance from intergenic transcribed regions (ITRs) and annotated ncRNAs to the closest neighboring gene. ITR and ncRNA sequences are separated by whether they are predicted as functional (F) or non-functional (NF) by the 500 bp full model. (B) Feature similarity based on Pearson’s Correlation Coefficients (PCC) between random pairs of ITRs, Araport11 ncRNAs, TAIR10 ncRNAs, or annotated genes. (C) Feature similarity between proximal neighbors (within 95th percentile (456 bp) of intron lengths), and (D) Feature similarity between distal neighbors (>456 bp). Pairs involving ITRs and annotated ncRNAs were divided by whether the ITR or ncRNA sequence was predicted as functional (F) or non-functional (NF) by the full model. Feature values were quantile normalized prior to calculating correlations. 64 Supplemental Figure 2.6 Distributions of functional likelihood scores based on the 100 bp tissue-agnostic model. (A) Phenotype genes. (B) Pseudogenes. (C) Protein-coding gene. (D) transposable elements. (E) Random unexpressed intergenic sequences. (F) Intergenic transcribed regions (ITR). (G) Araport11 ncRNAs. (H) TAIR10 ncRNAs. (I) RNA genes with loss-offunction mutant phenotypes. (J) MicroRNAs, (K) Small nucleolar RNAs, (L) Small nuclear RNAs. The tissue-agnostic model was built with 100 bp features and while excluding the expression breadth and tissue-specific features. Higher functional likelihood values indicate greater similarity to phenotype genes while lower values indicate similarity to pseudogenes. Vertical dashed lines display the threshold to define a sequence as functional or non-functional. The numbers to the left and right of the dashed line show the percentage of sequences predicted as functional or non-functional, respectively. 65 Supplemental Figure 2.7 Correlation between features used in functional predictions. Colors within the heatmap indicate pairwise correlation between two features. Colors on the leftmost and bottom-most edges indicate the associated feature category (see Fig. 2.2). Feature values were quantile normalized prior to calculating correlation. 66 CHAPTER 3: CROSS-SPECIES AND POST-DUPLICATION EVOLUTIONARY DYNAMICS OF INTERGENIC TRANSCRIBED REGIONS INDICATE WIDESPREAD NOISY TRANSCRIPTION 67 ABSTRACT Extensive transcriptional activity occurring in unannotated, intergenic regions of genomes has generated spirited debate over whether intergenic transcription represents the activity of novel genes or noisy expression. A comprehensive investigation of the evolutionary histories of intergenic transcribed regions (ITRs) in multiple species would be informative in this debate but is currently lacking. Here, we evaluated the cross-species and post-duplication sequence and expression conservation of ITRs in four grass (Poaceae) species. The majority of ITRs in our analysis are species-specific sequences and ITR orthologs, when present, were usually not expressed. However, orthologous pairs of ITRs showed similarity in tissue expression patterns approaching that of annotated exon pairs, suggesting that ITRs with expression conservation are enriched for functional sequences. Similarly, ITR duplicates tend to be recent and not expressed, with few examples ITR duplicates retained in synteny from ancient whole genome duplication events. In addition, function prediction models were established in Oryza sativa (rice) to classify transcribed regions as likely-functional by integrating evolutionary and biochemical features and were highly capable of distinguishing between benchmark functional and non-functional sequences. Prediction models classified 2,754 rice ITRs (38%) as functional and confirmed that ITRs with expression conservation and those with ancient retained duplicates frequently represent functional sequences. However, evaluations of evolutionary histories primarily highlight ITRs as short-lived sequences with unstable expression, which is consistent with computational function predictions that suggest a majority of ITRs are not under selection. 68 INTRODUCTION Advances in sequencing technology have uncovered pervasive transcription occurring throughout unannotated, intergenic space in eukaryotic genomes, including metazoan (ENCODE Project Consortium, 2012; Brown et al., 2014; Boeck et al., 2016), fungal (Nagalakshmi et al., 2008), and plant systems (Yamada et al., 2003; Stolc et al., 2005; Nobuta et al., 2007; Moghe et al., 2013; Krishnakumar et al., 2015). While intergenic transcription was initially thought to be primarily associated with nearby genes (van Bakel et al., 2010), a host of possible new functions indicate many intergenic transcripts may represent the activity of novel, independent genes. These include roles as competitive endogenous RNAs (Tan et al., 2015), cis-acting regulatory transcripts (Guil and Esteller, 2012) or small protein-coding regions that are frequently missed by gene finding programs (Hanada et al., 2013). Despite these exciting possibilities, intergenic transcribed regions (ITRs) may also represent the products of noisy transcription, resulting from imperfect regulation of the cellular machinery that controls gene expression (Struhl, 2007). In addition, ITRs may provide the raw materials from which novel genes may evolve de novo (Carvunis et al., 2012). Thus, distinguishing between functional intergenic transcripts and those that are the products of noisy transcription represents a difficult but critical task in genome biology. The foundational step in identifying novel genes via intergenic transcription is to establish what it means to be functional. Throughout this article, we utilize the selected effect definition of function, which requires that a biochemical activity, particularly transcription, have been molded through evolutionary selection and contribute to organismal fitness (Amundson and Lauder, 1994; Graur et al., 2013; Doolittle et al., 2014). This definition stands in contrast to the 69 casual role definition that considers any reproducible biochemical activity to be functional (Cummins, 1975; Amundson and Lauder, 1994). Under the causal role definition, intergenic transcription would be considered functional de facto. However, the possibility of noisy transcription indicates that transcription activity by itself should not be used as evidence that a genome region is under selection. Moreover, pseudogenes (i.e. non-functional gene remnants) can be expressed (Zou et al., 2009; Pei et al., 2012), indicating that transcriptional activity may persist when biological function has been lost. Given these considerations, the selected effect definition of function is frequently considered as more appropriate within biological contexts (Amundson and Lauder, 1994; Graur et al., 2013; Doolittle et al., 2014). Given the selected effect definition, sequence conservation, particularly over long time periods, represents strong evidence for functionality. Sequence conservation over short time periods, however, may be due to insufficient time for mutations to accumulate, rather than selective pressure. By contrast, lack of observable conservation does not necessarily indicate a lack of function, as this can be due to weak or positive evolutionary selection (Pang et al., 2006; Ponting, 2017). Thus, it is critical to ask not only whether a sequence is conserved, but how informative the presence or absence of sequence conservation is for the potential functionality of a sequence. As a result, it is unclear whether the lack of sequence conservation among most ITRs in Arabidopsis thaliana, for example (Moghe et al., 2013), indicates that they are primarily nonfunctional. In addition to sequence conservation, duplication histories could prove informative to sequence functionality, as gene duplications represented critical evolutionary events that played a fundamental role in shaping the functional content of genomes (Cui et al., 2006; Soltis et al., 2009; Rutter et al., 2012). However, the duplication histories of ITRs are unknown. Instead of relying on a single line of evidence to define functional sequences, such as sequence 70 conservation, an approach that integrates genetic (i.e. phenotype), evolutionary, and biochemical data has been suggested (Kellis et al., 2014). Based on this framework, recent studies have shown that integrating evolutionary, transcription, epigenetic, and structural characteristics was highly effective at distinguishing between sequences that were under selection and those that were not (Lloyd et al., 2017 [preprint]; Tsai et al., 2017). Uniting the evolutionary histories of ITRs with robust function predictions based on the integration of evolutionary and biochemical signatures could provide valuable insight into the functional content of ITR sequences. In this study, we investigate the evolutionary dynamics and potential functionality of ITRs in four grass (Poaceae) species: Oryza sativa (rice), Brachypodium distachyon, Sorghum bicolor (sorghum), and Zea mays (maize). The four species are oriented phylogenetically as two species pairs (Fig. 3.1A), one pair that diverged 15 million years ago (MYA): maize and sorghum (Skendzic et al., 2007; Liu et al., 2014), and another that diverged 47 MYA: rice and B. distachyon (Massa et al., 2011). All four species have shared ancient whole genome duplications (WGDs) (Paterson et al., 2004; Tang et al., 2010) and exhibit recent small- and large-sale duplication events (Fig. 3.1A) (Swigoňová et al., 2004). We utilized this system to assess sequence and expression conservation of ITRs between species and following duplication. To determine the extent to which evolutionary histories inform the likely functionality of a sequence we also generated function prediction models in rice using established data integration approaches (Lloyd et al., 2017 [preprint]; Tsai et al., 2017). Function prediction models were applied genome-wide to predict candidate functional ITRs, which were then compared among ITRs with varying evolutionary histories. 71 RESULTS AND DISCUSSION Identification and classification of Poaceae transcribed regions To investigate the evolutionary dynamics and potential functionality of intergenic transcripts, we focused on four Poaceae species (Fig. 3.1A), each with a set of 10 to 14 developmentally-matched transcriptome datasets. In each of the four species, we identified transcribed regions and classified them according to overlap with exon, intron, and pseudogene annotation. Transcribed regions that did not overlap gene or pseudogene annotation were considered intergenic. To provide an overview of intergenic transcribed regions (ITRs) in the four Poaceae species, we first evaluated their prevalence, putative protein-coding and repetitive element content, and expression characteristics. With regard to the prevalence of ITRs, intergenic transcripts account for only 4-7% of mapped reads and 6-12% of transcribed regions in each species (Fig. 3.1B). By contrast, 92-96% of mapped reads overlap an annotated exon or intron (Fig. 3.1B), indicating that expression is primarily originating from genic regions. Similarly, 0.4-3.3% of mappable intergenic space (see Materials and Methods) was covered by transcribed regions in each species (Fig. 3.1C). Thus, intergenic transcription in the Poaceae is rare relative to genic expression, consistent with previous findings in Arabidopsis thaliana (Moghe et al., 2013) and Homo sapiens (van Bakel et al., 2011), and only a small fraction of intergenic space is expressed. Despite the relative scarcity of intergenic transcription, we identified between 7,000 and 16,000 ITRs in each species (Fig. 3.1B, right panel). To investigate the content of Poaceae ITRs, we identified ITRs with protein similarity and those that were highly repetitive (see Materials and Methods) and found that the majority of ITRs in each species (54-77%; Fig. 3.1D) did not 72 Figure 3.1 Transcriptome content in four Poaceae species. (A) Phylogenetic relationships between the four species. Whole genome duplication (WGD) events are marked with yellow circles. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize; Ex: exon, In: intron, Ps: pseudogene, Ig: intergenic. MYA: millions of years ago. (B) Number of reads mapping to (left panel) and transcribed regions overlapping (right panel) exon (Ex; dark blue), intron (In; cyan), pseudogene (Ps; red), and intergenic (Ig; yellow) genome regions. (C) Percent of nucleotides annotated as exon, intron, pseudogene, and intergenic overlapped by transcribed regions. Ig’ represents the proportion of intergenic space covered by transcribed regions that also overlap genic or pseudogenic regions. (D) Percent of transcribed regions that were classified as highly repetitive (pink), with protein similarity (blue), or neither of these (Other; tan). 73 exhibit hallmarks of either protein-coding or repetitive sequences. Thus, ITRs primarily represent non-protein-coding sequences. However, only 56-94 ITRs in each species (<1%) contained canonical RNA gene domains, indicating that few ITRs may be functioning as microRNAs (miRNAs), small nucleolar RNAs, or small nuclear RNAs. Given the lack of sequence similarity to known protein-coding or RNA gene regions, ITRs could represent the noisy transcription of random intergenic sequences. We next assessed whether the expression properties of ITRs are consistent with those expected of noisy transcripts. Compared to transcribed regions that overlap exons (ETRs), ITRs in all four species are shorter (Supplemental Fig. 3.1A; Mann Whitney U tests, all p<7e-105) and expressed at lower levels (Supplemental Fig. 3.1B; U tests, all p<8e-311). Further, 49-60% of ITRs in each species are expressed in a single tissue (Supplemental Fig. 3.1C), compared to 10-17% of ETRs (Fisher’s Exact Tests (FETs), all p<3e-10). These characteristics are consistent with descriptions of noisy transcripts as short and abortive (Struhl, 2007). However, we found that 76-88% of intergenic transcribed regions were reproducible across replicate leaf transcriptome datasets (Supplemental Fig. 3.1D). While these proportions are lower than those of ETRs (93-95%; FET, all p<3e-10), they far exceed the expected proportions if transcripts were randomly distributed across mappable intergenic space (expected proportions: 2-21%; all p<7e-10). This is indicative of the presence of hotspots in intergenic space where transcription is likely to originate, either due to the presence of truly functional sequences or spurious regulatory signals that increase the likelihood of noisy transcription occurring within a region. Intergenic transcripts have also been suggested to be associated with nearby genes (van Bakel et al., 2010), either as unannotated exon extensions or products of run-on transcription. We expected that ITRs would be disproportionately originating from the active chromatin regions 74 surrounding genes, however ITRs were further from genes when compared to a random background (Supplemental Fig. 3.2A). Nevertheless, at least 70% of ITRs in rice, B. distachyon, and sorghum and 35% of ITRs in maize were within the 95th percentile of intron lengths to neighboring genes (Supplemental Fig. 3.2B), indicating that a substantial proportion of ITRs could potentially represent unannotated extensions of genes. Consistent with this notion, ITRs that were within ~500 nucleotides to a gene exhibited increased expression correlation with neighboring ETRs compared to those that are further away (Supplemental Fig. 3.2C). While this could suggest that ITRs nearby genes represent gene extensions, a similar pattern is observed among transcribed regions that overlap pseudogenes. Thus, this may also be explained by the regulation of genes influencing the expression patterns of nearby, but unrelated, transcripts. Overall, we find 43,301 ITRs across four Poaceae species. These ITRs are primarily not proteincoding sequences and their expression characteristics are consistent with noisy expression. To further evaluate their characteristics and potential functionality, we next investigate the evolutionary histories of ITRs both between species and following duplication. Cross-species sequence conservation of ITRs Sequence conservation due to selective pressure is a hallmark frequently exhibited by functional genome regions. However, among ITRs in rice and B. distachyon, fewer than 15% were conserved across species (Fig. 3.2A), conservation rates that are lower than those of exon, intron, and pseudogene transcribed regions (FET, all p<2e-9). Nevertheless, randomly-selected, unexpressed intergenic regions in both species were conserved at lower rates compared to ITRs (<7%; FET, both p<8e-10). Similar patterns were observed among maize and sorghum transcribed regions, except that an increased proportion of sequences were conserved overall (Fig. 3.2A). In particular, 2-3 times as many ITRs in maize and sorghum were conserved 75 Figure 3.2 Sequence conservation of transcribed regions. (A) Heatmaps of cross-species sequence similarity, where colors indicate the -log10(E-value) for best match within a species. Columns were ordered by evolutionary distance (see Fig. 3.1A) then by smaller to larger genome size. Sequence type and species abbreviations are the same as in Fig. 3.1. (B) Percent of sequences that overlap a block of conserved nucleotides that are present in all four species. (C) Cumulative percentages of sequences that have at most a median phastCons score indicated on the x-axis. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize; Ex: exon, In: intron, Ps: pseudogene, Ig: intergenic. 76 compared to those in rice and B. distachyon (FET, all p<2e-9), a pattern that is likely due to a more recent divergence between maize and sorghum (Fig. 3.1A) (Skendzic et al., 2007; Massa et al., 2011; Liu et al., 2014). This highlights that sequence conservation is not always a result of selection, but instead can be due to insufficient time for sequences to mutate beyond recognition. As a result, reliance on sequence conservation among closely-related lineages to detect functional sequences may have significant false positive error rates. We also note that ITRs with protein similarity in all species have higher rates of conservation that those without (Supplemental Table 3.1; FET, all p<6e-5). Nevertheless, ITRs that lack protein similarity were conserved more frequently than random intergenic sequences (FET, all p<9e-10), suggesting that a subset of ITRs may be functioning at the RNA level (Mercer et al., 2009). Given that sequence conservation is not always be due to a history of selection, we next investigated potential selective pressure acting on ITRs by calculating phastCons scores (Siepel et al., 2005) within blocks of conserved nucleotides that were present in all four species (see Materials and Methods). As conserved nucleotide blocks (CNBs) and phastCons scores were generated in relation to the rice genome, we focused solely on rice sequences for this analysis. A lower proportion of rice ITRs overlapped CNBs (7%) compared to ETRs (51%; FET, p<2e-9; Fig. 3.2B). However, only 2% of random intergenic regions overlapped CNBs (FET, p<4e-10), providing additional evidence that ITRs are more frequently conserved than unexpressed intergenic regions. Although sequences within CNBs were present in all four Poaceae species, we observed significant differences in the phastCons scores among ETRs, ITRs, and random intergenic sequences. ETRs had higher average phastCons scores (median=0.69) than ITRs (median=0.40; U test, p<9e-43), and ITRs were more strongly conserved than random intergenic sequences (median=0.03; U test, p<3e-10). Overall, we find that a minority of ITRs are 77 conserved across species and are therefore considered lineage specific. However, the proportion of conserved ITRs exceeds that of random intergenic regions, suggesting a subset of ITRs likely represent functional sequences that are under selection. Cross-species expression conservation of ITRs Conservation of expression may offer additional information into the likely functionality of a sequence compared to sequence conservation alone. For ITRs that are conserved in sequence across species, we evaluated whether expression was also conserved. For this analysis, we focused on two sets of cross-species homologs: transcribed regions overlapping CNBs described in the previous section and pairs of orthologous sequences present in cross-species syntenic gene blocks (see Materials and Methods). First, we analyzed the expression states of CNBs that were exonic (n=13,616), intronic (n=125), or intergenic (n=525) in all four species. The majority of intergenic CNBs were not expressed in any species (85%; Fig. 3.3A). Among 79 intergenic CNBs that had evidence of expression, transcription was restricted to a single species in 57 cases (78%), a pattern that is more similar to conserved intron sequences than exons (Fig. 3.3A). Similarly, 18-44% of syntenic ITR orthologs exhibited conserved expression, rates that are lower than those of syntenic ETR orthologs (fig 3B; FET, all p<2e-10). Thus, ITRs with sequence conservation often do not exhibit expression conservation. Among ITRs, approximately twice as many syntenic orthologs between rice and B. distachyon were expressed (~40%) compared to those between maize and sorghum (~20%; fig 3B; FET, all p<6e-4), indicating that sequences maintained over longer periods of time are more likely to both be expressed. This could suggest that expression is lost more quickly than sequence similarity and, as a result, the presence of expression conservation may prove useful in classifying likely-functional sequences, particularly between closely-related species. 78 Figure 3.3 Expression conservation of transcribed regions. (A) Percentages of exon (Ex), intron (In), and intergenic (Ig) conserved nucleotides blocks that have evidence of expression from 0, 1, 2, 3, or all 4 species in our analysis. (B) Percentages of syntenic orthologs with expression evidence. Syntenic orthologs were identified by conservation across pairs of crossspecies syntenic blocks. (C,D) Commonality in tissue expression between reciprocal pairs of cross-species homologous exons (Ex), introns (In; panel C), and intergenic (Ig; panel D) transcribed regions and randomly-generated expression vectors (Rn). Exon transcript pairs represent a subsample of all exons with breadths of expression that match either intron (C) or intergenic (D) transcript pairs. Similarly, random expression vectors were generated with breadths that matched the distribution from intron or intergenic pairs. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize. 79 If orthologous sequences with conserved expression are both functional, they may perform similar roles in both species and show similar expression patterns across tissues. To evaluate this possibility, we first investigated similarity in tissue expression among pairs of transcribed, orthologous intron sequences and found that the proportion of expressed tissues in common between intron orthologs (mean % commonality=27%) was no different than random (mean=26%; Fig. 3.3C; U test, p=0.33) and lower than that of orthologous ETR pairs (mean=49%; U test, p<2e-18). This may suggest that a substantial proportion of conserved intron expression is due to capture of transcripts prior to splicing or misregulated splicing events from neighboring exon regions, rather than selection. By contrast, orthologous ITRs had similarity in tissue expression (mean % commonality=48%) that exceeded random (mean=36%; Fig. 3.3D; U test, p<6e-6), but was lower than that for orthologous ETRs (mean=53%; U test, p=0.04). However, the difference in % commonality between orthologous ITRs and ETRs was small. ITRs with cross-species expression conservation may frequently represent functional sequences performing similar roles in different species. Nevertheless, few ITRs exhibit sequence or expression conservation across species, suggesting that most either function in a species-specific manner, evolve too quickly for orthologous sequences to be detected, or are products of noisy transcription. Post-duplication sequence and expression conservation of ITRs Plant genomes harbor a rich history of large- and small-scale duplication events that have contributed to the adaptive evolution of these lineages (Cui et al., 2006; Rizzon et al., 2006; Hanada et al., 2008; Soltis et al., 2009). Thus, we next asked whether ITRs are present as duplicates, and if so, whether ITR duplicates tend to be retained over time. We first assessed the prevalence of ITR duplicates and found that 40-50% of ETRs, ITRs, and random intergenic 80 sequences were duplicated in rice, B. distachyon, and sorghum and 70-80% were duplicated in maize (Supplemental Fig. 3.3). Although duplication rates between sequence types could be significantly different (see Supplemental Fig. 3.3), almost all differences were less than 5%, indicating that the presence of a duplicate is not informative to whether a sequence is likely part of an annotated gene. We next estimated the timing of duplication using per-base substitution rates (K) and found that ITR duplicates in all species were generated from more recent duplication events compared to those of ETRs (Fig. 3.4A; U tests, all p<6e-127). Thus, duplication patterns are distinct between ITRs and ETRs and could be associated with the functionality of a sequence. To further investigate the duplication patterns of ITRs, we evaluated putative duplicate retention of ITRs present in WGD-associated syntenic genome regions (see Materials and Methods; Supplemental Fig. 3.4). Among 6,531 maize ITRs present in syntenic blocks associated with the 12 million-year-old (MYO) maize WGD (Fig. 3.1A; Supplemental Fig. 3.4), 12% exhibit a retained syntenic duplicate. Fewer retained duplicates are identified for the 5,783 ITRs from all four species present in syntenic blocks associated with the >70MYO ρ / σ WGDs (Fig. 3.1A; Supplemental Fig. 3.4), as only 63 exhibited a retained syntenic duplicate (1.1%), including 26 duplicates that were also intergenic (0.5%). We considered that ITRs may frequently represent RNA genes and sequence similarity over >70 million years could be difficult to detect. However, in rice 11 of 18 tRNA genes (61%) and 3 of 7 miRNA genes (43%) had identifiable ρ / σ WGD-associated duplicates, retention rates that surpass that of rice ITRs (2.2%; both p<5e-5). This suggests ITRs exhibit low duplicate retention rates over time. However, we cannot rule out the possibility that many ITRs present in WGD-associated syntenic regions arose after WGD events. Additionally, ITRs may have been disproportionately 81 Figure 3.4 Duplication characteristics of transcribed regions. (A) Distributions of per-base substitution rates (K) between a sequence and its top within-species BLASTN match based on a generalized time reversible model. (B) Syntenic duplicates from the maize whole genome duplication (WGD; left panel) and ρ or σ WGDs (right panel). ‘None’ indicates that no syntenic duplicate was identified. Exon transcribed regions (Ex, x-axis) were categorized by whether they overlapped genes anchoring syntenic blocks, ‘(A)’, and those that were between anchor genes, ‘(N)’. Only intron transcribed regions (In, x-axis) that overlapped anchor genes are shown. (C) Percent of expressed duplicates among sequence types binned by duplication time, which was estimated using K. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize; Ps: pseudogene, Ig: intergenic. 82 transposed into syntenic blocks or their duplicates transposed out, resulting in a lack of observable synteny. Nevertheless, ITRs feature WGD-related duplicates at greater rates than those of random intergenic regions (Fig. 3.4B, FET, both p<0.002). Thus, ITRs and their duplicates from WGD events could be under selective maintenance and represent functional sequences. Consistent with this notion, 92% of maize ITRs with a retained syntenic duplicate from the recent maize WGD event were conserved across species, compared to 46% of those without an identifiable duplicate (FET, p<3e-10). Similarly, 26 of 26 ITRs in all four species with an intergenic duplicate retained from the ρ / σ WGD events were conserved across species. Thus, duplicate retention over long periods is a likely a strong indicator of sequence functionality. Last, we assessed whether ITR expression was conserved following duplication. Among duplicates of ITRs in each species, 18-34% were expressed, compared at least 79% of ETR duplicates (FET, all p<6e-10). In addition, recent (i.e. lower K) ITR duplicates are more likely to be expressed than older duplicates (Fig. 3.4C; U tests, all p<3e-5). These results might suggest highly dynamic expression among ITRs, with frequent expression state gains and losses. However, fewer duplicates of random intergenic sequences were expressed (2-7% in each species; FET, all p<4e-10) compared to ITR duplicates, indicating that the presence of expression between an ITR and its duplicate is likely due to the conservation of an ancestral expression state, rather than two independent gains of expression. Together with high reproducibility of ITR expression across replicate transcriptome datasets (Supplemental Fig. 3.1D), this indicates that intergenic expression is driven in part by the presence of particular regulatory signals (e.g. cis-regulatory elements or chromatin state) that may persist following duplication. Overall, the duplication patterns among ITRs indicate that few ITRs exhibit long- 83 term duplicate retention and may feature highly dynamic expression states, characteristics that could be indicative of sequences that are generally not under selection. Nevertheless, we find rare examples of ITRs with ancient retained duplicates that could represent functional sequences. To more deeply explore the functionality of ITR sequences, particularly those that lack long-term sequence conservation or duplicate retention, we utilized a data integration approach (Lloyd et al., 2017 [preprint]; Tsai et al., 2017) that jointly considers evolutionary and biochemical characteristics of a sequence to classify its functionality. Predicting the functionality of rice transcribed regions Sequence and expression conservation over long periods of time and following duplication provide glimpses into the potential functionality of ITR sequences. To provide a more robust estimation of the functionality of transcribed regions, we next utilized a data integration approach that considers the evolutionary and biochemical characteristics of known functional and non-functional sequences (Lloyd et al., 2017 [preprint]; Tsai et al., 2017). This approach consists of four steps: establishing a set of benchmark functional and non-functional sequences, identifying characteristics associated with benchmark sequences, integrating these characteristics via statistical learning methods to generate function prediction models, and applying function prediction models to all transcribed regions, including ITRs. Due to data availability, we focused solely on rice sequences for this approach. We first established two sets of benchmark functional sequences: 513 transcribed regions that overlap exons of genes with documented loss-of-function phenotypes (referred to as Phen-TRs) and 19 transcribed regions that overlapped high-confidence miRNA gene annotations (referred to as miRNA-TRs). Benchmark non-functional sequences consisted of 743 transcribed regions that overlap pseudogenes (referred to as Pseu-TRs), the disabled remnants of ancient genes. We next 84 identified a set of 44 evolutionary and biochemical features representing five categories: transcription activity (n=12), sequence conservation (n=3), DNA methylation (n=16), histone marks (n=12), and nucleosome occupancy (n=1) (Supplemental Table 3.2; see Materials and Methods). A wide variety of features in multiple categories were informative for distinguishing between benchmark functional and non-functional sequences (Supplemental Table 3.2) and we moved forward with data integration methods to consider all features in combination. We utilized the random forest machine learning algorithm to jointly consider all 44 evolutionary and biochemical features and generate function prediction models. Two models were established: a binary model trained on Phen-TR and Pseu-TR sequences and a three-class model trained on Phen-TR, miRNA-TR, and Pseu-TR sequences. Models were generated under cross-validation; independent sets of sequences were used for training and testing (see Materials and Methods). The resulting binary prediction model was highly capable of distinguishing between Phen-TRs and Pseu-TRs (Fig. 3.5A; Supplemental Fig. 3.5A; AUC-ROC =0.92). We utilized prediction scores from the random forest classifier and a threshold identified by maximizing F-measure to classify sequences as likely functional or non-functional. Based on this approach, we correctly classified 85% of Phen-TRs as functional and 86% of pseudogenes as non-functional. However, only 3 of 19 miRNA-TR sequences (21%) were predicted as functional, indicating substantial false negative rates for RNA gene classification based on the binary model. By comparison, 15 of 19 miRNA-TRs (79%) were classified as either miRNATRs or Phen-TRs based on the three-class prediction model (Fig. 3.5B; Supplemental Fig. 3.5B; see Materials and Methods), indicating that the three-class model is better suited to predicting functionality among RNA genes. Similarly, 82% of Phen-TRs were predicted as functional based 85 Figure 3.5 Function predictions of transcribed regions. (A) Precision-recall curve of the bestperforming random forest classifier based on a binary model distinguishing between transcribed regions that overlap exons of phenotype genes and pseudogenes. (B) Confusion matrix heatmap of actual and predicted classes from a 3-class model trained with transcribed regions that overlap exons of phenotype genes (Phen), high-confidence miRNAs (miRNA), and pseudogenes (Pseu). Predictions as Phen or miRNA were considered putatively functional. (C) Distributions of function predictions among sequence types based on the combination of binary (A) and 3-class (B) prediction models. Combined predictions were based on a priority system: Phen > miRNA > Pseu. (D,E) Distributions of function predictions among conserved (D) and duplicated (E) sequences. Colors represent the same classifications as in (C). (D) Sequences with putative homologs were those with any significant cross-species similarity, while those with syntenic orthologs exhibited significant conservation across a cross-species syntenic gene block. (E) Duplicated ETRs and ITRs were separated according to duplication timing, which was estimated 86 Figure 3.5 (cont’d) using per-base substitution rates (K). Duplicates from the ρ and σ whole genome duplications represent examples of ancient duplicate retention. Ex: exon, In: intron, Ig: intergenic, ETR: exon transcribed region, ITR: intergenic transcribed region. 87 on the three-class model. However, only 61% of pseudogenes were predicted as non-functional (compared to 85% based on the binary model). While a small subset of pseudogenes may represent truly functional sequences (Zou et al., 2009; Poliseno et al., 2010; Karreth et al., 2015), this result likely reflects the difficultly in distinguishing between RNA genes and pseudogenes using the feature set established here. We established a final set of combined predictions by merging the classifications from the binary and three-class function prediction models (Fig. 3.5C). We next applied the binary and three-class function prediction models to all transcribed regions within the rice genome: ETRs that did not overlap a known phenotype gene, intron transcribed regions, and ITRs. Among ETRs, 85% were predicted as functional, either as PhenTRs (77%) or miRNA-TRs (8%). The 15% of ETRs that were predicted as non-functional were enriched for short ETRs (Supplemental Fig. 3.6A; U test; p<9e-83) and those that had minor overlap with an exon (U test, p<2e-39; Supplemental Fig. 3.6B,C). We considered that intron transcribed regions would exhibit similarity to RNA genes, given their location within annotated genes (i.e. potentially-functional genome regions) and lack of protein-coding potential. However, only 27% of intron transcribed regions were classified as miRNA-TRs and an additional 17% were classified as Phen-TRs, indicating that the function prediction models can distinguish between transcribed regions overlapping sequences that encode a final gene product and those that are spliced out. Last, we applied the prediction models to ITRs, and found that 62% were predicted as non-functional. Consistent with previous results (Lloyd et al., 2017 [preprint]; Tsai et al., 2017), the lack of similarity among ITRs to benchmark functional sequences suggests that the majority are the result of transcriptional noise. 88 Comparison of function predictions and evolutionary histories With sets of ITRs predicted as functional and non-functional, we next investigated the extent to which the evolutionary histories of a sequence inform sequence functionality. As expected, ITR sequences that were conserved across species were more frequently predicted as functional compared to those that were not (Fig. 3.5D; FET, p<5e-10), which is due in part to conservation features used in model training. However, despite at least of 47 million years of sequence conservation, 34% of conserved ITRs that were predicted as non-functional. On the other hand, 35% of ITRs without sequence conservation were predicted as functional, suggesting a subset of ITRs may be performing species-specific roles. Overall, relying on sequence conservation alone results in significant false positive and false negative function predictions among ITRs. We next investigated whether the presence of expression conservation in addition to sequence conservation more accurately predicted functional sequences. Surprisingly, a similar proportion of ITRs with expression conservation and sequence conservation only were predicted as functional (Fig. 3.5D; FET, both p>0.37). However, ITRs with expression conservation were more likely to be classified as Phen-TRs, rather than miRNA-TRs (FET, both p<0.005). This pattern could be explained by the miRNA-TR sequence set containing false positive functional benchmarks, as the miRNA annotations we used lack strong functional evidence provided by genetic analysis. Consequently, sequences predicted as miRNA-TR may be more frequently truly non-functional compared to those predicted as Phen-TRs (Fig. 3.5B; Supplemental Fig. 3.5B). Thus, the increased proportion of Phen-TR classifications among ITRs with expression conservation suggests they are enriched for functional sequences relative to ITRs that only exhibit sequence conservation. Of particular note, 89% of ITRs with syntenic, expressed 89 orthologs were predicted as functional, with 67% classified as Phen-TRs, suggesting that expression conservation of a high-confidence ortholog provides strong evidence for sequence functionality among ITRs. Last, we investigated the relationship between function predictions and duplication status. We found that 47% of single-copy ITRs were predicted as functional compared to 24% of those that were duplicated (Fig. 3.5E; FET, p<5e-10). A similar pattern is observed for ETRs (Fig. 3.5E), suggesting that a greater proportion of duplicate sequences may be non-functional and undergoing decay. Among duplicated sequences, ITRs with older duplicates (i.e. higher K) are more frequently predicted as functional compared to those with more recent duplicates (Fig. 3.5E; U test, p<6e-11), indicating that the retention of a duplicate over time is at partially informative for the functionality of a sequence. This notion is supported by the 11 ITRs with a retained duplicate from the >70 MYO ρ or σ WGD events, as all were predicted as functional (Fig. 3.5E). Thus, combining function predictions with evaluation of evolutionary histories confirms that retention of ancient duplicates, as well as cross-species sequences and expression conservation, provide strong evidence that a sequence is likely to be functional. However, the overall number of ITRs that exhibit sequence and expression conservation is low. 90 CONCLUSION We provide a comprehensive, multi-species investigation of the potential functionality of intergenic transcribed regions (ITRs) through evaluation of the cross-species and postduplication evolutionary histories of these sequences in four grass (Poaceae) species. In summary, ITRs are primarily species-specific and the lack of conserved transcription across species indicates that expression among ITRs may be a recent evolutionary event. This is further reflected by the lack of transcription evidence among ITR duplicates, suggesting that the expression of ITRs is frequently gained and lost. We also found very few examples of ITRs with ancient retained duplicates, suggesting there may be little adaptive potential through neo- or subfunctionalization of ITR sequences. Overall, the evolutionary dynamics explored here are consistent with the notion that intergenic expression primarily represents transcriptional noise. Robust function prediction models were established in rice based on 44 evolutionary and biochemical characteristics of transcribed regions that overlap phenotype genes, high-confidence miRNAs, or pseudogenes. These models distinguished between benchmark functional and nonfunctional with high accuracy. Based on model predictions, 2,754 ITRs in rice (38%) were classified as functional. However, high rates of false positive predictions (FPR=42%) suggest this is likely an overestimate of the proportion of functional ITRs. Function predictions could be improved in part by identification of a stronger set of benchmark RNA genes, as the lack of RNA genes with genetic evidence of function required us to use unconfirmed miRNA gene annotations. Using a similar data integration framework, ~40% of ITRs in Arabidopsis thaliana were also predicted as functional (Lloyd et al., 2017 [preprint]), and thus this percentage may represent a general baseline for the proportion of candidate functional ITRs in any species. 91 However, both rice and A. thaliana have relatively small genomes among plants and it will be interesting to see whether a species with a much larger genome (e.g. maize) contains more or fewer ITRs that are likely functional. Function predictions confirmed that ITRs with both sequence and expression conservation and those with ancient retained duplicates are likely to represent functional sequences. Prediction models also classified 35% of rice ITRs that lack sequence conservation as functional. A similar pattern can be seen in A. thaliana, as a greater proportion of ITRs are predicted as functional (Lloyd et al., 2017 [preprint]) than are conserved across species (Moghe et al., 2013). These patterns indicate that it may not be uncommon for ITRs to function in species-specific roles. The data integration framework applied here and described in previous studies (Lloyd et al., 2017 [preprint]; Tsai et al., 2017) can identify likely-functional sequences when sequence conservation cannot be effectively utilized, e.g. species-specific function or weak selective pressure (Ponting, 2017). Ultimately, the majority of ITRs lack extensive cross-species or post-duplication sequence conservation and similarity to benchmark functional sequences, and thus we conclude that intergenic transcription primarily represents transcriptional noise. 92 METHODS Gene, pseudogene, and random intergenic annotation The MAKER-P (r1065) genome annotation pipeline was used to reannotate the Oryza sativa (rice) Nipponbare (IGRSP-1.0 v7), Brachypodium distachyon (v1.2) and Sorghum bicolor (sorghum; v2.1) genome assemblies as previously described (Cantarel et al., 2008; Campbell et al., 2014a) . Repeats were masked using default parameters in RepeatMasker (v 4.0.3) for the sorghum and B. distachyon genomes, and a custom repeat library was created for rice using a method described previously (Campbell et al. 2014). RepeatMasker was run within the MAKER pipeline to mask repetitive elements. To aid in gene prediction, expressed sequence tag (EST) evidence was provided by transcriptome assemblies generated from selected publically available data from the National Center for Biotechnology Information Sequence Read Archive (NCBISRA) (Supplemental Table 3.3) using Trinity (version 2014) with a minimum contig length of 150 bp and the Jaccard clip option (Haas et al., 2013). Protein evidence was provided by the SwissProt plant protein dataset (ftp://ftp.uniprot.org/pub/databases/uniprot/ current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_plants.dat.gz), with the rice, sorghum, or B. distachyon protein sequences removed. For Zea mays (maize; v. 3.21), we utilized the MAKER-P gene annotation described by Law et al. (2015). Pseudogenes were identified in each species using a pipeline similar to that described by Zou et al. (Zou et al., 2009). We considered genome regions at least 40 contiguous ambiguous nucleotides (Ns) as likely-unmappable, a length that represented the size of reads in RNA-sequencing (RNA-seq) datasets used in our analyses (see below).The length of likely-unmappable were excluded when calculating the size of total exon, intron, pseudogene, and intergenic space in Fig. 3.1B. Random 93 sets of intergenic coordinates with equal length distributions as intergenic transcribed regions (ITRs, see below) were identified in each species. Random intergenic coordinates were sampled so that they did not overlap one another, transcribed regions, or likely-unmappable genome regions (as described above). We further filtered random intergenic coordinates to remove those that contained an ambiguous nucleotide, a step that removed 4-9% of random coordinates in each species. Identification and classification of transcribed regions Multiple developmentally-matched RNA-seq datasets from rice, B. distachyon, sorghum, and maize (n = 11, 11, 10, and 14 respectively) were retrieved from the Sequence Read Archive at the National Center for Biotechnology Information (Supplemental Table 3.4) (Davidson et al., 2011; Davidson et al., 2012). Reads were trimmed of low-scoring ends and Illumina adapter sequences using Trimmomatic v.0.33 (LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20). Reads ≥20 nucleotides in length and with an average Phred score ≥20 were mapped to associated genomes using TopHat v.2.1.0 (-i 5000; -I 50000; all other parameters default) (Kim et al., 2013). Transcribed region coordinates were identified by assembling unique-mapped reads with Cufflinks v.2.2.1 (-min-intron-length 5000; -max-intron-length 50000; -m 150; --no-effectivelength-correction) (Trapnell et al., 2010) while providing an associated genome sequence via the -b flag to correct for sequence-specific biases. Transcribed regions from across datasets that overlapped with one another by at least 1 nucleotide were merged. Merged transcribed regions were classified based on overlap with annotated exon, intron, and pseudogene sequences through a priority system: exon > intron > pseudogene (e.g. a transcribed region that overlapped both an exon and an intron was classified as exonic). Transcribed regions that did not overlap with any gene or pseudogene annotation were classified as intergenic. 94 Transcribed regions were further categorized as likely protein coding or repetitive. Likely protein-coding transcribed regions were defined as those that contained a protein domain from Pfam v. 31.0 (Finn et al., 2016) in any translated frame or had significant translated sequence similarity to a plant protein annotated in Phytozome v.10 (BLAST v. 2.2.26; BLASTX E-value < 1e-05) (Goodstein et al., 2012). By contrast, the presence of canonical RNA gene domains were identified within ITRs by searching for significant matches (E<1e-5) with Rfam domain covariance models (v.12) (Nawrocki et al., 2015) using Infernal v.1.1 (Nawrocki and Eddy, 2013) A set of likely repetitive transcribed regions were defined based on the presence of repeat-associated Pfam protein domains or high numbers of duplicate sequences. To identify repeat-associated protein domains and set duplicate thresholds to call sequences as highly repetitive, a set of benchmark interspersed repetitive elements were identified in each unmasked genome using RepeatMasker v.4.0.5 (-nolow -norna -qq) with the default RepeatMasker repeat library and a custom library of rice repeats generated using previously described methods (Campbell et al. 2014). Transcribed regions that contained a Pfam v.31.0 protein domain that was significantly enriched among interspersed repeats relative to exon transcribed regions (Fisher’s exact test; adjusted p<0.05; Benjamini-Hochberg procedure) (Benjamini and Hochberg, 1995) were considered likely-repetitive. The number of duplicate sequences required to call a sequence as highly-repetitive was calculated based on the duplicate counts of long terminal repeats (a subset of the interspersed repetitive elements identified by RepeatMasker) and transcribed regions that overlapped exons, through F-measure maximization. Resulting duplicate count thresholds to consider a transcribed region as repetitive were 10, 15, 33, and 202 for rice, B. distachyon, sorghum, and maize, respectively. For all duplication-related analysis, we 95 excluded sequences that were classified as repetitive, as they were considered duplicated by definition. Sequence and expression conservation Significant sequence matches were identified using BLAST searches (BLAST v. 2.2.26; BLASTN E-value < 1e-05) by searching nucleotide sequences of transcribed regions against repeat-masked whole genome sequences. BLAST searches were performed within-species to identify duplicates and cross-species to identify putative homologs. Per base substitution rates (K) between within-species query and match sequences were calculated with PHAST (Hubisz et al., 2011) using a generalized time reversible model after aligning matched regions with EMBOSS-Needle v.6.5.7.0. Blocks of conserved nucleotides present across all four species were identified using the LASTZ/MULTIZ paradigm with rice as the target genome (Blanchette et al., 2004; Harris, 2007; Hupalo and Kern, 2013) (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto). Among all conserved nucleotide blocks (CNBs; n=60,801), we identified those that were exonic (n=13,616), intronic (n=125), or intergenic (n=525) in all four species and were only overlapped only by transcribed regions of the associated type (e.g. a conserved nucleotide block that was intergenic in all four species and was only overlapped by ITRs). phastCons scores were calculated for each rice nucleotide within a conserved nucleotide block (Siepel et al., 2005), with conserved and non-conserved states estimated using the --estimate-trees option. Expression conservation was assessed by determining whether a cross-species or withinspecies match was overlapped by a transcribed region (as described above). Reciprocal matches between exon, intron, and intergenic transcribed regions across CNBs or cross-species syntenic gene blocks (see below) were included in percent commonality calculations (Fig. 3.3C,D), which 96 was calculated as the number of expressed tissues in common between a pair of transcribed regions divided by the total number of expressed tissues. The five tissues that were in common among all four species were used: embryo, endosperm, seed, leaf, and anther. When replicate datasets were available, transcription evidence in a single dataset was required to consider the tissue as expressed. Similarly, expression in either seed datasets – 5 days after pollination (DAP) or 10 DAP – were required. Expected percent commonality is affected by the expression breadth of two transcripts. To control for expected percent commonality, exon percent commonality (Fig. 3.3C,D) was calculated using exon transcribed region pairs with expression breadth that matched those of conserved intergenic or intron transcribed regions. For each pair of conserved intron and intergenic transcribed regions, two pairs of exon transcribed regions with the same expression breadths were selected for comparisons. Background expectations of percent commonality (Fig. 3.3C,D, Rn) were calculated by randomly selecting tissues to match the expression breadth of intergenic and intron transcribed region pairs. For each intron and intergenic transcribed region pair, 25 random pairs of tissues were selected. The probability that a tissue would be selected for a random set was proportional to how often it appeared among transcribed regions with expression conservation. Identification of syntenic gene blocks We identified cross-species and within-species syntenic blocks by identifying of sets of collinear genes using MCScanX v.2 (Wang et al., 2012). Multiple minimum gene pair and maximum gap parameters were tested when generating collinear blocks (Supplemental Table 3.5) and we utilized values of 10 and 10, respectively, for within-species collinear blocks and 10 and 2, respectively, for cross-species blocks. Cross-species blocks were identified between the more-closely-related species pairs: rice/B. distachyon and maize/sorghum. The rates of 97 synonymous substitutions (Ks) between homologous protein-coding genes anchoring withinspecies syntenic blocks were calculated using the yn00 package in the Phylogenetic Analysis by Maximum Likelihood software (Yang, 2007). Syntenic blocks with a median Ks ≥0.7 across all anchor genes in the syntenic block in all four species were considered associated with the ρ / σ whole genome duplication (WGD) events (Paterson et al., 2004; Tang et al., 2010), while syntenic blocks in maize the an average Ks < 0.7 were associated with the more recent 12 MYO maize WGD (Supplemental Fig. 3.4) (Swigoňová et al., 2004). An additional 254 gene-pair syntenic block in rice was identified with a median block Ks of 0.13, suggesting recent duplication. However, due to the uncertain nature of the origin and timing of the duplication event (Wang et al., 2011), this block was excluded from further analysis. For exon and intron transcribed regions, syntenic duplicates and orthologs were identified as BLASTN matches (Evalue<1e-5) that overlapped a homologous anchor gene. For pseudogene transcribed regions, ITRs, and random intergenic sequences, syntenic duplicates and orthologs were BLASTN matches that were present within corresponding block regions circumscribed by homologous anchor genes. To provide annotated RNA gene comparisons, rice tRNA gene sequences marked as reliable were retrieved from tRNAdb (Jühling et al., 2009) and high-confidence rice miRNA gene sequences were retrieved from miRBase (Kozomara and Griffiths-Jones, 2014). To identify the location of tRNA and miRNA gene sequences within the rice genome, we identified perfect alignment between RNA gene sequences and the MSU.v.7 rice genome (BLASTN, 100% identity, full length alignment). We further filtered out tRNA and miRNA sequence alignments that overlapped gene or pseudogene annotation. 98 Rice function prediction features Transcription activity features Twelve transcription-related features were generated for use in rice function prediction models (Supplemental Table 3.2). The first 9 features were FPKM values (referred to as expression level; ‘Level’ in Supplemental Table 3.2) from each of the 9 tissues represented in the RNA-seq datasets described above. For replicate datasets (leaf and endosperm), expression level was taken as the average FPKM value if a region was expressed in both replicate datasets, and the single FPKM value otherwise. If a transcribed region was not expressed in an RNA-seq dataset, expression level was set to 0. Two additional features were represented by the maximum expression level among all RNA-seq datasets and the median FPKM among datasets where a transcribed region was expressed. The final feature was the expression breadth of a sequence, represented by the total number of tissues in which a sequence was expressed. For the breadth calculations, both seed datasets (5 and 10 DAP) and inflorescence datasets (early and emerging) were considered as a single tissue. Sequence conservation features Three sequence-conservation-related features were generated. The first feature was the minimum BLASTN E-value to a sequence in B. distachyon, sorghum, or maize. A threshold of 1e-05 was required to consider a match significant. Sequences without a significant cross-species match where given a score of 0. The second sequence-conservation feature was the proportion of a sequence covered by the 4-species conserved nucleotide blocks (described above). The third sequence-conservation feature was the median per-base phastCons score across the proportion of a sequence that overlapped a conserved nucleotide block. If a sequence did not overlap a conserved block the phastCons score was set to 0. 99 Histone mark and nucleosome occupancy features Twelve histone mark-related features among 10 histone marks were calculated based on 18 chromatin immunoprecipitation sequencing (ChIP-seq) datasets from NCBI-SRA (Supplemental Table 3.6). Reads were trimmed as described above and mapped to the rice MSU v.7 genome with Bowtie v2.2.5 (default parameters). Spatial Clustering for Identification of ChIP-Enriched Regions v.1.1 (Xu et al., 2014) was used to identify significant ChIP-seq peaks with a non-overlapping window size of 200, a gap parameter of 600, and an effective genome size of 0.68 (Koehler et al., 2011). For datasets with control total histone or protein datasets, a false discovery rate (FDR) ≤0.05 was utilized (see Supplemental Table 3.6). Peaks for histone marks with multiple datasets were merged. The first 10 histone mark-related features were represented by the percent overlap of a sequence with histone mark peaks for each histone marks (‘coverage’ in Supplemental Table 3.2). The other two features were the number of activationassociated histone marks and repression-associated histone marks with a peak that overlapped a sequence. Eight marks were considered activation-associated and two were considered repression-associated (Supplemental Table 3.6). A single nucleosome occupancy feature was also generated from micrococcal nuclease sequencing (MNase-seq) data. MNase-seq data was generated by Wu et al. (2014) and processed by Liu et al. (2015). The nucleosome occupancy feature was calculated as MNase-seq average read depth across the length of a sequence. DNA methylation features Sixteen DNA methylation features were calculated from bisulfite-sequencing (BS-seq) datasets from four tissues: embryo (SRR059000), endosperm (SRR059005), leaf (SRR618545), and panicle (SRR1520042). BS-seq reads were trimmed as described above and processed with Bismark v.3 (default parameters) to identify the number of reads that call cytosines as 100 methylated and unmethylated in CG, CHG, and CHH (H = A, C, or T) contexts. The first 12 features were represented by the methylation level of a sequence in CG, CHG, and CHH contexts for each of the four tissues. Methylation level of a sequence was calculated as the number of reads that mapped to CG, CHG, or CHH sites that called a cytosine site as methylated divided by the total number of reads mapping to CG, CHG, or CHH sites within the sequence. A minimum of 5 reads across 5 cytosine sites were required to calculate methylation level. Multiple minimum read (range: 1-20) and site (range: 1-10) requirements were tested and found to have little effect on the ability of resulting methylation level features to distinguish between Phen-TR and Pseu-TR sequences (Supplemental Table 3.7). The final four DNA methylation features were represented by whether a sequence exhibited a methylation pattern consistent with gene body methylation (GBM) in each of the four tissues. The GBM pattern is presence of CG methylation and absence of CHG and CHH methylation. Presence or absence of CG, CHG, and CHH states within a sequence were determined by binomial tests of the methylation level of a sequence (as described above) compared to the background methylation level across the whole genome for a given cytosine context. Binomial test P-values were corrected for multiple testing using the Benjamini and Hochberg method (FDR ≤0.05) (Benjamini and Hochberg, 1995). Sequences that were significantly enriched in CG methylation relative to the genome background and not significantly enriched in CHG and CHH methylation were considered to be gene body methylated. Machine learning approach We utilized the random forest algorithm implemented in the Scikit-learn software (Pedregosa et al., 2011) to perform machine learning runs aimed at distinguishing between functional and non-functional transcribed regions in rice. Two sets of benchmark functional 101 transcribed regions were established: those that overlapped exons of genes with documented loss-of-function phenotypes (referred to as Phen-TRs) (Lloyd et al., 2015; Oellrich et al., 2015) and those that overlapped a set of high-confidence miRNA gene annotations from miRBase (referred to as miRNA-TRs) (Kozomara and Griffiths-Jones, 2014). Benchmark non-functional sequences were represented by transcribed regions that overlapped pseudogene annotation (referred to as Pseu-TRs). Prior to model building, missing data points among features were imputed by randomly-selecting a representative value from another sequence of the same type. For example, an ITR with a missing data point would randomly-select a value for the same feature from another ITR. Two sets of machine learning-based predictions were generated: a binary prediction trained using Phen-TR and Pseu-TR sequences (Fig. 3.5A; Supplemental Fig. 3.5A) and threeclass prediction trained using Phen-TR, miRNA-TR, and Pseu-TR sequences (Fig. 3.5B; Supplemental Fig. 3.5B). For the binary predictions, 100 balanced datasets that included equal proportions of Phen-TR and Pseu-TR sequences were used for training and 10-fold crossvalidation was implemented (i.e. 90% of a dataset was used for training and the held-out 10% used for testing). Parameter sweeps of maximum tree depth (3, 5, 10, and 50) and proportion of random features (10%, 25%, 50%, 75%, square root, and log2) values were performed, with 3 and 10% providing the highest performance, respectively. For each of the 100 balanced datasets, a prediction score for each transcribed region was generated that was equal to the proportion of 500 random forest trees that predicted the region as Phen-TR; a final prediction score was calculated as the median of all scores. A threshold to predict transcribed regions as likelyfunctional or non-functional was generated by identifying the threshold with the maximum Fmeasure, which is calculated as the harmonic mean of precision (proportion of sequences 102 predicted as Phen-TRs that are truly Phen-TR sequences) and recall (proportion of true Phen-TR sequences that are predicted as Phen-TRs). The Python script utilized to generate these predictions (ML_classification.py) is available on GitHub: https://github.com/ShiuLab/MLPipeline. Binary predictions were also generated using the support vector machine and logistic regression algorithms implemented in Scikit-learn, but random forest provided the highest performance by AUC-ROC. Three-class prediction models were generated similarly to binary models, except that 500 balanced datasets were utilized due to low sample sizes among miRNA-TR sequences (n=19). Each of the 500 balanced datasets provided sequences with three prediction scores representing the proportion of 500 random forest trees predicting a sequence as Phen-TR, miRNA-TR, or Pseu-TR, and the final prediction scores were generated by calculating the median all scores. Sequences were classified based on the highest median score. Binary and three-class classifications were merged using a priority system: Phen-TR in either model > miRNA-TR in the 3-class model > Pseu-TR (Fig. 3.5C). 103 ACKNOWLEDGEMENTS I wish to thank Rosalie Sowers for processing rice histone mark data, Ning Jiang for providing a rice repeat library, Robin Buell for helpful discussions, and Jack R. S. DeCatters for inspiration and support. 104 APPENDIX 105 Supplemental Figure 3.1 Expression characteristics of transcribed regions in four Poaceae species. (A) Boxplots of length distributions among transcribed regions that overlap exon (Ex), intron (In), pseudogene (Ps), or intergenic (Ig) regions. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize. Nt: nucleotides. (B) Boxplots of maximum FPKM distributions among all tissues. (C) Distributions of expression breadth (i.e. # of tissues with expression evidence) for transcribed regions. A subset of exons with expression levels ±5% of intergenic transcribed regions (Ex (weak)) are also shown. (D) Percentage of sequences that reproducible across replicate leaf transcriptome datasets. Intergenic sequences were randomly-sampled to determine the background expected reproducibility (Rn). 106 Supplemental Figure 3.2 Distance and expression correlation between intergenic transcribed regions and neighboring genes. (A) Boxplots of distance distributions between genes and intergenic transcribed regions (ITR) or randomly-selected intergenic regions (RAN). Nt: nucleotides. (B) Percentage of ITR and RAN sequences that are within multiple intron length 107 Supplemental Figure 3.2 (cont’d) percentiles to the closest neighboring gene. Intron length percentiles were calculated based on the length distributions of annotated introns within each species. (C) Heatmaps of expression correlation between neighboring pairs of transcribed regions. Colors represent the median Pearson’s correlation coefficient (PCC) of expression levels (FPKM) across tissues between all transcribed regions pairs within a distance bin. Neighboring transcribed region pairs were classified according to whether they overlapped the same gene (within gene), neighboring genes, gene and pseudogene neighbors (Gene/Pseudogene), genes and surrounding intergenic space (Gene/Intergenic), or neighboring ITRs (Intergenic/Intergenic). Neighboring gene pairs were sub-classified according to whether genes were oriented in the same direction (Tandem), or different directions with proximal 5’ (Head-to-head) or 3’ (Tail-to-tail) regions. Gene/Pseudogene and Gene/Intergenic pairs were sub-classified according to whether the pseudogene or intergenic transcribed region was upstream or downstream of a gene neighbor. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize. 108 Supplemental Figure 3.3 Distributions of duplication types. Sequences shown include ETRs (Ex, x-axis), ITRs (Ig, x-axis), and random intergenic (Rn) sequences. Fisher’s exact tests were used to test significance between the proportion of sequences that were duplicated (sum of Ex, In, Ps, and Ig duplicate proportions) or not duplicated (None). *: p<0.05, ***: p<0.001, n.s.: not significant, p≥0.05, In: intron, Ps: pseudogene; Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize. 109 Supplemental Figure 3.4 Distributions of synonymous substitution rates between anchor genes of within-species collinear gene blocks. Circled numbers indicate synonymous substitution rate (Ks) peaks associated with (1) a low-Ks large-scale duplication in rice, (2) a recent whole genome duplication (WGD) in maize, and (3) the ρ and σ WGD events. Average Ks values among ρ and σ duplicates have been estimated at 0.9 and 1.7, respectively (Paterson et al., 2004; Tang et al., 2010). Ks distributions for these two events are highly overlapping and cannot be effectively distinguished. Due to uncertain origin and timing of the low-Ks rice duplication (1) (Wang et al., 2011), duplicates from this event were not included in further analysis. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize. 110 Supplemental Figure 3.5 Prediction score distributions based on binary and three-class function prediction models. (A) Functional likelihood distributions of various sequence classes based on the binary model. TR: transcribed region; Phen: phenotype exon; Pseu: pseudogene; ETR: exon transcribed region; ITR: intergenic transcribed region. Higher and lower functional likelihood values indicate greater similarity to Phen-TRs and Pseu-TRs, respectively. Vertical dashed lines indicate the threshold for calling a sequence as functional or non-functional. The percentages to the left and right of the dashed line indicate the percent of sequences predicted as functional or non-functional, respectively. (B) Stacked bar plots indicate the prediction scores among various sequence types for each of the three classes: dark blue: phenotype exon (Phen), cyan: miRNA, red: pseudogene (Pseu). Sequences were classified as one of the three classes according to the highest prediction score. Color bars below the chart indicate the predicted class, with the same color scheme as the prediction score. Sequences classified as Phen or miRNA were considered putatively functional, while those classified as Pseu were considered 111 Supplemental Figure 3.5 (cont’d) non-functional. Percentages below a classification region indicate the proportion of sequences classified as that class. 112 Supplemental Figure 3.6 Relationship between sequence length, % exon overlap, and function predictions. Distributions of function predictions among exon transcribed regions (ETRs) of various lengths (A) and degrees of overlap with exon annotations (B). Sequences were classified as phenotype exon-like (Phen; dark blue), miRNA-like (cyan), or pseudogene-like (Pseu; red). Parentheses and brackets indicate a value was excluded from or included in a range, respectively. Nt: nucleotides. 113 CHAPTER 4: CHARACTERISTICS OF PLANT ESSENTIAL GENES ALLOW FOR WITHIN- AND BETWEEN-SPECIES PREDICTION OF LETHAL MUTANT PHENOTYPES 1 1 The work described in this chapter has been published: John P. Lloyd, Alexander E. Seddon, Gaurav D. Moghe, Matthew C. Simenc, Shin-Han Shiu (2015) Characteristics of plant essential genes allow for within- and between-species prediction of lethal mutant phenotypes. Plant Cell 8: 2133-2147 114 ABSTRACT Essential genes represent critical cellular components whose disruption results in lethality. Characteristics shared among essential genes have been uncovered in fungal and metazoan model systems. However, features associated with plant essential genes are largely unknown and the full set of essential genes remains to be discovered in any plant species. Here we show that essential genes in Arabidopsis thaliana have distinct features useful for constructing within- and cross-species prediction models. Essential genes in A. thaliana are often single copy or derived from older duplications, highly and broadly expressed, slow evolving, and highly connected within molecular networks compared to genes with non-lethal mutant phenotypes. These gene features allowed the application of machine learning methods that predicted known lethal genes as well as an additional 1,970 likely essential genes without documented phenotypes. Prediction models from A. thaliana could also be applied to predict Oryza sativa and Saccharomyces cerevisiae essential genes. Importantly, successful predictions drew upon many features, while any single feature was not sufficient. Our findings show that essential genes can be distinguished from genes with non-lethal phenotypes using features that are similar across kingdoms and indicate the possibility for translational application of our approach to species without extensive functional genomic and phenomic resources. 115 INTRODUCTION In the post-genome era, one major challenge in genetic research is in linking genotypes to phenotypes (Dowell et al., 2010; 1000 Genomes Project Consortium, 2010). Genome-wide phenotype information, obtained through large-scale loss-of-function studies, is available for several eukaryotic models including Saccharomyces cerevisiae (Winzeler et al., 1999), Caenorhabditis elegans (Kamath et al., 2003), Drosophila melanogaster (Boutros et al., 2004) and Schizosaccharomyces pombe (Kim et al., 2010). This information allows systematic analysis of genotype-phenotype connections and provides clues on homologous gene functions in species where large-scale loss-of-function analysis cannot be readily applied. By comparison, only a small proportion (~15%) of genes in the model plant Arabidopsis thaliana are associated with well-curated phenotype information (Lloyd and Meinke, 2012), despite the availability of powerful reverse genetics resources that allow for the potential of near-saturation mutagenesis studies (Kuromori et al., 2009). This is due in large part to the time and resources required for cultivating and phenotyping mutant populations. While S. cerevisiae and C. elegans have generation times measured in hours or days, A. thaliana, a relatively fast-growing plant, requires 5 to 6 weeks to begin seed production (Meyerowitz, 1989). These difficulties are exacerbated by high gene duplication rates in plants, due to both polyploidization (Soltis et al., 2009) and tandem duplications (Rizzon et al., 2006; Hanada et al., 2008), which result in many genes not exhibiting a phenotype under controlled conditions. Thus, the ability to effectively prioritize gene selection by predicting mutant phenotypes would represent an important step towards streamlining intensive and costly phenotypic analysis in plants. 116 Among genes with apparent phenotypes when lost, “essential” genes (lethal-phenotype genes) have been the target of focused analysis because they perform functions required for organismal viability and are critical in the investigation of potential drug targets in microbes (Golling et al., 2002; Firon et al., 2003; Kobayashi et al., 2003; Glass et al., 2006; Meinke et al., 2008; Silva et al., 2008). In S. cerevisiae, a variety of genomic features are associated with essential genes, including but not limited to singleton status, elevated transcription levels, and strong phylogenetic conservation (Winzeler et al., 1999; Kim et al., 2010). Some of these attributes are shared by lethal-phenotype genes in C. elegans, S. pombe, and Mus musculus (Kamath et al., 2003; Kim et al., 2010; Yuan et al., 2012). Using these features, lethal-phenotype genes have been predicted in S. cerevisiae and M. musculus (Seringhaus et al., 2006; Acencio and Lemke, 2009; Yuan et al., 2012). In plants, essential genes tend to be single copy (Mutwil et al., 2010; Lloyd and Meinke, 2012) and have distinct functional biases (Tzafrir et al., 2004; Lloyd and Meinke, 2012). It has also been shown that genes with housekeeping functions that may or may not have lethal phenotypic consequences tend to be present in single copy across many plant species (De Smet et al., 2013). In addition to single copy status, essential genes are often highly-connected in gene functional networks (Mutwil et al., 2010), and genes with embryo-lethal defects tend to be connected with one another in the AraNet functional network (Lee et al., 2010). With these pioneering studies, an outstanding question is what other characteristics plant essential genes possess. For example, although single-copy genes tend to be essential, there are a number of duplicate genes that are essential. Thus, from the gene duplication perspective, it is possible that the extent, timing, and mechanism of duplication may be important. Similarly, one would expect that cross-species conservation, selective pressure, and expression characteristics will be related 117 to whether a gene is essential or not. Nonetheless, these features have not been evaluated for their relationship with plant essential genes. Aside from the studies of essential gene features, Mutwil et al. (2010) identified clusters in their gene network with higher proportions of lethal-phenotype genes and predicted six novel essential genes. Although this study established a set of essential gene predictions in plants, the method will miss any essential genes outside of enriched clusters and therefore is not applicable genome wide. One potential solution to this is to predict lethal-phenotype genes based on many gene features beyond simply presence in a co-expression cluster, as this can produce genomewide and potentially more accurate predictions. A data integration approach that made use of sequence data and expression correlation was successful in predicting functional overlap between A. thaliana duplicates, i.e. the absence of a phenotype due to buffering effects from another gene (Chen et al., 2010). Although the prediction of genetic buffering effects represents the opposite extreme of potential mutant phenotypes, a similar methodological framework could be used to predict essentiality or other detectable phenotypes on a genome scale. However, such a framework is not currently available. To determine the feasibility of large-scale lethal-phenotype gene prediction in A. thaliana we collected loss-of-function phenotype data for ~3,500 genes and assessed relationships between phenotype lethality and gene function, copy number, duplication, expression levels and patterns, rate of evolution, cross-species conservation, and network connectivity, many of which were not explored previously in detail. We generated machine learning models to identify additional lethal-phenotype genes on the basis of multiple gene features, including a predictive model based only on sequence-derived features. Finally, as lethal-phenotype genes share many 118 characteristics between species, we tested whether lethal-phenotype predictions would be possible across species boundaries. 119 RESULTS AND DISCUSSION Phenotype classification and functions of genes with lethal phenotypes To predict lethal mutant phenotypes in Arabidopsis thaliana, loss-of-function phenotype descriptions were collected for 3,443 genes (Supplemental Table 4.1) (Kuromori et al., 2006; Ajjawi et al., 2010; Lloyd and Meinke, 2012; Savage et al., 2013), covering 12.7% of A. thaliana protein-coding genes. A phenotype was considered “lethal” if it resulted in developmental arrest at the gametophytic, embryonic, seedling, or rosette stage prior to bolting or extreme developmental defects that are expected to significantly affect plant growth in laboratory growth conditions. Under this definition, the loss-of-function phenotypes of 705 (20.5%) genes were considered lethal and the remaining (2,738; 79.5%) were considered non-lethal (Supplemental Table 4.1). Genes displaying lethal and non-lethal mutant phenotypes are referred to as “lethal genes” (essential genes) and “non-lethal genes,” respectively. Genes not in our phenotype dataset are referred to as “undocumented genes”. An earlier study demonstrated that genes involved in, for example, RNA synthesis and modification, protein synthesis, and protein degradation tend to have higher essential-to-nonessential gene ratios (Lloyd and Meinke, 2012). However, that study classified genes into 11 categories and included only 5% of A. thaliana genes. In addition, despite the differences in ratios, the statistical significance of such differences is unclear. To assess if there is a statistically significant bias in the function of lethal genes and to assess if gene functions may be useful for generating predictions genome-wide, we tested for over- and under-representation of lethal genes in Gene Ontology (GO) categories (see Methods). We identified 28 terms in which lethal genes are significantly over- or under-represented compared to non-lethal genes (Fisher’s exact tests 120 (FET), adjusted p < 0.05; Supplemental Fig. 4.1). Lethal genes in our dataset tend to be enriched in the translation, nucleolus, mitochondrion, and plastid categories and are rarely associated with signaling and regulation-related terms (signal transduction, cell communication, kinase and transcription factor activity, and response to endogenous, biotic, and abiotic stimulus). We also found that several basic developmental processes, such as reproduction, pollination, and the cell cycle, tend to be over-represented with lethal genes. In total, 27 GO terms that contain over- or under-represented numbers of lethal genes (not including the embryo development term; see Methods) were used in machine learning predictions of lethal-phenotype genes. Copy number of lethal genes In addition to functional bias, the presence or absence of paralogs is correlated with phenotypic severity in fungi (Winzeler et al., 1999; Gu et al., 2003; Kim et al., 2010) as paralogs may compensate for the loss of related genes and buffer the effects of gene loss. It has also been shown that single-copy genes in A. thaliana tend to be lethal genes (Mutwil et al., 2010), a trend also observed in another study (Lloyd and Meinke, 2012),. Consistent with these studies, lethal genes in our phenotype dataset are more commonly present as single copy genes than non-lethal genes (FET, p < 4e-10; Fig. 4.1A). This result provides additional support for the relationship between lethality and singleton status in plants, with a much larger gene set than in a previous study (Mutwil et al., 2010), and also indicates that gene copy number represents a potentially useful feature for lethal gene prediction. While we expected that lethal genes would be overrepresented in other small paralogous groups, both double- and triple-copy genes have a statistically similar proportion of lethal and non-lethal genes (FET, p=0.29 and 0.11 for doublecopy and triple-copy genes, respectively; Fig. 4.1A). Thus, the presence of even a single paralog 121 Figure 4.1 Copy number of phenotype genes in A. thaliana and O. sativa. (A) Frequency distribution of the number of paralogs (copy number) in the sets of lethal, non-lethal, and undocumented (i.e., no documented phenotype) genes. (B) Distributions of orthologous group sizes between A. thaliana and O. sativa. Rows indicate A. thaliana gene copy numbers in the orthologous groups, while columns denote phenotype categories. 122 provides appreciable functional overlap and therefore reduces the likelihood of lethality following disruption of a gene in laboratory conditions. As lethal genes are enriched among certain functional categories and tend to be single copy, it is possible that lethal gene duplicates with particular functions were preferentially reduced to single copy. This preferential reduction to single copy appears to be conserved across species. Single-copy A. thaliana lethal genes tend to more often have one Oryza sativa (rice) ortholog compared to non-lethal and undocumented genes (Fig. 4.1B). More lethal A. thaliana genes also have readily identifiable homologs in O. sativa (87%) compared to non-lethal (77%; FET of lethal vs. non-lethal, p < 5e-10) and undocumented (54%; FET of lethal vs. undocumented, p < 5e-10) genes, which suggests a stronger degree of selective constraint on lethal genes. Considering that there were repeated rounds of whole-genome duplications in both the A. thaliana and the rice lineages (Paterson et al., 2004; Cui et al., 2006), the conserved single-copy status of A. thaliana lethal genes and their rice orthologs suggests that the loss of lethal gene paralogs compared to non-lethal gene paralogs is not completely random. In addition, this conservation of single-copy status suggests that single-copy rice orthologs are likely lethalphenotype genes as well. Such cross-species conservation is explored in greater depth in a later section. Duplication timing of lethal-phenotype genes Although lethal genes are more likely to be single copy compared to non-lethal and undocumented genes, ~67% of lethal genes have paralogs, raising the question: why do some duplicate genes have a lethal phenotype in null mutant backgrounds? For genes with paralogs, a greater period since duplication may allow for a higher degree of functional divergence, which lessens the ability of duplicates to compensate for the loss of one another. An earlier study found 123 that essential genes tend to have greater protein sequence divergence from their paralogs (Lloyd and Meinke, 2012). Accordingly, we asked if lethal genes with paralogs (referred to as “lethal gene duplicates”) would be the product of older duplication events compared to non-lethal genes with paralogs (“non-lethal gene duplicates”). Using synonymous substitution rate (Ks) as a proxy for duplication time, most-similar lethal gene duplicate pairs have significantly higher Ks values (median = 1.69) than those of most-similar non-lethal gene duplicate pairs (median = 1.07; Kolmogorov–Smirnov test (KST), p < 3e-08; Fig. 4.2A). One possible explanation for the lower median Ks among non-lethal gene duplicates is that they tend to be genes that arose after duplication events took place, i.e. lineage-specific genes. To assess this, we eliminated a subset of lineage-specific genes by focusing on genes with homologs in rice and again performed the Ks analysis. The results were almost identical to the results based on the full set of lethal and nonlethal genes (median lethal Ks = 1.7, median non-lethal Ks = 1.03; KST, p < 5e-08), indicating that lineage-specific genes may not fully explain the differences in Ks distributions between lethal and non-lethal genes. Interestingly, the major Ks peak for non-lethal gene duplicates coincides with that for duplicates derived from the whole-genome duplication (WGD) event that took place 50-65 million years ago (Fig. 4.2B) (Beilstein et al., 2010). By contrast, the major Ks distribution peaks for lethal gene duplicates (Fig. 4.2A) coincide with the peak Ks for not only duplicates derived from the  but also the much older  WGD (Fig. 4.2B) (Bowers et al., 2003), contributing to the significantly higher Ks values among lethal gene duplicates compared to non-lethal ones. This suggests that lethal gene duplicates may be generated from both WGD events, raising the question of how often lethal-phenotype genes retain their duplicates from these events. 124 Figure 4.2 Duplication timing and type of A. thaliana phenotype genes. (A) Synonymous substitution rate (Ks) distributions for gene pairs of lethal, non-lethal, and undocumented genes and their most similar paralog. Gene pairs with higher Ks values are expected to be the result of older duplication events. Genes in a pair may not be from the same phenotype category. (B) Ks distributions of genes duplicated via tandem, and the  and  whole genome duplications (WGD). Some genes are derived from both tandem and whole-genome duplications. (C) Percent of duplicated lethal, non-lethal, and undocumented genes that have a paralog derived from  WGD,  WGD, and tandem duplications. Percent of all lethal, non-lethal, and undocumented 125 Figure 4.2 (cont’d) genes with significant sequence similarity (percent identity ≥ 40%) to ≥1 pseudogenes is shown in the right-most portion of the panel. 126 Assuming that duplication rates are similar among all genes, significantly higher Ks values among lethal gene duplicates suggest that duplicates of lethal genes are more frequently lost than non-lethal gene duplicates. This is consistent with the finding that lethal genes tend to be single copy (Mutwil et al., 2010; Lloyd and Meinke, 2012; Fig. 4.1). In addition, we found that significantly fewer lethal genes with paralogs have retained their duplicates generated during WGD events compared to non-lethal genes (FET, p < 4e-10 and 3e-7 for the  and  events, respectively; Fig. 4.2C). This analysis focuses on all possible duplicate pairs and the conclusion remains the same for the  WGD if we examine only the most closely-related paralogs (as in Fig. 4.2A; FET,  p < 3e-10 and p = 0.06; Supplemental Fig. 4.2). Thus, some lethal genes retain their duplicates from WGD events, but the retention rate of lethal genes is lower than that of non-lethal genes. In addition, while a major peak in the lethal gene Ks distribution (Fig. 4.2A) coincides with the Ks peak from the  WGD events (Fig. 4.2B), the lethal gene pairs underlying the peak in Fig. 4.2A may not necessarily represent duplicates retained from the  WGD event. We also found that pseudogenes resembling lethal genes are more often present compared to those resembling non-lethal genes (FET, p = 0.03), although this proportion is not significantly different from that for undocumented genes (p = 0.54; Fig. 4.2C). In addition to WGD, tandem duplication is another major mechanism that contributes to form paralogous genes (Rizzon et al., 2006; Hanada et al., 2008). We found that duplicate lethal genes are less likely to be present in tandem clusters compared to non-lethal duplicates (FET, p < 0.01) and undocumented duplicates (FET, p < 4e-10; Fig. 4.2C). Furthermore, the few lethal genes derived from tandem duplications tend to have larger Ks values (median = 1.22) compared to non-lethal (median = 0.64; KST, p = 0.05) and undocumented (median = 0.69; KST, p < 0.02) tandem duplicates. These results indicate that lethal gene duplicates have a significantly higher loss rate after WGD and a 127 significantly lower proportion of tandem genes compared to non-lethal and undocumented gene duplicates. If lethal gene duplicates cannot be attributed to tandem or whole genome duplications, then what mechanisms were responsible for generating these duplicates? One explanation may be that lethal gene duplicates were generated via WGD, but are not in present in recognizable WGD blocks. However, the WGD blocks cover ~90% of A. thaliana genes (Bowers et al., 2003), and thus the above explanation can only account for few of the lethal gene duplication events. It is also possible that duplicates of lethal genes may be commonly produced through segmental duplication events similar to those found in human (Bailey et al., 2002), but that remains to be verified. In either case, this represents an intriguing question that calls for further study. Although lethal genes tend to be present as singletons, when lethal gene paralogs are present, they are derived from relatively ancient duplication events, consistent with the interpretation that deletion of a gene with a lesser degree of functional overlap with its paralog(s) due to longer divergence time will result in more severe phenotypic effects. Our findings also identify a number of features that can be used for lethal-phenotype gene prediction, including singleton status, Ks with top paralog, presence of duplicates from the  or  WGD or tandem duplication events, presence of pseudogene counterparts, and absence of orthologs in other species (Table 4.1). Relationship between phenotype lethality and gene expression Overrepresentation of lethal genes among older duplicates compared to non-lethal genes suggests a higher degree of functional divergence among lethal gene duplicates. To explore this further, we compared gene expression levels and patterns between lethal and non-lethal genes. Duplicates with a higher degree of expression divergence are expected to perform their 128 Table 4.1 Features of essential genes in A. thaliana. Binary Sign of lethal association1 - 3.17E-10 Seq. based feature3 No Binary - 3.07E-10 No Pseudogene present Binary + 0.035 Tandem duplicate Category Feature Gene duplication  WGD duplicate retained  WGD duplicate retained Expression Evolution and conservation Networks Miscellaneous Data type P-value2 Rice4 Yeast4 No No No No Yes Yes No Binary - 7.93E-06 Yes Yes No Paralog Ks Numeric + 2.17E-08 Yes Yes Yes Gene family size Numeric - 1.20E-24 Yes Yes Yes Median expression Numeric + 1.60E-08 No Yes Yes Expression variation Numeric - 0.002 No Yes Yes Expression breadth Numeric + 5.47E-20 No Yes No Expression correlation Numeric NA 0.072 No No No Expression correlation (Ks < 2) Numeric - 0.004 No No No Core eukaryotic gene Binary + 2.44E-08 No No Yes Homolog not found in O. sativa Binary - 4.04E-10 Yes No No Percent identity in plants Numeric + 2.73E-06 Yes No No Percent identity in metazoans Numeric NA 0.254 Yes No No Percent identity in fungi Numeric NA 0.077 Yes No No A. lyrata homolog Ka/Ks Numeric - 0.012 Yes No No P. trichocarpa homolog Ka/Ks Numeric - 0.008 Yes No No V. vinifera homolog Ka/Ks Numeric - 0.003 Yes No No O. sativa homolog Ka/Ks Numeric - 0.012 Yes No No P. patens homolog Ka/Ks Numeric - 0.038 Yes No No Nucleotide diversity Numeric - 0.001 No No No Paralog Ka/Ks Numeric + 2.51E-14 Yes Yes Yes Expression module size Numeric + 1.94E-34 No No Yes Gene network connections Numeric + 9.84E-11 No No Yes Protein-protein interactions No No No Numeric NA 0.72 Gene body methylated Binary + 3.46E-10 No No No Paralog percent identity Numeric - 2.75E-33 Yes Yes Yes Protein length Numeric + 1.22E-06 Yes Yes Yes Domain number Numeric + 0.023 Yes Yes Yes 1. For each binary feature, + and – indicate that the proportion of lethal genes are significantly higher (overrepresentation) or lower (under-representation) than non-lethal genes, respectively. For each numeric feature, + and indicate that lethal genes have significantly higher or lower feature values compared to non-lethal genes, respectively. NA indicates that there is no significant difference between lethal and non-lethal genes, 2. P-values from Fisher’s exact tests (used for binary data) or Kolmogorov-Smirnov tests (used for numeric data). 3. Sequence-based features, where “Yes” indicates that a feature can be derived from genome sequence data. 4. Feature used (“Yes”) or not used (“No”) in rice or yeast lethal phenotype gene predictions. 129 molecular functions in more distinct temporal, spatial, and conditional contexts. Because of this, we expect lethal genes may show higher degrees of expression divergence with their paralogs compared to non-lethal genes. Consistent with this expectation, lethal gene duplicate pairs have significantly lower expression correlation (thus higher divergence) compared to non-lethal gene duplicates when Ks ≤ 2 (KST; 0 < Ks ≤ 1, p < 4e-4; 1 < Ks ≤ 2, p < 0.01; Fig. 4.3A). However, older lethal and non-lethal genes show similar degrees of expression correlation with paralogs (Ks > 2; KST, p = 0.35). These results are also consistent with previous findings in A. thaliana that, unlike other eukaryotic species, expression divergence is not correlated with duplication timing (Gu et al., 2002; Ganko et al., 2007). One potential explanation for the decreasing differences in expression divergence between lethal and non-lethal duplicates over time is that greater divergence in the proteincoding sequences among older duplicates has contributed to a higher degree of biochemical divergence between duplicates. Therefore the presence of a paralog with a similar expression profile no longer buffers against the consequences of gene loss. We found that lethal gene duplicates with Ks > 2 have a significantly higher ratio of non-synonymous to synonymous substitution rates compared with non-lethal duplicates (Ka/Ks; KST, p < 6e-10; Supplemental Fig. 4.3), indicating that there is increased divergence at the protein-coding level for older lethal genes compared with non-lethal genes. This raises a question: among duplicates with Ks < 1, what underlying mechanisms contribute to the differences in expression correlation between lethal and non-lethal genes? Was there selection pressure driving the expression differences between lethal genes and/or maintaining expression similarity among non-lethals? Alternatively, were the patterns we see predominantly driven by neutral processes such as drift? In this context, the distinction between lethal and non-lethal genes may simply be how paralogs accrued 130 Figure 4.3 Expression characteristics of A. thaliana phenotype genes. (A) Box plots of expression correlations (Pearson’s Correlation Coefficient, PCC) of paralogous gene pairs involving three gene categories - lethal (red), non-lethal (blue), and undocumented (grey) genes across AtGenExpress developmental dataset samples (Schmid et al., 2005). Lower expression correlation indicates increased degree of expression divergence for a gene pair. Genes in a paralog pair may or may not be from the same phenotype category. (B) Box plots of median expression levels (array hybridization intensities) of genes in three categories across array experiments. (C) Box plots of numbers of samples where genes in each category were considered expressed according to multiple thresholds. (D) Box plots of expression variation across samples (median absolute deviation/median) in each gene category. 131 mutations that contribute to expression divergence and have little to do with selection. These possibilities need to be further studied. In addition to expression divergence, expression level of a gene may affect phenotypic severity. In S. cerevisiae and Mus musculus, essential genes tended to be expressed at higher levels (Winzeler et al., 1999; Yuan et al., 2012). Consistent with findings in other species, in A. thaliana the expression levels of lethal genes across the AtGenExpress developmental expression series (n = 64; Schmid et al., 2005) are significantly higher than those of non-lethal genes (KST, p < 2e-8; Fig. 4.3B), suggesting that transcript levels are correlated with gene essentiality. In addition to expression level, lethal genes tend to be more broadly expressed across developmental stages and organs than non-lethal genes (KST, p < 5e-19, 4e-12, 6e-05, and = 0.19 for log2 intensity thresholds of 4, 6, 8 and 10, respectively; Fig. 4.3C). Finally, while lethal genes show a significantly lower degree of expression variation compared to non-lethal genes, the effect size is small (KST, p < 0.01; Fig. 4.3D). Although lethal genes tend to be highly expressed, 7% are expressed among the bottom third of all genes (defined as weakly-expressed, n= 51; log2 median intensity ≤ 4.39). Among 15 GO categories significantly overrepresented in weakly-expressed lethal genes compared to highly-expressed ones, 14 are related to transcriptional regulation due to the contribution of the same 15 genes across categories (Supplemental Table 4.2). These genes exhibit a broad spectrum of lethal phenotypes (gametophytic, embryo, and seedling) with notable developmental defects, including cotyledons with leaf-like characteristics (FUS3, LEC1, LEC2), precocious seed development (FIS2, MEA), and complete loss of the primary root (STIP). To summarize, we found that lethal gene duplicates tend to display higher expression divergence when Ks ≤ 2, and higher proteinsequence divergence when Ks > 2. We also found that lethal genes tend to be more highly and 132 broadly expressed and have lower degrees of expression variation compared to non-lethal genes. Thus, a variety of expression characteristics correlate with phenotype lethality and were incorporated into lethal-phenotype prediction models (Table 4.1). Conservation of lethal genes Due to their severe phenotypic consequences when lost, lethal genes likely experienced greater selective constraint compared to genes with non-lethal phenotypes. The Ka/Ks values between A. thaliana lethal genes and their homologs in five plant species tend to be significantly lower compared to cross-species non-lethal gene homolog pairs (KST, see figure legend for pvalues; Fig. 4.4A). Similarly, lethal genes have a significantly lower degree of nucleotide diversity among 80 accessions of A. thaliana compared to non-lethal genes (KST, p < 7e-4; Fig. 4.4B). Both results suggest that lethal genes are experiencing stronger purifying selection. There are two potential confounding factors. First, lethal genes tend to be expressed at higher levels than non-lethal genes (Fig. 4.3B) and highly-expressed genes often experience greater selective pressure due to disproportionate effects of toxic protein misfolding (Drummond et al., 2005). Second, expression levels can affect calculations of Ka/Ks due to codon usage bias (Duret and Mouchiroud, 1999). Thus we analyzed the relationship between the Ka/Ks values and median expression levels. Consistent with our expectation, we found a negative correlation between median expression levels of lethal genes and Ka/Ks values in each of the five plant species (median Pearson Correlation Coefficient, PCC = -0.23). However this relationship explains only a minor component of the variation in selective pressure experienced by lethal genes (r2 values range from 0.03 to 0.08). Thus, our finding that lethal genes are experiencing stronger negative selection is not simply due to their higher expression levels. 133 Similar to the Ka/Ks based comparison, lethal genes have significantly higher sequence identities to their best matches in other plant lineages compared to non-lethal genes (Fig. 4.4C). Although no significant difference in sequence identity is noted between lethal and non-lethal genes when considering their best matches in animal and fungal species, a significantly higher proportion of lethal genes (25%) are present in orthologous clusters consisting of genes from seven diverse eukaryotes (“core eukaryotic genes”; see Methods) compared to non-lethal genes (15.7%; FET, p < 3e-8). Lethal genes tend to be the result of older duplications and are present in fewer copies than non-lethal genes. As any set of genes with these features may be highly conserved, we assessed the effects of copy number and duplication age on the sequence conservation of lethal genes. We found that both timing of duplication (Ks value, r2 = 0.01) and gene copy number (r2 = 0.03) explain little of the variation in protein conservation across the plant kingdom for lethal genes. These results, along with those from the above analysis of the relationship between expression level and Ka/Ks, show that correlation between features that are expected to be dependent can be far from perfect, highlighting the need to consider them jointly for lethal-phenotype gene predictions. Together with our finding that lethal genes tend to have homologs in O. sativa (Fig. 4.1B), the results from Fig. 4.4 indicate a higher degree of conservation for lethal genes. Because evolutionary rate values and protein conservation metrics could prove useful in a prediction context, they were included in later lethal gene prediction (Table 4.1). However, we should emphasize that the Ka/Ks, nucleotide diversity, and cross-species sequence identity distributions between lethal and non-lethal genes overlap substantially, i.e. the effect sizes are small despite significant differences (Fig. 4.4). One explanation is that the non-lethal genes studied here are those with observable phenotypes in loss-of-function backgrounds. These non-lethal genes thus 134 Figure 4.4 Evolutionary rate and cross-species protein conservation of A. thaliana phenotype genes. (A) Ratios of non-synonymous substitutions (Ka) to synonymous substitutions (Ks) between A. thaliana genes and homologs in the same OrthoMCL cluster from Arabidopsis lyrata (Al; KST of lethal vs. non-lethal, p<0.02), Populus trichocarpa (Pt; p<0.01), Vitis vinifera (Vv; p<0.01), O. sativa (Os; p<0.02), and Physcomitrella patens (Pp; p<0.04). Lower Ka/Ks values are indicative of stronger negative selection pressure. (B) Distributions of nucleotide diversity for lethal, non-lethal, pseudo-, and undocumented genes. Higher nucleotide diversity values indicate higher degree of sequence polymorphism between A. thaliana accessions. (C) Probability density distributions of median % identity of lethal, non-lethal, and undocumented genes to top BLASTP matches in dicotyledonous plants (DC; lethal vs. non-lethal KST, p<2e-6), monocotyledonous plants (MC; p<7e-4), other embryophytic plants (OE; p=0.05), algae (AL; p<7e-6), fungi (FN; p=0.07), and metazoans (ME; p=0.25). 135 are likely subjected to strong selection, although not as strong as the selection against lethal gene mutations. We should also note that none of the examined characteristics that distinguish between lethal and non-lethal genes are perfect. As a result, multiple characteristics are considered jointly in statistical learning models for predicting lethal-phenotype genes (described in a later section). Network connectivity of lethal-phenotype genes In S. cerevisiae, proteins that are highly connected in physical protein-protein interaction networks tend to be essential (Jeong et al., 2001). Similarly, analyses of S. cerevisiae and A. thaliana gene networks based on functional relatedness between genes have demonstrated that highly connected genes tend to have severe loss-of-function phenotypes (Lee et al., 2010; Mutwil et al., 2010), and identification of co-expression clusters in A. thaliana that are enriched in lethal phenotype genes was useful in selecting and validating six novel essential genes (Mutwil et al., 2010). Further, an A. thaliana gene functional network (AraNet; Lee et al., 2010) was used to demonstrate that genes with embryo-lethal phenotypes tend to be connected with one another in the gene network (Lee et al., 2010). However, no formal prediction of essential genes using AraNet data has been performed. To verify that the relationship between network connectivity and gene phenotype lethality also exists in our phenotype dataset, we examined coexpression networks established in this study, connections in the AraNet gene network, and protein-protein interaction data (Arabidopsis Interactome Mapping Consortium, 2011). We found that lethal genes in A. thaliana tend to be found in larger co-expression modules (median size = 19) than those containing non-lethal genes (median = 13; KST, p < 2e-34; Fig. 4.5A). In addition, lethal genes tend to be co-expressed (PCC > 0.86; 99th percentile of all pairwise coexpression coefficients) with a greater number of genes (median = 20) than non-lethal genes 136 Figure 4.5 Network connectivity of A. thaliana phenotype genes. (A) Co-expression module sizes of lethal, non-lethal, and undocumented genes. Modules represent groups of genes clustered via K-means clustering (K=2000) based on expression similarity across A. thaliana development samples in AtGenExpress (Schmid et al., 2005). (B) Number of edges connected to genes in the three categories with a log likelihood ≥ 1 in the AraNet network (Lee et al., 2010). Higher numbers of edges indicate increased connectivity within the network. (C) Distributions of the numbers of genes with 1, 2, 3, 4, or ≥5 protein-protein interactions (PPIs) (Arabidopsis Interactome Mapping Consortium, 2011). 137 (median = 5; KST, p < 8e-8). Similarly, A. thaliana lethal genes tend to have a greater number of interactions (median = 53) in the AraNet gene functional network than non-lethal genes (median = 30; KST, p < 1e-10; Fig. 4.5B). These results corroborate previous findings based on analysis of co-expression networks (Mutwil et al., 2010) and indicate that high interactivity in gene networks may be useful for establishing lethal-phenotype gene predictions. In contrast to the relationship between gene essentiality and centrality in gene networks, connectivity within a physical protein-protein interaction network does not seem to be correlated with phenotypic severity in A. thaliana (Lloyd and Meinke, 2012). There remains no clear relationship between our updated phenotype data and protein-protein interactions (KST, p = 0.73; Fig. 4.5C). It remains to be determined if this is due to the lower coverage in the A. thaliana interactome map (12% of proteins compared to 30% in yeast; Jeong et al., 2001). Taken together, the higher connectivity among phenotype-lethal genes is consistent with the interpretation that their disruption may interfere with the function of many other genes. One additional possibility is low-interacting genes may play more specialized roles than highinteracting ones. Thus, low-interacting genes would tend not to have strong phenotypic consequences when mutated. Prediction of lethal genes using a machine learning framework Based on analysis of functional annotation, gene copy number, duplicate retention patterns, gene expression, evolutionary rates, cross-species conservation, and network connectivity (Table 4.1), we have identified a wide variety of genomic features correlated with phenotype lethality. In addition to these features, genes encoding longer proteins with a larger number of domains and those with CG gene body methylation (Takuno and Gaut, 2012) are more likely to exhibit lethal phenotypes upon disruption (Table 4.1). As these features do not 138 correlate perfectly with whether disruption of a gene results in a lethal or non-lethal phenotype, it raises the question as to whether a meaningful prediction of phenotype lethality is feasible if they are jointly considered. In addition, it remains unclear how these disparate features would differ in their contribution to A. thaliana lethal-phenotype gene prediction. To address these questions, we applied machine learning methods that have been used for essential gene predictions in budding yeast (Seringhaus et al., 2006; Acencio and Lemke, 2009) and mouse (Yuan et al., 2012). A matrix of genes with a documented phenotype and their associated values for different features (Supplemental Table 4.3) was used as input for six machine learning classifiers (see Methods; Fig. 4.6A). To build the classifiers, 90% of our dataset was used for model building (training phase) and 10% was held out for testing the accuracy of the predictive model (validation phase). The model building process was repeated 10 times so that every gene in our phenotype dataset was held out of the model building exactly one time (10-fold cross-validation). We should emphasize that the training data were completely independent from the validation data. Performance was evaluated by calculating the Area Under the Curve - Receiver Operating Characteristic (AUC-ROC), where the AUC-ROC of a model based on random guessing is ~0.5 and that of a perfect model is 1.0. Using the best performing classifier, Random Forest (Ho, 1995), the lethal gene prediction model AUC-ROC is 0.81, which is significantly better than random guessing (Fig. 4.6A; see methods). To provide an alternative interpretation of model performance, we also examined the precision (proportion of predicted genes that are truly lethal) and recall (proportion of true lethal genes recovered) of our model (Fig. 4.6B). Based on this analysis, to correctly recover 50% of lethal genes, our precision is at 57%. Because the proportion of lethal genes in our dataset is 0.2, the precision of random guesses is expected to be ~20% (grey line, Fig. 4.6B), indicating that our methods perform 139 Figure 4.6 Machine learning performance of essential gene predictions. (A) Receiver Operating Characteristic (ROC) curves of the predictive models based on Random Forest (RF), Logistic regression (Log), SMO-SVM (SMO), Naïve Bayes tree (NBtree), Naïve Bayes (NB), and J48 decision tree (J48) using the best-performing parameter sets. Area Under the Curve (AUC)-ROC is indicated in the inset; an AUC-ROC value of ~0.5 is equivalent to random guessing while an AUC-ROC of 1 indicates perfect predictions. Diagonal dashed line: the expected performance of a model based on random guessing (RAN). Curves closer to the upper left corner of the chart represent a better predictive performance than curves that are closer to the diagonal dashed line. Error bars: standard error between 10 cross-validation runs. (B) Precisionrecall curves for the models from (A). Precision: the proportion genes predicted as lethal that are actual lethal genes. Recall: the proportion of actual lethal genes predicted as lethal. Horizontal dashed line: the proportion of lethal genes in the dataset, which represents the expected precision 140 Figure 4.6 (cont’d) based on random guessing. Error bars: standard error between cross-validation runs. (C) AUCROC values of the best-performing Random Forest machine learning classification using all features (All; median of ten cross-validation runs) in comparison to AUC-ROCs of the models based on each of the five most informative features (cyan background) and the median of all single feature predictions. (D) AUC-ROC values of within-species (cyan background) and crossspecies (green and magenta background) predictions in A. thaliana (At), O. sativa (Os), and S. cerevisiae (Sc). Species on the left side of the arrows indicate the species from which data were used to train a prediction model. Independent datasets from species on the right of the arrows were used for testing. Predictions between A. thaliana and S. cerevisiae were performed both with a full feature set (green background) and with a subset of features in which the sign of SMO weights agree (magenta background). (E) Ranks and signs of SMO weights of 29 features available in A. thaliana (At), rice (Os), and yeast (Sc). A lower number rank of a feature weight indicates greater importance for within-species predictions. A more positive weight indicates better association with phenotype lethality and a more negative weight indicates better association with non-lethality. 141 reasonably well. By comparison, an earlier study based on co-expression clusters predicted and validated 6 novel essential genes out of a pool of 20 candidate genes (Mutwil et al., 2010). This represents a precision of 30%. However, this methodology applies only to essential geneenriched clusters (357 genes, ~1.3% of A. thaliana annotated genes). As a result, any essential genes outside of these clusters cannot be predicted using this methodology and recall is expected to be very low. This highlights the need to consider a large suite of gene features for genomewide predictions of essential genes. We next used the best performing model to classify the rest of the 23,763 undocumented genes as potentially lethal or non-lethal when lost. This provided each gene with a “lethalphenotype score”, a value between 0 and 1 where higher values indicate higher confidence that a gene will display a lethal phenotype upon disruption. Notably, the highest lethal-phenotype score for an undocumented gene is 0.72 while the highest scoring lethal gene in our phenotype dataset is almost 0.90, indicating a distinction between the lethal genes in our training dataset and the rest of the genes in the A. thaliana genome and potential biases in our model. Applying the machine learning model and a lethal-phenotype score threshold resulting in the highest Fmeasure (harmonic mean of precision = 0.54 and recall = 0.54; arrow, Fig. 4.6B) in the training data, we identify 1,970 (8%) undocumented genes whose loss is expected to result in a lethal phenotype (Supplemental Table 4.1). Using this lethal-phenotype score threshold (0.31), we expect that 1,059 (1,970*precision) are correctly-predicted lethal genes and that there are 885 (1,059/recall-1,059) additional lethal genes that we fail to detect. Thus, we anticipate an additional 1,944 lethal genes in the undocumented gene set. Together with the 705 known lethal genes, 10% (~2,700) of A. thaliana protein-coding genes are expected to have lethal mutant phenotypes based on the lethal-phenotype score threshold of 0.31. As an additional validation 142 step, we collected an independent set of 60 A. thaliana phenotype genes based on a literature search (17 with a lethal phenotype; Supplemental Table 4.1) that are not included in our initial dataset of 3,443 lethal and non-lethal genes. The AUC-ROC of the best-performing Random Forest model is 0.83 for this independent set. Of the 17 genes with lethal phenotypes in this 60gene dataset, 13 (77%) are correctly predicted as lethal and of the 43 non-lethal genes, 40 (93%) are correctly predicted as non-lethal. To determine what features are among the most important to our predictions, we assessed the performance reduction resulting from the removal of each feature from prediction analysis. We found that no single feature is particularly critical for predictive performance by itself (all leave-one-out models have AUC-ROC ≥ 0.8 compared to 0.81 for the full model; Supplemental Table 4.4). These results are corroborated by the fact that machine learning predictions using all data types perform much better than predictions based on any single feature by itself (median AUC-ROC = 0.54, Fig. 4.6C; Supplemental Table 4.4). We also found that 46 features (80% of all features) are required to achieve an AUC-ROC of 0.80 (Supplemental Fig. 4.4), indicating that the contributions from most features are critical. This is consistent with our observations that, although the features we used are generally significantly distinct between lethal and nonlethal genes, in many cases the effect sizes are small (Fig. 4.1-4.5). In addition, many features we included here are likely dependent, although correlation between features is low (see Methods). In any case, our findings indicate that the predictive models for lethal-phenotype genes are robust and draw upon a wide variety of gene features to generate meaningful classifications of lethal and non-lethal genes. 143 Cross-species predictions of lethal-phenotype genes Considering that some of the features we found to be correlated with gene lethality have been shown to be important in other species (Seringhaus et al., 2006; Yuan et al., 2012), this raises the question whether a prediction model trained with A. thaliana data (A. thaliana model) can be used to predict phenotype lethality across species boundaries. To test this, we first collected rice phenotype data for 92 genes (18 lethal, see Methods; Supplemental Table 4.1) and analogous genomics and functional annotation data (Table 4.1). Then a “rice model” was generated and applied to predict lethal genes within the rice test set using 2-fold cross validation. Surprisingly, this performed as well as within-A. thaliana predictions (AUC-ROC = 0.82; Fig. 4.6D), indicating that a significantly smaller gene set still allows lethal gene classification with comparable accuracy. We also tested if good predictions could be made in A. thaliana using a reduced gene set by randomly sampling 20 lethal and 80 non-lethal genes from our full dataset and making predictions with 2-fold cross-validation. This was repeated 100 times. The AUCROCs of these 100 models range from 0.55 to 0.88 with a median of 0.75. The median AUCROC indicates that few phenotype genes can be used to establish lethal gene prediction model with reasonable performance. The rather large variance in AUC-ROCs indicates that the genes included during model building can have a significant effect, particularly if the sample size is small. Next, to test if prediction across plant species is feasible, we trained prediction models using data from one species and predicted phenotype lethality for genes in test sets from another species (see Methods). Using the A. thaliana model, we can predict rice lethal genes with an AUC-ROC of 0.80 (Fig. 4.6D). A rice model is also capable of identifying A. thaliana lethal genes, although the performance is reduced (AUC-ROC = 0.72; Fig. 4.6D). This is potentially 144 because use of a model trained on a small gene set is ineffective in classifying a large number of genes in another species. Given that cross-species phenotype prediction is feasible between A. thaliana and rice, which diverged over 200 million years ago, we sought to determine if lethal-phenotype genes were predictable across a significantly greater phylogenetic distance by predicting lethal genes in S. cerevisiae. We collected a S. cerevisiae phenotype dataset consisting of 6,075 genes (1,189 lethal, see Methods; Supplemental Table 4.1), 11 types of genomic data (Table 4.1) and assignments to 25 GO terms. Similar to earlier studies, the yeast model performed well in predicting yeast lethal genes (AUC-ROC = 0.82; Fig. 4.6D). Application of the A. thaliana model on yeast data performed reasonably well (AUC-ROC = 0.73), while an S. cerevisiae model on A. thaliana data performed worse (AUC-ROC = 0.65). The reduced performance in cross plant-fungal species predictions prompted us to investigate which features were meaningful for predictions in one species but not the other. The relative importance of features can be assessed according to a weight measure derived by the Support Vector Machine (SVM) classifier, which indicates the importance of a feature for predicting lethal (more positive weight) or non-lethal (more negative weight) genes. Between A. thaliana and S. cerevisiae, we found that 15 out of 36 features (42%) had opposing signs on their SVM weights (Supplemental Table 4.5), suggesting that, despite their importance for distinguishing lethal and non-lethal genes, these features have opposite contributions. For example, genes associated with the reproduction GO term tend to be phenotype-lethal in A. thaliana, but non-lethal in S. cerevisiae. When features with opposing correlations with lethality between the two species were removed, the performance improved in predicting S. cerevisiae lethal genes with the A. thaliana model (AUCROC from 0.73 to 0.75) and in predicting A. thaliana lethal genes with the S. cerevisiae model 145 (AUC-ROC from 0.65 to 0.73; Fig. 4.6D). While not as accurate as A. thaliana-O. sativa crossspecies predictions, these results demonstrate that lethal phenotypes can be predicted between two species separated by 1.4 billion years of evolution. In addition, although many features of essential genes are similar between species, some features are predictive of lethal phenotypes in one species but of non-lethal phenotypes in another. As lethal-phenotype genes tend to be well conserved, it may be expected that crossspecies predictions would perform well. However, we should emphasize that the A. thaliana-rice cross-species predictions do not make use of any conservation-based features, and as a result, sequence conservation is unrelated to the performance of these cross-species predictions. For A. thaliana-yeast cross-species predictions, only one feature is related to gene conservation: presence as a core eukaryotic gene. While this is important for predictions (based on SVM feature weights; Supplemental Table 4.5), if cross-species predictions are performed using only the core eukaryotic gene feature, the AUC-ROC of predictions falls from 0.75 to 0.63 for A. thaliana-to-yeast cross-species predictions and from 0.73 to 0.55 for yeast-to-A. thaliana predictions. Further, if within-A. thaliana predictions are performed using only sequence conservation and evolutionary rate features, the AUC-ROC of essential gene predictions is 0.60 (compared to 0.81 with the full feature set). These results serve to further emphasize that neither protein conservation nor any single feature can sufficiently explain gene essentiality by itself, and that drawing upon a robust set of gene features provides a far more accurate prediction of essential genes. To compare and contrast gene features important for essential gene prediction in all three species, we evaluated the importance of 29 features that are available in A. thaliana, rice, and yeast. Feature importance and relationship with phenotype lethality were assessed using the rank 146 and sign of SVM weights that are akin to the importance of a feature for predicting lethal (more positive weight) or non-lethal (more negative weight) genes (Fig. 4.6E). Features have generally similar importance for essential gene predictions in each species, although their relationship with lethality in each species is often not the same (i.e. opposing signs on SVM weights). We find five features that are relatively important for predictions and have the same sign in each species: median expression level, transcription factor activity, singleton status, cellular component organization, and signal transduction. These features likely represent characteristics shared by essential genes across kingdoms. Despite general similarity in feature importance, there are apparent species-specific features. For example, mitochondrial protein localization represents a feature important for predicting essential genes in plants but not in yeast. In addition, while response to endogenous stimulus and expression variation are relatively unimportant for predictions in plants, they are important for yeast predictions. Some species-specific features are not shared between more closely related taxa. For example, translation is important for lethalphenotype predictions in A. thaliana but not in rice and yeast. For yeast, this may be due to a larger portion of translation-related genes being identified, including factors that are less central and essential to the process of protein synthesis. In rice, the smaller dataset of phenotype genes may not include many genes involved in the translation process, and therefore the term is not relevant to the predictions. Thus, we cannot rule out the possibility that some of the differences we found are due to differences in how functional categories are annotated across species. It will be more informative to examine lethal phenotype status on a gene-by-gene basis by asking whether and why orthologous genes are essential in one species but not the other. Lastly, because few sequenced species have the extensive functional genomic resources found in A. thaliana, rice, and S. cerevisiae, we sought to determine if a model based only on 147 features that can be generated from a genome sequence (Table 4.1) can accurately predict lethalphenotype genes. A machine learning model without input from expression and interactome features was generated for predicting A. thaliana lethal genes and performed with an AUC-ROC of 0.74. This result suggests that essential genes can be predicted with only sequence-based features. Interestingly, it has also been shown that sequence-based features are important in the identification of functional overlap between related genes (Chen et al., 2010). Our finding represents an important step that should prove useful in analyzing newly sequenced organisms that lack robust expression and interactome datasets. 148 CONCLUSION We identified a set of genomic features that significantly correlate with genes that have lethal phenotypes when disrupted in A. thaliana. Similar to findings in yeast and mouse (Seringhaus et al., 2006; Yuan et al., 2012), these features can be used to predict genes with lethal phenotypes in plants. We also show that lethal-phenotype gene prediction models can be applied across species with reasonable performance. This provides strong evidence that the characteristics of essential genes can be defined based on genome sequence features and largescale functional genomics data and, in some cases, are shared between species. We predict that a smaller percentage of A. thaliana genes are essential (10%) in comparison to S. cerevisiae, M. musculus, and S. pombe (18%, 19%, and 26%, respectively; Kim et al., 2010; Yuan et al., 2012). Considering the presence of multiple rounds of genome duplication in the past 100 million years in the A. thaliana lineage, the presence of duplicates is likely a major contributor to the difference. Nonetheless, we should emphasize that although individual characteristics can be used to distinguish between genes with lethal and non-lethal phenotypes, in many cases the effect sizes are rather small. Thus, despite the statistical significance, lethal-phenotype genes are more accurately predicted when many features are considered jointly. Another consideration is that the cause-effect relationship between these features and phenotype lethality are not always obvious. For example, while lethal-phenotype genes tend to be single copy or have ancient duplicates, it is not known if stochastic gene loss simply results in essentiality for the remaining duplicate. Alternatively, there may be preferential loss of duplicates of essential genes, perhaps due to an inability to neo- or subfunctionalize many essential gene functions or because essential genes disproportionately function in dosage- 149 dependent processes. While our finding that lethal-phenotype genes tend to have similar duplicate retention and loss patterns across lineages is consistent with the preferential loss possibility, a more detailed analysis on this topic is warranted. Although the machine learning model performs well with high AUC-ROC, there also remains room for improvement in essential gene prediction. For our analysis, we restricted features to those in which we could provide a priori reasoning for association with phenotype lethality. Alternatively, a more data-driven approach that includes more genomic signatures without apparent relationships to phenotype lethality (e.g. histone marks, cis-regulatory complexity, or chromatin state) may allow the discovery of previously ignored factors. Another potential way to improve prediction is to focus on more narrowly-defined sets of essential genes. Because lethal phenotypes can result from the loss of a broad range of functions, we cannot necessarily expect all essential genes to possess the same sets of characteristics, as suggested by the significant association of lethal-phenotype genes with multiple characteristics but mostly with small effect sizes. As a result, it is reasonable to hypothesize that there exist distinct sets of essential genes where genes in each set share common characteristics. If this is the case, it will be intriguing to uncover the underlying reasons for the existence of such gene sets. Taken together, our findings provide a detailed look at the factors predictive of gene phenotype lethality. Through a joint analysis of evolutionary (duplication, conservation) and functional (expression, Ka/Ks) characteristics of lethal-phenotype genes, this study advances our understanding of the evolution of essential genes. In addition, we provide genome-wide plant essential gene predictions and large-scale validation of cross-species lethal-phenotype predictions, building on earlier results focused on fungal or metazoan species and on smaller plant gene datasets. The predictive performance of our models highlights a promising avenue for 150 prioritizing candidate genes for large-scale phenotyping efforts in A. thaliana, particularly essential genes. The feasibility of cross-species predictions suggests that model plant phenotype data can be useful for the identification of essential genes in other plant species. 151 METHODS Phenotype Data Sources Descriptions of gene-based, loss-of-function mutant phenotypes in Arabidopsis thaliana were retrieved from three sources: (1) a published phenotype dataset (Lloyd and Meinke, 2012), (2) the Chloroplast 2010 Database (Ajjawi et al., 2010; Savage et al., 2013), and (3) the RIKEN Phenome database (Kuromori et al., 2006). Phenotype descriptions for genes in Oryza sativa (rice) were gathered from four sources: (1) a published phenotype dataset (Lloyd and Meinke, 2012), (2) literature search and manual curation (search terms: rice, lethal, mutant, phenotype, null, and knockout), (3) Oryzabase (Kurata and Yamazaki, 2006), and (4) Gramene (Monaco et al., 2014). Saccharomyces cerevisiae (yeast) phenotype annotations were obtained from the Saccharomyces Genome Database (http://www.yeastgenome.org; Cherry et al., 2012). If a gene had conflicting phenotype assignments from multiple sources, the lethal phenotype description was given priority. For yeast, phenotypes annotated to the “inviable” phenotype ontology term were considered lethal, while those annotated to the “viable” term were considered non-lethal. Only phenotypes associated with a null allele were included. An independent set of 60 A. thaliana phenotype genes were identified from recently-published literature by searching for articles in the PubMed database that included the keywords “Arabidopsis” and “lethal” and were published in 2012 or 2013. This independent dataset includes 24 genes for which a homozygous single-gene mutant was viable and included in at least the attempted construction of a double knockout mutant for the GABI-DUPLO project (Bolle et al., 2013). 152 Gene Ontology Functional Annotation Gene Ontology (GO) gene annotations for A. thaliana and yeast were downloaded from the GO database (http://www.geneontology.org/), and version 7 rice annotations were downloaded from the Rice Genome Annotation Project (Kawahara et al., 2013) (http://rice.plantbiology.msu.edu). A. thaliana and yeast annotations were mapped to the plant slim ontology using the map2slim program in the GOperl package (http://search.cpan.org/~cmungall/go-perl/). For A. thaliana, only gene-GO terms associated with experimental or computational evidence codes were utilized, while those based only on curation and author statements were excluded. Of the 97 terms in the plant GO slim subset, three were excluded because they are the root terms (biological process, molecular function, cellular component), 59 were excluded because they are not significantly over- or underrepresented in lethal genes, one was excluded because it is associated with few A. thaliana phenotype genes (<1%), and five were excluded because they are highly overlapping in gene membership with another significantly enriched term (Pearson’s Correlation Coefficient, PCC ≥ 0.50). Among overlapping terms, one representative term was chosen based on the lowest adjusted p-value from FETs, except in the case of pairwise overlap between the “response to stress” term and “response to biotic stimulus”/”response to abiotic stimulus,” where “response to stress” was removed despite having a lower p-value to maintain the distinction in functional responses to biotic and abiotic environmental factors. Because 151 of 329 genes in the embryo development term are included in our phenotype dataset, it was excluded to prevent ascertainment bias. Plant GO slim terms plastid, embryo development, and pollination were excluded from analysis involving yeast data. GO enrichment analysis using the full list of terms beyond the plant slim subset was also performed in A. thaliana to determine enrichment of both highly expressed 153 (genes with the top 1/3 expression levels) and weakly expressed (genes with the bottom 1/3 expression levels) lethal genes (Supplemental Table 4.2). In all GO analyses, p-values were adjusted for multiple testing based on the Benjamini and Hochberg procedure (Benjamini and Hochberg, 1995). Evolutionary Rate Calculations and Analysis of Duplicates and Pseudogenes Paralogs in A. thaliana, O. sativa, and S. cerevisiae and homologs between A. thaliana and five different plant species (Arabidopsis lyrata, Populus trichocarpa, Vitis vinifera, O. sativa, and Physcomitrella patens) were identified with OrthoMCL (inflation parameter = 1.5). Protein sequences for A. thaliana were downloaded from The Arabidopsis Information Resource (Version 10; www.arabidopsis.org), sequences for S. lycopersicum were downloaded from the Sol Genomics Network (Version 2.4; www.solgenomics.net), sequences for O. sativa were downloaded from the Rice Genome Annotation Project (Version 7; rice.plantbiology.msu.edu), sequences for S. cerevisiae were downloaded from the Saccharomyces Genome Database (www.yeastgenome.org), and sequences for A. lyrata, P. trichocarpa, and V. vinifera were downloaded from Phytozome (Version 9; www.phytozome.net). In Fig. 4.1A, the paralog copy number for each A. thaliana gene equaled the size of the OrthoMCL cluster the gene in question resided in. In Fig. 4.1B, to identify orthologs between A. thaliana and O. sativa and to assess duplicate retention and loss, a gene-species tree reconciliation approach was used. First, protein sequences of genes in each A. thaliana-O. sativa OrthoMCL cluster were aligned using MUSCLE (Edgar, 2004). Ten maximum likelihood trees for each aligned cluster were built using RAxML (Stamatakis, 2014) to identify the tree with the highest likelihood. The trees were midpoint rooted with retree in the PHYLIP package and parsed with Notung (Chen et al., 2000) to identify duplication and speciation nodes in the gene 154 trees. A group of genes sharing a speciation node in a gene tree were regarded as an orthologous group. Rates of synonymous (Ks) and nonsynonymous (Ka) substitutions were calculated between homologous gene pairs using the yn00 package in PAML (Yang, 2007). Highly similar or dissimilar sequence pairs (Ks < 0.005 and Ks > 3, respectively) were excluded from further analyses. In cross-species Ka/Ks calculations, the median Ka/Ks value between each A. thaliana gene and genes from other species in the same OrthoMCL cluster was used as a representative value. In A. thaliana, O. sativa, and S. cerevisiae, a paralogous pair for Ks analysis only (e.g. Fig. 4.2A) was defined by identifying a gene and its top-scoring match in BLAST similarity searches (Altschul et al., 1990). Nucleotide diversity between 80 A. thaliana accessions was calculated according to an earlier study (Moghe et al., 2013). Genes with paralogs produced in the  or  whole genome duplication events were identified by Bowers et al. (2003). Two genes were defined as a tandem duplicate pair if they have a BLASTP E-value < 1e-10 and are no more than 10 genes apart. Pseudogenes were identified through the pipeline described by Zou et al. (2009). Clusters of orthologous genes were downloaded from the National Center for Biotechnology Information (Tatusov et al., 2003). Core eukaryotic genes were defined as genes present in clusters that included at least one gene from each of the seven species in the analysis (A. thaliana, C. elegans, D. melanogaster, Encephalitozoon cuniculi, Homo sapiens, S. cerevisiae, and S. pombe). BLAST similarity searches were performed between A. thaliana protein sequences and the protein sequences of 34 other plant species present in Phytozome v9, including 26 dicotyledonous, 6 monocotyledonous, and 2 other embryophyte species. Similarity searches were also performed between A. thaliana and 8 fungal species (Aspergillus nidulans, Coprinopsis 155 cinerea, Cryptococcus neoformans, Fusarium oxysporum f. sp. lycopersici, Neurospora crassa, Puccinia graminis f. sp. tritici, S. cerevisiae, and S. pombe) and 8 metazoan species (C. elegans, Ciona savignyi, Danio rerio, D. melanogaster, Gallus gallus, H. sapiens, M. musculus, and Xenopus tropicalis). Fungal and metazoan protein sequence annotations were retrieved from FungiDB (www.fungidb.org) and Ensembl (www.ensemblgenomes.org), respectively. Expression Data Sources and Processing The AtGenExpress development microarray data (Schmid et al., 2005) was downloaded from the Weigel lab (http://www.weigelworld.org/resources/microarray/AtGenExpress/). Samples involving data from mutant plants were removed and the median value of the replicates was used as a representative expression value for each gene. Pre-processed RNA-seq data from O. sativa were downloaded from the Rice Genome Annotation Project (http://rice.plantbiology.msu.edu/expression.shtml). Data for testing differential gene expression were excluded from further analyses. For S. cerevisiae, a time-course cell-cycle expression dataset was used (Orlando et al., 2008). Median and maximum expression and variation of expression were calculated for each gene in the datasets. Expression variation was represented by median absolute deviation divided by the median as it is a measure that does not require a normality assumption. For O. sativa expression breadth was calculated by counting the number of tissues in which expression was greater than zero fragments per kilobase of exon per million reads mapped (FPKM). For AtGenExpress data, a series of thresholds (log2 intensity = 4~10) for calling whether a gene was expressed or not was tested. The log2 intensity threshold of 4 resulted in the lowest p-value (KST) from testing if the distributions of number of datasets a gene was expressed in were significantly different between lethal and non-lethal genes and was used in machine learning analysis. Expression correlations between a gene and putative paralogs 156 (defined as genes belonging to the same OrthoMCL cluster) were evaluated using PCC and the maximum PCC with a paralogous gene was reported in Fig. 4.3A and used in machine learning analysis. Network Analysis Co-expression modules in the AtGenExpress expression data were identified through Kmeans clustering with K = 5~2000. Clusters generated with K=2000 resulted in the lowest pvalue from KSTs that tested whether the co-expression module size distributions for lethal and non-lethal genes were significantly different and were used in subsequent analysis. Pairwise expression correlations (PCC) between all genes for which expression data was available were calculated. A gene pair with a PCC of 0.86 (99th percentile of all pairwise comparisons) was considered co-expressed. The AraNet gene network dataset (Lee et al., 2010) was downloaded from http://www.functionalnet.org/aranet/ and any gene pair with a log likelihood score ≥ 1 was considered to be functionally related for our analyses. S. cerevisiae gene network data generated by Costanzo et al. (2010) were retrieved from the Saccharomyces Genome Database. Proteinprotein interaction data from the Arabidopsis Interactome Mapping Consortium (Arabidopsis Interactome Mapping Consortium, 2011) were retrieved from the supplemental data associated with the publication. Self-interactions and interactions involving a mitochondrial or plastid gene were excluded from analysis. Machine learning predictions Phenotype predictions were carried out using machine learning algorithms implemented in the Waikato Environment for Knowledge Analysis software (WEKA; Hall et al., 2009). The features we used are shown in Table 4.1, and a complete matrix of all genes and feature values is available in Supplemental Table 4.3. We first tested if the targeted features were correlated with 157 one another through pairwise Spearman rank correlation analysis. We found that 95% and 99% of feature pairs show a correlation of ≤0.22 and ≤0.40, respectively. This indicated that there was no extensive overlap and thus all features were used in subsequent analysis. Six classifiers capable of handling binary, numeric, and missing data were tested: J48 decision tree, logistic regression, naïve Bayes, naïve Bayes tree, Random Forest, and sequential minimal optimization support vector machine (SMO-SVM). Ten-fold cross validation was performed for all machine learning runs, except for those involving rice phenotype data, where a low number of instances necessitated two-fold cross-validation. A grid search was implemented to identify best-performing parameters. Grid searches for each classifier included a parameter for the proportion of lethal-to-non-lethal instances to include in each round of predictions. For the Random Forest classifier, a model trained with a 1-to-1 ratio of lethal to non-lethal genes (AUCROC=0.8) performed similarly as models trained with a dataset containing other ratios of lethal to non-lethal genes (maximum AUC-ROC = 0.81). Additional parameters for the following classifiers were also examined: J48 decision tree, pruning confidence; logistic regression, ridge; SMO-SVM, complexity constant; random forest, number of random features to consider. The – M option was invoked in SMO-SVM runs, which provides a confidence score between 0 and 1 with predictions and was used as the “lethal-phenotype score.” For random forest, 100 trees were built during the parameter search phase. All other parameters were default values. Bestperforming parameter sets for each classifier were determined by AUC-ROC, which was calculated and visualized using the ROCR package (Sing et al., 2005). Models were built using best-performing parameter sets and randomly shuffled lethal and non-lethal gene labels. The AUC-ROC values from 100 iterations of gene label shuffling for all six classifiers ranged from 0.45 to 0.55 with a median of 0.5. 158 To predict whether an undocumented gene was lethal, the lethal-phenotype score resulting in the highest F-measure [harmonic mean of precision (proportion of predictions correct) and recall (proportion of true positives predicted)] was used as the threshold to call potential lethal-phenotype genes. Features most important to the prediction analysis were evaluated by leave-one-out analysis, wherein features were excluded one at a time, and effects on performance in comparison to a full feature set were recorded. To evaluate how many features are required to have comparable performance as the full model, 57 models were built and evaluated with increasing numbers of features. The order in which features were included was based on SVM weight, where features with the highest absolute weight were added first. During cross-species predictions, numeric data were discretized into quantiles (for example, data points in the lowest quantile were set to 1, while data points in the highest quantile were set to 5), to ensure that data were present in similar ranges and distinctions within data drawn by the machine learning algorithms could be applied to data from another species. 159 ACKNOWLEDGEMENTS I thank Sahra Uygun and David Hufnagel for providing illustrious data without which this project would have crumbled. 160 APPENDIX 161 Supplemental Figure 4.1 Over- and under-representation of phenotype genes in Gene Ontology categories. Proportions of lethal, non-lethal, and undocumented genes present in (A) biological processes, (B) molecular functions, or (C) cellular components categories. Only categories that have significantly over- or under-represented numbers of lethal genes relative to non-lethal genes are shown, based on adjusted p-values of Fisher exact tests where *, **, and *** indicate  = 0.05, 0.01, and 0.001, respectively. Categories in which lethal genes are 162 Supplemental Figure 4.1 (cont’d) underrepresented are indicated by a cyan background. 163 Supplemental Figure 4.2 Proportions of most similar paralogs produced in whole-genome duplication events. Proportion of lethal, non-lethal, and undocumented genes with close paralogs (most similar by BLASTP e-value) generated in the  or  WGD events. 164 Supplemental Figure 4.3 Evolutionary rates between paralogs. Box plots of Ka/Ks values between paralogous pairs of lethal, non-lethal, or undocumented genes. Gene pairs are nonexclusive with regard to phenotype category. Distributions of ratios are shown as they relate to the rate of synonymous substitutions (Ks) between a gene pair, where low Ks values indicate recent duplicates and high Ks values indicate older duplicates. 165 Supplemental Figure 4.4 Performance of essential gene predictions with increasing numbers of features. AUC-ROCs of 57 Random Forest models using 1 to 57 features detailed in Table 4.1. Features were included in the order of highest-to-lowest absolute SVM weight. 166 CHAPTER 5: CONCLUSIONS 167 Assessing the functionality of genomic sequences The research presented in Chapters 2 and 3 focused on defining functional regions in genomes, with an emphasis on intergenic transcribed regions (ITRs). This work was guided by three aims: 1) defining functional regions in genomes by generating function prediction models, 2) prediction of likely-functional ITRs through application of function prediction models, and 3) detailed evaluation of the evolutionary histories of ITRs. I find that functional genome sequences can be successfully defined by integrating genetic, biochemical, and evolutionary evidence, as machine learning models accurately distinguish between benchmark functional (phenotype and RNA genes) and non-functional (pseudogenes and random intergenic) sequences in both A. thaliana and rice. Expression data proved highly informative for predictions, as combining 24 transcriptional activity features in A. thaliana provided predictions with an AUC-ROC performance of 0.97, similar to results in human (Tsai et al., 2017). Nevertheless, transcript evidence alone was a poor predictor of functionality, with 80% of A. thaliana pseudogenes being expressed. This underscores that the presence of a transcript alone should not be used as evidence to indicate a sequence is functional. Function prediction models were applied to ITRs and predict 2,754 and 4,427 rice and A. thaliana ITRs as likely functional, respectively. These are likely highly enriched in functional sequences and should be considered strong candidates for future experimental studies. However, only ~40% of ITRs in each species are predicted as functional, indicating that ITRs may primarily represent non-functional sequences. In addition, I anticipated that if ITRs generally represented non-functional sequence, the proportion of intergenic space covered by ITR transcript fragments would increase with genome size. When this hypothesis was tested, I observed that coverage of intergenic regions with transcript evidence has a positive and 168 significant correlation with genome size across 15 flowering plant species, while coverage of genic expression had no correlation. Further, evaluation of the evolutionary histories of ITRs finds that these sequences are predominantly species-specific that generally lack long-term duplicate retention. In addition, expression is rarely present in ITR duplicates or in the rare cases when an ITR ortholog can be identified, indicating that the expression state of ITRs is transient and unstable. Overall, these three lines of evidence suggest most ITRs are not under selection, and thus I conclude that intergenic expression is primarily the result of transcriptional noise. Because of this, I recommend that the null hypothesis for the functionality of intergenic transcripts is that they represent the product of noisy expression, which can be overturned with compelling experimental evidence for the functionality of a sequence. Despite the overall success of function prediction models, interpretation of two sets of predictions remains challenging. First, function prediction models classified 18% of annotated protein-coding genes in A. thaliana and 15% of transcribed regions overlapping annotated exons in rice as non-functional. However, it is unclear whether these predictions should be interpreted as false negatives among truly functional sequences or true negatives among annotated genes that are decaying and en route to pseudogene status. Second, ITRs in A. thaliana and rice that were predicted as functional tend to be close to annotated genes. Given the proximity to genic regions, the influence of the open and active chromatin regions surrounding genes may represent a confounding factor. Additionally, ITRs near genes could be unannotated extensions of genes. Advances in read lengths produced during sequencing could yield evidence to determine whether and how often ITRs are extensions of nearby genes. Analysis of cap analysis gene expression (CAGE) datasets could also be informative. Last, experimental validation via loss-of-function 169 analysis would be particularly useful to assess annotated genes predicted as non-functional and gene-proximal ITRs predicted as functional, but is currently lacking. Pseudogenes functional classifications from prediction models that included benchmark RNA genes resulted in high false positive rates (FPRs; ~30% in A. thaliana, ~40% in rice). A small subset of pseudogenes likely represent truly functional sequences (Poliseno et al., 2010; Karreth et al., 2015), however the majority of pseudogenes are anticipated to be neutrallyevolving, non-functional sequences (Li et al., 1981; Svensson et al., 2006). Two issues likely contribute to high FPRs: 1) lack of high-quality benchmark RNA genes and 2) limited feature sets. For the first issue, there are only rare examples of RNA genes with documented loss-offunction phenotypes in plants. Instead, I turned to community-based curation and utilized “highconfidence” miRNA annotations. Without strong evidence provided by phenotype data, it is possible that these miRNAs contain a substantial proportion of false positive gene annotations, contributing to false positive predictions in pseudogenes. For the second issue, biochemical features utilized were associated with transcription, as DNA methylation, chromatin accessibility, and histone marks act as transcriptional regulators. Thus, if transcription characteristics are similar between sequence classes, functional predictions will be challenging. The identification of feature sets unrelated to transcription could provide information to distinguish between RNA genes and pseudogenes. One additional consideration is that the processing of existing features could be expanded. For example, identification of DNA methylation and histone mark profiles across the length of sequences could prove more informative than basic intensity or coverage measures that have been utilized thus far. It is unlikely that prediction models in their current state are able to distinguish between the functionality of a DNA sequence and the functionality of transcript. For example, trans- 170 acting regulatory enhancers have been associated with transcript evidence (referred to as eRNAs) (Lam et al., 2014). Despite a regulatory role for the enhancer sequence, it is unknown whether eRNA transcripts are functional. Instead, the open chromatin state of enhancers may increase the likelihood of noisy transcription occurring within the region. Further, the physical act of transcription may be important for maintaining the chromatin state of a sequence (Gerstein et al., 2007). In both cases, a DNA sequence may be functional while a transcript derived from the sequence is not. This highlights the limitations of prediction models that have been established to classify likely-genic sequences. Future work to adapt prediction models toward identifying regulatory and other classes of potentially-functional genome sequences would represent an important extension. Given the successful prediction of functional sequences in two distinct plant groups (dicotyledonous and monocotyledonous plants) and metazoans (Tsai et al., 2017), it is likely that the data integration framework described in Chapters 2 and 3 will be applicable in most biological systems. An intriguing next question is whether function prediction models can be applied across species. Currently, I have helped to generate datasets of similar function-related features in A. thaliana, rice, and human. Further, there is extensive transcriptional activity, epigenetic, and evolutionary data are available for numerous model systems (Gerstein et al., 2010; Roy et al., 2010; Cherry et al., 2012; Mouse ENCODE Consortium et al., 2012). Using the wealth of available data, prediction models can be generated and validated using analogous feature sets and benchmark sequences in multiple systems. It would then be straightforward to apply function prediction models across species (see Chapter 4 for an example using lethal phenotype gene predictions). The potential for cross-species predictions might be most exciting for models generated with only transcriptional activity features, which provided highly accurate 171 predictions. Successful transcription-only cross-species predictions would open up any species with a sequenced genome and moderate amount of transcriptome data as a feasible target for prediction models generated in model systems. Predictions of mutant phenotypes The research described in Chapter 4 aims to assess the role of genes (i.e. functional genome sequences) by predicting of loss-of-function mutant phenotypes. Phenotype predictions were focused on essential genes, i.e. those with lethal mutant phenotypes, as this set of genes has been a target of historical, focused study (Meinke et al., 2008). Based on this analysis, I found that essential genes shared characteristics that distinguished them from genes with non-lethal mutant phenotypes. Essential genes tend to be single-copy or derived from older duplication events, highly and broadly expressed, strongly conserved, and highly connected in gene networks. Machine learning-based integration of 57 total features provided accurate predictions of genes with lethal and non-lethal mutant phenotypes (AUC-ROC =0.81). Thus, machine learning approaches represent useful methods to prioritize genes for large-scale phenotyping analysis. Perhaps most intriguing, essential gene prediction models could be successfully applied across species, indicating that essential genes in different species exhibit similar characteristics. Together with successful predictions of essential genes based solely on sequence-based features (i.e. without expression or gene network data), cross-species predictions highlight the potential for a machine learning approach to facilitate translation of phenotype data in model systems to non-model systems, particularly for essential genes. Essential gene predictions, while reasonably accurate, exhibit substantial false positive and false negative rates. One issue is likely that essential genes represent a heterogeneous group of genes. Genes with gametophyte-, embryo-, and seedling-lethal phenotypes, for example, may 172 exhibit distinct characteristics. Predictions may also be improved by employing additional features, particularly biochemical features such as DNA methylation and histone mark patterns, which were largely excluded. In addition, essential genes tend to be connected with one another within gene networks (Lee et al., 2010), but any features assessing putative interaction with known essential genes were not implemented. One last issue with the predictions in Chapter 4 is that it is not clear whether essential gene predictions are predicting lethality specifically or phenotype severity generally. To test this, it would be informative to compare the lethal phenotype prediction scores of genes that exhibit obvious visible phenotypes to those that are associated with subtle biochemical or cellular phenotype. Building on the success of essential gene predictions, the next step will be to determine what other phenotypes can be accurately predicted. A foundational issue for such a task is identifying features that can effectively distinguish sets of functionally-related genes with specificity. Lethality represents a broad phenotype category and therefore properties such as median expression level, duplication patterns, and evolutionary histories were informative. However, similar features would likely provide minimal information for more specific phenotype categories. Instead, gene interactions may prove critical for such predictions. Analysis based on unsupervised clustering has established that genes functioning in similar pathways frequently exhibit co-expression (Eisen et al., 1998; Spellman et al., 1998; Lee et al., 2004a; Uygun et al., 2016). A next step would be to overlay supervised classification techniques on expression clustering to determine the accuracy and limitations in which genes with related functions can be classified through co-expression. One additional possibility could be to adapt function prediction models described in Chapters 2 and 3 toward predicting tissue- or cell type-specific functionality, 173 which could be feasible if appropriate functional genomics datasets are available. However, such an approach is untested to date. Plants feature an abundance of duplication events. As a result, the presence of functionally-redundant paralogs frequently masks the consequences of gene disruption and hampers phenotyping efforts. For example, 10% of A. thaliana genes were predicted as essential, compared to at least 18% of genes in Saccharomyces cerevisiae, Mus musculus, and Schizosaccharomyces pombe. To help counteract pervasive duplication, efforts are underway to catalog phenotypes resulting from the disruption of gene pairs (Bolle et al., 2013). Computational methods could provide additional information needed to streamline phenotyping of genes that exhibit functional overlap. Initial machine learning-based predictions of gene pairs that are likely functionally redundant have been established (Chen et al., 2010). Extensions of these models that incorporate novel features, particularly biochemical signatures across duplicates, and semi-quantitative redundancy classes (e.g. fully-redundant vs. partiallyredundant) could be provided. Overlaying predictions of redundancy on top of phenotype predictions could prove to be a highly effective approach. Concluding remarks Overall, the research described in this dissertation highlights successful computational approaches to merge heterogeneous sets of data and produce accurate biological classifications. These classifications serve not only to prioritize candidates for future experimental characterization, but provide meaningful biological insights into the characteristics that define functional genome sequences and gene essentiality. Such computational approaches work in harmony with experimental approaches by directing the production of high quality data. In turn, these data allow for synergistic refinement and expansion of computational approaches. With 174 increasingly large stores of data produced through the use of modern sequencing and biotechnological techniques, computational approaches will serve a key role in the future of the biological study. 175 REFERENCES 176 REFERENCES 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–73 Acencio ML, Lemke N (2009) Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinformatics 10: 290 Ajjawi I, Lu Y, Savage LJ, Bell SM, Last RL (2010) Large-scale reverse genetics in Arabidopsis: case studies from the Chloroplast 2010 Project. Plant Physiol 152: 529–40 Altschul SF, Gish W, Miller W, Myers E, Lipman D (1990) Basic local alignment search tool. J Mol Biol 215: 403–410 Amundson R, Lauder G V (1994) Function without purpose. Biol Philos 9: 443–469 Arabidopsis Interactome Mapping Consortium (2011) Evidence for network evolution in an Arabidopsis interactome map. Science 333: 601–7 Bailey J a, Gu Z, Clark R a, Reinert K, Samonte R V, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE (2002) Recent segmental duplications in the human genome. Science 297: 1003–1007 van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2010) Most “dark matter” transcripts are associated with known genes. PLoS Biol. doi: 10.1371/journal.pbio.1000371 van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2011) Response to “The reality of pervasive transcription.” PLoS Biol 9: 7–10 Bassel GW, Glaab E, Marquez J, Holdsworth MJ, Bacardit J (2011) Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets. Plant Cell 23: 3101–16 Bazykin GA, Kochetov A V. (2011) Alternative translation start sites are conserved in eukaryotic genomes. Nucleic Acids Res 39: 567–577 Beadle GW, Tatum EL (1941) Genetic Control of Biochemical Reactions in Neurospora. Proc Natl Acad Sci 79: 499–505 Beilstein MA, Nagalingum NS, Clements MD, Manchester SR, Mathews S (2010) Dated molecular phylogenies indicate a Miocene origin for Arabidopsis thaliana. Proc Natl Acad Sci USA 107: 18724–18728 Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc 57: 289–300 177 Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E (2015) The Arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome. Genesis 53: 474–485 Bernard D, Prasanth K V, Tripathi V, Colasse S, Nakamura T, Xuan Z, Zhang MQ, Sedel F, Jourdren L, Coulpier F, et al (2010) A long nuclear-retained non-coding RNA regulates synaptogenesis by modulating gene expression. EMBO J 29: 3082–3093 Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF a, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14: 708–15 Boeck ME, Huynh C, Gevirtzman L, Thompson OA, Wang G, Kasper DM, Reinke V, Hillier LW, Waterston RH (2016) The time-resolved transcriptome of C. elegans. Genome Res 26: 1441–1450 Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120 Bolle C, Huep G, Kleinbölting N, Haberer G, Mayer K, Leister D, Weisshaar B (2013) GABI-DUPLO: a collection of double mutants to overcome genetic redundancy in Arabidopsis thaliana. Plant J 75: 157–71 Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM (2004) Protein interaction networks from yeast to human. Curr Opin Struct Biol 14: 292–9 Boutros M, Kiger AA, Armknecht S, Kerr K, Hild M, Koch B, Haas SA, Paro R, Perrimon N (2004) Genome-wide RNAi analysis of growth and viability in Drosophila cells. Science 303: 832–5 Bowers JE, Chapman BA, Rong J, Paterson AH (2003) Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422: 433– 438 Brown JB, Boley N, Eisman R, May GE, Stoiber MH, Duff MO, Booth BW, Wen J, Park S, Suzuki AM, et al (2014) Diversity and dynamics of the Drosophila transcriptome. Nature 512: 393–399 Bulger M, Groudine M (2010) Enhancers: The abundance and function of regulatory sequences beyond promoters. Dev Biol 339: 250–257 Cai J, Zhao R, Jiang H, Wang W (2008) De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics 179: 487–496 Campbell MS, Holt C, Moore B, Yandell M (2014a) Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinforma. doi: 10.1002/0471250953.bi0411s48 Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun 178 R, Jiao D, Lawrence CJ, et al (2014b) MAKER-P: A Tool Kit for the Rapid Creation, Management, and Quality Control of Plant Genome Annotations. Plant Physiol 164: 513– 524 Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18: 188–96 Cao J, Schneeberger K, Ossowski S, Günther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, et al (2011) Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet 43: 956–963 Carvunis A-R, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Simonis N, Charloteaux B, Hidalgo CA, Barbette J, Santhanam B, et al (2012) Proto-genes and de novo gene birth. Nature 487: 370–374 Chen F, D’Auria JC, Tholl D, Ross JR, Gershenzon J, Noel JP, Pichersky E (2003) An Arabidopsis thaliana gene for methylsalicylate biosynthesis, identified by a biochemical genomics approach, has a role in defense. Plant J 36: 577–588 Chen H-W, Bandyopadhyay S, Shasha DE, Birnbaum KD (2010) Predicting genome-wide redundancy using machine learning. BMC Evol Biol 10: 357 Chen K, Durand D, Farach-colton M (2000) Notung: a program for dating gene duplications and optimizing gene family trees. Bioinformatics 7: 429–447 Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, et al (2012) Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res 40: D700-5 Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JLY, Toufighi K, Mostafavi S, et al (2010) The genetic landscape of a cell. Science 327: 425– 431 Crick FHC (1963) On the Genetic Code. Science 139: 461–464 Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A, et al (2006) Widespread genome duplications throughout the history of flowering plants. Genome Res 16: 738–49 Cummins R (1975) Functional Analysis. J Philos 72: 741 Davidson RM, Gowda M, Moghe G, Lin H, Vaillancourt B, Shiu SH, Jiang N, Robin Buell C (2012) Comparative transcriptomics of three Poaceae species reveals patterns of gene expression evolution. Plant J 71: 492–502 Davidson RM, Hansey CN, Gowda M, Childs KL, Lin H, Vaillancourt B, Sekhon RS, de Leon N, Kaeppler SM, Jiang N, et al (2011) Utility of RNA Sequencing for Analysis of 179 Maize Reproductive Transcriptomes. Plant Genome J 4: 191 Doolittle WF (2013) Is junk DNA bunk? A critique of ENCODE. Proc Natl Acad Sci USA 110: 5294–300 Doolittle WF, Brunet TDP, Linquist S, Gregory TR (2014) Distinguishing between “function” and “effect” in genome biology. Genome Biol Evol 6: 1234–1237 Dowell RD, Ryan O, Jansen A, Cheung D, Agarwala S, Danford T, Bernstein DA, Rolfe PA, Heisler LE, Chin B, et al (2010) Genotype to phenotype: a complex problem. Science 328: 469 Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH (2005) Why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA 102: 14338–43 Duret L, Mouchiroud D (1999) Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc Natl Acad Sci USA 96: 4482– 4487 Eddy SR (2013) The ENCODE project: missteps overshadowing a success. Curr Biol 23: R259-61 Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113 Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95: 14863–14868 ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74 Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, et al (2016) The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res 44: D279–D285 Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC (1998) Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391: 806– 811 Firon A, Villalba F, Beffa R, Enfert C (2003) Identification of essential genes in the human fungal pathogen Aspergillus fumigatus by transposon mutagenesis. Eukaryot Cell 2: 247– 256 Force a, Lynch M, Pickett FB, Amores a, Yan YL, Postlethwait J (1999) Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531–45 Friedel M, Nikolajewa S, Sühnel J, Wilhelm T (2009) DiProDB: a database for dinucleotide properties. Nucleic Acids Res 37: D37--40 180 Furey T, Cristianini N, Duffy N (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16: 906–914 Gaertner FH, Cole KW (1977) A cluster-gene: Evidence for one gene, one polypeptide, five enzymes. Biochem Biophys Res Commun 75: 259–264 Ganko EW, Meyers BC, Vision TJ (2007) Divergence in expression between duplicated genes in arabidopsis. Mol Biol Evol 24: 2298–2309 Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17: 669–681 Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, et al (2010) Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330: 1775–87 Glass JI, Assad-Garcia N, Alperovich N, Yooseph S, Lewis MR, Maruf M, Iii CAH, Smith HO, Venter JC (2006) Essential genes of a minimal bacterium. Proc Natl Acad Sci USA 103: 425–430 Golling G, Amsterdam A, Sun Z, Antonelli M, Maldonado E, Chen W, Burgess S, Haldi M, Artzt K, Farrington S, et al (2002) Insertional mutagenesis in zebrafish rapidly identifies genes essential for early vertebrate development. Nat Genet 31: 135–40 Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, et al (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40: D1178-86 Graur D (2017) An Upper Limit on the Functional Fraction of the Human Genome. Genome Biol Evol 9: 1880–1885 Graur D, Zheng Y, Price N, Azevedo RBR, Zufall R a, Elhaik E (2013) On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE. Genome Biol Evol 5: 578–90 Griffiths PE, Stotz K (2006) Genes in the postgenomic era. Theor Med Bioeth 27: 499–521 Gu Z, Nicolae D, Lu HHS, Li WH (2002) Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet 18: 609–613 Gu Z, Steinmetz L, Gu X, Scharfe C, Davis R, Li W (2003) Role of duplicate genes in genetic robustness against null mutations. Nature 42: 63–66 Guil S, Esteller M (2012) Cis-acting noncoding RNAs: friends and foes. Nat Struct Mol Biol 19: 1068–1075 Gulko B, Gronau I, Hubisz MJ, Siepel A (2014) Probabilities of Fitness Consequences for 181 Point Mutations Across the Human Genome. Guo H-S, Xie Q, Fei J-F, Chua N-H (2005) MicroRNA directs mRNA cleavage of the transcription factor NAC1 to downregulate auxin signals for arabidopsis lateral root development. Plant Cell 17: 1376–1386 Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, et al (2009) Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458: 223–227 Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn 46: 389–422 Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, et al (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8: 1494–1512 Hall M, Frank E, Holmes G (2009) The WEKA data mining software: an update. ACM SIGKDD 11: 10–18 Hanada K, Higuchi-Takeuchi M, Okamoto M, Yoshizumi T, Shimizu M, Nakaminami K, Nishi R, Ohashi C, Iida K, Tanaka M, et al (2013) Small open reading frames associated with morphogenesis are hidden in plant genomes. Proc Natl Acad Sci USA 110: 2395–400 Hanada K, Zou C, Lehti-Shiu MD, Shinozaki K, Shiu S-H (2008) Importance of lineagespecific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol 148: 993–1003 Hardiman KE, Brewster R, Khan SM, Deo M, Bodmer R (2002) The bereft gene, a potential target of the neural selector gene cut, contributes to bristle morphogenesis. Genetics 161: 231–247 Harris RS (2007) Improved Pairwise Alignment of Genomic DNA. Pennsylvania State Univ. Thesis Heinen TJAJ, Staubach F, Häming D, Tautz D (2009) Emergence of a New Gene from an Intergenic Region. Curr Biol 19: 1527–1531 Hilario M, Kalousis A, Müller M, Pellegrini C (2003) Machine learning approaches to lung cancer prediction from mass spectra. Proteomics 3: 1716–9 Hirai MY, Sugiyama K, Sawada Y, Tohge T, Obayashi T, Suzuki A, Araki R, Sakurai N, Suzuki H, Aoki K, et al (2007) Omics-based identification of Arabidopsis Myb transcription factors regulating aliphatic glucosinolate biosynthesis. Proc Natl Acad Sci 104: 6478–6483 Hittinger CT, Carroll SB (2007) Gene duplication and the adaptive evolution of a classic 182 genetic switch. Nature 449: 677–681 Ho TK (1995) Random decision forests. Proc. 3rd Int. Conf. Doc. Anal. Recognit. pp 278–282 Hsieh L-C, Lin S-I, Shih AC-C, Chen J-W, Lin W-Y, Tseng C-Y, Li W-H, Chiou T-J (2009) Uncovering small RNA-mediated responses to phosphate deficiency in Arabidopsis by deep sequencing. Plant Physiol 151: 2120–2132 Hubisz MJ, Pollard KS, Siepel A (2011) PHAST and RPHAST: phylogenetic analysis with space/time models. Brief Bioinform 12: 41–51 Hughes AL (1994) The Evolution of Functionally Novel Proteins after Gene Duplication. Proc R Soc B, Biol Sci 256: 119–124 Hupalo D, Kern AD (2013) Conservation and functional element discovery in 20 angiosperm plant genomes. Mol Biol Evol 30: 1729–44 Ivanova N, Dobrin R, Lu R, Kotenko I, Levorse J, DeCoste C, Schafer X, Lun Y, Lemischka IR (2006) Dissecting self-renewal in stem cells with {RNA} interference. Nature 442: 533–538 Jacob F (1977) Evolution and tinkering. Science 196: 1161 Jeong H, Mason SP, Barabási A-L, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411: 41–2 Johnson JM, Edwards S, Shoemaker D, Schadt EE (2005) Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. Trends Genet 21: 93–102 Johnson ME, Viggiano L, Bailey JA, Abdul-Rauf M, Goodwin G, Rocchi M, Eichler EE (2001) Positive selection of a gene family during the emergence of humans and African apes. Nature 413: 514–519 Jühling F, Mörl M, Hartmann RK, Sprinzl M, Stadler PF, Pütz J (2009) tRNAdb 2009: Compilation of tRNA sequences and tRNA genes. Nucleic Acids Res 37: 159–162 Kaessmann H (2010) Origins, evolution, and phenotypic impact of new genes. Genome Res 20: 1313–1326 Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, et al (2003) Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 421: 231–7 Karreth FA, Reschke M, Ruocco A, Ng C, Chapuy B, Léopold V, Sjoberg M, Keane TM, Verma A, Ala U, et al (2015) The BRAF pseudogene functions as a competitive endogenous RNA and induces lymphoma in vivo. Cell 161: 319–332 183 Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S, et al (2013) Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6: 4 Kelemen O, Convertini P, Zhang Z, Wen Y, Shen M, Falaleeva M, Stamm S (2013) Function of alternative splicing. Gene 514: 1–30 Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, Ward LD, Birney E, Crawford GE, Dekker J, et al (2014) Defining functional DNA elements in the human genome. Proc Natl Acad Sci USA 111: 6131–8 Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14: R36 Kim DU, Hayles J, Kim D, Wood V, Park HO, Won M, Yoo HS, Duhig T, Nam M, Palmer G, et al (2010) Analysis of a genome-wide set of gene deletions in the fission yeast Schizosaccharomyces pombe. Nat Biotechnol 28: 617–623 King RD, Sternberg MJ (1990) Machine learning approach for the prediction of protein secondary structure. J Mol Biol 216: 441–57 Knowles DG, Mclysaght A, Knowles DG, Mclysaght A (2009) Recent de novo origin of human protein-coding genes. Genome Res 19: 1–9 Kobayashi K, Ehrlich SD, Albertini A, Amati G, Andersen KK, Arnaud M, Asai K, Ashikaga S, Aymerich S, Bessieres P, et al (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci USA 100: 4678–83 Kochetov A V. (2008) Alternative translation start sites and hidden coding potential of eukaryotic mRNAs. BioEssays 30: 683–691 Koehler R, Issac H, Cloonan N, Grimmond SM (2011) The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics 27: 272–274 Kozomara A, Griffiths-Jones S (2014) miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res 42: D68--73 Krishnakumar V, Hanlon MR, Contrino S, Ferlanti ES, Karamycheva S, Kim M, Rosen BD, Cheng C-Y, Moreira W, Mock SA, et al (2015) Araport: the Arabidopsis information portal. Nucleic Acids Res 43: D1003--9 Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for BisulfiteSeq applications. Bioinformatics 27: 1571–1572 Kurata N, Yamazaki Y (2006) Oryzabase. An integrated biological and genome information database for rice. Bioinformatics 140: 12–17 184 Kuromori T, Takahashi S, Kondou Y, Shinozaki K, Matsui M (2009) Phenome analysis in plant species using loss-of-function and gain-of-function mutants. Plant Cell Physiol 50: 1215–31 Kuromori T, Wada T, Kamiya A, Yuguchi M, Yokouchi T, Imura Y, Takabe H, Sakurai T, Akiyama K, Hirayama T, et al (2006) A trial of phenome analysis using 4000 Dsinsertional mutants in gene-coding regions of Arabidopsis. Plant J 47: 640–51 Lai K-MV, Gong G, Atanasio A, Rojas J, Quispe J, Posca J, White D, Huang M, Fedorova D, Grant C, et al (2015) Diverse Phenotypes and Specific Transcription Patterns in Twenty Mouse Lines with Ablated LincRNAs. PLoS One 10: e0125522 Lam MTY, Li W, Rosenfeld MG, Glass CK (2014) Enhancer RNAs and regulated transcriptional programs. Trends Biochem Sci 39: 170–182 Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40: D120210 Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25 Law M, Childs KL, Campbell MS, Stein JC, Olson AJ, Holt C, Panchy N, Lei J, Jiao D, Andorf CM, et al (2015) Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes. Plant Physiol 167: 25–39 Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P (2004a) Coexpression Analysis of Human Genes Across Many Microarray Data Sets. Genome Res 14: 1085–1094 Lee I, Ambaru B, Thakkar P, Marcotte EM, Rhee SY (2010) Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana. Nat Biotechnol 28: 149–56 Lee I, Date S V, Adai AT, Marcotte EM (2004b) A probabilistic functional network of yeast genes. Science 306: 1555–8 Lehti-Shiu MD, Uygun S, Moghe GD, Panchy N, Fang L, Hufnagel DE, Jasicki HL, Feig M, Shiu S-H (2015) Molecular Evidence for Functional Divergence and Decay of a Transcription Factor Derived from Whole-Genome Duplication in Arabidopsis thaliana. Plant Physiol 168: 1717–1734 Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ (2006) Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testisbiased expression. Proc Natl Acad Sci 103: 9935–9939 Li F, Zheng Q, Vandivier LE, Willmann MR, Chen Y, Gregory BD (2012a) Regulatory 185 impact of RNA secondary structure across the Arabidopsis transcriptome. Plant Cell 24: 4346–4359 Li W, Cui X, Meng Z, Huang X, Xie Q, Wu H, Jin H, Zhang D, Liang W (2012b) Transcriptional regulation of Arabidopsis {MIR168a} and argonaute1 homeostasis in abscisic acid and abiotic stress responses. Plant Physiol 158: 1279–1292 Li W, Gojobori T, Nei M (1981) Pseudogenes as a paradigm of neutral evolution. Nature Liu M-J, Seddon AE, Tsai ZT-Y, Major IT, Floer M, Howe GA, Shiu S-H (2015) Determinants of nucleosome positioning and their influence on plant gene expression. Genome Res 25: 1182–1195 Liu Q, Liu H, Wen J, Peterson PM (2014) Infrageneric phylogeny and temporal divergence of Sorghum (Andropogoneae, Poaceae) based on low-copy nuclear and plastid sequences. PLoS One. doi: 10.1371/journal.pone.0104933 Lloyd J, Meinke D (2012) A comprehensive dataset of genes with a loss-of-function mutant phenotype in Arabidopsis. Plant Physiol 158: 1115–1129 Lloyd JP, Seddon AE, Moghe GD, Simenc MC, Shiu S-H (2015) Characteristics of Plant Essential Genes Allow for within- and between-Species Prediction of Lethal Mutant Phenotypes. Plant Cell 27: 2133–2147 Lloyd JP, Tsai ZT, Sowers RP, Panchy NL (2017) Defining the functional significance of intergenic transcribed regions based on heterogeneous features of phenotype genes and pseudogenes. bioRxiv 1–51 Lynch M, Conery JS (2000) The Evolutionary Fate and Consequences of Duplicate Genes. Science 290: 1151–1155 Lynch M, Conery JS (2003) The Origins of Genome Complexity. Science 302: 1401–1404 Lynch M, Force a (2000) The probability of duplicate gene preservation by subfunctionalization. Genetics 154: 459–73 Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y (2005) Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci 102: 5454–5459 Makalowska I, Lin CF, Makalowski W (2005) Overlapping genes in vertebrate genomes. Comput Biol Chem 29: 1–12 Marahrens Y, Panning B, Dausman J, Strauss W, Jaenisch R (1997) Xist-deficient mice are defective in dosage compensation but not spermatogenesis. Genes Dev 11: 156–166 Marais DL Des, Rausher MD (2008) Escape from adaptive conflict after duplication in an anthocyanin pathway gene. Nature 454: 762–765 186 Martzen MR, McCraith SM, Spinelli SL, Torres FM, Fields S, Grayhack EJ, Phizicky EM (1999) A Biochemical Genomics Approach for Identifying Genes by the Activity of Their Products. Science 286: 1153–1155 Massa AN, Wanjugi H, Deal KR, O’Brien K, You FM, Maiti R, Chan AP, Gu YQ, Luo MC, Anderson OD, et al (2011) Gene space dynamics during the evolution of aegilops tauschii, brachypodium distachyon, Oryza sativa, and sorghum bicolor genomes. Mol Biol Evol 28: 2537–2547 Meinke D, Muralla R, Sweeney C, Dickerman A (2008) Identifying essential genes in Arabidopsis thaliana. Trends Plant Sci 13: 483–91 Mercer TR, Dinger ME, Mattick JS (2009) Long non-coding RNAs: insights into functions. Nat Rev Genet 10: 155–159 Meyerowitz EM (1989) Arabidopsis, a useful weed. Cell 56: 263–269 Michael TP, Jackson S (2013) The First 50 Plant Genomes. Plant Genome 6: 0 Moghe GD, Hufnagel DE, Tang H, Xiao Y, Dworkin I, Town CD, Conner JK, Shiu S-H (2014) Consequences of Whole-Genome Triplication as Revealed by Comparative Genomic Analyses of the Wild Radish Raphanus raphanistrum and Three Other Brassicaceae Species. Plant Cell 26: 1925–1937 Moghe GD, Lehti-Shiu MD, Seddon AE, Yin S, Chen Y, Juntawong P, Brandizzi F, BaileySerres J, Shiu S-H (2013) Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis. Plant Physiol 161: 210–24 Monaco MK, Stein J, Naithani S, Wei S, Dharmawardhana P, Kumari S, Amarasinghe V, Youens-Clark K, Thomason J, Preece J, et al (2014) Gramene 2013: comparative plant genomics resources. Nucleic Acids Res 42: D1193-9 Mouse ENCODE Consortium, Stamatoyannopoulos JA, Snyder M, Hardison R, Ren B, Gingeras T, Gilbert DM, Groudine M, Bender M, Kaul R, et al (2012) An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol 13: 418 Musso G, Tasan M, Mosimann C, Beaver JE, Plovie E, Carr L a, Chua HN, Dunham J, Zuberi K, Rodriguez H, et al (2014) Novel cardiovascular gene functions revealed via systematic phenotype prediction in zebrafish. Development 141: 224–35 Mutwil M, Usadel B, Schütte M, Loraine A, Ebenhöh O, Persson S (2010) Assembly of an interactive correlation network for the Arabidopsis genome using a novel heuristic clustering algorithm. Plant Physiol 152: 29–43 Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M (2008) The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing. Science 320: 1344–1349 187 Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J, et al (2015) Rfam 12.0: Updates to the RNA families database. Nucleic Acids Res 43: D130–D137 Nawrocki EP, Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29: 2933–2935 Neander K (1991) Functions as selected effects: The conceptual analyst’s defense. Philos Sci 58: 168–184 Ngernprasirtsiri J, Kobayashi H, Akazawa T (1988) DNA methylation as a mechanism of transcriptional regulation in nonphotosynthetic plastids in plant cells. Proc Natl Acad Sci U S A 85: 4750–4754 Ning S, Wang P, Ye J, Li X, Li R, Zhao Z, Huo X, Wang L, Li F, Li X (2013) A global map for dissecting phenotypic variants in human lincRNAs. Eur J Hum Genet 21: 1128–1133 Niu D-K, Jiang L (2013) Can ENCODE tell us how much junk DNA we carry in our genome? Biochem Biophys Res Commun 430: 1340–3 Nobuta K, Venu RC, Lu C, Beló A, Vemaraju K, Kulkarni K, Wang W, Pillay M, Green PJ, Wang G-L, et al (2007) An expression atlas of rice mRNAs and small RNAs. Nat Biotechnol 25: 473–477 O’Malley RC, Huang S-SC, Song L, Lewsey MG, Bartlett A, Nery JR, Galli M, Gallavotti A, Ecker JR (2016) Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. Cell 166: 1598 Oellrich A, Walls RL, Cannon EK, Cannon SB, Cooper L, Gardiner J, Gkoutos G V, Harper L, He M, Hoehndorf R, et al (2015) An ontology approach to comparative phenomics in plants. Plant Methods 11: 10 Ohno S, Wolf U, Atkin NB (1968) Evolution From Fish To Mammals By Gene Duplication. Hereditas 59: 169–187 Orlando DA, Lin CY, Bernard A, Wang JY, Socolar JES, Iversen ES, Hartemink AJ, Haase SB (2008) Global control of cell-cycle transcription by coupled CDK and network oscillators. Nature 453: 944–947 Panchy N, Lehti-Shiu MD, Shiu S-H (2016) Evolution of gene duplication in plants. Plant Physiol 171: pp.00523.2016 Pang KC, Frith MC, Mattick JS (2006) Rapid evolution of noncoding RNAs: lack of conservation does not mean lack of function. Trends Genet 22: 1–5 Paterson AH, Bowers JE, Chapman BA (2004) Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc Natl Acad Sci USA 101: 9903–8 188 Pearson H (2006) Genetics: what is a gene? Nature 441: 398–401 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res 12: 2825–2830 Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, Harte R, Balasubramanian S, Tanzer A, Diekhans M, et al (2012) The GENCODE pseudogene resource. Genome Biol 13: R51 Penny GD, Kay GF, Sheardown SA, Rastan S, Brockdorff N (1996) Requirement for Xist in X chromosome inactivation. Nature 379: 131–137 Pesole G (2008) What is a gene? An updated operational definition. Gene 417: 1–4 Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP (2010) A codingindependent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465: 1033–8 Ponting CP (2017) Biological function in the twilight zone of sequence conservation. BMC Biol 15: 71 Ponting CP, Belgard TG (2010) Transcribed dark matter: meaning or myth? Hum Mol Genet 19: R162-8 Portin P (2015) The Development of Genetics in the Light of Thomas Kuhn’s Theory of Scientific Revolutions. Recent Adv DNA gene Seq 9: 14–25 Portin P, Wilkins A (2017) The Evolving Definition of the Term “Gene.” Genetics 205: 1353– 1364 Raamsdonk LM, Teusink B, Broadhurst D, Zhang N, Hayes A, Walsh MC, Berden JA, Brindle KM, Kell DB, Rowland JJ, et al (2001) A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nat Biotechnol 19: 45–50 Raff R (1996) The Shape of Life. University of Chicago Press, Chicago, IL Rands CM, Meader S, Ponting CP, Lunter G (2014) 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage. PLoS Genet 10: e1004525 Rizzon C, Ponger L, Gaut BS (2006) Striking similarities in the genomic distribution of tandemly arrayed genes in Arabidopsis and rice. PLoS Comput Biol 2: 989–1000 Roy S, Ernst J, Kharchenko P V, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow C a, Ma L, Lin MF, et al (2010) Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330: 1787–97 189 Rutter MT, Cross K V, Van Woert P a (2012) Birth, death and subfunctionalization in the Arabidopsis genome. Trends Plant Sci 17: 204–12 Sauvageau M, Goff LA, Lodato S, Bonev B, Groff AF, Gerhardinger C, Sanchez-Gomez DB, Hacisuleyman E, Li E, Spence M, et al (2013) Multiple knockout mouse models reveal {lincRNAs} are required for life and brain development. Elife 2: e01749 Savage LJ, Imre KM, Hall DA, Last RL (2013) Analysis of essential Arabidopsis nuclear genes encoding plastid-targeted proteins. PLoS One 8: e73291 Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Schölkopf B, Weigel D, Lohmann JU (2005) A gene expression map of Arabidopsis thaliana development. Nat Genet 37: 501–6 Schreiber SL, Bernstein BE (2002) Signaling Network Model of Chromatin. Cell 111: 771–778 Seringhaus M, Paccanaro A, Borneman A, Snyder M, Gerstein M (2006) Predicting essential genes in fungal genomes. Genome Res 16: 1126–35 Shin H, Shin H-S, Chen R, Harrison MJ (2006) Loss of At4 function impacts phosphate distribution between the roots and the shoots during phosphate starvation. Plant J 45: 712– 726 Shipp M a, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RCT, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, et al (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8: 68–74 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15: 1034–50 Silva JM, Marran K, Parker JS, Silva J, Golding M, Schlabach MR, Elledge SJ, Hannon GJ, Chang K (2008) Profiling essential genes in human mammary cells by multiplex RNAi screening. Science 319: 617–20 Sing T, Sander O, Beerenwinkel N, Lengauer T (2005) ROCR: visualizing classifier performance in R. Bioinformatics 21: 3940–3941 Skendzic EM, Columbus JT, Cerros-Tlatilpa R (2007) Phylogenetics of Chloridoideae (Gramineae): A preliminary study based on nuclear ribosomal internal transcribed spacer and chloroplast trnL-F sequences. Aliso A J Syst Evol Bot 23: 530–544 De Smet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van de Peer Y (2013) Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proc Natl Acad Sci USA 110: 2898–903 Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, Depamphilis CW, Wall PK, Soltis PS (2009) Polyploidy and angiosperm diversification. 190 Am J Bot 96: 336–48 Spellman PT, Sherlock G, Zhang MQ, Vishwanath R, Anders K, Eisen MB, Brown PO, Futcher B (1998) Comprehensive Identification of Cell Cycle – regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol Biol Cell 9: 3273–3297 Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30: 1312–3 Stolc V, Samanta MP, Tongprasit W, Sethi H, Liang S, Nelson DC, Hegeman A, Nelson C, Rancour D, Bednarek S, et al (2005) Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays. Proc Natl Acad Sci U S A 102: 4453–4458 Stormo G, Schneider T (1982) Use of the “Perceptron” algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids … 10: 2997–3012 Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinformatics 9: 307 Struhl K (2007) Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat Struct Mol Biol 14: 103–105 Sullivan AM, Arsovski AA, Lempe J, Bubb KL, Weirauch MT, Sabo PJ, Sandstrom R, Thurman RE, Neph S, Reynolds AP, et al (2014) Mapping and dynamics of regulatory DNA and transcription factor networks in A. thaliana. Cell Rep 8: 2015–2030 Sumner LW, Mendes P, Dixon RA (2003) Plant metabolomics: Large-scale phytochemistry in the functional genomics era. Phytochemistry 62: 817–836 Svensson O, Arvestad L, Lagergren J (2006) Genome-wide survey for biologically functional pseudogenes. PLoS Comput Biol 2: e46 Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al (2008) The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res 36: 1009–1014 Swigoňová Z, Lai J, Ma J, Ramakrishna W, Llaca V, Bennetzen JL, Messing J (2004) Close split of sorghum and maize genome progenitors. Genome Res 14: 1916–1923 Takuno S, Gaut BS (2012) Body-methylated genes in Arabidopsis thaliana are functionally important and evolve slowly. Mol Biol Evol 29: 219–27 Tan JY, Sirey T, Honti F, Graham B, Piovesan A, Merkenschlager M, Webber C, Ponting CP, Marques AC (2015) Extensive microRNA-mediated crosstalk between lncRNAs and mRNAs in mouse embryonic stem cells. Genome Res 25: 655–666 Tang H, Bowers JE, Wang X, Paterson AH (2010) Angiosperm genome comparisons reveal 191 early polyploidy in the monocot lineage. Proc Natl Acad Sci USA 107: 472–477 Tarca AL, Carey VJ, Chen X, Romero R, Drăghici S (2007) Machine learning and its applications to biology. PLoS Comput Biol 3: e116 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin E V, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4: 41 Tautz D, Domazet-Lošo T (2011) The evolutionary origin of orphan genes. Nat Rev Genet 12: 692–702 Toll-Riera M, Bosch N, Bellora N, Castelo R, Armengol L, Estivill X, Mar Albà M (2009) Origin of primate orphan genes: A comparative genomics approach. Mol Biol Evol 26: 603–612 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28: 511–515 Tsai ZT-Y, Lloyd JP, Shiu S-H (2017) Defining Functional Genic Regions in the Human Genome through Integration of Biochemical, Evolutionary, and Genetic Evidence. Mol. Biol. Evol. Tsai ZT-Y, Shiu S-H, Tsai H-K (2015) Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol 11: e1004418 Tzafrir I, Pena-Muralla R, Dickerman A, Berg M, Rogers R, Hutchens S, Sweeney TC, McElver J, Aux G, Patton D, et al (2004) Identification of genes required for embryo development in Arabidopsis. Plant Physiol 135: 1206–20 Ulitsky I, Shkumatava A, Jan CH, Sive H, Bartel DP (2011) Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution. Cell 147: 1537– 1550 Upstill-Goddard R, Eccles D, Fliege J, Collins A (2013) Machine learning approaches for the discovery of gene-gene interactions in disease data. Brief Bioinform 14: 251–60 Uygun S, Peng C, Lehti-Shiu MD, Last RL, Shiu SH (2016) Utility and Limitations of Using Gene Expression Data to Identify Functional Associations. PLoS Comput Biol 12: 1–27 VanBuren R, Bryant D, Edger PP, Tang H, Burgess D, Challabathula D, Spittle K, Hall R, Gu J, Lyons E, et al (2015) Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum. Nature 527: 508–511 Veeramachaneni V, Makalowski W, Galdzicki M, Sood R, Makalowska I, Makałowski W, 192 Makałowska I (2004) Mammalian Overlapping Genes : The Comparative Perspective Mammalian Overlapping Genes : The Comparative Perspective Identification of Overlapping Genes. Genome Res 280–286 Wang X, Tang H, Paterson AH (2011) Seventy million years of concerted evolution of a homoeologous chromosome pair, in parallel, in major Poaceae lineages. Plant Cell 23: 27– 37 Wang Y, Tang H, Debarry JD, Tan X, Li J, Wang X, Lee TH, Jin H, Marler B, Guo H, et al (2012) MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res 40: 1–14 Wang Y, Tetko I V, Hall M a, Frank E, Facius A, Mayer KFX, Mewes HW (2005) Gene selection from microarray data for cancer classification--a machine learning approach. Comput Biol Chem 29: 37–46 Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh T-Y, Peng W, Zhang MQ, et al (2008) Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet 40: 897–903 Winzeler E, Shoemaker D, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke J, Bussey H, et al (1999) Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285: 901–6 Wu Y, Zhang W, Jiang J (2014) Genome-Wide Nucleosome Positioning Is Orchestrated by Genomic Regions Associated with DNase I Hypersensitivity in Rice. PLoS Genet. doi: 10.1371/journal.pgen.1004378 Xu S, Grullon S, Ge K, Peng W (2014) Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods Mol Biol 1150: 97–111 Yamada K, Lim J, Dale JM, Chen H, Shinn P, Palm CJ, Southwick AM, Wu HC, Kim C, Nguyen M, et al (2003) Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 302: 842–846 Yang L, Takuno S, Waters ER, Gaut BS (2011) Lowly expressed genes in Arabidopsis thaliana bear the signature of possible pseudogenization by promoter degradation. Mol Biol Evol 28: 1193–203 Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24: 1586–91 Ye Q-H, Qin L-X, Forgues M, He P, Kim JW, Peng AC, Simon R, Li Y, Robles AI, Chen Y, et al (2003) Predicting hepatitis B virus-positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning. Nat Med 9: 416–23 You Z-H, Yin Z, Han K, Huang D-S, Zhou X (2010) A semi-supervised learning approach to 193 predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinformatics 11: 343 Yuan Y, Xu Y, Xu J, Ball RL, Liang H (2012) Predicting the lethal phenotype of the knockout mouse by integrating comprehensive genomic data. Bioinformatics 28: 1246–52 Zhang J (2003) Evolution by gene duplication: An update. Trends Ecol Evol 18: 292–298 Zhang J, Rosenberg HF, Nei M (1998) Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Natl Acad Sci USA 95: 3708–3713 Zhang T, Jiang M, Chen L, Niu B, Cai Y (2013) Prediction of gene phenotypes based on GO and KEGG pathway enrichment scores. Biomed Res Int 2013: 870795 Zhang X, Yazaki J, Sundaresan A, Cokus S, Chan SW-L, Chen H, Henderson IR, Shinn P, Pellegrini M, Jacobsen SE, et al (2006) Genome-wide high-resolution mapping and functional analysis of {DNA} methylation in arabidopsis. Cell 126: 1189–1201 Zhao Y, Li H, Fang S, Kang Y, Wu W, Hao Y, Li Z, Bu D, Sun N, Zhang MQ, et al (2016) NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res 44: D203--8 Zhou Q, Zhang G, Zhang Y, Xu S, Zhao R, Zhan Z, Li X, Ding Y, Yang S, Wang W (2008) On the origin of new genes in Drosophila. Genome Res 18: 1446–1455 Zou C, Lehti-Shiu MD, Thibaud-Nissen F, Prakash T, Buell CR, Shiu S-H (2009) Evolutionary and expression signatures of pseudogenes in Arabidopsis and rice. Plant Physiol 151: 3–15 Zou C, Sun K, Mackaluso JD, Seddon AE, Jin R, Thomashow MF, Shiu S-H (2011) Cisregulatory code of stress-responsive transcription in Arabidopsis thaliana. Proc Natl Acad Sci USA 108: 14992–7 194