IDENTIFYING THE ACTIVITIES OF RHIZOSPHERE MICROBIAL COMMUNITIES USING METATRANSCRIPTOMICS By Aaron Garoutte A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Microbiology and Molecular Genetics Doctor of Philosophy 2016 ABSTRACT IDENTIFYING THE ACTIVITIES OF RHIZOSPHERE MICROBIAL COMMUNITIES USING METATRANSCRIPTOMICS By Aaron Garoutte Soil microbial communities carry out many functions, most of which are beneficial to the planet as well as to humans. Soil microbial communities control the biogeochemical cycling rates of key elements such as carbon, nitrogen, sulfur and phosphorous and can also aid in plant growth and disease defense. Microbial ecologists have studied the functional activities of microbial communities for decades often using laboratory incubations. Metagenomics has allowed the identification of the microbes and the potentially functional genes in an environmental sample, but does not allow an assessment of activity. Direct observation of microbial community activity in the field is the desired strategy to build the foundational knowledge required to assess, predict and potentially manage soil microbial community activity. In this dissertation I combine the use of metagenomics with metatranscriptomics to identify functional activity of the microbial community in the soil and rhizosphere of candidate biofuel crops. First, I assessed the efficiency of a novel method of rRNA removal, called the duplex specific nuclease normalization, to remove the dominating rRNA from samples of total RNA, to allow greater sequence coverage of mRNA. While this method did result in about 17% non-rRNA, it did not provide a major gain in sequencing depth. I also established best practices for computational metatranscriptomic analysis, especially the importance of assembling short reads into longer contigs to improve annotation accuracy. Second, I examined the activity of the rhizosphere microbial community of switchgrass, a candidate biofuel crop, using a combination of metagenomics, metatranscriptomics and metaproteomics. I defined a minimum core of microbial community functions of both metagenomic and metatranscriptomic sequences to focus the analysis on the most common sequences that were expressed. Beyond the expected housekeeping functions, the ecologically important functions related to biogeochemical cycling expressed were glycoside hydrolases, ligninolytic enzymes, ammonia assimilation, phosphate metabolism and functions related to plant-microbe interactions were production of auxin, trehalose and ACC-deaminase. Ecologically important genes had lower abundance than housekeeping functions indicating that ecologically important genes may represent keystone functions. I also examined the effect of two plants, switchgrass and corn, on the presence and activity of microbial community functions at various distances from living roots using metagenomics and metatranscriptomics. The metagenomic data was able to differentiate between microbial communities associated with the two different crops and differentiate communities in direct contact with the roots versus those not in direct contact. The metatranscriptomic data was unable to differentiate between bulk and rhizosphere samples indicating others factors are stronger determinants of community transcription. I show that direct observation of the activity of microbial communities associated with biofuel crops in field collected samples is possible through metatranscriptomics and aided by metagenomics and metaproteomics. These data allow the detection of microbial activities related to biogeochemical cycling and plant microbe interactions as well as reveal differences in genetic potential across different soil treatments. iv This thesis is dedicated to my parents, Kristin Magnuson and Michael Garoutte, and to my wife Jennifer Garoutte v ACKNOWLEDGEMENTS They say it takes a whole village to raise a child, I think the same could be said about finishing a PhD. I would like to thank my friends and family (Garoutte and Gable), you have all been a fantastic support! Your interest in my work has always encouraged me. I never and Claire. Watching you grow, learn and explore has been an inspiration to me as a scientist and as a father. You give me immeasurable joy and remind me there is a world outside of the lab. I will always be proud of you. I love you both dearly. I would also like to thank my wife, Jenny. You have been my rock. Your unwavering confidence in me and your constant support have enabled me to achieve this goal. I love you more than words can convey. Mom, thank you for all your support over the years, through all the ups and downs of life you have always been there for me. It is only now, as I have my own children, that I am starting to understand the sacrifices you and Dad made for me. Dad, you were taken from us too soon. You will always be my role model and I will always strive to be the great it is the only thing I hear you say when I remember your voice. I would like to thank a few specific teachers and professors who have helped me you taught this little dyslexic boy to read and helped me overcome my learning disability. Next, I would like to thank Dr. Richard Cass. You showed me that even though I was a slow reader and not the best writer that I could succeed in a college level English course. Finally, I would like to thank Drs. Greg and Kathy Murray. You both shaped me as an ecologist and vi gave me my first research experiences. Your classes were hard but the time I put in was never work. Thank you for investing in me. I would like to thank Dion Antonopoulos for taking a chance and hiring a field ecologist with minimal knowledge of microbiology, and next to nothing in terms of lab skills, to work in your lab. You helped ignite my passion for microbial ecology and set me on the path of working with metatranscriptomics. You taught me everything I know about sharing with me your passion for soil. The two of you set me on a path to explore soil microbial communities in my PhD work. I would also like to thank the Tiedje lab, all current and past members. I was fortunate to have been able to interact with so many people with such diverse backgrounds. Thank you for your support, encouragement and constructive criticism. You thank Adina Howe. Adina, you were like my second mentor. You taught me to code, how to use the HPCC, how to interview and train students. You helped improve my writing and you were always up for Indian buffet! Thank you for investing in me. Dr. Tiedje, thank you for taking me on as a student. You had no idea what you were getting into! You have helped me learn how to think about science and how to present my work both in publication and presentation form. You have always had the ability to push me outside of my scientific comfort zone, and your unique perspective has always helped bring my work up to the next level. Thank you. Finally, I would like to thank my current committee members Sheng Yang He, Maren Friesen and Ashley Shade. Each of you has helped me grow as a scientist. I value the vii feedback you have given me and appreciate your time. I would also like to thank a former committee member, Titus Brown. Thank you for supporting me as this hybrid biologist bioinformatician. You have always made me feel like I had a place in your lab. You taught me the value of open science and helped me think critically about the code I write. viii TABLE OF CONTENTS .x .iii CHAPTER 1: Bridging the gap between the lab and the field ...............1 Introduction to soil micro.2 S2 The rhizosphere: a focal point of microb.5 Disconnect between laboratory-based studies and field microbial communities .7 7 Metagenom..............7 Metatranscri9 The dark side of high throughpu.10 Observation of microbial community functions to ans..11 Importance of direct observation of microbial commu12 13 REFERENCES16 CHAPTER 2: Methodologies for probing the metatranscriptome of grassland soil..................24 Abstract25 26 29 Metatranscriptome sample collection an.31 MG-RAST databases used for annotation 32 Metatranscriptome assembly Previous metagenomes used for comparis33 Metagenome assembly & ann33 Estimation of abundance of assembled cont34 Curation of soil reference genome 34 Unassembled read mapping to metagen..35 35 Characterization of sequences in the unassembled soil metatranscriptome35 Characterization of sequences in the assembled soil metatra...............40 Comparison of the metatranscriptome 47 49 54 APPENDIX...............56 REFERENCES104 ix CHAPTER 3: Using a multi-omics to approach to identify active microbial functions in the rhizosphere of Switchgrass......................111 ............112 113 .115 .115 Metaproteome sample preparation and .............117 Indirect extractio.117 Metatranscriptome and metagenome Metatranscriptome and metagenome search Metatranscriptomic and metagenomic se121 Defining the minimum functional core and its representation of the field site...121 .124 Building minimum function124 Metaproteome characterization an127 Comparison of the minimum functional core to rhizoplane .128 Characterization of the minimum functional core 131 Abundant functions within the minimum functional core are related to housekeeping pro131 Functions of ecological importance within the Carbon cycling functions within the .............................138 140 144 APPENDIX..146 REFERENCE.148 CHAPTER 4: Plant root effects on soil microbial community functions as viewed through metagenomics and metatranscriptomics................................152 ..153 154 155 Site description and sample c155 Sample preparation and seq156 157 158 Metatranscriptome analysi..............158 162 167 171 APPENDIX REFERENCES176 CHAPTER 5: Conclusions and Future directions ..............179 ..............180 183 x LIST OF TABLES Table 2.1: Summary of sequence annotations of the unassembled and assembled soil metatranscriptome against various reference databases. Results of annotation by MG-RAST38 Table 2.2: Summary of transcript mapping. Transcripts mapped to reference assemblies (available in MG-RAST with IDs indicated) or genomes with proportion of reads identified as similar to rRNA genes and mapping uniquely to a specific reference assembly. 39 Table 2.3: Summary of assembled metagenomes in number of assembled contigs and total base pairs represented in the assembly. Results of short read assembly of metagenome samples49 Table 2.4: Manually curated soil-associated genomes comprising the RefSoil database..57 Table 2.5: Most abundant functional annotations of the unassembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED 88 Table 2.6: Most abundant functional annotations of the unassembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database89 Table 2.7: Most abundant functional annotations of the unassembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in 89 Table 2.8: Most abundant functional annotations of the unassembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do 90 Table 2.9: 50 RefSoil genomes with the greatest number of metatranscriptome reads mapping90 Table 2.10: RefSoil genomes with metatranscriptome reads mapping to the most unique genes94 Table 2.11: Most abundant functional annotations of the assembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED reference database96 xi Table 2.12: Most abundant functional annotations of the assembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database97 Table 2.13: Most abundant functional annotations of the assembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in the RefSeq reference dat98 Table 2.14: Most abundant functional annotations of the assembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do in the KEGG reference database99 Table 2.15: RFam abundance (based on base pair coverage) of the assembled metatranscriptome99 Table 2.16: Top 50 CAZy annotations by number of contigs. Most abundant CAZy annotations by the number of contigs without regard to the abundance or number of reads 101 Table 2.17: Top 50 CAZy annotations by abundance. Abundance of CAZy annotation by the number of reads mapping to annotated contigs..102 Table 3.1: Summary of rhizosphere sequencing, assembly and annotation. Metagenome sequencing samples did not have rRNA removed before sequencing, therefore -metagenome samples were not mapped to the metatranscriptome. Therefore these 123 Table 3.2: Summary of core annotations. The SEED Subsystems is a hierarchical database, which annotates gene functions not specific genes. RefSeq database annotates specific genes from model organisms. The Carbohydrate Active Enzyme database (CAZy) specifically annotates enzymes related to synthesis, metabolism and transport of carbohydrates127 Table 3.3: Summary of minimum functional core annotations found in rhizoplane metagenomes. Minimum core represents the functions found in five out of seven samples. Percent core represents the percent of the crop specific core that is found within our established minimum functional core. Percent abundance captured represents the abundance of the crop specific samples found in the minimum functional core130 Table 3.4: Summary of rhizoplane metagenome reads, assembly and assembled read abundance147 Table 4.1: PERMANOVA analysis of metagenome and metatranscriptome samples. Permutational multivariate analysis of variance of samples. SRT is the switchgrass metatranscriptome, CBT is the corn bulk metatranscriptome, switchgrass rhizosphere xii metagenome, CBG is the corn bulk metagenome, C is for the corn rhizoplane metagenome, S is the switchgrass rhizoplane metagenome, SR is the switchgrass rhizosphere metagenome, CR is the switchgrass rhizosphere metagenome, SRG is the switchgrass rhizosphere metagenome samples collected with the metatranscriptome samples, CBG is the corn bulk metagenome samples collected with the metatranscriptome samples, CNRP is the combination of the corn bulk and rhizosphere metagenome samples, and SNRP is the combination of the two treatments of switchgrass rhizosphere metagenome samples159 Table 4.2: Summary of metagenome and metatranscriptome reads, assembly and assembled read abundance173 Table 4.3: Metatranscriptome protein coding annotations..175 xiii LIST OF FIGURES Figure 2.1: Metatranscriptome data analysis workflow. Various methods for metatranscriptome data analysis are shown. (a) Direct annotation of short reads. (b) Assembly of short reads into longer contigs and subsequent annotation. (c) Short read mapping to genomes compiled in the RefSoil database30 Figure 2.2: Phylogenetic distribution of sequence annotations identified in unassembled and assembled metatranscriptome and associated soil metagenomes. Phylogeny of rRNA from the unassembled metatranscriptome compared to the phylogeny of MG-RA-hit classification of protein-coding genes for the assembled metatranscriptome and the reference metagenomes36 Figure 2.3: Distribution of assembled metatranscriptome annotations. Proportion of assembled metatranscriptome sequences associated with known rRNA, gene function (SEED), or non-coding sequences (RFAM)41 Figure 2.4: Comparison of functional profiles of metatranscriptomes and metagenomes. Annotations were identified in the assembled and unassembled metatranscriptome datasets as well as the three metagenome assemblies against the MG-RAST SEED database42 Figure 2.5: Comparison of the total number of gene annotations identified in the unassembled and assembled metatranscriptomes. Results were generated using the MG-RAST Metagenome Analysis page43 Figure 2.6: Comparison of annotation alignment lengths of the assembled and unassembled datasets. Amino acid alignment lengths of SEED subsystem annotations for the assembled and unassembled datasets. The minimum alignment length is set to the MG-RAST default of 15 amino acids46 Figure 2.7: Comparison of annotation E-values of the assembled and unassembled datasets. E-value of SEED Subsystem annotations of the assembled and unassembled datasets. The minimum e-value is set to the MG-RAST default of 1e-547 Figure 3.1: Rank Abundance Curve of Multi-omics Subsystem Annotations. The metaproteome data set is smaller than metagenome and metatranscriptome data sets as indicated by they shorter lines in the MetaP-MetaG and MetaP-MetaT samples125 Figure 3.2: Diversity of core functions by subsystem. Number of annotations in the minimum functional core as annotated by the SEED Subsystems database132 Figure 3.3: Relative abundance of multi-omics data by SEED Subsystem annotations. Relative abundance here is averaged across each of the three individual samples134 xiv Figure 3.4: Relative abundance of biogeochemical cycling functions in the minimum functional core. Allantonin utilization, Ammonia assimilation, Denitrification and Nitrate and Nitrite ammonification are subsystems within the Nitrogen Metabolism subsystem. Alkylphosphate utilization and Phosphate metabolism are subsystems within the Phosphorous metabolism subsystem136 Figure 3.5: Relative abundance of plant growth promoting functions in the minimum functional core. Auxin biosynthesis and Trehalose Biosynthesis are the second level in the SEED Subsystem hierarchy within the Secondary metabolites subsystem and Carbohydrates subsystems respectively138 Figure 3.6: Relative abundance of CAZy annotations in the minimum functional core. CAZy enzyme classes are: Glycoside Hydrolases (GH), Glycosyl Transferases (GT), Carbohydrate Esterases (CE), Polysaccharide Lyases (PL) and Auxiliary Activities (AA)..140 Figure 4.1: Average relative abundance of metatranscriptome annotations Shows annotations based on MG-RAST SEED Subsystem database. (a) Average relative abundance of corn and switchgrass metatranscriptome annotations. (b) Relative abundance of metatranscriptome annotations in the Carbohydrate subsystem level 2161 Figure 4.2: Nonmetric multidimensional scaling (NMDS) analysis of metagenome sample. Metagenome samples were log2 plus one transformed to increase normality163 Figure 4.3: Comparison of differentially abundant annotations in corn rhizoplane and non-rhizoplane samples. Number of differentially abundant annotations in corn rhizoplane and non-rhizoplane samples based on SEED Subsystems165 Figure 4.4: Comparison of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane samples. Number of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane samples based on SEED Subsystems167 1 CHAPTER 1: Bridging the gap between the lab and the field 2 Introduction to soil microbial communities grow we now know the many important benefits provided by the microbes that live within soil. Soil microbial communities are unique as compared to other environmental microbial communities. One reason is the sheer size of the microbial community within soil. Soil contains a large microbial community [1]. In addition to its large size, soil is one of the most diverse ecosystems on the planet [2]. Just one gram of soil is thought to contain as many as 52,000 distinct microbial species [3]. Soil microbial communities provide numerous beneficial ecosystem services such as bioremediation, controlling biogeochemical cycles, restoring water quality and aiding in plant growth and development. The extreme diversity and vital ecosystem services provided by soil microbial communities have drawn the interest of many microbial ecologists. Understanding and learning to enhance these natural processes can aid in our ability to detoxify the environment, mitigate global climate input. Soil microbial community functions Irresponsible waste disposal practices have lead to contamination of soils with toxic chemicals. These toxic chemicals have the potential to leach into ground water and cause illness. One method to mitigate contaminated soils involves the use of microbial communities to detoxify contaminates. This process is called 3 bioremediation [4]. Microbial populations and communities have been shown to be capable of detoxifying a wide range of compounds including organic pollutants such as hexachlorocyclohexane [5], polycyclic aromatic hydrocarbons [6], polychlorinated biphenyls [7] and dioxins [8] to name the more prominent. Evidence has shown that some genes responsible for bioremediation processes can be transferred via horizontal gene transfer potentially improving the ability of the microbial community to detoxify contaminated soils [5]. Other evidence has shown that biostimulation via root exudates can enhance bioremediation processes [6]. In addition, microbial communities have been shown to reduce uranium VI, a soluble compound, to the insoluble uranium IV preventing the spread of radioactive uranium through ground water [9]. The ability to detoxify contaminated soils represents one of the many benefits provided by soil microbial communities that could be harnessed and enhanced to provide greater benefit to society. Soil microbial communities play a key role in biogeochemical cycling processes. Biogeochemistry is an interdisciplinary science that draws on chemistry, geology and biology to study processes that regulate the chemical cycles of elements. Soil microbial communities are major contributors to the biogeochemical cycling of elements such as carbon [10], nitrogen [11, 12], sulfur [13] and phosphorus [14]. Much of the microbially mediated biogeochemical cycling takes place in the rhizosphere, thereby allowing plants to access important macro and micronutrients needed for growth [15-17]. Plant root exudates have been shown to illicit the rhizosphere priming effect in which root exudates stimulate the turnover of soil organic matter (SOM) and can increase microbially mediated rates of 4 biogeochemical cycling processes [17-19]. Due to the importance of CO2 and CH4 as greenhouse gases many biogeochemistry studies focus on carbon trapped in the soil as SOM and seek to understand how a warming global climate will influence the rates at which SOM is respired as CO2 and CH4 [20-22]. Microbially mediated biogeochemical processes play a critical role in influencing climate change and affect the ability of plants to access macro- and micronutrients needed for growth. Finally, soil microbial communities living in the rhizoplane and rhizosphere provide many important benefits to the plants with which they interact. Rhizosphere microbial communities provide plants with macronutrients such as nitrogen, phosphorous, and micronutrients such as iron [16, 23, 24]. Nitrogen is the most limiting nutrient for plant growth [25]. Microbial communities can make nitrogen available to plants. This occurs via direct fixation of atmospheric nitrogen by both free-living (associative) nitrogen fixers [26] and by rhizobia in a symbiotic relationship with leguminous plants [27]. Microbes also can provide nitrogen to plants by transforming nitrogen-containing compounds into bioavailable compounds, i.e. mineralization [28]. Besides providing nitrogen to plants, microbial communities can promote plant growth via production of plant growth hormones such as auxin and ACC deaminase [29, 30]. Auxin can aid in root development [31] and suppress plant defense mechanisms possibly allowing non-pathogenic microbes to colonize plant root systems [32]. ACC deaminase reduces ethylene concentration in roots, which in high concentrations can reduce plant growth [29, 33]. While much of the current research focuses on single microbial species interactions with plants, recent work has shown that microbial populations can cooperatively 5 enhance plant growth. In one study a secondary metabolite produced by Pseudomonas fluorescens F113 acted as a promoter for many plant genes known to have phytostimulatory effects in Azospirillum brasilense Sp245-Rif [34]. Finally microbial communities can protect plants from pathogenic microbes [35]. Many bacteria can produce antibiotics against pathogens [28, 36, 37]. Microbes can also protect plants from pathogens by activating the plants own immune functions [38, 39]. Soil microbial communities provide many benefits to plants including providing a source of growth limiting macro- and micronutrients, promoting plant growth through exogenous addition of plant hormones, and protecting plants from pathogens. The rhizosphere: a focal point of microbial community activity The rhizosphere is a hotspot of microbial activity, i.e. a small volume of soil with higher process rates and greater interaction among community members compared to average soil [40]. Many studies have shown bioremediation processes to be enhanced in the rhizosphere [41]. These studies include depletion of contaminants such as PCBs [42, 43], petrochemical residues [44], bensulfuron-methyl [45], arsenate, chromate [46], and heavy metals (Zn, Cd and Cu) by bioaccumulation such as in white mustard, i.e. phytoextraction [47]. Many plant microbe interactions are also tied to biogeochemical cycling processes related to the transformation of carbon and macro- and micronutrients into bioavailable forms for plant utilization [17-19, 48, 49]compounds, i.e. root exudates, into the rhizosphere provide the fuel for these 6 enhanced microbial processes [50]. As previously stated rhizosphere microbial communities provide many other benefits to plants including plant growth promotion, activating plant defense mechanisms, increasing stress tolerance, and protecting plants from pathogens. Disconnect between laboratory-based studies and field microbial communities Many of the previously mentioned studies were carried out under laboratory, growth chamber or greenhouse conditions using single populations of microbes and/or a single plant species. However naturally occurring soil microbes live within a diverse community of microbes and plants. It is unknown how interactions between community members will affect the functional activity of microbes that carry out anthropogenically important functions. Closing this gap in knowledge is central to our ability to predict and manipulate these microbial functions to meet modern day challenges such as global climate change and an increasing human population on earth. To bridge the gap between laboratory studies of microbial community activity and to study microbial community activity in the environment microbial ecologists must rely on direct observation of functions actively carried out by microbial community members in the environment. Advances in high-throughput sequencing technologies such as metatranscriptomics allows direct observation of microbial community gene expression, a proxy of functional activity, thereby enabling microbial ecologists to study the activity of microbial communities in the environment. 7 Evolution of Sequencing Technology Recent advances in high throughput sequencing, also called next-generation sequencing, have led to an extreme decrease in sequencing cost per nucleotide since the release of these new technologies, beginning in 2005 by 454 Life Sciences [51] and soon followed by Illumina. These advances have opened new avenues of research that were previously infeasible by allowing deeper and more cost effective sampling. Specifically, advances in high throughput sequencing have allowed microbial ecologists to sample fragments of whole genomes from environmental microbial communities without the bias of culture based methods [52]. Previously, Sanger sequencing required labor intensive cloning of environmental DNA into a host cell, prior to sequencing of that clone. Compared to current high throughput sequencing methodologies, the yield of the Sanger sequencing is very low [53] although its accuracy was much higher. Metagenomics Metagenomics is an approach that allows DNA to be sequenced directly from the sampled community. Current technologies enable metagenomic samples to be sequenced resulting in billions of reads without the need for labor intensive cloning. approach in which a particular gene of interest is PCR amplified from an environmental DNA sample and then sequenced. This is often done for phylogenetic 8 environmental DNA, fragmented into short pieces, is sequenced [54]. Frequently shotgun metagenomics is referred to as simply metagenomics. After sequencing, the DNA sequences are typically assigned a functional or taxonomic annotation based on similarity to previously sequenced organisms or genes [53]. Sequencing technologies will likely continue to improve providing deeper sampling and longer reads to the point in which whole genomes of environmental organisms can be fully assembled, which cannot currently be done. Eventually, metagenomic data will consist of full genomes of organisms present in an environmental sample rather than the current collection of annotations of gene or genome fragments. It is important to note that additional care must be taken in the interpretation of metagenomic data. DNA represents the genetic potential of an organism or community, not necessarily its activity. Metagenomic sequencing can include sequences from dead microbes, or dormant microbes, which may skew data interpretation. Estimates of dormancy in soil microbial communities are as high as 80% [55]controlled to only occur under specific circumstances, as is the case with many microbes that engage in symbiotic interactions and quorum sensing behaviors. Therefore metagenomic functional annotations must be recognized as the functional potential of a microbial community and not functional activity of a microbial community. 9 Metatranscriptomics Advanced high throughput sequencing technologies also enable sequencing of environmental RNA, referred to as metatranscriptomics. Metatranscriptomics presents many unique methodological challenges compared to metagenomics thus leading to low adoption of metatranscriptomics by many microbial community ecologists, especially in complex habitats like soil and sediments. Like metagenomics, metatranscriptomics can be targeted to a specific gene [56] or an untargeted shotgun approach can be used [57]. Often an intermediate approach is desired as the goal of many metatranscriptomic studies is to sequence mRNA to sample actively transcribed genes. Messenger RNAs typically comprise only 4% of the total RNA, while rRNA is thought to comprise over 90% of total RNA requiring its removal to increase sequencing yield of mRNA [58]. Removal of rRNA is challenging. Various methods have been developed to remove rRNA but have been met with mixed success depending on the complexity of the environmental system [59-62]. Another challenge unique to metatranscriptomics is the unstable nature of the mRNA molecule and its short half-life, estimated to be between 2.4 and 5 minutes [63]. These factors make sample preservation and extraction of RNA more difficult than that of DNA for metagenomics. Before sequencing, a reverse transcription reaction must be carried out to convert the RNA to cDNA. All of these factors can result in lower sequence yield and lower sequence quality compared to (DNA-based) metagenomics. Despite the challenges inherent in metatranscriptomics, this method has the potential to deliver valuable insights and 10 build foundational knowledge of microbial community functional activity. As with all methods there remain limits to metatranscriptomics. For example, due to the short turnover time of mRNA, metatranscriptomic data represents microbial community activity at the time of sampling only. To overcome this limitation studies can adopt a time series approach to capture transcription information over a relevant time period. Additionally after transcription mRNA molecules may still not be translated into proteins [64], and even if the protein is made, its actual activity depends on availability of substrate and any other required condition for function. Consequently, metatranscriptomics represents likely functional activity of a microbial community, not necessarily the actual functional activity of a microbial community at the sampling time. The dark side of high throughput sequencing Annotation of metagenomic and metatranscriptomic reads relies on gene annotations from previously sequenced and typically cultured microbes [65]. It is estimated that only 0.1 to 1% of microbial species have been isolated in culture [66]. Therefore many of the sequences from metagenomic and metatranscriptomic studies cannot be annotated. Processes such as short read assembly can substantially improve quality and percent of sequences annotated [67]. Further improvement in sequence annotation relies on increasing the isolation of the currently uncultivated microbes and the characterization of their gene functions and Dark Matter Project [68] uses single cell genome sequencing to fill in the phylogenetic 11 gaps in the current database of cultured strains. This is a step in the right direction however, much work remains. Identifying the functions of unknown or hypothetical proteins will lead to the most dramatic improvements in annotation of metagenomic and metatranscriptomic sequences. Observation of microbial community functions to answer ecological questions Direct observation of microbial community functions can enable microbial ecologists to answer questions of ecological importance. For example one study examined the microbial community structure associated with Ulva australis, a green macroalga. The study showed only 15% of community members were shared across samples while 70% of functional gene annotations were shared [69]. These results led the authors to suggest that the functional genes possessed by individuals rather than their species identification better explains community structure. In another study, soil and sediments were sampled in space and time along an environmental gradient near a stream corridor. Microbial community membership was measured using denaturing gradient gel electrophoresis. Substrate analogs with fluorescent molecules were used to measure the activity of ten different enzymes. In this study, bacterial community structure was not associated with spatial or temporal components while enzyme activity was associated with temporal dynamics [70]. The disconnect between bacterial community structure and enzymatic activity provides further evidence that not all microbial functional activities are limited to a particular microbial species. Both of these examples illustrate how the genomic diversity found within a species and gene exchange between species results in a 12 decoupling of microbial species and their functions. This decoupling necessitates the observation of microbial community functions, through more direct methods such as metagenomics and metatranscriptomics, to answer important ecological questions. Importance of direct observation of microbial community functional activity Microbial species can differ not only in genomic content but also in the regulation of functional genes. The mere presence of a gene does not imply the gene is expressed. A simple species survey or a metagenomic study does not provide sufficient information to determine microbial community functional activity. Direct observation of microbial community functional activity through metatranscriptomics is more directly related to function and hence a better method to answer this question and bridge the knowledge gap between the laboratory and the environment. Although methodologically more challenging than metagenomics, the use of metatranscriptomics has resulted in the advancement of foundational knowledge to understanding interactions within microbial communities that shape community composition, richness and ecosystem. For example, a study of methanogenic microbial communities illustrates that amino acid auxotrophy promoted syntrophic relationships within the community and regulated carbon and energy flow within the community [71]. Without direct observation of microbial community the basis of the syntrophic interactions could only have been postulated instead of directly observed. In another study the biogeochemistry of hydrothermal plumes was found 13 to be regulated by microbes inhabiting seawater rather than microbes inhabiting the seafloor [72]. Direct observation of microbial community activity allowed the authors to identify which groups of microbes regulate important biogeochemical processes in water surrounding hydrothermal vents. These examples illustrate the power of direct observation of microbial activity through metatranscriptomics to illuminate what microbes are doing and highlight their ecological importance. Questions addressed in this thesis Soil microbial communities provide many environmentally and anthropogenically beneficial services. Most of these services, i.e., functions, are carried out or enhanced in the rhizosphere. The ability to predict and control this activity would provide great societal benefit. To see these benefits realized microbial ecologists must first bridge the gap between laboratory-based studies and direct observation of microbial community activity in the environment. This requires direct observation of microbial community functions. While metagenomics enables microbial ecologists to examine the functional potential of microbial communities, metatranscriptomics offers direct observation of genes actively transcribed by microbial community members. Hence, I focused my studies on answering the question of which genes are expressed in the rhizosphere of bioenergy crops. The three research chapters of my dissertation focus on i) examination of a novel method of rRNA removal and establishment of best practices for metatranscriptome data analysis ii) using a multiomics approach to integrate next generation sequencing of DNA and mRNA with advanced proteomic 14 information and iii) examining gene presence and transcription in three regions of the root - the rhizoplane, the rhizosphere and the bulk soil. These studies are conducted in the context of a biofuel cropping system study in which corn (Zea mays), Miscanthus (Miscanthus gigantus) and switchgrass (Panicum virgatum) are compared. Use of biofuels, particularly cellulosic biofuel crops such as derived from switchgrass and Miscanthus, represent an important energy source, which can reduce our dependence on greenhouse gas emitting fossil fuels. Given the previously discussed benefits microbial communities provide to plants, developing a deeper understanding of how to manage microbial functions carried out in the rhizosphere can aid low cost and sustainable cultivation of cellulosic biofuel crops on marginal lands, i.e. currently not economic for cultivated food crops. In Chapter two I examine a novel method of rRNA removal, called the duplex specific nuclease normalization (DSN). This method provides several advantages over current probe-base rRNA removal methods as it requires less total RNA input data analysis approaches to establish best practices. There are no publications, which provide a through examination of data analysis practices for metatranscriptomic data. In chapter three I use a multiomics approach to identify genes present, transcribed and translated in the rhizosphere of switchgrass. These data traverse all the steps within the central dogma of molecular biology, which presents an integrated profile of microbial community activity in the switchgrass rhizosphere. This chapter focuses on microbially mediated biogeochemical cycle and functions related to plant microbe interactions. Finally chapter four examines 15 differential gene presence in the rhizoplane and rhizosphere of corn and switchgrass as well as bulk soil. This chapter also compares microbial community transcription in the rhizosphere of switchgrass with that of bulk soil. The goals were to: i) identify genes enriched at various distances from living roots, ii) identify microbial genes associated with different plants - corn and switchgrass, and iii) to determine if any observed differences in gene abundance correlate with microbial community transcription. These studies are aimed at directly observing microbial community activity under field conditions with the ultimate goal of learning to enhance the beneficial services provided by the microbial community to the plant. This would aid enhance the sustainable production of cellulosic biofuel crops on marginal lands. 16 REFERENCES 17 REFERENCES 1. Whitman, W.B., et al., Genomic Encyclopedia of Bacterial and Archaeal Type Strains, Phase III: the genomes of soil and plant-associated and newly described type strains. Standards in Genomic Sciences, 2015. 10(1): p. 8-13. 2. York, L.M., et al., The holistic rhizosphere: integrating zones, processes, and semantics in the soil influenced by roots. Journal of experimental botany, 2016. 67(12): p. 3629-43. 3. Roesch, L.F., et al., Pyrosequencing enumerates and contrasts soil microbial diversity. Isme Journal, 2007. 1(4): p. 283-290. 4. Juwarkar, A.A., S.K. Singh, and A. Mudhoo, A comprehensive overview of elements in bioremediation. Reviews in Environmental Science and Biotechnology, 2010. 9(3): p. 215-288. 5. Sangwan, N., et al., Comparative Metagenomic Analysis of Soil Microbial Communities across Three Hexachlorocyclohexane Contamination Levels. PloS one, 2012. 7(9): p. e46219-e46219. 6. Techer, D., et al., Contribution of Miscanthus x giganteus root exudates to the biostimulation of PAH degradation: An in vitro study. Science of the Total Environment, 2011. 409(20): p. 4489-4495. 7. Chen, F., et al., Enhanced biodegradation of polychlorinated biphenyls by defined bacteria-yeast consortium. Annals of Microbiology, 2015. 65(4): p. 1847-1854. 8. Benli Chai, T.V.T., Shoko Iwai, Cun Liu, Jordan A. Fish, Cheng Gu, Timothy A. Johnson, Gerben Zylstra, Brian J. Teppen, Hui Li, Syed A. Hashsham, Stephen A. Boyd, James R. Cole, James M. Tiedje, Sphingomonas wittichii Strain RW1 Genome-Wide Gene Expression Shifts in Response to Dioxins and Clay. PLoS One, 2016. 11(6). 9. O'Loughlin, E.J., et al., Reduction of Uranium(VI) by mixed iron(II/iron(III) hydroxide (green rust): Formation of UO2 manoparticies. Environmental Science & Technology, 2003. 37(4): p. 721-727. 10. Zhao, M., et al., Microbial mediation of biogeochemical cycles revealed by simulation of global changes with soil transplant and cropping. The ISME journal, 2014. 8(10): p. 2045-55. 18 11. Jia, Z. and R. Conrad, Bacteria rather than Archaea dominate microbial ammonia oxidation in an agricultural soil. Environmental microbiology, 2009. 11(7): p. 1658-71. 12. Reed, S.C., C.C. Cleveland, and A.R. Townsend, Functional Ecology of Free-Living Nitrogen Fixation: A Contemporary Perspective. Annual Review of Ecology, Evolution, and Systematics, 2011. 42(1): p. 489-512. 13. Schmalenberger, A., et al., The role of Variovorax and other Comamonadaceae in sulfur transformations by microbial wheat rhizosphere communities exposed to different sulfur fertilization regimes. Environmental Microbiology, 2008. 10(6): p. 1486-1500. 14. Marschner, P., D. Crowley, and Z. Rengel, Rhizosphere interactions between microorganisms and plants govern iron and phosphorus acquisition along the root axis model and research methods. Soil Biology and Biochemistry, 2011. 43(5): p. 883-894. 15. Jackson, L.E., M. Burger, and T.R. Cavagnaro, Roots, nitrogen transformations, and ecosystem services. Annual review of plant biology, 2008. 59: p. 341-63. 16. Richardson, A.E., et al., Acquisition of phosphorus and nitrogen in the rhizosphere and plant growth promotion by microorganisms. Plant and Soil, 2009. 321(1-2): p. 305-339. 17. Zhu, B., et al., Rhizosphere priming effects on soil carbon and nitrogen mineralization. Soil Biology and Biochemistry, 2014. 76: p. 183-192. 18. Bird, J.a., D.J. Herman, and M.K. Firestone, Rhizosphere priming of soil organic matter by bacterial groups in a grassland soil. Soil Biology and Biochemistry, 2011. 43(4): p. 718-725. 19. Cheng, W., Rhizosphere priming effect: Its functional relationships with microbial turnover, evapotranspiration, and CN budgets. Soil Biology and Biochemistry, 2009. 41(9): p. 1795-1801. 20. Bracho, R., et al., Temperature sensitivity of organic matter decomposition of permafrost-region soils during laboratory incubations. Soil Biology and Biochemistry, 2016(February): p. 1-14. 21. Conant, R.T., J.M. Steinweg, and M.L. Haddix, EXPERIMENTAL WARMING SHOWS THAT DECOMPOSITION TEMPERATURE SENSITIVITY INCREASES WITH SOIL ORGANIC MATTER RECALCITRANCE. Ecology, 2008. 89(9): p. 2384-2391. 19 22. Fang, C., et al., Similar response of labile and resistant soil organic matter pools to changes in temperature. Nature, 2005: p. 57-59. 23. Lemanceau, P., et al., Iron dynamics in the rhizosphere as a case study for analyzing interactions between soils, plants and microbes. Plant and Soil, 2009. 321(1-2): p. 513-535. 24. Van Der Heijden, M.G.A., R.D. Bardgett, and N.M. Van Straalen, The unseen majority: Soil microbes as drivers of plant diversity and productivity in terrestrial ecosystems. Ecology Letters, 2008. 11(3): p. 296-310. 25. Clode, P.L., et al., In situ mapping of nutrient uptake in the rhizosphere using nanoscale secondary ion mass spectrometry. Plant physiology, 2009. 151(4): p. 1751-7. 26. Saikia, S.P., & Jain, V. , Biological nitrogen fixation with non-legumes: an achievable target or a dogma. Current Science, 2007. 92(3): p. 317-322. 27. Friesen, M.L., Widespread fitness alignment in the legume-rhizobium symbiosis. The New phytologist, 2012. 194(4): p. 1096-111. 28. Morgan, J.a.W., G.D. Bending, and P.J. White, Biological costs and benefits to plant-microbe interactions in the rhizosphere. Journal of experimental botany, 2005. 56(417): p. 1729-39. 29. Bhattacharyya, P.N. and D.K. Jha, Plant growth-promoting rhizobacteria (PGPR): emergence in agriculture. World Journal of Microbiology and Biotechnology, 2011: p. 1327-1350. 30. Spaepen, S. and J. Vanderleyden, Auxin and plant-microbe interactions. Cold Spring Harbor perspectives in biology, 2011. 3(4). 31. Patten, C.L. and B.R. Glick, Role of Pseudomonas putida indoleacetic acid in development of the host plant root system. Applied and Environmental Microbiology, 2002. 68(8). 32. Spaepen, S., J. Vanderleyden, and R. Remans, Indole-3-acetic acid in microbial and microorganism-plant signaling. FEMS microbiology reviews, 2007. 31(4): p. 425-48. 33. Bhattacharjee, R.B., et al., Indole acetic acid and ACC deaminase-producing Rhizobium leguminosarum bv. trifolii SN10 promote rice growth, and in the process undergo colonization and chemotaxis. Biology and Fertility of Soils, 2011. 48(2): p. 173-182. 20 34. Combes-Meynet, E., et al., The Pseudomonas secondary metabolite 2,4-diacetylphloroglucinol is a signal inducing rhizoplane expression of Azospirillum genes involved in plant-growth promotion. Molecular plant-microbe interactions : MPMI, 2011. 24(2): p. 271-84. 35. Mendes, R., P. Garbeva, and J.M. Raaijmakers, The rhizosphere microbiome: significance of plant beneficial, plant pathogenic, and human pathogenic microorganisms. FEMS microbiology reviews, 2013. 37(5): p. 634-63. 36. Jousset, A., et al., Plants respond to pathogen infection by enhancing the antifungal gene expression of root-associated bacteria. Molecular plant-microbe interactions : MPMI, 2011. 24(3): p. 352-8. 37. Mazurier, S., et al., Phenazine antibiotics produced by fluorescent pseudomonads contribute to natural soil suppressiveness to Fusarium wilt. The ISME journal, 2009. 3(8): p. 977-91. 38. Rudrappa, T., et al., Root-secreted malic acid recruits beneficial soil bacteria. Plant physiology, 2008. 148(3): p. 1547-56. 39. Saravanakumar, D., Harish, S., Loganathan, M., Vivekananthan, R., Rajendran, L., Raguchander, T., & Samiyappan, R., Rhizobacterial bioformulation for the effective management of Macrophomina root rot in mungbean. Archives of Phytopathology and Plant Protection, 2007. 40(5): p. 323-337. 40. Kuzyakov, Y. and E. Blagodatskaya, Microbial hotspots and hot moments in soil: Concept & review. Soil Biology and Biochemistry, 2015. 83: p. 184-199. 41. Shukla, K.P. and S. Sharma, Nature and role of root exudates: Efficacy in bioremediation. 10(48): p. 9717-9724. 42. Narasimhan, K., et al., Enhancement of Plant-Microbe Interactions Using a Rhizosphere Metabolomics-Driven Polychlorinated Biphenyls. 2003. 132(May): p. 146-153. 43. Xu, L., et al., Enhanced removal of polychlorinated biphenyls from alfalfa rhizosphere soil in a field study: The impact of a rhizobial inoculum. Science of the Total Environment, 2010. 408(5): p. 1007-1013. 44. Yergeau, E., et al., Microbial expression profiles in the rhizosphere of willows depend on soil contamination. The ISME journal, 2013: p. 1-15. 45. Yang, C., Y. Wang, and J. Li, Plant Species Mediate Rhizosphere Microbial Activity and Biodegradation Dynamics in a Riparian Soil Treated with Bensulfuron-methyl. Clean - Soil, Air, Water, 2011. 39(4): p. 338-344. 21 46. Bolan, N., A. Kunhikrishnan, and J. Gibbs, Rhizoreduction of arsenate and chromate in Australian native grass, shrub and tree vegetation. Plant and Soil, 2013. 367(1-2): p. 615-625. 47. rhizospheric bacterial strain Brevibacterium casei MH8a colonizes plamt tissues and enhances Cd, Zn, Cu phytoextraction by white mustard. Frontiers in Plant Science, 2016. 7(February): p. 101-101. 48. Hinsinger, P., C. Plassard, and B. Jaillard, Rhizosphere: A new frontier for soil biogeochemistry. Journal of Geochemical Exploration, 2006. 88(1-3): p. 210-213. 49. Murphy, C.J., et al., Rhizosphere priming can promote mobilisation of N-rich compounds from soil organic matter. Soil Biology and Biochemistry, 2015. 81: p. 236-243. 50. Chaparro, J.M., et al., Manipulating the soil microbiome to increase soil health and plant fertility. Biology and Fertility of Soils, 2012. 48(5): p. 489-499. 51. Lemmon, E.M. and A.R. Lemmon, High-Throughput Genomic Data in Systematics and Phylogenetics. Annual Review of Ecology, Evolution, and Systematics, 2013. 44(1): p. 99-121. 52. Tringe, S.G. and E.M. Rubin, Metagenomics: DNA sequencing of environmental samples. Nature reviews. Genetics, 2005. 6(11): p. 805-14. 53. Teeling, H. and F.O. Glöckner, Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective. Briefings in bioinformatics, 2012. 54. Eisen, J.a., Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS biology, 2007. 5(3): p. e82-e82. 55. Lennon, J.T. and S.E. Jones, Microbial seed banks: the ecological and evolutionary implications of dormancy. Nature Reviews Microbiology, 2011. 9(2): p. 119-130. 56. Baldrian, P., et al., Active and total microbial communities in forest soil are largely different and highly stratified during decomposition. The ISME journal, 2012. 6(2): p. 248-58. 57. Urich, T., et al., Simultaneous assessment of soil microbial community structure and function through analysis of the meta-transcriptome. PloS one, 2008. 3(6): p. e2527-e2527. 22 58. Neidhardt, F.C. and H.E. Umbarger, Chemical composition of Escherichia coli. 1996. 13-16. 59. Gilbert, J.a., et al., Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PloS one, 2008. 3(8): p. e3042-e3042. 60. He, S., et al., Validation of two ribosomal RNA removal methods for microbial metatranscriptomics. 7(10). 61. Stewart, F.J., E.a. Ottesen, and E.F. DeLong, Development and quantitative analyses of a universal rRNA-subtraction protocol for microbial metatranscriptomics. The ISME journal, 2010. 4(7): p. 896-907. 62. Yi, H., et al., Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic acids research, 2011. 39(20): p. e140-e140. 63. Moran, M.A., et al., Sizing up metatranscriptomics. The ISME journal, 2013. 7(2): p. 237-43. 64. Zhang, Y.P., et al., Regulation of nitrogen fixation in Azospirillum brasilense. Fems Microbiology Letters, 1997. 152(2): p. 195-204. 65. Yang Y, J.X.-T., Zhang T Evaluation of a Hybrid Approach Using UBLAST and BLASTX for Metagenomic Sequences Annotation of Specific Functional Genes. PLoS One, 2014. 9(10): p. e110947. 66. Head, I.M., J.R. Saunders, and R.W. Pickup, Microbial evolution, diversity, and ecology: A decade of ribosomal RNA analysis of uncultivated microorganisms. Microbial Ecology, 1998. 35(1): p. 1-21. 67. Wommack, K.E., J. Bhavsar, and J. Ravel, Metagenomics: Read length matters. Applied and Environmental Microbiology, 2008. 74(5): p. 1453-1463. 68. Rinke, C., et al., Insights into the phylogeny and coding potential of microbial dark matter. Nature, 2013. 499(7459): p. 431-437. 69. Burke, C. and P. Steinberg, Bacterial community assembly based on functional genes rather than species. 70. Frossard, A., et al., Disconnect of microbial structure and function: enzyme activities and bacterial communities in nascent stream corridors. The ISME journal, 2012. 6(3): p. 680-91. 23 71. Embree, M., et al., Networks of energetic and metabolic interactions define dynamics in microbial communities. Proceedings of the National Academy of Sciences, 2015. 112(50): p. 201506034-201506034. 72. Lesniewski, R.a., et al., The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. The ISME journal, 2012. 6(12): p. 2257-2268. 24 Chapter 2: Methodologies for probing the metatranscriptome of grassland soil This chapter has been published in: Garoutte, A., Cardenas, E., Tiedje, J., & Howe, A. (2016). Methodologies for probing the metatranscriptome of grassland soil. Journal of Microbiological Methods, 131, 122129. http://doi.org/10.1016/j.mimet.2016.10.018 25 Abstract Metatranscriptomics provides an opportunity to identify active microbes and expressed genes in complex soil communities in response to particular conditions. Currently, there are a limited number of soil metatranscriptome studies to provide guidance for using this approach in this challenging matrix. Hence, we evaluated the technical challenges of applying soil metatranscriptomics to a highly diverse, low activity natural system. We used a non-targeted rRNA removal approach, duplex nuclease specific (DSN) normalization, to generate a metatranscriptomic library from field collected soil supporting a perennial grass, Miscanthus x giganteus (a biofuel crop), and evaluated its ability to provide insight into its active community members and their expressed protein-coding genes. We also evaluated various bioinformatics approaches for analyzing our soil metatranscriptome, including annotation of unassembled transcripts, de novo assembly, and aligning reads to known genomes. Further, we evaluated various databases for their ability to provide annotations for our metatranscriptome. Overall, our results emphasize that low activity, highly genetically diverse and relatively stable microbiomes, like soil, requires very deep sequencing to sample the transcriptome beyond the common core functions. We identified several key areas that metatranscriptomic analyses will benefit from including increased rRNA removal, assembly of short read transcripts, and more relevant reference bases while providing a priority set of expressed genes for functional assessment. 26 Introduction Metatranscriptomics holds promise for providing insight into which organisms are active and which gene subsets are expressed within microbial communities, but its use is particularly challenging in complex systems, especially soil. Metatranscriptomics has been most prevalently used in marine ecology studies, where, as examples, it has helped identify key nutrient transformations in hydrothermal plumes [1]; patterns of niche diversification in coastal waters [2]; seasonal and diurnal patterns of gene expression in the English Channel [3] and patterns of diazotroph diversity along salinity and nutrient gradients [4]. In contrast, the application of metatranscriptomics in terrestrial environments has been limited, mostly either targeting specific genes (e.g., phylogenetic markers or functional genes), [5,6] experimentally enriched soil communities [7,8] or in greenhouse pot-based experiments [9]. In forest soils, fungal-targeted metatranscriptomics has been used to identify novel hydrolase enzymes [5], and a targeted approach (16S rRNA, ITS, and cellobiohydrolase) has shown that low-abundance species play an important role in carbon decomposition [7]. Metatranscriptomics has also been used to contrast expression in pristine soils and those contaminated with polycyclic aromatic hydrocarbons [10] and domain-level changes in the rhizosphere of potted plants [11]. While these examples demonstrate the feasibility and usefulness of soil transcriptomics, the application of non-targeted metatranscriptomics to field collected agricultural soils, e.g., croplands and pastures, 27 has yet to be demonstrated; these soils comprise over 40% of global land use [12] and are essential to food production and ecosystem services. Soil metatranscriptomics presents several obstacles. First, soil microbial communities are incredibly diverse; one gram of soil is estimated to contain nearly one million distinct genomes [13], magnitudes higher than aquatic and host-associated habitats [14]. Second, reference genomes from soil are limited, making sequence annotation difficult. Third, RNA, especially mRNA, is in low abundance because of the primarily dormant or starved states of the community members, with few perturbations to induce expression. For example, turnover rates of soil microbes has been calculated to be 30- to 300-fold slower than that of microbes in the ocean [15]. Overall, the mRNA comprises only about 4% of total RNA [16], highlighting the challenge of isolating or enriching the mRNA prior to sequencing to achieve greater sequence depth. A common approach for mRNA enrichment is to remove rRNA through subtractive hybridization [17]. This approach presents challenges of its own in that it is hindered by the difficulty of obtaining intact rRNA through soil RNA extraction methods. Finally, soil metatranscriptomics is challenged by the high temporal and spatial diversity in soil populations due to habitat complexity at small scales (<1 mm), and various stochastic perturbations (e.g., rainfall, plant litter introduction, micro and mesofauna movements). Consequently, capturing appropriate snapshots of targeted activity in soil requires sampling high biodiversity within complex and often unknown and unpredictable dynamics. Furthermore, the lack of the soil metatranscriptome reference datasets makes it 28 difficult to evaluate appropriate sampling and experimental strategies and for insight into common system responses. In this study, we evaluated mRNA enrichment as well as various bioinformatic approaches to analyze soil metatranscriptomes, and present recommendations for such studies. Our metatranscriptome originated from bulk soil associated with a Miscanthus x giganteus crop, an important bioenergy crop due to its perenniality and its high biomass yield compared to other crops [18]. Soils were sampled at the period of most active plant growth (early August), at midday when photosynthesis was maximum, and after a rainfall period so that soil water was not limiting to maximize soil microbial community response to potentially new substrates. Our objective was to determine our ability to access and identify actively transcribed genes in this soil metatranscriptome. To address the challenge of low concentration of mRNA, we enriched for mRNA by removing rRNA using duplex nuclease specific (DSN) normalization. DSN is a non-targeted approach which has several advantages over subtractive hybridization including less stringent RNA quality requirements, a lower required amount of RNA (100 ng compared to 1 ug), and an increased rRNA removal efficiency [17]. An alternative to DSN is subtractive hybridization, which targets the removal of specific rRNAs based on genomic primer targets. In contrast to subtractive hybridization, which removes rRNAs with bias, DSN does not target specific rRNAs and consequently the remaining rRNAs are more likely to reflect their original distributions. To assess the gene content of our metatranscriptome, 29 we annotated against several gene or genome reference databases, including MG-RAST M5NR, the Carbohydrate Active Enzyme (CAZy) database, and a soil genome dataset termed RefSoil, and three de novo assembled metagenomes obtained from the same plot during the Spring of 2009. Additionally, we assessed the value of assembly of longer sequences, or contigs, from the metatranscriptome for improved insight into community activity. Methods Various methods that can be used for soil metatranscriptome analysis including (A) direct annotation of short read sequences (B) assembly and annotation and (C) alignment of reads to existing reference genomes are summarized in Figure 2.1 along with the advantages and disadvantages of each. We analyzed our dataset set using all three of these methods. We compared the results of each method to the others to determine best practices and to identify the advantages and disadvantages of each method. 30 Figure 2.1: Metatranscriptome data analysis workflow. Various methods for metatranscriptome data analysis are shown. (a) Direct annotation of short reads. (b) Assembly of short reads into longer contigs and subsequent annotation. (c) Short read mapping to genomes compiled in the RefSoil database. 31 Fig Metatranscriptome sample collection and library preparation The bulk soil samples for metatranscriptomics were obtained from a four year-old stand of Miscanthus (Miscanthus x giganteus), plot G6R1 at the Bioenergy Cropping Systems ExperSamples were collected midday on August 1st, 2012. The mean air temperature for the previous week was 24 oC, and there had been 117 mm of rain in the week preceding the sampling with 2 days of no rain directly prior to sampling; the soil was still moist. A composite sample comprised of three soil samples was taken from random points in the plot. The soil was quickly sieved (4 mm) to remove roots, and frozen on dry ice to prevent mRNA degradation. Samples were stored at -80 oC until RNA extraction. RNA was extracted from 2 g of soil using the PowerSoil RNA kit (MoBio, Carlsbad, CA), and DNA was then removed by DNase treatment (Invitrogen, Carlsbad, CA). RNA (100 ng) was converted to cDNA and treated with duplex 32 specific nuclease (DSN) to reduce the abundance of rRNA as described in [17]. Samples were sequenced with the Illumina HiSeq sequencing platform at the Research Technology Support Facility, Michigan State University, East Lansing, MI, USA, generating 100 base pair (bp) reads. MG-RAST databases used for annotation of unassembled raw reads Ribosomal RNA sequences were identified using riboPicker [21] (Figure 2.1a), Rfam [22] databases and MG-RAST [23]. The resulting non-rRNA sequences were submitted to MG-RAST (v 3.3.7.3) using the M5NR[24] for gene annotation. Many reads were annotated as Enterobacteria phage phiX174, which is commonly used as a control in sequencing facilities. Hence, the sequences were mapped to Enterobacteria phage phiX174 sensu lato genome (NC_001422.1) using Bowtie 2 (v2.0.0-beta6, [21]) and removed from the analysis as they were likely the result of contamination. Additionally rRNA sequences within the unassembled database were annotated using the MG-RAST M5RNA database (MGRAST IDs 4554103.3, Unassembled Metatranscriptome). Annotations were identified using the following preset quality filter parameters: Max. e-value cutoff 1e-5, Min. percent identity cutoff 60% and Min. Alignment length cutoff of 15. Metatranscriptome assembly & annotation Sequences were filtered using digital normalization (flags: -C 20, -k 20, N 4, x 2e9) as described in [2729]. Normalized reads were assembled using Velvet (v 1.2.10) [30] with odd numbered k-mers from length 19 to 59 (Figure 1B). 33 Assemblies produced from different k-mer lengths were merged using AMOS (v 3.1.0) [31] and CD-HIT (v 4.5.7) [32]. Resulting assembled contigs with lengths greater than 200 bp were annotated with MG-RAST (v 3.3.7.3) [24] (MGRAST IDs 4532564.3 Assembled Metatranscriptome), the CAZy database (date accessed: July 13, 2008) [33] which contains enzymes involved in carbon compound synthesis and decomposition, and the Rfam database which contains non-coding RNAs. Previous metagenomes used for comparison to soil metatranscriptome The reference metagenomes used in this study were obtained from one bulk and two rhizosphere soil samples collected from the same Miscanthus bioenergy plot in October 2009. DNA was extracted from 2.5 g soil as described in [19]. The high molecular weight DNA was then gel purified, electroeluted, and concentrated using methods described in [20]. Samples were sequenced with both the Illumina GAII and 454 sequencing platforms at the Joint Genome Institute (Walnut Creek, CA) generating 100-base reads. Metagenome assembly & annotation Sequences were trimmed using a quality score of 20. Reads were assembled using SOAPdenovo[34] with a k-mer range of 21, 23, 25, 27, 29 and 31. All of the default settings were used for the SOAPdenovo assembly (flags d1 and R). Contigs were then merged using SGA [35] with all default parameters. Contigs greater than 500 bp were annotated by MG-RAST (v3, 2011-02-22) (MGRAST IDs 4465947.3, Bulk MetaG, 4465942.3, Rhizo MetaG1, and 4465943.3; Rhizo MetaG2). 34 Estimation of abundance of assembled contigs or reference sequences The abundance of assembled contigs and reference sequences (e.g., soil genomes) was estimated as the median base pair coverage of all transcript alignments to contigs (assembled metatranscriptome and reference metagenomes) or genomes with the RefSoil database. Mapping of unassembled metatranscriptome reads to contigs or genomes was performed using Bowtie2 (v2.0.0-beta6, [25]) with the following default parameters: end-to-end alignment, minimum score threshold for 100 bp reads was -60.6, D 100, distinct alignments for each read. Base pair coverage was estimated using BedTools (v 2.17.0) [26]. For metagenomic, metatranscriptomic, and soil reference annotated genes, coverage was estimated on the genic region rather than the complete originating contig (which may contain both genic and intergenic sequences). Curation of soil reference genome database, RefSoil A manually curated database of soil bacterial genomes was built to provide a soil-specific reference set. Strains with completely sequenced genomes were selected from the Gold Database (http://genomesonline.org) on August 19th, 2011.The inclusion criteria involved both information on isolation of the sequenced organism and literature searches regarding the ecology of the species. e.g. Erwinia amylovora CFBP 1430 was selected even when it was originally isolated from a Crataegus plant because it is commonly detected in soils. Obligate human pathogens 35 and non-soil relevant extremophiles were excluded. If redundant genomes were found at the species-level, only two per species were kept to reduce the database bias. A total of 492 organisms, representing 19 different phyla and contributing a total of 1,031 replicons (chromosomes and plasmids) formed the database. Complete GenBank accessions were downloaded and parsed to extract whole genome sequences and features (gene coordinates, and annotations) (Supp Table 2.1). A complete list of genomes and accession numbers used in the RefSoil database is in Supp Table 2.1. Unassembled read mapping to metagenomes and RefSoil genomes Sequences were mapped to the three metagenome assemblies and the genomes within the RefSoil database using Bowtie 2 (v2.0.0-beta6, [25]) with the following default parameters: end-to-end alignment, minimum score threshold for 100 bp reads was -60.6, D 100, distinct alignments for each read (Figure 1C). Coverage of annotated regions was estimated using BedTools (v 2.17.0) [26]. Only reads with a minimum alignment length of 100 bp (to references) and contigs (or genes/genomes) with at least two mapped reads were considered. Results Characterization of sequences in the unassembled soil metatranscriptome Genes identified in transcripts include sequences associated with both rRNA and mRNA genes, informative of the active community structure and function. The large majority of transcripts, 169 million reads (82.8%), shared similarity to known 36 rRNA genes. The justification of using the DSN approach for mRNA enrichment over probe-based rRNA removal was the unbiased removal of rRNA gene fragments. Using the DSN metatranscriptome library preparation, the remaining rRNA sequences were evaluated to determine the taxonomic composition of the active community members, resulting in nearest matches to over 22,000 species in our hyla were Actinobacteria and Proteobacteria (Figure 2.2, blue bar), and sequences associated with Ascomycota were the most abundant fungal phylum. Figure 2.2: Phylogenetic distribution of sequence annotations identified in unassembled and assembled metatranscriptome and associated soil metagenomes. Phylogeny of rRNA from the unassembled metatranscriptome compared to the phylogeny of MG--hit classification of protein-coding genes for the assembled metatranscriptome and the reference metagenomes. 37 * The pink bar in the Assembled MetaT (mRNA) Firmicutes represents the proportion of misannotation in the sample (explained below). In unassembled transcripts sharing sequence similarity with known genes, the most abundant protein coding annotations from the SEED database were associated with hypothetical (3.7%) or housekeeping proteins such as GroEL (2.9%), DNA-directed RNA-polymerase (1.9%), and the translation elongation factor-Tu (1.8%). These non-rRNA sequences account for 179,088 reads, representing 8,345 protein coding genes (Table 2.1). Similar functional profiles were also obtained when annotating the unassembled metatranscriptome against other databases in MG-RAST including GenBank, KEGG, and RefSeq (Table 2.1, Supp Table 2-5). Overall, the reads comprising the unassembled metatranscriptome were associated with a few dominant annotations, where the five most abundant annotations represented 12% of the total abundance of annotations our metatranscriptome. 38 Table 2.1. Summary of sequence annotations of the unassembled and assembled soil metatranscriptome against various reference databases. Results of annotation by MG-RAST. a Note that annotations are defined within the MG-RAST M5NR database, where distinct annotations may be represented by multiple features. Features may be associated with a specific gene in a reference genome. To further explore both the taxonomic and functional content of the metatranscriptome, transcripts were also compared against the RefSoil database, which resulted in 94 million reads aligning to RefSoil genomes, the large majority of which were associated with rRNA gene annotations (Table 2.2). Similar to SEED-associated annotations, the most represented functions of the metatranscriptome in the RefSoil database were associated with hypothetical proteins, ribosomal structure, and housekeeping genes. Overall, the most abundantly represented RefSoil genomes in the soil metatranscriptome included Syntrophus aciditrophicus Unassembled metatranscriptome Assembled metatranscriptome Database Abundance Unique annotation Unique features a Abundance Unique annotation Unique features a SEED 480,802 8,345 59,189 388,030 3,882 13,754 GenBank 681,148 45,204 116,790 174,438 6,946 16,653 KEGG 385,794 24,479 82,977 204,263 5,442 15,687 RefSeq 470,518 24,444 94,699 319,974 5,518 17,365 39 SB, Methylococcus capsulatus str. Bath, and Novosphingobium sp. PP1Y (Supp Table 6). The most genes (e.g., presence rather than abundance) were identified in genomes of Nocardioides sp. JS614 , Bradyrhizobium japonicum USDA 110, and Streptomyces scabiei 87.22 (Supp Table 2.7). Table 2.2: Summary of transcript mapping. Transcripts mapped to reference assemblies (available in MG-RAST with IDs indicated) or genomes with proportion of reads identified as similar to rRNA genes and mapping uniquely to a specific reference assembly. Unassembled read mapping sources Transcripts mapped to reference Transcripts mapping to protein coding regions in assembled contigs MetaG Bulk 30,769,638 (15.0%) 3,461,504 (1.7%) MetaG Rhizo1 39,728,854 (19.4%) 277,158 (0.1%) MetaG Rhizo2 35,837,442 (17.5%) 876,690 (0.4%) RefSoil 94,104,227 (45.9%) 9,693,354 (4.7%) 40 Characterization of sequences in the assembled soil metatranscriptome Assembly of the metatranscriptome was highly successful and incorporated 73.8% of the reads into 116,556 contigs totaling 32.4 Mbp. In contrast to the unassembled metatranscriptome, the majority of assembled contigs (78.3%) were not associated with rRNA genes. Overall, a total of 15,032 (13.3%) contigs shared sequence similarity with known proteins in the SEED database (Figure 2.3, Supp Table 2.8-11). To estimate abundance of assembled sequences, unassembled reads were aligned to contigs, and the median base pair coverage of each contig was calculated. The most abundant gene functions identified within the soil metatranscriptome assembly included those associated with hypothetical proteins or with functions associated with RNA and protein metabolism (Figure 2.4 dark blue and red lines only). Comparing the number of annotations identified with the unassembled and assembled metatranscriptome datasets; we found a larger number of annotations in the unassembled dataset (Figure 2.5). However, the five most abundant subsystems in the SEED annotations were shared between the assembled and unassembled metatranscriptome datasets though with differing rank abundances (Figure 2.4). 41 Figure 2.3: Distribution of assembled metatranscriptome annotations. Proportion of assembled metatranscriptome sequences associated with known rRNA, gene function (SEED), or non-coding sequences (RFAM). 42 Figure 2.4: Comparison of functional profiles of metatranscriptomes and metagenomes. Annotations were identified in the assembled and unassembled metatranscriptome datasets as well as the three metagenome assemblies against the MG-RAST SEED database. 43 Figure 2.5: Comparison of the total number of gene annotations identified in the unassembled and assembled metatranscriptomes. Results were generated using the MG-RAST Metagenome Analysis page. Community composition represented by the assembled metatranscriptome was identified by comparing the contigs to the taxonomic origins of proteins in the MG-RAST M5NR, resulting in the identification of 2,200 species. Similar to the rRNA in the unassembled metatranscriptome dataset, the dominant phyla represented were Actinobacteria and Proteobacteria. In contrast to the unassembled metatranscriptome, Firmicutes also represented a large portion of protein annotations in the assembled dataset (Figure 2.2, red bar), but this was mainly due 44 to hypothetical proteins associated with Heliobacterium modesticaldum, Lactobacillus rhamnosus, and Staphylococcus aureus. The unique detection of abundant Firmicutes in the assembled dataset and its absence in unassembled transcripts suggest the presence of a database bias in rRNA gene sequences within the MG-RAST M5NR, and hence this likely annotation error (as noted by the different shading of the red bar in Figure 2.1.). This is likely as the three previously mentioned organisms are associated with human disease and therefore comprise a larger portion of available genomes. The large majority of contigs within the soil metatranscriptome (greater than 65%) could not be annotated with any of the reference databases used in this study (MG-RAST, RefSoil, CAZy, or associated metagenomes) (Figure 2.3). To evaluate the possible presence of non-coding RNAs, sequences were compared to known non-coding RNAs in the RFam database, resulting in a total of 3,036 contigs (2.6%) sharing similarity to RNA genes, regulatory RNAs, or self-splicing RNAs. The major RNA families identified included RNAs associated with transcription and translation (5/5.8S, tmRNA, and RNaseP), signal recognition particles, and riboswitches (Supp Table 2.12). Further, longer sequence lengths of assembled contigs significantly improved annotations, doubling the median alignment lengths to known proteins (Figure 2.6). To assess the impacts of sequence length, we evaluated the influence of varying similarity thresholds. Stricter criteria for alignment scores (e.g., decreased minimum E-value cutoff) reduced the abundance and total number of unique features in the unassembled dataset. Overall, confidence in annotations (e.g. median E-value 45 scores) for the assembled annotations were much higher (lower E-value) than for the unassembled (Figure 2.7), and variations in the E-value thresholds did not have as pronounced an effect on the total number of annotations nor the number of unique features. Importantly, assembled contigs provides longer sequence lengths for annotation (62 aa vs. 31 in the unassembled set), allowing for improved annotations (e.g., similarity comparisons to CAZy enzymes) (Figure 2.6). In total, assembly resulted in 688 contigs, comprising 194,985 bp, which could be classified into five enzyme categories including glycoside hydrolases (GH), glycosyltransferases (GT), polysaccharide lyases (PL), and carbohydrate esterases (CE). The large majority of these sequences (572 contigs, 83%) were associated with GH, GT, and CBMcontaining enzymes. The most frequent CAZy gene families were GT2, GH36, CBM13, and GH18 (Supp Table 2.12). Overall, these CAZy-associated contigs were present at relatively low abundances within the metatranscriptome, averaging 4.1-fold coverage. The most abundant enzymes classes included GH19, GH17, and CBM14 with 137, 47, and 19-fold coverage, respectively (Supp Table 2.13). 46 Figure 2.6: Comparison of annotation alignment lengths of the assembled and unassembled datasets. Amino acid alignment lengths of SEED subsystem annotations for the assembled and unassembled datasets. The minimum alignment length is set to the MG-RAST default of 15 amino acids. 47 Figure 2.7: Comparison of annotation E-values of the assembled and unassembled datasets. E-value of SEED Subsystem annotations of the assembled and unassembled datasets. The minimum e-value is set to the MG-RAST default of 1e-5. Comparison of the metatranscriptome datasets to metagenomes To provide further insight into the active subset of microbial communities, we evaluated the membership identified in the soil metatranscriptome (gene expression) and compared this to membership identified in the soil metagenomes (gene potential). Assemblies of the reference metagenomes produced 1.3 million contigs and represent over one billion bases (Table 2.3). Approximately 30 to 40 million unassembled metatranscriptome reads mapped to these metagenomes, with the majority of these reads being rRNA (98%, Table 2.2). The remaining non-rRNA transcripts mapped to a total of 147 genes, the most abundant were related to 48 housekeeping functions, e.g., RNA polymerase, chaperone proteins, and translation elongation factors. The functional profiles of the assembled metatranscriptome and metagenomes from the same site revealed that the metatranscriptome was greatly enriched in genes related to RNA and protein metabolism. In contrast, the metagenomes were enriched in genes related to carbohydrate, amino acid (and derivatives), and DNA metabolism (Figure 2.4). The overlap of functional annotations between the assembled metagenomes and the assembled metatranscriptome (e.g., at the functional level) was comprised of 2,413 annotations (62% of the metatranscriptome). Comparing taxonomic profiles of the metatranscriptomes to those of the metagenomes, we found that sequences associated with Proteobacteria were enriched in the metagenomes, while sequences associated with Actinobacteria were enriched in both the assembled and unassembled metatranscriptomes (Figure 2.2). 49 Table 2.3: Summary of assembled metagenomes in number of assembled contigs and total base pairs represented in the assembly. Results of short read assembly of metagenome samples. Assembled metagenomes Contigs Base Pairs MetaG Bulk 617,602 457,810,820 MetaG Rhizo1 303,353 216,957,151 MetaG Rhizo2 453,481 360,952,806 Total 1,374,436 1,035,720,777 Discussion Our aim was to use metatranscriptomics to assess biological information of a (normally) marginally active soil microbiome and to understanding the technical and methodological challenges of this approach. Towards this end, we assessed approaches to generate a soil metatranscriptome library (e.g., mRNA enrichment), analysis approaches (e.g., de novo assembly), and the gene content of the dataset. Overall, we identified multiple expressed genes in our soil metatranscriptome (Figure 2.4), though it was largely dominated by ribosomal rRNA genes as well as sequences of unknown origin and function (Table 2.2, Figure 2.3). To increase the information gleaned from soil metatranscriptomics in the future, we identify below several areas for improvement. The abundance of rRNA in metatranscriptomes must be further reduced in order to improve sampling of mRNA encoding protein-coding genes. The large 50 proportion of rRNA within our soil metatranscriptome library compromised our ability to sample deeply and consequently access more protein-coding genes. Although DSN normalization was expected to remove diverse rRNA, it did not with 83% rRNA remaining in our metatranscriptome library. This fraction is comparable to rRNA removal efficiency in a human gut metatranscriptome following subtractive hybridization [37] and in a sandy soil metatranscriptome with no rRNA removal [17]. Direct comparison of RNA extraction efficiencies in the two soils may not be appropriate because of different soil characteristics and the sampling season; their much lighter textured (sandy) soil was sampled in winter, and our medium textured (loamy) soil was sampled in late summer [38]. In general, reports on rRNA remaining based on multiple approaches and environments vary from 50 to 85% [2,3739], evidence that rRNA removal in metatranscriptomes remains inefficient for complex communities, regardless of extraction methodology. Though DSN normalization has improved performance compared to subtractive hybridization in pure cultures, its effectiveness in high diversity soil systems remains unclear. An alternative approach is to bypass rRNA removal and to sequence more deeply and computationally remove rRNA reads. This approach is more feasible as sequencing prices decrease. A useful result of the presence of rRNA in our metatranscriptome was that it did allow us to make taxonomic inferences about active members of the community. Unlike samples prepared using subtractive hybridization, the DSN normalization preserves the relative abundance of sequences within the sample [17]. Since the relative abundance of sequences is preserved taxonomic annotations associated 51 with the remaining rRNA sequences (in unassembled reads) is reflective of the relative abundances in the original sample. Notably, ribosomal RNA sequences from the assembled metatranscriptome dataset were not used because assemblers typically cannot assemble highly conserved sequences like 16S rRNA genes. Therefore, as a proxy, the taxonomic classification of the most similar known homologous protein was used for community analysis of the assembled metatranscriptome. Taxonomic annotation of both metatranscriptome datasets (unassembled rRNA and assembled protein coding contigs) suggests that they share a similar taxonomic profile that contrasts with those observed in the metagenomes, highlighting the increased activity of sequences associated with Actinobacteria and diminished activity of sequences associated with Proteobacteria. This result is consistent with other findings that indicate Actinobacteria are more abundant and active in bulk soils while Proteobacteria tend to be more abundant in the rhizosphere [41,42]. The curation and availability of the RefSoil database allows for the evaluation of sequencing datasets in the context of cultivated soil organisms. Despite the diversity of soil microbial communities and the difficulty of cultivating microbial representatives, this database was surprisingly represented within our soil metatranscriptome. Many transcripts could be aligned to RefSoil genomes, although most were associated with rRNA genes. This result suggests that the RefSoil database captures a large amount of the SSU rRNA (taxonomic) diversity in our sample. The functions contained with RefSoil were not nearly as well represented in the soil metatranscriptome, suggesting that although this database may capture 52 many of the genus-to-species level of diversity, the genetic diversity within those groups is still very large. We found de novo assembly of this soil metatranscriptome to be an important step towards providing improved references for soil sequencing approaches, evidenced by longer sequence lengths, data reduction, improved confidence in annotation, and the development of reference sequences that do not rely on a priori information. Previously, the high diversity of soil communities has resulted in only a fraction of sequences being assembled in metagenomic studies [28]. For this soil metatranscriptome, 73.8% of the reads in the unassembled dataset mapped to our assembly, suggesting that the diversity of soil metatranscriptomes is significantly less than that of metagenomes. As a consequence, if rRNA can be efficiently removed prior to sequencing, metatranscriptomic efforts may require less sequencing depth than previously suggested by soil metagenomes. The longer sequence lengths provided by the assembly also provide higher confidence in annotations as well as the identification of multiple novel and abundant sequences. Importantly, our metatranscriptome assembly provides a specific set of genomic references that can be used for comparative soil studies. The presence of shared (highly) expressed sequences in multiple datasets can be used to prioritize encoded genes for characterization. An indirect advantage to soil metatranscriptome assembly is also that it discards many rRNA-associated sequences because these sequences are difficult to assemble, allowing it to be used as a method for rRNA removal that does not rely on having known references. As datasets continue to grow in volume, assembly may become 53 an increasingly efficient method for both improving gene annotation and removing rRNA. We evaluated the novel information gained through our metatranscriptome by comparing our soil metatranscriptome to available metagenomes from the same plot. The majority of genes were unshared between these datasets though the majority of encoded functions were similar. Within functional annotations, 62% of the metatranscriptome annotations were shared with the metagenomes. However, many of these were associated with rRNA genes; relatively few non-rRNA transcripts (~2.7 million) were aligned to the metagenomes, with only 3.7% of non-rRNA annotations shared. This result indicates that the metatranscriptome is functionally similar to the metagenomes but is composed of distinct genotypes with high levels of functional redundancy between members. Previous studies have also shown little overlap between metagenome and metatranscriptome libraries [42,43]. A possible explanation for this observation is a change in the microbial communities over the time (2 years) between sampling the soil metatranscriptome and metagenomes. While changes during this time are very likely, we expect that these changes are relatively small in the metagenomes as soil microbial populations are thought to have turnover times ranging from 6.8 to 0.24 years [15,44,45]. Another possible explanation for the low overlap between samples is the soil subhabitat (metatranscriptome of bulk vs. metagenome of rhizosphere). Rhizosphere soils, generally, contain more active communities compared to bulk soils [10]. Differences in the sequencing depth of the metatranscriptomic and metagenomic efforts may have also contributed to differences in these datasets. A final explanation for the 54 distinct communities identified between these sequencing efforts is that the biologically active communities in the soil may not be represented in metagenomes due to under sampling or spatial differences. In this case, metagenomic libraries may be most useful for generating gene references reflecting possible soil diversity, while metatranscriptomics may be most appropriate for targeting active communities. Overall, our metatranscriptome was dominated by sequences that could not be associated with genes that have previously been studied and for which no function is known (e.g., hypothetical proteins). Abundant hypothetical proteins are observed in other metatranscriptomes [40,44]. Insight into these sequences (these role in function. Additionally, as increasing numbers of metatranscriptomes become available, the development of novel approaches that use unsupervised classification methods to identify patterns of codon usage across microbial communities [47] or co-occurrence of sequences [4850] within multiple datasets should prove useful in characterizing these sequences. Conclusion Based on our evaluation of this Miscanthus soil sample, soil metatranscriptomics holds promise for identifying actively transcribed genes in the soil. The methods for leveraging this technology still require much development to reach genes important to ecological fitness or ecosystem functions. From this relatively small sample (20 Gbp), we were able to produce an assembly that 55 captured the majority of reads in the sequenced dataset. The resulting assembly allowed us to identify, with high confidence, several sequences similar to known genes and soil genomes that are actively transcribed and of interest to carbon cycling. The development of a soil specific database was helpful for analyzing our soil metatranscriptome, but a large majority of the assembled sequences still lack references in databases. This does show, however, the value of expanding the soil isolate genome and physiology database. Overall this study illustrates that metatranscriptomic sequencing can be preformed on samples of field collected soil. Acknowledgements We thank Tamara Cole for helpful discussions regarding this manuscript and Jeff Landgraf for troubleshooting the duplex specific nuclease normalization method. This work was funded by the DOE Great Lakes Bioenergy Research Center (DOE BER Office of Science DE-FC02-07ER64494). 56 APPENDIX 57 Table 2.4: Manually curated soil-associated genomes comprising the RefSoil database. Organism Taxonomy GenBank Accession numbers Acetobacter pasteurianus IFO 3283-01 Proteobacteria-Alpha AP011121, AP011122, AP011123, AP011124, AP011125, AP011126, AP011127 Acholeplasma laidlawii PG-8A Tenericutes CP000896 Achromobacter xylosoxidans A8 Proteobacteria-Beta CP002287, CP002288, CP002289 Acidithiobacillus caldus SM-1 Proteobacteria-Gamma CP002573, CP002574, CP002575, CP002576, CP002577 Acidithiobacillus ferrooxidans ATCC 23270 Proteobacteria-Gamma CP001219 Acidobacterium capsulatum ATCC 51196 Acidobacteria CP001472 Acidovorax avenae avenae ATCC 19860 Proteobacteria-Beta CP002521 Acidovorax avenae citrulli AAC00-1 Proteobacteria-Beta CP000512 Acidovorax ebreus TPSY Proteobacteria-Beta CP001392 Acidovorax sp. JS42 Proteobacteria-Beta CP000539, CP000540, CP000541 Acinetobacter baumannii ATCC 17978 Proteobacteria-Gamma CP000521, CP000522, CP000523 Acinetobacter baylyi ADP1 Proteobacteria-Gamma CR543861 Acinetobacter calcoaceticus PHEA-2 Proteobacteria-Gamma CP002177 Acinetobacter sp. DR1 Proteobacteria-Gamma CP002080 Actinosynnema mirum 101, DSM 43827 Actinobacteria CP001630 Agrobacterium sp. H13-3 Proteobacteria-Alpha CP002248, CP002249,CP002250 58 Agrobacterium tumefaciens C58-UWash Proteobacteria-Alpha AE007869, AE007870, AE007871, AE007872 Agrobacterium vitis S4 Proteobacteria-Alpha CP000633, CP000634, CP000635, CP000636, CP000637, CP000638, CP000639 Akkermansia muciniphila ATCC BAA-835 Verrucomicrobia CP001071 Alicycliphilus denitrificans BC Proteobacteria-Beta CP002449, CP002450, CP002451 Alicycliphilus denitrificans K601 Proteobacteria-Beta CP002657, CP002658 Alkalilimnicola ehrlichei MLHE-1 Proteobacteria-Gamma CP000453 Alkaliphilus metalliredigens QYMF Firmicutes CP000724 Alkaliphilus oremlandii OhILAs Firmicutes CP000853 Amycolatopsis mediterranei U32 Actinobacteria CP002000 Anabaena variabilis ATCC 29413 Cyanobacteria CP000117, CP000118, CP000119, CP000120, CP000121 Anaeromyxobacter dehalogenans 2CP-C Proteobacteria-Delta CP000251 Anaeromyxobacter sp K Proteobacteria-Delta CP001131 Anaeromyxobacter sp. Fw109-5 Proteobacteria-Delta CP000769 Arcobacter nitrofigilis DSM 7299 Proteobacteria-Epsilon CP001999 Aromatoleum aromaticum EbN1 Proteobacteria-Beta CR555306, CR555307, CR555308 Arthrobacter arilaitensis re117, CIP108037 Actinobacteria FQ311875, FQ311475, FQ311476 59 Arthrobacter aurescens TC1 Actinobacteria CP000474, CP000475, CP000476 Arthrobacter chlorophenolicus A6 Actinobacteria CP001341, CP001342, CP001343 Arthrobacter phenanthrenivorans Sphe3 Actinobacteria CP002379, CP002380, CP002381 Arthrobacter sp. FB24 Actinobacteria CP000454, CP000455, CP000456, CP000457 Asticcacaulis excentricus CB 48 Proteobacteria-Alpha CP002395, CP002396, CP002397, CP002398 Azoarcus sp. BH72 Proteobacteria-Beta AM406670 Azorhizobium caulinodans ORS 571 Proteobacteria-Alpha AP009384 Azospirillum sp. B510 Proteobacteria-Alpha AP010946, AP010947, AP010948, AP010949, AP010950, AP010951, AP010952 Azotobacter vinelandii DJ, ATCC BAA-1303 Proteobacteria-Gamma CP001157 Bacillus amyloliquefaciens Campbell F Firmicutes FN597644 Bacillus amyloliquefaciens FZB42 Firmicutes CP000560 Bacillus anthracis Ames Firmicutes AE016879 Bacillus anthracis Ames Ancestor A2084 (0581) Firmicutes AE017334, AE017335, AE017336 Bacillus anthracis CI Firmicutes CP001746, CP001747, CP001748, CP001749 Bacillus atrophaeus 1942 Firmicutes CP002207 Bacillus cellulosilyticus N-4, DSM 2522 Firmicutes CP002394 61 Bacillus cereus ATCC 10987 Firmicutes AE017194, AE017195 Bacillus cereus ATCC 14579 Firmicutes AE016877, AE016878 Bacillus clausii KSM-K16 Firmicutes AP006627 Bacillus coagulans 2-6 Firmicutes CP002472 Bacillus halodurans C-125 Firmicutes BA000004 Bacillus licheniformis DSM 13 Goettingen Firmicutes AE017333 Bacillus megaterium QM B1551 Firmicutes CP001983, CP001984, CP001985, CP001986, CP001987, CP001988, CP001989, CP001990 Bacillus pseudofirmus OF4 Firmicutes CP001878, CP001879, CP001880 Bacillus pumilus SAFR-032 Firmicutes CP000813 Bacillus selenitireducens MLS10 Firmicutes CP001791 Bacillus subtilis BSn5 Firmicutes CP002468 Bacillus subtilis subtilis 168 Firmicutes AL009126 Bacillus thuringiensis CT43 Firmicutes CP001907, CP001908, CP001909, CP001910, CP001911, CP001912 Bacillus thuringiensis sv. finitimus YBT-020 Firmicutes CP002508, CP002509, CP002510 Bacillus tusciae T2, DSM 2912 Firmicutes CP002017 Bacillus weihenstephanensis KBAB4 Firmicutes CP000903, CP000904, CP000905, CP000906, CP000907 62 Beijerinckia indica indica ATCC 9039 Proteobacteria-Alpha CP001016, CP001017, CP001018 Beutenbergia cavernae HKI 0122, DSM 12333 Actinobacteria CP001618 Brachybacterium faecium 6-10, DSM 4810 Actinobacteria CP001643 Bradyrhizobium japonicum USDA 110 Proteobacteria-Alpha BA000040 Bradyrhizobium sp. BTAi1 Proteobacteria-Alpha CP000494, CP000495 Bradyrhizobium sp. ORS278 Proteobacteria-Alpha CU234118 Brevibacillus brevis NBRC 100599 Firmicutes AP008955 Brevundimonas subvibrioides ATCC 15264 Proteobacteria-Alpha CP002102 Brucella microti CCM 4915 Proteobacteria-Alpha CP001578, CP001579 Burkholderia ambifaria MC40-6 Proteobacteria-Beta CP001025, CP001026, CP001027, CP001028 Burkholderia cenocepacia HI2424, BCC1 Proteobacteria-Beta CP000458, CP000459, CP000460, CP000461 Burkholderia cenocepacia MC0-3 Proteobacteria-Beta CP000958, CP000959, CP000960 Burkholderia cepacia 383 (R18194) Proteobacteria-Beta CP000151, CP000150, CP000152 Burkholderia cepacia AMMD Proteobacteria-Beta CP000440, CP000441, CP000442, CP000443 Burkholderia gladioli BSR3 Proteobacteria-Beta CP002599, CP002600, CP002601, CP002602, CP002603, CP002604 Burkholderia glumae BGR1 Proteobacteria-Beta CP001503, CP001504, CP001505, CP001506, CP001507, CP001508 63 Burkholderia multivorans ATCC 17616 Proteobacteria-Beta CP000868.1, CP000869.1, CP000870.1, CP000871.1 Burkholderia phymatum STM815 Proteobacteria-Beta CP001043, CP001044, CP001045, CP001046 Burkholderia phytofirmans PsJN Proteobacteria-Beta CP001052, CP001053, CP001054 Burkholderia rhizoxinica HKI 454 Proteobacteria-Beta FR687359, FR687360, FR687361 Burkholderia sp. CCGE1001 Proteobacteria-Beta CP002519, CP002520 Burkholderia sp. CCGE1002 Proteobacteria-Beta CP002013, CP002014, CP002015, CP002016 Burkholderia thailandensis E264 Proteobacteria-Beta CP000085, CP000086 Burkholderia vietnamiensis G4 (R1808) Proteobacteria-Beta CP000614, CP000615, CP000616, CP000617, CP000618, CP000619, CP000620, CP000621 Burkholderia xenovorans LB400 Proteobacteria-Beta CP000270, CP000271, CP000272 Campylobacter lari RM2100 Proteobacteria-Epsilon CP000932, CP000933 Candidatus Accumulibacter phosphatis Type IIA UW-1 Proteobacteria-Beta CP001715, CP001716, CP001717, CP0017185, Candidatus Blochmannia floridanus Proteobacteria-Gamma BX248583 Candidatus Blochmannia pennsylvanicus BPEN Proteobacteria-Gamma CP000016 Candidatus Blochmannia vafer BVAF Proteobacteria-Gamma CP002189 Candidatus Carsonella ruddii Proteobacteria-Gamma AP009180 64 Candidatus Cloacamonas acidaminovorans WWE1 CU466930 Candidatus Hamiltonella defensa 5AT Proteobacteria-Gamma CP001277, CP001278 Candidatus Hodgkinia cicadicola Dsem Proteobacteria-Alpha CP001226 Candidatus Liberibacter asiaticus psy62 Proteobacteria-Alpha CP001677 Candidatus Liberibacter solanacearum CLso-ZC1 Proteobacteria-Alpha CP002371 Candidatus Methylomirabilis oxyfera NC10 FP565575 Candidatus Nitrospira defluvii Nitrospirae FP929003.1 Candidatus Phytoplasma aster yellows witches'-broom AY-WB Tenericutes CP000061, CP000062, CP000063, CP000064, CP000065 Candidatus Phytoplasma australiense Tenericutes AM422018 Candidatus Phytoplasma mali AT Tenericutes CU469464 Candidatus Phytoplasma onion yellows OY-M Tenericutes AP006628 Candidatus Protochlamydia amoebophila UWE25 Chlamydiae BX908798 Candidatus Sulcia muelleri DMIN Bacteroidetes CP001981 65 Candidatus Tremblaya princeps PCIT Proteobacteria-Beta CP002244 Candidatus Zinderia insecticola CARI Proteobacteria-Beta CP002161 Catenulispora acidiphila ID139908, DSM 44928 Actinobacteria CP001700 Caulobacter crescentus CB15 Proteobacteria-Alpha AE005673 Caulobacter crescentus NA1000 Proteobacteria-Alpha CP001340 Caulobacter segnis ATCC 21756 Proteobacteria-Alpha CP002008 Caulobacter sp. K31 Proteobacteria-Alpha CP000927,CP000928,CP000929 Cellulomonas flavigena 134, DSM 20109 Actinobacteria CP001964 Cellulomonas flavigena NRS 133, ATCC 484 Actinobacteria CP002666 Cellvibrio gilvus ATCC 13127 Proteobacteria-Gamma CP002665 Cellvibrio japonicus Ueda107 Proteobacteria-Gamma CP000934 Chelativorans sp. BNC1 Proteobacteria-Alpha CP000390, CP000389, CP000391, CP000392 Chitinophaga pinensis UQM 2034, DSM 2588 Bacteroidetes CP001699 Chromobacterium violaceum ATCC 12472 Proteobacteria-Beta AE016825 Citrobacter koseri ATCC BAA-895 Proteobacteria-Gamma CP000822, CP000823, CP000824 Citrobacter rodentium Proteobacteria-Gamma FN543502, FN543503, FN543504, FN543505 66 Clavibacter michiganensis michiganensis NCPPB 382 Actinobacteria AM711867, AM711866, AM711865 Clavibacter michiganensis sepedonicus ATCC 33113 Actinobacteria AM849034, AM849035, AM849036 Clostridium acetobutylicum ATCC 824 Firmicutes AE001437, AE001438 Clostridium acetobutylicum DSM 1731 Firmicutes CP002660, CP002661, CP002662 Clostridium beijerinckii NCIMB 8052 Firmicutes CP000721 Clostridium botulinum BoNT/B1 Okra Firmicutes CP000939, CP000940 Clostridium botulinum type A - Hall Firmicutes AM412317, AM412318 Clostridium cellulolyticum H10 Firmicutes CP001348 Clostridium cellulovorans 743B, ATCC 35296 Firmicutes CP002160 Clostridium cf. saccharolyticum K10 Firmicutes FP929037 Clostridium kluyveri DSM 555 Firmicutes CP000673, CP000674 Clostridium kluyveri NBRC 12016 Firmicutes AP009049,AP009050 Clostridium ljungdahlii PETC, DSM 13528 Firmicutes CP001666 Clostridium novyi NT Firmicutes CP000382 67 Clostridium perfringens 13 Firmicutes BA000016, AP003515.1 Clostridium perfringens ATCC 13124 Firmicutes CP000246 Clostridium phytofermentans ISDg Firmicutes CP000885 Clostridium saccharolyticum WM1, DSM 2544 Firmicutes CP002109 Clostridium sticklandii DSM 519 Firmicutes FP565809 Clostridium tetani Massachusetts E88 Firmicutes AE015927, AF528097.1 Clostridium thermocellum ATCC 27405 Firmicutes CP000568 Clostridium thermocellum LQ8, DSM 1313 Firmicutes CP002416 Comamonas testosteroni CNB-1 Proteobacteria-Beta CP001220, EF079106.1 Conexibacter woesei ID131577, DSM 14684 Actinobacteria CP001854 Coraliomargarita akajimensis DSM 45221 Verrucomicrobia CP001998 Corynebacterium aurimucosum CN-1, ATCC 700975 Actinobacteria CP001601, FM164414 Corynebacterium efficiens YS-314T Actinobacteria BA000035, AP005225, AP005226 Corynebacterium glutamicum Nakagawa, ATCC 13032 Actinobacteria BA000036 68 Corynebacterium glutamicum R Actinobacteria AP009044, AP009045 Corynebacterium jeikeium K411 Actinobacteria CR931997, AF401314.1 Corynebacterium kroppenstedtii DSM 44385 Actinobacteria CP001620 Corynebacterium pseudotuberculosis 1002 Actinobacteria CP001809 Corynebacterium pseudotuberculosis C231 Actinobacteria CP001829 Corynebacterium resistens DSM 45100 Actinobacteria CP002857 Corynebacterium urealyticum DSM 7109 Actinobacteria AM942444 Corynenebacterium ulcerans 809 Actinobacteria CP002790 Corynenebacterium ulcerans BR-AD22 Actinobacteria CP002791 Cronobacter sakazakii ATCC BAA-894 Proteobacteria-Gamma CP000783, CP000784, CP000785 Cronobacter turicensis z3032 Proteobacteria-Gamma FN543093, FN543094, FN543095, FN543096 Cupriavidus metallidurans CH34 Proteobacteria-Beta CP000352, CP000353, CP000354, CP000355 Cupriavidus necator JMP134 Proteobacteria-Beta CP000090, CP000091, CP000092, CP000093 Cupriavidus necator N-1 Proteobacteria-Beta CP002877, CP002878, CP002879, CP002880 Cupriavidus taiwanensis LMG 19424 Proteobacteria-Beta CU633749, CU633750, CU633751 Cyanothece sp. PCC 7424 Cyanobacteria CP001291, CP001292, CP001293, CP001294, CP001295, CP001296, CP001297, 69 Cyanothece sp. PCC 8802 Cyanobacteria CP001701, CP001702, CP001703, CP001704, CP001705 Cytophaga hutchinsonii ATCC 33406 Bacteroidetes CP000383 Dechloromonas aromatica RCB Proteobacteria-Beta CP000089 Dehalococcoides ethenogenes 195 Chloroflexi CP000027 Dehalococcoides sp. VS Chloroflexi CP001827 Dehalogenimonas lykanthroporepellens BL-DC-9 Chloroflexi CP002084 Deinococcus deserti VCD115 Thermi CP001114, CP001115, CP001116, CP001117 Deinococcus maricopensis LB-34, DSM 21211 Thermi CP002454 Deinococcus proteolyticus MRP, DSM 20540 Thermi CP002536, CP002537, CP002538, CP002539, CP002540 Deinococcus radiodurans USUHS (R1) Thermi AE000513, AE001825, AE001826, AE001827 Delftia acidovorans SPH-1 Proteobacteria-Beta CP000884 Delftia sp. Cs1-4 Proteobacteria-Beta CP002735 Desulfarculus baarsii 2st14, DSM 2075 Proteobacteria-Delta CP002085 Desulfitobacterium hafniense DCB-2 Firmicutes CP001336 Desulfitobacterium hafniense Y51 Firmicutes AP008230 Desulfobacca acetoxidans ASRB2 Proteobacteria-Delta CP002629 70 Desulfobacterium autotrophicum HRM2, DSM 3382 Proteobacteria-Delta CP001087, CP001088 Desulfobulbus propionicus 1pr3, DSM 2032 Proteobacteria-Delta CP002364 Desulfotomaculum acetoxidans 5575, DSM 771 Firmicutes CP001720 Desulfotomaculum carboxydivorans CO-1-SRB, DSM 14880 Firmicutes CP002736 Desulfotomaculum kuznetsovii 17 Firmicutes CP002770 Desulfotomaculum reducens MI-1 Firmicutes CP000612 Desulfotomaculum ruminis DL, DSM 2154 Firmicutes CP002780 Desulfovibrio aespoeensis Aspo-2 Proteobacteria-Delta CP002431 Desulfovibrio alaskensis G20 Proteobacteria-Delta CP000112 Desulfovibrio desulfuricans desulfuricans 27774 Proteobacteria-Delta CP001358 Desulfovibrio magneticus RS-1 Proteobacteria-Delta AP010904, AP010905, AP010906 Desulfovibrio salexigens DSM 2638 Proteobacteria-Delta CP001649 Desulfovibrio vulgaris RCH1 Proteobacteria-Delta CP002297, CP002298 Desulfovibrio vulgaris vulgaris Hildenborough Proteobacteria-Delta AE017285, AE017286 Desulfurispirillum indicum S5 Chrysiogenetes CP002432 71 Desulfurivibrio alkaliphilus AHT2 Proteobacteria-Delta CP001940 Dickeya dadantii 3937 Proteobacteria-Gamma CP002038 Dickeya dadantii Ech703 Proteobacteria-Gamma CP001654 Dickeya zea Ech1591 Proteobacteria-Gamma CP001655 Dyadobacter fermentans NS114, DSM 18053 Bacteroidetes CP001619 Ensifer medicae WSM419 Proteobacteria-Alpha CP000738, CP000739, CP000740, CP000741 Ensifer meliloti AK83 Proteobacteria-Alpha CP002781, CP002782, CP002783, CP002784, CP002785 Ensifer meliloti BL225C Proteobacteria-Alpha CP002740, CP002741,CP002742 Enterobacter aerogenes KCTC 2190 Proteobacteria-Gamma CP002824 Enterobacter cloacae cloacae ATCC 13047 Proteobacteria-Gamma CP001918, CP001919, CP001920 Enterobacter cloacae cloacae NCTC 9394 Proteobacteria-Gamma FP929040 Enterobacter sp. 638 Proteobacteria-Gamma CP000653, CP000654 Erwinia amylovora CFBP1430 Proteobacteria-Gamma FN434113, FN434114 Erwinia amylovora Ea273, ATCC 49946 Proteobacteria-Gamma FN666575, FN666576, FN666577 Erwinia billingiae Eb661 Proteobacteria-Gamma FP236843, FP236826, FP236830, Erwinia pyrifoliae Ep1/96 Proteobacteria-Gamma FP236842, FP236827, FP236828, FP236829, FP928999 Erwinia pyrifoliae Ep1/96 Proteobacteria-Gamma FN392235, FN392236, FN392237, FN392238, FN392239 72 Erwinia tasmaniensis Et1/99 Proteobacteria-Gamma CU468135, CU468128.1, CU468130, CU468131, CU468132, CU468133 Escherichia coli W, ATCC 9739 Proteobacteria-Gamma CP002185, AY639886 Eubacterium cylindroides T2-87 Firmicutes FP929041 Eubacterium limosum KIST612 Firmicutes CP002273 Eubacterium rectale M104/1 Firmicutes FP929043 Eubacterium siraeum 70/3 Firmicutes FP929044 Exiguobacterium sibiricum 255-15 Firmicutes CP001022, CP001023, CP001024 Exiguobacterium sp. AT1b Firmicutes CP001615 Flavobacterium johnsoniae UW101, ATCC 17061 Bacteroidetes CP000685 Frankia sp EuI1c Actinobacteria CP002299 Gallionella capsiferriformans ES-2 Proteobacteria-Beta CP002159 Gemmatimonas aurantiaca T-27T Gemmatimonadetes AP009153 Geobacter bemidjiensis Bem, DSM 16622 Proteobacteria-Delta CP001124 Geobacter lovleyi SZ Proteobacteria-Delta CP001089, CP001090 Geobacter metallireducens GS-15 Proteobacteria-Delta CP000148, CP000149 Geobacter sp. FRC-32 Proteobacteria-Delta CP001390 Geobacter sulfurreducens Proteobacteria-Delta AE017180 73 Geobacter uraniireducens Rf4 Proteobacteria-Delta CP000698 Geodermatophilus obscurus G-20, DSM 43160 Actinobacteria CP001867 Gloeobacter violaceus PCC 7421 Cyanobacteria BA000045 Gluconacetobacter diazotrophicus PAl 5, DSM 5601 Proteobacteria-Alpha AM889285, AM889286, AM889287 Gluconacetobacter diazotrophicus PAl 5,DSM 5601 Proteobacteria-Alpha CP001189, CP001190 Gluconobacter oxydans 621H Proteobacteria-Alpha CP000009, CP000004, CP000005, CP000006, CP000007, CP000008 Gordonia bronchialis 3410, DSM 43247 Actinobacteria CP001802, CP001803 Gordonibacter pamelaeae 7-10-1-bT Actinobacteria FP929047 Granulibacter bethesdensis CGDNIH1 Proteobacteria-Alpha CP000394 Granulicella tundricola MP5ACTX9 Acidobacteria CP002480, CP002481, CP002482, CP002483, CP002484, CP002485 Herbaspirillum seropedicae SmR1 Proteobacteria-Beta CP002039 Herminiimonas arsenicoxydans ULPAs1 Proteobacteria-Beta CU207211 Hyphomicrobium denitrificans ATCC 51888 Proteobacteria-Alpha CP002083 Intrasporangium calvum 7KIP, DSM 43043 Actinobacteria CP002343 Isoptericola variabilis 225 Actinobacteria CP002810 74 Janthinobacterium sp. Marseille Proteobacteria-Beta CP000269 Ketogulonicigenium vulgare Y25 Proteobacteria-Alpha CP002224, CP002225, CP002226 Kineococcus radiotolerans SRS30216 Actinobacteria CP000750, CP000751, CP000752 Kitasatospora setae KM-6054, NBRC 14216, DSM 43861 Actinobacteria AP010968 Klebsiella pneumoniae 342 Proteobacteria-Gamma CP000964, CP000965, CP000966 Klebsiella pneumoniae pneumoniae MGH78578 Proteobacteria-Gamma CP000647, CP000648, CP000649, CP000650, CP000651, CP000652 Klebsiella variicola At-22 Proteobacteria-Gamma CP001891 Kocuria rhizophila DC2201 Actinobacteria AP009152 Korebacter versatilis Ellin345 Acidobacteria CP000360 Kribbella flavida IFO 14399, DSM 17836 Actinobacteria CP001736 Lactobacillus brevis ATCC 367 Firmicutes CP000416, CP000417, CP000418 Leadbetterella byssophila 4M15, DSM 17132 Bacteroidetes CP002305 Legionella longbeachae NSW150 Proteobacteria-Gamma FN650140, FN650141 Legionella pneumophila 2300/99 Alcoy Proteobacteria-Gamma CP001828 Legionella pneumophila Paris Proteobacteria-Gamma CR628336, CR628338 Leifsonia xyli xyli CTCB07 Actinobacteria AE016822 Leptospira biflexa Spirochaetes CP000777,CP000778,CP000779 75 Leptospira borgpetersenii JB197 Spirochaetes CP000350, CP000351 Leptospira borgpetersenii L550 Spirochaetes CP000348, CP000349 Leptospira interrogans 56601 Spirochaetes AE010301, AE010300 Leptospira interrogans Fiocruz L1-130 Spirochaetes AE016823, AE016824 Leptothrix cholodnii SP-6 Proteobacteria-Beta CP001013 Leuconostoc citreum KM20 Firmicutes DQ489736, DQ489737, DQ489738, DQ489739, DQ489740 Leuconostoc gasicomitatum LMG 18811 Firmicutes FN822744 Leuconostoc kimchii IMSNU11154 Firmicutes CP001758, CP001753, CP001754, CP001755, CP001756, CP001757 Leuconostoc mesenteroides mesenteroides ATCC 8293 Firmicutes CP000414, CP000415 Leuconostoc sp. Firmicutes CP002898 Listeria monocytogenes 4b F2365 Firmicutes AE017262 Listeria monocytogenes HCC23 Firmicutes CP001175 Listeria monocytogenes M7 Firmicutes CP002816 Listeria seeligeri SLCC3954 Firmicutes FN557490 Listeria welshimeri SLCC5334 Firmicutes AM263198 Lysinibacillus sphaericus C3-41 Firmicutes CP000817, CP000818 76 Mesoplasma florum L1 Firmicutes AE017263 Mesorhizobium ciceri bv biserrulae WSM1271 Proteobacteria-Alpha CP002447, CP002448 Mesorhizobium loti MAFF303099 Proteobacteria-Alpha BA000012, BA000013, AP003017 Mesorhizobium opportunistum WSM2075 Proteobacteria-Alpha CP002279 Methylacidiphilum infernorum V4 Verrucomicrobia CP000975 Methylibium petroleiphilum PM1 Proteobacteria-Beta CP000555, CP000556 Methylobacterium chloromethanicum CM4 Proteobacteria-Alpha CP001298, CP001299, CP001300 Methylobacterium extorquens AM1 Proteobacteria-Alpha CP001510,CP001511,CP001512,CP001513,CP001514 Methylobacterium extorquens DM4 Proteobacteria-Alpha FP103042, FP103043, FP103044 Methylobacterium nodulans ORS 2060 Proteobacteria-Alpha CP001349, CP001350, CP001351, CP001352, CP001353, CP001354, CP001355, CP001356 Methylobacterium populi BJ001 Proteobacteria-Alpha CP001029, CP001030, CP001031 Methylobacterium radiotolerans JCM 2831 Proteobacteria-Alpha CP001001, CP001002, CP001003, CP001004, CP001005, CP001006, CP001007, CP001008, CP001009 Methylocella silvestris BL2 Proteobacteria-Alpha CP001280 Methylococcus capsulatus Bath Proteobacteria-Gamma AE017282 Methylotenera mobilis JLW8 Proteobacteria-Beta CP001672 77 Methylovorus glucosetrophus SIP3-4 Proteobacteria-Beta CP001674, CP001675, CP001676 Methylovorus sp. MP688 Proteobacteria-Beta CP002252 Microbacterium testaceum StLB037 Actinobacteria AP012052 Microlunatus phosphovorus NM-1 Actinobacteria AP012204 Micromonospora aurantiaca ATCC 27029 Actinobacteria CP002162 Micromonospora sp. L5 Actinobacteria CP002399 Moorella thermoacetica ATCC 39073 Firmicutes CP000232 Mycobacterium abscessus CIP 104536 Actinobacteria CU458896, CU458745 Mycobacterium avium 104 Actinobacteria CP000479 Mycobacterium avium paratuberculosis K-10 Actinobacteria AE016958 Mycobacterium bovis BCG Moreau RDJ Actinobacteria AM412059 Mycobacterium bovis BCG Tokyo 172 Actinobacteria AP010918 Mycobacterium gilvum PYR-GCK Actinobacteria CP000656, CP000657, CP000658, CP000659 Mycobacterium leprae Br4923 Actinobacteria FM211192 Mycobacterium smegmatis MC2 155 Actinobacteria CP000480 Mycobacterium tuberculosis F11 (ExPEC) Actinobacteria CP000717.1 Mycobacterium tuberculosis KZN 1435 (MDR) Actinobacteria CP001658 78 Mycobacterium vanbaalenii PYR-1 Actinobacteria CP000511 Myxococcus xanthus DK 1622 Proteobacteria-Delta CP000113 Nakamurella multipartita Y-104, DSM 44233 Actinobacteria CP001737 Nitrobacter hamburgensis X14 Proteobacteria-Alpha CP000319, CP000320, CP000321, CP000322 Nitrobacter winogradskyi Nb-255 Proteobacteria-Alpha CP000115 Nitrosomonas europaea ATCC 19718 Proteobacteria-Beta AL954747 Nitrosomonas eutropha C91 Proteobacteria-Beta CP000450, CP000451, CP000452 Nitrosomonas sp. AL212 Proteobacteria-Beta CP002552, CP002553,CP002554 Nitrosomonas sp. Is79A3 Proteobacteria-Beta CP002876 Nitrosospira multiformis ATCC 25196 Proteobacteria-Beta CP000103, CP000104, CP000105, CP000106 Nocardia farcinica IFM 10152 Actinobacteria AP006618, AP006619, AP006620 Nocardioides sp. JS614 Actinobacteria CP000509, CP000508 Nocardiopsis dassonvillei dassonvillei DSM 43111 Actinobacteria CP002040, CP002041 Nostoc azollae 0708 Cyanobacteria CP002059, CP002060, CP002061 Nostoc punctiforme ATCC 29133 Cyanobacteria CP001037, CP001038, CP001039, CP001040, CP001041, CP001042 Nostoc sp. PCC 7120 Cyanobacteria BA000019, BA000020, AP003602, AP003603, AP003604, AP003605, AP003606, Novosphingobium aromaticivorans DSM 12444 Proteobacteria-Alpha CP000248, CP000676, CP000677 79 Novosphingobium sp. PP1Y Proteobacteria-Alpha FR856862, FR856859, FR856860, FR856861 Oceanobacillus iheyensis HTE831 Firmicutes BA000028 Ochrobactrum anthropi ATCC 49188 Proteobacteria-Alpha CP000758, CP000759, CP000760, CP000761, CP000762, CP000763 Oligotropha carboxidovorans OM4 Proteobacteria-Alpha CP002821, CP002822, CP002823 Oligotropha carboxidovorans OM5 Proteobacteria-Alpha CP002826, CP002827, CP002828 Opitutus terrae PB90-1 Verrucomicrobia CP001032 Paenibacillus mucilaginosus KNP414 Firmicutes CP002869 Paenibacillus polymyxa E681 Firmicutes CP000154 Paenibacillus polymyxa SC2 Firmicutes CP002213, CP002214 Paludibacter propionicigenes WB4, DSM 17365 Bacteroidetes CP002345 Pantoea ananatis AJ13355 Proteobacteria-Gamma AP012032, AP012033 Pantoea ananatis LMG 20103 Proteobacteria-Gamma CP001875 Pantoea sp. At-9b. Proteobacteria-Gamma CP002433, CP002434, CP002435, CP002436, CP002437, CP002438 Pantoea vagans C9-1 Proteobacteria-Gamma CP002206.1,CP001893, CP001894, CP001895 Paracoccus denitrificans PD1222 Proteobacteria-Alpha CP000489, CP000490, CP000491 Pectobacterium atrosepticum SCRI1043 Proteobacteria-Gamma BX950851 Pectobacterium carotovorum PC1 Proteobacteria-Gamma CP001657 Pectobacterium wasabiae WPP163 Proteobacteria-Gamma CP001790 Pediococcus pentosaceus Firmicutes CP000422 80 Pedobacter heparinus HIM 762-3, DSM 2366 Bacteroidetes CP001681 Pedobacter saltans Stey 113, DSM 12145 Bacteroidetes CP002545 Pelobacter carbinolicus DSM 2380 Proteobacteria-Delta CP000142 Pelobacter propionicus DSM 2379 Proteobacteria-Delta CP000482, CP000483, CP000484 Pirellula staleyi DSM 6068 Planctomycetes CP001848 Planctomyces brasiliensis IFAM 1448, DSM 5305 Planctomycetes CP002546 Planctomyces limnophilus Mu 290, DSM 3776 Planctomycetes CP001744, CP001745 Polaromonas naphthalenivorans CJ2 Proteobacteria-Beta CP000529, CP000530, CP000531, CP000532, CP000533, CP000534, CP000535, CP000536, CP000537 Polaromonas sp. JS666 Proteobacteria-Beta CP000316, CP000317, CP000318 Polymorphum gilvum SL003B-26A1 Proteobacteria-Alpha CP002568, CP002569 Polynucleobacter necessarius asymbioticus QLW-P1DMWA-1 Proteobacteria-Beta CP000655 Polynucleobacter necessarius necessarius STIR1 Proteobacteria-Beta CP001010 Pseudomonas aeruginosa PA7 Proteobacteria-Gamma CP000744 Pseudomonas aeruginosa PAO1 Proteobacteria-Gamma AE004091 Pseudomonas brassicacearum brassicacearum NFM421 Proteobacteria-Gamma CP002585 81 Pseudomonas entomophila L48 Proteobacteria-Gamma CT573326 Pseudomonas fluorescens Pf0-1 Proteobacteria-Gamma CP000094 Pseudomonas fluorescens SBW25 Proteobacteria-Gamma AM181176, AM235768.1 Pseudomonas fulva 12-X Proteobacteria-Gamma CP002727 Pseudomonas mendocina NK-01 Proteobacteria-Gamma CP002620 Pseudomonas mendocina ymp Proteobacteria-Gamma CP000680 Pseudomonas putida BIRD-1 Proteobacteria-Gamma CP002290 Pseudomonas putida F1 Proteobacteria-Gamma CP000712 Pseudomonas putida KT2440 Proteobacteria-Gamma AE015451 Pseudomonas stutzeri ATCC 17588 Proteobacteria-Gamma CP002881 Pseudomonas stutzeri CMT.A.9, DSM 4166 Proteobacteria-Gamma CP002622 Pseudomonas syringae 1448A Proteobacteria-Gamma CP000058, CP000059, CP000060 Pseudomonas syringae B728a Proteobacteria-Gamma CP000075 Pseudomonas syringae tomato DC3000 Proteobacteria-Gamma AE016853, AE016854, AE016855 Pseudoxanthomonas suwonensis 11-1 Proteobacteria-Gamma CP002446 Psychrobacter arcticus 273-4 Proteobacteria-Gamma CP000082 Psychrobacter cryohalolentis K5 Proteobacteria-Gamma CP000323, CP000324 Psychrobacter sp. PRwf-1 Proteobacteria-Gamma CP000713, CP000714, CP000715 Rahnella sp. Y9602 Proteobacteria-Gamma CP002505, CP002506, CP002507 Ralstonia eutropha H16 Proteobacteria-Beta AM260479, AM260480, AY305378 82 Ralstonia pickettii 12D Proteobacteria-Beta CP001644, CP001645, CP001646, CP001647, CP001648 Ralstonia pickettii 12J Proteobacteria-Beta CP001068, CP001069, CP001070 Ralstonia solanacearum CFBP2957 Proteobacteria-Beta FP885897, FP885907 Ralstonia solanacearum Po82 Proteobacteria-Beta CP002819, CP002820 Rhizobium etli CFN 42, DSM 11541 Proteobacteria-Alpha CP000133, CP000134, CP000135, CP000136, CP000137, CP000138, U80928 Rhizobium etli CIAT 652 Proteobacteria-Alpha CP001074, CP001075, CP001076, CP001077 Rhizobium leguminosarum bv. trifolii WSM1325 Proteobacteria-Alpha CP001622, CP001623, CP001624, CP001625, CP001626 , CP001627 Rhizobium leguminosarum bv. viciae 3841 Proteobacteria-Alpha AM236080, AM236081, AM236082, AM236083, AM236084, AM236085, AM236086 Rhizobium rhizogenes K84 Proteobacteria-Alpha CP000628, CP000629, CP000630, CP000631, CP000632 Rhizobium sp. NGR234 (ANU265) Proteobacteria-Alpha CP001389, CP000874, U00090 Rhodobacter capsulatus SB1003 Proteobacteria-Alpha CP001312, CP001313 Rhodobacter sphaeroides 2.4.1 Proteobacteria-Alpha CP000143, CP000144, , CP000145, CP000146, CP000147, DQ232586, DQ232587 Rhodobacter sphaeroides ATCC 17025 Proteobacteria-Alpha CP000661, CP000662, CP000663, CP000664, CP000665, CP000666 Rhodocista centenaria SW Proteobacteria-Alpha CP000613 Rhodococcus equi 103S Actinobacteria FN563149 Rhodococcus erythropolis PR4 Actinobacteria AP008957, AP008931, AP008932, AP008933 Rhodococcus jostii RHA1 Actinobacteria CP000431, CP000432, CP000433, CP000434 83 Rhodococcus opacus B4 Actinobacteria AP011115, AP011116, AP011117, AP011118, AP011119, AP011120 Rhodopseudomonas palustris BisB5 Proteobacteria-Alpha CP000283 Rhodopseudomonas palustris CGA009 Proteobacteria-Alpha BX571963, BX571964 Rubrobacter xylanophilus DSM 9941 Actinobacteria CP000386 Runella slithyformis LSU4, DSM 19594 Bacteroidetes CP002859, CP002860, CP002861, CP002862, CP002863, CP002864 Saccharomonospora viridis P101, DSM 43017 Actinobacteria CP001683 Saccharophagus degradans 2-40 Proteobacteria-Gamma CP000282 Saccharopolyspora erythraea NRRL2338 Actinobacteria AM420293 Salmonella bongori NCTC 12419 Proteobacteria-Gamma FR877557 Salmonella enterica Agona SL483 Proteobacteria-Gamma CP001138, CP001137 Salmonella enterica Newport SL254 Proteobacteria-Gamma CP001113, CP000604.1, CP001112 Serratia proteamaculans 568 Proteobacteria-Gamma CP000826, CP000827 Serratia sp. AS9 Proteobacteria-Gamma CP002773 Shewanella amazonensis SB2B Proteobacteria-Gamma CP000507 Shewanella denitrificans OS217 Proteobacteria-Gamma CP000302 Shewanella halifaxensis HAW-EB4 Proteobacteria-Gamma CP000931 Shewanella oneidensis MR-1 Proteobacteria-Gamma AE014299, AE014300 Shewanella putrefaciens 200 Proteobacteria-Gamma CP002457 Shewanella putrefaciens CN-32 Proteobacteria-Gamma CP000681 84 Shewanella sediminis HAW-EB3 Proteobacteria-Gamma CP000821 Shewanella sp. ANA-3 Proteobacteria-Gamma CP000469, CP000470 Shewanella violacea DSS12 Proteobacteria-Gamma AP011177 Sideroxydans lithotrophicus ES-1 Proteobacteria-Gamma CP001965 Solibacter usitatus Ellin6076 Acidobacteria CP000473 Sorangium cellulosum So ce 56 Proteobacteria-Delta AM746676 Sphaerobacter thermophilus 4ac11, DSM 20745 Chloroflexi CP001823, CP001824 Sphingobacterium sp. 21 Bacteroidetes CP002584 Sphingobium chlorophenolicum L-1 Proteobacteria-Alpha CP002798, CP002799, CP002800, Sphingobium japonicum UT26S Proteobacteria-Alpha AP010803, AP010804, AP010805, AP010806, AP010807 Sphingomonas wittichii RW1 Proteobacteria-Alpha CP000699, CP000700, CP000701 Spirosoma linguale DSM 74 Bacteroidetes CP001769, CP001770, CP001771, CP001772, CP001773, CP001774, CP001775, CP001776, CP001777 Stackebrandtia nassauensis LLR-40K-21, DSM 44728 Actinobacteria CP001778 Staphylococcus aureus aureus Newman Firmicutes AP009351 Staphylococcus aureus RF122 Firmicutes AJ938182 Staphylococcus carnosus carnosus TM300 Firmicutes AM295250 Staphylococcus epidermidis ATCC 12228 Firmicutes AE015929, AE015930, AE015931, AE015932, AE015933, AE015934, AE015935 85 Table Staphylococcus epidermidis RP62A Firmicutes CP000029, CP000028 Staphylococcus haemolyticus JCSC1435 Firmicutes AP006716, AP006717, AP006718, AP006719 Staphylococcus lugdunensis HKU09-01 Firmicutes CP001837 Staphylococcus lugdunensis N920143 Firmicutes FR870271 Staphylococcus pseudintermedius ED99 Firmicutes CP002478 Staphylococcus pseudintermedius HKU10-03 Firmicutes CP002439 Staphylococcus saprophyticus saprophyticus ATCC 15305 Firmicutes AP008934, AP008935, AP008936 Starkeya novella DSM 506 Proteobacteria-Alpha CP002026 Stenotrophomonas maltophilia K279a Proteobacteria-Gamma AM743169 Stenotrophomonas maltophilia R551-3 Proteobacteria-Gamma CP001111.1 Stigmatella aurantiaca DW4 /3-1 Proteobacteria-Delta CP002271 Streptomyces avermitilis MA-4680 Actinobacteria BA000030, AP005645.1 Streptomyces bingchenggensis BCW-1 Actinobacteria CP002047 Streptomyces flavogriseus IAF 45 CD, ATCC 33331 Actinobacteria CP002475, CP002476, CP002477 Streptomyces griseus griseus NBRC 13350 Actinobacteria AP009493 Streptomyces scabiei 87.22 Actinobacteria FN554889.1 86 Streptomyces venezuelae Actinobacteria FR845719 Streptosporangium roseum NI 9100, DSM 43021 Actinobacteria CP001814, CP001815 Sulfurihydrogenibium azorense Az-Fu1 Aquificae CP001229 Sulfurihydrogenibium sp. YO3AOP1 Aquificae CP001080 Sulfurospirillum deleyianum 5175, DSM 6946 Proteobacteria-Epsilon CP001816 Symbiobacterium thermophilum IAM 14863 Firmicutes AP006840 Syntrophobacter fumaroxidans MPOB Proteobacteria-Delta CP000478 Syntrophomonas wolfei Goettingen, DSM 2245B Firmicutes CP000448 Syntrophothermus lipocalidus DSM 12680 Firmicutes CP002048 Syntrophus aciditrophicus SB Proteobacteria-Delta CP000252 Terriglobus saanensis SP1PR4 Acidobacteria CP002467 Thauera sp. MZ1T Proteobacteria-Beta CP001281, CP001282 Thermobaculum terrenum YNP1, ATCC BAA-798 Chloroflexi CP001825, CP001826 Thermobifida fusca YX Actinobacteria CP000088 Thermobispora bispora R51, DSM 43833 Actinobacteria CP001874 Thiobacillus denitrificans ATCC 25259 Proteobacteria-Beta CP000116 Variovorax paradoxus EPS Proteobacteria-Beta CP002417 87 Variovorax paradoxus S110 Proteobacteria-Beta CP001635, CP001636 Verminephrobacter eiseniae EF01-2 Proteobacteria-Beta CP000542,CP000543 Xanthobacter autotrophicus Py2 Proteobacteria-Alpha CP000781, CP000782 Xanthomonas albilineans GPE PC73 Proteobacteria-Gamma FP565176 Xanthomonas axonopodis 306 Proteobacteria-Gamma AE008923, AE008924, AE008925 Xanthomonas campestris ATCC 33913 Proteobacteria-Gamma AE008922 Xanthomonas campestris B100 Proteobacteria-Gamma AM920689 Xanthomonas oryzae MAFF 311018 Proteobacteria-Gamma AP008229 Xanthomonas oryzae pv. oryzae PXO99A Proteobacteria-Gamma CP000967 Xenorhabdus bovienii SS-2004 Proteobacteria-Gamma FN667741 Xenorhabdus nematophila ATCC19061 Proteobacteria-Gamma FN667742, FN667743 Xylanimonas cellulosilytica XIL07, DSM 15894 Actinobacteria CP001821, CP001822 Xylella fastidiosa CVC 9a5c Proteobacteria-Gamma AE003849, AE003850, AE003851 Xylella fastidiosa Temecula1 Proteobacteria-Gamma AE009442, AE009443 Yersinia pestis CO-92 Proteobacteria-Gamma AL590842, AL109969.1, AL117189.1, AL117211.1 Yersinia pestis KIM 10 Proteobacteria-Gamma AE009952, AF074611.1 Yersinia pseudotuberculosis IP 32953 Proteobacteria-Gamma BX936398, BX936399, BX936400 Yersinia pseudotuberculosis YPIII Proteobacteria-Gamma CP000950 88 Zymomonas mobilis mobilis NCIB 11163 Proteobacteria-Alpha CP001722, CP001723, CP001724, CP001725 Zymomonas mobilis mobilis T.H.Delft 1, ATCC 10988 Proteobacteria-Alpha CP002850, CP002851, CP002852, CP002853, CP002854, CP002855, CP002856 Zymomonas mobilis pomaceae Barker 1, ATCC 29192 Proteobacteria-Alpha CP002865, CP002866, CP002867 Table 2.5: Most abundant functional annotations of the unassembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED reference database. Rank SEED Functions abundance 1 hypothetical protein 21,668 2 Heat shock protein 60 family chaperone GroEL 16,685 3 DNA-directed RNA polymerase beta subunit (EC 2.7.7.6) 11,413 4 Translation elongation factor Tu 10,495 5 DNA-directed RNA polymerase beta' subunit (EC 2.7.7.6) 8,849 6 Translation elongation factor G 7,646 7 Chaperone protein DnaK 7,644 8 SSU ribosomal protein S1p 6,270 9 Aldehyde dehydrogenase (EC 1.2.1.3) 6,034 10 RNA polymerase sigma factor RpoD 3,845 11 hyphothetical protein 3,806 12 Iron-sulfur cluster assembly protein SufB 3,410 13 Glutamine synthetase type I (EC 6.3.1.2) 3,340 14 Cell division protein FtsH (EC 3.4.24.-) 3,196 15 DNA-directed RNA polymerase alpha subunit (EC 2.7.7.6) 2,962 89 Table 2.6: Most abundant functional annotations of the unassembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database. Rank GenBank Function abundance 1 conserved hypothetical protein 59012 2 chaperonin GroEL 14315 3 DNA-directed RNA polymerase, beta subunit 9891 4 DNA-directed RNA polymerase, beta' subunit 6669 5 translation elongation factor Tu 6144 6 chaperone protein DnaK 6023 7 predicted protein 5755 8 translation elongation factor G 5060 9 DNA-directed RNA polymerase subunit beta 3644 10 ATPase AAA-2 domain protein 3474 11 LOW QUALITY PROTEIN: conserved hypothetical protein 3107 12 adenosylhomocysteinase 3059 13 ABC transporter related 2970 14 translation elongation factor 2 (EF-2/EF-G) 2957 15 SSU ribosomal protein S1P 2450 Table 2.7 Most abundant functional annotations of the unassembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in the RefSeq reference database. Rank RefSeq Function abundance 1 18S ribosomal RNA 477043 2 hypothetical protein 89899 3 conserved hypothetical protein 45183 4 chaperonin GroEL 19437 5 DNA-directed RNA polymerase subunit beta 12153 6 elongation factor Tu 9373 7 DNA-directed RNA polymerase subunit beta' 6977 8 28S ribosomal RNA 6880 9 elongation factor G 6160 10 30S ribosomal protein S1 5776 11 aldehyde dehydrogenase 5632 12 molecular chaperone DnaK 4757 13 chaperone protein DnaK 3882 14 DNA-directed RNA polymerase, beta subunit 3405 15 translation elongation factor Tu 3175 90 Table 2.8: Most abundant functional annotations of the unassembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do in the KEGG reference database. Rank KEGG Function abundance 1 hypothetical protein 69437 2 chaperonin GroEL 18763 3 DNA-directed RNA polymerase subunit beta (EC:2.7.7.6) 8857 4 elongation factor G 5880 5 elongation factor Tu (EC:3.6.5.3) 5725 6 30S ribosomal protein S1 5510 7 DNA-directed RNA polymerase subunit beta' (EC:2.7.7.6) 4748 8 molecular chaperone DnaK 4330 9 aldehyde dehydrogenase 3403 10 S-adenosyl-L-homocysteine hydrolase (EC:3.3.1.1) 2710 11 ABC transporter related 2628 12 DNA-directed RNA polymerase subunit beta 2622 13 elongation factor Tu 2617 14 chaperone protein DnaK 2351 15 ATPase 2348 Table 2.9: 50 RefSoil genomes with the greatest number of metatranscriptome reads mapping Genbank Accession No. Avg Median Coverage (bp) # of annotated regions similar to transcriptome Description CP000252 7459 1 Syntrophus aciditrophicus SB AE017282 3454 4 Methylococcus capsulatus str. Bath FR856862 2407 12 Novosphingobium sp. PP1Y AP010904 1126 3 Desulfovibrio magneticus RS-1 AE015929 808 2 Staphylococcus epidermidis ATCC 12228 91 Table 2.9 AP012204 747 20 Microlunatus phosphovorus NM-1 CP002472 458 16 Bacillus coagulans 2-6 CP002629 243 1 Desulfobacca acetoxidans DSM 11109 AE015927 128 4 Clostridium tetani E88 CP000382 93 10 Clostridium novyi NT CP000352 70 1 Cupriavidus metallidurans CH34 BA000035 59 2 Corynebacterium efficiens YS-314 CP000783 40 3 Cronobacter sakazakii ATCC BAA-894 AE016958 30 2 Mycobacterium avium subsp. paratuberculosis K-10 CP000822 17 5 Citrobacter koseri ATCC BAA-895 CP002213 11 1 Paenibacillus polymyxa SC2 AE017263 11 1 Mesoplasma florum L1 CP000061 10 2 Candidatus Phytoplasma aster yellows witches'-broom AY-WB CP000509 9 76 Nocardioides sp. JS614 AP006628 8 2 Candidatus Phytoplasma onion yellows OY-M FP565176 7 1 Xanthomonas albilineans GPE PC73 92 Table 2.9 AE016877 6 3 Bacillus cereus ATCC 14579 CP000903 6 2 Bacillus weihenstephanensis KBAB4 CP001983 5 29 Bacillus megaterium QM B1551 CP001854 5 8 Conexibacter woesei DSM 14684 CP000512 5 1 Acidovorax citrulli AAC00-1 CP002343 5 32 Intrasporangium calvum DSM 43043 CP002000 5 10 Amycolatopsis mediterranei U32 BA000012 5 6 Mesorhizobium loti MAFF303099 FP929003 5 4 Candidatus Nitrospira defluvii CP000813 4 3 Bacillus pumilus SAFR-032 AP012052 4 40 Microbacterium testaceum StLB037 CP002810 4 2 Isoptericola variabilis 225 CP002821 4 1 Oligotropha carboxidovorans OM4 CP001736 4 43 Kribbella flavida DSM 17836 CP001630 4 16 Actinosynnema mirum DSM 43827 CP001700 4 2 Catenulispora acidiphila DSM 44928 CP001341 4 2 Arthrobacter chlorophenolicus A6 93 Table 2.9 AE017194 4 1 Bacillus cereus ATCC 10987 CP000656 3 16 Mycobacterium gilvum PYR-GCK FN554889 3 47 Streptomyces scabiei 87.22 AP009493 3 5 Streptomyces griseus subsp. griseus NBRC 13350 AM746676 3 9 Sorangium cellulosum So ce56 CP001821 3 3 Xylanimonas cellulosilytica DSM 15894 BA000040 3 48 Bradyrhizobium japonicum USDA 110 CP001867 3 40 Geodermatophilus obscurus DSM 43160 CP002665 3 10 [Cellvibrio] gilvus ATCC 13127 CP002279 3 8 Mesorhizobium opportunistum WSM2075 AE016822 3 4 Leifsonia xyli subsp. xyli str. CTCB07 CP002162 3 22 Micromonospora aurantiaca ATCC 27029 94 Table 2.10: RefSoil genomes with metatranscriptome reads mapping to the most unique genes Genbank Accession No. # of annotated regions similar to transcriptome Avg Median Coverage (bp) Description CP000509 76 9 Nocardioides sp. JS614 BA000040 48 3 Bradyrhizobium japonicum USDA 110 FN554889 47 3 Streptomyces scabiei 87.22 CP001736 43 4 Kribbella flavida DSM 17836 CP000454 42 3 Arthrobacter sp. FB24 BA000030 41 3 Streptomyces avermitilis MA-4680 CP001867 40 3 Geodermatophilus obscurus DSM 43160 AP012052 40 4 Microbacterium testaceum StLB037 CP002343 32 5 Intrasporangium calvum DSM 43043 CP001635 32 3 Variovorax paradoxus S110 CP001983 29 5 Bacillus megaterium QM B1551 CP000511 25 3 Mycobacterium vanbaalenii PYR-1 CP002162 22 3 Micromonospora aurantiaca ATCC 27029 CP002666 22 2 Cellulomonas fimi ATCC 484 CP000480 22 3 Mycobacterium smegmatis str. MC2 155 CP002399 21 3 Micromonospora sp. L5 AP012204 20 747 Microlunatus phosphovorus NM-1 CP000555 18 2 Methylibium petroleiphilum PM1 CP000474 18 3 Arthrobacter aurescens TC1 CP001630 16 4 Actinosynnema mirum DSM 43827 CP000656 16 3 Mycobacterium gilvum PYR-GCK 95 Table 2.10 CP002472 16 458 Bacillus coagulans 2-6 CP001737 15 3 Nakamurella multipartita DSM 44233 FR845719 14 3 Streptomyces venezuelae ATCC 10712 FR856862 12 2407 Novosphingobium sp. PP1Y CP002417 12 3 Variovorax paradoxus EPS CP001013 11 3 Leptothrix cholodnii SP-6 CP002000 10 5 Amycolatopsis mediterranei U32 CP002665 10 3 [Cellvibrio] gilvus ATCC 13127 AP010968 10 2 Kitasatospora setae KM-6054 CP000382 10 93 Clostridium novyi NT AM746676 9 3 Sorangium cellulosum So ce56 CU234118 9 3 Bradyrhizobium sp. ORS 278 CP002047 9 3 Streptomyces bingchenggensis BCW-1 CP001854 8 5 Conexibacter woesei DSM 14684 CP000699 8 2 Sphingomonas wittichii RW1 CP000115 8 3 Nitrobacter winogradskyi Nb-255 CP002279 8 3 Mesorhizobium opportunistum WSM2075 CP002379 7 3 Arthrobacter phenanthrenivorans Sphe3 CP000494 6 2 Bradyrhizobium sp. BTAi1 CP001814 6 2 Streptosporangium roseum DSM 43021 CP000319 6 3 Nitrobacter hamburgensis X14 CP002475 6 3 Streptomyces flavogriseus ATCC 33331 CP001096 6 2 Rhodopseudomonas palustris TIE-1 BA000012 6 5 Mesorhizobium loti MAFF303099 FN563149 5 3 Rhodococcus equi 103S 96 Table 2.10 CP000283 5 2 Rhodopseudomonas palustris BisB5 CP000316 5 3 Polaromonas sp. JS666 AP009493 5 3 Streptomyces griseus subsp. griseus NBRC 13350 CP002447 5 3 Mesorhizobium ciceri biovar biserrulae WSM1271 96 Table 2.11: Most abundant functional annotations of the assembled metatranscriptome against the SEED reference database. Annotations in this table appear as they do in the SEED reference database. Rank SEED Functions abundance 1 hypothetical protein 256 556 2 hyphothetical protein 45 493 3 Retron-type reverse transcriptase 24 961 4 Cell wall-associated hydrolase 3 636 5 FOG: WD40 repeat 1 213 6 Heat shock protein 60 family chaperone GroEL 1 207 7 Hypothetical ORF 1 042 8 predicted protein 1 004 9 DNA-directed RNA polymerase beta subunit (EC 2.7.7.6) 948 10 Translation elongation factor Tu 728 11 SSU ribosomal protein S1p 689 12 DNA-directed RNA polymerase beta' subunit (EC 2.7.7.6) 598 13 Aldehyde dehydrogenase (EC 1.2.1.3) 587 14 Translation elongation factor G 499 15 lipoprotein, putative 438 97 Table 2.12: Most abundant functional annotations of the assembled metatranscriptome against the GenBank reference database. Annotations in this table appear as they do in the GenBank reference database. Rank GenBank Functions abundance 1 conserved hypothetical protein 277 402 2 hypothetical protein BACCAP_03833 67 029 3 hypothetical protein BACCAP_04473 67 029 4 predicted protein 49 548 5 hypothetical protein HMPREF9529_01276 26 513 6 hypothetical protein BACUNI_00158 20 088 7 hypothetical protein BACUNI_02471 20 088 8 unknown 15 933 9 hypothetical protein RAZWK3B_00595 10 419 10 hypothetical protein RAZWK3B_11306 10 419 11 cell wall-associated hydrolase 7 180 12 LOW QUALITY PROTEIN: hypothetical protein SSBG_02741 5 110 13 LOW QUALITY PROTEIN: hypothetical protein SSBG_06429 5 110 14 conserved domain protein 4 878 15 hypothetical protein ANACOL_02136 4 130 98 Table 2.13 Most abundant functional annotations of the assembled metatranscriptome against the RefSeq reference database. Annotations in this table appear as they do in the RefSeq reference database. Rank RefSeq Functions abundance 1 hypothetical protein 773 813 2 conserved hypothetical protein 190 198 3 3,4-dihydroxy-2-butanone-4-phosphate synthase 5 462 4 Senescence-associated protein 4 171 5 ORF58e 3 813 6 chaperonin GroEL 1 070 7 GLP_748_1200_211 984 8 Putative protein of unknown function; overexpression confers resistance to the antimicrobial peptide MiAMP1 927 9 DNA-directed RNA polymerase subunit beta 665 10 putative cytoplasmic protein 506 11 multi-sensor hybrid histidine kinase 501 12 elongation factor Tu 475 13 hypothetical 440 14 30S ribosomal protein S1 433 15 methane monooxygenase 402 99 Table 2.14: Most abundant functional annotations of the assembled metatranscriptome against the KEGG reference database. Annotations in this table appear as they do in the KEGG reference database. Rank KEGG Functions abundance 1 hypothetical protein 509 073 2 Senescence-associated protein 4 171 3 hypothetical protein LOC100337426 1 523 4 chaperonin GroEL 1 107 5 Putative protein of unknown function; overexpression confers resistance to the antimicrobial peptide MiAMP1 927 6 putative cytoplasmic protein 506 7 DNA-directed RNA polymerase subunit beta (EC:2.7.7.6) 500 8 multi-sensor hybrid histidine kinase 493 9 hypothetical LOC783710 468 10 hypothetical protein LOC100335677 468 11 hypothetical protein LOC100336571 468 12 hypothetical protein LOC100336585 468 13 hypothetical protein LOC100337004 468 14 hypothetical protein LOC100337158 468 15 30S ribosomal protein S1 405 Table 2.15: RFam abundance (based on base pair coverage) of the assembled metatranscriptome. RFam Annotation RFam ID number Abundance 5_8S_rRNA RF00002 46,190.5 tmRNA RF00023 44,298.0 PK-G12rRNA RF01118 36,375.5 RNaseP_bact_a RF00010 18,232.0 Bacteria_small_SRP RF00169 11,915.0 Bacteria_large_SRP RF01854 11,728.5 5S_rRNA RF00001 6,622.0 c-di-GMP-I RF01051 2,458.5 Metazoa_SRP RF00017 1,716.5 Fungi_SRP RF01502 1,486.5 Archaea_SRP RF01857 1,456.0 beta_tmRNA RF01850 1,446.0 tRNA RF00005 824.5 Plant_SRP RF01855 725.5 100 Table 2.15 6S RF00013 356.0 SSU_rRNA_bacteria RF00177 225.0 RNaseP_bact_b RF00011 200.0 RNaseP_arch RF00373 148.0 SSU_rRNA_archaea RF01959 129.0 alpha_tmRNA RF01849 63.0 ydaO-yuaA RF00379 56.0 RNaseP_nuc RF00009 49.0 group-II-D1D4-2 RF01999 33.0 c-di-GMP-II RF01786 31.5 Intron_gpII RF00029 28.0 group-II-D1D4-6 RF02005 21.0 GOLLD RF02032 19.0 speF RF00518 16.0 Alpha_RBS RF00140 16.0 Afu_309 RF01512 15.0 U3 RF00012 13.0 U2 RF00004 12.0 cspA RF01766 11.0 ROSE RF00435 9.0 Cobalamin RF00174 9.0 Intron_gpI RF00028 9.0 Glycine RF00504 7.0 group-II-D1D4-7 RF02012 7.0 group-II-D1D4-4 RF02003 6.0 RNase_MRP RF00030 6.0 HEARO RF02033 6.0 group-II-D1D4-3 RF02001 5.0 FMN RF00050 4.0 Rhizobiales-2 RF01723 4.0 Hammerhead_3 RF00008 4.0 msiK RF01747 3.0 SAH_riboswitch RF01057 3.0 CrcZ RF01675 3.0 T-box RF00230 3.0 suhB RF00519 2.0 Acido-Lenti-1 RF01687 1.0 101 Table 2.16: Top 50 CAZy annotations by number of contigs. Most abundant CAZy annotations by the number of contigs without regard to the abundance or number of reads mapping to the contigs. CAZy enzyme class # of Contigs Matching Abundance GT2 72 3.5 GH36 41 3.2 CBM13 36 3.5 GH18 34 4.9 CBM33 30 5.4 CE11 29 2.4 CE1 26 2.2 GH13 22 3.1 GT4 20 2.6 GT51 20 2.4 GH76 18 8.4 CE10 18 4.1 CBM2 15 7.4 CBM47 15 3.1 GH16 14 2.2 CBM50 13 3.8 GT41 13 2.5 GH23 12 2.8 GH35 11 6.2 GH15 9 3.0 CBM14 8 19.1 GT22 8 7.3 CBM5 8 2.6 GT35 8 2.3 CBM32 7 2.9 GH5 6 4.2 GT87 6 4.2 CBM48 6 2.5 GH3 6 2.3 GH6 5 2.8 CE9 5 2.6 GT26 5 2.6 CBM20 5 2.4 CBM12 5 2.0 CE4 5 2.0 GH28 5 2.0 GH92 4 3.3 102 Table 2.16 GT20 4 2.3 GT55 4 2.3 GT1 4 2.0 GH9 3 3.3 GH73 3 2.7 GH48 3 2.3 CBM3 3 2.2 GH1 3 2.0 GH38 3 2.0 GH43 3 2.0 GH19 2 137.0 GH17 2 47.0 GH102 2 7.3 Table 2.17: Top 50 CAZy annotations by abundance. Abundance of CAZy annotation by the number of reads mapping to annotated contigs. CAZy enzyme class Abundance # of Contigs Matching GH19 137.0 2 GH17 47.0 2 CBM14 19.1 8 GH76 8.4 18 CBM2 7.4 15 GT22 7.3 8 GH102 7.3 2 GH35 6.2 11 CBM33 5.4 30 GH66 5.0 1 GH18 4.9 34 GH5 4.2 6 GT87 4.2 6 CE10 4.1 18 GH14 4.0 1 PL11 4.0 1 CBM50 3.8 13 GT2 3.5 72 CBM13 3.5 36 GH9 3.3 3 GH92 3.3 4 GH36 3.2 41 103 Table 2.17 GH13 3.1 22 CBM47 3.1 15 GH15 3.0 9 CBM1 3.0 1 GT77 3.0 1 CBM32 2.9 7 GH6 2.8 5 GH23 2.8 12 GH73 2.7 3 CBM5 2.6 8 GT4 2.6 20 CE9 2.6 5 GT26 2.6 5 GT41 2.5 13 CBM48 2.5 6 CBM43 2.5 2 GH62 2.5 2 CE11 2.4 29 CBM20 2.4 5 GT51 2.4 20 GH3 2.3 6 GH48 2.3 3 GT35 2.3 8 GT20 2.3 4 GT55 2.3 4 GH16 2.2 14 104 REFERENCES 105 REFERENCES 1. Lesniewski R a, Jain S, Anantharaman K, Schloss PD, Dick GJ (2012) The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. ISME J 6: 22572268. Available: http://www.ncbi.nlm.nih.gov/pubmed/22695860. Accessed 26 October 2012. 2. Gifford SM, Sharma S, Booth M, Moran MA (2013) Expression patterns reveal niche diversification in a marine microbial assemblage. ISME J 7: 281298. Available: http://www.ncbi.nlm.nih.gov/pubmed/22931830. Accessed 28 February 2013. 3. Gilbert J, Meyer F, Schriml L (2010) Metagenomes and metatranscriptomes from the L4 long-term coastal monitoring station in the Western English Channel. Stand 193. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035373/. Accessed 19 September 2012. 4. Hilton J a, Satinsky BM, Doherty M, Zielinski B, Zehr JP (2014) Metatranscriptomics of N2-fixing cyanobacteria in the Amazon River plume. ISME J 9: 15571569. Available: http://www.nature.com/doifinder/10.1038/ismej.2014.240. 5. Baldrian P, Ktotal microbial communities in forest soil are largely different and highly stratified during decomposition. ISME J 6: 248258. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3260513&tool=pmcentrez&rendertype=abstract. Accessed 5 March 2012. 6. Geisen S, Tveit AT, Clark IM, Richter A, Svenning MM, et al. (2015) Metatranscriptomic census of active protists in soils. ISME J: 113. Available: http://www.nature.com/doifinder/10.1038/ismej.2015.30. 7. Takasaki K, Miura T, Kanno M, Tamaki H, Hanada S, et al. (2013) Discovery of glycoside hydrolase enzymes in an avicel-adapted forest soil fungal community by a metatranscriptomic approach. PLoS One 8: e55485. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3564753&tool=pmcentrez&rendertype=abstract. Accessed 16 September 2013. 8. Yergeau E, Schoondermark-Stolk S a, Brodie EL, Déjean S, DeSantis TZ, et al. (2009) Environmental microarray analyses of Antarctic soil microbial communities. ISME J 3: 340351. Available: http://www.ncbi.nlm.nih.gov/pubmed/19020556. Accessed 12 March 2012. 106 9. Ofek-lalzar M, Sela N, Goldman-voronov M, Green SJ, Hadar Y, et al. (2014) Niche and host-associated functional signatures of the root surface microbiome. Nat Commun 5: 19. Available: http://dx.doi.org/10.1038/ncomms5950. 10. Yergeau E, Sanschagrin S, Maynard C, St-Arnaud M, Greer CW (2013) Microbial expression profiles in the rhizosphere of willows depend on soil contamination. ISME J: 115. Available: http://www.ncbi.nlm.nih.gov/pubmed/24067257. Accessed 6 November 2013. 11. Turner TR, Ramakrishnan K, Walshaw J, Heavens D, Alston M, et al. (2013) Comparative metatranscriptomics reveals kingdom level changes in the rhizosphere microbiome of plants. ISME J: 111. Available: http://www.nature.com/doifinder/10.1038/ismej.2013.119. Accessed 19 July 2013. 12. Foley J a, Defries R, Asner GP, Barford C, Bonan G, et al. (2005) Global consequences of land use. Science 309: 570574. Available: http://www.ncbi.nlm.nih.gov/pubmed/16040698. Accessed 6 November 2013. 13. Gans J, Wolinsky M, Dunbar J (2005) Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309: 13871390. Available: http://www.ncbi.nlm.nih.gov/pubmed/16123304. Accessed 19 March 2012. 14. Rodriguez-R LM, Konstantinidis KT (2013) Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics: 17. Available: http://www.ncbi.nlm.nih.gov/pubmed/24123672. Accessed 10 December 2013. 15. 6583. Available: http://www.pnas.org/content/95/12/6578.full.pdf&embedded=true. Accessed 23 July 2014. 16. Neidhardt F, Umbarger H (1996) Chemical composition of Escherichia coli. 17. Yi H, Cho Y-J, Won S, Lee J-E, Jin Yu H, et al. (2011) Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic Acids Res 39: e140. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3203590&tool=pmcentrez&rendertype=abstract. Accessed 22 March 2012. 18. Heaton E a., Dohleman FG, Long SP (2008) Meeting US biofuel goals with less land: the potential of Miscanthus. Glob Chang Biol 14: 20002014. Available: http://doi.wiley.com/10.1111/j.1365-2486.2008.01662.x. Accessed 29 February 2012. 107 19. Zhou J, Bruns M, Tiedje J (1996) DNA recovery from soils of diverse composition. short. Accessed 11 February 2014. 20. Brady S (2007) Construction of soil environmental DNA cosmid libraries and screening for clones that produce biologically active small molecules. Nat Protoc 2: 129731305. 21. Schmieder R, Lim YW, Edwards R (2012) Identification and removal of ribosomal RNA sequences from metatranscriptomes. Bioinformatics 28: 433435. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3268242&tool=pmcentrez&rendertype=abstract. Accessed 16 September 2013. 22. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, et al. (2011) Rfam: Wikipedia, clans 5. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3013711&tool=pmcentrez&rendertype=abstract. Accessed 18 September 2013. 23. RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9: 386. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2563014&tool=pmcentrez&rendertype=abstract. Accessed 3 March 2013. 24. Wilke A, Harrison T, Wilkening J, Field D, Glass EM, et al. (2012) The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools. BMC Bioinformatics 13: 141. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3410781&tool=pmcentrez&rendertype=abstract. Accessed 25 September 2013. 25. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2690996&tool=pmcentrez&rendertype=abstract. Accessed 4 October 2012. 26. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841842. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2832824&tool=pmcentrez&rendertype=abstract. Accessed 20 September 2013. 27. Brown CT, Howe A, Zhang Q, Pyrkosz A, Brom T (2012) A Reference-Free Algorithm for Computational Normalization of Shotgu118. Available: http://ged.msu.edu/downloads/2012-diginorm.pdf. Accessed 11 February 2014. 108 28. Howe AC, Jansson J, Malfatti S (2012) Assembling large, complex environmental metagenomes. Available: http://adsabs.harvard.edu/abs/2012arXiv1212.2832C. Accessed 11 February 2014. 29. Pell J, Hintze A, Canino-Koning R (2012) Scaling metagenome sequence assembly 12. Available: http://buonmathuot.vn/ws/r/www.pnas.org/content/109/33/13272.full. Accessed 25 April 2013. 30. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821829. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2336801&tool=pmcentrez&rendertype=abstract. Accessed 17 September 2013. 31. Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M (2011) Next generation sequence assembly with AMOS. Curr Protoc Bioinformatics Chapter 11: Unit 11.8. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3072823&tool=pmcentrez&rendertype=abstract. Accessed 20 September 2013. 32. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 16581659. Available: http://www.ncbi.nlm.nih.gov/pubmed/16731699. Accessed 28 February 2013. 33. Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, et al. (2009) The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res 37: D2338. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2686590&tool=pmcentrez&rendertype=abstract. Accessed 19 September 2013. 34. Li R, Zhu H, Ruan J, Qian W, Fang X, et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20: 265272. doi:10.1101/gr.097261.109. 35. Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26: i36773. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2881401&tool=pmcentrez&rendertype=abstract. Accessed 16 September 2013. 36. Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, et al. (2013) Rfam 11.0: 10 years of RNA families. Nucleic Acids Res 41: D22632. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3531072&tool=pmcentrez&rendertype=abstract. Accessed 19 February 2014. 109 37. Leimena MM, Ramiro-Garcia J, Davids M, van den Bogert B, Smidt H, et al. (2013) A comprehensive metatranscriptome analysis pipeline and its validation using human small intestine microbiota datasets. BMC Genomics 14: 530. Available: http://www.biomedcentral.com/1471-2164/14/530. Accessed 6 August 2013. 38. Urich T, Lanzén A, Qi J, Huson DH, Schleper C, et al. (2008) Simultaneous assessment of soil microbial community structure and function through analysis of the meta-transcriptome. PLoS One 3: e2527. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2424134&tool=pmcentrez&rendertype=abstract. Accessed 6 August 2013. 39. Gifford SM, Sharma S, Rinta-Kanto JM, Moran MA (2011) Quantitative analysis of a deeply sequenced marine microbial metatranscriptome. ISME J 5: 461472. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3105723&tool=pmcentrez&rendertype=abstract. Accessed 9 March 2012. 40. Singh, B.K., et al., Influence of grass species and soil type on rhizosphere microbial community structure in grassland soils. Applied Soil Ecology, 2007. 36(2-3): p. 147-155. 41. Ridl, J., et al., Plants Rather than Mineral Fertilization Shape Microbial Community Structure and Functional Potential in Legacy Contaminated Soil. Frontiers in Microbiology, 2016. 7. 42. Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, et al. (2008) Microbial community gene expression in ocean surface waters. Proc Natl Acad Sci U S A 105: 38053810. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2268829&tool=pmcentrez&rendertype=abstract. 43. Gilbert J a, Field D, Huang Y, Edwards R, Li W, et al. (2008) Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS One 3: e3042. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2518522&tool=pmcentrez&rendertype=abstract. Accessed 1 March 2012. 44. Paul EA, Voroney RP (1984) Field interpretation of microbial biomass activity measurements. Curr Perspect Microb Ecol p 509-514. 45. Clark FE, Paul EA (1970) The Microflora of Grassland. Adv Agron. doi:10.1016/S0065-2113(08)60273-4. 110 46. Shi Y, Tyson GW, DeLong EF (2009) Metatranscriptomics reveals unique microbial 269. Available: http://www.ncbi.nlm.nih.gov/pubmed/19444216. Accessed 9 March 2012. 47. Roller M, Lucic V, Nagy I, Perica T, Vlahovicek K (2013) Environmental shaping of codon usage and functional adaptation across microbial communities. Nucleic Acids Res: 111. Available: http://www.ncbi.nlm.nih.gov/pubmed/23921637. Accessed 20 September 2013. 48. Raes J, Bork P (2008) Molecular eco-systems biology: towards an understanding of community function. Nat Rev Microbiol 6: 693699. Available: http://www.ncbi.nlm.nih.gov/pubmed/18587409. 49. Fuhrman J a (2009) Microbial community structure and its functional implications. Nature 459: 193199. Available: http://www.ncbi.nlm.nih.gov/pubmed/19444205. Accessed 1 March 2012. 50. Williams RJ, Howe A, Hofmockel K (2014) Demonstrating Microbial Co-occurrence Pattern Analyses Within and Between Ecosystems. doi:10.3389/fmicb.2014.00358. 111 Chapter 3: Using a multi-omics to approach to identify active microbial functions in the rhizosphere of Switchgrass 112 Abstract Rhizosphere microbial communities provide many ecologically important services such as regulating biogeochemical cycles of elements such as carbon, nitrogen, phosphorous and iron. They also are known to aid in plant growth and defense. Advances in high throughput sequencing have allowed the use of metagenomics to survey microbial community functional activity without culture-based bias. To assay the functional activity of rhizosphere microbial communities we used a multi-omics approach including the use of metagenomics, metatranscriptomics and metaproteomics. In this article we establish a minimum functional core from the multi-omics data, collected from the rhizosphere microbial community associated with the biofuel crop switchgrass (Panicum virgatum) grown in agricultural soil. The minimum functional core is defined by annotations found in both metagenomic and metatranscriptomic data from the switchgrass rhizosphere, which represent ubiquitous and dominant functions found in the microbial community. We compare the minimum functional core to rhizoplane metagenomes from switchgrass, Miscanthus (Miscantus giganteusi) and corn (Zea mays) to determine if the minimum functional core is representative of the field. Functions related to cellular maintenance and housekeeping were highly abundant within the minimum functional core. While functions related to biogeochemical cycling and plant growth promotion were present within the minimum functional core in low abundance. Specifically we found evidence that at the time of sampling, the most abundant nitrogen cycling processes were related to ammonia assimilation. Phosphate metabolism was a highly active component of the phosphorous cycle. Due to their low abundance these biogeochemical processes related functions and 113 their great importance to the community, these functions likely represent keystone functions. Carbon cycling enzymes related to glycoside hydrolases and lignin break down were abundant especially in the metaproteome. The multi-omics approach used in this article enabled the identification of active microbial processes from field collected rhizosphere soil. Introduction Soil microbes in the rhizosphere play a key role in environmental processes such as cycling of elements carbon, nitrogen, sulfur, phosphorus and iron [1-3]. Rhizosphere microbes also aid plant growth and development through protection from pathogens [4, 5], liberation of micronutrients [1] and secretion of plant growth promotion compounds [6, 7]. Metagenomic surveys of soil microbial communities have been instrumental in identifying the presence of microbial functions of environmental importance [8, 9]. However, metagenomic surveys caof microbial dormancy, 80% of microbial cells and 60% of operation taxonomic units (OTUs) in soil microbial communities were identified as dormant [10]. These data indicate that a significant portion of the information gathered from metagenomic studies are from non-active (dead or dormant) cells. To characterize the contribution of the microbial community to biogeochemical cycles and the types of plant-microbe interactions taking place in soil it is paramount to know which functions are actively being carried out. In this study we used a multi-omics approach, taking into account metagenomic data as functional potential, metatranscriptomic data to assess transcriptional activity and metaproteomic data to 114 assess translated proteins. We used this multi-omics approach to provide an integrated view of stages of microbial community activity in the rhizosphere of switchgrass (Panicum virgatum), a crop of major interest for biofuel production. Because of the complexity of the multi-omic soil data, we identified a minimum functional core representing the dominant functions found in both the metagenome and expressed in the metatranscriptome data. This provides several benefits. First, it identifies the functional activity of the microbial community carried out in the majority of samples, which provides an overview of general activity. For example, many functions related to central metabolism are present in the minimum functional core while there are few functions related to cell division. Second, it relieves the effect of undersampling the microbial community, which occurs because of the extensive genetic diversity in soil, i.e. in the terabase range [11, 12]. Current sequencing technologies and computational methods cannot accommodate this level of deep sequencing. As a result of undersampling the metagenomic data are incomplete [13]. Third, the minimum functional core will identify functions necessary for microbial community survival in agricultural soil. These core functions will best characterize the functional diversity present at the site and identify likely central functions in soil microbial community activities of the sampled area [14]. While this core approach will undoubtedly include many housekeeping genes it should also capture information about the most abundant, ecologically important, non-housekeeping functions. This approach has been successfully applied to the human microbiome [15] where the authors were able to identify functional genes related to housekeeping functions as well as functional genes potentially specific to the gut. 115 The metaproteomic data was used to assess which of the annotations in the minimum functional core were translated as well as transcribed. We also compare the minimum functional core to metagenomes from field-collected rhizoplane soils associated with two other major biofuel candidate crops, corn (Zea mays), Miscanthus (Miscanthus giganteus), to determine if the minimum functional core is broadly representative of the field site. While housekeeping functions are expected to be prominent in the multi-omics data set, these functions will provide insight into the state of cellular maintenance of the microbial community at the time of sampling. More central to this study is the examination of microbial community functions related to carbon and plant nutrient cycling and plant-microbe interactions since these processes are central to a sustainable and renewable-fuel production ecosystem. Methods Soil collection All soil s(GLBRC) Biofuel Cropping Systems Experiment (BCSE) at the Kellogg Biological Station on July 31st 2013 at midday, a time near maximum plant photosynthesis. The sky was cloudless and the soil moist from a 11 mm rain 2 days before. Rhizosphere samples used for metatranscriptomics, metagenomics and metaproteomics were collected from switchgrass plot G5R2 http://lter.kbs.msu.edu/research/long-term-experiments/glbrc-intensive-experiment. Three switchgrass root systems were dug up and vigorously shaken to remove excess soil. The switchgrass root system was placed in a sterile bag and vigorously shaken again. The soil that fell into the bag was subdivided into whirpack bags 116 and placed in liquid nitrogen for rapid freezing. This process was completed in less than 5 min to minimize transcript turnover. Rhizoplane samples for metagenome sequencing were collected at the same time as the rhizosphere samples, but from corn, Miscanthus as well as switchgrass plants from plots G1R2 through R4, G6R2 through R4 and G5R2 through R4, respectively. Plants were dug up and the root systems were placed in gallon bags that were place on ice for transport to the laboratory where they were stored at -20C. At the R2 sites three samples were collected from two adjacent replicate plants. At the R3 and R4 sites two samples were collected from two adjacent replicate plants. Later, excess soil and macroaggregates were removed from the roots so that only a very thin film of soil and micro aggregates surrounding the roots remained. Each sample which contained approximately 5 g of root and soil material, was placed in phosphate buffer. Samples were gently shaken to remove soil attached to the roots. The samples were spun down to pellet the suspended soil particles and microbes. Root material was carefully removed from the soil pellet, and the soil was saved for DNA extraction. Sample preparation and sequencing RNA was extracted from the three replicates of rhizosphere soils using the MoBio PowerSoil RNA extraction kit (MoBio, Carlsbad, CA). Samples were treated with DNase (Invitrogen, Carlsbad, CA) to remove any potential contaminating DNA. The sample quality was checked by nanodrop and quantified using the Qubit RNA quantification kit (Invitrogen, Carlsbad, CA). The three RNA samples, collected from switchgrass are termed SRT1, SRT2 and SRT3. DNA was extracted from the same aliquot of rhizosphere soil as was 117 used for the metatranscriptome samples. Approximately 0.5 grams of soil was used for DNA extraction using the MoBio PowerSoil DNA kit (MoBio, Carlsbad, CA) according to the the Qubit DNA quantification kit (Invitrogen, Carlsbad, CA). The three rhizosphere DNA samples are termed SRG1, SRG2 and SRG3. DNA was also extracted from an additional 21 samples collected from the rhizoplanes of corn, Miscanthus and switchgrass (seven replicates for each plant and termed C1-7, M1-7 and S1-7 respectively). Both DNA and RNA samples were sequenced by the Joint Genome Institute (JGI) in Walnut Creek, CA. Ribosomal RNA subtraction was performed by JGI on the metatranscriptome samples using the RiboZero kit [16]. All samples were sequenced using the HiSeq-1TB. All samples were filtered for quality and trimmed to remove adapters using BBDuk (ktrim=r, k=25, mink = 12, tpe=t, tbo=t, qtrim=10, maq=10, maxns=3, minlen=50). Reads were then filtered for artifacts using BBDuk (k-16). For metatranscriptomic samples rRNA was removed via mapping to the Silva database with BBMap (fast=t, minid=0.90 local=t). Reads were then assembled using metahit (v 0.2.0) [16] (--cpu-only -m 100e9 --k-max 123 -l 155). Metaproteome sample preparation and characterization1 Indirect extraction Metaproteomic sample preparation and data analysis was performed at the Environmental Molecular Science Laboratory (EMSL) at the Pacific Northwest National Laboratory in Richland, WA. Rhizosphere soil was sieved through a 35mm mesh and 1 This work was performed by Angela M Norbeck, Carrie Nicora, Sam Purvine and Ljjiljana Pasa-Tolic 118 weighed into 20 g aliquots in 50 mL tubes with 20 mL of ice cold phosphate buffered saline (PBS), pH 8. The samples were kept on ice and homogenized at full speed with a hand-held OMNI tool and disposable probes (OMNI, Kennesaw, GA) for 30 s, allowed to cool and homogenized again. The samples were then centrifuged at 2,500x g for 5 min at 4°C to remove large soil particulates. The supernatants were transferred to a fresh 50 mL tube. Again 20 mL of buffer was added to the soil pellet and the samples were homogenized and centrifuged as described previously. The supernatants were combined and centrifuged at 10,000 x g for 15 min at 4°C to pellet the intact microbial cells. The cell pellet was washed with 1 mL of ammonium bicarbonate buffer, pH 8.0 (ABC), and transferred into a 2 mL snap-cap centrifuge tubes (Eppendorf, Hamburg, Germany). The sample was then centrifuged at 10,000 xg for 10 min at 4°C and the supernatant removed and 200 µl of ABC was added along with 0.1 mm zirconia beads and bead beaten in a Bullet Blender (Next Advance, Averill Park, NY) at speed 8 for 3 min at 4°C. After bead beating, the lysate was spun into a 15 mL Falcon tube at 2,000 xg for 10 min at 4°C. The sample was removed to a clean tube and a methanol/chloroform extraction was done to separate the protein, metabolites and lipids. Ice cold (-20°C) chloroform:methanol mix (prepared 2:1 (v/v)) was added to the sample in a 5:1 ratio over sample volume and vigorously vortexed. The sample was then placed on ice for 5 min and then vortexed for 10 s followed by centrifugation at 10,000 xg for 10 min at 4°C. The upper water soluble metabolite phase was collected into a glass vial, the lower lipid soluble phase was collected into another fresh glass vial, and both samples were dried to complete dryness in a speed vac and then stored at -80°C until analysis. The remaining protein interlayer was washed with 100% ice-cold methanol and placed in a fume hood to dry after pelleting. The protein pellet was solubilized by adding 119 up to 100 µl of SDS-Tris buffer (4% SDS, 100 mM DTT in 100 mM Tris-HCl, pH 8.0), gently sonicated into solution and then added to the microbial pellet. The solution was incubated at 95°C for 5 min to reduce and denature the protein and allowed to cool at 4°C for 10 min. Filter Aided Sample Preparation (FASP) [17] kits were used for protein digestion 00 µl of 8 M urea (all reagents included in the kit) was added to each 500 µl 30K molecular weight cut off (MWCO) FASP spin column and up to 100 µl of the sample in SDS buffer was added, centrifuged at 14,000 xg for 30 min to bring the sample to the dead volume. The waste was removed from the bottom of the tube and another 400 µl of 8 M urea was added to the column and centrifuged again at 14,000 xg for 30 min and repeated once more. Each column was prepared with 400 µl of 50 mM ABC and then centrifuged for 30 min, done twice. The column was placed into a new fresh, clean and labeled collection tube. Digestion the sample. Each sample was incubated for 3 h at 37°C with 800 rpm shaking on a thermomixer with a thermotop (Eppendorf, Hamburg, Germany) to reduce condensation into the cap. Additional ABC (40 ul) was added to the filter and the resultant peptides were then centrifuged through the filter and into the collection tube at 14,000 xg for 15 min, repeated twice. The peptides were then concentrated to ~30µL using a SpeedVac and stored in vials until analysis. Final peptide concentrations were determined using a bicinchoninic acid (BCA) assay (Thermo Scientific, Waltham, MA USA). 120 Metatranscriptome and metagenome peptide analysis Three contig files were converted to amino acids via six-frame translation using Python, with 50 amino acids as the shortest sequence allowed. All sequences were converted to tab delimited text using Protein Digestion Simulator (PDS) (http://omics.pnl.gov/software/protein-digestion-simulator) and imported into Microsoft SQL server 2008. Contig source names were appended to the contig names to eliminate name collisions downstream. Assuming at most one protein sequence per contig, the longest resultant sequence per contig was retained. For all protein collection, 16 common contaminant protein sequences were added, including porcine and bovine trypsin, human and bovine serum albumin, and commonly observed keratin sequences. For all metaproteome searches, the fasta files were split into 25 roughly equivalently sized files to allow for memory limitations of the MSGFPlus search program. Top scoring identifications for each searched MS/MS spectrum were retained for the final output. Metatranscriptome and metagenome search and peptide identification Spectra to peptide identification: Peptide mass spectra (MS/MS) were searched against the metatranscriptome using the MS-GF+ algorithm [18], and accepting MSGF scores [19] of less than or equal to 1e-12. This yielded a false discovery rate (FDR) for the entire set of 0.81%. In the metagenome searches, a MSGF score threshold of 1e-12 was used, resulting in an FDR of 2.47%. Peptide sequences and redundant matches to protein parents are reported. Because of the large file sizes, the fasta files containing sequences were split into 25 pieces and separate searches performed on a distributed CPU system. The results were then merged for each metaproteome or transcriptome. 121 Metatranscriptomic and metagenomic sequence data analysis To determine the median number of reads mapping to the assembled contigs in each sample, JGI quality filtered reads were mapped to the assembled contigs using bowtie2 (v2.0.0-beta6, [20]) with the following default parameters: end-to-end alignment, minimum score threshold for 100 bp reads was -60.6, D 100, distinct alignments for each read. Median base pair coverage was estimated using BedTools [21]. All contigs of length less than 300 were removed. Samples were submitted to MG-RAST (v3.6, [22]) for annotation using the assembled pipeline and no other filtering methods or quality controls. To examine carbon cycling genes more closely, a BLAST search (blastx e-value e-5) was used to identify contigs that could be annotated using the Carbohydrate Active Enzyme (CAZy) database [23] (code used can be found at: https://github.com/Garoutte/Chapter_3/tree/master ) Defining the minimum functional core and its representation of the field sitedominant genetic composition Annotations from the three rhizosphere metagenomes were compared. Annotations present in two out of the three samples were considered core. The selection of two out of three samples was chosen for two reasons; first the metatranscriptome sample SRT-1 had a high percentage of rRNA sequences resulting many fewer non-rRNA contigs (Table 3.1). requiring a function to be present in all was too stringent and would result in a lack of diverse metabolic functions. The same selection criterion, presence in two of the three 122 replicates, was also done with the three rhizosphere metatranscriptomes, all of which had similar sequence yields. The metatranscriptome and metagenome cores were compared; functions found in both cores were considered to comprise the minimum functional core. This process was carried out for annotations based on the SEED Subsystems, RefSeq and CAZy databases (code used can be found at: https://github.com/Garoutte/Chapter_3/tree/master) A functional core was also established for each of the three rhizoplane metagenome plant treatments: corn, switchgrass and Miscanthus. Annotations found in five of the seven rhizoplane replicate metagenomes were considered core. We chose to identify functions as core to the rhizoplane samples using a more stringent cutoff to enhance the rigor of the core, while still allowing for undersampling. The minimum functional core was compared to each rhizoplane functional core to determine if the minimum functional core is broadly representative of the microbial functional diversity of the field site. 123 Table 3.1: Summary of switchgrass rhizosphere metagenome (SRG) and metatranscriptome (SRT) sequence yield and its assembly and annotation. Sample Total Reads Non-rRNA reads Assembled contigs Percent Assembleda Annotated Contigs (percent) Reads mapping to MetaG contigs SRT-1 246,895,742 68,949,934 440,213 81.23% 45% 10.92% SRT-2 284,791,354 166,978,397 1,825,857 80.34% 46.3% 27.70% SRT-3 397,351,240 250,124,715 2,237,997 82.42% 46.90% 22.90% SRG-1 298,716,384 NAb 6,606,700 40.82% 68.70% NAc SRG-2 338,846,620 NA 4,076,354 44.87% 68.7% NA SRG-3 298,364,910 NA 6,207,377 40.93% 68.7% NA aFor SRT samples percent assembled is based on Non-rRNA reads b- cMetagenome samples were not mapped to the metatranscriptome 124 Results Building minimum functional core To define the minimum functional core, a metagenomic and a metatranscriptomic core were separately created and subsequently combined. Due to the need to freeze samples used for metatranscriptomics quickly, to prevent mRNA turnover, samples used to define the minimum functional core were taken from the rhizosphere of switchgrass rather than from the rhizoplane. Collection of rhizoplane samples requires time-consuming removal of roots from the plant root system and is therefore not amenable to metatranscriptomic sampling. Metatranscriptome samples were comprised of approximately 486 million non-rRNA reads and the metagenome samples were composed of approximate 935 million reads (Table 3.1). Both switchgrass rhizosphere metatranscriptome (SRT) and switchgrass rhizosphere metagenome (SRG) samples have a similar rank abundance curve (Figure 3.1), indicating that the occurrence of their annotations in the core is similar. Both the SRT and SRG have a similar slope indicating a similar level of evenness across both core datasets. This shows that a variety of functions are transcribed at varying abundance levels whereas one might expect a few highly abundance transcripts to be extremely abundant with the remaining transcripts at very low abundance. The datasets do differ in some aspects, the SRT slope is steeper indicating fewer annotations with high abundance, while the SRG line extends farther than the SRT indicating a greater number of annotations. For the proteomics, however, the lines are much shorter and the slope steeper than the metagenome and metatranscriptome samples. This indicates the at metaproteome samples contain fewer annotations and have a greater disparity in abundance. 125 Figure: 3.1: Rank abundance curve of multi-omics subsystem annotations. The metaproteome data set is smaller than metagenome and metatranscriptome data sets as indicated by their shorter lines in the MetaP-MetaG and MetaP-MetaT samples. Near complete sampling of soil microbial communities is calculated to require terabytes of sequencing [24], hence we are likely under sampling with our current dataset. 126 To minimize undersampling effects, we defined the metagenome and metatranscriptome functional cores as the presence of a functional annotation in two of the three samples. The minimum functional core was then composed of functional annotations present in both the metagenome and metatranscriptome. Table 3.2 shows the number of core functional annotations, by SEED Subsystems, RefSeq and CAZy databases. The metagenome core is larger than the metatranscriptome core as defined by all annotation databases. The minimum functional core accounts for 99% of the abundance of the SEED Subsystems annotations, 92% of RefSeq annotations and 99% of CAZy annotations. The RefSeq minimum functional core is much larger than the SEED Subsystems and CAZy minimum functional cores because RefSeq annotations are more fine scale leading to more redundancies in functional annotations. The CAZy minimum functional core is much small than the other annotation databases because the CAZy database is specific to proteins that act on carbohydrates. When we refer to the minimum functional core from this point on we will be referring to the minimum functional core derived from the SEED Subsystems unless specified otherwise. 127 Table 3.2: Summary of minimal core annotations. The SEED Subsystems is a hierarchical database, which annotates gene functions not specific genes. RefSeq database annotates specific genes from model organisms. The Carbohydrate Active Enzyme database (CAZy) specifically annotates enzymes related to synthesis, metabolism and transport of carbohydrates. Reference database Number of Annotations* Minimum Functional Core (MFC) Annotations** Percent of MFC Annotation Abundance Represented by Reference database SRT SRG SEED Subsystems 8,180 9,729 7,781 0.99 RefSeq 38,672 85,193 27,988 0.92 CAZy 380 410 375 0.99 *Represent the number of annotations in the core of each data type ** Represents the combination of the SRG and SRT functional cores. Metaproteome characterization and core comparison Analysis of the metaproteome data sets found 460 unique SEED Subsystem annotations with a total abundance of 876,429 in the metatranscriptome derived metaproteome and 766 unique SEED Subsystem annotations with a total abundance of 607,281 in the metagenome derived metaproteome. The rank abundance curves of the 128 metaproteome data (Figure 3.1) indicate that there are a few highly abundant proteins as the curve is fairly steep, although the metagenome derived metaproteome is more diverse. When compared to the minimum functional core, 448 of the 460 SEED subsystem annotations from the metatranscriptome derived metaproteome are found within the core, while 727 of the 766 SEED subsystem annotations from the metagenome derived metaproteome are found within the minimum functional core. Of the 12 annotations from core, most are eukaryotic or archael ribosomal proteins. These 12 proteins only represent 0.25% of the total relative abundance. These data illustrate that our sequencing effort was insufficient to adequately sample the archael and eukaryotic portion of the microbial community. The 39 annotations from the metagenome-derived metaproteome are from 15 different SEED subsystems and only represent 0.3% of the total relative abundance. The metaproteome derived from the metatranscriptome contained 102 CAZy annotations, all of which were found in the CAZy minimum functional core while the metaproteome derived from the metagenome contained 271 CAZy annotations, all but one was found in the minimum functional core. Comparison of the minimum functional core to rhizoplane functional cores Twenty-one metagenomic samples originating from corn, Miscanthus and switchgrass associated rhizoplane soils, seven from each respective plant, were compared to the minimum functional core to determine if the minimum functional core is representative of the functional diversity of the field site or specific to the crop. Samples averaged approximately 363 million bases each and formed an average of 5.5 million 129 contigs (Table 3.4). Rhizoplane core functions were defined as found in five of seven replicates for each plant. This criterion is more stringent than the criterion of two out of three used in the construction of the minimum functional core. Of the 7,781 SEED Subsystem functional annotations present in the minimum functional core, between 97.7% and 98.4% were found in the core of the three plant metagenome samples (Table 3.3). The minimum functional core captured approximately 99.3% of the total annnotation abundance of each plant indicating that the minimum functional core is representative of the functional diversity of the field site irrespective of the plant. For the RefSeq minimum functional core, composed of 27,988 functional annotations, between 92.8% and 94.1% were found in the core of the three plant metagenomes. The RefSeq functional annotations represent approximately 91% of the annotation abundance of the rhizoplane metagenomes. 130 Table 3.3: Summary of minimum functional core annotations found in rhizoplane metagenomes of three crops. Minimum core represents the functions found in five out of seven samples. Percent core represents the percent of the crop specific core that is found within our established minimum functional core. Percent abundance captured represents the abundance of the crop specific samples found in the minimum functional core. SEED Subsystems Crop Minimum Core Percent of Core Percent Abundance Captured Corn 7628 98.0% 99.3% Miscanthus 7606 97.8% 99.4% Switchgrass 7656 98.4% 99.3% RefSeq Crop Minimum Core Percent of Core Percent Abundance Captured Corn 26106 93.3% 91.2% Miscanthus 25993 92.9% 91.7% Switchgrass 26333 94.1% 91.3% 131 Characterization of the minimum functional core of switchgrass Abundant functions within the minimum functional core are related to housekeeping processes The two subsystems with the greatest number of functions within the minimum functional core are Carbohydrates and Clustering-based subsystems with 1151 and 1049 functional annotations respectively (Figure 3.2). The Carbohydrate subsystem includes functions related to central metabolism, which can be classified as housekeeping functions, as well as functions related to the utilization of organic compounds important when considering carbon cycling in the rhizosphere. The Clustering-based subsystem is defined as genes, which evidence suggests belong together but there is no known function. functional annotations, which also represent housekeeping related functions; these include Amino Acids and Derivatives, Protein Metabolism and RNA Metabolism. Functions within these subsystems are related to the expression of proteins, which is an important process for all microbes. The Protein Metabolism subsystem is the only one where the number of functions in the metatranscriptome core is greater than the metagenome core, 561 and 474, respectively. 132 Figure 3.2: Diversity of switchgrass rhizosphere core functions by subsystem. Number of annotations in the minimum functional core as annotated by the SEED Subsystems database. When relative abundance of functions in the minimum functional core are taken into account (Figure 3.3), Carbohydrates and Clustering-based subsystems remain the most abundant in the metagenome, with 14.3% and 13.9% relative abundance respectively. In the metatranscriptome the Clustering-based Subsystem, Protein Metabolism and Carbohydrates have the greatest relative abundance, 15%, 14.6% and 12.2% respectively. Protein Metabolism and RNA Metabolism subsystems are the most abundant in the metatranscriptome-derived metaproteome, with 45% and 19.6% relative abundance respectively. The most abundant subsystems in the metagenome-derived metaproteome 02004006008001000120014001600CarbohydratesClustering-based subsysAmino Acids and DerivativesMiscellaneousProtein MetabolismRNA MetabolismCell Wall and CapsuleRespirationStress ResponseMembrane TransportDNA MetabolismNucleosides and NucleotidesVirulence, Disease and DefenseRegulation and Cell signalingIron acquisition and metabolismSulfur MetabolismNitrogen MetabolismMotility and ChemotaxisPhosphorus MetabolismSecondary MetabolismCell Division and Cell CyclePotassium metabolismDormancy and SporulationPhotosynthesisNumber of Functional AnnotationsSEED SubsystemCoreMetaG coreMetaT core 133 are RNA Metabolism and Respiration, with relative abundance of 28.7% and 22.4% respectively. Considering the important role functions within the Protein Metabolism, RNA Metabolism and Respiration subsystems have in basic cellular maintenance their abundance in the metatranscriptome and metaproteome is not surprising. Taken together these date indicate that the rhizosphere microbial community is actively carrying out housekeeping functions related to transcription and translation. 134 Figure 3.3. Relative abundance of switchgrass rhizosphere core multi-omics data by SEED Subsystem annotations. Relative abundance is averaged across each of the three replicates. Functions of ecological importance within the minimum functional core Rhizosphere microbes are known to play an important role in many ecologically important functions such as biogeochemical cycling and plant growth and defense. Subsystems representing biogeochemical cycling include Carbohydrates, Nitrogen Metabolism, Phosphorus Metabolism, and Iron Acquisition and Metabolism. The Secondary Metabolism subsystem contains functions related to plant defense and growth promotion. With the exception of Carbohydrates, all of these subsystems have a low relative abundance (>1%) in the minimum functional core (Figure 3.3). However, the minimum 00.050.10.150.20.250.30.350.40.450.5CarbohydratesClustering-based subsAmino Acids and DerivativesMiscellaneousCofactors, Vitamins, etc.Protein MetabolismRNA MetabolismCell Wall and CapsuleDNA MetabolismVirulence, Disease and DefenseRespirationMembrane TransportStress ResponseNucleosides and NucleotidesRegulation and Cell signalingSulfur MetabolismPhosphorus MetabolismMotility and ChemotaxisPhages, Prophages, etc.Cell Division and Cell CycleNitrogen MetabolismIron acquisition and metabolismSecondary MetabolismPotassium metabolismDormancy and SporulationPhotosynthesisAverage Relative AbundanceSEED SubsystemMetaGMetaTMetaP-MetaTMetaP-MetaG 135 functional core accounts for the majority of the relative abundance of each subsystem ranging from 99.8% to 97%. While these functions of ecological importance have low relative abundance in our samples, compared to other more ubiquitous housekeeping functions, they are present and active in the minimum functional core. Nitrogen Metabolism is one of the most important microbial community functions for plant growth. Major subsystem categories within the Nitrogen Metabolism subsystem found in the minimum functional core include Allantoin Utilization, Ammonia Assimilation, Denitrification, Nitrate and Nitrite Ammonification, Nitrogen Fixation and Nitrosative stress. The nitrogen fixation functions in the minimum functional core are related to nitrogenase transcription factors. The protein components of nitrogenase are found in the metagenome functional core but are not expressed and therefore not part of the minimum functional core. Noticeably absent from these data are genes related to nitrification. The metaproteome supports these findings with annotations related to Nitrogen Metabolism subsystem subcategories Allantion Utilization, Ammonia Assimilation, Denitrification and Nitrate and Nitrite Ammonification (Figure 3.4). Ammonia Assimilation is the most active process related to nitrogen cycling taking place in the rhizosphere at the time of sampling as the metatranscriptome and both metaproteome data sets have greatest relative abundance in the Ammonia Assimilation subcategory within the Nitrogen Metabolism subsystem. 136 Figure 3.4: Relative abundance of biogeochemical cycling functions in the minimum functional core of the switchgrass rhizosphere. Allantonin utilization, Ammonia assimilation, Denitrification and Nitrate and Nitrite ammonification are subsystems within the Nitrogen Metabolism subsystem. Alkylphosphate utilization and Phosphate metabolism are subsystems within the Phosphorous metabolism subsystem. The Phosphorous metabolism subsystem in the minimum functional core contains 75% of the functional annotations and 99.8% of the relative abundance of this subsystem. Major subcategories within the Phosphorous Metabolism subsystem include Phosphate metabolism, Alkylphosphonate Utilization and High Affinity Phosphate Transporters. The metaproteomes contained functional annotations for Alkyphosphonate Utilization and Phosphate Metabolism subcategories within the Phosphorous Metabolism subsystem 00.0050.010.0150.020.0250.030.0350.040.045Average Relative AbundanceSEED Subsystem Level 3MetaGMetaTMetaP-metaTMetaP-MetaG 137 (Figure 3.4). The Iron Acquisition and Metabolism subsystem is a very diverse subsystem with 230 functional annotations that are core to the metagenome. The minimum function core contains 117 functional annotations representing 97% of the relative abundance. The major subcategories of the Iron Acquisition and Metabolism subsystem are related to Siderophores and Heme and Hemin Uptake and Utilization. Many beneficial services provided by rhizosphere microbes are also reflected in the minimum functional core. Aside from providing plants with bioavailable sources of nitrogen, phosphorous and iron, microbes also produce plant growth hormones such as Auxin. Many functions related to Auxin biosynthesis are found in the Secondary Metabolism subsystem. Additionally rhizosphere microbes have been shown to reduce plant ethylene levels, a plant stress hormone, through the production of ACC-deaminase. This function was found in the minimum functional core and is classified as belonging to the Miscellaneous subsystem. Finally, rhizosphere microbes produce the sugar trehalose, which can be used by plants to protect against drought stress. Many functions related to trehalose biosynthesis are found in the Carbohydrates subsystem (Figure 3.5). While these functions comprise a rather small portion of the overall abundance they can have a significant effect on plant microbe interactions within the microbial community. 138 Figure 3.5. Relative abundance of plant growth promoting functions in the minimum functional core of the switchgrass rhizosphere. Auxin biosynthesis and Trehalose Biosynthesis are the second level in the SEED Subsystem hierarchy within the Secondary metabolites subsystem and Carbohydrates subsystems respectively. *ACC-deaminase is the fourth and lowest level in the SEED Subsystem hierarchy. Carbon cycling functions within the minimum functional core Carbohydrates subsystem is highly abundant in the metagenome and the metatranscriptome functional cores (Fig 3.3). In both derivations of the metaproteome the Carbohydrate subsystem falls in relative abundance but it still represents 6.5% and 5.7% of the relative abundance, respectively. To further examine Carbon cycling processes, we used the Carbohydrate Active Enzyme (CAZy) database to annotate our sequences. The CAZy minimum functional core is composed of 375 different functional annotations. The most common enzyme class was the Glycoside Hydrolases (GH) with 200 functional annotations. These enzymes are common and are known to hydrolize or rearrange glycosidic bonds. When relative abundance is taken into account the Glycoside Hydrolases 139 class again stands out with the greatest relative abundance across the other classes (Figure 3.6). Glycoside Transferases (GT) show a downward trend in relative abundance with the metagenome representing the 33% of the relative abundance and the metatranscriptome and the metatranscriptome derived metaproteome with lower relative abundances, 17% and 8%, respectively. However, in the metagenome derived metaproteome the relative abundance of GT annotations is 39%. Finally the Auxiliary Activities class relative abundance in the metagenome is 3%, while in the metatranscriptome it comprises 11% of the relative abundance. In the metatranscriptome derived metaproteome the Auxiliary Activities class comprises 29% of the relative abundance and 9% of the relative abundance of the metagenome derived metaproteome. This indicates that the Auxiliary Activities class, which has low representation in the metagenomes, was being highly expressed and translated at the time of sampling. The Auxiliary Activities class is described as redox enzymes that act in conjunction with CAZy enzymes. This class is predominately related to ligninolytic enzymes. 140 Figure 3.6: Relative abundance of CAZy annotations in the minimum functional core of the switchgrass rhizosphere . CAZy enzyme classes are: Glycoside Hydrolases (GH), Glycosyl Transferases (GT), Carbohydrate Esterases (CE), Polysaccharide Lyases (PL) and Auxiliary Activities (AA). Discussion Many methods can be used to build a minimum functional core ranging from conservative, i.e. functions must be found in all samples, to lenient, all functions are included regardless of how many times it was found. We chose an intermediate approach requiring functions to be present in two of the three samples for each of the metagenome and metatranscriptome datasets. This removes singletons while still preserving much of the functional diversity found within the samples. minimum functional core for several reason. We are the contigs are annotated (Table 3.6). In the case of the metagenomes used to build the 141 minimum functional core less than 50% of the contigs could be annotated by MG-RAST. This represents a large portion of the dataset from which no functional information can be obtained. Typically a core of functions shrinks when more data are included. However we define this dataset as the minimum functional core since the unannotated portion of the data is high. Based on our comparison of the minimum functional core to 21 rhizoplane metagenomic samples from the same field site we are confident that our minimum functional core is representative of major functional processes being carried out in the rhizosphere of our field site regardless of associated plant. In our comparison we used a highly-bred, high-nutrient responsive annual crop (corn) and two recently domesticated low nutrient input perennials (switchgrass and Miscanthus). Using a more stringent core identification method (defined as five of seven replicates per plant treatment) we identified over 97% of the minimum functional core annotations and accounted for over 99% of the relative abundance of each plant treatment based on the SEED Subsystem annotations. In the RefSeq minimum functional core over 92% of the functional annotations were observed and comprised over 91% of the rhizoplane abundance (Table 3.3). This multi-omics approach has shown that many environmentally important microbially mediated biogeochemical cycling functions are active in the rhizosphere. The Nitrogen Metabolism subsystem within the minimum functional core contains many key elements of the nitrogen cycle. The greater abundance of the Ammonia Assimilation subsystem, especially in the metatranscriptomic and metaproteomic data, (Figure 3.4) highlights the relative importance of this function to the community compared to other subcategories of the Nitrogen Metabolism subsystem. The Phosphorous Metabolism 142 subsystem functions within the minimum functional core also contain a many elements of phosphorous cycling. The Phosphate Metabolism subcategory has the highest relative abundance across the multi-omics data set indicating its importance to the rhizosphere microbial community. The Iron Acquisition and Metabolism subsystem is represented in the minimum functional core built from the metagenomes and metatranscriptomes. No metatranscriptome derived metaproteome functional annotations were found and very few annotations were found in the metagenome-derived metaproteome. This may be the result of under sampling in the metaproteomic data combined with the methodology used to obtain the metaproteome. Despite their low proportional abundance, microbially mediated biogeochemical cycling functions are vital to the microbial community. These functions provide bioavailable micronutrients to rhizosphere microbial community and the associated plant. We speculate these functional annotations related to biogeochemical cycling, found in low abundance, within the minimum functional core should be classified as rare keystone community functions as they have a disproportionately large effect on the microbial community. These rare functions and their associated taxa may be involved in synergistic crossfeeding. Syntrophy occurs in various habitats including hot springs, fresh water sediments, marine sediments, eutrophic bog sediments, marsh sediments, rumen [25], sewage treatment plants [26] and petroleum muck [27] as well as in constructed communities [28, 29]. To support these claims a more direct and quantitative approach is required. Plant growth promoting functions are also found within the minimum functional core at low abundance. It was once thought that 80% of microbes living in the rhizosphere 143 could produce the plant growth hormone auxin [30]. These data also suggest that many or most of the organisms in the rhizosphere have a commensal relationship with the plant. These organisms benefit from carbon secreted through root exudates while not providing a benefit to the plant, other than perhaps outcompeting plant pathogens in the rhizosphere. An alternative explanation is that timing plays a more relevant role in the abundance and expression of microbially mediated element cycles and plant-microbe interactions. As can be seen in Figure 3.5, the relative abundance in the metagenomes is greater than the metatranscriptome at the time of sampling indicating the system has a greater capacity to aid plant growth than was observed at the time of sampling. Sampling at different times throughout the growing season may reveal different trends in the abundance and expression of elemental cycling functions and functions related to plant-microbe interactions. Annotations of our data to the CAZy minimum functional core suggests that proteins related to lignin breakdown are highly active in the rhizosphere microbial community. The Auxiliary Activities class, which has the lowest relative abundance in the metagenome minimum functional core and has intermediate abundance in the metatranscriptome, has the second highest relative abundance in the metatranscriptome derived metaproteome. The metagenome-derived metaproteome has lower abundance than the metatranscriptome-derived metaproteome suggesting that much of the annotated proteins were recently synthesized by active microbes. The breakdown of lignin indicates that a portion of the microbial community was actively involved in the breakdown of plant biomass. Our use of a multi-omics approach greatly contributed to this finding, without the metaproteomic data this insight into rhizosphere microbial community function would not 144 have been identified. These data also show that Glycoside Hydrolases are the predominate CAZy class that was active in the rhizosphere community across all samples. The high relative abundance of the metagenome derived metaproteome GT annotations indicates these proteins are older and originate from microbes that were not actively transcribing these enzymes at the time of sampling. Annotations in the Clustering-base subsystem were found at high abundance in both the metagenome and metatranscriptome (Figures 3.2 and 3.3). The metagenomic data clearly shows many genomes possess a wide range of different annotations belonging to the Clustering-based subsystem (Figure 3.2). Metatranscriptomic data also shows a wide range of highly expressed transcripts in this class. Meanwhile both metaproteomic data sets show few annotations belonging to the Clustering-based subsystem indicating that in the recent past few of the proteins of Clustering-based subsystem functions were produced in detectable quantities. The cumulative evidence here also reinforces the need to further investigate the actual functions found in the Clustering-based subsystem as they are abundant in rhizosphere community genomes, and are actively transcribed. More evidence is needed to determine if functions found in the Clustering-based subsystem are translated to protein in levels similar to their transcription levels. Conclusion To fully understand the dynamics of environmentally important microbially mediated processes, microbes must be studied in a community context. Our multi-omics approach for establishing a core of active microbial functions has led to several insights into microbial community function within the rhizosphere. While the most abundant 145 functional annotations within the minimum functional core are related to housekeeping functions, biogeochemical cycling and plant growth promoting functions are present and active in the rhizosphere. Our use of metaproteomics greatly contributed to the findings that Ammonia Assimilation and Phosphate Metabolism subsystems were highly active during the sampling period compared to other biogeochemical functions. To further increase our understanding of biogeochemical cycling and plant microbe interactions in the rhizosphere sampling during multiple time points throughout the season may reveal seasonal patterns of activity. The use of a multi-omics approach allowed the identification of microbial community activity in the rhizosphere. 146 APPENDIX 147 Table 3.4: Summary of rhizoplane metagenome reads, assembly and assembled read abundance Total Reads Assembled contigs Reads Assembled Corn-1 335,539,202 4,985,554 58.37% Corn-2 362,890,032 5,795,619 48.31% Corn-3 340,306,582 5,159,075 49.56% Corn-4 353,097,824 5,306,152 44.89% Corn-5 396,880,408 5,227,891 44.44% Corn-6 419,729,170 5,989,941 43.18% Corn-7 411,276,622 6,235,593 49.49% Miscanthus-1 349,441,596 5,733,645 47.42% Miscanthus-2 356,902,550 6,365,326 47.73% Miscanthus-3 355,019,026 6,058,969 44.76% Miscanthus-4 334,711,394 5,764,967 50.37% Miscanthus-5 282,058,246 4,566,237 40.79% Miscanthus-6 213,309,852 2,879,035 30.27% Miscanthus-7 399,148,934 6,517,537 44.98% Switchgrass-1 449,331,454 6,606,700 58.36% Switchgrass-2 352,965,222 4,076,354 41.81% Switchgrass-3 405,940,264 6,207,377 48.30% Switchgrass-4 415,253,680 6,156,833 45.96% Switchgrass-5 389,928,900 5,430,532 43.88% Switchgrass-6 353,644,618 5,843,003 49.98% Switchgrass-7 364,517,026 5,715,056 47.57% 148 REFERENCES 149 REFERENCES 1. Philippot, L., et al., Going back to the roots: the microbial ecology of the rhizosphere. Nature reviews. Microbiology, 2013. 11: p. 789-99. 2. Bird, J.a., D.J. Herman, and M.K. Firestone, Rhizosphere priming of soil organic matter by bacterial groups in a grassland soil. Soil Biology and Biochemistry, 2011. 43: p. 718-725. 3. Marschner, P., D. Crowley, and Z. Rengel, Rhizosphere interactions between microorganisms and plants govern iron and phosphorus acquisition along the root axis model and research methods. Soil Biology and Biochemistry, 2011. 43: p. 883-894. 4. Berendsen, R.L., C.M.J. Pieterse, and P.a.H.M. Bakker, The rhizosphere microbiome and plant health. Trends in plant science, 2012: p. 1-9. 5. Mendes, R., P. Garbeva, and J.M. Raaijmakers, The rhizosphere microbiome: significance of plant beneficial, plant pathogenic, and human pathogenic microorganisms. FEMS microbiology reviews, 2013. 37: p. 634-63. 6. Spaepen, S. and J. Vanderleyden, Auxin and plant-microbe interactions. Cold Spring Harbor perspectives in biology, 2011. 3. 7. Dodd, I.C., et al., Rhizobacterial mediation of plant hormone status. Annals of Applied Biology, 2010. 157: p. 361-379. 8. Fierer, N., et al., Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proceedings of the National Academy of Sciences of the United States of America, 2012. 109(52): p. 21390-21395. 9. Mendes, L.W., et al., Taxonomical and functional microbial community selection in soybean rhizosphere. Isme Journal, 2014. 8(8): p. 1577-1587. 10. Lennon, J.T. and S.E. Jones, Microbial seed banks: the ecological and evolutionary implications of dormancy. Nature Reviews Microbiology, 2011. 9(2): p. 119-130. 11. Howe, A.C., et al., Tackling soil diversity with the assembly of large, complex metagenomes. Proceedings of the National Academy of Sciences of the United States of America, 2014. 111(13): p. 4904-4909. 12. Rodriguez-R, L.M. and K.T. Konstantinidis, Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics, 2014. 30(5): p. 629-635. 150 13. Delmont, T.O., P. Simonet, and T.M. Vogel, Describing microbial communities and performing global comparisons in the 'omic era. Isme Journal, 2012. 6(9): p. 1625-1628. 14. Shade, A. and J. Handelsman, Beyond the Venn diagram: the hunt for a core microbiome. Environmental Microbiology, 2012. 14(1): p. 4-12. 15. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 2010. 464(7285): p. 59-U70. 16. Dinghua Li, C.-M.L., Ruibang Luo, Kunihiko Sadakane and Tak-Wah Lam, MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Brujin graph. Bioinformatics, 2015. 31(10): p. 1674-1676. 17. Wisniewski, J.R., A. Zougman, and M. Mann, Combination of FASP and StageTip-Based Fractionation Allows In-Depth Analysis of the Hippocampal Membrane Proteome. Journal of Proteome Research, 2009. 8(12): p. 5674-5678. 18. Kim, S., et al., The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search. Molecular & Cellular Proteomics, 2010. 9(12): p. 2840-2852. 19. Kim, S., N. Gupta, and P.A. Pevzner, Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. Journal of Proteome Research, 2008. 7(8): p. 3354-3363. 20. Langmead, B., et al., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 2009. 10(3): p. R25. 21. Quinlan, A.R. and I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 2010. 26(6): p. 841-2. 22. Meyer, F., et al., The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008. 9: p. 386. 23. Cantarel, B.L., et al., The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res, 2009. 37(Database issue): p. D233-8. 24. Rodriguez, R.L. and K.T. Konstantinidis, Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics, 2014. 30(5): p. 629-35. 151 25. McInerney, M.J., et al., Physiology, ecology, phylogeny, and genomics of microorganisms capable of syntrophic metabolism. Annals of the New York Academy of Sciences, 2008. 1125: p. 58-72. 26. Jackson, B.E., et al., a new anaerobic bacterium that degrades fatty acids and benzoate in syntrophic association with hydrogen-using microorganisms. 1999: p. 107-114. 27. Joshi, M.N., et al., Metagenomics of petroleum muck: Revealing microbial diversity and depicting microbial syntrophy. Archives of Microbiology, 2014. 196: p. 531-544. 28. Mee, M.T., et al., Syntrophic exchange in synthetic microbial communities. Proceedings of the National Academy of Sciences of the United States of America, 2014. 111: p. E2149-56. 29. D'Souza, G., et al., Less is more: Selective advantages can explain the prevalent loss of biosynthetic genes in bacteria. Evolution, 2014. 68: p. 2559-2570. 30. Spaepen, S. and J. Vanderleyden, Auxin and plant-microbe interactions. Cold Spring Harb Perspect Biol, 2011. 3(4). 152 Chapter 4 Plant root effects on soil microbial community functions as viewed through metagenomics and metatranscriptomics 153 Abstract We used metagenomics and metatranscriptomics to identify microbial community functions enriched in the rhizoplane, the rhizosphere and bulk soil (not influenced by living plant roots). We postulated that metagenomes of all three soil zones would show metatranscriptomes, and that the bulk versus rhizosphere soil, would show the greatest difference. To accomplish this, we obtained metagenome sequence from the rhizoplane of switchgrass (Panicum virgatum) and corn (Zea mays), and the rhizosphere of corn. For metatranscriptomics we obtained sequence from rhizosphere samples of switchgrass and bulk samples from between corn rows. Contrary to our hypothesis the metagenomes rhizosphere and bulk soil showed no statistical differences, but there was a significant difference between rhizoplane and rhizosphere soils. We therefore combined the corn bulk and rhizosphere samples and termed them non-rhizoplane samples; the rhizosphere samples from switchgrass were also termed non-rhizoplane samples. Additionally bulk corn and switchgrass rhizosphere metagenomes showed significant differences in terms of functional composition and gene abundances, however the associated metatranscriptomes were not statistically different. In this study we show that bulk soil microbial communities are affected by the plant even though samples were collected away from living roots. This study also illustrates that factors other than proximity to living plant roots more strongly affects microbial community functions. 154 Introduction Soils exist in a continuum with regard to their exposure to living plant roots and plant detritus. As rhizodeposition occurs, carbon rich compounds are first accessible to organisms living in the rhizoplane, external surface of plant roots together with any closely adhering particles of soil or debris [1]. Carbon rich compounds then diffuse to the surrounding soil, i.e. rhizosphere, the soil influenced by living plant roots [1]. Bulk soil is defined in this study as soil not influenced by living plant roots. Differences between bulk and rhizosphere soils have been widely studied while comparisons to the rhizoplane are relatively rare. The rhizosphere is characterized as having lower species diversity than bulk soils [2] but also as having a larger and more complex network of interactions than bulk soil[3]. These differences in microbial community composition are also accompanied by differences in the functional potential of bulk and rhizosphere microbial communities [4]. These results have not been universally consistent. In studies of Oryza sativa, rice, and Merlot grapevine the microbial community structure of the rhizosphere and bulk soil cannot be differentiated [5, 6]. Many experiments in plant-microbe interactions utilize bulk soil as a control [7-9] however; there is disagreement in the literature of the effect of plant roots and their extension into the rhizosphere. In this study we utilize metagenomics as well as metatranscriptomics to more deeply examine functional differences in microbial communities associated with the rhizoplane, rhizosphere and bulk soil of switchgrass and corn. We postulate rhizoplane and rhizosphere samples from corn and switchgrass associated soils will differ from bulk soil with the bulk soil being intermediate between corn and switchgrass associated soils. We also postulate that while the metagenome samples will show a significant difference 155 between bulk soil and the rhizoplane and rhizosphere soils, the metatranscriptome will show a more statistically significant difference between bulk and rhizosphere samples, as it is not obfuscated by dead or dormant cells. Methods Site description and sample collection Samples were collected from the Kellogg Biological Station Great Lakes Bioenergy Research Center BSCE. Prior to 2008, dating back to 1988, the site was conventionally farmed mostly for soybeans. In 2008 the site was set up in a randomized block design to study various bioenergy cropping systems including continuous corn and switchgrass, the focus of this study. Soil samples were collected on July 31st 2013. Bulk and rhizosphere samples used for metatranscriptomic and metagenomic sequencing were collected from corn plot G1R4 and adjacent switchgrass plot G5R2, respectively. Bulk soil samples were collected from in between crop rows using a 3 cm diameter soil corer; only the top 10 cm of soil were collected. For the switchgrass rhizosphere samples, the root systems were dug up and vigorously shaken to remove excess, potentially non-rhizosphere, soil. The switchgrass root system was placed in a sterile bag and vigorously shaken to loosen the closely attached rhizosphere soil. The separated soil was placed in whirlpacks and frozen in liquid nitrogen. This process was completed in less than 5 min to minimize transcript turnover. Rhizoplane metagenome samples were collect at the same time and place as the previous samples. The root systems were placed in sealed bags on kept on ice for transport to the laboratory. Samples were stored at -20C until needed. Small roots with their surrounding 156 thin layer of soil and microaggregates were removed from the root system and placed in a phosphate buffer. Sample contained about five grams of root and soil material. Samples were shaken to remove soil attached to the roots. The suspended soil and microbes were then pelleted. Root material was then carefully removed from the soil pellet. Soil that fell off of the roots during transport or rhizoplane root picking was considered rhizosphere soil as it was likely not in direct contact with the plant root system. Therefore this soil was collected for both switchgrass and corn samples for use as rhizosphere metagenomes. Sample preparation and sequencing RNA was extracted from three replicate rhizosphere soils using the MoBio PowerSoil RNA extraction kit (MoBio, Carlsbad, CA). Samples were treated with DNase (Invitrogen, Carlsbad, CA) to remove any potentially co-extracted DNA. Sample quality was checked using a nanodrop and was quantified using the Qubit RNA quantification kit (Invitrogen, Carlsbad, CA). The three switchgrass RNA replicates were termed SRT1, SRT2 and SRT3, SRT (switchgrass rhizosphere treatment), while the three corn samples were termed CBT1, CBT2 and CBT3, CBT (corn bulk (soil) treatment). Six DNA samples were extracted from the same samples used for metatranscriptomics. About 0.5 grams of soil was used for DNA extraction using the MoBio PowerSoil DNA kit (MoBio, Carlsbad, CA) accordinnanodrop and was quantified using the Qubit DNA quantification kit (Invitrogen, Carlsbad, CA). The three rhizosphere DNA samples were termed SRG1, SRG2 and SRG3, SRG (Switchgrass rhizosphere (meta)genome), while the three corn metagenome samples were termed SBG1, SBG2 and SBG3, SBG (switchgrass bulk(soil) treatment). DNA was also 157 extracted from an additional 20 samples (as above) seven were collected from the rhizoplane of corn and seven from switchgrass (samples for each plant treatment termed C1-7, (C refers to corn plot), and S1-7, (S refers to switchgrass plot). The remaining six samples, three corn rhizosphere and three switchgrass rhizosphere samples were termed CR1, CR2, CR3, CR (corn rhizosphere), and SR1, SR2 and SR3, SR (switchgrass rhizosphere). All DNA samples were sent to the Joint Genome Institute (JGI) in Walnut Creek, CA for sequencing. Ribosomal RNA subtraction was performed on the metatranscriptome samples using the RiboZero kit (Illumina, San Diego, CA) at JGI. All samples were sequenced using the HiSeq-assembled using Metahit (v 0.2.0) [10] (--cpu-only -m 100e9 --k-max 123 -l 155). Data analysis The median coverage of reads mapping to the contigs was identified as follows. Quality filtered reads were mapped to the contigs using bowtie2 (v2.0.0-beta6, [11]) with the following default parameters: end-to-end alignment, minimum score threshold for 100 bp reads was -60.6, D 100, distinct alignments for each read. Median base pair coverage was estimated using BedTools [12]. All contigs of length less than 300bp were removed from the data set. Samples were then submitted to MG-RAST (v3.6, [13]) for annotation using the assembled pipeline and no other filtering methods or quality controls (code available at https://github.com/Garoutte/Chapter_4). Statistical analysis was carried out in the R statistics environment (v3.1.3) using the vegan package (v2.2-1). Nonmetric multidimensional analysis was used to visualize the data, which were log2 plus one transformed to increase normality. To establish 158 significance of sampling groups permutational multivariate analysis of variance (PERMANOVA) was used with a Bray Curtis distance matrix. Differential abundance of according to [14] in the R package EdgeR (v3.8.6). Functional annotations were considered differentially abundant if the log fold change was one or greater and the false discovery rate was 0.05 or less (code available at https://github.com/Garoutte/Chapter_4). Results Metatranscriptome analysis Ribosomal RNA subtraction by JGI was mostly successful with the majority of samples having approximately 30-40% rRNA after sequencing (Table 4.2). However, for one sample, SRT1, rRNA removal was less successful with rRNA reads making up approximately 72% of the sequences. The low number of rRNA sequences resulted in a much lower number of contigs (Table 4.2) and fewer overall annotations (Table 4.3). Therefore, sample SRT1 was not used for further analysis as it is a technical outlier. Bulk soil samples were not collected from the switchgrass field because no root free area of soil could be found. Metatranscriptome samples were tested using PERMANOVA to determine if there was a statistically significant difference between SRT and CBT samples. The PERMANOVA shows that the corn and switchgrass metatranscriptome samples are not significantly different from one another (Table 4.1). Metagenome samples of the switchgrass rhizosphere (SRT) and the corn bulk soil (CBG), collected from the same samples as the metatranscriptome samples were statistically different with a PERMANOVA p-value of 0.001 (Table 4.1). 159 Table 4.1: PERMANOVA analysis of metagenome and metatranscriptome samples. Permutational multivariate analysis of variance of samples. SRT is the switchgrass metatranscriptome, CBT is the corn bulk metatranscriptome, switchgrass rhizosphere metagenome, CBG is the corn bulk metagenome, C is for the corn rhizoplane metagenome, S is the switchgrass rhizoplane metagenome, SR is the switchgrass rhizosphere metagenome, CR is the switchgrass rhizosphere metagenome, SRG is the switchgrass rhizosphere metagenome samples collected with the metatranscriptome samples, CBG is the corn bulk metagenome samples collected with the metatranscriptome samples, CNRP is the combination of the corn bulk and rhizosphere metagenome samples, and SNRP is the combination of the two treatments of switchgrass rhizosphere metagenome samples. Comparison PERMANOVA p-value SRT-CBT 0.1 SRG-CBG 0.001389 C-S 0.031 S-SR 0.01 S-SRG 0.029 SR-SRG 0.2014 160 Table 4.1 C-CR 0.001 C-CBG 0.001 CR-CBG 0.1 CNRP-SNRP 0.002 Both metatranscriptome samples sets share the top three subsystem annotations, namely common Clustering-based subsystem, Protein Metabolism and Carbohydrates (Figure 4.1a). The Clustering-based subsystem is defined, as a subsystem in which there is evidence of functional coupling between annotations with no known function. The high functions. Another very abundant subsystem is the Protein Metabolism, which contains most to housekeeping functions such as translation. The majority of the relative abundant functional annotations within the Carbohydrates subsystem is related to housekeeping functions such as central metabolism (Figure 4.1b). Other level two functions within the Carbohydrate subsystem relate to the metabolism of various carbohydrates. Other abundant housekeeping-related functions include Amino Acids and Derivatives and RNA Metabolism. 161 Figure 4.1: Average relative abundance of metatranscriptome annotations Shows annotations based on MG-RAST SEED Subsystem database. (a) Average relative abundance of corn and switchgrass metatranscriptome annotations. (b) Relative abundance of metatranscriptome annotations in the Carbohydrate subsystem level 2. a: 00.020.040.060.080.10.120.140.16 Protein Metabolism RNA Metabolism Stress Response Respiration Membrane Transport Cell Wall and Capsule DNA Metabolism Motility and Chemotaxis Cell Division and Cell Cycle Sulfur Metabolism Nitrogen Metabolism Phosphorus Metabolism Secondary Metabolism Potassium metabolism PhotosynthesisAverage Relative AbundanceSEED SubsystemsSwitchgrassRhizosphereMetatranscriptomeCorn BulkMetatranscriptome 162 Figure 4.1 b: Metagenome analysis To explore the continuum of root effects on microbial communities we used nonmetric multidimensional scaling (NMDS) analysis and PERMANOVA on the rhizoplane metagenomes (S and C1) and rhizosphere metagenomes (SR and CR). The two plant rhizoplane metagenomes (C and S) were statistically different (Table 4.1) even though they appear to cluster in the graph of the NMDS analysis (Figure 4.2). Furthermore, the switchgrass rhizoplane metagenomic samples (S) are also statistically different from the rhizosphere samples collected from the rhizoshpere metagenomic sample (SR) and the rhizosphere samples collected from the metatranscriptome sampling (SRG) (Table 4.1). The two sets of switchgrass rhizosphere metagenomic samples collected by different methods are, as expected, not statistically different (SRG and SR) (Table 4.1). Like the 00.010.020.030.040.050.06Average relative abundanceCarbohydrate Subsystem level 2CBTSRT 163 switchgrass rhizosphere metagenomic samples, the corn bulk (CBG) and corn rhizosphere metagenomic samples (CR) are both significantly different (Table 4.1) from the corn rhizoplane (C). However, the corn bulk (CBG) and rhizosphere (CR) metagenomic samples are not statistically different from one another (Table 4.1) even though they appear to cluster independently in the graph of the NMDS analysis (Figure 4.2). Figure 4.2: Nonmetric multidimensional scaling (NMDS) analysis of metagenome sample. Metagenome samples were log2 plus one transformed to increase normality. Since the two switchgrass rhizosphere metagenomic samples (SR and SRG) and the corn bulk (CBG) and rhizosphere (CR) metagenomic samples are not statistically different 164 from one another the samples were combined by associated plant and termed non-rhizoplane samples (CNRP and SNRP). A PERMANOVA analysis of the two non-rhizoplane treatments shows that the two treatments are significantly different from one another (Table 4.1). identify functional annotations that are differentially abundant in the various treatments. The corn rhizoplane metagenomic samples compared to the non-rhizoplane corn metagenomic samples identified 294 enriched functional annotations in the rhizoplane versus 73 enriched in the rhizosphere (Figure 4.3). Many of the differentially abundant functions are the only members of the subcategories within the hierarchical structure of the SEED Subsystem to which they belong. These functional annotations do not comprise a complete or even partial functional pathway and hence do not reveal likely major functional changes. They may be the result of under sampling. Therefore we will only present differentially abundant annotations with reasonable representation within a pathway. The corn rhizoplane metagenomic samples are enriched for many functions thought to be common plant-microbe interactions. These include subcategories of the Carbohydrate subsystem, Oligo and Di-saccharides, Mononsaccharides, and Organic Acids; four chemotaxis and five flagellum associated functional annotations; and protein secretion systems with 14 annotations. Interestingly there are many functional annotations related to DNA exchange; with five functions related to plasmid encoded T-DNA and five annotations associate with conjugative transfer. The subsystem with the greatest number of enriched functional annotations in the non-rhizoplane corn samples is the Phages, Prophages, Transposable Elements and Plasmids. All but one of these annotations are 165 related to Phage replication and reproduction. The non-rhizoplane corn metagenomic samples are also enriched for Phage shock proteins and have two CRISPR associated hypothetical protein annotations. Figure 4.3: Comparison of differentially abundant annotations in corn rhizoplane and non-rhizoplane metagenomic samples. Number of differentially abundant annotations in corn rhizoplane and non-rhizoplane samples based on SEED Subsystems. When the switchgrass rhizoplane was compared to the non-rhizoplane switchgrass associated samples using edgeR, 119 functional annotations were associated with the non-rhizoplane switchgrass metagenomic samples , while 391 functional annotations were associated with the rhizoplane switchgrass metagenomic samples (Figure 4.4). The 0102030405060CarbohydratesMembrane TransportAmino Acids and DerivativesClustering-based subsystemsRegulation and Cell signalingCell Wall and CapsuleMiscellaneousMotility and ChemotaxisStress ResponseNucleosides and NucleotidesVirulence, Disease and DefenseRNA MetabolismDNA MetabolismRespirationSulfur MetabolismPhosphorus MetabolismNitrogen MetabolismPotassium metabolismProtein MetabolismCell Division and Cell CyclePhotosynthesisDormancy and SporulationNumber of differentially abundant annotationsSEED SubsystemsCorn RhizoplaneCorn Non-rhizoplane 166 annotations enriched in the switchgrass rhizoplane metagenomic samples, like the corn rhizoplane enriched functions, relate to plant-microbe interactions like the utilization of root exudates. The Carbohydrate subsystem contains enriched functions related to utilization of Di- and oligosacchrides, monosaccharides, organic acids and sugar alcohols. The Iron acquisition and metabolism subsystem contains 21 enriched functions related to siderophores. The Membrane Transport subsystem contains 33 enriched functions related to type IV secretion, 17 of which are related to conjugative transfer. There are 11 enriched functions related to T-DNA and ten related to resistance to antibiotics and toxins. Functional annotations associated with non-rhizoplane switchgrass metagenomic samples include seven related to central metabolism and 11 related to phage. 167 Figure 4.4: Comparison of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane metagenomic samples. Number of differentially abundant annotations in switchgrass rhizoplane and non-rhizoplane samples based on SEED Subsystems. Discussion Analysis of corn rhizosphere and bulk soil associated samples showed (via PERMANOVA analysis) that the two sample sets are not statistically different (Table 4.1) even though the samples appear to cluster separately in the NMDS analysis (Figures 4.2). Additional sampling may aid in resolving this discrepancy, as a PERMANOVA p-value of 0.1 is considered by some to be marginally significant. This result lead us to combine the corn bulk and rhizosphere samples into a single treatment called corn associated non-05101520253035404550CarbohydratesMembrane TransportClustering-based subsystemsAmino Acids and DerivativesCell Wall and CapsuleMiscellaneousRegulation and Cell signalingPhosphorus MetabolismRNA MetabolismStress ResponseNucleosides and NucleotidesDNA MetabolismRespirationMotility and ChemotaxisSulfur MetabolismDormancy and SporulationNitrogen MetabolismCell Division and Cell CyclePhotosynthesisProtein MetabolismSecondary MetabolismNumber of differentially abundant annotationsSEED SubsystemsSwitchgrass RhizoplaneSwitchgrass Non-rhizoplane 168 rhizoplane samples as we could not reliably differentiate between bulk and rhizosphere soils. We subsequently reclassified and combined the two treatments of switchgrass rhizosphere into switchgrass associated non-rhizosphere soil. These results indicate that We initially predicted that bulk soils samples (not influenced by plants) should ordinate at an intermediate position between the corn and switchgrass treatments. It may be that since the tops are harvested for biofuel) were found near the bulk soil sample collection communities associated with an annual crop such as corn would also come into contact with dead plant material during a growing season. This continuous exposure to decaying plant material may influence the bulk, rhizosphere and rhizoplane communities to possess similar functional traits making it difficult to decipher one from the other. Non-rhizoplane samples could easily be differentiated from their associated rhizoplane samples based on their PERMANOVA p-values (Table 4.1) as well as the NMDS (Figures 4.2). In the rhizoplane samples we see enrichment of functions commonly associated with plant-microbe interactions. Both corn and switchgrass are enriched for annotations related to carbohydrate utilization (many of which are potentially related to utilization of root exudates) and protein secretion, which can be a form of chemical communication between plants and microbes [15]. The corn rhizoplane is also enriched for chemotaxis and flagellum related annotations. Root exudates have been shown to act as a chemo-attractant to some bacteria, therefore it is not surprising to find the rhizoplane 169 enriched with functional annotations related to chemotaxis and flagella [16]. Furthermore, bacterial movement is more common in the rhizosphere than bulk soil . The switchgrass rhizoplane is enriched for capsular polysaccharide biosynthesis, indicating bacterial cell growth, as well as for phosphorous and iron utilization, micronutrients commonly liberated by microbes and utilized by associated plants [17]. Many of the non-rhizoplane samples were collected as rhizosphere samples. By labeling them non-rhizoplane samples we are not suggesting that they are devoid of information. Furthermore the presence of differentially abundant functions associated with plant-microbe interactions in the rhizoplane samples does not negate the possibility that these samples originate from the rhizosphere. Instead these data only show that plant-microbe associated functions are more abundant in the rhizoplane. Comparing the corn rhizoplane to the corn non-rhizoplane sequences all of the differentially abundant functions in the rhizoplane were also present in the corn non-rhizoplane annotations. The differentially abundant switchgrass rhizoplane sequences only contain two functions not found in the switchgrass non-rhizoplane sequences. We simply lack sufficient data, ere samples. The non-rhizoplane samples have fewer differentially abundant annotations. The differentially abundant annotations found in non-many annotations for the same pathways. The lack of differentially abundant functions in the non-rhizoplane samples indicates the functional similarity between the two samples. It can be inferred that the non-rhizoplane samples are a reflection of the rhizoplane samples 170 with a lower abundance of classical functional annotations associated plant-microbe interactions. In our initial hypothesis we postulate that metagenomic techniques would allow us to differentiate among bulk, rhizosphere and rhizoplane soils associated with corn and switchgrass. Additionally we postulated that the metatranscriptome would show a greater statistical difference between bulk and rhizosphere soils associated with switchgrass. Interestingly when compared using PERMANOVA the two metatranscriptome samples sets did not show a statistically significant difference (p = 0.1, Table 4.1). Counter to our hypothesis, the metagenome sequence associated with the metatranscriptome (SRG and CBG) showed a very significant difference. The assumption underlying our hypothesis was that the main driver of microbial community transcription (activity) is plant-microbe interactions. However, the relative abundance data from the metatranscriptome suggests the microbial community activity was mostly related to housekeeping functions such as transcription, translation and central metabolism. Influence from plant roots or detritus seems to have little impact on microbial community transcription at the time of sampling. Other factors common to both samples such as environmental factors like temperature and moisture content of the soil may be playing a larger role in transcription by the microbial communities. Another factor could be that some of the roots we collected were older, lignified roots and therefore secreting fewer exudates. We did however, collect only the small roots, i.e. < 1mm diameter. Sampling during throughout the growing season or closer to root tips may reveal different patterns of transcriptional activity. 171 Conclusion We investigated the affect of plant root exudates on microbial communities across a continuum of samples ranging from the rhizoplane, rhizosphere and bulk soil. We were able to differentiate rhizoplane soil samples from non-rhizoplane soil samples using metagenomic sequence. Rhizoplane samples showed differential abundance of functions related to plant-microbe interactions such as carbohydrate utilization (potentially related to root exudates), protein secretion and biogeochemical cycling. Bulk soil samples could not be differentiated from rhizosphere soil possibly due to under sampling. However, the bulk soil is clearly influenced by the presence of nearby or recent plants not necessarily by between bulk and rhizosphere soils, while the associated metagenomes did. The metatranscriptome samples were enriched for functional annotations related to housekeeping processes indicating that plant-microbe interactions were not the main driver of microbial community transcription at the time of sampling. Taken together these data illustrate the complexity of natural soil systems and the need for further efforts to develop more accurate conceptual models of plant effects on soil microbial communities or to develop better methods to sample active root tips. 172 APPENDIX 173 Table 4.2: Summary of metagenome and metatranscriptome reads, assembly and assembled read abundance Sample Total Reads Non-rRNA reads Assembled contigs Percent Assembled* SRT-1 246,895,742 68,949,934 440,213 81.23% SRT-2 284,791,354 166,978,397 1,825,857 80.34% SRT-3 397,351,240 250,124,715 2,237,997 82.42% CBT-1 271,044,518 168,104,823 2,070,236 77.22% CBT-2 395,933,348 272,908,023 2,884,984 81.24% CBT-3 272,295,622 147,577,500 1,972,841 79.44% SRG-1 298,716,384 NA 6,606,700 40.82% SRG-2 338,846,620 NA 4,076,354 44.87% SRG-3 298,364,910 NA 6,207,377 40.93% CBG-1 367,768,170 NA 5,471,300 54.25% CBG-2 343,375,648 NA 5,254,567 46.66% CBG-3 342,045,434 NA 4,960,472 57.39% Corn-1 335,539,202 NA 4,985,554 58.37% Corn-2 362,890,032 NA 5,795,619 48.31% Corn-3 340,306,582 NA 5,159,075 49.56% Corn-4 353,097,824 NA 5,306,152 44.89% 174 Table 4.2 Corn-5 396,880,408 NA 5,227,891 44.44% Corn-6 419,729,170 NA 5,989,941 43.18% Corn-7 411,276,622 NA 6,235,593 49.49% Corn-R1 416,788,124 NA 6,209,886 35.57% Corn-R2 379,476,104 NA 6,254,363 39.26% Corn-R3 362,767,630 NA 6,119,938 49.40% Miscanthus-1 349,441,596 NA 5,733,645 47.42% Miscanthus-2 356,902,550 NA 6,365,326 47.73% Miscanthus-3 355,019,026 NA 6,058,969 44.76% Miscanthus-4 334,711,394 NA 5,764,967 50.37% Miscanthus-5 282,058,246 NA 4,566,237 40.79% Miscanthus-6 213,309,852 NA 2,879,035 30.27% Miscanthus-7 399,148,934 NA 6,517,537 44.98% Switchgrass-1 449,331,454 NA 6,606,700 58.36% Switchgrass-2 352,965,222 NA 4,076,354 41.81% Switchgrass-3 405,940,264 NA 6,207,377 48.30% Switchgrass-4 415,253,680 NA 6,156,833 45.96% Switchgrass-5 389,928,900 NA 5,430,532 43.88% Switchgrass-6 353,644,618 NA 5,843,003 49.98% 175 Table 4.2 Switchgrass-7 364,517,026 NA 5,715,056 47.57% Switchgrass-R1 305,009,580 NA 4,592,503 35.57% Switchgrass-R2 332,653,200 NA 5,545,283 39.29% Switchgrass-R3 312,867,640 NA 4,239,083 34.52% Table 4.3: Metatranscriptome protein coding annotations Sample Protein Coding Annotations CBT-1 8,780 CBT-2 9,110 CBT-3 8,798 SRT-1 6,606 SRT-2 8,522 SRT-3 8,967 176 REFERENCES 177 REFERENCES 1. York, L.M., et al., The holistic rhizosphere: integrating zones, processes, and semantics in the soil influenced by roots. Journal of experimental botany, 2016. 67(12): p. 3629-43. 2. Peiffer, J.A. and R.E. Ley, Exploring the maize rhizosphere microbiome in the field. Communicative & integrative biology, 2013(October): p. 5-7. 3. Shi, S., Nuccio, E. E., Shi, Z. J., He, Z., Zhou, J. and Firestone, M. K., The interconnected rhizosphere: High network complexity dominates rhizosphere assemblages. Ecol Lett, 2016(19): p. 926-936. 4. Xiangzhen Li , J.R., Jingbo Xiong, Jiabao Li, Zhili He, Jizhong Zhou, Anthony C. Yannarell, Roderick I. Mackie, Functional Potential of Soil Microbial Communities in the Maize Rhizosphere. PLoS One, 2014. 9(11). 5. Iratxe Zarraonaindia, S.M.O., Pamela Weisenhorn, Kristin West, Jarrad Hampton-Marcell, Simon Lax, Nicholas A. Bokulich, David A. Mills, Gilles Martin, Safiyh Taghavi, Daniel van der Lelie, Jack A. Gilbert, The Soil Microbiome Influences Grapevine-Associated Microbiota. mBio, 2015. 6(2). 6. Joseph Edwards, C.J., Christian Santos-Medellín, Eugene Lurie, Natraj Kumar Podishetty, Srijak Bhatnagar, Jonathan A. Eisen, and Venkatesan Sundaresan, Structure, variation, and assembly of the root-associated microbiomes of rice. Proceedings of the National Academy of Sciences, 2015. 112(8): p. 911-920. 7. Yergeau, E., et al., Microbial expression profiles in the rhizosphere of willows depend on soil contamination. The ISME journal, 2013: p. 1-15. 8. DeAngelis, K.M., et al., Selective progressive response of soil microbial community to wild oat roots. The ISME journal, 2009. 3(2): p. 168-78. 9. Chaudhary, D.R., et al., Microbial profiles of rhizosphere and bulk soil microbial communities of biofuel crops switchgrass ( Panicum virgatum L.) and jatropha ( Jatropha curcas L.). Applied and Environmental Soil Science, 2012. 2012: p. 906864-Article ID 906864. 10. Li, D., et al., MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015. 31(10): p. 1674-1676. 11. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012. 9(4): p. 357-U54. 178 12. Quinlan, A.R. and I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 2010. 26(6): p. 841-842. 13. Meyer, F., et al., The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. Bmc Bioinformatics, 2008. 9. 14. McMurdie, P.J. and S. Holmes, Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. Plos Computational Biology, 2014. 10(4). 15. De-la-Pena, C., et al., Root-microbe communication through protein secretion. Journal of Biological Chemistry, 2008. 283(37): p. 25247-25255. 16. Zhang, N., et al., Effects of different plant root exudates and their organic acid components on chemotaxis, biofilm formation and colonization by beneficial rhizosphere-associated bacterial strains. Plant and Soil, 2013. 374(1-2): p. 689-700. 17. Mendes, L.W., et al., Taxonomical and functional microbial community selection in soybean rhizosphere. The ISME journal, 2014: p. 1577-1587. 179 Chapter 5 Conclusions and Future Directions 180 Conclusions Soil microbial communities provide many beneficial services that humans rely on. For many years microbial ecologists have studied these beneficial services in laboratory conditions, in growth chambers and greenhouses. However, we lack foundational knowledge of the beneficial services provided by soil microbes under natural environmental conditions and in communities. This dissertation attempts narrow the knowledge gap through the development and use of a novel technological method, metatranscriptomics. This dissertation first evaluates the practicality and use of metatranscriptomics on field collected agricultural soil samples. To my knowledge this has not been done before. Additionally this dissertation attempts to establish best practices for metatranscriptomic analysis. Second this dissertation utilizes metatranscriptomics to identify the actively transcribed genes through the use of a minimum functional core composed of annotations found in both metagenome and metatranscriptome samples collected from the same site. Finally this dissertation explores the relationship between microbial potential activity and actual activity in relation to distance from living plant roots. In chapter two of this dissertation a novel method of rRNA removal, called duplex specific normalization or DSN, is explored using a soil sample collected from an agricultural field site. The DSN is not as efficient at removing rRNA as other probe based methods such as the RiboZero kit. However, the DSN still offers several advantages over probe-based methods. The DSN requires only 10 ng of total RNA as input while the RiboZero kit requires at least one microgram of material. Another advantage of the DSN is that the rRNA that is not removed from the sample can be used for phylogenetic analysis. This is 181 because the DSN is a normalization procedure that decreases the relative abundance of the most abundant sequences while preserving the relative abundance of all sequences across the sample. The RiboZero kit removes rRNA based on probes so any contaminating rRNA in the metatranscriptome sample is present because there was not a probe in the RiboZero kit with a close enough match to remove the sequence. Chapter two of this dissertation also identifies best practices for metatranscriptomic analysis, namely the need for short read assembly. Assembly was show to greatly improve the confidence in annotation. Chapter three of this dissertation utilized a multi-omics approach to identify actively transcribed and translated genes. To accomplish this goal a minimum functional core of functional annotations present in two of three metagenome and metatranscriptome samples was established. The metaproteomes derived from the metagenomic and metatranscriptomic data were compared to the minimum functional core. Finally the minimum functional core was compared to plant specific functional cores derived from rhizoplane soil taken from switchgrass, corn and Miscanthus. This comparison showed that over 90% of functions in the minimum functional core were found throughout the field site indicating that it is broadly representative of the site. The minimum functional core was composed of many housekeeping functions mostly related to transcription, translation and central metabolism. The minimum functional core also contained functions to important biogeochemical cycles such as carbon, nitrogen, phosphorous and iron. Functions related to plant microbe interactions were also found within the minimum functional core. Chapter four of this dissertation examines the effect of proximity to living roots on the functional composition of the microbial community. In this study metagenomes and metatranscriptomes from corn bulk soil and switchgrass rhizosphere are compared. Corn 182 bulk soil (from the root-free soil between the corn rows) was used to approximate switchgrass bulk soil as roots from the 5-year old switchgrass stand had fully penetrated the plot and suitable bulk (root-free) soil could not be found. Analysis of the corn bulk and switchgrass rhizosphere metagenomes showed a significant difference while the metatranscriptomes did not show a significant difference. These data show that while the metagenomes underlying the metatranscriptome show differences in functional composition, the activity of the community is very similar. These data indicate that at the time of sampling plant treatment was not a strong driver of community function. The similarity in community activity could be due to environmental conditions at the time of sampling. Metagenome samples collected from the rhizoplane of switchgrass and corn show a statistically significant difference. Metagenomes from both corn and switchgrass rhizoplanes were compared to their respective rhizosphere samples and were statistically different. However, the corn bulk soil samples were not statistically different from the corn rhizosphere samples. Due to these results the rhizosphere samples were labeled as non-rhizoplane samples to better reflect the fact that they were not statistically different from the bulk samples. Nonmetric multidimensional scaling analysis shows that the corn bulk samples do not cluster between rhizosphere samples as expected. Instead the bulk soil samples cluster along a shared trajectory with the corn rhizosphere and rhizoplane samples. This indicates that the bulk soil is under the influence of the plant treatment, possibly due to root exudates but more likely due to corn plant residues from previous seasons. 183 Overall my dissertation highlights some of the challenges of using field-collected samples for plant microbe interaction omics sequencing studies. Studies commonly conducted in a growth chamber or a greenhouse have the luxury of a relatively stable, controlled and more homogenous environment in contrast to the heterogeneous environment experienced by microbes in nature. Laboratory based studies of plant microbe interactions are also better able to collect the desired sample types as their access may be built into the study design. For example, one can have a design providing for more confidence in collecting a rhizosphere sample with a more defined relationship to living active roots, and of roots of a particular age. In the field one must uproot plants to collect samples from and the most accessible roots maybe be older and lignified. The more active growing root tips may not be accessible in the field. Greater care must be given to field sample collection in order to collect the most informative samples. For these reasons field collected samples may show additional variation in microbial activity. Any patterns of activity may be more obscured by variance in environmental conditions and sampling procedures. These findings, while not ideal for experimental design, identify factors that must be taken into account when collecting field samples. Future directions A common theme in this dissertation and common problem among all metagenome and metatranscriptome studies is the existence of hypothetical proteins. A meta-analysis of published metagenomes and metatranscriptomes could be conducted to describe the distribution of hypothetical genes across different habitats. This work could potentially identify hypothetical proteins as enriched in a particular habitat or ubiquitous throughout 184 many different types of microbiomes. Of equal or greater importance is unannotated contigs. A similar meta-analysis approach could be used to bin common unannotated contigs by habitat based on sequence identity. Again this work could establish links between unannotated contigs and particular environments, conditions or as cosmopolitan. The minimum functional core described in chapter three could be applied to other soils and environments. Comparing the minimum functional core to other soils and environments could aid in determining which functions are cosmopolitan in all environments, such as many housekeeping functions, and identify functions that are specific to a given environment such as soil, gut, aquatic habitat, etc. This would also allow for better comparison between soil types to identify how environment shapes microbial community functions. Identification of biotic and abiotic factors that influence the activity of plant microbe interaction genes and functions involved in biogeochemical cycling is of particular importance. The ability to promote and control these functions could have a variety of environmental impacts such as reducing fertilizer use, promoting plant growth and mitigating climate change by sequestering carbon in soils. A time series sampling approach combined with metatranscriptomics could provide valuable insight into biotic and abiotic factors affecting microbial community activity related to plant microbe interactions and biogeochemical cycling. If full metatranscriptomics is not deemed feasible, a targeted metatranscriptome approach could be used. The program Xandar could be used to assemble low abundance genes of interest and primers could then be designed to target the desired set of genes. 185 Finally, chapter four of this dissertation raises questions about the affect of plants both through living roots and potentially leftover residues on the microbial community functional composition. To further investigate the affect of plant treatment in agricultural soils the metagenomics portion of the experiment in chapter four could be repeated for the corn plot only. Additional samples would be collected from both the rhizosphere and the bulk, samples (between crop rows). I would recommend at least six samples be taken from each soil zone to ensure enough statistical power. Alternatively, soil could be collected from between crop rows of corn, sieved and taken to the lab. The soil could then be divided into three treatments, a corn treatment where corn stover is mixed in the soil, a switchgrass treatment where cuttings from switchgrass are mixed in the soil, and a control where nothing is added to the soil. Metagenomes sequence could be collected before the treatment and then after six or so months.