RHIZOSPHEREMETAGENOMICSOFTHREEBIOFUELCROPSByJiarongGuoADISSERTATIONSubmittedtoMichiganStateUniversityinpartialoftherequirementsforthedegreeofMicrobiologyandMolecularGeneticsŒDoctorofPhilosophy2016ABSTRACTRHIZOSPHEREMETAGENOMICSOFTHREEBIOFUELCROPSByJiarongGuoSoilmicrobesformassociationswithcropsintherhizosphereandalsoplayamajorroleinecosystemfunctions,suchastheNandCcycles.Thuslarge-scalecultivationofbiofuelcropswillhaveaimpactonecosystemfunctionsatleastregionally.Inrecentyears,advancesinhighthroughputsequencingtechnologieshaveenabledmetagenomics,whichinturnopensnewwaystoaccesstheunknownmajorityinmicrobiologybutposesgreatchallengesfordataanalysisduetothelargedatasizeandshortreadlengthofsequencedatasets.Wegeneratedabout1Tbofshotgunmetagenomicdatafromrhizopheresoilsamplesofthreebiofuelcrops:corn,switchgrass,andMiscanthus.Mycentralgoalistodevisemethodstoextractmeaningfromthisrhizospheremetagenomicdata,withafocusonNcyclegenessinceNisthemostlimitingresourceforsustainableforbiofuelproduction.Iinitiallyprovideareviewofgene-targetedmethodsforanalyzingshotgunmetagenomics.InthesecondchapterIdevelopamethodthatimprovesthespeedwithwhichrRNAgenesfrag-mentscanbefoundandanalyzedinlargeshotgunmetagenomedatasets,therebyavoidingprimerbiasandchimerasthatareproblematicwithPCR-basedmethods.Ipresentapipeline,SSUsearch,toachievefasterofshortsubunitrRNAgenefragmentsplusprovidesunsupervisedcommunityanalysis.Thepipelinealsoincludesandcopynumbercorrection,andtheoutputcanbeusedintraditionalamplicondownstreamanalysisplatforms.Shotgunderivedrhizo-spheredatafromthispipelineyieldedhigherdiversityestimatesthanamplicondatabutretainedthegroupingofsamplesinordinationanalysis.OuranalysistheknownbiasagainstVerrucomicrobiainacommonlyusedV6-V8primersetaswellasdiscoveredlikelybiasesagainstActinobacteriaandforVerrucomicrobiainacommonlyusedV4primerset.Inthethirdchapter,IexploreanalternativephylogeneticmarkertothewidelyusedSSUrRNAgene,whichhasseverallimitationsincludingmultiplecopiesinthesamegenomesandlowresolu-tionfordifferentiatingstrains.IdemonstratethatrplB,asinglecopyproteincodinggene,providesresolutionmoreakintospeciesandsubspecieslevelandalsoscale(OTU)diversityanal-ysis.Themethodrequiresshotgunsequencesincethegeneisnotconservedenoughforrecoverybyprimers.Whenusedontherhizospheresequencedata,itrevealedmoremicrobialdiversityandbetterdifferentiatedthecommunitiesamongthethreecropsthantheSSUrRNAgeneanalysis.InthelastchapterIaddressmycentralbiologicalquestiononrhizospheremetagenomics:dotheydifferamongthethreecropsandwhatdoesthisinformationsuggestsaboutfunction?Icom-paretherhizospheremetagenomesforoverallcommunitystructure(SSUrRNAgene),overallfunction(annotationfromglobalassembly),andNcyclegenes(usingXander,atargetedgeneassemblytool).AllthreelevelsshowedcornhadadifferentcommunityfromMis-canthusandswitchgrass(exceptforammonia-oxidizingArchaea),andthatthetwoperennialsshowedatrendofseparation.Intermsoflifehistorystrategy,thecornrhizospherewasenrichedincopiotrophswhiletheperennialswereenrichedinoligotrophs.ThisisfurthersupportedbyhigherabundanceofgenesinthefiCarbohydratesflsubsystemcategoryandhigherfungi/bacteriaratios.Additionally,thenitrogencommunityofcornwasdominatedbynifHgenesmostcloselyaftoRhizobiumandBradyrhizobiumwhiletheperennialshadnifHsequencesmostrelatedtoCoraliomargarita,NovosphingobiumandAzospirillum,indicatingthattheperennialsindependentlyselectedmembers.Moreover,highernumbersofnitrogengenesandlowernumberofnitritereductiongenessuggestbetternitrogensustainabilityoftheperennials.Thesedataindicatethatperennialbioenergycropshaveadvantagesovercorninhighermicrobialspeciesrichnessandfunctionaldiversityaswellasinselectingmemberswithtraits,consistentwithNuseefy.CopyrightbyJIARONGGUO2016ACKNOWLEDGEMENTSFirst,IamverygratefultohaveDr.JamesTiedjeandDr.TitusBrownasmyco-advisers.TheynotonlyhaveprovidedcontinuoussupportofmywholePh.Dstudywiththeirgreatpatienceandimmenseknowledgeintheir(MicrobiologyandDataScience,respectively)butalsotailortheirmentoringstylesbasedonmyknowledgebackgroundandpersonality.Iamespeciallygrate-fulforthefreedomtheygivemeforindependentcriticalthinkingandexploratoryresearch,which,combinedwiththejoyfromsolvingchallengingproblems,arethemostenjoyablethingsformeinresearch.Ialsowouldliketothankmycommitteemembers(Dr.JamesCole,Dr.ChrisAdami,Dr.ShengYangHe,Dr.YanniSunfortheirinsightcomments,encouragement,andhardquestionsinmycomprehensiveexamthatwidenmyresearchfromvariousperspectivesandhelpmegrowasascientist.MysincerethanksalsogoestomyfellowlabmatesinbothBrownlabandTiedjelabforstimulatingdiscussionsandfriendships.Lastbutnottheleast,IwouldliketothankmyparentsforsupportingmespirituallythroughoutmyPh.Dstudyandmylifeingeneral.Thewisdomandcapacitytosympathizetheyhavetaughtmearekeytoovercomingadversitiesduringthisjourney.vTABLEOFCONTENTSLISTOFTABLES.......................................ixLISTOFFIGURES.......................................xCHAPTER1REVIEWOFGENE-TARGETEDMETHODSINSHOTGUNMETAGE-NOMICS....................................11.1Introduction......................................11.1.1Shotgunmetagenomics............................11.1.2Assemblygraphs...............................21.1.3Globalassembly...............................31.2Gene-targetedmethodsinshotgunmetagenomics..................31.2.1SSUrRNAgene................................41.2.1.1EMIRGE..............................41.2.1.2REAGO..............................51.2.1.3SSUsearch.............................61.2.2Functionalgenes...............................71.2.2.1Xander...............................71.2.2.2SAT-Assembler...........................81.3issueswithgene-targetedmethods....................91.3.1Referencedependency............................91.3.1.1Lackingreferencesequencesforfunctionalgenes.........101.3.1.2Largesequencevariationamongsubgroupswithincertaingenes.101.3.2Shortreadvs.contig.............................111.3.3Difofread-basedmethodforsomefunctionalgenes.........121.4Summaryandoutlook.................................12CHAPTER2MICROBIALCOMMUNITYANALYSISUSINGRIBOSOMALFRAG-MENTSINSHOTGUNMETAGENOMES..................142.1Abstract........................................142.2Introduction......................................152.3MaterialsandMethods................................162.3.1Soilsamples,DNAextractionandsequencing................162.3.2Datapreprocessing..............................172.3.3BuildingSSUandLSUrRNAgenemodels.................172.3.4ofrRNAgenefragmentsfrommetagenomicdata.......182.3.5Testingtheeffectoftargetregionsizeandvariableregiononclustering..192.3.6CommunitystructurecomparisonbasedonOTUsfromdenovoclustering.192.3.7ComparisonofOTU-basedmicrobialcommunitystructuresinferredfromshotgunandampliconSSUrRNAgenesequences..............202.3.8ComparisonoftaxonomybasedmicrobialcommunitystructuresinferredfromshotgunandampliconSSUrRNAgenesequences...........202.3.9Copynumbercorrection...........................21vi2.3.10Implementation,reproducibilityandsequencedataaccession........212.4Results.........................................212.5Discussion.......................................372.6Conclusion......................................43CHAPTER3GENOMICANDMETAGENOMICCOMPARISONOFrplBANDSSUrRNAGENEINSPECIESRESOLUTION..................453.1Introduction......................................453.2Methods........................................463.2.1Downloadingbacterialgenomes.......................463.2.2SSUrRNAgeneandrplBextractionandpairwisecomparison.......463.2.3Rhizospheresoilmetagenomicdatadownloadandpreprocess........473.2.4SSUrRNAgeneandrplBandcomparisoninsoilmetagenomes473.3Results.........................................493.3.1SSUrRNAgeneandrplBcopynumberandintra-genomeSSUidentity..493.3.2PairwisecomparisonofgenomeswithSSUrRNAgeneandrplB......503.3.3ComparisonofSSUrRNAgeneandrplBondiversityanalysesinlargesoilshotgunmetagenomes..........................553.4Disscussion......................................563.4.1Choosinggenomicdatasets..........................583.4.2AdvantagesofrplBoverSSUrRNAgeneasaphylogeneticmarker....593.4.3Noveltiesinthisstudy............................603.4.4LowerstrainlevelresolutionofSSUrRNAgeneduetoitsintra-genomevariations...................................603.4.5ComparisonoftwogenesindenovoOTUbasedcommunitydiversityanalyses....................................613.4.6SequencingdepthneededforrplBinmetagenomes.............613.4.7Lackofuniversalprimersforandreferencesequences....623.5Conclusion......................................62CHAPTER4RHIZOSPHEREMETAGENOMICSOFTHREEBIOFUELCROPS....644.1Abstract........................................644.2Introduction......................................654.3Methods........................................674.3.1Soilsamples,DNAextraction,andsequencing................674.3.2Datapreprocessing..............................674.3.3DiversityanalysisusingSSUrRNAgene..................694.3.4Globalmetagenomeassembly,annotationandfunctionaldiversity.....694.3.5Ncyclegeneassemblyanddiversity.....................704.3.6Dataaccession................................704.4Results.........................................714.4.1SSUrRNAgenebasedanalysis........................714.4.2Globalassembly...............................764.4.3Functionaldiversityusingannotationfromassembly............764.4.4Ncyclegenediversity............................78vii4.4.5Ncyclegeneabundanceandtaxonomycomposition.............784.4.6Metadata...................................844.5Discussion.......................................844.6Conclusion......................................93BIBLIOGRAPHY........................................94viiiLISTOFTABLESTable1.1Summaryofgene-targetedmethodsforshotgunmetagenomes..........13Table2.1ComparisonofsearchresultsfromSSUsearch,metaxaandmeta-rna.......24Table2.2Higherfungi/bacteriaratioandpercentofAMFfungiareinrhizospheresam-ple(M1)thanbulksample(SB1)..........................37Table3.1Rhizospheresoilshotgunmetagenomicdata....................48Table3.2SummaryofSSUrRNAgeneandrplBcopynumberingenomes.........50Table3.3SummaryofreadsasSSUrRNAgeneandrplBinsoilmetagenomes.56Table4.1SizeandJGIprojectIDofrawreaddatafor21samples..............71Table4.2SummaryofSSUsearchpipelineresultsfor21samples..............72Table4.3SharedOTUs,genera,andcontigamongrhizospheremicrobialcommunitiesofcorn(C),Miscanthus(M),andswitchgrass(S).................73Table4.4AveragepercentageofBacteria,Archaea,Eukaryota,andfungiinrhizospherecommunitiesofcorn(C),Miscanthus(M)andswitchgrass(S)..........75Table4.5Globalassemblystatistics.............................76Table4.6Soilchemistrytestresultsfor21samples.....................88ixLISTOFFIGURESFigure2.1FlowchartofSSUsearchpipeline.........................23Figure2.2Testingtheeffectoftargetregionsizeandvariableregiononclustering.....25Figure2.3TestingtheeffectoftargetregionsizeonV4offull-lengthSSUrRNAgenesinclusteringfromasyntheticcommunitywith64species............26Figure2.4TechnicalreproducibilitytestofourunsupervisedclusteringandcomparisonofOTUabundancesbetweenpairedshotgunandamplicondata.........27Figure2.5PhylaofOTUsdifferentbetweenshotgundataandamplicondata.29Figure2.6Principlecoordinatesanalysis(PCoA)ofampliconandshotgunderiveddatafromsevenreplicates.............................30Figure2.7Comparsionofordinationanalysiswithdifferentvariableregions........31Figure2.8AlphadiversitycomparisonsbetweencornandMiscanthususingV2,V4,V6andV8regions...................................32Figure2.9Bacterialphylumcomparisonusingdifferentvariableregions......33Figure2.10TaxonomyofBacterialphylaofshotgunfragmentsfromSSUandLSUrRNAgenesandofampliconreads.....................34Figure2.11TaxonomycomparisonatdomainlevelusingSSUandLSUrRNAgenes35Figure2.12BacterialphylumleveltaxonomysummarybeforeandafterSSUrRNAgenecopycorrection..................................36Figure2.13Ordinationplotofampliconandshotgunderiveddataaftercopycorrection...38Figure2.14Lengthdistributionoftrimmedreadsafterqualitytrimmingandpaired-endmerging......................................40Figure3.1Intra-genomevaritaionofSSUrRNAgeneofR.leguminosarum,P.putida,andE.coliinboxplot...............................49Figure3.2SSUrRNAgeneandrplBidentitiesamongspeciesinthesamegenus......51Figure3.3SSUrRNAgeneandrplBidentitiesamongR.leguminosarumrelatedgenomes.52Figure3.4SSUrRNAgeneandrplBamongP.putidarelatedgenomes...........53Figure3.5SSUrRNAgeneandrplBidentitiesamongE.colirelatedgenomes.......54xFigure3.6SSUrRNAgeneandrplBidentitiesamongallE.coligenomes.........55Figure3.7ComparisonofOTUnumbersusingSSUrRNAgeneandrplBinlargesoilmetagenomes...................................57Figure3.8ComparisonofSSUrRNAgeneandrplBinbetadiversityanalysis(ordina-tion)usinglargesoilmetagenomes........................57Figure3.9ComparisonofSSUrRNAgeneandrplBinalphadiversityanalysis(Shan-nondiversityindex)usinglargesoilmetagenomes................58Figure4.1Roots(Miscanthus)andrhizospheresoil.....................68Figure4.2Species(OTU)andfunctionaldiversitycomparsionamongthreecrops.....73Figure4.3Rankabundancecurveofmicrobialcommunitiesofcorn(C),Miscanthus(M)andswitchgrass(S)usingOTUfromSSUrRNAgene...........74Figure4.4Relativeabundanceoftoptenmostabundantbacterialphylainthreecrops...75Figure4.5Functionaldiversitycomparisonamongthreecropsusingsubsystemlevel1..77Figure4.6FunctioncomparisonbetweencornandMiscanthus...............79Figure4.7OrdinationanalysisofeachNcyclegeneandrplBusingOTU..........80Figure4.8OTUnumbercomparisonamongcorn(C),Miscanthus(M)andswitchgrass(S)usingOTUtablesofNcyclegenes......................81Figure4.9Diversity(Shannon)comparisonamongcorn(C),Miscanthus(M)andswitch-grass(S)usingOTUtablesofNcyclegenes...................82Figure4.10geneabundancesaftercombiningthegenesencodingenzymeswiththesamefunction..............................83Figure4.11RatiosofinterestrelatedtoNcyclegenesincorn(C),Miscanthus(M)andswitchgrass(S)..................................83Figure4.12AbundanceandphylumlevelassociationoftheindicatedNcyclegenesincorn(C),Miscanthus(M)andswitchgrass(S)..................84Figure4.13AbundanceandgenusleveldistributionofnifH.................85Figure4.14SequenceidentityofassembledNcyclegenestotheirbest-hitreferences....86Figure4.15DistributionofidentityofassemblednifHsequencestotheirbest-hitreferences87Figure4.16Ratioof2012yieldover2011andN2Omeasurednearsamplingtime...87xiFigure4.17Species(OTU)diversityofthreecrops......................89xiiCHAPTER1REVIEWOFGENE-TARGETEDMETHODSINSHOTGUNMETAGENOMICS1.1IntroductionSequencingtechnologyhasadvancedfromSangersequencingtomassiveshortread(finextflorsecondgeneration)sequencingsuchas454andIllumina,andthenagaintomassivelongread(thirdgeneration)sequencingsuchasPacBioandOxfordNanopore,justinthepasttenyears[1,2].Secondgenerationsequencingtechnologiesprovideveryhighthroughputandacceptablesequencingerror,andhastransformedbiologyresearchbecauseoftheirlowcostandlargeamountsofdata.Thirdgenerationsequencingtechnologiesarepromisingtoprovidelongreadalternativestosecond-generationtechnology,atthecostofhighersequencingerrorrates[3].1.1.1ShotgunmetagenomicsMetagenomics,thedirectsequencingofDNAextractedfromenvironmentalsamples,providesnewwaystoaccesstheunculturedmajorityinmicrobiomestudies.Techniquesbasedonampliconsequencingorgene-targetedmetagenomicsarelimitedbythelackofuniversalprimersetsandthepresenceofbothprimerbiasandPCRartifacts[4Œ6].Sincemetagenomeshotgunsequencingsamplesfromallgenomesinthecommunity,itcanpotentiallyovercometheseproblems,pro-vidingamorecompletepictureofthemicrobialcommunity.Massiveshortreadsequencingofshotgunmetagenomeshasadvancedourknowledgeaboutmicrobialcommunitymembersinmanyenvironments[7,8].Withthelargedatasetsproducedfromsecondgenerationdata,dataanalysisbecomethemainchallengeinmanyresearchprojects[9,10].Therearetwomainmethodsforanalysisofshortreaddata:readmappingandassembly.Readmappingrequiresareferencesequence(genomesandgenes)whileassemblydoesnot;assemblyisaprocessthatjoinreadsoffragmentedDNAtorecoverthesequenceoftheoriginalfulllengthDNA(usuallychromosomesorgenomes).1Forseveralreasons,metagenomeassemblygenerallydoesnotproducecompletegenomes,butjustlongersequencescalledcontigs[7,9],whichcouldbefurthermergedintoscaffoldsbypaired-endormate-pairinformation[11].1.1.2AssemblygraphsAssemblyoutputsareusuallylinearsequencesbuttheassemblyprocessrequiresaconnectivitygraphandalgorithmsthatoperateonthatgraphtoproduceassembledcontigs.TwoofthemostcommondatastructuresusedfortheassemblyprocessaredeBruijngraph(DBG)andstringgraph(SG).ThedeBruijngraphchopsreadsintoevensmallerk-mersandthenbuildsagraphconnectingk-mersthatsharek-1bases,whilethestringgraphoverlaps(largerthanalengthcutoff)amongallreadsandthenconnectreadsbasedontheoverlappinginformation[11,12].Inthestringgraph,overlapcalculationrequiresall-against-allpairwisereadcomparisonandthusisveryexpensiveinCPUtime.ThedeBruijngraphisunintuitivebecauseitbreaksdownthereadsbutitachievesfasterCPUtimebyavoidingtheexpensiveall-against-allpairwisecomparisons.Thereareonlyeightpossibleneighboringk-mersforeachk-merbyextendingabase(A,T,C,orG)oneithersideandneighboringk-merconnectionscanbecalculatedveryefjustbycheckingwhetherthoseeightpossibleneighboringk-mersarepresentinthedata.Intermsofmemorycost,thestringgraphincreaseswithsequencingdepthwhilethedeBruijngraphisnotrelatedtosequencingdepth(onlydeterminedbytotalgenomesize)[13],iftherearenosequencingerrors.BothstringgraphanddeBruijngrapharesensitivetosequencingerrors,whichgreatlyincreasesthecomplexityofthegraphsandmemoryusage.Eachsequencingerrorcancausekspuriousnodes(k-mers)indeBruijngraphandonespuriousnode(string)instringgraph.Thestringgraphworkswellwithlongreadsbypreservingtheintegrityofreadsbutdoesnotscalewithhighdepthdata,whereasthedeBruijngraphwellwithhighdepthcoverageshortreadsthatsecondgenerationsequencingplatformproduces(454andIllumina).ThisresultedinaboomofdeBruijngraphassemblers,suchasVelvet,ABySS,AllPathsandSOAPdenovo2[11,14Œ16].Alternatively,Simpsonetal.[12]haveappliedFMindexderivedfromBurrows-Wheelertransformationtooverlapsamongreadswhichreducesthetimecomplexityofstringgraphassemblytobelinearwithgenomesizeandindependentofthesequencingdepth.Thismakesstringgraphsanoptionwithsecondgenerationsequencing.1.1.3GlobalassemblyInanalyzingshotgunmetagenomes,anintuitiveapproachistorecoverallgenomesinthemetagenomes,i.e.,globalassemblyincontrasttogene-targeted(local)assembly.GlobalassemblyhasbecomeacommonstepforshotgunmetagenomicanalysesandtherearemanymetagenomeassemblersavailablesuchasMetaVelvet,IDBA-UD,andMEGAHIT[17Œ19].Althoughtherehavebeenmanymajorimprovementsmadeinrecentyearswithreadutilizationinassemblies[17],globalassemblyresultsarestilllimitedbychallengessuchasrepeats,sequencingerrors,lowcoverageandlargedatasets,especiallyforcomplexenvironmentslikesoil[7].Moreover,globalassemblydoespoorlyongeneswithconservedregionssuchSSUrRNAgenesduetothecomplexitythisintroducesintheassemblygraph;typically,globalassemblycollapsessequenceswithhighidentitiesthatareusefulfordiversityanalysis[6,20].1.2Gene-targetedmethodsinshotgunmetagenomicsManystudiesonlyfocusonimportantgenesorpathwayssuchasthenitrogencycle,biodegradationorantibioticresistancegenessoglobalassemblyisnotnecessaryforthesestudies.Gene-targetedmethodscanalsobemorecomputationallyefbecausetheyfocusonlyonreadsasthetargetgene.Herewediscussseveralgenetargetedmethodsthatareassembly-orread-basedandgroupthembytheirdesiredoutputŒSSUrRNAgeneorfunctional(protein-coding)gene.31.2.1SSUrRNAgeneSSUrRNAgeneisthestandardphylogeneticmarkerfordiversityanalyses[21,22].Italsohasthemostcomprehensivereferencesequencedatabases(RDPandSILVA)whichareimportantfortax-onomyanddiversityanalyses[23,24].Thusitisageneofinterestformostmetagenomicstudies.ThereareseveraltoolsdevelopedtoextracttheSSUrRNAgenefragmentsand/orassemblethemintofulllengthforfurtherdiversityanalyses[6,20,25].1.2.1.1EMIRGESSUrRNAgenesarediftoassembleusingglobalassemblymethodsbecausetheyhavebothhighlyconservedregionsandalsohypervariableregions;thisinturnformsacomplexgraphstruc-turethatitishardforassemblerstodisentangle.EMIRGE[26]overcomesthisproblembyavoid-ingassembly,andusesamapping-basedstrategyinstead.Itappliesanexpectationmaximization(EM)algorithmtorecoverfulllengthSSUrRNAgenesequences.IttakesrawreadsinFASTQformat(eitherpairedendsorsingleend),aSSUrRNAgenereferencedatabaseinFASTQformat,andaBowtieindexofthereferencedatabaseasinput.First,readsaremappedtothereferencesbyBowtie[27].Second,EMIRGErunsanEMalgorithm.Theexpectationstepcalculatestheprobabilityareadisfromareferencesequenceandthemaximizationstepthereferencesequencesbasedonthemostfrequentbaseateachpositionfromreadmappingwhilealsocalcu-latingtherelativeabundanceofeachreferencesequence.Thentheabovemapping,expectation,andmaximizationstepsarerepeateduntilthereferencesequencesandtheirrelativeabundancesstabilize.Bydefault,EMIRGEstopsafter40iterationsifthereferencesequencesandrelativeabundancesdonotstabilize.Forcomplexcommunities,theauthorsrecommendincreasingthenumberofiterations.Duringaniteration,tworeferencesaremergediftheybecomeverysimilarandtheiridentityexceedsacertainthreshold(defaultis0.97).Ontheotherhand,asequenceissplitintotwoifthevariantfractionexceedsathreshold(defaultof0.04)basedonreadmapping,suggestingtwodifferentstrainshavethebesthitonthesamereference.ThemostimportantcontributionofEMIRGEisthatitprovidesaprobabilisticframework4toaddresstheuncertaintyofmappingSSUrRNAgenefragmentstothereferencesequences.Forexample,somefragmentsfromconservedregionscouldmaptomanyreferences,andevenfragmentsfromvariableregionscouldmaptomanyreferencesifmultiplecloselyrelatedstrainsarepresent.Thisprobabilisticframingenablesreferencestomergeandtodivergeratherthanbekeptascompositeorconsensussequencesfromdifferentstrainsastheglobalassemblersdo.Thesequenceoutputalsoprovidesaprobabilitythatcanguideuserintheresults.Aswithglobalassemblytools,EMIRGEalsosometimesproduceschimericsequences[28].ErroneouslychimericsequencesinthereferencedatabasemightbepassedontothereconstructedSSUrRNAgenesiftherearereadsmappedacrossthechimericreference.AnotherscenarioisthatSSUrRNAgenesequencesfromcloselyrelatedstrainsbutwithlowcoveragemightformchimerasbecausetherearenotenoughreadsmappedtotheirclosestreference;inthiscase,EMIRGEdoesnothaveenoughinformationonvariantfractionstosplitthereferenceintotwoormorecandidatesequences.1.2.1.2REAGOREAGO[25]isanothermethodforreconstructingSSUrRNAcontentfromasample,butitisassemblybased,unlikeEMIRGE.REAGOtheSSUrRNAgenereadswithacalledcovariancemodel(CM)usingINFERNAL,whichisawareofsecondarystructure[29].ThentheSSUrRNAgenereadsareusedtobuildastringgraphbyoverlapsamongreads.Thisstringgraphisreducedwithsequencingerrorcorrection,node-collapsing,topology-basedpruningandtaxonomy-basedremovalofbadedges.Sequencesarethenassembledbyapathalgorithmguidedbypaired-endinformationtoavoidchimeras.Finally,partialassembliesshorterthanthelengthcutoffaremergedintolongeronesbypaired-endinformation.REAGOusesstringgraphsratherthandeBruijngraphssincethemultipleconservedregionsandvariableregionsintheSSUrRNAgenecausethegraphstobeverycomplex.BecauseDeBruijngraphsbreakreadsdownintok-mers,theydealpoorlywiththiskindofcomplexity.AlthoughthisstringgraphapproachislesspronetochimerasthandeBruijngraphs,theSSU5rRNAgeneisstillaparticularlydifchallengeforstringgraphassembly.ThisisbecausereadssharingoverlapsarenotnecessarilyfromthesamegenesequenceduetothehighconservationoftheSSUgene.REAGOutilizestaxonomyinformationfromeachnodealongwithpaired-endinformationtothecorrectpathsforassembliesandreducechimeras.Despitetheadvancedalgorithmusedtoresolvechimeras,REAGOisveryslowinthesearchCMsearchstep.1.2.1.3SSUsearchSincetheSSUrRNAgeneisdiftoassemble,therearealsoalternativesthatavoidassemblyanddodiversityanalysesdirectlyontheshortreadsfromshotgunsequence.Earlypilotamplicon-basedstudiesused80bpreads,whichissufcientforcommonmicrobialdiversityanalyses[30].But,asIlluminareadsincreasedinlengthpast300bp,mergingpairedendsyieldssequencesthatarecomparabletoampliconsequencesintheirlengthand.SSUsearch[6]isatoolthatusesshotgunreadsdirectlyfordiversityanalyses.Itincludestwomainsteps.ThestepisfastofSSUrRNAgenefragmentsusingHMMER,whichshowsspeedandmemoryimprovementcomparedtoothertoolssuchasBLAST,ribopicker,andMeta-rna[31Œ33].Thesecondstepistheselectionofreadsmappedtoaregion(V4,E.coliposition:577to727bydefault)soallreadshaveoverlapswitheachother.Thenthosereadscanbefurtherusedfortypicalamplicon-likediversityanalyses,e.g.asprovidedinMothurandQIIME[34,35].ThemainadvantageofSSUsearchisthatitenablesdenovoOTU(OperationalTaxonomicUnit)baseddiversityanalyseswithIlluminashotgunsequencingdata.Secondly,SSUsearchisdesignedtoscalewithlargedatasets,e.g.,5CPUhoursandlessthan5Gbmemoryonsearchstep(thebottleneckstep)for38Gb(onelane)oftrimmedIlluminaHiSeq2500data.SSUsearchmakesaspeedimprovementbymergingbacteriaandarchaeamodelssincethemergedmodelgivesalmostthesameresultsastwoseparatemodelswiththeiroutputscombined.Second,thereisalargememoryimprovementbycodeoptimization,comparedtoMeta-rna[33],commonlyusedandHMM-basedtoolforSSUrRNAgeneThesearchstep(HMMER)ofSSUse-6arch,however,hasnotbeencomparedtotheoneofREAGO(INFERNO).ItiswellknownthatINFERNO(cmsearch)ismuchslowerandmighthavehighersensitivityandonsearchresultsthanHMMER(hmmsearch)butnotestshavebeendoneonsimulatedandrealmetagenomicdata.1.2.2FunctionalgenesSincefunctionalgenescanbeanalyzedinproteinspace,analysessuchashomologysearchandgenealignmentaredonewithdifferenttoolsthanthoseforSSUrRNAgenes(analyzedinnu-cleotidespace).Moreover,proteindomainsarecommonlynotuniquetoonegeneandoftensharedamongmanydifferentgenes,thusdirectlyaligningshortreadsmaynotworkwellbecausereadsfromothergenessharingaproteindomaincanbeincorrectlyassigned.Thereforeassembly-basedmethodsaremoresuitableduetohigherprovidedbylongerassembledsequences.1.2.2.1XanderXander[36]isagene-targeteddeBruijngraphassemblerthatusesHiddenMarkovModels(HMMs)ofproteinsequencesofthetargetedgenetoguidegraphtraversal.XanderloadsreadsintoadeBruijngraphusingalow-memoryprobabilisticdatastructurecalledaBloom,whichiscommonlyusedtoreducethememorycostoflargesequencedata[10].Thenitk-mersinnucleotidereferencesequencesoftargetgenesinthegraphasthestartingpointforgraphtraversal.AforwardHMMandreverseHMMbuiltrespectivelyfromthealignedproteinsequencesandthereverseofthealignedproteinsequencesenablesgraphtraversalinbothdirectionsfromstartingk-mers.GuidedbytheproteinHMM,thegraphtraversaladvancesthreebasesatatime,whichequalstoonecodon.XanderuseanA*algorithmtothepathwiththehighestscoreaccordingtheproteinHMM.Further,thecandidateassembledsequencesarethenclusteredat99%toremoveredundantsequences,andchimerasareremovedbyUCHIME[37].Xanderalsohaspost-assemblyutilitiesincludingassembledsequencecoveragecalculationandtaxonomyassignmentbynearestneighbor.7TorunXander,proteinHMMs(bothforwardandreverse)andalargesetofproteinandnu-cleotidereferencesequenceswithtaxonomyinformationarerequired.Currently,XanderfocusesonrplB(50SribosomalsubunitproteinL2,asinglecopyhousekeepinggene)andNcyclegenesincludingAOA,AOB,nifH,nirK,nirS,norB_cNor,norB_qNor,nosZ_cladeIandnosZ_cladeIIareincludedinthetool.Itisfairlyeasytoaddmoregenes(https://github.com/rdpstaff/Xander_assembler#per-gene-preparation-requires-biological-insight).Insummary,XanderappliesHMMsearchontopofthecommonlyuseddeBruijngraphparadigm.HMMguidedgraphtraversalnotonlyreducedthesearchspacecomparedtoglobalassemblybutalsoprovidesaprobabilitymeasureonhowlikelyacontigbasedonthemodel,andthusreducechimeras.Italsorecoversfromshotgundatamoreofthetargetedgenesandassemblesthemtolongerlengthsthandoglobalassemblers[36].1.2.2.2SAT-Assembler.SimilartoXander,SAT-AssembleralsousesHMMsbutitisastringgraphbasedassembler.ItisfromthesamegroupasREAGOandhasasimilardesignthatincludestwomainsteps.ThestepsearchesfortargetgenefragmentsinreadsusingHMMwithHMMER3,whichgreatlyreducestheinputdatasizeforthenextstep.Thesecondstepbuildsstringgraphsforeachtargetedgeneandthenassembles.ThereadalignmentlocationinformationtotheHMMmodelfromthestepisusedtohelpguideoverlapcalculationamongreads.Multipletypesofinformationsuchaspairedend,coverage,andalignmentlocationontheHMMareusedtoguidegraphtraversalandavoidchimeras.Attheend,contigsaremergedintoscaffoldsusingpairedendinformation.TorunSAT-Assembler,acontainingHMMsoftargetedgenesisrequired.TheHMMforagenecanbebuiltfromalignedproteinsequencesofthegenebyhmmbuildcommandinHMMER3.Additionally,SAT-AssembleralsoworkswithHMMsinthePfamdatabase,whichhas16306HMMsinversion30.0andcoversabout80%ofproteinssequencesinUniProtKB[38].BothXanderandSAT-AssemblerusesHMMsbutindifferentways.XanderloadsallreadsintothedeBruijngraphandthenusesHMMstoguidethegraphstraversal.Thusthegraphtraversal8islimitedtoasmallerpartofthewholegraphrelatedtothetargetedgene.Thegraphtraversalspaceisreducedbutitisstillcomputationallyexpensive(CPUtimeandmemory)toloadallreadsintothegraphandidentifythestartingk-mers.Ontheotherhand,SAT-AssembleronlyusesHMMstoidentifythereadsastargetgenefragments,whicharethenusedtobuildamuchsmallergraph.SAT-AssemblerappliespairedendandcoverageinformationbutnotHMMstoguidegraphtraversal.Thesetwotoolscouldbecombinedtomakesomeimprovements.TheHMMsearchstepprovidesthealignmentlocationinformationtospeeduptheoverlapcalculationofstringgraphandalsoassistsinestablishingthecorrectconnectionamongreadswithlowcoverage.Althoughthesearchstepiscriticaltoreducingtheproblemsize,itisstillthebottleneckstepoftheSAT-Assembler.HMMbasedsearchprovidesafasterandmoreeffectivewaytosearchthegenefragmentscomparedtoothertypesofsearch,e.g.pairwisealignmentsuchasBLAST,becauseeachHMMqueryisanaggregateformedfrommanysequencesandthusthequerysizeisgreatlyreduced.Additionally,HMMbasedsearchcanimprovethesensitivityofremotelyrelatedprotein[39].1.3issueswithgene-targetedmethods1.3.1ReferencedependencyForallthetoolsmentionedabove,acertaintypeofreferenceinformationisrequired,eitherHMM,CM,oracollectionofgenesequenceswithassociatedtaxonomyinformation.Thusthequalityandnumberofthereferencesiskeytotheperformanceofthesetools.FortheSSUrRNAgene,millionsofreferencesequenceshavebeencollectedindatabasessuchRDPandSILVA[23,24];thisprovidesrichphylogeneticcontextfordiversityanalysessuchastaxonomyassignmentandchimeracheck.Incomparison,functionalgenesarestilllackinginreferencesequences.91.3.1.1LackingreferencesequencesforfunctionalgenesTherearetypicallytwocategoriesoffunctionalgenesofinterest,genescodingforcriticalecosys-temfunctionssuchastheNcycle,anduniversalproteincodinggenesthatcanserveasphylogeneticmarkerssuchasrplB,rpoBandrecA.ThelatterareofinterestsincetheyhavehigherresolutionforspeciesthanSSUrRNAgeneandarealsosinglecopy[40,41],unlikeSSUrRNAgenes.Asmicrobialecologymovesbeyondtaxonomyandcommunitystructuretoecosystemfunctions,alargenumberofreferencesequencesforfunctionalgenesarealsoneeded.UnlikeSSUrRNAgenes,itismorediftomaintaindatabasesoffunctionalgenesbecausetherearetoomanyofthemandthustheefforthastobedistributedtogroupswithspecialinterestincertaingenes.OneeffortinthisdirectionistheRDPFunGenepipeline,whichprovidesaplatformforresearcherstomaintainfunctionalgenedatabasesinasemi-automaticway.AswiththeSSUrRNAgene,therearetwomainsourcesofreferencesequences.Thesourceiswholelengthgenesfromgenomes(about60,000inNCBIRefSeqwhenaccessedinSep2016)andthesecondsourceispartialgenesequencesamplbyprimers(themajority).Thenumberofsequencedgenomesareaccumulatingfastsincesecondgenerationsequencingstarted,sowearemakinggoodprogressonthesource,althoughmoreeffortstoextractgenesforecosystemfunctionsinphylogeneticallyundersampledregionsofthetaxonomyareneeded.Forthesecondsource,therearemanyprevi-ousstudiestargetingfunctionalgenesusingprimers[42Œ45].Alargereffortondatabasecurationisneededtocollectthereferencesfromtheabovetwosources.Inaddition,newprimerdesignisneededtooptimalprimersforgene-targetedampliconsequencingsincethisapproachcansamplethemicrobiometogreaterdepth.OnesuccessfulexampleofsuchaneffortisnifH-codingfornitrogenasereductase[46,47].1.3.1.2LargesequencevariationamongsubgroupswithincertaingenesAlthoughfunctionalgenessharesomeconservedregions(domains),thereislikelylargevariationsamongsequences,whichcouldbefurtherdividedintosubgroups(clades).NotethattheSSUrRNAgenecouldalsobedividedintosubgroups(phyla)butitisnotusuallynecessaryduetothehigh10sequenceidentityamongallsequences.However,functionalgenesaremuchlessconservedthantheSSUrRNAgene.Thusalignmentandhomologysearcharemoreaccuratewhensubgroupsaretreatedseparately.Therearetwocaseswheregenetargetedanalysisissuboptimal.Thecaseisthatwearemissingrepresentativesequencesfromsomesubgroups,whichleadstobiasedresultsondiversity.Thenitrousoxidereductase(nosZ)issuchanexample:studiesnowshowthatthereareatleasttwocladesofnosZgenes,withthesecondcladeveryunderstudied[48].Follow-upstudiesshowthatcladeIIisveryabundantintheenvironmentanditsenzymeismoreefinlownitrogenenvironments[49].Thesecondcaseisthatwearetreatingsubgroupswithsequencedivergenceasagrouptogether,whichcancausepoorperformancesongenesearchandsequencealignment.Forexample,thenosZcladeIandcladeIIshouldbetreatedseparatelyduetolowsequencesimilaritybetweentwogroupsandthecombinedsequencesfromtwocladescouldleadtopoormultiplesequencealignmentandfurtherasinglenosZHMMbuiltfromthealignmentwithpoorsensitivityand(conserveddomainsarecriticalforHMMbasedsearch).1.3.2Shortreadvs.contigBothshortreadsandassembledcontigsareusedinmetagenomicstudies[7,9,50],butthelatteraremorecommon,sinceannotatingshortreadsisnotscalablecomputationallyandprovideslow(lowe-valueorscoreinBLASTsearch).Additionally,assembledcontigscanprovidemoreinformationsuchasgenesyntenyandgenomicinformationfromnovelspeciesthatcanfurtherbeusedforisolatingthosenovelspecies.Ontheotherhand,theassembledcontigsstillhavesomekeydrawbacks.First,therearechimerasinassembledcontigs.Eventhoughsomecanberesolvedbypairedendinformation,thereisnomethodtoverifytheminsilicogenerally.Second,sequencevariationsfromcloselyrelatedstrainsarecollapsedorignoredintheassemblyprocess.ThustheassembledcontigsarenotsuitableforSNP,primerdesign,ordiversityanalysisthatinvolvestaxonomy(subspecies,ecotypeorstrain)levels.Third,lowcoveragestrainsareoftentoolow-coveragetoassemble.Alltheabovebecomesabarrierincomplexmetagenomesfrom11environmentssuchassoilthatcontainsnotonlyhighdiversitybutmanycloselyrelatedstrains,andmanystrainswithlowcoverage.Therefore,contigsmaynotreliablyrepresentthepopulation,althoughtheannotationsfromthemdohavehigh(lowere-valueforhomologysearch).Incontrast,read-basedmethodsdonotencounteranyoftheabovethreeissuesandthusarestillvaluableforshotgunmetagenomics.Despiteofthelowinannotationcausedbyshortlength,readsfromsecondgenerationsequencingareshowntobelongenoughtogivereliableannotationsusingHMMbasedsearch[51].Additionally,thescalabilityofread-basedannotationisnotasabigissueingene-targetedanalysisasinglobalfunctionaldiversityanalysissinceonlyreadsbelongingtotargetedgenesareannotated.1.3.3ofread-basedmethodforsomefunctionalgenesAsmentionedabove,conservedbutnotuniquedomains,suchassulfurandironbindingdomainsinnifH,confoundapplicationofreadbasedmethodsinfunctionalgenes,sincereadsfromadifferentgenewiththesamedomaincanfalselyincreasethecoverageandsequencevariation.However,read-basedmethodsarefavoredbyavoidingthosenon-uniquedomainsandlookingatotherregionsinthetargetgeneincludinggenedomainsandvariableregions.YetthechallengesarethatnotallgeneshaveuniquedomainsandphylogeneticallydiversereferencesequencesarerequiredtosearchvariableregionsofageneusingBLASTstylepairwisecomparisons,whichisnotfeasibleformostfunctionalgenescurrently.1.4SummaryandoutlookInsummary,wereviewafewtoolsforgene-targetedanalyseswithshotgunsequence(metagenomes)(Table1.1).Wealsodiscusssomeissuesrelatedtogene-targetedanalyseswithshotgunmetagenomes,includingthelackofreferencesequencedatabasesandthepresenceofsubgroupsforfunctionalgenes,anddifofapplyingshortreadbasedmethodsonfunctionalgenesduetowidespreadandhighlyconservedbutdomains.12Table1.1:Summaryofgene-targetedmethodsforshotgunmetagenomes.ToolGeneSearchAssemblyRead/ContigDiversityanalysisEMIRGESSUBowtieEMContigNoREAGOSSUCMSGContigNoSSUsearchSSUHMMHMMReadYesXanderProteinKmerDBG&HMMContigYesSAT-AssemblerProteinHMMSGContigNoHerewemainlydiscussinsilicogenetargetedanalysesinshotgundatabutitisalsopossibletotargetgenesinvitrousingprobesdesignedtothatgeneandthensequence,whichcangivemuchhighercoverageforlowabundancegenes[52].Theprobedesign(similartoprimerdesign)willrequireaphylogeneticallydiversesetofreferencesequences,orelsethemethodwillbebiased.Inthefuture,withthefurtherdevelopmentoflong-readsequencingtechnologies(withreducederrorrates),read-basedmethodswillfromlongerreads.Especiallyforasinglegene,itmightbenotnecessarytoassemblewithlongreads.Anotherwaytothinkabouttheproblemisthatlong-readsequencingandassemblyservesthesamepurpose,i.e.,toobtainlongsequences.Additionally,stringgraphmaygainmorepopularityasoccurredinSangersequencingeracomparedtothedominanceofdeBruijngraphwithshortreaddatafromsecondgenerationsequencing,whichappliestobothglobalassemblyandgene-targetedassembly.Meanwhile,short-readsequencinganddeBruijngraphbasedmethodswillcontinuetoplayanimportantroleinmetagenomicanalysisduetoitshighsequencingdepth.13CHAPTER2MICROBIALCOMMUNITYANALYSISUSINGRIBOSOMALFRAGMENTSINSHOTGUNMETAGENOMES12.1AbstractShotgunmetagenomicsequencingdoesnotdependongene-targetedprimersorPCRtionandthusisnotaffectedbyprimerbiasorchimeras.However,searchingrRNAgenesfromlargeshotgunIlluminadatasetiscomputationallyexpensiveandthereisnoexistingapproachforunsupervisedcommunityanalysisofSSUrRNAgenefragmentsretrievedfromshotgundata.Wepresentapipeline,SSUsearch,toachievefasterofshortsubunitrRNAgenefrag-mentsandenabledunsupervisedcommunityanalysiswithshotgundata.Italsoincludescationandcopynumbercorrection,andtheoutputcanbeusedbytraditionalampliconanalysisplatforms.Shotgunmetagenomedatausingthispipelineyieldedhigherdiversityestimatesthanamplicondatabutretainedthegroupingofsamplesinordinationanalyses.Weappliedtothispipelinetosoilsampleswithpairedshotgunandamplicondata,andbiasagainstVer-rucomicrobiainacommonlyusedV6-V8primersetaswellasdiscoveringlikelybiasagainstActinobacteriaandforVerrucomicrobiainacommonlyusedV4primerset.Thispipelinecanuti-lizeallvariableregionsinSSUrRNAandcanalsobeappliedtolargesubunitrRNA(LSU)genesforofcommunitystructure.Thepipelinecanscaletolargesoilmetagenomicdata(5Gbmemoryand5CPUhourstoprocess38GB(1lane)oftrimmedIlluminaHiSeq2500data)andisfreelyavailableathttps://github.com/dib-lab/SSUsearchundertheBSDLicense.1PublishedinGuoJ,ColeJR,ZhangQ,BrownCT,TiedjeJM.MicrobialCommunityAnalysiswithRiboso-malGeneFragmentsfromShotgunMetagenomes.AppliedandEnvironmentalMicrobiology.2016;82(1):157-166.doi:10.1128/AEM.02772-15.142.2IntroductionMicrobialphylogeny,andevolutionstudieswererevolutionizedbytheintroductionofSSUrRNAanalysis25yearsago[21],andwiththeadventofPCRandhighthroughputse-quencing,communitystructurestudiesarenowcommonplace[22,30,53,54].ThegrowingsizeofSSUrRNAgenedatabasesprovidearichecologicalandphylogeneticcontextforSSUrRNAgene-basedcommunitystructuresurveys[24,55].However,theaccuracyofPCR-basedampliconapproachesarereducedbyprimerbiasandchimeras[5,56].Unlikegene-targetedampliconsequencing,shotgunsequencingsamplesfromtheentirecom-munity,bysequencingrandomlyshearedfragmentsofDNA[9,57].Hence,whileampliconse-quencingcanprovidefardeepercoverageofSSUrRNAgeneswithsameamountofsequencing,shotgunsequencingmayprovideamoreaccuratecharacterizationofmicrobialdiversityincludingfunctionaldiversity[58].Inparticular,shotgunsequencingmayprovideanimprovedmeanstodetectdivergentsequencesnotrecoveredbystandardSSUrRNAgeneprimers,suchastheVer-rucomicrobia,aswellaseukaryoticmembersofthecommunity[4,56,58,59].NotethatbothapproachesremainpronetosequencingerrorandbiasfromenvironmentalDNAextraction[5].ThechallengesforusingshotgunDNAforrRNAanalysesareinefsearchingforthesefragmentsinlargesequencedatasets,andthesubsequentanalysisofthematchingshortreads.SeveralmethodshavebeendevelopedforSSUrRNAretrievalinlargedatasets[32,33,60,61],butspeedimprovementsarestillneededtomatchthegrowthindatasize;moreover,noneofthemprovidefurthercommunityanalysisusingtherRNAgenesequences.Inaddition,traditionalcommunityanalysistools[23,34,35]arelargelydesignedtohandlesequencesthatarebyPCRprimers.Therearetwoprimarytypesofcommunityanalyses:reference-based(supervised)andOTU-based(unsupervised).Thereference-basedmethodassignsSSUrRNAgenesequencestobinsbasedontaxonomyoftheirclosestreferencesequences,whileOTU-basedmethodsassignoverlappinggenesequencestobinsbasedondenovoclusteringwithasimilaritycutoff(e.g.97%).Thereference-basedmethodcanbeeasilyappliedtoshotgundataonceSSUrRNAgenefragmentsareretrieved[62]andthereareseveralexistingtoolsavailable15[63Œ67],buttheOTU-basedapproachstillremainschallengingwithshotgundatabecausereadsarefromrandomlyshearedfragments.ThemaingoalofthisstudyistoenableunsupervisedOTU-basedanalysisoflargeshotgunmetagenomicdatafromsoil.WeimprovedspeedandmemoryefywithaHiddenMarkovModel(HMM)basedmethod,alreadyshowntobefastandaccurateforSSUrRNAsearch[33,60,61],usingawell-curatedandup-to-datetrainingreferencesequencecollectionfromSILVA[24].Ourunsupervisedclusteringmethodwastestedonasyntheticcommunitywithshotgundataof100bpreads.Wenextappliedthemethodtosoildatasets,whereweassembledlongerreadsfromtheoverlappingpairedendIlluminaHiSeqreadsandmappedthoseto150bpsmallhyper-variableregionsofSSUrRNAgenesfordenovoclusteringandfurtherdiversityanalysis.Weretrievedandanalyzedthelargeribosomalsubunit(LSU)geneforanalysis.Finally,weevaluateprimerbiasesusingthepairedshotgunandamplicondataproducedfromthesameDNAextract,whichisbeyondtraditionalprimerevaluationusinginsilicodatabasesearch[68,69].2.3MaterialsandMethods2.3.1Soilsamples,DNAextractionandsequencingTwosetsofsoilsampleswereused.Thesample,whichwasusedtodevelopthemethod,wasabulk(non-rootsoilsample(SB1)takenin2009frombetweentherowsofswitchgrass.Themethodwasthenappliedtothesecondsamplesettakenin2012,whichconsistedofsevenreplicaterhizospheresamplesfrombothcorn(C)andMiscanthus(M)plots.AllsampleswerefromtheGreatLakesBioenergyResearchCenter(GLBRC)CroppingSystemComparisonsiteattheKelloggBiologicalStationinSouthwestMichigan,http://data.sustainability.glbrc.org/pages/1.html.Therhizospheresampleswerecloselyassociatedwiththeroots,<1mm.DNAextractionandSSUrRNAgenemethodswerepreviouslydescribed[70].TheSSUrRNAgeneampliconsfromthesampleweresequencedbytheJointGenomeInstitute(JGI)intheirstandardworkw,whichused454GSFLXandTitaniumplatformsandprimerset16(926F:AAACTYAAAKGAATTGACGG1392R:ACGGGCGGTGTGTRC)thattargetedtheV6-V8variableregionofBacteria,Archaea,andEukaryotes.ThesecondsetwasalsosequencedatJGIbutatalatertime,sousedtheIlluminaMiSeqplatformandprimerset(515F:GTGCCAGCMGC-CGCGGTAA806R:GGACTACHVGGGTWTCTAAT)thattargetedtheV4variableregion.Shot-gunsequencingwasalsodonebyJGIusingIlluminaGAIIplatformsforthesetandHiSeq2500using250bpinsertlibrariesand2x150bpreadsforthesecondset.Wehadabout8Gbofdataforthesetandabout300GBofdataeachforcornandMiscanthusforthesecondset.2.3.2DatapreprocessingDatapreprocessingisnecessaryforbothshotgunandamplicondataduetosequencingerrors.However,itisnotincludedaspartofthecorepipelinebecauseusershavetheirownpreferences.Wetrimmedtrailingbaseswithqualityscore2calledReadSegmentQualityControlIndicatorencodedbyASCII66fiBflinIlluminaGAIIorASCII35fi#flinIlluminaHiSeqshotgundataanddiscardedreadsshorterthan30bpandwithNs.Thereadswerethenqualitytrimmedwithfastq-mcf(version1.04.662)(http://code.google.com/p/ea-utils)withfi-l50-q30-w4-x10Œmax-ns0-Xfl.Thepaired-endreadsoverlappingbymorethan10bpwereassembledintoonelongreadbyFLASH(version1.2.7)[71]withfi-m10-M120-x0.20-r140-f250-s25fl.Roche454pyrotagamplicondatawasprocessedusingtheRDPPipelinefiPIPELINEINITIALPROCESSflandfiCHIMERACHECKfl[23].Readsweresequencedfromthereverseprimerendtotheforwardprimerend.Sincethetargetedregionisabout467bp(926F/1392R),mostreadswerenotlongenoughtoreachtheforwardprimer.Thusonlythereverseprimerproductwasusedforqualitytrimming.Theminimumlengthwassetto400bpanddefaultswereusedforotherparameters.2.3.3BuildingSSUandLSUrRNAgenemodelsWequalitytrimmedSILVA[24]SSUandLSURefNRdatabase(version115)sequencesbydis-cardingallsequenceswithambiguousDNAbasesandconvertingUtoT.Wethenclusteredthem17ata97%similaritycutoffusingpick_otus.py(defaultwithUCLUST)andpick_rep_set.pyfromQIIME(version1.8.0)[35].WechosethelongestrepresentativeineachOTUtobefurtherclus-teredat80%similaritycutoff.WecollectedthelongestsequenceineachOTU,resultingin4027representativesequencesfortheSSUrRNAgeneand1295fortheLSUgenetoobtainaphyloge-neticallydiversesetofreferencegenes.Wefurthergroupedthesesequencesintotwogroups,onecombiningBacteriaandArchaeaandtheothercontainingonlyEukaryota(alsoseeDiscussion).EachgroupwasusedtomaketwoHMMs(HiddenMarkovModel):onewithsequencesfrompre-viousstepandtheotherwithreversecomplement,usinghmmbuildinHMMERversion3.1[39].Atlast,theHMMwereconcatenatedintoasingleforeachgene.ThisstepisnotpartofthepipelineandtheresultingHMMswereincludedinthedatabaseofthispipeline.2.3.4ofrRNAgenefragmentsfrommetagenomicdataTheanalysisframeworkisshownin(Fig.2.1),aswellasthereasonsforourchoicesofpipelinecomponents.WesearchedIlluminashotgunmetagenomicdatawithhmmsearchinHMMERver-sion3.1[39]usingtheLSUandSSUHMMmodels.Fortestingthesensitivityofnewlybuiltmodels,wecomparedourtoolwithmeta-rna[33]andmetaxa[61].Weusedane-valueof10forhmmsearchwiththenewlybuiltHMMs.Meta-rna(rna_hmm3.py)(packagewasnotversionedandlatestversionupdateonOct.21,2011wasused)wasrunwithfi-keuk,bac,arc-e0.00001fl,andmetaxa(metaxa_x)version2.0.2wasrunwithfiŒallow_single_domain1e-5,0-N1-E1e-5fl.Abulksoil(SB1),arhizospheresoil(M1)andasyntheticcommunitysample[58]wereusedastestdata.WealignedtheHMMERhitsfromthee-valuecutoffof10usingthemultiplesequencealignerfialign.seqsflinMothur(version1.33.3)[72].ForSSUrRNAgenefragments,18491full-lengthSSUrRNAgenesequences(14956fromBacteria,2297fromArchaea,and1238fromEukaryota)fromtheSILVAdatabase(release102)[24]downloadedfromtheMothurweb-site(http://www.mothur.org/wiki/Silva_reference_files)wereusedasthetemplatewithfithreshold=0.5fl,and=tfl.ForLSUrRNA,MultipleSequenceAlignment(MSA)ofrepresentativesequencesofSILVALSURefNRdatabaseclusteredat97%similaritycutoffwere18usedastemplatewiththesameasforSSU.Basedonalignmentinformationprovidedinthefialign.seqsfloutputreportthoseshotgunreadswithmorethan50%mappedtoareferencegeneweredesignatedasSSUrRNAorLSUrRNAgenefragments.2.3.5TestingtheeffectoftargetregionsizeandvariableregiononclusteringWeusedshotgundataofasyntheticcommunitycomprisedof64species,whichweresequencedbypairedend100bpmethodonIlluminaHiSeq2000[58].Whentestingtargetregionsizeeffectonclustering,wepickedV4withstartingpositionat577inE.coli.Sizesfrom50to180bpwitha10bpincrementwerechosen.Theminimumreadlengthwassettothetargetregionsizeminus5bpiftheregionsizewaslessthan100bp,and95bpwhentheregionsizewaslongerthan100bp.Weusedpre.clustercommandinMothurwith1editdistancetocollapsereadswitherrorsandtheiroriginalreads.ThendenovoclusteringwasachievedbyRDPMcClustwithupgmaalgorithmandminimumreadoverlappinglengthof25bp[23].WechoseMcClustastheclusteringtoolduetoitsspeedandmemoryefy[23,73].Next,wechooseV2,V3,V4,V5,V6,andV8,startingatposition127,427,577,787,987,and1227inE.colirespectively[74],totestthehypervariableregioneffectonclusteringresult.Targetregionsizesof80bpand120bpandadistancecutoffof5%werechosen.Further,theaboveanalyseswerealsoappliedto16SrRNAgenesfromthe64speciescomprisingthecommunitytogetthetrueOTUnumbers.2.3.6CommunitystructurecomparisonbasedonOTUsfromdenovoclusteringFortheclusteringanalysisofshotgunandamplicondata,150basescorrespondingtotheV8region(E.colipositions:1227-1377)werealigned.Readsshorterthan100bpwereremovedfromthealignmentandtheremainderclusteredusingMcClustwithminimumoverlapof25bpandupgmamethod(6).TheclusteringresultwasconvertedtotheMothurformatandcommunitystructurecomparisonwasdoneusingthemake.sharedwithlabel=0.05,dist.shared(calc=thetayc),andusingthepcoacommand.Whencomparingdifferentregions,E.colipositions:127-277,577-727,and997-1147werechosenforV2,V4,andV6,respectively[74].Procrustesanalysisasimplemented19inQIIMEwasusedtotransformV2,V6andV8PCoAresultsandminimizethedistancesbetweencorrespondingpointsinV4.Thebulksoilsample(SB1)wassequencedinsixlanesfromoneIlluminaplateusingDNAfromthesameextraction.WeusedtheseastechnicalreplicatesfortestingthereproducibilityofdenovoOTU-basedanalysisonshotgundata.Sincesequencingdepthiscriticalforreproducibilitytesting,wepooledtheseintotwosamplesofthreelaneseach,thethreelanesasSB1_123andtheremainingthreelanesasSB1_456.2.3.7ComparisonofOTU-basedmicrobialcommunitystructuresinferredfromshotgunandampliconSSUrRNAgenesequencesTheabundanceofeachOTUinshotgundataandamplicondata(V6-V8forSB1andV4forM1)fromthesameDNAextractionwerecomparedtochecktheconsistencybetweenthetwosequencingapproaches.Pearson'scorrelationcoefandR2oflinearregressionswereusedtoevaluatethecorrelationbetweenthetwotypesofdataandbetweentechnicalreplicates.AlltheabundancesofeachOTUwereincreasedbyapseudo-countofonetoallowdisplayonalogscale(avoidingzeros).2.3.8Comparisonoftaxonomybasedmicrobialcommunitystructuresinferredfromshot-gunandampliconSSUrRNAgenesequencesTheSSUrRNAfragmentsfromshotgundataandamplicondatawereusingRDPClas-[62].ThereferenceSSUrRNAgenesfromRDPandSILVAareprovidedontheMothurwebsiteandwereusedastrainingsets,withabootstrapcutoffforof50%.RepresentativesequencesofSILVALSURefNRclusteredat97%similaritycutoffwereusedastrainingsetwithtaxonomyinformationbuiltfromthesequencefortheLSUrRNAgene.Thebacterialtaxonomyfromshotgundataandamplicondatawerecomparedatphylumlevel.202.3.9CopynumbercorrectionWeusedSSUrRNAgenecopynumberdatabaseinCopyrighter[75]thatprovidescopynumberforeachtaxoninGreengenesdatabase[76].Inthetaxonomicsummary,theabundanceofeachtaxonisweightedbytheinverseofitsSSUrRNAgenecopynumber.SimilarlyinOTU-basedanalysis,theabundanceofeachOTUisweightedbytheinverseofSSUrRNAgenecopynumberofitsconsensustaxon.sequencesareweightedbytheinverseofaveragecopynumberofalltaxainthedataset.2.3.10Implementation,reproducibilityandsequencedataaccessionThepipelinecanbefoundinhttps://github.com/dib-lab/SSUsearchasatutorialwithipythonnotebooks[77].ScriptsforreproducingtheinthispapercanbefoundatGitHub(https://github.com/dib-lab/2014-ssu-search/blob/master/analysis-in-paper.Makefile).ThesyntheticcommunitydatafortestingcanbedownloadedfromNCBISRAunderSRR606249.TheamplicondataforC1-7andM1-7aredepositedinJGIgenomeportalunderprojectID1025756withlibraryIDM2094andM2113respectively,andSB1depositedinNCBIunderSRX902929.TheshotgundataforthesamethreedatasetsaredepositedinJGIportal(C1-7areunderprojectID1023764,1023767,1023770,1023773,1023776,1023779,and1023782;M1-7areunderprojectID1023785,1023788,1023791,1023794,1023797,1023800and1018623;SB1areunderprojectID402775).2.4ResultsWedevelopedanoptimizedpipelinethatreadilyanalyseslargedatasets(Fig.2.1).Thepipelinehastwomajorsteps:SSUrRNAgenefragmentsearchandunsupervisedOTUanalysis.HMMER-basedmethodssearchwithHMMsandthusscaleswithincreasingsizeofSSUrRNAgenedatabase[23],sowechosethemforthesearchstep.Weusedmeta-rna[33]butcouldnotrunitonlargedatasetsduetoitspoormemorymanagement.Wethereforeandoptimizedtheapproach21usedbymeta-rna.Sincethesearchstepisstillthecomputationalbottleneck(Fig.2.1),ourinterestherewastomakeanimprovementonsearchspeedandmemoryefywhileretainingaccuracy.Ourimplementationisabout4timesfasterand100timesmorememoryefthanmeta-rna,and10timesfasterand15timesmorememoryefthanmetaxa[61](Table2.1).Thespeedimprovementisrealizedfromtwo1)WereducethenumberofHMMsmodelstosearchwithbymergingBacteriaandArchaeamodels.SSUrRNAgenesarehighlyconservedandthusthemergedmodelstillhashighsensitivity.Evenmoreso,wecanincreasethesensitivitybyusingmorerelaxedE-valuecutoff,sincefalsepositivesaretolerableinthisinitialsearchstep.2)Weusereverse-complementHMMsratherthanreversecomplementingthereads,becausethelatterscalespoorlywithlargedatasets.ThenewlybuiltHMMmodels(HMMsforBacteriaandArchaeatogetherandHMMsforEukaryota)covermosthitsfoundbyseparateHMMsofthethreedomains(Bacteria,Archaea,andEukaryota)inmeta-rnainallthreetestdatasets,thetwosoildatasetsinthisstudyandonesyntheticcommunity(Table2.1).Ourmethod15566(0.03%)of44,787,632qualitytrimmedandpairedendmergedsequencesasSSUrRNAgenefragmentsinthebulksoilsample(SB1)and112402(0.04%)of274,060,925readsintheMiscanthusreplicate(M1).UnsupervisedOTUanalysiswithshotgundataisnotavailableinanycurrentpipelineandwedevelopamethodforOTUclusteringaroundasmallregionwhereallreadsoverlap.Toshowthevalidityofourunsupervisedmethod,wedidtestsoneffectsoftargetregionsizesanddifferentvariableregionswithshotgundatafromasyntheticcommunity.Wefoundallregionsizesfrom50bpto160bpinV4hadanOTUnumberthatapproachedthespeciesnumberatadistancecutoffof4%or5%(Fig.2.2)whentestingtargetregionsizeeffectonOTUnumber.WealsodidasimilartestwithonlyfulllengthSSUrRNAgenesfrom64speciestomakesureOTUnumberisclosetospeciesnumberandthatOTUnumberwasclosetospeciesnumber(atarangefrom50to60)whenacutoffof0to0.06waschosen(Fig.2.3).Whentargetregionsizewaslargerthan170bp,theclusteringtool(McClust)[23]didnotclusterbecausepercentageofnon-overlappingreadsexceededitsthreshold.Basedontheaboveresults,wechose80bpor120bpasthetargetregion22Figure2.1:FlowchartofSSUsearchpipeline.SSUrRNAgenefragmentswereretrievedbyanhmmsearchandalignmentstep,whichcouldbefurtherusedforreference-based(supervised)diversityanalysis(taxonomy).Thosefragmentsalignedto150bpofavariableregioncouldbeusedforOTU-based(unsupervised)diversityanalysis.SSUrRNAgenehmmsearchisthemosttimeconsumingsteps.For1laneoftrimmedHiSeqdata(38Gb)fromMiscanthusrhizospheresample(M1),SSUrRNAgenetookabout4hourswithpeakmemoryusageofabout4.5Gb.Intheanalysispipeline,wheretherewasnotaclearperformancedifferencebetweentools,wemostlyusedMothurincludingdatabases.For(denovo)OTUbasedanalysis,SSUrRNAgenereferencesequenceswithtaxonomyinformationarerequiredforAnothersmallersetofalignedreferencesisrequiredtoalignthegenefragmentsfromshotgundata.TheSILVAdatabase(theofone,nottheoneincludedwithQIIMEorMothur)wasusedtobuildtheHMMsincethereferencesetfromSILVAwasmoreuptodate.TwoscriptsinQIIMEwereusedtocluster(UCLUST,thedefault)andpickrepresentativesequences.BuildingtheHMMisnotpartofthepipelinebutusingthebuiltHMMis.Wefoundthatcomplete-linkageclusteringisfasterandrequireslessmemorywithMcClustthanwithMothur(dist.seqsandcluster).Addi-tionally,weusetwoscriptsinQIIME(pick_otus.pyandpick_rep_set.py)toselectrepresentativesequencesforbuildingtheHMMduetoeaseofuseandtheGreenGenesdatabaseisincludedforusewiththeCopyrightercopynumbercorrectiontool.23Table2.1:ComparisonofsearchresultsfromSSUsearch,metaxaandmeta-rna.Subsetsof5millionreadsfrommetagenomeofasyntheticcommunityof48Bacteriaand16Archaea,SB1(bulksoilmetagenome)andM1(Miscanthusrhizospheremetagenome)areusedastestingdata.Symbolfi\flindicatesoverlap.MockSB1M1SSUsearch643227892612Metarna645527812600Metaxa532226492444SSUsearch\metarna630027592576SSUsearch\metaxa528626422442metarna\metaxa530426492436SSUsearch\metarna\metaxa526826422435SSUsearchcputime(min)1.644.8Meta-rnacputime(min)8.616.516.9Metaxacputime(min)17.547.234.8SSUsearchmemory(Mb)353330Meta-rnamemory(Mb)340640054234Metaxamemory(Mb)452456572sizeand5%distancecutofffortestingdifferenthypervariableregions.ThenumberofOTUscreatedinallvariableregionsisclosetotherealnumberofspeciesinthesyntheticcommunityexceptwhenusingV3.24Figure2.2:Testingtheeffectoftargetregionsizeandvariableregiononclustering.25Figure2.2(cont'd):Ashotgunmetagenomicdatasetwith100bpreadsfromsyntheticcommunitywith64speciesisused.Ashowsadistancecutoffof4%or5%isproperforallregionssizesfrom50bpto160bpinV4(OTUnumberapproachedthespeciesnumber64asindicatedbytheblackline).Differentcolorsindicatesregionsizesinbasepairs(bp).BshowsmoredetailsinthemethodusedforA.PanelfiReadlengthcutoffflinBshowsminimumreadlengthwassettothetargetregionsizeminus5bpiftheregionsizewaslessthan100bp,and95bpwhentheregionsizewaslongerthan100bp.Asaresult,thenumberofreadsaligneddecreasedasthetargetregionsizeincreaseduntil100bp,andthenthenumberofreadsalignedincreasedwithtargetregionasshowninPanelfiMappedreadnumberflinB.PanelfiOTUnumberflinCshowsOTUnumberatdistancecutoffof0.05andourmethodworkswellfroma50bpregionto160bpregioninV4.Ctestsourunsupervisedmethodonmultiplehyper-variableregions(V2,V3,V4,V5,V6,V8)withregionsizeof120bp(circle)and80bp(triangle).PanelfiMappedreadnumberflinCshowsthenumberofreadsmappedtoeachchosenregion.PanelfiOTUnumberflinCshowsthenumberofOTUsineachregionatdistancecutoffof0.05.AllregionshaveconsistentmappedreadnumberandOTUsexceptV3.Figure2.3:TestingtheeffectoftargetregionsizeonV4offull-lengthSSUrRNAgenesinclusteringfromasyntheticcommunitywith64species.TheOTUnumberisclosetospeciesnumber(64)withalltheregionsizes.Differentcolorsindicatesregionsizesinbasepairs(bp).Reproducibilitybetweentechnicalreplicatesisimportantandabasicfeatureofamethod[78,79].WeevaluateditbycomparingthecorrelationofOTUabundancebetweentechnicalreplicatesfromthebulksoilsampleandfoundhighcorrelationbetweenthem(Pearson'scorre-lationcoef=0.997).TheconsistencywasbetterforthemoreabundantOTUs(Fig.2.4A).LogtransformationcouldreducetheeffectofhighabundanceOTUsontheoverallcorrelation.Evenwithlog-transformedabundance,thetworeplicateshaveaPearson'scorrelationcoef26of0.91andlinearregressionR2of0.88whenOTUswithlessthan25totalcountswerediscarded(Fig.2.4B).Thechoiceof25cutoffwaschosentocomparetoanotherstudyonampliconrepro-ducibility[80]andasmentionedintheDiscussion.Figure2.4:TechnicalreproducibilitytestofourunsupervisedclusteringandcomparisonofOTUabundancesbetweenpairedshotgunandamplicondata.27Figure2.4(cont'd):AshowsconsistentOTUabundanceintwotechnicalrepli-cates(Pearson'scorrelationcoefis0.997).XaxisshowsnumberofreadsineachOTUinreplicateSB1_123,andyaxisshowsnumberofreadsineachOTUinreplicateSB1_456.ThesizeofcircleisproportionaltonumberofOTUsatthesamelocationintheplot(withthesamecountsinSB1_123andalsoinSB1_456).TheconsistencyofcountsofeachOTUintworeplicatesbecomesbetterwhentheabundanceofOTUsarehigher.Bshowsprogressivedropoutanalysisoftwotechnicalreplicatesofshotgundata.ThereiscorrelationofcountsofeachOTUbetweentechnicalreplicates;XaxisisthethresholdofOTUabundanceandyaxisistheR2oflinearregressionoflogtransformedOTUabundancesintworeplicates.OTUswithlowerabundancethanthethresholds(xaxis)werediscardedbeforeregressionanalysis.CandDshowscomparisonofOTUabundancebetweenpairedshotgunandamplicondatainbulksoilsample(SB1)andrhizospheresample(M1),respectively.Thereisinconsistencybetweenshotgundataandamplicondatainbothsamples.XaxisshowsnumberofSSUrRNAgenefrag-mentsinshotgundataperOTUinlogscale,andyaxisshowsnumberofampliconsequencesineachOTUinlogscale.TheOTUabundanceinbothampliconandshotgundatawereincreasedby1toavoid0countsthatcanbedisplacedinlogscale.ThesizeofcircleisproportionaltonumberofOTUswiththesameabundanceinbothtypesofdata.ThereareOTUswithdiffer-entabundancesinthetwotypesofdata(circlesdeviatefromdiagonalline).Pearson'scorrelationbetweentwotypesofdatais0.873inSB1and0.581inM1.WealsocomparedOTU-basedmicrobialcommunitystructuresinferredfromshotgunandam-pliconSSUrRNAgenesequences.OTUabundancesinshotgunandamplicondata,however,donotcorrelateaswell(Pearson'scorrelationcoefof0.87forthebulksoilsampleSB1(Fig.2.4C)and0.58fortheMiscanthusrhizospheresampleM1(Fig.2.4D),showingthattheam-pliconandshotgunmethodsarenotprovidingthesameinformation.TheofOTUswithtotalabundancehigherthan10andwithratiobetweentwodatatypehigherthanvefoldshowsVerrucomicrobiawerebiasedagainstinthebulksoilsample(SB1_PT)byV6-V8primer,whiletheywerebiasedtointherhizospheresample(M1_PT)byV4primer.ActinobacteriawasbiasedagainstinM1_PTbyV4primer(Fig.2.5).Theabovere-sultswereconsistentwiththetaxonomy-basedcomparisonofthetwodatatypes(seebelow)thatsuggestingprimerbiasinamplicondata.WeappliedordinationanalysistoOTUtablesfromunsupervisedanalysisofcornandMis-canthusrhizospheresamples.OTUsfromshotgunandamplicondatabothshowedseparationofrhizospherecommunitiesofcornandMiscanthus(horizontaldimensioninFig.2.6),aswellasa28Figure2.5:PhylaofOTUsdifferentbetweenshotgundataandamplicondata.SB_123areshotgundataandSB1_PTareamplicondatabothfromthesameDNAfrombulksoilsample.M1areshotgundataandM1_PTareamplicondatabothfromthesameDNAfromMis-canthusrhizospheresample.OTUsydifferentwereasthosewithtotalabun-dance>10andfoldchangebetweentwotypesofdata>5or<0.2.VerrucomicrobiawasbiasedagainstinbulksoilsamplebyV6-V8primer(SB1_PT)butbiasedforinrhizospheresamplewithV4primer(M1_PT).Actinobacteriawasbiasedagainstinrhizospheresam-ple(M1).differencebetweenthetwodatatypes(verticaldimensioninFig.2.6),thedifferencebetweenshotgunandamplicondata.separation(p<0.001byAMOVAtestinMothur)ofcornandMiscanthussampleswasalsoobservedwhenV2,V4,V6,andV8shotgundatawereusedforclustering(Fig.2.7)butthesamplegroupingswerethesameforallvariableregions.Figures2.6and2.7showedthatthedispersionamongthesevencornreplicateswasmuchhigherthanfortheMiscanthusreplicates.MiscanthussampleshadhigheralphadiversitythancornsamplesasshownforallofV2,V4,V6andV8regions,althoughtherewerevariationsamongtheseregions(Fig.2.8).Wecomparedthetaxonomy-basedmicrobialcommunitystructuresinferredfromshotgundatawiththosefromampliconSSUrRNAgenesequences(12163ampliconforSB1and60148forM1)andknownprimerbiasesandrevealedanewbias.Beforecomparingtwodatatypes,welookedatthetaxonomyofshotgundatausingdifferentvariableregions.ShotgundatamappedtodifferentvariableregionsshowssimilartaxonomyattheBacterialphylumlevel29Figure2.6:Principlecoordinatesanalysis(PCoA)ofampliconandshotgunderiveddatafromsevenreplicates.Therearedifferencesbetweenampliconandshotgun(un-deriveddata(alongy-axis)andbetweencorn(circle)andMiscanthus(square)rhizospheresamples(alongx-axis),(AMOVAp-value<0.001).PCoAwasappliedtoOTUtableresultingfromclusteringwithshotgundataandamplicondatausing150bpofV4region.Labelswithsuffi_PTflareamplicondataandothersareshotgundata.(Pearson'scorrelations>0.96)exceptthatV6hasmoreunclassequences(Fig.2.9).Sincedifferentvariableregionsmayprovidedifferenttaxonomicprecisiontocertaingroups[81],taxon-omyinformationfromallregionsmaybetterrepresentthetaxonomyThusweusedallSSUrRNAgenefragmentsfortaxonomycomparisonwithamplicondata.BothshotgunandamplicondatashowActinobacteria,ProteobacteriaandAcidobacteriaasthethreemostabundantphyla,asisexpectedforsoil[82].Sinceshotgundataaremoreaccurateonestimatingcommunitystruc-ture,thusweaccepttheshotgundataasthereference[5,58].The926F/1392R(V6-V8)primersetisbiasedagainstVerrucomicrobia(0.3%inamplicondatavs.5.8%inshotgundatabyRDP30Figure2.7:Comparsionofordinationanalysiswithdifferentvariableregions.PCoAofOTUsfromdifferentSSUrRNAvariableregions(V2,V4,V6,andV8)wasappliedoncornmarkers,fi_Cfl)andMiscanthusmarkers,fi_Mfl)rhizospheresamples.Differentcolorsindicatethesevenreplicates.OTUtablesfromclusteringofshotgundatausing150bpofV2,V4,V6andV8regionswereusedforPCoAandprocrustesanalysisinQIIMEwasusedtotransformthePCoAresultsfromdifferentregionsandplottheminthesamedatabase)inbulksoilsampleSB1(Fig.2.10),andthe515F/806R(V4)primersetbiasedagainstActinobacteria(11.6%inamplicondatavs.26.6%inshotgundata)andtowardsVerrucomicrobia(5.9%inamplicondatavs.3.2%inshotgundata)inrhizospheresampleM1(Fig.2.10).Totakeadvantageofthefactthatshotgundataisun-targeted,weretrievedandtheLargeSubunit(LSU)rRNAgenes,whichareco-transcribedwithSSUrRNAgenes.Theirtaxon-omywassimilartothatofSSUrRNAgene(Pearson'scorrelationcoefof0.87forSB1and0.91forM1),exceptthatmorereads(19.6%)remain(Fig.2.10).ThisisexpectedbecauseofthemuchlowernumberofreferenceLSUrRNAgenesintheSILVAdatabase.31Figure2.8:AlphadiversitycomparisonsbetweencornandMiscanthususingV2,V4,V6andV8regions.AllvariableregionsshowedMiscanthusismorediversethancorn,eventhoughtherewasvariationofdiversityamongrRNAgeneregions.AlphadiversitywascalculatedwithinverseSimpsonusingOTUsresultingfromclusteringwithdifferentvariableregions.ThetwogenesshowconsistentcommunityattheBacterialphylumlevelandalsocon-theknownprimerbiasagainstVerrrucomicrobiain926F/1392R(V6-V8)primersetandalsothebiasagainstActinobacteriain515F/806R(V4)primerset(Fig.2.10).Further,boththeLSUandSSUHMMsshowtheabilitytoidentifyEukaryotamembersandgiveaboutthesametaxon-omyatdomainlevel(Fig.2.11).ItisnoteworthythatbothLSUandSSUshotgundatashowArchaeatobetwiceasnumerous(2%vs.1%)inthebulksoilastherhizosphereandthatEukaryotaweremuchmorenumerous(6%vs.1%)intheMiscanthusrhizospheresoilthanthebulksoil(Fig.2.11).Higherfungalpercentage(2.54%vs.0.39%),fungi/bacteriaratio(0.027vs.0.004),andArbuscularMycorrhizalFungi(AMF)percentageinfungi(0.18%vs.0)arefoundin32Figure2.9:Bacterialphylumprcomparisonusingdifferentvariableregions.DifferentvariableregionshavesimilartaxonomyexceptthatV6hasmoresequences.TheminimumPearson'scorrelationbetweentheregionsis0.96.weredoneusingSSUrRNAgenefragmentsfromMiscanthusrhizospheresoilsample(M1)andSILVAdatabaseasreference.rhizospheresample(M1)thanbulksoilsample(SB1)(Table2.2).Copycorrectionwasappliedontwosoilsamples(SB1andM1).BothsamplesshowedthatFirmicutesandBacteroideteshadthehighestfoldchangeaftercopynumbercorrection(Fig.2.12).Despitecopynumbercorrection,theclusteringofoursoilsamplesdidnotchange(Fig.2.13)comparedtothatwithoutcopynum-bercorrection(Fig.2.6),probablybecauseofthelowproportionoftaxawithlargerrnnumbercorrections.33Figure2.10:TaxonomyprofBacterialphylaofshotgunfragmentsfromSSUandLSUrRNAgenesandofampliconreads.weredoneusingbothRDPandSILVAreferencedatabases.fi_PTflindicatesamplicondata.Lowerpanelisfrombulksoilsample(SB1)anditsamplicondata(_PT)usesV6-V8primer(515F/806R)andshowslessVerrucomicrobiadetectedusingbothdatabases.Upperpanelisrhizospheresample(M1)anditsamplicondatausesV4primer(515F/806R)andshowslessActinobacteriaandmoreVerrucomicrobiausingbothdatabases.34Figure2.11:TaxonomyprcomparisonatdomainlevelusingSSUandLSUrRNAgenes.Forboththebulksoilsample(SB1)andrhizospheresample(M1),SSUandLSUshowconsis-tentdomainleveltaxonomydistribution(Pearson'scorrelationcoef=1).fi_LSUflindicatestaxonomyfromLSUrRNASILVAdatabaseandtherestarebySSUrRNAdatabase.35Figure2.12:BacterialphylumleveltaxonomysummarybeforeandafterSSUrRNAgenecopycorrection.36Figure2.12(cont'd):Leftverticalaxiswithbarplotshowspercentageintotalcommunity,whilerightverticalaxiswithlineplotshowsfoldchangeaftercopynumbercorrection.Taxawithrelativeabundancesofmorethan0.1%beforecopycorrectionwerechosenandwereorderedbasedonfoldchange.Aisforbulksoilsample(SB1)andBisforMiscanthusrhizospheresample(M1).Table2.2:Higherfungi/bacteriaratioandpercentofAMFfungiareinrhizospheresample(M1)thanbulksample(SB1).%Bacteria%Fungi%AMFinFungiFungi/BacteriaSB1-SSU97.00%0.36%0.00%0.0037SB1-LSU96.75%0.42%0.00%0.0044M1-SSU92.38%2.60%0.18%0.0281M1-LSU92.94%2.48%0.18%0.02672.5DiscussionWepresent,characterizeandvalidateanefmethodforretrievingandanalyzingSSUrRNAgenefragmentsfromshotgunmetagenomicsequences.Thepipelineenablesunsuperviseddiver-sityanalysiswithcopynumbercorrectiononmultiplevariableregions,hasthescalabilitytohandlelargesoilmetagenomes,isexpandabletootherphylogeneticmarkergenesandispubliclyavailableonGitHub.Weapplyatwo-stepapproachforretrievingSSUrRNAgenefragments,alooseHMMteringstepfollowedbyamorestringentstepthatscreensbyidentitytothebestmatchreference.ThestepleveragesHMMER[39]andshouldthushavebetterscalabilitythanexistingshotgunanalysispipelinesthatuseBLAST-liketoolssuchasMG-RAST[83].MG-RASTannotatesshotgunreadsbyBLATsearchagainstrRNAdatabasesandthetaxonomyofreadsisinferredfromthebesthitorleastcommonancestorofseveraltophits[31,65,84].BLATorBLAST-liketoolsarenotscalableforlargedatasetsandmustberuninparallelonlargecomputerclusters,becauseBLAST-liketoolstypicallydopairwisecomparisonofreadsagainstlargeandgrowingrRNAdatabasesuch37Figure2.13:Ordinationplotofampliconandshotgunderiveddataaftercopycorrection.Theplotissimilartotheonewithoutcopycorrection(Fig.2.8).Thereweredifferencesinampliconandshotgunderiveddata(y-axis)andofcornandMiscanthusrhizospheresamples(x-axis),(AMOVAp-value<0.001),aftercopynumbercorrection.PCoAwasappliedtoOTUtableresultingfromdenovoclusteringwithshotgundataandamplicondatausing150bpofV4region.Themarkers(fi_PTfl)areamplicondataandthemarkersareshotgundata.asRDP,SILVAandGreenGenes[23,24,76],whileHMM-basedmethodscomparereadstoonlyednumberofmodels(commonlyoneforeachdomain)andthusmorescalable[83].Moreover,thesepipelineslackunsupervisedcommunityanalysis.HMM-basedsearchhasbeenusedbeforeasitisfastandsensitiveforrrnretrieval[33,60,61,63]andcurrentexistingimplementationssuchasmeta-rna,RNASelectorandmetaxaareallwrappersaroundHMMER[39].Wechosemeta-rnaandmetaxaforcomparison,forthereasonthatRNASelectorcanonlyruningraphicinterfacethatisnotsuitableforlargedatasets.Thesecondstepevaluateshmmsearchresultsbasedonidentitytotheirbest-matchreference38andalsopreparesthealignmentofSSURNAgenefragmentsforclustering.SincethereisnoclearsequenceidentitythresholdforSSUrRNAgenes,thechoiceofidentitycutoffisarbitrary(thedefaultis50%).Thisisalsocommonpracticeinampliconanalysisplatformswherereadswithlowidentitytoreferencesequencesarediscardedpriortoclustering[34,35].Forconsistencyincomparison,sequencesinouramplicondatasetswithlessthan50%identitytoreferencese-quencesarealsodiscarded.AnalignmentoftheSSUrRNAgenefragmentsisessentialforthelaterunsupervisedanalyses.Incomparisontomethodsthatuseonlythe16SrRNAgeneinE.coliasalignmenttemplate[85],ourmethodtakesadvantagesoftherichphylogeneticdiversityofSSUrRNAgenesprovidedbytheSILVAdatabase.Increasingthenumberofreferencesequencescanimprovethequalityofalignment,butalsolinearlyincreasesthememoryrequired[72,86].Ourunsupervisedanalysiswithshotgundataisanovelandimportantpartofthepipeline.Ourtestsshowsthatregionsassmallas50bpcouldbeappliedtoclustering.Thusshortreadsaround50bpcouldbeappliedtothismethodaslongastherearesufnumbersofreadsalignedtothetargetregion(sequencingdepthisalimitingfactor;seebelow).Thisisconsistentwithpilotstudiesfrom454ampliconsequencing[30,87].Whensequencingdepthislimited,thereisxibilitytocontrolthenumberofreadstoincludeforclusteringbyadjustingthetargetregionsizeandreadlengthcutoffwithincertainlimits(Fig.2.2B).Thecaveatofusingveryshortorverylargetargetregionsisthattheoverlappingportionofreadswilldecreaseandthusdecreasetheaccuracyofclusteringandtheimpactofsequencingerrorwillincrease;forexample,anerrorina50bpreadcancause2%distanceandwewillaccordinglyneedtosetalargerdistancecutoffforclustering.Inaddition,wecanobtainlongersequencesfromoverlappingpairedendsasshowninoursoildata(Fig.2.14).ThusreadsfromIlluminashotgundata(rangingfrom75to250bp)canbeusedforunsupervisedanalysis.Notethatthexibilityonchoiceofvariableregionforanalysisisanotheradvantageofshotgundata(Fig.2.2C).WealsofoundgoodreproducibilityofOTUabundancebetweentechnicalreplicates,whichiscriticalforthevalidityofourmethod(Fig.2.4A).Generally,OTU-basedanalysisprovideshigherresolutionthantaxonomy-baseddiversityanalysisforcommunitycomparison,largelyduetothe39Figure2.14:Lengthdistributionoftrimmedreadsafterqualitytrimmingandpaired-endmerging.SB1isthebulksoildataandM1istherhizospheredata.Thereadswith>150bpresultfromthemergedpairedends,whichandclusteringindownstreamanalyses.Readslessthan150bparealsousedintheanalysisandcomefromunmergedpairedreads.databaseslackingreferencesequencesfromunculturedmicrobes[88].ThehighcorrelationofOTUabundanceintwotechnicalreplicatesshowsthereproducibilityoftheanalysisofshotgundata,whichiscomparabletothereproducibilityofamplicondatashowninanotherstudyintermsofPearson'scorrelationandR2oflinearregression[80].Further,comparisonofOTUabundancesinshotgundataandamplicondatasequencedfromthesameDNAextractionalsoshowthatmanyOTUshaveinconsistentabundancesbetweenthetwotypesofdata(Fig.2.4CandD),whichagreeswiththedifferencesseeninthetaxonomy-basedcomparison(Fig.2.10).CommunitycomparisonbyordinationmethodssuchasPCoAandNMDSisoneofthemostcommonanalysesinmicrobialecology.Tothebestofourknowledge,themethodsusedintwostudies[85,89]aretheonlyexistingtoolsthataredesignedtodealwithclusteringofSSUrRNAgenefragmentsfromIlluminashotgundata.Theformercouldresultinpooralignmentbyusingonlythe16SrRNAgeneinE.coliasthealignmenttemplate.Thelatter(PhylOTU)wasappliedtolongershotgunsequencesfromSangersequencingintheirstudy.ItdoestheOTUclusteringofSSUrRNAgenefragmentsalignedtogetheroverthewholegenelength,whichcanbeproblematic40becausefragmentsalignedtodifferentregionsdonotoverlapandthustheclusteringresultsarenotreliable,eventhoughthereferencesequencesincludedintheclusteringprocesscanimprovetheresults.Sinceourtestsshowsthataslowas50bphyper-variableregioncanbeusedforunsupervisedanalyses(Fig.2.2AandC),wemakesureallthesequencesoverlapbypickingonesmallregion(150bp)andallsequencesincludedinclusteringhavelengthslongerthan100bpinourclusteringmethod.Inaddition,longerreadsobtainedbyassemblingoverlappingpairedendreads(Fig.2.14)canmakeuseofmoreoverlapamongreadsandthusaremoresuitableforclustering.Asreadlengthsincreasewithimprovementofsequencingtechnology,longerregionscanbechosenandtheclusteringresultswillbeevenmorereliable.Also,shotgundataprovidesthexibilitytochooseanyvariableregion(Figs.2.2and2.7to2.9),andtheconsistencyofresultsfromdifferentvariableregionsprovidesmoreinthebiologicalconclusionaswellasthemethoditself.Primerbiasisamajorlimitationofampliconmethods,whichwasapparentinourcomparisonsofcommunityfromampliconversusshotgundata(Fig.2.10).Commonly,itisdiftotellabiasiscausedbyprimerorDNAextraction.ThepairedampliconandshotgundatafromthesameDNAextractprovidesusanewopportunitytoevaluatethisissue,sincethedifferenceisonlyfromthesequencingstep.WeusedtwomainSSUrRNAgenedatabases,RDPandSILVA,tomakesurethetaxonomydistributionwasnotbiasedbythechoiceofreferencedatabasesandfurthercon-bytaxonomydistributionbytheLSUrRNAgene.ThebiasagainstVerrucomicrobiawithV6-V8primersisconsistentwithotherstudiesshowingthatVerrucomicrobia'sabundanceinsoilsamplesisunderestimatedduetoprimerbias[56].Meanwhile,thebiastowardsVerrucomicrobiawithV4primersagreeswithstudiesshowingthatV4primersethasbettercoverageofVerrucomi-crobia[56].Furthermore,theV4primersetshowsbiasagainstActinobacteria.TheV4primersetisreportedtocover92.4%ofActinobacteriainreferencedatabases,thelowestamongninecom-monbacterialphyla[56].BiasagainstActinobacteriahasalsobeenreportedinastudyonsyntheticcommunitywherenoprimermismatchwasfoundwithmembersfromActinobacteria[58],andan-otherstudyonenvironmentalsamplesusingSangersequencing(24F/1492R)[90],suggestingthat41in-silicoevaluationofprimersisnotsufandsomefactorsotherthanprimermismatcharecausingthebias,forexample,competitionbetweenprimersandmeltingtemperature[91].Thusprimerbiasdetectionbycomparingpairedampliconandshotgundataissuperiortomethodsthatonlysearchprimersinthereferencedatabases.AnotheradvantageofourmethodoverampliconapproachesisthatwecanidentifytheSSUrRNAgenefromBacteria,ArchaeaandEukaryota.Fungiareofspecialinterestinmicrobialecologyduetheircriticalroleinecosystems[92,93].TheFungi/BacteriaratioisanimportantindicatorofC/Nratioandsoilhealth[94,95].Ourshotgunmetagenome(DNAbased)showsaratioof0.004inbulksoil(SB1)and0.027inrhizospheresoil(M1)(Table2.2),whilestudiesusingPhospholipidFatty-acidAnalysis(PLFA)atthesamesamplingsite(KBS)typicallyshowsaratioof1to1.3(58).ThedifferencecanbeexplainedbyhigherbiomassoffungirelativetoDNA,sincesomefungihyphaemaynotbewithnuclei.Inaddition,wealsofoundhigherpercentageofAMF(AbuscularMycorrhizaFungi)ofrhizospheresoil(M1)comparedtobulksoil(SB1),whichisconsistentwiththeirsymbioticrelationshipwithgrassroots.Wealsoshowthatcopynumbercorrectioncanbeachievedinourpipeline.Genecopynumberisanothersourceofbiasthatlimitsone'sabilitytoaccuratelymicrobialcommunities.Thereareupto15SSUrRNAgenecopiesinsomeBacteriaandupto5inArchaea[96].ThispipelineutilizestheSSUrRNAcopydatabaseinCopyRighter[75].Asexpected,theFirmicutesinsoilsampleshavethehighestfoldchange(Fig.2.12).But,duetotheirlowproportioninthesesoils,theimpactofcopynumbercorrectionontheoverallcommunitywasminor(Fig.2.13).Copynumbercorrection,however,isstillanopenproblembecauseSSUrRNAgenecopynumberdataformostspeciesand/orOTUsarelackingandcopynumbercanbeincorrectevenforspecieswithcompletegenomesequencesbecauseofmis-assemblyoftheserepeatedregions.Sequencingdepthisanotherimportantfactorinconsideringthismethodfordiversityanaly-sis.ThepercentageofSSUrRNAgenefragmentsinshotgundatavariesdependingontheSSUrRNAgenecopynumberandgenomesizeofeachmember.Inourbulksoilsample(SB1)andtheMiscanthusrhizospheresoilsample(M1),weabout0.03%and0.04%ofthetotal42shotgundataasSSUrRNA,respectively.InanidealsituationwewanttogetenoughSSUrRNAgenefragmentstoseesaturationofrarefactioncurveinOTU-basedanalysis,whichisdifforsoilsamplesbecauseoftheirhighdiversityandthepresenceofsequencingerror.However,studieshaveshownthatnearsaturationsequencingofSSUrRNAampliconsisnotnecessaryforbeta-diversityanalysis[54,97].Thustmpiricalfoldcoverageof3000basedonthewholelength(about1500bp)ofSSUrRNAgeneissuggestedforsurfacesoilsamples,whichrequires11.2(1500*3000/0.04%)Gbpofshotgundata,assumingtheSSUrRNAgenecomprisesabout0.04%oftotaldata.Inthisstudy,theLSUrRNAgenewasusedmainlyasforSSUrRNAgene-baseddiversityanalysis(Figs.2.10and2.11).However,theLSUrRNAgeneoffersadditionalstretchesofvariableandcharacteristicsequenceregionsduetoitslongersequencelength,andthusyieldsbetterphylogeneticresolution[98].Fortheabovereasonandalsothereasonthattherearemoreavailablereferencesforfungi,theLSUrRNAgeneismorecommonlyusedforfungalcommunitystudies[99Œ102].CurrentlytheuseoftheLSUrRNAgeneislimitedbyreferencesequencesanduniversalprimersetsavailable[98],butitsincreasedresolutionshouldnotbeoverlookedsincetoolimitedresolutionoftheSSUrRNAgeneisabarrierinmanyecologicalstudies[92].Inthefuture,othersinglecopygeneswithphylogeneticreferenceandresolution,suchasrplB,gyrB,andrecA,couldalsoberecoveredfrommetagenomicsequenceandusedforcommunitystructureanalysisbyasimilarpipeline[41].2.6ConclusionWedevelopedafastandefpipelinethatenablesunsuperviseddiversityanalysiswithIllu-minashotgundata.Thepipelinehasthescalabilitytoanalyzelargedatasets(5CPUhoursfor38Gbdata,with4.8Gbpeakmemory)andcanberunonmostdesktopswithmorethan5GBofmemory.SinceSSUrRNAbasedcommunityanalysisisanimportantmethodinmicrobialecol-ogy,thismethodcansaveprojectswithexistingshotgunsequencedatafromtheadditionalcostofSSUrRNAampliconsequencing.Moreover,shotgunsequencingarenotasaffectedbyprimerbias43andchimerasasampliconsequencingandthuscanimprovethemeasurementofmicrobialcommu-nitystructure.Asreadlengthandsequencingdepthincreases,longerandmoreSSUrRNAgenefragmentscanberecovered.Thusclusteringanddiversityanalysisbythispipelinewillbecomeevenmorereliable.44CHAPTER3GENOMICANDMETAGENOMICCOMPARISONOFrplBANDSSUrRNAGENEINSPECIESRESOLUTION3.1IntroductionShapedby3.5billionyearsofevolution,microorganismsareestimatedtocompriseuptoonetrillionspeciesandthemajorityofgeneticdiversityinthebiosphere[103].However,ourun-derstandingofthisdiversityislimited,mainlybecausethemajorityofmicroorganismscannotbeculturedunderlaboratoryconditions.SincethepioneeringworkofCarlWoeseinthelate1970s,theSSUrRNAgenehasbecomethedominantmarkerusedininmicrobialcommunitydiversityanalysis[21,22,54,104].ThefactthattheSSUrRNAgeneishighlyconservedandthatthereareoftenmultiplecopiesinthesamegenomewithintra-genomicvariationsmakestheSSUrRNAgeneproblematicfortaxonomicatspeciesandstrainlevels[40,41,105].WiththeacceleratedaccumulationofmicrobialgenomesinNCBIinrecentyears[106],wholegenomebasedcomparisonisconsideredtobethemostaccuratemethodforspeciesandstrain[106Œ110].However,wholegenomebasedcomparisoniscomputationallymoreex-pensivecomparedtomarkergenecomparison,andmoreimportantly,weareunabletoreliablyobtaingenomesequencesofmembersinamicrobialcommunity,especiallyforcomplexenvi-ronmentssuchassoil.Thus,genomebasedtaxonomicdearelessusefulforcommunitydiversityanalysis,wheremarkergeneanalysisisstillessential.BecauseofthedisadvantagesofusingtheSSUrRNAgene,microbialecologistshavebeenlookingforalternatives.Singlecopyproteincodinghousekeepinggenesstandoutforthefollowingreasons.First,theyaresinglecopyinthegenomeandthusallowmoreaccuratespeciesandstrainandOTUclusteringthantheSSUrRNAgenewhichislimitedbyintra-genomicheterogeneity.Second,theyarepresentinvirtuallyallmembersofthethreedomainsoflife.Third,proteincodinggenesevolvefasterthanrRNAgenesnotonlybecauserRNAgenesaremoreconservedduetotheircriticalroleinribosomes[111],butalsobecauseoftheredundancyinthe45geneticcode,especiallyatthethirdcodonposition[40].Here,weuseasinglecopyproteincodinggene,rplB(codingribosomallargesubunitproteinL2)toshowcasethepotentialofusingsinglecopyhousekeepinggenesasaphylogeneticmarkerformicrobialcommunitydiversityanalysis.Althoughtherearestudiesusingeithergenomicdataormetagenomicdata[40,41],wecomparerplBandSSUrRNAgeneswithallcompletedbacterialgenomes(agenomeasasinglesequence)andalsolargesoilshotgunmetagenomicdatathatareordersofmagnitudelargerthanthoseusedinpreviousstudies.Thenoveltyofourgenomicdataanalysesisthatweevaluatedifferenceinquantity(i.e.percentidentitiesamongtaxonsingenese-quence)ratherthanwhethertaxonsaredifferentingenesequence).Inmetagenomicdataanalyses,wefocusonthedifferenteffectsofthetwodifferentmarkergenesondenovoOTUbasedanalyses(acommonpracticeinmicrobialdiversityanalyses),whilepreviousstudiesfocusontaxonomic[40,41].3.2Methods3.2.1DownloadingbacterialgenomesBacterialgenomeassemblyinformationfromNCBI(ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt)wasusedtoconstructtheurltodownloadeachgenomebasedontheinstructionsdescribedinthislink(http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete).Commandlinefiwgetflwasthenusedtoretrievethegenomese-quenceswithurlsobtainedfromtheabovestep.3.2.2SSUrRNAgeneandrplBextractionandpairwisecomparisonTheSSUrRNAgeneHMM(HiddenMokovModel)fromSSUsearchwasused[6].AlignedrplBnucleotidesequencesofthefitrainingsetflretrievedfromtheRDPFunGenedatabasewereusedtobuildHMMusinghmmbuildcommandinHMMER(version3.1b2)[39,112].ThenhmmercommandinHMMERwasthenusedtoextractSSUrRNAgenesequencesandrplBsequences46fromtheBacterialgenomesobtainedfromNCBIusingscorecutoff(-T)of60.Next,nhmmeratleast90%aslongastheHMMmodelwereacceptedasthecorrecttargetgene.ForthepurposeofcomparingSSUrRNAgenedistanceandrplBgenedistance,onecopyoftheSSUrRNAgenewasrandomlypickedforeachgenome.Pairwisecomparisonamonggenesequenceswasdoneusingvsearch(version1.1.3)[113]withfiŒallpairs_globalŒacceptallfl.Threespeciesofinter-est:Rhizobiumleguminosarum,PseudomonasputidaandEscherichiacoli,werechosenforclosercomparisonofrplBpairwisedistancesandSSUrRNAgenedistances.3.2.3RhizospheresoilmetagenomicdatadownloadandpreprocessShotgunsequencingmetagenomicdatafor21samplesweredownloadedfromtheJGIwebportal(http://genome.jgi.doe.gov/).JGIProjectIDsforthe21samplesarelistedinTable3.1.Thesamplesarefromtherhizosphereofthreebiofuelcrops:corn(C),switchgrass(S)andMiscanthus(M),withsevenreplicatesforeachcrop.Rawreadswerequalitytrimmedusingfastq-mcf(verison1.04.662)(http://code.google.com/p/ea-utils)fi-l50-q30-w4-k0-x0Œmax-ns0-Xfl.Overlappingpaired-endreadsweremergedbyFLASH(version1.2.7)[71]withfi-m10-M120-x0.20-r140-f250-s25-t1fl,whichhasalsobeendescribedin[6].3.2.4SSUrRNAgeneandrplBandcomparisoninsoilmetagenomesSSUrRNAgenefragmentsandthosealignedtotheV4region(E.coliposition:577-727)ofeachsamplewereusingtheSSUsearchpipeline[6]andclusteredusingtheRDP'sMcClusttool[23].InordertoevaluatetheeffectofthetwogenesonOTUnumber,SSUrRNAgenereadsusingSSUsearchwereassembledintofulllengthSSUrRNAgenesequencesusingemirge_amplicon.pyfromtheEMIRGEpackage(version0.60.3)[114]withoptionsfi-l150Œphred33-a4-i275-s25fl,tocomparewithfulllengthrplBsequencesdecribedbelowonOTUnumber.FulllengthrplBsequenceswereassembledusingXanderwithfiMAX_JVM_HEAP=500G,FILTER_SIZE=40,K_SIZE=45,genes=rplB,THREADS=9fl[36].Dataforeachcropwereassembledseparately.TheassembledrplBsequences(nucleotideand47Table3.1:Rhizospheresoilshotgunmetagenomicdata.SampleJGIprojectIDDatasize(Gb)C1102376446C2102376739C3102377057C4102377353C5102377657C6102377951C7102378246M1102378556M2102378857M3102379150M4102379442M5102379745M6102380043M71018623,101861132S1102380340S21018626,101861428S3102380644S4102380949S51018629,101861727S6102381250S7102381539protein)fromthreecropswerepooledandclusteredusingRDP'sMcClusttool[23].AnOTUtableforeachgenewithOTUcountsofeachsampleweremadebasedonmeankmercoverageoftherepresentativesequenceofeachOTU.Further,diversityanalysesweredonebyveganpackageinRusingfunctionsfirdaflforordi-nationandfidiversityflforShannondiversityindex,resectively,fromtheOTU(count)tables.Forcomparingtheresolutionoftwogenesassembledfromthemetagenomicdata,sincetheSSUrRNAgenehadhighersamplingdepththanrplBduetothemultiplecopynatureofSSUrRNAgenes,wetheuniquefulllengthsequencesofbothgenesbasedontheabundanceinformation(i.e.tomultiplyeachuniquesequencebyitsabundnce)andthenweevenlysubsampledthetwogenedatasets.TheabundanceinformationofrplBsequenceswasinawiththenamefi*_cover-age.txtflintherplB/clusterdirectoryfromXander.Meanwhile,therelativeabundanceinformationofSSUrRNAgeneassembliescanbefoundthedescriptionofassembledsequenceein48fastaformatasdescribedinREADME.txt(NormPrior)oftheEMIRGEpackage.WiththetotalcoverageofthegeneestimatedfromthereadsasSSUrRNAgenesequences(total_bp/1500),theabundanceofeachgenewasestimatedbymultiplyingrelativeabundanceandtotalcoverageoftheSSUrRNAgene.3.3Results3.3.1SSUrRNAgeneandrplBcopynumberandintra-genomeSSUidentityAtotalof53865Bacterialgenomesweredownloadedwith4457ofthemcomplete.ThemaximumandmeanofSSUrRNAgenecopynumberdetectedwas20and2amongallgenomes,and16and4amongcompletegenomes(Table3.2).MultiplecopiesofrplBwithinthesamegenomewerealsodetected(Table3.2),withamaximumof6amongallgenomesand2amongcompletegenomes.Inordertocomparethesetwogenes,theyneedtobothbepresentingenomes.Wefound46600genomesand4440completegenomeswithbothgenes.Whenlookingatintra-genomicvariationamongcopiesincompletedgenomesofR.leguminosarum,P.putidaandE.coli,wefoundthatE.colihasthelargestvariationamongthethree,withaminimumof95.4%(Fig.3.1).Forthepairwisecomparisonofgenomesbelow,onecopyofSSUrRNAgenewasrandomlypickedasarepresentativeforgenomeswithmultiplecopies.Figure3.1:Intra-genomevaritaionofSSUrRNAgeneofR.leguminosarum,P.putida,andE.coliinboxplot.E.colihaslargestvariationamongSSUrRNAgenecopieswithinthethree.Circlesareoutliesinboxplots.EachdotisacomparisonbetweentwocopiesofSSUrRNAgeneinthesamegenome.49Table3.2:SummaryofSSUrRNAgeneandrplBcopynumberingenomes.CopySSU_allSSU_completerplB_allrplB_complete132576635532894443231367119043264956014NA422645853NA52006372NANA616034841NA71333556NANA8525194NANA922292NANA10187107NANA1111458NANA127227NANA135826NANA144838NANA1584NANA1671NANA181NANANA201NANANA3.3.2PairwisecomparisonofgenomeswithSSUrRNAgeneandrplBOverall,rplBgenesequenceswerelesssimilarthanSSUrRNAgenesequenceformostgenomepairs(Fig.3.2).SSUrRNAgenesequencepairwisesimilaritiesweremainlybetween90%and100%,whilerplBdistanceswereinalargerrangebetween70%and100%.Further,wetargetedseveralspeciesofinterestfordetailedexamination:R.leguminosarum,P.putidaandE.coliandfoundallselectedspeciesexceptE.colirplBgeneslesssimilarthanSSUrRNAgeneswithintheircorrespondingfamily,genus,andspecies(amongstrains)(Figs.3.3to3.5).Incontrast,mostcomparisonswithingenusandspeciesforE.colishowedtheSSUrRNAgeneslesssimilarthanthecorrespondingrplBgenes.SSUrRNAgenesimilaritiesweremostly99%whilemostrplBgeneswereidentical(Fig.3.5BandC).Theaboveresultswerefromcompletegenomes.WealsoappliedthesameanalyseswithallgenomesbutsomestrainswithintheE.colishowedlowidentities(<80%)intheirSSUrRNAgenes(Fig.3.6),indicatingpoorgenomequalityorstrainThuswedecidedtouseonlyresultsfromcompletedgenomesonly.50Figure3.2:SSUrRNAgeneandrplBidentitiesamongspeciesinthesamegenus.XaxisisSSUrRNAgeneidentitiesandYaxisisrplBidentities.Thedashedlineisy=x.ThehistogramsonthetopandrightaxesareidentitydistributionsforSSUrRNAgeneandrplBrespectively.SSUrRNAgeneidentitiesarelargerthanrplBinmostgenomes(mostdotsarebelowthedashedline),suggestingrplBhasmorevariationamongspeciesinthesamegenus.51Figure3.3:SSUrRNAgeneandrplBamongR.leguminosarumrelatedgenomes.SSUrRNAgeneidentitiesarelargerthanrplBinmostgenomepairs.XaxisisSSUrRNAgeneidentitiesandYaxisisrplBidentities.Thedashedlineisy=x.SizeofdotsindicatesnumberofgenomepairsthatsharethesameSSUrRNAgeneidentityandrplBidentity.A.AllcompletedgenomesinRhizobialesareincludedinpairwisecomparison;B.AllcompletedgenomesinRhizobiumareincluded;C.AllcompletedgenomesinR.leguminosarumareincluded.52Figure3.4:SSUrRNAgeneandrplBamongP.putidarelatedgenomes.SSUrRNAgeneiden-titiesarelargerthanrplBinmostgenomepairs.XaxisisSSUrRNAgeneidentitiesandYaxisisrplBidentities.Thedashedlineisy=x.SizeofdotsindicatesnumberofgenomepairsthatsharethesameSSUrRNAgeneidentityandrplBidentity.A.AllcompletedgenomesinPseu-domonadalesareincludedinpairwisecomparison;B.AllcompletedgenomesinPseudomonasareincluded;C.AllcompletedgenomesinP.putidaareincluded.53Figure3.5:SSUrRNAgeneandrplBidentitiesamongE.colirelatedgenomes.XaxisisSSUrRNAgeneidentitiesandYaxisisrplBidentities.Thedashedlineisy=x.SizeofdotsindicatesnumberofgenomepairsthatsharethesameSSUrRNAgeneidentityandrplBidentity.A.AllcompletedgenomesinEnterobacterialesareincludedinpairwisecomparison;B.AllcompletedgenomesinEscherichiaareincluded;C.AllcompletedgenomesinE.coliareincluded.54Figure3.6:SSUrRNAgeneandrplBidentitiesamongallE.coligenomes.SSUrRNAgeneidentitiesaresmallerthanrplBinmostgenomepairsandsomeareaslowas70%.XaxisisSSUrRNAgeneidentitiesandYaxisisrplBidentities.Thedashedlineisy=x.SizeofdotsindicatesnumberofgenomepairsthatsharethesameSSUrRNAgeneidentityandrplBidentity.3.3.3ComparisonofSSUrRNAgeneandrplBondiversityanalysesinlargesoilshotgunmetagenomesOnaverage,0.04%oftotalreadswereasSSUrRNAgenefragmentsand0.004%oftotalreadsalignedtotheV4regionofthegenewithSSUsearch.Another0.01%oftotalreadswereasrplBwithXander(Table3.3).Further,wefoundthattheSSUrRNAgenehadmoreOTUsthanrplBwithoutevenlysubsamplingwithadistancecutoffrangeof0to10%(Fig.3.7lowerpanel).However,afterevenlysubsamplingsequencesto6000foreachgene,weobservedrplBnucleotidesequenceshadmoreOTUsthanSSUrRNAgenesequencesconsistently,andrplBproteinsequenceshadmoreOTUsthanSSUrRNAgenesequencesexceptfordistanceoffsrangingfrom2%to5%but(Fig.3.7upperpanel).RegardlessofthedifferencesinOTUsres-olution,bothgenesproducedthesamebiologicalconclusionsintwocommondiversityanalyses:themicrobialcommunitiesinrhizospheresoilofcornweredifferentfromthoseinMiscanthusandswitchgrass(betadiversitybyordination)(Fig.3.8);andcornhadalowerdiversitythanMiscanthusandswitchgrass(alphadiversity)(Fig.3.9).Inaddition,rplBalsoseparatedthe55microbialcommunitiesinMiscanthusandswitchgrasswhileSSUrRNAgenedidnot,suggestinghigherresolutionoftherplBgenes(Fig.3.8).Table3.3:SummaryofreadsasSSUrRNAgeneandrplBinsoilmetagenomes.TotalReadsSSUSSU%V4V4%rplBrplB%C12322340951171600.050%116150.005%297430.013%C2220844998954930.043%81370.004%226050.010%C32821233351192100.042%115790.004%330760.012%C42605426621071140.041%102950.004%284270.011%C52858732321433020.050%135200.005%334280.012%C62504776171245980.050%124870.005%296680.012%C72629439301141830.043%94490.004%297290.011%M12740609251020490.037%96930.004%272100.010%M22782788681004980.036%97670.004%280410.010%M3244772969926240.038%89630.004%249440.010%M4206129129695350.034%64740.003%207770.010%M5225964704747780.033%68920.003%227130.010%M6215320045730620.034%68000.003%220740.010%M7160726636712490.044%65800.004%151820.009%S1192776425680800.035%64810.003%192620.010%S2143660127572290.040%54690.004%132700.009%S3218743879763680.035%72100.003%214210.010%S4249773480784220.031%72150.003%236640.009%S5140239637499620.036%45980.003%121800.009%S6254749228821050.032%72820.003%243640.010%S7194863138680940.035%65040.003%191840.010%3.4DisscussionAlthoughsinglephylogeneticmarkerbasedspeciesarelimitedcomparedtowholegenomebasedspeciesitstillisthemosteffectiveandpracticalapproachformicrobialcommunityanalyses.Herewecompareasinglecopyproteincodinggene(rplB)withthecom-monlyusedSSUrRNAgeneanddemonstratethatsinglecopyproteincodinggenearesuitedforcommunityspeciesdiversityanalysesusingallbaterialgenomesandlargesoilshotgunmetagenomes.56Figure3.7:ComparisonofOTUnumbersusingSSUrRNAgeneandrplBinlargesoilmetagenomes.ThelowerpanelshowsthattherearemoreOTUsusingSSUrRNAgenethanrplBwithoutevensampling,buttheupperpanelshowsthattherearelessOTUsusingSSUrRNAgeneafterevensubsampling.XaxisispercentagedistancecutoffforOTUclusteringandYaxisisOTUnumber.Figure3.8:ComparisonofSSUrRNAgeneandrplBinbetadiversityanalysis(ordination)usinglargesoilmetagenomes.Bothgenesshowmicrobialcommunityincornrhizosphereissig-differentfromthoseinMiscanthusandswitchgrass.Additionaly,rplB(bothnucleotideandprotein)alsoseparatesmicrobialcommunitiesofMiscanthusandswitchgrass,suggestingthehigherresolutionofrplB.57Figure3.9:ComparisonofSSUrRNAgeneandrplBinalphadiversityanalysis(Shannondiversityindex)usinglargesoilmetagenomes.BothgenesshowmicrobialcommunityincornrhizosphereislessdiversethanthoseinMiscanthusandswitchgrass.3.4.1ChoosinggenomicdatasetsForthecomparisonofrplBandtheSSUrRNAgenewithingenomes,genome(assembly)qualityiscriticalbecauseerroneoussequencesofSSUrRNAgenesandrplBextractedfromlowqualityassemblygenomescouldleadtoproblematicresults.Ontheothehand,itisimportanttoincludeasmanygenomesaspossibletomaketheconclusionmoregeneralizable.Wetestedtwosetsofgenomes.Thesetcontainedonlycompletedgenomes(genomesthathasbeenclosedasonesequence)(4457),whichareexpectedtohavehigherquality.ThesecondsetislargerwithalltheavailableNCBIRefSeqBacterialgenomes(53865),mostofwhicharenotcompleted(withmulti-plecontigs).ThefactthatthesecondsethasahigherfractionofgenomeswithmultiplecopiesoftherplBgenes(expectedinsingle-copy)andmanyE.colistrainsintheseconddatasetshowlowerthan80%similarityfortheSSUrRNAgenesbetweenstrains(Fig.3.5),imply,onaverage,lowerqualitygenomeassembliesinthesecondset.Thuswepicktheset(completedgenomes)forouranalyses.Shortreadsequencingtechnologyhasmadesequencingbacterialgenomescheaperandfaster,whilenotprovidingaparalleldecreaseinthecostofagenome,somostassem-bledgenomesarenotcomplete[106].EspeciallyforgenomicregionswithmultiplecopiessuchasSSUrRNAgenes,assemblyfromshortreadsarepronetoassemblyerrorandlong-readsequencingtechnologysuchasPacBioandOxfordNanoPorearepromisingwaytosolvetheproblem.583.4.2AdvantagesofrplBoverSSUrRNAgeneasaphylogeneticmarkerFirst,theSSUrRNAgene(amultiplecopygene)postsdifforinterpretingspecies(OTU)abundance,whilerplB(asinglecopygene)doesnothavethesameissue.Additionally,varia-tionsamongmultipleSSUcopiescancausemultipleOTUs(sequenceclusters)fromthesamespecies[115],andthusleadtooverestimationofspeciesrichness.Since,asinglecopyoftherplBgeneiscontainedineverycellinacommunity,rplBtherelativeabundanceofrplBgenesequencesprovidesamechanismforestimatingthefractionoforganismsposessingotherlessuniversalsingle-copygenes.Second,SSUrRNAgenesarealsomorepronetoassemblyerrors(chimera)thansinglecopygenesduetotheirhigheroverallsimilarityandthepresenceofhighlyconservedregionsinterspersedinSSUrRNAgenes.Multiplecopiesandvariationamongcopies,whichisalsoconsistentwiththeresultsthatthereareunexpectedlylowsimilarities(<80%)ob-servedamongSSUrRNAgenesofE.colistrainswhilerplBhashighsimilarities(>95%)amongE.colistrainswhenincompletegenomesareincluded(Fig.3.6).Notethattheseerroneousse-quencesmightbefurthercollectedbydatabasesandusedasreferencesfortaxonomy,alignment,andchimeradetection,andthushaveanimpactoncommonmicrobialecologydiversityanalyses;switchingtoasinglecopygenethatislesspronetoassemblyerrorcanmitigatetheaboveprob-lem.Third,wehaveshowntherplBgeneisbetterabletodifferentiatecloselyrelatedspeciesthanistheSSUrRNAgene,duetolowersequencesimilarityoftherplBgenecomparedtotheSSUrRNAgeneinpairwisecomparisonsofgenomesequences(Figs.3.2to3.5)andalsorplBgenesassembledfromsoilmetagenomicdataclusteredintomoreOTUsthanSSUrRNAgenesindicatingagainmoreresolution(Fig.3.7).ThisisconsistentwiththecrucialroleSSUrRNAplaysintranslationaspartoftheribosome(ensuringtranslationaccuracybyworkinginconcertwithtwootherrRNAspecies,tRNAs,ribosomalproteins,translationinitiation,elongationandter-minationfactorsetc.[111],whichhasalsobeenbyanotherstudyshowingSSUrRNAgenes(alongwithLSUrRNAgenes,tRNAandABCtransportergenes)havebeenreportedtobethemostconservedgenes[116].Further,therearemultiplecasesofclearlydifferentspecieswithidenticalSSUrRNAgene(Fig.3.2),andtheheterogeneityamongcopiesofSSUrRNAgenesin59thesamegenomecanbehigherthanthoseamongdifferentspecies[115],whichmeanstheSSUrRNAgeneisnotsuitabletoresolvelowertaxonomiessuchasspeciesandstrains.Andlast,rplBhastheadvantageofhavingsequencesinbothnucleotidespaceandproteinspace.Higherleveltaxonomiescanberesolvedatproteinspace,andlowerleveltaxonomiescanberesolvedatnu-cleotidespacewithproteinsequencesguidingthenucleotidealignment(codonalignment),whichiscriticalforphylogeneticanalysisandalignment-basedOTUclustering.3.4.3NoveltiesinthisstudyAlthoughthereareotherstudiesoncomparingsinglecopyhousekeepinggenesandSSUrRNAgenes[40,41],thenoveltyinouranalysesisthatweuseOTUs(fromdenovoclustering),whichismorecommonthantaxonomyinmicrobialdiversityanalysesandenabledbyrecentadvancementinshotgunmetagenomicanalysismethodology,SSUsearchandEMIRGE(forSSUrRNAgenes)[6,114]andXander(forrplBandotherproteincodinggenes)[36].Incontrast,otherstudiesdocomparisonattaxonomylevel,whichbecauseexistingreferencesetscoveronlyasmallfractionofdiversity,manyreadscannotbeathighandarethusbedicarded,especiallyforgenesotherthanSSUrRNAgenes.Second,wecomparethesetwogenesusingmuchlargergenomicdatasetsandlargershotgunmetagenomicdatasetsthanwerepreviouslyavailabl.Forexample,Caseetal.comparedtherpoBgene(codingforthebetasubunitofRNApolymerase)totheSSUrRNAgeneusingonly111completedgenomesfromNCBI[40],whileweinclude4457completedgenomes,socanresultinmoregeneralizableconclusions.Similarlywithshotgunmetagenomicdata,Rouxetal.makecomparisonsusingonlyasubsetofsamplesfromtheGlobalOceanSurvey,whichisabout6billionbpintotalfromSangersequencing[41,117],whileoursoilmetagenomicdatasetsareabout900billionbpintotal.3.4.4LowerstrainlevelresolutionofSSUrRNAgeneduetoitsintra-genomevariationsAdditionally,weinconsistentresultswhencomparingtwogenesatstrainlevelwithdiffer-entspecies(Figs.3.3to3.5).TheexceptioninE.colicouldbeexplainedbytheintra-genome60variationsofSSUrRNAgenes(Fig.3.1)thatleadtolargerinter-strainvariationthanrplBamongstrains,whichcouldnotbeinterpretedasbetterresolution.Consistently,weobservedOTUnum-bersfromSSUdecreaseslowerwhendistancecutoffislessthan4%(slopeoflineinFig.3.7),whichcausesOTUnumberofSSUrRNAgenestobehigherthanrplBproteinandsuggeststhatintra-genomevariationofSSUrRNAgenesarecausingofOTUnumbers[115]andalsomakeSSUrRNAgenesnotsuitableforstrain3.4.5ComparisonoftwogenesindenovoOTUbasedcommunitydiversityanalysesAnothergoalofthisstudyistoevaluatetheeffectoftwodifferentgenesoncommonOTUbaseddiversityanalysessuchasalphadiversityandbetadiversity.InadditiontomoreOTUsresultingfromrplBnucleotidesequencesthanSSUrRNAgenesequencesusingthesamedistancecutoff(0.03),diversityanalysesoftwogenesprovidethesamebiologicalconclusion,exceptthatrplBalsoseparatesMiscanthusandswitchgrasswhileitisnotforcorn,suggestingrplBhashigherresolutionthantheSSUrRNAgene.ThisissimilartocomparisonoftaxononybaseddiversityanalysisanddenovoOTUbaseddiversityanalyses.Althoughspeciesorgenuslevelisenoughtodifferentiatethebiologicaltreatmentssometimes,higherresolutionlevels(OTU)arestillpreferredwhenthedifferencesarenotobvious,e.g.,thereisnodifferenceincommunitycompositionatspecieslevelbutstraincompositionsinsomespeciesaredifferent.Forthesamereason,highervariationinrplBmakesthegeneabetterphylogeneticmarkerthanSSUrRNAgenesforanalyzingmoresimilarcommunities.3.4.6SequencingdepthneededforrplBinmetagenomesNowthatdenovoOTUbaseddiversityanalyseswithrplBinshotgunmetagenomeshavebeenenabledbyXander,anothercommonquestioninexperimentdesignis:HowmuchsequencingisneededtoensurethereisenoughcoverageontherplBgenefordiversityanalyses?Becauseofthehighmicrobialdiversityinsoilandsequencingerror,thethoroughsequencingofsoilmetagenomesisdifAdditionally,saturationofOTUs(species)inrarefactioncurveisalsonotnecessary61forbeta-diversityanalysis[54],soanempiricalfoldcoverageof3000isrecommendedforsoilsamplesbasedonexperienceswiththeSSUrRNAgene[6],whichmeansabout25Gbp(3000*850bp/0.01%)shotgunsequencesareneededwith850bpastheaveragelengthofrplBand0.01%astheaveragepercentageofrplBreadsintotalreads(Table3.3).3.4.7LackofuniversalprimersforandreferencesequencesThereareafewdrawbackstousingsinglecopyhousekeepinggenessuchasrplBfordiversityanalysis.First,unlikeSSUrRNAgenes,whichhavehighlyconservedregionsinterspersedwithhighlyvariableregions,theredundanceyinthegeneticcode,especiallythehighvariabilityatthethirdcodonpositionmakesitdiftodesignuniversalprimers.Althoughbetterbulkassemblymethodsandnewgene-targetedassemblymethods(suchasXander)enableustoavoidprimer[6,36,114],thecostofshotgunsequencingoflargenumbersofsamplesatsufsequencingdepth,asmentionedabove,isstillhighandthusalimitation.Therefore,availabilityofuniversalprimersarestilldesirabletraitsforagoodphylogeneticmarker.Second,rplBandothersinglecopyhousekeepinggeneslackdeepsequencereferencesets,whiletheSSUrRNAgenehasaccumulatedalargeamountofreferencesequencesanddedicateddatabases,suchasthoseinRDPandSILVA[23,24]thatarecriticalfordiversityanalyses.3.5ConclusionAlthoughgenomelevelcomparisonshavethehighestresolution,thephylogeneticmarkermethodisstillessentialforcommunitydiversityanalysis.WedemostratethatrplB,asinglecopypro-teincodinggene,canprovideresolutiontoidentifylowleveltaxonomysuchasspeciesandstrainsthanthemorecommonlyusedSSUrRNAgene(Fig.3.2)andalsoscaledenovo(OTU)diversityanalysis(Figs.3.7and3.8).Inspiteofthelackofreferences,morereferencesarebecommingavailableasbacterialgenomesequencingbecomesfasterandcheaper[106].Inaddition,astheswitchtohigher-qualitylongreadsequencingforbacterialgenomesequencingcontinues,thequalityofthesereferencegenomeswillincrease.Thus,singlecopyproteincoding62genessuchasrplBhavegreatpotentialtocomplementorevenreplacetheSSUrRNAgeneasaphylogeneticmarker.63CHAPTER4RHIZOSPHEREMETAGENOMICSOFTHREEBIOFUELCROPS4.1AbstractSoilmicrobesformassociationwithcropsinrhizosphereandalsoplayamajorroleinecosystemfunctions,suchasNandCcycle.Croprootshadstrongonthesoilmicrobialcommunity.Thuslarge-scaleplantationofbiofuelcropswillhaveimpactonecosys-temfunctionsregionallyandbeyond.Wecomparedrhizospheremicrobialcommunitiesofcorn(annual)andswitchgrassandMiscanthus(perennials).Thisisthecomparativestudyofthesebiofuelcropsusingshotgunmetagenomicsandoneofthelargestsequencingeffortstodate(about1TBbpintotal).Wecomparedtherhizospheremetagenomesatthreelevels:overallcommunitystructure(SSUrRNAgene),overallfunction(annotationfromglobalassembly),andNcyclegenes(fromXander).AllthreelevelsshowedcornhadadifferentcommunityfromMis-canthusandswitchgrass(exceptforAOA).Intermsoflifehistorystrategy,thecornrhizospherewasenrichedwithmorecopiotrophswhiletheperennialswereenrichedwitholigotrophs,whichisfurthersupportedbyhigherabundanceofgenesinfiCarbohydratesflandhigherfungi/bacteriaratios.Inaddition,cornalsohadalessrichandevencommunity,sotheperennialsmanagedtomaintainamorediversecommunityeventhoughinvestinglessCintherhizosphere.Moreover,alargerdispersionofcorndatainordinationplotsandenrichedPenicilliumfungi)alsoindicatecornmaynotbedoingaswellincontrollingitscommunityandselectingmember.Furthermore,thenitrogencommunityofcornwasdominatedbyRhizobium(per-hapsalegacyfrompriorlegumecrops)whiletheperennialshadNifHsequencesmostrelatedtoCoraliomargarita,NovosphingobiumandAzospirillum,indicatingthattheperennialscanbetterselectmembers.Moreover,highernumbersofgenesfornitrogenandlowernumberofgenesfornitritereductionsuggestbetternitrogensustainabilityoftheperennials.Thusourstudyprovidescomprehensiveevidenceshowingperennialbioenergycropshaveadvantages64overcorninhighermicrobialspeciesandfunctionaldiversityandinselectingmemberswithben-traits,consistentwithahigherlevelofsustainabilityofperennialbiofuelcrops.4.2IntroductionBioenergycropshavethepotentialtoproducerenewabletransportationfuelsandwithanetneg-ativegreenhousegasproductionifmanagedappropriately[118],andtoprovideotherecosystemservices[119,120].BioenergycropgrowthstillrequiressomeenergyinputssuchasNfertilizer,acomponentthatthecroppingsystem'smicrobiomecanpotentiallyminimize.Themicrobiome,andespeciallytherhizosphere,isanimportantpartofthesoilwhereplantsprovidethemicrobescarbonandmicrobesprovidestheplantnutrients(NandP),growthhormones,anddiseasere-sistance[121Œ123].Thusrhizospheremicrobialcommunityhasthepotentialreducethecostofgrowingbioenergyplantsandiskeyforsustainability,especiallywhenbioenergycropsaregrownonmarginallands,i.e.,notsuitableforhumanfoodcrops.Vegetationtypehasanonmicrobialcommunitystructurethroughitsrootexudatesanddetritus[124,125].Herewestudiedtherhizosphereofthreeprominentbiofuelcrops(corn,Miscanthusandswitchgrass)fromthesamefamily(truegrasses)butvaryingmarkedlyinpheno-type.Miscanthusandswitchgrassareperennialgrasseswithroots,whilecornisanannualwithlessroot.Miscanthusisthelargestinsizeandisanexoticoriginatingfromtropi-calAsia,whileswitchgrassisnativetoUSMidwest.Largescaleplantingofthesebiofuelcropsmaychangesoilmicrobialcommunitystructureandfunction,andimpactglobalnutrientcyclepro-cesses.Corntypicallyneedsfertilizerformaximumyields,andtheNfertilizersarereportedtocostabout40%oftotalenergyinputforthecroppingsystem[126].Ontheotherhand,MiscanthusandswitchgrasscangrowwithlittleornoNfertilizer[127Œ129],sotheperennialgrassesarethoughttohavebetternitrogensustainability.Itisstillnotknownhowmuchrhizospheremicrobescontributetothebetternitrogensustainabilityofperennialssincenutrienttranslocationbeforesenescenceandphyllosphere(leaf)microbesarealsopossiblecontributors.HerewenotonlystudytheoverallmicrobialcommunitystructureofthreebiofuelcropsbutalsofocusongroupsinvolvedintheN65cycle.WehypothesizethattherhizospheremicrobialcommunitiesofperennialsareadaptedtoalowNfertilizerinputenvironmentandareenrichedinNgenes,andhaslessgenes.Despitelimitedunderstandingofrhizospheremicroorganisms,thechemistryofthenutrientcyclescarriedoutbythosemicroorganismsisrelativelyclear,includingnitrogennitri-andThenitrogenmicrobialcommunityiscommonlymeasuredwithnifHasmarkerandthenitrifyingcommunitywithAOA(archaealamoA)andAOB(bacterialamoA)astheirmarkergenes[47,130].hasmultiplestepsincludingnitratereduc-tion(Nar),nitritereduction(Nir),nitricoxidereduction(Nor)andnitrousoxidereduction(Nos).Nirhastwotypes,Cu-Nir(copper-based)andcd1-Nir(heme-based),encodedbynirKandnirS,respectively.NorisencodedbycNor(withcytochromec)orqNor(withquinoneinsteadofcy-tochromec).Nosalsohasdifferenttypes,encodedbytypicalnosZ(cladeI)andatypicalnosZ(CladeII)[48,49,131,132].Thusampliconbasedstudiesonlytargetingonesingletypeofageneforandlikelyunderestimateabundanceofthemicrobesinvolved.MostsurveysonNcyclerelatedcommunitiesinsoilsuseampliconbasedmethods[47,48,133],whichhasprimerbiasproblems,especiallyforthoseNcyclegenesthatlackwellcuratedrefer-encedatabases[47,48,133].Incontrast,theshotgunmetagenomicmethodproducesshortreadsfragmentsrepresentingallthegeneticinformationofcommunitymembersandthushasnoprimerbias.Thechallengeoftheshotgunapproachistheanalysisofbigdataanditsshortreadcharac-ter[9,10].Thereisacross-disciplinarystudyshowingtheperennialgrasscanincreasebiodiversityacrossmultipletaxaandenhancemultipleecosystems,butonlymethanotrophicbacteriaismea-suredwithinmicrobiome[120].OurstudycoversoverallmicrobialdiversityandalsosubgroupsinvolvedinNcyclethatareofeconomicandenvironmentalinterest,andthusprovidesamorecompleteviewofthemicrobiome.Further,weexpecttoseestrongermicrobialresponsebyusingrhizospheresoil(<1mmtoroot)ratherthanbulksoilusedinotherstudies[120,128].Inthisstudy,weappliedshotgunmetagenomics(about1TBDNAdata)incombinationwithecologytheory(lifehistorystrategy)tostudyrhizospheremicrobialspeciesandfunctionalre-66sponsetothreedifferentcrops(corn,Miscanthus,andswitchgrass).Weusecomputationallyefmethodstoassembleglobally(khmer)[10,134,135]andalsoextractgenesofinterest(SSUsearchandXander)[6,36]inthisbigdata.ComparativemetagenomicsisappliedatwholecommunityspeciesandfunctionlevelandalsoNcyclegeneleveltounderstandmicrobialre-sponsestodifferentbioenergycrops.4.3Methods4.3.1Soilsamples,DNAextraction,andsequencingRhizospheresamplesofcorn,switchgrassandMiscanthuswerecollectedatGreatLakeBioen-ergyResearchCenter(GLBRC)intensiveCroppingSystemComparisonsiteinKellogBiologicalStation(KBS)(42.394828,-42.394828)(http://data.sustainability.glbrc.org/pages/1.html)inMichiganon10/12/2012.Wesampledrhizosphereofeachcropatsevenplotareas.Eachreplicatewasacompositeoftwoorthreeplantclosetothesamplingspot.Rootswereshakenvigorouslytoremovelooselyattachedsoil,thencutofffromstemandplacedinplasticbags4de-greesC.Soilcloselyattachedtoroots(<1mm)werecollectedandwashedoffrootsasrhizospheresoilinlab(Fig.4.1).Soilleftweresendtomeasuresoilchemistry.DNAwasextractedwithPow-erSoilDNAIsolationKitfromMOBIO,USAasdescribedin[70]andthensenttoJointGenomeInstitute(JGI)forIlluminaHiSeq2500sequencing(250bpinsertlibrariesand2x150bpreads).4.3.2DatapreprocessingRawIlluminareadswerequalitytrimmedusingfastq-mcf(version1.04.662,http://code.google.com/p/ea-utils)withfi-l50-q30-w4-x10Œmax-ns0-Xfl.ThenpairedendsweremergedbyFLASH(version1.2.7)[71]usingfi-m10-M120-x0.20-r140-f250-s25fl[6].67Figure4.1:Roots(Miscanthus)andrhizospheresoil.Rhizospheresoiliscollectedbywashingsoilfromtheroots.Theimageshowsthatsoilisveryclose(<1mm)tosmallroots.684.3.3DiversityanalysisusingSSUrRNAgeneSSUrRNAgenefragmentsinshotgunmetagenomedatawereandanOTUtablewasmadeusingfragmentsalignedtoa150bpregionofV4(Escherichiacoliposition577to727)bytheSSUsearchpipeline[6]followingatutorialinhttps://github.com/dib-lab/SSUsearch.Ordination,richnessanddiversityanalyseswerecalculatedbyphyloseq[136],simba[137]andvegan[138]packagesinR[139].4.3.4Globalmetagenomeassembly,annotationandfunctionaldiversityPreprocessedreadsofeachcropwerepooledandprocessedwithdigitalnormalization(-k20-C10-N4-x2.14e11)andpartitioning(-k20-N4-x6e11)pipeline[7]follow-ingtutorialsinhttp://nbviewer.jupyter.org/github/ngs-docs/ngs-notebooks/blob/master/ngs-70-hmp-diginorm.ipynbandhttp://nbviewer.jupyter.org/github/ngs-docs/ngs-notebooks/blob/master/ngs-71-hmp-partition.ipynbusingkhmer(version0.6.1).Thepartitioneddatawerethenassembledwithvelvet(version1.2.08)[11]andSGA(ver-sion0.9.35)[12].Forvelvet,kmersizeof29,39,49,59and69wereusedandnalassemblywasmergedwithSGA.ForSGA,overlapsizeof29,39,49,59,and69wereusedandassemblieswereclusteredusingcd-hit-est(-c0.99)inCD-HIT[140].Contigsshorterthan300bpweredis-carded.SincevelvetandSGAproducedverysimilarassemblyintermsoftotalsizeandlongestcontigs,theassemblyfromSGAwaschosenforfurtheranalysis.TheassemblywasthenuploadedtoMG-RAST[65]forgenecallingandannotation.Genecall-ing(withsuf.genecalling.coding.faa),clustering(withsuf.cluster.aa90.mapping),andBLAT[84]tabularoutput(withsuf.superblat.sims)weredownloaded.BLAThitswithalignmentlongerthan30aminoacids,identityhigherthan70%,andE-valuelowerthan0.00001wereusedfordownstreamanalysis.NumberofhitstoeachreferenceinMG-RASTM5NRdatabaseforeachsamplewassummarizedbasedonBLAThits,genecallingIDs,andclus-teringinfo.ThenM5NRIDswereconvertedintosubsystemIDsandfurthersubsystemontologiesusingMG-RASTAPI(http://api.metagenomics.anl.gov/api.html).Acounttableofsub-69systemontologiesacrosseachsamplewasmade.Level3andLevel1subsystemswereusedforor-dinationanalysiswithRphyloseqpackage[136].Forenrichmentanalysis,yinQIIME[35]wasusedtolevel3subsystems(pathways)thatweredifferentbe-tweentwocrops(p<0.05,n=7)andthosepathwayswithmorethan10totalcountsinallsamplesandafoldchangelessthan2/3orlargerthan3/2weretreatedas4.3.5NcyclegeneassemblyanddiversityPreprocessedreadsofthesamecropwerecombinedandusedasinputtoXander(genetargetedmetagenomicassembler)[36]withMAX_JVM_HEAP=300G,FILTER_SIZE=40,K_SIZE=45,genes=finifHamoA_AOAamoA_AOBnirKnirSnorB_cNornorB_qNornosZnosZ_a2rplBflfollowingtutorialinhttps://github.com/rdpstaff/Xander_assembler.Outputwithtaxonabundance(fi*_taxonabund.txtfl)wereusedfortaxonomy-basedanalysis.TheabundanceofeachNcyclegenewasnormalizedbythenumberofrplBgenes(singlecopygene,per10,000).ForOTU-basedanalysis,assembledgenesequenceswereclusteredbyMcClust[23].MeankmercoverageofeachassembledsequenceineachsamplewascalculatedbyKmerFilter.jarinXanderandthenwasusedtoadjusttheabundanceofOTUsinthesample.DiversityanalysiswasdonebyphyloseqandveganpackageinR.AtypicalnosZ(cladeII)weredividedintotwogroups(nosZ_a1andnosZ_a2)duetolargesequencevariationwithinthegroup.However,nosZ_a1werenotyetincludedinXander.4.3.6DataaccessionThescriptsusedtoanalyzethedataareavailableathttps://github.com/jiarong/2015-glbrc/tree/master/scripts.RawreadsandassembledcontigscanbeaccessedonJGIportalandMG-RASTrespectively(Table4.1).70Table4.1:SizeandJGIprojectIDofrawreaddatafor21samples.SampleJGIprojectIDDatasizeC1102376446C2102376739C3102377057C4102377353C5102377657C6102377951C7102378246M1102378556M2102378857M3102379150M4102379442M5102379745M6102380043M71018623,101861132S1102380340S21018626,101861428S3102380644S4102380949S51018629,101861727S6102381250S71023815394.4Results4.4.1SSUrRNAgenebasedanalysisTheaveragereadnumberafterqualitytrimmingandpairedendassemblywasabout228millions(rangefrom140millionsto285millions)foreachreplicate.Thereadssizesrangedfrom50bpto290bp.Anaverageof0.039%ofdatawereasSSUrRNAgenefragments(Table4.2).ThenumberofSSUrRNAgenefragmentsalignedto150bpofV4regionwas9.4%onaverage(rangefrom8.3%to10.0%).Allsamplesweresubsampledtothesamenumber(4598)ofSSUrRNAgenefragmentsandatotalof12841OTUswereclustered.TheOTUnumberineachreplicatewas2164onaverage(rangefrom1707to2355).OrdinationanalysisusingOTUsclusteredfromfragmentsalignedtoV4regionofSSUrRNA71Table4.2:SummaryofSSUsearchpipelineresultsfor21samples.TotalReadsTotalSSU%SSUV4OTUsChao1ShannonV4%C12322340951171600.05%11615170740136.69.90%C2220844998954930.04%81372014503778.50%C32821233351192100.04%115792009475979.70%C42605426621071140.04%10295214651317.29.60%C52858732321433020.05%13520190743656.99.40%C62504776171245980.05%12487203352286.910.00%C72629439301141830.04%94492016481878.30%M12740609251020490.04%9693231554567.49.50%M22782788681004980.04%9767232958187.39.70%M3244772969926240.04%8963224553587.39.70%M4206129129695350.03%6474228949847.39.30%M5225964704747780.03%6892221348607.39.20%M6215320045730620.03%6800228452587.39.30%M7160726636712490.04%6580229055067.39.20%S1192776425680800.04%6481227852057.39.50%S2143660127572290.04%5469233654267.39.60%S3218743879763680.04%7210235552087.49.40%S4249773480784220.03%7215227853647.39.20%S5140239637499620.04%4598208646397.19.20%S6254749228821050.03%7282215448887.28.90%S7194863138680940.04%6504215647117.29.60%geneshowedthatperennials(Miscanthusandswitchgrass)hadsdifferentcommunitystructurefromtheannual(corn)(ADONIS:p<0.001,R2=44.4%)(Fig.4.2A).Cornalsohadgreaterdispersionamongreplicatesthantheperennials(Fig.4.2B).Consistently,relativeabun-danceofsharedOTUandgenerawerehighestbetweenMiscanthusandswitchgrass(Table4.3).Thethreecropsshared21.5%oftheirOTUs,but73.1%whenabundanceofeachOTUwascon-sidered.Atgenuslevel,theyshared55.2%ofgenera,but98.1%whenabundanceofeachgenuswasconsidered.Microbialcommunitiesintherhizosphereoftheperennialshadhigherspeciesrichnessanddiversitythandidcornasmeasuredbybothalphadiversity(individualsample)andgammadiver-sity(samplesmergedbyplant)(Figs.4.2and4.17).Communitymembersinperennialswerealsomoreevenlydistributedthanincorn(Fig.4.3).CornhadmoreProteobacteriainrhizospherewhileperennialshadmoreAcidobacteria(Fig.4.4).72Table4.3:SharedOTUs,genera,andcontigamongrhizospheremicrobialcommunitiesofcorn(C),Miscanthus(M),andswitchgrass(S).MiscanthusandswitchgrasshavemoresimilaritytoeachotherthantocorninOTU,genus,andcontig.OTUGenusContigCandM0.340.80.17CandS0.330.80.17MandS0.40.820.26Figure4.2:Species(OTU)andfunctionaldiversitycomparsionamongthreecrops.Theseab-breviationsforthethreecrops,C(corn),M(Miscanthus),andS(switchgrass),areusedthroughoutthepaper.CornhasadifferentcommunitycompositionfromMiscanthusandswitchgrassinbothOTU(A)andfunction(level3subsystems)(subplotC),andcornhaslowerdiversityinbothOTU(B)andfunction(D).Thesymbolattopeachgraphindicatesanceofdifference(fi***flindicatesp<0.001;fi**flindicatesp<0.01;fi*flindicatesp<0.05).73Figure4.3:Rankabundancecurveofmicrobialcommunitiesofcorn(C),Miscanthus(M)andswitchgrass(S)usingOTUfromSSUrRNAgene.Miscanthusandswitchgrasshavemoreevenmicrobialcommunityinrhizospherethancorn.TopthreegeneraareSphingomonas(7.5%,Proteobacteria),Penicillium(3.5%,fungi),Burkholde-ria(3.4%,Proteobacteria)incorn,Da023(5.0%,Acidobacteria),Akiw543(4.1%,Actinobacte-ria),Sphingomonas(3.1%,Proteobacteria)inMiscanthus,andDa023(6%,Acidobacteria),Sph-ingomonas(4.0%,Proteobacteria),andAkiw543(3.2%,Actinobacteria).Bacteria,Archaea,andEukaryotawere91.9%,0.7%,and7.2%respectivelyonaverageacrossallsamples(Table4.4).Corn(4.9%)hadmorefungithanMiscanthus(2.2%)andswitchgrass(2.0%).However,rela-tiveabundanceofAMF(arbuscularmycorrhizalfungi)waslowerincorn(0.1%)thanMiscanthus(0.3%)andswitchgrass(0.4%).74Figure4.4:Relativeabundanceoftoptenmostabundantbacterialphylainthreecrops.Corn(C)isenrichedinProteobacteria(p<0.01,n=7),andMiscanthus(M)andswitchgrass(S)areenrichedinAcidobacteria(p<0.001,n=7).AverageofProteobacteriaincorn,Miscanthus,andswitchgrassare0.387,0.282,and0.312respectively.AverageofActinobacteriaincorn,Miscant-hus,andswitchgrassare0.050,0.101,and0.117respectively.AverageofAcidobacteriaincorn,Miscanthus,andswitchgrassare0.050,0.101,and0.117respectively.Table4.4:AveragepercentageofBacteria,Archaea,Eukaryota,andfungiinrhizospherecommunitiesofcorn(C),Miscanthus(M)andswitchgrass(S).fiAMF/fungiflispercentageofAMF(arbuscularmycorrhizalfungi)oftotalfungi.CornhasmorefungiinthewholecommunitybuthaslesspercentageofAMFwithinfungi.BacArcEukFungiAMF/fungiC91.90%0.70%7.20%4.90%0.10%M94.10%1.20%4.50%2.20%0.30%S95.00%0.90%4.00%2.00%0.40%754.4.2GlobalassemblyWeappliedthedigitalnormalizationandpartitioningmethod[7]toassemblepooleddata(300Gb)foreachplant.Theaveragecomputationtimewas1510CPUhours(63daysintotal,Diginorm:71h,Filterbelow:25h,Loadgraph:1166h,Partgraph:205h,Mergegraph:8h,Annograph:27h,Extrgraph:8h).Memoryusagevariedindifferencesteps.Thepeakmemoriesusedwas1TBindigitalnormalization,whichproducedanaveragefalsepositiverateoflessthan0.01ink-mercounting(falsepositiveratesbelow20%areacceptablefordigitalnormalization)[134].Ouras-semblymethodproducedassemblysizesrangingfrom5.3to6.4Gbwithminimumcontigsizecutoffof300bp,andreadmappingbackrates(suggestingpercentageoftrimmedreaddataas-sembled)weresimilaramongthethreecrops(Table4.5).Wealsocomparedthreecropsbasedonsequencesimilarityamongassembliesandfoundcornwasthemostdifferentamongthreebiofuelcrops(Table4.3).Table4.5:Globalassemblystatistics.Mapping%isthepercentageofreadsmappedtocontigs.Cstandsforcorn,MstandsforMiscanthus,andSstandsforswitchgrass.Contigno.(M)Total(Gbp)Max(Kbp)Median(bp)Mapping%C12.96.42140020.7M12.9615.739119.1S11.45.31639120.34.4.3FunctionaldiversityusingannotationfromassemblyAftertheassembledsequenceswereannotatedbyMG-RAST[65],weusedfiSubsystemflannota-tionforfurtheranalysis.Ordinationanalysiswithbothsubsystemlevel1and3showedcornwasdifferentfromtheperennialsandnodifferencebetweenMiscanthusandswitchgrass(ADONIS:p<0.01).Thetotalvariationexplainedbycropswas74.7%and77.2%atlevel1and3subsystemsrespectively(Figs.4.2and4.5).Theperennialshadhigherinfunctionaldiversity(Shannon)atbothlevels(p<0.001,n=7).Sincetheaboveordinationanalysisshowedcornwasthemostdifferentandperennialswere76Figure4.5:Functionaldiversitycomparisonamongthreecropsusingsubsystemlevel1.CornhasadifferentcommunityfunctionalcompositionfromMiscanthusandswitchgrass(level1sub-system)(A),andcornhaslowerdiversity(B).Cstandsforcorn,MstandsforMiscanthus,andSstandsforswitchgrass.Symbolsattopeachsubplotindicatesofdifference(fi***flindicatesp<0.001;fi**flindicatesp<0.01;fi*flindicatesp<0.05).similar,wepickedMiscanthusasarepresentativeofperennialsandfocusedonourcomparisonbetweencornandMiscanthus.Wefocusedonlevel1andlevel3subsystemsforenrichmentanal-ysissincelevel1shouldgiveusanoverviewoffunctionsandlevel3shouldshowpathwaylevelinformation.Usinglevel1subsystems,wefoundtherhizosphereofcornwasenrichedinfiIronac-quisitionandmetabolismfl,fiMembraneTransportfl,fiCellWallandCapsulefl,fiPhages,Prophages,Transposableelements,Plasmidsfl,fiMotilityandChemotaxisfl,andfiCarbohydratefl,whileMis-canthuswasenrichedinfiAminoAcidsandDerivativesfl,fiCellDivisionandCellCyclefl,fiNucle-osidesandNucleotidesfl,fiProteinMetabolismfl,fiRNAMetabolismfl,fiSecondaryMetabolismfl,fiPotassiummetabolismfl,fiRespirationfl,fiPhotosynthesisfl,fiRegulationandCellsignalingfl,andfiStressResponsefl(FDRadjustedp<0.05,n=7).Atlevel3,wefoundcornhadmorepathwaysenrichedbelongingtolevel1subsystemsoffiStressResponsefl,fiCarbohydratefl,andfiCellWall/Capsulefl,whileMiscanthushadmorepathwaysenrichedinfiAminoAcidsandDerivativesfl,andfiRespirationfl(Fig.4.6).WithinfiStressresponsefl,fiDesiccationstressflandfiOxidativestressflwerehigherincorn,whilefiColdshockflandfiOsmoticstressflwerehigherinMiscant-hus.WithinfiCarbohydratesfl,fiSugaralcoholsfl,fiAminosugarsfl,andfiMonosaccharidesflwereen-richedincorn,whilefiCO2fiCentralcarbohydratemetabolismfl,andfiPolysaccharidesfl77wereenrichedinMiscanthus.WithinfiPhages,Prophages,Transposableelements,Plasmidsfl,fiGeneTransferAgent(GTA)fl,fiPlasmidrelatedfunctionsfl,andfiPhagesandprophagesflwereenrichedincorn,whilefiTransposableelementsflwereenrichedinMiscanthus.WithinfiNitrogenMetabolismfl,fiNitrilaseflandfiCyanatehydrosisflwereenrichedincorn,whilefiDissimilatoryni-tratereductasefl,andfiAllantoinUtilizationfl,fiNitrogenFixationflwereenrichedinMiscanthus(FDRadjustedp<0.05,n=7).4.4.4NcyclegenediversityAlltheNcyclegenesexceptAOAshoweddifferentamongthethreecrops(Fig.4.7).Intermsofrichnessanddiversity,nosZ,nosZa2,cNor,andqNorshoweddifferenceamongthreecrops(Figs.4.8and4.9.Fortheabovefourgenes,theirquantitieswerehigherinMiscanthusandswitchgrassthanincorn(p<0.05,n=7).4.4.5NcyclegeneabundanceandtaxonomycompositionWeusedXander[36]toassembletheNcyclegenesandfortheorganismtaxonomyoftheirnearestreferences.Figure4.12showsNcyclegenesabundances,includingnitrogengenes(nifH),genes(AOAandAOB),andgenes(nirK,nirS,nosZ,nosZa2,qNor,andcNor)normalizedbyrplBabundance.WefoundnifH,AOB,nirK,nosZa2,andqNorweredifferentbetweencornandMiscanthus.Additionally,nirK,nosZandqNorweredifferentbetweencornandswitchgrass.AOAandnosZweredifferentbetweenMiscanthusandswitchgrass(p<0.05,n=7).Whencombiningthegenescodingenzymeswiththesamefunction,nitritereductasegenes(nirKandnirS)aremoreabundantincorn(Cvs.M:p=0.003,Cvs.S:p=0.05,onewayttestacrossthethreecrops:p<0.01),butnitricoxidereductasegenes(qNorandcNor)andnitrousoxidereductasegenes(nosZandnosZa2)aremoreabundantintheperennials(finorBflCvs.M:p=0.001,finorBflCvs.S:p=0.0004,finorBflonewayttestacrossthethreecrops:p<0.01,finosZallflCvs.M:p=0.0003,finosZallflCvs.S:p=0.1,finosZallflonewayttest:78Figure4.6:FunctioncomparisonbetweencornandMiscanthus.Cornhadmorepathwaysenrichedbelongingtolevel1subsystemsoffiStressResponsefl,fiCarbohydratefl,andfiCellWall/Capsulefl,whileMiscanthushadmorepathwaysenrichedinfiAminoAcidsandDerivativesfl,andfiRespirationfl.SignifcanceofenrichmentisdonebyKruskal-Wallistest.Thoselevel3pathwayswithFDR(FalseDiscoveryRate)adjustp<0.05,totalcountlargerthan10andratioofcornoverMiscanthus(inlog2space)largerthan1.5orsmallerthan-1.5arechosenas79Figure4.7:OrdinationanalysisofeachNcyclegeneandrplBusingOTU.AllgenesexceptAOAshowseparationofsamplesbycorn(C),Miscanthus(M)andswitchgrass(S).OTUsofeachgeneareclusteredat5%distancecutoffusingproteinsequence.p<0.01)(Fig.4.10).Theratioofxation(amoA/nifH)and((nirK+nirS)/amoA,(qNor+cNor)/amoA,and(nosZ+nosZa2)/amoA)werenotdiffer-entata=0:05level.However,themedianofamoA/nifHwasthehighestincornandthemediansof(nirK+nirS)/amoA,(cNor+qNor)/amoA,and(nosZ+nosZa2)/amoAwerehighestinswitchgrass(Fig.4.11).Wealsofoundaconsistentpatternacrossallsamplesforthosegenepairscodingenzymeswithsamefunction:AOA>AOB,nirK>nirS,nosZa2>nosZ,qNor>cNor(Figs.4.11and4.12).Moreover,nitgene(AOAandAOB)andgene(nirKandnirS,qNorandcNor,ornosZandnosZa2)abundanceswerehigherthannitrogengenes(nifH),anddeni-geneabundances(nirKandnirS,qNorandcNor,ornosZandnosZa2)werehigherthan80Figure4.8:OTUnumbercomparisonamongcorn(C),Miscanthus(M)andswitchgrass(S)usingOTUtablesofNcyclegenes.YaxisofeachplotshowsOTUnumber.Symbolsattopeachsubplotindicateofdifference(fi***flindicatesp<0.001;fi**flindicatesp<0.01;fi*flindicatesp<0.05,fi.flindicatesp<0.1).genes(AOAandAOB)(Figs.4.11and4.12).WethenlookedatdetailedgenusleveldistributionsofnifH.Therewere7,9,and5generaincorn,Miscanthus,andswitchgrass,respectively(Fig.4.13).Thetaxonomiccompositionsofthecommunitiesweredifferentamongthethreecrops.Onaverage,Rhizobium,Bradyrhizobium,andMethylobacteriumwerethemostabundantincorn,Coraliomargarita,Azospirillum,andNostocwerethemostabundantinMiscanthus,andNovosphigobium,Gluconacetobacter,andAzospiril-lumwerethemostabundantinswitchgrass(Fig.4.13).SinceXanderassignedthetaxonomybasedonbesthittoreferencesequences,someassembledsequencesmightbequitedifferentfromitsbest-hitreferencesequenceandthusdistantfromits81Figure4.9:Diversity(Shannon)comparisonamongcorn(C),Miscanthus(M)andswitchgrass(S)usingOTUtablesofNcyclegenes.Symbolsattopeachsubplotindicateofdifference(fi***flindicatesp<0.001;fi**flindicatesp<0.01;fi*flindicatesp<0.05,fi.flindicatesp<0.1).assignedtaxon.Thusweexaminedthesimilaritydistributionofourassembledsequencestorefer-ences.Identitiesofassembledsequencetoreferencesrangefrom79.3%to100%fornifH,93.5%to97.5%forAOA,95.6%to98.9%forAOB,58.5%to97.3%fornirK,68.5%to94.8%fornirS,65.7%to99.2%fornosZ,54.7%to97.7%fornosZa2,71.1%to98.4%forcNor,and51.9%to98.3%forqNor(Fig.4.14).FornifH,themostabundantgeneraincorn,Miscanthus,andswitch-grasshasassembledsequenceswithlowestidentityof89.5%(Rhizobium),83.1%(Coraliomar-garita),92.4%(Novosphingobium),respectively(Fig.4.15).82Figure4.10:geneabundancesaftercombiningthegenesencodingenzymeswiththesamefunction.fiNirKSfl=nirKandnirS.fiNorBfl=cNorandqNor.fiNosZallfl=nosZandnosZa2.Eachgene'sabundanceisnormalizedrelativetorplBcopies(per10,000).ThreecropsaredifferentinabundancesoffinirKSfl,finorBfl,andfinosZallfl(p<0.05,n=7).CornishigherthanperennialsinabundanceoffinirKSflbutlowerinabundanceoffinorBflandfinosZallfl.Figure4.11:RatiosofinterestrelatedtoNcyclegenesincorn(C),Miscanthus(M)andswitch-grass(S).fiAmoAflisthesumofAOAandAOB;finirKSflisthesumofnirKandnirS;finorBflisthesumofcNorandqNor;finosZallflisthesumofnosZandnosZa2.Eachratioisgroupedbyplant(7replicates).83Figure4.12:AbundanceandphylumlevelassociationoftheindicatedNcyclegenesincorn(C),Miscanthus(M)andswitchgrass(S).GeneabundancesarenormalizedrelativetorplBcopies(per10,000).TherearedifferencesamongthreecropsinnirK,nosZ,nosZa2andqNor(p<0.05,n=7).Additionally,pairwisettestshowsMiscanthusishigherthancorninnifHandAOB(p<0.05,n=7).4.4.6MetadataSoilchemistryanalysesshowedhigherNO3-andNH4+incornandnodifferencesinSOM(SoilOrganicMatter)andpHinsoilfromthree(Table4.6).Foldchangeofyieldin2012over2011was(p<0.01,n=5)lowerincorn(moreaffectedbydrought)thanperennials(Fig.4.16A).N2Owashigherincornonaveragebutnotata=0:05levelduetolargevariationincornandnegativevaluescausedbylowatsamplingseason(Fig.4.16B).4.5DiscussionInthisstudy,weapplyshotgunmetagenomicstoinvestigaterhizospheremicrobialcommunitystructureandfunctionofthreebiofuelcrops-corn,Miscanthus,andswitchgrass-withafocuson84Figure4.13:AbundanceandgenusleveldistributionofnifH.Yaxisisgeneabundancenormal-izedrelativetorplBcopies(per10,000).Xaxisissamplelabelsinwhichletterstandsforcrop(Cforcorn,MforMiscanthus,andSforswitchgrass).Threecropsrhizospheresarequitedifferentingenerawiththeclosestsequencematch.Onaverage,Rhizobium,Bradyrhizobium,andMethylobacterium-likesequencesarethemostabundantincorn,Coraliomargarita,Azospiril-lum,andNostocthemostabundantinMiscanthus,andNovosphigobium,Gluconacetobacter,andAzospirillumthemostabundantinswitchgrass.Ncyclegenes.Wecornisthemostdifferentamongthethreecrops,incommunitystructure(4.2A),assembledgenomicsequences(Table4.3),andfunctional(4.2C).Itisexpectedthatdifferentplantshavedifferentrhizospheremicrobialcommunities[124,125],butourcompar-ativemetagenomicanalysesofassembliesandfunctionalprovidesinformationbeyondtheexistingstudiesrelyingon16SrRNAorafewotherfunctionalgenes[125,128].Metagenomeassemblycouldrepresentgenomiccontentofcommunitiesandthusgeneticdiversity.Speciesdiversity,geneticdiversity,andfunctionaldiversityallshowcornhasdifferentrhizo-spheremicrobialcommunitythanperennials(Fig.4.2AandC,Table4.3),suggestingthatlargescalebiofuelcroppingcanimpactmicrobialcommunitiesandtheirassociatedecosystemservices.Thecornrhizospherecommunityalsoshowsmorevariationamongreplicatesinordinationplot85Figure4.14:SequenceidentityofassembledNcyclegenestotheirbest-hitreferences.Percent-ageidentities(Yaxis)aregroupedbyplant(Cforcorn,MforMiscanthus,andSforswitchgrass).Theloweridentiyis,themorelikelyaassembledsequenceisincorrectly(Fig.4.2A).Incontrast,communityturnoverwithinreplicates(betadiversity)basedonnumberofuniqueOTUsissimilaramongthethreecrops(Fig.4.17),sothelargerdispersionincornisprobablycausedbyvariationinabundanceofsharedOTUs.Therearetwopossibleexplanations:First,cornisannualandgrowsfromseedsandestablishesanewrootsystemeveryyear,whileperennials(Miscanthusandswitchgrass)havemorestablerootsystemsandprobablyalessvariedcommunityofvariedrhizospherecommunity.Second,cornhasbeenbredforgrainyieldunderintensivemanagement(e.g.fertilization)andsometraitsforrecruitingamicrobiomemayhavebeenlost.Thuslargervariationofcornrhizospherecommunitymaysuggestweakerandselectionfromcornroots.Inspiteofthelargerdispersionincorn,perennialshavehigheralpha(local)diversityand86Figure4.15:DistributionofidentityofassemblednifHsequencestotheirbest-hitrefer-ences.Percentidentities(Yaxis)aregroupedbygenera.Assemblieswithbest-hitreferencefromCoraliomargaritaandFrankiahaspercentageidentitylowerthanothergenera(<85%),indicatesthesesequencesaremorelikelytobeincorrectlyFigure4.16:Ratioof2012yieldover2011andN2Omeasurednearsamplingtime.Ratioof2012yieldover2011ofcornislylowerthanMiscanthusandswitchgrass,suggestingcornyieldisimpactedmoreby2012drought(A).AverageofN2OishigherincornthanMiscanthusandswitchgrass,butnotata=0:05levelduetohighvariationincorn(B).87Table4.6:Soilchemistrytestresultsfor21samples.SampleIDNO3NH4%OMZnMnCuFepHppmppmppmppmppmppmC17.83.03.12.747.02.136.36.1C21.73.42.81.936.01.528.75.7C311.55.53.22.746.12.829.16.1C44.73.03.32.942.52.622.36.2C518.36.03.92.950.52.022.26.1C64.65.53.93.138.82.627.46.2C76.43.93.52.840.63.127.66.2S11.23.63.62.332.02.121.56.2S23.43.13.52.533.91.720.36.4S32.73.14.02.934.22.725.06.2S43.84.04.02.758.23.020.46.2S51.12.94.93.156.33.122.96.1S62.92.93.51.922.72.831.26.5S71.13.13.62.227.02.328.16.1M17.34.23.72.658.53.123.56.1M22.44.93.51.834.53.625.05.8M33.63.33.72.443.42.220.86.0M41.53.23.93.148.63.125.56.0M53.13.73.82.730.53.122.36.2M62.14.84.33.551.93.323.86.2M71.63.73.42.330.12.722.96.5alsogammadiversity(whenreplicatesarecombinedbyplant)(Fig.4.17).Thuschangefromannualtoperennialsasbiofuelcropscanincreasetheoverallmicrobialdiversity,whichcouldalsohaveaimpactontheecosystem.Accordingtoastudyondisturbanceanddiversitymodels[141],evennessshouldincreasewhendisturbanceisintroduced.Cornhastore-growrootseveryyear,whichcouldbeconsideredasaregulardisturbance.Theannuallifecycleofcornshouldencourageamoreevencommunity,whichcontradictstoourresultthatcornislesseventhanperennials.Thusthelowerevennessofcornrhizospherecommunityshouldbeattributedtorootexudates,rootdetritus(especiallyforcorn),oragriculturalmanagement(e.g.fertilization),whicharefurtherdiscussedbelow.Consistentwithdiversity,therearealsodifferencesintaxonomiccompositionbe-tweencornandtheothertwoperennialcrops.Proteobacteria,whichwasenrichedincorn,arecopi-88Figure4.17:Species(OTU)diversityofthreecrops.MiscanthusandswitchgrassarehigherinbothobservedOTUnumber(q=0forfitrudivflfunctioninsimbapackageinR)(A)andanotherdiversityindex(B,q=1forfitrudivflfunction)(B).Cstandsforcorn,MstandsforMiscanthusandSstandsforswitchgrass.Alpha(diversity)referstodiversityofeachreplicate,gamma(diversity)referstodiversitywhenallreplicatesofeachcroparecombined,andbeta(diversity)isratioofgammaoveralpha.otrophictaxaandAcidobacteria,whichwasenrichedinperennialsareoligotrophictaxa[142,143].IncreasedNinputcancauseashiftfromoligotrophictocopiotrophictaxa[144,145].ThusmorecopiotrophictaxaincornareconsistentwithhigherNfertilizationandhigherNfoundinitsrhizo-spheresoil(Table4.6).Further,sincecopiotrophsneedmorecarbonaswellnitrogen,wepredictthatcornisalsoprovidingmorecarbonsource(rootexudateanddetritus)comparedtoperennials,whichisconsistentwithmorefiCarbohydratesflrelatedpathwaygenesenrichedincornrhizosphereasshowninthefunctionaldiversityanalysis(Fig.4.6).Sincecopiotrophshavearelativelyfastergrowthrateinnutrientrichenvironment[144],theyoutnumberedtheremainder,whichexplainsthelowerevennessincornmentionedabove(Fig.4.3).Additionally,thehigherfungal-to-bacterialratioincornalsosuggestscornprovidesmorecar-bonsourcetotherhizospherecomparedtoperennialsandencouragescopiotrophicbacteria,whichisconsistentwiththeabove.Higherfungal-to-bacterialratiocommonlyindicateshigherC/NratiosincefungalbiomasshashigherC/Nratiothanbacteria[94,95].Duetothefactthatfungihas89largerbiomasstoDNAratiothanbacteria(somefungalhyphaemaynothavenuclei)andfun-galDNAarehardertoextract[146],theratios(DNAbased)herearemuchsmallerthanreportedinotherstudiesusingbiomass[70].Moreover,Penicillium,acommonlysaprotrophbuthardlyknownastoplants(tothebestofourknowledge),ishighlyenrichedincorn(thethirdmostabundantgenus),whichsuggeststhatcornprovidesmorecarbonsourcetorhizospherebutnoteffectivelyrecruitingmembers(selectingonesinthiscase).ThereismorefiStressresponsesflrelatedpathwaygenesenrichedincorn,whichindicatesrhizospherecommunityofcornisundermorestress.Especially,theenrichedarefiDesiccationstressflrelatedgeneswhichisconsistentwith2012beingadroughtyearandthecornyieldwasaffectedbythedroughtthemost(Fig.4.16A).Moreover,highernumberofphageandprophagerelatedgenesincornrhizospherealsoindicatesthatcornrhizospherecommunityisathigherriskofphageinfectionandthusunstable,consistentwithlowerevennessanddiversityofcornrhizospherecommunitycomparedtoperennials.OurglobalassemblyandannotationpipelineisnotsuitableforevaluatingNcyclegenes,sincetheannotationdatabasethatweuse,SEEDsubsystem[65],doesnotincludeanygenes.Moreover,Xanderprovidesbettersensitivityand,andrecoverslongerassemblies[36].OTUbasedordinationanalysesofallNcyclegenesexceptAOAshowstheseparationofallthreecrops(Fig.4.7),whichisthesameasSSUrRNAgeneandfunctional(Fig.4.2AandC)andsuggeststhatnitrogennitrifying,anddenitrifyingcommunityarealsochangedwhencropsareswitchedfromcorntoMiscanthusandswitchgrass,consistentwiththeoverallcommunity,exceptthattheseparationofMiscanthusandswitchgrassbecomesmorePerennialshasmorenifHthancornonaverageforMiscanthusbutnotforswitchgrassata=0:05level),suggestingperennialshavemorenitrogenmicrobes,whichcouldexplainwhyMiscanthusandswitchgrassgrowthdoesnotrespondtonitrogenfertil-ization[127,129].OurcollaboratorsinKBS(whereoursampleswerecollected)switchgrasshasnotrespondedtofertilizationafter7years(S.Roley,unpublisheddata).ThisisalsoconsistentwithanotherstudyusingqPCR[128].Threecropsareverydifferentintaxawithnearestse-90quencematchwithRhizobium-likesequencesasthemostabundantincorn,Coraliomargarita-likesequencesasthemostabundantinMiscanthus,andNovosphingobium-likesequencesasthemostabundantinswitchgrass(Fig.4.13),whichsuggestseachcrophasselecteddifferentnitrogenmembers,consistentwithordinationanalysisusingOTUs(Fig.4.7).Sincetheseplotswereplantedwithsoybeanandalfalfabeforetheexperimentsitewasestablished,RhizobiumandBradyrhizo-bium,alsoenrichedincornsamples,werepresentbeforethebiofuelcropswereplanted.Ontheotherhand,perennialscanselectedmicrobessuchasCoraliomargarita,Azospirillum,Nostoc,Novosphingobium,andGluconacetobacter,notRhizobia.Additionally,notethatthereisalotofvariationintaxonomiccompositioninreplicates(Fig.4.13),butclearlytherearedifferentgenerathatstandoutforallthreecropsthoughweakinareplicateortwo,whichunderscorestheimportanceofreplicationinsoilstudies.Twonitrifyinggroups(AOAandAOB)bothhaveverylowdiversity(threeOTUs)comparedtoothergenes(Figs.4.8and4.9).TheAOBcommunitiesweredifferentamongthethreecrops,whileAOAwerenot(Fig.4.7),indicatingthatAOBcommunitycompositionweremoreresponsivetocrop,consistentwithastudyusingampliconmethods[147,148].Further,theresultthatcornwithaddedNfertilizerdoesnothavethehighestabundanceofnitrifyingmembers(Fig.4.12)suggestsabundancedoesnotpositivelyrespondtoNfertilizer,whichmaybeexplainedbydroughtthatcausedtoolowsoilmoistureforConversely,theperennialswerelessaffectedbythedroughtduetotheirdeeperandlargerrootsystems.hasmultiplesteps(nitritereduction,nitricoxidereduction,andnitrousoxidere-ductionincludedouranalysis)andeachstephasmorethanonegenecodingenzymeswiththesamefunction,whichmakesitdiftodrawfunctionalconclusionswhencomparingthethreecrops.Forexample,nirKiscantlyhigherincornbutnirSisnotdifferentamongthethreecrops.Thuswecombinedthosegenescodingenzymeswiththesamefunction,whichleadstothethattherearemorenitritereductasegenesbutlessnitricoxidereductasegenesandnitrousoxidereductasegenesincorn(Fig.4.10).Sincenitritereductionisconsideredastheratelimitingstepin[131]andtheabundanceofnitritereductiongenesare91lowerthandownstream(Fig.4.10),cornhasselectedahigheroverallpotential,whichisconsistentwithmoreNfertilizerinputincorn(highersubstrateavailability)andhigheraverageN2Oincorn(Fig.4.16B)[149].Otherthancommunitystructureandfunctioncomparisonsamongthethreecrops,wealsohavesomegeneralaboutrhizospheresoilatoursamplingsite.First,ammonia-oxidizingarchaeawereaboutthreetimesmoreabundantthanammonia-oxidizingbacteriaonaverage,con-sistentwithotherstudiesonsoilwithsimilarconditions[130,150,151].BacteriawithnirKtypenitritereductaseareaboutninetimesmoreabundantthanthosewithnirS.BacteriawithqNortypenitricoxidereductaseare14timesmoreabundantthanthosewithcNor.BacteriawithnosZa2arefourtimesmoreabundantthannosZ(Fig.4.11).Theaboveratiosalsosuggestthattraditionalampliconstudiesthatonlytargetonegeneinthoseprocessesmightyieldmisleadingconclusions,e.g.nosZofnitrousreductionwashigherincornbutnosZa2washigherinMiscanthusandswitch-grass.Second,nitrogenbacteriawereabout0.4%oftotalcommunityonaverage;wereabout1.4%onaverage;wereabout10.6%,22.4%and15.6%oftotalcommunityfornitritereduction,nitricoxidereduction,andnitrousoxidereduction,respectively(Figs.4.10and4.12).Boththeaboveratiosandabundancearevaluableinformationforfuturestudiesofthissamplingsite.AnothercontributionofthisstudyisthatwefulllengthNcyclegenesthataredistantfromexistingreferences(Fig.4.14),whichcanhelpimproveprimerdesignforthesegenes.Further,sincetaxonassignmentisdonebybesthitreferencesequence,taxonomyinformationoftheas-sembledNcyclegenemaybenotcorrect(especiallyforthosewithlowidentity),butsuggeststhesegenesareintaxawhosegenomeshavenotyetbeensequenced(orperhapsthestrainsnotyetisolated).ItisalsoworthmentioningthatthemostabundantgroupofassemblednifHsequences(thoseassignedtothemostabundantgenusCoraliomargarita)inMiscanthushaveanidentityofonlyabout85%tobesthitreference,whichisfromcoralreefs(Fig.4.15).Thissuggeststhatamajorstraininsoilremaintobediscovered.924.6ConclusionInthisstudy,weshowcasethepowerofshotgunmetagenomicstostudymicrobialecologyatbothcommunitystructureandfunctionlevels.Wesequencedabout1TBofrhizospheremetagenome,whichisoneofthelargestsequencingeffortsonrhizospheresoilandenablesustostudyfunctionaltraitsthatarenotabundant.Overallcommunitystructure(SSUrRNAgene),overallfunction(annotationfromglobalassembly),andNcyclegenes(exceptAOA)allshowthecornrhizospheremicrobiomewasdifferentfromMiscanthusandswitchgrass,suggestingchangingbioenergycropfromcorn(annual)toMiscanthusandswitchgrass(perennial)willhaveimpactonecosystemfunctionscarriedoutbymicrobiome.Further,wecornhavelessandselectiononitsrhizospherecommunity,supportedbylargervariationincommunitycomposition,enrichedPenicilliumfungi),andpredominanceofRhizobiumandBradyrhizobium(leftfrombeforeestablishingthestudysite).Further,perennialsmanagetomaintainmorediverserhizospheremicrobialcommunitiesbyinvestinglesscarbonsource(rootdetritusandexudate).Moreover,weperennialshavemoreNgenes(nifH)andnitritereducinggenes(sumofnirKandnirS),whichagreeswithbetterNsustainabilityofMiscanthusandswitchgrass[127,129].Thusperennialsbioenergycropshaveadvantageovercorninmaintainmicrobialspeciesandfunctionaldiversityandalsoselectingmemberswithtraits.93BIBLIOGRAPHY94BIBLIOGRAPHY[1]E.R.Mardis.Theimpactofnext-generationsequencingtechnologyongenetics.TrendsGenet.,24(3):133Œ141,March2008.[2]E.Pettersson,J.Lundeberg,andA.Ahmadian.Generationsofsequencingtechnologies.Genomics,93(2):105Œ111,February2009.[3]A.RhoadsandK.F.Au.PacBioSequencingandItsApplications.GenomicsProteomicsBioinformatics,13(5):278Œ289,October2015.[4]J.A.Frank,C.I.Reich,S.Sharma,J.S.Weisbaum,B.A.Wilson,andG.J.Olsen.Criticalevaluationoftwoprimerscommonlyusedforofbacterial16srRNAgenes.Appl.Environ.Microbiol.,74(8):2461Œ2470,April2008.[5]B.J.Haas,D.Gevers,A.M.Earl,M.Feldgarden,D.V.Ward,G.Giannoukos,D.Ciulla,D.Tabbaa,S.K.Highlander,E.Sodergren,B.Methe,T.Z.DeSantis,J.F.Petrosino,R.Knight,andB.W.Birren.Chimeric16srRNAsequenceformationanddetectioninSangerand454-pyrosequencedPCRamplicons.GenomeRes.,21(3):494Œ504,March2011.[6]JiarongGuo,JamesR.Cole,QingpengZhang,C.TitusBrown,andJamesM.Tiedje.Mi-crobialcommunityanalysiswithribosomalgenefragmentsfromshotgunmetagenomes.AppliedandEnvironmentalMicrobiology,pagesAEM.02772Œ15,October2015.[7]AdinaChuangHowe,JanetK.Jansson,StephanieA.Malfatti,SusannahG.Tringe,JamesM.Tiedje,andC.TitusBrown.Tacklingsoildiversitywiththeassemblyoflarge,complexmetagenomes.ProceedingsoftheNationalAcademyofSciences,111(13):4904Œ4909,April2014.[8]S.Sunagawa,L.P.Coelho,S.Chaffron,J.R.Kultima,K.Labadie,G.Salazar,B.Djahan-schiri,G.Zeller,D.R.Mende,A.Alberti,F.M.Cornejo-Castillo,P.I.Costea,C.Cruaud,F.d'Ovidio,S.Engelen,I.Ferrera,J.M.Gasol,L.Guidi,F.Hildebrand,F.Kokoszka,C.Lepoivre,G.Lima-Mendez,J.Poulain,B.T.Poulos,M.Royo-Llonch,H.Sarmento,S.Vieira-Silva,C.Dimier,M.Picheral,S.Searson,S.Kandels-Lewis,C.Bowler,C.deVar-gas,G.Gorsky,N.Grimsley,P.Hingamp,D.Iudicone,O.Jaillon,F.Not,H.Ogata,S.Pe-sant,S.Speich,L.Stemmann,M.B.Sullivan,J.Weissenbach,P.Wincker,E.Karsenti,J.Raes,S.G.Acinas,P.Bork,S.G.Acinas,P.Bork,E.Boss,C.Bowler,C.deVargas,M.Follows,G.Gorsky,N.Grimsley,P.Hingamp,D.Iudicone,O.Jaillon,S.Kandels-Lewis,L.Karp-Boss,E.Karsenti,U.Krzic,F.Not,H.Ogata,S.Pesant,J.Raes,E.G.Reynaud,C.Sardet,M.Sieracki,S.Speich,L.Stemmann,M.B.Sullivan,S.Sunagawa,D.Velayoudon,J.Weissenbach,andP.Wincker.Oceanplankton.Structureandfunctionoftheglobaloceanmicrobiome.Science,348(6237):1261359,May2015.[9]J.Qin,R.Li,J.Raes,M.Arumugam,K.S.Burgdorf,C.Manichanh,T.Nielsen,N.Pons,F.Levenez,T.Yamada,D.R.Mende,J.Li,J.Xu,S.Li,D.Li,J.Cao,B.Wang,H.Liang,H.Zheng,Y.Xie,J.Tap,P.Lepage,M.Bertalan,J.M.Batto,T.Hansen,D.LePaslier,A.Linneberg,H.B.Nielsen,E.Pelletier,P.Renault,T.Sicheritz-Ponten,K.Turner,H.Zhu,C.Yu,S.Li,M.Jian,Y.Zhou,Y.Li,X.Zhang,S.Li,N.Qin,H.Yang,J.Wang,S.Brunak,95J.Dore,F.Guarner,K.Kristiansen,O.Pedersen,J.Parkhill,J.Weissenbach,P.Bork,S.D.Ehrlich,J.Wang,M.Antolin,F.Artiguenave,H.Blottiere,N.Borruel,T.Bruls,F.Casel-las,C.Chervaux,A.Cultrone,C.Delorme,G.Denariaz,R.Dervyn,M.Forte,C.Friss,M.vandeGuchte,E.Guedon,F.Haimet,A.Jamet,C.Juste,G.Kaci,M.Kleerebezem,J.Knol,M.Kristensen,S.Layec,K.LeRoux,M.Leclerc,E.Maguin,R.M.Minardi,R.Oozeer,M.Rescigno,N.Sanchez,S.Tims,T.Torrejon,E.Varela,W.deVos,Y.Wino-gradsky,andE.Zoetendal.Ahumangutmicrobialgenecatalogueestablishedbymetage-nomicsequencing.Nature,464(7285):59Œ65,March2010.[10]JasonPell,ArendHintze,RosangelaCanino-Koning,AdinaHowe,JamesM.Tiedje,andC.TitusBrown.ScalingmetagenomesequenceassemblywithprobabilisticdeBruijngraphs.ProceedingsoftheNationalAcademyofSciences,109(33):13272Œ13277,August2012.[11]DanielR.ZerbinoandEwanBirney.Velvet:algorithmsfordenovoshortreadassemblyusingdeBruijngraphs.GenomeResearch,18(5):821Œ829,May2008.[12]JaredT.SimpsonandRichardDurbin.Efdenovoassemblyoflargegenomesusingcompresseddatastructures.GenomeResearch,22(3):549Œ556,March2012.[13]Z.Li,Y.Chen,D.Mu,J.Yuan,Y.Shi,H.Zhang,J.Gan,N.Li,X.Hu,B.Liu,B.Yang,andW.Fan.Comparisonofthetwomajorclassesofassemblyalgorithms:overlap-layout-consensusandde-bruijn-graph.BriefFunctGenomics,11(1):25Œ37,January2012.[14]J.T.Simpson,K.Wong,S.D.Jackman,J.E.Schein,S.J.Jones,andI.Birol.ABySS:aparallelassemblerforshortreadsequencedata.GenomeRes.,19(6):1117Œ1123,June2009.[15]J.Butler,I.MacCallum,M.Kleber,I.A.Shlyakhter,M.K.Belmonte,E.S.Lander,C.Nus-baum,andD.B.Jaffe.ALLPATHS:denovoassemblyofwhole-genomeshotgunmicrore-ads.GenomeRes.,18(5):810Œ820,May2008.[16]R.Luo,B.Liu,Y.Xie,Z.Li,W.Huang,J.Yuan,G.He,Y.Chen,Q.Pan,Y.Liu,J.Tang,G.Wu,H.Zhang,Y.Shi,Y.Liu,C.Yu,B.Wang,Y.Lu,C.Han,D.W.Cheung,S.M.Yiu,S.Peng,Z.Xiaoqian,G.Liu,X.Liao,Y.Li,H.Yang,J.Wang,T.W.Lam,andJ.Wang.SOAPdenovo2:anempiricallyimprovedmemory-efshort-readdenovoassembler.Gigascience,1(1):18,2012.[17]D.Li,C.M.Liu,R.Luo,K.Sadakane,andT.W.Lam.MEGAHIT:anultra-fastsingle-nodesolutionforlargeandcomplexmetagenomicsassemblyviasuccinctdeBruijngraph.Bioinformatics,31(10):1674Œ1676,May2015.[18]T.Namiki,T.Hachiya,H.Tanaka,andY.Sakakibara.MetaVelvet:anextensionofVelvetassemblertodenovometagenomeassemblyfromshortsequencereads.NucleicAcidsRes.,40(20):e155,November2012.[19]Y.Peng,H.C.Leung,S.M.Yiu,andF.Y.Chin.IDBA-UD:adenovoassemblerforsingle-cellandmetagenomicsequencingdatawithhighlyunevendepth.Bioinformatics,28(11):1420Œ1428,June2012.96[20]C.S.Miller,B.J.Baker,B.C.Thomas,S.W.Singer,andJ.F.EMIRGE:recon-structionoffull-lengthribosomalgenesfrommicrobialcommunityshortreadsequencingdata.GenomeBiol.,12(5):R44,2011.[21]D.J.Lane,B.Pace,G.J.Olsen,D.A.Stahl,M.L.Sogin,andN.R.Pace.Rapiddetermi-nationof16sribosomalRNAsequencesforphylogeneticanalyses.Proc.Natl.Acad.Sci.U.S.A.,82(20):6955Œ6959,October1985.[22]S.M.Huse,L.Dethlefsen,J.A.Huber,D.MarkWelch,D.M.Welch,D.A.Relman,andM.L.Sogin.ExploringmicrobialdiversityandtaxonomyusingSSUrRNAhypervariabletagsequencing.PLoSGenet.,4(11):e1000255,November2008.[23]J.R.Cole,Q.Wang,J.A.Fish,B.Chai,D.M.McGarrell,Y.Sun,C.T.Brown,A.Porras-Alfaro,C.R.Kuske,andJ.M.Tiedje.RibosomalDatabaseProject:dataandtoolsforhighthroughputrRNAanalysis.NucleicAcidsRes.,42(Databaseissue):D633Œ642,January2014.[24]C.Quast,E.Pruesse,P.Yilmaz,J.Gerken,T.Schweer,P.Yarza,J.Peplies,andF.O.Glockner.TheSILVAribosomalRNAgenedatabaseproject:improveddataprocessingandweb-basedtools.NucleicAcidsRes.,41(Databaseissue):D590Œ596,January2013.[25]C.Yuan,J.Lei,J.Cole,andY.Sun.Reconstructing16srRNAgenesinmetagenomicdata.Bioinformatics,31(12):35Œ43,June2015.[26]C.S.Miller.Assemblingfull-lengthrRNAgenesfromshort-readmetagenomicsequencedatasetsusingEMIRGE.Meth.Enzymol.,531:333Œ352,2013.[27]B.Langmead.AligningshortsequencingreadswithBowtie.CurrProtocBioinformatics,Chapter11:Unit11.7,December2010.[28]L.Rajeev,U.N.daRocha,N.Klitgord,E.G.Luning,J.Fortney,S.D.Axen,P.M.Shih,N.J.Bouskill,B.P.Bowen,C.A.Kerfeld,F.Garcia-Pichel,E.L.Brodie,T.R.Northen,andA.Mukhopadhyay.Dynamiccyanobacterialresponsetohydrationanddehydrationinadesertbiologicalsoilcrust.ISMEJ,7(11):2178Œ2191,November2013.[29]E.P.Nawrocki,D.L.Kolbe,andS.R.Eddy.Infernal1.0:inferenceofRNAalignments.Bioinformatics,25(10):1335Œ1337,May2009.[30]M.L.Sogin,H.G.Morrison,J.A.Huber,D.MarkWelch,S.M.Huse,P.R.Neal,J.M.Arrieta,andG.J.Herndl.Microbialdiversityinthedeepseaandtheunderexplored"rarebiosphere".Proc.Natl.Acad.Sci.U.S.A.,103(32):12115Œ12120,August2006.[31]S.F.Altschul,T.L.Madden,A.A.Schaffer,J.Zhang,Z.Zhang,W.Miller,andD.J.Lipman.GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.NucleicAcidsRes.,25(17):3389Œ3402,September1997.[32]R.Schmieder,Y.W.Lim,andR.Edwards.andremovalofribosomalRNAsequencesfrommetatranscriptomes.Bioinformatics,28(3):433Œ435,February2012.97[33]Y.Huang,P.Gilna,andW.Li.ofribosomalRNAgenesinmetagenomicfragments.Bioinformatics,25(10):1338Œ1340,May2009.[34]P.D.Schloss,S.L.Westcott,T.Ryabin,J.R.Hall,M.Hartmann,E.B.Hollister,R.A.Lesniewski,B.B.Oakley,D.H.Parks,C.J.Robinson,J.W.Sahl,B.Stres,G.G.Thallinger,D.J.VanHorn,andC.F.Weber.Introducingmothur:open-source,platform-independent,community-supportedsoftwarefordescribingandcomparingmicrobialcommunities.Appl.Environ.Microbiol.,75(23):7537Œ7541,December2009.[35]J.Kuczynski,J.Stombaugh,W.A.Walters,A.Gonzalez,J.G.Caporaso,andR.Knight.UsingQIIMEtoanalyze16srRNAgenesequencesfrommicrobialcommunities.CurrProtocMicrobiol,Chapter1:Unit1E.5.,November2012.[36]QiongWang,JordanA.Fish,MariahGilman,YanniSun,C.TitusBrown,JamesM.Tiedje,andJamesR.Cole.Xander:employinganovelmethodforefgene-targetedmetage-nomicassembly.Microbiome,3:32,2015.[37]R.C.Edgar,B.J.Haas,J.C.Clemente,C.Quince,andR.Knight.UCHIMEimprovessen-sitivityandspeedofchimeradetection.Bioinformatics,27(16):2194Œ2200,August2011.[38]R.D.Finn,P.Coggill,R.Y.Eberhardt,S.R.Eddy,J.Mistry,A.L.Mitchell,S.C.Potter,M.Punta,M.Qureshi,A.Sangrador-Vegas,G.A.Salazar,J.Tate,andA.Bateman.ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture.NucleicAcidsRes.,44(D1):D279Œ285,January2016.[39]S.R.Eddy.Anewgenerationofhomologysearchtoolsbasedonprobabilisticinference.GenomeInform,23(1):205Œ211,October2009.[40]R.J.Case,Y.Boucher,I.Dahllof,C.Holmstrom,W.F.Doolittle,andS.Kjelleberg.Useof16srRNAandrpoBgenesasmolecularmarkersformicrobialecologystudies.Appl.Environ.Microbiol.,73(1):278Œ288,January2007.[41]S.Roux,F.Enault,G.Bronner,andD.Debroas.Comparisonof16srRNAandprotein-codinggenesasmolecularmarkersforassessingmicrobialdiversity(BacteriaandArchaea)inecosystems.FEMSMicrobiol.Ecol.,78(3):617Œ628,December2011.[42]C.R.Penton,T.A.Johnson,J.F.Quensen,S.Iwai,J.R.Cole,andJ.M.Tiedje.Func-tionalgenestoassessnitrogencyclingandaromatichydrocarbondegradation:primersandprocessingmatter.FrontMicrobiol,4:279,2013.[43]BrigitteHai,NdeyeHélèneDiallo,SaidouSall,FelixHaesler,KristinaSchauss,MoussaBonzi,KomiAssigbetse,Jean-LucChotte,JeanCharlesMunch,andMichaelSchloter.ofKeyGenesSteeringtheMicrobialNitrogenCycleintheRhizosphereofSorghumCultivarsinTropicalAgroecosystems.AppliedandEnvironmentalMicrobiology,75(15):4993Œ5000,August2009.[44]A.H.Treusch,S.Leininger,A.Kletzin,S.C.Schuster,H.P.Klenk,andC.Schleper.NovelgenesfornitritereductaseandAmo-relatedproteinsindicatearoleofuncultivatedmesophiliccrenarchaeotainnitrogencycling.Environ.Microbiol.,7(12):1985Œ1995,De-cember2005.98[45]L.N.Huang,F.Z.Tang,Y.S.Song,C.Y.Wan,S.L.Wang,W.Q.Liu,andW.S.Shu.Biodiversity,abundance,andactivityofbacteriaduringprimarysuccessiononacopperminetailings.FEMSMicrobiol.Ecol.,78(3):439Œ450,December2011.[46]JohnChristianGabyandDanielH.Buckley.AcomprehensivealignednifHgenedatabase:amultipurposetoolforstudiesofbacteria.Database,2014:bau001,January2014.[47]JohnChristianGabyandDanielH.Buckley.AComprehensiveEvaluationofPCRPrimerstoAmplifythenifHGeneofNitrogenase.PLoSONE,7(7):e42149,July2012.[48]RobertA.Sanford,DarleneD.Wagner,QingzhongWu,JoanneC.Chee-Sanford,SaraH.Thomas,ClaribelCruz-García,GinaRodríguez,ArturoMassol-Deyá,KishoreK.Krish-nani,KirstiM.Ritalahti,SilkeNissen,KonstantinosT.Konstantinidis,andFrankE.Löf-.Unexpectednitrousoxidereductasegenediversityandabundanceinsoils.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,109(48):19709Œ19714,November2012.[49]S.Yoon,S.Nissen,D.Park,R.A.Sanford,andF.E.Lof.NitrousOxideReductionKi-neticsDistinguishBacteriaHarboringCladeINosZfromThoseHarboringCladeIINosZ.Appl.Environ.Microbiol.,82(13):3793Œ3800,July2016.[50]N.Fierer,J.W.Leff,B.J.Adams,U.N.Nielsen,S.T.Bates,C.L.Lauber,S.Owens,J.A.Gilbert,D.H.Wall,andJ.G.Caporaso.Cross-biomemetagenomicanalysesofsoilmicrobialcommunitiesandtheirfunctionalattributes.Proc.Natl.Acad.Sci.U.S.A.,109(52):21390Œ21395,December2012.[51]Y.ZhangandY.Sun.MetaDomain:aHMM-basedproteindomaintoolforshortsequences.PacSympBiocomput,pages271Œ282,2012.[52]T.R.Mercer,M.B.Clark,J.Crawford,M.E.Brunck,D.J.Gerhardt,R.J.Taft,L.K.Nielsen,M.E.Dinger,andJ.S.Mattick.TargetedsequencingforgenediscoveryandusingRNACaptureSeq.NatProtoc,9(5):989Œ1009,May2014.[53]W.R.StreitandR.A.Schmitz.MetagenomicsŒthekeytotheunculturedmicrobes.Curr.Opin.Microbiol.,7(5):492Œ498,October2004.[54]J.G.Caporaso,C.L.Lauber,W.A.Walters,D.Berg-Lyons,J.Huntley,N.Fierer,S.M.Owens,J.Betley,L.Fraser,M.Bauer,N.Gormley,J.A.Gilbert,G.Smith,andR.Knight.Ultra-high-throughputmicrobialcommunityanalysisontheIlluminaHiSeqandMiSeqplat-forms.ISMEJ,6(8):1621Œ1624,August2012.[55]J.R.Cole,Q.Wang,E.Cardenas,J.Fish,B.Chai,R.J.Farris,A.S.Kulam-Syed-Mohideen,D.M.McGarrell,T.Marsh,G.M.Garrity,andJ.M.Tiedje.TheRibosomalDatabaseProject:improvedalignmentsandnewtoolsforrRNAanalysis.NucleicAcidsRes.,37(Databaseissue):D141Œ145,January2009.[56]G.T.Bergmann,S.T.Bates,K.G.Eilers,C.L.Lauber,J.G.Caporaso,W.A.Walters,R.Knight,andN.Fierer.Theunder-recognizeddominanceofVerrucomicrobiainsoilbacterialcommunities.SoilBiol.Biochem.,43(7):1450Œ1455,July2011.99[57]G.W.Tyson,J.Chapman,P.Hugenholtz,E.E.Allen,R.J.Ram,P.M.Richardson,V.V.Solovyev,E.M.Rubin,D.S.Rokhsar,andJ.F.Communitystructureandmetabolismthroughreconstructionofmicrobialgenomesfromtheenvironment.Nature,428(6978):37Œ43,March2004.[58]MigunShakya,ChristopherQuince,JamesH.Campbell,ZaminK.Yang,ChristopherW.Schadt,andMirceaPodar.ComparativemetagenomicandrRNAmicrobialdiversitycharac-terizationusingarchaealandbacterialsyntheticcommunities.EnvironmentalMicrobiology,15(6):1882Œ1899,June2013.[59]G.C.Baker,J.J.Smith,andD.A.Cowan.Reviewandre-analysisof16sprimers.J.Microbiol.Methods,55(3):541Œ555,December2003.[60]J.H.Lee,H.Yi,andJ.Chun.rRNASelector:acomputerprogramforselectingribosomalRNAencodingsequencesfrommetagenomicandmetatranscriptomicshotgunlibraries.J.Microbiol.,49(4):689Œ691,August2011.[61]J.Bengtsson,K.M.Eriksson,M.Hartmann,Z.Wang,B.D.Shenoy,G.A.Grelet,K.Abarenkov,A.Petri,M.A.Rosenblad,andR.H.Nilsson.Metaxa:asoftwaretoolforautomateddetectionanddiscriminationamongribosomalsmallsubunit(12s/16s/18s)sequencesofarchaea,bacteria,eukaryotes,mitochondria,andchloroplastsinmetagenomesandenvironmentalsequencingdatasets.AntonieVanLeeuwenhoek,100(3):471Œ475,Octo-ber2011.[62]Q.Wang,G.M.Garrity,J.M.Tiedje,andJ.R.Cole.NaiveBayesianforrapidassignmentofrRNAsequencesintothenewbacterialtaxonomy.Appl.Environ.Microbiol.,73(16):5261Œ5267,August2007.[63]N.Shah,H.Tang,T.G.Doak,andY.Ye.Comparingbacterialcommunitiesinferredfrom16srRNAgenesequencingandshotgunmetagenomics.PacSympBiocomput,pages165Œ176,2011.[64]A.E.Darling,G.Jospin,E.Lowe,F.A.Matsen,H.M.Bik,andJ.A.Eisen.PhyloSift:phylogeneticanalysisofgenomesandmetagenomes.PeerJ,2:e243,2014.[65]F.Meyer,D.Paarmann,M.D'Souza,R.Olson,E.M.Glass,M.Kubal,T.Paczian,A.Ro-driguez,R.Stevens,A.Wilke,J.Wilkening,andR.A.Edwards.ThemetagenomicsRASTserver-apublicresourcefortheautomaticphylogeneticandfunctionalanalysisofmetagenomes.BMCBioinformatics,9:386,2008.[66]D.H.Huson,A.F.Auch,J.Qi,andS.C.Schuster.MEGANanalysisofmetagenomicdata.GenomeRes.,17(3):377Œ386,March2007.[67]R.Logares,S.Sunagawa,G.Salazar,F.M.Cornejo-Castillo,I.Ferrera,H.Sarmento,P.Hingamp,H.Ogata,C.deVargas,G.Lima-Mendez,J.Raes,J.Poulain,O.Jaillon,P.Wincker,S.Kandels-Lewis,E.Karsenti,P.Bork,andS.G.Acinas.Metagenomic16srDNAIlluminatagsareapowerfulalternativetoampliconsequencingtoexplorediversityandstructureofmicrobialcommunities.Environ.Microbiol.,16(9):2659Œ2671,September2014.100[68]D.P.Mao,Q.Zhou,C.Y.Chen,andZ.X.Quan.Coverageevaluationofuniversalbacterialprimersusingthemetagenomicdatasets.BMCMicrobiol.,12:66,2012.[69]A.Klindworth,E.Pruesse,T.Schweer,J.Peplies,C.Quast,M.Horn,andF.O.Glock-ner.Evaluationofgeneral16sribosomalRNAgenePCRprimersforclassicalandnext-generationsequencing-baseddiversitystudies.NucleicAcidsRes.,41(1):e1,January2013.[70]EdersondaC.Jesus,ChaoLiang,JohnF.Quensen,EndangSusilawati,RandallD.Jackson,TeresaC.Balser,andJamesM.Tiedje.ofcorn,switchgrass,andprairiecroppingsystemsonsoilmicrobialcommunitiesintheupperMidwestoftheUnitedStates.GCBBioenergy,pagesn/aŒn/a,2015.[71]T.MagocandS.L.Salzberg.FLASH:fastlengthadjustmentofshortreadstoimprovegenomeassemblies.Bioinformatics,27(21):2957Œ2963,November2011.[72]P.D.Schloss.Ahigh-throughputDNAsequencealignerformicrobialecologystudies.PLoSONE,4(12):e8230,2009.[73]Y.Loewenstein,E.Portugaly,M.Fromer,andM.Linial.Efalgorithmsforaccuratehierarchicalclusteringofhugedatasets:tacklingtheentireproteinspace.Bioinformatics,24(13):i41Œ49,July2008.[74]J.M.Neefs,Y.VandePeer,P.DeRijk,S.Chapelle,andR.DeWachter.CompilationofsmallribosomalsubunitRNAstructures.NucleicAcidsRes.,21(13):3025Œ3049,July1993.[75]F.E.Angly,P.G.Dennis,A.Skarshewski,I.Vanwonterghem,P.Hugenholtz,andG.W.Tyson.CopyRighter:arapidtoolforimprovingtheaccuracyofmicrobialcommunitythroughgenecopynumbercorrection.Microbiome,2:11,2014.[76]T.Z.DeSantis,P.Hugenholtz,N.Larsen,M.Rojas,E.L.Brodie,K.Keller,T.Huber,D.Dalevi,P.Hu,andG.L.Andersen.Greengenes,achimera-checked16srRNAgenedatabaseandworkbenchcompatiblewithARB.Appl.Environ.Microbiol.,72(7):5069Œ5072,July2006.[77]FernandoPérezandBrianE.Granger.IPython:aSystemforInteractiveComput-ing.ComputinginScienceandEngineering,9(3):21Œ29,May2007.[78]J.Zhou,L.Wu,Y.Deng,X.Zhi,Y.H.Jiang,Q.Tu,J.Xie,J.D.VanNostrand,Z.He,andY.Yang.Reproducibilityandquantitationofampliconsequencing-baseddetection.ISMEJ,5(8):1303Œ1313,August2011.[79]J.Zhou,Z.He,Y.Yang,Y.Deng,S.G.Tringe,andL.Alvarez-Cohen.High-throughputmetagenomictechnologiesforcomplexmicrobialcommunityanalysis:openandclosedformats.MBio,6(1),2015.[80]D.S.Lundberg,S.L.Lebeis,S.H.Paredes,S.Yourstone,J.Gehring,S.Malfatti,J.Trem-blay,A.Engelbrektson,V.Kunin,T.G.delRio,R.C.Edgar,T.Eickhorst,R.E.Ley,P.Hugenholtz,S.G.Tringe,andJ.L.Dangl.thecoreArabidopsisthalianarootmicrobiome.Nature,488(7409):86Œ90,August2012.101[81]F.Guo,F.Ju,L.Cai,andT.Zhang.Taxonomicprecisionofdifferenthypervariableregionsof16srRNAgeneandannotationmethodsforfunctionalbacterialgroupsinbiologicalwastewatertreatment.PLoSONE,8(10):e76185,2013.[82]P.H.Janssen.Identifyingthedominantsoilbacterialtaxainlibrariesof16srRNAand16srRNAgenes.Appl.Environ.Microbiol.,72(3):1719Œ1728,March2006.[83]S.Sunagawa,D.R.Mende,G.Zeller,F.Izquierdo-Carrasco,S.A.Berger,J.R.Kultima,L.P.Coelho,M.Arumugam,J.Tap,H.B.Nielsen,S.Rasmussen,S.Brunak,O.Pedersen,F.Guarner,W.M.deVos,J.Wang,J.Li,J.Dore,S.D.Ehrlich,A.Stamatakis,andP.Bork.Metagenomicspeciesusinguniversalphylogeneticmarkergenes.Nat.Methods,10(12):1196Œ1199,December2013.[84]W.J.Kent.BLATŒtheBLAST-likealignmenttool.GenomeRes.,12(4):656Œ664,April2002.[85]C.Luo,L.M.Rodriguez-R,E.R.Johnston,L.Wu,L.Cheng,K.Xue,Q.Tu,Y.Deng,Z.He,J.Z.Shi,M.M.Yuan,R.A.Sherry,D.Li,Y.Luo,E.A.Schuur,P.Chain,J.M.Tiedje,J.Zhou,andK.T.Konstantinidis.Soilmicrobialcommunityresponsestoadecadeofwarmingasrevealedbycomparativemetagenomics.Appl.Environ.Microbiol.,80(5):1777Œ1786,March2014.[86]J.G.Caporaso,K.Bittinger,F.D.Bushman,T.Z.DeSantis,G.L.Andersen,andR.Knight.PyNAST:axibletoolforaligningsequencestoatemplatealignment.Bioinformatics,26(2):266Œ267,January2010.[87]Z.Liu,C.Lozupone,M.Hamady,F.D.Bushman,andR.Knight.Shortpyrosequencingreadssufforaccuratemicrobialcommunityanalysis.NucleicAcidsRes.,35(18):e120,2007.[88]P.D.SchlossandS.L.Westcott.Assessingandimprovingmethodsusedinoperationaltaxonomicunit-basedapproachesfor16srRNAgenesequenceanalysis.Appl.Environ.Microbiol.,77(10):3219Œ3226,May2011.[89]T.J.Sharpton,S.J.Riesenfeld,S.W.Kembel,J.Ladau,J.P.O'Dwyer,J.L.Green,J.A.Eisen,andK.S.Pollard.PhylOTU:ahigh-throughputproceduremicrobialcom-munitydiversityandresolvesnoveltaxafrommetagenomicdata.PLoSComput.Biol.,7(1):e1001061,2011.[90]M.H.FarrisandJ.B.Olson.DetectionofActinobacteriacultivatedfromenvironmentalsamplesrevealsbiasinuniversalprimers.Lett.Appl.Microbiol.,45(4):376Œ381,October2007.[91]A.Parada,D.M.Needham,andJ.A.Fuhrman.Everybasematters:assessingsmallsubunitrRNAprimersformarinemicrobiomeswithmockcommunities,time-seriesandglobalsamples.Environ.Microbiol.,August2015.102[92]B.D.Lindahl,R.H.Nilsson,L.Tedersoo,K.Abarenkov,T.Carlsen,R.Kj?ller,U.Koljalg,T.Pennanen,S.Rosendahl,J.Stenlid,andH.Kauserud.Fungalcommunityanalysisbyhigh-throughputsequencingofammarkersŒauser'sguide.NewPhytol.,199(1):288Œ299,July2013.[93]A.Porras-Alfaro,K.L.Liu,C.R.Kuske,andG.Xie.Fromgenustophylum:large-subunitandinternaltranscribedspacerrRNAoperonregionsshowsimilaraccuraciesbydatabasecomposition.Appl.Environ.Microbiol.,80(3):829Œ840,February2014.[94]FranciskaT.deVries,EllisHofNickvanEekeren,LijbertBrussaard,andJaapBloem.Fungal/bacterialratiosingrasslandswithcontrastingnitrogenmanagement.SoilBiologyandBiochemistry,38(8):2092Œ2103,August2006.[95]BonnieG.Waring,ColinAverill,andChristineV.Hawkes.Differencesinfungalandbac-terialphysiologyaltersoilcarbonandnitrogencycling:insightsfrommeta-analysisandtheoreticalmodels.EcologyLetters,16(7):887Œ894,2013.[96]S.G.Acinas,L.A.Marcelino,V.Klepac-Ceraj,andM.F.Polz.Divergenceandredundancyof16srRNAsequencesingenomeswithmultiplerrnoperons.J.Bacteriol.,186(9):2629Œ2635,May2004.[97]J.Kuczynski,Z.Liu,C.Lozupone,D.McDonald,N.Fierer,andR.Knight.Microbialcom-munityresemblancemethodsdifferintheirabilitytodetectbiologicallyrelevantpatterns.Nat.Methods,7(10):813Œ819,October2010.[98]D.E.Hunt,V.Klepac-Ceraj,S.G.Acinas,C.Gautier,S.Bertilsson,andM.F.Polz.Evalu-ationof23srRNAPCRprimersforuseinphylogeneticstudiesofbacterialdiversity.Appl.Environ.Microbiol.,72(3):2221Œ2225,March2006.[99]D.L.MummeyandM.C.Rillig.EvaluationofLSUrRNA-genePCRprimersforanaly-sisofarbuscularmycorrhizalfungalcommunitiesviaterminalrestrictionfragmentlengthpolymorphismanalysis.J.Microbiol.Methods,70(1):200Œ204,July2007.[100]K.L.Liu,A.Porras-Alfaro,C.R.Kuske,S.A.Eichorst,andG.Xie.Accurate,rapidtaxonomicoffungallarge-subunitrRNAgenes.Appl.Environ.Microbiol.,78(5):1523Œ1533,March2012.[101]T.M.PorterandG.B.Golding.FactorsthataffectlargesubunitribosomalDNAampliconsequencingstudiesoffungalcommunities:method,primerchoice,anderror.PLoSONE,7(4):e35749,2012.[102]D.Begerow,H.Nilsson,M.Unterseher,andW.Maier.CurrentstateandperspectivesoffungalDNAbarcodingandrapidprocedures.Appl.Microbiol.Biotechnol.,87(1):99Œ108,June2010.[103]K.J.LoceyandJ.T.Lennon.Scalinglawspredictglobalmicrobialdiversity.Proc.Natl.Acad.Sci.U.S.A.,113(21):5970Œ5975,May2016.103[104]C.R.WoeseandG.E.Fox.Phylogeneticstructureoftheprokaryoticdomain:theprimarykingdoms.Proc.Natl.Acad.Sci.U.S.A.,74(11):5088Œ5090,November1977.[105]M.WuandJ.A.Eisen.Asimple,fast,andaccuratemethodofphylogenomicinference.GenomeBiol.,9(10):R151,2008.[106]M.Land,L.Hauser,S.R.Jun,I.Nookaew,M.R.Leuze,T.H.Ahn,T.Karpinets,O.Lund,G.Kora,T.Wassenaar,S.Poudel,andD.W.Ussery.Insightsfrom20yearsofbacterialgenomesequencing.Funct.Integr.Genomics,15(2):141Œ161,March2015.[107]J.Goris,K.T.Konstantinidis,J.A.Klappenbach,T.Coenye,P.Vandamme,andJ.M.Tiedje.DNA-DNAhybridizationvaluesandtheirrelationshiptowhole-genomesequencesimilarities.Int.J.Syst.Evol.Microbiol.,57(Pt1):81Œ91,January2007.[108]C.Luo,S.T.Walk,D.M.Gordon,M.Feldgarden,J.M.Tiedje,andK.T.Konstan-tinidis.GenomesequencingofenvironmentalEscherichiacoliexpandsunderstandingoftheecologyandspeciationofthemodelbacterialspecies.Proc.Natl.Acad.Sci.U.S.A.,108(17):7200Œ7205,April2011.[109]N.J.Varghese,S.Mukherjee,N.Ivanova,K.T.Konstantinidis,K.Mavrommatis,N.C.Kyr-pides,andA.Pati.Microbialspeciesdelineationusingwholegenomesequences.NucleicAcidsRes.,43(14):6761Œ6771,August2015.[110]MarcoScortichini,SimoneMarcelletti,PatriziaFerrante,andGiuseppeFirrao.AGenomicofPseudomonasavellanaespecies.PLOSONE,8(9):e75794,September2013.[111]A.P.Carter,W.M.Clemons,D.E.Brodersen,R.J.Morgan-Warren,B.T.Wimberly,andV.Ramakrishnan.Functionalinsightsfromthestructureofthe30sribosomalsubunitanditsinteractionswithantibiotics.Nature,407(6802):340Œ348,September2000.[112]JordanA.Fish,BenliChai,QiongWang,YanniSun,C.TitusBrown,JamesM.Tiedje,andJamesR.Cole.FunGene:thefunctionalgenepipelineandrepository.FrontiersinMicrobiology,4,2013.[113]TorbjørnRognes,TomáıFlouri,BenNichols,ChristopherQuince,andFrédéricMahé.VSEARCH:aversatileopensourcetoolformetagenomics.PeerJ,4:e2584,October2016.[114]C.S.Miller,K.M.Handley,K.C.Wrighton,K.R.Frischkorn,B.C.Thomas,andJ.F.Short-readassemblyoffull-length16sampliconsrevealsbacterialdiversityinsubsurfacesediments.PLoSONE,8(2):e56018,2013.[115]D.L.Sun,X.Jiang,Q.L.Wu,andN.Y.Zhou.Intragenomicheterogeneityof16srRNAgenescausesoverestimationofprokaryoticdiversity.Appl.Environ.Microbiol.,79(19):5962Œ5969,October2013.[116]T.A.Isenbarger,C.E.Carr,S.S.Johnson,M.Finney,G.M.Church,W.Gilbert,M.T.Zuber,andG.Ruvkun.ThemostconservedgenomesegmentsforlifedetectiononEarthandotherplanets.OrigLifeEvolBiosph,38(6):517Œ533,December2008.104[117]D.B.Rusch,A.L.Halpern,G.Sutton,K.B.Heidelberg,S.Williamson,S.Yooseph,D.Wu,J.A.Eisen,J.M.Hoffman,K.Remington,K.Beeson,B.Tran,H.Smith,H.Baden-Tillson,C.Stewart,J.Thorpe,J.Freeman,C.Andrews-Pfannkoch,J.E.Venter,K.Li,S.Kravitz,J.F.Heidelberg,T.Utterback,Y.H.Rogers,L.I.Falcon,V.Souza,G.Bonilla-Rosso,L.E.Eguiarte,D.M.Karl,S.Sathyendranath,T.Platt,E.Bermingham,V.Gallardo,G.Tamayo-Castillo,M.R.Ferrari,R.L.Strausberg,K.Nealson,R.Friedman,M.Frazier,andJ.C.Venter.TheSorcererIIGlobalOceanSamplingexpedition:northwestAtlanticthrougheasterntropicalPPLoSBiol.,5(3):e77,March2007.[118]I.Gelfand,R.Sahajpal,X.Zhang,R.C.Izaurralde,K.L.Gross,andG.P.Robert-son.SustainablebioenergyproductionfrommarginallandsintheUSMidwest.Nature,493(7433):514Œ517,January2013.[119]D.A.Landis,M.M.Gardiner,W.vanderWerf,andS.M.Swinton.Increasingcornforbiofuelproductionreducesbiocontrolservicesinagriculturallandscapes.Proc.Natl.Acad.Sci.U.S.A.,105(51):20552Œ20557,December2008.[120]BenP.Werling,TimothyL.Dickson,RufusIsaacs,HannahGaines,ClaudioGratton,KatherineL.Gross,HeidiLiere,CarolynM.Malmstrom,TimothyD.Meehan,LeileiRuan,BruceA.Robertson,G.PhilipRobertson,ThomasM.Schmidt,AbbieC.Schrotenboer,TracyK.Teal,JuliannaK.Wilson,andDouglasA.Landis.Perennialgrasslandsenhancebiodiversityandmultipleecosystemservicesinbioenergylandscapes.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,111(4):1652Œ1657,January2014.[121]DavideBulgarelli,KlausSchlaeppi,StijnSpaepen,EmielVerLorenvanThemaat,andPaulSchulze-Lefert.StructureandFunctionsoftheBacterialMicrobiotaofPlants.AnnualReviewofPlantBiology,64(1):807Œ838,2013.[122]EzékielBaudoin,EmileBenizri,andArmandGuckert.Impactofrootexudatesonthebacterialcommunitystructureinbulksoilandmaizerhizosphere.SoilBiologyandBiochemistry,35(9):1183Œ1192,September2003.[123]PaulG.Dennis,AnthonyJ.Miller,andPennyR.Hirsch.Arerootexudatesmoreimportantthanothersourcesofrhizodepositsinstructuringrhizospherebacterialcommunities?FEMSMicrobiologyEcology,72(3):313Œ327,June2010.[124]K.Smalla,G.Wieland,A.Buchner,A.Zock,J.Parzy,S.Kaiser,N.Roskot,H.Heuer,andG.Berg.BulkandRhizosphereSoilBacterialCommunitiesStudiedbyDenaturingGradientGelElectrophoresis:Plant-DependentEnrichmentandSeasonalShiftsRevealed.AppliedandEnvironmentalMicrobiology,67(10):4742Œ4751,October2001.[125]YuejianMao,AnthonyC.Yannarell,andRoderickI.Mackie.ChangesinN-TransformingArchaeaandBacteriainSoilduringtheEstablishmentofBioenergyCrops.PLoSONE,6(9):e24750,September2011.[126]GustavoG.T.Camargo,MatthewR.Ryan,andTomL.Richard.EnergyUseandGreen-houseGasEmissionsfromCropProductionUsingtheFarmEnergyAnalysisTool.Bio-Science,63(4):263Œ273,April2013.105[127]H.Schwarz,P.Liebhard,K.Ehrendorfer,andP.Ruckenbauer.TheeffectoffertilizationonyieldandqualityofMiscanthussinensis`Giganteus'.IndustrialCropsandProducts,2(3):153Œ159,May1994.[128]YuejianMao,AnthonyC.Yannarell,SarahC.Davis,andRoderickI.Mackie.ImpactofdifferentbioenergycropsonN-cyclingbacterialandarchaealcommunitiesinsoil.Environ-mentalMicrobiology,15(3):928Œ942,March2013.[129]DavidJ.ParrishandJohnH.Fike.TheBiologyandAgronomyofSwitchgrassforBiofuels.CriticalReviewsinPlantSciences,24(5-6):423Œ459,September2005.[130]JamesI.ProsserandGraemeW.Nicol.Archaealandbacterialammonia-oxidisersinsoil:thequestfornichespecialisationanddifferentiation.TrendsinMicrobiology,20(11):523Œ531,November2012.[131]WGZumft.CellbiologyandmolecularbasisofMicrobiologyandMolec-ularBiologyReviews,61(4):533Œ616,December1997.[132]J.Hendriks,A.Oubrie,J.Castresana,A.Urbani,S.Gemeinhardt,andM.Saraste.Nitricoxidereductasesinbacteria.Biochim.Biophys.Acta,1459(2-3):266Œ273,August2000.[133]KimHeylen,DirkGevers,BramVanparys,LievenWittebolle,JokeGeets,NicoBoon,andPaulDeVos.TheincidenceofnirSandnirKandtheirgeneticheterogeneityincultivatedEnvironmentalMicrobiology,8(11):2012Œ2021,November2006.[134]QingpengZhang,JasonPell,RosangelaCanino-Koning,AdinaChuangHowe,andC.TitusBrown.TheseAreNottheK-mersYouAreLookingFor:EfOnlineK-merCountingUsingaProbabilisticDataStructure.PLoSONE,9(7):e101271,July2014.[135]MichaelR.Crusoe,HussienF.Alameldin,SherineAwad,ElmarBoucher,AdamCald-well,ReedCartwright,AmandaCharbonneau,BedeConstantinides,GregEdvenson,ScottFay,JacobFenton,ThomasFenzl,JordanFish,LeonorGarcia-Gutierrez,PhillipGarland,JonathanGluck,IvánGonzález,SarahGuermond,JiarongGuo,AditiGupta,JoshuaR.Herr,AdinaHowe,AlexHyer,AndreasHärpfer,LuizIrber,RhysKidd,DavidLin,JustinLippi,TamerMansour,PamelaMcA'Nulty,EricMcDonald,JessicaMizzi,KevinD.Mur-ray,JoshuaR.Nahum,KabenNanlohy,AlexanderJohanNederbragt,HumbertoOrtiz-Zuazaga,JeramiaOry,JasonPell,CharlesPepe-Ranney,ZacharyN.Russ,ErichSchwarz,CamilleScott,JosiahSeaman,ScottSievert,JaredSimpson,ConnorT.Skennerton,JamesSpencer,RamakrishnanSrinivasan,DanielStandage,JamesA.Stapleton,SusanR.Stein-man,JoeStein,BenjaminTaylor,WillTrimble,HeatherL.Wiencko,MichaelWright,BrianWyss,QingpengZhang,enzyme,andC.TitusBrown.Thekhmersoftwarepackage:en-ablingefnucleotidesequenceanalysis.F1000Research,September2015.[136]PaulJ.McMurdieandSusanHolmes.phyloseq:AnRPackageforReproducibleInteractiveAnalysisandGraphicsofMicrobiomeCensusData.PLoSONE,8(4):e61217,April2013.[137]GeraldJurasinskiandwithcontributionsfromVroniRetzer.simba:ACollectionoffunc-tionsforsimilarityanalysisofvegetationdata.2012.Rpackageversion0.3-5.106[138]JariOksanen,F.GuillaumeBlanchet,RoelandKindt,PierreLegendre,PeterR.Minchin,R.B.O'Hara,GavinL.Simpson,PeterSolymos,M.HenryH.Stevens,andHeleneWagner.vegan:CommunityEcologyPackage.2015.Rpackageversion2.3-0.[139]RCoreTeam.R:ALanguageandEnvironmentforStatisticalComputing.RFoundationforStatisticalComputing,Vienna,Austria,2014.[140]WeizhongLiandAdamGodzik.Cd-hit:afastprogramforclusteringandcomparinglargesetsofproteinornucleotidesequences.Bioinformatics,22(13):1658Œ1659,July2006.[141]J.RobinSvensson,MatsLindegarth,PerR.Jonsson,andHenrikPavia.Disturbance-diversitymodels:whatdotheyreallypredictandhowaretheytested?Proceedings.Bi-ologicalSciences/TheRoyalSociety,279(1736):2163Œ2170,June2012.[142]NoahFierer,MarkA.Bradford,andRobertB.Jackson.Towardanecologicalofsoilbacteria.Ecology,88(6):1354Œ1364,June2007.WOS:000247203100003.[143]KathrynG.Eilers,ChristianL.Lauber,RobKnight,andNoahFierer.Shiftsinbacterialcommunitystructureassociatedwithinputsoflowmolecularweightcarboncompoundstosoil.SoilBiologyandBiochemistry,42(6):896Œ903,June2010.[144]NoahFierer,ChristianL.Lauber,KellyS.Ramirez,JesseZaneveld,MarkA.Bradford,andRobKnight.Comparativemetagenomic,phylogeneticandphysiologicalanalysesofsoilmicrobialcommunitiesacrossnitrogengradients.TheISMEjournal,6(5):1007Œ1017,May2012.[145]EllaWessén,SaraHallin,andLaurentPhilippot.Differentialresponsesofbacterialandarchaealgroupsathightaxonomicalrankstosoilmanagement.SoilBiologyandBiochem-istry,42(10):1759Œ1765,October2010.[146]Frank-MichaelC.Müller,KatherineE.Werner,MikiKasai,AndreaFrancesconi,StephenJ.Chanock,andThomasJ.Walsh.RapidExtractionofGenomicDNAfromMedicallyIm-portantYeastsandFilamentousFungibyHigh-SpeedCellDisruption.JournalofClinicalMicrobiology,36(6):1625Œ1629,June1998.[147]Ju-peiShen,Li-meiZhang,Yong-guanZhu,Jia-baoZhang,andJi-zhengHe.Abundanceandcompositionofammonia-oxidizingbacteriaandammonia-oxidizingarchaeacommuni-tiesofanalkalinesandyloam.EnvironmentalMicrobiology,10(6):1601Œ1611,June2008.[148]YananWang,XiubinKe,LiqinWu,andYahaiLu.Communitycompositionofammonia-oxidizingbacteriaandarchaeainricesoilasaffectedbynitrogenfertilization.System-aticandAppliedMicrobiology,32(1):27Œ36,February2009.[149]LawrenceG.Oates,DavidS.Duncan,IlyaGelfand,NevilleMillar,G.PhilipRobertson,andRandallD.Jackson.NitrousoxideemissionsduringestablishmentofeightalternativecellulosicbioenergycroppingsystemsintheNorthCentralUnitedStates.GCBBioenergy,pagesn/aŒn/a,May2015.107[150]S.Leininger,T.Urich,M.Schloter,L.Schwark,J.Qi,G.W.Nicol,J.I.Prosser,S.C.Schuster,andC.Schleper.Archaeapredominateamongammonia-oxidizingprokaryotesinsoils.Nature,442(7104):806Œ809,August2006.[151]CécileGubry-Rangin,GraemeW.Nicol,andJamesI.Prosser.Archaearatherthanbac-teriacontrolintwoagriculturalacidicsoils.FEMSMicrobiologyEcology,74(3):566Œ574,December2010.108