COMPUTATIONALIDENTIFICATIONANDANALYSISOFNON-CODINGRNASIN LARGE-SCALEBIOLOGICALDATA By JikaiLei ADISSERTATION Submittedto MichiganStateUniversity inpartialentoftherequirements forthedegreeof ComputerScience{DoctorofPhilosophy 2015 ABSTRACT COMPUTATIONALIDENTIFICATIONANDANALYSISOF NON-CODINGRNASINLARGE-SCALEBIOLOGICALDATA By JikaiLei Non-protein-codingRNAs(ncRNAs)areRNAmoleculesthatfunctiondirectlyatthe levelofRNAwithouttranslatingintoprotein.Theyplayimportantbiologicalfunctionsin allthreedomainsoflife,i.e.Eukarya,BacteriaandArchaea.Tounderstandtheworking mechanismsandthefunctionsofncRNAsinvariousspecies,afundamentalstepistoidentify bothknownandnovelncRNAsfromlarge-scalebiologicaldata. Large-scalegenomicdataincludesbothgenomicsequencedataandNGSsequencingdata. BothtypesofgenomicdataprovidegreatopportunityforidentifyingncRNAs.Forgenomic sequencedata,alotofncRNAidentoolsthatusecomparativesequenceanaly- sishavebeendeveloped.ThesemethodsworkwellforncRNAsthathavestrongsequence similarity.However,theyarenotwell-suitedfordetectingncRNAsthatareremotelyho- mologous.Nextgenerationsequencing(NGS),whileitopensanewhorizonforannotating andunderstandingknownandnovelncRNAs,alsointroducesmanychallenges.First,exist- inggenomicsequencesearchingtoolscannotbereadilyappliedtoNGSdatabecauseNGS technologyproducesshort,fragmentaryreads.Second,mostNGSdatasetsarelarge-scale. ExistingalgorithmsareinfeasibleonNGSdatabecauseofhighresourcerequirements.Third, metagenomicsequencing,whichutilizesNGStechnologytosequenceuncultured,complex microbialcommunitiesdirectlyfromtheirnaturalinhabitants,furtheraggravatesthe culties.Thus,massiveamountofgenomicsequencedataandNGSdatacallsfornt algorithmsandtoolsforncRNAannotation. Inthisdissertation,Ipresentthreecomputationalmethodsandtoolstotlyidentify ncRNAsfromlarge-scalebiologicaldata.Chain-RNAisatoolthatcombinesbothsequence similarityandstructuresimilaritytolocatecross-speciesconservedRNAelementswithlow sequencesimilarityingenomicsequencedata.Itcanachievecantlyhighersensitivity inidentifyingremotelyconservedncRNAelementsthansequencebasedmethodssuchas BLAST,andismuchfasterthanexistingstructuralalignmenttools.miR-PREFeR(miRNA PREdictionFromsmallRNA-Seqdata)utilizesexpressionpatternsofmiRNAandfollows thecriteriaforplantmicroRNAannotationtoaccuratelypredictplantmiRNAsfromone ormoresmallRNA-Seqdatasamples.Itissensitive,accurate,fastandhaslow-memory footprint.metaCRISPRfocusesonidentifyingClusteredRegularlyInterspacedShortPalin- dromicRepeats(CRISPRs)fromlarge-scalemetagenomicsequencingdata.Itusesakmer hashtabletotlydetectreadsthatbelongtoCRISPRsfromtherawmetagonmic dataset.Overlapgraphbasedclusteringisthenconductedonthereduceddatasettosep- aratetCRSIPRs.Asetofgraphbasedalgorithmsareusedtoassembleandrecover CRISPRsfromtheclusters. Tomybelovedfamily,especiallymylovelyboy,Evan. iv ACKNOWLEDGMENTS Firstandformost,IwouldliketothankmyadvisorDr.YanniSunforherexcellentsuper- visionandprofessionalguidanceonmyresearchanddissertationwork.Noneofthesewould bepossiblewithoutherpatience,time,encouragementandinspiration. Mysincerethanksalsogotomythesiscommitteemembers:Dr.EricTorng,Dr.Yuehua CuiandDr.JamesCole,fortheirprecioustime,helpfuldiscussionsandbeingalways approachable.IamalsogratefultoallthefacultyandmembersintheDepartmentof ComputerScienceandEngineeringwhohavetaughtmeorassistedme. Iwouldalsoliketothankmywife,TaoHe,forherunconditionallove,supportand encouragementduringthepastyears,aswellasforbeingagreatmotherforourson.Thanks mylittleboy,EvanY.Lei,formakingmylifemorecompleteandmeaningful. Mythanksalsogotomyacademicfamilymembers:Dr.YuanZhang,Dr.Rujira Achawanantakun,Dr.ChengYuan,JiaoChenandNanDu,fortheirvaluablehelpinmy dailylifeandusefuldiscussionsalltheseyears.Ialsothankmygreatfriends,YiyanLin,Qiu Chen,RanXuandallothergraduatestudentsinthedepartmentfortheirencouragements andcompanythroughoutthejourney. Lastbutnotleast,Iwouldexpressmyprofoundgratitudetomybelovedparents,who alwaysbelieveinme,givemecouragetopursuemydreams,andtookverygoodcareofmy sonwhenIwasnotthere. v TABLEOFCONTENTS LISTOFTABLES .................................... ix LISTOFFIGURES ................................... x Chapter1Introduction ............................... 1 1.1Overview......................................1 1.2NCRNAsecondarystructure...........................2 1.3Nextgenerationsequencingtechnology.....................3 1.3.1Wholegenomesequencing........................4 1.3.2RNA-Seq.................................5 1.3.3Metagenomicsequencing.........................5 1.4ExistingmethodsforidentifyingncRNAsfromlarge-scalegenomicdata...6 1.5Contributions...................................7 Chapter2Chain-RNA:acomparativencRNAsearchtoolbasedonthe two-dimensionalchainalgorithm .................. 9 2.1Introduction....................................9 2.2RelatedWork...................................11 2.3Methods......................................13 2.3.1Localbase-pairingprobabilities.....................15 2.3.2Stemextractionfromdotplots.....................15 2.3.3Stemalignmentasatwo-dimensionalchainproblem..........17 2.3.3.1Thebasicchainproblem....................17 2.3.3.2TheExtendedchainproblem.................19 2.3.3.3Weightofarectangle......................20 2.3.3.4Extendedchainalgorithm...................22 2.3.3.5Filtration............................25 2.4ResultsandDiscussion..............................26 2.4.1Evaluatingtheperformanceofchain-RNA...............26 2.4.1.1BenchmarkConstruction....................26 2.4.1.1.1SelectionofncRNAs..................27 2.4.1.1.2Addingcontextsequences...............27 2.4.1.2ofsensitivityandfalsepositiverate.......30 2.4.1.3PerformanceofncRNAsearchtools..............31 2.4.1.4Filtration.......................34 2.4.1.5Analysisofthefalsepositivehits...............35 2.4.2Usingchain-RNAinnovelncRNAsearch................37 2.5Conclusion.....................................41 vi Chapter3miR-PREFeR:anaccurate,fast,andeasy-to-useplantmiRNA predictiontoolusingsmallRNA-Seqdata ............ 43 3.1Introduction....................................43 3.2DescriptionofthemiR-PREFeRpipeline....................44 3.2.1Ww.................................44 3.2.1.1Candidateregioniden................46 3.2.1.1.1Peaks:.........................47 3.2.1.1.2Candidateregions:..................47 3.2.1.1.3Candidatematuresequences:............47 3.2.1.2Folding.............................48 3.2.1.3MiRNAprediction.......................48 3.2.1.4Checkpointing.........................50 3.2.2Implementationandexternalprogramdependency...........50 3.3Results.......................................51 3.3.1Evaluatingpredictionperformanceonsixdatasets..........51 3.3.2AnalyzingthebetweentheannotatedandpredictedmiR- NAs....................................56 3.3.3miR-PFEFeRisfastandt...............57 Chapter4IdentifyandassembleCRISPRsfrommetagenomicdata ... 61 4.1Introduction....................................61 4.2Relatedworksandtheirlimitations.......................62 4.3Method......................................64 4.3.1Overview.................................64 4.3.2IdenofCRISPRreads.....................66 4.3.2.1Directrepeatidention..................66 4.3.2.2Filtermetagenomicdatausingk-merhashtable.......68 4.3.3ClusteringofCRISPRreads.......................69 4.3.3.1Overlapgraph.........................70 4.3.3.2OverlapgraphbasedCRISPRreadsclustering........71 4.3.3.3Correcttheorientationofreads................74 4.3.3.4Filterfalsepositiveclusters..................74 4.3.4AssembleCRISPRsofheterogeneouscoverage.............75 4.3.4.1Step1:cleangraphandrecoverhighlysequencedCRISPRs.77 4.3.4.2Step2:cleangraphbyremovingrecoverednodesandedges.78 4.3.4.3Step3:breakcycles.......................78 4.3.4.4Step4:recoverlowlysequencedCRISPRs..........80 4.4Experimentalresults...............................81 4.4.1EvaluatingtheperformanceofCRISPRreadsiden......82 4.4.2Theperformanceofreadclustering...................83 4.4.3PerformanceofCRISPRassemblyonthemockdataset.......85 4.4.4Experimentalresultonthesoildataset.................87 4.4.5Experimentalresultonthehumangutdataset............89 Chapter5Conclusionandfuturework ...................... 91 vii BIBLIOGRAPHY .................................... 93 viii LISTOFTABLES Table2.1Informationofthedatasetandtotalrunningtimeofeachbench- markedtooloneachdataset.Column length showstheaverage lengthofgenomicsequencesinadatasetaftermaskingtheprotein codinggenes.Theaveragenumberofrectanglesareshownforapair ofwindowsofsizes100nt 100nt.Thetimeofchain-RNAincludes therunningtimeofRNAplfold.....................29 Table2.2Selectedbacterialstrainsandtheirchromosomeinformation.....37 Table2.3Transcribedregionsandtargetchromosomesareinputtochain-RNA. \all"inthe TargetChrs columnmeansthatallgenomicsequencesfor the7strainsinTable2.2areusedastargetsequences........40 Table3.1ComparisonoffeaturesofmiRNAannotationtools..........53 Table3.2InformationofthesmallRNA-Seqdatausedintheexperiments..54 Table3.3Performancecomparisonofbenchmarkedtools............58 Table3.4PerformancecomparisionofmiR-RPEFeRandShortStackonmaize59 Table3.5Runningtime(hour)..........................59 Table3.6Memoryusage(MB)..........................60 Table4.1TheoftheCRISPRreadsiden...........82 Table4.2Theperformanceofclusteringonmockdataset...........84 Table4.3Theperformanceofthebenchmarkedtoolsonthemockdataset..88 Table4.4Theperformanceofthebenchmarkedtoolsonthesoildataset...88 Table4.5Theperformanceofthebenchmarkedtoolsonthehumangutdataset89 ix LISTOFFIGURES Figure1.1AtRNAsequenceanditssecondarystructure............3 Figure1.2workwofnextgenerationsequencingtechnologythat producespaired-endreads.ThegenomicDNAsequenceisfrag- mented.ThetwoendsofthefragmentedDNAarethensequenced. Thesequencebetweenthetworeadsofeachpairisunknown,butthe distanceisinarangethatisspebythesequencingprotocol..4 Figure2.1Pipelineofchain-RNA..........................13 Figure2.2(A).AdotplotforatRNAsequence.Thelower-lefttrianglerep- resentstheMFEstructure.Theupper-righttriangleencodesbase- pairingprobabilitiesandisusedinourcomputation.Astemishigh- lightedinabox.(B).PartialoutputofRNAplfoldforalongsequence. Ahighlightedtriangleshowsthelocalbase-pairingprobabilitydot plotforasubstringofsizet.......................14 Figure2.3Propertiesofastem.Thestemconsistsofebasepairsandabulge ofsizeone.................................17 Figure2.4Thebasicchainproblem.Twochainsstartingfromrectanglewith weight3areshowedandthechainthatincludesrectangleswith weights3,2,and9isthemaximumweightedchaininthegiven rectangleset................................18 Figure2.5(A).Theformationofarectanglebytwostems.Thinarcsrepresent basepairs.Fourrectanglesareformedbytwostems.Thenumber ofbasepairsinthetwostemsformingarectanglecouldbet. r 2 isanexample.(B).Asubsetofrectanglesbetweentwosequences. Stemsarerepresentedbythickarcs.Thebasicweightislistedfor eachrectangle.Themaximumweightedchainandthecorresponding stemsarehighlighted...........................19 Figure2.6Legalrelationsbetweentworectanglesinavalidstemalignmentwith- outpseudoknots.Eacharcrepresentsabasepair.(A).Parallelstems leadtonon-overlappingrectangles.(B).Nestedstemsleadtonested rectangles.................................20 Figure2.7Algorithmforcomputingmaximumweightedextendedchainonaset ofrectangles...............................24 x Figure2.8Anexampleshowinghowrecursivelynestedrectanglesarehandled intheproposedextendedchainalgorithm. r 1 :bw =120, r 2 :bw =30, r 3 :bw =10.................................24 Figure2.9ROCcurvesforbenchmarkedtoolsundertoverlap onthedatasetwithsequenceidentityfrom0%to60%.\X/N" representstheROCcurveforthetoolXundertheoverlapN%. Forexample,\Chain-RNA/20"meanstheROCcurveisgeneratedfor Chain-RNAusingtheoverlap20%................35 Figure2.10ROCcurvesforbenchmarkedtoolsundertoverlap onthedatasetwithsequenceidentityfrom60%to70%.\X/N" representstheROCcurveforthetoolXundertheoverlapN%. Forexample,\Chain-RNA/20"meanstheROCcurveisgeneratedfor Chain-RNAusingtheoverlap20%................35 Figure2.11ROCcurvesforbenchmarkedtoolsundertoverlap onthedatasetwithsequenceidentityfrom70%to100%.\X/N" representstheROCcurveforthetoolXundertheoverlapN%. Forexample,\Chain-RNA/20"meanstheROCcurveisgeneratedfor Chain-RNAusingtheoverlap20%................36 Figure2.12Derivedstructuresforthreepairsoffalsepositivehits.........36 Figure2.13DerivedsecondarystructuresforhomologousncRNAsinall11chro- mosomesforthetranscribedregionwithID18.............41 Figure3.1WorwofmiR-PREFeR........................45 Figure3.2IdenofpeaksfromtwoinputRNA-Seqsamples.Eachcol- oredarrowrepresentsareadfromoneRNA-Seqsample.Redand bluearrowsrepresentreadsfromtwotsamples,respectively. Twopeaksareidenaccordingtothedepth D cutoff ,which is2inthisexample...........................46 Figure3.3Generatingcandidateregionsfrompeaks.(A).Thegap g between twoneighbouringpeaks p 1 and p 2 issmallerthanthegaplength G max .Thus,peaks p 1 and p 2 areclusteredintoonecandidate region.(B).Peak p 3 hasnoneighbouringpeaksandisextendedto theleftandtotherighttoformtwocandidateregionswithlength L max ...................................46 Figure3.4VenndiagramsthatshowtheintersectionsofthepredictedmiRNAs fromttoolsonedatasetsfromfourspecies.ForArabidop- sis,Medicagoandpeach,theannotationsfrommiRBaseareincluded.59 xi Figure3.5ThedistributionofthemiRNAstartpositions(toppanel) andlengths(bottompanel)betweenannotatedandpredictedmiR- NAsbythreetools.ThewasgeneratedusingannotatedmiR- NAsofArabidopsisthatarepredictedbyallthreetoolsondataset Athl-6...................................60 Figure4.1Acas-CRISPRexample.Greenboxesinthetopgurerepresent directrepeats(DRs).AlinebetweentwoDRsrepresentsaspacer sequence,whichsharesequenceidentitywithfragmentsofplasmids andbacteriophagegenomesandspecifythetargetsofCRISPRinter- ference.DRsandspacersformanDR-spacerarray.Thenumberof DRscanvarysubstantially,fromaminimumoftwotoafewhundred. AsetofCRISPR-associatedgenes(cas-genes),whichencodethepro- teinmachineryresponsibleforCRISPRactivity,arelocatedupstream ofthearray.ThebottomboxshowsthesequenceoftheDR-spacer array.GreencharactersareforDR,othercoloredcharactersarefor spacers.DRsequencesarehighlyconservedinaCRISPR.......62 Figure4.2MajorcomponentsofmetaCRISPR.Paired-endreadsarerepresented bytworectanglesconnectedbyadottedline.Readsofredandgreen colorsrepresentreadsfromtwotCRISPRs,respectively.Gray readsrepresentreadsfromothergenomicregions...........65 Figure4.3Illustrationofhownovelrepeatsareideninaread.Thegreen regionsareextendedregions(seemaintext).Blackcrossesrepresent mismatchesinthesequence.......................67 Figure4.4Thedistributionofgroundtruthreads.................69 Figure4.5Anexampleofoverlapgraphandtransitiveedge.Threereadsform threeedges.Therededgeisatransitiveedgecanbesafelyremoved.70 Figure4.6Top:twoCRISPRsthatshareverysimilarrepeatsequences.The readmappingagainstthetwoCRISPRslociareshown.Bot- tom:theinitialoverlapgraph G .Readsofredandpurplearese- quencedfromthetwoloci,respectively.Eachnoderepresentsaread ofthesameletter.Blackandgreyedgesdenoterandomoverlaps. Thenumberalongeachedgerepresentstheoverlap.Readsinapair r arerepresentedusing r: 1and r: 2....................71 xii Figure4.7Leftpanel:majorstepstoassembleCRISPRs.Rightpanelshowsan examplewiththreeCRISPRs,whicharerepresentedbyred,purple, andgreennodes/edges,respectively.TheCRISRPrepresentedusing redcolorhashighsequencecoverage,theothertwohavelowsequence coverage.Readsinapairareconnectedbyablack,dotted,arrowline. Nodeswhichformsacyclearecircledusingadottedline.......76 Figure4.8Leftpanel:twocasesincyclebreaking.Case1:thereadwhichpairs withreadinthecycleisinthedownstreamofthegraph;case2:the readthatpairswiththereadinthecycleisintheupstreamofthe graph.RightPanel:mainstepstobreakthecycleforcase1.Paired- endreadsareconnectedusingablack,dotted,arrowline.Edges representedbyred,dottedlinesaretemporallyremovededges.Red, solidlinesformthelongestpathbetweentwopairedreads......80 Figure4.9SensitivityoftheCRISPRreadsidenonstep.CRISPRsare sortedalongthex-axisbythenumberofrepeatstheycontains.See maintextfordetails...........................83 Figure4.10Performanceofthereadclustering...................85 xiii Chapter1 Introduction 1.1Overview Non-protein-codingRNAs(ncRNAs)areRNAmoleculesthatfunctiondirectlyatthelevel ofRNAwithouttranslatingintoprotein.Overthelastdecade,theimportanceofthissur- prisinglydiverseclassofRNAmoleculeshasbeenwidelyrecognized.Theyplayimportant biologicalfunctionsinallthreedomainsoflife,i.e.Eukarya,BacteriaandArchaea.Ithas longbeenknownthattRNAandrRNAplayimportantrolesinproteinsynthesis.Small nucleolarRNAs(snoRNAs)canregulatethealternativesplicingofmRNA[55].MicroRNAs (miRNAs),whicharefoundinalleukaryotes,regulatetheexpressionofgenesbybase-pairing withcomplementarysequenceswithinthetargetmessagerRNAs[24,125].16srRNAscan beusedasmarkergenesforphylogeneticreconstruction[121,63].Inaddition,ncRNAs newopportunitiesforgenetherapyandgenomeengineering.Forexample,usingsmallinter- feringRNA(siRNA)andtheassociatedRNAinterference(RNAi)mechanism,scientistscan designtherapytodown-regulatethemRNAlevelbyengineeringRNAsthatcansp andelytargetthedisease-relatedgene(s)[108,67,65,76,95].ClusteredRegularly InterspacedShortPalindromicRepeats(CRISPRs),togetherwiththeassociatedcasgenes, constituteaprokaryoticadaptiveimmunesystemthatprotectsagainstexogenousgenetic elementssuchasvirusesandplasmids[7,72,44,17,43,5,88].Recentadvancesofstudying CRISPRsinprokaryoticorganismshaveledtopromisingbiotechnologyapplicationssuchas 1 immunizationandaccurategeneediting(silencing,enhancingorchangingsp genes)[102,37,16,19,46,25,50,113,62].Tounderstandtheworkingmechanismsandthe functionsofncRNAsinvariousspecies,afundamentalstepistoidentifyncRNAsfrom biologicaldata. ThereexisttkindsofbiologicaldatawhichscientistscanusetoidentifyncRNAs. Onetypeisgenomicsequencedatasuchaswholegenomicsequences.Genomicsequences areavailableformodelspeciesandmoreandmorenon-modelspecies.Withtheadvent ofnextgenerationsequencing(NGS)technology,hugeamountofsequencingdata(whole genomesequencingdata,RNA-Seqdata,metagenomicdata)arebecomingavailable.This opensanewhorizonforannotatingandunderstandingknownandnovelncRNAs.Inthe followingsectionsofthischapter,Irstgiveabriefintroductiontothecharacteristicsof ncRNAs.ThenIdescribenextgenerationsequencingtechnologyandmetagenomics.Then IreviewgeneralapproachestoidentifyncRNAsfromlarge-scalegenomicdata.Finally,I introducethemethodsandtoolsIdevelopedduringmyPh.D.study. 1.2NCRNAsecondarystructure MostncRNAsfunctionthroughtheirthreedimensionalstructures.Toreachitsfunctional form,theRNAmoleculesequenceundergoesfoldingdrivenbyWatson-Crick(C-GandA-U) andwobble(G-U)base-pairingandstackinginteractions.Thisprocessformsshortstems andvarioussinglestrandedloopsthatitssecondarystructure.Figure1.1showsa tRNAsequenceanditssecondarystructure.Thesecondarystructuresofmanytypesof ncRNAsaremoreconservedthantheirprimarysequences.Thisprovidesusefulinformation foridentifyingncRNAsfrombiologicaldata. 2 Figure1.1AtRNAsequenceanditssecondarystructure 1.3Nextgenerationsequencingtechnology Sequencingistheprocessofdeterminingtheorderofnucleotidebaseswithinastenchof DNA/RNAsequence.Sanger'smethod,whichinvolvescopyingaDNAtemplateusingchain- terminatingdideoxynucleotides,dominatedthesequencingmarketformorethan30years andremainedinwideuse[96,97].Thoughitisaccurateandgeneratesreadsoflonglength, Sangersequencinghasseverallimitations.First,thecloningprocessislabor-intensive.Sec- ond,itreliesoncloningDNAtogeneratesinglestrandedDNA.However,DNAsequences suchascentromeres,arepracticallyunclonable,whichmadethemhardtosequence[79]. Third,thethroughputofSangersequencingisquitelowcomparedwithnewlydeveloped methods,thusthedepthofcoverageisrelativelylow.Thismakesitttostudies suchasinvestigatingmicrobialpopulationdiversity[66].Forth,theoverallcostofSanger sequencingishigh[66]. Overthepastfewyears,next-generationsequencing(NGS)technologiesevolvedvery 3 Figure1.2workwofnextgenerationsequencingtechnologythatproduces paired-endreads.ThegenomicDNAsequenceisfragmented.Thetwoendsofthe fragmentedDNAarethensequenced.Thesequencebetweenthetworeadsofeachpairis unknown,butthedistanceisinarangethatisspbythesequencingprotocol. fastandignitedarevolutioningenomicscience[57,77,115].Figure1.2showsa workwofnextgenerationsequencing.ThemainadvantageofNGStechnologiesisthat theyaremassivelyparallel.Thustheyhaveveryhighthroughputandareabletosequence longstretchesofDNA/RNAsequencesspanningentiregenomesrapidlyatalowcost.For example,asofJan,2014,Illumina'sHiSeqXsequencerisabletogenerateupto1.8terabyte datainasingleruninlessthanthreedays.Thesystemcansequenceahumangenomeat30x coverageorgreaterforlessthan$1000whenusedatscale[47].Therearetsequencing strategies,ofwhichthreeCCprovidevaluableinformationforncRNAidenThey arewholegenomesequencing[85],RNA-Seq[115],andmetagenomicsequencing[111,104, 23,40]. 1.3.1Wholegenomesequencing WholegenomesequencingusesSanger'smethodorNGStechnologiestosequencethewhole genomeofatargetspecies.Thesequencedreadsareassembledoralignedtoanexisting, closelyrelatedgenometoformthegenomeofthetargetspecies.Theproducedgenomeor 4 genomicsequencescanbeusedtoidentifyncRNAsforthespecies. 1.3.2RNA-Seq RNA-SequsesthecapabilitiesofNGStechnologiestosequencecomplementaryDNA(cDNA) generatedfromtranscribedRNAs[115].BecauseitonlysequencestranscribedRNAs,notthe wholegenome,whichcontainsmanyun-transcribedregions,itprovidesfundamentalinsights toannotateandstudynon-codingRNAs.Inaddition,RNA-SeqwhichusesasmallRNA preparationprotocol(smallRNA-seq)providesdeepinsightstounderstandsmallncRNAs, whichrefertoncRNAswhoselengthsarebetween18-26nt(miRNA,siRNA,etc).Deeply sequencedsmallRNAlibrariesprovideinformationtoaccuratelyidentifyexpressedsmall RNAs,aswellasquantifytheirexpressionlevelsintsamplesbasedonreadmapping [69,80,53,6]. 1.3.3Metagenomicsequencing WiththeadvanceofNGStechnologiesandthereductioninthecostofsequencing,now scientistshavetheabilitytocloneandsequenceDNA/mRNAdirectlyfrommicrobialcom- munitiesintheirnaturalhabitatswithoutcultivation.Thisculture-independentmicrobial communityanalysisapproachiscommonlyreferredtoasmetagenomics.ByusingNGS technologies,nowitispossibletosequencetheorganismsofmicrobialcommunitiesdeeply enoughtocoverthespecieswithtabundance.Metagenomicsprovidesaunique opportunitytostudyvariousaspectsofcomplexmicrobialpopulations.Forexample,itis possibletousemetagenomicdatatoidentifycas-CRISPRsystems.Thiscangreatlyenhance ourunderstandingofthephage-hostinteractionsandco-evolutionwithinnaturalmicrobial 5 communities[5,91,101]. 1.4ExistingmethodsforidentifyingncRNAsfromlarge- scalegenomicdata Large-scalegenomicdataincludesgenomicsequencedataandNGSsequencingdata.Alotof toolshavebeendevelopedforsearchingncRNAsingenomicsequences.Traditionally,ncR- NAsareidenbysearchingavailablegenomicsequencesbasedoncomparativesequence analysis,inwhichhomologysearchisusedtoexaminewhetherputativencRNAsareevolu- tionarilyconservedinrelatedgenomesandthusprovideskeyevidenceforncRNAannotation. Conventionalcomparativesequenceanalysisbasedmethodsarenotwell-suitedfordetecting ncRNAslackingstrongsequenceconservation.Thus,methodsthatutilizestructureconser- vationinformationofncRNAfamiliesarealsodeveloped.SearchingncRNAsdirectlyfrom genomicsequenceshasitslimitations.First,withoutinformationaboutwhichregionsare transcribed,toolsbelongstothiscategoryneedtoscanthewholegenomicsequence.Thus thesearchspaceisverylargeforevensmallgenomes.Thisleadstoverylongrunningtime, especiallyfortoolsbasedonstructureanalysis.Second,duetothelargesearchspace,these methodsincurhighfalsepositiverate.Forexmaple,thereexistover100,000regionswhich canbefoldedasstem-looplikestructuresinE.coligenome[89].Ifasimplestructure-based approachisusedtoidentifymiRNA,whichhasaverysimplestem-loopstructure,fromE. coligenome,onegetsalargenumberoffalsepositivepredictions. WithNGSdata,especiallyRNA-Seqdata,theselimitationscanbedrasticallyreduced, ifnottotallyeliminated.RNA-SeqsequencescDNAsthataregeneratedfromRNAs.We knowthatRNA-Seqreadsaregeneratedfromtranscribedregions.Mappingthesereadsback 6 tothereferencegenomeandworkingonregionsthatonlyhavetlymappingscan greatlynarrowthesearchspace,improverunningtime,andreducethefalsepositiverate.In addition,RNA-SeqwhichusesasmallRNApreparationprotocol(smallRNA-seq)provides deepinsightstounderstandsmallncRNAs,whichrefertoncRNAswhoselengthsarebetween 18-26nt(miRNA,siRNA,etc).DeeplysequencedsmallRNAlibrariesprovideinformation toaccuratelyidentifyexpressedsmallRNAs,aswellasquantifyingtheirexpressionlevels intlibrarysamplesbasedonread-counts[6]. ThepreviouslypresentedmethodsthatutilizeNGSdataneedtohavethereference genome/genomicsequence.However,withthedevelopmentofmetagenomics,thecomplexity ofmicrobialcommunitiesandthehugeamountofmetagenomicdatamakesassembling microbialgenomesachallengingproblem.Thepreviouslydescribedapproachescannotbe readilyappliedtometagenomicdata.Inrecentyears,anewcategoryofncRNAsearch methodsthatdonotrequireareferencegenomeisemerging: denovo ncRNAassembly.In thistypeofapproach,readsthatbelongtotheinterestedncRNA(s)aregatheredfrom RNA-Seqormetagenomicdata,andthenthesereadsareassembledtoproducethe ncRNAsequences. 1.5Contributions Inthisdissertation,Ipresentthreecomputationalmethodsandtoolstotlyidentify ncRNAsfromlarge-scalebiologicaldata.Chain-RNAisatoolthatcombinesbothsequence similarityandstructuresimilaritytolocatecross-speciesconservedRNAelementswithlow sequencesimilarityingenomicsequencedata.Itcanachievecantlyhighersensitivity inidentifyingremotelyconservedncRNAelementsthansequencebasedmethodssuchas 7 BLAST,andismuchfasterthanexistingstructuralalignmenttools.miR-PREFeR(miRNA PREdictionFromsmallRNA-Seqdata)utilizesexpressionpatternsofmiRNAandfollows thecriteriaforplantmicroRNAannotationtoaccuratelypredictplantmiRNAsfromone ormoresmallRNA-Seqdatasamples.Itissensitive,accurate,fastandhaslow-memory footprint.metaCRISPRfocusesonidentifyingClusteredRegularlyInterspacedShortPalin- dromicRepeats(CRISPRs)fromlarge-scalemetagenomicsequencingdata.Itusesakmer hashtabletotlydetectreadsthatbelongtoCRISPRsfromtherawmetagonmic dataset.Overlapgraphbasedclusteringisthenconductedonthereduceddatasettosep- aratetCRSIPRs.Asetofgraphbasedalgorithmsareusedtoassembleandrecover CRISPRsfromtheclusters. 8 Chapter2 Chain-RNA:acomparativencRNA searchtoolbasedonthe two-dimensionalchainalgorithm 2.1Introduction NoncodingRNAs(ncRNAs),whichfunctiondirectlyasRNAswithouttranslatingintopro- teins,ishighlyimportanttomodernbiology.nttypesofncRNAplaydiverseand importantbiologicalfunctions.Forexample,ithaslongbeenknownthattRNAandrRNA playimportantrolesinproteinsynthesis.Morerecently,regulatoryrolesofsomesmall ncRNAshavebeenrevealed[51,70],showingthatthefunctionsofncRNAsaremuchmore complicatedthanwethought.Inaddition,advancesofRNA-seqandthenext-generation sequencingtechnologieshaverevealedthatalargeportionoftranscriptomicdatacannotbe mappedbacktoannotatedprotein-codinggenesinthereferencegenome.Examiningwhether thesetranscriptscontainfunctionalncRNAsisimportanttoelucidatethecomplexityofthe JikaiLei,PrapapornTecha-angkoon,YanniSun.Chain-RNA:acomparativencRNAsearchtoolbased onthetwo-dimensionalchainalgorithm.IEEE/ACMTransactionsonComputationalBiologyandBioinfor- matics(TCBB).10:2,274-285.2013. 9 transcriptomeofvariousorganisms. Thestate-of-the-artmethodforncRNAidenisbasedoncomparativegenomics, inwhichregionsshowinghighsequenceorstructuralsimilaritiesacrossrelatedgenomesare examinedtoidentifyputativencRNAs.TherearetwotypesofncRNAsearchproblems.The oneisoftenreferredtoasknownncRNAsearch.GivenasinglencRNAsequenceora groupofhomologousncRNAswithannotatedstructures,stochasticcontext-freegrammar (SCFG)-basedmethodscanaccuratelyhomologstotheannotatedncRNAs.Infernal[83] andRSEARCH[56]aretwocommonlyusedtoolsinthiscategory.Thesecondtypeof ncRNAsearchisoftenreferredtoas denovo ncRNAsearch[29,9],whichdoesnotrequire priorinformationonthestructuresorsequencesofthencRNAstobefoundandthuscan identifynovelncRNAs.Forgenome-scalepredictionofncRNAs, denovo searchusuallystarts fromaligningsubstringswithtsequenceandstructuralsimilaritiesfromrelated genomes.ThesealignmentswillgothroughfurtherexaminationforncRNAiden ThisworkaimsatdevelopingalocalstructuralalignmentapproachfordenovoncRNA search.Forconvenienceofdiscussion,ncRNAsearchinthisworkspreferstothe secondproblem. Existinggenome-scalencRNAsearchprogramssuchasQRNA[93],RNAz[116],and EvoFold[87]stillusesequencealignmenttoolssuchasBLAST,ClustalW[109],orMUL- TIZ[11]asthestepforncRNAsearch.Forallalignmentsgeneratedbypairwiseor multiplesequencealignmenttools,acarefulscreeningisconductedtoclassifytheminto functionalncRNAsorotherfeaturesusingheterogeneousevidence.Thus,theoverallsensi- tivityoftheseprogramsislimitedbythesensitivityofthesesequencealignmentprograms onncRNAsearch.However,conventionalsequencealignmentprogramsarenotwell-suited fordetectingncRNAslackingstrongsequenceconservation.Itisknownthatsometypesof 10 ncRNAsmightexhibitabetterconservationofsecondarystructures.Forexample,highly structuredRNasePandlineagespecncRNAssuchasXistorAirposehardcasesfor BLAST-liketools[13,86].EvensomesmallncRNAssuchastRNAhavesequenceidentity aslowas45%intheirannotatedhomologsfromthehumanandmousegenomes[2].Asa result,whensequencealignmentprogramsareusedforncRNAsearch,ncRNAelementslack- ingstrongsequenceconservationcanbemissed.Therefore,thereisaneedforagenome-scale ncRNAsearchtoolthatcanincorporatebothsequenceandstructuralsimilarities. Inthiswork,weintroduceanncRNAsearchtool chain-RNA ,whichformulatesstructural alignmentproblemasanextendedtwo-dimensionalchainproblem.Ourtoolusesthelocal dotplotgenerationtoolRNAplfold[10]togeneratelocalbase-pairingprobabilitiesacrosstwo genomes,basedonwhichweextractstems.Byrepresentingapairofpredictedstemsfrom twoinputsequencesasarectangle,wecantakeadvantageoftheexistingchainalgorithm tosolvethencRNAstructuralalignmentproblem.Chain-RNAprovidesacomplementary ncRNAalignmentmethodtoconventionalalignmenttools.Chain-RNAismuchfasterthan existingstructuralalignmenttoolsforncRNAsearch.ForncRNAswithlowsequencesim- ilarity,ourexperimentsshowthatchain-RNAhasatlybetterbetween sensitivityandfalsepositiveratethanBLAST. 2.2RelatedWork SequencealignmenttoolshavebeenappliedtoncRNAsearch.Forexample,pairwisealign- mentsproducedbyBLAST[3]areusedforncRNAsearchbyQRNA[93].Thepairwise sequencealignmentsinthedatabaseforcomparativeregulatorygenomics(CORG)[20]and themultiplesequencealignmentsgeneratedbyCLUSTALWareusedasthetestsetsfor 11 ncRNAsearchinRNAz[116].MULTIZ[11]isadoptedbyEvoFold[87]togenerateinitial sequencealignmentsforncRNAexamination.Thepairwiseormultiplesequencealignments generatedbyvarioussequencealignmenttoolsarescreenedforputativefunctionalncR- NAsbasedonfeaturessuchasminimumfreeenergyandconservedsecondarystructures. However,usingconventionalsequencecomparisontoolsasthebasisforncRNAsearchisin dangerofmissingstructuralncRNAs. AnumberofncRNAstructuralalignmentprogramsexist.Thesealignmentmethods accepthomologousncRNAsasinputandtargetatgeneratingtheconsensusstructuresof inputsequences.Tojustgiveafewexamples,thereareexpensivebutaccurateSankalgo- rithm[98],fasteralignmenttoolssuchasSCARNA[107],MXSCARNA[106],Murlet[54], FoldalignM[110],LocARNA[118],RAF[21],GrammarString[1],etc.Amajorityofthese pairwiseormultipleglobalalignmenttoolsaredesignedtomaximizetheaccuracyofstruc- turepredictionandarenotreadilyusableforgenome-widencRNAsearch.Inordertoapply theseglobalalignmenttoolstogenome-scalencRNAsearch,the\slidingwindow"tech- nique[112,29]isusuallyused.Thegenomesaredividedintosmallandoverlappingwindows thatareusedasinputtotheglobalncRNAstructuralalignmenttools. AveryrelatedworktooursisPLAST-ncRNA[15],whichconductslocalalignmentbe- tweenaqueryRNAsequenceandalargegenomesequenceusingpartitionfunctionposterior probabilities.TheauthorsshowthatPLAST-ncRNAachieveshighaccuracyinidentifying distantlysimilarncRNAshiddeninthetargetgenome.AmajorbetweenPLAST- ncRNAandthemoregeneralschemeofncRNAsearchisthattheformerisdesignedto compareashortqueryandalongsequence.Itlimitsthesizeofthequerysequence. SCARNA LM[105]cangeneratelocalmultiplealignmentforthedetectionofncRNAs.It isbasedonadiscriminativepairwisealignmentmodelwhichincorporatessecondarystructure 12 Figure2.1Pipelineofchain-RNA. featuresasbasepairingprobabilities.Althoughitsstrengthliesinmultiplealignment,it canbeappliedtopairwisehomologysearch. FOLDALIGN[38,39]isahighlysensitivelocalstructuralalignmenttoolthatcanidentify ncRNAswithverylowsequencesimilarity( < 40%).Usingheuristicssuchasdynamic programmingmatrixpruning,FOLDALIGNisfasterthantheaccurateimplementationof theSankalgorithm[98].However,itisstillCPU-intensiveonlargedatasets. 2.3Methods Inthissection,wedescribethemethodbehindchain-RNAinlocatingconservedncRNAshid- denintwogenomicsequences.Thebasicrationalebehindourmethodisthatthesecondary structuresareimportanttothefunctionsandthusarewell-conservedduringevolution.Iden- 13 Figure2.2(A).AdotplotforatRNAsequence.Thelower-lefttrianglerepresentsthe MFEstructure.Theupper-righttriangleencodesbase-pairingprobabilitiesandisusedin ourcomputation.Astemishighlightedinabox.(B).PartialoutputofRNAplfoldfora longsequence.Ahighlightedtriangleshowsthelocalbase-pairingprobabilitydotplotfora substringofsizet. tifyingtheregionsdemonstratingsimilarstructuresprovidesthebasisforncRNAprediction. ItisknownthatsomehomologousncRNAsdonotsharehighstructuralsimilarity.These ncRNAsposehardcasesforchain-RNAandotherncRNAsearchtoolsthatrelyonstructural similarity. Ourpipelineconsistsoffourmainstages:generatinglocalbase-pairingprobabilities, extractingputativestemsfrombase-pairingprobabilities,andaligningstemsusing anextendedtwo-dimensionalchainalgorithm.Fig4.7showstherelationshipandinformation wbetweenthemainstages.Forapairoflonggenomicsequences,weapplyalocal foldingalgorithmtocomputelocalbase-pairingprobabilities,basedonwhichalargenumber ofcandidatestemsareextracted.Aslidingwindowstrategy[112]isusedtodividethe genomicsequenceintosmall,overlapping,anddwindows.Foreachwindowpair extractedfromtwoinputsequences,structuralalignmentisperformedusinganextended chainalgorithm.Inordertoimprovetheaccuracyofthefoldingstep,weallowstemswith lowprobabilities.Astageinthepipelineisusedtodecreasethesearchspaceand reducetheinputsizetothestructuralalignment.Themainnoveltyofouralgorithmisto 14 formulatethestructuralalignmentasachainproblem,whichhastsolutions.The detailofthealgorithmwillbedescribedinthefollowingsections. 2.3.1Localbase-pairingprobabilities WeuseRNAplfold[10]togeneratelocalbase-pairingprobabilitiesforinputgenomicse- quences.RNAplfoldimplementsalocalfoldingalgorithmthatcomputeslocalbase-pairing probabilitiesforbasepairswithamaximalspan t .Forasequenceoflength L andawindowsizeof t ,RNAplfoldhasmemoryandtimecomplexity O ( L + t 2 )and O ( Lt 2 ), respectively.As t isusuallymuchsmallerthan L ,itispracticaltoapplyRNAplfoldto verylargegenomes.Thebase-pairingprobabilitiesareencodedinadotplot,inwhichthe valueat( x , y )givestheprobabilitythatbaseatposition x ispairedwithbaseatposition y . Fig2.2.(A)showsthedotplotforatRNAsequence.Thesizesofdotsareproportionalto thebase-pairingprobabilities.Thus,thebasepairsinawell-conservedstemformanotable anti-diagonalinadotplot.Therearefourprominentanti-diagonalsinFig2.2.(A),corre- spondingtothefourstemsinatypicaltRNAstructure.Fig2.2.(B)showsapartialdotplot outputbyRNAplfoldonalargesequence.Thislongdotplotcanbedividedintosmaller trianglesforderivinglocalbase-pairingprobabilitiesforasub-stringofsize t . 2.3.2Stemextractionfromdotplots FollowingpreviousworksuchasSCARNA[107]andDotKnot[103],weevaluatethesimi- laritybetweendotplotsbyaligningthestem-loopstructuresextractedfromthedotplots. Astem S isformedbystackingbasepairsthatmeetthefollowingcriteria: Theprobabilityofeachbasepairislargerthanaspthreshold P base .Bydefault, 15 P base =0 : 05inchain-RNA. Theaverageoftheprobabilitiesofallbasepairsislargerthanaspthreshold P avg .Bydefault, P avg =0 : 25inchain-RNA. Atmostoneinteriorlooporbulgeofsizeoneisallowedinthestem. Foreachstem S inasequence x ,epropertiesare Startposition S :P start :thepositionofthebaseinthe5'partofthestem. Endposition S :P end :thepositionofthelastbaseinthe3'partofthestem. Giventhestartandendposition,wedenotethestem-loopsizeusing S :d ,whichis S :P end S :P start +1. probability S :C :thesumofthebase-pairingprobabilitiesofallbasepairs inthestem. S :seq :thesubsequencein x from S :P start and S :P end . Fig2.3illustratespropertiesofastem.Whenextractingstemsfromdotplots,theprobability threshold P base issettobeaverysmallnumber(0.05bydefault),enablingustoinclude stemswithsmallprobabilities.Thereasonisthatdotplotsgeneratedbydenovo structurepredictiontoolsdonotalwaysthetruestructure.Someconservedbase pairsmayhavesmallpredictedprobabilities.Whenastemwithsmall S :C iswellconserved acrosstspecies,ourstemalignmentprograminthenextstepwillbeabletorecognize them. 16 Figure2.3Propertiesofastem.Thestemconsistsofvebasepairsandabulgeofsizeone. 2.3.3Stemalignmentasatwo-dimensionalchainproblem Afterextractingallstemsandgeneratingtheirproperties,wecanconductstructuralalign- mentbyaligningstems.HomologousncRNAswithstrongstructuralsimilarityareexpected togenerateahigh-scoringstemalignment.Thusthestemalignmentscorecanbeusedto discriminatetruehomologsfromunrelatedsequences.Notethatalthoughonlybase-pairing probabilitiesareforeachstem,thesequencesimilaritywillbeevaluatedduringalign- ment.Apseudoknot-freencRNAincludesparallelornestedstems.Inordertogeneratea legalalignmentwiththeaboveconstraints,weformulatethestemalignmentasanextended chainalignmentproblem.Inthissection,westreviewthebasicchainproblemas inRef.[33]. 2.3.3.1Thebasicchainproblem Let R beasetof n rectanglesinaplane.Weuse r todenotearectangle. v istheweightof r .Whenthereisaneedtodistinguishtrectangles,subscriptsareused.Forexample, the i th rectanglein R is r i ,whichhasanassociatedweight v i .Asubsetof R iscalleda chain ifnohorizontalorverticallineintersectsmorethanonerectangleinthesubsetandif therectanglescanbeorderedsothateachoneisbelowandtotherightofitspredecessor. Theweightofachainisthesumoftheweightsoftherectanglesinthechain.Thebasic chainproblemistoachain C withthemaximumweight C:weight overallchainsin R . 17 Figure2.4Thebasicchainproblem.Twochainsstartingfromrectanglewithweight3are showedandthechainthatincludesrectangleswithweights3,2,and9isthemaximum weightedchaininthegivenrectangleset. Fig2.4showsthemaximumweightedchaingivenasetofrectanglesandtheirweights. Thebasicchainproblemcanbetlysolvedin O ( n log( n ))timebyemployingabalanced binarysearchtree[33].Somebioinformaticsproblemssuchassubstringsimilaritysearch havebeenformulatedandsolvedasthebasicchainproblem[33]. Inordertoillustratetherelationshipbetweenthechainbasicproblemandstemalign- ment,weewhatmakesarectangleinourproblem.Foreachinputsequence,we extractasetofcandidatestemsfromthedotplotoutputbyRNAplfold.Whenthereare twoinputsequences,wehavetwosetsofcandidatestems,eachofwhichaputative stemstructureinoneinputsequence.Arectangle r isformedbyanytwostem-loopsfrom thetwoinputsequences.Inordertodescribethepositionandthesizeofarectangle r ,we denoteitsupper-leftcorneras r:left and r:top .Similarly,thelower-rightcornerisdenoted by r:right and r:bottom .Foreachstem S i fromthesequenceandeachstem S j fromthe secondsequence,( r:left , r:top )=( S i :P start , S j :P start )and( r:right , r:bottom )=( S i :P end , S j :P end ).Fig2.5.(A)showshowstemsformrectangles. 18 Thepositionalrelationshipbetweenrectanglesrepresentsntstructuralmotifs. Fig2.6showsalltwopermittedrelationsbetweentworectanglesinapseudoknot-freestem alignment.RectanglesthatdonotoverlapinXorY-axisrepresentparallelstems.Nested rectanglesrepresentnestedstems.Pseudoknotscanalsoberepresentedbyoverlappingrect- angles.Ourcurrentalignmentprogramdoesnotpermitpseudoknots. Figure2.5(A).Theformationofarectanglebytwostems.Thinarcsrepresentbasepairs. Fourrectanglesareformedbytwostems.Thenumberofbasepairsinthetwostemsforming arectanglecouldbet. r 2 isanexample.(B).Asubsetofrectanglesbetweentwo sequences.Stemsarerepresentedbythickarcs.Thebasicweightislistedforeachrectangle. Themaximumweightedchainandthecorrespondingstemsarehighlighted. Thebasicchainproblemcanonlyhandleparallelstems.Inordertoincorporatenested stems,weformulatestemalignmentasanextendedchainproblemasfollows: 2.3.3.2TheExtendedchainproblem Let R beasetof n rectanglesinaplane.Eachrectangle r i in R hasanassociatedweight v i . Asubsetof R iscalleda chain ifanytworectanglesinthesubsetsatisfy:1)oneisnested intheother,or2)theupper-leftcornerofonerectangleislocatedbelowandtotherightof thelower-rightcorneroftheother.Theweightofachainisthesumoftheweightsofthe 19 Figure2.6Legalrelationsbetweentworectanglesinavalidstemalignmentwithoutpseu- doknots.Eacharcrepresentsabasepair.(A).Parallelstemsleadtonon-overlapping rectangles.(B).Nestedstemsleadtonestedrectangles. rectanglesinthechain.Theextendedchainproblemistoachain C withthemaximum weight C:weight overallchainsin R . Fig2.5.(B)showsanexampleforillustratingthebetweenthemaximumweighted basicchainandmaximumweightedextendedchain.Forthegivensetofrectangles,themax- imumweightedbasicchaincontainstworectangleswithweights130and30.Themaximum weightedextendedchaincontainsthreerectangleswithweights130,100,and30.Andthe extendedchaincorrectlythecommonstemstructuressharedbythetwosequences whilethebasicchainmissesanestedstem. 2.3.3.3Weightofarectangle Bycarefullyngtheweightofarectangle,themaximumweightedextendedchainrep- resentsthecommonsecondarystructuressharedbytwosequences.Ifthetotalweightof thechainisaboveagivenweoutputthestemalignment.Inordertoutilizingthe 20 tbasicchainalgorithmtosolvethestemalignmentproblem,wethreeweights foreachrectangle r ,thebasicweight r:bw ,theinsideweight r:iw ,andtheweight r:fw . Thebasicweightofarectanglerepresentsthematchingscoreofthetwostemsformingthe rectangle.Itisindependentofotherrectangles.Wethebasicweightofarectangle r formedbystem S 1 andstem S 2 as: r:bw = w 1 SeqScore + w 2 BPprobScore w 3 DistancePenalty Here w 1 , w 2 , w 3 areadjustableweightparameters. SeqScore isthenormalizedglobal alignmentscorebetweentheloopregionsofstem S 1 andstem S 2 usingtheRIBOSUM scorematrix[56]. BPprobScore isthesumoftheprobabilitiesofthetwo stems. DistanceScore isusedtopenalizestempairsthathaveabigintheir stem-loopsizes.Weusethesameformulafor DistancePenalty between S 1 and S 2 isasin SCARNA[107]: DistancePenalty = p jS 1 :d S 2 :d j Theinsideweight r:iw givestheweightofthemaximumweightedchainformedbyall rectanglesinside r .Inordertoincorporatenestedstems,wetheweightofa rectangle r asbelow: r:fw = 8 < : r:bw + r:iw; if I r 6 = ? and r:iw> 0 r:bw; otherwise 21 Here, I r isthesetofrectanglesthatlocateinside r .Whentheweightofthemaximum weightedchaininside r ispositive,theweightof r includesboththebasicweightof r andtheweightofthemaximumweightedchaininside r .Whentherearenorectangles inside r ,ortheweightorthemaximumweightedchainisnon-positive,wethe weightof r asitsbasicweight.If r ischoseninthemaximumweightedchain,boththe stemsforming r andthestructurescorrespondingtotheinternalchainareusedasthe structureannotation. 2.3.3.4Extendedchainalgorithm Inthissectionwepresentthepseudocodefortheextendedchainalgorithm.Thebasicidea isthatweusetheweightofarectangleinthebasicchainalgorithm.Whenarectangle ischosenintheoptimalalignment,itsmaximumweightedinsidechainisalsoincluded (ifthereexistsone).Thus,thebasicchainalgorithmcanbeappliedtothestemalignment whentherearenestedstems. Inouralgorithm,inputrectanglesaresortedby r:left inascendingorder.Foreach rectangle r ,welocateallrectanglesinside r .Thenthebasicchainalgorithmisusedto themaximumweightedchaininsideeach r (Line6-17inFig2.7)fromthelastrectangleto theone.Astherectanglesweresortedandweprocessrectanglesfromthelastoneto thewhenwestarttoprocessthe i th rectangle r i ,themaximumweightedchainsand theweightsofalltherectanglesinside r i havealreadybeencomputed.Usingthe weightsofprocessedrectangles,thebasicchainalgorithmcanbereadilyappliedtocompute themaximumweightedchaininside r i (Line18inFig2.7).Procedure BasicChain inFig 2.7computesthemaximumweightedbasicchainanditsweightonasetofrectangles.As thealgorithmforthebasicchainproblemhasbeenpresentedinpreviouswork,wereferusers 22 to[33]forthedetailedpseudocode. Recursivelynestedrectangles,whichcorrespondtonestedstemsformorethantwolevels, areautomaticallyprocessedbythealgorithm.Figure2.8showsanexample.Accordingto theresultofsorting, r 3 isprocessedBecausethereisnointernalrectangleinsideit,its weightis10.Then r 2 isprocessed.Ithasoneinternalrectangle r 3 .Thebasicchain algorithmreturns r 3 asthemaximumweightedchaininside r 2 .So r 2 hasaweightof 30+10=40.Whenprocessing r 1 ,ithastwointernalrectangles, r 2 and r 3 ,withweights 40and10,respectively. r 2 hasalreadyincorporated r 3 asitssubstructure.Byusingthe basicchainalgorithmon r 2 and r 3 , r 2 isselectedasthemaximumweightedinternalchain for r 1 .Theweightof r 1 is120+40=160andthechainconsistsof r 1 , r 2 ,and r 3 . Theconsensusstructureoftheinputsequencescontains3recursivelynestedstems. Hereweanalyzetherunningtimeofthealgorithm.Supposethenumberofrectangles inthesetis n andthenumberofrectanglesinsidearectangle r i is m i .Thesortingin line1takes O ( n log( n ))time.Decidingtheinternalrectanglesetsofallrectanglesisthe2- dimensionalrectangularrangesearchproblemincomputationalgeometryandcanbesolved in O ( P n 1 m i + n log( n ))time[18].Line7-14takes O ( m i log( m i ))time,soline6-16takes O ( P n 1 m i log( m i ))time.Line17takes O ( n log( n ))time.Sothetotalrunningtimeofthe algorithmis O ( n log( n )+ P n 1 m i + P n 1 m i log( m i )).Thenumberofinternalrectanglesis usuallysmallformostrectangles.Inaddition,thebasepairingmatrixforasequenceis usuallyasparsematrix.Weapplythreshold P base and P avg todiscardstemswithsmall probabilitiesandadoptseveralstrategies(SeeFiltrationSection)toremoverect- anglesthatarenotlikelytobeinacorrectalignment.Thus,thenumberofrectanglesis smallerthantheinputsequencelengthonaverage.Someexamplesof n willbegiveninthe ExperimentalResultsSection. 23 Figure2.7Algorithmforcomputingmaximumweightedextendedchainonasetofrectangles Figure2.8Anexampleshowinghowrecursivelynestedrectanglesarehandledintheproposed extendedchainalgorithm. r 1 :bw =120, r 2 :bw =30, r 3 :bw =10. 24 2.3.3.5Filtration Inchain-RNA,weusetwotypesofrationtospeedupourtool: Filtrationofwindowpairsthatareusedtoformrectangles.Allwindowpairswith lowstructuralandsequencesimilaritywillbediscarded.Fortwogenomicsequences oflength M and N ,thetotalnumberofwindowpairsis b M W c +1 b N W c +1 .Here W and areslidingwindowsizeandstep,respectively.Thisbecomesverylargeifthe inputsequencesarelong.Infact,mostofthewindowpairscontainnon-homologous sequences.Inchain-RNA,weuseasimpletoremovewindowpairsthatarenon- homologousbasedontheirsequenceandstructuralsimilarity.First,ifthesequence identityissmallerthanathreshold,thewindowpairisdiscarded.Second, Wediscardwindowpairsthathavesmallnumberofrectangles.Therationalehere isthatweexpecttoseeatleastsomenumberofwell-conservedstemsinapairof homologousncRNAs. Filtrationofrectanglesthatareusedasinputtotheextendedchainalgorithm.As weallowstemswithsmallbase-pairingprobabilities,alargenumberofstemscanbe generatedusingRNAplfold.However,manyofthemarenotlikelytobeincludedin theoptimalchain.First,weexpectthatrectanglessharedbytruehomologstend tohavesimilarsagainstthestartingpositionsofthehiddenncRNAs.Thus,we excluderectangleswith j r:left r:top j >˝ or j r:right r:bottom j >˝ .Second,all rectangleswithnegativeweightsareremovedfromfurtherprocessing.Asweusethe slidingwindowstrategytoconductlocalalignment,usingasmall ˝ willjeopardizethe sensitivity.Ontheotherhand,usingabig ˝ willtlyincreasethecomputa- tionalcostandalsoleadtohighFPrate.Inourexperiments,wechooseavaluefor ˝ 25 inordertoachieveacceptablesensitivityand. Thedetailedparametersincludingthewindowsize,thestepsize,and ˝ willbegivenalong theexperimentsetup. 2.4ResultsandDiscussion Weconductedtwosetsofexperimentstoevaluatetheperformanceofchain-RNA.Asthe maingoalofncRNAsearchistolocatethepositionsofncRNAgeneshiddeninlonggenomic sequences,ourperformanceevaluationexamineshowmanyoutputsoverlapwithactual ncRNAs.Intheexperiment,weusedchain-RNAtosearchannotatedhomologous ncRNAsingenomicsequencesandevaluateditssensitivityandfalsepositive(FP)rate.In thesecondexperiment,weappliedchain-RNAtonovelncRNAsearchinthetranscriptome ofthe Burkholderiacenocepacia genomebycomparingputativetranscribedregionsagainst asetofrelatedbacterialgenomes.Ourexperimentalresultsshowthatchain-RNAhasa tbetterbetweensensitivityandFPrateforncRNAsearchthanBLAST andcanidentifynovelncRNAs. 2.4.1Evaluatingtheperformanceofchain-RNA 2.4.1.1BenchmarkConstruction Inordertoevaluatetheperformanceofchain-RNA,weconstructedabenchmarkdataset thatcontains36typesofncRNAsand1,664pairofhomologousgenomicsequencesfromthe BRAliBase2.1[119].ThemajorstepstoconstructourbenchmarkaresimilartoPLAST- ncRNA[15]. 26 2.4.1.1.1SelectionofncRNAs. AsBRAliBase2.1containsadiversesetofncRNAs, weusedtheK2subsetofBRAliBase2.1asthestartingpoint.TheK2datasetcontains thousandsofhomologousncRNApairsfor36ncRNAfamilies.Foreachfamily,werandomly selected50pairsfromK2.Thisgeneratedadatasetcontaining1,664pairs.Toevaluate theperformanceofthebenchmarkedtoolsontsequencesimilarities,weseparatethe datasettothreedatasetsaccordingtotheirpairwisesequenceidentity.Asitiscommonly believedthatthesequenceidentityaround60%isthetwilightzoneforncRNAalignment[27], weuse60%and70%asThedatasetcontainsncRNAswhichhavesequence identityfrom0-60%.TheseconddatasetcontainsncRNAshavingsequenceidentityfrom 60%-70%.ThelastdatasetcontainsncRNAswithsequenceidentityrangingbetween70% and100%.Thenumberofpairofsequencesofthethreesubsetinourbenchmarkdatacan befoundinTable2.1. 2.4.1.1.2Addingcontextsequences. AsncRNAsearchshouldbeabletodetecthid- denncRNAsingenomicsequences,werecoveredthegenomiccontextforallchosenncRNA homologsfromthepreviousstep.ForeachncRNAsequence,itssurroundinggenomicse- quencewasobtainedfromtheEMBLwebsite 1 accordingtotheaccessionnumber,and thencRNAwasembeddedinthecontextsequences.Weadded2000ntcontextgenomic sequencestoeachendofthencRNA(Ifthegenomiccontextisshorterthan2000,weused theactualsize).Foreachcontextgenomicsequence,regionsannotatedasprotein-coding genesinEMBLweremaskedfromfurtheranalysis.Thisbreaksagenomicsequenceintoa setofintergenicregions(IGRs).WediscardedallIGRsshorterthanthewindowsizesetfor RNAplfold(i.e.100inourexperiment).WealsocheckedtheexistingncRNAannotation 1 http://www.ebi.ac.uk/embl/ 27 onthecontextgenomicsequencestomakesureourchosenncRNAsaretheonlyncRNAsin thesequencepairs. 28 29 Table2.1Informationofthedatasetandtotalrunningtimeofeachbenchmarkedtooloneachdataset.Column length shows theaveragelengthofgenomicsequencesinadatasetaftermaskingtheproteincodinggenes.Theaveragenumberofrectangles areshownforapairofwindowsofsizes100nt 100nt.Thetimeofchain-RNAincludestherunningtimeofRNAplfold. sequencesequencerunningtime(h) identitypairslengthrectangleschain-RNAFOLDALIGNMXSCARNABLASTPLAST-ncRNA 0%-60%822219923.284.92055.9512.80.240.53 60%-70%438166119.516.1601.2147.90.110.25 70%-100%304156625.17.57324.0108.20.070.14 2.4.1.2ofsensitivityandfalsepositiverate WeevaluatetheperformanceofhomologysearchtoolsusingsensitivityandFPrate.First wea hit .Mosthomologysearchtoolsrelyonascoretodeterminewhetheran alignmentscoreindicatesahomologouspair.Forexample,BLASTusesanE-value Forchain-RNA,FOLDALIGN,PLAST-ncRNA,andMXSCARNA,weuseascore AnypairofsubstringswithalignmentscoreorE-valuesatisfyingthecis a hit .Foranalignmenttool,letthesetofhitsoutputbythetoolbe H .Letthesetof truehomologoussequencepairsembeddedinthegenomicsequencesbe H + .Ahitisatrue hitifandonlyiftheoverlappingpercentagebetweenahitandthetruehomologouspairsis aboveagiventhreshold.Forapairofinputsequenceswithsizes M and N ,wethe sensitivityandFPrateforapairofinputsasfollows: sensitivity = j H \ H + j j H + j FPrate = j H n H + j MN H \ H + isthesetoftruehits. H n H + isthesetofallfalsehits.OurtionofFP ratefollowsthestandardinwhich FPrate = FP FP + TN .Inourproblem, FP is j H n H + j .The\conditionnegative"setbyFP+TNincludesallthehitsthatdonot sharerequiredoverlapwiththeactualncRNAs.AsthenumberofTPhitsisfarsmaller thanthespaceofTN,weuse MN asanestimateofFP+TN.Thisestimationleadstoan underestimatedFPrateforallbenchmarktoolsbecause MN actuallyincludestheTPhits. ForeachtypeofncRNA,wehavemultiplepairsofinputgenomicsequences.Thesen- sitivityandFPrateforatypeofncRNAiscomputedastheaverageofsensitivityandFP 30 rateforeachpairofsequences. 2.4.1.3PerformanceofncRNAsearchtools Weappliedchain-RNA,BLAST,PLAST-ncRNA,MXSCARNA,andFOLDALIGNtothe benchmarkdataset.Forchain-RNA,thewindowsizeforthegenome-scalefoldingby RNAplfoldissetto100.Duringthesearchofchain-RNA,weusedaslidingwindowof size100withaslidingstepof20.Allwindowpairsthatdonotmeetthestrategy werediscarded.Weusedthebl2seqprogramoftheNCBIBLASTsuite(version2.2.25)to searchhomologousncRNAsbetweeneverypairofgenomicsequencesinourbenchmarkdata set.Inordertoboostthesensitivity,weusedwordsize4,whichistheminimumwordsize allowed.Otherparametersaresettotheirdefaultvalues.PLAST-ncRNAisdesignedto generatelocalalignmentsbetweenashortqueryandalonggenomicsequences.Itdoesnot acceptanyofourbenchmarksequencepairsasinputbecausetheinputsequencesexceedthe lengthlimit.Thus,foreachpair,weusedaslidingwindowofsize100withstep20onone ncRNAsequenceandusedtheothergenomicsequenceasanotherinput.MXSCARNAisa globalstructuralalignmenttoolforncRNA.Ithastheadvantageofincorporatingposterior base-pairingandalignmentprobabilitiesinthestructurealignment.SinceMXSCARNAis aglobalalignmenttool,wemanuallycutthelonggenomicsequencesintowindowsofsize 100withaslidingstepsizeof20.Globalalignmentwasperformedoneachpairofwindows fromtwoinputsequences.FOLDALIGNcanoutputlocalstructuralalignments.Wecutthe longgenomicsequencesintowindowsoflength200,whichisthemaximumlengthallowed byFOLDALIGN.ThelocalmodewasusedtosearchhomologousncRNAsoneachwindow pair.Wealsotestedthe\scan"modeofFOLDALIGN.Butitwasmuchslowerthanthe localmodeonourdatasetandthuswechosetoreporttheresultsunderthelocalmode. 31 Wetestedtheperformanceofthebenchmarkedtoolsusingtoverlapthresholds. Fig.2.9,Fig.2.10,andFig.2.11showtheROCcurvesofeachbenchmarkedtoolonthe threedatasetsusingoverlapthreshold20%and50%.ThedatapointsforBLASTwere generatedusingword-size4andE-value0.00001,0.0001,0.001,0.01,0.1,1,10,100, 200,500,and1000.SomepointsoverlapbecauseofhighsimilarityinsensitivityandFP rates.Thedatapointsforchain-RNAweregeneratedusingscorefrom20to800with stepsize50.WemotheMXSCARNAsourcecodetooutputthealignmentscoresand usedscoreonthealignmentscorestogeneratedatapoints.ForPLAST-ncRNA,we usefrom0.1to0.9withastepof0.1ontheoutputaverageposteriorprobabilities. TherunningtimeofeachtooloneachdatasetisshowninTable2.1. TheROCcurvesshowthattheperformanceofthetestedtoolsimproveswiththeincrease ofthesequenceidentity.Theperformancencebetweenthestructuralalignmenttools andBLASTalsobecomessmallerwiththeincreaseofthesequenceidentity.Chain-RNA archivedslightlybetterperformancethanthetwostructuralalignmenttoolsMXSCARNA andFOLDALIGNonthedatasetwiththelowestsequenceidentitywhentheoverlap is20%.FOLDALIGNperformedbestandChain-RNAwasthesecondbestwhentheoverlap is50%onthisdataset.Withtheincreaseofsequenceidentity,thethreestructural alignmenttoolstendtohavesimilarperformance.However,therunningtimeofchain-RNA ismuchshorterthanMXSCARNAandFOLDALIGN. BLASTandposteriorprobabilitybasedtoolPLAST-ncRNAareverytimetcom- paredtostructuralalignmenttools.OnereasonisthatneitherBLASTnorPLAST-ncRNA outputssecondarystructuresofthealignments.BLASTshowstheworstbetween thesensitivityandFPrateofthetestedtools.WeobservedthatBLASThasgoodper- formanceonsometypesofncRNAswithhighsequenceidentity,suchasU2andSRP euk 32 elements.Theiraverageidentityinourtestsetisover80%andthusBLASTistin thatcase.However,whenthesequenceidentitydecreases,itbecomesforBLASTto distinguishthencRNAsfromthesurroundinggenomicsequences.Asweusedthesmallest wordsizeallowed,thelowsensitivityismorelikelytobecausedbythedynamicprogramming stepratherthanthewordstep.Thelowsequenceidentityleadstolowalignment scoresandthusbigE-values.Ontheotherhand,basedonposteriorprobability,PLAST- ncRNAshowssuperiorperformanceinidentifyingthencRNAswithlowsequenceidentity. However,theirhighestsensitivityislimitedevenunderthelowestavailablescoreThe highestsensitivityisaround43%forthedatasetwiththesequenceidentitybelow60%with overlap50%.NotethatwecannotfurtherimprovethesensitivityofPLAST-ncRNA becausewehaveusedtheminimumposteriorprobabilityoutputbyPLAST-ncRNAasa Still,usingthealignmentposteriorprobabilityispromisingaccordingtoexperimen- talresults.Ontheotherhand,whileMXSCARNAincorporatedboththealignmentand base-pairingposteriorprobability,itdoesnotdemonstratesuperiorperformance. Wealsoevaluatedtheperformanceofttoolsusingthestartpositionandtheend positionbetweenthepredictedandactualncRNAs.ForMXSCARNA,FOLDALIGN andChain-RNA,weusedallthehitsoutputunderthedefaultparametersettings.For BLAST,weusedallthehitsgeneratedwiththedefaultE-value10.0.ForPLAST-ncRNA, allhitsoutputundertheminimumposteriorprobabilityareused.Foranytoolsthatoutput multiplehitsforapairofncRNAs,wepickedthehitthathastheminimumstartposition tothestartpositionoftheactualncRNAs,andthehitthathastheminimum endpositionditotheendpositionoftheactualncRNA.Thedistributionsofthe ofthepredictedpositionsandtheactualncRNApositionsareshowninFig.1and Fig.2intheAppendix.ThedistributionsforChain-RNA,MXSCARNA,andFOLDALIGN 33 aresimilar,whileChain-RNAandMXSCARNAhaveloweststandarddeviationsandmin- imumabsolutemeanvalues.However,Chain-RNAismuchfasterthanMXSCARNAand FOLDALIGN.BLASThaslargestandarddeviationsandthemeanvaluesarefarawayfrom zerocomparedtoothertools.ThereasonisthatBLASTislocalandtakesthewholegnomic pairsasinput.ForsomencRNApairs,especiallyoneswithlowsequenceidentity,thepre- dictedstartandendpositionsarehundredsorthousandsbasesfarawayfromtheactual ncRNApositions.ThisindicatethatBLASTisnotsuitableforlocatingremotehomologous ncRNAs. WealsotestedSCARNA LM,whichisamultiplelocalalignmenttool.Forallpairwise sequencesinourbenchmarkdataset,SCARNA LMdidnotoutputanyhit.Changing parametersdidnothelp.However,SCARNA LMoutputlocalalignmentwhentherewere multiplesequencesasinput.Insummary,wewerenotabletomakeSCARNA LMworkon thebenchmarkdataset. 2.4.1.4Filtration Ourtoolusesantonstrategytoremovewindowpairsthatarenotlikelyto containtruencRNAs.Inaddition,rectanglesthatarenotlikelytobeinthemaximum weightedchainarealsodiscarded.Usingthedefaultsetting,about30%ofthewindowpairs arewithoutfurtherprocessing.Thisimprovesthetimeofourpipeline. Table2.1showstheaveragenumberofrectanglesforeachwindowpairofsize100ntafter applyingtherectanglestrategy.Theaveragenumberofrectanglesisusuallysmaller thantheinputsequencelength. 34 Figure2.9ROCcurvesforbenchmarkedtoolsundertoverlaponthedata setwithsequenceidentityfrom0%to60%.\X/N"representstheROCcurveforthetool XundertheoverlapN%.Forexample,\Chain-RNA/20"meanstheROCcurveis generatedforChain-RNAusingtheoverlap20%. Figure2.10ROCcurvesforbenchmarkedtoolsundertoverlaponthedata setwithsequenceidentityfrom60%to70%.\X/N"representstheROCcurveforthetool XundertheoverlapN%.Forexample,\Chain-RNA/20"meanstheROCcurveis generatedforChain-RNAusingtheoverlap20%. 2.4.1.5Analysisofthefalsepositivehits Theresultsoftheexperimentshowchain-RNAgenerateshighscorehitsthatsharenooverlap withannotatedncRNAhomologs.Thesehitsareintofalsepositivehits.However, acloserlookdemonstratesthatthesehitsshowstrongstructuralsimilarityandthusthey tendtogainahighstemalignmentscore.Fig2.12showsthederivedstructuresforthree 35 Figure2.11ROCcurvesforbenchmarkedtoolsundertoverlaponthedata setwithsequenceidentityfrom70%to100%.\X/N"representstheROCcurveforthetool XundertheoverlapN%.Forexample,\Chain-RNA/20"meanstheROCcurveis generatedforChain-RNAusingtheoverlap20%. falsepositivehitswithhighscores.Allthreeexamplesshowthattheinputsequenceshave strongstructuralconservation.Thus,additionalevidencebeyondsequenceandstructural similarityisneededtoscreenfunctionalncRNAs. Figure2.12Derivedstructuresforthreepairsoffalsepositivehits. 36 Table2.2Selectedbacterialstrainsandtheirchromosomeinformation. StrainChrGenBankIDLength(nt) CupriavidusChr1CP0003523,928,089 metalliduransCH34 CupriavidustaiwanensisChr1CU6337493,416,911 LMG19424Chr2CU6337502,502,411 PolynucleobacterChr1CP0010101,560,469 necessarius RalstoniaeutrophaChr1AM2604794,052,032 H16Chr2AM2604802,912,490 RalstoniaeutrophaChr1CP0000903,806,533 JMP134Chr2CP0000912,726,152 RalstoniaeutrophaChr1CP0010683,942,557 pickettiiChr2CP0010691,302,238 RalstoniaChr1AL6460523,716,413 solanacearum 2.4.2Usingchain-RNAinnovelncRNAsearch AnalysisonavailableRNA-seqdatahasshownthatasizablefractionoftheRNA-seqreads canonlybemappedbacktotheintergenicregionsofthereferencegenome.Itremains animportantquestiontounderstandwhetherthesereadsaregeneratedbytranscriptional noise,novelprotein-codinggenes,ornovelncRNAs.Inthisexperiment,weaimatnovel ncRNAsearchintheunannotatedtranscribedregionsbycombiningheterogeneousevidence, ofwhichexistenceofncRNAhomologsinrelatedgenomesisexaminedbychain-RNA. WedownloadedthetranscriptomicdatageneratedusingRNA-seqforthestrainAU1054 of B.cenocepacia fromthethewebsiteofpreviouswork[122].Intotal,thereare11,163,555 nonpoly(A)readsofaveragelength41nt.Wedownloadedtheannotatedproteinsand 37 ncRNAsforthereferencegenomesfromtheIntegratedMicrobialGenomes(IMG)system[71] andRfamdatabase[30].Inaddition,weappliedInfernal[83]toidentifymembersofexisting ncRNAfamiliesinRfam[30].Thesetwostepsensurethatwehavearelativelycomprehensive annotationofprotein-codingandncRNAgenes.Allreadsweremappedbacktothereference genomeusingBowtie[61]withatmost2mismatchesallowed.Readsthatoverlapwith anyannotatedncRNAsorprotein-codingregionsareexcludedfromfurtheranalysis.The remainingreadswereclusteredusingBlockbuster[60].Notallclustersarelikelytranscribed. Someofthemcontainfewreads.Someofthemhavehighlybiasedreaddistribution.In addition,weareonlyinterestedinncRNAs.Thus,wedidaonall27,240clusters outputbyBlockbusterinordertoidentifyasetoftranscribedregionsthatcouldcontain ncRNAgenes.Thecriteriaare: Thelengthoftheclusterisbetween50ntand200nt. Thetotalnumberofreadsoftheclusterisbiggerthan40. Thecoverageoftheclusterisbiggerthan0.1. DistancesfromtheclustertoadjacentproteincodinggenesorannotatedncRNAgenes arebiggerthan150nt. Theclustermustexhibitastablesecondarystructure,whichismeasuredbytheMFE valueandthepercentageofpredictedbasepairsbyafoldingtool[41,126,75,42].We computedtheMFEstructureofeachcluster,andusedtheMFEcof-20.00to outclustersthathaverelativehighMFE. Herethecoverageisnedasthetotalnumberofdistinctreadsintheclusteroverthe lengthofthecluster.Aftern,wehave35clustersleftforfurtheranalysis.As 38 theseclustersexhibitsmallMFEandstablestructure,theycouldbencRNAgenes.Wethus furtherinvestigatedwhethertherearehomologouscounterpartsinrelatedbacterialgenomes. AsthencRNAmightbehiddeninthesetranscribedregions,alocalalignmenttoolsuchas chain-RNAisneeded. WedownloadedallgenomesinthebetasubdivisionfromtheNCBImicrobialFTP 2 . TheassociatedphylogenetictreewasobtainedfromthePATRCwebsite 3 withtaxonomyID 119060.AfterdiscardingstrainsthataretooclosetotheAU1054(startingwithBurkholde- ria)and12strainswhosecompletegenomesarenotavailableatthetimeofwriting,7strains remained.Thereare11chromosomesforthe7strainsintotal.Theinformationofthese strainsissummarizedinTable2.2.WeappliedBLASTtoclosehomologsinrelated genomes.WeranBLASTusing35transcribedregionsasqueriesand11chromosomesas targets.Forall35queries,25ofthemhaveBLASThitswithall11chromosomes,showing strongsequenceconservationinrelatedgenomes.Ofthe10remainingqueries,onehasjust onetargetchromosomelackingBLASThits,alltheothershavenoBLASThitsreportedin anyofthe11chromosomes.Thusthe10queriesmightlacktsequencesimilarity andcannotbedetectedbyBLAST.Wethusappliedchain-RNAtoexaminewhetherthere arehomologswithstrongstructuralsimilarityinrelatedgenomes.Theinformationofthe 10transcribedregions,theinputandoutputofchain-RNA,andtherunningtimeare summarizedinTable2.3.Foraquerywhoselengthissmallerthan100,weusedthelength ofthetranscribedregionasthewindowsizeforRNAplfold.Forquerieswithsizebiggerthan 100,thewindowsizeofRNAplfoldissetto100.Thewindowsizeusedforthesliding-window strategyinchain-RNAissetasthesamewiththefoldingwindowsizeforRNAplfold.Other 2 ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ 3 http://www.patricbrc.org 39 Table2.3Transcribedregionsandtargetchromosomesareinputtochain-RNA.\all"inthe TargetChrs columnmeansthatallgenomicsequencesforthe7strainsinTable2.2areused astargetsequences. TranscribedLengthTargetNumberRunning regionIDChrsofhitstime(h) 18137all5895.71 6096all312.96 8058all01.15 12092CP001010110.11 146160all235.59 16792all03.32 182113all272.77 187103all94.14 204165all5119.58 272177all6233.54 parametersarekeptthesameastheexperiment.Wekeepallalignmentswithscore abovethechosencuto380,which,basedontheresultsoftheexperiment,gaveboth theacceptablesensitivityandFPrate. Table2.3showsthatnotallputativetranscribedregionshavehomologsinrelated genomes.LackinganytalignmentbyBLASTorchain-RNAdoesnotsupport thefunctionalrolesofthoseregions.Threetranscribedregionsincuralargenumberofhits withhighscores.AnumberofthesehitscontainsimpleAUrepeatsandthushavehigh probabilitiestoformbasepairs.However,theirtophitsshowstrongstructuralconserva- tioninthepredictedhomologouspairs.Thus,despiteexistenceoffalsepositivehits,these regionsareverylikelytobefunctionalncRNAs.Asanexample,Fig2.13showsthede- rivedsecondarystructuresofthetranscribedregionwithID18anditshomologsinall11 targetchromosomeswiththetopalignmentscore.Accordingtothisallstructures 40 arehighlysimilar,providingevidencethatthederivedstructureistheirnativesecondary structure.Thehighstructuralsimilarityalsoindicatesthatthesesequencesarehomologous. Thelocationsofthe6putativenovelncRNAscanbefoundatourwebsite. Figure2.13DerivedsecondarystructuresforhomologousncRNAsinall11chromosomesfor thetranscribedregionwithID18. 2.5Conclusion Inthiswork,weintroduceanew denovo ncRNAsearchmethodanditsimplementation calledchain-RNA.ThemajornoveltyofouralgorithmisthatitformulatesncRNAstruc- turalalignmentproblemasanextendedchainproblem.Solvingtheextendedchainproblem basedontheexisting O ( n log( n ))algorithmprovidesthebasisforthegenome-scalestruc- turalalignmentscheme,where n isthenumberofrectangles.Byusingthegenome-scale localfoldingalgorithmandtheslidingwindowstrategy,chain-RNAisabletolocal structuralalignmentsbetweenlonggenomicsequences.Wetestedchain-RNAinlocating homologousncRNAsinalargedatasetcontainshomologousncRNApairsofvariousse- quencesimilarity.Inaddition,weappliedittonovelncRNAsearchinatranscriptomic 41 dataset.Theexperimentalresultsshowthatchain-RNAcanachievetlyhigher sensitivityinidentifyingconservedncRNAelementsthanBLAST.ByscarifyingasmallFP rate,chain-RNAcanachievebettersensitivitythanPLAST-ncRNAonsequencepairswith identityabove60%.Chain-RNAismuchfasterthanotherstructuralalignmenttoolswhich havecomparablebetweensensitivityandFPrate. Ourworkdemonstratestheutilityofchain-RNA.Inaddition,itshowsthatgenome- scale denovo ncRNAsearchisstillachallengingproblemdespitethepromisingprogress.In particular,theperformanceofalltestedtoolsonidentifyingncRNAswithidentitybelow60% isnotsatisfactory.Weplantoimprovechain-RNAinthefollowingaspects.First,weplan todesignabetterdiscriminationmetric.Currently,weareusingascoretodetermine whetheralocalalignmentshouldbeoutput.Amoreadvancedmethodistocomputethe E-valuesforlocalalignmentscoresasinBLAST.WeexpectthatusingE-valuecanimprove thediscriminationpowerofchain-RNA.Second,Amoreestrategyisneeded tocontrolfalsepositiverateandimprovethe.weplantousemachinelearning methodssuchasSVMtodoinasystematicmanner.Finally,incorporatingthe posteriorprobabilitybasedsequencealignmentandthestructuralalignmentinchain-RNA mightleadtobetterncRNAdetectionperformance. 42 Chapter3 miR-PREFeR:anaccurate,fast,and easy-to-useplantmiRNAprediction toolusingsmallRNA-Seqdata 3.1Introduction PlantmicroRNAs(miRNAs)are ˘ 21ntlongnon-codingRNAsthatplayimportantroles intranscriptionalandpost-transcriptionalregulationofgeneexpression[124].Recently developedgenome-widemiRNAannotationtoolsallutilizesmallRNA-Seqdatatoquantify theexpressionofannotatedmiRNAsandtopredictnovelones[114,120,34,6].These toolsfromseveralofthefollowingproblems.First,thesetoolsusuallyhavevariable sensitivityandhighfalsepositive(FP)ratewhenappliedtoditspecies.Second,most ofexistingNGS-basedtoolsareslow.Third,mostexistingcommand-linebasedtoolsarenot user-friendly.Web-servertoolssuchasmiRanalyzer[34],areeasytouse.Buttheyusually onlyworkforgenomesintheirdatabases,whichinhibituserstoanalyzenewgenomes.In addition,mostofwebservertoolsalsohaveotherproblemslistedhere.Thus,thereisaneed JikaiLei,YanniSun.miR-PREFeR:anaccurate,fastandeasy-to-useplantmiRNApredictiontool usingsmallRNA-Seqdata.Bioinformatics.19:30,2837{2839.2014. 43 foraplantmiRNApredictiontoolthathasgoodperformance(highsensitivity,lowFPrate, andaccurate),worksforallplantgenomes,runsfast,hasasmallmemoryfootprint,andis easytouse. miR-PFEFeRutilizesexpressionpatternsofmiRNAsandfollowsthecriteriaforplant microRNAannotation[78]toaccuratelypredictplantmiRNAsfromoneormoresmall RNA-Seqdatasamples.IthashighsensitivityandverylowFPrate.miR-PREFeRismuch fasteranduseslessmemorythanexistingtools.UsingmiR-PREFeRrequiresminimum informaticsexpertise:ithaslowdependencyonotherprograms;thereisnoneedtocompile orinstallthepipeline;itprovidesacheckpointfeature,whichmakesitveryeasytocontinue anjobfromwhereitwasstopped;thedocumentationispubliclyavailable.The miR-PREFeRpipelineisintheprocessofbeingincorporatedintotheMAKER-Pgenome annotationengine[14]. 3.2DescriptionofthemiR-PREFeRpipeline 3.2.1Ww ThemiR-PREFeRpipelinepredictsplantmiRNAsfromoneormoresmallRNA-Seqsamples. Figure3.1illustratesthewwofmiR-PREFeR.Theinputincludesafastaformatgenome (acompletegenomeorcontigs/supercontigsofangenome/ESTsequencesof aspecies)andoneormoresmallRNA-SeqalignmentThealignmentshouldbein theSAMformat,whichissupportedbyalmostallpopularshortreadaligners.Userscan alsoinputanoptionalGFF3whichcontainsexistingannotationsforthegenome(protein codingregions,repeatedsequences,etc).ThepipelineexcludesfeatureslistedintheGFF3 fromfurtheranalysis.Notethatusersshouldusethisoptionwithcautionbecauseby 44 Figure3.1WwofmiR-PREFeR. excludingproteincodingregionsuserswillnotgetpredictedmiRNAsoverlappingcoding regions.Alltheinputlesandparametersfortheprogramarespina whichistheonlyinputtomiR-PREFeR.Theoutputincludes:1).AnHTMLtablethat summarizestheresultsofthepredictedmiRNAs.Itcontainsthenumberofpredictions, thelengthdistributionofthepredictedmaturemiRNAsequences,thelistofallpre-miRNA sequencesandtheirsecondarystructures,andthedetailedreadmappingagainstthe pre-miRNAsequences.Inaddition,thistableprovidessearchlinkstomiRBase.2).Separate foreasydownstreamanalysis.TheseincludesaCSVformatthatsummarizing predictioninformation,aGFF3annotationforthepredictedmiRNAs,twothat containthepredictedmaturemiRNAsequencesandprecursorsequences,athatcontains thestructureofeachprediction,anamed\miRNA.stat.txt"thatcontainsstatisticsof thepredictedmiRNAs,andafolder\readmapping"thatcontainsreadmapping foreachpre-miRNAprediction.Anexamplelistofalloutputcanbefoundat http: 45 Figure3.2IdenofpeaksfromtwoinputRNA-Seqsamples.Eachcoloredarrow representsareadfromoneRNA-Seqsample.Redandbluearrowsrepresentreadsfrom twotsamples,respectively.Twopeaksareidendaccordingtothedepth D cutoff ,whichis2inthisexample. Figure3.3Generatingcandidateregionsfrompeaks.(A).Thegap g betweentwoneigh- bouringpeaks p 1 and p 2 issmallerthanthegaplength G max .Thus,peaks p 1 and p 2 areclusteredintoonecandidateregion.(B).Peak p 3 hasnoneighbouringpeaksandis extendedtotheleftandtotherighttoformtwocandidateregionswithlength L max . //www.cse.msu.edu/ ~ leijikai/mir-prefer/example-output/ . Thepipelineconsistsofthreemainsteps:idenofcandidateregions,foldingof thecandidateregions,andmiRNAlociprediction.Herewepresentthedetailsofeachstep. 3.2.1.1Candidateregioniden Thestepofthepipelineistoidentifycandidateregionsthatarelikelytoharborex- pressedmiRNAgenes.BasedontheassumptionthatexpressedmiRNAsshouldhaverea- sonableamountofreadsmappedtotheirloci,weusethemappingdepthtoextractregions thatshouldbeanalyzed.SimilarmethodisusedbyothermiRNApredictiontoolssuchas ShortStack.Readsfromallinputsamplesareconsideredwhencountingthemappingdepth. 46 3.2.1.1.1Peaks: Thestepistoidentify peaks .A peak isacontinuousregionina sequencewithminimummappingdepthabove D cutoff ateachnucleotideposition,where mappingdepthisthenumberofreadscoveringthatposition.Here D cutoff isapipeline parameterthatuserscanadjust.Figure3.2givesanexampleonhowtoidentifypeaks,with D cutoff =2. 3.2.1.1.2Candidateregions: Inthenextstep,peaksarecombinedtoform candidate regions ,whicharetheninputtothenextstepofthepipeline.Therearetwoparametersto controlthegenerationofcandidateregions, G max and L max .Apairofneighboringpeaks withgapsizesmallerthan G max areclusteredintoonecandidateregion(seeFigure3.3). Thisneighborjoiningiscontinueduntilnoneighboringpeakwithindistance G max exists. Ifapeakhasnoneighboringpeakswithgapsizesmallerthan G max ,itisextendedtothe leftandtotherightto L max nt,andtwocandidateregionsaregeneratedforthispeak (Figure3.3(B)). L max isthemaximumlengthofacandidateregion.Regionslongerthan L max arediscarded. 3.2.1.1.3Candidatematuresequences: AccordingtotheknowbiogenesisofmiRNA expression,thematureregionofapre-miRNAstem-loopshouldhavehighmappingdepth. Thus,weidentifycandidatematuresequencesaccordingtothenumberofoccurrence(abun- dance)ofeachuniquereadinthepeak.Inparticular,foracandidateregion,wechoosethe topNmostabundantreadsofeachpeakascandidatematuresequencesforthecandidate region.Inthecurrentimplementation,weuse N =2. 47 3.2.1.2Folding Thesecondarystructureofapre-microRNAsequenceisusuallyastem-loop.Byfolding eachcandidatesequenceanddetectingwhetherthesequencecanbefoldedintostem-loop structures,alargenumberofcandidateregionsthatarenotmiRNAlocicanberemoved.In miR-PREFeR,weuseRNALfoldfromtheViennaRNApackagetofoldcandidateregions[68]. RNALfoldcomputesalistoflocallystableRNAsecondarystructuresforaninputsequence. Candidateregionswithatleastonestem-loopstructurearekeptandanalyzedbythenext step. 3.2.1.3MiRNAprediction ThepredictionstepusesthecriteriaforannotatingplantmiRNAasitsbasicprincipleto predictmicroRNAs[78]. 1.TheprimarycriteriaisthatthesmallRNA-Seqdatashouldprovideexceptionalev- idenceofprecisemiRNA/miRNA excision:thereshouldexistabundantreadsthat correspondtothematuremiRNAsequence,andthereshouldbeatleastonereadthat canbepreciselymappedbacktothemiRNA sequence.ThemiRNAandthemiRNA sequencesformaduplexwithtwont,3'overhangs. 2.StructurecharacteristicsofthemiRNA/miRNA duplex.Therearetypicallyfouror fewermismatchedmiRNAbasesinthemiRNA/miRNA duplex.Asymmetricbulges arerareandsmallinsize. Tomeetthecriteria,miR-PREFeRselectsreadsthatareabundantineachpeakas thecandidatematuresequencesasdescribedearlier,checksthestem-loopstructuresofthe 48 candidateregiontoidentifystarsequenceposition,thencheckswhetherthestarsequence ispresentornotintheRNA-Seqsamples. AcharacteristicfeatureofmiR-PREFeRisthatitisabletotakemorethanoneRNA-Seq samplesasinput.ByusingmultipleRNA-Seqsamplesfromttissues/conditions/developmental stages,moreinformationcanbeobtainedfromtheRNA-Seqdata: ForsomemicroRNAloci,it'spossiblethatthereisonlyasmallamountofreadsthat canbemappedbacktotheseloci.It'sossibletopredicttheselociusing asinglesample.Bycombiningallreadsfrommultiplesamplestogether,it'spossible topredicttheselowlyexpressedmiRNAs.ForthepredictedmiRNA,westillrecord theirexpressionineachsample. ForsomemiRNAs,averylargenumberofreadsfrommultiplesamplescanbeprecisely mappedbacktothematureregions.Inaddition,thesecondarystructureofthelocus isanstem-loopthatmeetsthestructurecriteria.However,thereisnoreadthat correspondstothestarsequence.Thesetypeoflociwouldbemissedifwestrictly followtheprimarycriteriawhichrequiresthatthereshouldbeatleastasingleread forthestarsequence.Informationfromtheexpressionpatternsfrommultiplesamples andthewell-formedstructureindicatethatthislocuscontainsatruemiRNAgene. Thusweloosetheprimarycriteriatodroptherequirementofthepresenceofthestar sequenceinthecasethatthereareaverylargenumberofreadsineachsamplethat canbepreciselymappedbacktoamatureregion.Inthecurrentimplementation, whenthereisnoreadcorrespondstothestarsequence,thethresholdsaresetto1000 and100.Thatis,thereshouldbeatleast1000readsinallsamples,andatleast100 readsineachsample. 49 ByusingtheannotationcriteriaofplantmiRNA,miR-PREFeRisabletoprecisely predictplantmiRNAswithhighspeyandhighsensitivity.Withthehelpofinformation frommultiplesamples,thesensitivitycanbefurtherimprovedwithouttincrease ofthefalsepositiverate. 3.2.1.4Checkpointing Thepipelinemakesacheckpointaftereachmajorstepinthepipelineandmakescheckpoints periodicallyinthetime-consumingfoldingstage(seeFigure3.1).Userscaneasilysetthe checkpointgranularityinthenThisprovidesusersaneasywaytorestart jobsfromwherethejobswerestopped.Thisfeatureisparticularlyusefulin varioussituations.Forexample,it'spossiblethatajobrunningonahighperformance computingsystemiskilledduetoresourcelimits,ortheuser'slaptopisrunningoutof battery.Byrestartingajobfromthelatestcheckpointratherthanstartingitfromthe beginning,alotoftime/resourcescanbesavedforlong-runningjobs.Asfarasweknow, miR-PREFeRisthemicroRNApredictiontoolthatprovidessuchausefulfeature. 3.2.2Implementationandexternalprogramdependency miR-PREFeRisimplementedusingPython.Thereisnoneedfortheuserstoinstallitandit worksunderLinux/MacOS.ThepipelineisabletousemultipleCPUs/coresautomatically. Thismakesthepipelinemuchfasterthanexistingprogramsandiseasierfortheusersto utilizetheircomputationalresources. ThemiR-PREFeRpipelinereliesontwoexternaltools.TheoneistheSAMTools alignmentsmanipulatingpackage[64].It'swidelyusedintheNGSdataanalysiscommunity anditiseasytoinstallanduse.ThesecondprogramistheRNALfoldprogramfrom 50 theViennaRNApackage.ItisusedbyalmostallmicroRNApredictiontools.Itiswell documentedandiseasytoinstall. 3.3Results 3.3.1Evaluatingpredictionperformanceonsixdatasets WebenchmarkedmiR-PREFeRandseveralpopularornewlydevelopedNGS-basedmiRNA predictiontoolsincludingShortStack[6],miRDeep-P[120],miRanalyzer[34],miRDeep2 [26],miRDeep [4],andMIReNA[74].ThefeaturesofthetoolscanbefoundinTable3.1. Weusedsixdatasetsfrom Arabidopsisthaliana ,Medicago,papaya,peach,andmaizeto evaluatetheperformanceofthetools.Thedetailsofthedatasetscanbefoundinthe Table3.2.MicroRNAsinthe Arabidopsisthaliana genomeiswellannotated.Thusweknow thegroundtruthmiRNAsinthegenome.Weusedtwodatasetsfrom Arabidopsis thaliana toevaluatethesensitivityandfalsepositiverateofthebenchmarkedtools.The dataset, Athl-2 ,whichcontainstwosamples,isthedatasetusedintheShortStackpaper[6]. Theseconddataset, Athl-6 ,containssixpublishedsamplesfromttissues/conditions. Table3.3showstheperformanceofeachtoolonthetwodatasets.miRanalyzerhasthebest sensitivitybutthenumbersofpredictionsareverylargeonbothdatasets.Itisverylikely thatmostofthesepredictionsareFPpredictions.ItshouldbenotedthatmiRanalyzer triestodetectannotatedmiRNAsthataresavedinitsowndatabase.Thisiswhyits sensitivityishigherthanallothertools.miRDeep2,miRDeep*,andMIReNAhavevery lowsensitivityonbothdatasets,whichindicatesthattheyshouldnotbethechoicefor annotatingplantmiRNA.ShortStackisdesignedtobespthusithasthelowestFP rateandreasonablesensitivity.miR-PREFeRhasthesecondhighestsensitivityandvery 51 lowFPrate.ThepredictionsfrommiR-PREFeRhavelargeintersectionswithpredictions fromShortStack(SeeFigure3.4)Accordingtothecomparison,miR-PREFeR,ShortStack, andmiRDeep-Phavebettersensitivity,FPrate,runningtime,andcanbeappliedtoany plantspecies.Thus,tofurtherevaluatetheperformanceofthetools,weranmiR-PREFeR, ShortStack,andmiRDeep-Ponthreemorespecies:Medicago(4RNA-Seqsamples),peach (4RNA-Seqsamples),andpapaya(3RNA-Seqsamples).Theinformationofallthesmall RNA-SeqdatausedintheexperimentscanbefoundinTable3.2.Thelinkstodownloadthe datasetscanbefoundat http://www.cse.msu.edu/ ~ leijikai/mir-prefer/dataset/ . 52 53 Table3.1ComparisonoffeaturesofmiRNAannotationtools miR- PREFeR ShortStackmiRDeep- P miRanalyzermirDeep2miRDeep*MIReNA miRNAannota- tion YesYesYesYesYesYesYes Othersmall RNAannotation NoYesNoNoNoNoNo Inputmulti- pleRNASeq samples YesYesNoNoNoNoNo Speicies PlantAllPlantAllanimalanimalAll Server/Local Local (command- line) Local (command- line) Local (command- line) Server, Local (command- line) Local (command- line) Local(GUI, command- line) Local (command-line) Checkpoint YesNoNoNoNoNoNo Canusemuliple processors/cores AutomaticManualManualManualManualManualManual algorithm miRNA annotaton criteria miRNA annotaton criteria Naive Bayes Random foresttree Naive Bayes NaiveBayes5criteriausedto secondary structures Table3.2InformationofthesmallRNA-Seqdatausedintheexperiments SpiesesAccessIDDescription Athl-6Genomesize:121MB Arabidopsisthaliana SRX065858Root,endodermis Arabidopsisthaliana SRX065854Root,stele Arabidopsisthaliana SRX065857Root,cortex Arabidopsisthaliana SRX107308Seedlings,coldstress Arabidopsisthaliana SRX107309Seedlings,saltstress Arabidopsisthaliana SRX107310Seedlings,saltstress Athl-2Genomesize:121MB Arabidopsisthaliana SRR275588rosetteleaves Arabidopsisthaliana SRR051927above-groundtissues MedicagoGenomesize:425MB Medicagotruncatula GSM769272Seeding Medicagotruncatula GSM769273Root Medicagotruncatula GSM769274Flower Medicagotruncatula GSM769277Leaf papayaAssemblysize:349MB Caricapapaya GSM712525Caricapapayaleaves Caricapapaya GSM712526Caricapapayawers Caricapapaya GSM712527Caricapapayainfectedleaves peachAssemblysize:231MB Prunuspersica GSM944971Peachroot Prunuspersica GSM944972Peachleaf Prunuspersica GSM944973Peachwer Prunuspersica GSM944974Peachfruit maizeGenomesize:2095MB Zeamays GSM1178887Maizehen1-1mutant Zeamays GSM1178886Maizewildtypecontrolforhen1-1mutant Wecomparedtheoverlapsofthepredictionsofthethreepredictiontools.Figure3.4 (a)-(e)illustratestheintersectionsofthepredictionsofttoolsandexistingmiRNA databasesusingVenndiagrams.Foralledatasets,thenumbersofpredictionsfrom miRDeep-Pareverylarge.ItisverylikelythatmostofthesepredictionsareFPpre- dictions.ThepredictionsfrommiR-PREFeRhavelargeintersectionswithpredictionsfrom ShortStack.BothtoolsusethecriteriaforplantmiRNAannotationastheirpredictionguide- lineandthepredictionsareconservative.Thus,theyhavelesspredictionsthanmiRDeep-P. 54 Toevaluatetheperformanceofthepredictionstoolsonlargeandcomplexplantgenomes, weappliedmiR-PREFeRtomiRNApredictioninmaize,whichhasamuchlargergenome sizethantheothertestedgenomes(seeTable3.2).WeusedtwosmallRNA-Seqsamples, andtheirinformationcanbefoundinTable3.2.Likepreviousexperiments,weranmiR- PREFeR,ShortStack,andmiRDeep-Ponthedataset.FormiRDeep-P,wedidnotget anyresultbecauseitwasveryslowandwewereunabletoitin5days.Thus,we onlyreporttheresultsfrommiR-PREFeRandShortStack.Toevaluatethesensitivity,we comparedthepredictionswithboththemaizemiRNAannotationandtheplantmiRNA annotationinmiRBaseversion20.Table3.4summarizedtheperformanceofmiR-PREFeR andShortStack.miR-PRFeRshowedhighersensitivitythanShortStackonthisdataset. For172maizemiRNAannotationinmiRBase[58],miR-PREFeRsuccessfullyiden74 ofthem,whileShortStackiden66.WeusedBLASTtosearchpredictionsfromtwo toolsagainstthemiRBaseplantmiRNAannotations.FormiR-PREFeR,89predictions havetoverlaps(E-valuewassetto1e-40)withplantmiRNAannotations. ForShortStack,thenumberis82.TheexperimentshowedthatmiR-PREFeRhasbetter sensitivitythanShortStackonthemaizedataset.BecausemiR-PREFeRcanautomatically takesadvantageofmultipleCPUs/coresavailableonamachine,ittlyreducedthe runningtime.ForShortStack,ittook156hourstoanalyzethetwomaizesamples.For miR-PREFeR,itonlytook1.2hoursbyusing4coresautomatically.Thismakesiteasier foruserstoanalyzelargeplantgenomes.TherunningtimeisshowninTable3.5. 55 3.3.2Analyzingthebetweentheannotatedandpre- dictedmiRNAs Figure3.5quantheofthestartpositionsandlengthsbetweentheannotated andpredictedmiRNAsinArabidopsis.Thewasgeneratedusing63annotatedmiRNA lociofArabidopsisthatarepredictedbyall3toolsontheAthl-6dataset.Eachofthese locicontainsoneannotatedmaturesequenceaccordingtomiRBase.Alltoolsiden someannotatedmiRNAsandmiRNAsthathavetstartpositionsorlengthsasthe annotatedones.Buttherencedistributionsarequitet.77.8%(49outof63) ofthematuresequencespredictedbymiR-PREFeRhavethesamestartpositionastheir annotations.ForShortStackandmiRDeep-P,thevaluesare49.2%and15.9%,respectively. ForthemajorityofthepredictionsfrommiR-PREFeR(85.7%),theofthestart positionsbetweenthepredictedmiRNAmaturesequencesandtheirannotationsarewithin 1nt.ThelengthsofthemiRNAmaturesequencespredictedbymiR-PREFeRaremostly consistentwithannotatedones:81%ofthemarethesameastheannotationsandfor98.4% ofthepredictionsthearewithin1nt. SomeofthepredictedmiRNAsthataretfromtheannotatedonesarelikelytobe isomiRs.ItisknownthatmiRNAscanbemobytheaddition,editingorsubtraction ofnucleotides,resultingindistinctmiRNAisoforms(i.e.isomiRs)[49,84].Inparticular, 5'and3'isomiRscontainminorchangesinthestart/endpositionsandlengthsofmiRNAs. AcomprehensiveanalysisoftheisomiRsinmultipletranscriptomicsamplesofArabidopsis showsthetrimming/extensionofthestartpositionin1or2ntsin5'sisomiRs[49].Thus, thepredictionsthatcontainminorpositionfromtheannotatedonesarelikelyto beisomiRs. 56 However,existinganalysisalsoshowsthatthemostabundantreadscorrespondedto conserved,previouslyidenmiRNAs[49,84].Inthelarge-scaleanalysisoftheisomiRsin Arabidopsis,114outof144highlyrepresentedmiRNAsinmultiplesmallRNA-Seqsamples aretheannotatedmiRNAs[49].ThepredictionsbymiR-PREFeRareconsistentwiththis observation.Forpredictionsthatcontainlargerpositionfromtheannotated ones,suchasthosewithatleast5-ntdwecannotexcludethepossibilityofwrong predictions.Figure3.5showsthatmiR-PREFeR,ShortStack,andmiRDeep-Phave14.3%, 46.0%,and25.4%predictionsthathaveatleast5-ntfromtheannotatedmiRNAs, respectively.Althoughfurtherexperimentsareneededtoshowhowmanyofthesearewrong predictions,ahighpercentageofmiRNApredictionswithlargetrimming/extensionmight indicateloweraccuracy. 3.3.3miR-PFEFeRisfastandt Table3.5andTable3.6showtherunningtimeandthememoryusageofthebenchmarked tools.TherunningtimeformiR-PREFeRincludesboththetimeforgeneratingthealign- mentandthetimeforrunningthepipeline.miR-PREFeRachieves4 37 speedup comparedtoShortStack,whichisthesecondfastesttool.NotethatShortStackcanpredict othertypesofsmallRNAbesidesmiRNA.Thismightbeonereasonthatitisslowerfor justmiRNAprediction.MIReNAusesBLASTtodoreadalignmentandisveryslow.For theAthl-2andAthl-6datasets,wedidnotgetanyresultsfor4outof8samplesbecause themappingstageistooslow.miRDeep-Palsohasverylongrunningtimeonalldatasets. Withoutmanuallyparallelisingthejobs,itiseventorunthetwotoolsonsmall genomesonapersonalcomputer.Asformemoryusage,miR-PREFeRuseslessmemory thantheothertoolsonalldatasets. 57 58 Table3.3Performancecomparisonofbenchmarkedtools miR-PREFeRShortStackmirDeep-PmiRanalyzermirDeep2mirDeep*MIReNA Softwareversionv0.090-5-01.3unversioned0.0.5v312.0 Athl-2(2samples) NumberofknownmiRNAsexpressed a 240240240240240240240 NumberofpredictedmiRs155113126321821822018152 NumberofexpressedmiRspredicted12710786201641035 Numberofnovelpredictions286117719811182008117 Sensitivity0.530.450.360.840.270.040.15 Athl-6(6samples) NumberofknownmiRNAsexpressed a 243243243243243243243 NumberofpredictedmiRs1851363021131142911472411 NumberofexpressedmiRspredicted13612512820979744 Numberofnovelpredictions49112893123062121465367 Sensitivity0.560.510.530.860.330.030.18 (a)Athl-2 (b)Athl-6 (c)Medicago (d)Peach (e)Papaya Figure3.4VenndiagramsthatshowtheintersectionsofthepredictedmiRNAsfromdif- ferenttoolsonedatasetsfromfourspecies.ForArabidopsis,Medicagoandpeach,the annotationsfrommiRBaseareincluded. Table3.4PerformancecomparisionofmiR-RPEFeRandShortStackonmaize miR-PREFeRShortStack Numberofpredictions220147 OverlapwithmiRBasemaizeannotation 74/172 66/172 OverlapwithmiRBaseplantannotation 89 82 Table3.5Runningtime(hour) miR-PREFeRShortStackmiRDeep-PmirDeep2mirDeep*MIReNA Athl-2 1.25 5.619710.78Unknown a Athl-6 1.21 11.0315727.427.8Unknow a Medicago 0.91 4.25110 papaya 0.75 4.883 peach 1.65 41.6197 maize 4.22 156 Bold numberindicatesthebestrunningtime. a MIReNAdidnotonseveralinputsamplesafterrunning72hoursandwaskilledbytheHigh PerformanceComputingsystem. 59 Figure3.5ThedistributionofthemiRNAstartpositions(toppanel)andlengths (bottompanel)betweenannotatedandpredictedmiRNAsbythreetools.Thewas generatedusingannotatedmiRNAsofArabidopsisthatarepredictedbyallthreetoolson datasetAthl-6. Table3.6Memoryusage(MB) miR-PREFeRShortStackmiRDeep-PmirDeep2mirDeep*MIReNA Athl-2 850 1449155915383040Unknow a Athl-6 854 1266391724811825Unknow a Medicago 263 11551800 papaya 238 12201177 peach 850 14491559 maize 1968 3857 b Bold numberindicatesthebestmemoryusage. a MIReNAdidnotseveralinputsamplesafterrunning72hoursandwaskilledby theHighPerformanceComputingsystem. b ShortStackdidnotcompletelyonthelargerRNA-Seqsampleandwaskilledright afteritoutputthepredictionsofmiRNAs.Thenumberhereisthememoryusageonthe smallerRNA-Seqsample. 60 Chapter4 IdentifyandassembleCRISPRsfrom metagenomicdata 4.1Introduction CRISPRsarefoundinalargenumberofprokaryoticgenomes(approximately40%ofse- quencedbacteriaandalmostallsequencedarchaea[32]).TheCRISPR/Cas-genesystem (cas-CRISPR)isanadaptiveimmunesysteminbacteriaandarchaeathatprotectsthe targetgenomefromforeigninvadinggeneticelementssuchasvirusplasmidsandphages [7,72,44,17,5,88].EachCRISPRslocuscontainsasetofvariablespacersregularlysep- aratedbydirectrepeats(DRs),whicharehighlyconservedwithinspeciesandhavetypical sizerangeof23-47bp.A100bp-500bplongDNAsequencecalled\leader"isalwaysfound ononesideoftheCRISPR.Theleaderactsasapromoteranddirectsthetranscription oftheCRISPR,whichisessentialfortheCRISPRssystemstoconferspresistanceto viruseswhosegenomicsequencesmatchonespacer[35,5,90].Figure4.1showsanexample ofacas-CRISPRsystem. EvenwithextensivestudiesofCRISPRs,thedetailsoftheCRISPR-basedmechanism andthedynamicsofinteractionsbetweenvirusesandthehostCRISPRslociarenotyetfully known[5].OnekeysteptoadvanceourunderstandingofCRISPRsworkingmechanismsis todetectCRISPRslociinalargevarietyofprokaryotes.Inparticular,metagenomicdata, 61 Figure4.1Acas-CRISPRexample.Greenboxesinthetoprepresentdirectrepeats (DRs).AlinebetweentwoDRsrepresentsaspacersequence,whichsharesequenceidentity withfragmentsofplasmidsandbacteriophagegenomesandspecifythetargetsofCRISPR interference.DRsandspacersformanDR-spacerarray.ThenumberofDRscanvary substantially,fromaminimumoftwotoafewhundred.AsetofCRISPR-associatedgenes (cas-genes),whichencodetheproteinmachineryresponsibleforCRISPRactivity,arelocated upstreamofthearray.ThebottomboxshowsthesequenceoftheDR-spacerarray.Green charactersareforDR,othercoloredcharactersareforspacers.DRsequencesarehighly conservedinaCRISPR. whichcontaincommunityshotgunsequencesofalargenumberofunculturedmicrobial speciesfromvariablenaturalsamples,provideauniqueopportunitytothoroughlyannotate CRISPRsandtoenhanceourunderstandingofthephage-hostinteractionsandco-evolution withinnaturalsystems[5,91,101] 4.2Relatedworksandtheirlimitations MostofexistingtoolsforidentifyingCRISPRsaredevelopedforsequencedgenomes.These methodssearchinterestedgenomesforregularlyspacedrepeatstoidentifyCRISPRloci. PopularimplementationsinthiscategoryincludeCRISPRFinder,metaCRT/CRISPRAlign [92],PILER CR[22],andCRT[12].Theyworkwellforassembledgenomesorcontigs butcannotbereadilyusedonshortsequencingdata.Acommonwaytoutilizethistype oftoolsformetagenomicdataisusedenovometagenomicassemblytoolstooutput assembledcontigs,thenusethecontigsastheirinput.Thisapproachhasitslimitations. 62 First,metagenomicassemblyisaproblem.Metagenomicassemblerstrytorecover individualgenomesinanenvironmentalsample.Organismsinthemicrobialcommunityof asamplehavehighlyunevenrepresentationandsomeorganismsareverycloselyrelated. Thehighlycomplexcompositionofmicrobialcommunitiesandsequencingerrormakeit impossibletorecoverallgenomesinasample.Previousworkshowsthatsequencingofhigh- complexitymicrobialcommunitiesresultsinlittleornoassemblyofreads[81,73].Second, theheterogeneoussequencecoverageofCRISPRscausedbyvariablespeciesabundanceor expressionlevelsoftenmakesexistingtoolsoutputonlythehighlyrepresentedCRISPRs. OnereportedproblemisthatCRISPRslocifoundinmetagenmoicassembliesmayonly representthemostdominantstraininthecommunity,andnotindicativeofthetruespacer diversity[101].Third,CRISPRsinthesamespeciesorevenintspeciescanshare similarrepeats.ManyCRISPRsthatsharethesamerepeatinaspecieshaveveryhighly conservedleadersequences[117,52,48].Thisleadstoverytangledassemblygraphs.Existing assemblytoolsarenotdesignedtodistinguishCRISPRswithsimilarrepeats.Inaddition, thesizeofmetagenomicdataposesachallengeoncomputationalresources.Thus,the commonlyusedpipelineofcombiningdenovometagenomicassemblyandgenome-wide CRISPRsdetectiontoolsisconvenientbutnotidealforCRISPRsdetectioninmetagenomic data. ThesecondtypeoftoolsidenCRISPRsfrommetagenomicdatawithoutassembling ofthewholedataset.Toourknowledge,theonlyavailabletoolinthiscategoryisCrass[101]. CrassidenreadsthatbelongtoCRISPRsusingtherepeatstructuralcharacteristicof CRISPRsandusesagraphbasedalgorithmtorecoverthespacerarrangementofCRISPRs thatsharethesamedirectrepeatsequence.Asasteptooutputthefull-lengthCRISPR sequences,Crassrequiresuserstoparsetheinformationoutputbythepreviousstepand 63 inputittoexistingdenovoassemblyprogramssuchasVelvet[123].Likeotherwholegenome assemblytools,VelvetisnotdesignedtoaddressspassemblychallengesforCRISPRs. Inaddition,becauseCrassreliesontheassumptionthatthereexistsomereadsthatcontains atleasttwocopiesofarepeat,itfailstoidentifyCRISPRrepeatwhenreadlengthisshort orwhenrepeat-spacerunitislong[8].Thus,thereisaneedtodevelopnovelalgorithmsand toolstotlydetectandassemblebothhighlyandlowlyrepresentedCRISPRsfrom largeamountofshortanderror-pronemetagenomicreads. Inthiswork,wedesignanddevelopnovelmethodstoidentifyandassemblesequencesof CRISPRsthatcontainbothknownandnovelrepeatsinmetagenomicdata.Thedeveloped tool,metaCRISPR,incorporatessppropertiesoftheDR-spacerarrayinCRISPRsand addressesthechallengesincludingthelargesizeofmetagenomicdata,theheterogeneous sequencecoverageofCRISPRs,andmultiplecopiesofrepeatsinCRISPRs. 4.3Method Inthissection,weintroducemetaCRISPRpipeline.Wegiveanoverviewofthewhole pipeline.Thenthedesignofeachcomponentofthepipelineiselaborated. 4.3.1Overview Figure4.2sketchesthemainalgorithmiccomponentsinthemetaCRISPRpipelinethatis designedforrecoveringCRISPRscontainingbothknownandnovelrepeatsinmetagenomic dataset.Itconsistsofthreemajorcomponents:clusteringandassembly.First,we identifyCRISPRreadsfromtheoriginalmetagenomicdataset.Allreadsthathaveatleast twocopiesofthedirectrepeatsareidenedtogathernovelrepeats.Ak-merhashtableis 64 Figure4.2MajorcomponentsofmetaCRISPR.Paired-endreadsarerepresentedbytwo rectanglesconnectedbyadottedline.Readsofredandgreencolorsrepresentreadsfrom twotCRISPRs,respectively.Grayreadsrepresentreadsfromothergenomicregions. 65 constructedusingthegatheredtherepeatsandknownrepeatssuchasthoseinCRISPRdb [32].Theoriginalmetagenomicdatasetisusingthek-merhashtable.Thisstep removesalargeportionofreadsthatarenotpartofCRISPRsandtlyreduces inputsizefordownstreamanalysis.Inthesecondcomponent,webuildanoverlapgraph withlargeoverlapthreshold.Theideaisthatbyusingverylargeoverlapthreshold,most randomoverlapscanbeeliminatedandreadsthatbelongtoCRISPRscomefromt species,orCRISPRscomefromthesamespeciesbutdonotsharerepeatsequencecanbe separatedintorentclusters.Clusterswithpairedendreadssupportarecombinedto recoverreadsthatbelongtothesameCRISPRbutwereclusteredtomultipleclusters.In thestage,anoverlapgraphisconstructedforreadsineachcluster.Variousgraph cleaningtechniquesareappliedtocorrectsequencingerrorsanderroneousbranchesdueto randomoverlaps.Longestpathisusedtorecovercontigsonthedirected acyclicgraph(DAG).Recoveredcontigsareconnectedusingpaired-endinformationtoform supercontigsorscAfterassembly,asimplevalidationstepisappliedtoout contigsthatarenotCRISPRs. 4.3.2IdenofCRISPRreads 4.3.2.1Directrepeatiden OurpipelineidenreadsthatarelikelysequencedfromCRISPRs.Theevenly spaceddirectrepeatstructureinCRISPRsprovidesthebasicideaforidentifyingnovel repeatsfrommetagenomicreads.Sincethelengthsofmostknowndirectrepeatslieinthe rangefrom23to47[31],itisareasonableassumptionthatatportionofreads willcontainatleasttwocopiesofdirectrepeats,especiallywiththeincreasingreadlength 66 Figure4.3Illustrationofhownovelrepeatsareideninaread.Thegreenregionsare extendedregions(seemaintext).Blackcrossesrepresentmismatchesinthesequence. oftheNGStechnology.Basedonthisobservation,weidentifydirectrepeatsbysearching maximalrepeatswithmismatchesinreads.Figure4.3illustrateshowaputativedirect repeatisideninaread.Foreachread,wecheckwhetheritcontainsatleasttwo copiesofaninitialk-merwhoselength k issetto20bydefault.Thedistancebetweenthe twocopiesofthek-mershouldbeatleasttheminimumlengthofaspacer.Afterapairof matchingk-mersarefound,theyareextendedtotherightuntiltheextendedregionscontain atmosttwomismatches.Thesequencefromthestartoftheinitialk-mertotheendofthe extendedregionisregardedasaputativedirectrepeat.UnlikeCrass,weallowatmosttwo mismatcheswhenidentifyingrepeats.Thisisbasedontheobservationthatalthoughrepeats inaCRISPRarewellconserved,therestillexistrepeatsthathaveoneortwomismatchesin someCRISPRs.Byallowingmismatches,metaCRISPRisabletogatherrepeatsthatCrass mayfailtoidentify. Therepeatidenmethodcanrecruitalargeportionofrepeatsinametagenomic dataset.However,itmaymisssomelongrepeats,especiallywhenthereadlengthisshort (e.g.75nt)[8].Toovercomethisproblem,weresorttotheknownrepeatssuchasthosein CRISPRdb.DirectrepeatsinannotatedCRISPRslociareectivetemplatesforidentifying CRISPRsreadsbecauserepeatsarewell-conservedamongCRISPRsinthesamespeciesor evenbetweentspecies[59].Byincludingrepeatsfromknowndatabases,weare abletoidentifysomeoftherepeatsthataremissedbytherepeatidenprocessand increasethesensitivityofthepipeline. 67 4.3.2.2Filtermetagenomicdatausingk-merhashtable Usingthepreviousstep,wecanidentifyrepeatsandrecruitreadsthatcontainatleast twocopiesofdirectrepeats.However,manyreadsmayonlycontainasinglecopyofthe repeat.Figure4.4showsthedistributionofthegroundtruthreadsthatcanbemapped totheCRISPRregionsofthegenomesinasyntheticmetagenomicdata(SRR606249)[99] sequencedfromamixtureof16archaealspeciesand48bacterialspecies.Thelengthofthe readsinthedatasetis101nt.Theyare220knownCRISPRsin48genomesinthedata set.ForeachCRISPR,weshowthepercentageofgroundtruthreadsthathaveatleast twocopiesofarepeat20-mer,thepercentageofgroundtruthreadsthathaveonlyonecopy ofarepeat20-mer,andtheaveragespacer-repeatlengthoftheCRISPR.Wecanseethat althoughwecanidentifyrepeatsandrecruitreadsthatcontainsatleasttwocopiesofdirect repeatsformostCRISPRs,mostgroundtruthreadsforeachCRISPRonlycontainonecopy ofarepeat20-mer.Wecannotrecruitthesereadsinthepreviousstep.Withoutincluding thesereads,thedepthofcoverageoftheCRISPRisquitelowandit'squitelikelythat forthoselowlyrepresentedCRISPRs,therewillbemanygapswhichpreventustorecover thewholesequenceoftheCRISPR.Therefore,weneedantmethodtorecruitreads containsonlyonecopyofdirectrepeatbeforedownstreamanalysis. Tobuildthetprocedure,weuseak-merhashtablebasedmethodtodetectreads thatcontainonlyonecopyofdirectrepeat.Weconstructahashtablefromallthek-mers oftheknownandnewlyidendirectrepeats.Foreachk-mer,boththek-merand thereversecomplementarysequenceofthek-merarekeptinthehashtable.Theoriginal metagenomicdatasetisscannedtoidentifythosereadsthatcontainatleastonek-merin thehashtable.Forpaired-endreads,ifoneendhasamatchingk-mer,theotherendis 68 Figure4.4Thedistributionofgroundtruthreads automaticallyincludedregardlessofwhetherithasamatchingk-mer.Allreadsfromthis steparepassedtothenextclusteringstep.Thisstepcanrapidlyoutmostofreadsthat arenotpartofaCRISPRandtlyreducethedatasizefordownstreamanalysis. Itshouldbenotedthatboththerepeatidenstepandthek-merhashtable basedonstepinevitablyrecruitasmallportionofreadsthatdonotbelongtoany CRISPR.Thesefalsepositivereadsslightlyincreasetheinputsizetothenextstepbutwill beeasilyremovedduringeithertheclusteringsteportheassembly/validationstep. 4.3.3ClusteringofCRISPRreads TheCRISPRreadsidenstepproducesamixtureofreadsthatcomefromCRISPRs inditspeciesortCRISPRsofthesamespecies.Assemblythemixtureofreads directlymayintroduceerroneouscontigsbecausetherepeat-spacerstructureofCRISPRs leadstoaverytangledassemblygraph.Thusit'sbettertoseparatetCRISPRsand assemblethemindividually.Weuseanoverlapgraphbasedapproachtoclusterreadsto 69 Figure4.5Anexampleofoverlapgraphandtransitiveedge.Threereadsformthreeedges. Therededgeisatransitiveedgecanbesafelyremoved. tclusters.Firstweintroducetheconceptofoverlapgraphandrelatedterminology, thenwepresentthemethodweusetoseparatetCRISPRs. 4.3.3.1Overlapgraph Anoverlapgraphisandirectedgraphthatoverlaprelationsbetweenreads.Each nodeintheoverlapgraphrepresentsaread.Givenanytworeads r 1 , r 2 ,andanoverlap threshold T ,iftheoverlapsizebetweentheof r 1 andtheof r 2 isgreaterthan T , adirectededgewhichgoesfrom r 1 to r 2 isadded.Theoverlapthreshold T hashighimpact ontheconnectivityandcomplexityofthegraph.Asmaller T isabletoconnectreadsin lowlysequencedregions,butitproducesverytangledgraphsincetheremaybemanywrong, randomedgesformedbetweenreadsthatarenottrulyconnectedinthegenomes.Alarger T producessimpleroverlapgraphwithtedges,butitmaymissconnectionsbetween readsinthelowlysequencedregions.Afteraninitialoverlapgraphisbuilt,transitiveedges, whichareredundantedgestorepresentgraphconnectivity,canberemovedtosimplifythe graph.Atransitiveedgeisformedbetweentwonodesifthetwonodeareconnectedby alternativepathsotherthanthetransitiveedge.Removingsuchedgesdoesnotchange reachabilityofthegraphbutcangreatlysimplifytheoverlapgraph.Anexampleofoverlap 70 Figure4.6Top:twoCRISPRsthatshareverysimilarrepeatsequences.Thereadmapping againstthetwoCRISPRslociareshown.Bottom:theinitialoverlapgraph G . Readsofredandpurplearesequencedfromthetwoloci,respectively.Eachnoderepresents areadofthesameletter.Blackandgreyedgesdenoterandomoverlaps.Thenumberalong eachedgerepresentstheoverlap.Readsinapair r arerepresentedusing r: 1and r: 2. graphandtransitiveedgeisillustratedinFigure4.5.Betweennodes a 1and a 3,thereisan alternativepath a 1 ! a 2 ! a 3,edge a 1 ! a 3canbesafelyremoved.Inourimplementation, weconstructoverlapgraphusingreadjoiner[28],whichisanoverlapgraphbasedassembler. Itprovidesasetofcientalgorithmstocreateandmanipulateoverlapgraph. 4.3.3.2OverlapgraphbasedCRISPRreadsclustering ToseparateCRISPRs,weusetwoimportantpropertiesspctoCRISPRs. Spacersareusuallyunique.ForaCRISPRlocus,thearrayofspacersarequitet fromeachother.Theyalsohaveverylowsimilaritywiththerepeatsequence.In addition,spacersarenotconservedacrosserentCRISPRs. LeadersequencesofCRISPRsthatsharethesamerepeatinonespeciesmayhavehigh similarity,buttheyareusuallyquitetforCRISPRsfromtspecies,even theseCRISPRssharethesamerepeatsequence[117,52,48]. 71 Fromtheproperty,weknowthatrandomoverlapscausedbynearidenticalrepeats cannotexceedthelengthofatypicalrepeat.AsshowninFigure4.6,forreadsthatstart orendwithinrepeats,theytendtoformrandomoverlapswithreadsofsimilarproperties. However,asthespacersareusuallyunique,theoverlapwillnotlikelytoextendbeyond therepeats.Thus,byapplyinganoverlapthresholdthatismuchlargerthanthetypical repeatsize,amajorityofrandomoverlapswillberemoved.Inaddition,itisrarethattwo tCRISPRswillproducethesamereadunlessthereadiscompletelysequencedfrom therepeats.However,asthelengthofreadsisusuallymuchlongerthanrepeats,theread willcoveratleastpartofthespacer,whichcandistinguishreadsfromtloci.Hence, it'sreasonabletoassumethattCRISPRsdonotsharethesamereadandwecould useanonoverlappingclustermethodtoseparateCRISPRreads. Fromthesecondproperty,weknowthattheremayexistlargeoverlapedgeswhichare formedbetweenreadssequencedfromtheleaderregionsoftCRISPRsthatshare repeatsequenceinthesamespecies.Weareunabletoremovetheseerroneousedgeseven usingalargeoverlapthreshold.However,thiswillnotposeaproblemforthepipelineforthe followingreasons.First,becausewebuildtheoverlapgraphwithoutallowingmismatches, mostCRISPRsinthesamespeciescanstillbeseparatedintotclustersaslongas theleadersequencearenottotallythesame.ThisistrueforalargeportionofCRISPRs thatsharerepeatsequenceinthesamespecies[52].Second,evenifseveralCRISPRsare clusteredintothesamecluster,wecanstillseparatethemintheassemblystage. BasedontheseCRISPRspproperties,webuildanoverlapgraphwithalargeover- lapthreshold(Thedefaultvalueis80ntor80%ofthelengthofthereads,whicheveris smaller)toclusterCRISPRreads.Edgesinsuchoverlapgrapharehighlyreliable.Thatis, anedgeconnectingtwonodes(reads)inthegraphhasveryhighprobabilitytothe 72 trueoverlapbetweenthereads.Readsinthesameweaklyconnectedcomponentareusually comefromthesameCRISPR.Forasmallnumberofconnectedcomponents,eachcon- nectedcomponentharborsseveralCRISPRsfromthesamespeciesbecausetheseCRISPRs shareleadersequence.Thus,wecanclusterthereadsusingtheweaklyconnected componentsintheoverlapgraph.Eachconnectedcomponentformsaninitialcluster. AllreadsfromthesameCRISPRmaybeseparatedintoseveralinitialclusters.This happensforthefollowingreasons.First,therearebothhighlyrepresentedandlowlyrepre- sentedspeciesinametagenomicsample.Forlowlyrepresentedspecies,thesequencingdepth ismuchlowerthanthehighlyrepresentedspecies.ForCRISPRsinlowlyrepresentedspecies, mostoverlapsbetweenreadsmaybesmallerthantheoverlapthreshold,thusthereadsare clusteredintotsmallclusters.Second,evenforahighlyrepresentedCRISPR,lowly sequencedregionsmayexistlocallyandpreventallreadsfortheCRISPRgoesintoasingle cluster.Thus,weconductamergingsteptocombineclusterswhosereadsaresequenced fromthesameCRISPR.Weprocessclustersthatonlycontainasingleread.Alarge portionofreadsfromtheseclustersarenotCRISPRreads.However,therearestillsome readswhicharepartofCRISPRsbutfailedtogointoanylargerclustersbecausetheyhave asmalleroverlapwithotherreads.Weusepaired-endinformationtorecruitsuchreads.For areadinanyclusterofsizeone,iftheotherendofthepaired-endreadisinaclusterwhose sizeisatleasttwo,thisreadisintothiscluster.Allsinglyclusteredreadsthatcan notbeusingpaired-endinformationarediscarded. Wethenbuildanundirectedgraphinwhichanoderepresentsaninitialclusterwhosesize islargerthanone.Anedgeisformedbetweentwonodes v 1 and v 2 iftwoclustersdenoted bythesenodeshaveenoughpaired-endreadssupport.Apairedendissaidtosupportnodes v 1 and v 2 ifoneendofthepairisintheclusterdenotedby v 1 andtheotherendisincluster 73 denotedby v 2 .Thenallinitialclustersinoneconnectedcomponentinthegraphforma newcluster. 4.3.3.3Correcttheorientationofreads ReadsinaNGSsamplecouldcomefrombothstrainsofthegenomesinthesample.Without knowledgeoftheorientation,overlapscanbeformedbetweenofareadandtheof anotherread(P ! S),orbetweenthexofareadandthereversecomplementofanother read(S ! P),orbetweenthereversecomplementofareadandtheofanotherread (P ! P).Thisrequireswehaveatwodirectionaloverlapgraphwhendoinggraphtravesal [82].Unlikegeneralassemblytoolswhichhavenoideaofthedirectionofeachread,wecan correcttheorientationofthereadsineachclusterwiththehelpofthedirectrepeat.Foreach cluster,weknowthedirectrepeatsequencesofthisclusterfromtherepeatiden step.Byusingonedirectrepeatasreference,weareabletocorrecttheorientationofall thereadsinthecluster.Afterre-orientationofthereads,weonlyneedtopayattentionto overlapsintheassemblystep. 4.3.3.4Filterfalsepositiveclusters Throwingawayalargeportionofthesinglyclusteredreadsonlyremovessomeofnon CRISPRreads.TheremayexistlargerclustersthatdonotharboranyCRISPR.Because assemblyisusuallytimeconsuming,wetrytoreducethenumberofclustersbeforegoing totheassemblystep.Foreachcluster,wehaveanoverlapgraphbuiltusingalargeoverlap threshold.Theoverlapgraphisusuallysimpleandedgesinthegrapharereliable.Wedoa quicktraversalofthegraphtoformlongercontigsandcheckwhetherlongcontigsexhibit repeat-spacerstructureofCRISPRs.Ifatleastoneofthecontigshassuchstructure,then 74 theclusteriskeptforfurtherprocessing.Thisstepcanremovealotofclustersthatdonot containanyCRISPR. 4.3.4AssembleCRISPRsofheterogeneouscoverage Overlapgraphbasedclusteringdividesreadsintotclusters,eachofwhichcontains readssequencedfromasingleCRISPRorCRISPRssharingrepeatsequencesinthesame species.Foreachcluster,webuildanoverlapgraph,andanovelassemblypipelineoptimized foridentifyingCRISPRswillbeappliedtoeachoverlapgraphtorecoverCRISPRs.Inorder torecoverCRISPRsoflowsequencecoverage,weconstructtheinitialoverlapgraphusing asmalloverlapthreshold. RecoveringCRISPRsfrommetagenomicdataneedstoaddressthechallengesofhetero- geneoussequencecoverageandahighlytangledgraphcausedbyrandomoverlapsbetween repeats.Figure4.7sketchesthemainstepsandtheresultsofapplyingthemtoanoverlap graphthatcontainsthreeCRISPRs,ofwhichonehashighsequencecoverage(red),andthe othertwohavelowsequencecoverage.ThesethreeCRISPRsareinthesamespeciesand shareverysimilarrepeatsequencesthustherearemanyedgesgeneratedbyrandomoverlaps intheinitialoverlapgraph.TheserandomedgesinthesameCRISPRandbetweent CRISPRslocileadtohighlytangledandcyclicgraphs.Ourgoalistorecoverthethree correctpathscorrespondingtothethreeCRISPRsloci.WeformulateCRISPRsassembly problemasthelongestsimplepathsthathavesupportfrompairedendreadsinthe overlapgraph. FromtheCRISRPsppropertiesstatedinsection4.3.3.2,weknowthatbyapplying anoverlapthresholdlargerthanthetypicalrepeatsize,amajorityofrandomoverlapscan beremoved.Wereferedgeswhoseoverlapsizeislargerthanthreshold ˝ as c 75 Figure4.7Leftpanel:majorstepstoassembleCRISPRs.Rightpanelshowsanexamplewith threeCRISPRs,whicharerepresentedbyred,purple,andgreennodes/edges,respectively. TheCRISRPrepresentedusingredcolorhashighsequencecoverage,theothertwohave lowsequencecoverage.Readsinapairareconnectedbyablack,dotted,arrowline.Nodes whichformsacyclearecircledusingadottedline. 76 edges .Here ˝> repeatsize.Inaddition,wecanremovereadsthathavebeenincludedin assembledCRISPRslociwithoutthesearchfortheremainingCRISPRsbecause areadonlybelongstooneCRISPR.Basedontheseobservations,weproposethefollowing methodtoassembleCRISPRs. 4.3.4.1Step1:cleangraphandrecoverhighlysequencedCRISPRs Wecleanthegraph.First,bubbles,whichareintroducedbysequencingerrors,canbe easilyidenandremoved.Node r inFigure4.7isanexampleofbubble.Second,we cleanthegraphusingtedges.Forexample,inFigure4.7,theedgefrom b 3to b 4 isatedge,thusedgefrom b 3to c 2andtheedgefrom b 3to b 2 : 2canberemoved. Third,wecanfurtherusepairedendinformationtoremoveerroneousbranches.Theideais thatifoneendofapairedendreadisinonenode,andtheotherisinanother,thenthepath betweenthemismorereliablethanpathsbetweennodeswithoutpairedendreadsupport. InFigure4.7,node a 6isabranchingnode.Theerroneousbranchfromnode a 6to r 1can beremovedusingcotedge.Becausethereisapairedendreadfornode a 4 : 1and a 4 : 2, thuspathfrom a 4 : 1to a 4 : 2ismorecotthanpathfrom a 4 : 1to r 1,edgefrom a 6to r 1 canberemoved. Aftercleaningthegraph,CRISPRswithhighsequencecoverageformaverysimple graphwithoutcycle.LongestpathdingcanbeappliedtorecovertheseCRISPRs.A post-processingcanbedonetohandleaveryrarecasewhenthesameCRISPRisdiscon- nectedbecausetheoverlapbetweentwonodesissmallerthan ˝ .Thepathsthatrepresent segmentsofadisconnectedCRISPRcanbeconnectedusingpaired-endreadsinformation. Forexample,inFigure4.7,node a 7and a 8canbeconnectedusinginformationfrompaired- endreads a 4 : 1and a 4 : 2. 77 4.3.4.2Step2:cleangraphbyremovingrecoverednodesandedges GiventheobservationthattCRISPRsdon'tsharethesamereads,allthereads,their correspondingnodes,andassociatededgesfromrecoveredCRISPRscanberemovedfrom G .Thisleadstograph G 2 ,whichmaycontainseveralCRISPRswithlowsequencecoverage. In G 2 ,thereexistalottrueoverlapsthathavesimilarsizeswiththerandomoverlaps,and cannotbeeasilydistinguishedusingcoverageinformation.Evenworse,therandomoverlaps formedbyrepeatswithinthesamelociorbetweentwolowcoverageCRISPRsmayform cyclesinahighlytangledgraph,whichmakeslongestpathahardproblem. 4.3.4.3Step3:breakcycles Inthisstep,wewillfurthersimplifythecyclicgraph G 2 bybreakingcycleusinginformation providedbypairedendreads.Firstwemakethefollowingassumptions: Assumption1:ifreads R: 1, R: 2arepaired-endreads,and R: 1and R: 2areinthesame weaklyconnectedcomponentofanoverlapgraph,thentheremustbeapathfrom R: 1to R: 2(assume R: 1isintheupstreamof R: 2).Adirectedgraphisweaklycon- nectedifreplacingallofitsdirectededgeswithundirectededgesproducesaconnected undirectedgraph. Assumption2:ifaweaklyconnectedgraphhascycles,thenforeachcycle,thereexists atleastonepaired-endreadsuchthat R: 1(or R: 2)isanodeinthecycle,and R: 2(or R: 1)isanodeoutofthecycle. Inmostcases,tworeadsinapairaresupposedtobesequencedfromthesameCRISPR locus.Iftheyareinthesameweaklyconnectedcomponent,thenthereshouldbeatleastone pathbetweenthem(assumption1).Thesecondassumptiongivesusinformationtobreak 78 thecycleusingthemethodthatwillbepresentedhere.Ifthisassumptiondonotholdfor somecycles,wesimplydoadeeptraversalofthegraphandremoveallbackedges. Withtheassumptionswemade,herewepresentthealgorithmtobreakcyclesinan overlapgraph.Suppose R: 1and R: 2arepaired-endreads, R: 1isinacycle,and R: 2isnot inthecyclebutinthedownstreamof R: 1.Basedontheassumptions,thereexistatleasta pathfrom R: 1to R: 2.Becauseeverynodecanonlybevisitedonce(weassumethataread isnotsharedbetweentCRISPRs)whenformingthepath,allingoingedgesto R: 1 shouldnotbeonthepathfrom R: 1to R: 2.Otherwise, R: 1willbevisitedatleasttwice: oncefrom R: 1tosomeothernodewhenstartingwalkingfrom R: 1,oncefromtheingoing edgeto R: 1.Basedonthisobservation,wecantemporallyremovealltheingoingedgesto R: 1tobreakthecycle.Onceallcyclesarebroken,thegraphbecomesadirectedacyclic graph(DAG),thelongestpathalgorithmcanbereadilyusedtorecovertheCRISPRs. Figure4.8showstwocasesincyclebreakingandthestepstobreakthecycleinonecase. Thetwocasesareonlyntatlocationofthecycle,whichdeterminesedges(outgoing oringoing)thatshouldbetemporallyremoved.Incase1,thecycleisintheupstreamof a 1 : 2,thenweneedtotemporallyremoveallingoingedgesof a 1 : 1.Incase2,thecycleisin thedownstreamof a 1 : 1,thenweneedtotemporallyremovealloutgoingedgesof a 1 : 2.All theotheroperationsareexactlythesame. TherightpanelofFigure4.8showsanexampleofhowtobreakacycleincase1. e: 2is temporallyremovedbecauseitisaningoingedgeof a 1 : 1.ThegraphbecomesaDAG,the longestpathbetween a 1 : 1and a 1 : 2canbeeasilyfound(thepathformedbyrededges). e: 5 isremovedbecauseit'sabranchingedgeonthepath.Afteradding e 2back,wecana longestpathwhichcontains e 2, e 1, e 4,and e 3. 79 Figure4.8Leftpanel:twocasesincyclebreaking.Case1:thereadwhichpairswithread inthecycleisinthedownstreamofthegraph;case2:thereadthatpairswiththereadin thecycleisintheupstreamofthegraph.RightPanel:mainstepstobreakthecyclefor case1.Paired-endreadsareconnectedusingablack,dotted,arrowline.Edgesrepresented byred,dottedlinesaretemporallyremovededges.Red,solidlinesformthelongestpath betweentwopairedreads. 4.3.4.4Step4:recoverlowlysequencedCRISPRs Nowallcyclesarebroken,wehavealistofDAGs.ForeachDAG,alongestpathcanbe found.Thenthenodes,edgesassociatedwithpathcanberemovedfromtheDAGtoform anewDAG.Thestepisrepeateduntilthegraphisempty. Aftertheassembly,allcontigslongerthanauserthreshold L isvalidated.For eachcontig,thevalidationstepidenrepeatsequencesinthecontig,thenvalidates thesizeoftherepeats,thesizeofthespacers,andthenumberofrepeatsequences.Ifallof themmeettheuservalues,thecontigisasaCRISPR.Usercanalsouse othergenome-wideCRISPRdetectiontoolssuchasPILER-CR,CRISPRFinder,tovalidate thecontigs. 80 4.4Experimentalresults ToevaluatetheperformanceofmetaCRISPR,weconductedseveralexperimentsonthree metagenomicdatasets.Thedatasetisamockcommunitydatasetwhichwassequenced fromsyntheticcommunities.Thedatasetcontains16archaeamembersand48bacteria members[99].Ofthe64speciesinthedataset,48ofthemhave220CRISPRannotations inCRISPRdb.Thedatasetcontainsabout105millionof101bppaired-endreads.The fastqsizeis34G.Asthespecies,theirgenomes,andtheCRISPRannotationsinthe genomesareknown,weareabletoevaluatetheperformanceofCRISPRreadiden readclustering,andassembly.ThesecondoneisaRhizosphereSoildataset.Thefastq ofthedatasetisabout117Gaftercleaning.Itcontains271millionof150ntpaired-end reads.WeusethisdatasettoshowmetaCRISPR'sabilitytohandlehugemetagenomic data.Thelastoneisasix-samplehumangutdatasetsequencedfromfamilymembersofa patientwithdiarrhea.ByaligningspacersidenbymetaCRISPRtotheviruscontigs inthemetagenomicsample,weareabletoshowthatCRISPRsidenbymetaCRISPR arereliable. TherearetwoapproachestoidentifyCRISPRsfrommetagenomicdata.Thecat- egoryreliesondenovoassemblyandthenidenCRISPRsfromtheassembledcontigs. ThesecondcategorydirectlyidenCRISPRsfrommetagenomicreaddata.Webench- markedmetaCRISPRwithtoolsinbothcategories.Forthecategory,wechoseIDBA-ud asthebulkmetagenomicassembler,andusedCRISPRFinderasthegenome-wideCRISPR detector.Forthesecondcategory,webenchmarkedCrass,whichistheonlytoolwenoticed toidentifyCRISPRsdirectlyinmetagenomicreaddata. 81 Table4.1TheoftheCRISPRreadsidencation datasetreadsinrawrecruitedreadsratetime(min) a memory(G) mock 105m11334441.08%18.73.7 soil 271m41366441.52%41.65.2 gut 28m1486360.53%4.40.47 a Walltimeusingfourthreads,averageofthreeruns. 4.4.1EvaluatingtheperformanceofCRISPRreadsiden TheCRISPRreadidenisthekeystepformetaCRISPRtohandlehugemetage- nomicdataset.Weevaluateditstoreducethesizeoftherawdataset. Table4.1showsthattheCRISPRreadsidenstepcandiscardatamount ofreadsrapidly.Forthethreedatasets,thesizesofthereduceddatasetsrangefrom0.53% to1.52%oftherawdatasets.AsshownintheTable,runningtimeandmemoryusageis smallwithrespecttothesizeoftherawdatasets. AsweknowthespeciesandtheannotatedCRISPRsinthemockcommunitydataset,we canevaluatetheaccuracyofCRISPRreadsidenusingsensitivity.Thesetofreads thataresequencedfromCRISPRregionsaretruepositivereads(TP).Let P bethesetof readsrecruitedbymetaCRISPR.Sensitivityisas P \ TP TP .Toshowthatincluding knownrepeatscanimprovetheperformanceofmetaCRISPR,weranthereadsiden withknownrepeatsandwithoutknownrepeats,respectively.Fortherunwithoutknown repeats,thepipelinerecruited228036outof237162TPreads.Thesensitivityis96.2%.For therunthatincludesknownrepeats,thesensitivityincreasedto98%.Tofurtherinvestigate theperformance,wecomparedthesensitivityofthetworunsonCRISPRlevel.InFigure4.9, x-axisshowstheindexof220CRISPRsinthedataset.TheCRISPRsaresortedbythe numberofrepeatstheyhave.Thebluelinewithsquaredmarkershowsthesensitivityof therunwithknownrepeats.Theredlinewithtrianglemarkershowsthesensitivityofthe 82 Figure4.9SensitivityoftheCRISPRreadsidenstep.CRISPRsaresortedalong thex-axisbythenumberofrepeatstheycontains.Seemaintextfordetails. runwithoutknownrepeats.Thelinewithcyclemarkershowstheaveragelengthofthe repeat-spacerunit.Wecanseetheperformanceofthetworunsareverysimilarformostof theCRISPRs.WithoutincludingknownCRISPRs,thepipelinefailedtorecruitanyreads forsixCRISPRs.TheindicefortheseCRISPRsare9,13,46,84,93,103,respectively.The reasonisthattheaveragelengthsoftherepeat-spacerunitsoftheseCRISPRsarelonger thantheothers.Asthereadlengthofthedatasetis101bp,it'simpossibletoseetwo copiesof20-mersoftherepeatinonereadforsuchCRISPRswithlongrepeat-spacerunits. ByincludingknownrepeatsfromCRISPRdb,weareabletomostofthereadsforthese CRISPRs.ItshouldbenotedthatalthoughweusedknownrepeatsonlyfromCRISPRdb, userscanuseanyknownrepeatsfromotherresources. 4.4.2Theperformanceofreadclustering ThegoalofthereadclusteringstepistoseparatetCRISPRsaswellastoreduce falsepositivereadsrecruitedinthestep.Weusedthemockdatasettoevaluatethe 83 Table4.2Theperformanceofclusteringonmockdataset readtypereadsafterreadsafterclusteringreduction TPreads2324982322870.1% FPreads90094619678778.2% Total1133444429074 performanceofreadclustering.Theclusterstepproduced1533clustersontherecruited mockdata.InTable4.2,wecanseethataftertheclusteringstep,mostofthereads sequencingfromCRISPRregions(TPreads)werekept,whilealargeportionofFPreads wereremoved.Althoughtherewerestill196787FPreadsintheclusters,theycanbeeasily idenandremovedinthevalidationstepafterassembly. WethenevaluatedtheabilityofmetaCRISPRtoseparatetCRISPRs.After runningthereadclusteringstepontherecruitedreadsonmockdataset,readsbelongto oneCRISPRsmaygotoseveralclusters.Inaddition,oneclustermayharbormultiple CRISPRs.InFigure4.10,weshowthenumberofclustersthateachCRISPRhasusing theredlinewithtrianglemarker.Thebluelinewithsquaredmarkershowsthenumberof CRISPRsthateachclusterharbors.Hereweonlyincludedclustersthathavemorethan5 TPreads.Thereare175suchclusters.FromthewecanseethatmostCRISPRsare separatedtotheirownclusters.Forthe220CRISPRs,only14ofthemhavemorethanone cluster.Thereare29outof175clustershavemorethanoneCRISPRs.Ofthese29clusters, 28ofthemonlycontainCRISPRsfromonespecies.Theleftclustercontains15CRISPRs fromtwospecies.Fromtheanalysiswecanseethereadclusteringstepperformedverywell onseparatingerentCRISPRs. 84 Figure4.10Performanceofthereadclustering. 4.4.3PerformanceofCRISPRassemblyonthemockdataset Next,weappliedtheassemblyalgorithmtotheclustersthepipelineproduced,andcom- paredtheperformanceofmetaCRISPRwithIDBA-ud/CRISPRFinderandCrass.The initialoverlapthresholdandthetoverlapthresholdfortheassemblyweresetto 30and50,respectively.ForIDBA-ud/CRISPRFinder,weranIDBA-udwithdefault parametersonthefullmetagenomicdataset,theassembledcontigswereuploadedtothe CRISPRFinderwebsite.AllCRISPRspredictedbyCRISPRFinderweredownloadedand usedasthepredictionofthepipeline.ForCrass,weranitontherawmetagenomicdataset withdefaultparameters.Crassonlyreportsspacersandtheirarrangement.Itrequiresusers tomanuallychecktheoutputoutwhichsegmentsoutputbyCrassshouldbeas- sembledtogether,andgivetheinformationtoawrapperofvelvetorcap3[45]forassembling thedetectedreads.It'snoteasyforuserstogetcompleteCRISPRsequencesifthereare manygroups.Thus,insteadofusingthewrapperprovidedbyCrass,wedirectlyusedvelvet andcap3toassemblethegroupsCrassdetected.Forvelvet,weusedtheVelvetOptimiser 85 pipeline[100].Theminimumandmaximumk-mersizeweresetto50and80,respectively. Thestepsizewassetto5.Forcap3,alltheparametersuseddefaultvalues.Thecontigs producedbytheassemblerswereuploadedtoCRISPRFindertoannotateCRISPRs. Weusedthefollowingmetricstomeasuretheperformance: FullyrecoveredCRISPRs( full ):thesequenceoftheannotatedCRISPRisfullycovered byoneormoreoutputofthepipeline. PartiallyrecoveredCRISPRs( partial ):thesequenceoftheannotatedCRISPRsis partlycoveredbyoneormoreoutputofthepipeline. ChimericCRISPRs( chimeric ):It'spossiblethattheassemblerconnectstwopartsof twotCRISPRstogetherandformsachimericCRISPR.Thesepredictionsare wrong. Truespacersrecovered:thenumberofannotatedspacersthatarerecovered. Resourceusage:thisincludesCPUusageandmemoryusage. ThesummaryoftheperformancecomparisonwasshowninTable4.3.Thenumbersof fullyrecoveredCRISPRsformetaCRISPRandIDBA-ud/CRISPRFinderwereverysimi- lar:metaCRISPRfullyrecovered151CRISPRswhileIDBA-ud/CRISPRFinderfullyrecov- ered153.IDBA-udpartiallyrecovered40CRISPRs,whilemetaCRISPRpartiallyrecov- ered37CRISPRs.IDBA-udonlyproducedonechimericCRISPR.metaCRISPRwasonly slightlyworsethanIDBA-ud:itproducedthreechimericCRISPRs.Overall,theaccuracy ofmetaCRISPRiscomparabletotheaccuracyofIDBA-ud/CRISPRFinderpipelineonthe rawmetagenomicdataset.However,asforresourceusage,theproposedpipelinewasmuch fasterandusedmuchlessmemorythanIDBA-ud/CRISPRFinder.Themocksampleisa 86 medium-sizedmetagenomicdataset.Toassembleit,IDBA-udusedabout34GBmemory andtookmorethan18hours.TheresourcesrequiredbyIDBA-udarenotavailablefor mostoftoday'sdesktopmachines.metaCRISPRwasmuchmoreresourcet:itonly took1.32hoursandused3.9GBmemory.Thismakesitmoreconvenientforidentifying CRISPRsonlargescalemetagenomicdatasetonevenapersonalcomputer.Thisisoneof themotivationsofmetaCRISPR.Crasswithcap3generated661CRISPRs.Thisismuch largerthanthenumberofannotatedCRISPRs(220).Itonlyfullyrecovered89andpartially recovered102oftheannotatedCRISPRs.TheperformanceforCrasswithvelvetwasthe lowestamongallthebenchmarkedpipelines.Thisindicatedthattheassemblystagefor Crassdidnotworkquitewell.Thisiswhyweneedanassemblerthatcantakeadvantage ofthepropertiesofCRISPRs. Anotheraspecttoevaluatetheperformanceofthebenchmarkedtoolsistocheckhow manyoftheidenspacersoverlapwithspacersintheannotatedCRISPRs.FromTa- ble4.3wecanseenumbersofspacersgeneratedbymetaCRISPRandIDBA-ud/CRISPRFinder arequitesimilar.FormetaCRISPR,83.4%oftheidenspacersoverlappedwithannota- tions.ForIDBA-ud/CRISPRFinder,thenumberis84.7%.Crassgeneratedsimilarnumber ofspacersasIDBA-ud/CRISPRFinder.However,thenumberofspacersthatareinthean- notatedCRISPRswerelowerthanbothmetaCRISPRandIDBA-ud/CRISPRFinder:only 79.7%oftheidenspacersoverlappedwithannotations.Thismeansalotofspacers detectedbyCrasswerefalsepositive. 4.4.4Experimentalresultonthesoildataset Thedatasetisasyntheticdatasetandthesizeisnotbigenoughtochallengethe benchmarkedtools.Inthisexperiment,weappliedmetaCRISPRtoalargerandmore 87 Table4.3Theperformanceofthebenchmarkedtoolsonthemockdataset fullpartialchimericspacers a timememory toolpredictions(h)(G) metaCRISPR3211513735353/64141.323.9 IDBA-ud/ CRISPRFinder2881534015374/634318.534 Crass(cap3)6618910225028/63042.290.9 Crass(velvet)5787410505028/63043.420.9 a ThenumberofspacersintheannotatedCRISPRs/Thenumberoftotalidenspacers. Table4.4Theperformanceofthebenchmarkedtoolsonthesoildataset toolmaxaveragetimememory predictionsspacerslengthlength(h)(G) metaCRISPR12396850305271.083.2 IDBA-ud/CRISPRFinder10882950305661.13.1 Crass(cap3)839247648472230.890.78 complexsoilmetagenomicdataset.Thedatasetcontains271millionof150ntpaired- endreadsandthefastqofthedatasetisabout117G.Wetriedtoassemblethedata setusingIDBA-udonahighperformancecomputingnodewith120Gmemorybutfailed. Theprogramwaskilledbecauseitusedmorememorythanrequested.WeranCrasson therawdataset.Itdidnotafter24hours.Thus,herewereportedtheresultof IDBA-ud/CRISPRFinderandCrassonthedataset.ForCrass,weranitwithcap3 becausetheexperimentshowedbetterperformancethanvelvet. Unliketheexperiment,thegenomesinthedatasetareunknown.Insteadofcom- paringwithannotatedCRISPRs,weusedthefollowingmetrics:thenumberofCRISPRs identhenumberofspacersidenmaxlengthoftheidenCRISPRs,av- eragelengthoftheidenCRISPRs,andresourceusage.Theresultwasshownin Table4.4.metaCRISPRdetected5moreCRISPRsand139morespacersthanIDBA- ud/CRISPRFinderonthisdataset.TheoverlapbetweenthespacersdetectedmetaCRISPR andIDBA-ud/CRISPRFinderwas669,whichisabout69%ofallthespacersdetectedby 88 Table4.5Theperformanceofthebenchmarkedtoolsonthehumangutdataset metaCRISPRIDBA-ud/CRISPRFinder sampleviromecontigsCRISPRsspacershitsCRISPRsspacershit patient6535 181017 4370 follow-up4680 15131 014 133 0 healthy19416 5336427 3928820 healthy27620 353433 282622 healthy336 30293 0202550 healthy4273820138 6 21 149 0 metaCRISPR.ThemaximumlengthoftheCRISPRsoutputbymetaCRISPRandIDBA- ud/CRISPRFinderwasthesame.Onaverage,metaCRISPRoutputshorterCRISPRsthan IDBA-ud/CRISPRFinder.Crasswithcap3predicted839CRISPRs.Thenumberwasabout 8timesofthenumberofCRISPRspredictedbyIDBA-ud/CRISPRFinder.Theaverage lengthofthepredictedCRISPRsoutputbyCrasswaslessthanhalfoftheaverageCRISPR lengthoftheothertools.Thisindicatedthattheassemblydidnotworkwell,whichis consistentwiththeresultonthemockdataset.Crassdetected2476spacers,whichwas muchlargerthantheoutputofmetaCRISPRandIDBA-ud/CRISPRFinder.Wemapped thespacerspredictedbyCrasstotheinputreadsdata.Abouthalfofthespacerscanonly bemappedtooneread,and1739ofthespacerscanonlybemappedtolessthanthreereads. Thereadcountsforthesespacersaretoolowtodistinguishthembetweentruespacersfrom lowlysequencedCRISPRsandfalsepositives. 4.4.5Experimentalresultonthehumangutdataset Inthelastexperiment,weappliedmetaCRISPRtoahumangutmetagenomicdataset.The datasetwassequencedusingIlluminafromfamilymembersofapatientwithdiarrhea.It containssixsamples.One(case)wassequencedfromthepatientwithillness,anothersample (followup)wassequencedafterthepatienthaverecoveredfromtheillness.Theotherfour 89 samples(control)weresequencedfromotherfamilymembers.Thenumbersofreadsinthe samplesrangefrom3millionto7million.Inthisexperiment,weevaluatedthereliableof theidenspacers.Inthecas-CRISPRsystem,thestageintheimmuneresponseto aninvadingphageiscapturingashortpartoftheDNAofthephage,insertingthesequence tooneendofaCRISPRlocusandformingaspacer.Thusifaspacerwasnewlyacquired fromaninvadingphageintheenvironment,thespacersequenceshouldmatchpartofthe genomeofaphageintheenvironment.Thusbysearchingtheviralcontigsassembledfrom themetagenomicdataset,it'spossibletovalidatewhetherthepredictedspacersarereliable ornot. WepredictedCRISPRsusingmetaCRISPRandIDBA-ud/CRISPRFinderonallsix samples.ThenumberofCRISPRsandspacerspredictedwereshowninTable4.5.Onmost samples,metaCRISPRpredictedmoreCRISPRsandmorespacers.Wethenassembled thesixmetagenomicsamplesusingIDBA-udwithdefaultparameters.Thecontigswere uploadedtoMetavir[94],whichisawebservertoannotateviralmetagenomicsequences. It'sabletoannotateeachsequenceasviromeornon-viromebasedonsimilarityscorewith knownviromes.Thenumberofvirome-likecontigsforeachsamplewereshowninTable4.5. WeusedBowtietomapthepredictedspacerstothevirome-likecontigsineachsample.The numbersofhitsareshowninthe'hits'columnofTable4.5.Wecanseethatfor4outof6 samples,thereweresomeproportionsofspacerscanbemappedtothevirome-likecontigs assembledfromthesamesample.Forallthefoursamples,metaCRISPRhadmorehits thanIDBA-ud/CRISPRFinder.Especiallyforthesamplesequencedfromthepatient,and thesamplesequencedfromahealthyfamilymember(healthy4),IDBA-ud/CRISPRFinder hadnohits,whilemetaCRISPRhad7and6hits,respectively.Theexperimentshowsthat spacersidenbymetaCRISPRarereliable. 90 Chapter5 Conclusionandfuturework Inthisdissertation,Iintroducednon-codingRNAandtheimportancetoidentifyncR- NAsfromlarge-scalebiologicaldataset,whichincludesgenomicsequencedataandnext generationsequencingdata.AstheaccumulationofNGSdata,especiallymetagenomic data,ncRNAidenbecameachallengeproblem.First,traditionalsequenceanalysis toolsarenotsensitiveenoughtodetectremotelyhomologousncRNAs.Second,NGSpro- ducesshortsequences,mostexistingncRNAidentoolscannotbereadilyapplied tothistypeofdata.Third,NGS,especiallymetagenomicsequencing,producesmassive amountofshortreaddata.Alotofexistingalgorithmsandanalysispipelinesarenotable tohandledataofsuchscale.Toaddressthesechallenges,Iproposedasetofalgorithmsin thisdissertation. ThetoolisChain-RNA,whichcombinessequencesimilarityandstructuresimilarity toidentifylowlyconservednon-codingRNAsingenomicsequencedata.It'sfasterthan structurebasedtoolsandismoresensitivethansequenceanalysistoolssuchasBLAST. Thesecondtool,miR-PREFeR,predictsplantmicroRNAsfromnextgenerationsmallRNA sequencingdata.It'saccurate,fast,andveryuser-friendly.ThelasttoolIdevelopedis metaCRISPR,whichisatargetedCRISPRidentoolthatworksonmetagenomic data.ItcanrapidlyrecoverCRISPRsfromlarge-scalemetagenomicdatasetaccurately. Thereareseveraldirectionsthatthepreviousstudiescanbeimproved: 91 miR-PREFeRcurrentlyonlyworksforplantgenomesbecauseitwasdevelopedby utilizingplantmicroRNAannotationcriteria.SincemanysmallRNAseqdatafor animalsareavailable,itisusefultoextendmiR-PREFeRtoanimalgenomes.In addition,miR-PREFeRrequiresuserstomapthesmallRNASeqdatatoareference genomeorcontigs,whichmaynotbeavailableforsomenewlystudiedspecies.Whether wecanextendmiR-PREFeRtouseonlysmallRNASeqdataisaninterestingdirection forfurtherstudy. TheststepofmetaCRISPRistorecruitCRISPRreadsfromlarge-scalemetagenomic dataset.Nowthealgorithmdependsontheassumptionthatthereexisttwocopies ofthesamek-merintherepeatregioninaread.Thisisnottrueforsampleswhose readlengthisshort(i.e.75nt).Italsohasproblemwhentherepeat-spacerunitis long.Althoughbyincludingknownrepeatscanimprovetheperformance,itdidnot completelysolvetheproblemsincetheremayexistCRISPRswithnew,longrepeats inametagenomicsample.Weneedtodesignalgorithmsthatdonotdependonthe repeatstructureassumption. TheresourceusageofmetaCRISPRcanbefurtherimproved.First,theassemblystep wasimplementedinPythonandwassinglethreaded.Becausetclusterscanbe assembledindependently,oneimprovementistoparalleltheassemblysteptoutilize themulti-corearchitectureofmostoftoday'sdesktopcomputers.Inaddition,the pipelineusedthePythonlibraryofNetworkX[36]tomanipulateoverlapgraphsinthe assemblystage.Itisslowwhentheoverlapgraphislarge.Theperformancecanbe furtherimprovedifweimplementthepipelineusingC++. 92 BIBLIOGRAPHY 93 BIBLIOGRAPHY [1]R.Achawanantakun,Y.Sun,andS.S.Takyar.NcRNAconsensussecondarystruc- turederivationusinggrammarstrings. JournalofBioinformaticsandComputational Biology ,9(2):317{337,2011. [2]O.Aljawad,Y.Sun,A.Liu,andJ.Lei.NcRNAhomologysearchusingHammingdis- tanceseeds.In ProceedingsoftheACMConferenceonBioinformatics,Computational BiologyandBiomedicine(ACM-BCB11):1-3August2011;Chicago ,2011. [3]S.Altschul,W.Gish,W.Miller,E.Myers,andD.Lipman.Basiclocalalignment searchtool. JournalofMolecularBiology ,215:403{410,1990. [4]J.An,J.Lai,M.L.Lehman,andC.C.Nelson.miRDeep*:anintegratedapplica- tiontoolformiRNAidenfromRNAsequencingdata. NucleicAcidsRes. , 41(2):727{737,Jan2013. [5]A.F.AnderssonandJ.F.Viruspopulationdynamicsandacquiredvirus resistanceinnaturalmicrobialcommunities. Science ,320(5879):1047{1050,May2008. [6]M.J.Axtell.ShortStack:comprehensiveannotationandquanofsmallRNA genes. RNA ,19(6):740{751,Jun2013. [7]R.Barrangou,C.Fremaux,H.Deveau,M.Richards,P.Boyaval,S.Moineau,D.A. Romero,andP.Horvath.CRISPRprovidesacquiredresistanceagainstvirusesin prokaryotes. Science ,315(5819):1709{1712,Mar2007. [8]I.Ben-BassatandB.Chor.CRISPRdetectionfromshortreadsusingpartialoverlap graphs. LectureNotesinComputerScience ,9029:16{27,Mar2015. [9]S.H.BernhartandI.L.Hofacker.FromconsensusstructurepredictiontoRNAgene infunctionalgenomicsproteomics ,8(6):461{471,2009. [10]S.H.Bernhart,I.L.Hofacker,andP.F.Stadler.LocalRNAbasepairingprobabilities inlargesequences. Bioinformatics ,22(5):614{615,2006. [11]M.Blanchette,W.J.Kent,C.Riemer,L.Elnitski,A.F.Smit,K.M.Roskin, R.Baertsch,K.Rosenbloom,H.Clawson,E.D.Green,D.Haussler,andW.Miller. 94 AligningmultiplegenomicsequenceswiththeThreadedBlocksetAligner. Genome Research ,14(4):708{715,2004. [12]C.Bland,T.L.Ramsey,F.Sabree,M.Lowe,K.Brown,N.C.Kyrpides,andP.Hugen- holtz.Crisprrecognitiontool(crt):atoolforautomaticdetectionofclusteredregularly interspacedpalindromicrepeats. BMCBioinformatics ,8:209,2007. [13]A.F.Bompfunewerer,C.Flamm,C.Fried,G.Fritzsch,I.L.Hofacker,J.Lehmann, K.Missal,A.Mosig,B.Muller,S.J.Prohaska,B.M.Stadler,P.F.Stadler,A.Tanzer, S.Washietl,andC.Witwer.Evolutionarypatternsofnon-codingRNAs. Theoryin Biosciences ,123(4):301{369,2005. [14]M.S.Campbell,M.Law,C.Holt,J.C.Stein,G.D.Moghe,D.E.Hufnagel,J.Lei, R.Achawanantakun,D.Jiao,C.J.Lawrence,D.Ware,S.-H.Shiu,K.L.Childs, Y.Sun,N.Jiang,andM.Yandell.MAKER-P:atool-kitfortherapidcreation, management,andqualitycontrolofplantgenomeannotations. PlantPhysiology , 164(2):513{524,Feb2014. [15]S.Chikkagoudar,D.R.Livesay,andU.Roshan.PLAST-ncRNA:Partitionfunction LocalAlignmentSearchToolfornon-codingRNAsequences. NucleicAcidsResearch , 38:W59{W63,2010. [16]L.Cong,F.A.Ran,D.Cox,S.Lin,R.Barretto,N.Habib,P.D.Hsu,X.Wu,W.Jiang, L.A.andF.Zhang.MultiplexgenomeengineeringusingCRISPR/Cas systems. Science ,339(6121):819{823,Feb2013. [17]Y.Cui,Y.Li,O.Gorg,M.E.Platonov,Y.Yan,Z.Guo,C.Pourcel,S.V.Dentovskaya, S.V.Balakhonov,X.Wang,Y.Song,A.P.Anisimov,G.Vergnaud,andR.Yang. InsightintomicroevolutionofYersiniapestisbyclusteredregularlyinterspacedshort palindromicrepeats. PLoSOne ,3(7):e2652,2008. [18]M.deBerg,O.Cheong,M.vanKreveld,andM.Overmars. ComputationalGeometry: AlgorithmsandApplications(3rded.) .Springer,USA,2008. [19]J.E.DiCarlo,J.E.Norville,P.Mali,X.Rios,J.Aach,andG.M.Church.Genome engineeringinSaccharomycescerevisiaeusingCRISPR-Cassystems. NucleicAcids Res ,41(7):4336{4343,Apr2013. [20]C.Dieterich,H.Wang,K.Rateitschak,H.Luz,andM.Vingron.CORG:adatabase forCOmparativeRegulatoryGenomics. NucleicAcidsResearch ,pages55{57,2003. 95 [21]C.B.Do,C.-S.Foo,andS.Batzoglou.Amax-marginmodelforetsimultaneous alignmentandfoldingofRNAsequences. Bioinformatics ,24(13):168{176,2008. [22]R.C.Edgar.PILER-CR:fastandaccurateidenofCRISPRrepeats. BMC Bioinformatics ,8:18,2007. [23]J.A.Eisen.Environmentalshotgunsequencing:itspotentialandchallengesforstudy- ingthehiddenworldofmicrobes. PLoSBiol ,5(3):e82,Mar2007. [24]W.Filipowicz,S.N.Bhattacharyya,andN.Sonenberg.Mechanismsofpost- transcriptionalregulationbymicroRNAs:aretheanswersinsight? NatRevGenet , 9(2):102{114,Feb2008. [25]A.E.Friedland,Y.B.Tzur,K.M.Esvelt,M.P.Colaicovo,G.M.Church,andJ.A. Calarco.HeritablegenomeeditinginC.elegansviaaCRISPR-Cas9system. Nat Methods ,10(8):741{743,Aug2013. [26]M.R.Fander,S.D.Mackowiak,N.Li,W.Chen,andN.Rajewsky.miRDeep2 accuratelyidenknownandhundredsofnovelmicroRNAgenesinsevenanimal clades. NucleicAcidsRes. ,40(1):37{52,Jan2012. [27]P.P.Gardner,A.Wilm,andS.Washietl.Abenchmarkofmultiplesequencealignment programsuponstructuralRNAs. NucleicAcidsResearch ,33(8):2433{2439,2005. [28]G.GonnellaandS.Kurtz.Readjoiner:afastandmemorytstringgraph-based sequenceassembler. BMCBioinformatics ,13:82,2012. [29]J.Gorodkin,I.L.Hofacker,E.Torarinsson,Z.Yao,J.H.Havgaard,andW.L.Ruzzo. DenovopredictionofstructuredRNAsfromgenomicsequences. TrendsinBiotech- nology ,28:9{19,2010. [30]S.ths-Jones,A.Bateman,M.Marshall,A.Khanna,andS.R.Eddy.Rfam:an RNAfamilydatabase. NucleicAcidsResearch ,31:439{441,2003. [31]I.Grissa,G.Vergnaud,andC.Pourcel.CRISPRFinder:awebtooltoidentifyclustered regularlyinterspacedshortpalindromicrepeats. NucleicAcidsRes ,35(WebServer issue):W52{W57,Jul2007. [32]I.Grissa,G.Vergnaud,andC.Pourcel.TheCRISPRdbdatabaseandtoolstodisplay CRISPRsandtogeneratedictionariesofspacersandrepeats. BMCBioinformatics , 8:172,2007. 96 [33]D.G AlgorithmsonStrings,Trees,andSequences:ComputerScienceandCom- putationalBiology .CambridgeUniversityPress,UK,1997. [34]M.Hackenberg,N.Rodreleta,andA.M.Aransay.miRanalyzer:anupdate onthedetectionandanalysisofmicroRNAsinhigh-throughputsequencingexperi- ments. NucleicAcidsRes. ,39:W132{138,Jul2011. [35]D.H.Haft,J.Selengut,E.F.Mongodin,andK.E.Nelson.Aguildof45CRISPR- associated(Cas)proteinfamiliesandmultipleCRISPR/Cassubtypesexistinprokary- oticgenomes. PLoSComputBiol ,1(6):e60,Nov2005. [36]A.A.Hagberg,D.A.Schult,andP.J.Swart.Exploringnetworkstructure,dynamics, andfunctionusingnetworkx.InG.Varoquaux,T.Vaught,andJ.Millman,editors, Proceedingsofthe7thPythoninScienceConference ,pages11{15,Pasadena,CA USA,2008. [37]C.R.Hale,S.Majumdar,J.Elmore,N.M.Compton,S.Olson,A.M.Resch, C.V.C.Glover,3rd,B.R.Graveley,R.M.Terns,andM.P.Terns.Essentialfeatures andrationaldesignofCRISPRRNAsthatfunctionwiththeCasRAMPmodule complextocleaveRNAs. MolCell ,45(3):292{302,Feb2012. [38]J.H.Havgaard,R.B.Lyngso,G.D.Stormo,andJ.Gorodkin.Pairwiselocalstructural alignmentofRNAsequenceswithsequencesimilaritylessthan40%. Bioinformatics , 21(9):1815{1824,2005. [39]J.H.Havgaard,E.Torarinsson,andJ.Gorodkin.FastpairwisestructuralRNAalign- mentsbypruningofthedynamicalprogrammingmatrix. PLOScomputationalbiology , 3(10):1896{1908,2007. [40]M.Hess,A.Sczyrba,R.Egan,T.-W.Kim,H.Chokhawala,G.Schroth,S.Luo,D.S. Clark,F.Chen,T.Zhang,R.I.Mackie,L.A.Pennacchio,S.G.Tringe,A.Visel, T.Woyke,Z.Wang,andE.M.Rubin.Metagenomicdiscoveryofbiomass-degrading genesandgenomesfromcowrumen. Science ,331(6016):463{467,Jan2011. [41]I.Hofacker,W.Fontana,P.Stadler,S.BonhoM.Tacker,andP.Schuster.Fast foldingandcomparisonofRNAsecondarystructures. Monatsheftef.Chemie ,125:167{ 188,1994. [42]I.L.HofackerandP.F.Stadler.MemoryntfoldingalgorithmsforcircularRNA secondarystructures.In ProceedingsoftheGermanConferenceonBioinformatics. GCB2003 ,pages15{25,2006. 97 [43]P.HorvathandR.Barrangou.CRISPR/Cas,theimmunesystemofbacteriaand archaea. Science ,327(5962):167{170,Jan2010. [44]P.Horvath,D.A.Romero,A.-C.Cot-Monvoisin,M.Richards,H.Deveau,S.Moineau, P.Boyaval,C.Fremaux,andR.Barrangou.Diversity,activity,andevolutionof CRISPRlociinStreptococcusthermophilus. JBacteriol ,190(4):1401{1412,Feb2008. [45]X.HuangandA.Madan.Cap3:Adnasequenceassemblyprogram. GenomeRes , 9(9):868{877,Sep1999. [46]W.Y.Hwang,Y.Fu,D.Reyon,M.L.Maeder,S.Q.Tsai,J.D.Sander,R.T. Peterson,J.-R.J.Yeh,andJ.K.Joung.tgenomeeditinginusinga CRISPR-Cassystem. NatBiotechnol ,31(3):227{229,Mar2013. [47]Illumina.HiSeqXTenPreliminaryProductInformationSheet.[Online;accessed 25-April-2014]. [48]R.Jansen,J.D.A.vanEmbden,W.Gaastra,andL.M.Schouls.Idenofa novelfamilyofsequencerepeatsamongprokaryotes. OMICS ,6(1):23{33,2002. [49]D.-H.Jeong,S.R.Thatcher,R.S.H.Brown,J.Zhai,S.Park,L.A.Rymarquis,B.C. Meyers,andP.J.Green.ComprehensiveinvestigationofmicroRNAsenhancedby analysisofsequencevariants,expressionpatterns,ARGONAUTEloading,andtarget cleavage. PlantPhysiol. ,162(3):1225{1245,Jul2013. [50]W.Jiang,H.Zhou,H.Bi,M.Fromm,B.Yang,andD.P.Weeks.Demonstrationof CRISPR/Cas9/sgRNA-mediatedtargetedgenemoinArabidopsis,tobacco, sorghumandrice. NucleicAcidsRes ,41(20):e188,Nov2013. [51]M.W.W.Jones-Rhoades,D.P.P.Bartel,andB.Bartel.MicroRNAsandtheir regulatoryrolesinplants. AnnualReviewofPlantBiology ,57:19{53,2006. [52]F.V.KarginovandG.J.Hannon.TheCRISPRsystem:smallrna-guideddefensein bacteriaandarchaea. MolCell ,37(1):7{19,Jan2010. [53]A.Keller,C.Backes,P.Leidinger,N.Kefer,V.Boisguerin,C.Barbacioru,B.Vogel, M.Matzas,H.Huwer,H.A.Katus,C.Sthler,B.Meder,andE.Meese.Next-generation sequencingidennovelmicroRNAsinperipheralbloodoflungcancerpatients. Mol Biosyst ,7(12):3187{3199,Dec2011. [54]H.Kiryu,Y.Tabei,T.Kin,andK.Asai.Murlet:apracticalmultiplealignmenttool forstructuralRNAsequences. Bioinformatics ,23(13):1588{1598,2007. 98 [55]S.Kishore,A.Khanna,Z.Zhang,J.Hui,P.J.Balwierz,M.Stefan,C.Beach,R.D. Nicholls,M.Zavolan,andS.Stamm.ThesnoRNAMBII-52(SNORD115)isprocessed intosmallerRNAsandregulatesalternativesplicing. HumMolGenet ,19(7):1153{ 1164,Apr2010. [56]R.J.KleinandS.R.Eddy.RSEARCH:homologsofsinglestructuredRNA sequences. BMCBioinformatics ,4:44{59,2003. [57]D.C.Koboldt,K.M.Steinberg,D.E.Larson,R.K.Wilson,andE.R.Mardis.The next-generationsequencingrevolutionanditsimpactongenomics. Cell ,155(1):27{38, Sep2013. [58]A.KozomaraandS.miRBase:integratingmicroRNAannotationand deep-sequencingdata. NucleicAcidsRes. ,39(Databaseissue):D152{157,Jan2011. [59]V.Kunin,R.Sorek,andP.Hugenholtz.Evolutionaryconservationofsequenceand secondarystructuresinCRISPRrepeats. GenomeBiol ,8(4):R61,2007. [60]D.Langenberger,C.Bermudez-Santana,J.Hertel,S.P.Khaitovich,and P.F.Stadler.EvidenceforhumanRNAsinsmallRNAsequencing data. Bioinformatics ,25(18):2298{2301,2009. [61]B.Langmead,C.Trapnell,M.Pop,andS.L.Salzberg.Ultrafastandt alignmentofshortDNAsequencestothehumangenome. GenomeBiology ,10(3):R25{ 34,2009. [62]M.H.Larson,L.A.Gilbert,X.Wang,W.A.Lim,J.S.Weissman,andL.S.Qi. CRISPRinterference(CRISPRi)forsequence-spcontrolofgeneexpression. Nat Protoc ,8(11):2180{2196,Nov2013. [63]N.Leblond-Bourget,H.Philippe,I.Mangin,andB.Decaris.16SrRNAand16S to23Sinternaltranscribedspacersequenceanalysesrevealinter-andintrasp phylogeny. IntJSystBacteriol ,46(1):102{111,Jan1996. [64]H.Li,B.Handsaker,A.Wysoker,T.Fennell,J.Ruan,N.Homer,G.Marth,G.Abeca- sis,R.Durbin,and.G.P.D.P.S..TheSequenceAlignment/Mapformatand SAMtools. Bioinformatics ,25(16):2078{2079,Aug2009. [65]S.-D.Li,S.Chono,andL.Huang.toncogenesilencingandmetastasisinhibition viasystemicdeliveryofsiRNA. MolTher ,16(5):942{946,May2008. 99 [66]L.Liu,Y.Li,S.Li,N.Hu,Y.He,R.Pong,D.Lin,L.Lu,andM.Law.Comparison ofnext-generationsequencingsystems. JBiomedBiotechnol ,2012:251364,2012. [67]Y.P.Liu,J.Haasnoot,O.terBrake,B.Berkhout,andP.Konstantinova.Inhibition ofHIV-1bymultiplesiRNAsexpressedfromasinglemicroRNApolycistron. Nucleic AcidsRes ,36(9):2811{2824,May2008. [68]R.Lorenz,S.H.Bernhart,C.onerZuSiederdissen,H.Tafer,C.Flamm,P.F.Stadler, andI.L.Hofacker.ViennaRNAPackage2.0. AlgorithmsMolBiol ,6:26,2011. [69]C.Lu,S.S.Tej,S.Luo,C.D.Haudenschild,B.C.Meyers,andP.J.Green.Elucidation ofthesmallRNAcomponentofthetranscriptome. Science ,309(5740):1567{1569,Sep 2005. [70]S.Lu,R.Shi,C.-C.Tsao,X.Yi,L.Li,andV.L.Chiang.RNAsilencinginplantsby theexpressionofsiRNAduplexes. Nucl.AcidsRes. ,32(21):e171{177,2004. [71]V.M.Markowitz,I.-M.A.Chen,K.Chu,E.Szeto,K.Palaniappan,M.Pillay,A.Rat- ner,J.Huang,I.Pagani,S.Tringe,M.Huntemann,K.Billis,N.Varghese,K.Ten- nessen,K.Mavromatis,A.Pati,N.N.Ivanova,andN.C.Kyrpides.Img/m4ver- sionoftheintegratedmetagenomecomparativeanalysissystem. NucleicAcidsRes , 42(Databaseissue):D568{D573,Jan2014. [72]L.A.andE.J.Sontheimer.CRISPRinterferencelimitshorizontalgene transferinstaphylococcibytargetingDNA. Science ,322(5909):1843{1845,Dec2008. [73]J.A.MartinandZ.Wang.Next-generationtranscriptomeassembly. NatRevGenet , 12(10):671{682,Oct2011. [74]A.MathelierandA.Carbone.MIReNA:microRNAswithhighaccuracyandno learningatgenomescaleandfromdeepsequencingdata. Bioinformatics ,26(18):2226{ 2234,Sep2010. [75]J.McCaskill.Theequilibriumpartitionfunctionandbasepairbindingprobabilities forRNAsecondarystructures. Biopolymers ,29:1105{1119,1990. [76]M.C.Mendez-Ortega,S.Restrepo,L.M.RodrI.Perez,J.C.Mendoza,A.P. MartR.Sierra,andG.J.Rey-Benito.AnRNAiinsilicoapproachtoan optimalshRNAcocktailagainstHIV-1. VirolJ ,7:369,2010. [77]M.L.Metzker.Sequencingtechnologies-thenextgeneration. NatRevGenet , 11(1):31{46,Jan2010. 100 [78]B.C.Meyers,M.J.Axtell,B.Bartel,D.P.Bartel,D.Baulcombe,J.L.Bowman, X.Cao,J.C.Carrington,X.Chen,P.J.Green,S.GS.E.Jacobsen, A.C.Mallory,R.A.Martienssen,R.S.Poethig,Y.Qi,H.Vaucheret,O.Voinnet, Y.Watanabe,D.Weigel,andJ.-K.Zhu.CriteriaforannotationofplantMicroRNAs. PlantCell ,20(12):3186{3190,Dec2008. [79]C.Mlot.Centromeres.Ajourneytothecenterofthechromosome. Science , 290(5499):2057{2059,Dec2000. [80]R.D.Morin,M.D.O'Connor,M.th,F.Kuchenbauer,A.Delaney,A.-L.Prabhu, Y.Zhao,H.McDonald,T.Zeng,M.Hirst,C.J.Eaves,andM.A.Marra.Applicationof massivelyparallelsequencingtomicroRNAanddiscoveryinhumanembryonic stemcells. GenomeRes ,18(4):610{621,Apr2008. [81]E.W.Myers.Thefragmentassemblystringgraph. Bioinformatics ,21Suppl2:ii79{ ii85,Sep2005. [82]E.W.Myers.Thefragmentassemblystringgraph. Bioinformatics ,21Suppl2:ii79{ ii85,Sep2005. [83]E.P.Nawrocki,D.L.Kolbe,andS.R.Eddy.Infernal1.0:InferenceofRNAalign- ments. Bioinformatics ,25:1335{1337,2009. [84]C.T.Neilsen,G.J.Goodall,andC.P.Bracken.IsomiRs{theoverlookedrepertoirein thedynamicmicroRNAome. TrendsGenet. ,28(11):544{549,Nov2012. [85]P.C.NgandE.F.Kirkness.Wholegenomesequencing. MethodsMolBiol ,628:215{ 226,2010. [86]K.C.Pang,M.C.Fritha,andJ.S.Mattick.RapidevolutionofnoncodingRNAs: lackofconservationdoesnotmeanlackoffunction. TrendsinGenetics ,22(1):1{5, 2005. [87]J.S.Pedersen,G.Bejerano,A.Siepel,K.Rosenbloom,K.Lindblad-Toh,E.S.Lander, J.Kent,W.Miller,andD.Haussler.IdenandofconservedRNA secondarystructuresintheHumanGenome. PLoSComputBiol ,2(4):251{262,2006. [88]E.Pennisi.TheCRISPRcraze. Science ,341(6148):833{836,Aug2013. [89]M.Petrillo,G.Silvestro,P.P.D.Nocera,A.Boccia,andG.Paolella.Stem-loop structuresinprokaryoticgenomes. BMCGenomics ,7:170,2006. 101 [90]K.Pougach,E.Semenova,E.Bogdanova,K.A.Datsenko,M.Djordjevic,B.L.Wan- ner,andK.Severinov.Transcription,processingandfunctionofCRISPRcassettesin Escherichiacoli. MolMicrobiol ,77(6):1367{1379,Sep2010. [91]M.Rho,Y.Wu,H.Tang,T.G.Doak,andY.Ye.DiverseCRISPRsevolvinginhuman microbiomes. PLoSGenet ,8(6):e1002441,2012. [92]M.Rho,Y.-W.Wu,H.Tang,T.G.Doak,andY.Ye.DiverseCRISPRsevolvingin humanmicrobiomes. PLoSGenet ,8(6):e1002441,2012. [93]E.RivasandS.R.Eddy.NoncodingRNAgenedetectionusingcomparativesequence analysis. BMCBioinformatics ,2(1):8{26,2001. [94]S.Roux,J.Tournayre,A.Mahul,D.Debroas,andF.Enault.Metavir2:newtoolsfor viralmetagenomecomparisonandassembledviromeanalysis. BMCBioinformatics , 15:76,2014. [95]R.L.Rungta,H.B.Choi,P.J.Lin,R.W.Ko,D.Ashby,J.Nair,M.Manoharan, P.R.Cullis,andB.A.Macvicar.LipidNanoparticleDeliveryofsiRNAtoSilence NeuronalGeneExpressionintheBrain. MolTherNucleicAcids ,2:e136,2013. [96]F.SangerandA.R.Coulson.ArapidmethodfordeterminingsequencesinDNAby primedsynthesiswithDNApolymerase. JMolBiol ,94(3):441{448,May1975. [97]F.Sanger,S.Nicklen,andA.R.Coulson.DNAsequencingwithchain-terminating inhibitors. ProcNatlAcadSciUSA ,74(12):5463{5467,Dec1977. [98]D.SankSimultaneoussolutionoftheRNAfolding,alignmentandprotosequence problems. SIAMJournalonAppliedMathematics ,45(5):810{825,1985. [99]M.Shakya,C.Quince,J.H.Campbell,Z.K.Yang,C.W.Schadt,andM.Podar. ComparativemetagenomicandrRNAmicrobialdiversitycharacterizationusingar- chaealandbacterialsyntheticcommunities. EnvironMicrobiol ,15(6):1882{1899,Jun 2013. [100]T.S.SimonGladman.Velvetoptimiser,2012. [101]C.T.Skennerton,M.Imelfort,andG.W.Tyson.Crass:idenandreconstruc- tionofCRISPRfromunassembledmetagenomicdata. NucleicAcidsRes ,41(10):e105, May2013. 102 [102]R.Sorek,V.Kunin,andP.Hugenholtz.CRISPR{awidespreadsystemthatpro- videsacquiredresistanceagainstphagesinbacteriaandarchaea. NatRevMicrobiol , 6(3):181{186,Mar2008. [103]J.SperschneiderandA.Datta.DotKnot:pseudoknotpredictionusingtheprobability dotplotunderaenergymodel. NucleicAcidsResearch ,38(7):e103{115,2010. [104]A.Stern,E.Mick,I.Tirosh,O.Sagy,andR.Sorek.CRISPRtargetingrevealsa reservoirofcommonphagesassociatedwiththehumangutmicrobiome. GenomeRes , 22(10):1985{1994,Oct2012. [105]Y.TabeiandK.Asai.Alocalmultiplealignmentmethodfordetectionofnon-coding RNAsequences. Bioinformatics ,25(12):1498{1505,2009. [106]Y.Tabei,H.Kiryu,T.Kin,andK.Asai.Afaststructuralmultiplealignmentmethod forlongRNAsequences. BMCBioinformatics ,9:33{49,2008. [107]Y.Tabei,K.Tsuda,T.Kin,andK.Asai.SCARNA:fastandaccuratestructural alignmentofRNAsequencesbymatchingstemfragments. Bioinformatics , 22(14):1723{1729,2006. [108]O.terBrake,P.Konstantinova,M.Ceylan,andB.Berkhout.SilencingofHIV-1with RNAinterference:amultipleshRNAapproach. MolTher ,14(6):883{892,Dec2006. [109]J.D.Thompson,D.G.Higgins,andT.J.Gibson.CLUSTALW:improvingthesensi- tivityofprogressivemultiplesequencealignmentthroughsequenceweighting,position- spgappenaltiesandweightmatrixchoice. NucleicAcidsResearch ,22(22):4673{ 4680,1994. [110]E.Torarinsson,J.H.Havgaard,andJ.Gorodkin.Multiplestructuralalignmentand clusteringofRNAsequences. Bioinformatics ,23(8):926{932,2007. [111]G.W.Tyson,J.Chapman,P.Hugenholtz,E.E.Allen,R.J.Ram,P.M.Richardson, V.V.Solovyev,E.M.Rubin,D.S.Rokhsar,andJ.F.Communitystructure andmetabolismthroughreconstructionofmicrobialgenomesfromtheenvironment. Nature ,428(6978):37{43,Mar2004. [112]A.V.Uzilov,J.M.Keegan,andD.H.Mathews.Detectionofnon-codingRNAs onthebasisofpredictedsecondarystructureformationfreeenergychange. BMC Bioinformatics ,7(1):173{202,2006. 103 [113]H.Wang,H.Yang,C.S.Shivalila,M.M.Dawlaty,A.W.Cheng,F.Zhang,and R.Jaenisch.One-stepgenerationofmicecarryingmutationsinmultiplegenesby CRISPR/Cas-mediatedgenomeengineering. Cell ,153(4):910{918,May2013. [114]W.-C.Wang,F.-M.Lin,W.-C.Chang,K.-Y.Lin,H.-D.Huang,andN.-S.Lin.miREx- press:analyzinghigh-throughputsequencingdataformicroRNAexpression. BMCBioinformatics ,10:328,2009. [115]Z.Wang,M.Gerstein,andM.Snyder.RNA-Seq:arevolutionarytoolfortranscrip- tomics. NatRevGenet ,10(1):57{63,Jan2009. [116]S.Washietl,I.L.Hofacker,andP.F.Stadler.Fastandreliablepredictionofnoncoding RNAs. ProcNatlAcadSciUSA ,102(7):2454{2459,2005. [117]Y.Wei,M.T.Chesne,R.M.Terns,andM.P.Terns.Sequencesspanningtheleader- repeatjunctionmediateCRISPRadaptationtophageinstreptococcusthermophilus. NucleicAcidsRes ,43(3):1749{1758,Feb2015. [118]S.Will,K.Reiche,I.L.Hofacker,P.F.Stadler,andR.Backofen.Inferringnoncoding RNAfamiliesandclassesbymeansofgenome-scalestructure-basedclustering. PLoS ComputationalBiology ,3(4):680{691,2007. [119]A.Wilm,I.Mainz,andG.Steger.AnenhancedRNAalignmentbenchmarkfor sequencealignmentprograms. AlgorithmsforMolecularBiology ,1:19{29,2006. [120]X.YangandL.Li.miRDeep-P:acomputationaltoolforanalyzingthemicroRNA transcriptomeinplants. Bioinformatics ,27(18):2614{2615,Sep2011. [121]P.Yarza,M.Richter,J.Peplies,J.Euzeby,R.Amann,K.-H.Schleifer,W.Ludwig, F.O.Glckner,andR.Rossell-Mra.TheAll-SpeciesLivingTreeproject:a16SrRNA- basedphylogenetictreeofallsequencedtypestrains. SystApplMicrobiol ,31(4):241{ 250,Sep2008. [122]D.R.Yoder-Himes,P.S.G.Chain,Y.Zhu,O.Wurtzel,E.M.Rubin,J.M.Tiedje,and R.Sorek.MappingtheBurkholderiacenocepacianicheresponseviahigh-throughput sequencing. ProcNatlAcadSciUSA. ,106(10):3976{3981,2009. [123]D.R.ZerbinoandE.Birney.Velvet:algorithmsfordenovoshortreadassemblyusing deBruijngraphs. GenomeRes ,18(5):821{829,May2008. [124]B.Zhang,X.Pan,G.P.Cobb,andT.A.Anderson.PlantmicroRNA:asmallregu- latorymoleculewithbigimpact. Dev.Biol. ,289(1):3{16,Jan2006. 104 [125]S.ZhaoandM.-F.Liu.MechanismsofmicroRNA-mediatedgeneregulation. SciChina CLifeSci ,52(12):1111{1116,Dec2009. [126]M.ZukerandP.Stiegler.OptimalcomputerfoldingoflargeRNAsequencesusing thermodynamicandauxiliaryinformation. NuclAcidRes ,9:133{148,1981. 105