INFERENCEOFVIRALSTRAINSUSINGMETAGENOMICSDATA By JiaoChen ADISSERTATION Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof ComputerScienceŠDoctorofPhilosophy 2018 ABSTRACT INFERENCEOFVIRALSTRAINSUSINGMETAGENOMICSDATA ByJiaoChen RNAViruses,suchashumanyvirus(HIV),andhepatitisCvirus (HCV),havegreatimpactonhumanhealth.ThehighmutationrateofRNAvirusescanproduce apopulationofdifferentbutcloselyrelatedvirussequencescalledviralquasispecies.Todevelop preventionandtreatmentstrategiesforviralpathogens,anessentialandfundamentalstepisto characterizetheirsequencesandabundancesatthestrain-levelforaviralquasispecies. Advancesinnext-generationsequencing(NGS)technologieshaveopenedupnewopportuni- tiestostudyviruses.Viralmetagenomicdata,whichcontainsthegeneticinformationforabunch ofvirusesinthesamehabitat,havebecomethemajorresourceforcharacterizingRNAviruses. Althoughtherearemanypipelinesforanalyzingvirusesinmetagenomicdata,theyusuallylack threefunctions.First,novelvirusesorvirusesthatlackcloselyrelatedreferencegenomescannot bedetectedwithhighsensitivity.Second,strain-levelanalysisisusuallymissing.Althoughthere areseveralassemblytoolsdesignedforviralhaplotypes, denovo assemblyofvirus genomesisstillacomputationallychallengetaskduetotheerror-proneshortreads,highsimilar- itiesbetweenrelatedstrains,andunknownnumberorabundanceofvirushaplotypes.Third,itis hardtoestimatethenumberofhaplotypesandtheirabundancesinaquasispecies. Inthisdissertation,wehavedevelopedapipelinewiththreetools,TAR-VIRPEHaploand VirBintoaddressthechallengesexistingforviralmetagenomicassemblyandanalysis.TAR- VIRisatoolforclassifyingandenrichingviralreadsfrommetagenomicdatawithoutrelying oncompleteorhigh-qualityreferencegenomes.ItisoptimizedforidentifyingRNAviruses frommetagenomicdatawithanefBurrows-WheelerTransform(BWT)basedoverlapping method.TAR-VIRwastestedonbothsimulatedandrealviralmetagenomicdatasets.Theresults demonstratedTAR-VIRcompetesfavorablywithbenchmarkedtools.PEHaploisa denovo hap- lotypereconstructiontool,whichemployspaired-endreadstodistinguishhighlysimilarstrainsfor viralquasispeciesdata.Itwasappliedonbothsimulatedandrealquasispeciesdata,andtheresults werebenchmarkedagainstseveralrecentlypublished denovo haplotypereconstructiontools.The comparisonshowsthatPEHaplooutperformsthebenchmarkedtoolsinacomprehensivesetof metrics.Withassembledviralcontigs,VirBinfocusesonestimatingthenumberofhaplotypes andclusteringthecontigstodifferenthaplotypes.VirBinwindowsfromcontigs alignmenttoestimatehaplotypenumberandbettercalculatethecontigabundances.Then, itappliesanExpectation-Maximizationmethodtoclusterthecontigsbasedoncontigabundance levels.TheexperimentalresultsofVirBinonbothsimulatedandrealdatasetsshowthatthe window-basedmethodcanpreciselyestimatethehaplotypenumber,andgeneratemoreaccurate abundanceestimationforcontigsthatwilleventuallyleadtosuperiorclusteringresults.Inaddi- tion,thisdissertationalsocontainsonechapterfortheotherworkwehavedoneforidentifying theprimarytranscriptionstartsitesformiRNAgenesinCaenorhabditiselegansandmousewith Cap-seqdata. ACKNOWLEDGMENTS Firstandforemost,IwouldliketothankmyadvisorDr.YanniSun.Herperfectsupervisionhas beenguidingmethroughallmyresearchanddissertationworkduringmyPh.D.study.Nothing couldbeachievedwithoutherdevotedtime,patience,andinspiration.Duringthepastveandhalf yearsstudyingatMichiganStateUniversity,Dr.Sunhelpedmeonmyresearchwork,courses, andhowtowriteorpresentresearchresults.Ihavebeenkeptimprovinginmakingpresentations toaudiencesfromdifferentbackgrounds,modelingbioinformaticproblemswithcomputational techniques,andreadingorwritingpapers. Next,IwanttothankmyothercommitteemembersDr.DavidArnosti,Dr.KevinLiu,andDr. JinChen.Dr.Arnosticarefullyreadandprovidedmewithmanyvaluablesuggestionsformy paperonmiRNAs.HealsoinvitedmetojoinandtalkonJournalClub,whichwasanimportant partofmyPh.D.training.Dr.LiugavemethelectureofCSE836(ComputationalBiology)on mysecond-yearstudyinMSU,whichisacoursecloselyrelatedtomyPh.D.research.Healso provideduswithseveralchancesofpresentingpapers.Dr.Chenhelpedmealotforeithermy comprehensiveexamordefense.IalsothankDr.JianrongWang.Hehadbeenhelpingmeon thebinningprojectbyprovidinghelpfuldiscussionsandbeingalwaysapproachable. IalsothankmylabmatesDr.YuanZhang,Dr.ChengYuan,Dr.JikaiLei,Dr.Rujira Achawanantakun,NanDuandYumengWen,fortheirhelponmyeitherresearchworkordaily life.MythankalsogoestomypreviousroommateChaoyueLiuandShaohuaYang. Finally,Iwanttoexpressmysinceregratitudetomyparentsandsister.Theyhavealwaysbeen encouragingandsupportingmeforpursuingmycareer,andarealwaystherewheneverIneed them. iv TABLEOFCONTENTS LISTOFTABLES ....................................... viii LISTOFFIGURES ...................................... x LISTOFALGORITHMS ................................... xv Chapter1Introduction .................................. 1 1.1Overviewofthepipeline...............................3 1.2Next-generationsequencing..............................5 1.2.1Librarypreparation..............................6 1.2.2ClusterGeneration..............................6 1.2.3CyclicReversibleTermination(CRT)Sequencing..............7 1.2.4Dataanalysis.................................8 1.2.5Paired-EndSequencing............................8 1.3Virusquasispecies...................................10 1.3.1Thequasispeciestheory............................11 Chapter2TAR-VIR:TARgetedVIRalreadsandstrainreconstruc- tionfrommetagenomicdata ......................... 14 2.1Background......................................14 2.1.1Relatedwork.................................15 2.1.2Overviewofourwork.............................17 2.2Methods........................................18 2.2.1Twoscenarios.................................18 2.2.2Validityofreadrecruitmentusingoverlapdetection.............20 2.2.2.1Sequencingerrors.........................21 2.2.2.2Chimericreads...........................22 2.2.3Readrecruiting................................22 2.2.3.1Uniqueimplementationstrategies.................25 2.2.4Iterativesearch................................26 2.2.4.1Runningtimeandmemoryusage.................27 2.2.5Strain-levelassembly.............................28 2.3Resultsanddiscussion................................28 2.3.1Sizesofcommonregionsbetweenhumanvirusesandothermicrobialspecies29 2.3.2Exp1:reconstructtheSARShaplotypesusingthebatcoronavirusasthe reference...................................30 2.3.2.1Datapropertiesandevaluationmetrics..............30 2.3.2.2Performanceofreadrecruitment..................33 2.3.3Exp2:characterizinghepatitisvirusesfromthehumanplasmadata.....37 2.3.3.1Preprocessing............................38 v 2.3.3.2RecruitedreadsbyTAR-VIRcanimprovetheperformanceof de novo assembly...........................38 2.3.3.3Comparisonwithreference-basedandextension-basedassembly methods..............................41 2.3.3.4Assemblingthewholedatasetdirectly..............42 2.3.4Identifyingvirusescontainingtargetgenes..................42 2.3.5Computationaltimeandmemoryusage...................44 2.4Conclusions......................................45 Chapter3Denovohaplotypereconstructioninvirusquasispeciesusingpaired- endreads ................................... 47 3.1Introduction......................................47 3.2Methods........................................50 3.2.1Overlaphgraphandpaired-endgraph....................51 3.2.2MutationRateforSequenceReplicationandProbabilityofLongestCom- monSubstring(LCS).............................51 3.2.3Usepaired-endreadstodistinguishdifferenthaplotypes...........55 3.2.4ThewholepipelineofPEHaplo.......................56 3.2.4.1Datapre-processing........................58 3.2.4.2Overlapgraphconstruction....................59 3.2.4.3Graphpruning...........................60 3.2.4.4Paired-endguidedpath...................63 3.2.4.5Correctingcontigswithpaired-endreaddistribution.......69 3.3Results.........................................69 3.3.1LCSprobabilitysimulation..........................70 3.3.2BenchmarkonsimulatedHIVdataset....................71 3.3.2.1Paired-endguidedpathisabletogenerateaccuratelong contigs...............................73 3.3.3BenchmarkonMiSeqdataset........................74 3.3.4BechmarkonsimulatedbiasedHIVdatasets................77 3.3.5Bechmarkondataset........................78 3.3.6Computationaltimeandmemoryusage...................79 3.4DiscussionandConclusion..............................79 Chapter4Alignmentwindowsbasedviralstrain-levelcontigsbinning ....... 81 4.1Introduction......................................81 4.2Methods........................................82 4.2.1Problem..............................82 4.2.2TheVirBinalgorithmoverview........................83 4.2.3Estimatehaplotypenumberbycontigsalignmentandwindowsextraction.84 4.2.4Expectation-Maximizationmethodforbinningcontigs...........86 4.3Results.........................................88 4.3.1HIVsimulateddataset............................88 4.3.1.1Datasimulation...........................88 4.3.1.2Resultsfor5HIVhaplotypes...................90 vi 4.3.1.3Resultsfor10HIVhaplotypes...................93 4.3.2HIVrealMiSeqdataset...........................95 4.3.2.1HIVrealdatasetandcontigs...................95 4.3.2.2ResultsforHIVrealdataset....................96 4.4DiscussionandConclusion..............................99 Chapter5StudyingtranscriptionalregulationsofmiRNAgeneswithCap-seq ... 100 5.1Introduction......................................100 5.2MaterialsandMethods................................105 5.2.1Datasetsandprocessing............................105 5.2.2Clusteringof5'endreads..........................105 5.2.3Statisticalanalysis..............................106 5.2.4miRNAclusterandintergenicpre-miRNAs..........107 5.2.5FindingbidirectionalandmultipleTSSspromoters.............107 5.3Results.........................................108 5.3.1ofprimarymiRNATSSsin C.elegans andmouse......108 5.3.1.1OverviewofprimarymiRNATSSsannotation..........108 5.3.1.2Comparisonwithpreviouswork..................110 5.3.25'm 7 Gcappedpre-miRNAsarein C.elegans ..........111 5.3.2.1Possible5'recessedRNAsenrichedbyCap-seq.........111 5.3.2.2ng5'm 7 Gcappedpre-miRNAswithpre-capTICs.....113 5.3.3M 7 Gcappedpre-miRNAsoftenhaveupstreamprimaryTICs........115 5.3.45p-miRNAsareproducedfromthem 7 Gcappedpre-miRNAs..117 5.3.5MultipletranscriptioninitiationsitesformiRNAclustersin C.elegans ...118 5.3.5.15're-capafterpost-transcriptionalprocessing...........119 5.3.5.2MultipleTSSscanbegeneratedfromthesamepromoter.....121 5.3.5.3Pre-miRNAsinaclustercanbetranscribedindependently....123 5.3.6ChromatinandPolIIofprimarymiRNApromoters........124 5.3.6.1IdentiofdivergentandmultipleTSSspromoters.....124 5.3.6.2ChromatinandPolIIsurroundingmiRNApromoters..125 5.4Discussion.......................................126 Chapter6ConclusionandFuturework ........................ 130 6.1Conlusions.......................................130 6.2Futurework......................................132 6.2.1Reference-freeviralsequences.................132 6.2.2Virusquasispeciesassemblyfuturework...................132 6.2.3Binninglow-qualitycontigs.........................133 BIBLIOGRAPHY ....................................... 134 vii LISTOFTABLES Table2.1:ReadrecruitmentresultsbyusingseedsetsconstructedwithBowtie2and BWA.ThefiAlignmentflsectioncontainsresultsforalignedreads.The fiRecruitmentflsectioncontainsresultsforrecruitedreadsbyTAR-VIR usingthealignedreads.Foreachrow,thealignedreadsinfiAlignment" sectionaretheseedsetfortherecruitedreadsinfiRecruitmentflsection. ForBowtie2,thefiŒscore-minflparameterwassettoallowdifferentalign- menterrorratescorrespondingto5%,10%,15%,and20%,respectively. ForBWA,fi-Aflisedasitsdefaultvalue1.fi-Bflwastoal- lowdifferenterrorratesimilartoBowtie2.fiNumberflisthenumberof alignedorrecruitedreads.fiDepthflistheaveragesequencingcoverage. fiCoverageflisthepercentageofgenomecoveringbyatleastoneread.h1 toh5representtheveSARS-CoVhaplotypes...............34 Table2.2:AssemblyresultsonSARS-CoValignedandrecruitedmetagenomicdata. ThedefaultassemblytoolinTAR-VIRisPEHaplo.Theofthe metricscanbefoundinSection2.3.2.1...................37 Table2.3:AssemblyresultsonSARS-CoVmetagenomicdataforTAR-VIRand PRICE....................................37 Table2.4:Overlapextensionresultsusingdifferentseedset R 0 .`#'represents`num- ber'.TheshadedregionsinthistableandTable2.5highlightthecase wherelessrecruitedreadscanproducebetterassemblyresultsthanaligned readsonly..................................39 Table2.5:Assemblyevaluationresultsonalignedandrecruitedreadsusingthegenomes ofHBV,HCV,andHPgVasreferences.`cov.'istheabbreviationfor `coverage'.ThedefaultassemblycomponentinTAR-VIRisPEHaplo..40 Table2.6:Assemblyresultscomparisonwithreference-andextension-basedmeth- ods......................................41 Table2.7:AssemblyresultsonrecruitedreadswithapartialCDSsequencefor HPgVasreference..............................43 Table2.8:Timeandmemoryusageforoverlapextensionandassemblyonviral metagenomicdatafromhumanplasam.The denovo assemblytimeand memoryusagewereevaluatedonrecruitedreadsbasedonmismatchrate from5%to20%...............................44 Table3.1:Pairwisesequencesimilaritybetween5HIV-1strains...........70 viii Table3.2:Longestcommonsubstring(LCS)between5HIV-1strains.Thesestrains havesimilarlengthsofabout10kbp....................70 Table3.3:AssemblyresultsonsimulatedHIVdatasetforIVA,MLEHaplo,SAV- AGEandPEHaplo.Contigsthatareatleast500bparealignedtothe truehaplotypesequenceswithasimilaritycutoffof98%.TheN50value isthemaximallengththatallcontigsofatleastthislengthcoveratleast halfofthetotalassemblylength.......................73 Table3.4:AssemblyresultsonrealHIVMiSeqdatasetforIVA,MLEHaplo,SAV- AGEandPEHaplo.ContigsareevaluatedwithMetaQuastusingthesame parameterstosimulateddataset.......................75 Table3.5:AssemblyresultsonsimulatedbiasedHXB2-NL43MiSeqdatasetfo SAVAGEPEHaplo.ContigsareevaluatedwithMetaQuast.........77 Table3.6:AssemblyresultsonMiSeqdatasetforSAVAGEandPEHaplo. ContigsareevaluatedwithMetaQuastusingthesameparameterstoHIV datasets...................................78 Table4.1:Haplotypeabundancesforsimulated5HIVhaplotypescalculatedfrom readsmapping................................90 Table4.2:EMclusteringresultsonsimulated5HIVhaplotypecontigsforVirBin andMaxBin..................................92 Table4.3:Haplotypeabundancesforsimulated10HIVhaplotypescalculatedfrom readsmapping................................93 Table4.4:EMclusteringresultsonsimulated10HIVhaplotypecontigsforVirBin andMaxBin..................................95 Table4.5:Haplotypeabundancescalculatedfromreadsmapping...........96 Table4.6:EMclusteringresultsonassembled5realhaplotypecontigsforVirBin andMaxBin..................................98 Table4.7:EMclusteringresultsonassembled5realhaplotypecontigsforVirBin andMaxBin..................................98 Table5.1:GCcontent,OEratioandCpGnumberformiRNApromoterswithmul- tipleTSSs.(A)miRNAclusters.(B)individualmiRNAs..........122 ix LISTOFFIGURES Figure1.1:Pipelineoverview.(1)TAR-VIRforenrichingviralreadsofinterestfrom metagenomicdata.(2)PEHaplofor denovo assemblyofviralhaplo- types.(3)VirBinforbinningassembledcontigsandestimatethehaplo- typeabundances...............................4 Figure1.2:Librarypreparation.GenomicDNAisfragmentedandcommonadaptors areligatedtobothendsofthem.......................6 Figure1.3:Solid-phaseamplication.Eachsingle-stranded,single-moleculetemplate iscapturedbyprimerandimmobolizetothesolidsurface.Eachbound templateisintoaclonalclusterthroughbridgeamplication...7 Figure1.4:TheIllumina/Solexafour-colourcyclicreversibletermination(CRT)method. Ituses3'-O-azidomethylreversibleterminatorchemistryonsolid-phase- templateclusters.Afterimaging,acleavagestepremovesthe dyesandregeneratesthe3'-OHgroupfornext-roundincor- poration....................................9 Figure1.5:Paired-endsequencingenablesbothendsoftheDNAfragmenttobese- quenced.Thedistancebetweeneachpairedreadsisknownandalignment algorithmscantakeadvantageoftomapthereadsoverrepetitiveregions moreprecisely................................10 Figure1.6:Quasispeciesadaption.Quasispeciestendstogouphillsinhigh-dimensional landscape.Thex-axisisthesequencespaceandthey-axisisthe Thehighertheyget,thetheyare................13 Figure2.1:Twoscenarios.(A).Thereferenceisageneorotherfunctionalsite(long greenbar).Thereadsarerepresentedbyshortlines.Shortgreenlines canbemappedtothereferencesequenceandthesetofseedreads. Theiterationofoverlapdetectionwillidentifynewreads(bluelines) overlappingwiththeseedreads.Theseconditerationofoverlapdetec- tionwillidentifymorereads(redlines).(B).Thereferenceisaremotely relatedgenome(longgreenbar).Theseedreadscanbemappedtothe referencegenomeandarerepresentedbyshortgreenlines.Twoitera- tionsofoverlapdetectionwillrecruitnewreads(bluelinesandredlines, respectively).................................19 x Figure2.2:Thevisualizationoftheoutputofeachstepofthebackwardsearchfora querysequenceAAAT.TheSAandthecorrespondingsufesaregrey becausetheyarenotactuallyusedinthesearch.Thecomputedrange foreachiterationishighlightedusinggrayboxencompassedbyarrows. When t is3,thesearchshouldreturn r 2 asitformsanoverlapofsize3 withAAAT..................................24 Figure2.3:HistogramoftheLCSsizesbetweenhumanviruses(A),betweenhuman virusesandnon-humanviruses(B),andbetweenhumanvirusesandbac- teria(C).Thex-axisisthe log 10oftheLCSlength.They-axisisthe numberofpairswithinthegivenrangeofLCSsize.OnlyLCSsthatare longerthan10bparepresented.(D)ProbabilitydistributionforLCSs betweentwosimulatedHIVstrainsthatare50,100,200,and500gen- erationsapart.The x -axisisthelengthofLCS,witharangefrom0to 10,000bp.The y -axisisthecorrespondingprobabilitiesforthoseLCS sizes......................................31 Figure2.4:EnrichingSARS-CoVreadsusingthebatcoronavirusgenomeastheref- erence.(A)and(B)showthealignedandrecruitedreadsThe datasetwasalignedbyBWAwiththedefaultparameter("-B4,-A1"). BWAischosentoincludemorelocallyalignedreadsin(A).Thereads wererecruitedusingtheoverlapcutoffof150bp.(C)displaysthese- quenceidentitybetweenSARS-Covandthebatcoronavirus.Thepr wasgeneratedusingVISTA[45].......................35 Figure3.1:MultiplesequencealignmentofveHIV-1haplotypes...........49 Figure3.2:(A)Thebottomtwolonglines(redandblack)representtwohaplotypes, whichonlydifferbytwomutationsattwoloci(G-CandA-G).Shortlines representreadssequencedfromthetwostrains.Redreadsaresequenced fromredstrainwhileblackreadsaresequencedfromblackstrain.The readsaresortedbytheirreadmappingpositionsagainsttheirnativestrain. a.1anda.2areareadpairfromtheblackstrain.d.1andd.2arethe readpairfromtheredstrain.(B)Theoverlapgraphandthepaired-end grapharecombinedinoneNodesb,c,e,andforiginatefromthe commonregionofthetwostrains.Thedashedlinesrepresentpaired-end readconnection.(C).Thesuper-readsgeneratedbymergingreadsinve cliquesofsize4.Eachsuper-readisnamedusingthestartingnodeand endingnode.(D).Thenewoverlapgraphgeneratedusingthesuper-reads.57 Figure3.3:Removingfalseedgesusingpaired-endinformation.Solidlinesrepre- sentoverlapsbetweennodesanddashedlinesrepresentthepaired-end connectionsbetweennodes.Theoverlapedgeswithredcrosswillbe removedifinsufpaired-endinformationexistbetweentheirends..62 xi Figure3.4:Paired-endinformationguidethepathextensiontogooverbifurcation nodes.Withnodescomingfromtwostrains a and b inthethose nodesbelongingtothesamestraincanbecorrectlycombinedtoonepath basedonthepaired-endconnections.Solidlinesrepresentoverlapsbe- tweennodesanddashedlinesrepresentthepaired-endconnections.....64 Figure3.5:Pathprotocol.Pathsareoutputtedwhenmeetinganendnodeor avisitednode.................................65 Figure3.6:Inthisexampleasshowninthetheendingnode p n ofcurrentpath hastwosuccessors v 1 and v 2 .Thesolidlinesareoverlapsanddashed linesarepaired-endconnectionsbetweennodes.Fivefeaturesarecom- putedtochoosetherightsuccessortoextendthepath.Thesefeatures arecomputedaspaired-endgraphedgeweightsbetweennodesincur- rentpathandnodesassociatedwiththesuccessor.Asfor v 1 ,theSISO scoreiscalculatedbetweenSISOnodesand v 1 ;planscorebetweenSISO nodesandplannodes;PE_planscorebetweenSISOnodesandPE_plan nodes;pathscorebetweenpathnodesand v 1 ;path_planscorebetween pathnodesandplannodes..........................66 Figure3.7:Decisiontreetoselecttherightnodeforextensionbasedonpaired-end scorefeatures.If N feature = 1,weaddtheonlynodewithscoregreater than0tothepath.If N feature > 1,weselectthenodewithmaximumscore value.Otherwise,lookatthenextfeature.Ifallthe N feature valuesare0, wewillselectthesuccessorwithsimilarreadscoveragetothepath.....68 Figure3.8:Readpairmappingonamisjoinedcontig.Thecontigisshown asthelongbaratthebottom,whichismisjoinedwithtwosequences fromstrains a and b .Theredandbluelinesrepresenttwoendsofaread pair.Fewreadpairswillgoacrossthemisjoinedlocation,thusrevealing avalleyinthealignedreadswhichcanbeusedtosplitthecontig.69 Figure3.9:ProbabilitydistributionforLCSsbetweentwostrainsthatare50,100, 200,and500generationsapart.Thex-axisisthelengthofLCS,which rangesfrom0to10,000.They-axisisthecorrespondingprobabilityfor eachLCS...................................71 xii Figure3.10:ContigsalignmentresultonHXB2strainforPEHaploandSAVAGE. ThesecontigswereproducedfromtherealHIVMiSeqdataandaligned toreferencegenomewithMetaQuast.Thex-axisisthecoordinationsof HXB2genome,withtheregionscoveredbycontigsingreenandothers inblack.They-axisrepresentsthenumberofcontigs,withcontignames listedontheleftpanel.Oneachcontig,thegreennumberattheleftis thestartingcoordinateofthealignedcontigandthenumberinsideofthe parenthesisshowsthestartingcoordinateonthereferencegenome,the blackvalueattherightistheendingcoordinateofthealignedcontigand thenumberinsideoftheparenthesisshowstheendingcoordinateonthe reference...................................76 Figure4.1:ThewwofVirBin...........................84 Figure4.2:Twokindsofalignmentareallowedbetweentwocontigsto.(A)Overlap. (B)Inclusion................................85 Figure4.3:Top5windowsoutputfromsimulatedcontigsalignmentfor5HIVhaplo- types.Eachlinestartingwith'>'representsonewindow.Thefollowing columnsrepresentreferencecontigname,windowheight,windowstart- ingpositiononreferencecontig,windowendingpositiononereference contig,referencehaplotype,andhaplotypeabundance,respectively.The rowsafter'>'lineshowtheinformationforeachcontiginsideofthis window.Thecolumnsrepresentcontigname,relativeabundance,sum- mationofreadscoverage,referencehaplotype,andhaplotypeabundance, respectively..................................91 Figure4.4:Top3windowsoutputfromsimulatedcontigsalignmentfor10HIVhap- lotypes.TheoutputformataresameasFigure4.3..............94 Figure4.5:Top5windowsoutputforcontigsassembledfromHIVrealMiSeqdata. TheoutputformataresameasFigure4.3..................97 Figure5.1:PrimaryTICislocatedupstreamofthepre-miRNA,whilepre-capTIC islocatedinsideofthepre-miRNA.Theannotationofapre-miRNAis usuallyastem-loopthatincludesthepre-miRNAandthelowerstems. However,therealpre-miRNAonlyincludesthered,purplesequences (maturemiRNAs)andtheloopbetweenthem.Therefore,thepre-capTIC startsfromthe5'endofa5p-miRNA.Eachblueorgreenbarcorresponds toamappedread,wheregreenindicatestheplusstrandandbluethe minusstrand.TheCap-seqdatasetsweresequencedinastrand-speci way.OnlythereadsonthesamestrandofthemiRNAareconsidered...109 xiii Figure5.2:SequencemotifanalysisofnucleotidesaroundputativemiRNATSSs (+1).Thenucleotideheight(inbits)standsforthe log 2 ratioofthe observednucleotidesfrequencyrelativetothebackgroundgenomicnu- cleotidecomposition.TheYRmotifisobservedattheputativeprimary miRNATSSsandindependentmiRNApre-capTSSs............114 Figure5.3:PrimaryTICsaredetectedupstreamof5'-cappedmiRNAscel-mir-51 andcel-mir-53.MultipleCap-seqpeaksareobservedinpre-miRNAre- gions.ThemappedreadswerevisualizedbyGenomeView[1].Each blueorgreenbarcorrespondstoamappedread,wheregreenindicates theplusstrandandbluetheminusstrand.Thereadsinupperpanelare fromCap-seqdataset,withuniformlengthof36nt.Thereadsinlower panelarefromsmallRNA-seqdataset,withlengthinrangefrom14ntto 26nt......................................116 Figure5.4:UpstreamprimaryTICsalsoexistform 7 Gcappedpre-miRNAs.(A) TwoalternativepromotersforthesamemiRNA.Bothofthemareableto generatethetranscriptsthatproducethesamematuremiRNA.(B)The miRNAtranscriptisgeneratedbythepre-capTIC.UpstreamTIC(s)cor- respondtotranscribedenhancer(s)......................117 Figure5.5:CappedTICsdistributioninmiRNAclustercel-mir-35-41.Multiplestrong cappedpeakshavebeenobservedinpre-miRNAsinthecluster.Asillus- tratedbythereddashedline,theCap-seqpeaksonthe5parmhavethe samestartpositionasthe5p-miRNAs,whilethepeaksonthe3parmstart fromtheendofthe3p-miRNAs.ThelengthofCap-seqreadsis36nt...119 Figure5.6:ModelformiRNAcytoplasmicre-capping.Inthisnon-canonicalmiRNA pathway,pri-miRNAsareexportedtocytoplasmbyXPO1andprocessed there.Duringthepre-miRNAgeneratingprocess,m 7 G-capsareaddedto the5'endsofnewlygeneratedpre-miRNAandpri-miRNAleftoverby cytoplasmiccappingenzyme........................121 Figure5.7:(A)Promotersdistributionin C.elegans .(B)Detectingbidirectionaland multipletranscriptionpromotersin C.elegans ................124 Figure5.8:(A,B):H3K4me3andPolIIsignalsurroundingbidirectionaland broadpromoters.(C,D):H3K4me3andPolIIsignalsurrounding miRNAbidirectionalandbroadpromoters..................126 xiv LISTOFALGORITHMS Algorithm1 OverlapdetectionusingBWT'sbackwardsearch 25Algorithm2 Defaultmode:createBWTforallreads 27Algorithm3 WindowsfromcontigsalignmentxvChapter1 Introduction Manyofmostimportanthumanviralpathogens,suchashumanyvirus(HIV), hepatitisCvirus(HCV),SevereAcuteRespiratorySyndrome(SARS)coronavirus(SARS-CoV), andH1N1virus,arehighlyvariableRNAviruses.Theyusuallymutatequicklyininfectedcells ororganismsandexistasacollectionofdifferentbutcloselyrelatedviralgenomeswithdynamic distribution,whichisreferredtoasviralquasispecies[41,3].Thequasispeciesdynamicsprovides RNAviruseswithhighadaptivepotentialtoovercomeinternalorexternalselectiveconstraints, suchasimmuneresponses,antiviralagents,etc.,makingdiseasepreventionandtreatmentdif Thesevirusesstillclaimthelivesofmillionseachyeardespitecenturiesstudiesofthevaccineand treatment[158,140].Thus,characterizinghumanviralpathogens,includingrecognizingnovel ones,remainscrucial. Developmentofnext-generationsequencing(NGS)technologiesshedslightoncharacterizing theviralgenomesandtheirabundancesinenvironment.Thedeepsequencingdataofmicrobial communities(metagenomicdata)containsthegeneticinformationofthemicrobeslivinginthe samehabitatandhasbecometheprimarysourceforvirusdiscovery[165].Viralmetagenomic datahasadvantagesovertraditionalmethodssinceitisunnecessarytoisolateviral speciesandcultivatetheminthelaboratory.Also,multiplepathogens,includingnewones,canbe inasingleassay.Currently,themostpopularNGSstrategiesformetagenomics,suchas IlluminaMiSeqandHiSeq,arecost-effectiveandwithhigh-throughput.Theycanproducebillions ofshort,highlyaccurate100-300bpreadswithinafewdaysforlarge-scalemicrobiomeresearch 1 projects[109]. Whilewiththeseadvantages,assemblingmicrobialgenomesandcharacterizingtheircompo- sitionsfrommetagenomicdatameetwithseveralcomputationalchallenges.Thechallenge isthesheerdatasizefromhigh-throughputsequencing,whichrequiresmassivecomputationalre- sourcesforassemblingandanalyzingthemicrobes.Thesecondoneistheshortreadlength,which makesitdiftoresolvetherepeatedsequenceswithinthesameorganism,orsharedbetween differentorganisms.Third,itisdiftoassemblethelowabundancemicrobialgenomeswith theheterogeneouscoveragesfordistinctmicrobes.Inadditiontothesewell-knownchallenges, reconstructingRNAvirusesinmetagenomicsfacesseveraluniqueobstacles.First,thereferences ofvirusesarestillverylimited.Usually,onlyasmallpercentageofreadsinmetagenomicdatacan bemappedtoavailableviralgenomes,eitherbecausethevirusesaremutatingquicklyorarenovel ones.CharacterizingRNAviruseswithoutqualityreferencegenomesisthusneeded.Second,most clinicallyimportantvirusesareoftenfastmutatingRNAvirusesthatcanleadtoviralpopulations ofhighgeneticdiversity(quasispecies)[107].Asdifferentgenomesinthesameviralpopulation canhavedifferentpropertiessuchasresistancetoanti-viraldrugs,distinguishingthesedifferent buthighlysimilarviralstrains(haplotypes)isessentialbutcomputationallydifThird,there areusuallyonlysmallamountsofeukaryoticvirusesinviralmetagenomicdata.Although tiontechniquesandmethodsmaybeappliedtoenrichRNAviruses,thesestrategies maycausevarioustypesofbiases.Thus,effectivepreprocessingmethodsarestillimportantto identifyreadsforthetargetedviruses. Inthiswork,totackleallthesechallengesforcharacterizingviralhaplotypesandtheirabun- dancesfrommetagenomicdata,wehaveproposedapipelinewiththreetoolsto(1)enrichtarget viruses'readswithlow-qualityorpartialreferencesequences;(2) denovo assemblyofthevi- ralhaplotypesfromtheenrichedreads;(3)analyzehaplotypeabundancesbybinningassembled 2 contigstodifferentgroups. Inthefollowingpartofthischapter,IwillgiveabriefintroductiontothepipelineIhave developedduringmyPh.D.study,andthenintroducethenecessarybackgroundfornext-generation sequencingandviralquasispeciestheory. 1.1Overviewofthepipeline AsshowninFigure1.1,therearethreemaincomponentsforthispipeline.Inthestep,we utilizeanoverlap-basedextensionmethodtoisolateandenrichthereadsfortargetvirusesbefore assembly.Withthebigscaleofviralmetagenomicdata,thereareusuallyasmallproportionofviral readsamongthem,withmanycontaminationsfromhostgenomesorphages.Directlyassembling virusesfromthemetagenomicdataisdifandtime-consuming.Therefore,itisimportantto isolateandenrichtheviralreadsweareinterestedinbeforeassembly.Thispartisdetailedin Chapter2.Second,weassemblethedifferentviralhaplotypesfromtheenrichedviralreadswitha paired-endinformationincorporatedoverlapgraph.ThedetailsforthispartareinChapter3.Third, withtheassembledviralcontigs,weuseanExpectation-Maximizationalgorithmtoestimatethe abundancesofdifferentviralhaplotypesbyclusteringtheassembledcontigstodifferentgroups, whereeachgrouprepresentsonehaplotype.ThedetailsforthispartareinChapter4.Finally,in Chapter5,IintroducetheotherworkIhavedoneforstudyingthetranscriptionalregulationsof miRNAgeneswithCap-seqdata. 3 Figure1.1:Pipelineoverview.(1)TAR-VIRforenrichingviralreadsofinterestfrommetagenomic data.(2)PEHaplofor denovo assemblyofviralhaplotypes.(3)VirBinforbinningassembled contigsandestimatethehaplotypeabundances. 4 1.2Next-generationsequencing ThoughtheautomatedSangersequencingtechnologyhasproveditspowerwithanumberofmon- umentalaccomplishments,includingthecompletionoftheHumanGenomeProject(HGP),its limitationsshowedaneedfornewandimprovedtechnologiesforlarge-scaleandlow-costse- quencing[98].AstheautomatedSangermethodisconsideredtobethetechnol- ogy,newermethodsarereferredtoasnext-generationsequencing(NGS).WhileNGStechnologies constitutevariousstrategiesandarequitediverseinsequencingbiochemistry,theirwwsare conceptuallysimilar.Thesemethodshavesimilarbasicmechanismsandareclascyclic- arraysequencing,whichcanbesummarizedasthesequencingofadensearrayofDNAfeatures byiterativecyclesofenzymaticmanipulationandimaging-baseddatacollection[101].Thebasic protocolofNGStechnologiesincludestemplatepreparation,sequencingandimaging,anddata analysis.ThemajoradvanceofNGSistheabilitytoproducelargevolumeofdatawithlowcost- insomecasesoveronebillionshortreadsperinstrumentrun[98]. Thereareseveralcommerciallyavailabletechnologies,includingRoche/454,Illumina/Solexa, Life/APGandHelicosBioSciences,etc.Theuniquecombinationofprotocolsdistin- guishesonetechnologyfromanotheranddeterminesthetypeofdataproducedfromeachplatform. HerewewillgiveadetailedintroductiontoIllumina/Solexasequencingtechnologyandanalyze thedifferentcomputationaltechniquesfordealingwiththedata. TheIllumina/Solexasequencingtechnologyiscalledsequencingbysynthesis(SBS).Thereare fourbasicstepsforIlluminaNGSwws: 5 Figure1.2:Librarypreparation.GenomicDNAisfragmentedandcommonadaptorsareligatedto bothendsofthem. 1.2.1Librarypreparation BeforesequencingtheDNAorcDNAsample,thelongDNAsequencesarerandomlybro- kenintosmallersizesbysonication.TheseshortDNAfragmentsproducedcannotbesequenced directly;adaptorsareligatedtothe5'and3'endsoftheseDNAfragmentsfromwhicheither fragmenttemplatesormate-pairtemplatesarecreated(Figure1.2).Adaptersareshortoligonu- cleotideswhichareattachedtotheDNAtobesequenced.Anadaptercanprovideaprimingsite forbothandsequencing.AlltheDNAtemplatesarecalledasequencinglibrary. 1.2.2ClusterGeneration WhilebasesofaDNAsequencearedetectedbymostimagingsystemshavenotbeen designedtodetectsingleevents.oftemplatesarerequired.Illumina appliesamethodcalledsolid-phase[43].High-densityforwardandreverseprimers 6 Figure1.3:Solid-phaseamplication.Eachsingle-stranded,single-moleculetemplateiscaptured byprimerandimmobolizetothesolidsurface.Eachboundtemplateisintoaclonal clusterthroughbridgeamplication. arecovalentlyattachedtoaglassslide,andtheratiooftheprimerstothetemplatethe surfacedensityoftheclusters(Figure1.3).Thelibraryisloadedintoawcellwhere fragmentsarecapturedbytheprimers.Eachfragmentisthenamintodistinct,clonalclusters throughbridgeWhenclustergenerationiscomplete,thefreeendsoftemplatescan behybridizedbyauniversalsequencingprimertoinitiatetheNGSreaction. 1.2.3CyclicReversibleTermination(CRT)Sequencing Clonalresultsinapopulationofidenticaltemplates,eachofwhichhasundergonethe sequencingreaction.Uponimaging,theobservedsignalisaconsensusofthenucleotidesadded totheidenticaltemplatesforagivencycle. Illuminatechnologyutilizesaproprietaryreversibleterminator-basedmethodthatdetectssin- 7 glebasesastheyareincorporatedintoDNAtemplatestrands.Inthestep,aDNApolymerase bindstotheprimedtemplateandaddsorincorporatesjustonenucleotide, whichrepresentsthecomplementofthetemplatebase.Areversibleterminatorisoneverymodi- nucleotidetopreventmultipleadditionsinoneround.Followingincorporation,theremaining unincorporatednucleotidesarewashedaway.Usingthefour-colourchemistry,eachofthefour baseshasauniqueemission.Aftereachround,imagingisusedtodeterminetheidentityofthein- corporatednucleotide.Acleavagestepisfollowedtoremovetheterminating/inhibitinggroupand thedye.Thenextincorporationstepwillstartafteranadditionalwashing(Figure1.4). 1.2.4Dataanalysis AfterNGSreadshavebeengenerated,theyarealignedtoaknownreferencesequenceorassembled denovo[149,72,70,22,83,123].Theproductionoflargenumbersoflow-costreadsmakesthe NGSplatformsusefulformanyapplications.Theseincludevariantdiscoverybyresequencing targetedregionsorwholegenomes,suchasdetectingsinglenucleotidepolymorphisms(SNPs)or structuralvariants(SVs);cataloguingthetranscriptomesofcells,tissuesandorganisms(RNA-seq) [155];genome-wideofepigeneticmarksandchromatinstructure(ChIP-seq,MNase-seq, etc.)[138,113];andspeciesand/orgenediscoverybymetagenomicsstudies[120]. 1.2.5Paired-EndSequencing Paired-endsequencingsequencesbothendsoflongDNAfragmentsinasequencinglibraryand aligningtheforwardandreversereadsasreadpairs(Figure1.5).Inadditiontoproducingtwice thenumberofreadsforthesametimeandeffortinlibrarypreparation,readpairsfromtwoends ofDNAfragmentsenablemoreaccuratereadalignmentandtheabilitytoovercomerepetitive 8 Figure1.4:TheIllumina/Solexafour-colourcyclicreversibletermination(CRT)method.Ituses 3'-O-azidomethylreversibleterminatorchemistryontemplateclusters.Af- terimaging,acleavagestepremovesthedyesandregeneratesthe3'-OHgroupfor next-roundincorporation. 9 Figure1.5:Paired-endsequencingenablesbothendsoftheDNAfragmenttobesequenced.The distancebetweeneachpairedreadsisknownandalignmentalgorithmscantakeadvantageofto mapthereadsoverrepetitiveregionsmoreprecisely. regions. Aligningreadswithinrepetitiveregionsorincorrespondingregionsthatmaynotexistinthe referencegenomeisdifTherepeatsandgapsinthereferencegenomecanalsocauseabnor- malextremelyhighpeaksinthemappingwhichmayleadtofalsepositivesfordownstream functionalanalysis[25].Paired-endreadscanresolvethecorrectgenomeassignmentforsome repetitiveregionsaslongasonereadinthepairisuniquetothegenome[98]. 1.3Virusquasispecies Avastnumberofmedicallyimportantviruses,includinghumaniyvirus(HIV), hepatitisCvirus(HCV),andareRNAviruses.Theyreplicatewithextremelyhighmu- tationratesandexhibitgeneticdiversity,whichallowthemtorapidlyadapttodynamic environmentsandevolveresistancetovaccinesorantiviraldrugs[75].TheRNAvirusevolution anddynamicscanbemodeledbyquasispeciestheory.Aquasispeciesisacloudofdiversevari- antsthataregeneticallylinkedthroughmutations,interactcooperativelyonafunctionalleveland 10 collectivelycontributetothecharacteristicsofthepopulation. 1.3.1Thequasispeciestheory Imagineaquasispecieswithanlargepopulationofdifferentstrains,eachcarryinga genomeoflength L .Let x i denotestherelativeabundance/frequencyofstrains i .Wehave å n i = 0 x i = 1.Thegenomicstructureofthepopulationcanbecharacterizedbythevector ~ x =( x 0 ; x 1 ;:::; x n ) . Denoteby f i theofgenome i ,whichmeansgenomesoftype i arebeingreproducedat rate f i .Thelandscapeisgivenbythevector ~ f =( f 0 ; f 1 ;:::; f n ) .Theaverageofthe populationistheinnerproductofthevectors ~ x and ~ f , f = ~ x ~ f . Mistakescanhappenwhenagenomereplicates.Weusemutationmatrix Q =[ q i ; j ] todenote theprobabilitythatreplicationofgenome i resultsingenome j . Q isastochasticmatrix,with whicheachrowisaprobabilityandeachrowsumstoone, å n j = 0 q ij = 1. Thequasispeciesequationisgivenby[107] x 0 i = n å j = 0 x j f j q ji f x i (1.1) where x 0 i isthedifferentiationof x i withrespecttotime.Sequence i isobtainedbyreplicating anysequence j atrate f j timestheprobabilitythatreplicationofsequence j generates sequence i .Toensurethatthetotalpopulationsizeremainsconstant,thatis å n i = 0 x 0 i = 0,each sequenceisremovedatrate f (Proof: å n i = 0 x 0 i = å n j = 0 x j f j q ji f å n i = 0 x i = å n i = 0 q ji å n j = 0 x j f j f = å n j = 0 x j f j f = 0). Ifwecombinethelandscape, ~ f ,andthemutationmatrix, Q ,wecanobtainthemutation- selectionmatrix, W =[ w ji ]=[ f j q ji ] (1.2) 11 Withthemutation-selectionmatrix,thequasispeciesequationcanbewritteninvectornotationas ~ x 0 = ~ x W f ~ x (1.3) Thereforetheequilibriumofquasispeciesdynamicsisgivenby ~ x W = f ~ x (1.4) Thisisastandardeigenvalueproblem,wheretheaverage f isthelargesteigenvalueof thematrix W and ~ x istheeigenvectorassociatedwiththiseigenvalue.Thus,theeigenvector ~ x ofmutation-selectionmatrix W providestheequilibriumstructureofthequasispecies,withthe propernormalization å n i = 0 x i = 1. Errorcatastrophe Thequasispeciesequation1.1describesthemovementofapopulationthroughsequencespace. Thequasispeciesattemptstoclimbuphillinthemountainrangeofthelandscapeandreach localorglobalpeaks(Figure1.6).Theerrorthresholdplaysakeyroleinthesuccessofthe evolutionarywalk. Whenthemutationrate m isabovetheerrorthreshold,theabilityofthequasispeciestoclimb uphillandtoremainontopofamountainpeakisimpaired[42,13].Smallincreasesinmuta- tionratewillupsetthisbalanceasthemastersequenceitselfdisappearsandmeaningfulgenetic informationislostinanavalancheoferrors.Earlystudieswithvesicularstomatitisvirus(VSV) showedthatchemicalmutagensgenerallyreducedviralinfectivity,andstudieswithpoliovirus demonstratedthatmutagenicnucleosideanalogspushviralpopulationstoextinction-a4-fold increaseinmutationrateresultedina95%reductioninviraltiter[52,153]. 12 Figure1.6:Quasispeciesadaption.Quasispeciestendstogouphillsinhigh-dimensional landscape.Thex-axisisthesequencespaceandthey-axisistheThehighertheyget,the theyare. 13 Chapter2 TAR-VIR:TARgetedVIRalreads andstrainreconstructionfrom metagenomicdata 2.1Background Pathogenichumanvirusessuchashumanyvirus(HIV),hepatitisCvirus(HCV), SevereAcuteRespiratorySyndrome(SARS)coronavirus(SARS-CoV),andH1N1virus,still claimthelivesofmillionseachyeardespitecenturiesstudiesofthevaccineandtreatment[158, 140].Thus,characterizinghumanviralpathogens,includingrecognizingnovelones,remains crucial.Thedeepsequencingdataofmicrobialcommunities(metagenomicdata)containsthe geneticinformationofthemicrobeslivinginthesamehabitatandhasbecometheprimarysource forvirusdiscovery[165].Forexample,Lietal.[80]conductedcomprehensivevirusscreening inplasmasamplesof19antiretroviral-treatedHIVpatientsusingmetagenomicsequencing.Lim etal.[85]studiedthedynamicsofeukaryoticRNAandDNAvirusesfortheyearinfants. Therearealsoglobal-scalestudiesonvirusesinnaturalenvironmentalsamplessuchasocean water[102,126].Viralmetagenomicdatahasadvantagesovertraditionalmethods sinceitisunnecessarytoisolateviralspeciesandcultivatetheminthelaboratory.Also,multiple 14 pathogens,includingnewones,canbeinasingleassay. WeareparticularlyinterestedincharacterizingRNAvirusesviametagenomicsequencingbe- causemanyofthemareclinicallyimportantandtherearestillurgentneedsforbetterprevention andtreatmentstrategies.Inadditiontosomewell-knownchallengesformetagenomicassembly suchassheerdatasize,shortreadlength,andheterogeneouscoverage,reconstructingRNAviruses inmetagenomicsfacesseveraluniqueobstacles.First,thereferencesofvirusesarestillverylim- ited.Usually,onlyasmallpercentageofreadsinmetagenomicdatacanbemappedtoavailable viralgenomes,eitherbecausethevirusesaremutatingquicklyorarenovelones.Characterizing RNAviruseswithoutqualityreferencegenomesisthusneeded.Second,mostclinicallyimportant virusesareoftenfastmutatingRNAvirusesthatcanleadtoviralpopulationsofhighgeneticdi- versity(quasispecies)[107].Asdifferentgenomesinthesameviralpopulationcanhavedifferent propertiessuchasresistancetoanti-viraldrugs,distinguishingthesedifferentbuthighlysimilar viralstrains(haplotypes)isimportantbutcomputationallydifThird,thereareusuallyonly smallamountsofeukaryoticvirusesinviralmetagenomicdata.Althoughtechniquesand methodsmaybeappliedtoenrichRNAviruses,thesestrategiesmaycausevarious typesofbiases.Thus,effectivepreprocessingmethodsarestillimportanttoidentifyreadsforthe targetedviruses. Inthisstudy,ourgoalistodevelopanewpipelinethatcanclassifyRNAviralreadswithhigh sensitivityandalsoproducetheassembledviralstrains(i.e.haplotypes)frommetagenomicdata. 2.1.1Relatedwork Thereexistsomepipelinesforconductingthecompositionandfunctionalanalysisofviralmetage- nomicdata[99,128,105,122].Dependingontherequiredinputs,existingprogramscanbe roughlydividedintotwogroups. 15 Someviralmetagenomicanalysistoolstakemetagenomicassembliesasinputandconduct reference-basedtaxonomicandfunctionalanalysis.Forexample,Espinoetal.[112]identifyviral sequencesfromassembledmetagenomiccontigsofsizesgreaterthan5kb.VirSorter[127]detects virusesinassembledcontigsatleast3kb.Theviralsequencesareusuallyscreenedbycomparing thecontigswithacuratedsetofviralproteinfamilies. Althoughvirusdetectionusingassembledcontigsisusuallymoresensitivethanusingshort reads,conducting denovo assemblyforrawviralmetagenomicdatarequiresmassivecomputing resources.Also,applyinggeneric denovo assemblytoolsdirectlytometagenomicdatacanlead toshortorchimericcontigsduetotheheterogeneouscoveragealongthegenomesinmetage- nomicdataandhighsimilaritybetweenstrains.Therefore,mostoftheassemblymethodsforviral metagenomicscombinereference-basedand denovo assembly.Thisstrategyusu- allyreadsintodifferenttaxonomiesorfunctionalgroupsusingreference-basedmethods andthenconduct denovo assemblyforreadswithinthesamegroup.Forexample,VIP[84], drVM[86],andVirusTAP[162]allapplythisstrategy.Theyclassify/enrichviralreadsbyeither aligningreadstoavailableviralreferencesorremovinghostandotherunrelatedmicrobialreads. Next,existingassemblytoolssuchasSPAdes[8]areemployedtothevirus-likereadstoproduce theassemblyresults.Whilethesetoolsmadecontributionsinpurifyingthedataby removingnon-virusreadsandthenclassifyingvirus-likereadsintofunctional/taxonomicalgroups, theirperformanceheavilydependsonthequalityofthereferences. ForRNAviruscharacterization,relatedtoolsalsoincludehaplotypereconstructionpipelines designedtoassembleviralstrainsinaquasispecies.Amajorityofthesetoolsarereference-based andtakethealignmentsofreadsagainstreferencegenomesasinput.HaploClique[147],Vi- Quas[59],VGA[95]allbelongtothisgroup. Inaddition,therearegenericassemblytoolsthatapplyextensionstrategiestoiterativelyextend 16 thefiseedflcontigs.Forexample,PRICE[129]takesadvantageofpaired-endreadstoiteratively alignreadsontheinitialcontigsandextendthem. Thelimitationofreferencegenomesisthecriticalchallengeofapplyingtheabovereference- basedtoolsforRNAvirusesanalysisinmetagenomicdata.Whileregardedasthemostabundant biologicalentitiesonearth,onlyasmallportionofviruseshavebeensequencedandcharacterized. Besides,forRNAviruseswithhighmutationrates,high-qualityreferencegenomesofaviral populationarenotalwaysavailable.Forexample,manyemergingviraldiseasesarecausedby zoonoticviruses,whichoriginateinvertebratesbutcaninfecthumans.Thegenomesofsome emergingvirusesmayonlysharemediumsequenceidentitywiththeirpeersinanimals,creating difcircumstancesforreference-basedvirusreads 2.1.2Overviewofourwork HereweintroduceTAR-VIR,whichprovidesausefuladditiontoexistingtoolsforidentifyingtar- getedRNAvirusesandtheirhaplotypesinmetagenomicdata.Thefitargetedflvirusesarethosethat stillpossesslocalsequencesimilaritywiththeirhomologsinthereferencegenomes.Acompletely newvirusthatdoesnotshareanyconservationwithanyreferencegenomewon'tbedetectedby ourmethod. Ourpipelinecombinesreference-basedstrategyand denovo assemblyandisoptimizedforthe followingapplications.1)Identifyinghost-switchingvirusessuchasSARS-CoVusingremotely relatedvirusesinotherhostsasthereferences.2)Reconstructingviralhaplotypesthatarediver- gentfromaknownvirusfamily.3)Recoveringvirusesandtheirgenomesthatcontaingenesor functionalsitesofinteresttousers.TAR-VIRisfasterandmoreeffectiveinidentifyingtargeted virusesthangenericassembly,whichisparticularlyimportantforlargeandcomplicatedmetage- nomicdatasetscontainingasmallpercentageofviruses.Meanwhile,TAR-VIRismoretolerantto 17 incompleteorlow-similarityreferencesthanexistingreference-basedtools. WeappliedTAR-VIRtoasimulatedmetagenomicdatasetcontainingvehaplotypesofSARS- CoVandarealhumanbloodplasmametagenomicdataset.Thebenchmarkedresultswithboth de novo assemblytoolsandreference-basedhaplotypereconstructiontoolsdemonstratedtheutility ofTAR-VIRinrecoveringRNAvirusesfrommetagenomicdatawithlimitedreferences. 2.2Methods Followingastand-aloneerror-correctionstep,ourpipelineperformsthefollowingthreesteps. First ,weconstructthesetoffiseedreadsflbymappingthereadsagainstprovidedreferencese- quences,whichcouldbereferencegenomesinavailabledatabasesorfunctionalsitessuchasgenes. Allthereadsthatcanbemappedtothereferenceconstitutethesetoffiseedreadsfl. Second ,were- cruitreadsthatformoverlapswiththeseedreads.Newlyrecruitedreadswillbeadded totheseedset.Thisprocesswilliterateuntilnonewreadscanberecruited. Third ,weconduct strain-levelassemblyusingthereadsinthesecondstep. 2.2.1Twoscenarios TheabovepipelineisvisualizedfortwoscenariosinFigure2.1.Inscenario1,usersaretryingto detectvirusesthatcontainafunctionalsite,suchasagene.Unlikethewell-studiedgene-centric assemblyinmetagenomicdata,ourgoalistorecoverthewholegenomethatcontainsaparticular gene.Inthismethod,thegeneisprovidedasareference,andreadsmappedtoitaretheseedreads. Overlapdetectionisthenappliedtorecruitmorereadsthatbelongtothesamevirusesastheseed reads.ThereadrecruitmentprocessispresentedinFigure2.1(A). Inscenario2,thegoalistoidentifyvirusesthatlackqualityreferencegenomes.Thisispartic- 18 ularlyimportantforhost-switchingviruses,whichmaynotalwaysconservehighsequencesimilar- itywiththeirrelatedpeersinotherhosts.Forexample,SARS-CoVsharesabout80%ofsequence identitywiththebatcoronavirusaccordingtoBLAST[20].Andtheidentityislowerthan50% atdifferentloci.Thus,conventionalreadmappingmethodscannotcaptureallreadsfromthetar- getedviruseswhentheylackhighsimilaritywiththeavailablereferences.Figure2.1(B)presents theprocessofidentifyingreadsofthetargetviruswitharemotelyrelatedvirusasthereference. Althoughthemappedreadsarescatteredalongthereferencegenomewithlowcoverage,suf readsbelongingtothetargetviruscanberecruitedthroughoverlapdetection. Figure2.1:Twoscenarios.(A).Thereferenceisageneorotherfunctionalsite(longgreenbar). Thereadsarerepresentedbyshortlines.Shortgreenlinescanbemappedtothereferencesequence andthesetofseedreads.Theiterationofoverlapdetectionwillidentifynewreads(blue lines)overlappingwiththeseedreads.Theseconditerationofoverlapdetectionwillidentifymore reads(redlines).(B).Thereferenceisaremotelyrelatedgenome(longgreenbar).Theseedreads canbemappedtothereferencegenomeandarerepresentedbyshortgreenlines.Twoiterationsof overlapdetectionwillrecruitnewreads(bluelinesandredlines,respectively). 19 2.2.2Validityofreadrecruitmentusingoverlapdetection Inthissection,wewillconductcarefulanalysistoexaminewhetherusingoverlapswillbeboth sensitiveandaccurateforclassifyingreadsinthesamequasispecies.Inaddition,wewillusethe analysistodeterminetheappropriateoverlapsize. Anidealreadrecruitingprocessshouldonlycapturethereadsfromthevirusesofinterest.At thesametime,itshouldcaptureallreadsbelongingtothetargetedviralpopulations(i.e.,qua- sispecies).Thesuccessoftheoverlapextension-basedreaddependsonthequality ofseedreads,sequencingdepthofthetargetedviruses,andsequencesimilaritybetweendifferent microbes.Inparticular,ifmanymicrobessharelongcommonregionswiththetargetedviruses, theoverlapextensionwillrecruitalargenumberofreadsfromunrelatedspecies. Asanyreadthatformsanoverlapofsizeaboveagiventhreshold t withaseedreadisrecruited, contamination,whichreferstoreadsnotsequencedfromthetargetedviralpopulations,canbe checkedbyexaminingwhethergenomesofdifferentvirusessharecommonregionswithsizes above t .Therefore,wecomputedthesizesoflongestcommonsubstrings(LCSs)betweendifferent viruses.TheLCSsbetweenvirusesandothermicrobialspecieswerealsoexamined.Thedetails forLCScalculationcanbefoundinSupplementaryMaterialsSection1.Theresultsareshownin SupplementaryFigureS1(A-C). Insummary,thesizesofLCSsbetweendifferentviralgenomesorbetweenhumanvirusesand bacteriaareusuallysmallerthan100bp.LCSslongerthan100bparemostlybetweenviruses fromthesamegenusordifferentgenotypesofthesamevirus.Forexample,Vacciniavirusand VariolavirusshareanLCSof469bp,andHCVgenotype7andHCVgenotype5shareanLCSof 154bp. Meanwhile,itisalsonecessarytoevaluatewhetherreadsbelongingtothesamequasispecies 20 canberecruitedusingoverlapdetection.AsthecharacterizedhaplotypesfordifferentRNAviruses areverylimited,insteadofcomputingtheLCSusingavailabledata,weestimatedtheLCSswithin aquasispeciesusingaprobabilitymodel.Withthemutationrate m ateachbaseduringvirus replication,theprobabilitydistributionofLCSlengthbetweentwoviralstrainsthatare n gener- ationsapartcanbecalculatedwithdynamicprogramming[26].Asanexample,theprobability distributionofLCSsizesbetweentwoHIVstrainsisshowninSupplementaryFigureS1(D). TheresultrevealsthattheLCSsbetweendifferenthaplotypesofthesameviralpopulationare usuallymuchlongerthanLCSsbetweendifferentvirusesoranIlluminareadsize.Thus,evenwith theinitialseedreadsalignedtoonlyonehaplotype,thereadsofotherhaplotypescanberecruited throughthelongcommonregionssharedbydifferenthaplotypes.Thereadssequencedfromthe commonregionsactlikebaitstorecruitreadsfromdifferenthaplotypes. TheshortLCSsbetweendifferentvirusesandthelongLCSsbetweendifferenthaplotypes withinthesamequasispeciesenablehighsensitivityandofthereadrecruitingprocess. BychoosingaproperoverlapthresholdusingthederivedLCSsizedistribution,weareabletore- cruitreadsforthesameviralquasispecieswithoutintroducingcontaminationfromothermicrobes. 2.2.2.1Sequencingerrors Withoutconsideringthesequencingerrors,unevensequencingcoverage,andthevirusrecombi- nation,theaboveanalysisforviralquasispeciesprovidesanupperboundofreads'overlapsizes. Sequencingerrorswillshortenoverlapsbetweenreadsandmaypreventrecruitingallreadsbe- longingtothesamequasispecies.Torecruitsufreadsforassembly,wecanconstructeither approximateoverlapsbyallowingmismatches/gapsorexactoverlapsonerror-correctedreads. Consideringtheriskofintroducingcontaminationsbyapproximateoverlapandextensiveresearch inerrorcorrectionwechosetousestand-aloneerrorcorrectiontoolspairedwithexactover- 21 lapdetection.Lowsequencingdepthwillcauseinsufoverlapsandthusaffectstheperfor- manceofreadrecruitment.Thereisthepossibilitythelowcoverageregionsofagenomecannot beassembled. 2.2.2.2Chimericreads Onerecentstudyrevealedthatchimericreads,whichcontainsequencesfrommorethanone species,canbegenerated invitro duringthepreparationofhigh-throughputsequencinglibraries [114].Thesechimerasmayhaveoverlapswithmorethanonespecies,thusintroducingcontamina- tionsfromthehostorunrelatedmicrobes.Inourexperiments,wesettheoverlapthresholdlonger thanhalfofthereadsizetopreventrecruitingthesechimericreadsorextendingfromthem. 2.2.3Readrecruiting Let r i and r j betworeads.Ifthereisapropersufof r i thatistheof r j orviceversa, r i and r j formanoverlap.Inpractice,wewillalsoaccountfortheoverlapsformedby r i and r 0 j sreverse complement.Naivealgorithmforoverlapstakesquadratictimeandthuswillnotbeableto scaletolargedatasets.Thereareafewdatastructuresandmethodsavailableforefoverlap detection[47,142].WeapplythemethodswithBWTandFM-index[142]forefsearch.In thestep,allreadsareconcatenatedintoasinglesequence T [ 1 :: n ] using$asadelimiter,where n isthenumberofreadsin T .Then,multi-keyfiquicksortflisappliedtosortallthesufesof T forconstructingageneralizedsufarray SA ( T ) [121].Then BWT ( T ) canbeconstructedusing thefollowingequation,where BWT [ i ] and SA [ i ] areabbreviatedrepresentationsof BWT ( T )[ i ] and 22 SA ( T )[ i ] ,respectively. BWT [ i ]= 8 > < > : T [ SA [ i ] 1 ] ; if SA [ i ] > 0 $ ; if SA [ i ]= 0 (2.1) With T and BWT ( T ) ,thebackwardsearchcanbeusedtodetectoverlapsbetweenaqueryread andallotherreads.Aftermatching t (theoverlapthreshold)characters,wesearchforthedelimiter `$'totheesoverlappingwiththequery'ssuf ThebackwardsearchalgorithmforBWTneedstousetwoauxiliarydatastructuresbuilton T andBWT.Onedatastructureisanarray C [ 1 ;:::; j S j ] .Foreachcharacter c inthealphabet S , C [ c ] containsthetotalnumberofcharacterin T thatarealphabeticallysmallerthanc(including repeatingcharacters).Theseconddatastructureisatwo-dimensionalarray Occ ( c ; i ) ,where c 2 S and1 i j T j .Forthe i thpositioninBWT, Occ [ c ][ i ] isthenumberofoccurrencesof c in BWT [ 1 ::: i ] .ExamplesoftheSA,BWT, C ,and Occ areprovidedinFigure ?? .Notethatinorder tomainthecorrectrankingofthereadIDs,thesmallestsuf$willbeappendedbytheread in T forsorting.Thus,allthesufesstartingwith$aresortedbythereadsfollowingthe$.In addition,foranysufstartingwith$,theirreadIDsaredeterminedbythereadfollowingthe$. OncetheBWTandtheauxiliarydatastructuresareconstructed,backwardsearchisconducted foranyqueryread,startingfromtheendingcharacter.Ifsomereadsin T haveamatchwith thequeryandthematchsizeisaboveagiventhreshold,thebackwardsearchwillreturnarange containingthosereads.UsingthecorrespondingRIDarray,wecanobtaintheirreadIDs.Note thatbidirectionalsearchalgorithmexists[142].Inourwork,wereversethequeryinorderto theoverlapformedbytheofthequeryandsufofareadintheconstructedBWT.The pseudocodeofthebackwardsearchcanbefoundinAlgorithm1.Theinputparametersarethe BWT,Occ,C,thequeryread r ,theoverlapthreshold t ,andthereadIDarrayRID.Theoutputs 23 Figure2.2:Thevisualizationoftheoutputofeachstepofthebackwardsearchforaquerysequence AAAT.TheSAandthecorrespondingsufesaregreybecausetheyarenotactuallyusedinthe search.Thecomputedrangeforeachiterationishighlightedusinggrayboxencompassedby arrows.When t is3,thesearchshouldreturn r 2 asitformsanoverlapofsize3withAAAT. 24 arereadswhosematchthesufofrwithsize t .Anexampleofthebackwardsearch followingthepseudocodeisprovidedinFigure2.2. Algorithm1 OverlapdetectionusingBWT'sbackwardsearch Function: overlap (( BWT ( T ) ; C ( S ) ; Occ ( S ; BWT ) ; r ; t ; RID ( T )) Input: T :inputtext; r :queryread; C and Occ :theauxiliarystructures; t :theoverlapthreshold; RID :readIDarray Output: Readswhosematchthesufofrwithsize t 1: i j r j 2: c r [ i ] 3: start C [ c ]+ 1 4: end C [ c + 1 ] . characterc+1isthenextcharacterinalphabeticallysortedalphabet 5: while start end and i 2 do 6: c r [ i 1 ] 7: start C [ c ]+ Occ [ c ; start 1 ]+ 1 8: end C [ c ]+ Occ [ c ; end ] 9: i 10: if j r j i t then 11: start C [ 0 $ 0 ]+ Occ [ 0 $ 0 ; start 1 ]+ 1 12: end C [ 0 $ 0 ]+ Occ [ 0 $ 0 ; end ] 13: if end start then 14: outputreadswithIDsbetween RID [ start ] and RID [ end ] 15: endif 16: endif 17: endwhile 2.2.3.1Uniqueimplementationstrategies AlthoughthereareavailableimplementationsoftheBWT-basedoverlapdetection,oursdiffers fromtheexistingonesinthefollowingaspects.ThedifferenceisthestorageofthereadID information.ForaconstructedBWTandaquery,theoutputofthebackwardsearchisthesetof reads(i.e.,theirIDs)thatformoverlapswiththequery.Theoretically,differentreadscanbedis- tinguishedbyappendinguniquedelimitersattheendofeachread.Someexistingimplementations savereadIDsforeachsufForexample,SGAassembler[142]savesreadIDinformationalong withthesufstartingpositioninthegeneralizedsufarray SA ( T ) [142].Inourimplementation, 25 weuse`$'asthedelimiterforallreads.ThereadIDarray RID iscreatedonlyforsufesstarting with`$'.ThisworksbecausethebackwardsearchalgorithmneedstoretrievethereadIDinthe step,wherethecharactertosearchis`$'.Thisreducedthememory usagebyreducingthesizeof RID from j T j integersto n (numberofreads)integers,where T is roughlytheproductof n andthereadsize. 2.2.4Iterativesearch Overlapdetectionwillbeiterativelyappliedtorecruitreadssequencedfromtargetedviruses.Let R 0 bethesetofseedreadsthatcanbemappedtogivenreferencesequences(i.e.,seedreadset). First,the BWT ( T ) for T isbuilt.Theseedreadsin R 0 areusedasqueriestothe BWT ( T ) .Then newlyreadsthatoverlapwiththeseedreadswillbeusedasnewqueriestotheBWT. Theiterationswillcontinueuntilnonewreadscanberetrieved.Itspseudocodeisdescribedin2. 26 Algorithm2 Defaultmode:createBWTforallreads Input: seedreadset R 0 ,theinputtextT,theoverlapthreshold t Output: Readsthataresequencedfromthetargetedviruses 1: output R 0 2: R R 0 3: CreateBWTandRIDfor T : BWT ( T ) , RID ( T ) 4: while R notempty do 5: Backwardsearchon BWT ( T ) toallreadsthatoverlapwithreadsin R 6: Savethemtoset R 0 7: output output [ R 0 8: R R 0 9: endwhile 10: returnoutput 2.2.4.1Runningtimeandmemoryusage Intheabovepipeline,oncetheBWTisconstructed,thesufarraywillbedeleted.Therunning timeofsufxarrayandBWTconstructionislinearto j T j .ThememoryusageofBWTisthe productof j T j andthesizeofeachcharacterandthusislinearto j T j .Thememoryusageofthe RID istheproductof n andthesizeofsavingareadID. WhencreatingBWTforallreadsbecomestooexpensive,ourprogramsupportsdistributed constructionoftheBWTandFM-indexforlargeinput.,theprogramcanautomatically partitioninputdataintomultiplesmallerBWTisthenconstructedforeachdivideddataset. ThereadoverlapdetectioncanberuninparallelforeachBWT.Thereadsarecombined andusedasthequeryforthenextiterationofreadrecruitment.Inthiscase,thelargestmemory 27 footprintisdeterminedbythesizeofeachdividedreadset.Bydefault,thenumberofpartitionsis ve.Thisnumbercanbebyusers. 2.2.5Strain-levelassembly Theoutputsofourprogramareassembliesofviralstrains.Allrecruitedreadswillbeusedas inputtoassemblyprograms.Asourprogramhasamodularstructure,thisstepcanbeexecutedby anyassemblytoolchosenbytheusers.Bydefault,weincludeinthepackageourin-housedevel- opedtoolPEHaplo[26]forviralhaplotypereconstruction.PEHaplodoesnotrequireanyreference sequencesandconductsstrain-levelassemblyusingpaired-endreads.Fortheinputpaired-end reads,PEHaploconstructsapaired-endoverlapgraph,whichaugmentedstandardoverlapgraphs byaddingedgesconnectingnodesthatcanformendsofreadpairs.Then,agreedypath algorithmisappliedtosearchforthepathswiththebestsupportsfrompaired-endreads,wherethe supportsarebythenumberofcontainedreadpairsandalsotheirdistances.Thedetailed algorithmandimplementationofPEHaploaredescribedinChapter3. 2.3Resultsanddiscussion WehavedevelopedamodularstructuredtoolnamedTAR-VIRforreconstructingviralhaplotypes frommetagenomicsdata.Theoutputsofthistoolareassembledviralcontigscorresponding todifferentstrains.Wefocusonevaluatingtheperformanceofthereadrecruitingstageandalso itsimpactontheassembly. BeforepresentingtheresultsofTAR-VIRfordifferentsetsofinputdata,weshowthe resultsofthelongestcommonsubstring(LCS)sizecomputationfordifferentmicrobes. 28 2.3.1Sizesofcommonregionsbetweenhumanvirusesandothermicrobial species Toevaluatethesimilaritybetweendifferentmicrobes,wecalculatedthesizesofLCSsbetween humanvirusesandothermicrobialgenomes.Asbacteriainfecthumansaswellasviruses,the sizesofLCSsbetweenhumanviruses,humanvs.non-humanviruses,andhumanvirusesvs. bacteriawerecalculated.ThevirusreferencegenomesweredownloadedfromNCBIViruses (https://www.ncbi.nlm.nih.gov/genome/viruses/).Todate(June2018),thereareintotal7,456 completeviralgenomes,ofwhich481havehumanasthenaturalhost(denoteashumanviruses). ThehumanbacterialreferencegenomesweredownloadedfromHumanMicrobiomeProject(HMP) onNCBI.Intotal,2,314bacterialreferencegenomesweredownloaded. Asthereisalargenumberofmicrobialspeciesavailable,weconductedLCSsearchforavail- ablemicrobialgenomesbyconstructinggeneralizedsufarrayandthecorrespondinglongest common(LCP)array[50]. First ,webuildageneralizedSA[121]. Then ,theLCParray, whichcontainsLCPsbetweeneachtwoadjacentsufes,canbecalculatedinlineartime[61,60]. BytheLCSforeachtwosequencesisthemaximumLCPbetweenallpairsofsufes fromthetwosequences.Thefollowinglemmaisemployedinordertoavoidcheckingallthe LCPvaluesbetweentwosequences.Forthesufstartingat SA [ i ] ,theLCPbetween SA [ i ] and SA [ j ] ( j > i )isnolessthantheLCPbetween SA [ i ] and SA [ k ] if k > j .Withthispropertyanda userLCScutoff,theLCPcalculationbetween SA [ i ] andallothersufesafter i canbe calculatedinconstanttime.Theoveralltimecomplexityis O ( N ) . TheresultsoftheLCShistogramsareshowninFigure2.3(A-C).Onemayalsoexamine whetherthereadrecruitmentprocesscanincurcontaminationbyusingsimulatedorrealsequenc- ingdata.However,theempiricalstudiesusingrealdataarelimitedtothevirusesinthesamples. 29 Meanwhile,producingsimulatedsequencingdataforallmicrobesisnotpractical.Usingsuf arraybasedLCScomputationallowsustoobtainamorecomprehensiveviewofthecommon regionsbetweendifferentmicrobes. WealsocomparedthesizesoftheLCSsbetweendifferentmicrobeswiththeoneswithin aviralpopulation.AsthecharacterizedhaplotypesfordifferentRNAvirusesareverylimited, insteadofcomputingtheLCSusingavailabledata,weestimatedtheLCSswithinaquasispecies usingaprobabilitymodelanddynamicprogramming[26].Withthemutationrateof3e-5ateach baseduringvirusreplication,theprobabilitydistributionofLCSlengthbetweentwoHIVstrains thatare n generationsapartwerecalculatedandthedistributionofLCSprobabilitiesisshownin Figure2.3(D). 2.3.2Exp1:reconstructtheSARShaplotypesusingthebatcoronavirusas thereference Inthisexperiment,wemimicthescenarioinwhichSARS-CoV[23]isanemergingvirusinfecting humans.OurgoalistoreconstructtheSARS-CoVhaplotypesusingothercoronavirusesasrefer- ences.DuringthebreakoutofSARS,electronmicroscopeimagerevealsthecrown-likeshapeof theinfectiousagent,providinghintstousecoronavirusesasreferences. Totestthis,weassumethatthebatcoronavirus(NC_014470.1)wassequencedandavailable touseasareference,althoughitwasactuallysequencedafterthebreakoutofSARS. 2.3.2.1Datapropertiesandevaluationmetrics Aviralmetagenomicdatasetcontaining(NC_002023.1),hepatitisCvirus(HCV,NC_004102.1), and5SARS-CoVhaplotypes,wassimulated.TheSARS-CoVhaplotypeswerecreatedfromthe 30 Figure2.3:HistogramoftheLCSsizesbetweenhumanviruses(A),betweenhumanvirusesand non-humanviruses(B),andbetweenhumanvirusesandbacteria(C).Thex-axisisthe log 10ofthe LCSlength.They-axisisthenumberofpairswithinthegivenrangeofLCSsize.OnlyLCSsthat arelongerthan10bparepresented.(D)ProbabilitydistributionforLCSsbetweentwosimulated HIVstrainsthatare50,100,200,and500generationsapart.The x -axisisthelengthofLCS,with arangefrom0to10,000bp.The y -axisisthecorrespondingprobabilitiesforthoseLCSsizes. 31 SARS-CoVreference(NC_004718.3)genomebymutatingbasesatrandomlyselectedlocations. Thesequencesimilaritybetweenanytwohaplotypesisabove96%.Theabundanceofeachhap- lotypeiscalculatedbasedonapowerlawequation[9].Thetotalsequencingdepthsforthe5 SARS-CoVhaplotypesare1000-x,with438-x,219-x,146-x,109-x,and88-xforeachhaplotype, respectively.ThesequencingdepthsforandHBVare700-xand300-x,respectively.All thedatasetsweresimulatedbyART-illumina[55]aserror-containingMiSeqpaired-endreads, withthereadlengthof250bp,theaverageinsertsizeof600bp,andthestandarddeviationof 150bp.Intotal,thereare173,703simulatedreads,ofwhich119,002readsarefromtheve SARS-CoVhaplotypes. Withtheavailablebatcoronavirusasthereference,thesimulatedreadswerealignedwithboth Bowtie2[71]andBWA[79].WethenappliedtheoverlapextensioncomponentofTAR-VIR toisolateandenrichSARS-CoVreads,andassembledthemwith denovo assemblytools.Both thereadrecruitmentandtheassemblyperformancewereevaluated.Forthesimulateddata,the groundtruthoftheoriginatinghaplotypeandpositionofeachreadisknown.Thusreadrecruiting performancecanbeevaluatedusingthereads'positionsandoriginatinghaplotypes.Insummary, weexaminehowmanyreadsarecorrectlyrecruitedforeachhaplotypeandreportthehaplotype coverageanddepth. Theassemblyperformancewasevaluatedusingtheknowngenomesofthe5SARS-CoVhap- lotypesandMetaQuast[100].Similartootherworks,wetheassemblycontinuity, completeness,andaccuracyintermsofnumberofcontigs,N50,genomecoverage,andmismatch rate.N50isasthemaximallengthsothatallcontigsabovethislengthcontainatleast50% ofallthecontigbases.Genomecoverageisthepercentageofthevehaplotypes'genomesbeing alignedbyatleastonecontig.Mismatchrateisthepercentageofmismatchesbetweenthealigned contigsandthereferences.Inallcases,contigsofatleast500bparealignedtotheviralrefer- 32 encesequencesforevaluation.Theassemblyresultswerealsobenchmarkedwithotherpopular assemblytoolsSGA[142],SPAdes[8],andSAVAGE[6]. 2.3.2.2Performanceofreadrecruitment WeappliedbothBowtieandBWAinthereadmappingstage.Byadjustingthescoringfunction relatedparameters,weconstructeddifferentsetsofseedreadsthatcanbealignedtothereferences withdifferentapproximatematchconstraints.Foreachseedset,therecruitedreadsgeneratedby TAR-VIRarerecorded.Table2.1comparesthealignedandrecruitedreadsforeachSARS-CoV haplotype.Besidesapproximatematchrates,wealsoconsideredlocalandfiglocalflalignment mode,wheretheglocalmoderequirestheend-to-endalignmentofthereadagainstthereference. Usinglocalalignmentmodeforreadmappingcanusuallyproducealargerseedset.However,itis possiblethatsomeofthelocallyalignedreadsarenotsequencedfromtheunderlyinghaplotypes. InTable2.1,weusedlocalalignmentmodeforBWAandglocalmodelforBowtie.Thus,theseed setsconstructedbyBWAislargerthanBowtie. Evenwiththeleaststringentthreshold,thealignedreadshavelowergenomecoveragethan recruitedreads,whichisexpectedbecauseSARS-CoVdoesnotsharegenome-scalehighsimilarity withthebatcoronavirus(Figure2.4(C)).Inparticular,withtheparameterfi-B1",BWAcan alignslightlymorereadsthanwhatBowtie2canrecruitwiththeparameterfiL,0,-0.9"((52,688 vs.52,377).Therecruitedreads(52,377),however,cover20%-30%moregenomesfortheve haplotypes.Thisindicatesthatalignment-basedmethodstendtoidentifyreadssequencedfrom highlysimilarregionsbetweenthetargetandthereferenceviruses,whiletherecruitmentmethod ismorelikelytoobtainreadsfromthewholegenomeofthetargetviruses.Worthnotingisallthe recruitedreadsarefromSARS-CoV(nocontaminationfromandHBV). Figure2.4(A)andFigure2.4(B)comparedthegenomecoverageofseedreadsandrecruited 33 Table2.1:ReadrecruitmentresultsbyusingseedsetsconstructedwithBowtie2andBWA.The fiAlignmentflsectioncontainsresultsforalignedreads.ThefiRecruitmentflsectioncontainsresults forrecruitedreadsbyTAR-VIRusingthealignedreads.Foreachrow,thealignedreadsinfiAlign- ment"sectionaretheseedsetfortherecruitedreadsinfiRecruitmentflsection.ForBowtie2,the fiŒscore-minflparameterwassettoallowdifferentalignmenterrorratescorrespondingto5%,10%, 15%,and20%,respectively.ForBWA,fi-Aflisedasitsdefaultvalue1.fi-Bflwasto allowdifferenterrorratesimilartoBowtie2.fiNumberflisthenumberofalignedorrecruitedreads. fiDepthflistheaveragesequencingcoverage.fiCoverageflisthepercentageofgenomecoveringby atleastoneread.h1toh5representtheveSARS-CoVhaplotypes. Bowtie2 Alignment NumberDepthCoverage h1h2h3h4h5 h1h2h3h4h5 L,0,-0.3 550.130.010.060.150.12 0.010.010.010.010.01 L,0,-0.6 9253.61.50.90.90.9 0.070.060.050.070.08 L,0,-0.9 8,1543214987 0.310.310.270.270.3 L,0,-1.2 13,2214924151310 0.430.450.420.440.43 Recruitment NumberDepthCoverage h1h2h3h4h5 h1h2h3h4h5 L,0,-0.3 45,50418289594111 1.01.01.01.00.37 L,0,-0.6 46,57618390604218 1.01.01.00.960.55 L,0,-0.9 52,337 198 96 63 46 37 1.0 1.0 1.0 0.98 0.99 L,0,-1.2 55,48518289594139 1.01.01.01.00.99 BWA Alignment NumberDepthCoverage h1h2h3h4h5 h1h2h3h4h5 B:8 24,5858946282018 0.40.370.330.310.34 B:4 41,56415278503732 0.630.570.560.570.53 B:2 51,99519594634639 0.790.780.770.700.74 B:1 52,688 199 94 64 47 39 0.81 0.78 0.80 0.71 0.76 Recruitment NumberDepthCoverage h1h2h3h4h5 h1h2h3h4h5 B:8 62,609235117755544 1.01.01.01.00.99 B:4 72,901270135896553 1.01.01.01.01.0 B:2 79,755299146977158 1.01.01.01.01.0 B:1 78,540294143967057 1.01.01.01.01.0 reads.Directlyaligningthereadstothebatcoronaviruscoversonlyasmallproportionofthewhole genome(Figure2.4(A)),leadingtoincompleteassembly.Usingthesealignedreadsasseeds,TAR- 34 Figure2.4:EnrichingSARS-CoVreadsusingthebatcoronavirusgenomeasthereference.(A) and(B)showthealignedandrecruitedreadse.ThedatasetwasalignedbyBWAwiththe defaultparameter("-B4,-A1").BWAischosentoincludemorelocallyalignedreadsin(A).The readswererecruitedusingtheoverlapcutoffof150bp.(C)displaysthesequenceidentitybetween SARS-Covandthebatcoronavirus.ThewasgeneratedusingVISTA[45]. VIRisabletorecruitmanymorereadsthatnearlycoverthewholegenomeofSARS-Cov,asshown inFigure2.4(B). Table2.1alsoshowsthatthenumbersofrecruitedreadsdonotheavilyrelyonthenumberof theseedreads.Evenwhentheseedsetissmall(e.g.theseedsetsconstructedusingBowtie2), manynewreadscanberecruitedduringeachiteration.Aftermultipleiterations,thesetof recruitedreadscanbelargerthantheseedset,boundedbythesequencingdepthof thehaplotypes.Ontheotherhand,iftheseedsetscontainmanyreadsfromnon-relevantspecies, thesetofrecruitedreadscouldevenincludeallthereadsfromtheinput,whichmakesthe readrecruitinguseless.Becauseofthis,weprefertoconstructtheseedsetusingglocalmodeto ensurehighquality. 35 Recruitedreadsleadtobetterassemblyperformance Boththealignedandrecruitedreadswereassembledwith denovo assemblytools.PEHaplois thedefaultassemblycomponentinTAR-VIR.AsTAR-VIRhasamodularstructure,other de novo assemblytoolsincludingSGA,SPAdes,andSAVAGEarealsousedtoreplacePEHaplofor haplotyereconstruction.SPAdeswasrunwithŒmetaoption,whichissameasmetaSPAdes[108]. Asthealignedreadscoveratmost80%ofthegenomesevenwiththeleaststringentalignment threshold,itisnotpropertoapplytheconventionalreference-basedassemblymethodsforthis dataset. Thecomplete denovo assemblyresultsusingbothalignedandrecruitedreadsarepresentedin SupplementaryTableS1.PartoftheresultsareshowninTable2.2duetospacelimiation.Forall assemblytools,usingrecruitedreadsproducesbetterresults:longercontigsandhighergenome coverage.,thisisnotsimplyduetotheincreasednumberofreads.Forexample,as showninTable2.2,thereadsrecruitedbyBowtie2withparameterfiL,0,-0.9flislessthanBWA- alignedreadswhenBis1(52,337vs.52,668).Buttherecruitedreadsproducecontigsatleastten timeslongerthanthealignedreads,andtwicehigherinthegenomecoverage.Bycomparingthe assemblyperformanceofalltestedtoolsontherecruitedreads,ourassemblycomponentPEHaplo consistentlyhashigherN50andgenomecoveragethanothers.Overall,PEHaploandSGAperform betterthantheothertwoassemblytools. PRICEappliesextension-basedstrategiesforcontigassembly.Usingtheseedreadsasinitial contigs,PRICEcanbereadilyusedtoperformtargetedviralassemblyfrommetagenomicdata. Therefore,theresultsofTAR-VIRwasalsobenchmarkedwithPRICE'sresults,asshowninTa- ble2.3.PRICEproducedonelongcontig(similarN50toours)forthemostabundanthaplotype. Thus,itsgenomecoverageisonlyabout20%. 36 Table2.2:AssemblyresultsonSARS-CoValignedandrecruitedmetagenomicdata.Thede- faultassemblytoolinTAR-VIRisPEHaplo.TheofthemetricscanbefoundinSec- tion2.3.2.1. Aligned Tool#ContigsN50 Genomes covered(%) Mismatch rate(%) Bowtie2 L,0,-0.9 TAR-VIR5850519.70.02 SGA5650520.10.03 SPAdes3456912.90.16 SAVAGE5445517.50.0 Recruited Tool#ContigsN50 Genomes Covered(%) Mismatch rate(%) TAR-VIR 7 29,676 98.9 0.0 SGA 13 26,729 98.9 0.0 SPAdes 14 15,882 92.1 0.51 Bowtie2 L,0,-0.9 SAVAGE 22 12,445 97.0 0.0 Aligned Tool#ContigsN50 Genomes covered(%) Mismatch rate(%) TAR-VIR 84 1,192 55.1 0.0 SGA 85 1,027 56.5 0.0 SPAdes 67 1,012 44.6 0.12 BWA B:1 SAVAGE 68 669 32.3 0.0 Recruited Tool#ContigsN50 Genomes Covered(%) Mismatch rate(%) BWA B:1 TAR-VIR629,70699.50.0 SGA1812,63899.50.0 SPAdes2110,35389.20.39 SAVAGE565,14089.30.0 Table2.3:AssemblyresultsonSARS-CoVmetagenomicdataforTAR-VIRandPRICE. Tool#ContigsN50GenomesCovered(%)Mismatch(%) Bowtie2 L,0,-0.9 TAR-VIR729,67698.90.0 PRICE129,74920.01.7 BWA B:1 TAR-VIR629,70699.50.0 PRICE129,75020.01.66 2.3.3Exp2:characterizinghepatitisvirusesfromthehumanplasmadata Inthisexperiment,TAR-VIRistestedonarealmetagenomicdataset,whichwassequenced fromtheplasmaof19antiretroviral-treatedHIVpatients(SRR2083204)[80].Thesampleswere byrandomRT-PCR(RA)forbothviralRNAsandDNAsandthen 37 sequencedbyIlluminaMiseq,producingabout23millionreads.Allthesesamplescontainlow levelsofHIVbecauseoftheantiretroviraltreatment.Butitmaycontainotherhumanpathogens. Inourstudy,wefocusedonidentifyinghepatitisviruses.Althoughourpipelineisdesignedto tacklethechallengesofcharacterizingRNAviralquasispecies,wealsoincludeinthereferences DNAhepatitisvirusessuchasHBV. 2.3.3.1Preprocessing Therawdatasetcontainsreadsthatcomefromvariedsources:human,bacteria,phages,etc. Thereadsofthetargetvirusescompriselessthan30%oftheentiredataset.Sincetheprimary focusishumanviruses,removingthosereadsfromthehost(human),bacteria,andphagesisideal beforepathogendetection.Followingcanonicalqualitycontrolandtrimming,weusedbamtagger [124]toremovehumanreads,andBowtie2toremovereadsfrombacteriaandphagebyaligning readsagainsttheirreferencegenomes.Theremainingreadswerecorrectedbyerrorcorrectiontool Karect[2].Afterpreprocessing,8,145,722readswereleft. 2.3.3.2RecruitedreadsbyTAR-VIRcanimprovetheperformanceof denovo assembly Inthestep,weconductedreadmappingtoobtaintheseedreads.BothBWAandBowtie2 couldbeused.However,althoughBWAalignedmorereads,manyreadsyieldedonlyshortlocal alignmentsandareunlikelytobesequencedfromthetargetviruses.Usingthesereadsasseeds tendtocausecontaminationduringthereadrecruitmentstage.Forexample,whenBWA(fi-B8,-A 1")wasusedtogeneratetheseedset,roughly3.5millionreadswererecruited,whileaportionof themcanbealignedtoothergenomes(suchasphages).AlthoughBWA'soutputcanbeprocessed toremovelocalalignments,theseedsetcanbemorereliablyproducedusingBowtie2'soutput. Therefore,Bowtie2waschosenasthealignerforallrealdataexperiments. 38 WedownloadedthereferencegenomesofHBV(NC_003977.2),multiplegenotypesofHCV (NC_009827.1,NC_009823.1,NC_009825.1,NC_030791.1,NC_004102.1,NC_009826.1,NC_009824.1), andhumanpegivirus(HPgV,NC_001710.1)fromtheviralgenomedatabaseofNCBI.Thepre- processedreadswerethenalignedtothereferencesundermismatchratesof5%,10%,15%,and 20%,respectively.Theseinitiallyalignedreadswereusedastheseedreadsets.Althoughthere aremultiplegenotypesforHCV,onlygenotype1hasadecentamountofalignedreads.Other genotypeshavelessthan50readsmapped.Thus,toproduceareliableevaluationoftheassembly results,onlytheresultsofHCVgenotype1wereused.Thenumbersofreadsbeforeandafterread recruitingareshowninTable2.4. Table2.4:Overlapextensionresultsusingdifferentseedset R 0 .`#'represents`number'.The shadedregionsinthistableandTable2.5highlightthecasewherelessrecruitedreadscanproduce betterassemblyresultsthanalignedreadsonly. Align mismatch Seed # Recruited # Align mismatch Seed # Recruited # 5%21,925200,650 10%67,973222,065 15%162,454 263,029 20% 294,448340,705 Asthisisarealmetagenomicdatasetwithoutknowngroundtruthoftheviralhaplotypes,the evaluationmetricsforreadrecruitingaredifferentfromthesimulationdataset.Wecannotevaluate whethereveryrecruitedreadiscorrectbecauseitsoriginatinglocationisunknown.Thus,instead ofevaluatingthedepthandgenomecoverageforeachhaplotype,wefocusonevaluatingwhether usingrecruitedreadscanimprovetheperformanceofgenomeassembly. Therefore,boththealignedreadsandtherecruitedreadsfromTAR-VIRwereassembledby de novo assemblytools,andtheresultswerecomparedinTable2.5.Theassemblyresultsdemonstrate thatthereadsrecruitedbyTAR-VIRusuallyimprovetheassemblyresultsbyproducinglonger contigsandhighergenomecoverageforitsPEHaplo,SGA,andSPAdes.Theimprovementisnot 39 simplyduetotheincreasednumberofreadsaftertherecruitmentstage.Forexample,according toTable2.4,byusing15%mismatchrate,therecruitedreadsarelessthanthealignedreadsunder 20%mismatchrate(263,029vs.294,448).However,theassemblyresultsusingtherecruited readsarebetterthanorcomparabletotheresultsusingthealignedreadsforalltheassemblytools. Amongthefourassemblersused,PEHaploofTAR-VIRandSPAdesproducedgoodresultswith largeN50andhighgenomecoverage.SGAgeneratedlargernumberofcontigswithlowN50 value.WhilewehavetriedthebestparametersforSAVAGEwithourempiricalexperience,its resultsarenotconsistentwiththeotherthreetools.BetterparametersmayexistforSAVAGEto producebetterresults.However,thelong-runningtimeandhighmemoryusageofthistoolmade continuingtotunetheparametersdif Table2.5:AssemblyevaluationresultsonalignedandrecruitedreadsusingthegenomesofHBV, HCV,andHPgVasreferences.`cov.'istheabbreviationfor`coverage'.Thedefaultassembly componentinTAR-VIRisPEHaplo. Align Tool Bowtie2aligned Bowtie2Recruited Contig # N50 Genome cov.(%) Contig # N50 Genome cov.(%) 5% TAR-VIR 1192027.3 973,64382.3 SGA 1464526.8 6367568.4 SPAdes 51,17727.6 153,63679.6 SAVAGE 1369821.6 4980640.4 10% TAR-VIR 6179467.4 312,63584.0 SGA 2666356.4 7270669.5 SPAdes 151,25165.4 193,37379.3 SAVAGE 3063140.6 3291526.5 15% TAR-VIR 9793980.9 14 3,579 83.1 SGA 5661757.7 74 722 70.2 SPAdes 201,68977.6 16 3,986 81.0 SAVAGE 3263929.9 24 999 27.6 20% TAR-VIR 38 1,852 84.5 775,67886.4 SGA 78 661 59.5 37453764.5 SPAdes 23 2,710 83.8 104,83084.6 SAVAGE 19 671 19.1 158235.5 40 2.3.3.3Comparisonwithreference-basedandextension-basedassemblymethods Withthereferencegenomesavailable,reference-basedtoolscanbeappliedforviralmetagenomic dataanalysis.Therefore,wealsobenchmarkedTAR-VIRwithreference-basedhaplotyperecon- structiontoolsincludingHaploclique[147],drVM[86],andViQuas[59].VirusTap[162]canalso conductreadsandthenassembly.WhilewewereplanningtocompareTAR-VIR withVirusTap,alargedatasetcouldnotbeuploadedtothewebsite-basedVirusTap.Inaddition, about3,000jobswerewaitingatthewebsite.Therefore,theresultsfromVirusTapcouldnotbe reported. Thereadsalignedwiththemismatchrateof15%wereusedasinputforHaplocliqueand ViQuas.FordrVM,thereferencegenomeswerebuiltfromhumanviruses,anditranonthe rawfastq(withsimplequalitycontrolandtrimming)dumpedfromSRAwithdefault parameters.TheseedreadsetofTAR-VIRwerealsothereadsmappedwith15%mismatchrate. TheassemblyresultsareshowninTable2.6. Table2.6:Assemblyresultscomparisonwithreference-andextension-basedmethods. Tool Contig # N50 Genome cov.(%) Tool Contig # N50 Genome cov.(%) TAR-VIR143,57983.1 ViQuas3969,646100.0 Haploclique50,41930471.1 drVM41382981.9 TheresultsshowthatTAR-VIRperformsbetterthanHaplocliqueanddrVMbyproducing fewerbutlongercontigswithhighergenomecoverage.Withthecompleteandalsothelikely fitrueflvirusgenomesasthereference,ViQuashasproducednear-completegenomes.However,it producesalmost400contigswithsimilarlengths(fullgenomes),indicatingahighprobabilityof overestimationofthehaplotypes.Sincethegroundtruthoftheactualnumberofhaplotypesinthis datasetisunknown,weintendedtotestthishypothesisusingadatasetwithknownhaplotypes. Therefore,wetestedViQuasontheSARS-CoVsimulateddatasetwith5haplotypes.Itreported 41 113contigs,eachcovering99.98%ofthegenomewithhighmismatchrate( > 9.0%).Thus,the longcontigsproducedbyViQuasarenotlikelythetruehaplotypes. SimilartoSARS-CoVdata,wealsobenchmarkedourresultswiththeextension-basedtool PRICE.TheinitialcontigsofPRICEwerealsothereadsmappedwith15%mismatchrate.PRICE generated164contigs,withaN50valueof791,andgenomecoverageof87.3%.PRICE'sresults haveaslightlylargergenomecoveragebutamuchsmallerN50valuecomparingtoTAR-VIR. 2.3.3.4Assemblingthewholedatasetdirectly AsSGAandSPAdesarehighlyefandhavebeenusedbyvariousvirusanalysispipelines,it maybepossibletodirectlyapplythemtoallthepreprocessedreadsforrecoveringthethreeviruses ((HBV,HCVgenotype1,andHPgV).Thus,weappliedSGAandSPAdestothepreprocessed reads.Theassembledcontigswerecomparedwiththereferencegenomesofthethreeviruses. SGAtookabout1hourtoItgenerated2,659contigs,fromwhich123contigscanbealigned tothethreeviruseswiththesimilaritythresholdof90%.The123contigscancover42.36%of thereferencegenomes.SPAdesfailedtoreporttheresultswithin24hours.Theresultsfrom SGAvthatalthoughthepreprocesseddatasetcontainsallthereadsfromthetargetviruses, thesheerdatasizeandthelowproportionofthethreevirusesmakegenericassemblydif Meanwhile,assemblingalargedatasetconsumescomputingresources. 2.3.4Identifyingvirusescontainingtargetgenes Insomesituations,theresearchersareonlyinterestedintheviralgenomescontainingapartialor completegene.Inthesecases,itisdifforexistingreference-basedvirustoolsto constructthewholeviralgenome.Here,wedemonstratethatwiththeoverlapextensionmethod, themostofagenomecanbebuiltfromapartialgenereference. 42 Inthisexperiment,weshowthatwithanon-completegenesequenceoflength1,073bpfor HPgVasreference,mostofthegenomecanbeassembled.Thereferencesequence(Sequence name:10MYKJ037)wasdownloadedfromVirusPathogenDatabaseandAnalysisResource (ViPR)[117],whichisapartialcodingDNAsequence(CDS)ofHPgVisolatedfromMalaysia in2010.ThetotallengthofHPgVgenomeis9,392bp.Fromtheresultsofourpreviousex- periments,overlapextensionfromalignedreadsundermismatchrateof15%wasabletorecruit adequatereadswhilekeepingawayunreliablereadsasseeds.Therefore,wealignedtherawreads tothisCDSreferencebyallowingmismatchrateof15%,fromwhich19,714readswerealigned. ViQuasanddrVMwereusedtoassemblethealignedreads.However,ViQuascouldonlyproduce contigssimilartotheshortCDSsequence.InadditiontoprovidedCDSreference,drVMalso downloadedreferencesfromInternet.ItcorrectlyrecognizedtheHPgVbutfailedtoproduceany contig.Theresultsthatreference-basedmethodsdonotapplyinthiscase.Withoverlap cutoffof150bp,118,339readswererecruitedfromtheoverlapextensionstep.Theywerethen assembledbyPEHaploofTAR-VIR,SGA,SPAdes,andSAVAGE,asshowninTable2.7. Table2.7:AssemblyresultsonrecruitedreadswithapartialCDSsequenceforHPgVasreference. Tool Contig # N50 HPgV cov.(%) Tool Contig # N50 HPgV cov.(%) TAR-VIR57,95986.0 SPAdes68,95794.0 SGA4159149.0 SAVAGE3558049.5 Whilethelengthofreferencestrainbeingonly11.42%ofthewholeHPgVgenome,thecontigs assembledfromrecruitedreadsbyTAR-VIRandSPAdesareabletocoverthenearlycomplete genome.Theresultsrevealthatevenwithagene/CDSsequenceasreference,sufreadscan stillbecollectedtoconstructthevirusatthewholegenomelevel.Asthereisonlyonetargetvirus, SPAdesproducedthebestresults.ApplyingSPAdestothewholehumanplasmadatasetfailed 43 toontheclusterafter24hours,butbyusingrecruitedreads,SPAdescanproducebetter assemblieswiththeminimumamountofresources. 2.3.5Computationaltimeandmemoryusage WeevaluatedthetimeandmemoryusageofTAR-VIRontherealhumanplasmadata.After preprocessing,8,145,722readswereleft.Thedatasizeis2.9GB,andthetotallengthofthe sequencesare2,447,741,491bp.Toreducememoryusage,therawdatawassplitinto5parts,with 5BWTsbeingbuiltforthewholedata.Thesplittingprocessisembeddedinourprogram,andthe numberofsegmentscanbesetbyusers.Foreachpartition,thesizesfortheBWT,Occarray, andthereadIDarrayare490M,200M,and13M,respectively.Thetotalsizeofbuiltindexesis3.5 GB.ThedetailedtimeandmemoryusagefortheoverlapextensionisshowninTable2.8below. Ausercanloadeachpartitionseparatelytoreducethememoryusage.Inthatcase,thememory usageisaboutthesizeofeachpartition.Inaddition,wemayfurtherreducethememoryusageof therecruitingprocessbyapplyingmorecompactimplementationoftheBWT[18]. Table2.8:Timeandmemoryusageforoverlapextensionandassemblyonviralmetagenomicdata fromhumanplasam.The denovo assemblytimeandmemoryusagewereevaluatedonrecruited readsbasedonmismatchratefrom5%to20%. TimeMemory(GB) Overlap extension Buildingindex 127m2.4 Recruitment < 20m ˘ 3.5 Denovo Assembly TAR-VIR 5m-20m2-4 SGA < 10m < 1 SPAdes < 10m < 1 SAVAGE 2h-72h64 PRICE ˘ 2h ˘ 5.2 Reference-based assembly Haploclique 33h5 ViQuas > 72h < 4 drVM < 30m < 1 AlltheexperimentsweretestedonanMSUHPCCCentOS6.8nodewithTwo2.4Ghz14-core 44 IntelXeonE5-2680v4CPUsand128GBmemory.Weused4threadsfortheassemblycompo- nentofTAR-VIR,8threadsforSGA,16threadsforSAVAGE,8threadsforPRICE,1threadfor HaplocliqueandViQuas,and2threadsfordrVM. 2.4Conclusions InthisChapter,wepresentedanovelpipelineforviralreadscationandstrain-levelassembly fromviralmetagenomicdatanamedTAR-VIR.Whenavirusinametagenomicdatasetisonly remotelyrelatedtoacharacterizedvirusinpublicdatabases,ourpipelinecanbeappliedto classifythereadsbelongingtothesevirusesandthenconductstrain-levelassembly.Orifauseris interestedindetectingavirusthatcontainsagivengene,ourmethodcanbeemployedtorecover thewholegenomeofthegene-containingvirus. Wealsomadecontributionsbyconductingcarefulanalysisofthecommonregionsizesbe- tweenandwithinviralquasispecies.Theseanalyseslaidthefoundationforusingoverlapdetection toclassifyreadsofthesamequasispecieswithoutintroducingcontamination.Ouruniqueimple- mentationoftheindexingstructurealsomakeourmethodeconomicalinbothmemoryandCPU usages. Wedemonstratedthetool'sutilitiesonasimulatedviralmetagenomicdatacontainingSARS- Covandarealviralmetagenomicdatasetsequencedfromhumanplasma.Thesimulateddata enablesustoevaluatetheperformanceofreadclassitotheresolutionofeachsingleread. ItshowsthatTAR-VIRcansuccessfullyclassifyenoughreadstocoverthewholegenome.In addition,itproducedcontigscoveringvedifferenthaplotypes. Onthehumanplasmadata,wewereabletoenrichenoughreadsfromthetargetvirusesfor downstreamassemblyevenwithasmallseedreadset.WithapartialCDSsequenceforHPgVas 45 reference,TAR-VIRwasabletoproducenearcompletegenomeassemblies.Theresultsclearly showedtheeffectivenessofTAR-VIR.Insummary,TAR-VIRprovidescomplementaryfunctions toexistingvirusdetectiontoolswhenthequalityorcompletereferencesarenotavailable. 46 Chapter3 Denovohaplotypereconstructioninvirus quasispeciesusingpaired-endreads 3.1Introduction Virusquasispeciesdescribeapopulationofdifferentbutcloselyrelatedviruswithdynamicdistri- bution.Eachstraininquasispeciesisbyitshaplotypesequence.Natureselection,point mutation,andrecombinationscanallchangethehaplotypesandtheirabundanceinsideaqua- sispecies.RNAvirusandsomeDNAvirusevolvefollowingthequasispeciesmodel.Commonly knownexamplesincludehumanyvirus(HIV-1),thehepatitisCvirus(HCV),the foot-and-mouthvirus(FMV),etc.Figure3.1showsalocalmultiplealignmentofvehaplotypes inHIVquasispecies. Astheselectionworksonasetofsequencesratherthanone,quasispecieshaveabilitiesto escapehostimmuneresponsesordevelopdrugresistance.Reconstructionoftheviralhaplotypes isafundamentalsteptocharacterizethequasispecies,predictviralphenotypes,andprovide importantinformationforclinicaltreatmentandprevention. Developmentofnext-generationsequencingtechnologiesshedslightoncharacterizingthehap- lotypesandtheirabundanceinquasispecies.Ifthereferencesequencesareavailable,readmapping canbeconductedtoinferhaplotypes.However,duetothehighmutationrate,usuallyqualityref- erencegenomesofquasispeciesarenotavailable.Thus,thereisaneedfor denovo haplotypere- 47 constructionmethod.Arecentreviewofchosenhaplotypeassemblytoolshasshownthat denovo haplotyperecoveryisacomputationallychallengingproblem.Itsharessomecommonchallenges withmetagenomicassembly,whichaimstoassembleshortreadsintofullgenomesofmember species.Theessentialideaistobuildagraph(e.g.DeBruijnoroverlapgraph)usingreadsinthe givensampleandthenrecoverthegenomebywalkingthroughapath.Intheidealcase,where therearenosequencingerrors,thereissufcoverage,andthegenomesdiffersubstantially,it istrivialtoidentifyreadsbelongingtothesamehaplotypeandstitchthemintoalongsequence. However,inreality,haplotypeconstructionmusthandleshorterror-pronereads,highsimilarity betweendifferentstrains,raremutations,andlowabundanceofsomestrains.Wediscuss howtheseissuescomplicatetheassemblyproblem.Asweconstructoverlapgraphforhaplotypes recoveryinthiswork,wefocusonhowtheseissuesmake denovo assemblyinoverlapgraphs dif First,ifthecommonregionbetweenanytwostrainsisshorterthanthereadlength,thecon- nectionofallreadswillleadtodifferentpaths,eachofwhichcorrespondstothefullsequenceof ahaplotype.However,thecommonregionsbetweenstrainscaneasilyexceedthesizeofaread andthusintroducemanysharednodesandbifurcationsintheoverlapgraph.Second,sequencing errorscanintroducewrongconnections,tips,bubblesandaddtothecomplexityof thegraph.Third,alignment-basederrorcorrectionhavedifindistinguishingraremutations fromsequencingerrors.Finally,readsoriginatingfromhaplotypesoflowabundancetendtohave smalloverlapsandthusonlyfragmentsofhaplotypesbereconstructed.Indeed,arecentbench- marking[136]demonstratedthattheperformanceofallthetestedprogramsispoorwhensequence divergenceislow.Inaddition,theseprogramsfailedtorecoverrarehaplotypes.Thus,thereisa needfornewmethodsandtoolsformoreaccuratehaplotypeassembly. Thereexistanumberofgenericassemblytools[106,150,73,116,90,133,159],whichcanbe 48 Figure3.1:MultiplesequencealignmentofveHIV-1haplotypes. appliedforsequenceassemblyinviralquasispeciesdata.However,thesegenericassemblytools arenotdesignedforrecoveringviralhaplotypesinquasispecies.Highsimilaritybetweendifferent strains,plussequencingerrors,canleadtohighlytangledgraphs.Applyingexistingtoolseither leadtoveryshortcontigsorchimericcontigs. Thereareagroupoftoolsdesignedforreconstructionofhaplotypes[166,92,118, 148,119,4,146,54].Sometoolsfocusedonestimatinglocalhaplotypes.Ourgoalofthiswork isglobalreconstruction,whichaimsatreconstructingthefull-lengthviralhaplotypesinasample. Comparedtolocalreconstruction,itprovidesamorecompleteandaccuratecharacterizationof thequansispecies.SometoolssuchasShoRAHrelyonreferencesequencesforconductingalign- ments.Formoregeneralapplicationofhaplytypereconstruction,weneeda denovo assemblytool thatdoesnotrelyonreferencesequences. Oftheavailablehaplotypereconstructionprograms,therearethreehighlyrelatedonestoours. TheyareHaploClique[147],SAVAGE[5],andMLEHaplo[93].LikePEHaplo,allofthemare designedforpaired-endreads.HaploCliqueusesenumerationofmaximalcliquesasageneral approachtoclusteringNGSpaired-endreads.AlthoughPEHaploisalsodesignedforpaired- endreads,thetwoarehighlydifferentinusingpaired-endreadsinformation.First,HaploClique needsareferencesequenceforgeneratingalignmentgraphswhileoursdoesnotrelyonreference sequences.Second,HaploCliqueusestheinsertsizesfordetectionoflargeindels.However,insert sizeusuallyhasadistributionandcanvarysubstantiallybetweendifferentfragments.Thusthis 49 informationisusedmoreconservativelyinPEHaplo.MLEHaploalsoexplicitlyemployspaired- endreadsfortop-scorepaths.Itsusageofpaired-endinformationinaDeBruijngraphis highlydifferentfromourmethodbasedonanoverlapgraph.HaploCliqueprovidesasourceof inspirationforSAVAGE[5],whichisthetoolforrecoveringviralhaplotypesusingoverlap graphs.SAVAGEalsotookadvantageofpaired-endreadsandmergeshortreadsusingcliques. TheauthorsbenchmarkedSAVAGEwithothervirusassemblytoolsandshowedthatSAVAGE outperformedothertoolsinacomprehensivesetofassemblymetrics.Aswefocuson denovo assemblytools,wewillbenchmarkPEHaploagainstSAVAGEandMLEHaplo. Inthiswork,wedesignedandimplementedPEHaplo,whichassemblesvirushaplotypesfrom virusmetagenomicsequencingdata.Sequenceassemblyhasbeenanintensiveresearchareaand newmethodsorimplementationsareemergingquickly.PEHaploadoptedrelevanttechniquessuch aserrorcorrectionandefoverlapconstruction.Mostimportantly,wemadecontributionsto distinguishhighlysimilarhaplotypesusingpaired-endreads.Therationalewillbedetailedin theMethodsSection.WeappliedPEHaplotobothsimulatedandrealviralquasispeciesdataand comparedtheassemblyperformancewiththerecentlypublishedtools.Theexperimentalresults showthatPEHaplocanrecoverviralhaplotypeswithlongercontigsandhigheraccuracy. 3.2Methods Inthissection,wewilldescribethegraphmodelsweuseforhaplotypeconstruction.Then wewillpresentthekeyideaofapplyingpaired-endreadstodistinguishdifferentviralhaplotypes. Finally,weshowthewholepipelineanddetaileachmaincomponent. 50 3.2.1Overlaphgraphandpaired-endgraph Anoverlapgraph G ( V ; E ) isaweighteddirectedgraphthatoverlapsbetweenreads.Each node v 2 V representsaread.Anoverlapbetweentworeadsisformedifthesufofareadmatches theofanotherread.Givenanytworeads r 1 , r 2 ,andanoverlapthreshold l ,iftheoverlap sizebetween r 1 and r 2 isgreaterthan l ,adirectededgeisaddedfromthenodesrepresenting r 1 and r 2 in G .Theedgeweightistheoverlapsize. Whileanoverlapgraphrecordstheconnectivitybetweenreads,wealsorecordthenumberof paired-endreadsbetweennodesusingapaired-endgraph PE _ G .Apaired-endgraphisaweighted undirectedgraph,whichhasthesamenodesetastheoverlapgraph.Twonodesareconnectedby anedgeifthecorrespondingreadsformareadpair.Inpractice,anodeintheoverlapgraphcan containmultiplereadsafterwecombinenodeswithoutbifurcations.Thus,multiplereadpairscan existbetweennodes.Thetotalnumberofreadpairsbetweentwonodesislabeledastheedge weightin PE _ G . InFigure3.2(B),theedgesinoverlapgraphareshownusingsolidlineswhiletheedgesinthe paired-endgraphareshownusingdashedlines.Nodesa.1anda.2formareadpairandthushave anedgeofweight1in PE _ G .Similarly,nodesd.1andd.2haveanedgeofweight1becaused.1 andd.2areareadpair. 3.2.2MutationRateforSequenceReplicationandProbabilityofLongest CommonSubstring(LCS) Differentstrainsofavirusquasispeciesusuallysharehighsequencesimilarity.Butthemuta- tions,insertionsordeletionscanhappenrandomly.Insteadofsequencesimilarity,weuselongest commonsubstring(LCS)tocharacterizethesimilaritybetweenstrains,fromwhichstrain-levelas- 51 semblyispossibleiftheLCSisshorterthanthepaired-endinsertsize.Herewewillinferthe mutationrateaccumulationbetweenaninitialstrainandits n thoffspring,thenprovideadynamic programmingalgorithmforcalculatingtheprobabilityofLCSwithlength m betweentheinitial strainanditsoffspring. Mutationrateaccumulationforsequencereplication Imaginetheenvironmentinitiallywith onlyonevirusstrain x 0 oflength L ,mistakescanhappeneachtimewhenthegenomereplicates andwillresultinanequilibriumwithmultiplestrainsofconstantabundances,whichis describedbythequasispeciestheory. Reproducingmistakescanbeinsertions,deletionsorsubstitutions.Tosimplifytheproblem, weassumeonlysubstituionscanhappenduringreplicationandthevirusgenomelengthremains unchanged.Assumetheprobabilityofmutationateachpositionwhenreplicatingisconstantand independanttoeachother,denotedas m ,weaskwhatismutationprobabilityateachlocation between x 0 andits n thoffspring x n ? Thisproblemcanbesolvedbydynamicprogramming.Let f ( i ) betheprobabilitythat x n [ i ] is differentfrom x 0 [ i ] ,and g ( i ) betheprobabilitythat x n [ i ] issameto x 0 [ i ] .Weget 8 > < > : f ( i + 1 )=( 1 f ( i )) m + f ( i )( 1 m = 3 ) ; i = 1 ; 2 ; 3 ;:::; L g ( i + 1 )= g ( i )( 1 m )+ f ( i ) m = 3 ; i = 1 ; 2 ; 3 ;:::; L (3.1) Notethat f ( 1 )= m and g ( 1 )= 1 m ,and f ( i )+ g ( i )= 1,equations3.1canbesolved 8 > < > : f ( i )=( 1 4 m 3 ) i 1 ( m 3 4 )+ 3 4 ; i = 1 ; 2 ; 3 ;:::; L g ( i )=( 1 4 m 3 ) i 1 ( 3 4 m )+ 1 4 ; i = 1 ; 2 ; 3 ;:::; L (3.2) ProbabilityofLCS Fortheinitialvirusstrain x 0 oflength L ,weaskwhatistheprobabilityof LCSwithlength m between x 0 andits n thoffspring x n ? 52 Againthisproblemcanbesolvedbydynamicprogramming. f ( i ) astheprobability that x n [ 1 toi ] has LCS m with x 0 and x n [ i ] mutated, g ( i ) astheprobabilitythat x n [ 1 toi ] has LCS m with x 0 and x n [ i ] notmutated. t astheprobabilitythat x n [ i ] isdifferent to x 0 [ i ] ,whichcanbecalculatedusingequation3.1.Nowwelookatthecasewhen m = 2 f ( i + 1 )= t f ( i )+ t g ( i ) g ( i + 1 )=( 1 t ) f ( i )+( 1 t ) 2 f ( i 1 ) m = 3 f ( i + 1 )= t f ( i )+ t g ( i ) g ( i + 1 )=( 1 t ) f ( i )+( 1 t ) 2 f ( i 1 )+( 1 t ) 3 f ( i 2 ) . . . m f ( i + 1 )= t f ( i )+ t g ( i ) g ( i + 1 )= å m 1 j = 0 f ( i j )( 1 t ) j + 1 If i m ,theLCSwillalwaysbelessorequalto m .Thus, f ( i )= t ; g ( i )= 1 t .Therefore,the recursiveequationsforcalculating f ( i ) and g ( i ) are When1 i m 8 > < > : f ( i )= t g ( i )= 1 t (3.3) When L i m + 1 53 8 > < > : f ( i )= t ( f ( i 1 )+ g ( i 1 )) g ( i )= å m 1 j = 0 f ( i j 1 )( 1 t ) j + 1 (3.4) Thetimecomplexitytocalculatetheprobabilityfor LCS = m is O ( 2 L + mL ) .Sincethesequence lengthis L , m canbeanyintegersbetween0and L .Tocalculatetheprobabilityforeach m 2 [ 0 ; L ] , thetimecomplexityis O ( 2 L 2 +( L + 1 ) L 2 2 )= O ( L 3 ) .ForHIV,itsgenomelengthisabout10 4 . Calculatingdirectlywithequations3.4forallthepossibleLCSistimeconsuming.Itcostshours tocalculatetheprobabilitiesforalltheavailableLCSwithatypicalsingle-coreCPU. Infact,theresultof g ( i ) canbeusedtocalculatefor g ( i + 1 ) ,whichwillreducethetotaltime complexityto O ( L 2 ) .Let S ( i )= f ( i )+ g ( i ) ,wewillhavetherecursiverelationship(Proofomitted) S ( i + 1 )= S ( i ) t ( 1 t ) m + 1 S ( i m 1 ) ; i m + 2(3.5) Therefore,weget S ( i )= 8 > > > > > > > > > < > > > > > > > > > : 1 ; if1 i m 1 ( 1 t ) m + 1 ; if i = m + 1 1 ( 1 + t )( 1 t ) m + 1 ; if i = m + 2 S ( i 1 ) t ( 1 t ) m + 1 S ( i m 2 ) ; if i m + 3 (3.6) ItislineartocalculatetheprobabilityforoneLCS.Thetimecomplextiyforcalculating LCS = m is O ( L ) .Tocalculateeach m 2 [ 0 ; L ] ,thetimecomplexityis O ( L 2 ) . 54 3.2.3Usepaired-endreadstodistinguishdifferenthaplotypes Thescattereddistributionofthemutationsorinsertions/deletionsbetweendifferentstrainsmake strain-levelassemblypossiblebyemployingpaired-endreadsinformation.Wequantifythe strainsimilaritywithsizesofthelongestcommonsubstring(LCS).IftheLCSsizeissmallerthan readsize,pathbecomestrivialbecausedifferentstrainscorrespondtodifferentpathsinthe overlapgraph G .Theanalysisofavailabledatashowsthatthesizeofthelongestcommonregions arelargerthantypicalreadsizebutsmallerthanfragmentsize,whichistheend-to-endsizefor areadpair.Thus,itishighlylikelythatthetwoendsofareadpaircanencompassthecommon regionsbetweendifferentstrains.Aspaired-endreadsaresequencedfromthesamefragment,they shouldbeassembledintothesamecontig.Thus,byusingthepaired-endinformation,twodifferent strainscanbedistinguishedfromeachother. Figure3.2sketchesthebasicideabehindstrain-levelassemblyusingpaired-endreads.There areonlytwomutationsbetweenthetwostrainsinthisexample.TheLCSisthecommonregion betweenthetwomutationloci.Theoverlapthresholdissetashalfofthereadsize.Theoverlap graph G andthepaired-endgraph PE _ G areconstructedaccordingly.In G ,therearefourlong paths: a : 1 ! b ! c ! e ! f ! a : 2, a : 1 ! b ! c ! e ! f ! d : 2, d : 1 ! b ! c ! e ! f ! a : 2, and d : 1 ! b ! c ! e ! f ! a : 2.Thegoalofassemblyistooutputthetwocorrectpaths(i.e. a : 1 to a : 2and d : 1to d : 2).Threetypesofinformationcanbeused.1):Coverage:ifthetwostrains havehighlydifferentcoverage,wemaydistinguishthetwostrainscorrectly.2):Enumerationof cliques.Thismethodwasrecentlyadoptedbyseveraltools[147,93,5].Forreadswithsuf coverage,readsformingcliquestendtocomefromthesamehaplotypeandthuscanbemergedas asuper-read.Thisprocesscanbeiterativelyappliedtoextendlocalhaplotypetoglobalone.3): paired-endinformation. 55 Byusingthepaired-endinformation,apathstartingwitha.1inFigure3.2willendwitha.2 becausea.1anda.2shouldbeassembledintothesamecontig.Forthesamereason,apathstarting withd.1willonlyextendtod.2.Thus,thepathwilloutputtwopaths,correctlyrepresenting thetwohaplotypes.Ofthethreetypesofinformation,paired-endinformationisthemostaccurate oneandcantolerateLCSuptothefragmentsize.Coveragedifferencehaslimitedapplications becauseitrequiresrelativelyuniformcoverageandcanonlydistinguishstrainsofhighlydifferent abundance.Cliqueenumerationcanleadto"chimeric"superreadevenwhentheLCSisonly longerthanreadsize.Inthisexample,thecliqueenumerationwillnotbeabletodistinguishthe twostrains.AsshowninFigure3.2.(B),therearevecliquesofsize4.Bymergingthereadsinto super-readsinsideeachclique,weobtainedvesuper-readsinFigure3.2.(C).Theiroverlapgraph isshowninFigure3.2.(D).Nomatterwhethercliquesofsize3areusedforiterativemerge,using cliqueswon'tbeabletodistinguishthetwostrainswithoutusingpaired-endinformation. 3.2.4ThewholepipelineofPEHaplo TherearevemajorcomponentsinthepipelineofPEHaplo:errorcorrection,strandcorrection, overlapgraphconstruction,pathandcontigcorrection.Inthepre-processstage,reads withmultiplelow-qualityorambiguousbasecallsareortrimmed.Base-callingerrorsor indelsarecorrectedfromthesetofreadsusingalignment-basederrorcorrection.Dupli- catedreadsandsubstringreadsarethenremovedfromthecorrectedreads.Second,anoverlap graphisbuiltfromthepre-processedreadsandthestrandofreadsareadjustedbytraversingthe graph.Theoutputreadswillhavethesameorientation.Thethirdstagewillbuildtheoverlapgraph againfromthestrand-adjustedreadsandutilizevariousgraphpruningmethodstoremovepossi- blerandomoverlapsandsimplifythegraphforefassembly.Inthefourthstage,paired-end guidedpathalgorithmsareappliedtoproducecontigsfromtheoverlapgraph.Finally,we 56 Figure3.2:(A)Thebottomtwolonglines(redandblack)representtwohaplotypes,whichonly differbytwomutationsattwoloci(G-CandA-G).Shortlinesrepresentreadssequencedfrom thetwostrains.Redreadsaresequencedfromredstrainwhileblackreadsaresequencedfrom blackstrain.Thereadsaresortedbytheirreadmappingpositionsagainsttheirnativestrain.a.1 anda.2areareadpairfromtheblackstrain.d.1andd.2arethereadpairfromtheredstrain. (B)Theoverlapgraphandthepaired-endgrapharecombinedinoneNodesb,c,e,andf originatefromthecommonregionofthetwostrains.Thedashedlinesrepresentpaired-endread connection.(C).Thesuper-readsgeneratedbymergingreadsinvecliquesofsize4.Eachsuper- readisnamedusingthestartingnodeandendingnode.(D).Thenewoverlapgraphgenerated usingthesuper-reads. 57 alignpaired-endreadsagainstproducedcontigstoidentifyandcorrectpotentialmis-joinerrors. 3.2.4.1Datapre-processing Errorcorrectionbasedoncoverageinformation,usuallyimproves denovo assembly.Analignment- basederrorcorrectiontoolKarect[2]isusedtocorrectsubstitution,insertion,anddeletionerrors. Besidesadoptingexistingerrorcorrectiontools,wealsoremovedreadswithverylowabundance. Onlyreadsthatareduplicatedatleast n timesintheoriginaldatasetwillbeusedfordownstream analysis.Usingaprobabilitymodel,weshowthatthismethodcanfurthererouterrorcontaining readswhilemaintainingsufreadsfor denovo assembly. Let N bethetotalreadsnumber, r bethereadlengthand L bethegenomesize,theprobability that k readsstartfromthesamelocationonthegenomecanbeestimatedbyaPoissondistribution: Pr [ k ]= l k e l k ! ; l = N = L (3.7) Pr [ k > = n ]= 1 n 1 å i = 0 l i e l i ! ; n 1(3.8) Wedenote e base betheprobabilityofsequencingerrorateachbase, N e bethenumberoferror baseforoneread,theprobabilitythatareadcontainsatleastonesequencingerroris Pr [ N e 1 ]= 1 ( 1 e base ) r .For n readsoriginatingfromthesamelocationongenome,theprobabilitythatthey haveatleastonesequencingerroronthesameplacecanbecalculatedas Pr = Pr [ N e 1 ] n = r n 1 . IntheHIVMiSeqdatasetweused(detaileddescriptioninResults), ˘ 700kerrorcorrectedreads areleftfor5HIVstrains.Thegenomelengthis ˘ 10kbp. l = 7 10 5 = ( 5 10 4 )= 14.For eachbaseonthegenome,theprobabilitythatatleast3readsstartfromitisover99.9%.Ifwe assume e base = 0 : 01 ; r = 200,theprobabilitythatareadsequenceduplicatedatleast3timesbut containsatleastonesequencingerroris Pr ( 1 0 : 99 200 ) 3 = 200 2 ˇ 1 : 62 e 5.Theresultsreveal 58 thatwhensequencingdepthisdeepenough,keepingreadswithduplicationsisabletoout errorcontainingreadsandwillnotreduceconnectivity. 3.2.4.2Overlapgraphconstruction Allreadsremainedafterpre-processingareusedtoconstructtheoverlapgraph.Astraightforward overlapdetectionmethodrequires O ( n 2 ) comparisons,whichiscomputationallyexpensivefor largesequencingdatasets.Thereareefimplementationofall-pairssufcomparison algorithmsbasedondatastructuressuchashashingtableorcompacttree[47,51].In PEHaplopipeline,weuseReadjoiner[47]tocomputeallsufmatchesamongreads forreadstrandadjustment,andthenuseApsp[51]fortheoverlapgraphconstruction. Strandcorrectionwithoverlapgraph Readsintherawdatasetcancomefromdifferent strands,andthetworeadsinareadpaircomefromtwodifferentstrands.WeuseReadjoinerto constructtheoverlapgraphandtraverseeachnodeinthegraphforstrandadjustment.Readjoiner providesasetoffastandspaceefalgorithmstocomputesufmatchesamongall pairsofreadsusingsufsortingandscanningmethods.Fortworeads r i and r j ,wedenote r 0 i and r 0 j astheirreversecomplements.Readjoinercomputesallpossibleoverlapsbetween ( r i ; r j ) , ( r 0 i ; r j ) and ( r i ; r 0 j ) ,andlabeltheoverlapswith`++',`-+'and`+-'respectivelyiftheirsuf matchesarelongerorequaltotheoverlapthreshold.Notethattransitiveedgesareremovedinthe outputbyReadjoiner. Weuseasearch(BFS)traversalmethodtoadjustthestrandofeachread.The traversalstartsatastartnode(within-degreeof0)andlabelitas`+',thenrecursivelylabels allitssuccessorsandpredecessorsbasedontheedgetype.The`++'typemeanscurrentnode anditsneighborcomefromthesamestrand,while`-+'or`+-'typesmeanadifferentstrand. Aftertraversingthewholegraph,allthereadslabeledwith`-'willbereplacedwiththeirreverse 59 complements. Withstrand-adjustedreads,weuseanothertoolApsp[51]toconstructanewoverlapgraph sincewewillneedalloverlapedgesthistime. 3.2.4.3Graphpruning TheoriginaloverlapgraphgeneratedfromtheoutputofApspisusuallyverycomplexbecause ofthelargedatasize,transitiveedges,sequencingerrors,andhighlysimilarregionssharedby haplotypes.Weapplyaniterativegraphpruningproceduretorepeatedlysimplifythegraphat eachiteration. Mergereadsincliques Weareinterestedincliquesintheoverlapgraphbecausereadswithin acliquecansharetruemutationswhilesequencingerrorsareusuallymorerandomandarenot sharedbythemajorityofreads.Therefore,cliquescanbeusedtodistinguishtruemutations fromsequencingerrors.Severalrecenthaplotypereconstructiontools[5,147]mergereadsinside cliquesassuper-readsandconductiterativelyhaplotypeextension. Wehandlecliquesdifferentlytotheseexistingtools.Insteadofdirectlymeringeachcliqueto asuper-read,wemergeagroupoflinkedcliquessimultaneously.Thelinkedcliquesisas aclusterofcliqueswhichsharecommonnodeswithatleastoneothercliqueinthecluster.Each cliquecluster,denotedas G c ,isaconnectedsubgraph.Wesimplifythissubgraphbyremoving transitiveedgesandperformingcollapseoperationsasdescribedbelowandthenconnect G c tothe originaloverlapgraphbyreconstructingconnectionstoitsstartingandendingnodes.Theedges connectedtoothercliquenodesintheoverlapgraphareremoved.Thosecommonnodessharedby cliqueswillbekeptandwefurtherapplypaired-endinformationtoidentifystrainsinpath step.Bymerginglinkedcliques,weareabletosimplifytheoverlapgraphwhileavoidresultingin "chimericsuper-read"(Figure3.2(D): ( a : 1 e ) ! ( b f ) ! ( c d : 2 ) ). 60 Removetransitiveedgesandnodecollapsing Anedge u ! v iscalledtransitiveifthereexist otherpathsfrom u to v inthegraph.Transitiveedgesareusuallyremovedingraph-basedassembly algorithmstosimplifythegraphwhilestillkeepingtheconnectivity.Weuseasearch (DFS)basedalgorithmtoremovetransitiveedgesfromtheoverlapgraph G =( V ; E ) : 1.Foreachvertex u 2 G ,startDFSfromeachofitssuccessor v . 2.Foreachvertex w thatcouldbereachedbyDFSfrom v ,removeedge u ! w iftheedgeexists. Theoverallcomplexityofthealgorithmaboveis O ( j V j ( j V j + j E j )) ,whichruns j V j DFSstoremove alltransitiveedges. Linearlyconnectednodescanbemergedwithoutlossofreachability.Aftertransitiveedge removal,theoverlapgraphtendstohavechainsoflinearlyconnectedvertices,whicharecollapsed tofurthersimplifythegraph. Removefalseedgesusingreadpairs Duetothenatureofvirusquasispecies,differenthap- lotypesofonespeciesusuallyhaveveryhighsequencesimilarity(possiblyover90%),whichcan easilycauseoverlapsbetweenreadsoriginatingfromdifferentstrains.Therefore,havingasuf matchdoesnotguaranteethatthetworeadsoriginatefromthesamevirusstrain.Wrong edgesincreasethecomplexityofgraphandmayalsoproducechimericcontigs.Weemploypaired- endinformationtoremovepotentiallywrongedges. Theoverlapcutoff l isanimportantparameter.Asmall l tendstokeepmosttrueoverlapsbut alsointroducesmorefalseconnectededges,whilelarge l islikelytoeliminatemostfalseoverlaps butcanpossiblymisstrueconnectionsforreadsfromlowlysequencedregions.InPEHaplo,we initiallychoosearelativelysmalloverlapcutoff,thenapplyasetofpaired-endbasedheuristic methodstoremovepotentialfalseedges.Thekeyideaisthatforanedgeformedbetweenreads fromdifferenthaplotypes,thesetwonodesortheirpredecessorsandsuccessorsarenotusually supportedbypairedendconnections.Thus,wewillusepaired-endconnectionasevidencetore- 61 Figure3.3:Removingfalseedgesusingpaired-endinformation.Solidlinesrepresentoverlaps betweennodesanddashedlinesrepresentthepaired-endconnectionsbetweennodes.Theoverlap edgeswithredcrosswillberemovedifinsufpaired-endinformationexistbetweentheir ends. movefalseedges.,wefocusonthreecasesasshowninFigure3.3,andremoveedges withoutpaired-endinformationsupport.Toaidtheexplanation,weintroducethefollowingterms fortheoverlapgraphs.Let u ! v beanedgebetweennodes u and v .Thus, u isthepredecessorof v and v isthesuccessorof u .Anodecanhavemultiplepredecessorsorsuccessors.Letfunction succ ( u ) representthesetofallsuccessorsof u andlet pred ( u ) bethesetofallpredecessorsof u . Weexamineedgesinthefollowingcases. 1.Foranedge u ! v ,ifout-degreeof u islargerthan1andin-degreeof v islargerthan1,check paired-endconnectionsbetween u and v , pred ( u ) and v , u and succ ( v ) .Ifnopaired-endconnec- tionsbetweenthesenodesandtheoverlapbetween u ; v islessthanauseredgeremoval cutoff l 1 ( l 1 l ),removeedge u ! v (Figure3.3(A)). 2.Foranode u within-degreebeing1andout-degreelargerthan1,wecheckthepaired-end connectionsbetween u andallnodesin succ ( u ) .Foranedge u ! v ,weremovethisedgeifthe followingthreeconditionsaremet.1)Thereisnopaired-endconnectionbetween u and v .2) Thereexistsanode v 0 2 succ ( u ) ,( v 0 6 = v ),andthereispaired-endconnectionbetween u and v 0 .3) 62 Thein-degreeof v islargerthan1.Intuitively,ifanode u hasalargenumberofout-goingedges, itishighlypossiblethatsomeofthemareonlyrandomconnections.Thus,weonlykeeptheones withpaired-endsupports(Figure3.3(B)). 3.Similartocase2,ifanodehasalargenumberofincomingedgesbutonlyhasoneout-going edge,weremovetheoneswithoutpaired-endsupport.Foranode u without-degreebeing1and in-degreelargerthan1,weexaminethepaired-endconnectionsbetweenallnodesin pred ( u ) and u .Foranedge w ! u ,weremovethisedgeifthefollowingconditionsaremet.1)Thereisno paired-endconnectionbetween w and u .2)Therearepaired-endconnectionsbetweenotherpre- decessornodesand u .3)Theout-degreeof w islargerthan1(Figure3.3(C)). 3.2.4.4Paired-endguidedpath Withsufcoverage,virushaplotypescanberecoveredbypathsfromthegraph.Al- thoughthegraphhasbeengreatlywithprevioussteps,thehighsequencesimilarity betweenvirushaplotypescanstillresultincomplexoverlapgraphswithalotofbifurcations.To recovermoreaccuratecontigs,paired-endinformationofreadsarewidelyusedinmanyassembly toolsforguidingthecreationofcontigs[169,163,93]orscaffolds.Therationaleisthattwoends ofareadpairaresequencedfromtwoendsofafragment,thusshouldbeassembledinthesame contigorscaffold.Paired-endinformationcanguidethepathextensiontogoovercommonre- gionssharedbytwogenomessincethelengthoffragmentisusuallymuchlongerthanthelength ofreads(Figure3.4). Inourmethod,weuseaDFS-basedpathextensionalgorithmandselecttherightnodeto appendtothepathwithadecisiontreeateachbifurcation.Pathstartsatastartnode(in- degreeof0)andterminatesatanendnode(out-degreeof0)orwhenmeetingavisitednode.At 63 Figure3.4:Paired-endinformationguidethepathextensiontogooverbifurcationnodes.With nodescomingfromtwostrains a and b inthethosenodesbelongingtothesamestrain canbecorrectlycombinedtoonepathbasedonthepaired-endconnections.Solidlinesrepresent overlapsbetweennodesanddashedlinesrepresentthepaired-endconnections. 64 Figure3.5:Pathprotocol.Pathsareoutputtedwhenmeetinganendnodeoravisitednode. eachvertexwithmultiplesuccessors,wetransformtheavailablepaired-endandoverlapinforma- tionasfeaturesanddecidewhichsuccessortoappendtothecurrentpathwithapre-traineddecision tree.ThewchartofpathalgorithmisshowninFigure3.5.Thepaired-endconnections betweennodescanbeefaccessedfromtheconstructedpaired-endgraph PE _ G . Featuredesignanddecisiontree Tothecorrectpathfromtheoverlap graph,weneedtoselecttherightnodeeachtimeweextendthepath.Inparticular,whenanode hasmultiplesuccessors,achoiceneedstobemadeforpathextension.Ingeneral,wechoose asuccessorwiththemostnumberofpaired-endconnectionswiththenodesinthecurrentpath. However,differenttypesofpaired-endconnectionsshouldbetreateddifferentlyindistinguishing haplotypes.Inparticular,weneedtodifferentiatesingle-insingle-outnodes(SISO)fromother nodes.SISOnodeshavein-degreeof1andout-degreeof1.Wehavehighthatthese 65 Figure3.6:Inthisexampleasshowninthetheendingnode p n ofcurrentpathhastwo successors v 1 and v 2 .Thesolidlinesareoverlapsanddashedlinesarepaired-endconnections betweennodes.Fivefeaturesarecomputedtochoosetherightsuccessortoextendthepath.These featuresarecomputedaspaired-endgraphedgeweightsbetweennodesincurrentpathandnodes associatedwiththesuccessor.Asfor v 1 ,theSISOscoreiscalculatedbetweenSISOnodesand v 1 ; planscorebetweenSISOnodesandplannodes;PE_planscorebetweenSISOnodesandPE_plan nodes;pathscorebetweenpathnodesand v 1 ;path_planscorebetweenpathnodesandplannodes. 66 nodesbelongtoonehaplotypeandthusanypaired-endconnectionincidenttoSISOnodescan recruitnodesbelongingtothesamehaplotype.Onthecontrary,thenodesoriginatingfromcom- monregionsoftwoormorehaplotypesarenotSISOnodes.Paired-endconnectionsbetween thosenodescannotprovideusefulguidanceinpathInourfeaturedesign,wedistinguish paired-endconnectionsinvolvingSISOnodesandothernodes. Let PE _ G bethepaired-endgraph. Path = f p 1 ; p 2 ;:::; p n g bethecurrentpath.Theending node p n inthecurrentpathhasmultiplesuccessors.Foreachsuccessor v of p n ,wecomputethe scoresofthefollowing5featuresfornode v .Foreachfeature,weuse N feature torecoredthe numberofsuccessorswithscorevaluegreaterthan0.Thefeatureswillthenbeusedtomakea choiceforpathextension.Tobetterillustratehowthesefeaturesarecalculated,wegiveanexample asshowninFigure3.6. 1.SISOscore:wecalculatetheSISOscoreasthesummationofedgeweightsinpaired-end graph PE _ G betweenSISOnodesinthecurrentpath Path and v .Notethatedgeweightin PE _ G isthenumberofpaired-endreadsbetweentwonodes.Weuse N SISO todenotethenumberofsuc- cessorsthathaveaSISOscoregreaterthan0. 2.Planscore:similartoSISOscore,wecalculatetheplanscoreasthesummationof PE _ G edgeweightsbetweenSISOnodesandtheplannodes.Theplannodesareasfollows. Startingfrom v , v 'ssuccessorasplannodeif v hasonlyonesuccessor.Repeatingthispro- cedureuntilreachinganodewithmorethanonesuccessororout-degreeof0.Planscoreconsiders thepaired-endreadconnectionsbetweenallpotentialSISOnodesinsideapath,includingthechil- drennodesof p n . N plan denotesthenumberofsuccessorsthathaveaplanscoregreaterthan0. 3.PE_planscore: v 'sadjacentnodesin PE _ G thatarenotinthepath Path areas 67 Figure3.7:Decisiontreetoselecttherightnodeforextensionbasedonpaired-endscorefeatures. If N feature = 1,weaddtheonlynodewithscoregreaterthan0tothepath.If N feature > 1,weselect thenodewithmaximumscorevalue.Otherwise,lookatthenextfeature.Ifallthe N feature values are0,wewillselectthesuccessorwithsimilarreadscoveragetothepath. PE_plannodes.ThePE_planscoreiscalculatedasthesummationof PE _ G edgeweightsbetween SISOnodesinpath Path andthePE_plannodes. N PE _ plan denotesthenumberofsuccessorsthat haveaPE_planscoregreaterthan0. WhiletheabovethreefeaturesarefocusedonSISOnodes,thefollowingfeaturesextendto paired-endconnectionsincidenttoothertypesofnodes. 4.Pathscore:thepathnodesareallthenodesinthepath Path except p n if n > 1or p 1 if n = 1. Pathscoreiscalculatedasthesummationof PE _ G edgeweightsbetweenpathnodesand v . N path denotesthenumberofsuccessorswithpath_scoregreaterthan0. 5.Path_planscore:thesummationof PE _ G edgeweightsbetweenpathnodesand v 'splan nodes. N path _ plan denotesthenumberofsuccessorswithpath_planscoregreaterthan0. Withthe5featuresshownabove,wedesignthedecisiontreeasshowninFigure3.7toclassify therightnodeforpathextension. 68 Figure3.8:Readpairmappingonamisjoinedcontig.Thecontigisshownasthelong baratthebottom,whichismisjoinedwithtwosequencesfromstrains a and b .Theredandblue linesrepresenttwoendsofareadpair.Fewreadpairswillgoacrossthemisjoinedlocation,thus revealingavalleyinthealignedreadswhichcanbeusedtosplitthecontig. 3.2.4.5Correctingcontigswithpaired-endreaddistribution OurmethodsusuallygeneratelongcontigsafterpathWhilepaired-endguidedpath methodcangooverthecommonregionsthatareshortthanfragmentlength,itislimitedwhentwo strainshaveaLCSlongerthantheinsertsizeofreadpairs.Tofurtherimprovethequalityof assembledcontigs,weapplyacontigcorrectionmethodsimilartothetoolPECC[81].Withthe contigsgeneratedafterpathwealigntherawreadstothemandsplitcontigsfromthe locationswithlowreadpairscoverage(Figure3.8). 3.3Results WehavedevelopedandimplementedPEHaploa denovo virusquasispeciesassemblymethod basedonpaired-endguidedpathonoverlapgraph.Toevaluatetheperformanceofour method,weappliedPEHaploonbothsimulatedHIVquasispeciesdatasetsandHIVillumina MiSeqsequencingdataset.Bothofthesimulateddataandrealdataweregeneratedfromamixture ofvewell-studiedHIV-1strains(HXB2,JRCSF,89.6,NL43andYU2).Thesestrainshave pairwisesequencesimilaritiesfrom91.8%to97.4%(Table3.1)andalongestcommonsubstring 69 of427bpbetweenHXB2andNL43(Table3.2).WechooseHIVdatasetbecauseHIV-1hashigh intra-patientgeneticdiversity.Tofurtherevaluatetheefyofthetool,wealsotestedPEHaplo onarealMiSeqsequencingdata. Asthehaplotypesequencesandcompositionsareknowninthesedatasets,weareabletoeval- uatetheperformanceofourmethodswiththeassembledcontigs.Wealsocomparedtheproduced resultstorecentlypublished denovo assemblytoolsIVA[57],MLEHaplo[93]andSAVAGE[5]. Table3.1:Pairwisesequencesimilaritybetween5HIV-1strains. 89.6HXB2JRCSFNL43YU2 89.693.991.893.593.6 HXB292.8 97.4 95.2 JRCSF92.692.9 NL4394.9 YU2 Table3.2:Longestcommonsubstring(LCS)between5HIV-1strains.Thesestrainshavesimilar lengthsofabout10kbp. 89.6HXB2JRCSFNL43YU2 89.6195201164234 HXB2180 427 216 JRCSF157201 NL43185 YU2 3.3.1LCSprobabilitysimulation ToestimatetheLCSbetweentwostrainswithinaquasispecies,wecalculatedtheprobabilitydistri- butionforLCSbetweentwostrainsthatare n generationsapartfromequation3.2andequation3.6. Letthemutationrate m betweentwogenerationsbe3e-5,andthegenomelength L be10,000.The probabilitydistributionforLCSsbetweentwostrainsisshowninFigure3.9.Theshows thatasthemutationrateincreases,theLCSlengthdecreases. 70 Figure3.9:ProbabilitydistributionforLCSsbetweentwostrainsthatare50,100,200,and500 generationsapart.Thex-axisisthelengthofLCS,whichrangesfrom0to10,000.They-axisis thecorrespondingprobabilityforeachLCS. 3.3.2BenchmarkonsimulatedHIVdataset WeappliedPEHaploonasimulatedvirusquasispeciesdataset.WeusedART-illumina[56] tosimulate1.9e+5paired-end,250bperror-containingMiSeqreadsfromtheveHIV-1strains withaveragefragmentlengthof600bpandtotalcoverageof5000x.Abasedpowerlaw equation[9]wasusedtosimulatethecoveragedistributionamongvestrains: C i = bf a i ,where C i and f i denotethecoverageandofstrain i ,respectively.Thecoveragesforeachstraininthe simulateddatasetare:89.6-2190x,HXB2-1095x,JRCSF-730x,NL43-547x,YU2-438x. FollowingthePEHaplopipeline,weperformederrorcorrectionandduplicatedsequences removalontherawsimulateddataset.With1.9e+5errorcorrectedreads,48,833readswere keptafterremovingduplicates.Wethenonlykeptthosereadsthatareduplicatedatleast3times intherawdata,furtherreducingthereadsnumberto26,961.Afteradjustingreadsorientation, 71 weconstructedtheoverlapgraphwithtoolApsp.Theoriginaloverlapgraphhas26,961nodes and977,570edges,aftermerginglinkedcliques,removingtransitiveedges,collapsingnodesand removingbadedges,125nodesand166edgeswereleft.Wethenrecoveredpathsandgenerated contigsfromthisgraphbasedonpaired-endinformation. PEHaplogenerated14contigsfromthesimulateddataset.Thegeneratedcontigswerealigned andevaluatedtovereferencegenomeswithMetaQuast[100]andtheresultsaresummarizedin Table3.3.Thecontigsareabletocoverover97%onthe5virusstrains,withaN50of7170bp. Thelargestcontighasalengthof9667bp,whichcanalmostcoverawholeHIVstrain.Meanwhile, thesecontigshavelowmismatchandindelrates. WealsoassembledthesimulatedreadswithbenchmarktoolsIVA,MLEHaploandSAVAGE andsummarizetheirresultsinTable3.3.Withthedefaultparameters,IVAproducedasingle,long contigfromtheerrorcorrectedreads.Thislongcontighasalengthof13,434bpandcancoverthe wholegenomeof89.6strainandabout43%oftheHXB2strain.TheresultsofIVArevealthatit tendstogenerateoneconsensusgenomesequencecorrespondingtothehaplotypewiththehighest coverage.Otherstrainsarelargelymissed,indicatingthatitmaynotforvirusquasispecies assembly.Usingk-mersizeof55,MLEHaploproduced205contigsthatcover78%oftheve HIV-1strains.Thecontigsitproducedarequitefragmented,withalowN50valueof671bpand largestcontigof1,716bp.Followingtheguidancefromthetutorial,wesettheoverlapcutoffas 190bptorunSAVAGE.Itproduces43contigsthatcoverover99%ofthereferencegenomes,with aN50above2,000bp,andlargestcontigsof9,594bp. ComparingtoIVAandMLEHaplo,PEHaploisabletoproducelongercontigswithlessmis- matchesandindelsonthesimulatedHIV-1vestrainsdataset.SAVAGEproducedmorecontigs thatcouldcoveralmostallthevestrains.However,thesecontigshaveamuchlowerN50value (2,228bp)thanPEHaplo(7,170bp). 72 Table3.3:AssemblyresultsonsimulatedHIVdatasetforIVA,MLEHaplo,SAVAGEandPE- Haplo.Contigsthatareatleast500bparealignedtothetruehaplotypesequenceswithasimilarity cutoffof98%.TheN50valueisthemaximallengththatallcontigsofatleastthislengthcoverat leasthalfofthetotalassemblylength. Tools#contigsN50 Genomes covered(%) Unaligned length Mismatches (%) Indels (%) IVA113,43428.7 0 0.8090.051 MLEHaplo20567178.081,1250.5420.008 SAVAGE432228 99.4 52850.0190.002 PEHaplo14717097.8 00.0150 3.3.2.1Paired-endguidedpathisabletogenerateaccuratelongcontigs Inouralgorithms,wehaveappliedmultiplemethodstosimplifytheoverlapgraphandcollapse nodesbeforepathWhilethesestepsgreatlyreducethecomplexityofthegraph,theycan- notdistinguishdifferentstrainsthatsharelongcommonregions.Thus,thepathalgorithm basedonpaired-endconnectionsplayacrucialroleforproducinghighqualitylongcontigsfrom theoverlapgraph. Toevaluatetheperformanceofourpathalgorithm,weassembledthesequencesinthe overlapgraphbeforepathwiththepopularassemblytoolsIDBA-UD[115]andRayMeta [14].Wethenalignedthegeneratedcontigstoreferencegenomesandcomparedwiththeresults producedbyPEHaplo.TheresultsrevealthatthosecontigsassembledbyIDBA-UDandRayMeta fromthereducedoverlapgrapharefragmentedandmaycontainmanymisjoinedsegments:they coverlargeproportionsofthevereferencegenomes,butwithhighmismatchandindelratesand theiraveragelengthsaremuchshorterthanPEHaplo.Theexperimentshowsthatthepaired-end guidedpathalgorithminPEHaploisabletocorrectlyrecoverlonghaplotypesequences fromthevirusquasispeciessequencingdata. 73 3.3.3BenchmarkonMiSeqdataset Tofurtherassesstheperformanceofassemblymethods,weappliedPEHaploonarealHIVquasis- peciesdataset(SRR961514)sequencedfromthemixofthesameveHIV-1strainsasdescribed abovewithIlluminMiSeqsequencingtechnology[39].Thisdatasetcontains714,994pairs(2x250 bp)ofreadsthatcoverthevestrainsto20,000x. WeusedPEHaplotoperformsimilarprocessingproceduresontherealHIVquasispeciesdata. With774,044anderrorcorrectedreads,98,947readswerekeptafterremovingduplicates andsubstrings.Sincetherawdatasethasextremelyhighcoverageonthevestrains,westillkept thosereadsthatareduplicatedatleastthreetimesintherawdataset.Afterthesepre-processing procedures,26,691readswerekeptforstrandadjustmentandassembly. PEHaploproduced33contigsfromtherealMiSeqHIVdatasetthatcancoverover92%of theveHIV-1strains.ThesecontigshaveaN50valueabout2,500bpandthelongestcontigis 9108bp.TheresultsaresummarizedinTable3.4.ComparedtosimulatedHIVdataset,PEHaplo hasgeneratedmorecontigsbutwithalowerN50valueandhighermismatchesandindelsonthe realdataset.WenoticethattherealHIVdatasetcontainsmoresequencingerrorsandhasamore variableinsertsizethanthesimulateddataset. WeagaincomparedtheperformanceofPEHaplowithIVA,MLEHaploandSAVAGE.IVA generated10contigsthatcoverabout20%ofthevestrains.Thesecontigsstillcoverlonger partsonhaplotypeswithhighersequencingcoverage.Buttheyspreadtofourstrainsthistime, likelybecausethevestrainshaveclosesequencingcoveragesintherealdataset.Withthe sameparametersasabove,MLEHaploproduced234contigsthatcancoverover53%oftheve genomeswithsimilarmismatchandindelratestothesimulateddataset.Strikingly,itgenerated muchlongercontigsontherealdata,withaN50valueof6,501bpandthelargestcontigof8,470 74 bp.However,thesecontigscontainmanymisjoinedsegments.Over150contigswithtotallength of787,272cannotaligntoanyreferencegenomes.SincetheSAVAGEpaper[5]hasshowntheir resultsonthesamedataset,weusethemetricsintheirliteratureforevaluation.Fromtheirresults, SAVAGEproduced482contigsthatcoverover90%ofthereferencegenomes,withaN50of1,062 bp,andlargestcontigof4,256bp(Table3.4). OntherealHIVdataset,PEHaplocanstillproducelongercontigswithfewermismatchesand indelsthanallthreebenchmarkedtools.Overall,PEHaploisabletoassembleabunchofreads thataresequencedfrommultiplevirusstrainssharinghighsimilarities,generatelong,highquality sequencesandrecovermostofthetargethaplotypes.Thetoolconsumeslessrunningtimewhile stillproduceshighqualitycontigs.Comparedtootherstate-of-the-artmethods,ourtoolusually producesfewerbutlongercontigs.In3.10,weshowthecontigsalignmentresultonHXB2 strainforPEHaploandSAVAGE.ThosecontigswerealignedtoHXB2strainwithasimilarity cutoffof98%byMetaQuastevaluationtool.Thisshowsthatwhilebothtoolsproduced contigsthatcoverthevirusgenometoasimilarproportion,PEHaplowasabletogeneratelessbut longercontigs. Table3.4:AssemblyresultsonrealHIVMiSeqdatasetforIVA,MLEHaplo,SAVAGEandPE- Haplo.ContigsareevaluatedwithMetaQuastusingthesameparameterstosimulateddataset. Tools#contigsN50 Genomes covered(%) Unaligned length Mismatches (%) Indels (%) IVA10115020.111500.6600.052 MLEHaplo234650153.6786,2720.588 0.035 SAVAGE482106290.5 0 0.1470.048 PEHaplo332553 92.400.117 0.049 75 Figure3.10:ContigsalignmentresultonHXB2strainforPEHaploandSAVAGE.Thesecontigs wereproducedfromtherealHIVMiSeqdataandalignedtoreferencegenomewithMetaQuast. Thex-axisisthecoordinationsofHXB2genome,withtheregionscoveredbycontigsingreenand othersinblack.They-axisrepresentsthenumberofcontigs,withcontignameslistedontheleft panel.Oneachcontig,thegreennumberattheleftisthestartingcoordinateofthealignedcontig andthenumberinsideoftheparenthesisshowsthestartingcoordinateonthereferencegenome, theblackvalueattherightistheendingcoordinateofthealignedcontigandthenumberinsideof theparenthesisshowstheendingcoordinateonthereference. 76 3.3.4BechmarkonsimulatedbiasedHIVdatasets Toevaluatetheperformanceofourmethodsonassemblinglowabundancestrains,weusedHIV strainsHXB2andNL43tosimulatethreegroupsofdatasetswithextremlybiasedcoverages betweenthem.Thetotalcoverageforeachgroup1000x,withHXB2-900x,NL43-100x;HXB2- 950x,NL43-50x;andHXB2-990x,NL43-10xforeachgroup,respectively.Thesedatasetswere alsosimulatedbyART-illuminawithMiSeqplatform. WiththesimilarprocessingproceduresonHIV5strainsdata,weusedPEHaplotoassemble contigsfromthesedatasetsandcomparedtheresultswithSAVAGE,whichworksmuchbetter thanMLEHaploandIVA.TheresultsareshowninTable3.5. Table3.5:AssemblyresultsonsimulatedbiasedHXB2-NL43MiSeqdatasetfoSAVAGEPE- Haplo.ContigsareevaluatedwithMetaQuast. HXB2-NL43 coveragesTools#contigsN50 Genomes covered(%) Unaligned length Mismatches (%) Indels (%) 900:100 SAVAGE7250046.765810.0220 PEHaplo11816380.2600.0380 950:50 SAVAGE8803246.76181700 PEHaplo1947046.7600.0330.011 990:10 SAVAGE13213046.7515900.0220 PEHaplo1950948.95000 Theresultsrevealthatbothtoolsfailedtoassembletherarestrainwith5%or1%abundance. However,PEHaplowasabletobetterassemblethedominantstrainwithproducingonelongcontig. Inaddition,whentherarestrainconstituted10%(100x)ofthetotalcoverage,PEHaplocould partiallyassembleit,whileSAVAGEonlyassembledthedominantone. 77 3.3.5Bechmarkondataset InadditionofHIVdata,wealsoappliedPEHaploonarealH1N1dataset(SRR1766219) sequencedfromthemixofawildtype(99%)andamutanttype(1%).Thisdatasetissequenced withIlluminaMiSeqsequencingtechnology,containing646,879pairs(2x250)ofreadsthatcover thetwostrainsto23,000x.ThemutanttypecarriestwosilentmutationsintheM1ORF(C354T andA645T,segment7). Weperformsimilarpre-processingonthedata.With851,988anderror correctedreads,27,888readswerekeptafterremovingduplicatesandsubstrings.Westillkept thosereadsthatduplicateatleastthreetimesintherawdataset.Afterpre-processing,11,940 readswerekeptforstrandadjustmentandassembly. PEHaploproduced10contigsfromtheMiSeqdata,with8contigscoverover99%of the8segmentsofgenomeand2contigsunaligned.Againwecomparedtheassembled resultswithSAVAGE.TheresultsaresummarizedinTable3.6.Ondata,SAVAGEhas produced220contigswithaN50valueof620bp.Thesecontigsareabletocoverover96%ofthe genome.TheresultsshowthatPEHaploworksmuchbetterthanSAVAGEon dataasitsuccessfullyassembledallthe8segments. Table3.6:AssemblyresultsonMiSeqdatasetforSAVAGEandPEHaplo.Contigsare evaluatedwithMetaQuastusingthesameparameterstoHIVdatasets. Tools#contigsN50 Genomes covered(%) Unaligned length Mismatches (%) Indels (%) SAVAGE22062096.3383030.8180.046 PEHaplo10179099.512700.8360.007 78 3.3.6Computationaltimeandmemoryusage Toevaluatethecomputationalcostofourtool,wealsocomparetherunningtimeandmemory usagewiththethreebenchmarktools.PEHaploused43minuteswithapeakmemoryusageof2.9 GBonsimulateddataand19minuteswithapeakmemoryusageof1.3GBonrealdataset.IVA runsveryfast:ittookabout17minutesonbothsimulatedandrealdataset,withpeakmemory usageabout2.6GB.MLEHaploandSAVAGEaremuchslower.MLEHaploconsumedover10 hourswithpeakmemoryusageof6.4GBonsimulateddatasetandabout4hourswithpeak memoryusageof2.4GBonrealdataset.SAVAGEused54minutesforsimulateddataand1331 minutesforrealdata.Itisabletokeepalowmemoryusageof0.5GBasshownintheirpaper[5]. 3.4DiscussionandConclusion Longreadswillreducethecomplexityof denovo assemblyofgeneralmetagenomicdata,including virusquasispeciesdata.Forpaired-endreads,onemayconsidertocombineshortreadsintoa relativelylongerreadbeforeconductingassembly.Weactuallyappliedexistingreadjoiningtools forthispurpose.However,joiningreadsisnotatrivialproblemastheoverlappingpartofthe readpairsmaynotalwaysbeidentical.Thus,existingmethodsofjoiningtwoendsmayintroduce errors.Inaddition,mergingpaired-endreadswilldiscardthepaired-endinformationforguiding thepathprocess.Asaresult,theexperimentalresultsusingPEAR[167]andotherend mergingtoolsshowinferiorperformance.Thuswedidnotincludethatstepinourpipeline. Thethird-generationsequencingplatformssuchasPacBiocanproduceverylongreads,which cancoverthewholelengthofviralgenomes.However,thehighsequencingerrorrate(about 10%)andthelowerthroughputthanIlluminastillhampertheirwideapplicationformetage- nomicsequencing.Theadvantagesandlimitationsofapplyingcurrentlongreadstechnologiesfor 79 virushaplotypesreconstructionarediscussedinBAsE-Seq[53].Withtheincreasedreadquality, longreadsequencingtechnologieswillgreatlysimplifytheassemblymethodsformetagenomic data[39].However,atthismoment,virushaplotypereconstructionusingshortreadsisstillneeded. Ifthedistributionoffragmentsizeisknown,wecanfurtherimprovetheaccuracyofpath ingandfalseedgeremoval.Forexample,withknownfragmentsize,wecanaccuratelycompute howfarweshouldexaminethesuccessorsorpredecessorsforfalseedgeremoval.Currently,for computationalefy,wedidnotincorporatefragmentsizedistributioninthesetwosteps. Ourmethodcanbeextendedtometagenomicdataifthememberspecies'genomeshavecom- monregionswithlengthsmallerthanfragmentsize.However,ouranalysishasshownthatmany genesinmetagenomicdatacanhaveLCSsizesmuchgreaterthantypicalfragmentsize.Forthose metagenomicdata,largeinsertsizesshouldbechosenforthesequencingprotocol. Inconclusion,wepresenteda denovo virushaplotypereconstructiontoolforviralquasis- pecies.Weappliedittobothsimulatedandrealquasispeciesdataandachievedbetterresultsthan severalbenchmarkedtools. 80 Chapter4 Alignmentwindowsbasedviralstrain-level contigsbinning 4.1Introduction Aftercontigsofdifferentviralstrainsbeingassembledfromviralquasispeciesdata,itisstillun- knownthathowmanyviralhaplotypesarethereinthequasispeciesandwhataretherecorrespond- ingabundances.Therefore,anothercrucialstepinrecoveringviralhaplotypesfrommetagenomic dataistheestimationofnumberofviralhaplotypesandofcontigsassembledinto differentgroups,whichisoftenreferredtoasbinning.Thegeneralbinningformicrobialsamples isasfurtherinvestigatingthetaxonomicstructureofcontigsandclustering/classifying themintooperationaltaxonomicgroups.Forviralquasispeciesassembly,thesegroupsrepresent compositestrainsofanindividualviralspeciesthatcompriseaviralquasispecies. Manybinningmethodsexisttobinassembledcontigsfrommetagenomicdata[160,88].These methodsusuallyestimatethebinnumberbyaligningmetagenomicdatatoapre-establishedmarker genedatabase,andthenassignassembledcontigstodifferentbinsusingsequencecomposition informationandreadcoveragelevels.Forexample,MaxBin[160]usesbothtetranucleotidefre- quenciesandcontigcoveragelevelstoassignassembledcontigsintodifferentbins. Inaddition,therearealsomethodsofbinningcontigsusingthecoverageofthecontigs acrossmultiplemetagenomicsamples.Theideaisthatiftwocontigsarefromthesamebin,their 81 coverageacrossmultiplesamplesshouldbehighlycorrelated.Forexample,COCACOLA [88]incorporatessequencecompositionandreadcoverageacrossmultiplesamplesforbinning. Whilethesebinningtoolsexist,binningofcontigsinaviralquasispecieshasitsuniquechal- lenges,andthoseexistingtoolsareeithernotapplicableordonotperformwell.(1)Viralhaplo- typesinaquasispeciesbelongtothesamespecies,andtheysharethesamemarkergene.Aligning thereadstomarkergenedatabasewillonlyidentifyonespeciesbutnotthenumberofhaplotypes. (2)Viralhaplotypesinaquasispceisusuallysharehighsequencesimilarity(mayover90%).The sequencecompositionsfordifferenthaplotypesarethusverysimilar,andcanhardlybeusedto differentiatethem.3)Theremaynotbemultiplesamplesforaviralquasispecies. HerewepresentVirBin,amethoddesignedforbinningcontigsassembledfroma viralquasispeciesdata.Ittakesadvantageofcontigsalignmentandrelativecontigabundancesin alignedwindowstoaccuratelyestimatethenumberofhaplotypesandclustercontigsintodiffer- entgroups.Thismethodworkswithasinglesequencingsampleanddoesnotrequirereference genomes.VirBinwasappliedontwosimulateddatasetsandarealdataset,andbenchmarkedwith therecentapproachMaxBin.TheresultsshowthatVirBinrevealssuperiorityintermsofboth precisionandrecall. 4.2Methods 4.2.1Problem Aviralquasispeciesiscomposedofasetofhighlysimilarviralstrainswithdifferentabundance levels.Withassembledcontigsfromaquasispceis,theobjectiveisto(1)estimatethenumberof haplotypesinthequasispecies;(2)clustercontigsintogroupssothatcontigsfromthesameviral strainwillbegroupedtogether;(3)calculatetheabundanceforeachviralstraininquasispecies. 82 Mathematically,theproblemcanbeas:Givencontigs C 0 ; C 1 ;:::; C n fromaviralquasis- pecies,thegoalistoestimatethenumberofhaplotypes N ,clusterthecontigsinto N + 1groups (onemoregroupassothatcontigsfromthesamestrainwillbeinthesamegroup,and calculatetheabundanceforeachhaplotype. Sinceviralstrainsinaquasispeciesbelongtothesamespeciesarehighlysimilartoeachother, aligningthemtomarkergenesusuallyonlythespecies.Therefore,canonicalbinning methodsbasedonmarkergenesdonotapplyforviralquaspecies.However,takingadvantage ofthehighsequencesimilaritiesbetweenhaplotypes,contigsfromdifferentstrainsbutderived fromthesamegenomiclocationscanbealignedtogether.Withthealignmentofcontigs,VirBin usesawindow-basedmethodtoestimatethehaplotypenumberandmoreaccuratelycalculatethe relativeabundanceforeachcontig.Thewindowisasacontinuousregioninthealignment withthesamenumberofcontigs.Thismethodbeginswithwindowsfromcontigs alignment.Thenhigh-qualitywindows,whichhaveahighprobabilityofcontainingcontigsfrom allhaplotypes,areTheconsensusheights(numberofcontigs h )inthesewindowsare usedtoestimatethenumberofhaplotypes N .Foreachwindow,therelativeabundancelevelsfor contigscanbecalculatedfromthereadsalignment 4.2.2TheVirBinalgorithmoverview TheoverallpipelineofourmethodisshowninFigure4.1.Therearemainlytwosteps:(1)align contigsandidentifywindows;(2)calculaterelativeabundancesineachwindowandapplyan expectation-maximizationmethodtoclusterthecontigs. 83 Figure4.1:ThewwofVirBin. 4.2.3Estimatehaplotypenumberbycontigsalignmentandwindowsextrac- tion Inthestep,contigsarealignedwitheachotherusingblast[20].Eachtimeacontig C i is chosenasareferencecontig,andallothercontigsarealignedagainstthereferencetogeneratean alignmentsimilartomultiplesequencealignment.Althoughblastperformslocalalignment, therawalignmentresultswerewheretwotypesofalignmentrelationshipsareallowed betweeneverytwocontigs.Oneisoverlap,whereonecontig'ssufalignwiththeotherone's (Figure4.2(A)).Anotheroneisinclusion,whereonewholecontigalignstoasubstring ofanothercontig(Figure4.2(B)).Inthisway,randomalignmentsoralignmentbetweenrepeated sequencescanbeavoided.Thenstartingfrom5'endof C i ,windowsareasthecontinuous regionswiththesamenumberofcontigs.Thealgorithmforidentifyingwindowsfromcontigs alignmentisshowninAlgorithm3. Ideally,ifthecontigsarecompleteandallalignmentsbetweenthemarecorrect,theheightof 84 Figure4.2:Twokindsofalignmentareallowedbetweentwocontigsto.(A)Overlap.(B)Inclu- sion Algorithm3 Windowsfromcontigsalignment Function: Alignedcontigsdepth Height ( T ) Input: Contigsalignmentresultonareferencecontig C ,withindexfrom1to L ; Output: Windowsoncontig C ,alistoftuples 1: i 1 2: j 2 3: start 1 4: end 0 5: while i L do 6: if i==LorHeight[i]!=Height[j] then 7: end i 8: Windows+=[start,end] 9: start j 10: i ++ 11: j ++ 12: endif 13: endwhile 85 eachwindow( h )canbeusedtorepresentthehaplotypenumberinthequasispecies.Inpractice,the contigsmaynotcovereveryregionofhaplotypesandtheremayexistchimericcontigs.Therefore, weusetheconsensuswindowheightasthenumberofhaplotypes. 4.2.4Expectation-Maximizationmethodforbinningcontigs Therawreadscanbealignedtocontigstocalculatetheirabundancelevels.However,dueto thepotentialincompletenessoftheassembledcontigsandhighsequencesimilaritybetweenviral haplotypes,directlycalculatingcontigabundancesisnotaccurate.Becausereadsfromstrains withmissingcontigsmaybemappedtocontigsfromotherstrains,whichwilleventuallylead toinaccurateestimationofcontigabundances.Withwindows,denotetheregionofa contiginsideofawindowas"sub-contig" c .Thealignedreadsoneach sub-contigcanbeusedtocalculatethe"relative"abundancesforeachofthemwithinawindow, whichisamoreaccurateestimationforthecontigabundances.Theaveragerelativeabundance (denoteas¯ c )forasub-contig c i inawindowiscalculatedas ¯ c i = S ( c i ) å h j = 1 S ( c j ) (4.1) where S ( c i ) isthesummationofreadscoverageonsub-contig c i .Therationaleofthismethodis thatwhiletheabundanceestimationstillmaynotbeaccurateinwindowswithmissingcontigs,we canhavemoreaccurateabundanceestimationforwindowswithcorrectnumberofcontigs.Using therelativeabundancesinthese"correct"windowsforclustering,wecanhavemoreaccurate binningresults. Similarly,wecancalculatetherelativeabundance(denoteas ~ c )forasub- contig c i as 86 ~ c i [ k ]= c i [ k ] å h j = 1 c j [ k ] ; k = 1 :: w (4.2) where c i [ k ] representsthereadscoverageatthe k positionofsub-contig c i . w isthewindowwidth. Letthehaplotypenumberestimatedbywindowsheightas N .Withsub-contigs c ,theposition- readscoverage ~ c ,andtheaverageabundance¯ c ,VirBinutilizesthe relativeabundancesofsub-contigsinwindowswithheighttoestimatetheprobabilitythata contigbelongstoabinwithanExpectation-Maximization(EM)method.Letthehaplotypes be H 1 ; H 2 ;:::; H N ,theirabundances x 1 ; x 2 ;:::; x N berandomvariablesandfollowdistributions E 1 ( x ) ; E 2 ( x ) ;:::; E N ( x ) ,respectively.Let C betheoriginatingcontigof c i ,intotal n sub-contigs from C .WeusedEMalgorithmtomaximizetheposteriorprobability P ( c i 2 H j j ¯ c i ) .Thealgorithm containsfourstepsasshownbelow: 1.Initialize N groupsbyrandomlyassignsub-contigstothemorwithguidance.Thesub- contigsinthesamewindowareassignedtodifferentgroups. 2. Expectation Foreachgroup,thecompoentsub-contigs'relativeabundance ~ c sareaggregatedtocalcu- latetheempiricalprobabilitydensityfunction E ( x ) .Theaggregationisperformedbycalculating thenormalizedhistogramsfortheserelativeabundancesothatthesummationofhis- togramvalueswillbe1.Thelikelihoodthat c i beingproducedfromahaplotypecanbedenotedas P ( x j = ¯ c i j c i 2 H j ) .Itcanbecaclulatedas E j ( ¯ c i ) . Thepriorprobability P ( c i 2 H j ) issameas P ( C 2 H j ) ,whichiscalculatedas å n k = 1 P ( c i 2 H j j ¯ c i ) P ( c i ) .Here, P ( c i ) isthepriorprobabilityof c i ,whichiscalculatedas length ( c i ) length ( C ) , P ( c i 2 H j j ¯ c i ) istheposteriorprobabilitythat c i belongstohaplotype H j fromlastiteration. Withbothlikelihoodandprior,theexpectedprobabilitythat c i belongstohaplotype H j canbe 87 calculatedas likelihood prior ,thatis P t + 1 ( c i 2 H j j ¯ c i )= P t ( x j = ¯ c i j c i 2 H j ) n å k = 1 P t ( c i 2 H j j ¯ c i ) length ( c i ) length ( C ) (4.3) 3. Maximization Withtheposteriorprobabilitiescalculatedforeachgroupdistribution,wecanreassignthesub- contig c i tothehaplotypewiththemaximumposteriorprobabilityorusingGibbssampling.The samereassigningproceduresareappliedforallthesub-contigs.Withtheassignmentresults,the distribution E ( H j ) andpriorprobability P ( c i 2 H j ) canbeupdated. 4.Iteratestep2and3untiltheclusteringresultsdonotchangeorthemaximumnumberof runshavebeenachieved.Thedefaultmaximumnumberofrunsis100. 4.3Results 4.3.1HIVsimulateddataset VirBinwastestedontwosimulatedHIVquasispeciesdatasetsforitseffectiveness.One simulateddatasethave5HIVhaplotypesandtheotherhave10. 4.3.1.1Datasimulation HIVhaplotypessimulation MultipleHIV-1strainsareavailableonHIVsequencedatabase(https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). However,thesequencesimilaritiesbetweenthemareusuallybelow90%,whicharelowerthanse- quencesimilaritiesbetweenviralstrainsinaquasispceis.Therefore,wedownloadedfourHIV strains(FJ061,FJ064,FJ065,andFJ066)fromthedatabase,andgeneratesimulatedhaplotypes fromthembyrandomlymutatingbasesatrandomlyselectedlocations.Fortheviralquasispecies 88 with5haplotypes,wegenerated3simulatedhaplotypesfromFJ061strainandmixedthemwith FJ066tocomposeaquasispecies(denoteas5HIVhaplotypes).Thesequencesimilaritiesbetween eachsimulatedhaplotypeandFJ061is97%.Forthequasispecieswith10haplotypes,2simulated haplotypesweregeneratedfromeachFJ061,FJ065,andFJ066strains.Theyweremixedwith FJ064tocomposethequasispecies(denoteas10HIVhaplotypes). Contigssimulation Whileviralcontigscanbeassembledfromsimulatedreadswithavailable assemblytools,thequalityofgeneratedcontigsdependsontheassemblyalgorithmsandparam- etersused.Tofocusonthebinningproblem,wesimulatederror-freecontigsdirectlyfromthe referencegenomes.Foreachreferencegenome(denoteitslengthas L ),werandomlygenerated alistof20locationpairs ( p 1 ; p 2 ) ,where1 p 1 < p 2 L and p 2 p 1 + 1 500.Eachlocation pairrepresentsacandidatecontig.Thesepairswerethensortedby p 1 andpairswere outfromthelist.Thestartsfromlookingatthesecondpairinthesortedlist(the pairisalwayskept),ifithasanoverlapless100bpwiththepreviouspairandisnotasubstring ofthepreviouspair,keepitandmovetothenextpair.Otherwise,removethispairfromthelist andmovetothenext.Afterthecontigswerethengeneratedfromthereferencegenome withleftlocationpairs. HIVreadssimulation WithavailableHIVhaplotypes,simulatedreadsweregeneratedfrom thembyART-illumina[56]aserror-containingMiSeqpaired-endreads,withreadlengthof250bp, averageinsertsizeof600bp,andstandarddeviationof150bp.Theabundanceofeachhaplotypeis calculatedbasedonapowerlawequation[9].Thetotalsequencingdepthsforthe5HIVhaplotypes are1000-x,with438-x,219-x,146-x,109-x,and88-xforeachhaplotype,respectively.The sequencingdepthsfor10HIVhaplotypesare2000-x,with683-x,341-x,228-x,171-x137-x,114- x,98-x,85-x,76-x,68-xforeachhaplotype,respectively.Intotal,thereare38,914simulatedreads for5HIVhaplotypes,and76,974readsfor10HIVhaplotypes. 89 Table4.1:Haplotypeabundancesforsimulated5HIVhaplotypescalculatedfromreadsmapping. HaplotypeFJ061FJ061-h1FJ061-h2FJ061-h3FJ066 Abundance(%)43.9021.9514.6310.938.59 4.3.1.2Resultsfor5HIVhaplotypes Haplotypeabundances Bymappingsimulatedreadstothesimulated5HIVhaplotypesbybowtie2,wewereableto estimatetheabundanceforeachhaplotypeasshowninTable4.1. Windows Withtheavailablesimulatedcontigsandreads,wealignedcontigstoeachotherbyblastand identifywindowsbyVirBin.For5HIVhaplotypes,VirBingenerated121windowsfromallthe contigs.Thewindowsweresortedbytheirlengthsandthetop5windowsareshowninFigure4.3. Theresultsrevealthattheheightofthesehigh-qualitywindowscanbeusedtoaccurately estimatethenumberofhaplotypesinthequasispecies.Forexample,outofthetop100windows, 50contain5contigs,8contain6contigs,and32contain4contigs.Leftwindowscontain2or 3contigs.Outofthetop50windows,39have5contigs,1contains6contigs,and8contain4 contigs.Outofthetop25windows,20contain5contigs,1contain6contigs,and3contain4 contigs.Applyingasimplevotingprocessonthetop-kwindows,thenumberofhaplotypescanbe estimatedastheheightofmostabundancewindows,whichis5. ByaligningreadstothesecontigswithBowtie2,therelativeabundancelevelsforeachcontig withinawindowcanbecalculatedfollowingtheequation4.1.Relativeabundancesforcontigsin thetop5windowsareshowninFigure4.3. EMclustering Withtherelativeabundancescalculatedforcontigswithinwindows,weappliedtheEMclustering 90 Figure4.3:Top5windowsoutputfromsimulatedcontigsalignmentfor5HIVhaplotypes.Each linestartingwith'>'representsonewindow.Thefollowingcolumnsrepresentreferencecontig name,windowheight,windowstartingpositiononreferencecontig,windowendingpositionone referencecontig,referencehaplotype,andhaplotypeabundance,respectively.Therowsafter'>' lineshowtheinformationforeachcontiginsideofthiswindow.Thecolumnsrepresentcontig name,relativeabundance,summationofreadscoverage,referencehaplotype,andhaplotypeabun- dance,respectively. 91 Table4.2:EMclusteringresultsonsimulated5HIVhaplotypecontigsforVirBinandMaxBin. virbinMaxBin Precision(%)Recall(%)Precision(%)Recall(%) FJ066 84.692.931.452.5 FJ061 96.689.00.00.0 FJ061-h1 88.282.90.00.0 FJ061-h2 91.899.60.00.0 FJ061-h3 92.789.00.00.0 methodinVirBintoclustercontigsinto5groups.Sincethegroundtruthforwhichhaplotype eachcontigbelongstoisknow,wewereabletoevaluatetheclusteringresultsbycalculatingthe precisionandrecallathebaselevel.TheevaluationresultsareshowninTable4.2. TheresultswerebenchmarkedwithMaxBin,whichisabinningtoolformetagenomicscaf- folds/contigsbasedontetranucleotidefrequenciesandreadscoveragelevels.TheMaxBinprogram requiresmarkergenestoidentifyseedcontigsforbinning,whichisnotapplicableforviralqua- sispeciessincethedifferentviralhaplotypesbelongtothesamespeciesandhavethesamemarker gene.However,wewerestillabletorunthecoreclusteringprogramofMaxBinbyprovidingthe seedcontigsmanually.Werandomlychoseonecontigfromeachhaplotypeastheseedcontigs andcalculatedthecontigabundancesbymappingreadstothem.TheresultsfromMaxBinarealso showninTable4.2.Forallthe24contigs,itassigned17to5clusters,leaving7unassigned.One contigwascorrectlyassignedtotheclustercorrespondingtoFJ066strain,butothercontigswere notcorrectlyclustered. StrainPhlAn[152]isalsoatooltocharacterizethegeneticstructureofviralstrainsinmetagenomes. IttakestherawsequencingreadsandMetaPhlAn2[151]databaseofreferencese- quencesasinputandaimstooutputthemostabundantstrainforeachsample.However,itfailed todetectanyviralspeciesatthesteprunningMetaPhlAn2.ConStrains[89]isanothertool designedtoidentifystrainstructuresfrommetagenomicsequencedata.Itusesbowtie2 92 Table4.3:Haplotypeabundancesforsimulated10HIVhaplotypescalculatedfromreadsmapping. HaplotypeFJ061FJ066FJ065FJ064FJ061-h1 Abundance(%)34.6116.8311.268.446.94 HaplotypeFJ061-h2FJ066-h1FJ066-h2FJ065-h1FJ065-h2 Abundance(%)5.784.844.203.753.36 tomapreadstoasetofuniversalgenesandinferthewithin-speciesstrainsabundancesbyutiliz- ingsingle-nucleotidepolymorphism(SNP)patterns.Thistoolagaindidnotgetenoughmapped readstoreportanystrainabundances.Thus,wecannotreporttheresultsfromStrainPhlAnor ConStrains. 4.3.1.3Resultsfor10HIVhaplotypes Haplotypeabundances Similarto5HIVhaplotypes,bymappingsimulatedreadstothe10HIVhaplotypesbybowtie2, wewereabletoestimatetheabundanceforeachhaplotypeasshowninTable4.3. Windows Similarto5HIVhaplotypes,thecontigsalignmentandwindowswasalsoapplied onsimulatedcontigsfor10HIVhaplotypes.VirBingenerated319windowsfromallthecontigs. Sortingthesewindowsbytheirlengths,thetop3areshowninFigure4.4. Fromtheresults,outofthetop100windows,36contain10contigs,34contain9contigs,and 20contain8contigs,4oftheleftwindowscontaincontigsnumbergreaterthan10and6contain contigsnumberlessthan8.Outofthetop50windows,26have10contigs,16contain9contigs, and6contain8contigs.Outofthetop25windows,15contain10contigs,8contain9contigs, and1contains8contigs.Therefore,thehaplotypenumber10canstillbecorrectlyoutby theheightsofmostabundancewindows. Similarto5HIVhaplotypes,therelativeabundancesforcontigswithinwindowscanbecalcu- 93 latedbyaligningreadstocontigswithbowtie2.Therelativeabundancesforcontigsinthetop3 windowsareshowninFigure4.4. Figure4.4:Top3windowsoutputfromsimulatedcontigsalignmentfor10HIVhaplotypes.The outputformataresameasFigure4.3. EMclustering Withthewindowsandrelativeabundances,VirBinwasalsoappliedtoclustercontigs into10groupsfor10HIVhaplotypes.TheevaluationresultsforclusteringareshowninTable4.4. TheresultswerealsobenchmarkedwithMaxBin.Similartosimulated5HIVhaplotypes,10 seedcontigsfromeachhaplotypewererandomlyselectedandprovidedtoMaxBin.It 25outof34contigs,with9TheresultsareshowninTable4.4.WhileMaxBin correctlyonecontigtoFJ065-h1andonetoFJ066,mostoftheothercontigswerenot 94 Table4.4:EMclusteringresultsonsimulated10HIVhaplotypecontigsforVirBinandMaxBin. virbinMaxBin Precision(%)Recall(%)Precision(%)Recall(%) FJ064 73.098.00.00.0 FJ061 100.080.00.00.0 FJ061-h1 98.188.10.00.0 FJ061-h2 99.589.20.00.0 FJ065 86.461.50.00.0 FJ065-h1 18.610.717.6100.0 FJ065-h2 42.194.90.00.0 FJ066 84.695.843.469.2 FJ066-h1 90.979.20.00.0 FJ066-h2 60.047.60.00.0 appropriatelyclustered.WeagaintriedStrainPhlAnandConStrainsonthissimulateddataset,but still,noreadscanbemappedtoavailablereferencegenes. 4.3.2HIVrealMiSeqdataset 4.3.2.1HIVrealdatasetandcontigs Tofurtherassesstheperformanceofthebinningmethod,weappliedVirBinontherealHIV quasispeciesdataset(SRR961514),sequencedfromthemixofveHIV-1(89.6,HXB2,JRCSF, NL43,YU2)strainswithIlluminaMiSeqsequencingtechnology[39].Thisdatasetcontains 714,994pairs(2x250bp)ofreadsthatcoverthevestrainsto20,000x.Therawdatasetwas pre-processedwithFaQCs/1.3[87]andTrimmomatic[15]totrimandlow-qualityreadsor adapters.Theleftreadswerethenerror-correctedwithKarect[2].Afterpre-processing,774,044 readswereleft. Bymappingpre-processedreadstotheavailable5referencegenomesbybowtie2,wewere abletoestimatetheabundanceforeachhaplotypeasshowninTable4.5. WeusethecontigsassembledbyPEHaploasinputforVirBin.PEHaploproduced24contigs 95 Table4.5:Haplotypeabundancescalculatedfromreadsmapping. HaplotypeJRCSFNL4389.6YU2HXB2 Abundance(%)29.7925.1821.6812.6210.87 fromtherealMiSeqHIVdatasetthatcancoverover92%oftheveHIV-1strains.Thesecontigs haveaN50valueof2,223bpandthelongestcontigis9,133bp. 4.3.2.2ResultsforHIVrealdataset Windows Afteraligningcontigstoeachotherbyblast,VirBinwasappliedonthealignedandgen- erated197windows.Sortingthesewindowsbylength,thetop5areshowninFigure4.5. Fromtheresults,outofthetop100windows,44contain5contigs,39contain6contigs,and 5contain4contigs.Outofthetop50windows,27contain5contigs,16contain6contigs,and 2contain4contigs.Outofthetop25windows,17contain5contigs,5contain6contigs,and 1contains4contigs.Similartothesimulatedhaplotypesresults,thehaplotypenumber5canbe estimatedfromtheheightsofmostabundantwindows. Whilethewindowsonsimulatedcontigstendtohavewindowsheightslessthan thehaplotypenumber,thewindowsproducedoncontigsassembledhavealotofheightsgreater thanthehaplotypenumber.Onereasoncouldbethattheremaybe"chimericcontigs"assembled, whereacontigisjoinedbysequencesfromtwoormorehaplotypes.Thechimericcontigscan alignwiththosecorrectcontigs,thusincreasingtheheightofthewindows.Thesimulatedcontigs areerror-freebutnotnecessarilybecompleteduetotherandomness.Therefore,therearealot ofwindowswithaheightlessthanthehaplotypenumber.Anotherpossiblereasonisthatthere arerepeatedsequencesforHIVrealhaplotypes.Letusdenoteasubstringofaviralhaplotype X as X [ start _ position ; end _ position ] .TheblastresultsbetweentheseHIVhaplotypesshowthat 96 Figure4.5:Top5windowsoutputforcontigsassembledfromHIVrealMiSeqdata.Theoutput formataresameasFigure4.3. 97 Table4.6:EMclusteringresultsonassembled5realhaplotypecontigsforVirBinandMaxBin. VirBinMaxBin Precision(%)Recall(%)Precision(%)Recall(%) HXB2 48.570.239.127.0 YU2 34.030.056.650.9 89.6 58.056.50.00.0 NL43 18.517.49.621.6 JRCSF 65.150.80.00.0 Table4.7:EMclusteringresultsonassembled5realhaplotypecontigsforVirBinandMaxBin. VirBinMaxBin Precision(%)Recall(%)Precision(%)Recall(%) HXB2 70.087.533.325.0 YU2 25.020.033.333.3 89.6 33.3100.00.00.0 NL43 66.728.625.020.0 JRCSF 33.333.30.00.0 89.6[9046,9669]canbealignedwithJRCSF[1,630]withanidentityof92%.Therefore,the alignmentbetweencontigsmaynotmeanthattheycomefromthesamelocationoftheconsensus genomesequence,whichcanleadtoanoverestimationoftheheightofthewindows. EMresults ThesimilarclusteringprocedureswereappliedonwindowsbyVirBin withrelativeabundances.MaxBinwasrunagainasBenchmarkTheresultsfrombothtoolsare showninTable4.6.StrainPhlAnandConStrainswereappliedonthisrealHIVdatasettoo. StrainPhlAnwasabletoidentifytheHIVspecies,butcouldnotreportanystraininformation. ConStrainscouldnotalignenoughreadstomarkergenesforfurtherreportingstrainabundances. VirBin ´ sresultsarenotasgoodasforsimulatedHIVdatasets.Oneofthereasonmaybethe existenceofimperfectassembledcontigs,whichmayleadtotheincorrectcalculationofabundance levelsandaffecttheclustering. 98 4.4DiscussionandConclusion Withinthesameviralquasispecies,binningthecontigsfromdifferenthaplotypeswithk-mer frequencies(suchastetranucleotidefrequencies)isnotapplicablebecauseofthehighsequence similaritiesbetweenthem.Theonlypossibleinformationtodifferentiatethesecontigsistheir abundancesinferredfromreadscoveragelevelsbymapping.However,theinhomogeneousreads coveragealongtheviralgenomeandcloseabundancesbetweenviralstrainsincreasethedif cultytodirectlyapplyingcoveragelevelsforbinningviralcontigs.InthisChapter,weproposed anovelmethodtomoreaccuratelycalculatetherelativeabundancesforsub-contigswithinwin- dows.Moreover,thenumberofhaplotypesinthequasispeciescanbeestimatedfromtheconsensus heightofwindows. Wehaveshowntheutilityofourtoolontwosimulatedandonerealviralquasispeciesdatasets, andbenchmarkedtheresultswithMaxBin.Thesuccessofthismethodreliesonthequalityofinput contigsandtheabundancedifferencebetweenviralhaplotypes.Whentheassembledcontigscan coverthemostpartofviralstrains,thenumberofhaplotypesinthequasispeciescanbeaccurately Theempiricalexperienceshowsthatitisdiftoclassifytwoviralstrainswhenthe abundancedifferencebetweenthemisbelow3%. 99 Chapter5 Studyingtranscriptionalregulationsof miRNAgeneswithCap-seq 5.1Introduction MicroRNAs(miRNAs)arealargefamilyof ˘ 21nucleotide-longRNAsthathavebeenuncov- eredaskeyregulatorsofgeneexpressionatpost-transcriptionallevelinmetazoans,plants,and viruses[63,66,10].Inmetazoans,maturemiRNAsandargonaute(AGO)proteinsformintothe miRNA-inducedsilencingcomplex(miRISC),withinwhichmiRNAsbase-pairtothe3'-UTRof targetmRNAsandinhibitproteinsynthesisbyeitherrepressingtranslationorpromotingmRNA degradation.Itwasinferredthatmorethanone-thirdofallprotein-codinggenesareregulatedby miRNAs[94].miRNAshavealsobeendiscoveredtoplayacrucialroleinprecisionmedicine. Precisionmedicineattemptstocharacterizethegeneticbackgroundofpatientsandclassifythem intosubpopulationsthatdifferintheirsusceptibilitytoaparticulardisease[30,58,91].Thecapa- bilityofmodulatingavastnumberofprotein-codinggenesmakesmiRNApowerfulregulatorsof thedifferentcellularprocessesinvolvedinthepathogenesisofvarioustypesofdiseases,including cardiovasculardiseasesandcancer.Forexample,livermiRNAmiR-122,asthemostabundantand mostlivermiRNA,ismostlikelytorepresentanovelbiomarkerforcardiovascularand metabolicdiseasesasitplaysacentralroleinlipidandglucosehomeostasisandisdetectablein serumandplasma[156].DifferentialexpressionofmiRNAshavealsobeenobservedintumor 100 tissues.Theiralterationexpressioninprostatecancerhasbeenwelldocumented[96].Becauseof theirimportantregulatoryfunctions,manystudieshavefocusedonmiRNAannotationandidenti- fyingtheirtargets[32,164,78].However,howmiRNAsthemselvesareexpressedandregulated isnotfullyunderstood. InthecanonicalmiRNAbiogenesispathway,miRNAsareprocessedfromlongertranscripts, whicharereferredtoasprimarymiRNAs(pri-miRNAs)[66].Pri-miRNAsareeithertranscribed bypolymeraseII(PolII)fromindependentgenesorderivedfromtheintronsofprotein-coding genes[77,16].TwomembersoftheRNaseIIIfamilyofenzymes,DroshaandDicer,further processpri-miRNAstomaturemiRNAs[76,28,68].First,Droshacleavesthehairpinstructureof apri-miRNAtoan ˘ 70-nucleotideprecursormiRNA(pre-miRNA)inthenucleus.Pre-miRNAs arethenexportedtothecytoplasmbyXPO5,whereDicercleavesofftheloopregionofthehairpin andfurtherprocessesitto ˘ 21-bpmaturemiRNA(s).Recentstudieshaveuncoveredseveral non-canonicalwaysofgeneratingmiRNAs,demonstratingthecomplexityofmiRNAbiogenesis. OneclassofunconventionalmiRNAsiscalledmirtrons,whichareencodedinintrons,bypass Droshaprocessorbutrelyonsplicingmachineryforpre-miRNAgeneration[11,130].miRNAsin mammalshavebeenshowntofrequentlyutilizealternativepromotersindifferentcelltypes,and pri-miRNAsmayencodesubsetsofclusteredmiRNAs[24].Pri-miRNAtranscriptscanbecleaved bycytoplasmicDroshainhumancells[35].Anotherstudyonmicehasuncoveredasecondclass ofnon-canonicalmiRNAs,ofwhichthepre-miRNAsare5'-cappedandgenerateddirectlyby transcription[161]. AlthoughthegenomiccoordinatesofmatureandprecursormiRNAshavebeenannotatedin databasessuchasmiRBase[48],verylittleisknownaboutthecoordinatesofpri-miRNAs.RNA- seqtechnology[155]hasbeenprovedasanefwaytoannotateprotein-codinggenes.Mature mRNAscontaina5'7-methylguanosine(m 7 G)capandalong3'polyadenylated(poly(A))tail 101 andarerelativelystable,sotheycanbewellextractedfromcellsandsequenced.Thesequenced RNAfragmentsarethenmappedtothereferencegenomeforgeneannotation.However,sincethe original5'endsofprimarymiRNAtranscriptsarerapidlycleavedoffbyDroshaduringmiRNA maturation,regularRNA-seqtechnologycannotbeusedtotheprimaryTSSsofmiRNAgenes. Pri-miRNAsareusuallytranscribedbyPolIIandalsocontaina5'm 7 Gcapanda3'poly(A)tail [77],indicatingthatthebiologicalfeaturesrelatedtoPolIItranscriptioncanbeusedtoidentify thetranscriptioninitiationsitesformiRNAgenes. ToidentifytheprimaryTSSsofmiRNAs,somecomputationalmethodshavebeenimple- mentedbasedonfeaturesrelatedtoPolIItranscribedgenes,suchastranscriptionfactorbinding sites(TFBSs),PolIIbinding,andchromatinstatesincludinghistoneandnucleo- somepositioning[132,111,31,29].Typically,PolIIandH3K4me3arehighlyenrichedatactive promoters,whilenucleosomesaredepletedattheTSSs.Wangetal.[154]designedastatistical modeltomimicPolIIbindingpatternsatthepromotersofhighlyexpressedprotein-codinggenes andusedittosearchforsimilarPolIIbindingpatternsupstreamofallintergenicmiRNAsinhu- manbreastcancercellstoidentifyprimarypromoters.Theyvtheirbycheckingthe conservation,CpGcontent,andactivatinghistonemarksinthepromoterregions.Ozso- laketal.[111]combinednucleosomemappingwithChIP-chipscreensforH3K4me3,H3K9/14ac, PolIIandPolIIIsignaturestoidentifytheproximalpromoterregionsofpri-miRNAsinhuman genome.Theytestedtheiralgorithmonhumanannotatedprotein-codinggenesandpredictedthe transcriptioninitiationregionstoaresolutionof150bp.Withthesamemethod,thetranscrip- tioninitiationregionsof175transcriptionallyactivemiRNAsweredetermined.Sainietal.[131] predictedthe5'endsofintergenicpri-miRNAsinhuman,mouseandratgenomesbycombining thefeaturesofTSSspredictions,CpGislandsand5'capanalysisofgeneexpression(CAGE) tags.miRStart[29]builtaSVMmodelusingthefeaturesofCAGEtags,TSSSeqlibrariesand 102 H3K4me3chromatinsignaturefromChIP-seqtoidentifytheTSSsofhumanmiRNAs.Themodel wastrainedon7,268protein-codinggeneswithuniqueTSSand847putativeTSSsfor the940humanpre-miRNAsobtainedfrommiRBase. Whilethemethodsdiscussedabove[154,111,131]havepredictedTSSsformiRNAgenes inmammalgenomes,theirpredictionresultshavelowresolution(hundredsofbps)becausethe typicaldistributionpatternsofPolIIandchromatinfeaturessurroundingpromotersmaynothold foranyparticulargene.Evenforsomeactivelytranscribedgenes(29/85),thedistancebetween theTSSandtheclosestPolIIpeakcanbeover1000bp(FigureS1(B)).Amoreaccuratemethod istotakeadvantageofthecapstructureat5'endsofPolIItranscribedRNAs.Mappingthe cappedsequencestoreferencegenomeswillenabledirectdiscoveryoftheTSSs.CAGEandCap- seqhavebeenusedtodirectlysequenceRNAswith5'm 7 Gcaps,whichareusedtoidentifythe candidateTSSs.CAGE[65]usesaso-calledficaptrapper"methodtocapturefull-lengthmRNAs andsequencethe5'endswithSangersequencingtechnology.However,CAGEisnotwidelyused tomapTSSsforeachgenebecauseofthecostandsequencingdepth.Withthedevelopmentofhigh throughputsequencingtechnology,enrichingcappedRNAtranscriptsfollowedbynext-generation sequencing(NGS)technology(Cap-seqordeepCAGE[38])hasbeenusedtosequencethecapped RNAsinthewholegenome. Themouseisapopularmammalianmodelsystemforgeneticresearch,forwhichsomeCap-seq datasetshavebeengenerated[161,49].RecentlytheCap-seqstudyinmice[161]hasuncovered anon-canonicalwayofgeneratingpre-miRNAs,inwhichthepre-miRNAsaregenerateddirectly bytranscriptionandtheir5'endsarem 7 Gcapped.These5'm 7 Gcappedpre-miRNAspreferto beexportedfromthenucleustothecytoplasmbyexportin1(XPO1)andafterDicerprocessing, only3p-miRNAisefloadedontotheAGOcomplex.Thisspecialclassof5'-cappedpre- miRNAshavealsobeendiscoveredinthehumangenome(miR-320a)[161],butwhethertheyalso 103 existinothernon-mammalianspeciesisstillunknown. Caenorhabditiselegans( C.elegans )isalsoawellestablishedmodelorganismforgenomic studies.Thewormisasimplemulticellularorganismbutwithavarietyoftissuetypesanda shortlifecycle[33].Therefore,manyfunctionalgenomicsequencingdatasets,includingCap-seq [49,27,67],havebeengeneratedonthisspecies.Thetranscriptionregulationinthisanimalis quitedifferentfromthatinmammals.Forexample,theprimarytranscriptsofabout70%ofits protein-codinggenesundergotrans-splicing[144];anditspri-miRNAtranscriptsareexportedby XPO1andpossiblyprocessedinnuclearpore[19]. Inthisstudy,weutilizedavailableCap-seqdatasetstostudythetranscriptionregulationof miRNAsin C.elegans andmouse.Themainresultsaresummarizedbelow. Weagroupofcandidate5'm 7 Gcappedpre-miRNAsin C.elegans . WeanotherclassofmiRNAswithnon-canonicaltranscriptionmechanisms,for whichthepre-miRNAsmaybegeneratedbyboththecanonicalmiRNApathwaywith Droshaandthenon-canonicalpathwaywithoutDrosha. BasedonthecappingsignalsformiRNAgenesinclusters,weproposedahypothesisthat thesepri-miRNAtranscriptsmightundergocytoplasmicre-cappingduringthepre-miRNAs generationprocess. WedevelopedamethodtoseparatetheseprimarymiRNApromotersasbroador divergentandcharacterizedthembyanalyzingtheH3K4me3andPolIIbindingsurrounding them. 104 5.2MaterialsandMethods 5.2.1Datasetsandprocessing ThesmallRNACap-seqdatafornewbornmousewasretrievedfromthestudybyXieetal.[161], andwasdownloadedfromSequenceReadArchive(SRA)withrunnumberSRR1022391.The smallRNACap-seqdataformixedstagedembryosof C.elegans isfromthestudybyChenetal. [27],andwasdownloadedfromNCBIGeneExpressionOmnibus(GEO)withaccessionnumber GSE42819.ThesmallRNA-seqdataof C.elegans L4stagewasalsodownloadedfromGEO databasewithaccessionnumberGSM916519.TheChIP-seqdataofH3K4me3andPolIIfor C.elegans werealsoobtainedfromNCBIGEOdatabase,withaccessionnumberGSE28770and GSE15535,respectively.SmallRNA-seqdatafordetectingmaturemiRNAsin C.elegans was downloadedfromGEOwithaccessionnumberGSM916519. RawsequencingdatasetsareinSRAformatandweredumpedtoFASTQformatbySRA Toolkit[145].TheFASTQwerethenmappedto C.elegans (WBcel235)andMouse(GRCm38) referencegenomeusingbowtie[72]allowing2mismatchesand3mismatchesforreadslengthof 36ntand50nt,respectively.OnlyuniquelymappedreadswerereportedintheoutputSAM andwerevisualizedbyGenomeView[1]. 5.2.2Clusteringof5'endreads SmallRNACap-seqdatafor C.elegans andmousearestrandMappedreadsonforward andreversestrandwereanalyzedindependently.Wethetranscriptioninitiationclusters (TICs)withsimilarmethodsfromthestudybyChen,etal[27].First,mappedreadswithsame strandand5'endpositionswerecombinedanddenotedascap-stacks.Second,allcap-stacks containingveormoretagswereclusteredusingasingle-linkageapproach:twoormorestacks 105 wereclusteredtogetherifthedistancebetweentwoadjacentstacksislessorequalto50bp.The positioncoveredbythemost5'endswithintheTICwasasthemode,whichrepresents theTSSfortheTIC.Inthecaseoftwoormorepositionswiththesamenumberoftags,theone furthestupstreamwasselectedasthemode.Hereintotalweobtained32,530TICsin C.elegans and4,903TICsonmousechromosome7. 5.2.3Statisticalanalysis BecausetheCap-seqisnotperfectandmayinvolvecontaminationorartifacts,weusedaPoisson distributiontomodelthebackgroundnoisefollowingthepreviouswork[168].ThoseTICsthat areenrichedwithsequencingreadsarereported(Poissondistribution p -valuebased on l ).SincethereadsofCap-seqarenotrandomlydistributedalongthegenome,weestimatea dynamicparameter l local ,foreachTICas: l local = min [ l 5k ; l 10k ] (5.1) where l 5k and l 10k areestimatedfromthe5kbor10kbwindowcenteredatthemodeoftheTIC. OurmodelisbuiltontheassumptionthataTICisreliableifthenumberofreadsenrichedinside oftheTICishigher(defaultwith p -value<10 5 )thanthenumberofreadslocatedin theTICwhentheyarerandomlydistributedinthelocalregion. ResultsofCap-seqandsmallRNA-seqreadsmappedtotRNAsweresubjecttoatwo-tailed pairedsamplet-testtoestimatetheirdifferences,with p -value<0.01consideredstatisticallysig- BothCap-seqandsmallRNA-seqmappingresultsontRNAswerenormalizedbytheir coveragedepthsonthegenome.Sincetheyonlycoverasmallpartofthewholegenome,the coveragedepthwascalculatedasthetotalnumberofreadsmappedmultiplythereadslengthand 106 dividedbythelengthofcoveredgenomeregion.ThesimilarPoissonmodelisalsousedtoestimate theenrichmentofCap-seqreadsontRNAs. 5.2.4miRNAclusterandintergenicpre-miRNAs miRNAclusterswereretrievedfrommiRBasewithadistancethresholdof1000bp.Thedistance betweenpre-miRNAsinmiceareusuallylongerthan1000bp.Thus,wedidnotanymiRNA clusterinmicewithourthreshold. ToidentifyintergenicmiRNAsfrom C.elegans andmousegenome,wedownloadedthemiRNA annotationsfrommiRBaseRelease21andgeneannotaiongff3Thegeneannotationfor C.elegans andmouseweredownloadedfromEnsemblwebsite(http://www.ensembl.org/index.html). Pre-miRNAsfrombothmiRBaseandEnsemblgeneannotationeswereisolatedandthosemiR- NAsthatarenotcoveredbyprotein-codinggenes,non-codingRNAgenes,snoRNAgenesand snRNAgeneswereasintergenicmiRNAs.Inthecasethatsamepre-miRNAisanno- tatedbothinmiRBaseandgeneannotationweonlykepttheoneinmiRBase.Intotal,we 134intergenicmiRNAsin C.elegans and80intergenicmiRNAsonmousechromosome 7.Theregionupstreamofthe5'endofintergenicpre-miRNAwasalsoand TICsdetectedinthisregionwereannotatedasthecandidateprimaryTSSsforthemiRNA.The TICswithinthepre-miRNAregionwereannotatedasthepre-capTICs. 5.2.5FindingbidirectionalandmultipleTSSspromoters WiththecappedRNAreads,wetranscriptioninitiationsitesandannotatedtheirpromot- ersasbidirectionalorbroadpromoters(promoterswithmultipleTSSs)in C.elegans .Weused theTICstoannotatethesepromoters.FirstforeachTIC( A inFigure5.7(B))onthe 107 plusstrand,themostclosedownstream( C inFigure5.7(B))andupstream( B inFigure5.7(B)) minusstrandTICsweresearched.IfthedistancebetweentheupstreamminusTIC( B )andthe plusstrandTIC( A )islessthanthreshold(hereweuse300bp),thetwoTICsweretreatedasfrom thesamepromoterandthepromoterwasannotatedasbidirectional.Thedownstreamminusstrand cappedpeak( C )actsastheboundaryfordetectingmultipletranscripts.AlltheplusstrandTICs upstreamofitwereannotatedasfromthesamepromoter.Toidentifythemultipletranscriptson theminusstrand,themostcloseupstreamplusstrandTIC( D )of B wasfoundandalltheminus strandTICsbetween D and B wereannotatedasfromthispromoter.ThoseleftminusstrandTICs wereannotatedwiththesimilarmethod. Withthisapproach,wedetected11,272promotersin C.elegans ,ofwhich,6,149arebidirec- tionalpromotersand2,359arebroadpromoters.Themostupstream5'endsofbothplusand minusstrandTICswereusedtorepresentfortheTSSsofthesepromoters.OnlyoneTSSoneach strandwasselectedasrepresentingTSSforonepromoter. 5.3Results 5.3.1ofprimarymiRNATSSsin C.elegans andmouse 5.3.1.1OverviewofprimarymiRNATSSsannotation Weusedthesingle-linkageclusteringmethodtodetecttranscriptioninitiationclusters(TICs)[27] fromCap-seqdatain C.elegans andmice.ForeachTIC,wealsousedaPoissondistribution tomodelthelocalbackgroundnoiseandtestwhetheritisenrichedwithCap-seq readsbycalculatinga p -value(seeMaterialsandMethods).TheintronicmiRNAsareusually processedaspartoftheirhost-genemRNA[10]andthustheirtranscriptionscoordinatewiththe 108 protein-codinggenes.Therefore,inthisstudy,wehavefocusedonidentifyingtheprimaryTSSs forintergenicmiRNAs.Inbothspecies,wetheintergenicmiRNAsthatarenot coveredbyprotein-codinggenes,non-codingRNAgenes,smallnucleolarRNA(snoRNA)genes, orsmallnuclearRNA(snRNA)genes.Thentheregionbetween5'endofpre-miRNA andtheclosestupstreamgenewassearchedforTICs.TheTICswereannotatedas candidate primaryTICs (Figure5.1).Theregionwithinpre-miRNAwasalsosearchedforTICs, andtheTICswereannotatedas pre-capTICs (Figure5.1).Moreover,wealsosearched formiRNAclustersfrommiRBasewithadistancethresholdof1000bp.FormiRNAclusters, primaryTSS(s)areannotatedastheTIC(s)upstreamofthe5'endofthepre-miRNAinthe clusterandpre-capTIC(s)areannotatedastheTIC(s)withinthepre-miRNAsinthecluster. Figure5.1:PrimaryTICislocatedupstreamofthepre-miRNA,whilepre-capTICislocated insideofthepre-miRNA.Theannotationofapre-miRNAisusuallyastem-loopthatincludes thepre-miRNAandthelowerstems.However,therealpre-miRNAonlyincludesthered,purple sequences(maturemiRNAs)andtheloopbetweenthem.Therefore,thepre-capTICstartsfrom the5'endofa5p-miRNA.Eachblueorgreenbarcorrespondstoamappedread,wheregreen indicatestheplusstrandandbluetheminusstrand.TheCap-seqdatasetsweresequencedina way.OnlythereadsonthesamestrandofthemiRNAareconsidered. In C.elegans ,we134intergenicmiRNAsand16miRNAclusters.Withtheretrieved Cap-seqdata[27],70intergenicmiRNAswerewithatleastonecandidateprimaryTIC 109 intheregions,and9miRNAswerewithatleastonepre-capTICinsideofthe precursors.ForthosemiRNAswithcandidateprimaryTICsorpre-capTICs,8were withbothofthem.Themodes(highestcoverageofreads'5'ends)oftheprimaryTICsareusedto representthecandidateTSSsofmiRNAs.Inthe16miRNAclusters,6werefoundtocontainboth primaryTSSsandpre-capTICs,and2clusterswerefoundtocontainonlyprimaryTSSs(Table S1). Inmice,weusedtheCap-seqdatafromthestudybyXieetal.[161].TheyperformedCap- seqtotheunconventionalpre-miRNAswhose5'endsaregenerateddirectlybytranscription initiationwithPolIIandthus5'm 7 Gcapped.Basedontheirresults,therearethehighestnumber of5'-cappedpre-miRNAsonchromosome7,sowefocusedonidentifyingtheprimaryTSSsfor miRNAswithinmousechromosome7.We80intergenicmiRNAsonthischromosome, ofwhich10miRNAswerewithatleastonepre-capTICand37miRNAswere withatleastonecandidateprimaryTICs.OfthosemiRNAswithcandidateprimaryTICsorpre- capTICs,2werefoundtocontainbothofthem(TableS2). 5.3.1.2Comparisonwithpreviouswork Inthepreviousstudy[49],Guetal.developedtheCapSeqprotocoltoenrichandsequencelonger (70-90nt)5'-cappedRNAtranscripts,andCIP-TAPcloningtoisolateandsequence5'-capped small(18-40nt)RNAs.Theyappliedthesetwo5'anchoredRNAdeep-sequencingapproaches onto C.elegans andmousegenomesandannotatedtheprimaryTSSsformiRNAsinbothofthem. Asaresult,theyatleastoneTSSfor55individualpre-miRNAsand9miRNAclusters in C.elegans ,and134individualpre-miRNAsinmice,with7miRNAsannotatedonchromosome 7.Comparingwiththeirresults,wetheprimaryTSSsfor43additionalmiRNAs,2 additionalmiRNAclustersin C.elegans and37additionalmiRNAsonmousechromosome7 110 (TableS3). Inanotherstudy[67],Kruesietal.devisedaglobalrun-oncapsequencing(GRO-cap)method tocaptureandsequenceonlythose5'm 7 GcappedRNAsin C.elegans embryos,starvedL1lar- vae,andL3larvae.WiththeGRO-capsequencingdata,theyannotatedtheprimaryTSSsfor52 individualpre-miRNAsand5miRNAclusters.WesimilarprimaryTSSsforthoseover- lappingpre-miRNAs(24miRNAs).Furthermore,weannotatedtheTSSsfor46moreindividual miRNAsand3moremiRNAclusters(TableS4). 5.3.25'm 7 Gcappedpre-miRNAsarein C.elegans AsshowninFigure5.1,readsinpre-capTICsareusuallyalignedverywellwiththe5'endsof pre-miRNAs.Thereisapossibilitythatthesereadswereactuallysequencedfromuncappedpre- miRNAsratherthancappedRNAs.Inthissection,weinvestigatewhetherusuallyuncapped pre-miRNAsarehighlyenrichedbyCap-seqprotocol. 5.3.2.1Possible5'recessedRNAsenrichedbyCap-seq ToenrichforcappedRNAinCap-seqexperimentsfor C.elegans ,exonucleaseTerminatorandcalf intestinalalkaline(CIP)wereusedtoremovetheuncappedRNAs[27].However,someuncapped RNAs,suchaspre-miRNAsandtRNAs,maynotbeaccessedefbyTerminator/CIPbe- causeoftheir5'recessedendsintheirsecondarystructures.Asaresult,someoftheCap-seqreads inpre-capTICsmayactuallycomefrompre-miRNAs.Toinvestigatethecontaminationofpre- miRNAreadsinCap-seqdata,wedownloadedsmallRNA-seqdata(GSM916519)for C.elegans andmappedthereadstothereferencegenomeasthecontrol.ThosemiRNAsthataredetectedas expressedbysmallRNA-seqdatamightbeobservedinCap-seqdataaswell.Wethen thenumberofCap-seqreadsalignedtothoseexpressedpre-miRNAs(mappedwithsufsmall 111 RNA-seqreads)(FigureS2(A)).TheresultsshowedthatthenumbersofCap-seqreadsmappedto pre-miRNAsarenotproportionaltothenumbersofsmallRNA-seqreadsmapped.Formany highlyexpressedmiRNAs(22/33),therearefewornoCap-seqreadsalignedtotheirprecursors, indicatingthatmanyCap-seqreadsmaynotbepre-miRNAs.Inaddition,pre-miRNAs,serving asanintermediateduringmiRNAmaturation,arequicklyprocessedbyDicerandthustendnotto beenrichedbyCap-seq.Previousworkhasshownthatthedataofcarefullydesignedpre-miRNA sequencingonlycontainslessthan1%readsthatcanbemappedtopre-miRNAs[82]. AccordingtoChenetal.[27],therewasnosteptoremovetRNAsintheCap-seqprotocol.We thendidthesimilarcomparisonfortRNAsbecausetRNAsmayescapetreatmentoftheTerminator andCIPforthesamereasonaspre-miRNAs.Theresultsshowedthat,althoughalmostallthe annotatedtRNAgenesinthewormwereexpressed(604/605),mostofthesetRNAs(463/605)do nothaveanyCap-seqreadsmapped(FigureS2(B)).Atwotailedpairedsamplet-testwasused toestimatethemeandifferencebetweenthenormalizedCap-seqreadsandsmallRNA-seqreads mappedtotRNAs.Wegota p -valueas1.12e-185,showingthatthetRNAreadscapturedbytwo protocolsaredifferent(alphalevelas0.01).Pearson'scorrelationandSpearman's rankcorrelationbetweentwomappingresultswerealsocalculated.Thecorrelationcoef results(-0.042and-0.125,respectively)suggestthattheyarepoorlycorrelated.Mostofthose tRNAsmappedwithCap-seqreadshavelessthan10reads(124/142,FigureS3),implyingthat thesereadsmightbecausedbyrandomcontamination.WethenusedthePoissondistributionto modelthebackgroundnoise(seeMaterialsandMethods).31tRNAswerereportedas enrichedwithCap-seqreadsatthecutoffof10 5 .ThosetRNAsthatarehighlyenrichedforCap- seqreadsareusuallyoverlappedwiththerepeatregions.ConsideringthatmostoftheothertRNAs havefewornoCap-seqreadsmapped,weinferthattheseregionsmightbetranscribedbyPolII andproducecappedRNAsundercertainconditions. 112 WealsosuspectedthatsomeoftheCap-seqreadswithinpre-miRNAsmightbematuremiR- NAs.However,maturemiRNAsareshortandusuallydonotpossesscomplexsecondarystruc- tures.TheycouldbeefremovedbyTerminator/CIPtreatment[44].TheofCap-seq readsonmostmaturemiRNAsaddssupporttothis.Hencethepre-capTICsarenotlikelyformed bymaturemiRNAseither.TheCap-seqstudyonmicehavealsoshownthatuncappedRNAscon- stitutelessthan10%ofthetotalsequencedreads[161].Consideringtheaboveanalysistogether, wepositedthatalthoughCap-seqinevitablycontainssomeuncappedRNAs,manyofthereads mappedtopre-miRNAsarelikelysequencedfromcappedRNAs. 5.3.2.25'm 7 Gcappedpre-miRNAswithpre-capTICs RNAssynthesizedbyPolIIare5'm 7 Gcappedcotranscriptionally.Apreviousstudy[161]has documentedanewclassofunusualmiRNAsinnew-bornmice,forwhichthe5'endsofthe pre-miRNAsarem 7 GcappedandcoincidewiththeirTSSs.ThesemiRNAsweresuggestedto begeneratedwithoutDroshaprocessing,withtheir5'endsdetermineddirectlybytranscription initiationandthe3'endsgeneratedbytranscriptiontermination. ToexaminewhethertherearethesameclassofmiRNAsin C.elegans ,weanalysedthosepre- miRNAswithpre-capTICsmappedinside.The5'endsofpre-capTICsareusuallyconsistentwith the5'endsofpre-miRNAsor5p-miRNAs.Therefore,thosepre-miRNAsthatweredetectedwith pre-capTICsshouldacquirem 7 Gcapattheir5'endsandwereannotatedas5'-cappedmiRNAs. WealsoappliedourmethodtomouseCap-seqdataobtainedfromthestudybyXieetal.[161] andcomparedourresultswiththeirs.Ourresultshaveuncoveredallthe9intergenic5'-capped pre-miRNAsonmousechromosome7asshowninthepaper,validatingourmethod.Besides,we alsoannotatedtheprimaryTSSsfor37miRNAswiththesamedataset.Strikingly,newcandidate TSSswerealsofoundupstreamoftwo5'cappedpre-miRNAs(mmu-mir-344candmir-344i).In 113 total,we95'-cappedmiRNAsin C.elegans (TableS1)and10onmousechromosome7 (TableS2). WealsousedstatisticalanalysistoevaluatetheenrichmentofcappedRNAreadsonthe5'ends ofpre-miRNAs.Withthecalculated p -values,7/95'-cappedpre-miRNAsin C.elegans and9/10 inmiceareenrichedwithCap-seqreads( p -value<10 5 ;TableS1). TolookforsequencemotifssurroundingtheseputativemiRNATSSs,weplottedthe nucleotidecompositionaroundthembyWeblogo[34].AstrongYRmotifwasobservedatboth theputativeprimarymiRNATSSsorpre-capTSSsofindependentmiRNAs,inwhichYrepresents pyrimidine,RrepresentspurineandRlocatesattheTSSs(+1position,Figure5.2). Figure5.2:SequencemotifanalysisofnucleotidesaroundputativemiRNATSSs(+1).Thenu- cleotideheight(inbits)standsforthe log 2 ratiooftheobservednucleotidesfrequencyrelativeto thebackgroundgenomicnucleotidecomposition.TheYRmotifisobservedattheputativeprimary miRNATSSsandindependentmiRNApre-capTSSs. 114 5.3.3M 7 Gcappedpre-miRNAsoftenhaveupstreamprimaryTICs Inboth C.elegans andmice,wenoticedthatsomeofthese5'-cappedpre-miRNAsalsohave primaryTICsintheregionfromthepre-miRNA5'endtotheclosestupstreamgene.In C.elegans , 8outof95'-cappedpre-miRNAshavebeendetectedwithprimaryTICs;whileinmice,2outof10 5'-cappedpre-miRNAshaveupstreamprimaryTICs.TheseupstreamTICsareusuallynotveryfar fromthepre-miRNAsandthePolIIbindingattheseTICselongatetothedownstreammiRNAs, indicatingthattheyareconnectedwiththemiRNAs.Toourknowledge,thisphenomenonhasnot beendescribedinotherstudies.Twoexamplesin C.elegans areshowninFigure5.3.Sincepre- miRNAscanbegeneratedbothinthecanonicalwayasfromalongpri-miRNAbyDroshaandin thenon-canonicalwaybytranscriptioninitiationandtermination,weaskwhichTICcorresponds totherealprimaryTSSofthemiRNAorcanbothofthemproducepre-miRNAs? Weproposedtwopossibleexplanationsforthisphenomenon.Theexplanationisthatthere couldbemultipleisoformsforthesemiRNAgenes.Forexample,theseconddiscoveredmiRNAin C.elegans (let-7)hasbeendetectedwithatleastthreeprimarytranscripts[17].Itwasalsoreported thatgenesoftenusealternativepromotersinadevelopmentalstageorcelltypewaythat canspreaduptothousandsofkilobases.Asanexample,aboutonehalfoftheprotein-coding genesinhumanandmousegenomeshavemultiplealternativepromoters[37].Therefore,these miRNAgenesmayalsocontainmulitplealternativepromotersthatcangenerateseveralisoforms (Figure5.4(A)).TheotherprimaryTICsmayproducelongermiRNAtranscriptsthataresubjectto thecanonicalmiRNAprocessingproceduresinvolvingDrosha.Sincethe5'-cappedpre-miRNAs aremostlikelygenerateddirectlybytranscription,twopathsmaybeabletoleadtothematuration ofthesamemiRNA:onewithDroshaandtheotherwithout.Becauseweusedthedatasetfrom mixed-stageembryos,thesemiRNAgenesmayemployalternativepromotersandproducediverse 115 Figure5.3:PrimaryTICsaredetectedupstreamof5'-cappedmiRNAscel-mir-51andcel-mir-53. MultipleCap-seqpeaksareobservedinpre-miRNAregions.Themappedreadswerevisualized byGenomeView[1].Eachblueorgreenbarcorrespondstoamappedread,wheregreenindicates theplusstrandandbluetheminusstrand.ThereadsinupperpanelarefromCap-seqdataset,with uniformlengthof36nt.ThereadsinlowerpanelarefromsmallRNA-seqdataset,withlengthin rangefrom14ntto26nt. isoformsatdifferentstages.WealsoobservedthatforsomemiRNAs(SupplementaryFiguresS5 -S7:cel-mir-235,cel-mir-244,cel-mir-238,andcel-mir-228),theupstreamTICsareverycloseto thepre-miRNA,likelythattheyaregeneratedfromthesamepromoterasthepre-capTIC. ThesecondexplanationisthattheTICsupstreamofpre-miRNAmaybetranscribedenhancers orpromoters,whichgeneratetranscriptsthatwillnotproducemiRNAs.Recentlyseveralstudies haveshownthatpromoterandenhancerregionscanbetranscribedinhuman,mouseand C.elegans genomes[40,141,27].Thetranscriptioninpromotersandenhancersareusuallyassociatedwith downstreamgenes.Itissuggestedin C.elegans thattheelongationfromanupstreamenhancer towardadownstreamgenemayhavethepotentialtodeliverPolIItoaproximalpromoter,oral- 116 ternativelyfunctiondirectlyasadistalpromoter[27].Thus,theupstreamTICsmaybetranscribed intheenhancerregionsandhavearegulatoryeffectonthedownstreammiRNAgenes(Figure5.4 (B)). Figure5.4:UpstreamprimaryTICsalsoexistform 7 Gcappedpre-miRNAs.(A)Twoalternative promotersforthesamemiRNA.Bothofthemareabletogeneratethetranscriptsthatproducethe samematuremiRNA.(B)ThemiRNAtranscriptisgeneratedbythepre-capTIC.UpstreamTIC(s) correspondtotranscribedenhancer(s). 5.3.45p-miRNAsareproducedfromthem 7 Gcappedpre-miRNAs Them 7 Gcappedpre-miRNAshavebeenreportedtoproducesingle3pmaturemiRNAsinmice [161].Themainexplanationisthatthecapped5p-miRNAisnotefloadedontoAgocom- plex.However,wenoticedthatsome5'-cappedpre-miRNAsin C.elegans alsogenerate 117 5pmaturemiRNAsbasedontheannotationsinmiRBase(TableS1).Whilethereisapossibility thatsomeoftheCap-seqreadsmappedtopre-miRNAsmaybeuncappedduetocontamination ortechnicalartifacts,mostofthesem 7 Gcappedpre-miRNAsarelyenriched withcappedRNAreadsaccordingtoourpreviousanalysis.Wesurmisethattheremightbeal- ternativepathwaysforproducing5p-miRNAsfromthose5'm 7 Gcappedpre-miRNAs.Similar observationswerealsomadefor5'-cappedpre-miRNAsinmice:fourofthem(mmu- mir-484,mmu-mir-1903,mmu-mir-344fandmmu-mir-344i)actuallyprefertogenerate5pmature miRNAs[161]. The5pmaturemiRNAscouldbegeneratedbyalternativeprimaryTSSs.Asmanyofthese 5'-cappedpre-miRNAsalsohavecandidateprimaryTSSs,canonicalprimarymiRNA transcriptsmaybegeneratedfromthemandproduce5p-miRNAs.Studieshaveshownthatmature miRNAselectionfrom5'and3'strandsofthesameprecursorishighlyregulatedandvariesunder differentcelltypes,developmentalstagesanddiseasestates[12,97].Capped5pmaturemiRNAs maybeproducedunderconditionstopromoteitstargetgene'sexpression. 5.3.5MultipletranscriptioninitiationsitesformiRNAclustersin C.elegans miRNAsinaclusterareclosetoeachother,usuallycoexpressedandtranscribedasasinglepri- miRNA[74,10].PreviouslyeachmiRNAclusterin C.elegans hasbeenannotatedwithone primaryTSSusingCap-seq[49]orGRO-cap[67]datasets.Strikingly,thedatasetsweusedhere haveshownthattheTICsformiRNAclustersin C.elegans canhaveabroaddistributionwith multiplestrongpeaksacrossthewholecluster.WeprimaryTICsfor8clusters,ofwhich 7haveTICsinsideoftheclusters.Oneexampleofclustercel-mir-35-41isshowninFigure5.5. Thesimilarphenomenonisalsoobservedinindividualpre-miRNAs,asshowninFigure5.3,the Cap-seqsignalwithincel-mir-51andcel-mir-53alsodisplaysmultiplestrongpeaks.Wenoticed 118 thatthesecappedreadspeaksinsideofclusterswerelocatedonbotharmsofpre-miRNAs,with 5p-peakshavethesame5'endsof5p-miRNAsand3p-peaksstartfromthe3'endsof3p-miRNAs (Figure5.5).Asanalyzedabove,notmanypre-miRNAsormaturemiRNAswerekeptintheCap- seqexperimentsandmanyofthereadsmappedarelikelytobecappedRNAs.Thisisfurther supportedbythecoordinatedifferencesbetweenthecappedpeaksandthemiRNA/miRNA*on3p arms.Weproposedthreehypothesestoexplainthisphenomenon. Figure5.5:CappedTICsdistributioninmiRNAclustercel-mir-35-41.Multiplestrongcapped peakshavebeenobservedinpre-miRNAsinthecluster.Asillustratedbythereddashedline,the Cap-seqpeaksonthe5parmhavethesamestartpositionasthe5p-miRNAs,whilethepeakson the3parmstartfromtheendofthe3p-miRNAs.ThelengthofCap-seqreadsis36nt. 5.3.5.15're-capafterpost-transcriptionalprocessing Ithasbeenreportedthatmaturelongtranscriptsofbothprotein-codingmRNAsandlongncR- NAsinhumancellscanbeprocessedpost-transcriptionallytoyieldsmallRNAs,whicharethen 119 bytheadditionofa5'-capstructure[44].Lateron,acytoplasmiccappingenzyme, whichisabletoadd5'-captotheendsofcleavedRNAswasinmurineerythroidand nonerythroidcells[110].Thiscytoplasmiccappingenzyme,togetherwithakinase,cantrans- fercovalentlyboundGMPontoa5'-monophosphateRNAtocreatea5'-GpppXRNA,butitcan notfunctiononRNAswith5'-hydroxylends[137].Thisphenomenonisknownascytoplasmic capping,whichhasbeenfoundinbothmurineandhumancells[110,103,64]. Thewellcoordinatedcappedreadsat5pand3parmsofpre-miRNAhairpinsindicatethatthe exposed5'endsofthemiRNAtranscriptsafterDroshacleavagemaybere-capped:thatiswhythe cappedreadswereobservedatthe5'endsof5pmiRNAsandthe3'endsof3pmiRNAs.Certainly, inthecanonicalmiRNAbiogenesis,pre-miRNAsareproducedfrompri-miRNAsinthenucleusby Drosha.Therefore,the5'monophosphateendsatthecleavagesitesmaybere-cappedbynuclear cappingenzymeinthenucleus.However,pri-miRNAsin C.elegans areexportedtothecytoplasm byXPO1andbeprocessedtopre-miRNAseitherinthenuclearporeorinthecytoplasm[19]. Consideringthediscoveryofcytoplasmiccappingenzyme,thepre-miRNAre-cappingismore likelytohappeninthecytoplasm. Theprocessingofpri-miRNAsinthecytoplasmrequirescytoplasmicmiRNAprocessors.It hasbeenshownthatcytoplasmicRNAviruseswhichencodemiRNAswereabletoproducefunc- tionalmiRNAsinthecytoplasmofBHK-21cells[125].Theprocessingofthesevirus-generated cytoplasmicpri-miRNAsalsoreliesonDroshabuttakesplaceinthecytoplasm[139].Basedon thisdiscovery,althoughwithoutdirectexperimentalvthereisapossibilitythatthesim- ilarcytoplasmicmiRNAprocessorsinvolvingDroshaalsoexistin C.elegans cells.Allthese ingssupportanewmodelofmiRNAbiogenesisin C.elegans inwhichpri-miRNAsareexported tothecytoplasmbyXPO1,wheretheyarecleavedbyDroshaandfurtherprocessed.Transcripts ofmiRNAclustersmayundergocytoplasmicre-cappingduringthecleavageprocess(Figure5.6). 120 Figure5.6:ModelformiRNAcytoplasmicre-capping.Inthisnon-canonicalmiRNApathway, pri-miRNAsareexportedtocytoplasmbyXPO1andprocessedthere.Duringthepre-miRNA generatingprocess,m 7 G-capsareaddedtothe5'endsofnewlygeneratedpre-miRNAandpri- miRNAleftoverbycytoplasmiccappingenzyme. ThecytoplasmicrecappingofclustermiRNAscouldproducecapped5p-miRNAswhichmay notbeefloadedonAgo.Indeed,fromtheannotationsinmiRBase,mostoftheclusters withcappedTICsinsidefavorablygenerate3pmaturemiRNAs.However,theclustersmir-41- 44andmir-86-8211doprefertogenerate5pmaturemiRNAs.Heresimilartothe5'-capped independentpre-miRNAs,therecappingmaybeacontrolledprocesswhichonlyoccuratcertain celltypes,developmentalstages,ordiseasestates.Forexample,withthecappedshortRNAs datafrom C.elegans youngadultstage[49],wedidnotobservecappedreadspeaksinsideofthe miRNAclusters.TherecappingcouldservetoregulatethestrandselectiononthesemiRNAs, suppressingthe5p-miRNAsandpromotingtheexpressionoftheirtargets. 5.3.5.2MultipleTSSscanbegeneratedfromthesamepromoter PreviousstudieshaveshownthatmostcorepromotersdonothaveasingleTSS,butmultiplestart sitesthatarecloselylocated[135,46].ThebroadTSSpromotersin C.elegans areoftenenriched forCpGisland,whilesharpTSSpromotersoftenhaveTATA-box.Forsomepromoters,although 121 TSSsaredistributedoveralargeregion,mosttranscriptioninitiatesatonenucleotide position[135].Therefore,multiplecappedpeaksinindividualpre-miRNAsormiRNAclusters maybegeneratedfromthebroadpromoterasmultipleTSSs.Toevidencesupportingthis,we scannedtheproximalpromotersofthesemiRNAgenesformotifsofTATA-box,Inr,DPEandBRE withpositionweightmatricesderivedfromdatabaseJASPAR[134].GCcontentandCpGnumber arealsosearchedinthepromoterregions(Table5.1).Weobservethatthepromotersofthese miRNAclustersandindividual5'-cappedmiRNAsareusuallyGCrich:theGCcontentoneach chromosomein C.elegans isalmostthesame,withavalueof36% 1%.ButtheGCcontentis usuallyabove40%forpromotersofindividualpre-miRNAs,andabove50%formiRNAclusters (Table5.1).Inaddition,theretendstobeapositivecorrelationbetweentheGCcontent/CpG numberandthenumberofcappedpeaks:higherGCcontent/moreCpGresultinmorestrong cappedpeaks. Table5.1:GCcontent,OEratioandCpGnumberformiRNApromoterswithmultipleTSSs.(A) miRNAclusters.(B)individualmiRNAs. Promoters Location GCcontent OEratio CPGnumber (A)ClustermiRNAs II:11537326-11537576 0.55 1.18 18 II:11889425-11889675 0.56 1.09 13 III:11937020-11937270 0.52 0.93 9 III:2172175-2172425 0.55 1.0 15 X:2368553-2368803 0.51 0.76 9 X:13145204-13145454 0.6 1.34 24 (B)IndividualmiRNAs cel-mir-244I:4684364-4684614 0.4 0.82 8 cel-mir-235I:6162337-6162587 0.41 0.76 8 cel-mir-238III:8867375-8867625 0.4 0.66 6 cel-mir-228IV:5561825-5562075 0.43 1.18 12 cel-mir-51IV:11026062-11026312 0.51 1.09 17 cel-mir-53IV:11027641-11027891 0.45 1.38 16 cel-mir-49X:9989082-9989332 0.51 0.75 12 122 ThesecappedreadspeaksarecorrelatedtothematuremiRNAsequences(Figure5.3,5.5). ThereisalowprobabilitythatmultiplepeakswithinmiRNAclusteraresimplygeneratedran- domly.Ithasbeensuggestedthattranscriptionstartusageofeachnucleotidecanbepredicted fromlocalDNAsequence[46].Therefore,itislikelythatthepositionsoftheseTSSsaremainly determinedbynucleotidesequence. 5.3.5.3Pre-miRNAsinaclustercanbetranscribedindependently Pre-miRNAsfromthesamegenomicclustercanbetranscribedandregulatedindependently[157]. Forexample,althoughtheprimarytranscriptsofmousemir-433andmir-127weredetectedas overlappingina5'-3'unidirectionalway,experimentshavevedthattheyweretranscribed independentlyfromeachother[143].Accordingtothismodel,thepre-miRNAsinthesamecluster maytranscribeindependently,producingmultiple5'm 7 GcappedRNApeakslocatedinsideofthe cluster.ThecappedRNApeaksatthe5parmsindicatethatthe5'endofthepre-miRNAmaybe determinedbytranscriptioninitiation[161],whilethecappedpeaksatthe3parmsmaycorrespond totheTSSsforthesubsequentpre-miRNAs.Wenoticedthatmanypre-miRNAsintheclusterhave boththecappedRNApeaksonitsown5'endandthe3parmofupstreampre-miRNA.Wehave speculatedthatpre-miRNAsmightbeabletobegeneratedinthecanonicalwaywithDroshaor inthenon-canonicalwaybytranscriptioninitiationandtermination,sohereagain,thesamepre- miRNAmaybegeneratedbydifferentmechanisms:onewithDroshaandtheotherwithout. 123 5.3.6ChromatinandPolIIprofprimarymiRNApromoters 5.3.6.1ofdivergentandmultipleTSSspromoters Tocharacterizethesepri-miRNATSSs,weanalysedthedistributionofchromatinand PolIIfeaturessurroundingthem.Weobservedthatpromotersin C.elegans oftengeneratediver- gentormultipletranscriptswiththeCap-seqdatasetsused.Toidentifythesepromoters,we thedivergent/bidirectionalandbroadpromoters(promoterswithmultipleTSSs)fromthosetran- scriptioninitiationclusters(TICs).Ifthedistancebetweentwoadjacentplusstrandandminus strandTICsislessthanorequalto300bp,theyarecombinedasfromthesamedivergentpro- moter.TICsonthesamestrandareclusteredtogetherifthedistancebetweentwoadjacentTICs iswithin500bp.TheseclusteredTICsonthesamestrandwillthebroadpromoters(see detailsinMaterialsandMethods).Inthewholegenomeof C.elegans ,wedetected11,272pro- moters,ofwhich6,149aredivergentpromotersand2,359arebroadpromoters(Figure5.7(A)). Themostupstream5'endsofbothplusandminusstrandTICswereusedtorepresenttheTSSsof thepromoter. Figure5.7:(A)Promotersdistributionin C.elegans .(B)Detectingbidirectionalandmultiple transcriptionpromotersin C.elegans . Withallthepromoters,wefocusedonthosewitheithermultipletranscripts(atleast 3TSSs)orbidirectionality.Thatis,thebroadpromotersarenon-bidirectionalandbidirectional 124 promotersdonotcontainmultipleTSSs.WealignedtheTSSsofthesepromoters,andplotted H3K4me3andPolIIsignalsurroundingthem(Figure5.8(A,B)).Theresultshowsthat H3K4me3signalismuchstrongerinthedownstreamofbroadpromotersthanbidirectionalpro- moters,indicatingthatH3K4me3isstronglycorrelatedwithtranscriptioninitiations.Interestingly, weobservethatPolIIsignalinthedownstreamofbroadpromotersisalsoslightlystrongerthan intheupstream,whichwouldnotbeobservedifweuseallpromoterswithmultipleTSSs(in- cludingbidirectionalpromoters(FigureS2(D)).ThehigherlevelofdownstreamPolIIsignalmay becausedbythePolIIpausingatthetranscriptioninitiationsites.IthasbeenshownthatPolII promoter-proximalpausingisrarein C.elegans [67],whichmayexplainwhythePolIIsignal differencebetweenupstreamanddownstreamisnotbig. 5.3.6.2ChromatinandPolIIprsurroundingmiRNApromoters WethenassignedthesepromoterstotheintergenicmiRNAsin C.elegans .Foreach ofthem,theregionbetweenits5'endandthemostcloseupstreamgene's3'endis searchedforpromoters,andthemostclosepromoterisassignedtothemiRNAasitsprimary promoter.Again,wealignedthesemiRNAswiththeirprimaryTSSsacquiredfromtheassigned promotersandplottedtheH3K4me3andPolIIsignalsurroudingtheTSSs.Theresultsareshown inFigure5.8(C,D).H3K4me3signalsurroundingmiRNAbroadpromoterspeaksatthe TSSswhilethesignalsurroundingbidirectionalpromoterspeaksattheupstream600bpposition, indicatingthattheremaybemuchstrongerminustranscriptsforthesebidirectionalpromoters. ForPolIIsignal,theintensitiesaresimilarintheupstreamofbothmiRNAbroadpromotersand bidirectionalpromoters.However,againthedownstreamofbroadpromotershasastrongerPolII signal,suggestingPolIIisalsopausedintheproximalpromotersofmiRNAgenes. 125 Figure5.8:(A,B):H3K4me3andPolIIsignalsurroundingbidirectionalandbroadpro- moters.(C,D):H3K4me3andPolIIsignalsurroundingmiRNAbidirectionalandbroad promoters. 5.4Discussion Inthisstudy,weusedCap-seqdatasetstoannotatetheprimaryTSSsforintergenicmiRNAgenesto 1baseresolutionin C.elegans andmouse.Intotal,weannotatedtheprimaryTSSsfor70miRNAs and8miRNAclustersin C.elegans ,and37miRNAsonmousechromosome7.Comparingwith previousworkusingcappedRNA-seqmethods[49,67],wehaveannotatedtheTSSsformany moremiRNAsinbothspecies.WenoticedthattheCap-seqdatasetsweusedareeithergenerated frommixed-stageembryos( C.elegans )ormixedtissues(newbornmouse),whiletheworkswe comparedwithonlyutilizedfewstageddatasetsin C.elegans [49,67]orsingletissuedataset inmouse[49].Therefore,thelimitedtissuecontextmayexplainwhyfewermiRNAgeneswere 126 expressedanddetectedintheirstudies. Similartothepreviousstudy[161],whichdetectedaspecialclassofpre-miRNAsthatare5' m 7 Gcappedinmice,herewealsothistypeofpre-miRNAsin C.elegans .5'-capped pre-miRNAsinmiceprefertobeexportedtothecytoplasmbyXPO1insteadofXPO5[161]. Comparingtomammals,XPO5oritshomologueisnotencodedin C.elegans butXPO1is[104]. Therefore,manymore5'-cappedpre-miRNAswereexpectedtobein C.elegans .But usingtheCap-seqdatasetsofmixed-stageembrosof C.elegans ,wehaveonlydetected9candi- date5'-cappedpre-miRNAs.TheXPO1andcap-bindingcomplex(CBC)havebeenreportedtoact jointlytoexportpri-miRNAsin C.elegans andDrosophila,buthowthesubsequentpre-miRNAs areprocessedfromthesepri-miRNAsinthecytoplasmremainsunclear[19].Accordingly,both 5'-cappedpre-miRNAsandnormalpri-miRNAscanbeexportedtothecytoplasmandtheirsub- sequentprocessingmaybedifferentfromthatinthecanonicalmiRNAbiogenesispathway.Since XPO1doesnotexport5'-cappedpre-miRNAsin C.elegans ,itsexistencedoesnot necessarilyresultinmore5'-cappedpre-miRNAs.Thus,itisnotoddthatonlyasmallnumberof theseunusualpre-miRNAsareobservedin C.elegans . IthasbeensuggestedthatXPO1-dependentm 7 Gcappedpre-miRNAsmayrepresentagroupof ancientmiRNAsthatappearedbeforetheemergenceofXPO5[161].Therefore,thesem 7 Gcapped pre-miRNAsmaybewellconservedindifferentlineages.Wecheckedtheconservationofthose 5'-cappedpre-miRNAsin C.elegans andmousefrommiRviewer[62].Surprisingly, almostallofthesem 7 Gcappedpre-miRNAsarelineagetheyaredetected eitheronlyincaenorhhabditisormuroidea. Inbothmouseand C.elegans ,wehaveaclassofpre-miRNAsthatare5'm 7 G cappedbutalsopossesupstreamcandidateprimaryTSSs.Ithasbeensuggestedthatmostgenes inmammalsarenottranscribedfromasingleTSS,butmultipleTSSsthatarecloselylocated 127 over50to100nucleotides[21,135,46,36].Indeed,weobservedthatsomeupstreamTICsare veryclosetothepre-miRNAs,suggestingthattheproximalcorepromoterofthesemiRNAgenes maygeneratemultipletranscriptioninitiationsthatarecloselylocatedtoeachother.However,the otherfurtherupstreamTICs(over500nt)areunlikelytobegeneratedfromthesamepromoteras thepre-capTICs.Manygeneshavealternativepromotersandcangeneratemultipleisoformsin differentcelltypes,tissuesordevelopmentalstages[69,7].SinceweusedtheCap-seqdatasets ofmixedstagedembryosof C.elegans ,thesemiRNAgenesmayutilizealternativepromotersat differentstagestogeneratediversetranscriptswhicharesubjecttodistinctprocessingprocedures. Thatis,these5'-cappedpre-miRNAsmaybeonlygeneratedincelltypes,developmental stagesordiseasestates.ConsideringthattheseXPO1-dependent5'm 7 Gcappedpre-miRNAs maybelongtoagroupofancientmiRNAs,itisreasonablethatatleastsomeofthemshouldbe conservedalongdifferentlineagesduringevolution.However,wehaveshownthatalmostallof these5'm 7 Gcappedpre-miRNAsarelineageThereisapossibilitythatthese miRNAgenesarenewlygeneratedandacquiretheirabilityofproducing5'-cappedpre-miRNAs laterintheevolution.Inextremecases,mostofthese5'-cappedpre-miRNAsmayhaveupstream distalpromotersthatcanproducethenormalprimarytranscriptsattheothertime.Thesituation wherem 7 Gcappedpre-miRNAsareexpressedmayduetothelackofDroshaorotherrelated microprocessor.TheabilityofgeneratingthesamemiRNAswithorwithoutDroshamayhelpthe organismtobetteradapttothecomplexandchangeableenvironment. Manyofthe5'm 7 Gcappedpre-miRNAsin C.elegans preferentiallygenerate5p- miRNAs,whichisdifferentfromthatinmiceaspreviouslyreported.Thesepre-miRNAsusually alsohaveupstreamcandidateprimaryTSSs.Wesurmisedthatthese5'-cappedpre-miRNAsmay producethe5p-miRNAsusingthealternativeupstreamprimaryTSSs.However,thisstillneeds furthereffortstoclarify. 128 Multiplecappedpeakshavebeenobservedwithinpre-miRNAsinacluster,andthewellco- ordinatedrelationshipbetweenthecappedpeaksandthematuremiRNAssuggestthatthe5'm 7 G capmaybeaddedduringthepre-miRNAgeneratingprocess.WeproposedanewmodelofmiRNA biogenesisinwhichthepri-miRNAsarecleavedandcappedinthecytoplasmsimultaneously(Fig- ure5.6).WenoticedthatthecytoplasmiccappingisprominentinmiRNAclusters,likelybecause theexposed5'endofmiRNAclustertranscriptsduringcleavageneedtobeprotected.Therefore, them 7 Gcapisaddedandprotectsthecleavedtranscriptsfrombeingdegradedduringthecleavage process.Then,thesubsequentpre-miRNAscanbegeneratedsuccessfully. Thepre-miRNAsinclustersmaybetranscribedindependently.Togainmoreevidenceforthis model,wealsolookedatthethesequencemotifsurroundingthemodesofpre-capTICswithin clusteredpre-miRNAs.TheYRmotif,whichwasshownattheputativeprimarymiRNATSSsand pre-capTSSsofindependentpre-miRNAs,isnotobserved(FigureS8).Instead,aweakconsensus sequenceoffiTNGG"isdetected,inwhichfiN"locatesatthe+1positionrepresentingthemodes oftheTICs.Therefore,atleastforsomemiRNAclusters,theindependenttranscriptionmodelis notsupportedbytheYRmotif. 129 Chapter6 ConclusionandFuturework 6.1Conlusions Characterizingthecompositionandfunctionsofmicrobialcommunitiesfromhost-relatedorenvi- ronmentalsamplesusingmetagenomicsequencinghasbecomeafavorablemethod.Inthisstudy, wefocusedoncharacterizingthevirusquasispeciesfromviralmetagenomicdata.Towardsthis goal,wedevelopedapipelineconsistingofthreetoolsforassemblingviralstrainsinaquasispecies andestimatinghaplotypenumberandcorrespondingabundances.ThethreetoolsareTAR-VIR, PEHaplo,andVirBin. TAR-VIRisthetoolforenrichingreadsoftargetedviruseswithremotehomologorpartial referencesfrommetagenomicdata.Comparedtothestandardmethodthatreliesonreadmapping toclassifyreadsofdifferentviruses,TAR-VIRhashighersensitivitytoidentifydivergentstrains ofknownvirusfamilies,whichisakeyfunctionforfast-mutatingvirusessuchasHIV,HCVetc.It appliedanexternaloverlapextensionsteptoiterativelyrecruitreadsfrominitiallyaligned"seed" reads.Theoverlapdetectionbetweenlarge-scalemetagenomicsequencingreadswasimplemented intheefBurrows-WheelerTransform(BWT)andFM-indexalgorithms.TAR-VIRcanbe usedasapreliminarysteptoisolateandenrichreadsfrommetagenomicdatabeforeassemblyeven withouthigh-qualityreferences. PEHaploisthe denovo assemblytoolsforreconstructingviralhaplotypesfromvirusqua- sispeciessequencingdata.Ittakesadvantageofthelonginsertsizeofpaired-endsequencingto 130 differentiatehaplotypessharinglongcommonregions.ThemethodsofPEHaploarebasedon overlapgraph,whichreconstructhaplotypesbythepathswell-supportedbypaired-end reads.Whilepaired-endinformationhasbeenusedbymanyotherassemblytools,PEHaplodida thoroughanalysisoftherelationshipbetweenLCSandtheinsertsize,anddiveddeeplyintothe utilizationofpaired-endreadsforremovingfalseedgesandndingcorrectpaths.Thebenchmark resultswithseveralrecentlypublishedviralquasispeciesassemblytoolsdemonstratethatPEHaplo isabletoproduceaccurateandmuchlongercontigs. TheoutputofPEHaploarecontigs,whicharepartialsequencesforviralgenomes.Tobetter characterizeaviralquasispecies,weneedtoknowthenumberofviralhaplotypesandtheircorre- spondingabundances.VirBinisthebinningtoolthattakesassembledcontigsasinput,estimates thenumberofhaplotypes,clusterscontigsintogroups,andcalculatestheabundancesforeachhap- lotype.Thebinningproblemforviralquasispeciesisverydifferenttothecanonicalspecies-level binningproblemsbecausecommonlyadoptedfeaturessuchask-merfrequencyarenotabletodis- tinguishdifferentstrains.VirBintakesadvantageofthehighsequencesimilaritiesbetweenviral strainsinaquasispecies,estimatesthenumberofhaplotypesbyaligningcontigstoeachother.The relativeabundancesforeachcontigcanalsobemoreaccuratelycalculatedfromthecontigalign- mentresults.Basedonthecontigabundances,VirBinappliesanEMalgorithmtoclustercontigs intodifferentgroups.ThebenchmarkresultswithtoolMaxBin[160]haveshownitssuperiority onbothsimulatedorrealviralquasispeciesdata. AtthebeginningofmyPh.D.study,Ialsoworkedonaprojectforstudyingthetranscriptional regulationofmiRNAgenesinC.elegansandmice.Inthisproject,weutilizedCap-seqdatato identifytheprimarytranscriptionalstartsitesformiRNAgenes,characterizedaspecialgroupof miRNAswhoseprecursorsare5'-cappedandmaybedirectlygeneratedbytranscription. 131 6.2Futurework 6.2.1Reference-freeviralsequences WhileTAR-VIRisabletorecruitviralreadswithpartialorremotelyrelatedhomologousrefer- ences,themethodisnotapplicabletonewviruseswithoutanyavailablereferences.Therefore, directlyclassifyingviralreadsfrommetagenomicdatawithoutreferencesmaystillbeneeded.For directlyclassifyingviralreads,afundamentalquestiontoaskiswhetherviralsequencescanbe differentiatedfromotherbacterialoreukaryoticsequences.However,thecanonicalsequenceanal- ysis(suchask-merfrequencies)resultsonNGSreadsshowedthatviralreadsarehighlydiverse andarediftobedifferentiatedfromothersequences. Thedeeplearningmethods,withmultiplelayersofneuralnetworksandnumerous trainableparameters,arepowerfulforTheydonotrequireelaboratelydesigned sequencefeaturesasinputandcantakesimplyencodedrawsequencesasinputfor Whetherwecantakeadvantageofdeeplearningmodelsfordirectlyclassifyingviralreadsfrom metagenomicdatasetsisworthexploring. 6.2.2Virusquasispeciesassemblyfuturework Therearestillsomecomputationalchallengesremaininginvirusquasispeciesassembly. Assemblyoflowabundancehaplotypes Oneofthechallengesisthatrarehaplotypesare diftoassemblefromthedata.Whenthesequencingcoverageislowforaviralstrain,there maynotbeenoughoverlapsbetweenreads,resultinginadisjoinedoverlapgraphwithmultiple smallconnectedsubgraphs.Inaddition,evenifsequencingdepthsforrarehaplotypesaredeep enough,theedgesbetweensharednodesandrarehaplotypenodesmaybesuppressedbytheedges betweensharednodesandabundanthaplotypenodes,resultinginfragmentedcontigs. 132 LCSislongerthanpaired-endinsertsize AnotherchallengeisthesituationwhenLCSbe- tweentwohaplotypesismuchlongerthantheaveragepaired-endinsertsize.Inthiscase,few readpairscangoacrossthecommonnodes,thuscorrectpathsfromthegraphisdif Theonlyavailablefeatureisthereadscoverage.Whenthecoveragesforthesetwohaplotypes aredifthealgorithmisabletochoosethesuccessorwithsimilarcoverageto thecurrentpathforappending.However,ifthesequencingcoveragesofthetwohaplotypesare similar,thepathextendingishard.WithshortreadsfromNGS,oneofthepossiblesolutionsis toincreasetheinsertsize.Alternatively,wemaytakeadvantageofthelongreadsfromthird- generationsequencing. 6.2.3Binninglow-qualitycontigs Tocorrectlyestimatethenumberofhaplotypesandcalculatetherelativeabundancesforcon- tigs,thecontigsneedtobealigned,andwindowsheavilydependsonthealignment. However,whentherearechimericcontigsorcontigswitherrors,eitherindels/mismatches,the alignmentmaynotbeaccurateandeventuallyresultinaninaccurateestimationofhaplotypenum- berorclustering.Howtodevelop/improvethebinningalgorithmstorobustlyhandlelow-quality contigsisanotherproblemtobefurtherinvestigated. 133 BIBLIOGRAPHY 134 BIBLIOGRAPHY [1] T.Abeel,T.VanParys,Y.Saeys,J.Galagan,andY.VandePeer.GenomeView:anext- generationgenomebrowser. Nucl.AcidsRes. ,40(2):e12,2012. [2] A.Allam,P.Kalnis,andV.Solovyev.Karect:accuratecorrectionofsubstitution,insertion anddeletionerrorsfornext-generationsequencingdata. Bioinformatics ,pagebtv415,2015. [3] R.AndinoandE.Domingo.Viralquasispecies. Virology ,479:46Œ51,2015. [4] I.Astrovskaya,B.Tork,S.Mangul,K.Westbrooks,I.M andoiu,P.Balfe,andA.Zelikovsky. Inferringviralquasispeciesspectrafrom454pyrosequencingreads. BMCbioinformatics , 12(6):S1,2011. [5] J.Baaijens,A.Z.ElAabidine,E.Rivals,andA.Schoenhuth.Denovoassemblyofviral quasispeciesusingoverlapgraphs. bioRxiv ,page080341,2016. [6] J.A.Baaijens,A.Z.ElAabidine,E.Rivals,andA.Schönhuth.Denovoassemblyofviral quasispeciesusingoverlapgraphs. Genomeresearch ,27(5):835Œ848,2017. [7] D.Baek,C.Davis,B.Ewing,D.Gordon,andP.Green.Characterizationandpredictive discoveryofevolutionarilyconservedmammalianalternativepromoters. GenomeRes. , 17(2):145Œ155,2007. [8] A.Bankevich,S.Nurk,D.Antipov,A.A.Gurevich,M.Dvorkin,A.S.Kulikov,V.M. Lesin,S.I.Nikolenko,S.Pham,A.D.Prjibelski,etal.SPAdes:anewgenomeassembly algorithmanditsapplicationstosingle-cellsequencing. Journalofcomputationalbiology , 19(5):455Œ477,2012. [9] V.C.Barbosa,R.Donangelo,andS.R.Souza.Quasispeciesdynamicswithnetworkcon- straints. Journaloftheoreticalbiology ,312:114Œ119,2012. [10] E.Berezikov.EvolutionofmicroRNAdiversityandregulationinanimals. NatureRev. Genet. ,12(12):846Œ860,Dec.2011. [11] E.Berezikov,W.-J.Chung,J.Willis,E.Cuppen,andE.C.Lai.Mammalianmirtrongenes. Mol.Cell ,28(2):328Œ336,2007. [12] M.Biasiolo,G.Sales,M.Lionetti,L.Agnelli,K.Todoerti,A.Bisognin,A.Coppe,C.Ro- mualdi,A.Neri,andS.Bortoluzzi.ImpactofhostgenesandstrandselectiononmiRNA andmiRNA*expression. PLoSONE ,6(8):1Œ11,2011. [13] C.K.BiebricherandM.Eigen.Theerrorthreshold. Virusresearch ,107(2):117Œ127,2005. 135 [14] S.Boisvert,F.Raymond,É.Godzaridis,F.Laviolette,andJ.Corbeil.Raymeta:scalable denovometagenomeassemblyand Genomebiology ,13(12):R122,2012. [15] A.M.Bolger,M.Lohse,andB.Usadel.Trimmomatic:axibletrimmerforIllumina sequencedata. Bioinformatics ,30(15):2114Œ2120,2014. [16] G.M.Borchert,W.Lanier,andB.L.Davidson.RNApolymeraseIIItranscribeshuman microRNAs. NatureStruct.Mol.Biol. ,13(12):1097Œ1101,2006. [17] J.Bracht,S.Hunter,R.Eachus,P.Weeks,andA.E.Pasquinelli.Trans-splicingand polyadenylationoflet-7microRNAprimarytranscripts. RNA ,10(10):1586Œ1594,2004. [18] M.BurrowsandD.J.Wheeler.Ablock-sortinglosslessdatacompressionalgorithm.1994. [19] I.Büssing,S.YangJr,E.C.Lai,andH.Großhans.ThenuclearexportreceptorXPO-1 supportsprimarymiRNAprocessinginC.elegansandDrosophila. EMBOJ. ,29(11):1830Œ 1839,2010. [20] C.Camacho,G.Coulouris,V.Avagyan,N.Ma,J.Papadopoulos,K.Bealer,andT.L.Mad- den.BLAST+:architectureandapplications. BMCbioinformatics ,10(1):421,2009. [21] P.Carninci,A.Sandelin,B.Lenhard,S.Katayama,K.Shimokawa,J.Ponjavic,C.A.Sem- ple,M.S.Taylor,P.G.Engström,M.C.Frith,etal.Genome-wideanalysisofmammalian promoterarchitectureandevolution. NatureGenet. ,38(6):626Œ635,2006. [22] M.J.Chaisson,D.Brinza,andP.A.Pevzner.Denovofragmentassemblywithshortmate- pairedreads:Doesthereadlengthmatter? Genomeresearch ,19(2):336Œ346,2009. [23] M.CHAN-YEUNGandR.-H.XU.SARS:epidemiology. Respirology ,8(s1),2003. [24] T.-C.Chang,M.Pertea,S.Lee,S.L.Salzberg,andJ.T.Mendell.Genome-wideannotation ofmicrornaprimarytranscriptstructuresrevealsnovelregulatorymechanisms. Genome research ,25(9):1401Œ1409,2015. [25] J.Chen,Z.Dai,C.Cao,Q.Zhang,H.Liu,andX.Sun.Next-generationsequencingdata processing:Analysisofunmappedreadsandextremelyhighmappedpeaks.In Biomedical EngineeringandInformatics(BMEI),20125thInternationalConferenceon ,pages893Œ 897.IEEE,2012. [26] J.Chen,Y.Zhao,andY.Sun.Denovohaplotypereconstructioninviralquasispeciesusing paired-endreadguidedpath Bioinformatics ,1:9,2018. [27] R.A.-J.Chen,T.A.Down,P.Stempor,Q.B.Chen,T.A.Egelhofer,L.W.Hillier,T.E. Jeffers,andJ.Ahringer.ThelandscapeofRNApolymeraseIItranscriptioninitiationin C.elegansrevealspromoterandenhancerarchitectures. GenomeRes. ,23(8):1339Œ1347, 136 2013. [28] T.P.Chendrimada,R.I.Gregory,E.Kumaraswamy,J.Norman,N.Cooch,K.Nishikura, andR.Shiekhattar.TRBPrecruitstheDicercomplextoAgo2formicroRNAprocessing andgenesilencing. Nature ,436(7051):740Œ744,2005. [29] C.-H.Chien,Y.-M.Sun,W.-C.Chang,P.-Y.Chiang-Hsieh,T.-Y.Lee,W.-C.Tsai,J.-T. Horng,A.-P.Tsou,andH.-D.Huang.Identifyingtranscriptionalstartsitesofhumanmi- croRNAsbasedonhigh-throughputsequencingdata. Nucl.AcidsRes. ,39(21):9345Œ9356, 2011. [30] F.S.CollinsandH.Varmus.Anewinitiativeonprecisionmedicine. NewEnglandJournal ofMedicine ,372(9):793Œ795,2015. [31] D.L.Corcoran,K.V.Pandit,B.Gordon,A.Bhattacharjee,N.Kaminski,andP.V.Benos. FeaturesofmammalianmicroRNApromotersemergefrompolymeraseIIchromatinim- munoprecipitationdata. PLoSONE ,4(4):e5279,2009. [32] C.CoronnelloandP.V.Benos.Comir:combinatorialmicrornatargetpredictiontool. Nucl. AcidsRes. ,41(W1):W159ŒW164,2013. [33] A.K.Corsi.Abiochemist'sguidetoCaenorhabditiselegans. Anal.Biochem. ,359(1):1Œ17, 2006. [34] G.E.Crooks,G.Hon,J.-M.Chandonia,andS.E.Brenner.WebLogo:asequencelogo generator. GenomeRes. ,14(6):1188Œ1190,2004. [35] L.Dai,K.Chen,B.Youngren,J.Kulina,A.Yang,Z.Guo,J.Li,P.Yu,andS.Gu.Cy- toplasmicdroshaactivitygeneratedbyalternativesplicing. NucleicAcidsResearch ,page gkw668,2016. [36] Y.M.Danino,D.Even,D.Ideses,andT.Juven-Gershon.Thecorepromoter:attheheart ofgeneexpression. Biochim.Biophys.Acta ,1849(8):1116Œ1131,2015. [37] R.V.Davuluri,Y.Suzuki,S.Sugano,C.Plass,andT.H.-M.Huang.Thefunctionalconse- quencesofalternativepromoteruseinmammaliangenomes. TrendsGenet. ,24(4):167Œ177, 2008. [38] M.deHoonandY.Hayashizaki.Deepcapanalysisgeneexpression(CAGE):genome- wideofpromoters,oftheirexpression,andnetworkinference. Biotechniques ,44(5):627,2008. [39] F.DiGiallonardo,A.Töpfer,M.Rey,S.Prabhakaran,Y.Duport,C.Leemann,S.Schmutz, N.K.Campbell,B.Joos,M.R.Lecca,etal.Full-lengthhaplotypereconstructiontoinfer thestructureofheterogeneousviruspopulations. Nucleicacidsresearch ,42(14):e115Œe115, 137 2014. [40] S.Djebali,C.A.Davis,A.Merkel,A.Dobin,T.Lassmann,A.Mortazavi,A.Tanzer,J.La- garde,W.Lin,F.Schlesinger,etal.Landscapeoftranscriptioninhumancells. Nature , 489(7414):101Œ108,2012. [41] E.Domingo,J.Sheldon,andC.Perales.Viralquasispeciesevolution. Microbiologyand MolecularBiologyReviews ,76(2):159Œ216,2012. [42] M.Eigen.Errorcatastropheandantiviralstrategy. ProceedingsoftheNationalAcademyof Sciences ,99(21):13374Œ13376,2002. [43] M.Fedurco,A.Romieu,S.Williams,I.Lawrence,andG.Turcatti.Bta,anovelreagent fordnaattachmentonglassandefgenerationofsolid-phasednacolonies. Nucleicacidsresearch ,34(3):e22Œe22,2006. [44] K.Fejes-Toth,V.Sotirova,R.Sachidanandam,G.Assaf,G.J.Hannon,P.Kapranov,S.Fois- sac,A.T.Willingham,R.Duttagupta,E.Dumais,etal.Post-transcriptionalprocessing generatesadiversityoflongandshortRNAs. Nature ,457(7232):1028Œ1032, 2009. [45] K.A.Frazer,L.Pachter,A.Poliakov,E.M.Rubin,andI.Dubchak.VISTA:computational toolsforcomparativegenomics. Nucleicacidsresearch ,32(suppl_2):W273ŒW279,2004. [46] M.C.Frith,E.Valen,A.Krogh,Y.Hayashizaki,P.Carninci,andA.Sandelin.Acodefor transcriptioninitiationinmammaliangenomes. GenomeRes. ,18(1):1Œ12,2008. [47] G.GonnellaandS.Kurtz.Readjoiner:afastandmemoryefstringgraph-based sequenceassembler. BMCbioinformatics ,13(1):82,2012. [48] S.GrifR.J.Grocock,S.VanDongen,A.Bateman,andA.J.Enright.miRBase: microRNAsequences,targetsandgenenomenclature. Nucl.AcidsRes. ,34(suppl1):D140Œ D144,2006. [49] W.Gu,H.-C.Lee,D.Chaves,E.M.Youngman,G.J.Pazour,D.ConteJr,andC.C.Mello. CapSeqandCIP-TAPIdentifyPolIIStartSitesandRevealCappedSmallRNAsasC. eleganspiRNAPrecursors. Cell ,151(7):1488Œ1500,2012. [50] D.eld. Algorithmsonstrings,treesandsequences:computerscienceandcomputa- tionalbiology .Cambridgeuniversitypress,1997. [51] M.HajRachidandQ.Malluhi.Apracticalandscalabletooltooverlapsbetween sequences. BioMedresearchinternational ,2015,2015. [52] J.J.Holland,E.Domingo,J.C.delaTorre,andD.A.Steinhauer.Mutationfrequenciesat 138 singlecodonsitesinvesicularstomatitisvirusandpolioviruscanbeincreasedonly slightlybychemicalmutagenesis. Journalofvirology ,64(8):3960,1990. [53] L.Z.Hong,S.Hong,H.T.Wong,P.P.Aw,Y.Cheng,A.Wilm,P.F.deSessions,S.G.Lim, N.Nagarajan,M.L.Hibberd,etal.Base-seq:amethodforobtaininglongviralhaplotypes fromshortsequencereads. Genomebiology ,15(11):517,2014. [54] A.Huang,R.Kantor,A.DeLong,L.Schreier,andS.Istrail.Qcolors:Analgorithmforcon- servativeviralquasispeciesreconstructionfromshortandnon-contiguousnextgeneration sequencingreads. Insilicobiology ,11(5,6):193Œ201,2011. [55] W.Huang,L.Li,J.R.Myers,andG.T.Marth.ART:anext-generationsequencingread simulator. Bioinformatics ,28(4):593Œ594,2011. [56] W.Huang,L.Li,J.R.Myers,andG.T.Marth.Art:anext-generationsequencingread simulator. Bioinformatics ,28(4):593Œ594,2012. [57] M.Hunt,A.Gall,S.H.Ong,J.Brener,B.Ferns,P.Goulder,E.Nastouli,J.A.Keane,P.Kel- lam,andT.D.Otto.Iva:accuratedenovoassemblyofrnavirusgenomes. Bioinformatics , 31(14):2374Œ2376,2015. [58] J.L.JamesonandD.L.Longo.PrecisionmedicineŠpersonalized,problematic,and promising. Obstetrical&GynecologicalSurvey ,70(10):612Œ614,2015. [59] D.Jayasundara,I.Saeed,S.Maheswararajah,B.Chang,S.-L.Tang,andS.K.Halgamuge. ViQuaS:animprovedreconstructionpipelineforviralquasispeciesspectrageneratedby next-generationsequencing. Bioinformatics ,31(6):886Œ896,2014. [60] J.KärkkäinenandP.Sanders.Simplelinearworksufarrayconstruction.In International ColloquiumonAutomata,Languages,andProgramming ,pages943Œ955.Springer,2003. [61] T.Kasai,G.Lee,H.Arimura,S.Arikawa,andK.Park.Linear-time computationinsufarraysanditsapplications.In AnnualSymposiumonCombinatorial PatternMatching ,pages181Œ192.Springer,2001. [62] A.Kiezun,S.Artzi,S.Modai,N.Volk,O.Isakov,andN.Shomron.miRviewer:amulti- speciesmicroRNAhomologousviewer. BMCRes.Notes ,5(1):1Œ6,2012. [63] V.N.KimandJ.-W.Nam.GenomicsofmicroRNA. TrendsGenet. ,22(3):165Œ173,Mar. 2006. [64] D.L.Kiss,K.Oman,R.Bundschuh,andD.R.Schoenberg.Uncapped5'endsofmRNAs targetedbycytoplasmiccappingmaptothevicinityofdownstreamCAGEtags. FEBSLett. , 589(3):279Œ284,2015. 139 [65] R.Kodzius,M.Kojima,H.Nishiyori,M.Nakamura,S.Fukuda,M.Tagami,D.Sasaki, K.Imamura,C.Kai,M.Harbers,etal.CAGE:capanalysisofgeneexpression. Nat. Methods ,3(3):211Œ222,2006. [66] J.Krol,I.Loedige,andW.Filipowicz.ThewidespreadregulationofmicroRNAbiogenesis, functionanddecay. NatureRev.Genet. ,11(9):597Œ610,Sept.2010. [67] W.S.Kruesi,L.J.Core,C.T.Waters,J.T.Lis,andB.J.Meyer.Condensincontrolsre- cruitmentofRNApolymeraseIItoachievenematodeX-chromosomedosagecompensation. Elife ,2:e00808,2013. [68] A.Kuehbacher,C.Urbich,A.M.Zeiher,andS.Dimmeler.RoleofDicerandDroshafor endothelialmicroRNAexpressionandangiogenesis. Circ.Res. ,101(1):59Œ68,2007. [69] J.-R.Landry,D.L.Mager,andB.T.Wilhelm.Complexcontrols:theroleofalternative promotersinmammaliangenomes. TrendsGenet. ,19(11):640Œ648,2003. [70] B.LangmeadandS.L.Salzberg.Fastgapped-readalignmentwithbowtie2. Naturemeth- ods ,9(4):357Œ359,2012. [71] B.LangmeadandS.L.Salzberg.Fastgapped-readalignmentwithbowtie2. Naturemeth- ods ,9(4):357Œ359,2012. [72] B.Langmead,C.Trapnell,M.Pop,andS.L.Salzberg.Ultrafastandmemory-ef alignmentofshortdnasequencestothehumangenome. Genomebiology ,10(3):R25,2009. [73] J.Laserson,V.Jojic,andD.Koller.Genovo:denovoassemblyformetagenomes. JComput Biol ,18(3):429Œ443,2011. [74] N.C.Lau,L.P.Lim,E.G.Weinstein,andD.P.Bartel.AnabundantclassoftinyRNAswith probableregulatoryrolesinCaenorhabditiselegans. Science ,294(5543):858Œ862,2001. [75] A.S.LauringandR.Andino.Quasispeciestheoryandthebehaviorofrnaviruses. PLoS Pathog ,6(7):e1001005,2010. [76] Y.Lee,C.Ahn,J.Han,H.Choi,J.Kim,J.Yim,J.Lee,P.Provost,O.Rådmark,S.Kim,etal. ThenuclearRNaseIIIDroshainitiatesmicroRNAprocessing. Nature ,425(6956):415Œ419, 2003. [77] Y.Lee,M.Kim,J.Han,K.-H.Yeom,S.Lee,S.H.Baek,andV.N.Kim.MicroRNAgenes aretranscribedbyRNApolymeraseII. EMBOJ. ,23(20):4051Œ4060,2004. [78] J.LeiandY.Sun.miR-PREFeR:anaccurate,fastandeasy-to-useplantmiRNAprediction toolusingsmallRNA-Seqdata. Bioinformatics ,30(19):2837Œ2839,2014. 140 [79] H.LiandR.Durbin.FastandaccurateshortreadalignmentwithBurrowsŒWheelertrans- form. bioinformatics ,25(14):1754Œ1760,2009. [80] L.Li,X.Deng,A.C.DaCosta,R.Bruhn,S.G.Deeks,andE.Delwart.Viromeanalysis ofantiretroviral-treatedHIVpatientsshowsnocorrelationbetweenT-cellactivationand anelloviruseslevels. JournalofClinicalVirology ,72:106Œ113,2015. [81] M.Li,B.Wu,X.Yan,J.Luo,Y.Pan,F.-X.Wu,andJ.Wang.Pecc:correctingcontigsbased onpaired-endreaddistribution. Computationalbiologyandchemistry ,69:178Œ184,2017. [82] N.Li,X.You,T.Chen,S.D.Mackowiak,M.R.Friedländer,M.Weigt,H.Du,A.Gogol- Döring,Z.Chang,C.Dieterich,etal.GlobalofmiRNAsandthehairpinpre- cursors:insightsintomiRNAprocessingandnovelmiRNAdiscovery. Nucl.AcidsRes. , 41(6):3619Œ3634,2013. [83] R.Li,H.Zhu,J.Ruan,W.Qian,X.Fang,Z.Shi,Y.Li,S.Li,G.Shan,K.Kristiansen, etal.Denovoassemblyofhumangenomeswithmassivelyparallelshortreadsequencing. Genomeresearch ,20(2):265Œ272,2010. [84] Y.Li,H.Wang,K.Nie,C.Zhang,Y.Zhang,J.Wang,P.Niu,andX.Ma.VIP:anintegrated pipelineformetagenomicsofvirusanddiscovery. reports ,6,2016. [85] E.S.Lim,Y.Zhou,G.Zhao,I.K.Bauer,L.Droit,I.M.Ndao,B.B.Warner,P.I.Tarr, D.Wang,andL.R.Holtz.Earlylifedynamicsofthehumangutviromeandbacterial microbiomeininfants. Naturemedicine ,21(10):1228Œ1234,2015. [86] H.-H.LinandY.-C.Liao.drVM:anewtoolforefgenomeassemblyofknown eukaryoticvirusesfrommetagenomes. GigaScience ,6(2):1Œ10,2017. [87] C.-C.LoandP.S.Chain.Rapidevaluationandqualitycontrolofnextgenerationsequencing datawithFaQCs. BMCbioinformatics ,15(1):366,2014. [88] Y.Y.Lu,T.Chen,J.A.Fuhrman,andF.Sun.Cocacola:binningmetagenomiccontigsusing sequencecomposition,readcoverage,co-alignmentandpaired-endreadlinkage. Bioinfor- matics ,33(6):791Œ798,2017. [89] C.Luo,R.Knight,H.Siljander,M.Knip,R.J.Xavier,andD.Gevers.Constrains microbialstrainsinmetagenomicdatasets. Naturebiotechnology ,33(10):1045,2015. [90] C.Luo,D.Tsementzi,N.Kyrpides,andK.Konstantinidis.Individualgenomeassembly fromcomplexcommunityshort-readmetagenomicdatasets. ISMEJ. ,6(4):898Œ901,2012. [91] T.F.Lüscher.Frontiersinprecisionmedicine:genesandtheirmodulationbymirnas. Eu- ropeanHeartJournal ,37(43):3247Œ3250,2016. 141 [92] R.Malhotra,S.Prabhakara,M.Poss,andR.Acharya.Estimatingviralhaplotypesina populationusingk-mercounting.In IAPRInternationalConferenceonPatternRecognition inBioinformatics ,pages265Œ276.Springer,2013. [93] R.Malhotra,M.M.S.Wu,A.Rodrigo,M.Poss,andR.Acharya.Maximumlikelihoodde novoreconstructionofviralpopulationsusingpairedendsequencingdata. arXivpreprint arXiv:1502.04239 ,2015. [94] S.K.MallannaandA.Rizzino.EmergingrolesofmicroRNAsinthecontrolofembryonic stemcellsandthegenerationofinducedpluripotentstemcells. Dev.Biol. ,344(1):16Œ25, 2010. [95] S.Mangul,N.C.Wu,N.Mancuso,A.Zelikovsky,R.Sun,andE.Eskin.VGA:amethodfor viralquasispeciesassemblyfromultra-deepsequencingdata.In ComputationalAdvances inBioandMedicalSciences(ICCABS),2014IEEE4thInternationalConferenceon ,pages 1Œ1.IEEE,2014. [96] F.Matin,V.Jeet,J.A.Clements,G.M.Yousef,andJ.Batra.Micrornatheranosticsin prostatecancerprecisionmedicine. ClinicalChemistry ,pagesclinchemŒ2015,2016. [97] H.A.Meijer,E.M.Smith,andM.Bushell.RegulationofmiRNAstrandselection:follow theleader? Biochem.Soc.Trans. ,42(4):1135Œ1140,2014. [98] M.L.Metzker.SequencingtechnologiesŠthenextgeneration. Naturereviewsgenetics , 11(1):31Œ46,2010. [99] F.Meyer,D.Paarmann,M.D'Souza,R.Olson,E.M.Glass,M.Kubal,T.Paczian,A.Ro- driguez,R.Stevens,A.Wilke,etal.ThemetagenomicsRASTserverŒapublicresourcefor theautomaticphylogeneticandfunctionalanalysisofmetagenomes. BMCbioinformatics , 9(1):386,2008. [100] A.Mikheenko,V.Saveliev,andA.Gurevich.Metaquast:evaluationofmetagenomeassem- blies. Bioinformatics ,pagebtv697,2015. [101] R.D.MitraandG.M.Church.Insitulocalizedamplandcontactreplicationof manyindividualdnamolecules. NucleicAcidsResearch ,27(24):e34Œe39,1999. [102] C.M.Mizuno,F.Rodriguez-Valera,N.E.Kimes,andR.Ghai.Expandingthemarine virosphereusingmetagenomics. PLoSgenetics ,9(12):e1003987,2013. [103] C.Mukherjee,D.P.Patil,B.A.Kennedy,B.Bakthavachalu,R.Bundschuh,andD.R. Schoenberg.ofcytoplasmiccappingtargetsrevealsaroleforcaphomeostasis intranslationandmRNAstability. CellRep. ,2(3):674Œ684,2012. [104] D.Murphy,B.Dancis,andJ.R.Brown.Theevolutionofcoreproteinsinvolvedinmi- 142 croRNAbiogenesis. BMCEvol.Biol. ,8(1):1Œ18,2008. [105] S.N.Naccache,S.Federman,N.Veeraraghavan,M.Zaharia,D.Lee,E.Samayoa,J.Bou- quet,A.L.Greninger,K.-C.Luk,B.Enge,etal.Acloud-compatiblebioinformaticspipeline forultrarapidpathogenonfromnext-generationsequencingofclinicalsamples. Genomeresearch ,24(7):1180Œ1192,2014. [106] T.Namiki,T.Hachiya,H.Tanaka,andY.Sakakibara.Metavelvet:anextensionofvelvet assemblertodenovometagenomeassemblyfromshortsequencereads. NucleicAcids Research ,40(20):e155,2012. [107] M.A.Nowak. Evolutionarydynamics .HarvardUniversityPress,CambridgeMas- sachusetts,2006. [108] S.Nurk,D.Meleshko,A.Korobeynikov,andP.A.Pevzner.metaSPAdes:anewversatile metagenomicassembler. Genomeresearch ,pagesgrŒ213959,2017. [109] N.D.Olson,T.J.Treangen,C.M.Hill,V.Cepeda-Espinoza,J.Ghurye,S.Koren,and M.Pop.Metagenomicassemblythroughthelensofvalidation:recentadvancesinas- sessingandimprovingthequalityofgenomesassembledfrommetagenomes. in bioinformatics ,2017. [110] Y.Otsuka,N.L.Kedersha,andD.R.Schoenberg.ofacytoplasmiccomplex thataddsacaponto5'-monophosphateRNA. Mol.CellBiol. ,29(8):2155Œ2167,2009. [111] F.Ozsolak,L.L.Poling,Z.Wang,H.Liu,X.S.Liu,R.G.Roeder,X.Zhang,J.S.Song, andD.E.Fisher.ChromatinstructureanalysesidentifymiRNApromoters. GenesDev. , 22(22):3172Œ3183,2008. [112] D.Paez-Espino,G.A.Pavlopoulos,N.N.Ivanova,andN.C.Kyrpides.Nontargetedvirus sequencediscoverypipelineandvirusclusteringformetagenomicdata. natureprotocols , 12(8):1673,2017. [113] P.J.Park.ChipŒseq:advantagesandchallengesofamaturingtechnology. NatureReviews Genetics ,10(10):669Œ680,2009. [114] J.Peccoud,S.Lequime,I.Moltini-Conclois,I.Giraud,L.Lambrechts,andC.Gilbert.A SurveyofVirusRecombinationUncoversCanonicalFeaturesoflChimerasGen- eratedDuringDeepSequencingLibraryPreparation. G3:Genes,Genomes,Genetics , 8(4):1129Œ1138,2018. [115] Y.Peng,H.C.Leung,S.-M.Yiu,andF.Y.Chin.Idba-ud:adenovoassemblerforsingle-cell andmetagenomicsequencingdatawithhighlyunevendepth. Bioinformatics ,28(11):1420Œ 1428,2012. 143 [116] Y.Peng,H.C.M.Leung,S.M.Yiu,andF.Y.L.Chin.Meta-IDBA:adenovoassembler formetagenomicdata. Bioinformatics ,27(13):i94Œi101,2011. [117] B.E.Pickett,E.L.Sadat,Y.Zhang,J.M.Noronha,R.B.Squires,V.Hunt,M.Liu,S.Ku- mar,S.Zaremba,Z.Gu,etal.ViPR:anopenbioinformaticsdatabaseandanalysisresource forvirologyresearch. Nucleicacidsresearch ,40(D1):D593ŒD598,2011. [118] S.Prabhakaran,M.Rey,O.Zagordi,N.Beerenwinkel,andV.Roth.Hiv-haplotypein- ferenceusingaconstraint-baseddirichletprocessmixturemodel.In MachineLearningin ComputationalBiology(MLCB)NIPSWorkshop ,pages1Œ4,2010. [119] M.C.ProsperiandM.Salemi.Qure:softwareforviralquasispeciesreconstructionfrom next-generationsequencingdata. Bioinformatics ,28(1):132Œ133,2012. [120] J.Qin,R.Li,J.Raes,M.Arumugam,K.S.Burgdorf,C.Manichanh,T.Nielsen,N.Pons, F.Levenez,T.Yamada,etal.Ahumangutmicrobialgenecatalogueestablishedbymetage- nomicsequencing. nature ,464(7285):59Œ65,2010. [121] S.RajasekaranandM.Nicolae.Anelegantalgorithmfortheconstructionofsufarrays. JournalofDiscreteAlgorithms ,27:21Œ28,2014. [122] S.Rampelli,M.Soverini,S.Turroni,S.Quercia,E.Biagi,P.Brigidi,andM.Candela. ViromeScan:anewtoolformetagenomicviralcommunity BMCgenomics , 17(1):165,2016. [123] G.Robertson,J.Schein,R.Chiu,R.Corbett,M.Field,S.D.Jackman,K.Mungall,S.Lee, H.M.Okada,J.Q.Qian,etal.Denovoassemblyandanalysisofrna-seqdata. Nature methods ,7(11):909Œ912,2010. [124] K.RotmistrovskyandR.Agarwala.BMTagger:BestMatchTaggerforremovinghuman readsfrommetagenomicsdatasets.2011. [125] H.Rouha,C.Thurner,andC.W.Mandl.FunctionalmicroRNAgeneratedfromacytoplas- micRNAvirus. Nucl.AcidsRes. ,38(22):8328Œ8337,2010. [126] S.Roux,J.R.Brum,B.E.Dutilh,S.Sunagawa,M.B.Duhaime,A.Loy,B.T.Poulos, N.Solonenko,E.Lara,J.Poulain,etal.Ecogenomicsandpotentialbiogeochemicalimpacts ofgloballyabundantoceanviruses. Nature ,537(7622):689Œ693,2016. [127] S.Roux,F.Enault,B.L.Hurwitz,andM.B.Sullivan.VirSorter:miningviralsignalfrom microbialgenomicdata. PeerJ ,3:e985,2015. [128] S.Roux,J.Tournayre,A.Mahul,D.Debroas,andF.Enault.Metavir2:newtoolsforviral metagenomecomparisonandassembledviromeanalysis. BMCbioinformatics ,15(1):76, 2014. 144 [129] J.G.Ruby,P.Bellare,andJ.L.DeRisi.Price:softwareforthetargetedassemblyofcom- ponentsof(meta)genomicsequencedata. G3:Genes|Genomes|Genetics ,3(5):865Œ880, 2013. [130] J.G.Ruby,C.H.Jan,andD.P.Bartel.IntronicmicroRNAprecursorsthatbypassDrosha processing. Nature ,448(7149):83Œ86,2007. [131] H.K.Saini,A.J.Enright,andS.GrifAnnotationofmammalianprimarymi- croRNAs. BMCgenomics ,9(1):564,2008. [132] H.K.Saini,S.GrifandA.J.Enright.GenomicanalysisofhumanmicroRNA transcripts. Proc.Natl.Acad.Sci.USA ,104(45):17719Œ17724,2007. [133] S.L.Salzberg,D.D.Sommer,D.Puiu,andV.T.Lee.Gene-boostedassemblyofanovel bacterialgenomefromveryshortreads. PLOSComputBiol ,4(9):e1000186,092008. [134] A.Sandelin,W.Alkema,P.Engström,W.W.Wasserman,andB.Lenhard.JASPAR:an open-accessdatabaseforeukaryotictranscriptionfactorbinding Nucl.AcidsRes. , 32(suppl1):D91ŒD94,2004. [135] A.Sandelin,P.Carninci,B.Lenhard,J.Ponjavic,Y.Hayashizaki,andD.A.Hume.Mam- malianRNApolymeraseIIcorepromoters:insightsfromgenome-widestudies. NatureRev. Genet. ,8(6):424Œ436,2007. [136] M.Schirmer,W.T.Sloan,andC.Quince.Benchmarkingofviralhaplotypereconstruc- tionprogrammes:anoverviewofthecapacitiesandlimitationsofcurrentlyavailablepro- grammes. inbioinformatics ,pagebbs081,2012. [137] D.R.SchoenbergandL.E.Maquat.Re-cappingthemessage. TrendsBiochem.Sci. , 34(9):435Œ442,2009. [138] D.E.Schones,K.Cui,S.Cuddapah,T.-Y.Roh,A.Barski,Z.Wang,G.Wei,andK.Zhao. Dynamicregulationofnucleosomepositioninginthehumangenome. Cell ,132(5):887Œ 898,2008. [139] J.S.Shapiro,R.A.Langlois,A.M.Pham,etal.Evidenceforacytoplasmicmicroprocessor ofpri-miRNAs. RNA ,18(7):1338Œ1346,2012. [140] D.Sharma,P.Priyadarshini,andS.Vrati.Unravelingthewebofviroinformatics:compu- tationaltoolsanddatabasesinvirusresearch. Journalofvirology ,89(3):1489Œ1501,2015. [141] A.A.Sigova,A.C.Mullen,B.Molinie,S.Gupta,D.A.Orlando,M.G.Guenther,A.E. Almada,C.Lin,P.A.Sharp,C.C.Giallourakis,etal.Divergenttranscriptionoflong noncodingRNA/mRNAgenepairsinembryonicstemcells. Proc.Natl.Acad.Sci.USA , 110(8):2876Œ2881,2013. 145 [142] J.T.SimpsonandR.Durbin.Efdenovoassemblyoflargegenomesusingcompressed datastructures. Genomeresearch ,22(3):549Œ556,2012. [143] G.SongandL.Wang.MiR-433andmiR-127arisefromindependentoverlappingprimary transcriptsencodedbythemiR-433-127locus. PLoSONE ,3(10):e3574,2008. [144] J.Spieth,D.Lawson,P.Davis,G.Williams,andK.Howe.Overviewofgenestructurein C.elegans. WormBook ,.:1Œ18,2014. [145] S.R.A.S.Staff.Usingthesratoolkittoconvert.sraintootherformats.In:SRA KnowledgeBase[Internet],2011.Availableat: http://www.ncbi.nlm.nih.gov/ books/NBK158900/ .(Accessed:25August2015). [146] S.TO'NeilandS.J.Emrich.Haplotypeandminimum-chimerismconsensusdetermination usingshortsequencedata. BMCgenomics ,13(2):S4,2012. [147] A.Töpfer,T.Marschall,R.A.Bull,F.Luciani,A.Schönhuth,andN.Beerenwinkel. Viralquasispeciesassemblyviamaximalcliqueenumeration. PLoSComputBiol , 10(3):e1003515,2014. [148] A.Töpfer,O.Zagordi,S.Prabhakaran,V.Roth,E.Halperin,andN.Beerenwinkel.Proba- bilisticinferenceofviralquasispeciessubjecttorecombination. JournalofComputational Biology ,20(2):113Œ123,2013. [149] C.TrapnellandS.L.Salzberg.Howtomapbillionsofshortreadsontogenomes. Nature biotechnology ,27(5):455,2009. [150] T.Treangen,S.Koren,D.Sommer,B.Liu,I.Astrovskaya,B.Ondov,A.Darling, A.Phillippy,andM.Pop.MetAMOS:amodularandopensourcemetagenomicassem- blyandanalysispipeline. GenomeBiology ,14(1):R2,2013. [151] D.T.Truong,E.A.Franzosa,T.L.Tickle,M.Scholz,G.Weingart,E.Pasolli,A.Tett, C.Huttenhower,andN.Segata.Metaphlan2forenhancedmetagenomictaxonomic ing. Naturemethods ,12(10):902,2015. [152] D.T.Truong,A.Tett,E.Pasolli,C.Huttenhower,andN.Segata.Microbialstrain-level populationstructureandgeneticdiversityfrommetagenomes. Genomeresearch ,2017. [153] M.Vignuzzi,J.K.Stone,J.J.Arnold,C.E.Cameron,andR.Andino.Quasispeciesdiversity determinespathogenesisthroughcooperativeinteractionsinaviralpopulation. Nature , 439(7074):344Œ348,2006. [154] G.Wang,Y.Wang,C.Shen,Y.-w.Huang,K.Huang,T.H.Huang,K.P.Nephew,L.Li,and Y.Liu.RNApolymeraseIIbindingpatternsrevealgenomicregionsinvolvedinmicroRNA generegulation. PLoSONE ,5(11):e13798,2010. 146 [155] Z.Wang,M.Gerstein,andM.Snyder.Rna-seq:arevolutionarytoolfortranscriptomics. Naturereviewsgenetics ,10(1):57Œ63,2009. [156] P.Willeit,P.Skroblin,S.Kiechl,C.Fernández-Hernando,andM.Mayr.Livermicrornas: potentialmediatorsandbiomarkersformetabolicandcardiovasculardisease? European heartjournal ,pageehw146,2016. [157] J.Winter,S.Jung,S.Keller,R.I.Gregory,andS.Diederichs.Manyroadstomaturity: microRNAbiogenesispathwaysandtheirregulation. NatureCellBiol. ,11(3):228Œ234, 2009. [158] M.E.Woolhouse,A.Rambaut,andP.Kellam.LessonsfromEbola:Improvinginfec- tiousdiseasesurveillancetoinformoutbreakmanagement. Sciencetranslationalmedicine , 7(307):307rv5Œ307rv5,2015. [159] Y.Wu,M.Rho,T.G.Doak,andY.Ye.Stitchinggenefragmentswithanetworkmatching algorithmimprovesgeneassemblyformetagenomics. Bioinformatics ,28(18):i363Œi369, 2012. [160] Y.-W.Wu,Y.-H.Tang,S.G.Tringe,B.A.Simmons,andS.W.Singer.MaxBin:an automatedbinningmethodtorecoverindividualgenomesfrommetagenomesusingan expectation-maximizationalgorithm. Microbiome ,2(1):26,2014. [161] M.Xie,M.Li,A.Vilborg,N.Lee,M.-D.Shu,V.Yartseva,N.−estan,andJ.a.Steitz. Mammalian5'-cappedmicroRNAprecursorsthatgenerateasinglemicroRNA. Cell , 155(7):1568Œ1580,Dec.2013. [162] A.Yamashita,T.Sekizuka,andM.Kuroda.VirusTAP:viralgenome-targetedassembly pipeline. Frontiersinmicrobiology ,7:32,2016. [163] C.Yuan,J.Lei,J.Cole,andY.Sun.Reconstructing16srrnagenesinmetagenomicdata. Bioinformatics ,31(12):i35Œi43,2015. [164] C.YuanandY.Sun.RNA-CODE:ANoncodingRNAToolforShortReads inNGSDataLackingReferenceGenomes. PLoSONE ,8:e77596,102013. [165] N.Yutin,K.S.Makarova,A.B.Gussow,M.Krupovic,A.Segall,R.A.Edwards,andE.V. Koonin.Discoveryofanexpansivebacteriophagefamilythatincludesthemostabundant virusesfromthehumangut. Naturemicrobiology ,3(1):38,2018. [166] O.Zagordi,A.Bhattacharya,N.Eriksson,andN.Beerenwinkel.Shorah:estimatingthe geneticdiversityofamixedsamplefromnext-generationsequencingdata. BMCbioinfor- matics ,12(1):119,2011. [167] J.Zhang,K.Kobert,T.Flouri,andA.Stamatakis.Pear:afastandaccurateilluminapaired- 147 endreadmerger. Bioinformatics ,30(5):614Œ620,2014. [168] Y.Zhang,T.Liu,C.A.Meyer,J.Eeckhoute,D.S.Johnson,B.E.Bernstein,C.Nusbaum, R.M.Myers,M.Brown,W.Li,etal.Model-basedanalysisofChIP-Seq(MACS). Genome Biol. ,9(9):R137,2008. [169] Y.Zhang,Y.Sun,andJ.R.Cole.Ascalableandaccuratetargetedgeneassemblytool (sat-assembler)fornext-generationsequencingdata. PLoSComputBiol ,10(8):e1003737, 2014. 148