NOVELCOMPUTATIONALMETHODSFORIMPROVINGFUNCTIONALANALYSISFOR LONGNOISYREADS By NanDu ADISSERTATION Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof ComputerScienceŠDoctorofPhilosophy 2019 ABSTRACT NOVELCOMPUTATIONALMETHODSFORIMPROVINGFUNCTIONALANALYSISFOR LONGNOISYREADS By NanDu Single-molecule,real-timesequencing(SMRT)developedbyPBioSciences(PacBio)and NanoporesequencingdevelopedbyOxfordNanoporeTechnologies(Nanopore)producelonger readsthansecond-generationsequencingtechnologiessuchasIllumina.Theincreasedreadlength enablesPacBiosequencingtoclosegapsingenomeassembly,revealstructuralvariations,and characterizetheintra-speciesvariations.Italsoholdsthepromisetodecipherthecommunitystruc- tureincomplexmicrobialcommunitiesbecauselongreadshelpmetagenomicassembly.However, comparedwithdataproducedbypopularshortreadsequencingtechnologies(suchasIllumina), PacBioandNanoporedatahaveahighersequencingerrorrateandlowercoverage.Therefore,new algorithmsareneededtotakefulladvantageofthird-generationsequencingtechnologies. Forexample,duringanalignment-basedhomologysearch,insertionordeletionerrorsingenes willcauseframeshifts,whichmayleadtomarginalalignmentscoresandshortalignments.In thiscase,itishardtodistinguishcorrectalignmentsfromrandomalignments,andtheambiguity willincurerrorsinstructuralandfunctionalannotation.Existingframeshiftcorrectiontoolsare designedfordatawithamuchlowererrorrate,andtheyarenotoptimizedforPacBiodata.As anincreasingnumberofgroupsareusingSMRT,thereisanurgentneedfordedicatedhomology searchtoolsforPacBioandNanoporedata. Anotherexampleisoverlapdetection.ForbothPacBioreadsandNanoporereads,thereis stillaneedtoimprovethesensitivityofdetectingsmalloverlapsoroverlapswithhigherrorrates. Addressingthisneedwillenablebetterassemblyformetagenomicdataproducedbythethird- generationsequencingtechnologies. Inthisarticle,wearegoingtodiscussthepossiblemethodforhomologysearchandoverlapde- tectionforthethird-generationsequencing.Foroverlapdetection,wedesignedandimplemented anoverlapdetectionprogramnamedGroupK.GroupKtakesagroupofshort k -merhits,whichsat- isfystatisticallyderiveddistanceconstraintstoincreasethesensitivityofsmalloverlapdetection. Forhomologysearch,wedesignedandimplementedahomologysearchtoolnamedFrame- ProbasedonthehiddenMarkovmodel(pHMM)andconsensussequencesmethod. However,Frame-proisstillrelyingonmultiplesequencealignment.SoweimplementedDeep- Frame,adeeplearningmodelthatpredictsthecorrespondingproteinfunctionforthird-generation sequencingreads.Intheexperimentonsimulatedreadsofproteincodingsequencesandrealreads fromthehumangenome,ourmodeloutperformspHMM-basedmethodsandthedeeplearning basedmethod.OurmodelcanalsorejectunrelatedDNAreadsandachieveshigherrecallwiththe precisioncomparabletothestate-of-the-artmethod. ThisthesisisdedicatedtomybeautifulwifeWenjingandmyparents,whoseencouragement alongthewaywasparamounttomakingitthusfar. iv ACKNOWLEDGMENTS Firstandforemost,IwouldliketothankmyadvisorDr.YanniSun.ItwascoincidentlythatI heardthereisanopportunitytoworkwithDr.Sunat2014winter,andIalmostdon'thaveany bioinformaticsbackgroundwhenItalkedtoDr.Sunfourandhalfyearsago.Butshestill patientlydiscussedwithmeaboutpossibleprojectsandgivemethechancetostudythealgorithms ofsequenceanalysisunderherguidance.Sheisveryniceandpatient,neverblamedmebutinstead triedtohelpmewhenImademistakes.Dr.Sunhelpedmealotinmyresearchwork,courses,and howtowriteandpresentresearchresults.Withoutherhelp,nothingcouldbeachieved. Next,IwanttothankmyPh.D.guidancecommitteemembers,includingDr.KevinLiu,Dr. JamesCole,andDr.JiayuZhou.Dr.Liugavememanyvaluablesuggestionsinmyqualifying exam,mycomprehensiveexam,andmydefense.Iamverygratefulthathecanserveasco- chairofmycommittee.Dr.ZhougavemethelectureofMachineLearningandsharedhisinsights ondeeplearningwithmeonmyproject.Dr.Colehelpedmealotforbothmycomprehensive examanddefense. IwouldalsoliketothankNationalScienceFoundationandSummerFellowshipfromCollege ofEngineeringforfundingtheresearchpresentedherein,andalsoMSUDissertationCompletion Fellowship(Spring2019)whichallowedmetocompletethisdissertation.Manythankstostaffat CSE,andespeciallystaffattheHigh-PerformanceComputingCenteratMichiganStateUniversity, fortheirgenerallysupportforalltheresourcesthatIneedduringmyPh.D.study. Iwanttothanksmylabmates,includingDr.JikaiLei,Dr.PrapapornTecha-angkoon,Dr. JiaoChen,andYumengWen,fortheirhelponmyeitherresearchworkordailylife.Jikaigave memanyusefulsuggestionswhenIjoinedthegroup.Healsohelpedmealotinmyinternship JiaoandIhavemanydiscussionsonresearch,dailylife,andjobIwouldalso v thankallmyfriendswhohelpedmeatMSUinthepastsevenyears.Thankstomylabmatesat UEClabinPhysics,includingDr.ZhenshengTao,Dr.KiseokChang,andFaranZhou.They alwaysencouragedmeforchallengesandcomfortedmewhenIdepressed.ThankstoChaoyue Liu,whohelpedmetomakethedecisionthattransferstoComputerScienceandEngineering. ThankstoDr.ZhenZhuandYananChen,whohavehelpedmeinbothmyacademicandpersonal life.ThankstomyfriendsintheMSUPhysicsQQgroup,Dr.XuLu,Dr.YanhaoTang,Dr. JieGuan,Dr.MengzeZhu,andXukunXiang.Wereallyhavehadagoodtimediscussingmany interestingtopicsfromfrontiersofphysicstodailylifeintheUnitedStates.ThankstoDr.Yuan Gao,Dr.LiKe,Dr.XiaoranZhang,Dr.ShuangLiang,andDr.ChuanpengJiang,fortheir companyfromthetimewhenwearrivedatEastLansingtogether.Thankstomyfriendsand classmatesatComputerScienceandEngineeringwhostudied,worked,andhadfuntogetherwith me,includingDr.JianpengXu,Dr.NanLiu,Dr.ChenQiu,Dr.ShaohuaYang,DingWang,Nan Xia,WeiWang,ZhuangdiZhu,JunChen,QiWang,KaixiangLin,DeliangYang,TianXie,Jiawei Li,SihanWang,ZhiweiWang,YaoMa,andManniLiu. Finally,Iwanttoexpressmysinceregratitudetomywifeandmyparents.Theyhavealways beenencouragingandsupportingmeforpursuingmycareer,andarealwaystherewheneverIneed them. vi TABLEOFCONTENTSLISTOFTABLES ....................................... xLISTOFFIGURES ...................................... xiChapter1Introduction .................................. 11.1Third-generationsequencing.............................2 1.1.1SMRT.....................................2 1.1.2Nanoporesequencing.............................4 1.2Sequenceanalysisalgorithms.............................5 1.2.1Alignment...................................5 1.2.2Errorcorrection................................7 1.2.3Wholegenomeassembly...........................8 1.3Challengesandoverview...............................12 Chapter2OverlapDetectionOnLongNoisyReads 1................. 142.1Background......................................15 2.1.1Relatedwork.................................15 2.1.2Overviewofourwork.............................17 2.2Methods........................................17 2.2.1Pipeline....................................19 2.2.2Estimatetheexpectednumberof k-mersforthestage.......21 2.2.3GroupHitCriteria...............................25 2.2.3.1Constrainton k-merdistance....................26 2.2.3.2Constrainton k-merdiagonaldistance...............27 2.2.4GroupChaining................................28 2.2.5Implementationdetailsofthemajorcomponents..............30 2.2.5.1Filtrationby k-mercounts.....................30 2.2.5.2Groupseedmatchandchaining..................30 2.2.5.3Timecomplexityanalysis.....................32 2.3Results.........................................32 2.3.1Simulated E.coli dataset...........................33 2.3.1.1Runningtimeandmemoryusage.................35 2.3.2RealPB E.coli dataset............................35 2.3.2.1Performancewithdifferentoverlapsize..............37 2.3.3HumanFootMetagenomicDataset......................37 2.3.4RealONT E.coli dataset...........................41 2.4Discussion.......................................41 2.5Conclusions......................................42 1Du,Nan,JiaoChen,andYanniSun."Improvingthesensitivityoflongreadoverlapdetectionusinggroupedshort k-mermatches."BMCgenomics20.2(2019):190. viiLISTOFALGORITHMS ................................... Chapter3prHMMmodelandhomologysearch 2 ................ 43 3.1Introduction......................................43 3.2Relatedwork.....................................45 3.2.1homologysearch...........................45 3.2.2Relatedworkonframeshiftcorrection....................46 3.3hiddenMarkovmodel.............................47 3.3.1Markovchains................................48 3.3.2HiddenMarkovmodels............................49 3.3.3hiddenMarkovmodelsHMMs)...............50 3.3.3.1Viterbialgorithm..........................51 3.3.3.2Forwardalgorithm.........................53 3.3.3.3FrommultiplesequencealignmenttoHMM........54 3.3.3.4Forward-Backwardalgorithm...................55 3.4Methods........................................57 3.4.1Generatesequencealignmentgraph.....................57 3.4.2ViterbialgorithmforaligninganalignmentgraphwithaHMM...59 3.4.3Filtrationstageforremovingunrelevantproteindomainfamilies......63 3.5Experimentalresults..................................64 3.5.1Simulated Escherichiacoli sequences....................65 3.5.1.1Performanceoferrorcorrection..................65 3.5.1.2PerformanceofHMMsearch................68 3.5.2 Meiothermusruber DSM1279sequences..................70 3.5.2.1Evaluateerrorcorrectionperformancebyaligningoutputsagainst thereferencegenome........................71 3.5.2.2Evaluatetheperformanceofhomologysearch.......71 3.5.3Humanarmmetagenomicdataset......................73 3.5.3.1GenerateSAGandHMMdomain.........73 3.5.3.2ProteindomainsearchusingGAcutoff..............75 3.6Conclusionanddiscussion..............................76 Chapter4DeepLearningBasedApproachForProteinDomainPrediction ..... 78 4.1Introduction......................................78 4.1.1Deeplearningforsequentialdata.......................79 4.1.1.1ConvolutionalNeuralNetworks..................79 4.1.1.1.1Convolutionoperation..................79 4.1.1.1.2MajorfeaturesofCNNs.................80 4.1.1.1.3CNNforsequence.............81 4.1.1.2RecurrentNeuralNetworks....................82 4.1.1.2.1BasicstructureofRNNs.................82 4.1.1.2.2LSTM..........................84 4.1.1.2.3GRU...........................85 4.1.1.3Proteinfunctionpredictionusingdeeplearning..........86 2 Du,Nan,andYanniSun."ImprovehomologysearchsensitivityofPacBiodatabycorrectingframeshifts."Bioin- formatics32.17(2016):i529-i537. viii 4.1.2Relatedwork.................................88 4.1.2.1ehomologysearch......................88 4.1.2.2Homologysearchwitherrorcorrection..............89 4.1.3Overviewofourwork.............................89 4.2Methods........................................90 4.2.1Encoding...................................92 4.2.2ConvolutionalNeuralNetworks.......................93 4.2.3Out-of-distributionexamplesdetection....................95 4.2.3.1Thethresholdbaseline.......................95 4.2.3.2OutlierExposure..........................96 4.2.4Implementationsandhyperparameters....................98 4.3Experimentsandresults................................99 4.3.1SimulatedPacBioGPCRcdsdataset.....................100 4.3.1.1Performancewithdifferentarchitectures.............100 4.3.1.1.1Comparisonsofencoding.................100 4.3.1.1.2Comparisonsofconvolutionalsetup........102 4.3.1.2ComparisonwithHMMERandDeepFam.............104 4.3.1.3Comparisonoftimecomplexity..................106 4.3.2Humangenomedataset............................107 4.3.2.0.1Traindataset........................107 4.3.2.0.2Thresholdcalibrationdataset...............107 4.3.2.0.3Out-of-distributiontestdataset..............107 4.3.2.1Out-of-distributiontestusingPacBioandNanoporereads....108 4.3.3Visualizeandunderstandingconvolution...............109 4.4Discussion.......................................110 4.5Conclusion......................................114 Chapter5Conclusionsandfutureworks ........................ 115BIBLIOGRAPHY....................................... 118ixLISTOFTABLESTable2.1:Computationalperformanceonthesimulated E.coli dataset.Thecomputa- tionalperformanceofoverlapdetectionusingGroupK,Minimap,Minimap2, DALIGNER,MHAP,andGraphMaponthesimulated E.coli dataset.......35 Table2.2:Overlapdetectiononthereal E.coli dataset.Theperformanceofoverlapdetec- tionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMap ontherealPBRSII(P6-C4) E.coli dataset.Hereweonlyreporttheexperiment resultswiththehighestF1scoreforeachtool...................36 Table2.3:Overlapdetectiononthehumanmetagenomicdataset.Theperformanceof overlapdetectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP, andGraphMaponthemockmetagenomicdataset.Hereweonlyreportthe experimentalresultswiththehighestF1scoreforeachtool............40 Table2.4:OverlapdetectionontheONT E.coli dataset.Theperformanceofoverlapde- tectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMap ontherealONTSQK-MAP-006 E.coli dataset.Minimap2usestheava-ont setup,whichisoptimizedforONTdata......................41 Table4.1:ThenameandbriefdescriptionofvariantsofCNNmodelswecompared.....101 Table4.2:Theaverageelapsedtimetopredictfamiliesof10,000simulatedPacBioreads foreachmethod..................................106 Table4.3:Theperformanceofproteindomainpredictionwithout-of-distributionexam- plesusingDeepFramethresholdbaseline,DeepFramewithOutlierExposure (OE),andHMMERontherealPacBioandNanoporedataset..........108 xLISTOFFIGURESFigure1.1:Principleofsingle-molecule,real-timeDNAsequencing.Reprintedfrom[39].3 Figure1.2:Principleofnanoporesequencing.Itdeterminethetypeofnucleotidebymea- suringthecurrentlevelofthebasepassespassingthroughthenanohole. ReprintedfromNaturejobswebsite........................4 Figure1.3:Adynamicprogrammingalignmentmatrixforaligningsequence GATTACAagainst GCATGCUusingNeedleman-Wunschalgorithm[105].Inthisexam- ple,asimplescoringsystem,match=1,mismatch=-1,gap=-1,isused.(a) Startfromthecellinthesecondrow,secondcolumn.Movethroughthecell rowbyrow,calculatingthescoreforeachcell.(b)Thescoreisthemaximal scorecalculatedfromthecellfromthetop,left,orleft-top.Ifthescoreis calculatedusingleft-topcell,meansamatchormismatch.Otherwiseisan insertionordeletion.(c)Aftertheofalltable,thescoreof theright-bottomcellofthematrixisthescoreforglobalalignmentofthetwo sequence.(d)Tracebackfromthelastcellisneededtoobtainthealignment. Thealignmentis G-ATTACAvs.GCA-TGCU.ReprintedfromWikipedia...6 Figure1.4:RapidoverlappingofnoisyreadsusingMinHashsketches.(a)Tocreatea MinHashsketchofaDNAsequenceS,weobtainedallthek-mersof thesequencek-mers.Intheexampleabove, k=3,generate12k-merseach forS1andS2.(b)Multiplehashfunctionsareusedtoconvertk-merstoin- tegerThenumberofhashfunctionsHdeterminestheresulting sketchsize.Here,wechoose H=4.Theminimumkmerwhichhashingfunc- tionvalueisminimalforeachhashisreferredasmin-mer.(c)Thesketch ofasequenceiscomposedoftheHmin-merinorder,whichis muchsmaller(size4)thanthesetofallk-mers(size12).Inthisexample,the sketchesofS1andS2sharetwosameminimum(underlined).(d) RealJaccardsimilarity(0.22)isestimatedusingthefractionofsharedmin- mer(0.5)betweenthesketches.Tohaveanaccurateestimation, H˛4is needed.(e)Ifthesimilaritymeetsthethreshold,thepositionofsharedmin- mersintheoriginalsequenceiscomputedtodeterminetheoverlapoffsetof theS1andS2.Reprintedfrom[12]........................10 Figure1.5:Threestage(correction,trimming,andassembly)inaPacbioSMRTassem- bler(Canu).Thecorrectionstepselectsthebestoverlapstouseforsequence correctionandgeneratescorrectedreadsfromconsensussequences.Thetrim- mingstepunsupportedregionsintheinputandtrimsorsplitsreads basedonthelongestsupportedrange.Theassemblystepmakesapass todeterminesequencingerrors;recomputeoverlapalignments;andoutputs contigs,anassemblygraph,andsummarystatistics.Reprintedfrom[79]...11 xiFigure2.1:Histogramsofirreducibleoverlapsizes( top )andtheratioofoverlapsizeto thereadlength( bottom )whencomparingadjacentoverlappingreadsonsim- ulatedPB E.coli datasetsgivendifferentcoverages.Thebinwidthforthe overlapsizeis500.Forthe30Xdataset,theaveragereadlengthis8366and thenumberofreadsis16644.Forthe15Xdataset,theaveragereadlengthis 8253andthenumberofreadsis8436.Forthe8Xdataset,theaverageread lengthis8414andthenumberofreadsis4413..................18 Figure2.2:ThepipelineofGroupK..............................20 Figure2.3:Thedotplotof k -merhitsandgroupseedsmatchesforoneoverlappingpair fromthe E.coli PBdatasetusedinSection2.3.Eachdotisa k -merhit.Thex- axisandy-axisshowthelocationsofthehitsonthereads.Theoverlapregion isroughlyfrom0to2800onthex-axis,andfrom1000to3800onthey-axis. Top: all9-merhits. Bottom: 9-merhitsthatpassedthegrouphitcriteria.A groupseedisrepresentedbycloselylocateddotsofthesamecolor.......22 Figure2.4:Twocasescontributedtotheidenticalcharactersatanalignedpositioninthe overlappingregionof S 1 and S 2 . S 1 and S 2 aretworeadssequencedfromthe sameregionoftheunderlyinggenomeandformanoverlap. Top: bothbases on S 1 and S 2 arecorrect,formingamatch. Bottom: bothbaseson S 1 and S 2 aresequencingerrors,andsubstitutedbythesamecharacter..........23 Figure2.5:Thechangeof E [ X o ] and E [ X r ] (y-axis)withtheincreaseofthereadlength (x-axis),whichisobtainedfromarealPBdataset.Theoverlapsizeissetas the1/4ofthereadlengthaswefocusonidentifyingthehardcaseofsmall overlaps. k = 15..................................24 Figure2.6:Anexampleoftheoverlapsizeestimation.Thesufofread1andthe ofread2formanoverlap.Eachshortsolidlinerepresentsagroupseedmatch intheoptimalchain.Theblackdashedlineindicatesthetrueoverlapalign- mentregionbetweenthetworeads.Thegraydashedline,whichisformedby thetwoendinggroupseedsintheoptimalchain,canoverestimatetheoverlap size........................................29 Figure2.7:TheROC-likeplotusingGroupK,Minimap,Minimap2,DALIGNER,MHAP, andGraphMaponthesimulatedPB E.coli dataset.Thex-axisrepresentsthe falsediscoveryrate(FDR = 1 precision).Y-axisisthesensitivity(0.5to1)..34 Figure2.8:SensitivityofGroupKandMinimapfordetectingoverlapsofdifferentsize onPB E.coli dataset.Thex-axisrepresentstheoverlapsize.Y-axisisthe correspondingsensitivityofthebin.TheX-axisbinwidthis500andthe onlyincludedthe6bins(i.e.uptooverlapsize3000)astheirsensitivity becomesmoresimilarwiththeincreaseoftheoverlapsize............38 xii Figure3.1:AMarkovchainexampleofDNA.Therearefourstatesforeachofnucleotides A,C,GandT.Atransitionprobabilityisassociatedwitheacharrowinthe48 Figure3.2:ThetransitionstructureofaHMM.Weusesqurestoindicatematch states,diamondstoindicateinsertionstates,andcirclestoindicatedeletion states........................................51 Figure3.3:Tencolumnsfromthemultiplesequencealignmentsofsevenglobinproteins. Thestarredcolumnsareonesthatwillbetreated`matches'inthe HMM.Reprintedfrom[35]............................54 Figure3.4:BuildanexampleSAGfromanalignment.Toppanel:multiplesequence alignmentofsixsequences.Thetopsequenceistheseedsequence.Bottom panel:sequencealignmentgraph.Fornode x0,ifwetracebackfortwomore edges,wecanidentifythreecodonsendingwith x0:TCA,ATA,andGCA. Eachpathhasitsnodes x(1)0andx(2)0marked.Theconsensuspathof thispartofgraphisATCA............................58 Figure3.5:ComparisonoferrorcorrectionperformanceforFrame-ProandDAG-Conin thesimulated E.coli datasetand M.ruber dataset.Frame-Procorrectsmore errorsinlargerfractionofPBreads........................66 Figure3.6:Histogramofederrorsaftererrorcorrectioninthesimulated E.coli dataset.X-axisrepresentsthenumberofremainingerrorsineachread.Y- axisisthecorrespondingnumbersofreads.Binwidthis1andtheonly includedthe40bins(i.e.upto40errors)duetospacelimitation......67 Figure3.7:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values (C)forreferencesequencesandcorrectedsequencesproducedbyFrame-Pro andDAG-Coninthe E.coli simulateddataset.X-axisrepresentsthedomains. AlldomainsaresortedbythereferencevaluesfromPfam.Redcirclesrepre- sentthevaluesofHMMalignmentsforcorrectedsequencesoutputbyFrame- Pro.BluecirclesrepresentthevaluesofHMMalignmentsforcorrectedse- quencesoutputbyDAG-Con.Astherearemanydatapoints,allnumbers producedbyonetoolareprocessedbySavityzkyGolaytogeneratea smoothedcurveforclarityofpresentation....................69 xiiiFigure3.8:Thecomparisonofalignments'bitscoresforreferencesequencesandcor- rectedsequencesproducedbyFrame-ProandDAG-Coninthe M.ruber dataset. X-axisrepresentsthedomains.Alldomainsaresortedbythereferencebit scoresfromPfam.RedcirclesrepresentthevaluesofHMMalignmentsfor correctedsequencesoutputbyFrame-Pro.Bluecirclesrepresentthevaluesof HMMalignmentsforcorrectedsequencesoutputbyDAG-Con.Asthereare manydatapoints,allnumbersproducedbyonetoolareprocessedbySavi- tyzkyGolaytogenerateasmoothedcurveforclarityofpresentation....72 Figure3.9:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values (C)for48domainscommonlybyFrame-ProandDAG-Coninthe armdataset.X-axisrepresentsthedomains...................74 Figure4.1:Howakerneloperatesononepixel.Reprintedfrom[6].............80 Figure4.2:StructureofTextCNNmodel.Reprintedfrom[74]...............82 Figure4.3:Therepeatingmoduleinainasimplerecurrentneuralnetworkstructure. Reprintedfrom[113]...............................83 Figure4.4:TherepeatingmoduleinaLSTMcell.Reprintedfrom[113]...........84 Figure4.5:TherepeatingmoduleinaGRUcell.Reprintedfrom[113]............86 Figure4.6:TheoverviewofDeepFramemodel.Theinputsequencewastranslatedand encodedtoa3-channeltensor.Twoconvolutionallayerandmaxpoolinglayer extractsequencefeatures.Thesefeatureswereusedbyafullyconnectedlayer withsoftmaxfunctiontoinfertheprobabilityofeachproteinfamilies.In thetask,themodeldirectlyoutputsthefamilieswiththelargest scoreastheprediction.Inthedetectiontask,themaximumofsoftmaxscore needstocomparewiththethresholdtodeterminewhethertheinputcontainsa trainedproteinfamilyorshouldberejected....................91 Figure4.7:Thedistributionofsoftmaxscoresfrombasemodel( top )andmodelwithOut- lierExposure( bottom ).Greenbarrepresentthecountofin-distributionex- amplesthatpredictedcorrectly.Bluebarrepresentthecountofin-distribution examplesthatpredictedincorrectly.Redbarrepresentthecountofout-of- distributionexamples...............................97 Figure4.8:Themeanandstandarddeviationofaccuracyofdifferentnet- workarchitectures.Forconvenient,weuseddifferentnames( 3-frameencod- ing , 2048s , s8 ,and 2layer )forthesamebasemodel.Weused differentcolorsfordifferentgroupofcomparisons:greenbarsforencoding; bluebarsfornumberofpurplebarsfordifferentsizes;orangebars fordifferentconvolutionlayers..........................101 xiv Figure4.9:2Dt-SNEplotoffeaturesextractedfromtheconvolutionallayeroutput.(Top) 3-frameencodingmodel.(Middle)DNAone-hotencodingmodel.(Bottom) 3-branchmodel..................................103 Figure4.10:ComparisonofproteinperformanceofDeepFrame,HMMER, andDeepFamacrossdifferenterrorrates.....................105 Figure4.11:Mostactivatedconvolutional(index:1622)forLatrophilin........110 Figure4.12:Mostactivatedconvolutional(index:1877)forCannabinoid.......110 Figure4.13:Mostactivatedconvolutional(index:1689)forPlatelet..........111 Figure4.14:Mostactivatedconvolutional(index:1382)forProstacyclin........111 Figure4.15:Mostactivatedconvolutional(index:1033)forCholecystokinin.....112 Figure4.16:Mostactivatedconvolutional(index:1776)forEMR1...........112 Figure4.17:Thedistributionsofsumofacivationvaluesofconvolutionalunits.Xaxisis theindexoftheconvolutiontersinDeepFram.Yaxisisthesumofactiva- tionvaluesofthegivenconvolutional..................113 xv LISTOFALGORITHMS Chapter1 Introduction Secondgenerationsequencingtechnologiesofferedseveralimprovementsovertradi- tionalSangersequencingandbecameanobviouschoiceformetagenomicsanalysis.Usingshot- gunsequencing,peoplecaninterrogatethecompositionandfunctionofmicrobialcommunities ef.Thisinformationcanleadtodiscoveryonnewbiologyactivecompounds,antimicro- bials,virulencefactors,ormetabolicpathways[144].Thetraditionaldownstreammethoduses referencegenomestoannotatethemetagenomicsdataset.However,inmostcases,themetage- nomicsdatasetcannotbecorrespondedtotheknownreferencegenomes.Forexample,2%to 96%ofmetagenomicssequencefromhumanskincannotbeannotatedbythetraditionalmethod, basedontheoriginsofsample[112]. Denovo approach,orareference-freeapproach,isonealternativemethodthatcanbeusedto overcomethelackofreference.Withshortreadsfromthesecond-generationsequencingplatform, denovo assemblycangeneratecontigsfollowedbythetaxonomybinningtodeterminethecom- positionsofthesample.However,withthecomplexandsimilargenomesinmetagenomics,itis challengingtouseshortreadsfromthesecond-generationsequencingtechnologytoassembly.In suchcircumstances,longreadsfromthethirdgenerationsequencingplatformshaveitsadvantages togenerateabetteroverlapgraphfortheassembler. 1Anotheralternativemethodisanalyzingtheaveragepropertiesofwholecommunitiesrather thanfocusingonseveralgenomes.Suchanapproachisirreplaceableasinparticularcircumstance assemblymaynotevenpossible.Inthismethod,longerreadsarestillpreferredasitcanprovide moreaccurateinformationthanshorterfragments. Here,wepresenttheprinciplesofthethirdgenerationsequencingtechnologies,somewidely- usedalgorithmsdesignedforthirdgenerationsequencingingenomeassembly,thechallengefor existingfunctionalanalysisalgorithmswhenusingthirdgenerationsequencingdata. 1.1Third-generationsequencing Severalthirdgenerationsequencingtechnologieswereproposed[15,39].Theysharealotof similarcharacteristic,likelongreadlengthandhigherrorrate.Inthesetechniques,themost intensivelystudiedarethesingle-molecule,real-timesequencing(SMRT),developedbyP BioSciences(PacBio),andNanoporesequencing,developedbyOxfordNanoporeTechnologies (Nanopore).HerewewillintroducethetheoremofSMRTandNanoporesequencingtechnologies. AlthoughsomealgorithmswediscussedherewereoriginallyproposedtoSMRT,itcanalsobe usedforNanoporesequencing[93]. 1.1.1SMRT Toachievereal-timesequencingatthesinglemoleculelevel,SMRTadoptedanapproachtogen- eratesequencinginformationbasedonthesequentialsignalgeneratedduringDNA synthesisbyapolymerasemolecule. 2 Figure1.1:Principleofsingle-molecule,real-timeDNAsequencing.Reprintedfrom[39]. ThebasicsequencingunitinSMRTisazero-modewaveguide(ZMW).Thepropagationmode inthewaveguidewilllimitthewave,satisfyingthemodetopropagateinthewaveguide(precisely determinethefrequencyandform).The"zero-modewaveguide"meansthewaveguideprovides zeroguidedmodesforthepropagationoftheelectromagneticwave.Observationinthezero-mode waveguidecanreachsingleeventscaleeveninthehighconcentrationof moleculespresent[81].InthebottomofeachZMWthereislocatedasinglemoleculeofDNA template-bound F 29DNApolymerase. Duringanobservation,aphospholinkednucleotideformsahydrogenbondwiththenucleotide inthetemplate.ThisprocesswillproduceasignalatthebottomofZMW.Thelight pulsewillenduponthecleavageofthedye-linker-pyrophosphategroup.Bydetectingthelight pulseandrecordingaseriesofmoviesofthesignal,wecanhaveallinformationwhichlatercan betranslatedtobasesincontinuouslongreads(CLR).TheaverageCLRreadlengthfromPacbio RSIIusingC4chemistrycanreachmorethan10kbps[145],withan11%to15%errorrate[8,80]. Usingthecirculartemplate,SMRTcansequenceDNAmultipletimes.Bydoingthis,ahigh qualityreadcalledcircularconsensussequence(CCS)canbegeneratedwiththemorethan99% 3 accuracywith15Xcoverage.However,thetrade-offofCCSistheshort-readlength(averagelyis about1kbps[114])duetothelimitationlifetimeoftheDNApolymerase[134]. 1.1.2Nanoporesequencing OxfordNanoporeisanotherpromisingthird-generationsequencingtechnology.InNanoporese- quencing,aDNAstrandisthreadedthroughananopore,andtheioniccurrentcanbe usedtoidentifythecorrespondingDNAsequences. Figure1.2:Principleofnanoporesequencing.Itdeterminethetypeofnucleotidebymeasuringthe currentlevelofthebasepassespassingthroughthenanohole.ReprintedfromNaturejobswebsite. Intheory,thenanoholescanbecreatedbyproteinspuncturingmembranes(biologicalnanopores) 4 orinsolidmaterials(solid-statenanopores).InNanoporesequencing,thenanoporeiscreateduti- lizingtheheptamericprotein a -hemolysin( a HL)[28].Duringthesequencing,DNAwillbindto thebindingsiteofthenanopore.Thesequencerreadsthedisruptioninthecurrentpassedthrough themembraneaseachbasepassesthrough.Sincethebaseshavedifferentelectroniccharacteris- tics,eachnucleotidehasadifferentsignatureofdisruption,allowingforbasecalls. Nanoporesequencingcanproduceverylongsequencesthatlengthupto2millionbasesin theory.InarecentexperimentwithMinIONR9.41Dchemistry,peoplewereabletosequence ultralongreadswith882,000bp,withamedianlengthof10,589bp.TheerrorratesofNanopore sequencingareclosetoPacBio,whicharearound18%[67]. 1.2Sequenceanalysisalgorithms Inthissection,weoverviewedthecommonsequenceanalysisalgorithmanditsapplicationin third-generationsequencingreads. 1.2.1Alignment AlignmentprogramisoneofthefundamentalalgorithmsforPacbioSMRTapplication,asright nowtheerrorcorrectionandwholegenomeassemblybothrelyonbuildingthealignment Thefundamentalideaofaligningisusingdynamicprogrammingandselectingamaximal scorepathgiventhedifferentpenaltyofinsertion,deletion,andmismatch(Figure1.3).Forexam- ple,inBLAST[4],anidentitywillscore5andamismatchwillgetpenaltyof4.Severaldifferent approachesareadoptedtoacceleratetheperformanceofaligning.Forexample,the"seedand extend"methodexactmatchesin k -merscaleandthenreducesthesearchspacebyonly 5 Figure1.3:Adynamicprogrammingalignmentmatrixforaligningsequence GATTACA against GCATGCU usingNeedleman-Wunschalgorithm[105].Inthisexample,asimplescoringsystem, match=1,mismatch=-1,gap=-1,isused.(a)Startfromthecellinthesecondrow,second column.Movethroughthecellrowbyrow,calculatingthescoreforeachcell.(b)Thescoreis themaximalscorecalculatedfromthecellfromthetop,left,orleft-top.Ifthescoreiscalculated usingleft-topcell,meansamatchormismatch.Otherwiseisaninsertionordeletion.(c)After theofalltable,thescoreoftheright-bottomcellofthematrixisthescorefor globalalignmentofthetwosequence.(d)Tracebackfromthelastcellisneededtoobtainthe alignment.Thealignmentis G-ATTACA vs. GCA-TGCU .ReprintedfromWikipedia. allowingthesequencetokeepwiththenumberofmatchesof k -mersabovethethreshold.Other methodslikeBurrows-WheelerTransform[83,84]andHashingfunction[90]canbealsoapplied totraditionalalignmenttools.AlthoughpeoplecanusethosetoolsonPacbioSMRTreads,the sensitivity,accuracyandspeedofalignmentarecompromisedduetothelongerroneousreads. 6 Rightnow,BLASR[21]isapopularalignmenttooldesignedforlongreadswithhigherror rates.BLASRcombineddatastructureusedforshortreadmappingandthealignmentmethod forwholegenomemapping.Attheprogramwillclustertheshortexactmatchesfoundus- ingtheBWT-FMindex[42,84]orsufarray[97].Bydoingthis,theprogramcandetermine theapproximatecoordinatethatreadsshouldaligntothegenome.Aroughalignmentusingthe sparsedynamicprogrammingonthesetofshortexactmatcheswillbethefollowingsteps.Fi- nally,withtheguideofthesparsedynamicprogramming,anaccuratealignmentwillbegenerated usingdynamicprogramming.ComparewithothertoolslikeBWA-SW(aligning132Mbasesin 434minutes)[89]andBLAT(aligning181Mbasesin4724minutes)[73],BLASRcanalignthe largestnumberofreads(230Mbases)tothereferencegenomewiththefastestspeed(20minutes) usingan E.coli O104:H4dataset. Therearemoretoolsdesignedforthirdgenerationsequencingcoming.Forexample,GraphMap [137]isanothertooldesignedforthirdgenerationsequencingwithagapped-seedstrategy.Wewill discussalignmenttoolsforthirdgenerationsequencinglaterindetail. 1.2.2Errorcorrection ErrorcorrectionisanecessarystepforusingSMRTsequencingastheerrorrateofrawreadsistoo high.However,longerreadswithsmallbiasmakeconsensuserrorcorrectionpossible.Rightnow, twomajorapproachesareavailable:hybrid[8,55,58,78,99,130]andnon-hybrid[24,93]error correction.Hybriderrorcorrectionusesshortbutaccuratesecondgenerationreadstohelpcorrecterrorin 7longthirdgenerationreads.Thettypeofhybriderrorcorrectiontools(LSC[8],PacBioToCA [78],andporovread[55])alignedaccurateshortreadstothirdgenerationlongreads,andthenus- ingtheconsensusalgorithmtogenerateerror-correctedconsensussequences.Thesecondtypeof tools(LoRDEC[130],Jabba[99])exploitsthecontextinshortreadsbybuildadeBruijngraph fromthoseshortreads.LongreadsfromPacbioSMRTarethenaddedastheguidetohelpthe programdecidethebestpathinthedeBruijngraph.Othermethods[58]beyondthesetwotypes arealsoavailable,butstillsharealotofsimilaritytothetoolsmentionedabove. UnlikehybriderrorcorrectionwhichneedssecondgenerationreadslikeIllumina,non-hybrid errorcorrectionmethodsonlyusereadsfromthethirdgenerationsequencing.HGAP(hierarchical genomeassemblyprocess)[24]usednon-hybriderrorcorrectionbeforestartingtheassembly process.ThemethodselectsthelongestseedreadsfromPacbioSMRTrawreads.Then alignedother"short"readtothoseseedreadsandbuiltamultiplesequencealignmentgraph. Theprogramcaneasilythemaximalscorepathinthegraphusingdynamicprogramming. Afterthepre-assemblystage,theaverageaccuracyofthedatasetisincreasedfrom86.9%to99.9 %[24].Similarmethods[93]werealsoappliedtoreadsfromOxfordNanoporesuccessfully.The nonhybridmethodreliesonthecoverageofdataset[34]. 1.2.3Wholegenomeassembly Thelongreadsfromthirdgenerationsequencingismorefavorableforwholegenomeassembly aslongreadscanovercomemanylimitationsfromshortreads,suchashandlinghighlyrepetitive genomicregions. Thenon-hybrid Denovo assembly[24]needspre-assemblyerrorcorrectionsteptocompensate 8 thehigherrorratesofPacbiodata.Highqualityreadsaftererrorcorrectionarefeasibletoapply traditionalassemblymethodthatcantakelong-readinput.Inthiscase,overlap-layout-consensus (OLC)methodisbetterthantheDeBruijnGraph(DBG),althoughthelatteroneisquitepopular forthesecondgenerationsequencingdata.IntheOLCmethod[101,103],theassemblywill trytoallversusallmatchandthendetermineifthereadpairhasoverlap.Thenagraphwillbe constructedbytreatingeachreadasanodeandconnectingoverlapreadsbyedges.Anassembly algorithmisavailablebyconsideringalltheinformationinthegraphandproducesconsensusout- put.Givenenoughcoverage,thequalityofHGAPassemblycanreachbeyond99.999%(QV50). FastoverlappingmethodisdesiredforOLCmethodassemblertoreducethehugecomputa- tionalcostforcalculatingtheoverlapbyall-versus-allalignment.Thetime-consumingover- lappingisthebottleneckof Denovo assemblyusingSMRTandpreventstheapplicationforlarger genome[12].MinHashAlignmentProcess(MHAP)isanewlydevelopedalgorithmcanef detectoverlapbetweenerroneouslongreads.MinHash[16]isthecoremethodofthealgorithm thatusedtoestimatethesimilarityofthetwosequenceswithoutacompletealignment(Figure 1.4).Toevaluatethesimilarityofthetwosequences,theMinHashstartbyconvertallkmersin eachofthesequencestointegerusingmultiplerandomizedhashingfunctions.Then thealgorithmwillcollecttheminimalvaluekmerforeachhashingfunctiontoconstructthesketch ofthesequence.Thesizeofthesketchisdeterminedbythenumberofhashingfunctionandis muchsmallerthanthesizeofallkmers.TheJaccardSimilarityofthesketchofthetwosequences canbetreatasanapproximationofthesharednumberofkmersbetweentwosequences.Thisis acomputationaleftoestimatethesimilarityoftwosequenceswithoutalignment.Inthe test,MHAPmethod(3.6CPUhoursforE.coliK12dataset)ismuchfasterthantheBLASR(87 CPUhours)tooverlappingforassembly.Intheexperiment,MHAPachievedacomparableor 9 Figure1.4:RapidoverlappingofnoisyreadsusingMinHashsketches.(a)TocreateaMinHash sketchofaDNAsequenceS,weobtainedallthek-mersofthesequencek-mers.Inthe exampleabove, k = 3,generate12k-merseachforS1andS2.(b)Multiplehashfunctionsare usedtoconvertk-merstointegerThenumberofhashfunctionsHdeterminesthe resultingsketchsize.Here,wechoose H = 4.Theminimumkmerwhichhashingfunctionvalue isminimalforeachhashisreferredasmin-mer.(c)Thesketchofasequenceiscomposedofthe Hmin-merinorder,whichismuchsmaller(size4)thanthesetofallk-mers(size12). Inthisexample,thesketchesofS1andS2sharetwosameminimumints(underlined). (d)RealJaccardsimilarity(0.22)isestimatedusingthefractionofsharedmin-mer(0.5)between thesketches.Tohaveanaccurateestimation, H ˛ 4isneeded.(e)Ifthesimilaritymeetsthe threshold,thepositionofsharedmin-mersintheoriginalsequenceiscomputedtodeterminethe overlapoffsetoftheS1andS2.Reprintedfrom[12]. improvedassemblythanBLASRwithlesstime. 10 Figure1.5:Threestage(correction,trimming,andassembly)inaPacbioSMRTassembler(Canu). Thecorrectionstepselectsthebestoverlapstouseforsequencecorrectionandgeneratescorrected readsfromconsensussequences.Thetrimmingstepunsupportedregionsintheinputand trimsorsplitsreadsbasedonthelongestsupportedrange.Theassemblystepmakesapass todeterminesequencingerrors;recomputeoverlapalignments;andoutputscontigs,anassembly graph,andsummarystatistics.Reprintedfrom[79] 11 1.3Challengesandoverview Althoughmanynovelalgorithmsforalignment,errorcorrection,andwholegenomeassembly havebeendevelopedforhandlinglongnoisyreads,therearestillmanychallengesforthethird- generationsequenceanalysis.First,mostnovelalgorithmsaredesignedtotargetcompletegenome assembly.However,inmanyscenarios,likemetagenomicsandtranscriptomicsequencing,the wholegenomeassemblyisnotrealistic,ortheresultoftheassemblyisnotgiventhe coverageofsequencing.Inbothcases,downstreamanalysisalgorithmneedstoprocesslongread withhigherrorrates.Forexample,forhomologysearchandproteindomainannotation,current methodslikepHMMcannothandletheframeshiftsintroducedbyalargenumberofinsertionsand deletionsinthethird-generationsequencingreads,leadingtoshortoralignments inthedownstreamanalysis.Second,forapplicationslikemetagenomics,theremaycoexistmany similarproteindomainsinthedataset,whichisnoteasytodistinguish,evenwithouterrors.Third, existingalgorithmsdesignedforPacBioorNanoporestillhavethepotentialforimprovementin performanceandefy. Inordertoaddressthesechallengesandimprovetheperformanceofsequenceanalysis,es- peciallyproteindomainpredictionandannotation,wehaveproposedanddevelopedthreetools: GroupK,FramePro,andDeepFrame.GroupKisanoverlapdetectiontooldesignedforPacBio andNanoporedata.TheexperimentalresultsshowedthatGroupKenablesmoresensitiveoverlap detection,especiallyfordatasetsoflowsequencingcoverage.BothFrameProandDeepFrame aredesignedforimprovingproteinpredictionfromthird-generationsequencing.FrameProisa homologysearchtooldesignedforPacBioreads.ItutilizesbothhiddenMarkov modelandsequencealignmentgraphconsensustoenablemoresensitivehomologysearch.Deep- FramefocusesonidentifyingencodedproteindomainsfromasinglelongnoisyDNAread.Ituses 12 aconvolutionalneuralnetworktoextractusefulfeaturestodistinguishprotein-codingsequences automatically.Intheexperiment,itoutperformspHMM-basedmethodsinbothon multipleproteinfamilies,andthedetectionofproteindomainfromotherunrelatedDNAreads. 13 Chapter2 OverlapDetectionOnLongNoisyReads 1 Genomeassemblyusingthird-generationsequencingdatarequiresdedicatedmethodsandtools. Existinggenomeassemblytoolsmainlyutilizetwotypesofgraphmodels:overlapgraphandde Bruijngraph.Whentheerrorrateislow,deBruijngraphhasthetheoreticaladvantagethatthe graphsizedoesnotincreasewiththesequencingcoverage,whichisusuallyhigh forIlluminadatasets.Forthird-generationsequencingdata,thehigherrorrateandlowcoverage maketheoverlapgraphasensiblechoiceforgenomeassembly[78].Akeystepinconstructing theoverlapgraphistoidentifyreadpairsthatshareoverlaps,whichindicatesthatthesereadsare sequencedfromthesamelociintheunderlyinggenome.Althoughthereareanumberofsequence alignmentprogramsavailableforconductingoverlapalignment[4,132],amajorityofthemrely ondynamicprogrammingandaretoocomputationallyexpensiveforhighthroughputsequencing data.Duetohigherrorrates,existingshortreadoverlapdetectionsoftwareusingBWT(Burrows- Wheelertransform)orhashtable[51,135]cannotbedirectlyappliedtolongreads.Toaddressthis challenge,manynewmethodwasproposed.Here,wesummarizethemajorexistingmethodand alsointroduceournewmethodfordetectingoverlapfromlongnoisysequences. 1 Du,Nan,JiaoChen,andYanniSun."Improvingthesensitivityoflongreadoverlapdetectionusinggroupedshort k-mermatches."BMCgenomics20.2(2019):190. 14 2.1Background 2.1.1Relatedwork Twostrategiesarecurrentlybeingemployedtodetectoverlapsforerror-pronelongreads.One strategytriestocorrectsequencingerrorsinPB(PBiosciences)andONT(OxfordNanopore) databeforeoverlapdetection.Thereexistanumberofsequencingerrorcorrectiontools[24,78]. Someofthemrelyonhybridsequencing,whichrequirespreparationofatleasttwosequencing librariesandseveraltypesofsequencingrunsandthusisnotcost-effectiveformanyapplications. Othersconducterrorcorrectionusinglongreadsonly.Onerepresentativemethodisdescribedin Chinetal.'shierarchicalgenome-assemblyprocessHGAP[24],whoseperformanceimproveswith higherreadcoverage.Itisworthnotingthatforalignment-basederrorcorrectionmethodssuchas theoneinHGAP,animportantstepistoidentifyreadsthatcanbealignedquickly.Essentially, techniquesusedforoverlapdetectioncanbeusedforalignmentdetectionaswell. Thesecondcategorybypassesthedifoferrorcorrectionandoverlapsusing rawreads.VariousapproximatesimilaritysearchmethodshavebeenappliedonPBandONT data[26].Theygenerallyfollowseed-chain-alignprocedure[88].Seed-basedltrationstepplays anessentialroleincontrollingthetrade-offbetweensensitivityandcomputationalefy.Usu- ally,thesemethodsuseshortstringmatchesasthetionstep.Ashortstringor k -mermatch requiresexactmatchesof k consecutivecharactersbetweentwosequences.Intuitively,overlapping readstendtosharemorecommon k -mersthannon-overlappingreads.Strategiesthatcanquickly thenumberofshared k -merscanthusbeapplied.Inthissection,wesummarizethemain strategiesofseveralstate-of-the-artoverlapdetectiontools.Wehighlightthedifferencesbetween ourmethodandtheexistingonesinthefollowingsection. MHAP[12],Minimap[87,88],andDALIGNER[102]alluse k -mermatchesforidentifying 15 candidateoverlappingpairs.Duetothehigherrorrate,usuallyonlyshort k -merswillbeapplied inordertoachievehighsensitivity.However,identifyingshort k -mermatchesbetweenallpairsof readsiscomputationallyexpensive.Thus,theleadingtoolsemployeddifferentdatastructuresand algorithmsforestimating k -mer-basedsimilarity.MHAPconvertslongreadsintosetsof k -mers andsketchesusingminHash.Thenthesimilaritybetweenreadsisestimatedusingthecompact sketches.Minimapalsousesacompactrepresentationoftheoriginalreadsbykeepingminimizers ratherthanallpossible k -mersofaread.Thencollinear k -merswillbeclusteredandusedfor checkingpossibleoverlaps.DALIGNERdirectlysorts k -mersbasedontheirpositionsandthen utilizesmergesorttoidentifythenumberofshared k -mers.Asthesortingiscacheef DALIGNERispracticallyveryef BLASR[21]wasinitiallydesignedformappingPBreadstoareferencegenome.Itisalso widelyusedasanoverlapdetectiontoolforPBdata.BLASRusesBWTandFMindextoidentify short k -mermatchesandthenclusters k -mermatcheswithinagivendistancerange.Theclustered k -mersarerankedbasedonafunctionofthe k -merfrequency.Onlyhighlyrankedclusterswillbe keptfordownstreamanalysis. Beingdifferentfromtheabovetools,GraphMap[137]usesspacedseedsthatallowmatches ofnon-consecutivecharacters.Spacedseedswereinitiallyusedinhomologysearchforimproving thetrade-offbetweensensitivityandefy[18,95,139].Inparticular,spacedseeds containingthepatternfi11*flhavehighsensitivityincapturinghomologousprotein-codinggenes becauseofthecodonstructure.However,designingoptimalspacedseeds(i.e.,decidingtheposi- tionsofthewildcardcharacters)isNP-hard[94,106].GraphMapempiricallychoosestwospaced seeds.Ideally,differentsetsofseedsmaybedesignedforinputdataofdifferenterror 16 2.1.2Overviewofourwork ThehigherrorratesandalsothedifferenterrorofPBandONTdatamotivateustouse amorexibleseedingstrategycalledgrouphitcriteria[109],whichagroupofpossibly overlapping k -merssatisfyingstatisticallyderiveddistanceconstrains.Forbrevity,wewillcall the k -merssetsatisfyingthegrouphitcriteriaasafigroupseedfl.Agroupseedwasinitially proposedandusedforhomologysearch.Giventheerrorsuchastheestimatedindelsand mismatchprobabilities,thresholdsforgroupingshort k -merscanbecomputedusingthewaiting timedistributionandtheone-dimensionalrandomwalk[109].Agroupseedcaneffectivelyhandle alltypesoferrorsandisidealtodetectsmalloverlaps.Withgroupseeds,wecanachievehigh sensitivityusingshort k -mers(e.g.,9-mer)whilestillmaintainingadesirable. Inthiswork,weemploygroupseedsfordetectingoverlappinglongreadsfor denovo genome assembly.Ourimplementation,namedGroupK,providesacomplementarytooltoexistingmeth- odsfordetectingsmalloverlapsoroverlapscompoundedbyhigherrorratesofbothreads.This abilityenablesourtoolasensiblechoiceforgenomeassemblyinmetagenomicdatasequencedby third-generationsequencingplatforms.Asthesecommunitysamplesusuallycontainmicroorgan- ismswithheterogeneouscoverage,beingabletoidentifysmalloverlapswillbeveryimportantfor reconstructinggenomesofrarespecies. 2.2Methods GroupKisdesignedforimprovingthesensitivityofdetectingsmalloverlapsoroverlapswithlow identity.Currently,third-generationsequencingdatastillhashigherrorrates.Theoverlapping regionsformedbytwoerror-pronelongreadscanhavelowersequenceidentitythanmappinga longreadagainstareferencegenome.Figure2.1presentsthehistogramoftheoverlapsizeandthe 17 Figure2.1:Histogramsofirreducibleoverlapsizes( top )andtheratioofoverlapsizetotheread length( bottom )whencomparingadjacentoverlappingreadsonsimulatedPB E.coli datasets givendifferentcoverages.Thebinwidthfortheoverlapsizeis500.Forthe30Xdataset,the averagereadlengthis8366andthenumberofreadsis16644.Forthe15Xdataset,theaverage readlengthis8253andthenumberofreadsis8436.Forthe8Xdataset,theaveragereadlengthis 8414andthenumberofreadsis4413. 18 correspondingratioofoverlapsizetothereadlengthbetweentwoadjacentreads.Thereadsare simulatedusingPBSIM[114]from E.coli withthreedifferentcoverages.Asweknowtheposition ofeachsimulatedreadinthegenome,theoverlapsizecanbeeasilydecided.Notethatareadcan formoverlapswithmultiplereadssequencedfromthesameregion.However,thetwoare generatedusingoverlapsbetweentwoadjacentreads,whichanfiirreduciblefledge[100]in anoverlapgraph.Thus,theseoverlapscandecidethecontinuityofthegenomeassembly. Theshowthattherearestillsubstantialregionswithsmalloverlaps.Forexample,thereare 45.46%,34.76%,and31.19%oftheoverlapsshorterthanthe50%ofthereadlengthfordatawith coverageof8X,15X,and30X,respectively.Itwillbeidealtodetectrelativelysmalloverlapsto fullytakeadvantageofthelongreadsforgeneratingmorecompleteassemblies. 2.2.1Pipeline IdentifyingsmalloverlapsiscomputationallydifThus,weuseacarefullydesignedhier- archicalstrategytodistinguishtrueoverlappingreadsfromnon-overlappingones.The pipelineofGroupKconsistsofthreekeysteps:groupseedmatching,andchaining(Fig- ure2.2).Filtrationisusedtoreducethesearchspacebyquicklyidentifyingreadpairssharing aminimumnumberof k -mers.Highinsertion/deletionerrorratestendtoproduceshort k -mer matchesondifferentdiagonals.Thus,weadoptgroupseedmatchingtoidentifyagroupofshort k -mermatchesincloseproximity.Therearetwotypesofdistanceconstraints.1)Thedistance (numberofnucleotides)betweenthe k -mersonx-axisandy-axismustbesmallerthanagiven threshold;2)thediagonaldistance,whichisthedifferenceofthediagonalsoftwo k -mermatches, mustbewithinagivenrange.Chainingisusedtoestimatetheoverlapregion.Figure2.3 showsthatapplyinggroupseedcanremovealargenumberofrandom k -merhitswhilekeeping the k -mermatcheswithintheoverlappingregion.When k = 15,thereareonlytwohits,anditis 19 Figure2.2:ThepipelineofGroupK. 20 diftodeterminewhetherthereisanoverlap.When k = 9,thereisclearlyachainformedby hitsintheoverlappingregion.However,therearealsoalargenumberofrandomhits.Withgroup seedmatchingcriteria,mostoftherandom9-merhitsareout.Sothedownstreamanalysis becomesmorestraightforward. 2.2.2Estimatetheexpectednumberof k -mersforthestage Inthissection,weanalyzetheexpectednumberofrandom k -merhitsbetweentworeadsandalso k -merhitsinoverlaps.Theanalysiswillbeusedfordeterminingthe k -mersizeandalsoother parametersforthestage.Giventworeads( S 1 and S 2 )withlength L anderrorrate e ,we wanttodeterminehowmany k -merhitsweexpecttobetween S 1 and S 2 . Weconsiderthecasethat S 1 and S 2 arenotrelated(nooverlap).Byassumingthatthe basesin S 1 and S 2 arerandomlydistributed,theexpectednumberofrandom k -mermatches E [ X r ] isroughly: E [ X r ]= 1 j S j k L 2 (2.1) Notethatthisequationisdifferentfromtheexpectednumberofshared k -mersinMHAP[12] becausewedistinguish k -merhitsbasedontheirlocationsratherthanthe k -mersthemselves.Also, weassumethatoverlapping k -mersareindependent. Inthesecondcaseof S 1 and S 2 forminganoverlap,weestimatetheexpectednumberof k -mer matchesintheoverlapbycomputing P o ,whichistheprobabilityofobservinga k -mermatch withintheoverlap.Inthecaseofnosequencingerror(i.e. e = 0),theprobabilityofobservinga k -mermatchatanalignedpositioninanoverlapissimply1.0.Butinthepracticalcaseof e > 0, weneedtoconsidertwoscenariosinordertodeterminetheprobabilityofobservingtwoidentical charactersatanalignedpositionintheoverlap:1)thecharactersfrom S 1 and S 2 arecorrect;2)the 21 Figure2.3:Thedotplotof k -merhitsandgroupseedsmatchesforoneoverlappingpairfromthe E.coli PBdatasetusedinSection2.3.Eachdotisa k -merhit.Thex-axisandy-axisshowthe locationsofthehitsonthereads.Theoverlapregionisroughlyfrom0to2800onthex-axis,and from1000to3800onthey-axis. Top: all9-merhits. Bottom: 9-merhitsthatpassedthegroup hitcriteria.Agroupseedisrepresentedbycloselylocateddotsofthesamecolor. 22 Figure2.4:Twocasescontributedtotheidenticalcharactersatanalignedpositionintheoverlap- pingregionof S 1 and S 2 . S 1 and S 2 aretworeadssequencedfromthesameregionoftheunderlying genomeandformanoverlap. Top: bothbaseson S 1 and S 2 arecorrect,formingamatch. Bottom: bothbaseson S 1 and S 2 aresequencingerrors,andsubstitutedbythesamecharacter. twocharactersfrom S 1 and S 2 areerrorsandarerandomlysubstitutedbythesamecharacter.The twocasesarevisualizedinFigure2.4.Sotheprobabilityisgivenby[12]: P o = ( 1 e ) 2 + e 2 1 j S j 1 k (2.2) Consideringbothrandom k -mermatchesand k -mermatchesinanoverlap,theexpectednumber ofshared k -mersbetweentwooverlappingreadsisestimatedby: E [ X o ]= P o M + 1 j S j k L 2 (2.3) M isthesizeoftheoverlap.Notethattheaboveequationslightlyover-countsthenumberof k -mer hitsinanoverlapbecausetherandom k -merhitsinsidetheoverlapmaybecountedtwicewith 23 probability P o ( 1 j S j ) k .(Also,weassumethattheprobabilitiesofsubstitutionandinsertion/deletion areonthesameorderandthusdonotdistinguishthemintheaboveequation.) Figure2.5:Thechangeof E [ X o ] and E [ X r ] (y-axis)withtheincreaseofthereadlength(x-axis), whichisobtainedfromarealPBdataset.Theoverlapsizeissetasthe1/4ofthereadlengthaswe focusonidentifyingthehardcaseofsmalloverlaps. k = 15. Aswearemainlyinterestedinsmalloverlapsoroverlapswithlowsequenceidentity, weplottheexpectednumberof k -merhitswiththeoverlapsizebeing1/4ofthereadsizein Figure2.5.Inordertoplotthewecompute E [ X o ] and E [ X r ] usingthereadlengthsfrom a15X E.coli PBdataset(thedataofoursecondexperimentinSection2.3).Weonlyconsider readsoflengthabove2,000.Theseallowustochoosetheappropriatethresholdfor k - mer-countingbasedForexample,Figure2.5showstheexpectednumberof15-mers betweenreadsofdifferenterrorrates. E [ X r ] startedas4when e = 0 : 15.And,forlarger e , E [ X r ] is evensmaller.Thus,ourdefaultrationthresholdistwo15-mersinordertoensurehigh sensitivity.Theimplementationdetailsofthe k -mercountingstagecanbefoundtowardstheend 24 oftheSection2.2. 2.2.3GroupHitCriteria Sequencingerrorstendtoproduceshort k -mermatches.Inaddition,theinsertion/deletionerrors leadto k -mermatchesondifferentdiagonals.Thus,insteadofusingrelativelylong k -mers(suchas 15or16-mers)asexistingtoolsdo,weuseagroupofshort k -mers(suchas9-mer)toaccommodate thehighinsertionordeletionerrorrates.Agroupseedisasetofpossiblyoverlapping k -merhits withstatisticallycalculatedconstraints.Theregioncontaininggroupseedsismorelikelytobe insideanoverlapthanasingle k -merhit.Reference[109]introducedthegrouphitcriteriaand alsoderivedthemethodtocalculatethecriteriastatistically.Weapplytheirmethodforoverlap detection. Assumethatwehavetworeads S 1 and S 2 oflength m and n ,respectively.Thenumbersof k -mersatdifferentpositionsin S 1 and S 2 are m k + 1and n k + 1,respectively.A k -mer hitatposition ( i ; j ) isby S 1 [ i ::: i + k 1 ]= S 2 [ j ::: j + k 1 ] ,where i m k + 1and j n k + 1.Fortwo k -merhitsat ( i 1 ; j 1 ) and ( i 2 ; j 2 ) ,theirinter-seeddistance D (( i 1 ; j 1 ) ; ( i 2 ; j 2 )) isthemaximumof j i 2 i 1 j and j j 2 j 1 j .The k -merdiagonalofa k -merhitat ( i ; j ) , d ( i ; j ) ,is as j i . Withthesenotations,thegoalistosolvethefollowinginequalitiesgivenlevel1 a bylevel a : D (( i 1 ; j 1 ) ; ( i 2 ; j 2 )) r (2.4) j d ( i 1 ; j 1 ) d ( i 2 ; j 2 ) j d (2.5) r and d areintegersweneedtothegrouphitcriteria.Forexample,when a = 0 : 05,our 25 goalistoderive r and d sothatwith95%chancetheinter-seeddistanceandthediagonalshift betweentwo k -mersinoverlappingreadsareatmost r and d ,respectively. 2.2.3.1Constrainton k -merdistance Wefollowedthemodeldescribedinthearticle[109].Runsofheadin n -independentBernoulli trialsareusedtomodel k -mermatches,withtheprobability p foramatchand ( 1 p ) foramis- match.Inthismodel,a k -mermatchcanbetreatedas k consecutivematchrunswithprobabilityof p k .Fromthewaitingtimedistribution[2,11],theprobabilitiesoftheinter-seeddistance x oftwo k -mersinanoverlapare: P [ D k = x ]= 8 > > > > > < > > > > > : 0for0 x < k p k for x = k ( 1 p ) p k ( 1 å x k 1 i = 0 P [ D k = i ]) for x > k (2.6) Withthelevel1 a , r canbesolvedusingfollowingequation: P [ D k r ]= 1 a (2.7) Intheactualimplementationweuse a = 0 : 05.Thus,thereis95%chancethattheinter-seed distanceofthetwoseedsinanoverlapislessthanthe r calculatedusingEquation(2.7). 26 2.2.3.2Constrainton k -merdiagonaldistance Thediagonalshiftbetweentwo k -mers, j d ( i 1 ; j 1 ) d ( i 2 ; j 2 ) j ,intwooverlappingreadsiscaused byinsertionsanddeletions.Notethattheinsertionsanddeletionsarebycomparingtwo reads,notbetweenareadtoareferencegenome.Sotheinsertionrateanddeletionratearetreated equally. Thediagonalshiftbetweentwo k -merhitscanbemodeledbyadiscreteone-dimensionalran- domwalkmodel[41,109].Thediagonalshiftstartsfrom0.Letthestepsoftherandomwalkbe l .Assumethattheinsertionanddeletionrateis q acrossthewholeread.Thus,theprobabilityof adiagonalchangeis q ,andtheprobabilityofstayinginplaceis1 2 q .Also,weassumethatin l steps,thereare n i insertednucleotides(increaseshift), n d deletednucleotides(decreaseshift),and n m matchednucleotides(noimpactonshift).Ifthediagonalshiftis i ,wehavethefollowing equations: 8 > > > < > > > : n i + n d + n m = l n i n d = i (2.8) Reference[109]calculatedtheprobabilityofobtainingadiagonalshift i after l stepsinthe randomwalk.AccordingtoEquation(8),wehave n i = i + n d and n m = l ( i + 2 n d ) .Fora n d ,theprobabilityofarandomwalkproducingdiagonalshift i canbecalculatedasthenumber ofthepossiblepaths l i + 2 n d i + 2 n d i + n d ,timestheprobabilityproductofallinsertions,deletions,and noshiftchangeateachstep q n d q n d + i ( 1 2 q ) l ( i + 2 n d ) .Tocalculatetheprobabilityofgeneratinga diagonalshift i given l , P [ i ; l ] ,weneedtoconsiderallpossiblevaluesof n d ,whichisfrom0(no deletion)to ( l i ) = 2(nomatch).Sowehave: 27 P [ i ; l ]= å ( l i ) = 2 n d = 0 l i + 2 n d i + 2 n d i + n d q n d q n d + i ( 1 2 q ) l ( i + 2 n d ) (2.9) Tocalculate d ,wesumuptheprobabilities P [ i ; l ] for i = 0 ; 1 ; 2 ;:::; l untilwereachthelevel 1 a .Wereferthereadertotheoriginalarticle[109]foramoredetaileddiscussionofthemodel andapracticalimplementationusinggeneratingfunctions. Inourexperiment,wesetthesequencingaccuracy p = 0 : 85andtheindelrate q = 0 : 06,asPB readstendtohavehigherindelratesthansubstitutionerrorrates.Userscanadjusttheseparameters basedontheirdataproperties.Thelevel a is0.05and k is9.UsingEquation(2.6), wecanobtain r = 54.UsingEquation(2.9)and r asanestimationof l ,wecanobtain d = 5given r = 54.Thus,when k = 9,seedswithinter-seeddistance D (( i 1 ; j 1 ) ; ( i 2 ; j 2 )) 54anddiagonal shift j d ( i 1 ; j 1 ) d ( i 2 ; j 2 ) j 5areclusteredinthesamegroup.Inaddition,if k -mers ( i 1 ; j 1 ) and ( i 2 ; j 2 ) areinthesamegroupand k -mers ( i 2 ; j 2 ) and ( i 3 ; j 3 ) areinthesamegroup,wewillcluster ( i 1 ; j 1 ) and ( i 3 ; j 3 ) aswell. 2.2.4GroupChaining Withgrouphitcriteria,GroupKcanshortsimilarregions.Toidentifytheoverlappingregion, weaimtoachainofgroupseedmatchesthatmaximizesthenumberofmatchedbases.We usedthesparsedynamicprogrammingforchaining[54,68]. Aftergeneratingachainofgroupseedmatches,weneedtodeterminewhetherthischaining anoverlap.Wedeveloptwocriteriaforthispurpose.First,wecalculatetheexpected numberofmatchedbasesfromthegrouphitcriteria,assumingthatthechaincoversthepossible overlappingregionwithlength L O ( L O canbeestimatedbytheextensionofbothendsofthe 28 optimalchain).Weusedthefollowingequationtocalculatetheexpectednumberofthematched bases n e : n e = 1 c L O r k (2.10) Where c isacoeftocontrolthecriteria, k isthesizeof k -mer, r isthegrouphitcriteriafor theinter- k -merdistance.Weonlyreportthechainingresultifthenumberofmatchedbases n n e . Second,werequirethatbothreadshavesimilarsizesinsidetheoverlappingregion. Figure2.6:Anexampleoftheoverlapsizeestimation.Thesufofread1andtheofread 2formanoverlap.Eachshortsolidlinerepresentsagroupseedmatchintheoptimalchain.The blackdashedlineindicatesthetrueoverlapalignmentregionbetweenthetworeads.Thegray dashedline,whichisformedbythetwoendinggroupseedsintheoptimalchain,canoverestimate theoverlapsize. Inourexperiment,wefoundthatsometimesusingtheoptimalchaingeneratedfromsparse dynamicprogrammingmayoverestimatetheoverlapregion,asshowninFigure2.6.Thisover- estimationcanjeopardizethesensitivityofdetectingsmalloverlaps.Wethisproblembyonly keepingthecollineargroupseeds,whichareusedtoestimatetheoverlapsize. 29 2.2.5Implementationdetailsofthemajorcomponents 2.2.5.1Filtrationby k -mercounts Inthestepofourpipeline,weuse k -mercounting-basedtoremovelargenumbers ofreadpairsthatarenotlikelysequencedfromthesamelociontheunderlyinggenome.Weim- plemented k -mercountingusingageneralizedsufarrayandthederivedlongestcommon (LCP)array.Thegeneralizedsufarray SA iscreatedfromtheconcatenatedreads(delimitedby specialcharacterssuchas$)usingalinearalgorithm[125].Then,wecreatetheLCPusingboth thesufarray SA andthereversedsufarray SA 0 [70,71].Letthesequenceofconcatenated readsbe T .Followingtheofthereversedsufarray,forasufstartingatposition SA [ i ] ,wehave SA 0 [ SA [ i ]]= i .Foreachposition i intheLCP,LCP[i]containsthesizeofthelongest commonbetween SA [ i ] and SA [ i 1 ] .Thekeyobservation[125]forefcomputation ofLCP[i]is:foraposition j in T ,if LCP [ SA 0 [ j 1 ]] is L , LCP [ SA 0 [ j ]] L 1.ThewholeLCP arrayconstructiontakeslineartimetothesizeof T [125]. Inordertocounttheshared k -mersbetweenreadsandalsoreportreadpairspassingthe k - mercountingthreshold,weusebothLCPandanauxiliarydatastructurerecordingthereadIDs (denotedasarray readID ).Forasufstartingatposition SA [ i ] ,itsreadIDisat readID [ i ] .The pseudocodeofthenumberofshared k -merscanbefoundinAlgorithm1.Inpractice,we alsocount k -mersbetweenareadandtheotherread'sreversecomplement. 2.2.5.2Groupseedmatchandchaining Anypairofreadsthatpasstheabovestagewillbeusedasinputforinggroupseed matches.Allotherpairswillbediscarded.Currently,weareusingthecodesofYASS[108,110] forthegroupseedmatches.Theprogramuseshashingtabletoshortexactmatches 30 Algorithm1 Counting k -mersbetweenreads Input: LCParray LCP [ 1 :: n ] ,readIDarray readID [ 0 :: n ] , k -mersize k , k -mercountsthreshold t Output: Readpairswhoseshared k -mer counts t 1: Initializeamap counts [ key ][ value ] forrecording k -mercounts 2: for i = 1to n do 3: readi readID [ i ] 4: j i + 1 5: L LCP [ j ] 6: while L k and j n do 7: readj readID [ j ] 8: if readi < readj then 9: key readi : readj 10: endif 11: if readj < readi then 12: key readj : readi 13: endif 14: if counts [ key ] doesnotexist then 15: counts [ key ] 1 16: else 17: counts [ key ] ++ 18: endif 19: j ++ 20: L min( L , LCP [ j ] ) . min( L , LCP [ j ] )returnstheminimumof L and LCP [ j ] 21: endwhile 22: endfor 23: forall key in counts do 24: if counts [ key ] t then 25: output key . keycontainsreadpairIDspassingthe k -mercountthreshold 26: endif 27: endfor 31 andcreatesthegroupsofmatchesonthe.Itisourfutureworktore-implementgroupseed matchingusingmoreeftindexing-basedmethods.Implementationofchainingalgorithmhas beenfromtheprogramofglobalchainingalgorithminSeqAnlibrary[32]. 2.2.5.3Timecomplexityanalysis Ifwehave N readswithaveragereadlength L ,ourtextsizeis NL .Therefore,wehave NL elements inthesufarrayandthecorrespondingLCParray.Foreachsufxinthesufarray,supposeon average,itcanformLCPswith m othersufeswithsizeabove k ,whichisthesizeof k -merused inthe k -mer-countsteps.Sothetimecomplexityofalltheshared k -mersforall possiblereadingpairsis O ( mNL ) .If k islargeenough(e.g., k = 15and11inourexperiment),we have m ˝ NL ,sothetimecomplexitywillbedominatedby NL . For N 0 readpairsthatpassthestage,lettheaveragenumberof k -merhitsforeach pairbe q .Sortingthehitswillneed O ( q log q ) anditeratingthroughallhitstogroupsislinear to q .Forallthereadpairs,thetimecomplexityis O ( N 0 q log q ) . Assumingwehave r groupseeds,thechainingprocedurehascomplexityin O ( r log r ) foreachreadpairs.Forallthereadpairs,thetimecomplexityis O ( N 0 r log r ) .Itispracticallyvery fastbecausethenumberofgroupseedmatchesisverysmallcomparedtotheoriginalseedhits (indicatedbyFigure2.3). 2.3Results Wefocusonevaluatingthesensitivityandprecisionofoverlapdetection.WeappliedGroupKto threePBdatasetsandoneONTdataset:asimulatedPBRSII E.coli sequencingdataset,arealPB RSII E.coli sequencingdataset,aPBRSIIhumanfootmetagenomicsequencingdataset,andan 32 ONT E.coli (SQK-MAP-006)dataset.Forsimulated E.coli dataset,wehavethetruesampling positionforeachreadasourgroundtruth.Forthereal E.coli datasetandhumanfootdataset,we determinethegroundtruthviaBLASR's[21]alignmentsagainstthereferencegenome. WebenchmarkedGroupK'sperformancewithMinimap[87],Minimap2[88],DALIGNER [102],MHAP[12],andGraphMap[137].Thosetoolsandmethodsarerepresentativeoverlap detectiontoolsforlongerroneousreadsfromPBorONT[26].Allthedetailedparameterscan befoundatthewebsitelistedin[33].Ourmainmetricsinclude:(1)sensitivity,whichmeasures theratioofthetrueoverlapsbyeachprogramtothewholesetofoverlappingpairs;(2) precision,whichtheratiooftrueoverlapdetectedbyeachprogramtothetotalreported overlappingpairs;and(3)F1score,whichistheharmonicmeanofsensitivityandprecision.A reportedoverlappingpairisregardedascorrectifitisalsopresentinourgroundtruth.Thedetailed overlapregionandoverlaplengthwerenotconsideredinthecurrentevaluationbecausetheseread pairscangothroughamoreaccuratealignmentprogramforgeneratingtheoverlapalignment. Aswediscussedbefore,ourgoalistoidentifyoverlappingreadswithoutusingerrorcorrection. Alltestedtoolswillthusbeappliedtotheirrawdataset. 2.3.1Simulated E.coli dataset Weevaluatedtheperformanceofourmethodonasimulated E.coli dataset.Thedataset wasgeneratedusingPBSIM[114]with E.coli K-12MG1655asthereferencegenome[60].The lengthdistributionandthequalitywerederivedfromrealPBP6-C4 E.coli dataset[118]. Thesimulateddatasethas5,620reads,withaveragelengthof8,344.78bpsand14.5%average errorrate(8.6%insertions,4.4%deletions,and1.4%substitutions). FromthereportbyPBSIM,wecanobtaintheexactlocationswherethesimulatedreadsare sampledinthegenome.Thisinformationprovidesuswiththegroundtruthforreads'overlapsso 33 thatwecancalculatethesensitivityandprecision. FollowingthepipelinewediscussedinSection2.2,werstused k -mer-countingasthe tionstage.AccordingtoEquation(2.1),Equation(2.3),andFigure2.5,wediscardedallreadpairs withlessthantwo15-mermatches.Thesensitivityoftheis0.979andonlyabout6%of readpairsarekeptfordownstreamanalysis. Figure2.7:TheROC-likeplotusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,and GraphMaponthesimulatedPB E.coli dataset.Thex-axisrepresentsthefalsediscoveryrate (FDR = 1 precision).Y-axisisthesensitivity(0.5to1). Weevaluatedtheperformanceofourtoolbyadjustingthegroupseedmatchcriteriacoefient c ,whichisintroducedinSection2.2.Withtheincreaseof c ,sensitivitywillbecomehigher,andthe precisionwillbecomelower.AsshowninFigure2.7,GroupKcanachieve5%to6%improvement onthesensitivitywithsimilarprecisiontootheroverlapdetectiontools. 34 GroupKMinimapMinimap2DALIGNERGraphMapMHAP Time(seconds)1871301639171858 Memory(GB)1.9941.7541.0972.2881.8732.562 Table2.1:Computationalperformanceonthesimulated E.coli dataset.Thecomputationalper- formanceofoverlapdetectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,and GraphMaponthesimulated E.coli dataset. 2.3.1.1Runningtimeandmemoryusage Weevaluatedtherunningtimeandthepeakmemoryusageofthetestedtoolsinthisexperiment. Werunalloverlapdetectiontoolswithasinglecoreof2.4Ghz14-coreIntelXeonE5-2680v4 CPUand32GBmemoryrequestedfromtheHigh-PerformanceComputingCenteratMichigan StateUniversity.TheperformanceismeasuredwiththebestF1score.Theresultsarereported inTable2.1.Forthememoryusage,Minimap2isthemostefonebutallothersarecom- parable.GroupKisslowerthanothertools,partiallybecauseweusesmall k -mers.Wefound thatthebottleneckofourprogramisthegroupmatchingstage,whichaccountsforabout1200of 1871seconds.Byimplementingamoreefindexing-basedmethod,weexpecttoreducethe runningtimeofthisstage.Forexample,wecanspeedup k -mercountingbyadoptingthemethod usedinKMC[77]. 2.3.2RealPB E.coli dataset Afterusingthesimulateddatasettoevaluateourmethod'sperformance,weappliedGroupKtoa realPBRSII(P6-C4) E.coli dataset[118].Thecoverageofthewholedatasetis150X.Totestthe performanceoflowcoveragedata,wesampleda15Xcoveragedatasetbasedonthereadlength distributionofthewholedataset.Thedatasethas14,262reads,withtheaveragelengthof4,882.09 bpsandaverageerrorrateof14.14%(errorrateisestimatedusingqualityscore). WeappliedBLASRtomapthereadstothereferencegenometoestimatethegroundtruth 35 (BLASRwasrunwithparameters:minReadLength,2000;maxScore,1000;maxLCPLength,16; minMatch,12;m4andnCandidates/bestnsetto10 sequencingcoverage).Themappingre- sultfromBLASRmaycontainshortalignmentsduetotherepeatinthegenomes.Sowhenwe determinethegroundtruth,weonlyconsiderthealignmentsthatcoveratleast80%oftheread. Byremovingshortnoisyalignments,wecanmakesuretheBLASRalignmentsareclosetothe underlyinggroundtruth. GroupKMinimapMinimap2DALIGNERGraphMapMHAP BestF1score 0.9311 0.90370.84260.83400.77580.7023 Sensitivity 0.9330 0.89390.87780.90660.67410.7258 Precision 0.92920.9138 0.81010.7722 0.9137 0.6802 Table2.2:Overlapdetectiononthereal E.coli dataset.Theperformanceofoverlapdetection usingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMapontherealPBRSII (P6-C4) E.coli dataset.HereweonlyreporttheexperimentresultswiththehighestF1scorefor eachtool. Weusedthesamesetupadoptedinthesimulated E.coli experiment.Amongall overlapdetectiontoolswetested,GroupKstillachievedthehighestsensitivitywithcomparable precision(Table2.2).Withslightlyhigherprecision,oursensitivityis4%betterthanthenext besttool,Minimap.Comparedtothepreviousexperiment,thedifferenceinsensitivityissmaller. Onereasonliesintheconstructionofthegroundtruthdataset.Inthesimulateddataset,weused thesamplepositionsofallreadstodeterminewhethertworeadsformanoverlap.Thus,that datasetcanincludereadswithsmalloverlapsorreadswithhighererrorrates.Inthisdataset,our methoddiscardedBLASRalignmentswithhigherrorratesandtheremainingalignmentshave highersimilaritieswiththereferencegenomeandthusproducefewerfihardcasesfl.AsMinimapis thesecondbesttoolforthisdataset,wefurtheranalyzedtheperformanceofGroupKandMinimap onreadpairsofdifferentoverlapsizeinthenextsection. 36 2.3.2.1Performancewithdifferentoverlapsize Wedividealloverlappingpairsintobinsofwidth500basedontheoverlapsize.Forexample, thebinhasthereadpairswithoverlapsizesfrom0to499,andthesecondbinhasreadpairs withoverlapsizesfrom500to999,andsoon.Foreachbin,wecomparedthesensitivityof GroupKandMinimapusingparametersyieldingthesimilarprecisioninFigure2.8.ForMinimap, weshowedtheresultwiththehighestF1score.ForGroupK,weselectedaparametersothatit achievessimilarprecisiontoMinimap(F1score:0.9241,sensitivity:0.9344,precision:0.9140). AccordingtoFigure2.8,GroupKhasmuchbettersensitivitywhentheoverlapsizeislessthan 2,000.AsweshowedinFigure2.1,thereareanumberofoverlapswithoverlapsize smallerthan2,000evenfor30Xcoverage.Beingabletoidentifysmalloverlapsallowsusto generatemorecompleteassembliesusinglongreads.Thisisparticularlyusefulforlowcoverage data,suchaswhatweusuallyhaveinmetagenomicdatasets. 2.3.3HumanFootMetagenomicDataset Oneofthemajorutilitiesofourtoolistoidentifyoverlapsbetweenreadsinmetagenomicdatathat aresequencedusingthePBplatform.Forcomplicatedmicrobialcommunities,metagenomicdata containingonlyshortreadsposesseriouscomputationalchallengesfor denovo assemblyandthe downstreamcomposition/functionalanalysis.Longreadsholdthepromisetoproducemorecom- pleteandaccuratemicrobialgenomeassembliesforthemetagenomicdataset.Inthisexperiment, weevaluatedtheperformanceofoverlapdetectionforamockmetagenomicdatasetconstructed fromarealhumanfootdataset[144].Aparticularchallengeforthisexperimentisthelowcover- ageofthecomponentspeciesinthemetagenomicdataset,whichcouldbecausedbysequencing throughputandcomplexityofthesample. 37 Figure2.8:SensitivityofGroupKandMinimapfordetectingoverlapsofdifferentsizeonPB E. coli dataset.Thex-axisrepresentstheoverlapsize.Y-axisisthecorrespondingsensitivityofthe bin.TheX-axisbinwidthis500andtheonlyincludedthe6bins(i.e.uptooverlap size3000)astheirsensitivitybecomesmoresimilarwiththeincreaseoftheoverlapsize. 38 ThehumanfootsamplewassequencedbylinearPBRSIITdT(terminaldeoxynucleotidyl transferase).Sequencesthatcanbemappedtothehumangenomewereremovedashost-derived DNA.AccordingtotheSupplementaryMaterialsof[144],thereareabout1,000bacteriaand virusesinthismetagenomicdataset.However,wecannotevaluatetheperformanceofoverlap detectiononallthereadsfromthe1,000microbesbecausethecoveragesofmanyspeciesaretoo lowtoyieldmeaningfuloverlaps.Inordertoconstructthegroundtruth,weneedtoalignthereads againstspecieswithknownreferencegenomesandreasonablecoverage.Thus,weonlychoose readssatisfyingthefollowingcriteria:1)thereadsaresequencedfromaspecieswithknownrefer- encegenome;and2)thecoverageofthespeciescannotbetoosmall(e.g., > 3Xcoverage).Based onthesecriteria,wekeepthereadssequencedfromthreebacteria: Corynebacteriumaurimucosum (6.3XCoverage), Corynebacteriumtuberculostearicum (8.5XCoverage),and Staphylococcusho- minis (3.2XCoverage).ThereadsarerecruitedviaBLASR.Thealignmentpositionsareusedto determinewhichreadsformanoverlap.Notethat Corynebacteriumaurimucosum and Corynebac- teriumtuberculostearicum belongtothesamegenusandmaycontributetothefalsepositiveover- lapdetectionduetotheirsharedregions.Theaveragelengthofthereadsis1696.25bps,whichis muchshorterthanthereadsinthepreviousexperiments. Asthisdatasetcontainsmuchshorterreads,theexpectednumberof k -merhitswillchange. Intuitivelyweneedtouseshorter k -merstoensurehighsensitivity.Usingthereadlength distribution,wecalculated E [ X r ] (Equation2.1)and E [ X o ] (Equation2.3)anddeterminedthe k - mercounting-basedcriteria.Inthisexperiment,weonlykeptreadpairsthatshareatleast three11-mers. Forthismockmetagenomicdataset,GroupKyieldedbetterperformancethan othertoolsonmetricsincludingF1score,sensitivity,andprecision(Table2.3).Comparedto othertools,GroupKcanproducemuchhighersensitivitywithoutprecision,leadingto 39 GroupKMinimapMinimap2DALIGNERGraphMapMHAP Total: BestF1score 0.9163 0.83060.87760.69110.71880.7812 Sensitivity 0.8954 0.78020.83520.60120.57210.6803 Precision0.93810.88800.92450.8027 0.9666 0.9174 C.aurimucosum : BestF1score 0.9512 0.80720.91170.74320.73970.8545 Sensitivity 0.9228 0.68580.84670.62660.58920.8045 Precision 0.98140.98060.9874 0.9131 0.9937 0.9111 C.tubercu- lostearicum : BestF1score 0.9454 0.86880.90500.83150.72760.7961 Sensitivity 0.9105 0.79580.83460.71500.57270.7355 Precision 0.9830 0.9567 0.98840.99340.9977 0.8675 S.hominis : BestF1score 0.9163 0.76510.80240.87540.63430.6861 Sensitivity 0.8733 0.68870.6813 0.7967 0.46580.6348 Precision0.92450.86060.97590.9713 0.9938 0.7464 Table2.3:Overlapdetectiononthehumanmetagenomicdataset.Theperformanceofoverlap detectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMaponthemock metagenomicdataset.HereweonlyreporttheexperimentalresultswiththehighestF1scorefor eachtool. thehigherF1score.Besidesevaluatingtheperformanceofvarioustoolsonallthereadsfrom thethreespecies,wealsoreportedtheperformanceofdifferentoverlapdetectiontoolsoneach singlebacteriadatasetwithoutmixingwithotherspecies(Table2.3).Inthesetools,GraphMap hashighforallthreewithofsensitivity.However,GroupKstillachievesthe bestperformanceoverall.Thecomparisonssuggestthatourmethodhasgreatpotentialtodetect overlapsfordatawithverylowcoverage(around5X).ThiswillenablebetterassemblyforPB sequencedmetagenomicdata,whichwillbecomemoreavailablewiththeadvancesoflongread sequencingtechnologies. 40 2.3.4RealONT E.coli dataset WealsotestedourmethodononeONTdataset.Weuseddownsampled15Xcoverage2Dreads fromtheSQK-MAP-006datasetas2Dreadsprovidehigherqualitythan1Dreads.Wefollowed thesamepipelinesweusedforthereal E.coli PBdataset.As2DONTreadshavesimilarerror ratestothePBreads,weexpectthatourtoolcanstillachievereasonableperformancegiventhe samesetupforthePBdataset.Therefore,weusedthesameparametersastheonesweusedforthe PB E.coli dataset. GroupKMinimapMinimap2DALIGNERGraphMapMHAP F1score 0.93830.9310 0.90900.8272 0.9280 0.8122 Sensitivity 0.95970.9546 0.93620.87300.89910.8871 Precision0.9178 0 .90850.88330.7860 0.9589 0.7490 Table2.4:OverlapdetectionontheONT E.coli dataset.Theperformanceofoverlapdetection usingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMapontherealONTSQK- MAP-006 E.coli dataset.Minimap2usestheava-ontsetup,whichisoptimizedforONTdata. Amongthesetools,GraphMapwasdesignedforONTdata,Minimap2providesas setupfortheoverlaponONTdataset.AllothertoolsarenotdesignedforONT datasets.GroupKachievesthebestF1scorecomparedtoothertools'defaultsetupwhilekeeping thehighestsensitivity(Table2.4).Thisresultsuggeststhatourstrategyisrobustwithdifferent typesoflongreads. 2.4Discussion Seedingisakeystepforoverlapdetectionbecauseofthehigherrorrateoflongreads.Successful seedingstrategiesshouldbalancethesensitivityandthetoachievetheoptimalperfor- mance.Popularseedingmethodsincludemaximalexactmatches,spacedseeds,andgappedspaced 41 seeds.However,tosuccessfullyahitbetweentworeads,thesemethodsstillneedeitherto relativelylongcontinuousexactmatches(large k -mer)ortoinexactmatchesfollowingcertain errorpatterns(spacedseed).Comparedtothesemethods,groupseedmatchingismorexibleas itrequiresmultipleshortexactmatcheswithoutspecifyingtheerrorpatterns.Thisxibilityleads tohighsensitivity,andmeanwhiletheisstillguaranteedwiththegroupseedmatch criteria. Currentlythegroupseedmatchingstepbasedonhashtableisthebottleneckofouroverlap detectionpipeline.Anewmethodthatcanimprovetherunningtimeefyofthisstepis neededtomakethealgorithmachievethesamespeedasotherfasteroverlapdetectiontools. 2.5Conclusions Inthiswork,wedevelopedanoverlapdetectiontoolforthird-generationsequencingdata.By adoptingthegrouphitcriteriatoclusteragroupofshort k -merhitsthatsatisfystatisticallyderived distanceconstrains,ourmethodcanimprovethesensitivityofoverlapdetectionwithout precision.Ourexperimentalresultshaveshownthatfordatasetswithlowsequencingcoverage, ourprogramcandetectsmoreoverlappingpairswhilekeepinghighprecision.One utilityofourapproachistodetectsmalloverlapsbetweenlongreadsofrarespeciesinamicrobial community. 42 Chapter3 prHMMmodelandhomologysearch 1 3.1Introduction ComparedtoIllumina,therepresentativesecondarygenerationsequencingplatform,themajor disadvantagesofPBincludehighsequencingerrorrate(11-15%),lowerthroughput,andhigher costperbase[124,126].Similartopyrosequencingdata,mostoftheerrorsareinsertionordeletion errors.Thehigherrorrateposeschallengesforalldownstreamsequenceanalysis.Inparticular, duringhomologysearchforgenomeannotation,sequencesarealignedtocharacterizedprotein sequencesorfamilies.Insertionordeletionerrorsingeneswillcauseframeshiftsandmayonly leadtomarginalalignmentscoresandshortalignments[154].Asaresult,itishardtodistinguish truealignmentsfromrandomalignments.Theinaccuratehomologysearchresultscanincurerrors instructuralandfunctionalannotation. Differentstrategieshavebeenproposedorimplementedtoavoidorcorrectsequencinger- rorsinPBdata.TherearevariousPBsequencingprojectsthatmainlyusecircularconsensus sequencing(CCS)readswithsufsequencingpasses.Acoverageof15passesyields > 99% accuracy[128].However,CCSreadsaremuchshorterthanthecontinuouslongreadsofPBdata. Inaddition,theamountofCCSreadsismuchlessthanalltheoutputofPBdata.Thus,usingonly CCSreadsdoesnottakefulladvantageofthesequencingpowerandstrengthofPBdata. 1 Du,Nan,andYanniSun."ImprovehomologysearchsensitivityofPacBiodatabycorrectingframeshifts."Bioin- formatics32.17(2016):i529-i537. 43 OnepopularstrategytohandlesequencingerrorsofPBdataisbasedonhybridsequenc- ing[78].AsIlluminaproducesmanymoreaccuratebutshorterreads,methodsaredeveloped tocorrecterrorsbyaligningshortreadstolongPBreads.Yet,thismethodneedspreparationofat leasttwosequencinglibrariesandseveraltypesofsequencingruns,whichisnotcost-effectivefor manyapplications. Unlikehybridsequencing,therearemethodsthatdonotrequirehighlyaccurateshortreads forerrorcorrection.OnerepresentativemethodisdescribedinChinetal.'shierarchicalgenome- assemblyprocessHGAP[24],whichalignsshortsequencestothelongestreadsofthesamese- quencinglibraryofPB.AsthesequencingerrorsinPBreadsoccurrandomly,theinferredconsen- sussequencefromthealignmentbetweentheshortreadsandlongreadsrepresentthehigh-quality sequence.Despiteitssuccess,thereisstillroomtoimprovetheerrorcorrectionperformancefor theconsensussequenceextractionstageinHGAP.Inparticular,itsperformanceisheavilyaffected bythecoverageofthealignedshortsequencesagainstthelongseedsequences.Theregionswith moreshortsequencesalignedhavebettererrorcorrectionperformancethanotherregions. Aftererrorcorrection,correctedPBreadscanusuallyachievemoresensitivehomologysearch resultscomparedtotherawdata.Inparticular,whenthecoverageishigh,thecorrectedreads fromHGAPcanachievealignmentscoressimilartothegroundtruth.However,inpractice,notall PBsequencingprojectscanhavesufcoverageforallregions,transcripts,orgenomes.For example,HGAPfailedtoassemblethedatafromthearmsampleinhumanskinmicrobialcom- munity[144]becauseoflowcoverageofthedataset.Figure3.7inourexperimentalresultsshow thatthedifferenceofthealignments'scores,lengths,andE-valuesbetweenHGAP'scorrected readsandthegroundtruthisstillThus,thereisstillaneedforhomologysearchtools designedforPBdata. Inthiswork,wedesignedandimplementedFrame-Pro,ahomologysearchtoolforPBreads. 44 Theexperimentalresultsshowedthatourtoolcanimprovethehomologysearchsen- sitivitywhilealsocorrectingsequencingerrors.Ourmethodincorporatedtwokeyobservations. First,asshownbyHGAP,sequencingerrorsinPBaredistributedrandomlyandthustheconsen- sussequencestendtobeclosertogroundtruth.Ourworkincorporatedthismethod.Second,we identifyframeshiftscausedbysequencingerrorsusingcharacterizedproteinfamiliesastheguid- ance.Essentiallyourmethodcorrectserrorsbymaximizingbothalignmentscoreagainstprotein familiesandlocalcoveragescoreinaconstructedalignmentgraph.Bothobservationsareused togethertoboosttheperformanceofbothhomologysearchanderrorcorrection. TheremainderofthisChapterisorganizedasfollows.Section3.2reviewsother frameshifterrordetectiontoolsandtheirlimitationsinproteindomaininPBdata sets.Section3.4describesthedynamicprogrammingalgorithmthatincorporatesconsensusse- quenceandViterbialgorithmforerrorcorrectionandsequencealignment.InSection3.5, wedemonstratetheresultsoferrorcorrectionandhomologysearchbyapplyingourtooltosim- ulatedandrealPBdata.WealsobenchmarkourtoolwithHGAP,asuccessfulerrorcorrection methodwithoutrelyingonhybridsequencing.Finally,Section3.6concludesandsuggestsdirec- tionsforfuturework. 3.2Relatedwork 3.2.1Prhomologysearch Homologysearchisstillanimportantstepinsequence-basedfunctionalanalysisforgenomicdata. Bycomparingquerysequencesagainstreferencesequencesori.e.,afamilyofhomolo- gousreferencesequences,functionsandstructurescanbeinferred.Therepresentativetoolsfor sequencehomologysearchandhomologysearchareBLAST[4]andHMMER[38],re- 45 spectively.homologysearchhasseveraladvantagesoverpairwisealignmenttoolssuchas BLAST.First,thenumberofgenefamiliesissmallerthanthenumberofsequences, renderingmuchfastersearchtime.Forexample,thereareonlyabout13,000manuallycurated proteinfamiliesinPfam,butthesecovernearly80%oftheUniProtKnowledgebaseandthecov- erageisincreasingeveryyearasenoughinformationbecomesavailabletoformnewfamilies[45]. ThenewestversionofHMMER[38]ismoresensitivethanBLASTandisabout10%faster. Second,previouswork[35]hasdemonstratedthatusingfamilyinformationcanimprovethe sensitivityofaremoteproteinhomologysearch,whichisveryimportantforvarioussequencing datasuchasmetagenomicdataanalysis.Thesedatasetsmaycontainspeciesremotelyrelatedto onesinthereferencedatabaseandrequiresensitivehomologysearch.Thus,inthiswork,wefocus onimplementinghomologysearchforPBdata.Themethodcanbeextendedtopairwise sequencealignment.AsHMMERisthemostwidelyusedalignmenttool,wefocuson evaluatingthealignmentperformanceusingHMMER. TheproteindomainsfamiliesusedinourexperimentsaredownloadedfromPfam.Other databasessuchasTIGRFAMs[57],FIGfams[98],InterProScan[152],andFOAM[123]canbe usedtooaslongashiddenMarkovmodelscanbetrained. 3.2.2Relatedworkonframeshiftcorrection Usually,whencomparingaDNAsequencewithaproteinsequenceorfamily,six-frametransla- tionsareconductedandoneofthereadingframeshouldleadtostatisticallyalignment ifthequeryandthereferencearehomologous.However,frameshiftscausedbyinsertionordele- tionerrorsmakethecorrecttranslationconsistofalternatingreadingframes.Withoutknowingthe errorpositions,choosingthecorrectframesforeachfragmentbetweenerrorsischallenging. AnumberofprogramsexisttohandleframeshiftsthroughDNAvs.proteinsequencealign- 46 ment.SimplemethodssuchasBLASTXdiscardsequencesthatmightcontainframeshiftsrather thantryingtothem.Othertools[17,22,49,50,53,59,120,121,155]areavailabletodetectand frameshifterrorsautomatically.Besidesdetectingframeshiftinsequencealignment,somepro- grams[5,14,76,131]focusonframeshiftdetectionduringgeneanduse abinitio methods. TheseprogramsarenotdesignedforPBdata.Inaddition,theycannotbeappliedtoprotein homologysearch. Alternatively,Genewise[13],awidelyusedDNAvs.proteinalignmenttoolallowscompar- isonofaDNAsequencewithaproteinfamily.Butitdoesnotconsiderthesequencingerror propertiesofNGSdata.Themostrelevantworkstooursinclude[154],whichViterbi algorithmtoimprovehomologysearchforpyrosequencingdata.Butitdoesnothavesatisfactory performanceforPBdatabecausethesequencingerrorpropertiesofpyrosequencingandPBare different.Pyrosequencingreadshavelowererrorrateandmostoftheerrorsarelocatedinside homopolymerregions.ThesepropertiesmakeerrorcorrectioneasierthanPB,whichhavehigher errorratesandtheerrorscanoccurmorerandomly.FrameBot[150]isanotherrelevantworkfor correctingframeshiftscausedbysequencingerrors.Butitisnotdesignedforhomology search.And,likeothertools,itisonlyoptimizedandtestedonsequencingdatawithlowererror ratethanPB. HGAP[24]isanotherhighlyrelevantworkbecauseitcontainserrorcorrectionstageforPB data.AsdiscussedinSection3.1,itsperformanceisheavilyaffectedbysequencingcoverage. 3.3PrhiddenMarkovmodel BeforeintroducingtheaugmentedViterbialgorithmthatusedinFrame-Pro,wereviewedthe conceptesandalgorithmsofhiddenMarkovmodel.Thecontentofthissectionismainly 47 adaptedfrom[35,36]. 3.3.1Markovchains WestartfromMarkovchain,whichisthesimplestprobabilisticmodelofsequence.Inthemodel, theprobabilityofasymbolinthesequencedependsontheprevioussymbol.InMarkovchain, eachsymbol(e.g.,residueofDNA)correspondstoastateinthechain. Figure3.1:AMarkovchainexampleofDNA.Therearefourstatesforeachofnucleotides A , C , G and T .Atransitionprobabilityisassociatedwitheacharrowinthe Thetransitionprobabilities a st istheprobabilitythatacertainstatefollowinganotherstate: a st = P ( x i = t j x i 1 = s ) (3.1) Theprobabilityofthesequencecanbewriteas P ( x )= P ( x L ; x L 1 ;:::; x 1 ) = P ( x L j x L 1 ;:::; x 1 ) P ( x L 1 j x L 2 ;:::; x 1 ) ::: P ( x 1 ) (3.2) 48 byrepeatapplying P ( X ; Y )= P ( X j Y ) P ( Y ) .InMarkovchain,theprobabilityofeachsymbol x i de- pendsonlyontheprevioussymbol x i 1 ,notonentireprevioussequence.Sothepreviousequation canberewriteas P ( x )= P ( x L j x L 1 ) P ( x L 1 j x L 2 ) ::: P ( x 1 ) = P ( x 1 ) L Õ i = 2 a x i 1 x i (3.3) ThisisthegeneralequationfortheprobabilityofasequenceofanyMarkovchain. 3.3.2HiddenMarkovmodels IntheMarkovchain,thereisaone-to-onecorrespondencebetweenthestatesandthesymbols. Sincewecandirectlyobservethesymbolsinthesequences(e.g.,residuesinDNAsequences),we canalsothestatefromtheobservation.However,inthehiddenMarkovmodel,onestate mayemitmanydifferentsymbols.Sofromtheobservationofsymbols,wecannotthe statefromourobservation.Inotherwords,thestateishiddennow,sowecallsuchmodelhidden Markovmodel. WefomulatethehiddenMarkovmodelasfollowing:Wecallthesequenceofthestates path , p .ThepathitselfisasimpleMarkovchain,theprobabilityofastateinthepathonlydependson thepreviousstate.The i thstateinthepathiscalled p i .Thetransitionprobabilityofthechainis: a kl = P ( p i = k j p i 1 = k ) (3.4) Becausewehavedecoupledthesymbols b fromthestates k ,wemustneedanewsetofpa- rameters,emissionprobabilities e k ( b ) ,tomodeltheprobabilitythatsymbol b isobservedwhenin 49 state k : e k ( b )= P ( x i = b j p i = k ) (3.5) Wecanwritetheprobabilityofanobservedsequence x andastatesequence p as: P ( x ; p )= a 0 p 1 L Õ i = 1 e p i ( x i ) a p i p i + 1 (3.6) wherewerequire p L + 1 = 0.TheEquation3.6istheHMManalogueofEquation3.3. InrealapplicationofHMM,therearethreefundamentalproblemswecareabout[69]:(1) GivenanHMMmodelwithparameters a kl and e k ( b ) ,andanobservationsequence x ,determine thelikelihoodof P ( x ) ( likelihood problem);(2)GivenanHMMmodelwithparameters a kl and e k ( b ) ,andanobservationsequence x ,decodethemostprobablestatepath p ( decoding problem); (3)Givenanobservationsequence x andstatesofHMM,learntheHMMparameters a kl and e k ( b ) ( learning problem).TheViterbialgorithm,Forwardalgorithm,andForward-Backwardalgorithm weredevelopedtosolvethoseproblems,respectively.Wewillintroducethesealgorithmsfor hiddenMarkovmodelinthefollowingsections. 3.3.3PrhiddenMarkovmodels(prHMMs) HMMsisaparticularHMMdevelopedtomodelmultiplesequencealignments[82].Here wewilluseHMMsdesignedforproteinfamilymultiplealignmentsasourexample.In HMMs,foreachpositionofstate i inthepath p ,therearethreepossiblestates,matchstate M i ,insertionstate I i ,anddeletionstate D i (Figure3.2).Insertionstate I i isusedtomatchinsertions aftertheresiduematchingthe i thcolumnofthemultiplealignments.Therearetransitionsfrom M i to I i ,alooptransitionfrom I i toitselftoaccommodatemulti-residueinsertions,andatransition 50 Figure3.2:ThetransitionstructureofaHMM.Weusesqurestoindicatematchstates, diamondstoindicateinsertionstates,andcirclestoindicatedeletionstates. from I i backto M i .Deletionstate D i isusedtohandlethejumptransitionbetweennon-neighboring matchstateswithoutanyemission.Thesumofthecostsofantransition M to D followedbya numberof D to D transitions,thena D to M transitionswillbethecostofadeletion. 3.3.3.1Viterbialgorithm Viterbialgorithm[148]isthemostcommonalgorithmforthemostprobablestatepath: p = argmax p P ( x ; p ) (3.7) Themostprobablepath p canbefoundrecursively.ForHMMs,let V M j ( i ) bethemaximum log-oddsscoreofaligningabestsub-sequence x 1 ; 2 ;::; i totheHMMpathuptostate j ,endingwith x i beingemittedbystate M j , V I j ( i ) bethescoreofthebestpathendingwith x i beingemittedby 51 state I j similarly,and V D j ( i ) forthebestpathendinginstate D j . q x i istheemissionprobabilityof astandardrandommodel(backgrounddistribution).ThentherecursiveViterbiequationscanbe writeas: V M j ( i )= log e M j ( x i ) q x i + max 8 > > > > > < > > > > > : V M j 1 ( i 1 )+ log a M j 1 ; M j V I j 1 ( i 1 )+ log a I j 1 ; M j V D j 1 ( i 1 )+ log a D j 1 ; M j V I j ( x i )= log e I j ( x i ) q x i + max 8 > < > : V M j ( i 1 )+ log a M j ; I j V I j ( i 1 )+ log a I j ; I j (3.8) V D j ( x i )= max 8 > < > : V M j 1 ( i )+ log a M j 1 ; D j V D j 1 ( i )+ log a D j 1 ; D j Inacommonsituation,theemissionscorelog e I j ( x i ) q x i for V I j ( x i ) willbecanceledbecausewe assumetheemissiondistribution e I j ( x i ) fromtheinsertionstates I j isthesameasthebackground distribution q x i .ThewholeprocessofViterbialgorithmisdescribedinAlgorithm2. Algorithm2 Viterbialgorithm Input: DNAsequence x withlength L ,HMMwithlength n Output: Mostprobablestatepath p withlog-oddsscorelog P ( x ; p ) 1: V M 0 ( 0 )= 0, V S j ( 0 )= inf for0 < j and S 2 {M,I,D}. . Initialization 2: for i = 1to L do . Recursion 3: V S j ( i ) Equation3.8 4: ptr i ( j ) argmax S j 1 ( V S j ( i )) 5: endfor 6: log P ( x ; p ) V M n ( L ) . Termination 7: p L argmax n 1 V M n ( L ) 8: for i = L to1 do 9: p i 1 ptr i ( p i ) . Traceback 10: endfor 52 3.3.3.2Forwardalgorithm Toevaluatethelikelihoodofasequence x ,weneedsumtheprobabilitiesofallpossiblestatepath toobtaintheprobabilityof x : P ( x )= å p P ( x ; p ) (3.9) Adynamicprogrammingalgorithmcanbeusedtocalculatethisfullprobabilityrecursively. Thisiscalledtheforwardalgorithm. WeusedthesimilarnotationthatusedinViterbialgorithm.Theforwardlog-oddsscore F M j ( i ) , F I j ( i ) ,and F D j ( i ) arecorrespondingto V M j ( i ) , V I j ( i ) ,and V D j ( i ) .Sotherecurrenceequations offorwardalgorithmare: F M j ( i )= log e M j ( x i ) q x i + log [ a M j 1 ; M j exp ( F M j 1 ( i 1 )) + a I j 1 ; M j exp ( F I j 1 ( i 1 ))+ a D j 1 ; M j exp ( F D j 1 ( i 1 ))] F I j ( x i )= log e I j ( x i ) q x i + log [ a M j ; I j exp ( F M j ( i 1 )) (3.10) + a I j ; I j exp ( F I j ( i 1 ))] F D j ( x i )= log [ a M j 1 ; D j exp ( F M j 1 ( i ))+ a D j 1 ; D j exp ( F D j 1 ( i ))] Theinitializationoftheforwardalgorithmrequiresthat F M 0 ( 0 )= 0,whichissimilartoViterbi algorithm.Sinceweonlyneedtheprobabilitiesof x sothereisnotracebackintheforwardalgo- rithm. 53 3.3.3.3FrommultiplesequencealignmenttoprHMM Usually,wewanttobuildHMMsfromamultiplesequencealignmentwithmultiplecolumns. Weneedtodeterminewhichmultiplealignmentcolumnstoassigntomatchstates,andwhich toassigninsertionstates(Figure3.3).Weoftenusetheheuristicruletoignorethecolumnfor whichthefractionofgapcharactersisgreaterthanorequaltoacolumnremovalthreshold q [29]. Usually,thecolumnremovalthreshold q equalsto0 : 5.Afterthis,wecanconstructHMMs asallthestatesaredetermined. Figure3.3:Tencolumnsfromthemultiplesequencealignmentsofsevenglobinproteins.The starredcolumnsareonesthatwillbetreated`matches'intheHMM.Reprintedfrom[35]. Thenweneedtoestimatetheprobabilityparameters.Frommultiplesequencealignment,we canaligneachoftherowtotheHMM.Wecandirectlyestimatetheparametersfromthe alignments.Bycountingupthenumberoftimeseachtransitionoremissionisused,theprobability parametersareassignedby: a kl = A kl å l 0 A kl 0 and e k ( a )= E k ( a ) å a 0 E k ( a ) (3.11) where k and l areindicesoverstates,and a kl and e k arethetransitionandemissionprobabilities, and A kl and E k arethecorrespondingcounts. 54 Inmanysituations,weonlyhavealimitednumberofsequencesinthetrainingalignments. Themajorchallengeshereisthatsometransitionsandemissionsthatneverseeninthetraining alignmentwillbeassignedzeroprobability.Tosolvetheproblem,onecanaddpseudocountsto theobservedcounts.ThemoststraightforwardLaplace'sruleistoaddonetoeachcountweusedin Equation3.11.AbettersolutionisusingamixtureofDirichletdistributionsastheprior.Readers areencouragedtolearnmoredetailsinChapter5of[35]. 3.3.3.4Forward-Backwardalgorithm Whenthepathisunknownforthetrainingprocess,wearenolongerabletoestimatetheparameter valueandmustiterativeproceduretolearntheparameters.Forward-backwardalgorithmisthe standardalgorithmforleaningtheparametersofHMM.Itisintroducedfrom[9]byLeonardBaum, soitisalsocalledBaum-Welchalgorithm.ItisaspecialcaseoftheExpectation-Maximization (EM)algorithm.Thealgorithmestimatedtheprobabilities,thenderivedabetterestimates. Theestimationisimprovediterativelyuntilsomestoppingcriterionisreached. Wealreadycalculatedforwardprobabilityusingforwardalgorithm.TotrainHMM,wealso needtocalculatetheposteriorprobabilitythatobservation x i camefromstate k giventheobserved sequence P ( p i = k j x ) usingbackwardalgorithm.Wecalculatedtheprobabilityofproducing theentireobservedsequencewiththe i thsymbolbeingproducedbystatek: P ( x ; p = k )= P ( x 1 ::: x i ; p = k ) P ( x i + 1 ::: x L j x 1 ::: x i ; p = k ) = P ( x 1 ::: x i ; p = k ) P ( x i + 1 ::: x L j p = k ) = f k ( i ) b k ( i ) (3.12) 55 bk(i)= P(xi+1::: xLjp=k)(3.13)bk(i)canbeobtainedbyabackwardrecursionstartfromendofthesequence. Theposteriorprobabilitiescanbecalculatedbystraightforwardconditioning, P(p=kjx)= fk(i)bk(i)P(X)(3.14)whereP(x)istheresultofforwardorbackwardalgorithm. Thenwecancalculatetheexpectednumberoftimesoftransitionsandemission, Akl andEk,usingtheforwardprobabilitiesandbackwardprobabilities.Thedetailedderivationcanbefound fromtextbooks[29,35,69].Theexpectednumberoftimesthat akl isusedisgivenby: Akl =åj1P(xj)åifjk(i)akl el(xji+1)bjl(i+1)(3.15)wherefjk(i)istheforwardvariable fk(i)calculatedforsequence j,and bjl(i)isthebackward variablecorrespondingly.Similarly,wehavetheequationforexpectednumberoftimesthat bappearsinstate k:Ek(b)= åj1P(xj)åfijxji=bgfjk(i)bjk(i)(3.16)wherethesecondsumcalculateoverthepositions iforwhichthesymbolemittedis b.ThewholeprocessofBaum-Welchalgorithmlikethis: 56Algorithm3 Baum-Welchalgorithm 1: Initialize a lk and e k . 2: while D > threshold Q do 3: Setallthe A and E totheirpseudocountvalues. 4: for eachsequence j = 1to n do 5: for i = 1 ;:::; L do 6: f k ( i )= e k ( x i ) å h f h ( i 1 ) a ik 7: endfor . forwardalgorithm 8: for i = L 1 ;:::; 1 do 9: b k ( i )= å l a kl e l ( x i + 1 ) b l ( i + 1 ) 10: endfor . backwardalgorithm 11: Addthecontributionofsequence j to A and E inEquation3.15and3.16 12: endfor 13: Calculatenew a lk and e k usingEquation3.11 14: Calculatethenewloglikelihoodofthemodel 15: D changeoftheloglikelihood 16: endwhile 3.4Methods Frame-ProisdesignedtoimprovehomologysearchperformanceforPBdata.Itincorpo- ratesconsensus-basederrorcorrectionandaViterbialgorithmforoptimalalign- ment.WhileastandardViterbialgorithmalignsasinglesequencetoahiddenMarkovmodel (HMM),ouralgorithmalignsanalignmentgraphtoanHMM.Inaddition,weanewscor- ingsystemthatcombinesthepathscorefromthealignmentgraphandthealignmentscoreinthe HMM.Weintroducetheconstructionofthealignmentgraph. 3.4.1Generatesequencealignmentgraph FollowingHGAP,weconstructasequencealignmentgraph(SAG)fromPBdata.Thedetails ofthegraphconstructioncanbefoundinthesupplementarymaterialofChinet.al'swork[24]. Herewesimplysummarizethemajorsteps.Readswithalengthlongerthanachosencut-offare selectedasseedsequences.OtherreadsarethenalignedtotheseedsequencesbyBLASR[21]for 57 Figure3.4:BuildanexampleSAGfromanalignment.Toppanel:multiplesequencealignmentof sixsequences.Thetopsequenceistheseedsequence.Bottompanel:sequencealignmentgraph. Fornode x 0 ,ifwetracebackfortwomoreedges,wecanidentifythreecodonsendingwith x 0 : TCA,ATA,andGCA.Eachpathhasitsnodes x ( 1 ) 0 and x ( 2 ) 0 marked.Theconsensuspath ofthispartofgraphisATCA. 58 constructionofanalignmentgraph(seetoppanelofFigure3.4).Ineachalignedcolumn,different basesaremodeledasnodesforthecorrespondingpositioninanSAG.Consecutivebasesare connectedbyanedgeinthegraph.Theedgeweightrepresentsthenumberofalignedsequences thatgothroughthisedge. Theconsensussequencefromthealignmentgraphrepresentsthemostreliablesequence.Fig- ure3.4showsanexampleSAGandtheconsensussequence.Usually,whentherearesuf readsaligningtotheseedsequences,theconsensussequencecanbereliablyusedfordownstream analysisincludingconductinghomologysearch.However,whenthecoverageisnotsufthe chosenconsensussequencedoesnotshowhigherscorethanalternativesequences. Thus,itisdiftoextractapaththatisclosesttothereferencesequence. Ouralgorithmismuchlesssensitivetosequencingcoverageasitusescharacterizedprotein familiesasreferencetochooseoptimalpathinthealignmentgraph.Essentially,italignsanalign- mentgraphwithaHMM,whichrepresentsaproteinfamily.Duringthealignment,the algorithmchoosesasequencepathinSAGandastatepathintheHMMinordertomaximizethe combinedcoveragescoreandalignmentscore.BelowwedescribetheViterbialgorithm. 3.4.2ViterbialgorithmforaligninganalignmentgraphwithaprHMM Let p beastatepathinaHMMModel M .Let p 0 beasequencepathinanSAG G .Ourgoal istosearchfortheoptimalpathpair ( p ; p 0 ) suchthat ( p ; p 0 )= argmax ( a S M ( p ; p 0 )+ b S G ( p 0 )) , where S M ( p ; p 0 ) isthealignmentscorebetweenasequencein G andtheHMM. a and b aretheweightsoftheHMMscoreandthecoveragescorein G ,respectively.Intuitivelythisalgo- rithmsearchesforanoptimalalignmentbetweenaDNAsequenceinthealignmentgraph G and aHMM M bysimultaneouslyconsidering1)theprobabilityofaHMMalignment, representedbylog-oddsscore S M ( p ; p 0 ) ,and2)pathweightin G ,representedby S G ( p 0 ) .Tosolve 59 theaboveequation,wedevelopedthefollowingdynamicprogrammingalgorithm,whichisan augmentedViterbialgorithm[35].Therecursiveequationscanbeextendedtoforwardalgorithm andalsoposteriorprobabilitycalculationforanHMM. Input: aSAG G generatedbyalignmentsbetweenalongDNAseedsequence x seed andmulti- pleshortersequences,aHMM M .Notationsof M willbedescribedbelow. Output: theerror-correctedDNAsequence x 0 seed anditsoptimalalignmentwith M . Algorithm: wenotationsthatwillbeusedintherecursiveequations. NotationsoftheprHMMmodel M : ThedetaileddescriptionsofaHMM modelcanbefoundintheliterature[35,36].AHMMmodel M consistsofmatch states M j ,deletionstates D j ,andinsertionstates I j foreachposition j ,whichistheindex ofaconservedcolumninamultiplesequencealignment. a s i s j isthetransitionprobability fromstate s i to s j .AstheHMMisconstructedbyalignedproteinsequenceswhilethe graphmodelisconstructedbyalignedDNAsequences,weneedtotranslatethesequences in G intoaminoacids.Here,let T ( x i 2 x i 1 x i ) betheaminoacidtranslatedfromacodon x i 2 x i 1 x i . e s ( T ) istheemissionprobabilityforstate s toemit T .Comparedtothetopology ofPlan7modelusedbyHMMER[38],oneofthemajorchangeswemadeisthatNandC areresponsibleforemittingallDNAbasesthatareoutsideoftheproteindomain. NotationsoftheSAG: Let x i representthe i thnodeinthetopologicallysortedlistofthe graph.AsthealignmentgraphisconstructedinDNAspace, x i alsorepresentsabasefrom thealignedsequences. x ( k ) i representsanode,fromwhichtonode x i thereexistsapath consistingof k edges.Accordingtothistion,when k = 1,thereisanedgefrom x ( 1 ) i to x i .When k = 2,acodonisformedbythreebases: x ( 2 ) i , x ( 1 ) i ,and x i .Forexample,in Figure3.4,bothnodeslabeledwithTandCare x ( 1 ) 0 . S G ( p 0 ) representsthepathscorefor 60 p 0 ,whichcanbeasequenceofanylengthin G .Wewillthepathscorefollowingthe equations. Subproblemsandtherecursiveequations: Forasequencepath p 0 endingatnode x i in G , andastatepath p endingwithindex j in M ,thedynamicprogrammingalgorithmintends tomaximizethecombinedpathscoreandalignmentscore: a S G ( p 0 1 :: i )+ b S M ( p 0 1 :: i ; p 1 :: j ) . Dependingontheendingstates,weneedtoconsidermultiplecases. V M j ( x i ) ismaximumscoreofaligningasequencepathendingwith x i in G totheHMMup tostate M j ,undertheconstraintthattheanimoacidtranslatedbythelastthreebases x i , x ( 1 ) i ,and x ( 2 ) i in G isemittedbymatchstate M j .Notethattherecanbemultiplesequence pathsin G endingwith x i .Sothelastthreebasesofanysuchsequencepathcanbegenerally representedby x ( 2 ) i x ( 1 ) i x i .Andthetranslatedaminoacidisthus T ( x ( 2 ) i x ( 1 ) i x i ) . V I j ( x i ) isthemaximumscoreofaligningasequencepathendingwith x i in G totheHMM uptostate V j ,undertheconstraintthattheanimoacidtranslatedbythelastthreebases x i , x ( 1 ) i ,and x ( 2 ) i in G isemittedbyinsertionstate I j . V D j ( x i ) isthemaximumscoreofaligningasequencepathendingwith x i in G totheHMM uptostate D j . V N ( x i ) isthemaximumscoreofaligningasequencepathendingwith x i in G totheHMMuptostate N ,given x i beingemittedbythespecialstate N .Similarly, V C ( x i ) is themaximumscoreofaligningasequencepathendingwith x i in G totheHMMuptostate C ,given x i beingemittedbythespecialstate C . Forbrevityofpresentation, S inthefollowingequationsrepresents S ( x ( 3 ) i x ( 2 ) i x ( 1 ) i x i ) ,which isthepathscorefromthepossiblethirdbaseoftheprecedingcodontothecurrentcodon endingwith x i .Notethatthepathscoreisnotasimplesummationofedgeweightsasina standardgraph.Wewillthepathscoreaftertherecursiveequations. T istheabbrevi- 61 ationof T ( x ( 2 ) i x ( 1 ) i x i ) infollowingequations. V M j ( x i )= max 8 > > > > > > > > > < > > > > > > > > > : V M j 1 ( x i ( 3 ) )+ a e M j ( T )+ a log a M j 1 ; M j + b S V I j 1 ( x i ( 3 ) )+ a e M j ( T )+ a log a I j 1 ; M j + b S V D j 1 ( x i ( 3 ) )+ a e M j ( T )+ a log a D j 1 ; M j + b S V N ( x i ( 3 ) )+ a e M j ( T )+ b S V I j ( x i )= max 8 > < > : V M j ( x i ( 3 ) )+ a e I j ( T )+ a log a M j ; I j + b S V I j ( x i ( 3 ) )+ a e I j ( T )+ a log a I j ; I j + b S V D j ( x i )= max 8 > < > : V M j 1 ( x i )+ a log a M j 1 ; D j V D j 1 ( x i )+ a log a D j 1 ; D j V N ( x i )= max ˆ V N ( x ( 1 ) i )+ b S x i ( 1 ) x i V C ( x i )= max 8 > < > : V C ( x ( 1 ) i )+ b S x i ( 1 ) x i V M ( x ( 1 ) i )+ b S x i ( 1 ) x i Calculatepathscores S : wefollowedthesameequationinHGAPpaper[24]tocalculate apathscore,whichisthesumofthescoresofallnodesinthepath.InaSAG G ,thelocal coverageforapositioninthealignedgraphisbythenumberofreadsalignedto thatposition.Forexample,thelocalcoverageis5forallpositionsinFigure3.4asthereare 5reads(excludingseedsequence)alignedtotheseedsequence.Foranode,ifoneofthe incomingedge'sweightismorethanhalfofthelocalcoverageatthatposition,thenodegets apositivescore.Otherwise,wewillassignanegativescore.Thepathscoreisthesumofthe scoresofnodesinthepath.ThedetailedpseudocodecanbefoundinHGAP. CombineHMMscoreandpathscore: userscanchoosetheweightsofthepathscorein 62 graph G andtheHMMscorein M .Ideally, a and b shouldbeadjustedbasedonthelocal coverage.Forhighlocalcoverage,wecanassignbigger a than b .Bydefault,weusethe sameweight,whichisthechosenparametersforalloftheexperimentsinthiswork. Runningtimeanalysis Thetimecomplexityofouralgorithmis O ( d j N jj M j ) ,where j N j isthenumberofnodesinthe G and j M j isthenumberofstatesin M . d istheaverage numberofpossiblepathsthatcontainacodonendingwithanode.Basedonourtest,for 20Xcoverage, d isabout4to6pernode.Wealsofoundathighlocalcoveragearea,the edgewithweight1canbeignoredastheprobabilitytoincludethatedgeintheoptimalpath isextremelylow.Thus,weremovedthoseedgestofurtherspeedupthealgorithm.After thispruning, d canbeaslowasnear1pernodeonaverage. 3.4.3Filtrationstageforremovingunrelevantproteindomainfamilies Duringhomologysearch,wealigntheconstructedalignmentgraphsfromPBdatawithallcharac- terizeddomainsinPfam.Butforanygivenspeciesoracommunity,notalldomainsarerelevant.In ordertoreducethesearchspace,weapplyastagetoremovedomainsthatarenotrelevant tothegivendata.Inpractice,weapplyHMMERtoallconsensussequencesextractedfromthe alignmentgraphwithaverylooseE-valuecutoff(1000isusedinallexperiments).Onlydomains thatyieldatleastpartialalignmentswillbeusedasinputtoourdynamicprogramming.Otherdo- mainswithoutanyhitwillbediscardedbecauseitisunlikelythattheywillbetruedomainsinthis dataset.Eachconsensussequencemightbealignedtomultipledomains,forregionsthatcannot bealignedtoanydomain,wealsoremovethemfromnextstage.Thus,afterthetrimmed alignmentgraphandthecorrespondingdomainwillbeusedasinputtoourtool.Notethatthisdo- mainmayjustincuraverysmallscoreandE-value.Itwillbere-alignedusingour 63 algorithmandthealignmentscorewilldecidewhetheritisatruedomain.Inourexperiments, werefertotheinputassequence-domainpair.Inallofourexperiments,truedomainsfoundby correctedsequencesusingsuggestedcutoffsisalwayslessthantheinputsequence-domainpairs. 3.5Experimentalresults Ourmethodisusedtoimprovethesensitivityofhomologysearchbycorrectinginsertion ordeletionerrorsinPBsequencingdata.Thus,wewillfocusonevaluatingtheperformance ofourimplementationinbotherrorcorrectionandhomologysearch.WeappliedFrame-Proto threedatasets:asimulated E.coli PBRSsequencingdataset,areal M.ruber PBRSsequencing dataset,andaHumanarmPBRSIImetagenomicsequencingdataset.Thethreedatasetsenable ustoevaluatetheperformanceoferrorcorrectionfordatawithdifferentsequencingcoverage. Asboththegenomesandtheirproteindomainannotationsofthetwodatasetsareavailable atNCBIandPfam,weareabletoquantifytheaccuracyoferrorcorrectionandHMM search.,weusedBLASTtoevaluateitserrorcorrectionperformancebyaligning correctedsequencestoreferencesequence.Wealsotheperformanceofproteindomain annotationbyapplyingHMMER3tocorrectedsequencesandthereferencegenomes. WebenchmarkedourmethodwiththeerrorcorrectionstageDAG-ConinHGAP[24].The errorcorrectioninHGAPandourmethoddonotrelyonshortsequencesgeneratedbyanother platform.AndtheerrorcorrectioninHGAPhasbeenextensivelytestedandhassatisfactoryper- formanceforsequencingdatawithreasonablecoverage.DAG-Cononlyoutputscorrectedse- quences.Inordertoevaluatetheperformanceofhomologysearch,weapplyHMMERto thecorrectedsequencesofDAG-Con.AlthoughFrame-Prooutputsbothcorrectedsequencesand thealignments,wererunthecorrectedsequencesagainstinputdomainsusingHMMERto 64 ensureafaircomparisonbythesamealignmenttool. Forourexperiments,alldetailedcommands,parametersandoutputcanbefoundalongwith thesourcecodeofFrame-Pro.Toachieveafaircomparisonondataoflowcoverage,wesetthe coveragethresholdofDAG-Conas1tomakesuretheoutputsarecomparable. 3.5.1Simulated Escherichiacoli sequences InordertoevaluatetheaccuracyofFrame-Proindetectingandcorrectinginsertionanddeletion errors,wegeneratedasimulated E.coli K-12MG1655(NCBItax.ID511145)sequencedataset byPBSIM[114].Weusedthereferencegenomesequence(NC_000913.3,genomesize4,641,652 basepairs,[60])generatedbySangersequencingasinputforPBSIM.Thequalityinformationand thesequencinglengthdistributionfromrealPBRSsequencingaftersecondaryanalysis[115]were usedasthesimulationparameters.PBSIMgenerated34,898sequenceswith92,810,130basepairs withaveragesequencingcoverageof20X.Theaveragelengthofreadsis2,660bpandthereads' averageerrorrateis14.42%. Inthedataset,3,280sequencesseedcriterion.Tocontrolthescaleofexperiment,we randomlyselect451seedsequencesforgraphconstruction.Aftergraphconstructionandprotein domain7,093sequence-domainpairswerekeptasinputtoourprogram. 3.5.1.1Performanceoferrorcorrection BothFrame-ProandDAG-Conproduced7,093correctedsequencesfromtheinputPBsimulation data.Weevaluatedtheperformanceoferrorcorrectionofbothtoolsbycomparingtheir correctedsequencestothereferencesequences.ThecomparisonisconductedusingBLAST[4]. Figure3.5summarizedthecomparisonofbothtoolsoneachread.Forthisdataset,ourtoolcan correctmoreerrorsinabouthalfofthereadswhileDAG-Concorrectsmoreerrorsinlessthan 65 Figure3.5:ComparisonoferrorcorrectionperformanceforFrame-ProandDAG-Coninthesim- ulated E.coli datasetand M.ruber dataset.Frame-ProcorrectsmoreerrorsinlargerfractionofPB reads. 66 Figure3.6:Histogramofederrorsaftererrorcorrectioninthesimulated E.coli dataset.X- axisrepresentsthenumberofremainingerrorsineachread.Y-axisisthecorrespondingnumbers ofreads.Binwidthis1andtheonlyincludedthe40bins(i.e.upto40errors)dueto spacelimitation. 67 5%ofreads.Intotal,ourmethodcorrects7,884moreerrorsthanDAG-Con.Ifweonlycount insertionanddeletionerrors,ourmethodcorrects8,374moreerrorsthanDAG-Con.DAG-Con corrects490moremismatchesthanourmethod.ItisexpectedbecausetheHMMsearchis muchmoresensitivetoframeshiftscausedbygapsastheywillimpactthealignment lengthandscore.Figure3.6comparesthenumberofremainingerrorsincorrectedreadsproduced bybothtools.Itisclearthatourmethodcanproducemorereadswithnoerrororjust1error. 3.5.1.2PerformanceofprHMMsearch Oneofourmajorgoalsistoimproveperformanceofhomologysearch.Thissectionwill focusonevaluatingtheperformanceofproteindomainhomologysearchofcorrectedsequences. Aftercorrectingframeshifts,weexpectthatthehomologysearchprogramcangeneratelonger alignmentswithhigherscoresandsmallerE-valuesforproteindomainsofinterest.Sotheusers candistinguishtruedomainsfromrandomalignmentswithhigher HMMER3.1b2wasusedtogenerateHMMalignmentsfromcorrectedsequences.The domaincompositionof Escherichiacoli K12(NCBItax.ID83333)proteomefromPfam(Re- lease29.0,[45])wasusedasthereference.2,347proteinHMMdomainswerefoundin 7,093outputsequences.Somesequenceshavemultiplehits.Forallthedata,includingthePB simulationdata,andthecorrectedsequencesbybothtools,weonlykeepthebestalignmentfor eachdomaininourcomparison.Thechangesofalignments'E-values,alignmentlengths,andbit scoresduetoerrorcorrectionarepresentedinFigure3.7.Onaverage,thelengthofthedomain alignmentincreasesfromDAG-Con's108.51aminoacids(a.a)to150.36a.a,whichiscloserto theaveragealignmentlengthof163.62a.ainthereferenceproteomefromPfam.Atwo-sample Kolmogorov-Smirnovtest(K-Stest)onthealignments'lengthsandE-valuesfromourmethod andDAG-Conwasappliedtoexaminethestatisticalofdifferencefromtwomethods. 68 Figure3.7:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values(C)for referencesequencesandcorrectedsequencesproducedbyFrame-ProandDAG-Coninthe E.coli simulateddataset.X-axisrepresentsthedomains.Alldomainsaresortedbythereferencevalues fromPfam.RedcirclesrepresentthevaluesofHMMalignmentsforcorrectedsequencesoutputby Frame-Pro.BluecirclesrepresentthevaluesofHMMalignmentsforcorrectedsequencesoutput byDAG-Con.Astherearemanydatapoints,allnumbersproducedbyonetoolareprocessedby SavityzkyGolaytogenerateasmoothedcurveforclarityofpresentation. 69 Thep-valuesforthealignments'length,E-value,andbitscoredistributionswere5 : 4436 10 25 , 1 : 3237 10 25 ,and1 : 8881 10 25 ,respectively.Thus,theimprovementbyapplyingourmethod isstatistically 3.5.2 Meiothermusruber DSM1279sequences Thesimulateddataenablesustothoroughlytesttheparametersofourtoolandalsotoevaluate variousaspectsoftheperformance.Forthenexttwoexperiments,weapplyourtooltorealPB datawithdifferentcoverage.Astherealsequencingprojects,inparticularthetranscriptomic sequencingprojectsandmetagenomicsequencingprojects,cancontaintranscriptsorgenomes withheterogeneouscoverage,itisthusimportanttoevaluatetoolsfordataofdifferentcoverage. Inthisexperiment,wetestedFrame-Proonreal Meiothermusruber DSM1279PBSequencing data. Meiothermusruber (NCBItax.ID504728)isusuallyfoundinhotsprings[141].Ithas genomesizeof3,098,881bps.Therawsequencingdatafrom1SMRTcellinHGAP[24]was usedforthisexperiment.Therawdataintotalcontains177.4Mbpswith59.6%GCcontentwith approximatecoverageof60X. StandardSMRTdataprocessingpipeline[117]wasusedtotherawsequencingdatato generatesubreads,whichcontainsequencespassingthecommonlyusedlengthandquality criteria.Thedatasethas90,114,302bpsand36,180sequenceswith30Xcoverage.After proteindomainfamilies33,911sequencedomainpairspassedthethresholdandwere inputtodownstreamerrorcorrectionpipelines. 70 3.5.2.1Evaluateerrorcorrectionperformancebyaligningoutputsagainstthereference genome TocomparetheerrorcorrectionperformanceofFrame-ProandDAG-CononthisrealPBsequenc- ingdataset,weusedBLASTtosearchalloutputsagainstthereferencegenomeof Meiothermus ruber (NC_021081.1).ThecomparisonoferrorcorrectionforeachreadissummarizedinFig- ure3.5.Intotal,ourmethodcorrects14,613moreerrorsthanDAG-Con.Ifweonlycountinsertion anddeletionerrors,ourmethodcorrected15,924moreerrorsthanDAG-Con.DAG-Concorrected 1,311moremismatchesthanourmethod. Comparingtotheprevioussimulationexperiment,thedifferenceoftheerrorcorrectionperfor- mancebetweenFrame-ProandDAG-Condecreased.Themainreasonisthattheperformanceof DAG-Conusuallyimproveswithincreasedcoverage(20Xto30X). 3.5.2.2Evaluatetheperformanceofprhomologysearch ThecorrectedsequencesfromFrame-ProandtheconsensussequencesfromDAG-Conweresearched againstproteindomainsinPfambyHMMER3.1b2.Thehighercoverageofthisdatasetdidim- provetheperformanceofDAG-Con. For2,984domainsbybothtools,wecomparedthebesthitforeachofthemfrom twotools.TheannotateddomainsfromthereferenceproteomeweredownloadedfromPfam andwereusedasthereference.Thechangesofalignments'bitscoresduetoerrorcorrection arepresentedinFigure3.8.Noimprovementcanbeobservedforbitscores.Similar observationsweremadeforalignments'lengthsandE-values.Thus,theothertwowere omitted.Comparedwiththeaveragealignmentlengthof148.68a.afromDAG-Con'sconsensus sequence,theaveragelengthof157.92a.abyFrame-Proismuchclosetothereference's158.69 71 Figure3.8:Thecomparisonofalignments'bitscoresforreferencesequencesandcorrectedse- quencesproducedbyFrame-ProandDAG-Coninthe M.ruber dataset.X-axisrepresentsthe domains.AlldomainsaresortedbythereferencebitscoresfromPfam.Redcirclesrepresentthe valuesofHMMalignmentsforcorrectedsequencesoutputbyFrame-Pro.Bluecirclesrepresent thevaluesofHMMalignmentsforcorrectedsequencesoutputbyDAG-Con.Astherearemany datapoints,allnumbersproducedbyonetoolareprocessedbySavityzkyGolaytogeneratea smoothedcurveforclarityofpresentation. 72 a.a.Weconductedatwo-sampleK-Stestonthealignments'lengths,E-values,andbitscores fromourmethodandDAG-Con.Thep-valuesforalignments'lengths,E-values,andbitscores' distributionswere0.2859,0.2321,and0.2007,respectively.AlthoughFrame-Prostillgenerated longeralignmentswithbetterscores,DAG-Conachievedsatisfactoryperformanceofhomology searchforthisdatasetbecauseofthesufcoverage. 3.5.3Humanarmmetagenomicdataset WhenPBisappliedtometagenomicsequencing,onechallengeisthatsomedatasetsdonothave enoughcoverageforeffectivedownstreamanalysis.Here,weappliedFrame-Protoanalyzethe proteindomaincompositioninthehumanskinmetagenomicdata,whichweresequencedfromthe humanarmandfoot[144]. ThehumanarmsamplewassequencedbylinearPBRSIITdT(terminaldeoxynucleotidyl transferase)sequencingplatform.SequencescanbemappedwithCHM1humangenomewere removedashosthuman-derivedDNA.ThisdatasetcannotbefurtherassembledbytheHGAP pipelineduetotheinsufcoverage,providingchallengestodownstreamanalysisincluding proteindomainThusinthisexperiment,wefocusonthearmdataset,whichin- cludes16,388sequenceswith2,662,7191bps. 3.5.3.1GenerateSAGandprHMMdomain Comparedwithprevioustwodatasets,theaveragereadlengthofthehumanarmmetagenomics datasetisonly1,629bps.Togeneratesufseedreads,thecut-offwaschangedto3,000bps. AlthoughFrame-ProisnotassensitiveasDAG-Contothesequences'coverage,thesteps wereaffectedasweuseHMMERtosearchtheconsensusagainstPfamwithE-value1000.After 602sequencedomainpairswerekeptforfurtheranalysis.Foreachsequencedomain 73 pair,acorrectedsequenceusingFrame-ProandaconsensussequenceusingDAG-Con,bothfrom thesamegraph,weregenerated. Figure3.9:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values(C) for48domainscommonlybyFrame-ProandDAG-Coninthearmdataset.X-axis representsthedomains. 74 3.5.3.2ProteindomainsearchusingGAcutoff Unlikepreviousdatasets,wedon'tknowallthecompositespeciesinthearmsample.Thuswe cannotobtainallthegroundtruthproteindomainsinthisdataset,makingthecomparisonwith thereferencedomainsdifInaddition,withoutcompletereferencegenomesofallspecies, wedon'thavetheactualsequencescorrespondingtothesereadsandthuswecannotevaluatethe numberofederrors.Inordertoexaminetheresults,wethususeastringentthresholdto examinethedomainsetsforcorrectedsequencesoutputbyFrame-proandDAG-Con. Weusethegathering(GA)cutoffsfromPfamasthethresholdfordomaincompositionanalysis becauseGAcutoffsarefamilybitscorethresholdsaimingtominimizefalsepositiverate andtomaximizethecoverage[44].InFrame-Pro'sresults,84domainsareabovetheGAcutoff, comparingto49domainsinDAG-Con'sresults.Frame-ProonlymissedonedomaininDAG- Con'sresultwithscoreslightlylessthanthecorrespondingGAcutoff.AccordingtoPfam,all domainsuniquelyfoundbyFrame-Prowereannotatedinmicrobialspecies.Thelistofdomains canbefoundinourwebsite. Forthecommonly48domains,wecomparedthealignments'bitscores,alignment length,andE-valuesoncorrectedsequencesoutputbytwotools(SeeFigure3.9).Weconducted atwo-sampleK-Stestonthethreemetricsforallalignments.Thequalityofalignmentimprove .Thep-valuesforthethealignments'length,E-value,andbitscoredistributionswere 6 : 05 10 6 ,3 : 66 10 12 ,and7 : 64 10 13 ,respectively. Withoutreferencesequences,therecouldbeonepossibilitythatFrame-Proover-correctsthe errorsinordertomaximizethealignmentscore.Inordertotestthispossibility,weexamined thedomainsforonereferencespecies: Corynebacteriumaurimucosum .Althoughthe completecompositespeciesofthearmsampleisunknown,thephylogeneticanalysisin[144] 75 andotherarticles[47,142]showedthat Corynebacteriumaurimucosum isrelativelyabundantin thissample.Inaddition,thisspecieshasannotateddomains(NCBItax.ID548476)inthePfam database.Thus,wefocusonevaluatinghowmanyoftheannotateddomainsinthisspeciescan bebytestedmethods.ByusingGAcutoff,Frame-Procanrecover41domainswhile DAG-Concanrecover24domains.ThesetbyDAG-Conisasubsetofours.Thus,this experimentshowsthatextradomainsbyusarenotlikelyfalsepredictions.Thisexper- imentaddsevidencethatFrame-Procanidentifymoredomainsfordatawithverylowcoverage andthusprovidescomplementaryanalysisformetagenomicdata. 3.6Conclusionanddiscussion Inthiswork,wedevelopedahomologysearchtoolforPBsequencingdata.Bycorrect- inginsertionordeletionerrors,ourimplementationcanimprovehomologysearchperformance, includingalignmentscores,lengths,andE-values.Inparticular,forsequencingdatawithlow sequencingcoverage(aroundorlowerthan20X),ourtoolcancorrectmoreerrors andimprovethesensitivityofhomologysearchbymorecorrectdomains.Beingableto conductsensitivehomologysearchforsequencingdataoflowcoverageisimportantforvarious sequencingprojectsincludingmetagenomicandtranscriptomicsequencing.Usually,asthetran- scriptsorspecieshavehighlydifferentabundanceandthusheterogeneouscoverage,conducting homologysearchneedstoconsidertheinputofafullspectrumofcoverage.Manyrarespeciesin anenvironmentalcommunityorraretranscriptsareparticularlyinterestingforbiologicaldiscov- ery. Althoughourcurrentimplementationisbasedonhomologysearch,whichcompares sequenceswithHMMs.Ourmethodcanbeeasilyextendedtopairwisealignmentaswell. 76 Inthatcase,theHMMwillbereplacedasinglesequence.Existingpairwisealignment algorithmcanbeextendedtoalignanalignmentgraphwithasinglesequence. LiketheerrorcorrectionstageinHGAP,ourtooldoesnotrelyonhybridsequencingeither, makingitconvenientforvariousapplications.However,onelimitationisthatourtoolisnota generalerrorcorrectiontoolbecauseitcanonlycorrecterrorsinregionsthatarehomologousto referencesequences.Thisisnotaproblemforspecieswithhighlypackedcodingregions.Butfor genomeswithlargefractionsofnoncodingregions,ourtoolisnotdesignedforerrorcorrectionin thewholegenome. Finally,weonlytestedourapplicationinPBdata.Butourmethodcanbeextendedtootherse- quencingdatawithsimilartypesoferrors.Forexample,wewilltestourtoolonthedataproduced byNanoporetechnology. 77 Chapter4 DeepLearningBasedApproachForProtein DomainPrediction 4.1Introduction Third-generationsequencingtechnologies,suchasPBiosciencessingle-moleculereal-time sequencing(PacBio)andOxfordNanoporesequencing(Nanopore),producelongerreadsthan thetraditionalsequencingtechnologies.Theincreasedlengthenablesclosinggapsingenome assembly[67,149],detectingepigenetic[151],andquantifyinggeneisoformswith higheraccuracyinthetranscriptomicsequencing[107,140].Italsoshowsgreatpotentialsin sequencingthemicrobialcommunities[23,144]. Oneofthemostchallengestoutilizelongreadsfromthird-generationsequencing isthehighsequencingerrorrate[7,67].Mosterrorsareinsertionsanddeletions,whichcancause frameshiftsduringtranslations.Withoutknowingtheerrorsandtheirpositions,theframeshifts canleadtoonlyshortoralignmentsinthedownstreamanalysis[154].Asaresult, traditionaldownstreamanalysispipelineontherawPacBioorNanoporereadsleadstoincorrect results[34]. Characterizingtheproteindomaininsequencesisoneoftypicaldownstreamanalysiswitha continuingconcernasproteinsplaypivotalrolesinmanybiologicalprocesses.Homologysearch 78 isoneofthestandardmethodstoannotatethefunctionsofproteinsequencesbycomparingquery sequencesagainstreferencesequences[4]orhomologoussequencefamilies[35].Algorithms ofmachinelearningisanalternativeapproachwithoutalignmenttopredicttheproteindomain insequences[30,147].Themajorchallengeofthesemethodsishowtoconvertthesequenceto featuresthatcancapturetheinformationofproteinfamilies.Recentdevelopmentsinthealgorithm ofdeeplearningledtoanotherapproachwithautomaticfeatureextractionfromthesequence [91,133]. 4.1.1Deeplearningforsequentialdata Differentdeeplearningmodelhavetheirownadvantagestoresolvetypesofproblems forsequentialdata.Soweexpectthatthosealgorithmscanalsohelpondifferentproblemsin bioinformatics.Here,wewilldescribetwomajordeeplearningstructures,convolutionalneural networks(CNNs),recurrentneuralnetworks(RNNs),thatalreadybeenprovedachievestateofthe artperformanceonsequentialdata. 4.1.1.1ConvolutionalNeuralNetworks OneofthemostsuccessfulmodelinrecentyearsisConvolutionalNeuralNetworks(CNNs).The majordifferencebetweenCNNsandtraditionalneuralnetworksistoreplacethegeneralmatrix operationinthetraditionlayerwiththemathematicaloperationcalledconvolution. 4.1.1.1.1Convolutionoperation Inmathematics,theconvolutionoffunction x and w , s ( t ) ,is asfollowingequation: s ( t )= Z x ( a ) w ( t a ) dt =( x w )( t ) 79 Inconvolutionalnetworkterminology,theargument(inthisexample,thefunctionx)tothe convolutionisoftenreferredtoastheinputandthesecondargument(inthisexample,thefunction w)asthekernel.Theoutputissometimesreferredtoasthe featuremap . Figure4.1:Howakerneloperatesononepixel.Reprintedfrom[6]. Theconvolutionoperationinthematrixcantreatasakindofmatrixmultiplication.Asthe Figure4.1showed,thekernelconvolutiontakesourcepixelfrominputmatrixandsimplymulti- pliedthekernelmatrixtogeneratetheelementfortheconvolutionallayeroutput. 4.1.1.1.2MajorfeaturesofCNNs TherearethreemajorfeaturesthatmakeCNNssosuccess- ful:sparseinteractions,parametersharing,andequivariantrepresentations. Sparseinteractionsmeansthattheneuronintheconvolutionallayersonlyneedstointeractwith severalneuronsintheinputandoutputlayers.Inthecontrast,aneuroninthetraditionalneural 80 networkneedstointeracteveryneuronintheinputandoutputlayers.Inthisway,theCNNscan easilylearnsomesmallandmeaningfulfeatureswhichareonlyfromtinypartoftheinputmatrix. Also,itcansavealotofcomputationtime. Parametersharingreferstousingthesameparameterformorethanonefunctioninaneural network.AteverypositionoftheinputinCNNs,itwillusetheeachelementofthekernelmatrix. Bydoingthis,thelayercanlearnsomethingwithawholepicture. Equivariantrepresentationsdescribesthephenomenonthatiftheoutputwillchangecorre- spondingtothechangeofinput. 4.1.1.1.3CNNforsequence Foracharacterinthesequence,wecanuseak- dimensionvector x torepresentit.Soasequenceoflengthnisrepresentedas: x 1: n = x 1 x 2 ::: x n Hereoperator istheconcatenationoperator.Weappliedconvolution w onawindowof sizeh,totransformtheinput x 1: n toanewfeature c i : c i = f ( w x i : i + h + b ) Herebisthebiastermandfistheactivationfunctionusuallyusingnon-linearfunctionlikesigmoid orhyperbolictangent.Byrepeatedapplytheoperation,wecangeneratefeaturemap.Then themax-over-timepoolingwasappliedtokeepthemostimportantfeatures. TextCNN[74]isoneoftheearliestmodelsadoptingthestructurewediscussedaboveforsen- tenceThekeyideaofTextCNNisusingkernelswithdifferentsizestocapture 81 Figure4.2:StructureofTextCNNmodel.Reprintedfrom[74]. patternindifferentranges.Afterglobalmaxpooling,thefeaturesthenusedasinputforafully- connectedlayerandthenconverttotheprobabilityoflabels.ThevariationofTextCNNmodel hasbeenadoptedforsolvingdifferentbioinformaticsproblem,suchasproteindomain tion[133],virussequence[127],andDNA-proteinbindingprediction[153] 4.1.1.2RecurrentNeuralNetworks RecurrentNeuralNetworks ,orRNNs[129]arethefamilyofthedeeplearningstructuresto processsequentialdata.Parametersharingacrossthedifferentpartsofthemodelisthekeyidea thatmakesRNNstobeabletodealwiththesequentialdata.,theRNNswetalked aboutisoperatingonasequencethatcontainsvectors x ( t ) withthetimestepindextrangingfrom 1to t . 4.1.1.2.1BasicstructureofRNNs Asimplerecurrentneuralnetworkwilljustprocessthe informationfromtheinput x andincorporateitintothestate h ofhiddenunitandpassedtothe 82 nextunit.Thehiddenunitstate h attimestept, h ( t ) , h ( t ) = f ( h ( t 1 ) ; x ( t ) ; ) RNNalsousuallyhaveextraarchitectureslikeoutputlayersthatoutputtheresultof h forfurther processingorpredictions. Figure4.3:Therepeatingmoduleinainasimplerecurrentneuralnetworkstructure.Reprinted from[113]. Wecanuseafunction g ( t ) torepresenttheunfoldedrecurrenceaftertsteps,givenusanother representationof h ( t ) , h ( t ) = g ( t ) ( x ( t ) ; x ( t 1 ) ;::: x ( 2 ) ; x ( 1 ) ) Inthisform, g ( t ) takestheinputofwholepastsequences ( x ( t ) ; x ( t 1 ) ;::: x ( 2 ) ; x ( 1 ) )) asinput andoutputthecurrentstate.However,wecanunfoldingthestructureandrepeatedapplyfunction f toachievesameoutput.Bydoingthiswecanusethesametransitionfunction f withthesame parametersateverystep. However,asimpleRNNscannotlearnlongtimedependencyasintheoptimization.Aswede- scribedabove,ateachtimestepwerepeatedlyapplythesameoperation f .Wecanuseamultipli- cationonmatrix W torepresentsuchoperation.Supposethatmatrix W hasaneigendecomposition 83 W = V diag ( ) V 1 .After t steps,theinputalreadymultiply W t ,so: W t = V diag ( ) t V 1 Meansthatgivenenoughlargestepst,anyeigenvalue l thatnotcloseto1willeithervanish (lessthan1)orexplode(largerthan1).Thegradientinsuchgraphalsoscaledaccordingto diag ( ) t [52].Inthiscase,itdiftooptimizesuchgradient.Tosolvethischallenge,gated RNNsisproposedandbecomesoneofthemosteffectivepracticalmodelsthatusedforsequential data. 4.1.1.2.2LSTM Thelongshort-termmemory(LSTM)architecture[66]isonebranchofsuch gatedRNNsthatisextremelysuccessfulintheapplicationlikespeechrecognition,machinetrans- lation,andhandwritinggeneration.ThekeyideaofLSTMistointroduceaselfloopsothat gradientcanwforlongduration. Figure4.4:TherepeatingmoduleinaLSTMcell.Reprintedfrom[113]. Theselfloop(internalrecurrence)islocatedinfiLSTMcellsflwithouterrecurrencelikeordi- naryrecurrentnetwork.Thecelliscontrolledbyacombinationofinputgateandmemorycontrol gates,formulatedbyequationsbelow: 84 it=s(Uiht1+Wixt+bi)ft=s(Ufht1+Wfxt+bf)ot=s(Uoht1Woxt+bo)Ÿct=s(Wcxt+Ucht1+bc)ct=ftct1+itŸctht=ots(ct)whereit,ftandotarethreegatesthatcontrolinput,forgetandoutputrespectively.Eachofthe gateoutputsignaldependsonitsstatetransitionmatrix U,inputweightingmatrix Wandbias b.Thelcellstate cisupdatedbyelement-wisemultiplication,denotedbyoperatorfi fl.sisthe activationfunction,whichcanbechosenfromsigmoid,ReLU,tanhandsoon. Aswecansee,LSTMaddstwomoregatestocontroltheoutputandpreviousinputs,thushas theabilitytocapturethelongtermdependenciesinthesequence.Sofar,LSTMsarethemost successfulvariationofRNNs. 4.1.1.2.3GRU Thegatedrecurrentunit(GRU)architecture[25]isanalternativeandthemain competitortotheLSTM.TheperformanceofGRUiscomparabletotheLSTMonmanysequential datasets,withsimpleunitstructuresandlesscomputationrequirement. ThemaindifferenceofGRUwiththeLSTMisthatasingleupdategatereplacetheroleofthe forgetandinputgates.Sotheequationsareupdatedtofollowing: 85Figure4.5:TherepeatingmoduleinaGRUcell.Reprintedfrom[113].. z t = s ( U z h t + W z x t + b z ) r t = s ( U r h t + W r x t + b r ) Ÿ h t = s ( W h xt 1 + U h ( r t h t 1 )) h t = z t 1 h t 1 +( 1 z t 1 ) Ÿ h t 4.1.1.3Proteinfunctionpredictionusingdeeplearning Recently,deeplearningtechniqueshavebeenshowntoachievestate-of-the-artresultsinmany machinelearningproblemswithouttheneedforcomplexfeatureengineering[52,85].Deeplearn- ingbasedmethodalsohavedemonstratedtheirabilityonmanybioinformaticsproblems,suchas DNA-proteinandRNA-proteinbindingsitesprediction[3,153],taxonomic[19,43], andSNPandsmall-indelvariantdetection[122]. 86 Thesuccessofdeeplearningtechniquesinspiredresearcherstoapplytheapproachtomodeling theproteinsequences.Longshort-termmemory(LSTM)wasintroducedby[65]tomodelthe proteinhomologydetection.Themodelcanautomaticallyextractthelong-termandshort-term dependencyofthesequencesandpredictwhetherthesequencesbelongtothemodeledprotein family.Recently,abidirectionalLSTMmodel,ProDec-BLSTM[91],wasproposedtoimprove thepredictioncapabilityofthemodelfurther.Inthemodel,everyhiddenvalueoftheLSTMwas usedtocapturethelongandshortdependencyinformationoftheproteins.Theexperimentresult showsProDec-BLSTMoutperformsvariousexistingmethods,includingHMMER. DeepFam[133]adoptedconvolutionalneuralnetworks(CNN)tomodelfamiliesofprotein sequences.UnlikeLSTMmodelswementionedearlierthatcanonlymodelsingleproteinfamily, DeepFamusedasingleCNNmodeltopredictmultipleproteinfamilies.Varioussizesofconvo- lutionalwereusedinDeepFamtoextractconservedregionstomodelproteinfamilies.It outperformsHMMERandpreviousalignment-freemethod.Also,theexecutiontimeofDeepFam isfastasitnotaffectedmuchbythenumberoffamilieswhileHMMERneedslongertimeasthe numberoffamiliesincreased.Forexample,DeepFanismorethantentimesfastercomparedwith HMMERwhen1,000querysequencessearchingagainstthousandsofproteinfamilies[133]. However,existingmethodslikeDeepFamcannothandleerrorsandincompletesequences.All thedeeplearningmethodwementionedwereonlytestedusingcompleteandcorrectproteinse- quences.Moreover,DeepFamusedaclosedatasettotesttheirperformance,whichassumesthat testdatasetareheld-outsamplesfromthesamedistributionsofthetrainingdataset.Inarealappli- cation,asequencemaybesampledfromnon-codingsequenceorotherproteinfamiliesthatnotbe modeledintheneuralnetwork.Wecallthistaskdetectiontaskhereafter,asthemaingoalistode- tectthetargetedprotein-encodingsequencesfromotherirrelevantsequences.Intheory,irrelevant inputissupposedtoberejected.Butcurrentdeeplearningmodelsemploysoftmax 87 functiontoassignthelabelandarenotabletodeclinesuchcases,yieldinghighfalse-positive rate[10,63]. 4.1.2Relatedwork 4.1.2.1Prhomologysearch Topredicttheproteindomainofagivensequence,wecandirectlycomparethesequencewith protein-encodingsequencesusingpairwisealignmenttoollikeBLAST[4].However,inmany applications,weusuallydonotcareifthequerysequenceclosetooneencodingsequence ofthetargetprotein.Instead,wewanttoalignthequerysequencetoafamilyofproteinsequences. AhiddenMarkovmodel(pHMM)canbebuiltfrommultiplesequencealignmentstomodel aproteinfamily.Byexplicitlythematch,insertion,anddeletionstate,pHMMcanhandle thecomparisonofthequerysequencetotheofproteinfamiliesaccurately,evenwhenthe queryisremotehomologyofthetargetproteinfamilies.HMMERisoneofthemostwidelyused searchtoolsbasedonpHMM[36].oftheproteinfamiliescanbedownloadedfrom manyproteindatabases,suchasPfam[40],andTIGRFAMS[56]. Asanalignment-basedmethod,thespeedofthehomologysearchsufferswhenthe numberoffamiliesincreases.Extensiveresearchhasstudiedtoimprovetheefyofthe homologysearch.Thenewestversion(3.1b2)ofHMMER[37]ismoresensitivethan BLASTandisabout10%faster.However,recentresearch[133]suggestsitisstillslowerthanthe alignment-freemethod. 88 4.1.2.2Homologysearchwitherrorcorrection Duetothehigherrorrates,thehomologysearchresultonrawPacBioorNanoporereadsusually isnotdesiredwithwrongannotationandshortalignments.Inthissituation,errorcorrectionon rawreadscaneffectivelyimprovethehomologysearchresult. Therearetwomajorcorrectionmethodsforthird-generationsequencingdata:hybriderror correctionsandnon-hybriderrorcorrections.Thehybridmethodcorrectlongnoisyreadbycom- biningshort,accuratesecond-generationsequencingreadswiththird-generationreadstoremove errors[78,136].Unlikehybridsequencing,non-hybriderrorcorrection,orself-correction,only relyonthelongreadsthemselvesforerrorcorrection[24,48,93].Asthesequencingerrorsinthird- generationreadsoccurrandomly,theinferredconsensussequencefromthealignmentbetweenthe longestbackbonereadsandshorterreadsfromthird-generationsequencingdatasetrepresentthe high-qualitysequence.However,itsperformanceisprofoundlyaffectedbythecoverageofthe alignedsequencesagainstthebackbonesequences[34]. 4.1.3Overviewofourwork Aswediscussedintheprevioussection,thecurrentproteinpredictalgorithmscannothandlethe higherrorrateofthethird-generationreads.Thus,wedesignedandimplementedDeepFrame,a deeplearningbasedmethodtopredicttheproteindomainofthird-generationsequences. WeoptimizedtheneuralnetworksarchitectureofDeepFramebasedonrigorousvalidation performedontheofasimulatedPacBiosequencedataset.Theationaccuracy ofDeepFrameconsistentlyoutperformsthestate-of-the-artmethodlikeHMMERacrossvarious errorrates.WethenextendedtheframeworktothedetectiontaskwithOutlierExposureandtested onrealthird-generationsequencingdataset.DeepFrameachieveshigherF1scoreandhigherrecall 89 withcomparableprecisionwhencomparedwithHMMER. Comparedtopreviousworks,DeepFramehasseveralmerits.First,wedesignedathree- channelencodingforthetaskoffunctionpredictionusingDNAreads.Comparedtootherencod- ingmethods(discussedinExperimentsandresults),modelwiththree-channelencodingachieved betterperformance.Second,evaluatedbysimulatedPacBiodataandrealPacBioandNanopore data,DeepFramecansuccessfullyprocesslongerroneousreadswithoutanyprepossessing.Al- thoughweonlyusedsimulatedPacBioreadsthatareeasytogeneratetotrainourmodel,the performanceonrealPacbioandNanoporedataisstillrobustandconsistent.Third,unlikepre- viousdeeplearningworkthatonlydesignedforDeepFramecanalsobeusedfor detectiontasks.Withthethresholdonsoftmaxpredictionprobability[63]andOutlierExposure method[64],DeepFramecandistinguishtheDNAsequenceswithtargetedproteindomainfrom otherrandomcodingornon-codingDNAsequences. 4.2Methods DeepFrameisdesignedtopredictproteindomainsforthird-generationsequencingdata.Itin- corporates3-frameencodingtoconvertDNAreadsinto3-channeltensorastheinputofneural networks.Fromtheinput,DeepFrameautomaticallyextractsfeaturesbyusingconvolutionlayer andmax-over-timepoolinglayer.Awithtwofullyconnectedlayerwasusedtogenerate theprobabilitiesofthesequenceagainstallpossibleproteindomains.Toexcludetheunrelated codingornon-codingDNAsequences,wethencomparethemaxvalueoftheoutputprobabilities withathreshold.Wewilldiscussthedetailsofourmodelinthefollowingsection. 90 Figure4.6:TheoverviewofDeepFramemodel.Theinputsequencewastranslatedandencoded toa3-channeltensor.Twoconvolutionallayerandmaxpoolinglayerextractsequencefeatures. Thesefeatureswereusedbyafullyconnectedlayerwithsoftmaxfunctiontoinfertheprobability ofeachproteinfamilies.Inthetask,themodeldirectlyoutputsthefamilieswith thelargestscoreastheprediction.Inthedetectiontask,themaximumofsoftmaxscoreneeds tocomparewiththethresholdtodeterminewhethertheinputcontainsatrainedproteinfamilyor shouldberejected. 91 4.2.1Encoding ToprocesstheDNAreads,weneedtoencodesequencestonumericalvalues.Wedesigneda 3-frameencodingschemefortheinputsequences.EachDNAsequencesis3-frametranslated to3proteinsequences.Alltheresiduesintheproteinsequenceareone-hotencodedusinga 21-dimensionalvectorfollowingIUPACaminoacidcodenotation.Thenweconcatenatethree matricesintoasingle3-channeltensorlikeanRGBimage.Therationalebehindthisisthatwe hopetheconvolutioncanautomaticallydistinguishtherightfragmentoneachframeafter translationandextractthecorrespondingfeatures.Also,therelativeorderoftheresiduesatthe samepositionofthethreeframesincorporatetheinformationoftheoriginalDNAsequence.Given thesequencelength L,thentheencodedinputisatensorwithsize3 L21.Thepseudo-codeof 3-frameencodingcanbefoundinAlgorithm4. Algorithm4 3-frameencoding Input:DNAsequence xwithlength L,peptidetoindexdict idx ,peptidealphabetsize jSj,output sequencelength n.Output:Inputtensorforneuralnetworkswithsize3 n21.1:Initializeanarray arr withall0valuesanddimensions[3, n,jSj]2:for i=1to3 do3:xi x[i:]4:yi translationof xi.translatenucleotidesin xitoaminoacidsin yi5:for residueaatposition kinyido6:ifknthen7:arr [i;k;idx [a]] 1.one-hotencodingforeachframe 8:endif 9:endfor 10:endfor 11:arr isinputtensorforneuralnetworks Wealsotestedotherencodingmethods:(1)DNAone-hotencoding,whichdirectlytrasnfer DNAsequencetoonehotvectorsofsize L4.Forfaircomparison,weusedsizesthatare3 timesaslongasweusedfor3-frameencoding;(2)3-branchmodel,weconstructedanetworkar- 92chitecturewiththreeseparatebranchesprocessingeachofthe3-frametranslatedproteinsequence respectively.Eachofthebranchesconsistsofidenticallayers,andalltheparametersareshared betweenthesamelayerof3branches.Inthe3-branchmodel,eachofthebranchesmodelsthe translatedproteinsequencesseparatelybeforethemerginglayerrightbeforethetwo-layerclassi- .Incontrast,in3-frameencoding,allthreetranslatedproteinsequenceswereprocessedand combinedbythe3-channelconvolutionalintheconvolutionallayers.Ourexperimental resultsshowthat3-frameencodingisabetterencodingschemeasitcaneffectivelyencodethe originalDNAsequenceinformationandalsohelpsconvolutiontoextractusefulfeaturesfor predictionoftheproteindomain(Section4.3.1.1). 4.2.2ConvolutionalNeuralNetworks DeepFrameconsistsoftwoconvolutionallayers,onemax-over-timepoolinglayer,onehidden linearlayer,andonelinearoutputlayerwithsoftmaxfunction. Theconvolutionallayer[46,74,86]isthemostimportantbuildingblockinthemodel.Fora residueinthesingleframeproteinsequence,wecanuseak-dimension(k= j S j inourcase)vector y torepresentit.Soasequenceoflength n isrepresentedasaone-hotencodingmatrix.Weapplied convolution w 2 R hk onasize h windowofthematrix,totransformtheinput y i : i + h 1 to c i : c i = f ( w y i : i + h 1 + b ) (4.1) Here b isthebiastermand f istheactivationfunctionusuallyusingnon-linearfunctionlike sigmoidorhyperbolictangent.Inourmodel,weuseReLU[104]asactivationfunctionofconvo- lutionallayer: f ( x )= s ( x )= max ( 0 ; x ) (4.2) 93 WecanextendEq.4.1tomultiplechannelinput(inourcase,channel = 3: c i = f ( 3 å j = 1 w j y i : i + h 1 ; j + b ) (4.3) ForbothEq.4.1and4.3,weappliedconvolutionalrepeatedlytoeachpossiblewindow oftheone-hotmatrix y 1: h ; y 2:1 + h ;:::; y n h + 1: n toproduceafeaturemap,whichisthevectorof extractedfeaturevalues: c =[ c 1 ; c 2 ;:::; c n h + 1 ] (4.4) Thenthemax-over-timepoolingisappliedonthefeaturemaptocapturethemaximumvalue ‹ c = max c asthefeaturefromthisparticular.Therationalebehindthisoperationistoextract thepeaksignalasthemostimportantfeatureforeachfeaturemap.Themax-over-timepoolingis xiblewithdifferentinputlength. Wedescribedhowasingleintheconvolutionallayerworks.Inourapplication,weused multiplewithvaryingsize(orwindowsize)toextractvariousfeatureswithdifferent ranges. DeepFramehastwoconvolutionallayers.Theconvolutionallayerusesconsistentconvo- lutionsizetoextractlow-levelshortdistancepatternsdirectlyfrom3-frameencodinginput. Thensecondconvolutionallayerextractshigh-level,intricatepatternswithvaryingdistancefrom theoutputoftheconvolutionlayer.Byrepeatedlyapplyingtheoperations,wecan generateafeaturemap.Thenthemax-over-timepoolingwasappliedtokeepthemostimportant features.Dropout[138]isalsousedafterpoolingtopreventovandtolearnrobustfeatures. Atwo-layerwithsoftmaxfunctiontransfersthefeaturestoavectorofprobabilitiesover eachlabel.Forweselectthelabelwiththemaximumprobabilityastheprediction fromDeepFrame. 94 4.2.3Out-of-distributionexamplesdetection WeshowthatDeepFramewithsoftmaxfunctioncanpredicttheproteinlabelsofgiven DNAreads.However,thewillalwaysassignalabelfortheinputsample,evenifthe inputisnotrelatedtoanylabelinthemodel(wecallsuchinputsout-of-distributionexamples, comparedtoin-distributionexamples).Forexample,inRNA-seqdata,noteveryreadencodes targetedproteinfamiliesinthemodel.Whenprocessingsuchunrelatedinputs,themodelwillstill assigntheinputtooneoftheproteinfamiliesinthemodelasthereisnosuchlabelfortheout-of- distributioninput.Inarealapplication,thisbehaviorwillleadtoanundesiredhighfalse-positive rate.Toaddresstheproblem,weadoptOutlierExposure(OE)[64]withathresholdonsoftmax predictionprobabilities[63]todistinguishtheout-of-distributionexamples. 4.2.3.1Thethresholdbaseline Usually,theexamplesfromout-of-distributiondatasettendtohavesmallsoftmaxvaluebecauseits probabilitiesvectortendstobemoreuniformlydistributedthantheexamplesfromin-distribution dataset.Thus,byspecifyingathresholdonmaximumprobabilityinthesoftmax'soutput,itispos- sibletodistinguishout-of-distributionexamplesformin-distributionexamples.Following[63], weextractthemaximumvalueofthesoftmaxprobabilityfromtheoutputofDeepFrameforeach inputexmaple.Weseparatethein-distributionexamplesfromtheout-of-distributionexamples byspecifyingathresholdofthemaximumsoftmaxprobability.Todeterminethethreshold,an- othersimulatedout-of-distributiondatasetiscreatedandcombinedwiththesimulatedholdouttest dataset.Foreachexample,wecomputethemaximumsoftmaxprobability.WeobtainF1score fromthedataset: F 1 = 2 precision recall recall + precision (4.5) 95 TheF1scoresummarizestheperformanceofthebinaryclofin-distributionexamples andout-of-distributionexamplesforagiventhreshold.WecomputetheF1scoreacrossdifferent thresholdsandselectthethresholdofbestF1scoreasthethresholdusedinourmodel. 4.2.3.2OutlierExposure Tofurtherimprovetheperformanceofout-of-distributionexamplesdetection,weadopttheOut- lierExposure(OE)methodintroducedby[64].Aswediscussedpreviously,weexpecttheout-of- distributionexamplestohaveuniformsoftmaxoutputfromthemodel.However,assuchinputs neverbeprocessedinthetraining,sometimesthemodelwillyieldunexpectedhighpre- dictionforout-of-distributioninput(Figure4.7).Toaddresstheproblem,weexposethemodelto out-of-distributionexamplesinthetrainingprocesstoletthemodeleffectivelylearntheheuristics oftheout-of-distributioninputs. Givenamodel g andthelearningobjective L ,theobjectiveofOEistominimizetheoriginal lossfunctionwithanauxiliarylosstermstoregularizetheout-of-distributionexamples[64].In originaltask,weusecross-entropylossfunctionas L .Sowesettheauxiliaryloss L OE totheuniformdistribution: L OE = log exp ‹ y 0 å j exp y 0 j ! = y 0 + log å j exp y 0 j (4.6) Here, y 0 isthemeanvalueoftheoutputlogits(theoutputofCNNsmodelbeforethesoftmax function).Andweuse l = 0 : 5inourexperiment. Figure4.7presentsthedistributionofthesoftmaxscoreofDeepFramemodelwithandwithout OutlierExposureforthethresholdcalibrationdatasetweusedinSection4.3.2.WithoutOutlier Exposure,therearestillalotofout-of-distributionexampleswithlargesoftmaxscores(0.5to1). 96 Figure4.7:Thedistributionofsoftmaxscoresfrombasemodel( top )andmodelwithOutlier Exposure( bottom ).Greenbarrepresentthecountofin-distributionexamplesthatpredictedcor- rectly.Bluebarrepresentthecountofin-distributionexamplesthatpredictedincorrectly.Redbar representthecountofout-of-distributionexamples. 97 AftertrainedwithOutlierExposure,mostoftheout-of-distributionexamplesaccumulatedwith smallsoftmaxscores(0to0.4).Wealsofoundthatthescoresofin-distributionexamplesthat incorrectlypredictedbyourmodelareclosetothescoresofout-of-distributionexamples.Soan optimizedthresholdcanalsoexcludethoseincorrectpredictions. 4.2.4Implementationsandhyperparameters DeepFramewasimplementedbyPython3usingPyTorch[119]withApex[111]acceleration.The DeepFramemodelwastrainedusingagradient-descentoptimizationalgorithmwithadaptivees- timatesofmomentscalledAdam[75].Thetrainingtargetistominimizemultipleclasscross- entropylossfunction. Wealsoimplementedseveralcommonlypracticedtechniquesforoptimizingdeepneuralnet- works,suchasmini-batchgradientdescent,Heinitialization[61],andhyperparametertuning. Thereareseveralhyperparametersincludingthenumberandthesizeofconvolutionthe numberofinputandoutputfeaturesinthelinearlayer,dropoutrate,learningrate,andbatchsize. Wetriedourbesttosearchtheparameters,andweusedtheparametersthataretestedtoperform wellintheexperiment.Inourexperiment,thehyperparametersweresetasfollows:batchsize = 256,learningrate = 0 : 001,dropoutrate = 0 : 5.Fortheconvolutionallayer,thesizeof theis3,andthenumberofis64.Forthesecondconvolutionallayer,thesizeofthe is8 ; 12 ; 16 ; 20 ; 24 ; 28 ; 32 ; 36with256ltersofeachsize.Also,forthehiddenlinearlayer, thenumberofoutputfeaturesis512.Giventhedatasizeweused,wefoundourmodelconverges around1to2epochstraining,sowetrainedallmodelstwoepochswithvalidatingevery102400 samplestoselectthebestmodel. 98 4.3Experimentsandresults ToevaluateDeepFramemodel,weappliedDeepFrameontwodataset:asimulatedPacBioG protein-coupledreceptor(GPCR)codingsequences(cds)dataset[30],andarealthird-generation sequencingdatasetofhumangenome[20,116].Gprotein-coupledreceptorisalargeprotein familythatisinvolvedinmanycriticalphysiologicalprocesses,suchlikevisualsense,gustatory sense,senseofsmell,regulationofimmunesystemactivity,andsoon[143].WeusedGDS[30] forourexperimentasitisalreadyusedtoevaluateDeepFam[133].Itconsistof8222protein sequencesbelongingto5families,38subfamilies,and86sub-subfamilies.ForsimulatedGPCR cdsdataset,wehavetheproteinfamilyinformationofreferencereadforeachsimulatedreadas ourgroundtruth.Forthehumangenomedataset,wedeterminethegroundtruthviaBLASR's[21] alignmentsagainsttheGPCRcodingsequences. WebenchmarkedDeepFrame'sperformancewithHMMERandDeepFam.Thosetoolsand methodsarerepresentativesofcurrentstate-of-the-artmethodsofproteindomainprediction.In experimentsonsimulatedPacBioGPCRcdsdataset,weevaluatetheperformanceofallmethods usingaccuracy.Forhumangenomedataset,weevaluatedtheperformanceofdetec- tionofGPCRproteinfamiliesfromotherunrelatedDNAsequencesusingfollowingmetrics:(1) recall,whichmeasurestheratioofthecorrectproteindomainspredictedbyeachprogramtothe wholesetofexpectedproteindomains;(2)precision,whichtheratioofcorrectprotein domainsdetectedbyeachprogramtothetotalreportedproteindomains;and(3)F1score,which istheharmonicmeanofrecallandprecision.BecauseDeepFamisnotdesignedtoprocessthedata without-of-distributionsamples,wedidnotevaluateitintheexperimentonthehumangenome dataset. Forourexperiments,allcommands,parameters,andoutputcanbefoundalongwith 99 thesourcecodeofDeepFrame. 4.3.1SimulatedPacBioGPCRcdsdataset WeevaluatedourmodelusingsimulatedPacBioreadsgeneratedusingPBSIM[114]with GPCRcdsdatasetwithdefaultsetupanderrorratesfrom1%to15%.Wecreateacorresponding GPCRprotein-codingsequences(cds)datasetbydownloadingcdsfromNCBIbysearchingthe keywordofthecorrespondingsub-subfamiliesinGDS.WealsobuildpHMMmodelsfromGDS usingHMMERandtestallcdsagainstthesemodels.Wediscardedthecdsthatnotpredicted correctlybyGPCRHMMmodelstocleanthedataset.Wedeterminedthegroundtruthofsub- subfamilylabelofasimulatedreadbyassignthelabelofthereferencecodingsequencetoit. 4.3.1.1Performancewithdifferentarchitectures Weconductedaseriesofexperimentsbyvaryingthekeyopponentsinourbasemodels:thenumber ofconvolutionlayers,thenumberofconvolutionkernels,thesizeofconvolutionkernels.Wealso testeddifferentencodingstrategies.WelistedallvariationswetestedinTable4.1.Thereference codingsequencesofeachsub-subfamilieswerespittedto80%trainingsamplesand20%test samples.ThenweusedPBSIMtogeneratesimulatedPacBioreadsfortrainingsamplesandtest sampleswith15%errorrates.Allthemodelsweretrainedusingthesamehyperparameterswe discussedinSection4.2.4withvetimesrepeat.Figure4.8comparesaccuracyof allthevariantsinthetestsamplesofsimulatedGPCRcdsdataset. 4.3.1.1.1Comparisonsofencoding. Wecomparedthreemodelswithdifferentencodingstrate- gies:(1)3-frameencodingmodel,(2)DNAone-hotencodingmodel,and(3)3-branchmodel.We describethearchitectureofallthreeencodingmodelsinSection4.2.1. 100 NameArchitecture(comparetobasemodel) BasemodelThemodelwedescribedinMethods Use512intotalinthe2ndconvolutionlayer Use1024intotalinthe2ndconvolutionlayer Use4096intotalinthe2ndconvolutionlayer 1layerOnlykeepthelastconvolutionlayer 3layerAddanextraconvolutionlayerwith64ofsize3 sizesof2ndconvolutionlayer =[ 6 ; 9 ; 12 ; 15 ; 18 ; 21 ; 24 ; 27 ] sizesof2ndconvolutionlayer =[ 10 ; 15 ; 20 ; 25 ; 30 ; 35 ; 40 ; 45 ] sizesof2ndconvolutionlayer =[ 12 ; 18 ; 24 ; 30 ; 36 ; 42 ; 48 ; 54 ] 3-branch3branchesstructurefortranslatedreads DNA- encoding Useone-hotencodingofDNAreadsasinput Table4.1:ThenameandbriefdescriptionofvariantsofCNNmodelswecompared. Figure4.8:Themeanandstandarddeviationofaccuracyofdifferentnetworkarchi- tectures.Forconvenient,weuseddifferentnames( 3-frameencoding , 2048s , s8 ,and 2 layer )forthesamebasemodel.Weuseddifferentcolorsfordifferentgroupofcomparisons:green barsforencoding;bluebarsfornumberofpurplebarsfordifferentsizes;orangebars fordifferentconvolutionlayers. 101 FromFigure4.1,themodelwith3-frameencodingachievedmuchbetterperformancecom- paredtothe3-branchmodelandDNAencoding.Inthe3-branchmodel,theoriginalDNAse- quencesinformationislostastheordersofthethreeframesnotkeptinthemodel.Incontrast, DNAencodinghastheoriginalinformation,buttheindirectinformationofproteininDNAis noteasytoextractusefulfeaturesthatcanseparatedifferentproteinfamilies.TheDNAencoding modelalsoconvergesmuchslowerthantheothertwopeptideencodingmodelsintraining.Finally, bothDNAencodingmodeland3-branchmodelhavealargernumberofparameterscomparedto 3-frameencoding,requiringmorecomputingresources. Toinvestigatethefeaturesextractedfromthesequencesbyconvolutionallayers,weobtainthe featurematrixafterthemax-over-timepoolingforallthreemodels.Wepicksixsub-subfamilies withtopF1scoresforallthreemodelsforvisualizingthefeaturesusingt-SNE[96].Thefeatures fromdifferentproteinsub-subfamilieshavebeenmappertomanysmallclustersshownin4.9. Thedistancesbetweenclustersfromdifferentsub-subfamiliesin3-frameencodingmodelsare relativelylarger,andeachsmallclusteronlyrepresentsonesub-subfamilies.Manyclustersin DNAone-hotencodingmodelareclosetoeachother.Andclustersfromthe3-branchmodelhave overlaps.Theseplotsmayexplainwhytheperformanceofthe3-frameencodingisbetterasit extractsfeaturesthatcaneasilydistinguishdifferentproteinsub-subfamilies. 4.3.1.1.2Comparisonsofconvolutionalsetup. Comparedtotheone-layermodel,our basemodel(2layers)achievedhigheraccuracy.Theextralayerhelpstheneuralnetworktoextract morecomplexpatternssuchasinteractionsofthelowerlevelfeatures.However,the"deeper" modelwithmorelayersaremorediftooptimize.Thatisthepossiblereasonwhytheaverage accuracyof3layermodelislowerthanthebasemodel,butthehighestaccuracyachievedishigher thanourbasemodel. 102 Figure4.9:2Dt-SNEplotoffeaturesextractedfromtheconvolutionallayeroutput.(Top)3-frame encodingmodel.(Middle)DNAone-hotencodingmodel.(Bottom)3-branchmodel. 103 Wefoundthattheadditionalconvolutionalincreasedperformanceforproteindomain prediction.Theimprovementissaturatedwhenwehavemorethan2048 Increasingthesizeofcanalsohelptoimprovetheperformanceofthemodels.With thelargersize,theneuralnetworkcancapturelong-rangefeatures.Theresultsuggeststhe importanceofchoosingtherightsize,whichisnotexploredinpreviousworks[133,153]. 4.3.1.2ComparisonwithHMMERandDeepFam TheaccuracyofofGPCRcdsofDeepFrameandDeepFamwasmeasuredusing 5-foldcross-validation.Thereferencecodingsequencesofeachsub-subfamilieswereequally splitintovefoldswhilepreservingtheratioofthefamilies.ThenweusePBSIMsimulateeach foldofthecodingsequencestothesimulatedPacBioreadsthatusedtotrainDeepFrame.For DeepFam,weusedthetranslatedproteinsequencesofeachfoldofcodingsequencestoretrainit. Wethentestedallthreetranslationsusingourretrainedmodel.ForHMMER,weusedall5-fold translatedproteinsequencestoretrainthepHMMmodel.Togeneratemultipleseuqnceaignment, MAFFT[72]wasusedforthesequencesofeachsub-subfamilies.Thenweused hmmbuild in HMMERtobuildpHMMmodel.ForeachtestDNAsequences,3-frametranslationswasapplied togetthreeproteinsequences,thenwastestedwith hmmscan againstall86pHMMmodelswebuilt. InbothexperimentsofDeepFamandHMMER,ifatleastoneofthethreetranslatedreads totherightsub-subfamilies,wetreatedthiscaseasacorrectprediction.HMMERwasnotableto assignfamilylabelforsomesequences,sosuchcasesweretreatedasincorrectpredictions. Figure4.10summarizedthecomparisonofofallmethodsonthesimulated PacBioreads.Forthisdataset,ourmethodachievedconsistentperformancewhentestondataset withdifferenterrorrates.Ourmethodachievedmuchhigheraccuracycomparedtotheothertwo methodswhentheerrorratesarehigh.Thehigherrorratesheavilyimpactedtheperformanceof 104 Figure4.10:ComparisonofproteinperformanceofDeepFrame,HMMER,and DeepFamacrossdifferenterrorrates. 105 HMMERandDeepFam.ItisexpectedbecausetheHMMsearchismuchmoresensitiveto frameshiftscausedbygaps,andDeepFamisdesignedforclassifyingrelativelycompleteerror-free proteinsequences. 4.3.1.3Comparisonoftimecomplexity Wemeasuredtheexecutiontimesofthetestedmethods.Withalargeamountofdatagenerated bythethird-generationsequencingplatform,werequirenotonlyhighaccuracyofthealgorithm butalsothehighefyofthealgorithms.WerunbothDeepFrame,DeepFam,andHMMER usingIntel R Xeon R Gold6148CPUwith20coresattheHigh-PerformanceComputingCenter atMichiganStateUniversity.WealsotestedDeepFrameandDeepFamwithNVIDIA R Tesla R V100GPUwithApexaccelerationlibrary(HMMERdoesn'tsupportGPU).Foreachmethod, Wemeasureditsexecutiontimebyaveraging5independenttrialswithrandomlyselected10,000 sequences. SetupDeepFrameDeepFamHMMERHMMER max CPU1168.78s276.74s312.13s3470.04s GPU25.71s20.37sunavailableunavailable Table4.2:Theaverageelapsedtimetopredictfamiliesof10,000simulatedPacBioreadsforeach method. InCPU,HMMERwiththedefaultsetuprunsmuchfasterthanDeepFrameandDeepFam.With highsequencingerrorrates,thealignmentagainstmanycandidatesub-subfamiliescannotpassthe stageofHMMER,speedinguppHMMsearch.With --max (turnofftheexecution timeofHMMERisevenslowerthandeeplearningmethods.WithGPUacceleration,therunning timeofDeepFrameismuchsmallerthantherunningtimeofHMMERwiththedefaultsetup. 106 4.3.2Humangenomedataset ToevaluateDeepFrame'sperformanceonrealthird-generationsequencingdataset,wetestedDeep- FrameontheH.sapiens10xSequenceCoveragewithPacBiodata[116]andOxfordNanopore HumanReferenceDatasetsRel6[67].Wedescribedthedatasetweusedinthisexperiment. 4.3.2.0.1Traindataset. Forthethresholdbaseline,weusedthemodeltrainedintheprevious experiment.ToapplyingOutlierExposure,weconstructedadatasetthatmixingthe previous5-foldtraindatasetwithadatasetofoutliers.Weconstructedtheoutelierdatasetusing simulatedPacBioreads.Toclosetotherealout-of-distributionexamples,wesimulatedaPacBio humangenomedatasetusingGRCh37/hg19humanreferencegenomeasreference[27].Weonly keptthesimulatedreadsthatcannotbealignedtocdsbyBLASRintheout-of-distribution dataset. 4.3.2.0.2Thresholdcalibrationdataset. Tocalibratethethreshold,weneedadatasetthat mixingin-distributionexampleswithout-of-distributionexamples.Wesimulatedin-distribution examplesfromGPCRcdsdataset.Also,wesimulatedout-of-distributionexamplesusingATP synthasefamiliescds.WeusedATPsynthasefamiliesbecausebothGPCRandATPsynthase familiesbelongtothemembraneandcellsurfaceproteinsandpeptidesclassinSCOP [92].Intuitively,proteinfamiliescloselyrelatedmaybemorechallengingtodistinguish,andthis leadtothethresholdwecalibratedmorerobust. 4.3.2.0.3Out-of-distributiontestdataset. Weconstructedtwotestdataset:aPacBioRSII testdatasetfromPacBioSMRTSequencingforCHM1TERThumancellline;andaNanoporetest datasetfromOxfordNanoporeMinIONonCEPH1463(NA12878/GM12878,Ceph/Utahpedi- gree)humangenomereferencestandardusingR9.4chemistry.Todeterminethegroundtruthof 107 thesetestreads,wealignedallreadsagainstGPCRcdsdatasetusingBLASR[21].Weex- tractedthealignedpartofthereadsofwhichalignmentlengthlongerthan60%ofalignedcds lengthasourin-distributiontestsamples.Foreachsample,thegroundtruthisgivenbythelabel ofalignedcds.Thesereadsarein-distributionexamplesinthetestdataset.Wealsorandomly selectedreadsthatnotalignedtoanycdsasourout-of-distributionsamples.Thedatasetconsists of50%readsthatmaybecodingGPCRand50%out-of-distributionreads.Forreadlongerthan 3000bpsinthetestdataset,wecroppedreadintofragmentsthatnotlongerthan3000bpstobe consistentwithtraindataset. 4.3.2.1Out-of-distributiontestusingPacBioandNanoporereads MethodPacBioNanopore RecallPrecisionF-1scoreRecallPrecisionF-1score HMMER0.14940.95070.25830.3984 0.9825 0.5670 DeepFrame threshold 0.42350.54070.4667 0.4866 0.62340.5384 DeepFrameOE 0.44790.98370.6154 0.48360.9731 0.6458 Table4.3:Theperformanceofproteindomainpredictionwithout-of-distributionexamplesusing DeepFramethresholdbaseline,DeepFramewithOutlierExposure(OE),andHMMERonthereal PacBioandNanoporedataset. FordetecttheGPCRcodingsequencesamplefromout-of-distributionsamples,weretrain DeepFramemodelwithOutlierExposurediscussedinMethods.Weusedthesametraindataset intheprevioussimulatedPacBioreadsexperiment.WebenchmarkedtheOEmodelwithHM- MERandabaselinemodelwithonlyasoftmaxthreshold.InbothPacBioandNanoporedataset, DeepFramewithOEachievesthebestF1scorecomparedtoHMMERdefaultsetupandthreshold baseline(Table4.3).DeepFramewithOEachievedsignimprovementonrecallwhilethe precisioniscomparablewithHMMER.Ingeneral,allthreemethodshavebetterperformanceon 108 Nanoporedataset.Noticedthatinthewholepipeline,weonlyusesimulatedPacBioreadsbutno datafromNanopore.Thisresultsuggeststhatourstrategyisrobustwithdifferenttypesoflong reads. 4.3.3Visualizeandunderstandingconvolution Tobetterunderstandwhatourconvolutionalneuralnetworkslearned,wefurtherinvestigatethe featureextractedbyconvolutionalWevisualizedtheconvolutionunitsofourmodeland foundthatsomeconservedregionswerecaptured,althoughtheinputdataisverynoisy. Followingthemethodsadoptedbypreviousresearch[3,133],wevisualizetheconvolution unitsactivatedforeachfamily.Weusedthemodelthatwastrainedinpreviouslyexperiments andfeedthetestsequencesthatbelongedtothefamilytoourmodel.Thenwecollectallthe sequencefragmentsthatactivatetheconvolutionunits.Weextractedtheresultsfrommostacti- vatedconvolutionunitsandusedWeblogotogeneratethelogosfromsub-sequencesoftheresults. SincewetranslatedtheoriginalinputDNAsequencesinto3-framesproteinsequences,soforeach convolutionalunits,wehave3logosassociatedwith3framesrespectively. Fromthelogos,wecanfoundthattheconvolutionaldidlearnsomeconservedregions ofproteinfamilies.However,thelogoisverynoisycomparedtotheresultsreportedinprevious studies.Whentheinputhasinsertionsanddeletions,awell-conservedregionmayappearinall threeframesduetoframeshifts,sotheofallthreechannelssometimesmayneedtoextract similarpatterns.Thatiswhysomeofthe3-framelogosshowsimilarpatternsacrossdifferent frames. Figure4.17presentedtheactivationvaluedistributionofconvolutionalforthesixsub- subfamilieswediscussedbefore.Itclearlyshowsthatdifferentproteinfamilieswillactivatedif- ferentconvolutional 109 Figure4.11:Mostactivatedconvolutional(index:1622)forLatrophilin. Figure4.12:Mostactivatedconvolutional(index:1877)forCannabinoid. 4.4Discussion Inthiswork,weadoptedthethresholdonsoftmaxandOutlierExposuretodetectrelevanttarget readsfromotherrandomreads.Previousresearch[3,65,91]usuallysolvethechallengebyusing abinaryframework.However,foragivensequence,weneedtorunthebinary modelforeachpossibletarget.Instead,ourmodelscalesbetter 110 Figure4.13:Mostactivatedconvolutional(index:1689)forPlatelet. Figure4.14:Mostactivatedconvolutional(index:1382)forProstacyclin. withalargenumberofcandidatetargets. Weaddressedthechallengeofpredictionofproteindomainswithasupervisedlearningmethod, whichneedsavastamountoflabeleddata.However,itischallengingtoobtainalargenumberof 111 Figure4.15:Mostactivatedconvolutional(index:1033)forCholecystokinin. Figure4.16:Mostactivatedconvolutional(index:1776)forEMR1. labeledreadsfromthird-generationsequencing.Inthiswork,weshowthatdeeplearningmethod achievesacceptableresults,evenwhentrainingonsimulateddataandtestingonrealdata.Webe- 112 Figure4.17:Thedistributionsofsumofacivationvaluesofconvolutionalunits.Xaxisisthe indexoftheconvolutioninDeepFram.Yaxisisthesumofactivationvaluesofthegiven convolutional 113 lievetheperformanceofourmethodcanbefurtherimprovedifmorelabeledrealthird-generation readsavailable. 4.5Conclusion Inthiswork,wedevelopedaproteindomainpredictiontoolforthird-generationsequencingdata. Byincorporating3-frameencodingandmultiple-layerconvolutionalourCNNmodelcan directlypredicttheproteinfamiliesoflongerroneousread.Ourexperimentalresultshaveshown thatforthedatasetfromthird-generationsequencingtechnologies,ourprogramcandetectrele- vantproteindomainfromotherrandomDNAreads.Beingabletopredictproteindomainfrom asinglereadfromthird-generationsequencingdirectlyisessentialforapplicationsliketran- scriptomicandmetagenomicsequencing.DeepFrameprovidesacomplementarytooltocur- rentthird-generationsequenceanalysispipelines,whichusuallyrequirehighcoverageforer- rorcorrection.Thesourcecodeandthemodelstrainedandareavailableat https://github.com/strideradu/DeepFrame. 114 Chapter5 Conclusionsandfutureworks Inthisdissertation,Iintroducedthedevelopmentofthird-generationsequencingtechnologies andthepossibleapplicationofPacBioandNanoporereads.Asmoreandmorethird-generation sequencingdatasetaccumulating,especiallymetagenomicdataandtranscriptomicdata,theanal- ysisoflongnoisyreadsbecameachallengingproblem.First,traditionalsequenceanalysistools cannothandlehighinsertionanddeletionrates,leadingtotheshortannotationthatcannotbe distinguishedfromrandomsequences.Second,applicationslikemetagenomicsequencingmay containmanycloseproteinsequencesthatarenoteasytodistinguish.Third,althoughsomealgo- rithmshavebeendevelopedforPacBioorNanoporereads,thereisstillagreatpotentialtoimprove theperformanceandefyofthesequenceanalysisalgorithmforthird-generationreads.To addressthesechallenges,Iproposedasetofalgorithmsinthisdissertation. InChapter2,IintroducedGroupK,whichwasdesignedforthehighsensitivityoverlapdetec- tionoflongnoisyreads.GroupKincorporatedthegroupedhitscriteriawhichhasbeensuccessfully appliedtoremotehomologydetectionandachievesbettersensitivitywhencomparedwithother overlapdetectiontoolsforthird-generationsequencingreads. InChapter3,IintroducedFrame-Pro,whichwasdesignedforthehomologysearchofPacBio reads.Itcorrectssequencingerrorsandalsooutputsthealignmentsofthecorrectedse- quencesagainstcharacterizedproteinfamiliessimultaneously.Comparedwithhomologysearch onerrorcorrectionreads,Frame-Proenablesmoresensitivehomologysearchandcorrectsmore 115 sequencingerrors. InChapter4,IintroduceDeepFrame,whichwasdesignedforproteindomainpredictionand detectiononasinglelongnoisyread.DeepFrameisbasedonthedeepconvolutionalneuralnet- workswithsoftmaxthresholdandOutlierExposure.ItoutperformsHMMERonboth accuracyanddetectionF1scorewithfasterexecutiontimeusingGPU. Thereareseveraldirectionsthatthepreviousstudiescanbeimproved: InGroupK,thegroupseedmatchingstepcurrentlyisbasedonhashtableimplementation fromYASS[108].ThisstepisthebottleneckoftheGroupKoverlapdetectionpipeline.In thefuture,wecanimplementthegroupmatchingstepusingnew k -mermethodto improvetherunningtimeefytobecomparablewithotherfasteroverlapdetection tools.Wecanalsoinvestinthestepstoallowlessfalse-positiveoverlappairstoenter thedownstreamsteps. Frame-ProiscomputationallymoreexpensivethanHMMERaswedonotadoptanyaccel- erationapproach.WecanreducetherunningtimeofFrame-Probypruningthedynamicpro- grammingmatrix.WecanalsoincorporatethefastViterbialgorithmasastageto outimpossibleproteindomaincandidates.Also,Frame-Proiscurrentlyimplemented withPython.TheexecutiontimeofFrame-Procanbegreatlyreducedifweimplement itusingC++,orotheracceleratedmachinelearninglibrarieslikeTensorFlow[1],orPy- Torch[119]. InDeepFrame,weimplementedtwoconvolutionallayerneuralnetworks.Inthefuture,we maytestamorecomplexmodelwithfideeperflconvolutionalneuralnetworks.However, usuallyfideeperflneuralnetworksarenoteasytooptimize,sowemayneedusetechniques likeresidueblocks,asusedinResNet[62].Anothercandidateistransformerstructure[146], 116 whichalreadyproveditssuccessonnaturallanguageprocessing[31].Wemayalsoadda steptoimprovetheefyofouralgorithmfurther,especiallywhenonlyCPU available.Besides,currentlyweusemulti-classationframeworkthatdirectlyclassi- onthesub-subfamilieslevel.Wemayimplementahierarchicalframework thatclassifyondifferentproteinlevelslikefamilies,sub-families,andsub-subfamiliessi- multaneously. 117 BIBLIOGRAPHY 118 BIBLIOGRAPHY[1]M.Abadi,A.Agarwal,P.Barham,E.Brevdo,Z.Chen,C.Citro,G.S.Corrado,A.Davis, J.Dean,M.Devin,S.Ghemawat,I.Goodfellow,A.Harp,G.Irving,M.Isard,Y.Jia, R.Jozefowicz,L.Kaiser,M.Kudlur,J.Levenberg,D.Mané,R.Monga,S.Moore,D.Mur- ray,C.Olah,M.Schuster,J.Shlens,B.Steiner,I.Sutskever,K.Talwar,P.Tucker,V.Van- houcke,V.Vasudevan,F.Viégas,O.Vinyals,P.Warden,M.Wattenberg,M.Wicke,Y.Yu, andX.Zheng.TensorFlow:Large-scalemachinelearningonheterogeneoussystems,2015. Softwareavailablefromw.org. S.Aki,H.Kuboki,andK.Hirano.Ondiscretedistributionsoforderk.AnnalsoftheInstituteofStatisticalMathematics,36(1):431Œ440,1984.B.Alipanahi,A.Delong,M.T.Weirauch,andB.J.Frey. Predictingthesequencespeci-ofdna-andrna-bindingproteinsbydeeplearning.Naturebiotechnology,33(8):831,2015.S.F.Altschul,W.Gish,W.Miller,E.W.Myers,andD.J.Lipman.Basiclocalalignmentsearchtool.Journalofmolecularbiology,215(3):403Œ410,1990.I.AntonovandM.Borodovsky. Genetack:frameshiftinprotein-codingse-quencesbytheviterbialgorithm.Journalofbioinformaticsandcomputationalbiology,8(03):535Œ551,2010.AppleInc.Performingconvolutionoperations.,2018.Sep,2016.S.Ardui,A.Ameur,J.R.Vermeesch,andM.S.Hestand.Singlemoleculereal-time(smrt)sequencingcomesofage:applicationsandutilitiesformedicaldiagnostics.Nucleicacidsresearch,46(5):2159Œ2168,2018.K.F.Au,J.G.Underwood,L.Lee,andW.H.Wong.Improvingpacbiolongreadaccuracybyshortreadalignment.PLoSOne,7(10):e46679,2012.L.Baum.Aninequalityandassociatedmaximizationtechniqueinstatisticalestimationofprobabilisticfunctionsofamarkovprocess.Inequalities,3:1Œ8,1972.A.BendaleandT.E.Boult.Towardsopensetdeepnetworks.InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition,pages1563Œ1572,2016.G.Bensonetal.Tandemrepeatsaprogramtoanalyzednasequences.Nucleicacidsresearch,27(2):573Œ580,1999.119[12]K.Berlin,S.Koren,C.-S.Chin,J.P.Drake,J.M.Landolin,andA.M.Phillippy.Assem- blinglargegenomeswithsingle-moleculesequencingandlocality-sensitivehashing. Nature biotechnology ,33(6):623Œ630,2015. [13]E.Birney,M.Clamp,andR.Durbin.Genewiseandgenomewise. Genomeresearch , 14(5):988Œ995,2004. [14]M.BorodovskyandJ.McIninch.Genmark:parallelgenerecognitionforbothdnastrands. Computers&chemistry ,17(2):123Œ133,1993. [15]D.Branton,D.W.Deamer,A.Marziali,H.Bayley,S.A.Benner,T.Butler,M.DiVentra, S.Garaj,A.Hibbs,X.Huang,etal.Thepotentialandchallengesofnanoporesequencing. Naturebiotechnology ,26(10):1146Œ1153,2008. [16]A.Z.Broder.Identifyingandnear-duplicatedocuments.In AnnualSymposiumon CombinatorialPatternMatching ,pages1Œ10.Springer,2000. [17]N.P.Brown,C.Sander,andP.Bork.Frame:detectionofgenomicsequencingerrors. Bioinformatics(Oxford,England) ,14(4):367Œ371,1998. [18]J.Buhler,U.Keich,andY.Sun.DesigningseedsforsimilaritysearchingenomicDNA. JournalofComputerandSystemSciences ,70(3):342Œ363,2005. [19]A.Busia,G.E.Dahl,C.Fannjiang,D.H.Alexander,E.Dorfman,R.Poplin,C.Y.McLean, P.-C.Chang,andM.DePristo.Adeeplearningapproachtopatternrecognitionforshort dnasequences. bioRxiv ,page353474,2019. [20]M.J.Chaisson,J.Huddleston,M.Y.Dennis,P.H.Sudmant,M.Malig,F.Hormozdiari, F.Antonacci,U.Surti,R.Sandstrom,M.Boitano,etal.Resolvingthecomplexityofthe humangenomeusingsingle-moleculesequencing. Nature ,517(7536):608Œ611,2015. [21]M.J.ChaissonandG.Tesler.Mappingsinglemoleculesequencingreadsusingbasiclocal alignmentwithsuccessive(blasr):applicationandtheory. BMCbioinformatics , 13(1):238,2012. [22]W.I.ChangandE.Lawler.Sublinearexpectedtimeapproximatestringmatchingandbio- logicalapplications. Algorithmica ,12:327Œ44,1994. [23]T.Charalampous,G.L.Kay,H.Richardson,A.Aydin,R.Baldan,C.Jeanes,D.Rae, S.Grundy,D.J.Turner,J.Wain,etal.Nanoporemetagenomicsenablesrapidclinical diagnosisofbacteriallowerrespiratoryinfection. NatureBiotechnology ,page1,2019. [24]C.-S.Chin,D.H.Alexander,P.Marks,A.A.Klammer,J.Drake,C.Heiner,A.Clum, A.Copeland,J.Huddleston,E.E.Eichler,etal.Nonhybrid,microbialgenome assembliesfromlong-readsmrtsequencingdata. Naturemethods ,10(6):563Œ569,2013. 120 K.Cho,B.VanMerriënboer,C.Gulcehre,D.Bahdanau,F.Bougares,H.Schwenk,andBengio.Learningphraserepresentationsusingrnnencoder-decoderforstatisticalma-chinetranslation.arXivpreprintarXiv:1406.1078,2014.J.Chu,H.Mohamadi,R.L.Warren,C.Yang,andI.Birol.Innovationsandchallengesinde-tectinglongreadoverlaps:anevaluationofthestate-of-the-art.Bioinformatics,33(8):1261Œ1270,2016.D.M.Church,V.A.Schneider,T.Graves,K.Auger,F.Cunningham,N.Bouk,H.-C.Chen,Agarwala,W.M.McLaren,G.R.Ritchie,etal.Modernizingreferencegenomeassem-blies.PLoSbiology,9(7):e1001091,2011.J.Clarke,H.-C.Wu,L.Jayasinghe,A.Patel,S.Reid,andH.Bayley. Continuousbaseiden-forsingle-moleculenanoporednasequencing.Naturenanotechnology,4(4):265,2009.P.CompeauandP.Pevzner.Bioinformaticsalgorithms:anactivelearningapproach,vol-ume1.ActiveLearningPublishers,LaJolla,California,USA,2015.M.N.Davies,A.Secker,A.A.Freitas,M.Mendao,J.Timmis,andD.R.Flower.Onthehierarchicalofgprotein-coupledreceptors.Bioinformatics,23(23):3113Œ3118,2007.J.Devlin,M.-W. Chang,K.Lee,andK.Toutanova.Bert:Pre-trainingofdeepbidirectionaltransformersforlanguageunderstanding.arXivpreprintarXiv:1810.04805,2018.A.Döring,D.Weese,T.Rausch,andK.Reinert.SeqAnangenericc++libraryforsequenceanalysis.BMCbioinformatics,9(1):11,2008.N.Du,J.Chen,andY.Sun.Improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortk-mermatches.BMCgenomics,20(2):190,2019.N.DuandY.Sun.ImprovehomologysearchsensitivityofPacBiodatabycorrectingframeshifts.Bioinformatics,32(17):i529Œi537,2016.R.Durbin,S.R.Eddy, A.Krogh,andG.Mitchison.Biologicalsequenceanalysis:prob-abilisticmodelsofproteinsandnucleicacids.CambridgeUniversityPress,Cambridge,UnitedKingdom,1998.S.R.Eddy. hiddenmarkovmodels.Bioinformatics(Oxford,England),14(9):755Œ763,1998.S.R.Eddy, T.J.Wheeler,andtheHMMERdevelopmentteam.Hmmer3.1b2,2015.S.R.Eddy, T.J.Wheeler,andtheHMMERdevelopmentteam.Hmmer3.2.1.121,2018.June,2018.[39]J.Eid,A.Fehr,J.Gray,K.Luong,J.Lyle,G.Otto,P.Peluso,D.Rank,P.Baybayan, B.Bettman,etal.Real-timednasequencingfromsinglepolymerasemolecules. Science,323(5910):133Œ138,2009. [40]S.El-Gebali,J.Mistry,A.Bateman,S.R.Eddy,A.Luciani,S.C.Potter,M.Qureshi,L.J. Richardson,G.A.Salazar,A.Smart,etal.Thepfamproteinfamiliesdatabasein2019. Nucleicacidsresearch ,47(D1):D427ŒD432,2018. [41]W.Feller. Anintroductiontoprobabilitytheoryanditsapplications ,volume2.JohnWiley &Sons,2008. [42]P.FerraginaandG.Manzini.Opportunisticdatastructureswithapplications.In Foundations ofComputerScience,2000.Proceedings.41stAnnualSymposiumon ,pages390Œ398.IEEE, 2000.[43]A.Fiannaca,L.LaPaglia,M.LaRosa,G.Renda,R.Rizzo,S.Gaglio,A.Urso,etal. Deeplearningmodelsforbacteriataxonomicofmetagenomicdata. BMCbioinformatics,19(7):198,2018. [44]R.D.Finn,A.Bateman,J.Clements,P.Coggill,R.Y.Eberhardt,S.R.Eddy,A.Heger, K.Hetherington,L.Holm,J.Mistry,etal.Pfam:theproteinfamiliesdatabase. Nucleicacidsresearch ,42(D1):D222ŒD230,2013. [45]R.D.Finn,P.Coggill,R.Y.Eberhardt,S.R.Eddy,J.Mistry,A.L.Mitchell,S.C.Potter, M.Punta,M.Qureshi,A.Sangrador-Vegas,etal.Thepfamproteinfamiliesdatabase: towardsamoresustainablefuture. Nucleicacidsresearch ,44(D1):D279ŒD285,2015. [46]K.Fukushima.Neocognitron:Aself-organizingneuralnetworkmodelforamechanismof patternrecognitionunaffectedbyshiftinposition. Biologicalcybernetics ,36(4):193Œ202, 1980.[47]Z.Gao,C.-h.Tseng,Z.Pei,andM.J.Blaser.Molecularanalysisofhumanforearmsuper- skinbacterialbiota. ProceedingsoftheNationalAcademyofSciences ,104(8):2927Œ 2932,2007. [48]F.Giordano,L.Aigrain,M.A.Quail,P.Coupland,J.K.R.M.Davies,G.Tischler, D.K.Jackson,T.M.Keane,J.Li,etal.Denovoyeastgenomeassembliesfromminion, pacbioandmiseqplatforms. reports ,7(1):3935,2017. [49]M.Gîrdea,L.Noé,andG.Kucherov.Back-translationfordiscoveringdistantproteinho- mologies.In InternationalWorkshoponAlgorithmsinBioinformatics ,pages108Œ120. Springer,2009. 122[50]M.Gîrdea,L.Noé,andG.Kucherov.Back-translationfordiscoveringdistantproteinho- mologiesinthepresenceofframeshiftmutations. AlgorithmsforMolecularBiology ,5(1):6, 2010. [51]G.GonnellaandS.Kurtz.Readjoiner:afastandmemoryefstringgraph-based sequenceassembler. BMCbioinformatics ,13(1):82,2012. [52]I.Goodfellow,Y.Bengio,andA.Courville. Deeplearning .MITpress,Cambridge,Mas- sachusetts,UnitedStates,2016. [53]X.GuanandE.C.Uberbacher.Alignmentsofdnaandproteinsequencescontaining frameshifterrors. Bioinformatics ,12(1):31Œ40,1996. [54]D.Gu Algorithmsonstrings,treesandsequences:computerscienceandcomputa- tionalbiology .Cambridgeuniversitypress,1997. [55]T.Hackl,R.Hedrich,J.Schultz,andF.Förster.proovread:large-scalehigh-accuracypacbio correctionthroughiterativeshortreadconsensus. Bioinformatics ,30(21):3004Œ3011,2014. [56]D.H.Haft,B.J.Loftus,D.L.Richardson,F.Yang,J.A.Eisen,I.T.Paulsen,andO.White. Tigrfams:aproteinfamilyresourceforthefunctionalofproteins. Nucleic acidsresearch ,29(1):41Œ43,2001. [57]D.H.Haft,J.D.Selengut,andO.White.Thetigrfamsdatabaseofproteinfamilies. Nucleic acidsresearch ,31(1):371Œ373,2003. [58]E.Haghshenas,F.Hach,S.C.Sahinalp,andC.Chauve.Colormap:Correctinglongreads bymappingshortreads. Bioinformatics ,32(17):i545Œi551,2016. [59]E.Halperin,S.Faigler,andR.Gill-More.Frameplus:aligningdnatoproteinsequences. Bioinformatics ,15(11):867Œ873,1999. [60]K.Hayashi,N.Morooka,Y.Yamamoto,K.Fujita,K.Isono,S.Choi,E.Ohtsubo,T.Baba, B.L.Wanner,H.Mori,etal.Highlyaccurategenomesequencesofescherichiacolik-12 strainsmg1655andw3110. Molecularsystemsbiology ,2(1),2006. [61]K.He,X.Zhang,S.Ren,andJ.Sun.DelvingdeepintoSurpassinghuman- levelperformanceonimagenetIn ProceedingsoftheIEEEinternational conferenceoncomputervision ,pages1026Œ1034,2015. [62]K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.In ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition ,pages 770Œ778,2016. [63]D.HendrycksandK.Gimpel.Abaselinefordetectingandout-of-distribution 123 examplesinneuralnetworks. arXivpreprintarXiv:1610.02136 ,2016. [64]D.Hendrycks,M.Mazeika,andT.G.Dietterich.Deepanomalydetectionwithoutlier exposure. arXivpreprintarXiv:1812.04606 ,2018. [65]S.Hochreiter,M.Heusel,andK.Obermayer.Fastmodel-basedproteinhomologydetection withoutalignment. Bioinformatics ,23(14):1728Œ1736,2007. [66]S.HochreiterandJ.Schmidhuber.Longshort-termmemory. Neuralcomputation , 9(8):1735Œ1780,1997. [67]M.Jain,S.Koren,K.H.Miga,J.Quick,A.C.Rand,T.A.Sasani,J.R.Tyson,A.D.Beggs, A.T.Dilthey,I.T.Fiddes,etal.Nanoporesequencingandassemblyofahumangenome withultra-longreads. Naturebiotechnology ,36(4):338,2018. [68]D.Joseph,J.Meidanis,andP.Tiwari.Determiningdnasequencesimilarityusingmaximum independentsetalgorithmsforintervalgraphs.In ScandinavianWorkshoponAlgorithm Theory ,pages326Œ337.Springer,1992. [69]D.JurafskyandJ.H.Martin. SpeechandLanguageProcessing(2ndEdition) .Prentice-Hall, Inc.,UpperSaddleRiver,NJ,USA,2009. [70]J.KärkkäinenandP.Sanders.Simplelinearworksufarrayconstruction.In International ColloquiumonAutomata,Languages,andProgramming ,pages943Œ955.Springer,2003. [71]T.Kasai,G.Lee,H.Arimura,S.Arikawa,andK.Park.Linear-time computationinsufarraysanditsapplications.In AnnualSymposiumonCombinatorial PatternMatching ,pages181Œ192.Springer,2001. [72]K.KatohandD.M.Standley.Mafftmultiplesequencealignmentsoftwareversion7:im- provementsinperformanceandusability. Molecularbiologyandevolution ,30(4):772Œ780, 2013. [73]W.J.Kent.BlatŠtheblast-likealignmenttool. Genomeresearch ,12(4):656Œ664,2002. [74]Y.Kim.Convolutionalneuralnetworksforsentence arXivpreprint arXiv:1408.5882 ,2014. [75]D.P.KingmaandJ.Ba.Adam:Amethodforstochasticoptimization. arXivpreprint arXiv:1412.6980 ,2014. [76]A.Kislyuk,A.Lomsadze,A.L.Lapidus,andM.Borodovsky.Frameshiftdetectionin prokaryoticgenomicsequences. Internationaljournalofbioinformaticsresearchandap- plications ,5(4):458Œ477,2009. 124 [77]M.Kokot,M.andS.Deorowicz.Kmc3:countingandmanipulatingk-merstatis- tics. Bioinformatics ,33(17):2759Œ2761,2017. [78]S.Koren,M.C.Schatz,B.P.Walenz,J.Martin,J.T.Howard,G.Ganapathy,Z.Wang,D.A. Rasko,W.R.McCombie,E.D.Jarvis,etal.Hybriderrorcorrectionanddenovoassembly ofsingle-moleculesequencingreads. Naturebiotechnology ,30(7):693Œ700,2012. [79]S.Koren,B.P.Walenz,K.Berlin,J.R.Miller,N.H.Bergman,andA.M.Phillippy.Canu: scalableandaccuratelong-readassemblyviaadaptivek-merweightingandrepeatsepara- tion. Genomeresearch ,27(5):722Œ736,2017. [80]J.Korlach.Understandingaccuracyinsmrt R sequencing,2013. [81]J.KorlachandS.W.Turner.Zero-modewaveguides.In EncyclopediaofBiophysics ,pages 2793Œ2795.Springer,2013. [82]A.Krogh,M.Brown,I.S.Mian,K.Sjölander,andD.Haussler.Hiddenmarkovmodels incomputationalbiology:Applicationstoproteinmodeling. Journalofmolecularbiology , 235(5):1501Œ1531,1994. [83]B.LangmeadandS.L.Salzberg.Fastgapped-readalignmentwithbowtie2. Naturemeth- ods ,9(4):357Œ359,2012. [84]B.Langmead,C.Trapnell,M.Pop,andS.L.Salzberg.Ultrafastandmemory-ef alignmentofshortdnasequencestothehumangenome. Genomebiology ,10(3):1,2009. [85]Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning. nature ,521(7553):436,2015. [86]Y.LeCun,B.E.Boser,J.S.Denker,D.Henderson,R.E.Howard,W.E.Hubbard,and L.D.Jackel.Handwrittendigitrecognitionwithaback-propagationnetwork.In Advances inneuralinformationprocessingsystems ,pages396Œ404,1990. [87]H.Li.Minimapandminiasm:fastmappinganddenovoassemblyfornoisylongsequences. Bioinformatics ,32(14):2103Œ2110,2016. [88]H.Li.Minimap2:pairwisealignmentfornucleotidesequences. Bioinformatics ,1:7,2018. [89]H.LiandR.Durbin.Fastandaccuratelong-readalignmentwithburrowsŒwheelertrans- form. Bioinformatics ,26(5):589Œ595,2010. [90]H.LiandN.Homer.Asurveyofsequencealignmentalgorithmsfornext-generationse- quencing. inbioinformatics ,11(5):473Œ483,2010. [91]S.Li,J.Chen,andB.Liu.Proteinremotehomologydetectionbasedonbidirectionallong short-termmemory. BMCbioinformatics ,18(1):443,2017. 125 [92]L.LoConte,B.Ailey,T.J.Hubbard,S.E.Brenner,A.G.Murzin,andC.Chothia.Scop:a structuralofproteinsdatabase. Nucleicacidsresearch ,28(1):257Œ259,2000. [93]N.J.Loman,J.Quick,andJ.T.Simpson.Acompletebacterialgenomeassembleddenovo usingonlynanoporesequencingdata. Naturemethods ,2015. [94]B.MaandM.Li.Onthecomplexityofthespacedseeds. JournalofComputerandSystem Sciences ,73(7):1024Œ1034,2007. [95]B.Ma,J.Tromp,andM.Li.PatternHunter:fasterandmoresensitivehomologysearch. Bioinformatics ,18(3):440Œ445,2002. [96]L.v.d.MaatenandG.Hinton.Visualizingdatausingt-sne. Journalofmachinelearning research ,9(Nov):2579Œ2605,2008. [97]U.ManberandG.Myers.Sufarrays:anewmethodforon-linestringsearches. siam JournalonComputing ,22(5):935Œ948,1993. [98]F.Meyer,R.Overbeek,andA.Rodriguez.Figfams:yetanothersetofproteinfamilies. Nucleicacidsresearch ,37(20):6643Œ6654,2009. [99]G.Miclotte,M.Heydari,P.Demeester,S.Rombauts,Y.VandePeer,P.Audenaert,and J.Fostier.Jabba:hybriderrorcorrectionforlongsequencingreads. AlgorithmsforMolec- ularBiology ,11(1):1,2016. [100]E.W.Myers.Thefragmentassemblystringgraph. Bioinformatics ,21(suppl_2):ii79Œii85, 2005. [101]E.W.Myers,G.G.Sutton,A.L.Delcher,I.M.Dew,D.P.Fasulo,M.J.Flanigan,S.A. Kravitz,C.M.Mobarry,K.H.Reinert,K.A.Remington,etal.Awhole-genomeassembly ofdrosophila. Science ,287(5461):2196Œ2204,2000. [102]G.Myers.Eflocalalignmentdiscoveryamongstnoisylongreads.In International WorkshoponAlgorithmsinBioinformatics ,pages52Œ67.Springer,2014. [103]N.NagarajanandM.Pop.Sequenceassembly NatureReviewsGenetics , 14(3):157Œ167,2013. [104]V.NairandG.E.Hinton.linearunitsimproverestrictedboltzmannmachines. In Proceedingsofthe27thinternationalconferenceonmachinelearning(ICML-10) ,pages 807Œ814,2010. [105]S.B.NeedlemanandC.D.Wunsch.Ageneralmethodapplicabletothesearchforsimilar- itiesintheaminoacidsequenceoftwoproteins. Journalofmolecularbiology ,48(3):443Œ 453,1970. 126 F.NicolasandE.Rivals.Hardnessofoptimalspacedseeddesign.JournalofComputerandSystemSciences,74(5):831Œ849,2008.S.K.Nielsen,T.L.Koch,F.Hauser,A.Garm,andC.J.Grimmelikhuijzen.Denovotranscriptomeassemblyofthecubomedusatripedaliacystophora,includingtheanalysisofasetofgenesinvolvedinpeptidergicneurotransmission.BMCgenomics,20(1):175,2019.L.NoéandG.Kucherov. YASS:SimilaritysearchinDNAsequences.PhDthesis,INRIA,2003.L.NoéandG.Kucherov. ImprovedhitcriteriaforDNAlocalalignment.BMCbioinfor-matics,5(1):149,2004.L.NoéandG.Kucherov. YASS:enhancingthesensitivityofDNAsimilaritysearch.Nucleicacidsresearch,33(suppl_2):W540ŒW543,2005.NVIDIACorporation.Apex,2019.J.Oh,A.L.Byrd,C.Deming,S.Conlan,B.Barnabas,R.Blakesley, G.Bouffard,S.Brooks,Coleman,M.Dekhtyar,etal.Biogeographyandindividualityshapefunctioninthehumanskinmetagenome.Nature,514(7520):59,2014.C.Olah.Understandinglstmnetworks.,2018.Aug,2015.Y.Ono,K.Asai,andM.Hamada.Pbsim:PacbioreadssimulatorŠtowardaccurategenomeassembly. Bioinformatics,29(1):119Œ121,2013.[115]PBiosciences.Ecolik12mg1655resequencing. Jul22,2013.[116]PBiosciences.H.sapiens10xsequencecoveragewithpacbiodata,2014. [117]PBiosciences.Smrtanalysis. 2014.Oct15,2014.[118]Piences.E.colibacterialassemblyprimaryanalysis(instrumentoutput)data, 2016.AccessedMay13,2016. [119]A.Paszke,S.Gross,S.Chintala,G.Chanan,E.Yang,Z.DeVito,Z.Lin,A.Desmaison, L.Antiga,andA.Lerer.AutomaticdifferentiationinPyTorch.In NIPSAutodiffWorkshop ,2017.127[120]M.PellegriniandT.O.Yeates.Searchingforframeshiftevolutionaryrelationshipsbetween proteinsequencefamilies. Proteins:Structure,Function,andBioinformatics ,37(2):278Œ 283,1999. [121]H.Peltola,H.Söderlund,andE.Ukkonen.Algorithmsforthesearchofaminoacidpatterns innucleicacidsequences. Nucleicacidsresearch ,14(1):99Œ107,1986. [122]R.Poplin,P.-C.Chang,D.Alexander,S.Schwartz,T.Colthurst,A.Ku,D.Newburger, J.Dijamco,N.Nguyen,P.T.Afshar,etal.Auniversalsnpandsmall-indelvariantcaller usingdeepneuralnetworks. Naturebiotechnology ,36(10):983,2018. [123]E.Prestat,M.M.David,J.Hultman,N.Ta¸s,R.Lamendella,J.Dvornik,R.Mackelprang, D.D.Myrold,A.Jumpponen,S.G.Tringe,etal.Foam(functionalontologyassignmentsfor metagenomes):ahiddenmarkovmodel(hmm)databasewithenvironmentalfocus. Nucleic acidsresearch ,42(19):e145Œe145,2014. [124]M.A.Quail,M.Smith,P.Coupland,T.D.Otto,S.R.Harris,T.R.Connor,A.Bertoni,H.P. Swerdlow,andY.Gu.Ataleofthreenextgenerationsequencingplatforms:comparisonof iontorrent,biosciencesandilluminamiseqsequencers. BMCgenomics ,13(1):341, 2012. [125]S.RajasekaranandM.Nicolae.Anelegantalgorithmfortheconstructionofsufarrays. JournalofDiscreteAlgorithms ,27:21Œ28,2014. [126]D.A.Rasko,D.R.Webster,J.W.Sahl,A.Bashir,N.Boisen,F.Scheutz,E.E.Paxinos, R.Sebra,C.-S.Chin,D.Iliopoulos,etal.Originsofthee.colistraincausinganoutbreakof hemolyticŒuremicsyndromeingermany. NewEnglandJournalofMedicine ,365(8):709Œ 717,2011. [127]J.Ren,K.Song,C.Deng,N.A.Ahlgren,J.A.Fuhrman,Y.Li,X.Xie,andF.Sun.Iden- tifyingvirusesfrommetagenomicdatabydeeplearning. arXivpreprintarXiv:1806.07810 , 2018. [128]A.RhoadsandK.F.Au.Pacbiosequencinganditsapplications. Genomics,proteomics& bioinformatics ,13(5):278Œ289,2015. [129]D.E.Rumelhart,G.E.Hinton,andR.J.Williams.Learningrepresentationsbyback- propagatingerrors. nature ,323(6088):533,1986. [130]L.SalmelaandE.Rivals.Lordec:accurateandeflongreaderrorcorrection. Bioin- formatics ,pagebtu538,2014. [131]T.Schiex,J.Gouzy,A.Moisan,andY.deOliveira.Framed:axibleprogramforquality checkandgenepredictioninprokaryoticgenomesandnoisymaturedeukaryoticsequences. Nucleicacidsresearch ,31(13):3738Œ3741,2003. 128 [132]S.Schwartz,W.J.Kent,A.Smit,Z.Zhang,R.Baertsch,R.C.Hardison,D.Haussler,and W.Miller.HumanŒmousealignmentswithBLASTZ. Genomeresearch ,13(1):103Œ107, 2003. [133]S.Seo,M.Oh,Y.Park,andS.Kim.Deepfam:deeplearningbasedalignment-freemethod forproteinfamilymodelingandprediction. Bioinformatics ,34(13):i254Œi262,2018. [134]D.Sharon,H.Tilgner,F.Grubert,andM.Snyder.Asingle-moleculelong-readsurveyof thehumantranscriptome. Naturebiotechnology ,31(11):1009,2013. [135]J.T.SimpsonandR.Durbin.Efconstructionofanassemblystringgraphusingthe FM-index. Bioinformatics ,26(12):i367Œi373,2010. [136]I.Sovi ´ c,K.vi ´ c,K.Skala,andM.−iki ´ c.Evaluationofhybridandnon-hybridmeth- odsfordenovoassemblyofnanoporereads. Bioinformatics ,32(17):2582Œ2589,2016. [137]I.Sovi ´ c,M.−iki ´ c,A.Wilm,S.N.Fenlon,S.Chen,andN.Nagarajan.Fastandsensitive mappingofnanoporesequencingreadswithgraphmap. Naturecommunications ,7,2016. [138]N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:a simplewaytopreventneuralnetworksfromov Thejournalofmachinelearning research ,15(1):1929Œ1958,2014. [139]Y.SunandJ.Buhler.DesigningmultiplesimultaneousseedsforDNAsimilaritysearch. JournalofComputationalBiology ,12(6):847Œ861,2005. [140]H.Tilgner,F.Grubert,D.Sharon,andM.P.Snyder.apersonal,and single-moleculelong-readtranscriptome. ProceedingsoftheNationalAcademyofSciences , 111(27):9869Œ9874,2014. [141]B.J.Tindall,J.Sikorski,S.Lucas,E.Goltsman,A.Copeland,T.GlavinaDelRio,M.Nolan, H.Tice,J.F.Cheng,C.Han,S.Pitluck,K.Liolios,N.Ivanova,K.Mavromatis,G.Ovchin- nikova,A.Pati,R.Faehnrich,L.Goodwin,A.Chen,K.Palaniappan,M.Land,L.Hauser, Y.J.Chang,C.D.Jeffries,M.Rohde,M.Goeker,T.Woyke,J.Bristow,J.A.Eisen, V.Markowitz,P.Hugenholtz,N.C.Kyrpides,H.P.Klenk,andA.Lapidus.Complete genomesequenceofmeiothermusrubertypestrain(21). Standardsingenomicsciences , 3(1):26Œ36,2010. [142]E.Trost,S.Götker,J.Schneider,S.Schneiker-Bekel,R.Szczepanowski,A.Tilker,P.Vieho- ever,W.Arnold,T.Bekel,J.Blom,etal.Completegenomesequenceandlifestyleofblack- pigmentedcorynebacteriumaurimucosumatcc700975(formerlyc.nigricanscn-1)isolated fromavaginalswabofawomanwithspontaneousabortion. BMCgenomics ,11(1):91, 2010. [143]B.Trzaskowski,D.Latek,S.Yuan,U.Ghoshdastider,A.Debinski,andS.Filipek.Action 129 ofmolecularswitchesingpcrs-theoreticalandexperimentalstudies. Currentmedicinal chemistry ,19(8):1090Œ1109,2012. [144]Y.-C.Tsai,S.Conlan,C.Deming,J.A.Segre,H.H.Kong,J.Korlach,J.Oh,N.C.S.Pro- gram,etal.Resolvingthecomplexityofhumanskinmetagenomesusingsingle-molecule sequencing. MBio ,7(1):e01948Œ15,2016. [145]R.VanBuren,D.Bryant,P.P.Edger,H.Tang,D.Burgess,D.Challabathula,K.Spittle, R.Hall,J.Gu,E.Lyons,etal.Single-moleculesequencingofthedesiccation-tolerantgrass oropetiumthomaeum. Nature ,527(7579):508,2015. [146]A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Kaiser,and I.Polosukhin.Attentionisallyouneed.In Advancesinneuralinformationprocessing systems ,pages5998Œ6008,2017. [147]S.VingaandJ.Almeida.Alignment-freesequencecomparisonŠareview. Bioinformatics , 19(4):513Œ523,2003. [148]A.Viterbi.Errorboundsforconvolutionalcodesandanasymptoticallyoptimumdecoding algorithm. IEEEtransactionsonInformationTheory ,13(2):260Œ269,1967. [149]M.Wang,L.Tu,D.Yuan,D.Zhu,C.Shen,J.Li,F.Liu,L.Pei,P.Wang,G.Zhao,etal. Referencegenomesequencesoftwocultivatedallotetraploidcottons,gossypiumhirsutum andgossypiumbarbadense. Naturegenetics ,51(2):224,2019. [150]Q.Wang,J.F.Quensen,J.A.Fish,T.K.Lee,Y.Sun,J.M.Tiedje,andJ.R.Cole.Ecological patternsofnifhgenesinfourterrestrialclimaticzonesexploredwithtargetedmetagenomics usingframebot,anewinformaticstool. MBio ,4(5):e00592Œ13,2013. [151]C.-L.Xiao,S.Zhu,M.He,D.Chen,Q.Zhang,Y.Chen,G.Yu,J.Liu,S.-Q.Xie,F.Luo, etal.N6-methyladeninednainthehumangenome. Molecularcell ,71(2):306Œ 318,2018. [152]E.M.ZdobnovandR.Apweiler.InterproscanŒanintegrationplatformforthesignature- recognitionmethodsininterpro. Bioinformatics ,17(9):847Œ848,2001. [153]H.Zeng,M.D.Edwards,G.Liu,andD.K.Gifford.Convolutionalneuralnetworkarchi- tecturesforpredictingdnaŒproteinbinding. Bioinformatics ,32(12):i121Œi127,2016. [154]Y.ZhangandY.Sun.Hmm-frame:accurateproteindomainformetagenomic sequencescontainingframeshifterrors. BMCbioinformatics ,12(1):198,2011. [155]Z.Zhang,W.R.Pearson,andW.Miller.Aligningadnasequencewithaproteinsequence. JournalofComputationalBiology ,4(3):339Œ349,1997. 130