HIDDENMARKOVMODEL-BASEDHOMOLOGYSEARCHANDGENEPREDICTIONINNGSERAByPrapapornTecha-angkoonADISSERTATIONSubmittedtoMichiganStateUniversityinpartialoftherequirementsforthedegreeofComputerScienceŒDoctorofPhilosophy2017ABSTRACTHIDDENMARKOVMODEL-BASEDHOMOLOGYSEARCHANDGENEPREDICTIONINNGSERAByPrapapornTecha-angkoonTheexponentialcostreductionofnext-generationsequencing(NGS)enabledresearcherstosequencealargenumberoforganismsinordertoanswervariousquestionsinbiology,ecology,health,etc.Fornewlysequencedgenomes,genepredictionandhomologysearchagainstcharac-terizedproteinsequencedatabasesaretwofundamentaltasksforannotatingfunctionalelementsinthegenomes.Themaingoalofgenepredictionistoidentifythegenelocusandtheirstructures.AsthereisaccumulatingevidenceshowingimportantfunctionsofRNAs(ncRNAs),comprehen-sivegenepredictionshouldincludebothprotein-codinggenesandncRNAs.Homologysearchagainstproteinsequencescanaidoffunctionalelementsingenomes.Althoughthereareintensiveresearchintheofgeneprediction,ncRNAsearch,andhomologysearch,therearestillunaddressedchallenges.Inthisdissertation,Imadecontributionsinthesethreeareas.Forgeneprediction,IdesignedanHMM-basedabinitiogenepredictiontoolthatconsidersG+Cgradientingrassgenomes.Forhomologysearch,IdesignedamethodthatcanalignshortreadsagainstproteinfamiliesusingHMMs.ForncRNAsearch,IdesignedanncRNAalignmenttoolthatcanalignhighlystructuredncRNAsusingonlysequencesimilarity.BelowIsummarizemycontributions.Despitedecadesofresearchaboutgeneprediction,existinggenepredictiontoolsarenotcare-fullydesignedtodealwithvariantG+Ccontentand50-30changingpatternsinsidecodingregions.Thus,thesetoolscanmissgeneswithpositiveornegativeG+Cgradientingrassgenomessuchasrice,maize,sorghum,etc.IimplementedatoolnamedAUGUSTUS-GCthataccountsfor50-30G+Cgradient.Ourtoolcanaccuratelypredictprotein-codinggenesinplantgenomesespeciallygrassgenomes.Alargenumberofsequencingprojectsproducedshortreadsfromthewholegenomesortran-scriptomicdata.Idesignedashortreadshomologysearchtoolthatemployspaired-endreadstoimprovehomologysearchsensitivity.Theexperimentalresultsshowthatourtoolcanachievesig-bettersensitivityandaccuracyinaligningshortreadsthatarepartofremotehomologs.DespitetheextensivestudiesofncRNAsearch,theexistingtoolsthatheavilydependonthesecondarystructureinhomologysearchcannotefhandleRNA-seqdatathatisaccumu-latingrapidly.ItwillbeidealifwecanhaveafasterncRNAhomologysearchtoolwithsimilaraccuracyasthoseadoptingsecondarystructure.IimplementedanaccuratencRNAalignmenttoolcalledglu-RNAthatcanachievesimilaraccuracytostructuralalignmenttoolswhilekeepingthesamerunningtimecomplexityassequencealignmenttools.Theexperimentalresultsdemonstratethatourtoolcanachievemoreaccuratealignmentsthanthepopularsequencealignmenttoolsandawell-knownstructuralalignmentprogram.Tomybelovedfamily,mytirelessteachers,myencouragingfriends,andtheinspiredscientistsivACKNOWLEDGMENTSThecompletionofthisdissertationwouldnothavebeenpossiblewithoutthesupport,encourage-ment,andguidanceofseveralpeople.Iwouldliketoexpressmysinceregratitudetoallofthem.Firstofall,Iwouldliketoexpressthedeepestappreciationandthankstomyadvisor,Dr.YanniSun,foracceptingmeintoherresearchgroupandallherexcellentsupport,valuableguidance,understanding,patience,andencouragementthatshegavemethroughoutmyresearchanddisser-tationworkintheofComputationalBiologyandBioinformatics.Itwasapleasureworkingwithherandlearningfromherresearchexpertise.IamalsogratefulmyPh.D.guidancecommitteemembers:Dr.Shin-HanShiu,Dr.EricTorng,andDr.JinChen,fortheirinterestinmyworkandtheirprecioustime.Theygavemealotofimportantsuggestions,helpfuldiscussions,andinformationthroughoutmyPh.D.program.IalsogratefullyacknowledgeDr.C.TitusBrown,Dr.KevinChilds,andDr.TiffanyLiufortheirvaluabletime,usefulsuggestions,andallresearchsupport.IwouldalsoliketothankallthefacultyandstaffmembersintheDepartmentofComputerScienceandEngineeringwhohavetaughtmeorassistedme.Withouttheirsupport,myworkwouldhaveundoubtedlybeenmuchmorechallenging.Ialsowanttothankmylabmates,Dr.YuanZhang,Dr.RujiraAchawanantakun,Dr.JikaiLei,Dr.ChengYuan,JiaoChen,andNanDuforprovidingafriendlyworkingenvironment,discussingandcooperatingonvariousresearchtopics.IwouldalsoliketothankYuanchunZhao,XiaoqingZhu,KeShen,Dr.AdityaK.Sood,Dr.SurangkanaRawangyot,WorawutSrisukkham,ChotiratRattanasinchai,SuttipunSungsuwan,othergraduatestudentsinthedepartment,andmyfriendsinThailandfortheirsupportandhelpinmydailylifeandusefuldiscussionsthroughouttheseyears.IwouldalsoliketoacknowledgePanjasawatwongfamilyandTecha-angkoonfamilyespeciallyvmybelovedparentsfortheirlove,care,andsupport.Ithankyoufortakingcareofmylittlesonanddaughter.Ithankyoufromthebottomofmyheart.Iwouldalsoliketogivemyspecialthankstomyhusband,KritPanjasawatwong,forpatientlove,hardwork,endlesshelp,supportanden-couragementthroughoutmylifeandmyPh.D.program.Ithankyouforworkinghardandraisingourlittlesonanddaughter.Ialsothankmylittlesonanddaughter,DanuphatPanjasawatwongandChanyaphachPanjasawatwong,formakingmylifemorecompleteandmeaningfulandbeingmygreatmotivation.Iloveyouwithallmyheart.Lastbutnotleast.IgratefullyacknowledgetheRoyalThaiGovernment,FacultyofScienceatChiangMaiUniversity,NSFCAREERGrantDBI-0953738,NSFIOS-1126998,Dr.YanniSun,andmyhusbandforsupportthroughouttheseyears.IwouldliketothankthestaffattheOfOfEducationAffairs(OEADC),allfacultyandstaffmembersatFacultyofScience,ChiangMaiUniversityforallhelpandsupportduringmygraduatestudy.viTABLEOFCONTENTSLISTOFTABLES.......................................xLISTOFFIGURES......................................xiiChapter1Introduction..................................11.1Geneprediction....................................11.1.1Prokaryoticgeneprediction..........................31.1.2Eukaryoticgeneprediction..........................31.2Homologysearch...................................41.2.1Thechallengesofhomologysearchinlarge-scalegenomicdata......51.2.1.1Next-generationsequencingtechnology..............51.2.1.2Thechallengesofhomologysearchfornext-generationsequenc-ingdata...............................71.2.1.3Non-codingRNAhomologysearchinlarge-scalegenomicdata.81.3Contributions.....................................10Chapter2aReviewofMarkovChainandHiddenMarkovModels.........122.1HiddenMarkovModels(HMMs)...........................122.1.1MarkovChains................................122.1.2HiddenMarkovmodels............................152.1.2.1Themostprobablestatepath:ViterbiAlgorithm.........152.1.2.2TheForwardAlgorithm......................172.1.2.3TheBackwardAlgorithm.....................18Chapter3AUGUSTUS-GC:aGenePredictionModelaccountingfor50-30G+CGradient....................................193.1Introduction......................................193.2Method........................................213.2.1TheHiddenMarkovModelofAUGUSTUS-GC...............213.3ExperimentalResults.................................273.3.1GenePredictioninArabidopsisThaliana..................293.3.1.1TrainingtheHMMmodel.....................293.3.1.2Performancecomparisonbetweendifferentgenepredictiontools303.3.2GenePredictioninOryzaSativa.......................333.3.2.1GeneinOryzaSativaobtainedfromChildsLab,DepartmentofPlantBiology,MichiganStateUniversity.....343.3.2.2FindinggenesinOryzaSativaretrievedfromStankeetal....38Chapter4SensitiveShortReadHomologySearchforPaired-EndReadSequenc-ingData....................................444.1Introduction......................................44vii4.1.0.3IsHMMERgoodenoughforshort-readhomologysearch?....464.2Methods........................................484.2.1Constructingfragmentlengthdistribution..................504.2.2Probabilisticmodel..............................514.3Experimentalresults..................................534.3.1shortreadhomologysearchinArabidopsisThalianaRNA-Seqdataset..................................544.3.1.1Determinationfortruemembershipofpaired-endreads.....544.3.1.2Performanceoffragmentlengthdistribution...........554.3.1.3Ourmethodcanalignmorereads..........554.3.1.4Sensitivityandaccuracyofshortreadhomologysearch.....564.3.1.4.1Performanceofcase1:..................574.3.1.4.2Performanceofcase2:..................574.3.1.5Performanceevaluationondomain-level.............594.3.1.6Runningtimeanalysis.......................614.3.2homologysearchforshortreadsinametagenomicdatasetfromsyntheticcommunities............................624.3.2.1Dataset...............................624.3.2.2Determinationoftruemembershipofpaired-endreads......624.3.2.3Performanceoffragmentlengthdistribution...........634.3.2.4Ourmethodcanalignmorereads.................634.3.2.5Sensitivityandaccuracyofshortreadhomologysearch.....634.3.2.5.1Case1:oneendisalignedbyHMMER.........634.3.2.5.2Case2:bothendsarealignedbyHMMER.......654.3.2.6Domain-levelperformanceevaluation...............654.3.2.7Runningtimeonametagenomicdatasetofsyntheticcommunities674.4Discussionandconclusion..............................68Chapter5glu-RNA:aliGnhighLystrUcturedncRNAsusingonlysequencesimi-larity......................................705.1Introduction......................................705.2RelatedWork.....................................725.3Method........................................745.3.1Alignmentpathandsimilaritybetweentwoalignments...........785.3.2Choosingthebasealignment.........................805.3.3Improvingalignmentaccuracyusingmachinelearningmethods......825.3.3.1Constructingalegalalignment...................845.3.4Runningtimeanalysis............................855.4ExperimentalResults.................................855.4.1Data......................................855.4.1.1Trainingdata............................895.4.1.2Testingdata.............................905.4.2Performancecomparisonbetweendifferentandregressionmodels.....................................905.4.3Performancecomparisonofdifferentalignmentprograms..........92viii5.4.4Discussion...................................965.5Conclusionandfuturework..............................97Chapter6ConclusionandFutureWork........................99BIBLIOGRAPHY.......................................101ixLISTOFTABLESTable3.1:PerformancecomparisonofgenepredictiontoolsonArabidopsisThaliana.Thetransitionprobabilitiesweredividedintothreeequalportions.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS...................................31Table3.2:PerformancecomparisonofgenepredictiontoolsonArabidopsisThaliana.Thetransitionprobabilitieswastrainedbycomputingmaximumlikelihoodestima-tion.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS........................31Table3.3:PerformancecomparisonofgenepredictionapproachesonOryzaSativaob-tainedfromChildsLab.Thetransitionprobabilitiesweredividedintothreeequalparts.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.....................36Table3.4:PerformancecomparisonofgenepredictionapproachesonOryzaSativaob-tainedfromChildsLab.Thetransitionprobabilitieswascalculatedbyusingmaximumlikelihoodestimation.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.......36Table3.5:PerformancecomparisonofgenepredictiontoolsonOryzaSativaprovidedbyStankeetal.Thetransitionprobabilitiesweredividedintothreeequalparts.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS..........................39Table3.6:PerformancecomparisonofgenepredictiontoolsonOryzaSativaprovidedbyStankeetal.Thetransitionprobabilitieswastrainedusingmaximumlike-lihoodestimation.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS..............39Table4.1:TheperformanceofHMMERunderdifferentcutoffsonclassifyingreadsintheArabidopsisThalianaRNA-Seqdataset.Thereare2,972,809paired-endreadsbeinguniquelymappedto3,577annotatedproteindomainfamiliesintherefer-encegenome.However,HMMERunderE-valuecutoff10missedatleasthalfofthereadswhenaligningthesereadstoinputdomains.ThealignmentsbyHMMERcanbedividedintothreecases.Case1:onlyoneendcanbealignedtooneormultipledomains.Case2:bothendscanbealignedtooneormultipledomains.Case3:noendisalignedtoanydomain.Inthetable,thepercent-ageofacasecanberepresentedbyeachnumber.fiHMMERw/orepresentsshuttingoffallionstepsandrunningfullForward/Backward.fiHMMERGAcutoffflrepresentsapplyingHMMERwithgatheringthresholds..46xTable4.2:Thepercentagesofallthreecasesofpaired-endreadalignmentsbyHMMERandourtoolfortheRNA-Seqdata.Case1representsoneend.Case2representsbothends.Case3representsnoend........................56Table4.3:TherunningtimeofHMMERunderdifferentcutoffsandourtoolontheAra-bidopsisThalianaRNA-Seqdataset.Note:Therunningtimeisthetotalrun-ningtimeofhomologysearchtooltoalign9,559,784paired-endreadswith3962domains....................................61Table4.4:Thepercentagesofallthreecasesofpaired-endreadalignmentsbyHMMERandourtoolforthesyntheticmetagenomicdata.Case1representsoneend.Case2representsbothends.Case3representsnoend...............64Table4.5:TherunningtimeofHMMERunderdifferentcutoffsandourtoolonametage-nomicdatasetofsyntheticcommunities.Note:Therunningtimeistheoverallrunningtimeofhomologysearchtooltoalign52,486,341readswith111do-mains........................................68Table5.1:Numberofbestalignmentsgeneratedbyfoursequencealignmenttoolsfor662pairsoftRNAsandRiboswitches.NotethatN-WisNeedleman-Wunsch.ClustalW,amultiplesequencealignmenttool,canalsobeappliedtoapairofsequences......................................81Table5.2:InformationoftheTrainingDataSet.......................87Table5.3:InformationoftheTestingDataSet.Notethatthereisnooverlapbetweenthetrainingandtestingdata...............................91Table5.4:Theaccuracyofdifferentmodels..........................92Table5.5:Averagedistancesofdifferentalignmentprogramson360pairsofncRNAsinthetestingdataset.BoldnumberrepresentsthelowestaveragedistanceforanncRNAfamily.UnderlinednumbershowsthesecondlowestaveragedistanceforanncRNAfamily................................95Table5.6:Averagedistancesofdifferentalignmentprogramson58pairsofsequencesinvefamilieswithsequencesimilaritybelow40%.Boldnumberrepresentsthelowestaveragedistance.Underlinednumbershowsthesecondlowestaveragedistance.......................................95xiLISTOFFIGURESFigure1.1:Prokaryoticgenestructure.............................2Figure1.2:Eukaryoticgenestructure.............................2Figure1.3:representationofnextgenerationsequencingtechnologythatpro-ducesingle-endreadsorpaired-endreads.First,thereferencegenomese-quenceisfragmented.Forsingle-endsequencing,oneendofDNAfragmentissequenced.Forpaired-endsequencing,bothendsoffragmentedDNAaresequenced.InsertsizeisthelengthofsequenceRead1andRead2aswellasunknownsequencebetweenreadpair.......................7Figure1.4:AtRNAsequenceanditssecondarystructure..................9Figure2.1:AnexampleofMarkovChain..........................13Figure2.2:AnexampleofMarkovChain.BeginandendstatesareaddedtoaMarkovchainmodelformodellingthebeginningandtheendingofasequence.....14Figure3.1:ThestatediagramofabinitioAUGUSTUS[101].Thestatesbeginningwithrrepresentsthereversestrand.Esingle:asingleexon.Einit:theinitialcodingexonofamulti-exongene.DSS:adonorsplicesite.Ishort:anintronemittingatmostdnucleotides.Ied:alongerintronwiththednucleotides.Igeo:alongerintronemittingonenucleotideatatimeafterthednucleotides.ASS:anacceptorsplicesitewithbranchpoint.E:aninternalcodingexonofamulti-exongene.Eterm:thelastcodingexonofamulti-exongene.IR:intergenicregion.Diamondsrepresentthestatesthatemitedlengthstrings.Ovalsrepresentthestatesincludingexplicitlengthdistribution.Thenumbersatthearrowsshowthetransitionprobabilities.Theexponents0,1,and2representthereadingframephase.Foranexonstate,thisisthepositionofthelastbaseoftheexoninitscodon.Fortheotherstates,theexponentarethepreceding-exonphase.Thesmallcirclesrepresentsilentstates..........22xiiFigure3.2:ThestatediagramofAUGUSTUS-GC.Thestatesbeginningwithrrepresentsthereversestrand.EHsingle:asingleexonofhighG+Ccontent.EMsingle:asingleexonofmediumG+Ccontent.ELsingle:asingleexonoflowG+Ccon-tent.EinitH:theinitialcodingexonofamulti-exongenewithhighG+Ccon-tent.EinitM:theinitialexonofamulti-exongenewithmediumG+Ccontent.EinitL:theinitialexonofamulti-exongenewithlowG+Ccontent.DSS:adonorsplicesite.Ishort:anintronemittingatmostdnucleotides.Ied:alongerintronwiththednucleotides.Igeo:alongerintronemittingonenu-cleotideatatimeafterthednucleotides.ASS:anacceptorsplicesitewithbranchpoint.EH:aninternalcodingexonofamulti-exongenewithhighG+Ccontent.EM:theinternalexonofamulti-exongenewithmediumG+Ccon-tent.EL:theinternalexonofamulti-exongenewithlowG+Ccontent.EHterm:thelastcodingexonofamulti-exongenewithhighG+Ccontent.EMterm:theterminalexonofamulti-exongenewithmediumG+Ccontent.ELterm:theter-minalexonofamulti-exongenewithlowG+Ccontent.IR:intergenicregion.Diamondsrepresentthestatesthatemitedlengthstrings.Ovalsrepresentthestatesincludingexplicitlengthdistribution.Thenumbersatthearrowsshowthetransitionprobabilities.Theexponents0,1,and2representtheread-ingframephase.Foranexonstate,thisisthepositionofthelastbaseoftheexoninitscodon.Fortheotherstates,theexponentarethepreceding-exonphase.Thesmallcirclesrepresentsilentstates..................23Figure3.3:TheexampleofagenewithG+CcontentgradientinOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent..........27Figure3.4:G+CContentofArabidopsisThalianadataset.................30Figure3.5:AnexampleofapredictedgenewithG+CcontentgradientofArabidopsisThalianadataset.ThisgenewaspredictedbyAUGUSTUS-GC.X-axisrep-resentsexonindex.Y-axisrepresentsG+Ccontent................32Figure3.6:G+CContentofthedatasetofOryzaSativa................34Figure3.7:G+CContentofthesecondOryzaSativadataset................35Figure3.8:ApredictedgeneonforwardstrandwithG+CcontentgradientoftheOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.......................................37Figure3.9:ThreepredictedgenesonreversestrandwithG+CcontentgradientoftheOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.......................................37xiiiFigure3.10:AnexampleofauniquelypredictedgeneonreversestrandwithG+Ccon-tentgradientofsecondOryzaSativadataset.However,regularAUGUSTUScannotdetectit.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.41Figure3.11:FirstexampleofapredictedgeneonforwardstrandwithG+Ccontentgra-dientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent..............................41Figure3.12:SecondexampleofapredictedgeneonforwardstrandwithG+Ccontentgra-dientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent..............................42Figure3.13:FirstexampleofapredictedgeneonreversestrandwithG+Ccontentgradi-entofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent..............................42Figure3.14:SecondexampleofapredictedgeneonreversestrandwithG+Ccontentgra-dientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent..............................43Figure3.15:ThirdexampleofapredictedgeneonreversestrandwithG+Ccontentgra-dientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent..............................43Figure4.1:Anexampleofaproteinfamily,itsalignmentwithagene,andreadmappingpositionsofareadpairagainstthegene.ThePkinasemodelhadannotationlineofconsensusstructure.ThelinebeginningwithPkinaseistheconsensusofthequerymodel.Capitallettersshowpositionsofthemostconservation.Dots(.)inthislinerepresentinsertionsinthetargetgenesequencewithrespecttothemodel.ThemidlinerepresentsmatchesbetweenthePkinasemodelandtheAT2G28930.1genesequence.A+representspositivescore.ThelinebeginningwithAT2G28930.1isthetargetgenesequence.Dashes(-)inthislinerepresentsdeletionsinthegenesequencewithrespecttothemodel.Thebottomlineindicatestheposteriorprobabilityofeachalignedresidue.A0represents0-5%,1represents5-15%,...,9represents85-95%,and*represents95-100%posteriorprobability.Thelinestartingwithr1andendingwithr2isreadmappingregionsonthegenesequence.A-indicateswherethepositionofthereadcanbemappedtothegenesequence.................48xivFigure4.2:HMMalignmentsofareadpair.Paired-endreadsr1andr2representedbytwogreyscalelinesarealignedagainstmodelsM1,M2,andM3withdifferentscoresofalignments.Thedarkerlinesrepresentbiggerscores.Thefragmentsizedistributionisprovidedaboveeachmodel.Thedistancebetweenthetwoalignmentsiscomputedandisusedtocomputethelikelihoodofthecorre-spondingfragmentsize.Inthisexample,M1ismostlikelytobethenativefamily.......................................50Figure4.3:Anexampleoffragmentlengthcalculation.ThealignmentpositionsalongtheHMMcanbeconvertedintopositionsineachseedsequences.Thefragmentsizeiscomputedastheaveragesizeofthosemappedregions.....52Figure4.4:Comparingfragmentlengthdistributionofourmethod(blue)tofragmentlengthdistributionconstructedfromreadmappingresults(red).X-axisrep-resentsthelengthoffragmentinaminoacids.Y-axisrepresentsprobabilityofthecorrespondingfragmentsize........................56Figure4.5:ROCcurvesofshortreadhomologysearchforArabidopsisRNA-Seqdata.WecomparedHMMERandourtooloncase1,whereoneendcanbealignedbyHMMERwithdefaultE-value.NotethatHMMERwithGAcutoffhasonedatapoint..............................58Figure4.6:ROCcurvesofshortreadhomologysearchforArabidopsisRNA-Seqdata.WecomparedHMMERandourtooloncase2,wherebothendsarealignedbyHMMERwithdefaultE-value.NotethatHMMERwithGAcut-offhasonedatapoint.UsingposteriorprobabilityhelpsremovefalsealigneddomainsandthusleadstobettertradeoffbetweensensitivityandFPrate....58Figure4.7:ThedistancecomparisonbetweenourmethodandHMMERoncase1oftheRNA-SeqdatasetofArabidopsis.377domainswiththelargestdistanceval-uesarelistedinthefoursubplots.X-axisshowstheindicesofthedomains.Smallervalueindicatescloserdomainabundancetothegroundtruth.Theav-eragedistancesofHMMER,HMMERw/oHMMERGAcutoff,andourtoolare704.92,781.80,1,054.77,and522.12,respectively.........60Figure4.8:ThedistancecomparisonbetweenourmethodandHMMERoncase2oftheRNA-SeqdatasetofArabidopsis.358domainswiththelargestdistancesarelistedinthefoursubplots.X-axisshowstheindicesofthedomains.Smallervalueindicatescloserdomainabundancetothegroundtruth.TheaveragedistancesofHMMER,HMMERw/oHMMERGAcutoff,andourmethodare818.09,704.65,1084.50,and558.60,respectively..........61xvFigure4.9:Comparingfragmentlengthdistributionofourtool(blue)tofragmentlengthdistributionconstructedfromreadmappingresults(red).X-axisrepresentsfragmentlengthinaminoacids.Y-axisrepresentstheprobabilityofthecor-respondingfragmentsize.............................64Figure4.10:ROCcurvesofshortreadhomologysearchforthesyntheticmetagenomicdataset.WecomparedHMMERandourtooloncase1,whereoneendcanbealignedbyHMMERwithdefaultE-value.NotethatHMMERunderGAcutoffhasonedatapoint........................65Figure4.11:ROCcurvesofshortreadhomologysearchforthesyntheticmetagenomicdata.WecomparedHMMERandourtooloncase2,wherebothendsarealignedbyHMMERunderdefaultE-value.NotethatHMMERunderGAcutoffhasonedatapoint.UsingposteriorprobabilityhelpsremovefalsealigneddomainsandthusleadstobettertradeoffbetweensensitivityandFPrate.......................................66Figure4.12:ThedistancecomparisonbetweenourmethodandHMMERoncase1ofthemetagenomicdataset.X-axisshowstheindicesofthedomains.Smallervalueindicatescloserdomainabundancetothegroundtruth.Domainsaresortedbasedonthedistanceofourtool.Duetoscalingissues,domainswiththelargestdistancesareplottedintheembeddedwindow............67Figure4.13:ThedistancecomparisonbetweenourmethodandHMMERoncase2ofthemetagenomicdataset.X-axisshowstheindicesofdomains.Smallervalueindicatescloserdomainabundancetothegroundtruth..............68Figure5.1:AlignmentsgeneratedbyfouralignmentprogramsforapairoftRNAswithsequencesimilarity80%.X-axisrepresentssequence2andY-axisrepresentssequence1.Thealignmentstartswith(0,0)....................74Figure5.2:AlignmentsgeneratedbyfouralignmentprogramsforapairoftRNAswithsequencesimilarity40%.X-axisrepresentssequence2andY-axisrepresentssequence1.Thealignmentstartswith(0,0)...................75Figure5.3:Thereferencealignmentandalignmentsgeneratedbyfouralignmentpro-gramsforapairoftRNAswithsequencesimilarity40%.Thepredictedsec-ondarystructuresareshownwheneachalignmentisusedasinputtoPfold,astructurepredictionprogram...........................77Figure5.4:Constructalegalalignmentusinglocalsearch.representnodesinthebasealignment.representnewpositionsofnodesinbasealignmentafterexten-sion.(A).node(3,3)ismovedto(3,4),whichisrepresentedby.(B).representinsertednodes..............................84xviFigure5.5:TypesofncRNAsandtheirconsensussecondarystructures(tRNAandri-boswitches)....................................86Figure5.6:Sequencesimilarityhistogramoftrainingdataset.X-axisrepresentsthese-quencesimilarity.Y-axisrepresentsthenumberofncRNApairs.........88Figure5.7:ConsensussecondarystructuresofU2andgcvT.................90Figure5.8:Comparingthealignmentsgeneratedbyglu-RNA,dpswalign,andDynaligntothereferencealignmentofaTHIpairwithsequencesimilarity45%.glu-RNAcanproducemoreaccuratealignmentthandpswalignandDynalign.......97xviiChapter1IntroductionTheexponentialcostreductionofNGSenablesustosequencemanymoreorganismsincludingnon-modelspeciesoranyspeciesthatareofinteresttobiologists.Genomeannotation,whichlabelsfunctionalelementsinsidegenomes,isafundamentalstepforunderstandingthefunctionsofnewlysequencedgenomes.Inparticular,geneannotationisakeystepinwholegenomean-notation.WithincreasedknowledgeabouttheimportanceofnoncodingRNAs,geneannotationshouldincludebothprotein-codinggenesandnon-codingRNAgenes.Homologysearchisalsoanessentialtaskforannotatingfunctionalelementsinnewlysequencedgenomes.Therehasbeenex-tensiveresearchinareasofgeneprediction,homologysearch,andncRNAsearch.However,therearestillunaddressedchallenges.DuringmyPh.D.study,Iconductedresearchinthesethreeareas.Iwilldescribethebackgroundofthethreeproblems.ThenIwillhighlighttheunaddressedchallenges.Finally,Isummarizemywork.1.1GenepredictionWithadvancedsequencingtechnology,thenumberofnewgenomeshasrapidlyincreased.Theabundanceofnewlysequencedgenomesneedsbetterandefcomputationalsequenceanal-ysismethodsandtoolstounderstandmoreaboutawidevarietyoforganisms.Oneofthemostimportantstepsingenomeanalysisistheofallgenesinagenome.Theobjectiveofgenepredictionistoidentifythegenelocusandtheirstructure.1Figure1.1:Prokaryoticgenestructure.Figure1.2:Eukaryoticgenestructure.Therearetwotypesofgenepredictiondistinguishedbytheorganisms:prokaryotesandeu-karyotes.Prokaryoteslikebacteriaandarchaeaareorganismswhosecellshavenonucleusoranyothermembrane-boundcompartments.Eukaryotesareorganismswhosecellscontainnucleusandmembrane-boundorganelles[56].Eukaryotescanbesingle-celledormulti-celled,andincludeanimals,plants,fungi,microsporidia,etc.Identifyinggenesinprokaryoticgenomesislesscom-plicatedowingtothehighgenedensityandnointronsintheproteincodingregions.Figure1.1showsthestructureoftheprokaryoticgene.Completeprokaryoticgenestructurebeginsatpro-moterregion.Transcriptionstartsite(TSS)indicatesthestartlocationoftranscriptionofthegene.Transcriptionstopsiteindicatesthestopoftranscription.Transcriptioncomprisesof50Untrans-latedregion(50UTR),Openreadingframes(ORFs),and30UTR.Fortranslationinprokaryotes,acontinuouscodingsegmentfromstartcodontostopcodonwillbetranslated.Genepredictionineukaryotesismorechallengingduetotheexon-intronstructure.Figure1.2illustrateseukaryoticgenestructure.Ineukaryotes,thecodinggenesarenotcontinuous.Mostofthegenesaredividedintoseveralpiecescalledexons(initialexons,internalexons,terminal2exons)thatareinterspersedwithintrons.Somegeneshaveonlysingleexonwithoutintrons.Unlikeprokaryoticgenestructure,eukaryoticgenestructuredoesnothaveORFwithcontinuoussegment.Because,amountsofnon-codingsequencesthatdonottranslateintoproteinsbuttheproportionofproteincodinggenesintheentiregenomeisrelativesmall[44].Thegenomeineukaryotesisenormousandmorecomplicatedthanthegenomeinprokaryotes.1.1.1ProkaryoticgenepredictionIdentifyinggenesinasequenceofprokaryotesislessdifthanpredictinggenesinagenomeofeukaryotes.Thegenestructureofprokaryotesislesscomplexthanthatofeukaryotesbecauseitdoesnotcontainintrons.Oneprotein-codinggeneinagenomecorrespondstooneORFstartingwithastartcodonandendingwithastopcodon.Becausethestartcodoncanoccurinsidease-quenceofageneorappearinupstreamofthetranslationstartsite.Thegoalofprokaryoticgenepredictiontoolsistodistinguishbetweenthecodingandnon-codingregionsinthegenomeanddeterminethecorrectstartcodonofeachgene.Inordertoidentifygenes,therearemanycompu-tationalmethods.ThesemethodsconsistofprobabilisticmodelssuchasHiddenMarkovModels(HMMs)[61,92,23,11,55],andmachinelearningmethodssuchasSupportVectorMachines(SVMs)[49]andSelf-OrganizingMaps(SOMs)[64].1.1.2EukaryoticgenepredictionComparingwithprokaryoticgeneprediction,thegeneineukaryoticorganismsisamorechallengingproblem.Becausethegenomesareverylarge,buttheycontainonlyasmallpro-portionofprotein-codinggenessurroundedbythelargeamountsofnon-codingsequence.More-over,thecomplicatedofgenestructureineukaryoticmakesofthecorrectposition3ofgenestructureinagenomemorechallenging.Severalmethodshavebeenimplemented.OneofthemostwidelyusedapproachesforgeneisbasedonHiddenMarkovModels(HMMs).Existinggenepredictiontoolsroughlyfallintotwogroups.Thecategorycomprisesofabini-tiotoolswhichutilizestatisticalpatternsandanintrinsicinformationextractingfromtrainingdatasettoaccuratelylocatetheboundariesofcodingregionsinaqueryDNAsequence.Manyabinitiotoolsbasedonprobabilisticmodels.ThesuccessofHMM-basedgenepredictionhasbeenappliedinHMMgene[50,51]andVEIL[39].AverysuccessfulextensionofHMMsforgenetionisGeneralizedHMM(GHMM).ThegeneralizationofstandardHMMistoallowvariablestatelength.GHMMwasimplementedforgenepredictiontoolinGENIE[53]followedbyoneofthemostfamousgeneGENSCAN[14,15].Then,therearemanyGHMM-basedgeneincludingPhat[18],AUGUSTUS[98],Exonomy[65],SNAP[47],TIGRscan[66],GlimmerHMM[66],etc.Thesecondcategoryconsistsofcomparativetoolswhichemployinfor-mationotherthanaquerysequence.Someextrinsicmethodsincorporateoneormoresequencesfromotherspeciesandcompareaquerysequencewiththegenomicsequencesofcloselyrelatedspeciestoidentifygenes.ThetoolsinthisgroupincludeSLAM[1],SGP2[81],TWINSCAN[48],N-SCAN[34],DOUBLESCAN[73],AGenDA[105,75],etc.1.2HomologysearchIncomputationalbiology,homologysearchisoneoftheimportantstepsinbiologicalsequenceanalysis.Thestepofhomologysearchcomparesaquerysequencetoanothersequenceorase-quencefamilydatabasetostatisticallysimilaritybetweensequences.Homologoussequencesshareacommonancestryandhavethesameorsimilarfunctions.Thestate-of-the-artmethodforhomologysearchisbasedoncomparativesequenceanalysis.Therearetwomain4comparativemethodsforhomologysearch.Oneapproachforhomologysearchisbasedonpair-wisesequencecomparison.Thisapproachcomparesaquerysequenceagainsteachsequenceinadatabaseandgeneratespairwisesequencealignmentwithsimilarity.OneofwidelyusedpairwisehomologysearchtoolsisBLAST[3].However,conventionalpairwisesequencecompar-isontoolssuchasBLASTyieldlowsensitivityforremotelyrelatedsequences[80].Thus,moresensitivityhomologysearchtoolsarerequired[63,13,102,20,2].Pairwisesequencecomparisonhasbeenextendedtomultiplesequencesandprobabilisticmodels.Thesecondapproachforho-mologysearchishomologysearch.homologysearchismoresensitivethanpairwisehomologysearch.homologysearchcomparesaquerysequencetopro-teindomaindatabasessuchasPfam[31],InterProScan[116],FIGfams[72],TIGRFAMs[35],etc.ThemajorityofhomologysearchtoolsmakeuseofHiddenMarkovModels(pHMMs).HiddenMarkovModelisaprobabilisticmodelofmultiplesequencealignment.pHMMrepresentsinformationembeddedinamultiplesequencealignment.IntheproteinfamilydatabasesuchasPfam,apHMM[52]isusedtorepresentsequenceconserva-tionofeachcolumnofamultiplesequencealignment.TheexampleofawidelyusedpackageforhomologysearchisHMMER[29].1.2.1Thechallengesofhomologysearchinlarge-scalegenomicdata1.2.1.1Next-generationsequencingtechnologySequencingisthemethodofdeterminingtheorderofbasepairswithinastrandofDNA/RNA.PriortoNext-GenerationSequencing(NGS)technologies,Sanger'smethoddominatedthese-quencingmarketforover30years[93,94].Sanger'smethodisDNAsequencingmethodbasedonchain-terminatinginhibitors.However,Sangersequencinghasseverallimitations.First,Sanger's5methodrequiresalargeamountoftimeandresourcestosequenceandanalyze.UsingSanger'sse-quencing,thehumangenomerequired10yearstosequenceandthreemoreyearstocompletetheanalysis[5,109].DuetothehighcostofSangersequencing,thisdrawbackmakesmanyscien-tistsinsmalllaboratoriesinaccessiblewiththebudgetconstraints[60].Thus,faster,cheaper,andhigherthroughputtechnologiesarerequired.ToovercometheselimitationsofSangersequencing,next-generationsequencingtechnologiesaredeveloped.NGStechnologiesaremassivelyparallel.Thus,NGStechnologiescanproducealargeamountofdatainashortperiodoftimewithlowcosts[121,67,71,17].TheNGStechnologiesaredominatedbythefollowingplatformscommer-cializedaround2005:Roche454GenomeSequencer,theIlluminaGenomeAnalyzer,SolexaandtheLifeTechnologiesSOLiDSystem[121].Currently,IlluminasequencingistheleaderofNGSplatformwhichoffersthehighestthroughputperrunandthelowestcostperbase[60,109].ManyNGStechnologiescanproducesingle-endandpaired-endreads.Single-endsequencingreadsaDNAfragmentonlyoneendgeneratingasequenceofbasepairs.Forpaired-endreads,thepaired-endsequencingenablesbothendsofaDNAfragmenttobesequencedtogenerateapairofreads.Themainadvantageofpaired-endsequencingisthatthepaired-endreadcontainspositionalinformationwhichiscalledinsertsize.InsertsizeisthelengthofsequenceRead1andRead2aswellastheunknowngapbetweenbothends.Manyalignmentalgorithmscanusethisinformationtomapthereadstoareferencegenomeleadingtomoreaccuratealignmentalgorithmsforshortreads[62,97].Arepresentationofsingle-endsequencingandpaired-endsequencingisillustratedinFigure1.3.Withadvancesinnextgenerationsequencing(NGS)technologies,usingreadssequencedbyNGStechnologiesenableustostudy,analyze,andunderstandinmanyaspectsofBiologicalresearch.6Figure1.3:representationofnextgenerationsequencingtechnologythatproducesingle-endreadsorpaired-endreads.First,thereferencegenomesequenceisfragmented.Forsingle-endsequencing,oneendofDNAfragmentissequenced.Forpaired-endsequencing,bothendsoffragmentedDNAaresequenced.InsertsizeisthelengthofsequenceRead1andRead2aswellasunknownsequencebetweenreadpair.1.2.1.2Thechallengesofhomologysearchfornext-generationsequencingdataWithanincreasingnumberofgenomicdata(DNA,RNA,andproteinsequences),homologysearchhasbecomeanessentialtaskforsequenceanalysis.Oneexampleoflarge-scalegenomicdataisnext-generationsequencing(NGS)data.ApplicationsofNGSinsequencingthetranscriptomesofvariousnon-modelspeciesenableustoidentifygenesandtheirisoformsthatarelowlytranscribedoronlytranscribedintissuesorconditions.NewhomologsofexistinggenefamiliesandevennewgenefamiliesareexpectedbyanalyzingthedataproducedbyNGS.However,high-throughputdatafromsequencingmachinespresentnewchallengesforhomologysearch.Homol-ogysearchisstillthekeysteptoannotatevariousNGSdata.Homologysearchcanbeconducted7intwoways.Thedefactoapproachistoconductdenovoassemblyandthenapplyexistingho-mologysearchtoolstotheassembledsequences.However,denovoassemblyfrombillionsofreads,whichareveryshortandhighaswellaslow-qualityreads,posesachallenge.Furthermore,datasetsofNGSareverylarge.Theselargedatasetsrequirelargememoryusageand/orlongcomputationaltime.Thesechallengesneedtobetackled[4,117,120].Thesecondmethodistoconducthomologysearchdirectlytoshortreads.Thismethodavoidsthecomplicatedanderror-proneassemblystep.However,shortreadsposegreatchallengesforhomologysearch.Withtheincreaseofreadlengthbythesecondandthird-generationsequencingtechnologies,newopportunitiesappear.However,theshortreadssequencedfrompoorlyconservedregionsposeagreatchallengeforexistinghomologysearchtools.Therefore,weneedtodevelopthele-basedhomologysearchmethodtoimprovesensitivityforshortreads.1.2.1.3Non-codingRNAhomologysearchinlarge-scalegenomicdataNon-codingRNAs(ncRNAs)haveplayedimportantrolesinmanybiologicalprocesses.ncRNAsarefunctionalmoleculesthatarenottranslatedintoproteinbutfunctiondirectlyasRNAs.TherearemanydifferenttypesofncRNAssuchastRNA,rRNA,miRNA,etc.AsncRNAsperformtheirfunctionsthroughboththeirsequencesandsecondarystructures,theseprovideusefulinformationforhomologysearchofncRNAs.AnexampleofatRNAsequenceanditssecondarystructureareshowninFigure1.4.ThedevelopmentofNGStechnologiesenablesresearcherstosequencegenomesandtranscriptomesofvariousmodelandnon-modelspecies,providinguniqueopportu-nitiesforncRNAmining.ConductinghomologysearchisusuallythestepforncRNAgeneTherearetwotypesofhomologysearchtoolsforncRNAgeneThetypeofmethodsfocusesonidentifyingflknownflncRNAsthatarehomologoustocharacterizedncRNA8Figure1.4:AtRNAsequenceanditssecondarystructuresequencesorfamilies.Therepresentativetoolthatcaneffectivelytakeadvantageofbothsequenceandstructuralsimilarityisbasedoncontext-freegrammar(CFG)[77].Thismethodusesaprob-abilisticmodeltorepresentancRNAfamilyandalignsaquerysequencewiththemodel.ThealignmentscoreindicateshowlikelythequeryisamemberofthencRNAfamily.ThesecondtypeofmethodisnotrestrictedtoknownncRNAs.Bycomparingtwosequences,suchasintergenicsequencesoftwogenomes,orunannotatedtranscriptsoftworelatedspecies,novelncRNAsmightbedetected.Usually,secondarystructure-awarepairwisecomparisontoolsareappliedtoncRNAsecondarystructurepredictionandncRNAgeneDespitetheextensivestudiesofncRNAgenethetoolsthatheavilydependonthesecondarystructureinhomologysearchcannotefhandleRNA-Seqdatathatisaccumulatingrapidly.ItwillbeidealifwecanhaveafasterncRNAhomologysearchtoolwithsimilaraccuracyasthoseadoptingsecondarystructures.91.3ContributionsInthiswork,wemainlyfocusongenepredictionandhomologysearch.Geneisoneoftheessentialstepsforgenomeannotation.Genepredictioncanbeappliedintheannotationofgenomicdataproducedbygenomesequencingmethods.Forgenepredictiontoolsingrassgenomes,existinggenetoolsarenotcarefullymodeledfor50-30G+Cgradient.Asaconsequence,theexistingtoolshavelimitedsensitivityandaccuracyforidentifyinggeneswithG+Cgradient.Homologysearchisalsoessentialforproteindomainanalysis.Oneoftheimportantstepsoffunctionalanalysisisconductinghomologysearchorclassifyingshortreadsintodifferenttypesofproteinfamiliesorproteindomainfamilies.Theexistingtoolsofhomologysearchhavebeensuccessfullyusedforgenome-widedomainanalysis.However,readssequencedfrompoorlyconservedregionsmaystillbemissedbyexistinghomologysearchtools.Thus,thereisaneedtoflrescueflmissingreadsandtoimprovethesensitivityofhomologysearchforshortreadslackingreferencegenomes.Inthefollowingsectionsofthedissertation,Wepresentcontributionsinthreeareas:genepre-diction,homologysearch,andncNRAsearch.First,thegenemethodthataccountsfor50-30G+CgradientispresentedinChapter3.OurmethodforgrassgenomeswhichhavebeeninvestigatedthatthereisG+Cgradientinsideexonsofagene.Ourtoolcanincreasetheaccuracyofpredictinggenesinplantgenomesespeciallygrassgenomessuchasrice,maize,etc.Second,shortreadhomologysearchusingpaired-endreadinformationisproposedinChapter4.WeuseanapproximateBayesianapproachapplyingalignmentscoresandthefragmentlengthdistributiontorescuemissingendandindicatethetruedomains.Wecanimprovetheaccuracyforshortreadsthatarepartofremotehomologs.Third,Chapter5introducesglu-RNA:aliGnhighLystrUcturedncRNAsusingonlysequencesimilaritywhichcanbeappliedtoncRNAgene10WedescribeanaccuratencRNAalignmentmethodtoalignhighlystructuredncRNAsusingonlysequencesimilarity.WeshowthatwecangenerateaccuratealignmentsofhighlystructuredncR-NAswithoutusingstructuralinformationbyincorporatingposteriorprobabilityandamachinelearningapproach.WealsotestourapproachonoverthreehundredsofpairsofhighlystructuredncRNAsfromBRAliBase2.1.11Chapter2aReviewofMarkovChainandHiddenMarkovModels2.1HiddenMarkovModels(HMMs)AHiddenMarkovModel(HMM)isastochasticmodelwhichiswidelyusedformanyareasincomputationalbiology.Inthelate1960s,HMMswerepresentedinstatisticalpapersbyBaumetal.[10,9,22].HMMshavebeenappliedinspeechrecognitionsincetheearly1970s[88,87].Sincethelate1980s,HMMshavebeendevelopedintheanalysisofbiologicalsequences.HMMshavebeenappliedtovariousareasofcomputationalbiology.Theycanbeusedtomodelasequenceorafamilyofsequences[27]foranalysis,predictionofproteincodinggenesingenomesequences,etc.ThissectionisadaptedfromDurbinetal.[27].MarkovchainsandHMMSaredescribed.2.1.1MarkovChainsAMarkovmodelisasetofstatesconnectedbytransitionswhichcantransitfromonestatetootherstates.Eachtransitionhasaprobability.AMarkovchainforgenomicsequenceX=(x1;x2;:::;xL)2SwhereS=fA;C;G;Tgcanberepresentedas1)aweighteddirectedgraph,2)anoderepresentsastate,3)anedgerepresentstransitionbetweenstatesassociatedwithtransition12probabilityast=Pr(xi=tjxi1=s)withåt=fA;C;G;Tgact=1.Figure2.1showsanexampleofaMarkovChainModel.NodesarestateswithnamesA,C,G,andT.Edgesarepossibletransitionsassociatedwithtransitionprobabilities.GivenasequencexoflengthL,wecanasktheprobabilityFigure2.1:AnexampleofMarkovChainofgeneratingasequencex,jxj=LgivenaMarkovChainModel.Pr(x)=Pr(xL;xL1;:::;x1)=Pr(xLjxL1;:::;x1)Pr(xL1jxL2;:::;x1):::Pr(x1)KeypropertyofaMarkovchainorder):theprobabilityofeachsymbolxidependsonly13onxi1.Pr(x)=Pr(xLjxL1)Pr(xL1jxL2):::Pr(x2jx1)Pr(x1)=Pr(x1)LÕi=2axi1xiFigure2.2:AnexampleofMarkovChain.BeginandendstatesareaddedtoaMarkovchainmodelformodellingthebeginningandtheendingofasequence.Figure2.2showsthefullMarkovchainmodelfromFigure2.1.aBeginstateandanEndstateareaddedtotheMarkovchainmodeltodescribeaprobabilitydistributionoverallpossiblesequencesofanysequencelength.TheobjectiveofaddingEnd(E)stateisthatwewouldliketomodelsequencelengthdistribution.LetBbetheBeginstate.Assumethatx0=B.Thus,theprobabilityofthecharacterinasequenceisPr(x1=s)=aBs=Pr(s)(2.1)wherePr(s)isthebackgroundprobabilityofasymbols.Wealsomodeltheendstate.LetEbetheEndstate.Thus,theprobabilityoftheendingwithstatetisPr(xL=t)=axLE(2.2)142.1.2HiddenMarkovmodelsHiddenMarkovmodelsaretheextensionmodelsoftheclassicalMarkovChainswhichoutputsymbolsarenotthesameasthestates.:AHiddenMarkovModel(HMM)isasatripletM=(S;Q;Q).SisanalphabetofsymbolssuchasS=fA;C;G;TgQisasetofstates.ThestatesarecapableofemittingsymbolsfromSQisthesetofprobabilities,includingStatetransitionprobabilities=aklforeachk;l2QEmissionprobabilities=ek(b)foreachk2Qandb2SLetapathP=(p1;p2;:::;pL)beasequenceofstatesinthemodelM.GivenasequenceX=(x1;x2;:::;xL)andapathP=(p1;p2;:::;pL).TheprobabilityofsequenceXandasequenceofstatesPisPr(X;P)=a0p1LÕi=1epi(xi)apipi+1(2.3)2.1.2.1Themostprobablestatepath:ViterbiAlgorithmAswetheprobabilityPr(X;P),wedonotknowwhatthestatesequencethatemittedthegivensequence.Thus,thegeneratingpathofagivensequenceXishidden.ThedecodingproblemcanbeasgivenaHMMM=(S;Q;Q)andasequenceX2S,wewanttothemostprobablepathPbetweenstatesofamodelforXthatmaximizesPr(X;P)P=argmaxpfPr(X;P)g(2.4)15Totheoptimalpath,wecancomputethemostprobablepathinHMMemployinganalgorithmofdynamicprogrammingcalledViterbiAlgorithm.Supposeuk(i)istheprobabilityofthemostprobablepathendinginstatek,forsequencex1x2:::xiandxiisgeneratedbystatek.Then,ul(i+1)=el(xi+1)maxk2Quk(i)akl(2.5)Algorithm:ViterbialgorithmInput:aHMMM=(S;Q;Q)andasequenceX2SOutput:themostprobablepathPInitialization:u0(0)=1(2.6)uk(0)=0;fork>0(2.7)Recursion:fori=1;:::;Landforl2Qul(i)=el(xi)maxk2Quk(i1)akl(2.8)ptri(l)=argmaxk2Quk(i1)akl(2.9)Termination:Pr(X;P)=maxk2Quk(L)ak0(2.10)pL=argmaxk2Quk(L)ak0(2.11)Traceback:fori=L;:::;1pi1=ptri(pi)(2.12)162.1.2.2TheForwardAlgorithmTheevaluationproblemcanbeasgivenaHMMM=(S;Q;Q)andasequenceX2S,wewanttocomputetheprobabilityPr(XjM).ThisprobabilitycanbeefcalculatedusingForwardAlgorithm.Thisalgorithmissim-ilartoViterbialgorithmbyreplacingmaximizationstepsbysums.Theforwardvariablefk(i)canbeasfk(i)=Pr(x1x2:::xi;pi=k)(2.13)Here,fk(i)istheprobabilityoftheobservedsequencex1x2:::xi,requiringthatpi=k(xiisgener-atedbypi).Therecursionisasfollow.fl(i+1)=el(xi+1)Sk2Qfk(i)akl(2.14)Algorithm:ForwardalgorithmInput:aHMMM=(S;Q;Q)andasequenceX2SOutput:theprobabilityPr(XjM)Initialization(i=0):f0(0)=1(2.15)fk(0)=0;fork>0(2.16)Recursion:fori=1;:::;Landforl2Qfl(i)=el(xi)Sk2Qfk(i1)akl(2.17)17Termination:Pr(X)=Sk2Qfk(L)ak0(2.18)2.1.2.3TheBackwardAlgorithmForthebackwardalgorithm,thevariableofbackwardissimilartotheforwardvariable.However,thebackwardcalculationisdifferentfromthecalculationofforward.Insteadofcalculatingfromthebeginningofasequence,thebackwardrecursionisstartedattheendofthesequence.Thebackwardvariablebk(i)canbeasbk(i)=Pr(xi+1xi+2:::xL;pi=k)(2.19)wherebk(i)istheprobabilityoftheobservedsequencexi+1xi+2:::xL,giventhatpi=kAlgorithm:BackwardalgorithmInput:aHMMM=(S;Q;Q)andasequenceX2SOutput:theprobabilityPr(XjM)Initialization(i=L):forallkbk(L)=ak0(2.20)Recursion:fori=L1;:::;1andfork2Qbk(i)=Sl2Qaklel(xi+1)bl(i+1)(2.21)Termination:Pr(X)=Sl2Qa0lel(x1)bl(1)(2.22)18Chapter3AUGUSTUS-GC:aGenePredictionModelaccountingfor50-30G+CGradient3.1IntroductionComputationalgenepredictionplaysanimportantroleintheannotationofnewlysequencedgenomes.Thegoalofgenepredictionistoidentifythelocationandstructureofprotein-codinggenesingenomicsequences.TheandannotationofgenesingenomicDNAse-quenceswillhelpusdetectandunderstandtheessentialfunctionalelementsinasequencedgenome.Computationalgenepredictionmethodscanbebroadlydividedintotwomaincategories:abinitiomethodsandhomology-basedmethods.Thecategorycontainsabinitiogenepredictiontoolsthatcanpredictgenesusingthequerysequenceastheonlyinputsequence.MajorabinitiogenepredictiontoolsuseHiddenMarkovModels(HMMs),whicharetheprobabilisticmodelsthatcanrepresentthegenestructure.HMMscanlabeldifferentsegmentsinanunknownsequenceasgenestructureelementssuchasintergenicregion,exons,andintrons.Givenasequence,wecanuseHMMstoinferthemostprobablepathcorrespondingtoanannotationofgenestructure.TheexamplesofabinitiogenepredictiontoolsincludeGENSCAN[14,15],GENEID[82],HMMGene[50],GeneMark.hmm[61],GlimmerHMM[66],FGENESH[91],SNAP[47],andAUGUSTUS[98]etal.Thesecondcategorycontainscomparativegenepredictiontools,which19compareaquerysequencewithhomologoussequencesofrelatedspeciesandemploytheirse-quencesimilarityforgeneannotation.TheexamplesofthesecondgroupincludeGENEWISE[12],GENOMESCAN[113],AGenDA[105,75],TWINSCAN[48],SGP2[81],DOUBLES-CAN[73],CEM[8],SLAM[1],etc.Usingmoreinformationsuchashomologoussequenceshaspotentialtoproducebetterresultsbutrequiresaclosely-relatedspecies.Similarity-basedmethodsperformnotverywellwhenthetargetsequencehasnohomology.Thus,abinitiogenepredictiontoolsplayaroletonovelgenesinthesequenceswithoutaprioriknownhomologies.AsthebasecompositionandtheexonlengthdistributionscandifferforgeneswithdifferentG+Ccontents,mostgenepredictiontoolsemployG+Ccontent-dependenttraining[98,101,100,14,15].,traininggenesaredividedintoseveralsubsetsaccordingtotheirG+Ccontent(e.g.low,medium,andhigh).Thendifferentsetsofmodelparametersarederivedfromdifferentsubsetsoftrainingdata.Foraninputsequence,thegenepredictiontoolchoosestheparametersetwithmeanG+Ccontentclosesttotheinputsequence[98].BoththeoreticalanalysisandempiricalresultshaveshownthatG+Ccontent-dependenttraininggreatlyimprovesthegenepredictionaccuracyandsensitivity.However,thesetoolshavetwolimitationsforgenepredictioningrassgenomessuchasrice,barley,maize,sorghum,andsoon.First,manygrassgenesexhibitasharp50-30decreasingG+Ccontentgradient[43,90],whichisnotcarefullymodeledbyexistinggenepredictiontools.Asaresult,thesetoolshavelimitedsensitivityandaccuracyforpredictinggeneswithG+Cgradients.Second,unlikemanyanimalgenomes,G+Ccontentofgenesinplantgenomesarenotwellcor-relatedwiththecontextG+Ccontent(i.e.lackingisochorestructures)[28];itisthusnottrivialtodeterminewhichparametersetisoptimalforpredictinggenesinaninputsequence.Toaddresstheselimitations,weproposeanewgenepredictionmodelwithtwoadvantages:1)ourmodel20canpredictgeneswithG+Cgradientwithhighersensitivityandaccuracy;2)ourmodelisoptimizedforgenesofvariantG+Ccontentand50-30changingpatterns.3.2MethodInthissection,wedescribeAUGUSTUS-GC,atoolthatpredictsgeneswith50-30G+Cgradient.ThemainnoveltyofourmethodisahiddenMarkovmodel(HMM)thatdistinguishesexonsofdifferentG+Ccontent.TheHMMhasasimilartopologytotheoneusedinAUGUS-TUS[101,100]andmanyotherdenovogenepredictiontools[14,50].ButthemajordifferenceisthatourmodelisdesignedtohandlevariousG+Ccontentand50-30changingpatternsinsidecodingregions.3.2.1TheHiddenMarkovModelofAUGUSTUS-GCWemodifythegeneralizedHiddenMarkovModel(GHMM)fromAUGUSTUS[101],anabinitiogenepredictiontool.TheAUGUSTUS-GCissimilartoAUGUSTUSabinitiogenepredictionmodel.However,insteadofusingonestatetorepresentanexon,wehavethreestatestomodelexonsofhigh,medium,andlowG+Ccontents.Figure3.1andFigure3.2illustratethedifferenceofstatesbetweentheexistingAUGUSTUSHMMandHMMofAUGUSTUS-GCforgeneprediction.Here,wemakeareasonableassumptionthattheG+Ccontentchangeinsideexonsofmulti-exongenesisrelativelysmall.Thisisthetradeoffbetweenmodelcomplexityandtheresolutionofthemodel.Forsingleexongenes,threestates(EHsingle,EMsingle,ELsingle)arecreated.Forinitialexon,threestates(E0initH,E0initM,E0initL)areusedtomodelexonsofhigh,medium,andlowG+Ccontent.Moreover,theinitialexonsofotherphases,theinternalexonsofallphases,andtheterminalexon21Figure3.1:ThestatediagramofabinitioAUGUSTUS[101].Thestatesbeginningwithrrepresentsthereversestrand.Esingle:asingleexon.Einit:theinitialcodingexonofamulti-exongene.DSS:adonorsplicesite.Ishort:anintronemittingatmostdnucleotides.Ied:alongerintronwiththednucleotides.Igeo:alongerintronemittingonenucleotideatatimeafterthednucleotides.ASS:anacceptorsplicesitewithbranchpoint.E:aninternalcodingexonofamulti-exongene.Eterm:thelastcodingexonofamulti-exongene.IR:intergenicregion.Diamondsrepresentthestatesthatemitedlengthstrings.Ovalsrepresentthestatesincludingexplicitlengthdistribution.Thenumbersatthearrowsshowthetransitionprobabilities.Theexponents0,1,and2representthereadingframephase.Foranexonstate,thisisthepositionofthelastbaseoftheexoninitscodon.Fortheotherstates,theexponentarethepreceding-exonphase.Thesmallcirclesrepresentsilentstates.22Figure3.2:ThestatediagramofAUGUSTUS-GC.Thestatesbeginningwithrrepresentsthereversestrand.EHsingle:asingleexonofhighG+Ccontent.EMsingle:asingleexonofmediumG+Ccontent.ELsingle:asingleexonoflowG+Ccontent.EinitH:theinitialcodingexonofamulti-exongenewithhighG+Ccontent.EinitM:theinitialexonofamulti-exongenewithmediumG+Ccontent.EinitL:theinitialexonofamulti-exongenewithlowG+Ccontent.DSS:adonorsplicesite.Ishort:anintronemittingatmostdnucleotides.Ied:alongerintronwiththednucleotides.Igeo:alongerintronemittingonenucleotideatatimeafterthednucleotides.ASS:anacceptorsplicesitewithbranchpoint.EH:aninternalcodingexonofamulti-exongenewithhighG+Ccontent.EM:theinternalexonofamulti-exongenewithmediumG+Ccontent.EL:theinternalexonofamulti-exongenewithlowG+Ccontent.EHterm:thelastcodingexonofamulti-exongenewithhighG+Ccontent.EMterm:theterminalexonofamulti-exongenewithmediumG+Ccontent.ELterm:theterminalexonofamulti-exongenewithlowG+Ccontent.IR:intergenicregion.Diamondsrepresentthestatesthatemitedlengthstrings.Ovalsrepresentthestatesincludingexplicitlengthdistribution.Thenumbersatthearrowsshowthetransitionprobabilities.Theexponents0,1,and2representthereadingframephase.Foranexonstate,thisisthepositionofthelastbaseoftheexoninitscodon.Fortheotherstates,theexponentarethepreceding-exonphase.Thesmallcirclesrepresentsilentstates.23allhavethreestatesforhigh,medium,andlowG+Ccontent.GenesofvariantG+Cchangingpatternscanberepresentedbythenewexonstates.TheaddedexonstatesallowtheHMMtopredictgenesofvariousG+Ccontentandchangingpatternswithhigheraccuracy.Forexample,genesofnegativeG+CgradienttendtoberepresentedbyapathstartingwithEHandendingwithEL.GeneswithhighG+CcontentandmoderategradienttendtobeproducedbyapathmainlyconsistingofEH.ForpurposeofgenethestatesofAUGUSTUS-GCinthestatesetQareconsis-tentwithmeaningofbiologysuchasintron,exon,andsplicesite,etc.Weallowastatetransitingtootherstatesinbiologicallymeaningfulways.Forexample,anintronmustfollowandonorsplicesite.themodelofAUGUSTUS-GChasthefollowing79states:Q=fIR;EHsingle;EMsingle;ELsingle;EHterm;EMterm;ELterm;rEHsingle;rEMsingle;rELsingle;rEHinit;rEMinit;rELinitg[2[i=0fEiinitH;EiinitM;EiinitL;DSSi;Iishort;Iied;Iigeo;ASSi;EiH;EiM;EiLg[2[i=0frEitermH;rEitermM;rEitermL;rDSSi;rIishort;rIied;rIigeo;rASSi;rEiH;rEiM;rEiLg(3.1)Figure3.2showsthestatesinthestatesetQ.IntheupperhalfofFigure3.2,thestatesrepresentprotein-codinggenesontheforwardstrand.ThestateIRstandsfortheintergenicregion.Thediamond-shapedstatesarethestatesemittinged-lengthstringsatatime.Theoval-shapedstatesarethestatesemittingvariable-lengthstrings.ThestateEHsingle,EMsingle,andELsinglerepresentagenecontainingjustonlyoneexonofhigh,medium,andlowG+Ccontents.Otherstatesintheupperhalfassociatewithgeneswithseveralexons.Regardingthereadingframeposition,threestatesforeachtypeofexonofhigh,medium,andlowG+CcontentsareshownexcepttheterminalexonEHterm,EMterm,andELterm.TheinitialexonstatesEiinitH,EiinitM,EiinitLfori=0;1;and2arerelatedto24thecodingexonofagene.TheinternalexonstatesEiH,EiM,EiLfori=0;1;and2arerelatedtotheinteriorcodingexonsofagene.Thesuperscriptifori=0;1;and2denotesthephaseofreadingframewheretheexonends.Thesuperscript0meansthattheendoftheexonisattheboundaryofacompletecodon.Thesuperscript1representsthattheboundaryofcodonis1positionpriortotheexonend.Thesuperscript2meanstheboundaryofcodonis2positionspriortotheexonend.DSSk(0k2)isthestateofdonorsplicesite.ASSk(0k2)representthestateofacceptorsplicesite.Ikshort(0k2)isthestateofintronwiththelengthatmostdbases.Ikedisthestateofalongerintronwiththedbases.Ikgeoisthestateofalongerintronwithemittingsinglenucleotidesatatimeafterthedbases.Thesuperscriptkdenotesthephaseofreadingframeofpreviousexon.InthelowerhalfofFigure3.2,thestatesrepresentprotein-codinggenesonthereversestrand.Foreachstateofreversestrand,thenamebeginswith'r'.Theyhavetheconsistentbiologicalmeaningwiththestatesontheforwardstrand.Thesuperscriptonthereversestrandrepresentsthephaseofthereadingframeattherightmostpositionofexon.Thus,theyhaveninestatesforaterminalexonandthreestatesforaninitialexonconsideringhigh,medium,andlowG+Ccontents.InFigure3.2,thearrowsrepresentthetransitionsbetweenstatesinthestatesetQhavingnon-zeroprobabilities.ThetransitionsfromandtotheintronstatesarethesameasthoseofstatesdescribedintheAUGUSTUSmodel[101].Figure3.1showsthestatemodeloforiginalAUGUSTUS[101].ThetransitionprobabilitiesareedvaluesgiveninFigure3.1roundedupto6decimalplaces.ForAUGUSTUS-GC,weconsidertwostrategiesforcomputingtransitionprobabilities.First,weuseaverysimplestrategybydividingtheoldprobabilitiesofAUGUSTUSequallyforthreestatesofhigh,medium,andlowG+Ccontents.Thisstrategycanbeusedwhenwehaveverylimitedtrainingdata.Ourhypothesisisthattheystartwithequalprobabilities.Figure3.2presentsthetransitionprobabilitiesofstrategy.Thesecondstrategyisamorestandardmethod.25Weconsidertrainingdataandcomputethetransitionprobabilitiesusingthemaximumlikelihoodestimation.Letaklbethetransitionprobabilityfork;l2Q,Aklbethenumberoftimesthattransitsfromthestatektostatelintrainingdata.Themaximumlikelihoodestimatorisasakl=Aklåq2QAkq(3.2)Forsomestates,therearemanytransitions.Whenwehavemanytransitions,manyparameters,weneedalargenumberoftrainingsamples.Otherwise,thetrainingisverybiased.Toavoidzeroprobabilitiesduetosparse/insuftrainingdata,weusepseudocountsbyaddingoneextracountstotheobservedfrequenciestopriorbiasesregardingtheprobabilityvalues.Givenpseudocountsrkl,weA0klasA0kl=Akl+rkl(3.3)Usually,inLaplacemethod,allrklequalto1.TheperformancecomparisonsoftwostrategiesforcomputingtransitionprobabilitiesareshowninExperimentResultssection.Foreachexonstateonbothforwardstrandandreversestrand,theexonlengthdistributionofEHsingle,EMsingle,ELsingle,EiinitH,EiinitM,EiinitL,EiH,EiM,EiL,EHterm,EMterm,ELterm,rEHsingle,rEMsingle,rELsingle,rEitermH,rEitermM,rEitermL,rEiH,rEiM,rEiL,rEHinit,rEMinit,andrELinitfori=0;1;and2arecomputedonthecorrespondingexons,respectively.Similarly,differentinhomogeneouskth-orderMarkovmodels(bydefaultk=4)foreachexonstatearederivedseparately.263.3ExperimentalResultsToevaluatetheperformanceofAUGUSTUS-GC,wetestedAUGUSTUS-GConthreedifferenttestsetsinplantgenomes:ArabidopsisThaliana[7],OryzaSativaobtainedfromChildsLab,DepartmentofPlantBiology,MichiganStateUniversityandOryzaSativaprovidedbyStankeetal.[99].WecomparedourresultstotheregularAUGUSTUSabinitiogeneprogram.OurmethodisdesignedforgeneswithG+Cgradient.Ithasbeeninvestigatedthatmanygrassgenespresentasharp50-30decreasinggradientofG+Ccontent.Figure3.3:TheexampleofagenewithG+CcontentgradientinOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.Figure3.3showsanexampleofdescendingslopeofG+CcontentinOryzaSativadatasetbystartingwithhighG+CcontentandendingwithlowG+Ccontent.WealsotestedAUGUSTUS-GConArabidopsisThaliana[7].ArabidopsisThalianaisnotgrassgenome.ItmaynotexhibitG+Ccontentgradient.However,ArabidopsisThalianahasagoodqualityofgenomeannotation.Thus,wefocusontestingAUGUSTUS-GCinbothArabidopsisThalianaandOryzaSativa.WeexpecttoseeabiggerimprovementofgenepredictionperformanceinOryzaSativathaninArabidopsisThaliana.Toquantifytheperformanceofgenepredictiontools,weevaluatedthegenepredictionaccu-27racybyusingtwomeasures:sensitivityand.Wetheperformanceofgenepredictionprogramsatthreedifferentlevels:thenucleotidelevel,theexonlevel,andthegenelevel.WeadoptedcommonlyusedmetricsfromStankeandWaack[101].ThesensitivityandtheareasSensitivity(Sen)=TPTP+FN(3.4)Specificity(Spe)=TPTP+FP(3.5)whereTPrepresentsthenumberofcorrectlypredictedfeatures(codingnucleotides,exons,orgenes).FNrepresentsthenumberofannotatedfeaturesthatarenotthepredictedfeatures.FPrepresentsthenumberofpredictedfeaturesthatarenottheannotatedfeatures.Ateachlevel,twobasicmetricsarebeingused.Sensitivityistheproportionofcorrectlypredictedfeaturesinthesetofallannotatedfeatures.istheproportionofcorrectlypredictedfeaturesinthesetofallpredictedfeatures.Attheexonlevel,apredictedexonwillbecorrectifbothsplicesitesareidenticaltothetruepositionofanexon.Atthegenelevel,apredictedgeneisconsideredcorrectlypredictedifallcodingexonsarecorrectlyandnoadditionalexonsareinthegene.Thepredictedpartialgenesareconsideredasthepredictedgenes.Theforwardandreversestrandsareconsideredasdifferentsequences.Of,theofisasSpecificity(Spe)=TNTN+FP(3.6)However,becausetheamountofnon-codingnucleotidesinaDNAsequenceismuchgreaterthantheamountofcodingnucleotides,TNbecomesmuchlargerthanFP.Hence,ingenepredictionhastypicallybeencalculatedbytheof.283.3.1GenePredictioninArabidopsisThalianaToshowtheutilityofAUGUSTUS-GC,thetrainingdatasetwasdownloadedfromStanke'sweb-site[7].WetrainedthemodelonArabidopsisThalianadataset.Thedatasetcontained249genbanksequences.Thereweretwosingleexongenesand247multi-exongenesinthedataset.3.3.1.1TrainingtheHMMmodelTotrainourHMM,wecalculatedG+Ccontentsforalloftheexonsandthemashigh,medium,andlowusingcutoffs.ForAUGUSTUS-GC,wehavetwocutoffs:LowTandhighT.AnexonwillbeaslowG+CcontentgroupiftheG+CcontentofexonislowerthanLowT.Similarly,anexonwillbeashighG+CcontentgroupiftheG+CcontentofexonisabovethehighT.AnexonwillbeasmediumG+CcontentgroupiftheG+CcontentofexonishigherthanorequaltolowTandlowerthanhighT.Table3.1showslowTandhighTthatweusedintheexperiments.TherearethreegroupsofexonsaccordingtoG+CcontentsthatweForeachgroup,ithaddifferentparametersets.Weused10-foldcross-validation.Itdividedthedatasetrandomlyinto10subsets.Theevaluationmethodisrepeated10times.Eachround,oneof10subsetsisevaluatedasthetestsetandtheotherninesubsetsareputtogetherfortraining.Thentheaveragepredictionaccuracyofall10trialsiscalculated.Forcomputingtransitionprobabilities,wetestedontwostrategies.First,weequallydividedtheprobabilitiesofAUGUSTUSforthreestatesofhigh,medium,andlowG+Ccontents.TheresultsbyusingthestrategyareshowninTable3.1.Second,weusedmaximumlikelihoodestimationtocalculatetransitionprobabilities.TheresultsbyusingthesecondstrategyforcomputingthetransitionprobabilitiesareshowninTable3.2.Moreover,thelengthdistributionforeachnew29exonstatewascomputed.Correspondingly,wecalculatedkth-orderMarkovModel(bydefaultk=4)foreachnewexonstate.Figure3.4illustratesthedistributionofexonsbytheirG+Ccontentfor1431exonsinArabidopsisThalianadataset.Figure3.4:G+CContentofArabidopsisThalianadataset3.3.1.2PerformancecomparisonbetweendifferentgenepredictiontoolsWetestedoriginalAUGUSTUSandAUGUSTUS-GConthetestingdatasetofArabidopsisThaliana[7].Thetestingdatasetwasdifferentfromthetrainingdataset.Therewere74genbanksequenceswith168genesonforwardandreversestrand.OurprogramwasfromAUGUSTUSversion2.4downloadedfrom[6].BothoriginalAUGUSTUSandAUGUSTUS-GCweretestedusingdefaultinputparameters.Table3.1showsthecomparisonoftheaccuracyofAUGUSTUSandAUGUSTUS-GCwithdifferentthresholdsofG+CcontentsonthesetofArabidopsisThaliana.Weusedthestrat-30Table3.1:PerformancecomparisonofgenepredictiontoolsonArabidopsisThaliana.Thetran-sitionprobabilitiesweredividedintothreeequalportions.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.ProgramAUGUSTUSAUGUSTUS-GClowT=0.47lowT=0.30lowT=0.30lowT=0.60highT=0.63highT=0.60highT=0.70highT=0.70BaseSen0.9680.9620.9630.9620.963levelSpe0.7080.7090.710.710.71ExonSen0.870.8480.8480.8450.848levelSpe0.6660.6690.6790.680.679GeneSen0.5540.5650.5480.5480.548levelSpe0.3520.360.3540.3540.354Table3.2:PerformancecomparisonofgenepredictiontoolsonArabidopsisThaliana.Thetransi-tionprobabilitieswastrainedbycomputingmaximumlikelihoodestimation.Boldnumberindi-catesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.ProgramAUGUSTUSAUGUSTUS-GClowT=0.47lowT=0.30lowT=0.30lowT=0.60highT=0.63highT=0.60highT=0.70highT=0.70BaseSen0.9680.960.9720.9720.972levelSpe0.7080.7110.7090.7090.709ExonSen0.870.8510.8820.8820.882levelSpe0.6660.6740.6770.6770.677GeneSen0.5540.560.5650.5650.565levelSpe0.3520.3460.3510.3510.35131egyforcomputingthetransitionprobabilities.Thetransitionprobabilitieswereequallyseparatedintothreestatesofhigh,medium,andlow.TheseexperimentalresultsshowthatAUGUSTUS-GCachievedbettersensitivityandforgenelevel.Forbaselevelandexonlevel,AUGUSTUS-GChashigherthanAUGUSTUS.Inaddition,weshoweddifferentcutoffsforAUGUSTUS-GC.ChangingcutoffsforG+Ccontentscanimprovetheperformance.Figure3.5:AnexampleofapredictedgenewithG+CcontentgradientofArabidopsisThalianadataset.ThisgenewaspredictedbyAUGUSTUS-GC.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.WefurtherimplementedtheAUGUSTUS-GCbycomputingthetransitionprobabilitiesbasedonthetrainingset.Thetransitionprobabilitiesfromintrontodifferentexonstateswerecomputedusingmaximumlikelihoodestimation.Table3.2presentsthecomparisonofaccuracyofAU-GUSTUSandAUGUSTUS-GCwithdifferentcutoffsandtrainedtransitionprobabilities.Usingmaximumlikelihoodestimationforcomputingthetransitionprobabilitiesgavethebetteroverallperformance.WeconductedadditionalanalysisusingtheresultsoflowT=0.30andhighT=0.60.Weobservedthat,forsomesequences,wecanpredictmoregenesandwecanidentifycorrectlyposition.Forexample,wecandetectthegeneinFigure3.5becauseitcontainedG+Ccontentgra-dient.However,theregularAUGUSTUScannotidentifyitcorrectly.Moreover,wecomparedtheG+CchangebetweentheuniquelyfoundgenesbyourtoolandthecommononessharedbyAU-32GUSTUSandAUGUSTUS-GC.Theresultsshowedthattherewere14uniquelygenesbyourtooland149sharedgenes.WefurtherexaminewhetherourtoolcangeneswithmoreG+Cchange.Wecomputedthestandarddeviation(SD)andthedistancebetweentheexonthathashighestG+CcontentandtheexonthathaslowestG+Ccontentforeachprotein-codinggeneandreportedtheaverageofallgenes.TheexperimentalresultsdemonstratedthattheaverageSDoftheuniquelypredictedgeneswas0.046.However,theaverageSDofthecommononeswas0.033whichissmallerthantheuniquelypredictedgenes.Also,theaveragedistancebetweentheexonthathashighestG+CcontentandtheexonthathaslowestG+Ccontentforeachprotein-codinggeneoftheuniquelyfoundgeneswas0.111buttheaveragedistanceofthecommongeneswas0.087.Theaveragedistanceofourtooliswiderthanthatofcommonones.3.3.2GenePredictioninOryzaSativaForOryzaSativadatasets,weconductedtwoexperiments.FirstOryzaSativadatasetwaspro-videdbyChildsLab,DepartmentofPlantBiology,MichiganStateUniversity.SecondOryzaSativadatasetwasobtainedfromStankeetal.[99].Thesetwodatasetswereconstructedfromdifferentways.Thegroundtruthsequencesoftwodatasetsweredifferent.Thedatasetissmallerthantheseconddataset.Figure3.6presentsthedistributionofexonsbytheirG+Ccontentfor844exonsinOryzaSativadataset.Figure3.7showsthedistributionofexonsbytheirG+Ccontentfor16199exonsinsecondRicedataset.AccordingtoFigure3.6andFigure3.7,theG+CcontentchangefortheexonshaswiderrangethanthatofArabidopsisThaliana.Ourhypothesisisthatinsomegrassgenessuchasrice,theyhaveG+Cgradientofexonsratherthanconsistent.Thus,weconductedtheexperimentstotesthowourtoolperformedtohandlethechangeofG+Ccontentforeachseparatedexons.33Figure3.6:G+CContentofthedatasetofOryzaSativa3.3.2.1GeneinOryzaSativaobtainedfromChildsLab,DepartmentofPlantBiology,MichiganStateUniversityIntheexperimentofOryzaSativadataset,wetestedtheperformanceofgenetoolsinOryzaSativadatasetobtainedfromChildsLab,DepartmentofPlantBiology,MichiganStateUniversity.ThisdatasetisapartofMSURiceGenomeAnnotationProject[79,114].Thetrainingdatasetconsistedofasetof150genbanksequenceswith11single-exongenesand139multi-exongenesonforwardstrandandreversestrand.Again,weused10-foldcrossvalidationstrategyfortraining.Weexaminedtheaccuracyofgenepredictionmethods.Therewere150genbanksequenceswith13singleexongenesand137multi-exongenesonforwardandreversestrandfortestingdataset.Ingrassgenomes,thesharp50-30decreasingG+Ccontentgradientisexhibitedasweshowed34Figure3.7:G+CContentofthesecondOryzaSativadatasetinFigure3.3.Therefore,weassessedtheperformanceonOryzaSativadatasettodemonstratetheutilityofourAUGUSTUS-GC.WecomparedAUGUSTUS-GCwithAUGUSTUSinTable3.3.Inthisexperiment,weob-servedchangingpatternsofG+Ccontentofexonsinsideeachgene.Forexample,somegenestendtostartwithhighG+CcontentofexonandendwithlowG+Ccontentofexon.Astheresult,AUGUSTUS-GCachievedhighersensitivitythanAUGUSTUS.AUGUSTUS-GCcanimproveboththesensitivityandtheatalllevelsusingacutoffoflowG+Ccontent(0.40)andthecutoffofhighG+Ccontent(0.60).Withtrainingtransitionprobabilitiesusingmaximumlike-lihood,theresultsareshowninTable3.4.Furthermore,wecomparedchangesinG+CcontentsoftheuniquelygenesbyAUGUSTUS-GCandthecommongenessharedbyAUGUS-TUSandAUGUSTUS-GC.TherewerefouruniquelypredictedgenesbyAUGUSTUS-GCand143commongenes.WecomparedtheuniquelygenesandcommononesintermsofSD35Table3.3:PerformancecomparisonofgenepredictionapproachesonOryzaSativaobtainedfromChildsLab.Thetransitionprobabilitiesweredividedintothreeequalparts.Boldnumberindi-catesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.ProgramAUGUSTUSAUGUSTUS-GClowT=0.39lowT=0.35lowT=0.50lowT=0.40highT=0.61highT=0.61highT=0.60highT=0.60BaseSen0.8390.840.8310.9210.841levelSpe0.8920.9020.8980.8830.901ExonSen0.6130.6170.5890.6980.617levelSpe0.6940.7330.7150.6920.725GeneSen0.260.280.2670.2670.287levelSpe0.2350.2610.250.2340.267Table3.4:PerformancecomparisonofgenepredictionapproachesonOryzaSativaobtainedfromChildsLab.Thetransitionprobabilitieswascalculatedbyusingmaximumlikelihoodestimation.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.ProgramAUGUSTUSAUGUSTUS-GClowT=0.39lowT=0.35lowT=0.50lowT=0.40highT=0.61highT=0.61highT=0.60highT=0.60BaseSen0.8390.850.8470.9230.847levelSpe0.8920.8980.8980.8760.899ExonSen0.6130.6330.620.7070.629levelSpe0.6940.7320.7160.670.727GeneSen0.260.2530.2530.2670.267levelSpe0.2350.2350.2350.2270.24736Figure3.8:ApredictedgeneonforwardstrandwithG+CcontentgradientoftheOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.Figure3.9:ThreepredictedgenesonreversestrandwithG+CcontentgradientoftheOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.andthedistancebetweentheexonthathashighestG+CcontentandtheexonthathaslowestG+Ccontentforeachprotein-codinggene.TheresultsshowedthattheaverageSDoftheuniquelypre-dictedgenes(0.098)washigherthanthatofcommmongenes(0.075).Also,theaveragedistanceoftheuniquelyfoundgenesbyAUGUSTUS-GC(0.241)waswiderthanthatofcommongenes(0.171).Figure3.8andFigure3.9illustratesthegenesthatAUGUSTUS-GCcandetectaccurately,buttheregularAUGUSTUScannotidentifyitcorrectly.BecausetheycontainedG+Ccontentgradientinsideeachgene,theG+Ccontentofexonchangeratherthansimilar.TheexistinggenepredictiontoolsdonotworkwellforthiscasebecausetheyassumedtheG+Ccontentsofexonsinsideageneareconsistent.373.3.2.2FindinggenesinOryzaSativaretrievedfromStankeetalInthesecondexperimentofOryzaSativadataset,theRicetestsetwasretrievedfromStankeetal.[99].Thisisthebriefdescriptionabouthowtheypickedthosegenes.First,theymadeagenbankfromthegenomeandthegffSecond,theyconstructedasetwithbothUTRsannotated.Then,theygeneratedselectedgeneswithbothUTRsandCDSplausible(fromavisualinspectioninJBrowseagainstpanicleandleafRNA-SeqSTARalignments).Fourth,theytrainedthedatasettoidentifygeneswitherrorsonlyandremovedthesequenceswitherrors.Finally,the1000genesconsistof128manuallyselectedand872randomlychosenones(fromamonggeneswithbothUTRsannotated).TheOryzaSativaparametersetwastrainedonaselectionof800geneswithUTRsfromphytozome.Fortrainingdataset,therewere187singleexongenesand613multi-exongenesonforwardandreversestrand.Toassesstheperformance,200geneswhichwereasubsetofthe1000genesweretested.Thetestingconsistedof40single-exongenesand160multi-exongenesonforwardandreversestrands.Table3.5showsthesummaryoftheaccuracyresultsofAUGUSTUSandAUGUSTUS-GConthetestset.ByapplyingAUGUSTUS-GC,thesensitivityofAUGUSTUS-GCatbaselevelwasenhancedfrom0.859to0.942for0.31lowG+Ccontentcutoffand0.52highG+Ccontentcutoff.However,ofAUGUSTUSisslightlybetterthanthatofAUGUSTUS-GCfor0.31lowG+Ccontentcutoffand0.52highG+Ccontentcutoff.Moreover,AUGUSTUS-GChadthebettersensitivitythanAUGUSTUSatexonlevel.Atexonlevel,thesensitivityofAUGUSTUS-GCwasimprovedfrom0.67to0.768for0.31lowG+Ccontentcutoffand0.52highG+Ccontentcutoff.Atthegenelevel,AUGUSTUS-GChadbettersensitivityand(0.4and0.211,respectively).WecomputedthetransitionprobabilitiesusingmaximumlikelihoodinTable3.6.ThesensitivityandofAUGUSTUS-GCwereimproved.38Table3.5:PerformancecomparisonofgenepredictiontoolsonOryzaSativaprovidedbyStankeetal.Thetransitionprobabilitiesweredividedintothreeequalparts.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.ProgramAUGUSTUSAUGUSTUS-GClowT=0.31lowT=0.49lowT=0.30lowT=0.60highT=0.52highT=0.52highT=0.50highT=0.70BaseSen0.8590.9420.950.9370.84levelSpe0.6190.6070.5970.590.622ExonSen0.670.7680.7810.7480.63levelSpe0.5520.5460.5530.520.559GeneSen0.3550.40.40.3550.365levelSpe0.1910.2110.2050.1770.204Table3.6:PerformancecomparisonofgenepredictiontoolsonOryzaSativaprovidedbyStankeetal.Thetransitionprobabilitieswastrainedusingmaximumlikelihoodestimation.BoldnumberindicatesthatsensitivityorofAUGUSTUS-GCarehigherthanthoseofAUGUSTUS.ProgramAUGUSTUSAUGUSTUS-GClowT=0.31lowT=0.49lowT=0.30lowT=0.60highT=0.52highT=0.52highT=0.50highT=0.70BaseSen0.8590.9550.9450.9480.858levelSpe0.6190.6070.6010.5860.62ExonSen0.670.7980.7690.7650.665levelSpe0.5520.5650.5440.5720.547GeneSen0.3550.4250.360.350.37levelSpe0.1910.2170.1860.170.20439WeconductedadditionalanalysisusingtheresultsoflowT=0.31andhighT=0.52.WefoundthatAUGUSTUS-GCcandetectthegenescontainingG+Ccontentgradient.Inaddition,therewere26uniquelypredictedgenesbyAUGUSTUS-GCand163commongenessharedbyAUGUS-TUSandAUGUSTUS-GC.ComparinguniquelypredictedgeneswithcommongenesbyaverageSDandtheaveragedistancebetweentheexonthathashighestG+CcontentandtheexonthathaslowestG+Ccontent.TheexperimentalresultsshowedthatuniquelypredictedgeneshadhigherSDthancommongeneswhichwere0.099and0.080,respectively.Besides,theaveragedistancebetweentheexonthathashighestG+CcontentandtheexonthathaslowestG+Ccontentforeachprotein-codinggeneoftheuniquelyfoundgeneshadlargerthanthatofthecommongeneswhichwere0.211and0.171,respectively.Figure3.10illustratesagenepredictedbyAUGUSTUS-GC.However,regularAUGUSTUScannotidentify.ThisgeneisonreversestrandwithG+Ccontentgradient.Thisisstrongevi-dencethatourtoolcanpredictthegeneasweexpected.Figure3.11,Figure3.12,Figure3.13,Figure3.14,andFigure3.15aretheexamplesofthegenesthathaveG+Ccontentgradient.Thesegeneswerepredictedbyourtool,AUGUSTUS-GC.However,theregularAUGUSTUScannotidentifythemcorrectly.40Figure3.10:AnexampleofauniquelypredictedgeneonreversestrandwithG+CcontentgradientofsecondOryzaSativadataset.However,regularAUGUSTUScannotdetectit.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.Figure3.11:FirstexampleofapredictedgeneonforwardstrandwithG+CcontentgradientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.41Figure3.12:SecondexampleofapredictedgeneonforwardstrandwithG+CcontentgradientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.Figure3.13:FirstexampleofapredictedgeneonreversestrandwithG+CcontentgradientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.42Figure3.14:SecondexampleofapredictedgeneonreversestrandwithG+CcontentgradientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.Figure3.15:ThirdexampleofapredictedgeneonreversestrandwithG+CcontentgradientofsecondOryzaSativadataset.X-axisrepresentsexonindex.Y-axisrepresentsG+Ccontent.43Chapter4SensitiveShortReadHomologySearchforPaired-EndReadSequencingData4.1IntroductionHomologysearchhasbeenoneofthemostwidelyusedmethodsforinferringthestructureandfunctionofnewlysequenceddata.Forexample,thestate-of-the-arthomologysearchtool,HMMER[29]hasbeensuccessfullyappliedforgenome-scaledomainannotation.Themajorho-mologysearchtoolsweredesignedforlongsequences,includinggenomiccontigs,near-completegenes,orlongreadsproducedbyconventionalsequencingtechnologies.Theyarenotoptimizedfordataproducedbynext-generationsequencing(NGS)platforms.Forreadsproducedbypy-rosequencingormorerecentPacBioandnanoporetechnologies,frameshiftcausedbysequencingerrorsarethemajorchallengesforhomologysearch.FordatasetsproducedbyIllumina,shortreadswillleadtomarginalalignmentscoresandthusmanyreadscouldbemissedbyconventionalhomologysearchtools.InordertoapplyhomologysearcheffectivelytoNGSdataproducedbyIl-PrapapornTecha-Angkoon,YanniSun,andJikaiLei,flImproveShortReadHomologySearchusingPaired-EndReadInformationfl,ProceedingsofInternationalSymposiumonBioinformaticsResearchandApplications(ISBRA2016),Minsk,Belarus,June5-8,2016.(AppearedinLectureNotesforBioinformatics(LNBI)9683)PrapapornTecha-Angkoon,YanniSun,andJikaiLei,flSensitiveShortReadHomologySearchforPaired-EndReadSequencingDatafl,Underreview44lumina,manyofwhichcontainshortreads,readmappingordenovoassembly[84,76,42,117,58]isemployedtoassembleshortreadsintocontigs.Thenexistinghomologysearchtoolscanbeappliedtothecontigstoinferfunctionsorstructures.However,itisnotalwaysfeasibletoobtainassembledcontigsfromshortreads.Forexample,complexmetagenomicdataposesseriouscomputationalchallengesforassembly.Just1gramofsoilcancontain4petabasepairs(11015bps)ofDNA[115]andtensofthousandsofspecies.Readmappingisnotveryusefulinthenativegenomesorgenesofthesereadsasmostreferencegenomesarenotavailable.Denovoassemblyalsohaslimitedsuccessduetothecom-plexitiesandlargesizesofthesedata[108,42,117].Besidesmetagenomicdata,whichusuallylackcompletereferencegenomes,RNA-Seqdataofnon-modelspeciesalsofacessimilarcompu-tationalchallenges.AssemblingshortreadsintocorrecttranscriptswithoutusinganyreferencegenomeiscomputationallydifThus,inordertoanalyzetheNGSdatawithoutreferencegenomes,awidelyadoptedmethodforfunctionalanalysisistoclassifyreadsintocharacterizedfunctionalclasses,suchasprotein/domainfamiliesinPfam[31,86],TIGRFAM[35],FIGfams[72],InterProScan[116],FOAM[85],etc.Thereadassignmentisusuallyconductedbysequencehomologysearchthatcomparesreadswithreferencesequencesori.e.,afamilyofhomologousreferencesequences.Therepresen-tativetoolsforsequencehomologysearchandhomologysearchareBLAST[3]andHM-MER[29],respectively.homologysearchhasseveraladvantagesoverpairwisealignmenttoolssuchasBLAST.First,thenumberofgenefamiliesissmallerthanthenumberofsequences,renderingmuchfastersearchtime.Forexample,thereareonlyabout13,000manuallycuratedproteinfamiliesinPfam,butthesecovernearly80%oftheUniProtKnowledgebaseandthecoverageisincreasingeveryyearasenoughinformationbecomesavailabletoformnewfam-ilies[86].ThenewestversionofHMMER[29]ismoresensitivethanBLASTandisabout10%45faster.Second,previouswork[27]hasdemonstratedthatusingfamilyinformationcanimprovethesensitivityofaremoteproteinhomologysearch,whichisveryimportantformetagenomicanalysisbecausemanydatasetscontainspeciesremotelyrelatedtoonesinthereferencedatabase.4.1.0.3IsHMMERgoodenoughforshort-readhomologysearch?HMMERhasbeensuccessfullyusedingenome-scaleproteindomainannotationinmanyspecies.Ithasbothhighyandsensitivityinidentifyingdomains.Thus,itisalsowidelyadoptedforhomologysearchinanumberofexistingNGSanalysispipelinesorwebsites(e.g.IMG/M[41],EBImetagenomicsportal[74],CoMet[59],etc.).However,HMMERisnotopti-mizedforshort-readhomologysearches.Shortreadssequencedfromregionsoflowconservationtendtobemissed.WehavetheperformanceofHMMERonseveralNGSdatasetsandtheresultsshowedthatHMMERhasmuchlowersensitivitythanwhenappliedtocompletegenesorgenomes.Belowwedescribeoneoftheexperiments.Table4.1:TheperformanceofHMMERunderdifferentcutoffsonclassifyingreadsintheAra-bidopsisThalianaRNA-Seqdataset.Thereare2,972,809paired-endreadsbeinguniquelymappedto3,577annotatedproteindomainfamiliesinthereferencegenome.However,HMMERunderE-valuecutoff10missedatleasthalfofthereadswhenaligningthesereadstoinputdomains.ThealignmentsbyHMMERcanbedividedintothreecases.Case1:onlyoneendcanbealignedtooneormultipledomains.Case2:bothendscanbealignedtooneormultipledomains.Case3:noendisalignedtoanydomain.Inthetable,thepercentageofacasecanberepresentedbyeachnumber.fiHMMERw/orepresentsshuttingoffallstepsandrunningfullFor-ward/Backward.fiHMMERGAcutoffflrepresentsapplyingHMMERwithgatheringthresholds.CaseHMMERHMMERHMMERunderE-value10w/oGAcutoffunderE-value10Case134.51%32.83%22.51%Case228.42%31.58%8.84%Case337.07%35.59%68.65%WeappliedHMMERtoannotateproteindomainsinreadssequencedfromanormalizedcDNA46libraryofArabidopsisThaliana[68,119].Therewereatotalof9,559,784paired-endreadsof76bpbyIllumina.Theknownreferencegenome,thegeneanddomainannotations,andthereadmappingresultsarecombinedtodeterminethetruemembershipofreads.Wedownloadedallcodingsequences(CDS)ofArabidopsisThalianafromTAIR10[106].WethepositionsofdomainsinCDSusingHMMERwithgatheringthresholds(GAs)byaligningCDSto3,962plant-relatedPfamdomains[31].Then,bowtieallowinguptotwomismatches[54]wasusedtomapseparatelypaired-endreadstoCDS.WecomparedthepositionsofuniquelymappedreadsinCDStoannotateddomainsinCDS.Wechoseallreadswithbothendsbeinguniquelymappedtoaproteindomainasthetruepositiveset.ThenthesensitivityofHMMER,whichhowmanyofthesetruepositivereadscanbewascomputed.Therewere2,972,809readpairsinthistruepositiveset.TheiralignmentsbyHMMERunderdifferentcutoffscanbedividedintothreecases.Case1:onlyoneendcanbealigned.Case2:bothendscanbealignedtothecorrespondingproteinfamilybyHMMER.Case3:neitherendcanbealigned.TheresultsofthisexperimentwereshowninTable4.1.Case2istheidealcase.Butitonlytakesaboutone-thirdofallreadpairs.Turningoffdoesnotimprovethepercentageofcase2.Usinggatheringthresholds(GA)cutoffisrecommendedforaccuratedomainannotationingenomes.However,near70%ofreadpairscannotbealignedunderGAcutoff.Acloserlookrevealsthatformanyofthosereads,themissingendisusuallysequencedfromaregionwithlowsequenceconservationwiththeunderlyingproteinfamily.Byusingthewhole-genealignmentagainsttheproteinfamilyandthereadmappingpositionsonthegene,werevealthealignmentofthereadsagainstthemodel.OneexampleisshowninFigure4.1.Inthisexample,oneendr1canbealignedtothedomainusingHMMERwithon.However,theotherendr2cannotbeoutputbyHMMERbecauseofitspooralignment.Inordertoimprovethesensitivity,onemayconsidertouseloosecutoffssuchasalowscore47Figure4.1:Anexampleofaproteinfamily,itsalignmentwithagene,andreadmappingposi-tionsofareadpairagainstthegene.ThePkinasemodelhadannotationlineofconsensusstruc-ture.ThelinebeginningwithPkinaseistheconsensusofthequerymodel.Capitallettersshowpositionsofthemostconservation.Dots(.)inthislinerepresentinsertionsinthetargetgenesequencewithrespecttothemodel.ThemidlinerepresentsmatchesbetweenthePkinasemodelandtheAT2G28930.1genesequence.A+representspositivescore.ThelinebeginningwithAT2G28930.1isthetargetgenesequence.Dashes(-)inthislinerepresentsdeletionsinthegenesequencewithrespecttothemodel.Thebottomlineindicatestheposteriorprobabilityofeachalignedresidue.A0represents0-5%,1represents5-15%,...,9represents85-95%,and*repre-sents95-100%posteriorprobability.Thelinestartingwithr1andendingwithr2isreadmappingregionsonthegenesequence.A-indicateswherethepositionofthereadcanbemappedtothegenesequence.orhighE-valuecutoff.However,usingloosecutoffscanleadtofalsepositivedomainalignments.Inthiswork,wewilldescribeanewmethodtoimprovethesensitivityofhomologysearchforshortreadswithoutjeopardizingthealignmentaccuracy.4.2MethodsInthissection,wedescribeashortreadhomologysearchmethodthatincorporatespropertiesofpaired-endreadsequencing.Paired-endsequencingisthepreferredsequencingmodeandiswidelyadoptedbymanysequencingprojects.Wehaveobservedthatforalargenumberofreadpairs,onlyoneendcanbealignedbyHMMERwhiletheotherendismissed.Thus,weexploitthesequencingpropertyofpaired-endreadstorescuethemissingend.Ourprobabilistichomologysearchmodeltheofthealignmentbetweenareadpairandaproteindomainfamily.Thecomputationincorporatesthedistributionoffragmentlengths(orinsertsizes)ofpaired-endreadsandthealignmentscores.Similarapproacheshavebeenappliedtomappingpaired-endDNAreadstoareferencegenome[62,97].Buttoourknowledge,48thisisthetimethatanapproximateBayesianapproachhasbeenemployedtoalignpaired-endreadstoproteinfamilies.Therearethreemajorsteps.Inthestep,wewillaligneachend(all-frametranslations)togivenproteinfamiliesusingHMMERunderE-valuecutoff10.NotethatalthoughGA-cutoffistherecommendedcutoffbyHMMERforaccuratedomainannotation,onlyasmallpercentageofshortreadscanpassGAcutoff.Thus,weuseE-valuecutoff10inthestepinordertorecruitmorereads.Asthereadsareshort,thisstepwillusuallyaligneachreadtooneormultipleproteinfamilies.Notallofthealignmentsarepartofthegroundtruth.Inthesecondstep,forallread-pairswhereonlyoneendisalignedbyHMMER,weusethemostsensitivemodeofHMMERtoaligntheotherendtotheproteinfamiliesinthestep.AlthoughthesensitivesearchmodeofHMMERisslow,itisonlyappliedtotheproteinfamiliesthatarefewerthantotalproteinfamiliesinthedatasetandthuswillnotbecomethebottleneckoflarge-scalehomologysearch.Inthelaststep,theposteriorprobabilityofthealignmentbetweenapairofreadsandaproteindomainfamilyiscalculated.Thefalselyaligneddomainsinthestepwillberemovedinthelastthroughthecomputationoftheposterioralignmentprobability.Figure4.2showsanexampleaboutdeterminingthetrueproteinfamilyifbothendscanbealignedtoseveralfamilies.Inthisexample,M1isthemostlikelytobethenativefamilyduetothebiggeralignmentscoresandthehigherprobabilityoftheobservedfragmentlength.Wequantifytheposteriorprobabilityofeachreadpairbeingcorrectlyalignedtoaproteinfamily.AstheexampleinFigure4.2shows,inordertocalculatetheposteriorprobabilityofanalign-ment,weneedtoknowthesizedistributionoffragments,fromwhichpaired-endreadsarese-quenced.Usuallywemayhavetheinformationabouttherangeofthefragments(shortestandlongest).However,thesizedistributionisunknown.FormetagenomicdataandRNA-Seqdataof49Figure4.2:HMMalignmentsofareadpair.Paired-endreadsr1andr2representedbytwogreyscalelinesarealignedagainstmodelsM1,M2,andM3withdifferentscoresofalignments.Thedarkerlinesrepresentbiggerscores.Thefragmentsizedistributionisprovidedaboveeachmodel.Thedistancebetweenthetwoalignmentsiscomputedandisusedtocomputethelike-lihoodofthecorrespondingfragmentsize.Inthisexample,M1ismostlikelytobethenativefamily.non-modelsspecieswhosecompleteorqualityreferencegenomesarenotavailable,itisnottrivialtoderivethefragmentsizedistribution.Inthiswork,wetakeadvantageoftheproteinalignmentandthetrainingsequencestoestimatethefragmentsizedistribution.Thenexttwosectionswilldescribethedetailsaboutcomputingfragmentsizedistributionandthemethodtorankalignmentsusingposteriorprobabilities.4.2.1ConstructingfragmentlengthdistributionPairedendreadsaresequencedfromtheendsoffragments.Whenthereferencegenomeisavail-able,thefragmentsizecanbecomputedusingthedistancebetweenthemappingpositionsofthereadpair.Thus,thedistributioncanbecomputed[62,97]fromalarge-scaleofreadmappingpositions.However,thismethodisnotapplicabletoourworkbecausewearefocusingonthehomologysearchofNGSdatathatlackreferencegenomes.Forthesedata,weproposeamodel-basedmethodtoestimatefragmentsizedistribution.Thekeyobservationisthatifareadpaircanbeuniquelyalignedtoaproteinfamily,itisverylikelythatthispairissequencedfromagenethatishomologoustothemembersequencesoftheproteinfamily.Thehomologyisinferred50fromstatisticallysequencesimilarity.Thus,wewillusethealignmentpositionsandthehomologousseedsequencestoinferthefragmentsize.Thismethodisnotaccurateaswearenotusinganyreferencegenomes/genes.However,ourexperimentalresultshaveshownthattheestimateddistributionisveryclosetothetruedistribution.Figure4.3sketchesthemainstepsofinferringafragment'ssizefromthealignmentofareadpairagainstaproteinfamilymodel.Areadpairr1andr2areuniquelyalignedtoaproteinfamilyM.ThealignmentpositionsalongthemodelMarefromwtoxandytoz,respectively.ModelMistrainedonagroupofhomologoussequences(fiseedsequence1fltofiseedsequenceNfl).Notethattheactualsequencefromwhichr1andr2aresequencedisnotinthetrainingsetofmodelM.ThealignmentpositionsalongthemodelMwillbeconvertedintothecolumnindicesinthemultiplesequencealignmentconstructedbyallseedsequences.Thenafteraccountingfordeletionsandinsertions,thecolumnindiceswillbeconvertedintopositionsalongeachseedsequence.Asitisunknownwhichseedsequencesharesthehighestsequencesimilaritywiththegenecontainingthefragment,wecalculatethefragmentsizeastheaverageofthedistancesbetweenconvertedalignmentpositions.Thefragmentsizeestimationisconductedforallpairedendreadsthatareuniquelyalignedtoproteindomainfamilies.Alltheestimatedfragmentsizesareusedtoconstructthefragmentsizedistribution.Wewillcomparetheestimatedfragmentsizedistributionwiththeonesthatarederivedbasedonreadmappingresults.4.2.2ProbabilisticmodelForeachalignedpaired-endread,anapproximateBayesianapproach[62,97]isusedtoestimatethefialignmentquality.flThequalityofalignmentisastheprobabilityofapairofreadsbeingaccuratelyalignedtoitsnativeproteindomainfamily.Becauseapairofreadscouldbe51Figure4.3:Anexampleoffragmentlengthcalculation.ThealignmentpositionsalongtheHMMcanbeconvertedintopositionsineachseedsequences.Thefragmentsizeiscomputedastheaveragesizeofthosemappedregions.alignedtomultipledomainfamiliesandsomeofthemmightnotbeingroundtruth,wecanrankallalignmentsusingcomputedposteriorprobabilitiesandkeepthealignmentswithhighprobability.Letr1andr2beareadpair.LetA1andA2bethecandidatealignmentsetsofr1andr2againstoneormoreproteinfamilymodels.Foreachalignmentpaira12A1anda22A2witha1anda2beingalignedtothesameproteinfamilyM,wecalculatetheposteriorprobabilityofa1anda2beingthetruealignmentsgeneratedbythereadpairr1;r2againstMas:Pr(a1;a2jr1;r2)µesa1=Tesa2=TPr(fr1;r2)(4.1)whereesa1=Tisthetargetprobabilityofgeneratinganalignmentscoreofa1againstM[29,45].TisthescalingfactorusedinE-valuecomputation.Pr(fr1;r2)istheprobabilityofobservedfragment52sizebetweenr1andr2.Theposteriorprobabilitydependsonthefragmentlengthcomputedfroma1anda2aswellastheiralignmentscores.WecomputeEquation(1)foreachreadpair'salignmentsandkeepthealignmentsaboveagiventhreshold.Foreachreadpair,supposethemaximumposteriorprobabilityofitsalignmentsagainstallalignedmodelsispmax.Wekeepallalignmentswithprobabilitiesabovepmaxt,wheretis40%bydefault.Userscanchangettokeepmoreorlessalignments.4.3ExperimentalresultsWedesignedhomologysearchmethodforNGSdatalackingreferencegenomes,in-cludingRNA-Seqdataofnon-modelspeciesandmetagenomicdata.Inordertodemonstrateitsutilityindifferenttypesofdata,weappliedourtooltoaRNA-Seqdatasetandametagenomicdataset.Inbothexperiments,wechoosedatasetswithknownreferencegenomessothatwecanquantifytheperformanceofhomologysearch.Itisimportanttonotethatthegroundtruthinthisworkisasthehomologysearchresultsforcompletegenes.Weareawarethatcom-putationalproteindomainannotationforcompletegenesorgenomesarenotalwaysaccurate.Butwhole-genedomainannotationhashighersensitivityandaccuracythanshortreadhomologysearchandhasbeenextensivelytestedinvariousspecies.Thus,ourgoalistodecreasetheperformancegapbetweenshortreadhomologysearchandwhole-genehomologysearch.HMMERcanberunindifferentmodes.Inthiswork,wechoosethemostcommonlyusedmodes:HMMERwithdefaultE-value,HMMERwithgatheringthresholds(GAs)cutoff,andHMMERwithoutGAcutoffistherecommendedcutoffbecauseofitsaccuracy.Turningoffwillyieldthehighestsensitivitywithofspeed.ThedatasetinourexperimentistheRNA-SeqdatasetofArabidopsisThaliana.The53secondoneismetagenomicdatasetsequencedfrombacterialandarchaealsyntheticcommunities.WewillcarefullyexaminewhetherourmethodandHMMERcancorrectlyassigneachreadtoitscorrectdomainfamilies.Thenwewillevaluatetheperformanceofhomologysearchfromusers'perspective.Auserneedstoknowthecompositionofdomainsandalsotheirabundanceinadataset.ThuswewillcompareHMMERandourmethodinbothaspects.4.3.1PrshortreadhomologysearchinArabidopsisThalianaRNA-SeqdatasetTheRNA-SeqdatasetwassequencedfromanormalizedcDNAlibraryofArabidopsisusingpaired-endsequencingofIlluminaplatform[68].Therewere9,559,784paired-endreadsintotalandthelengthofeachreadis76bp.Theauthors[68]indicatedthatthefragmentlengthsarebetween198and801bps.However,thefragmentsizedistributionisunknown.4.3.1.1Determinationfortruemembershipofpaired-endreadsThetruemembershipofpaired-endreadswasdeterminedusingreadmappinganddomainanno-tationoncompletecodingsequences.First,allcodingsequences(CDS)ofArabidopsisThalianagenomeweredownloadedfromTAIR10[106].Second,wedownloaded3,912plant-relatedpro-teinordomainmodelsfromPfam[31].WenoticethatsomeofthesedomainfamiliesaretrainedongenesofArabidopsis.Thus,inordertoconductafairevaluationofhomologysearchperfor-mance,weremovedallgenesofArabidopsisfromthedomainseedfamiliesandre-trainedthePfamHMMs.Third,CDSwerealignedagainstPfamdomains[31]usingHMMERwithgatheringthresholds(GAs)[29].ThealignmentresultscontainthepositionsofdomainsinCDS.Notethatitispossiblethatseveraldomainsarepartiallyalignedtothesameregioninacodingse-54quence.Thishappensoftenfordomainsinthesameclan[32]becausethesedomainsarerelatedinstructuresandfunctions.Inthiscase,wewillkeepalldomainalignmentspassingtheGAcutoffinthegroundtruth.Fourth,paired-endreadsweremappedseparatelytoCDSusingBowtieallowingupto2mismatches[54].ThepositionsofuniquelymappedreadsinCDSwerecomparedtoanno-tateddomainsinCDS.Ifthemappingpositionsofreadpairsarewithinannotateddomainregions,weassignedthereadstothosePfamdomains.Thereadsandtheirassigneddomainsconstitutethetruemembershipofthesereads.4.3.1.2PerformanceoffragmentlengthdistributionWecomparedourestimatedfragmentlengthdistributionwiththetruefragmentlengthdistributioninFigure4.4.Thetruefragmentsizedistributionisderivedbymappingallpaired-endreadsbacktothereferencegenome.Thecomparisonshowsthat,foragivenlength,themaximumprobabilitydifferencebetweenourfragmentlengthdistributionandthetruefragmentlengthdistributionis0.02,whichslightlydecreasestheaccuracyoftheposteriorprobabilitycalculation.4.3.1.3OurmethodcanalignmorereadsAsshowninthesection,HMMERmissedoneendofatleasthalfofthereadpairsintheRNA-SeqdatasetofArabidopsis.Byapplyingourtool,thepercentageofcase2(bothends)ofpaired-endreadalignmentsincreasesfrom28.42%to62.51%.Table1presentsthecomparison.Importantly,theimprovementisnotachievedby.Asweusetheposteriorprobabilitytodiscardfalsealignments,thetradeoffbetweensensitivityandisactuallyimproved,asshowninthenextsection.55Figure4.4:Comparingfragmentlengthdistributionofourmethod(blue)tofragmentlengthdis-tributionconstructedfromreadmappingresults(red).X-axisrepresentsthelengthoffragmentinaminoacids.Y-axisrepresentsprobabilityofthecorrespondingfragmentsize.Table4.2:Thepercentagesofallthreecasesofpaired-endreadalignmentsbyHMMERandourtoolfortheRNA-Seqdata.Case1representsoneend.Case2representsbothends.Case3representsnoend.CaseHMMER,HMMER,HMMER,OurtoolE-value10w/oGAcutoffE-value10Case134.51%32.83%22.51%0.42%Case228.42%31.58%8.84%62.51%Case337.07%35.59%68.65%37.07%4.3.1.4SensitivityandaccuracyofshortreadhomologysearchAlthoughGAcutoffistherecommendedthresholdfordomainannotationbyHMMER,ityieldslowsensitivityforshortreadhomologysearch.Inordertoalignasmanyreadsaspossible,thedefaultE-valuecutoffischosen.However,evenforcase2,wherebothendscanbealignedbyHMMER,thesereadsmaybealignedtomultipledomainsandnotallofthemarecorrect.Ourmethodcanbeusedtoimprovethetradeoffbetweensensitivityandaccuracyforbothcase1and56case2.Inthissection,theperformanceofhomologysearchforeachreadisquantibycomparingitstrueproteindomainfamilymembershipandpredictedmembership.Foreachreadpair,supposeitissequencedfromdomainsetTP=fTP1;TP2;:::;TPng,whichisderivedfromthereadmappingresults.ThehomologysearchtoolalignsthisreadpairtodomainsetC=fC1;C2;:::;Cmg.Thesensitivityandfalsepositive(FP)rateforthisreadpairareusingthefollowingequations:Sensitivity=jTP\CjjTPj(4.2)FPrate=jCTPjjTNj(4.3)NotethatTNrepresentsthetruenegativedomainset.LetUrepresentalldomainswedownloadedfromPfam(jUj=3,962).Then,foreachreadpair,TN=UTP.Inthissection,thesensitivityandFPrateforeachpairofreadsarecomputedandthentheaverageofallpairsofreadsisreportedusingROCcurves.4.3.1.4.1Performanceofcase1:Thereare1,025,982paired-endreads,whereonlyoneendcanbealignedtooneormultipledomainfamiliesbyHMMERwithon.Figure4.5showsROCcurvesofshortreadhomologysearchusingHMMERunderdifferentcutoffsandourmethod.ForHMMER,wechangedtheE-valuecutofffrom1000to105withratio0.1.AssomeE-valuecutoffsyieldthesameoutput,severaldatapointsoverlapcompletely.Forourmethod,eachdatapointcorrespondstodifferenttvalues(10%to70%)asinProbabilisticModelSection.Unlessotherwise,alltheROCcurvesaregeneratedusingthesame4.3.1.4.2Performanceofcase2:Thereare844,796paired-endreadswithbothendsbeingalignedbyHMMERwithon.Somereadpairsarealignedtofalsefamilies.Thefalsely57Figure4.5:ROCcurvesofshortreadhomologysearchforArabidopsisRNA-Seqdata.WecomparedHMMERandourtooloncase1,whereoneendcanbealignedbyHMMERwithdefaultE-value.NotethatHMMERwithGAcutoffhasonedatapoint.aligneddomainfamiliescanberemovedbyourmethod.Therefore,ourmethodhavebettertrade-offbetweensensitivityandfalsepositiverate.InFigure4.6,weplottedROCcurvesofHMMERandourmethod.Figure4.6:ROCcurvesofshortreadhomologysearchforArabidopsisRNA-Seqdata.WecomparedHMMERandourtooloncase2,wherebothendsarealignedbyHMMERwithdefaultE-value.NotethatHMMERwithGAcutoffhasonedatapoint.UsingposteriorprobabilityhelpsremovefalsealigneddomainsandthusleadstobettertradeoffbetweensensitivityandFPrate.58TheresultsshowedthatHMMERwithGAcutoffyieldslowsensitivityandlowFPrate.OurmethodhasbettertradeoffbetweensensitivityandFPrateforbothcases.WealsocomputedothermetricsincludingF-score(2sensitivityPPVsensitivity+PPV)andPPV(PositivePredictiveValue,jTP\CjjCj).ComparingalltoolsintermsofF-ScoreandPPVunderdifferentthresholdsforcase1,ourtoolachievesthehighestF-Score81.98%;thecorrespondingPPVis80.41%.HMMERw/ohasthesecondhighestF-Score75.39%anditsPPVis65.17%.Forcase2,ourmethodhasthehighestF-score86.33%withPPV94.34%.HMMERwithdefaultE-valuecutoffhasthesecondhighestF-Score76.45%withPPV67.50%.4.3.1.5Performanceevaluationondomain-levelInordertoassessthehomologysearchperformanceondomain-level,wefocusedoncomparingthesetofdomainsfoundbyHMMERandourtool.Wefurtherthedomainabundance,whichisthenumberofreadsineachdomainbygiventools.Thepredicteddomainsetandtheirabundancearealsocomparedtothegroundtruth,whichisderivedusingthereadmappingresultsandthewhole-genedomainannotation.OurexperimentalresultsshowedthatthesetofdomainsreportedbyHMMERunderthedefaultE-valuecutoffandourmethodarealmostidentical.Theyonlydifferby1outof3,962domains.Bothtoolscanidentifyalmostalltheground-truthdomains.TheonlyexceptionisHMMERwithGAcutoff,whichreturns84%oftruedomains.AlthoughHMMERandourmethodreportednearidenticaldomainsets,theygeneratedverydifferentdomainabundance.Wecomparedthepredictedabundancetothegroundtruthbycomput-ingtheirdistance,whichisthedifferenceinthenumberofreadstoadomain.Accordingtothesmalldistanceindicateshighersimilaritytothegroundtruth.Forcase1,ourtoolhassmallerdistancetothegroundtruththanHMMER,withaveragedistancebeing65.39.Our59toolproducedthesameabundanceasthegroundtruthfor1,185domains.TheaveragedistancesofHMMER,HMMERwithoutandHMMERwithGAcutoffare107.60,126.85,and153.64,respectively.Figure4.7showsthedistanceof377domainsforwhichourtoolhasdistanceabove86.Figure4.7:ThedistancecomparisonbetweenourmethodandHMMERoncase1oftheRNA-SeqdatasetofArabidopsis.377domainswiththelargestdistancevaluesarelistedinthefoursubplots.X-axisshowstheindicesofthedomains.Smallervalueindicatescloserdomainabundancetothegroundtruth.TheaveragedistancesofHMMER,HMMERw/oHMMERGAcutoff,andourtoolare704.92,781.80,1,054.77,and522.12,respectively.Figure4.8illustratesthedistancebetweenthepredicteddomainabundanceandthegroundtruthforcase2,wherebothendscanbealignedbyHMMERunderthedefaultE-valuecutoff.TheaveragedistancesforHMMER,HMMERwithoutHMMERwithGAcutoff,andourtoolare121.61,107.81,139.56,and96.34respectively.Figure4.8onlyincludes358domainsforwhichourtoolhasthedistanceabove30.Insummary,beingconsistentwiththeresultsshowninFigures4.5and4.6,ourmethodcanassignreadstotheirnativedomainswithhigheraccuracy.60Figure4.8:ThedistancecomparisonbetweenourmethodandHMMERoncase2oftheRNA-SeqdatasetofArabidopsis.358domainswiththelargestdistancesarelistedinthefoursubplots.X-axisshowstheindicesofthedomains.Smallervalueindicatescloserdomainabundancetothegroundtruth.TheaveragedistancesofHMMER,HMMERw/oHMMERGAcutoff,andourmethodare818.09,704.65,1084.50,and558.60,respectively.4.3.1.6RunningtimeanalysisWecomparedtherunningtimeoftestedtoolsinTable4.3.HMMERwithGAcutoffisthefastestbutyieldslowsensitivity.HMMERwithoutiscomputationallyexpensiveandistheslowest.WeareinbetweenaswerelyonthefullViterbialgorithmtoalignthemissingendofareadpair.Table4.3:TherunningtimeofHMMERunderdifferentcutoffsandourtoolontheArabidopsisThalianaRNA-Seqdataset.Note:Therunningtimeisthetotalrunningtimeofhomologysearchtooltoalign9,559,784paired-endreadswith3962domains.CaseHMMER,HMMER,HMMER,OurtoolE-value10w/oGAcutoffE-value10Time(m)2,628.9758,825.21,955.19,857.3614.3.2PrhomologysearchforshortreadsinametagenomicdatasetfromsyntheticcommunitiesInthesecondexperiment,wetestedtheperformanceofshortreadhomologysearchinametage-nomicdataset.Inordertoquantifytheperformanceofourmethod,wechoseamockmetagenomicdatawithknowncomposition.4.3.2.1DatasetThechosenmetagenomicdatasetissequencedfromdiversesyntheticcommunitiesofArchaeaandBacteria.Thesyntheticcommunitiesconsistof16Archaeaand48Bacteria[96].AllknowngenomesweredownloadedfromNCBI.ThemetagenomicdatasetofsyntheticcommunitiesweredownloadedfromNCBISequenceReadArchive(SRA)(accessionNo.SRA059004).Thereare52,486,341paired-endreadsintotalandthelengthofeachreadis101bp.Allofreadsarealignedagainstasetofsinglecopygenes.ThesegenesincludesnearlyallribosomalproteinsaswellastRNAsynthasesexistedinnearlyallfree-livingbacteria[26].Theseproteinfamilieshavebeenusedforphylogeneticanalysisinvariousmetagenomicstudiesandthusitisimportanttostudytheircompositionandabundanceinvariousmetagenomicdata.Wedownloaded111domainsfromPfamdatabase[31]andTIGRFAMs[35].4.3.2.2Determinationoftruemembershipofpaired-endreadsThetruemembershipofpaired-endreadsisdeterminedbasedonwholecodingsequenceannota-tionandreadmappingresults.First,allcodingsequences(CDS)of64genomesofArchaeaandBacteriaweredownloadedfromNCBI.Second,CDSwerealignedagainst111domainsdown-loadedfromTIGRFAMs[35]andPfamdatabase[31]usingHMMERwithgatheringthresholds62(GAs)[29].ThepositionsofaligneddomainsinallinCDSwererecorded.Third,paired-endreadsweremappedbacktothegenomesusingBowtie[54].Thereadmappingpositionsandtheannotateddomainpositionsarecompared.Ifbothendsareuniquelymappedwithinanannotateddomain,weassignthereadpairtothedomainfamily.Thetruepositivesetcontainsallreadpairswithbothendsbeinguniquelymappedtoaproteindomain.Wewillonlyevaluatethehomologysearchperformanceofchosentoolsforthesereads.4.3.2.3PerformanceoffragmentlengthdistributionAgain,weneedtoexaminetheaccuracyofourfragmentsizecomputation.Figure4.9showsthefragmentlengthdistributionconstructedfromourtoolandthefragmentlengthdistributionderivedfromthereadmappingresults.Foragivenlength,themaximumprobabilitydifferencebetweenourmethodandthegroundtruthis0.01,whichslightlyreducestheaccuracyofposteriorprobabilitycomputation.4.3.2.4OurmethodcanalignmorereadsInthisexperiment,thereadlengthislongerthanthoseintheexperiment.Consequently,HMMERcanalignmorereadsagainsttheirnativedomainfamilies.Nevertheless,itstillhasonethirdofpairsofreadswithoneendbeingalignedtotheproteindomainfamilies.Byapplyingourtool,thepercentageofcase2(bothends)ofpaired-endreadalignmentsisenhancedfrom65.82%to88.71%.ThepercentagesofthreecasesbyourtoolandHMMERareshowninTable4.4.4.3.2.5Sensitivityandaccuracyofshortreadhomologysearch4.3.2.5.1Case1:oneendisalignedbyHMMERTherewere213,668paired-endreadswithonlyoneendbeingalignedtooneormultipledomains.Figure4.10showstheROCcurvesofshort63Figure4.9:Comparingfragmentlengthdistributionofourtool(blue)tofragmentlengthdistri-butionconstructedfromreadmappingresults(red).X-axisrepresentsfragmentlengthinaminoacids.Y-axisrepresentstheprobabilityofthecorrespondingfragmentsize.Table4.4:Thepercentagesofallthreecasesofpaired-endreadalignmentsbyHMMERandourtoolforthesyntheticmetagenomicdata.Case1representsoneend.Case2representsbothends.Case3representsnoend.CaseHMMER,HMMER,HMMER,OurtoolE-value10w/oGAcutoffE-value10Case123.15%21.63%3.76%0.26%Case265.82%68.46%2.46%88.71%Case311.03%9.91%93.77%11.03%readhomologysearchusingHMMERandourtool.HMMERwithGAcutoffhasthelowestFPrate(0.0).However,thesensitivityofHMMERwithGAcutoffisonly4.11%.Inaddition,wefurthercomputedPPVandF-ScoreofeachdatapointinROCcurves.Comparingalltools,ourtoolhasthehighestF-ScoreandPPV(90.87%and88.01%,respectively).HMMERwithE-value10hasthenexthighestF-scoreandPPV(64.79%and48.07%,respectively).64Figure4.10:ROCcurvesofedshortreadhomologysearchforthesyntheticmetage-nomicdataset.WecomparedHMMERandourtooloncase1,whereoneendcanbealignedbyHMMERwithdefaultE-value.NotethatHMMERunderGAcutoffhasonedatapoint.4.3.2.5.2Case2:bothendsarealignedbyHMMER607,558paired-endreadswereclassi-tocase2.Wedivideddataintotwogroups:1bothendsbeingalignedtoonedomainand2bothendsbeingalignedtomultipledomains.Therewere515,586paired-endreadsand91,972paired-endreads,respectively.Whenbothendsarealignedtoonesingledomain,theisusuallycorrect.Thus,wefocusonevaluatingtheperformanceofthesecondgroup,wherereadpairsarealignedtomorethanonedomain.Figure4.11showstheaverageperformancecomparisonbetweenHMMERandourtoolon91,972paired-endreads.ComparingalltoolsintermofF-ScoreandPPV,ourtoolachievesthehighestF-Scoreof96.05%anditsPPVis92.42%.HMMERw/oachievesthesecondhighestF-Score80.28%withPPV80.45%.4.3.2.6Domain-levelperformanceevaluationForwholedataset,wecomparedthesetofdomainsbyHMMERandourtool.Theresultsshowedthateverytoolallgroundtruthdomains(111domains)exceptHMMERwithGAcutoff,whichonlyfound26domains.65Figure4.11:ROCcurvesofedshortreadhomologysearchforthesyntheticmetage-nomicdata.WecomparedHMMERandourtooloncase2,wherebothendsarealignedbyHMMERunderdefaultE-value.NotethatHMMERunderGAcutoffhasonedatapoint.UsingposteriorprobabilityhelpsremovefalsealigneddomainsandthusleadstobettertradeoffbetweensensitivityandFPrate.Inaddition,thedomainabundancewasandcomparedtothegroundtruth.Foreachdomain,wecomputethefidistancefl,whichisthedifferenceinthenumberofreadstoadomainbyatoolandinthegroundtruth.Smallerdistanceindicatescloserdomainabundancetothegroundtruth.Forcase1,theaveragedistancesofHMMER,HMMERw/oHMMERwithGAcutoff,andourmethodare272.74,280.65,505.56,and178.60,respectively.Ourtoolhasthesameabundanceasthegroundtruthin43domains.Weremovedthose43domainsandshoweddistanceofotherdomainsinFigure4.12.Forcase2,wherebothendscanbealigned,alltoolshaveworsedomainabundanceestimation.TheaveragedistancesofHMMER,HMMERw/oHMMERwithGAcutoff,andourmethodare702.39,1698.79,1831.55,and666.96,respectively.Ourtoolstillhastheclosestdomainabundancetothegroundtruth.Ithasthesamedomainabundanceasthegroundtruthfor68domains.Weremovedthe68domainsandplottedthedistancesofotherdomainsinFigure4.13.Althoughthereadlengthsofthisdatasetarelongerthanthedataset,theaveragesequence66Figure4.12:ThedistancecomparisonbetweenourmethodandHMMERoncase1ofthemetage-nomicdataset.X-axisshowstheindicesofthedomains.Smallervalueindicatescloserdomainabundancetothegroundtruth.Domainsaresortedbasedonthedistanceofourtool.Duetoscalingissues,domainswiththelargestdistancesareplottedintheembeddedwindow.conservationofthedomainfamiliesisaslowas30%.Thepoorlyconservedfamiliescontainlargenumbersofsubstitutions,longinsertionsanddeletions,leadingtoeitherover-predictionorunder-predictionofthetestedtools.HMMERwithE-valuecutoff10,HMMERw/oandourmethodallmorereadsintothedomainfamiliesthangroundtruth.HMMERwithGAcutoffundershortreadsintotheunderlyingfamilies.Thus,thedistancesofallthesetoolsarelarge.4.3.2.7RunningtimeonametagenomicdatasetofsyntheticcommunitiesTherunningtimesofHMMERunderdifferentcutoffsandourtoolarecomparedinTable4.5.Asexpected,HMMERwithisthefastest.OurmethodisslowerthanHMMERwithbutmuchfasterthanHMMERw/o67Figure4.13:ThedistancecomparisonbetweenourmethodandHMMERoncase2ofthemetage-nomicdataset.X-axisshowstheindicesofdomains.Smallervalueindicatescloserdomainabundancetothegroundtruth.Table4.5:TherunningtimeofHMMERunderdifferentcutoffsandourtoolonametagenomicdatasetofsyntheticcommunities.Note:Therunningtimeistheoverallrunningtimeofhomologysearchtooltoalign52,486,341readswith111domains.CaseHMMER,HMMER,HMMER,OurtoolE-value10w/oGAcutoffE-value10Time(m)353.13152,864378.394,468.404.4DiscussionandconclusionHomologysearchhasbeenwidelyusedforsequence-basedfunctionalanalysisinvariousNGSsequencingprojects.Inparticular,forgene-centricanalysis,readsareintocharacterizedprotein/domainfamiliesusinghomologysearch.WhileHMMERisthestate-of-the-arttoolforhomologysearch,itsperformanceonshortreadshasnotbeensystematicallyexamined.OurtestofHMMERinvariousNGSdatacontainingshortreadsshowsthatitcouldmissalargenumberofshortreads.Inthiswork,wedescribedaprobabilistichomologysearchmodelforpaired-endreads.Thegoalistoimprovetheperformanceofshortreadhomologysearch.68ItisbuiltonHMMERandcanbeusedasacomplementarytooltoHMMERformoresensitivereadOnefuturedirectionistoimprovetheshortreadhomologysearchperformanceforpoorlycon-servedfamilies.Near4,000domainfamiliesintheexperimenthavehigheraveragesequenceidentityandthusleadtoreasonabledomainabundanceestimation.The100+familiesinthesecondexperimenthavelowsequenceidentityandthetestedtoolstendtoeitherover-classifyorunder-classifyheavilyforsomefamilies.Thus,bettermethodsneedtobedesignedtoalignshortreadstopoorlyconservedproteinfamilies.TheadvancesofNGStechnologiesenableoutputoflongerreads.TheincreasedlengthwillleadtobettersensitivityofHMMER.However,beforethereadsreachthelengthofnearcompletetranscriptsorgenes,thereisstillaneedforimprovingshortreadhomologysearch.Inaddition,ex-istingsequencingprojectsarestillheavilyrelyingontoday'ssequencingtechnologies.Weexpectourtoolcanbeusedtoimprovethefunctionalanalysis.69Chapter5glu-RNA:aliGnhighLystrUcturedncRNAsusingonlysequencesimilarity5.1IntroductionNoncodingRNAs(ncRNAs),whichfunctiondirectlyasRNAswithouttranslatingintoproteins,ishighlyimportanttomodernbiology.DifferenttypesofncRNAhavediverseandimportantbiologicalfunctions.Forexample,house-keepingRNAssuchastRNAsandrRNAs,andsmallregulatoryRNAsincludingmiRNAs,havebeenextensivelystudiedbecauseoftheirimportantbi-ologicalroles.Inparticular,thedevelopmentofnext-generationsequencing(NGS)technologiesshedslightsonmorereliableandcomprehensivencRNAannotation.Deepsequencingoftran-scriptomsofvariousorganismshasrevealedthatalargeportionoftranscriptomicdatacannotbemappedbacktoannotatedprotein-codinggenesinthereferencegenome,indicatingthatthosetranscriptsmaycontaintranscribedncRNAs.ThefunctionsofmanytypesofncRNAsareassociatedwithboththeirsequencesandsecondarystructures,whichcontaininteractingbasepairssuchasWatson-CrickbasepairsandG-U.ForPrapapornTecha-angkoonandYanniSun,flglu-RNA:aliGnhighLystrUcturedncRNAsusingonlysequencesimilarityfl,ProceedingsofACMConferenceonBioinformatics,ComputationalBiologyandBiomedicalInformatics(ACMBCB2013),WashingtonDC,USA,2013.70example,functionalpre-miRNAsinvariousspeciesfoldintohairpinstructures.AmajorityoftRNAssharethecharacteristicclover-leafstructure.Thus,ncRNAshouldincorporateinformationfrombothsequencesandsecondarystructures.Successfulgenome-scalencRNAgenetoolscanberoughlydividedintotwogroups.ThecategoryincludestoolsforfiknownncRNAsearchfl,whichcomparequerysequenceswithannotatedncRNAsorncRNAfamiliesandusetheirsequenceandstructuralsimilarityforncRNAannotation.AscharacterizedncRNAsorncRNAfamiliesareneededasreference,thistypeoftoolsarelimitedtofiknownncRNAsfl,suchastRNA,rRNAs,annotatedmiRNAsetc.ThesecondcategoryincludesdenovoncRNAgenetoolsthatareabletoidentifynovelncRNAs.EfimplementationsinthiscategoryacceptsequencealignmentsasinputandexaminehowlikelytheinputsequencesencodencRNAs.PopulartoolsinthiscategoryincludeRNAz[110],EvoFold[83],QRNA[89],etc.Forexample,RNAzandEvoFoldhavebeenusedforncRNAgeneinhumangenome[30].OthertoolsthatdonotrelyonalignmentsarealsoavailablefornovelncRNAHowever,conductingstructuralalignmentmakesthesetoolshighlyexpensiveforgenome-scalencRNAsearch[107,38].AmajorityofpopulardenovoncRNAgenetoolsinthesecondcategoryrequirepair-wiseormultiplealignmentasinput.ThequalifyofthealignmentdirectlyaffectstheaccuracyofncRNAgeneThereexistaccuratestructuralalignmentalgorithmsthatmaximizebothse-quenceandsecondarystructuresimilarity[95,104,69].UsingstructuralalignmentsleadstomorereliablencRNAgeneHowever,moststructuralalignmenttoolscannotbeconvenientlyappliedtowholegenomesorNGSdatabecauseoftworeasons.First,genomescalencRNAgeneneedslocalalignmenttoolswhilemanystructuralalignmentprogramsareeitherglobaloruseasimpleslidingwindowstrategytogeneratelocalalignments.Second,thetimecomplex-ityofstructuralalignmentisstillhigherthansequencealignment.Generatinglocal71structuralalignmentsbetweenlargegenomesorlargenumberofshortreadsgeneratedbyNGStechnologiesisnotpractical.Itisthusdesirabletohaveanalignmentprogramthatcanachievesimilaraccuracytostructuralalignmenttoolswhilekeepingthesameorderofrunningtimecom-plexityassequencealignmenttools.Inthiswork,wedesignedandimplementedanaccuratencRNAalignmentprogramthatonlyreliesonsequencesimilarity.Weappliedthistoolto360pairsofhighlystructuredncRNAsinclud-ingtRNAsandriboswitchelements.TheexperimentalresultsdemonstratethatouralignmenttoolcangeneratemoreaccuratencRNAalignmentsthanseveralpopularsequencealignmenttools.Inparticular,forhighlystructuredncRNAswithlowsequencesimilarity(<%40),weachievedbetteralignmentaccuracythanapopularstructuralalignmentprogramwhilekeepingthesamerunningtimecomplexityastypicalsequencealignmentprograms.5.2RelatedWorkRecently,Willetal.[111]reportedmanynovelncRNAsusingstructured-basedwholegenomerealignment.Intheirmethod,existingwholegenomesequencealignmentwasusedasthebasealignment.Regionsthatwerepoorlyalignedwererealignedusingabandedstructuralalignmentalgorithm,whichconductsstructuralalignmentinarestrictedsearchspacearoundthesequencealignment.ThenewlygeneratedstructuralalignmentswereusedforncRNAexamination.ItwasfoundthatmanyofthemmaycontainnovelncRNAs.Ourtoolsharesthesimilarframework,whichintendstogenerateabetteralignmentusingasequencealignmentasthestartingpoint.However,thereareseveralmajordifferences.First,thebasealignmentweuseisgeneratedusingposteriorprobability,whichisexpectedtobemorereliableforsequenceswithlowsimilarity.Second,unlikeWilletal.'sworkthatreliesonabandedstructuralalignment,weusesequencesimilarityanda72machinelearningapproachtoconvertthebasealignmentintoanewalignment.Nostructuralinformationisused.Theprocedureofmodifyingabasealignmentinourworkcanalsobeemployedtocreateabandeddynamicprogrammingspace.Theideaofbandedalignmenthasbeenusedinbothse-quencealignmentandstructuralalignmentprograms[69,111,37,24,19].Byapplyingdynamicprogramminginbandedsearchspace,therunningtimecomplexityofalignmentprogramscanbereduced.ThemostrelatedworkisDynalign[69],whichxedawindowofwidthMaroundthealignmentdiagonal.Thus,apositioninsequencexisrestrictedtoalignwithinM-distanceofthesamepositioninsequencey.Thisspacereductionreducestherunningtimecomplexityofstructuralalignment.Thisheuristicsisadoptedbyotherstructuralalignmentpro-grams[38,111,24].Forexample,aconstantdeviation(D)isusedinWilletal.'sworktoabandedregionforstructuralalignment.ThevalueofMisempiricallydeterminedinDynalign.InanupdatedversionofDynalign[37],theauthorsthesearchspacebyapplyingacutofftoposterioralignmentprobabilities.Onlypositionpairswithposterioralignmentprobabilitiesabovethegivencutoffwillbeusedinsearch.Thus,differentvaluesofMareusedforeachrow.OurworkdiffersfromDynaligninboththeapplicationandthemethod.First,ourworkaimstogenerateaccuratencRNAalignmentwithoutusingstructuralinformation.Dynalignconductsstructuralalignmentwithinreducedsearchspace.Second,ourworkemploysposteriorprobabilityandmorefeaturesinamachinelearningapproachtoproduceanewalignment.Dynalignonlyusesposteriorprobabilitytoasearchspace.73Figure5.1:AlignmentsgeneratedbyfouralignmentprogramsforapairoftRNAswithsequencesimilarity80%.X-axisrepresentssequence2andY-axisrepresentssequence1.Thealignmentstartswith(0,0).5.3MethodOurncRNAalignmentprogramisbuiltonseveralkeyobservations.First,forncRNAswithhighsequencesimilarity,thesequencealignmentisusuallyhighlyconsistentwithstructuralalign-ment[33].ForhighlystructuredncRNAsthatlackstrongsequencesimilarity,thedivergencebetweenthesequenceandstructuralalignmentstendstobelarge.Second,differentsequencealignmenttoolscangeneratedifferentalignments,dependingonthechoiceofscoringmatrices,74Figure5.2:AlignmentsgeneratedbyfouralignmentprogramsforapairoftRNAswithsequencesimilarity40%.X-axisrepresentssequence2andY-axisrepresentssequence1.Thealignmentstartswith(0,0).gappenalties,andalignmentalgorithms.Previouswork[37,21,70,40,57]suggestedthatalign-mentsgeneratedusingposteriorprobabilitiestendtobemoreaccuratethanalignmentsgeneratedusingascoringtable.Third,whendifferentalignmentprogramsagreewitheachother,thereishighchancethatthealignedbasesarepartofthetruealignment.TheaboveobservationscanbevisualizedinFigure5.1and5.2throughtwopairsofhomolo-gousncRNAs.Inbotheverypathcorrespondstoanalignment.Besidesthereferencealign-ment,whichisthetruealignmentfromreliablencRNAannotationdatabases,thealignmentspro-75ducedusingfoursequencealignmentprogramsincludingNeedleman-Wunsch[78],ClustalW[25],ProbA[70],anddpswalign[40]arepresented.AllpathsoverlapinFigure5.1be-causethesequencesimilarityishigh.ForncRNAswithlowsequencesimilarity(Figure5.2),twoalignmentprogramsthatusescoringmatrix(Needleman-WunschandClustalW)producedhighlydifferentalignments.Similarly,althoughProbAanddpswalignbothemployposteriorprobabili-ties,theyproduceddivergentalignments.Inaddition,thedifferencebetweensequencealignmentsandthereferencealignmentispronounced.Moreover,thedifferenceisnotaconstantalongthepath.Itischangingfordifferentrowsinthematrixgraph.Towardstheendofthealignmentpaths,differentalignmentprogramsconverge;thedivergencebetweensequencealignmentsandthereferencealignmentisdiminished.7677Figure5.3:ThereferencealignmentandalignmentsgeneratedbyfouralignmentprogramsforapairoftRNAswithsequencesimilarity40%.ThepredictedsecondarystructuresareshownwheneachalignmentisusedasinputtoPfold,astructurepredictionprogram.ErrorsinalignmentswillbepropagatedintodownstreamanalysissuchasncRNAgeneandsecondarystructureprediction.Anumberofsecondarystructurepredictionprogramstakealignmentsasinputandoutputtheconsensussecondarystructure.Erroneousalignmentsleadtowrongstructureprediction.Figure5.3detailsthealignmentscorrespondingtothepathsinFig-ure5.2andthepredictedstructuresusingthesealignmentsasinputtoPfold[46].Thestructurepre-dictioncanoutputthecorrectclover-leafstructureforthereferencealignment.Otheralignmentsdidnotleadtocorrectsecondarystructures.Thus,itisimportanttohaveaccuratealignments.Inthiswork,ourgoalistodevelopasequencealignmentprogramthatcanproducealignmentsasclosetothereferencealignmentaspossible.Weachievedthisgoalbychoosinganexistingsequencealignmentasthebasealignment.Then,weemployedmachinelearningmethodstode-terminehowweshouldchangethebasealignmentsothatitoverlapsorgetsclosertothereferencealignment.Thefeaturesusedinthemachinelearningmethodsweregeneratedfromthekeyob-servations,includingsequencesimilarityandtheconsistencybetweenvarioussequencealignmentprograms.TheremainingoftheMethodsectionisorganizedasfollows.First,weformallythealignmentpathandtheofthesimilaritybetweenalignments.Then,wewillintroduceposteriorprobabilityandourchoiceofthebasealignment.Finally,wefocusonthemethodofmodifyingandimprovingthebasealignment.5.3.1AlignmentpathandsimilaritybetweentwoalignmentsForapairofsequencesxandy,adirectedmatrixgraphGofsizemncanbecreated,wheremandnarethesizesofxandy,respectively.Theithbaseinxandthejthbaseinyanode,representedbyatuple,(x;y)inG.Threetypesofdirectededgesexist,correspondingtomatch/mismatch,insertion,anddeletioninanalignment.Thus,anodewithindex(x;y)hasthree78directededgestonodes(x+1;y+1),(x+1;y),and(x;y+1).AnylegalalignmentbetweenxandycanbevisualizedasapathinG.Thesimilaritybetweentwoalignmentscanthusberepresentedbythedistanceoftheircorrespondingpaths,assuggestedbyWilletal.[111].Foreachrowrinthematrixgraph,supposethenodeshaveposition(r;c)and(r;c0)inalignment1and2,respectively.Thedistancefromalignment1to2atrowristhusc0c,whichalsotheextensionneededtoconvertonealignmentintotheotheroneatrowr.Moregenerally,bothalignmentsmayspanmultiplecolumnsforarowr.Letthenodesetofalignment1atrowrbef(r;c1);(r;c2);:::;(r;cm)g;thenodesetforalignment2atrowrbef(r;c01);(r;c02);:::;(r;c0n)g.Thedistancefromalignment1toalignment2atrowrisas:dr=maxi=1tom(minj=1ton(c0jci));(5.1)whichmimicstheextensionneededbetweenconsecutiveinsertionblocksinsequence1.Inallexperiments,weusedthelongersequenceastheY-axisandthusmaximumlyavoidedthiscase.Asonealignmentcontainsmultiplerows(equaltothelengthofoneinputsequence),thedistancebetweentwoalignmentsis:sumr=1tomdr(5.2)wheremisthenumberofrows.Intuitively,thisdistanceissimilartoeditdistancebetweentwosequences.Itapproximatesthenumberofmovementsneededtoconvertonealignmentpathintotheother.Forexample,Figure5.4.(B)showstwoalignmentpaths,oneconsistingofnodeswithshapeandtheotherconsistingofnodesand.Accordingtotheabovethedistancefromthealignmentrepresentedbytotheotheristhus(31)+(41)+(52)+(72)+(72)=18.Notethatforrow3,thedistanceis3accordingtotheofdr.795.3.2ChoosingthebasealignmentOuralgorithmneedstouseasequencealignmentasthestartingalignment,whichisreferredtoasfibasealignmentfl.Previouswork[37,21,57]suggestedthatsequencealignmentsproducedusingposteriorprobabilitiessharethesmallestdistancewiththetruestructuralalignments.Were-evaluatethisconclusiononhighlystructuredncRNAs.Inparticular,wecomparevariousse-quencealignmentprogramsandchoosetheonewiththebestaccuracy.Inthisparagraph,weintroducetheposteriorprobabilityandtwoalignmentprogramswechoseforgeneratingthisalignment.Forinputsequencesxandy,theposteriorprobabilityoftheithbaseofx(i.e.xi)beingalignedwiththejthbaseofy(i.e.yj)isdenotedasPr(xi˘yjjx;y);(5.3)where˘representsanalignment.Thisprobabilitycanbecomputedusingtwomethods.Oneisbasedonapair-HMM[27]andtheotherisbasedonpartitionfunction.Bothmethodscomputetheposteriorprobabilityforallpairsofbasesinxandy.Theposteriorprobabilitytablereplacesthescoringmatrixduringthedynamicprogrammingequationsforconductingsequencealignment.Inaddition,thegappenaltyisusuallysettozero.Bothglobalandlocalposteriorprobability-basedalignmentalgorithmsexist.TherunningtimecomplexityforglobalalignmentisO(jxjjyj),wherejxjandjyjarethesizesofsequencexandy,respectively.Asthereareliteraturesavailablefordescribingbothmethods,wereferusersto[27]fordetailsaboutcomputingposteriorprobabilies.Inthiswork,wetestedtwodifferentimplementationsofposteriorprobability-basedsequencealignment.OneisprobA[70],whichisbasedonpartitionfunction.Theotherisdpswalign[40],whichisbasedonpair-HMMs.Becauseofdifferentparametersandmethodsusedforcalculating80Table5.1:Numberofbestalignmentsgeneratedbyfoursequencealignmenttoolsfor662pairsoftRNAsandRiboswitches.NotethatN-WisNeedleman-Wunsch.ClustalW,amultiplesequencealignmenttool,canalsobeappliedtoapairofsequences.Alignmentprogram#bestalignmentsNeedleman-Wunsch72ClustalW107ProbA73dpswalign289N-WandProbA3ProbAandClustalW6ProbAanddpswalign41dpswalignandClustalW22ProbA,dpswalign,andClustalW41N-W,ProbA,anddpswalign1N-W,ClustalW,ProbA,anddpswalign7posteriorprobabilities,theytendtooutputdifferentalignments.Wecomparedfoursequencealignmentprogramson662pairsofhomologoustRNAsandRi-boswitches.WeusedtRNAsandRiboswitcheswithconsistentsecondarystructureannotationfromRfam[16]andBRAliBase2.1[112].Ofthefouralignmentsgeneratedforeachpairbyfourprograms,thebestalignmentisastheonewiththesmallestdistancetothereferencealignment.Table5.1liststhenumberofbestalignmentsgeneratedbyeachtool.Insomepairsofsequences,sometoolsgivethesamesmallestdistancetoreferencealignment.Forexample,therearethreepairsoftRNAsandRiboswitchesthatNeedleman-Wunsch(N-W)andProbAhavethesamesmallestdistancetothereferencealignment.InTable5.1,thesenumbersshowthatposteriorprobability-basedalignmentstendtobeclosertothereferencealignments.OfProbAanddpswalign,thelatteryieldsbetterperformance.Thus,wechoosethealignmentgeneratedbydpswalignasthebasealignment.815.3.3ImprovingalignmentaccuracyusingmachinelearningmethodsThebasealignmentsharessimilaritywiththereferencealignment.Wemodifythebasealignmenttoconstructanewalignmentthatisexpectedtolargelyoverlapthereferencealignment.Themod-isconductedbyestimatingthedistancebetweenthebasealignmentandthereferencealignmentforeachrowinthematrixgraph.ThedistanceiscomputedusingmachinelearningmethodsbyincorporatingthekeyobservationsdescribedatthebeginningofSection5.3.callyspeaking,letrbearowindexinthematrixgraph.Supposethatthenodeatrowrinthebasealignmentpathis(r;x).Thenodeatrowrinthereferencealignmentpathis(r;y).Thegoalistochangexintox0sothatthedistancebetweenthenewnode(r;x0)andthenode(r;y)inthereferencealignmentisminimized.Intheidealcase,ifx0==y,thedistanceiszero,indicatingtheoverlapofthenodesatrowr.Weappliedmachinelearningmethodstotrainingdata,whichcontainsbothbasealignmentsandreferencealignmentsandthustheknowndistanceyxforeachrow.Thetrainedmodelisthenappliedtotestdatatoconstructnewalignments.Therearetwostepsinconstructinganewalignment.First,featuresextractedfromtheseinfor-mationareusedasinputtomachinelearningmethodsandtheextensionisthencalculatedforeachrow.Alltheseextensionsroughlysketchanewalignment.However,someofthemmaynotbepartofalegalalignmentpath.Forexample,theextensioninrow3inFigure5.4isnotpartofalegalalignmentpath.Thus,inthesecondstep,weextensionstoconstructalegalalignment.Thereare4featuresweusedasinputtoorregressionmodels.F1:sequencesimilarity,anumberbetween0to1.F2-F4:pairwisedistancebetweenthethreealignmentsgeneratedbyClustalW,ProbA,anddpswalign.Therearethreedistancesbetweenthreealignments.ThesefeaturesarerowForexample,forrowrinthematrixgraph,letthenodesinpathscorrespondingto82thealignmentsgeneratedbyClustalWanddpswalignbe(r;w)and(r;z),respectively.WehaveF2=wz.Notethatthedirectionofthedistanceisalsoincluded.Forthetrainingdata,wefoundthatthisnumbercanbequitelargeforsomeinputpairs.Thelargestdistanceforarowis71,whichmeansthattwoalignmentsareveryfarawayfromeachother.Foreachpairofsequences,F1isevaluatedbycountingtheratioofidenticalbasesintheglobalalignmentgeneratedbydpswalign.FeaturesF2,F3,andF4arecomputedforeachrow.Thetrainedmodelisappliedtoeachrowinamatrixgraph.Theoutputvaluewillbetheextensionvalueforthenodeinthebasealignment.Intuitively,thesmallerthesequencesimilarityis,thelargerthedistancebetweenthesequencealignmentandthereferencealignmentpathsis.Weonlychosethreesequencealignmentprogramsbecauseoftheiroverallgoodperformance.Moreover,thethreeprogramshaveverydifferentimplementationsandthuscomplementwitheachother.Whenthepairwisedistancesbetweenthethreeprogramsareallsmall,threedifferentprogramshaveaconsensusonthealignmentandthusthedistancebetweenthebaseandreferencealignmentstendstobesmall.Theoreticallyspeaking,bothmodelsandregressionmodelscanbetrained.Formodels,weused10classeswithlabels:1,2,:::,10.Notethattheactualdifferencecanbelargerthan10.However,amajorityofthemshouldbesmallerthan10.Wetestedmultiplemodelsincludingdecisiontree,NaiveBayes,andSVM.The10-foldcross-validationtrainingshowsthatasimpledecisiontreeworksbestforthisproblem.Thus,wereporttheresultsusingdecisiontree.Forregressionmodels,weusedthevaluesforF1toF4asinputandtheoutputisusedastheextension.Wetesteddifferenttypesofsimpleandcomplicatedregressionmodels.Here,wereporttheresultoflinearregressionmodelbecauseityieldsbetteraccuracythanotherregressionmodels.AllmodeltrainingwasconductedusingWEKApackage[36].83Figure5.4:Constructalegalalignmentusinglocalsearch.representnodesinthebasealign-ment.representnewpositionsofnodesinbasealignmentafterextension.(A).node(3,3)ismovedto(3,4),whichisrepresentedby.(B).representinsertednodes.5.3.3.1ConstructingalegalalignmentTheabovemachinelearningmethodsgenerateanewpositionforeachnodeinthebasealignmentifthecomputedextensionvalueisnotzero.Theextensionstepdoesnotevaluatewhetherthenewpositionsarepartofalegalalignment.Itispossiblethatthenewnodesarenotinanylegalalignmentpath.Forexample,Figure5.4showstwocasesofillegalalignments.Inbothpanels,thenodesindicatedbycirclesaretheresultsofextension.Theydonotformlegalalignmentpaths.Moreover,thetwoexamplesrepresenttwogeneralcases.Accordingtothestandardglossaryofgraphtheory,anodeuisthedirectsuccessorofanodevifthereisanedgefromvtouinthepathofinterest.Figure5.4.(A)illustratesthecasethatthenewpositionofanodeissmallerthanthecolumnindexoftheexpecteddirectsuccessornode.,node(3,2)ischangedintonode(3,3)becausethetrainedmodeloutputsanextensionvalueof1forrow2.However,node(2,4)onlyacceptsthreepossiblepositionsforitssuccessornode:(3,5),(2,5),or(3,4).Figure5.4.(B)showsthecasethattheextendednodehasmuchlargercolumnindexthantheexpecteddirectsuccessornode.,forrow3,theexpecteddirectsuccessornodeof(2,4)shouldbeoneof(3;5);(2;5);(3;4).However,theextensioncreatesanewposition(3,7).Wetheillegalpathusingdifferentstrategiesforthesetwocases.Forcase1,wechangethepositionofthe84nodethatisnotpartofthelegalalignment.Inparticular,wemoveittorightbecauseitscolumnindexissmallerthanthecorrespondingnodeinalegalalignment.Thus,node(3,3)ismovedtonode(3,4),asindicatedbythesquarenodeinFigure5.4.(A).Forcase2,asthecolumnindexismuchlargerthantheexpectednode,weinsertnewnodes,asindicatedbythetwosquarenodesinFigure5.4.(B).5.3.4RunningtimeanalysisTherunningtimeofthismethodincludesthreeparts.Thepartdependsontheprogramthatgeneratesthebasealignmentusingposteriorprobabilities.Thesecondpartistoapplythetrainedregressionormodeltocomputetheextensionvalueforeachrow.Becauseweonlyhavefourfeatures,thisstepisefTherunningtimeisdecidedbythelengthofthealignmentpath.Thethirdstepistoexaminewhethertheextendednodesformalegalalignment.Ifnot,thenodeswillbetoformalegalalignment.Thetimecomplexityofthelaststepislineartothesizeofthealignmentpath.Overall,thestepdominatestherunningtimeofglu-RNA.Asweuseposteriorprobabilitytogeneratethebasealignment,theoverallrunningtimecomplexityisO(mn),wheremandnarethesizesofinputsequences,respectively.5.4ExperimentalResults5.4.1DataForncRNAswithhighsequencesimilarity,suchashomologousmaturemiRNAs,existingse-quencealignmentprogramscanproducereliablealignments.OurfocusishighlystructuredncR-NAsthatmaylackstrongsequencesimilarity.ForthesetypesofncRNAs,theirsequencesmay85evolvefasterthantheirsecondarystructuresandthusposehardcasesforsequencealignmentprograms.Forexample,intherecentrealignmentworkbyWilletal.[111],thewholegenomealignmentswereexamined.Onlyregionsthatshowedlowalignmentqualitywillbere-aligned.Thus,wefocusontestingourmethodonhighlystructuredncRNAs.Figure5.5:TypesofncRNAsandtheirconsensussecondarystructures(tRNAandriboswitches).86Table5.2:InformationoftheTrainingDataSetRfamncRNANumberofSequenceAverageFamilyTypencRNApairsIdentityLengthRF00005tRNA5025%-65%74.01RF00050FMN10035%-85%134.50RF00059THI5020%-85%109.99RF00162Sbox1250%-85%108.08RF00167Purine10045%-80%100.85RF00174Cobalamin5030%-95%203.17RF00521SAMalpha10055%-95%79.00RF01055MOCORNAmotif10025%-95%141.90RF01057SAHriboswitch10025%-90%83.508788Figure5.6:Sequencesimilarityhistogramoftrainingdataset.X-axisrepresentsthesequencesimilarity.Y-axisrepresentsthenumberofncRNApairs.5.4.1.1TrainingdataTotrainthesupervisedlearningmodels,weconstructedourtrainingdatasetusingninetypesofstructuredncRNAsfromBRAliBase2.1[112]andRfam11.0[16].InBRAliBase2.1,fourtypesofncRNAswereselected:tRNA,THI,Sbox,andCobalamin.AsRfamcontainsmorestructuredncRNAs,wechosevemoreriboswitchelementsfromseedalignmentofRfam.Intotal,wehavetRNAandotherriboswitchstructuresfromFMN,THI,Sbox,Purine,Cobalamin,SAMalpha,MOCORNAmotif,andSAHriboswitch.Riboswitchelementsplayanimportantroleintranscriptionandtranslationalregulationinprokaryotes.Inaddition,riboswitchfamilieshavecomplicatedsecondarystructuresandoveralllowsequencesimilarity[103,118].Thus,theywereincludedinourtrainingdata.TheconsensussecondarystructuresoftheninetypesoffamiliesareshowninFigure5.5.ThenumbersofpairsofsequencesoftheninetypesofncRNAsinourtrainingdatasetareshowninTable5.2.ThepairwisesequencesimilarityhistogramofthetrainingdataispresentedinFigure5.6.Foreachpairofsequencesinthetrainingdata,thetruestructuralalignmentisneeded.BRAl-iBaseprovidesqualityalignments.ForalignmentsextractedfromtheseedalignmentofRfam,wehavehighintheirpairwisealignments.First,thesepairwisealignmentsareextractedfromthemultiplealignmentofseedsequencesinRfam.Second,exceptSAMalpha,thesecondarystructuresofallothertypesofncRNAsarepublished(asshowninRfamalignmentThus,beingpartofqualifymultiplesequencealignment,thesepairwisealignmentsarenotgeneratedbyanyofthetestedsequencealignmentprogramandshouldnotbebiasedtowardsanyalignmentprogram.895.4.1.2TestingdataToevaluatetheperformanceofglu-RNAonqualityalignmentsonly,weconstructedourtestingdatasetfromBRAliBase2.1.FivetypesofncRNAswerechosen:tRNA,THI,Cobalamin,gcvT,andU2.ThreetypesofncRNAs(tRNA,THI,andCobalamin)havethesametypesastrainingdataset.However,allpairsofncRNAsfromthistestingdatasetaredifferentfromthoseinthetrainingdataset.WechosetRNA,THI,andCobalaminbecauseoftheircomplicatedsecondarystructure.Moreover,thesequencesofTHIandCobalaminarelong.TwotypesofncRNAs(gcvTandU2)werenewandwerenotincludedintrainingdataset.ThesecondarystructuresofgcvTandU2areshowninFigure5.7.ThenumbersofpairsofsequencesofthevetypesofncRNAsinourtestingdatasetarepresentedinTable5.3.Figure5.7:ConsensussecondarystructuresofU2andgcvT.5.4.2Performancecomparisonbetweendifferentandregres-sionmodelsTherearemultipleandregressionmodelstochoosefrom.Withoutknowingwhichmodelperformsbestforthisproblem,weusedtheoutputofcross-validationonthetrainingdatato90Table5.3:InformationoftheTestingDataSet.Notethatthereisnooverlapbetweenthetrainingandtestingdata.RfamncRNANumberofSequenceAverageFamilyTypencRNApairsIdentityLengthRF00004U26040%-95%185.82RF00005tRNA6025%-85%74.76RF00059THI12020%-75%108.55RF00174Cobalamin5025%-70%204.09RF00504gcvT7025%-80%99.72choosethebestmodel.AllmodelsweretrainedusingWeka3.6[36].Foreachpairofsequencesinthetrainingset,theirsequencealignmentsweregeneratedusingdpswalign,ProbA,andClustalW.Foreachrowinthematrixgraph,thedistancebetweenthenodeinthebasealignment,generatedbydpswalign,andthenodeinthereferencealignment,wascomputedandusedtoevaluatethepredictionperformanceofdifferentmethods.Weused10-foldcross-validation,where10isthenumberofdatasetpartitions.Thetrainingdataineachfoldmustcontaindataofallclasses.Duringeachrun,ninepartitionswereusedfortrainingandtheremainingonewasusedfortesting.Thisprocedurewasrepeated10timessothateachpartitionwasusedfortestingexactlyone.Theaccuracyforeachrunwasreportedusingthefollowingequation:Accuracy=XY(5.4)whereXisthenumberofrowsforwhichthepredicteddistance(i.e.predictedextension)isthesameastheknowndistance.Yisthetotalnumberofrowsofallalignmentsinthetrainingdataset.Theaverageaccuracywasreportedfor10-foldcross-validation.Weconductedtheexperimentswithfourstandardmodels:DecisionTrees,NaiveBayes,k-NearestNeighbor(k-NN),andSupportVectorMachines(SVM).AsindicatedintheMethodsSection,theclassesinclude0,1,2,etc.Forregressionmodels,wedidtheexperiments91Table5.4:Theaccuracyofdifferentmodels.MachineLearningModelAccuracyDecisionTree84.64%NaiveBayes70.14%k-NN79.95%SVM76.20%Linearregression80.99%Isotonicregression79.08%Paceregression73.37%usingLinearregression,Isotonicregression,andPaceregression.WeusedWekatoevaluatetheperformanceofmodelsandregressionmodelswithdefaultparameters.TheaccuracyissummarizedinTable5.4.AccordingtoTable5.4,decisiontreehasthebestaccuracyamongNaiveBayes,k-NN,andSVM.Thus,weuseddecisiontreetodeterminetherowextensioninourimplementationnamedglu-RNA.Forregressionmodels,linearregressionhashigheraccuracythanisotonicre-gressionandpaceregression.Todistinguishtheimplementationofouralignmentprogramusinglinearregressionmodelfromdecisiontree,wenametheformerREglu-RNA,whereRErepresentsregressionmodel.5.4.3PerformancecomparisonofdifferentalignmentprogramsInthisexperiment,webenchmarkedtheperformanceofglu-RNAwithotherpopularalignmentprogramsonthetestingdataset.Weappliedglu-RNA,REglu-RNA,EMBOSS-Needle[78],ClustalW[25],ProbA[70],dpswalign[40],andDynalign[37]to360ncRNApairsofourtest-ingdatasetpresentedinTable5.3.EMBOSSNeedle,ClustalW,ProbA,anddpswalignareglobalsequencealignmenttools.Dynalignisaglobalstructuralalignmenttool.ForstandardNeedleman-Wunsch,weutilizedEMBOSSNeedleinEMBOSSpackageversion6.4.0withgapopeningpenaltyof10andgapextensionpenaltyof0.5.ClustalWwasdesignedtogenerateglobal92multiplesequencealignment.Inthisexperiment,weappliedClustalWtoapairofncRNAse-quences.WeusedClustalW2fromClustalWpackageversion2.1withdefaultparameters.ProbAisapairwisesequencealignmenttoolusingpartitionfunction-basedposteriorprobabilities.WeemployedprobAversion0.1.1withdefaultparameters.dpswalignisanotherposteriorprobability-basedsequencealignment.However,itusesapair-HMMtocomputetheposteriorprobabilities.WeuseddpswalignprogramfromIanHolmes'Dartpackage[40]withdefaultparametersandtwooptions:-ptand-oa.fi-ptfloptionproducesposterior-probabilitytablebetweentwosequences.fi-oaflisusedtogeneratetheoptimalaccuracyalignment.AsClustalW,ProbA,anddpswalignarefeaturesinourmodels,wecomparedglu-RNAandREglu-RNAwiththesetools.Thegoalistoevaluatewhetherglu-RNAandREglu-RNAcangeneratemoreaccuratealignmentthanthesetools.Dynalignsimultaneouslypredictsstructureandalignment.Ithastheadvantageofemploy-ingthemutualinformationofapairofsequencestorestrictthepredictionofsecondarystructure.WeusedDynalignprogramfromRNAstructurepackageversion5.3andranDynalignintheglobalmode.Notethatthereareanumberofstructuralalignmentprograms.WechoseDynalignbecauseithadcarefullydesignedbandingstrategyforfaststructuralalignment.Thus,itisthemostrelevantstructuralalignmentprogramforthiswork.Toevaluatetheperformanceofdifferentalignmentprograms,weusedthethedistancebetweentheproducedalignmentandthereferencealignmentasinSection5.3.1.Thedistanceeval-uateshowclosethepredictedalignmenttoreferencealignment.Smalldistanceindicateshighaccuracy.Thereare360pairsofncRNAsinthetestingdataset.Eachprogramproducesanalignmentforeachpair.Werecordedtheaveragedistanceonallpairs.glu-RNAhasthebestover-allperformanceon360pairs,withaveragedistance273.219.ThenextlowestaveragedistanceisREglu-RNA(283.958).dpswalignandDynalignhavethethirdandfourthlowestaveragedistance:298.606and377.397.Theaveragedistancesofdifferentalignmentprogramsoneachfamilyare93showninTable5.5.Wehave4outof5familiesthatglu-RNAandREglu-RNAachievedthebestperformance.9495Table5.5:Averagedistancesofdifferentalignmentprogramson360pairsofncRNAsinthetestingdataset.BoldnumberrepresentsthelowestaveragedistanceforanncRNAfamily.UnderlinednumbershowsthesecondlowestaveragedistanceforanncRNAfamily.Type#testpairsNeedleman-WunschClustalWprobAdpswalignDynalignglu-RNAREglu-RNAU260397.100282.433435.233159.033196.950155.217157.817tRNA60440.967184.800232.167156.400104.183133.583134.633THI120373.892430.283520.083256.742396.742234.125239.225Cobalamin501007.5801323.9601620.100762.500902.300645.120733.020gcvT70421.200608.429670.943280.543358.157295.429276.000Table5.6:Averagedistancesofdifferentalignmentprogramson58pairsofsequencesinvefamilieswithsequencesimilaritybelow40%.Boldnumberrepresentsthelowestaveragedistance.Underlinednumbershowsthesecondlowestaveragedistance.#testpairsNeedleman-WunschClustalWprobAdpswalignDynalignglu-RNAREglu-RNA581182.6551510.7931626.431726.1211101.517605.534667.345Moreover,wetestedtheperformanceofalignmentprogramsonsequenceswithsequencesim-ilaritybelow40%.Thereare58pairsofsequenceswithsequencesimilaritybelow40%.TheaveragedistancesoftestedalignmentprogramsonthesepairsareshowninTable5.6.Thisex-perimentshowsthatglu-RNAcangeneratemoreaccuratencRNAalignmentsforsequencepairswithsimilaritybelow40%thantestedsequencealignmentprograms.Inaddition,alltheaboveexperimentsshowthatglu-RNAgeneratedmoreaccuratealignmentsthanDynalign,whichisapopularstructuralalignmentprogram.5.4.4DiscussionOurprogram,glu-RNA,fromincorporatingposteriorprobabilityandamachinelearningapproach.Figure5.8showsthealignmentsofaTHIpairwithsequencesimilarity45%generatedbyglu-RNA,dpswalign,andDynalign.Weusethealignmentgeneratebydpswalignasthebasealignmentandimprovetheaccuracybyusingtrainedmodels.Thisisarepresentativeexam-pleshowingthatglu-RNAcangeneratemoreaccuratealignmentsthandpswalignandDynalign.Ouralignmentprogram,glu-RNA,cangenerateanaccuratencRNAalignmentrelyingonlyonsequencesimilarity.Thus,ithasthesamerunningtimecomplexityasothersequencealignmentprograms.Here,wecompareglu-RNAandthestructuralalignmentprogramDynalign.OnshortncRNAssuchastRNAinthetestingdataset,theaveragerunningtimeofglu-RNAis32.88swhiletheaveragerunningtimeofDynalignis490.49s.Whenthelengthofsequenceincreases,theaveragerunningofDynalignincreasesfast.IntheCobalamindataset,theaveragerunningtimeofglu-RNAis36.26s.However,theaveragerunningtimeofDynalignis143,014s.Thus,forlargeinputdatasuchasthosefrommetagenomics,applyingglu-RNAisexpectedtobemorepromisingtoidentifynovelncRNAclusters.96Figure5.8:Comparingthealignmentsgeneratedbyglu-RNA,dpswalign,andDynaligntothereferencealignmentofaTHIpairwithsequencesimilarity45%.glu-RNAcanproducemoreaccuratealignmentthandpswalignandDynalign.5.5ConclusionandfutureworkWepresentedanaccuratencRNAalignmentprogramthatonlyreliesonsequencesimilarity.Althoughitisconceptuallyeasy,itproducesalignmentswithhigheraccuracythancommonlyusedsequencealignmentprograms.Evenwithoutusinganystructuralinformation,ourprogramachievedslightlybetterperformancethanawidelyusedstructuralalignmentprogram.Thealign-mentprogramhasquadraticrunningtimecomplexity,whichisboththeoreticallyandpractically97fasterthanstandardstructuralalignmentprograms.ThealignmentscanbeusedasinputtoncRNAgenetoolsandcanalsobeusedforncRNAsecondarystructureprediction.Forexample,thepoorlyalignedregionsinexistingpairwisegenome-scalealignmentscanberealignedbyourmethodbeforencRNAgenetoolsareapplied.Ourprogramcanbefurtherimproved.First,weusedthemodelsandregressionmodelsconvenientlyincorporatedinWeka.Amorerigorousstudyaboutchoosingthebestmodelforthisproblemisneeded.Second,thecurrentmethodtreatseachrowseparately,whichisnottrueintruealignments.Thepredictedextensionvaluesforconsecutiverowshavestrongdependencewitheachother.Thus,akth-orderMarkovmodelshouldbeincorporatedtoimproveglu-RNA.Third,thecurrentimplementationofourmethodcanonlybeappliedtoapairofsequences.Wealsoplantoextendthismethodtohandlemultiplesequencessimultaneously.Aspartofourfuturework,weplantoapplyourpairwisealignmentprogramglu-RNAtoshortsequencesproducedinNGSdata.Oneconvenientapplicationistoapplyglu-RNAtoclustersequencesfrommetagenomicdatawiththegoalofdetectingnovelstructuralncRNAs.98Chapter6ConclusionandFutureWorkInthisdissertation,weintroducedgenepredictionandhomologysearchaligningasequencetoproteinfamilieswhicharetwoessentialstepsforannotatingfunctionalelementsinnewlyse-quencedgenomes.Then,threechallengesintheofgeneprediction,homologysearch,andncRNAsearcharediscussed.First,traditionalgenepredictiontoolsarenotcarefullydesignedtoidentifycodingregionsthathave50-30changingpatterns.Second,NGSproducesplentyshortreads,existinghomologysearchtoolsarenotsensitiveenoughtoalignshortreadsthatarepartofremotehomologs.Third,mostexistingstructuralalignmentprogramscannotbeconvenientlyap-pliedtoNGSdata.Inordertoaddressthechallenges,wehavedevelopedthreetoolsandcomparedtheperformancewithexistingprograms.First,IdevelopedatoolnamedAUGUSTUS-GC,whichwasdesignedtoaccuratelypredictprotein-codingregionsthathavevariousG+Ccontentand50-30changingpatterns.ItismoreaccuratethantheexistinggeneprogramssuchasAUGUSTUS.Second,Idevelopedanapproachtoalignshortreadsusingpaired-endreadinformation.Itcanrecoverthemissingendanddeterminethetruedomains.Third,Idevelopedaprogramnamedglu-RNA,whichwasdesignedtoaccuratelyalignhighlystructuredncRNAsemployingonlysequencesimilarity.Accordingtothepreviousstudies,severaldirectionscanbeimproved.Forgeneprediction,someexistinggenepredictiontoolscanidentify50UTRand30UTRregions.Currently,AUGUSTUS-GCisnotabletoidentifyUTRregions.WeplantoextendAUGUSTUS-GCtoaccuratelypredict995'UTRand3'UTRregions.Existinggenepredictiontoolsdemonstratedthatusingextrinsicevi-dencederivedfrommatchestoanESTorproteindatabasecanimprovetheaccuracyofgenepre-diction.WewillfurtherimproveAUGUSTUS-GCaccuracybyusinghintsfromexternalsourcesifthedataisavailable.Finally,weplantoprovideuserswithasystematicalwaytobetteres-timatelowTandhighTforG+Ccontentsbasedonthetrainingdata.Forhomologysearch,ourcurrentmethodreliesonHMMER3.0atthestep.Weplantoapplybanded-viterbialgorithmtoimprovethesensitivityofshortreadhomologysearchtool.ForncRNAsearch,currentglu-RNAemployedmodelsandregressionmodelsfromWeka.Wecanfurtherimproveglu-RNAbythebetterwaytochoosethebestmodelforthisproblem.Furthermore,thecurrentglu-RNAcanbeconductedpairwisealignment,weplantoextendtoconductmultiplese-quencealignments.Wealsoplantoapplyglu-RNAtoreadsgeneratedbyNGSdata.Forexample,wecanuseglu-RNAtoclustershortsequencesfrommetagenomicdatatodetectnovelstructuralncRNAs.100BIBLIOGRAPHY101BIBLIOGRAPHY[1]M.Alexandersson,S.Cawley,andL.Pachter.SLAM:Cross-SpeciesGeneFindingandAlignmentwithaGeneralizedPairHiddenMarkovModel.GenomeResearch,13(3):496Œ502,2003.[2]O.Aljawad,Y.Sun,A.Liu,andJ.Lei.NcRNAHomologySearchUsingHammingDis-tanceSeeds.InProceedingsofthe2NdACMConferenceonBioinformatics,ComputationalBiologyandBiomedicine,BCB'11,pages209Œ217,NewYork,NY,USA,2011.ACM.[3]S.F.Altschul,W.Gish,W.Miller,E.W.Myers,andD.J.Lipman.Basiclocalalignmentsearchtool.JournalofMolecularBiology,215(3):403Œ410,Oct1990.[4]J.A.MartinandZ.Wang.Next-generationtranscriptomeassembly.NatRevGenet,12(10):671Œ682,2011.[5]AnIntroductiontoNext-GenerationSequencingTechnology.http://www.illumina.com.[6]Augustusdownload.http://augustus.gobics.de/binaries/old/.[7]Augustusserver.http://augustus.gobics.de/datasets/.[8]V.BafnaandD.Huson.TheconservedexonmethodforgeneProcIntConfIntellSystMolBiol.,8:3Œ12,2000.[9]L.E.Baum,J.A.Eagon,etal.AninequalitywithapplicationstostatisticalestimationforprobabilisticfunctionsofMarkovprocessesandtoamodelforecology.Bull.Amer.Math.Soc,73(3):360Œ363,1967.[10]L.E.BaumandT.Petrie.StatisticalInferenceforProbabilisticFunctionsofFiniteStateMarkovChains.TheAnnalsofMathematicalStatistics,37(6):1554Œ1563,1966.[11]J.Besemer,A.Lomsadze,andM.Borodovsky.GeneMarkS:aself-trainingmethodforpredictionofgenestartsinmicrobialgenomes.Implicationsforsequencemotifsinregulatoryregions.NucleicAcidsResearch,29(12):2607Œ2618,2001.[12]E.BirneyandR.Durbin.Dynamite:axiblecodegeneratinglanguagefordynamicpro-grammingmethodsusedinsequencecomparison.ProcIntConfIntellSystMolBiol.,5:56Œ64,1997.[13]J.Buhler,U.Keich,andY.Sun.DesigningSeedsforSimilaritySearchinGenomicDNA.InProceedingsoftheSeventhAnnualInternationalConferenceonResearchinComputationalMolecularBiology,RECOMB'03,pages67Œ75,NewYork,NY,USA,2003.ACM.102[14]C.BurgeandS.Karlin.Predictionofcompletegenestructuresinhumangenomicdna.JournalofMolecularBiology,268(1):78Œ94,1997.[15]C.B.BurgeandS.Karlin.Findingthegenesingenomicdna.CurrentOpinioninStructuralBiology,8(3):346Œ354,1998.[16]S.W.Burge,J.Daub,R.Eberhardt,J.Tate,L.Barquist,E.P.Nawrocki,S.R.Eddy,P.P.Gardner,andA.Bateman.Rfam11.0:10yearsofrnafamilies.NucleicAcidsResearch,41(D1):D226ŒD232,2013.[17]F.Casals,Y.Idaghdour,J.Hussin,andP.Awadalla.Next-generationsequencingapproachesforgeneticmappingofcomplexdiseases.JournalofNeuroimmunology,248(1-2):10Œ22,2012.SpecialIssueonNewtechnologiesforbiomarkerdiscoveryinmultiplesclerosis.[18]S.E.Cawley,A.I.Wirth,andT.P.Speed.PhatŒageneprogramforPlasmodiumfalciparum.MolecularandBiochemicalParasitology,118(2):167Œ174,2001.[19]K.-M.Chao,W.R.Pearson,andW.Miller.Aligningtwosequenceswithinadiagonalband.ComputApplBiosci.,8(5):481Œ7,1992.[20]S.Chikkagoudar,D.R.Livesay,andU.Roshan.PLAST-ncRNA:PartitionfunctionLocalAlignmentSearchToolfornon-codingRNAsequences.NucleicAcidsResearch,38(Web-Server-Issue):59Œ63,2010.[21]S.Chikkagoudar,D.R.Livesay,andU.Roshan.PLAST-ncRNA:PartitionfunctionLocalAlignmentSearchToolfornon-codingRNAsequences.NucleicAcidsResearch,38:W59ŒW63,2010.[22]P.G.Chis,TiberiuandHarrison.AnalysingandPredictingPatientArrivalTimes,pages77Œ85.SpringerInternationalPublishing,Cham,2013.[23]A.L.Delcher,D.Harmon,S.Kasif,O.White,andS.L.Salzberg.ImprovedmicrobialgenewithGLIMMER.NucleicAcidsResearch,27(23):4636Œ4641,1999.[24]R.D.DowellandS.R.Eddy.EfpairwiseRNAstructurepredictionandalignmentusingsequencealignmentconstraints.BMCBioinformatics,7(400),2006.[25]J.D.Thompson,D.G.Higgins,andT.J.Gibson.CLUSTALW:improvingthesensitivityofprogressivemultiplesequencealignmentthroughsequenceweighting,gappenaltiesandweightmatrixchoice.NucleicAcidsResearch,22(22):4673Œ4680,1994.[26]C.Dupont,D.Rusch,S.Yooseph,M.Lombardo,R.Richter,R.Valas,M.Novotny,J.Yee-Greenbaum,J.Selengut,D.Haft,A.Halpern,R.Lasken,K.Nealson,R.Friedman,andJ.Venter.GenomicinsightstoSAR86,anabundantanduncultivatedmarinebacteriallin-eage.TheISMEJournal,6(6):1186Œ1199,2012.103[27]R.Durbin,S.R.Eddy,A.Krogh,andG.Mitchison.BiologicalSequenceAnalysisProba-bilisticModelsofProteinsandNucleicAcids.CambridgeUniversityPress,UK,1998.[28]L.Duret,D.Mouchiroud,andC.Gautier.Statisticalanalysisofvertebratesequencesrevealsthatlonggenesarescarceingc-richisochores.JournalofMolecularEvolution,40(3):308Œ317,1995.[29]S.R.Eddy.AcceleratedHMMSearches.PLoSComputBiol,7(10):e1002195,102011.[30]E.B.etal.andanalysisoffunctionalelementsin1%ofthehumangenomebytheencodepilotproject.Nature,447(7146):799Œ816,2007.[31]R.D.Finn,A.Bateman,J.Clements,P.Coggill,R.Y.Eberhardt,S.R.Eddy,A.Heger,K.Hetherington,L.Holm,J.Mistry,E.L.L.Sonnhammer,J.Tate,andM.Punta.Pfam:theproteinfamiliesdatabase.NucleicAcidsResearch,42(D1):D222ŒD230,2014.[32]R.D.Finn,J.Mistry,B.Schuster-Bockler,S.GrifV.Hollich,T.Lassmann,S.Moxon,M.Marshall,A.Khanna,R.Durbin,S.R.Eddy,E.L.L.Sonnhammer,andA.Bateman.Pfam:clans,webtoolsandservices.NucleicAcidsResearch,34(suppl1):D247ŒD251,2006.[33]P.P.Gardner,A.Wilm,andS.Washietl.AbenchmarkofmultiplesequencealignmentprogramsuponstructuralRNAs.NucleicAcidsResearch,33(8):2433Œ2439,2005.[34]S.S.GrossandM.R.Brent.UsingMultipleAlignmentstoImproveGenePrediction.JournalofComputationalBiology,13(2):379Œ393,2006.[35]D.H.Haft,J.D.Selengut,andO.White.TheTIGRFAMsdatabaseofproteinfamilies.NucleicAcidsResearch,31(1):371Œ373,2003.[36]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKAdataminingsoftware:Anupdate.SIGKDDExplorations,11(1):10Œ18,2009.[37]A.Harmanci,G.Sharma,andD.Mathews.EfpairwiseRNAstructurepredictionusingprobabilisticalignmentconstraintsinDynalign.BMCBioinformatics,8(1):130,2007.[38]J.H.Havgaard,R.B.Lyngso,andJ.Gorodkin.TheFOLDALIGNwebserverforpairwisestructuralRNAalignmentandmutualmotifsearch.Nucl.AcidsRes.,33(suppl.2):W650Œ653,2005.[39]J.HENDERSON,S.SALZBERG,andK.H.FASMAN.FindingGenesinDNAwithaHiddenMarkovModel.JournalofComputationalBiology,4(2):127Œ141,1997.[40]I.Holmes.Studiesinprobabilisticsequencealignmentandevolution,1998.104[41]IMG:IntegratedMicrobialGenomes.http://img.jgi.doe.gov/,2011.[42]A.M.JeffreyandW.Zhong.Next-generationtranscriptomeassembly.NatureReviewsGenetics,12:671Œ682,2011.[43]N.Jiang,A.A.Ferguson,R.K.Slotkin,andD.Lisch.Pack-Mutatorliketransposableele-ments(Pack-MULEs)inducedirectionalofgenesthroughbiasedinsertionandDNAacquisition.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,108(4):1537Œ1542,2011.[44]N.C.JonesandP.Pevzner.Anintroductiontobioinformaticsalgorithms.Computationalmolecularbiology.MITPress,Cambridge(Mass.),London,2004.[45]S.KarlinandS.F.Altschul.Methodsforassessingthestatisticalofmolecularsequencefeaturesbyusinggeneralscoringschemes.ProcNatlAcadSciUSA.,87(6):2264Œ2268,1990.[46]B.KnudsenandJ.Hein.Pfold:RNAsecondarystructurepredictionusingstochasticcontext-freegrammars.NucleicAcidsRes,31(13):3423Œ3428,2003.[47]I.Korf.Genendinginnovelgenomes.BMCBioinformatics,5(1):59,2004.[48]I.Korf,P.Flicek,D.Duan,andM.R.Brent.Integratinggenomichomologyintogenestructureprediction.Bioinformatics,17(suppl1):S140ŒS148,2001.[49]L.Krause,A.C.McHardy,T.W.Nattkemper,A.Puhler,J.Stoye,andF.Meyer.GISMO-geneusingasupportvectormachineforORFNucleicAcidsResearch,35(2):540Œ549,2007.[50]A.Krogh.TwomethodsforimprovingperformanceofanHMMandtheirapplicationforgeneProcIntConfIntellSystMolBiol,5:179Œ186,1997.[51]A.Krogh.UsingDatabaseMatcheswithHMMGeneforAutomatedGeneDetectioninDrosophila.GenomeResearch,10(4):523528,2000.[52]A.Krogh,M.Brown,I.S.Mian,K.Sjlander,andD.Haussler.HiddenMarkovmodelsincomputationalbiology:applicationstoproteinmodeling.JOURNALOFMOLECULARBIOLOGY,235:1501Œ1531,1994.[53]D.Kulp,D.Haussler,M.Reese,andF.Eeckman.AgeneralizedhiddenMarkovmodelfortherecognitionofhumangenesinDNA.ProcIntConfIntellSystMolBiol,4:134Œ142,1996.[54]B.Langmead,C.Trapnell,M.Pop,andS.Salzberg.Ultrafastandmemory-efalign-mentofshortDNAsequencestothehumangenome.GenomeBiology,10(3):R25,2009.105[55]T.S.LarsenandA.Krogh.EasyGeneŒaprokaryoticgenethatranksORFsbystatisticalBMCBioinformatics,4(1):21,2003.[56]A.L.Lehninger,D.L.Nelson,andM.M.Cox.Lehningerprinciplesofbiochemistry.W.H.Freeman,NewYork,2013.[57]J.Lei,P.Techa-angkoon,andY.Sun.NCRNAhomologysearchbasedonanextendedtwo-dimensionalchainalgorithm.IEEE/ACMTransComputBiolBioinform.,2012.[58]D.Li,C.-M.Liu,R.Luo,K.Sadakane,andT.-W.Lam.MEGAHIT:anultra-fastsingle-nodesolutionforlargeandcomplexmetagenomicsassemblyviasuccinctdeBruijngraph.Bioinformatics,31(10):1674Œ1676,2015.[59]T.Lingner,K.P.Aßhauer,S.F.,andP.Meinicke.CoMet-awebserverforcomparativefunctionalofmetagenomes.NucleicAcidsResearch,39(suppl2):W518,2011.[60]L.Liu,Y.Li,S.Li,N.Hu,Y.He,R.Pong,D.Lin,L.Lu,andM.Law.ComparisonofNext-GenerationSequencingSystems.JournalofBiomedicineandBiotechnology,251364,2012.[61]A.V.LukashinandM.Borodovsky.GeneMark.hmm:NewsolutionsforgeneNucleicAcidsResearch,26(4):1107Œ1115,1998.[62]G.LunterandM.Goodson.Stampy:AstatisticalalgorithmforsensitiveandfastmappingofIlluminasequencereads.GenomeResearch,21(6):936Œ939,2011.[63]B.Ma,J.Tromp,andM.Li.PatternHunter:fasterandmoresensitivehomologysearch.BIOINFORMATICS,18(3):440Œ445,2002.[64]S.Mahony,J.O.McInerney,T.J.Smith,andA.Golden.GenepredictionusingtheSelf-OrganizingMap:automaticgenerationofmultiplegenemodels.BMCBioinformatics,5(1):23,2004.[65]W.H.Majoros,M.Pertea,C.Antonescu,andS.L.Salzberg.GlimmerM,ExonomyandUnveil:ThreeabInitioEukaryoticNucleicAcidsResearch,31(13):36013604,2003.[66]W.H.Majoros,M.Pertea,andS.L.Salzberg.TigrScanandGlimmerHMM:twoopensourceabinitioeukaryoticBioinformatics,20(16):2878Œ2879,2004.[67]E.R.Mardis.Next-GenerationDNASequencingMethods.AnnualReviewofGenomicsandHumanGenetics,9(1):387Œ402,2008.PMID:18576944.[68]Y.Marquez,J.W.Brown,C.Simpson,A.Barta,andM.Kalyna.TranscriptomesurveyrevealsincreasedcomplexityofthealternativesplicinglandscapeinArabidopsis.GenomeResearch,22(6):1184Œ1195,2012.106[69]D.H.MathewsandD.H.Turner.Dynalign:analgorithmforthesecondarystructurecommontotwoRNAsequences.JournalofMolecularBiology,317(2):191Œ203,2002.[70]U.Mckstein,I.L.Hofacker,andP.F.Stadler.Stochasticpairwisealignments.Bioinformat-ics,18(suppl2):S153ŒS160,2002.[71]M.L.Metzker.Sequencingtechnologies-thenextgeneration.NatRevGenet,11(11):31Œ46,2010.[72]F.Meyer,R.Overbeek,andA.Rodriguez.FIGfams:yetanothersetofproteinfamilies.NucleicAcidsResearch,37(20):6643Œ6654,2009.[73]I.M.MeyerandR.Durbin.ComparativeabinitiopredictionofgenestructuresusingpairHMMs.Bioinformatics,18(10):1309Œ1318,2002.[74]A.Mitchell,F.Bucchini,G.Cochrane,H.Denise,P.t.Hoopen,M.Fraser,S.Pesseat,S.Pot-ter,M.Scheremetjew,P.Sterk,andR.D.Finn.EBImetagenomicsin2016-anexpandingandevolvingresourcefortheanalysisandarchivingofmetagenomicdata.NucleicAcidsResearch,2015.[75]B.Morgenstern,O.Rinner,S.Abdeddam,D.Haase,K.F.X.Mayer,A.W.M.Dress,andH.-W.Mewes.Exondiscoverybygenomicsequencealignment.Bioinformatics,18(6):777Œ787,2002.[76]T.Namiki,T.Hachiya,H.Tanaka,andY.Sakakibara.Metavelvet:anextensionofvelvetassemblertodenovometagenomeassemblyfromshortsequencereads.NucleicAcidsResearch,40(20):e155,2012.[77]E.P.NawrockiandS.R.Eddy.Infernal1.1:100-foldfasterRNAhomologysearches.Bioinformatics,29(22):2933Œ2935,2013.[78]S.B.NeedlemanandC.D.Wunsch.Ageneralmethodapplicabletothesearchforsimilar-itiesintheaminoacidsequenceoftwoproteins.JournalofMolecularBiology,48(3):443Œ53,1970.[79]S.Ouyang,W.Zhu,J.Hamilton,H.Lin,M.Campbell,K.Childs,F.Thibaud-Nissen,R.L.Malek,Y.Lee,L.Zheng,J.Orvis,B.Haas,andJ.Wortman.TheTIGRRiceGenomeAnnotationResource:improvementsandnewfeatures.NucleicAcidsResearch,35:D883ŒD887,2007.[80]J.Park,K.Karplus,C.Barrett,R.Hughey,D.Haussler,T.Hubbard,andC.Chothia.Se-quenceComparisonsUsingMultipleSequencesDetectThreeTimesasManyRemoteHo-mologuesAsPairwiseMethods.jmb,284(4):1201Œ1210,1998.107[81]G.Parra,P.Agarwal,J.F.Abril,T.Wiehe,J.W.Fickett,andR.Guig.ComparativeGenePredictioninHumanandMouse.GenomeResearch,13(1):108Œ117,2003.[82]G.Parra,E.Blanco,andR.Guigo.GeneIDinDrosophila.GenomeResearch,10(4):511Œ515,2000.[83]J.S.Pedersen,G.Bejerano,A.Siepel,K.Rosenbloom,K.Lindblad-Toh,E.S.Lander,J.Kent,W.Miller,andD.Haussler.andofconservedRNAsecondarystructuresintheHumanGenome.PLoSComputBiol,2(4):251Œ262,2006.[84]Y.Peng,H.C.M.Leung,S.M.Yiu,andF.Y.L.Chin.Meta-IDBA:adenovoassemblerformetagenomicdata.Bioinformatics,27(13):i94Œi101,2011.[85]E.Prestat,M.M.David,J.Hultman,N.Tas,R.Lamendella,J.Dvornik,R.Mackelprang,D.D.Myrold,A.Jumpponen,S.G.Tringe,E.Holman,K.Mavromatis,andJ.K.Jansson.FOAM(FunctionalOntologyAssignmentsforMetagenomes):aHiddenMarkovModel(HMM)databasewithenvironmentalfocus.NucleicAcidsResearch,2014.[86]M.Punta,P.C.Coggill,R.Y.Eberhardt,J.Mistry,J.Tate,C.Boursnell,N.Pang,K.Forslund,G.Ceric,J.Clements,A.Heger,L.Holm,E.L.L.Sonnhammer,S.R.Eddy,A.Bateman,andR.D.Finn.ThePfamproteinfamiliesdatabase.NucleicAcidsResearch,40(D1):D290ŒD301,2012.[87]L.RabinerandB.-H.Juang.Fundamentalsofspeechrecognition.1993.[88]L.R.Rabiner.AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheIEEE,77(2):257Œ286,1989.[89]E.RivasandS.R.Eddy.NoncodingRNAgenedetectionusingcomparativesequenceanalysis.BMCBioinformatics,2(1):8Œ26,2001.[90]E.P.Rocha.CodonusagebiasfromtRNA'spointofview:Redundancy,specialization,andefdecodingfortranslationoptimization.GenomeResearch,14(11):2279Œ2286,2004.[91]A.A.SalamovandV.V.Solovyev.AbinitioGeneFindinginDrosophilaGenomicDNA.GenomeResearch,10(4):516Œ522,2000.[92]S.L.Salzberg,A.L.Delcher,S.Kasif,andO.White.MicrobialgeneusinginterpolatedMarkovmodels.NucleicAcidsResearch,26(2):544Œ548,1998.[93]F.SangerandA.Coulson.ArapidmethodfordeterminingsequencesinDNAbyprimedsynthesiswithDNApolymerase.JournalofMolecularBiology,94(3):441Œ448,1975.108[94]F.Sanger,S.Nicklen,andA.Coulson.DNAsequencingwithchain-terminatingin-hibitors.ProceedingsofTheNationalAcademyofSciencesofTheUnitedStatesOfAmer-ica,74(12):5463Œ5467,1977.[95]D.Sankoff.SimultaneoussolutionoftheRNAfolding,alignmentandprotosequenceprob-lems.SIAMJournalonAppliedMathematics,45(5):810Œ825,1985.[96]M.Shakya,C.Quince,J.H.Campbell,Z.K.Yang,C.W.Schadt,andM.Podar.Com-parativemetagenomicandrRNAmicrobialdiversitycharacterizationusingarchaealandbacterialsyntheticcommunities.Environmentalmicrobiology,15(6):1882Œ1899,2013.[97]A.M.S.ShresthaandM.C.Frith.AnapproximateBayesianapproachformappingpaired-endDNAreadstoareferencegenome.Bioinformatics,29(8):965Œ972,2013.[98]M.Stanke.GenePredictionwithaHidden-MarkovModel.PhDthesis,UniversittGttingen,2003.[99]M.Stanke,M.Diekhans,R.Baertsch,andD.Haussler.UsingnativeandsyntenicallymappedcDNAalignmentstoimprovedenovogeneBioinformatics,24(5):637Œ644,2008.[100]M.StankeandB.Morgenstern.Augustus:awebserverforgenepredictionineukaryotesthatallowsuserconstraints.NucleicAcidsResearch,33(suppl2):W465ŒW467,2005.[101]M.StankeandS.Waack.GenepredictionwithahiddenMarkovmodelandanewintronsubmodel.Bioinformatics,19(suppl2):ii215Œii225,2003.[102]Y.SunandJ.Buhler.DesigningmultiplesimultaneousseedsforDNAsimilaritysearch.InP.E.BourneandD.editors,RECOMB,pages76Œ84.ACM,2004.[103]Y.Sun,J.Buhler,andC.Yuan.Designingforfast-knownNcRNAIEEE/ACMTransactionsonComputationalBiologyandBioinformatics,9(3):774Œ787,2012.[104]Y.Tabei,K.Tsuda,T.Kin,andK.Asai.SCARNA:fastandaccuratestructuralalignmentofRNAsequencesbymatchinged-lengthstemfragments.Bioinformatics,(22):1723Œ1729,2006.[105]L.Taher,O.Rinner,S.Garg,A.Sczyrba,M.Brudno,S.Batzoglou,andB.Morgenstern.AGenDA:homology-basedgeneprediction.Bioinformatics,19(12):1575Œ1577,2003.[106]TheArabidopsisInformationResource(TAIR).http://www.arabidopsis.org.109[107]E.Torarinsson,M.Sawera,J.H.Havgaard,M.Fredholm,andJ.Gorodkin.ThousandsofcorrespondinghumanandmousegenomicregionsunalignableinprimarysequencecontaincommonRNAstructure.GenomeResearch,16:885Œ889,2006.[108]T.Treangen,S.Koren,D.Sommer,B.Liu,I.Astrovskaya,B.Ondov,A.Darling,A.Phillippy,andM.Pop.MetAMOS:amodularandopensourcemetagenomicassem-blyandanalysispipeline.GenomeBiology,14(1):R2,2013.[109]E.L.vanDijk,H.Auger,Y.Jaszczyszyn,andC.Thermes.Tenyearsofnext-generationsequencingtechnology.TrendsinGenetics,30(9):418Œ426,2014.[110]S.Washietl,I.L.Hofacker,andP.F.Stadler.FastandreliablepredictionofnoncodingRNAs.ProcNatlAcadSciUSA,102(7):2454Œ2459,2005.[111]S.Will,M.Yu,andB.Berger.Structure-basedwholegenomerealignmentrevealsmanynovelnon-codingRNAs.GenomeResearch,2013.[112]A.Wilm,I.Mainz,andG.Steger.AnenhancedRNAalignmentbenchmarkforsequencealignmentprograms.AlgorithmsforMolecularBiology,1:19,2006.[113]R.-F.Yeh,L.P.Lim,andC.B.Burge.ComputationalInferenceofHomologousGeneStructuresintheHumanGenome.GenomeResearch,11(5):803Œ816,2001.[114]Q.Yuan,S.Ouyang,A.Wang,W.Zhu,R.Maiti,H.Lin,J.Hamilton,B.Haas,R.Sultana,F.Cheung,J.Wortman,andC.R.Buell.TheInstituteforGenomicResearchOsa1RiceGenomeAnnotationDatabase.PlantPhysiology,138(1):18Œ26,2005.[115]I.Zarraonaindia,D.P.Smith,andJ.A.Gilbert.Beyondthegenome:community-levelanalysisofthemicrobialworld.BiolPhilos,28(2):261Œ282,2013.[116]E.M.ZdobnovandR.Apweiler.InterProScananintegrationplatformforthesignature-recognitionmethodsinInterPro.Bioinformatics,17(9):847Œ848,2001.[117]D.R.ZerbinoandE.Birney.Velvet:algorithmsfordenovoshortreadassemblyusingdeBruijngraphs.GenomeResearch,18(5):821Œ829,2008.[118]S.Zhang,I.Borovok,Y.Aharonowitz,R.Sharan,andV.Bafna.Sequence-basedmethodforncRNAanditsapplicationtosearchingforriboswitchelements.Bioinf.,22:e557Œ65,2006.[119]Y.Zhang,Y.Sun,andJ.R.Cole.ASensitiveandAccurateproteindomainTool(SALT)forshortreads.Bioinformatics,29(17):2103Œ2111,2013.[120]Q.-Y.Zhao,Y.Wang,Y.-M.Kong,D.Luo,X.Li,andP.Hao.Optimizingdenovotranscrip-tomeassemblyfromshort-readRNA-Seqdata:acomparativestudy.BMCBioinformatics,12(Suppl14):S2,2011.110[121]X.Zhou,L.Ren,Q.Meng,Y.Li,Y.Yu,andJ.Yu.Thenext-generationsequencingtech-nologyandapplication.ProteinandCell,1(6):520Œ536,2010.111