NOVELCOMPUTATIONALMETHODSFORIMPROVINGFUNCTIONALANALYSISFOR
LONGNOISYREADS
By
NanDu
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialoftherequirements
forthedegreeof
ComputerScienceŠDoctorofPhilosophy
2019
ABSTRACT
NOVELCOMPUTATIONALMETHODSFORIMPROVINGFUNCTIONALANALYSISFOR
LONGNOISYREADS
By
NanDu
Single-molecule,real-timesequencing(SMRT)developedbyPBioSciences(PacBio)and
NanoporesequencingdevelopedbyOxfordNanoporeTechnologies(Nanopore)producelonger
readsthansecond-generationsequencingtechnologiessuchasIllumina.Theincreasedreadlength
enablesPacBiosequencingtoclosegapsingenomeassembly,revealstructuralvariations,and
characterizetheintra-speciesvariations.Italsoholdsthepromisetodecipherthecommunitystruc-
tureincomplexmicrobialcommunitiesbecauselongreadshelpmetagenomicassembly.However,
comparedwithdataproducedbypopularshortreadsequencingtechnologies(suchasIllumina),
PacBioandNanoporedatahaveahighersequencingerrorrateandlowercoverage.Therefore,new
algorithmsareneededtotakefulladvantageofthird-generationsequencingtechnologies.
Forexample,duringanalignment-basedhomologysearch,insertionordeletionerrorsingenes
willcauseframeshifts,whichmayleadtomarginalalignmentscoresandshortalignments.In
thiscase,itishardtodistinguishcorrectalignmentsfromrandomalignments,andtheambiguity
willincurerrorsinstructuralandfunctionalannotation.Existingframeshiftcorrectiontoolsare
designedfordatawithamuchlowererrorrate,andtheyarenotoptimizedforPacBiodata.As
anincreasingnumberofgroupsareusingSMRT,thereisanurgentneedfordedicatedhomology
searchtoolsforPacBioandNanoporedata.
Anotherexampleisoverlapdetection.ForbothPacBioreadsandNanoporereads,thereis
stillaneedtoimprovethesensitivityofdetectingsmalloverlapsoroverlapswithhigherrorrates.
Addressingthisneedwillenablebetterassemblyformetagenomicdataproducedbythethird-
generationsequencingtechnologies.
Inthisarticle,wearegoingtodiscussthepossiblemethodforhomologysearchandoverlapde-
tectionforthethird-generationsequencing.Foroverlapdetection,wedesignedandimplemented
anoverlapdetectionprogramnamedGroupK.GroupKtakesagroupofshort
k
-merhits,whichsat-
isfystatisticallyderiveddistanceconstraintstoincreasethesensitivityofsmalloverlapdetection.
Forhomologysearch,wedesignedandimplementedahomologysearchtoolnamedFrame-
ProbasedonthehiddenMarkovmodel(pHMM)andconsensussequencesmethod.
However,Frame-proisstillrelyingonmultiplesequencealignment.SoweimplementedDeep-
Frame,adeeplearningmodelthatpredictsthecorrespondingproteinfunctionforthird-generation
sequencingreads.Intheexperimentonsimulatedreadsofproteincodingsequencesandrealreads
fromthehumangenome,ourmodeloutperformspHMM-basedmethodsandthedeeplearning
basedmethod.OurmodelcanalsorejectunrelatedDNAreadsandachieveshigherrecallwiththe
precisioncomparabletothestate-of-the-artmethod.
ThisthesisisdedicatedtomybeautifulwifeWenjingandmyparents,whoseencouragement
alongthewaywasparamounttomakingitthusfar.
iv
ACKNOWLEDGMENTS
Firstandforemost,IwouldliketothankmyadvisorDr.YanniSun.ItwascoincidentlythatI
heardthereisanopportunitytoworkwithDr.Sunat2014winter,andIalmostdon'thaveany
bioinformaticsbackgroundwhenItalkedtoDr.Sunfourandhalfyearsago.Butshestill
patientlydiscussedwithmeaboutpossibleprojectsandgivemethechancetostudythealgorithms
ofsequenceanalysisunderherguidance.Sheisveryniceandpatient,neverblamedmebutinstead
triedtohelpmewhenImademistakes.Dr.Sunhelpedmealotinmyresearchwork,courses,and
howtowriteandpresentresearchresults.Withoutherhelp,nothingcouldbeachieved.
Next,IwanttothankmyPh.D.guidancecommitteemembers,includingDr.KevinLiu,Dr.
JamesCole,andDr.JiayuZhou.Dr.Liugavememanyvaluablesuggestionsinmyqualifying
exam,mycomprehensiveexam,andmydefense.Iamverygratefulthathecanserveasco-
chairofmycommittee.Dr.ZhougavemethelectureofMachineLearningandsharedhisinsights
ondeeplearningwithmeonmyproject.Dr.Colehelpedmealotforbothmycomprehensive
examanddefense.
IwouldalsoliketothankNationalScienceFoundationandSummerFellowshipfromCollege
ofEngineeringforfundingtheresearchpresentedherein,andalsoMSUDissertationCompletion
Fellowship(Spring2019)whichallowedmetocompletethisdissertation.Manythankstostaffat
CSE,andespeciallystaffattheHigh-PerformanceComputingCenteratMichiganStateUniversity,
fortheirgenerallysupportforalltheresourcesthatIneedduringmyPh.D.study.
Iwanttothanksmylabmates,includingDr.JikaiLei,Dr.PrapapornTecha-angkoon,Dr.
JiaoChen,andYumengWen,fortheirhelponmyeitherresearchworkordailylife.Jikaigave
memanyusefulsuggestionswhenIjoinedthegroup.Healsohelpedmealotinmyinternship
JiaoandIhavemanydiscussionsonresearch,dailylife,andjobIwouldalso
v
thankallmyfriendswhohelpedmeatMSUinthepastsevenyears.Thankstomylabmatesat
UEClabinPhysics,includingDr.ZhenshengTao,Dr.KiseokChang,andFaranZhou.They
alwaysencouragedmeforchallengesandcomfortedmewhenIdepressed.ThankstoChaoyue
Liu,whohelpedmetomakethedecisionthattransferstoComputerScienceandEngineering.
ThankstoDr.ZhenZhuandYananChen,whohavehelpedmeinbothmyacademicandpersonal
life.ThankstomyfriendsintheMSUPhysicsQQgroup,Dr.XuLu,Dr.YanhaoTang,Dr.
JieGuan,Dr.MengzeZhu,andXukunXiang.Wereallyhavehadagoodtimediscussingmany
interestingtopicsfromfrontiersofphysicstodailylifeintheUnitedStates.ThankstoDr.Yuan
Gao,Dr.LiKe,Dr.XiaoranZhang,Dr.ShuangLiang,andDr.ChuanpengJiang,fortheir
companyfromthetimewhenwearrivedatEastLansingtogether.Thankstomyfriendsand
classmatesatComputerScienceandEngineeringwhostudied,worked,andhadfuntogetherwith
me,includingDr.JianpengXu,Dr.NanLiu,Dr.ChenQiu,Dr.ShaohuaYang,DingWang,Nan
Xia,WeiWang,ZhuangdiZhu,JunChen,QiWang,KaixiangLin,DeliangYang,TianXie,Jiawei
Li,SihanWang,ZhiweiWang,YaoMa,andManniLiu.
Finally,Iwanttoexpressmysinceregratitudetomywifeandmyparents.Theyhavealways
beenencouragingandsupportingmeforpursuingmycareer,andarealwaystherewheneverIneed
them.
vi
TABLEOFCONTENTSLISTOFTABLES
.......................................
xLISTOFFIGURES
......................................
xiChapter1Introduction
..................................
11.1Third-generationsequencing.............................2
1.1.1SMRT.....................................2
1.1.2Nanoporesequencing.............................4
1.2Sequenceanalysisalgorithms.............................5
1.2.1Alignment...................................5

1.2.2Errorcorrection................................7

1.2.3Wholegenomeassembly...........................8
1.3Challengesandoverview...............................12
Chapter2OverlapDetectionOnLongNoisyReads
1.................
142.1Background......................................15
2.1.1Relatedwork.................................15

2.1.2Overviewofourwork.............................17
2.2Methods........................................17
2.2.1Pipeline....................................19
2.2.2Estimatetheexpectednumberof
k-mersforthestage.......21
2.2.3GroupHitCriteria...............................25
2.2.3.1Constrainton
k-merdistance....................26
2.2.3.2Constrainton
k-merdiagonaldistance...............27
2.2.4GroupChaining................................28
2.2.5Implementationdetailsofthemajorcomponents..............30
2.2.5.1Filtrationby
k-mercounts.....................30
2.2.5.2Groupseedmatchandchaining..................30

2.2.5.3Timecomplexityanalysis.....................32
2.3Results.........................................32
2.3.1Simulated
E.coli
dataset...........................33
2.3.1.1Runningtimeandmemoryusage.................35
2.3.2RealPB
E.coli
dataset............................35
2.3.2.1Performancewithdifferentoverlapsize..............37
2.3.3HumanFootMetagenomicDataset......................37

2.3.4RealONT
E.coli
dataset...........................41
2.4Discussion.......................................41

2.5Conclusions......................................42
1Du,Nan,JiaoChen,andYanniSun."Improvingthesensitivityoflongreadoverlapdetectionusinggroupedshort
k-mermatches."BMCgenomics20.2(2019):190.
viiLISTOFALGORITHMS
...................................
Chapter3prHMMmodelandhomologysearch
2
................
43
3.1Introduction......................................43
3.2Relatedwork.....................................45
3.2.1homologysearch...........................45
3.2.2Relatedworkonframeshiftcorrection....................46
3.3hiddenMarkovmodel.............................47
3.3.1Markovchains................................48
3.3.2HiddenMarkovmodels............................49
3.3.3hiddenMarkovmodelsHMMs)...............50
3.3.3.1Viterbialgorithm..........................51
3.3.3.2Forwardalgorithm.........................53
3.3.3.3FrommultiplesequencealignmenttoHMM........54
3.3.3.4Forward-Backwardalgorithm...................55
3.4Methods........................................57
3.4.1Generatesequencealignmentgraph.....................57
3.4.2ViterbialgorithmforaligninganalignmentgraphwithaHMM...59
3.4.3Filtrationstageforremovingunrelevantproteindomainfamilies......63
3.5Experimentalresults..................................64
3.5.1Simulated
Escherichiacoli
sequences....................65
3.5.1.1Performanceoferrorcorrection..................65
3.5.1.2PerformanceofHMMsearch................68
3.5.2
Meiothermusruber
DSM1279sequences..................70
3.5.2.1Evaluateerrorcorrectionperformancebyaligningoutputsagainst
thereferencegenome........................71
3.5.2.2Evaluatetheperformanceofhomologysearch.......71
3.5.3Humanarmmetagenomicdataset......................73
3.5.3.1GenerateSAGandHMMdomain.........73
3.5.3.2ProteindomainsearchusingGAcutoff..............75
3.6Conclusionanddiscussion..............................76
Chapter4DeepLearningBasedApproachForProteinDomainPrediction
.....
78
4.1Introduction......................................78
4.1.1Deeplearningforsequentialdata.......................79
4.1.1.1ConvolutionalNeuralNetworks..................79
4.1.1.1.1Convolutionoperation..................79
4.1.1.1.2MajorfeaturesofCNNs.................80
4.1.1.1.3CNNforsequence.............81
4.1.1.2RecurrentNeuralNetworks....................82
4.1.1.2.1BasicstructureofRNNs.................82
4.1.1.2.2LSTM..........................84
4.1.1.2.3GRU...........................85
4.1.1.3Proteinfunctionpredictionusingdeeplearning..........86
2
Du,Nan,andYanniSun."ImprovehomologysearchsensitivityofPacBiodatabycorrectingframeshifts."Bioin-
formatics32.17(2016):i529-i537.
viii
4.1.2Relatedwork.................................88
4.1.2.1ehomologysearch......................88
4.1.2.2Homologysearchwitherrorcorrection..............89
4.1.3Overviewofourwork.............................89
4.2Methods........................................90
4.2.1Encoding...................................92

4.2.2ConvolutionalNeuralNetworks.......................93

4.2.3Out-of-distributionexamplesdetection....................95
4.2.3.1Thethresholdbaseline.......................95

4.2.3.2OutlierExposure..........................96
4.2.4Implementationsandhyperparameters....................98
4.3Experimentsandresults................................99
4.3.1SimulatedPacBioGPCRcdsdataset.....................100
4.3.1.1Performancewithdifferentarchitectures.............100
4.3.1.1.1Comparisonsofencoding.................100

4.3.1.1.2Comparisonsofconvolutionalsetup........102
4.3.1.2ComparisonwithHMMERandDeepFam.............104

4.3.1.3Comparisonoftimecomplexity..................106
4.3.2Humangenomedataset............................107
4.3.2.0.1Traindataset........................107

4.3.2.0.2Thresholdcalibrationdataset...............107

4.3.2.0.3Out-of-distributiontestdataset..............107
4.3.2.1Out-of-distributiontestusingPacBioandNanoporereads....108
4.3.3Visualizeandunderstandingconvolution...............109
4.4Discussion.......................................110

4.5Conclusion......................................114
Chapter5Conclusionsandfutureworks
........................
115BIBLIOGRAPHY.......................................
118ixLISTOFTABLESTable2.1:Computationalperformanceonthesimulated
E.coli
dataset.Thecomputa-
tionalperformanceofoverlapdetectionusingGroupK,Minimap,Minimap2,
DALIGNER,MHAP,andGraphMaponthesimulated
E.coli
dataset.......35
Table2.2:Overlapdetectiononthereal
E.coli
dataset.Theperformanceofoverlapdetec-
tionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMap

ontherealPBRSII(P6-C4)
E.coli
dataset.Hereweonlyreporttheexperiment
resultswiththehighestF1scoreforeachtool...................36
Table2.3:Overlapdetectiononthehumanmetagenomicdataset.Theperformanceof
overlapdetectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,
andGraphMaponthemockmetagenomicdataset.Hereweonlyreportthe
experimentalresultswiththehighestF1scoreforeachtool............40
Table2.4:OverlapdetectionontheONT
E.coli
dataset.Theperformanceofoverlapde-
tectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMap

ontherealONTSQK-MAP-006
E.coli
dataset.Minimap2usestheava-ont
setup,whichisoptimizedforONTdata......................41
Table4.1:ThenameandbriefdescriptionofvariantsofCNNmodelswecompared.....101
Table4.2:Theaverageelapsedtimetopredictfamiliesof10,000simulatedPacBioreads
foreachmethod..................................106
Table4.3:Theperformanceofproteindomainpredictionwithout-of-distributionexam-
plesusingDeepFramethresholdbaseline,DeepFramewithOutlierExposure
(OE),andHMMERontherealPacBioandNanoporedataset..........108
xLISTOFFIGURESFigure1.1:Principleofsingle-molecule,real-timeDNAsequencing.Reprintedfrom[39].3
Figure1.2:Principleofnanoporesequencing.Itdeterminethetypeofnucleotidebymea-
suringthecurrentlevelofthebasepassespassingthroughthenanohole.
ReprintedfromNaturejobswebsite........................4
Figure1.3:Adynamicprogrammingalignmentmatrixforaligningsequence
GATTACAagainst
GCATGCUusingNeedleman-Wunschalgorithm[105].Inthisexam-
ple,asimplescoringsystem,match=1,mismatch=-1,gap=-1,isused.(a)
Startfromthecellinthesecondrow,secondcolumn.Movethroughthecell
rowbyrow,calculatingthescoreforeachcell.(b)Thescoreisthemaximal
scorecalculatedfromthecellfromthetop,left,orleft-top.Ifthescoreis
calculatedusingleft-topcell,meansamatchormismatch.Otherwiseisan

insertionordeletion.(c)Aftertheofalltable,thescoreof

theright-bottomcellofthematrixisthescoreforglobalalignmentofthetwo
sequence.(d)Tracebackfromthelastcellisneededtoobtainthealignment.
Thealignmentis
G-ATTACAvs.GCA-TGCU.ReprintedfromWikipedia...6
Figure1.4:RapidoverlappingofnoisyreadsusingMinHashsketches.(a)Tocreatea
MinHashsketchofaDNAsequenceS,weobtainedallthek-mersof
thesequencek-mers.Intheexampleabove,
k=3,generate12k-merseach
forS1andS2.(b)Multiplehashfunctionsareusedtoconvertk-merstoin-
tegerThenumberofhashfunctionsHdeterminestheresulting

sketchsize.Here,wechoose
H=4.Theminimumkmerwhichhashingfunc-
tionvalueisminimalforeachhashisreferredasmin-mer.(c)Thesketch

ofasequenceiscomposedoftheHmin-merinorder,whichis
muchsmaller(size4)thanthesetofallk-mers(size12).Inthisexample,the
sketchesofS1andS2sharetwosameminimum(underlined).(d)

RealJaccardsimilarity(0.22)isestimatedusingthefractionofsharedmin-
mer(0.5)betweenthesketches.Tohaveanaccurateestimation,
H˛4is
needed.(e)Ifthesimilaritymeetsthethreshold,thepositionofsharedmin-
mersintheoriginalsequenceiscomputedtodeterminetheoverlapoffsetof
theS1andS2.Reprintedfrom[12]........................10
Figure1.5:Threestage(correction,trimming,andassembly)inaPacbioSMRTassem-
bler(Canu).Thecorrectionstepselectsthebestoverlapstouseforsequence
correctionandgeneratescorrectedreadsfromconsensussequences.Thetrim-

mingstepunsupportedregionsintheinputandtrimsorsplitsreads
basedonthelongestsupportedrange.Theassemblystepmakesapass
todeterminesequencingerrors;recomputeoverlapalignments;andoutputs

contigs,anassemblygraph,andsummarystatistics.Reprintedfrom[79]...11
xiFigure2.1:Histogramsofirreducibleoverlapsizes(
top
)andtheratioofoverlapsizeto
thereadlength(
bottom
)whencomparingadjacentoverlappingreadsonsim-
ulatedPB
E.coli
datasetsgivendifferentcoverages.Thebinwidthforthe
overlapsizeis500.Forthe30Xdataset,theaveragereadlengthis8366and
thenumberofreadsis16644.Forthe15Xdataset,theaveragereadlengthis
8253andthenumberofreadsis8436.Forthe8Xdataset,theaverageread
lengthis8414andthenumberofreadsis4413..................18
Figure2.2:ThepipelineofGroupK..............................20
Figure2.3:Thedotplotof
k
-merhitsandgroupseedsmatchesforoneoverlappingpair
fromthe
E.coli
PBdatasetusedinSection2.3.Eachdotisa
k
-merhit.Thex-
axisandy-axisshowthelocationsofthehitsonthereads.Theoverlapregion
isroughlyfrom0to2800onthex-axis,andfrom1000to3800onthey-axis.
Top:
all9-merhits.
Bottom:
9-merhitsthatpassedthegrouphitcriteria.A
groupseedisrepresentedbycloselylocateddotsofthesamecolor.......22
Figure2.4:Twocasescontributedtotheidenticalcharactersatanalignedpositioninthe
overlappingregionof
S
1
and
S
2
.
S
1
and
S
2
aretworeadssequencedfromthe
sameregionoftheunderlyinggenomeandformanoverlap.
Top:
bothbases
on
S
1
and
S
2
arecorrect,formingamatch.
Bottom:
bothbaseson
S
1
and
S
2
aresequencingerrors,andsubstitutedbythesamecharacter..........23
Figure2.5:Thechangeof
E
[
X
o
]
and
E
[
X
r
]
(y-axis)withtheincreaseofthereadlength
(x-axis),whichisobtainedfromarealPBdataset.Theoverlapsizeissetas
the1/4ofthereadlengthaswefocusonidentifyingthehardcaseofsmall
overlaps.
k
=
15..................................24
Figure2.6:Anexampleoftheoverlapsizeestimation.Thesufofread1andthe
ofread2formanoverlap.Eachshortsolidlinerepresentsagroupseedmatch
intheoptimalchain.Theblackdashedlineindicatesthetrueoverlapalign-
mentregionbetweenthetworeads.Thegraydashedline,whichisformedby
thetwoendinggroupseedsintheoptimalchain,canoverestimatetheoverlap
size........................................29
Figure2.7:TheROC-likeplotusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,
andGraphMaponthesimulatedPB
E.coli
dataset.Thex-axisrepresentsthe
falsediscoveryrate(FDR
=
1

precision).Y-axisisthesensitivity(0.5to1)..34
Figure2.8:SensitivityofGroupKandMinimapfordetectingoverlapsofdifferentsize
onPB
E.coli
dataset.Thex-axisrepresentstheoverlapsize.Y-axisisthe
correspondingsensitivityofthebin.TheX-axisbinwidthis500andthe
onlyincludedthe6bins(i.e.uptooverlapsize3000)astheirsensitivity
becomesmoresimilarwiththeincreaseoftheoverlapsize............38
xii
Figure3.1:AMarkovchainexampleofDNA.Therearefourstatesforeachofnucleotides
A,C,GandT.Atransitionprobabilityisassociatedwitheacharrowinthe48
Figure3.2:ThetransitionstructureofaHMM.Weusesqurestoindicatematch
states,diamondstoindicateinsertionstates,andcirclestoindicatedeletion
states........................................51
Figure3.3:Tencolumnsfromthemultiplesequencealignmentsofsevenglobinproteins.
Thestarredcolumnsareonesthatwillbetreated`matches'inthe

HMM.Reprintedfrom[35]............................54
Figure3.4:BuildanexampleSAGfromanalignment.Toppanel:multiplesequence
alignmentofsixsequences.Thetopsequenceistheseedsequence.Bottom

panel:sequencealignmentgraph.Fornode
x0,ifwetracebackfortwomore
edges,wecanidentifythreecodonsendingwith
x0:TCA,ATA,andGCA.
Eachpathhasitsnodes
x(1)0andx(2)0marked.Theconsensuspathof
thispartofgraphisATCA............................58
Figure3.5:ComparisonoferrorcorrectionperformanceforFrame-ProandDAG-Conin
thesimulated
E.coli
datasetand
M.ruber
dataset.Frame-Procorrectsmore
errorsinlargerfractionofPBreads........................66
Figure3.6:Histogramofederrorsaftererrorcorrectioninthesimulated
E.coli
dataset.X-axisrepresentsthenumberofremainingerrorsineachread.Y-

axisisthecorrespondingnumbersofreads.Binwidthis1andtheonly
includedthe40bins(i.e.upto40errors)duetospacelimitation......67
Figure3.7:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values
(C)forreferencesequencesandcorrectedsequencesproducedbyFrame-Pro
andDAG-Coninthe
E.coli
simulateddataset.X-axisrepresentsthedomains.
AlldomainsaresortedbythereferencevaluesfromPfam.Redcirclesrepre-
sentthevaluesofHMMalignmentsforcorrectedsequencesoutputbyFrame-
Pro.BluecirclesrepresentthevaluesofHMMalignmentsforcorrectedse-

quencesoutputbyDAG-Con.Astherearemanydatapoints,allnumbers

producedbyonetoolareprocessedbySavityzkyGolaytogeneratea

smoothedcurveforclarityofpresentation....................69
xiiiFigure3.8:Thecomparisonofalignments'bitscoresforreferencesequencesandcor-
rectedsequencesproducedbyFrame-ProandDAG-Coninthe
M.ruber
dataset.
X-axisrepresentsthedomains.Alldomainsaresortedbythereferencebit
scoresfromPfam.RedcirclesrepresentthevaluesofHMMalignmentsfor
correctedsequencesoutputbyFrame-Pro.Bluecirclesrepresentthevaluesof
HMMalignmentsforcorrectedsequencesoutputbyDAG-Con.Asthereare
manydatapoints,allnumbersproducedbyonetoolareprocessedbySavi-
tyzkyGolaytogenerateasmoothedcurveforclarityofpresentation....72
Figure3.9:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values
(C)for48domainscommonlybyFrame-ProandDAG-Coninthe
armdataset.X-axisrepresentsthedomains...................74
Figure4.1:Howakerneloperatesononepixel.Reprintedfrom[6].............80
Figure4.2:StructureofTextCNNmodel.Reprintedfrom[74]...............82
Figure4.3:Therepeatingmoduleinainasimplerecurrentneuralnetworkstructure.
Reprintedfrom[113]...............................83
Figure4.4:TherepeatingmoduleinaLSTMcell.Reprintedfrom[113]...........84
Figure4.5:TherepeatingmoduleinaGRUcell.Reprintedfrom[113]............86
Figure4.6:TheoverviewofDeepFramemodel.Theinputsequencewastranslatedand
encodedtoa3-channeltensor.Twoconvolutionallayerandmaxpoolinglayer
extractsequencefeatures.Thesefeatureswereusedbyafullyconnectedlayer
withsoftmaxfunctiontoinfertheprobabilityofeachproteinfamilies.In
thetask,themodeldirectlyoutputsthefamilieswiththelargest
scoreastheprediction.Inthedetectiontask,themaximumofsoftmaxscore
needstocomparewiththethresholdtodeterminewhethertheinputcontainsa
trainedproteinfamilyorshouldberejected....................91
Figure4.7:Thedistributionofsoftmaxscoresfrombasemodel(
top
)andmodelwithOut-
lierExposure(
bottom
).Greenbarrepresentthecountofin-distributionex-
amplesthatpredictedcorrectly.Bluebarrepresentthecountofin-distribution
examplesthatpredictedincorrectly.Redbarrepresentthecountofout-of-
distributionexamples...............................97
Figure4.8:Themeanandstandarddeviationofaccuracyofdifferentnet-
workarchitectures.Forconvenient,weuseddifferentnames(
3-frameencod-
ing
,
2048s
,
s8
,and
2layer
)forthesamebasemodel.Weused
differentcolorsfordifferentgroupofcomparisons:greenbarsforencoding;
bluebarsfornumberofpurplebarsfordifferentsizes;orangebars
fordifferentconvolutionlayers..........................101
xiv
Figure4.9:2Dt-SNEplotoffeaturesextractedfromtheconvolutionallayeroutput.(Top)
3-frameencodingmodel.(Middle)DNAone-hotencodingmodel.(Bottom)
3-branchmodel..................................103
Figure4.10:ComparisonofproteinperformanceofDeepFrame,HMMER,
andDeepFamacrossdifferenterrorrates.....................105
Figure4.11:Mostactivatedconvolutional(index:1622)forLatrophilin........110
Figure4.12:Mostactivatedconvolutional(index:1877)forCannabinoid.......110
Figure4.13:Mostactivatedconvolutional(index:1689)forPlatelet..........111
Figure4.14:Mostactivatedconvolutional(index:1382)forProstacyclin........111
Figure4.15:Mostactivatedconvolutional(index:1033)forCholecystokinin.....112
Figure4.16:Mostactivatedconvolutional(index:1776)forEMR1...........112
Figure4.17:Thedistributionsofsumofacivationvaluesofconvolutionalunits.Xaxisis
theindexoftheconvolutiontersinDeepFram.Yaxisisthesumofactiva-
tionvaluesofthegivenconvolutional..................113
xv
LISTOFALGORITHMS


Chapter1
Introduction
Secondgenerationsequencingtechnologiesofferedseveralimprovementsovertradi-
tionalSangersequencingandbecameanobviouschoiceformetagenomicsanalysis.Usingshot-

gunsequencing,peoplecaninterrogatethecompositionandfunctionofmicrobialcommunities

ef.Thisinformationcanleadtodiscoveryonnewbiologyactivecompounds,antimicro-

bials,virulencefactors,ormetabolicpathways[144].Thetraditionaldownstreammethoduses

referencegenomestoannotatethemetagenomicsdataset.However,inmostcases,themetage-

nomicsdatasetcannotbecorrespondedtotheknownreferencegenomes.Forexample,2%to

96%ofmetagenomicssequencefromhumanskincannotbeannotatedbythetraditionalmethod,

basedontheoriginsofsample[112].
Denovo
approach,orareference-freeapproach,isonealternativemethodthatcanbeusedto
overcomethelackofreference.Withshortreadsfromthesecond-generationsequencingplatform,

denovo
assemblycangeneratecontigsfollowedbythetaxonomybinningtodeterminethecom-
positionsofthesample.However,withthecomplexandsimilargenomesinmetagenomics,itis

challengingtouseshortreadsfromthesecond-generationsequencingtechnologytoassembly.In

suchcircumstances,longreadsfromthethirdgenerationsequencingplatformshaveitsadvantages

togenerateabetteroverlapgraphfortheassembler.
1Anotheralternativemethodisanalyzingtheaveragepropertiesofwholecommunitiesrather
thanfocusingonseveralgenomes.Suchanapproachisirreplaceableasinparticularcircumstance
assemblymaynotevenpossible.Inthismethod,longerreadsarestillpreferredasitcanprovide
moreaccurateinformationthanshorterfragments.
Here,wepresenttheprinciplesofthethirdgenerationsequencingtechnologies,somewidely-
usedalgorithmsdesignedforthirdgenerationsequencingingenomeassembly,thechallengefor
existingfunctionalanalysisalgorithmswhenusingthirdgenerationsequencingdata.
1.1Third-generationsequencing
Severalthirdgenerationsequencingtechnologieswereproposed[15,39].Theysharealotof
similarcharacteristic,likelongreadlengthandhigherrorrate.Inthesetechniques,themost
intensivelystudiedarethesingle-molecule,real-timesequencing(SMRT),developedbyP
BioSciences(PacBio),andNanoporesequencing,developedbyOxfordNanoporeTechnologies
(Nanopore).HerewewillintroducethetheoremofSMRTandNanoporesequencingtechnologies.
AlthoughsomealgorithmswediscussedherewereoriginallyproposedtoSMRT,itcanalsobe
usedforNanoporesequencing[93].
1.1.1SMRT
Toachievereal-timesequencingatthesinglemoleculelevel,SMRTadoptedanapproachtogen-
eratesequencinginformationbasedonthesequentialsignalgeneratedduringDNA
synthesisbyapolymerasemolecule.
2
Figure1.1:Principleofsingle-molecule,real-timeDNAsequencing.Reprintedfrom[39].
ThebasicsequencingunitinSMRTisazero-modewaveguide(ZMW).Thepropagationmode
inthewaveguidewilllimitthewave,satisfyingthemodetopropagateinthewaveguide(precisely
determinethefrequencyandform).The"zero-modewaveguide"meansthewaveguideprovides
zeroguidedmodesforthepropagationoftheelectromagneticwave.Observationinthezero-mode
waveguidecanreachsingleeventscaleeveninthehighconcentrationof
moleculespresent[81].InthebottomofeachZMWthereislocatedasinglemoleculeofDNA
template-bound
F
29DNApolymerase.
Duringanobservation,aphospholinkednucleotideformsahydrogenbondwiththenucleotide
inthetemplate.ThisprocesswillproduceasignalatthebottomofZMW.Thelight
pulsewillenduponthecleavageofthedye-linker-pyrophosphategroup.Bydetectingthelight
pulseandrecordingaseriesofmoviesofthesignal,wecanhaveallinformationwhichlatercan
betranslatedtobasesincontinuouslongreads(CLR).TheaverageCLRreadlengthfromPacbio
RSIIusingC4chemistrycanreachmorethan10kbps[145],withan11%to15%errorrate[8,80].
Usingthecirculartemplate,SMRTcansequenceDNAmultipletimes.Bydoingthis,ahigh
qualityreadcalledcircularconsensussequence(CCS)canbegeneratedwiththemorethan99%
3
accuracywith15Xcoverage.However,thetrade-offofCCSistheshort-readlength(averagelyis
about1kbps[114])duetothelimitationlifetimeoftheDNApolymerase[134].
1.1.2Nanoporesequencing
OxfordNanoporeisanotherpromisingthird-generationsequencingtechnology.InNanoporese-
quencing,aDNAstrandisthreadedthroughananopore,andtheioniccurrentcanbe
usedtoidentifythecorrespondingDNAsequences.
Figure1.2:Principleofnanoporesequencing.Itdeterminethetypeofnucleotidebymeasuringthe
currentlevelofthebasepassespassingthroughthenanohole.ReprintedfromNaturejobswebsite.
Intheory,thenanoholescanbecreatedbyproteinspuncturingmembranes(biologicalnanopores)
4
orinsolidmaterials(solid-statenanopores).InNanoporesequencing,thenanoporeiscreateduti-
lizingtheheptamericprotein
a
-hemolysin(
a
HL)[28].Duringthesequencing,DNAwillbindto
thebindingsiteofthenanopore.Thesequencerreadsthedisruptioninthecurrentpassedthrough
themembraneaseachbasepassesthrough.Sincethebaseshavedifferentelectroniccharacteris-
tics,eachnucleotidehasadifferentsignatureofdisruption,allowingforbasecalls.
Nanoporesequencingcanproduceverylongsequencesthatlengthupto2millionbasesin
theory.InarecentexperimentwithMinIONR9.41Dchemistry,peoplewereabletosequence
ultralongreadswith882,000bp,withamedianlengthof10,589bp.TheerrorratesofNanopore
sequencingareclosetoPacBio,whicharearound18%[67].
1.2Sequenceanalysisalgorithms
Inthissection,weoverviewedthecommonsequenceanalysisalgorithmanditsapplicationin
third-generationsequencingreads.
1.2.1Alignment
AlignmentprogramisoneofthefundamentalalgorithmsforPacbioSMRTapplication,asright
nowtheerrorcorrectionandwholegenomeassemblybothrelyonbuildingthealignment
Thefundamentalideaofaligningisusingdynamicprogrammingandselectingamaximal
scorepathgiventhedifferentpenaltyofinsertion,deletion,andmismatch(Figure1.3).Forexam-
ple,inBLAST[4],anidentitywillscore5andamismatchwillgetpenaltyof4.Severaldifferent
approachesareadoptedtoacceleratetheperformanceofaligning.Forexample,the"seedand
extend"methodexactmatchesin
k
-merscaleandthenreducesthesearchspacebyonly
5
Figure1.3:Adynamicprogrammingalignmentmatrixforaligningsequence
GATTACA
against
GCATGCU
usingNeedleman-Wunschalgorithm[105].Inthisexample,asimplescoringsystem,
match=1,mismatch=-1,gap=-1,isused.(a)Startfromthecellinthesecondrow,second
column.Movethroughthecellrowbyrow,calculatingthescoreforeachcell.(b)Thescoreis
themaximalscorecalculatedfromthecellfromthetop,left,orleft-top.Ifthescoreiscalculated
usingleft-topcell,meansamatchormismatch.Otherwiseisaninsertionordeletion.(c)After
theofalltable,thescoreoftheright-bottomcellofthematrixisthescorefor
globalalignmentofthetwosequence.(d)Tracebackfromthelastcellisneededtoobtainthe
alignment.Thealignmentis
G-ATTACA
vs.
GCA-TGCU
.ReprintedfromWikipedia.
allowingthesequencetokeepwiththenumberofmatchesof
k
-mersabovethethreshold.Other
methodslikeBurrows-WheelerTransform[83,84]andHashingfunction[90]canbealsoapplied
totraditionalalignmenttools.AlthoughpeoplecanusethosetoolsonPacbioSMRTreads,the
sensitivity,accuracyandspeedofalignmentarecompromisedduetothelongerroneousreads.
6
Rightnow,BLASR[21]isapopularalignmenttooldesignedforlongreadswithhigherror
rates.BLASRcombineddatastructureusedforshortreadmappingandthealignmentmethod
forwholegenomemapping.Attheprogramwillclustertheshortexactmatchesfoundus-
ingtheBWT-FMindex[42,84]orsufarray[97].Bydoingthis,theprogramcandetermine
theapproximatecoordinatethatreadsshouldaligntothegenome.Aroughalignmentusingthe
sparsedynamicprogrammingonthesetofshortexactmatcheswillbethefollowingsteps.Fi-
nally,withtheguideofthesparsedynamicprogramming,anaccuratealignmentwillbegenerated
usingdynamicprogramming.ComparewithothertoolslikeBWA-SW(aligning132Mbasesin
434minutes)[89]andBLAT(aligning181Mbasesin4724minutes)[73],BLASRcanalignthe
largestnumberofreads(230Mbases)tothereferencegenomewiththefastestspeed(20minutes)
usingan
E.coli
O104:H4dataset.
Therearemoretoolsdesignedforthirdgenerationsequencingcoming.Forexample,GraphMap
[137]isanothertooldesignedforthirdgenerationsequencingwithagapped-seedstrategy.Wewill
discussalignmenttoolsforthirdgenerationsequencinglaterindetail.
1.2.2Errorcorrection
ErrorcorrectionisanecessarystepforusingSMRTsequencingastheerrorrateofrawreadsistoo
high.However,longerreadswithsmallbiasmakeconsensuserrorcorrectionpossible.Rightnow,
twomajorapproachesareavailable:hybrid[8,55,58,78,99,130]andnon-hybrid[24,93]error
correction.Hybriderrorcorrectionusesshortbutaccuratesecondgenerationreadstohelpcorrecterrorin
7longthirdgenerationreads.Thettypeofhybriderrorcorrectiontools(LSC[8],PacBioToCA
[78],andporovread[55])alignedaccurateshortreadstothirdgenerationlongreads,andthenus-
ingtheconsensusalgorithmtogenerateerror-correctedconsensussequences.Thesecondtypeof
tools(LoRDEC[130],Jabba[99])exploitsthecontextinshortreadsbybuildadeBruijngraph
fromthoseshortreads.LongreadsfromPacbioSMRTarethenaddedastheguidetohelpthe
programdecidethebestpathinthedeBruijngraph.Othermethods[58]beyondthesetwotypes
arealsoavailable,butstillsharealotofsimilaritytothetoolsmentionedabove.
UnlikehybriderrorcorrectionwhichneedssecondgenerationreadslikeIllumina,non-hybrid
errorcorrectionmethodsonlyusereadsfromthethirdgenerationsequencing.HGAP(hierarchical
genomeassemblyprocess)[24]usednon-hybriderrorcorrectionbeforestartingtheassembly
process.ThemethodselectsthelongestseedreadsfromPacbioSMRTrawreads.Then
alignedother"short"readtothoseseedreadsandbuiltamultiplesequencealignmentgraph.
Theprogramcaneasilythemaximalscorepathinthegraphusingdynamicprogramming.
Afterthepre-assemblystage,theaverageaccuracyofthedatasetisincreasedfrom86.9%to99.9
%[24].Similarmethods[93]werealsoappliedtoreadsfromOxfordNanoporesuccessfully.The
nonhybridmethodreliesonthecoverageofdataset[34].
1.2.3Wholegenomeassembly
Thelongreadsfromthirdgenerationsequencingismorefavorableforwholegenomeassembly
aslongreadscanovercomemanylimitationsfromshortreads,suchashandlinghighlyrepetitive
genomicregions.
Thenon-hybrid
Denovo
assembly[24]needspre-assemblyerrorcorrectionsteptocompensate
8
thehigherrorratesofPacbiodata.Highqualityreadsaftererrorcorrectionarefeasibletoapply
traditionalassemblymethodthatcantakelong-readinput.Inthiscase,overlap-layout-consensus
(OLC)methodisbetterthantheDeBruijnGraph(DBG),althoughthelatteroneisquitepopular
forthesecondgenerationsequencingdata.IntheOLCmethod[101,103],theassemblywill
trytoallversusallmatchandthendetermineifthereadpairhasoverlap.Thenagraphwillbe
constructedbytreatingeachreadasanodeandconnectingoverlapreadsbyedges.Anassembly
algorithmisavailablebyconsideringalltheinformationinthegraphandproducesconsensusout-
put.Givenenoughcoverage,thequalityofHGAPassemblycanreachbeyond99.999%(QV50).
FastoverlappingmethodisdesiredforOLCmethodassemblertoreducethehugecomputa-
tionalcostforcalculatingtheoverlapbyall-versus-allalignment.Thetime-consumingover-
lappingisthebottleneckof
Denovo
assemblyusingSMRTandpreventstheapplicationforlarger
genome[12].MinHashAlignmentProcess(MHAP)isanewlydevelopedalgorithmcanef
detectoverlapbetweenerroneouslongreads.MinHash[16]isthecoremethodofthealgorithm
thatusedtoestimatethesimilarityofthetwosequenceswithoutacompletealignment(Figure
1.4).Toevaluatethesimilarityofthetwosequences,theMinHashstartbyconvertallkmersin
eachofthesequencestointegerusingmultiplerandomizedhashingfunctions.Then
thealgorithmwillcollecttheminimalvaluekmerforeachhashingfunctiontoconstructthesketch
ofthesequence.Thesizeofthesketchisdeterminedbythenumberofhashingfunctionandis
muchsmallerthanthesizeofallkmers.TheJaccardSimilarityofthesketchofthetwosequences
canbetreatasanapproximationofthesharednumberofkmersbetweentwosequences.Thisis
acomputationaleftoestimatethesimilarityoftwosequenceswithoutalignment.Inthe
test,MHAPmethod(3.6CPUhoursforE.coliK12dataset)ismuchfasterthantheBLASR(87
CPUhours)tooverlappingforassembly.Intheexperiment,MHAPachievedacomparableor
9
Figure1.4:RapidoverlappingofnoisyreadsusingMinHashsketches.(a)TocreateaMinHash
sketchofaDNAsequenceS,weobtainedallthek-mersofthesequencek-mers.Inthe
exampleabove,
k
=
3,generate12k-merseachforS1andS2.(b)Multiplehashfunctionsare
usedtoconvertk-merstointegerThenumberofhashfunctionsHdeterminesthe
resultingsketchsize.Here,wechoose
H
=
4.Theminimumkmerwhichhashingfunctionvalue
isminimalforeachhashisreferredasmin-mer.(c)Thesketchofasequenceiscomposedofthe
Hmin-merinorder,whichismuchsmaller(size4)thanthesetofallk-mers(size12).
Inthisexample,thesketchesofS1andS2sharetwosameminimumints(underlined).
(d)RealJaccardsimilarity(0.22)isestimatedusingthefractionofsharedmin-mer(0.5)between
thesketches.Tohaveanaccurateestimation,
H
˛
4isneeded.(e)Ifthesimilaritymeetsthe
threshold,thepositionofsharedmin-mersintheoriginalsequenceiscomputedtodeterminethe
overlapoffsetoftheS1andS2.Reprintedfrom[12].
improvedassemblythanBLASRwithlesstime.
10
Figure1.5:Threestage(correction,trimming,andassembly)inaPacbioSMRTassembler(Canu).
Thecorrectionstepselectsthebestoverlapstouseforsequencecorrectionandgeneratescorrected
readsfromconsensussequences.Thetrimmingstepunsupportedregionsintheinputand
trimsorsplitsreadsbasedonthelongestsupportedrange.Theassemblystepmakesapass
todeterminesequencingerrors;recomputeoverlapalignments;andoutputscontigs,anassembly
graph,andsummarystatistics.Reprintedfrom[79]
11
1.3Challengesandoverview
Althoughmanynovelalgorithmsforalignment,errorcorrection,andwholegenomeassembly
havebeendevelopedforhandlinglongnoisyreads,therearestillmanychallengesforthethird-
generationsequenceanalysis.First,mostnovelalgorithmsaredesignedtotargetcompletegenome
assembly.However,inmanyscenarios,likemetagenomicsandtranscriptomicsequencing,the
wholegenomeassemblyisnotrealistic,ortheresultoftheassemblyisnotgiventhe
coverageofsequencing.Inbothcases,downstreamanalysisalgorithmneedstoprocesslongread
withhigherrorrates.Forexample,forhomologysearchandproteindomainannotation,current
methodslikepHMMcannothandletheframeshiftsintroducedbyalargenumberofinsertionsand
deletionsinthethird-generationsequencingreads,leadingtoshortoralignments
inthedownstreamanalysis.Second,forapplicationslikemetagenomics,theremaycoexistmany
similarproteindomainsinthedataset,whichisnoteasytodistinguish,evenwithouterrors.Third,
existingalgorithmsdesignedforPacBioorNanoporestillhavethepotentialforimprovementin
performanceandefy.
Inordertoaddressthesechallengesandimprovetheperformanceofsequenceanalysis,es-
peciallyproteindomainpredictionandannotation,wehaveproposedanddevelopedthreetools:
GroupK,FramePro,andDeepFrame.GroupKisanoverlapdetectiontooldesignedforPacBio
andNanoporedata.TheexperimentalresultsshowedthatGroupKenablesmoresensitiveoverlap
detection,especiallyfordatasetsoflowsequencingcoverage.BothFrameProandDeepFrame
aredesignedforimprovingproteinpredictionfromthird-generationsequencing.FrameProisa
homologysearchtooldesignedforPacBioreads.ItutilizesbothhiddenMarkov
modelandsequencealignmentgraphconsensustoenablemoresensitivehomologysearch.Deep-
FramefocusesonidentifyingencodedproteindomainsfromasinglelongnoisyDNAread.Ituses
12
aconvolutionalneuralnetworktoextractusefulfeaturestodistinguishprotein-codingsequences
automatically.Intheexperiment,itoutperformspHMM-basedmethodsinbothon
multipleproteinfamilies,andthedetectionofproteindomainfromotherunrelatedDNAreads.
13
Chapter2
OverlapDetectionOnLongNoisyReads
1
Genomeassemblyusingthird-generationsequencingdatarequiresdedicatedmethodsandtools.
Existinggenomeassemblytoolsmainlyutilizetwotypesofgraphmodels:overlapgraphandde
Bruijngraph.Whentheerrorrateislow,deBruijngraphhasthetheoreticaladvantagethatthe
graphsizedoesnotincreasewiththesequencingcoverage,whichisusuallyhigh
forIlluminadatasets.Forthird-generationsequencingdata,thehigherrorrateandlowcoverage
maketheoverlapgraphasensiblechoiceforgenomeassembly[78].Akeystepinconstructing
theoverlapgraphistoidentifyreadpairsthatshareoverlaps,whichindicatesthatthesereadsare
sequencedfromthesamelociintheunderlyinggenome.Althoughthereareanumberofsequence
alignmentprogramsavailableforconductingoverlapalignment[4,132],amajorityofthemrely
ondynamicprogrammingandaretoocomputationallyexpensiveforhighthroughputsequencing
data.Duetohigherrorrates,existingshortreadoverlapdetectionsoftwareusingBWT(Burrows-
Wheelertransform)orhashtable[51,135]cannotbedirectlyappliedtolongreads.Toaddressthis
challenge,manynewmethodwasproposed.Here,wesummarizethemajorexistingmethodand
alsointroduceournewmethodfordetectingoverlapfromlongnoisysequences.
1
Du,Nan,JiaoChen,andYanniSun."Improvingthesensitivityoflongreadoverlapdetectionusinggroupedshort
k-mermatches."BMCgenomics20.2(2019):190.
14
2.1Background
2.1.1Relatedwork
Twostrategiesarecurrentlybeingemployedtodetectoverlapsforerror-pronelongreads.One
strategytriestocorrectsequencingerrorsinPB(PBiosciences)andONT(OxfordNanopore)
databeforeoverlapdetection.Thereexistanumberofsequencingerrorcorrectiontools[24,78].
Someofthemrelyonhybridsequencing,whichrequirespreparationofatleasttwosequencing
librariesandseveraltypesofsequencingrunsandthusisnotcost-effectiveformanyapplications.
Othersconducterrorcorrectionusinglongreadsonly.Onerepresentativemethodisdescribedin
Chinetal.'shierarchicalgenome-assemblyprocessHGAP[24],whoseperformanceimproveswith
higherreadcoverage.Itisworthnotingthatforalignment-basederrorcorrectionmethodssuchas
theoneinHGAP,animportantstepistoidentifyreadsthatcanbealignedquickly.Essentially,
techniquesusedforoverlapdetectioncanbeusedforalignmentdetectionaswell.
Thesecondcategorybypassesthedifoferrorcorrectionandoverlapsusing
rawreads.VariousapproximatesimilaritysearchmethodshavebeenappliedonPBandONT
data[26].Theygenerallyfollowseed-chain-alignprocedure[88].Seed-basedltrationstepplays
anessentialroleincontrollingthetrade-offbetweensensitivityandcomputationalefy.Usu-
ally,thesemethodsuseshortstringmatchesasthetionstep.Ashortstringor
k
-mermatch
requiresexactmatchesof
k
consecutivecharactersbetweentwosequences.Intuitively,overlapping
readstendtosharemorecommon
k
-mersthannon-overlappingreads.Strategiesthatcanquickly
thenumberofshared
k
-merscanthusbeapplied.Inthissection,wesummarizethemain
strategiesofseveralstate-of-the-artoverlapdetectiontools.Wehighlightthedifferencesbetween
ourmethodandtheexistingonesinthefollowingsection.
MHAP[12],Minimap[87,88],andDALIGNER[102]alluse
k
-mermatchesforidentifying
15
candidateoverlappingpairs.Duetothehigherrorrate,usuallyonlyshort
k
-merswillbeapplied
inordertoachievehighsensitivity.However,identifyingshort
k
-mermatchesbetweenallpairsof
readsiscomputationallyexpensive.Thus,theleadingtoolsemployeddifferentdatastructuresand
algorithmsforestimating
k
-mer-basedsimilarity.MHAPconvertslongreadsintosetsof
k
-mers
andsketchesusingminHash.Thenthesimilaritybetweenreadsisestimatedusingthecompact
sketches.Minimapalsousesacompactrepresentationoftheoriginalreadsbykeepingminimizers
ratherthanallpossible
k
-mersofaread.Thencollinear
k
-merswillbeclusteredandusedfor
checkingpossibleoverlaps.DALIGNERdirectlysorts
k
-mersbasedontheirpositionsandthen
utilizesmergesorttoidentifythenumberofshared
k
-mers.Asthesortingiscacheef
DALIGNERispracticallyveryef
BLASR[21]wasinitiallydesignedformappingPBreadstoareferencegenome.Itisalso
widelyusedasanoverlapdetectiontoolforPBdata.BLASRusesBWTandFMindextoidentify
short
k
-mermatchesandthenclusters
k
-mermatcheswithinagivendistancerange.Theclustered
k
-mersarerankedbasedonafunctionofthe
k
-merfrequency.Onlyhighlyrankedclusterswillbe
keptfordownstreamanalysis.
Beingdifferentfromtheabovetools,GraphMap[137]usesspacedseedsthatallowmatches
ofnon-consecutivecharacters.Spacedseedswereinitiallyusedinhomologysearchforimproving
thetrade-offbetweensensitivityandefy[18,95,139].Inparticular,spacedseeds
containingthepatternﬁ11*ﬂhavehighsensitivityincapturinghomologousprotein-codinggenes
becauseofthecodonstructure.However,designingoptimalspacedseeds(i.e.,decidingtheposi-
tionsofthewildcardcharacters)isNP-hard[94,106].GraphMapempiricallychoosestwospaced
seeds.Ideally,differentsetsofseedsmaybedesignedforinputdataofdifferenterror
16
2.1.2Overviewofourwork
ThehigherrorratesandalsothedifferenterrorofPBandONTdatamotivateustouse
amorexibleseedingstrategycalledgrouphitcriteria[109],whichagroupofpossibly
overlapping
k
-merssatisfyingstatisticallyderiveddistanceconstrains.Forbrevity,wewillcall
the
k
-merssetsatisfyingthegrouphitcriteriaasaﬁgroupseedﬂ.Agroupseedwasinitially
proposedandusedforhomologysearch.Giventheerrorsuchastheestimatedindelsand
mismatchprobabilities,thresholdsforgroupingshort
k
-merscanbecomputedusingthewaiting
timedistributionandtheone-dimensionalrandomwalk[109].Agroupseedcaneffectivelyhandle
alltypesoferrorsandisidealtodetectsmalloverlaps.Withgroupseeds,wecanachievehigh
sensitivityusingshort
k
-mers(e.g.,9-mer)whilestillmaintainingadesirable.
Inthiswork,weemploygroupseedsfordetectingoverlappinglongreadsfor
denovo
genome
assembly.Ourimplementation,namedGroupK,providesacomplementarytooltoexistingmeth-
odsfordetectingsmalloverlapsoroverlapscompoundedbyhigherrorratesofbothreads.This
abilityenablesourtoolasensiblechoiceforgenomeassemblyinmetagenomicdatasequencedby
third-generationsequencingplatforms.Asthesecommunitysamplesusuallycontainmicroorgan-
ismswithheterogeneouscoverage,beingabletoidentifysmalloverlapswillbeveryimportantfor
reconstructinggenomesofrarespecies.
2.2Methods
GroupKisdesignedforimprovingthesensitivityofdetectingsmalloverlapsoroverlapswithlow
identity.Currently,third-generationsequencingdatastillhashigherrorrates.Theoverlapping
regionsformedbytwoerror-pronelongreadscanhavelowersequenceidentitythanmappinga
longreadagainstareferencegenome.Figure2.1presentsthehistogramoftheoverlapsizeandthe
17
Figure2.1:Histogramsofirreducibleoverlapsizes(
top
)andtheratioofoverlapsizetotheread
length(
bottom
)whencomparingadjacentoverlappingreadsonsimulatedPB
E.coli
datasets
givendifferentcoverages.Thebinwidthfortheoverlapsizeis500.Forthe30Xdataset,the
averagereadlengthis8366andthenumberofreadsis16644.Forthe15Xdataset,theaverage
readlengthis8253andthenumberofreadsis8436.Forthe8Xdataset,theaveragereadlengthis
8414andthenumberofreadsis4413.
18
correspondingratioofoverlapsizetothereadlengthbetweentwoadjacentreads.Thereadsare
simulatedusingPBSIM[114]from
E.coli
withthreedifferentcoverages.Asweknowtheposition
ofeachsimulatedreadinthegenome,theoverlapsizecanbeeasilydecided.Notethatareadcan
formoverlapswithmultiplereadssequencedfromthesameregion.However,thetwoare
generatedusingoverlapsbetweentwoadjacentreads,whichanﬁirreducibleﬂedge[100]in
anoverlapgraph.Thus,theseoverlapscandecidethecontinuityofthegenomeassembly.
Theshowthattherearestillsubstantialregionswithsmalloverlaps.Forexample,thereare
45.46%,34.76%,and31.19%oftheoverlapsshorterthanthe50%ofthereadlengthfordatawith
coverageof8X,15X,and30X,respectively.Itwillbeidealtodetectrelativelysmalloverlapsto
fullytakeadvantageofthelongreadsforgeneratingmorecompleteassemblies.
2.2.1Pipeline
IdentifyingsmalloverlapsiscomputationallydifThus,weuseacarefullydesignedhier-
archicalstrategytodistinguishtrueoverlappingreadsfromnon-overlappingones.The
pipelineofGroupKconsistsofthreekeysteps:groupseedmatching,andchaining(Fig-
ure2.2).Filtrationisusedtoreducethesearchspacebyquicklyidentifyingreadpairssharing
aminimumnumberof
k
-mers.Highinsertion/deletionerrorratestendtoproduceshort
k
-mer
matchesondifferentdiagonals.Thus,weadoptgroupseedmatchingtoidentifyagroupofshort
k
-mermatchesincloseproximity.Therearetwotypesofdistanceconstraints.1)Thedistance
(numberofnucleotides)betweenthe
k
-mersonx-axisandy-axismustbesmallerthanagiven
threshold;2)thediagonaldistance,whichisthedifferenceofthediagonalsoftwo
k
-mermatches,
mustbewithinagivenrange.Chainingisusedtoestimatetheoverlapregion.Figure2.3
showsthatapplyinggroupseedcanremovealargenumberofrandom
k
-merhitswhilekeeping
the
k
-mermatcheswithintheoverlappingregion.When
k
=
15,thereareonlytwohits,anditis
19
Figure2.2:ThepipelineofGroupK.
20
diftodeterminewhetherthereisanoverlap.When
k
=
9,thereisclearlyachainformedby
hitsintheoverlappingregion.However,therearealsoalargenumberofrandomhits.Withgroup
seedmatchingcriteria,mostoftherandom9-merhitsareout.Sothedownstreamanalysis
becomesmorestraightforward.
2.2.2Estimatetheexpectednumberof
k
-mersforthestage
Inthissection,weanalyzetheexpectednumberofrandom
k
-merhitsbetweentworeadsandalso
k
-merhitsinoverlaps.Theanalysiswillbeusedfordeterminingthe
k
-mersizeandalsoother
parametersforthestage.Giventworeads(
S
1
and
S
2
)withlength
L
anderrorrate
e
,we
wanttodeterminehowmany
k
-merhitsweexpecttobetween
S
1
and
S
2
.
Weconsiderthecasethat
S
1
and
S
2
arenotrelated(nooverlap).Byassumingthatthe
basesin
S
1
and
S
2
arerandomlydistributed,theexpectednumberofrandom
k
-mermatches
E
[
X
r
]
isroughly:
E
[
X
r
]=

1
j
S
j

k

L
2
(2.1)
Notethatthisequationisdifferentfromtheexpectednumberofshared
k
-mersinMHAP[12]
becausewedistinguish
k
-merhitsbasedontheirlocationsratherthanthe
k
-mersthemselves.Also,
weassumethatoverlapping
k
-mersareindependent.
Inthesecondcaseof
S
1
and
S
2
forminganoverlap,weestimatetheexpectednumberof
k
-mer
matchesintheoverlapbycomputing
P
o
,whichistheprobabilityofobservinga
k
-mermatch
withintheoverlap.Inthecaseofnosequencingerror(i.e.
e
=
0),theprobabilityofobservinga
k
-mermatchatanalignedpositioninanoverlapissimply1.0.Butinthepracticalcaseof
e
>
0,
weneedtoconsidertwoscenariosinordertodeterminetheprobabilityofobservingtwoidentical
charactersatanalignedpositionintheoverlap:1)thecharactersfrom
S
1
and
S
2
arecorrect;2)the
21
Figure2.3:Thedotplotof
k
-merhitsandgroupseedsmatchesforoneoverlappingpairfromthe
E.coli
PBdatasetusedinSection2.3.Eachdotisa
k
-merhit.Thex-axisandy-axisshowthe
locationsofthehitsonthereads.Theoverlapregionisroughlyfrom0to2800onthex-axis,and
from1000to3800onthey-axis.
Top:
all9-merhits.
Bottom:
9-merhitsthatpassedthegroup
hitcriteria.Agroupseedisrepresentedbycloselylocateddotsofthesamecolor.
22
Figure2.4:Twocasescontributedtotheidenticalcharactersatanalignedpositionintheoverlap-
pingregionof
S
1
and
S
2
.
S
1
and
S
2
aretworeadssequencedfromthesameregionoftheunderlying
genomeandformanoverlap.
Top:
bothbaseson
S
1
and
S
2
arecorrect,formingamatch.
Bottom:
bothbaseson
S
1
and
S
2
aresequencingerrors,andsubstitutedbythesamecharacter.
twocharactersfrom
S
1
and
S
2
areerrorsandarerandomlysubstitutedbythesamecharacter.The
twocasesarevisualizedinFigure2.4.Sotheprobabilityisgivenby[12]:
P
o
=

(
1

e
)
2
+
e
2
1
j
S
j
1

k
(2.2)
Consideringbothrandom
k
-mermatchesand
k
-mermatchesinanoverlap,theexpectednumber
ofshared
k
-mersbetweentwooverlappingreadsisestimatedby:
E
[
X
o
]=
P
o

M
+

1
j
S
j

k

L
2
(2.3)
M
isthesizeoftheoverlap.Notethattheaboveequationslightlyover-countsthenumberof
k
-mer
hitsinanoverlapbecausetherandom
k
-merhitsinsidetheoverlapmaybecountedtwicewith
23
probability
P
o

(
1
j
S
j
)
k
.(Also,weassumethattheprobabilitiesofsubstitutionandinsertion/deletion
areonthesameorderandthusdonotdistinguishthemintheaboveequation.)
Figure2.5:Thechangeof
E
[
X
o
]
and
E
[
X
r
]
(y-axis)withtheincreaseofthereadlength(x-axis),
whichisobtainedfromarealPBdataset.Theoverlapsizeissetasthe1/4ofthereadlengthaswe
focusonidentifyingthehardcaseofsmalloverlaps.
k
=
15.
Aswearemainlyinterestedinsmalloverlapsoroverlapswithlowsequenceidentity,
weplottheexpectednumberof
k
-merhitswiththeoverlapsizebeing1/4ofthereadsizein
Figure2.5.Inordertoplotthewecompute
E
[
X
o
]
and
E
[
X
r
]
usingthereadlengthsfrom
a15X
E.coli
PBdataset(thedataofoursecondexperimentinSection2.3).Weonlyconsider
readsoflengthabove2,000.Theseallowustochoosetheappropriatethresholdfor
k
-
mer-countingbasedForexample,Figure2.5showstheexpectednumberof15-mers
betweenreadsofdifferenterrorrates.
E
[
X
r
]
startedas4when
e
=
0
:
15.And,forlarger
e
,
E
[
X
r
]
is
evensmaller.Thus,ourdefaultrationthresholdistwo15-mersinordertoensurehigh
sensitivity.Theimplementationdetailsofthe
k
-mercountingstagecanbefoundtowardstheend
24
oftheSection2.2.
2.2.3GroupHitCriteria
Sequencingerrorstendtoproduceshort
k
-mermatches.Inaddition,theinsertion/deletionerrors
leadto
k
-mermatchesondifferentdiagonals.Thus,insteadofusingrelativelylong
k
-mers(suchas
15or16-mers)asexistingtoolsdo,weuseagroupofshort
k
-mers(suchas9-mer)toaccommodate
thehighinsertionordeletionerrorrates.Agroupseedisasetofpossiblyoverlapping
k
-merhits
withstatisticallycalculatedconstraints.Theregioncontaininggroupseedsismorelikelytobe
insideanoverlapthanasingle
k
-merhit.Reference[109]introducedthegrouphitcriteriaand
alsoderivedthemethodtocalculatethecriteriastatistically.Weapplytheirmethodforoverlap
detection.
Assumethatwehavetworeads
S
1
and
S
2
oflength
m
and
n
,respectively.Thenumbersof
k
-mersatdifferentpositionsin
S
1
and
S
2
are
m

k
+
1and
n

k
+
1,respectively.A
k
-mer
hitatposition
(
i
;
j
)
isby
S
1
[
i
:::
i
+
k

1
]=
S
2
[
j
:::
j
+
k

1
]
,where
i

m

k
+
1and
j

n

k
+
1.Fortwo
k
-merhitsat
(
i
1
;
j
1
)
and
(
i
2
;
j
2
)
,theirinter-seeddistance
D
((
i
1
;
j
1
)
;
(
i
2
;
j
2
))
isthemaximumof
j
i
2

i
1
j
and
j
j
2

j
1
j
.The
k
-merdiagonalofa
k
-merhitat
(
i
;
j
)
,
d
(
i
;
j
)
,is
as
j

i
.
Withthesenotations,thegoalistosolvethefollowinginequalitiesgivenlevel1

a
bylevel
a
:
D
((
i
1
;
j
1
)
;
(
i
2
;
j
2
))

r
(2.4)
j
d
(
i
1
;
j
1
)

d
(
i
2
;
j
2
)
j
d
(2.5)
r
and
d
areintegersweneedtothegrouphitcriteria.Forexample,when
a
=
0
:
05,our
25
goalistoderive
r
and
d
sothatwith95%chancetheinter-seeddistanceandthediagonalshift
betweentwo
k
-mersinoverlappingreadsareatmost
r
and
d
,respectively.
2.2.3.1Constrainton
k
-merdistance
Wefollowedthemodeldescribedinthearticle[109].Runsofheadin
n
-independentBernoulli
trialsareusedtomodel
k
-mermatches,withtheprobability
p
foramatchand
(
1

p
)
foramis-
match.Inthismodel,a
k
-mermatchcanbetreatedas
k
consecutivematchrunswithprobabilityof
p
k
.Fromthewaitingtimedistribution[2,11],theprobabilitiesoftheinter-seeddistance
x
oftwo
k
-mersinanoverlapare:
P
[
D
k
=
x
]=
8
>
>
>
>
>
<
>
>
>
>
>
:
0for0

x
<
k
p
k
for
x
=
k
(
1

p
)
p
k
(
1

å
x

k

1
i
=
0
P
[
D
k
=
i
])
for
x
>
k
(2.6)
Withthelevel1

a
,
r
canbesolvedusingfollowingequation:
P
[
D
k

r
]=
1

a
(2.7)
Intheactualimplementationweuse
a
=
0
:
05.Thus,thereis95%chancethattheinter-seed
distanceofthetwoseedsinanoverlapislessthanthe
r
calculatedusingEquation(2.7).
26
2.2.3.2Constrainton
k
-merdiagonaldistance
Thediagonalshiftbetweentwo
k
-mers,
j
d
(
i
1
;
j
1
)

d
(
i
2
;
j
2
)
j
,intwooverlappingreadsiscaused
byinsertionsanddeletions.Notethattheinsertionsanddeletionsarebycomparingtwo
reads,notbetweenareadtoareferencegenome.Sotheinsertionrateanddeletionratearetreated
equally.
Thediagonalshiftbetweentwo
k
-merhitscanbemodeledbyadiscreteone-dimensionalran-
domwalkmodel[41,109].Thediagonalshiftstartsfrom0.Letthestepsoftherandomwalkbe
l
.Assumethattheinsertionanddeletionrateis
q
acrossthewholeread.Thus,theprobabilityof
adiagonalchangeis
q
,andtheprobabilityofstayinginplaceis1

2
q
.Also,weassumethatin
l
steps,thereare
n
i
insertednucleotides(increaseshift),
n
d
deletednucleotides(decreaseshift),and
n
m
matchednucleotides(noimpactonshift).Ifthediagonalshiftis
i
,wehavethefollowing
equations:
8
>
>
>
<
>
>
>
:
n
i
+
n
d
+
n
m
=
l
n
i

n
d
=
i
(2.8)
Reference[109]calculatedtheprobabilityofobtainingadiagonalshift
i
after
l
stepsinthe
randomwalk.AccordingtoEquation(8),wehave
n
i
=
i
+
n
d
and
n
m
=
l

(
i
+
2
n
d
)
.Fora
n
d
,theprobabilityofarandomwalkproducingdiagonalshift
i
canbecalculatedasthenumber
ofthepossiblepaths

l
i
+
2
n
d


i
+
2
n
d
i
+
n
d

,timestheprobabilityproductofallinsertions,deletions,and
noshiftchangeateachstep
q
n
d
q
n
d
+
i
(
1

2
q
)
l

(
i
+
2
n
d
)
.Tocalculatetheprobabilityofgeneratinga
diagonalshift
i
given
l
,
P
[
i
;
l
]
,weneedtoconsiderallpossiblevaluesof
n
d
,whichisfrom0(no
deletion)to
(
l

i
)
=
2(nomatch).Sowehave:
27
P
[
i
;
l
]=
å
(
l

i
)
=
2
n
d
=
0

l
i
+
2
n
d


i
+
2
n
d
i
+
n
d


q
n
d
q
n
d
+
i
(
1

2
q
)
l

(
i
+
2
n
d
)
(2.9)
Tocalculate
d
,wesumuptheprobabilities
P
[
i
;
l
]
for
i
=
0
;

1
;

2
;:::;

l
untilwereachthelevel
1

a
.Wereferthereadertotheoriginalarticle[109]foramoredetaileddiscussionofthemodel
andapracticalimplementationusinggeneratingfunctions.
Inourexperiment,wesetthesequencingaccuracy
p
=
0
:
85andtheindelrate
q
=
0
:
06,asPB
readstendtohavehigherindelratesthansubstitutionerrorrates.Userscanadjusttheseparameters
basedontheirdataproperties.Thelevel
a
is0.05and
k
is9.UsingEquation(2.6),
wecanobtain
r
=
54.UsingEquation(2.9)and
r
asanestimationof
l
,wecanobtain
d
=
5given
r
=
54.Thus,when
k
=
9,seedswithinter-seeddistance
D
((
i
1
;
j
1
)
;
(
i
2
;
j
2
))

54anddiagonal
shift
j
d
(
i
1
;
j
1
)

d
(
i
2
;
j
2
)
j
5areclusteredinthesamegroup.Inaddition,if
k
-mers
(
i
1
;
j
1
)
and
(
i
2
;
j
2
)
areinthesamegroupand
k
-mers
(
i
2
;
j
2
)
and
(
i
3
;
j
3
)
areinthesamegroup,wewillcluster
(
i
1
;
j
1
)
and
(
i
3
;
j
3
)
aswell.
2.2.4GroupChaining
Withgrouphitcriteria,GroupKcanshortsimilarregions.Toidentifytheoverlappingregion,
weaimtoachainofgroupseedmatchesthatmaximizesthenumberofmatchedbases.We
usedthesparsedynamicprogrammingforchaining[54,68].
Aftergeneratingachainofgroupseedmatches,weneedtodeterminewhetherthischaining
anoverlap.Wedeveloptwocriteriaforthispurpose.First,wecalculatetheexpected
numberofmatchedbasesfromthegrouphitcriteria,assumingthatthechaincoversthepossible
overlappingregionwithlength
L
O
(
L
O
canbeestimatedbytheextensionofbothendsofthe
28
optimalchain).Weusedthefollowingequationtocalculatetheexpectednumberofthematched
bases
n
e
:
n
e
=
1
c

L
O
r

k
(2.10)
Where
c
isacoeftocontrolthecriteria,
k
isthesizeof
k
-mer,
r
isthegrouphitcriteriafor
theinter-
k
-merdistance.Weonlyreportthechainingresultifthenumberofmatchedbases
n

n
e
.
Second,werequirethatbothreadshavesimilarsizesinsidetheoverlappingregion.
Figure2.6:Anexampleoftheoverlapsizeestimation.Thesufofread1andtheofread
2formanoverlap.Eachshortsolidlinerepresentsagroupseedmatchintheoptimalchain.The
blackdashedlineindicatesthetrueoverlapalignmentregionbetweenthetworeads.Thegray
dashedline,whichisformedbythetwoendinggroupseedsintheoptimalchain,canoverestimate
theoverlapsize.
Inourexperiment,wefoundthatsometimesusingtheoptimalchaingeneratedfromsparse
dynamicprogrammingmayoverestimatetheoverlapregion,asshowninFigure2.6.Thisover-
estimationcanjeopardizethesensitivityofdetectingsmalloverlaps.Wethisproblembyonly
keepingthecollineargroupseeds,whichareusedtoestimatetheoverlapsize.
29
2.2.5Implementationdetailsofthemajorcomponents
2.2.5.1Filtrationby
k
-mercounts
Inthestepofourpipeline,weuse
k
-mercounting-basedtoremovelargenumbers
ofreadpairsthatarenotlikelysequencedfromthesamelociontheunderlyinggenome.Weim-
plemented
k
-mercountingusingageneralizedsufarrayandthederivedlongestcommon
(LCP)array.Thegeneralizedsufarray
SA
iscreatedfromtheconcatenatedreads(delimitedby
specialcharacterssuchas$)usingalinearalgorithm[125].Then,wecreatetheLCPusingboth
thesufarray
SA
andthereversedsufarray
SA
0
[70,71].Letthesequenceofconcatenated
readsbe
T
.Followingtheofthereversedsufarray,forasufstartingatposition
SA
[
i
]
,wehave
SA
0
[
SA
[
i
]]=
i
.Foreachposition
i
intheLCP,LCP[i]containsthesizeofthelongest
commonbetween
SA
[
i
]
and
SA
[
i

1
]
.Thekeyobservation[125]forefcomputation
ofLCP[i]is:foraposition
j
in
T
,if
LCP
[
SA
0
[
j

1
]]
is
L
,
LCP
[
SA
0
[
j
]]

L

1.ThewholeLCP
arrayconstructiontakeslineartimetothesizeof
T
[125].
Inordertocounttheshared
k
-mersbetweenreadsandalsoreportreadpairspassingthe
k
-
mercountingthreshold,weusebothLCPandanauxiliarydatastructurerecordingthereadIDs
(denotedasarray
readID
).Forasufstartingatposition
SA
[
i
]
,itsreadIDisat
readID
[
i
]
.The
pseudocodeofthenumberofshared
k
-merscanbefoundinAlgorithm1.Inpractice,we
alsocount
k
-mersbetweenareadandtheotherread'sreversecomplement.
2.2.5.2Groupseedmatchandchaining
Anypairofreadsthatpasstheabovestagewillbeusedasinputforinggroupseed
matches.Allotherpairswillbediscarded.Currently,weareusingthecodesofYASS[108,110]
forthegroupseedmatches.Theprogramuseshashingtabletoshortexactmatches
30
Algorithm1
Counting
k
-mersbetweenreads
Input:
LCParray
LCP
[
1
::
n
]
,readIDarray
readID
[
0
::
n
]
,
k
-mersize
k
,
k
-mercountsthreshold
t
Output:
Readpairswhoseshared
k
-mer
counts

t
1:
Initializeamap
counts
[
key
][
value
]
forrecording
k
-mercounts
2:
for
i
=
1to
n
do
3:
readi
 
readID
[
i
]
4:
j
 
i
+
1
5:
L
 
LCP
[
j
]
6:
while
L

k
and
j

n
do
7:
readj
 
readID
[
j
]
8:
if
readi
<
readj
then
9:
key
 
readi
:
readj
10:
endif
11:
if
readj
<
readi
then
12:
key
 
readj
:
readi
13:
endif
14:
if
counts
[
key
]
doesnotexist
then
15:
counts
[
key
]
 
1
16:
else
17:
counts
[
key
]
++
18:
endif
19:
j
++
20:
L
 
min(
L
,
LCP
[
j
]
)
.
min(
L
,
LCP
[
j
]
)returnstheminimumof
L
and
LCP
[
j
]
21:
endwhile
22:
endfor
23:
forall
key
in
counts
do
24:
if
counts
[
key
]

t
then
25:
output
key
.
keycontainsreadpairIDspassingthe
k
-mercountthreshold
26:
endif
27:
endfor
31
andcreatesthegroupsofmatchesonthe.Itisourfutureworktore-implementgroupseed
matchingusingmoreeftindexing-basedmethods.Implementationofchainingalgorithmhas
beenfromtheprogramofglobalchainingalgorithminSeqAnlibrary[32].
2.2.5.3Timecomplexityanalysis
Ifwehave
N
readswithaveragereadlength
L
,ourtextsizeis
NL
.Therefore,wehave
NL
elements
inthesufarrayandthecorrespondingLCParray.Foreachsufxinthesufarray,supposeon
average,itcanformLCPswith
m
othersufeswithsizeabove
k
,whichisthesizeof
k
-merused
inthe
k
-mer-countsteps.Sothetimecomplexityofalltheshared
k
-mersforall
possiblereadingpairsis
O
(
mNL
)
.If
k
islargeenough(e.g.,
k
=
15and11inourexperiment),we
have
m
˝
NL
,sothetimecomplexitywillbedominatedby
NL
.
For
N
0
readpairsthatpassthestage,lettheaveragenumberof
k
-merhitsforeach
pairbe
q
.Sortingthehitswillneed
O
(
q
log
q
)
anditeratingthroughallhitstogroupsislinear
to
q
.Forallthereadpairs,thetimecomplexityis
O
(
N
0
q
log
q
)
.
Assumingwehave
r
groupseeds,thechainingprocedurehascomplexityin
O
(
r
log
r
)
foreachreadpairs.Forallthereadpairs,thetimecomplexityis
O
(
N
0
r
log
r
)
.Itispracticallyvery
fastbecausethenumberofgroupseedmatchesisverysmallcomparedtotheoriginalseedhits
(indicatedbyFigure2.3).
2.3Results
Wefocusonevaluatingthesensitivityandprecisionofoverlapdetection.WeappliedGroupKto
threePBdatasetsandoneONTdataset:asimulatedPBRSII
E.coli
sequencingdataset,arealPB
RSII
E.coli
sequencingdataset,aPBRSIIhumanfootmetagenomicsequencingdataset,andan
32
ONT
E.coli
(SQK-MAP-006)dataset.Forsimulated
E.coli
dataset,wehavethetruesampling
positionforeachreadasourgroundtruth.Forthereal
E.coli
datasetandhumanfootdataset,we
determinethegroundtruthviaBLASR's[21]alignmentsagainstthereferencegenome.
WebenchmarkedGroupK'sperformancewithMinimap[87],Minimap2[88],DALIGNER
[102],MHAP[12],andGraphMap[137].Thosetoolsandmethodsarerepresentativeoverlap
detectiontoolsforlongerroneousreadsfromPBorONT[26].Allthedetailedparameterscan
befoundatthewebsitelistedin[33].Ourmainmetricsinclude:(1)sensitivity,whichmeasures
theratioofthetrueoverlapsbyeachprogramtothewholesetofoverlappingpairs;(2)
precision,whichtheratiooftrueoverlapdetectedbyeachprogramtothetotalreported
overlappingpairs;and(3)F1score,whichistheharmonicmeanofsensitivityandprecision.A
reportedoverlappingpairisregardedascorrectifitisalsopresentinourgroundtruth.Thedetailed
overlapregionandoverlaplengthwerenotconsideredinthecurrentevaluationbecausetheseread
pairscangothroughamoreaccuratealignmentprogramforgeneratingtheoverlapalignment.
Aswediscussedbefore,ourgoalistoidentifyoverlappingreadswithoutusingerrorcorrection.
Alltestedtoolswillthusbeappliedtotheirrawdataset.
2.3.1Simulated
E.coli
dataset
Weevaluatedtheperformanceofourmethodonasimulated
E.coli
dataset.Thedataset
wasgeneratedusingPBSIM[114]with
E.coli
K-12MG1655asthereferencegenome[60].The
lengthdistributionandthequalitywerederivedfromrealPBP6-C4
E.coli
dataset[118].
Thesimulateddatasethas5,620reads,withaveragelengthof8,344.78bpsand14.5%average
errorrate(8.6%insertions,4.4%deletions,and1.4%substitutions).
FromthereportbyPBSIM,wecanobtaintheexactlocationswherethesimulatedreadsare
sampledinthegenome.Thisinformationprovidesuswiththegroundtruthforreads'overlapsso
33
thatwecancalculatethesensitivityandprecision.
FollowingthepipelinewediscussedinSection2.2,werstused
k
-mer-countingasthe
tionstage.AccordingtoEquation(2.1),Equation(2.3),andFigure2.5,wediscardedallreadpairs
withlessthantwo15-mermatches.Thesensitivityoftheis0.979andonlyabout6%of
readpairsarekeptfordownstreamanalysis.
Figure2.7:TheROC-likeplotusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,and
GraphMaponthesimulatedPB
E.coli
dataset.Thex-axisrepresentsthefalsediscoveryrate
(FDR
=
1

precision).Y-axisisthesensitivity(0.5to1).
Weevaluatedtheperformanceofourtoolbyadjustingthegroupseedmatchcriteriacoefient
c
,whichisintroducedinSection2.2.Withtheincreaseof
c
,sensitivitywillbecomehigher,andthe
precisionwillbecomelower.AsshowninFigure2.7,GroupKcanachieve5%to6%improvement
onthesensitivitywithsimilarprecisiontootheroverlapdetectiontools.
34
GroupKMinimapMinimap2DALIGNERGraphMapMHAP
Time(seconds)1871301639171858
Memory(GB)1.9941.7541.0972.2881.8732.562
Table2.1:Computationalperformanceonthesimulated
E.coli
dataset.Thecomputationalper-
formanceofoverlapdetectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,and
GraphMaponthesimulated
E.coli
dataset.
2.3.1.1Runningtimeandmemoryusage
Weevaluatedtherunningtimeandthepeakmemoryusageofthetestedtoolsinthisexperiment.
Werunalloverlapdetectiontoolswithasinglecoreof2.4Ghz14-coreIntelXeonE5-2680v4
CPUand32GBmemoryrequestedfromtheHigh-PerformanceComputingCenteratMichigan
StateUniversity.TheperformanceismeasuredwiththebestF1score.Theresultsarereported
inTable2.1.Forthememoryusage,Minimap2isthemostefonebutallothersarecom-
parable.GroupKisslowerthanothertools,partiallybecauseweusesmall
k
-mers.Wefound
thatthebottleneckofourprogramisthegroupmatchingstage,whichaccountsforabout1200of
1871seconds.Byimplementingamoreefindexing-basedmethod,weexpecttoreducethe
runningtimeofthisstage.Forexample,wecanspeedup
k
-mercountingbyadoptingthemethod
usedinKMC[77].
2.3.2RealPB
E.coli
dataset
Afterusingthesimulateddatasettoevaluateourmethod'sperformance,weappliedGroupKtoa
realPBRSII(P6-C4)
E.coli
dataset[118].Thecoverageofthewholedatasetis150X.Totestthe
performanceoflowcoveragedata,wesampleda15Xcoveragedatasetbasedonthereadlength
distributionofthewholedataset.Thedatasethas14,262reads,withtheaveragelengthof4,882.09
bpsandaverageerrorrateof14.14%(errorrateisestimatedusingqualityscore).
WeappliedBLASRtomapthereadstothereferencegenometoestimatethegroundtruth
35
(BLASRwasrunwithparameters:minReadLength,2000;maxScore,1000;maxLCPLength,16;
minMatch,12;m4andnCandidates/bestnsetto10

sequencingcoverage).Themappingre-
sultfromBLASRmaycontainshortalignmentsduetotherepeatinthegenomes.Sowhenwe
determinethegroundtruth,weonlyconsiderthealignmentsthatcoveratleast80%oftheread.
Byremovingshortnoisyalignments,wecanmakesuretheBLASRalignmentsareclosetothe
underlyinggroundtruth.
GroupKMinimapMinimap2DALIGNERGraphMapMHAP
BestF1score
0.9311
0.90370.84260.83400.77580.7023
Sensitivity
0.9330
0.89390.87780.90660.67410.7258
Precision
0.92920.9138
0.81010.7722
0.9137
0.6802
Table2.2:Overlapdetectiononthereal
E.coli
dataset.Theperformanceofoverlapdetection
usingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMapontherealPBRSII
(P6-C4)
E.coli
dataset.HereweonlyreporttheexperimentresultswiththehighestF1scorefor
eachtool.
Weusedthesamesetupadoptedinthesimulated
E.coli
experiment.Amongall
overlapdetectiontoolswetested,GroupKstillachievedthehighestsensitivitywithcomparable
precision(Table2.2).Withslightlyhigherprecision,oursensitivityis4%betterthanthenext
besttool,Minimap.Comparedtothepreviousexperiment,thedifferenceinsensitivityissmaller.
Onereasonliesintheconstructionofthegroundtruthdataset.Inthesimulateddataset,weused
thesamplepositionsofallreadstodeterminewhethertworeadsformanoverlap.Thus,that
datasetcanincludereadswithsmalloverlapsorreadswithhighererrorrates.Inthisdataset,our
methoddiscardedBLASRalignmentswithhigherrorratesandtheremainingalignmentshave
highersimilaritieswiththereferencegenomeandthusproducefewerﬁhardcasesﬂ.AsMinimapis
thesecondbesttoolforthisdataset,wefurtheranalyzedtheperformanceofGroupKandMinimap
onreadpairsofdifferentoverlapsizeinthenextsection.
36
2.3.2.1Performancewithdifferentoverlapsize
Wedividealloverlappingpairsintobinsofwidth500basedontheoverlapsize.Forexample,
thebinhasthereadpairswithoverlapsizesfrom0to499,andthesecondbinhasreadpairs
withoverlapsizesfrom500to999,andsoon.Foreachbin,wecomparedthesensitivityof
GroupKandMinimapusingparametersyieldingthesimilarprecisioninFigure2.8.ForMinimap,
weshowedtheresultwiththehighestF1score.ForGroupK,weselectedaparametersothatit
achievessimilarprecisiontoMinimap(F1score:0.9241,sensitivity:0.9344,precision:0.9140).
AccordingtoFigure2.8,GroupKhasmuchbettersensitivitywhentheoverlapsizeislessthan
2,000.AsweshowedinFigure2.1,thereareanumberofoverlapswithoverlapsize
smallerthan2,000evenfor30Xcoverage.Beingabletoidentifysmalloverlapsallowsusto
generatemorecompleteassembliesusinglongreads.Thisisparticularlyusefulforlowcoverage
data,suchaswhatweusuallyhaveinmetagenomicdatasets.
2.3.3HumanFootMetagenomicDataset
Oneofthemajorutilitiesofourtoolistoidentifyoverlapsbetweenreadsinmetagenomicdatathat
aresequencedusingthePBplatform.Forcomplicatedmicrobialcommunities,metagenomicdata
containingonlyshortreadsposesseriouscomputationalchallengesfor
denovo
assemblyandthe
downstreamcomposition/functionalanalysis.Longreadsholdthepromisetoproducemorecom-
pleteandaccuratemicrobialgenomeassembliesforthemetagenomicdataset.Inthisexperiment,
weevaluatedtheperformanceofoverlapdetectionforamockmetagenomicdatasetconstructed
fromarealhumanfootdataset[144].Aparticularchallengeforthisexperimentisthelowcover-
ageofthecomponentspeciesinthemetagenomicdataset,whichcouldbecausedbysequencing
throughputandcomplexityofthesample.
37
Figure2.8:SensitivityofGroupKandMinimapfordetectingoverlapsofdifferentsizeonPB
E.
coli
dataset.Thex-axisrepresentstheoverlapsize.Y-axisisthecorrespondingsensitivityofthe
bin.TheX-axisbinwidthis500andtheonlyincludedthe6bins(i.e.uptooverlap
size3000)astheirsensitivitybecomesmoresimilarwiththeincreaseoftheoverlapsize.
38
ThehumanfootsamplewassequencedbylinearPBRSIITdT(terminaldeoxynucleotidyl
transferase).Sequencesthatcanbemappedtothehumangenomewereremovedashost-derived
DNA.AccordingtotheSupplementaryMaterialsof[144],thereareabout1,000bacteriaand
virusesinthismetagenomicdataset.However,wecannotevaluatetheperformanceofoverlap
detectiononallthereadsfromthe1,000microbesbecausethecoveragesofmanyspeciesaretoo
lowtoyieldmeaningfuloverlaps.Inordertoconstructthegroundtruth,weneedtoalignthereads
againstspecieswithknownreferencegenomesandreasonablecoverage.Thus,weonlychoose
readssatisfyingthefollowingcriteria:1)thereadsaresequencedfromaspecieswithknownrefer-
encegenome;and2)thecoverageofthespeciescannotbetoosmall(e.g.,
>
3Xcoverage).Based
onthesecriteria,wekeepthereadssequencedfromthreebacteria:
Corynebacteriumaurimucosum
(6.3XCoverage),
Corynebacteriumtuberculostearicum
(8.5XCoverage),and
Staphylococcusho-
minis
(3.2XCoverage).ThereadsarerecruitedviaBLASR.Thealignmentpositionsareusedto
determinewhichreadsformanoverlap.Notethat
Corynebacteriumaurimucosum
and
Corynebac-
teriumtuberculostearicum
belongtothesamegenusandmaycontributetothefalsepositiveover-
lapdetectionduetotheirsharedregions.Theaveragelengthofthereadsis1696.25bps,whichis
muchshorterthanthereadsinthepreviousexperiments.
Asthisdatasetcontainsmuchshorterreads,theexpectednumberof
k
-merhitswillchange.
Intuitivelyweneedtouseshorter
k
-merstoensurehighsensitivity.Usingthereadlength
distribution,wecalculated
E
[
X
r
]
(Equation2.1)and
E
[
X
o
]
(Equation2.3)anddeterminedthe
k
-
mercounting-basedcriteria.Inthisexperiment,weonlykeptreadpairsthatshareatleast
three11-mers.
Forthismockmetagenomicdataset,GroupKyieldedbetterperformancethan
othertoolsonmetricsincludingF1score,sensitivity,andprecision(Table2.3).Comparedto
othertools,GroupKcanproducemuchhighersensitivitywithoutprecision,leadingto
39
GroupKMinimapMinimap2DALIGNERGraphMapMHAP
Total:
BestF1score
0.9163
0.83060.87760.69110.71880.7812
Sensitivity
0.8954
0.78020.83520.60120.57210.6803
Precision0.93810.88800.92450.8027
0.9666
0.9174
C.aurimucosum
:
BestF1score
0.9512
0.80720.91170.74320.73970.8545
Sensitivity
0.9228
0.68580.84670.62660.58920.8045
Precision
0.98140.98060.9874
0.9131
0.9937
0.9111
C.tubercu-
lostearicum
:
BestF1score
0.9454
0.86880.90500.83150.72760.7961
Sensitivity
0.9105
0.79580.83460.71500.57270.7355
Precision
0.9830
0.9567
0.98840.99340.9977
0.8675
S.hominis
:
BestF1score
0.9163
0.76510.80240.87540.63430.6861
Sensitivity
0.8733
0.68870.6813
0.7967
0.46580.6348
Precision0.92450.86060.97590.9713
0.9938
0.7464
Table2.3:Overlapdetectiononthehumanmetagenomicdataset.Theperformanceofoverlap
detectionusingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMaponthemock
metagenomicdataset.HereweonlyreporttheexperimentalresultswiththehighestF1scorefor
eachtool.
thehigherF1score.Besidesevaluatingtheperformanceofvarioustoolsonallthereadsfrom
thethreespecies,wealsoreportedtheperformanceofdifferentoverlapdetectiontoolsoneach
singlebacteriadatasetwithoutmixingwithotherspecies(Table2.3).Inthesetools,GraphMap
hashighforallthreewithofsensitivity.However,GroupKstillachievesthe
bestperformanceoverall.Thecomparisonssuggestthatourmethodhasgreatpotentialtodetect
overlapsfordatawithverylowcoverage(around5X).ThiswillenablebetterassemblyforPB
sequencedmetagenomicdata,whichwillbecomemoreavailablewiththeadvancesoflongread
sequencingtechnologies.
40
2.3.4RealONT
E.coli
dataset
WealsotestedourmethodononeONTdataset.Weuseddownsampled15Xcoverage2Dreads
fromtheSQK-MAP-006datasetas2Dreadsprovidehigherqualitythan1Dreads.Wefollowed
thesamepipelinesweusedforthereal
E.coli
PBdataset.As2DONTreadshavesimilarerror
ratestothePBreads,weexpectthatourtoolcanstillachievereasonableperformancegiventhe
samesetupforthePBdataset.Therefore,weusedthesameparametersastheonesweusedforthe
PB
E.coli
dataset.
GroupKMinimapMinimap2DALIGNERGraphMapMHAP
F1score
0.93830.9310
0.90900.8272
0.9280
0.8122
Sensitivity
0.95970.9546
0.93620.87300.89910.8871
Precision0.9178
0
.90850.88330.7860
0.9589
0.7490
Table2.4:OverlapdetectionontheONT
E.coli
dataset.Theperformanceofoverlapdetection
usingGroupK,Minimap,Minimap2,DALIGNER,MHAP,andGraphMapontherealONTSQK-
MAP-006
E.coli
dataset.Minimap2usestheava-ontsetup,whichisoptimizedforONTdata.
Amongthesetools,GraphMapwasdesignedforONTdata,Minimap2providesas
setupfortheoverlaponONTdataset.AllothertoolsarenotdesignedforONT
datasets.GroupKachievesthebestF1scorecomparedtoothertools'defaultsetupwhilekeeping
thehighestsensitivity(Table2.4).Thisresultsuggeststhatourstrategyisrobustwithdifferent
typesoflongreads.
2.4Discussion
Seedingisakeystepforoverlapdetectionbecauseofthehigherrorrateoflongreads.Successful
seedingstrategiesshouldbalancethesensitivityandthetoachievetheoptimalperfor-
mance.Popularseedingmethodsincludemaximalexactmatches,spacedseeds,andgappedspaced
41
seeds.However,tosuccessfullyahitbetweentworeads,thesemethodsstillneedeitherto
relativelylongcontinuousexactmatches(large
k
-mer)ortoinexactmatchesfollowingcertain
errorpatterns(spacedseed).Comparedtothesemethods,groupseedmatchingismorexibleas
itrequiresmultipleshortexactmatcheswithoutspecifyingtheerrorpatterns.Thisxibilityleads
tohighsensitivity,andmeanwhiletheisstillguaranteedwiththegroupseedmatch
criteria.
Currentlythegroupseedmatchingstepbasedonhashtableisthebottleneckofouroverlap
detectionpipeline.Anewmethodthatcanimprovetherunningtimeefyofthisstepis
neededtomakethealgorithmachievethesamespeedasotherfasteroverlapdetectiontools.
2.5Conclusions
Inthiswork,wedevelopedanoverlapdetectiontoolforthird-generationsequencingdata.By
adoptingthegrouphitcriteriatoclusteragroupofshort
k
-merhitsthatsatisfystatisticallyderived
distanceconstrains,ourmethodcanimprovethesensitivityofoverlapdetectionwithout
precision.Ourexperimentalresultshaveshownthatfordatasetswithlowsequencingcoverage,
ourprogramcandetectsmoreoverlappingpairswhilekeepinghighprecision.One
utilityofourapproachistodetectsmalloverlapsbetweenlongreadsofrarespeciesinamicrobial
community.
42
Chapter3
prHMMmodelandhomologysearch
1
3.1Introduction
ComparedtoIllumina,therepresentativesecondarygenerationsequencingplatform,themajor
disadvantagesofPBincludehighsequencingerrorrate(11-15%),lowerthroughput,andhigher
costperbase[124,126].Similartopyrosequencingdata,mostoftheerrorsareinsertionordeletion
errors.Thehigherrorrateposeschallengesforalldownstreamsequenceanalysis.Inparticular,
duringhomologysearchforgenomeannotation,sequencesarealignedtocharacterizedprotein
sequencesorfamilies.Insertionordeletionerrorsingeneswillcauseframeshiftsandmayonly
leadtomarginalalignmentscoresandshortalignments[154].Asaresult,itishardtodistinguish
truealignmentsfromrandomalignments.Theinaccuratehomologysearchresultscanincurerrors
instructuralandfunctionalannotation.
Differentstrategieshavebeenproposedorimplementedtoavoidorcorrectsequencinger-
rorsinPBdata.TherearevariousPBsequencingprojectsthatmainlyusecircularconsensus
sequencing(CCS)readswithsufsequencingpasses.Acoverageof15passesyields
>
99%
accuracy[128].However,CCSreadsaremuchshorterthanthecontinuouslongreadsofPBdata.
Inaddition,theamountofCCSreadsismuchlessthanalltheoutputofPBdata.Thus,usingonly
CCSreadsdoesnottakefulladvantageofthesequencingpowerandstrengthofPBdata.
1
Du,Nan,andYanniSun."ImprovehomologysearchsensitivityofPacBiodatabycorrectingframeshifts."Bioin-
formatics32.17(2016):i529-i537.
43
OnepopularstrategytohandlesequencingerrorsofPBdataisbasedonhybridsequenc-
ing[78].AsIlluminaproducesmanymoreaccuratebutshorterreads,methodsaredeveloped
tocorrecterrorsbyaligningshortreadstolongPBreads.Yet,thismethodneedspreparationofat
leasttwosequencinglibrariesandseveraltypesofsequencingruns,whichisnotcost-effectivefor
manyapplications.
Unlikehybridsequencing,therearemethodsthatdonotrequirehighlyaccurateshortreads
forerrorcorrection.OnerepresentativemethodisdescribedinChinetal.'shierarchicalgenome-
assemblyprocessHGAP[24],whichalignsshortsequencestothelongestreadsofthesamese-
quencinglibraryofPB.AsthesequencingerrorsinPBreadsoccurrandomly,theinferredconsen-
sussequencefromthealignmentbetweentheshortreadsandlongreadsrepresentthehigh-quality
sequence.Despiteitssuccess,thereisstillroomtoimprovetheerrorcorrectionperformancefor
theconsensussequenceextractionstageinHGAP.Inparticular,itsperformanceisheavilyaffected
bythecoverageofthealignedshortsequencesagainstthelongseedsequences.Theregionswith
moreshortsequencesalignedhavebettererrorcorrectionperformancethanotherregions.
Aftererrorcorrection,correctedPBreadscanusuallyachievemoresensitivehomologysearch
resultscomparedtotherawdata.Inparticular,whenthecoverageishigh,thecorrectedreads
fromHGAPcanachievealignmentscoressimilartothegroundtruth.However,inpractice,notall
PBsequencingprojectscanhavesufcoverageforallregions,transcripts,orgenomes.For
example,HGAPfailedtoassemblethedatafromthearmsampleinhumanskinmicrobialcom-
munity[144]becauseoflowcoverageofthedataset.Figure3.7inourexperimentalresultsshow
thatthedifferenceofthealignments'scores,lengths,andE-valuesbetweenHGAP'scorrected
readsandthegroundtruthisstillThus,thereisstillaneedforhomologysearchtools
designedforPBdata.
Inthiswork,wedesignedandimplementedFrame-Pro,ahomologysearchtoolforPBreads.
44
Theexperimentalresultsshowedthatourtoolcanimprovethehomologysearchsen-
sitivitywhilealsocorrectingsequencingerrors.Ourmethodincorporatedtwokeyobservations.
First,asshownbyHGAP,sequencingerrorsinPBaredistributedrandomlyandthustheconsen-
sussequencestendtobeclosertogroundtruth.Ourworkincorporatedthismethod.Second,we
identifyframeshiftscausedbysequencingerrorsusingcharacterizedproteinfamiliesastheguid-
ance.Essentiallyourmethodcorrectserrorsbymaximizingbothalignmentscoreagainstprotein
familiesandlocalcoveragescoreinaconstructedalignmentgraph.Bothobservationsareused
togethertoboosttheperformanceofbothhomologysearchanderrorcorrection.
TheremainderofthisChapterisorganizedasfollows.Section3.2reviewsother
frameshifterrordetectiontoolsandtheirlimitationsinproteindomaininPBdata
sets.Section3.4describesthedynamicprogrammingalgorithmthatincorporatesconsensusse-
quenceandViterbialgorithmforerrorcorrectionandsequencealignment.InSection3.5,
wedemonstratetheresultsoferrorcorrectionandhomologysearchbyapplyingourtooltosim-
ulatedandrealPBdata.WealsobenchmarkourtoolwithHGAP,asuccessfulerrorcorrection
methodwithoutrelyingonhybridsequencing.Finally,Section3.6concludesandsuggestsdirec-
tionsforfuturework.
3.2Relatedwork
3.2.1Prhomologysearch
Homologysearchisstillanimportantstepinsequence-basedfunctionalanalysisforgenomicdata.
Bycomparingquerysequencesagainstreferencesequencesori.e.,afamilyofhomolo-
gousreferencesequences,functionsandstructurescanbeinferred.Therepresentativetoolsfor
sequencehomologysearchandhomologysearchareBLAST[4]andHMMER[38],re-
45
spectively.homologysearchhasseveraladvantagesoverpairwisealignmenttoolssuchas
BLAST.First,thenumberofgenefamiliesissmallerthanthenumberofsequences,
renderingmuchfastersearchtime.Forexample,thereareonlyabout13,000manuallycurated
proteinfamiliesinPfam,butthesecovernearly80%oftheUniProtKnowledgebaseandthecov-
erageisincreasingeveryyearasenoughinformationbecomesavailabletoformnewfamilies[45].
ThenewestversionofHMMER[38]ismoresensitivethanBLASTandisabout10%faster.
Second,previouswork[35]hasdemonstratedthatusingfamilyinformationcanimprovethe
sensitivityofaremoteproteinhomologysearch,whichisveryimportantforvarioussequencing
datasuchasmetagenomicdataanalysis.Thesedatasetsmaycontainspeciesremotelyrelatedto
onesinthereferencedatabaseandrequiresensitivehomologysearch.Thus,inthiswork,wefocus
onimplementinghomologysearchforPBdata.Themethodcanbeextendedtopairwise
sequencealignment.AsHMMERisthemostwidelyusedalignmenttool,wefocuson
evaluatingthealignmentperformanceusingHMMER.
TheproteindomainsfamiliesusedinourexperimentsaredownloadedfromPfam.Other
databasessuchasTIGRFAMs[57],FIGfams[98],InterProScan[152],andFOAM[123]canbe
usedtooaslongashiddenMarkovmodelscanbetrained.
3.2.2Relatedworkonframeshiftcorrection
Usually,whencomparingaDNAsequencewithaproteinsequenceorfamily,six-frametransla-
tionsareconductedandoneofthereadingframeshouldleadtostatisticallyalignment
ifthequeryandthereferencearehomologous.However,frameshiftscausedbyinsertionordele-
tionerrorsmakethecorrecttranslationconsistofalternatingreadingframes.Withoutknowingthe
errorpositions,choosingthecorrectframesforeachfragmentbetweenerrorsischallenging.
AnumberofprogramsexisttohandleframeshiftsthroughDNAvs.proteinsequencealign-
46
ment.SimplemethodssuchasBLASTXdiscardsequencesthatmightcontainframeshiftsrather
thantryingtothem.Othertools[17,22,49,50,53,59,120,121,155]areavailabletodetectand
frameshifterrorsautomatically.Besidesdetectingframeshiftinsequencealignment,somepro-
grams[5,14,76,131]focusonframeshiftdetectionduringgeneanduse
abinitio
methods.
TheseprogramsarenotdesignedforPBdata.Inaddition,theycannotbeappliedtoprotein
homologysearch.
Alternatively,Genewise[13],awidelyusedDNAvs.proteinalignmenttoolallowscompar-
isonofaDNAsequencewithaproteinfamily.Butitdoesnotconsiderthesequencingerror
propertiesofNGSdata.Themostrelevantworkstooursinclude[154],whichViterbi
algorithmtoimprovehomologysearchforpyrosequencingdata.Butitdoesnothavesatisfactory
performanceforPBdatabecausethesequencingerrorpropertiesofpyrosequencingandPBare
different.Pyrosequencingreadshavelowererrorrateandmostoftheerrorsarelocatedinside
homopolymerregions.ThesepropertiesmakeerrorcorrectioneasierthanPB,whichhavehigher
errorratesandtheerrorscanoccurmorerandomly.FrameBot[150]isanotherrelevantworkfor
correctingframeshiftscausedbysequencingerrors.Butitisnotdesignedforhomology
search.And,likeothertools,itisonlyoptimizedandtestedonsequencingdatawithlowererror
ratethanPB.
HGAP[24]isanotherhighlyrelevantworkbecauseitcontainserrorcorrectionstageforPB
data.AsdiscussedinSection3.1,itsperformanceisheavilyaffectedbysequencingcoverage.
3.3PrhiddenMarkovmodel
BeforeintroducingtheaugmentedViterbialgorithmthatusedinFrame-Pro,wereviewedthe
conceptesandalgorithmsofhiddenMarkovmodel.Thecontentofthissectionismainly
47
adaptedfrom[35,36].
3.3.1Markovchains
WestartfromMarkovchain,whichisthesimplestprobabilisticmodelofsequence.Inthemodel,
theprobabilityofasymbolinthesequencedependsontheprevioussymbol.InMarkovchain,
eachsymbol(e.g.,residueofDNA)correspondstoastateinthechain.
Figure3.1:AMarkovchainexampleofDNA.Therearefourstatesforeachofnucleotides
A
,
C
,
G
and
T
.Atransitionprobabilityisassociatedwitheacharrowinthe
Thetransitionprobabilities
a
st
istheprobabilitythatacertainstatefollowinganotherstate:
a
st
=
P
(
x
i
=
t
j
x
i

1
=
s
)
(3.1)
Theprobabilityofthesequencecanbewriteas
P
(
x
)=
P
(
x
L
;
x
L

1
;:::;
x
1
)
=
P
(
x
L
j
x
L

1
;:::;
x
1
)
P
(
x
L

1
j
x
L

2
;:::;
x
1
)
:::
P
(
x
1
)
(3.2)
48
byrepeatapplying
P
(
X
;
Y
)=
P
(
X
j
Y
)
P
(
Y
)
.InMarkovchain,theprobabilityofeachsymbol
x
i
de-
pendsonlyontheprevioussymbol
x
i

1
,notonentireprevioussequence.Sothepreviousequation
canberewriteas
P
(
x
)=
P
(
x
L
j
x
L

1
)
P
(
x
L

1
j
x
L

2
)
:::
P
(
x
1
)
=
P
(
x
1
)
L
Õ
i
=
2
a
x
i

1
x
i
(3.3)
ThisisthegeneralequationfortheprobabilityofasequenceofanyMarkovchain.
3.3.2HiddenMarkovmodels
IntheMarkovchain,thereisaone-to-onecorrespondencebetweenthestatesandthesymbols.
Sincewecandirectlyobservethesymbolsinthesequences(e.g.,residuesinDNAsequences),we
canalsothestatefromtheobservation.However,inthehiddenMarkovmodel,onestate
mayemitmanydifferentsymbols.Sofromtheobservationofsymbols,wecannotthe
statefromourobservation.Inotherwords,thestateishiddennow,sowecallsuchmodelhidden
Markovmodel.
WefomulatethehiddenMarkovmodelasfollowing:Wecallthesequenceofthestates
path
,
p
.ThepathitselfisasimpleMarkovchain,theprobabilityofastateinthepathonlydependson
thepreviousstate.The
i
thstateinthepathiscalled
p
i
.Thetransitionprobabilityofthechainis:
a
kl
=
P
(
p
i
=
k
j
p
i

1
=
k
)
(3.4)
Becausewehavedecoupledthesymbols
b
fromthestates
k
,wemustneedanewsetofpa-
rameters,emissionprobabilities
e
k
(
b
)
,tomodeltheprobabilitythatsymbol
b
isobservedwhenin
49
state
k
:
e
k
(
b
)=
P
(
x
i
=
b
j
p
i
=
k
)
(3.5)
Wecanwritetheprobabilityofanobservedsequence
x
andastatesequence
p
as:
P
(
x
;
p
)=
a
0
p
1
L
Õ
i
=
1
e
p
i
(
x
i
)
a
p
i
p
i
+
1
(3.6)
wherewerequire
p
L
+
1
=
0.TheEquation3.6istheHMManalogueofEquation3.3.
InrealapplicationofHMM,therearethreefundamentalproblemswecareabout[69]:(1)
GivenanHMMmodelwithparameters
a
kl
and
e
k
(
b
)
,andanobservationsequence
x
,determine
thelikelihoodof
P
(
x
)
(
likelihood
problem);(2)GivenanHMMmodelwithparameters
a
kl
and
e
k
(
b
)
,andanobservationsequence
x
,decodethemostprobablestatepath
p

(
decoding
problem);
(3)Givenanobservationsequence
x
andstatesofHMM,learntheHMMparameters
a
kl
and
e
k
(
b
)
(
learning
problem).TheViterbialgorithm,Forwardalgorithm,andForward-Backwardalgorithm
weredevelopedtosolvethoseproblems,respectively.Wewillintroducethesealgorithmsfor
hiddenMarkovmodelinthefollowingsections.
3.3.3PrhiddenMarkovmodels(prHMMs)
HMMsisaparticularHMMdevelopedtomodelmultiplesequencealignments[82].Here
wewilluseHMMsdesignedforproteinfamilymultiplealignmentsasourexample.In
HMMs,foreachpositionofstate
i
inthepath
p
,therearethreepossiblestates,matchstate
M
i
,insertionstate
I
i
,anddeletionstate
D
i
(Figure3.2).Insertionstate
I
i
isusedtomatchinsertions
aftertheresiduematchingthe
i
thcolumnofthemultiplealignments.Therearetransitionsfrom
M
i
to
I
i
,alooptransitionfrom
I
i
toitselftoaccommodatemulti-residueinsertions,andatransition
50
Figure3.2:ThetransitionstructureofaHMM.Weusesqurestoindicatematchstates,
diamondstoindicateinsertionstates,andcirclestoindicatedeletionstates.
from
I
i
backto
M
i
.Deletionstate
D
i
isusedtohandlethejumptransitionbetweennon-neighboring
matchstateswithoutanyemission.Thesumofthecostsofantransition
M
to
D
followedbya
numberof
D
to
D
transitions,thena
D
to
M
transitionswillbethecostofadeletion.
3.3.3.1Viterbialgorithm
Viterbialgorithm[148]isthemostcommonalgorithmforthemostprobablestatepath:
p

=
argmax
p
P
(
x
;
p
)
(3.7)
Themostprobablepath
p
canbefoundrecursively.ForHMMs,let
V
M
j
(
i
)
bethemaximum
log-oddsscoreofaligningabestsub-sequence
x
1
;
2
;::;
i
totheHMMpathuptostate
j
,endingwith
x
i
beingemittedbystate
M
j
,
V
I
j
(
i
)
bethescoreofthebestpathendingwith
x
i
beingemittedby
51
state
I
j
similarly,and
V
D
j
(
i
)
forthebestpathendinginstate
D
j
.
q
x
i
istheemissionprobabilityof
astandardrandommodel(backgrounddistribution).ThentherecursiveViterbiequationscanbe
writeas:
V
M
j
(
i
)=
log
e
M
j
(
x
i
)
q
x
i
+
max
8
>
>
>
>
>
<
>
>
>
>
>
:
V
M
j

1
(
i

1
)+
log
a
M
j

1
;
M
j
V
I
j

1
(
i

1
)+
log
a
I
j

1
;
M
j
V
D
j

1
(
i

1
)+
log
a
D
j

1
;
M
j
V
I
j
(
x
i
)=
log
e
I
j
(
x
i
)
q
x
i
+
max
8
>
<
>
:
V
M
j
(
i

1
)+
log
a
M
j
;
I
j
V
I
j
(
i

1
)+
log
a
I
j
;
I
j
(3.8)
V
D
j
(
x
i
)=
max
8
>
<
>
:
V
M
j

1
(
i
)+
log
a
M
j

1
;
D
j
V
D
j

1
(
i
)+
log
a
D
j

1
;
D
j
Inacommonsituation,theemissionscorelog
e
I
j
(
x
i
)
q
x
i
for
V
I
j
(
x
i
)
willbecanceledbecausewe
assumetheemissiondistribution
e
I
j
(
x
i
)
fromtheinsertionstates
I
j
isthesameasthebackground
distribution
q
x
i
.ThewholeprocessofViterbialgorithmisdescribedinAlgorithm2.
Algorithm2
Viterbialgorithm
Input:
DNAsequence
x
withlength
L
,HMMwithlength
n
Output:
Mostprobablestatepath
p

withlog-oddsscorelog
P
(
x
;
p

)
1:
V
M
0
(
0
)=
0,
V
S
j
(
0
)=

inf
for0
<
j
and
S
2
{M,I,D}.
.
Initialization
2:
for
i
=
1to
L
do
.
Recursion
3:
V
S
j
(
i
)
 
Equation3.8
4:
ptr
i
(
j
)
 
argmax
S
j

1
(
V
S
j
(
i
))
5:
endfor
6:
log
P
(
x
;
p

)
 
V
M
n
(
L
)
.
Termination
7:
p

L
 
argmax
n

1
V
M
n
(
L
)
8:
for
i
=
L
to1
do
9:
p

i

1
 
ptr
i
(
p

i
)
.
Traceback
10:
endfor
52
3.3.3.2Forwardalgorithm
Toevaluatethelikelihoodofasequence
x
,weneedsumtheprobabilitiesofallpossiblestatepath
toobtaintheprobabilityof
x
:
P
(
x
)=
å
p
P
(
x
;
p
)
(3.9)
Adynamicprogrammingalgorithmcanbeusedtocalculatethisfullprobabilityrecursively.
Thisiscalledtheforwardalgorithm.
WeusedthesimilarnotationthatusedinViterbialgorithm.Theforwardlog-oddsscore
F
M
j
(
i
)
,
F
I
j
(
i
)
,and
F
D
j
(
i
)
arecorrespondingto
V
M
j
(
i
)
,
V
I
j
(
i
)
,and
V
D
j
(
i
)
.Sotherecurrenceequations
offorwardalgorithmare:
F
M
j
(
i
)=
log
e
M
j
(
x
i
)
q
x
i
+
log
[
a
M
j

1
;
M
j
exp
(
F
M
j

1
(
i

1
))
+
a
I
j

1
;
M
j
exp
(
F
I
j

1
(
i

1
))+
a
D
j

1
;
M
j
exp
(
F
D
j

1
(
i

1
))]
F
I
j
(
x
i
)=
log
e
I
j
(
x
i
)
q
x
i
+
log
[
a
M
j
;
I
j
exp
(
F
M
j
(
i

1
))
(3.10)
+
a
I
j
;
I
j
exp
(
F
I
j
(
i

1
))]
F
D
j
(
x
i
)=
log
[
a
M
j

1
;
D
j
exp
(
F
M
j

1
(
i
))+
a
D
j

1
;
D
j
exp
(
F
D
j

1
(
i
))]
Theinitializationoftheforwardalgorithmrequiresthat
F
M
0
(
0
)=
0,whichissimilartoViterbi
algorithm.Sinceweonlyneedtheprobabilitiesof
x
sothereisnotracebackintheforwardalgo-
rithm.
53
3.3.3.3FrommultiplesequencealignmenttoprHMM
Usually,wewanttobuildHMMsfromamultiplesequencealignmentwithmultiplecolumns.
Weneedtodeterminewhichmultiplealignmentcolumnstoassigntomatchstates,andwhich
toassigninsertionstates(Figure3.3).Weoftenusetheheuristicruletoignorethecolumnfor
whichthefractionofgapcharactersisgreaterthanorequaltoacolumnremovalthreshold
q
[29].
Usually,thecolumnremovalthreshold
q
equalsto0
:
5.Afterthis,wecanconstructHMMs
asallthestatesaredetermined.
Figure3.3:Tencolumnsfromthemultiplesequencealignmentsofsevenglobinproteins.The
starredcolumnsareonesthatwillbetreated`matches'intheHMM.Reprintedfrom[35].
Thenweneedtoestimatetheprobabilityparameters.Frommultiplesequencealignment,we
canaligneachoftherowtotheHMM.Wecandirectlyestimatetheparametersfromthe
alignments.Bycountingupthenumberoftimeseachtransitionoremissionisused,theprobability
parametersareassignedby:
a
kl
=
A
kl
å
l
0
A
kl
0
and
e
k
(
a
)=
E
k
(
a
)
å
a
0
E
k
(
a
)
(3.11)
where
k
and
l
areindicesoverstates,and
a
kl
and
e
k
arethetransitionandemissionprobabilities,
and
A
kl
and
E
k
arethecorrespondingcounts.
54
Inmanysituations,weonlyhavealimitednumberofsequencesinthetrainingalignments.
Themajorchallengeshereisthatsometransitionsandemissionsthatneverseeninthetraining
alignmentwillbeassignedzeroprobability.Tosolvetheproblem,onecanaddpseudocountsto
theobservedcounts.ThemoststraightforwardLaplace'sruleistoaddonetoeachcountweusedin
Equation3.11.AbettersolutionisusingamixtureofDirichletdistributionsastheprior.Readers
areencouragedtolearnmoredetailsinChapter5of[35].
3.3.3.4Forward-Backwardalgorithm
Whenthepathisunknownforthetrainingprocess,wearenolongerabletoestimatetheparameter
valueandmustiterativeproceduretolearntheparameters.Forward-backwardalgorithmisthe
standardalgorithmforleaningtheparametersofHMM.Itisintroducedfrom[9]byLeonardBaum,
soitisalsocalledBaum-Welchalgorithm.ItisaspecialcaseoftheExpectation-Maximization
(EM)algorithm.Thealgorithmestimatedtheprobabilities,thenderivedabetterestimates.
Theestimationisimprovediterativelyuntilsomestoppingcriterionisreached.
Wealreadycalculatedforwardprobabilityusingforwardalgorithm.TotrainHMM,wealso
needtocalculatetheposteriorprobabilitythatobservation
x
i
camefromstate
k
giventheobserved
sequence
P
(
p
i
=
k
j
x
)
usingbackwardalgorithm.Wecalculatedtheprobabilityofproducing
theentireobservedsequencewiththe
i
thsymbolbeingproducedbystatek:
P
(
x
;
p
=
k
)=
P
(
x
1
:::
x
i
;
p
=
k
)
P
(
x
i
+
1
:::
x
L
j
x
1
:::
x
i
;
p
=
k
)
=
P
(
x
1
:::
x
i
;
p
=
k
)
P
(
x
i
+
1
:::
x
L
j
p
=
k
)
=
f
k
(
i
)
b
k
(
i
)
(3.12)
55
bk(i)=
P(xi+1:::
xLjp=k)(3.13)bk(i)canbeobtainedbyabackwardrecursionstartfromendofthesequence.
Theposteriorprobabilitiescanbecalculatedbystraightforwardconditioning,
P(p=kjx)=
fk(i)bk(i)P(X)(3.14)whereP(x)istheresultofforwardorbackwardalgorithm.
Thenwecancalculatetheexpectednumberoftimesoftransitionsandemission,
Akl
andEk,usingtheforwardprobabilitiesandbackwardprobabilities.Thedetailedderivationcanbefound
fromtextbooks[29,35,69].Theexpectednumberoftimesthat
akl
isusedisgivenby:
Akl
=åj1P(xj)åifjk(i)akl
el(xji+1)bjl(i+1)(3.15)wherefjk(i)istheforwardvariable
fk(i)calculatedforsequence
j,and
bjl(i)isthebackward
variablecorrespondingly.Similarly,wehavetheequationforexpectednumberoftimesthat
bappearsinstate
k:Ek(b)=
åj1P(xj)åfijxji=bgfjk(i)bjk(i)(3.16)wherethesecondsumcalculateoverthepositions
iforwhichthesymbolemittedis
b.ThewholeprocessofBaum-Welchalgorithmlikethis:
56Algorithm3
Baum-Welchalgorithm
1:
Initialize
a
lk
and
e
k
.
2:
while
D
>
threshold
Q
do
3:
Setallthe
A
and
E
totheirpseudocountvalues.
4:
for
eachsequence
j
=
1to
n
do
5:
for
i
=
1
;:::;
L
do
6:
f
k
(
i
)=
e
k
(
x
i
)
å
h
f
h
(
i

1
)
a
ik
7:
endfor
.
forwardalgorithm
8:
for
i
=
L

1
;:::;
1
do
9:
b
k
(
i
)=
å
l
a
kl
e
l
(
x
i
+
1
)
b
l
(
i
+
1
)
10:
endfor
.
backwardalgorithm
11:
Addthecontributionofsequence
j
to
A
and
E
inEquation3.15and3.16
12:
endfor
13:
Calculatenew
a
lk
and
e
k
usingEquation3.11
14:
Calculatethenewloglikelihoodofthemodel
15:
D
 
changeoftheloglikelihood
16:
endwhile
3.4Methods
Frame-ProisdesignedtoimprovehomologysearchperformanceforPBdata.Itincorpo-
ratesconsensus-basederrorcorrectionandaViterbialgorithmforoptimalalign-
ment.WhileastandardViterbialgorithmalignsasinglesequencetoahiddenMarkovmodel
(HMM),ouralgorithmalignsanalignmentgraphtoanHMM.Inaddition,weanewscor-
ingsystemthatcombinesthepathscorefromthealignmentgraphandthealignmentscoreinthe
HMM.Weintroducetheconstructionofthealignmentgraph.
3.4.1Generatesequencealignmentgraph
FollowingHGAP,weconstructasequencealignmentgraph(SAG)fromPBdata.Thedetails
ofthegraphconstructioncanbefoundinthesupplementarymaterialofChinet.al'swork[24].
Herewesimplysummarizethemajorsteps.Readswithalengthlongerthanachosencut-offare
selectedasseedsequences.OtherreadsarethenalignedtotheseedsequencesbyBLASR[21]for
57
Figure3.4:BuildanexampleSAGfromanalignment.Toppanel:multiplesequencealignmentof
sixsequences.Thetopsequenceistheseedsequence.Bottompanel:sequencealignmentgraph.
Fornode
x
0
,ifwetracebackfortwomoreedges,wecanidentifythreecodonsendingwith
x
0
:
TCA,ATA,andGCA.Eachpathhasitsnodes
x
(
1
)
0
and
x
(
2
)
0
marked.Theconsensuspath
ofthispartofgraphisATCA.
58
constructionofanalignmentgraph(seetoppanelofFigure3.4).Ineachalignedcolumn,different
basesaremodeledasnodesforthecorrespondingpositioninanSAG.Consecutivebasesare
connectedbyanedgeinthegraph.Theedgeweightrepresentsthenumberofalignedsequences
thatgothroughthisedge.
Theconsensussequencefromthealignmentgraphrepresentsthemostreliablesequence.Fig-
ure3.4showsanexampleSAGandtheconsensussequence.Usually,whentherearesuf
readsaligningtotheseedsequences,theconsensussequencecanbereliablyusedfordownstream
analysisincludingconductinghomologysearch.However,whenthecoverageisnotsufthe
chosenconsensussequencedoesnotshowhigherscorethanalternativesequences.
Thus,itisdiftoextractapaththatisclosesttothereferencesequence.
Ouralgorithmismuchlesssensitivetosequencingcoverageasitusescharacterizedprotein
familiesasreferencetochooseoptimalpathinthealignmentgraph.Essentially,italignsanalign-
mentgraphwithaHMM,whichrepresentsaproteinfamily.Duringthealignment,the
algorithmchoosesasequencepathinSAGandastatepathintheHMMinordertomaximizethe
combinedcoveragescoreandalignmentscore.BelowwedescribetheViterbialgorithm.
3.4.2ViterbialgorithmforaligninganalignmentgraphwithaprHMM
Let
p
beastatepathinaHMMModel
M
.Let
p
0
beasequencepathinanSAG
G
.Ourgoal
istosearchfortheoptimalpathpair
(
p

;
p
0

)
suchthat
(
p

;
p
0

)=
argmax
(
a
S
M
(
p
;
p
0
)+
b
S
G
(
p
0
))
,
where
S
M
(
p
;
p
0
)
isthealignmentscorebetweenasequencein
G
andtheHMM.
a
and
b
aretheweightsoftheHMMscoreandthecoveragescorein
G
,respectively.Intuitivelythisalgo-
rithmsearchesforanoptimalalignmentbetweenaDNAsequenceinthealignmentgraph
G
and
aHMM
M
bysimultaneouslyconsidering1)theprobabilityofaHMMalignment,
representedbylog-oddsscore
S
M
(
p
;
p
0
)
,and2)pathweightin
G
,representedby
S
G
(
p
0
)
.Tosolve
59
theaboveequation,wedevelopedthefollowingdynamicprogrammingalgorithm,whichisan
augmentedViterbialgorithm[35].Therecursiveequationscanbeextendedtoforwardalgorithm
andalsoposteriorprobabilitycalculationforanHMM.
Input:
aSAG
G
generatedbyalignmentsbetweenalongDNAseedsequence
x
seed
andmulti-
pleshortersequences,aHMM
M
.Notationsof
M
willbedescribedbelow.
Output:
theerror-correctedDNAsequence
x
0
seed
anditsoptimalalignmentwith
M
.
Algorithm:
wenotationsthatwillbeusedintherecursiveequations.

NotationsoftheprHMMmodel
M
:
ThedetaileddescriptionsofaHMM
modelcanbefoundintheliterature[35,36].AHMMmodel
M
consistsofmatch
states
M
j
,deletionstates
D
j
,andinsertionstates
I
j
foreachposition
j
,whichistheindex
ofaconservedcolumninamultiplesequencealignment.
a
s
i
s
j
isthetransitionprobability
fromstate
s
i
to
s
j
.AstheHMMisconstructedbyalignedproteinsequenceswhilethe
graphmodelisconstructedbyalignedDNAsequences,weneedtotranslatethesequences
in
G
intoaminoacids.Here,let
T
(
x
i

2
x
i

1
x
i
)
betheaminoacidtranslatedfromacodon
x
i

2
x
i

1
x
i
.
e
s
(
T
)
istheemissionprobabilityforstate
s
toemit
T
.Comparedtothetopology
ofPlan7modelusedbyHMMER[38],oneofthemajorchangeswemadeisthatNandC
areresponsibleforemittingallDNAbasesthatareoutsideoftheproteindomain.

NotationsoftheSAG:
Let
x
i
representthe
i
thnodeinthetopologicallysortedlistofthe
graph.AsthealignmentgraphisconstructedinDNAspace,
x
i
alsorepresentsabasefrom
thealignedsequences.
x
(
k
)
i
representsanode,fromwhichtonode
x
i
thereexistsapath
consistingof
k
edges.Accordingtothistion,when
k
=
1,thereisanedgefrom
x
(
1
)
i
to
x
i
.When
k
=
2,acodonisformedbythreebases:
x
(
2
)
i
,
x
(
1
)
i
,and
x
i
.Forexample,in
Figure3.4,bothnodeslabeledwithTandCare
x
(
1
)
0
.
S
G
(
p
0
)
representsthepathscorefor
60
p
0
,whichcanbeasequenceofanylengthin
G
.Wewillthepathscorefollowingthe
equations.

Subproblemsandtherecursiveequations:
Forasequencepath
p
0
endingatnode
x
i
in
G
,
andastatepath
p
endingwithindex
j
in
M
,thedynamicprogrammingalgorithmintends
tomaximizethecombinedpathscoreandalignmentscore:
a
S
G
(
p
0
1
::
i
)+
b
S
M
(
p
0
1
::
i
;
p
1
::
j
)
.
Dependingontheendingstates,weneedtoconsidermultiplecases.
V
M
j
(
x
i
)
ismaximumscoreofaligningasequencepathendingwith
x
i
in
G
totheHMMup
tostate
M
j
,undertheconstraintthattheanimoacidtranslatedbythelastthreebases
x
i
,
x
(
1
)
i
,and
x
(
2
)
i
in
G
isemittedbymatchstate
M
j
.Notethattherecanbemultiplesequence
pathsin
G
endingwith
x
i
.Sothelastthreebasesofanysuchsequencepathcanbegenerally
representedby
x
(
2
)
i
x
(
1
)
i
x
i
.Andthetranslatedaminoacidisthus
T
(
x
(
2
)
i
x
(
1
)
i
x
i
)
.
V
I
j
(
x
i
)
isthemaximumscoreofaligningasequencepathendingwith
x
i
in
G
totheHMM
uptostate
V
j
,undertheconstraintthattheanimoacidtranslatedbythelastthreebases
x
i
,
x
(
1
)
i
,and
x
(
2
)
i
in
G
isemittedbyinsertionstate
I
j
.
V
D
j
(
x
i
)
isthemaximumscoreofaligningasequencepathendingwith
x
i
in
G
totheHMM
uptostate
D
j
.
V
N
(
x
i
)
isthemaximumscoreofaligningasequencepathendingwith
x
i
in
G
totheHMMuptostate
N
,given
x
i
beingemittedbythespecialstate
N
.Similarly,
V
C
(
x
i
)
is
themaximumscoreofaligningasequencepathendingwith
x
i
in
G
totheHMMuptostate
C
,given
x
i
beingemittedbythespecialstate
C
.
Forbrevityofpresentation,
S
inthefollowingequationsrepresents
S
(
x
(
3
)
i
x
(
2
)
i
x
(
1
)
i
x
i
)
,which
isthepathscorefromthepossiblethirdbaseoftheprecedingcodontothecurrentcodon
endingwith
x
i
.Notethatthepathscoreisnotasimplesummationofedgeweightsasina
standardgraph.Wewillthepathscoreaftertherecursiveequations.
T
istheabbrevi-
61
ationof
T
(
x
(
2
)
i
x
(
1
)
i
x
i
)
infollowingequations.
V
M
j
(
x
i
)=
max
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
V
M
j

1
(
x
i
(
3
)
)+
a
e
M
j
(
T
)+
a
log
a
M
j

1
;
M
j
+
b
S
V
I
j

1
(
x
i
(
3
)
)+
a
e
M
j
(
T
)+
a
log
a
I
j

1
;
M
j
+
b
S
V
D
j

1
(
x
i
(
3
)
)+
a
e
M
j
(
T
)+
a
log
a
D
j

1
;
M
j
+
b
S
V
N
(
x
i
(
3
)
)+
a
e
M
j
(
T
)+
b
S
V
I
j
(
x
i
)=
max
8
>
<
>
:
V
M
j
(
x
i
(
3
)
)+
a
e
I
j
(
T
)+
a
log
a
M
j
;
I
j
+
b
S
V
I
j
(
x
i
(
3
)
)+
a
e
I
j
(
T
)+
a
log
a
I
j
;
I
j
+
b
S
V
D
j
(
x
i
)=
max
8
>
<
>
:
V
M
j

1
(
x
i
)+
a
log
a
M
j

1
;
D
j
V
D
j

1
(
x
i
)+
a
log
a
D
j

1
;
D
j
V
N
(
x
i
)=
max
ˆ
V
N
(
x
(
1
)
i
)+
b
S
x
i
(
1
)
x
i
V
C
(
x
i
)=
max
8
>
<
>
:
V
C
(
x
(
1
)
i
)+
b
S
x
i
(
1
)
x
i
V
M
(
x
(
1
)
i
)+
b
S
x
i
(
1
)
x
i

Calculatepathscores
S
:
wefollowedthesameequationinHGAPpaper[24]tocalculate
apathscore,whichisthesumofthescoresofallnodesinthepath.InaSAG
G
,thelocal
coverageforapositioninthealignedgraphisbythenumberofreadsalignedto
thatposition.Forexample,thelocalcoverageis5forallpositionsinFigure3.4asthereare
5reads(excludingseedsequence)alignedtotheseedsequence.Foranode,ifoneofthe
incomingedge'sweightismorethanhalfofthelocalcoverageatthatposition,thenodegets
apositivescore.Otherwise,wewillassignanegativescore.Thepathscoreisthesumofthe
scoresofnodesinthepath.ThedetailedpseudocodecanbefoundinHGAP.

CombineHMMscoreandpathscore:
userscanchoosetheweightsofthepathscorein
62
graph
G
andtheHMMscorein
M
.Ideally,
a
and
b
shouldbeadjustedbasedonthelocal
coverage.Forhighlocalcoverage,wecanassignbigger
a
than
b
.Bydefault,weusethe
sameweight,whichisthechosenparametersforalloftheexperimentsinthiswork.

Runningtimeanalysis
Thetimecomplexityofouralgorithmis
O
(
d
j
N
jj
M
j
)
,where
j
N
j
isthenumberofnodesinthe
G
and
j
M
j
isthenumberofstatesin
M
.
d
istheaverage
numberofpossiblepathsthatcontainacodonendingwithanode.Basedonourtest,for
20Xcoverage,
d
isabout4to6pernode.Wealsofoundathighlocalcoveragearea,the
edgewithweight1canbeignoredastheprobabilitytoincludethatedgeintheoptimalpath
isextremelylow.Thus,weremovedthoseedgestofurtherspeedupthealgorithm.After
thispruning,
d
canbeaslowasnear1pernodeonaverage.
3.4.3Filtrationstageforremovingunrelevantproteindomainfamilies
Duringhomologysearch,wealigntheconstructedalignmentgraphsfromPBdatawithallcharac-
terizeddomainsinPfam.Butforanygivenspeciesoracommunity,notalldomainsarerelevant.In
ordertoreducethesearchspace,weapplyastagetoremovedomainsthatarenotrelevant
tothegivendata.Inpractice,weapplyHMMERtoallconsensussequencesextractedfromthe
alignmentgraphwithaverylooseE-valuecutoff(1000isusedinallexperiments).Onlydomains
thatyieldatleastpartialalignmentswillbeusedasinputtoourdynamicprogramming.Otherdo-
mainswithoutanyhitwillbediscardedbecauseitisunlikelythattheywillbetruedomainsinthis
dataset.Eachconsensussequencemightbealignedtomultipledomains,forregionsthatcannot
bealignedtoanydomain,wealsoremovethemfromnextstage.Thus,afterthetrimmed
alignmentgraphandthecorrespondingdomainwillbeusedasinputtoourtool.Notethatthisdo-
mainmayjustincuraverysmallscoreandE-value.Itwillbere-alignedusingour
63
algorithmandthealignmentscorewilldecidewhetheritisatruedomain.Inourexperiments,
werefertotheinputassequence-domainpair.Inallofourexperiments,truedomainsfoundby
correctedsequencesusingsuggestedcutoffsisalwayslessthantheinputsequence-domainpairs.
3.5Experimentalresults
Ourmethodisusedtoimprovethesensitivityofhomologysearchbycorrectinginsertion
ordeletionerrorsinPBsequencingdata.Thus,wewillfocusonevaluatingtheperformance
ofourimplementationinbotherrorcorrectionandhomologysearch.WeappliedFrame-Proto
threedatasets:asimulated
E.coli
PBRSsequencingdataset,areal
M.ruber
PBRSsequencing
dataset,andaHumanarmPBRSIImetagenomicsequencingdataset.Thethreedatasetsenable
ustoevaluatetheperformanceoferrorcorrectionfordatawithdifferentsequencingcoverage.
Asboththegenomesandtheirproteindomainannotationsofthetwodatasetsareavailable
atNCBIandPfam,weareabletoquantifytheaccuracyoferrorcorrectionandHMM
search.,weusedBLASTtoevaluateitserrorcorrectionperformancebyaligning
correctedsequencestoreferencesequence.Wealsotheperformanceofproteindomain
annotationbyapplyingHMMER3tocorrectedsequencesandthereferencegenomes.
WebenchmarkedourmethodwiththeerrorcorrectionstageDAG-ConinHGAP[24].The
errorcorrectioninHGAPandourmethoddonotrelyonshortsequencesgeneratedbyanother
platform.AndtheerrorcorrectioninHGAPhasbeenextensivelytestedandhassatisfactoryper-
formanceforsequencingdatawithreasonablecoverage.DAG-Cononlyoutputscorrectedse-
quences.Inordertoevaluatetheperformanceofhomologysearch,weapplyHMMERto
thecorrectedsequencesofDAG-Con.AlthoughFrame-Prooutputsbothcorrectedsequencesand
thealignments,wererunthecorrectedsequencesagainstinputdomainsusingHMMERto
64
ensureafaircomparisonbythesamealignmenttool.
Forourexperiments,alldetailedcommands,parametersandoutputcanbefoundalongwith
thesourcecodeofFrame-Pro.Toachieveafaircomparisonondataoflowcoverage,wesetthe
coveragethresholdofDAG-Conas1tomakesuretheoutputsarecomparable.
3.5.1Simulated
Escherichiacoli
sequences
InordertoevaluatetheaccuracyofFrame-Proindetectingandcorrectinginsertionanddeletion
errors,wegeneratedasimulated
E.coli
K-12MG1655(NCBItax.ID511145)sequencedataset
byPBSIM[114].Weusedthereferencegenomesequence(NC_000913.3,genomesize4,641,652
basepairs,[60])generatedbySangersequencingasinputforPBSIM.Thequalityinformationand
thesequencinglengthdistributionfromrealPBRSsequencingaftersecondaryanalysis[115]were
usedasthesimulationparameters.PBSIMgenerated34,898sequenceswith92,810,130basepairs
withaveragesequencingcoverageof20X.Theaveragelengthofreadsis2,660bpandthereads'
averageerrorrateis14.42%.
Inthedataset,3,280sequencesseedcriterion.Tocontrolthescaleofexperiment,we
randomlyselect451seedsequencesforgraphconstruction.Aftergraphconstructionandprotein
domain7,093sequence-domainpairswerekeptasinputtoourprogram.
3.5.1.1Performanceoferrorcorrection
BothFrame-ProandDAG-Conproduced7,093correctedsequencesfromtheinputPBsimulation
data.Weevaluatedtheperformanceoferrorcorrectionofbothtoolsbycomparingtheir
correctedsequencestothereferencesequences.ThecomparisonisconductedusingBLAST[4].
Figure3.5summarizedthecomparisonofbothtoolsoneachread.Forthisdataset,ourtoolcan
correctmoreerrorsinabouthalfofthereadswhileDAG-Concorrectsmoreerrorsinlessthan
65
Figure3.5:ComparisonoferrorcorrectionperformanceforFrame-ProandDAG-Coninthesim-
ulated
E.coli
datasetand
M.ruber
dataset.Frame-ProcorrectsmoreerrorsinlargerfractionofPB
reads.
66
Figure3.6:Histogramofederrorsaftererrorcorrectioninthesimulated
E.coli
dataset.X-
axisrepresentsthenumberofremainingerrorsineachread.Y-axisisthecorrespondingnumbers
ofreads.Binwidthis1andtheonlyincludedthe40bins(i.e.upto40errors)dueto
spacelimitation.
67
5%ofreads.Intotal,ourmethodcorrects7,884moreerrorsthanDAG-Con.Ifweonlycount
insertionanddeletionerrors,ourmethodcorrects8,374moreerrorsthanDAG-Con.DAG-Con
corrects490moremismatchesthanourmethod.ItisexpectedbecausetheHMMsearchis
muchmoresensitivetoframeshiftscausedbygapsastheywillimpactthealignment
lengthandscore.Figure3.6comparesthenumberofremainingerrorsincorrectedreadsproduced
bybothtools.Itisclearthatourmethodcanproducemorereadswithnoerrororjust1error.
3.5.1.2PerformanceofprHMMsearch
Oneofourmajorgoalsistoimproveperformanceofhomologysearch.Thissectionwill
focusonevaluatingtheperformanceofproteindomainhomologysearchofcorrectedsequences.
Aftercorrectingframeshifts,weexpectthatthehomologysearchprogramcangeneratelonger
alignmentswithhigherscoresandsmallerE-valuesforproteindomainsofinterest.Sotheusers
candistinguishtruedomainsfromrandomalignmentswithhigher
HMMER3.1b2wasusedtogenerateHMMalignmentsfromcorrectedsequences.The
domaincompositionof
Escherichiacoli
K12(NCBItax.ID83333)proteomefromPfam(Re-
lease29.0,[45])wasusedasthereference.2,347proteinHMMdomainswerefoundin
7,093outputsequences.Somesequenceshavemultiplehits.Forallthedata,includingthePB
simulationdata,andthecorrectedsequencesbybothtools,weonlykeepthebestalignmentfor
eachdomaininourcomparison.Thechangesofalignments'E-values,alignmentlengths,andbit
scoresduetoerrorcorrectionarepresentedinFigure3.7.Onaverage,thelengthofthedomain
alignmentincreasesfromDAG-Con's108.51aminoacids(a.a)to150.36a.a,whichiscloserto
theaveragealignmentlengthof163.62a.ainthereferenceproteomefromPfam.Atwo-sample
Kolmogorov-Smirnovtest(K-Stest)onthealignments'lengthsandE-valuesfromourmethod
andDAG-Conwasappliedtoexaminethestatisticalofdifferencefromtwomethods.
68
Figure3.7:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values(C)for
referencesequencesandcorrectedsequencesproducedbyFrame-ProandDAG-Coninthe
E.coli
simulateddataset.X-axisrepresentsthedomains.Alldomainsaresortedbythereferencevalues
fromPfam.RedcirclesrepresentthevaluesofHMMalignmentsforcorrectedsequencesoutputby
Frame-Pro.BluecirclesrepresentthevaluesofHMMalignmentsforcorrectedsequencesoutput
byDAG-Con.Astherearemanydatapoints,allnumbersproducedbyonetoolareprocessedby
SavityzkyGolaytogenerateasmoothedcurveforclarityofpresentation.
69
Thep-valuesforthealignments'length,E-value,andbitscoredistributionswere5
:
4436

10

25
,
1
:
3237

10

25
,and1
:
8881

10

25
,respectively.Thus,theimprovementbyapplyingourmethod
isstatistically
3.5.2
Meiothermusruber
DSM1279sequences
Thesimulateddataenablesustothoroughlytesttheparametersofourtoolandalsotoevaluate
variousaspectsoftheperformance.Forthenexttwoexperiments,weapplyourtooltorealPB
datawithdifferentcoverage.Astherealsequencingprojects,inparticularthetranscriptomic
sequencingprojectsandmetagenomicsequencingprojects,cancontaintranscriptsorgenomes
withheterogeneouscoverage,itisthusimportanttoevaluatetoolsfordataofdifferentcoverage.
Inthisexperiment,wetestedFrame-Proonreal
Meiothermusruber
DSM1279PBSequencing
data.
Meiothermusruber
(NCBItax.ID504728)isusuallyfoundinhotsprings[141].Ithas
genomesizeof3,098,881bps.Therawsequencingdatafrom1SMRTcellinHGAP[24]was
usedforthisexperiment.Therawdataintotalcontains177.4Mbpswith59.6%GCcontentwith
approximatecoverageof60X.
StandardSMRTdataprocessingpipeline[117]wasusedtotherawsequencingdatato
generatesubreads,whichcontainsequencespassingthecommonlyusedlengthandquality
criteria.Thedatasethas90,114,302bpsand36,180sequenceswith30Xcoverage.After
proteindomainfamilies33,911sequencedomainpairspassedthethresholdandwere
inputtodownstreamerrorcorrectionpipelines.
70
3.5.2.1Evaluateerrorcorrectionperformancebyaligningoutputsagainstthereference
genome
TocomparetheerrorcorrectionperformanceofFrame-ProandDAG-CononthisrealPBsequenc-
ingdataset,weusedBLASTtosearchalloutputsagainstthereferencegenomeof
Meiothermus
ruber
(NC_021081.1).ThecomparisonoferrorcorrectionforeachreadissummarizedinFig-
ure3.5.Intotal,ourmethodcorrects14,613moreerrorsthanDAG-Con.Ifweonlycountinsertion
anddeletionerrors,ourmethodcorrected15,924moreerrorsthanDAG-Con.DAG-Concorrected
1,311moremismatchesthanourmethod.
Comparingtotheprevioussimulationexperiment,thedifferenceoftheerrorcorrectionperfor-
mancebetweenFrame-ProandDAG-Condecreased.Themainreasonisthattheperformanceof
DAG-Conusuallyimproveswithincreasedcoverage(20Xto30X).
3.5.2.2Evaluatetheperformanceofprhomologysearch
ThecorrectedsequencesfromFrame-ProandtheconsensussequencesfromDAG-Conweresearched
againstproteindomainsinPfambyHMMER3.1b2.Thehighercoverageofthisdatasetdidim-
provetheperformanceofDAG-Con.
For2,984domainsbybothtools,wecomparedthebesthitforeachofthemfrom
twotools.TheannotateddomainsfromthereferenceproteomeweredownloadedfromPfam
andwereusedasthereference.Thechangesofalignments'bitscoresduetoerrorcorrection
arepresentedinFigure3.8.Noimprovementcanbeobservedforbitscores.Similar
observationsweremadeforalignments'lengthsandE-values.Thus,theothertwowere
omitted.Comparedwiththeaveragealignmentlengthof148.68a.afromDAG-Con'sconsensus
sequence,theaveragelengthof157.92a.abyFrame-Proismuchclosetothereference's158.69
71
Figure3.8:Thecomparisonofalignments'bitscoresforreferencesequencesandcorrectedse-
quencesproducedbyFrame-ProandDAG-Coninthe
M.ruber
dataset.X-axisrepresentsthe
domains.AlldomainsaresortedbythereferencebitscoresfromPfam.Redcirclesrepresentthe
valuesofHMMalignmentsforcorrectedsequencesoutputbyFrame-Pro.Bluecirclesrepresent
thevaluesofHMMalignmentsforcorrectedsequencesoutputbyDAG-Con.Astherearemany
datapoints,allnumbersproducedbyonetoolareprocessedbySavityzkyGolaytogeneratea
smoothedcurveforclarityofpresentation.
72
a.a.Weconductedatwo-sampleK-Stestonthealignments'lengths,E-values,andbitscores
fromourmethodandDAG-Con.Thep-valuesforalignments'lengths,E-values,andbitscores'
distributionswere0.2859,0.2321,and0.2007,respectively.AlthoughFrame-Prostillgenerated
longeralignmentswithbetterscores,DAG-Conachievedsatisfactoryperformanceofhomology
searchforthisdatasetbecauseofthesufcoverage.
3.5.3Humanarmmetagenomicdataset
WhenPBisappliedtometagenomicsequencing,onechallengeisthatsomedatasetsdonothave
enoughcoverageforeffectivedownstreamanalysis.Here,weappliedFrame-Protoanalyzethe
proteindomaincompositioninthehumanskinmetagenomicdata,whichweresequencedfromthe
humanarmandfoot[144].
ThehumanarmsamplewassequencedbylinearPBRSIITdT(terminaldeoxynucleotidyl
transferase)sequencingplatform.SequencescanbemappedwithCHM1humangenomewere
removedashosthuman-derivedDNA.ThisdatasetcannotbefurtherassembledbytheHGAP
pipelineduetotheinsufcoverage,providingchallengestodownstreamanalysisincluding
proteindomainThusinthisexperiment,wefocusonthearmdataset,whichin-
cludes16,388sequenceswith2,662,7191bps.
3.5.3.1GenerateSAGandprHMMdomain
Comparedwithprevioustwodatasets,theaveragereadlengthofthehumanarmmetagenomics
datasetisonly1,629bps.Togeneratesufseedreads,thecut-offwaschangedto3,000bps.
AlthoughFrame-ProisnotassensitiveasDAG-Contothesequences'coverage,thesteps
wereaffectedasweuseHMMERtosearchtheconsensusagainstPfamwithE-value1000.After
602sequencedomainpairswerekeptforfurtheranalysis.Foreachsequencedomain
73
pair,acorrectedsequenceusingFrame-ProandaconsensussequenceusingDAG-Con,bothfrom
thesamegraph,weregenerated.
Figure3.9:Thecomparisonofalignments'bitscores(A),lengthsina.a.(B),andE-values(C)
for48domainscommonlybyFrame-ProandDAG-Coninthearmdataset.X-axis
representsthedomains.
74
3.5.3.2ProteindomainsearchusingGAcutoff
Unlikepreviousdatasets,wedon'tknowallthecompositespeciesinthearmsample.Thuswe
cannotobtainallthegroundtruthproteindomainsinthisdataset,makingthecomparisonwith
thereferencedomainsdifInaddition,withoutcompletereferencegenomesofallspecies,
wedon'thavetheactualsequencescorrespondingtothesereadsandthuswecannotevaluatethe
numberofederrors.Inordertoexaminetheresults,wethususeastringentthresholdto
examinethedomainsetsforcorrectedsequencesoutputbyFrame-proandDAG-Con.
Weusethegathering(GA)cutoffsfromPfamasthethresholdfordomaincompositionanalysis
becauseGAcutoffsarefamilybitscorethresholdsaimingtominimizefalsepositiverate
andtomaximizethecoverage[44].InFrame-Pro'sresults,84domainsareabovetheGAcutoff,
comparingto49domainsinDAG-Con'sresults.Frame-ProonlymissedonedomaininDAG-
Con'sresultwithscoreslightlylessthanthecorrespondingGAcutoff.AccordingtoPfam,all
domainsuniquelyfoundbyFrame-Prowereannotatedinmicrobialspecies.Thelistofdomains
canbefoundinourwebsite.
Forthecommonly48domains,wecomparedthealignments'bitscores,alignment
length,andE-valuesoncorrectedsequencesoutputbytwotools(SeeFigure3.9).Weconducted
atwo-sampleK-Stestonthethreemetricsforallalignments.Thequalityofalignmentimprove
.Thep-valuesforthethealignments'length,E-value,andbitscoredistributionswere
6
:
05

10

6
,3
:
66

10

12
,and7
:
64

10

13
,respectively.
Withoutreferencesequences,therecouldbeonepossibilitythatFrame-Proover-correctsthe
errorsinordertomaximizethealignmentscore.Inordertotestthispossibility,weexamined
thedomainsforonereferencespecies:
Corynebacteriumaurimucosum
.Althoughthe
completecompositespeciesofthearmsampleisunknown,thephylogeneticanalysisin[144]
75
andotherarticles[47,142]showedthat
Corynebacteriumaurimucosum
isrelativelyabundantin
thissample.Inaddition,thisspecieshasannotateddomains(NCBItax.ID548476)inthePfam
database.Thus,wefocusonevaluatinghowmanyoftheannotateddomainsinthisspeciescan
bebytestedmethods.ByusingGAcutoff,Frame-Procanrecover41domainswhile
DAG-Concanrecover24domains.ThesetbyDAG-Conisasubsetofours.Thus,this
experimentshowsthatextradomainsbyusarenotlikelyfalsepredictions.Thisexper-
imentaddsevidencethatFrame-Procanidentifymoredomainsfordatawithverylowcoverage
andthusprovidescomplementaryanalysisformetagenomicdata.
3.6Conclusionanddiscussion
Inthiswork,wedevelopedahomologysearchtoolforPBsequencingdata.Bycorrect-
inginsertionordeletionerrors,ourimplementationcanimprovehomologysearchperformance,
includingalignmentscores,lengths,andE-values.Inparticular,forsequencingdatawithlow
sequencingcoverage(aroundorlowerthan20X),ourtoolcancorrectmoreerrors
andimprovethesensitivityofhomologysearchbymorecorrectdomains.Beingableto
conductsensitivehomologysearchforsequencingdataoflowcoverageisimportantforvarious
sequencingprojectsincludingmetagenomicandtranscriptomicsequencing.Usually,asthetran-
scriptsorspecieshavehighlydifferentabundanceandthusheterogeneouscoverage,conducting
homologysearchneedstoconsidertheinputofafullspectrumofcoverage.Manyrarespeciesin
anenvironmentalcommunityorraretranscriptsareparticularlyinterestingforbiologicaldiscov-
ery.
Althoughourcurrentimplementationisbasedonhomologysearch,whichcompares
sequenceswithHMMs.Ourmethodcanbeeasilyextendedtopairwisealignmentaswell.
76
Inthatcase,theHMMwillbereplacedasinglesequence.Existingpairwisealignment
algorithmcanbeextendedtoalignanalignmentgraphwithasinglesequence.
LiketheerrorcorrectionstageinHGAP,ourtooldoesnotrelyonhybridsequencingeither,
makingitconvenientforvariousapplications.However,onelimitationisthatourtoolisnota
generalerrorcorrectiontoolbecauseitcanonlycorrecterrorsinregionsthatarehomologousto
referencesequences.Thisisnotaproblemforspecieswithhighlypackedcodingregions.Butfor
genomeswithlargefractionsofnoncodingregions,ourtoolisnotdesignedforerrorcorrectionin
thewholegenome.
Finally,weonlytestedourapplicationinPBdata.Butourmethodcanbeextendedtootherse-
quencingdatawithsimilartypesoferrors.Forexample,wewilltestourtoolonthedataproduced
byNanoporetechnology.
77
Chapter4
DeepLearningBasedApproachForProtein
DomainPrediction
4.1Introduction
Third-generationsequencingtechnologies,suchasPBiosciencessingle-moleculereal-time
sequencing(PacBio)andOxfordNanoporesequencing(Nanopore),producelongerreadsthan
thetraditionalsequencingtechnologies.Theincreasedlengthenablesclosinggapsingenome
assembly[67,149],detectingepigenetic[151],andquantifyinggeneisoformswith
higheraccuracyinthetranscriptomicsequencing[107,140].Italsoshowsgreatpotentialsin
sequencingthemicrobialcommunities[23,144].
Oneofthemostchallengestoutilizelongreadsfromthird-generationsequencing
isthehighsequencingerrorrate[7,67].Mosterrorsareinsertionsanddeletions,whichcancause
frameshiftsduringtranslations.Withoutknowingtheerrorsandtheirpositions,theframeshifts
canleadtoonlyshortoralignmentsinthedownstreamanalysis[154].Asaresult,
traditionaldownstreamanalysispipelineontherawPacBioorNanoporereadsleadstoincorrect
results[34].
Characterizingtheproteindomaininsequencesisoneoftypicaldownstreamanalysiswitha
continuingconcernasproteinsplaypivotalrolesinmanybiologicalprocesses.Homologysearch
78
isoneofthestandardmethodstoannotatethefunctionsofproteinsequencesbycomparingquery
sequencesagainstreferencesequences[4]orhomologoussequencefamilies[35].Algorithms
ofmachinelearningisanalternativeapproachwithoutalignmenttopredicttheproteindomain
insequences[30,147].Themajorchallengeofthesemethodsishowtoconvertthesequenceto
featuresthatcancapturetheinformationofproteinfamilies.Recentdevelopmentsinthealgorithm
ofdeeplearningledtoanotherapproachwithautomaticfeatureextractionfromthesequence
[91,133].
4.1.1Deeplearningforsequentialdata
Differentdeeplearningmodelhavetheirownadvantagestoresolvetypesofproblems
forsequentialdata.Soweexpectthatthosealgorithmscanalsohelpondifferentproblemsin
bioinformatics.Here,wewilldescribetwomajordeeplearningstructures,convolutionalneural
networks(CNNs),recurrentneuralnetworks(RNNs),thatalreadybeenprovedachievestateofthe
artperformanceonsequentialdata.
4.1.1.1ConvolutionalNeuralNetworks
OneofthemostsuccessfulmodelinrecentyearsisConvolutionalNeuralNetworks(CNNs).The
majordifferencebetweenCNNsandtraditionalneuralnetworksistoreplacethegeneralmatrix
operationinthetraditionlayerwiththemathematicaloperationcalledconvolution.
4.1.1.1.1Convolutionoperation
Inmathematics,theconvolutionoffunction
x
and
w
,
s
(
t
)
,is
asfollowingequation:
s
(
t
)=
Z
x
(
a
)
w
(
t

a
)
dt
=(
x

w
)(
t
)
79
Inconvolutionalnetworkterminology,theargument(inthisexample,thefunctionx)tothe
convolutionisoftenreferredtoastheinputandthesecondargument(inthisexample,thefunction
w)asthekernel.Theoutputissometimesreferredtoasthe
featuremap
.
Figure4.1:Howakerneloperatesononepixel.Reprintedfrom[6].
Theconvolutionoperationinthematrixcantreatasakindofmatrixmultiplication.Asthe
Figure4.1showed,thekernelconvolutiontakesourcepixelfrominputmatrixandsimplymulti-
pliedthekernelmatrixtogeneratetheelementfortheconvolutionallayeroutput.
4.1.1.1.2MajorfeaturesofCNNs
TherearethreemajorfeaturesthatmakeCNNssosuccess-
ful:sparseinteractions,parametersharing,andequivariantrepresentations.
Sparseinteractionsmeansthattheneuronintheconvolutionallayersonlyneedstointeractwith
severalneuronsintheinputandoutputlayers.Inthecontrast,aneuroninthetraditionalneural
80
networkneedstointeracteveryneuronintheinputandoutputlayers.Inthisway,theCNNscan
easilylearnsomesmallandmeaningfulfeatureswhichareonlyfromtinypartoftheinputmatrix.
Also,itcansavealotofcomputationtime.
Parametersharingreferstousingthesameparameterformorethanonefunctioninaneural
network.AteverypositionoftheinputinCNNs,itwillusetheeachelementofthekernelmatrix.
Bydoingthis,thelayercanlearnsomethingwithawholepicture.
Equivariantrepresentationsdescribesthephenomenonthatiftheoutputwillchangecorre-
spondingtothechangeofinput.
4.1.1.1.3CNNforsequence
Foracharacterinthesequence,wecanuseak-
dimensionvector
x
torepresentit.Soasequenceoflengthnisrepresentedas:
x
1:
n
=
x
1

x
2

:::

x
n
Hereoperator

istheconcatenationoperator.Weappliedconvolution
w
onawindowof
sizeh,totransformtheinput
x
1:
n
toanewfeature
c
i
:
c
i
=
f
(
w

x
i
:
i
+
h
+
b
)
Herebisthebiastermandfistheactivationfunctionusuallyusingnon-linearfunctionlikesigmoid
orhyperbolictangent.Byrepeatedapplytheoperation,wecangeneratefeaturemap.Then
themax-over-timepoolingwasappliedtokeepthemostimportantfeatures.
TextCNN[74]isoneoftheearliestmodelsadoptingthestructurewediscussedaboveforsen-
tenceThekeyideaofTextCNNisusingkernelswithdifferentsizestocapture
81
Figure4.2:StructureofTextCNNmodel.Reprintedfrom[74].
patternindifferentranges.Afterglobalmaxpooling,thefeaturesthenusedasinputforafully-
connectedlayerandthenconverttotheprobabilityoflabels.ThevariationofTextCNNmodel
hasbeenadoptedforsolvingdifferentbioinformaticsproblem,suchasproteindomain
tion[133],virussequence[127],andDNA-proteinbindingprediction[153]
4.1.1.2RecurrentNeuralNetworks
RecurrentNeuralNetworks
,orRNNs[129]arethefamilyofthedeeplearningstructuresto
processsequentialdata.Parametersharingacrossthedifferentpartsofthemodelisthekeyidea
thatmakesRNNstobeabletodealwiththesequentialdata.,theRNNswetalked
aboutisoperatingonasequencethatcontainsvectors
x
(
t
)
withthetimestepindextrangingfrom
1to
t
.
4.1.1.2.1BasicstructureofRNNs
Asimplerecurrentneuralnetworkwilljustprocessthe
informationfromtheinput
x
andincorporateitintothestate
h
ofhiddenunitandpassedtothe
82
nextunit.Thehiddenunitstate
h
attimestept,
h
(
t
)
,
h
(
t
)
=
f
(
h
(
t

1
)
;
x
(
t
)
;

)
RNNalsousuallyhaveextraarchitectureslikeoutputlayersthatoutputtheresultof
h
forfurther
processingorpredictions.
Figure4.3:Therepeatingmoduleinainasimplerecurrentneuralnetworkstructure.Reprinted
from[113].
Wecanuseafunction
g
(
t
)
torepresenttheunfoldedrecurrenceaftertsteps,givenusanother
representationof
h
(
t
)
,
h
(
t
)
=
g
(
t
)
(
x
(
t
)
;
x
(
t

1
)
;:::
x
(
2
)
;
x
(
1
)
)
Inthisform,
g
(
t
)
takestheinputofwholepastsequences
(
x
(
t
)
;
x
(
t

1
)
;:::
x
(
2
)
;
x
(
1
)
))
asinput
andoutputthecurrentstate.However,wecanunfoldingthestructureandrepeatedapplyfunction
f
toachievesameoutput.Bydoingthiswecanusethesametransitionfunction
f
withthesame
parametersateverystep.
However,asimpleRNNscannotlearnlongtimedependencyasintheoptimization.Aswede-
scribedabove,ateachtimestepwerepeatedlyapplythesameoperation
f
.Wecanuseamultipli-
cationonmatrix
W
torepresentsuchoperation.Supposethatmatrix
W
hasaneigendecomposition
83
W
=
V
diag
(

)
V

1
.After
t
steps,theinputalreadymultiply
W
t
,so:
W
t
=
V
diag
(

)
t
V

1
Meansthatgivenenoughlargestepst,anyeigenvalue
l
thatnotcloseto1willeithervanish
(lessthan1)orexplode(largerthan1).Thegradientinsuchgraphalsoscaledaccordingto
diag
(

)
t
[52].Inthiscase,itdiftooptimizesuchgradient.Tosolvethischallenge,gated
RNNsisproposedandbecomesoneofthemosteffectivepracticalmodelsthatusedforsequential
data.
4.1.1.2.2LSTM
Thelongshort-termmemory(LSTM)architecture[66]isonebranchofsuch
gatedRNNsthatisextremelysuccessfulintheapplicationlikespeechrecognition,machinetrans-
lation,andhandwritinggeneration.ThekeyideaofLSTMistointroduceaselfloopsothat
gradientcanwforlongduration.
Figure4.4:TherepeatingmoduleinaLSTMcell.Reprintedfrom[113].
Theselfloop(internalrecurrence)islocatedinﬁLSTMcellsﬂwithouterrecurrencelikeordi-
naryrecurrentnetwork.Thecelliscontrolledbyacombinationofinputgateandmemorycontrol
gates,formulatedbyequationsbelow:
84
it=s(Uiht1+Wixt+bi)ft=s(Ufht1+Wfxt+bf)ot=s(Uoht1Woxt+bo)Ÿct=s(Wcxt+Ucht1+bc)ct=ftct1+itŸctht=ots(ct)whereit,ftandotarethreegatesthatcontrolinput,forgetandoutputrespectively.Eachofthe
gateoutputsignaldependsonitsstatetransitionmatrix
U,inputweightingmatrix
Wandbias
b.Thelcellstate
cisupdatedbyelement-wisemultiplication,denotedbyoperatorﬁ
ﬂ.sisthe
activationfunction,whichcanbechosenfromsigmoid,ReLU,tanhandsoon.
Aswecansee,LSTMaddstwomoregatestocontroltheoutputandpreviousinputs,thushas
theabilitytocapturethelongtermdependenciesinthesequence.Sofar,LSTMsarethemost
successfulvariationofRNNs.
4.1.1.2.3GRU
Thegatedrecurrentunit(GRU)architecture[25]isanalternativeandthemain
competitortotheLSTM.TheperformanceofGRUiscomparabletotheLSTMonmanysequential
datasets,withsimpleunitstructuresandlesscomputationrequirement.
ThemaindifferenceofGRUwiththeLSTMisthatasingleupdategatereplacetheroleofthe
forgetandinputgates.Sotheequationsareupdatedtofollowing:
85Figure4.5:TherepeatingmoduleinaGRUcell.Reprintedfrom[113]..
z
t
=
s
(
U
z
h
t
+
W
z
x
t
+
b
z
)
r
t
=
s
(
U
r
h
t
+
W
r
x
t
+
b
r
)
Ÿ
h
t
=
s
(
W
h
xt

1
+
U
h
(
r
t

h
t

1
))
h
t
=
z
t

1

h
t

1
+(
1

z
t

1
)

Ÿ
h
t
4.1.1.3Proteinfunctionpredictionusingdeeplearning
Recently,deeplearningtechniqueshavebeenshowntoachievestate-of-the-artresultsinmany
machinelearningproblemswithouttheneedforcomplexfeatureengineering[52,85].Deeplearn-
ingbasedmethodalsohavedemonstratedtheirabilityonmanybioinformaticsproblems,suchas
DNA-proteinandRNA-proteinbindingsitesprediction[3,153],taxonomic[19,43],
andSNPandsmall-indelvariantdetection[122].
86
Thesuccessofdeeplearningtechniquesinspiredresearcherstoapplytheapproachtomodeling
theproteinsequences.Longshort-termmemory(LSTM)wasintroducedby[65]tomodelthe
proteinhomologydetection.Themodelcanautomaticallyextractthelong-termandshort-term
dependencyofthesequencesandpredictwhetherthesequencesbelongtothemodeledprotein
family.Recently,abidirectionalLSTMmodel,ProDec-BLSTM[91],wasproposedtoimprove
thepredictioncapabilityofthemodelfurther.Inthemodel,everyhiddenvalueoftheLSTMwas
usedtocapturethelongandshortdependencyinformationoftheproteins.Theexperimentresult
showsProDec-BLSTMoutperformsvariousexistingmethods,includingHMMER.
DeepFam[133]adoptedconvolutionalneuralnetworks(CNN)tomodelfamiliesofprotein
sequences.UnlikeLSTMmodelswementionedearlierthatcanonlymodelsingleproteinfamily,
DeepFamusedasingleCNNmodeltopredictmultipleproteinfamilies.Varioussizesofconvo-
lutionalwereusedinDeepFamtoextractconservedregionstomodelproteinfamilies.It
outperformsHMMERandpreviousalignment-freemethod.Also,theexecutiontimeofDeepFam
isfastasitnotaffectedmuchbythenumberoffamilieswhileHMMERneedslongertimeasthe
numberoffamiliesincreased.Forexample,DeepFanismorethantentimesfastercomparedwith
HMMERwhen1,000querysequencessearchingagainstthousandsofproteinfamilies[133].
However,existingmethodslikeDeepFamcannothandleerrorsandincompletesequences.All
thedeeplearningmethodwementionedwereonlytestedusingcompleteandcorrectproteinse-
quences.Moreover,DeepFamusedaclosedatasettotesttheirperformance,whichassumesthat
testdatasetareheld-outsamplesfromthesamedistributionsofthetrainingdataset.Inarealappli-
cation,asequencemaybesampledfromnon-codingsequenceorotherproteinfamiliesthatnotbe
modeledintheneuralnetwork.Wecallthistaskdetectiontaskhereafter,asthemaingoalistode-
tectthetargetedprotein-encodingsequencesfromotherirrelevantsequences.Intheory,irrelevant
inputissupposedtoberejected.Butcurrentdeeplearningmodelsemploysoftmax
87
functiontoassignthelabelandarenotabletodeclinesuchcases,yieldinghighfalse-positive
rate[10,63].
4.1.2Relatedwork
4.1.2.1Prhomologysearch
Topredicttheproteindomainofagivensequence,wecandirectlycomparethesequencewith
protein-encodingsequencesusingpairwisealignmenttoollikeBLAST[4].However,inmany
applications,weusuallydonotcareifthequerysequenceclosetooneencodingsequence
ofthetargetprotein.Instead,wewanttoalignthequerysequencetoafamilyofproteinsequences.
AhiddenMarkovmodel(pHMM)canbebuiltfrommultiplesequencealignmentstomodel
aproteinfamily.Byexplicitlythematch,insertion,anddeletionstate,pHMMcanhandle
thecomparisonofthequerysequencetotheofproteinfamiliesaccurately,evenwhenthe
queryisremotehomologyofthetargetproteinfamilies.HMMERisoneofthemostwidelyused
searchtoolsbasedonpHMM[36].oftheproteinfamiliescanbedownloadedfrom
manyproteindatabases,suchasPfam[40],andTIGRFAMS[56].
Asanalignment-basedmethod,thespeedofthehomologysearchsufferswhenthe
numberoffamiliesincreases.Extensiveresearchhasstudiedtoimprovetheefyofthe
homologysearch.Thenewestversion(3.1b2)ofHMMER[37]ismoresensitivethan
BLASTandisabout10%faster.However,recentresearch[133]suggestsitisstillslowerthanthe
alignment-freemethod.
88
4.1.2.2Homologysearchwitherrorcorrection
Duetothehigherrorrates,thehomologysearchresultonrawPacBioorNanoporereadsusually
isnotdesiredwithwrongannotationandshortalignments.Inthissituation,errorcorrectionon
rawreadscaneffectivelyimprovethehomologysearchresult.
Therearetwomajorcorrectionmethodsforthird-generationsequencingdata:hybriderror
correctionsandnon-hybriderrorcorrections.Thehybridmethodcorrectlongnoisyreadbycom-
biningshort,accuratesecond-generationsequencingreadswiththird-generationreadstoremove
errors[78,136].Unlikehybridsequencing,non-hybriderrorcorrection,orself-correction,only
relyonthelongreadsthemselvesforerrorcorrection[24,48,93].Asthesequencingerrorsinthird-
generationreadsoccurrandomly,theinferredconsensussequencefromthealignmentbetweenthe
longestbackbonereadsandshorterreadsfromthird-generationsequencingdatasetrepresentthe
high-qualitysequence.However,itsperformanceisprofoundlyaffectedbythecoverageofthe
alignedsequencesagainstthebackbonesequences[34].
4.1.3Overviewofourwork
Aswediscussedintheprevioussection,thecurrentproteinpredictalgorithmscannothandlethe
higherrorrateofthethird-generationreads.Thus,wedesignedandimplementedDeepFrame,a
deeplearningbasedmethodtopredicttheproteindomainofthird-generationsequences.
WeoptimizedtheneuralnetworksarchitectureofDeepFramebasedonrigorousvalidation
performedontheofasimulatedPacBiosequencedataset.Theationaccuracy
ofDeepFrameconsistentlyoutperformsthestate-of-the-artmethodlikeHMMERacrossvarious
errorrates.WethenextendedtheframeworktothedetectiontaskwithOutlierExposureandtested
onrealthird-generationsequencingdataset.DeepFrameachieveshigherF1scoreandhigherrecall
89
withcomparableprecisionwhencomparedwithHMMER.
Comparedtopreviousworks,DeepFramehasseveralmerits.First,wedesignedathree-
channelencodingforthetaskoffunctionpredictionusingDNAreads.Comparedtootherencod-
ingmethods(discussedinExperimentsandresults),modelwiththree-channelencodingachieved
betterperformance.Second,evaluatedbysimulatedPacBiodataandrealPacBioandNanopore
data,DeepFramecansuccessfullyprocesslongerroneousreadswithoutanyprepossessing.Al-
thoughweonlyusedsimulatedPacBioreadsthatareeasytogeneratetotrainourmodel,the
performanceonrealPacbioandNanoporedataisstillrobustandconsistent.Third,unlikepre-
viousdeeplearningworkthatonlydesignedforDeepFramecanalsobeusedfor
detectiontasks.Withthethresholdonsoftmaxpredictionprobability[63]andOutlierExposure
method[64],DeepFramecandistinguishtheDNAsequenceswithtargetedproteindomainfrom
otherrandomcodingornon-codingDNAsequences.
4.2Methods
DeepFrameisdesignedtopredictproteindomainsforthird-generationsequencingdata.Itin-
corporates3-frameencodingtoconvertDNAreadsinto3-channeltensorastheinputofneural
networks.Fromtheinput,DeepFrameautomaticallyextractsfeaturesbyusingconvolutionlayer
andmax-over-timepoolinglayer.Awithtwofullyconnectedlayerwasusedtogenerate
theprobabilitiesofthesequenceagainstallpossibleproteindomains.Toexcludetheunrelated
codingornon-codingDNAsequences,wethencomparethemaxvalueoftheoutputprobabilities
withathreshold.Wewilldiscussthedetailsofourmodelinthefollowingsection.
90
Figure4.6:TheoverviewofDeepFramemodel.Theinputsequencewastranslatedandencoded
toa3-channeltensor.Twoconvolutionallayerandmaxpoolinglayerextractsequencefeatures.
Thesefeatureswereusedbyafullyconnectedlayerwithsoftmaxfunctiontoinfertheprobability
ofeachproteinfamilies.Inthetask,themodeldirectlyoutputsthefamilieswith
thelargestscoreastheprediction.Inthedetectiontask,themaximumofsoftmaxscoreneeds
tocomparewiththethresholdtodeterminewhethertheinputcontainsatrainedproteinfamilyor
shouldberejected.
91
4.2.1Encoding
ToprocesstheDNAreads,weneedtoencodesequencestonumericalvalues.Wedesigneda
3-frameencodingschemefortheinputsequences.EachDNAsequencesis3-frametranslated

to3proteinsequences.Alltheresiduesintheproteinsequenceareone-hotencodedusinga

21-dimensionalvectorfollowingIUPACaminoacidcodenotation.Thenweconcatenatethree

matricesintoasingle3-channeltensorlikeanRGBimage.Therationalebehindthisisthatwe

hopetheconvolutioncanautomaticallydistinguishtherightfragmentoneachframeafter

translationandextractthecorrespondingfeatures.Also,therelativeorderoftheresiduesatthe

samepositionofthethreeframesincorporatetheinformationoftheoriginalDNAsequence.Given

thesequencelength
L,thentheencodedinputisatensorwithsize3
L21.Thepseudo-codeof
3-frameencodingcanbefoundinAlgorithm4.
Algorithm4
3-frameencoding
Input:DNAsequence
xwithlength
L,peptidetoindexdict
idx
,peptidealphabetsize
jSj,output
sequencelength
n.Output:Inputtensorforneuralnetworkswithsize3
n21.1:Initializeanarray
arr
withall0valuesanddimensions[3,
n,jSj]2:for
i=1to3
do3:xi x[i:]4:yi translationof
xi.translatenucleotidesin
xitoaminoacidsin
yi5:for
residueaatposition
kinyido6:ifknthen7:arr
[i;k;idx
[a]]
 1.one-hotencodingforeachframe
8:endif
9:endfor
10:endfor
11:arr
isinputtensorforneuralnetworks
Wealsotestedotherencodingmethods:(1)DNAone-hotencoding,whichdirectlytrasnfer
DNAsequencetoonehotvectorsofsize
L4.Forfaircomparison,weusedsizesthatare3
timesaslongasweusedfor3-frameencoding;(2)3-branchmodel,weconstructedanetworkar-
92chitecturewiththreeseparatebranchesprocessingeachofthe3-frametranslatedproteinsequence
respectively.Eachofthebranchesconsistsofidenticallayers,andalltheparametersareshared
betweenthesamelayerof3branches.Inthe3-branchmodel,eachofthebranchesmodelsthe
translatedproteinsequencesseparatelybeforethemerginglayerrightbeforethetwo-layerclassi-
.Incontrast,in3-frameencoding,allthreetranslatedproteinsequenceswereprocessedand
combinedbythe3-channelconvolutionalintheconvolutionallayers.Ourexperimental
resultsshowthat3-frameencodingisabetterencodingschemeasitcaneffectivelyencodethe
originalDNAsequenceinformationandalsohelpsconvolutiontoextractusefulfeaturesfor
predictionoftheproteindomain(Section4.3.1.1).
4.2.2ConvolutionalNeuralNetworks
DeepFrameconsistsoftwoconvolutionallayers,onemax-over-timepoolinglayer,onehidden
linearlayer,andonelinearoutputlayerwithsoftmaxfunction.
Theconvolutionallayer[46,74,86]isthemostimportantbuildingblockinthemodel.Fora
residueinthesingleframeproteinsequence,wecanuseak-dimension(k=
j
S
j
inourcase)vector
y
torepresentit.Soasequenceoflength
n
isrepresentedasaone-hotencodingmatrix.Weapplied
convolution
w
2
R
hk
onasize
h
windowofthematrix,totransformtheinput
y
i
:
i
+
h

1
to
c
i
:
c
i
=
f
(
w

y
i
:
i
+
h

1
+
b
)
(4.1)
Here
b
isthebiastermand
f
istheactivationfunctionusuallyusingnon-linearfunctionlike
sigmoidorhyperbolictangent.Inourmodel,weuseReLU[104]asactivationfunctionofconvo-
lutionallayer:
f
(
x
)=
s
(
x
)=
max
(
0
;
x
)
(4.2)
93
WecanextendEq.4.1tomultiplechannelinput(inourcase,channel
=
3:
c
i
=
f
(
3
å
j
=
1
w
j

y
i
:
i
+
h

1
;
j
+
b
)
(4.3)
ForbothEq.4.1and4.3,weappliedconvolutionalrepeatedlytoeachpossiblewindow
oftheone-hotmatrix
y
1:
h
;
y
2:1
+
h
;:::;
y
n

h
+
1:
n
toproduceafeaturemap,whichisthevectorof
extractedfeaturevalues:
c
=[
c
1
;
c
2
;:::;
c
n

h
+
1
]
(4.4)
Thenthemax-over-timepoolingisappliedonthefeaturemaptocapturethemaximumvalue
‹
c
=
max
c
asthefeaturefromthisparticular.Therationalebehindthisoperationistoextract
thepeaksignalasthemostimportantfeatureforeachfeaturemap.Themax-over-timepoolingis
xiblewithdifferentinputlength.
Wedescribedhowasingleintheconvolutionallayerworks.Inourapplication,weused
multiplewithvaryingsize(orwindowsize)toextractvariousfeatureswithdifferent
ranges.
DeepFramehastwoconvolutionallayers.Theconvolutionallayerusesconsistentconvo-
lutionsizetoextractlow-levelshortdistancepatternsdirectlyfrom3-frameencodinginput.
Thensecondconvolutionallayerextractshigh-level,intricatepatternswithvaryingdistancefrom
theoutputoftheconvolutionlayer.Byrepeatedlyapplyingtheoperations,wecan
generateafeaturemap.Thenthemax-over-timepoolingwasappliedtokeepthemostimportant
features.Dropout[138]isalsousedafterpoolingtopreventovandtolearnrobustfeatures.
Atwo-layerwithsoftmaxfunctiontransfersthefeaturestoavectorofprobabilitiesover
eachlabel.Forweselectthelabelwiththemaximumprobabilityastheprediction
fromDeepFrame.
94
4.2.3Out-of-distributionexamplesdetection
WeshowthatDeepFramewithsoftmaxfunctioncanpredicttheproteinlabelsofgiven
DNAreads.However,thewillalwaysassignalabelfortheinputsample,evenifthe
inputisnotrelatedtoanylabelinthemodel(wecallsuchinputsout-of-distributionexamples,
comparedtoin-distributionexamples).Forexample,inRNA-seqdata,noteveryreadencodes
targetedproteinfamiliesinthemodel.Whenprocessingsuchunrelatedinputs,themodelwillstill
assigntheinputtooneoftheproteinfamiliesinthemodelasthereisnosuchlabelfortheout-of-
distributioninput.Inarealapplication,thisbehaviorwillleadtoanundesiredhighfalse-positive
rate.Toaddresstheproblem,weadoptOutlierExposure(OE)[64]withathresholdonsoftmax
predictionprobabilities[63]todistinguishtheout-of-distributionexamples.
4.2.3.1Thethresholdbaseline
Usually,theexamplesfromout-of-distributiondatasettendtohavesmallsoftmaxvaluebecauseits
probabilitiesvectortendstobemoreuniformlydistributedthantheexamplesfromin-distribution
dataset.Thus,byspecifyingathresholdonmaximumprobabilityinthesoftmax'soutput,itispos-
sibletodistinguishout-of-distributionexamplesformin-distributionexamples.Following[63],
weextractthemaximumvalueofthesoftmaxprobabilityfromtheoutputofDeepFrameforeach
inputexmaple.Weseparatethein-distributionexamplesfromtheout-of-distributionexamples
byspecifyingathresholdofthemaximumsoftmaxprobability.Todeterminethethreshold,an-
othersimulatedout-of-distributiondatasetiscreatedandcombinedwiththesimulatedholdouttest
dataset.Foreachexample,wecomputethemaximumsoftmaxprobability.WeobtainF1score
fromthedataset:
F
1
=
2

precision

recall
recall
+
precision
(4.5)
95
TheF1scoresummarizestheperformanceofthebinaryclofin-distributionexamples
andout-of-distributionexamplesforagiventhreshold.WecomputetheF1scoreacrossdifferent
thresholdsandselectthethresholdofbestF1scoreasthethresholdusedinourmodel.
4.2.3.2OutlierExposure
Tofurtherimprovetheperformanceofout-of-distributionexamplesdetection,weadopttheOut-
lierExposure(OE)methodintroducedby[64].Aswediscussedpreviously,weexpecttheout-of-
distributionexamplestohaveuniformsoftmaxoutputfromthemodel.However,assuchinputs
neverbeprocessedinthetraining,sometimesthemodelwillyieldunexpectedhighpre-
dictionforout-of-distributioninput(Figure4.7).Toaddresstheproblem,weexposethemodelto
out-of-distributionexamplesinthetrainingprocesstoletthemodeleffectivelylearntheheuristics
oftheout-of-distributioninputs.
Givenamodel
g
andthelearningobjective
L
,theobjectiveofOEistominimizetheoriginal
lossfunctionwithanauxiliarylosstermstoregularizetheout-of-distributionexamples[64].In
originaltask,weusecross-entropylossfunctionas
L
.Sowesettheauxiliaryloss
L
OE
totheuniformdistribution:
L
OE
=

log
 
exp
‹
y
0
å
j
exp
y
0
j
!
=

y
0
+
log
å
j
exp
y
0
j
(4.6)
Here,
y
0
isthemeanvalueoftheoutputlogits(theoutputofCNNsmodelbeforethesoftmax
function).Andweuse
l
=
0
:
5inourexperiment.
Figure4.7presentsthedistributionofthesoftmaxscoreofDeepFramemodelwithandwithout
OutlierExposureforthethresholdcalibrationdatasetweusedinSection4.3.2.WithoutOutlier
Exposure,therearestillalotofout-of-distributionexampleswithlargesoftmaxscores(0.5to1).
96
Figure4.7:Thedistributionofsoftmaxscoresfrombasemodel(
top
)andmodelwithOutlier
Exposure(
bottom
).Greenbarrepresentthecountofin-distributionexamplesthatpredictedcor-
rectly.Bluebarrepresentthecountofin-distributionexamplesthatpredictedincorrectly.Redbar
representthecountofout-of-distributionexamples.
97
AftertrainedwithOutlierExposure,mostoftheout-of-distributionexamplesaccumulatedwith
smallsoftmaxscores(0to0.4).Wealsofoundthatthescoresofin-distributionexamplesthat
incorrectlypredictedbyourmodelareclosetothescoresofout-of-distributionexamples.Soan
optimizedthresholdcanalsoexcludethoseincorrectpredictions.
4.2.4Implementationsandhyperparameters
DeepFramewasimplementedbyPython3usingPyTorch[119]withApex[111]acceleration.The
DeepFramemodelwastrainedusingagradient-descentoptimizationalgorithmwithadaptivees-
timatesofmomentscalledAdam[75].Thetrainingtargetistominimizemultipleclasscross-
entropylossfunction.
Wealsoimplementedseveralcommonlypracticedtechniquesforoptimizingdeepneuralnet-
works,suchasmini-batchgradientdescent,Heinitialization[61],andhyperparametertuning.
Thereareseveralhyperparametersincludingthenumberandthesizeofconvolutionthe
numberofinputandoutputfeaturesinthelinearlayer,dropoutrate,learningrate,andbatchsize.
Wetriedourbesttosearchtheparameters,andweusedtheparametersthataretestedtoperform
wellintheexperiment.Inourexperiment,thehyperparametersweresetasfollows:batchsize
=
256,learningrate
=
0
:
001,dropoutrate
=
0
:
5.Fortheconvolutionallayer,thesizeof
theis3,andthenumberofis64.Forthesecondconvolutionallayer,thesizeofthe
is8
;
12
;
16
;
20
;
24
;
28
;
32
;
36with256ltersofeachsize.Also,forthehiddenlinearlayer,
thenumberofoutputfeaturesis512.Giventhedatasizeweused,wefoundourmodelconverges
around1to2epochstraining,sowetrainedallmodelstwoepochswithvalidatingevery102400
samplestoselectthebestmodel.
98
4.3Experimentsandresults
ToevaluateDeepFramemodel,weappliedDeepFrameontwodataset:asimulatedPacBioG
protein-coupledreceptor(GPCR)codingsequences(cds)dataset[30],andarealthird-generation
sequencingdatasetofhumangenome[20,116].Gprotein-coupledreceptorisalargeprotein
familythatisinvolvedinmanycriticalphysiologicalprocesses,suchlikevisualsense,gustatory
sense,senseofsmell,regulationofimmunesystemactivity,andsoon[143].WeusedGDS[30]
forourexperimentasitisalreadyusedtoevaluateDeepFam[133].Itconsistof8222protein
sequencesbelongingto5families,38subfamilies,and86sub-subfamilies.ForsimulatedGPCR
cdsdataset,wehavetheproteinfamilyinformationofreferencereadforeachsimulatedreadas
ourgroundtruth.Forthehumangenomedataset,wedeterminethegroundtruthviaBLASR's[21]
alignmentsagainsttheGPCRcodingsequences.
WebenchmarkedDeepFrame'sperformancewithHMMERandDeepFam.Thosetoolsand
methodsarerepresentativesofcurrentstate-of-the-artmethodsofproteindomainprediction.In
experimentsonsimulatedPacBioGPCRcdsdataset,weevaluatetheperformanceofallmethods
usingaccuracy.Forhumangenomedataset,weevaluatedtheperformanceofdetec-
tionofGPCRproteinfamiliesfromotherunrelatedDNAsequencesusingfollowingmetrics:(1)
recall,whichmeasurestheratioofthecorrectproteindomainspredictedbyeachprogramtothe
wholesetofexpectedproteindomains;(2)precision,whichtheratioofcorrectprotein
domainsdetectedbyeachprogramtothetotalreportedproteindomains;and(3)F1score,which
istheharmonicmeanofrecallandprecision.BecauseDeepFamisnotdesignedtoprocessthedata
without-of-distributionsamples,wedidnotevaluateitintheexperimentonthehumangenome
dataset.
Forourexperiments,allcommands,parameters,andoutputcanbefoundalongwith
99
thesourcecodeofDeepFrame.
4.3.1SimulatedPacBioGPCRcdsdataset
WeevaluatedourmodelusingsimulatedPacBioreadsgeneratedusingPBSIM[114]with
GPCRcdsdatasetwithdefaultsetupanderrorratesfrom1%to15%.Wecreateacorresponding
GPCRprotein-codingsequences(cds)datasetbydownloadingcdsfromNCBIbysearchingthe
keywordofthecorrespondingsub-subfamiliesinGDS.WealsobuildpHMMmodelsfromGDS
usingHMMERandtestallcdsagainstthesemodels.Wediscardedthecdsthatnotpredicted
correctlybyGPCRHMMmodelstocleanthedataset.Wedeterminedthegroundtruthofsub-
subfamilylabelofasimulatedreadbyassignthelabelofthereferencecodingsequencetoit.
4.3.1.1Performancewithdifferentarchitectures
Weconductedaseriesofexperimentsbyvaryingthekeyopponentsinourbasemodels:thenumber
ofconvolutionlayers,thenumberofconvolutionkernels,thesizeofconvolutionkernels.Wealso
testeddifferentencodingstrategies.WelistedallvariationswetestedinTable4.1.Thereference
codingsequencesofeachsub-subfamilieswerespittedto80%trainingsamplesand20%test
samples.ThenweusedPBSIMtogeneratesimulatedPacBioreadsfortrainingsamplesandtest
sampleswith15%errorrates.Allthemodelsweretrainedusingthesamehyperparameterswe
discussedinSection4.2.4withvetimesrepeat.Figure4.8comparesaccuracyof
allthevariantsinthetestsamplesofsimulatedGPCRcdsdataset.
4.3.1.1.1Comparisonsofencoding.
Wecomparedthreemodelswithdifferentencodingstrate-
gies:(1)3-frameencodingmodel,(2)DNAone-hotencodingmodel,and(3)3-branchmodel.We
describethearchitectureofallthreeencodingmodelsinSection4.2.1.
100
NameArchitecture(comparetobasemodel)
BasemodelThemodelwedescribedinMethods
Use512intotalinthe2ndconvolutionlayer
Use1024intotalinthe2ndconvolutionlayer
Use4096intotalinthe2ndconvolutionlayer
1layerOnlykeepthelastconvolutionlayer
3layerAddanextraconvolutionlayerwith64ofsize3
sizesof2ndconvolutionlayer
=[
6
;
9
;
12
;
15
;
18
;
21
;
24
;
27
]
sizesof2ndconvolutionlayer
=[
10
;
15
;
20
;
25
;
30
;
35
;
40
;
45
]
sizesof2ndconvolutionlayer
=[
12
;
18
;
24
;
30
;
36
;
42
;
48
;
54
]
3-branch3branchesstructurefortranslatedreads
DNA-
encoding
Useone-hotencodingofDNAreadsasinput
Table4.1:ThenameandbriefdescriptionofvariantsofCNNmodelswecompared.
Figure4.8:Themeanandstandarddeviationofaccuracyofdifferentnetworkarchi-
tectures.Forconvenient,weuseddifferentnames(
3-frameencoding
,
2048s
,
s8
,and
2
layer
)forthesamebasemodel.Weuseddifferentcolorsfordifferentgroupofcomparisons:green
barsforencoding;bluebarsfornumberofpurplebarsfordifferentsizes;orangebars
fordifferentconvolutionlayers.
101
FromFigure4.1,themodelwith3-frameencodingachievedmuchbetterperformancecom-
paredtothe3-branchmodelandDNAencoding.Inthe3-branchmodel,theoriginalDNAse-
quencesinformationislostastheordersofthethreeframesnotkeptinthemodel.Incontrast,
DNAencodinghastheoriginalinformation,buttheindirectinformationofproteininDNAis
noteasytoextractusefulfeaturesthatcanseparatedifferentproteinfamilies.TheDNAencoding
modelalsoconvergesmuchslowerthantheothertwopeptideencodingmodelsintraining.Finally,
bothDNAencodingmodeland3-branchmodelhavealargernumberofparameterscomparedto
3-frameencoding,requiringmorecomputingresources.
Toinvestigatethefeaturesextractedfromthesequencesbyconvolutionallayers,weobtainthe
featurematrixafterthemax-over-timepoolingforallthreemodels.Wepicksixsub-subfamilies
withtopF1scoresforallthreemodelsforvisualizingthefeaturesusingt-SNE[96].Thefeatures
fromdifferentproteinsub-subfamilieshavebeenmappertomanysmallclustersshownin4.9.
Thedistancesbetweenclustersfromdifferentsub-subfamiliesin3-frameencodingmodelsare
relativelylarger,andeachsmallclusteronlyrepresentsonesub-subfamilies.Manyclustersin
DNAone-hotencodingmodelareclosetoeachother.Andclustersfromthe3-branchmodelhave
overlaps.Theseplotsmayexplainwhytheperformanceofthe3-frameencodingisbetterasit
extractsfeaturesthatcaneasilydistinguishdifferentproteinsub-subfamilies.
4.3.1.1.2Comparisonsofconvolutionalsetup.
Comparedtotheone-layermodel,our
basemodel(2layers)achievedhigheraccuracy.Theextralayerhelpstheneuralnetworktoextract
morecomplexpatternssuchasinteractionsofthelowerlevelfeatures.However,the"deeper"
modelwithmorelayersaremorediftooptimize.Thatisthepossiblereasonwhytheaverage
accuracyof3layermodelislowerthanthebasemodel,butthehighestaccuracyachievedishigher
thanourbasemodel.
102
Figure4.9:2Dt-SNEplotoffeaturesextractedfromtheconvolutionallayeroutput.(Top)3-frame
encodingmodel.(Middle)DNAone-hotencodingmodel.(Bottom)3-branchmodel.
103
Wefoundthattheadditionalconvolutionalincreasedperformanceforproteindomain
prediction.Theimprovementissaturatedwhenwehavemorethan2048
Increasingthesizeofcanalsohelptoimprovetheperformanceofthemodels.With
thelargersize,theneuralnetworkcancapturelong-rangefeatures.Theresultsuggeststhe
importanceofchoosingtherightsize,whichisnotexploredinpreviousworks[133,153].
4.3.1.2ComparisonwithHMMERandDeepFam
TheaccuracyofofGPCRcdsofDeepFrameandDeepFamwasmeasuredusing
5-foldcross-validation.Thereferencecodingsequencesofeachsub-subfamilieswereequally
splitintovefoldswhilepreservingtheratioofthefamilies.ThenweusePBSIMsimulateeach
foldofthecodingsequencestothesimulatedPacBioreadsthatusedtotrainDeepFrame.For
DeepFam,weusedthetranslatedproteinsequencesofeachfoldofcodingsequencestoretrainit.
Wethentestedallthreetranslationsusingourretrainedmodel.ForHMMER,weusedall5-fold
translatedproteinsequencestoretrainthepHMMmodel.Togeneratemultipleseuqnceaignment,
MAFFT[72]wasusedforthesequencesofeachsub-subfamilies.Thenweused
hmmbuild
in
HMMERtobuildpHMMmodel.ForeachtestDNAsequences,3-frametranslationswasapplied
togetthreeproteinsequences,thenwastestedwith
hmmscan
againstall86pHMMmodelswebuilt.
InbothexperimentsofDeepFamandHMMER,ifatleastoneofthethreetranslatedreads
totherightsub-subfamilies,wetreatedthiscaseasacorrectprediction.HMMERwasnotableto
assignfamilylabelforsomesequences,sosuchcasesweretreatedasincorrectpredictions.
Figure4.10summarizedthecomparisonofofallmethodsonthesimulated
PacBioreads.Forthisdataset,ourmethodachievedconsistentperformancewhentestondataset
withdifferenterrorrates.Ourmethodachievedmuchhigheraccuracycomparedtotheothertwo
methodswhentheerrorratesarehigh.Thehigherrorratesheavilyimpactedtheperformanceof
104
Figure4.10:ComparisonofproteinperformanceofDeepFrame,HMMER,and
DeepFamacrossdifferenterrorrates.
105
HMMERandDeepFam.ItisexpectedbecausetheHMMsearchismuchmoresensitiveto
frameshiftscausedbygaps,andDeepFamisdesignedforclassifyingrelativelycompleteerror-free
proteinsequences.
4.3.1.3Comparisonoftimecomplexity
Wemeasuredtheexecutiontimesofthetestedmethods.Withalargeamountofdatagenerated
bythethird-generationsequencingplatform,werequirenotonlyhighaccuracyofthealgorithm
butalsothehighefyofthealgorithms.WerunbothDeepFrame,DeepFam,andHMMER
usingIntel
R

Xeon
R

Gold6148CPUwith20coresattheHigh-PerformanceComputingCenter
atMichiganStateUniversity.WealsotestedDeepFrameandDeepFamwithNVIDIA
R

Tesla
R

V100GPUwithApexaccelerationlibrary(HMMERdoesn'tsupportGPU).Foreachmethod,
Wemeasureditsexecutiontimebyaveraging5independenttrialswithrandomlyselected10,000
sequences.
SetupDeepFrameDeepFamHMMERHMMER

max
CPU1168.78s276.74s312.13s3470.04s
GPU25.71s20.37sunavailableunavailable
Table4.2:Theaverageelapsedtimetopredictfamiliesof10,000simulatedPacBioreadsforeach
method.
InCPU,HMMERwiththedefaultsetuprunsmuchfasterthanDeepFrameandDeepFam.With
highsequencingerrorrates,thealignmentagainstmanycandidatesub-subfamiliescannotpassthe
stageofHMMER,speedinguppHMMsearch.With
--max
(turnofftheexecution
timeofHMMERisevenslowerthandeeplearningmethods.WithGPUacceleration,therunning
timeofDeepFrameismuchsmallerthantherunningtimeofHMMERwiththedefaultsetup.
106
4.3.2Humangenomedataset
ToevaluateDeepFrame'sperformanceonrealthird-generationsequencingdataset,wetestedDeep-
FrameontheH.sapiens10xSequenceCoveragewithPacBiodata[116]andOxfordNanopore
HumanReferenceDatasetsRel6[67].Wedescribedthedatasetweusedinthisexperiment.
4.3.2.0.1Traindataset.
Forthethresholdbaseline,weusedthemodeltrainedintheprevious
experiment.ToapplyingOutlierExposure,weconstructedadatasetthatmixingthe
previous5-foldtraindatasetwithadatasetofoutliers.Weconstructedtheoutelierdatasetusing
simulatedPacBioreads.Toclosetotherealout-of-distributionexamples,wesimulatedaPacBio
humangenomedatasetusingGRCh37/hg19humanreferencegenomeasreference[27].Weonly
keptthesimulatedreadsthatcannotbealignedtocdsbyBLASRintheout-of-distribution
dataset.
4.3.2.0.2Thresholdcalibrationdataset.
Tocalibratethethreshold,weneedadatasetthat
mixingin-distributionexampleswithout-of-distributionexamples.Wesimulatedin-distribution
examplesfromGPCRcdsdataset.Also,wesimulatedout-of-distributionexamplesusingATP
synthasefamiliescds.WeusedATPsynthasefamiliesbecausebothGPCRandATPsynthase
familiesbelongtothemembraneandcellsurfaceproteinsandpeptidesclassinSCOP
[92].Intuitively,proteinfamiliescloselyrelatedmaybemorechallengingtodistinguish,andthis
leadtothethresholdwecalibratedmorerobust.
4.3.2.0.3Out-of-distributiontestdataset.
Weconstructedtwotestdataset:aPacBioRSII
testdatasetfromPacBioSMRTSequencingforCHM1TERThumancellline;andaNanoporetest
datasetfromOxfordNanoporeMinIONonCEPH1463(NA12878/GM12878,Ceph/Utahpedi-
gree)humangenomereferencestandardusingR9.4chemistry.Todeterminethegroundtruthof
107
thesetestreads,wealignedallreadsagainstGPCRcdsdatasetusingBLASR[21].Weex-
tractedthealignedpartofthereadsofwhichalignmentlengthlongerthan60%ofalignedcds
lengthasourin-distributiontestsamples.Foreachsample,thegroundtruthisgivenbythelabel
ofalignedcds.Thesereadsarein-distributionexamplesinthetestdataset.Wealsorandomly
selectedreadsthatnotalignedtoanycdsasourout-of-distributionsamples.Thedatasetconsists
of50%readsthatmaybecodingGPCRand50%out-of-distributionreads.Forreadlongerthan
3000bpsinthetestdataset,wecroppedreadintofragmentsthatnotlongerthan3000bpstobe
consistentwithtraindataset.
4.3.2.1Out-of-distributiontestusingPacBioandNanoporereads
MethodPacBioNanopore
RecallPrecisionF-1scoreRecallPrecisionF-1score
HMMER0.14940.95070.25830.3984
0.9825
0.5670
DeepFrame
threshold
0.42350.54070.4667
0.4866
0.62340.5384
DeepFrameOE
0.44790.98370.6154
0.48360.9731
0.6458
Table4.3:Theperformanceofproteindomainpredictionwithout-of-distributionexamplesusing
DeepFramethresholdbaseline,DeepFramewithOutlierExposure(OE),andHMMERonthereal
PacBioandNanoporedataset.
FordetecttheGPCRcodingsequencesamplefromout-of-distributionsamples,weretrain
DeepFramemodelwithOutlierExposurediscussedinMethods.Weusedthesametraindataset
intheprevioussimulatedPacBioreadsexperiment.WebenchmarkedtheOEmodelwithHM-
MERandabaselinemodelwithonlyasoftmaxthreshold.InbothPacBioandNanoporedataset,
DeepFramewithOEachievesthebestF1scorecomparedtoHMMERdefaultsetupandthreshold
baseline(Table4.3).DeepFramewithOEachievedsignimprovementonrecallwhilethe
precisioniscomparablewithHMMER.Ingeneral,allthreemethodshavebetterperformanceon
108
Nanoporedataset.Noticedthatinthewholepipeline,weonlyusesimulatedPacBioreadsbutno
datafromNanopore.Thisresultsuggeststhatourstrategyisrobustwithdifferenttypesoflong
reads.
4.3.3Visualizeandunderstandingconvolution
Tobetterunderstandwhatourconvolutionalneuralnetworkslearned,wefurtherinvestigatethe
featureextractedbyconvolutionalWevisualizedtheconvolutionunitsofourmodeland
foundthatsomeconservedregionswerecaptured,althoughtheinputdataisverynoisy.
Followingthemethodsadoptedbypreviousresearch[3,133],wevisualizetheconvolution
unitsactivatedforeachfamily.Weusedthemodelthatwastrainedinpreviouslyexperiments
andfeedthetestsequencesthatbelongedtothefamilytoourmodel.Thenwecollectallthe
sequencefragmentsthatactivatetheconvolutionunits.Weextractedtheresultsfrommostacti-
vatedconvolutionunitsandusedWeblogotogeneratethelogosfromsub-sequencesoftheresults.
SincewetranslatedtheoriginalinputDNAsequencesinto3-framesproteinsequences,soforeach
convolutionalunits,wehave3logosassociatedwith3framesrespectively.
Fromthelogos,wecanfoundthattheconvolutionaldidlearnsomeconservedregions
ofproteinfamilies.However,thelogoisverynoisycomparedtotheresultsreportedinprevious
studies.Whentheinputhasinsertionsanddeletions,awell-conservedregionmayappearinall
threeframesduetoframeshifts,sotheofallthreechannelssometimesmayneedtoextract
similarpatterns.Thatiswhysomeofthe3-framelogosshowsimilarpatternsacrossdifferent
frames.
Figure4.17presentedtheactivationvaluedistributionofconvolutionalforthesixsub-
subfamilieswediscussedbefore.Itclearlyshowsthatdifferentproteinfamilieswillactivatedif-
ferentconvolutional
109
Figure4.11:Mostactivatedconvolutional(index:1622)forLatrophilin.
Figure4.12:Mostactivatedconvolutional(index:1877)forCannabinoid.
4.4Discussion
Inthiswork,weadoptedthethresholdonsoftmaxandOutlierExposuretodetectrelevanttarget
readsfromotherrandomreads.Previousresearch[3,65,91]usuallysolvethechallengebyusing
abinaryframework.However,foragivensequence,weneedtorunthebinary
modelforeachpossibletarget.Instead,ourmodelscalesbetter
110
Figure4.13:Mostactivatedconvolutional(index:1689)forPlatelet.
Figure4.14:Mostactivatedconvolutional(index:1382)forProstacyclin.
withalargenumberofcandidatetargets.
Weaddressedthechallengeofpredictionofproteindomainswithasupervisedlearningmethod,
whichneedsavastamountoflabeleddata.However,itischallengingtoobtainalargenumberof
111
Figure4.15:Mostactivatedconvolutional(index:1033)forCholecystokinin.
Figure4.16:Mostactivatedconvolutional(index:1776)forEMR1.
labeledreadsfromthird-generationsequencing.Inthiswork,weshowthatdeeplearningmethod
achievesacceptableresults,evenwhentrainingonsimulateddataandtestingonrealdata.Webe-
112
Figure4.17:Thedistributionsofsumofacivationvaluesofconvolutionalunits.Xaxisisthe
indexoftheconvolutioninDeepFram.Yaxisisthesumofactivationvaluesofthegiven
convolutional
113
lievetheperformanceofourmethodcanbefurtherimprovedifmorelabeledrealthird-generation
readsavailable.
4.5Conclusion
Inthiswork,wedevelopedaproteindomainpredictiontoolforthird-generationsequencingdata.
Byincorporating3-frameencodingandmultiple-layerconvolutionalourCNNmodelcan
directlypredicttheproteinfamiliesoflongerroneousread.Ourexperimentalresultshaveshown
thatforthedatasetfromthird-generationsequencingtechnologies,ourprogramcandetectrele-
vantproteindomainfromotherrandomDNAreads.Beingabletopredictproteindomainfrom
asinglereadfromthird-generationsequencingdirectlyisessentialforapplicationsliketran-
scriptomicandmetagenomicsequencing.DeepFrameprovidesacomplementarytooltocur-
rentthird-generationsequenceanalysispipelines,whichusuallyrequirehighcoverageforer-
rorcorrection.Thesourcecodeandthemodelstrainedandareavailableat
https://github.com/strideradu/DeepFrame.
114
Chapter5
Conclusionsandfutureworks
Inthisdissertation,Iintroducedthedevelopmentofthird-generationsequencingtechnologies
andthepossibleapplicationofPacBioandNanoporereads.Asmoreandmorethird-generation
sequencingdatasetaccumulating,especiallymetagenomicdataandtranscriptomicdata,theanal-
ysisoflongnoisyreadsbecameachallengingproblem.First,traditionalsequenceanalysistools
cannothandlehighinsertionanddeletionrates,leadingtotheshortannotationthatcannotbe
distinguishedfromrandomsequences.Second,applicationslikemetagenomicsequencingmay
containmanycloseproteinsequencesthatarenoteasytodistinguish.Third,althoughsomealgo-
rithmshavebeendevelopedforPacBioorNanoporereads,thereisstillagreatpotentialtoimprove
theperformanceandefyofthesequenceanalysisalgorithmforthird-generationreads.To
addressthesechallenges,Iproposedasetofalgorithmsinthisdissertation.
InChapter2,IintroducedGroupK,whichwasdesignedforthehighsensitivityoverlapdetec-
tionoflongnoisyreads.GroupKincorporatedthegroupedhitscriteriawhichhasbeensuccessfully
appliedtoremotehomologydetectionandachievesbettersensitivitywhencomparedwithother
overlapdetectiontoolsforthird-generationsequencingreads.
InChapter3,IintroducedFrame-Pro,whichwasdesignedforthehomologysearchofPacBio
reads.Itcorrectssequencingerrorsandalsooutputsthealignmentsofthecorrectedse-
quencesagainstcharacterizedproteinfamiliessimultaneously.Comparedwithhomologysearch
onerrorcorrectionreads,Frame-Proenablesmoresensitivehomologysearchandcorrectsmore
115
sequencingerrors.
InChapter4,IintroduceDeepFrame,whichwasdesignedforproteindomainpredictionand
detectiononasinglelongnoisyread.DeepFrameisbasedonthedeepconvolutionalneuralnet-
workswithsoftmaxthresholdandOutlierExposure.ItoutperformsHMMERonboth
accuracyanddetectionF1scorewithfasterexecutiontimeusingGPU.
Thereareseveraldirectionsthatthepreviousstudiescanbeimproved:

InGroupK,thegroupseedmatchingstepcurrentlyisbasedonhashtableimplementation
fromYASS[108].ThisstepisthebottleneckoftheGroupKoverlapdetectionpipeline.In
thefuture,wecanimplementthegroupmatchingstepusingnew
k
-mermethodto
improvetherunningtimeefytobecomparablewithotherfasteroverlapdetection
tools.Wecanalsoinvestinthestepstoallowlessfalse-positiveoverlappairstoenter
thedownstreamsteps.

Frame-ProiscomputationallymoreexpensivethanHMMERaswedonotadoptanyaccel-
erationapproach.WecanreducetherunningtimeofFrame-Probypruningthedynamicpro-
grammingmatrix.WecanalsoincorporatethefastViterbialgorithmasastageto
outimpossibleproteindomaincandidates.Also,Frame-Proiscurrentlyimplemented
withPython.TheexecutiontimeofFrame-Procanbegreatlyreducedifweimplement
itusingC++,orotheracceleratedmachinelearninglibrarieslikeTensorFlow[1],orPy-
Torch[119].

InDeepFrame,weimplementedtwoconvolutionallayerneuralnetworks.Inthefuture,we
maytestamorecomplexmodelwithﬁdeeperﬂconvolutionalneuralnetworks.However,
usuallyﬁdeeperﬂneuralnetworksarenoteasytooptimize,sowemayneedusetechniques
likeresidueblocks,asusedinResNet[62].Anothercandidateistransformerstructure[146],
116
whichalreadyproveditssuccessonnaturallanguageprocessing[31].Wemayalsoadda
steptoimprovetheefyofouralgorithmfurther,especiallywhenonlyCPU
available.Besides,currentlyweusemulti-classationframeworkthatdirectlyclassi-
onthesub-subfamilieslevel.Wemayimplementahierarchicalframework
thatclassifyondifferentproteinlevelslikefamilies,sub-families,andsub-subfamiliessi-
multaneously.
117
BIBLIOGRAPHY
118
BIBLIOGRAPHY[1]M.Abadi,A.Agarwal,P.Barham,E.Brevdo,Z.Chen,C.Citro,G.S.Corrado,A.Davis,
J.Dean,M.Devin,S.Ghemawat,I.Goodfellow,A.Harp,G.Irving,M.Isard,Y.Jia,
R.Jozefowicz,L.Kaiser,M.Kudlur,J.Levenberg,D.Mané,R.Monga,S.Moore,D.Mur-
ray,C.Olah,M.Schuster,J.Shlens,B.Steiner,I.Sutskever,K.Talwar,P.Tucker,V.Van-
houcke,V.Vasudevan,F.Viégas,O.Vinyals,P.Warden,M.Wattenberg,M.Wicke,Y.Yu,
andX.Zheng.TensorFlow:Large-scalemachinelearningonheterogeneoussystems,2015.
Softwareavailablefromw.org.
S.Aki,H.Kuboki,andK.Hirano.Ondiscretedistributionsoforderk.AnnalsoftheInstituteofStatisticalMathematics,36(1):431Œ440,1984.B.Alipanahi,A.Delong,M.T.Weirauch,andB.J.Frey.
Predictingthesequencespeci-ofdna-andrna-bindingproteinsbydeeplearning.Naturebiotechnology,33(8):831,2015.S.F.Altschul,W.Gish,W.Miller,E.W.Myers,andD.J.Lipman.Basiclocalalignmentsearchtool.Journalofmolecularbiology,215(3):403Œ410,1990.I.AntonovandM.Borodovsky.
Genetack:frameshiftinprotein-codingse-quencesbytheviterbialgorithm.Journalofbioinformaticsandcomputationalbiology,8(03):535Œ551,2010.AppleInc.Performingconvolutionoperations.,2018.Sep,2016.S.Ardui,A.Ameur,J.R.Vermeesch,andM.S.Hestand.Singlemoleculereal-time(smrt)sequencingcomesofage:applicationsandutilitiesformedicaldiagnostics.Nucleicacidsresearch,46(5):2159Œ2168,2018.K.F.Au,J.G.Underwood,L.Lee,andW.H.Wong.Improvingpacbiolongreadaccuracybyshortreadalignment.PLoSOne,7(10):e46679,2012.L.Baum.Aninequalityandassociatedmaximizationtechniqueinstatisticalestimationofprobabilisticfunctionsofamarkovprocess.Inequalities,3:1Œ8,1972.A.BendaleandT.E.Boult.Towardsopensetdeepnetworks.InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition,pages1563Œ1572,2016.G.Bensonetal.Tandemrepeatsaprogramtoanalyzednasequences.Nucleicacidsresearch,27(2):573Œ580,1999.119[12]K.Berlin,S.Koren,C.-S.Chin,J.P.Drake,J.M.Landolin,andA.M.Phillippy.Assem-
blinglargegenomeswithsingle-moleculesequencingandlocality-sensitivehashing.
Nature
biotechnology
,33(6):623Œ630,2015.
[13]E.Birney,M.Clamp,andR.Durbin.Genewiseandgenomewise.
Genomeresearch
,
14(5):988Œ995,2004.
[14]M.BorodovskyandJ.McIninch.Genmark:parallelgenerecognitionforbothdnastrands.
Computers&chemistry
,17(2):123Œ133,1993.
[15]D.Branton,D.W.Deamer,A.Marziali,H.Bayley,S.A.Benner,T.Butler,M.DiVentra,
S.Garaj,A.Hibbs,X.Huang,etal.Thepotentialandchallengesofnanoporesequencing.
Naturebiotechnology
,26(10):1146Œ1153,2008.
[16]A.Z.Broder.Identifyingandnear-duplicatedocuments.In
AnnualSymposiumon
CombinatorialPatternMatching
,pages1Œ10.Springer,2000.
[17]N.P.Brown,C.Sander,andP.Bork.Frame:detectionofgenomicsequencingerrors.
Bioinformatics(Oxford,England)
,14(4):367Œ371,1998.
[18]J.Buhler,U.Keich,andY.Sun.DesigningseedsforsimilaritysearchingenomicDNA.
JournalofComputerandSystemSciences
,70(3):342Œ363,2005.
[19]A.Busia,G.E.Dahl,C.Fannjiang,D.H.Alexander,E.Dorfman,R.Poplin,C.Y.McLean,
P.-C.Chang,andM.DePristo.Adeeplearningapproachtopatternrecognitionforshort
dnasequences.
bioRxiv
,page353474,2019.
[20]M.J.Chaisson,J.Huddleston,M.Y.Dennis,P.H.Sudmant,M.Malig,F.Hormozdiari,
F.Antonacci,U.Surti,R.Sandstrom,M.Boitano,etal.Resolvingthecomplexityofthe
humangenomeusingsingle-moleculesequencing.
Nature
,517(7536):608Œ611,2015.
[21]M.J.ChaissonandG.Tesler.Mappingsinglemoleculesequencingreadsusingbasiclocal
alignmentwithsuccessive(blasr):applicationandtheory.
BMCbioinformatics
,
13(1):238,2012.
[22]W.I.ChangandE.Lawler.Sublinearexpectedtimeapproximatestringmatchingandbio-
logicalapplications.
Algorithmica
,12:327Œ44,1994.
[23]T.Charalampous,G.L.Kay,H.Richardson,A.Aydin,R.Baldan,C.Jeanes,D.Rae,
S.Grundy,D.J.Turner,J.Wain,etal.Nanoporemetagenomicsenablesrapidclinical
diagnosisofbacteriallowerrespiratoryinfection.
NatureBiotechnology
,page1,2019.
[24]C.-S.Chin,D.H.Alexander,P.Marks,A.A.Klammer,J.Drake,C.Heiner,A.Clum,
A.Copeland,J.Huddleston,E.E.Eichler,etal.Nonhybrid,microbialgenome
assembliesfromlong-readsmrtsequencingdata.
Naturemethods
,10(6):563Œ569,2013.
120
K.Cho,B.VanMerriënboer,C.Gulcehre,D.Bahdanau,F.Bougares,H.Schwenk,andBengio.Learningphraserepresentationsusingrnnencoder-decoderforstatisticalma-chinetranslation.arXivpreprintarXiv:1406.1078,2014.J.Chu,H.Mohamadi,R.L.Warren,C.Yang,andI.Birol.Innovationsandchallengesinde-tectinglongreadoverlaps:anevaluationofthestate-of-the-art.Bioinformatics,33(8):1261Œ1270,2016.D.M.Church,V.A.Schneider,T.Graves,K.Auger,F.Cunningham,N.Bouk,H.-C.Chen,Agarwala,W.M.McLaren,G.R.Ritchie,etal.Modernizingreferencegenomeassem-blies.PLoSbiology,9(7):e1001091,2011.J.Clarke,H.-C.Wu,L.Jayasinghe,A.Patel,S.Reid,andH.Bayley.
Continuousbaseiden-forsingle-moleculenanoporednasequencing.Naturenanotechnology,4(4):265,2009.P.CompeauandP.Pevzner.Bioinformaticsalgorithms:anactivelearningapproach,vol-ume1.ActiveLearningPublishers,LaJolla,California,USA,2015.M.N.Davies,A.Secker,A.A.Freitas,M.Mendao,J.Timmis,andD.R.Flower.Onthehierarchicalofgprotein-coupledreceptors.Bioinformatics,23(23):3113Œ3118,2007.J.Devlin,M.-W.
Chang,K.Lee,andK.Toutanova.Bert:Pre-trainingofdeepbidirectionaltransformersforlanguageunderstanding.arXivpreprintarXiv:1810.04805,2018.A.Döring,D.Weese,T.Rausch,andK.Reinert.SeqAnangenericc++libraryforsequenceanalysis.BMCbioinformatics,9(1):11,2008.N.Du,J.Chen,andY.Sun.Improvingthesensitivityoflongreadoverlapdetectionusinggroupedshortk-mermatches.BMCgenomics,20(2):190,2019.N.DuandY.Sun.ImprovehomologysearchsensitivityofPacBiodatabycorrectingframeshifts.Bioinformatics,32(17):i529Œi537,2016.R.Durbin,S.R.Eddy,
A.Krogh,andG.Mitchison.Biologicalsequenceanalysis:prob-abilisticmodelsofproteinsandnucleicacids.CambridgeUniversityPress,Cambridge,UnitedKingdom,1998.S.R.Eddy.
hiddenmarkovmodels.Bioinformatics(Oxford,England),14(9):755Œ763,1998.S.R.Eddy,
T.J.Wheeler,andtheHMMERdevelopmentteam.Hmmer3.1b2,2015.S.R.Eddy,
T.J.Wheeler,andtheHMMERdevelopmentteam.Hmmer3.2.1.121,2018.June,2018.[39]J.Eid,A.Fehr,J.Gray,K.Luong,J.Lyle,G.Otto,P.Peluso,D.Rank,P.Baybayan,
B.Bettman,etal.Real-timednasequencingfromsinglepolymerasemolecules.
Science,323(5910):133Œ138,2009.
[40]S.El-Gebali,J.Mistry,A.Bateman,S.R.Eddy,A.Luciani,S.C.Potter,M.Qureshi,L.J.
Richardson,G.A.Salazar,A.Smart,etal.Thepfamproteinfamiliesdatabasein2019.
Nucleicacidsresearch
,47(D1):D427ŒD432,2018.
[41]W.Feller.
Anintroductiontoprobabilitytheoryanditsapplications
,volume2.JohnWiley
&Sons,2008.
[42]P.FerraginaandG.Manzini.Opportunisticdatastructureswithapplications.In
Foundations
ofComputerScience,2000.Proceedings.41stAnnualSymposiumon
,pages390Œ398.IEEE,
2000.[43]A.Fiannaca,L.LaPaglia,M.LaRosa,G.Renda,R.Rizzo,S.Gaglio,A.Urso,etal.
Deeplearningmodelsforbacteriataxonomicofmetagenomicdata.
BMCbioinformatics,19(7):198,2018.
[44]R.D.Finn,A.Bateman,J.Clements,P.Coggill,R.Y.Eberhardt,S.R.Eddy,A.Heger,
K.Hetherington,L.Holm,J.Mistry,etal.Pfam:theproteinfamiliesdatabase.
Nucleicacidsresearch
,42(D1):D222ŒD230,2013.
[45]R.D.Finn,P.Coggill,R.Y.Eberhardt,S.R.Eddy,J.Mistry,A.L.Mitchell,S.C.Potter,
M.Punta,M.Qureshi,A.Sangrador-Vegas,etal.Thepfamproteinfamiliesdatabase:
towardsamoresustainablefuture.
Nucleicacidsresearch
,44(D1):D279ŒD285,2015.
[46]K.Fukushima.Neocognitron:Aself-organizingneuralnetworkmodelforamechanismof
patternrecognitionunaffectedbyshiftinposition.
Biologicalcybernetics
,36(4):193Œ202,
1980.[47]Z.Gao,C.-h.Tseng,Z.Pei,andM.J.Blaser.Molecularanalysisofhumanforearmsuper-
skinbacterialbiota.
ProceedingsoftheNationalAcademyofSciences
,104(8):2927Œ
2932,2007.
[48]F.Giordano,L.Aigrain,M.A.Quail,P.Coupland,J.K.R.M.Davies,G.Tischler,
D.K.Jackson,T.M.Keane,J.Li,etal.Denovoyeastgenomeassembliesfromminion,
pacbioandmiseqplatforms.
reports
,7(1):3935,2017.
[49]M.Gîrdea,L.Noé,andG.Kucherov.Back-translationfordiscoveringdistantproteinho-
mologies.In
InternationalWorkshoponAlgorithmsinBioinformatics
,pages108Œ120.
Springer,2009.
122[50]M.Gîrdea,L.Noé,andG.Kucherov.Back-translationfordiscoveringdistantproteinho-
mologiesinthepresenceofframeshiftmutations.
AlgorithmsforMolecularBiology
,5(1):6,
2010.
[51]G.GonnellaandS.Kurtz.Readjoiner:afastandmemoryefstringgraph-based
sequenceassembler.
BMCbioinformatics
,13(1):82,2012.
[52]I.Goodfellow,Y.Bengio,andA.Courville.
Deeplearning
.MITpress,Cambridge,Mas-
sachusetts,UnitedStates,2016.
[53]X.GuanandE.C.Uberbacher.Alignmentsofdnaandproteinsequencescontaining
frameshifterrors.
Bioinformatics
,12(1):31Œ40,1996.
[54]D.Gu
Algorithmsonstrings,treesandsequences:computerscienceandcomputa-
tionalbiology
.Cambridgeuniversitypress,1997.
[55]T.Hackl,R.Hedrich,J.Schultz,andF.Förster.proovread:large-scalehigh-accuracypacbio
correctionthroughiterativeshortreadconsensus.
Bioinformatics
,30(21):3004Œ3011,2014.
[56]D.H.Haft,B.J.Loftus,D.L.Richardson,F.Yang,J.A.Eisen,I.T.Paulsen,andO.White.
Tigrfams:aproteinfamilyresourceforthefunctionalofproteins.
Nucleic
acidsresearch
,29(1):41Œ43,2001.
[57]D.H.Haft,J.D.Selengut,andO.White.Thetigrfamsdatabaseofproteinfamilies.
Nucleic
acidsresearch
,31(1):371Œ373,2003.
[58]E.Haghshenas,F.Hach,S.C.Sahinalp,andC.Chauve.Colormap:Correctinglongreads
bymappingshortreads.
Bioinformatics
,32(17):i545Œi551,2016.
[59]E.Halperin,S.Faigler,andR.Gill-More.Frameplus:aligningdnatoproteinsequences.
Bioinformatics
,15(11):867Œ873,1999.
[60]K.Hayashi,N.Morooka,Y.Yamamoto,K.Fujita,K.Isono,S.Choi,E.Ohtsubo,T.Baba,
B.L.Wanner,H.Mori,etal.Highlyaccurategenomesequencesofescherichiacolik-12
strainsmg1655andw3110.
Molecularsystemsbiology
,2(1),2006.
[61]K.He,X.Zhang,S.Ren,andJ.Sun.DelvingdeepintoSurpassinghuman-
levelperformanceonimagenetIn
ProceedingsoftheIEEEinternational
conferenceoncomputervision
,pages1026Œ1034,2015.
[62]K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.In
ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition
,pages
770Œ778,2016.
[63]D.HendrycksandK.Gimpel.Abaselinefordetectingandout-of-distribution
123
examplesinneuralnetworks.
arXivpreprintarXiv:1610.02136
,2016.
[64]D.Hendrycks,M.Mazeika,andT.G.Dietterich.Deepanomalydetectionwithoutlier
exposure.
arXivpreprintarXiv:1812.04606
,2018.
[65]S.Hochreiter,M.Heusel,andK.Obermayer.Fastmodel-basedproteinhomologydetection
withoutalignment.
Bioinformatics
,23(14):1728Œ1736,2007.
[66]S.HochreiterandJ.Schmidhuber.Longshort-termmemory.
Neuralcomputation
,
9(8):1735Œ1780,1997.
[67]M.Jain,S.Koren,K.H.Miga,J.Quick,A.C.Rand,T.A.Sasani,J.R.Tyson,A.D.Beggs,
A.T.Dilthey,I.T.Fiddes,etal.Nanoporesequencingandassemblyofahumangenome
withultra-longreads.
Naturebiotechnology
,36(4):338,2018.
[68]D.Joseph,J.Meidanis,andP.Tiwari.Determiningdnasequencesimilarityusingmaximum
independentsetalgorithmsforintervalgraphs.In
ScandinavianWorkshoponAlgorithm
Theory
,pages326Œ337.Springer,1992.
[69]D.JurafskyandJ.H.Martin.
SpeechandLanguageProcessing(2ndEdition)
.Prentice-Hall,
Inc.,UpperSaddleRiver,NJ,USA,2009.
[70]J.KärkkäinenandP.Sanders.Simplelinearworksufarrayconstruction.In
International
ColloquiumonAutomata,Languages,andProgramming
,pages943Œ955.Springer,2003.
[71]T.Kasai,G.Lee,H.Arimura,S.Arikawa,andK.Park.Linear-time
computationinsufarraysanditsapplications.In
AnnualSymposiumonCombinatorial
PatternMatching
,pages181Œ192.Springer,2001.
[72]K.KatohandD.M.Standley.Mafftmultiplesequencealignmentsoftwareversion7:im-
provementsinperformanceandusability.
Molecularbiologyandevolution
,30(4):772Œ780,
2013.
[73]W.J.Kent.BlatŠtheblast-likealignmenttool.
Genomeresearch
,12(4):656Œ664,2002.
[74]Y.Kim.Convolutionalneuralnetworksforsentence
arXivpreprint
arXiv:1408.5882
,2014.
[75]D.P.KingmaandJ.Ba.Adam:Amethodforstochasticoptimization.
arXivpreprint
arXiv:1412.6980
,2014.
[76]A.Kislyuk,A.Lomsadze,A.L.Lapidus,andM.Borodovsky.Frameshiftdetectionin
prokaryoticgenomicsequences.
Internationaljournalofbioinformaticsresearchandap-
plications
,5(4):458Œ477,2009.
124
[77]M.Kokot,M.andS.Deorowicz.Kmc3:countingandmanipulatingk-merstatis-
tics.
Bioinformatics
,33(17):2759Œ2761,2017.
[78]S.Koren,M.C.Schatz,B.P.Walenz,J.Martin,J.T.Howard,G.Ganapathy,Z.Wang,D.A.
Rasko,W.R.McCombie,E.D.Jarvis,etal.Hybriderrorcorrectionanddenovoassembly
ofsingle-moleculesequencingreads.
Naturebiotechnology
,30(7):693Œ700,2012.
[79]S.Koren,B.P.Walenz,K.Berlin,J.R.Miller,N.H.Bergman,andA.M.Phillippy.Canu:
scalableandaccuratelong-readassemblyviaadaptivek-merweightingandrepeatsepara-
tion.
Genomeresearch
,27(5):722Œ736,2017.
[80]J.Korlach.Understandingaccuracyinsmrt
R

sequencing,2013.
[81]J.KorlachandS.W.Turner.Zero-modewaveguides.In
EncyclopediaofBiophysics
,pages
2793Œ2795.Springer,2013.
[82]A.Krogh,M.Brown,I.S.Mian,K.Sjölander,andD.Haussler.Hiddenmarkovmodels
incomputationalbiology:Applicationstoproteinmodeling.
Journalofmolecularbiology
,
235(5):1501Œ1531,1994.
[83]B.LangmeadandS.L.Salzberg.Fastgapped-readalignmentwithbowtie2.
Naturemeth-
ods
,9(4):357Œ359,2012.
[84]B.Langmead,C.Trapnell,M.Pop,andS.L.Salzberg.Ultrafastandmemory-ef
alignmentofshortdnasequencestothehumangenome.
Genomebiology
,10(3):1,2009.
[85]Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning.
nature
,521(7553):436,2015.
[86]Y.LeCun,B.E.Boser,J.S.Denker,D.Henderson,R.E.Howard,W.E.Hubbard,and
L.D.Jackel.Handwrittendigitrecognitionwithaback-propagationnetwork.In
Advances
inneuralinformationprocessingsystems
,pages396Œ404,1990.
[87]H.Li.Minimapandminiasm:fastmappinganddenovoassemblyfornoisylongsequences.
Bioinformatics
,32(14):2103Œ2110,2016.
[88]H.Li.Minimap2:pairwisealignmentfornucleotidesequences.
Bioinformatics
,1:7,2018.
[89]H.LiandR.Durbin.Fastandaccuratelong-readalignmentwithburrowsŒwheelertrans-
form.
Bioinformatics
,26(5):589Œ595,2010.
[90]H.LiandN.Homer.Asurveyofsequencealignmentalgorithmsfornext-generationse-
quencing.
inbioinformatics
,11(5):473Œ483,2010.
[91]S.Li,J.Chen,andB.Liu.Proteinremotehomologydetectionbasedonbidirectionallong
short-termmemory.
BMCbioinformatics
,18(1):443,2017.
125
[92]L.LoConte,B.Ailey,T.J.Hubbard,S.E.Brenner,A.G.Murzin,andC.Chothia.Scop:a
structuralofproteinsdatabase.
Nucleicacidsresearch
,28(1):257Œ259,2000.
[93]N.J.Loman,J.Quick,andJ.T.Simpson.Acompletebacterialgenomeassembleddenovo
usingonlynanoporesequencingdata.
Naturemethods
,2015.
[94]B.MaandM.Li.Onthecomplexityofthespacedseeds.
JournalofComputerandSystem
Sciences
,73(7):1024Œ1034,2007.
[95]B.Ma,J.Tromp,andM.Li.PatternHunter:fasterandmoresensitivehomologysearch.
Bioinformatics
,18(3):440Œ445,2002.
[96]L.v.d.MaatenandG.Hinton.Visualizingdatausingt-sne.
Journalofmachinelearning
research
,9(Nov):2579Œ2605,2008.
[97]U.ManberandG.Myers.Sufarrays:anewmethodforon-linestringsearches.
siam
JournalonComputing
,22(5):935Œ948,1993.
[98]F.Meyer,R.Overbeek,andA.Rodriguez.Figfams:yetanothersetofproteinfamilies.
Nucleicacidsresearch
,37(20):6643Œ6654,2009.
[99]G.Miclotte,M.Heydari,P.Demeester,S.Rombauts,Y.VandePeer,P.Audenaert,and
J.Fostier.Jabba:hybriderrorcorrectionforlongsequencingreads.
AlgorithmsforMolec-
ularBiology
,11(1):1,2016.
[100]E.W.Myers.Thefragmentassemblystringgraph.
Bioinformatics
,21(suppl_2):ii79Œii85,
2005.
[101]E.W.Myers,G.G.Sutton,A.L.Delcher,I.M.Dew,D.P.Fasulo,M.J.Flanigan,S.A.
Kravitz,C.M.Mobarry,K.H.Reinert,K.A.Remington,etal.Awhole-genomeassembly
ofdrosophila.
Science
,287(5461):2196Œ2204,2000.
[102]G.Myers.Eflocalalignmentdiscoveryamongstnoisylongreads.In
International
WorkshoponAlgorithmsinBioinformatics
,pages52Œ67.Springer,2014.
[103]N.NagarajanandM.Pop.Sequenceassembly
NatureReviewsGenetics
,
14(3):157Œ167,2013.
[104]V.NairandG.E.Hinton.linearunitsimproverestrictedboltzmannmachines.
In
Proceedingsofthe27thinternationalconferenceonmachinelearning(ICML-10)
,pages
807Œ814,2010.
[105]S.B.NeedlemanandC.D.Wunsch.Ageneralmethodapplicabletothesearchforsimilar-
itiesintheaminoacidsequenceoftwoproteins.
Journalofmolecularbiology
,48(3):443Œ
453,1970.
126
F.NicolasandE.Rivals.Hardnessofoptimalspacedseeddesign.JournalofComputerandSystemSciences,74(5):831Œ849,2008.S.K.Nielsen,T.L.Koch,F.Hauser,A.Garm,andC.J.Grimmelikhuijzen.Denovotranscriptomeassemblyofthecubomedusatripedaliacystophora,includingtheanalysisofasetofgenesinvolvedinpeptidergicneurotransmission.BMCgenomics,20(1):175,2019.L.NoéandG.Kucherov.
YASS:SimilaritysearchinDNAsequences.PhDthesis,INRIA,2003.L.NoéandG.Kucherov.
ImprovedhitcriteriaforDNAlocalalignment.BMCbioinfor-matics,5(1):149,2004.L.NoéandG.Kucherov.
YASS:enhancingthesensitivityofDNAsimilaritysearch.Nucleicacidsresearch,33(suppl_2):W540ŒW543,2005.NVIDIACorporation.Apex,2019.J.Oh,A.L.Byrd,C.Deming,S.Conlan,B.Barnabas,R.Blakesley,
G.Bouffard,S.Brooks,Coleman,M.Dekhtyar,etal.Biogeographyandindividualityshapefunctioninthehumanskinmetagenome.Nature,514(7520):59,2014.C.Olah.Understandinglstmnetworks.,2018.Aug,2015.Y.Ono,K.Asai,andM.Hamada.Pbsim:PacbioreadssimulatorŠtowardaccurategenomeassembly.
Bioinformatics,29(1):119Œ121,2013.[115]PBiosciences.Ecolik12mg1655resequencing.
Jul22,2013.[116]PBiosciences.H.sapiens10xsequencecoveragewithpacbiodata,2014.
[117]PBiosciences.Smrtanalysis.
2014.Oct15,2014.[118]Piences.E.colibacterialassemblyprimaryanalysis(instrumentoutput)data,
2016.AccessedMay13,2016.
[119]A.Paszke,S.Gross,S.Chintala,G.Chanan,E.Yang,Z.DeVito,Z.Lin,A.Desmaison,
L.Antiga,andA.Lerer.AutomaticdifferentiationinPyTorch.In
NIPSAutodiffWorkshop
,2017.127[120]M.PellegriniandT.O.Yeates.Searchingforframeshiftevolutionaryrelationshipsbetween
proteinsequencefamilies.
Proteins:Structure,Function,andBioinformatics
,37(2):278Œ
283,1999.
[121]H.Peltola,H.Söderlund,andE.Ukkonen.Algorithmsforthesearchofaminoacidpatterns
innucleicacidsequences.
Nucleicacidsresearch
,14(1):99Œ107,1986.
[122]R.Poplin,P.-C.Chang,D.Alexander,S.Schwartz,T.Colthurst,A.Ku,D.Newburger,
J.Dijamco,N.Nguyen,P.T.Afshar,etal.Auniversalsnpandsmall-indelvariantcaller
usingdeepneuralnetworks.
Naturebiotechnology
,36(10):983,2018.
[123]E.Prestat,M.M.David,J.Hultman,N.Ta¸s,R.Lamendella,J.Dvornik,R.Mackelprang,
D.D.Myrold,A.Jumpponen,S.G.Tringe,etal.Foam(functionalontologyassignmentsfor
metagenomes):ahiddenmarkovmodel(hmm)databasewithenvironmentalfocus.
Nucleic
acidsresearch
,42(19):e145Œe145,2014.
[124]M.A.Quail,M.Smith,P.Coupland,T.D.Otto,S.R.Harris,T.R.Connor,A.Bertoni,H.P.
Swerdlow,andY.Gu.Ataleofthreenextgenerationsequencingplatforms:comparisonof
iontorrent,biosciencesandilluminamiseqsequencers.
BMCgenomics
,13(1):341,
2012.
[125]S.RajasekaranandM.Nicolae.Anelegantalgorithmfortheconstructionofsufarrays.
JournalofDiscreteAlgorithms
,27:21Œ28,2014.
[126]D.A.Rasko,D.R.Webster,J.W.Sahl,A.Bashir,N.Boisen,F.Scheutz,E.E.Paxinos,
R.Sebra,C.-S.Chin,D.Iliopoulos,etal.Originsofthee.colistraincausinganoutbreakof
hemolyticŒuremicsyndromeingermany.
NewEnglandJournalofMedicine
,365(8):709Œ
717,2011.
[127]J.Ren,K.Song,C.Deng,N.A.Ahlgren,J.A.Fuhrman,Y.Li,X.Xie,andF.Sun.Iden-
tifyingvirusesfrommetagenomicdatabydeeplearning.
arXivpreprintarXiv:1806.07810
,
2018.
[128]A.RhoadsandK.F.Au.Pacbiosequencinganditsapplications.
Genomics,proteomics&
bioinformatics
,13(5):278Œ289,2015.
[129]D.E.Rumelhart,G.E.Hinton,andR.J.Williams.Learningrepresentationsbyback-
propagatingerrors.
nature
,323(6088):533,1986.
[130]L.SalmelaandE.Rivals.Lordec:accurateandeflongreaderrorcorrection.
Bioin-
formatics
,pagebtu538,2014.
[131]T.Schiex,J.Gouzy,A.Moisan,andY.deOliveira.Framed:axibleprogramforquality
checkandgenepredictioninprokaryoticgenomesandnoisymaturedeukaryoticsequences.
Nucleicacidsresearch
,31(13):3738Œ3741,2003.
128
[132]S.Schwartz,W.J.Kent,A.Smit,Z.Zhang,R.Baertsch,R.C.Hardison,D.Haussler,and
W.Miller.HumanŒmousealignmentswithBLASTZ.
Genomeresearch
,13(1):103Œ107,
2003.
[133]S.Seo,M.Oh,Y.Park,andS.Kim.Deepfam:deeplearningbasedalignment-freemethod
forproteinfamilymodelingandprediction.
Bioinformatics
,34(13):i254Œi262,2018.
[134]D.Sharon,H.Tilgner,F.Grubert,andM.Snyder.Asingle-moleculelong-readsurveyof
thehumantranscriptome.
Naturebiotechnology
,31(11):1009,2013.
[135]J.T.SimpsonandR.Durbin.Efconstructionofanassemblystringgraphusingthe
FM-index.
Bioinformatics
,26(12):i367Œi373,2010.
[136]I.Sovi
´
c,K.vi
´
c,K.Skala,andM.−iki
´
c.Evaluationofhybridandnon-hybridmeth-
odsfordenovoassemblyofnanoporereads.
Bioinformatics
,32(17):2582Œ2589,2016.
[137]I.Sovi
´
c,M.−iki
´
c,A.Wilm,S.N.Fenlon,S.Chen,andN.Nagarajan.Fastandsensitive
mappingofnanoporesequencingreadswithgraphmap.
Naturecommunications
,7,2016.
[138]N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:a
simplewaytopreventneuralnetworksfromov
Thejournalofmachinelearning
research
,15(1):1929Œ1958,2014.
[139]Y.SunandJ.Buhler.DesigningmultiplesimultaneousseedsforDNAsimilaritysearch.
JournalofComputationalBiology
,12(6):847Œ861,2005.
[140]H.Tilgner,F.Grubert,D.Sharon,andM.P.Snyder.apersonal,and
single-moleculelong-readtranscriptome.
ProceedingsoftheNationalAcademyofSciences
,
111(27):9869Œ9874,2014.
[141]B.J.Tindall,J.Sikorski,S.Lucas,E.Goltsman,A.Copeland,T.GlavinaDelRio,M.Nolan,
H.Tice,J.F.Cheng,C.Han,S.Pitluck,K.Liolios,N.Ivanova,K.Mavromatis,G.Ovchin-
nikova,A.Pati,R.Faehnrich,L.Goodwin,A.Chen,K.Palaniappan,M.Land,L.Hauser,
Y.J.Chang,C.D.Jeffries,M.Rohde,M.Goeker,T.Woyke,J.Bristow,J.A.Eisen,
V.Markowitz,P.Hugenholtz,N.C.Kyrpides,H.P.Klenk,andA.Lapidus.Complete
genomesequenceofmeiothermusrubertypestrain(21).
Standardsingenomicsciences
,
3(1):26Œ36,2010.
[142]E.Trost,S.Götker,J.Schneider,S.Schneiker-Bekel,R.Szczepanowski,A.Tilker,P.Vieho-
ever,W.Arnold,T.Bekel,J.Blom,etal.Completegenomesequenceandlifestyleofblack-
pigmentedcorynebacteriumaurimucosumatcc700975(formerlyc.nigricanscn-1)isolated
fromavaginalswabofawomanwithspontaneousabortion.
BMCgenomics
,11(1):91,
2010.
[143]B.Trzaskowski,D.Latek,S.Yuan,U.Ghoshdastider,A.Debinski,andS.Filipek.Action
129
ofmolecularswitchesingpcrs-theoreticalandexperimentalstudies.
Currentmedicinal
chemistry
,19(8):1090Œ1109,2012.
[144]Y.-C.Tsai,S.Conlan,C.Deming,J.A.Segre,H.H.Kong,J.Korlach,J.Oh,N.C.S.Pro-
gram,etal.Resolvingthecomplexityofhumanskinmetagenomesusingsingle-molecule
sequencing.
MBio
,7(1):e01948Œ15,2016.
[145]R.VanBuren,D.Bryant,P.P.Edger,H.Tang,D.Burgess,D.Challabathula,K.Spittle,
R.Hall,J.Gu,E.Lyons,etal.Single-moleculesequencingofthedesiccation-tolerantgrass
oropetiumthomaeum.
Nature
,527(7579):508,2015.
[146]A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Kaiser,and
I.Polosukhin.Attentionisallyouneed.In
Advancesinneuralinformationprocessing
systems
,pages5998Œ6008,2017.
[147]S.VingaandJ.Almeida.Alignment-freesequencecomparisonŠareview.
Bioinformatics
,
19(4):513Œ523,2003.
[148]A.Viterbi.Errorboundsforconvolutionalcodesandanasymptoticallyoptimumdecoding
algorithm.
IEEEtransactionsonInformationTheory
,13(2):260Œ269,1967.
[149]M.Wang,L.Tu,D.Yuan,D.Zhu,C.Shen,J.Li,F.Liu,L.Pei,P.Wang,G.Zhao,etal.
Referencegenomesequencesoftwocultivatedallotetraploidcottons,gossypiumhirsutum
andgossypiumbarbadense.
Naturegenetics
,51(2):224,2019.
[150]Q.Wang,J.F.Quensen,J.A.Fish,T.K.Lee,Y.Sun,J.M.Tiedje,andJ.R.Cole.Ecological
patternsofnifhgenesinfourterrestrialclimaticzonesexploredwithtargetedmetagenomics
usingframebot,anewinformaticstool.
MBio
,4(5):e00592Œ13,2013.
[151]C.-L.Xiao,S.Zhu,M.He,D.Chen,Q.Zhang,Y.Chen,G.Yu,J.Liu,S.-Q.Xie,F.Luo,
etal.N6-methyladeninednainthehumangenome.
Molecularcell
,71(2):306Œ
318,2018.
[152]E.M.ZdobnovandR.Apweiler.InterproscanŒanintegrationplatformforthesignature-
recognitionmethodsininterpro.
Bioinformatics
,17(9):847Œ848,2001.
[153]H.Zeng,M.D.Edwards,G.Liu,andD.K.Gifford.Convolutionalneuralnetworkarchi-
tecturesforpredictingdnaŒproteinbinding.
Bioinformatics
,32(12):i121Œi127,2016.
[154]Y.ZhangandY.Sun.Hmm-frame:accurateproteindomainformetagenomic
sequencescontainingframeshifterrors.
BMCbioinformatics
,12(1):198,2011.
[155]Z.Zhang,W.R.Pearson,andW.Miller.Aligningadnasequencewithaproteinsequence.
JournalofComputationalBiology
,4(3):339Œ349,1997.
130