INFERENCEOFVIRALSTRAINSUSINGMETAGENOMICSDATA
By
JiaoChen
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialoftherequirements
forthedegreeof
ComputerScienceŠDoctorofPhilosophy
2018
ABSTRACT
INFERENCEOFVIRALSTRAINSUSINGMETAGENOMICSDATA
ByJiaoChen
RNAViruses,suchashumanyvirus(HIV),andhepatitisCvirus
(HCV),havegreatimpactonhumanhealth.ThehighmutationrateofRNAvirusescanproduce
apopulationofdifferentbutcloselyrelatedvirussequencescalledviralquasispecies.Todevelop

preventionandtreatmentstrategiesforviralpathogens,anessentialandfundamentalstepisto

characterizetheirsequencesandabundancesatthestrain-levelforaviralquasispecies.
Advancesinnext-generationsequencing(NGS)technologieshaveopenedupnewopportuni-
tiestostudyviruses.Viralmetagenomicdata,whichcontainsthegeneticinformationforabunch

ofvirusesinthesamehabitat,havebecomethemajorresourceforcharacterizingRNAviruses.

Althoughtherearemanypipelinesforanalyzingvirusesinmetagenomicdata,theyusuallylack

threefunctions.First,novelvirusesorvirusesthatlackcloselyrelatedreferencegenomescannot

bedetectedwithhighsensitivity.Second,strain-levelanalysisisusuallymissing.Althoughthere

areseveralassemblytoolsdesignedforviralhaplotypes,
denovo
assemblyofvirus
genomesisstillacomputationallychallengetaskduetotheerror-proneshortreads,highsimilar-

itiesbetweenrelatedstrains,andunknownnumberorabundanceofvirushaplotypes.Third,itis

hardtoestimatethenumberofhaplotypesandtheirabundancesinaquasispecies.
Inthisdissertation,wehavedevelopedapipelinewiththreetools,TAR-VIRPEHaploand
VirBintoaddressthechallengesexistingforviralmetagenomicassemblyandanalysis.TAR-

VIRisatoolforclassifyingandenrichingviralreadsfrommetagenomicdatawithoutrelying

oncompleteorhigh-qualityreferencegenomes.ItisoptimizedforidentifyingRNAviruses
frommetagenomicdatawithanefBurrows-WheelerTransform(BWT)basedoverlapping
method.TAR-VIRwastestedonbothsimulatedandrealviralmetagenomicdatasets.Theresults
demonstratedTAR-VIRcompetesfavorablywithbenchmarkedtools.PEHaploisa
denovo
hap-
lotypereconstructiontool,whichemployspaired-endreadstodistinguishhighlysimilarstrainsfor
viralquasispeciesdata.Itwasappliedonbothsimulatedandrealquasispeciesdata,andtheresults
werebenchmarkedagainstseveralrecentlypublished
denovo
haplotypereconstructiontools.The
comparisonshowsthatPEHaplooutperformsthebenchmarkedtoolsinacomprehensivesetof
metrics.Withassembledviralcontigs,VirBinfocusesonestimatingthenumberofhaplotypes
andclusteringthecontigstodifferenthaplotypes.VirBinwindowsfromcontigs
alignmenttoestimatehaplotypenumberandbettercalculatethecontigabundances.Then,
itappliesanExpectation-Maximizationmethodtoclusterthecontigsbasedoncontigabundance
levels.TheexperimentalresultsofVirBinonbothsimulatedandrealdatasetsshowthatthe
window-basedmethodcanpreciselyestimatethehaplotypenumber,andgeneratemoreaccurate
abundanceestimationforcontigsthatwilleventuallyleadtosuperiorclusteringresults.Inaddi-
tion,thisdissertationalsocontainsonechapterfortheotherworkwehavedoneforidentifying
theprimarytranscriptionstartsitesformiRNAgenesinCaenorhabditiselegansandmousewith
Cap-seqdata.
ACKNOWLEDGMENTS
Firstandforemost,IwouldliketothankmyadvisorDr.YanniSun.Herperfectsupervisionhas
beenguidingmethroughallmyresearchanddissertationworkduringmyPh.D.study.Nothing
couldbeachievedwithoutherdevotedtime,patience,andinspiration.Duringthepastveandhalf
yearsstudyingatMichiganStateUniversity,Dr.Sunhelpedmeonmyresearchwork,courses,
andhowtowriteorpresentresearchresults.Ihavebeenkeptimprovinginmakingpresentations
toaudiencesfromdifferentbackgrounds,modelingbioinformaticproblemswithcomputational
techniques,andreadingorwritingpapers.
Next,IwanttothankmyothercommitteemembersDr.DavidArnosti,Dr.KevinLiu,andDr.
JinChen.Dr.Arnosticarefullyreadandprovidedmewithmanyvaluablesuggestionsformy
paperonmiRNAs.HealsoinvitedmetojoinandtalkonJournalClub,whichwasanimportant
partofmyPh.D.training.Dr.LiugavemethelectureofCSE836(ComputationalBiology)on
mysecond-yearstudyinMSU,whichisacoursecloselyrelatedtomyPh.D.research.Healso
provideduswithseveralchancesofpresentingpapers.Dr.Chenhelpedmealotforeithermy
comprehensiveexamordefense.IalsothankDr.JianrongWang.Hehadbeenhelpingmeon
thebinningprojectbyprovidinghelpfuldiscussionsandbeingalwaysapproachable.
IalsothankmylabmatesDr.YuanZhang,Dr.ChengYuan,Dr.JikaiLei,Dr.Rujira
Achawanantakun,NanDuandYumengWen,fortheirhelponmyeitherresearchworkordaily
life.MythankalsogoestomypreviousroommateChaoyueLiuandShaohuaYang.
Finally,Iwanttoexpressmysinceregratitudetomyparentsandsister.Theyhavealwaysbeen
encouragingandsupportingmeforpursuingmycareer,andarealwaystherewheneverIneed
them.
iv
TABLEOFCONTENTS
LISTOFTABLES
.......................................
viii
LISTOFFIGURES
......................................
x
LISTOFALGORITHMS
...................................
xv
Chapter1Introduction
..................................
1
1.1Overviewofthepipeline...............................3
1.2Next-generationsequencing..............................5
1.2.1Librarypreparation..............................6
1.2.2ClusterGeneration..............................6
1.2.3CyclicReversibleTermination(CRT)Sequencing..............7
1.2.4Dataanalysis.................................8
1.2.5Paired-EndSequencing............................8
1.3Virusquasispecies...................................10
1.3.1Thequasispeciestheory............................11
Chapter2TAR-VIR:TARgetedVIRalreadsandstrainreconstruc-
tionfrommetagenomicdata
.........................
14
2.1Background......................................14
2.1.1Relatedwork.................................15
2.1.2Overviewofourwork.............................17
2.2Methods........................................18
2.2.1Twoscenarios.................................18
2.2.2Validityofreadrecruitmentusingoverlapdetection.............20
2.2.2.1Sequencingerrors.........................21
2.2.2.2Chimericreads...........................22
2.2.3Readrecruiting................................22
2.2.3.1Uniqueimplementationstrategies.................25
2.2.4Iterativesearch................................26
2.2.4.1Runningtimeandmemoryusage.................27
2.2.5Strain-levelassembly.............................28
2.3Resultsanddiscussion................................28
2.3.1Sizesofcommonregionsbetweenhumanvirusesandothermicrobialspecies29
2.3.2Exp1:reconstructtheSARShaplotypesusingthebatcoronavirusasthe
reference...................................30
2.3.2.1Datapropertiesandevaluationmetrics..............30
2.3.2.2Performanceofreadrecruitment..................33
2.3.3Exp2:characterizinghepatitisvirusesfromthehumanplasmadata.....37
2.3.3.1Preprocessing............................38
v
2.3.3.2RecruitedreadsbyTAR-VIRcanimprovetheperformanceof
de
novo
assembly...........................38
2.3.3.3Comparisonwithreference-basedandextension-basedassembly
methods..............................41
2.3.3.4Assemblingthewholedatasetdirectly..............42
2.3.4Identifyingvirusescontainingtargetgenes..................42
2.3.5Computationaltimeandmemoryusage...................44
2.4Conclusions......................................45
Chapter3Denovohaplotypereconstructioninvirusquasispeciesusingpaired-
endreads
...................................
47
3.1Introduction......................................47
3.2Methods........................................50
3.2.1Overlaphgraphandpaired-endgraph....................51
3.2.2MutationRateforSequenceReplicationandProbabilityofLongestCom-
monSubstring(LCS).............................51
3.2.3Usepaired-endreadstodistinguishdifferenthaplotypes...........55
3.2.4ThewholepipelineofPEHaplo.......................56
3.2.4.1Datapre-processing........................58
3.2.4.2Overlapgraphconstruction....................59
3.2.4.3Graphpruning...........................60
3.2.4.4Paired-endguidedpath...................63
3.2.4.5Correctingcontigswithpaired-endreaddistribution.......69
3.3Results.........................................69
3.3.1LCSprobabilitysimulation..........................70
3.3.2BenchmarkonsimulatedHIVdataset....................71
3.3.2.1Paired-endguidedpathisabletogenerateaccuratelong
contigs...............................73
3.3.3BenchmarkonMiSeqdataset........................74
3.3.4BechmarkonsimulatedbiasedHIVdatasets................77
3.3.5Bechmarkondataset........................78
3.3.6Computationaltimeandmemoryusage...................79
3.4DiscussionandConclusion..............................79
Chapter4Alignmentwindowsbasedviralstrain-levelcontigsbinning
.......
81
4.1Introduction......................................81
4.2Methods........................................82
4.2.1Problem..............................82
4.2.2TheVirBinalgorithmoverview........................83
4.2.3Estimatehaplotypenumberbycontigsalignmentandwindowsextraction.84
4.2.4Expectation-Maximizationmethodforbinningcontigs...........86
4.3Results.........................................88
4.3.1HIVsimulateddataset............................88
4.3.1.1Datasimulation...........................88
4.3.1.2Resultsfor5HIVhaplotypes...................90
vi
4.3.1.3Resultsfor10HIVhaplotypes...................93
4.3.2HIVrealMiSeqdataset...........................95
4.3.2.1HIVrealdatasetandcontigs...................95
4.3.2.2ResultsforHIVrealdataset....................96
4.4DiscussionandConclusion..............................99
Chapter5StudyingtranscriptionalregulationsofmiRNAgeneswithCap-seq
...
100
5.1Introduction......................................100
5.2MaterialsandMethods................................105
5.2.1Datasetsandprocessing............................105
5.2.2Clusteringof5'endreads..........................105
5.2.3Statisticalanalysis..............................106
5.2.4miRNAclusterandintergenicpre-miRNAs..........107
5.2.5FindingbidirectionalandmultipleTSSspromoters.............107
5.3Results.........................................108
5.3.1ofprimarymiRNATSSsin
C.elegans
andmouse......108
5.3.1.1OverviewofprimarymiRNATSSsannotation..........108
5.3.1.2Comparisonwithpreviouswork..................110
5.3.25'm
7
Gcappedpre-miRNAsarein
C.elegans
..........111
5.3.2.1Possible5'recessedRNAsenrichedbyCap-seq.........111
5.3.2.2ng5'm
7
Gcappedpre-miRNAswithpre-capTICs.....113
5.3.3M
7
Gcappedpre-miRNAsoftenhaveupstreamprimaryTICs........115
5.3.45p-miRNAsareproducedfromthem
7
Gcappedpre-miRNAs..117
5.3.5MultipletranscriptioninitiationsitesformiRNAclustersin
C.elegans
...118
5.3.5.15're-capafterpost-transcriptionalprocessing...........119
5.3.5.2MultipleTSSscanbegeneratedfromthesamepromoter.....121
5.3.5.3Pre-miRNAsinaclustercanbetranscribedindependently....123
5.3.6ChromatinandPolIIofprimarymiRNApromoters........124
5.3.6.1IdentiofdivergentandmultipleTSSspromoters.....124
5.3.6.2ChromatinandPolIIsurroundingmiRNApromoters..125
5.4Discussion.......................................126
Chapter6ConclusionandFuturework
........................
130
6.1Conlusions.......................................130
6.2Futurework......................................132
6.2.1Reference-freeviralsequences.................132
6.2.2Virusquasispeciesassemblyfuturework...................132
6.2.3Binninglow-qualitycontigs.........................133
BIBLIOGRAPHY
.......................................
134
vii
LISTOFTABLES
Table2.1:ReadrecruitmentresultsbyusingseedsetsconstructedwithBowtie2and
BWA.TheﬁAlignmentﬂsectioncontainsresultsforalignedreads.The
ﬁRecruitmentﬂsectioncontainsresultsforrecruitedreadsbyTAR-VIR
usingthealignedreads.Foreachrow,thealignedreadsinﬁAlignment"
sectionaretheseedsetfortherecruitedreadsinﬁRecruitmentﬂsection.
ForBowtie2,theﬁŒscore-minﬂparameterwassettoallowdifferentalign-
menterrorratescorrespondingto5%,10%,15%,and20%,respectively.
ForBWA,ﬁ-Aﬂisedasitsdefaultvalue1.ﬁ-Bﬂwastoal-
lowdifferenterrorratesimilartoBowtie2.ﬁNumberﬂisthenumberof
alignedorrecruitedreads.ﬁDepthﬂistheaveragesequencingcoverage.
ﬁCoverageﬂisthepercentageofgenomecoveringbyatleastoneread.h1
toh5representtheveSARS-CoVhaplotypes...............34
Table2.2:AssemblyresultsonSARS-CoValignedandrecruitedmetagenomicdata.
ThedefaultassemblytoolinTAR-VIRisPEHaplo.Theofthe
metricscanbefoundinSection2.3.2.1...................37
Table2.3:AssemblyresultsonSARS-CoVmetagenomicdataforTAR-VIRand
PRICE....................................37
Table2.4:Overlapextensionresultsusingdifferentseedset
R
0
.`#'represents`num-
ber'.TheshadedregionsinthistableandTable2.5highlightthecase
wherelessrecruitedreadscanproducebetterassemblyresultsthanaligned
readsonly..................................39
Table2.5:Assemblyevaluationresultsonalignedandrecruitedreadsusingthegenomes
ofHBV,HCV,andHPgVasreferences.`cov.'istheabbreviationfor
`coverage'.ThedefaultassemblycomponentinTAR-VIRisPEHaplo..40
Table2.6:Assemblyresultscomparisonwithreference-andextension-basedmeth-
ods......................................41
Table2.7:AssemblyresultsonrecruitedreadswithapartialCDSsequencefor
HPgVasreference..............................43
Table2.8:Timeandmemoryusageforoverlapextensionandassemblyonviral
metagenomicdatafromhumanplasam.The
denovo
assemblytimeand
memoryusagewereevaluatedonrecruitedreadsbasedonmismatchrate
from5%to20%...............................44
Table3.1:Pairwisesequencesimilaritybetween5HIV-1strains...........70
viii
Table3.2:Longestcommonsubstring(LCS)between5HIV-1strains.Thesestrains
havesimilarlengthsofabout10kbp....................70
Table3.3:AssemblyresultsonsimulatedHIVdatasetforIVA,MLEHaplo,SAV-
AGEandPEHaplo.Contigsthatareatleast500bparealignedtothe
truehaplotypesequenceswithasimilaritycutoffof98%.TheN50value
isthemaximallengththatallcontigsofatleastthislengthcoveratleast
halfofthetotalassemblylength.......................73
Table3.4:AssemblyresultsonrealHIVMiSeqdatasetforIVA,MLEHaplo,SAV-
AGEandPEHaplo.ContigsareevaluatedwithMetaQuastusingthesame
parameterstosimulateddataset.......................75
Table3.5:AssemblyresultsonsimulatedbiasedHXB2-NL43MiSeqdatasetfo
SAVAGEPEHaplo.ContigsareevaluatedwithMetaQuast.........77
Table3.6:AssemblyresultsonMiSeqdatasetforSAVAGEandPEHaplo.
ContigsareevaluatedwithMetaQuastusingthesameparameterstoHIV
datasets...................................78
Table4.1:Haplotypeabundancesforsimulated5HIVhaplotypescalculatedfrom
readsmapping................................90
Table4.2:EMclusteringresultsonsimulated5HIVhaplotypecontigsforVirBin
andMaxBin..................................92
Table4.3:Haplotypeabundancesforsimulated10HIVhaplotypescalculatedfrom
readsmapping................................93
Table4.4:EMclusteringresultsonsimulated10HIVhaplotypecontigsforVirBin
andMaxBin..................................95
Table4.5:Haplotypeabundancescalculatedfromreadsmapping...........96
Table4.6:EMclusteringresultsonassembled5realhaplotypecontigsforVirBin
andMaxBin..................................98
Table4.7:EMclusteringresultsonassembled5realhaplotypecontigsforVirBin
andMaxBin..................................98
Table5.1:GCcontent,OEratioandCpGnumberformiRNApromoterswithmul-
tipleTSSs.(A)miRNAclusters.(B)individualmiRNAs..........122
ix
LISTOFFIGURES
Figure1.1:Pipelineoverview.(1)TAR-VIRforenrichingviralreadsofinterestfrom
metagenomicdata.(2)PEHaplofor
denovo
assemblyofviralhaplo-
types.(3)VirBinforbinningassembledcontigsandestimatethehaplo-
typeabundances...............................4
Figure1.2:Librarypreparation.GenomicDNAisfragmentedandcommonadaptors
areligatedtobothendsofthem.......................6
Figure1.3:Solid-phaseamplication.Eachsingle-stranded,single-moleculetemplate
iscapturedbyprimerandimmobolizetothesolidsurface.Eachbound
templateisintoaclonalclusterthroughbridgeamplication...7
Figure1.4:TheIllumina/Solexafour-colourcyclicreversibletermination(CRT)method.
Ituses3'-O-azidomethylreversibleterminatorchemistryonsolid-phase-
templateclusters.Afterimaging,acleavagestepremovesthe
dyesandregeneratesthe3'-OHgroupfornext-roundincor-
poration....................................9
Figure1.5:Paired-endsequencingenablesbothendsoftheDNAfragmenttobese-
quenced.Thedistancebetweeneachpairedreadsisknownandalignment
algorithmscantakeadvantageoftomapthereadsoverrepetitiveregions
moreprecisely................................10
Figure1.6:Quasispeciesadaption.Quasispeciestendstogouphillsinhigh-dimensional
landscape.Thex-axisisthesequencespaceandthey-axisisthe
Thehighertheyget,thetheyare................13
Figure2.1:Twoscenarios.(A).Thereferenceisageneorotherfunctionalsite(long
greenbar).Thereadsarerepresentedbyshortlines.Shortgreenlines
canbemappedtothereferencesequenceandthesetofseedreads.
Theiterationofoverlapdetectionwillidentifynewreads(bluelines)
overlappingwiththeseedreads.Theseconditerationofoverlapdetec-
tionwillidentifymorereads(redlines).(B).Thereferenceisaremotely
relatedgenome(longgreenbar).Theseedreadscanbemappedtothe
referencegenomeandarerepresentedbyshortgreenlines.Twoitera-
tionsofoverlapdetectionwillrecruitnewreads(bluelinesandredlines,
respectively).................................19
x
Figure2.2:Thevisualizationoftheoutputofeachstepofthebackwardsearchfora
querysequenceAAAT.TheSAandthecorrespondingsufesaregrey
becausetheyarenotactuallyusedinthesearch.Thecomputedrange
foreachiterationishighlightedusinggrayboxencompassedbyarrows.
When
t
is3,thesearchshouldreturn
r
2
asitformsanoverlapofsize3
withAAAT..................................24
Figure2.3:HistogramoftheLCSsizesbetweenhumanviruses(A),betweenhuman
virusesandnon-humanviruses(B),andbetweenhumanvirusesandbac-
teria(C).Thex-axisisthe
log
10oftheLCSlength.They-axisisthe
numberofpairswithinthegivenrangeofLCSsize.OnlyLCSsthatare
longerthan10bparepresented.(D)ProbabilitydistributionforLCSs
betweentwosimulatedHIVstrainsthatare50,100,200,and500gen-
erationsapart.The
x
-axisisthelengthofLCS,witharangefrom0to
10,000bp.The
y
-axisisthecorrespondingprobabilitiesforthoseLCS
sizes......................................31
Figure2.4:EnrichingSARS-CoVreadsusingthebatcoronavirusgenomeastheref-
erence.(A)and(B)showthealignedandrecruitedreadsThe
datasetwasalignedbyBWAwiththedefaultparameter("-B4,-A1").
BWAischosentoincludemorelocallyalignedreadsin(A).Thereads
wererecruitedusingtheoverlapcutoffof150bp.(C)displaysthese-
quenceidentitybetweenSARS-Covandthebatcoronavirus.Thepr
wasgeneratedusingVISTA[45].......................35
Figure3.1:MultiplesequencealignmentofveHIV-1haplotypes...........49
Figure3.2:(A)Thebottomtwolonglines(redandblack)representtwohaplotypes,
whichonlydifferbytwomutationsattwoloci(G-CandA-G).Shortlines
representreadssequencedfromthetwostrains.Redreadsaresequenced
fromredstrainwhileblackreadsaresequencedfromblackstrain.The
readsaresortedbytheirreadmappingpositionsagainsttheirnativestrain.
a.1anda.2areareadpairfromtheblackstrain.d.1andd.2arethe
readpairfromtheredstrain.(B)Theoverlapgraphandthepaired-end
grapharecombinedinoneNodesb,c,e,andforiginatefromthe
commonregionofthetwostrains.Thedashedlinesrepresentpaired-end
readconnection.(C).Thesuper-readsgeneratedbymergingreadsinve
cliquesofsize4.Eachsuper-readisnamedusingthestartingnodeand
endingnode.(D).Thenewoverlapgraphgeneratedusingthesuper-reads.57
Figure3.3:Removingfalseedgesusingpaired-endinformation.Solidlinesrepre-
sentoverlapsbetweennodesanddashedlinesrepresentthepaired-end
connectionsbetweennodes.Theoverlapedgeswithredcrosswillbe
removedifinsufpaired-endinformationexistbetweentheirends..62
xi
Figure3.4:Paired-endinformationguidethepathextensiontogooverbifurcation
nodes.Withnodescomingfromtwostrains
a
and
b
inthethose
nodesbelongingtothesamestraincanbecorrectlycombinedtoonepath
basedonthepaired-endconnections.Solidlinesrepresentoverlapsbe-
tweennodesanddashedlinesrepresentthepaired-endconnections.....64
Figure3.5:Pathprotocol.Pathsareoutputtedwhenmeetinganendnodeor
avisitednode.................................65
Figure3.6:Inthisexampleasshowninthetheendingnode
p
n
ofcurrentpath
hastwosuccessors
v
1
and
v
2
.Thesolidlinesareoverlapsanddashed
linesarepaired-endconnectionsbetweennodes.Fivefeaturesarecom-
putedtochoosetherightsuccessortoextendthepath.Thesefeatures
arecomputedaspaired-endgraphedgeweightsbetweennodesincur-
rentpathandnodesassociatedwiththesuccessor.Asfor
v
1
,theSISO
scoreiscalculatedbetweenSISOnodesand
v
1
;planscorebetweenSISO
nodesandplannodes;PE_planscorebetweenSISOnodesandPE_plan
nodes;pathscorebetweenpathnodesand
v
1
;path_planscorebetween
pathnodesandplannodes..........................66
Figure3.7:Decisiontreetoselecttherightnodeforextensionbasedonpaired-end
scorefeatures.If
N
feature
=
1,weaddtheonlynodewithscoregreater
than0tothepath.If
N
feature
>
1,weselectthenodewithmaximumscore
value.Otherwise,lookatthenextfeature.Ifallthe
N
feature
valuesare0,
wewillselectthesuccessorwithsimilarreadscoveragetothepath.....68
Figure3.8:Readpairmappingonamisjoinedcontig.Thecontigisshown
asthelongbaratthebottom,whichismisjoinedwithtwosequences
fromstrains
a
and
b
.Theredandbluelinesrepresenttwoendsofaread
pair.Fewreadpairswillgoacrossthemisjoinedlocation,thusrevealing
avalleyinthealignedreadswhichcanbeusedtosplitthecontig.69
Figure3.9:ProbabilitydistributionforLCSsbetweentwostrainsthatare50,100,
200,and500generationsapart.Thex-axisisthelengthofLCS,which
rangesfrom0to10,000.They-axisisthecorrespondingprobabilityfor
eachLCS...................................71
xii
Figure3.10:ContigsalignmentresultonHXB2strainforPEHaploandSAVAGE.
ThesecontigswereproducedfromtherealHIVMiSeqdataandaligned
toreferencegenomewithMetaQuast.Thex-axisisthecoordinationsof
HXB2genome,withtheregionscoveredbycontigsingreenandothers
inblack.They-axisrepresentsthenumberofcontigs,withcontignames
listedontheleftpanel.Oneachcontig,thegreennumberattheleftis
thestartingcoordinateofthealignedcontigandthenumberinsideofthe
parenthesisshowsthestartingcoordinateonthereferencegenome,the
blackvalueattherightistheendingcoordinateofthealignedcontigand
thenumberinsideoftheparenthesisshowstheendingcoordinateonthe
reference...................................76
Figure4.1:ThewwofVirBin...........................84
Figure4.2:Twokindsofalignmentareallowedbetweentwocontigsto.(A)Overlap.
(B)Inclusion................................85
Figure4.3:Top5windowsoutputfromsimulatedcontigsalignmentfor5HIVhaplo-
types.Eachlinestartingwith'>'representsonewindow.Thefollowing
columnsrepresentreferencecontigname,windowheight,windowstart-
ingpositiononreferencecontig,windowendingpositiononereference
contig,referencehaplotype,andhaplotypeabundance,respectively.The
rowsafter'>'lineshowtheinformationforeachcontiginsideofthis
window.Thecolumnsrepresentcontigname,relativeabundance,sum-
mationofreadscoverage,referencehaplotype,andhaplotypeabundance,
respectively..................................91
Figure4.4:Top3windowsoutputfromsimulatedcontigsalignmentfor10HIVhap-
lotypes.TheoutputformataresameasFigure4.3..............94
Figure4.5:Top5windowsoutputforcontigsassembledfromHIVrealMiSeqdata.
TheoutputformataresameasFigure4.3..................97
Figure5.1:PrimaryTICislocatedupstreamofthepre-miRNA,whilepre-capTIC
islocatedinsideofthepre-miRNA.Theannotationofapre-miRNAis
usuallyastem-loopthatincludesthepre-miRNAandthelowerstems.
However,therealpre-miRNAonlyincludesthered,purplesequences
(maturemiRNAs)andtheloopbetweenthem.Therefore,thepre-capTIC
startsfromthe5'endofa5p-miRNA.Eachblueorgreenbarcorresponds
toamappedread,wheregreenindicatestheplusstrandandbluethe
minusstrand.TheCap-seqdatasetsweresequencedinastrand-speci
way.OnlythereadsonthesamestrandofthemiRNAareconsidered...109
xiii
Figure5.2:SequencemotifanalysisofnucleotidesaroundputativemiRNATSSs
(+1).Thenucleotideheight(inbits)standsforthe
log
2
ratioofthe
observednucleotidesfrequencyrelativetothebackgroundgenomicnu-
cleotidecomposition.TheYRmotifisobservedattheputativeprimary
miRNATSSsandindependentmiRNApre-capTSSs............114
Figure5.3:PrimaryTICsaredetectedupstreamof5'-cappedmiRNAscel-mir-51
andcel-mir-53.MultipleCap-seqpeaksareobservedinpre-miRNAre-
gions.ThemappedreadswerevisualizedbyGenomeView[1].Each
blueorgreenbarcorrespondstoamappedread,wheregreenindicates
theplusstrandandbluetheminusstrand.Thereadsinupperpanelare
fromCap-seqdataset,withuniformlengthof36nt.Thereadsinlower
panelarefromsmallRNA-seqdataset,withlengthinrangefrom14ntto
26nt......................................116
Figure5.4:UpstreamprimaryTICsalsoexistform
7
Gcappedpre-miRNAs.(A)
TwoalternativepromotersforthesamemiRNA.Bothofthemareableto
generatethetranscriptsthatproducethesamematuremiRNA.(B)The
miRNAtranscriptisgeneratedbythepre-capTIC.UpstreamTIC(s)cor-
respondtotranscribedenhancer(s)......................117
Figure5.5:CappedTICsdistributioninmiRNAclustercel-mir-35-41.Multiplestrong
cappedpeakshavebeenobservedinpre-miRNAsinthecluster.Asillus-
tratedbythereddashedline,theCap-seqpeaksonthe5parmhavethe
samestartpositionasthe5p-miRNAs,whilethepeaksonthe3parmstart
fromtheendofthe3p-miRNAs.ThelengthofCap-seqreadsis36nt...119
Figure5.6:ModelformiRNAcytoplasmicre-capping.Inthisnon-canonicalmiRNA
pathway,pri-miRNAsareexportedtocytoplasmbyXPO1andprocessed
there.Duringthepre-miRNAgeneratingprocess,m
7
G-capsareaddedto
the5'endsofnewlygeneratedpre-miRNAandpri-miRNAleftoverby
cytoplasmiccappingenzyme........................121
Figure5.7:(A)Promotersdistributionin
C.elegans
.(B)Detectingbidirectionaland
multipletranscriptionpromotersin
C.elegans
................124
Figure5.8:(A,B):H3K4me3andPolIIsignalsurroundingbidirectionaland
broadpromoters.(C,D):H3K4me3andPolIIsignalsurrounding
miRNAbidirectionalandbroadpromoters..................126
xiv
LISTOFALGORITHMS
Algorithm1
OverlapdetectionusingBWT'sbackwardsearch
25Algorithm2
Defaultmode:createBWTforallreads
27Algorithm3
WindowsfromcontigsalignmentxvChapter1
Introduction
Manyofmostimportanthumanviralpathogens,suchashumanyvirus(HIV),
hepatitisCvirus(HCV),SevereAcuteRespiratorySyndrome(SARS)coronavirus(SARS-CoV),
andH1N1virus,arehighlyvariableRNAviruses.Theyusuallymutatequicklyininfectedcells
ororganismsandexistasacollectionofdifferentbutcloselyrelatedviralgenomeswithdynamic
distribution,whichisreferredtoasviralquasispecies[41,3].Thequasispeciesdynamicsprovides
RNAviruseswithhighadaptivepotentialtoovercomeinternalorexternalselectiveconstraints,
suchasimmuneresponses,antiviralagents,etc.,makingdiseasepreventionandtreatmentdif
Thesevirusesstillclaimthelivesofmillionseachyeardespitecenturiesstudiesofthevaccineand
treatment[158,140].Thus,characterizinghumanviralpathogens,includingrecognizingnovel
ones,remainscrucial.
Developmentofnext-generationsequencing(NGS)technologiesshedslightoncharacterizing
theviralgenomesandtheirabundancesinenvironment.Thedeepsequencingdataofmicrobial
communities(metagenomicdata)containsthegeneticinformationofthemicrobeslivinginthe
samehabitatandhasbecometheprimarysourceforvirusdiscovery[165].Viralmetagenomic
datahasadvantagesovertraditionalmethodssinceitisunnecessarytoisolateviral
speciesandcultivatetheminthelaboratory.Also,multiplepathogens,includingnewones,canbe
inasingleassay.Currently,themostpopularNGSstrategiesformetagenomics,suchas
IlluminaMiSeqandHiSeq,arecost-effectiveandwithhigh-throughput.Theycanproducebillions
ofshort,highlyaccurate100-300bpreadswithinafewdaysforlarge-scalemicrobiomeresearch
1
projects[109].
Whilewiththeseadvantages,assemblingmicrobialgenomesandcharacterizingtheircompo-
sitionsfrommetagenomicdatameetwithseveralcomputationalchallenges.Thechallenge
isthesheerdatasizefromhigh-throughputsequencing,whichrequiresmassivecomputationalre-
sourcesforassemblingandanalyzingthemicrobes.Thesecondoneistheshortreadlength,which
makesitdiftoresolvetherepeatedsequenceswithinthesameorganism,orsharedbetween
differentorganisms.Third,itisdiftoassemblethelowabundancemicrobialgenomeswith
theheterogeneouscoveragesfordistinctmicrobes.Inadditiontothesewell-knownchallenges,
reconstructingRNAvirusesinmetagenomicsfacesseveraluniqueobstacles.First,thereferences
ofvirusesarestillverylimited.Usually,onlyasmallpercentageofreadsinmetagenomicdatacan
bemappedtoavailableviralgenomes,eitherbecausethevirusesaremutatingquicklyorarenovel
ones.CharacterizingRNAviruseswithoutqualityreferencegenomesisthusneeded.Second,most
clinicallyimportantvirusesareoftenfastmutatingRNAvirusesthatcanleadtoviralpopulations
ofhighgeneticdiversity(quasispecies)[107].Asdifferentgenomesinthesameviralpopulation
canhavedifferentpropertiessuchasresistancetoanti-viraldrugs,distinguishingthesedifferent
buthighlysimilarviralstrains(haplotypes)isessentialbutcomputationallydifThird,there
areusuallyonlysmallamountsofeukaryoticvirusesinviralmetagenomicdata.Although
tiontechniquesandmethodsmaybeappliedtoenrichRNAviruses,thesestrategies
maycausevarioustypesofbiases.Thus,effectivepreprocessingmethodsarestillimportantto
identifyreadsforthetargetedviruses.
Inthiswork,totackleallthesechallengesforcharacterizingviralhaplotypesandtheirabun-
dancesfrommetagenomicdata,wehaveproposedapipelinewiththreetoolsto(1)enrichtarget
viruses'readswithlow-qualityorpartialreferencesequences;(2)
denovo
assemblyofthevi-
ralhaplotypesfromtheenrichedreads;(3)analyzehaplotypeabundancesbybinningassembled
2
contigstodifferentgroups.
Inthefollowingpartofthischapter,IwillgiveabriefintroductiontothepipelineIhave
developedduringmyPh.D.study,andthenintroducethenecessarybackgroundfornext-generation
sequencingandviralquasispeciestheory.
1.1Overviewofthepipeline
AsshowninFigure1.1,therearethreemaincomponentsforthispipeline.Inthestep,we
utilizeanoverlap-basedextensionmethodtoisolateandenrichthereadsfortargetvirusesbefore
assembly.Withthebigscaleofviralmetagenomicdata,thereareusuallyasmallproportionofviral
readsamongthem,withmanycontaminationsfromhostgenomesorphages.Directlyassembling
virusesfromthemetagenomicdataisdifandtime-consuming.Therefore,itisimportantto
isolateandenrichtheviralreadsweareinterestedinbeforeassembly.Thispartisdetailedin
Chapter2.Second,weassemblethedifferentviralhaplotypesfromtheenrichedviralreadswitha
paired-endinformationincorporatedoverlapgraph.ThedetailsforthispartareinChapter3.Third,
withtheassembledviralcontigs,weuseanExpectation-Maximizationalgorithmtoestimatethe
abundancesofdifferentviralhaplotypesbyclusteringtheassembledcontigstodifferentgroups,
whereeachgrouprepresentsonehaplotype.ThedetailsforthispartareinChapter4.Finally,in
Chapter5,IintroducetheotherworkIhavedoneforstudyingthetranscriptionalregulationsof
miRNAgeneswithCap-seqdata.
3
Figure1.1:Pipelineoverview.(1)TAR-VIRforenrichingviralreadsofinterestfrommetagenomic
data.(2)PEHaplofor
denovo
assemblyofviralhaplotypes.(3)VirBinforbinningassembled
contigsandestimatethehaplotypeabundances.
4
1.2Next-generationsequencing
ThoughtheautomatedSangersequencingtechnologyhasproveditspowerwithanumberofmon-
umentalaccomplishments,includingthecompletionoftheHumanGenomeProject(HGP),its
limitationsshowedaneedfornewandimprovedtechnologiesforlarge-scaleandlow-costse-
quencing[98].AstheautomatedSangermethodisconsideredtobethetechnol-
ogy,newermethodsarereferredtoasnext-generationsequencing(NGS).WhileNGStechnologies
constitutevariousstrategiesandarequitediverseinsequencingbiochemistry,theirwwsare
conceptuallysimilar.Thesemethodshavesimilarbasicmechanismsandareclascyclic-
arraysequencing,whichcanbesummarizedasthesequencingofadensearrayofDNAfeatures
byiterativecyclesofenzymaticmanipulationandimaging-baseddatacollection[101].Thebasic
protocolofNGStechnologiesincludestemplatepreparation,sequencingandimaging,anddata
analysis.ThemajoradvanceofNGSistheabilitytoproducelargevolumeofdatawithlowcost-
insomecasesoveronebillionshortreadsperinstrumentrun[98].
Thereareseveralcommerciallyavailabletechnologies,includingRoche/454,Illumina/Solexa,
Life/APGandHelicosBioSciences,etc.Theuniquecombinationofprotocolsdistin-
guishesonetechnologyfromanotheranddeterminesthetypeofdataproducedfromeachplatform.
HerewewillgiveadetailedintroductiontoIllumina/Solexasequencingtechnologyandanalyze
thedifferentcomputationaltechniquesfordealingwiththedata.
TheIllumina/Solexasequencingtechnologyiscalledsequencingbysynthesis(SBS).Thereare
fourbasicstepsforIlluminaNGSwws:
5
Figure1.2:Librarypreparation.GenomicDNAisfragmentedandcommonadaptorsareligatedto
bothendsofthem.
1.2.1Librarypreparation
BeforesequencingtheDNAorcDNAsample,thelongDNAsequencesarerandomlybro-
kenintosmallersizesbysonication.TheseshortDNAfragmentsproducedcannotbesequenced
directly;adaptorsareligatedtothe5'and3'endsoftheseDNAfragmentsfromwhicheither
fragmenttemplatesormate-pairtemplatesarecreated(Figure1.2).Adaptersareshortoligonu-
cleotideswhichareattachedtotheDNAtobesequenced.Anadaptercanprovideaprimingsite
forbothandsequencing.AlltheDNAtemplatesarecalledasequencinglibrary.
1.2.2ClusterGeneration
WhilebasesofaDNAsequencearedetectedbymostimagingsystemshavenotbeen
designedtodetectsingleevents.oftemplatesarerequired.Illumina
appliesamethodcalledsolid-phase[43].High-densityforwardandreverseprimers
6
Figure1.3:Solid-phaseamplication.Eachsingle-stranded,single-moleculetemplateiscaptured
byprimerandimmobolizetothesolidsurface.Eachboundtemplateisintoaclonal
clusterthroughbridgeamplication.
arecovalentlyattachedtoaglassslide,andtheratiooftheprimerstothetemplatethe
surfacedensityoftheclusters(Figure1.3).Thelibraryisloadedintoawcellwhere
fragmentsarecapturedbytheprimers.Eachfragmentisthenamintodistinct,clonalclusters
throughbridgeWhenclustergenerationiscomplete,thefreeendsoftemplatescan
behybridizedbyauniversalsequencingprimertoinitiatetheNGSreaction.
1.2.3CyclicReversibleTermination(CRT)Sequencing
Clonalresultsinapopulationofidenticaltemplates,eachofwhichhasundergonethe
sequencingreaction.Uponimaging,theobservedsignalisaconsensusofthenucleotidesadded
totheidenticaltemplatesforagivencycle.
Illuminatechnologyutilizesaproprietaryreversibleterminator-basedmethodthatdetectssin-
7
glebasesastheyareincorporatedintoDNAtemplatestrands.Inthestep,aDNApolymerase
bindstotheprimedtemplateandaddsorincorporatesjustonenucleotide,
whichrepresentsthecomplementofthetemplatebase.Areversibleterminatorisoneverymodi-
nucleotidetopreventmultipleadditionsinoneround.Followingincorporation,theremaining
unincorporatednucleotidesarewashedaway.Usingthefour-colourchemistry,eachofthefour
baseshasauniqueemission.Aftereachround,imagingisusedtodeterminetheidentityofthein-
corporatednucleotide.Acleavagestepisfollowedtoremovetheterminating/inhibitinggroupand
thedye.Thenextincorporationstepwillstartafteranadditionalwashing(Figure1.4).
1.2.4Dataanalysis
AfterNGSreadshavebeengenerated,theyarealignedtoaknownreferencesequenceorassembled
denovo[149,72,70,22,83,123].Theproductionoflargenumbersoflow-costreadsmakesthe
NGSplatformsusefulformanyapplications.Theseincludevariantdiscoverybyresequencing
targetedregionsorwholegenomes,suchasdetectingsinglenucleotidepolymorphisms(SNPs)or
structuralvariants(SVs);cataloguingthetranscriptomesofcells,tissuesandorganisms(RNA-seq)
[155];genome-wideofepigeneticmarksandchromatinstructure(ChIP-seq,MNase-seq,
etc.)[138,113];andspeciesand/orgenediscoverybymetagenomicsstudies[120].
1.2.5Paired-EndSequencing
Paired-endsequencingsequencesbothendsoflongDNAfragmentsinasequencinglibraryand
aligningtheforwardandreversereadsasreadpairs(Figure1.5).Inadditiontoproducingtwice
thenumberofreadsforthesametimeandeffortinlibrarypreparation,readpairsfromtwoends
ofDNAfragmentsenablemoreaccuratereadalignmentandtheabilitytoovercomerepetitive
8
Figure1.4:TheIllumina/Solexafour-colourcyclicreversibletermination(CRT)method.Ituses
3'-O-azidomethylreversibleterminatorchemistryontemplateclusters.Af-
terimaging,acleavagestepremovesthedyesandregeneratesthe3'-OHgroupfor
next-roundincorporation.
9
Figure1.5:Paired-endsequencingenablesbothendsoftheDNAfragmenttobesequenced.The
distancebetweeneachpairedreadsisknownandalignmentalgorithmscantakeadvantageofto
mapthereadsoverrepetitiveregionsmoreprecisely.
regions.
Aligningreadswithinrepetitiveregionsorincorrespondingregionsthatmaynotexistinthe
referencegenomeisdifTherepeatsandgapsinthereferencegenomecanalsocauseabnor-
malextremelyhighpeaksinthemappingwhichmayleadtofalsepositivesfordownstream
functionalanalysis[25].Paired-endreadscanresolvethecorrectgenomeassignmentforsome
repetitiveregionsaslongasonereadinthepairisuniquetothegenome[98].
1.3Virusquasispecies
Avastnumberofmedicallyimportantviruses,includinghumaniyvirus(HIV),
hepatitisCvirus(HCV),andareRNAviruses.Theyreplicatewithextremelyhighmu-
tationratesandexhibitgeneticdiversity,whichallowthemtorapidlyadapttodynamic
environmentsandevolveresistancetovaccinesorantiviraldrugs[75].TheRNAvirusevolution
anddynamicscanbemodeledbyquasispeciestheory.Aquasispeciesisacloudofdiversevari-
antsthataregeneticallylinkedthroughmutations,interactcooperativelyonafunctionalleveland
10
collectivelycontributetothecharacteristicsofthepopulation.
1.3.1Thequasispeciestheory
Imagineaquasispecieswithanlargepopulationofdifferentstrains,eachcarryinga
genomeoflength
L
.Let
x
i
denotestherelativeabundance/frequencyofstrains
i
.Wehave
å
n
i
=
0
x
i
=
1.Thegenomicstructureofthepopulationcanbecharacterizedbythevector
~
x
=(
x
0
;
x
1
;:::;
x
n
)
.
Denoteby
f
i
theofgenome
i
,whichmeansgenomesoftype
i
arebeingreproducedat
rate
f
i
.Thelandscapeisgivenbythevector
~
f
=(
f
0
;
f
1
;:::;
f
n
)
.Theaverageofthe
populationistheinnerproductofthevectors
~
x
and
~
f
,
f
=
~
x
~
f
.
Mistakescanhappenwhenagenomereplicates.Weusemutationmatrix
Q
=[
q
i
;
j
]
todenote
theprobabilitythatreplicationofgenome
i
resultsingenome
j
.
Q
isastochasticmatrix,with
whicheachrowisaprobabilityandeachrowsumstoone,
å
n
j
=
0
q
ij
=
1.
Thequasispeciesequationisgivenby[107]
x
0
i
=
n
å
j
=
0
x
j
f
j
q
ji

f
x
i
(1.1)
where
x
0
i
isthedifferentiationof
x
i
withrespecttotime.Sequence
i
isobtainedbyreplicating
anysequence
j
atrate
f
j
timestheprobabilitythatreplicationofsequence
j
generates
sequence
i
.Toensurethatthetotalpopulationsizeremainsconstant,thatis
å
n
i
=
0
x
0
i
=
0,each
sequenceisremovedatrate
f
(Proof:
å
n
i
=
0
x
0
i
=
å
n
j
=
0
x
j
f
j
q
ji

f
å
n
i
=
0
x
i
=
å
n
i
=
0
q
ji
å
n
j
=
0
x
j
f
j

f
=
å
n
j
=
0
x
j
f
j

f
=
0).
Ifwecombinethelandscape,
~
f
,andthemutationmatrix,
Q
,wecanobtainthemutation-
selectionmatrix,
W
=[
w
ji
]=[
f
j
q
ji
]
(1.2)
11
Withthemutation-selectionmatrix,thequasispeciesequationcanbewritteninvectornotationas
~
x
0
=
~
x
W

f
~
x
(1.3)
Thereforetheequilibriumofquasispeciesdynamicsisgivenby
~
x
W
=
f
~
x
(1.4)
Thisisastandardeigenvalueproblem,wheretheaverage
f
isthelargesteigenvalueof
thematrix
W
and
~
x
istheeigenvectorassociatedwiththiseigenvalue.Thus,theeigenvector
~
x

ofmutation-selectionmatrix
W
providestheequilibriumstructureofthequasispecies,withthe
propernormalization
å
n
i
=
0
x
i
=
1.
Errorcatastrophe
Thequasispeciesequation1.1describesthemovementofapopulationthroughsequencespace.
Thequasispeciesattemptstoclimbuphillinthemountainrangeofthelandscapeandreach
localorglobalpeaks(Figure1.6).Theerrorthresholdplaysakeyroleinthesuccessofthe
evolutionarywalk.
Whenthemutationrate
m
isabovetheerrorthreshold,theabilityofthequasispeciestoclimb
uphillandtoremainontopofamountainpeakisimpaired[42,13].Smallincreasesinmuta-
tionratewillupsetthisbalanceasthemastersequenceitselfdisappearsandmeaningfulgenetic
informationislostinanavalancheoferrors.Earlystudieswithvesicularstomatitisvirus(VSV)
showedthatchemicalmutagensgenerallyreducedviralinfectivity,andstudieswithpoliovirus
demonstratedthatmutagenicnucleosideanalogspushviralpopulationstoextinction-a4-fold
increaseinmutationrateresultedina95%reductioninviraltiter[52,153].
12
Figure1.6:Quasispeciesadaption.Quasispeciestendstogouphillsinhigh-dimensional
landscape.Thex-axisisthesequencespaceandthey-axisistheThehighertheyget,the
theyare.
13
Chapter2
TAR-VIR:TARgetedVIRalreads
andstrainreconstructionfrom
metagenomicdata
2.1Background
Pathogenichumanvirusessuchashumanyvirus(HIV),hepatitisCvirus(HCV),
SevereAcuteRespiratorySyndrome(SARS)coronavirus(SARS-CoV),andH1N1virus,still
claimthelivesofmillionseachyeardespitecenturiesstudiesofthevaccineandtreatment[158,
140].Thus,characterizinghumanviralpathogens,includingrecognizingnovelones,remains
crucial.Thedeepsequencingdataofmicrobialcommunities(metagenomicdata)containsthe
geneticinformationofthemicrobeslivinginthesamehabitatandhasbecometheprimarysource
forvirusdiscovery[165].Forexample,Lietal.[80]conductedcomprehensivevirusscreening
inplasmasamplesof19antiretroviral-treatedHIVpatientsusingmetagenomicsequencing.Lim
etal.[85]studiedthedynamicsofeukaryoticRNAandDNAvirusesfortheyearinfants.
Therearealsoglobal-scalestudiesonvirusesinnaturalenvironmentalsamplessuchasocean
water[102,126].Viralmetagenomicdatahasadvantagesovertraditionalmethods
sinceitisunnecessarytoisolateviralspeciesandcultivatetheminthelaboratory.Also,multiple
14
pathogens,includingnewones,canbeinasingleassay.
WeareparticularlyinterestedincharacterizingRNAvirusesviametagenomicsequencingbe-
causemanyofthemareclinicallyimportantandtherearestillurgentneedsforbetterprevention
andtreatmentstrategies.Inadditiontosomewell-knownchallengesformetagenomicassembly
suchassheerdatasize,shortreadlength,andheterogeneouscoverage,reconstructingRNAviruses
inmetagenomicsfacesseveraluniqueobstacles.First,thereferencesofvirusesarestillverylim-
ited.Usually,onlyasmallpercentageofreadsinmetagenomicdatacanbemappedtoavailable
viralgenomes,eitherbecausethevirusesaremutatingquicklyorarenovelones.Characterizing
RNAviruseswithoutqualityreferencegenomesisthusneeded.Second,mostclinicallyimportant
virusesareoftenfastmutatingRNAvirusesthatcanleadtoviralpopulationsofhighgeneticdi-
versity(quasispecies)[107].Asdifferentgenomesinthesameviralpopulationcanhavedifferent
propertiessuchasresistancetoanti-viraldrugs,distinguishingthesedifferentbuthighlysimilar
viralstrains(haplotypes)isimportantbutcomputationallydifThird,thereareusuallyonly
smallamountsofeukaryoticvirusesinviralmetagenomicdata.Althoughtechniquesand
methodsmaybeappliedtoenrichRNAviruses,thesestrategiesmaycausevarious
typesofbiases.Thus,effectivepreprocessingmethodsarestillimportanttoidentifyreadsforthe
targetedviruses.
Inthisstudy,ourgoalistodevelopanewpipelinethatcanclassifyRNAviralreadswithhigh
sensitivityandalsoproducetheassembledviralstrains(i.e.haplotypes)frommetagenomicdata.
2.1.1Relatedwork
Thereexistsomepipelinesforconductingthecompositionandfunctionalanalysisofviralmetage-
nomicdata[99,128,105,122].Dependingontherequiredinputs,existingprogramscanbe
roughlydividedintotwogroups.
15
Someviralmetagenomicanalysistoolstakemetagenomicassembliesasinputandconduct
reference-basedtaxonomicandfunctionalanalysis.Forexample,Espinoetal.[112]identifyviral
sequencesfromassembledmetagenomiccontigsofsizesgreaterthan5kb.VirSorter[127]detects
virusesinassembledcontigsatleast3kb.Theviralsequencesareusuallyscreenedbycomparing
thecontigswithacuratedsetofviralproteinfamilies.
Althoughvirusdetectionusingassembledcontigsisusuallymoresensitivethanusingshort
reads,conducting
denovo
assemblyforrawviralmetagenomicdatarequiresmassivecomputing
resources.Also,applyinggeneric
denovo
assemblytoolsdirectlytometagenomicdatacanlead
toshortorchimericcontigsduetotheheterogeneouscoveragealongthegenomesinmetage-
nomicdataandhighsimilaritybetweenstrains.Therefore,mostoftheassemblymethodsforviral
metagenomicscombinereference-basedand
denovo
assembly.Thisstrategyusu-
allyreadsintodifferenttaxonomiesorfunctionalgroupsusingreference-basedmethods
andthenconduct
denovo
assemblyforreadswithinthesamegroup.Forexample,VIP[84],
drVM[86],andVirusTAP[162]allapplythisstrategy.Theyclassify/enrichviralreadsbyeither
aligningreadstoavailableviralreferencesorremovinghostandotherunrelatedmicrobialreads.
Next,existingassemblytoolssuchasSPAdes[8]areemployedtothevirus-likereadstoproduce
theassemblyresults.Whilethesetoolsmadecontributionsinpurifyingthedataby
removingnon-virusreadsandthenclassifyingvirus-likereadsintofunctional/taxonomicalgroups,
theirperformanceheavilydependsonthequalityofthereferences.
ForRNAviruscharacterization,relatedtoolsalsoincludehaplotypereconstructionpipelines
designedtoassembleviralstrainsinaquasispecies.Amajorityofthesetoolsarereference-based
andtakethealignmentsofreadsagainstreferencegenomesasinput.HaploClique[147],Vi-
Quas[59],VGA[95]allbelongtothisgroup.
Inaddition,therearegenericassemblytoolsthatapplyextensionstrategiestoiterativelyextend
16
theﬁseedﬂcontigs.Forexample,PRICE[129]takesadvantageofpaired-endreadstoiteratively
alignreadsontheinitialcontigsandextendthem.
Thelimitationofreferencegenomesisthecriticalchallengeofapplyingtheabovereference-
basedtoolsforRNAvirusesanalysisinmetagenomicdata.Whileregardedasthemostabundant
biologicalentitiesonearth,onlyasmallportionofviruseshavebeensequencedandcharacterized.
Besides,forRNAviruseswithhighmutationrates,high-qualityreferencegenomesofaviral
populationarenotalwaysavailable.Forexample,manyemergingviraldiseasesarecausedby
zoonoticviruses,whichoriginateinvertebratesbutcaninfecthumans.Thegenomesofsome
emergingvirusesmayonlysharemediumsequenceidentitywiththeirpeersinanimals,creating
difcircumstancesforreference-basedvirusreads
2.1.2Overviewofourwork
HereweintroduceTAR-VIR,whichprovidesausefuladditiontoexistingtoolsforidentifyingtar-
getedRNAvirusesandtheirhaplotypesinmetagenomicdata.Theﬁtargetedﬂvirusesarethosethat
stillpossesslocalsequencesimilaritywiththeirhomologsinthereferencegenomes.Acompletely
newvirusthatdoesnotshareanyconservationwithanyreferencegenomewon'tbedetectedby
ourmethod.
Ourpipelinecombinesreference-basedstrategyand
denovo
assemblyandisoptimizedforthe
followingapplications.1)Identifyinghost-switchingvirusessuchasSARS-CoVusingremotely
relatedvirusesinotherhostsasthereferences.2)Reconstructingviralhaplotypesthatarediver-
gentfromaknownvirusfamily.3)Recoveringvirusesandtheirgenomesthatcontaingenesor
functionalsitesofinteresttousers.TAR-VIRisfasterandmoreeffectiveinidentifyingtargeted
virusesthangenericassembly,whichisparticularlyimportantforlargeandcomplicatedmetage-
nomicdatasetscontainingasmallpercentageofviruses.Meanwhile,TAR-VIRismoretolerantto
17
incompleteorlow-similarityreferencesthanexistingreference-basedtools.
WeappliedTAR-VIRtoasimulatedmetagenomicdatasetcontainingvehaplotypesofSARS-
CoVandarealhumanbloodplasmametagenomicdataset.Thebenchmarkedresultswithboth
de
novo
assemblytoolsandreference-basedhaplotypereconstructiontoolsdemonstratedtheutility
ofTAR-VIRinrecoveringRNAvirusesfrommetagenomicdatawithlimitedreferences.
2.2Methods
Followingastand-aloneerror-correctionstep,ourpipelineperformsthefollowingthreesteps.
First
,weconstructthesetofﬁseedreadsﬂbymappingthereadsagainstprovidedreferencese-
quences,whichcouldbereferencegenomesinavailabledatabasesorfunctionalsitessuchasgenes.
Allthereadsthatcanbemappedtothereferenceconstitutethesetofﬁseedreadsﬂ.
Second
,were-
cruitreadsthatformoverlapswiththeseedreads.Newlyrecruitedreadswillbeadded
totheseedset.Thisprocesswilliterateuntilnonewreadscanberecruited.
Third
,weconduct
strain-levelassemblyusingthereadsinthesecondstep.
2.2.1Twoscenarios
TheabovepipelineisvisualizedfortwoscenariosinFigure2.1.Inscenario1,usersaretryingto
detectvirusesthatcontainafunctionalsite,suchasagene.Unlikethewell-studiedgene-centric
assemblyinmetagenomicdata,ourgoalistorecoverthewholegenomethatcontainsaparticular
gene.Inthismethod,thegeneisprovidedasareference,andreadsmappedtoitaretheseedreads.
Overlapdetectionisthenappliedtorecruitmorereadsthatbelongtothesamevirusesastheseed
reads.ThereadrecruitmentprocessispresentedinFigure2.1(A).
Inscenario2,thegoalistoidentifyvirusesthatlackqualityreferencegenomes.Thisispartic-
18
ularlyimportantforhost-switchingviruses,whichmaynotalwaysconservehighsequencesimilar-
itywiththeirrelatedpeersinotherhosts.Forexample,SARS-CoVsharesabout80%ofsequence
identitywiththebatcoronavirusaccordingtoBLAST[20].Andtheidentityislowerthan50%
atdifferentloci.Thus,conventionalreadmappingmethodscannotcaptureallreadsfromthetar-
getedviruseswhentheylackhighsimilaritywiththeavailablereferences.Figure2.1(B)presents
theprocessofidentifyingreadsofthetargetviruswitharemotelyrelatedvirusasthereference.
Althoughthemappedreadsarescatteredalongthereferencegenomewithlowcoverage,suf
readsbelongingtothetargetviruscanberecruitedthroughoverlapdetection.
Figure2.1:Twoscenarios.(A).Thereferenceisageneorotherfunctionalsite(longgreenbar).
Thereadsarerepresentedbyshortlines.Shortgreenlinescanbemappedtothereferencesequence
andthesetofseedreads.Theiterationofoverlapdetectionwillidentifynewreads(blue
lines)overlappingwiththeseedreads.Theseconditerationofoverlapdetectionwillidentifymore
reads(redlines).(B).Thereferenceisaremotelyrelatedgenome(longgreenbar).Theseedreads
canbemappedtothereferencegenomeandarerepresentedbyshortgreenlines.Twoiterationsof
overlapdetectionwillrecruitnewreads(bluelinesandredlines,respectively).
19
2.2.2Validityofreadrecruitmentusingoverlapdetection
Inthissection,wewillconductcarefulanalysistoexaminewhetherusingoverlapswillbeboth
sensitiveandaccurateforclassifyingreadsinthesamequasispecies.Inaddition,wewillusethe
analysistodeterminetheappropriateoverlapsize.
Anidealreadrecruitingprocessshouldonlycapturethereadsfromthevirusesofinterest.At
thesametime,itshouldcaptureallreadsbelongingtothetargetedviralpopulations(i.e.,qua-
sispecies).Thesuccessoftheoverlapextension-basedreaddependsonthequality
ofseedreads,sequencingdepthofthetargetedviruses,andsequencesimilaritybetweendifferent
microbes.Inparticular,ifmanymicrobessharelongcommonregionswiththetargetedviruses,
theoverlapextensionwillrecruitalargenumberofreadsfromunrelatedspecies.
Asanyreadthatformsanoverlapofsizeaboveagiventhreshold
t
withaseedreadisrecruited,
contamination,whichreferstoreadsnotsequencedfromthetargetedviralpopulations,canbe
checkedbyexaminingwhethergenomesofdifferentvirusessharecommonregionswithsizes
above
t
.Therefore,wecomputedthesizesoflongestcommonsubstrings(LCSs)betweendifferent
viruses.TheLCSsbetweenvirusesandothermicrobialspecieswerealsoexamined.Thedetails
forLCScalculationcanbefoundinSupplementaryMaterialsSection1.Theresultsareshownin
SupplementaryFigureS1(A-C).
Insummary,thesizesofLCSsbetweendifferentviralgenomesorbetweenhumanvirusesand
bacteriaareusuallysmallerthan100bp.LCSslongerthan100bparemostlybetweenviruses
fromthesamegenusordifferentgenotypesofthesamevirus.Forexample,Vacciniavirusand
VariolavirusshareanLCSof469bp,andHCVgenotype7andHCVgenotype5shareanLCSof
154bp.
Meanwhile,itisalsonecessarytoevaluatewhetherreadsbelongingtothesamequasispecies
20
canberecruitedusingoverlapdetection.AsthecharacterizedhaplotypesfordifferentRNAviruses
areverylimited,insteadofcomputingtheLCSusingavailabledata,weestimatedtheLCSswithin
aquasispeciesusingaprobabilitymodel.Withthemutationrate
m
ateachbaseduringvirus
replication,theprobabilitydistributionofLCSlengthbetweentwoviralstrainsthatare
n
gener-
ationsapartcanbecalculatedwithdynamicprogramming[26].Asanexample,theprobability
distributionofLCSsizesbetweentwoHIVstrainsisshowninSupplementaryFigureS1(D).
TheresultrevealsthattheLCSsbetweendifferenthaplotypesofthesameviralpopulationare
usuallymuchlongerthanLCSsbetweendifferentvirusesoranIlluminareadsize.Thus,evenwith
theinitialseedreadsalignedtoonlyonehaplotype,thereadsofotherhaplotypescanberecruited
throughthelongcommonregionssharedbydifferenthaplotypes.Thereadssequencedfromthe
commonregionsactlikebaitstorecruitreadsfromdifferenthaplotypes.
TheshortLCSsbetweendifferentvirusesandthelongLCSsbetweendifferenthaplotypes
withinthesamequasispeciesenablehighsensitivityandofthereadrecruitingprocess.
BychoosingaproperoverlapthresholdusingthederivedLCSsizedistribution,weareabletore-
cruitreadsforthesameviralquasispecieswithoutintroducingcontaminationfromothermicrobes.
2.2.2.1Sequencingerrors
Withoutconsideringthesequencingerrors,unevensequencingcoverage,andthevirusrecombi-
nation,theaboveanalysisforviralquasispeciesprovidesanupperboundofreads'overlapsizes.
Sequencingerrorswillshortenoverlapsbetweenreadsandmaypreventrecruitingallreadsbe-
longingtothesamequasispecies.Torecruitsufreadsforassembly,wecanconstructeither
approximateoverlapsbyallowingmismatches/gapsorexactoverlapsonerror-correctedreads.
Consideringtheriskofintroducingcontaminationsbyapproximateoverlapandextensiveresearch
inerrorcorrectionwechosetousestand-aloneerrorcorrectiontoolspairedwithexactover-
21
lapdetection.Lowsequencingdepthwillcauseinsufoverlapsandthusaffectstheperfor-
manceofreadrecruitment.Thereisthepossibilitythelowcoverageregionsofagenomecannot
beassembled.
2.2.2.2Chimericreads
Onerecentstudyrevealedthatchimericreads,whichcontainsequencesfrommorethanone
species,canbegenerated
invitro
duringthepreparationofhigh-throughputsequencinglibraries
[114].Thesechimerasmayhaveoverlapswithmorethanonespecies,thusintroducingcontamina-
tionsfromthehostorunrelatedmicrobes.Inourexperiments,wesettheoverlapthresholdlonger
thanhalfofthereadsizetopreventrecruitingthesechimericreadsorextendingfromthem.
2.2.3Readrecruiting
Let
r
i
and
r
j
betworeads.Ifthereisapropersufof
r
i
thatistheof
r
j
orviceversa,
r
i
and
r
j
formanoverlap.Inpractice,wewillalsoaccountfortheoverlapsformedby
r
i
and
r
0
j
sreverse
complement.Naivealgorithmforoverlapstakesquadratictimeandthuswillnotbeableto
scaletolargedatasets.Thereareafewdatastructuresandmethodsavailableforefoverlap
detection[47,142].WeapplythemethodswithBWTandFM-index[142]forefsearch.In
thestep,allreadsareconcatenatedintoasinglesequence
T
[
1
::
n
]
using$asadelimiter,where
n
isthenumberofreadsin
T
.Then,multi-keyﬁquicksortﬂisappliedtosortallthesufesof
T
forconstructingageneralizedsufarray
SA
(
T
)
[121].Then
BWT
(
T
)
canbeconstructedusing
thefollowingequation,where
BWT
[
i
]
and
SA
[
i
]
areabbreviatedrepresentationsof
BWT
(
T
)[
i
]
and
22
SA
(
T
)[
i
]
,respectively.
BWT
[
i
]=
8
>
<
>
:
T
[
SA
[
i
]

1
]
;
if
SA
[
i
]
>
0
$
;
if
SA
[
i
]=
0
(2.1)
With
T
and
BWT
(
T
)
,thebackwardsearchcanbeusedtodetectoverlapsbetweenaqueryread
andallotherreads.Aftermatching
t
(theoverlapthreshold)characters,wesearchforthedelimiter
`$'totheesoverlappingwiththequery'ssuf
ThebackwardsearchalgorithmforBWTneedstousetwoauxiliarydatastructuresbuilton
T
andBWT.Onedatastructureisanarray
C
[
1
;:::;
j
S
j
]
.Foreachcharacter
c
inthealphabet
S
,
C
[
c
]
containsthetotalnumberofcharacterin
T
thatarealphabeticallysmallerthanc(including
repeatingcharacters).Theseconddatastructureisatwo-dimensionalarray
Occ
(
c
;
i
)
,where
c
2
S
and1

i
j
T
j
.Forthe
i
thpositioninBWT,
Occ
[
c
][
i
]
isthenumberofoccurrencesof
c
in
BWT
[
1
:::
i
]
.ExamplesoftheSA,BWT,
C
,and
Occ
areprovidedinFigure
??
.Notethatinorder
tomainthecorrectrankingofthereadIDs,thesmallestsuf$willbeappendedbytheread
in
T
forsorting.Thus,allthesufesstartingwith$aresortedbythereadsfollowingthe$.In
addition,foranysufstartingwith$,theirreadIDsaredeterminedbythereadfollowingthe$.
OncetheBWTandtheauxiliarydatastructuresareconstructed,backwardsearchisconducted
foranyqueryread,startingfromtheendingcharacter.Ifsomereadsin
T
haveamatchwith
thequeryandthematchsizeisaboveagiventhreshold,thebackwardsearchwillreturnarange
containingthosereads.UsingthecorrespondingRIDarray,wecanobtaintheirreadIDs.Note
thatbidirectionalsearchalgorithmexists[142].Inourwork,wereversethequeryinorderto
theoverlapformedbytheofthequeryandsufofareadintheconstructedBWT.The
pseudocodeofthebackwardsearchcanbefoundinAlgorithm1.Theinputparametersarethe
BWT,Occ,C,thequeryread
r
,theoverlapthreshold
t
,andthereadIDarrayRID.Theoutputs
23
Figure2.2:Thevisualizationoftheoutputofeachstepofthebackwardsearchforaquerysequence
AAAT.TheSAandthecorrespondingsufesaregreybecausetheyarenotactuallyusedinthe
search.Thecomputedrangeforeachiterationishighlightedusinggrayboxencompassedby
arrows.When
t
is3,thesearchshouldreturn
r
2
asitformsanoverlapofsize3withAAAT.
24
arereadswhosematchthesufofrwithsize

t
.Anexampleofthebackwardsearch
followingthepseudocodeisprovidedinFigure2.2.
Algorithm1
OverlapdetectionusingBWT'sbackwardsearch
Function:
overlap
((
BWT
(
T
)
;
C
(
S
)
;
Occ
(
S
;
BWT
)
;
r
;
t
;
RID
(
T
))
Input:
T
:inputtext;
r
:queryread;
C
and
Occ
:theauxiliarystructures;
t
:theoverlapthreshold;
RID
:readIDarray
Output:
Readswhosematchthesufofrwithsize

t
1:
i
 j
r
j
2:
c
 
r
[
i
]
3:
start
 
C
[
c
]+
1
4:
end
 
C
[
c
+
1
]
.
characterc+1isthenextcharacterinalphabeticallysortedalphabet
5:
while
start

end
and
i

2
do
6:
c
 
r
[
i

1
]
7:
start
 
C
[
c
]+
Occ
[
c
;
start

1
]+
1
8:
end
 
C
[
c
]+
Occ
[
c
;
end
]
9:
i

10:
if
j
r
j
i

t
then
11:
start
 
C
[
0
$
0
]+
Occ
[
0
$
0
;
start

1
]+
1
12:
end
 
C
[
0
$
0
]+
Occ
[
0
$
0
;
end
]
13:
if
end

start
then
14:
outputreadswithIDsbetween
RID
[
start
]
and
RID
[
end
]
15:
endif
16:
endif
17:
endwhile
2.2.3.1Uniqueimplementationstrategies
AlthoughthereareavailableimplementationsoftheBWT-basedoverlapdetection,oursdiffers
fromtheexistingonesinthefollowingaspects.ThedifferenceisthestorageofthereadID
information.ForaconstructedBWTandaquery,theoutputofthebackwardsearchisthesetof
reads(i.e.,theirIDs)thatformoverlapswiththequery.Theoretically,differentreadscanbedis-
tinguishedbyappendinguniquedelimitersattheendofeachread.Someexistingimplementations
savereadIDsforeachsufForexample,SGAassembler[142]savesreadIDinformationalong
withthesufstartingpositioninthegeneralizedsufarray
SA
(
T
)
[142].Inourimplementation,
25
weuse`$'asthedelimiterforallreads.ThereadIDarray
RID
iscreatedonlyforsufesstarting
with`$'.ThisworksbecausethebackwardsearchalgorithmneedstoretrievethereadIDinthe
step,wherethecharactertosearchis`$'.Thisreducedthememory
usagebyreducingthesizeof
RID
from
j
T
j
integersto
n
(numberofreads)integers,where
T
is
roughlytheproductof
n
andthereadsize.
2.2.4Iterativesearch
Overlapdetectionwillbeiterativelyappliedtorecruitreadssequencedfromtargetedviruses.Let
R
0
bethesetofseedreadsthatcanbemappedtogivenreferencesequences(i.e.,seedreadset).
First,the
BWT
(
T
)
for
T
isbuilt.Theseedreadsin
R
0
areusedasqueriestothe
BWT
(
T
)
.Then
newlyreadsthatoverlapwiththeseedreadswillbeusedasnewqueriestotheBWT.
Theiterationswillcontinueuntilnonewreadscanberetrieved.Itspseudocodeisdescribedin2.
26
Algorithm2
Defaultmode:createBWTforallreads
Input:
seedreadset
R
0
,theinputtextT,theoverlapthreshold
t
Output:
Readsthataresequencedfromthetargetedviruses
1:
output
 
R
0
2:
R
 
R
0
3:
CreateBWTandRIDfor
T
:
BWT
(
T
)
,
RID
(
T
)
4:
while
R
notempty
do
5:
Backwardsearchon
BWT
(
T
)
toallreadsthatoverlapwithreadsin
R
6:
Savethemtoset
R
0
7:
output
 
output
[
R
0
8:
R
 
R
0
9:
endwhile
10:
returnoutput
2.2.4.1Runningtimeandmemoryusage
Intheabovepipeline,oncetheBWTisconstructed,thesufarraywillbedeleted.Therunning
timeofsufxarrayandBWTconstructionislinearto
j
T
j
.ThememoryusageofBWTisthe
productof
j
T
j
andthesizeofeachcharacterandthusislinearto
j
T
j
.Thememoryusageofthe
RID
istheproductof
n
andthesizeofsavingareadID.
WhencreatingBWTforallreadsbecomestooexpensive,ourprogramsupportsdistributed
constructionoftheBWTandFM-indexforlargeinput.,theprogramcanautomatically
partitioninputdataintomultiplesmallerBWTisthenconstructedforeachdivideddataset.
ThereadoverlapdetectioncanberuninparallelforeachBWT.Thereadsarecombined
andusedasthequeryforthenextiterationofreadrecruitment.Inthiscase,thelargestmemory
27
footprintisdeterminedbythesizeofeachdividedreadset.Bydefault,thenumberofpartitionsis
ve.Thisnumbercanbebyusers.
2.2.5Strain-levelassembly
Theoutputsofourprogramareassembliesofviralstrains.Allrecruitedreadswillbeusedas
inputtoassemblyprograms.Asourprogramhasamodularstructure,thisstepcanbeexecutedby
anyassemblytoolchosenbytheusers.Bydefault,weincludeinthepackageourin-housedevel-
opedtoolPEHaplo[26]forviralhaplotypereconstruction.PEHaplodoesnotrequireanyreference
sequencesandconductsstrain-levelassemblyusingpaired-endreads.Fortheinputpaired-end
reads,PEHaploconstructsapaired-endoverlapgraph,whichaugmentedstandardoverlapgraphs
byaddingedgesconnectingnodesthatcanformendsofreadpairs.Then,agreedypath
algorithmisappliedtosearchforthepathswiththebestsupportsfrompaired-endreads,wherethe
supportsarebythenumberofcontainedreadpairsandalsotheirdistances.Thedetailed
algorithmandimplementationofPEHaploaredescribedinChapter3.
2.3Resultsanddiscussion
WehavedevelopedamodularstructuredtoolnamedTAR-VIRforreconstructingviralhaplotypes
frommetagenomicsdata.Theoutputsofthistoolareassembledviralcontigscorresponding
todifferentstrains.Wefocusonevaluatingtheperformanceofthereadrecruitingstageandalso
itsimpactontheassembly.
BeforepresentingtheresultsofTAR-VIRfordifferentsetsofinputdata,weshowthe
resultsofthelongestcommonsubstring(LCS)sizecomputationfordifferentmicrobes.
28
2.3.1Sizesofcommonregionsbetweenhumanvirusesandothermicrobial
species
Toevaluatethesimilaritybetweendifferentmicrobes,wecalculatedthesizesofLCSsbetween
humanvirusesandothermicrobialgenomes.Asbacteriainfecthumansaswellasviruses,the
sizesofLCSsbetweenhumanviruses,humanvs.non-humanviruses,andhumanvirusesvs.
bacteriawerecalculated.ThevirusreferencegenomesweredownloadedfromNCBIViruses
(https://www.ncbi.nlm.nih.gov/genome/viruses/).Todate(June2018),thereareintotal7,456
completeviralgenomes,ofwhich481havehumanasthenaturalhost(denoteashumanviruses).
ThehumanbacterialreferencegenomesweredownloadedfromHumanMicrobiomeProject(HMP)
onNCBI.Intotal,2,314bacterialreferencegenomesweredownloaded.
Asthereisalargenumberofmicrobialspeciesavailable,weconductedLCSsearchforavail-
ablemicrobialgenomesbyconstructinggeneralizedsufarrayandthecorrespondinglongest
common(LCP)array[50].
First
,webuildageneralizedSA[121].
Then
,theLCParray,
whichcontainsLCPsbetweeneachtwoadjacentsufes,canbecalculatedinlineartime[61,60].
BytheLCSforeachtwosequencesisthemaximumLCPbetweenallpairsofsufes
fromthetwosequences.Thefollowinglemmaisemployedinordertoavoidcheckingallthe
LCPvaluesbetweentwosequences.Forthesufstartingat
SA
[
i
]
,theLCPbetween
SA
[
i
]
and
SA
[
j
]
(
j
>
i
)isnolessthantheLCPbetween
SA
[
i
]
and
SA
[
k
]
if
k
>
j
.Withthispropertyanda
userLCScutoff,theLCPcalculationbetween
SA
[
i
]
andallothersufesafter
i
canbe
calculatedinconstanttime.Theoveralltimecomplexityis
O
(
N
)
.
TheresultsoftheLCShistogramsareshowninFigure2.3(A-C).Onemayalsoexamine
whetherthereadrecruitmentprocesscanincurcontaminationbyusingsimulatedorrealsequenc-
ingdata.However,theempiricalstudiesusingrealdataarelimitedtothevirusesinthesamples.
29
Meanwhile,producingsimulatedsequencingdataforallmicrobesisnotpractical.Usingsuf
arraybasedLCScomputationallowsustoobtainamorecomprehensiveviewofthecommon
regionsbetweendifferentmicrobes.
WealsocomparedthesizesoftheLCSsbetweendifferentmicrobeswiththeoneswithin
aviralpopulation.AsthecharacterizedhaplotypesfordifferentRNAvirusesareverylimited,
insteadofcomputingtheLCSusingavailabledata,weestimatedtheLCSswithinaquasispecies
usingaprobabilitymodelanddynamicprogramming[26].Withthemutationrateof3e-5ateach
baseduringvirusreplication,theprobabilitydistributionofLCSlengthbetweentwoHIVstrains
thatare
n
generationsapartwerecalculatedandthedistributionofLCSprobabilitiesisshownin
Figure2.3(D).
2.3.2Exp1:reconstructtheSARShaplotypesusingthebatcoronavirusas
thereference
Inthisexperiment,wemimicthescenarioinwhichSARS-CoV[23]isanemergingvirusinfecting
humans.OurgoalistoreconstructtheSARS-CoVhaplotypesusingothercoronavirusesasrefer-
ences.DuringthebreakoutofSARS,electronmicroscopeimagerevealsthecrown-likeshapeof
theinfectiousagent,providinghintstousecoronavirusesasreferences.
Totestthis,weassumethatthebatcoronavirus(NC_014470.1)wassequencedandavailable
touseasareference,althoughitwasactuallysequencedafterthebreakoutofSARS.
2.3.2.1Datapropertiesandevaluationmetrics
Aviralmetagenomicdatasetcontaining(NC_002023.1),hepatitisCvirus(HCV,NC_004102.1),
and5SARS-CoVhaplotypes,wassimulated.TheSARS-CoVhaplotypeswerecreatedfromthe
30
Figure2.3:HistogramoftheLCSsizesbetweenhumanviruses(A),betweenhumanvirusesand
non-humanviruses(B),andbetweenhumanvirusesandbacteria(C).Thex-axisisthe
log
10ofthe
LCSlength.They-axisisthenumberofpairswithinthegivenrangeofLCSsize.OnlyLCSsthat
arelongerthan10bparepresented.(D)ProbabilitydistributionforLCSsbetweentwosimulated
HIVstrainsthatare50,100,200,and500generationsapart.The
x
-axisisthelengthofLCS,with
arangefrom0to10,000bp.The
y
-axisisthecorrespondingprobabilitiesforthoseLCSsizes.
31
SARS-CoVreference(NC_004718.3)genomebymutatingbasesatrandomlyselectedlocations.
Thesequencesimilaritybetweenanytwohaplotypesisabove96%.Theabundanceofeachhap-
lotypeiscalculatedbasedonapowerlawequation[9].Thetotalsequencingdepthsforthe5
SARS-CoVhaplotypesare1000-x,with438-x,219-x,146-x,109-x,and88-xforeachhaplotype,
respectively.ThesequencingdepthsforandHBVare700-xand300-x,respectively.All
thedatasetsweresimulatedbyART-illumina[55]aserror-containingMiSeqpaired-endreads,
withthereadlengthof250bp,theaverageinsertsizeof600bp,andthestandarddeviationof
150bp.Intotal,thereare173,703simulatedreads,ofwhich119,002readsarefromtheve
SARS-CoVhaplotypes.
Withtheavailablebatcoronavirusasthereference,thesimulatedreadswerealignedwithboth
Bowtie2[71]andBWA[79].WethenappliedtheoverlapextensioncomponentofTAR-VIR
toisolateandenrichSARS-CoVreads,andassembledthemwith
denovo
assemblytools.Both
thereadrecruitmentandtheassemblyperformancewereevaluated.Forthesimulateddata,the
groundtruthoftheoriginatinghaplotypeandpositionofeachreadisknown.Thusreadrecruiting
performancecanbeevaluatedusingthereads'positionsandoriginatinghaplotypes.Insummary,
weexaminehowmanyreadsarecorrectlyrecruitedforeachhaplotypeandreportthehaplotype
coverageanddepth.
Theassemblyperformancewasevaluatedusingtheknowngenomesofthe5SARS-CoVhap-
lotypesandMetaQuast[100].Similartootherworks,wetheassemblycontinuity,
completeness,andaccuracyintermsofnumberofcontigs,N50,genomecoverage,andmismatch
rate.N50isasthemaximallengthsothatallcontigsabovethislengthcontainatleast50%
ofallthecontigbases.Genomecoverageisthepercentageofthevehaplotypes'genomesbeing
alignedbyatleastonecontig.Mismatchrateisthepercentageofmismatchesbetweenthealigned
contigsandthereferences.Inallcases,contigsofatleast500bparealignedtotheviralrefer-
32
encesequencesforevaluation.Theassemblyresultswerealsobenchmarkedwithotherpopular
assemblytoolsSGA[142],SPAdes[8],andSAVAGE[6].
2.3.2.2Performanceofreadrecruitment
WeappliedbothBowtieandBWAinthereadmappingstage.Byadjustingthescoringfunction
relatedparameters,weconstructeddifferentsetsofseedreadsthatcanbealignedtothereferences
withdifferentapproximatematchconstraints.Foreachseedset,therecruitedreadsgeneratedby
TAR-VIRarerecorded.Table2.1comparesthealignedandrecruitedreadsforeachSARS-CoV
haplotype.Besidesapproximatematchrates,wealsoconsideredlocalandﬁglocalﬂalignment
mode,wheretheglocalmoderequirestheend-to-endalignmentofthereadagainstthereference.
Usinglocalalignmentmodeforreadmappingcanusuallyproducealargerseedset.However,itis
possiblethatsomeofthelocallyalignedreadsarenotsequencedfromtheunderlyinghaplotypes.
InTable2.1,weusedlocalalignmentmodeforBWAandglocalmodelforBowtie.Thus,theseed
setsconstructedbyBWAislargerthanBowtie.
Evenwiththeleaststringentthreshold,thealignedreadshavelowergenomecoveragethan
recruitedreads,whichisexpectedbecauseSARS-CoVdoesnotsharegenome-scalehighsimilarity
withthebatcoronavirus(Figure2.4(C)).Inparticular,withtheparameterﬁ-B1",BWAcan
alignslightlymorereadsthanwhatBowtie2canrecruitwiththeparameterﬁL,0,-0.9"((52,688
vs.52,377).Therecruitedreads(52,377),however,cover20%-30%moregenomesfortheve
haplotypes.Thisindicatesthatalignment-basedmethodstendtoidentifyreadssequencedfrom
highlysimilarregionsbetweenthetargetandthereferenceviruses,whiletherecruitmentmethod
ismorelikelytoobtainreadsfromthewholegenomeofthetargetviruses.Worthnotingisallthe
recruitedreadsarefromSARS-CoV(nocontaminationfromandHBV).
Figure2.4(A)andFigure2.4(B)comparedthegenomecoverageofseedreadsandrecruited
33
Table2.1:ReadrecruitmentresultsbyusingseedsetsconstructedwithBowtie2andBWA.The
ﬁAlignmentﬂsectioncontainsresultsforalignedreads.TheﬁRecruitmentﬂsectioncontainsresults
forrecruitedreadsbyTAR-VIRusingthealignedreads.Foreachrow,thealignedreadsinﬁAlign-
ment"sectionaretheseedsetfortherecruitedreadsinﬁRecruitmentﬂsection.ForBowtie2,the
ﬁŒscore-minﬂparameterwassettoallowdifferentalignmenterrorratescorrespondingto5%,10%,
15%,and20%,respectively.ForBWA,ﬁ-Aﬂisedasitsdefaultvalue1.ﬁ-Bﬂwasto
allowdifferenterrorratesimilartoBowtie2.ﬁNumberﬂisthenumberofalignedorrecruitedreads.
ﬁDepthﬂistheaveragesequencingcoverage.ﬁCoverageﬂisthepercentageofgenomecoveringby
atleastoneread.h1toh5representtheveSARS-CoVhaplotypes.
Bowtie2
Alignment
NumberDepthCoverage
h1h2h3h4h5
h1h2h3h4h5
L,0,-0.3
550.130.010.060.150.12
0.010.010.010.010.01
L,0,-0.6
9253.61.50.90.90.9
0.070.060.050.070.08
L,0,-0.9
8,1543214987
0.310.310.270.270.3
L,0,-1.2
13,2214924151310
0.430.450.420.440.43
Recruitment
NumberDepthCoverage
h1h2h3h4h5
h1h2h3h4h5
L,0,-0.3
45,50418289594111
1.01.01.01.00.37
L,0,-0.6
46,57618390604218
1.01.01.00.960.55
L,0,-0.9
52,337
198
96
63
46
37
1.0
1.0
1.0
0.98
0.99
L,0,-1.2
55,48518289594139
1.01.01.01.00.99
BWA
Alignment
NumberDepthCoverage
h1h2h3h4h5
h1h2h3h4h5
B:8
24,5858946282018
0.40.370.330.310.34
B:4
41,56415278503732
0.630.570.560.570.53
B:2
51,99519594634639
0.790.780.770.700.74
B:1
52,688
199
94
64
47
39
0.81
0.78
0.80
0.71
0.76
Recruitment
NumberDepthCoverage
h1h2h3h4h5
h1h2h3h4h5
B:8
62,609235117755544
1.01.01.01.00.99
B:4
72,901270135896553
1.01.01.01.01.0
B:2
79,755299146977158
1.01.01.01.01.0
B:1
78,540294143967057
1.01.01.01.01.0
reads.Directlyaligningthereadstothebatcoronaviruscoversonlyasmallproportionofthewhole
genome(Figure2.4(A)),leadingtoincompleteassembly.Usingthesealignedreadsasseeds,TAR-
34
Figure2.4:EnrichingSARS-CoVreadsusingthebatcoronavirusgenomeasthereference.(A)
and(B)showthealignedandrecruitedreadse.ThedatasetwasalignedbyBWAwiththe
defaultparameter("-B4,-A1").BWAischosentoincludemorelocallyalignedreadsin(A).The
readswererecruitedusingtheoverlapcutoffof150bp.(C)displaysthesequenceidentitybetween
SARS-Covandthebatcoronavirus.ThewasgeneratedusingVISTA[45].
VIRisabletorecruitmanymorereadsthatnearlycoverthewholegenomeofSARS-Cov,asshown
inFigure2.4(B).
Table2.1alsoshowsthatthenumbersofrecruitedreadsdonotheavilyrelyonthenumberof
theseedreads.Evenwhentheseedsetissmall(e.g.theseedsetsconstructedusingBowtie2),
manynewreadscanberecruitedduringeachiteration.Aftermultipleiterations,thesetof
recruitedreadscanbelargerthantheseedset,boundedbythesequencingdepthof
thehaplotypes.Ontheotherhand,iftheseedsetscontainmanyreadsfromnon-relevantspecies,
thesetofrecruitedreadscouldevenincludeallthereadsfromtheinput,whichmakesthe
readrecruitinguseless.Becauseofthis,weprefertoconstructtheseedsetusingglocalmodeto
ensurehighquality.
35
Recruitedreadsleadtobetterassemblyperformance
Boththealignedandrecruitedreadswereassembledwith
denovo
assemblytools.PEHaplois
thedefaultassemblycomponentinTAR-VIR.AsTAR-VIRhasamodularstructure,other
de
novo
assemblytoolsincludingSGA,SPAdes,andSAVAGEarealsousedtoreplacePEHaplofor
haplotyereconstruction.SPAdeswasrunwithŒmetaoption,whichissameasmetaSPAdes[108].
Asthealignedreadscoveratmost80%ofthegenomesevenwiththeleaststringentalignment
threshold,itisnotpropertoapplytheconventionalreference-basedassemblymethodsforthis
dataset.
Thecomplete
denovo
assemblyresultsusingbothalignedandrecruitedreadsarepresentedin
SupplementaryTableS1.PartoftheresultsareshowninTable2.2duetospacelimiation.Forall
assemblytools,usingrecruitedreadsproducesbetterresults:longercontigsandhighergenome
coverage.,thisisnotsimplyduetotheincreasednumberofreads.Forexample,as
showninTable2.2,thereadsrecruitedbyBowtie2withparameterﬁL,0,-0.9ﬂislessthanBWA-
alignedreadswhenBis1(52,337vs.52,668).Buttherecruitedreadsproducecontigsatleastten
timeslongerthanthealignedreads,andtwicehigherinthegenomecoverage.Bycomparingthe
assemblyperformanceofalltestedtoolsontherecruitedreads,ourassemblycomponentPEHaplo
consistentlyhashigherN50andgenomecoveragethanothers.Overall,PEHaploandSGAperform
betterthantheothertwoassemblytools.
PRICEappliesextension-basedstrategiesforcontigassembly.Usingtheseedreadsasinitial
contigs,PRICEcanbereadilyusedtoperformtargetedviralassemblyfrommetagenomicdata.
Therefore,theresultsofTAR-VIRwasalsobenchmarkedwithPRICE'sresults,asshowninTa-
ble2.3.PRICEproducedonelongcontig(similarN50toours)forthemostabundanthaplotype.
Thus,itsgenomecoverageisonlyabout20%.
36
Table2.2:AssemblyresultsonSARS-CoValignedandrecruitedmetagenomicdata.Thede-
faultassemblytoolinTAR-VIRisPEHaplo.TheofthemetricscanbefoundinSec-
tion2.3.2.1.
Aligned
Tool#ContigsN50
Genomes
covered(%)
Mismatch
rate(%)
Bowtie2
L,0,-0.9
TAR-VIR5850519.70.02
SGA5650520.10.03
SPAdes3456912.90.16
SAVAGE5445517.50.0
Recruited
Tool#ContigsN50
Genomes
Covered(%)
Mismatch
rate(%)
TAR-VIR
7
29,676
98.9
0.0
SGA
13
26,729
98.9
0.0
SPAdes
14
15,882
92.1
0.51
Bowtie2
L,0,-0.9
SAVAGE
22
12,445
97.0
0.0
Aligned
Tool#ContigsN50
Genomes
covered(%)
Mismatch
rate(%)
TAR-VIR
84
1,192
55.1
0.0
SGA
85
1,027
56.5
0.0
SPAdes
67
1,012
44.6
0.12
BWA
B:1
SAVAGE
68
669
32.3
0.0
Recruited
Tool#ContigsN50
Genomes
Covered(%)
Mismatch
rate(%)
BWA
B:1
TAR-VIR629,70699.50.0
SGA1812,63899.50.0
SPAdes2110,35389.20.39
SAVAGE565,14089.30.0
Table2.3:AssemblyresultsonSARS-CoVmetagenomicdataforTAR-VIRandPRICE.
Tool#ContigsN50GenomesCovered(%)Mismatch(%)
Bowtie2
L,0,-0.9
TAR-VIR729,67698.90.0
PRICE129,74920.01.7
BWA
B:1
TAR-VIR629,70699.50.0
PRICE129,75020.01.66
2.3.3Exp2:characterizinghepatitisvirusesfromthehumanplasmadata
Inthisexperiment,TAR-VIRistestedonarealmetagenomicdataset,whichwassequenced
fromtheplasmaof19antiretroviral-treatedHIVpatients(SRR2083204)[80].Thesampleswere
byrandomRT-PCR(RA)forbothviralRNAsandDNAsandthen
37
sequencedbyIlluminaMiseq,producingabout23millionreads.Allthesesamplescontainlow
levelsofHIVbecauseoftheantiretroviraltreatment.Butitmaycontainotherhumanpathogens.
Inourstudy,wefocusedonidentifyinghepatitisviruses.Althoughourpipelineisdesignedto
tacklethechallengesofcharacterizingRNAviralquasispecies,wealsoincludeinthereferences
DNAhepatitisvirusessuchasHBV.
2.3.3.1Preprocessing
Therawdatasetcontainsreadsthatcomefromvariedsources:human,bacteria,phages,etc.
Thereadsofthetargetvirusescompriselessthan30%oftheentiredataset.Sincetheprimary
focusishumanviruses,removingthosereadsfromthehost(human),bacteria,andphagesisideal
beforepathogendetection.Followingcanonicalqualitycontrolandtrimming,weusedbamtagger
[124]toremovehumanreads,andBowtie2toremovereadsfrombacteriaandphagebyaligning
readsagainsttheirreferencegenomes.Theremainingreadswerecorrectedbyerrorcorrectiontool
Karect[2].Afterpreprocessing,8,145,722readswereleft.
2.3.3.2RecruitedreadsbyTAR-VIRcanimprovetheperformanceof
denovo
assembly
Inthestep,weconductedreadmappingtoobtaintheseedreads.BothBWAandBowtie2
couldbeused.However,althoughBWAalignedmorereads,manyreadsyieldedonlyshortlocal
alignmentsandareunlikelytobesequencedfromthetargetviruses.Usingthesereadsasseeds
tendtocausecontaminationduringthereadrecruitmentstage.Forexample,whenBWA(ﬁ-B8,-A
1")wasusedtogeneratetheseedset,roughly3.5millionreadswererecruited,whileaportionof
themcanbealignedtoothergenomes(suchasphages).AlthoughBWA'soutputcanbeprocessed
toremovelocalalignments,theseedsetcanbemorereliablyproducedusingBowtie2'soutput.
Therefore,Bowtie2waschosenasthealignerforallrealdataexperiments.
38
WedownloadedthereferencegenomesofHBV(NC_003977.2),multiplegenotypesofHCV
(NC_009827.1,NC_009823.1,NC_009825.1,NC_030791.1,NC_004102.1,NC_009826.1,NC_009824.1),
andhumanpegivirus(HPgV,NC_001710.1)fromtheviralgenomedatabaseofNCBI.Thepre-
processedreadswerethenalignedtothereferencesundermismatchratesof5%,10%,15%,and
20%,respectively.Theseinitiallyalignedreadswereusedastheseedreadsets.Althoughthere
aremultiplegenotypesforHCV,onlygenotype1hasadecentamountofalignedreads.Other
genotypeshavelessthan50readsmapped.Thus,toproduceareliableevaluationoftheassembly
results,onlytheresultsofHCVgenotype1wereused.Thenumbersofreadsbeforeandafterread
recruitingareshowninTable2.4.
Table2.4:Overlapextensionresultsusingdifferentseedset
R
0
.`#'represents`number'.The
shadedregionsinthistableandTable2.5highlightthecasewherelessrecruitedreadscanproduce
betterassemblyresultsthanalignedreadsonly.
Align
mismatch
Seed
#
Recruited
#
Align
mismatch
Seed
#
Recruited
#
5%21,925200,650
10%67,973222,065
15%162,454
263,029
20%
294,448340,705
Asthisisarealmetagenomicdatasetwithoutknowngroundtruthoftheviralhaplotypes,the
evaluationmetricsforreadrecruitingaredifferentfromthesimulationdataset.Wecannotevaluate
whethereveryrecruitedreadiscorrectbecauseitsoriginatinglocationisunknown.Thus,instead
ofevaluatingthedepthandgenomecoverageforeachhaplotype,wefocusonevaluatingwhether
usingrecruitedreadscanimprovetheperformanceofgenomeassembly.
Therefore,boththealignedreadsandtherecruitedreadsfromTAR-VIRwereassembledby
de
novo
assemblytools,andtheresultswerecomparedinTable2.5.Theassemblyresultsdemonstrate
thatthereadsrecruitedbyTAR-VIRusuallyimprovetheassemblyresultsbyproducinglonger
contigsandhighergenomecoverageforitsPEHaplo,SGA,andSPAdes.Theimprovementisnot
39
simplyduetotheincreasednumberofreadsaftertherecruitmentstage.Forexample,according
toTable2.4,byusing15%mismatchrate,therecruitedreadsarelessthanthealignedreadsunder
20%mismatchrate(263,029vs.294,448).However,theassemblyresultsusingtherecruited
readsarebetterthanorcomparabletotheresultsusingthealignedreadsforalltheassemblytools.
Amongthefourassemblersused,PEHaploofTAR-VIRandSPAdesproducedgoodresultswith
largeN50andhighgenomecoverage.SGAgeneratedlargernumberofcontigswithlowN50
value.WhilewehavetriedthebestparametersforSAVAGEwithourempiricalexperience,its
resultsarenotconsistentwiththeotherthreetools.BetterparametersmayexistforSAVAGEto
producebetterresults.However,thelong-runningtimeandhighmemoryusageofthistoolmade
continuingtotunetheparametersdif
Table2.5:AssemblyevaluationresultsonalignedandrecruitedreadsusingthegenomesofHBV,
HCV,andHPgVasreferences.`cov.'istheabbreviationfor`coverage'.Thedefaultassembly
componentinTAR-VIRisPEHaplo.
Align
Tool
Bowtie2aligned
Bowtie2Recruited
Contig
#
N50
Genome
cov.(%)
Contig
#
N50
Genome
cov.(%)
5%
TAR-VIR
1192027.3
973,64382.3
SGA
1464526.8
6367568.4
SPAdes
51,17727.6
153,63679.6
SAVAGE
1369821.6
4980640.4
10%
TAR-VIR
6179467.4
312,63584.0
SGA
2666356.4
7270669.5
SPAdes
151,25165.4
193,37379.3
SAVAGE
3063140.6
3291526.5
15%
TAR-VIR
9793980.9
14
3,579
83.1
SGA
5661757.7
74
722
70.2
SPAdes
201,68977.6
16
3,986
81.0
SAVAGE
3263929.9
24
999
27.6
20%
TAR-VIR
38
1,852
84.5
775,67886.4
SGA
78
661
59.5
37453764.5
SPAdes
23
2,710
83.8
104,83084.6
SAVAGE
19
671
19.1
158235.5
40
2.3.3.3Comparisonwithreference-basedandextension-basedassemblymethods
Withthereferencegenomesavailable,reference-basedtoolscanbeappliedforviralmetagenomic
dataanalysis.Therefore,wealsobenchmarkedTAR-VIRwithreference-basedhaplotyperecon-
structiontoolsincludingHaploclique[147],drVM[86],andViQuas[59].VirusTap[162]canalso
conductreadsandthenassembly.WhilewewereplanningtocompareTAR-VIR
withVirusTap,alargedatasetcouldnotbeuploadedtothewebsite-basedVirusTap.Inaddition,
about3,000jobswerewaitingatthewebsite.Therefore,theresultsfromVirusTapcouldnotbe
reported.
Thereadsalignedwiththemismatchrateof15%wereusedasinputforHaplocliqueand
ViQuas.FordrVM,thereferencegenomeswerebuiltfromhumanviruses,anditranonthe
rawfastq(withsimplequalitycontrolandtrimming)dumpedfromSRAwithdefault
parameters.TheseedreadsetofTAR-VIRwerealsothereadsmappedwith15%mismatchrate.
TheassemblyresultsareshowninTable2.6.
Table2.6:Assemblyresultscomparisonwithreference-andextension-basedmethods.
Tool
Contig
#
N50
Genome
cov.(%)
Tool
Contig
#
N50
Genome
cov.(%)
TAR-VIR143,57983.1
ViQuas3969,646100.0
Haploclique50,41930471.1
drVM41382981.9
TheresultsshowthatTAR-VIRperformsbetterthanHaplocliqueanddrVMbyproducing
fewerbutlongercontigswithhighergenomecoverage.Withthecompleteandalsothelikely
ﬁtrueﬂvirusgenomesasthereference,ViQuashasproducednear-completegenomes.However,it
producesalmost400contigswithsimilarlengths(fullgenomes),indicatingahighprobabilityof
overestimationofthehaplotypes.Sincethegroundtruthoftheactualnumberofhaplotypesinthis
datasetisunknown,weintendedtotestthishypothesisusingadatasetwithknownhaplotypes.
Therefore,wetestedViQuasontheSARS-CoVsimulateddatasetwith5haplotypes.Itreported
41
113contigs,eachcovering99.98%ofthegenomewithhighmismatchrate(
>
9.0%).Thus,the
longcontigsproducedbyViQuasarenotlikelythetruehaplotypes.
SimilartoSARS-CoVdata,wealsobenchmarkedourresultswiththeextension-basedtool
PRICE.TheinitialcontigsofPRICEwerealsothereadsmappedwith15%mismatchrate.PRICE
generated164contigs,withaN50valueof791,andgenomecoverageof87.3%.PRICE'sresults
haveaslightlylargergenomecoveragebutamuchsmallerN50valuecomparingtoTAR-VIR.
2.3.3.4Assemblingthewholedatasetdirectly
AsSGAandSPAdesarehighlyefandhavebeenusedbyvariousvirusanalysispipelines,it
maybepossibletodirectlyapplythemtoallthepreprocessedreadsforrecoveringthethreeviruses
((HBV,HCVgenotype1,andHPgV).Thus,weappliedSGAandSPAdestothepreprocessed
reads.Theassembledcontigswerecomparedwiththereferencegenomesofthethreeviruses.
SGAtookabout1hourtoItgenerated2,659contigs,fromwhich123contigscanbealigned
tothethreeviruseswiththesimilaritythresholdof90%.The123contigscancover42.36%of
thereferencegenomes.SPAdesfailedtoreporttheresultswithin24hours.Theresultsfrom
SGAvthatalthoughthepreprocesseddatasetcontainsallthereadsfromthetargetviruses,
thesheerdatasizeandthelowproportionofthethreevirusesmakegenericassemblydif
Meanwhile,assemblingalargedatasetconsumescomputingresources.
2.3.4Identifyingvirusescontainingtargetgenes
Insomesituations,theresearchersareonlyinterestedintheviralgenomescontainingapartialor
completegene.Inthesecases,itisdifforexistingreference-basedvirustoolsto
constructthewholeviralgenome.Here,wedemonstratethatwiththeoverlapextensionmethod,
themostofagenomecanbebuiltfromapartialgenereference.
42
Inthisexperiment,weshowthatwithanon-completegenesequenceoflength1,073bpfor
HPgVasreference,mostofthegenomecanbeassembled.Thereferencesequence(Sequence
name:10MYKJ037)wasdownloadedfromVirusPathogenDatabaseandAnalysisResource
(ViPR)[117],whichisapartialcodingDNAsequence(CDS)ofHPgVisolatedfromMalaysia
in2010.ThetotallengthofHPgVgenomeis9,392bp.Fromtheresultsofourpreviousex-
periments,overlapextensionfromalignedreadsundermismatchrateof15%wasabletorecruit
adequatereadswhilekeepingawayunreliablereadsasseeds.Therefore,wealignedtherawreads
tothisCDSreferencebyallowingmismatchrateof15%,fromwhich19,714readswerealigned.
ViQuasanddrVMwereusedtoassemblethealignedreads.However,ViQuascouldonlyproduce
contigssimilartotheshortCDSsequence.InadditiontoprovidedCDSreference,drVMalso
downloadedreferencesfromInternet.ItcorrectlyrecognizedtheHPgVbutfailedtoproduceany
contig.Theresultsthatreference-basedmethodsdonotapplyinthiscase.Withoverlap
cutoffof150bp,118,339readswererecruitedfromtheoverlapextensionstep.Theywerethen
assembledbyPEHaploofTAR-VIR,SGA,SPAdes,andSAVAGE,asshowninTable2.7.
Table2.7:AssemblyresultsonrecruitedreadswithapartialCDSsequenceforHPgVasreference.
Tool
Contig
#
N50
HPgV
cov.(%)
Tool
Contig
#
N50
HPgV
cov.(%)
TAR-VIR57,95986.0
SPAdes68,95794.0
SGA4159149.0
SAVAGE3558049.5
Whilethelengthofreferencestrainbeingonly11.42%ofthewholeHPgVgenome,thecontigs
assembledfromrecruitedreadsbyTAR-VIRandSPAdesareabletocoverthenearlycomplete
genome.Theresultsrevealthatevenwithagene/CDSsequenceasreference,sufreadscan
stillbecollectedtoconstructthevirusatthewholegenomelevel.Asthereisonlyonetargetvirus,
SPAdesproducedthebestresults.ApplyingSPAdestothewholehumanplasmadatasetfailed
43
toontheclusterafter24hours,butbyusingrecruitedreads,SPAdescanproducebetter
assemblieswiththeminimumamountofresources.
2.3.5Computationaltimeandmemoryusage
WeevaluatedthetimeandmemoryusageofTAR-VIRontherealhumanplasmadata.After
preprocessing,8,145,722readswereleft.Thedatasizeis2.9GB,andthetotallengthofthe
sequencesare2,447,741,491bp.Toreducememoryusage,therawdatawassplitinto5parts,with
5BWTsbeingbuiltforthewholedata.Thesplittingprocessisembeddedinourprogram,andthe
numberofsegmentscanbesetbyusers.Foreachpartition,thesizesfortheBWT,Occarray,
andthereadIDarrayare490M,200M,and13M,respectively.Thetotalsizeofbuiltindexesis3.5
GB.ThedetailedtimeandmemoryusagefortheoverlapextensionisshowninTable2.8below.
Ausercanloadeachpartitionseparatelytoreducethememoryusage.Inthatcase,thememory
usageisaboutthesizeofeachpartition.Inaddition,wemayfurtherreducethememoryusageof
therecruitingprocessbyapplyingmorecompactimplementationoftheBWT[18].
Table2.8:Timeandmemoryusageforoverlapextensionandassemblyonviralmetagenomicdata
fromhumanplasam.The
denovo
assemblytimeandmemoryusagewereevaluatedonrecruited
readsbasedonmismatchratefrom5%to20%.
TimeMemory(GB)
Overlap
extension
Buildingindex
127m2.4
Recruitment
<
20m
˘
3.5
Denovo
Assembly
TAR-VIR
5m-20m2-4
SGA
<
10m
<
1
SPAdes
<
10m
<
1
SAVAGE
2h-72h64
PRICE
˘
2h
˘
5.2
Reference-based
assembly
Haploclique
33h5
ViQuas
>
72h
<
4
drVM
<
30m
<
1
AlltheexperimentsweretestedonanMSUHPCCCentOS6.8nodewithTwo2.4Ghz14-core
44
IntelXeonE5-2680v4CPUsand128GBmemory.Weused4threadsfortheassemblycompo-
nentofTAR-VIR,8threadsforSGA,16threadsforSAVAGE,8threadsforPRICE,1threadfor
HaplocliqueandViQuas,and2threadsfordrVM.
2.4Conclusions
InthisChapter,wepresentedanovelpipelineforviralreadscationandstrain-levelassembly
fromviralmetagenomicdatanamedTAR-VIR.Whenavirusinametagenomicdatasetisonly
remotelyrelatedtoacharacterizedvirusinpublicdatabases,ourpipelinecanbeappliedto
classifythereadsbelongingtothesevirusesandthenconductstrain-levelassembly.Orifauseris
interestedindetectingavirusthatcontainsagivengene,ourmethodcanbeemployedtorecover
thewholegenomeofthegene-containingvirus.
Wealsomadecontributionsbyconductingcarefulanalysisofthecommonregionsizesbe-
tweenandwithinviralquasispecies.Theseanalyseslaidthefoundationforusingoverlapdetection
toclassifyreadsofthesamequasispecieswithoutintroducingcontamination.Ouruniqueimple-
mentationoftheindexingstructurealsomakeourmethodeconomicalinbothmemoryandCPU
usages.
Wedemonstratedthetool'sutilitiesonasimulatedviralmetagenomicdatacontainingSARS-
Covandarealviralmetagenomicdatasetsequencedfromhumanplasma.Thesimulateddata
enablesustoevaluatetheperformanceofreadclassitotheresolutionofeachsingleread.
ItshowsthatTAR-VIRcansuccessfullyclassifyenoughreadstocoverthewholegenome.In
addition,itproducedcontigscoveringvedifferenthaplotypes.
Onthehumanplasmadata,wewereabletoenrichenoughreadsfromthetargetvirusesfor
downstreamassemblyevenwithasmallseedreadset.WithapartialCDSsequenceforHPgVas
45
reference,TAR-VIRwasabletoproducenearcompletegenomeassemblies.Theresultsclearly
showedtheeffectivenessofTAR-VIR.Insummary,TAR-VIRprovidescomplementaryfunctions
toexistingvirusdetectiontoolswhenthequalityorcompletereferencesarenotavailable.
46
Chapter3
Denovohaplotypereconstructioninvirus
quasispeciesusingpaired-endreads
3.1Introduction
Virusquasispeciesdescribeapopulationofdifferentbutcloselyrelatedviruswithdynamicdistri-
bution.Eachstraininquasispeciesisbyitshaplotypesequence.Natureselection,point
mutation,andrecombinationscanallchangethehaplotypesandtheirabundanceinsideaqua-
sispecies.RNAvirusandsomeDNAvirusevolvefollowingthequasispeciesmodel.Commonly
knownexamplesincludehumanyvirus(HIV-1),thehepatitisCvirus(HCV),the
foot-and-mouthvirus(FMV),etc.Figure3.1showsalocalmultiplealignmentofvehaplotypes
inHIVquasispecies.
Astheselectionworksonasetofsequencesratherthanone,quasispecieshaveabilitiesto
escapehostimmuneresponsesordevelopdrugresistance.Reconstructionoftheviralhaplotypes
isafundamentalsteptocharacterizethequasispecies,predictviralphenotypes,andprovide
importantinformationforclinicaltreatmentandprevention.
Developmentofnext-generationsequencingtechnologiesshedslightoncharacterizingthehap-
lotypesandtheirabundanceinquasispecies.Ifthereferencesequencesareavailable,readmapping
canbeconductedtoinferhaplotypes.However,duetothehighmutationrate,usuallyqualityref-
erencegenomesofquasispeciesarenotavailable.Thus,thereisaneedfor
denovo
haplotypere-
47
constructionmethod.Arecentreviewofchosenhaplotypeassemblytoolshasshownthat
denovo
haplotyperecoveryisacomputationallychallengingproblem.Itsharessomecommonchallenges
withmetagenomicassembly,whichaimstoassembleshortreadsintofullgenomesofmember
species.Theessentialideaistobuildagraph(e.g.DeBruijnoroverlapgraph)usingreadsinthe
givensampleandthenrecoverthegenomebywalkingthroughapath.Intheidealcase,where
therearenosequencingerrors,thereissufcoverage,andthegenomesdiffersubstantially,it
istrivialtoidentifyreadsbelongingtothesamehaplotypeandstitchthemintoalongsequence.
However,inreality,haplotypeconstructionmusthandleshorterror-pronereads,highsimilarity
betweendifferentstrains,raremutations,andlowabundanceofsomestrains.Wediscuss
howtheseissuescomplicatetheassemblyproblem.Asweconstructoverlapgraphforhaplotypes
recoveryinthiswork,wefocusonhowtheseissuesmake
denovo
assemblyinoverlapgraphs
dif
First,ifthecommonregionbetweenanytwostrainsisshorterthanthereadlength,thecon-
nectionofallreadswillleadtodifferentpaths,eachofwhichcorrespondstothefullsequenceof
ahaplotype.However,thecommonregionsbetweenstrainscaneasilyexceedthesizeofaread
andthusintroducemanysharednodesandbifurcationsintheoverlapgraph.Second,sequencing
errorscanintroducewrongconnections,tips,bubblesandaddtothecomplexityof
thegraph.Third,alignment-basederrorcorrectionhavedifindistinguishingraremutations
fromsequencingerrors.Finally,readsoriginatingfromhaplotypesoflowabundancetendtohave
smalloverlapsandthusonlyfragmentsofhaplotypesbereconstructed.Indeed,arecentbench-
marking[136]demonstratedthattheperformanceofallthetestedprogramsispoorwhensequence
divergenceislow.Inaddition,theseprogramsfailedtorecoverrarehaplotypes.Thus,thereisa
needfornewmethodsandtoolsformoreaccuratehaplotypeassembly.
Thereexistanumberofgenericassemblytools[106,150,73,116,90,133,159],whichcanbe
48
Figure3.1:MultiplesequencealignmentofveHIV-1haplotypes.
appliedforsequenceassemblyinviralquasispeciesdata.However,thesegenericassemblytools
arenotdesignedforrecoveringviralhaplotypesinquasispecies.Highsimilaritybetweendifferent
strains,plussequencingerrors,canleadtohighlytangledgraphs.Applyingexistingtoolseither
leadtoveryshortcontigsorchimericcontigs.
Thereareagroupoftoolsdesignedforreconstructionofhaplotypes[166,92,118,
148,119,4,146,54].Sometoolsfocusedonestimatinglocalhaplotypes.Ourgoalofthiswork
isglobalreconstruction,whichaimsatreconstructingthefull-lengthviralhaplotypesinasample.
Comparedtolocalreconstruction,itprovidesamorecompleteandaccuratecharacterizationof
thequansispecies.SometoolssuchasShoRAHrelyonreferencesequencesforconductingalign-
ments.Formoregeneralapplicationofhaplytypereconstruction,weneeda
denovo
assemblytool
thatdoesnotrelyonreferencesequences.
Oftheavailablehaplotypereconstructionprograms,therearethreehighlyrelatedonestoours.
TheyareHaploClique[147],SAVAGE[5],andMLEHaplo[93].LikePEHaplo,allofthemare
designedforpaired-endreads.HaploCliqueusesenumerationofmaximalcliquesasageneral
approachtoclusteringNGSpaired-endreads.AlthoughPEHaploisalsodesignedforpaired-
endreads,thetwoarehighlydifferentinusingpaired-endreadsinformation.First,HaploClique
needsareferencesequenceforgeneratingalignmentgraphswhileoursdoesnotrelyonreference
sequences.Second,HaploCliqueusestheinsertsizesfordetectionoflargeindels.However,insert
sizeusuallyhasadistributionandcanvarysubstantiallybetweendifferentfragments.Thusthis
49
informationisusedmoreconservativelyinPEHaplo.MLEHaploalsoexplicitlyemployspaired-
endreadsfortop-scorepaths.Itsusageofpaired-endinformationinaDeBruijngraphis
highlydifferentfromourmethodbasedonanoverlapgraph.HaploCliqueprovidesasourceof
inspirationforSAVAGE[5],whichisthetoolforrecoveringviralhaplotypesusingoverlap
graphs.SAVAGEalsotookadvantageofpaired-endreadsandmergeshortreadsusingcliques.
TheauthorsbenchmarkedSAVAGEwithothervirusassemblytoolsandshowedthatSAVAGE
outperformedothertoolsinacomprehensivesetofassemblymetrics.Aswefocuson
denovo
assemblytools,wewillbenchmarkPEHaploagainstSAVAGEandMLEHaplo.
Inthiswork,wedesignedandimplementedPEHaplo,whichassemblesvirushaplotypesfrom
virusmetagenomicsequencingdata.Sequenceassemblyhasbeenanintensiveresearchareaand
newmethodsorimplementationsareemergingquickly.PEHaploadoptedrelevanttechniquessuch
aserrorcorrectionandefoverlapconstruction.Mostimportantly,wemadecontributionsto
distinguishhighlysimilarhaplotypesusingpaired-endreads.Therationalewillbedetailedin
theMethodsSection.WeappliedPEHaplotobothsimulatedandrealviralquasispeciesdataand
comparedtheassemblyperformancewiththerecentlypublishedtools.Theexperimentalresults
showthatPEHaplocanrecoverviralhaplotypeswithlongercontigsandhigheraccuracy.
3.2Methods
Inthissection,wewilldescribethegraphmodelsweuseforhaplotypeconstruction.Then
wewillpresentthekeyideaofapplyingpaired-endreadstodistinguishdifferentviralhaplotypes.
Finally,weshowthewholepipelineanddetaileachmaincomponent.
50
3.2.1Overlaphgraphandpaired-endgraph
Anoverlapgraph
G
(
V
;
E
)
isaweighteddirectedgraphthatoverlapsbetweenreads.Each
node
v
2
V
representsaread.Anoverlapbetweentworeadsisformedifthesufofareadmatches
theofanotherread.Givenanytworeads
r
1
,
r
2
,andanoverlapthreshold
l
,iftheoverlap
sizebetween
r
1
and
r
2
isgreaterthan
l
,adirectededgeisaddedfromthenodesrepresenting
r
1
and
r
2
in
G
.Theedgeweightistheoverlapsize.
Whileanoverlapgraphrecordstheconnectivitybetweenreads,wealsorecordthenumberof
paired-endreadsbetweennodesusingapaired-endgraph
PE
_
G
.Apaired-endgraphisaweighted
undirectedgraph,whichhasthesamenodesetastheoverlapgraph.Twonodesareconnectedby
anedgeifthecorrespondingreadsformareadpair.Inpractice,anodeintheoverlapgraphcan
containmultiplereadsafterwecombinenodeswithoutbifurcations.Thus,multiplereadpairscan
existbetweennodes.Thetotalnumberofreadpairsbetweentwonodesislabeledastheedge
weightin
PE
_
G
.
InFigure3.2(B),theedgesinoverlapgraphareshownusingsolidlineswhiletheedgesinthe
paired-endgraphareshownusingdashedlines.Nodesa.1anda.2formareadpairandthushave
anedgeofweight1in
PE
_
G
.Similarly,nodesd.1andd.2haveanedgeofweight1becaused.1
andd.2areareadpair.
3.2.2MutationRateforSequenceReplicationandProbabilityofLongest
CommonSubstring(LCS)
Differentstrainsofavirusquasispeciesusuallysharehighsequencesimilarity.Butthemuta-
tions,insertionsordeletionscanhappenrandomly.Insteadofsequencesimilarity,weuselongest
commonsubstring(LCS)tocharacterizethesimilaritybetweenstrains,fromwhichstrain-levelas-
51
semblyispossibleiftheLCSisshorterthanthepaired-endinsertsize.Herewewillinferthe
mutationrateaccumulationbetweenaninitialstrainandits
n
thoffspring,thenprovideadynamic
programmingalgorithmforcalculatingtheprobabilityofLCSwithlength
m
betweentheinitial
strainanditsoffspring.
Mutationrateaccumulationforsequencereplication
Imaginetheenvironmentinitiallywith
onlyonevirusstrain
x
0
oflength
L
,mistakescanhappeneachtimewhenthegenomereplicates
andwillresultinanequilibriumwithmultiplestrainsofconstantabundances,whichis
describedbythequasispeciestheory.
Reproducingmistakescanbeinsertions,deletionsorsubstitutions.Tosimplifytheproblem,
weassumeonlysubstituionscanhappenduringreplicationandthevirusgenomelengthremains
unchanged.Assumetheprobabilityofmutationateachpositionwhenreplicatingisconstantand
independanttoeachother,denotedas
m
,weaskwhatismutationprobabilityateachlocation
between
x
0
andits
n
thoffspring
x
n
?
Thisproblemcanbesolvedbydynamicprogramming.Let
f
(
i
)
betheprobabilitythat
x
n
[
i
]
is
differentfrom
x
0
[
i
]
,and
g
(
i
)
betheprobabilitythat
x
n
[
i
]
issameto
x
0
[
i
]
.Weget
8
>
<
>
:
f
(
i
+
1
)=(
1

f
(
i
))
m
+
f
(
i
)(
1

m
=
3
)
;
i
=
1
;
2
;
3
;:::;
L
g
(
i
+
1
)=
g
(
i
)(
1

m
)+
f
(
i
)
m
=
3
;
i
=
1
;
2
;
3
;:::;
L
(3.1)
Notethat
f
(
1
)=
m
and
g
(
1
)=
1

m
,and
f
(
i
)+
g
(
i
)=
1,equations3.1canbesolved
8
>
<
>
:
f
(
i
)=(
1

4
m
3
)
i

1
(
m

3
4
)+
3
4
;
i
=
1
;
2
;
3
;:::;
L
g
(
i
)=(
1

4
m
3
)
i

1
(
3
4

m
)+
1
4
;
i
=
1
;
2
;
3
;:::;
L
(3.2)
ProbabilityofLCS
Fortheinitialvirusstrain
x
0
oflength
L
,weaskwhatistheprobabilityof
LCSwithlength
m
between
x
0
andits
n
thoffspring
x
n
?
52
Againthisproblemcanbesolvedbydynamicprogramming.
f
(
i
)
astheprobability
that
x
n
[
1
toi
]
has
LCS

m
with
x
0
and
x
n
[
i
]
mutated,
g
(
i
)
astheprobabilitythat
x
n
[
1
toi
]
has
LCS

m
with
x
0
and
x
n
[
i
]
notmutated.
t
astheprobabilitythat
x
n
[
i
]
isdifferent
to
x
0
[
i
]
,whichcanbecalculatedusingequation3.1.Nowwelookatthecasewhen
m
=
2
f
(
i
+
1
)=
t
f
(
i
)+
t
g
(
i
)
g
(
i
+
1
)=(
1

t
)
f
(
i
)+(
1

t
)
2
f
(
i

1
)
m
=
3
f
(
i
+
1
)=
t
f
(
i
)+
t
g
(
i
)
g
(
i
+
1
)=(
1

t
)
f
(
i
)+(
1

t
)
2
f
(
i

1
)+(
1

t
)
3
f
(
i

2
)
.
.
.
m
f
(
i
+
1
)=
t
f
(
i
)+
t
g
(
i
)
g
(
i
+
1
)=
å
m

1
j
=
0
f
(
i

j
)(
1

t
)
j
+
1
If
i

m
,theLCSwillalwaysbelessorequalto
m
.Thus,
f
(
i
)=
t
;
g
(
i
)=
1

t
.Therefore,the
recursiveequationsforcalculating
f
(
i
)
and
g
(
i
)
are
When1

i

m
8
>
<
>
:
f
(
i
)=
t
g
(
i
)=
1

t
(3.3)
When
L

i

m
+
1
53
8
>
<
>
:
f
(
i
)=
t
(
f
(
i

1
)+
g
(
i

1
))
g
(
i
)=
å
m

1
j
=
0
f
(
i

j

1
)(
1

t
)
j
+
1
(3.4)
Thetimecomplexitytocalculatetheprobabilityfor
LCS
=
m
is
O
(
2
L
+
mL
)
.Sincethesequence
lengthis
L
,
m
canbeanyintegersbetween0and
L
.Tocalculatetheprobabilityforeach
m
2
[
0
;
L
]
,
thetimecomplexityis
O
(
2
L
2
+(
L
+
1
)
L
2
2
)=
O
(
L
3
)
.ForHIV,itsgenomelengthisabout10
4
.
Calculatingdirectlywithequations3.4forallthepossibleLCSistimeconsuming.Itcostshours
tocalculatetheprobabilitiesforalltheavailableLCSwithatypicalsingle-coreCPU.
Infact,theresultof
g
(
i
)
canbeusedtocalculatefor
g
(
i
+
1
)
,whichwillreducethetotaltime
complexityto
O
(
L
2
)
.Let
S
(
i
)=
f
(
i
)+
g
(
i
)
,wewillhavetherecursiverelationship(Proofomitted)
S
(
i
+
1
)=
S
(
i
)

t
(
1

t
)
m
+
1
S
(
i

m

1
)
;
i

m
+
2(3.5)
Therefore,weget
S
(
i
)=
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
1
;
if1

i

m
1

(
1

t
)
m
+
1
;
if
i
=
m
+
1
1

(
1
+
t
)(
1

t
)
m
+
1
;
if
i
=
m
+
2
S
(
i

1
)

t
(
1

t
)
m
+
1
S
(
i

m

2
)
;
if
i

m
+
3
(3.6)
ItislineartocalculatetheprobabilityforoneLCS.Thetimecomplextiyforcalculating
LCS
=
m
is
O
(
L
)
.Tocalculateeach
m
2
[
0
;
L
]
,thetimecomplexityis
O
(
L
2
)
.
54
3.2.3Usepaired-endreadstodistinguishdifferenthaplotypes
Thescattereddistributionofthemutationsorinsertions/deletionsbetweendifferentstrainsmake
strain-levelassemblypossiblebyemployingpaired-endreadsinformation.Wequantifythe
strainsimilaritywithsizesofthelongestcommonsubstring(LCS).IftheLCSsizeissmallerthan
readsize,pathbecomestrivialbecausedifferentstrainscorrespondtodifferentpathsinthe
overlapgraph
G
.Theanalysisofavailabledatashowsthatthesizeofthelongestcommonregions
arelargerthantypicalreadsizebutsmallerthanfragmentsize,whichistheend-to-endsizefor
areadpair.Thus,itishighlylikelythatthetwoendsofareadpaircanencompassthecommon
regionsbetweendifferentstrains.Aspaired-endreadsaresequencedfromthesamefragment,they
shouldbeassembledintothesamecontig.Thus,byusingthepaired-endinformation,twodifferent
strainscanbedistinguishedfromeachother.
Figure3.2sketchesthebasicideabehindstrain-levelassemblyusingpaired-endreads.There
areonlytwomutationsbetweenthetwostrainsinthisexample.TheLCSisthecommonregion
betweenthetwomutationloci.Theoverlapthresholdissetashalfofthereadsize.Theoverlap
graph
G
andthepaired-endgraph
PE
_
G
areconstructedaccordingly.In
G
,therearefourlong
paths:
a
:
1
!
b
!
c
!
e
!
f
!
a
:
2,
a
:
1
!
b
!
c
!
e
!
f
!
d
:
2,
d
:
1
!
b
!
c
!
e
!
f
!
a
:
2,
and
d
:
1
!
b
!
c
!
e
!
f
!
a
:
2.Thegoalofassemblyistooutputthetwocorrectpaths(i.e.
a
:
1
to
a
:
2and
d
:
1to
d
:
2).Threetypesofinformationcanbeused.1):Coverage:ifthetwostrains
havehighlydifferentcoverage,wemaydistinguishthetwostrainscorrectly.2):Enumerationof
cliques.Thismethodwasrecentlyadoptedbyseveraltools[147,93,5].Forreadswithsuf
coverage,readsformingcliquestendtocomefromthesamehaplotypeandthuscanbemergedas
asuper-read.Thisprocesscanbeiterativelyappliedtoextendlocalhaplotypetoglobalone.3):
paired-endinformation.
55
Byusingthepaired-endinformation,apathstartingwitha.1inFigure3.2willendwitha.2
becausea.1anda.2shouldbeassembledintothesamecontig.Forthesamereason,apathstarting
withd.1willonlyextendtod.2.Thus,thepathwilloutputtwopaths,correctlyrepresenting
thetwohaplotypes.Ofthethreetypesofinformation,paired-endinformationisthemostaccurate
oneandcantolerateLCSuptothefragmentsize.Coveragedifferencehaslimitedapplications
becauseitrequiresrelativelyuniformcoverageandcanonlydistinguishstrainsofhighlydifferent
abundance.Cliqueenumerationcanleadto"chimeric"superreadevenwhentheLCSisonly
longerthanreadsize.Inthisexample,thecliqueenumerationwillnotbeabletodistinguishthe
twostrains.AsshowninFigure3.2.(B),therearevecliquesofsize4.Bymergingthereadsinto
super-readsinsideeachclique,weobtainedvesuper-readsinFigure3.2.(C).Theiroverlapgraph
isshowninFigure3.2.(D).Nomatterwhethercliquesofsize3areusedforiterativemerge,using
cliqueswon'tbeabletodistinguishthetwostrainswithoutusingpaired-endinformation.
3.2.4ThewholepipelineofPEHaplo
TherearevemajorcomponentsinthepipelineofPEHaplo:errorcorrection,strandcorrection,
overlapgraphconstruction,pathandcontigcorrection.Inthepre-processstage,reads
withmultiplelow-qualityorambiguousbasecallsareortrimmed.Base-callingerrorsor
indelsarecorrectedfromthesetofreadsusingalignment-basederrorcorrection.Dupli-
catedreadsandsubstringreadsarethenremovedfromthecorrectedreads.Second,anoverlap
graphisbuiltfromthepre-processedreadsandthestrandofreadsareadjustedbytraversingthe
graph.Theoutputreadswillhavethesameorientation.Thethirdstagewillbuildtheoverlapgraph
againfromthestrand-adjustedreadsandutilizevariousgraphpruningmethodstoremovepossi-
blerandomoverlapsandsimplifythegraphforefassembly.Inthefourthstage,paired-end
guidedpathalgorithmsareappliedtoproducecontigsfromtheoverlapgraph.Finally,we
56
Figure3.2:(A)Thebottomtwolonglines(redandblack)representtwohaplotypes,whichonly
differbytwomutationsattwoloci(G-CandA-G).Shortlinesrepresentreadssequencedfrom
thetwostrains.Redreadsaresequencedfromredstrainwhileblackreadsaresequencedfrom
blackstrain.Thereadsaresortedbytheirreadmappingpositionsagainsttheirnativestrain.a.1
anda.2areareadpairfromtheblackstrain.d.1andd.2arethereadpairfromtheredstrain.
(B)Theoverlapgraphandthepaired-endgrapharecombinedinoneNodesb,c,e,andf
originatefromthecommonregionofthetwostrains.Thedashedlinesrepresentpaired-endread
connection.(C).Thesuper-readsgeneratedbymergingreadsinvecliquesofsize4.Eachsuper-
readisnamedusingthestartingnodeandendingnode.(D).Thenewoverlapgraphgenerated
usingthesuper-reads.
57
alignpaired-endreadsagainstproducedcontigstoidentifyandcorrectpotentialmis-joinerrors.
3.2.4.1Datapre-processing
Errorcorrectionbasedoncoverageinformation,usuallyimproves
denovo
assembly.Analignment-
basederrorcorrectiontoolKarect[2]isusedtocorrectsubstitution,insertion,anddeletionerrors.
Besidesadoptingexistingerrorcorrectiontools,wealsoremovedreadswithverylowabundance.
Onlyreadsthatareduplicatedatleast
n
timesintheoriginaldatasetwillbeusedfordownstream
analysis.Usingaprobabilitymodel,weshowthatthismethodcanfurthererouterrorcontaining
readswhilemaintainingsufreadsfor
denovo
assembly.
Let
N
bethetotalreadsnumber,
r
bethereadlengthand
L
bethegenomesize,theprobability
that
k
readsstartfromthesamelocationonthegenomecanbeestimatedbyaPoissondistribution:
Pr
[
k
]=
l
k
e

l
k
!
;
l
=
N
=
L
(3.7)
Pr
[
k
>
=
n
]=
1

n

1
å
i
=
0
l
i
e

l
i
!
;
n

1(3.8)
Wedenote
e
base
betheprobabilityofsequencingerrorateachbase,
N
e
bethenumberoferror
baseforoneread,theprobabilitythatareadcontainsatleastonesequencingerroris
Pr
[
N
e

1
]=
1

(
1

e
base
)
r
.For
n
readsoriginatingfromthesamelocationongenome,theprobabilitythatthey
haveatleastonesequencingerroronthesameplacecanbecalculatedas
Pr
=
Pr
[
N
e

1
]
n
=
r
n

1
.
IntheHIVMiSeqdatasetweused(detaileddescriptioninResults),
˘
700kerrorcorrectedreads
areleftfor5HIVstrains.Thegenomelengthis
˘
10kbp.
l
=
7

10
5
=
(
5

10
4
)=
14.For
eachbaseonthegenome,theprobabilitythatatleast3readsstartfromitisover99.9%.Ifwe
assume
e
base
=
0
:
01
;
r
=
200,theprobabilitythatareadsequenceduplicatedatleast3timesbut
containsatleastonesequencingerroris
Pr

(
1

0
:
99
200
)
3
=
200
2
ˇ
1
:
62
e

5.Theresultsreveal
58
thatwhensequencingdepthisdeepenough,keepingreadswithduplicationsisabletoout
errorcontainingreadsandwillnotreduceconnectivity.
3.2.4.2Overlapgraphconstruction
Allreadsremainedafterpre-processingareusedtoconstructtheoverlapgraph.Astraightforward
overlapdetectionmethodrequires
O
(
n
2
)
comparisons,whichiscomputationallyexpensivefor
largesequencingdatasets.Thereareefimplementationofall-pairssufcomparison
algorithmsbasedondatastructuressuchashashingtableorcompacttree[47,51].In
PEHaplopipeline,weuseReadjoiner[47]tocomputeallsufmatchesamongreads
forreadstrandadjustment,andthenuseApsp[51]fortheoverlapgraphconstruction.
Strandcorrectionwithoverlapgraph
Readsintherawdatasetcancomefromdifferent
strands,andthetworeadsinareadpaircomefromtwodifferentstrands.WeuseReadjoinerto
constructtheoverlapgraphandtraverseeachnodeinthegraphforstrandadjustment.Readjoiner
providesasetoffastandspaceefalgorithmstocomputesufmatchesamongall
pairsofreadsusingsufsortingandscanningmethods.Fortworeads
r
i
and
r
j
,wedenote
r
0
i
and
r
0
j
astheirreversecomplements.Readjoinercomputesallpossibleoverlapsbetween
(
r
i
;
r
j
)
,
(
r
0
i
;
r
j
)
and
(
r
i
;
r
0
j
)
,andlabeltheoverlapswith`++',`-+'and`+-'respectivelyiftheirsuf
matchesarelongerorequaltotheoverlapthreshold.Notethattransitiveedgesareremovedinthe
outputbyReadjoiner.
Weuseasearch(BFS)traversalmethodtoadjustthestrandofeachread.The
traversalstartsatastartnode(within-degreeof0)andlabelitas`+',thenrecursivelylabels
allitssuccessorsandpredecessorsbasedontheedgetype.The`++'typemeanscurrentnode
anditsneighborcomefromthesamestrand,while`-+'or`+-'typesmeanadifferentstrand.
Aftertraversingthewholegraph,allthereadslabeledwith`-'willbereplacedwiththeirreverse
59
complements.
Withstrand-adjustedreads,weuseanothertoolApsp[51]toconstructanewoverlapgraph
sincewewillneedalloverlapedgesthistime.
3.2.4.3Graphpruning
TheoriginaloverlapgraphgeneratedfromtheoutputofApspisusuallyverycomplexbecause
ofthelargedatasize,transitiveedges,sequencingerrors,andhighlysimilarregionssharedby
haplotypes.Weapplyaniterativegraphpruningproceduretorepeatedlysimplifythegraphat
eachiteration.
Mergereadsincliques
Weareinterestedincliquesintheoverlapgraphbecausereadswithin
acliquecansharetruemutationswhilesequencingerrorsareusuallymorerandomandarenot
sharedbythemajorityofreads.Therefore,cliquescanbeusedtodistinguishtruemutations
fromsequencingerrors.Severalrecenthaplotypereconstructiontools[5,147]mergereadsinside
cliquesassuper-readsandconductiterativelyhaplotypeextension.
Wehandlecliquesdifferentlytotheseexistingtools.Insteadofdirectlymeringeachcliqueto
asuper-read,wemergeagroupoflinkedcliquessimultaneously.Thelinkedcliquesisas
aclusterofcliqueswhichsharecommonnodeswithatleastoneothercliqueinthecluster.Each
cliquecluster,denotedas
G
c
,isaconnectedsubgraph.Wesimplifythissubgraphbyremoving
transitiveedgesandperformingcollapseoperationsasdescribedbelowandthenconnect
G
c
tothe
originaloverlapgraphbyreconstructingconnectionstoitsstartingandendingnodes.Theedges
connectedtoothercliquenodesintheoverlapgraphareremoved.Thosecommonnodessharedby
cliqueswillbekeptandwefurtherapplypaired-endinformationtoidentifystrainsinpath
step.Bymerginglinkedcliques,weareabletosimplifytheoverlapgraphwhileavoidresultingin
"chimericsuper-read"(Figure3.2(D):
(
a
:
1

e
)
!
(
b

f
)
!
(
c

d
:
2
)
).
60
Removetransitiveedgesandnodecollapsing
Anedge
u
!
v
iscalledtransitiveifthereexist
otherpathsfrom
u
to
v
inthegraph.Transitiveedgesareusuallyremovedingraph-basedassembly
algorithmstosimplifythegraphwhilestillkeepingtheconnectivity.Weuseasearch
(DFS)basedalgorithmtoremovetransitiveedgesfromtheoverlapgraph
G
=(
V
;
E
)
:
1.Foreachvertex
u
2
G
,startDFSfromeachofitssuccessor
v
.
2.Foreachvertex
w
thatcouldbereachedbyDFSfrom
v
,removeedge
u
!
w
iftheedgeexists.
Theoverallcomplexityofthealgorithmaboveis
O
(
j
V
j
(
j
V
j
+
j
E
j
))
,whichruns
j
V
j
DFSstoremove
alltransitiveedges.
Linearlyconnectednodescanbemergedwithoutlossofreachability.Aftertransitiveedge
removal,theoverlapgraphtendstohavechainsoflinearlyconnectedvertices,whicharecollapsed
tofurthersimplifythegraph.
Removefalseedgesusingreadpairs
Duetothenatureofvirusquasispecies,differenthap-
lotypesofonespeciesusuallyhaveveryhighsequencesimilarity(possiblyover90%),whichcan
easilycauseoverlapsbetweenreadsoriginatingfromdifferentstrains.Therefore,havingasuf
matchdoesnotguaranteethatthetworeadsoriginatefromthesamevirusstrain.Wrong
edgesincreasethecomplexityofgraphandmayalsoproducechimericcontigs.Weemploypaired-
endinformationtoremovepotentiallywrongedges.
Theoverlapcutoff
l
isanimportantparameter.Asmall
l
tendstokeepmosttrueoverlapsbut
alsointroducesmorefalseconnectededges,whilelarge
l
islikelytoeliminatemostfalseoverlaps
butcanpossiblymisstrueconnectionsforreadsfromlowlysequencedregions.InPEHaplo,we
initiallychoosearelativelysmalloverlapcutoff,thenapplyasetofpaired-endbasedheuristic
methodstoremovepotentialfalseedges.Thekeyideaisthatforanedgeformedbetweenreads
fromdifferenthaplotypes,thesetwonodesortheirpredecessorsandsuccessorsarenotusually
supportedbypairedendconnections.Thus,wewillusepaired-endconnectionasevidencetore-
61
Figure3.3:Removingfalseedgesusingpaired-endinformation.Solidlinesrepresentoverlaps
betweennodesanddashedlinesrepresentthepaired-endconnectionsbetweennodes.Theoverlap
edgeswithredcrosswillberemovedifinsufpaired-endinformationexistbetweentheir
ends.
movefalseedges.,wefocusonthreecasesasshowninFigure3.3,andremoveedges
withoutpaired-endinformationsupport.Toaidtheexplanation,weintroducethefollowingterms
fortheoverlapgraphs.Let
u
!
v
beanedgebetweennodes
u
and
v
.Thus,
u
isthepredecessorof
v
and
v
isthesuccessorof
u
.Anodecanhavemultiplepredecessorsorsuccessors.Letfunction
succ
(
u
)
representthesetofallsuccessorsof
u
andlet
pred
(
u
)
bethesetofallpredecessorsof
u
.
Weexamineedgesinthefollowingcases.
1.Foranedge
u
!
v
,ifout-degreeof
u
islargerthan1andin-degreeof
v
islargerthan1,check
paired-endconnectionsbetween
u
and
v
,
pred
(
u
)
and
v
,
u
and
succ
(
v
)
.Ifnopaired-endconnec-
tionsbetweenthesenodesandtheoverlapbetween
u
;
v
islessthanauseredgeremoval
cutoff
l
1
(
l
1

l
),removeedge
u
!
v
(Figure3.3(A)).
2.Foranode
u
within-degreebeing1andout-degreelargerthan1,wecheckthepaired-end
connectionsbetween
u
andallnodesin
succ
(
u
)
.Foranedge
u
!
v
,weremovethisedgeifthe
followingthreeconditionsaremet.1)Thereisnopaired-endconnectionbetween
u
and
v
.2)
Thereexistsanode
v
0
2
succ
(
u
)
,(
v
0
6
=
v
),andthereispaired-endconnectionbetween
u
and
v
0
.3)
62
Thein-degreeof
v
islargerthan1.Intuitively,ifanode
u
hasalargenumberofout-goingedges,
itishighlypossiblethatsomeofthemareonlyrandomconnections.Thus,weonlykeeptheones
withpaired-endsupports(Figure3.3(B)).
3.Similartocase2,ifanodehasalargenumberofincomingedgesbutonlyhasoneout-going
edge,weremovetheoneswithoutpaired-endsupport.Foranode
u
without-degreebeing1and
in-degreelargerthan1,weexaminethepaired-endconnectionsbetweenallnodesin
pred
(
u
)
and
u
.Foranedge
w
!
u
,weremovethisedgeifthefollowingconditionsaremet.1)Thereisno
paired-endconnectionbetween
w
and
u
.2)Therearepaired-endconnectionsbetweenotherpre-
decessornodesand
u
.3)Theout-degreeof
w
islargerthan1(Figure3.3(C)).
3.2.4.4Paired-endguidedpath
Withsufcoverage,virushaplotypescanberecoveredbypathsfromthegraph.Al-
thoughthegraphhasbeengreatlywithprevioussteps,thehighsequencesimilarity
betweenvirushaplotypescanstillresultincomplexoverlapgraphswithalotofbifurcations.To
recovermoreaccuratecontigs,paired-endinformationofreadsarewidelyusedinmanyassembly
toolsforguidingthecreationofcontigs[169,163,93]orscaffolds.Therationaleisthattwoends
ofareadpairaresequencedfromtwoendsofafragment,thusshouldbeassembledinthesame
contigorscaffold.Paired-endinformationcanguidethepathextensiontogoovercommonre-
gionssharedbytwogenomessincethelengthoffragmentisusuallymuchlongerthanthelength
ofreads(Figure3.4).
Inourmethod,weuseaDFS-basedpathextensionalgorithmandselecttherightnodeto
appendtothepathwithadecisiontreeateachbifurcation.Pathstartsatastartnode(in-
degreeof0)andterminatesatanendnode(out-degreeof0)orwhenmeetingavisitednode.At
63
Figure3.4:Paired-endinformationguidethepathextensiontogooverbifurcationnodes.With
nodescomingfromtwostrains
a
and
b
inthethosenodesbelongingtothesamestrain
canbecorrectlycombinedtoonepathbasedonthepaired-endconnections.Solidlinesrepresent
overlapsbetweennodesanddashedlinesrepresentthepaired-endconnections.
64
Figure3.5:Pathprotocol.Pathsareoutputtedwhenmeetinganendnodeoravisitednode.
eachvertexwithmultiplesuccessors,wetransformtheavailablepaired-endandoverlapinforma-
tionasfeaturesanddecidewhichsuccessortoappendtothecurrentpathwithapre-traineddecision
tree.ThewchartofpathalgorithmisshowninFigure3.5.Thepaired-endconnections
betweennodescanbeefaccessedfromtheconstructedpaired-endgraph
PE
_
G
.
Featuredesignanddecisiontree
Tothecorrectpathfromtheoverlap
graph,weneedtoselecttherightnodeeachtimeweextendthepath.Inparticular,whenanode
hasmultiplesuccessors,achoiceneedstobemadeforpathextension.Ingeneral,wechoose
asuccessorwiththemostnumberofpaired-endconnectionswiththenodesinthecurrentpath.
However,differenttypesofpaired-endconnectionsshouldbetreateddifferentlyindistinguishing
haplotypes.Inparticular,weneedtodifferentiatesingle-insingle-outnodes(SISO)fromother
nodes.SISOnodeshavein-degreeof1andout-degreeof1.Wehavehighthatthese
65
Figure3.6:Inthisexampleasshowninthetheendingnode
p
n
ofcurrentpathhastwo
successors
v
1
and
v
2
.Thesolidlinesareoverlapsanddashedlinesarepaired-endconnections
betweennodes.Fivefeaturesarecomputedtochoosetherightsuccessortoextendthepath.These
featuresarecomputedaspaired-endgraphedgeweightsbetweennodesincurrentpathandnodes
associatedwiththesuccessor.Asfor
v
1
,theSISOscoreiscalculatedbetweenSISOnodesand
v
1
;
planscorebetweenSISOnodesandplannodes;PE_planscorebetweenSISOnodesandPE_plan
nodes;pathscorebetweenpathnodesand
v
1
;path_planscorebetweenpathnodesandplannodes.
66
nodesbelongtoonehaplotypeandthusanypaired-endconnectionincidenttoSISOnodescan
recruitnodesbelongingtothesamehaplotype.Onthecontrary,thenodesoriginatingfromcom-
monregionsoftwoormorehaplotypesarenotSISOnodes.Paired-endconnectionsbetween
thosenodescannotprovideusefulguidanceinpathInourfeaturedesign,wedistinguish
paired-endconnectionsinvolvingSISOnodesandothernodes.
Let
PE
_
G
bethepaired-endgraph.
Path
=
f
p
1
;
p
2
;:::;
p
n
g
bethecurrentpath.Theending
node
p
n
inthecurrentpathhasmultiplesuccessors.Foreachsuccessor
v
of
p
n
,wecomputethe
scoresofthefollowing5featuresfornode
v
.Foreachfeature,weuse
N
feature
torecoredthe
numberofsuccessorswithscorevaluegreaterthan0.Thefeatureswillthenbeusedtomakea
choiceforpathextension.Tobetterillustratehowthesefeaturesarecalculated,wegiveanexample
asshowninFigure3.6.
1.SISOscore:wecalculatetheSISOscoreasthesummationofedgeweightsinpaired-end
graph
PE
_
G
betweenSISOnodesinthecurrentpath
Path
and
v
.Notethatedgeweightin
PE
_
G
isthenumberofpaired-endreadsbetweentwonodes.Weuse
N
SISO
todenotethenumberofsuc-
cessorsthathaveaSISOscoregreaterthan0.
2.Planscore:similartoSISOscore,wecalculatetheplanscoreasthesummationof
PE
_
G
edgeweightsbetweenSISOnodesandtheplannodes.Theplannodesareasfollows.
Startingfrom
v
,
v
'ssuccessorasplannodeif
v
hasonlyonesuccessor.Repeatingthispro-
cedureuntilreachinganodewithmorethanonesuccessororout-degreeof0.Planscoreconsiders
thepaired-endreadconnectionsbetweenallpotentialSISOnodesinsideapath,includingthechil-
drennodesof
p
n
.
N
plan
denotesthenumberofsuccessorsthathaveaplanscoregreaterthan0.
3.PE_planscore:
v
'sadjacentnodesin
PE
_
G
thatarenotinthepath
Path
areas
67
Figure3.7:Decisiontreetoselecttherightnodeforextensionbasedonpaired-endscorefeatures.
If
N
feature
=
1,weaddtheonlynodewithscoregreaterthan0tothepath.If
N
feature
>
1,weselect
thenodewithmaximumscorevalue.Otherwise,lookatthenextfeature.Ifallthe
N
feature
values
are0,wewillselectthesuccessorwithsimilarreadscoveragetothepath.
PE_plannodes.ThePE_planscoreiscalculatedasthesummationof
PE
_
G
edgeweightsbetween
SISOnodesinpath
Path
andthePE_plannodes.
N
PE
_
plan
denotesthenumberofsuccessorsthat
haveaPE_planscoregreaterthan0.
WhiletheabovethreefeaturesarefocusedonSISOnodes,thefollowingfeaturesextendto
paired-endconnectionsincidenttoothertypesofnodes.
4.Pathscore:thepathnodesareallthenodesinthepath
Path
except
p
n
if
n
>
1or
p
1
if
n
=
1.
Pathscoreiscalculatedasthesummationof
PE
_
G
edgeweightsbetweenpathnodesand
v
.
N
path
denotesthenumberofsuccessorswithpath_scoregreaterthan0.
5.Path_planscore:thesummationof
PE
_
G
edgeweightsbetweenpathnodesand
v
'splan
nodes.
N
path
_
plan
denotesthenumberofsuccessorswithpath_planscoregreaterthan0.
Withthe5featuresshownabove,wedesignthedecisiontreeasshowninFigure3.7toclassify
therightnodeforpathextension.
68
Figure3.8:Readpairmappingonamisjoinedcontig.Thecontigisshownasthelong
baratthebottom,whichismisjoinedwithtwosequencesfromstrains
a
and
b
.Theredandblue
linesrepresenttwoendsofareadpair.Fewreadpairswillgoacrossthemisjoinedlocation,thus
revealingavalleyinthealignedreadswhichcanbeusedtosplitthecontig.
3.2.4.5Correctingcontigswithpaired-endreaddistribution
OurmethodsusuallygeneratelongcontigsafterpathWhilepaired-endguidedpath
methodcangooverthecommonregionsthatareshortthanfragmentlength,itislimitedwhentwo
strainshaveaLCSlongerthantheinsertsizeofreadpairs.Tofurtherimprovethequalityof
assembledcontigs,weapplyacontigcorrectionmethodsimilartothetoolPECC[81].Withthe
contigsgeneratedafterpathwealigntherawreadstothemandsplitcontigsfromthe
locationswithlowreadpairscoverage(Figure3.8).
3.3Results
WehavedevelopedandimplementedPEHaploa
denovo
virusquasispeciesassemblymethod
basedonpaired-endguidedpathonoverlapgraph.Toevaluatetheperformanceofour
method,weappliedPEHaploonbothsimulatedHIVquasispeciesdatasetsandHIVillumina
MiSeqsequencingdataset.Bothofthesimulateddataandrealdataweregeneratedfromamixture
ofvewell-studiedHIV-1strains(HXB2,JRCSF,89.6,NL43andYU2).Thesestrainshave
pairwisesequencesimilaritiesfrom91.8%to97.4%(Table3.1)andalongestcommonsubstring
69
of427bpbetweenHXB2andNL43(Table3.2).WechooseHIVdatasetbecauseHIV-1hashigh
intra-patientgeneticdiversity.Tofurtherevaluatetheefyofthetool,wealsotestedPEHaplo
onarealMiSeqsequencingdata.
Asthehaplotypesequencesandcompositionsareknowninthesedatasets,weareabletoeval-
uatetheperformanceofourmethodswiththeassembledcontigs.Wealsocomparedtheproduced
resultstorecentlypublished
denovo
assemblytoolsIVA[57],MLEHaplo[93]andSAVAGE[5].
Table3.1:Pairwisesequencesimilaritybetween5HIV-1strains.
89.6HXB2JRCSFNL43YU2
89.693.991.893.593.6
HXB292.8
97.4
95.2
JRCSF92.692.9
NL4394.9
YU2
Table3.2:Longestcommonsubstring(LCS)between5HIV-1strains.Thesestrainshavesimilar
lengthsofabout10kbp.
89.6HXB2JRCSFNL43YU2
89.6195201164234
HXB2180
427
216
JRCSF157201
NL43185
YU2
3.3.1LCSprobabilitysimulation
ToestimatetheLCSbetweentwostrainswithinaquasispecies,wecalculatedtheprobabilitydistri-
butionforLCSbetweentwostrainsthatare
n
generationsapartfromequation3.2andequation3.6.
Letthemutationrate
m
betweentwogenerationsbe3e-5,andthegenomelength
L
be10,000.The
probabilitydistributionforLCSsbetweentwostrainsisshowninFigure3.9.Theshows
thatasthemutationrateincreases,theLCSlengthdecreases.
70
Figure3.9:ProbabilitydistributionforLCSsbetweentwostrainsthatare50,100,200,and500
generationsapart.Thex-axisisthelengthofLCS,whichrangesfrom0to10,000.They-axisis
thecorrespondingprobabilityforeachLCS.
3.3.2BenchmarkonsimulatedHIVdataset
WeappliedPEHaploonasimulatedvirusquasispeciesdataset.WeusedART-illumina[56]
tosimulate1.9e+5paired-end,250bperror-containingMiSeqreadsfromtheveHIV-1strains
withaveragefragmentlengthof600bpandtotalcoverageof5000x.Abasedpowerlaw
equation[9]wasusedtosimulatethecoveragedistributionamongvestrains:
C
i
=
bf
a
i
,where
C
i
and
f
i
denotethecoverageandofstrain
i
,respectively.Thecoveragesforeachstraininthe
simulateddatasetare:89.6-2190x,HXB2-1095x,JRCSF-730x,NL43-547x,YU2-438x.
FollowingthePEHaplopipeline,weperformederrorcorrectionandduplicatedsequences
removalontherawsimulateddataset.With1.9e+5errorcorrectedreads,48,833readswere
keptafterremovingduplicates.Wethenonlykeptthosereadsthatareduplicatedatleast3times
intherawdata,furtherreducingthereadsnumberto26,961.Afteradjustingreadsorientation,
71
weconstructedtheoverlapgraphwithtoolApsp.Theoriginaloverlapgraphhas26,961nodes
and977,570edges,aftermerginglinkedcliques,removingtransitiveedges,collapsingnodesand
removingbadedges,125nodesand166edgeswereleft.Wethenrecoveredpathsandgenerated
contigsfromthisgraphbasedonpaired-endinformation.
PEHaplogenerated14contigsfromthesimulateddataset.Thegeneratedcontigswerealigned
andevaluatedtovereferencegenomeswithMetaQuast[100]andtheresultsaresummarizedin
Table3.3.Thecontigsareabletocoverover97%onthe5virusstrains,withaN50of7170bp.
Thelargestcontighasalengthof9667bp,whichcanalmostcoverawholeHIVstrain.Meanwhile,
thesecontigshavelowmismatchandindelrates.
WealsoassembledthesimulatedreadswithbenchmarktoolsIVA,MLEHaploandSAVAGE
andsummarizetheirresultsinTable3.3.Withthedefaultparameters,IVAproducedasingle,long
contigfromtheerrorcorrectedreads.Thislongcontighasalengthof13,434bpandcancoverthe
wholegenomeof89.6strainandabout43%oftheHXB2strain.TheresultsofIVArevealthatit
tendstogenerateoneconsensusgenomesequencecorrespondingtothehaplotypewiththehighest
coverage.Otherstrainsarelargelymissed,indicatingthatitmaynotforvirusquasispecies
assembly.Usingk-mersizeof55,MLEHaploproduced205contigsthatcover78%oftheve
HIV-1strains.Thecontigsitproducedarequitefragmented,withalowN50valueof671bpand
largestcontigof1,716bp.Followingtheguidancefromthetutorial,wesettheoverlapcutoffas
190bptorunSAVAGE.Itproduces43contigsthatcoverover99%ofthereferencegenomes,with
aN50above2,000bp,andlargestcontigsof9,594bp.
ComparingtoIVAandMLEHaplo,PEHaploisabletoproducelongercontigswithlessmis-
matchesandindelsonthesimulatedHIV-1vestrainsdataset.SAVAGEproducedmorecontigs
thatcouldcoveralmostallthevestrains.However,thesecontigshaveamuchlowerN50value
(2,228bp)thanPEHaplo(7,170bp).
72
Table3.3:AssemblyresultsonsimulatedHIVdatasetforIVA,MLEHaplo,SAVAGEandPE-
Haplo.Contigsthatareatleast500bparealignedtothetruehaplotypesequenceswithasimilarity
cutoffof98%.TheN50valueisthemaximallengththatallcontigsofatleastthislengthcoverat
leasthalfofthetotalassemblylength.
Tools#contigsN50
Genomes
covered(%)
Unaligned
length
Mismatches
(%)
Indels
(%)
IVA113,43428.7
0
0.8090.051
MLEHaplo20567178.081,1250.5420.008
SAVAGE432228
99.4
52850.0190.002
PEHaplo14717097.8
00.0150
3.3.2.1Paired-endguidedpathisabletogenerateaccuratelongcontigs
Inouralgorithms,wehaveappliedmultiplemethodstosimplifytheoverlapgraphandcollapse
nodesbeforepathWhilethesestepsgreatlyreducethecomplexityofthegraph,theycan-
notdistinguishdifferentstrainsthatsharelongcommonregions.Thus,thepathalgorithm
basedonpaired-endconnectionsplayacrucialroleforproducinghighqualitylongcontigsfrom
theoverlapgraph.
Toevaluatetheperformanceofourpathalgorithm,weassembledthesequencesinthe
overlapgraphbeforepathwiththepopularassemblytoolsIDBA-UD[115]andRayMeta
[14].Wethenalignedthegeneratedcontigstoreferencegenomesandcomparedwiththeresults
producedbyPEHaplo.TheresultsrevealthatthosecontigsassembledbyIDBA-UDandRayMeta
fromthereducedoverlapgrapharefragmentedandmaycontainmanymisjoinedsegments:they
coverlargeproportionsofthevereferencegenomes,butwithhighmismatchandindelratesand
theiraveragelengthsaremuchshorterthanPEHaplo.Theexperimentshowsthatthepaired-end
guidedpathalgorithminPEHaploisabletocorrectlyrecoverlonghaplotypesequences
fromthevirusquasispeciessequencingdata.
73
3.3.3BenchmarkonMiSeqdataset
Tofurtherassesstheperformanceofassemblymethods,weappliedPEHaploonarealHIVquasis-
peciesdataset(SRR961514)sequencedfromthemixofthesameveHIV-1strainsasdescribed
abovewithIlluminMiSeqsequencingtechnology[39].Thisdatasetcontains714,994pairs(2x250
bp)ofreadsthatcoverthevestrainsto20,000x.
WeusedPEHaplotoperformsimilarprocessingproceduresontherealHIVquasispeciesdata.
With774,044anderrorcorrectedreads,98,947readswerekeptafterremovingduplicates
andsubstrings.Sincetherawdatasethasextremelyhighcoverageonthevestrains,westillkept
thosereadsthatareduplicatedatleastthreetimesintherawdataset.Afterthesepre-processing
procedures,26,691readswerekeptforstrandadjustmentandassembly.
PEHaploproduced33contigsfromtherealMiSeqHIVdatasetthatcancoverover92%of
theveHIV-1strains.ThesecontigshaveaN50valueabout2,500bpandthelongestcontigis
9108bp.TheresultsaresummarizedinTable3.4.ComparedtosimulatedHIVdataset,PEHaplo
hasgeneratedmorecontigsbutwithalowerN50valueandhighermismatchesandindelsonthe
realdataset.WenoticethattherealHIVdatasetcontainsmoresequencingerrorsandhasamore
variableinsertsizethanthesimulateddataset.
WeagaincomparedtheperformanceofPEHaplowithIVA,MLEHaploandSAVAGE.IVA
generated10contigsthatcoverabout20%ofthevestrains.Thesecontigsstillcoverlonger
partsonhaplotypeswithhighersequencingcoverage.Buttheyspreadtofourstrainsthistime,
likelybecausethevestrainshaveclosesequencingcoveragesintherealdataset.Withthe
sameparametersasabove,MLEHaploproduced234contigsthatcancoverover53%oftheve
genomeswithsimilarmismatchandindelratestothesimulateddataset.Strikingly,itgenerated
muchlongercontigsontherealdata,withaN50valueof6,501bpandthelargestcontigof8,470
74
bp.However,thesecontigscontainmanymisjoinedsegments.Over150contigswithtotallength
of787,272cannotaligntoanyreferencegenomes.SincetheSAVAGEpaper[5]hasshowntheir
resultsonthesamedataset,weusethemetricsintheirliteratureforevaluation.Fromtheirresults,
SAVAGEproduced482contigsthatcoverover90%ofthereferencegenomes,withaN50of1,062
bp,andlargestcontigof4,256bp(Table3.4).
OntherealHIVdataset,PEHaplocanstillproducelongercontigswithfewermismatchesand
indelsthanallthreebenchmarkedtools.Overall,PEHaploisabletoassembleabunchofreads
thataresequencedfrommultiplevirusstrainssharinghighsimilarities,generatelong,highquality
sequencesandrecovermostofthetargethaplotypes.Thetoolconsumeslessrunningtimewhile
stillproduceshighqualitycontigs.Comparedtootherstate-of-the-artmethods,ourtoolusually
producesfewerbutlongercontigs.In3.10,weshowthecontigsalignmentresultonHXB2
strainforPEHaploandSAVAGE.ThosecontigswerealignedtoHXB2strainwithasimilarity
cutoffof98%byMetaQuastevaluationtool.Thisshowsthatwhilebothtoolsproduced
contigsthatcoverthevirusgenometoasimilarproportion,PEHaplowasabletogeneratelessbut
longercontigs.
Table3.4:AssemblyresultsonrealHIVMiSeqdatasetforIVA,MLEHaplo,SAVAGEandPE-
Haplo.ContigsareevaluatedwithMetaQuastusingthesameparameterstosimulateddataset.
Tools#contigsN50
Genomes
covered(%)
Unaligned
length
Mismatches
(%)
Indels
(%)
IVA10115020.111500.6600.052
MLEHaplo234650153.6786,2720.588
0.035
SAVAGE482106290.5
0
0.1470.048
PEHaplo332553
92.400.117
0.049
75
Figure3.10:ContigsalignmentresultonHXB2strainforPEHaploandSAVAGE.Thesecontigs
wereproducedfromtherealHIVMiSeqdataandalignedtoreferencegenomewithMetaQuast.
Thex-axisisthecoordinationsofHXB2genome,withtheregionscoveredbycontigsingreenand
othersinblack.They-axisrepresentsthenumberofcontigs,withcontignameslistedontheleft
panel.Oneachcontig,thegreennumberattheleftisthestartingcoordinateofthealignedcontig
andthenumberinsideoftheparenthesisshowsthestartingcoordinateonthereferencegenome,
theblackvalueattherightistheendingcoordinateofthealignedcontigandthenumberinsideof
theparenthesisshowstheendingcoordinateonthereference.
76
3.3.4BechmarkonsimulatedbiasedHIVdatasets
Toevaluatetheperformanceofourmethodsonassemblinglowabundancestrains,weusedHIV
strainsHXB2andNL43tosimulatethreegroupsofdatasetswithextremlybiasedcoverages
betweenthem.Thetotalcoverageforeachgroup1000x,withHXB2-900x,NL43-100x;HXB2-
950x,NL43-50x;andHXB2-990x,NL43-10xforeachgroup,respectively.Thesedatasetswere
alsosimulatedbyART-illuminawithMiSeqplatform.
WiththesimilarprocessingproceduresonHIV5strainsdata,weusedPEHaplotoassemble
contigsfromthesedatasetsandcomparedtheresultswithSAVAGE,whichworksmuchbetter
thanMLEHaploandIVA.TheresultsareshowninTable3.5.
Table3.5:AssemblyresultsonsimulatedbiasedHXB2-NL43MiSeqdatasetfoSAVAGEPE-
Haplo.ContigsareevaluatedwithMetaQuast.
HXB2-NL43
coveragesTools#contigsN50
Genomes
covered(%)
Unaligned
length
Mismatches
(%)
Indels
(%)
900:100
SAVAGE7250046.765810.0220
PEHaplo11816380.2600.0380
950:50
SAVAGE8803246.76181700
PEHaplo1947046.7600.0330.011
990:10
SAVAGE13213046.7515900.0220
PEHaplo1950948.95000
Theresultsrevealthatbothtoolsfailedtoassembletherarestrainwith5%or1%abundance.
However,PEHaplowasabletobetterassemblethedominantstrainwithproducingonelongcontig.
Inaddition,whentherarestrainconstituted10%(100x)ofthetotalcoverage,PEHaplocould
partiallyassembleit,whileSAVAGEonlyassembledthedominantone.
77
3.3.5Bechmarkondataset
InadditionofHIVdata,wealsoappliedPEHaploonarealH1N1dataset(SRR1766219)
sequencedfromthemixofawildtype(99%)andamutanttype(1%).Thisdatasetissequenced
withIlluminaMiSeqsequencingtechnology,containing646,879pairs(2x250)ofreadsthatcover
thetwostrainsto23,000x.ThemutanttypecarriestwosilentmutationsintheM1ORF(C354T
andA645T,segment7).
Weperformsimilarpre-processingonthedata.With851,988anderror
correctedreads,27,888readswerekeptafterremovingduplicatesandsubstrings.Westillkept
thosereadsthatduplicateatleastthreetimesintherawdataset.Afterpre-processing,11,940
readswerekeptforstrandadjustmentandassembly.
PEHaploproduced10contigsfromtheMiSeqdata,with8contigscoverover99%of
the8segmentsofgenomeand2contigsunaligned.Againwecomparedtheassembled
resultswithSAVAGE.TheresultsaresummarizedinTable3.6.Ondata,SAVAGEhas
produced220contigswithaN50valueof620bp.Thesecontigsareabletocoverover96%ofthe
genome.TheresultsshowthatPEHaploworksmuchbetterthanSAVAGEon
dataasitsuccessfullyassembledallthe8segments.
Table3.6:AssemblyresultsonMiSeqdatasetforSAVAGEandPEHaplo.Contigsare
evaluatedwithMetaQuastusingthesameparameterstoHIVdatasets.
Tools#contigsN50
Genomes
covered(%)
Unaligned
length
Mismatches
(%)
Indels
(%)
SAVAGE22062096.3383030.8180.046
PEHaplo10179099.512700.8360.007
78
3.3.6Computationaltimeandmemoryusage
Toevaluatethecomputationalcostofourtool,wealsocomparetherunningtimeandmemory
usagewiththethreebenchmarktools.PEHaploused43minuteswithapeakmemoryusageof2.9
GBonsimulateddataand19minuteswithapeakmemoryusageof1.3GBonrealdataset.IVA
runsveryfast:ittookabout17minutesonbothsimulatedandrealdataset,withpeakmemory
usageabout2.6GB.MLEHaploandSAVAGEaremuchslower.MLEHaploconsumedover10
hourswithpeakmemoryusageof6.4GBonsimulateddatasetandabout4hourswithpeak
memoryusageof2.4GBonrealdataset.SAVAGEused54minutesforsimulateddataand1331
minutesforrealdata.Itisabletokeepalowmemoryusageof0.5GBasshownintheirpaper[5].
3.4DiscussionandConclusion
Longreadswillreducethecomplexityof
denovo
assemblyofgeneralmetagenomicdata,including
virusquasispeciesdata.Forpaired-endreads,onemayconsidertocombineshortreadsintoa
relativelylongerreadbeforeconductingassembly.Weactuallyappliedexistingreadjoiningtools
forthispurpose.However,joiningreadsisnotatrivialproblemastheoverlappingpartofthe
readpairsmaynotalwaysbeidentical.Thus,existingmethodsofjoiningtwoendsmayintroduce
errors.Inaddition,mergingpaired-endreadswilldiscardthepaired-endinformationforguiding
thepathprocess.Asaresult,theexperimentalresultsusingPEAR[167]andotherend
mergingtoolsshowinferiorperformance.Thuswedidnotincludethatstepinourpipeline.
Thethird-generationsequencingplatformssuchasPacBiocanproduceverylongreads,which
cancoverthewholelengthofviralgenomes.However,thehighsequencingerrorrate(about
10%)andthelowerthroughputthanIlluminastillhampertheirwideapplicationformetage-
nomicsequencing.Theadvantagesandlimitationsofapplyingcurrentlongreadstechnologiesfor
79
virushaplotypesreconstructionarediscussedinBAsE-Seq[53].Withtheincreasedreadquality,
longreadsequencingtechnologieswillgreatlysimplifytheassemblymethodsformetagenomic
data[39].However,atthismoment,virushaplotypereconstructionusingshortreadsisstillneeded.
Ifthedistributionoffragmentsizeisknown,wecanfurtherimprovetheaccuracyofpath
ingandfalseedgeremoval.Forexample,withknownfragmentsize,wecanaccuratelycompute
howfarweshouldexaminethesuccessorsorpredecessorsforfalseedgeremoval.Currently,for
computationalefy,wedidnotincorporatefragmentsizedistributioninthesetwosteps.
Ourmethodcanbeextendedtometagenomicdataifthememberspecies'genomeshavecom-
monregionswithlengthsmallerthanfragmentsize.However,ouranalysishasshownthatmany
genesinmetagenomicdatacanhaveLCSsizesmuchgreaterthantypicalfragmentsize.Forthose
metagenomicdata,largeinsertsizesshouldbechosenforthesequencingprotocol.
Inconclusion,wepresenteda
denovo
virushaplotypereconstructiontoolforviralquasis-
pecies.Weappliedittobothsimulatedandrealquasispeciesdataandachievedbetterresultsthan
severalbenchmarkedtools.
80
Chapter4
Alignmentwindowsbasedviralstrain-level
contigsbinning
4.1Introduction
Aftercontigsofdifferentviralstrainsbeingassembledfromviralquasispeciesdata,itisstillun-
knownthathowmanyviralhaplotypesarethereinthequasispeciesandwhataretherecorrespond-
ingabundances.Therefore,anothercrucialstepinrecoveringviralhaplotypesfrommetagenomic
dataistheestimationofnumberofviralhaplotypesandofcontigsassembledinto
differentgroups,whichisoftenreferredtoasbinning.Thegeneralbinningformicrobialsamples
isasfurtherinvestigatingthetaxonomicstructureofcontigsandclustering/classifying
themintooperationaltaxonomicgroups.Forviralquasispeciesassembly,thesegroupsrepresent
compositestrainsofanindividualviralspeciesthatcompriseaviralquasispecies.
Manybinningmethodsexisttobinassembledcontigsfrommetagenomicdata[160,88].These
methodsusuallyestimatethebinnumberbyaligningmetagenomicdatatoapre-establishedmarker
genedatabase,andthenassignassembledcontigstodifferentbinsusingsequencecomposition
informationandreadcoveragelevels.Forexample,MaxBin[160]usesbothtetranucleotidefre-
quenciesandcontigcoveragelevelstoassignassembledcontigsintodifferentbins.
Inaddition,therearealsomethodsofbinningcontigsusingthecoverageofthecontigs
acrossmultiplemetagenomicsamples.Theideaisthatiftwocontigsarefromthesamebin,their
81
coverageacrossmultiplesamplesshouldbehighlycorrelated.Forexample,COCACOLA
[88]incorporatessequencecompositionandreadcoverageacrossmultiplesamplesforbinning.
Whilethesebinningtoolsexist,binningofcontigsinaviralquasispecieshasitsuniquechal-
lenges,andthoseexistingtoolsareeithernotapplicableordonotperformwell.(1)Viralhaplo-
typesinaquasispeciesbelongtothesamespecies,andtheysharethesamemarkergene.Aligning
thereadstomarkergenedatabasewillonlyidentifyonespeciesbutnotthenumberofhaplotypes.
(2)Viralhaplotypesinaquasispceisusuallysharehighsequencesimilarity(mayover90%).The
sequencecompositionsfordifferenthaplotypesarethusverysimilar,andcanhardlybeusedto
differentiatethem.3)Theremaynotbemultiplesamplesforaviralquasispecies.
HerewepresentVirBin,amethoddesignedforbinningcontigsassembledfroma
viralquasispeciesdata.Ittakesadvantageofcontigsalignmentandrelativecontigabundancesin
alignedwindowstoaccuratelyestimatethenumberofhaplotypesandclustercontigsintodiffer-
entgroups.Thismethodworkswithasinglesequencingsampleanddoesnotrequirereference
genomes.VirBinwasappliedontwosimulateddatasetsandarealdataset,andbenchmarkedwith
therecentapproachMaxBin.TheresultsshowthatVirBinrevealssuperiorityintermsofboth
precisionandrecall.
4.2Methods
4.2.1Problem
Aviralquasispeciesiscomposedofasetofhighlysimilarviralstrainswithdifferentabundance
levels.Withassembledcontigsfromaquasispceis,theobjectiveisto(1)estimatethenumberof
haplotypesinthequasispecies;(2)clustercontigsintogroupssothatcontigsfromthesameviral
strainwillbegroupedtogether;(3)calculatetheabundanceforeachviralstraininquasispecies.
82
Mathematically,theproblemcanbeas:Givencontigs
C
0
;
C
1
;:::;
C
n
fromaviralquasis-
pecies,thegoalistoestimatethenumberofhaplotypes
N
,clusterthecontigsinto
N
+
1groups
(onemoregroupassothatcontigsfromthesamestrainwillbeinthesamegroup,and
calculatetheabundanceforeachhaplotype.
Sinceviralstrainsinaquasispeciesbelongtothesamespeciesarehighlysimilartoeachother,
aligningthemtomarkergenesusuallyonlythespecies.Therefore,canonicalbinning
methodsbasedonmarkergenesdonotapplyforviralquaspecies.However,takingadvantage
ofthehighsequencesimilaritiesbetweenhaplotypes,contigsfromdifferentstrainsbutderived
fromthesamegenomiclocationscanbealignedtogether.Withthealignmentofcontigs,VirBin
usesawindow-basedmethodtoestimatethehaplotypenumberandmoreaccuratelycalculatethe
relativeabundanceforeachcontig.Thewindowisasacontinuousregioninthealignment
withthesamenumberofcontigs.Thismethodbeginswithwindowsfromcontigs
alignment.Thenhigh-qualitywindows,whichhaveahighprobabilityofcontainingcontigsfrom
allhaplotypes,areTheconsensusheights(numberofcontigs
h
)inthesewindowsare
usedtoestimatethenumberofhaplotypes
N
.Foreachwindow,therelativeabundancelevelsfor
contigscanbecalculatedfromthereadsalignment
4.2.2TheVirBinalgorithmoverview
TheoverallpipelineofourmethodisshowninFigure4.1.Therearemainlytwosteps:(1)align
contigsandidentifywindows;(2)calculaterelativeabundancesineachwindowandapplyan
expectation-maximizationmethodtoclusterthecontigs.
83
Figure4.1:ThewwofVirBin.
4.2.3Estimatehaplotypenumberbycontigsalignmentandwindowsextrac-
tion
Inthestep,contigsarealignedwitheachotherusingblast[20].Eachtimeacontig
C
i
is
chosenasareferencecontig,andallothercontigsarealignedagainstthereferencetogeneratean
alignmentsimilartomultiplesequencealignment.Althoughblastperformslocalalignment,
therawalignmentresultswerewheretwotypesofalignmentrelationshipsareallowed
betweeneverytwocontigs.Oneisoverlap,whereonecontig'ssufalignwiththeotherone's
(Figure4.2(A)).Anotheroneisinclusion,whereonewholecontigalignstoasubstring
ofanothercontig(Figure4.2(B)).Inthisway,randomalignmentsoralignmentbetweenrepeated
sequencescanbeavoided.Thenstartingfrom5'endof
C
i
,windowsareasthecontinuous
regionswiththesamenumberofcontigs.Thealgorithmforidentifyingwindowsfromcontigs
alignmentisshowninAlgorithm3.
Ideally,ifthecontigsarecompleteandallalignmentsbetweenthemarecorrect,theheightof
84
Figure4.2:Twokindsofalignmentareallowedbetweentwocontigsto.(A)Overlap.(B)Inclu-
sion
Algorithm3
Windowsfromcontigsalignment
Function:
Alignedcontigsdepth
Height
(
T
)
Input:
Contigsalignmentresultonareferencecontig
C
,withindexfrom1to
L
;
Output:
Windowsoncontig
C
,alistoftuples
1:
i
 
1
2:
j
 
2
3:
start
 
1
4:
end
 
0
5:
while
i

L
do
6:
if
i==LorHeight[i]!=Height[j]
then
7:
end
 
i
8:
Windows+=[start,end]
9:
start
 
j
10:
i
++
11:
j
++
12:
endif
13:
endwhile
85
eachwindow(
h
)canbeusedtorepresentthehaplotypenumberinthequasispecies.Inpractice,the
contigsmaynotcovereveryregionofhaplotypesandtheremayexistchimericcontigs.Therefore,
weusetheconsensuswindowheightasthenumberofhaplotypes.
4.2.4Expectation-Maximizationmethodforbinningcontigs
Therawreadscanbealignedtocontigstocalculatetheirabundancelevels.However,dueto
thepotentialincompletenessoftheassembledcontigsandhighsequencesimilaritybetweenviral
haplotypes,directlycalculatingcontigabundancesisnotaccurate.Becausereadsfromstrains
withmissingcontigsmaybemappedtocontigsfromotherstrains,whichwilleventuallylead
toinaccurateestimationofcontigabundances.Withwindows,denotetheregionofa
contiginsideofawindowas"sub-contig"
c
.Thealignedreadsoneach
sub-contigcanbeusedtocalculatethe"relative"abundancesforeachofthemwithinawindow,
whichisamoreaccurateestimationforthecontigabundances.Theaveragerelativeabundance
(denoteas¯
c
)forasub-contig
c
i
inawindowiscalculatedas
¯
c
i
=
S
(
c
i
)
å
h
j
=
1
S
(
c
j
)
(4.1)
where
S
(
c
i
)
isthesummationofreadscoverageonsub-contig
c
i
.Therationaleofthismethodis
thatwhiletheabundanceestimationstillmaynotbeaccurateinwindowswithmissingcontigs,we
canhavemoreaccurateabundanceestimationforwindowswithcorrectnumberofcontigs.Using
therelativeabundancesinthese"correct"windowsforclustering,wecanhavemoreaccurate
binningresults.
Similarly,wecancalculatetherelativeabundance(denoteas
~
c
)forasub-
contig
c
i
as
86
~
c
i
[
k
]=
c
i
[
k
]
å
h
j
=
1
c
j
[
k
]
;
k
=
1
::
w
(4.2)
where
c
i
[
k
]
representsthereadscoverageatthe
k
positionofsub-contig
c
i
.
w
isthewindowwidth.
Letthehaplotypenumberestimatedbywindowsheightas
N
.Withsub-contigs
c
,theposition-
readscoverage
~
c
,andtheaverageabundance¯
c
,VirBinutilizesthe
relativeabundancesofsub-contigsinwindowswithheighttoestimatetheprobabilitythata
contigbelongstoabinwithanExpectation-Maximization(EM)method.Letthehaplotypes
be
H
1
;
H
2
;:::;
H
N
,theirabundances
x
1
;
x
2
;:::;
x
N
berandomvariablesandfollowdistributions
E
1
(
x
)
;
E
2
(
x
)
;:::;
E
N
(
x
)
,respectively.Let
C
betheoriginatingcontigof
c
i
,intotal
n
sub-contigs
from
C
.WeusedEMalgorithmtomaximizetheposteriorprobability
P
(
c
i
2
H
j
j
¯
c
i
)
.Thealgorithm
containsfourstepsasshownbelow:
1.Initialize
N
groupsbyrandomlyassignsub-contigstothemorwithguidance.Thesub-
contigsinthesamewindowareassignedtodifferentgroups.
2.
Expectation
Foreachgroup,thecompoentsub-contigs'relativeabundance
~
c
sareaggregatedtocalcu-
latetheempiricalprobabilitydensityfunction
E
(
x
)
.Theaggregationisperformedbycalculating
thenormalizedhistogramsfortheserelativeabundancesothatthesummationofhis-
togramvalueswillbe1.Thelikelihoodthat
c
i
beingproducedfromahaplotypecanbedenotedas
P
(
x
j
=
¯
c
i
j
c
i
2
H
j
)
.Itcanbecaclulatedas
E
j
(
¯
c
i
)
.
Thepriorprobability
P
(
c
i
2
H
j
)
issameas
P
(
C
2
H
j
)
,whichiscalculatedas
å
n
k
=
1
P
(
c
i
2
H
j
j
¯
c
i
)
P
(
c
i
)
.Here,
P
(
c
i
)
isthepriorprobabilityof
c
i
,whichiscalculatedas
length
(
c
i
)
length
(
C
)
,
P
(
c
i
2
H
j
j
¯
c
i
)
istheposteriorprobabilitythat
c
i
belongstohaplotype
H
j
fromlastiteration.
Withbothlikelihoodandprior,theexpectedprobabilitythat
c
i
belongstohaplotype
H
j
canbe
87
calculatedas
likelihood

prior
,thatis
P
t
+
1
(
c
i
2
H
j
j
¯
c
i
)=
P
t
(
x
j
=
¯
c
i
j
c
i
2
H
j
)

n
å
k
=
1
P
t
(
c
i
2
H
j
j
¯
c
i
)
length
(
c
i
)
length
(
C
)
(4.3)
3.
Maximization
Withtheposteriorprobabilitiescalculatedforeachgroupdistribution,wecanreassignthesub-
contig
c
i
tothehaplotypewiththemaximumposteriorprobabilityorusingGibbssampling.The
samereassigningproceduresareappliedforallthesub-contigs.Withtheassignmentresults,the
distribution
E
(
H
j
)
andpriorprobability
P
(
c
i
2
H
j
)
canbeupdated.
4.Iteratestep2and3untiltheclusteringresultsdonotchangeorthemaximumnumberof
runshavebeenachieved.Thedefaultmaximumnumberofrunsis100.
4.3Results
4.3.1HIVsimulateddataset
VirBinwastestedontwosimulatedHIVquasispeciesdatasetsforitseffectiveness.One
simulateddatasethave5HIVhaplotypesandtheotherhave10.
4.3.1.1Datasimulation
HIVhaplotypessimulation
MultipleHIV-1strainsareavailableonHIVsequencedatabase(https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html).
However,thesequencesimilaritiesbetweenthemareusuallybelow90%,whicharelowerthanse-
quencesimilaritiesbetweenviralstrainsinaquasispceis.Therefore,wedownloadedfourHIV
strains(FJ061,FJ064,FJ065,andFJ066)fromthedatabase,andgeneratesimulatedhaplotypes
fromthembyrandomlymutatingbasesatrandomlyselectedlocations.Fortheviralquasispecies
88
with5haplotypes,wegenerated3simulatedhaplotypesfromFJ061strainandmixedthemwith
FJ066tocomposeaquasispecies(denoteas5HIVhaplotypes).Thesequencesimilaritiesbetween
eachsimulatedhaplotypeandFJ061is97%.Forthequasispecieswith10haplotypes,2simulated
haplotypesweregeneratedfromeachFJ061,FJ065,andFJ066strains.Theyweremixedwith
FJ064tocomposethequasispecies(denoteas10HIVhaplotypes).
Contigssimulation
Whileviralcontigscanbeassembledfromsimulatedreadswithavailable
assemblytools,thequalityofgeneratedcontigsdependsontheassemblyalgorithmsandparam-
etersused.Tofocusonthebinningproblem,wesimulatederror-freecontigsdirectlyfromthe
referencegenomes.Foreachreferencegenome(denoteitslengthas
L
),werandomlygenerated
alistof20locationpairs
(
p
1
;
p
2
)
,where1

p
1
<
p
2

L
and
p
2

p
1
+
1

500.Eachlocation
pairrepresentsacandidatecontig.Thesepairswerethensortedby
p
1
andpairswere
outfromthelist.Thestartsfromlookingatthesecondpairinthesortedlist(the
pairisalwayskept),ifithasanoverlapless100bpwiththepreviouspairandisnotasubstring
ofthepreviouspair,keepitandmovetothenextpair.Otherwise,removethispairfromthelist
andmovetothenext.Afterthecontigswerethengeneratedfromthereferencegenome
withleftlocationpairs.
HIVreadssimulation
WithavailableHIVhaplotypes,simulatedreadsweregeneratedfrom
thembyART-illumina[56]aserror-containingMiSeqpaired-endreads,withreadlengthof250bp,
averageinsertsizeof600bp,andstandarddeviationof150bp.Theabundanceofeachhaplotypeis
calculatedbasedonapowerlawequation[9].Thetotalsequencingdepthsforthe5HIVhaplotypes
are1000-x,with438-x,219-x,146-x,109-x,and88-xforeachhaplotype,respectively.The
sequencingdepthsfor10HIVhaplotypesare2000-x,with683-x,341-x,228-x,171-x137-x,114-
x,98-x,85-x,76-x,68-xforeachhaplotype,respectively.Intotal,thereare38,914simulatedreads
for5HIVhaplotypes,and76,974readsfor10HIVhaplotypes.
89
Table4.1:Haplotypeabundancesforsimulated5HIVhaplotypescalculatedfromreadsmapping.
HaplotypeFJ061FJ061-h1FJ061-h2FJ061-h3FJ066
Abundance(%)43.9021.9514.6310.938.59
4.3.1.2Resultsfor5HIVhaplotypes
Haplotypeabundances
Bymappingsimulatedreadstothesimulated5HIVhaplotypesbybowtie2,wewereableto
estimatetheabundanceforeachhaplotypeasshowninTable4.1.
Windows
Withtheavailablesimulatedcontigsandreads,wealignedcontigstoeachotherbyblastand
identifywindowsbyVirBin.For5HIVhaplotypes,VirBingenerated121windowsfromallthe
contigs.Thewindowsweresortedbytheirlengthsandthetop5windowsareshowninFigure4.3.
Theresultsrevealthattheheightofthesehigh-qualitywindowscanbeusedtoaccurately
estimatethenumberofhaplotypesinthequasispecies.Forexample,outofthetop100windows,
50contain5contigs,8contain6contigs,and32contain4contigs.Leftwindowscontain2or
3contigs.Outofthetop50windows,39have5contigs,1contains6contigs,and8contain4
contigs.Outofthetop25windows,20contain5contigs,1contain6contigs,and3contain4
contigs.Applyingasimplevotingprocessonthetop-kwindows,thenumberofhaplotypescanbe
estimatedastheheightofmostabundancewindows,whichis5.
ByaligningreadstothesecontigswithBowtie2,therelativeabundancelevelsforeachcontig
withinawindowcanbecalculatedfollowingtheequation4.1.Relativeabundancesforcontigsin
thetop5windowsareshowninFigure4.3.
EMclustering
Withtherelativeabundancescalculatedforcontigswithinwindows,weappliedtheEMclustering
90
Figure4.3:Top5windowsoutputfromsimulatedcontigsalignmentfor5HIVhaplotypes.Each
linestartingwith'>'representsonewindow.Thefollowingcolumnsrepresentreferencecontig
name,windowheight,windowstartingpositiononreferencecontig,windowendingpositionone
referencecontig,referencehaplotype,andhaplotypeabundance,respectively.Therowsafter'>'
lineshowtheinformationforeachcontiginsideofthiswindow.Thecolumnsrepresentcontig
name,relativeabundance,summationofreadscoverage,referencehaplotype,andhaplotypeabun-
dance,respectively.
91
Table4.2:EMclusteringresultsonsimulated5HIVhaplotypecontigsforVirBinandMaxBin.
virbinMaxBin
Precision(%)Recall(%)Precision(%)Recall(%)
FJ066
84.692.931.452.5
FJ061
96.689.00.00.0
FJ061-h1
88.282.90.00.0
FJ061-h2
91.899.60.00.0
FJ061-h3
92.789.00.00.0
methodinVirBintoclustercontigsinto5groups.Sincethegroundtruthforwhichhaplotype
eachcontigbelongstoisknow,wewereabletoevaluatetheclusteringresultsbycalculatingthe
precisionandrecallathebaselevel.TheevaluationresultsareshowninTable4.2.
TheresultswerebenchmarkedwithMaxBin,whichisabinningtoolformetagenomicscaf-
folds/contigsbasedontetranucleotidefrequenciesandreadscoveragelevels.TheMaxBinprogram
requiresmarkergenestoidentifyseedcontigsforbinning,whichisnotapplicableforviralqua-
sispeciessincethedifferentviralhaplotypesbelongtothesamespeciesandhavethesamemarker
gene.However,wewerestillabletorunthecoreclusteringprogramofMaxBinbyprovidingthe
seedcontigsmanually.Werandomlychoseonecontigfromeachhaplotypeastheseedcontigs
andcalculatedthecontigabundancesbymappingreadstothem.TheresultsfromMaxBinarealso
showninTable4.2.Forallthe24contigs,itassigned17to5clusters,leaving7unassigned.One
contigwascorrectlyassignedtotheclustercorrespondingtoFJ066strain,butothercontigswere
notcorrectlyclustered.
StrainPhlAn[152]isalsoatooltocharacterizethegeneticstructureofviralstrainsinmetagenomes.
IttakestherawsequencingreadsandMetaPhlAn2[151]databaseofreferencese-
quencesasinputandaimstooutputthemostabundantstrainforeachsample.However,itfailed
todetectanyviralspeciesatthesteprunningMetaPhlAn2.ConStrains[89]isanothertool
designedtoidentifystrainstructuresfrommetagenomicsequencedata.Itusesbowtie2
92
Table4.3:Haplotypeabundancesforsimulated10HIVhaplotypescalculatedfromreadsmapping.
HaplotypeFJ061FJ066FJ065FJ064FJ061-h1
Abundance(%)34.6116.8311.268.446.94
HaplotypeFJ061-h2FJ066-h1FJ066-h2FJ065-h1FJ065-h2
Abundance(%)5.784.844.203.753.36
tomapreadstoasetofuniversalgenesandinferthewithin-speciesstrainsabundancesbyutiliz-
ingsingle-nucleotidepolymorphism(SNP)patterns.Thistoolagaindidnotgetenoughmapped
readstoreportanystrainabundances.Thus,wecannotreporttheresultsfromStrainPhlAnor
ConStrains.
4.3.1.3Resultsfor10HIVhaplotypes
Haplotypeabundances
Similarto5HIVhaplotypes,bymappingsimulatedreadstothe10HIVhaplotypesbybowtie2,
wewereabletoestimatetheabundanceforeachhaplotypeasshowninTable4.3.
Windows
Similarto5HIVhaplotypes,thecontigsalignmentandwindowswasalsoapplied
onsimulatedcontigsfor10HIVhaplotypes.VirBingenerated319windowsfromallthecontigs.
Sortingthesewindowsbytheirlengths,thetop3areshowninFigure4.4.
Fromtheresults,outofthetop100windows,36contain10contigs,34contain9contigs,and
20contain8contigs,4oftheleftwindowscontaincontigsnumbergreaterthan10and6contain
contigsnumberlessthan8.Outofthetop50windows,26have10contigs,16contain9contigs,
and6contain8contigs.Outofthetop25windows,15contain10contigs,8contain9contigs,
and1contains8contigs.Therefore,thehaplotypenumber10canstillbecorrectlyoutby
theheightsofmostabundancewindows.
Similarto5HIVhaplotypes,therelativeabundancesforcontigswithinwindowscanbecalcu-
93
latedbyaligningreadstocontigswithbowtie2.Therelativeabundancesforcontigsinthetop3
windowsareshowninFigure4.4.
Figure4.4:Top3windowsoutputfromsimulatedcontigsalignmentfor10HIVhaplotypes.The
outputformataresameasFigure4.3.
EMclustering
Withthewindowsandrelativeabundances,VirBinwasalsoappliedtoclustercontigs
into10groupsfor10HIVhaplotypes.TheevaluationresultsforclusteringareshowninTable4.4.
TheresultswerealsobenchmarkedwithMaxBin.Similartosimulated5HIVhaplotypes,10
seedcontigsfromeachhaplotypewererandomlyselectedandprovidedtoMaxBin.It
25outof34contigs,with9TheresultsareshowninTable4.4.WhileMaxBin
correctlyonecontigtoFJ065-h1andonetoFJ066,mostoftheothercontigswerenot
94
Table4.4:EMclusteringresultsonsimulated10HIVhaplotypecontigsforVirBinandMaxBin.
virbinMaxBin
Precision(%)Recall(%)Precision(%)Recall(%)
FJ064
73.098.00.00.0
FJ061
100.080.00.00.0
FJ061-h1
98.188.10.00.0
FJ061-h2
99.589.20.00.0
FJ065
86.461.50.00.0
FJ065-h1
18.610.717.6100.0
FJ065-h2
42.194.90.00.0
FJ066
84.695.843.469.2
FJ066-h1
90.979.20.00.0
FJ066-h2
60.047.60.00.0
appropriatelyclustered.WeagaintriedStrainPhlAnandConStrainsonthissimulateddataset,but
still,noreadscanbemappedtoavailablereferencegenes.
4.3.2HIVrealMiSeqdataset
4.3.2.1HIVrealdatasetandcontigs
Tofurtherassesstheperformanceofthebinningmethod,weappliedVirBinontherealHIV
quasispeciesdataset(SRR961514),sequencedfromthemixofveHIV-1(89.6,HXB2,JRCSF,
NL43,YU2)strainswithIlluminaMiSeqsequencingtechnology[39].Thisdatasetcontains
714,994pairs(2x250bp)ofreadsthatcoverthevestrainsto20,000x.Therawdatasetwas
pre-processedwithFaQCs/1.3[87]andTrimmomatic[15]totrimandlow-qualityreadsor
adapters.Theleftreadswerethenerror-correctedwithKarect[2].Afterpre-processing,774,044
readswereleft.
Bymappingpre-processedreadstotheavailable5referencegenomesbybowtie2,wewere
abletoestimatetheabundanceforeachhaplotypeasshowninTable4.5.
WeusethecontigsassembledbyPEHaploasinputforVirBin.PEHaploproduced24contigs
95
Table4.5:Haplotypeabundancescalculatedfromreadsmapping.
HaplotypeJRCSFNL4389.6YU2HXB2
Abundance(%)29.7925.1821.6812.6210.87
fromtherealMiSeqHIVdatasetthatcancoverover92%oftheveHIV-1strains.Thesecontigs
haveaN50valueof2,223bpandthelongestcontigis9,133bp.
4.3.2.2ResultsforHIVrealdataset
Windows
Afteraligningcontigstoeachotherbyblast,VirBinwasappliedonthealignedandgen-
erated197windows.Sortingthesewindowsbylength,thetop5areshowninFigure4.5.
Fromtheresults,outofthetop100windows,44contain5contigs,39contain6contigs,and
5contain4contigs.Outofthetop50windows,27contain5contigs,16contain6contigs,and
2contain4contigs.Outofthetop25windows,17contain5contigs,5contain6contigs,and
1contains4contigs.Similartothesimulatedhaplotypesresults,thehaplotypenumber5canbe
estimatedfromtheheightsofmostabundantwindows.
Whilethewindowsonsimulatedcontigstendtohavewindowsheightslessthan
thehaplotypenumber,thewindowsproducedoncontigsassembledhavealotofheightsgreater
thanthehaplotypenumber.Onereasoncouldbethattheremaybe"chimericcontigs"assembled,
whereacontigisjoinedbysequencesfromtwoormorehaplotypes.Thechimericcontigscan
alignwiththosecorrectcontigs,thusincreasingtheheightofthewindows.Thesimulatedcontigs
areerror-freebutnotnecessarilybecompleteduetotherandomness.Therefore,therearealot
ofwindowswithaheightlessthanthehaplotypenumber.Anotherpossiblereasonisthatthere
arerepeatedsequencesforHIVrealhaplotypes.Letusdenoteasubstringofaviralhaplotype
X
as
X
[
start
_
position
;
end
_
position
]
.TheblastresultsbetweentheseHIVhaplotypesshowthat
96
Figure4.5:Top5windowsoutputforcontigsassembledfromHIVrealMiSeqdata.Theoutput
formataresameasFigure4.3.
97
Table4.6:EMclusteringresultsonassembled5realhaplotypecontigsforVirBinandMaxBin.
VirBinMaxBin
Precision(%)Recall(%)Precision(%)Recall(%)
HXB2
48.570.239.127.0
YU2
34.030.056.650.9
89.6
58.056.50.00.0
NL43
18.517.49.621.6
JRCSF
65.150.80.00.0
Table4.7:EMclusteringresultsonassembled5realhaplotypecontigsforVirBinandMaxBin.
VirBinMaxBin
Precision(%)Recall(%)Precision(%)Recall(%)
HXB2
70.087.533.325.0
YU2
25.020.033.333.3
89.6
33.3100.00.00.0
NL43
66.728.625.020.0
JRCSF
33.333.30.00.0
89.6[9046,9669]canbealignedwithJRCSF[1,630]withanidentityof92%.Therefore,the
alignmentbetweencontigsmaynotmeanthattheycomefromthesamelocationoftheconsensus
genomesequence,whichcanleadtoanoverestimationoftheheightofthewindows.
EMresults
ThesimilarclusteringprocedureswereappliedonwindowsbyVirBin
withrelativeabundances.MaxBinwasrunagainasBenchmarkTheresultsfrombothtoolsare
showninTable4.6.StrainPhlAnandConStrainswereappliedonthisrealHIVdatasettoo.
StrainPhlAnwasabletoidentifytheHIVspecies,butcouldnotreportanystraininformation.
ConStrainscouldnotalignenoughreadstomarkergenesforfurtherreportingstrainabundances.
VirBin
´
sresultsarenotasgoodasforsimulatedHIVdatasets.Oneofthereasonmaybethe
existenceofimperfectassembledcontigs,whichmayleadtotheincorrectcalculationofabundance
levelsandaffecttheclustering.
98
4.4DiscussionandConclusion
Withinthesameviralquasispecies,binningthecontigsfromdifferenthaplotypeswithk-mer
frequencies(suchastetranucleotidefrequencies)isnotapplicablebecauseofthehighsequence
similaritiesbetweenthem.Theonlypossibleinformationtodifferentiatethesecontigsistheir
abundancesinferredfromreadscoveragelevelsbymapping.However,theinhomogeneousreads
coveragealongtheviralgenomeandcloseabundancesbetweenviralstrainsincreasethedif
cultytodirectlyapplyingcoveragelevelsforbinningviralcontigs.InthisChapter,weproposed
anovelmethodtomoreaccuratelycalculatetherelativeabundancesforsub-contigswithinwin-
dows.Moreover,thenumberofhaplotypesinthequasispeciescanbeestimatedfromtheconsensus
heightofwindows.
Wehaveshowntheutilityofourtoolontwosimulatedandonerealviralquasispeciesdatasets,
andbenchmarkedtheresultswithMaxBin.Thesuccessofthismethodreliesonthequalityofinput
contigsandtheabundancedifferencebetweenviralhaplotypes.Whentheassembledcontigscan
coverthemostpartofviralstrains,thenumberofhaplotypesinthequasispeciescanbeaccurately
Theempiricalexperienceshowsthatitisdiftoclassifytwoviralstrainswhenthe
abundancedifferencebetweenthemisbelow3%.
99
Chapter5
Studyingtranscriptionalregulationsof
miRNAgeneswithCap-seq
5.1Introduction
MicroRNAs(miRNAs)arealargefamilyof
˘
21nucleotide-longRNAsthathavebeenuncov-
eredaskeyregulatorsofgeneexpressionatpost-transcriptionallevelinmetazoans,plants,and
viruses[63,66,10].Inmetazoans,maturemiRNAsandargonaute(AGO)proteinsformintothe
miRNA-inducedsilencingcomplex(miRISC),withinwhichmiRNAsbase-pairtothe3'-UTRof
targetmRNAsandinhibitproteinsynthesisbyeitherrepressingtranslationorpromotingmRNA
degradation.Itwasinferredthatmorethanone-thirdofallprotein-codinggenesareregulatedby
miRNAs[94].miRNAshavealsobeendiscoveredtoplayacrucialroleinprecisionmedicine.
Precisionmedicineattemptstocharacterizethegeneticbackgroundofpatientsandclassifythem
intosubpopulationsthatdifferintheirsusceptibilitytoaparticulardisease[30,58,91].Thecapa-
bilityofmodulatingavastnumberofprotein-codinggenesmakesmiRNApowerfulregulatorsof
thedifferentcellularprocessesinvolvedinthepathogenesisofvarioustypesofdiseases,including
cardiovasculardiseasesandcancer.Forexample,livermiRNAmiR-122,asthemostabundantand
mostlivermiRNA,ismostlikelytorepresentanovelbiomarkerforcardiovascularand
metabolicdiseasesasitplaysacentralroleinlipidandglucosehomeostasisandisdetectablein
serumandplasma[156].DifferentialexpressionofmiRNAshavealsobeenobservedintumor
100
tissues.Theiralterationexpressioninprostatecancerhasbeenwelldocumented[96].Becauseof
theirimportantregulatoryfunctions,manystudieshavefocusedonmiRNAannotationandidenti-
fyingtheirtargets[32,164,78].However,howmiRNAsthemselvesareexpressedandregulated
isnotfullyunderstood.
InthecanonicalmiRNAbiogenesispathway,miRNAsareprocessedfromlongertranscripts,
whicharereferredtoasprimarymiRNAs(pri-miRNAs)[66].Pri-miRNAsareeithertranscribed
bypolymeraseII(PolII)fromindependentgenesorderivedfromtheintronsofprotein-coding
genes[77,16].TwomembersoftheRNaseIIIfamilyofenzymes,DroshaandDicer,further
processpri-miRNAstomaturemiRNAs[76,28,68].First,Droshacleavesthehairpinstructureof
apri-miRNAtoan
˘
70-nucleotideprecursormiRNA(pre-miRNA)inthenucleus.Pre-miRNAs
arethenexportedtothecytoplasmbyXPO5,whereDicercleavesofftheloopregionofthehairpin
andfurtherprocessesitto
˘
21-bpmaturemiRNA(s).Recentstudieshaveuncoveredseveral
non-canonicalwaysofgeneratingmiRNAs,demonstratingthecomplexityofmiRNAbiogenesis.
OneclassofunconventionalmiRNAsiscalledmirtrons,whichareencodedinintrons,bypass
Droshaprocessorbutrelyonsplicingmachineryforpre-miRNAgeneration[11,130].miRNAsin
mammalshavebeenshowntofrequentlyutilizealternativepromotersindifferentcelltypes,and
pri-miRNAsmayencodesubsetsofclusteredmiRNAs[24].Pri-miRNAtranscriptscanbecleaved
bycytoplasmicDroshainhumancells[35].Anotherstudyonmicehasuncoveredasecondclass
ofnon-canonicalmiRNAs,ofwhichthepre-miRNAsare5'-cappedandgenerateddirectlyby
transcription[161].
AlthoughthegenomiccoordinatesofmatureandprecursormiRNAshavebeenannotatedin
databasessuchasmiRBase[48],verylittleisknownaboutthecoordinatesofpri-miRNAs.RNA-
seqtechnology[155]hasbeenprovedasanefwaytoannotateprotein-codinggenes.Mature
mRNAscontaina5'7-methylguanosine(m
7
G)capandalong3'polyadenylated(poly(A))tail
101
andarerelativelystable,sotheycanbewellextractedfromcellsandsequenced.Thesequenced
RNAfragmentsarethenmappedtothereferencegenomeforgeneannotation.However,sincethe
original5'endsofprimarymiRNAtranscriptsarerapidlycleavedoffbyDroshaduringmiRNA
maturation,regularRNA-seqtechnologycannotbeusedtotheprimaryTSSsofmiRNAgenes.
Pri-miRNAsareusuallytranscribedbyPolIIandalsocontaina5'm
7
Gcapanda3'poly(A)tail
[77],indicatingthatthebiologicalfeaturesrelatedtoPolIItranscriptioncanbeusedtoidentify
thetranscriptioninitiationsitesformiRNAgenes.
ToidentifytheprimaryTSSsofmiRNAs,somecomputationalmethodshavebeenimple-
mentedbasedonfeaturesrelatedtoPolIItranscribedgenes,suchastranscriptionfactorbinding
sites(TFBSs),PolIIbinding,andchromatinstatesincludinghistoneandnucleo-
somepositioning[132,111,31,29].Typically,PolIIandH3K4me3arehighlyenrichedatactive
promoters,whilenucleosomesaredepletedattheTSSs.Wangetal.[154]designedastatistical
modeltomimicPolIIbindingpatternsatthepromotersofhighlyexpressedprotein-codinggenes
andusedittosearchforsimilarPolIIbindingpatternsupstreamofallintergenicmiRNAsinhu-
manbreastcancercellstoidentifyprimarypromoters.Theyvtheirbycheckingthe
conservation,CpGcontent,andactivatinghistonemarksinthepromoterregions.Ozso-
laketal.[111]combinednucleosomemappingwithChIP-chipscreensforH3K4me3,H3K9/14ac,
PolIIandPolIIIsignaturestoidentifytheproximalpromoterregionsofpri-miRNAsinhuman
genome.Theytestedtheiralgorithmonhumanannotatedprotein-codinggenesandpredictedthe
transcriptioninitiationregionstoaresolutionof150bp.Withthesamemethod,thetranscrip-
tioninitiationregionsof175transcriptionallyactivemiRNAsweredetermined.Sainietal.[131]
predictedthe5'endsofintergenicpri-miRNAsinhuman,mouseandratgenomesbycombining
thefeaturesofTSSspredictions,CpGislandsand5'capanalysisofgeneexpression(CAGE)
tags.miRStart[29]builtaSVMmodelusingthefeaturesofCAGEtags,TSSSeqlibrariesand
102
H3K4me3chromatinsignaturefromChIP-seqtoidentifytheTSSsofhumanmiRNAs.Themodel
wastrainedon7,268protein-codinggeneswithuniqueTSSand847putativeTSSsfor
the940humanpre-miRNAsobtainedfrommiRBase.
Whilethemethodsdiscussedabove[154,111,131]havepredictedTSSsformiRNAgenes
inmammalgenomes,theirpredictionresultshavelowresolution(hundredsofbps)becausethe
typicaldistributionpatternsofPolIIandchromatinfeaturessurroundingpromotersmaynothold
foranyparticulargene.Evenforsomeactivelytranscribedgenes(29/85),thedistancebetween
theTSSandtheclosestPolIIpeakcanbeover1000bp(FigureS1(B)).Amoreaccuratemethod
istotakeadvantageofthecapstructureat5'endsofPolIItranscribedRNAs.Mappingthe
cappedsequencestoreferencegenomeswillenabledirectdiscoveryoftheTSSs.CAGEandCap-
seqhavebeenusedtodirectlysequenceRNAswith5'm
7
Gcaps,whichareusedtoidentifythe
candidateTSSs.CAGE[65]usesaso-calledﬁcaptrapper"methodtocapturefull-lengthmRNAs
andsequencethe5'endswithSangersequencingtechnology.However,CAGEisnotwidelyused
tomapTSSsforeachgenebecauseofthecostandsequencingdepth.Withthedevelopmentofhigh
throughputsequencingtechnology,enrichingcappedRNAtranscriptsfollowedbynext-generation
sequencing(NGS)technology(Cap-seqordeepCAGE[38])hasbeenusedtosequencethecapped
RNAsinthewholegenome.
Themouseisapopularmammalianmodelsystemforgeneticresearch,forwhichsomeCap-seq
datasetshavebeengenerated[161,49].RecentlytheCap-seqstudyinmice[161]hasuncovered
anon-canonicalwayofgeneratingpre-miRNAs,inwhichthepre-miRNAsaregenerateddirectly
bytranscriptionandtheir5'endsarem
7
Gcapped.These5'm
7
Gcappedpre-miRNAspreferto
beexportedfromthenucleustothecytoplasmbyexportin1(XPO1)andafterDicerprocessing,
only3p-miRNAisefloadedontotheAGOcomplex.Thisspecialclassof5'-cappedpre-
miRNAshavealsobeendiscoveredinthehumangenome(miR-320a)[161],butwhethertheyalso
103
existinothernon-mammalianspeciesisstillunknown.
Caenorhabditiselegans(
C.elegans
)isalsoawellestablishedmodelorganismforgenomic
studies.Thewormisasimplemulticellularorganismbutwithavarietyoftissuetypesanda
shortlifecycle[33].Therefore,manyfunctionalgenomicsequencingdatasets,includingCap-seq
[49,27,67],havebeengeneratedonthisspecies.Thetranscriptionregulationinthisanimalis
quitedifferentfromthatinmammals.Forexample,theprimarytranscriptsofabout70%ofits
protein-codinggenesundergotrans-splicing[144];anditspri-miRNAtranscriptsareexportedby
XPO1andpossiblyprocessedinnuclearpore[19].
Inthisstudy,weutilizedavailableCap-seqdatasetstostudythetranscriptionregulationof
miRNAsin
C.elegans
andmouse.Themainresultsaresummarizedbelow.

Weagroupofcandidate5'm
7
Gcappedpre-miRNAsin
C.elegans
.

WeanotherclassofmiRNAswithnon-canonicaltranscriptionmechanisms,for
whichthepre-miRNAsmaybegeneratedbyboththecanonicalmiRNApathwaywith
Droshaandthenon-canonicalpathwaywithoutDrosha.

BasedonthecappingsignalsformiRNAgenesinclusters,weproposedahypothesisthat
thesepri-miRNAtranscriptsmightundergocytoplasmicre-cappingduringthepre-miRNAs
generationprocess.

WedevelopedamethodtoseparatetheseprimarymiRNApromotersasbroador
divergentandcharacterizedthembyanalyzingtheH3K4me3andPolIIbindingsurrounding
them.
104
5.2MaterialsandMethods
5.2.1Datasetsandprocessing
ThesmallRNACap-seqdatafornewbornmousewasretrievedfromthestudybyXieetal.[161],
andwasdownloadedfromSequenceReadArchive(SRA)withrunnumberSRR1022391.The
smallRNACap-seqdataformixedstagedembryosof
C.elegans
isfromthestudybyChenetal.
[27],andwasdownloadedfromNCBIGeneExpressionOmnibus(GEO)withaccessionnumber
GSE42819.ThesmallRNA-seqdataof
C.elegans
L4stagewasalsodownloadedfromGEO
databasewithaccessionnumberGSM916519.TheChIP-seqdataofH3K4me3andPolIIfor
C.elegans
werealsoobtainedfromNCBIGEOdatabase,withaccessionnumberGSE28770and
GSE15535,respectively.SmallRNA-seqdatafordetectingmaturemiRNAsin
C.elegans
was
downloadedfromGEOwithaccessionnumberGSM916519.
RawsequencingdatasetsareinSRAformatandweredumpedtoFASTQformatbySRA
Toolkit[145].TheFASTQwerethenmappedto
C.elegans
(WBcel235)andMouse(GRCm38)
referencegenomeusingbowtie[72]allowing2mismatchesand3mismatchesforreadslengthof
36ntand50nt,respectively.OnlyuniquelymappedreadswerereportedintheoutputSAM
andwerevisualizedbyGenomeView[1].
5.2.2Clusteringof5'endreads
SmallRNACap-seqdatafor
C.elegans
andmousearestrandMappedreadsonforward
andreversestrandwereanalyzedindependently.Wethetranscriptioninitiationclusters
(TICs)withsimilarmethodsfromthestudybyChen,etal[27].First,mappedreadswithsame
strandand5'endpositionswerecombinedanddenotedascap-stacks.Second,allcap-stacks
containingveormoretagswereclusteredusingasingle-linkageapproach:twoormorestacks
105
wereclusteredtogetherifthedistancebetweentwoadjacentstacksislessorequalto50bp.The
positioncoveredbythemost5'endswithintheTICwasasthemode,whichrepresents
theTSSfortheTIC.Inthecaseoftwoormorepositionswiththesamenumberoftags,theone
furthestupstreamwasselectedasthemode.Hereintotalweobtained32,530TICsin
C.elegans
and4,903TICsonmousechromosome7.
5.2.3Statisticalanalysis
BecausetheCap-seqisnotperfectandmayinvolvecontaminationorartifacts,weusedaPoisson
distributiontomodelthebackgroundnoisefollowingthepreviouswork[168].ThoseTICsthat
areenrichedwithsequencingreadsarereported(Poissondistribution
p
-valuebased
on
l
).SincethereadsofCap-seqarenotrandomlydistributedalongthegenome,weestimatea
dynamicparameter
l
local
,foreachTICas:
l
local
=
min
[
l
5k
;
l
10k
]
(5.1)
where
l
5k
and
l
10k
areestimatedfromthe5kbor10kbwindowcenteredatthemodeoftheTIC.
OurmodelisbuiltontheassumptionthataTICisreliableifthenumberofreadsenrichedinside
oftheTICishigher(defaultwith
p
-value<10

5
)thanthenumberofreadslocatedin
theTICwhentheyarerandomlydistributedinthelocalregion.
ResultsofCap-seqandsmallRNA-seqreadsmappedtotRNAsweresubjecttoatwo-tailed
pairedsamplet-testtoestimatetheirdifferences,with
p
-value<0.01consideredstatisticallysig-
BothCap-seqandsmallRNA-seqmappingresultsontRNAswerenormalizedbytheir
coveragedepthsonthegenome.Sincetheyonlycoverasmallpartofthewholegenome,the
coveragedepthwascalculatedasthetotalnumberofreadsmappedmultiplythereadslengthand
106
dividedbythelengthofcoveredgenomeregion.ThesimilarPoissonmodelisalsousedtoestimate
theenrichmentofCap-seqreadsontRNAs.
5.2.4miRNAclusterandintergenicpre-miRNAs
miRNAclusterswereretrievedfrommiRBasewithadistancethresholdof1000bp.Thedistance
betweenpre-miRNAsinmiceareusuallylongerthan1000bp.Thus,wedidnotanymiRNA
clusterinmicewithourthreshold.
ToidentifyintergenicmiRNAsfrom
C.elegans
andmousegenome,wedownloadedthemiRNA
annotationsfrommiRBaseRelease21andgeneannotaiongff3Thegeneannotationfor
C.elegans
andmouseweredownloadedfromEnsemblwebsite(http://www.ensembl.org/index.html).
Pre-miRNAsfrombothmiRBaseandEnsemblgeneannotationeswereisolatedandthosemiR-
NAsthatarenotcoveredbyprotein-codinggenes,non-codingRNAgenes,snoRNAgenesand
snRNAgeneswereasintergenicmiRNAs.Inthecasethatsamepre-miRNAisanno-
tatedbothinmiRBaseandgeneannotationweonlykepttheoneinmiRBase.Intotal,we
134intergenicmiRNAsin
C.elegans
and80intergenicmiRNAsonmousechromosome
7.Theregionupstreamofthe5'endofintergenicpre-miRNAwasalsoand
TICsdetectedinthisregionwereannotatedasthecandidateprimaryTSSsforthemiRNA.The
TICswithinthepre-miRNAregionwereannotatedasthepre-capTICs.
5.2.5FindingbidirectionalandmultipleTSSspromoters
WiththecappedRNAreads,wetranscriptioninitiationsitesandannotatedtheirpromot-
ersasbidirectionalorbroadpromoters(promoterswithmultipleTSSs)in
C.elegans
.Weused
theTICstoannotatethesepromoters.FirstforeachTIC(
A
inFigure5.7(B))onthe
107
plusstrand,themostclosedownstream(
C
inFigure5.7(B))andupstream(
B
inFigure5.7(B))
minusstrandTICsweresearched.IfthedistancebetweentheupstreamminusTIC(
B
)andthe
plusstrandTIC(
A
)islessthanthreshold(hereweuse300bp),thetwoTICsweretreatedasfrom
thesamepromoterandthepromoterwasannotatedasbidirectional.Thedownstreamminusstrand
cappedpeak(
C
)actsastheboundaryfordetectingmultipletranscripts.AlltheplusstrandTICs
upstreamofitwereannotatedasfromthesamepromoter.Toidentifythemultipletranscriptson
theminusstrand,themostcloseupstreamplusstrandTIC(
D
)of
B
wasfoundandalltheminus
strandTICsbetween
D
and
B
wereannotatedasfromthispromoter.ThoseleftminusstrandTICs
wereannotatedwiththesimilarmethod.
Withthisapproach,wedetected11,272promotersin
C.elegans
,ofwhich,6,149arebidirec-
tionalpromotersand2,359arebroadpromoters.Themostupstream5'endsofbothplusand
minusstrandTICswereusedtorepresentfortheTSSsofthesepromoters.OnlyoneTSSoneach
strandwasselectedasrepresentingTSSforonepromoter.
5.3Results
5.3.1ofprimarymiRNATSSsin
C.elegans
andmouse
5.3.1.1OverviewofprimarymiRNATSSsannotation
Weusedthesingle-linkageclusteringmethodtodetecttranscriptioninitiationclusters(TICs)[27]
fromCap-seqdatain
C.elegans
andmice.ForeachTIC,wealsousedaPoissondistribution
tomodelthelocalbackgroundnoiseandtestwhetheritisenrichedwithCap-seq
readsbycalculatinga
p
-value(seeMaterialsandMethods).TheintronicmiRNAsareusually
processedaspartoftheirhost-genemRNA[10]andthustheirtranscriptionscoordinatewiththe
108
protein-codinggenes.Therefore,inthisstudy,wehavefocusedonidentifyingtheprimaryTSSs
forintergenicmiRNAs.Inbothspecies,wetheintergenicmiRNAsthatarenot
coveredbyprotein-codinggenes,non-codingRNAgenes,smallnucleolarRNA(snoRNA)genes,
orsmallnuclearRNA(snRNA)genes.Thentheregionbetween5'endofpre-miRNA
andtheclosestupstreamgenewassearchedforTICs.TheTICswereannotatedas
candidate
primaryTICs
(Figure5.1).Theregionwithinpre-miRNAwasalsosearchedforTICs,
andtheTICswereannotatedas
pre-capTICs
(Figure5.1).Moreover,wealsosearched
formiRNAclustersfrommiRBasewithadistancethresholdof1000bp.FormiRNAclusters,
primaryTSS(s)areannotatedastheTIC(s)upstreamofthe5'endofthepre-miRNAinthe
clusterandpre-capTIC(s)areannotatedastheTIC(s)withinthepre-miRNAsinthecluster.
Figure5.1:PrimaryTICislocatedupstreamofthepre-miRNA,whilepre-capTICislocated
insideofthepre-miRNA.Theannotationofapre-miRNAisusuallyastem-loopthatincludes
thepre-miRNAandthelowerstems.However,therealpre-miRNAonlyincludesthered,purple
sequences(maturemiRNAs)andtheloopbetweenthem.Therefore,thepre-capTICstartsfrom
the5'endofa5p-miRNA.Eachblueorgreenbarcorrespondstoamappedread,wheregreen
indicatestheplusstrandandbluetheminusstrand.TheCap-seqdatasetsweresequencedina
way.OnlythereadsonthesamestrandofthemiRNAareconsidered.
In
C.elegans
,we134intergenicmiRNAsand16miRNAclusters.Withtheretrieved
Cap-seqdata[27],70intergenicmiRNAswerewithatleastonecandidateprimaryTIC
109
intheregions,and9miRNAswerewithatleastonepre-capTICinsideofthe
precursors.ForthosemiRNAswithcandidateprimaryTICsorpre-capTICs,8were
withbothofthem.Themodes(highestcoverageofreads'5'ends)oftheprimaryTICsareusedto
representthecandidateTSSsofmiRNAs.Inthe16miRNAclusters,6werefoundtocontainboth
primaryTSSsandpre-capTICs,and2clusterswerefoundtocontainonlyprimaryTSSs(Table
S1).
Inmice,weusedtheCap-seqdatafromthestudybyXieetal.[161].TheyperformedCap-
seqtotheunconventionalpre-miRNAswhose5'endsaregenerateddirectlybytranscription
initiationwithPolIIandthus5'm
7
Gcapped.Basedontheirresults,therearethehighestnumber
of5'-cappedpre-miRNAsonchromosome7,sowefocusedonidentifyingtheprimaryTSSsfor
miRNAswithinmousechromosome7.We80intergenicmiRNAsonthischromosome,
ofwhich10miRNAswerewithatleastonepre-capTICand37miRNAswere
withatleastonecandidateprimaryTICs.OfthosemiRNAswithcandidateprimaryTICsorpre-
capTICs,2werefoundtocontainbothofthem(TableS2).
5.3.1.2Comparisonwithpreviouswork
Inthepreviousstudy[49],Guetal.developedtheCapSeqprotocoltoenrichandsequencelonger
(70-90nt)5'-cappedRNAtranscripts,andCIP-TAPcloningtoisolateandsequence5'-capped
small(18-40nt)RNAs.Theyappliedthesetwo5'anchoredRNAdeep-sequencingapproaches
onto
C.elegans
andmousegenomesandannotatedtheprimaryTSSsformiRNAsinbothofthem.
Asaresult,theyatleastoneTSSfor55individualpre-miRNAsand9miRNAclusters
in
C.elegans
,and134individualpre-miRNAsinmice,with7miRNAsannotatedonchromosome
7.Comparingwiththeirresults,wetheprimaryTSSsfor43additionalmiRNAs,2
additionalmiRNAclustersin
C.elegans
and37additionalmiRNAsonmousechromosome7
110
(TableS3).
Inanotherstudy[67],Kruesietal.devisedaglobalrun-oncapsequencing(GRO-cap)method
tocaptureandsequenceonlythose5'm
7
GcappedRNAsin
C.elegans
embryos,starvedL1lar-
vae,andL3larvae.WiththeGRO-capsequencingdata,theyannotatedtheprimaryTSSsfor52
individualpre-miRNAsand5miRNAclusters.WesimilarprimaryTSSsforthoseover-
lappingpre-miRNAs(24miRNAs).Furthermore,weannotatedtheTSSsfor46moreindividual
miRNAsand3moremiRNAclusters(TableS4).
5.3.25'm
7
Gcappedpre-miRNAsarein
C.elegans
AsshowninFigure5.1,readsinpre-capTICsareusuallyalignedverywellwiththe5'endsof
pre-miRNAs.Thereisapossibilitythatthesereadswereactuallysequencedfromuncappedpre-
miRNAsratherthancappedRNAs.Inthissection,weinvestigatewhetherusuallyuncapped
pre-miRNAsarehighlyenrichedbyCap-seqprotocol.
5.3.2.1Possible5'recessedRNAsenrichedbyCap-seq
ToenrichforcappedRNAinCap-seqexperimentsfor
C.elegans
,exonucleaseTerminatorandcalf
intestinalalkaline(CIP)wereusedtoremovetheuncappedRNAs[27].However,someuncapped
RNAs,suchaspre-miRNAsandtRNAs,maynotbeaccessedefbyTerminator/CIPbe-
causeoftheir5'recessedendsintheirsecondarystructures.Asaresult,someoftheCap-seqreads
inpre-capTICsmayactuallycomefrompre-miRNAs.Toinvestigatethecontaminationofpre-
miRNAreadsinCap-seqdata,wedownloadedsmallRNA-seqdata(GSM916519)for
C.elegans
andmappedthereadstothereferencegenomeasthecontrol.ThosemiRNAsthataredetectedas
expressedbysmallRNA-seqdatamightbeobservedinCap-seqdataaswell.Wethen
thenumberofCap-seqreadsalignedtothoseexpressedpre-miRNAs(mappedwithsufsmall
111
RNA-seqreads)(FigureS2(A)).TheresultsshowedthatthenumbersofCap-seqreadsmappedto
pre-miRNAsarenotproportionaltothenumbersofsmallRNA-seqreadsmapped.Formany
highlyexpressedmiRNAs(22/33),therearefewornoCap-seqreadsalignedtotheirprecursors,
indicatingthatmanyCap-seqreadsmaynotbepre-miRNAs.Inaddition,pre-miRNAs,serving
asanintermediateduringmiRNAmaturation,arequicklyprocessedbyDicerandthustendnotto
beenrichedbyCap-seq.Previousworkhasshownthatthedataofcarefullydesignedpre-miRNA
sequencingonlycontainslessthan1%readsthatcanbemappedtopre-miRNAs[82].
AccordingtoChenetal.[27],therewasnosteptoremovetRNAsintheCap-seqprotocol.We
thendidthesimilarcomparisonfortRNAsbecausetRNAsmayescapetreatmentoftheTerminator
andCIPforthesamereasonaspre-miRNAs.Theresultsshowedthat,althoughalmostallthe
annotatedtRNAgenesinthewormwereexpressed(604/605),mostofthesetRNAs(463/605)do
nothaveanyCap-seqreadsmapped(FigureS2(B)).Atwotailedpairedsamplet-testwasused
toestimatethemeandifferencebetweenthenormalizedCap-seqreadsandsmallRNA-seqreads
mappedtotRNAs.Wegota
p
-valueas1.12e-185,showingthatthetRNAreadscapturedbytwo
protocolsaredifferent(alphalevelas0.01).Pearson'scorrelationandSpearman's
rankcorrelationbetweentwomappingresultswerealsocalculated.Thecorrelationcoef
results(-0.042and-0.125,respectively)suggestthattheyarepoorlycorrelated.Mostofthose
tRNAsmappedwithCap-seqreadshavelessthan10reads(124/142,FigureS3),implyingthat
thesereadsmightbecausedbyrandomcontamination.WethenusedthePoissondistributionto
modelthebackgroundnoise(seeMaterialsandMethods).31tRNAswerereportedas
enrichedwithCap-seqreadsatthecutoffof10

5
.ThosetRNAsthatarehighlyenrichedforCap-
seqreadsareusuallyoverlappedwiththerepeatregions.ConsideringthatmostoftheothertRNAs
havefewornoCap-seqreadsmapped,weinferthattheseregionsmightbetranscribedbyPolII
andproducecappedRNAsundercertainconditions.
112
WealsosuspectedthatsomeoftheCap-seqreadswithinpre-miRNAsmightbematuremiR-
NAs.However,maturemiRNAsareshortandusuallydonotpossesscomplexsecondarystruc-
tures.TheycouldbeefremovedbyTerminator/CIPtreatment[44].TheofCap-seq
readsonmostmaturemiRNAsaddssupporttothis.Hencethepre-capTICsarenotlikelyformed
bymaturemiRNAseither.TheCap-seqstudyonmicehavealsoshownthatuncappedRNAscon-
stitutelessthan10%ofthetotalsequencedreads[161].Consideringtheaboveanalysistogether,
wepositedthatalthoughCap-seqinevitablycontainssomeuncappedRNAs,manyofthereads
mappedtopre-miRNAsarelikelysequencedfromcappedRNAs.
5.3.2.25'm
7
Gcappedpre-miRNAswithpre-capTICs
RNAssynthesizedbyPolIIare5'm
7
Gcappedcotranscriptionally.Apreviousstudy[161]has
documentedanewclassofunusualmiRNAsinnew-bornmice,forwhichthe5'endsofthe
pre-miRNAsarem
7
GcappedandcoincidewiththeirTSSs.ThesemiRNAsweresuggestedto
begeneratedwithoutDroshaprocessing,withtheir5'endsdetermineddirectlybytranscription
initiationandthe3'endsgeneratedbytranscriptiontermination.
ToexaminewhethertherearethesameclassofmiRNAsin
C.elegans
,weanalysedthosepre-
miRNAswithpre-capTICsmappedinside.The5'endsofpre-capTICsareusuallyconsistentwith
the5'endsofpre-miRNAsor5p-miRNAs.Therefore,thosepre-miRNAsthatweredetectedwith
pre-capTICsshouldacquirem
7
Gcapattheir5'endsandwereannotatedas5'-cappedmiRNAs.
WealsoappliedourmethodtomouseCap-seqdataobtainedfromthestudybyXieetal.[161]
andcomparedourresultswiththeirs.Ourresultshaveuncoveredallthe9intergenic5'-capped
pre-miRNAsonmousechromosome7asshowninthepaper,validatingourmethod.Besides,we
alsoannotatedtheprimaryTSSsfor37miRNAswiththesamedataset.Strikingly,newcandidate
TSSswerealsofoundupstreamoftwo5'cappedpre-miRNAs(mmu-mir-344candmir-344i).In
113
total,we95'-cappedmiRNAsin
C.elegans
(TableS1)and10onmousechromosome7
(TableS2).
WealsousedstatisticalanalysistoevaluatetheenrichmentofcappedRNAreadsonthe5'ends
ofpre-miRNAs.Withthecalculated
p
-values,7/95'-cappedpre-miRNAsin
C.elegans
and9/10
inmiceareenrichedwithCap-seqreads(
p
-value<10

5
;TableS1).
TolookforsequencemotifssurroundingtheseputativemiRNATSSs,weplottedthe
nucleotidecompositionaroundthembyWeblogo[34].AstrongYRmotifwasobservedatboth
theputativeprimarymiRNATSSsorpre-capTSSsofindependentmiRNAs,inwhichYrepresents
pyrimidine,RrepresentspurineandRlocatesattheTSSs(+1position,Figure5.2).
Figure5.2:SequencemotifanalysisofnucleotidesaroundputativemiRNATSSs(+1).Thenu-
cleotideheight(inbits)standsforthe
log
2
ratiooftheobservednucleotidesfrequencyrelativeto
thebackgroundgenomicnucleotidecomposition.TheYRmotifisobservedattheputativeprimary
miRNATSSsandindependentmiRNApre-capTSSs.
114
5.3.3M
7
Gcappedpre-miRNAsoftenhaveupstreamprimaryTICs
Inboth
C.elegans
andmice,wenoticedthatsomeofthese5'-cappedpre-miRNAsalsohave
primaryTICsintheregionfromthepre-miRNA5'endtotheclosestupstreamgene.In
C.elegans
,
8outof95'-cappedpre-miRNAshavebeendetectedwithprimaryTICs;whileinmice,2outof10
5'-cappedpre-miRNAshaveupstreamprimaryTICs.TheseupstreamTICsareusuallynotveryfar
fromthepre-miRNAsandthePolIIbindingattheseTICselongatetothedownstreammiRNAs,
indicatingthattheyareconnectedwiththemiRNAs.Toourknowledge,thisphenomenonhasnot
beendescribedinotherstudies.Twoexamplesin
C.elegans
areshowninFigure5.3.Sincepre-
miRNAscanbegeneratedbothinthecanonicalwayasfromalongpri-miRNAbyDroshaandin
thenon-canonicalwaybytranscriptioninitiationandtermination,weaskwhichTICcorresponds
totherealprimaryTSSofthemiRNAorcanbothofthemproducepre-miRNAs?
Weproposedtwopossibleexplanationsforthisphenomenon.Theexplanationisthatthere
couldbemultipleisoformsforthesemiRNAgenes.Forexample,theseconddiscoveredmiRNAin
C.elegans
(let-7)hasbeendetectedwithatleastthreeprimarytranscripts[17].Itwasalsoreported
thatgenesoftenusealternativepromotersinadevelopmentalstageorcelltypewaythat
canspreaduptothousandsofkilobases.Asanexample,aboutonehalfoftheprotein-coding
genesinhumanandmousegenomeshavemultiplealternativepromoters[37].Therefore,these
miRNAgenesmayalsocontainmulitplealternativepromotersthatcangenerateseveralisoforms
(Figure5.4(A)).TheotherprimaryTICsmayproducelongermiRNAtranscriptsthataresubjectto
thecanonicalmiRNAprocessingproceduresinvolvingDrosha.Sincethe5'-cappedpre-miRNAs
aremostlikelygenerateddirectlybytranscription,twopathsmaybeabletoleadtothematuration
ofthesamemiRNA:onewithDroshaandtheotherwithout.Becauseweusedthedatasetfrom
mixed-stageembryos,thesemiRNAgenesmayemployalternativepromotersandproducediverse
115
Figure5.3:PrimaryTICsaredetectedupstreamof5'-cappedmiRNAscel-mir-51andcel-mir-53.
MultipleCap-seqpeaksareobservedinpre-miRNAregions.Themappedreadswerevisualized
byGenomeView[1].Eachblueorgreenbarcorrespondstoamappedread,wheregreenindicates
theplusstrandandbluetheminusstrand.ThereadsinupperpanelarefromCap-seqdataset,with
uniformlengthof36nt.ThereadsinlowerpanelarefromsmallRNA-seqdataset,withlengthin
rangefrom14ntto26nt.
isoformsatdifferentstages.WealsoobservedthatforsomemiRNAs(SupplementaryFiguresS5
-S7:cel-mir-235,cel-mir-244,cel-mir-238,andcel-mir-228),theupstreamTICsareverycloseto
thepre-miRNA,likelythattheyaregeneratedfromthesamepromoterasthepre-capTIC.
ThesecondexplanationisthattheTICsupstreamofpre-miRNAmaybetranscribedenhancers
orpromoters,whichgeneratetranscriptsthatwillnotproducemiRNAs.Recentlyseveralstudies
haveshownthatpromoterandenhancerregionscanbetranscribedinhuman,mouseand
C.elegans
genomes[40,141,27].Thetranscriptioninpromotersandenhancersareusuallyassociatedwith
downstreamgenes.Itissuggestedin
C.elegans
thattheelongationfromanupstreamenhancer
towardadownstreamgenemayhavethepotentialtodeliverPolIItoaproximalpromoter,oral-
116
ternativelyfunctiondirectlyasadistalpromoter[27].Thus,theupstreamTICsmaybetranscribed
intheenhancerregionsandhavearegulatoryeffectonthedownstreammiRNAgenes(Figure5.4
(B)).
Figure5.4:UpstreamprimaryTICsalsoexistform
7
Gcappedpre-miRNAs.(A)Twoalternative
promotersforthesamemiRNA.Bothofthemareabletogeneratethetranscriptsthatproducethe
samematuremiRNA.(B)ThemiRNAtranscriptisgeneratedbythepre-capTIC.UpstreamTIC(s)
correspondtotranscribedenhancer(s).
5.3.45p-miRNAsareproducedfromthem
7
Gcappedpre-miRNAs
Them
7
Gcappedpre-miRNAshavebeenreportedtoproducesingle3pmaturemiRNAsinmice
[161].Themainexplanationisthatthecapped5p-miRNAisnotefloadedontoAgocom-
plex.However,wenoticedthatsome5'-cappedpre-miRNAsin
C.elegans
alsogenerate
117
5pmaturemiRNAsbasedontheannotationsinmiRBase(TableS1).Whilethereisapossibility
thatsomeoftheCap-seqreadsmappedtopre-miRNAsmaybeuncappedduetocontamination
ortechnicalartifacts,mostofthesem
7
Gcappedpre-miRNAsarelyenriched
withcappedRNAreadsaccordingtoourpreviousanalysis.Wesurmisethattheremightbeal-
ternativepathwaysforproducing5p-miRNAsfromthose5'm
7
Gcappedpre-miRNAs.Similar
observationswerealsomadefor5'-cappedpre-miRNAsinmice:fourofthem(mmu-
mir-484,mmu-mir-1903,mmu-mir-344fandmmu-mir-344i)actuallyprefertogenerate5pmature
miRNAs[161].
The5pmaturemiRNAscouldbegeneratedbyalternativeprimaryTSSs.Asmanyofthese
5'-cappedpre-miRNAsalsohavecandidateprimaryTSSs,canonicalprimarymiRNA
transcriptsmaybegeneratedfromthemandproduce5p-miRNAs.Studieshaveshownthatmature
miRNAselectionfrom5'and3'strandsofthesameprecursorishighlyregulatedandvariesunder
differentcelltypes,developmentalstagesanddiseasestates[12,97].Capped5pmaturemiRNAs
maybeproducedunderconditionstopromoteitstargetgene'sexpression.
5.3.5MultipletranscriptioninitiationsitesformiRNAclustersin
C.elegans
miRNAsinaclusterareclosetoeachother,usuallycoexpressedandtranscribedasasinglepri-
miRNA[74,10].PreviouslyeachmiRNAclusterin
C.elegans
hasbeenannotatedwithone
primaryTSSusingCap-seq[49]orGRO-cap[67]datasets.Strikingly,thedatasetsweusedhere
haveshownthattheTICsformiRNAclustersin
C.elegans
canhaveabroaddistributionwith
multiplestrongpeaksacrossthewholecluster.WeprimaryTICsfor8clusters,ofwhich
7haveTICsinsideoftheclusters.Oneexampleofclustercel-mir-35-41isshowninFigure5.5.
Thesimilarphenomenonisalsoobservedinindividualpre-miRNAs,asshowninFigure5.3,the
Cap-seqsignalwithincel-mir-51andcel-mir-53alsodisplaysmultiplestrongpeaks.Wenoticed
118
thatthesecappedreadspeaksinsideofclusterswerelocatedonbotharmsofpre-miRNAs,with
5p-peakshavethesame5'endsof5p-miRNAsand3p-peaksstartfromthe3'endsof3p-miRNAs
(Figure5.5).Asanalyzedabove,notmanypre-miRNAsormaturemiRNAswerekeptintheCap-
seqexperimentsandmanyofthereadsmappedarelikelytobecappedRNAs.Thisisfurther
supportedbythecoordinatedifferencesbetweenthecappedpeaksandthemiRNA/miRNA*on3p
arms.Weproposedthreehypothesestoexplainthisphenomenon.
Figure5.5:CappedTICsdistributioninmiRNAclustercel-mir-35-41.Multiplestrongcapped
peakshavebeenobservedinpre-miRNAsinthecluster.Asillustratedbythereddashedline,the
Cap-seqpeaksonthe5parmhavethesamestartpositionasthe5p-miRNAs,whilethepeakson
the3parmstartfromtheendofthe3p-miRNAs.ThelengthofCap-seqreadsis36nt.
5.3.5.15're-capafterpost-transcriptionalprocessing
Ithasbeenreportedthatmaturelongtranscriptsofbothprotein-codingmRNAsandlongncR-
NAsinhumancellscanbeprocessedpost-transcriptionallytoyieldsmallRNAs,whicharethen
119
bytheadditionofa5'-capstructure[44].Lateron,acytoplasmiccappingenzyme,
whichisabletoadd5'-captotheendsofcleavedRNAswasinmurineerythroidand
nonerythroidcells[110].Thiscytoplasmiccappingenzyme,togetherwithakinase,cantrans-
fercovalentlyboundGMPontoa5'-monophosphateRNAtocreatea5'-GpppXRNA,butitcan
notfunctiononRNAswith5'-hydroxylends[137].Thisphenomenonisknownascytoplasmic
capping,whichhasbeenfoundinbothmurineandhumancells[110,103,64].
Thewellcoordinatedcappedreadsat5pand3parmsofpre-miRNAhairpinsindicatethatthe
exposed5'endsofthemiRNAtranscriptsafterDroshacleavagemaybere-capped:thatiswhythe
cappedreadswereobservedatthe5'endsof5pmiRNAsandthe3'endsof3pmiRNAs.Certainly,
inthecanonicalmiRNAbiogenesis,pre-miRNAsareproducedfrompri-miRNAsinthenucleusby
Drosha.Therefore,the5'monophosphateendsatthecleavagesitesmaybere-cappedbynuclear
cappingenzymeinthenucleus.However,pri-miRNAsin
C.elegans
areexportedtothecytoplasm
byXPO1andbeprocessedtopre-miRNAseitherinthenuclearporeorinthecytoplasm[19].
Consideringthediscoveryofcytoplasmiccappingenzyme,thepre-miRNAre-cappingismore
likelytohappeninthecytoplasm.
Theprocessingofpri-miRNAsinthecytoplasmrequirescytoplasmicmiRNAprocessors.It
hasbeenshownthatcytoplasmicRNAviruseswhichencodemiRNAswereabletoproducefunc-
tionalmiRNAsinthecytoplasmofBHK-21cells[125].Theprocessingofthesevirus-generated
cytoplasmicpri-miRNAsalsoreliesonDroshabuttakesplaceinthecytoplasm[139].Basedon
thisdiscovery,althoughwithoutdirectexperimentalvthereisapossibilitythatthesim-
ilarcytoplasmicmiRNAprocessorsinvolvingDroshaalsoexistin
C.elegans
cells.Allthese
ingssupportanewmodelofmiRNAbiogenesisin
C.elegans
inwhichpri-miRNAsareexported
tothecytoplasmbyXPO1,wheretheyarecleavedbyDroshaandfurtherprocessed.Transcripts
ofmiRNAclustersmayundergocytoplasmicre-cappingduringthecleavageprocess(Figure5.6).
120
Figure5.6:ModelformiRNAcytoplasmicre-capping.Inthisnon-canonicalmiRNApathway,
pri-miRNAsareexportedtocytoplasmbyXPO1andprocessedthere.Duringthepre-miRNA
generatingprocess,m
7
G-capsareaddedtothe5'endsofnewlygeneratedpre-miRNAandpri-
miRNAleftoverbycytoplasmiccappingenzyme.
ThecytoplasmicrecappingofclustermiRNAscouldproducecapped5p-miRNAswhichmay
notbeefloadedonAgo.Indeed,fromtheannotationsinmiRBase,mostoftheclusters
withcappedTICsinsidefavorablygenerate3pmaturemiRNAs.However,theclustersmir-41-
44andmir-86-8211doprefertogenerate5pmaturemiRNAs.Heresimilartothe5'-capped
independentpre-miRNAs,therecappingmaybeacontrolledprocesswhichonlyoccuratcertain
celltypes,developmentalstages,ordiseasestates.Forexample,withthecappedshortRNAs
datafrom
C.elegans
youngadultstage[49],wedidnotobservecappedreadspeaksinsideofthe
miRNAclusters.TherecappingcouldservetoregulatethestrandselectiononthesemiRNAs,
suppressingthe5p-miRNAsandpromotingtheexpressionoftheirtargets.
5.3.5.2MultipleTSSscanbegeneratedfromthesamepromoter
PreviousstudieshaveshownthatmostcorepromotersdonothaveasingleTSS,butmultiplestart
sitesthatarecloselylocated[135,46].ThebroadTSSpromotersin
C.elegans
areoftenenriched
forCpGisland,whilesharpTSSpromotersoftenhaveTATA-box.Forsomepromoters,although
121
TSSsaredistributedoveralargeregion,mosttranscriptioninitiatesatonenucleotide
position[135].Therefore,multiplecappedpeaksinindividualpre-miRNAsormiRNAclusters
maybegeneratedfromthebroadpromoterasmultipleTSSs.Toevidencesupportingthis,we
scannedtheproximalpromotersofthesemiRNAgenesformotifsofTATA-box,Inr,DPEandBRE
withpositionweightmatricesderivedfromdatabaseJASPAR[134].GCcontentandCpGnumber
arealsosearchedinthepromoterregions(Table5.1).Weobservethatthepromotersofthese
miRNAclustersandindividual5'-cappedmiRNAsareusuallyGCrich:theGCcontentoneach
chromosomein
C.elegans
isalmostthesame,withavalueof36%

1%.ButtheGCcontentis
usuallyabove40%forpromotersofindividualpre-miRNAs,andabove50%formiRNAclusters
(Table5.1).Inaddition,theretendstobeapositivecorrelationbetweentheGCcontent/CpG
numberandthenumberofcappedpeaks:higherGCcontent/moreCpGresultinmorestrong
cappedpeaks.
Table5.1:GCcontent,OEratioandCpGnumberformiRNApromoterswithmultipleTSSs.(A)
miRNAclusters.(B)individualmiRNAs.
Promoters
Location
GCcontent
OEratio
CPGnumber
(A)ClustermiRNAs
II:11537326-11537576
0.55
1.18
18
II:11889425-11889675
0.56
1.09
13
III:11937020-11937270
0.52
0.93
9
III:2172175-2172425
0.55
1.0
15
X:2368553-2368803
0.51
0.76
9
X:13145204-13145454
0.6
1.34
24
(B)IndividualmiRNAs
cel-mir-244I:4684364-4684614
0.4
0.82
8
cel-mir-235I:6162337-6162587
0.41
0.76
8
cel-mir-238III:8867375-8867625
0.4
0.66
6
cel-mir-228IV:5561825-5562075
0.43
1.18
12
cel-mir-51IV:11026062-11026312
0.51
1.09
17
cel-mir-53IV:11027641-11027891
0.45
1.38
16
cel-mir-49X:9989082-9989332
0.51
0.75
12
122
ThesecappedreadspeaksarecorrelatedtothematuremiRNAsequences(Figure5.3,5.5).
ThereisalowprobabilitythatmultiplepeakswithinmiRNAclusteraresimplygeneratedran-
domly.Ithasbeensuggestedthattranscriptionstartusageofeachnucleotidecanbepredicted
fromlocalDNAsequence[46].Therefore,itislikelythatthepositionsoftheseTSSsaremainly
determinedbynucleotidesequence.
5.3.5.3Pre-miRNAsinaclustercanbetranscribedindependently
Pre-miRNAsfromthesamegenomicclustercanbetranscribedandregulatedindependently[157].
Forexample,althoughtheprimarytranscriptsofmousemir-433andmir-127weredetectedas
overlappingina5'-3'unidirectionalway,experimentshavevedthattheyweretranscribed
independentlyfromeachother[143].Accordingtothismodel,thepre-miRNAsinthesamecluster
maytranscribeindependently,producingmultiple5'm
7
GcappedRNApeakslocatedinsideofthe
cluster.ThecappedRNApeaksatthe5parmsindicatethatthe5'endofthepre-miRNAmaybe
determinedbytranscriptioninitiation[161],whilethecappedpeaksatthe3parmsmaycorrespond
totheTSSsforthesubsequentpre-miRNAs.Wenoticedthatmanypre-miRNAsintheclusterhave
boththecappedRNApeaksonitsown5'endandthe3parmofupstreampre-miRNA.Wehave
speculatedthatpre-miRNAsmightbeabletobegeneratedinthecanonicalwaywithDroshaor
inthenon-canonicalwaybytranscriptioninitiationandtermination,sohereagain,thesamepre-
miRNAmaybegeneratedbydifferentmechanisms:onewithDroshaandtheotherwithout.
123
5.3.6ChromatinandPolIIprofprimarymiRNApromoters
5.3.6.1ofdivergentandmultipleTSSspromoters
Tocharacterizethesepri-miRNATSSs,weanalysedthedistributionofchromatinand
PolIIfeaturessurroundingthem.Weobservedthatpromotersin
C.elegans
oftengeneratediver-
gentormultipletranscriptswiththeCap-seqdatasetsused.Toidentifythesepromoters,we
thedivergent/bidirectionalandbroadpromoters(promoterswithmultipleTSSs)fromthosetran-
scriptioninitiationclusters(TICs).Ifthedistancebetweentwoadjacentplusstrandandminus
strandTICsislessthanorequalto300bp,theyarecombinedasfromthesamedivergentpro-
moter.TICsonthesamestrandareclusteredtogetherifthedistancebetweentwoadjacentTICs
iswithin500bp.TheseclusteredTICsonthesamestrandwillthebroadpromoters(see
detailsinMaterialsandMethods).Inthewholegenomeof
C.elegans
,wedetected11,272pro-
moters,ofwhich6,149aredivergentpromotersand2,359arebroadpromoters(Figure5.7(A)).
Themostupstream5'endsofbothplusandminusstrandTICswereusedtorepresenttheTSSsof
thepromoter.
Figure5.7:(A)Promotersdistributionin
C.elegans
.(B)Detectingbidirectionalandmultiple
transcriptionpromotersin
C.elegans
.
Withallthepromoters,wefocusedonthosewitheithermultipletranscripts(atleast
3TSSs)orbidirectionality.Thatis,thebroadpromotersarenon-bidirectionalandbidirectional
124
promotersdonotcontainmultipleTSSs.WealignedtheTSSsofthesepromoters,andplotted
H3K4me3andPolIIsignalsurroundingthem(Figure5.8(A,B)).Theresultshowsthat
H3K4me3signalismuchstrongerinthedownstreamofbroadpromotersthanbidirectionalpro-
moters,indicatingthatH3K4me3isstronglycorrelatedwithtranscriptioninitiations.Interestingly,
weobservethatPolIIsignalinthedownstreamofbroadpromotersisalsoslightlystrongerthan
intheupstream,whichwouldnotbeobservedifweuseallpromoterswithmultipleTSSs(in-
cludingbidirectionalpromoters(FigureS2(D)).ThehigherlevelofdownstreamPolIIsignalmay
becausedbythePolIIpausingatthetranscriptioninitiationsites.IthasbeenshownthatPolII
promoter-proximalpausingisrarein
C.elegans
[67],whichmayexplainwhythePolIIsignal
differencebetweenupstreamanddownstreamisnotbig.
5.3.6.2ChromatinandPolIIprsurroundingmiRNApromoters
WethenassignedthesepromoterstotheintergenicmiRNAsin
C.elegans
.Foreach
ofthem,theregionbetweenits5'endandthemostcloseupstreamgene's3'endis
searchedforpromoters,andthemostclosepromoterisassignedtothemiRNAasitsprimary
promoter.Again,wealignedthesemiRNAswiththeirprimaryTSSsacquiredfromtheassigned
promotersandplottedtheH3K4me3andPolIIsignalsurroudingtheTSSs.Theresultsareshown
inFigure5.8(C,D).H3K4me3signalsurroundingmiRNAbroadpromoterspeaksatthe
TSSswhilethesignalsurroundingbidirectionalpromoterspeaksattheupstream600bpposition,
indicatingthattheremaybemuchstrongerminustranscriptsforthesebidirectionalpromoters.
ForPolIIsignal,theintensitiesaresimilarintheupstreamofbothmiRNAbroadpromotersand
bidirectionalpromoters.However,againthedownstreamofbroadpromotershasastrongerPolII
signal,suggestingPolIIisalsopausedintheproximalpromotersofmiRNAgenes.
125
Figure5.8:(A,B):H3K4me3andPolIIsignalsurroundingbidirectionalandbroadpro-
moters.(C,D):H3K4me3andPolIIsignalsurroundingmiRNAbidirectionalandbroad
promoters.
5.4Discussion
Inthisstudy,weusedCap-seqdatasetstoannotatetheprimaryTSSsforintergenicmiRNAgenesto
1baseresolutionin
C.elegans
andmouse.Intotal,weannotatedtheprimaryTSSsfor70miRNAs
and8miRNAclustersin
C.elegans
,and37miRNAsonmousechromosome7.Comparingwith
previousworkusingcappedRNA-seqmethods[49,67],wehaveannotatedtheTSSsformany
moremiRNAsinbothspecies.WenoticedthattheCap-seqdatasetsweusedareeithergenerated
frommixed-stageembryos(
C.elegans
)ormixedtissues(newbornmouse),whiletheworkswe
comparedwithonlyutilizedfewstageddatasetsin
C.elegans
[49,67]orsingletissuedataset
inmouse[49].Therefore,thelimitedtissuecontextmayexplainwhyfewermiRNAgeneswere
126
expressedanddetectedintheirstudies.
Similartothepreviousstudy[161],whichdetectedaspecialclassofpre-miRNAsthatare5'
m
7
Gcappedinmice,herewealsothistypeofpre-miRNAsin
C.elegans
.5'-capped
pre-miRNAsinmiceprefertobeexportedtothecytoplasmbyXPO1insteadofXPO5[161].
Comparingtomammals,XPO5oritshomologueisnotencodedin
C.elegans
butXPO1is[104].
Therefore,manymore5'-cappedpre-miRNAswereexpectedtobein
C.elegans
.But
usingtheCap-seqdatasetsofmixed-stageembrosof
C.elegans
,wehaveonlydetected9candi-
date5'-cappedpre-miRNAs.TheXPO1andcap-bindingcomplex(CBC)havebeenreportedtoact
jointlytoexportpri-miRNAsin
C.elegans
andDrosophila,buthowthesubsequentpre-miRNAs
areprocessedfromthesepri-miRNAsinthecytoplasmremainsunclear[19].Accordingly,both
5'-cappedpre-miRNAsandnormalpri-miRNAscanbeexportedtothecytoplasmandtheirsub-
sequentprocessingmaybedifferentfromthatinthecanonicalmiRNAbiogenesispathway.Since
XPO1doesnotexport5'-cappedpre-miRNAsin
C.elegans
,itsexistencedoesnot
necessarilyresultinmore5'-cappedpre-miRNAs.Thus,itisnotoddthatonlyasmallnumberof
theseunusualpre-miRNAsareobservedin
C.elegans
.
IthasbeensuggestedthatXPO1-dependentm
7
Gcappedpre-miRNAsmayrepresentagroupof
ancientmiRNAsthatappearedbeforetheemergenceofXPO5[161].Therefore,thesem
7
Gcapped
pre-miRNAsmaybewellconservedindifferentlineages.Wecheckedtheconservationofthose
5'-cappedpre-miRNAsin
C.elegans
andmousefrommiRviewer[62].Surprisingly,
almostallofthesem
7
Gcappedpre-miRNAsarelineagetheyaredetected
eitheronlyincaenorhhabditisormuroidea.
Inbothmouseand
C.elegans
,wehaveaclassofpre-miRNAsthatare5'm
7
G
cappedbutalsopossesupstreamcandidateprimaryTSSs.Ithasbeensuggestedthatmostgenes
inmammalsarenottranscribedfromasingleTSS,butmultipleTSSsthatarecloselylocated
127
over50to100nucleotides[21,135,46,36].Indeed,weobservedthatsomeupstreamTICsare
veryclosetothepre-miRNAs,suggestingthattheproximalcorepromoterofthesemiRNAgenes
maygeneratemultipletranscriptioninitiationsthatarecloselylocatedtoeachother.However,the
otherfurtherupstreamTICs(over500nt)areunlikelytobegeneratedfromthesamepromoteras
thepre-capTICs.Manygeneshavealternativepromotersandcangeneratemultipleisoformsin
differentcelltypes,tissuesordevelopmentalstages[69,7].SinceweusedtheCap-seqdatasets
ofmixedstagedembryosof
C.elegans
,thesemiRNAgenesmayutilizealternativepromotersat
differentstagestogeneratediversetranscriptswhicharesubjecttodistinctprocessingprocedures.
Thatis,these5'-cappedpre-miRNAsmaybeonlygeneratedincelltypes,developmental
stagesordiseasestates.ConsideringthattheseXPO1-dependent5'm
7
Gcappedpre-miRNAs
maybelongtoagroupofancientmiRNAs,itisreasonablethatatleastsomeofthemshouldbe
conservedalongdifferentlineagesduringevolution.However,wehaveshownthatalmostallof
these5'm
7
Gcappedpre-miRNAsarelineageThereisapossibilitythatthese
miRNAgenesarenewlygeneratedandacquiretheirabilityofproducing5'-cappedpre-miRNAs
laterintheevolution.Inextremecases,mostofthese5'-cappedpre-miRNAsmayhaveupstream
distalpromotersthatcanproducethenormalprimarytranscriptsattheothertime.Thesituation
wherem
7
Gcappedpre-miRNAsareexpressedmayduetothelackofDroshaorotherrelated
microprocessor.TheabilityofgeneratingthesamemiRNAswithorwithoutDroshamayhelpthe
organismtobetteradapttothecomplexandchangeableenvironment.
Manyofthe5'm
7
Gcappedpre-miRNAsin
C.elegans
preferentiallygenerate5p-
miRNAs,whichisdifferentfromthatinmiceaspreviouslyreported.Thesepre-miRNAsusually
alsohaveupstreamcandidateprimaryTSSs.Wesurmisedthatthese5'-cappedpre-miRNAsmay
producethe5p-miRNAsusingthealternativeupstreamprimaryTSSs.However,thisstillneeds
furthereffortstoclarify.
128
Multiplecappedpeakshavebeenobservedwithinpre-miRNAsinacluster,andthewellco-
ordinatedrelationshipbetweenthecappedpeaksandthematuremiRNAssuggestthatthe5'm
7
G
capmaybeaddedduringthepre-miRNAgeneratingprocess.WeproposedanewmodelofmiRNA
biogenesisinwhichthepri-miRNAsarecleavedandcappedinthecytoplasmsimultaneously(Fig-
ure5.6).WenoticedthatthecytoplasmiccappingisprominentinmiRNAclusters,likelybecause
theexposed5'endofmiRNAclustertranscriptsduringcleavageneedtobeprotected.Therefore,
them
7
Gcapisaddedandprotectsthecleavedtranscriptsfrombeingdegradedduringthecleavage
process.Then,thesubsequentpre-miRNAscanbegeneratedsuccessfully.
Thepre-miRNAsinclustersmaybetranscribedindependently.Togainmoreevidenceforthis
model,wealsolookedatthethesequencemotifsurroundingthemodesofpre-capTICswithin
clusteredpre-miRNAs.TheYRmotif,whichwasshownattheputativeprimarymiRNATSSsand
pre-capTSSsofindependentpre-miRNAs,isnotobserved(FigureS8).Instead,aweakconsensus
sequenceofﬁTNGG"isdetected,inwhichﬁN"locatesatthe+1positionrepresentingthemodes
oftheTICs.Therefore,atleastforsomemiRNAclusters,theindependenttranscriptionmodelis
notsupportedbytheYRmotif.
129
Chapter6
ConclusionandFuturework
6.1Conlusions
Characterizingthecompositionandfunctionsofmicrobialcommunitiesfromhost-relatedorenvi-
ronmentalsamplesusingmetagenomicsequencinghasbecomeafavorablemethod.Inthisstudy,
wefocusedoncharacterizingthevirusquasispeciesfromviralmetagenomicdata.Towardsthis
goal,wedevelopedapipelineconsistingofthreetoolsforassemblingviralstrainsinaquasispecies
andestimatinghaplotypenumberandcorrespondingabundances.ThethreetoolsareTAR-VIR,
PEHaplo,andVirBin.
TAR-VIRisthetoolforenrichingreadsoftargetedviruseswithremotehomologorpartial
referencesfrommetagenomicdata.Comparedtothestandardmethodthatreliesonreadmapping
toclassifyreadsofdifferentviruses,TAR-VIRhashighersensitivitytoidentifydivergentstrains
ofknownvirusfamilies,whichisakeyfunctionforfast-mutatingvirusessuchasHIV,HCVetc.It
appliedanexternaloverlapextensionsteptoiterativelyrecruitreadsfrominitiallyaligned"seed"
reads.Theoverlapdetectionbetweenlarge-scalemetagenomicsequencingreadswasimplemented
intheefBurrows-WheelerTransform(BWT)andFM-indexalgorithms.TAR-VIRcanbe
usedasapreliminarysteptoisolateandenrichreadsfrommetagenomicdatabeforeassemblyeven
withouthigh-qualityreferences.
PEHaploisthe
denovo
assemblytoolsforreconstructingviralhaplotypesfromvirusqua-
sispeciessequencingdata.Ittakesadvantageofthelonginsertsizeofpaired-endsequencingto
130
differentiatehaplotypessharinglongcommonregions.ThemethodsofPEHaploarebasedon
overlapgraph,whichreconstructhaplotypesbythepathswell-supportedbypaired-end
reads.Whilepaired-endinformationhasbeenusedbymanyotherassemblytools,PEHaplodida
thoroughanalysisoftherelationshipbetweenLCSandtheinsertsize,anddiveddeeplyintothe
utilizationofpaired-endreadsforremovingfalseedgesandndingcorrectpaths.Thebenchmark
resultswithseveralrecentlypublishedviralquasispeciesassemblytoolsdemonstratethatPEHaplo
isabletoproduceaccurateandmuchlongercontigs.
TheoutputofPEHaploarecontigs,whicharepartialsequencesforviralgenomes.Tobetter
characterizeaviralquasispecies,weneedtoknowthenumberofviralhaplotypesandtheircorre-
spondingabundances.VirBinisthebinningtoolthattakesassembledcontigsasinput,estimates
thenumberofhaplotypes,clusterscontigsintogroups,andcalculatestheabundancesforeachhap-
lotype.Thebinningproblemforviralquasispeciesisverydifferenttothecanonicalspecies-level
binningproblemsbecausecommonlyadoptedfeaturessuchask-merfrequencyarenotabletodis-
tinguishdifferentstrains.VirBintakesadvantageofthehighsequencesimilaritiesbetweenviral
strainsinaquasispecies,estimatesthenumberofhaplotypesbyaligningcontigstoeachother.The
relativeabundancesforeachcontigcanalsobemoreaccuratelycalculatedfromthecontigalign-
mentresults.Basedonthecontigabundances,VirBinappliesanEMalgorithmtoclustercontigs
intodifferentgroups.ThebenchmarkresultswithtoolMaxBin[160]haveshownitssuperiority
onbothsimulatedorrealviralquasispeciesdata.
AtthebeginningofmyPh.D.study,Ialsoworkedonaprojectforstudyingthetranscriptional
regulationofmiRNAgenesinC.elegansandmice.Inthisproject,weutilizedCap-seqdatato
identifytheprimarytranscriptionalstartsitesformiRNAgenes,characterizedaspecialgroupof
miRNAswhoseprecursorsare5'-cappedandmaybedirectlygeneratedbytranscription.
131
6.2Futurework
6.2.1Reference-freeviralsequences
WhileTAR-VIRisabletorecruitviralreadswithpartialorremotelyrelatedhomologousrefer-
ences,themethodisnotapplicabletonewviruseswithoutanyavailablereferences.Therefore,
directlyclassifyingviralreadsfrommetagenomicdatawithoutreferencesmaystillbeneeded.For
directlyclassifyingviralreads,afundamentalquestiontoaskiswhetherviralsequencescanbe
differentiatedfromotherbacterialoreukaryoticsequences.However,thecanonicalsequenceanal-
ysis(suchask-merfrequencies)resultsonNGSreadsshowedthatviralreadsarehighlydiverse
andarediftobedifferentiatedfromothersequences.
Thedeeplearningmethods,withmultiplelayersofneuralnetworksandnumerous
trainableparameters,arepowerfulforTheydonotrequireelaboratelydesigned
sequencefeaturesasinputandcantakesimplyencodedrawsequencesasinputfor
Whetherwecantakeadvantageofdeeplearningmodelsfordirectlyclassifyingviralreadsfrom
metagenomicdatasetsisworthexploring.
6.2.2Virusquasispeciesassemblyfuturework
Therearestillsomecomputationalchallengesremaininginvirusquasispeciesassembly.
Assemblyoflowabundancehaplotypes
Oneofthechallengesisthatrarehaplotypesare
diftoassemblefromthedata.Whenthesequencingcoverageislowforaviralstrain,there
maynotbeenoughoverlapsbetweenreads,resultinginadisjoinedoverlapgraphwithmultiple
smallconnectedsubgraphs.Inaddition,evenifsequencingdepthsforrarehaplotypesaredeep
enough,theedgesbetweensharednodesandrarehaplotypenodesmaybesuppressedbytheedges
betweensharednodesandabundanthaplotypenodes,resultinginfragmentedcontigs.
132
LCSislongerthanpaired-endinsertsize
AnotherchallengeisthesituationwhenLCSbe-
tweentwohaplotypesismuchlongerthantheaveragepaired-endinsertsize.Inthiscase,few
readpairscangoacrossthecommonnodes,thuscorrectpathsfromthegraphisdif
Theonlyavailablefeatureisthereadscoverage.Whenthecoveragesforthesetwohaplotypes
aredifthealgorithmisabletochoosethesuccessorwithsimilarcoverageto
thecurrentpathforappending.However,ifthesequencingcoveragesofthetwohaplotypesare
similar,thepathextendingishard.WithshortreadsfromNGS,oneofthepossiblesolutionsis
toincreasetheinsertsize.Alternatively,wemaytakeadvantageofthelongreadsfromthird-
generationsequencing.
6.2.3Binninglow-qualitycontigs
Tocorrectlyestimatethenumberofhaplotypesandcalculatetherelativeabundancesforcon-
tigs,thecontigsneedtobealigned,andwindowsheavilydependsonthealignment.
However,whentherearechimericcontigsorcontigswitherrors,eitherindels/mismatches,the
alignmentmaynotbeaccurateandeventuallyresultinaninaccurateestimationofhaplotypenum-
berorclustering.Howtodevelop/improvethebinningalgorithmstorobustlyhandlelow-quality
contigsisanotherproblemtobefurtherinvestigated.
133
BIBLIOGRAPHY
134
BIBLIOGRAPHY
[1]
T.Abeel,T.VanParys,Y.Saeys,J.Galagan,andY.VandePeer.GenomeView:anext-
generationgenomebrowser.
Nucl.AcidsRes.
,40(2):e12,2012.
[2]
A.Allam,P.Kalnis,andV.Solovyev.Karect:accuratecorrectionofsubstitution,insertion
anddeletionerrorsfornext-generationsequencingdata.
Bioinformatics
,pagebtv415,2015.
[3]
R.AndinoandE.Domingo.Viralquasispecies.
Virology
,479:46Œ51,2015.
[4]
I.Astrovskaya,B.Tork,S.Mangul,K.Westbrooks,I.M

andoiu,P.Balfe,andA.Zelikovsky.
Inferringviralquasispeciesspectrafrom454pyrosequencingreads.
BMCbioinformatics
,
12(6):S1,2011.
[5]
J.Baaijens,A.Z.ElAabidine,E.Rivals,andA.Schoenhuth.Denovoassemblyofviral
quasispeciesusingoverlapgraphs.
bioRxiv
,page080341,2016.
[6]
J.A.Baaijens,A.Z.ElAabidine,E.Rivals,andA.Schönhuth.Denovoassemblyofviral
quasispeciesusingoverlapgraphs.
Genomeresearch
,27(5):835Œ848,2017.
[7]
D.Baek,C.Davis,B.Ewing,D.Gordon,andP.Green.Characterizationandpredictive
discoveryofevolutionarilyconservedmammalianalternativepromoters.
GenomeRes.
,
17(2):145Œ155,2007.
[8]
A.Bankevich,S.Nurk,D.Antipov,A.A.Gurevich,M.Dvorkin,A.S.Kulikov,V.M.
Lesin,S.I.Nikolenko,S.Pham,A.D.Prjibelski,etal.SPAdes:anewgenomeassembly
algorithmanditsapplicationstosingle-cellsequencing.
Journalofcomputationalbiology
,
19(5):455Œ477,2012.
[9]
V.C.Barbosa,R.Donangelo,andS.R.Souza.Quasispeciesdynamicswithnetworkcon-
straints.
Journaloftheoreticalbiology
,312:114Œ119,2012.
[10]
E.Berezikov.EvolutionofmicroRNAdiversityandregulationinanimals.
NatureRev.
Genet.
,12(12):846Œ860,Dec.2011.
[11]
E.Berezikov,W.-J.Chung,J.Willis,E.Cuppen,andE.C.Lai.Mammalianmirtrongenes.
Mol.Cell
,28(2):328Œ336,2007.
[12]
M.Biasiolo,G.Sales,M.Lionetti,L.Agnelli,K.Todoerti,A.Bisognin,A.Coppe,C.Ro-
mualdi,A.Neri,andS.Bortoluzzi.ImpactofhostgenesandstrandselectiononmiRNA
andmiRNA*expression.
PLoSONE
,6(8):1Œ11,2011.
[13]
C.K.BiebricherandM.Eigen.Theerrorthreshold.
Virusresearch
,107(2):117Œ127,2005.
135
[14]
S.Boisvert,F.Raymond,É.Godzaridis,F.Laviolette,andJ.Corbeil.Raymeta:scalable
denovometagenomeassemblyand
Genomebiology
,13(12):R122,2012.
[15]
A.M.Bolger,M.Lohse,andB.Usadel.Trimmomatic:axibletrimmerforIllumina
sequencedata.
Bioinformatics
,30(15):2114Œ2120,2014.
[16]
G.M.Borchert,W.Lanier,andB.L.Davidson.RNApolymeraseIIItranscribeshuman
microRNAs.
NatureStruct.Mol.Biol.
,13(12):1097Œ1101,2006.
[17]
J.Bracht,S.Hunter,R.Eachus,P.Weeks,andA.E.Pasquinelli.Trans-splicingand
polyadenylationoflet-7microRNAprimarytranscripts.
RNA
,10(10):1586Œ1594,2004.
[18]
M.BurrowsandD.J.Wheeler.Ablock-sortinglosslessdatacompressionalgorithm.1994.
[19]
I.Büssing,S.YangJr,E.C.Lai,andH.Großhans.ThenuclearexportreceptorXPO-1
supportsprimarymiRNAprocessinginC.elegansandDrosophila.
EMBOJ.
,29(11):1830Œ
1839,2010.
[20]
C.Camacho,G.Coulouris,V.Avagyan,N.Ma,J.Papadopoulos,K.Bealer,andT.L.Mad-
den.BLAST+:architectureandapplications.
BMCbioinformatics
,10(1):421,2009.
[21]
P.Carninci,A.Sandelin,B.Lenhard,S.Katayama,K.Shimokawa,J.Ponjavic,C.A.Sem-
ple,M.S.Taylor,P.G.Engström,M.C.Frith,etal.Genome-wideanalysisofmammalian
promoterarchitectureandevolution.
NatureGenet.
,38(6):626Œ635,2006.
[22]
M.J.Chaisson,D.Brinza,andP.A.Pevzner.Denovofragmentassemblywithshortmate-
pairedreads:Doesthereadlengthmatter?
Genomeresearch
,19(2):336Œ346,2009.
[23]
M.CHAN-YEUNGandR.-H.XU.SARS:epidemiology.
Respirology
,8(s1),2003.
[24]
T.-C.Chang,M.Pertea,S.Lee,S.L.Salzberg,andJ.T.Mendell.Genome-wideannotation
ofmicrornaprimarytranscriptstructuresrevealsnovelregulatorymechanisms.
Genome
research
,25(9):1401Œ1409,2015.
[25]
J.Chen,Z.Dai,C.Cao,Q.Zhang,H.Liu,andX.Sun.Next-generationsequencingdata
processing:Analysisofunmappedreadsandextremelyhighmappedpeaks.In
Biomedical
EngineeringandInformatics(BMEI),20125thInternationalConferenceon
,pages893Œ
897.IEEE,2012.
[26]
J.Chen,Y.Zhao,andY.Sun.Denovohaplotypereconstructioninviralquasispeciesusing
paired-endreadguidedpath
Bioinformatics
,1:9,2018.
[27]
R.A.-J.Chen,T.A.Down,P.Stempor,Q.B.Chen,T.A.Egelhofer,L.W.Hillier,T.E.
Jeffers,andJ.Ahringer.ThelandscapeofRNApolymeraseIItranscriptioninitiationin
C.elegansrevealspromoterandenhancerarchitectures.
GenomeRes.
,23(8):1339Œ1347,
136
2013.
[28]
T.P.Chendrimada,R.I.Gregory,E.Kumaraswamy,J.Norman,N.Cooch,K.Nishikura,
andR.Shiekhattar.TRBPrecruitstheDicercomplextoAgo2formicroRNAprocessing
andgenesilencing.
Nature
,436(7051):740Œ744,2005.
[29]
C.-H.Chien,Y.-M.Sun,W.-C.Chang,P.-Y.Chiang-Hsieh,T.-Y.Lee,W.-C.Tsai,J.-T.
Horng,A.-P.Tsou,andH.-D.Huang.Identifyingtranscriptionalstartsitesofhumanmi-
croRNAsbasedonhigh-throughputsequencingdata.
Nucl.AcidsRes.
,39(21):9345Œ9356,
2011.
[30]
F.S.CollinsandH.Varmus.Anewinitiativeonprecisionmedicine.
NewEnglandJournal
ofMedicine
,372(9):793Œ795,2015.
[31]
D.L.Corcoran,K.V.Pandit,B.Gordon,A.Bhattacharjee,N.Kaminski,andP.V.Benos.
FeaturesofmammalianmicroRNApromotersemergefrompolymeraseIIchromatinim-
munoprecipitationdata.
PLoSONE
,4(4):e5279,2009.
[32]
C.CoronnelloandP.V.Benos.Comir:combinatorialmicrornatargetpredictiontool.
Nucl.
AcidsRes.
,41(W1):W159ŒW164,2013.
[33]
A.K.Corsi.Abiochemist'sguidetoCaenorhabditiselegans.
Anal.Biochem.
,359(1):1Œ17,
2006.
[34]
G.E.Crooks,G.Hon,J.-M.Chandonia,andS.E.Brenner.WebLogo:asequencelogo
generator.
GenomeRes.
,14(6):1188Œ1190,2004.
[35]
L.Dai,K.Chen,B.Youngren,J.Kulina,A.Yang,Z.Guo,J.Li,P.Yu,andS.Gu.Cy-
toplasmicdroshaactivitygeneratedbyalternativesplicing.
NucleicAcidsResearch
,page
gkw668,2016.
[36]
Y.M.Danino,D.Even,D.Ideses,andT.Juven-Gershon.Thecorepromoter:attheheart
ofgeneexpression.
Biochim.Biophys.Acta
,1849(8):1116Œ1131,2015.
[37]
R.V.Davuluri,Y.Suzuki,S.Sugano,C.Plass,andT.H.-M.Huang.Thefunctionalconse-
quencesofalternativepromoteruseinmammaliangenomes.
TrendsGenet.
,24(4):167Œ177,
2008.
[38]
M.deHoonandY.Hayashizaki.Deepcapanalysisgeneexpression(CAGE):genome-
wideofpromoters,oftheirexpression,andnetworkinference.
Biotechniques
,44(5):627,2008.
[39]
F.DiGiallonardo,A.Töpfer,M.Rey,S.Prabhakaran,Y.Duport,C.Leemann,S.Schmutz,
N.K.Campbell,B.Joos,M.R.Lecca,etal.Full-lengthhaplotypereconstructiontoinfer
thestructureofheterogeneousviruspopulations.
Nucleicacidsresearch
,42(14):e115Œe115,
137
2014.
[40]
S.Djebali,C.A.Davis,A.Merkel,A.Dobin,T.Lassmann,A.Mortazavi,A.Tanzer,J.La-
garde,W.Lin,F.Schlesinger,etal.Landscapeoftranscriptioninhumancells.
Nature
,
489(7414):101Œ108,2012.
[41]
E.Domingo,J.Sheldon,andC.Perales.Viralquasispeciesevolution.
Microbiologyand
MolecularBiologyReviews
,76(2):159Œ216,2012.
[42]
M.Eigen.Errorcatastropheandantiviralstrategy.
ProceedingsoftheNationalAcademyof
Sciences
,99(21):13374Œ13376,2002.
[43]
M.Fedurco,A.Romieu,S.Williams,I.Lawrence,andG.Turcatti.Bta,anovelreagent
fordnaattachmentonglassandefgenerationofsolid-phasednacolonies.
Nucleicacidsresearch
,34(3):e22Œe22,2006.
[44]
K.Fejes-Toth,V.Sotirova,R.Sachidanandam,G.Assaf,G.J.Hannon,P.Kapranov,S.Fois-
sac,A.T.Willingham,R.Duttagupta,E.Dumais,etal.Post-transcriptionalprocessing
generatesadiversityoflongandshortRNAs.
Nature
,457(7232):1028Œ1032,
2009.
[45]
K.A.Frazer,L.Pachter,A.Poliakov,E.M.Rubin,andI.Dubchak.VISTA:computational
toolsforcomparativegenomics.
Nucleicacidsresearch
,32(suppl_2):W273ŒW279,2004.
[46]
M.C.Frith,E.Valen,A.Krogh,Y.Hayashizaki,P.Carninci,andA.Sandelin.Acodefor
transcriptioninitiationinmammaliangenomes.
GenomeRes.
,18(1):1Œ12,2008.
[47]
G.GonnellaandS.Kurtz.Readjoiner:afastandmemoryefstringgraph-based
sequenceassembler.
BMCbioinformatics
,13(1):82,2012.
[48]
S.GrifR.J.Grocock,S.VanDongen,A.Bateman,andA.J.Enright.miRBase:
microRNAsequences,targetsandgenenomenclature.
Nucl.AcidsRes.
,34(suppl1):D140Œ
D144,2006.
[49]
W.Gu,H.-C.Lee,D.Chaves,E.M.Youngman,G.J.Pazour,D.ConteJr,andC.C.Mello.
CapSeqandCIP-TAPIdentifyPolIIStartSitesandRevealCappedSmallRNAsasC.
eleganspiRNAPrecursors.
Cell
,151(7):1488Œ1500,2012.
[50]
D.eld.
Algorithmsonstrings,treesandsequences:computerscienceandcomputa-
tionalbiology
.Cambridgeuniversitypress,1997.
[51]
M.HajRachidandQ.Malluhi.Apracticalandscalabletooltooverlapsbetween
sequences.
BioMedresearchinternational
,2015,2015.
[52]
J.J.Holland,E.Domingo,J.C.delaTorre,andD.A.Steinhauer.Mutationfrequenciesat
138
singlecodonsitesinvesicularstomatitisvirusandpolioviruscanbeincreasedonly
slightlybychemicalmutagenesis.
Journalofvirology
,64(8):3960,1990.
[53]
L.Z.Hong,S.Hong,H.T.Wong,P.P.Aw,Y.Cheng,A.Wilm,P.F.deSessions,S.G.Lim,
N.Nagarajan,M.L.Hibberd,etal.Base-seq:amethodforobtaininglongviralhaplotypes
fromshortsequencereads.
Genomebiology
,15(11):517,2014.
[54]
A.Huang,R.Kantor,A.DeLong,L.Schreier,andS.Istrail.Qcolors:Analgorithmforcon-
servativeviralquasispeciesreconstructionfromshortandnon-contiguousnextgeneration
sequencingreads.
Insilicobiology
,11(5,6):193Œ201,2011.
[55]
W.Huang,L.Li,J.R.Myers,andG.T.Marth.ART:anext-generationsequencingread
simulator.
Bioinformatics
,28(4):593Œ594,2011.
[56]
W.Huang,L.Li,J.R.Myers,andG.T.Marth.Art:anext-generationsequencingread
simulator.
Bioinformatics
,28(4):593Œ594,2012.
[57]
M.Hunt,A.Gall,S.H.Ong,J.Brener,B.Ferns,P.Goulder,E.Nastouli,J.A.Keane,P.Kel-
lam,andT.D.Otto.Iva:accuratedenovoassemblyofrnavirusgenomes.
Bioinformatics
,
31(14):2374Œ2376,2015.
[58]
J.L.JamesonandD.L.Longo.PrecisionmedicineŠpersonalized,problematic,and
promising.
Obstetrical&GynecologicalSurvey
,70(10):612Œ614,2015.
[59]
D.Jayasundara,I.Saeed,S.Maheswararajah,B.Chang,S.-L.Tang,andS.K.Halgamuge.
ViQuaS:animprovedreconstructionpipelineforviralquasispeciesspectrageneratedby
next-generationsequencing.
Bioinformatics
,31(6):886Œ896,2014.
[60]
J.KärkkäinenandP.Sanders.Simplelinearworksufarrayconstruction.In
International
ColloquiumonAutomata,Languages,andProgramming
,pages943Œ955.Springer,2003.
[61]
T.Kasai,G.Lee,H.Arimura,S.Arikawa,andK.Park.Linear-time
computationinsufarraysanditsapplications.In
AnnualSymposiumonCombinatorial
PatternMatching
,pages181Œ192.Springer,2001.
[62]
A.Kiezun,S.Artzi,S.Modai,N.Volk,O.Isakov,andN.Shomron.miRviewer:amulti-
speciesmicroRNAhomologousviewer.
BMCRes.Notes
,5(1):1Œ6,2012.
[63]
V.N.KimandJ.-W.Nam.GenomicsofmicroRNA.
TrendsGenet.
,22(3):165Œ173,Mar.
2006.
[64]
D.L.Kiss,K.Oman,R.Bundschuh,andD.R.Schoenberg.Uncapped5'endsofmRNAs
targetedbycytoplasmiccappingmaptothevicinityofdownstreamCAGEtags.
FEBSLett.
,
589(3):279Œ284,2015.
139
[65]
R.Kodzius,M.Kojima,H.Nishiyori,M.Nakamura,S.Fukuda,M.Tagami,D.Sasaki,
K.Imamura,C.Kai,M.Harbers,etal.CAGE:capanalysisofgeneexpression.
Nat.
Methods
,3(3):211Œ222,2006.
[66]
J.Krol,I.Loedige,andW.Filipowicz.ThewidespreadregulationofmicroRNAbiogenesis,
functionanddecay.
NatureRev.Genet.
,11(9):597Œ610,Sept.2010.
[67]
W.S.Kruesi,L.J.Core,C.T.Waters,J.T.Lis,andB.J.Meyer.Condensincontrolsre-
cruitmentofRNApolymeraseIItoachievenematodeX-chromosomedosagecompensation.
Elife
,2:e00808,2013.
[68]
A.Kuehbacher,C.Urbich,A.M.Zeiher,andS.Dimmeler.RoleofDicerandDroshafor
endothelialmicroRNAexpressionandangiogenesis.
Circ.Res.
,101(1):59Œ68,2007.
[69]
J.-R.Landry,D.L.Mager,andB.T.Wilhelm.Complexcontrols:theroleofalternative
promotersinmammaliangenomes.
TrendsGenet.
,19(11):640Œ648,2003.
[70]
B.LangmeadandS.L.Salzberg.Fastgapped-readalignmentwithbowtie2.
Naturemeth-
ods
,9(4):357Œ359,2012.
[71]
B.LangmeadandS.L.Salzberg.Fastgapped-readalignmentwithbowtie2.
Naturemeth-
ods
,9(4):357Œ359,2012.
[72]
B.Langmead,C.Trapnell,M.Pop,andS.L.Salzberg.Ultrafastandmemory-ef
alignmentofshortdnasequencestothehumangenome.
Genomebiology
,10(3):R25,2009.
[73]
J.Laserson,V.Jojic,andD.Koller.Genovo:denovoassemblyformetagenomes.
JComput
Biol
,18(3):429Œ443,2011.
[74]
N.C.Lau,L.P.Lim,E.G.Weinstein,andD.P.Bartel.AnabundantclassoftinyRNAswith
probableregulatoryrolesinCaenorhabditiselegans.
Science
,294(5543):858Œ862,2001.
[75]
A.S.LauringandR.Andino.Quasispeciestheoryandthebehaviorofrnaviruses.
PLoS
Pathog
,6(7):e1001005,2010.
[76]
Y.Lee,C.Ahn,J.Han,H.Choi,J.Kim,J.Yim,J.Lee,P.Provost,O.Rådmark,S.Kim,etal.
ThenuclearRNaseIIIDroshainitiatesmicroRNAprocessing.
Nature
,425(6956):415Œ419,
2003.
[77]
Y.Lee,M.Kim,J.Han,K.-H.Yeom,S.Lee,S.H.Baek,andV.N.Kim.MicroRNAgenes
aretranscribedbyRNApolymeraseII.
EMBOJ.
,23(20):4051Œ4060,2004.
[78]
J.LeiandY.Sun.miR-PREFeR:anaccurate,fastandeasy-to-useplantmiRNAprediction
toolusingsmallRNA-Seqdata.
Bioinformatics
,30(19):2837Œ2839,2014.
140
[79]
H.LiandR.Durbin.FastandaccurateshortreadalignmentwithBurrowsŒWheelertrans-
form.
bioinformatics
,25(14):1754Œ1760,2009.
[80]
L.Li,X.Deng,A.C.DaCosta,R.Bruhn,S.G.Deeks,andE.Delwart.Viromeanalysis
ofantiretroviral-treatedHIVpatientsshowsnocorrelationbetweenT-cellactivationand
anelloviruseslevels.
JournalofClinicalVirology
,72:106Œ113,2015.
[81]
M.Li,B.Wu,X.Yan,J.Luo,Y.Pan,F.-X.Wu,andJ.Wang.Pecc:correctingcontigsbased
onpaired-endreaddistribution.
Computationalbiologyandchemistry
,69:178Œ184,2017.
[82]
N.Li,X.You,T.Chen,S.D.Mackowiak,M.R.Friedländer,M.Weigt,H.Du,A.Gogol-
Döring,Z.Chang,C.Dieterich,etal.GlobalofmiRNAsandthehairpinpre-
cursors:insightsintomiRNAprocessingandnovelmiRNAdiscovery.
Nucl.AcidsRes.
,
41(6):3619Œ3634,2013.
[83]
R.Li,H.Zhu,J.Ruan,W.Qian,X.Fang,Z.Shi,Y.Li,S.Li,G.Shan,K.Kristiansen,
etal.Denovoassemblyofhumangenomeswithmassivelyparallelshortreadsequencing.
Genomeresearch
,20(2):265Œ272,2010.
[84]
Y.Li,H.Wang,K.Nie,C.Zhang,Y.Zhang,J.Wang,P.Niu,andX.Ma.VIP:anintegrated
pipelineformetagenomicsofvirusanddiscovery.
reports
,6,2016.
[85]
E.S.Lim,Y.Zhou,G.Zhao,I.K.Bauer,L.Droit,I.M.Ndao,B.B.Warner,P.I.Tarr,
D.Wang,andL.R.Holtz.Earlylifedynamicsofthehumangutviromeandbacterial
microbiomeininfants.
Naturemedicine
,21(10):1228Œ1234,2015.
[86]
H.-H.LinandY.-C.Liao.drVM:anewtoolforefgenomeassemblyofknown
eukaryoticvirusesfrommetagenomes.
GigaScience
,6(2):1Œ10,2017.
[87]
C.-C.LoandP.S.Chain.Rapidevaluationandqualitycontrolofnextgenerationsequencing
datawithFaQCs.
BMCbioinformatics
,15(1):366,2014.
[88]
Y.Y.Lu,T.Chen,J.A.Fuhrman,andF.Sun.Cocacola:binningmetagenomiccontigsusing
sequencecomposition,readcoverage,co-alignmentandpaired-endreadlinkage.
Bioinfor-
matics
,33(6):791Œ798,2017.
[89]
C.Luo,R.Knight,H.Siljander,M.Knip,R.J.Xavier,andD.Gevers.Constrains
microbialstrainsinmetagenomicdatasets.
Naturebiotechnology
,33(10):1045,2015.
[90]
C.Luo,D.Tsementzi,N.Kyrpides,andK.Konstantinidis.Individualgenomeassembly
fromcomplexcommunityshort-readmetagenomicdatasets.
ISMEJ.
,6(4):898Œ901,2012.
[91]
T.F.Lüscher.Frontiersinprecisionmedicine:genesandtheirmodulationbymirnas.
Eu-
ropeanHeartJournal
,37(43):3247Œ3250,2016.
141
[92]
R.Malhotra,S.Prabhakara,M.Poss,andR.Acharya.Estimatingviralhaplotypesina
populationusingk-mercounting.In
IAPRInternationalConferenceonPatternRecognition
inBioinformatics
,pages265Œ276.Springer,2013.
[93]
R.Malhotra,M.M.S.Wu,A.Rodrigo,M.Poss,andR.Acharya.Maximumlikelihoodde
novoreconstructionofviralpopulationsusingpairedendsequencingdata.
arXivpreprint
arXiv:1502.04239
,2015.
[94]
S.K.MallannaandA.Rizzino.EmergingrolesofmicroRNAsinthecontrolofembryonic
stemcellsandthegenerationofinducedpluripotentstemcells.
Dev.Biol.
,344(1):16Œ25,
2010.
[95]
S.Mangul,N.C.Wu,N.Mancuso,A.Zelikovsky,R.Sun,andE.Eskin.VGA:amethodfor
viralquasispeciesassemblyfromultra-deepsequencingdata.In
ComputationalAdvances
inBioandMedicalSciences(ICCABS),2014IEEE4thInternationalConferenceon
,pages
1Œ1.IEEE,2014.
[96]
F.Matin,V.Jeet,J.A.Clements,G.M.Yousef,andJ.Batra.Micrornatheranosticsin
prostatecancerprecisionmedicine.
ClinicalChemistry
,pagesclinchemŒ2015,2016.
[97]
H.A.Meijer,E.M.Smith,andM.Bushell.RegulationofmiRNAstrandselection:follow
theleader?
Biochem.Soc.Trans.
,42(4):1135Œ1140,2014.
[98]
M.L.Metzker.SequencingtechnologiesŠthenextgeneration.
Naturereviewsgenetics
,
11(1):31Œ46,2010.
[99]
F.Meyer,D.Paarmann,M.D'Souza,R.Olson,E.M.Glass,M.Kubal,T.Paczian,A.Ro-
driguez,R.Stevens,A.Wilke,etal.ThemetagenomicsRASTserverŒapublicresourcefor
theautomaticphylogeneticandfunctionalanalysisofmetagenomes.
BMCbioinformatics
,
9(1):386,2008.
[100]
A.Mikheenko,V.Saveliev,andA.Gurevich.Metaquast:evaluationofmetagenomeassem-
blies.
Bioinformatics
,pagebtv697,2015.
[101]
R.D.MitraandG.M.Church.Insitulocalizedamplandcontactreplicationof
manyindividualdnamolecules.
NucleicAcidsResearch
,27(24):e34Œe39,1999.
[102]
C.M.Mizuno,F.Rodriguez-Valera,N.E.Kimes,andR.Ghai.Expandingthemarine
virosphereusingmetagenomics.
PLoSgenetics
,9(12):e1003987,2013.
[103]
C.Mukherjee,D.P.Patil,B.A.Kennedy,B.Bakthavachalu,R.Bundschuh,andD.R.
Schoenberg.ofcytoplasmiccappingtargetsrevealsaroleforcaphomeostasis
intranslationandmRNAstability.
CellRep.
,2(3):674Œ684,2012.
[104]
D.Murphy,B.Dancis,andJ.R.Brown.Theevolutionofcoreproteinsinvolvedinmi-
142
croRNAbiogenesis.
BMCEvol.Biol.
,8(1):1Œ18,2008.
[105]
S.N.Naccache,S.Federman,N.Veeraraghavan,M.Zaharia,D.Lee,E.Samayoa,J.Bou-
quet,A.L.Greninger,K.-C.Luk,B.Enge,etal.Acloud-compatiblebioinformaticspipeline
forultrarapidpathogenonfromnext-generationsequencingofclinicalsamples.
Genomeresearch
,24(7):1180Œ1192,2014.
[106]
T.Namiki,T.Hachiya,H.Tanaka,andY.Sakakibara.Metavelvet:anextensionofvelvet
assemblertodenovometagenomeassemblyfromshortsequencereads.
NucleicAcids
Research
,40(20):e155,2012.
[107]
M.A.Nowak.
Evolutionarydynamics
.HarvardUniversityPress,CambridgeMas-
sachusetts,2006.
[108]
S.Nurk,D.Meleshko,A.Korobeynikov,andP.A.Pevzner.metaSPAdes:anewversatile
metagenomicassembler.
Genomeresearch
,pagesgrŒ213959,2017.
[109]
N.D.Olson,T.J.Treangen,C.M.Hill,V.Cepeda-Espinoza,J.Ghurye,S.Koren,and
M.Pop.Metagenomicassemblythroughthelensofvalidation:recentadvancesinas-
sessingandimprovingthequalityofgenomesassembledfrommetagenomes.
in
bioinformatics
,2017.
[110]
Y.Otsuka,N.L.Kedersha,andD.R.Schoenberg.ofacytoplasmiccomplex
thataddsacaponto5'-monophosphateRNA.
Mol.CellBiol.
,29(8):2155Œ2167,2009.
[111]
F.Ozsolak,L.L.Poling,Z.Wang,H.Liu,X.S.Liu,R.G.Roeder,X.Zhang,J.S.Song,
andD.E.Fisher.ChromatinstructureanalysesidentifymiRNApromoters.
GenesDev.
,
22(22):3172Œ3183,2008.
[112]
D.Paez-Espino,G.A.Pavlopoulos,N.N.Ivanova,andN.C.Kyrpides.Nontargetedvirus
sequencediscoverypipelineandvirusclusteringformetagenomicdata.
natureprotocols
,
12(8):1673,2017.
[113]
P.J.Park.ChipŒseq:advantagesandchallengesofamaturingtechnology.
NatureReviews
Genetics
,10(10):669Œ680,2009.
[114]
J.Peccoud,S.Lequime,I.Moltini-Conclois,I.Giraud,L.Lambrechts,andC.Gilbert.A
SurveyofVirusRecombinationUncoversCanonicalFeaturesoflChimerasGen-
eratedDuringDeepSequencingLibraryPreparation.
G3:Genes,Genomes,Genetics
,
8(4):1129Œ1138,2018.
[115]
Y.Peng,H.C.Leung,S.-M.Yiu,andF.Y.Chin.Idba-ud:adenovoassemblerforsingle-cell
andmetagenomicsequencingdatawithhighlyunevendepth.
Bioinformatics
,28(11):1420Œ
1428,2012.
143
[116]
Y.Peng,H.C.M.Leung,S.M.Yiu,andF.Y.L.Chin.Meta-IDBA:adenovoassembler
formetagenomicdata.
Bioinformatics
,27(13):i94Œi101,2011.
[117]
B.E.Pickett,E.L.Sadat,Y.Zhang,J.M.Noronha,R.B.Squires,V.Hunt,M.Liu,S.Ku-
mar,S.Zaremba,Z.Gu,etal.ViPR:anopenbioinformaticsdatabaseandanalysisresource
forvirologyresearch.
Nucleicacidsresearch
,40(D1):D593ŒD598,2011.
[118]
S.Prabhakaran,M.Rey,O.Zagordi,N.Beerenwinkel,andV.Roth.Hiv-haplotypein-
ferenceusingaconstraint-baseddirichletprocessmixturemodel.In
MachineLearningin
ComputationalBiology(MLCB)NIPSWorkshop
,pages1Œ4,2010.
[119]
M.C.ProsperiandM.Salemi.Qure:softwareforviralquasispeciesreconstructionfrom
next-generationsequencingdata.
Bioinformatics
,28(1):132Œ133,2012.
[120]
J.Qin,R.Li,J.Raes,M.Arumugam,K.S.Burgdorf,C.Manichanh,T.Nielsen,N.Pons,
F.Levenez,T.Yamada,etal.Ahumangutmicrobialgenecatalogueestablishedbymetage-
nomicsequencing.
nature
,464(7285):59Œ65,2010.
[121]
S.RajasekaranandM.Nicolae.Anelegantalgorithmfortheconstructionofsufarrays.
JournalofDiscreteAlgorithms
,27:21Œ28,2014.
[122]
S.Rampelli,M.Soverini,S.Turroni,S.Quercia,E.Biagi,P.Brigidi,andM.Candela.
ViromeScan:anewtoolformetagenomicviralcommunity
BMCgenomics
,
17(1):165,2016.
[123]
G.Robertson,J.Schein,R.Chiu,R.Corbett,M.Field,S.D.Jackman,K.Mungall,S.Lee,
H.M.Okada,J.Q.Qian,etal.Denovoassemblyandanalysisofrna-seqdata.
Nature
methods
,7(11):909Œ912,2010.
[124]
K.RotmistrovskyandR.Agarwala.BMTagger:BestMatchTaggerforremovinghuman
readsfrommetagenomicsdatasets.2011.
[125]
H.Rouha,C.Thurner,andC.W.Mandl.FunctionalmicroRNAgeneratedfromacytoplas-
micRNAvirus.
Nucl.AcidsRes.
,38(22):8328Œ8337,2010.
[126]
S.Roux,J.R.Brum,B.E.Dutilh,S.Sunagawa,M.B.Duhaime,A.Loy,B.T.Poulos,
N.Solonenko,E.Lara,J.Poulain,etal.Ecogenomicsandpotentialbiogeochemicalimpacts
ofgloballyabundantoceanviruses.
Nature
,537(7622):689Œ693,2016.
[127]
S.Roux,F.Enault,B.L.Hurwitz,andM.B.Sullivan.VirSorter:miningviralsignalfrom
microbialgenomicdata.
PeerJ
,3:e985,2015.
[128]
S.Roux,J.Tournayre,A.Mahul,D.Debroas,andF.Enault.Metavir2:newtoolsforviral
metagenomecomparisonandassembledviromeanalysis.
BMCbioinformatics
,15(1):76,
2014.
144
[129]
J.G.Ruby,P.Bellare,andJ.L.DeRisi.Price:softwareforthetargetedassemblyofcom-
ponentsof(meta)genomicsequencedata.
G3:Genes|Genomes|Genetics
,3(5):865Œ880,
2013.
[130]
J.G.Ruby,C.H.Jan,andD.P.Bartel.IntronicmicroRNAprecursorsthatbypassDrosha
processing.
Nature
,448(7149):83Œ86,2007.
[131]
H.K.Saini,A.J.Enright,andS.GrifAnnotationofmammalianprimarymi-
croRNAs.
BMCgenomics
,9(1):564,2008.
[132]
H.K.Saini,S.GrifandA.J.Enright.GenomicanalysisofhumanmicroRNA
transcripts.
Proc.Natl.Acad.Sci.USA
,104(45):17719Œ17724,2007.
[133]
S.L.Salzberg,D.D.Sommer,D.Puiu,andV.T.Lee.Gene-boostedassemblyofanovel
bacterialgenomefromveryshortreads.
PLOSComputBiol
,4(9):e1000186,092008.
[134]
A.Sandelin,W.Alkema,P.Engström,W.W.Wasserman,andB.Lenhard.JASPAR:an
open-accessdatabaseforeukaryotictranscriptionfactorbinding
Nucl.AcidsRes.
,
32(suppl1):D91ŒD94,2004.
[135]
A.Sandelin,P.Carninci,B.Lenhard,J.Ponjavic,Y.Hayashizaki,andD.A.Hume.Mam-
malianRNApolymeraseIIcorepromoters:insightsfromgenome-widestudies.
NatureRev.
Genet.
,8(6):424Œ436,2007.
[136]
M.Schirmer,W.T.Sloan,andC.Quince.Benchmarkingofviralhaplotypereconstruc-
tionprogrammes:anoverviewofthecapacitiesandlimitationsofcurrentlyavailablepro-
grammes.
inbioinformatics
,pagebbs081,2012.
[137]
D.R.SchoenbergandL.E.Maquat.Re-cappingthemessage.
TrendsBiochem.Sci.
,
34(9):435Œ442,2009.
[138]
D.E.Schones,K.Cui,S.Cuddapah,T.-Y.Roh,A.Barski,Z.Wang,G.Wei,andK.Zhao.
Dynamicregulationofnucleosomepositioninginthehumangenome.
Cell
,132(5):887Œ
898,2008.
[139]
J.S.Shapiro,R.A.Langlois,A.M.Pham,etal.Evidenceforacytoplasmicmicroprocessor
ofpri-miRNAs.
RNA
,18(7):1338Œ1346,2012.
[140]
D.Sharma,P.Priyadarshini,andS.Vrati.Unravelingthewebofviroinformatics:compu-
tationaltoolsanddatabasesinvirusresearch.
Journalofvirology
,89(3):1489Œ1501,2015.
[141]
A.A.Sigova,A.C.Mullen,B.Molinie,S.Gupta,D.A.Orlando,M.G.Guenther,A.E.
Almada,C.Lin,P.A.Sharp,C.C.Giallourakis,etal.Divergenttranscriptionoflong
noncodingRNA/mRNAgenepairsinembryonicstemcells.
Proc.Natl.Acad.Sci.USA
,
110(8):2876Œ2881,2013.
145
[142]
J.T.SimpsonandR.Durbin.Efdenovoassemblyoflargegenomesusingcompressed
datastructures.
Genomeresearch
,22(3):549Œ556,2012.
[143]
G.SongandL.Wang.MiR-433andmiR-127arisefromindependentoverlappingprimary
transcriptsencodedbythemiR-433-127locus.
PLoSONE
,3(10):e3574,2008.
[144]
J.Spieth,D.Lawson,P.Davis,G.Williams,andK.Howe.Overviewofgenestructurein
C.elegans.
WormBook
,.:1Œ18,2014.
[145]
S.R.A.S.Staff.Usingthesratoolkittoconvert.sraintootherformats.In:SRA
KnowledgeBase[Internet],2011.Availableat:
http://www.ncbi.nlm.nih.gov/
books/NBK158900/
.(Accessed:25August2015).
[146]
S.TO'NeilandS.J.Emrich.Haplotypeandminimum-chimerismconsensusdetermination
usingshortsequencedata.
BMCgenomics
,13(2):S4,2012.
[147]
A.Töpfer,T.Marschall,R.A.Bull,F.Luciani,A.Schönhuth,andN.Beerenwinkel.
Viralquasispeciesassemblyviamaximalcliqueenumeration.
PLoSComputBiol
,
10(3):e1003515,2014.
[148]
A.Töpfer,O.Zagordi,S.Prabhakaran,V.Roth,E.Halperin,andN.Beerenwinkel.Proba-
bilisticinferenceofviralquasispeciessubjecttorecombination.
JournalofComputational
Biology
,20(2):113Œ123,2013.
[149]
C.TrapnellandS.L.Salzberg.Howtomapbillionsofshortreadsontogenomes.
Nature
biotechnology
,27(5):455,2009.
[150]
T.Treangen,S.Koren,D.Sommer,B.Liu,I.Astrovskaya,B.Ondov,A.Darling,
A.Phillippy,andM.Pop.MetAMOS:amodularandopensourcemetagenomicassem-
blyandanalysispipeline.
GenomeBiology
,14(1):R2,2013.
[151]
D.T.Truong,E.A.Franzosa,T.L.Tickle,M.Scholz,G.Weingart,E.Pasolli,A.Tett,
C.Huttenhower,andN.Segata.Metaphlan2forenhancedmetagenomictaxonomic
ing.
Naturemethods
,12(10):902,2015.
[152]
D.T.Truong,A.Tett,E.Pasolli,C.Huttenhower,andN.Segata.Microbialstrain-level
populationstructureandgeneticdiversityfrommetagenomes.
Genomeresearch
,2017.
[153]
M.Vignuzzi,J.K.Stone,J.J.Arnold,C.E.Cameron,andR.Andino.Quasispeciesdiversity
determinespathogenesisthroughcooperativeinteractionsinaviralpopulation.
Nature
,
439(7074):344Œ348,2006.
[154]
G.Wang,Y.Wang,C.Shen,Y.-w.Huang,K.Huang,T.H.Huang,K.P.Nephew,L.Li,and
Y.Liu.RNApolymeraseIIbindingpatternsrevealgenomicregionsinvolvedinmicroRNA
generegulation.
PLoSONE
,5(11):e13798,2010.
146
[155]
Z.Wang,M.Gerstein,andM.Snyder.Rna-seq:arevolutionarytoolfortranscriptomics.
Naturereviewsgenetics
,10(1):57Œ63,2009.
[156]
P.Willeit,P.Skroblin,S.Kiechl,C.Fernández-Hernando,andM.Mayr.Livermicrornas:
potentialmediatorsandbiomarkersformetabolicandcardiovasculardisease?
European
heartjournal
,pageehw146,2016.
[157]
J.Winter,S.Jung,S.Keller,R.I.Gregory,andS.Diederichs.Manyroadstomaturity:
microRNAbiogenesispathwaysandtheirregulation.
NatureCellBiol.
,11(3):228Œ234,
2009.
[158]
M.E.Woolhouse,A.Rambaut,andP.Kellam.LessonsfromEbola:Improvinginfec-
tiousdiseasesurveillancetoinformoutbreakmanagement.
Sciencetranslationalmedicine
,
7(307):307rv5Œ307rv5,2015.
[159]
Y.Wu,M.Rho,T.G.Doak,andY.Ye.Stitchinggenefragmentswithanetworkmatching
algorithmimprovesgeneassemblyformetagenomics.
Bioinformatics
,28(18):i363Œi369,
2012.
[160]
Y.-W.Wu,Y.-H.Tang,S.G.Tringe,B.A.Simmons,andS.W.Singer.MaxBin:an
automatedbinningmethodtorecoverindividualgenomesfrommetagenomesusingan
expectation-maximizationalgorithm.
Microbiome
,2(1):26,2014.
[161]
M.Xie,M.Li,A.Vilborg,N.Lee,M.-D.Shu,V.Yartseva,N.−estan,andJ.a.Steitz.
Mammalian5'-cappedmicroRNAprecursorsthatgenerateasinglemicroRNA.
Cell
,
155(7):1568Œ1580,Dec.2013.
[162]
A.Yamashita,T.Sekizuka,andM.Kuroda.VirusTAP:viralgenome-targetedassembly
pipeline.
Frontiersinmicrobiology
,7:32,2016.
[163]
C.Yuan,J.Lei,J.Cole,andY.Sun.Reconstructing16srrnagenesinmetagenomicdata.
Bioinformatics
,31(12):i35Œi43,2015.
[164]
C.YuanandY.Sun.RNA-CODE:ANoncodingRNAToolforShortReads
inNGSDataLackingReferenceGenomes.
PLoSONE
,8:e77596,102013.
[165]
N.Yutin,K.S.Makarova,A.B.Gussow,M.Krupovic,A.Segall,R.A.Edwards,andE.V.
Koonin.Discoveryofanexpansivebacteriophagefamilythatincludesthemostabundant
virusesfromthehumangut.
Naturemicrobiology
,3(1):38,2018.
[166]
O.Zagordi,A.Bhattacharya,N.Eriksson,andN.Beerenwinkel.Shorah:estimatingthe
geneticdiversityofamixedsamplefromnext-generationsequencingdata.
BMCbioinfor-
matics
,12(1):119,2011.
[167]
J.Zhang,K.Kobert,T.Flouri,andA.Stamatakis.Pear:afastandaccurateilluminapaired-
147
endreadmerger.
Bioinformatics
,30(5):614Œ620,2014.
[168]
Y.Zhang,T.Liu,C.A.Meyer,J.Eeckhoute,D.S.Johnson,B.E.Bernstein,C.Nusbaum,
R.M.Myers,M.Brown,W.Li,etal.Model-basedanalysisofChIP-Seq(MACS).
Genome
Biol.
,9(9):R137,2008.
[169]
Y.Zhang,Y.Sun,andJ.R.Cole.Ascalableandaccuratetargetedgeneassemblytool
(sat-assembler)fornext-generationsequencingdata.
PLoSComputBiol
,10(8):e1003737,
2014.
148