COMPUTATIONALIDENTIFICATIONANDANALYSISOFNON-CODINGRNASIN
LARGE-SCALEBIOLOGICALDATA
By
JikaiLei
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialentoftherequirements
forthedegreeof
ComputerScience{DoctorofPhilosophy
2015
ABSTRACT
COMPUTATIONALIDENTIFICATIONANDANALYSISOF
NON-CODINGRNASINLARGE-SCALEBIOLOGICALDATA
By
JikaiLei
Non-protein-codingRNAs(ncRNAs)areRNAmoleculesthatfunctiondirectlyatthe
levelofRNAwithouttranslatingintoprotein.Theyplayimportantbiologicalfunctionsin
allthreedomainsoflife,i.e.Eukarya,BacteriaandArchaea.Tounderstandtheworking
mechanismsandthefunctionsofncRNAsinvariousspecies,afundamentalstepistoidentify
bothknownandnovelncRNAsfromlarge-scalebiologicaldata.
Large-scalegenomicdataincludesbothgenomicsequencedataandNGSsequencingdata.
BothtypesofgenomicdataprovidegreatopportunityforidentifyingncRNAs.Forgenomic
sequencedata,alotofncRNAidentoolsthatusecomparativesequenceanaly-
sishavebeendeveloped.ThesemethodsworkwellforncRNAsthathavestrongsequence
similarity.However,theyarenotwell-suitedfordetectingncRNAsthatareremotelyho-
mologous.Nextgenerationsequencing(NGS),whileitopensanewhorizonforannotating
andunderstandingknownandnovelncRNAs,alsointroducesmanychallenges.First,exist-
inggenomicsequencesearchingtoolscannotbereadilyappliedtoNGSdatabecauseNGS
technologyproducesshort,fragmentaryreads.Second,mostNGSdatasetsarelarge-scale.
ExistingalgorithmsareinfeasibleonNGSdatabecauseofhighresourcerequirements.Third,
metagenomicsequencing,whichutilizesNGStechnologytosequenceuncultured,complex
microbialcommunitiesdirectlyfromtheirnaturalinhabitants,furtheraggravatesthe
culties.Thus,massiveamountofgenomicsequencedataandNGSdatacallsfornt
algorithmsandtoolsforncRNAannotation.
Inthisdissertation,Ipresentthreecomputationalmethodsandtoolstotlyidentify
ncRNAsfromlarge-scalebiologicaldata.Chain-RNAisatoolthatcombinesbothsequence
similarityandstructuresimilaritytolocatecross-speciesconservedRNAelementswithlow
sequencesimilarityingenomicsequencedata.Itcanachievecantlyhighersensitivity
inidentifyingremotelyconservedncRNAelementsthansequencebasedmethodssuchas
BLAST,andismuchfasterthanexistingstructuralalignmenttools.miR-PREFeR(miRNA
PREdictionFromsmallRNA-Seqdata)utilizesexpressionpatternsofmiRNAandfollows
thecriteriaforplantmicroRNAannotationtoaccuratelypredictplantmiRNAsfromone
ormoresmallRNA-Seqdatasamples.Itissensitive,accurate,fastandhaslow-memory
footprint.metaCRISPRfocusesonidentifyingClusteredRegularlyInterspacedShortPalin-
dromicRepeats(CRISPRs)fromlarge-scalemetagenomicsequencingdata.Itusesakmer
hashtabletotlydetectreadsthatbelongtoCRISPRsfromtherawmetagonmic
dataset.Overlapgraphbasedclusteringisthenconductedonthereduceddatasettosep-
aratetCRSIPRs.Asetofgraphbasedalgorithmsareusedtoassembleandrecover
CRISPRsfromtheclusters.
Tomybelovedfamily,especiallymylovelyboy,Evan.
iv
ACKNOWLEDGMENTS
Firstandformost,IwouldliketothankmyadvisorDr.YanniSunforherexcellentsuper-
visionandprofessionalguidanceonmyresearchanddissertationwork.Noneofthesewould
bepossiblewithoutherpatience,time,encouragementandinspiration.
Mysincerethanksalsogotomythesiscommitteemembers:Dr.EricTorng,Dr.Yuehua
CuiandDr.JamesCole,fortheirprecioustime,helpfuldiscussionsandbeingalways
approachable.IamalsogratefultoallthefacultyandmembersintheDepartmentof
ComputerScienceandEngineeringwhohavetaughtmeorassistedme.
Iwouldalsoliketothankmywife,TaoHe,forherunconditionallove,supportand
encouragementduringthepastyears,aswellasforbeingagreatmotherforourson.Thanks
mylittleboy,EvanY.Lei,formakingmylifemorecompleteandmeaningful.
Mythanksalsogotomyacademicfamilymembers:Dr.YuanZhang,Dr.Rujira
Achawanantakun,Dr.ChengYuan,JiaoChenandNanDu,fortheirvaluablehelpinmy
dailylifeandusefuldiscussionsalltheseyears.Ialsothankmygreatfriends,YiyanLin,Qiu
Chen,RanXuandallothergraduatestudentsinthedepartmentfortheirencouragements
andcompanythroughoutthejourney.
Lastbutnotleast,Iwouldexpressmyprofoundgratitudetomybelovedparents,who
alwaysbelieveinme,givemecouragetopursuemydreams,andtookverygoodcareofmy
sonwhenIwasnotthere.
v
TABLEOFCONTENTS
LISTOFTABLES
....................................
ix
LISTOFFIGURES
...................................
x
Chapter1Introduction
...............................
1
1.1Overview......................................1
1.2NCRNAsecondarystructure...........................2
1.3Nextgenerationsequencingtechnology.....................3
1.3.1Wholegenomesequencing........................4
1.3.2RNA-Seq.................................5
1.3.3Metagenomicsequencing.........................5
1.4ExistingmethodsforidentifyingncRNAsfromlarge-scalegenomicdata...6
1.5Contributions...................................7
Chapter2Chain-RNA:acomparativencRNAsearchtoolbasedonthe
two-dimensionalchainalgorithm
..................
9
2.1Introduction....................................9
2.2RelatedWork...................................11
2.3Methods......................................13
2.3.1Localbase-pairingprobabilities.....................15
2.3.2Stemextractionfromdotplots.....................15
2.3.3Stemalignmentasatwo-dimensionalchainproblem..........17
2.3.3.1Thebasicchainproblem....................17
2.3.3.2TheExtendedchainproblem.................19
2.3.3.3Weightofarectangle......................20
2.3.3.4Extendedchainalgorithm...................22
2.3.3.5Filtration............................25
2.4ResultsandDiscussion..............................26
2.4.1Evaluatingtheperformanceofchain-RNA...............26
2.4.1.1BenchmarkConstruction....................26
2.4.1.1.1SelectionofncRNAs..................27
2.4.1.1.2Addingcontextsequences...............27
2.4.1.2ofsensitivityandfalsepositiverate.......30
2.4.1.3PerformanceofncRNAsearchtools..............31
2.4.1.4Filtration.......................34
2.4.1.5Analysisofthefalsepositivehits...............35
2.4.2Usingchain-RNAinnovelncRNAsearch................37
2.5Conclusion.....................................41
vi
Chapter3miR-PREFeR:anaccurate,fast,andeasy-to-useplantmiRNA
predictiontoolusingsmallRNA-Seqdata
............
43
3.1Introduction....................................43
3.2DescriptionofthemiR-PREFeRpipeline....................44
3.2.1Ww.................................44
3.2.1.1Candidateregioniden................46
3.2.1.1.1Peaks:.........................47
3.2.1.1.2Candidateregions:..................47
3.2.1.1.3Candidatematuresequences:............47
3.2.1.2Folding.............................48
3.2.1.3MiRNAprediction.......................48
3.2.1.4Checkpointing.........................50
3.2.2Implementationandexternalprogramdependency...........50
3.3Results.......................................51
3.3.1Evaluatingpredictionperformanceonsixdatasets..........51
3.3.2AnalyzingthebetweentheannotatedandpredictedmiR-
NAs....................................56
3.3.3miR-PFEFeRisfastandt...............57
Chapter4IdentifyandassembleCRISPRsfrommetagenomicdata
...
61
4.1Introduction....................................61
4.2Relatedworksandtheirlimitations.......................62
4.3Method......................................64
4.3.1Overview.................................64
4.3.2IdenofCRISPRreads.....................66
4.3.2.1Directrepeatidention..................66
4.3.2.2Filtermetagenomicdatausingk-merhashtable.......68
4.3.3ClusteringofCRISPRreads.......................69
4.3.3.1Overlapgraph.........................70
4.3.3.2OverlapgraphbasedCRISPRreadsclustering........71
4.3.3.3Correcttheorientationofreads................74
4.3.3.4Filterfalsepositiveclusters..................74
4.3.4AssembleCRISPRsofheterogeneouscoverage.............75
4.3.4.1Step1:cleangraphandrecoverhighlysequencedCRISPRs.77
4.3.4.2Step2:cleangraphbyremovingrecoverednodesandedges.78
4.3.4.3Step3:breakcycles.......................78
4.3.4.4Step4:recoverlowlysequencedCRISPRs..........80
4.4Experimentalresults...............................81
4.4.1EvaluatingtheperformanceofCRISPRreadsiden......82
4.4.2Theperformanceofreadclustering...................83
4.4.3PerformanceofCRISPRassemblyonthemockdataset.......85
4.4.4Experimentalresultonthesoildataset.................87
4.4.5Experimentalresultonthehumangutdataset............89
Chapter5Conclusionandfuturework
......................
91
vii
BIBLIOGRAPHY
....................................
93
viii
LISTOFTABLES
Table2.1Informationofthedatasetandtotalrunningtimeofeachbench-
markedtooloneachdataset.Column
length
showstheaverage
lengthofgenomicsequencesinadatasetaftermaskingtheprotein
codinggenes.Theaveragenumberofrectanglesareshownforapair
ofwindowsofsizes100nt

100nt.Thetimeofchain-RNAincludes
therunningtimeofRNAplfold.....................29
Table2.2Selectedbacterialstrainsandtheirchromosomeinformation.....37
Table2.3Transcribedregionsandtargetchromosomesareinputtochain-RNA.
\all"inthe
TargetChrs
columnmeansthatallgenomicsequencesfor
the7strainsinTable2.2areusedastargetsequences........40
Table3.1ComparisonoffeaturesofmiRNAannotationtools..........53
Table3.2InformationofthesmallRNA-Seqdatausedintheexperiments..54
Table3.3Performancecomparisonofbenchmarkedtools............58
Table3.4PerformancecomparisionofmiR-RPEFeRandShortStackonmaize59
Table3.5Runningtime(hour)..........................59
Table3.6Memoryusage(MB)..........................60
Table4.1TheoftheCRISPRreadsiden...........82
Table4.2Theperformanceofclusteringonmockdataset...........84
Table4.3Theperformanceofthebenchmarkedtoolsonthemockdataset..88
Table4.4Theperformanceofthebenchmarkedtoolsonthesoildataset...88
Table4.5Theperformanceofthebenchmarkedtoolsonthehumangutdataset89
ix
LISTOFFIGURES
Figure1.1AtRNAsequenceanditssecondarystructure............3
Figure1.2workwofnextgenerationsequencingtechnologythat
producespaired-endreads.ThegenomicDNAsequenceisfrag-
mented.ThetwoendsofthefragmentedDNAarethensequenced.
Thesequencebetweenthetworeadsofeachpairisunknown,butthe
distanceisinarangethatisspebythesequencingprotocol..4
Figure2.1Pipelineofchain-RNA..........................13
Figure2.2(A).AdotplotforatRNAsequence.Thelower-lefttrianglerep-
resentstheMFEstructure.Theupper-righttriangleencodesbase-
pairingprobabilitiesandisusedinourcomputation.Astemishigh-
lightedinabox.(B).PartialoutputofRNAplfoldforalongsequence.
Ahighlightedtriangleshowsthelocalbase-pairingprobabilitydot
plotforasubstringofsizet.......................14
Figure2.3Propertiesofastem.Thestemconsistsofebasepairsandabulge
ofsizeone.................................17
Figure2.4Thebasicchainproblem.Twochainsstartingfromrectanglewith
weight3areshowedandthechainthatincludesrectangleswith
weights3,2,and9isthemaximumweightedchaininthegiven
rectangleset................................18
Figure2.5(A).Theformationofarectanglebytwostems.Thinarcsrepresent
basepairs.Fourrectanglesareformedbytwostems.Thenumber
ofbasepairsinthetwostemsformingarectanglecouldbet.
r
2
isanexample.(B).Asubsetofrectanglesbetweentwosequences.
Stemsarerepresentedbythickarcs.Thebasicweightislistedfor
eachrectangle.Themaximumweightedchainandthecorresponding
stemsarehighlighted...........................19
Figure2.6Legalrelationsbetweentworectanglesinavalidstemalignmentwith-
outpseudoknots.Eacharcrepresentsabasepair.(A).Parallelstems
leadtonon-overlappingrectangles.(B).Nestedstemsleadtonested
rectangles.................................20
Figure2.7Algorithmforcomputingmaximumweightedextendedchainonaset
ofrectangles...............................24
x
Figure2.8Anexampleshowinghowrecursivelynestedrectanglesarehandled
intheproposedextendedchainalgorithm.
r
1
:bw
=120,
r
2
:bw
=30,
r
3
:bw
=10.................................24
Figure2.9ROCcurvesforbenchmarkedtoolsundertoverlap
onthedatasetwithsequenceidentityfrom0%to60%.\X/N"
representstheROCcurveforthetoolXundertheoverlapN%.
Forexample,\Chain-RNA/20"meanstheROCcurveisgeneratedfor
Chain-RNAusingtheoverlap20%................35
Figure2.10ROCcurvesforbenchmarkedtoolsundertoverlap
onthedatasetwithsequenceidentityfrom60%to70%.\X/N"
representstheROCcurveforthetoolXundertheoverlapN%.
Forexample,\Chain-RNA/20"meanstheROCcurveisgeneratedfor
Chain-RNAusingtheoverlap20%................35
Figure2.11ROCcurvesforbenchmarkedtoolsundertoverlap
onthedatasetwithsequenceidentityfrom70%to100%.\X/N"
representstheROCcurveforthetoolXundertheoverlapN%.
Forexample,\Chain-RNA/20"meanstheROCcurveisgeneratedfor
Chain-RNAusingtheoverlap20%................36
Figure2.12Derivedstructuresforthreepairsoffalsepositivehits.........36
Figure2.13DerivedsecondarystructuresforhomologousncRNAsinall11chro-
mosomesforthetranscribedregionwithID18.............41
Figure3.1WorwofmiR-PREFeR........................45
Figure3.2IdenofpeaksfromtwoinputRNA-Seqsamples.Eachcol-
oredarrowrepresentsareadfromoneRNA-Seqsample.Redand
bluearrowsrepresentreadsfromtwotsamples,respectively.
Twopeaksareidenaccordingtothedepth
D
cutoff
,which
is2inthisexample...........................46
Figure3.3Generatingcandidateregionsfrompeaks.(A).Thegap
g
between
twoneighbouringpeaks
p
1
and
p
2
issmallerthanthegaplength

G
max
.Thus,peaks
p
1
and
p
2
areclusteredintoonecandidate
region.(B).Peak
p
3
hasnoneighbouringpeaksandisextendedto
theleftandtotherighttoformtwocandidateregionswithlength
L
max
...................................46
Figure3.4VenndiagramsthatshowtheintersectionsofthepredictedmiRNAs
fromttoolsonedatasetsfromfourspecies.ForArabidop-
sis,Medicagoandpeach,theannotationsfrommiRBaseareincluded.59
xi
Figure3.5ThedistributionofthemiRNAstartpositions(toppanel)
andlengths(bottompanel)betweenannotatedandpredictedmiR-
NAsbythreetools.ThewasgeneratedusingannotatedmiR-
NAsofArabidopsisthatarepredictedbyallthreetoolsondataset
Athl-6...................................60
Figure4.1Acas-CRISPRexample.Greenboxesinthetopgurerepresent
directrepeats(DRs).AlinebetweentwoDRsrepresentsaspacer
sequence,whichsharesequenceidentitywithfragmentsofplasmids
andbacteriophagegenomesandspecifythetargetsofCRISPRinter-
ference.DRsandspacersformanDR-spacerarray.Thenumberof
DRscanvarysubstantially,fromaminimumoftwotoafewhundred.
AsetofCRISPR-associatedgenes(cas-genes),whichencodethepro-
teinmachineryresponsibleforCRISPRactivity,arelocatedupstream
ofthearray.ThebottomboxshowsthesequenceoftheDR-spacer
array.GreencharactersareforDR,othercoloredcharactersarefor
spacers.DRsequencesarehighlyconservedinaCRISPR.......62
Figure4.2MajorcomponentsofmetaCRISPR.Paired-endreadsarerepresented
bytworectanglesconnectedbyadottedline.Readsofredandgreen
colorsrepresentreadsfromtwotCRISPRs,respectively.Gray
readsrepresentreadsfromothergenomicregions...........65
Figure4.3Illustrationofhownovelrepeatsareideninaread.Thegreen
regionsareextendedregions(seemaintext).Blackcrossesrepresent
mismatchesinthesequence.......................67
Figure4.4Thedistributionofgroundtruthreads.................69
Figure4.5Anexampleofoverlapgraphandtransitiveedge.Threereadsform
threeedges.Therededgeisatransitiveedgecanbesafelyremoved.70
Figure4.6Top:twoCRISPRsthatshareverysimilarrepeatsequences.The
readmappingagainstthetwoCRISPRslociareshown.Bot-
tom:theinitialoverlapgraph
G
.Readsofredandpurplearese-
quencedfromthetwoloci,respectively.Eachnoderepresentsaread
ofthesameletter.Blackandgreyedgesdenoterandomoverlaps.
Thenumberalongeachedgerepresentstheoverlap.Readsinapair
r
arerepresentedusing
r:
1and
r:
2....................71
xii
Figure4.7Leftpanel:majorstepstoassembleCRISPRs.Rightpanelshowsan
examplewiththreeCRISPRs,whicharerepresentedbyred,purple,
andgreennodes/edges,respectively.TheCRISRPrepresentedusing
redcolorhashighsequencecoverage,theothertwohavelowsequence
coverage.Readsinapairareconnectedbyablack,dotted,arrowline.
Nodeswhichformsacyclearecircledusingadottedline.......76
Figure4.8Leftpanel:twocasesincyclebreaking.Case1:thereadwhichpairs
withreadinthecycleisinthedownstreamofthegraph;case2:the
readthatpairswiththereadinthecycleisintheupstreamofthe
graph.RightPanel:mainstepstobreakthecycleforcase1.Paired-
endreadsareconnectedusingablack,dotted,arrowline.Edges
representedbyred,dottedlinesaretemporallyremovededges.Red,
solidlinesformthelongestpathbetweentwopairedreads......80
Figure4.9SensitivityoftheCRISPRreadsidenonstep.CRISPRsare
sortedalongthex-axisbythenumberofrepeatstheycontains.See
maintextfordetails...........................83
Figure4.10Performanceofthereadclustering...................85
xiii
Chapter1
Introduction
1.1Overview
Non-protein-codingRNAs(ncRNAs)areRNAmoleculesthatfunctiondirectlyatthelevel
ofRNAwithouttranslatingintoprotein.Overthelastdecade,theimportanceofthissur-
prisinglydiverseclassofRNAmoleculeshasbeenwidelyrecognized.Theyplayimportant
biologicalfunctionsinallthreedomainsoflife,i.e.Eukarya,BacteriaandArchaea.Ithas
longbeenknownthattRNAandrRNAplayimportantrolesinproteinsynthesis.Small
nucleolarRNAs(snoRNAs)canregulatethealternativesplicingofmRNA[55].MicroRNAs
(miRNAs),whicharefoundinalleukaryotes,regulatetheexpressionofgenesbybase-pairing
withcomplementarysequenceswithinthetargetmessagerRNAs[24,125].16srRNAscan
beusedasmarkergenesforphylogeneticreconstruction[121,63].Inaddition,ncRNAs
newopportunitiesforgenetherapyandgenomeengineering.Forexample,usingsmallinter-
feringRNA(siRNA)andtheassociatedRNAinterference(RNAi)mechanism,scientistscan
designtherapytodown-regulatethemRNAlevelbyengineeringRNAsthatcansp
andelytargetthedisease-relatedgene(s)[108,67,65,76,95].ClusteredRegularly
InterspacedShortPalindromicRepeats(CRISPRs),togetherwiththeassociatedcasgenes,
constituteaprokaryoticadaptiveimmunesystemthatprotectsagainstexogenousgenetic
elementssuchasvirusesandplasmids[7,72,44,17,43,5,88].Recentadvancesofstudying
CRISPRsinprokaryoticorganismshaveledtopromisingbiotechnologyapplicationssuchas
1
immunizationandaccurategeneediting(silencing,enhancingorchangingsp
genes)[102,37,16,19,46,25,50,113,62].Tounderstandtheworkingmechanismsandthe
functionsofncRNAsinvariousspecies,afundamentalstepistoidentifyncRNAsfrom
biologicaldata.
ThereexisttkindsofbiologicaldatawhichscientistscanusetoidentifyncRNAs.
Onetypeisgenomicsequencedatasuchaswholegenomicsequences.Genomicsequences
areavailableformodelspeciesandmoreandmorenon-modelspecies.Withtheadvent
ofnextgenerationsequencing(NGS)technology,hugeamountofsequencingdata(whole
genomesequencingdata,RNA-Seqdata,metagenomicdata)arebecomingavailable.This
opensanewhorizonforannotatingandunderstandingknownandnovelncRNAs.Inthe
followingsectionsofthischapter,Irstgiveabriefintroductiontothecharacteristicsof
ncRNAs.ThenIdescribenextgenerationsequencingtechnologyandmetagenomics.Then
IreviewgeneralapproachestoidentifyncRNAsfromlarge-scalegenomicdata.Finally,I
introducethemethodsandtoolsIdevelopedduringmyPh.D.study.
1.2NCRNAsecondarystructure
MostncRNAsfunctionthroughtheirthreedimensionalstructures.Toreachitsfunctional
form,theRNAmoleculesequenceundergoesfoldingdrivenbyWatson-Crick(C-GandA-U)
andwobble(G-U)base-pairingandstackinginteractions.Thisprocessformsshortstems
andvarioussinglestrandedloopsthatitssecondarystructure.Figure1.1showsa
tRNAsequenceanditssecondarystructure.Thesecondarystructuresofmanytypesof
ncRNAsaremoreconservedthantheirprimarysequences.Thisprovidesusefulinformation
foridentifyingncRNAsfrombiologicaldata.
2
Figure1.1AtRNAsequenceanditssecondarystructure
1.3Nextgenerationsequencingtechnology
Sequencingistheprocessofdeterminingtheorderofnucleotidebaseswithinastenchof
DNA/RNAsequence.Sanger'smethod,whichinvolvescopyingaDNAtemplateusingchain-
terminatingdideoxynucleotides,dominatedthesequencingmarketformorethan30years
andremainedinwideuse[96,97].Thoughitisaccurateandgeneratesreadsoflonglength,
Sangersequencinghasseverallimitations.First,thecloningprocessislabor-intensive.Sec-
ond,itreliesoncloningDNAtogeneratesinglestrandedDNA.However,DNAsequences
suchascentromeres,arepracticallyunclonable,whichmadethemhardtosequence[79].
Third,thethroughputofSangersequencingisquitelowcomparedwithnewlydeveloped
methods,thusthedepthofcoverageisrelativelylow.Thismakesitttostudies
suchasinvestigatingmicrobialpopulationdiversity[66].Forth,theoverallcostofSanger
sequencingishigh[66].
Overthepastfewyears,next-generationsequencing(NGS)technologiesevolvedvery
3
Figure1.2workwofnextgenerationsequencingtechnologythatproduces
paired-endreads.ThegenomicDNAsequenceisfragmented.Thetwoendsofthe
fragmentedDNAarethensequenced.Thesequencebetweenthetworeadsofeachpairis
unknown,butthedistanceisinarangethatisspbythesequencingprotocol.
fastandignitedarevolutioningenomicscience[57,77,115].Figure1.2showsa
workwofnextgenerationsequencing.ThemainadvantageofNGStechnologiesisthat
theyaremassivelyparallel.Thustheyhaveveryhighthroughputandareabletosequence
longstretchesofDNA/RNAsequencesspanningentiregenomesrapidlyatalowcost.For
example,asofJan,2014,Illumina'sHiSeqXsequencerisabletogenerateupto1.8terabyte
datainasingleruninlessthanthreedays.Thesystemcansequenceahumangenomeat30x
coverageorgreaterforlessthan$1000whenusedatscale[47].Therearetsequencing
strategies,ofwhichthreeCCprovidevaluableinformationforncRNAidenThey
arewholegenomesequencing[85],RNA-Seq[115],andmetagenomicsequencing[111,104,
23,40].
1.3.1Wholegenomesequencing
WholegenomesequencingusesSanger'smethodorNGStechnologiestosequencethewhole
genomeofatargetspecies.Thesequencedreadsareassembledoralignedtoanexisting,
closelyrelatedgenometoformthegenomeofthetargetspecies.Theproducedgenomeor
4
genomicsequencescanbeusedtoidentifyncRNAsforthespecies.
1.3.2RNA-Seq
RNA-SequsesthecapabilitiesofNGStechnologiestosequencecomplementaryDNA(cDNA)
generatedfromtranscribedRNAs[115].BecauseitonlysequencestranscribedRNAs,notthe
wholegenome,whichcontainsmanyun-transcribedregions,itprovidesfundamentalinsights
toannotateandstudynon-codingRNAs.Inaddition,RNA-SeqwhichusesasmallRNA
preparationprotocol(smallRNA-seq)providesdeepinsightstounderstandsmallncRNAs,
whichrefertoncRNAswhoselengthsarebetween18-26nt(miRNA,siRNA,etc).Deeply
sequencedsmallRNAlibrariesprovideinformationtoaccuratelyidentifyexpressedsmall
RNAs,aswellasquantifytheirexpressionlevelsintsamplesbasedonreadmapping
[69,80,53,6].
1.3.3Metagenomicsequencing
WiththeadvanceofNGStechnologiesandthereductioninthecostofsequencing,now
scientistshavetheabilitytocloneandsequenceDNA/mRNAdirectlyfrommicrobialcom-
munitiesintheirnaturalhabitatswithoutcultivation.Thisculture-independentmicrobial
communityanalysisapproachiscommonlyreferredtoasmetagenomics.ByusingNGS
technologies,nowitispossibletosequencetheorganismsofmicrobialcommunitiesdeeply
enoughtocoverthespecieswithtabundance.Metagenomicsprovidesaunique
opportunitytostudyvariousaspectsofcomplexmicrobialpopulations.Forexample,itis
possibletousemetagenomicdatatoidentifycas-CRISPRsystems.Thiscangreatlyenhance
ourunderstandingofthephage-hostinteractionsandco-evolutionwithinnaturalmicrobial
5
communities[5,91,101].
1.4ExistingmethodsforidentifyingncRNAsfromlarge-
scalegenomicdata
Large-scalegenomicdataincludesgenomicsequencedataandNGSsequencingdata.Alotof
toolshavebeendevelopedforsearchingncRNAsingenomicsequences.Traditionally,ncR-
NAsareidenbysearchingavailablegenomicsequencesbasedoncomparativesequence
analysis,inwhichhomologysearchisusedtoexaminewhetherputativencRNAsareevolu-
tionarilyconservedinrelatedgenomesandthusprovideskeyevidenceforncRNAannotation.
Conventionalcomparativesequenceanalysisbasedmethodsarenotwell-suitedfordetecting
ncRNAslackingstrongsequenceconservation.Thus,methodsthatutilizestructureconser-
vationinformationofncRNAfamiliesarealsodeveloped.SearchingncRNAsdirectlyfrom
genomicsequenceshasitslimitations.First,withoutinformationaboutwhichregionsare
transcribed,toolsbelongstothiscategoryneedtoscanthewholegenomicsequence.Thus
thesearchspaceisverylargeforevensmallgenomes.Thisleadstoverylongrunningtime,
especiallyfortoolsbasedonstructureanalysis.Second,duetothelargesearchspace,these
methodsincurhighfalsepositiverate.Forexmaple,thereexistover100,000regionswhich
canbefoldedasstem-looplikestructuresinE.coligenome[89].Ifasimplestructure-based
approachisusedtoidentifymiRNA,whichhasaverysimplestem-loopstructure,fromE.
coligenome,onegetsalargenumberoffalsepositivepredictions.
WithNGSdata,especiallyRNA-Seqdata,theselimitationscanbedrasticallyreduced,
ifnottotallyeliminated.RNA-SeqsequencescDNAsthataregeneratedfromRNAs.We
knowthatRNA-Seqreadsaregeneratedfromtranscribedregions.Mappingthesereadsback
6
tothereferencegenomeandworkingonregionsthatonlyhavetlymappingscan
greatlynarrowthesearchspace,improverunningtime,andreducethefalsepositiverate.In
addition,RNA-SeqwhichusesasmallRNApreparationprotocol(smallRNA-seq)provides
deepinsightstounderstandsmallncRNAs,whichrefertoncRNAswhoselengthsarebetween
18-26nt(miRNA,siRNA,etc).DeeplysequencedsmallRNAlibrariesprovideinformation
toaccuratelyidentifyexpressedsmallRNAs,aswellasquantifyingtheirexpressionlevels
intlibrarysamplesbasedonread-counts[6].
ThepreviouslypresentedmethodsthatutilizeNGSdataneedtohavethereference
genome/genomicsequence.However,withthedevelopmentofmetagenomics,thecomplexity
ofmicrobialcommunitiesandthehugeamountofmetagenomicdatamakesassembling
microbialgenomesachallengingproblem.Thepreviouslydescribedapproachescannotbe
readilyappliedtometagenomicdata.Inrecentyears,anewcategoryofncRNAsearch
methodsthatdonotrequireareferencegenomeisemerging:
denovo
ncRNAassembly.In
thistypeofapproach,readsthatbelongtotheinterestedncRNA(s)aregatheredfrom
RNA-Seqormetagenomicdata,andthenthesereadsareassembledtoproducethe
ncRNAsequences.
1.5Contributions
Inthisdissertation,Ipresentthreecomputationalmethodsandtoolstotlyidentify
ncRNAsfromlarge-scalebiologicaldata.Chain-RNAisatoolthatcombinesbothsequence
similarityandstructuresimilaritytolocatecross-speciesconservedRNAelementswithlow
sequencesimilarityingenomicsequencedata.Itcanachievecantlyhighersensitivity
inidentifyingremotelyconservedncRNAelementsthansequencebasedmethodssuchas
7
BLAST,andismuchfasterthanexistingstructuralalignmenttools.miR-PREFeR(miRNA
PREdictionFromsmallRNA-Seqdata)utilizesexpressionpatternsofmiRNAandfollows
thecriteriaforplantmicroRNAannotationtoaccuratelypredictplantmiRNAsfromone
ormoresmallRNA-Seqdatasamples.Itissensitive,accurate,fastandhaslow-memory
footprint.metaCRISPRfocusesonidentifyingClusteredRegularlyInterspacedShortPalin-
dromicRepeats(CRISPRs)fromlarge-scalemetagenomicsequencingdata.Itusesakmer
hashtabletotlydetectreadsthatbelongtoCRISPRsfromtherawmetagonmic
dataset.Overlapgraphbasedclusteringisthenconductedonthereduceddatasettosep-
aratetCRSIPRs.Asetofgraphbasedalgorithmsareusedtoassembleandrecover
CRISPRsfromtheclusters.
8
Chapter2
Chain-RNA:acomparativencRNA
searchtoolbasedonthe
two-dimensionalchainalgorithm
2.1Introduction
NoncodingRNAs(ncRNAs),whichfunctiondirectlyasRNAswithouttranslatingintopro-
teins,ishighlyimportanttomodernbiology.nttypesofncRNAplaydiverseand
importantbiologicalfunctions.Forexample,ithaslongbeenknownthattRNAandrRNA
playimportantrolesinproteinsynthesis.Morerecently,regulatoryrolesofsomesmall
ncRNAshavebeenrevealed[51,70],showingthatthefunctionsofncRNAsaremuchmore
complicatedthanwethought.Inaddition,advancesofRNA-seqandthenext-generation
sequencingtechnologieshaverevealedthatalargeportionoftranscriptomicdatacannotbe
mappedbacktoannotatedprotein-codinggenesinthereferencegenome.Examiningwhether
thesetranscriptscontainfunctionalncRNAsisimportanttoelucidatethecomplexityofthe
JikaiLei,PrapapornTecha-angkoon,YanniSun.Chain-RNA:acomparativencRNAsearchtoolbased
onthetwo-dimensionalchainalgorithm.IEEE/ACMTransactionsonComputationalBiologyandBioinfor-
matics(TCBB).10:2,274-285.2013.
9
transcriptomeofvariousorganisms.
Thestate-of-the-artmethodforncRNAidenisbasedoncomparativegenomics,
inwhichregionsshowinghighsequenceorstructuralsimilaritiesacrossrelatedgenomesare
examinedtoidentifyputativencRNAs.TherearetwotypesofncRNAsearchproblems.The
oneisoftenreferredtoasknownncRNAsearch.GivenasinglencRNAsequenceora
groupofhomologousncRNAswithannotatedstructures,stochasticcontext-freegrammar
(SCFG)-basedmethodscanaccuratelyhomologstotheannotatedncRNAs.Infernal[83]
andRSEARCH[56]aretwocommonlyusedtoolsinthiscategory.Thesecondtypeof
ncRNAsearchisoftenreferredtoas
denovo
ncRNAsearch[29,9],whichdoesnotrequire
priorinformationonthestructuresorsequencesofthencRNAstobefoundandthuscan
identifynovelncRNAs.Forgenome-scalepredictionofncRNAs,
denovo
searchusuallystarts
fromaligningsubstringswithtsequenceandstructuralsimilaritiesfromrelated
genomes.ThesealignmentswillgothroughfurtherexaminationforncRNAiden
ThisworkaimsatdevelopingalocalstructuralalignmentapproachfordenovoncRNA
search.Forconvenienceofdiscussion,ncRNAsearchinthisworkspreferstothe
secondproblem.
Existinggenome-scalencRNAsearchprogramssuchasQRNA[93],RNAz[116],and
EvoFold[87]stillusesequencealignmenttoolssuchasBLAST,ClustalW[109],orMUL-
TIZ[11]asthestepforncRNAsearch.Forallalignmentsgeneratedbypairwiseor
multiplesequencealignmenttools,acarefulscreeningisconductedtoclassifytheminto
functionalncRNAsorotherfeaturesusingheterogeneousevidence.Thus,theoverallsensi-
tivityoftheseprogramsislimitedbythesensitivityofthesesequencealignmentprograms
onncRNAsearch.However,conventionalsequencealignmentprogramsarenotwell-suited
fordetectingncRNAslackingstrongsequenceconservation.Itisknownthatsometypesof
10
ncRNAsmightexhibitabetterconservationofsecondarystructures.Forexample,highly
structuredRNasePandlineagespecncRNAssuchasXistorAirposehardcasesfor
BLAST-liketools[13,86].EvensomesmallncRNAssuchastRNAhavesequenceidentity
aslowas45%intheirannotatedhomologsfromthehumanandmousegenomes[2].Asa
result,whensequencealignmentprogramsareusedforncRNAsearch,ncRNAelementslack-
ingstrongsequenceconservationcanbemissed.Therefore,thereisaneedforagenome-scale
ncRNAsearchtoolthatcanincorporatebothsequenceandstructuralsimilarities.
Inthiswork,weintroduceanncRNAsearchtool
chain-RNA
,whichformulatesstructural
alignmentproblemasanextendedtwo-dimensionalchainproblem.Ourtoolusesthelocal
dotplotgenerationtoolRNAplfold[10]togeneratelocalbase-pairingprobabilitiesacrosstwo
genomes,basedonwhichweextractstems.Byrepresentingapairofpredictedstemsfrom
twoinputsequencesasarectangle,wecantakeadvantageoftheexistingchainalgorithm
tosolvethencRNAstructuralalignmentproblem.Chain-RNAprovidesacomplementary
ncRNAalignmentmethodtoconventionalalignmenttools.Chain-RNAismuchfasterthan
existingstructuralalignmenttoolsforncRNAsearch.ForncRNAswithlowsequencesim-
ilarity,ourexperimentsshowthatchain-RNAhasatlybetterbetween
sensitivityandfalsepositiveratethanBLAST.
2.2RelatedWork
SequencealignmenttoolshavebeenappliedtoncRNAsearch.Forexample,pairwisealign-
mentsproducedbyBLAST[3]areusedforncRNAsearchbyQRNA[93].Thepairwise
sequencealignmentsinthedatabaseforcomparativeregulatorygenomics(CORG)[20]and
themultiplesequencealignmentsgeneratedbyCLUSTALWareusedasthetestsetsfor
11
ncRNAsearchinRNAz[116].MULTIZ[11]isadoptedbyEvoFold[87]togenerateinitial
sequencealignmentsforncRNAexamination.Thepairwiseormultiplesequencealignments
generatedbyvarioussequencealignmenttoolsarescreenedforputativefunctionalncR-
NAsbasedonfeaturessuchasminimumfreeenergyandconservedsecondarystructures.
However,usingconventionalsequencecomparisontoolsasthebasisforncRNAsearchisin
dangerofmissingstructuralncRNAs.
AnumberofncRNAstructuralalignmentprogramsexist.Thesealignmentmethods
accepthomologousncRNAsasinputandtargetatgeneratingtheconsensusstructuresof
inputsequences.Tojustgiveafewexamples,thereareexpensivebutaccurateSankalgo-
rithm[98],fasteralignmenttoolssuchasSCARNA[107],MXSCARNA[106],Murlet[54],
FoldalignM[110],LocARNA[118],RAF[21],GrammarString[1],etc.Amajorityofthese
pairwiseormultipleglobalalignmenttoolsaredesignedtomaximizetheaccuracyofstruc-
turepredictionandarenotreadilyusableforgenome-widencRNAsearch.Inordertoapply
theseglobalalignmenttoolstogenome-scalencRNAsearch,the\slidingwindow"tech-
nique[112,29]isusuallyused.Thegenomesaredividedintosmallandoverlappingwindows
thatareusedasinputtotheglobalncRNAstructuralalignmenttools.
AveryrelatedworktooursisPLAST-ncRNA[15],whichconductslocalalignmentbe-
tweenaqueryRNAsequenceandalargegenomesequenceusingpartitionfunctionposterior
probabilities.TheauthorsshowthatPLAST-ncRNAachieveshighaccuracyinidentifying
distantlysimilarncRNAshiddeninthetargetgenome.AmajorbetweenPLAST-
ncRNAandthemoregeneralschemeofncRNAsearchisthattheformerisdesignedto
compareashortqueryandalongsequence.Itlimitsthesizeofthequerysequence.
SCARNA
LM[105]cangeneratelocalmultiplealignmentforthedetectionofncRNAs.It
isbasedonadiscriminativepairwisealignmentmodelwhichincorporatessecondarystructure
12
Figure2.1Pipelineofchain-RNA.
featuresasbasepairingprobabilities.Althoughitsstrengthliesinmultiplealignment,it
canbeappliedtopairwisehomologysearch.
FOLDALIGN[38,39]isahighlysensitivelocalstructuralalignmenttoolthatcanidentify
ncRNAswithverylowsequencesimilarity(
<
40%).Usingheuristicssuchasdynamic
programmingmatrixpruning,FOLDALIGNisfasterthantheaccurateimplementationof
theSankalgorithm[98].However,itisstillCPU-intensiveonlargedatasets.
2.3Methods
Inthissection,wedescribethemethodbehindchain-RNAinlocatingconservedncRNAshid-
denintwogenomicsequences.Thebasicrationalebehindourmethodisthatthesecondary
structuresareimportanttothefunctionsandthusarewell-conservedduringevolution.Iden-
13
Figure2.2(A).AdotplotforatRNAsequence.Thelower-lefttrianglerepresentsthe
MFEstructure.Theupper-righttriangleencodesbase-pairingprobabilitiesandisusedin
ourcomputation.Astemishighlightedinabox.(B).PartialoutputofRNAplfoldfora
longsequence.Ahighlightedtriangleshowsthelocalbase-pairingprobabilitydotplotfora
substringofsizet.
tifyingtheregionsdemonstratingsimilarstructuresprovidesthebasisforncRNAprediction.
ItisknownthatsomehomologousncRNAsdonotsharehighstructuralsimilarity.These
ncRNAsposehardcasesforchain-RNAandotherncRNAsearchtoolsthatrelyonstructural
similarity.
Ourpipelineconsistsoffourmainstages:generatinglocalbase-pairingprobabilities,
extractingputativestemsfrombase-pairingprobabilities,andaligningstemsusing
anextendedtwo-dimensionalchainalgorithm.Fig4.7showstherelationshipandinformation
wbetweenthemainstages.Forapairoflonggenomicsequences,weapplyalocal
foldingalgorithmtocomputelocalbase-pairingprobabilities,basedonwhichalargenumber
ofcandidatestemsareextracted.Aslidingwindowstrategy[112]isusedtodividethe
genomicsequenceintosmall,overlapping,anddwindows.Foreachwindowpair
extractedfromtwoinputsequences,structuralalignmentisperformedusinganextended
chainalgorithm.Inordertoimprovetheaccuracyofthefoldingstep,weallowstemswith
lowprobabilities.Astageinthepipelineisusedtodecreasethesearchspaceand
reducetheinputsizetothestructuralalignment.Themainnoveltyofouralgorithmisto
14
formulatethestructuralalignmentasachainproblem,whichhastsolutions.The
detailofthealgorithmwillbedescribedinthefollowingsections.
2.3.1Localbase-pairingprobabilities
WeuseRNAplfold[10]togeneratelocalbase-pairingprobabilitiesforinputgenomicse-
quences.RNAplfoldimplementsalocalfoldingalgorithmthatcomputeslocalbase-pairing
probabilitiesforbasepairswithamaximalspan
t
.Forasequenceoflength
L
andawindowsizeof
t
,RNAplfoldhasmemoryandtimecomplexity
O
(
L
+
t
2
)and
O
(
Lt
2
),
respectively.As
t
isusuallymuchsmallerthan
L
,itispracticaltoapplyRNAplfoldto
verylargegenomes.Thebase-pairingprobabilitiesareencodedinadotplot,inwhichthe
valueat(
x
,
y
)givestheprobabilitythatbaseatposition
x
ispairedwithbaseatposition
y
.
Fig2.2.(A)showsthedotplotforatRNAsequence.Thesizesofdotsareproportionalto
thebase-pairingprobabilities.Thus,thebasepairsinawell-conservedstemformanotable
anti-diagonalinadotplot.Therearefourprominentanti-diagonalsinFig2.2.(A),corre-
spondingtothefourstemsinatypicaltRNAstructure.Fig2.2.(B)showsapartialdotplot
outputbyRNAplfoldonalargesequence.Thislongdotplotcanbedividedintosmaller
trianglesforderivinglocalbase-pairingprobabilitiesforasub-stringofsize
t
.
2.3.2Stemextractionfromdotplots
FollowingpreviousworksuchasSCARNA[107]andDotKnot[103],weevaluatethesimi-
laritybetweendotplotsbyaligningthestem-loopstructuresextractedfromthedotplots.
Astem
S
isformedbystackingbasepairsthatmeetthefollowingcriteria:

Theprobabilityofeachbasepairislargerthanaspthreshold
P
base
.Bydefault,
15
P
base
=0
:
05inchain-RNA.

Theaverageoftheprobabilitiesofallbasepairsislargerthanaspthreshold
P
avg
.Bydefault,
P
avg
=0
:
25inchain-RNA.

Atmostoneinteriorlooporbulgeofsizeoneisallowedinthestem.
Foreachstem
S
inasequence
x
,epropertiesare

Startposition
S
:P
start
:thepositionofthebaseinthe5'partofthestem.

Endposition
S
:P
end
:thepositionofthelastbaseinthe3'partofthestem.

Giventhestartandendposition,wedenotethestem-loopsizeusing
S
:d
,whichis
S
:P
end
S
:P
start
+1.

probability
S
:C
:thesumofthebase-pairingprobabilitiesofallbasepairs
inthestem.
S
:seq
:thesubsequencein
x
from
S
:P
start
and
S
:P
end
.
Fig2.3illustratespropertiesofastem.Whenextractingstemsfromdotplots,theprobability
threshold
P
base
issettobeaverysmallnumber(0.05bydefault),enablingustoinclude
stemswithsmallprobabilities.Thereasonisthatdotplotsgeneratedbydenovo
structurepredictiontoolsdonotalwaysthetruestructure.Someconservedbase
pairsmayhavesmallpredictedprobabilities.Whenastemwithsmall
S
:C
iswellconserved
acrosstspecies,ourstemalignmentprograminthenextstepwillbeabletorecognize
them.
16
Figure2.3Propertiesofastem.Thestemconsistsofvebasepairsandabulgeofsizeone.
2.3.3Stemalignmentasatwo-dimensionalchainproblem
Afterextractingallstemsandgeneratingtheirproperties,wecanconductstructuralalign-
mentbyaligningstems.HomologousncRNAswithstrongstructuralsimilarityareexpected
togenerateahigh-scoringstemalignment.Thusthestemalignmentscorecanbeusedto
discriminatetruehomologsfromunrelatedsequences.Notethatalthoughonlybase-pairing
probabilitiesareforeachstem,thesequencesimilaritywillbeevaluatedduringalign-
ment.Apseudoknot-freencRNAincludesparallelornestedstems.Inordertogeneratea
legalalignmentwiththeaboveconstraints,weformulatethestemalignmentasanextended
chainalignmentproblem.Inthissection,westreviewthebasicchainproblemas
inRef.[33].
2.3.3.1Thebasicchainproblem
Let
R
beasetof
n
rectanglesinaplane.Weuse
r
todenotearectangle.
v
istheweightof
r
.Whenthereisaneedtodistinguishtrectangles,subscriptsareused.Forexample,
the
i
th
rectanglein
R
is
r
i
,whichhasanassociatedweight
v
i
.Asubsetof
R
iscalleda
chain
ifnohorizontalorverticallineintersectsmorethanonerectangleinthesubsetandif
therectanglescanbeorderedsothateachoneisbelowandtotherightofitspredecessor.
Theweightofachainisthesumoftheweightsoftherectanglesinthechain.Thebasic
chainproblemistoachain
C
withthemaximumweight
C:weight
overallchainsin
R
.
17
Figure2.4Thebasicchainproblem.Twochainsstartingfromrectanglewithweight3are
showedandthechainthatincludesrectangleswithweights3,2,and9isthemaximum
weightedchaininthegivenrectangleset.
Fig2.4showsthemaximumweightedchaingivenasetofrectanglesandtheirweights.
Thebasicchainproblemcanbetlysolvedin
O
(
n
log(
n
))timebyemployingabalanced
binarysearchtree[33].Somebioinformaticsproblemssuchassubstringsimilaritysearch
havebeenformulatedandsolvedasthebasicchainproblem[33].
Inordertoillustratetherelationshipbetweenthechainbasicproblemandstemalign-
ment,weewhatmakesarectangleinourproblem.Foreachinputsequence,we
extractasetofcandidatestemsfromthedotplotoutputbyRNAplfold.Whenthereare
twoinputsequences,wehavetwosetsofcandidatestems,eachofwhichaputative
stemstructureinoneinputsequence.Arectangle
r
isformedbyanytwostem-loopsfrom
thetwoinputsequences.Inordertodescribethepositionandthesizeofarectangle
r
,we
denoteitsupper-leftcorneras
r:left
and
r:top
.Similarly,thelower-rightcornerisdenoted
by
r:right
and
r:bottom
.Foreachstem
S
i
fromthesequenceandeachstem
S
j
fromthe
secondsequence,(
r:left
,
r:top
)=(
S
i
:P
start
,
S
j
:P
start
)and(
r:right
,
r:bottom
)=(
S
i
:P
end
,
S
j
:P
end
).Fig2.5.(A)showshowstemsformrectangles.
18
Thepositionalrelationshipbetweenrectanglesrepresentsntstructuralmotifs.
Fig2.6showsalltwopermittedrelationsbetweentworectanglesinapseudoknot-freestem
alignment.RectanglesthatdonotoverlapinXorY-axisrepresentparallelstems.Nested
rectanglesrepresentnestedstems.Pseudoknotscanalsoberepresentedbyoverlappingrect-
angles.Ourcurrentalignmentprogramdoesnotpermitpseudoknots.
Figure2.5(A).Theformationofarectanglebytwostems.Thinarcsrepresentbasepairs.
Fourrectanglesareformedbytwostems.Thenumberofbasepairsinthetwostemsforming
arectanglecouldbet.
r
2
isanexample.(B).Asubsetofrectanglesbetweentwo
sequences.Stemsarerepresentedbythickarcs.Thebasicweightislistedforeachrectangle.
Themaximumweightedchainandthecorrespondingstemsarehighlighted.
Thebasicchainproblemcanonlyhandleparallelstems.Inordertoincorporatenested
stems,weformulatestemalignmentasanextendedchainproblemasfollows:
2.3.3.2TheExtendedchainproblem
Let
R
beasetof
n
rectanglesinaplane.Eachrectangle
r
i
in
R
hasanassociatedweight
v
i
.
Asubsetof
R
iscalleda
chain
ifanytworectanglesinthesubsetsatisfy:1)oneisnested
intheother,or2)theupper-leftcornerofonerectangleislocatedbelowandtotherightof
thelower-rightcorneroftheother.Theweightofachainisthesumoftheweightsofthe
19
Figure2.6Legalrelationsbetweentworectanglesinavalidstemalignmentwithoutpseu-
doknots.Eacharcrepresentsabasepair.(A).Parallelstemsleadtonon-overlapping
rectangles.(B).Nestedstemsleadtonestedrectangles.
rectanglesinthechain.Theextendedchainproblemistoachain
C
withthemaximum
weight
C:weight
overallchainsin
R
.
Fig2.5.(B)showsanexampleforillustratingthebetweenthemaximumweighted
basicchainandmaximumweightedextendedchain.Forthegivensetofrectangles,themax-
imumweightedbasicchaincontainstworectangleswithweights130and30.Themaximum
weightedextendedchaincontainsthreerectangleswithweights130,100,and30.Andthe
extendedchaincorrectlythecommonstemstructuressharedbythetwosequences
whilethebasicchainmissesanestedstem.
2.3.3.3Weightofarectangle
Bycarefullyngtheweightofarectangle,themaximumweightedextendedchainrep-
resentsthecommonsecondarystructuressharedbytwosequences.Ifthetotalweightof
thechainisaboveagivenweoutputthestemalignment.Inordertoutilizingthe
20
tbasicchainalgorithmtosolvethestemalignmentproblem,wethreeweights
foreachrectangle
r
,thebasicweight
r:bw
,theinsideweight
r:iw
,andtheweight
r:fw
.
Thebasicweightofarectanglerepresentsthematchingscoreofthetwostemsformingthe
rectangle.Itisindependentofotherrectangles.Wethebasicweightofarectangle
r
formedbystem
S
1
andstem
S
2
as:
r:bw
=
w
1

SeqScore
+
w
2

BPprobScore

w
3

DistancePenalty
Here
w
1
,
w
2
,
w
3
areadjustableweightparameters.
SeqScore
isthenormalizedglobal
alignmentscorebetweentheloopregionsofstem
S
1
andstem
S
2
usingtheRIBOSUM
scorematrix[56].
BPprobScore
isthesumoftheprobabilitiesofthetwo
stems.
DistanceScore
isusedtopenalizestempairsthathaveabigintheir
stem-loopsizes.Weusethesameformulafor
DistancePenalty
between
S
1
and
S
2
isasin
SCARNA[107]:
DistancePenalty
=
p
jS
1
:d
S
2
:d
j
Theinsideweight
r:iw
givestheweightofthemaximumweightedchainformedbyall
rectanglesinside
r
.Inordertoincorporatenestedstems,wetheweightofa
rectangle
r
asbelow:
r:fw
=
8
<
:
r:bw
+
r:iw;
if
I
r
6
=
?
and
r:iw>
0
r:bw;
otherwise
21
Here,
I
r
isthesetofrectanglesthatlocateinside
r
.Whentheweightofthemaximum
weightedchaininside
r
ispositive,theweightof
r
includesboththebasicweightof
r
andtheweightofthemaximumweightedchaininside
r
.Whentherearenorectangles
inside
r
,ortheweightorthemaximumweightedchainisnon-positive,wethe
weightof
r
asitsbasicweight.If
r
ischoseninthemaximumweightedchain,boththe
stemsforming
r
andthestructurescorrespondingtotheinternalchainareusedasthe
structureannotation.
2.3.3.4Extendedchainalgorithm
Inthissectionwepresentthepseudocodefortheextendedchainalgorithm.Thebasicidea
isthatweusetheweightofarectangleinthebasicchainalgorithm.Whenarectangle
ischosenintheoptimalalignment,itsmaximumweightedinsidechainisalsoincluded
(ifthereexistsone).Thus,thebasicchainalgorithmcanbeappliedtothestemalignment
whentherearenestedstems.
Inouralgorithm,inputrectanglesaresortedby
r:left
inascendingorder.Foreach
rectangle
r
,welocateallrectanglesinside
r
.Thenthebasicchainalgorithmisusedto
themaximumweightedchaininsideeach
r
(Line6-17inFig2.7)fromthelastrectangleto
theone.Astherectanglesweresortedandweprocessrectanglesfromthelastoneto
thewhenwestarttoprocessthe
i
th
rectangle
r
i
,themaximumweightedchainsand
theweightsofalltherectanglesinside
r
i
havealreadybeencomputed.Usingthe
weightsofprocessedrectangles,thebasicchainalgorithmcanbereadilyappliedtocompute
themaximumweightedchaininside
r
i
(Line18inFig2.7).Procedure
BasicChain
inFig
2.7computesthemaximumweightedbasicchainanditsweightonasetofrectangles.As
thealgorithmforthebasicchainproblemhasbeenpresentedinpreviouswork,wereferusers
22
to[33]forthedetailedpseudocode.
Recursivelynestedrectangles,whichcorrespondtonestedstemsformorethantwolevels,
areautomaticallyprocessedbythealgorithm.Figure2.8showsanexample.Accordingto
theresultofsorting,
r
3
isprocessedBecausethereisnointernalrectangleinsideit,its
weightis10.Then
r
2
isprocessed.Ithasoneinternalrectangle
r
3
.Thebasicchain
algorithmreturns
r
3
asthemaximumweightedchaininside
r
2
.So
r
2
hasaweightof
30+10=40.Whenprocessing
r
1
,ithastwointernalrectangles,
r
2
and
r
3
,withweights
40and10,respectively.
r
2
hasalreadyincorporated
r
3
asitssubstructure.Byusingthe
basicchainalgorithmon
r
2
and
r
3
,
r
2
isselectedasthemaximumweightedinternalchain
for
r
1
.Theweightof
r
1
is120+40=160andthechainconsistsof
r
1
,
r
2
,and
r
3
.
Theconsensusstructureoftheinputsequencescontains3recursivelynestedstems.
Hereweanalyzetherunningtimeofthealgorithm.Supposethenumberofrectangles
inthesetis
n
andthenumberofrectanglesinsidearectangle
r
i
is
m
i
.Thesortingin
line1takes
O
(
n
log(
n
))time.Decidingtheinternalrectanglesetsofallrectanglesisthe2-
dimensionalrectangularrangesearchproblemincomputationalgeometryandcanbesolved
in
O
(
P
n
1
m
i
+
n
log(
n
))time[18].Line7-14takes
O
(
m
i
log(
m
i
))time,soline6-16takes
O
(
P
n
1
m
i
log(
m
i
))time.Line17takes
O
(
n
log(
n
))time.Sothetotalrunningtimeofthe
algorithmis
O
(
n
log(
n
)+
P
n
1
m
i
+
P
n
1
m
i
log(
m
i
)).Thenumberofinternalrectanglesis
usuallysmallformostrectangles.Inaddition,thebasepairingmatrixforasequenceis
usuallyasparsematrix.Weapplythreshold
P
base
and
P
avg
todiscardstemswithsmall
probabilitiesandadoptseveralstrategies(SeeFiltrationSection)toremoverect-
anglesthatarenotlikelytobeinacorrectalignment.Thus,thenumberofrectanglesis
smallerthantheinputsequencelengthonaverage.Someexamplesof
n
willbegiveninthe
ExperimentalResultsSection.
23
Figure2.7Algorithmforcomputingmaximumweightedextendedchainonasetofrectangles
Figure2.8Anexampleshowinghowrecursivelynestedrectanglesarehandledintheproposed
extendedchainalgorithm.
r
1
:bw
=120,
r
2
:bw
=30,
r
3
:bw
=10.
24
2.3.3.5Filtration
Inchain-RNA,weusetwotypesofrationtospeedupourtool:

Filtrationofwindowpairsthatareusedtoformrectangles.Allwindowpairswith
lowstructuralandsequencesimilaritywillbediscarded.Fortwogenomicsequences
oflength
M
and
N
,thetotalnumberofwindowpairsis
b
M

W
c
+1


b
N

W
c
+1

.Here
W
and

areslidingwindowsizeandstep,respectively.Thisbecomesverylargeifthe
inputsequencesarelong.Infact,mostofthewindowpairscontainnon-homologous
sequences.Inchain-RNA,weuseasimpletoremovewindowpairsthatarenon-
homologousbasedontheirsequenceandstructuralsimilarity.First,ifthesequence
identityissmallerthanathreshold,thewindowpairisdiscarded.Second,
Wediscardwindowpairsthathavesmallnumberofrectangles.Therationalehere
isthatweexpecttoseeatleastsomenumberofwell-conservedstemsinapairof
homologousncRNAs.

Filtrationofrectanglesthatareusedasinputtotheextendedchainalgorithm.As
weallowstemswithsmallbase-pairingprobabilities,alargenumberofstemscanbe
generatedusingRNAplfold.However,manyofthemarenotlikelytobeincludedin
theoptimalchain.First,weexpectthatrectanglessharedbytruehomologstend
tohavesimilarsagainstthestartingpositionsofthehiddenncRNAs.Thus,we
excluderectangleswith
j
r:left

r:top
j
>˝
or
j
r:right

r:bottom
j
>˝
.Second,all
rectangleswithnegativeweightsareremovedfromfurtherprocessing.Asweusethe
slidingwindowstrategytoconductlocalalignment,usingasmall
˝
willjeopardizethe
sensitivity.Ontheotherhand,usingabig
˝
willtlyincreasethecomputa-
tionalcostandalsoleadtohighFPrate.Inourexperiments,wechooseavaluefor
˝
25
inordertoachieveacceptablesensitivityand.
Thedetailedparametersincludingthewindowsize,thestepsize,and
˝
willbegivenalong
theexperimentsetup.
2.4ResultsandDiscussion
Weconductedtwosetsofexperimentstoevaluatetheperformanceofchain-RNA.Asthe
maingoalofncRNAsearchistolocatethepositionsofncRNAgeneshiddeninlonggenomic
sequences,ourperformanceevaluationexamineshowmanyoutputsoverlapwithactual
ncRNAs.Intheexperiment,weusedchain-RNAtosearchannotatedhomologous
ncRNAsingenomicsequencesandevaluateditssensitivityandfalsepositive(FP)rate.In
thesecondexperiment,weappliedchain-RNAtonovelncRNAsearchinthetranscriptome
ofthe
Burkholderiacenocepacia
genomebycomparingputativetranscribedregionsagainst
asetofrelatedbacterialgenomes.Ourexperimentalresultsshowthatchain-RNAhasa
tbetterbetweensensitivityandFPrateforncRNAsearchthanBLAST
andcanidentifynovelncRNAs.
2.4.1Evaluatingtheperformanceofchain-RNA
2.4.1.1BenchmarkConstruction
Inordertoevaluatetheperformanceofchain-RNA,weconstructedabenchmarkdataset
thatcontains36typesofncRNAsand1,664pairofhomologousgenomicsequencesfromthe
BRAliBase2.1[119].ThemajorstepstoconstructourbenchmarkaresimilartoPLAST-
ncRNA[15].
26
2.4.1.1.1SelectionofncRNAs.
AsBRAliBase2.1containsadiversesetofncRNAs,
weusedtheK2subsetofBRAliBase2.1asthestartingpoint.TheK2datasetcontains
thousandsofhomologousncRNApairsfor36ncRNAfamilies.Foreachfamily,werandomly
selected50pairsfromK2.Thisgeneratedadatasetcontaining1,664pairs.Toevaluate
theperformanceofthebenchmarkedtoolsontsequencesimilarities,weseparatethe
datasettothreedatasetsaccordingtotheirpairwisesequenceidentity.Asitiscommonly
believedthatthesequenceidentityaround60%isthetwilightzoneforncRNAalignment[27],
weuse60%and70%asThedatasetcontainsncRNAswhichhavesequence
identityfrom0-60%.TheseconddatasetcontainsncRNAshavingsequenceidentityfrom
60%-70%.ThelastdatasetcontainsncRNAswithsequenceidentityrangingbetween70%
and100%.Thenumberofpairofsequencesofthethreesubsetinourbenchmarkdatacan
befoundinTable2.1.
2.4.1.1.2Addingcontextsequences.
AsncRNAsearchshouldbeabletodetecthid-
denncRNAsingenomicsequences,werecoveredthegenomiccontextforallchosenncRNA
homologsfromthepreviousstep.ForeachncRNAsequence,itssurroundinggenomicse-
quencewasobtainedfromtheEMBLwebsite
1
accordingtotheaccessionnumber,and
thencRNAwasembeddedinthecontextsequences.Weadded2000ntcontextgenomic
sequencestoeachendofthencRNA(Ifthegenomiccontextisshorterthan2000,weused
theactualsize).Foreachcontextgenomicsequence,regionsannotatedasprotein-coding
genesinEMBLweremaskedfromfurtheranalysis.Thisbreaksagenomicsequenceintoa
setofintergenicregions(IGRs).WediscardedallIGRsshorterthanthewindowsizesetfor
RNAplfold(i.e.100inourexperiment).WealsocheckedtheexistingncRNAannotation
1
http://www.ebi.ac.uk/embl/
27
onthecontextgenomicsequencestomakesureourchosenncRNAsaretheonlyncRNAsin
thesequencepairs.
28
29
Table2.1Informationofthedatasetandtotalrunningtimeofeachbenchmarkedtooloneachdataset.Column
length
shows
theaveragelengthofgenomicsequencesinadatasetaftermaskingtheproteincodinggenes.Theaveragenumberofrectangles
areshownforapairofwindowsofsizes100nt

100nt.Thetimeofchain-RNAincludestherunningtimeofRNAplfold.
sequencesequencerunningtime(h)
identitypairslengthrectangleschain-RNAFOLDALIGNMXSCARNABLASTPLAST-ncRNA
0%-60%822219923.284.92055.9512.80.240.53
60%-70%438166119.516.1601.2147.90.110.25
70%-100%304156625.17.57324.0108.20.070.14
2.4.1.2ofsensitivityandfalsepositiverate
WeevaluatetheperformanceofhomologysearchtoolsusingsensitivityandFPrate.First
wea
hit
.Mosthomologysearchtoolsrelyonascoretodeterminewhetheran
alignmentscoreindicatesahomologouspair.Forexample,BLASTusesanE-value
Forchain-RNA,FOLDALIGN,PLAST-ncRNA,andMXSCARNA,weuseascore
AnypairofsubstringswithalignmentscoreorE-valuesatisfyingthecis
a
hit
.Foranalignmenttool,letthesetofhitsoutputbythetoolbe
H
.Letthesetof
truehomologoussequencepairsembeddedinthegenomicsequencesbe
H
+
.Ahitisatrue
hitifandonlyiftheoverlappingpercentagebetweenahitandthetruehomologouspairsis
aboveagiventhreshold.Forapairofinputsequenceswithsizes
M
and
N
,wethe
sensitivityandFPrateforapairofinputsasfollows:
sensitivity
=
j
H
\
H
+
j
j
H
+
j
FPrate
=
j
H
n
H
+
j
MN
H
\
H
+
isthesetoftruehits.
H
n
H
+
isthesetofallfalsehits.OurtionofFP
ratefollowsthestandardinwhich
FPrate
=
FP
FP
+
TN
.Inourproblem,
FP
is
j
H
n
H
+
j
.The\conditionnegative"setbyFP+TNincludesallthehitsthatdonot
sharerequiredoverlapwiththeactualncRNAs.AsthenumberofTPhitsisfarsmaller
thanthespaceofTN,weuse
MN
asanestimateofFP+TN.Thisestimationleadstoan
underestimatedFPrateforallbenchmarktoolsbecause
MN
actuallyincludestheTPhits.
ForeachtypeofncRNA,wehavemultiplepairsofinputgenomicsequences.Thesen-
sitivityandFPrateforatypeofncRNAiscomputedastheaverageofsensitivityandFP
30
rateforeachpairofsequences.
2.4.1.3PerformanceofncRNAsearchtools
Weappliedchain-RNA,BLAST,PLAST-ncRNA,MXSCARNA,andFOLDALIGNtothe
benchmarkdataset.Forchain-RNA,thewindowsizeforthegenome-scalefoldingby
RNAplfoldissetto100.Duringthesearchofchain-RNA,weusedaslidingwindowof
size100withaslidingstepof20.Allwindowpairsthatdonotmeetthestrategy
werediscarded.Weusedthebl2seqprogramoftheNCBIBLASTsuite(version2.2.25)to
searchhomologousncRNAsbetweeneverypairofgenomicsequencesinourbenchmarkdata
set.Inordertoboostthesensitivity,weusedwordsize4,whichistheminimumwordsize
allowed.Otherparametersaresettotheirdefaultvalues.PLAST-ncRNAisdesignedto
generatelocalalignmentsbetweenashortqueryandalonggenomicsequences.Itdoesnot
acceptanyofourbenchmarksequencepairsasinputbecausetheinputsequencesexceedthe
lengthlimit.Thus,foreachpair,weusedaslidingwindowofsize100withstep20onone
ncRNAsequenceandusedtheothergenomicsequenceasanotherinput.MXSCARNAisa
globalstructuralalignmenttoolforncRNA.Ithastheadvantageofincorporatingposterior
base-pairingandalignmentprobabilitiesinthestructurealignment.SinceMXSCARNAis
aglobalalignmenttool,wemanuallycutthelonggenomicsequencesintowindowsofsize
100withaslidingstepsizeof20.Globalalignmentwasperformedoneachpairofwindows
fromtwoinputsequences.FOLDALIGNcanoutputlocalstructuralalignments.Wecutthe
longgenomicsequencesintowindowsoflength200,whichisthemaximumlengthallowed
byFOLDALIGN.ThelocalmodewasusedtosearchhomologousncRNAsoneachwindow
pair.Wealsotestedthe\scan"modeofFOLDALIGN.Butitwasmuchslowerthanthe
localmodeonourdatasetandthuswechosetoreporttheresultsunderthelocalmode.
31
Wetestedtheperformanceofthebenchmarkedtoolsusingtoverlapthresholds.
Fig.2.9,Fig.2.10,andFig.2.11showtheROCcurvesofeachbenchmarkedtoolonthe
threedatasetsusingoverlapthreshold20%and50%.ThedatapointsforBLASTwere
generatedusingword-size4andE-value0.00001,0.0001,0.001,0.01,0.1,1,10,100,
200,500,and1000.SomepointsoverlapbecauseofhighsimilarityinsensitivityandFP
rates.Thedatapointsforchain-RNAweregeneratedusingscorefrom20to800with
stepsize50.WemotheMXSCARNAsourcecodetooutputthealignmentscoresand
usedscoreonthealignmentscorestogeneratedatapoints.ForPLAST-ncRNA,we
usefrom0.1to0.9withastepof0.1ontheoutputaverageposteriorprobabilities.
TherunningtimeofeachtooloneachdatasetisshowninTable2.1.
TheROCcurvesshowthattheperformanceofthetestedtoolsimproveswiththeincrease
ofthesequenceidentity.Theperformancencebetweenthestructuralalignmenttools
andBLASTalsobecomessmallerwiththeincreaseofthesequenceidentity.Chain-RNA
archivedslightlybetterperformancethanthetwostructuralalignmenttoolsMXSCARNA
andFOLDALIGNonthedatasetwiththelowestsequenceidentitywhentheoverlap
is20%.FOLDALIGNperformedbestandChain-RNAwasthesecondbestwhentheoverlap
is50%onthisdataset.Withtheincreaseofsequenceidentity,thethreestructural
alignmenttoolstendtohavesimilarperformance.However,therunningtimeofchain-RNA
ismuchshorterthanMXSCARNAandFOLDALIGN.
BLASTandposteriorprobabilitybasedtoolPLAST-ncRNAareverytimetcom-
paredtostructuralalignmenttools.OnereasonisthatneitherBLASTnorPLAST-ncRNA
outputssecondarystructuresofthealignments.BLASTshowstheworstbetween
thesensitivityandFPrateofthetestedtools.WeobservedthatBLASThasgoodper-
formanceonsometypesofncRNAswithhighsequenceidentity,suchasU2andSRP
euk
32
elements.Theiraverageidentityinourtestsetisover80%andthusBLASTistin
thatcase.However,whenthesequenceidentitydecreases,itbecomesforBLASTto
distinguishthencRNAsfromthesurroundinggenomicsequences.Asweusedthesmallest
wordsizeallowed,thelowsensitivityismorelikelytobecausedbythedynamicprogramming
stepratherthanthewordstep.Thelowsequenceidentityleadstolowalignment
scoresandthusbigE-values.Ontheotherhand,basedonposteriorprobability,PLAST-
ncRNAshowssuperiorperformanceinidentifyingthencRNAswithlowsequenceidentity.
However,theirhighestsensitivityislimitedevenunderthelowestavailablescoreThe
highestsensitivityisaround43%forthedatasetwiththesequenceidentitybelow60%with
overlap50%.NotethatwecannotfurtherimprovethesensitivityofPLAST-ncRNA
becausewehaveusedtheminimumposteriorprobabilityoutputbyPLAST-ncRNAasa
Still,usingthealignmentposteriorprobabilityispromisingaccordingtoexperimen-
talresults.Ontheotherhand,whileMXSCARNAincorporatedboththealignmentand
base-pairingposteriorprobability,itdoesnotdemonstratesuperiorperformance.
Wealsoevaluatedtheperformanceofttoolsusingthestartpositionandtheend
positionbetweenthepredictedandactualncRNAs.ForMXSCARNA,FOLDALIGN
andChain-RNA,weusedallthehitsoutputunderthedefaultparametersettings.For
BLAST,weusedallthehitsgeneratedwiththedefaultE-value10.0.ForPLAST-ncRNA,
allhitsoutputundertheminimumposteriorprobabilityareused.Foranytoolsthatoutput
multiplehitsforapairofncRNAs,wepickedthehitthathastheminimumstartposition
tothestartpositionoftheactualncRNAs,andthehitthathastheminimum
endpositionditotheendpositionoftheactualncRNA.Thedistributionsofthe
ofthepredictedpositionsandtheactualncRNApositionsareshowninFig.1and
Fig.2intheAppendix.ThedistributionsforChain-RNA,MXSCARNA,andFOLDALIGN
33
aresimilar,whileChain-RNAandMXSCARNAhaveloweststandarddeviationsandmin-
imumabsolutemeanvalues.However,Chain-RNAismuchfasterthanMXSCARNAand
FOLDALIGN.BLASThaslargestandarddeviationsandthemeanvaluesarefarawayfrom
zerocomparedtoothertools.ThereasonisthatBLASTislocalandtakesthewholegnomic
pairsasinput.ForsomencRNApairs,especiallyoneswithlowsequenceidentity,thepre-
dictedstartandendpositionsarehundredsorthousandsbasesfarawayfromtheactual
ncRNApositions.ThisindicatethatBLASTisnotsuitableforlocatingremotehomologous
ncRNAs.
WealsotestedSCARNA
LM,whichisamultiplelocalalignmenttool.Forallpairwise
sequencesinourbenchmarkdataset,SCARNA
LMdidnotoutputanyhit.Changing
parametersdidnothelp.However,SCARNA
LMoutputlocalalignmentwhentherewere
multiplesequencesasinput.Insummary,wewerenotabletomakeSCARNA
LMworkon
thebenchmarkdataset.
2.4.1.4Filtration
Ourtoolusesantonstrategytoremovewindowpairsthatarenotlikelyto
containtruencRNAs.Inaddition,rectanglesthatarenotlikelytobeinthemaximum
weightedchainarealsodiscarded.Usingthedefaultsetting,about30%ofthewindowpairs
arewithoutfurtherprocessing.Thisimprovesthetimeofourpipeline.
Table2.1showstheaveragenumberofrectanglesforeachwindowpairofsize100ntafter
applyingtherectanglestrategy.Theaveragenumberofrectanglesisusuallysmaller
thantheinputsequencelength.
34
Figure2.9ROCcurvesforbenchmarkedtoolsundertoverlaponthedata
setwithsequenceidentityfrom0%to60%.\X/N"representstheROCcurveforthetool
XundertheoverlapN%.Forexample,\Chain-RNA/20"meanstheROCcurveis
generatedforChain-RNAusingtheoverlap20%.
Figure2.10ROCcurvesforbenchmarkedtoolsundertoverlaponthedata
setwithsequenceidentityfrom60%to70%.\X/N"representstheROCcurveforthetool
XundertheoverlapN%.Forexample,\Chain-RNA/20"meanstheROCcurveis
generatedforChain-RNAusingtheoverlap20%.
2.4.1.5Analysisofthefalsepositivehits
Theresultsoftheexperimentshowchain-RNAgenerateshighscorehitsthatsharenooverlap
withannotatedncRNAhomologs.Thesehitsareintofalsepositivehits.However,
acloserlookdemonstratesthatthesehitsshowstrongstructuralsimilarityandthusthey
tendtogainahighstemalignmentscore.Fig2.12showsthederivedstructuresforthree
35
Figure2.11ROCcurvesforbenchmarkedtoolsundertoverlaponthedata
setwithsequenceidentityfrom70%to100%.\X/N"representstheROCcurveforthetool
XundertheoverlapN%.Forexample,\Chain-RNA/20"meanstheROCcurveis
generatedforChain-RNAusingtheoverlap20%.
falsepositivehitswithhighscores.Allthreeexamplesshowthattheinputsequenceshave
strongstructuralconservation.Thus,additionalevidencebeyondsequenceandstructural
similarityisneededtoscreenfunctionalncRNAs.
Figure2.12Derivedstructuresforthreepairsoffalsepositivehits.
36
Table2.2Selectedbacterialstrainsandtheirchromosomeinformation.
StrainChrGenBankIDLength(nt)
CupriavidusChr1CP0003523,928,089
metalliduransCH34
CupriavidustaiwanensisChr1CU6337493,416,911
LMG19424Chr2CU6337502,502,411
PolynucleobacterChr1CP0010101,560,469
necessarius
RalstoniaeutrophaChr1AM2604794,052,032
H16Chr2AM2604802,912,490
RalstoniaeutrophaChr1CP0000903,806,533
JMP134Chr2CP0000912,726,152
RalstoniaeutrophaChr1CP0010683,942,557
pickettiiChr2CP0010691,302,238
RalstoniaChr1AL6460523,716,413
solanacearum
2.4.2Usingchain-RNAinnovelncRNAsearch
AnalysisonavailableRNA-seqdatahasshownthatasizablefractionoftheRNA-seqreads
canonlybemappedbacktotheintergenicregionsofthereferencegenome.Itremains
animportantquestiontounderstandwhetherthesereadsaregeneratedbytranscriptional
noise,novelprotein-codinggenes,ornovelncRNAs.Inthisexperiment,weaimatnovel
ncRNAsearchintheunannotatedtranscribedregionsbycombiningheterogeneousevidence,
ofwhichexistenceofncRNAhomologsinrelatedgenomesisexaminedbychain-RNA.
WedownloadedthetranscriptomicdatageneratedusingRNA-seqforthestrainAU1054
of
B.cenocepacia
fromthethewebsiteofpreviouswork[122].Intotal,thereare11,163,555
nonpoly(A)readsofaveragelength41nt.Wedownloadedtheannotatedproteinsand
37
ncRNAsforthereferencegenomesfromtheIntegratedMicrobialGenomes(IMG)system[71]
andRfamdatabase[30].Inaddition,weappliedInfernal[83]toidentifymembersofexisting
ncRNAfamiliesinRfam[30].Thesetwostepsensurethatwehavearelativelycomprehensive
annotationofprotein-codingandncRNAgenes.Allreadsweremappedbacktothereference
genomeusingBowtie[61]withatmost2mismatchesallowed.Readsthatoverlapwith
anyannotatedncRNAsorprotein-codingregionsareexcludedfromfurtheranalysis.The
remainingreadswereclusteredusingBlockbuster[60].Notallclustersarelikelytranscribed.
Someofthemcontainfewreads.Someofthemhavehighlybiasedreaddistribution.In
addition,weareonlyinterestedinncRNAs.Thus,wedidaonall27,240clusters
outputbyBlockbusterinordertoidentifyasetoftranscribedregionsthatcouldcontain
ncRNAgenes.Thecriteriaare:

Thelengthoftheclusterisbetween50ntand200nt.

Thetotalnumberofreadsoftheclusterisbiggerthan40.

Thecoverageoftheclusterisbiggerthan0.1.

DistancesfromtheclustertoadjacentproteincodinggenesorannotatedncRNAgenes
arebiggerthan150nt.

Theclustermustexhibitastablesecondarystructure,whichismeasuredbytheMFE
valueandthepercentageofpredictedbasepairsbyafoldingtool[41,126,75,42].We
computedtheMFEstructureofeachcluster,andusedtheMFEcof-20.00to
outclustersthathaverelativehighMFE.
Herethecoverageisnedasthetotalnumberofdistinctreadsintheclusteroverthe
lengthofthecluster.Aftern,wehave35clustersleftforfurtheranalysis.As
38
theseclustersexhibitsmallMFEandstablestructure,theycouldbencRNAgenes.Wethus
furtherinvestigatedwhethertherearehomologouscounterpartsinrelatedbacterialgenomes.
AsthencRNAmightbehiddeninthesetranscribedregions,alocalalignmenttoolsuchas
chain-RNAisneeded.
WedownloadedallgenomesinthebetasubdivisionfromtheNCBImicrobialFTP
2
.
TheassociatedphylogenetictreewasobtainedfromthePATRCwebsite
3
withtaxonomyID
119060.AfterdiscardingstrainsthataretooclosetotheAU1054(startingwithBurkholde-
ria)and12strainswhosecompletegenomesarenotavailableatthetimeofwriting,7strains
remained.Thereare11chromosomesforthe7strainsintotal.Theinformationofthese
strainsissummarizedinTable2.2.WeappliedBLASTtoclosehomologsinrelated
genomes.WeranBLASTusing35transcribedregionsasqueriesand11chromosomesas
targets.Forall35queries,25ofthemhaveBLASThitswithall11chromosomes,showing
strongsequenceconservationinrelatedgenomes.Ofthe10remainingqueries,onehasjust
onetargetchromosomelackingBLASThits,alltheothershavenoBLASThitsreportedin
anyofthe11chromosomes.Thusthe10queriesmightlacktsequencesimilarity
andcannotbedetectedbyBLAST.Wethusappliedchain-RNAtoexaminewhetherthere
arehomologswithstrongstructuralsimilarityinrelatedgenomes.Theinformationofthe
10transcribedregions,theinputandoutputofchain-RNA,andtherunningtimeare
summarizedinTable2.3.Foraquerywhoselengthissmallerthan100,weusedthelength
ofthetranscribedregionasthewindowsizeforRNAplfold.Forquerieswithsizebiggerthan
100,thewindowsizeofRNAplfoldissetto100.Thewindowsizeusedforthesliding-window
strategyinchain-RNAissetasthesamewiththefoldingwindowsizeforRNAplfold.Other
2
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
3
http://www.patricbrc.org
39
Table2.3Transcribedregionsandtargetchromosomesareinputtochain-RNA.\all"inthe
TargetChrs
columnmeansthatallgenomicsequencesforthe7strainsinTable2.2areused
astargetsequences.
TranscribedLengthTargetNumberRunning
regionIDChrsofhitstime(h)
18137all5895.71
6096all312.96
8058all01.15
12092CP001010110.11
146160all235.59
16792all03.32
182113all272.77
187103all94.14
204165all5119.58
272177all6233.54
parametersarekeptthesameastheexperiment.Wekeepallalignmentswithscore
abovethechosencuto380,which,basedontheresultsoftheexperiment,gaveboth
theacceptablesensitivityandFPrate.
Table2.3showsthatnotallputativetranscribedregionshavehomologsinrelated
genomes.LackinganytalignmentbyBLASTorchain-RNAdoesnotsupport
thefunctionalrolesofthoseregions.Threetranscribedregionsincuralargenumberofhits
withhighscores.AnumberofthesehitscontainsimpleAUrepeatsandthushavehigh
probabilitiestoformbasepairs.However,theirtophitsshowstrongstructuralconserva-
tioninthepredictedhomologouspairs.Thus,despiteexistenceoffalsepositivehits,these
regionsareverylikelytobefunctionalncRNAs.Asanexample,Fig2.13showsthede-
rivedsecondarystructuresofthetranscribedregionwithID18anditshomologsinall11
targetchromosomeswiththetopalignmentscore.Accordingtothisallstructures
40
arehighlysimilar,providingevidencethatthederivedstructureistheirnativesecondary
structure.Thehighstructuralsimilarityalsoindicatesthatthesesequencesarehomologous.
Thelocationsofthe6putativenovelncRNAscanbefoundatourwebsite.
Figure2.13DerivedsecondarystructuresforhomologousncRNAsinall11chromosomesfor
thetranscribedregionwithID18.
2.5Conclusion
Inthiswork,weintroduceanew
denovo
ncRNAsearchmethodanditsimplementation
calledchain-RNA.ThemajornoveltyofouralgorithmisthatitformulatesncRNAstruc-
turalalignmentproblemasanextendedchainproblem.Solvingtheextendedchainproblem
basedontheexisting
O
(
n
log(
n
))algorithmprovidesthebasisforthegenome-scalestruc-
turalalignmentscheme,where
n
isthenumberofrectangles.Byusingthegenome-scale
localfoldingalgorithmandtheslidingwindowstrategy,chain-RNAisabletolocal
structuralalignmentsbetweenlonggenomicsequences.Wetestedchain-RNAinlocating
homologousncRNAsinalargedatasetcontainshomologousncRNApairsofvariousse-
quencesimilarity.Inaddition,weappliedittonovelncRNAsearchinatranscriptomic
41
dataset.Theexperimentalresultsshowthatchain-RNAcanachievetlyhigher
sensitivityinidentifyingconservedncRNAelementsthanBLAST.ByscarifyingasmallFP
rate,chain-RNAcanachievebettersensitivitythanPLAST-ncRNAonsequencepairswith
identityabove60%.Chain-RNAismuchfasterthanotherstructuralalignmenttoolswhich
havecomparablebetweensensitivityandFPrate.
Ourworkdemonstratestheutilityofchain-RNA.Inaddition,itshowsthatgenome-
scale
denovo
ncRNAsearchisstillachallengingproblemdespitethepromisingprogress.In
particular,theperformanceofalltestedtoolsonidentifyingncRNAswithidentitybelow60%
isnotsatisfactory.Weplantoimprovechain-RNAinthefollowingaspects.First,weplan
todesignabetterdiscriminationmetric.Currently,weareusingascoretodetermine
whetheralocalalignmentshouldbeoutput.Amoreadvancedmethodistocomputethe
E-valuesforlocalalignmentscoresasinBLAST.WeexpectthatusingE-valuecanimprove
thediscriminationpowerofchain-RNA.Second,Amoreestrategyisneeded
tocontrolfalsepositiverateandimprovethe.weplantousemachinelearning
methodssuchasSVMtodoinasystematicmanner.Finally,incorporatingthe
posteriorprobabilitybasedsequencealignmentandthestructuralalignmentinchain-RNA
mightleadtobetterncRNAdetectionperformance.
42
Chapter3
miR-PREFeR:anaccurate,fast,and
easy-to-useplantmiRNAprediction
toolusingsmallRNA-Seqdata
3.1Introduction
PlantmicroRNAs(miRNAs)are
˘
21ntlongnon-codingRNAsthatplayimportantroles
intranscriptionalandpost-transcriptionalregulationofgeneexpression[124].Recently
developedgenome-widemiRNAannotationtoolsallutilizesmallRNA-Seqdatatoquantify
theexpressionofannotatedmiRNAsandtopredictnovelones[114,120,34,6].These
toolsfromseveralofthefollowingproblems.First,thesetoolsusuallyhavevariable
sensitivityandhighfalsepositive(FP)ratewhenappliedtoditspecies.Second,most
ofexistingNGS-basedtoolsareslow.Third,mostexistingcommand-linebasedtoolsarenot
user-friendly.Web-servertoolssuchasmiRanalyzer[34],areeasytouse.Buttheyusually
onlyworkforgenomesintheirdatabases,whichinhibituserstoanalyzenewgenomes.In
addition,mostofwebservertoolsalsohaveotherproblemslistedhere.Thus,thereisaneed
JikaiLei,YanniSun.miR-PREFeR:anaccurate,fastandeasy-to-useplantmiRNApredictiontool
usingsmallRNA-Seqdata.Bioinformatics.19:30,2837{2839.2014.
43
foraplantmiRNApredictiontoolthathasgoodperformance(highsensitivity,lowFPrate,
andaccurate),worksforallplantgenomes,runsfast,hasasmallmemoryfootprint,andis
easytouse.
miR-PFEFeRutilizesexpressionpatternsofmiRNAsandfollowsthecriteriaforplant
microRNAannotation[78]toaccuratelypredictplantmiRNAsfromoneormoresmall
RNA-Seqdatasamples.IthashighsensitivityandverylowFPrate.miR-PREFeRismuch
fasteranduseslessmemorythanexistingtools.UsingmiR-PREFeRrequiresminimum
informaticsexpertise:ithaslowdependencyonotherprograms;thereisnoneedtocompile
orinstallthepipeline;itprovidesacheckpointfeature,whichmakesitveryeasytocontinue
anjobfromwhereitwasstopped;thedocumentationispubliclyavailable.The
miR-PREFeRpipelineisintheprocessofbeingincorporatedintotheMAKER-Pgenome
annotationengine[14].
3.2DescriptionofthemiR-PREFeRpipeline
3.2.1Ww
ThemiR-PREFeRpipelinepredictsplantmiRNAsfromoneormoresmallRNA-Seqsamples.
Figure3.1illustratesthewwofmiR-PREFeR.Theinputincludesafastaformatgenome
(acompletegenomeorcontigs/supercontigsofangenome/ESTsequencesof
aspecies)andoneormoresmallRNA-SeqalignmentThealignmentshouldbein
theSAMformat,whichissupportedbyalmostallpopularshortreadaligners.Userscan
alsoinputanoptionalGFF3whichcontainsexistingannotationsforthegenome(protein
codingregions,repeatedsequences,etc).ThepipelineexcludesfeatureslistedintheGFF3
fromfurtheranalysis.Notethatusersshouldusethisoptionwithcautionbecauseby
44
Figure3.1WwofmiR-PREFeR.
excludingproteincodingregionsuserswillnotgetpredictedmiRNAsoverlappingcoding
regions.Alltheinputlesandparametersfortheprogramarespina
whichistheonlyinputtomiR-PREFeR.Theoutputincludes:1).AnHTMLtablethat
summarizestheresultsofthepredictedmiRNAs.Itcontainsthenumberofpredictions,
thelengthdistributionofthepredictedmaturemiRNAsequences,thelistofallpre-miRNA
sequencesandtheirsecondarystructures,andthedetailedreadmappingagainstthe
pre-miRNAsequences.Inaddition,thistableprovidessearchlinkstomiRBase.2).Separate
foreasydownstreamanalysis.TheseincludesaCSVformatthatsummarizing
predictioninformation,aGFF3annotationforthepredictedmiRNAs,twothat
containthepredictedmaturemiRNAsequencesandprecursorsequences,athatcontains
thestructureofeachprediction,anamed\miRNA.stat.txt"thatcontainsstatisticsof
thepredictedmiRNAs,andafolder\readmapping"thatcontainsreadmapping
foreachpre-miRNAprediction.Anexamplelistofalloutputcanbefoundat
http:
45
Figure3.2IdenofpeaksfromtwoinputRNA-Seqsamples.Eachcoloredarrow
representsareadfromoneRNA-Seqsample.Redandbluearrowsrepresentreadsfrom
twotsamples,respectively.Twopeaksareidendaccordingtothedepth
D
cutoff
,whichis2inthisexample.
Figure3.3Generatingcandidateregionsfrompeaks.(A).Thegap
g
betweentwoneigh-
bouringpeaks
p
1
and
p
2
issmallerthanthegaplength
G
max
.Thus,peaks
p
1
and
p
2
areclusteredintoonecandidateregion.(B).Peak
p
3
hasnoneighbouringpeaksandis
extendedtotheleftandtotherighttoformtwocandidateregionswithlength
L
max
.
//www.cse.msu.edu/
~
leijikai/mir-prefer/example-output/
.
Thepipelineconsistsofthreemainsteps:idenofcandidateregions,foldingof
thecandidateregions,andmiRNAlociprediction.Herewepresentthedetailsofeachstep.
3.2.1.1Candidateregioniden
Thestepofthepipelineistoidentifycandidateregionsthatarelikelytoharborex-
pressedmiRNAgenes.BasedontheassumptionthatexpressedmiRNAsshouldhaverea-
sonableamountofreadsmappedtotheirloci,weusethemappingdepthtoextractregions
thatshouldbeanalyzed.SimilarmethodisusedbyothermiRNApredictiontoolssuchas
ShortStack.Readsfromallinputsamplesareconsideredwhencountingthemappingdepth.
46
3.2.1.1.1Peaks:
Thestepistoidentify
peaks
.A
peak
isacontinuousregionina
sequencewithminimummappingdepthabove
D
cutoff
ateachnucleotideposition,where
mappingdepthisthenumberofreadscoveringthatposition.Here
D
cutoff
isapipeline
parameterthatuserscanadjust.Figure3.2givesanexampleonhowtoidentifypeaks,with
D
cutoff
=2.
3.2.1.1.2Candidateregions:
Inthenextstep,peaksarecombinedtoform
candidate
regions
,whicharetheninputtothenextstepofthepipeline.Therearetwoparametersto
controlthegenerationofcandidateregions,
G
max
and
L
max
.Apairofneighboringpeaks
withgapsizesmallerthan
G
max
areclusteredintoonecandidateregion(seeFigure3.3).
Thisneighborjoiningiscontinueduntilnoneighboringpeakwithindistance
G
max
exists.
Ifapeakhasnoneighboringpeakswithgapsizesmallerthan
G
max
,itisextendedtothe
leftandtotherightto
L
max
nt,andtwocandidateregionsaregeneratedforthispeak
(Figure3.3(B)).
L
max
isthemaximumlengthofacandidateregion.Regionslongerthan
L
max
arediscarded.
3.2.1.1.3Candidatematuresequences:
AccordingtotheknowbiogenesisofmiRNA
expression,thematureregionofapre-miRNAstem-loopshouldhavehighmappingdepth.
Thus,weidentifycandidatematuresequencesaccordingtothenumberofoccurrence(abun-
dance)ofeachuniquereadinthepeak.Inparticular,foracandidateregion,wechoosethe
topNmostabundantreadsofeachpeakascandidatematuresequencesforthecandidate
region.Inthecurrentimplementation,weuse
N
=2.
47
3.2.1.2Folding
Thesecondarystructureofapre-microRNAsequenceisusuallyastem-loop.Byfolding
eachcandidatesequenceanddetectingwhetherthesequencecanbefoldedintostem-loop
structures,alargenumberofcandidateregionsthatarenotmiRNAlocicanberemoved.In
miR-PREFeR,weuseRNALfoldfromtheViennaRNApackagetofoldcandidateregions[68].
RNALfoldcomputesalistoflocallystableRNAsecondarystructuresforaninputsequence.
Candidateregionswithatleastonestem-loopstructurearekeptandanalyzedbythenext
step.
3.2.1.3MiRNAprediction
ThepredictionstepusesthecriteriaforannotatingplantmiRNAasitsbasicprincipleto
predictmicroRNAs[78].
1.TheprimarycriteriaisthatthesmallRNA-Seqdatashouldprovideexceptionalev-
idenceofprecisemiRNA/miRNA

excision:thereshouldexistabundantreadsthat
correspondtothematuremiRNAsequence,andthereshouldbeatleastonereadthat
canbepreciselymappedbacktothemiRNA

sequence.ThemiRNAandthemiRNA

sequencesformaduplexwithtwont,3'overhangs.
2.StructurecharacteristicsofthemiRNA/miRNA

duplex.Therearetypicallyfouror
fewermismatchedmiRNAbasesinthemiRNA/miRNA

duplex.Asymmetricbulges
arerareandsmallinsize.
Tomeetthecriteria,miR-PREFeRselectsreadsthatareabundantineachpeakas
thecandidatematuresequencesasdescribedearlier,checksthestem-loopstructuresofthe
48
candidateregiontoidentifystarsequenceposition,thencheckswhetherthestarsequence
ispresentornotintheRNA-Seqsamples.
AcharacteristicfeatureofmiR-PREFeRisthatitisabletotakemorethanoneRNA-Seq
samplesasinput.ByusingmultipleRNA-Seqsamplesfromttissues/conditions/developmental
stages,moreinformationcanbeobtainedfromtheRNA-Seqdata:

ForsomemicroRNAloci,it'spossiblethatthereisonlyasmallamountofreadsthat
canbemappedbacktotheseloci.It'sossibletopredicttheselociusing
asinglesample.Bycombiningallreadsfrommultiplesamplestogether,it'spossible
topredicttheselowlyexpressedmiRNAs.ForthepredictedmiRNA,westillrecord
theirexpressionineachsample.

ForsomemiRNAs,averylargenumberofreadsfrommultiplesamplescanbeprecisely
mappedbacktothematureregions.Inaddition,thesecondarystructureofthelocus
isanstem-loopthatmeetsthestructurecriteria.However,thereisnoreadthat
correspondstothestarsequence.Thesetypeoflociwouldbemissedifwestrictly
followtheprimarycriteriawhichrequiresthatthereshouldbeatleastasingleread
forthestarsequence.Informationfromtheexpressionpatternsfrommultiplesamples
andthewell-formedstructureindicatethatthislocuscontainsatruemiRNAgene.
Thusweloosetheprimarycriteriatodroptherequirementofthepresenceofthestar
sequenceinthecasethatthereareaverylargenumberofreadsineachsamplethat
canbepreciselymappedbacktoamatureregion.Inthecurrentimplementation,
whenthereisnoreadcorrespondstothestarsequence,thethresholdsaresetto1000
and100.Thatis,thereshouldbeatleast1000readsinallsamples,andatleast100
readsineachsample.
49
ByusingtheannotationcriteriaofplantmiRNA,miR-PREFeRisabletoprecisely
predictplantmiRNAswithhighspeyandhighsensitivity.Withthehelpofinformation
frommultiplesamples,thesensitivitycanbefurtherimprovedwithouttincrease
ofthefalsepositiverate.
3.2.1.4Checkpointing
Thepipelinemakesacheckpointaftereachmajorstepinthepipelineandmakescheckpoints
periodicallyinthetime-consumingfoldingstage(seeFigure3.1).Userscaneasilysetthe
checkpointgranularityinthenThisprovidesusersaneasywaytorestart
jobsfromwherethejobswerestopped.Thisfeatureisparticularlyusefulin
varioussituations.Forexample,it'spossiblethatajobrunningonahighperformance
computingsystemiskilledduetoresourcelimits,ortheuser'slaptopisrunningoutof
battery.Byrestartingajobfromthelatestcheckpointratherthanstartingitfromthe
beginning,alotoftime/resourcescanbesavedforlong-runningjobs.Asfarasweknow,
miR-PREFeRisthemicroRNApredictiontoolthatprovidessuchausefulfeature.
3.2.2Implementationandexternalprogramdependency
miR-PREFeRisimplementedusingPython.Thereisnoneedfortheuserstoinstallitandit
worksunderLinux/MacOS.ThepipelineisabletousemultipleCPUs/coresautomatically.
Thismakesthepipelinemuchfasterthanexistingprogramsandiseasierfortheusersto
utilizetheircomputationalresources.
ThemiR-PREFeRpipelinereliesontwoexternaltools.TheoneistheSAMTools
alignmentsmanipulatingpackage[64].It'swidelyusedintheNGSdataanalysiscommunity
anditiseasytoinstallanduse.ThesecondprogramistheRNALfoldprogramfrom
50
theViennaRNApackage.ItisusedbyalmostallmicroRNApredictiontools.Itiswell
documentedandiseasytoinstall.
3.3Results
3.3.1Evaluatingpredictionperformanceonsixdatasets
WebenchmarkedmiR-PREFeRandseveralpopularornewlydevelopedNGS-basedmiRNA
predictiontoolsincludingShortStack[6],miRDeep-P[120],miRanalyzer[34],miRDeep2
[26],miRDeep

[4],andMIReNA[74].ThefeaturesofthetoolscanbefoundinTable3.1.
Weusedsixdatasetsfrom
Arabidopsisthaliana
,Medicago,papaya,peach,andmaizeto
evaluatetheperformanceofthetools.Thedetailsofthedatasetscanbefoundinthe
Table3.2.MicroRNAsinthe
Arabidopsisthaliana
genomeiswellannotated.Thusweknow
thegroundtruthmiRNAsinthegenome.Weusedtwodatasetsfrom
Arabidopsis
thaliana
toevaluatethesensitivityandfalsepositiverateofthebenchmarkedtools.The
dataset,
Athl-2
,whichcontainstwosamples,isthedatasetusedintheShortStackpaper[6].
Theseconddataset,
Athl-6
,containssixpublishedsamplesfromttissues/conditions.
Table3.3showstheperformanceofeachtoolonthetwodatasets.miRanalyzerhasthebest
sensitivitybutthenumbersofpredictionsareverylargeonbothdatasets.Itisverylikely
thatmostofthesepredictionsareFPpredictions.ItshouldbenotedthatmiRanalyzer
triestodetectannotatedmiRNAsthataresavedinitsowndatabase.Thisiswhyits
sensitivityishigherthanallothertools.miRDeep2,miRDeep*,andMIReNAhavevery
lowsensitivityonbothdatasets,whichindicatesthattheyshouldnotbethechoicefor
annotatingplantmiRNA.ShortStackisdesignedtobespthusithasthelowestFP
rateandreasonablesensitivity.miR-PREFeRhasthesecondhighestsensitivityandvery
51
lowFPrate.ThepredictionsfrommiR-PREFeRhavelargeintersectionswithpredictions
fromShortStack(SeeFigure3.4)Accordingtothecomparison,miR-PREFeR,ShortStack,
andmiRDeep-Phavebettersensitivity,FPrate,runningtime,andcanbeappliedtoany
plantspecies.Thus,tofurtherevaluatetheperformanceofthetools,weranmiR-PREFeR,
ShortStack,andmiRDeep-Ponthreemorespecies:Medicago(4RNA-Seqsamples),peach
(4RNA-Seqsamples),andpapaya(3RNA-Seqsamples).Theinformationofallthesmall
RNA-SeqdatausedintheexperimentscanbefoundinTable3.2.Thelinkstodownloadthe
datasetscanbefoundat
http://www.cse.msu.edu/
~
leijikai/mir-prefer/dataset/
.
52
53
Table3.1ComparisonoffeaturesofmiRNAannotationtools
miR-
PREFeR
ShortStackmiRDeep-
P
miRanalyzermirDeep2miRDeep*MIReNA
miRNAannota-
tion
YesYesYesYesYesYesYes
Othersmall
RNAannotation
NoYesNoNoNoNoNo
Inputmulti-
pleRNASeq
samples
YesYesNoNoNoNoNo
Speicies
PlantAllPlantAllanimalanimalAll
Server/Local
Local
(command-
line)
Local
(command-
line)
Local
(command-
line)
Server,
Local
(command-
line)
Local
(command-
line)
Local(GUI,
command-
line)
Local
(command-line)
Checkpoint
YesNoNoNoNoNoNo
Canusemuliple
processors/cores
AutomaticManualManualManualManualManualManual
algorithm
miRNA
annotaton
criteria
miRNA
annotaton
criteria
Naive
Bayes
Random
foresttree
Naive
Bayes
NaiveBayes5criteriausedto
secondary
structures
Table3.2InformationofthesmallRNA-Seqdatausedintheexperiments
SpiesesAccessIDDescription
Athl-6Genomesize:121MB
Arabidopsisthaliana
SRX065858Root,endodermis
Arabidopsisthaliana
SRX065854Root,stele
Arabidopsisthaliana
SRX065857Root,cortex
Arabidopsisthaliana
SRX107308Seedlings,coldstress
Arabidopsisthaliana
SRX107309Seedlings,saltstress
Arabidopsisthaliana
SRX107310Seedlings,saltstress
Athl-2Genomesize:121MB
Arabidopsisthaliana
SRR275588rosetteleaves
Arabidopsisthaliana
SRR051927above-groundtissues
MedicagoGenomesize:425MB
Medicagotruncatula
GSM769272Seeding
Medicagotruncatula
GSM769273Root
Medicagotruncatula
GSM769274Flower
Medicagotruncatula
GSM769277Leaf
papayaAssemblysize:349MB
Caricapapaya
GSM712525Caricapapayaleaves
Caricapapaya
GSM712526Caricapapayawers
Caricapapaya
GSM712527Caricapapayainfectedleaves
peachAssemblysize:231MB
Prunuspersica
GSM944971Peachroot
Prunuspersica
GSM944972Peachleaf
Prunuspersica
GSM944973Peachwer
Prunuspersica
GSM944974Peachfruit
maizeGenomesize:2095MB
Zeamays
GSM1178887Maizehen1-1mutant
Zeamays
GSM1178886Maizewildtypecontrolforhen1-1mutant
Wecomparedtheoverlapsofthepredictionsofthethreepredictiontools.Figure3.4
(a)-(e)illustratestheintersectionsofthepredictionsofttoolsandexistingmiRNA
databasesusingVenndiagrams.Foralledatasets,thenumbersofpredictionsfrom
miRDeep-Pareverylarge.ItisverylikelythatmostofthesepredictionsareFPpre-
dictions.ThepredictionsfrommiR-PREFeRhavelargeintersectionswithpredictionsfrom
ShortStack.BothtoolsusethecriteriaforplantmiRNAannotationastheirpredictionguide-
lineandthepredictionsareconservative.Thus,theyhavelesspredictionsthanmiRDeep-P.
54
Toevaluatetheperformanceofthepredictionstoolsonlargeandcomplexplantgenomes,
weappliedmiR-PREFeRtomiRNApredictioninmaize,whichhasamuchlargergenome
sizethantheothertestedgenomes(seeTable3.2).WeusedtwosmallRNA-Seqsamples,
andtheirinformationcanbefoundinTable3.2.Likepreviousexperiments,weranmiR-
PREFeR,ShortStack,andmiRDeep-Ponthedataset.FormiRDeep-P,wedidnotget
anyresultbecauseitwasveryslowandwewereunabletoitin5days.Thus,we
onlyreporttheresultsfrommiR-PREFeRandShortStack.Toevaluatethesensitivity,we
comparedthepredictionswithboththemaizemiRNAannotationandtheplantmiRNA
annotationinmiRBaseversion20.Table3.4summarizedtheperformanceofmiR-PREFeR
andShortStack.miR-PRFeRshowedhighersensitivitythanShortStackonthisdataset.
For172maizemiRNAannotationinmiRBase[58],miR-PREFeRsuccessfullyiden74
ofthem,whileShortStackiden66.WeusedBLASTtosearchpredictionsfromtwo
toolsagainstthemiRBaseplantmiRNAannotations.FormiR-PREFeR,89predictions
havetoverlaps(E-valuewassetto1e-40)withplantmiRNAannotations.
ForShortStack,thenumberis82.TheexperimentshowedthatmiR-PREFeRhasbetter
sensitivitythanShortStackonthemaizedataset.BecausemiR-PREFeRcanautomatically
takesadvantageofmultipleCPUs/coresavailableonamachine,ittlyreducedthe
runningtime.ForShortStack,ittook156hourstoanalyzethetwomaizesamples.For
miR-PREFeR,itonlytook1.2hoursbyusing4coresautomatically.Thismakesiteasier
foruserstoanalyzelargeplantgenomes.TherunningtimeisshowninTable3.5.
55
3.3.2Analyzingthebetweentheannotatedandpre-
dictedmiRNAs
Figure3.5quantheofthestartpositionsandlengthsbetweentheannotated
andpredictedmiRNAsinArabidopsis.Thewasgeneratedusing63annotatedmiRNA
lociofArabidopsisthatarepredictedbyall3toolsontheAthl-6dataset.Eachofthese
locicontainsoneannotatedmaturesequenceaccordingtomiRBase.Alltoolsiden
someannotatedmiRNAsandmiRNAsthathavetstartpositionsorlengthsasthe
annotatedones.Buttherencedistributionsarequitet.77.8%(49outof63)
ofthematuresequencespredictedbymiR-PREFeRhavethesamestartpositionastheir
annotations.ForShortStackandmiRDeep-P,thevaluesare49.2%and15.9%,respectively.
ForthemajorityofthepredictionsfrommiR-PREFeR(85.7%),theofthestart
positionsbetweenthepredictedmiRNAmaturesequencesandtheirannotationsarewithin
1nt.ThelengthsofthemiRNAmaturesequencespredictedbymiR-PREFeRaremostly
consistentwithannotatedones:81%ofthemarethesameastheannotationsandfor98.4%
ofthepredictionsthearewithin1nt.
SomeofthepredictedmiRNAsthataretfromtheannotatedonesarelikelytobe
isomiRs.ItisknownthatmiRNAscanbemobytheaddition,editingorsubtraction
ofnucleotides,resultingindistinctmiRNAisoforms(i.e.isomiRs)[49,84].Inparticular,
5'and3'isomiRscontainminorchangesinthestart/endpositionsandlengthsofmiRNAs.
AcomprehensiveanalysisoftheisomiRsinmultipletranscriptomicsamplesofArabidopsis
showsthetrimming/extensionofthestartpositionin1or2ntsin5'sisomiRs[49].Thus,
thepredictionsthatcontainminorpositionfromtheannotatedonesarelikelyto
beisomiRs.
56
However,existinganalysisalsoshowsthatthemostabundantreadscorrespondedto
conserved,previouslyidenmiRNAs[49,84].Inthelarge-scaleanalysisoftheisomiRsin
Arabidopsis,114outof144highlyrepresentedmiRNAsinmultiplesmallRNA-Seqsamples
aretheannotatedmiRNAs[49].ThepredictionsbymiR-PREFeRareconsistentwiththis
observation.Forpredictionsthatcontainlargerpositionfromtheannotated
ones,suchasthosewithatleast5-ntdwecannotexcludethepossibilityofwrong
predictions.Figure3.5showsthatmiR-PREFeR,ShortStack,andmiRDeep-Phave14.3%,
46.0%,and25.4%predictionsthathaveatleast5-ntfromtheannotatedmiRNAs,
respectively.Althoughfurtherexperimentsareneededtoshowhowmanyofthesearewrong
predictions,ahighpercentageofmiRNApredictionswithlargetrimming/extensionmight
indicateloweraccuracy.
3.3.3miR-PFEFeRisfastandt
Table3.5andTable3.6showtherunningtimeandthememoryusageofthebenchmarked
tools.TherunningtimeformiR-PREFeRincludesboththetimeforgeneratingthealign-
mentandthetimeforrunningthepipeline.miR-PREFeRachieves4

37

speedup
comparedtoShortStack,whichisthesecondfastesttool.NotethatShortStackcanpredict
othertypesofsmallRNAbesidesmiRNA.Thismightbeonereasonthatitisslowerfor
justmiRNAprediction.MIReNAusesBLASTtodoreadalignmentandisveryslow.For
theAthl-2andAthl-6datasets,wedidnotgetanyresultsfor4outof8samplesbecause
themappingstageistooslow.miRDeep-Palsohasverylongrunningtimeonalldatasets.
Withoutmanuallyparallelisingthejobs,itiseventorunthetwotoolsonsmall
genomesonapersonalcomputer.Asformemoryusage,miR-PREFeRuseslessmemory
thantheothertoolsonalldatasets.
57
58
Table3.3Performancecomparisonofbenchmarkedtools
miR-PREFeRShortStackmirDeep-PmiRanalyzermirDeep2mirDeep*MIReNA
Softwareversionv0.090-5-01.3unversioned0.0.5v312.0
Athl-2(2samples)
NumberofknownmiRNAsexpressed
a
240240240240240240240
NumberofpredictedmiRs155113126321821822018152
NumberofexpressedmiRspredicted12710786201641035
Numberofnovelpredictions286117719811182008117
Sensitivity0.530.450.360.840.270.040.15
Athl-6(6samples)
NumberofknownmiRNAsexpressed
a
243243243243243243243
NumberofpredictedmiRs1851363021131142911472411
NumberofexpressedmiRspredicted13612512820979744
Numberofnovelpredictions49112893123062121465367
Sensitivity0.560.510.530.860.330.030.18
(a)Athl-2
(b)Athl-6
(c)Medicago
(d)Peach
(e)Papaya
Figure3.4VenndiagramsthatshowtheintersectionsofthepredictedmiRNAsfromdif-
ferenttoolsonedatasetsfromfourspecies.ForArabidopsis,Medicagoandpeach,the
annotationsfrommiRBaseareincluded.
Table3.4PerformancecomparisionofmiR-RPEFeRandShortStackonmaize
miR-PREFeRShortStack
Numberofpredictions220147
OverlapwithmiRBasemaizeannotation
74/172
66/172
OverlapwithmiRBaseplantannotation
89
82
Table3.5Runningtime(hour)
miR-PREFeRShortStackmiRDeep-PmirDeep2mirDeep*MIReNA
Athl-2
1.25
5.619710.78Unknown
a
Athl-6
1.21
11.0315727.427.8Unknow
a
Medicago
0.91
4.25110
papaya
0.75
4.883
peach
1.65
41.6197
maize
4.22
156
Bold
numberindicatesthebestrunningtime.
a
MIReNAdidnotonseveralinputsamplesafterrunning72hoursandwaskilledbytheHigh
PerformanceComputingsystem.
59
Figure3.5ThedistributionofthemiRNAstartpositions(toppanel)andlengths
(bottompanel)betweenannotatedandpredictedmiRNAsbythreetools.Thewas
generatedusingannotatedmiRNAsofArabidopsisthatarepredictedbyallthreetoolson
datasetAthl-6.
Table3.6Memoryusage(MB)
miR-PREFeRShortStackmiRDeep-PmirDeep2mirDeep*MIReNA
Athl-2
850
1449155915383040Unknow
a
Athl-6
854
1266391724811825Unknow
a
Medicago
263
11551800
papaya
238
12201177
peach
850
14491559
maize
1968
3857
b
Bold
numberindicatesthebestmemoryusage.
a
MIReNAdidnotseveralinputsamplesafterrunning72hoursandwaskilledby
theHighPerformanceComputingsystem.
b
ShortStackdidnotcompletelyonthelargerRNA-Seqsampleandwaskilledright
afteritoutputthepredictionsofmiRNAs.Thenumberhereisthememoryusageonthe
smallerRNA-Seqsample.
60
Chapter4
IdentifyandassembleCRISPRsfrom
metagenomicdata
4.1Introduction
CRISPRsarefoundinalargenumberofprokaryoticgenomes(approximately40%ofse-
quencedbacteriaandalmostallsequencedarchaea[32]).TheCRISPR/Cas-genesystem
(cas-CRISPR)isanadaptiveimmunesysteminbacteriaandarchaeathatprotectsthe
targetgenomefromforeigninvadinggeneticelementssuchasvirusplasmidsandphages
[7,72,44,17,5,88].EachCRISPRslocuscontainsasetofvariablespacersregularlysep-
aratedbydirectrepeats(DRs),whicharehighlyconservedwithinspeciesandhavetypical
sizerangeof23-47bp.A100bp-500bplongDNAsequencecalled\leader"isalwaysfound
ononesideoftheCRISPR.Theleaderactsasapromoteranddirectsthetranscription
oftheCRISPR,whichisessentialfortheCRISPRssystemstoconferspresistanceto
viruseswhosegenomicsequencesmatchonespacer[35,5,90].Figure4.1showsanexample
ofacas-CRISPRsystem.
EvenwithextensivestudiesofCRISPRs,thedetailsoftheCRISPR-basedmechanism
andthedynamicsofinteractionsbetweenvirusesandthehostCRISPRslociarenotyetfully
known[5].OnekeysteptoadvanceourunderstandingofCRISPRsworkingmechanismsis
todetectCRISPRslociinalargevarietyofprokaryotes.Inparticular,metagenomicdata,
61
Figure4.1Acas-CRISPRexample.Greenboxesinthetoprepresentdirectrepeats
(DRs).AlinebetweentwoDRsrepresentsaspacersequence,whichsharesequenceidentity
withfragmentsofplasmidsandbacteriophagegenomesandspecifythetargetsofCRISPR
interference.DRsandspacersformanDR-spacerarray.ThenumberofDRscanvary
substantially,fromaminimumoftwotoafewhundred.AsetofCRISPR-associatedgenes
(cas-genes),whichencodetheproteinmachineryresponsibleforCRISPRactivity,arelocated
upstreamofthearray.ThebottomboxshowsthesequenceoftheDR-spacerarray.Green
charactersareforDR,othercoloredcharactersareforspacers.DRsequencesarehighly
conservedinaCRISPR.
whichcontaincommunityshotgunsequencesofalargenumberofunculturedmicrobial
speciesfromvariablenaturalsamples,provideauniqueopportunitytothoroughlyannotate
CRISPRsandtoenhanceourunderstandingofthephage-hostinteractionsandco-evolution
withinnaturalsystems[5,91,101]
4.2Relatedworksandtheirlimitations
MostofexistingtoolsforidentifyingCRISPRsaredevelopedforsequencedgenomes.These
methodssearchinterestedgenomesforregularlyspacedrepeatstoidentifyCRISPRloci.
PopularimplementationsinthiscategoryincludeCRISPRFinder,metaCRT/CRISPRAlign
[92],PILER
CR[22],andCRT[12].Theyworkwellforassembledgenomesorcontigs
butcannotbereadilyusedonshortsequencingdata.Acommonwaytoutilizethistype
oftoolsformetagenomicdataisusedenovometagenomicassemblytoolstooutput
assembledcontigs,thenusethecontigsastheirinput.Thisapproachhasitslimitations.
62
First,metagenomicassemblyisaproblem.Metagenomicassemblerstrytorecover
individualgenomesinanenvironmentalsample.Organismsinthemicrobialcommunityof
asamplehavehighlyunevenrepresentationandsomeorganismsareverycloselyrelated.
Thehighlycomplexcompositionofmicrobialcommunitiesandsequencingerrormakeit
impossibletorecoverallgenomesinasample.Previousworkshowsthatsequencingofhigh-
complexitymicrobialcommunitiesresultsinlittleornoassemblyofreads[81,73].Second,
theheterogeneoussequencecoverageofCRISPRscausedbyvariablespeciesabundanceor
expressionlevelsoftenmakesexistingtoolsoutputonlythehighlyrepresentedCRISPRs.
OnereportedproblemisthatCRISPRslocifoundinmetagenmoicassembliesmayonly
representthemostdominantstraininthecommunity,andnotindicativeofthetruespacer
diversity[101].Third,CRISPRsinthesamespeciesorevenintspeciescanshare
similarrepeats.ManyCRISPRsthatsharethesamerepeatinaspecieshaveveryhighly
conservedleadersequences[117,52,48].Thisleadstoverytangledassemblygraphs.Existing
assemblytoolsarenotdesignedtodistinguishCRISPRswithsimilarrepeats.Inaddition,
thesizeofmetagenomicdataposesachallengeoncomputationalresources.Thus,the
commonlyusedpipelineofcombiningdenovometagenomicassemblyandgenome-wide
CRISPRsdetectiontoolsisconvenientbutnotidealforCRISPRsdetectioninmetagenomic
data.
ThesecondtypeoftoolsidenCRISPRsfrommetagenomicdatawithoutassembling
ofthewholedataset.Toourknowledge,theonlyavailabletoolinthiscategoryisCrass[101].
CrassidenreadsthatbelongtoCRISPRsusingtherepeatstructuralcharacteristicof
CRISPRsandusesagraphbasedalgorithmtorecoverthespacerarrangementofCRISPRs
thatsharethesamedirectrepeatsequence.Asasteptooutputthefull-lengthCRISPR
sequences,Crassrequiresuserstoparsetheinformationoutputbythepreviousstepand
63
inputittoexistingdenovoassemblyprogramssuchasVelvet[123].Likeotherwholegenome
assemblytools,VelvetisnotdesignedtoaddressspassemblychallengesforCRISPRs.
Inaddition,becauseCrassreliesontheassumptionthatthereexistsomereadsthatcontains
atleasttwocopiesofarepeat,itfailstoidentifyCRISPRrepeatwhenreadlengthisshort
orwhenrepeat-spacerunitislong[8].Thus,thereisaneedtodevelopnovelalgorithmsand
toolstotlydetectandassemblebothhighlyandlowlyrepresentedCRISPRsfrom
largeamountofshortanderror-pronemetagenomicreads.
Inthiswork,wedesignanddevelopnovelmethodstoidentifyandassemblesequencesof
CRISPRsthatcontainbothknownandnovelrepeatsinmetagenomicdata.Thedeveloped
tool,metaCRISPR,incorporatessppropertiesoftheDR-spacerarrayinCRISPRsand
addressesthechallengesincludingthelargesizeofmetagenomicdata,theheterogeneous
sequencecoverageofCRISPRs,andmultiplecopiesofrepeatsinCRISPRs.
4.3Method
Inthissection,weintroducemetaCRISPRpipeline.Wegiveanoverviewofthewhole
pipeline.Thenthedesignofeachcomponentofthepipelineiselaborated.
4.3.1Overview
Figure4.2sketchesthemainalgorithmiccomponentsinthemetaCRISPRpipelinethatis
designedforrecoveringCRISPRscontainingbothknownandnovelrepeatsinmetagenomic
dataset.Itconsistsofthreemajorcomponents:clusteringandassembly.First,we
identifyCRISPRreadsfromtheoriginalmetagenomicdataset.Allreadsthathaveatleast
twocopiesofthedirectrepeatsareidenedtogathernovelrepeats.Ak-merhashtableis
64
Figure4.2MajorcomponentsofmetaCRISPR.Paired-endreadsarerepresentedbytwo
rectanglesconnectedbyadottedline.Readsofredandgreencolorsrepresentreadsfrom
twotCRISPRs,respectively.Grayreadsrepresentreadsfromothergenomicregions.
65
constructedusingthegatheredtherepeatsandknownrepeatssuchasthoseinCRISPRdb
[32].Theoriginalmetagenomicdatasetisusingthek-merhashtable.Thisstep
removesalargeportionofreadsthatarenotpartofCRISPRsandtlyreduces
inputsizefordownstreamanalysis.Inthesecondcomponent,webuildanoverlapgraph
withlargeoverlapthreshold.Theideaisthatbyusingverylargeoverlapthreshold,most
randomoverlapscanbeeliminatedandreadsthatbelongtoCRISPRscomefromt
species,orCRISPRscomefromthesamespeciesbutdonotsharerepeatsequencecanbe
separatedintorentclusters.Clusterswithpairedendreadssupportarecombinedto
recoverreadsthatbelongtothesameCRISPRbutwereclusteredtomultipleclusters.In
thestage,anoverlapgraphisconstructedforreadsineachcluster.Variousgraph
cleaningtechniquesareappliedtocorrectsequencingerrorsanderroneousbranchesdueto
randomoverlaps.Longestpathisusedtorecovercontigsonthedirected
acyclicgraph(DAG).Recoveredcontigsareconnectedusingpaired-endinformationtoform
supercontigsorscAfterassembly,asimplevalidationstepisappliedtoout
contigsthatarenotCRISPRs.
4.3.2IdenofCRISPRreads
4.3.2.1Directrepeatiden
OurpipelineidenreadsthatarelikelysequencedfromCRISPRs.Theevenly
spaceddirectrepeatstructureinCRISPRsprovidesthebasicideaforidentifyingnovel
repeatsfrommetagenomicreads.Sincethelengthsofmostknowndirectrepeatslieinthe
rangefrom23to47[31],itisareasonableassumptionthatatportionofreads
willcontainatleasttwocopiesofdirectrepeats,especiallywiththeincreasingreadlength
66
Figure4.3Illustrationofhownovelrepeatsareideninaread.Thegreenregionsare
extendedregions(seemaintext).Blackcrossesrepresentmismatchesinthesequence.
oftheNGStechnology.Basedonthisobservation,weidentifydirectrepeatsbysearching
maximalrepeatswithmismatchesinreads.Figure4.3illustrateshowaputativedirect
repeatisideninaread.Foreachread,wecheckwhetheritcontainsatleasttwo
copiesofaninitialk-merwhoselength
k
issetto20bydefault.Thedistancebetweenthe
twocopiesofthek-mershouldbeatleasttheminimumlengthofaspacer.Afterapairof
matchingk-mersarefound,theyareextendedtotherightuntiltheextendedregionscontain
atmosttwomismatches.Thesequencefromthestartoftheinitialk-mertotheendofthe
extendedregionisregardedasaputativedirectrepeat.UnlikeCrass,weallowatmosttwo
mismatcheswhenidentifyingrepeats.Thisisbasedontheobservationthatalthoughrepeats
inaCRISPRarewellconserved,therestillexistrepeatsthathaveoneortwomismatchesin
someCRISPRs.Byallowingmismatches,metaCRISPRisabletogatherrepeatsthatCrass
mayfailtoidentify.
Therepeatidenmethodcanrecruitalargeportionofrepeatsinametagenomic
dataset.However,itmaymisssomelongrepeats,especiallywhenthereadlengthisshort
(e.g.75nt)[8].Toovercomethisproblem,weresorttotheknownrepeatssuchasthosein
CRISPRdb.DirectrepeatsinannotatedCRISPRslociareectivetemplatesforidentifying
CRISPRsreadsbecauserepeatsarewell-conservedamongCRISPRsinthesamespeciesor
evenbetweentspecies[59].Byincludingrepeatsfromknowndatabases,weare
abletoidentifysomeoftherepeatsthataremissedbytherepeatidenprocessand
increasethesensitivityofthepipeline.
67
4.3.2.2Filtermetagenomicdatausingk-merhashtable
Usingthepreviousstep,wecanidentifyrepeatsandrecruitreadsthatcontainatleast
twocopiesofdirectrepeats.However,manyreadsmayonlycontainasinglecopyofthe
repeat.Figure4.4showsthedistributionofthegroundtruthreadsthatcanbemapped
totheCRISPRregionsofthegenomesinasyntheticmetagenomicdata(SRR606249)[99]
sequencedfromamixtureof16archaealspeciesand48bacterialspecies.Thelengthofthe
readsinthedatasetis101nt.Theyare220knownCRISPRsin48genomesinthedata
set.ForeachCRISPR,weshowthepercentageofgroundtruthreadsthathaveatleast
twocopiesofarepeat20-mer,thepercentageofgroundtruthreadsthathaveonlyonecopy
ofarepeat20-mer,andtheaveragespacer-repeatlengthoftheCRISPR.Wecanseethat
althoughwecanidentifyrepeatsandrecruitreadsthatcontainsatleasttwocopiesofdirect
repeatsformostCRISPRs,mostgroundtruthreadsforeachCRISPRonlycontainonecopy
ofarepeat20-mer.Wecannotrecruitthesereadsinthepreviousstep.Withoutincluding
thesereads,thedepthofcoverageoftheCRISPRisquitelowandit'squitelikelythat
forthoselowlyrepresentedCRISPRs,therewillbemanygapswhichpreventustorecover
thewholesequenceoftheCRISPR.Therefore,weneedantmethodtorecruitreads
containsonlyonecopyofdirectrepeatbeforedownstreamanalysis.
Tobuildthetprocedure,weuseak-merhashtablebasedmethodtodetectreads
thatcontainonlyonecopyofdirectrepeat.Weconstructahashtablefromallthek-mers
oftheknownandnewlyidendirectrepeats.Foreachk-mer,boththek-merand
thereversecomplementarysequenceofthek-merarekeptinthehashtable.Theoriginal
metagenomicdatasetisscannedtoidentifythosereadsthatcontainatleastonek-merin
thehashtable.Forpaired-endreads,ifoneendhasamatchingk-mer,theotherendis
68
Figure4.4Thedistributionofgroundtruthreads
automaticallyincludedregardlessofwhetherithasamatchingk-mer.Allreadsfromthis
steparepassedtothenextclusteringstep.Thisstepcanrapidlyoutmostofreadsthat
arenotpartofaCRISPRandtlyreducethedatasizefordownstreamanalysis.
Itshouldbenotedthatboththerepeatidenstepandthek-merhashtable
basedonstepinevitablyrecruitasmallportionofreadsthatdonotbelongtoany
CRISPR.Thesefalsepositivereadsslightlyincreasetheinputsizetothenextstepbutwill
beeasilyremovedduringeithertheclusteringsteportheassembly/validationstep.
4.3.3ClusteringofCRISPRreads
TheCRISPRreadsidenstepproducesamixtureofreadsthatcomefromCRISPRs
inditspeciesortCRISPRsofthesamespecies.Assemblythemixtureofreads
directlymayintroduceerroneouscontigsbecausetherepeat-spacerstructureofCRISPRs
leadstoaverytangledassemblygraph.Thusit'sbettertoseparatetCRISPRsand
assemblethemindividually.Weuseanoverlapgraphbasedapproachtoclusterreadsto
69
Figure4.5Anexampleofoverlapgraphandtransitiveedge.Threereadsformthreeedges.
Therededgeisatransitiveedgecanbesafelyremoved.
tclusters.Firstweintroducetheconceptofoverlapgraphandrelatedterminology,
thenwepresentthemethodweusetoseparatetCRISPRs.
4.3.3.1Overlapgraph
Anoverlapgraphisandirectedgraphthatoverlaprelationsbetweenreads.Each
nodeintheoverlapgraphrepresentsaread.Givenanytworeads
r
1
,
r
2
,andanoverlap
threshold
T
,iftheoverlapsizebetweentheof
r
1
andtheof
r
2
isgreaterthan
T
,
adirectededgewhichgoesfrom
r
1
to
r
2
isadded.Theoverlapthreshold
T
hashighimpact
ontheconnectivityandcomplexityofthegraph.Asmaller
T
isabletoconnectreadsin
lowlysequencedregions,butitproducesverytangledgraphsincetheremaybemanywrong,
randomedgesformedbetweenreadsthatarenottrulyconnectedinthegenomes.Alarger
T
producessimpleroverlapgraphwithtedges,butitmaymissconnectionsbetween
readsinthelowlysequencedregions.Afteraninitialoverlapgraphisbuilt,transitiveedges,
whichareredundantedgestorepresentgraphconnectivity,canberemovedtosimplifythe
graph.Atransitiveedgeisformedbetweentwonodesifthetwonodeareconnectedby
alternativepathsotherthanthetransitiveedge.Removingsuchedgesdoesnotchange
reachabilityofthegraphbutcangreatlysimplifytheoverlapgraph.Anexampleofoverlap
70
Figure4.6Top:twoCRISPRsthatshareverysimilarrepeatsequences.Thereadmapping
againstthetwoCRISPRslociareshown.Bottom:theinitialoverlapgraph
G
.
Readsofredandpurplearesequencedfromthetwoloci,respectively.Eachnoderepresents
areadofthesameletter.Blackandgreyedgesdenoterandomoverlaps.Thenumberalong
eachedgerepresentstheoverlap.Readsinapair
r
arerepresentedusing
r:
1and
r:
2.
graphandtransitiveedgeisillustratedinFigure4.5.Betweennodes
a
1and
a
3,thereisan
alternativepath
a
1
!
a
2
!
a
3,edge
a
1
!
a
3canbesafelyremoved.Inourimplementation,
weconstructoverlapgraphusingreadjoiner[28],whichisanoverlapgraphbasedassembler.
Itprovidesasetofcientalgorithmstocreateandmanipulateoverlapgraph.
4.3.3.2OverlapgraphbasedCRISPRreadsclustering
ToseparateCRISPRs,weusetwoimportantpropertiesspctoCRISPRs.

Spacersareusuallyunique.ForaCRISPRlocus,thearrayofspacersarequitet
fromeachother.Theyalsohaveverylowsimilaritywiththerepeatsequence.In
addition,spacersarenotconservedacrosserentCRISPRs.

LeadersequencesofCRISPRsthatsharethesamerepeatinonespeciesmayhavehigh
similarity,buttheyareusuallyquitetforCRISPRsfromtspecies,even
theseCRISPRssharethesamerepeatsequence[117,52,48].
71
Fromtheproperty,weknowthatrandomoverlapscausedbynearidenticalrepeats
cannotexceedthelengthofatypicalrepeat.AsshowninFigure4.6,forreadsthatstart
orendwithinrepeats,theytendtoformrandomoverlapswithreadsofsimilarproperties.
However,asthespacersareusuallyunique,theoverlapwillnotlikelytoextendbeyond
therepeats.Thus,byapplyinganoverlapthresholdthatismuchlargerthanthetypical
repeatsize,amajorityofrandomoverlapswillberemoved.Inaddition,itisrarethattwo
tCRISPRswillproducethesamereadunlessthereadiscompletelysequencedfrom
therepeats.However,asthelengthofreadsisusuallymuchlongerthanrepeats,theread
willcoveratleastpartofthespacer,whichcandistinguishreadsfromtloci.Hence,
it'sreasonabletoassumethattCRISPRsdonotsharethesamereadandwecould
useanonoverlappingclustermethodtoseparateCRISPRreads.
Fromthesecondproperty,weknowthattheremayexistlargeoverlapedgeswhichare
formedbetweenreadssequencedfromtheleaderregionsoftCRISPRsthatshare
repeatsequenceinthesamespecies.Weareunabletoremovetheseerroneousedgeseven
usingalargeoverlapthreshold.However,thiswillnotposeaproblemforthepipelineforthe
followingreasons.First,becausewebuildtheoverlapgraphwithoutallowingmismatches,
mostCRISPRsinthesamespeciescanstillbeseparatedintotclustersaslongas
theleadersequencearenottotallythesame.ThisistrueforalargeportionofCRISPRs
thatsharerepeatsequenceinthesamespecies[52].Second,evenifseveralCRISPRsare
clusteredintothesamecluster,wecanstillseparatethemintheassemblystage.
BasedontheseCRISPRspproperties,webuildanoverlapgraphwithalargeover-
lapthreshold(Thedefaultvalueis80ntor80%ofthelengthofthereads,whicheveris
smaller)toclusterCRISPRreads.Edgesinsuchoverlapgrapharehighlyreliable.Thatis,
anedgeconnectingtwonodes(reads)inthegraphhasveryhighprobabilitytothe
72
trueoverlapbetweenthereads.Readsinthesameweaklyconnectedcomponentareusually
comefromthesameCRISPR.Forasmallnumberofconnectedcomponents,eachcon-
nectedcomponentharborsseveralCRISPRsfromthesamespeciesbecausetheseCRISPRs
shareleadersequence.Thus,wecanclusterthereadsusingtheweaklyconnected
componentsintheoverlapgraph.Eachconnectedcomponentformsaninitialcluster.
AllreadsfromthesameCRISPRmaybeseparatedintoseveralinitialclusters.This
happensforthefollowingreasons.First,therearebothhighlyrepresentedandlowlyrepre-
sentedspeciesinametagenomicsample.Forlowlyrepresentedspecies,thesequencingdepth
ismuchlowerthanthehighlyrepresentedspecies.ForCRISPRsinlowlyrepresentedspecies,
mostoverlapsbetweenreadsmaybesmallerthantheoverlapthreshold,thusthereadsare
clusteredintotsmallclusters.Second,evenforahighlyrepresentedCRISPR,lowly
sequencedregionsmayexistlocallyandpreventallreadsfortheCRISPRgoesintoasingle
cluster.Thus,weconductamergingsteptocombineclusterswhosereadsaresequenced
fromthesameCRISPR.Weprocessclustersthatonlycontainasingleread.Alarge
portionofreadsfromtheseclustersarenotCRISPRreads.However,therearestillsome
readswhicharepartofCRISPRsbutfailedtogointoanylargerclustersbecausetheyhave
asmalleroverlapwithotherreads.Weusepaired-endinformationtorecruitsuchreads.For
areadinanyclusterofsizeone,iftheotherendofthepaired-endreadisinaclusterwhose
sizeisatleasttwo,thisreadisintothiscluster.Allsinglyclusteredreadsthatcan
notbeusingpaired-endinformationarediscarded.
Wethenbuildanundirectedgraphinwhichanoderepresentsaninitialclusterwhosesize
islargerthanone.Anedgeisformedbetweentwonodes
v
1
and
v
2
iftwoclustersdenoted
bythesenodeshaveenoughpaired-endreadssupport.Apairedendissaidtosupportnodes
v
1
and
v
2
ifoneendofthepairisintheclusterdenotedby
v
1
andtheotherendisincluster
73
denotedby
v
2
.Thenallinitialclustersinoneconnectedcomponentinthegraphforma
newcluster.
4.3.3.3Correcttheorientationofreads
ReadsinaNGSsamplecouldcomefrombothstrainsofthegenomesinthesample.Without
knowledgeoftheorientation,overlapscanbeformedbetweenofareadandtheof
anotherread(P
!
S),orbetweenthexofareadandthereversecomplementofanother
read(S
!
P),orbetweenthereversecomplementofareadandtheofanotherread
(P
!
P).Thisrequireswehaveatwodirectionaloverlapgraphwhendoinggraphtravesal
[82].Unlikegeneralassemblytoolswhichhavenoideaofthedirectionofeachread,wecan
correcttheorientationofthereadsineachclusterwiththehelpofthedirectrepeat.Foreach
cluster,weknowthedirectrepeatsequencesofthisclusterfromtherepeatiden
step.Byusingonedirectrepeatasreference,weareabletocorrecttheorientationofall
thereadsinthecluster.Afterre-orientationofthereads,weonlyneedtopayattentionto
overlapsintheassemblystep.
4.3.3.4Filterfalsepositiveclusters
Throwingawayalargeportionofthesinglyclusteredreadsonlyremovessomeofnon
CRISPRreads.TheremayexistlargerclustersthatdonotharboranyCRISPR.Because
assemblyisusuallytimeconsuming,wetrytoreducethenumberofclustersbeforegoing
totheassemblystep.Foreachcluster,wehaveanoverlapgraphbuiltusingalargeoverlap
threshold.Theoverlapgraphisusuallysimpleandedgesinthegrapharereliable.Wedoa
quicktraversalofthegraphtoformlongercontigsandcheckwhetherlongcontigsexhibit
repeat-spacerstructureofCRISPRs.Ifatleastoneofthecontigshassuchstructure,then
74
theclusteriskeptforfurtherprocessing.Thisstepcanremovealotofclustersthatdonot
containanyCRISPR.
4.3.4AssembleCRISPRsofheterogeneouscoverage
Overlapgraphbasedclusteringdividesreadsintotclusters,eachofwhichcontains
readssequencedfromasingleCRISPRorCRISPRssharingrepeatsequencesinthesame
species.Foreachcluster,webuildanoverlapgraph,andanovelassemblypipelineoptimized
foridentifyingCRISPRswillbeappliedtoeachoverlapgraphtorecoverCRISPRs.Inorder
torecoverCRISPRsoflowsequencecoverage,weconstructtheinitialoverlapgraphusing
asmalloverlapthreshold.
RecoveringCRISPRsfrommetagenomicdataneedstoaddressthechallengesofhetero-
geneoussequencecoverageandahighlytangledgraphcausedbyrandomoverlapsbetween
repeats.Figure4.7sketchesthemainstepsandtheresultsofapplyingthemtoanoverlap
graphthatcontainsthreeCRISPRs,ofwhichonehashighsequencecoverage(red),andthe
othertwohavelowsequencecoverage.ThesethreeCRISPRsareinthesamespeciesand
shareverysimilarrepeatsequencesthustherearemanyedgesgeneratedbyrandomoverlaps
intheinitialoverlapgraph.TheserandomedgesinthesameCRISPRandbetweent
CRISPRslocileadtohighlytangledandcyclicgraphs.Ourgoalistorecoverthethree
correctpathscorrespondingtothethreeCRISPRsloci.WeformulateCRISPRsassembly
problemasthelongestsimplepathsthathavesupportfrompairedendreadsinthe
overlapgraph.
FromtheCRISRPsppropertiesstatedinsection4.3.3.2,weknowthatbyapplying
anoverlapthresholdlargerthanthetypicalrepeatsize,amajorityofrandomoverlapscan
beremoved.Wereferedgeswhoseoverlapsizeislargerthanthreshold
˝
as
c
75
Figure4.7Leftpanel:majorstepstoassembleCRISPRs.Rightpanelshowsanexamplewith
threeCRISPRs,whicharerepresentedbyred,purple,andgreennodes/edges,respectively.
TheCRISRPrepresentedusingredcolorhashighsequencecoverage,theothertwohave
lowsequencecoverage.Readsinapairareconnectedbyablack,dotted,arrowline.Nodes
whichformsacyclearecircledusingadottedline.
76
edges
.Here
˝>
repeatsize.Inaddition,wecanremovereadsthathavebeenincludedin
assembledCRISPRslociwithoutthesearchfortheremainingCRISPRsbecause
areadonlybelongstooneCRISPR.Basedontheseobservations,weproposethefollowing
methodtoassembleCRISPRs.
4.3.4.1Step1:cleangraphandrecoverhighlysequencedCRISPRs
Wecleanthegraph.First,bubbles,whichareintroducedbysequencingerrors,canbe
easilyidenandremoved.Node
r
inFigure4.7isanexampleofbubble.Second,we
cleanthegraphusingtedges.Forexample,inFigure4.7,theedgefrom
b
3to
b
4
isatedge,thusedgefrom
b
3to
c
2andtheedgefrom
b
3to
b
2
:
2canberemoved.
Third,wecanfurtherusepairedendinformationtoremoveerroneousbranches.Theideais
thatifoneendofapairedendreadisinonenode,andtheotherisinanother,thenthepath
betweenthemismorereliablethanpathsbetweennodeswithoutpairedendreadsupport.
InFigure4.7,node
a
6isabranchingnode.Theerroneousbranchfromnode
a
6to
r
1can
beremovedusingcotedge.Becausethereisapairedendreadfornode
a
4
:
1and
a
4
:
2,
thuspathfrom
a
4
:
1to
a
4
:
2ismorecotthanpathfrom
a
4
:
1to
r
1,edgefrom
a
6to
r
1
canberemoved.
Aftercleaningthegraph,CRISPRswithhighsequencecoverageformaverysimple
graphwithoutcycle.LongestpathdingcanbeappliedtorecovertheseCRISPRs.A
post-processingcanbedonetohandleaveryrarecasewhenthesameCRISPRisdiscon-
nectedbecausetheoverlapbetweentwonodesissmallerthan
˝
.Thepathsthatrepresent
segmentsofadisconnectedCRISPRcanbeconnectedusingpaired-endreadsinformation.
Forexample,inFigure4.7,node
a
7and
a
8canbeconnectedusinginformationfrompaired-
endreads
a
4
:
1and
a
4
:
2.
77
4.3.4.2Step2:cleangraphbyremovingrecoverednodesandedges
GiventheobservationthattCRISPRsdon'tsharethesamereads,allthereads,their
correspondingnodes,andassociatededgesfromrecoveredCRISPRscanberemovedfrom
G
.Thisleadstograph
G
2
,whichmaycontainseveralCRISPRswithlowsequencecoverage.
In
G
2
,thereexistalottrueoverlapsthathavesimilarsizeswiththerandomoverlaps,and
cannotbeeasilydistinguishedusingcoverageinformation.Evenworse,therandomoverlaps
formedbyrepeatswithinthesamelociorbetweentwolowcoverageCRISPRsmayform
cyclesinahighlytangledgraph,whichmakeslongestpathahardproblem.
4.3.4.3Step3:breakcycles
Inthisstep,wewillfurthersimplifythecyclicgraph
G
2
bybreakingcycleusinginformation
providedbypairedendreads.Firstwemakethefollowingassumptions:

Assumption1:ifreads
R:
1,
R:
2arepaired-endreads,and
R:
1and
R:
2areinthesame
weaklyconnectedcomponentofanoverlapgraph,thentheremustbeapathfrom
R:
1to
R:
2(assume
R:
1isintheupstreamof
R:
2).Adirectedgraphisweaklycon-
nectedifreplacingallofitsdirectededgeswithundirectededgesproducesaconnected
undirectedgraph.

Assumption2:ifaweaklyconnectedgraphhascycles,thenforeachcycle,thereexists
atleastonepaired-endreadsuchthat
R:
1(or
R:
2)isanodeinthecycle,and
R:
2(or
R:
1)isanodeoutofthecycle.
Inmostcases,tworeadsinapairaresupposedtobesequencedfromthesameCRISPR
locus.Iftheyareinthesameweaklyconnectedcomponent,thenthereshouldbeatleastone
pathbetweenthem(assumption1).Thesecondassumptiongivesusinformationtobreak
78
thecycleusingthemethodthatwillbepresentedhere.Ifthisassumptiondonotholdfor
somecycles,wesimplydoadeeptraversalofthegraphandremoveallbackedges.
Withtheassumptionswemade,herewepresentthealgorithmtobreakcyclesinan
overlapgraph.Suppose
R:
1and
R:
2arepaired-endreads,
R:
1isinacycle,and
R:
2isnot
inthecyclebutinthedownstreamof
R:
1.Basedontheassumptions,thereexistatleasta
pathfrom
R:
1to
R:
2.Becauseeverynodecanonlybevisitedonce(weassumethataread
isnotsharedbetweentCRISPRs)whenformingthepath,allingoingedgesto
R:
1
shouldnotbeonthepathfrom
R:
1to
R:
2.Otherwise,
R:
1willbevisitedatleasttwice:
oncefrom
R:
1tosomeothernodewhenstartingwalkingfrom
R:
1,oncefromtheingoing
edgeto
R:
1.Basedonthisobservation,wecantemporallyremovealltheingoingedgesto
R:
1tobreakthecycle.Onceallcyclesarebroken,thegraphbecomesadirectedacyclic
graph(DAG),thelongestpathalgorithmcanbereadilyusedtorecovertheCRISPRs.
Figure4.8showstwocasesincyclebreakingandthestepstobreakthecycleinonecase.
Thetwocasesareonlyntatlocationofthecycle,whichdeterminesedges(outgoing
oringoing)thatshouldbetemporallyremoved.Incase1,thecycleisintheupstreamof
a
1
:
2,thenweneedtotemporallyremoveallingoingedgesof
a
1
:
1.Incase2,thecycleisin
thedownstreamof
a
1
:
1,thenweneedtotemporallyremovealloutgoingedgesof
a
1
:
2.All
theotheroperationsareexactlythesame.
TherightpanelofFigure4.8showsanexampleofhowtobreakacycleincase1.
e:
2is
temporallyremovedbecauseitisaningoingedgeof
a
1
:
1.ThegraphbecomesaDAG,the
longestpathbetween
a
1
:
1and
a
1
:
2canbeeasilyfound(thepathformedbyrededges).
e:
5
isremovedbecauseit'sabranchingedgeonthepath.Afteradding
e
2back,wecana
longestpathwhichcontains
e
2,
e
1,
e
4,and
e
3.
79
Figure4.8Leftpanel:twocasesincyclebreaking.Case1:thereadwhichpairswithread
inthecycleisinthedownstreamofthegraph;case2:thereadthatpairswiththereadin
thecycleisintheupstreamofthegraph.RightPanel:mainstepstobreakthecyclefor
case1.Paired-endreadsareconnectedusingablack,dotted,arrowline.Edgesrepresented
byred,dottedlinesaretemporallyremovededges.Red,solidlinesformthelongestpath
betweentwopairedreads.
4.3.4.4Step4:recoverlowlysequencedCRISPRs
Nowallcyclesarebroken,wehavealistofDAGs.ForeachDAG,alongestpathcanbe
found.Thenthenodes,edgesassociatedwithpathcanberemovedfromtheDAGtoform
anewDAG.Thestepisrepeateduntilthegraphisempty.
Aftertheassembly,allcontigslongerthanauserthreshold
L
isvalidated.For
eachcontig,thevalidationstepidenrepeatsequencesinthecontig,thenvalidates
thesizeoftherepeats,thesizeofthespacers,andthenumberofrepeatsequences.Ifallof
themmeettheuservalues,thecontigisasaCRISPR.Usercanalsouse
othergenome-wideCRISPRdetectiontoolssuchasPILER-CR,CRISPRFinder,tovalidate
thecontigs.
80
4.4Experimentalresults
ToevaluatetheperformanceofmetaCRISPR,weconductedseveralexperimentsonthree
metagenomicdatasets.Thedatasetisamockcommunitydatasetwhichwassequenced
fromsyntheticcommunities.Thedatasetcontains16archaeamembersand48bacteria
members[99].Ofthe64speciesinthedataset,48ofthemhave220CRISPRannotations
inCRISPRdb.Thedatasetcontainsabout105millionof101bppaired-endreads.The
fastqsizeis34G.Asthespecies,theirgenomes,andtheCRISPRannotationsinthe
genomesareknown,weareabletoevaluatetheperformanceofCRISPRreadiden
readclustering,andassembly.ThesecondoneisaRhizosphereSoildataset.Thefastq
ofthedatasetisabout117Gaftercleaning.Itcontains271millionof150ntpaired-end
reads.WeusethisdatasettoshowmetaCRISPR'sabilitytohandlehugemetagenomic
data.Thelastoneisasix-samplehumangutdatasetsequencedfromfamilymembersofa
patientwithdiarrhea.ByaligningspacersidenbymetaCRISPRtotheviruscontigs
inthemetagenomicsample,weareabletoshowthatCRISPRsidenbymetaCRISPR
arereliable.
TherearetwoapproachestoidentifyCRISPRsfrommetagenomicdata.Thecat-
egoryreliesondenovoassemblyandthenidenCRISPRsfromtheassembledcontigs.
ThesecondcategorydirectlyidenCRISPRsfrommetagenomicreaddata.Webench-
markedmetaCRISPRwithtoolsinbothcategories.Forthecategory,wechoseIDBA-ud
asthebulkmetagenomicassembler,andusedCRISPRFinderasthegenome-wideCRISPR
detector.Forthesecondcategory,webenchmarkedCrass,whichistheonlytoolwenoticed
toidentifyCRISPRsdirectlyinmetagenomicreaddata.
81
Table4.1TheoftheCRISPRreadsidencation
datasetreadsinrawrecruitedreadsratetime(min)
a
memory(G)
mock
105m11334441.08%18.73.7
soil
271m41366441.52%41.65.2
gut
28m1486360.53%4.40.47
a
Walltimeusingfourthreads,averageofthreeruns.
4.4.1EvaluatingtheperformanceofCRISPRreadsiden
TheCRISPRreadidenisthekeystepformetaCRISPRtohandlehugemetage-
nomicdataset.Weevaluateditstoreducethesizeoftherawdataset.
Table4.1showsthattheCRISPRreadsidenstepcandiscardatamount
ofreadsrapidly.Forthethreedatasets,thesizesofthereduceddatasetsrangefrom0.53%
to1.52%oftherawdatasets.AsshownintheTable,runningtimeandmemoryusageis
smallwithrespecttothesizeoftherawdatasets.
AsweknowthespeciesandtheannotatedCRISPRsinthemockcommunitydataset,we
canevaluatetheaccuracyofCRISPRreadsidenusingsensitivity.Thesetofreads
thataresequencedfromCRISPRregionsaretruepositivereads(TP).Let
P
bethesetof
readsrecruitedbymetaCRISPR.Sensitivityisas
P
\
TP
TP
.Toshowthatincluding
knownrepeatscanimprovetheperformanceofmetaCRISPR,weranthereadsiden
withknownrepeatsandwithoutknownrepeats,respectively.Fortherunwithoutknown
repeats,thepipelinerecruited228036outof237162TPreads.Thesensitivityis96.2%.For
therunthatincludesknownrepeats,thesensitivityincreasedto98%.Tofurtherinvestigate
theperformance,wecomparedthesensitivityofthetworunsonCRISPRlevel.InFigure4.9,
x-axisshowstheindexof220CRISPRsinthedataset.TheCRISPRsaresortedbythe
numberofrepeatstheyhave.Thebluelinewithsquaredmarkershowsthesensitivityof
therunwithknownrepeats.Theredlinewithtrianglemarkershowsthesensitivityofthe
82
Figure4.9SensitivityoftheCRISPRreadsidenstep.CRISPRsaresortedalong
thex-axisbythenumberofrepeatstheycontains.Seemaintextfordetails.
runwithoutknownrepeats.Thelinewithcyclemarkershowstheaveragelengthofthe
repeat-spacerunit.Wecanseetheperformanceofthetworunsareverysimilarformostof
theCRISPRs.WithoutincludingknownCRISPRs,thepipelinefailedtorecruitanyreads
forsixCRISPRs.TheindicefortheseCRISPRsare9,13,46,84,93,103,respectively.The
reasonisthattheaveragelengthsoftherepeat-spacerunitsoftheseCRISPRsarelonger
thantheothers.Asthereadlengthofthedatasetis101bp,it'simpossibletoseetwo
copiesof20-mersoftherepeatinonereadforsuchCRISPRswithlongrepeat-spacerunits.
ByincludingknownrepeatsfromCRISPRdb,weareabletomostofthereadsforthese
CRISPRs.ItshouldbenotedthatalthoughweusedknownrepeatsonlyfromCRISPRdb,
userscanuseanyknownrepeatsfromotherresources.
4.4.2Theperformanceofreadclustering
ThegoalofthereadclusteringstepistoseparatetCRISPRsaswellastoreduce
falsepositivereadsrecruitedinthestep.Weusedthemockdatasettoevaluatethe
83
Table4.2Theperformanceofclusteringonmockdataset
readtypereadsafterreadsafterclusteringreduction
TPreads2324982322870.1%
FPreads90094619678778.2%
Total1133444429074
performanceofreadclustering.Theclusterstepproduced1533clustersontherecruited
mockdata.InTable4.2,wecanseethataftertheclusteringstep,mostofthereads
sequencingfromCRISPRregions(TPreads)werekept,whilealargeportionofFPreads
wereremoved.Althoughtherewerestill196787FPreadsintheclusters,theycanbeeasily
idenandremovedinthevalidationstepafterassembly.
WethenevaluatedtheabilityofmetaCRISPRtoseparatetCRISPRs.After
runningthereadclusteringstepontherecruitedreadsonmockdataset,readsbelongto
oneCRISPRsmaygotoseveralclusters.Inaddition,oneclustermayharbormultiple
CRISPRs.InFigure4.10,weshowthenumberofclustersthateachCRISPRhasusing
theredlinewithtrianglemarker.Thebluelinewithsquaredmarkershowsthenumberof
CRISPRsthateachclusterharbors.Hereweonlyincludedclustersthathavemorethan5
TPreads.Thereare175suchclusters.FromthewecanseethatmostCRISPRsare
separatedtotheirownclusters.Forthe220CRISPRs,only14ofthemhavemorethanone
cluster.Thereare29outof175clustershavemorethanoneCRISPRs.Ofthese29clusters,
28ofthemonlycontainCRISPRsfromonespecies.Theleftclustercontains15CRISPRs
fromtwospecies.Fromtheanalysiswecanseethereadclusteringstepperformedverywell
onseparatingerentCRISPRs.
84
Figure4.10Performanceofthereadclustering.
4.4.3PerformanceofCRISPRassemblyonthemockdataset
Next,weappliedtheassemblyalgorithmtotheclustersthepipelineproduced,andcom-
paredtheperformanceofmetaCRISPRwithIDBA-ud/CRISPRFinderandCrass.The
initialoverlapthresholdandthetoverlapthresholdfortheassemblyweresetto
30and50,respectively.ForIDBA-ud/CRISPRFinder,weranIDBA-udwithdefault
parametersonthefullmetagenomicdataset,theassembledcontigswereuploadedtothe
CRISPRFinderwebsite.AllCRISPRspredictedbyCRISPRFinderweredownloadedand
usedasthepredictionofthepipeline.ForCrass,weranitontherawmetagenomicdataset
withdefaultparameters.Crassonlyreportsspacersandtheirarrangement.Itrequiresusers
tomanuallychecktheoutputoutwhichsegmentsoutputbyCrassshouldbeas-
sembledtogether,andgivetheinformationtoawrapperofvelvetorcap3[45]forassembling
thedetectedreads.It'snoteasyforuserstogetcompleteCRISPRsequencesifthereare
manygroups.Thus,insteadofusingthewrapperprovidedbyCrass,wedirectlyusedvelvet
andcap3toassemblethegroupsCrassdetected.Forvelvet,weusedtheVelvetOptimiser
85
pipeline[100].Theminimumandmaximumk-mersizeweresetto50and80,respectively.
Thestepsizewassetto5.Forcap3,alltheparametersuseddefaultvalues.Thecontigs
producedbytheassemblerswereuploadedtoCRISPRFindertoannotateCRISPRs.
Weusedthefollowingmetricstomeasuretheperformance:

FullyrecoveredCRISPRs(
full
):thesequenceoftheannotatedCRISPRisfullycovered
byoneormoreoutputofthepipeline.

PartiallyrecoveredCRISPRs(
partial
):thesequenceoftheannotatedCRISPRsis
partlycoveredbyoneormoreoutputofthepipeline.

ChimericCRISPRs(
chimeric
):It'spossiblethattheassemblerconnectstwopartsof
twotCRISPRstogetherandformsachimericCRISPR.Thesepredictionsare
wrong.

Truespacersrecovered:thenumberofannotatedspacersthatarerecovered.

Resourceusage:thisincludesCPUusageandmemoryusage.
ThesummaryoftheperformancecomparisonwasshowninTable4.3.Thenumbersof
fullyrecoveredCRISPRsformetaCRISPRandIDBA-ud/CRISPRFinderwereverysimi-
lar:metaCRISPRfullyrecovered151CRISPRswhileIDBA-ud/CRISPRFinderfullyrecov-
ered153.IDBA-udpartiallyrecovered40CRISPRs,whilemetaCRISPRpartiallyrecov-
ered37CRISPRs.IDBA-udonlyproducedonechimericCRISPR.metaCRISPRwasonly
slightlyworsethanIDBA-ud:itproducedthreechimericCRISPRs.Overall,theaccuracy
ofmetaCRISPRiscomparabletotheaccuracyofIDBA-ud/CRISPRFinderpipelineonthe
rawmetagenomicdataset.However,asforresourceusage,theproposedpipelinewasmuch
fasterandusedmuchlessmemorythanIDBA-ud/CRISPRFinder.Themocksampleisa
86
medium-sizedmetagenomicdataset.Toassembleit,IDBA-udusedabout34GBmemory
andtookmorethan18hours.TheresourcesrequiredbyIDBA-udarenotavailablefor
mostoftoday'sdesktopmachines.metaCRISPRwasmuchmoreresourcet:itonly
took1.32hoursandused3.9GBmemory.Thismakesitmoreconvenientforidentifying
CRISPRsonlargescalemetagenomicdatasetonevenapersonalcomputer.Thisisoneof
themotivationsofmetaCRISPR.Crasswithcap3generated661CRISPRs.Thisismuch
largerthanthenumberofannotatedCRISPRs(220).Itonlyfullyrecovered89andpartially
recovered102oftheannotatedCRISPRs.TheperformanceforCrasswithvelvetwasthe
lowestamongallthebenchmarkedpipelines.Thisindicatedthattheassemblystagefor
Crassdidnotworkquitewell.Thisiswhyweneedanassemblerthatcantakeadvantage
ofthepropertiesofCRISPRs.
Anotheraspecttoevaluatetheperformanceofthebenchmarkedtoolsistocheckhow
manyoftheidenspacersoverlapwithspacersintheannotatedCRISPRs.FromTa-
ble4.3wecanseenumbersofspacersgeneratedbymetaCRISPRandIDBA-ud/CRISPRFinder
arequitesimilar.FormetaCRISPR,83.4%oftheidenspacersoverlappedwithannota-
tions.ForIDBA-ud/CRISPRFinder,thenumberis84.7%.Crassgeneratedsimilarnumber
ofspacersasIDBA-ud/CRISPRFinder.However,thenumberofspacersthatareinthean-
notatedCRISPRswerelowerthanbothmetaCRISPRandIDBA-ud/CRISPRFinder:only
79.7%oftheidenspacersoverlappedwithannotations.Thismeansalotofspacers
detectedbyCrasswerefalsepositive.
4.4.4Experimentalresultonthesoildataset
Thedatasetisasyntheticdatasetandthesizeisnotbigenoughtochallengethe
benchmarkedtools.Inthisexperiment,weappliedmetaCRISPRtoalargerandmore
87
Table4.3Theperformanceofthebenchmarkedtoolsonthemockdataset
fullpartialchimericspacers
a
timememory
toolpredictions(h)(G)
metaCRISPR3211513735353/64141.323.9
IDBA-ud/
CRISPRFinder2881534015374/634318.534
Crass(cap3)6618910225028/63042.290.9
Crass(velvet)5787410505028/63043.420.9
a
ThenumberofspacersintheannotatedCRISPRs/Thenumberoftotalidenspacers.
Table4.4Theperformanceofthebenchmarkedtoolsonthesoildataset
toolmaxaveragetimememory
predictionsspacerslengthlength(h)(G)
metaCRISPR12396850305271.083.2
IDBA-ud/CRISPRFinder10882950305661.13.1
Crass(cap3)839247648472230.890.78
complexsoilmetagenomicdataset.Thedatasetcontains271millionof150ntpaired-
endreadsandthefastqofthedatasetisabout117G.Wetriedtoassemblethedata
setusingIDBA-udonahighperformancecomputingnodewith120Gmemorybutfailed.
Theprogramwaskilledbecauseitusedmorememorythanrequested.WeranCrasson
therawdataset.Itdidnotafter24hours.Thus,herewereportedtheresultof
IDBA-ud/CRISPRFinderandCrassonthedataset.ForCrass,weranitwithcap3
becausetheexperimentshowedbetterperformancethanvelvet.
Unliketheexperiment,thegenomesinthedatasetareunknown.Insteadofcom-
paringwithannotatedCRISPRs,weusedthefollowingmetrics:thenumberofCRISPRs
identhenumberofspacersidenmaxlengthoftheidenCRISPRs,av-
eragelengthoftheidenCRISPRs,andresourceusage.Theresultwasshownin
Table4.4.metaCRISPRdetected5moreCRISPRsand139morespacersthanIDBA-
ud/CRISPRFinderonthisdataset.TheoverlapbetweenthespacersdetectedmetaCRISPR
andIDBA-ud/CRISPRFinderwas669,whichisabout69%ofallthespacersdetectedby
88
Table4.5Theperformanceofthebenchmarkedtoolsonthehumangutdataset
metaCRISPRIDBA-ud/CRISPRFinder
sampleviromecontigsCRISPRsspacershitsCRISPRsspacershit
patient6535
181017
4370
follow-up4680
15131
014
133
0
healthy19416
5336427
3928820
healthy27620
353433
282622
healthy336
30293
0202550
healthy4273820138
6
21
149
0
metaCRISPR.ThemaximumlengthoftheCRISPRsoutputbymetaCRISPRandIDBA-
ud/CRISPRFinderwasthesame.Onaverage,metaCRISPRoutputshorterCRISPRsthan
IDBA-ud/CRISPRFinder.Crasswithcap3predicted839CRISPRs.Thenumberwasabout
8timesofthenumberofCRISPRspredictedbyIDBA-ud/CRISPRFinder.Theaverage
lengthofthepredictedCRISPRsoutputbyCrasswaslessthanhalfoftheaverageCRISPR
lengthoftheothertools.Thisindicatedthattheassemblydidnotworkwell,whichis
consistentwiththeresultonthemockdataset.Crassdetected2476spacers,whichwas
muchlargerthantheoutputofmetaCRISPRandIDBA-ud/CRISPRFinder.Wemapped
thespacerspredictedbyCrasstotheinputreadsdata.Abouthalfofthespacerscanonly
bemappedtooneread,and1739ofthespacerscanonlybemappedtolessthanthreereads.
Thereadcountsforthesespacersaretoolowtodistinguishthembetweentruespacersfrom
lowlysequencedCRISPRsandfalsepositives.
4.4.5Experimentalresultonthehumangutdataset
Inthelastexperiment,weappliedmetaCRISPRtoahumangutmetagenomicdataset.The
datasetwassequencedusingIlluminafromfamilymembersofapatientwithdiarrhea.It
containssixsamples.One(case)wassequencedfromthepatientwithillness,anothersample
(followup)wassequencedafterthepatienthaverecoveredfromtheillness.Theotherfour
89
samples(control)weresequencedfromotherfamilymembers.Thenumbersofreadsinthe
samplesrangefrom3millionto7million.Inthisexperiment,weevaluatedthereliableof
theidenspacers.Inthecas-CRISPRsystem,thestageintheimmuneresponseto
aninvadingphageiscapturingashortpartoftheDNAofthephage,insertingthesequence
tooneendofaCRISPRlocusandformingaspacer.Thusifaspacerwasnewlyacquired
fromaninvadingphageintheenvironment,thespacersequenceshouldmatchpartofthe
genomeofaphageintheenvironment.Thusbysearchingtheviralcontigsassembledfrom
themetagenomicdataset,it'spossibletovalidatewhetherthepredictedspacersarereliable
ornot.
WepredictedCRISPRsusingmetaCRISPRandIDBA-ud/CRISPRFinderonallsix
samples.ThenumberofCRISPRsandspacerspredictedwereshowninTable4.5.Onmost
samples,metaCRISPRpredictedmoreCRISPRsandmorespacers.Wethenassembled
thesixmetagenomicsamplesusingIDBA-udwithdefaultparameters.Thecontigswere
uploadedtoMetavir[94],whichisawebservertoannotateviralmetagenomicsequences.
It'sabletoannotateeachsequenceasviromeornon-viromebasedonsimilarityscorewith
knownviromes.Thenumberofvirome-likecontigsforeachsamplewereshowninTable4.5.
WeusedBowtietomapthepredictedspacerstothevirome-likecontigsineachsample.The
numbersofhitsareshowninthe'hits'columnofTable4.5.Wecanseethatfor4outof6
samples,thereweresomeproportionsofspacerscanbemappedtothevirome-likecontigs
assembledfromthesamesample.Forallthefoursamples,metaCRISPRhadmorehits
thanIDBA-ud/CRISPRFinder.Especiallyforthesamplesequencedfromthepatient,and
thesamplesequencedfromahealthyfamilymember(healthy4),IDBA-ud/CRISPRFinder
hadnohits,whilemetaCRISPRhad7and6hits,respectively.Theexperimentshowsthat
spacersidenbymetaCRISPRarereliable.
90
Chapter5
Conclusionandfuturework
Inthisdissertation,Iintroducednon-codingRNAandtheimportancetoidentifyncR-
NAsfromlarge-scalebiologicaldataset,whichincludesgenomicsequencedataandnext
generationsequencingdata.AstheaccumulationofNGSdata,especiallymetagenomic
data,ncRNAidenbecameachallengeproblem.First,traditionalsequenceanalysis
toolsarenotsensitiveenoughtodetectremotelyhomologousncRNAs.Second,NGSpro-
ducesshortsequences,mostexistingncRNAidentoolscannotbereadilyapplied
tothistypeofdata.Third,NGS,especiallymetagenomicsequencing,producesmassive
amountofshortreaddata.Alotofexistingalgorithmsandanalysispipelinesarenotable
tohandledataofsuchscale.Toaddressthesechallenges,Iproposedasetofalgorithmsin
thisdissertation.
ThetoolisChain-RNA,whichcombinessequencesimilarityandstructuresimilarity
toidentifylowlyconservednon-codingRNAsingenomicsequencedata.It'sfasterthan
structurebasedtoolsandismoresensitivethansequenceanalysistoolssuchasBLAST.
Thesecondtool,miR-PREFeR,predictsplantmicroRNAsfromnextgenerationsmallRNA
sequencingdata.It'saccurate,fast,andveryuser-friendly.ThelasttoolIdevelopedis
metaCRISPR,whichisatargetedCRISPRidentoolthatworksonmetagenomic
data.ItcanrapidlyrecoverCRISPRsfromlarge-scalemetagenomicdatasetaccurately.
Thereareseveraldirectionsthatthepreviousstudiescanbeimproved:
91

miR-PREFeRcurrentlyonlyworksforplantgenomesbecauseitwasdevelopedby
utilizingplantmicroRNAannotationcriteria.SincemanysmallRNAseqdatafor
animalsareavailable,itisusefultoextendmiR-PREFeRtoanimalgenomes.In
addition,miR-PREFeRrequiresuserstomapthesmallRNASeqdatatoareference
genomeorcontigs,whichmaynotbeavailableforsomenewlystudiedspecies.Whether
wecanextendmiR-PREFeRtouseonlysmallRNASeqdataisaninterestingdirection
forfurtherstudy.

TheststepofmetaCRISPRistorecruitCRISPRreadsfromlarge-scalemetagenomic
dataset.Nowthealgorithmdependsontheassumptionthatthereexisttwocopies
ofthesamek-merintherepeatregioninaread.Thisisnottrueforsampleswhose
readlengthisshort(i.e.75nt).Italsohasproblemwhentherepeat-spacerunitis
long.Althoughbyincludingknownrepeatscanimprovetheperformance,itdidnot
completelysolvetheproblemsincetheremayexistCRISPRswithnew,longrepeats
inametagenomicsample.Weneedtodesignalgorithmsthatdonotdependonthe
repeatstructureassumption.

TheresourceusageofmetaCRISPRcanbefurtherimproved.First,theassemblystep
wasimplementedinPythonandwassinglethreaded.Becausetclusterscanbe
assembledindependently,oneimprovementistoparalleltheassemblysteptoutilize
themulti-corearchitectureofmostoftoday'sdesktopcomputers.Inaddition,the
pipelineusedthePythonlibraryofNetworkX[36]tomanipulateoverlapgraphsinthe
assemblystage.Itisslowwhentheoverlapgraphislarge.Theperformancecanbe
furtherimprovedifweimplementthepipelineusingC++.
92
BIBLIOGRAPHY
93
BIBLIOGRAPHY
[1]R.Achawanantakun,Y.Sun,andS.S.Takyar.NcRNAconsensussecondarystruc-
turederivationusinggrammarstrings.
JournalofBioinformaticsandComputational
Biology
,9(2):317{337,2011.
[2]O.Aljawad,Y.Sun,A.Liu,andJ.Lei.NcRNAhomologysearchusingHammingdis-
tanceseeds.In
ProceedingsoftheACMConferenceonBioinformatics,Computational
BiologyandBiomedicine(ACM-BCB11):1-3August2011;Chicago
,2011.
[3]S.Altschul,W.Gish,W.Miller,E.Myers,andD.Lipman.Basiclocalalignment
searchtool.
JournalofMolecularBiology
,215:403{410,1990.
[4]J.An,J.Lai,M.L.Lehman,andC.C.Nelson.miRDeep*:anintegratedapplica-
tiontoolformiRNAidenfromRNAsequencingdata.
NucleicAcidsRes.
,
41(2):727{737,Jan2013.
[5]A.F.AnderssonandJ.F.Viruspopulationdynamicsandacquiredvirus
resistanceinnaturalmicrobialcommunities.
Science
,320(5879):1047{1050,May2008.
[6]M.J.Axtell.ShortStack:comprehensiveannotationandquanofsmallRNA
genes.
RNA
,19(6):740{751,Jun2013.
[7]R.Barrangou,C.Fremaux,H.Deveau,M.Richards,P.Boyaval,S.Moineau,D.A.
Romero,andP.Horvath.CRISPRprovidesacquiredresistanceagainstvirusesin
prokaryotes.
Science
,315(5819):1709{1712,Mar2007.
[8]I.Ben-BassatandB.Chor.CRISPRdetectionfromshortreadsusingpartialoverlap
graphs.
LectureNotesinComputerScience
,9029:16{27,Mar2015.
[9]S.H.BernhartandI.L.Hofacker.FromconsensusstructurepredictiontoRNAgene

infunctionalgenomicsproteomics
,8(6):461{471,2009.
[10]S.H.Bernhart,I.L.Hofacker,andP.F.Stadler.LocalRNAbasepairingprobabilities
inlargesequences.
Bioinformatics
,22(5):614{615,2006.
[11]M.Blanchette,W.J.Kent,C.Riemer,L.Elnitski,A.F.Smit,K.M.Roskin,
R.Baertsch,K.Rosenbloom,H.Clawson,E.D.Green,D.Haussler,andW.Miller.
94
AligningmultiplegenomicsequenceswiththeThreadedBlocksetAligner.
Genome
Research
,14(4):708{715,2004.
[12]C.Bland,T.L.Ramsey,F.Sabree,M.Lowe,K.Brown,N.C.Kyrpides,andP.Hugen-
holtz.Crisprrecognitiontool(crt):atoolforautomaticdetectionofclusteredregularly
interspacedpalindromicrepeats.
BMCBioinformatics
,8:209,2007.
[13]A.F.Bompfunewerer,C.Flamm,C.Fried,G.Fritzsch,I.L.Hofacker,J.Lehmann,
K.Missal,A.Mosig,B.Muller,S.J.Prohaska,B.M.Stadler,P.F.Stadler,A.Tanzer,
S.Washietl,andC.Witwer.Evolutionarypatternsofnon-codingRNAs.
Theoryin
Biosciences
,123(4):301{369,2005.
[14]M.S.Campbell,M.Law,C.Holt,J.C.Stein,G.D.Moghe,D.E.Hufnagel,J.Lei,
R.Achawanantakun,D.Jiao,C.J.Lawrence,D.Ware,S.-H.Shiu,K.L.Childs,
Y.Sun,N.Jiang,andM.Yandell.MAKER-P:atool-kitfortherapidcreation,
management,andqualitycontrolofplantgenomeannotations.
PlantPhysiology
,
164(2):513{524,Feb2014.
[15]S.Chikkagoudar,D.R.Livesay,andU.Roshan.PLAST-ncRNA:Partitionfunction
LocalAlignmentSearchToolfornon-codingRNAsequences.
NucleicAcidsResearch
,
38:W59{W63,2010.
[16]L.Cong,F.A.Ran,D.Cox,S.Lin,R.Barretto,N.Habib,P.D.Hsu,X.Wu,W.Jiang,
L.A.andF.Zhang.MultiplexgenomeengineeringusingCRISPR/Cas
systems.
Science
,339(6121):819{823,Feb2013.
[17]Y.Cui,Y.Li,O.Gorg,M.E.Platonov,Y.Yan,Z.Guo,C.Pourcel,S.V.Dentovskaya,
S.V.Balakhonov,X.Wang,Y.Song,A.P.Anisimov,G.Vergnaud,andR.Yang.
InsightintomicroevolutionofYersiniapestisbyclusteredregularlyinterspacedshort
palindromicrepeats.
PLoSOne
,3(7):e2652,2008.
[18]M.deBerg,O.Cheong,M.vanKreveld,andM.Overmars.
ComputationalGeometry:
AlgorithmsandApplications(3rded.)
.Springer,USA,2008.
[19]J.E.DiCarlo,J.E.Norville,P.Mali,X.Rios,J.Aach,andG.M.Church.Genome
engineeringinSaccharomycescerevisiaeusingCRISPR-Cassystems.
NucleicAcids
Res
,41(7):4336{4343,Apr2013.
[20]C.Dieterich,H.Wang,K.Rateitschak,H.Luz,andM.Vingron.CORG:adatabase
forCOmparativeRegulatoryGenomics.
NucleicAcidsResearch
,pages55{57,2003.
95
[21]C.B.Do,C.-S.Foo,andS.Batzoglou.Amax-marginmodelforetsimultaneous
alignmentandfoldingofRNAsequences.
Bioinformatics
,24(13):168{176,2008.
[22]R.C.Edgar.PILER-CR:fastandaccurateidenofCRISPRrepeats.
BMC
Bioinformatics
,8:18,2007.
[23]J.A.Eisen.Environmentalshotgunsequencing:itspotentialandchallengesforstudy-
ingthehiddenworldofmicrobes.
PLoSBiol
,5(3):e82,Mar2007.
[24]W.Filipowicz,S.N.Bhattacharyya,andN.Sonenberg.Mechanismsofpost-
transcriptionalregulationbymicroRNAs:aretheanswersinsight?
NatRevGenet
,
9(2):102{114,Feb2008.
[25]A.E.Friedland,Y.B.Tzur,K.M.Esvelt,M.P.Colaicovo,G.M.Church,andJ.A.
Calarco.HeritablegenomeeditinginC.elegansviaaCRISPR-Cas9system.
Nat
Methods
,10(8):741{743,Aug2013.
[26]M.R.Fander,S.D.Mackowiak,N.Li,W.Chen,andN.Rajewsky.miRDeep2
accuratelyidenknownandhundredsofnovelmicroRNAgenesinsevenanimal
clades.
NucleicAcidsRes.
,40(1):37{52,Jan2012.
[27]P.P.Gardner,A.Wilm,andS.Washietl.Abenchmarkofmultiplesequencealignment
programsuponstructuralRNAs.
NucleicAcidsResearch
,33(8):2433{2439,2005.
[28]G.GonnellaandS.Kurtz.Readjoiner:afastandmemorytstringgraph-based
sequenceassembler.
BMCBioinformatics
,13:82,2012.
[29]J.Gorodkin,I.L.Hofacker,E.Torarinsson,Z.Yao,J.H.Havgaard,andW.L.Ruzzo.
DenovopredictionofstructuredRNAsfromgenomicsequences.
TrendsinBiotech-
nology
,28:9{19,2010.
[30]S.ths-Jones,A.Bateman,M.Marshall,A.Khanna,andS.R.Eddy.Rfam:an
RNAfamilydatabase.
NucleicAcidsResearch
,31:439{441,2003.
[31]I.Grissa,G.Vergnaud,andC.Pourcel.CRISPRFinder:awebtooltoidentifyclustered
regularlyinterspacedshortpalindromicrepeats.
NucleicAcidsRes
,35(WebServer
issue):W52{W57,Jul2007.
[32]I.Grissa,G.Vergnaud,andC.Pourcel.TheCRISPRdbdatabaseandtoolstodisplay
CRISPRsandtogeneratedictionariesofspacersandrepeats.
BMCBioinformatics
,
8:172,2007.
96
[33]D.G
AlgorithmsonStrings,Trees,andSequences:ComputerScienceandCom-
putationalBiology
.CambridgeUniversityPress,UK,1997.
[34]M.Hackenberg,N.Rodreleta,andA.M.Aransay.miRanalyzer:anupdate
onthedetectionandanalysisofmicroRNAsinhigh-throughputsequencingexperi-
ments.
NucleicAcidsRes.
,39:W132{138,Jul2011.
[35]D.H.Haft,J.Selengut,E.F.Mongodin,andK.E.Nelson.Aguildof45CRISPR-
associated(Cas)proteinfamiliesandmultipleCRISPR/Cassubtypesexistinprokary-
oticgenomes.
PLoSComputBiol
,1(6):e60,Nov2005.
[36]A.A.Hagberg,D.A.Schult,andP.J.Swart.Exploringnetworkstructure,dynamics,
andfunctionusingnetworkx.InG.Varoquaux,T.Vaught,andJ.Millman,editors,
Proceedingsofthe7thPythoninScienceConference
,pages11{15,Pasadena,CA
USA,2008.
[37]C.R.Hale,S.Majumdar,J.Elmore,N.M.Compton,S.Olson,A.M.Resch,
C.V.C.Glover,3rd,B.R.Graveley,R.M.Terns,andM.P.Terns.Essentialfeatures
andrationaldesignofCRISPRRNAsthatfunctionwiththeCasRAMPmodule
complextocleaveRNAs.
MolCell
,45(3):292{302,Feb2012.
[38]J.H.Havgaard,R.B.Lyngso,G.D.Stormo,andJ.Gorodkin.Pairwiselocalstructural
alignmentofRNAsequenceswithsequencesimilaritylessthan40%.
Bioinformatics
,
21(9):1815{1824,2005.
[39]J.H.Havgaard,E.Torarinsson,andJ.Gorodkin.FastpairwisestructuralRNAalign-
mentsbypruningofthedynamicalprogrammingmatrix.
PLOScomputationalbiology
,
3(10):1896{1908,2007.
[40]M.Hess,A.Sczyrba,R.Egan,T.-W.Kim,H.Chokhawala,G.Schroth,S.Luo,D.S.
Clark,F.Chen,T.Zhang,R.I.Mackie,L.A.Pennacchio,S.G.Tringe,A.Visel,
T.Woyke,Z.Wang,andE.M.Rubin.Metagenomicdiscoveryofbiomass-degrading
genesandgenomesfromcowrumen.
Science
,331(6016):463{467,Jan2011.
[41]I.Hofacker,W.Fontana,P.Stadler,S.BonhoM.Tacker,andP.Schuster.Fast
foldingandcomparisonofRNAsecondarystructures.
Monatsheftef.Chemie
,125:167{
188,1994.
[42]I.L.HofackerandP.F.Stadler.MemoryntfoldingalgorithmsforcircularRNA
secondarystructures.In
ProceedingsoftheGermanConferenceonBioinformatics.
GCB2003
,pages15{25,2006.
97
[43]P.HorvathandR.Barrangou.CRISPR/Cas,theimmunesystemofbacteriaand
archaea.
Science
,327(5962):167{170,Jan2010.
[44]P.Horvath,D.A.Romero,A.-C.Cot-Monvoisin,M.Richards,H.Deveau,S.Moineau,
P.Boyaval,C.Fremaux,andR.Barrangou.Diversity,activity,andevolutionof
CRISPRlociinStreptococcusthermophilus.
JBacteriol
,190(4):1401{1412,Feb2008.
[45]X.HuangandA.Madan.Cap3:Adnasequenceassemblyprogram.
GenomeRes
,
9(9):868{877,Sep1999.
[46]W.Y.Hwang,Y.Fu,D.Reyon,M.L.Maeder,S.Q.Tsai,J.D.Sander,R.T.
Peterson,J.-R.J.Yeh,andJ.K.Joung.tgenomeeditinginusinga
CRISPR-Cassystem.
NatBiotechnol
,31(3):227{229,Mar2013.
[47]Illumina.HiSeqXTenPreliminaryProductInformationSheet.[Online;accessed
25-April-2014].
[48]R.Jansen,J.D.A.vanEmbden,W.Gaastra,andL.M.Schouls.Idenofa
novelfamilyofsequencerepeatsamongprokaryotes.
OMICS
,6(1):23{33,2002.
[49]D.-H.Jeong,S.R.Thatcher,R.S.H.Brown,J.Zhai,S.Park,L.A.Rymarquis,B.C.
Meyers,andP.J.Green.ComprehensiveinvestigationofmicroRNAsenhancedby
analysisofsequencevariants,expressionpatterns,ARGONAUTEloading,andtarget
cleavage.
PlantPhysiol.
,162(3):1225{1245,Jul2013.
[50]W.Jiang,H.Zhou,H.Bi,M.Fromm,B.Yang,andD.P.Weeks.Demonstrationof
CRISPR/Cas9/sgRNA-mediatedtargetedgenemoinArabidopsis,tobacco,
sorghumandrice.
NucleicAcidsRes
,41(20):e188,Nov2013.
[51]M.W.W.Jones-Rhoades,D.P.P.Bartel,andB.Bartel.MicroRNAsandtheir
regulatoryrolesinplants.
AnnualReviewofPlantBiology
,57:19{53,2006.
[52]F.V.KarginovandG.J.Hannon.TheCRISPRsystem:smallrna-guideddefensein
bacteriaandarchaea.
MolCell
,37(1):7{19,Jan2010.
[53]A.Keller,C.Backes,P.Leidinger,N.Kefer,V.Boisguerin,C.Barbacioru,B.Vogel,
M.Matzas,H.Huwer,H.A.Katus,C.Sthler,B.Meder,andE.Meese.Next-generation
sequencingidennovelmicroRNAsinperipheralbloodoflungcancerpatients.
Mol
Biosyst
,7(12):3187{3199,Dec2011.
[54]H.Kiryu,Y.Tabei,T.Kin,andK.Asai.Murlet:apracticalmultiplealignmenttool
forstructuralRNAsequences.
Bioinformatics
,23(13):1588{1598,2007.
98
[55]S.Kishore,A.Khanna,Z.Zhang,J.Hui,P.J.Balwierz,M.Stefan,C.Beach,R.D.
Nicholls,M.Zavolan,andS.Stamm.ThesnoRNAMBII-52(SNORD115)isprocessed
intosmallerRNAsandregulatesalternativesplicing.
HumMolGenet
,19(7):1153{
1164,Apr2010.
[56]R.J.KleinandS.R.Eddy.RSEARCH:homologsofsinglestructuredRNA
sequences.
BMCBioinformatics
,4:44{59,2003.
[57]D.C.Koboldt,K.M.Steinberg,D.E.Larson,R.K.Wilson,andE.R.Mardis.The
next-generationsequencingrevolutionanditsimpactongenomics.
Cell
,155(1):27{38,
Sep2013.
[58]A.KozomaraandS.miRBase:integratingmicroRNAannotationand
deep-sequencingdata.
NucleicAcidsRes.
,39(Databaseissue):D152{157,Jan2011.
[59]V.Kunin,R.Sorek,andP.Hugenholtz.Evolutionaryconservationofsequenceand
secondarystructuresinCRISPRrepeats.
GenomeBiol
,8(4):R61,2007.
[60]D.Langenberger,C.Bermudez-Santana,J.Hertel,S.P.Khaitovich,and
P.F.Stadler.EvidenceforhumanRNAsinsmallRNAsequencing
data.
Bioinformatics
,25(18):2298{2301,2009.
[61]B.Langmead,C.Trapnell,M.Pop,andS.L.Salzberg.Ultrafastandt
alignmentofshortDNAsequencestothehumangenome.
GenomeBiology
,10(3):R25{
34,2009.
[62]M.H.Larson,L.A.Gilbert,X.Wang,W.A.Lim,J.S.Weissman,andL.S.Qi.
CRISPRinterference(CRISPRi)forsequence-spcontrolofgeneexpression.
Nat
Protoc
,8(11):2180{2196,Nov2013.
[63]N.Leblond-Bourget,H.Philippe,I.Mangin,andB.Decaris.16SrRNAand16S
to23Sinternaltranscribedspacersequenceanalysesrevealinter-andintrasp
phylogeny.
IntJSystBacteriol
,46(1):102{111,Jan1996.
[64]H.Li,B.Handsaker,A.Wysoker,T.Fennell,J.Ruan,N.Homer,G.Marth,G.Abeca-
sis,R.Durbin,and.G.P.D.P.S..TheSequenceAlignment/Mapformatand
SAMtools.
Bioinformatics
,25(16):2078{2079,Aug2009.
[65]S.-D.Li,S.Chono,andL.Huang.toncogenesilencingandmetastasisinhibition
viasystemicdeliveryofsiRNA.
MolTher
,16(5):942{946,May2008.
99
[66]L.Liu,Y.Li,S.Li,N.Hu,Y.He,R.Pong,D.Lin,L.Lu,andM.Law.Comparison
ofnext-generationsequencingsystems.
JBiomedBiotechnol
,2012:251364,2012.
[67]Y.P.Liu,J.Haasnoot,O.terBrake,B.Berkhout,andP.Konstantinova.Inhibition
ofHIV-1bymultiplesiRNAsexpressedfromasinglemicroRNApolycistron.
Nucleic
AcidsRes
,36(9):2811{2824,May2008.
[68]R.Lorenz,S.H.Bernhart,C.onerZuSiederdissen,H.Tafer,C.Flamm,P.F.Stadler,
andI.L.Hofacker.ViennaRNAPackage2.0.
AlgorithmsMolBiol
,6:26,2011.
[69]C.Lu,S.S.Tej,S.Luo,C.D.Haudenschild,B.C.Meyers,andP.J.Green.Elucidation
ofthesmallRNAcomponentofthetranscriptome.
Science
,309(5740):1567{1569,Sep
2005.
[70]S.Lu,R.Shi,C.-C.Tsao,X.Yi,L.Li,andV.L.Chiang.RNAsilencinginplantsby
theexpressionofsiRNAduplexes.
Nucl.AcidsRes.
,32(21):e171{177,2004.
[71]V.M.Markowitz,I.-M.A.Chen,K.Chu,E.Szeto,K.Palaniappan,M.Pillay,A.Rat-
ner,J.Huang,I.Pagani,S.Tringe,M.Huntemann,K.Billis,N.Varghese,K.Ten-
nessen,K.Mavromatis,A.Pati,N.N.Ivanova,andN.C.Kyrpides.Img/m4ver-
sionoftheintegratedmetagenomecomparativeanalysissystem.
NucleicAcidsRes
,
42(Databaseissue):D568{D573,Jan2014.
[72]L.A.andE.J.Sontheimer.CRISPRinterferencelimitshorizontalgene
transferinstaphylococcibytargetingDNA.
Science
,322(5909):1843{1845,Dec2008.
[73]J.A.MartinandZ.Wang.Next-generationtranscriptomeassembly.
NatRevGenet
,
12(10):671{682,Oct2011.
[74]A.MathelierandA.Carbone.MIReNA:microRNAswithhighaccuracyandno
learningatgenomescaleandfromdeepsequencingdata.
Bioinformatics
,26(18):2226{
2234,Sep2010.
[75]J.McCaskill.Theequilibriumpartitionfunctionandbasepairbindingprobabilities
forRNAsecondarystructures.
Biopolymers
,29:1105{1119,1990.
[76]M.C.Mendez-Ortega,S.Restrepo,L.M.RodrI.Perez,J.C.Mendoza,A.P.
MartR.Sierra,andG.J.Rey-Benito.AnRNAiinsilicoapproachtoan
optimalshRNAcocktailagainstHIV-1.
VirolJ
,7:369,2010.
[77]M.L.Metzker.Sequencingtechnologies-thenextgeneration.
NatRevGenet
,
11(1):31{46,Jan2010.
100
[78]B.C.Meyers,M.J.Axtell,B.Bartel,D.P.Bartel,D.Baulcombe,J.L.Bowman,
X.Cao,J.C.Carrington,X.Chen,P.J.Green,S.GS.E.Jacobsen,
A.C.Mallory,R.A.Martienssen,R.S.Poethig,Y.Qi,H.Vaucheret,O.Voinnet,
Y.Watanabe,D.Weigel,andJ.-K.Zhu.CriteriaforannotationofplantMicroRNAs.
PlantCell
,20(12):3186{3190,Dec2008.
[79]C.Mlot.Centromeres.Ajourneytothecenterofthechromosome.
Science
,
290(5499):2057{2059,Dec2000.
[80]R.D.Morin,M.D.O'Connor,M.th,F.Kuchenbauer,A.Delaney,A.-L.Prabhu,
Y.Zhao,H.McDonald,T.Zeng,M.Hirst,C.J.Eaves,andM.A.Marra.Applicationof
massivelyparallelsequencingtomicroRNAanddiscoveryinhumanembryonic
stemcells.
GenomeRes
,18(4):610{621,Apr2008.
[81]E.W.Myers.Thefragmentassemblystringgraph.
Bioinformatics
,21Suppl2:ii79{
ii85,Sep2005.
[82]E.W.Myers.Thefragmentassemblystringgraph.
Bioinformatics
,21Suppl2:ii79{
ii85,Sep2005.
[83]E.P.Nawrocki,D.L.Kolbe,andS.R.Eddy.Infernal1.0:InferenceofRNAalign-
ments.
Bioinformatics
,25:1335{1337,2009.
[84]C.T.Neilsen,G.J.Goodall,andC.P.Bracken.IsomiRs{theoverlookedrepertoirein
thedynamicmicroRNAome.
TrendsGenet.
,28(11):544{549,Nov2012.
[85]P.C.NgandE.F.Kirkness.Wholegenomesequencing.
MethodsMolBiol
,628:215{
226,2010.
[86]K.C.Pang,M.C.Fritha,andJ.S.Mattick.RapidevolutionofnoncodingRNAs:
lackofconservationdoesnotmeanlackoffunction.
TrendsinGenetics
,22(1):1{5,
2005.
[87]J.S.Pedersen,G.Bejerano,A.Siepel,K.Rosenbloom,K.Lindblad-Toh,E.S.Lander,
J.Kent,W.Miller,andD.Haussler.IdenandofconservedRNA
secondarystructuresintheHumanGenome.
PLoSComputBiol
,2(4):251{262,2006.
[88]E.Pennisi.TheCRISPRcraze.
Science
,341(6148):833{836,Aug2013.
[89]M.Petrillo,G.Silvestro,P.P.D.Nocera,A.Boccia,andG.Paolella.Stem-loop
structuresinprokaryoticgenomes.
BMCGenomics
,7:170,2006.
101
[90]K.Pougach,E.Semenova,E.Bogdanova,K.A.Datsenko,M.Djordjevic,B.L.Wan-
ner,andK.Severinov.Transcription,processingandfunctionofCRISPRcassettesin
Escherichiacoli.
MolMicrobiol
,77(6):1367{1379,Sep2010.
[91]M.Rho,Y.Wu,H.Tang,T.G.Doak,andY.Ye.DiverseCRISPRsevolvinginhuman
microbiomes.
PLoSGenet
,8(6):e1002441,2012.
[92]M.Rho,Y.-W.Wu,H.Tang,T.G.Doak,andY.Ye.DiverseCRISPRsevolvingin
humanmicrobiomes.
PLoSGenet
,8(6):e1002441,2012.
[93]E.RivasandS.R.Eddy.NoncodingRNAgenedetectionusingcomparativesequence
analysis.
BMCBioinformatics
,2(1):8{26,2001.
[94]S.Roux,J.Tournayre,A.Mahul,D.Debroas,andF.Enault.Metavir2:newtoolsfor
viralmetagenomecomparisonandassembledviromeanalysis.
BMCBioinformatics
,
15:76,2014.
[95]R.L.Rungta,H.B.Choi,P.J.Lin,R.W.Ko,D.Ashby,J.Nair,M.Manoharan,
P.R.Cullis,andB.A.Macvicar.LipidNanoparticleDeliveryofsiRNAtoSilence
NeuronalGeneExpressionintheBrain.
MolTherNucleicAcids
,2:e136,2013.
[96]F.SangerandA.R.Coulson.ArapidmethodfordeterminingsequencesinDNAby
primedsynthesiswithDNApolymerase.
JMolBiol
,94(3):441{448,May1975.
[97]F.Sanger,S.Nicklen,andA.R.Coulson.DNAsequencingwithchain-terminating
inhibitors.
ProcNatlAcadSciUSA
,74(12):5463{5467,Dec1977.
[98]D.SankSimultaneoussolutionoftheRNAfolding,alignmentandprotosequence
problems.
SIAMJournalonAppliedMathematics
,45(5):810{825,1985.
[99]M.Shakya,C.Quince,J.H.Campbell,Z.K.Yang,C.W.Schadt,andM.Podar.
ComparativemetagenomicandrRNAmicrobialdiversitycharacterizationusingar-
chaealandbacterialsyntheticcommunities.
EnvironMicrobiol
,15(6):1882{1899,Jun
2013.
[100]T.S.SimonGladman.Velvetoptimiser,2012.
[101]C.T.Skennerton,M.Imelfort,andG.W.Tyson.Crass:idenandreconstruc-
tionofCRISPRfromunassembledmetagenomicdata.
NucleicAcidsRes
,41(10):e105,
May2013.
102
[102]R.Sorek,V.Kunin,andP.Hugenholtz.CRISPR{awidespreadsystemthatpro-
videsacquiredresistanceagainstphagesinbacteriaandarchaea.
NatRevMicrobiol
,
6(3):181{186,Mar2008.
[103]J.SperschneiderandA.Datta.DotKnot:pseudoknotpredictionusingtheprobability
dotplotunderaenergymodel.
NucleicAcidsResearch
,38(7):e103{115,2010.
[104]A.Stern,E.Mick,I.Tirosh,O.Sagy,andR.Sorek.CRISPRtargetingrevealsa
reservoirofcommonphagesassociatedwiththehumangutmicrobiome.
GenomeRes
,
22(10):1985{1994,Oct2012.
[105]Y.TabeiandK.Asai.Alocalmultiplealignmentmethodfordetectionofnon-coding
RNAsequences.
Bioinformatics
,25(12):1498{1505,2009.
[106]Y.Tabei,H.Kiryu,T.Kin,andK.Asai.Afaststructuralmultiplealignmentmethod
forlongRNAsequences.
BMCBioinformatics
,9:33{49,2008.
[107]Y.Tabei,K.Tsuda,T.Kin,andK.Asai.SCARNA:fastandaccuratestructural
alignmentofRNAsequencesbymatchingstemfragments.
Bioinformatics
,
22(14):1723{1729,2006.
[108]O.terBrake,P.Konstantinova,M.Ceylan,andB.Berkhout.SilencingofHIV-1with
RNAinterference:amultipleshRNAapproach.
MolTher
,14(6):883{892,Dec2006.
[109]J.D.Thompson,D.G.Higgins,andT.J.Gibson.CLUSTALW:improvingthesensi-
tivityofprogressivemultiplesequencealignmentthroughsequenceweighting,position-
spgappenaltiesandweightmatrixchoice.
NucleicAcidsResearch
,22(22):4673{
4680,1994.
[110]E.Torarinsson,J.H.Havgaard,andJ.Gorodkin.Multiplestructuralalignmentand
clusteringofRNAsequences.
Bioinformatics
,23(8):926{932,2007.
[111]G.W.Tyson,J.Chapman,P.Hugenholtz,E.E.Allen,R.J.Ram,P.M.Richardson,
V.V.Solovyev,E.M.Rubin,D.S.Rokhsar,andJ.F.Communitystructure
andmetabolismthroughreconstructionofmicrobialgenomesfromtheenvironment.
Nature
,428(6978):37{43,Mar2004.
[112]A.V.Uzilov,J.M.Keegan,andD.H.Mathews.Detectionofnon-codingRNAs
onthebasisofpredictedsecondarystructureformationfreeenergychange.
BMC
Bioinformatics
,7(1):173{202,2006.
103
[113]H.Wang,H.Yang,C.S.Shivalila,M.M.Dawlaty,A.W.Cheng,F.Zhang,and
R.Jaenisch.One-stepgenerationofmicecarryingmutationsinmultiplegenesby
CRISPR/Cas-mediatedgenomeengineering.
Cell
,153(4):910{918,May2013.
[114]W.-C.Wang,F.-M.Lin,W.-C.Chang,K.-Y.Lin,H.-D.Huang,andN.-S.Lin.miREx-
press:analyzinghigh-throughputsequencingdataformicroRNAexpression.
BMCBioinformatics
,10:328,2009.
[115]Z.Wang,M.Gerstein,andM.Snyder.RNA-Seq:arevolutionarytoolfortranscrip-
tomics.
NatRevGenet
,10(1):57{63,Jan2009.
[116]S.Washietl,I.L.Hofacker,andP.F.Stadler.Fastandreliablepredictionofnoncoding
RNAs.
ProcNatlAcadSciUSA
,102(7):2454{2459,2005.
[117]Y.Wei,M.T.Chesne,R.M.Terns,andM.P.Terns.Sequencesspanningtheleader-
repeatjunctionmediateCRISPRadaptationtophageinstreptococcusthermophilus.
NucleicAcidsRes
,43(3):1749{1758,Feb2015.
[118]S.Will,K.Reiche,I.L.Hofacker,P.F.Stadler,andR.Backofen.Inferringnoncoding
RNAfamiliesandclassesbymeansofgenome-scalestructure-basedclustering.
PLoS
ComputationalBiology
,3(4):680{691,2007.
[119]A.Wilm,I.Mainz,andG.Steger.AnenhancedRNAalignmentbenchmarkfor
sequencealignmentprograms.
AlgorithmsforMolecularBiology
,1:19{29,2006.
[120]X.YangandL.Li.miRDeep-P:acomputationaltoolforanalyzingthemicroRNA
transcriptomeinplants.
Bioinformatics
,27(18):2614{2615,Sep2011.
[121]P.Yarza,M.Richter,J.Peplies,J.Euzeby,R.Amann,K.-H.Schleifer,W.Ludwig,
F.O.Glckner,andR.Rossell-Mra.TheAll-SpeciesLivingTreeproject:a16SrRNA-
basedphylogenetictreeofallsequencedtypestrains.
SystApplMicrobiol
,31(4):241{
250,Sep2008.
[122]D.R.Yoder-Himes,P.S.G.Chain,Y.Zhu,O.Wurtzel,E.M.Rubin,J.M.Tiedje,and
R.Sorek.MappingtheBurkholderiacenocepacianicheresponseviahigh-throughput
sequencing.
ProcNatlAcadSciUSA.
,106(10):3976{3981,2009.
[123]D.R.ZerbinoandE.Birney.Velvet:algorithmsfordenovoshortreadassemblyusing
deBruijngraphs.
GenomeRes
,18(5):821{829,May2008.
[124]B.Zhang,X.Pan,G.P.Cobb,andT.A.Anderson.PlantmicroRNA:asmallregu-
latorymoleculewithbigimpact.
Dev.Biol.
,289(1):3{16,Jan2006.
104
[125]S.ZhaoandM.-F.Liu.MechanismsofmicroRNA-mediatedgeneregulation.
SciChina
CLifeSci
,52(12):1111{1116,Dec2009.
[126]M.ZukerandP.Stiegler.OptimalcomputerfoldingoflargeRNAsequencesusing
thermodynamicandauxiliaryinformation.
NuclAcidRes
,9:133{148,1981.
105