NON-CODING RNA IDENTIFICATION IN LARGE-SCALE GENOMIC DATA
By
Cheng Yuan

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science - Doctor of Philosophy
2014

ABSTRACT
NON-CODING RNA IDENTIFICATION IN LARGE-SCALE GENOMIC
DATA
By
Cheng Yuan
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into
proteins, play diverse and important biological functions. ncRNAs function not only through
their primary structures, but also secondary structures, which are defined by interactions
between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA,
rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies
suggest the existence of large number of ncRNA genes. Identification of novel and known
ncRNAs becomes increasingly important in order to understand their functionalities and the
underlying communities.
Next-generation sequencing (NGS) technology sheds lights on more comprehensive and
sensitive ncRNA annotation. Lowly transcribed ncRNAs or ncRNAs from rare species with
low abundance may be identified via deep sequencing. However, there exist several challenges
in ncRNA identification in large-scale genomic data. First, the massive volume of datasets
could lead to very long computation time, making existing algorithms infeasible. Second,
NGS has relatively high error rate, which could further complicate the problem. Third, high
sequence similarity among related ncRNAs could make them difficult to identify, resulting
in incorrect output. Fourth, while secondary structures should be adopted for accurate
ncRNA identification, they usually incur high computational complexity. In particular,
some ncRNAs contain pseudoknot structures, which cannot be effectively modeled by the
state-of-the-art approach. As a result, ncRNAs containing pseudoknots are hard to annotate.

In my PhD work, I aimed to tackle the above challenges in ncRNA identification. First, I
designed a progressive search pipeline to identify ncRNAs containing pseudoknot structures.
The algorithms are more efficient than the state-of-the-art approaches and can be used for
large-scale data. Second, I designed a ncRNA classification tool for short reads in NGS data
lacking quality reference genomes. The initial homology search phase significantly reduces
size of the original input, making the tool feasible for large-scale data. Last, I focused on
identifying 16S ribosomal RNAs from NGS data. 16S ribosomal RNAs are very important
type of ncRNAs, which can be used for phylogenic study. A set of graph based assembly
algorithms were applied to form longer or full-length 16S rRNA contigs. I utilized paired-end
information in NGS data, so lowly abundant 16S genes can also be identified. To reduce the
complexity of problem and make the tool practical for large-scale data, I designed a list of
error correction and graph reduction techniques for graph simplification.

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Introduction to noncoding RNAs . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Related works to ncRNA identification in large-scale genomic data . . . . . .
1.3 Challenges in ncRNA identification in large-scale genomic data . . . . . . . .
1.3.1 Sanger sequencing and NGS technology . . . . . . . . . . . . . . . . .
1.3.2 NCRNA identification in NGS data . . . . . . . . . . . . . . . . . . .
1.3.3 Pseudoknot structures and challenges in finding ncRNAs containing
pseudoknot structures in large-scale genomic data . . . . . . . . . . .
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 2 Efficient known ncRNA search including pseudoknots
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Sub-structure derivation . . . . . . . . . . . . . . . . . . .
2.3.2 Search performance of different sub-structures . . . . . . .
2.3.2.1 Sort sub-structures according to their E-values .
2.3.2.2 Choose sub-structures for progressive search . . .
2.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Pseudoknot sequences in Rfam . . . . . . . . . . . . . . .
2.4.2 Data set preparation . . . . . . . . . . . . . . . . . . . . .
2.4.3 Results and comparisons . . . . . . . . . . . . . . . . . . .
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 3 RNA-CODE: a noncoding RNA Classification
reaDs in NGS data lacking rEference genomes
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Stage 1: SCFG-based filtration . . . . . . . . . . .
3.2.2 Stage 2: family-specific de novo assembly . . . . . .
3.2.3 Stage 3: contig selection . . . . . . . . . . . . . . .
3.2.4 MIRNA families . . . . . . . . . . . . . . . . . . . .

iv

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

tOol for
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
1
2
4
4
5
6
8

.
.
.
.
.
.
.
.
.
.
.
.
.
.

9
9
11
13
14
17
18
20
22
23
23
24
26
31

short
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .

33
33
37
41
43
45
47

3.3

3.4

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Detecting reads of 16s ribosomal RNAs . . . . . . . . . . . . . . . . .
3.3.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1.2 Experimental results . . . . . . . . . . . . . . . . . . . . . .
3.3.2 NCRNA classification in RNA-seq data . . . . . . . . . . . . . . . . .
3.3.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2.2 Experimental results . . . . . . . . . . . . . . . . . . . . . .
3.3.2.2.1 Performance of filtration . . . . . . . . . . . . . . .
3.3.2.2.2 Performance comparison with SSAKE . . . . . . .
3.3.2.2.3 Using multiple overlap thresholds improves performance of RNA-CODE . . . . . . . . . . . . . . . .
3.3.2.2.4 Performance of microRNA families . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47
49
49
50
51
52
53
54
55
57
58
58

Chapter 4 Reconstructing 16S rRNA genes in metagenomic data . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Overview of REAGO . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 16S rRNA reads identification . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Overlap Graph Creation and Graph Pruning . . . . . . . . . . . . . .
4.2.3.0.5 Node collapsing . . . . . . . . . . . . . . . . . . . .
4.2.3.0.6 Alignment-based error correction . . . . . . . . . .
4.2.3.0.7 Topology-based graph reduction . . . . . . . . . . .
4.2.4 Bad edge removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.5 Guided path finding using paired-end information . . . . . . . . . . .
4.2.6 Scaffolding 16S rRNA segments . . . . . . . . . . . . . . . . . . . . .
4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Experiment 1: simulated metagenomic dataset . . . . . . . . . . . . .
4.3.1.0.8 Performance of rRNA reads classification using cmsearch . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1.0.9 Overlap graph construction . . . . . . . . . . . . .
4.3.1.0.10 Assembly performance evaluation . . . . . . . . . .
4.3.2 Experiment 2: synthetic metagenomic data . . . . . . . . . . . . . . .
4.3.3 Experiment 3: human stool metagenomic data . . . . . . . . . . . . .

60
60
63
63
64
66
67
67
69
70
72
74
76
77

Chapter 5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . .

88

BIBLIOGRAPHY

90

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

78
79
80
82
85

LIST OF TABLES

Table 2.1

Table 2.2

The order of E-values is highly consistent to the order of number of
the FP hits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Sequences that do not contain annotated pseudoknots and thus may not
be real members. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Table 2.3

Sensitivity, FP hits, and running time comparison between RNAv, RNATOPS,
Infernal, and sub-structure. Bold font is applied to the highest sensitivity,
the lowest FP hit, or the shortest running time for each RNA family. The
empty cells indicate that the corresponding tools did not generate any
output within 4 CPU days. . . . . . . . . . . . . . . . . . . . . . . . . 28

Table 3.1

Performance comparison of RNA-CODE vs Metaxa. Both tools were
applied using the default parameters. . . . . . . . . . . . . . . . . .

51

Table 3.2

Number of reads that are mapped to chromosome 2 of Arabidopsis.

53

Table 3.3

Filtration statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Table 3.4

Performance of RNA-CODE (multiple-k), SSAKE, and RNA-CODE
(single-k) on transcribed ncRNA families. . . . . . . . . . . . . . . .

57

Table 4.1

Species abundance. . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

Table 4.2

Pairwise sequence similarity. Bold numbers indicate sequence similarity above 90%.

78

Table 4.3

Performance of 16S rRNA gene recovery. . . . . . . . . . . . . . . .

81

Table 4.4

Performance of 16S rRNA gene recovery on true positive simulated
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

The performance of cmsearch on the synthetic community data. . .

83

Table 4.5

vi

Table 4.6

The performance of rRNA recovery on synthetic community data. Due to near
identical outputs and very high sequence similarity among some genes in some
genera, the sum of incorrect assemblies and correct assemblies may not sum to
the total number of output sequences. . . . . . . . . . . . . . . . . . . . .

84

The performance of rRNA recovery on only true positive reads from synthetic
community data. Due to near identical outputs and very high sequence similarity
among some genes in some genera, the sum of incorrect assemblies and correct
assemblies may not sum to the total number of output sequences. . . . . . . .

85

Table 4.8

Assembly tool performance on human stool data . . . . . . . . . . .

86

Table 4.9

Species recovered by REAGO and EMIRGE. “Y” indicates the species
of the row is identified by the tool labeled by the column title. . . .

87

Table 4.7

vii

LIST OF FIGURES

Figure 1.1

An example of the (A) secondary structure and (B) fundamental elements of a hypothetical ncRNA. . . . . . . . . . . . . . . . . . . .

2

Figure 1.2

An example of a psuedoknot-containing ncRNA . . . . . . . . . . .

7

Figure 2.1

Consensus secondary structure of tmRNA and the secondary structure described by SCFG (pseudoknots missing). A. Consensus secondary structure of RF00023 (tmRNA) in Rfam. Stacking base pairs in
1 are parallel to base pairs in 2 and 3. 1, 2, and 3 are nested in 4. 2
and 3 form a pseudoknot. B. Secondary structure described by SCFG
(pseudoknots missing). . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Figure 2.2

The pipeline of the SCFG construction and the progressive search. .

14

Figure 2.3

Five candidate sub-structures can be constructed from three stems in
a pseudoknot structure. Each arc represents a stem containing nested
base pairs and possible internal/bulge loops. Single-stranded regions are
represented using solid lines. . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.4

15

Number of TP hits and FP matches of each sub-structure under different score thresholds. For each sub-structure, the length and the search
time corresponding to the highest sensitivity is listed. Time format is
hr:min:sec. Due to highly different number of FP hits, two sub-structures
are plotted in the embedded figure. . . . . . . . . . . . . . . . . . . .

18

Figure 2.5

Sensitivity comparison on 71 families. . . . . . . . . . . . . . . . . .

29

Figure 2.6

Comparison of false positive hits on 71 families. . . . . . . . . . . .

30

Figure 2.7

Running Time comparison. There are 4 families on which Infernal run
much longer than on other families. To keep an appropriate scale, there
running times are not displayed on the figure. . . . . . . . . . . . . . .

Figure 3.1

30

Reads sequenced from pre-miRNAs cannot be assembled into contigs.
Three different miRNAs show highly different expression levels in the same
RNA-seq data. A. mir-156 B. mir-160 C. mir-166 . . . . . . . . . . . .

viii

38

Figure 3.2

The pipeline of RNA-CODE. The pipeline of RNA-CODE. For miRNAs,
the output of the first stage (SCFG-based filtration) and the whole pipeline
will be used together for reads classification. . . . . . . . . . . . . . . .

Figure 3.3

40

ROC curves of short reads classification using trCYK and BLAST.
ROC curves of short reads classification using trCYK and BLAST. Sensitivity measures the ratio of correctly found true tRNAs to the total number
of true tRNA reads. False positive rate measures the ratio of falsely found
tRNA reads to the total number of false tRNA reads. . . . . . . . . . .

42

Figure 3.4

Three types of contigs. . . . . . . . . . . . . . . . . . . . . . . . . .

46

Figure 4.1

Pipeline of the 16S rRNA gene assembly. Short black and gray bars
represent reads originated from different 16S rRNA genes. Short
white bars represent reads from non-16S regions. Long bars represent
contigs assembled from short reads. . . . . . . . . . . . . . . . . . .

63

Graph reduction is conducted iteratively until there is no change on
the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Figure 4.3

Two types of bifurcation where error correction is applied. . . . . . .

68

Figure 4.4

An example of error correction (applied on V2 and V3 ). The sequence
represented by each node is given beside the node. (A) Ungapped
alignment of reads from bifurcating vertices. (B) Mutate rare bases.
(C) Remove bifurcation. . . . . . . . . . . . . . . . . . . . . . . . . .

69

Figure 4.5

Topology-based graph reduction. . . . . . . . . . . . . . . . . . . . .

70

Figure 4.6

Path finding using paired-end information. Solid lines represent overlaps between nodes and dashed lines represent the existence of pairedend reads. The numbers beside dashed lines are the numbers of paired
end reads between the corresponding nodes. . . . . . . . . . . . . .

74

Calculate the score between two segments. Arcs represent paired-end
match between vertices and thickness of arcs indicate weight of the
paired-end match. Actual weights also labeled beside each arc. . . .

75

Figure 4.2

Figure 4.7

Figure 4.8

Phylogenetic tree of the 64 species in the synthetic metagenomic data. 82

Figure 4.9

Phylogenetic tree of the 12 representative sequences. . . . . . . . . .

ix

85

Chapter 1
Introduction

1.1

Introduction to noncoding RNAs

Noncoding RNAs (ncRNAs) are funtional RNA modules that are not translated into protein.
They function directly as RNA molecules and play diverse and important biological functions.
Studies have hint at an RNA world, in which the roles of ncRNA genes are as important as
that of protein coding genes [1, 2, 3]. Many types of ncRNAs function through both their
sequences and secondary structures, which are defined by interactions between WatsonCrick and wobble base pairs as well as non-canonical Primary sequences of ncRNAs are
assembled as a chain of nucleotides, adenine (A), cytosine (C), guanine (G) and uracil (U).
Adenine and guanine are purines, which can form hydrogen bond with pyrimidines uracil
and cytosine, respectively. Other base interactions, such guanine-uracil interaction, are also
possible. Because of these interactions, primary sequence of ncRNAs can fold into twodimensional structures. The structure is composed from several fundamental elements, such
as hairpin loop, interior loop, stem and bifurcation, as demonstrated in Figure 1.1 [4].
Common types of ncRNA include snoRNA, microRNA (miRNA), ribosomal RNA (rRNA)
and short interfering RNAs (siRNAs). SnoRNAs are normally between 70 and 250 nt long.
There exist a very large number of snoRNAs, in which most function in rRNA modificaiton
and some function in rRNA processing. [5, 6, 7]. MicroRNA s are very short RNA molecules,
usually 21 to 22 nt long, that functions in RNA silencing and post-transcriptional regulation
1

5'

5'
A
A

A
A

3'

UU

3'

C. G
bulge loop
G .C
G
.C
branched loop G
hairpin loop
A .U
A
G
CC
G
A
GU
CG
ACC
..
.. C
C ...
C A U G CU U
AUGGA
U UA

C. G
G .C
G
G .C
A .U
A
G

A
ACC
C ...
AUGGA

uns tructured single strand

CC
G
GU
CG
..
.. C
C A U G CU U
A

interior loop

stem

(A)

(B)

Figure 1.1: An example of the (A) secondary structure and (B) fundamental elements of a
hypothetical ncRNA.
of gene expression [8, 9]. Ribosomal RNAs, as predominant material within ribosome, play
essential roles in protein synthesis. SiRNAs is a class of double-stranded RNAs, which are
involved in in gene silencing in RNA interference (RNAi) pathway [10].

1.2

Related works to ncRNA identification in largescale genomic data

Recent studies suggest the existence of surprisingly large amount of ncRNA genes in various
genomes. In particular, human genomes could possibly contains thousands of ncRNA genes,
which play very important role in many biological processes [2]. Thus it becomes increasingly
important to identify ncRNA in modern biology. A number of specialized tools are available
to identify each type of ncRNAs. For example, tRNAscan-SE [11] is designed to identify
tRNAs. snoscan and snoGPS are specialized for snoRNA identification [12]. And miRDeep*
[13] can be used to detect miRNAs, while miRPlant [14] aims to find miRNAs only in
plants. Each tool may has its specialty in identifying certain type of ncRNAs in certain
2

domain. There also exist universal tools for ncRNA identification. BLASTN [15] is one of the
widely used approaches for nucleotide sequences identification. A BLASTN search performs
primary sequence alignments between a query sequences and a library of sequences. A query
sequence is identified to be originated from a sequence in library if their alignment score
is greater than a threshold. However, only primary sequences of ncRNAs are modeled by
BLASTN. NCRNAs having high secondary structure conservation but low primary sequence
conservation tend to be missed by BLASTN [16]. The state-of-the-art approach for ncRNA
identification approaches are based on stochastic context-free grammar (SCFG). SCFG not
only models primary sequences of ncRNAs but also their secondary structures. A detailed
explanation can be found in the introduction and method section of Chapter 3. NCRNAs
with low sequence conservation but high structural conservation can still be identified. The
state-of-the-art implementation of SCFG is covariance model (CM) in Infernal [17]. Each
fundamental elements of ncRNAs can be modeled as a state in CM. A query sequence can be
optimally aligned to the underlying model using inside-outside algorithms. If the alignment
score is above a threshold, the query is then considered to be originated from this model. A
number of studies have demonstrated the advantages of incorporating secondary structural
information in various types of non-coding RNA homology search [18, 19, 20]. Experiments
have been conducted to compare between BLAST and CM-based approaches [18, 20]. The
results indicate that BLAST tends to miss ncRNAs with low sequence conservation. The
performance of BLAST is even worse with fragmentary query sequences, since the some
information is missing, resulting in marginal alignment scores. The latest distribution of
Infernal [19], on the other hand, is designed to recover the possible missing bases that could
be otherwise form base pairs.

3

1.3

Challenges in ncRNA identification in large-scale
genomic data

1.3.1

Sanger sequencing and NGS technology

Invented in 1977, Sangers method dominated the sequencing market for almost 30 years, and
is considered as the “gold standard” for sequencing [21]. The traditional sequencing method
relies on DNA synthesis from a DNA template of interest. DNA fragments are polymerized by
catalysed enzymic reaction to generate complementary of target DNA sequences. There are
some limitation in Sanger sequencing. First, one has to uses gels or polymers as separation
media. Second, the number of samples which could be handled in parallel is very limited,
making the sequencing process relatively slow. Third, the cost of Sanger sequencing if very
high. The limitations triggered the effort to develop more efficient and economical sequencing
technologies.
Next generation sequencing (NGS) overcomes the drawbacks of Sanger sequencing. First,
with very high throughput, NGS platform can generate massive amount of data in parallel.
Whole small genome can now be sequenced in a day [22]. Second, the sequencing cost is
greatly reduced. An estimated cost per million sequenced bases using NGS technologies is
1000-fold less than that using traditional chain termination based Sanger sequencing [23].
The entire sequencing process can be done on bench-top instrumentation. Widely used NGS
platforms include Roche (454) GS FLX sequencer, Illumina genome analyzer and Applied
Biosystems SOLiD sequencer. New platforms such as PacBio and Ion Torrent PGM are also
available. The advent of NGS technologies has revolutionized the microbiology study.
NGS technology is utilized by various applications. RNA-sequencing, also called Whole

4

Transcriptome Shotgun Sequencing (WTSS) [24], is a NGS based application to study RNA.
It aims to reveal a snapshot of presenting RNAs in an organism at a given moment. Metagenomics [25] is another application that utilizes NGS techonology and it allows people to
study uncultured microorganims in environment samples. Cheaper and faster technologies
enable sequencing of uncultured microorganisms directly from their inhabitants. Instead
of studying a few carefully cultured species, we are able to screen thousands of species together and discover their relationships. The sequenced data represents a comprehensive and
unbiased picture of species diversity in the environmental sample.

1.3.2

NCRNA identification in NGS data

Despite a range of advantages, NGS-based application also has its difficulties and limitations
in data analysis. First, volume of NGS data could be very high, making many existing
ncRNA identification tools infeasible. For example, RNAv [26] and RNATOP [27] are tools
designed for identification of ncRNA with pseudoknot structures. When the size of input
data is small, the tools could be very effective. However, if the data volume is high, both
tools cannot produce any output in reasonable amount of time [20]. Second, sequences
obtained by NGS technologies are usually fragmentary. There is no guarantee that entire
sequence of an underlying ncRNA gene is thoroughly sequenced, even with very high coverage
during sequencing process. Identifying such ncRNA genes with missing information could
be difficult. Sequencing errors could further deteriorate the performance of identification
of such genes [28]. Fourth, in particular, metagenomic datasets, which are usually from
heterogeneous microbial communities, could contain tens of thousands of species. We do not
generally know the ground truth about origin of each fragmentary sequence in the dataset,
nor do we know the true microbial composition in the environmental sample, making it
5

difficult to separate similar ncRNAs genes from highly related species [29]. As a result,
effective and efficient algorithms are in need to meet the computational challenges posed by
the nature of NGS data.

1.3.3

Pseudoknot structures and challenges in finding ncRNAs
containing pseudoknot structures in large-scale genomic data

Pseudoknot is a functionally important structural motif in ncRNA secondary structures [30].
In pseudoknots, bases in loop regions can form base pairs with bases outside the stem loop,
as displayed in Figure 1.2. It is already known that pseudoknots play important functions in
telomerase RNA, tmRNA, rRNA, some riboswitches, some protein-biding RNAs, Viral ribosomal frameshifting signals, etc [31]. Different research groups [32, 33] have shown that the
pseudoknot structure in the telomerase RNA is essential for telomerase activity. Gilley and
Blackburn [32] experimentally demonstrated that disruptions of the pseudoknot base pairing
within the telomerase RNA from Tetrahymena thermophila prevent the stable assembly in
vivo of an active telomerase. They further concluded that the pseudoknot topology rather
than sequence is critical for an active telomerase. Similarly, biologists reported that the
pseudoknots in tmRNA are highly important for protein biding, tmRNA maturation, and
proper folding of the tRNA-like domain [34].

6

A

G

C
3'
A
C C
G. C
G. C
C. G
U

A .U
G .C
C
C .G
U .A
G .C C

5'

Figure 1.2: An example of a psuedoknot-containing ncRNA

The state-of-the-art approach for ncRNA identification is based on SCFGs. However,
SCFGs are not able to model pseudoknot. Thus, the implementations of SCFG by Infernal
neglect pseudoknots in the structures. As a result, Infernal could misclassify sequences as
members of families containing pseudoknots. In addition, Infernal [19] has high computational cost, limiting its usage in large-scale data sets, such as those generated by the nextgeneration sequencing technologies. Other approaches include as RNAv [27] and RNATOPS
[26]. RNATOPS designs a graph model for RNA pseudoknots and solves the structure sequence alignment by graph optimization.RNAv is a profile based RNA secondary structure
variation search program that detects distant ncRNA structural homologs, which might be
missed by RNATOPS. Because of the high algorithm complexity, both tools are extremely
computationally expensive and are not feasible for large-scale data.

7

1.4

Contributions

During my Ph.D. study, I tried to tackle a few challenges in ncRNA identification in largescale genomic data. First, I designed and implemented an efficient progressive search pipeline
that can model and search for ncRNAs including pseudoknots. Using the pipeline, one could
examine a query to detect each component of a model trained from a family of psuedoknotcontaining ncRNAs. Second, I designed and implemented a computational pipeline to identify small ncRNAs in RNA-Seq data. An SCFG-based homology search was applied to
significantly reduced, making the pipeline very efficient in processing large dataset. Third,
I designed and implemented a set of graph-based algorithms to recover 16s rRNA genes in
metagenomic data. I designed a set of graph based algorithms with a list of graph reduction
techniques. By utilizing paired-end information, one can separate similar genes from highly
related species. Genes with very low abundance can also be recovered.

8

Chapter 2
Efficient known ncRNA search
including pseudoknots

2.1

Background

Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions [35]. Many types of ncRNAs function
through both their sequences and secondary structures, which are defined by interactions
between Watson-Crick and wobble base pairs. Pseudoknot is a functionally important structural motif in ncRNA secondary structures. In pseudoknots, bases in loop regions can form
base pairs with bases outside the stem loop. In a graphical representation where arcs connect
base pairs, pseudoknot-free secondary structures only contain parallel or nested base pairs
while pseudoknot structures allow “crossing” base pairs, shown by an example in Figure
2.1.A.
Because the functions of ncRNAs are determined by both the sequence and structure,
successful ncRNA homology search tools must consider both sequence and structural conservations. Existing ncRNA search tools can be divided into two categories. One is commonly
referred to “known ncRNA search”, which aims to detecting homologs of ncRNAs with
annotated secondary structures. The second category includes tools for identifying novel
ncRNA genes. This work belongs to the first category and focuses on ncRNAs containing
9

pseudoknots.
For pseudoknot free ncRNAs, the state-of-the-art search method is based on stochastic
context-free grammars (SCFGs), which can accurately model the evolutionary changes of
both the sequences and structures of a group of homologous ncRNAs. Commonly used general and specialized known ncRNA search tools such as Infernal [19], RSEARCH [36], and
tRNAScan-SE [37] are all based on SCFG. In conjunction with the ncRNA family database
Rfam, Infernal has been successfully applied to classify query sequences into different types
of ncRNA. However, SCFGs are not able to model pseudoknot. Thus, the implementations
of SCFG by Infernal neglect pseudoknots in the structures. For example, although RF00023
(tmRNA) has four pseudoknots, its SCFG only models the knot-free structures, shown in
Figure 2.1.B. As a result, Infernal could misclassify sequences as members of families containing pseudoknots. In addition, Infernal has high computational cost, limiting its usage in
large-scale data sets, such as those generated by the next-generation sequencing technologies.

Figure 2.1: Consensus secondary structure of tmRNA and the secondary structure described
by SCFG (pseudoknots missing). A. Consensus secondary structure of RF00023 (tmRNA) in
Rfam. Stacking base pairs in 1 are parallel to base pairs in 2 and 3. 1, 2, and 3 are nested in 4. 2
and 3 form a pseudoknot. B. Secondary structure described by SCFG (pseudoknots missing).

More complicated grammars such as context-sensitive Grammars (CSGs) [4] exist to
faithfully model pseudoknots. However, the computational cost of the parsing algorithms of
a CSG is even higher than using a CFG [4]. Besides CSGs, other grammars such as parallel communicating grammar systems [38], RNA pseudoknot grammars [39], tree adjoining
10

grammars (TAGs) [40, 41], and multiple context-free grammars [42] have been proposed to
model pseudoknot structures. These work described the grammars and associated parsing
algorithms. However, they have not been widely used in pseudoknot search in large-scale
databases. First, although the parsing algorithms are polynomial, their cubic or even higher
time or memory complexity [42] limits their large-scale applications. Second, these methods
were designed for and tested on secondary structure derivation rather than homology search.
In order to conduct large-scale homology search, local parsing algorithms are needed. As
there are no source codes or executable implementations of these grammars, it is not clear
whether they can be automatically applied to known ncRNA search including pseudoknots.
In this work, we design an efficient pseudoknot search algorithm for all types of pseudoknots. Our method is based on a set of carefully chosen simple sub-structures (or substructures for short), which do not contain pseudoknots or bifurcations. The time complexity
of the parsing and probability computation algorithms for an SCFG including the CYK, the
inside, and the outside algorithm will be significantly reduced when the secondary structure does not contain any bifurcation [4, 43]. Thus, these simple sub-structures can be
searched efficiently using existing implementations of SCFGs. For multiple sub-structures
extracted from one ncRNA family, we choose a set of sub-structures according to their sizes
and false positive (FP) rates in order to maximize the search performance. These chosen
sub-structures will be used in a progressive search. Our experimental results show that our
tool competes favorably with other pseudoknot search methods.

2.2

Related Work

Brown and Wilson [44] proposed an RNA pseudoknot search method using intersections of
SCFGs. Both Brown’s method and our approach try to decompose pseudoknot into knot-free
11

structures for SCFG modeling. There are two major differences. First, our sub-structures
are not only knot-free, but also bifurcation free, which enables faster search. Second, while
Brown and Wilson’s method focused on the model construction and parsing algorithm, we
focus on choosing an optimal set of sub-structures to optimize the search performance.
The model construction and the parsing algorithms can be conveniently implemented using
Infernal, which has gone through extensive testing.
Structural motifs similar to sub-structures have been used as filters to speed up Infernal.
FastR [45] relies on stem-loops ((k, w)-stack) that do not contain bulge or interior loops to
search for ncRNAs. Weinberg et al. [46] use more flexible structural motifs based on subCMs and profile HMMs for ncRNA classification. Smith [43] used a decision tree to organize
partial SCFG models for fast ncRNA search. Currently, these filters are only designed and
tested for speeding up SCFG search.
Available pseudoknot search tools include RNAv [27] and RNATOPS [26]. RNATOPS
designs a graph model for RNA pseudoknots and solves the structure sequence alignment by
graph optimization.RNAv is a profile based RNA secondary structure variation search program that detects distant ncRNA structural homologs, which might be missed by RNATOPS.
The chain filter designed by Zhang et al. [45] consists of a collection of short conserved
words in an ncRNA family. In our work, we use a collection of simple sub-structures for
pseudoknot search. Similar to Zhang et al.’s work, we find that using a collection of simple
structures can achieve a good tradeoff between sensitivity and false positive rate during
search.

12

2.3

Approach

There are two components in the method. The first component is the design of a set of
sub-structures to represent an ncRNA family. The second component is a progressive search
strategy using the designed sub-structures. Different regions of an ncRNA sequence have
different sequence and structural conservations. Well-conserved structural and sequence
motifs tend to yield better search performance than poorly conserved motifs. Our approach
sorts sub-structures extracted from different regions according to their lengths and predicted
FP rates in order to choose a set of sub-structures with the optimal search performance.
For a chosen set of sub-structures, we conduct a progressive search according to a predetermined order. During the progressive search, one sub-structure is only applied to regions
containing matches to all previous sub-structures. A sequence is classified into the pseudoknot family if and only if 1) it passes the score thresholds of all the chosen sub-structures;
2) the position relationship between matched substrings is consistent with the relationship
between the sub-structures. Thus the false positive rate of the chosen set of sub-structures
is bounded by the product of the false positive rates of all component sub-structures. The
pipeline of the approach is illustrated in Figure 2.2

13

Figure 2.2: The pipeline of the SCFG construction and the progressive search.

2.3.1

Sub-structure derivation

In order to use SCFG-based models for pseudoknot search, we decompose a pseudoknot
structure into simple sub-structures. Each sub-structure contains at least one stem, which
includes a set of stacking base pairs allowing short bulge and interior loops. A full secondary
structure of an ncRNA family can be decomposed into multiple stems. Combinations of
stems define different sub-structures. Figure 2.3 shows all five simple sub-structures derived
from the given pseudoknot.

14

Figure 2.3: Five candidate sub-structures can be constructed from three stems in a pseudoknot structure. Each arc represents a stem containing nested base pairs and possible internal/bulge
loops. Single-stranded regions are represented using solid lines.

We describe a method to systematically extract all simple sub-structures from a pseudoknot. In the first step, all stems are extracted and sorted in increasing order of their starting
positions (i.e. 5’ end of the outmost base pair in the stem). Second, we build a bit table R
of size N by N for N stems extracted from the first step. For each cell R[i, j], if stem i and
stem j are nested, R[i, j] = 1; otherwise, R[i, j] = 0. Table R provides us information about
whether given stems can form one sub-structure. Given the stem set and their relationship
table R, we use pseudocode in Algorithm 1 to extract all simple sub-structures. In the
pseudocode, H x is the set of sub-structures containing x stems. Thus, the union of H x for
x = 1 to N consists of all simple sub-structures for a given secondary structure. The number
of sub-structures depends on the number of nested stems. Suppose the average number of
nested stems inside a stem is n. The total number of sub-structures is O(N + N 2n ).
Algorithm 1 only outputs the combination of stems. For each stem (or stem set) in a
sub-structure, we add loop and flanking regions using the following three rules. Let the 5’

15

Algorithm 1 ExtractSubstructures Input: a secondary structure containing pseudoknots
Output: all simple sub-structures
1: for each stem i = 1 to N do
2:
/* h: a sub-structure containing a set of stems */
3:
h = {i}
4:
H 1 = H 1 {h}
5: end for
6: for L = 2 to N do
7:
HL = ∅
8:
for each sub-structure h ∈ H L−1 do
9:
for each stem i ∈ h do
10:
/* h[i] is the ith stem in a sub-structure h */
11:
if R[h[1],i] and R[h[2],i] ... and R[h[L-1], i] then
12:
/* construct a new sub-structure h */
13:
h = h {i}
14:
H L = H L {h }
15:
end if
16:
end for
17:
end for
18: end for
19: output all sub-structures H = H 1 H 2 . . . H N
and 3’ ends of the outmost base pair in a sub-structure be I5 and I3 , respectively. Thus,
I5 < I3 .
• Rule 1: Add all single-stranded regions including bulge and internal loops between I5
and I3 .
• Rule 2: Except the base pairs inside the chosen stems in a sub-structure, all other base
pairs will be treated as single-stranded regions.
• Rule 3: Extend the flanking single-stranded regions to the left of I5 and to the right
of I3 until the first base pair in other sub-structures.

16

2.3.2

Search performance of different sub-structures

Each sub-structure can be conveniently modeled by an SCFG. As different sub-structures are
derived from regions with different sequence and structural conservations, their corresponding SCFGs have different performance in database search. In this section, we use an example
to illustrate this. We built SCFGs for eight sub-structures derived from RF00373 (Ribonuclease P) and evaluated the sensitivity, FP rates, and running time of the eight SCFGs when
applying them to a to a 22.5M Maize chromosome (data is described in “Experimental results”). The sensitivity and FP rates of different sub-structures from the same family can
be compared using true positive (TP) hits and FP hits respectively, because the condition
positive and condition negative sets are the same for all sub-structures derived from the same
family. For any SCFG Mi , let the set of matched sequences be Hit(Mi ). Let the set of true
pseudoknot sequences be S, which are the sequences in seed families containing pseudoknots
in Rfam. The number of true positive and FP matches of a sub-SCFG is |Hit(Mi )

S| and

|Hit(Mi ) \ S|, respectively. We summarized the TP hits and FP matches of eight SCFGs
under different score thresholds in Figure 2.4. In addition, the search times are included for
the score thresholds corresponding to the highest sensitivity. It is clear that different SCFGs
have highly search performance. During a progressive search using a series of sub-structures,
the number of matches of the preceding sub-structure determines the search space of the
current sub-structure. Thus, the total search time depends on both the FP hits and the
model running time, which is heavily affected by the model length. In order to maximize
the search efficiency, it is important to sort all candidate sub-structures according to their
FP rates. When the FP rates of two or more sub-structures are similar (same order), we
prefer shorter models because they incur less search times.

17

Figure 2.4: Number of TP hits and FP matches of each sub-structure under different score
thresholds. For each sub-structure, the length and the search time corresponding to the highest
sensitivity is listed. Time format is hr:min:sec. Due to highly different number of FP hits, two
sub-structures are plotted in the embedded figure.

2.3.2.1

Sort sub-structures according to their E-values

There are two methods to calculate the FP rates of sub-structures. Theoretically, by assuming a background model for random sequences and applying the CYK algorithm [4],
we can directly calculate the probability that a random sequence matches an SCFG model.

18

Empirically, we can apply the SCFGs to a large annotated sequence database and record
the number of FP matches. However, as it is more important to compare the FP rates of
different sub-structures than knowing their exact values, it is not necessary to directly calculate FP rates. By assuming that the SCFG alignment scores for random sequences follow
an exponential distribution, as implemented by Infernal, we can use E-values of the designed
score cutoffs to sort all sub-structures.
For an alignment score and a database size, an E-value indicates how many random hits a
user can expect to see with the same or better score in a random sequence database of similar
size. Thus, E-value indicates FP hits when it can be computed accurately. Currently, we
are using the E-value calculation method provided by Infernal. Although the assumed score
distribution is not accurate, we found that the estimated E-values allow us to compare FP
rates of different sub-structures with high accuracy. In order to estimate E-value, Infernal
generates a set of N random sequences whose GC content depends on the covariance model.
These N random sequences then are aligned against the model. In this process, all searching
result with score > 0 will be considered as hits. Scores of the top X hits are assumed to
follow an exponential distribution with two parameters, µ and λ. The maximum likelihood
approach is then taken to fit scores of hits into an exponential distribution.

E = db ∗ e−λ(score−µ)

where db is adjusted database size and is defined as

db =

dbtarget
(randhit + 0.5).
dbsizerandom

19

In the E-value computation, µ and λ are parameters trained in Infernal. sc is the score for
which one needs to calculate E-value. dbtarget is the size of target database. dbrandom is the
number of random sequences generated for curve fitting. At last, randhit is the number of
random sequences found by the covariance model. We can directly obtain µ and λ from each
calibrated covariance model, which is built for a sub-structure. With these two parameters
available, we can use the above equation to compute E-values for given scores.
Our experiments show that although the change of E-values does not scale with the
change of the FP rates, the order of E-values is highly consistent to the order of FP rates for
all 71 families we tested. Only for SCFGs with similarly small FP rates, their E-values cannot
accurately reflect their order. Table 2.1 presents an example. It is worth noting that we also
considered to use the average entropy to sort the sub-structures. However, our experiments
show that there is no systematic relationship between entropy-based measurements and the
FP rates of sub-structures.
sub-structure

E-value

FP hits

sub-structure

E-value

FP hits

RF00373 part2

1.71e+03

4894

RF00373 part1

7.33e+02

2606

RF00373 part5

7.30e-02

41

RF00373 part3

3.58e-02

39

RF00373 part4

3.18e-06

39

RF00373 part6

5.29e-09

37

RF00373 part8

4.52e-09

39

RF00373 part7

3.40e-15

34

Table 2.1: The order of E-values is highly consistent to the order of number of the FP hits.
2.3.2.2

Choose sub-structures for progressive search

During a progressive search based on multiple sub-structures, the final sensitivity is bounded
by the lowest sensitivity of all sub-structures. The final search time and FP rates heavily

20

depend on the order of applying these sub-structures. Let the final array of sub-structures
be SUB=(H1 , . . . , Hi , . . . , Hn ), where Hi will be applied before Hj if i < j. Let the size of
the original database be L. For a sub-structure Hi , let ti and f pi be its search time per hit
n

and FP rate, respectively. The final FP rate is bounded by

f pi . The final search time
i=1

n

is roughly T = L

i−1

ti (
i=1

j=1

i−1

f pj ), where L

f pj is roughly the search space for the subj=1

structure Hi . Minimizing T requires the accurate computation of ti or quantification of the
relationship between ti and f pi , which is not known as a priori. Although Infernal provides
estimated running time, it can be quite different from the true running time. According to
the equations, it is clear that we should apply short sub-structures with small FP rates before
long sub-structures with high FP rates. Thus we develop a greedy algorithm to generate a
set of sub-structures for progressive search based on our empirical observations.
We split sub-structures into short group and long group, which contain short and long
sub-structures respectively. For each group of sub-structures, we sort the sub-structures
according to their E-values and apply a greedy algorithm to choose a set of sub-structures
for search. The main steps of the greedy algorithm are outlined below, starting from the
short group:
1. In each iteration, choose the sub-structure with the smallest E-value. Remove it and
append it to the final sub-structure list SUB.
2. Remove any remaining sub-structure in both groups that only contains stems in this
sub-structure.
3. Repeat the first step until all stems are covered by one chosen sub-structure or the Evalues of all remaining sub-structures are bigger than a pre-determined cutoff (default
is 1).
21

If SUB has not included all stems, we apply the same process to the long group and
append the chosen sub-structures to SUB. We require all stems covered by the chosen substructures in order to ensure the representation of the annotated pseudoknot structure. It
is possible that this constraint will exclude homologous ncRNAs that lack annotated stem
loop structures. Currently, we use size 150 as the threshold to divide sub-structures into the
short and the long group.

2.3.3

Implementation

For each sub-structure, we train an SCFG-based model based on the corresponding alignment
in the training data using Infernal. Let the SCFGs trained from n sub-structures of an
ncRNA family SUB = (H1 , . . . , Hi , . . . , Hn ) be Π = (M1 , . . . , Mi , . . . , Mn ), where Mi
represents a single SCFG. A sequence can be classified into the corresponding family if the
following conditions are satisfied. First, the sequence contains matches to all designed SCFGs
in Π. SCFG match will be defined in the following text. Second, for every pair of strings
that match two SCFGs, their position relationship must be consistent with the annotated
relationship between two SCFGs in the underlying ncRNA family. There are three types of
position relationship between two sub-structures: parallel, nested, and cross-over. Cross-over
indicates existence of pseudoknots.
We determine SCFG match using score thresholds. For all sequences in the training
set, its alignment score with a given SCFG is computed. The minimum score of all the seed
sequences is used as the score threshold. This score cutoff is similar to the NC (trusted cutoff)
bit score thresholds used in HMMER [47] or Infernal. When the training data contains a good
representation of the family member sequences, the computed score threshold can ensure a
high sensitivity during homology search. If the training set only contains close homologs of
22

this ncRNA family, the designed cutoff may be too high for remotely related homologs.

2.4

Experimental results

In order to test the performance of our tool for pseudoknot search in sequence databases,
we conducted two experiments. First, we examined the automatically classified pseudoknot
sequences in Rfam. Second, we applied it to part of the Maize genome. On the same data
set, we compared our tool with RNAv, RNATOPS, and Infernal.

2.4.1

Pseudoknot sequences in Rfam

Because CFG cannot model pseudoknots, the implementations of Stochastic CFG (SCFG),
covariance models (CMs) in Rfam neglect pseudoknots in the structures. As a result, tools
that use SCFG for ncRNA search such as Infernal could misclassify sequences as members
of pseudoknot families. Each Rfam family contains a seed sequence set and a full sequence
set. While the seed sequence set contains manually validated homologous sequences, the
full sets are automatically produced using SCFG-based search against RFAMSEQ database
[48]. Thus, some of the sequences in the full set may not contain pseudoknot structures that
are annotated in the seed sequences. We examined the full member set of the 71 ncRNA
families containing pseudoknots in Rfam using our tool. Many families contain dozens of
sequences that lack the annotated pseudoknot structures. For all those sequences that cannot
be matched by our tool, we also utilized the Infernal alignments and a RNA stem finding tool
RNAmotif [49] to double check whether the base pairs in pseudoknot structures are missing.
The SCFG alignments output by Infernal contains annotations of all base pairs that do not
form pseudoknots. By comparing the annotated base pairs and the consensus secondary
structure of the seed alignments, we can extract the regions that should form pseudoknots.
23

Then, we applied RNAmotif to output all stems of size at least two in the chosen regions.
Failing to output any stems validated our findings that these sequences do not have the
annotated pseudoknots. The results are summarized in Table 2.2. Although homologous
ncRNAs may not share the same set of stems, simply ignoring pseudoknots without knowing
their impacts on the function can introduce a large number of false members. In particular,
it was already experimentally shown that pseudoknot structures are vital to the functions
of some types of ncRNAs [32, 33, 34]. For these well-studied pseudoknot structures, it is
important to include them during homology search.
ID

seqs

ID

seqs

ID

seqs

ID

seqs

without knots/

without knots/

without knots/

without knots/

num of seqs

num of seqs

num of seqs

num of seqs

RF00009

37/500

RF00010

3/3864

RF00011

26/460

RF00023

53/2871

RF00024

56/233

RF00028

2587/39045

RF00030

47/470

RF00041

2/151

RF00140

81/524

RF00176

37/64

RF00216

25/126

RF00233

22/76

RF00259

78/124

RF00261

43/78

RF00499

1/16

RF00523

2/5177

RF00622

1/94

RF01050

3/60

RF01072

21/271

RF01073

1/7006

RF01086

15/1093

RF01087

1/31

RF01089

4/25

RF01096

2/45

Table 2.2: Sequences that do not contain annotated pseudoknots and thus may not be real members.

2.4.2

Data set preparation

We created a simulated data set based on a contiguous 22-Mb region of the Maize Genome
[50]. The annotation of the 22-Mb region does not contain any hit to the 71 pseudoknot
families in Rfam. In order to evaluate the sensitivity of pseudoknot search tools, we randomly
chose 1,586 out of 26,704 seed sequences from 71 pseudoknot families and inserted them
in the 22-Mb region. The remaining seed sequences are used as the training data. In

24

order to examine the FP rate of SCFG-based tools, we also created 1,586 sequences without
pseudoknots. Specifically, for each of the 1,586 seed sequences, we altered the bases to disrupt
the base pairs that can form pseudoknots. Similarly RNAmotif is applied again to ensure
these sequences lose the annotated pseudoknot structure. These modified 1,586 sequences
and the original 22-Mb region of the Maize Genome constitute the negative training data.
Any hit to them is an FP hit. Note that by changing the bases, the modified sequences
might share lower sequence similarity to the trained model and thus pose an easier case
for all tools. Even so, our experimental results still show that different tools exhibit highly
difference performance on this data set. Thus, we feel this data set is a reasonable test set.
There are two major advantages of using this simulated data set for testing pseudoknot
search tools. First, as the 22-Mb region of the Maize genome does not harbor any reported
ncRNA that contains pseudoknots, we can measure the empirical FP rates of pseudoknot
search tools with higher reliability than using simulated sequences, which are usually generated using a simple i.i.d. model or low-order Markov model. In particular, the Maize genome
contains a high percentage of repeats and low-complexity regions, which could not be simply simulated and can pose a challenge for ncRNA search as warned by the Rfam website
(http://rfam.sanger.ac.uk/). Second, using thousands of seed members of the pseudoknot
families provides us adequate test data for evaluating the sensitivity.
Besides using the seed sequences of Rfam, we also considered another pseudoknot sequence database Pseudobase [51]. This database contains 304 RNA sequences with pseudoknot structures. A majority of them are sub-strings of Rfam seed sequences. Thus, we
choose to use Rfam seed sequences as the true label.

25

2.4.3

Results and comparisons

In order to separate the training set and the test set, we removed the sequences that were
inserted in the Maize genome from the seed alignments. For the alignments composed of
the remaining sequences, we trained the full covariance model and the models for the substructures. We used the designed sub-structure sets for pseudoknot search. We evaluated
the performance of pseudoknot search tools using three metrics: the sensitivity, FP hits, and
running time. For each ncRNA family represented by an SCFG M, let Hit(M) be the set
of output sequences by a search tool. Let S be the set of true pseudoknot sequences, which,
in this data set, includes seed sequences of each pseudoknot family. The sensitivity is thus
defined as:

sensitivity =

|Hit(M)
|S|

S|

Any output that does not overlap with true pseudoknot sequences is a false positive hit. The
number of FP hits of a search tool on one family is computed as:

F P hits = |Hit(M) \ S|

We report the FP hits instead of the FP rates for two reasons. First, the condition negative
set is family specific and thus is the same for all search tools for a given family. Second, the
size of the condition negative set is mainly determined by the size of the genome minus the
size of all true pseudoknot sequences. For a large genomic sequence, the FP rate becomes
very small and cannot reflect the difference between different tools.
On the same dataset, we run RNAv, RNATOPS, and Infernal 1.0.2. Of the three, RNAv
and RNATOPS are designed for pseudoknot search. For Infernal and sub-structure, no
26

hidden Markov model-based filtration was used in order to maximize the sensitivity. Other
parameters were set as default for Infernal. We used the default parameters to run RNAv
and RNATOPS. All experiments were run on the main cluster of the High Performance
Computing Center on campus (http://www.icer.msu.edu/?q=hpcc). Each experiment was
allocated four CPU days at most. There are 65 families and 31 families that failed RNAv
and RNATOPS, respectively. The search jobs for those families were killed by the cluster
after four CPU days. No results were produced. Thus we could not report the results for
those families. RNAPTOPS output results for 22 families by the end of the allocated time.
The performance of these four tools is recorded in Table 2.3. The results show that our
tool is significantly faster than RNATOPS and RNAv. For a majority of families, the running
time is smaller than half an hour. A closer examination reveals that 99% of the running time
is attributed to the first sub-structure, which is expected. Of the six families for which RNAv
successfully generated outputs, they all have the sensitivity of 1.0, equal to the sensitivity
of sub-structure based search. Of the 40 families for which RNATOPS reported results, 14
of them have equal sensitivity to ours. 1 family yields slightly better sensitivity than ours
while other 24 families have significantly worse sensitivity. Thus, overall, our search achieves
higher sensitivity than RNAv and RNATOPS. In addition, sub-structure based search tool
incurs lower FP rate than RNATOPS and RNAv. Table 2.3 shows that RNATOPS yields
low FP hits. Of the 40 families, RNATOPS has the same number of FP hits as ours for only
one family and significantly more FP hits for the rest. In particular, RNATOPS outputs
over 1,000 hits for 9 families.
We compared the sensitivity, FP hits, and running time of Infernal and our tool in Figure
2.5, Figure 2.6, and Figure 2.7 using X-Y scatter plots. As Infernal and our tool generate
the same sensitivity or other metrics for some families, we use the bubble plot to visualize
27

RNA fam
ID
RF00009
RF00010
RF00011
RF00023
RF00024
RF00028
RF00030
RF00041
RF00094
RF00140
RF00165
RF00176
RF00216
RF00233
RF00259
RF00261
RF00373
RF00381
RF00390
RF00458
RF00499
RF00505
RF00507
RF00523
RF00622
RF01050
RF01072
RF01073
RF01074
RF01075
RF01076
RF01077
RF01078
RF01079
RF01080
RF01081
RF01082
RF01083
RF01084
RF01086
RF01087
RF01088
RF01089
RF01090
RF01091
RF01092
RF01093
RF01094
RF01095
RF01096
RF01097

sen

1.0

FP

58077

time
RNAv

sen

FP

time
RNATOPS

0.21

4

4days

0.26

0

4days

0.38
1.0

30
763

4days
4days

0.2
0.41
0.29

2
7
160

4days
4days
4days

0.52
0.11
0.5

273
3
91

4days
4days
4days

1.0
0.5
0.67
0.5
1.0

333
135
284
2934
3002

4days
4days
4days
4days
4days

0.5

3

4days

0.33
0.43

1
4

4days
4days

19:54:40

1.0

196631

13:13:32

1.0

139249

16:20:29

1.0

165990

10:58:02

1.0
0.42

0
67

4days
4days

1.0

166314

16:44:20

0.5
0.25

1
1

4days
4days

sen

FP

1.0
0.58
0.84
0.4
0.95
0.83
0.38
0.95
0.88
0.97
1.0
1.0
0.87
0.96
1.0
1.0
0.92
1.0
1.0
1.0
1.0
1.0
0.95
0.95
1.0
1.0
0.96
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.67
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0

5
95
25
1
24
6
26
0
0
0
0
0
0
0
0
0
27
0
0
0
0
0
0
24
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
11
0
0
3
0
0
0
0
0
0
0
0

time
Substructure
01:47:37
00:18:47
00:06:51
00:06:54
00:06:33
22:30:56
02:35:01
00:10:37
00:09:21
01:05:08
00:22:10
00:05:48
00:03:03
00:09:06
00:05:41
00:13:53
01:35:35
00:17:10
00:05:21
00:09:37
00:09:51
00:32:27
00:34:44
00:20:31
00:05:15
00:41:32
00:08:37
00:18:36
00:06:59
00:07:59
00:20:36
00:53:32
00:12:38
00:07:37
00:08:04
00:06:47
00:05:47
00:04:34
00:10:46
05:18:38
01:19:41
00:39:09
01:03:09
00:23:11
00:13:06
00:17:57
00:13:59
00:52:46
00:10:56
00:23:04
00:12:18

sen

FP hits

1.0
0.97
0.97
1.0
0.81
0.72
0.98
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.95
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0

38
318
179
180
86
37
87
64
35
33
14
21
30
29
5
20
363
15
6
10
115
5
23
145
14
13
30
13
10
7
5
4
3
3
110
3
4
7
8
13
12
4
20
8
4
15
23
3
2
2
4

time
INFERNAL
26:16:07
17:54:31
09:09:52
13:40:31
20:36:42
79:05:16
83:37:31
01:27:52
00:54:20
01:52:09
00:32:25
00:50:54
04:42:29
00:47:38
02:09:52
02:50:11
14:15:43
00:33:42
00:07:35
02:18:47
01:33:43
00:29:55
00:52:44
00:19:24
00:42:40
39:22:21
00:10:13
00:29:04
00:15:00
01:00:46
00:35:33
00:37:23
00:26:04
00:16:01
00:13:41
00:08:44
00:09:13
00:07:05
01:53:25
05:39:23
01:37:01
00:37:14
01:21:27
00:36:35
00:28:51
00:30:08
00:29:56
01:10:47
00:27:12
00:24:45
00:22:09

Table 2.3: Sensitivity, FP hits, and running time comparison between RNAv, RNATOPS, Infernal,
and sub-structure. Bold font is applied to the highest sensitivity, the lowest FP hit, or the shortest
running time for each RNA family. The empty cells indicate that the corresponding tools did not
generate any output within 4 CPU days.

28

the number of the same data points. As expected, Infernal is highly sensitive. However, it
reported dozens of hits on the pseudoknot-free sequences which we inserted as false positive
sequences. For all families, Infernal reported equal or more FP hits than our tool. In
addition, it is generally slower than sub-structure-based tool. Out of 71 RNA families, substructure-based tool has shorter running time on 66 families. For 14 families, it yields 10x
speed up over Infernal.

Figure 2.5: Sensitivity comparison on 71 families.

29

Figure 2.6: Comparison of false positive hits on 71 families.

Figure 2.7: Running Time comparison. There are 4 families on which Infernal run much longer
than on other families. To keep an appropriate scale, there running times are not displayed on the
figure.

There is no significant difference in the sensitivity between Infernal and sub-structure30

based tool when the average sequence length in a family is not too long. Infernal has better
sensitivity on longer and more complicated RNA families including RF00010, RF00011,
RF00023, and RF00030. The major reason behind our worse sensitivity on the long families
is that we use sub-structure that cover every stem. Thus, we only classify sequences that
have all characterized stems from the underlying structure. However, some remote homologs
may lose base pairs in stems during evolution. Thus while we guarantee to find sequences
that have the same structures as the annotated pseudoknots, we can miss some homologs,
leading to lower sensitivity for some families.

2.5

Conclusion

Although Infernal is highly sensitive in known ncRNA search, caution must be taken when
applying Infernal to ncRNA families containing pseudoknots. In this work, we designed a
pseudoknot search method based on a set of carefully chosen sub-structures. These substructures do not contain pseudoknots or bifurcations. SCFGs can be conveniently built
on them and searched with high efficiency. In order to minimize the overall FP rate and
the running time, we sorted sub-structures according to their lengths and their E-values for
designed trusted cutoff (NC) bit score thresholds. We designed a greedy algorithm to choose
a set of sub-structures and applied the progressive search to minimize search time. Our
experimental results showed that our tool competes favorably with RNAv and RNATOPs,
both of which have been used for pseudoknot search in large databases. This work provides
a complementary pseudoknot search tool to existing SCFG-based knot-free ncRNA search
methods.
Currently our tool only reports homologous ncRNAs with the same number of charac-

31

terized stems as the training data. As a result, some true homologs that have lost one or
multiple stems will be ignored. As part of the future work, we plan to incorporate available
RNA-seq data for remote homology search.

32

Chapter 3
RNA-CODE: a noncoding RNA
Classification tOol for short reaDs in
NGS data lacking rEference genomes

3.1

Introduction

Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and crucial roles in many biochemical processes. For example, tRNAs and
rRNAs aid protein synthesis. SnoRNAs guide rRNA modifications. MicroRNAs (miRNAs)
regulate gene expression [52]. Short interfering RNAs (siRNAs) involve in gene silencing in
RNAi process [53].
In particular, the development of next-generation sequencing (NGS) technologies sheds
light on more sensitive and comprehensive ncRNA annotation. Deep sequencing of transcriptomes of various organisms has revealed that a large portion of transcriptomic data cannot
be mapped back to annotated protein-coding genes in the reference genome, indicating that
those transcripts may contain transcribed ncRNAs [2]. Total RNA-seq and small RNAseq data generated by numerous transcriptomic sequencing projects are still accumulating
rapidly. Identifying different types of ncRNAs and quantifying their expression levels in

33

different tissues, conditions, and developmental stages have generated new knowledge about
functions of ncRNAs. Besides RNA-seq data, ncRNA identification is also important for
analyzing metagenomic data, which contain sequenced metagenomes from various environmental samples. For example, 16s rRNA classification [54, 55] and assembly [29, 56] is a
fundamental step for studying phylogenies in a sample. NCRNA annotation is, therefore,
an important component in post-NGS analysis.
There are two different ncRNA identification problems for NGS data. One focuses on
identifying homologs of annotated ncRNAs, such as tRNA, rRNAs, snoRNAs, and many
types of miRNAs. Some example applications include comparing expression level changes
of let-7 miRNA genes in different developmental stages of C. elegans [57], identifying all
homologs to annotated miRNAs in the small RNA-seq data of a non-model species [58], and
16s rRNA annotation in metagenomic data [29]. These studies aim to annotate all known
ncRNAs or novel members of characterized ncRNA families. The second category focuses
on reporting novel ncRNA genes. One possible strategy is to cluster sequences and then
apply de novo ncRNA gene finding tools such as RNAz[59]. This work belongs to the first
category.
The state-of-the-art method for ncRNA homology search is still based on comparative
ncRNA identification, which searches for ncRNAs through evidence of evolutionary conservation. As the function of an ncRNA is determined not only by its sequence but also by
its secondary structure, which contains interacting base pairs, such as Watson-Crick base
pairs and G-U base pairs, successful ncRNA search should take advantage of both sequence
and secondary structural similarity. A number of such tools are available such as a general ncRNA search tool Infernal [17] and specialized tools for tRNA [11] and snoRNA [60]
etc. However, most existing homology search strategies use complete secondary structure of
34

annotated ncRNAs and are not optimized for NGS data. When applied to short and fragmentary sequences, these tools generate marginal scores and thus cannot distinguish reads
sequenced from ncRNAs or other regions. To our best knowledge, trCYK [16] is the only
tool that conducts homology search for fragmentary reads sequenced from various types of
ncRNAs. However, using it alone tends to incur high false positive rate according to our
experimental results.
It is worth noting that although NGS platforms are producing longer reads, many reads
sequenced from ncRNAs are still fragmentary. First, many ncRNAs are very long, including
mRNA-like long ncRNAs [61], 16s rRNAs [55], etc. Second, the biogenesis shows that some
types of small ncRNAs are cleavaged into short products (such as mature miRNAs from
their precursors). The sizes of these short products are not increasing with read length.
In order to apply existing ncRNA identification tools to NGS data, read mapping or de
novo sequence assembly tools are usually applied first to connect short reads into contigs.
When the reference genome is available, short reads can be mapped back to the reference
genome. Existing ncRNA identification tools can then be applied to the blocks containing
overlapping reads along the reference genome. When there is no quality reference genome
available, which is often the case for metagenomic data and RNA-seq data of non-model
organisms, de novo sequence assembly tools can be employed first to connect fragmentary
reads into contigs. However, using sequence assembly tools as the first step is not always
ideal for ncRNA classification.
First, the quality of read assembly deteriorates significantly in complicated NGS data
sets [28]. Different sequence assembly tools generate different sets of contigs. There is no
consensus on the best assembly tool. Error-containing contigs often affect down-stream
analysis.
35

Second, successful de novo assembly requires relatively high sequencing depth, which
is difficult to achieve for many ncRNAs. It is shown [62, 63] that the transcription levels
of many types of ncRNA genes are low and condition-dependent. Often it is difficult to
foresee which ncRNA genes are lowly transcribed. Thus there lacks information to optimize
the parameters of de novo assembly tools to produce complete or partial ncRNAs of highly
divergent expression levels or abundance.
Third, some types of ncRNAs are cleaved during the maturation process (mature miRNAs). The observed reads sequenced from these ncRNAs do not share any overlap and
cannot be assembled. Thus, there is a need for an alternative and better ncRNA search tool
for NGS data lacking reference genomes.
In this work, we introduce a comprehensive ncRNA classification tool for short reads:
RNA-CODE, which is specifically designed for ncRNA identification in NGS data sets that
lack reference genomes. Given a set of short reads, RNA-CODE classifies the reads into
different types of ncRNA families. The classification results can be used to quantify the
expression levels of different types of ncRNAs in RNA-seq data and ncRNA composition
profiles in metagenomic data, respectively. RNA-CODE integrates secondary structure based
homology search with family-specific de novo assembly. The parameters of de novo assembly
tools can be adjusted in a family-specific fashion.
The remaining of this manuscript is organized as follows. The Methods section describes
the design rationale of RNA-CODE and the three main stages. The Results section benchmarks RNA-CODE with other ncRNA classification frameworks. We present experiments
results on real metagenomic data and RNA-seq data. For the small-scale metagenomic data,
we compare RNA-CODE with Metaxa [54] on 16s rRNA read classification. Then, we compare RNA-CODE with de novo sequence assembly on small RNA-seq data annotation of a
36

well-annotated organism.

3.2

Methods

We propose a method that combines homology search and family-specific de novo assembly
to identify reads sequenced from ncRNAs. In particular, the homology search is applied to
both the short reads and contigs produced by assembly programs. This method is designed
based on two key observations. First, reads sequenced from ncRNAs tend to share higher
sequence and structural similarity with the their native families than reads sequenced from
other families. Thus, higher alignment scores by ncRNA homology search tools are expected.
In particular, homology search is vital for identifying ncRNAs that go through cleavage
and degradation. Reads sequenced from miRNAs are hard to assemble because only reads
corresponding to mature miRNAs can be largely captured into RNA-seq data. None or a
few can be mapped to other regions of the pre-miRNA due to fast degradation. Figure 3.1
shows the mapping results of reads sequenced from pre-miRNAs obtained from Arabidopsis.
No contig or very short contigs can be produced based on the typical read mapping pattern.
In addition, this read mapping pattern does not change with increase of expression levels, as
shown in the three miRNAs in Figure 3.1. For these types of ncRNAs, applying homology
search on short reads directly is indispensable.

37

9000

A

8000
7000

coverage

6000
5000
4000
3000
2000
1000
0

0

10

20

30

40
50
posit ion

60

70

80

90

25

B

coverage

20

15

10

5

0

0

10

20

30

40
50
posit ion

60

70

80

90

4000
C

3500

3000

coverage

2500

2000

1500

1000

500

0

0

20

40

60

80
posit ion

100

120

140

160

Figure 3.1: Reads sequenced from pre-miRNAs cannot be assembled into contigs. Three
different miRNAs show highly different expression levels in the same RNA-seq data. A. mir-156 B.
mir-160 C. mir-166
38

While homology search is important, applying it to short and fragmentary reads may
introduce high false positive rate when detecting remote ncRNA homologs (data will be
shown in Methods Section). Thus RNA-CODE employs the second observation that true
ncRNA reads sequenced from the same gene can be assembled into contigs with significantly
high alignment scores against their native families. On the contrary, reads aligned by chance
are not likely to be assembled because they tend to share poor overlaps. Both properties are
important in boosting sensitivity and accuracy of short reads classification.
RNA-CODE consists of three key stages. First, RNA-CODE coarsely classifies reads into
different ncRNA families using both secondary structure and sequence similarity. Then, a
family-specific sequence assembly is used to assemble aligned reads into contigs. Because the
numbers of reads that are coarsely classified in the first step indicate the expression levels
or abundance of ncRNA genes in this family, this step chooses de novo assembly parameters
(such as kmer size or overlap threshold) accordingly. The produced contigs are generally
longer than input reads and thus can be classified into ncRNA families with better sensitivity
and accuracy in the last step. For miRNAs which cannot be assembled into contigs, we use
their biogenesis-based property and homology search results for classification.
The three-stage workflow with chosen tool for each stage is illustrated in Figure 2.2.
Here, we highlight the rationale behind the design of the three stages. The first stage
aims to classify a large number of input reads into different ncRNA families with high
sensitivity. It employs existing homology search tools. For short and fragmentary reads,
this stage can incur high FP rate. Thus, downstream analysis is needed to remove those
falsely classified reads. In the second stage, de novo sequence assembly tools are employed to
assembly classified reads into contigs. The family-specific sequence assembly is expected to
produce contigs corresponding to complete or partial ncRNA genes. However, because of the
39

extremely uneven or low transcriptional levels of many types of ncRNAs or low abundance, a
small overlap cutoff or kmer is needed to ensure appropriate connectivity for some families.
As a result, some contigs are chimeric or simply consist of randomly aligned reads. The
third stage is used to remove the false positives. All contigs are aligned to ncRNA families.
Only ones with scores or lengths above given cutoffs are kept. For miRNAs that cannot form
contigs, we use stringent homology search scores and known biogenesis-related properties as
classification criteria. Every stage will be described in great detail below.
Short Reads

Align reads to Family1 using SCFG-based
tool trCYK

Align reads to Family2 using SCFG-based
tool trCYK

...

Align reads to FamilyN using SCFG-based
tool trCYK

Reads Classified into
Family-1

Reads Classified into
Family-2

...

Reads Classified into
Family-N

Family-specific
assembly

Family-specific
assembly

...

Family-specific
assembly

Contig Set-1

Contig Set-2

Contig Set-N

...
Align Contigs to
Family-1

Align Contigs to
Family-2

Contig Set-1

Contig Set-2

...

Align Contigs to
Family-N

Contig Set-N

...
Read Classification

Figure 3.2: The pipeline of RNA-CODE. The pipeline of RNA-CODE. For miRNAs, the output
of the first stage (SCFG-based filtration) and the whole pipeline will be used together for reads
classification.

40

3.2.1

Stage 1: SCFG-based filtration

To maximize classification sensitivity, short reads are aligned to an SCFG model built from
an RNA family of interest. An SCFG describes not only primary sequence of an RNA
family but also its secondary structure formed by base pair interaction. The state-of-the-art
implementation of SCFG model is Covariance Model (CM). The software suite Infernal [17]
builds a CM on a family of RNA sequences, and searches for homologs using inside-outside
algorithms. A CM in Infernal is implemented as a tree-like structure in which each node
models a single base or a base pair. Infernal is able to optimally align a sequence to this
tree with the highest possible score. Short reads, however, pose challenges to the search
algorithms because they are fragmentary sequences in which nucleotides expected to form
base pairs could be missing. Due to missing bases, base pairs that could have been aligned
to a base-pair node in a parse tree are not alignable any more. As a result, reads sequenced
from this family may not be well aligned to the underlying CM. Truncated-CYK (trCYK)
[16] is a specialized tool designed for fragmentary sequence search. It performs local RNA
alignment against a CM of interest, recovering base pairs that are possibly missing and would
otherwise be base paired. For every alignment, a score is provided by trCYK indicating the
goodness of alignment. Homologous reads tend to yield higher scores and longer alignments
than random reads.
Here we report the performance comparison of two homology search tools that can be
applied to short and fragmentary reads. One is the mostly commonly used homology search
tool BLAST [15], which relies on sequence similarity only. The second tool is trCYK [16].
The goal is to compare the performance of trCYK with BLAST in classifying ncRNA reads
of different lengths. Thus, for read length 25, 30 and 50, we sampled 5000 true reads from

41

tRNA sequences obtained from Rfam. Another 5000 random reads generated from other
RNA families were mixed with true tRNA reads. Seed sequences from Rfam were excluded
from the test data. Covariance Model used in trCYK and formatted database used in BLAST
were both built from seed sequences of tRNA. We then searched for tRNA reads in the mixed
reads using trCYK and BLAST. The performance of both tools is visualized in the ROC
curves in Figure 3.3. The figure demonstrates that trCYK has better performance than
BLAST. However, both tools have high FP rates, showing the need for further screening.
1

0.9

0.8

Sensitivity

0.7

0.6

0.5

0.4
BLAST LEN=25
BLAST LEN=30
0.3

BLAST LEN=50
trCYK LEN=30

0.2

trCYK LEN=50
trCYK LEN=25

0.1

0
0

0. 05

0. 1

0. 15

0. 2

False Positive Rate

Figure 3.3: ROC curves of short reads classification using trCYK and BLAST. ROC curves
of short reads classification using trCYK and BLAST. Sensitivity measures the ratio of correctly
found true tRNAs to the total number of true tRNA reads. False positive rate measures the ratio
of falsely found tRNA reads to the total number of false tRNA reads.

Like all alignment programs, a score cutoff is needed to distinguish homologous sequences
from others. We set two cutoffs s and l, on alignment score and alignment length, respectively.
s and l determine the strength of filtration. A low cutoff will lead to an overwhelming number
of negative reads which could significantly slow down the next two stages. A high cutoff,
however, will exclude remote homologous reads with poor conservation from further analysis.
As trCYK does not provide such thresholds, we considered two strategies to determine the

42

cutoffs. First, the expected alignment score for a homologous sequence of length L can be
used as the cutoff. In order to ensure high sensitivity, the actual cutoff can be smaller than
the expected score. Second, Monte-Carlo method can be used to evaluate the sensitivity and
FP rate of a score cutoff using a large number of sequences that are generated from both
ncRNAs and non-ncRNAs. In this work, we used the second strategy. Figure 3.3 shows that
some very short homologous reads have poor alignment scores. As the first stage defines
the upper-bound of the classification sensitivity, we chose a loose cutoff s = 1 to guarantee
that most positive reads can pass the filtration stage. We found that this threshold also
applies to reads sequenced from other types of ncRNAs. With the increase of read length,
this score threshold needs to be improved as well. As the first stage is designed to achieve
high sensitivity, the default cutoff is set to 1. To further increase the sensitivity of filtration,
we also accept reads with alignment score s greater than -1 and with alignment length l ≥ 30
bases.

3.2.2

Stage 2: family-specific de novo assembly

For reads that are coarsely classified to a family by the first stage, they will be input to de
novo assembly tools. Compared to conducting de novo assembly on all the reads, the input
sizes to assembly tools are significantly reduced. Thus, even memory intensive assembly
tools can be applied.
Multiple de novo assembly tools exist. Depending on the data properties, such as read
length and sequencing error rates, sensible choices can be made. In this work, the de novo
assembly programs are applied to RNA-seq data of non-model organisms or metagenomic
data. Thus, specific properties of these two data should be considered when choosing assembly tools. Unlike genome assembly, highly diverse sequencing coverage is expected in
43

both data sets. In RNA-seq data, heterogeneous expression levels of ncRNAs contribute to
highly diverse sequencing coverage. In metagenomic data, different abundance of ncRNA
genes lead to different sequencing coverage. Choosing one set of parameters (such as overlap
threshold in overlap graph or kmer size in de Bruijn graph) for the whole data set is not
likely to produce optimal results for downstream ncRNA analysis. Thus, the first requirement for the chosen assembly program is that users can adjust the parameters according to
the output of the filtration stage. Specifically, although the first stage only coarsely classifies reads into gene families, it can be used to estimate the expression levels or abundance
of genes in a family. For families with large number of reads classified, RNA-CODE assumes high sequencing coverage and thus uses stringent assembly parameters. On the other
hand, for families with fewer number of classified reads, small overlap cutoffs (in an overlap
graph-based assembly tools) or small kmers (in de Bruijn graph) should be used to ensure
connectivity for lowly transcribed regions or low abundance genes. Second, many assembly
programs removed kmers with low coverage as they may contain sequencing errors. In order
to assembly ncRNAs of low expression or abundance, we use an assembly tool that can keep
reads/kmers with low coverage.
In this work, for very short reads (read length < 50bp), we applied and compared several
assembly programs [64, 65] and chose SSAKE [66] because it delivers better assembly performance on our experimental data. SSAKE is a specialized de novo assembly tool for unpaired
short reads assembly. It is a graph-based greedy assembler that efficiently assembles millions
of short reads following near-linear time complexity. During the assembly process, the 3’
end of a contig is extended if its suffix overlaps with the prefix of another read. SSAKE
generates contigs progressively by searching all prefixes stored in a hash table for the longest
possible prefix-suffix match whose overlap is above a threshold. We modified the codes of
44

SSAKE to make it accept any overlap cutoff.
To follow standard notation for assembly algorithms, we use k to represent either the
kmer size in De Bruijn graph or the overlap threshold in an overlap graph for assembly.
The length of overlap threshold k is an important parameter in SSAKE. A higher k usually
results in fewer but more accurate contigs. A lower k leads to higher contiguity. But incorrect
extension may happen because the probability of a random prefix-suffix match is high. In
de novo assembly, there does not exist optimal overlap threshold. Although the expression
or abundance is not known a prior, we estimate it using the number of classified reads in
the first stage and choose k accordingly. In addition, it is observed that using a single-k will
lead to suboptimal performance of de novo assemblers. In the second stage, a range of k’s
are chosen and used on short reads assembly. For each k, all reads are used in assembly. The
contigs from different assemblies are subsequently pooled together for further analysis.

3.2.3

Stage 3: contig selection

Some randomly aligned reads in stage 1 can be removed by stage 2 because they are not part
of any contig. However, due to the loose overlap cutoff or small kmers, some reads can still
be assembled into contigs and thus produced by stage 2. There are three types of contigs
with reference to an ncRNA gene family: 1) positive contigs that are assembled by reads
originated from this family, 2) negative contigs that are assembled from false reads that are
not part of the underlying gene family, and 3) chimeric contigs that are formed by both
true and false reads. Negative and chimeric contigs can be formed due to small overlaps we
allowed in the multiple-k assembly. The probability that a contig is extended with a negative
read due to a random prefix-suffix match is high for a small overlap cutoff. Sketches of the
three types of contigs are presented in Figure 3.2.3.
45

Negative read

Positive read

Positive contig

Negative contig

Chimeric contig

Chimeric contig

Figure 3.4: Three types of contigs.

To distinguish positive contigs from negative ones, we align contigs to the underlying
CM. We chose to use trCYK for the following reasons. 1) Both sequence and structural
information of a contig should be utilized. 2) Many contigs may not be complete RNA genes
especially when the gene transcription level is low. Thus we need to consider missing bases
while aligning contigs to the underlying CM.
After trCYK is applied to all contigs from stage 2, if there exist contigs with alignment
scores greater than a pre-determined cutoff, the gene of interest is considered to be tran46

scribed. As trCYK is a local alignment tool, it is common that only part of the contig is
aligned to the underlying CM. Thus only reads that assemble the aligned part are classified
into this RNA family. This feature could be very effective when multiple correct segments
exist in a chimeric contig, although we did not observe such cases in our experiment. Bad
segments interleaved by correct ones could potentially be removed.

3.2.4

MIRNA families

Normally, all classified reads need to pass through the entire pipeline. But for miRNA
families, as no contigs might be formed, we used two criteria for read classification. First,
the alignment score and length of the trCYK alignments in the first stage must pass the predetermined threshold. Second, for all reads that align to the miRNA* region, we examined
whether they can form a stem with reads aligned to the mature miRNA region. If not, the
reads aligning to miRNA* region will be removed. As a result, many reads that cannot form
any contig can be still classified into miRNAs based only on the homology search results in
the first stage.

3.3

Results and Discussion

RNA-CODE can be applied to ncRNA classification in both metagenomic data and RNA-seq
data of non-model species. To demonstrate the utility of RNA-CODE, we conducted two
experiments. In the first experiment, we tested RNA-CODE on identifying reads sequenced
from 16s rRNAs in a small-scale metagenomic data set. It is widely known that 16s rRNA is
an important genetic marker for taxonomic identification in metagenomes. Identifying 16s
rRNA reads can be used as the first step to assemble full-length 16s rRNAs [54, 29, 56].

47

This experiment aims to detect reads that are sequenced from 16s rRNAs. In the second
experiment, we tested RNA-CODE on annotating reads sequenced from different ncRNA
genes including house-keeping RNAs, miRNAs etc. in RNA-seq data of the model organism
Arabidopsis thaliana.
For the first experiment, the performance of RNA-CODE was benchmarked with Metaxa
[54], which is designed for classifying short reads into different rRNA families. For the second
experiment, the performance of RNA-CODE was benchmarked with standard annotation
pipeline for NGS data, which is de novo assembly tools plus existing ncRNA annotation
tools.
To evaluate the performance of all tools, we compared the true membership and the
predicted membership of reads. Two metrics are used in evaluation: read-level sensitivity
and positive predictive value (PPV), which indicates accuracy. For an RNA family of interest,
let T P be the set of true positive reads originated from this ncRNA family. Let P be the
set of reads predicted to be positive. Sensitivity is defined as

sensitivity =

P ∩ TP
.
TP

PPV is defined as re appropriate

PPV =

P ∩ TP
.
P

A good ncRNA identification tool should have both high sensitivity and high PPV.

48

3.3.1

Detecting reads of 16s ribosomal RNAs

RNA-CODE can detect and discriminate among multiple ncRNA types. This experiment
tests RNA-CODE in recognition of one type of ncRNA, 16s rRNA in a metagenomic data set.
Various tools exist for 16s rRNA search [17, 54, 67, 68], of which, only Metaxa is designed
for short reads. Thus, we benchmark RNA-CODE with Metaxa in this experiment.

3.3.1.1

Data

In order to accurately evaluate the performance of RNA-CODE, we need to know the groundtruth membership of reads in metagenomic data. Thus, we constructed a small-scale real
metagenomic data set, for which we knew which reads were sequenced from 16s rRNAs.
We obtained human gut microbial metagenomics data from European Nucleotide Archive
(http://www.ebi.ac.uk/ena/data/view/ERA000116). The data were sequenced from fecal
specimens of obese individual human adults using Illumina Genome Analyser [69]. We selected two pair-ended metagenomics data sets of different read lengths. There were 9,633,603
reads of length 44 in the one dataset and 14,822,431 reads of length 75 in the other. Without knowing the genomes of the species in this sample and their 16s rRNA annotations, we
cannot obtain the true membership of all the reads. Thus, we need to construct a small
metagenomic data set using reads that can be reliably labeled as 16s rRNAs in fully sequenced genomes. To do this, we first chose species that have whole genomes and 16s rRNA
annotations using the species catalog in [70]. Then, all reads that can be mapped to these
genomes were included in the small metagenomic data set. In total, 11 strains with complete
genomes available were selected from 67 strains designed by Turnbaugh [70]. To determine
whether a read was originated from 16s ribosomal genes, we mapped each read back to the
genomes of the 11 selected bacteria strains. The read mapping positions and the annotation
49

of 16s rRNA genes are combined to determine whether a read is sequenced from 16s rRNAs.
If a mapped read share at least 80% of bases with an underlying 16s genes, we defined this
read as positive. If a read has no overlap with any 16s gene, we defined it as negative. The
reads having less than 80% of their bases overlapped with a 16s gene are considered to be
ambiguous and were discarded. Additionally, the reads that cannot be mapped to any of
the 11 genomes were also discarded, because we did not have enough information of their
true membership. The small-scale metagenomic data consists of both positive and negative
reads. For the data set of read length 44, there are 606 positive reads and 71993 negative
reads, respectively. For the data set of read length 75, there are 379 positive reads and 61086
negative reads, respectively.

3.3.1.2

Experimental results

We applied RNA-CODE to the small-scale metagenomic data sets constructed above for
16s rRNA classification. We then compared RNA-CODE with Metaxa, a specialized tool
for detecting reads originated from SSU rRNAs such as 12s, 16s and 18s. Same as RNACODE, input to Metaxa is also a set of short reads in FASTA format. Output of Metaxa
contains short reads presumably originated from SSU rRNAs. Short reads are categorized
by the species or organelles that they are originated from. The categories include bacteria,
archaea, eukaryota, mitochondria, and chloroplast. Because specific categories of reads are
not concerned in this experiment, we considered a read to be a true positive if it can be
categorized in any of the 5 categories.
The above experiments were conducted on reads of 44 bases and 75 bases using default parameters. As displayed in Table 3.1, both tools achieved high specificity for both
read lengths. The sensitivity of RNA-CODE out-performed Metaxa for both read lengths.
50

Specifically, RNA-CODE performed well on shorter reads. As Metaxa does read classification using hidden Markov models (HMMs), short reads tend to produce marginal scores and
thus are hard to distinguish from non-rRNA reads. RNA-CODE considers both sequence
and secondary structure similarity and is more sensitive for short read classification. For the
same reason, as Metaxa relies on HMMs, it is not expected to perform well on other types
of ncRNAs that lack strong sequence similarity.
RNA-CODE
read length

sen

PPV

Metaxa
sen

PPV

44

0.999 1.000 0.786 1.000

75

1.000 1.000 0.986 1.000

Table 3.1: Performance comparison of RNA-CODE vs Metaxa. Both tools were applied
using the default parameters.

3.3.2

NCRNA classification in RNA-seq data

To demonstrate the utility of RNA-CODE on detecting reads sequenced from various ncRNA
genes, we conducted the second experiment on real RNA-seq data. RNA-CODE classifies reads into different ncRNA families. For house-keeping ncRNAs such as tRNAs and
rRNAs, which contain multiple gene members, the number of classified reads show the
overall expression levels of this type of ncRNA. For single-member ncRNA families such
as many miRNA families in some species, the number of classified reads quantifies the expression level of this gene. We chose to use RNA-seq of the model species Arabidopsis
thaliana, which has high-quality genome assembly and gene annotation available in TAIR 10
(http://www.arabidopsis.org), enabling us to determine the true membership of reads with

51

high confidence.

3.3.2.1

Data

An RNA-seq dataset obtained from NCBI SRA (accession number GSM706704) was used
in this experiment. This dataset was sampled from transcriptome of inflorescence tissues
of Arabidopsis thaliana. The sample was sequenced using Illumina platform and contains
2,327,100 short reads. After removing adaptor sequences and quality trimming, the average
length of reads is 23.5, which poses a great challenge for both homology search tools and
de novo assembly tools. Using ncRNA annotation from TAIR 10, there are hundreds of
ncRNAs annotated in Arabidopsis. In this work, we present the results on classifying reads
into ncRNAs annotated on chromosome 2 of Arabidopsis.
According to TAIR10 annotation and read mapping results, there are 15 transcribed
ncRNAs on chromosome 2 in this RNA-seq data. Out of the 15, there is one miRNA mir825a that does not have corresponding family in Rfam. In addition, miRBase shows that
there are only two sequences annotated as mir-825, which are not enough to train a model.
Thus, we excluded this miRNA from our test. After removing this family, we had in total 14
transcribed ncRNAs on chromosome 2 in this data set. These families were used to evaluate
the sensitivity of ncRNA classification. The number of mapped reads for the 14 ncRNA
families can be found in Table 3.2. In order to evaluate both the sensitivity and accuracy
of RNA-CODE, we randomly chose 32 non-transcribed but annotated ncRNA families as
negative test data. The non-transcribed families have zero mapped read and are used to
evaluate the accuracy of RNA-CODE. Ideally, an accurate ncRNA detection tool should not
classify any read in this RNA-seq data into these ncRNAs.

52

ID in Rfam

gene name

num of mapped reads

num mapped reads(unique)

RF00002

5.8S ribosomal RNA

27984

417

RF00005

tRNA

35602

1339

RF00055

Small nucleolar RNA SNORD96

29

12

RF00073

mir-156 microRNA precursor

8875

13

RF00075

Mir-166 microRNA precursor

3657

18

RF00247

Mir-160 microRNA precursor

31

4

RF00268

Small nucleolar RNA snoZ7/snoR77

7

5

RF00300

Small nucleolar RNA Z221/R21b

79

13

RF00452

mir-172 microRNA precursor

101581

15

RF00647

microRNA MIR164

2746

5

RF00689

microRNA MIR390

933

11

RF00690

microRNA MIR408

322

12

RF00893

microRNA MIR854

1367

313

RF01280

Small nucleolar RNA snoR14

9

8

Table 3.2: Number of reads that are mapped to chromosome 2 of Arabidopsis.

For each of the 46 ncRNA families (14 positive + 32 negative), we used the corresponding SCFG-based models in Rfam 10.1 as input to RNA-CODE. Read mapping results and
TAIR10 annotation are used to determine the membership of reads. Only if a read has more
than 80% of its bases overlapping with an annotated gene, we consider the read to be a
member of this gene. Reads with no overlapping bases are unlikely to be valid transcripts of
the gene of interest. Such reads are considered to be negative.

3.3.2.2

Experimental results

We evaluate the performance of RNA-CODE from four aspects. First, as the filtration stage
of using trCYK is important to the performance of RNA-CODE, we analyze the performance
of trCYK in this experiment. Second, we compare the performance of RNA-CODE with de

53

novo assembly tools. Third, we demonstrate the utility of using multiple overlap thresholds
in RNA-CODE. Finally, we present case studies for miRNA genes, which produce reads that
share no overlaps.

3.3.2.2.1

Performance of filtration In this experiment, we report the performance of

the filtration stage for different types of ncRNAs. The first stage of RNA-CODE uses trCYK
to coarsely classify reads into different families. Only reads that pass this filtration stage will
be further analyzed. Thus, the filtration stage determined the upper bound of the sensitivity
of RNA-CODE. A low score cutoff is used to keep high sensitivity. The price paid, however,
is low specificity, as displayed in Table3.3. trCYK did not show good discriminative power
on some genes because the transcripts sequenced from these genes are not well conserved
and cannot form statistically significant alignments with the underlying CM. For example,
a majority of transcripts originated from Small nucleolar RNA Z221/R21b had alignment
scores lower than the defined threshold due to poor conservation. Thus most reads cannot
pass the filtration.

54

gene name

sensitivity

PPV

5.8S ribosomal RNA

0.891

0.415

tRNA

0.882

0.5

Small nucleolar RNA SNORD96

1

0.018

mir-156 microRNA precursor

1

0.039

Mir-166 microRNA precursor

1

0.007

Mir-160 microRNA precursor

1

0.012

Small nucleolar RNA snoZ7/snoR77

1

0.003

Small nucleolar RNA Z221/R21b

0.077

0.001

mir-172 microRNA precursor

1

0.015

microRNA MIR164

1

0.002

microRNA MIR390

1

0.011

microRNA MIR408

1

0.018

microRNA MIR854

0.721

0.447

Small nucleolar RNA snoR14

1

0.007

Table 3.3: Filtration statistics.

3.3.2.2.2

Performance comparison with SSAKE Reads that are classified into each

family are used as input to de novo assembly programs. For N input families, N de novo
assembly programs can be run in parallel. A number of de novo assembly tools are available.
However, due to short read length and low coverage for many types of ncRNAs, some popular
tools such as Velvet [64] only produces a few contigs. We empirically compared several de
novo assembly tools and chose SSAKE 3.8 for two reasons. First, it produced more contigs

55

than others. Second, the source codes of SSAKE is relatively easy to modify to address
specific needs for this data. SSAKE was designed for short reads assembly and the minimum
length of an input read is 22 bases. In this data, there exist reads of less than 22 bases.
So we customized SSAKE to make it accept reads as short as 19 bases. SSAKE requires a
minimum overlap of 16 bases when extending contigs. When gene expression level is low,
the overlap between positive reads is often lower than 16 bases. To assemble reads of lowly
transcribed genes, we customized SSAKE to extend a contig when its suffix has more than
5 overlapping bases with a prefix of a read. We set Minimum Number of Reads Needed to
Call a Base During an Extension to be 1, as opposed to 2, the default value. The rationale
of using 1 is that in the transcripts of poorly expressed genes, base coverage is low. It is rare
to find duplicates of a read when extending with this read.
The chosen assembly tool is run for each family separately. Thus, family-specific assembly
parameters can be chosen. In particular, the overlap threshold of SSAKE can be adjusted
according to the number of classified reads by trCYK. Although the number of classified
reads is not an accurate indication of depth of read coverage, it is preferred to choose a small
overlap threshold if the number of classified reads is small. In addition, it has been shown
that using multiple kmers can improve RNA-seq assembly [65]. As described in our method,
RNA-CODE also uses multiple overlap thresholds. Thus, for SSAKE, we allow overlap from
6 to 16 if the number of classified reads by trCYK is less than 10,000. Otherwise, we use 10
to 20. Users can adjust these parameters according to any known knowledge.
The performance of RNA-CODE on the 14 transcribed families is listed in Table3.4.
For the 32 control families, which were not considered to be transcribed, RNA-CODE did
not find any reads, yielding 0% false positive rate. This indicates that RNA-CODE can
successfully distinguish transcribed families from un-transcribed ones. Table 3.4 also includes
56

the performance comparison with SSAKE. More specifically, all reads are used as input to
SSAKE. Then, all contigs produced by SSAKE are searched against input ncRNA families
using trCYK. The comparison shows that RNA-CODE yielded better performance on all
genes except for 5.8S ribosomal RNA. Without the first stage, a large number of negative
reads may be assembled together with positive reads and form chimeric contigs. In chimeric
contigs, positive reads could be interleaved by negative reads, making the alignment score
between itself and the underlying CM low.
gene name

RNA-CODE
sen
PPV
5.8S ribosomal RNA
0.778 0.95
tRNA
0.437 0.984
Small nucleolar RNA SNORD96
1
1
mir-156 microRNA precursor
0.929
1
Mir-166 microRNA precursor
1
0.947
Mir-160 microRNA precursor
0.75 0.214
Small nucleolar RNA snoZ7/snoR77
1
1
Small nucleolar RNA Z221/R21b
0
N/A
mir-172 microRNA precursor
1
1
microRNA MIR164
1
1
microRNA MIR390
0.909
1
microRNA MIR408
1
0.75
microRNA MIR854
0.599 0.978
Small nucleolar RNA snoR14
0.25
1

SSAKE
sen
PPV
0.99 0.884
0.274 0.996
0.857
1
0.786
1
0.944
1
0
N/A
0
N/A
0
N/A
0.733
1
1
1
0.909
1
1
1
0.468 0.698
0
N/A

single-k
sen
PPV
0.999 0.978
0.044
1
0.903
1
0
N/A
0.996
1
0
N/A
0
N/A
0
N/A
1
1
1
1
0.996
1
1
1
0.415
1
0
N/A

Table 3.4: Performance of RNA-CODE (multiple-k), SSAKE, and RNA-CODE (single-k)
on transcribed ncRNA families.

The reason why RNA-CODE was out-performed on 5.8S ribosomal RNA is that many
short reads originated from 5.8S ribosomal RNA did not pass the filtration due to poor
conservation. As a result, the overall sensitivity is lower than de novo assembly tools.

3.3.2.2.3

Using multiple overlap thresholds improves performance of RNA-

CODE Multiple-k approach is another important component in the pipeline. To demon-

57

strate the effectiveness of this approach, we compared RNA-CODE using multiple-k and
single-k approach. Sensitivity of RNA-CODE using multiple-k is generally better than using single-k, except for 5.8S ribosomal RNA and MIR-390. We used the default k value
(i.e. 16) in single-k approach. Many contigs assembled from short reads originated from
lowly expressed genes were not recovered in the single-k approach; since overlap between
two positive reads may be less than 16 bases. However, using multiple overlap thresholds
may also introduce chimeric contigs, which explains the worse performance of RNA-CODE
using multiple k than single k on 5.8S ribosomal RNA and MIR-390.

3.3.2.2.4

Performance of microRNA families For eight transcribed miRNA genes,

RNA-CODE performs better in six of them and has the same sensitivity and PPV for the
other two. This is expected because miRNA reads usually cannot be assembled. In this
data set, because of the extreme short reads, some of them are assembled into the mature
miRNA and thus are able to be output by SSAKE. For example, reads of length between 19
and 22 are assembled into the mature miRNA of length 25 for mir-156. For most RNA-seq
data that have longer reads after quality trimming, de novo assembly tools will not be able
to assemble them into contigs. Thus, using both homology search and de novo assembly is
important to generate a comprehensive catalog of ncRNAs.

3.4

Conclusion

We presented an ncRNA classification tool that can determine the membership of reads that
are sequenced from ncRNA genes. By combining homology-based ncRNA search method
and family-specific de novo assembly, we can classify reads into different types of ncRNAs,
including those that cannot be assembled because of cleavage and degradation. This tool can
58

be applied to NGS data that do not have quality reference genomes, such as metagenomic
data and RNA-seq data of non-model organisms.
RNA-CODE relies on trCYK as the ncRNA homology search tool for very short reads.
When reads are longer, more efficient ncRNA homology search tool such as Infernal [17] can
replace trCYK. For very short reads, trCYK is still the best choice in order to yield high
sensitivity.

59

Chapter 4
Reconstructing 16S rRNA genes in
metagenomic data

4.1

Introduction

Microbes are ubiquitous species existing in all environments on earth, including extreme
conditions [71]. From the desert to acid waste water, from pine forest soil to mine drainage,
they sustain themselves using various mechanisms [72]. Human bodies are also habitants
of microbes. It is estimated that there are 104 bacterial cell inhabiting on our body, which
is 10 times more than our own cells [73] [74]. Human life as well as the entire ecosystem
are profoundly affected by microbes. There is a strong need to understand the function of
microbial community and how they interact with the environments where they inhabit. It
has been discovered that function of microbial community is defined by its composition and
diversity [75]. In particular, metagenomic data, which contains sequenced DNA reads of a
large number of uncultured microbial species from a wide range of environmental samples,
provide a unique opportunity to thoroughly analyze microbial species that have never been
identified before.
A commonly adopted approach for analyzing the microbial species in environmental samples is to conduct comparative analysis of ribosomal RNA sequences [76]. The use of rRNA
for microbial phylogenetic analysis had become such a relied-upon methodology that by
60

2008, 77% of all INSDC [77, 78, 79] bacterial DNA sequence submissions described an rRNA
gene sequence [80]! 16S rRNA reads from metagenomic studies provide a source of sequences
that is not subject to PCR primer bias and therefore covers taxa that might be missed by
existing popular primer sets [81]. A fundamental step in using rRNA to analyze the microbial species is to recover the rRNA genes from the large amount of reads in metagenomic
data sets, which is the goal of this work. The rRNA genes are a patchwork of hypervariable
(rapidly evolving) and universally conserved regions, complicating phylogenetic analysis of
rRNA genes in metagenomes both by the lack of usable phylogenetic signal in many unassembled reads and by the difficulty in separately assembling genes for the individual populations
in a matagenome. In addition, the massive data volume, short read length, skewed species
abundance, and high similarity of 16S rRNAs genes all make rRNA recovery in metagenomic
data very difficult. Currently, there is a rapid accumulation of metagenomic data from a wide
range of environmental samples. For example, there are roughly 24,367 Gbases of data from
Human Microbiome Project (HMP) alone. A tool that can efficiently and accurately recover
rRNAs from the large amount of data is in great need for analyzing microbial composition
in the underlying samples.
Existing pipelines for annotating rRNAs in metagenomic data can be divided into two
groups. The first type of pipelines relies on existing de novo assembly tools to output
assembled contigs, which are then used as input to available genome-wide rRNA search tools.
There are various bulk metagenomic assembly programs [82, 83, 84, 85, 86, 87, 88], which
intend to recover individual genomes in an environmental sample. Metagenomic assembly is
computationally difficult [83]. A recent review [89] summarized the disadvantages of existing
de novo assembly. First, the quality of read assembly deteriorates significantly in complicated
NGS data sets. In particular, previous work shows that metagenomic sequencing of high61

complexity microbial communities results in little or no assembly of reads [90, 89]. Second,
successful de novo assembly requires high sequencing depth, which is difficult to achieve for
all rRNAs in one sample. In addition, our goal is to detect rRNAs in metagenomic data
while much of bulk de novo assemblies consist of other genomic regions. Thus, the commonly
used pipeline of combining bulk metagenomic assembly and genome-wide rRNA detection
tools is convenient but not optimized for rRNA detection in metagenomic data. The second
type of pipelines avoids bulk metagenomic assembly and incorporates properties of rRNAs
[29, 56]. The most promising one is perhaps EMIRGE [29], which uses an expectation
maximization approach along with a set of reference gene sequences to assemble rRNA
genes from metagenomic data. The method has been used to assemble complete genes from
organisms at apparent phylum-level diversity to known cultured organisms [91]. However,
EMIRGE requires a large number of known rRNA genes for read mapping and does not
separate rRNA genes from closely related organisms.
Therefore, lacking reference genomes, recovering full-length rRNAs from a large number
of short and error-prone reads is still an outstanding challenge. In this work, we propose and
implement a targeted rRNA assembly program REAGO 1 that is optimized for rRNA gene
recovery in metagenomic data. It has three advantages comparing with existing methods.
First, it significantly reduces the problem size by first discarding reads that are not likely
sequenced from rRNA genes. Secondary structure-aware homology search is adopted in this
step to recognize rRNA reads, which will be kept for downstream analysis. Second, paired
end information is carefully applied to guide gene assembly and thus distinguish rRNAs
from different species. Third, incorporating profile-based homology search, scaffolding step
is used to connect contigs with low coverage and thus is able to recover rRNAs in species of
1 Scoure code is available at https://github.com/chengyuan/reago

62

low abundance. We applied our tool to both synthetic and real metagenomic data sets and
benchmarked its performance with several other programs. The experimental results show
that our tool competes favorably in recovery rRNA genes in metagenomic data sets.

Identify 16S
reads

Create overlap
graphs

Reconstruct 16S
rRNA genes

Scaffold 16S
rRNA fragments

Figure 4.1: Pipeline of the 16S rRNA gene assembly. Short black and gray bars represent
reads originated from different 16S rRNA genes. Short white bars represent reads from
non-16S regions. Long bars represent contigs assembled from short reads.

4.2
4.2.1

Method
Overview of REAGO

Figure 4.1 is a schematic representation of the pipeline, which contains four stages. First we
identify 16S rRNA reads from the original metagenomic dataset using secondary-structure
aware homology search. The majority of non-16S reads are discarded in this process, significantly reducing the problem size. Second, REAGO constructs overlap graphs. Various
graphs reduction techniques are applied to remove possible sesquencing errors and prepare
the graph for efficient assembly. The third stage assembles reads into contigs by our path
finding algorithm. The path finding procedure is guided using paired-end information and
63

aims to avoid generating chimeric 16S rRNA genes by maximizing Weighted Paired-End
Match Score (WPEMS) (see Section4.2.5). Finally, paired-end information again is used to
scaffold incomplete 16S fragments, if any, into longer contigs or full-length genes.

4.2.2

16S rRNA reads identification

In our method, we first conduct homology search to identify reads originated from 16S
rRNA genes. To utilize both the sequence and structural conservation of 16S rRNA genes,
we align metagenomic reads to a Stochastic Context-Free Grammar (SCFG) based model
[92], which is trained from a selected set of 16S rRNA sequences. The model describes
not only primary sequence conservation of a gene family, but also its secondary structure
conservation. The-state-of-the-art implementation of SCFG is covariance model (CM) in
Infernal [17]. For a gene family, Infernal first builds a CM from a set of training genes from
this family and classifies queries by aligning them to the model. A CM is implemented in
a tree-like structure in which each node models a single base or a base pair. Inside-outside
algorithm [92] is then used to optimally align a query to the CM by maximizing alignment
scores. Higher alignment scores are likely to be assigned to reads originated from the family
while reads from other genes tend to receive low scores. Fragmentary sequences pose great
challenges to the alignment algorithm since structural information in short reads is likely
to be partially missing. As a result, such short reads tend to received marginal alignment
scores and are not identified. Infernal handles this problem by recovering possibly missing
bases while performing the inside-outside algorithms. Thus, short 16S rRNA reads can still
receive significant alignment scores.
BLAST [15] is also available for 16S read identification. By aligning reads with the
reference 16S rRNA database, rRNA reads may be recognized using BLAST alignment scores
64

or E-values. We choose to use SCFG-based homology search over BLAST for two reasons.
First, BLAST conducts homology search based on sequence similarity and will miss reads
lacking primary sequence conservation. Commonly used as the phylogenetic marker gene
[93], 16S ribosomal RNAs share high sequence similarity on many regions across different
species. However, there also exist a number of variable regions where secondary structures
are better conserved than primary sequences. Secondary structural information could then
be very helpful to provide additional evidence for 16S sequence identification. A number of
studies have demonstrated the advantages of incorporating secondary structural information
in various types of non-coding RNA homology search [19]. Experiments were conducted
to compare BLAST and CM-based approaches [16] on identifying short rRNA reads. The
results indicate that BLAST tends to miss short rRNA reads that are sequenced from nonconserved regions. CM-based approaches generally improve the identification accuracy by
including secondary structure information. The second reason behind choosing SCFG-based
homology is that the single SCFG-based model provides a convenient reference for deciding
the relative positions among contigs during the scaffolding stage. We use the alignment
positions of reads inside a contig to determine the contig’s rough position along an SCFG
model.

65

Node
collapsing

Error
correction

Topologybased
graph reduction

Bad edge
removal

Figure 4.2: Graph reduction is conducted iteratively until there is no change on the graph.

4.2.3

Overlap Graph Creation and Graph Pruning

All reads that are possibly sequenced from rRNA genes are used to construct an overlap graph
for assembly. An overlap between two reads is formed if the suffix of a read matches the prefix
of another read. A straightforward overlap detection method performs pairwise comparison
among all reads, requiring O(n2 ) comparisons. Efficient implementations of overlap graphs
based on data structures such as hash table and BTW [94, 95] exist to handle NGS data.
We choose Readjoiner [94], which provides a set of time and space efficient algorithms for
detecting all prefix-suffix matches among a set of reads. In created overlap graphs, each
vertex represents a read and each edge represents a prefix-suffix match of size at least l,
a pre-determined overlap threshold. l has high impact on complexity and connectivity of
graphs. Small l tends to increase the connectivity, but also complexity, of the overlap graph.
Larger l is likely to produce less tangled graph, but can possibly miss connections between
reads from lowly sequenced regions. Note that transitive edges are automatically removed

66

in the output by Readjoiner.
The original graphs generated from the output of Readjoiner could be very complex because of the large data size, sequencing errors, and highly similar regions shared by different
genes. We apply an iterative graph pruning procedure, as depicted in Figure 4.2, to gradually simplify the graph at each iteration. The procedure terminates when the graph stop
changing. Below we detail each stage.

4.2.3.0.5

Node collapsing The original overlap graph tends to have chains of linearly

connected vertices. In such chains, each vertex has only a single in-coming edge and out-going
edge. Such vertices can be merged without loss of reachability.

4.2.3.0.6

Alignment-based error correction Sequencing errors and highly similar

regions shared by different genes can contribute to a large number bifurcations, greatly
complicating the graph. Error correction in metagenomic data is an unsolved problem.
Rare reads may come from lowly abundant genes rather than contain sequencing errors.
Nevertheless, we still follow the error correction rationale commonly used in de novo genome
assembly and correct bases in rare reads. As shown in Figure 4.3, we applied a heuristic but
efficient alignment-based error correction to two types of bifurcations. Reads or contigs from
sibling nodes V1 , V2 , · · · , Vn , which share the same predecessor or successor, are aligned.
Specifically, based on the known overlaps with the contig in the common predecessor or
successor, the contigs in the sibling nodes will be aligned first. Then the reads inside the
contigs can be aligned using their positions inside the contig.
For each column in the read alignment, a base is corrected if it is overwhelmingly outvoted by other bases that are aligned to it. An assumption made here is, if a base is sequenced
multiple times, it is correctly sequenced majority of the time. A base a is corrected into a ,
67

if and only if the number of a is at least τ times more than the number of a. It is worth
noting that this strategy could potentially incorrectly mutate bases from lowly abundant
genes, since the number of those bases could be out-voted by bases from the related and
abundant genes. In order to make our tool more practically useful, we sacrifice some accuracy
for assembly efficiency. The value of τ is customizable so the user can change the strictness
of error correction according to different datasets by adjusting the value of τ . A larger τ
tends to yield more conserved error correction. A smaller τ tends to correct all erroneous
bases, but may also falsely modify correct bases into incorrect ones, especially in regions
with low coverage.
An example is depicted in Figure 4.4. Sequences represented in Vertices V2 and V3 are
very similar and only differ in one base. The bifurcation may be caused by sequencing error
or simply represent similar regions from highly related species. Reads in both vertices are
aligned. In the highlighted column, the number of base T in V3 is far less than that of the
base A in V2 , so we mutate the T into A then merge V2 and V3 in to a single vertex.
.....

V0

.....

V1
.
.
.
.
.

.....

Vs

Vp

V0

.....

V1

.....

.
.
.
.
.

Vn

Vn

(A)

.....

(B)

Figure 4.3: Two types of bifurcation where error correction is applied.

68

CACGTCACAGTTAGAG

V2
CAGTCACGTCACAGT

V1

V3
CACGTCTCAGTT
(A)

CACGTCACAGT
ACGTCACAGTT
CGTCACAGTTAG
GTCACAGTTAGA
CACAGTTAGAG

CACGTCACAGTTAGAG

V2
CAGTCACGTCACAGT

V1

CACGTCTCAGT
ACGTCTCAGTT

V3

CACGTCACAGT
ACGTCACAGTT
CGTCACAGTTAG
GTCACAGTTAGA
CACAGTTAGAG

CACGTCACAGT
ACGTCACAGTT

CACGTCACAGT
CACGTCACAGT
ACGTCACAGTT
V23
ACGTCACAGTT
CGTCACAGTTAG
GTCACAGTTAGA
CACGTCACAGTTAGAG
CACAGTTAGAG

CAGTCACGTCACAGT

V1

CACGTCTCAGTT
(B)

(C)

Figure 4.4: An example of error correction (applied on V2 and V3 ). The sequence represented
by each node is given beside the node. (A) Ungapped alignment of reads from bifurcating
vertices. (B) Mutate rare bases. (C) Remove bifurcation.

4.2.3.0.7

Topology-based graph reduction Alignment-based error correction is only

applied to nodes sharing the same predecessor or successor. Following existing assembly
methods, we continue to conduct topology-based graph reduction, as depicted in Figure 4.5.
We examine all tips and bubbles and remove them if certain criteria are satisfied. Tips are
vertices with one end disconnected and must have at least one sibling. Bubbles are formed
by two vertices sharing the same predecessor and successor. Tips and bubbles are formed
primarily due to sequencing errors and highly similar region shared among closely related
species. A tip can be simply dropped from the graphs if two criteria are satisfied. First, the
sequence represented by the tip are shorter than a user-defined value. Second, the number
of reads collapsed in the tip should be less than a user-defined value. A bubble is resolved by
removing the vertex containing less reads if the contigs represented by both vertices share a
sequence similarity greater than a threshold, which is set to 98% by default. Local gapped
alignment is generated between contigs represented by vertices forming the bubbles. Similar
to the bifurcation removal procedure, the tip and bubble removal could potentially remove
contigs from low abundant genes. So we also allow the threshold to be adjusted.

69

r1

r1

r1

Tip
removal

r2

r3

r5

r4

Bubble
removal

r2

r3
r4

Node
collapsing

r2r3

r1r2r3r4

r4

Figure 4.5: Topology-based graph reduction.

4.2.4

Bad edge removal

Due to the nature of metagenomic dataset, there may exist multiple species sharing very
high sequence similarity. Thus reads originated from different species have high chance to
form edges in the graph. In another word, having prefix-suffix match does not necessarily
guarantee correct connection. Wrong edges not only increase the complexity of graph but
also lead to chimera.
Thus after graph reduction, we applied a Nave Bayes Classifier, similarly to the RDP
classifier [96], on each vertex and approximately annotate it with one or more genera. If
the vertices on either of an edge do not share any common annotated genus, we remove this
edge from the graph. Annotation of each vertex can be obtained by calculating a posterior
probability P (Gi |C), where C is the contig the vertex represents and Gi is the potential
genus that C is originated from. The probability indicates the likelihood that the C is
originated from Gi . If the probability is higher than a threshold, we annotate C with Gi .
To calculate P (Gi |C), we first decompose the contig C into a set of k-mers k1 , k2 , ..., kn .
Same as the RDP classifier, the default value of k is 8. Then the likelihood P (C|Gi ) can be
extended as P (k1 , k2 , ..., kn |Gi ). Assuming the independence of ki ’s, we can further simplify
the likelihood to P (k1 |Gi )P (k2 |Gi ) · · · P (kn |Gi ), in which each term can be pre-calculated

70

based on the RDP database [96]. The prior probability P (Gi ) can also be calculated as the
proportion of sequences in Gi to the total number of sequences in the RDP database. The
posterior probability P (C|G¯i ), which indicates the likelihood that C is not originated from
Gi , can also be calculated in the same way.
The detailed calculation of posterior probability is list as follows.

P (Gi |C) = P (Gi |k1 , k2 , ..., kn )
=

P (C|Gi )P (Gi )
P (C|Gi )P (Gi ) + P (C|G¯i )P (G¯i )

=

P (k1 , k2 , ..., kn |Gi )P (Gi )
P (k1 , k2 , ..., kn |Gi )P (Gi ) + P (k1 , k2 , ..., kn |G¯i )P (G¯i )

=

P (k1 |Gi )P (k2 |Gi ) · · · P (kn |Gi )P (Gi )
P (k1 |Gi )P (k2 |Gi ) · · · P (kn |Gi )P (Gi ) + P (k1 |G¯i )P (k2 |G¯i ) · · · P (kn |G¯i )P (G¯i )

where
Gi =

number of sequences in genus Gi
total number of sequences in RDP database

number of sequences in genus other than Gi
G¯i =
total number of sequences in RDP database
and
P (kj |Gi ) =
P (kj |Gi ) =

number of sequences in genus Gi containing kj
total number of sequence in genus Gi

number of sequences in genus other than Gi containing kj
total number of sequence in genus Gi

For a vertex, a genus Gi is included into its annotation if P (Gi |C) is greater than a
threshold. As most vertices in the graphs represent only short and partial 16S genes, the
classifier may not have enough evidence to uniquely and accurately annotate them. As a
result, each vertex is generally associated with multiple genera. Yet, base on our observation,
the annotation always include the true positive genus. The edge removal algorithm works
71

better on longer sequences. So we apply another round of graph collapsing, tip and bubble
removal, which potentially could generate vertices representing longer sequences.

4.2.5

Guided path finding using paired-end information

We then recover 16S rRNA sequences by finding paths that represent a full or partial rRNA
gene. Path finding starts at a vertex with no in-coming edges and terminates at a vertex
with no out-going edges. Paired-end information of reads are widely used in many assembly
tools for guiding the creation of contigs or scaffolds. The rationale is that two ends of a
read pair should to be assembled in the same contig or scaffold. To utilize the paired-end
information, a common approach adopted by SOAPdenovo [97], ABySS [98] and ALLPATHS
[99] creates graphs from a set of contigs where each vertex represents a contig and an edge is
formed between two vertices if more than a certain number of read pairs exist between their
reads. Then the graphs are searched, using various constraints and heuristics, to extend
contigs into longer scaffolds. Velvet [64], on the other hand, assumes a small variance of
insert size distribution and aims to create “long nodes” that are longer than all inserts. The
objective function intends to maximize the number of “long nodes” while minimizing the
number of read pairs spanning over such nodes. As an extension of Velvet, MetaVelvet [100]
uses paired-end information to guide the creation of contigs by checking their consistency.
Number of paired-end reads connecting the origin node and the extension node is used to
resolve chimeric node candidates.
In our algorithm, we use paired-end information to guide the path finding. At each vertex
with multiple successors, we decide which one to visit based on a new metric, weighted pairedend match score (WPEMS), which gives higher weights to paired end reads located in distant
nodes than those located in nearby nodes. Suppose (r, r ) is a mate pair, which are located
72

in vertices V and V (V = V ), respectively. The WPEMS of this read pair is thus 2d(r,r ) ,
where d(r, r ) is the number of vertices between V and V . If V = V , we define its WPEMS
as 0. The WPEMS of a path P is the sum of WPEMS of all paired end reads in P.

2d(r,r )

W P EM S(P ) =
(r,r ) in dif f erent vertices

When deciding among a set of vertices to visit next, we select the vertex that maximizes
the WPEMS of the current path in a greedy way. The rationale behind the design of
WPEMS is that only read pairs existing in non-adjacent vertices provide extra evidence for
path finding. Read pairs included in the same vertex or spanning adjacent vertices do not
provide additional path finding information beyond the existing overlap graph. In practice,
the graph pruning procedure leads to many vertices representing relatively long contigs.
Thus, for average fragment sizes such as hundreds of bases, the paired-end reads usually
exist within the same vertex or vertices connected by an edge already. Figure 4.6 describes
a typical example of read pair distribution in a graph. There are usually many paired-end
reads spanning adjacent vertices such as from n6 to n8 . Much less paired-end reads span
non-adjacent vertices, such as those from n1 to n7 and from n4 to n7 . Paired end reads
located in two adjacent nodes will have WPEMS of 1, because there is no intermediate
nodes between the adjacent nodes.
We use the example in Figure 4.6 to explain our path finding procedure. Nodes in the
graph are formed by reads from two different 16S rRNA genes A and B, colored in blue
and yellow, respectively. Nodes n3 and n6 are represented using shaded color since they
are formed by reads from common regions of A and B. Genes A and B are represented by
n1 → n3 → n4 → n6 → n7 and n2 → n3 → n5 → n6 → n8 . Suppose the path finding

73

process has successfully identified the path n1 → n3 → n4 → n6 , which is highlighted in
yellow, and needs to choose between n7 and n8 as the next node to include. Although n8
and n6 share a large number of paired-end reads, WPEMS will choose n7 , the correct node,
because of the paired-end reads shared between non-adjacent nodes.
In order to produce all 16S ribosomal RNA sequences, we apply the path finding algorithm
on each vertex with no in-coming edges. If the length of a contig represented by a path is
greater than a user-defined threshold, we consider the contig as a full-length gene. Otherwise,
we include it in the input to the next scaffolding stage. As this approach is a greedy
algorithm, the time complexity is very low. The entire path finding procedure take O(mn)
time to complete, where m is the average out-degree of each vertex and n is the average
distance between nodes with no in-coming edges and nodes with no out-going edges.

n1

n2

8

n3
12

15

n4

n5

15

n6
30

70

n7

n8

Figure 4.6: Path finding using paired-end information. Solid lines represent overlaps between
nodes and dashed lines represent the existence of paired-end reads. The numbers beside
dashed lines are the numbers of paired end reads between the corresponding nodes.

4.2.6

Scaffolding 16S rRNA segments

In the path finding stage, contigs longer than a user-defined parameter L, are output directly. Contigs shorter than L are selected for further processing. Short contigs are usually
74

created due to regionally low coverage of 16S genes in metagenomic data. Reads from lowly
sequenced regions failed to create connection to other region due to small overlap. As a
result, 16S segments originated from the same gene may be broken apart. To produce fulllength 16S sequences, we can utilize the WPEMS, with a minor modification, to scaffold
shorter segments. Orientation of segments can be inferred from the reads they contain. As
we have the alignment position of each read on the CM used the filtration stage, we can then
infer the alignment position of segments on CM. This information allows us to determine
orientation of segments as well as their relative position on CM. For two segments SA and
SB with SA being in the upstream of SB (see Figure 4.7), the WPEMS between SA and SB
is defined as

2d(r,r )

W P EM S(SA , SB ) =
(r,r )

where (r, r ) is a mate pair. r is in SA and r is in SB ,

and d(r, r ) is total number of vertices after r and before r , in SA and SB , respectively

24

22

21

Segment A

Segment B

Scaffold

Figure 4.7: Calculate the score between two segments. Arcs represent paired-end match
between vertices and thickness of arcs indicate weight of the paired-end match. Actual
weights also labeled beside each arc.

We only calculate W P EM S(SA , SB ) if SA is on the upstream of SB and their CM
75

alignment position overlap. If SA and SB are from the same gene, W P EM S(SA , SB ) tends
to be the highest among all W P EM S(SA , Si ) and W P EM S(Sj , SB ). So we calculate the
pair-wise WPEMS in all segments. We then connect SA and SB , if

W P EM S(SA , SB ) > W P EM S(SA , Si ) for all Si on the downstream of SA , and
W P EM S(SA , SB ) > W P EM S(Sj , SB ) for all Sj on the upstream of SB

All connected segments with total length longer than L will be output.

4.3

Experimental results

To evaluate the performance of our algorithm, we applied REAGO to three sets of metagenomic data including a simulated data set, a mock community data set, and a human stool
metagenomic data set. As the species and their genomes are largely known in the first two
data sets, we are able to evaluate the accuracy of rRNA assembly. The third experiment
enables us to test the utility of REAGO on a real metagenomic data set. We benchmarked
our tool with EMIRGE, a 16S rRNA identification tool, and two bulk metagenomic assembly tools, IDBA-ud [101] and Meta-Velvet [100]. We also compared the performance of the
three tools on two sets of inputs, the entire datasets and only true positive reads. For all
the tools, we evaluated their performance to reconstruct 16S rRNA sequences at sequencelevel and genus-level. We also recorded and compared their running time on the same high
performance computing node that has 64 bits CPU with Linux operating system.

76

Species
abundance
Bacteroides thetaiotaomicron VPI-5482 (BTV)
24.17%
Bacteroides vulgatus (BVG)
4.13%
Chlorobium phaeobacteroides DSM 266 (CPB)
10.02%
Chlorobium phaeovibrioides DSM 265 (CPV)
6.29%
Chlorobium tepidum TLS (CTT)
12.38%
Salinispora tropica CNB-440 (STC)
1.96%
Sulfurihydrogenibium sp YO3AOP1 (SSY)
4.72%
Bordetella bronchiseptica RB50 (BBR)
7.86%
Burkholderia xenovorans LB400 (BXL)
10.02%
Leptothrix cholodnii SP-6 (LCS)
4.72%
Nitrosomonas europaea ATCC 19718 (NEA)
13.75%
Table 4.1: Species abundance.

4.3.1

Experiment 1: simulated metagenomic dataset

To evaluate the performance of REAGO , we first applied it to a simulated metagenomic
data set containing reads from 11 species of 8 genera. We used WGSIM [102] to generate
9.6e7 paired-end, 110bp long error-containing reads. The sequencing error rate was set to
2% by default and the insert size was set to 100 with a standard deviation of 10 bases. The
11 selected species have very skewed relative abundance, as shown in Table4.1. The most
abundant species is 11 times more than the least abundant species.
To challenge REAGO , we selected some closely related species in the same genus with
highly similar 16S rRNA genes. We listed their pair-wise sequence similarity in Table4.2. For
example, three selected species in Chlorobium share sequence identity above 93%. Chlorobium phaeobacteroides and Chlorobium phaeovibrioides even share similarity as high as 96%.
Reads simulated from these species tend to produce extremely tangled overlap graphs that
pose challenges for assembly. Traversing such graphs will result in a large number of paths,
and most of which are chimeric assemblies.

77

BBR BTV
BBR
BTV
71
BVG
71
91
BXL
91
72
CPB
76
75
CPV
75
75
CTT
75
75
LCS
90
73
NEA
88
73
SSY
73
72
STC
76
72

BVG BXL CPB CPV CTT LCS NEA
71
75
75
75
74
96
74
74
93
94
72
90
76
77
76
72
89
75
75
74
86
72
73
73
73
73
74
74
71
76
77
77
76
77
76

SSY STC
76
-

Table 4.2: Pairwise sequence similarity. Bold numbers indicate sequence similarity above 90%.

4.3.1.0.8

Performance of rRNA reads classification using cmsearch Following

the pipeline in Figure4.1, cmsearch was applied to the simulated dataset to identify reads
originated from rRNA genes. When building the covariance model (CM) for cmsearch, we
only used rRNA genes that are not in the simulated data set. Specifically, we downloaded
2,591 bacterial 16S rRNA genes from the RDP website [96]. Then we removed the rRNA
genes in the 11 species used for simulation and also the ones in the same genera as those 11
species. As a result, we have 2,219 genes in the training set for building the CM in cmsearch.
The performance of cmsearch is quantified using 2 metrics: sensitivity and positive predictive
value (PPV). The set of reads sequenced from 16S rRNA genes are true positive (i.e. TP)
while the set of reads extracted from other regions are true negative. Let P be the set of
reads predicted as positive by cmsearch. Sensitivity is thus defined as P T∩TP P and PPV is
P
defined as P ∩T
P . It is worth noting that we only keep reads that can be globally aligned

to the covariance model. For reads that are partially sequenced from the rRNA genes and
thus produce partial or local alignments, we won’t keep them for downstream analysis.
Correspondingly, during the performance evaluation, we only use reads that are completely

78

sequenced from the rRNA genes and non-rRNA regions. The ones sequenced from boundaries
of rRNA genes will not be used for computation. For this simulated data set, the sensitivity
and PPV of cmsearch in recognizing rRNA reads are both 0.990. Reads originated from 16S
rRNA genes were precisely separated from those from other regions of genomes with only a
small amount of incorrectly classified reads. The output of cmsearch contains 82,638 reads,
which are used as input for overlap graph construction. Compared to the original size of the
data set (9.6e7 reads), the problem size is significantly reduced.

4.3.1.0.9

Overlap graph construction The reads classified as rRNA reads by cm-

search were used as input to Readjoiner for efficient overlap graph construction. By default,
the overlap threshold was set to 70% of the read length, which is 77 in the simulated data set.
Larger overlap may be used when sequencing depth is high for each species, while smaller
overlap should be used if some lowly abundant species are present. Smaller overlap, however,
tend to yield more tangled graphs. We have tried a range of overlaps from 66 to 99. The
results are almost identical on the simulated dataset. The error rate of WGSIM was set to
2%, which means that for every 50 bases generated, 1 error occurs. As a result, the error
correction threshold τ was set to 50, indicating that a base will be corrected if there are at
least 50 bases of a different kind are aligned to it.
We evaluated the efficiency of our error correction and graph reduction algorithms using
the change of graph complexity, which is quantified by the total number of paths in the graph.
The increase of the number of different paths implies the increase of the graph complexity.
We only record “complete” paths that start at nodes with no in-coming edge and end at nodes
with no out-going edges. The original graph contains 16,994,765 paths. After applying our
graph reduction procedures, the graph is significantly simplified, containing only 961 paths.

79

As shown by the assembly results presented in the following section, a majority of the genes
have been kept in the reduced graph.

4.3.1.0.10

Assembly performance evaluation Finally we evaluated the performance

of our assembly algorithms and compare it with EMIRGE, IDBA-ud, and Meta-Velvet.
Metrics used are the number of genes recovered at sequence level, the number of genus
recovered, the number of falsely recovered genes, and the running time. For all tools, all
contigs longer than 1350 nt are considered to be final output. An assembly is considered to
be correct at sequence level if and only if it can be aligned to the true gene with at least
98% of identity.
A genus is correctly recovered as long as one of its genes is recovered. For EMIRGE we
did two experiments using two different rRNA gene databases. The first one is its original
SSU candidate database excluding just the 11 training genes. The second one uses the
original database excluding all genes in the 8 selected genera. In this experiment, there are
83,510 reads fully sequenced from rRNAs. All parameters of IDBA-ud were set as default.
For Meta-Velvet, we conducted the multiple experiments with a wide range of kmer length
from 20 to 65 and pooled the output together as its final output. As REAGO , IDBA-ud,
and Meta-Velvet also output partial genes, we only kept full-length genes.
As displayed in Table4.4, our assembly algorithms demonstrate better performance in
reconstructing 16S rRNAs. EMIRGE achieved the same sensitivity as our tool with the first
database. But it took much longer running time. For the second database, which excluded
all genes in the eight chosen genus as the training set for our tool, it recovered much less
genes. Thus, if a genus in a metagenomic dataset does not exist in the SSU candidate
database at all, it is possible that the genes from this genus cannot be recovered. IDBA-

80

tool
REAGO
EMIRGE, 16S DB
excluding 11 genes
EMIRGE, 16S DB
excluding 8 genera
IDBA-ud
Meta-Velvet

# output genes
12
16

# genes recovered
11/11
10/11

# genus recovered
8/8
7/8

# incorrect assemblies
1
6

running time
4:53:19
96:12:1

24

5/11

4/8

19

96:14:29

416
1258

3/11
1/11

3/8
1/8

413
1257

16:36:28
5:35:30

Table 4.3: Performance of 16S rRNA gene recovery.
tool
EMIRGE, 16S DB
excluding 11 genes
EMIRGE, 16S DB
excluding 8 genera
IDBA-ud
Meta-Velvet

# output genes
17

# genes recovered
11/11

# genus recovered
8/8

# incorrect assemblies
5

running time
0:13:01

21

7/11

6/8

14

0:14:10

3
1

3/11
1/11

3/8
1/8

3
1

0:5:26
0:0:35

Table 4.4: Performance of 16S rRNA gene recovery on true positive simulated data.

ud was correct only on three genes and Meta-Velvet only identified one. The results show
that bulk or general de novo assembly tools are not optimized for recovering 16S rRNAs
and thus are not recommended for this task. We also conducted the same experiments on
EMIRGE, IDBA-ud and Meta-Velvet using only true positive reads. As displayed in Table
4.4, performance of the three tool increased.

81

Figure 4.8: Phylogenetic tree of the 64 species in the synthetic metagenomic data.

4.3.2

Experiment 2: synthetic metagenomic data

To further assess the performance of our assembly algorithms, we applied REAGO to a
metagenomic dataset (SRR606249) sequenced from mixture of archaeal and bacterial synthetic communities [103], which contain 16 archaeal species and 48 bacterial species, covering
50 different genera. The metagenomic dataset was sequenced using Illumina HiSeq-2000 and
contains about 11.1 Gbp in total. Reads in the dataset are all 101-base long and were sequenced in pairs. The annotations of 16S rRNA genes in these species were downloaded
from NCBI. High sequence similarity exist among genes in the same genus and genes across
different genera, as shown in Figure4.8. For instance, species from Leptothrix, Bordetella,
82

CM
remove only exact copies of genes
remove all genes in the 50 genera

# output reads
66,755
66,347

Sensitivity PPV running time
0.982
0.962
6:12:13
0.976
0.958
6:35:18

Table 4.5: The performance of cmsearch on the synthetic community data.

Burkholdera, and Nitrosomonas share sequence similarity above 97%. Three species under
Thermotoga share sequences similarity above 99%. Genes with high sequence similarity lead
to very complicated overlap graphs, yielding a large number of paths. Additionally, abundance of species are skewed in this dataset. The most abundant species is over 20 time more
than the least abundant species. Skewed abundance level, high sequence similarity, and a
large number of error-containing reads pose great challenges for rRNA recovery.
In rRNA read classification, we applied cmsearch to the metagenomic data with default
parameters using two CMs. The first CM was trained on all 16S genes from RDP excluding
the exact copies of rRNA genes in the 64 species of the synthetic community data. Next, we
further removed all genes belonging to any of the 50 genera and trained the second CM. The
read mapping results and the annotation of the 16S rRNA genes in the component genomes
enable us to determine whether a read is part of a rRNA gene. In this dataset, there are
67,979 reads completely sequenced from rRNA genes. Based on the known origins of the
reads, we can evaluate the performance of cmsearch on both CMs using sensitivity, PPV, and
running time. As displayed in Table4.5, cmsearch achieved both high sensitivity and PPV
with both CMs. Removal of all genes in the 50 genera only slightly decreased the sensitivity
and PPV. Running time of cmsearch with eight working threads did not differ much on the
two experiments. Table4.5 also shows the significant reduction of problem size by 99.87%
(from 54,029,186 reads to less than 68,000 reads). The experiments were conducted using a
2.4GHz CPU and 8GB memory.

83

tool
REAGO
EMIRGE, 16S DB
excluding 64 genes
EMIRGE, 16S DB
excluding 50 genera
idba-ud
Meta-Velvet

# output genes
59
76

# genes recovered
58/64
42/64

# genus recovered
47/50
33/50

# incorrect assemblies
2
42

running time
12:28:01
120:30:11

105

19/64

16/50

89

120:25:38

61
135

39/64
4/64

33/50
4/50

28
131

17:26:50
7:56:01

Table 4.6: The performance of rRNA recovery on synthetic community data. Due to near identical outputs
and very high sequence similarity among some genes in some genera, the sum of incorrect assemblies and
correct assemblies may not sum to the total number of output sequences.

Next, we applied our assembly algorithms on the reads that are classified as rRNA reads
by cmsearch. The overlap threshold of REAGO was set to 70 and the error correction
threshold was 30. Training sequences of CM contained neither any of the 64 genes nor
any gene from the 50 genera. We compared the performance of REAGO with EMIRGE,
IDBA-ud, and Meta-Velvet. In this experiment, there are 67,979 reads fully sequenced from
rRNAs. Parameters of IDBA-ud were all set as default. Meta-Velvet was run with a wide
range of kmer length from 20 to 65. The output were pooled together as its final output.
EMIRGE was run twice with two different 16S rRNA databases. The first one is its 16S
rRNA database excluding only exact copies of the 64 genes and the second one is its database
excluding all genes in the 50 genera. For all tools, we only consider assemblies longer than
1,350nt as the final output. Following the first experiment, we quantified the performance
using 4 metrics, including the number of genes recovered at sequence level, the number of
genus recovered, the number of incorrect assemblies, and the running time.
We summarized the results in Table4.6. Our tool correctly recovered more genes and
genus with far less running time. Even with all genes in the 50 genera removed from the
training set of the CM, REAGO can recover 58 out of 64 genes and 47 out of 50 genera. For
REAGO , most of the time was spent in running cmsearch and the actual assembly procedure
finished within a couple of minutes. The filtration with cmsearch can be greatly accelerated
when running in parallel. EMIRGE, on the other hand, recovered less genes and genera
84

tool
EMIRGE, 16S DB
excluding 64 genes
EMIRGE, 16S DB
excluding 50 genera
idba-ud
Meta-Velvet

# output genes
82

# genes recovered
45/64

# genus recovered
36/50

# incorrect assemblies
45

running time
0:52:19

103

25/64

20/50

82

0:51:48

60
65

41/64
16/64

34/50
16/50

26
49

0:1:16
0:0:26

Table 4.7: The performance of rRNA recovery on only true positive reads from synthetic community data.
Due to near identical outputs and very high sequence similarity among some genes in some genera, the sum
of incorrect assemblies and correct assemblies may not sum to the total number of output sequences.

with longer running time. With the removal of all genes in the 50 genera, the performance
of EMIRGE significantly deteriorated, indicating its limited ability on recovering 16S genes
from unknown genera. It is worth nothing that in this experiment EMIRGE outputs some
genes with high sequence similarity but not identical. Thus, several sequences output by
EMIRGE may share high sequence similarity with the same true rRNA gene in the data set.
To evaluate the necessity of applying cmsearch to identify 16S reads before the actual
16S gene recovery process, we applied EMIRGE, idba-ud and Meta-Velvet directly on true
positive 16S reads. As summarized in Table4.7, the number of recovered genes and genera
increased for all tools. Additionally, running time of each tool was significantly reduced.
Reads from non-16S did affect the performance of the 16S gene recovery, thus it is very
necessary to apply cmsearch before recovering 16S genes for better sensitivity and efficiency.

4.3.3

Experiment 3: human stool metagenomic data
0.48
0.92
1

1
0.98
0.57

uncultured_bacterium_16slp98-1b12.w2k_6.052%
uncultured_bacterium_014B-E11_3.05%
uncultured_bacterium_SJTU_G_10_53_3.262%
uncultured_bacterium_SJTU_E_02_29_7.498%
uncultured_bacterium_SJTU_B_06_60_2.405%
uncultured_organism_ELU0073-T501-S-NIPCRAMgANa_000369_2.16%
uncultured_bacterium_C2-121_2.583%
0.94
uncultured_organism_ELU0046-T235-S-NIPCRAMgANb_000196_2.791%
0.94 Bacteroides_sp._S-17_2.287%
uncultured_organism_ELU0078-T372-S-NI_000330_2.274%
1
0.88 Bacteroides_uniformis_23-18_2.579%
Bacteroides_vulgatus_ATCC_8482_60.772%

0.09

Figure 4.9: Phylogenetic tree of the 12 representative sequences.
85

CM
REAGO
EMIRGE
IDBA-ud
Meta-Velvet

number of identified representative sequences
7/12
6/12
1/12
0/12

number of other identified sequences
4
2
0
0

running time
6:10:11
45:12:34
6:02:09
2:07:54

Table 4.8: Assembly tool performance on human stool data

In the first two experiments, we evaluated our tool on two data sets with known composition species and annotated rRNA genes. In this section, we describe the application of
REAGO to a metagenomic data set sequenced from a human stool sample (SRS015264).
The dataset was sequenced using Illumina and contains 31,577,655 100bp paired-end reads.
In this experiment, we applied all tools using their default setup without modifying the given
training SSU database or the given CM. The overlap threshold for REAGO is still set as 70.
Our tool outputs 17 genes and EMIRGE outputs 26 genes. As we don’t know the complete
composition of the stool metagenomic data set, we compared the output with the known
genes identified using read mapping against the reference rRNA databases. Specifically, we
mapped the reads to the RDP rRNA database, which contains 1,354,916 16S ribosomal RNA
genes. There are 387 genes with at least 200 reads mapped to them. Many of these genes
are highly similar. We thus clustered them using CD-HIT-EST [104] using identity cutoff
0.98. 12 clusters are formed by CD-HIT-EST and the phylogenetic tree of their representative sequences with average base coverage is plotted in Figure4.9. There exists a dominant
species Bacteroides vulgatus ATCC 8482, whose average coverage is 204X, taking 60.77% of
the entire dataset. On the other hand, the abundance of other species is very similar with
average coverage ranging from 7X to 10X. Among all reads in the dataset, there are 51.1%
of them can be mapped to the 12 representative genes. Figure4.9 presents the phylogenetic
tree of the 12 representative sequences.
We then compared the output of the tools with the representative sequences of the 12

86

species
uncultured bacterium 16Slp98-1b12.w2k
uncultured bacterium 014B-E11
uncultured bacterium SJTU G 10 53
uncultured bacterium SJTU E 02 29
uncultured bacterium SJTU B 06 60
uncultured organism ELU0073-T501-S-NIPCRAMgANa 000369
uncultured bacterium C2-121
uncultured organism ELU0046-T235-S-NIPCRAMgANb 000196
Bacteroides sp. S-17
uncultured organism ELU0078-T372-S-NI 000330
Bacteroides uniformis 23-18
Bacteroides vulgatus ATCC 8482

abundance
6.052%
3.05%
3.262%
7.498%
2.405%
2.16%
2.583%
2.791%
2.287%
2.274%
2.579%
60.772%

REAGO
Y

EMIRGE
Y

Y
Y
Y

Y
Y
Y
Y

Y
Y
Y

Y

Table 4.9: Species recovered by REAGO and EMIRGE. “Y” indicates the species of the row
is identified by the tool labeled by the column title.

clusters and summarized the results in Table4.8. We evaluated their performance using
3 metrics, number of identified representative sequences, number of other sequences, and
running time. Our tool successfully identified 7 out of 12 representative sequences and
EMIRGE identified 6. Within the 12 representative sequences, the top 3 abundant species
are Bacteroides vulgatus ATCC 8482 and 2 uncultured species SJTU-E-02-29 and 16Slp981b12.w2k. Our tool managed to identify all of them while EMIRGE only identified the most
abundant species. The detailed outputs of EMIRGE and REAGO are compared in Table4.9.
We carefully examined the reasons of REAGO missing 6 representative sequences. During
the path finding process, if a path shares its first node with another path, which happens
when their sequence similarity is high, only one of the two path will be identified. Our tool
failed to identify 6 sequences because of the high sequence similarity between them and the
identified sequences as displayed in the phylogenetic tree in Figure4.9.
As described in Table4.8, both tools output a few assemblies that are not from the
representative sequences. They may be originated from species that are not in the 12 representative sequence or not even in the RDP database. Applying RDP classifier shows that
the uncharacterized assemblies output by EMIRGE and REAGO all belong to Bacteroides
and Sutterella.

87

Chapter 5
Conclusion and future work
Noncoding RNAs play important roles in various important biological functions. Identification of ncRNAs in genomic sequences becomes increasingly important for modern biology.
As many families of RNAs are more conserved in their secondary structures than primary sequences, it is necessary to include both pieces of information when creating models. Modeling
only primary sequences could lead to loss of sensitivity. With the advent of next-generation
sequencing (NGS) technology, massive amount of data has been rapidly accumulating, making many existing algorithms infeasible. As a result, efficient analysis of such data is needed.
Based on objectives of algorithms, it is necessary to exclude unrelated data from being analyzed. For graph-based algorithms, the graph size could be very sensitive to the input size.
Large amount of input data could potentially lead to high computational cost. Due to the
limitation of sequencing technologies, data sequenced is normally fragmentary. Considering
sequencing error and uneven sequencing depth, recovery of longer or full-length genes is still a
challenging task. With the advance of sequencing technologies, reads with increasing length
and lower sequencing error rate could be available. Gene recovery and short reads assembly
in general could be easier and more reliable. Paired-end information sometimes could be
very useful for gene recovery from NGS data. Especially for many assembly algorithms, it
provides additionally information to aid graph traversal.
In the dissertation, a set of algorithms called REAGO is introduced to recover 16S ribosomal RNA from metagenomic data. REAGO first adopts a very effective approach to reduce
88

the input size. Then it represents all 16S genes in an overlap graph. After a set of graph
reduction procedure, it generates a list of 16S rRNA contigs. Short contigs are combined
to produce the final output. On both experiments, REAGO out-performed EMIRGE, the
state-of-the-art tool, and other two bulk assemblers.
Algorithms in REAGO can be readily extended to identify other types of noncoding RNAs
in NGS datasets. They could be especially effective for the families of ncRNAs with variable
primary sequence conservation. If the primary sequences of a RNA family are known to be
conserved, we could still apply REAGO. However, other bulk metagenomic data assemblers,
such as Meta-Velvet and IDBA-ud, should also be good options after we apply the read
identification procedure in REAGO. It is worth noting that, Algorithms in REAGO may fail
if sequences in a RNA family contain repeats, which may form cycles in overlap graphs, as
REAGO is not be able to resolve cycles.
In general, Assembly algorithms for identifying ncRNA in NGS data should be customized
to various sequencing platforms, which may produce short reads with various attributes,
such as read length, error rate and insert size. Assembly algorithms should be adjusted to
these attributes. For example, if the data is obtained from PacBio sequencing platform,
which produces very long reads (up to 1500nt) but with high error rate (12.86%). Assembly
algorithms should be redesigned focusing on erroneous base correction. Accordingly, the
overlap detection using Readjoiner is not feasible anymore. New tools should be developed
for this task.

89

BIBLIOGRAPHY

90

BIBLIOGRAPHY
[1] Storz G (2002) An expanding universe of noncoding rnas. Science 296: 1260–1263.
[2] Eddy SR (2001) Non–coding rna genes and the modern rna world. Nature Reviews
Genetics 2: 919–929.
[3] Vitreschak AG, Rodionov DA, Mironov AA, Gelfand MS (2004) Riboswitches: the
oldest mechanism for the regulation of gene expression? TRENDS in Genetics 20:
44–50.
[4] Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. UK: Cambridge University Press.
[5] Eliceiri G (1999) Small nucleolar rnas. Cellular and Molecular Life Sciences CMLS 56:
22–31.
[6] Maxwell E, Fournier M (1995) The small nucleolar rnas. Annual review of biochemistry
64: 897–934.
[7] Fournier MJ, Stuart Maxwell E (1993) The nucleolar snrnas: catching up with the
spliceosomal snrnas. Trends in biochemical sciences 18: 131–135.
[8] Ambros V (2004) The functions of animal micrornas. Nature 431: 350–355.
[9] Bartel DP (2004) Micrornas: genomics, biogenesis, mechanism, and function. cell 116:
281–297.
[10] Lu S, Shi R, Tsao CC, Yi X, Li L, et al. (2004) Rna silencing in plants by the expression
of sirna duplexes. Nucleic acids research 32: e171–e171.
[11] Lowe TM, Eddy SR (1997) trnascan-se: a program for improved detection of transfer
rna genes in genomic sequence. Nucleic acids research 25: 0955–964.
[12] Schattner P, Brooks AN, Lowe TM (2005) The trnascan-se, snoscan and snogps web
servers for the detection of trnas and snornas. Nucleic acids research 33: W686–W689.

91

[13] An J, Lai J, Lehman ML, Nelson CC (2013) mirdeep*: an integrated application tool
for mirna identification from rna sequencing data. Nucleic acids research 41: 727–737.
[14] An J, Lai J, Sajjanhar A, Lehman ML, Nelson CC (2014) mirplant: an integrated tool
for identification of plant mirna from rna sequencing data. BMC bioinformatics 15:
275.
[15] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment
search tool. Journal of molecular biology 215: 403–410.
[16] Kolbe DL, Eddy SR (2009) Local rna structure alignment with incomplete sequence.
Bioinformatics 25: 1236–1243.
[17] Nawrocki EP, Kolbe DL, Eddy SR (2009) Infernal 1.0: inference of rna alignments.
Bioinformatics 25: 1335–1337.
[18] Kolbe DL, Eddy SR (2009) Local rna structure alignment with incomplete sequence.
Bioinformatics 25: 1236-1243.
[19] Nawrocki EP, Kolbe DL, Eddy SR (2009) Infernal 1.0: Inference of RNA alignments.
Bioinformatics 25: 1335-1337.
[20] Yuan C, Sun Y (2013) RNA-CODE: a noncoding RNA classification tool for short
reads in NGS data lacking reference genomes. PLOS ONE 8: e77596.
[21] Tsiatis AC, Norris-Kirby A, Rich RG, Hafez MJ, Gocke CD, et al. (2010) Comparison
of sanger sequencing, pyrosequencing, and melting curve analysis for the detection of¡
i¿ kras¡/i¿ mutations: Diagnostic and clinical implications. The Journal of Molecular
Diagnostics 12: 425–432.
[22] Grada A, Weinbrecht K (2013) Next-generation sequencing: methodology and application. Journal of Investigative Dermatology 133: e11.
[23] Liu L, Li Y, Li S, Hu N, He Y, et al. (2012) Comparison of next-generation sequencing
systems. BioMed Research International 2012.
[24] Wang Z, Gerstein M, Snyder M (2009) Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10: 57–63.
[25] Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiology and molecular biology reviews 68: 669–685.
92

[26] Huang Z, Wu Y, Robertson J, Feng L, Malmberg RL, et al. (2008) Fast and accurate
search for non-coding RNA pseudoknot structures in genomes. Bioinformatics 24:
2281-2287.
[27] Huang Z, Malmberg R, Mohebbi M, Cai L (2010) RNAv: Non-coding RNA secondary
structure variation search via graph homomorphism. In: CSB Conference Proceedings,
CA, USA. pp. 56-69.
[28] Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nature Reviews
Genetics 12: 671–682.
[29] Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF, et al. (2011) Emirge:
reconstruction of full-length ribosomal genes from microbial community short read
sequencing data. Genome Biol 12: R44.
[30] Powers T, Noller HF (1991) A functional pseudoknot in 16s ribosomal rna. The EMBO
journal 10: 2203.
[31] Staple DW, Butcher SE (2005) Pseudoknots: RNA Structures with Diverse Functions.
PLoS Biology 3: e213.
[32] Gilley D, Blackburn EH (1999) The telomerase RNA pseudoknot is critical for the
stable assembly of a catalytically active ribonucleoprotein. PNAS 96: 6621-6625.
[33] Chen JL, Greider CW (2005) Functional analysis of the pseudoknot structure in human
telomerase RNA. PNAS 102: 8080-8085.
[34] Wower IK, Zwieb C, Wower J (2004) Contributions of pseudoknots and protein SmpB
to the structure and function of tmRNA in trans-translation. the Journal of Biological
Chemistry 279: 54202-54209.
[35] Griffiths-Jones S (2007) Annotating Noncoding RNA Genes. Annual Review of Genomics and Human Genetics 8: 279-298.
[36] Klein RJ, Eddy SR (2003) RSEARCH: finding homologs of single structured RNA
sequences. BMC Bioinformatics 4: 44.
[37] Lowe T, Eddy SR (1997) TRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Res 25: 955–64.

93

[38] Cai L, Malmberg RL, Wu Y (2003) Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics 19: i66-i73.
[39] Rivas E, Eddy SR (2000) The language of RNA: a formal grammar that includes
pseudoknots. Bioinformatics 16: 334-340.
[40] Uemura Y, Hasegawa A, Kobayashi S, Yokomori T (1999) Tree adjoining grammars
for rna structure prediction. Theoretical Computer Science 210: 277 - 303.
[41] Matsui H, Sato K, Sakakibara Y (2005) Pair stochastic tree adjoining grammars for
aligning and predicting pseudoknot rna structures. Bioinformatics 21: 2611-2617.
[42] Kato Y, Seki H, Kasami T (2006) RNA pseudoknotted structure prediction using
stochastic multiple context-free grammar. IPSJ Digital Courier 2: 655-664.
[43] Smith JA (2009) RNA Search with Decision Trees and Partial Covariance Models.
TCBB 6: 517-527.
[44] Brown M, Wilson C (1996) RNA pseudoknot modeling using intersections of stochastic
context free grammars with applications to database search. In: Pacific Symposium
on Biocomputing (PSB ’96), Hawaii, USA. pp. 109-125.
[45] Zhang S, Haas B, Eskin E, Bafna V (2005) Searching genomes for noncoding RNA
using FastR. IEEE/ACM Transactions on Comp Bio and Bioinf (TCBB) 2: 366-79.
[46] Weinberg Z, Ruzzo W (2004) Exploiting conserved structure for faster annotation of
non-coding RNAs without loss of accuracy. Bioinformatics 20 suppl. 1: i334–40.
[47] Eddy S (2007). HMMER - biosequence analysis using profile hidden Markov models.
Http://hmmer.janelia.org/.
[48] Gardner P, Daub J, Tate J, Nawrocki E, Kolbe D, et al. (2008) Rfam: updates to the
RNA families database. Nucleic Acids Research 37(Database issue): D136-D140.
[49] Macke T, Ecker D, Gutell R, Gautheret D, Case D, et al. (2001) RNAMotif – A new
RNA secondary structure definition and discovery algorithm. Nucleic Acids Research
29: 4724-4735.
[50] Wei F, Stein JC, Liang C, et al (2009) Detailed analysis of a contiguous 22-mb region
of the maize genome. PLoS Genet 5: e1000728.

94

[51] van Batenburg FHD, Gultyaev AP, Pleij CWA, Ng J, Oliehoek J (2000) PseudoBase:
a database with RNA pseudoknots. Nucleic Acids Research 28: 201-204.
[52] Jones-Rhoades MWW, Bartel DPP, Bartel B (2006) Micrornas and their regulatory
roles in plants. Annual Review of Plant Biology 57: 19-53.
[53] Lu S, Shi R, Tsao CC, Yi X, Li L, et al. (2004) RNA silencing in plants by the
expression of siRNA duplexes. Nucl Acids Res 32: e171.
[54] Bengtsson J, Eriksson KM, Hartmann M, Wang Z, Shenoy BD, et al. (2011) Metaxa:
a software tool for automated detection and discrimination among ribosomal small
subunit (12s/16s/18s) sequences of archaea, bacteria, eukaryotes, mitochondria, and
chloroplasts in metagenomes and environmental sequencing datasets. Antonie Van
Leeuwenhoek 100: 471–475.
[55] Shah N, Tang H, Doak TG, Ye Y (2011) Comparing bacterial communities inferred
from 16s rRna gene sequencing and shotgun metagenomics. In: Pacific Symposium on
Biocomputing. World Scientific, volume 16, pp. 165–176.
[56] Fan L, McElroy K, Thomas T (2012) Reconstruction of ribosomal rna genes from
metagenomic data. PloS one 7: e39948.
[57] Stricklin Sea (2005) C. elegans noncoding RNA genes, WormBook, ed. The C. elegans
Research Community. WormBook.
[58] Ge X, Zhang Y, Jiang J, Zhong Y, Yang X, et al. (2013) Identification of MicroRNAs in
Helicoverpa armigera and Spodoptera litura based on deep sequencing and homology
analysis. Int J Biol Sci 9: 1-15.
[59] Gruber AR, Neub¨ock R, Hofacker IL, Washietl S (2007) The RNAz web server: prediction of thermodynamically stable and evolutionarily conserved RNA structures.
Nucleic Acids Research 35: W335–W338.
[60] Lowe TM, Eddy SR (1999) A computational screen for methylation guide snoRNAs in
yeast. Science 283: 1168–1171.
[61] Mercer TR, Dinger ME, Mattick JS (2009) Long non-coding RNAs: insights into
functions. Nature Reviews Genetics 10: 155–159.
[62] Sun G, Stewart CNJ, Xiao P, Zhang B (2012) MicroRNA expression analysis in the
cellulosic biofuel crop switchgrass (Panicum virgatum) under abiotic stress. PLoS One
7.
95

[63] Peng X, Gralinski L, Ferris MT, Frieman MB, Thomas MJ, et al. (2011) Integrative
deep sequencing of the mouse lung transcriptome reveals differential expression of
diverse classes of small RNAs in response to respiratory virus infection. MBio 2.
[64] Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using
de bruijn graphs. Genome research 18: 821–829.
[65] Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo rna-seq
assembly across the dynamic range of expression levels. Bioinformatics 28: 1086–1092.
[66] Warren RL, Sutton GG, Jones SJ, Holt RA (2007) Assembling millions of short DNA
sequences using ssake. Bioinformatics 23: 500–501.
[67] Lagesen K, Hallin P, Rødland EA, Stærfeldt HH, Rognes T, et al. (2007) RNAmmer2:
consistent and rapid annotation of ribosomal rna genes. Nucleic acids research 35:
3100–3108.
[68] Vilo C, Dong Q (2012) Evaluation of the RDP classifier accuracy using 16s rRNA gene
variable regions. Metagenomics 1: 1–5.
[69] Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. (2010) A human gut microbial
gene catalogue established by metagenomic sequencing. Nature 464: 59–65.
[70] Turnbaugh PJ, Quince C, Faith JJ, McHardy AC, Yatsunenko T, et al. (2010) Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes
of identical twins. Proceedings of the National Academy of Sciences 107: 7503–7508.
[71] Rothschild LJ, Mancinelli RL (2001) Life in extreme environments. Nature 409: 1092–
1101.
[72] Konings WN, Albers SV, Koning S, Driessen AJ (2002) The cell membrane plays a
crucial role in survival of bacteria and archaea in extreme environments. Antonie Van
Leeuwenhoek 81: 61–72.
[73] Savage DC (1977) Microbial ecology of the gastrointestinal tract. Annual Reviews in
Microbiology 31: 107–133.
[74] Berg RD (1996) The indigenous gastrointestinal microflora. Trends in microbiology 4:
430–435.

96

[75] Loreau M, Naeem S, Inchausti P, Bengtsson J, Grime J, et al. (2001) Biodiversity and
ecosystem functioning: current knowledge and future challenges. science 294: 804–808.
[76] Woese CR, Kandler O, Wheelis ML (1990) Towards a natural system of organisms:
proposal for the domains Archaea, Bacteria, and Eucarya. Proceedings of the National
Academy of Sciences of the United States of America 87: 4576-9.
[77] Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2010) GenBank.
Nucleic Acids Research 38: D46-D51.
[78] Cochrane G, Akhtar R, Bonfield J, Bower L, Demiralp F, et al. (2009) Petabyte-scale
innovations at the European Nucleotide Archive. Nucleic Acids Research 37: D19-D25.
[79] Tateno Y, Imanishi T, Miyazaki S, Fukami-Kobayashi K, Saitou N, et al. (2002) DNA
Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids
Research 30: 27-30.
[80] Christen R (2008) Global sequencing: a review of current molecular data and new
methods available to assess microbial diversity. Microbes and environments JSME 23:
253-268.
[81] Hamady M, Knight R (2009) Microbial community profiling for human microbiome
projects: tools, techniques, and challenges. Genome Research 19: 1141-1152.
[82] Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2012) Metavelvet: an extension of
velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic
Acids Research 40: e155.
[83] Treangen T, Koren S, Sommer D, Liu B, Astrovskaya I, et al. (2013) MetAMOS:
a modular and open source metagenomic assembly and analysis pipeline. Genome
Biology 14: R2.
[84] Laserson J, Jojic V, Koller D (2011) Genovo: de novo assembly for metagenomes. J
Comput Biol 18: 429-443.
[85] Peng Y, Leung HCM, Yiu SM, Chin FYL (2011) Meta-IDBA: a de novo assembler for
metagenomic data. Bioinformatics 27: i94-i101.
[86] Luo C, Tsementzi D, Kyrpides N, Konstantinidis K (2012) Individual genome assembly
from complex community short-read metagenomic datasets. ISME J 6: 898-901.

97

[87] Salzberg SL, Sommer DD, Puiu D, Lee VT (2008) Gene-boosted assembly of a novel
bacterial genome from very short reads. PLOS Comput Biol 4: e1000186.
[88] Wu Y, Rho M, Doak TG, Ye Y (2012) Stitching gene fragments with a network matching algorithm improves gene assembly for metagenomics. Bioinformatics 28: i363-i369.
[89] Jeffrey AM, Zhong W (2011) Next-generation transcriptome assembly. Nature Reviews
Genetics 12: 671-682.
[90] Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005) Comparative metagenomics of microbial communities. Science 308: 554-557.
[91] Wrighton KC, Thomas BC, Sharon I, Miller CS, Castelle CJ, et al. (2012) Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science
337: 1661-1665.
[92] Durbin R (1998) Biological sequence analysis: probabilistic models of proteins and
nucleic acids. Cambridge university press.
[93] Woese CR, Fox GE (1977) Phylogenetic structure of the prokaryotic domain: the
primary kingdoms. Proceedings of the National Academy of Sciences 74: 5088–5090.
[94] Gonnella G, Kurtz S (2012) Readjoiner: a fast and memory efficient string graph-based
sequence assembler. BMC bioinformatics 13: 82.
[95] Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22: 549-556.
[96] Cole JR, Chai B, Farris RJ, Wang Q, Kulam S, et al. (2005) The ribosomal database
project (rdp-ii): sequences and tools for high-throughput rrna analysis. Nucleic acids
research 33: D294–D296.
[97] Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome research 22: 549–556.
[98] Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, et al. (2009) Abyss: a parallel
assembler for short read sequence data. Genome research 19: 1117–1123.
[99] Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, et al. (2008) Allpaths:
de novo assembly of whole-genome shotgun microreads. Genome research 18: 810–820.

98

[100] Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2012) Metavelvet: an extension of
velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic
acids research 40: e155–e155.
[101] Peng Y, Leung HC, Yiu SM, Chin FY (2012) Idba-ud: a de novo assembler for singlecell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28:
1420–1428.
[102] Li H (2011). wgsim-read simulator for next generation sequencing.
[103] Shakya M, Quince C, Campbell JH, Yang ZK, Schadt CW, et al. (2013) Comparative
metagenomic and rrna microbial diversity characterization using archaeal and bacterial
synthetic communities. Environmental microbiology 15: 1882–1899.
[104] Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets
of protein or nucleotide sequences. Bioinformatics 22: 1658–1659.

99