IDENTIFICATION AND ANALYSIS OF NON-CODING RNAS IN LARGE SCALE
GENOMIC DATA
By
Rujira Achawanantakun

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science – Doctor of Philosophy
2014

ABSTRACT
IDENTIFICATION AND ANALYSIS OF NON-CODING RNAS IN LARGE SCALE
GENOMIC DATA
By
Rujira Achawanantakun
The high-throughput sequencing technologies have created the opportunity of large-scale transcriptome analyses and intensified attention on the study of non-coding RNAs (ncRNAs). NcRNAs
pay important roles in many cellular processes. For example, transfer RNAs and ribosomal RNAs
are involved in protein translation process; micro RNAs regulate gene expression; long ncRNAs
are found to associate with many human diseases ranging from autism to cancer. Many ncRNAs
function through both their sequences and secondary structures. Thus, accurate secondary structure prediction provides important information to understand the tertiary structures and thus the
functions of ncRNAs.
The state-of-the-art ncRNA identification tools are mainly based on two approaches. The first
approach is a comparative structure analysis, which determines the consensus structure from homologous ncRNAs. Structure prediction is a costly process, because the size of a set of putative
structures increases exponentially with the sequence length [1]. Thus it is not practical for very
long ncRNAs such as lncRNAs. The accuracy of current structure prediction tools is still not satisfactory, especially on sequences containing pseudoknots. An alternative identification approach
that has been increasingly popular is sequence based expression analysis, which relies on next generation sequencing (NGS) technologies for quantifying gene expression on a genome-wide scale.
The specific expression patterns are used to identify the type of ncRNAs. This method therefore is
limited to ncRNAs that have medium to high expression levels and have unique expression patterns
that are different from other ncRNAs.

In this work, we address the challenges presented in ncRNA identification using different approaches. To be specific, we have proposed four tools, grammar-string based alignment, KnotShape, KnotStructure, and LncRNA-ID. Grammar-string is a novel ncRNA secondary structure
representation that encodes an ncRNA’s sequence and secondary structure in the parameter space
of a context-free grammar and a full RNA grammar including pseudoknots. It simplifies a complicated structure alignment to a simple grammar string-based alignment. Also, grammar-stringbased alignment incorporates both sequence and structure into multiple sequence alignment. Thus,
we can then enhance the speed of alignment and achieve an accurate consensus structure. KnotShape and KnotStructure focus on reducing the size of the structure search space to enhance the
speed of a structure prediction process. KnotShape predicts the best shape by grouping similar structures together and applying SVM classification to select the best representative shape.
KnotStructure improve the performance of structure prediction by using grammar-string basedalignment and the predicted shape output by KnotShape. LncRNA-ID is specially designed for
lncRNA identification. It incorporates balanced random forest learning to construct a classification model to distinguish lncRNA from protein-coding sequences. The major advantage is that it
can maintain a good predictive performance under limited or imbalanced training data.

ACKNOWLEDGMENTS

First and foremost, I would like to thank my adviser Dr. Yanni Sun. Her decision to admit me as
her PhD student five years ago provided me the precious opportunity to study in Michigan State
University and led me to the world of bioinformatics. During these five years under her guidance, I
have made continuous progress in several aspects, including reading research papers, proposing research topics, developing methods, designing experiments, and writing papers. More importantly,
I have gradually improved my ability to both independently and collaboratively conduct in-depth
analysis into sophisticated research problems and use scientific methodologies to solve the challenging problems. She also gave me a lot of suggestions on how to effectively demonstrate our
work to the audience, especially to people from other research areas. This ability is very important
in that it will profoundly determine my capability to collaborate in a team of people from different
backgrounds.
I also want to thank other committee members Dr. C. Titus Brown, Dr. Pang-Ning Tan, and Dr.
James R. Cole. They gave a lot of useful suggestions during the course of my PhD program. I also
thank my lab mates Yuan Zhang, Jikai Lei, Cheng Yuan, and Jiao Chen. During these years, we
have productive discussion and cooperation on various research topics and I have obtained great
help from them. I gratefully acknowledge other faculties and staffs of CSE department, especially
Dr. Jin Chen, Dr. Richard J. Enbody, Dr. Joyce Chai, Dr. Li Xiao, Linda Moore, and Norma
Teague. I also owe a lot of thanks to my friends in MSU especially Wenjuan Ma. All of them give
me a lot of support and help during these years.
My final and most important acknowledgement must go to my family in Thailand and Dr. Alice
and Ken Whiren, my host family in the US. They always give me persistent and determined love
and support.
iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF PROCEDURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapter 1
Introduction . . . . . . . . .
1.1 Non-coding RNAs (ncRNAs) . . . .
1.2 Functional ncRNA identification . .
1.3 Long non-coding RNA identification

.
.
.
.

.
.
.
.

1
1
2
3

Chapter 2
Grammar string: a novel ncRNA secondary structure representation .
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 An unambiguous CFG for ncRNA generation . . . . . . . . . . . . . . .
2.2.2 Grammar string generation algorithm . . . . . . . . . . . . . . . . . . .
2.2.3 Grammar pattern for encoding stem structures . . . . . . . . . . . . . . .
2.3 Using grammar strings for multiple ncRNA structural alignment . . . . . . . . .
2.3.1 Score table design for grammar string alignment . . . . . . . . . . . . . .
2.3.2 Multiple ncRNA alignment using grammar strings . . . . . . . . . . . .
2.3.2.1 Structure prediction . . . . . . . . . . . . . . . . . . . . . . .
2.3.2.2 Multiple grammar string alignment . . . . . . . . . . . . . . .
2.3.3 Using grammar patterns to reduce errors caused by ab initio structure prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

6
6
10
10
11
14
15
15
17
18
19

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

. 20
. 24
. 29

Chapter 3
Secondary structure prediction of ncRNAs including pseudoknots . . . .
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Grammar string and grammar pattern . . . . . . . . . . . . . . . . . . . .
3.2.2 Grammar strings for ncRNAs with pseudoknots . . . . . . . . . . . . . . .
3.2.3 Consensus structure derivation through multiple grammar string alignment .
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 4
4.1

30
30
32
32
32
34
35
37

Shape and secondary structure prediction for ncRNAs including pseudoknots based on linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . 39
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

v

4.2
4.3

4.4

4.5

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . .
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 RNA structures and their representations . . . . . . . . .
4.3.1.1 RNA structures and pseudoknots . . . . . . .
4.3.1.2 Abstract shapes . . . . . . . . . . . . . . . . .
4.3.2 Shape prediction . . . . . . . . . . . . . . . . . . . . .
4.3.2.1 Notation . . . . . . . . . . . . . . . . . . . .
4.3.2.2 Feature construction and selection . . . . . . .
4.3.2.3 Shape ranking using a simple scoring function
4.3.2.4 Time complexity of shape prediction . . . . .
4.3.3 Consensus structure prediction given a shape . . . . . .
4.3.3.1 Running time of structure prediction . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 SVM training . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Shape prediction comparison . . . . . . . . . . . . . . .
4.4.4 Structure prediction comparison . . . . . . . . . . . . .
Discussion and conclusion . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

42
43
43
43
44
45
45
47
49
49
50
52
53
53
54
55
55
58

LncRNA-ID: Long non-coding RNA IDentification using balanced random forest classification . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 ORF features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Ribosome interaction features . . . . . . . . . . . . . . . . . . . . . . .
5.3.2.1 Initiation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2.2 Translation: . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2.3 Termination: . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Protein conservation features . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 Balanced random forest . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Performance of different groups of features . . . . . . . . . . . . . . . .
5.4.2 The human data set (H1) . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 The mouse data set (M) . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.4 CPAT’s human data set (H2) . . . . . . . . . . . . . . . . . . . . . . . .
5.4.5 Imbalanced training data . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.6 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

60
60
61
63
64
64
65
65
66
67
68
71
72
73
75
77
78
79
80

Chapter 5
5.1
5.2
5.3

5.4

5.5

Chapter 6

Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . 82

BIBLIOGRAPHY . . . . . . . . . . . . . .

vi

. . . . . . . . . . . . . . . . . . . 85

LIST OF TABLES
Table 2.1

Structure predictions for three tRNA sequences. Multiple structure predictions are output for each sequence. For each prediction, column named
“stems" displays its stem structure denoted by brackets. The corresponding grammar pattern and G are listed in columns 3 and 4, respectively. .

23

Table 3.1

Comparison of grammar string and RNASampler on pseudoknot derivation 35

Table 4.1

Accuracy of shape predictions . . . . . . . . . . . . . . . . . . . . . . . 55

Table 4.2

Abstract shapes of ncRNA families in R15 . . . . . . . . . . . . . . . . . . 57

Table 4.3

Sensitivity and PPV of predicted structures using the predicted shapes
and randomly selected shapes . . . . . . . . . . . . . . . . . . . . . . . . 57

Table 4.4

Sensitivity for different ncRNA families . . . . . . . . . . . . . . . . . .

Table 4.5

PPV for different ncRNA families . . . . . . . . . . . . . . . . . . . . . 57

Table 4.6

Running time for different ncRNA families (seconds) . . . . . . . . . . .

58

Table 5.1

Performance comparison on H1. . . . . . . . . . . . . . . . . . . . . . .

75

Table 5.2

Performance comparison on the mouse data set. . . . . . . . . . . . . . .

77

Table 5.3

Performance comparison on H2 . . . . . . . . . . . . . . . . . . . . . . .

78

vii

57

LIST OF FIGURES
Figure 2.1

Two tRNA sequences from the human genome and the alignment of their
grammar strings.The stars below the alignment denote exact matches. . .

9

Figure 2.2

Algorithm for generating a grammar string for substring X i..j . . . . . . . . 13

Figure 2.3

Four different stem structures and their grammar patterns. The left column shows the 2D representation of an ncRNA folding. The right column
shows the distributions of stems along an ncRNA sequence. All grammar
patterns are generated using G4 (our chosen unambiguous context-free
grammar). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Figure 2.4

Four highly different structures predicted by UNAFold for the tRNA sequence shown at the bottom. The numbers beside each structure is their
G. The cloverleaf structure has a bigger G than other predictions. . . .

21

Figure 2.5

The consensus grammar string of tRNA alignment and the consensus
secondary structure derived from the grammar string. X and x represent complementary base pairs. They can be easily translated into nucleotide bases using input tRNA sequences. All other structural alignment tools were tested under their default parameters except MARNA.
For MARNA, using default structure prediction option RNAfold (from
Vienna RNA package) generated no base pair in the consensus structure.
Thus we used RNAsubopt, which yielded a few more base pairs in the
consensus structure. The structure plotted by pmmmulti was generated
from their consensus sequence and structure, which only included a very
small number of base pairs. However, their multiple alignment seemed to
contain more base pairs. RNAforester detected less number of complementary mutations and included several inconsistent base pairs such as
U-U. LocARNA missed one base pair in one stem. Murlet generated the
same structure as our grammar string alignment method. . . . . . . . . . 25

Figure 2.6

The differences of the reference structures (from Rfam) and the predicted
consensus structures from grammar string and Murlet alignments are plotted and compared. Lower numbers indicate higher similarity between the
predicted structure and the reference structure. . . . . . . . . . . . . . . .

viii

26

Figure 2.7

Consensus structures are derived for multiple families of each type of
ncRNA, resulting a RNAdistance output vector. For each type of ncRNA,
the average RNAdistance output for Murlet and grammar string alignment is compared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 2.8

Running time comparison between grammar string and Murlet. The running time of grammar string is largely decided by the popular string set
size. As Murlet uses over 2000 seconds for the family gcvT, we did not
include this family in this ﬁgure in order to keep the ﬁne scale of Y-axis. . 28

Figure 3.1

(A) An example of pseudoknot with the most common topology. (B) The
dot-bracket representation for the pesudoknot in (A).<> and Aa are used
to distinguish base pairs from two helical segments in the pseudoknot. (C)
Three production rules from RE-pseudo. The diagrams are reproduced
from Rivas and Eddy’s original description of this grammar. (D) The ﬁrst
ﬁve steps in (A)’s derivation using the grammar RE-pseudo. . . . . . . . . 33

Figure 3.2

Illustration of the reference structures and predicted structures by grammar string-based alignment (denoted as GS) and RNASampler (denoted
as RS) on 8 families. Note that if RNASampler fails to output a structure containing pseudoknots or outputs errors, their predictions are not
displayed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 4.1

Structure of an RNA pseudoknot. (a-d) show the three-dimensional structure, secondary structure, arc-based representation, and dot-bracket notation of mouse mammary tumor virus (MMTV) H-type pseudoknot with
PDB code 1RNK. The bases in stacking regions are colored with red and
blue while the unpaired bases are colored with green and brown. . . . . . 44

Figure 4.2

Examples of abstract shapes in level 1, 3 and 5. (a) The abstract shapes
of a pseudoknot-free structure. (b) The abstract shapes of a structure with
a pseudoknot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Figure 4.3

The relationship between sequences, structures, and shapes. . . . . . . . .

Figure 4.4

An example of structure selection based on hierarchical clustering. For
each structure S ji in the folding space of sequence X i , the grammar string
encoding the structure and the sequence is denoted as gs ij . All sequences and
their associated structures are converted into grammar strings before clustering.
The highlighted rectangles indicate grammar strings that are selected as
representative structures. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 4.5

Comparison of the sensitivity and PPV of different tools. . . . . . . . . . 56

ix

46

Figure 5.1

Performance comparison among feature groups: ORF features (ORF),
ribosome interaction features (ribo), protein conservation features (protein), and the combined feature sets. . . . . . . . . . . . . . . . . . . . . 73

Figure 5.2

ROC curves of different tools on H1. The AUCs, and the sensitivity and
FPR corresponding the optimal F-score were indicated in the legend. . . .

Figure 5.3

75

ROC curves of different tools on the mouse data set. The AUCs, and the
sensitivity and FPR corresponding the optimal F-score were indicated in
the legend. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

x

LIST OF PROCEDURES

Procedure 1

Representative structures selection . . . . . . . . . . . . . . . . . . . . .

51

Procedure 2

Balanced random forest learning . . . . . . . . . . . . . . . . . . . . . .

70

xi

Chapter 1
Introduction

1.1

Non-coding RNAs (ncRNAs)

It has been suggested that less than 2% of the human genome codes for proteins [2]. A recent study
found only one-fifth of transcription across the human genome is associated with protein-coding
genes [3]. This indicates that the majority of the transcriptom is non-coding RNAs [4], which
are not translated into protein but function directly as RNAs [5, 6]. NcRNAs have been identified
as potential regulatory molecules that play diverse and important roles in many biochemical processes. For instance, two typical house keeping ncRNAs, tRNA and rRNA, are key components
for protein synthesis. MicroRNAs (miRNAs) play critical regulatory roles via interactions with
specific target mRNAs in many organisms [7, 8].
The most recently discovered class of ncRNAs is long non-coding RNAs (lncRNAs), which
are generally defined as non-coding transcripts of 200 nucleotides [9, 10, 11]. Increasing evidence
has shown that lncRNAs play important and diverse biological functions. For example, lncRNAs
ANRIL and HOTAIR bind to chromatin-remodeling complexes PRC1 and PRC2 to alter chromatin
and transcription. GAS5 lncRNA acts as a decoy for the GR transcription factor and prevents
GR from binding to DNA and transcriptional activation. MALAT1 RNA binds to SR proteins
to regulate mRNA alternative splicing, whereas BACE-1AS RNA binds to the complementary
BACE-1 mRNA to regulate BACE-1 translation [12]. As a result, the dysfunctions of lncRNAs are

1

associated with a wide range of diseases ranging from neurodegeneration to cancer [13].

1.2

Functional ncRNA identification

The functions of many types of ncRNA are determined not only by their sequences but also by
their secondary structures. Thus, comparative ncRNA identification must exploit both sequence
and structural conservations.
Many types of ncRNAs function through both sequences and secondary structures, which describe base pair interactions in ncRNA sequences. For example, the cloverleaf structure and the
stem-loop structure are prominent features of tRNAs and pre-miRNAs, respectively. Thus, ncRNAs’ structural annotation is an important component in their functional annotation.
Many computational methods have been used to determine the native structures of ncRNAs.
A native structure is a structure that forms conformationally folding in native state before forming
the tertiary structure.
In computational ncRNA prediction, secondary structural stability provides the characteristic
signal for distinguishing real RNA sequences from non-functional transcripts [14]. Two fundamental approaches to structure prediction are ab initio and comparative folding. The ab initio method
predicts a structure from a single sequence. The majority of them [15, 16, 17, 18, 19] search for
putative structures with the minimum free energy (MFE) using an experimental number of derived
energy parameters. However, the gap between the free energy of the stable native state and the less
stable non-native structures is often small [20]. Thus, misfolded conformations can form with high
probabilities [21].
The more accurate strategy is based on a homology detection using alignment methods [22]. A
homologous alignment relies on conservation testimony of multiple sequences that have a common

2

ancestor. The similarity between sequences provides more information yielding to higher accuracy
in structure prediction. Although there are promising progresses, finding the native secondary
structure is still difficult. In particular, identifying the pseudoknot, an important structural motif
in many types of ncRNAs, poses a great challenge for existing methods. Predicting the minimum
free energy secondary structure that includes pseudoknots has been proven to be NP-hard [23].
Despite promising output by existing alignment tools, many existing secondary structure representations are highly complicated, incurring high computational cost during alignment. Even with
various heuristics or pruning techniques to reduce the time complexity, ncRNA structural alignment is still more computationally intensive than pure sequence alignment and scale poorly with
the number and length of input sequences. Therefore, it remains important to develop an efficient
and accurate structure prediction method for ncRNAs.

1.3

Long non-coding RNA identification

Homologous alignment is an effective method to identify functional ncRNAs that exhibit similar
functionality between species. However, the primary limitation of both sequence and structure
based homologous alignment is that it might not yield a satisfactory result for sequences with low
conservation like lncRNAs. LncRNAs are found to have low sequence similarity [24], but are
functionally conserved [25, 26]. Although there exist functionally related structures of lncRNAs,
predicting structures of lncRNAs is computational expensive because they are generally long with
several hundred to thousand nucleotides. The size of the folding space of an ncRNA sequence increases exponentially with the sequence length [1]. Thus homology based functional identification
is not practical for lncRNAs.
A numbers of studies have integrated high-throughput genomic technologies such as microar-

3

rays and next-generation sequencing (NGS) to identify functional ncRNAs in past decades [27].
These technologies have explored the ability of scientists to detect various types of ncRNAs including lncRNAs. Nevertheless, lncRNAs are generally expressed at a low level [28] and are less
conserved [10], which have impeded their discovery and functional studies. Moreover, lncRNAs
share similarities with protein coding transcripts, in which they resemble messenger RNAs (mRNAs) with respect to transcript length and splicing structure [29]. Many lncRNAs are found to have
coding potential. For instance, H19, Xist, Mirg, Gtl2, and KcnqOT1 all have putative ORFs greater
100 amino acids, but have been characterized as functional ncRNAs [30]. These challenges pose
great challenges to lncRNA identification. Therefore, accurately distinguishing long non-coding
from protein coding transcripts is a critical first step towards comprehensive biogenesis assessment
for the understanding of genetic information underneath.
To address the challenges in ncRNA identification, we have proposed three tools: grammarstring, KnotShape and KnotStructure, and LncRNA-ID. Grammar-string is a novel secondary
structure representation based on a context free grammar. By encoding secondary structures in
grammar strings, ncRNA structural alignment is projected to a simple sequence alignment. We can
then predict the secondary structures of ncRNAs with less computational complexity. KnotShape
and KnotStructure use shapes of secondary structures to optimize the search space of structure
prediction. It classify similar structures to the same group based on a topology of each structure.
The number of shapes is much smaller than the number of initial structures. With a small potential structure set, a multiple sequence alignment therefore becomes more practical and efficient
especially for structures with pseudoknots. LncRNA-ID is designed to identify lncRNAs from
protein-coding transcripts. It applies balanced random forest learning to construct a classification
model. Thus, LncRNA-ID has a great competency to maintain steady performance with imbalanced training data, where the number of protein coding transcripts and lncRNAs are intrinsically
4

greatly different [10].

5

Chapter 2
Grammar string: a novel ncRNA secondary
structure representation

2.1

Background

Comparative ncRNA identification, which searches for ncRNAs through evidence of evolutionary conservation, is the state-of-the-art methodology for ncRNA finding. Stochastic context-free
grammar (SCFG) [31] provides a powerful way to encode both the sequence and structural conservations. A successful application of SCFG is ncRNA classification, which classifies query
sequences into annotated ncRNA families such as tRNA, rRNA, riboswitch families. Other secondary structure modeling representations such as base pair probability matrices [32, 33, 34], tree
profiles [35, 36], stem graphs [37] etc. have been used in RNA alignment, an important step in
novel ncRNA detection. These alignment methods first infer the possible structures of each input
sequence and then conduct structural alignment, whose accuracy and efficiently are highly dependent on structural representations. Despite promising output by existing alignment tools, many
existing secondary structure representations are highly complicated, incurring high computational
cost during alignment. Even with various heuristics or pruning techniques to reduce the time complexity, ncRNA structural alignment are still more computationally intensive than pure sequence
alignment and scale poorly with the number and length of input sequences. Therefore, it remains

6

important to develop an efficient and accurate structural modeling and comparison method.
In this work, we design a novel secondary structure representation and show its application
in consensus structure derivation through multiple ncRNA alignment. The two contributions are
listed below. First, we design and implement grammar string, a novel ncRNA secondary structure
representation. A grammar string is defined on a special alphabet constructed from a carefully
chosen context free grammar (CFG). It encodes how this CFG generates an ncRNA sequence and
its secondary structure. Compared to other secondary structure representations, grammar strings
are simple and can take advantage of well-developed algorithms on sequences or strings. For example, grammar strings can convert ncRNA alignment into sequence alignment without losing any
structural conservation, rendering highly efficient RNA alignment algorithm. In addition, supporting theories for sequence alignment such as score table design and Karlin-Altschul statistics [38]
can be applied to grammar string alignment. Beyond alignment, grammar strings have potential
for applications such as ncRNA sequence database indexing, ncRNA clustering, profile HMMbased ncRNA classification etc. It is worth mentioning that other string-based secondary structure
representations [39, 40, 41] exit. However, those methods focus on deriving ncRNAs’ similarities without resorting to alignment and thus cannot be directly applied for consensus structure
derivation from homologous ncRNAs.
The second contribution is that we develop an effective method to exclude errors introduced by
ab initio structure prediction. Many ncRNA alignment programs [35, 36, 32, 33, 34, 37] align predicted structured output by RNA folding tools. However, optimal prediction may not be the native
structure [42], creating a need for choosing plausible structures as input to multiple alignment. In
this work, we propose an efficient pattern matching method to pre-select predicted structures that
are highly likely to be the true structure. This pre-screening can be used to reduce errors introduced
by ab initio structure prediction and to remove contaminated sequences that are not homologous to
7

others.
Existing ncRNA alignment methods can be roughly classified into three basic types. The first
type aligns and folds simultaneously. The most accurate algorithm of this type was developed by
Sankoff [43]. However, it is prohibitively expensive with time complexity O(L3N ) and memory
complexity O(L2N ), where L and N are the length and number of input sequences, respectively.
Variants of the Sankoff algorithm have been proposed to reduce the computational time of multiple alignment, such as Stemloc [44], Consan [45], MARNA [46]. The second type of methods first
builds a sequence alignment and then folds the alignment [47, 48, 22, 22, 49]. They infer structures from pre-aligned sequences generated using MULTIZ [50], ClustalW [51], or other available
sequence alignment programs. The accuracy of these tools is largely affected by the alignment
quality. In particular, when homologous ncRNA sequences only share structural similarity, building a meaningful sequence alignment becomes difficult. The third type of methods folds input
sequences and then conducts structural alignment, yielding higher accuracy. Different tools in this
category differ by different secondary structure modeling methods. Although some of them used
restricted Sankoff algorithm in their implementations, we classify them into “fold and then align"
category because they apply structure prediction in the first step. As our grammar string based
alignment belongs to the third category, we discuss related “fold and then align" tools below, focusing on their secondary structure representations.
Several programs encode secondary structure using base pair probability matrices derived from
McCastkill’s approach [18, 52]. NcRNA alignment is then converted into base pair probability matrix alignment. However, base pair probability matrix comparison is highly resource demanding. For example, pmcomp [32] takes O(n4 ) memory and O(n6 ) operations for aligning
a pair of sequences with length n. More recent implementations such as LocARNA [33] and
FOLDALIGNM [34] applied various restrictions or pruning techniques to reduce the time com8

plexity. But they are still much more expensive than sequence alignment.
3'

3'
A
A U
C G
C A A
U A
U
C
U A
C
U A
C GUG
U A AA G
U
A A UA U
A A
A
C
U
UA G G
A
U
AUC C
G
A
C
C
G
U A A A
U
A
G
C
U G
C
C
U
C
U
C
A G G

5'

5'

C
G C
U A
G C A
A U
A
A U
C
C
U
A U
C
GC
U A C
G
C
A U U G A
AU
A A
A
U U U G
C
C
C A A A A C
A
A
U
A
C
A C
A G
G U
C
U A
U
U
A
G
A
U G
tRNA_1: GUAAAUAUAGUUUAACCAAAACAUCAGAUUGUGAAUCUGACAACAGAGGCUCACGACCCCUUAUUUACC
(((((((..((((.....)))).(((((.......)))))....((.((.......)).))))))))).

Grammar string
cPPPPPPPUA#caac#aPPPPAACCA|PPPPPUUGUGAA|PPAcPPCUCACGA|
tRNA_2:

ACUUUUAAAGGAUAACAGCCAUCCGUUGGUCUUAGGCCCCAAAAAUUUUGGUGCAACUCCAAAUAAAAGUA
(((((((..((((.......)))).((((.........))))....(((((.......)))))))))))).

Grammar string
aPPPPPPPAA#uaaa#gPPPPAACAGCC|PPPPUCUUAGGCC|PPPPPUGCAACU|
Pairwise alignment
cPPPPPPPUA#aac#aPPPPAAC--CA|PPPPP-UUGUGAA|PPPPPCUCACGA|
aPPPPPPPAA#aaa#gPPPPAACAGCC|PPPPPCUUAGGC-|PPPPPUGCAACU|
******* **** * ******* * ****** ** * ****** **
*

Figure 2.1 Two tRNA sequences from the human genome and the alignment of their grammar
strings.The stars below the alignment denote exact matches.

RNAforester [35, 36] used tree profiles to represent secondary structures. Algorithms on tree
alignment are applied for pairwise and multiple alignment computation. The asymptotic efficiency
depends on the node number of the tree representation and the maximum degree d of a tree node.
For n structures of average size s, their pairwise algorithm has time complexity O(s2 d 2 ) and space
complexity O(s2 d). RNAforester can achieve higher efficiency than base pair probability matrix
comparison. However, it is reported [33] that they tend to produce many alignment columns that

9

contain mostly gap characters in the multiple alignment mode. Carnac [37] used stem graphs
to represent secondary structures. However, their program cannot accept more than 15 input sequences, limiting its practical usage.

2.2

Method

Inspired by Jaakkola and Haussler’s discriminative classification method [53], we introduce grammar string, a representation of an ncRNA sequence in the parameter space of context-free grammar (CFG). Specifically, each ncRNA sequence and its secondary structure are transformed into
a string defined on a new alphabet, where each character corresponds to a production rule in a
CFG. We first introduce an unambiguous CFG for ncRNA sequence generation. Using the chosen
CFG as an example, we formally define grammar strings for modeling an ncRNA sequence and its
secondary structure.

2.2.1

An unambiguous CFG for ncRNA generation

NcRNA structures without pseudo-knots can be derived by CFGs [31]. A CFG is defined by a set
of nonterminals, a set of terminals, a start nonterminal, and a set of production rules of the form
V → α. V is a single nonterminal symbol, and α is a string of terminals and/or nonterminals.
By recursively replacing nonterminals on the right hand side of each production rule, an ncRNA
sequence and its secondary structure can be derived from a CFG. In this work, all our ncRNA sequences and their structures will be generated from G4, a light-weight CFG introduced by Dowell
and Eddy [54], using leftmost derivation. Following the general definition of a CFG, G4 has a finite set of nonterminal symbols V = {S , T }, a finite set of terminal symbols T = {A,C, G,U, ε},
and a finite set of production rules defined as below:
10

• S → aS |T |ε
• T → T a|aS a|T
ˆ aS aˆ
where a ∈ {A,C, G,U} and aˆ ∈ {A,C, G,U}. a and aˆ form complementary base pairs such as A-U
and G-C. In order to generate the unstructured single strand ‘C’ at 3’ end and the two outmost
base pairs in sequence tRNA_1 in Figure 2.1, the following production rules from G4 are called:
S → T , T → T C, T → G S C, S → T , T → U S A. Continuing to replace S by correctly
chosen production rules, we can derive tRNA_1. The sequence of production rules used for ncRNA
structure generation is called a derivation.
Using the leftmost derivation, an unambiguous CFG can guarantee a unique derivation for
a given ncRNA sequence and its secondary structure. For example, by using the unambiguous
grammar G4, we have only one choice when choosing a production rule to derive tRNA_1 secondary structure in Figure 2.1. For a more detailed introduction about unambiguous CFGs, we
refer readers to the review by Dowell and Eddy [54], where several light-weight unambiguous
CFGs including G4 are discussed.

2.2.2

Grammar string generation algorithm

Each ncRNA secondary structure has a unique leftmost derivation from an unambiguous CFG,
producing a one-to-one mapping between a structure and a production rule sequence. Intuitively,
homologous ncRNAs with similar structures will share similar derivations. This motivates us to
represent an ncRNA sequence and its secondary structure in the parameter space of a CFG. Thus,
ncRNA structural comparison is converted to the comparison of their derivations.
In order to represent an ncRNA structure using its derivation, we introduce a new alphabet,
where each character corresponds to a production rule in a CFG. One example alphabet derived
11

from G4 is defined below.
• Use upper case character of a to represent production rule S → aS . For example, use A to
represent S → AS .
• Use | to represent S → ε.
• Use lower case character of a to represent production rule T → T a. For example, use c to
represent T → T C.
• Use P to represent base pair emission T → aS a.
ˆ
• Use a special character # to indicate branching T → T aS a.
ˆ
• No character is needed for production rule S → T .
Thus, the new alphabet is A = { A,C,G,U, a, c, g, u, P, |, #}. If these production rules are
used on DNA sequences, we can simply replace U(u) with T (t). For brevity, we name a string
defined on the above alphabet a grammar string. As an example, the derivation for generating
the unstructured single strand ‘C’ at 3’ end and the two outmost base pairs in sequence tRNA_1 of
Figure 2.1 is: S → T , T → T C, T → G S C, S → T , T → U S A. Thus, the corresponding
grammar string is “cPP" using the alphabet A . Note that we don’t distinguish different base
pairs (i.e. A-U, G-C, and G-U if allowed) in a grammar string. All base pairs are represented
as ’P’ in order to maximize the alignment score between homologous ncRNAs that share high
structural similarity but low sequence similarity. Figure 2.1 shows the utility of grammar strings in
detecting structural similarity between two tRNA sequences from the human genome. Because of
low sequence similarity, BLAST [55] fails to align them. However, their structural similarity yields
a meaningful global alignment between their corresponding grammar strings with 69% identity.

12

void parse(i, j)
{
if i >= j
print '|';
return;
else if Xi is a single stranded base
print uppercase of Xi;
i++;
parse(i,j);
else if Xj is a single stranded base
print lowercase of Xj;
j--;
parse(i,j);
else if Xi and Xj form a base pair
print 'P';
i++ and j--;
parse(i,j);
else
print '#';
k = the position that forms a base pair with Xj;
parse(i,k-1);
parse(k,j);
}

Figure 2.2 Algorithm for generating a grammar string for substring Xi.. j .
In theory, our grammar string generation process consists of two steps. First, write the production rule sequence for an ncRNA sequence and its secondary structure. Second, transform the
sequence of production rules into a grammar string according to the definition of grammar string
alphabet. In practice, we use an efficient dynamic programming algorithm to design a grammar
string for an ncRNA structure directly, skipping the step of parsing an ncRNA sequence using a
CFG. The algorithm has time complexity O(L2 ), where L is the length of the ncRNA sequence.
Let X be an ncRNA sequence with its predicted or annotated secondary structure. i and j are
indexes in X. Xi is the base at position i. Figure 2.2 sketches the dynamic programming algorithm
generating a grammar string for substring Xi.. j . In order to generate the complete grammar string
for sequence X, one should call parse(1, L).

13

2.2.3

Grammar pattern for encoding stem structures

The number of stems and their relationship largely define the basic “shape" of a secondary structure. For example, the cloverleaf structure of a tRNA sequence consists of four stems: acceptor
stem, D stem, anticodon stem, and TΨCG stem. The precursor structure of a miRNA usually contains only one stem. According to the definition of grammar strings, three characters P, #, and |
encode the number and relative positions of all stems in an ncRNA secondary structure. If we simply remove all single stranded regions (i.e. substrings only consisting of A,C,G,U, a, c, g, u) from
a grammar string, we can use a simplified grammar string to represent the abstract stem structure
for an ncRNA sequence. For brevity, we name a simplified grammar string a grammar pattern,
which is a string defined on a reduced alphabet {P, #, |}. A grammar string can be converted into a
grammar pattern in two steps: 1) remove all substrings representing single stranded regions, and 2)
reduce every substring consisting of only Ps as a single P. Thus, the grammar pattern for sequence
tRNA_1 in Figure 2.1 is P##P|P|P|, where each P denotes a stem. There are four Ps, denoting four
stems. The end of each stem is marked by |. Number of # defines the number of bifurcations.
Different distributions of the same number of stems can yield highly different secondary structures. Figure 2.3 shows how grammar patterns can account for different structures with the same
number of stems. Note that all these grammar patterns are generated using G4 as the chosen CFG.
If other unambiguous CFGs are used to generate grammar strings for the same structures, different
sets of grammar patterns might be produced.
Ignoring all single stranded regions and length of each stem, grammar patterns only provide
a coarse-grained description of ncRNA secondary structures. However, because of the high efficiency of pattern matching, grammar patterns can be used to speed up grammar string comparison.
For example, we do not expect significant structural similarities between a tRNA and a miRNA

14

sequence. Instead of using Needleman-Wunsch [56] like alignment algorithm between their grammar strings, a constant time grammar pattern matching program can be applied as a filtration step.
This filtration is particularly important when we aim to derive the consensus structure of multiple
putatively homologous ncRNAs. Although these sequences are expected to be sequenced from the
same gene family, it is possible that some of the sequences are from other regions. Thus, we can
use the grammar pattern matching technique to exclude contaminated sequences, ensuring a multiple sequence alignment with good quality. The same technique can be used to remove possible
errors introduced by MFE-based secondary structure prediction tools.

2.3

Using grammar strings for multiple ncRNA structural alignment

In this work, we show the utility of grammar strings in deriving consensus structure through multiple ncRNA alignment, which has wide applications in both known ncRNA classification and novel
ncRNA search.

2.3.1

Score table design for grammar string alignment

Pairwise alignment is a fundamental step to multiple alignment and clustering. Existing alignment
algorithms such as Needleman-Wunsch [56] can be directly applied to grammar strings when a
score table defined on grammar strings’ alphabet is imported. Following the common practice in
score table design, we use maximum-likelihood ratio to derive the score between every pair of
characters in grammar strings’ alphabet A . For each pair of characters a, b in A , the score bePr(a,b)
tween a, b is s(a, b) = log Pr
0 (a,b) . Pr(a, b) is the target probability of a, b in a set of true alignments

15

and Pr0 (a, b) is the background probability that a and b are aligned. Assuming that a and b are
independent in the background model, we get Pr0 (a, b) = Pr0 (a) × Pr0 (b). Because the ncRNA
family database Rfam [57] provides a large number of annotated ncRNA sequences, their alignments, and their associated secondary structures, we obtain both the target and the background
probabilities from Rfam. In summary, we present following steps of designing a score table for
grammar string alignment.
1. Build an alignment training set by randomly picking a large number of pairwise ncRNA
alignment from Rfam 9.1’s seed alignments. Some criteria are applied to select alignments
with reasonably high quality. For example, if a pairwise alignment contains too many gaps,
it will not be included in the training set. After applying the selection criteria, we had 18487
pairwise alignments in the training set.
2. Transform each pair of ncRNA sequence alignment into an alignment between grammar
strings using the given secondary structure annotations by Rfam.
3. Compute the target probability Pr(a, b) for each pair of aligned characters a, b in the above
grammar string alignments.
4. Generate grammar strings for a large number of ncRNA sequences that are randomly picked
from full families of Rfam 9.1. Compute the background probabilities Pr0 (a) and Pr0 (b)
from these grammar strings.
The complete score table for grammar string alignment can be found at our website1 . All exact
matches have big positive scores. And bifurcation starting character # and stem ending character |
can only be aligned with themselves or cause a gap. This is consistent with our intuition because
it is not meaningful to align a bifurcation character with a base pair or a single stranded base.
1 http://www.cse.msu.edu/∼yannisun/grammar-string

16

5'

3'

P##P|P|P|

5'

3'

#P#P|P|P|

5'

3'

#P|P#P|P|

5'

3'

###P|P|P|P|

Figure 2.3 Four different stem structures and their grammar patterns. The left column shows the
2D representation of an ncRNA folding. The right column shows the distributions of stems along
an ncRNA sequence. All grammar patterns are generated using G4 (our chosen unambiguous
context-free grammar).
Insertions or deletions of ‘P’ or single stranded characters correspond to insertions or deletions
of a base pair or single stranded bases in the ncRNA sequence alignment. Empirical experiments
are conducted to choose default values for their gap opening and extension costs. The default gap
opening score is slightly smaller than the lowest number in the grammar string’s score table. The
default gap extension cost is set as 1/10 of the opening cost. We assign bigger gap penalties for
structural characters # and | in order to force corresponding stems or single stranded regions to be
aligned together.

2.3.2

Multiple ncRNA alignment using grammar strings

Major steps of aligning multiple ncRNA sequences are sketched below.
1. Use an ab initio secondary structure prediction tool to predict both the optimal and sub17

optimal structures of each input sequence.
2. Generate a grammar string for each predicted secondary structure. If an ncRNA sequence
has more than one structure predicted, multiple grammar strings will be generated.
3. Transform each grammar string into a grammar pattern. Use a voting mechanism to choose
the most popular grammar pattern that mostly likely represents the native stem structure
shared by the input sequences. All grammar strings that are not consistent with the chosen
grammar pattern will be discarded.
4. Apply a progressive multiple sequence alignment method on remaining grammar strings.
5. Derive the consensus secondary structure from multiple grammar string alignment. Transform grammar string alignment into ncRNA sequence alignment using the ncRNA sequences
and their predicted structures as references.

2.3.2.1

Structure prediction

Various tools exist to predict the secondary structures of a single input sequence. A majority
of them search for structures with the minimum free energy (MFE) using a large number of experimentally derived energy parameters. The representative implementations include Mfold [15],
RNAstructure [16, 17], McCaskill’s base pairing probability computation [18], etc. MFE-based
methods can also be combined with other probabilistic models such as conditional log-linear models (CLLMs) in ContraFold [58] for structure prediction. In our experiments, we choose MFE
based tool UNAFold [59, 60] for structure prediction because of the following reasons. 1) It has
a user-friendly interface for both web-site based and standalone tools. 2) It can generate both the
optimal and suboptimal structures. It is shown that a suboptimal prediction rather than the optimal
one could be the “correct" structure [42]. Thus, being able to output suboptimal structure increases
18

the chance of correct structure prediction for each input sequence. Empirically, we also tested
other folding tools such as ContraFold on our test sequences. However, no clear advantage was
observed.

2.3.2.2

Multiple grammar string alignment

We apply progressive alignment to multiple grammar strings. In the first stage, a guide tree is built
based on all-against-all pairwise similarities and unweighted pair group method with arithmetic
mean (UPGMA). In the second stage, the multiple sequence alignment is grown using the guide
tree. Sum-of-pairs score is used to evaluate the similarity between a character and a column in an
alignment or between two columns from two alignments. When we build the guide tree, several
methods are used to convert an alignment score to sequence distance. The first distance definition
rand (i j)
comes from Feng and Doolittle [61]: D = − ln SSreal (i(i j)−S
j)−S
(i j) , where Sreal is the observed aligniden

rand

ment score between sequences i and j. Siden is the average of the two scores of the two sequences
comparing with themselves. Srand is the alignment score between two random sequences with the
same length and composition as i and j. We applied shuffling to sequence i and j to obtain Srand .
Besides the Feng and Doolittle distance conversion method, we also evaluated several other simple
distance definitions. The “Simple Distance" model defines D = 1/(Sread (i j)/L), where L is the
alignment length. “No-random FD" model defines D = − ln SSreal (i(i j)j) . Our empirical experimental
iden

results show that both “Simple Distance" and “No-random FD" generate better alignment than
more complicated Feng and Doolittle distance.

19

2.3.3

Using grammar patterns to reduce errors caused by ab initio structure
prediction

The predicted structures for the same ncRNA sequence can differ significantly. It is important to
align only structures that are likely to be consistent with the native structure of the homologous
sequences. UNAFold [59] allows users to control the number of produced suboptimal structures
by specifying a range of allowed thermodynamic energy values ∆G. Suboptimal structures can
be highly different from the optimal structure for some ncRNA sequences. For example, tRNAs,
which have functional cloverleaf structures, can be folded in different ways with reasonably small
∆Gs. Figure 2.4 shows four different structures output by UNAFold for one tRNA sequence. Even
worse, the true structure may not always be the optimal prediction with the minimum ∆G. Thus,
it is not plausible to only keep the optimal prediction as correct structures may come from the
sub-optimal predictions. In this section, a grammar pattern based screening approach is introduced
to remove the contamination of wrong predictions before alignment.
As MFE-based ab initio structure prediction has limited accuracy, we increase structural prediction accuracy by using both the MFE and sub-optimal predictions. As a result, multiple grammar strings are derived for a single ncRNA. However, only one grammar string from each ncRNA
should be used for alignment. We thus choose a set of grammar strings so that the sum of their
pairwise similarity is maximized. Let m be the size of input ncRNA set S. For each ncRNA si ,
Ni structures and their associated thermodynamic energy values ∆Gs are predicted for si . Each
predicted structure is converted into a grammar string. Thus, the output of the ab initio structure
1
prediction is a set of grammar strings and their associated ∆G: {(s11 , ∆G11 ), (s21 , ∆G21 ),..., (sN
1 ,

j

Nm
Nm
1
∆GN
1 ), ..., (sm , ∆Gm ) }. si is the grammar string derived from the jth structure prediction for si .
j

∆Gi is the associated thermodynamic energy value. The goal is to choose a grammar string xi for

20

-23.5

-23.7
-23.3

-22.6

tRNA:
UCCUCAGUAGCUCAGUGGUAGAGCGGUCGGCUGUUAACUGA
CUGGUCGUAGGUUCAAAUCCUACUUGGGGAG
Figure 2.4 Four highly different structures predicted by UNAFold for the tRNA sequence shown at
the bottom. The numbers beside each structure is their ∆G. The cloverleaf structure has a bigger
∆G than other predictions.
each ncRNA si so that the following function is satisfied:

argmax{x1 ,x2 ,...,xm }

∑

x

sim(sxi i , s j j )

(2.1)

i, j∈S,i= j

x

where sim(sxi i , s j j ) is the similarity between two grammar strings derived from different ncRNAs.

21

x

sim(sxi i , s j j ) is minus infinity when i == j. Solving the above equation takes exponential running
time. Thus, we propose two heuristics to reduce the time complexity. First, based on the observation that predicted structures for the same ncRNA can have highly different topologies (refer
to Figure 2.4), we first determine the abstract shape using grammar pattern matching. Grammar
strings that can be converted into the chosen grammar pattern constitute the popular string set.
Others are discarded. Second, we apply an approximation algorithm on the popular string set to
solve the above equation.
We first describe the algorithm of using the most popular grammar pattern as the representative
abstract shape.
1. Denote the grammar pattern of grammar string sxi as oxi . Thus, for m ncRNAs, we have a set
of grammar patterns and their associated free energy values ∆Gs: {(o11 , ∆G11 ), (o21 , ∆G21 ),...,
N1
Nm
Nm
1
(oN
1 , ∆G1 ), ..., (om , ∆Gm ) }. Usually ∆G < 0.

2. Choose a grammar pattern that is shared by most input sequences. For each different grammar pattern o derived from the previous step, compute function:

f (o) =

∑

min {∆Gxi |oxi == o}

i=1..m x=1..Ni

(2.2)

When the set {∆Gxi |oxi == o} is empty, min(0)
/ = 0. The grammar pattern o with the smallest
f (o) is the preferred structure of input ncRNA sequences. Denote this chosen grammar
pattern as o∗ .
3. Of multiple grammar strings generated for each ncRNA sequence, only grammar strings that
can be converted to o∗ are kept in the popular string set for further processing.
For m input sequences, there are Ntotal = ∑1≤i≤m Ni structures predicted. Choosing the most pop22

ular grammar pattern has linear time complexity O(Ntotal ).
Table 2.1 Structure predictions for three tRNA sequences. Multiple structure predictions are output for each sequence. For each prediction, column named “stems" displays its stem structure
denoted by brackets. The corresponding grammar pattern and ∆G are listed in columns 3 and 4,
respectively.
ID
seq 1
seq 2

seq 3

stems grammar pattern
(()()())
P##P|P|P|
(()()())
P##P|P|P|
(()()())
P##P|P|P|
(()()())
P##P|P|P|
(()()())
P##P|P|P|
()
P
(()())
P#P|P|

∆G
-38.7 *
-37
-32.6 *
-32
-31.6
-23.6
-23

Once we have the popular string set, we choose m grammar strings from it to solve Eqn. (1).
We apply an approximation algorithm based on guide tree building to choose xi . Essentially, we
build a guide tree from the popular string set until the tree contains a subtree with m leaves. As the
similarity between grammar strings derived from the same ncRNA sequence is minus infinity, the
first subtree with m leaves contains m grammar strings from m inputs with relatively small sum of
all-pairs similarity. In order to build the tree, we need to conduct all-against-all comparison, which
2 . And, suppose there are d internal nodes in the tree when it finishes,
takes time complexity Ntotal
2
the total running time is O(Ntotal
+ d Ntotal ).

An example of choosing grammar pattern and building popular string set for tRNAs is shown
in Table 2.1. There are three different grammar patterns in Table 2.1: P##P|P|P|, P#P|P|, and P.
Following the definition of f (o) in Eqn. 2, f(P##P|P|P|) is the sum of ∆Gs of grammar patterns
denoted with *. Thus, f(P##P|P|P|) = -71.3, which is smaller than f(P) and f(P#P|P|). Therefore,
P##P|P|P| is the chosen abstract shape for the three tRNAs. Note that no grammar string is chosen
from “seq 3" because none of them can be converted to the chosen grammar pattern. Thus, the
popular string set has size 5. we will build a guide tree from the five grammar strings which

23

are derived from “seq 1" and “seq 2". The guide tree stops at the first step because a subtree
containing two leaves is formed. As a result, an alignment will only be conducted between two
grammar strings with ∆G = -38.7 and -32.6.
Applying the same method to 20 tRNA sequences, we found the cloverleaf structure with
four stems is the consensus structure shared by a majority of tRNA sequences. We repeated our
experiments using different energy parameters. The dominant structure remains the cloverleaf
structure although the second most popular structure alternates between a long hairpin and a threestem structure (i.e. “(()())"). After discarding grammar strings that are not consistent with the
chosen structure, we align remaining grammar strings using progressive alignment method.

2.4

Results

First, we conducted multiple sequence alignment for 20 tRNA sequences, which were used as
an example of handling errors introduced by structure prediction programs in Section 2.3.3. Figure 2.5 shows the consensus secondary structure derived from aligning grammar strings of given
tRNAs. We also tested other structural alignment programs including pmmulti [32], Murlet [62],
RNAforester [35, 36], MARNA [46], and LocARNA [33]. Figure 2.5 shows that the grammar
string alignment and Murlet both generate the best consensus structure for tRNA sequences.
Second, we use grammar strings to generate multiple ncRNA alignments for 452 families that
are randomly chosen from BRAliBase 2.1, an enhanced RNA alignment benchmark [63]. This
data set contains a diverse set of ncRNA families with different average sequence identity, length,
and structural conservation. Each family contains 15 ncRNA sequences. Suboptimal structures
with minimum-free energy values at most 10% higher than the optimal structure are predicted
using UNAFold [59, 60] on over 6700 sequences from these 452 families. The average number of

24

Consensus grammar string using IUPAC code
aPPPPPPPUAu#cugn#rPPPPAGUUGGUA|PPPPPYUNANAA|PPPPPUUCRAAU|
Consensus sequence and secondary structure
XXXXXXXUAXXXXAGUUGGUAxxxxRXXXXXYUNANAAxxxxxNGUCXXXXXUUCRAAUxxxxxUxxxxxxxa
(((((((..((((........)))).(((((.......)))))....(((((.......))))).))))))).

Grammar
string

LocARNA

RNAforester

marna
murlet

pmmulti

Figure 2.5 The consensus grammar string of tRNA alignment and the consensus secondary structure derived from the grammar string. X and x represent complementary base pairs. They can be
easily translated into nucleotide bases using input tRNA sequences. All other structural alignment
tools were tested under their default parameters except MARNA. For MARNA, using default structure prediction option RNAfold (from Vienna RNA package) generated no base pair in the consensus structure. Thus we used RNAsubopt, which yielded a few more base pairs in the consensus
structure. The structure plotted by pmmmulti was generated from their consensus sequence and
structure, which only included a very small number of base pairs. However, their multiple alignment seemed to contain more base pairs. RNAforester detected less number of complementary
mutations and included several inconsistent base pairs such as U-U. LocARNA missed one base
pair in one stem. Murlet generated the same structure as our grammar string alignment method.
suboptimal structure for each sequence is 20. For longer ncRNA sequences (length around 300),
the number of suboptimal structures is close to 50. For short ones (length < 50), there are only a

25

couple of suboptimal structures predicted.
300

Distance to reference structure

Murlet
250
Grammar
200

150

100

50

0
1

31

61

91

121 151 181 211 241 271 301 331 361 391 421 451
Family index

Figure 2.6 The differences of the reference structures (from Rfam) and the predicted consensus
structures from grammar string and Murlet alignments are plotted and compared. Lower numbers
indicate higher similarity between the predicted structure and the reference structure.

As Murlet [62], a Sankoff-based algorithm competes favorably in consensus structure quality
with other ncRNA alignment tools, we compare the accuracy of consensus structures predicted
from grammar string alignments and Murlet alignments. Since BRAliBase 2.1 only provides the
alignments for each family of ncRNA sequences, but not their secondary structures, we extracted
their reference structures from Rfam 9.1. In order to extract the consensus structure from a grammar string alignment, a consensus grammar string is first generated from the alignment (one example consensus grammar string is shown in Figure 2.5). And then this consensus grammar string is
translated into a secondary structure using a reversed protocol to the one described in Figure 2.2.
Murlet outputs the consensus structure along with each alignment. We compare the predicted sec26

ondary structures with the reference structures using RNAdistance from Vienna RNA package.
Small distance indicates high similarity. The difference between predicted structures and the reference structures for both grammar string and Murlet alignments are summarized in Figure 2.6. Of
452 families, grammar string-based alignment produces consensus structures closer to the reference structures in 216 families and Murlet produces more accurate structure in 206 families. They
generate the same consensus structures for 30 families. Some families pose hard cases for both
methods, such as IRES_HCV and IRES_Picorna.

Figure 2.7 Consensus structures are derived for multiple families of each type of ncRNA, resulting
a RNAdistance output vector. For each type of ncRNA, the average RNAdistance output for Murlet
and grammar string alignment is compared.

Figure 2.8 compares the running time between Murlet and grammar string-based structure
prediction. The running time of grammar string alignment largely depends on the size of popular
string set, from which a set of strings are chosen as input to multiple alignment.
In order to analyze how grammar string and Murlet perform on each type of ncRNA, Figure 2.7
compares the average RNAdistance output for 25 types of ncRNAs, each of which contains multiple families in BRAliBase 2.1. The figure shows that grammar string-based methods produces
27

Figure 2.8 Running time comparison between grammar string and Murlet. The running time of
grammar string is largely decided by the popular string set size. As Murlet uses over 2000 seconds
for the family gcvT, we did not include this family in this figure in order to keep the fine scale of
Y-axis.
more accurate consensus structures than Murlet for 13 types of ncRNAs: 5S_rRNA, Entero_OriR,
gcvT, Hammerhead_3, HCV_SLIV, HepC_CRE, Intron_gpII, S_box, SECIS, SPR_bact, THI,
tRNA, and yybP-ykoY. Murlet performs better for 10 types of ncRNAs: Entero_5_CRE, Entero_CRE, HCV_SLVII, HIV_FE, HIV|_GSL3, HIV_PBS, IRES_HCV, IRES_Picorna, Retroviral_psi, and U2. Thus, grammar string performs slightly better than Murlet in consensus structure
derivation.
The major cause for the high structural difference for some families is the inaccuracy of the
ab initio structure prediction program. Our alignment quality relies on the accuracy of structure
prediction program. The prescreening algorithm can choose structures with the same number
of stems and bifurcations. However, some predicted structures of homologous ncRNAs contain
highly different numbers of base pairs for a pair of homologous sequences, causing low similarity
between the derived grammar strings. Instead of using pure ab initio structure prediction tools,

28

we plan to use variants of Sankoff algorithm to generate consensus structures between a pair of
sequences and then use these structures to derive grammar strings.

2.5

Discussion and conclusion

We have described the grammar string, a novel and simple ncRNA secondary structure representation. By encoding secondary structures in grammar strings, ncRNA structural alignment is
transformed into sequence alignment. When there is no structural information available for ncRNA
sequences, ab initio or other structure prediction tools are used to derive secondary structure information, which is needed for grammar string generation. Thus, grammar string alignment quality
relies on the accuracy of structure prediction. When the structure prediction is reasonably accurate,
grammar string alignment can be highly accurate and efficient for homologous ncRNA consensus
structure derivation. Besides building ncRNA structural alignment, grammar string can be used to
encode characterized ncRNA structures, comparing different structures, and searching for common
structural motifs.
In the current grammar string generation algorithm, we don’t distinguish different base pairs
(G-C, A-U, and U-G if allowed) in order to maximize alignment score of homologous ncRNA
sequences that share strong structural similarity rather than sequence similarity. However, it is
worth testing whether an expanded alphabet can increase alignment accuracy. Thus we plan to : 1)
distinguish different base pairs in an expanded grammar string alphabet, and 2) use a set of high
quality pairwise ncRNA alignments to train the new substitution score table for the new alphabet.
In addition, we will evaluate how different alignment methods (such as interactive vs. progressive)
and different gap penalties in stems and single stranded regions affect the final alignment quality.

29

Chapter 3
Secondary structure prediction of ncRNAs
including pseudoknots

3.1

Background

As knowing the secondary structure provides important information to understanding the tertiary
structures and thus the functions of ncRNAs, deriving the secondary structures of ncRNAs remains
an important research topic in RNA informatics. Pseudoknot is an important structural motif in
secondary structures of many types of ncRNAs. Formally, a pseudoknot occurs when an RNA has
two base pairs, i − j and i − j , such that i < i < j < j . Psuedoknots are known to play important
functions in telomerase RNA, tmRNA, rRNA, some riboswitch, some protein-biding RNA, Viral
ribosomal frameshifting signals, etc [64].
There are two types of structure prediction methods. One is ab initio folding tools. A majority
of them [15, 16, 17, 18, 19] search for structures with the minimum free energy (MFE) using a
large number of experimentally derived energy parameters. Despite promising progress in the ab
initio structure prediction methods, their accuracy is still limited. The predicted structure with the
MFE may not be the native structure of an ncRNA.
More accurate structure prediction methods are based on comparative ncRNA analysis, which
aligns homologous ncRNAs and derives their consensus structure. A number of ncRNA alignment

30

and structure derivation tools exist. The most accurate tool developed by Sankoff [43] aligns and
folds ncRNAs simultaneously. However, this method is prohibitively expensive. Even with various
heuristics or pruning techniques, ncRNA structural alignment tools are still computationally intensive and scale poorly with the number and length of input sequences. In addition, most existing
tools do not allow pseudoknots, an important structural motif in RNAs. Thus, there is a need for an
efficient and accurate comparative structure derivation tool that can handle any secondary structure
including pseudoknots.
In this work, we introduce a consensus structure derivation approach based on grammar string,
a novel ncRNA secondary structure representation that encodes an ncRNA’s sequence and secondary structure in the parameter space of a context-free grammar (CFG) and a full RNA grammar
including pseudoknots. The main components of our method include:
• A novel ncRNA secondary structure representation named grammar string, which is defined
on a special alphabet constructed from production rules in a formal grammar such as context
free grammar (CFG) [31]. It encodes how this grammar generates an ncRNA sequence
and its secondary structure. Grammar strings are simple and can take advantage of welldeveloped algorithms on sequences or strings.
• A systematic method to exclude errors introduced by ab initio structure prediction. As the
optimal structure output by existing tools may not be the native structure, we design an
algorithm to choose correct structures using both the optimal and sub-optimal predictions.
• A general method that can be applied to any structural alignment as long as there is a formal
grammar describing the structure. We have extended grammar strings to handle pseudoknots
using the full RNA grammar introduced by Rivas and Eddy [65].

31

3.2
3.2.1

Method
Grammar string and grammar pattern

Inspired by Jaakkola and Haussler’s discriminative classification method [53], we introduce grammar string, a representation of an ncRNA sequence and its structure in the parameter space of a
grammar. In this method, each ncRNA sequence and its secondary structure are transformed into
a string defined on a new alphabet, where each character corresponds to a production rule in a
grammar. We use two grammars in this work. For secondary structures without pseudoknots, a
context-free grammar (CFG) is sufficient. For secondary structures containing pseudoknots, we
choose the full ncRNA grammar introduced by Rivas and Eddy [65].

3.2.2

Grammar strings for ncRNAs with pseudoknots

Pseudoknot is an important structural motif for ncRNAs. Two helical segments are either parallel
or nested for pseudoknot-free structures. But they can form crossover pseudoknot as shown in
Figure 3.1.(A). Describing the language of pseudoknots is beyond the ability of CFG. Recently,
Rivas and Eddy [65] presented a full RNA grammar for ncRNAs with pseudoknots. This grammar
adopted a small number of auxiliary symbols instead of general context-sensitive grammar to parse
sequences with pseudoknots in polynomial time. In this work, to distinguish it from G4 in the
previous chapter, we name the full grammar “RE-pseudo".
RE-pseudo has a bigger set of nonterminal symbols V={W , WB , V ab , WH , VHabcd , IS1 , IS2 }
than G4, where WH and VHabcd are used to generate pseudoknots. There are 43 different production
rules in this grammar and thus we define grammar strings on an alphabet of size 43. The full list of
production rules is not given here due to the space limitation. We refer readers to the original paper
by Rivas and Eddy. We only list three production rules which will be used to generate the prefix
32

AAGGCGGUCUGGGCGCCCAUCCC
..<<<<....AAA>>>>...aaa

GGCGGUCU…CGCC
<<<<
>>>>

Figure 3.1 (A) An example of pseudoknot with the most common topology. (B) The dot-bracket
representation for the pesudoknot in (A). <> and Aa are used to distinguish base pairs from two
helical segments in the pseudoknot. (C) Three production rules from RE-pseudo. The diagrams
are reproduced from Rivas and Eddy’s original description of this grammar. (D) The first five steps
in (A)’s derivation using the grammar RE-pseudo.
of a grammar string for the simple pseudoknot in Figure 3.1.(A). The three rules, their diagrams,
meanings, and the assigned characters can be found in Figure 3.1.(C). Based on these conventions,
the prefix for the H-ytpe pseudoknot in Figure 3.1.(A) is AA%PP. We can also replace P with
different characters to distinguish different base pairs.
In the previous chapter, we use the pseudoknot-free grammar G4 as an example to introduce
grammar pattern. The same idea can be applied to RE-pseudo. We can transform grammar strings
for RE-pseudo into grammar patterns using a similar strategy. Although this grammar can replace
33

G4 for any ncRNA secondary structure derivation, it is much more complicated than G4. Thus, we
keep G4 for ncRNAs without pseudoknots. The pseudocode for generating grammar string based
on RE-pseudo parsing can be found on htt p : //www.cse.msu.edu/ achawana/grammar − string.

3.2.3

Consensus structure derivation through multiple grammar string alignment

In this section, we sketch the major steps of consensus structure derivation using grammar string
alignment.
1. Choose an ab initio secondary structure prediction tool to predict both the optimal and suboptimal structures for each input sequence.
2. Generate a grammar string for each predicted secondary structure. If an ncRNA sequence
has more than one structure predicted, multiple grammar strings will be generated.
3. Based on the assumption that the correct structure should be shared by a majority of input
sequences, we choose one grammar string from each input ncRNA so that the sum of their
pairwise similarity is maximized. This step is used to remove possibly wrong predictions
from previous ab initio structure prediction. Detailed algorithms can be found in the next
section.
4. Apply progressive multiple sequence alignment on chosen grammar strings from the previous step and derive the consensus secondary structure.
Various tools [15, 16, 17, 18] exist to predict the secondary structures of a single input sequence. A
majority of them search for structures with the minimum free energy (MFE) using a large number
of experimentally derived energy parameters. For structure prediction without pseudoknots, we
34

choose UNAFold [59, 60] because: 1) it has been tested extensively; 2) it is easy to apply on largescale experiments; 3) it can generate both the optimal and suboptimal structures. Empirically,
we also tested other folding tools such as ContraFold [58] on our test sequences. However, no
clear advantage was observed. For similar reasons, we choose Hotknots 2.0 [19] for pseudoknot
prediction.
To derive the consensus secondary structure we conduct progressive multiple alignment on
grammar strings using a guide tree, which is built based on all-against-all pairwise similarities and
unweighted pair group method with arithmetic mean (UPGMA).

3.3

Results
Table 3.1 Comparison of grammar string and RNASampler on pseudoknot derivation

RfamID

name

average
length

RF00165
RF00381
RF00505
RF00176
RF00233
RF00499
RF00507
RF00523

Corona_pk3
Antizyme_FSE
RydC
Tombus_3_IV
Tymo_tRNA-like
Parecho_CRE
Corona_FSE
Prion_pknot

62
57
64
91
82
111
82
41

grammar
RNAstring
Sampler
accuracy accuracy
4
8
0
11
0
3
30
11
29
20
4

grammar
string
running time (s)
2.96
3.07
2.78
1.46
1.21
2.22
1.34
0.51

RNASampler
running time (s)
2.59
5.2
0.32
12.45
6.7
2.77
5.56
0.82

Accuracy is defined as the base pair difference between the reference structure and the predicted structure.
RNASampler outputs errors or structures that do not have the same abstract shape as the reference structure
for five families. Thus their accuracy is not measured.

The current version of Rfam contains 71 families with pseudoknots. 25 of them are published
rather than being computationally predicted. We focus on testing grammar strings on the 25 families. For each family, we first apply HotKnots [19] to predict secondary structures. Both the
optimal and sub-optimal predictions are kept. As these predictions show great variance in their
35

Figure 3.2 Illustration of the reference structures and predicted structures by grammar string-based
alignment (denoted as GS) and RNASampler (denoted as RS) on 8 families. Note that if RNASampler fails to output a structure containing pseudoknots or outputs errors, their predictions are not
displayed.
topology, we use the method described in Section 2.3.3 to choose a grammar pattern that is shared
by a majority of input sequences. Only grammar strings that can be converted into the chosen
grammar pattern are candidates for multiple alignment. For 17 families, we fail to obtain a grammar pattern that is shared by at least 2 input sequences. Thus, we can only conduct multiple ncRNA
alignment for 8 families. For each family, if the number of available grammar strings is less than
10, we align all of them. Otherwise, at most 10 grammar strings are aligned. To compare our tool
with existing ones that do not require an alignment as input, we apply RNASampler [66] on the

36

25 families. RNASampler outputs errors or structures without pseudoknots for 14 families. And
for other three families, the predicted structures are not as accurate as grammar string alignment.
ILM [67] can also predict pseudoknots without requiring an alignment as input. However, we are
not able to run their program on our input. We summarize properties of the eight families, the
structure derivation results of grammar string alignment and RNASampler in Table 3.1. We also
plot the reference structures and the predicted structures from grammar strings and RNASampler in
Figure 3.2. The results show that pseudoknot prediction is a highly challenging problem, especially
for long ncRNAs (> 100 nt). Hotknots becomes less accurate with the increase of inputs’ sizes.
Both our program and RNASampler are not able to handle long ncRNAs with pseudoknots. The
commands, parameters, and alignment results of RNASampler and our tool can be downloaded
from our website.

3.4

Discussion and conclusion

By encoding secondary structures in grammar strings, ncRNA structural alignment is transformed
into sequence alignment. We have shown the utility of grammar string alignment in consensus
structure derivation for ncRNAs including pseudoknots. Being a type of “fold and align" tool,
grammar strings’ performance relies on the quality of the ab initio structure prediction tools. In
the worst case, there is no grammar pattern that can be shared by at least two input sequences,
rendering low accuracy of structure derivation. Thus, grammar string-based alignment works best
when a few input sequences have consistent structure predictions.
One future direction is to reduce the running time of grammar string comparison. In particular,
we need to have a more efficient algorithm to choose input to multiple sequence alignment from
hundreds of sub-optimal structures. Second, we need to improve the quality of grammar string

37

alignment when the ab initio structure prediction tools have poor accuracy. One alternative method
is to apply simplified SanKoff algorithms to conduct pairwise structure derivation. Then we will
use these structures instead of output of existing MFE-based tools as input to grammar string
construction.

38

Chapter 4
Shape and secondary structure prediction
for ncRNAs including pseudoknots based on
linear SVM

4.1

Background

There are 26,704 sequences in 71 ncRNA seed families of Rfam 10.0[68] containing pseudoknots.
With advances in sequencing technologies and structure prediction, more pseudoknot structures
are expected to be disclosed.
Many computational methods have been used to determine the native structure of ncRNAs. A
native structure is a structure that forms conformationally folding in native state before forming
the tertiary structure. The gap between the free energy of the native state and other non-native
structures is often small[20]. Thus, misfolded conformations can form with high probabilities[21].
For a review of available tools, please see[69, 70].
Although there is promising progress, finding the native secondary structure is still difficult.
In particular, identifying the pseudoknot, an important structural motif in many types of ncRNAs,
poses a great challenge for existing methods. Predicting the minimum free energy secondary
structure that includes pseudoknots has been proven to be NP-hard [23]. One recent attempt is

39

to first predict the abstract shapes (or shapes for short), which retain adjacency and nesting of
structural features but disregard the length details of helix and loop regions [71]. The predicted
shape will then be used to guide structure prediction. The idea of abstract shapes has long been
used to characterize different types of structures. For example, most tRNAs have the clover-leaf
structure; most pre-miRNAs have the stem-loop structure; many types of pseudoknots have an
H-type structure.
While the size of the folding space of an RNA sequence increases exponentially with the sequence length [1], many possible folding only differ in the details of the loop and helix regions and
hence have the same abstract shape. Previous analysis shows that the space of the abstract shapes
is significantly smaller than the complete folding space [72]. Knowing the abstract shape can
significantly reduce the search space for structure prediction tools and improves the accuracy of
structure prediction [71, 73]. The utilities of abstract shapes have been demonstrated in a number
of recent publications. The Giegerich group used abstract shapes in comparative structure prediction in pseudoknot-free sequences [73]. People use shapes to aid miRNA precursor prediction in
large-scale studies [74, 75]. Furthermore, shapes are used to index fast-expanding ncRNA families
in Rfam [68] and lead to efficient known ncRNA search [76].
Previous work focused on shape derivation and usage for pseudoknot-free ncRNAs. There is a
lack of studies of the usage of shapes in pseudoknot structure prediction. In this work, we predict
the consensus shape of a group of homologous ncRNAs that may contain pseudoknots. In addition,
we develop a program that uses the consensus shape for consensus pseudoknot structure prediction.
A majority of existing pseudoknot structure prediction tools often have topology restrictions such
as H-type, recursive H-type [77, 78, 79, 80], kissing hairpin, or complexity levels of pseudoknot
using genus numbers [81]. Therefore, using the predicted abstract shapes of input sequences can
help remove the topology restriction and leads to more general and practical pseudoknot structure
40

prediction tools. Compared with existing tools, our tool has the following properties:
• While most existing shape prediction tools use a single sequence as input, we conduct comparative shape prediction on homologous ncRNAs that might contain pseudoknots. Experiments show that comparative structure or shape prediction, which derives the consensus
structure or shape from a group of homologous sequences, can achieve better accuracy than
using a single sequence [69, 73, 82].
• We can predict the abstract shapes of both pseudoknot-free and pseudoknot-containing sequences.
• Current tools use the shape probability [83] or the sum of energies of structures to rank
shapes. We use multiple features by combining well-studied feature ranking methods and
the support vector machine (SVM) method.
• We demonstrate the usage of the shape by applying it to pseudoknot structure prediction.
The whole software package can be directly used to derive the consensus secondary structure of homologous ncRNAs. The consensus shape prediction tool named KnotShape and
the corresponding consensus pseudoknot prediction tool named KnotStructure are publicly
available at our website.
We tested our software on hundreds of RNA sequence sets. The experimental results show that
we can achieve 18% higher accuracy than the state-of-the-art consensus shape prediction tools on
pseudoknot free sequences. For pseudoknot-containing sequences, we achieve approximate 29%
higher sensitivity and 10% higher positive predictive value in structure prediction than similar
tools.

41

4.2

Related work

Computational structure prediction can be divided into de novo structure prediction and comparative structure prediction, which derive structures from a single sequence and multiple homologous
ncRNAs respectively. As our method is to derive the consensus shape and structure of homologous
ncRNAs, we briefly introduce related work in comparative ncRNA structure derivation. There are
three general approaches for structure derivation from multiple sequences: simultaneously align
and fold, align-then-fold, and fold-then-align. It is computationally expensive to simultaneously
align and fold pseudoknot structures. The performance of the align-then-fold pseudoknot prediction heavily depends on the quality of the alignment. Usually multiple sequence alignment
(MSA) tools such as ClustalW [51] are used to generate the alignment as the input to the folding
tool. However, common structures can be missed due to misalignment between sequences lacking
significant similarity [63]. In this work, we design a pseudoknot prediction tool using the foldthen-align strategy that does not require an alignment as input. Tools based on fold-then-align use
a de novo folding tool to construct a small but representative sample of the folding space, which
consists of the predicted optimal and sub-optimal structures. Structures from the folding space are
chosen to maximize the structural and sequence similarity.
A number of software packages exist to predict the abstract shape for a single sequence. The
sum of energies or the accumulated Boltzmann probabilities of all structures within a shape have
been used as main features for shape prediction. The latter is often referred to as the shape probability. Usually the shapes with small sum of energies or high shape probabilities are more likely to
be the correct shapes. It is claimed in RapidShapes [83] that using shape probabilities has superior
performance over free energy-based approach because of its independence on sequence length and
base composition. However, exact computation of the shape probability incurs exponential com-

42

putational cost to the sequence length [83]. Thus, various heuristics or restrictions [84, 85] have
been adopted for fast shape probability computation.
RNAcast [73] derives the consensus shape from homologous pseudoknot-free sequences based
on the fold-then-align strategy. Structures are grouped based on their shapes and shapes are ranked
by sum of free energies of structures within the shape in ascending order. The first-ranked shape is
presented as the consensus shape. The consensus structure is derived from the lowest free energy
structures of each sequence within the shape.

4.3

Methods

4.3.1

RNA structures and their representations

4.3.1.1

RNA structures and pseudoknots

RNA molecules fold into complex three dimensional structures by nitrogenous bases that are connected via hydrogen bonds [86] (Figure 1a). The secondary structure of an ncRNA is defined by
the interacting base pairs. Some RNA molecules fold into pseudoknot structures by paring bases
in loop regions with bases outside the stem loop (Figure 1b).
In this work, two types of ncRNA secondary structure representations are used. The first type
is the arc-based representation, where nucleotides and hydrogen bonds are represented by vertices
and arcs, respectively (Figure 1c). For pseudoknot-free secondary structures, all arcs are either
nested or in parallel. Crossover arcs indicate pseudoknots. The second type is based on dotbracket notation, where ‘.’ represents unpaired bases and matching parenthesis ‘(’ and ‘)’ indicate
base-pairing nucleotides. Following the annotation of Rfam [68], we use an extended dot-bracket
notation to represent pseudoknot structures. The base-pairing nucleotides forming pseudoknots

43

are represented by upper-lower case character pairs, such as A..a or B..b, as shown in Figure 1d.

Figure 4.1 Structure of an RNA pseudoknot. (a-d) show the three-dimensional structure, secondary structure, arc-based representation, and dot-bracket notation of mouse mammary tumor
virus (MMTV) H-type pseudoknot with PDB code 1RNK. The bases in stacking regions are colored with red and blue while the unpaired bases are colored with green and brown.

4.3.1.2

Abstract shapes

Abstract shapes were formally introduced by Giegerich et al [71]. The folding space of a given
RNA sequence is partitioned into different classes of structures, by means of abstracting from
structural details. These classes are called abstract shapes, or shapes for short. An RNA secondary
structure can be mapped to an abstract shape with different levels of abstraction [73]. In the
abstract shape, details about the lengths of the loop and stacking regions are removed (see Figure 1
for examples of stacking and loop regions). Stacking regions are represented by pairs of brackets
and unpaired regions are represented by underscores.
Pseudoknots are represented by pairs of upper-lower case characters. Figure 2 presents examples of the abstract shapes of level 1, 3, and 5 of a pseudoknot-free structure and a pseudoknot.
Level 5 represents the strongest abstraction and ignores all bulges, internal loops, and singlestranded regions. Level 3 adds the helix interruptions caused by bulges or internal loops. Level 1
only abstracts from loop and stack lengths while retains all single-stranded regions.
44

(a)

AUCGGCGCACAGGACAUCCUAGGUACAAGGCCGCCCGUU
..(((.((..(((....))).(((.....))))))))..
L1: _[_[_[_]_[]]]_
L3: [[[][]]]
L5: [[][]]

(b)

GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU
(((((..AAAAAA.)))))........aaa.aaa
L1: [_AA_]_a_a
L3: [AA]aa
L5: [A]a

Figure 4.2 Examples of abstract shapes in level 1, 3 and 5. (a) The abstract shapes of a pseudoknotfree structure. (b) The abstract shapes of a structure with a pseudoknot.

4.3.2

Shape prediction

In this section we describe KnotShape, a comparative shape prediction tool for homologous ncRNA
sequences that allows pseudoknots. The task of shape prediction is to select the best representative
shape for given homologous sequences. In order to identify the best shape, various features such
as shape probability [83], sum of energies of all structures in this shape [73], and the rank sum
score [73] are evaluated to rank shapes. It has not been systematically assessed whether combinations of multiple features can lead to better shape prediction. In this work, we incorporate Support
Vector Machine (SVM)[87] and feature selection techniques to determine important features for
shape identification. In addition, we adopted a machine learning-based scoring function to evaluate
the qualities of shapes.
The method contains two important components. The first one is the consensus shape prediction (KnotShape) and the second one is structure prediction using predicted shape as input (KnotStructure). We will first describe KnotShape, focusing on the feature construction and selection
strategy. Then we will describe how to derive the consensus structure given the consensus shape.

4.3.2.1

Notation

Figure 4.3 illustrates the mapping between sequences, structures, and shapes. The input is a set of
homologous ncRNAs and the output is the predicted consensus shape. Notations used in this paper
correspond to this mapping.

45

X3

S12
S22
S32
S13
S23
S33

Shape analysis

Structure prediction

X2

P1
P1.LX= {X1, X2}
P1.LS = {S11, S31, S12}

P2
P2.LX= {X1, X2, X3}
P2.LS = {S41, S22, S13, S23}

S1N
S2N

XN

Representative shape selection

S11
S21
S31
S41

X1

^P
^
P.LX
^
P.LS

S3N

Figure 4.3 The relationship between sequences, structures, and shapes.
• The N homologous ncRNAs constitute the input sequence space. Xi represents the ith sequence.
• Each sequence can be folded into different secondary structures. Let Si represent the set of
folded structures of the ith sequence Xi . The set of structures predicted from all N input
sequences is the union of Si : S = S1 ∪ S2 ∪ . . . ∪ SN .
• Sij is the jth structure in the folding space of Xi . Its free energy is denoted by ∆G(Sij ). For a
sequence Xi , the minimum free energy MFE(Xi ) is the lowest free energy among the energies
of all predicted structures of Xi , i.e. MFE(Xi ) = min1≤ j≤|Si | ∆G(Sij ).
• All structures in S can be classified into a set of abstract shapes. For a shape P, we record its
associated sequences and structures. P.LX denotes the set of associated sequences, each of
which can fold into a structure with shape P. P.LS denotes all structures with shape P.
• Pˆ is the predicted shape of the given homologous sequences X1 , X2 , .., XN .
46

In order to explore the large folding space of multiple homologous sequences, we use a de novo
folding tool to output the optimal and sub-optimal structures within a given energy cutoff. This
heuristic does not allow us to explore the complete folding space. Given the observation that the
correct structure is usually close to the optimal structure, this heuristic works well in practice [88].

4.3.2.2

Feature construction and selection

Intuitively, the correct shape tends to possess the following properties. The correct shape should
have high shape probability, meaning that a large number of structures can be classified into this
shape. When we have multiple homologous sequences as input, the correct shape should be wellrepresented by all or a majority of the input sequences. Also, the ranking of the structure with
the correct shape in the folding space of each sequence should be high. In addition, some structures with the correct shape have low thermodynamic energies. For the energy-related properties,
various measurements can be introduced. For example, instead of using the sum of the energies
of all structures within a shape, one can use the smallest energy. Furthermore, more complicated
properties such as the sequence similarity for all sequences associated with a shape P and the structural similarity of structures associated with a shape P might contribute to the shape prediction too.
These similarities can be quantified using different methods such as k-mers profiles, multiple sequence alignment scores, variation of base pairs and so on.
It is not trivial to decide whether a single property is enough to choose the correct shape. If not,
which combination of these properties can lead to the best shape prediction performance. In order
to systematically choose a set of features (i.e. properties) for shape prediction, we use F-score
[89] to measure the discrimination between a feature and its label. Given the feature vector xk ,

47

k = 1, .., m, the F-score of the ith feature is defined as:
(+)

F(i) ≡

(x¯i

(−)

− x¯i )2 + (x¯i

− x¯i )2

(+)
(+) 2
(−)
(−) 2
n−
n+
1
1
n+ −1 ∑i=1 (xk,i − x¯i ) + n− −1 ∑i=1 (xk,i − x¯i )

(+)

where n+ and n− are the numbers of positive and negative instances respectively. x¯i , x¯i

(−)

, and x¯i

are the average values of the ith feature of the whole, positive labeled, and negative labeled data.
−
+
xk,i
and xk,i
are the values of ith feature of the kth positive and negative instances respectively.

F-score reflects the discrimination of the features. The higher the F-score, the more discriminative the feature is. F-score is known to have a disadvantage in that it does not carry out the mutual
information between features as it considers each feature separately. However, F-score is simple
and quite effective in practice.
Feature selection searches for the optimal subset of features [90]. There exist different methods
for feature selection. In this work, we adopt sequential forward search (SFS) [91] because of its
simplicity and effectiveness. Starting with an empty set, we iteratively select one feature at a time
and add it to the current feature set. Features are selected in a descending order of the discriminative
power determined by the F-score. The feature added is the one that gives the highest accuracy.
Based on the properties that might be relevant to consensus shape prediction, we construct
17 features and compute the F-score for each of them. The accuracy is evaluated using a linear
SVM method. The standard grid search approach is used to find an optimal SVM parameter. The
performance of a feature set is evaluated using 5-fold cross validation. Prediction accuracy is
the average value of all cross validation sets. The feature set that achieves the highest accuracy
includes the following four features.
• F1: the contribution of sequences. We capture the contribution of sequences using the number of sequences mapped to the shape. This feature reveals how the shape is shared among
48

the homologous sequences. F1 = |P.LX|.
• F2: the contribution of structures. This feature represents the abundance of structures
mapped to the shape. F2 = |P.LS|
• F3: the average free energy. Energy model is commonly used to determine the stability of
predicted structures. The basic idea behind this feature is that a stable shape is expected to
be derived from a group of stable structures. F3 =

∑S∈P.LS ∆G(S)
.
|P.LS|

• F4: the average of minimal free energy. This feature is different from F3 in that it considers only the minimal free energy among all predicted structures of each sequence. F4 =
∑X∈P.LX MFE(X)
.
|P.LX|

4.3.2.3

Shape ranking using a simple scoring function

Once the features are determined, they are used together with a trained linear SVM for shape labeling. Multiple shapes might be labeled as “true". In order to rank these candidate shapes for
the final shape selection, we evaluate each candidate shape using a score named sc, which is proportional to the signed distance between the candidate shape to the classification hyperplane [92].
Specifically, sc = w · x + b, where · denotes the dot product, w is the weight vector, and x is the
instance vector. w is trained on the optimization function in the linear SVM. The larger |w j | is, the
more important the jth feature is. This is restricted to w in a linear SVM model.

4.3.2.4

Time complexity of shape prediction

For N input sequences, there are S predicted structures. These structure can be grouped into P
shapes. As we use the de novo folding tools to output near-optimal structures within a given energy
range (e.g. 5%), we found that N : S : P ≈ 1 : 10 : 1.375. Mapping structures to shapes takes O(SL),
49

where L is the sequence length. As sorting shapes according to their features takes P log(P ) and
P ≤ 2N and S ≤ 11N, the procedure of shape prediction has time complexity O(NL + NlogN).

4.3.3

Consensus structure prediction given a shape

Once we determine the shape, we will predict the structure in the shape class for the given homologous ncRNAs. Structures corresponding to the same shape can differ significantly in the details of
the loop and stacking regions. A strategy is needed to choose the correct structure inside the shape
class for each input sequence. The simplest strategy is to output the MFE structure for the chosen
shape, which has been used in previous work [73]. However, the MFE structure in a shape may not
be the native structure. In particular, the accuracy of the MFE prediction for ncRNAs containing
pseudoknots is low.
In this section we describe KnotStructure, a comparative structure prediction method for homologous sequences given the shape. The rationale behind comparative structure prediction is that
the secondary structures and sequences are conserved during evolution. Thus, finding the structures
to maximize both the sequence and the secondary structure similarity among homologous ncRNAs
provides the basis for comparative structure prediction. Various methods for evaluating structural
ˆ
and sequence similarity exist. The major challenge is to efficiently select |P.LX|
representative
structures to achieve the highest structural and sequence similarity.
As we already derived the consensus shape Pˆ using KnotShape, only structures with shape Pˆ
ˆ
will be allowed. In addition, for each sequence Xi ∈ P.LX,
only one structure with shape Pˆ can
be selected. The total number of combinations of structures for measuring the similarity is thus
ˆ
ˆ
Πi=1 to |P.LX|
|P.LS
Si |, where P.LS
Si contains structures with shape Pˆ for a sequence Xi . Alˆ
though efficient heuristics exist to measure the similarity among multiple structures and sequences,
the sheer amount of combinations poses a great computational challenge.
50

Procedure 1 Representative structures selection
ˆ P.LX,
ˆ
ˆ
Input: P,
P.LS
Output: The representative structures
1. Initialization
for Every two structures Six and Syj do
//only evaluate similarity of structures from different sequences
if x = y then
Evaluate the similarity of Six and Syj
else
Six and Syj has similarity −∞
end if
end for
2. Select the set of representative structures using hierarchical clustering
//Each structure is a cluster by itself
repeat
Combine a pair of clusters with the highest similarity
For any structure Six added to the cluster, remove all other structures Sxj for j = i
Re-evaluate the similarity between clusters
ˆ
until the cluster has size |P.LX|
In order to efficiently select representative structures, we use a similar method to Unweighted
Pair Group Method with Arithmetic Mean (UPGMA), an agglomerative hierarchical clustering
technique [93]. Each object (i.e. secondary structure) starts in its own cluster. The closest pair of
clusters is selected and merged into a single cluster as one moves up the hierarchy. The distance
between clusters is measured using arithmetic mean defined in UPGMA. Compared to the standard clustering procedure, we have constraints on the objects that can be selected into the same
cluster. Given the shape, only structures that have shape Pˆ and come from different ncRNAs can
be combined in the same cluster. The detailed clustering process is described in Procedure 1.
During clustering, the structural and sequence similarity is evaluated using grammar stringbased approach [94, 95]. Grammar strings encode both secondary structure and sequence information for an ncRNA sequence. Grammar string alignment score can accurately quantify the
structural and sequence similarity of two ncRNAs. In addition, grammar string can encode pseudoknot structures [94, 95]. For a sequence Xi and one structure Sij in the folding space of Xi , Xi
51

and Sij are encoded in a grammar string gsij . We measure the similarity between any two grammar
strings using the normalized grammar string-based alignment score over the alignment length. The
similarity between groups of grammar strings is measured by arithmetic mean in UPGMA.
Figure 4.4 sketches the representative structure selection based on clustering procedure. Let
gsij be a grammar string converted from Xi and Sij , Xi ∈ P.LX. Once gsij is selected, all the other
grammar strings derived from the folding space of Xi will be removed from further processing.

gs11 gs21 gs31 gs41

gs12 gs22 gs32

gs13 gs23 gs33 gs43 gs14 gs24

gs15 gs25

S11 S21 S31 S41

S12 S22 S32

S13 S23 S33 S43 S14 S24

S15 S25

X1

X2

X3

X5

X4

Xk

Figure 4.4 An example of structure selection based on hierarchical clustering. For each structure
Sij in the folding space of sequence Xi , the grammar string encoding the structure and the sequence
is denoted as gsij . All sequences and their associated structures are converted into grammar strings
before clustering. The highlighted rectangles indicate grammar strings that are selected as representative structures.

The progressive MSA is performed on the set of representative structures using the clustering
path as a guide tree. We then derive the consensus secondary structure from the alignment. The
consensus structure can be mapped to each aligned sequence to accomplish the predicted structure
of an individual sequence.

4.3.3.1

Running time of structure prediction

Converting a sequence and an associated secondary structure into a GS (grammar string) takes
ˆ
O(L2 ), where L is the length of the sequence. Let the number of structures in P.LS
be m. It
ˆ In the first step of hierarchical clustering, we
takes O(L2 m) to encode all structures with shape P.
52

measure the similarity between GSs of different ncRNAs by conducting all-against-all comparison.
Conducting pairwise GS alignment takes O(l 2 ), where l is the length of the GS sequence and l ≤ L.
By using the default energy cutoff (5%) for sub-optimal structure generation, we observed that
m ≤ 11N. Thus, the all-against-all similarity measure has time complexity O(L2 N 2 ). The guide
tree generated using the clustering procedure contains at most N representative structures. Thus,
the total running time for clustering is O(L2 N 3 ), which is the leading time complexity term for the
consensus structure prediction algorithm.

4.4
4.4.1

Results
Data sets

The training data set is the K10 from BaliBASE [40]. It contains 845 sequence sets, each of which
has 10 homologous ncRNAs. There are two test data sets. The first one is the K15 from BaliBASE.
K15 contains 503 sequence sets, each of which has 15 homologous ncRNAs. As existing shape
prediction tools are not designed for handling pseudoknots, we use the pseudoknot-free sequence
sets in K15 to compare the performance of shape prediction. After removing the sets containing
pseudoknots, we have 452 sequence sets left. To test the performance of pseudoknot prediction, we
constructed the second test set R15 from psuedoknot families of Rfam [68]. In Rfam 10.0, there are
71 families containing pseudoknots. 25 of them have published structures. Of the 25 families, only
families with at least 15 seed sequences are used for testing our tools. For each chosen family, sets
of 15 sequences are chosen randomly to construct the test sets. Finally R15 contains 160 test sets.
The average pairwise sequence identities range from 60-93%. For all sequence sets, the reference
shapes were derived from Rfam [68].

53

4.4.2

SVM training

For both the training and testing data sets, we need to apply de novo folding tools to the sequences.
We choose a folding tool using the following criteria. First, this tool is able to output both the
optimal and sub-optimal structures. Second, this tool has high accuracy and can be efficiently
applied to a large number of ncRNAs. Finally, if the target sequences contain pseudoknots, this
tool should be able to output pseudoknot structures. As a result, we chose TT2NE [81]. Different
from many other pseudoknot prediction tools that have constraints on the type of the pseudoknot,
TT2NE is more flexible about the types of the target sequences. However, when it was applied to
K10, TT2NE failed to output structures for some sequences due to the length limit (200 nt) and
also existence of IUPAC characters in some sequences. Thus, for the training data set K10, we
applied quikfold [15] because K10 rarely contains pseudoknots. Although it is ideal to use the
same folding tool to the training and testing data set to achieve optimal classification performance,
the complexity of the training and test data sets together with the performance of de novo folding
tools lead to the current combination. In the Discussion Section we will briefly discuss how de
novo folding tools affect the performance.
We employed the SVM model implemented by LIBSVM tool [96] for classification. For each
sequence in K10, we applied quikfold with the energy range 5% to obtain both optimal and suboptimal structures of each sequence. The predicted structures were grouped based on their corresponding shapes. Associated features were extracted and enclosed with each shape. We normalized
feature values to fit the different properties of test sets to the same scale.
To label shapes, we used the shapes extracted from the consensus structures in Rfam [68] as
the reference. Shapes are labeled according to their correctness. We label a shape as 1 if it is as
same as the reference shape. Otherwise, it is labeled as -1.

54

4.4.3

Shape prediction comparison

We compared KnotShape with RNAcast [73], which is part of RNAshapes package [84]. RNAcast
takes the sequences as the input and predicts the consensus shape shared by all sequences. As it is
not designed for pseudoknot structures, we only applied RNAcast to 452 test sets of K15, which are
pseudoknot-free. TT2NE is applied to the test set using the default parameters. For each sequence,
the optimal structure and 10 sub-optimal structures are kept as the sample of the folding space for
each sequence. We compared our predicted shapes and the first-ranked shapes output by RNAcast
with the reference shapes derived from Rfam [68]. The comparison is presented in Table 4.1. Note
that RNAcast cannot output the shapes containing pseudoknots and thus is left blank for R15 in
Table 4.1. The accuracy of KnotShape is 18% higher than RNAshapes.
Table 4.1 Accuracy of shape predictions

KnotShape
RNAcast

4.4.4

Testset
452
452

K15
Correct shapes
311
232

%Accuracy Testset
68.81
160
51.33
-

R15
Correct shapes
107
-

%Accuracy
66.88
-

Structure prediction comparison

We applied the predicted shapes to pseudoknot structure prediction and compared the structure
prediction performance with IPknot [80], HxMatch [97], and TurboKnot [82], which are chosen
because of their popularity, availability, and easy usage on large number of sequences. Sequence
alignments were generated using ClustalW and entered as the input to IPknot and HxMatch. For
IPknot, we chose the appropriate levels of prediction according to the test sets. We ran Hxmatch
with the default parameters. We used the parameters suggested in [82] to run TurboKnot. Sensi-

55

tivity and the Positive Predicted Value (PPV) are used to evaluate the performance:

Sensitivity =

TP
TP
, PPV =
T P + FN
T P + FP

TP is the number of correctly predicted base pairs. FN is the number of base pairs that are in the
reference structure but not in the predicted structure. FP is the number of base pairs that are in the
predicted structure but not in the reference structure. Figure 4.5 is the boxplot of the sensitivity and
100
90
80
70
60
50
40
30
20
10
0
%Knot
S
t
r
uc
t
ur
eI
ma
t
c
h
Pk
not T
ur
boKnot Hx

100
90
80
70
60
50
40
30
20
10
0
% Knot
S
t
r
uc
t
ur
eI
ma
t
c
h
Pk
not T
ur
boKnot Hx

Figure 4.5 Comparison of the sensitivity and PPV of different tools.

PPV over all 160 test sets. KnotStructure has the best overall performance on the whole data set.
The median values of sensitivity and PPV are 54.55% and 46.15% for KnotStructure. Hxmatch
has the next highest sensitivity and PPV (42.11% and 42.86% respectively). The abstract shapes of
these families are shown in Table 4.2. Three families contain simple H-type pseudoknots while the
other three families contain more complicated pseudoknots. In order to show the effect of shape
prediction in structure prediction, we predicted the structures of R15 using 10 randomly selected
shapes. The average sensitivity and PPV of predicted structures with the predicted shapes are
higher than those using random shapes as shown in Table 4.3. Table 4.4 and 4.5 show the average
sensitivity and PPV over all sequences of each family compared to other tools. The average running
time of KnotStructure on each family compared to other tools is shown in Table 4.6.

56

Table 4.2 Abstract shapes of ncRNA families in R15
RNA Type
Shape Level 5
HDV_ribozyme
[A[B]]b[]a
Alpha_RBS
[ABC]bac
Tombus_3_IV
[[]A][][]a
Tymo_tRNA-like [][][]A[a]
Corona_FSE
[A]a
Prion_pknot
[A]a

Shape Level 3
[AA[B]]b[[[[[[]]]]]]aa
[[[ABC]]]bac
[[[]A]][][]a
[][][[]]AA[aa]
[AA]aa
[A]a

Table 4.3 Sensitivity and PPV of predicted structures using the predicted shapes and randomly
selected shapes
Predicted
shape
1
SEN
79.00
45.68
PPV
67.10
38.81

2
58.41
50.92

3
53.09
42.82

Randomly selected shape
4
5
6
7
58.06 55.74 45.88 44.75
49.15 46.24 40.29 36.33

8
56.12
45.75

9
61.08
52.80

10
46.54
38.53

Table 4.4 Sensitivity for different ncRNA families
RNA Type
HDV_ribozyme
Alpha_RBS
Tombus_3_IV
Tymo_tRNA-like
Corona_FSE
Prion_pknot

Len⊕

T S∗

89.70
110.99
91.61
85.12
82.91
40.46

12
18
4
3
9
114

KnotStructure
82.52
74.36
84.00
96.79
96.14
40.17

Sensitivity
IPknot TurboKnot
50.66
36.47
46.59
46.24
65.91
72.00
76.41
83.52
56.55
56.55
13.70
2.96

Hxmatch
23.51
24.49
80.00
75.09
73.27
30.26

Bold number and underlined number indicate the highest and the second highest sensitivity for each family.
Average sequence length. ∗ The number of test sets.

⊕

Table 4.5 PPV for different ncRNA families
RNA Type
HDV_ribozyme
Alpha_RBS
Tombus_3_IV
Tymo_tRNA-like
Corona_FSE
Prion_pknot

Len⊕

T S∗

89.70
110.99
91.61
85.12
82.91
40.46

12
18
4
3
9
114

KnotStructure
80.13
40.59
83.14
90.25
75.65
32.85

PPV
IPknot TurboKnot
59.24
39.28
25.34
23.70
78.02
73.47
73.62
86.85
60.51
42.19
15.71
3.03

Hxmatch
39.77
22.19
90.91
90.11
92.30
28.86

Bold number and underlined number indicate the highest and the second highest PPV for each family.
Average sequence length. ∗ The number of test sets.

57

⊕

Table 4.6 Running time for different ncRNA families (seconds)
RNA Type
HDV_ribozyme
Alpha_RBS
Tombus_3_IV
Tymo_tRNA-like
Corona_FSE
Prion_pknot

4.5

KnotStructure
1.55
1.93
1.80
1.78
1.51
1.07

IPknot
0.35
0.49
0.37
0.24
0.30
0.10

TurboKnot
28.22
37.25
13.66
12.24
12.75
4.23

Hxmatch
0.41
0.55
0.42
0.28
0.35
0.12

Discussion and conclusion

Based on the fold-then-align strategy, choice of folding tools can play an important role in the
performance of the shape and structure prediction. For the test set, we tested two folding tools:
HotKnots [98] and TT2NE. We used them in three different ways: Hotknots, TT2NE, and both of
them. We ran HotKnots and TT2NE with default parameters. The experimental results show that
using TT2NE alone achieves the best performance in consensus structure prediction. It is likely
that other folding tools exist to yield better performance than TT2NE. However, as the performance
of those tools also depends on the input data and the parameters, a systematic study is needed to
choose the best tool.
For TT2NE, we currently only use 10 sub-optimal structures. Increasing this number moderately does not affect the performance significantly. It indicates that the correct structures have
high rankings in the folding space. There are more pseudoknot-free structures available than
pseudoknot-containing structures. To achieve a reliable SVM model, more training data is desired. We used K10 for feature selection. This may cause KnotShape to have slightly lower predictive performance on pseudoknot-containing than pseudoknot-free sequences. Nonetheless, the
features used in KnotShape does not heavily rely on the free energy value, which is different between pseudoknot-free and pseudoknot-containing structures. Instead, the feature set is based on
multiple RNA properties shared among homologous sequences.
58

Extensive analysis of RNA properties based on SVM allows us to identify important features
related to abstract shapes. The combination of mass data analysis and SVM-based feature ranking
makes KnotShape a promising tool for shape prediction. By combining the predicted shapes and
the multiple structural alignment strategy, KnotStructure demonstrates higher accuracy in pseudoknot structure prediction.

59

Chapter 5
LncRNA-ID: Long non-coding RNA
IDentification using balanced random forest
classification

5.1

Background

Recent study indicates that there are at least four time lncRNA transcripts more than protein-coding
transcripts [3]. The majority of lncRNAs are transcribed in the sense and antisense directions and
some of those overlap with protein-coding genes. Unlike other ncRNAs such as miRNAs or snoRNAs that are strong conserved across diverse species [99], lncRNAs are poorly conserved [100].
The poor conservation of ncRNAs may be the result of recent and rapid adaptive selection, as
evidenced by the existence of many lineage specific ncRNAs, such as Xist or Air [101].
LncRNAs exist in many species such as Arabidopsis [102], Zea mays [103], honey bee [104],
chicken [105], zebrafish [106], etc. In recent years, a large number of lncRNAs have been identified. GENCODE [10] comprises 9,277 manually annotated lncRNA genes in the human genome.
The LncRNADisease database [107] contains 1,564 human lncRNAs that are likely to be associated with diseases. Thus, given the functional importance and ubiquity of lncRNAs, it is important
to annotate them on a genome scale in various species. With the advances of the next-generation

60

sequencing technologies, the transcriptomes of a large number of organisms have been sequenced,
providing us a unique opportunity to mine lncRNAs. The assembled transcripts contain different
types of functional elements such as small ncRNAs, lncRNAs, and protein coding genes. As lncRNAs are usually much longer than small ncRNAs, lncRNAs can be effectively distinguished from
small ncRNAs using size as the main criteria. However, lncRNAs have similar splicing structures
as protein coding transcripts and tend to encode putative open reading frames (ORFs). Thus, a
major challenge for lncRNA identification is to distinguish lncRNAs from protein coding genes,
especially in non-model organisms lacking comprehensive protein coding gene annotation.

5.2

Related work

Many efforts have been made to distinguish between lncRNA and protein-coding transcripts, ranging from applying a threshold for a single feature to more complicated supervised machine learning
methods. One commonly used feature is the length of the ORF. For example, a simple approach
is to classify a transcript containing an ORF of length above 100 amino acids as protein-coding
gene [108]. This criteria is arbitrary and is not always correct [109]. By using this simple criteria,
the mouse Xist RNA gene [110], which encodes a putative ORF of 298 amino acids (aa), was
mis-classified as a protein-coding gene when it was first discovered [111].
More accurate approaches for identifying lncRNAs use supervised machine learning methods
by formulating lncRNA discovery as a binary classification problem. These approaches can be
further divided into two types. One relies on sequence alignments and the other is alignmentfree. Representative examples of alignment-based methods include Coding-Potential Calculator
(CPC) [112] and Phylogenetic Codon Substitution Frequencies (PhyloCSF) [113]. CPC aligns
transcripts against known protein databases while PhyloCSF uses multiple sequence alignments.

61

As homologous protein-coding genes tend to share higher sequence conservation than lncRNAs,
the alignment score or its statistical significance provides useful information to differentiate these
two types of transcripts [114, 115, 116]. However, alignment-based methods usually require high
quality alignments [117], which are not trivial to produce and can incur high computational cost.
An alternative and faster approach is alignment-free methods such as CPAT [118], which integrates linguistic features of transcript sequences into a logistic regression model for lncRNA
prediction. In addition, rather than using the pre-built models, CPAT allows users to create a
model with their own data. This option, which is not present in CPC and PhyloCSF, is very useful
for lncRNA identification in different species.
Despite the promising progress for lncRNA identification, there is still a need for better approaches and tools. In particular, existing machine learning-based tools do not carefully handle
the imbalanced training data, in which one class has far more instances than the other. The issue
of imbalanced training data is particularly pronounced for lncRNA identification when it is formulated as a binary classification problem in existing tools. For example, due to poor annotation
of lncRNAs, many species have far less characterized lncRNAs than protein-coding genes. As
a result, a classifier tends to over-predict query transcripts as the major class [119]. In addition,
many existing tools need users to provide a score threshold for lncRNA identification, which is
not always obvious from users’ perspective. For example, PhyloCSF and CPAT do not suggest
the specific type of an input transcript, but only output a coding potential score. The predefined
score cutoffs of PhyloCSF and CPC vary from species to species. PhyloCSF’s score cutoffs of 50
and 300 were used for mouse [120] and Zebrafish [121], respectively. CPAT suggests the score
cutoffs of 0.364 and 0.44 for human and mouse, respectively. These specific score cutoffs cannot
be immediately applied to other species. Even worse, not every tool can be trained on different
species to provide users necessary information for choosing an appropriate score cutoff.
62

In this paper, we present LncRNA-ID, a lncRNA identification tool, which applies random
forest (RF) classification [122] to distinguish lncRNAs from protein-coding genes. RF is a classification model aggregating multiple classification trees generated from boot-strap samples and has
been successfully applied in bioinformatics [123, 124, 125]. LncRNA-ID has several advantages
over existing tools. First, it still takes advantage of alignment-based features, which have strong
discriminative power. However, instead of using genome-scale multiple sequence alignments or
pairwise alignments against all existing protein sequences, LncRNA-ID employs profile hidden
Markov model (profile HMM) based alignments, rendering more sensitive homology search and
shorter running time than existing alignment-based lncRNA identification tools. Second, LncRNAID is easy to use as it does not require users to provide a score cutoff. It automatically determines
the type of a query transcript as well as providing a coding potential score. Third, LncRNA-ID
can be applied to various species by providing an option to train the classifier for different data.
Fourth, LncRNA-ID does not require a large number of training data of neither protein-coding
transcripts or lncRNAs to construct a classifier and can handle imbalanced classes in the training
data. In our experiments, we evaluated the performance of LncRNA-ID on two different species,
human and mouse. It achieved the highest sensitivity and specificity compared with CPC, CPAT
and PhyloCSF on both species.

5.3

Methods

In this section, we first talk about the features used in LncRNA-ID and then describe the method
we use to construct our classification model. The features used in LncRNA-ID are derived from
three different groups: open reading frame (ORF), ribosome interaction, and protein conservation.
Each feature is selected either based on the literatures or the empirical observations. Using multiple

63

features can significantly improve the performance of classification.

5.3.1

ORF features

ORF is one of the most commonly used criteria to distinguish a lncRNA from a coding transcript.
A true protein-coding transcript tends to have longer ORFs than those in lncRNAs. We derive two
ORF-related features: ORF length and ORF coverage. The ORF length is defined as the length
of the longest reading frame identified in three forward frames. The ORF coverage is defined as
the ratio of the length of the chosen ORF to the length of the transcript. From our observation,
lncRNAs tend to have shorter ORF and lower ORF coverage than protein-coding transcripts.

5.3.2

Ribosome interaction features

These features are based on the interaction mechanism between the ribosome and mRNAs during
protein translation [126]. Ribosomes consist of two parts, a large subunit where two tRNA binding
sites are located and a small subunit where the mRNA binding site is located. The translation is
initiated when the small ribosomal subunit attaches to the mRNA at a start codon. The ribosome
starts to translate the mRNA towards 3’ end until it encounters a stop codon. At the end of the
protein translation, termination factors release the synthesized protein for use in the cell and the
ribosome splits back into large and small subunits [127]. Many studies have successfully applied
ribosome footprint to identify functional proteins [128, 129, 130, 131]. In particular, ribosome
profiling [129], which sequences mRNA fragments bound to ribosomes, provide a quantitative
snapshot of protein translation. However, the availability of ribosome profiling data is still limited.
Thus, in this work, we design computational features to quantify the main attributes related to
ribosome interaction with mRNAs. We define the features according to these interaction states:

64

initiation, translation, and termination.

5.3.2.1

Initiation:

The initiation interaction features are derived from the Kozak motif. The Kozak consensus is
a favorable motif for a ribosome scanning pattern and initiates translation. It greatly impacts
protein translation efficiency [132, 133, 134]. Kozak motif has the consensus GCCRCCAUGG (R
represents purine) and is located in the region around the initiator codon of an ORF. In the Kozak
determining experiment, single base mutants are preformed on mRNAs and the protein productions
of the mutant sequences are measured. It has been demonstrated that nearly all ribosomes will
initiate at the start codon [135], AUG. The highly conserved nucleotides at positions -3 and +4 (the
A of AUG is +1) and -2 and -1 play a major role in the initiation of the translation process.
We thus derive two features from Kazak motif: the nucleotides at the position {-3, +4} and {-2,
-1}. The Kozak features determine the potency of a starting site. A strong starting site, which enhances the translation efficiency, occurs when nucleotides at these positions are conserved, whereas
a less conservation indicates a weak starting site [135, 136].

5.3.2.2

Translation:

The interaction between the 3’ end of rRNAs and mRNA transcripts exhibits changes of binding
energy along the transcript. The binding energy consists of the free energy needed to open the
binding site and the energy gained from hybridization. We use RNAup [137] to compute thermodynamics of the interaction between the 3’ end of 18S rRNA and a transcript.
In order to capture the change of the binding energy, we calculated a series of binding energies
between the 3’ end of 18S rRNA and a transcript by moving the 3’ end of 18S rRNA toward 3’
direction of a transcript one nucleotide at a time. Let δi be the free energy at position i. Let Ni
65

denote the number of Watson-Crick base pairs of an interaction starting at position i. The ribosome
coverage is thus defined as:
L

Ribosome coverage =

∑ {Ni|δi < 0},

i=1

where L is the sequence length. The ribosome coverages were computed on three regions: the
whole transcript, ORF, and 3’UTR. These three features illustrate the level of ribosome occupancy
on a sequence. For protein coding transcripts, we expect to see higher ribosome coverage on the
whole transcript and the ORF region.

5.3.2.3

Termination:

We define the ribosome release score (RRS) to capture a termination signal of ribosomes. The
RRS takes advantage of the fact that ribosomes are released when reaching a stop codon. As a
result, a sharp drop in ribosome occupancy is seen at the start of the 3’ UTR of coding transcripts.
In contrast, translational termination should not occur in non-coding transcripts [29, 138].
The RRS is laboratorially measured using the quantitative sequences from a deep sequencing of
ribosome-protected mRNA fragments called ribosome profiling[129]. Although ribosome profiling is high quality, but it requires experimental data. Therefore, it is currently not widely available.
However, it is expected to become more widely available with the demand from research communities and the progress in cost-effective sequencing technologies.
In the absence of a ribosome profiling data, we estimate RRS as the ratio of ribosome coverage
in the putative ORF to ribosome coverage in the corresponding 3’ UTR.
RRS =

Ribosome coverage of ORF/length(ORF)
Ribosome coverage of 3 UTR/length(3 UTR)

RRS indicates the relative degree of ribosome occupancy bias at the terminal binding site in a

66

sequence. True protein coding transcripts are expected to have larger RRS than non-coding transcripts.

5.3.3

Protein conservation features

True protein coding transcripts tend to show better conservation against characterized proteins.
We measure the conservation using profile hidden Markov model (profile HMM)-based alignment
scores. In particular, we chose HMMER [139] to align a transcript against all available protein
families, such as the ones in Pfam [140]. Applying profile-based homology search has several
advantages, compared with pairwise alignment methodologies [141]. First, the number of gene
families is significantly smaller than the number of sequences, rendering a much smaller search
space. For example, there are only about 14,000 manually curated protein families in Pfam [140].
But they cover nearly 80% of the UniProt Knowledgebase [142] and the coverage is increasing every year as enough information becomes available to form new families [140]. As the profile-based
homology search tool HMMER is as fast as BLAST [143], using profile-based search provides a
shorter search time. In addition, alignments of query sequences against each protein family are
independent from each other and thus can be naturally parallelized on high performance computing platforms. Second, previous work [31] has shown that using family information can improve
the sensitivity of remote protein homology search [144]. For the transcriptomes of non-model
organism, sensitive remote homology search is especially important for identifying possibly new
homologs.
Specifically, each transcript is aligned to all protein families using HMMER. We use 0.1 as the
E-value cutoff for HMMER. When more than one alignment is generated for a query sequence, the
alignment with the best E-value is used. For the chosen alignment for a transcript, we derive the
following three features: (i) the score, (ii) the length of the alignment on the profile, and (iii) the
67

length of the alignment on the query sequence. A true protein-coding transcript is likely to produce
an alignment with higher score and longer alignments than lncRNAs.
In total, we extract 11 features: ORF length, ORF coverage, two Kozak-motif related features, ribosome coverage on three regions: transcript, ORF, and 3’UTR, ribosome release score,
alignment score, alignment length on the profile HMM and alignment length on the transcript.
It is apparent that although each feature exhibits different value distribution for the two types of
transcripts, none of the single feature is able to fully distinguish lncRNAs from coding transcripts.
Thus, it is important to combine multiple features to maximize discriminative power. We formalize
this problem as a binary classification problem where lncRNAs are defined as the positive class and
protein-coding transcripts are defined as the negative class. All these features will be incorporated
into the chosen classification model: balanced random forest, which we will describe below.

5.3.4

Balanced random forest

A decision tree is a commonly used classification model in machine learning. Random forest (RF)
consists of multiple decision trees. Each decision tree is built from a bootstrap sample, which is
a random sample drawn from the training data. During prediction, RF outputs the class agreed by
most of the individual trees. We select the RF for the following reasons:
1. It is able to effectively handle missing data, which is common in lncRNA identification. For
example, lncRNA transcripts are not likely to have protein conservation and the features
such as alignment score or alignment length could be missing.
2. It natively supports categorical features without requiring any transformation. The typical conversion for categorical data is to create dummy binary variables to represent each
category value. However, this may decrease the predictive power of the features and is time68

consuming because of the potentially large number of dummy features. With RF, we are able
to directly use Kozak motif features without the need for any conversion.
Inspired by Chen et al. [145], we extended RF to balanced random forest (BRF), which contains
multiple RFs where each RF is built from a subset of the training data. BRF provides LncRNA-ID
with the major advantage that it can learn from the imbalanced training data where the numbers of
lncRNA and protein coding samples are highly different. Imbalanced training data is common for
lncRNA identification. A recent study found that lncRNAs are at least four times more than protein
coding genes in the human genome [3]. In practice, the majority class in the training data is proteincoding transcript because there are more protein coding gene annotation than lncRNA annotation
for most organisms. For example, in the GENCODE database [10], there are 12,526 annotated
lncRNAs and 95,099 annotated coding transcripts in the human genome. For the mouse genome,
there are 6,053 annotated lncRNAs and 47,394 annotated coding transcripts in GENCODE. Thus,
there is a need for a classification method that can effectively learn from imbalanced training data
where one of the two classes has more samples (majority) than the other class (minority).
When learning from imbalanced training data, there is a high possibility that a bootstrap sample
contains very few or even none of the entities in the minority class, resulting in a classification tree
with poor performance for predicting the minority class. A naive solution is to either conduct prior
over-sampling of the minority class or prior down-sampling of the majority class. Down-sampling
usually has a better performance over over-sampling [146]. However, a prior down-sampling of
the majority class may result in loss of information, as a large part of the majority class is not used.
In contrast, LncRNA-ID employed BRF which ensembles multiple RFs. Each RF was trained
by a subset of the majority class and a full set of the minority class. This allows us to achieve
better classification performance by maximizing the benefits of using abundant protein-coding and
deficient lncRNA data.
69

Our BRF is different from [145] in that instead of creating balanced training subsets using
random drawings, we divide the majority class into equal subsets according to the imbalanced
ratio, which is the ration of the size of the majority class to the size of the minority class. The
purpose is to maximize the predictive power by ensuring that all training data are incorporated
in constructing the classification model. The balanced random forest learning algorithm is shown
below:
Procedure 2 Balanced random forest learning
Input: lncRNAs (P) and protein-coding transcripts (N)
Output: a BRF classifier
k = |N|
|P|
Create k non-overlapping subsets, n1 , n2 , .., nk , from N
for i=1 to k do
Train a classifier RFi to discriminate P against ni
end for
Return an ensemble of ∀ki=1 RFi

To create a balanced training data, down-sampling is performed on protein-coding transcripts,
creating approximately an equal number of protein-coding and lncRNA transcripts in each subset
pi . Each training subset is then used to create an individual RF. Finally, we integrate all constructed
RF classifiers into the BRF. The BRF classifier is then used to predict the type of a query transcript
by aggregating the prediction results of ensemble classifiers.
Integrated with balanced random forest methodology using different types of features, LncRNAID has the following advantages: (i) can effectively handle limited or imbalanced learning data,
which are commonly found in most species; (ii) incorporates different types of features, minimizing bias from a particular group of features. We employ the random forest classification implemented in Weka [147] software package to construct our classification model. The optimal number
of trees used in the random forest classification is determined based on the best performance obtained by 10-fold cross validation.
70

5.4

Results

LncRNA-ID can be applied to different species, and can achieve robust classification performance
with imbalanced training data, which is a commonly seen problem for lncRNA classification. To
evaluate the performance of LncRNA-ID, we used two data sets from the human genome and one
data set from the mouse genome. The first human data set (H1) and the mouse data set (M) were
generated from GENCODE consortium [10] within the framework of the ENCODE project. GENCODE is known to have the most comprehensive annotation of long noncoding RNAs available
to date. The second human data set (H2) is CPAT’s original data set generated from multiple resources: RefSeq [148], a human lncRNA catalog [149], and GENCODE. We further conducted
four additional experiments simulating different imbalanced ratios in the training data to demonstrate that LncRNA-ID was able to maintain robust performance with imbalanced training data.
To quantify the classification performance, we used five standard metrics: sensitivity, specificity, accuracy, false positive rate (FPR), and F-score, which are defined as follows:
Sensitivity =

TN
TP+TN
FP
TP
, Specificity =
, Accuracy =
, FPR =
T P + FN
T N + FP
P+N
FP + T N
F − score =

2 · Sensitivity · Specificity
Sensitivity + Specificity

LncRNAs are regraded as the positive class and protein coding transcripts are regarded as
the negative class. TP is the number of correctly classified lncRNAs and TN is the number of
correctly classified protein coding transcripts. Sensitivity is the proportion of correctly classified
lncRNAs in the set of all lncRNAs. Specificity is the proportion of correctly classified proteincoding transcripts in the set of all protein coding transcripts. Accuracy (a.k.a. positive predictive
value) is the ratio of correctly classified transcripts in all predictions. False positive rate (FPR)
refers to the portion of falsely classified lnRNAs among all protein coding transcripts. F-score

71

is the harmonic mean of sensitivity and PPV and hence can be used as a single measure for the
overall classification performance.

5.4.1

Performance of different groups of features

Before we evaluated the classification performance of LncRNA-ID, we first evaluated the discriminative power of different types of features. We constructed this experiment using the human data
set (H1). The detail about this data set can be found in the next section. We created the classification models with the training data using each individual group of features and the combination
of different groups. The performance of each classification model was then evaluated with the
test data. The overall performance was measured by the area under ROC curve (AUC). AUC is
a commonly used method to evaluates performances at all cutoff points, giving better insight into
how well the classifier is able to separate the two classes. The greater the AUC is, the better overall
classification performance the classifier achieves. The optimal performance is the best FPR and
sensitivity that maximizes the F-score.
Figure 5.1 shows the performance of LncRNA-ID using a single group of features versus multiple groups of features. The three groups of features exhibit highly different performance. The
ribosome interaction features have the best discriminative power because they are designed based
on the protein translation mechanism. According to Figure 5.1, combination of multiple groups of
features leads to better performance than using a single group of features. The best performance
comes from the combination of three groups of features, which are thus used in all our experiments.
We compared the performance of LncRNA-ID with three state-of-the-art coding-potential prediction tools: CPC, CPAT, and PhyloCSF. These tools output the coding-potential of a transcript
and can be used to classify a query transcript into coding or non-coding sequences. Below we
present the experimental results of applying LncRNA-ID and the benchmark tools on three data
72

sets. For each data set, we introduce the training data, test data, and the important parameters
used for each tool. We re-train the classification model in CPAT for different data set to optimize
its performance. As CPC and PhyloCSF does not provide the re-training option, we use pre-built
models. It is worth noting that there is no intersection between the training data and test data.

1

Sensitivity

0.95

All fetures (AUC=0.9947, optimal=0.0225,0.9660)
ribo+protein (AUC=0.9910, optimal=0.0300,0.9615)
ORF+protein (AUC=0.9885, optimal=0.0315,0.9585)

0.9

ORF+ribo (AUC=0.9868, optimal=0.0265,0.9393)
protein (AUC=0.9468, optimal=0.1197,0.9433)
ribo (AUC=0.9827, optimal=0.0510,0.9405)
ORF (AUC=0.9679, optimal=0.0173,0.8918)

0.85

0.8
0

0.05

0.1
0.15
0.2
False positive rate

0.25

0.3

Figure 5.1 Performance comparison among feature groups: ORF features (ORF), ribosome interaction features (ribo), protein conservation features (protein), and the combined feature sets.

5.4.2

The human data set (H1)

This data set contains randomly selected transcripts from GENCODE [10]. The training data
contains 48,600 protein-coding transcripts and 8,300 lncRNAs. The test data contains 4,000 coding
transcript and 4,000 lncRNAs.
We ran CPC using UniRef90 [150] as the reference protein database, which is a relatively
comprehensive protein database suggested by CPC. We created the classification model of CPAT
73

from the training data using the script provided in CPAT’s software package. The created classifier
was then used to predict the transcripts in the test data and the performance was evaluated using
the CPAT’s suggested optimal score cutoff.
A multiple sequence alignment of 45 vertebrate genomes, including the human genome, was
downloaded from the UCSC Genome Browser and was used as the input alignment for PhyloCSF.
We specified the option that allowed PhyloCSF to search all three reading frames and report the best
result as suggested in the PhyloCSF website [151]. We used the default score cutoff of PhyloCSF
to generate the classification results.
Table 5.1 shows the comparison of classification performance of all tools on H1. LncRNA-ID
had the best sensitivity and accuracy among all tools. Although CPC had the highest specificity, its
sensitivity and accuracy were much lower than those of LncRNA-ID. CPC’s classifier is based on
six features. Three of them are ORF-related features and the others are derived from the alignments
of a query sequence against existing protein sequences. These features could cause a bias toward
protein-coding transcripts if a lncRNA contains an ORF sharing similarity with existing protein sequences. This might be a major reason behind the low sensitivity of CPC. LncRNA-ID also had the
highest F-score and classification accuracy among all tools. Therefore, LncRNA-ID demonstrated
the best overall performance in distinguishing protein coding transcripts from lncRNAs among all
tools.
The values of AUC for all ROC curves can be found in Figure 5.2. LncRNA-ID also had the
best AUC among all the tools. In addition, we gave the sensitivity and FPR corresponding to the
optimal F-score for each tool in Figure 5.2. With the best F-score of 0.9717, LncRNA-ID had the
best sensitivity of 0.9660 and the FPR of 0.0225.
The optimal performance of CPAT and CPC was much better than that based on their default
score cutoffs, showing that their default score cutoffs are data dependent and may not provide users
74

with satisfactory classification performance. The optimal performance and AUC of PhyloCSF was
much worse than the other tools.
Table 5.1 Performance comparison on H1.
Sensitivity
Specificity
F-score
Accuracy

LncRNA-ID
96.73
95.40
96.06
96.06

CPC
66.48
99.97
79.85
83.22

CPAT
86.25
99.42
92.37
92.84

PhyloCSF
77.08
85.08
80.89
81.34

CPC, CPAT and PhyloCSF were evaluated using default score cutoffs. Bold numbers indicate the highest
value of the metrics.

1

Sensitivity

0.8
0.6
0.4
0.2
0
0

lncRNA−ID (AUC=0.9947, optimal=0.0225,0.9660)
CPC (AUC=0.9871, optimal=0.0405,0.9343)
CPAT (AUC=0.9905, optimal=0.0328,0.9490)
PhyloCSF (AUC=0.8587, optimal=0.2028,0.8664)

0.05

0.1
False positive rate

0.15

0.2

Figure 5.2 ROC curves of different tools on H1. The AUCs, and the sensitivity and FPR corresponding the optimal F-score were indicated in the legend.

5.4.3

The mouse data set (M)

LncRNA-ID can be trained for any species with some characterized protein coding and lncRNA
genes. If no such training data is available, pre-built model can be used. In this experiment, we
applied LncRNA-ID to the mouse data set to show its application to a different species. This data
set consists of randomly selected transcripts from GENCODE. The training data contains 44,300
75

protein-coding transcripts and 4,000 lncRNAs. The test data contains 2,000 coding transcript and
2,000 lncRNAs. The number of lncRNAs in this data set is only half of that contained in H1
because of limited lncRNA annotation in the mouse genome.
A multiple sequence alignment of 30 genomes, including the mouse genome, was downloaded
from the UCSC Genome Browser and used as the input alignment to PhyloCSF. The score cutoff of
50, which was shown to accurately separate known protein-coding genes from known non-coding
sequences [130, 120], was used to generate the classification results for PhyloCSF.
Table 5.2 shows the performance comparison of different tools on the mouse data set under their
default parameters. LncRNA-ID had the best sensitivity and accuracy among all tools. Although
CPC and CPAT had slightly higher specificity than LncRNA-ID, their sensitivity and accuracy
were much lower than those of LncRNA-ID. LncRNA-ID had the highest F-score among all tools,
showing its best overall classification performance. The sensitivity, F-score, and accuracy of the
tools on the mouse data set were lower than those on H1 for all except for CPC, largely due to the
smaller training data set. Note that as we used CPC’s pre-built classifier to evaluate the test data,
there might be some overlapping samples between CPC’s training data and this test data, giving an
advantage to CPC’s classifier over the other tools.
Figure 5.3 shows the ROC curves of different tools. When the sensitivity was lower than
0.834, CPC had lower false positive rate than the other tools. However, its optimal sensitivity
was much lower than that of LncRNA-ID and CPAT. Same as on the experimental on human data
set H1, LncRNA-ID had the best optimal performance and AUC. CPAT had the second highest
optimal performance and AUC. At the point with the optimal F-score, CPAT’s sensitivity was
2.6% lower than LncRNA-ID and its false positive rate was 0.6% lower than that of LncRNAID. The performance of CPAT under its default score cutoff was much lower than its optimal
performance when its score cutoff was changed. CPC had slightly lower false positive rate than
76

LncRNA-ID when its optimal performance was achieved. However, its sensitivity was much lower
than LncRNA-ID. PhyloCSF had significantly poorer performance than the other tools.
Table 5.2 Performance comparison on the mouse data set.
Sensitivity
Specificity
F-score
Accuracy

LncRNA-ID
94.65
92.20
93.41
93.43

CPC
76.55
98.75
86.24
87.65

CPAT
44.55
98.75
61.40
71.65

PhyloCSF
24.50
55.70
34.02
41.43

CPC, CPAT and PhyloCSF were evaluated using default score cutoffs. Bold numbers indicate the highest
value of the metrics.

1

Sensitivity

0.8
0.6
0.4
0.2
0
0

lncRNA−ID (AUC=0.9735, optimal=0.0800,0.9515)
CPC (AUC=0.9589, optimal=0.0515,0.8510)
CPAT (AUC=0.9657, optimal=0.0740,0.9255)
PhyloCSF (AUC=0.6467, optimal=0.2721,0.4938)

0.05

0.1
False positive rate

0.15

0.2

Figure 5.3 ROC curves of different tools on the mouse data set. The AUCs, and the sensitivity and
FPR corresponding the optimal F-score were indicated in the legend.

5.4.4

CPAT’s human data set (H2)

In this experiment, we evaluated the performance of LncRNA-ID on the human data set used by
CPAT [118]. The training set was originally claimed to contain 10,000 coding transcripts selected
from the RefSeq database and 10,000 randomly selected non-coding transcripts from GENCODE.
However, some transcripts no longer exist in the databases. They might have been removed be77

cause of the duplication with the existing ones, non-qualification as new evidence has emerged,
etc. As a result, the final training set contains 9,929 coding transcripts and 9,066 non-coding transcripts. The test set is the same as CPAT’s original data. It contains 4,000 coding transcripts from
RefSeq database and 4,000 lncRNAs from a human lncRNA catalog. The performance of CPC
and PhyloCSF had been benchmarked on this data set in [118] and were used in our experiment.
We created the classification model of CPAT from the training transcripts using the script provided
in CPAT’s software package. The created classifier was then used to predict the transcripts in the
test data set and the performance was evaluated using the CPAT’s suggested optimal score cutoff.
The first three columns of Table 5.3 shows the performance comparison between LncRNA-ID
and CPC. LncRNA-ID (sensitivity: 93.65%, specificity: 96.15%, F-score: 94.88%) achieved a better overall performance compared with CPAT (sensitivity: 87.58%, specificity: 97.32%, F-score:
92.20%), CPC (sensitivity: 73.75%, specificity: 99.9%, F-score: 84.85%), and PhyloCSF (Fscore: 74.05%, sensitivity: 62.8%, specificity: 90.2%). Please note that we used the performances
of CPC and PhyloCSF which were benchmarked on the same test data in [118].
Table 5.3 Performance comparison on H2
LncRNA-ID CPAT
Original data
Sensitivity
93.65
87.58
Specificity
96.15
97.32
F-score
94.88
92.20
Accuracy
94.90
92.45

LncRNA-ID
CPAT
S2
S3
S6
S8
S2
S3
S6
S8
93.43 93.54 92.72 92.73 79.51 73.01 60.32 54.46
95.41 95.23 95.47 95.38 98.14 98.50 99.29 99.42
94.41 94.38 94.08 94.03 87.84 83.85 75.03 70.32
94.42 94.38 94.10 94.06 88.82 85.75 79.81 76.94

Bold numbers indicate the highest values of the metrics.

5.4.5

Imbalanced training data

Using H2 data set, we evaluated how imbalanced training data affects the classification performance of LncRNA-ID and CPAT, which are the two best tools according to previous experimental
78

results. We constructed four data sets, S2, S3, S6, and S8, from the original training set simulating the condition of imbalanced training data. In S2, lncRNAs in the training set were randomly
divided into two subsets. Each lncRNA subset was combined with coding transcripts in the original training set, generating two training subsets in total. We created the classification models of
LncRNA-ID and CPAT using each of the two training subsets and evaluated their performance
using the same test set. The experiments on S3, S6, and S8 were conducted in the same manner
except that lncRNAs were divided into three, six, and eight subsets, respectively.
LncRNA-ID had the higher average sensitivity than CPAT in all four experiments (Table 5.3).
The performance of CPAT’s classifiers trained with the subsets significantly decreased compared
with that trained with the full training set. This shows that limited learning data led to less discriminative power of CPAT’s classifier. In contrast, LncRNA-ID, which implements balanced random
forest learning, was able to maintain stable performance in all data sets with different ratios of
imbalance.
The average specificity of LncRNA-ID was 3.46% lower than CPAT. However, the average sensitivity of CPAT was 26.28% lower than that of LncRNA-ID. The average accuracy and F-score of
LncRNA-ID were also much higher than those of CPC, showing better overall classification performance. CPAT’s sensitivities dramatically dropped by 9.21-37.81% while LncRNA-ID’s sensitivity
decreased by less than 1% compared with those trained with the full training set. This shows that
CPAT suffered not only from the limited learning data but also the impact of imbalanced training
data.

5.4.6

Running time

We measured the running time of LncRNA-ID compared to CPC, CPAT, and PhyloCSF on the H1
test set, which is the largest test set. All tools were ran on the same high performance computing
79

node that has 64 bits CPU with Linux operating system. CPC, CPAT, and PhyloCSF took 86.51h,
35.36s and 15,097.60h to process the data. Note that the running time of PhyloCSF did not include
the time used for preparing the input multiple sequence alignments, which can be computationally
expensive. LncRNA-ID took 65.36s to process the data. Its speed was comparable to CPAT, and
much faster than CPC and PhyloCSF.

5.5

Discussion and conclusion

We have proposed LncRNA-ID, an accurate lncRNA identification using balanced random forest
classification. LncRNA-ID ensembles multiple forest classifiers induced from balanced downsampled data and thus is able to maintain steady performance with different ratios of imbalance
and limited learning data. The results in both human and mouse genome demonstrates that the
features used by LncRNA-ID have powerful discriminative power in distinguishing lncRNAs from
protein-coding transcripts. Our empirical experiment shows that the ribosome interaction features
are the most discriminating features.
Among all classification tools, PhyloCSF had the worst performance. The explanation of this
result is that in the human data set, PhyloCSF could not determine the coding status of a decent
amount of lncRNAs (16.97%) and some coding transcripts (0.03%). This is because they either
are non-conserved transcripts or do not have sufficient long ORFs.
If ribosome profiling data (Ribo-Seq) and mRNA-Seq data are available, a more accurate ribosome release signal could be measured using the numbers of mapped reads on ORF and 3’UTR.
The RRS is then defined as the ratio of the two normalized ratios of mapped reads on these
two regions, RRS = (rORF /r3 UT R )Ribo−Seq /(rORF /r3 UT R )mRNA−Seq [129], where r is a number
of mapped reads on a sequence.

80

LncRNA-ID achieved the highest overall performances on both human and mouse compared
with other tools. The performance of LncRNA-ID is even more pronounced when learning with
imbalanced data set. The imbalanced learning data is essentially found in most species, in which
there are a large amount of functional annotation of proteins while validated annotation of lncRNAs are far less. The ability to maintain steady performance of LncRNA-ID on limited and imbalanced data results from applying balanced random forest learning. LncRNA-ID employs the
down-sampling technique to create multiple balanced learning data sets from the original imbalanced data. A balanced learning data prevents a classifier from being biased to the majority class.
Moreover, LncRNA-ID uses all learning data in classification by integrating down-sampled data
into multiple classifiers and thus prevents loss of information.

81

Chapter 6
Conclusion and future work
Next Generation Sequencing (NGS) technologies have greatly extended the abilities of researchers
to intensively study gene expression and discover new coding and non-coding genes on a genomewide scale. NcNRAs have gained significant increasing interest as many evidences have revealed
their critical roles in various biological processes. Nevertheless, NGS has identified many transcripts whose functions and significance are unclear [152]. In support of the growing attention in
ncRNAs identification, we have proposed four tools: grammar-string based alignment, KnotShape,
KnotStructure, and LncRNA-ID.
We proposed the grammar-string, a novel secondary structure representation and showed its
application in consensus structure derivation through multiple ncRNA alignment. Compared with
existing structure prediction tools, it had better sensitivity with high positive predictive value. We
also showed the utility of grammar string alignment in consensus structure derivation for ncRNAs
including pseudoknots. Besides constructing ncRNA structural alignment, grammar string can
be used to encode characterized ncRNA structures (such as those from Rfam), compare different
structures, and search for common structural motifs. We plan to explore these utilities of grammar
strings.
KnotShape and KnotStructure were designed specifically to decrease the searching space of
putative structures of homologous RNA sequences. The combination of mass data analysis and
SVM-based feature ranking makes KnotShape a promising tool for shape prediction. By combin-

82

ing the predicted shapes and the multiple structural alignment strategy, KnotStructure demonstrates
higher accuracy in pseudoknot structure prediction. There are some improvements that could be
made to further increase the sensitivity of the shape ranking step. In shape ranking, there were a
few input sequence sets for which the energies of correct structures are not near-optimal. Thus, enlarging the sample folding space will likely increase the sensitivity. However, using a large number
of sub-optimal structures can increase the computational cost. Thus, a better algorithm is needed
to achieve a better tradeoff between sensitivity and running time.
LncRNA-ID was specifically designed to identify lncRNAs, which are different from other
ncRNAs in that they are: i)longer than 200 nucleotides; ii) lack of strong sequence conservation across species; iii) usually have low to medium expression levels with no specific pattern;
iv) found to have ORFs as long as those found in protein-coding transcripts. LncRNA-ID is the
lncRNA classification using balanced random forest based on eleven biological coherent features.
LncRNA-ID focused on distinguishing lncRNAs from protein-coding transcript, which is the first
critical step for understanding the underneath biological roles. LncRNA-ID ensembles multiple
forest classifiers induced from balanced down-sampled data and thus is able to maintain the steady
performance with different imbalance ratio and limited learning data. In our experiments, we focused on the ribosome interaction in eukaryotes where more annotated lncRNAs are publicly available. Although LncRNA-ID showed great classification performance in two important eukaryotes,
human and mouse, it has not been tested on prokaryotes. The difference between eukaryotes and
prokaryotes is that in eukaryotes, a small 40S ribosomal subunit contains 18S rRNA whereas in
prokaryotes a small 30S ribosomal subunit contains 16S rRNA. Thus, further study is needed to
investigate whether the similar ribosome signal could be captured with 16S rRNA as with 18S
rRNA. The running time of protein-conservation feature extraction is the bottleneck of lncRNAs,
especially for large numbers of transcripts. As we used the profile HMM-based methods to reflect
83

the protein-conservation of transcripts, this can be naturally parallelized. This methodology has
greatly improved the scalability of the analysis of large-scale data.

84

BIBLIOGRAPHY

85

BIBLIOGRAPHY
[1] P. Schuster, W. Fontana, P. F. Stadler, and I. L. Hofacker. From sequences to shapes and
back: a case study in RNA secondary structures.Proceedings of the Royal Society: Biological Sciences, Series B, 255(1344):279–284, 1994.
[2] E. Pennisi. Genomics. ENCODE project writes eulogy for junk DNA.Science,
337(6099):1159, 1161, Sep 2012.
[3] P. Kapranov, J. Cheng, S. Dike, D. A. Nix, R. Duttagupta, A. T. Willingham, P. F. Stadler,
J. Hertel, J. Hackermuller, I. L. Hofacker, I. Bell, E. Cheung, J. Drenkow, E. Dumais, S. Patel, G. Helt, M. Ganesh, S. Ghosh, A. Piccolboni, V. Sementchenko, H. Tammana, and
T. R. Gingeras. RNA maps reveal new RNA classes and a possible function for pervasive
transcription.Science, 316(5830):1484–1488, Jun 2007.
[4] S. Djebali, C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, A. Tanzer, J. Lagarde, W. Lin, F. Schlesinger, C. Xue, G. K. Marinov, J. Khatun, B. A. Williams, C. Zaleski,
J. Rozowsky, M. Roder, F. Kokocinski, R. F. Abdelhamid, T. Alioto, I. Antoshechkin, M. T.
Baer, N. S. Bar, P. Batut, K. Bell, I. Bell, S. Chakrabortty, X. Chen, J. Chrast, J. Curado,
T. Derrien, J. Drenkow, E. Dumais, J. Dumais, R. Duttagupta, E. Falconnet, M. Fastuca,
K. Fejes-Toth, P. Ferreira, S. Foissac, M. J. Fullwood, H. Gao, D. Gonzalez, A. Gordon,
H. Gunawardena, C. Howald, S. Jha, R. Johnson, P. Kapranov, B. King, C. Kingswood,
O. J. Luo, E. Park, K. Persaud, J. B. Preall, P. Ribeca, B. Risk, D. Robyr, M. Sammeth,
L. Schaffer, L. H. See, A. Shahab, J. Skancke, A. M. Suzuki, H. Takahashi, H. Tilgner,
D. Trout, N. Walters, H. Wang, J. Wrobel, Y. Yu, X. Ruan, Y. Hayashizaki, J. Harrow,
M. Gerstein, T. Hubbard, A. Reymond, S. E. Antonarakis, G. Hannon, M. C. Giddings,
Y. Ruan, B. Wold, P. Carninci, R. Guigo, and T. R. Gingeras. Landscape of transcription in
human cells.Nature, 489(7414):101–108, Sep 2012.
[5] J. E. Wilusz, H. Sunwoo, and D. L. Spector. Long noncoding RNAs: functional surprises
from the RNA world.Genes Dev., 23(13):1494–1504, Jul 2009.
[6] A. Pauli, J. L. Rinn, and A. F. Schier. Non-coding RNAs as regulators of embryogenesis.
Nat. Rev. Genet., 12(2):136–149, Feb 2011.
[7] D. P. Bartel. MicroRNAs: genomics, biogenesis, mechanism, and function.Cell,
116(2):281–297, Jan 2004.

86

[8] Matthew W. W. Jones-Rhoades, David P. P. Bartel, and Bonnie Bartel. MicroRNAs and
their regulatory roles in plants.Annual Review of Plant Biology, 57:19–53, 2006.
[9] T. Hung and H. Y. Chang. Long noncoding RNA in genome regulation: prospects and
mechanisms.RNA Biol, 7(5):582–585, 2010.
[10] T. Derrien, R. Johnson, G. Bussotti, A. Tanzer, S. Djebali, H. Tilgner, G. Guernec, D. Martin, A. Merkel, D. G. Knowles, J. Lagarde, L. Veeravalli, X. Ruan, Y. Ruan, T. Lassmann,
P. Carninci, J. B. Brown, L. Lipovich, J. M. Gonzalez, M. Thomas, C. A. Davis, R. Shiekhattar, T. R. Gingeras, T. J. Hubbard, C. Notredame, J. Harrow, and R. Guigo. The GENCODE
v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and
expression.Genome Res., 22(9):1775–1789, Sep 2012.
[11] D. Bu, K. Yu, S. Sun, C. Xie, G. Skogerb, R. Miao, H. Xiao, Q. Liao, H. Luo, G. Zhao,
H. Zhao, Z. Liu, C. Liu, R. Chen, and Y. Zhao. NONCODE v3.0: integrative annotation of
long noncoding RNAs.Nucleic Acids Res., 40(Database issue):D210–215, Jan 2012.
[12] Disease-related human lncRNA Proﬁler.
research/disease-long-non-coding-rna.

http://www.systembio.com/lncrna-

[13] O. Wapinski and H. Y. Chang. Long noncoding RNAs and human disease.Trends Cell
Biol., 21(6):354–361, 2011.
[14] Sean R Eddy. Computational genomics of noncoding rna genes.Cell, 109(2):137–140,
April 2002.
[15] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA sequences
using thermodynamics and auxiliary information.Nucl. Acids Res., 9(1):133–148, 1981.
[16] D.H. Mathews, M.D. Disney, J. L. Childs, S.J. Schroeder, M. Zuker, and D.H. Turner. Incorporating chemical modiﬁcation constraints into a dynamic programming algorithm for
prediction of RNA secondary structure.Proceedings of the National Academy of Sciences
USA, 101:7287–7292, 2004.
[17] D.H. Mathews. Using an RNA secondary structure partition function to determine conﬁdence in base pairs predicted by free energy minimization.RNA, 10:1178–1190, 2004.
[18] J. S. Mccaskill. The equilibrium partition function and base pair binding probabilities for
RNA secondary structure.Biopolymers, 29(6-7):1105–1119, 1990.
[19] M. Andronescu, C. Pop, and A. Condon. Improved energy parameters for RNA pseudoknotted secondary structure prediction.RNA, 16(1):26–42, 2010.
87

[20] D K Treiber and J R Williamson. Exposing the kinetic traps in RNA folding.Curr Opin
Struct Biol, 9(3):339–45, 1999.
[21] A. Adams, T. Lindahl, and J. R. Fresco. Conformational differences between the biologically active and inactive forms of a transfer ribonucleic acid.Proc. Natl. Acad. Sci. U.S.A.,
57(6):1684–1691, Jun 1967.
[22] S. Washietl, I. L. Hofacker, and P. F. Stadler. Fast and reliable prediction of noncoding
RNAs.Proc Natl Acad Sci U S A, 102(7):2454–2459, 2005.
[23] R. B. Lyngsøand C. N. Pedersen. RNA pseudoknot prediction in energy-based models.
Journal of Computational Biology, 7(3–4):409–427, 2000.
[24] J. N. Hutchinson, A. W. Ensminger, C. M. Clemson, C. R. Lynch, J. B. Lawrence, and
A. Chess. A screen for nuclear transcripts identiﬁes two linked noncoding RNAs associated
with SC35 splicing domains.BMC Genomics, 8:39, 2007.
[25] H. Sunwoo, M. E. Dinger, J. E. Wilusz, P. P. Amaral, J. S. Mattick, and D. L. Spector.
MEN epsilon/beta nuclear-retained non-coding RNAs are up-regulated upon muscle differentiation and are essential components of paraspeckles.Genome Res., 19(3):347–359, Mar
2009.
[26] L. L. Chen and G. G. Carmichael. Altered nuclear retention of mRNAs containing inverted
repeats in human embryonic stem cells: functional role of a nuclear noncoding RNA.Mol.
Cell, 35(4):467–478, Aug 2009.
[27] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics.
Nat. Rev. Genet., 10(1):57–63, Jan 2009.
[28] B. Banfai, H. Jia, J. Khatun, E. Wood, B. Risk, W. E. Gundling, A. Kundaje, H. P. Gunawardena, Y. Yu, L. Xie, K. Krajewski, B. D. Strahl, X. Chen, P. Bickel, M. C. Giddings, J. B.
Brown, and L. Lipovich. Long noncoding RNAs are rarely translated in two human cell
lines.Genome Res., 22(9):1646–1657, Sep 2012.
[29] Ingolia NT Weissman JS Guttman M, Russell P and Lander ES. Ribosome Proﬁling Provides Evidence that Large Noncoding RNAs Do Not Encode Proteins.Cell, 154:240–251,
2013.
[30] K. V. Prasanth and D. L. Spector. Eukaryotic regulatory RNAs: an answer to the ’genome
complexity’ conundrum.Genes Dev., 21(1):11–42, Jan 2007.

88

[31] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison.Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, UK, 1998.
[32] Ivo L. Hofacker, Stephan H. F. Bernhart, and Peter F. Stadler. Alignment of RNA base
pairing probability matrices.Bioinformatics, 20(14):2222–2227, 2004.
[33] Sebastian Will, Kristin Reiche, Ivo L Hofacker, Peter F Stadler, and Rolf Backofen. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering.PLoS Comput Biol, 3(4):e65, 2007.
[34] Elfar Torarinsson, Jakob H. Havgaard, and Jan Gorodkin. Multiple structural alignment and
clustering of RNA sequences.Bioinformatics, 23(8):926–932, 2007.
[35] Matthias Höchsmann, Thomas Töller, Robert Giegerich, and Stefan Kurtz. Local similarity
in RNA secondary structures. InCSB ’03: Proceedings of the IEEE Computer Society
Conference on Bioinformatics, page 159, Washington, DC, USA, 2003. IEEE Computer
Society.
[36] M. Höchsmann, B. Voss, and R. Giegerich. Pure multiple RNA secondary structure alignments: a progressive proﬁle approach.IEEE/ACM Trans Comput Biol Bioinform, 1(1):53–
62, 2004.
[37] Helene Touzet and Olivier Perriquet. CARNAC: folding families of related RNAs.Nucl.
Acids Res., 32(suppl. 2):W142–145, 2004.
[38] S Karlin and S F Altschul. Methods for assessing the statistical signiﬁcance of molecular
sequence features by using general scoring schemes.Proc Natl Acad Sci, 87(6):2264–2268,
1990.
[39] Comparing RNA secondary structures based on 2d graphical representation.
[40] F Bai, D Li, and T Wang. A new mapping rule for RNA secondary structures with its
applications.J. Math. Chem., 43:932–943, 2008.
[41] C Li, A H Wang, and L Xing. Similarity of RNA secondary structures.J. Comput. Chem.,
28:508–512, 2007.
[42] Kishore Doshi, Jamie Cannone, Christian Cobaugh, and Robin Gutell. Evaluation of the
suitability of free-energy minimization using nearest-neighbour energy parameters for RNA
secondary structure prediction.BMC Bioinformatics, 5(1):105, 2004.

89

[43] David Sankoff. Simultaneous Solution of the RNA Folding, Alignment and Protosequence
Problems.SIAM Journal on Applied Mathematics, 45(5):810–825, 1985.
[44] Ian Holmes. A probabilistic model for the evolution of rna structure.BMC Bioinformatics,
5(1):166, 2004.
[45] Robin D Dowell and Sean R Eddy. Efﬁcient pairwise RNA structure prediction and alignment using sequence alignment constraints.BMC Bioinformatics, 7(400), 2006.
[46] Sven Siebert and Rolf Backofen. MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons.Bioinformatics, 21(16):3352–
3359, 2005.
[47] IL Hofacker, M Fekete, and PF Stadler. Secondary structure prediction for aligned RNA
sequences.J Mol Biol., 319(5):1059–66, 2002.
[48] Elena Rivas and Sean R. Eddy. Noncoding RNA gene detection using comparative sequence
analysis.BMC Bioinformatics, 2(1):8, 2001.
[49] B. Knudsen and J. Hein. Pfold: RNA secondary structure prediction using stochastic
context-free grammars.Nucleic Acids Res, 31(13):3423–3428, 2003.
[50] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. Smit, K. M. Roskin, R. Baertsch,
K. Rosenbloom, H. Clawson, E. D. Green, D. Haussler, and W. Miller. Aligning multiple
genomic sequences with the threaded blockset aligner.Genome Res, 14(4):708–715, 2004.
[51] Julie D. Thompson, Desmond G. Higgins, and Toby J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting,
position-speciﬁc gap penalties and weight matrix choice.Nucl. Acids Res., 22(22):4673–
4680, 1994.
[52] I.L. Hofacker. Vienna RNA secondary structure server.Nucleic Acids Res., 31:3429–3431,
2003.
[53] Tommi Jaakkola and David Haussler. Exploiting generative models in discriminative classiﬁers. InNIPS, pages 487–493, 1998.
[54] Robin Dowell and Sean Eddy. Evaluation of several lightweight stochastic context-free
grammars for RNA secondary structure prediction.BMC Bioinformatics, 5(1):71, 2004.

90

[55] S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. Basic local alignment search
tool.Journal of Molecular Biology, 215:403–410, 1990.
[56] Needleman SB and Wunsch CD. A general method applicable to the search for similarities
in the amino acid sequence of two proteins.J Mol Biol, 48:443–453, 1970.
[57] S. Grifﬁths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam:
annotating non-coding RNAs in complete genomes.Nucleic Acids Res., 33:D121–D124,
2005.
[58] Chuong B. Do, Daniel A. Woods, and Seraﬁm Batzoglou. CONTRAfold: RNA secondary
structure prediction without physics-based models.Bioinformatics, 22(14):e90–98, 2006.
[59] N. R. Markham and M. Zuker. DINAMelt web server for nucleic acid melting prediction.
Nucleic Acids Res., 33:W577–W581, 2005.
[60] N. R. Markham and M. Zuker. UNAFold: software for nucleic acid folding and hybriziation. InBioinformatics, Volume II. Structure, Functions and Applications, number 453 in
Methods in Molecular Biology, pages 3–31, Totowa, NJ, USA, 2008. Humana Press.
[61] D.F. Feng and R.F. Doolittle. Progressive sequence alignment as a prerequisite to correct
phylogenetic trees.J Mol Evol., 25(4):351–360, 1987.
[62] Hisanori Kiryu, Yasuo Tabei, Taishin Kin, and Kiyoshi Asai. Murlet: a practical multiple
alignment tool for structural RNA sequences.Bioinformatics, 23(13):1588–1598, 2007.
[63] A. Wilm, I Mainz, and G. Steger. An enhanced RNA alignment benchmark for sequence
alignment programs.Algorithms Mol Biol., 1(19), 2006.
[64] D. W. Staple and S. E. Butcher. Pseudoknots: RNA structures with diverse functions.PLoS
Biol., 3(6):e213, 2005.
[65] E. Rivas and S. Eddy. The language of RNA: a formal grammar that includes pseudoknots.
Bioinformatics, 16(4):334–340, 2000.
[66] X. Xu, Y. Ji, and G. D. Stormo. RNA Sampler: a new sampling based algorithm for common
RNA secondary structure prediction and structural alignment.Bioinformatics, 23(15):1883–
1891, 2007.
[67] J. Ruan, G. D. Stormo, and W. Zhang. ILM: a web server for predicting RNA secondary
structures with pseudoknots.Nucleic Acids Res., 32(suppl 2):W146–W149, 2004.
91

[68] Paul P Gardner, Jennifer Daub, John Tate, Benjamin L Moore, Isabelle H Osuch, Sam
Grifﬁths-Jones, Robert D Finn, Eric P Nawrocki, Diana L Kolbe, Sean R Eddy, and et al.
Rfam: Wikipedia, clans and the “decimal" release.Nucleic Acids Research, 39:D141–D145,
2011.
[69] Paul Gardner and Robert Giegerich. A comprehensive comparison of comparative RNA
structure prediction approaches.BMC Bioinformatics, 5(1):140, 2004.
[70] Stephan H. Bernhart and Ivo L. Hofacker. From consensus structure prediction to RNA gene
ﬁnding.Brieﬁngs in functional genomics & proteomics, 8(6):461–471, 2009.
[71] Robert Giegerich, Björn Voß, and Marc Rehmsmeier. Abstract shapes of RNA.Nucleic
Acids Research, 32(16):4843–4851, 2004.
[72] Anika Scheid and Markus Nebel.
Informatik, 2008.

On abstract shapes of RNA. Technical Report 368,

[73] Jens Reeder and Robert Giegerich. Consensus shapes: an alternative to the sankoff algorithm for RNA consensus structure prediction.Bioinformatics, 21(17):3516–3523, 2005.
[74] Eugene Berezikov, Geert van Tetering, Mark Verheul, Jose van de Belt, Linda van Laake,
Joost Vos, Robert Verloop, Marc van de Wetering, Victor Guryev, Shuji Takada, Anton Jan
van Zonneveld, Hiroyuki Mano, Ronald Plasterk, and Edwin Cuppen. Many novel mammalian microRNA candidates identiﬁed by extensive cloning and RAKE analysis.Genome
Res, 16(10):1289–98, 2006.
[75] Jian Lu, Yang Shen, Qingfa Wu, Supriya Kumar, Bin He, Suhua Shi, Richard W Carthew,
San Ming Wang, and Chung-I Wu. The birth and death of microRNA genes in Drosophila.
Nat Genet, 40(3):351–5, 2008.
[76] S. Janssen, J. Reeder, and R. Giegerich. Shape based indexing for faster search of rna family
databases.BMC Bioinformatics, 9(1):131, 2008.
[77] Chun-Hsiang Huang, Chin Lung Lu, and Hsien-Tai Chiu. A heuristic approach for detecting
RNA H-type pseudoknots.Bioinformatics, 21(17):3501–3508, 2005.
[78] Song Cao and Shi-Jie Chen. Predicting structures and stabilities for H-type pseudoknots
with interhelix loops.RNA, 15:696–706, 2009.
[79] Jana Sperschneider and Amitava Datta. DotKnot: pseudoknot prediction using the probability dot plot under a reﬁned energy model.Nucleic Acids Research, 38(7):e103, 2010.
92

[80] Kengo Sato, Yuki Kato, Michiaki Hamada, Tatsuya Akutsu, and Kiyoshi Asai. IPknot:
fast and accurate prediction of RNA secondary structures with pseudoknots using integer
programming.Bioinformatics, 27(13):i85–i93, 2011.
[81] Michael Bon and Henri Orland. TT2NE: a novel algorithm to predict RNA secondary structures with pseudoknots.Nucleic Acids Research, 39(14):e93, 2011.
[82] Matthew G. Seetin and David H Mathews. TurboKnot: rapid prediction of conserved RNA
secondary structures including pseudoknots.Bioinformatics, 28(6):792–798, 2012.
[83] Stefan Janssen and Robert Giegerich. Faster computation of exact RNA shape probabilities.
Bioinformatics, 26(5):632–639, 2010.
[84] Peter Steffen, Björn Voß, Marc Rehmsmeier, Jens Reeder, and Robert Giegerich.
RNAshapes: an integrated RNA analysis package based on abstract shapes.Bioinformatics,
22(4):500–503, 2006.
[85] Björn Voss, Robert Giegerich, and Marc Rehmsmeier. Complete probabilistic analysis of
RNA shapes.BMC biology, 4(1):5, 2006.
[86] Samuel S. Cho, David L. Pincus, and D. Thirumalai. Assembly mechanisms of RNA pseudoknots are determined by the stabilities of constituent secondary structures.Proceedings
of the National Academy of Sciences, 106(41):17349–17354, 2009.
[87] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for
optimal margin classiﬁers. InProceedings of the ﬁfth annual workshop on Computational
learning theory, COLT ’92, pages 144–152, New York, USA, 1992.
[88] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence
of thermodynamic parameters improves prediction of RNA secondary structure.Journal of
molecular biology, 288(5):911–940, 1999.
[89] Yi-wei Chen. Combining SVMs with various feature selection strategies. InFeature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). SpringerVerlag, 2005.
[90] K Z Mao. Fast orthogonal forward selection algorithm for feature subset selection.IEEE
Transactions on Neural Networks, 13(5):1218–1224, 2002.
[91] T. Marill and D. Green. On the effectiveness of receptors in recognition systems.Information Theory, IEEE Transactions on, 9(1):11–17, 1963.
93

[92] Tong Zhang.Fundamental Statistical Techniques, Chapter in Handbook of Natural Language Processing. Chapman and Hall, 2010.
[93] Stephen Johnson. Hierarchical clustering schemes.Psychometrika, 32(3):241–254, 1967.
[94] Rujira Achawanantakun, Seyedeh Shohreh Takyar, and Yanni Sun. Grammar string: a novel
ncRNA secondary structure representation.lifesciences society org, pages 2–13, 2010.
[95] Rujira Achawanantakun, Yanni Sun, and Seyedeh Shohreh Takyar. ncRNA consensus secondary structure derivation using grammar strings.J. Bioinformatics and Computational
Biology, 9(2):317–337, 2011.
[96] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
[97] Christina Witwer, Ivo L Hofacker, and Peter F Stadler. Prediction of consensus RNA secondary structures including pseudoknots.IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(2):66–77, 2004.
[98] Jihong Ren, Baharak Rastegari, Anne Condon, and Holger H Hoos. HotKnots: heuristic
prediction of RNA secondary structures including pseudoknots.RNA, 11:1494–1504, 2005.
[99] I. Bentwich, A. Avniel, Y. Karov, R. Aharonov, S. Gilad, O. Barad, A. Barzilai, P. Einat,
U. Einav, E. Meiri, E. Sharon, Y. Spector, and Z. Bentwich. Identiﬁcation of hundreds of
conserved and nonconserved human microRNAs.Nat. Genet., 37(7):766–770, Jul 2005.
[100] M. A. Blasco. Telomere length, stem cells and aging.Nat. Chem. Biol., 3(10):640–649, Oct
2007.
[101] P. Carninci and et al. The transcriptional landscape of the mammalian genome.Science,
309(5740):1559–1563, Sep 2005.
[102] J. Liu, C. Jung, J. Xu, H. Wang, S. Deng, L. Bernad, C. Arenas-Huertero, and N. H.
Chua. Genome-wide analysis uncovers regulation of long intergenic noncoding RNAs in
Arabidopsis.Plant Cell, 24(11):4333–4345, Nov 2012.
[103] S. Boerner and K. M. McGinnis. Computational identiﬁcation and functional predictions of
long noncoding RNA in Zea mays.PLoS ONE, 7(8):e43047, 2012.

94

[104] F. C. Humann, G. J. Tiberio, and K. Hartfelder. Sequence and Expression Characteristics of
Long Noncoding RNAs in Honey Bee Caste Development - Potential Novel Regulators for
Transgressive Ovary Size.PLoS ONE, 8(10):e78915, 2013.
[105] C. Arriaga-Canon, Y. Fonseca-Guzman, C. Valdes-Quezada, R. Arzate-Mejia, G. Guerrero,
and F. Recillas-Targa. A long non-coding RNA promotes full activation of adult gene expression in the chicken globin domain.Epigenetics, 9(1), Nov 2013.
[106] Y. Liu, D. Luo, H. Zhao, Z. Zhu, W. Hu, and C. H. Cheng. Inheritable and Precise Large
Genomic Deletions of Non-Coding RNA Genes in Zebraﬁsh Using TALENs.PLoS ONE,
8(10):e76387, 2013.
[107] G. Chen, Z. Wang, D. Wang, C. Qiu, M. Liu, X. Chen, Q. Zhang, G. Yan, and Q. Cui.
LncRNADisease: a database for long-non-coding RNA-associated diseases.Nucleic Acids
Res., 41(Database issue):D983–986, Jan 2013.
[108] Y. Okazaki, M. Furuno, T. Kasukawa, J. Adachi, H. Bono, S. Kondo, I. Nikaido, N. Osato, R. Saito, H. Suzuki, I. Yamanaka, H. Kiyosawa, K. Yagi, Y. Tomaru, Y. Hasegawa,
A. Nogami, C. Schonbach, T. Gojobori, R. Baldarelli, D. P. Hill, C. Bult, D. A. Hume,
J. Quackenbush, L. M. Schriml, A. Kanapin, H. Matsuda, S. Batalov, K. W. Beisel, J. A.
Blake, D. Bradt, V. Brusic, C. Chothia, L. E. Corbani, S. Cousins, E. Dalla, T. A. Dragani,
C. F. Fletcher, A. Forrest, K. S. Frazer, T. Gaasterland, M. Gariboldi, C. Gissi, A. Godzik,
J. Gough, S. Grimmond, S. Gustincich, N. Hirokawa, I. J. Jackson, E. D. Jarvis, A. Kanai,
H. Kawaji, Y. Kawasawa, R. M. Kedzierski, B. L. King, A. Konagaya, I. V. Kurochkin,
Y. Lee, B. Lenhard, P. A. Lyons, D. R. Maglott, L. Maltais, L. Marchionni, L. McKenzie,
H. Miki, T. Nagashima, K. Numata, T. Okido, W. J. Pavan, G. Pertea, G. Pesole, N. Petrovsky, R. Pillai, J. U. Pontius, D. Qi, S. Ramachandran, T. Ravasi, J. C. Reed, D. J. Reed,
J. Reid, B. Z. Ring, M. Ringwald, A. Sandelin, C. Schneider, C. A. Semple, M. Setou,
K. Shimada, R. Sultana, Y. Takenaka, M. S. Taylor, R. D. Teasdale, M. Tomita, R. Verardo,
L. Wagner, C. Wahlestedt, Y. Wang, Y. Watanabe, C. Wells, L. G. Wilming, A. WynshawBoris, M. Yanagisawa, I. Yang, L. Yang, Z. Yuan, M. Zavolan, Y. Zhu, A. Zimmer, P. Carninci, N. Hayatsu, T. Hirozane-Kishikawa, H. Konno, M. Nakamura, N. Sakazume, K. Sato,
T. Shiraki, K. Waki, J. Kawai, K. Aizawa, T. Arakawa, S. Fukuda, A. Hara, W. Hashizume,
K. Imotani, Y. Ishii, M. Itoh, I. Kagawa, A. Miyazaki, K. Sakai, D. Sasaki, K. Shibata,
A. Shinagawa, A. Yasunishi, M. Yoshino, R. Waterston, E. S. Lander, J. Rogers, E. Birney,
and Y. Hayashizaki. Analysis of the mouse transcriptome based on functional annotation of
60,770 full-length cDNAs.Nature, 420(6915):563–573, Dec 2002.
[109] M. E. Dinger, K. C. Pang, T. R. Mercer, and J. S. Mattick. Differentiating protein-coding
and noncoding RNA: challenges and ambiguities.PLoS Comput. Biol., 4(11):e1000176,
Nov 2008.

95

[110] N. Brockdorff, A. Ashworth, G. F. Kay, V. M. McCabe, D. P. Norris, P. J. Cooper, S. Swift,
and S. Rastan. The product of the mouse Xist gene is a 15 kb inactive X-speciﬁc transcript
containing no conserved ORF and located in the nucleus.Cell, 71(3):515–526, Oct 1992.
[111] G. Borsani, R. Tonlorenzi, M. C. Simmler, L. Dandolo, D. Arnaud, V. Capra, M. Grompe,
A. Pizzuti, D. Muzny, C. Lawrence, H. F. Willard, P. Avner, and A. Ballabio. Characterization of a murine gene expressed from the inactive X chromosome.Nature, 351(6324):325–
329, May 1991.
[112] L. Kong, Y. Zhang, Z. Q. Ye, X. Q. Liu, S. Q. Zhao, L. Wei, and G. Gao. CPC: assess the
protein-coding potential of transcripts using sequence features and support vector machine.
Nucleic Acids Res., 35(Web Server issue):W345–349, Jul 2007.
[113] M. F. Lin, I. Jungreis, and M. Kellis. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.Bioinformatics, 27(13):i275–282, Jul 2011.
[114] A. Marchler-Bauer, C. Zheng, F. Chitsaz, M. K. Derbyshire, L. Y. Geer, R. C. Geer, N. R.
Gonzales, M. Gwadz, D. I. Hurwitz, C. J. Lanczycki, F. Lu, S. Lu, G. H. Marchler, J. S.
Song, N. Thanki, R. A. Yamashita, D. Zhang, and S. H. Bryant. CDD: conserved domains
and protein three-dimensional structure.Nucleic Acids Res., 41(Database issue):D348–352,
Jan 2013.
[115] A. Marchler-Bauer and S. H. Bryant. CD-Search: protein domain annotations on the ﬂy.
Nucleic Acids Res., 32(Web Server issue):W327–331, Jul 2004.
[116] R. A. Chodroff, L. Goodstadt, T. M. Sirey, P. L. Oliver, K. E. Davies, E. D. Green, Z. Molnar, and C. P. Ponting. Long noncoding RNA genes: conservation of sequence and brain
expression among diverse amniotes.Genome Biol., 11(7):R72, 2010.
[117] Patrick D. Schloss. The effects of alignment quality, distance calculation method, sequence
ﬁltering, and region on the analysis of 16s rrna gene-based studies.PLoS Computational
Biology, 6(7), 2010.
[118] L. Wang, H. J. Park, S. Dasari, S. Wang, J. P. Kocher, and W. Li. CPAT: Coding-Potential
Assessment Tool using an alignment-free logistic regression model.Nucleic Acids Res.,
41(6):e74, Apr 2013.
[119] Foster Probost. Machine learning from imbalanced data sets 101, 2000.
[120] M. Guttman, M. Garber, J. Z. Levin, J. Donaghey, J. Robinson, X. Adiconis, L. Fan, M. J.
Koziol, A. Gnirke, C. Nusbaum, J. L. Rinn, E. S. Lander, and A. Regev. Chromatin sig96

nature reveals over a thousand highly conserved large non-coding RNAs.Nat. Biotechnol.,
458:223–227, February 2009.
[121] Andrea Pauli, Eivind Valen, Michael F. Lin, Manuel Garber, Nadine L. Vastenhouw,
Joshua Z. Levin, Lin Fan, Albin Sandelin, John L. Rinn, Aviv Regev, and Alexander F.
Schier. Systematic identiﬁcation of long noncoding rnas expressed during zebraﬁsh embryogenesis. 22:577–591+, 2012.
[122] Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001.
[123] J. Wu, H. Liu, X. Duan, Y. Ding, H. Wu, Y. Bai, and X. Sun. Prediction of DNA-binding
residues in proteins from amino acid sequences using a random forest model with a hybrid
feature.Bioinformatics, 25(1):30–35, Jan 2009.
[124] Y. Y. Leung, P. Ryvkin, L. H. Ungar, B. D. Gregory, and L. S. Wang. CoRAL: predicting
non-coding RNAs from small RNA-sequencing data.Nucleic Acids Res., 41(14):e137, Aug
2013.
[125] X. Chen and H. Ishwaran. Random forests for genomic data analysis.Genomics, 99(6):323–
329, 2012.
[126] K. Shaw. Biological applications of support vector machines.Nature Education, 1(1):201,
2008.
[127] Ribosomes, Transcription, and Translation. http://www.nature.com/scitable/topicpage/ribosomestranscription-and-translation-14120660.
[128] C. Xing, D. L. Bitzer, W. E. Alexander, M. A. Vouk, and A. M. Stomp. Identiﬁcation of
protein-coding sequences using the hybridization of 18S rRNA and mRNA during translation.Nucleic Acids Res., 37(2):591–601, Feb 2009.
[129] N. T. Ingolia, L. F. Lareau, and J. S. Weissman. Ribosome proﬁling of mouse embryonic
stem cells reveals the complexity and dynamics of mammalian proteomes.Cell, 147(4):789–
802, Nov 2011.
[130] M. Guttman, P. Russell, N. T. Ingolia, J. S. Weissman, and E. S. Lander. Ribosome proﬁling
provides evidence that large noncoding RNAs do not encode proteins.Cell, 154(1):240–
251, Jul 2013.
[131] Y. Arava, Y. Wang, J. D. Storey, C. L. Liu, P. O. Brown, and D. Herschlag. Genome-wide
analysis of mRNA translation proﬁles in Saccharomyces cerevisiae.Proc. Natl. Acad. Sci.
U.S.A., 100(7):3889–3894, Apr 2003.
97

[132] Kozak M. Context effects and inefﬁcient initiation at non-aug codons in eucaryotic cell-free
translation systems.Genome Research, 9:5073–5080, 1989.
[133] Heng Xu and et. at al. Length of the ORF, position of the ﬁrst AUG and the Kozak motif are
important factors in potential dual-coding transcripts.Cell Research, 20:445–457, 2010.
[134] M. De Angioletti, G. Lacerra, V. Sabato, and C. Carestia. Beta+45 G –> C: a novel silent
beta-thalassaemia mutation, the ﬁrst in the Kozak sequence.Br. J. Haematol., 124(2):224–
231, Jan 2004.
[135] Kozak M. Initiation of translation in prokaryotes and eukaryotes.Gene, 234:187–208, 1999.
[136] Kozak M. Recognition of aug and alternative initiator codons is augmented by g in position +4 but is not generally affected by the nucleotides in positions +5 and +6.EMBO J,
16:2482–2492, 1997.
[137] Jörg Hackermüller2 Stephan H. Bernhart1 Peter F. Stadler1 3 4 Ulrike Mückstein1,
Hakim Tafer1 and Ivo L. Hofacker1.Bioinformatics, 22(10):1177–1182, 2006.
[138] J. T. Vanselow A. Schlosser J. Vasquez, C. Hon and T. N. Siegel. Comparative ribosome proﬁling reveals extensive translational complexity in different Trypanosoma brucei life cycle
stages.Nucl. Acids Res., 42(6):3623–3637, 2014.
[139] R. D. Finn, J. Clements, and S. R. Eddy. HMMER web server: interactive sequence similarity searching.Nucleic Acids Res., 39(Web Server issue):29–37, Jul 2011.
[140] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang,
K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. Sonnhammer, S. R. Eddy,
A. Bateman, and R. D. Finn. The Pfam protein families database.Nucleic Acids Res.,
40(Database issue):290–301, Jan 2012.
[141] Y. Zhang, Y. Sun, and J. R. Cole. A Sensitive and Accurate protein domain cLassiﬁcation
Tool (SALT) for short reads.Bioinformatics, 29(17):2103–2111, Sep 2013.
[142] M. Magrane and U. Consortium. UniProt Knowledgebase: a hub of integrated protein data.
Database (Oxford), 2011:bar009, 2011.
[143] S. R. Eddy. A new generation of homology search tools based on probabilistic inference.
Genome Inform, 23(1):205–211, Oct 2009.

98

[144] Yuan Zhang and Yanni Sun. MetaDomain: a proﬁle hmm-based protein domain classiﬁcation tool for short sequences. InProceedings of Paciﬁc Symposium on Biocomputing (PSB),
2012.
[145] Andy Liaw Chao Chen and Leo Breiman. Using Random Forest to Learn Imbalanced Data.
[146] Michael R. Lyu Kai-Zhu Huang, Haiqin Yang.Machine Learning: Modeling Data Locally
and Globally. Springer Science and Business Media, 2008.
[147] Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka.
[148] Kim D. Pruitt, Tatiana A. Tatusova, and Donna R. Maglott. Ncbi reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.Nucleic Acids Research, 35(Database-Issue):61–65, 2007.
[149] M. N. Cabili, C. Trapnell, L. Goff, M. Koziol, B. Tazon-Vega, A. Regev, and J. L. Rinn.
[150] Baris E. Suzek, Hongzhan Huang, Peter B. McGarvey, Raja Mazumder, and Cathy H.
Wu. Uniref: comprehensive and non-redundant uniprot reference clusters.Bioinformatics, (10):1282–1288.
[151] PhyloCSF. https://github.com/mlin/PhyloCSF/wiki.
[152] T. R. Mercer, M. E. Dinger, and J. S. Mattick. Long non-coding RNAs: insights into functions.Nat. Rev. Genet., 10(3):155–159, Mar 2009.

99