FAST NCRNA IDENTIFICATION TECHNIQUES
By
Seyedeh Shohreh Takyar

A THESIS
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
MASTER OF SCIENCE
Computer Science
2012

ABSTRACT
FAST NCRNA IDENTIFICATION TECHNIQUES
By
Seyedeh Shohreh Takyar

Many noncoding RNAs (ncRNAs) function through both their sequences and secondary
structures. Thus, secondary structure derivation is an important issue in today’s RNA
research. The state-of-the-art structure annotation tools are based on comparative analysis,
which derives consensus structure of homologous ncRNAs. Despite promising results from
existing ncRNA aligning and consensus structure derivation tools, there is a need for more
eﬃcient and accurate ncRNA secondary structure modeling and alignment methods.
In this thesis, we introduce grammar string, a novel ncRNA secondary structure representation that encodes an ncRNAs sequence and secondary structure in the parameter
space of a context-free grammar (CFG). Being a string deﬁned on a special alphabet constructed from a CFG, it converts ncRNA alignment into sequence alignment with n square
complexity. We explain how this representation is used in derivation of consensus secondary
structure through multiple ncRNA alignment and also how existing clustering methods could
be applied to ncRNAs represented by this model.

Copyright by
SEYEDEH SHOHREH TAKYAR
2012

This thesis is dedicated to my parents, Touba and Shahrokh who taught me the most
important lesson in my life: Honesty.

iv

ACKNOWLEDGEMENTS

It would not have been possible to write this thesis without the help and support of the
kind people around me, to only some of whom it is possible to give particular mention here.
First and foremost, I would like to express my great appreciation to my advisor, Professor
Yanni Sun, for her valuable guidance and advice over the course of this thesis. She has always
given me enough time to discuss ideas and evaluate diﬀerent alternatives in all steps of my
research work.
My most sincere appreciation goes to Professor Titus Brown, as both my teacher and as a
member of my dissertation committee. I really enjoyed the Open Problem in Bioinformatic
course taught by him. He taught me that without having strong biology background, a
computer scientist could never make a great stride in Bioinformatics.
I would also like to extend my deepest gratitude to Professor James Cole, from Microbial Ecology Center, who kindly accepted to serve on my examining committee. I really
appreciate great comments and suggestions by him.
My special thanks must go to Professor Abdol-Hossein Esfahanian, Associate Chair at the
Department of Computer Science and Engineering. He has always supported me, whenever
I needed his help to meet diﬀerent milestone events.
I wish to express my heartfelt thanks to great friends at Michigan State University, for
being the surrogate family during the years I lived at Michigan. I will deﬁnitely miss our
study hours and other gatherings at MSU Dairy Store.
Finally, I am forever indebted to my parents, Touba and Shahrokh, and my family for
their endless supports and sacriﬁces.

v

Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

1 Introduction . . . . . . . . . . . . . . . . . . . . . .
1.1 Non-coding RNAs, Structure and functionalities
1.2 Our contribution . . . . . . . . . . . . . . . . .
1.3 Literature review and related works . . . . . . .
1.4 Road map of this thesis . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

1
2
3
5
7

2 Grammar string design . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Preliminaries and standard deﬁnitions . . . . . . . . . . . . . . .
2.1.1 Formal Language . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Transformational Grammars . . . . . . . . . . . . . . . .
2.1.3 Derivation process . . . . . . . . . . . . . . . . . . . . .
2.1.4 Parse tree . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.5 ambiguous grammar . . . . . . . . . . . . . . . . . . . .
2.1.6 Grammar classes . . . . . . . . . . . . . . . . . . . . . .
2.1.7 Context free grammars and ncRNA secondary structure .
2.2 Grammar string design . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 An unambiguous CFG for ncRNA generation . . . . . .
2.2.2 Grammar string generation algorithm . . . . . . . . . . .
2.2.3 Design a new grammar string of Rfam database . . . . .
2.2.4 Grammar pattern for encoding stem structures . . . . . .
2.2.5 Other grammar pattern design strategies . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

8
8
8
9
10
11
12
14
15
16
17
18
22
23
25

3 Multiple ncRNA structural alignment based on grammar string . . . . .
3.1 Summary of designed experiment for multiple ncRNA structural alignment
by Achawanantakun [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Score table design for grammar string alignment . . . . . . . . . . . . . . . .
3.3 Extracting consensus sequence of ncRNA families . . . . . . . . . . . . . . .

27

4 Clustering based on grammar string . . . . . . . . . .
4.1 NcRNA clustering; problem deﬁnition and approach . .
4.2 Step by step design for clustering experiment . . . . . .
4.2.1 Extracting consensus sequences in Rfam families

33
34
35
35

vi

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

27
29
30

. . . .
. . . .
. . . .
. . . .
. . . .
gram. . . .

37
39
41
42
42

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

A Ribosum matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1 Used in grammar string chapter . . . . . . . . . . . . . . . . . . . . . . . . .

57
57

B Old and new score tables . . . . . . . . . . . . . . . . . . . . . . . . . .
B.1 Old grammar string score table . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 New grammar string score table . . . . . . . . . . . . . . . . . . . . . . . . .

60
60
62

C Links to Web Pages of Clan families and their members . . .
C.1 Used in experimental results section . . . . . . . . . . . . . . . . . . . . . . .

66
66

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

4.3

4.2.2 Strategy of extracting data sets for our experiment . . . . . .
4.2.3 Pairwise alignment . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Applying MCL clustering approach . . . . . . . . . . . . . . .
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Experimental results of old designed grammar string . . . . .
4.3.2 Analysis of experimental results based on using new version of
mar string design . . . . . . . . . . . . . . . . . . . . . . . . .

vii

54

List of Tables
2.1

Observed → Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.1

Observed → Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.2

Observed → Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3

Observed → Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.1

Mapping in the consensus sequence . . . . . . . . . . . . . . . . . . . . . . .

41

4.2

Classiﬁcation of conserved clans based on internal average score . . . . . . .

43

4.3

Grammar patterns of 6 Clans . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.4

Discovered Clusters in Rfam−Clan . . . . . . . . . . . . . . . . . . . . . . .

54

viii

List of Figures
1.1

Central Dogma schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

General form of ncRNA secondary structure . . . . . . . . . . . . . . . . . .

2

2.1

“aab” pars trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2

Ambiguity in pars tree of “a+a+a” left-most derivation . . . . . . . . . . . .

13

2.3

Equivalent non-ambiguous pars tree of “a+a+a” . . . . . . . . . . . . . . . .

14

2.4

Crossing and non-crossing correlation between pairs of symbols . . . . . . . .

16

2.5

Secondary structure of a sample ncRNA . . . . . . . . . . . . . . . . . . . .

18

2.6

Left most derivation pars tree of Figure 2.5 . . . . . . . . . . . . . . . . . . .

19

2.7

Two tRNA sequences from the human genome and the alignment of their
grammar strings.The stars below the alignment denote exact matches. . . .

20

2.8

Algorithm for generating a grammar string for substring Xi..j . . . . . . . . .

21

2.9

Four diﬀerent stem structures and their grammar patterns. The left column
shows the 2D representation of an ncRNA folding. The right column shows
the distributions of stems along an ncRNA sequence. All grammar patterns
are generated using G4 (our chosen unambiguous context-free grammar). . .

24

4.1

Mapping in the consensus sequence . . . . . . . . . . . . . . . . . . . . . . .

37

4.2

Pairwise alignment conﬁgurations . . . . . . . . . . . . . . . . . . . . . . . .

43

4.3

Summary of clustering process of conserved clans members . . . . . . . . . .

44

4.4

Pairwise alignment scores - Part1 . . . . . . . . . . . . . . . . . . . . . . . .

45

4.5

Pairwise alignment scores - Part2 . . . . . . . . . . . . . . . . . . . . . . . .

46

4.6

External and internal average scores of all conserved clans . . . . . . . . . .

53

ix

A.1 Ribosum matrix - Part1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

A.2 Ribosum matrix - Part2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

A.3 Ribosum matrix - Part3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

B.1 Score table used in pairwise alignments of old garment strings - Part1 . . . .

60

B.2 Score table used in pairwise alignments of old garment strings - Part2 . . . .

61

B.3 Score table used in pairwise alignments of old garment strings - Part1 . . . .

62

B.4 Score table used in pairwise alignments of old garment strings - Part2 . . . .

63

B.5 Score table used in pairwise alignments of old garment strings - Part3 . . . .

64

B.6 Score table generated based on Ribosum matrix and used in pairwise alignments of new garment strings - Part4 . . . . . . . . . . . . . . . . . . . . . .

65

C.1 Links to Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

x

Chapter 1
Introduction
DNA (Deoxyribonucleic acid), RNA (Ribonucleic acid), and proteins are three major essential molecules for all known forms of life. DNA is a double-stranded nucleic acid which carries
the entirety of genetic information (called genome) of a living organism. Genome includes
thousands of genes and non-coding sequence segments. During a biological process called
transcription, genome is copied from DNA into a single stranded nucleic acid molecule called
RNA. Transcribed DNA genes are transformed into protein molecules during another biological process called translation and serve as enzymes, hormones, or antibodies for the proper
functioning of an organism. However, transcribed non-coding sequences do not have the
potential to be translated into protein molecules. These function directly as RNA molecules
and are named non-coding RNAs (ncRNA). Figure 1.1 shows the simpliﬁed model of Central
Dogma [5].

Figure 1.1: Central Dogma schema

1

1.1

Non-coding RNAs, Structure and functionalities

NcRNA is a single-stranded chain of nucleotides Adenine (A), Uracil (U), Cytosine (C), and
Guanine (G). There may also be a hydrogen band between some nucleotide pairs of {A, U},
{C, G}, and {G ,U}, causing nucleotides of each pair to be bounded together and making
local folds in ncRNA sequences. These paired-up nucleotides are called Complementary
base pairs. The two-dimensional shape of ncRNA formed by these complimentary base
pairs is called secondary structure of that ncRNA molecule. Figure 1.2 shows the general
form of secondary structure and the corresponding names of possible sub-structures made
by complimentary base pairs.

A

A
Hairpin loops

A
C
C
C

A
A
C

A

A A
A

G

A
A

U

C

U C
G

G

A

A

U

C

A

U

A

UA

A A
A
C
G
G A

A

C U

C
C

C

G

A
C

U

U U

C

C
C C

A

C

G

A

C A A

C

C A A G

U

C

U

A

C
Bulge loop

Multi-branched loop

G

U

Stems

Figure 1.2: General form of ncRNA secondary structure
2

Interior loop

Transcribed non-coding sequences play diverse and important roles in many biochemical
processes. For example, two typical house keeping ncRNAs, tRNA and rRNA, are key
components for protein synthesis while MicroRNAs (miRNAs) play critical regulatory roles
via interactions with speciﬁc target mRNAs in many organisms [23]. Also, short interfering
RNAs (siRNAs) are involved in gene silencing in RNAi process [31].
Considering the vital roles ncRNAs play in living organism, any eﬀort which helps know
ncRNAs and their biological functions is highly important to modern biology. For example,
comparative ncRNA identiﬁcation, which searches for ncRNAs through evidence of evolutionary conservation, is the state-of-the-art methodology for ncRNA ﬁnding. Since the
functions of most ncRNAs are determined by both their sequences and secondary structures,
comparative ncRNA identiﬁcation must exploit not only the sequence, but also the structural
conservations.

1.2

Our contribution

The origin of context free grammar lies in formal language theory. In biology, stochastic
context-free grammar (SCFG) [10] pertains to the problem of alignment and folding homologous families of RNA sequence. A successful application of SCFG is ncRNA classiﬁcation,
which classiﬁes query sequences into annotated ncRNA families such as tRNA, rRNA, riboswitch families. Other secondary structure modeling representations such as base pair
probability matrices [18, 41, 45], tree proﬁles [14, 15], stem graphs [42] etc. have been used
in RNA alignment, an important step in novel ncRNA detection. These alignment methods
ﬁrst infer the possible structures of each input sequence and then conduct structural alignment, whose accuracy and eﬃciently are highly dependent on structural representations.
Despite promising output by existing alignment tools, many existing secondary structure
representations are highly complicated, incurring high computational cost during alignment.
Even with various heuristics or pruning techniques to reduce the time complexity, ncRNA

3

structural alignment are still more computationally intensive than pure sequence alignment
and scale poorly with the number and length of input sequences. Therefore, it remains
important to develop an eﬃcient and accurate structural modeling and comparison method.
In this thesis, we design a novel secondary structure representation and show its application in consensus structure derivation through multiple ncRNA alignment and also in
ncRNA clustering task. The two contributions are listed below.
First, we design and implement grammar string, a novel ncRNA secondary structure
representation. A grammar string is deﬁned on a special alphabet constructed from a carefully chosen context free grammar (CFG). It encodes how this CFG generates an ncRNA
sequence and its secondary structure. Compared to other secondary structure representations, grammar strings are simple and can take advantage of well-developed algorithms on
sequences or strings. For example, grammar strings can convert ncRNA alignment into sequence alignment without losing any structural conservation, rendering highly eﬃcient RNA
alignment algorithm. In addition, supporting theories for sequence alignment such as score
table design and Karlin-Altual statistics [25] can be applied to grammar string alignment.
It is worth mentioning that other string-based secondary structure representations [3, 30, 47]
exist. However, those methods focus on deriving ncRNAs’ similarities without resorting
to alignment and thus cannot be directly applied for consensus structure derivation from
homologous ncRNAs.
An interesting application in biology is to classify ncRNAs into families with structural
homology. The Rfam database is a standard ncRNA database which will heavily be used
during the course of this thesis. This database maintains alignments, consensus secondary
structures and such for RNA families. Each family represents a set of RNA sequences that
function at the RNA level and share a clear common ancestor. However, there is still a
great interest in clustering of these ncRNA families. The reason is Rfam database employs
some quality control steps which prevent two families from annotating the same sequences.
Therefore, homologous ncRNA sequences might be separated from each other and go to

4

diﬀerent families because of restrictions existing in computational steps [12]. Moreover, a
single alignment may not capture all the diversity of a group of homologous RNAs. Our
second contribution in this thesis is to evaluate utilization of grammar string approach in
ncRNA clustering problem. For this purpose, we design a strategy to, ﬁrst, extract consensus sequence of ncRNA families, and then, generate their corresponding grammar strings
based on associated consensus sequence and secondary structure. In the next step, pairwise
alignment method is applied to these grammar strings using empirical score table and gap
penalties. Finally, ncRNA pairwise alignment scores are fed into a clustering application
called MCL to be classiﬁed in diﬀerent clusters.

1.3

Literature review and related works

There are many ncRNAs whose biological functionalities remain unknown to this day. Comparative techniques may help us reveal possible functionalities of novel ncRNAs based on
properties of ncRNAs with already known functionalities. The base line is that the ncRNA
with similar secondary structures present similar biological functionalities. Existing ncRNA
alignment methods can be roughly classiﬁed into three basic types.
The ﬁrst type aligns and folds simultaneously. The most accurate algorithm of this type
was developed by Sankoﬀ [36]. However, it is prohibitively expensive with time complexity
O(L3N ) and memory complexity O(L2N ), where L and N are the length and number of input
sequences, respectively. Variants of the Sankoﬀ algorithm have been proposed to reduce the
computational time of multiple alignment, such as Stemloc [19], Consan [9], MARNA [38].
The second type of methods ﬁrst builds a sequence alignment and then folds the alignment [17, 28, 35, 44, 44]. They infer structures from pre-aligned sequences generated using
MULTIZ [4], ClustalW [40], or other available sequence alignment programs. The accuracy
of these tools is largely aﬀected by the alignment quality. In particular, when homologous
ncRNA sequences only share structural similarity, building a meaningful sequence alignment

5

becomes diﬃcult.
The third type of methods folds input sequences and then conducts structural alignment,
yielding higher accuracy. Diﬀerent tools in this category diﬀer by diﬀerent secondary structure modeling methods. Although some of them used restricted Sankoﬀ algorithm in their
implementations, we classify them into “fold and then align” category because they apply
structure prediction in the ﬁrst step. As our grammar string based alignment belongs to
the third category, we discuss related “fold and then align” tools below, focusing on their
secondary structure representations.
Several programs encode secondary structure using base pair probability matrices derived from McCastkill’s approach [16, 33]. NcRNA alignment is then converted into base
pair probability matrix alignment. However, base pair probability matrix comparison is
highly resource demanding. For example, pmcomp [18] takes O(n4 ) memory and O(n6 ) operations for aligning a pair of sequences with length n. More recent implementations such as
LocARNA [45] and FOLDALIGNM [41] applied various restrictions or pruning techniques to
reduce the time complexity. But they are still much more expensive than sequence alignment.
RNAforester [14, 15] used tree proﬁles to represent secondary structures. Algorithms on
tree alignment are applied for pairwise and multiple alignment computation. The asymptotic
eﬃciency depends on the node number of the tree representation and the maximum degree d
of a tree node. For n structures of average size s, their pairwise algorithm has time complexity
O(s2 d2 ) and space complexity O(s2 d). RNAforester can achieve higher eﬃciency than base
pair probability matrix comparison. However, it is reported [45] that they tend to produce
many alignment columns that contain mostly gap characters in the multiple alignment mode.
Carnac [42] used stem graphs to represent secondary structures. However, their program
cannot accept more than 15 input sequences to remin eﬃcient, limiting its practical usage.

6

1.4

Road map of this thesis

The remainder of this thesis is organized as follows:
• Chapter 2: Preliminaries and standard deﬁnitions of formal language is introduced
ﬁrst. Next, grammar string design is discussed in details.
• Chapter 3: Multiple ncRNA structural alignment experiment designed by our colleague
Rujira Achawanantakum [1] is explained in high level. We then focus on those design
aspects which pertain to this thesis.
• Chapter 4: An experiment is designed and applied to clustering of ncRNA families
using grammar string approach. We will end this chapter with experimental results
and conclusions.

7

Chapter 2
Grammar string design
In this chapter, we ﬁrst brieﬂy introduce some preliminaries and standard deﬁnitions in
Formal Languages and Transformational Grammar. We then discuss the use of grammar
strings borrowed from formal languages in representation of secondary structure for ncRNAs.
Next, a new grammar string will be designed and utilized in a novel representation of these
secondary structures. Finally, the associated generating algorithm and the strategy to encode
stem structure of ncRNAs will be addressed.

2.1

Preliminaries and standard deﬁnitions

2.1.1

Formal Language

We are all familiar with one or more languages and practice at least one of them, e.g.,
English, French, Italian, etc. in our daily life. Informally, each language consists of a set
of symbols and a set of manipulating rules which help us express the facts, feelings, and
concepts. However, we need a formal mechanism to study languages mathematically. To
begin with, we need to know some terms used in the formal deﬁnition of languages.
• A Symbol is an abstract entity having no meaning by itself. The most common symbols
are digits, letters from variant alphabets, and special signs such as asterisk.
8

• An Alphabet is a ﬁnite set of symbols. For example, {a, b} is an alphabet containing
two symbols a and b, and {0, 1} is an alphabet with two members 0 and 1.
• A Word or a String is a ﬁnite sequence of symbols over an alphabet. For instance, abab
and 011001 are two words deﬁned on alphabets σ1 and σ2 , respectively. We use |.| to
denote the length of a word, e.g., |abab| = 4 and |011001| = 6 are the lengths of abab
and 011001 strings, respectively. The length of a null string is considered 0.
we can now formally deﬁne a language using above terms. A language is a set of strings over
a set of predeﬁned symbols called alphabet. This set of strings could be an empty, a ﬁnite
or an inﬁnite one.

2.1.2

Transformational Grammars

Transformational Grammars (Grammars) are used to understand the structure of natural
languages and are one of the formal ways of describing a language. Generally, a grammar is
deﬁned as a 4-tuple G = (V, T, P, S) where:
• V is a set of symbols called nonterminal, typically S,A,B,. . .
• T is a set of symbols called terminals, typically 0,1,a,b,. . .
• P is a set of production rules
• S is the starting nonterminal from V
Each rule in P has the form: A → α where A is a variable and α is any concatenation
of terminals and nonterminals. If there are more than one candidate in right side of a
production rule, the sign | is used to separate them from each other. Moreover, a speciﬁc
symbol such as ε presents a null string. When a nonterminal is replaced by a null string, no
symbol would be added to the string.

9

As an example, a sample grammar could be deﬁned as G = (V, T, P, S) where V = {S, A},
T = {0, 1} and the production rules from set P are:
S → 0A|0
A → 10A|ε

2.1.3

(2.1)
(2.2)

Derivation process

The sequence of production rules in a grammar that transforms the starting symbol of a
grammar into a string out of any nonterminals is called derivation process. The derivation
process begins with the starting symbol. In each step, one nonterminal is replaced with one
of the possible right hand side production rules. During the whole process, all nonterminals
are replaced by terminals and ﬁnally the string is produced. The existence of a derivation
process proves that a string belongs to a grammar language. In the previous example, the
string 0101010 is derived from G using these steps:

S ⇒ 0A ⇒ 010A ⇒ 01010A ⇒ 0101010A ⇒ 0101010A ⇒ 0101010

(2.3)

As it is seen in (2.1) and (2.3) single-line arrow stands for steps in production rules, whereas
double-line arrow shows diﬀerent steps in a derivation process.
There are two types of derivation: The left-most derivation and the right-most derivation.
A left-most derivation is the one in which the left-most nonterminals are always applied in
transforming rules. Similarly, a right-most derivation is the one in which the right-most
nonterminals are always applied in transforming rule processes. For example, suppose the
grammar G1 is deﬁned as G1 = (V, T, P, S) where V = {S, A, B}, T = {a, b}, and the

10

production rules in set P are:

S → AB

(2.4)

A → aA|ε

(2.5)

B → bB|ε

(2.6)

This grammar deﬁnes a language containing empty string, strings of a, strings of b, and
strings starting with one or more ‘a’ and continuing with any number of b. a, b, aaa, bbbbbb,
aab,and aaabbbbb are some sample words belonging to this language. These strings are
generated by production rules in grammar during the derivation process.
Two samples of left-most and right-most derivation steps of the string aab from grammar
G1 are shown as follows:

S ⇒ AB ⇒ aAB ⇒ aaAB ⇒ aaB ⇒ aabB ⇒ aab
S ⇒ AB ⇒ AbB ⇒ AbbB ⇒ Abb ⇒ aAbb ⇒ abb

2.1.4

Parse tree

The derivation process could be presented in the form of a speciﬁc tree called parse tree.
Parse tree represents the syntactic structure of a string according to some formal grammar.
In a parse tree, the root is the starting symbol (S) of the grammar, the interior nodes are
labeled by non-terminals of the grammar, and the leaf nodes are labeled by terminals of the
grammar. In Figure 2.1 two parse trees T 1 and T 2 present the left-most and right-most
derivation of the string aab from G1.

11

S

S

a

A

B

A

B
b

A

a

B
ϵ

A
a

ϵ
b

B
ϵ

A
a

A
ϵ

Right- most derivation tree

Left-most derivation tree

Figure 2.1: “aab” pars trees

2.1.5

ambiguous grammar

A grammar G is an ambiguous grammar if the language generated by G contains a string w
which has at least two diﬀerent parse trees, or equivalently more than one left-most derivation
tree. For example, there are two diﬀerent left-most derivations (Figure 2.2) for generating
the string a + a + a using grammar G2 = (V, T, P, S) where V = {S}, T = {a}, and with
the production rules in set P as:

S → S+S
S → a.

An ambiguous grammar could be turned into a non-ambiguous one by deﬁning some new

12

S

S

S

S

+

S

+

S

a
S

a

+

S

S

a

a

+

S

a

a

Figure 2.2: Ambiguity in pars tree of “a+a+a” left-most derivation
nonterminals. In example above, we can resolve the ambiguity problem by adding a new
nonterminal T to the nonterminal set and modifying the production rules consequently:

S → S + T |F
T →F
F → a.

Figure 2.3 shows the left-most derivation of

a+a+a

and its corresponding pars tree using non-ambiguous grammar above.

13

S

S

S

+

+

T

T

F
a
F
a
F
a

Figure 2.3: Equivalent non-ambiguous pars tree of “a+a+a”

2.1.6

Grammar classes

Chomsky developed a general theory for modeling string of symbols and presented a hierarchy
of grammars called Chomsky hierarchy of transformational grammars [10]. In other
words, Chomsky divided grammars into diﬀerent classes based on the type and attributes
of strings generated by these grammars. Regular grammars, context free grammars, and
context sensitive grammars are some examples of these grammar classes.

Regular grammars
A grammar G = (V, T, P, S) is called a regular grammar if all its production rules have the
form of A → w or A → wB, where A and B are variables and w is any combination of
terminals including empty string. If a language could be described by a regular grammar

14

then it is called regular language.
Regular grammars generate strings from left to right or right to left. Another property
of regular grammars lies in the fact that the generation of a symbol pair requires to be
independent and from two diﬀerent nonterminals, i.e., regular grammars cannot generate a
correlated base pair from a single nonterminal. G introduced in subsection 2.1.2 is such a
regular grammar.

Context free grammars
A grammar G = (V, T, P, S) is said to be a context free grammar (CFG) if all productions are
of the form A → x where A is in V and x is any combination of terminals and nonterminals
including null string. The languages generated by context-free grammars are known as the
context-free languages.
Context free grammars generate string from outside in. For example, G3 = (V, T, P, S)
where V = {S}, T = {a, b}, with the production rules in set P as:
S → aSa|bSb|cSd|aa|bb

is an example of a context free grammars. The derivation of “abaaba” and “aacaadaa” from
G3 is given, below.
S ⇒ aSa ⇒ abSba ⇒ abaaba
S ⇒ aSa ⇒ aaSaa ⇒ aacSdaa ⇒ aacaadaa

2.1.7

Context free grammars and ncRNA secondary structure

A typical context free grammar creates strings from outside in and enables to generate
strings with nested but non-crossing correlation between pairs of symbols.These attributes
make CFG class a proper option to generate Palindrome language. Strings of a such
language have a symmetric correlation between pair of symbols and are often being read the
15

same way in either directions left or right (see Figure 2.4). Since secondary structures in
nsRNAs are considered as a type of palindrome language, they can be generated by properly
designed context free grammars.

a a b b a a

Palindrome language

a a b a a b

Crossing correlation

Figure 2.4: Crossing and non-crossing correlation between pairs of symbols

2.2

Grammar string design

Inspired by Jaakkola and Haussler’s discriminative classiﬁcation method [20], we introduce
grammar string, a representation of an ncRNA sequence in the parameter space of contextfree grammar (CFG). Speciﬁcally, each ncRNA sequence and its secondary structure are
transformed into a string deﬁned on a new alphabet, where each character corresponds to a
production rule in a CFG.
We ﬁrst introduce an unambiguous CFG for ncRNA sequence generation. Using the
16

chosen CFG as an example, we formally deﬁne grammar strings for modeling an ncRNA
sequence and its secondary structure.

2.2.1

An unambiguous CFG for ncRNA generation

NcRNA structures without pseudo-knots can be derived by CFGs [10]. As mentioned in
previous section, a CFG is deﬁned by a set of nonterminals, a set of terminals, a start nonterminal, and a set of production rules of the form V → α. V is a single nonterminal symbol,
and α is a string of terminals and/or nonterminals. By recursively replacing nonterminals on
the right hand side of each production rule, an ncRNA sequence and its secondary structure
can be derived from a CFG. In this work, all our ncRNA sequences and their structures
will be generated from G4, a light-weight CFG introduced by Dowell and Eddy [8], using
leftmost derivation. Following the general deﬁnition of a CFG, G4 has a ﬁnite set of nonterminal V = {S, T }, a ﬁnite set of terminal symbols T = {A, C, G, U, ε}, and a ﬁnite set of
production rules deﬁned as below:
• S → aS|T |ε
• T → T a|aSˆ|T aSˆ
a
a
where a ∈ {A, C, G, U } and a ∈ {A, C, G, U }. a and a form complementary base pairs such
ˆ
ˆ
as A-U and G-C. In order to generate the unstructured single strand ‘C’ at 3’ end and the
two outmost base pairs in sequence tRNA 1 in Figure 2.7, the following production rules
from G4 are called: S → T , T → T C, T → GSC, S → T , T → U SA. Continuing to
replace S by correctly chosen production rules, we can derive tRNA 1. The sequence of
production rules used for ncRNA structure generation is called a derivation.
Using the leftmost derivation, an unambiguous CFG can guaranteer a unique derivation for a given ncRNA sequence and its secondary structure. For example, by using the
unambiguous grammar G4, we have only one choice when choosing a production rule to
derive a sample ncRNA (Figure 2.5) secondary structure in Figure 2.6. For a more de17

tailed introduction about unambiguous CFGs, we refer readers to the review by Dowell and
Eddy [8], where several light-weight unambiguous CFGs including G4 are discussed.

Figure 2.5: Secondary structure of a sample ncRNA

2.2.2

Grammar string generation algorithm

Each ncRNA secondary structure has a unique leftmost derivation from an unambiguous
CFG, producing a one-to-one mapping between a structure and a production rule sequence.
Intuitively, homologous ncRNAs with similar structures will share similar derivations. This
motivates us to represent an ncRNA sequence and its secondary structure in the parameter
space of a CFG. Thus, ncRNA structural comparison is converted to the comparison of their
derivations.
In order to represent an ncRNA structure using its derivation, we introduce a new alphabet, where each character corresponds to a production rule in a CFG. One example alphabet
derived from G4 is deﬁned below.
• Use upper case character of a to represent production rule S → aS. For example, use
A to represent S → AS.
• Use | to represent S → ε.
18

S
1

T
S

A

2

T

U

T

S

A

A
U

T

S

A

C

S

A

S

A

S

S

G

G

S

S

C

T

U

T
C

S

A

A

G

U

S

S

U

S

U

S

U

S

C
S

A

ε
T

C

T

T
1

G

S
2

U

S

ε

C

Figure 2.6: Left most derivation pars tree of Figure 2.5
• Use lower case character of a to represent production rule T → T a. For example, use
c to represent T → T C.
• Use P to represent base pair emission T → aSˆ.
a
• Use a special character # to indicate branching T → T aSˆ.
a
• No character is needed for production rule S → T .
Thus, the new alphabet is A = {A, C, G, U, a, c, g, u, P,—,#}. If these production rules are
used on DNA sequences, we can simply replace U (u) with T (t). For brevity, we name a
string deﬁned on the above alphabet a grammar string. As an example, the derivation for
generating the unstructured single strand ‘C’ at 3’ end and the two outmost base pairs in
sequence tRNA 1 of Figure 2.7 is: S → T , T → T C, T → G S C, S → T , T → U S A.
19

Thus, the corresponding grammar string is cP P using the alphabet A. Note that we don’t
distinguish diﬀerent base pairs (i.e. A-U, G-C, and G-U if allowed) in a grammar string. All
base pairs are represented as P in order to maximize the alignment score between homologous
ncRNAs that share high structural similarity but low sequence similarity. Figure 2.7 shows
the utility of grammar strings in detecting structural similarity between two tRNA sequences
from the human genome. Because of low sequence similarity, BLAST [2] fails to align
them. However, their structural similarity yields a meaningful global alignment between
their corresponding grammar strings with 69% identity.

In theory, our grammar string

3'

3'
A
C
A U
G C
C G
U A
C A A
G C A
U A
A U
U
A
C
U A
A U
C
C
C
G
U A
C GU
A U
C GC U
U A AA G
U A C
G
U
A A UA U
A U UC G A
A A
AU
A
A A
C
A
U
UA G G
U U U G
A
C
U
C
AUC C
C A A A A C
G
A
C
A
C
A
G
U
U A A A
A
C
A C
U
A G
A
G
G U
C
U G
C
C
C
U A
U
U
U
C
A
G
U
C
A G G
U G A
tRNA_1: GUAAAUAUAGUUUAACCAAAACAUCAGAUUGUGAAUCUGACAACAGAGGCUCACGACCCCUUAUUUACC
(((((((..((((.....)))).(((((.......)))))....((.((.......)).))))))))).
5'

5'

Grammar string
cPPPPPPPUA#caac#aPPPPAACCA|PPPPPUUGUGAA|PPAcPPCUCACGA|
tRNA_2:

ACUUUUAAAGGAUAACAGCCAUCCGUUGGUCUUAGGCCCCAAAAAUUUUGGUGCAACUCCAAAUAAAAGUA
(((((((..((((.......)))).((((.........))))....(((((.......)))))))))))).

Grammar string
aPPPPPPPAA#uaaa#gPPPPAACAGCC|PPPPUCUUAGGCC|PPPPPUGCAACU|
Pairwise alignment
cPPPPPPPUA#aac#aPPPPAAC--CA|PPPPP-UUGUGAA|PPPPPCUCACGA|
aPPPPPPPAA#aaa#gPPPPAACAGCC|PPPPPCUUAGGC-|PPPPPUGCAACU|
******* **** * ******* * ****** ** * ****** **
*

Figure 2.7: Two tRNA sequences from the human genome and the alignment of their grammar strings.The stars below the alignment denote exact matches.
20

void parse(i, j)
{
if i >= j
print '|';
return;
else if Xi is a single stranded base
print uppercase of Xi;
i++;
parse(i,j);
else if Xj is a single stranded base
print lowercase of Xj;
j--;
parse(i,j);
else if Xi and Xj form a base pair
print 'P';
i++ and j--;
parse(i,j);
else
print '#';
k = the position that forms a base pair with Xj;
parse(i,k-1);
parse(k,j);
}

Figure 2.8: Algorithm for generating a grammar string for substring Xi..j .
generation process consists of two steps. First, write the production rule sequence for an
ncRNA sequence and its secondary structure. Second, transform the sequence of production
rules into a grammar string according to the deﬁnition of grammar string alphabet. In
practice, we use an eﬃcient dynamic programming algorithm to design a grammar string for
an ncRNA structure directly, skipping the step of parsing an ncRNA sequence using a CFG.
The algorithm has time complexity O(L2 ), where L is the length of the ncRNA sequence.
Let X be an ncRNA sequence with its predicted or annotated secondary structure. i and j
are indexes in X. Xi is the base at position i. Figure 2.8 sketches the dynamic programming
algorithm generating a grammar string for substring Xi..j . In order to generate the complete
grammar string for sequence X, one should call parse(1, L).

21

2.2.3

Design a new grammar string of Rfam database

In the grammar string presented in Section 2.2, all base pairs are mapped to the same
alphabet P . Therefore, all base pairs are considered the same in a grammar string. Here
instead we design a new grammar string which enables us to distinguish between diﬀerent
types of complementary base pairs. Our goal is to compare the performance of a grammar
string when we feed it with more details about base pairs nucleotides and their decoration.
To design a new grammar string we still use G4 context free grammar, but expand its
set of alphabets from A = {A, U, C, G, a, u, c, g, P, #,—} to
A = {A, U, C, G, a, u, c, g, <, >, [, ], (, ), E, F, I, J, L, O, Z, #,—}.

As seen, instead of P which is a single representative for base pairs in our previous grammar
string design, some other new alphabets are added to the new alphabet set. Each new
alphabet is a representative for a speciﬁc type of base pair. In Table 2.1, it is seen how
base pairs are assigned to grammar string alphabets. For more details, read Section 4.2.1.
Both the previous and new generated consensus sequences and grammar strings of family
RF00756 along with the corresponding secondary structure are shown below, respectively.
AAAAAAAA.AAAAAAAAAAAAAAAAAAAAUUUGAAUUUUUUUUUUUUUUUUUUUUUUUUUUUUU,RF00756
<<<<<<<<.<<<<<<<<<<<<<<<<<<<<.......>>>>>>>>>>>>>>>>>>>>>>>>>>>>
PPPPPPPPPPPPPPPPPPPPPPPPPPPPUUUGAAU|

AAGAAGUG.GUUUACCGUCCCACAUACAUUUUGAAUAUGUAUGUGGGACGGUAAACCGUUUCUU,RF00756
<<<<<<<<.<<<<<<<<<<<<<<<<<<<<.......>>>>>>>>>>>>>>>>>>>>>>>>>>>>
<<]<<()]]>>><[[]>[[[<[<><[<>UUUGAAU|

22

Table 2.1: Observed → Replacement
Observed Replacement
AU
<
UA
>
CG
[
GC
]
GU
(
UG
)
EE
E
FF
F
II
I
JJ
J
LL
L
OO
O
ZZ
Z

2.2.4

Grammar pattern for encoding stem structures

The number of stems and their relationship largely deﬁne the basic “shape” of a secondary
structure. For example, the cloverleaf structure of a tRNA sequence consists of four stems:
acceptor stem, D stem, anticodon stem, and TΨCG stem. The precursor structure of a
miRNA usually contains only one stem. According to the deﬁnition of grammar strings, three
characters P, #, and | encode the number and relative positions of all stems in an ncRNA
secondary structure. If we simply remove all single stranded regions (i.e. substrings only
consisting of A, C, G, U, a, c, g, u) from a grammar string, we can use a simpliﬁed grammar
string to represent the abstract stem structure for an ncRNA sequence. For brevity, we name
a simpliﬁed grammar string a grammar pattern, which is a string deﬁned on a reduced
alphabet {P, #, |}. A grammar string can be converted into a grammar pattern in two steps:
1) remove all substrings representing single stranded regions, and 2) reduce every substring
consisting of only P s as a single P . Thus, the grammar pattern for sequence tRNA 1 in
Figure 2.7 is P ##P |P |P |, where each P denotes a stem. There are four Ps, denoting four
stems. The end of each stem is marked by |. Number of # deﬁnes the number of bifurcations.
Diﬀerent distributions of the same number of stems can yield highly diﬀerent secondary
structures. Figure 2.9 shows how grammar patterns can account for diﬀerent structures with
23

the same number of stems. Note that all these grammar patterns are generated using G4
as the chosen CFG. If other unambiguous CFGs are used to generate grammar strings for
the same structures, diﬀerent sets of grammar patterns might be produced. Ignoring all
5'

3'

P##P|P|P|

5'

3'

#P#P|P|P|

5'

3'

#P|P#P|P|

5'

3'

###P|P|P|P|

Figure 2.9: Four diﬀerent stem structures and their grammar patterns. The left column
shows the 2D representation of an ncRNA folding. The right column shows the distributions
of stems along an ncRNA sequence. All grammar patterns are generated using G4 (our
chosen unambiguous context-free grammar).
single stranded regions and length of each stem, grammar patterns only provide a coarsegrained description of ncRNA secondary structures. However, because of the high eﬃciency
of pattern matching, grammar patterns can be used to speed up grammar string comparison.
For example, we do not expect signiﬁcant structural similarities between a tRNA and a
miRNA sequence. Instead of using Needleman-Wunsch [37] like alignment algorithm between
their grammar strings, a constant time grammar pattern matching program can be applied as
a ﬁltration step. This ﬁltration is particularly important when we aim to derive the consensus
structure of multiple putatively homologous ncRNAs. Although these sequences are expected
to be sequenced from the same gene family, it is possible that some of the sequences are
from other regions. Thus, we can use the grammar pattern matching technique to exclude
contaminated sequences, ensuring a multiple sequence alignment with good quality. The
24

same technique can be used to remove possible errors introduced by MFE-based secondary
structure prediction tools.

2.2.5

Other grammar pattern design strategies

Designed grammar pattern in Section 2.2.4 encodes the stem structure of an ncRNA sequence. However, more complicated grammar patterns can be designed to carry information
about both sequence and secondary structure of a sample ncRNA. Following designed strategies are such examples.
• Strategy (A)
1. Preserving all bifurcation (#) and end (|) grammar alphabets.
2. Removing individual bases from grammar string.
3. Keeping only one P as the representative of base pairs if any base pair is seen
between bifurcation and end alphabets.
• Strategy (B)
1. Preserving all bifurcation (#) and end (|) grammar alphabets.
2. Keeping one P as a representative for each segment of base pairs.
• Strategy (C)
1. Preserving all bifurcation and end grammar alphabets.
2. Finding the total number of P s, the total number of left derived individual bases
(capital alphabets), and the total number of right derived individual bases (small
alphabets) existing in the grammar string.
3. Finding the total number of segments in which base pairs (P s) occur, the total
number of segments in which left derived individual bases occur, and the total
number of segments in which right derived individual bases occur.
25

4. Calculating average length of three segments (found in the previous step) by
dividing the total number of the corresponding grammar string alphabet to the
number of segments in which the alphabet occur.
5. Calculating the average of tree averages (calculated in the previous step) and
calling it “total average”.
6. Processing the grammar string again from left to right.
7. Preserving all bifurcation (#) and end (|) grammar alphabets.
8. Putting one P , A, or a for each observed segment if the number of characters in
that segment is greater than or equal to an empirical threshold × total average.

26

Chapter 3
Multiple ncRNA structural alignment
based on grammar string
In this chapter, we present a brief overview on multiple alignment of ncRNAs secondary
structure. We must mention that the main eﬀort in leading this research project has been
done by our colleague Rujira Achawanantakun [1]. Although the ﬁnal goal of this project,
accomplished by Rujira Achawanantakun, has been to derive consensus structure through
multiple ncRNA alignment, in this thesis we only focus on those parts of the research which
have been done by the author of the current thesis. These include creating score table for
grammar strings pairwise alignment and extracting consensus sequence of ncRNA families.
We refer interested readers to [1] for more details on other steps of this experiment as well
as the ﬁnal result of this research work.

3.1

Summary of designed experiment for multiple ncRNA
structural alignment by Achawanantakun [1]

Multiple ncRNA alignment has important applications in homologous ncRNA consensus
structure derivation, novel ncRNA identiﬁcation, and known ncRNA classiﬁcation. As many

27

ncRNAs’ functions are determined by both their sequences and secondary structures, accurate ncRNA alignment algorithms must maximize both sequence and structural similarity
simultaneously, incurring high computational cost. Faster secondary structure modeling
and alignment methods using trees, graphs, probability matrices have thus been developed.
Despite promising results from existing ncRNA alignment tools, there is a need for more
eﬃcient and accurate ncRNA secondary structure modeling and alignment methods. To
this end, Achawanantakun designed an experiment to derive consensus structure through
multiple ncRNA alignment. Major steps of aligning multiple ncRNA sequences experiment
are sketched below:
1. Use an ab initio secondary structure prediction tool to predict both the optimal and
sub-optimal secondary structures of each input sequence based on minimum free energy
criteria.
2. Generate a grammar string for each predicted secondary structure. If an ncRNA
sequence has more than one structure predicted, multiple grammar strings will be
generated.
3. Transform each grammar string into a grammar pattern. Use a voting mechanism to
choose the most popular grammar pattern that mostly likely represents the native stem
structure shared by the input sequences. All grammar strings that are not consistent
with the chosen grammar pattern will be discarded.
4. Apply a progressive multiple sequence alignment method on remaining grammar strings.
5. Derive the consensus secondary structure from multiple grammar string alignment.
Transform grammar string alignment into ncRNA sequence alignment using the ncRNA
sequences and their predicted structures as references.
In Chapter 2, we introduced an approach to generate grammar string from an ncRNA sequence and its secondary structure (Step2). We also addressed, in details, how to extract
28

grammar pattern from a grammar string (Step 3). In the next two sections, we ﬁrst explain
how to create a score table for grammar string alignment and then how to extract consensus
sequence of ncRNA families to generate their corresponding grammar string.

3.2

Score table design for grammar string alignment

Pairwise alignment is a fundamental step to multiple alignment and clustering. Existing
alignment algorithms such as Needleman-Wunsch [37] can be directly applied to grammar
strings when a score table deﬁned on grammar strings’ alphabet is imported. Following the
common practice in score table design, we use maximum-likelihood ratio to derive the score
between every pair of characters in grammar strings’ alphabet A. For each pair of characters
a, b in A, the score between a, b is s(a, b) = log

Pr(a,b)
. Pr(a, b) is the target probability
Pr0 (a,b)

of a, b in a set of true alignments and P r0 (a, b) is the background probability that a and
b are aligned. Assuming that a and b are independent in the background model, we get
Pr0 (a, b) = P r0 (a) × P r0 (b). Because ncRNA family database Rfam [13] provides a large
number of annotated ncRNA sequences, their alignments, and their associated secondary
structures, we obtain both the target and the background probabilities from Rfam. In
summary, we present following steps of designing a score table for grammar string alignment.
1. Build an alignment training set by randomly picking a large number of pairwise ncRNA
alignment from Rfam 9.1’s seed alignments. Some criteria are applied to select alignments with reasonably high quality. For example, if a pairwise alignment contains too
many gaps, it will not be included in the training set. After applying the selection
criteria, we had 18487 pairwise alignments in the training set.
2. Transform each pair of ncRNA sequence alignment into an alignment between grammar
strings using the given secondary structure annotations by Rfam.
3. Compute the target probability P r(a, b) for each pair of aligned characters a, b in the
above grammar string alignments.
29

4. Generate grammar strings for a large number of ncRNA sequences that are randomly
picked from full families of Rfam 9.1. Compute the background probabilities P r0 (a)
and P r0 (b) from these grammar strings.
The complete score table for grammar string alignment can be found at our website (Appendix B). All exact matches have big positive scores. And bifurcation starting character
# and stem ending character | can only be aligned with themselves or cause a gap. This
is consistent with our intuition because it is not meaningful to align a bifurcation character
with a base pair or a single stranded base.
Insertions or deletions of P or single stranded characters correspond to insertions or
deletions of a base pair or single stranded bases in the ncRNA sequence alignment. Empirical
experiments are conducted to choose default values for their gap opening and extension costs.
The default gap opening score is slightly smaller than the lowest number in the grammar
string’s score table. The default gap extension cost is set as 1/10 of the opening cost. We
assign bigger gap penalties for structural characters # and | in order to force corresponding
stems or single stranded regions to be aligned together.

3.3

Extracting consensus sequence of ncRNA families

We select a data set which includes 452 randomly chosen families from BRAliBase 2.1, an
enhanced RNA alignment benchmark [46]. This data set contains a diverse set of ncRNA
families with diﬀerent average sequence identity, length, and structural conservation.
In order to generate grammar string of each family, we ﬁrst need to extract its corresponding consensus sequence. For this purpose, we design an empirical strategy to extract
the consensus sequence of an Rfam family from its multiple sequence alignment and consensus secondary structure. We propose diﬀerent approaches to generate non-complementary
bases and complementary base-pairs of consensus sequence.
To extract consensus individual base of a speciﬁc column in a multiple alignment required
30

steps are as follows:
1. Count the total number of diﬀerent nucleotides occurred in this speciﬁc column of all
sequences. These nucleotides might be any of A, U, C, G nucleotides and in any number
plus dot character (“.”) which is representative of gap in multiple alignments.
2. If the number of counted dots in the column is equal or larger than a fraction of the total
number of sequences, we put “.” in that column of consensus sequence. Otherwise,
we sort non-dot nucleotides in descending order based on the number of occurrence
and then pick them from maximum in descending direction until the sum of selected
nucleotides satisﬁes an empirical threshold. Finally, based on Tables 3.1, 3.2, and 3.3, a
proper alphabet is selected to put in the corresponding column of consensus sequence.
Table 3.1: Observed → Replacement
Observed Replacement
A
A
U
U
C
C
G
G

Table 3.2: Observed → Replacement
Observed
Replacement
anagrams of (U A)
W
anagrams of (AC)
M
anagrams of (AG)
R
anagrams of (U C)
Y
anagrams of (U G)
K
anagrams of (CG)
S

Table 3.3: Observed → Replacement
Observed
Replacement
anagrams of (AU C)
H
anagrams of (AU G)
D
anagrams of (ACG)
V
anagrams of (U CG)
B

31

If none of these cases happen, put N in that column. Anagrams in these tables stand
for all possible rearrangements of letters in the candidate nucleotides, e.g., AU has two
anagrams: AU and U A.
By mapping the observed nucleotides (left column) to our deﬁned alphabets (right column), we are trying to reveal the degree of conservation in that column. For example,
when the number of an observed nucleotide is high enough to satisfy the empirical
threshold on its own, we map that nucleotide to itself and it implies that column is
well conserved. However, when the summation of more than one nucleotide is needed
to satisfy the threshold, we map them to our deﬁned alphabets so as to show the degree of conservation in that column. Obviously, the more selected nucleotides, the less
degree of conservation in that column.
To extract base pairs of a consensus sequence, we simply put AU in associated columns
of that sequence. This is because in our designed grammar string all complementary base
pair types are considered the same and are mapped to a single grammar alphabet P .
Remark 3.1. In the designed alignment pipeline, multiple structures are allowed to be predicted for each input sequence. As a result, multiple grammar strings will exist for a single
ncRNA sequence. However, predicted structures for the same ncRNA sequence can diﬀer
signiﬁcantly. It is important to align only structures that are likely to be consistent with the
native structure of the homologous sequences. For this purpose, grammar pattern introduced
in grammar string chapter is applied to pre-select grammar strings for multiple alignment.

32

Chapter 4
Clustering based on grammar string
Cluster analysis or clustering is generally the task of assigning objects of a set to separate
groups called clusters. Objects in the same cluster are more similar (in some sense) to each
other than to those in other clusters. This technique is vastly used in diﬀerent areas of bioinformatics. More speciﬁcally, a group of ncRNAs are classiﬁed into clusters, called families,
based on their sequential and structural homology. However, as mentioned in Section 1.2,
restrictions existing in computational steps of generating ncRNA families might separate
some of homologous ncRNA sequences from each other by assigning them to diﬀerent families. Therefore, designing techniques that enable us to identify this type of homologos but
separated families is still an interesting research area in bioinformatics.
In this chapter, we utilize grammar strings in ncRNA classiﬁcation (excluding pseudoknots) and ﬁnd clusters in a particular database, Rfam. Organization of this chapter is
as follows: Section 4.1 deﬁnes ncRNA classiﬁcation problem and the big picture of our
methodology to tackle this problem. Section 4.2 discusses steps of designed experiment in
more details. Section 4.3 presents experimental results of our approach.

33

4.1

NcRNA clustering; problem deﬁnition and approach

A. Problem deﬁnition
In Rfam database, non-coding RNAs are classiﬁed into diﬀerent families based on sequence alignments and statistical proﬁles known as covariance models. Examples of such
families are tRNAs, microRNAs, and such. Each family represents a set of RNA sequences
that, ﬁrst, can be reasonably aligned, and second, function at the RNA level and share
a clear common ancestor. These two criteria, however, impose fundamental limitation on
ﬁnding homologous families of RNAs in Rfam.
The problem with Rfam database comes from the fact that there may be two homologous
families with a common ancestor (same functionality) but too divergent to be aligned, or with
acceptable alignment but distinct functionality. To overcome these issues, the clan concept
is borrowed from the MEROPS and Pfam databases [6,11] to take an alternative approach to
ncRNA families classiﬁcation. This approach still get the right balance of sequence sensitivity
and speciﬁcity when it is applied to ncRNA families. Here, some of the internal quality
control measures used by Rfam can be relaxed for the clanned families. These clans describe
explicit relationships between families that either clearly share a common ancestor but are
too divergent to be reasonably aligned, or groups of families that could be aligned, but have
clearly distinct functions.
As an example, CL0002 clan contains ﬁve homologous families RF00030, RF01577,
RF00373, RF00010, RF00011 and RF0009. These families contain ribozyme RNAs involved
in processing of pre-tRNA and pre-rRNA sequences. However, they are diﬃcult to align to
each other. Furthermore, RF01577 and RF00030 are distinct molecules from a functionality
point of view [21].
Another clan of interest is CL00011. This clan contains RF00128 and RF00083 Rfam
families which are homologous, but distinct in functionality. Small RNAs sequences in
RF00128 activate expression of RF00083 while those in RF00083 regulate the translation of
a coding gene [22].
34

B. Our methodology towards ncRNA classiﬁcation process
Our goal is to evaluate the possibility of using grammar string approach to identify
existing clans in Rfam database, and also to discover novel clan families. In other words,
we would like to apply a clustering method to derived grammar strings and evaluate the
performance of grammar string approach in ncRNA classiﬁcation problem.
In our proposed approach, we take the following steps:
i) Designing and applying two consensus sequence derivation strategies to Rfam families.
ii) Generating two diﬀerent corresponding grammar strings GS1 and GS2 based on the
associated consensus sequence and secondary structure.
iii) Extracting two subsets of grammar strings from each set of GS1 and GS2 as data sets
for our experiment.
iv) Applying pairwise alignment method to all the subsets, built in previous step, using
empirical score table and gap penalties.
v) Feeding ncRNA pairwise alignment scores into a clustering application called MCL
(Markov cluster Algorithm) to be classiﬁed into diﬀerent clusters.
In next section, we discuss these steps in more details.

4.2
4.2.1

Step by step design for clustering experiment
Extracting consensus sequences in Rfam families

In order to generate grammar strings in Rfam families, we ﬁrst need to extract their consensus
sequences. We design two empirical strategies to extract the consensus sequence of an Rfam
family from its multiple sequence alignment and consensus secondary structure. In one
strategy, we assume all base pairs are the same, whereas in the other one, diﬀerent base
pairs are separated from each other.
35

To extract consensus sequences of Rfam families without distinguishing between base
pairs, we apply the same strategy introduced in Section 3.3, and therefore, the approach will
not be repeated here again. To extract complementary base-pairs of consensus sequences
with distinguished base pairs from multiple alignment of an Rfam family, a new strategy is
designed and presented here.
Extracting base pairs of a consensus sequence from multiple alignment
In this case, the number of diﬀerent occurrences of base pairs are counted. Possible base
pairs could be any of AU, U A, CG, GC, GU, U G, or even two dots to present un-conserved
base pairs in multiple alignment. If the number of occurring dot base pairs is larger than or
equal to an empirical fraction of family members, we put dot in the corresponding columns of
consensus sequence and secondary structure. Otherwise, we select two most occurring base
pairs. Denoting the ﬁrst and second most repeated base pairs and number of non-dot base
pairs by X, Y, and N , respectively, and assuming three thresholds th1 = 0.6, th2 = 0.45, and
th3 = 0.4, we arrive at ﬁve diﬀerent cases to be handled as follows:
1. If (the count of X ≥ th1 × N ) then we put X in the consensus sequence (e.g., X=AU
then we put AU in corresponding columns of consensus sequence).
2. If (th2 × N ≤ the count of X < th1 × N ) and (the count of Y < th3 × N ) then we
still put X in the consensus sequence (e.g., X=AU then we put AU in corresponding
columns of consensus sequence).
3. If (th2 × N ≤ the count of X < th1 × N ) and (the count of Y ≥ th3 × N ) then we use
the the Figure 4.1 to set the corresponding columns in the consensus sequence.
4. If (the count of X < th2 × N ) and (the count of Y < th3 × N ) then we put ZZ in the
corresponding columns of the consensus sequence.

36

Observed

Replacement

AU

EE

UA

FF

CG

II

GC

JJ

GU

LL

UG

OO

Figure 4.1: Mapping in the consensus sequence

4.2.2

Strategy of extracting data sets for our experiment

In Subsection 4.2.1, we explained how C1 and C2 as two sets of consensus sequences are
built from Rfam families. In C1, all base pairs are shown by single alphabet ‘P’ while in
C2 base pairs are distinguished based on the type of nucleotides and their decoration. By
adding conserved secondary structures of Rfam families to C1 and C2, we can apply two
grammar string approaches to these data sets. The version of grammar string design without
distinguishing between base pairs will be applied to C1 , whereas the one with separated
grammar string alphabets for each type of complementary base pair will be applied to C2.
As a result, two sets GS1 and GS2 of grammar strings will be generated for whole Rfam
families.
So far, we have designed two grammar string approaches in Sections 2.2.1 and 2.2.3. From

37

now on, we refer to these by “old” and “new” labels, respectively. To prepare the data sets
for the experiment, we ﬁrst need to generate two grammar sets of whole Rfam families based
on our old and new grammar string designs. To this end, we extract secondary structures of
Rfam families from Rfam database and associate them with their corresponding consensus
sequences in C1 and C2 and create CS1 and CS2 data sets, respectively. Now, we apply
our old grammar string approach to CS1 and generate grammar string set GS1, and apply
the new grammar string approach to CS2 and generate grammar string set GS2. In next
two subsections, we will describe how to prepare data sets for our experiment from GS1 and
GS2.

Strategy for extracting Rfam families of clan database
To study the performance of two diﬀerent grammar string designs in identifying well-known
clans, they need to be tested on Rfam families whose clusters have already been identiﬁed.
Since generally grammar string approach is a representation for ncRNAs secondary structure,
we would like to see if this approach is able to identify and consequently cluster clan family
members with similar secondary structures but diﬀerent functionalities. For this purpose, we
take advantage of grammar pattern strategy. We deﬁne label “conserved” for clan families
whose Rfam family members have the same grammar pattern. By applying tree strategies in
Section 2.2.5, we extract fourteen conserved clans. Table 4.3, shows these conserved clans,
their family members as well as their corresponding grammar patterns. we now extract
grammar strings of conserved clans from GS1 and GS2 and name them old and new grammar
strings of conserved clans, respectively.

Strategy for extracting Rfam families which are not in clan database
To study the performance of two diﬀerent grammar string designs in discovery of novel
ncRNA clusters, grammar strings of Rfam families whose clusters have not been discovered
yet are required. We extract this subset of grammar strings from both GS1 and GS2 sets

38

and call them old and new grammar strings of Rfam−Clan, respectively. Here and hereafter,
we use “Rfam−Clan” grammar strings to denote the entire grammar strings of Rfam families
excluding members of Clan database.

4.2.3

Pairwise alignment

After extracting four data sets (old/new conserved clans and old/new Rfam−clan grammar
strings), we run pairwise alignment application on these data sets. More speciﬁcally, we
design two empirical score tables called as “old” and “new” score tables. In the old score
table, which will be used in pairwise alignment of old grammar strings, all values are chosen
empirically, following a set of guidelines given below.
• Individual bases do not align with base pairs, bifurcations (#), or end of derivations
(|).
• Left derivation individual bases do not align with right derivation individual bases and
viceversa.
• Base pairs do not align with bifurcations (#) or end of derivations (|).
• Bifurcations (#) do not align with end of derivations and viceversa.
• The score of aligning two same alphabets is positive and that of aligning two diﬀerent
alphabets is negative.
• The score of aligning two bifurcations (#), or two end of derivations (|) is greater than
aligning two other equivalent grammar string alphabets.
• Individual bases {A, G} and {C, U } are aligned better than other combinations of
individual bases. Therefore, alignment score of these individual bases is still negative,
but greater than alignment score of other individual bases.

39

To construct the new score table which is used for pairwise alignment of the new grammar
strings, we take advantage of RIBOSUM matrices (see Appendix A ). These matrices are
designed by Robert.J Klein and Sean.R Eddy [27] to search for homologs of a single RNA
molecule in a sequence database, based on secondary structure. There are the guidelines to
build the new score table:
• The left derivation individual base columns existing in a left derivation individual base
row are ﬁlled in with scores from the corresponding table of RIBOSUM matrix. The
remaining columns in such a row (a left derivation individual base row) will be ﬁlled in
with a very large negative number so as to avoid selecting alignments associated with
those columns.
• The right derivation individual base columns existing in a right derivation individual
base row are ﬁlled in with scores from the corresponding table of RIBOSUM matrix.
The remaining columns in such a row (a right derivation individual base row) will be
ﬁlled in with a very large negative number so as to avoid selecting alignments associated
with those columns.
• To assign scores to columns of base pair rows, we go through diﬀerent strategies based
on the type of participating base pairs:
– If base pair alphabets {<, >, [, ], {, }} are aligned with each other, we use scores
from the corresponding RIBOSUM matrix.
– If base pair alphabets {E,F,I,J,L,O} are aligned with each other, we use a quarter
(1/4) of corresponding base pair scores from RIBOSUM matrix (Based on our
observations, base pair alphabet Z is never seen in any grammar strings).
– If a member of {<, >, [, ], {, }} is aligned with a member of {E, F, I, J, L, O}, we
use half of the corresponding base pair score from RIBOSUM matrix.
– The remaining columns of a base pair row are ﬁlled in with very large negative
numbers.
40

• The alignment score of two bifurcations (#) or two end of derivations (|) is set to a
very large positive number. The remaining columns of such rows are ﬁlled in with very
large negative numbers.
We refer to Appendix B for both old and new score tables. We also need to set the gap
penalties to perform pairwise alignment. Our gap penalty consists of open and extension
gap penalties. The open penalty is assigned to the ﬁrst matched gap with a grammar string
alphabet. If after this gap we have to match another gap with a grammar alphabet then
the negative gap extension score is used. We set gap penalties and extension empirically.
In Table 4.1, each data set is presented along with the corresponding score table and gap
penalties used in pairwise alignment.
The implementation of pairwise alignment is done by our colleague Achawanantakun and
we refer the interested reader to [37] for more details on the subject.
Table 4.1: Mapping in the consensus sequence
Data sets
Score tables
Gap Penalties
Old conserved
clans Old score table Open -8 / Ext -2
New conserved clans New score table Open -8 / Ext -2
Old Rfam-Clans
Old score table
Open -1 / Ext -12
New Rfam-Clans
New score table
Open -1 / Ext -15

4.2.4

Applying MCL clustering approach

The MCL (Markov Cluster Algorithm) is an unsupervised clustering algorithm for networks.
This algorithm which is claimed to be fast and scalable originates from simulation of stochastic ﬂow in graphs. Stijn van Dongen invented this algorithm which applies to diﬀerent areas,
especially in bioinformatics. The algorithm simulates ﬂow using algebraic operations on matrices, namely expansion and inﬂation. The former operation coincides with normal matrix
multiplication to model the spreading out of ﬂow and its homogeneity. The latter (inﬂation)
models the contraction of ﬂow and its thickness/thinness in regions of higher/lower current.
The MCL process causes ﬂow to spread out within natural clusters and evaporate in between
41

diﬀerent clusters. See [43] for mathematical theory behind the MCL algorithm. MCL has
O(N ∗ k 2 ) time complexity in worst case, where N is the number of nodes in the graph, and
k is the maximum number of neighbors tracked during computations.
We use pairwise alignment scores as inputs to MCL clustering application. Our assumption is that the pairwise alignment score between each two grammar strings shows
the degree of similarity between their grammar strings and consequently between their secondary structures. Before feeding pairwise alignment scores of a data set to MCL, we ﬁrst
ﬁnd the minimum score of data set and map it to the origin (zero). All other scores of the
corresponding set are accordingly shifted.
We run MCL application on a set of pairwise alignment scores with various cut-oﬀ thresholds. The cut-oﬀ threshold is used to eliminate scores less than this threshold. This option
helps us limit our search domain especially when we deal with large scale data sets.

4.3
4.3.1

Experimental results
Experimental results of old designed grammar string

Data set: conserved clans
In Figure 4.2, conserved clans along with their Rfam family members are shown.
We calculate the average of pairwise alignment shifted scores between Rfam family members within each clan. This value is called “internal average score”. Table 4.2 shows classiﬁed
conserved clans based on the internal average score factor. Experimental results show that
clustering process is more successful in identifying clans whose members have higher internal
average score than 4. We ﬁrst analyze the behavior of Rfam families which belong to these
clans during clustering process. To begin with, we name and also explain some ﬁgures and
tables that will heavily be used in our analysis.
• Figure 4.3 illustrates the summary of a clustering process (presented in ﬁve levels)

42

Clan name

Clan family members

Description

CL00015

RF01376 RF01377 RF01318

CRISPR

CL00010

RF00008 RF00163

ribozyme

CL00088

RF00685 RF00794

snoRNA; CD-box;

CL00099

RF00643 RF00692

snoRNA; CD-box;

CL00073

RF00046 RF00609

snoRNA; CD-box;

CL00074

RF00186 RF00339

snoRNA; CD-box;

CL00076

RF00610 RF01280

snRNA;

CL00044

RF00581 RF01249

snoRNA; CD-box;

CL00048

RF00569 RF01183

snoRNA; CD-box;

CL00050

RF00087 RF00136

snoRNA; CD-box;

CL00064

RF00151 RF00608

snRNA;

CL00067

RF00270 RF01170 RF01200

snRNA;

CL00072

RF00055 RF01299

snRNA;

CL00008

RF00206 RF01277

snoRNA; CD-box;

Figure 4.2: Pairwise alignment conﬁgurations
Table 4.2: Classiﬁcation of conserved clans based on internal average score
Average score interval
[1,2)
[2,3)
[3,4)
[4,5)
[5,6)

Conserved Clan families

CL00099

CL00008
CL00044
CL00088

CL00067
CL00064
CL00048
CL00073

CL000072
CL00050
CL00076
CL00076
CL00074
CL00015

CL00010

applied to Rfam family members of conserved clans. These clans all have internal
average score of more than 4, i.e, two most right columns in Table 4.2, except for Rfam
families shown in rectangles (in levels 3 and 4) which do not belong to the two most

43

right columns. We will later explain why this is the case.
• The table shown in Figure 4.5 presents shifted pairwise alignment scores between each
two participating Rfam families in the clustering process. We ﬁll in diagonal elements
of the table with zeros as pairwise alignment of each family with itself is not deﬁned.
The two most right columns are also ﬁlled in with zeros. This will be taken up later
in this section.
• Table 4.3 presents grammar patterns of conserved clans with internal average score
higher than 4.

RF01320
RF01376 RF00163
RF00186 RF00087 RF00610 RF01299
RF00055
RF01318
RF01377
RF0008
RF00339 RF00136 RF01280
Summary level: 1
RF01320
RF01318
c RF01377
RF01376

RF0008
c
RF00163

RF00339
c
RF00186

RF00087
RF00610
c
RF01280

RF01299
RF00055

RF00136
Summary level: 2

RF01320
RF01318
c RF01377
RF01376

RF00151
RF01299
RF01280 RF00186
RF00136 RF00055 c
RF00339
RF00610

RF0008
c
RF00163
RF00087

Summary level: 3
RF01320
RF01318
c RF01377
RF01376

RF00151
RF01299
RF01280 RF00186
RF00136 RF00055 c
RF00339
RF00610
RF00087

RF0008
c
RF00163

Figure 4.3: Summary of clustering process of conserved clans members

44

CL00015

CL00074

CL00076

RF01320 RF01376 RF01377 RF01318 RF00186 RF00339 RF00610 RF01280
RF01320
RF01376

0

4.07

5.38

4.74

1.47

1.4

1.39

1.24

0

3.11

5.64

1.42

1.42

1.24

1.17

0

3.39

1.92

1.87

1.82

1.64

0

1.15

1.22

1.16

0.98

0

4.99

4.38

3.8

0

4.38

3.84

0

4.67

CL00015
RF01377
RF01318
RF00186
CL00074
RF00339
RF00610
CL00076
RF01280

0

RF00087
CL00050
RF00136
RF00055
CL00072
RF01299

RF00008
CL00010
RF00163

Figure 4.4: Pairwise alignment scores - Part1

45

CL00050

CL00072

CL00010

RF00087

RF00136

RF00055

RF01299

RF00008

RF00163

RF01320

1.545381

1.497188

1.256737

1.520346

0

0

RF01376

1.25098

1.35261

1.24301

1.403463

0

0

RF01377

1.605622

1.891358

1.500366

1.8

0

0

RF01318

1.074509

1.164285

0.963441

1.15671

0

0

RF00186

3.968627

4.461044

3.820689

3.874796

0

0

RF00339

3.169697

4.086274

3.989855

3.916049

0

0

RF00610

3.567816

3.765891

4.177011

3.634599

0

0

RF01280

3.511355

3.78608

4.064583

3.302564

0

0

RF00087

0

4.122222

3.252083

3.226515

0

0

0

3.764912

4.092857

0

0

0

4.349425

0

0

0

0

0

0

5.690196

CL00015

CL00074

CL00076

CL00050
RF00136
RF00055
CL00072
RF01299
RF00008
CL00010
RF00163

0

Figure 4.5: Pairwise alignment scores - Part2

46

Table 4.3: Grammar patterns of 6 Clans
Rfam families Strategy 1 Strategy 2 Strategy 3 (P: 0.5, A:0.7, a:0.7)

CL00015

RF01320
RF01376
RF01377
RF01318

P|
P|
P|
P|

P|
P|
P|
P|

AaPA|
AP|
AaPA|
AaP|

CL00010

RF00008
RF00163

P#P|P|
P#P|P|

P#P|P|
P#P|P|

Pa#aPA|PA|
PA#aPA|P|

CL00074

RF00186
RF00339

P|
P|

P|
P|

PA|
PA|

CL00076

RF00610
RF01280

P|
P|

P|
P|

PA|
PA|

CL00050

RF00087
RF00136

P|
P|

P|
P|

PA|
PA|

CL00072

RF00055
RF01299

P|
P|

P|
P|

PA|
PA|

A. CL00010 and CL00015 clans
CL00010 and CL0015 are both identiﬁed and remained very well-conserved during clustering process (Figure 4.3). We study these two clans and investigate why our method could
also put them together in the clustering process, very much like what clan database did.
CL00010 is called “Hammerhead clan” in Clan database. This clan carries two RNA
families Hammerhead 1(RF0008) and Hammerhead 3(RF00163). Homology between these
families has been established in the published literature [7,26]. In clustering process CL00010
family members are clustered together and remained isolated from other Rfam families.
The reason is that although RF0008 and RF00163 share a similar secondary structure with
each other, they are very diﬀerent from other participants in structural level. RF0008 and
RF00163 are the only two Rfam families in the Table 4.3 which have bifurcation (#) in
their grammar patterns and as a result their pairwise alignments with other families result
in large negative scores. Based on our own deﬁned rule in pairwise alignment application,
47

these alignment types are ﬁltered out and do not appear at the output. As a result, these
families do not participate in clustering process and remain isolated from other families.
Clan CL00015 is also called “CRISPR-2 clan” and contains RNA families CRISPRDR5 (RF01318), CRISPR-DR7 (RF01320), CRISPR-DR63 (RF01376) and CRISPR-DR64
(RF01377). To verify the relationship between Rfam families of this clan we refer to [32, 34,
39]. CL00015 members are also identiﬁed and remained isolated from other Rfam families
during the clustering process. However, CL00015 clan members are not far from other
participants in structural level that much. As it is seen in the grammar pattern Table 4.3,
members in this clan share the same grammar patterns as other participants (excluding
RF0008 and RF00163). Members of CL00015 have long chains of individual bases starting
at both ends 3’ and 5’. CL00074, CL00076, CL00050, and CL00072 clan members do not
have this attribute. As a result, this clan family members could share a high score pairwise
alignment with each other, have low pairwise alignment with other Rfam families, and ﬁnally
are isolated from other clusters. For more details, one can see the table in Figure 4.5, and
check the pairwise alignment scores of CL00015 family members with inside and outside of
clan Rfam families.

48

B. CL00074 and CL00076 clans
CL00074 and CL00076 are both identiﬁed in clustering process (Figure 4.3), but are not
remained isolated. We start with a short biological introduction of these clans and next will
analyze the outcome of the experimental results for them.
CL00074 is called “SNORD101 clan” and contains two RNA families SNORD101 (RF000186)
and snoR60 (RF00339). Small nucleolar RNAs snoR60 (RF00339) and snoRD101 (RF000186)
both are non-coding RNA (ncRNA) molecules which function to modify other small nuclear
RNAs (snRNAs). They both belong to the C/D box class of snoRNAs containing the
conserved sequence motifs known as the C box (UGAUGA) and the D box (CUGA). Bioinformatic methods discussed in [32, 34, 39] conﬁrm the relationship between these families.
Clan CL00076 is called “SNORD110” and contains two RNA families SNORD110 (RF00610)
and snoR14 (RF01280). Functionality of both families is to modify other small nuclear RNAs
(snRNAs). They also belong to C/D box class of snoRNAs which contain the C (UGAUGA)
and D (CUGA) box motifs. Bioinformatic methods discussed in [24] conﬁrm the relationship
between these families.
Based on the Table 4.3, these two clans not only share very similar secondary structures
with each other, but also they are similar to CL00050 and CL00072 clans in this aspect.
Instead of high degree of similarity between this clan members with other participating
Rfam families in the structural level, our clustering approach is still successful to identify
CL00074 and CL00076 clans. Also in the pairwise alignment shown in Figure 4.5, it is clearly
seen that members of CL00074 or CL00076 share higher pairwise alignment score with their
clan-mates than those of other clans. Furthermore, although secondary structures of these
clans are very similar to each other and to the members of other clans, designed grammar
strings and score table are still successful to distinguish between Rfam family members of
these two clans. However, in further steps of the clustering, RF00136 is merged with CL00074
clan members. This observation will be discussed in the next step.
C. CL00050 clan

49

CL00050 is also called “SNORD26 clan” and contains two RNA families SNORD26
(RF00087) and SNORD81 (RF00136). Small nucleolar RNAs SNORD26 and SNORD81
both are members of the C/D class of snoRNA and contain the C (UGAUGA) and D
(CUGA) box motifs. Bioinformatic methods discussed in [32, 39] conﬁrm the relationship
between these families.
During the clustering process, RF00136 joins the cluster of CL00074 Rfam family members. The reason is that RF00136 is more similar to RF00186 member of CL00074 in both
sequence and structure level. To see this fact better, please visit the corresponding pages
in Clan database which show the secondary structure of CL00074 and CL00050 (Appendix
C). In the following, pairwise alignments of {RF00136,RF00186} and {RF00136,RF00087}
are shown, respectively. As it is seen, RF00136 has a better alignment with RF00186 than
its clan-mate RF00087.
PPPPP---AAUGAUGA-CUUW----AAUUGUCGGAUACCCCUUCACUCCYU--YYA--UGAGUGA
PPPPPPPCA-UGAUGAUCUCACYCCAACU-U--GA-ACUCUCUCAC----UGAUUACUUGA-UGA
-AACAUAAYAG-UCUGA|
YAAUA-AA-AGAUCUGA|
score= 243.00
align length= 83
RF00186-RF00136 pairwise alignment score before shifting:2.927711
RF00186-RF00136 pairwise alignment score after shifting:4.461044

cPPPP------GGGGAUGAUUUYA----AGAAC-UGAACUCUCU--CU--UU-C-UGAUGGWUUA
-PPPPPPPCAUG---AUGAUCUCACYCCA--ACUUGAACUCUCUCACUGAUUACUUGAUGA--YA
GUGGAGAAAASMMAAAAWMUCUGA|
AU----AAAA---GA----UCUGA|
score= 233.00
align length= 90
RF00087-RF00136 pairwise alignment score before shifting:2.588889

50

RF00087-RF00136 pairwise alignment score after shifting:4.122222

D. CL00072 clan
CL00072 is also called “SNORD96 clan” in Clan database and contains the RNA families
SNORD96 (RF00055) and snR39B (RF00608). They are both members of the C/D class of
snoRNA and contain the C (UGAUGA) and D (CUGA) box motifs [29].
Rfam family members of CL00072 are clustered together along with RF00151 Rfam family
which belongs to CL00064. CL00064 is called “SNORD58 clan” and contains RNA families
SNORD58 (RF00151) and SNORD99 (RF00608). These families are both non-coding RNA
(ncRNA) molecules whose functionality is to modify other small nuclear RNAs (snRNAs).
These both belong to the C/D box class of snoRNAs and contain the C (UGAUGA) and
D (CUGA) box motifs. Bioinformatic methods discussed in [32, 39] conﬁrm the relationship
between these families.
RF00151 joins the cluster of CL00072 clan members because this Rfam family is more
similar to RF01299 Rfam family in sequential level than RF00608 which is its clan-mate.
To see this fact better, please visit the corresponding pages in Clan database which show
the secondary structure of CL00072 and CL00064 (Appendix C). In the following, pairwise
alignments of {RF00608, RF00151,RF01299} are shown, respectively. As it is seen, RF00151
has a better pairwise alignment with RF01299 than its clan-mate RF00608.
cPPPP----WGUGAUGA---CU-WUCUUA--GGACACCUUUGGAU--UA---AC---CAUGAAAAS
yPPPPPUCMAG-GAUGAAACCYAAUYUSAGUGGACAUCUMUGGAUGAWAMWUGCGGAUAUGGGA-AAAWMWRUUCU--GA|
---------CUGAGA|
score= 172.00
align length= 82
RF00151-RF00608 pairwise alignment score before shifting:2.588889
RF00151-RF00608 pairwise alignment score after shifting:3.630894

51

cP--PPPWGUGAUGACUWUCUUAGGA-CACCUUU--GGAUUAACC----AUGAAAASAA-AWMWRUU
-PPPPPPWAUGAUGGC------A-WAYCAUCUUUCGGGACUGACCUGAAAUGRAGAGAWUAYWYAUU
-CUGA|
GCUGA|
score= 208.00
align length= 73
RF00151-RF01299 pairwise alignment score before shifting:2.849315
RF00087-RF00136 pairwise alignment score after shifting:4.382648

E. Conserved clans with internal average score less than 4
The clustering process is not successful to identify clans whose members have internal
average score (Section 4.3.1) less than 4. We calculate the average of pairwise alignment
scores between members of each clan and call it “external average score”. In Figure 4.6,
both internal and external average scores of 14 participating clans in this experiment are
presented. Clans whose internal and external average scores are much closer to each other,
have a lesser chance of being identiﬁed or remained isolated during clustering process. Clans
whose members could not stick to their clan-mates and create their own isolated clans have
close external and internal average scores. Also, their internal and external scores are close
to the corresponding average scores of other clans.

Clustering results of Rfam−Clan data set
By applying the clustering approach to the shifted pairwise alignment scores of Rfam−Clan
grammar strings, we could ﬁnd some clusters which remain well-conserved. These clusters are presented in Table 4.4 following with a short description of the common biological
functionality for members of each cluster.
• RF01112 and RF01083 are needed for eﬃcient transcription. For more details on
biological functionalities of RF01112 and RF01083, please visit the corresponding pages
in Clan database (Appendix C).
52

6
5
4
3
2
1

CL
00
0
CL 10
00
0
CL 15
00
07
4
CL
00
0
CL 76
00
0
CL 72
00
05
0
CL
00
0
CL 88
00
09
9
CL
00
0
CL 73
00
0
CL 44
00
04
8
CL
00
0
CL 64
00
0
CL 67
00
00
8

0

Internal average scores

External average score

Figure 4.6: External and internal average scores of all conserved clans
• RF01111 and RF01109 are needed for eﬃcient transcription. For more details on
biological functionalities of RF01112 and RF01083, please visit the corresponding pages
in Clan database (Appendix C).
• RF01025 and RF01017 have multiple roles in negative regulation (transcript degradation and sequestering, translational suppression) and possible involvement in positive
regulation (transcriptional and translational activation). For more details on biological
functionalities of RF01112 and RF01083, please visit the corresponding pages in Clan
database (Appendix C).
• RF01386 and RF00014 are used in translation and transcription terminator. For more
details on biological functionalities of RF01112 and RF01083, please visit the corresponding pages in Clan database (Appendix C).
53

Number

Table 4.4: Discovered Clusters in Rfam−Clan
Rfam families Grammar pattern RNA Type

Domain

1

RF01112
RF01083

APA|
APA|

Cis-reg

Viruses

2

RF01111
RF01109

aA|
aA|

Cis-reg

Viruses
Viruses, Bacteria

3

RF01025
RF01017

PA|
PA|

miRNA

Eukaryota

4

RF01386
RF00014

APA|
APA|

sRNA

Bacteria
Bacteria, Eukaryota

5

RF01340
RF01322

aP|
aP|

CRISPR

Bacteria

6

RF01115
RF01106

APA|
APA|

Cis-reg

Viruses
Viruses, Eukaryota

• RF01340 and RF01322 function as a prokaryotic immune system. For more details
on biological functionalities of RF01112 and RF01083, please visit the corresponding
pages in Clan database (Appendix C).
• RF01115 and RF01106 are needed for eﬃcient transcription. For more details on
biological functionalities of RF01112 and RF01083, please visit the corresponding pages
in Clan database (Appendix C).

4.3.2

Analysis of experimental results based on using new version
of grammar string design

By applying clustering process to the shifted pairwise alignment of conserved clans which
are generated based on new grammar string design, the results are the same as we achieved
before in Section 4.3.1. We can conclude that the performance of old designed grammar
string is as well as the new one. Designing a grammar string with distinguished base pairs
54

or deﬁning a new score table could not enhance our ability to do a better clustering. We
have done the same experiment for the new grammar string design, however, the questions
as to how to interpret the results require more investigation and are parts of the future work
on this subject.

55

APPENDICES

56

Appendix A
Ribosum matrix

A.1

Used in grammar string chapter

A
C
G
U

2.22
-1.86
-1.46
-1.39
A

1.16
-2.48
-1.05
C

1.03
-1.74
G

1.65
U

Figure A.1: Ribosum matrix - Part1

57

AA

-2.49

AC

-7.04

-2.11

AG

-8.24

-8.89

-0.8

AU

-4.32

-2.04

-5.13

4.49

CA

-8.84

-9.37

-10.41

-5.56

-5.13

CC

-14.37

-9.08

-14.53

-6.71

-10.53

-3.59

CG

-4.68

-5.86

-4.57

1.67

-3.57

-5.71

5.36

CU

-12.64

-10.45

-10.14

-5.17

-8.49

-5.77

-4.96

GA

-6.86

-9.73

-8.61

-5.33

-7.98

-12.43

-6

GC

-5.03

-3.81

-5.77

2.7

-5.95

-3.7

2.11

GG

-8.39

-11.05

-5.38

-5.61

-11.36

-12.58

-4.66

GU

-5.84

-4.72

-6.6

0.59

-7.93

-7.88

-0.27

UA

-4.01

-5.33

-5.43

1.61

-2.42

-6.88

2.75

UC

-11.32

-8.67

-8.87

-4.81

-7.08

-7.4

-4.91

UG

-6.16

-6.93

-5.94

-0.51

-5.63

-8.41

1.32

UU

-9.05

-7.83

-11.07

-2.98

-8.39

-5.41

-3.67

AA

AC

AG

AU

CA

CC

CG

Figure A.2: Ribosum matrix - Part2

58

CU
GA
GC
GG
GU
UA
UC
UG
UU

-2.28
-7.71
-5.84
-13.69
-5.61
-4.72
-3.83
-7.36
-5.21
CU

-1.05
-4.88
-8.67
-6.1
-5.85
-6.63
-7.55
-11.54
GA

5.02
-4.13
1.21
1.6
-4.49
-0.08
-3.9
GC

-1.98
-5.77
-5.75
-12.01
-4.27
-10.79
GG

3.47
-0.57
-5.3
-2.09
-4.45
GU

4.97
-2.98
1.14
-3.39
UA

-3.21
-4.76
-5.97
UC

Figure A.3: Ribosum matrix - Part3

59

3.36
-4.28
UG

-0.02
UU

Appendix B
Old and new score tables

B.1

Old grammar string score table

Our website: http://www.cse.msu.edu/ yannisun/grammar-string

A

U

C

G

A

7

-12

-12

-1

U

-12

7

-1

-12

C

-12

-1

7

-12

G

-1

-12

-12

7

a

-1000000

-1000000

-1000000

-1000000

u

-1000000

-1000000

-1000000

-1000000

c

-1000000

-1000000

-1000000

-1000000

g

-1000000

-1000000

-1000000

-1000000

P

-100

-100

-100

-100

#

-1000000

-1000000

-1000000

-1000000

|

-1000000

-1000000

-1000000

-1000000

Figure B.1: Score table used in pairwise alignments of old garment strings - Part1

60

a

u

c

g

P

#

|

A

-1000000

-1000000

-1000000

-1000000

-100

-1000000

-1000000

U

-1000000

-1000000

-1000000

-1000000

-100

-1000000

-1000000

C

-1000000

-1000000

-1000000

-1000000

-100

-1000000

-1000000

G

-1000000

-1000000

-1000000

-1000000

-100

-1000000

-1000000

a

7

-12

-12

-1

-100

-1000000

-1000000

u

-12

7

-1

-12

-100

-1000000

-1000000

c

-12

-1

7

-12

-100

-1000000

-1000000

g

-1

-12

-12

7

-100

-1000000

-1000000

P

-100

-100

-100

-100

7

-1000000

-1000000

#

-1000000

-1000000

-1000000

-1000000

-1000000

72

-1000000

|

-1000000

-1000000

-1000000

-1000000

-1000000

-1000000

58

Figure B.2: Score table used in pairwise alignments of old garment strings - Part2

61

B.2

New grammar string score table

A
U
C
G
a
u
c
g
<
>
[
]
(
)

E
F
I
J
L
O
Z
#
|

A
2.22
-1.39
-1.86
-1.46
-1000000
-1000000
-1000000
-1000000
1.11
-2.78
-3.72
-2.92
-2.92
-2.78
0.555
-5.56
-7.44
-5.84
-5.84
-5.56
-13
-1000000
-1000000

U
-1.39
1.65
-1.05
-1.74
-1000000
-1000000
-1000000
-1000000
-2.75
0.825
-2.1
-3.48
-3.48
0.825
-5.56
0.412
-4.2
-6.69
-6.69
0.412
-13
-1000000
-1000000

C
-1.86
-1.05
1.16
-1.48
-1000000
-1000000
-1000000
-1000000
-3.72
-2.1
-0.58
-4.96
-4.96
-2.1
-7.44
-4.2
0.29
-9.92
-9.92
-4.2
-13
-1000000
-1000000

G
-1.46
-1.74
-2.48
1.03
-1000000
-1000000
-1000000
-1000000
-2.92
-3.48
-4.96
0.515
0.515
-3.48
-5.84
0.29
-9.92
0.257
0.257
-6.96
-13
-1000000
-1000000

a
-1000000
-1000000
-1000000
-1000000
2.22
-1.39
-1.86
-1.46
-2.78
1.11
-2.92
-3.72
-2.78
-2.92
-5.56
0.555
-5.84
-7.44
-5.56
-5.84
-13
-1000000
-1000000

u
-1000000
-1000000
-1000000
-1000000
-1.39
1.65
-1.05
-1.74
0.825
-2.75
-3.48
-2.1
0.825
-3.48
0.412
-5.56
-6.69
-4.2
0.412
-6.69
-13
-1000000
-1000000

c
-1000000
-1000000
-1000000
-1000000
-1.86
-1.05
1.16
-1.48
-2.1
-3.72
-4.96
-0.58
-2.1
-4.96
-4.2
-7.44
-9.92
0.29
-4.2
-9.92
-13
-1000000
-1000000

Figure B.3: Score table used in pairwise alignments of old garment strings - Part1

62

A
U
C
G
a
u
c
g
<
>
[
]
(
)

E
F
I
J
L
O
Z
#
|

g
-1000000
-1000000
-1000000
-1000000
-1.46
-1.74
-2.48
1.03
-3.48
-2.92
0.515
-4.96
-3.48
0.515
0.29
-5.84
0.257
-9.92
-6.96
0.257
-13
-1000000
-1000000

<
1.11
-2.78
-3.72
-2.92
-2.78
0.825
-2.1
-3.48
4.49
1.61
1.67
2.7
0.59
-0.51
2.25
0.805
0.835
1.35
0.295
-1.02
-10
-1000000
-1000000

>
-2.75
0.825
-2.1
-3.48
1.11
-2.78
-3.72
-2.92
1.61
4.97
2.75
1.6
-0.57
1.14
0.805
2.485
1.375
0.8
-1.14
0.57
-10
-1000000
-1000000

[

]

(

)

-3.72
-2.1
-0.58
-4.96
-2.92
1.11
-4.96
0.515
1.67
2.75
5.36
2.11
-0.27
1.32
0.835
1.375
2.68
1.055
-0.54
0.66
-10
-1000000
-1000000

-2.92
-3.48
-4.96
0.515
-3.72
-2.1
0.58
-4.96
2.7
1.6
2.11
5.62
1.21
-0.08
1.35
0.8
1.055
2.81
2.42
-0.16
-10
-1000000
-1000000

-2.92
-3.48
-4.96
0.515
-2.78
0.825
-2.1
-3.48
0.59
-0.57
-0.27
1.21
3.47
-2.09
0.295
-1.14
-0.54
2.42
1.735
-4.18
-10
-1000000
-1000000

-2.78
0.825
-2.1
-3.48
-2.92
-3.48
-4.96
0.515
-0.51
1.14
1.32
-0.08
-2.09
3.36
-1.02
0.57
0.66
-0.16
-4.18
1.68
-10
-1000000
-1000000

Figure B.4: Score table used in pairwise alignments of old garment strings - Part2

63

A
U
C
G
a
u
c
g
<
>
[
]
(
)

E
F
I
J
L
O
Z
#
|

E
0.555
-5.56
-7.44
-5.84
-5.56
0.412
-4.2
-6.96
2.25
0.805
0.835
1.35
0.295
-1.02
1.125
0.402
0.417
0.675
0.147
-2.04
-10
-1000000
-1000000

F
-5.56
0.412
-4.2
-6.96
0.555
-5.56
-7.44
-5.84
0.825
2.485
1.375
0.8
-1.14
0.57
0.402
1.242
0.687
0.4
-2.28
0.285
-10
-1000000
-1000000

I
-7.44
-4.2
0.29
-9.92
-5.84
-6.96
-9.92
0.257
0.835
1.375
2.68
1.055
-0.54
0.66
0.417
0.687
1.34
0.527
-1.08
0.33
-10
-1000000
-1000000

J
-5.84
-6.96
-9.92
0.257
-7.44
-4.2
0.29
-9.92
1.35
0.8
1.055
2.81
0.605
-0.16
0.675
0.4
0.527
1.405
1.21
-0.32
-10
-1000000
-1000000

L
-5.84
-6.96
-9.92
0.257
-5.56
0.412
-4.2
-6.96
0.295
-1.14
-0.54
0.605
1.735
-4.18
0.147
-2.28
-1.08
1.21
0.867
-8.36
-10
-1000000
-1000000

O
-5.56
0.412
-4.2
-6.96
-5.84
-6.96
-9.92
0.257
-1.02
0.57
0.66
-1.6
-4.18
1.68
-2.04
0.285
0.33
-0.32
-8.36
0.84
-10
-1000000
-1000000

Z
-13
-13
-13
-13
-13
-13
-13
-13
-3
-3
-3
-3
-3
-3
-3
-3
-3
-3
-3
-3
-10
-1000000
-1000000

Figure B.5: Score table used in pairwise alignments of old garment strings - Part3

64

A
U
C
G
a
u
c
g
<
>
[
]
(
)

E
F
I
J
L
O
Z
#
|

#
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
100
-1000000

|
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
-1000000
80

Figure B.6: Score table generated based on Ribosum matrix and used in pairwise alignments
of new garment strings - Part4

65

Appendix C
Links to Web Pages of Clan families and their members

C.1

Used in experimental results section
Name

Webpage

CL00074

http://rfam.sanger.ac.uk/clan/CL00074

CL00050

http://rfam.sanger.ac.uk/clan/CL00050

CL00072

http://rfam.sanger.ac.uk/clan/CL00072

CL00064

http://rfam.sanger.ac.uk/clan/CL00064

RF01112

http://rfam.sanger.ac.uk/family/RF01112

RF01083

http://rfam.sanger.ac.uk/family/RF01083

RF01111

http://rfam.sanger.ac.uk/family/RF01111

RF01109

http://rfam.sanger.ac.uk/family/RF01109

RF01025

http://rfam.sanger.ac.uk/family/RF01025

RF01017

http://rfam.sanger.ac.uk/family/RF01017

RF01386

http://rfam.sanger.ac.uk/family/RF01386

RF00014

http://rfam.sanger.ac.uk/family/RF00014

RF01340

http://rfam.sanger.ac.uk/family/RF01340

RF01322

http://rfam.sanger.ac.uk/family/RF01322

RF01115
RF01106

http://rfam.sanger.ac.uk/family/RF01115
http://rfam.sanger.ac.uk/family/RF01106

Figure C.1: Links to Web Pages

66

BIBLIOGRAPHY

67

Bibliography
[1] Rujira Achawanantakun, Seyedeh Shohreh Takyar, and Yanni Sun. Grammar string:
a novel ncRNA secondary structure representation. In 9th Annual International Conference on Computational Systems Bioinformatics, pages 2–13, Stanford, CA, August
2010.
[2] S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. Basic local alignment
search tool. Journal of Molecular Biology, 215:403–410, 1990.
[3] F Bai, D Li, and T Wang. A new mapping rule for RNA secondary structures with its
applications. J. Math. Chem., 43:932–943, 2008.
[4] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. Smit, K. M. Roskin,
R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green, D. Haussler, and W. Miller.
Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res,
14(4):708–715, 2004.
[5] Francis Crick. Central dogma of molecular bilogy. Nature, 227:561–563, August 1970.
[6] Neil D. Rawlings and Alan J. Barrett. Evolutionary families of peptidases. Biochem J.,
290:205–218, 1993.
[7] Ruﬀner DE, Stormo GD, and Uhlenbeck OC.
[8] Robin Dowell and Sean Eddy. Evaluation of several lightweight stochastic context-free
grammars for RNA secondary structure prediction. BMC Bioinformatics, 5(1):71, 2004.
[9] Robin D Dowell and Sean R Eddy. Eﬃcient pairwise RNA structure prediction and
alignment using sequence alignment constraints. BMC Bioinformatics, 7(400), 2006.
[10] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, UK, 1998.
[11] Robert D. Finn, Jaina Mistry, Benjamin Schuster-Bockler, Sam Griﬃths-Jones, Volker
Hollich, Timo Lassmann, Simon Moxon, Mhairi Marshall, Ajay Khanna, Richard
Durbin, Sean R. Eddy, Erik L. L. Sonnhammer, and Alex Bateman. Pfam: clans,
web tools and services. Nucleic Acids R., 34:247–251, 2006.
[12] P. Gardner, J. Daub, J. Tate, B. Moore, I. Osuch, S. Griﬃths-Jones, R. Finn,
E. Nawrocki, D. Kolbe, S. Eddy, and A. Bateman. Rfam: Wikipedia, clans and the
decimal release. Nucleic Acids Res., 39(Database):D141–D145, 2010.
68

[13] S. Griﬃths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman.
Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res., 33:D121–
D124, 2005.
[14] M. H¨chsmann, B. Voss, and R. Giegerich. Pure multiple RNA secondary structure
o
alignments: a progressive proﬁle approach. IEEE/ACM Trans Comput Biol Bioinform,
1(1):53–62, 2004.
[15] Matthias H¨chsmann, Thomas T¨ller, Robert Giegerich, and Stefan Kurtz. Local Simo
o
ilarity in RNA Secondary Structures. In CSB ’03: Proceedings of the IEEE Computer
Society Conference on Bioinformatics, page 159, Washington, DC, USA, 2003. IEEE
Computer Society.
[16] I.L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids Res., 31:3429–
3431, 2003.
[17] IL Hofacker, M Fekete, and PF Stadler.
[18] Ivo L. Hofacker, Stephan H. F. Bernhart, and Peter F. Stadler. Alignment of RNA base
pairing probability matrices. Bioinformatics, 20(14):2222–2227, 2004.
[19] Ian Holmes. Accelerated probabilistic inference of RNA structure evolution. BMC
Bioinformatics, 6(1):73, 2005.
[20] Tommi Jaakkola and David Haussler. Exploiting Generative Models in Discriminative
Classiﬁers. In NIPS, pages 487–493, 1998.
[21] Ellis JC. and Brown JW. The RNase P family. RNA BIOLOGY.
[22] Urban JH. and Vogel J. Two seemingly homologous noncoding RNAs act hierarchically
to activate glmS mRNA translation. PLoS Biol., 6:e64, 2008.
[23] Matthew W. Jones-Rhoades, David P. Bartel, and Bonnie Bartel. MicroRNAs and
Their Regulatory Roles in Plants. Annual Review of Plant Biology, 57:19–53, 2006.
[24] Brown JW, Echeverria M, Qu LH, Lowe TM, Bachellerie JP, Httenhofer A, Kastenmayer
JP, Green PJ, Shaw P, and Marshall DF. Plant snoRNA database. Nucleic Acids Res.,
31:432–435, 2003.
[25] S Karlin and S F Altschul.
[26] Hertel KJ, Pardi A, Uhlenbeck OC, Koizumi M, Ohtsuka E, Uesugi S, Cedergren R,
Eckstein F, Gerlach WL, and Hodgson R. Numbering system for the hammerhead.
Nucleic Acids Res., 20:32–52, 92.
[27] Robert J Klein and Sean R Eddy. RSEARCH: Finding homologs of single structured
RNA sequences. BMC Bioinforma, USA, 2003.
[28] B. Knudsen and J. Hein. Pfold: RNA secondary structure prediction using stochastic
context-free grammars. Nucleic Acids Res, 31(13):3423–3428, 2003.
69

[29] Laurent Lestrade and Michel J. Weber. snoRNA-LBME-db, a comprehensive database
of human H/ACA and C/D box snoRNAs. Nucleic Acids Res., 34:158–162, 2006.
[30] C Li, A H Wang, and L Xing. Similarity of RNA secondary structures. J. Comput.
Chem., 28:508–512, 2007.
[31] Shanfa Lu, Rui Shi, Cheng-Chung Tsao, Xiaoping Yi, Laigeng Li, and Vincent L.
Chiang. RNA silencing in plants by the expression of siRNA duplexes. Nucl. Acids
Res., 32(21):e171–, 2004.
[32] Martin Madera. Proﬁle Comparer: a program for scoring and aligning proﬁle hidden
Markov models. Bioinformatics, 24(22):26302631, 2008.
[33] J. S. Mccaskill. The equilibrium partition function and base pair binding probabilities
for RNA secondary structure. Biopolymers, 29(6-7):1105–1119, 1990.
[34] Eric P. Nawrocki, Diana L. Kolbe, and Sean R. Eddy. Infernal 1.0: inference of RNA
alignments. Bioinformatics, 25:13351337, 2009.
[35] Elena Rivas and Sean R. Eddy. Noncoding RNA gene detection using comparative
sequence analysis. BMC Bioinformatics, 2(1):8, 2001.
[36] David Sankoﬀ. Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems. SIAM Journal on Applied Mathematics, 45(5):810–825, 1985.
[37] Needleman SB and Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 48:443–453, 1970.
[38] Sven Siebert and Rolf Backofen. MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics,
21(16):3352–3359, 2005.
[39] Eddy SR. A probabilistic model of local sequence alignment that simpliﬁes statistical
signiﬁcance estimation. PLoS Comput Biol.
[40] Julie D. Thompson, Desmond G. Higgins, and Toby J. Gibson. CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment through sequence weighting,
position-speciﬁc gap penalties and weight matrix choice. Nucl. Acids Res., 22(22):4673–
4680, 1994.
[41] Elfar Torarinsson, Jakob H. Havgaard, and Jan Gorodkin. Multiple structural alignment
and clustering of RNA sequences. Bioinformatics, 23(8):926–932, 2007.
[42] Helene Touzet and Olivier Perriquet. CARNAC: folding families of related RNAs. Nucl.
Acids Res., 32(suppl. 2):W142–145, 2004.
[43] Stijn van Dongen. Graph clustering by ﬂow simulation. PhD thesis, University of
Utrecht, May 2000.

70

[44] S. Washietl, I. L. Hofacker, and P. F. Stadler. Fast and reliable prediction of noncoding
RNAs. Proc Natl Acad Sci U S A, 102(7):2454–2459, 2005.
[45] Sebastian Will, Kristin Reiche, Ivo L Hofacker, Peter F Stadler, and Rolf Backofen.
Inferring Noncoding RNA Families and Classes by Means of Genome-Scale StructureBased Clustering. PLoS Comput Biol, 3(4):e65, 2007.
[46] A. Wilm, I Mainz, and G. Steger. An enhanced RNA alignment benchmark for sequence
alignment programs. Algorithms Mol Biol., 1(19), 2006.
[47] Y Zhang, J Qiu, and Su L. Comparing RNA secondary structures based on 2D graphical
representation. Chemical Physics Letters, 458:180–185, 2008.

71