I. ‘ nu: 0.4..lhutullr — u I. L. I l. .. f. .v‘ . . t «_ n ‘. .. r . '0
*wc ‘8 V. . . .31 .b
I . .II In : .. 1. cl . . T 1.: . S: l «1%»
41‘ :

 

 

THESIS A, m, L.

 

,. I a IlllllllllllllllllI

? Michigan State
University

 

 

This is to certify that the

dissertation entitled
PLLATEDNES‘S or (BECLOQ'ICAL Sewwces

USNCT ALMj—NMENT AND RESTRICTION MAP

DATABASES
presented by

3771 Tim

has been accepted towards fulﬁllment
of the requirements for

3 ‘ A
llMD degreein (Eli‘l‘bklel' gClGhCE

 

 

l

—f

 

Major professor

 

PLACE ll RETURN BOX to remove thi- cheokout from your record.
TO AVOID FINES return on or betore date due.

DATE DUE DATE DUE DATE DUE

    
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU Is An Afﬁrmative AotionlEquei Opportunity Institution
mm:

RELATEDNESS OF BIOLOGICAL SEQUENCES
USING ALIGNMENT AND RESTRICTION MAP
DATABASES

By

Jin Kim

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1996

ABSTRACT
RELATEDNESS OF BIOLOGICAL SEQUENCES USING
ALIGNMENT AND RESTRICTION MAP DATABASES
By
Jz'n Kim

Comparative analysis of DNA and protein macromolecules has been an important
component of biological research. Sequence alignment and restriction map compari-
son, in particular, have been useful in the study of molecular evolution, RNA folding,
and protein structure-function relationships. In this thesis, we have investigated mul-
tiple alignment problems involving RNA and protein sequences, and the methods to
infer relatedness between sequences using restriction patterns and restriction map

databases.

We ﬁrst consider the multiple sequence alignment problem. Multiple sequence
alignment is used to discover an optimal alignment based on deﬁned criteria. Vari-
ations of dynamic programming provide most commonly used tools for multiple se-
quence alignment methods. However, the biggest obstacle to using dynamic program-
ming for multiple sequence alignment has been the high computational complexity.

Due to this high computational complexity, dynamic programming cannot be effec-

tively used to align more than three sequences and to apply certain types of cost
functions. To overcome the limitations of dynamic programming, an algorithm called
MSASA, based on simulated annealing, was developed. MSASA can overcome these
limitations.

We then consider the multiple RNA sequence alignment problem to identify pos-
sible RNA secondary structures. An algorithm called RNASA, based on simulated
annealing, is suggested for multiple RNA sequence alignment. In this algorithm, RNA
sequences are aligned based on primary sequence similarity and then realigned based
on secondary structure information. Dot matrices generated from intra-sequence com-
parisons are used to obtain possible common secondary structures. Several strategies
to reduce the simulated annealing time are suggested.

Sequence database searches have been used for identifying unknown macro-
molecules. The whole or a part of the primary sequence information is required
to do the sequence database search. However, sequencing a macromolecule is expen-
sive and time consuming. A more crude but faster method, without sequencing an
unidentiﬁed macromolecule, has been developed. The method is based on restriction
patterns of an unknown isolate and restriction map databases. We use a three-phase
approach. In the ﬁrst phase, we obtain a restriction pattern of the unknown organism
while analytically deriving the corresponding restriction maps of the sequences in the
database. In the second phase, we identify a set of sequences, S, which have restric-
tion maps that are most similar to the unknown macromolecule’s restriction pattern.
Maximum Site Matching Problem (MSMP) is deﬁned and is proved to be in the class

of NP-complete problems. In the third phase, we use the set S to infer biological in-

formation about an unknown macromolecule. We demonstrate the usefulness of this
approach by applying it to the rRNA sequences of the Ribosomal Database Project

(app).

© Copyright 1996 by Jin Kim
All Rights Reserved

To my family

ACKNOWLEDGMENTS

I wish to express my gratitude and appreciation to my thesis advisor Dr. Sakti
Pramanik for his consistent guidance and encouragement right from the beginning,
and his ﬁnancial support. I am grateful for many discussions and invaluable comments
he provided.

I would like to thank my guidance committee members, Dr. Moon Jung Chung,
Dr. Jon Sticklen, and Dr. Zachary Burton, for their help and guidance. Also I would
like to appreciate Dr. Eric Torng, Dr. Jame R. Cole their useful advise.

I must express my heartful thanks to my family Myongye Bang, Hyein, Hyejee and
my parents, Yong-Beam Kim and my mother, Yong-Sook Shin for their sincere pray
during my studying here. Also, I never forget endless love from my parent-in-law,

Hwan-moon Bang, J i-Young Kim.

vii

TABLE OF CONTENTS

LIST OF TABLES x
LIST OF FIGURES xi
1 Introduction 1
1.1 Basic Concepts ................................ 1
1.1.1 Sequences .................................. 1
1.1.2 Similarity of Macromolecules . . .' .................... 2
1.1.3 Alignment .................................. 3
1.1.4 Restriction Patterns and Restriction Maps ................ 3
1.2 Problem Deﬁnitions .............................. 4
1.2.1 Multiple Sequence Alignment ....................... 4
1.2.2 Restriction Map Database Searches .................... 6
1.3 Previous Studies ............................... 7
1.3.1 Multiple Sequence Alignment Based on Primary Similarity ....... 7
1.3.2 Multiple Sequence Alignment Based on Secondary Structures ..... 11
1.3.3 Restriction Map Database Search ..................... 12
1.4 Our Studies .................................. 12
1.4.1 Multiple Sequence Alignment ....................... 12
1.4.2 Restriction Map Database Searches .................... 13
2 Inferring Relatedness of Sequences based on Primary Similarity us-
ing Multiple Sequence Alignment 15
2.1 Introduction .................................. 17
2.2 Algorithm ................................... 20
2.2.1 Simulated Annealing ........................... 20
2.2.2 Cost Function ............................... 21
2.2.3 Ttansition Rule .............................. 24
2.2.4 Solution Set ................................ 25
2.2.5 Speedup Strategies in MSASA .................... .. . 28
2.2.6 The MSASA Algorithm .......................... 31
2.3 Implementation ................................ 32
2.4 Results ..................................... 34
2.4.1 Initial Alignment ............................. 34
2.4.2 Transition Rules .............................. 35
2.4.3 Cost Comparison ............................. 36
2.4.4 Time Comparison ............................. 38
2.5 Conclusion ................................... 44

viii

ix

3 Inferring Relatedness of Sequences based on Secondary Structure

using Multiple Sequence Alignment 45
3.1 Introduction .................................. 47
3.2 Algorithm ................................... 50
3.2.1 Determination of Score Function ..................... 50
3.2.2 Simulated Annealing to Optimize the Score Function .......... 55
3.2.3 The Complexities of RNASA ....................... 59
3.3 Results ..................................... 59
3.3.1 Alignment and Secondary Structure .................... 61
3.3.2 Identiﬁcation of Secondary Structures .................. 64
3.3.3 Initial Alignment .............................. 66
3.3.4 Window Sizes ............................... 66
3.3.5 Effect of Double Shufﬂe .......................... 68
3.3.6 Speed of Convergence ........................... 69
3.4 Conclusion ................................... 70
4 Inferring Relatedness of a Macromolecule to a Sequence Database

Without Sequencing 71
4.1 Introduction .................................. 73
4.2 Methods .................................... 76
4.2.1 Overview .................................. 76
4.2.2 Obtaining Restriction Patterns and Restriction Maps .......... 78
4.2.3 Computing Closeness ............................ 81
4.2.4 Inferring Information from Closeness ................... 86
4.3 Results and Discussion ............................ 92
4.3.1 Experimental Procedure .......................... 92
4.3.2 sim(q, C(q, D, k)) ............................. 96
4.3.3 level ((1, C (q, D, k)) ............................. 101
4.3.4 Ordered. vs. Unordered .......................... 105
4.3.5 Simulation of Random Database ..................... 106

5 Conclusions 110

2.1

4.1
4.2
4.3
4.4
4.5

LIST OF TABLES

Comparison of computation time and score in MSA and MSASA . . . . 37
Sequences by number of sites ........................ 94
Hit ratios for each 1:. 5% error bound. ................... 102
Hit ratios for different error bounds. .................... 103
Hit ratios for each k1 + kg. 5% error bound. ................ 104
Hit ratio of level(q, C(q, D, 2),) and level(C(q, D, 2),) .......... 106

2.1

2.2

2.3
2.4

2.5

2.6

2.7

2.8

2.9

LIST OF FIGURES

Examples of the move sets. (a) original alignment A1. (b) new align-
ment A2 after I nsertion(1,6, 2, right) to A1. (c) new alignment
A3 after Deletion(3, 5, 2, left) to A2. (d) new alignment A4 after
Swap(1, 7, 1, right) to A3. ........................

Annealing curve (Energy vs. Iteration). A is the starting point in the
traditional SA approach. B is the starting point obtained from the
fast heuristic approach. C is the minimal point. ............

The MSASA algorithm ............................

Example of the operation swap. (a) original alignment (b) alignment after
swap(2, 4, 3, right) operation. (*) shows the changed part (Al) of the
alignment and (+) shows the null columns. ..............

Alignments of rat mast cell proteinase II, human plasma kallikrein, bovine
chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus
trypsin. A1 is the alignment generated from MSA and A2 is generated
from MSASA. The score of A1 is 9663 and the score of A2 is 9648. (*)
shows different alignments generated from MSA and MSASA.

Alignment of human plasma kallikrein, bovine chymotrypsin, bovine
trypsin, and pig elastase generated from MSA and MSASA. .....

Alignment of human plasma kallikrein, bovine chymotrypsin, bovine
trypsin, pig elastase, and Streptomyces griseus trypsin generated from
MSA. ...................................

Alignment of human plasma kallikrein, bovine chymotrypsin, bovine
trypsin, pig elastase, and Streptomyces griseus trypsin generated from
MSASA. The null columns (+) are not removed intentionally.

Alignment of rat mast cell proteinase 11, human plasma kallikrein, bovine
chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus
trypsin generated from MSA. ......................

2.10 Alignment of rat mast cell proteinase 11, human plasma kallikrein, bovine

chymotrypsin, bovine trypsin, pig elastase, and Streptomyces gn'seus
trypsin generated from MSASA. The null columns (+) are not removed
intentionally. ........... ' ....................

2.11 Running time for 4 to 20 sequences (1 million iterations) .........

3.1
3.2
3.3

Example of an alignment and its sum matrix. ...............
The transition rule double shufﬂe ......................
The RNASA algorithm ............................

xi

24

29
31

32

37

38

39

40

41

43

52
56
60

3.4

3.5

3.6

3.7

3.8

4.1
4.2

4.3

xii

(a) Alignment of segments of 15 168 RNA sequences and possible sec-
ondary structures. This region corresponds to nucleotides 996 to 1044
in Ecoli 16S RNA. Left ( (, [, {) and right 0, ], }) parts of the align-
ment show the possible stem regions (A-A’ , B-B’ , C-C’). (b) Pos-
sible common secondary structure from the alignment (a). (c) Sec-
ondary structure of Ehr.bovis based on (b). ((1) Secondary structure
of Hir.baltic based on (b). The symbol (e) indictes a 6-0 base pair
and (I) indicates A-U and G-C base pairs. Note that the length of the
loop and base pair regions are variable for each sequence. ......

(3) Alignment of segments of ten 16$ RNA sequences and possible sec-
ondary structures. This region corresponds to nucleotides 133 to 229
in E.coli168 RNA. Left ( (, <, [, { ) and right ( ), >, ], } ) parts of
the alignment show the possible stem regions. (b) Real common sec-
ondary structure from the RDP. There are different possible secondary
structures in the Region B (in the Box). Note that the length of the
loop and base paired regions are variable for each sequence. .....

Alignment of segments of 16S RNA sequences. The names of the used 168
RNA sequences are from RDP. This region corresponds to nucleotides
61 to 106 in E. coli 16S RNA. .41 is the alignment generated by the
progressive pairwise algorithm. A; is the ﬁnal alignment generated by
initial alignment A1 using RNASA with window size=4. A3 is a longer
alignment generated by the progressive pairwise algorithm. A; is the
ﬁnal alignment generated by the initial alignment A3 using RNASA
with window size=4. Left (0 and right (]) parts of possible stem re-
gions (A, A’ , B, B’) are symmetric. Upper case characters indicate
possible secondary structure regions. ..................

Alignment of segments of 168 RNA sequences with different window sizes.
Each alignment was obtained after 300000 iterations. B; is the align-
ment with window size=1. 32 is the alignment with window size=2.
33 is the alignment with window size=3. A2 in Figure 3.6 is the align-
ment with window size=4. B4 is the alignment with window size=5.
A2 in Figure 3.6 gives the best match to a known structure. .....

(a) Annealing curve with single shuffle. (b) Annealing curve with double
shuffle. Initial temperature T,- = 10‘ and ﬁnal temperature T, = 1.
Unit of score is 10“" and Unit of iteration is 10, 000 iterations. . . . .

The overall process of solving the problem ..................
(a) Restriction map for a. (M,(s) = {100,500,400}. (b) Re-
striction pattern for b. (Pz (b) = {100, 200, 300, 400}. (c)
mazcommon(M, (a), P,(b)) = 2 with common sites c1, 02. ......
Pairwise sequence similarity. The pairwise sequence similarity was de-
termined for all pairwise combinations of 1575 sequences in the test
database. Only sequence regions considered in further analysis were
included. The similarity of the aligned pairs was determined using the
pre-aligned sequences supplied by RDP. ................

62

63

65

67

68
77

81

4.4

4.5

4.6

4.7

4.8

4.9

xiii

sim(q, C(q, D, k)) for k=0,1,2 with 5% error bound, These results are
shown by number of query Sites as mean and range, after discard-
ing the 5% highest and lowest values. Bar chart at bottom indicates
the number of queries with result sets with size greater than one. The
y axis of the ﬁrst diagram shows the primary similarity between query
and result set. The y axis shows the number of query sequences. A
site with the number of query sequences _<_ 10 is eliminated. .....
sim(q, C(q, D, 0)) with 10%, 15%, 20% error bounds. The 3; axis of ﬁrst
diagram shows the primary similarity between query and result set.
The 3; axis shows the number of query sequences. These results are
shown by number of query sites as mean and range, after discarding
the 5% highest and lowest values. Bar chart at the bottom of the ﬁgure
indicates the number of queries with result sets with size greater than
one. A site with the number of query sequences _<_ 10 is eliminated. .
sim(q, (C(q, D, k1 + 162))“; for k1 + k2=0 thru 4 with 5% error bound
and two enzymes. These results are shown by number of query sites
as mean and range after discarding the 5% highest and lowest values.
Bar chart at bottom indicates the number of queries with result sets
with size greater than one. A site with the number of query sequences
5 10 is eliminated. ............................
sim(q, C(q, D, 2)), and sim(q,C(q, D, 2)): with 5% error bound. The y
axis of the diagram shows the primary similarity between query and
result set. The curve ”8” is generated by merging s and the curve ”q”
is generated by merging q. .......................
Accumulation of sim(q, s) for random databases with edge lengths 98%,
96%, 94%, 92%. Bar charts for each diagram show the standard devi-
ations. y shows the total number of queries with output. a: shows the
primary similarity between query and matched database sequences.
Accumulation of sim(q, s) for random databases with uniform primary
similarities 99.8%, 99.7%, 99.6%, 99.4%, 99.2%. Bar charts for each
diagram show the standard deviations. .................

98

101

107

109

Chapter 1

Introduction

1.1 Basic Concepts

1.1.1 Sequences

A sequence is a linear array of elements that are symbols or letters of an alphabet. Let
K={a1,a2, - - -,a;} be an alphabet ofl symbols. Then 3 = blbz - - ~b,, is a sequence
where b,- E K. For DNA (RNA) sequences, the alphabet is K = {A,T(U), G, C} where
’A’ for adenine, ’T’ for thymine (or ’U’ for uracil), ’G’ for guanine, ’C’ for cytosine.
For protein sequences, the alphabet contains 20 symbols which represent 20 types of

amino acids.

To sequence a long fragment of DNA, it must be cut into small fragments, as
present techniques do not permit the sequencing of fragments of more than a few
hundred bases. At ﬁrst the DNA is cut into overlapping fragments. These are then se-
quenced individually, before being reassembled by searching for the overlapped edges,

1

to reconstruct the whole sequences.

1.1.2 Similarity of Macromolecules

When two or more sequences are compared, it is not so obvious how to assess the
similarities between them. This requires some sort of natural biological metric which
is a measure of the similarity. AS an alternative to similarity, one can score the
distance between two sequences in a particular alignment based on a metric. Many
different alignments can be generated from multiple sequences maps. We seek an
alignment with an optimal score from these alignments. This optimal alignment gives
us the ability to estimate the biological relatedness between sequences. There are two

issues to identify the similarity of sequences.

1. Score function: It should be clear that appropriate similarity scores can only be
calculated from a speciﬁc model of similarity. The user must either develop his
own, or more likely use a set which is already developed. The score function
should reﬂect the speciﬁc types of similarity between sequences. For example, if
we want to identify the primary similarity of sequences, the optimal alignment
based on the score function should give us the best picture of the primary

similarity.

2. Algorithm to optimize score function: There are several types of algorithms to
score alignments. Exhaustive methods guarantee an optimal alignment. Heuris-
tic methods are used to find, within reasonable time, good alignments that are

not necessarily optimal. Users should decide the algorithm to be applied for a

3

score function based on their application.

1.1.3 Alignment

The term alignment is used to refer to a particular arrangement of two or more se-
quences in sequence comparisons. It is a pattern matching process that ﬁnds a corre-
spondence among the elements of the sequences. Any two macromolecular sequences
in a homologous group are usually not identical due to substitutions, insertions or
deletions of the elements of the set of macromolecular sequences.
Example 2. Let sl=AATAG and 32=AATCAG be two DNA sequences. A possible

alignment of 31 and 32 is

sl=AAT-AG

32=AATCAG
where ’-’ represents an insertion. This alignment represents one possible event in

evolution.

1.1.4 Restriction Patterns and Restriction Maps

A restriction enzyme typically cleaves a fragment of DNA at speciﬁc subsequences.
A restriction map is a set of ordered integers which gives the fragment length along
the strand of the restriction sites at which a restriction enzyme cleaves the DNA.
Example 1. Let s=AGCCCGGCCAAGGCCAA be a DNA sequence. Enzyme Hoe
III identiﬁes the substring GGCC and cleaves in the middle of GGCC. If Hoe 111 is

applied to the sequence 8, s is cleaved to a set of fragments { AGCCCGG. CCAAGG.

 

4
CCAA }. Then the restriction map for s is {7, 6, 4}.

Restriction maps are usually constructed prior to sequencing of DNA and many
mapped DNAs have never been sequenced. One or more restriction enzymes are
applied to cleave the macromolecule into many fragments. At this time only the
lengths of the fragments can be identiﬁed. This set of unordered integers which gives
the fragment length is called restriction pattern. The lengths of these fragments can

be determined by separating them by electrophoresis on a polyacrylamide gel.

1.2 Problem Deﬁnitions

1.2.1 Multiple Sequence Alignment

The comparison of more than two sequences of number or letters is common in sev-
eral ﬁelds, such as molecular biology, speech recognition. Sequence comparison is
particularly important in molecular biology where it has been critical in the study
of evolution and in the analysis of protein structure / function relationships. Multiple
sequence alignment is used to ﬁnd an optimal alignment based on certain criteria.
There may be several different sets of optimality. These are deﬁnitions for the mul-
tiple sequence alignment.

Deﬁnitions

A gap is the symbol ”-”.

Analphabet K=01,02,"',0118388t0fl '

symbols containing gaps. it can be nucleic acid or amino acid sequences.

5

An element is a member of the alphabet K.
A sequences = blbz - - b, is a strings of alphabet if b,- E K for i = l, 2, - - - , n.

A multiple sequence alignment 3 = {31, 32, - - - , sh} is a set of sequences of K.

sl=b},b;,..-,b,l,
32=b¥,b§,...,bi
31 =bfrbgr'nrb:

where bf 6 K for j = 1,2,---,n and i = l,2,---,k. An alignment of the sequences
31, - - - , s. is another set of sequences, SI, - - - , s: is obtained from s,- by inserting gaps in
positions where some of the other sequences have a non-blank character. There exist
many different possible alignments. The multiple sequence alignment problem is to
ﬁnd out the optimal alignment based on a certain score function for the optimality of
an alignment. However, different authors may have different sets of score functions for
the optimality of an alignment. Sometimes authors have to generate their own score
functions for their own purposes. To identify the primary similarity of the sequences,
the score function to reﬂect the primary similarity among sequences should be used.
To identify common secondary structures of the sequences, the score function to
reﬂect the common secondary structures among sequences is used.

Authors have been focused in solving the problem by comparing the N sequences

simultaneously by dynamic programming using an N dimensional matrix. However,

6

the complexity of the problem restricts this approach to N g 6 and n,- = 200 ~ 300.
Hence various heuristic (possibly sub-optimal) approaches are explored to solve the

problem.

1.2.2 Restriction Map Database Searches

Sequence database search is a standard tool in the identiﬁcation and characterization
of new organisms. However, there are many situations in which sequencing of new
organisms is not practical. One simple method to substitute sequence database search
is to use restriction patterns and restriction map databases. Maximum number of
common restriction sites between a new organism and database sequences is calculated
and used as a measure of similarity. To compare restriction patterns and restriction
maps in the restriction map database, a problem called Maximum Site Matching
Problem (MSMP) is deﬁned in this thesis. MSMP is the problem to ﬁnd maximum
common sites between a restriction pattern and a restriction map. In the MSMP, we
have as data the restriction pattern from a new organism a a restriction map from a

sequence b in a sequence database when a same enzyme is used, say,
Restriction pattern A = {a,- : 1 S i 5 n} from the digest of sequence a
Restriction map B = {b,- : l S i S m} from the nucleotide sequence b
A will be an unordered set, B will be an ordered set. In general A,B will be

multi-sets; that is, there may be values of fragment lengths that occur more than

once. Also,

2 -= Z ,-=L (1.1)

7

Given the above data the problem is to ﬁnd orderings for the restriction pattern A

such that the number of common sites implied by this ordering is maximum.

1.3 Previous Studies

1.3.1 Multiple Sequence Alignment Based on Primary Sim-
ilarity

Existing methods of multiple protein sequence alignment based on primary similar-
ity have time complexity varying from 0(Nn2) to 0(nN where N is the number of
sequences and n is the length of the sequences. Most of the methods attempt to ﬁnd,
within a reasonable time, good alignments that are not necessarily optimal. Some
methods guarantee optimal alignment in accordance with some pre-speciﬁed criteria.
The three major multiple sequence alignment methods may be summarized as follows.

The main issue in these methods is how to optimize a score function.

Clustering

This approach either constructs an approximate phylogenetic tree to guide the align-
ment process or arranges the sequences into some sequential order of alignment. Sev-
eral methods [12, 36, 34] adopt the same hierarchical clustering procedure but ues a
different score measure for tree construction. The methods of Feng and Doolittle [20]
and Taylor [78, 79] use less rigorous clustering procedures for tree construction. The

methods of Barton and Sternberg [7] and Martinez [52] arrange the sequences into a

8

certain order based on that the sequences are aligned one by one.

Dynamic Programming

This approach refers to the simultaneous comparison of N sequences using an N -
dimensional dynamic programming matrix. For example, the algorithm of Needleman
and Wunch [59] had been extended directly to the comparison of three sequences using
a three-dimensional matrix [43]. However, Murata et al. [58] showed that the time
complexity of this direct extension would be 0(n5) for three sequences of length n
but they were able to reduce it to 0(n3) by using two three-dimensional matrices. A
diﬂ'erent dynamic programming algorithm of 0(n3) running time for three sequences
was also presented by Fredman [26]. Both algorithms use a gap weight (penalty) which
is independent of gap length. However, this drawback is removed by Gotoh [28] whose
algorithm runs in 0(n3) time and uses a linear gap weighting function.

The algorithm of Murata et al., [58] and the algorithm of Gotoh [28] explicitly
specify the weight of simultaneous comparison of three residues. Such a criterion can
be extended to the evaluation of an alignment of N sequences, i,e. the cost of aligning
the N sequences is taken as the sum of the cost of aligning the N (N — 1) /2 pairs of
sequences. This measure of cost is called the SP measure (Sum of Pairs measure) [49].

The algorithm [22] for two sequences looks for the Optimal path within a strip of
the two-dimensional matrix as deﬁned by an upper bound of the cost of the optimal
path. Based on the SP measure, Carrillo and Lipman [10] have proposed a strategy
for the alignment of N sequences which is similar to that of Fickett’s algorithm. Given

the upper bound of the cost of a multiple sequence alignment, Carrillo and Lipman

9

showed that the upper bound of the alignment cost of each pair of the sequences can
be obtained. The upper-bound of the alignment cost of two sequences, say :I: and
y, deﬁnes a region on the two-dimensional plane, say (2:, y), of the two sequences.
Within this region lies the projection of the optimal path of the N sequences onto
(2:, y). The regions on all two-dimensional planes in turn deﬁne the region in which
the Optimal path in the N -dimensional space would be located. Therefore, the search
of the optimal path of N sequences is limited in a certain region of the N -dimensional

matrix and the search time is cut down.

The algorithm of Lipman et al., [49] adopts the aforementioned strategy of Carrillo
and Lipman. it uses a heuristic procedure to obtain an initial alignment based on
which the upper bounds of the pairwise alignment costs can be set. Alternatively,
the users may specify any set of bounds. The algorithm has two speciﬁc features.
One is the use of quasi-natural gap costs as proposed by Altschul [l] and the other
is the option of either a weighted SP measure or an unweighted SP measure. In the
weighted SP measure, different weights are assigned to the pairwise alignment costs
following the methods proposed by Altschul et al. [1]. The purpose is to discount the
dominance of a set of very similar sequences in the multiple sequence alignment. The
algorithm of Lipman et al. [49] can align six to eight sequences of 200-300 residues

which renders it superior to the foregoing three-sequence methods.

The biggest obstacle to using dynamic programming for multiple sequence align-
ment is its computational requirements of the method. Given these difficulties, heuris-

tics and modiﬁed cost function are applied to dynamic programming.

10
Simulated Annealing

SA is a good heuristic approach to solve combinatorial optimization problems. SA
is a version of a successful statistical model of thermodynamic processes for growing

crystals that has been transformed into computational terms.

A perfectly homogeneous crystal lattice represents a conﬁguration of solid state
material at a global minimum of energy. The solid state material is heated to a high
temperature until it reaches an amorphous liquid state. Then, it is cooled very slowly
and according to a specific schedule of decreasing the temperature. If the cooling is
perfect, then the atoms will arrange themselves in a pattern that closely resembles
the global energy minimum of the perfect crystal. Thermodynamics teaches that the
thermal equilibrium at temperature T is a probability distribution in which a state

with energy E is attained with the Boltzmann probability

e-‘l‘.
In a theoretical model of what occurs inside the material during the annealing process,
states are continually perturbed by the introduction of small random changes of the
positions of atoms in the matter. If the result is a state of lower energy, then the
state perturbation is unconditionally accepted. If not, then the state perturbation

may still be accepted with a probability of eAE/T. This probability decreases with

the temperature.
Several authors have suggested simulated annealing as an alternative approach to
overcome the limitations of dynamic programming. Lukashin et al. [50] applied SA to

human intro sequences with entrapy as a cost function. Ishikawa et al. [38] applied

11

SA to align protein sequences the same cost function as that used in Gotoh [28].

1.3.2 Multiple Sequence Alignment Based on Secondary

Structures

It is well-known that secondary structures of RNA are assumed to play an important
physiological role. The main issues in multiple RNA sequence alignment is how to
generate a score function. Currently, there is no clear global score function for multiple
RNA sequence alignment. Most of methods predict local secondary structures and

sometimes there is no way to to know the optimality of the alignment obtained.

Manual Method

This approach needs an alignment based on primary similarity. Then, base changes
between the compared sequences are noted. And the places where compensating
base changes maintain Watson-Crick complementarity between two potential pairing
regions is taken as evidence for the existence of a true stem at that position. This

method is done by hand [11, 32, 54, 63].

Context ﬂee Grammar Method

Stochastic context-free grammars are applied to the problems of multiple alignment
of RNA families [18, 70, 73]. This approach is related to use of Hidden Markov Models
(HMMs) to model E. coli DNA. It incorporates elements of both the thermodynamic
and phylogenetic approaches, with emphasis on the latter. The method of Sakakibara

et. al [70] requires initial knowledge of secondary structures. In contrast, Eddy and

12

Durbin [18] derive the structure of the grammar directly from unaligned sequences
and estimate the probability parameters of the resulting grammar using Expectation

Maximization (EM).

1.3.3 Restriction Map Database Search

Some work has been done on constructing restriction maps from restriction pat-
terns [64, 24, 47, 17, 8, 30, 89], but this work typically assumes that there are over-
lapping fragments.

Several authors [15, 19, 37, 60, 76] studied the restriction site matching probability
with given primary similarity of the two sequences. Also simple fragments pattern
matching method is used for identiﬁcation of query sequences. However, this simple
pattern matching does not work in some cases. No rigorous works for restriction map

searches using restriction patterns have been done.

1.4 Our Studies

1.4.1 Multiple Sequence Alignment

In this thesis, we propose two algorithms, MSASA for protein sequences and RNASA
for RNA Sequences, based on simulated annealing. In chapter 2, multiple sequence
alignment for protein sequences is studied. In MSASA, two cost functions, natural
gap costs and quasi-natural gap costs, for protein sequence alignment are discussed as

well as the relationships between two methods and two cost functions. We analyzed

13

the solution sets of multiple sequence alignments and suggest a technique to reduce
converging time by conﬁning the solution sets. Another speedup strategy is suggested
that involves using fast heuristic algorithm as the high temperature phase. 'II'ansi-
tion rules to get a new alignment for simulated annealing are generalized, pointin-
gout that any new alignments can be obtained by applying those transition rules.
Speedup strategies related to length of the alignment and high temperature phase are
discussed. We implement MSASA and Show SA can overcome the limitation of dy-
namic programming by comparing the output alignment from MSASA and dynamic
programming

In chapter 3, multiple sequence alignment for RNA sequences is studied. We
proposed an algorithm RN ASA to align multiple RNA sequences to identify possi-
ble secondary structures. Dot matrix generated from intro-sequence comparison is
expanded to ﬁnd alignments with maximally overlapped potential secondary struc-
tures. When the bases are compared, a certain probability of assigning a hit. This
probability is used to make a score function. We use a different transition rule called
a double shuﬂ‘le in RNASA, which can generate faster convergence to optimal. We

show the usefulness of RNASA with experimental results.

1.4.2 Restriction Map Database Searches

In chapter 4, we study the problem of obtaining biological information about a macro-
molecule isolate using only the restriction patterns of unknown macromolecules and

restriction map databases. We propose a three step approach to solve this problem.

14

In the second step We formulate a maximum site matching problem , show this prob-
lem is in the class of NP-complete problems, and suggest a heuristic approach to
attack this problem. We demonstrate that this three step approach can be used to

infer the identiﬁcation of an unknown macromolecule.

Chapter 2

Inferring Relatedness of Sequences
based on Primary Similarity using

Multiple Sequence Alignment

Multiple sequence alignment is a useful technique for studying molecular evolution
and analyzing structure-sequence relationships. Dynamic programming of multiple
sequence alignment has been widely used to ﬁnd an optimal alignment. However,
dynamic programming does not allow for certain types of gap costs, and it limits the
number of sequences that can be aligned due to its high computational complexity.
The focus of this paper is to use simulated annealing as the basis for developing an efﬁ-
cient multiple sequence alignment. An algorithm called Multiple Sequence Alignment
using Simulated Annealing (MSASA). The computational complexity of MSASA is
signiﬁcantly reduced by replacing the high temperature phase of the annealing pro-

15

16

cess by a fast heuristic algorithm. This heuristic algorithm facilitates in minimizing
the solution set of the low temperature phase of the annealing process. Compared
to the dynamic programming approach, MSASA can (a) use natural gap costs which

can generate better solution (b) align more sequences (c) take less computation time.

17
2.1 Introduction

Sequence alignment methods are useful tools to obtain regions of sequence similar-
ities of particular interest. Pairwise alignment has been widely used in sequence
alignment, but multiple alignment reveals more information than pairwise alignment.
After Needleman and Wunch [59] introduced dynamic programming into pairwise
alignment, efforts were made to apply dynamic programming to multiple Sequence

alignment.

Multiple sequence alignment methods can be divided into two different types
of algorithms; heuristic algorithms and exhaustive algorithms. Heuristic algo-
rithms [7, 6, 12, 21, 20, 29, 33, 34, 36, 42, 52, 71, 77, 83] try to ﬁnd out good but
not necessarily optimal alignments within a reasonable time. Most of these heuristic
algorithms construct a phylogenetic tree for the alignment of the sequences or assign
the sequences to a particular order. The sequences are aligned one by one related to

the order.

The exhaustive approach [16, 23, 26, 28, 35, 58, 72, 74] based on dynamic program-
ming tries to compare sequences simultaneously. This approach refers to the simul-
taneous comparison of N sequences using an N -dimensional dynamic programming
matrix [86]. For example, the algorithm of three sequences using a three-dimensional

matrix [43].

The algorithm of Pickett [22] for two sequences looks for the Optimal path within

a strip of the two-dimensional matrix as deﬁned by an upper bound of the cost of the

18

optimal path., Carrillo and Lipman [10] have proposed a strategy for the alignment Of
N sequences which is similar to that Of Fickett’s algorithm. Given the upper bound
of the cost of a multiple sequence alignment, Carrillo and Lipman showed that the
upper bound Of the alignment cost of each pair of the sequences can be obtained. The
upper bound of the alignment cost Of two sequences, say a and b, defines a region on
the two—dimensional plane, say (a, b), Of the two sequences. Within this region lies
the projection Of the optimal path of the N sequences onto (a, b). The regions on
all two-dimensional planes in turn deﬁne the region in which the optimal path in the
N -dimensional space would be located. Therefore, the search of the Optimal path Of
N sequences is limited in a certain region of the N -dimensional matrix and the search
time is cut down. This approach guarantees Optimal alignment. Although variations
of dynamic programming have been widely used to derive optimal alignments, there

are certain limitations.

One important problem in multiple sequence alignment is to deﬁne substitution
costs and gap costs. In pairwise alignment, researchers deﬁne substitution costs and
gap costs and try to minimize or maximize the total cost. Gap costs and substitution
costs in multiple sequence alignment should be deﬁned by using the same rationale
as used in pairwise alignment. Altschul [1] analyzed several types Of gap cost and
substitution cost for multiple alignments. He pointed out that previously deﬁned
gap costs in a multiple alignment were not clearly tied to their substitution costs.
He suggested a natural gap cost which was clearly related to its substitution cost.

Lipman et al. [49] implemented the Multiple Sequence Alignment (MSA) program

19

to align more than three sequences using dynamic programming. In MSA, quasi-
natural gap costs were used instead Of natural-gap costs because natural gap costs
for dynamic programming require impractically long computation time [1] . Due to
the type of gap costs used, MSA cannot guarantee producing an optimal multiple

alignment in some special cases.

Another problem in expanding dynamic programming to multiple sequence align-
ment is its high computational complexity. In pairwise alignment, the computational
complexity is 0(m-n) where m, n are the lengths of the sequences. But when dynamic
programming is used for multiple sequence alignment, its computational complexity
becomes proportional to the product of the lengths of the sequences to be aligned.
Therefore the exponential growth in computational complexity makes dynamic pro-
gramming impractical for aligning more than three sequences [27, 58]. Lipman [49]
has applied dynamic programming to MSA by reducing the solution space using a
heuristic algorithm. By conﬁning the solution space, the MSA program can align four

to six sequences of length 200-300 residues using rigorous bounds.

Several authors [38, 50] have suggested simulated annealing (SA) as an alternative
approach to overcome the limitations of dynamic programming. SA is a good heuristic
method to solve combinatorial optimization problems [46]. Ishikawa et al. [38] applied
SA to align protein sequences with the same cost function as that used in Gotoh [28].
To reduce the long computation time, they utilized a parallel computer for faster
convergence to Optimal solution, and discussed temperature parallel algorithm which

does not require any temperature scheduling. Lukashin et al. [50] applied SA to

20

human intron sequences with entropy as a cost function.

In this study, we developed a method for multiple sequence alignment called Mul-
tiple Sequence Alignment using Simulated Annealing (MSASA). Simulated annealing
is a good heuristic method to solve combinatorial Optimization problems [46]. Be-
cause traditional simulated annealing requires long computation time, several speedup
strategies have been incorporated into MSASA. The results generated from MSASA
were compared to those from MSA. By applying these speedup strategies we show
that MSASA can overcome the problems of high computational complexity and the

inability to use natural gap costs in MSA.

2.2 Algorithm

2.2.1 Simulated Annealing

Simulated annealing (SA), introduced by Kirkpatrick [46], is a probabilistic ap-
proach that can be used to ﬁnd a global minimum of a function in combinatorial
Optimization problems. To apply this algorithm to an optimization problem, a state
space X ={z1,---,:r,,} and a cost function 0 : X -> R, where R is the set of real
number, should be deﬁned. A real value 0 (X ) should be assigned tO each state m.
The goal of the Optimization problem is to ﬁnd the Optimal state 2:9,, whose score is
min(ma:r) {2,- | 1 5 i 5 n}. Simulated annealing continuously generates a new state

2,... from a current state same,“ by applying transition rules and acceptance rules

21

proposed by Metropolis (1953). The criteria Of the acceptance rules are:

1. If AB 5 0, accept a new state raw.

2. If AE > 0, accept a new state mm,” with probability P(AE) = e‘AE/T where

T is a temperature and AE = C(znm) — C (swam) is a cost difference.

Probability P(AE) prevents ﬁxation at local minimum. A state 2:me is called local
minimum if there is no state mm in X that is generated from the state Emma, by

applying the single transition rule and that has a lower cost than that Of the scam"...

Temperature T controls a probability to accept a new state 2m. Initially, T
starts from a high temperature and after every iteration, T decreases to become zero
by applying an annealing schedule. The probability of accepting a new state with a
higher cost than that Of the current state also decreases as temperature T decreases.
If a careful annealing schedule and number Of iterations are given, SA converges to
a global minimum state mm. The main disadvantage of SA is its requirement for a
large amount of computation time. Because SA is based on Monte-Carlo methods
which allow for a new state with a higher cost than that of a current state. To reduce

this computation time, speedup strategies are used in MSASA.

2.2.2 Cost Function

To be used in sequence alignment, a cost function should be explicitly deﬁned as

a measure Of overall alignment quality. Altschul (1989) classiﬁed several global cost

22

functions for multiple sequence alignment. These cost functions were composed of

substitution costs and gap costs.

Substitution Costs

Substitution costs are the costs for aligning n number Of sequences which include costs
for aligning letters which represent amino acids with nulls. The algorithm Of Murata
et al. [58] and the algorithm of Gotoh [28] explicitly Speciﬂy that the weight of the
simultaneous comparison of three residues is the sum Of the weights Of the pairwise
comparisons of the three residues. Such criterion can be extended to the evaluation
Of an alignment Of N sequences, i.e. the cost Of aligning of N sequences is taken as
the sum of the costs Of aligning the N (N — 1) / 2 pairs Of sequences. This measure of
cost is called the SP mwsure (Sum of Pairs measure) [49]. Substitution costs used in
this study were sum Of pairs (SP) substitution costs [5, 28, 58]. In MSA and MSASA,

SP substitution costs were used.

Gap Costs

Gaps are maximal strings Of consecutive nulls (’-’) in one sequence aligned with
letters in the other sequences. Gap costs are assessed separately from null costs. Null

costs are the substitution costs for aligning individual letters with nulls.

The alignment Of two sequences, 3,- and sj, derived from a multiple sequence align-
ment 0 is referred as the projection Of 0: onto 3,- and s,- [1, 3]. If the number Of gaps

in a is deﬁned as the total number of gaps in all the projections of alpha and each

23

gap is assigned as the total number of gaps in all the projections Of a and each gap
is assigned a constant cost, then the total gap cost of a will be called a natural gap
cost [1]. Natural gap costs are deﬁned based on the same rationale that is commonly
adOpted in deﬁning substitution costs. Altschul [1] points out that previously deﬁned
gap costs of multiple sequence alignments are proposed independently of the substitu-
tion costs. He argues that, since gap costs and substitution costs together determine
an Optimal alignment, they should be deﬁned based on the same rationale. He sug-
gested the natural gap cost as a gap cost which is clearly related to the substitution

cost.

But the amount Of information to keep in a cell with dynamic programming tO
apply natural gap costs increases 0([n/ln(2)]"\/h') where n is the number of se-
quences [1]. He deﬁned quasi-natural gap costs similarly to natural gap costs except
for some special cases that massively reduce the amount of information to keep in a

cell. Quasi-natural gap costs were used in MSA instead of natural gap costs.

One important advantage of MSASA over the dynamic programming approach
used in MSA is its ability tO apply any gap costs, including natural gap costs, in
multiple sequence alignment. This is because, unlike dynamic programming MSASA
does not require space to keep the information. After applying the transition rule
to a current alignment, a completely new alignment can be Obtained and all the
information to be applied to any speciﬁc gap costs can be identiﬁed. In contrast,
any complete alignment can not be obtained in dynamic programming until all Of the

computations are ﬁnished.

2.2.3

24

Transition Rule

Several Operations can be applied to a current alignment to generate a new candidate

alignment. Basically, all the Operations are related to change the positions of the

nulls (’-’) in the sequences. The basic operations are as follows.

e Insertion (i, j, 1:, direction): This Operation inserts the lc number of consecutive

nulls from the left /right (direction) Of column j in the sequence i.

e Deletion (i, j, 1:, direction): This Operation deletes the 1: consecutive number of

nulls from the left / right (direction) Of column j where columns j -— a through

j + E (a, ,6 Z 1:) make a gap where there are consecutive nulls in a sequence.

0 Shuffle (i, j, 1:, direction): This Operation shufﬂes the left /right (direction) nulls

from the null column j (including null j) in the sequence i and its left / right

(direction) 1: consecutive characters.

Figure 2.1 shows examples of the move sets. By modifying these move sets, effective

A1

HKQIGGAHGSLA--
HKKIGGATGALG--
HK---IGGAHGSLA

A3

HKQIGG--AHGSLA
HKKIGGATGAL --
HK-IGGAHGSLA--

Figure 2.1: Examples of the move sets.
alignment A2 after I nsertion(1,6, 2, right) to A1.

A2

MKQIGG--AHGSLA
HKKIGGATGALG--
HK---IGGAHGSLA

A4

HKQIGGA--HGSLA
HKKIGGATGAL --
HK-IGGAHGSLA--

(a) original alignment A1.

(b) new

(c) new alignment A3 after

Deletion(3, 5, 2, left) to A2. (d) new alignment A4 after Swap(1, 7, 1, right) to A3.

25

move sets for a different type Of multiple sequence alignment problem can be Ob-
tained. Also move sets can be applied to the different sequences simultaneously. The
parameters i, j and direction in the move set rules may be randomly determined.
But It may be determined by a certain distribution function, for example uniform
distribution or inverse function related to the size of It. Only experiment can tell

which is the best distribution function for It.

Only one basic operation swap is used as a transition rule in MSASA. This Oper-

ation generated a new alignment from the current alignment.

2.2.4 Solution Set

Deﬁnition 1 An alphabet is a ﬁnite set of characters and null(’-’).

A sequence is a ﬁnite string of characters.

A pseudo-sequence is a ﬁnite string of characters and nulls.

A pseudo-alignment of the sequence a1, a2, - - ~ , an is a set of padded pseudo-sequences
dual), - - - , a; of equal length and the removal of the nulls from a; generates a,-.

A null column is a column whose elements are all nulls.

A character column is a column such that at least one of its elements is a character.
An alignment is a pseudo-alignment whose columns are only character columns.

Let 1., be a length of a sequence a,-. The range Of the length l of an alignment of

the sequences a1, a2, - - - , an can be calculated by the deﬁnition of an alignment. The

26

range l is

2...... = manta“) g z s 1...... = in“) (2.1)

i=1

In this paper the length of the pseudo-alignment Of the sequences a1,a2, - - - ,a,, is
conﬁned by the above equation.

Deﬁnition 2 Let S; be the set of all alignments with the same set of sequences, and
each alignment in the S; has its own cost, and the length of each alignment is I. Then
the multiple sequence alignment problem is deﬁned as ﬁnding the alignment with the

smallest cost in the solution set S,:::(= U‘minslslmcc 5.).

In the SA process, the new pseudo-alignment generated from the current pseudo-
alignment, using the transition rule, may have null columns. These null columns do
not aﬁect the cost of a pseudo-alignment. Figure 2.1 shows how the null columns can
be generated during the SA process. These null columns cannot simply be deleted
during the SA process. The removal of the null columns during the SA process re-
duces the length Of the pseudo-alignment and it decreases the size of the solution set.
Therefore optimal solution may be removed from the solution set. The null columns
can be removed only after the SA process.

Lemma 1 Let P; be a set of pseudo-alignments with the same set of sequences such
that the length of any alignment in H is I. Let p; be a member of the set R. Any
pseudo-alignment p[ in H can be generated from a pseudo-alignment p; by applying a
ﬁnite number of swap operations.

Proof: The swap Operation does not decrease or increase the number of nulls. It only

changes the positions of the nulls in the alignment. Any pseudo-alignment p[ in the

27

H has same ﬁnite length l and same ﬁnite number of nulls in each sequence. Only
the positions Of the nulls in each sequence are different from each other. The nulls
in p; can be moved to the positions of the nulls in p[ within a ﬁnite number Of swap
Operations. So the pseudo-alignment p; can be transformed to any pseudo-alignment
p[ in the set P; by applying the swap Operations. 0

Lemma 2 The set of alignment Sim with same set of sequences can be generated
from any alignment p; by applying a ﬁnite number of swap operations.

Proof: From the lemma 1, H can be generated from any p. by applying the swap Op-
erations. The set of pseudo-alignments which have no null columns in P; is equivalent
to $1. And the set Of pseudo-alignments which have I: null columns in P; is equivalent
to 8;... The maximum of k is l — lm,,,. Therefore eliminating the null columns from
the pseudo-alignments in P; generates Sim. . CI

Theorem Let two pseudo-alignments with same set of sequences be p;, and p1, where
II > la. The alignment set, 52".”, obtained from the alignment p;, by applying the
swap operations is the superset of the set, 52...: obtained from the alignment pi, by
applying limited number of swap operations.

Proof: From the Lemma 2, the set 3;, and 8;, can be generated from the alignment

pg, and p;, by applying the swap Operations. The set 5,2.“ can be formed

Slim = Slim U SE,“ (2'2)

Therefore the set 3a,. is the superset of the set 3‘:.’....- C]

In MSASA, the solution set 5:33 is generated from an initial pseudo-alignment

28

pgm by applying the swap operations. The theorem shows that the longer initial
alignment generates the bigger solution set by applying the swap operations. This

bigger solution set increases the probability to ﬁnd an Optimal solution.

2.2.5 Speedup Strategies in MSASA

Heuristic Algorithm as The First Phase

Simulated annealing is composed of roughly two phases : a high-temperature phase
and a low-temperature phase. In the high temperature phase, SA gives a high proba-
bility to all the states with higher costs than that Of a current state. This allows any
state in the solution set to be a current state. At a lower temperature phase, SA gives
a high probability to states with a lower or not much higher temperature than that Of
a current state. This allows only the states near a current state to be a current state.
The high-temperature phase is similar to a random search, and the low-temperature
phase is similar to a greedy local search. Several authors [31, 69, 68, 67] suggest good
heuristic algorithm as a ﬁrst phase and a simulated annealing approach as a second

phase for ﬁne Optimization to the standard-cell-placement algorithm.

In MSASA, the same approach is used. The output alignment generated from
the same heuristic algorithm used in MSA is used as the ﬁrst phase. This heuristic
algorithm is similar to progressive pairwise alignment used in the studies of Water-
man and Perlwitz [85], Feng and Doolittle [20]. MSASA can eliminate the high-

temperature phase and reduce annealing time by using the output alignment gener-

29

ated from the heuristic algorithm. Figure 2.2 shows the annealing curve and different

starting points. SA time P can be saved when the system starts from point B which

 

Energy

A 1

 

 

 

 

|«-—u[ Iteration

P

Figure 2.2: Annealing curve (Energy vs. Iteration). A is the starting point in the
traditional SA approach. B is the starting point Obtained from the fast heuristic
approach. C is the minimal point.

is obtained from the fast heuristic approach instead Of point A. It is clear that SA
time can be reduced if point A is closer to the optimal point C. When the alignment
is Obtained from the heuristic approach, the initial temperature should be lower than

the initial temperature when traditional SA is applied.

Conﬁned Solution Set

To apply dynamic programming for aligning n sequences, a ﬁxed amount Of compu-
tation is required for each cell of the n-dimensional lattice. The total computational
time in dynamic programming is proportional to the product of a and ﬂ, a-ﬁ, where a
is the ﬁxed number Of computations for each cell and ﬂ is the size Of the n-dimensional
lattice. These two factors, a and ﬂ, limit the usage of dynamic programming in mul-

tiple sequence alignment. Natural gap costs make the factor a too large for aligning

30

more than ﬁve sequences. The factor ﬂ becomes too large for aligning more than

three sequences with average protein length (about 200-300 residues).

Fickett [22] suggested a way to reduce the solution space in pairwise alignment. He
searched the Optimal path within a diagonal band Of the two—dimensional matrix as
deﬁned by an upper bound of the cost of the optimal path. Carrillo and Lipman [10])
expanded the idea to reduce the solution space for aligning n sequences. They cal-
culated the upper bounds of the alignment cost of each pair of the sequences and
conﬁned the solution between each pair of the sequences using those upper bounds.
Therefore, they could reduce the computation time by applying dynamic program-

ming in only the limited solution space in an n-dimensional lattice.

In MSASA, the solution set is conﬁned to 3:33 . Therefore, if the length of the
optimal solution is larger than l,-,,,-., the Optimal solution can not be found in the
solution set. It is clear that an Optimal solution can be found if the solution set
5,3: is used. This solution set can be obtained from a pseudo-alignment p1,", by
applying the swap Operations. Due to its long SA time, it is not practical to use a
pseudo-alignment p1,.“ as an initial alignment. Instead of a pseudo-alignment p1,“,
the alignment generated by the progressive pairwise algorithm is used as an initial
alignment. The solution set 8:3; can be expanded with longer initial alignment. Once
the ﬁnal alignment which has null columns found in a given solution set is achieved,
increasing the solution set further rarely leads to improvement. Such an alignment is

Optimal for the given solution set.

31

 

begin ,
current.pseudo.alignment (— Output generated from the heuristic algorithm;
T ‘- Tinitial;

Eamm (— C (current.pseudo_alignment) where C is a cost function;
f inal.pseudo.alignment (— current.pseudo_alignment;
while (T > T1,“)
:1: (— random(column(1..l)) ;
for each sequence i in the current.pseudo.alignment
if column :I: is a null, then apply the swap operation;
if column :1: is a character, than do nothing;
end for;
Em (— C (new.pseudo.alignment);
if ((Em < Emma.) then
current_pseudo_alignment (— new.pseudo.alignment;
Ecurrcnt *— Enew;
if (Em < Emu) then
Ernin +— Enew;
f inal.,pseudo.alignment (— current.pseudo.alignment;
end if;
end if;
T(-'y-T,0<'y< 1;
end while;
Remove null columns from the f inal.pseudo.alignment;
Return f inalhlignment and minimal cost Ema, as an optimal alignment;
end

 

 

 

Figure 2.3: The MSASA algorithm

2.2.6 The MSASA Algorithm

The simplest data structure for sequences is an array. An alignment is a two
dimensional array Of symbols in which each row is a one-to-one collinear representation
of a molecular sequence. When a swap Operation is applied to sequences, only the
positions of the nulls are changed in an alignment. MSASA does not require any

complicated data structures. The algorithm for MSASA is presented in Figure 2.3.

The space complexity of MSASA is 0(n - l), where l is the length of the sequence

32

and n is the number of sequences. In each iteration the time to apply swap Operations
to each sequence is 0(n - Al) where n is the number of the sequences and Al is the

changed part of the alignment when swap operations are applied.

++
tittttt

HKQIGGA--MGSLA- HKQIGGA--HGSLA-

HKK---IGGATGALG HKKIGGA---TGALG

(a) (b)

Figure 2.4: Example of the Operation swap. (a) original alignment (b) alignment
after swap(2, 4, 3, right) operation. (*) shows the changed part (Al) of the alignment
and (+) shows the null columns.

Figure 2.4 (b) showed the changed part (Al) in the alignment. Only the difference
of the current costs and new costs of the columns which are affected by the changed
part is used to compute new costs. Al could be referred to as a constant because it is
not changed by the number Of sequences or length of the sequences to be aligned. The
time to calculate the cost of the changed part takes (n - (n — l) - Al). Therefore, the
time complexity of MSASA is 0(n2 - k) where k is the number Of iterations. The time
complexity Of MSASA does not depend on the length of the sequences in alignment

but to the number of sequences n and the number of iterations k.

2.3 Implementation

The program in this paper was implemented and tested on a Sun SPARC2 running
Solaris 2.2 which is a UNIX Operating system. The MSASA algorithm was coded in
ANSI C (SPARC compiler C 2.0.1). MSASA and MSA were tested and compared

in the same environment. MSASA was implemented to produce multiple alignments

33

and was compared to MSA. The experiments were performed on three serine protease

families - chymotrypsin, trypsin and elastase.

Natural gap costs are used as gap costs in MSASA whereas quasi-natural gap costs
are used in MSA. The cost of one gap was 8. In MSASA, the PAM-250 matrix [14],
derived from a study of amino acid replacements in homologous proteins, was used for
substitution costs. The modiﬁed substitution costs (17 minus the values in the PAM-
250 matrix) were used in both algorithms. The SP substitution costs in MSA have
two options, weighted SP substitution costs and unweighted substitution costs. In
the weighted SP substitution costs, weights are applied to the pairwise alignments in
order to reduce the effect Of the dominance of a set of similar sequences in the multiple
sequence alignment. In MSASA, both Options could be applied. For easy comparison

of the two algorithms, only unweighted SP substitution costs were considered.

The alignment from the heuristic procedure of MSA was used as an initial align-
ment in MSASA. Only the number of nulls Of each sequence in the initial alignment
are allowed to generate a new alignment. In the swap Operation, the maximum value
Of the parameter I: was initially set at 10. The initial temperature T; was decided by
previous experience. At the ﬁnal temperature, the probability to accept a new state

with a higher cost is

e-AE/TI = e (2.3)

34

where 0 < e < 1. This equation is simpliﬁed to

T, = —-AE/log(e) (2.4)

The minimum cost change, —AE, resulting from a swap operation is l. Empirically,

c“ is set to the total number of iterations 1:. Therefore the ﬁnal temperature becomes

T, = l/log(k) (2.5)

The schedule implemented in MSASA is T, = Tra" where "y is a constant for reducing

the temperature. 7 can easily be calculated from T}, T, and k

'7 = (TI/T1)“ (2.6)

2.4 Results

2.4.1 Initial Alignment

A good initial alignment, whose length and cost are as similar as possible to those
of an Optimal alignment, is crucial to MSASA. In the experiment, the length of the
initial alignment generated from the heuristics was longer than that Of the Optimal
alignment. When the ﬁnal pseudo-alignment did not have null columns, the l,-,,,-, was

increased and the SA process was applied again. One way of increasing the length Of

35

an initial alignment is to use a lower gap cost than the gap cost to be applied. If the
same ﬁnal output alignment is obtained, the alignment is the Optimal solution within
the solution set 5:3“: . A lower gap cost (3) than the gap cost (8) to be applied in
the SA process is used to get a longer initial alignment when aligning the sequences
in Figure 2.5. The length Of the initial alignment is 277 with gap cost (8) and is 281
with gap cost (8). The longer initial alignment is used to show this method. In the
Figure 2.8 thru 2.10, the null columns are not removed from the alignments to Show

the positions of the ﬁnal alignments and the lengths of the initial alignments.

Too lower gap costs generates an initial alignment whose length is longer than
that of the Optimal alignment. MSASA with this longer initial alignment requires
longer SA time. Because the solution set becomes bigger and the cost Of the initial
alignment becomes higher than those Of the initial alignment Obtained by the gap

cost to be applied.

2.4.2 'D'ansition Rules

The Operation swap in MSASA is used to change the position Of the nulls in the
alignment. This Operation does not change the number Of nulls in the alignment in
the SA process. The null columns in the alignment gave the effect that the operation
could reduce the nulls. Therefore any alignment whose length is between l,,,,-,, and

l,-,,,-¢ could be generated.

The Operation insertion which could insert the nulls and the operation deletion

 

lC

36

which could delete the nulls in the alignment were implemented and tested. These
Operations were possible to change the number of nulls in the alignment. But these
Operations required too much SA time (order of hours) so that it was not practical

to use these operations in MSASA.

2.4.3 Cost Comparison

Quasi-natural gap costs prevent MSA from generating Optimal alignment. When a
series Of nulls with left and right letters completely imposed on the series of nulls in
other sequences, quasi-natural costs count one more gap than natural gap costs do
(Altschul, 1989). These additional counts prevent generating Optimal alignment from
MSA. There are no additional counts in MSASA because the natural gap costs were

used in MSASA.

MSA and MSASA generate same alignment in some cases. When quasi-natural
gap costs are used in both algorithms, both generate the same alignments. And even
when natural gap costs are used in MSASA, if there are no completely imposed nulls

in the Optimal alignment, both algorithms generate the same alignment.

A comparison of the output alignments using the same sequences generated from

MSASA and MSA is presented in Figure 2.5.

The alignment Al was generated from MSA and the alignment A2 was generated
from MSASA. The score (9645) of the alignment A2 from MSASA is lower than

that (9668) Of the alignment A2 from MSA. The marked (*) parts show the difference

37
A1

eeeeeeeeee (9663)
IIGGVESIPHSRPYHAHLDIVTEKGLRVICGGFLISRQFVLTAAHC
IVGGTNSSHGEHPWQVSLQVKLTAQR-HLCGGSLIGHQHVLTAAHC
IVNGEEAVPGSHPHQVSLQ--DKTGF-HFCGGSLINENHVVTAAHC
IVGGYTCGANTVPYQVSLN----SGY-HFCGGSLINSQHVVSAAHC
VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNHVHTAAHC
VVGGTRAAQGEFPFHVRLS -------- HGCGGALYAQDIVLTAAHC

A2

sseeeeeess (9648)
IIGGVESIPHSRPYHAHLDIVTEKGLRVICGGFLISRQFVLTAAHC
IVGGTNSSHGEHPHQVSLQVKLTAQ-RHLCGGSLIGHQWVLTAAHC
IVNGEEAVPGSUPHQVSLQDKTG---FHFCGGSLINENHVVTAAHC
IVGGYTCGANTVPYQVSLNSG ----- YHFCGGSLINSQHVVSAAHC
VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNUVHTAAHC
VVGGTRAAQGEFPFHVRLSMG -------- CGGALYAQDIVLTAAHC

Figure 2.5: Alignments of rat mast cell proteinase II, human plasma kallikrein, bovine
chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin. A1 is
the alignment generated from MSA and A2 is generated from MSASA. The score of
A1 is 9663 and the score of A2 is 9648. (*) shows different alignments generated from
MSA and MSASA.

 

 

 

 

 

sequences MSA MSA SA
score time score time iteration (million)
4 21244 20 sec 21244 5 min 40 sec 1.3
5 35853 48 min 54 sec 35845 10 min 26 sec 2.0
6 54054 3 hour 18 min 37 sec 54050 16 min 40 sec 2.0

 

 

 

 

 

 

 

 

Table 2.1: Comparison of computation time and score in MSA and MSASA

between MSASA and MSA. The difference is due to the different gap costs in MSASA

and MSA.

The results of the alignments of the same set Of four to six sequences generated
from MSA and MSASA are shown in Figure 2.7 through Figure 2.10, and the related

scores and rlmning times are in Table 2.1.

The alignment Of four sequences (Figure 2.5) does not have the regions creating

additional counts for MSA. MSASA and MSA generated the same output in this

{Floor Ell.

 

38

IhlSllSJLtumdihdSAL

IVGGTNSSWGEHPWQVSLQVKLTAQR-HLCGGSLIGHQWVLTAAHCFDGLPLQDVHRIYSGILNLSDITKDTPFSQIKEI
IVNGEEAVPGSHPHQVSLQDK--TGF-HFCGGSLINENHVVTAAHCGVT-TSD---VVVAGEFDQGSSSEKIQKLKIAKV
IVGGYTCGANTVPYQVSLN----SGY-HFCGGSLINSQHVVSAAHCYKS-GIQ----VRLGEDNINVVEGNEQFISASKS
VVGGTEAQRNSWPSQISLQYRSGSSHAHTCGGTLIRQNHVMTAAHCVDR-ELT--FRVVVGEHNLNQNNGTEQYVGVQKI

IIHQNYKVSEGNH--DIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCWVTGHGFSK-EKGEIQNILQKVNIPLVTNE
FKNSKYNSLTINN--DITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNT
IVHPSYNSNTLNN--DIHLIKLKSAASLNSRVASISLPTSCA--SAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDS
VVHPYHNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGHGLTR-TNGQLAQTLQQAYLPTVDYA
ECQKR-YQDYKITQRHVCAGYKEGGKDACKGDSGGPLVCKHNGHHRLVGITSWCE--GCARREQPGVYTKVAEYHDHILE
NCKK--YHGTKIKDAHICAG--ASGVSSCHGDSGGPLVCKKNGAHTLVGIVSVCS--STCSTSTPGVYARVTALVNHVQQ
SCKSA-YPG-QITSNHFCAGYLEGGKDSCQGDSGGPVVC--SG--KLQGIVSWGS--GCAQKNKPGVYTKVCNYVSVIKQ
ICSSSSYWGSTVKNSHVCAG-GNGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYISUINN
KTOSS

TLAAN

TIASN
VIASN

Figure 2.6: Alignment Of human plasma kallikrein, bovine chymotrypsin, bovine
trypsin, and pig elastase generated from MSA and MSASA.

case. Because this alignment does not have completely imposed nulls. But in the
cases Of ﬁve sequences and six sequences (Figure 2.6 through Figure 2.10), MSASA

can generate the alignment with a lower cost than that of MSA.

2.4.4 Time Comparison

If the lengths of the sequences are short, the number of sequences is small, and the
optimal alignment of the sequences does not have regions making additional counts,
MSA may generate an optimal alignment faster than MSASA. One Of these cases is

shown in Figure 2.6. In this case, MSA took a shorter running time than did MSASA.

But in the case of ﬁve and six sequences, MSASA required a smaller running time

39

hdSJl

IVGGTNSSUGEHPWQVSLQVKLT-AQRHLCGGSLIGHQHVLTAAHCFDGLPLQDVWRIYSGILNLSDITKDTPFSQIKEI
IVNGEEAVPGSWPWQVSLQDKTG---FHFCGGSLINENUVVTAAHCGVT----TSDVVVAGEFDQGSSSEKIQKLKIAKV
IVGGYTCGANTVPYQVSL--NSG---YHFCGGSLINSQHVVSAAHCYKS-----GIQVRLGEDNINVVEGNEQFISASKS
VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNUVHTAAHCVDR---ELTFRVVVGEHNLNQNNGTEQYVGVQKI
VVGGTRAAQGEFPFHVRL--SHG ------ CGGALYAQDIVLTAAHCVSG----SGNNTSITATGGVVDLQSAVKVRSTKV

IIHQNYKVSEG--NHDIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCHVTGHGFSK-EKGEIQNILQKVNIPLVTNE
FKNSKYNSLTI--NNDITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTCWGLTRYTNANTPDRLQOASLPLLSNT
IVHPSYNSNTL--NNDIHLIKLKSAASLNSRVASISLPTSCASAG--TQCLISGHGNTKSSGTSYPDVLKCLKAPILSDS
VVHPYUNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTR-TNGQLAQTLQQAYLPTVDYA
LQAPGYNGT----GKDHALIKLAQPINQPTLKIATTTAYNQGTFT ------ VAGHGANR-EGGSQQRYLLKANVPFVSDA

ECQKR-YQDYKITQRHVCAGYK-EGGKDACKGDSGGPLVCKHN-GHHRLVGITSUGE--GCARREQPGVYTKVAEYHDHI
NCKK--YVGTKIKDAHICAG---ASGVSSCHGDSGGPLVCKKN-GAHTLVGIVSHGS--STCSTSTPGVYARVTALVNHV
SCKSA-YPG-QITSNHFCAGYL-EGGKDSCQGDSGGPVVCSGK ----- LOGIVSHGS--GCAQKNKPGVYTKVCNYVSHI
ICSSSSYHGSTVKNSMVCAGG--NGVRSGCQGDSGGPLHCLVN-GQYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYISUI
ACRSA-YGNELVANEEICAGYPDTGGVDTCOGDSGGPHFRKDNADEUIQVGIVSVGY--GCARPGYPGVYTEVSTFASAI

LEKTDSS
QQTI-MN
KQTIASN
NNVIASN
ASAARTL

Figure 2.7: Alignment of human plasma kallikrein, bovine chymotrypsin, bovine
trypsin, pig elastase, and Streptomyces griseus trypsin generated from MSA.

40

RJSJKSAL

+ ++
IVGGTNSSHGEHPHQVSLQVKLTAQR-HLCGGSLIGHQHVLTAAHCFDGL-PLQDVHRIYSGILNLSDITKDTPF--SQI
IVNGEEAVPGSVPHQVSLODKTGF---HFCGGSLINENHVVTAAHCGVT ----- TSDVVVAGEFDQGSSSEKIQX--LKI

IVGGYTCGANTVPYQVSLN--SGY---HFCGGSLINSQHVVSAAHCYKS ------ GIQVRLGEDNINVVEGNEOF--ISA
VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNHVMTAAHCVDR----ELTFRVVVGEHNLNQNNGTEQY--VGV
VVGGTRAAQGEFPFHVRLS--MG ------ CGGALYAQDIVLTAAHCVSG ----- SGNNTSITATGGVVDLQSAVK--VRS

KEIIIBQNYKVSEGNH--DIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCUVTGHGFSK-EKGEIQNILQKVNIPLV
AKVFKNSKYNSLTINN--DITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLL
SKSIVHPSYNSNTLNN--DIHLIKLKSAASLNSRVASISLPTSCASAG--TOCLISGHGNTKSSGTSYPDVLKCLKAPIL
QKIVVHPYHNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITOUGLTR-TNGQLAQTLQQAYLPTV
TKVLQAPGYNGTG--K--DHALIKLAQPINQPTLKIATTTAYNQGTFT ------ VAGHGANR-EGGSQQRYLLKANVPFV

+

TNEECQ-KRYQDYKITDRHVCAGY-KEGGKDACKGDSGGPLVCKHNG-HHRLVGITSHGE---GCARREQPGVYTKVAEY
SNTNCK--KYHCTKIKDAHIGAG---ASGVSSCHEUSGGPLVCKKNG-AWTLVGIVSHGS---STCSTSTPGVYARVTAL
SDSSCK-SAYPG-QITSNHFCAGY-LEGGKDSCQGDSGGPVVCSGK ----- LQGIVSHGS---GCAQKNKPGVYTKVCNY
DYAICSSSSYHGSTVKNSHVCAG--GNGVRSGCQGDSGGPLHCLVNG-QYAVHGVTSFVS-RLGCNVTRKPTVFTRVSAY
SDAACR-SAYGNELVANEEICAGYPDTGGVDTCQGDSGGPHFRKDNADEHIQVGIVSHGY--~GCARPGYPGVYTEVSTF

HDHILEKTDSS
VNUVQQTLAAN
VSUIKQTIASN
ISHINNVIASN
ASAIASAARTL

Figure 2.8: Alignment of human plasma kallikrein, bovine chymotrypsin, bovine
trypsin, pig elastase, and Streptomyces griseus trypsin generated from MSASA. The
null columns (+) are not removed intentionally.

 

 

Brrrrovwv T.IFT....VL VHETSAA

 

41

MSA

IIGGVESIPHSRPYHAHLDIVTEKGLRVICGGFLISRQFVLTAAHCKGR-EIT----VILGAHDVRKRESTQQKIKVEKQ
IVGGTNSSWGEHPHQVSLQVKLTAQR-HLCGGSLIGHQHVLTAAHCFDGLPLQDVHRIYSGILNLSDITKDTPFSQIKEI
IVNGEEAVPGSHPHQVSLQ--DKTGF-HFCGGSLINENHVVTAAHCGVT-TSD---VVVAGEFDQGSSSEKIQKLKIAKV
IVGGYTCGANTVPYQVSLN----SGY-HFCGGSLINSQHVVSAAHCYKS-GIQ----VRLGEDNINVVEGNEQFISASKS
VVGGTEAQRNSUPSQISLQYRSGSSHAHTCGGTLIRQNHVHTAAHCVDR-ELT--FRVVVGEHNLNQNNGTEDYVGVQKI
VVGGTRAAQGEFPFHVRLS -------- HGCGGALYAQDIVLTAAHCVSG-SGN---NTSITATGGVVDLQSAVKVRSTKV

IIHESYNS--VPNLHDIHLLKLEKKVELTPAVNVVPLPSPSDFIHPGAHCUAAGHGKTG-VRDPT-SYTLREVELRIHDE
IIHQNYKV--SEGNHDIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCHVTGHGFSK-EKGEI-QNILQKVNIPLVTN
FKNSKYNS--LTINNDITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTCHGLTRYTNANT-PDRLQQASLPLLSN
IVHPSYNS--NTLNNDIHLIKLKSAASLNSRVASISLPTSCA--SAGTQCLISGHGNTK-SSGTSYPDVLKCLKAPILSD
VVHPYHNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGHGLTR-TNGQL-AQTLQQAYLPTVDY
LQAPGYNG----TGKDHALIKLAQP ------ INQPTLKIATTTAYNQGTFTVAGHGANR-EGGSQ-QRYLLKANVPFVSD

KACVD--YRYYEYKFQ~VCVGSP-TTLRAAFHGDSGGPLLCAGV-----AHGIVSYGH----PDAKPPAIFTRVSTYVPT
EECQKR-YQDYKITORHVCAGYK-EGGKDACKGDSGGPLVCKHN-GHHRLVGITSUGE--GCARREQPGVYTKVAEYHDH
TNCKK--YWGTKIKDAHIGAG---ASGVSSCHGDSGGPLVCKKN-GAHTLVGIVSHGS--STCSTSTPGVYARVTALVNH
SSCKSA-YPG-QITSNHFCAGYL-EGGKDSCQGDSGGPVVCSGK ----- LOGIVSUGS--GCAQKNKPGVYTKVCNYVSH
AICSSSSYHGSTVKNSHVCAGG--NGVRSGCQGDSGGPLHCLVN-GQYAVBGVTSFVSRLGCNVTRKPTVFTRVSAYISH
AACRSA-YGNELVANEEICAGYPDTGGVDTCQGDSGGPHFRKDNADEHIQVGIVSHGY--GCARPGYPGVYTEVSTFASA

INAVI--N
ILEKTDSS
VQQTLAAN
IKQTIASN
INNVIASN
IASAARTL

Figure 2.9: Alignment of rat mast cell proteinase 11, human plasma kallikrein, bovine
chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin gener-
ated from MSA.

42

MSASA
+

IIGGVESIPHSRPYMAHLDIVTEKGLRVICGGFLISRQFVLTAAHCKGREI ----- TVILGAH-DVRKRESTQQKIKVEK
IVGGTNSSHGEHPHQVSLQVKLTAQ-RHLCGGSLIGHQUVLTAAHCFDGLPLQDVURIYSGIL-NLSDITKDTPFSQIKE
IVNGEEAVPGSHPHQVSLQDKTC---FHFCGGSLINENUVVTAAHCGVTTSD----VVVAGEF-DQGSSSEKIQKLKIAK
IVGGYTCGANTVPYQVSLNSG ----- YHFCGGSLINSQHVVSAAHCYKSGI ----- QVRLGED-NINVVEGNEQFISASK
VVGGTEAQRNSWPSQISLQYRSGSSHAHTCGGTLIRQNWVHTAAHCVDRELT---FRYVVGEH-NLNQNNGTEQYVGVQX
VVGGTRAAQGEFPFHVRLSHG -------- CGGALYAQDIVLTAAHCVSGSGN---NTSITATG-GVVDLQSAVK-VRSTK

+
QIIHESYNSVPNL--HDIHLLKLEKKVELTPAVNVVPLPSPSDFIHPGAHCUAAGHGKTGVRDPT--SYTLREVELRIHD
IIIHQNYKVSEGN--HDIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCHVTCHGFSKEKGEI--QNILQKVNIPLVT
VFKNSKYNSLTIN--NDITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTCWGLTRYTNAN-TPDRLQQASLPLLS
SIVHPSYNSNTLN--NDIHLIKLKSAASLNSRVASISLPTSCA--SAGTQCLISGHGNTKSSGTS-YPDVLKCLKAPILS
IVVHPYUNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGHGLTRTNGQL-~AQTLQQAYLPTVD
VLQAPGYNGTG----KDHALIKLAQP ------ INQPTLKIATTTAYNQGTFTVAGHGANREGGSQ--QRYLLKANVPFVS

EKACVD--YRYYEYKFQ-VCVGS-PTTLRAAFHGDSGGPLLCAG ----- VAHGIVSYGH----PDAKPPAIFTRVSTYVP
NEECQK-RYQDYKITDRHVCAGY-KEGGKDACKGDSGGPLVCKHNG-HHRLVGITSVGE--GCARREQPGVYTKVAEYHD
NTNCKK--YHGTKIKDAHICAG---ASGVSSCHGDSGGPLVCKKNG-AHTLVGIVSHGS--STCSTSTPGVYARVTALVN
DSSCKS-AYPG-QITSNHFCAGY-LEGGKDSCDGDSGGPVVCSG ----- KLQGIVSHGS--GCAQKNKPGVYTKVCNYVS
YAICSSSSYHGSTVKNSHVCAG--GNGVRSGCQGDSGGPLHCLVNG-QYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYIS
DAACRS-AYGNELVANEEICAGYPDTGGVDTCQGDSGGPHFRKDNADEHIQVGIVSHCY--GCARPGYPGVYTEVSTFAS

TINAVI--N
UILEKTDSS
UVQQTLAAN
HIKQTIASN
HINNVIASN
AIASAARTL

Figure 2.10: Alignment of rat mast cell proteinase II, human plasma kallikrein, bovine
chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin gener-
ated from MSASA. The null columns (+) are not removed intentionally.

43

80
70
60
50
T(min)10

 

 

 

 

0 l l l l J l l

4 6 8101214161820
Number of sequences

 

Figure 2.11: Running time for 4 to 20 sequences (1 million iterations)

than that of MSA. The running time to get an Optimal alignment in MSASA was
not clearly related to the number of sequences. In MSASA, usually 1 to 2 million
iterations which took 5 to 16 minutes were enough to get an Optimal alignment for

four to six sequences.

MSA took an impractically long time to align more than six sequences on a per-
sonal workstation. Therefore, it could not be directly compared to MSASA for more
than six sequences. In MSASA, 1 to 3 million iterations were enough to get a near-
optimal solution when aligning up to twenty sequences. Figure 2.1 1 illustrates the

computation time to run 1 million iterations for aligning several numbers of sequences.

It took about 17 minutes to run 1 million iterations for aligning 10 sequences and
about 76 minutes to run 1 million iterations for aligning 20 sequences. Therefore, the
running time to align more than six sequences in MSASA is more practical than that

in MSA. To reduce the computation time, well-conserved regions may be ﬁxed and

44

only the regions between the ﬁxed regions can be annealed. By using this divide-and-

conquer type approach, the computation time could be greatly reduced.

2.5 Conclusion

It is shown that MSASA based on simulated annealing is more powerful than MSA
based on dynamic programming. MSASA could apply natural gap costs, which can
generate a better alignment, while MSA could not. MSASA could generate alignments

faster and with more sequences than MSA.

Chapter 3

Inferring Relatedness of Sequences
based on Secondary Structure

using Multiple Sequence Alignment

Multiple sequence alignment has been a useful technique for identifying RNA sec-
ondary structures. In this paper, an algorithm for aligning multiple RNA sequences
to identify possible secondary structures is presented. In this algorithm, dot matri-
ces generated from intro-sequence comparisons are used to obtain possible common
secondary structures. A hit probability for dot matrices is calculated and a score
function based on this hit probability is deﬁned. Simulated annealing is applied to
Optimize the score. A solution set for a multiple sequence alignment is introduced
and how the number of nulls and length of an alignment affect the solution set is

analyzed. A method to reduce the computation time is applied to multiple sequence

45

46

alignment based on this solution set. Also several strategies to reduce the simulated
annealing time is suggested. Double shuﬂ‘le is used as a transition rule instead of
single shuffle. This move set generates faster convergence to Optimal. This algorithm

was used to ﬁnd possible common secondary structures in RNA sequences.

47
3.1 Introduction

Just as the evolutionary relationship in proteins is Often seen more in tertiary struc-
ture than primary sequence, RNA molecule relatedness is often seen in preserved
secondary structures. Much effort has been devoted to ﬁnding structural features
of RNA. Predictions of RNA secondary structure give useful information about the
mechanisms of gene expression, gene evolution and the functions of ribosomes. Sev-
eral approaches have been used for ﬁnding secondary structural features. Phylogenetic
analysis of homologous RNA sequences identiﬁes secondary structures which are con-
served during evolution [25, 40, 87]. Another approach is to apply thermodynamics to
compare the free energy of alternative structures [39, 62, 80, 90]. Context-Flee Gram-
mars have been applied to the problems of statistical modeling, multiple alignment,
discrimination and prediction of the secondary structures of RNA families [18, 70, 73].
However, RNA secondary structure prediction is not at a stage where perfect predic-
tion is possible. Zuker and Stiegler [90] pointed out that ’A program based solely
on conformational rules and thermodynamics will not yield a biologically meaningful
folding Of a molecule on its own... More and diﬁerent kinds of additional information
must be incorporated into the algorithm as well’. Users need to choose between sev-
eral methods, each with its own advantages and disadvantages, the one that best ﬁts

their particular problem.

Multiple sequence alignment has been a useful tool for identifying sites important
in enzyme activity and in gene regulation and in phylogenetic comparisons [72, 84].

Dot-matrix methods have been simple and useful tools for examining sequence simi-

48

larities. Several multiple sequence alignment methods [4, 81, 82] are dot-matrix based.
The dot matrix derived in these methods is from inter-sequence comparison based on
primary sequence similarity and additional properties Of molecular similarity. How-
ever, when RNA sequences are distantly related, sequences can not be aligned by just
primary sequence similarity. Secondary structure similarity needs to be considered.
A dot-matrix generated by intro-sequence comparison is one way to identify possible

RNA secondary structures in a single molecule [66].

In this paper, we have extended dot-matrix methods based on intra-sequence com-
parison to ﬁnd alignments with maximally overlapped potential secondary structures.
Individual dot matrices for each sequence are generated and overlapped. Correctly
aligned structures appear as diagonal regions with hits in most matrices. When the
bases are compared, there is a certain probability of assigning a hit even if the base
pairs do not exist as an actual secondary structure. This noise can hide real RNA
secondary structures. A sliding diagonal window is applied to reduce noise in the
dot matrix. The probability of Obtaining an observed number of hits in a cell in a
completely unaligned set of sequences can be calculated from the RNA base-pairing
relationships. The probability of ﬁnding an Observed pattern for a window in a com-
pletely unaligned set Of sequences can be obtained from the hit probability of the cells

in the window. This probability was used to design a score function.

Multiple RNA Sequence Alignment using Simulated Annealing (RNASA) is pro-
posed to optimize the score of RNA sequence alignments. Lukashin et al. [50] in-

troduced the simulated annealing (SA) technique to multiple sequence alignment for

49

nucleotide sequences. He discussed the cost function and several other important as-
pects of SA for multiple sequence alignment. Several authors [38, 45, 44] ) extended
SA to multiple sequence alignment Of amino acid sequence. The SA method was
introduced by Kirkpatrick et al. [46]. It is a probabilistic approach that can be used
to ﬁnd an Optimal state of a function in combinatorial Optimization problems. SA
starts from an initial state with high temperature. By applying the transition rules
and acceptance rules proposed by Metropolis [53], simulated annealing continuously

generates a new state from a current state. The criteria of the acceptance rules are:

1. If AC 5 0, accept a new state saw.

2. If AC > 0, accept a new state saw with probability P(AC) = e’AC/T where T

is a temperature and score difference AC = C(sm) - C(swﬁm).

Probability P(AC) prevents a system from ﬁxing at a local minimum. Temperature
T affects the probability of accepting a new state 3m. In the beginning Of the SA
process, T starts at a high temperature and T gradually decreases iteration by itera-
tion to become zero by applying an annealing schedule. The probability of accepting

a new state with a higher score also gradually decreases as temperature T goes down.

The SA process converges to a global minimum state when a careful annealing
schedule and number of iterations are given. The main disadvantage Of SA is its need
for a large amount Of computational time because SA is based on Monte-Carlo meth-
ods. To reduce this huge computational time, several speedup strategies including an

efﬁcient transition rule, are applied in RN ASA. Finally, experimental results Obtained

50
from RNASA are presented.

3.2 Algorithm

3.2.1 Determination Of Score Function

Base Pair

The following algorithm aligns conserved regions Of possible secondary structure.
It does not attempt to determine the lowest energy secondary structure, or chose
among conﬂicting secondary structure possibilities in a speciﬁc alignment. Instead it
is based on ﬁnding an alignment where the possible secondary structure motifs are
conserved between sequences. For the purposes Of this work, we deﬁne a base pair
possibility as two positions capable Of forming any of the canonical Watson-Crick base
pairs (A-U, C-G. G-C. or U-A) along with the common non-canonical (Hi and 11-6
base pairs. In addition to these common base pairs, RNA secondary structures also
contain less common base pairs (e.g. A-G , A-A etc.). Current secondary structure
prediction methods do not eﬁectively cover these rare base pairs. Their positions
are best established either from the X-ray crystallographic data or from analysis
Of compensatory changes in unambiguously aligned homologous sequences. In the
present alignment method, these rare base pairs are ignored. Since these base pairs

are rare by deﬁnition, this simpliﬁcation seems a reasonable compromise.

51
Dot Matrix

'Il'aditionally a dot matrix for a given RNA sequence is obtained by a square matrix
with a dot at the intersection of row i and j (1 S i < j g n) for the bases i and
j Of the- sequence. A dot matrix from intra-sequence comparison is used to depict
the common RNA secondary structures in the sequences to be aligned. This is a

rectangular array whose rows and columns are labeled with the bases of a sequence.

A dot matrix with I‘m-t x l,-,,,-¢ is used in RNASA. Length l,-,,,-, is the length of the
input alignment, obtained from a progressive pairwise alignment method based on
primary sequence similarity. We deﬁne a hit (1) if the base pair i and j in a RNA
sequence in an alignment is one of these base pairs (A-U, C-G , G-C , U-A, G-U, and

11-0); otherwise the pair is deﬁned as a miss (0).

Individual dot matrices from each aligned sequence in an alignment are generated
and the cells in each individual matrix are calculated. The total number Of hits, h,,

in an aligned sequence, 3, of an alignment is

h. = (IAI X IUI) + (IGI X IC'I) + (IGI X IUI) (3-1)

where |A|, [U], I0] and [C] are numbers of each bases in a sequence. The values of
each cell (i, j) in the individual matrices are summed and the sum of each cell (i, j)
is recorded in cell (i, j) of a matrix called the sum matrix. Clearly the value Of cell
(i, j) in the sum matrix cannot be larger than the number of sequences n. In the sum

matrix, the number of hits, m, in a cell represents the degree of potential common

52

secondary structure. The larger the number of hits, the higher the possibility Of

 

 

common secondary structure. Figure 3.1 shows an alignment and its sum matrix.
a u c o A o o A °
6 u c o A o c A c
G u c a A o e A c
a u c o A a o A c
1 {2731475 [sirloin
I I I I I I l I
G G G G .1-.0J.‘J£L0.LO.L0.IQJQL3.
I I I I I I I I
u u u u .2.-_.9..9.,£..‘!...4...4..£.9
c c c c .3.-_:-,'9',‘_:.°.:_‘_:.‘.:9',9
o o e e _‘.-.'-.'-I.2L°.'.°.'9.'9'.§
A A A A_5.-.i-i-i-i.°.i.°.l9.l9i9
e a e e.;.-J-J-l-L-L°J%ngg.
I I I I I I I I
a G a G’."'I"I'T'I'-I"'I"Ia1’6
A A A A----:-t'-+-:--:--:-i-+-
'CCC’IIIILIIIO

 

 

 

 

Figure 3.1: Example of an alignment and its sum matrix.

Average Hit Probability

The total number of hits in an aligned sequence is given by equation (1). These
hits can be regarded as distributed at random except in regions of real secondary
structure. The probability p, of a hit occurring by chance at any cell (i, j) in an

individual matrix of the aligned sequence 3 is approximately

 

O S p. = ’2 < l (3.2)

where l is the length Of the sequence 3. For the present study, we use the average

probability

p = (fps/n (3.3)

s=1

53

where n is the number Of sequences in an alignment as the base pair probability in
all the aligned sequences in an alignment. The value Of p is less than 1/2 for most

RNA sequences.

When individual matrices are overlapped to form a sum matrix, the number Of
hits at position (i, j) in the sum matrix is between 0 and the total number of bases,
11,-, in column i Of the alignment. This n,, may be less than the total number of
sequences, N , since some of the sequences may have nulls at position i. We can
then approximate the probability of randomly ﬁnding that m Of n,- bases at a speciﬁc

position, i, have base pair possibilities (hits) at some other position as:
. "i .
P(z, m) = -p"‘ - (1 - p)"""' (3-4)

The value of P(i, m) becomes smaller as m goes to n,- or m goes to 0. This means
the probability that a position will have no potential base pairs is relatively low, as

is the probability that a position will have all base pairs.

Sliding Windows on a Sum Matrix

Now the most basic secondary structure motif consists of one or more contiguous base
pairs with non-pairing bases at each end. The probability Of randomly ﬁnding such

a conserved motif in the alignment at positions i through i-i- w - 1 with w positions

54

at another location is simply:

=10

<I> = H P(a: + k — l,mk) (3.5)

k=1

where M = (m1, . . . , In"). The number of hits will be equal to n,- for absolutely
conserved base pairs, and equal to 0 for positions with no base pair possibility, such
as the ﬁrst non paring position deﬁning the end Of a stem region. For incompletely
conserved structures, the Observed values will be between 0 and n,- and thus the prob-
ability of Observing such an incompletely conserved motif will be higher. The method
we use to Optimize the alignment on secondary structure is to ﬁnd an alignment with

a high number Of low probability windows.

A window with size w is slid on the diagonals of the sum matrix for this purpose.
For example, ifthe diagonally adjacent cells (i,j), (i+1,j— 1), - - - , (i+w—1,j—w+ 1)
make a window, then the diagonally adjacent cells (i + 1, j - 1), (i + 2, j - 2), - - - , (i +

w, j — w) make the next window.

Determination of Score Function

To be used in sequence alignment, a score function should be explicitly deﬁned as
a measure Of overall alignment quality. A score function must include important
secondary structure information so that if the score function is mimmized / maximized,

the complete secondary structure in the alignment will be identiﬁed.

In RN ASA the sum of the reciprocal of the probability <I> for all possible windows

55

On a sum matrix is calculated as a scoring function for simulated annealing as follows.

1
C"...§...W ‘3‘”

The §<I>(:r, w, M) values for regions of conserved potential secondary structure are
so much larger than the expected (random) %<I>(:r, w, M) value that they dominate
such a sum. A good alignment will have relatively more of these conserved potential
secondary structures and will thus have a relatively large C value. The sum 0 is
used as a score function in RNASA. In RNASA, at each iteration, a new alignment
is generated and C is calculated to ﬁnd an alignment with maximum score. The goal
for RNASA is

mazimize(C) (3.7)

3.2.2 Simulated Annealing to Optimize the Score Function
Heuristic Algorithm as the High-temperature Phase

SA progresses from a high temperature phase that approximates a random search
to a lower temperature phase approximating a greedy local search. Several au-
thors [38, 45, 44] suggested a good heuristic algorithm could replace the high tem-
perature phase of simulated annealing and provide a substantial speed increase. In
the present algorithm, a heuristic primary sequence alignment method similar to the
progressive pairwise method [20, 77, 84] is used to provide a preliminary alignment

and to partially substitute for the high temperature phase. This preliminary align-

56

ment can also be used to manually select regions of long RNA sequences that are not

well aligned by primary sequence similarity.

Double Shufﬂe as a 'ITansition Rule

In a RNASA process, new alignments are created from an existing, possibly subopti-
mal, alignment by introducing random changes according to an Optimized transition

rule (Figure 3.2).

 

begin doubleshuﬂleﬁ)
for j = 1,2
:I: +— random(1 . . . l); randomly select a column
while column(:c) are null
scan left -or— right; (is a: (— a: — 1 or a: + 1)
wrap around if necessary;
return if no nulls are found in sequence i;
end while;
y (— random(:r+movm---z -movm);
shift right columns a: - 1 thru y
-Or- shift left columns a: + 1 thru y;
column(y) (— null;
end for;
end doubleshuﬂlc

 

 

 

Figure 3.2: The transition rule double shufﬂe

The rule used in the current algorithm allows changing one randomly chosen
sequence in the alignment with each iteration. A gap in the sequence is selected and
moved to a new position. This process is repeated on the chosen sequence a second
time. After both shuﬂles, the score function is recalculated. The reason for applying
two shuffles to a sequence in each iteration is to allow both sides Of a stem structure

to adjust in tandem. This eﬂ'ectively lowers the energy barrier between local points

57

and allows the system tO converge in fewer iterations. If the chosen sequence in the
alignment has no null characters, the iteration is skipped. Occasionally, the same gap
will be chosen for both shufﬂe Operations, producing the equivalent Of a single shufﬂe

move.

Solution Set

The SA process starts from an alignment with length I...“ produced by the heuristic
algorithm. The length l of new alignments produced from the current alignment does
not change during the SA process due to the properties Of the double shufﬂe transition

rule. Therefore RNASA only generates alignments with length l.-,.,-..

If I... is the length of sequence a,-, the range Of potential lengths of an alignment

of the sequences a1, a2, - - - , a,. can be calculated. The range I is conﬁned as below.

1...... = mama...) g l g 1...... = 20...) (3.8)

i=1
Let S; be the set of all alignments Of length l with same set of sequences. Each
alignment in the S; has its own score and the length Of each alignment is identically I.
Then the multiple sequence alignment problem can be deﬁned as ﬁnding the alignment

with an optimal score in the solution set St: (= Ujmmswm 5;).

There will be alignments which contain columns whose elements are all nulls (null
columns). If these null columns are removed from an alignment, the length,l', of the

alignment is reduced (l’ < l). Also, this alignment is a member of the solution set

58

Sy. Thus, the set S; is a superset of the set Sp where l’ < l (Kim et al., 1994).

The number of alignments in S; can be calculated. Let N be the number of
sequences. Let n,— be the number Of nulls placed into the sequence a,- to bring it up
to the alignment length, 1. That is n,- = l — la... The total number of alignments S in
S; is

3m = ° : - ° (3.9)
"1 n2 "N

PTom equations (9) the number of alignments ,8 increases rapidly with each of

n.,l, and N.

To ﬁnd the Optimal alignment, all the alignments in 51,“, should be investigated.
However, the sheer size of Si,” prohibits checking all possible alignments. Therefore,
if the length of the optimal alignment is larger than 1...“, the optimal alignment can
not be found in 51,”. In that case, the ﬁnal alignment from RNASA is near optimal.
Solution set can be expanded with longer initial alignment. Increasing the solution
set further rarely leads to improvement, so such an alignment is generally optimal for

the given solution set. A2 and A4 in Figure 3.7 are the examples.

The SA time can be reduced by this approach.

Schedule for Temperature.

The schedule implemented in RN ASA is T = T; - e‘ where e is a constant deﬁning

the rate Of annealing, i the iteration number, T1 the initial temperature, and T the

59

current temperature. The value Of 6 can easily be calculated from the total number

of iterations, k, the ﬁnal temperature,T,, and T1:

«2 = (Ti/Tl)“ (3.10)

3.2.3 The Complexities of RNASA

Figure 3.3 shows the block diagram of RNASA.

The sum matrix in RNASA is of size l2. Therefore the space complexity of RNASA
is 00"). In each iteration, the sum matrix must be updated in the region correspond-
ing to the columns modiﬁed by the double shuffle Operation. The values of all sliding
windows covering these columns must be recalculated. The number of these windows
is approximately 1 times the number of affected columns. In the experiments pre-
sented here, the maximum number of columns affected by each shufﬂe is limited to 10
(Figure 3.2). Since the number Of columns affected is essentially constant, the time
complexity per iteration is 0(l). It should be noted, however, that the number of

iterations necessary to assure convergence is dependent on several factors, including

l.

3.3 Results

RNASA was implemented and tested on a DEC alpha workstation 3000/ 400 with

32MB main memory size and DEC OSF/l V2.0 which is a UNIX Operating system.

60

 

 

begin RNASA
current.alignment (— Output generated by the heuristic algorithm;
/* Generate initial matrix and sum matrix */
For i=sequence 1 to n
For each nucleotide pair (j,k) in the sequence i
matrix(i, j, k) = 1 if pair(i,j) is in complementary set
matrix(i, j, k) = 0 if pair(i,j) is in not complementary set
summatrix(j, k) = sumJnatrix(j, k) + matrix(i, j, k)
end for;
end for;
/* Calculate the score of the sum matrix*/
Cinit = Z score(window)
/* Initialize parameters */
finalnlignment (— initial.alignmcnt;
Cmax ('- Ciniti C.curr'ent ‘- Cliniti T *- Tatar-ti
/* Annealing process */
while (T > Tm)
i (— random(sequence(1..n));
doubleshuﬂ‘leﬁ) in Figure 3.2;
Calculate individual matrix i;
Calculate sum matrix;
Cm (— C(sum matrix);
if Metropolice conditions are satisﬁed then
current.alignment (- new.alignment;
if (Cm... > Cm) then
Cm {— Cm;
f inalnalignment (— current.alignment;
end if;
end if;
T4—7-T,0<'7< 1;
end while;

Return ﬁnaLalignment, maximum score Cm and sum matrix ,-
end RNASA

 

Figure 3.3: The RN ASA algorithm

 

61
RNASA was written in programming language ANSI C (DEC OSF/ 1 C compiler).

The program will be available on request.

RNASA was implemented to produce multiple sequence alignments for RNA se-
quences. Experiments were performed on segments of bacterial 16S RNA sequences.
rRNA sequences consist of highly conserved regions separated by variable regions.
In these variable regions, secondary structure is often more conserved than primary
sequence similarity. Several such variable regions were chosen for testing RN ASA.
A window size of four was applied in all experiments except 8., Ba, Ba, and B4 in

Figure 3.7.

3.3.1 Alignment and Secondary Structure.

Figure 3.4 shows an alignment of segment Of 16S RNA sequences and their possible
secondary structures. Three stem regions (A-A’ , B-B’ , C-C’) could be identiﬁed.
This alignment is close to the hand alignment in the RDP [51]. From the ﬁnal sum
matrix, we could identify the common secondary structure (Fig 3.5 (b)) and could
overlay each aligned sequence onto this common secondary structures. Figure 3.4
(c) and (d) are the possible secondary structures of the sequence Ehr .bovis and
Hir .baltic in the alignment. These possible secondary structures are compared to
the proposed secondary structure Of Ag . tumef a7 [32] which is close to the aligned

sequences. These structures are matched well.

Figure 3.5 (a) shows an alignment Of a segment of ten 16S RNA sequences. The

Ehr.bovis
Sym.Mscfu
Sym.Trgr1
Sym.Trgr5'
Sym.Trgr6
Par.denitr
Ros.denitr
Hir.ba1tic
Erb.1ongus
R.sa1exige
Ps.diminut
Caw.FwC2
Afp.felis
Afp.c1eve

Figure 3.4: (a) Alignment Of segments of 15 163 RNA sequences and possible sec-
ondary structures. This region corresponds to nucleotides 996 to 1044 in E. coli 168
RNA. Left ( (, [, {) and right 0, ], }) parts of the alignment Show the possible
stem regions (A-A’ , B-B’ , C-C’). (b) Possible common secondary structure from
the alignment (a). (c) Secondary structure of Ehr.bovis based on (b). (d) Sec-
ondary structure of Hir.ba1tic based on (b). The symbol (0) indictes a G-U base
pair and (I) indicates A-U and G-C base pairs. Note that the length of the loop and

 

 

62

A B B’ C C’ A’
(((((((< [[[fiif 111111] {{{{ }}}}))))))))
aUGAAGAUUagaUCCUCCUuaacAGGAGGGchAGch-gGCUGGGUCUUgCA
aUGGAAAUUauaCCUAUUCgaagGGAUAGGguCGGUuc-gGCCGGGUUUCaCA
aUGGAAAUCauaCCUAUUCgaagGGAUAGGgu-CchucggCGCGAUUUUaCA
aUGGAAAUCauaCCUAUUCgaagGGAUAGGguCGGUuc-gGCCGGAUUUUaCA
aUGGAAAUCauaCCUAUUCgaagGGAUAGGguCGGch-gGCCGGAUUUUaCA
aUCCCAGG-acc-GGCCCGgagaCGGGUC-uuUCACuucgGUGA-CCUGGaGA
aUCCUGUG-cua-ACCCNAgagaUCGGGN-guUCUCgcaaGAGA-CGCAGuGA
aUCCUAUG-cu-NACUCCAgagaUGGAGUA--U-C-uucg-G-A-CAUAGuGA
aUCCUAGGAcgguUUCUG-gaga-CAGACuccUUCCcuchGGGaCCUAGuGA
aUCCCGGGAcgacUUCCA-gage-UGGAUuuuUUCAcuchGUGACCCGGnGA
aUGCCUGG-a-cNGCCACGgagaCGUGGCUu—UCCCuucgGGGA-CUAGGaCA
aUGCCUGG-a-cCGCCACAgagaUGUGGNNu—UCCCuucgGGGA-CCAGGaCA
aUGUCCAGGaccGGUCGCAgagaUGUGACC--UUCUcuchGAGCCUGGAgCA
aUGUCCAGGaccGGUCGCAgagaUGUGACC--UUCUcuchGAGCCUGGAgCA

(a)
U A
0‘ A
c c\ A C
CungG U
, c . C. ‘A
A A
51‘??in E. 09699999 :
Accyucuas a"? ohAUAc- A,U-
39 U [G 3' U . Ic-
c O” U G. U
G C Gizu
k) ' is

base pair regions are variable for each sequence.

63

A I C C' I’ D D’ I l’ A’
«(tum « mm 1111)] >> > '«<««« »»»»> «(it )in) 111111)}
(((fiifimlﬂflf 111111!'I’I'I'I'I'I'I'<«««« >»»»» ((((( 1))” 11111111

3-P1 will“
sec60001GGUUgGGgeIegCCCAUOgeeeClUGOGgueanaCCgAeugagCUCUAUGUAC~ugug-OUGUAUAGAG-gseaAGGGCInugGCCCU-GCCUUGAGen
3 1. bejece
azucuuooclﬂcgccgeuegcccA00gussetUGGGgIeeeeccgCeuaagCUlUlUUAUUCeueaaAGUO10161G-geaeGGCBCIuegOCOUCgIPCCULAGeg
Spi. splat:
uscCCCCUUGUUUGGgInagCUUGUOgeeaCAUAAGguaaaecCUAengegalﬂCUUUUAUCenaaBAUAAUAGAAggeeeGGCGCUouechCGUCGCAAGGGGen
Trp. Insole
uecCcUUUG6eUgGGgInagCCGGOAgeAAUACCGGguaanaCCgleueeggUCGGUl60cacggae-GCCACCGAgg-eeeGCGGCIIcgGCCGCg-CCBLAGGee
Trp. spat

useCCUUAGGeDgGGgauagCCACUIgeaaUAGUGGguaanaccgleugnggOUACCGGGC- ugug-GCCUIGUIAag-eeaﬂGAGCarunGCUCCgeCCUBAGGen
Spt. stones

uncCUUUAGGUUgGGgInagCCAUUAgseeULGOGGguaauaccgleuguggOUGCGGGGC- oung-GCCUUGUAAag-eeaﬁdlc-snag-CUCC-GCCDGAAQen
Trp plus-d
useCCUUAAGeUgGGgIne¢CUCCUAgalanineAGguaaueCCgAenec-gCUUAUACGOenaeagCCGUIUAAOga-eeeBAGGCIgsgGCCllg-Cuuclccen

183
LP. .P gﬂGgaaaaCLCABAgaAAUUUGUngaenICDglengugecccuuccucoguag-CAGGipUUGguneee-GCA-gees-06c-GCUUUUAGen
Lpn.1111:1
ugcCCACGGAUG‘66gen-eccuuucgeeaGGAAGGgueauaccgCeueecACCAUOGUUACsac3601100100605:see-GCA-gcearOOC-¢UCUGUGOA‘

8p1.eurent

ugcCUGeCOGeDgGGganagCUUCUGgeesCACAGGguaaneccglenge-ACUGACGAGACunngGUUUUGUUAGU-eeeGGUOCIIcgOCACCgRQCBeCLceg
(a) l .

 

 

 

 

3. 5
A3 .A
E' C:
E a. v III.
0' I)
(b)

Figure 3.5: (a) Alignment of segments Of ten 16S RNA sequences and possible sec-
ondary structures. This region corresponds to nucleotides 133 to 229 in E. coli 16S
RNA. Left ( (, <, [, { ) and right ( ), >, ], } ) parts of the alignment show the
possible stem regions. (b) Real common secondary structure from the RDP. There
are different possible secondary structures in the Region B (in the Box). Note that
the length of the loop and base paired regions are variable for each sequence.

64

possible secondary structure was compared to the proposed secondary structure of
Spi . aurant [32] which is closely related to the aligned sequences. ( second line in
the Figure 3.5 (a) ). The third line shows the possible secondary structure from the
alignment. A secondary structure diagram based on Spi . aurant is shown in Figure
3.5 (b). Fl'om the alignment and dot matrix, stem regions (A-A’ , C-C’ . D-D’ , E-E’
) were correctly aligned. However, in the region (B-B ’ ), there are different possible

stem regions with length=2.

3.3.2 Identiﬁcation of Secondary Structures

An RNASA generated alignment of one variable region from eight 16$ RNA sequences
is shown in Figure 3.6. This region corresponds to nucleotides 61 to 106 in E. coli 163

RNA [32, 61, 88, 87].

Alignment A1 in Figure 3.6 was generated by a progressive pairwise alignment
method and used as an initial alignment for RNASA. This method aligns the se-
quences by primary sequence similarity. Therefore when the sequences are distantly
related, an alignment by this method does not show the secondary structure simi-
larity of the sequences. A2 in Figure 3.6 shows the ﬁnal alignment obtained from
RNASA from initial alignment A1 . The primary similarity of the ﬁnal alignment
Obtained from RNASA was much worse, but the secondary structure similarity Of the
ﬁnal alignment was better than that of the input alignment. The ﬁnal sum matrix
from the output alignment could be used to identify the secondary structures by in-

SPGCtion. This possible secondary structure was compared to the one of My: . xanthu

A1
Organism
Dsb.postga
Dss.variab
Dsn.acetox
Dnn.tiedje
ny.xanthu
Cys.fuscus
Con.crocat
Nan.exeden

A2
Organism
Dsb.postga
Dss.variab
Dam.acetox
Dmn.tiedje
ny.xanthu
Cys.£uscus
Con.crocat
Nan.exeden
A3
Dsb.postga
Dss.variab
Dsn.acetox
Dnn.tiedje
ny.xanthu
Cys.1uscus
Con.crocat
Nan.exeden

A4

Dsb.postga
Dss.variab
Dsm.acetox
Dnn.tiedje
ny.xanthu
Cys.£uscus
Con.crocat
Nan.exeden

Figure 3.6: Alignment Of segments of 168 RNA sequences. The names Of the used
16S RNA sequences are from RDP. This region corresponds to nucleotides 61 to
106 in E. coli 16S RNA. A1 is the alignment generated by the progressive pairwise
algorithm. A; is the ﬁnal alignment generated by initial alignment A. using RN ASA
with window size=4. A3 is a longer alignment generated by the progressive pairwise
algorithm. A. is the ﬁnal alignment generated by the initial alignment A3 using
RNASA with window size=4. Left (D and right (]) parts of possible stem regions
(A, A’ , B . B ’) are symmetric. Upper case characters indicate possible secondary

65

GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC
GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC
GUCGaacgagaaaGUU---UCchgGGAAAUgaguagagUGGC
GUCGUacgagaaaCAUAUCNuuchGGUAUGgaguaaaGUGGC

GUCGagcgcgaaUAGG ------ GGcaaCCCUUAguagagCGGC
GUCGagcgcgaauGGA -------- gcaaUCCuaguagagCGGC
GUCGUgcgagaaaGGG ------ CuuchCCchguaaaGCGGC
GUCGaachGCUAgca -------- aUAGUC ------- agUGGC
[[[[[ [[[[[[[ 111111] 111]]

GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC
GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC
GUCG-aacgagaaaGUUUCch-gGGAAAUgaguagag-UGGC
GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC
GUCG-agcgcgaa-UAGGGGc-aaCCCUUA-g-uagag-CGGC
GUCG-agcgcgaau-GGA--gcaa--UCC-uaguagag-CGGC
GUCGUgcgagaaa--GGGC-uucg-GCCC--cgguaaaGCGGC
GUCG-a--ac--g-GGCUA-gcaa-UAGUC-a-----g-UGGC

GUCGaacgagaaa-G-GGA-UUGcu-ugc-AAUCCCgaguagagUGGc
GUCGUacgag-aacG-CUC-UAGcu-ugCUAGAGC-aaguaaaGUGGC
GUCGaacgagaaa-GUUUC-ngcG--GA-AAU---GaguagagUGGC
GUCGUacgagaaaCAUAUCNuucgG--GG-UAU-G-gaguaaaGUGGC

GUCGagcgcg-aa ------- UAGGG--Gc-aaCCCUUAguagagCGGC
GUCGagcgcg-aa ------- u-GGA-gc-aaUCC-uaguagagCGGC
GUCGUgcgag-aa ------- a-GGGCuuc-gGCCC-cgguaaaGCGGC
GUCGaacg-G-GC 3 A--gc-aaU----AGU-CagUGGC

 

[[[H [[[[[[E 111111] 111]]
GUCG-aacg-agaaaGGGAUUGcu--ugCAAUCCCgaguagag-UGGC
GUCGUa-cga-gaacGCUCUAch-~ugCUAGAGCaagu-aaaGUGGC
GUCG-aacg-agaaa-GUUUCch---gGGAAAUGaguag-ag-UGGC
GUCGUa-cga-gaaaCAUAUCNuu--chGGUAUGgarguaaaGUGGC
GUCG-agcg-cgaa--UAGGGGc---aaCCCUUA-gu--agag-CGGC
GUOG-agcg-cgaau---GGA-gca--a-UCC---uaguagag-CGGC
GUCGUg-c-gagaaa--GGGC-nuc-g-GCCC--c-gguaaaGCGGC
GUCG-aac---g-----GGCUAgc-a-aUAGUC -------- ag-UGGC

structure regions.

66

whose secondary structure is known already and matched well.

3.3.3 Initial Alignment

Alignment A3 in Figure 3.6 was Obtained by progressive pairwise alignment. A3
is longer than A; because a lower penalty for nulls was applied in the progressive
pairwise alignment. Alignments A4 is the ﬁnal alignment using RNASA with input
alignment A3. Alignment A. has more nulls than A2 and these additional nulls give
clearer resolution Of stem and loop regions. When an initial alignment is not long
enough, certain columns in a ﬁnal alignment may contain bases in loop regions and
stem regions. Thus, it is important to ensure that the length of the initial alignment is
appropriate, based on the output alignment. Ishikawa et al. [38] and Kim et al. [45, 44]

have shown the advantages of using an initial alignment Obtained by heuristics in SA.

3.3.4 Window Sizes

Figure 3.7 shows the alignments obtained with different window sizes.

The alignments Bl, Ba, Ba and B4 in Figure 3.7 and A2 in Figure 3.6 were all
Obtained from the same initial alignment (A1 in Figure 3.6) by applying different
window sizes. Clearer separation of paired and unpaired regions was Obtained at
window sizes Of 4 to 5. The alignment A2 is the best match to the known alignment.
By increasing the window size, more weight can be given to windows having diagonally

consecutive hits. Therefore the eﬁ'ect Of noise can be reduced.

67

Bl

GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC
GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC
GUCGaacgagaaa-GUUUC-chgGGAA-AUgaguagagUGGC
GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC
GUCGagcgcgaa ----- UAGGG-GcaaCCCUUAguagagCGGC
GUCGagcgcgaa ----- uG--GAgcaaUCC-uaguagagCGGC
GUCGUgcgagaaaGGGCuu-c--g--GCC-chguaaaGCGGC
GUCGaacg-G-GC----U-A---gc-aaU---AGU-CagUGGC

B2

GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC
GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC
GUCGaacgagaaaGUUUC--chg-GGAAAUgaguagagUGGC
GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC
GUCGagcgcga-aUAGGG-Gc--aaCCCUUA--guagagCGGC
GUCGagcgcgaau-GGA--gc---aa-UCCua-guagagCGGC
GUCGUgcgagaaa--GG--GCuuch-CC-chguaaaGCGGC
GUCGaacg-GGC---UA--gc--aa--UA----GUCa-gUGGC
BS

GUCGaacgagaaaGGGAUUGcungCAAUCCCgaguagagUGGC
GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC
GUCGaacgagaaa-GUUUC-chgGGAAAU-gaguagagUGGC
GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC
GUCGagc-gcgaa-UAGGG-GcaaCCCUUA--gu-agagCGGC
GUCGagc-gcgaau-GGA--gc-aa-UCC-uagu-agagCGGC
GUCGUgcgagaaa--GGGC-uucg-GCCC-c-gguaaaGCGGC
GUCGaa---cg---GGCUA-g-caaUAGUC ------ a-gUGGC

B4

GUCGaacgagaaaGGGAUUGcungCAAUCCCgaguagagUGGC
GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC
GUCC-aacgagaaaGUUUCU-ucgGGAAAUgaguagag-UGGC
GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC
GUCG-agcgcgaa-UAGGGG-caaCCCUUA--guagag-CGGC
GUCC-agcgcgaa-u-GGA-gcaa-UCCua--guagag-CGGC
GUCGUgcgagaaa--GGGC-uucg-GCCC--cgguaaaGCGGC
GUCG----aacg--GGCUA-gcaa-UAGUC-ag ------ 0660

Figure 3.7: Alignment Of segments of 168 RNA sequences with diﬁerent window
sizes. Each alignment was Obtained after 300000 iterations. BI is the alignment with
window size=1. B: is the alignment with window size=2. 83 is the alignment with
window size=3. A2 in Figure 3.6 is the alignment with window size=4. B4 is the
alignment with window size=5. A2 in Figure 3.6 gives the best match to a known

smructure.

68
3.3.5 Effect of Double Shufﬂe

To compare the efficiency Of the double shuffle and single shuﬂie, RNASA with double
shuﬂle and RN ASA with single shufﬂe were tested with initial alignment A1 in Figure

3.6. Figure 3.8 Shows the annealing curves with single Shuffle and double shufﬂe.

 

 

 

 

 

 

32; till
3.. T—S .................. -
3.6 r "
Score(x1015)3_8 _ -
4.0 - L T
4.2 A l 1 . l l ‘

 

 

 

0 510.15 20 25 30
Iteratron(x 10‘)

T l I I

3.2
3.4

3.6
Score (x 1015):”

 

    

l

4.0
4.2

-l

 

 

1 I

0 5 10 15 20 25 30
Iteration ( x 10“)

. Figure 3.8: (a) Annealing curve with single shufﬂe. (b) Annealing curve with double
shufﬂe. Initial temperature T.- = 10‘ and ﬁnal temperature T, = 1. Unit Of score is
1015 and Unit of iteration is 10, 000 iterations.

RNASA with single shuﬁie took longer to converge to the global maxima score
and had a tendency to stick to local maximum, even with a very large number Of

iterations.

69
3.3.6 Speed of Convergence

The converging time necessary to get satisfactory alignment in RNASA is generally

tied to the number of sequences, length of sequences and heuristic algorithm.

e The length of alignment. It is clear that the converging time for longer align-
ments is longer than that of shorter alignments from equation (9). For example,
it took 12 minutes ( 0.3 million iterations ) to get A, and 1 minute ( 10,000

iterations ) tO get the alignments in Figure 3.7.

It took less than 10 minutes ( .3 million iterations ) to align ﬁrst 10 sequences
in Figure 3.4 (a) (length-=53). However, it took 2hours ( 2 million iterations )

to align in Figure 3.5 (a) (length=107).

e The quality of initial alignment. The quality of initial alignment affects the
converging time. If the initial alignment is closer to the Optimal, it will require
less converging time. Figure 3.6 shows the importance of a good heuristic

algorithm.

e The number of sequences. It is clear that longer SA time is required for a larger
number Of sequences from equation (9). To align 14 sequences in Figure 3.4 (a),

it took approximately 20 minutes ( .5 million iterations ).

TO reduce the computational time, well-conserved regions may be ﬁxed and only the
regions between the ﬁxed regions annealed. Also, by eliminating well aligned regions

the length of long RNA sequences can be shortened and can then be aligned.

70
3.4 Conclusion

In this paper, we describe a method to align RNA sequences and identify possible
secondary structures using Simulated annealing. In our approach, sequences were ﬁrst
aligned based on primary sequence similarity and then realigned based on secondary
structure information. Dot matrices from intra-sequence comparison are generated
and overlapped to form a sum matrix. We built a clearly deﬁned score function for
this sum matrix based on the probability of ﬁnding the Observed pattern in completely
unaligned sequences. We were able to reduce SA time by reducing the search space
as well as using the double shuﬂle move set. We showed that RNASA can generate

alignments with clear secondary structure identiﬁcation.

The main advantage of RNASA is its ability to align conserved secondary struc-
tures in distantly related sequences. One potential disadvantage of this method is
that is is not based on ﬁnding the lowest energy secondary structure (via thermody-
namic energy rules). However, RNASA is able tO handle pseudoknotted structures
and conserved alternative foldings, both of which are not modeled well by most energy

based methods.

 

Chapter 4

Inferring Relatedness Of a
Macromolecule to a Sequence

Database Without Sequencing

Sequence database searches with unidentiﬁed sequence data have been a source of
useful information of identifying the biological information Of a macromolecule iso-
late. TO do the sequence database searches, the whole or part Of the primary sequence
information is required. However, in some situations, sequencing a macromolecule is
not practical. We need more crude but faster methods without sequencing an uniden-
tiﬁed macromolecule to identify the relatedness between database sequences and an
unidentiﬁed macromolecule. TO achieve this goal, we study the problem of Obtain-
ing biological information about a macromolecule isolate using only the restriction

pattern Of that isolate Obtained from digestion with enzymes and a database D of

71

72

sequences. We investigate a three step approach to solving this problem. (1) we Ob-
tain a restriction pattern of the isolate while analytically deriving the corresponding
restriction maps of the sequences in the database. (2) We identify a set S Q D of
sequences which have restriction maps that are most similar to the unknown isolate’s
restriction pattern. This task is complicated by the fact that we have only approxi-
mate fragment lengths for the unknown isolate and that we do not know the actual
ordering of the unknown isolate’s fragments. Despite these difﬁculties, we derive
experimental results which indicate maximum matching techniques are eﬁ'ective in
identifying the correct set most of the time. In this step we introduce Maximum Site
Matching Problem (MSMP) and show MSMP is in the class of N P—complete problems
which are conjectured to have no polynomial time solution. (3) We use the set S to
infer biological information (such as sequence information or hierarchical classiﬁca-
tion information) about the unknown isolate. We demonstrate experimentally that
the closeness Of the sequences in the set S to each other can be used to infer the

relatedness of the unknown isolate to the sequences of the set S.

 

73
4.1 Introduction

Sequence database searches have become a normal ﬁrst step in the identiﬁcation
and characterization of new macromolecule isolates. The information Obtained from
such database searches and comparisons Often yields novel insights into isolate origin,
function, and evolution. A number Of very good tools exist for searching sequence
databases, for example BLAST [2] and FASTA [65]. These tools can be very powerful
in rapidly discovering even weak similarities to a query. However, all of these database

search tools require that the molecular sequence Of the query be known.

Even though improvements in methods for nucleic acid and protein sequencing
have substantially reduced the cost Of Obtaining sequences, there are many situations
in which sequencing costs are still too high or the required specialized equipment is
just not available. Often, a researcher is able to Obtain more macromolecule isolates
than it is practical to sequence. Many of these isolates may be identical, and some

rapid method is needed to Obtain a set of unique isolates.

One simple method Of categorizing multiple isolates is to use the restriction pattern
obtained from digestion with certain enzymes (proteases and restriction endonucle-
ases). These enzymes typically cleave the macromolecular substrate at speciﬁc sub-
sequences. By comparing the pattern Of fragment lengths produced by the isolates,

sets Of non-identical patterns can be selected for further study.

At a further level of sophistication, the number of matching (same size) fragments

between patterns may be used to calculate a measure of (phylogenetic) relatedness

74

between isolates [60, 76, 48]. If the positions of cleavage sites along the molecule
(cleavage site map) are known as well as the fragment sizes, then several more accurate
methods exist for estimate the relatedness between isolates [60, 15, 19, 37]. Some work
has been done on constructing cleavage site maps from digestion patterns [64, 24, 47,
17, 8, 30, 89], but this work typically assumes that there are overlapping fragments.
In our case, there are no overlapping fragments to guide us in constructing a cleavage

site map.

Cleavage site maps can be computed easily from molecular sequences. We are
interested in how such maps generated from a sequence database might be used to
help characterize a molecular isolate. Assume a cleavage pattern for a query molecu-
lar isolate (measured experimentally to within some expected limit of accuracy) and
a database containing sequences with varying degree Of sequence similarity to the
query isolate. We would like to be able tO extract the set of sequences with cleavage
site maps most similar to the (unknown) map of the isolate. A maximum number
of common sites between cleavage site maps is used as a measure of similarity. We
formalize this problem as Maximum Site Mapping Problem (MSMP) and demon-
strate the MSMP to lie in the class Of NP-complete problems conjectured to have
no polynomial time solution. To overcome this difficulty, we suggest a heuristic ap-
proach to solve a conﬁned version of the problem. In addition, we would like some
estimate of how closely related these sequences are to our query molecule in terms of
sequence similarity or other biological measures. Related work has also been done in

identifying the location of restriction maps Of short query probes in longer restriction

75

maps [56, 55]. However, this problem differs from ours in that the ordering of the

query fragments is known.

Grouping isolates by cleavage site maps may not be useful in itself, unless this
grouping implies some more standard measure Of relationship, such as primary se-
quence similarity. Although primary Similarity between two sequences can be used
to directly estimate the expected fraction Of shared enzyme sites, sequence similar-
ity can not be accurately estimated from the fraction of matching sites alone. Any
attempt at performing the reverse calculation requires some knowledge of the under-
lying distribution of similarity in the pool from which the two sequences were drawn.
This leads us to use experimental results for evaluating the effectiveness of different

methods of inferring information about a query isolate.

Similar work has been done in classifying unknown peptides by using mass pro-
ﬁles where the unknown peptide is digested by enzymatic or chemical means and the
masses Of the resulting fragments are determined by mass spectrometry [41, 75, 13].
However, when dealing with peptides, several problem parameters are quite different
than when dealing with nucleotides. First, the masses Of the resulting fragments can
be Obtained with essentially no error. Second, the masses reveal signiﬁcant informa-

tion about the amino acid composition Of each fragment.

We perform our experiments using the Ribosomal Database Project (RDP)
database of bacterial 168 rRNA gene sequences [51]. This database is a member Of
a class Of databases containing sequences derived from that of an unknown common

evolutionary ancestor. In addition, the phylogenetic relationships of the sequences in

76

the database have been estimated and are available from the RDP, as is a biological
classiﬁcation scheme based on these relationships. Also, comparison of restriction
fragment patterns Of rRNA genes (rDNA) is an accepted method of discerning relat-

edness between isolates in micro-biological studies [57].

In this work, we present a method for ﬁnding the set Of database sequences with
the most sites in common with a query restriction fragment pattern. This method
uses sequences in the database as templates to assemble the fragments into putative
restriction site maps. The results Of this method are shown to be similar to results
Obtained when the exact fragment order is known for both the query isolate and the
database sequences. Also, a method to estimate the sequence similarity between the
query and database result set is presented. In addition, we demonstrate that the

results can be used to place the query in an existing biological classiﬁcation scheme.

4.2 Methods

4.2.1 Overview

Our basic problem is to determine biological information such as primary sequence
information about an unknown query isolate q using a database D sequences. Fur-

thermore, we do not allow biological sequencing of the unknown query isolate q.

An overview Of the approach to solve this problem is given in Figure 4.1.

1. We use enzymes to obtain a restriction pattern Of the query isolate q. Simulta-

77

 

 

 

 

 

 

 

 

 

 

 

 

Step1.
F a
so 100 300 400J
i restﬁctimpettemq
nest 31°92. Step3.
5° 30° 2°° _.[Cs|oilstionot H mum” Hintennmotbioiog'csi
uence100 - -

mgwenc‘ggso WotqandD muencetso mionnatlonotq

Swami! Selecteddatabase
100 100 10050 meets

Seaponciu
80 I 20 200400
L " momma J

 

 

 

 

Figure 4.1: The overall process of solving the problem.

neously, we analytically compute restriction maps Of all the sequences 8 6 D.

2. We then compute the closeness of the restriction map Of each sequence 3 E D

to the restriction pattern of q.

3. This closeness information is then used to infer biological information about q.

We describe each of these steps in more detail in the following sections.

We note here that our methods for completing each step, particularly the third
step, are dependent on the characteristics Of the database D. We ﬁrst show that
analytical methods to infer the primary sequence similarity between the query isolate
q and any sequence 3 E D require a priori knowledge Of the relationship of the
underlying probability distribution Of the primary sequence similarity of any sequence
3 E D to the sequence of isolate q. Straightforward analytic methods are further
handicapped by the assumption that for closely related sequences 3, s’ 6 D, the
relationship of the restriction map of s to the restriction pattern of q is independent
of the relationship of the restriction map of s’ to the restriction pattern of q. As a

result, we experimentally evaluate different methods for inferring information about

78

the query isolate q. Our experimental results demonstrate that effective inference
techniques do exist which utilize the set S Q D of sequences which have restriction
maps closest to the restriction pattern of q. This result has good consequences for
step two of our approach where we show that computing how close the restriction
map of any sequence 3 E D to the restriction pattern Of q may be computationally
difficult but that determining whether the restriction map of a sequence 3 6 D is

extremely close to the restriction pattern of q is computationally tractable.

4.2.2 Obtaining Restriction Patterns and Restriction Maps

The basic biological processing we perform on the isolate q is digestion by enzymes
which produces a restriction pattern of q. At the same time, we analytically compute
the restriction maps of all sequences 3 e D that would have been formed if we had
digested these sequences with the same enzymes. To facilitate later comparison of the
restriction pattern of q to the restriction maps of s 6 D, we assume that the database
sequences represent molecules with ends at positions homologous to the ends of the
query molecule. In practice, the end points of query molecules may be known if, for
example, they are produced by the PCR reaction using primers to conserved regions.
The primers deﬁne the query endpoints. For the speciﬁc RDP database, PCR is the

method of choice for isolating rRNA genes.

The two processes of Obtaining a restriction pattern of q and computing restric-
tion maps for s e D differ in their precision. Because computing the restriction

maps is analytical, we can compute the restriction maps exactly. However, while it

79

is technically possible to measure the exact length of the restriction fragments of q,
the methods required are not suitable for a rapid, inexpensive screen. A more real-
istic assumption is that fragment sizes would be estimated by simple gel or capillary
electrophoresis; this leads us to assume that fragments f which have actual length If]
will be measured to have length between (1 - e)| f | and (1 + 6)] f |. In our experiments,

we estimate 6 to be 5%.

Deﬁnitions.

We now formally deﬁne the terms restriction map, restriction pattern, the closeness
Of two restriction maps, and the closeness of a restriction map to a restriction pat-
tern. We begin by deﬁning restriction maps, restriction patterns, and the equivalence

relation that restriction patterns induce on restriction maps.

In the following deﬁnitions, 3 is a nucleotide sequence and z is an enzyme.
Deﬁnition 1 The restriction map M,(s) formed by digesting sequence 3 by enzyme 2
is the ordered multiset of fragment lengths where l,- > 0 for 1 g i _<_ n. We can obtain

the set of locations of cut sites
8: {s:s=2195,lj;1 S r 5 n}.

Deﬁnition 2 Let M,(s) = {11, lg, . . . , l,.}. We deﬁne the restriction pattern P,(s) to be
the unordered multiset Of fragment lengths {l1, l2, . . . , l,.}. We say that M,(s) yields

P, (3).

Note many different restriction maps yield the same restriction pattern. For exam-

80

ple, if there are n different fragment lengths in P,(s), n! number of different restriction
maps can be generated from P,(s).

Definition 3 We deﬁne two restriction patterns to be isomorphic if they yield the same
restriction pattern. We denote the set of isomorphic restriction maps which yield the

restriction pattern P,(s) with the notation [P,(s)].

For example, the restriction map {100, 200,300} and {300, 200, 100} are isomor-
phic but neither is isomorphic to the restriction map {100, 200, 100, 200} It is not
hard to see that this isomorphic relation deﬁnes an equivalence relation on the set of

restriction maps.

We now deﬁne closeness metrics for restriction maps and restriction patterns.
First, we need some notation. Let sites(M, (3)) denote the number of cleavage sites
in the restriction map M,(s). Let sites([P,(s)]) denote the number Of cleavage sites
in any of the restriction maps in P,(s). Note sites(M,(s)) and sites([P,(s)]) is one
less than the number Of fragments in M,(s) and P,(s).

Definition 4 We deﬁne a common site in sequences a and b to be a cleavage site that
appears in the same nucleotide position in both a and b. Let common(M, (a), M, (b))
denote the number Of common cleavage sites in the restriction maps M, (a) and M, (b).
Let maxcommon(M, (a), P, (b)) denote the maximum number of common cleavage
sites in any restriction map in [P, (b)] and M,(a).

Definition 5

We deﬁne the closeness of the two restriction maps, denoted closeness(M,(s), M, (3’)),

to be max(sites(M, (3)), sites(M, (s’)))-common(M, (s), M, (3’)). We deﬁne the close-

81

ness Of a restriction map to a restriction pattern, denoted closeness(M, (s), P,(s’)),
to be max(sites(M,(s), sites(P,(s’)))-

maxcommon(M,(s), P,(s’)).

Deﬁnition 6 We deﬁne the set of sequences 0 (q, D, j) Q D to be the sequences 3 E D
such that closeness(M,(s), P,(q)) S j.

Example Figure 4.2 explains the above deﬁnitions. (a) shows the restriction map
M,(a) = {100,500, 400} for sequence a. (b) shows the restriction pattern P, (b) =
{100, 200, 300, 400} for b. max(sites(M,(a)),sites(M,(b))) is 3. (c) shows the

maxcommon(M,(a), P,(b)) = 2. Clearly, the closeness(M,(a), P,(b)) is 1.

 

 

 

 

 

loo
7'00
300
l-———i
=loo= 500 g 400 i B 400 =
(a) 03)
p09 200 ,- 300 400 =
.100] 500 400 9
ﬁ I I
O] (c) c2

Figure 4.2: (a) Restriction map for a. (M,(s) = {100,500,400}. (b) Restriction
pattern for b. (P,(b) = {100, 200, 300, 400}. (c) maxcommon(M,(a), P,(b)) = 2 with
common sites c1 , 0,.

4.2.3 Computing Closeness

Step two Of our approach is computing closeness (M, (s), P, (q)) for each sequence

3 6 D. While we show that this problem appears to be a computationally intractable,

82

we describe how we can efﬁciently determine if closeness(M,(s), P,(q)) = k where
k=0,l or 2. This allows us to compute the set C(q, D, k) which is sufficient for step

three Of our approach.

We ﬁrst observe that the problem Of computing closeness(M,(sl),M,(sg))
reduces to the problem Of computing common(M,(sl), M,(sg)) and the prob-
lem of computing closeness(M,(sl), P,(sg)) reduces to that Of computing
maxcommon(M,(sl), [P,(32)]). We next note that the second problem, computing
maxcommon(M,(sr), [P,(s,)]), is NP—complete via a reduction from the 3-Partition

problem. The formal proof is as follows:

Maximum site matching problem: We have as data the restriction pattern
from a sequence a and restriction map from a sequence b when a same enzyme is

used, say,

Restriction pattern M (a) = {a,- : 1 g i S n} from the digest Of sequence a

Restriction map P(a) = {b.- : 1 s i 5 m} from the nucleotide sequence b

In general M (a), P(b) will be multisets; that is, there may be values of fragment

lengths that occur more than once. The lengths of sequences a and b is

= : a,-= Z 05 (4.1)

Given the above data the problem is to ﬁnd orderings for the set M (a) such that the

number Of common sites implied by this ordering is maximum.

Let 5,. the set of all permutations Of M (a). For a 6 Sn calls a conﬁguration. By

83

ordering M (0) according tO a, we obtain the set Of locations of common cut Sites
S = {s : s = lejgagu) and s = Zlggmbjio _<_ r g n,0 S t 5 m}.

Since we want to record only the location of cut sites, the set S is not allowed

repetitions, that is, S is not a multi-set. Now label the elements Of S such that

S={3,-:Osj5k} (4-2)

with s.- g s, and 0 _<_ k _<_ min(m,n). The problem is to ﬁnd a conﬁguration with
maximum 1: common sites.

Theorem: Maximum site matching problem is NP-complete.

Proof: Assume k be the the number of maximum common sites between M (a) and
P(b). It is clear that the problem as described above is in NP, as a nondeterministic
algorithm need only guess a conﬁguration a and check in polynomial time if 1: common

sites exist. The number of steps to check this is in fact linear. To show this problem

is NP complete we transform the 3-Partition problem to this problem.

In the 3-Partition problem, known to be NP-complete, we are given a ﬁnite
set M (a) Of 3m elements, a positive bound C, and a positive integer s(a) for
each a 6 M(a) such that C/4 < s(a) < C/2 and such that 2,61%.) s(a) =
mC. We wish to determine whether M (a) can be partitioned into m disjoint
sets M1(a),M2(a),...,M,,.(a) such that, for 1 5 i 5 m, Bag-M,(awm) = C. If
Eng-M(a) s(a) = L is not divisible by m, or |M(a)| is not divisible by three, there can

be no disjoint sets M1(a), M2(a), . . . , M...(a); else, consider as input to this problem

84

the data

M(a) = {s(a,) :1 5 i 5 3m}

P(b) = {0,c,---,0}

This is simply this problem in the case when the set M (b) has m fragments of equal

length and maximum common sites I: = m — 1.

Heuristic Solution for MSMP.

As a result, we have focused on a restricted version Of this problem which can be
solved in polynomial time. In particular, we give a simple polynomial time algorithm
which solves the decision problem, “Is closeness(M, (s), [P,(q)]) = k?” where k=0,1,
or 2. We can apply this algorithm to all sequences 3 e D to efficiently compute the

set C(q, D, k).

We ﬁrst simplify the problem by Observing the answer is no unless sites([P, (q)]) -
2 S sites(M,(s)) 5 sites([P,(q)]) + 2. In other words, we try to ﬁnd the answer
for closeness(M,(s), [P,(q)]) = k where k = 0, 1,2. Thus, we need only consider
sequences s 6 D which satisfy the constraint sites([P, (q)]) - 2 5 sites(M,(s)) S
sites([P,(q)]) + 2. Next, if s and q have the same number of sites, say n, we Observe
that for these sequences 3, the answer is yes if and only if P, (s) is identical to P, (q)
assuming that the fragment lengths are exact. This is easily solved in 0(n log n)

time by sorting both multi-sets of fragment lengths and verifying that the if” shortest

 

 

11'

8%

ti

ire

di

1
l)

 

85

database fragment length matches the if” shortest query fragment length for 1 5
i 5 n. If the answer is yes, which means closeness(M,(s), [P,(q)]) = 0, we stop. If
the answer is no, any two fragments from P,(s) and P, (q) are selected and merged
into a longer fragment. Let P,(s)’ and P, (q)’ be new restriction patterns. Again
we observe the answer is yes if and only if P,(s)’ and P,(q)’. If the answer is yes,
which means closeness(M,(s), [P,(q)]) = 1, we stop. If the answer is no, any two
fragments from P, (s)’ and P, (q)’ are selected and merged into a longer fragment. Let
P,(s)” and P,(q)” be new restriction patterns from P,(s)’ and P,(q)’. We Observe
the answer is yes if and only if P,(s)” and P,(q)". If the answer is yes, which means
closeness(M,(s),[P,(q)]) = 2, we stop. If sites([P,(q)]) and sites(M,(s)) are not
same, fragments in a restriction pattern with a larger number of sites are merged
to be the same number of fragments for both patterns and the answer is checked.

Assume sites([P,(q)]) = n and sites(M,(s)) = m. The number of ways Of merging

n

the fragments into n - 1 is . The number of ways Of merging the fragments
2 .
n n - 1

into n — 2 is approximately . However, only consecutive fragments
2 2

in s can be merged into a single fragment. The number of ways Of merging the
fragments into m — 1 is (m — 1) . The number of ways of merging the fragments into

m - 2 is approximately (m - 1)(m - 2). Therefore in worst case, we have to examine

n n n - 1
closeness(M,(s), [P, (q)]) approximately 1+ x (n-l)+ x (n—

2 2 2
1)(n - 2) cases. These merging process and examining closeness(M,(s), [P,(q)]) = k

where I: = 0, 1, 2 require polynomial time.

86
While we observe closeness(M,(s), [P,(q)]) = k, we assume the fragment length

in q are exact. However, while the set of fragment lengths for P, (s) where s 6 D may
be computed exactly, the fragment lengths for P, (q) are only approximate. We only
know that the actual length of fragment f ranges between (1 - e)| f | and (1 + 6)] f |.
Thus, we must relax our ”yes” condition. In particular, we must consider the length

of a query fragment f, to be identical to the length of any database fragment f, if

(1 - 6)]qu S f. .<. (1 + Ellfql-

Ideally, we would like to ﬁnd the sequences 3 e D such that
closeness(M,(s), M,(q)) = 1:. However, our data is imprecise in two ways which
complicates this task. First, we have only P, (q), not M, (q). Second, we have only
the approximate lengths, not the exact lengths, of each query fragment. The natural
question to ask is, how much do these data imprecisions affect the correctness Of our
algorithm? We perform some experiments with the RDP database to show that these
imprecisions do not seriously affect the accuracy Of our algorithm. These experiments

and results are presented in section 1.5.

4.2.4 Inferring Information from Closeness

We now consider the problem Of inferring information about the isolate q from the
closeness information we have computed. We ﬁrst describe analytic methods for
inferring information about q given closeness(M, (s), P, (q)) for any 8 E D. Unfor-
tunately, these techniques have three drawbacks which limit their applicability to

biological databases such as the RDP database. First, the analytical methods require

87

some a priori knowledge on the relationship between q and D such as the distribution
of similarities between elements Of D and q. Second, the analytic methods typically
assume sequences evolve and change only through mutation or substitution; that is,
insertions and deletions of nucleotides are typically not modeled. Third, the analytic
methods typically assume that two sequences 3, s' E D vary from s(q) independently,
even if we know 3 and s’ are biologically similar. For these reasons, we are forced
to evaluate our methods of inference experimentally. We describe a basic heuristic

approach and experimentally verify the applicability Of this approach.

Theoretical Models.

In this section, we describe two methods for using closeness(M,(s), P,(q)) to infer
primary sequence information about q. In both methods, we assume that we actually

have closeness(M,(s), M,(q)) instead of closeness(M,(s), P,(q)).

1. Nei and Li’s approach. The ﬁrst method is based upon the work Of Nei and
Li [60]. The basic assumption in this model is that all the sequences in D and
the underlying sequence s(q) of isolate q are derived from a common ancestor.
Sequences diverge over time via a Poisson process as nucleotides mutate. As
time goes on, the number of shared sites between q and s typically decreases
as more and more nucleotides mutate. Nei and Li developed methods for using
the number Of common sites present in s and s(q) to determine the amount Of
time that has progressed since 8 and s(q) shared a common ancestor. Using

the Poisson process, it is then possible tO compute the number of changes that

88

have occurred in each nucleotide position from the common ancestor to both 3
and s(q). This can then be used to estimate the primary sequence similarity

between 3 and s(q).

. Bayesian analysis. The second method is based upon Bayesian analysis. Given
the primary sequence similarity a between 3 and s(q) and the assumption that
each nucleotide position in s is identical to the same nucleotide position in s(q)
with probability a, we can compute the probability distribution on the random
variable that represents the number of sites common to both restriction maps.
However, in our setting, we actually have the number of common sites shared
by the restriction maps of s and s(q), and we desire to compute the primary
sequence similarity or between 3 and s(q). If we are also given the fact that
the primary sequence similarity between 3 and s(q) is drawn from a known
probability distribution X, we can compute a probability distribution on or
using a Bayesian analysis. Detailed analysis is as follows.

Notations

q is the query sequence and d is a databwe sequence.

0: denotes the sequence similarity between q and d.

r is the length of the restriction enzyme used.

n, denotes the number of restriction sites in q. i n“ denotes the number Of
shared restriction sites in q and d.

p = a', where p is the probability of getting a same shared restriction site.

89

The random variable n4... is simply the binomial random variable with parame-
ters (n,, p). Thus the mean Of n“ is simply pnq and the variance is nqp(1 — p).
We are more interested in the reverse direction which is deriving a value of
a given an Observed value b for a“. That is, we are interested in the condi-
tional random variable (a | n4... = b). We can apply Bayesian analysis to get a
probability distribution on this conditional random variable. That is, we know
that

P(nM = b I a = a) = (12') a‘b(l - a4)“"‘b.

Thus, the probability that a is in the range al to a; is simply

{:3 (mean - a4)"r"’da
It ('r)a“(1 — .4)..-th '

 

Unfortunately, both Of these methods have several ﬂaws which limit their appli-
cability to biological databases such as RDP. First, the assumptions on evolution are
too restrictive. In both cases, only mutations are considered; insertions and deletions
are not allowed. Fhrthermore, both models assume that nucleotides mutate with a
given probability distribution which we have access to. This is an extremely strong
assumption which does not seem to be justiﬁable in a general setting. Finally, the
Bayesian analysis method does little to account for the fact that biologically similar
sequences s and s’ are likely to either both share a site with s(q) or both not share a
site with s(q). For these reasons, it seems unlikely either method can be used in gen-

eral database settings reliably. Indeed one fundamental assumption that is shared by

90

both models is that the number of common sites Shared by sequences 3 and s(q) is a
binomial random variable with p based upon the primary sequence similarity between
3 and s(q). Our experiments with the RDP database in the results section indicate

the number of common sites does not seem to follow this binomial distribution.

As a result, we explore other methods for inferring information about s(q) and

use experimental means tO evaluate these methods.

Closest Sequence Methods.

In this section, we focus on using the set Of sequences C(q, D, lc) for lc=0,1 or 2 to infer
information about the query isolate q. The basic premise behind this method is that
sequences which have primary sequences that are most identical to s(q) are the ones
most likely to be in the set C (q, D, k). The extended premise is that the similarity
of s(q) to any sequence in C (q, D, k) is, with high probability, lower bounded by the

maximum pairwise dissimilarity between any two sequences in C (q, D, k).

We use this basic premise with two different biological metrics of similarity. The
ﬁrst metric is primary sequence similarity. We say that two sequences 8 and s’ have
primary sequence similarity sim(s, s’) = a for 0 5 a 5 1 if lOOa% of the correspond-
ing nucleotide positions in s and s’ are identical. Deﬁne sim(q, C(q, D, k)) to be the
quantity min,ec(q,p,,) sim(q, 3).

Example Let q be the query and the set C(q, D, 0) = {31,32,33}. Assume 90% =
sim(q,sl), 80% = sim(q,s2), 85% = sim(q, s3). Then the lowest primary similarity

between q and C(q, D, 0) is sim(q, C(q, D, 0)) = 80%

91

The second metric is a biological family similarity that is based on a biological
classiﬁcation hierarchy for the database D. Each sequence in D is classiﬁed by an 2:
digit number for 1 5 x 5 5 (note each digit may have a different range of values).
In addition to the rRNA sequence database, the RDP also distributes phylogenetic
information inferred from these sequences. This includes a ﬁve level hierarchical
classiﬁcation scheme consistent with the inferred phylogeny. For example, by this

classiﬁcation scheme, E. coli is classiﬁed as:

e First level 2 BACTERIA

e Second level 2.13 PURPLEBACTERIA

e Third level 2.13.3 GAMMASUBDIVISION

e Fourth level 2.13.3.15 ENTERICSANDBELATIVES

e Fifth level 2.13.3.15.2 ESCHERICHIASALMONELLAASSEMBLAGE.

We say that two sequences s and s’ have level similarity level (3, s’ ) = i if
their classiﬁcations are identical to the i“ digit. Deﬁne level (C(q, D, k)) =
min..,,jgc(,,p,,) level (s,-, s,). Finally, we deﬁne level(q, C(q, D, k)) to be the quantity
min,gc(q,p,,) level(q, s). The extended premise implies that level(C(q, D, k)) lower
bounds level(q, C(q, D, k)) with high probability.

Example Let q be the query and the set C(q,D,0) = {31,32,33}. Let q be
a level 2.1.10.5 and level(sl)=2.1.7.5, level(s2)=2.1.7.1, level(sg)=2.1.7.6. Then

level (C (q, D, 0))=2.1.7 and level (q, C (q, D, 0))=2.1.

92
If the extended premise is true and if both sim(C(q, D, k)) and level (C(q, D, k))

are high, we can, with high conﬁdence, infer fairly precise primary sequence informa-

tion or classiﬁcation information about q.

In general, it seems that sim(q, C (q, D, k)) and level (q, C (q, D, 1:» should increase
if the number of cleavage sites in P, (q) increases. That is, it seems unlikely that
dissimilar sequences will have restriction maps that are close to that of q. On the
other hand, it also seems likely that sim(q, C(q, D, k)) and level(q, C(q, D, 1:» will
decrease in size as the number of cleavage sites in q increases. These relationships,
however, are difficult to analyze mathematically, and thus it is difﬁcult to generate
conditions where our technique will be effective. As a result, we experimentally
evaluate this basic procedure. As we see in the next section, our experiments indicate
this method can be used to eﬂ'ectively infer useful information about roughly half the

sequences in RDP and that the error rate is quite small.

4.3 Results and Discussion

4.3.1 Experimental Procedure

The sequence data used in this study was Obtained from the Ribosomal Database
Project (RDP) (release number 5 of May 17, 1995 [51]). The RDP provides curated
databases of ribosomal RNA related information and analysis services. The database

used in this study was a subset Of the bacterial 16S rRNA database distributed

93

by the RDP. This database contains 16S rRNA sequence information from about
3000 different bacterial isolates and environmental samples. These sequences are
distributed by the RDP as pre-aligned sequences with alignment gaps inserted so that
homologous residues appear at the same position in all sequences. This alignment
was produced by the RDP curators using, in addition to primary sequence similarity,
secondary structure and other higher-order information. Inspection of the alignment
indicates that highly diverged regions are often aligned tO conserve putative secondary
structures without regard to primary sequence. Most of the sequences in this database
are incomplete, usually at the two ends. TO construct the subset used here, incomplete
sequences were removed after ﬁrst selecting the region corresponding to positions
46 through 1406 of E. coli [9]. The resulting database contained 1575 sequences.
Sequence similarity values were calculated for all pairwise combinations of the 1575
sequences. The similarity values clustered around the mean value of 72% identity,

with a tail stretching toward higher similarity values (Figure 4.3).

 

 

 

 

30 I I T I I

25 - '-
20 .— .—

comparisons15
(x 10‘)
10 - -
0 _L I l
40 50 60 70 80 90 100
pnmary elmdarlty(%)

Figure 4.3: Pairwise sequence similarity. The pairwise sequence similarity was de-
termined for all pairwise combinations of 1575 sequences in the test database. Only
sequence regions considered in further analysis were included. The similarity of the
aligned pairs was determined using the pre-aligned sequences supplied by RDP.

94

 

 

site seq.no site seq.no site seq.no site seq.no
1 6 11 695 21 10 31
2 19 12 727 22 5 32 1
3 86 13 587 23 1 33
4 236 14 429 24 1 34
5 403 15 331 25 1 35
6 626 16 188 26 1 36
7 855 17 80 27 37
8 893 18 49 28 38
9 835 19 22 29 39
10 779 20 9 30 40

 

 

 

 

 

 

 

 

 

 

Table 4.1: Sequences by number of sites

Five separate restriction map data sets were computer generated from the 1575
sequences. For each data set, positions matching recognition sites from two com-
mercial restriction enzymes were identiﬁed and the length between sites calculated
(after removing alignment gaps). Ambiguity codons in the database sequences were
treated as not matching any recognition site. The recognition sites (enzymes) chosen
to generate the ﬁve sets were: AGCT -i- CATG; CCGG + AATT; CTAG + GATC;
GTAC + TCGA; and TTAA + ATAT. The average number of sites in the combined

7875 digests was 9.74 with a median of 9 (Table 4.1).

The maps in a data set were chosen one at a time as queries and are considered as
restriction patterns. When a query had a result set C (q, D, 0), this query did not used
for the calculation Of C(q, D, 1). When a query did not have a result set C(q, D, 0),
this query was used as a query for the calculation Of C (q, D, 1). This process continued
until I: = 2 for single enzyme, 1: = 4 for multiple enzymes. To simulate experimental

data, the fragments sizes of the query were assumed to be accurate to only +/- 5%.

The result set of database sequences where all sites could be matched was deter-

mined for each query, assuming the fragment order for the query was known (ordered

95

query). These result sets were compared with the expected results calculated using
the exact site positions from the aligned sequences. These ordered query result sets
were missing one or more sequences found using pre-aligned sequences in 836 of the
7875 trials (10.6%). In addition to base changes (point mutations), rRNA genes have
accumulated insertions and deletions over the course Of evolution. These insertions
and deletions may have caused homologous fragments to no longer have sizes within
the 5% error bound. The result of these size changes is apparently to cause pattern
mismatches over shorter evolutionary distances than would be predicted from site

conservation alone.

Another 580 of the 7875 trials (7.4%) produced result sets with extra sequences.
Some of these extra matches may be due to peculiarities in the RDP alignment
causing occasional site mismatch when comparing aligned sequences; however some
extra matches are probably due to matching Of non-homologous sites within the 5%
error bound for the ordered query tests. Even if all of the additional matches are due
to incorrectly pairing non-homologous sites, the percentage of query results affected

is still relatively small.

The trials were repeated without assuming the order of query fragments was known
(unordered query). Any diﬂ'erences in result sets between ordered and unordered
queries represent incorrect pairing of non-homologous regions between query and
database sequences (incorrect ordering). Only 263 of the 7875 trials (3.3%) produced
result sets with extra sequences not in the ordered query result sets. However, if

mismatches were allowed in the site matching, the results deteriorated. When up

96

to one query and/ or database Site mismatch was allowed, 36.3% of the unordered
result sets contained sequences not in the ordered query result sets. When up to
two query and / or database Site mismatches were allowed the percentage Of result sets

with incorrect matches increased to 88.4%.

4.3.2 sim(q, C(q, D, 19))

Number of mismatches (k). The lowest primary similarities between query and
sim(q,C(q, D, k)) with k=0,1,2, were calculated as in Figure 4.4. The error bounds
were ﬁxed with 5% for k=0,1,2 in the experiments. The values Of sim(q, C(q, D, k))
were dependent, as expected, on the number of sites in the query and the number
of mismatched sites, In, between query and result sets. The values increase when the
number of matched cleavage sites increases. The value Of sim(q, C (q, D, k)) with given
cleavage sites decrease when It increases. For example, the values Of sim(q, C (q, D, 0))
are higher than 90% when the number of cleavage sites is larger than 6. The numbers
of required cleavage sites for the values of sim(q, C (q, D, 0)) which is higher than 90%

are 12 for k=1 and no site for k=2.

Error bounds. sim(q, C(q, D, k)) for different error bounds (10%, 15%, 20%) was
calculated as in Figure 4.5. In the experiments, the value of I: was set to 0. The
values of sim(q, C (q, D, k)) were dependent, as expected, on the error bounds of the
query fragments. The values increase when the sizes of the error bound decrease.
For example, the values Of sim(q, C (q, D, 0)) for 5% error bound are higher than 90%

when the number Of cleavage sites is larger than 6. The numbers of required cleavage

97

 

 

 

 

 

111L111

.q
0‘
IIIIIII

 

 

 

O
0"
H
O
g—s
0'!
8

 

 

 

IIIII
I— .1

d
A1111]

0 5 II) 15
Number of sites in query sequence (k =
III) I
95

38

 

 

 

JIIIIII

 

 

 

 

 

 

 

(I 5 III 15 EN)
Number of sites in query sequence (’6 = 1)

 

 

 

 

 

 

 

 

 

 

If!)
95. 4
m.- _
85 .— —i
go- A
75- ~
70- ..
65b-
60 T
0 20]
g: I I I :
E___.l_l=i:I:El:El:l=:l==__§
(1 5 If) '15 2!)
Number of sites in query sequence (k = 2)

Figure 4.4: sim(q, C(q, D, k)) for k=0,1,2 with 5% error bound, These results are
shown by number of query sites as mean and range, after discarding the 5% highest
and lowest values. Bar chart at bottom indicates the number of queries with result
sets with size greater than one. The y axis of the ﬁrst diagram shows the primary
similarity between query and result set. The y axis shows the number of query
sequences. A site with the number of query sequences 5 10 is eliminated.

98

 

sassé

IIIIII

e§§§ s a

 

75
70 .l L

 

 

1111111

I l l

0 5 10 15

 

 

 

M
O

 

 

 

 

 

l

0 5 10 15
Number of sites in query sequences (10%)

 

8

 

 

 

 

 

 

lpllll

 

 

 

8

 

 

 

 

 

 

 

 

0 5 10 15 20
Number of sites in query sequences (15%)

.. T ..

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

U!
H
o
..A
U!
8

 

 

 

 

 

 

e§§§ cs

0 5 10 15
Number of sites in query sequences (20%)

8

Figure 4.5: sim(q, C(q, D, 0)) with 10%, 15%, 20% error bounds. The y axis of ﬁrst
diagram shows the primary similarity between query and result set. The y axis shows
the number Of query sequences. These results are shown by number of query sites as
mean and range, after discarding the 5% highest and lowest values. Bar chart at the
bottom of the ﬁgure indicates the number Of queries with result sets with size greater
than one. A site with the number of query sequences 5 10 is eliminated.

 

99
sites for the values of sim(q, C(q, D, 0)) 2 90% are 8 (for 10%), 10 (for 15%), 16 (for

20%).

Multiple enzymes. To increase the values of sim(q, C(q, D, 19)), multiple enzymes
were applied to each query. The result set for each enzyme is Obtained and C (q, D, k)s
for each result sets were calculated. Let C(q, D, k)A and Cg(q, D, k)g be the results
using two different enzymes A and B with same q and D. First, we calculate the
common set of C(q, D, k1)... and 02(q, D, k2)3, which is C(q, D, k1 + k2)AB. The
values Of k1 + k2 are 0 thru 4. Then we examine the value Of C(q, D, k1 + k2)“.

Figure 4.6 shows the results.

100

 

 
   

..¢
' ,l""""'
a.

I l 1 IL I l l

 

 

60 l I I I I I I
0 5 10 15 20 25 30 35
I I

 

8

 

 

 

m
3%1

0 5 10 15 20 25 30 35
Number of sites in query sequence (k1 + k2 = 0)
100 I -~

 

vv' rrrrrr
I I
" ’1'...

  
  

J l I I I l I

 

 

l I I

II
05101520253035
II

T I I I

 

8

 

-

m assesses

IIIII

 

IIIII

 

l

5 10 15 20 25 30 35 40
berofsitesinquerysequence(k1+lc2=1)

' '. . 'O' V; I

 

z
Be

5

IIIerf

 

 

“D
C"

 

 

 

I l l l I I I

l l I l

 

 

 

CD
OI

II
10152025303540
I I

PT I ‘1 I I

 

IIII

25 !:gl

 

I

0 5 10 15 20 25 30 35
Numberofsitesinquerysequence(k1+bz=2)

m assesses
L

 

101

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1m I I 1 I f
95 +- r
90 -— A
85 .. -
80 r. ‘ _
75 - .. r
70 ._ .-
65 P -I
60 I I I I I I I
0 5 10 I5 20 25 30 35 40
r; I f I I I I I :
m 1 I E
0 5 10 15 20 25 30 35 40
Number Of sites in query sequence (In -I- k2 = 3)
100 | 7 I I
95 P
m b -i
85 .. -i
80 r— —i
75 - -
70 - a
65 - _
60 I I I I I I I
0 5 10 15 20 25 30 35 40
: I I I I Pﬁ I
E I I M l I l :1
0 5 10 15 20 25 30 35 40

Numberofsitesinquerysequence (k1+k2 =4)
Figure 4.6: sim(q, (C(q, D, k1 + k2))43 for k; + k2=0 thru 4 with 5% error bound and
two enzymes. These results are shown by number of query sites as mean and range
after discarding the 5% highest and lowest values. Bar chart at bottom indicates the

number of queries with result sets with size greater than one. A site with the number
of query sequences 5 10 is eliminated.

All the values of sim(q, (C(q, D, k. + k2)) are higher than sim(q, (C(q, D, k)), as

expected, when the values of k1 + kg and k are same.

4.3.3 level (q, C (q, D, k))

In addition to aligned rRNA sequence databases, the RDP also distributes phyloge-
netic information inferred from these sequences. This includes a hierarchical classi-

ﬁcation scheme consistent with the inferred phylogeny. At the most basic level, all

102

 

 

 

=0
x 1 2 3 4 5
t, 146 47 360 351 3065
r, 146 46 354 347 2898
hit ratio 100.0 97.9 98.3 98.9 94.6 ‘
k=1

 

 

x 1 2 3 4 5
t, 446 128 373 148 1239
446

 

 

 

r, 125 362 133 1036
hit ratio 100.0 97.? 97.1 89.9 83.6
=2
x 1 2 3 4 5
t, 552 97 140 39 340

r, 552 88 127 35 221
hit ratio 100.0 90.7 90.1 89.7 65.0

 

 

 

 

 

 

 

 

 

Table 4.2: Hit ratios for each It. 5% error bound. ~

sequences tested here are members of category 2, the Bacteria. There are 15 cate-
gories at the next level (2.1 through 2.15), 24 at the third level, 94 at the fourth, and
99 at the ﬁfth and highest level. Not all sequences are categorized to all ﬁve levels,

we assumed an implied category of 0 for all undeﬁned levels.

Number of mismatches (k). The minimum common digits in the levels Of q,
level(q, C(q, D, k)) with k=0,1 or 2, were calculated as in Table 4.2. The error
bounds were ﬁxed with 5% for k=0,1,2 in the experiments. We deﬁne hit ratio
tO use as a measure of accuracy of this method. Let t, be the number of queries with
level(C(q, D, k)) = x and r, be the number of queries with level(q, C(q, D, k)) = x

Then hit ratio for xth level is
h, = (r,/t,) x 100(%) (4.3)

The hit ratio decreases when It increases. For example, the value of ha are 94.6% for

k=0, 83.6% for Ic=l, 65% for k=2.

 

103

 

error bound=10%

 

I
3:
rx

1
453
453 .

2
128
120

342
332

328

2924
2731

 

hit ratio

100.0

97.9

98.3

98.9

94.6

 

error bound=15%

 

3
t:
r:

l
776
776

2
229
213

3
451
436

4
305
291

5
2809
2418

 

hit ratio

100.0

93.0

96.7

95.4

86.0

 

error bound=20%

 

x
t:
7':

l
1419
1419

2
399
353

3
539
507

4
278
257

5
2565
1919

 

hit ratio

100.0

88.5

94.1

92.4

74.8

 

 

 

 

 

 

 

 

Table 4.3: Hit ratios for diﬂ'erent error bounds.

Error bounds. Hit ratios for different error bounds (10%, 15%, 20%) were calcu-
lated as in Table 4.3. In the experiments, the value of I: was set to 0. It is clear that

the hit ratio decreases when the error bound increases. For example, the value of hr,

are 94.6% for 5%, 94.6% for 10%, 86% for 15%, 74.8% for 20%.

Multiple enzymes. To increase the values of level(q, C(q, D, 1:», multiple en-
zymes were applied to each query. The result set for each enzyme is Obtained and
C(q,D,k)s for each result sets were calculated. Let C(q, D, In) A and C(q, D, k2)3
be the results using two different enzymes with same q and D. First, we calculate
the commonly matched sequences C(q, D, k1 + k2)”. Then we examine the value Of

level(q, (C(q, D, k1 + kg”). Table 4.4 shows the results.

The hit ratio Of multiple enzymes is much higher than single enzyme as expected.

For example, the values of hr, for 0 thru 4 are 99.0%, 96.2%, 89.1%, 75.9% and 58.4%.

104

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

k1 + k2 =0

:1: l 2 3 4 5

t, 4 53 4495

r, 4 53 4450
hit ratio l. 100 120‘ 99.0
k; + k2 =1 _— _

x 1 2 3 4 5

t, 4 3 132 180 3354

r, 4 3 132 179 3222
hit ratio 100.0 100 99.6 99.0 96.1
k1 + k2=2

x 1 2 3 4 5

t, 24 15 259 200 2178

r, 242 15 258 198 1940
hit ratio 100 100 99.6 99.0 89.1
k1 + k2=tll

x 1 2 3 4 5

t, 68 41 183 86 1093

r, 68 41 171 80 830 ‘
hit ratio 100 100 93.4 93.0 7—59‘
*1 "I" kg :4

x 1 2 3 4 5

t, 45 32 72 23 519

r, 45 26 61 15 303

 

 

 

 

 

 

 

 

hit ratio 100 81.2 84.7 65.7 58.4
Table 4.4: Hit ratios for each k1 + kg. 5% error bound.

 

105
4.3.4 Ordered. vs. Unordered

When we compute sim(q, C (q, D, k)) for It > 0 we allow the merges of fragments in q
and/ or s E D to reduce the number Of fragments. We know the order of the fragments
in 8 while we don’t know the order Of the fragments in q (unordered). Only consecutive
fragments in s are allowed to be merged while any combination Of fragments in q are
allowed to be merged. This additional information (order of fragments) gives us more
accurate results. We compare sim(q, C(q, D, 2)) with two different cases. Let m be the
number Of fragments in q and n be the number of fragments in s. sim(q, C(q, D, 2))
can be calculated when m = n + 2 or n = m + 2. Let sim(q, C(q, D, 2),) be the
result by merging the fragments in q and sim(q, C (q, D, 2),) be the result by merging
the fragments in s. Figure 4.6 shows the difference between sim(q, C (q, D, 2),) and

sim(q, C(q, D, 2),). level(q,C(q,D,2),) and level(q,C(q,D,2),) are calculated by

 

 

 

 

100 r r r
95- . _-
90» q”—~
85- A
80.. _
75— _
7o- -
65
60 I I I

o 5 10 15 20

Number of sites in query sequences

Figure 4.7: sim(q, C(q, D, 2)), and sim(q,C(q,D,2llr with 5% error bound. The y
axis of the diagram shows the primary similarity between query and result set. The
curve ”8” is generated by merging s and the curve ”q” is generated by merging q.

doing same process. Hit ratio for each level is calculated. Table 4.5 show the hit

ratios of level (q, C (q, D, 2),) and level (q, C(q, D, 2) .).

 

Table 4.5: Hit ratio of level(q, C(q, D, 2),) and level(C(q, D, 2),)

4.3.5 Simulation of Random Database

To conﬁrm our method can be applied to other databases, we generated random
databases and applied our method and checked the results. To simulate the evolu-
tionary procedure, one original sequence, 8}, is generated randomly. 8: denotes xth

random sequence of yth level. Two random sequences, 9], sf are generated from so

with a similar to so. Again, four random sequences, 3;, ..., s; are generated from s], sf
with u percent similar to origin sequence. We call u as uniform edge length. This
process continues until 2048 (sh, ..., sﬁ“) random sequences are generated. The ﬁnal
2048 random sequences are considered as database sequences. The whole random
sequences resemble binary tree structure. Ten random databases with uniform edge

length u are generated with different orders Of random seeds and different values of

u. get random database with different uniform primary similarity.

Same ﬁve set of enzymes as in experiment of RDP are applied to ten random
databases for each u. All 2048 random sequences in a DB are considered as query
sequences. Therefore total number of query sequences are 10240 (2048 x 5). For
each query, (CLOSE(q, D, 0) is calculated. To identify the properties of selected
database sequences, the value of sim(q, s) are accumulated instead of calculation Of

sim(q, (CLOSE(q, D, 0)). In this experiment error bound is zero percent. Figure

107
4.7 shows the output Of this simulation with edge length 92% thru 98%. The edge

lengths u 5 98% are not enough to generate meaning output. Also u 5 90% are
not biologically meaningful. Therefore we only consider the random databases with

u 5 99%. Figure 4.8 shows the output u 5 99%.

5m I l I I I I I I I

 

888888

150-
100-
wh

0
10098 96 94.92 90.88 86 84 82 80
Primarysumlanty(%)

 

 

 

 

Figure 4.8: Accumulation of sim(q, s) for random databases with edge lengths 98%,
96%, 94%, 92%. Bar charts for each diagram show the standard deviations. y shows
the total number of queries with output. x shows the primary similarity between
query and matched database sequences.

108

First diagram in Figure 4.8 shows the total number of query sequences vs.
sim(q, d). The number Of matched database sequences rapidly decreases when
the primary similarity between query and database sequences decreases. The
value of sim(q,s) for most of s is greater than 90% which means the value of
sim(q, (CLOSE(q, D, 0)) is also greater than 90%. This diagram shows that our
method can be applied to other sequence databases. This diagram shows the useful-

ness Of our method.

The output in the ﬁrst digram of Figure 4.8 is divided by two groups which are
forward and backward mutations. First, we check the ﬁrst common ancestor, o, of the
q and d. Second, we check if q and o are matched. If q and d are matched, we consider
the output as forward. If not, we consider the output as backward. Second diagram
in Figure 4.8 shows the output with forward and third diagram in Figure 4.8 shows
the output with backward. The curves in second diagram are decayed rapidly while
the curves in third diagram are increased until some period and decayed. Phrther

study is required for more detailed analysis for these diagrams.

109

 

8000 _ ”99.8” 9— _
.. ”99.7” i
. ”99.6” _
7000 .994. _,_
6000 ml] _ ”99.2” e— -

 

 

 

96
Primary similarity(%)
9000 I I m l

8“” .1 4 ”99.8” 0_ _,

 

 

 

 

 

 

3000 .. ”99.8” 9— _.
”99.7” g:
7 - ”99.6” _
000 ”99.4” *—
eooo - ”99.2” €— -

    

 

 

100 - 98 l ' I I 94 A I 92 90
Figure 4.9: Accumulation Of sim(q, s) for random databases with uniform primary

similarities 99.8%, 99.7%, 99.6%, 99.4%, 99.2%. Bar charts for each diagram show
the standard deviations.

Chapter 5

Conclusions

In this thesis, we proposed an algorithm called MSASA, which is based on simulated
annealing approach, to align multiple protein sequences. Dynamic programming of
multiple sequence alignment has been widely used to ﬁnd an optimal alignment for
certain cost functions. However, it does not allow certain types Of cost function
including natural gap costs, and limits the number Of sequences that can be aligned
due to its high computational complexity. MSASA overcomes these problems because
it uses any gap costs which generates a better solution. It aligns more sequences,
and takes less computation time compared to a dynamic programming approach. A
solution set for a multiple sequence alignment problem is identiﬁed, and the multiple
sequence alignment problem is reformulated to ﬁnd an Optimal alignment from this
solution set. The computational complexity Of MSASA is signiﬁcantly reduced by
substituting the higher temperature phase of the annealing process by a fast heuristic

algorithm and conﬁning the solution set.

110

111
We then suggested an algorithm called RNASA, which is based on simulated an-

nealing, to align multiple RNA sequences. The output RNA alignment from RNASA
is used to identify possible secondary structures. A transition rule, based on double
shufﬂe, is proposed. This transition rule provides faster convergence to an optimal
solution. We Show that this new transition rule takes less convergence time than a
conventional transitiOn rule, based on single shufﬂe. RN ASA is applied to the se-
quences from RDP and the results are compared to known secondary structures. The

usefulness of RNASA is supported with experimental results.

We then studied the problem of obtaining biological information about a macro-
molecule isolate using only restriction patterns and restriction map databases. Max-
imum site matching problem (MSMP) is deﬁned in a formal way and proved to
be in NP-complete problems. We suggested a heuristic algorithm to solve MSMP. A
three phase approach to obtain relatedness Of unknown macromolecules and database
sequences is suggested. We demonstrate usefulness Of our approach using RDP

database.

Bibliography

[1] S. F. Altschul. Gap costs for multiple sequence alignment. J. Theor. Biol,
138:297-309, 1989.

[2] S. F. Altschul, W. Gish, W. Miller, E. M. Myers, and D. J. Lipman. Basic local
alignment search tool. J. Mol. Biol, 215:403-410, 1990.

[3] S. F. Altschul and D. J. Lipman. Tmes,stars,and multiple biological sequence
alignment. SIAM J. appl. Math., 49:197—209, 1989.

[4] P. Argos. A sensitive procedure to compare amino acid sequences. J. Molec.
Biol, 193:385—396, 1987.

[5] D. J. Bacon and W. F. Anderson. Multiple sequence alignment. J. Molec. Biol,
191:153—161, 1986.

[6] W. Bains. Multan: A program to align multiple dna sequences. Nucl. Acids
Res, 14:159—177, 1986.

[7] G. J. Barton and M. J. E. Sternberg. A startegy for the rapid multiple alignment

of protein sequences: conﬁdence levels from tertiary structure comparisons. J.
Molec. Biol, 198:327-337, 1987.

[8] B. Bellon. Construction of restriction maps. CABIOS, 4:111—115, 1988.

[9] J. Brosius, M. L. Palmer, P. J. Kennedy, and H. F. Noller. Complete nucleotide
sequence of a 168 ribosomal rna gene. Proc. Natl. Acad. Sci. U. S.A, 75:4801—4805,
1978.

[10] H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology.
SIAM J. Appl. Math, 48:1073—1082, 1988.

[11] T. R. Cech. Conserved sequences and structures Of group i intronszbuilding an
active site for ma catalysis-a review. Gene, 73:259-271, 1988.

[12] F. Corpet. Multiple sequence alignment with hierarchical clustering. Nucl. Acids
Res, 16:10881-10890, 1988.

[13] J. Cottrell. Protein identiﬁcation by peptide mass ﬁngerprinting. Peptide Re-
search, 7(3):115—124, 1994.

112

113

[14] M. O. Dayhoff. A model of evolutionary change in proteins. matrices for detect-
ing distance relationships. In Atlas of Protein Sequence and Structure, volume 5
suppl. 3, pages 345-352. Dayhoff. M. O.(ed) Washington. DC: National Biomed-
ical Research Foundation, 1978.

[15] R. DeBry and N. A. Slade. Cladistic analysis of restriction endonuclease cleavage
maps within a maximum-likelihood framework. Syst. Zool, 34(1):21-34, 1985.

[16] A. Delcoigne and P. Hansen. Sequence comparison by dynamic programming.
Biometrika, 62:662—664, 1975.

[17] R. Durand and F. Bregegere. An efﬁcient program to construct restriction maps
from experimental data with realistic error levels. Nucleic Acids Res, 12:703-
716, 1984.

[18] S. R Eddy and R. Durbin. Rna sequence analysis using covariance models. Nucl.
Acids Res, 22:2079—2088, 1994.

[19] Joseph Felsenstein. Phylogenies from restriction sites: a maximum-likelihood
approach. Evolution, 46(1):]59—173, 1992.

[20] D. F. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite
to correct phylogenetic trees. J. Molec. Evol, 25:351-360, 1987.

[21] D. F. Feng, M. S. Johnson, and R. F. Doolittle. Aligning amino acid sequences:
comparison Of commonly used methods. J. Molec. Evol, 21:112—125, 1982.

[22] J. W. Fickett. Fast Optimal alignment. Nucl. Acids Res, 12:175-180, 1984.

[23] W. M. Fitch and T. Smith. Optimal sequence alignments. Proc. Natl. Acad. Sci.
USA., 80:1382-1386, 1983.

[24] W. M. Fitch, T. F. Smith, and W. W. Ralph. Mapping the order of dna restriction
fragments. Gene, 22:19-29, 1983.

[25] G. E. Fox and C. R. Woese. 5s rna secondary structure. Nature, 256:505-507,
1975.

[26] M. L. E'edman. Algorithms for computing evolutionary similarity measures with
length independent gap penalties. Bull. Math. Biol, 46:553—566, 1984.

[27] O. Gotoh. An improved algorithm for matching biological sequences. J. M olec.
Biol, 162:705-708, 1982.

[28] O. Gotoh. Alignment Of three biological sequences with an eﬂicient traceback
procedure. J. Theor. Biol, 121:327—333, 1986.

[29] M. Gribskov, R. Luthy, and D. Eisenberg. Proﬁle analysis. J. Theor. Biol,
183:146-159, 1990.

114

[30] A. V. Grigorjev and A. A. Mironov. Mapping dna by stochastic relaxation: a
new approach to fragment sizes. CAB] OS, 6:107—111, 1990.

[31] L. K. Grover. Standard cell placement using simulated sintering. Proc. 24 th
DA 0, pages 56-59, 1987.

[32] R. R Gutell. Collection of small subunit (168 and 168 like) ribosomal structures.
Nucl. Acids Res, pages 3502—3507, 1994.

[33] J. Hein. A new method that simultaneously aligns and reconstructs ancestral

sequences for any number of homologous esequences, when the phylogeny is
given. Molec. Biol. Evol, 6:649—668, 1989.

[34] D. G. Higgins and P. M. Sharp. Clustal: a package for performing multiple
sequence alignment on a microcomputer. Gene, 7:237—244, 1988.

[35] M. Hirosawa, M. Hoshida, M. Ishikawa, and T. Toya. Mascot: Multiple align-
ment system for protein sequences based on three-way dynamic programming.
CABIOS, 9:161-167, 1993.

[36] P. Hogeweg and B. Hesper. The alignment Of sets of sequences and the con-
struction of phyletic trees: an integrated method. J. Molec. Evol, 20:175-186.
1984. ‘

[37] K. E. Holsinger and R. E. Jansen. Phylogenetic analysis Of restriction site data.
Methods in Enzymology, 224:439—455, 1993.

[38] M. Ishikawa, T. Toya, M. Hoshida, K. Nitta, A. Ogiwara, and M. Kanehisa.
Multiple sequence alignment by parallel simulated annealing. CABIOS, 9:267-
273, 1993.

[39] A. B. Jacobson, L. Good, J. Simonetti, and M. Zuker. Some simple computa-
tional methods to improve the folding of large rnas. Nucleic Acids Res., 12:45-52,
1984.

[40] B. D. James, G. J. Olsen, and N. R. Pace. Phylogenetic comparative analysis of
ma secondary structure. Methods in Enzymology, 180:227—230, 1989.

[41] P. James, M. Quadroni, E. Carafoli, and G. Gonnet. Protein identiﬁcation by
mass proﬁle ﬁngerprinting. Biochemical and Physical Research Communications,
195(1):58-64, 1993.

[42] MS. Johnson and R. F. Doolittle. A method for the simulataneous alignment of
three or more amino acid sequences. J. Molec. Evol, 23:267-287. 1986.

[43] R. A. J as, N. W. Woodbury, and R. F. Doolittle. Sequence homologies among
e.coli ribosomal proteins: evidence for evolutionary related groupings and inter-
nal duplication. J. Molec. Biol, 15:129—148. 1980.

115

[44] J. Kim and S. Pramanik. Multiple sequence alignment using simulated annealing.
In Second International Conference on Intelligent Systems for Molecular Biology,
1994.

[45] J. Kim, S. Pramanik, and M. J. Chung. Multiple sequence alignment using
simulated annealing. CABIOS, 10:419—426, 1994.

[46] S. Kirkpatrick, C.D. Gelatt Jr, and M. P. Vecchi. Optimization by simulated
annealing. Scinece, 220:671—680, 1983.

[47] M. Krawczak. Algorithms for the restriction-site mapping Of dna molecules. Proc.
Natl. Acad. Sci. USA., 85:7298—7301, 1988.

[48] W. Li. Evolutionary change of restriction cleavage sites and phylogenetic infer-
ence. Genetics Society of America, 113:187-213, 1986.

[49] D. J. Lipman, S. F. Altschul, and J. D. Kececioglu. A tool for multiple sequence
alignment. Proc. Natl. Acad. Sci. USA., 86:4412-4415, 1989.

[50] A. V. Lukashin, J. Engelbrecht, and S. Brunak. Multiple alignment using simu-
lated annealing: branch point deﬁnition in human mrna splicing tool for multiple
sequence alignment. Nucl. Acids Res, 20:2511-2516, 1992.

[51] B. L. Maidak, N. Larsen, M. J. McCaughey, R. Overbeek, G.J. Olsen, K. Fogel,
J. Blandy, and C. R. Woese. The ribosomal database project. Nucl. Acids Res,
22:3485-3487, 1994.

[52] H. M. Martinez. A ﬂexible multiple sequence alignment program. Nucl. Acids
Res., 16:1683—1691, 1988.

[53] N. Metropolis, A. Rcsenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation
of state calculations by fast computing machines. J. Chem. Phys, 21:1807-1092,
1953.

[54] F. Michel, K. Umesono, and H. Oseki. Comparative and functional anatomy Of
group ii catalytic introns-a review. Gene, 82:5—30, 1989.

[55] W. Miller, J. Barr, and K. Rudd. Improved algorithms for searching restriction
maps. CABIOS, 7(4):447—456, 1991.

[56] W. Miller, J. Ostell, and K. Rudd. An algorithm for searching restriction maps.
CABIOS, 6(3):247—252, 1990.

[57] C. R. Moyer, F. C. Dobbs, and D. M. Karl. Estimation of diversity and com-
munity structure through restriction fragment length polymorphism distribution
analysis Of bacterial 168 rrna genes from a microbial mat at an active, hydrother-
mal vent system. Appl. Environ. Microbiol, 60:871—879, 1993.

116

[58] M. Murata, J. S Richardson, and J. L. Sussman. Simultaneous comparison of
three protein sequences. In Proc. Natl. Acad. Sci. USA., volume 82, pages 3073-
3077, 1985.

[59] S. B. Needleman and Wunch. C. D. A general method applicable to the search
for similarities in the amino acid sequences Of two proteins. J. Molec. Biol,
48:443—453, 1970.

[60] M. Nei and W. Li. Mathematical model for studying genetic variation in terms
of restriction endonucleases. Genetics, 76(10):5269—5273, 1979.

[61] H. F. Noller and C. R. Woese. Secondary structure Of 168 ribosomal rna. Science,
212:519—533, 1981.

[62] R. Nussinov and Jr. Tinoco, I. Sequential folding of a messenger rna molecule.
JMB, 151:519—533, 1981.

[63] N. R. Pace, D. K. Smith, G. J. Olson, and B. D. James. Phylogenetic comparative
analysis and the secondary structure of ribonuclease p rna-a review. Gene, 82:65—
‘75, 1989.

[64] W. Pearson. Automatic construction of restriction site maps. Nucleic Acids Res,
10:217-227, 1982.

[65] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence com-
parison. basic local alignment search tool. Proc. Natl. Acad. Sci. U.S.A, 85:2444—
2448, 1988.

[66] G. Quigley, L. Gehrke, D Roth, and P. Auron. Computer-aided nucleic acid
secondary structure modeling incorporating enzymatic digestion data. Nucleic
Acids. Res, 12:347—366, 1984.

[67] . J. Rose, Wolfgang. Klebsch, and J uergen. Wolf. Tempteraure measurement Of
simulated annealin placements. In Proceedings I CCAD, pages 514-517, 1988.

[68] D. Rose, W. Snelgrove, and Z. Vranesic. Parallel standard cell placement algo-
rithms with quality equivalent to simulated annealing. IEEE Transactions on
CAD, 7(3):387-396, 1988.

[69] J .8 Rose, D. Blyde, W. Snelgrove, and Z. Vranesic. Fast, high quality vlsi place-
ment on an mimd multiprocessor. I CCAD, 86:42-45, November 1986.

[70] Y. Sakakibara, M. Brown, I. S. Mian, R. Underwood, and D. Haussler. Stochastic
context-free grappars for modeling rna. In Proceedings of the Hawaii Interna-
tional Conference on System Sciences, Los Alamitos, CA, 1994. IEEE Computer
Society Press.

[71] D. Sankoﬂ'. Simultaneous comparison of three or more sequences related by a
tree. Addison-Wesley, Reading, MA, 1983.

117

[72] D. Sankoﬂ' and J. B. Kruskal. Time Warps, String Edits and MacromoleculesL
The Theory and practice of Sequence Comparison. Addison-Wesley, Reading,
MA, 1983.

[73] D. B. Searls. The computational linguistics of biological sequences, chapter 2.
AAAI Press, 1993.

[74] P. H Sellers. On the theory and computation Of evolutionary distances. SIAM
J. Appl. Math.,26:787—793,1974.

[75] G. Shaw. Rapid identiﬁcation Of proteins. Proc. Natl. Acad. Sci. USA, 90:5138—
5142, 1993.

[76] P. E. Smouse and Li. W. Likelihood analysis Of mitochondrial restriction-cleavage
patterns for the human-himpanzee-gorilla trichotomy. Evolution, 41(6):1162-—
1176, 1987.

[77] P. Taylor. A fast homology program for aligning biological sequences. Nucl.
Acids Res, 12:447-455, 1984.

[78] W. R. Taylor. Multiple sequence alignment by a pairwise algorithm. CAB] OS,
3:81-87, 1987.

[79] W. R. Taylor. A ﬂexible method to align large numbers of biological sequences.
J. Molec. Evol, 28:161-169, 1988.

[80] L. Tinoco Jr., 0. C. Uhlenbeck, and M. D. Levine. Estimation Of secondary
structure in ribonucleic acids. Nature, 230:363—367, 1971.

[81] M. Vihinan. Simultaneous comparison of several sequences. Methods Enzymol.,
183:447—456, 1990.

[82] M. Vingron and P. Argos. Motif recognition and alignment for many sequences
by comparison Of dot-matrices. J. Molec. Biol, 218:33—43, 1991.

[83] M. S. Waterman. Consensus patterns in sequences. Boca Raton FL: CRC Press,
1989.

[84] M. S. Waterman and M. D. Perlwitz. Consensus methods for folding single-
stranded nucleic acids, chapter 8. CRE Press, 1984.

[85] M. S. Waterman and M. D. Perlwitz. Line geometries for sequence comparisons.
Bull. Math. Biol, 46:567—577, 1984.

[86] MS. Waterman, T.F. Smith, and W.A Beyer. Some biological sequence matrices.
Adv. Math., pages 367-387, 1976.

[87] C. R Woese, R. R Gutell, R. Gupta, and H. F. Noller. Detailed analysis of
the higher-order structure of 168-er ribosomal ribonucleic acids. Microbiology
Reviews, 47(4):221—229, 1983.

118

[88] L. J. Woese, C. R. Magrum, R. Gupta, and R. B. Siege]. Secondayr structure
model for bacterial 168 ribosomal rnazphylogenetic, enzymatic and chemical ev-
idence. NAR, 8:2275-2293, 1985.

[89] L. W. Wright, J. B. Lichter, J Reintz, M. A. Shifman, K. K. Kidd, and P. L.
Miller. Computer-assisted restriction mapping: an integrated approach to han-
dling experimental uncertainty. CABIOS, 10:443—450. 1994.

[90] M. Zuker and P. Stiegler. Optimal computer folding Of large rna sequences using
thermodynamics and auxiliary information. Nucleic Acids Res, 9:133-148, 1981.