I. ‘ nu: 0.4..lhutullr — u I. L. I l. .. f. .v‘ . . t «_ n ‘. .. r . '0 *wc ‘8 V. . . .31 .b I . .II In : .. 1. cl . . T 1.: . S: l «1%» 41‘ : THESIS A, m, L. ,. I a IlllllllllllllllllI ? Michigan State University This is to certify that the dissertation entitled PLLATEDNES‘S or (BECLOQ'ICAL Sewwces USNCT ALMj—NMENT AND RESTRICTION MAP DATABASES presented by 3771 Tim has been accepted towards fulfillment of the requirements for 3 ‘ A llMD degreein (Eli‘l‘bklel' gClGhCE l —f Major professor PLACE ll RETURN BOX to remove thi- cheokout from your record. TO AVOID FINES return on or betore date due. DATE DUE DATE DUE DATE DUE MSU Is An Affirmative AotionlEquei Opportunity Institution mm: RELATEDNESS OF BIOLOGICAL SEQUENCES USING ALIGNMENT AND RESTRICTION MAP DATABASES By Jin Kim A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1996 ABSTRACT RELATEDNESS OF BIOLOGICAL SEQUENCES USING ALIGNMENT AND RESTRICTION MAP DATABASES By Jz'n Kim Comparative analysis of DNA and protein macromolecules has been an important component of biological research. Sequence alignment and restriction map compari- son, in particular, have been useful in the study of molecular evolution, RNA folding, and protein structure-function relationships. In this thesis, we have investigated mul- tiple alignment problems involving RNA and protein sequences, and the methods to infer relatedness between sequences using restriction patterns and restriction map databases. We first consider the multiple sequence alignment problem. Multiple sequence alignment is used to discover an optimal alignment based on defined criteria. Vari- ations of dynamic programming provide most commonly used tools for multiple se- quence alignment methods. However, the biggest obstacle to using dynamic program- ming for multiple sequence alignment has been the high computational complexity. Due to this high computational complexity, dynamic programming cannot be effec- tively used to align more than three sequences and to apply certain types of cost functions. To overcome the limitations of dynamic programming, an algorithm called MSASA, based on simulated annealing, was developed. MSASA can overcome these limitations. We then consider the multiple RNA sequence alignment problem to identify pos- sible RNA secondary structures. An algorithm called RNASA, based on simulated annealing, is suggested for multiple RNA sequence alignment. In this algorithm, RNA sequences are aligned based on primary sequence similarity and then realigned based on secondary structure information. Dot matrices generated from intra-sequence com- parisons are used to obtain possible common secondary structures. Several strategies to reduce the simulated annealing time are suggested. Sequence database searches have been used for identifying unknown macro- molecules. The whole or a part of the primary sequence information is required to do the sequence database search. However, sequencing a macromolecule is expen- sive and time consuming. A more crude but faster method, without sequencing an unidentified macromolecule, has been developed. The method is based on restriction patterns of an unknown isolate and restriction map databases. We use a three-phase approach. In the first phase, we obtain a restriction pattern of the unknown organism while analytically deriving the corresponding restriction maps of the sequences in the database. In the second phase, we identify a set of sequences, S, which have restric- tion maps that are most similar to the unknown macromolecule’s restriction pattern. Maximum Site Matching Problem (MSMP) is defined and is proved to be in the class of NP-complete problems. In the third phase, we use the set S to infer biological in- formation about an unknown macromolecule. We demonstrate the usefulness of this approach by applying it to the rRNA sequences of the Ribosomal Database Project (app). © Copyright 1996 by Jin Kim All Rights Reserved To my family ACKNOWLEDGMENTS I wish to express my gratitude and appreciation to my thesis advisor Dr. Sakti Pramanik for his consistent guidance and encouragement right from the beginning, and his financial support. I am grateful for many discussions and invaluable comments he provided. I would like to thank my guidance committee members, Dr. Moon Jung Chung, Dr. Jon Sticklen, and Dr. Zachary Burton, for their help and guidance. Also I would like to appreciate Dr. Eric Torng, Dr. Jame R. Cole their useful advise. I must express my heartful thanks to my family Myongye Bang, Hyein, Hyejee and my parents, Yong-Beam Kim and my mother, Yong-Sook Shin for their sincere pray during my studying here. Also, I never forget endless love from my parent-in-law, Hwan-moon Bang, J i-Young Kim. vii TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xi 1 Introduction 1 1.1 Basic Concepts ................................ 1 1.1.1 Sequences .................................. 1 1.1.2 Similarity of Macromolecules . . .' .................... 2 1.1.3 Alignment .................................. 3 1.1.4 Restriction Patterns and Restriction Maps ................ 3 1.2 Problem Definitions .............................. 4 1.2.1 Multiple Sequence Alignment ....................... 4 1.2.2 Restriction Map Database Searches .................... 6 1.3 Previous Studies ............................... 7 1.3.1 Multiple Sequence Alignment Based on Primary Similarity ....... 7 1.3.2 Multiple Sequence Alignment Based on Secondary Structures ..... 11 1.3.3 Restriction Map Database Search ..................... 12 1.4 Our Studies .................................. 12 1.4.1 Multiple Sequence Alignment ....................... 12 1.4.2 Restriction Map Database Searches .................... 13 2 Inferring Relatedness of Sequences based on Primary Similarity us- ing Multiple Sequence Alignment 15 2.1 Introduction .................................. 17 2.2 Algorithm ................................... 20 2.2.1 Simulated Annealing ........................... 20 2.2.2 Cost Function ............................... 21 2.2.3 Ttansition Rule .............................. 24 2.2.4 Solution Set ................................ 25 2.2.5 Speedup Strategies in MSASA .................... .. . 28 2.2.6 The MSASA Algorithm .......................... 31 2.3 Implementation ................................ 32 2.4 Results ..................................... 34 2.4.1 Initial Alignment ............................. 34 2.4.2 Transition Rules .............................. 35 2.4.3 Cost Comparison ............................. 36 2.4.4 Time Comparison ............................. 38 2.5 Conclusion ................................... 44 viii ix 3 Inferring Relatedness of Sequences based on Secondary Structure using Multiple Sequence Alignment 45 3.1 Introduction .................................. 47 3.2 Algorithm ................................... 50 3.2.1 Determination of Score Function ..................... 50 3.2.2 Simulated Annealing to Optimize the Score Function .......... 55 3.2.3 The Complexities of RNASA ....................... 59 3.3 Results ..................................... 59 3.3.1 Alignment and Secondary Structure .................... 61 3.3.2 Identification of Secondary Structures .................. 64 3.3.3 Initial Alignment .............................. 66 3.3.4 Window Sizes ............................... 66 3.3.5 Effect of Double Shuffle .......................... 68 3.3.6 Speed of Convergence ........................... 69 3.4 Conclusion ................................... 70 4 Inferring Relatedness of a Macromolecule to a Sequence Database Without Sequencing 71 4.1 Introduction .................................. 73 4.2 Methods .................................... 76 4.2.1 Overview .................................. 76 4.2.2 Obtaining Restriction Patterns and Restriction Maps .......... 78 4.2.3 Computing Closeness ............................ 81 4.2.4 Inferring Information from Closeness ................... 86 4.3 Results and Discussion ............................ 92 4.3.1 Experimental Procedure .......................... 92 4.3.2 sim(q, C(q, D, k)) ............................. 96 4.3.3 level ((1, C (q, D, k)) ............................. 101 4.3.4 Ordered. vs. Unordered .......................... 105 4.3.5 Simulation of Random Database ..................... 106 5 Conclusions 110 2.1 4.1 4.2 4.3 4.4 4.5 LIST OF TABLES Comparison of computation time and score in MSA and MSASA . . . . 37 Sequences by number of sites ........................ 94 Hit ratios for each 1:. 5% error bound. ................... 102 Hit ratios for different error bounds. .................... 103 Hit ratios for each k1 + kg. 5% error bound. ................ 104 Hit ratio of level(q, C(q, D, 2),) and level(C(q, D, 2),) .......... 106 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 LIST OF FIGURES Examples of the move sets. (a) original alignment A1. (b) new align- ment A2 after I nsertion(1,6, 2, right) to A1. (c) new alignment A3 after Deletion(3, 5, 2, left) to A2. (d) new alignment A4 after Swap(1, 7, 1, right) to A3. ........................ Annealing curve (Energy vs. Iteration). A is the starting point in the traditional SA approach. B is the starting point obtained from the fast heuristic approach. C is the minimal point. ............ The MSASA algorithm ............................ Example of the operation swap. (a) original alignment (b) alignment after swap(2, 4, 3, right) operation. (*) shows the changed part (Al) of the alignment and (+) shows the null columns. .............. Alignments of rat mast cell proteinase II, human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin. A1 is the alignment generated from MSA and A2 is generated from MSASA. The score of A1 is 9663 and the score of A2 is 9648. (*) shows different alignments generated from MSA and MSASA. Alignment of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, and pig elastase generated from MSA and MSASA. ..... Alignment of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin generated from MSA. ................................... Alignment of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin generated from MSASA. The null columns (+) are not removed intentionally. Alignment of rat mast cell proteinase 11, human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin generated from MSA. ...................... 2.10 Alignment of rat mast cell proteinase 11, human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces gn'seus trypsin generated from MSASA. The null columns (+) are not removed intentionally. ........... ' .................... 2.11 Running time for 4 to 20 sequences (1 million iterations) ......... 3.1 3.2 3.3 Example of an alignment and its sum matrix. ............... The transition rule double shuffle ...................... The RNASA algorithm ............................ xi 24 29 31 32 37 38 39 40 41 43 52 56 60 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 xii (a) Alignment of segments of 15 168 RNA sequences and possible sec- ondary structures. This region corresponds to nucleotides 996 to 1044 in Ecoli 16S RNA. Left ( (, [, {) and right 0, ], }) parts of the align- ment show the possible stem regions (A-A’ , B-B’ , C-C’). (b) Pos- sible common secondary structure from the alignment (a). (c) Sec- ondary structure of Ehr.bovis based on (b). ((1) Secondary structure of Hir.baltic based on (b). The symbol (e) indictes a 6-0 base pair and (I) indicates A-U and G-C base pairs. Note that the length of the loop and base pair regions are variable for each sequence. ...... (3) Alignment of segments of ten 16$ RNA sequences and possible sec- ondary structures. This region corresponds to nucleotides 133 to 229 in E.coli168 RNA. Left ( (, <, [, { ) and right ( ), >, ], } ) parts of the alignment show the possible stem regions. (b) Real common sec- ondary structure from the RDP. There are different possible secondary structures in the Region B (in the Box). Note that the length of the loop and base paired regions are variable for each sequence. ..... Alignment of segments of 16S RNA sequences. The names of the used 168 RNA sequences are from RDP. This region corresponds to nucleotides 61 to 106 in E. coli 16S RNA. .41 is the alignment generated by the progressive pairwise algorithm. A; is the final alignment generated by initial alignment A1 using RNASA with window size=4. A3 is a longer alignment generated by the progressive pairwise algorithm. A; is the final alignment generated by the initial alignment A3 using RNASA with window size=4. Left (0 and right (]) parts of possible stem re- gions (A, A’ , B, B’) are symmetric. Upper case characters indicate possible secondary structure regions. .................. Alignment of segments of 168 RNA sequences with different window sizes. Each alignment was obtained after 300000 iterations. B; is the align- ment with window size=1. 32 is the alignment with window size=2. 33 is the alignment with window size=3. A2 in Figure 3.6 is the align- ment with window size=4. B4 is the alignment with window size=5. A2 in Figure 3.6 gives the best match to a known structure. ..... (a) Annealing curve with single shuffle. (b) Annealing curve with double shuffle. Initial temperature T,- = 10‘ and final temperature T, = 1. Unit of score is 10“" and Unit of iteration is 10, 000 iterations. . . . . The overall process of solving the problem .................. (a) Restriction map for a. (M,(s) = {100,500,400}. (b) Re- striction pattern for b. (Pz (b) = {100, 200, 300, 400}. (c) mazcommon(M, (a), P,(b)) = 2 with common sites c1, 02. ...... Pairwise sequence similarity. The pairwise sequence similarity was de- termined for all pairwise combinations of 1575 sequences in the test database. Only sequence regions considered in further analysis were included. The similarity of the aligned pairs was determined using the pre-aligned sequences supplied by RDP. ................ 62 63 65 67 68 77 81 4.4 4.5 4.6 4.7 4.8 4.9 xiii sim(q, C(q, D, k)) for k=0,1,2 with 5% error bound, These results are shown by number of query Sites as mean and range, after discard- ing the 5% highest and lowest values. Bar chart at bottom indicates the number of queries with result sets with size greater than one. The y axis of the first diagram shows the primary similarity between query and result set. The y axis shows the number of query sequences. A site with the number of query sequences _<_ 10 is eliminated. ..... sim(q, C(q, D, 0)) with 10%, 15%, 20% error bounds. The 3; axis of first diagram shows the primary similarity between query and result set. The 3; axis shows the number of query sequences. These results are shown by number of query sites as mean and range, after discarding the 5% highest and lowest values. Bar chart at the bottom of the figure indicates the number of queries with result sets with size greater than one. A site with the number of query sequences _<_ 10 is eliminated. . sim(q, (C(q, D, k1 + 162))“; for k1 + k2=0 thru 4 with 5% error bound and two enzymes. These results are shown by number of query sites as mean and range after discarding the 5% highest and lowest values. Bar chart at bottom indicates the number of queries with result sets with size greater than one. A site with the number of query sequences 5 10 is eliminated. ............................ sim(q, C(q, D, 2)), and sim(q,C(q, D, 2)): with 5% error bound. The y axis of the diagram shows the primary similarity between query and result set. The curve ”8” is generated by merging s and the curve ”q” is generated by merging q. ....................... Accumulation of sim(q, s) for random databases with edge lengths 98%, 96%, 94%, 92%. Bar charts for each diagram show the standard devi- ations. y shows the total number of queries with output. a: shows the primary similarity between query and matched database sequences. Accumulation of sim(q, s) for random databases with uniform primary similarities 99.8%, 99.7%, 99.6%, 99.4%, 99.2%. Bar charts for each diagram show the standard deviations. ................. 98 101 107 109 Chapter 1 Introduction 1.1 Basic Concepts 1.1.1 Sequences A sequence is a linear array of elements that are symbols or letters of an alphabet. Let K={a1,a2, - - -,a;} be an alphabet ofl symbols. Then 3 = blbz - - ~b,, is a sequence where b,- E K. For DNA (RNA) sequences, the alphabet is K = {A,T(U), G, C} where ’A’ for adenine, ’T’ for thymine (or ’U’ for uracil), ’G’ for guanine, ’C’ for cytosine. For protein sequences, the alphabet contains 20 symbols which represent 20 types of amino acids. To sequence a long fragment of DNA, it must be cut into small fragments, as present techniques do not permit the sequencing of fragments of more than a few hundred bases. At first the DNA is cut into overlapping fragments. These are then se- quenced individually, before being reassembled by searching for the overlapped edges, 1 to reconstruct the whole sequences. 1.1.2 Similarity of Macromolecules When two or more sequences are compared, it is not so obvious how to assess the similarities between them. This requires some sort of natural biological metric which is a measure of the similarity. AS an alternative to similarity, one can score the distance between two sequences in a particular alignment based on a metric. Many different alignments can be generated from multiple sequences maps. We seek an alignment with an optimal score from these alignments. This optimal alignment gives us the ability to estimate the biological relatedness between sequences. There are two issues to identify the similarity of sequences. 1. Score function: It should be clear that appropriate similarity scores can only be calculated from a specific model of similarity. The user must either develop his own, or more likely use a set which is already developed. The score function should reflect the specific types of similarity between sequences. For example, if we want to identify the primary similarity of sequences, the optimal alignment based on the score function should give us the best picture of the primary similarity. 2. Algorithm to optimize score function: There are several types of algorithms to score alignments. Exhaustive methods guarantee an optimal alignment. Heuris- tic methods are used to find, within reasonable time, good alignments that are not necessarily optimal. Users should decide the algorithm to be applied for a 3 score function based on their application. 1.1.3 Alignment The term alignment is used to refer to a particular arrangement of two or more se- quences in sequence comparisons. It is a pattern matching process that finds a corre- spondence among the elements of the sequences. Any two macromolecular sequences in a homologous group are usually not identical due to substitutions, insertions or deletions of the elements of the set of macromolecular sequences. Example 2. Let sl=AATAG and 32=AATCAG be two DNA sequences. A possible alignment of 31 and 32 is sl=AAT-AG 32=AATCAG where ’-’ represents an insertion. This alignment represents one possible event in evolution. 1.1.4 Restriction Patterns and Restriction Maps A restriction enzyme typically cleaves a fragment of DNA at specific subsequences. A restriction map is a set of ordered integers which gives the fragment length along the strand of the restriction sites at which a restriction enzyme cleaves the DNA. Example 1. Let s=AGCCCGGCCAAGGCCAA be a DNA sequence. Enzyme Hoe III identifies the substring GGCC and cleaves in the middle of GGCC. If Hoe 111 is applied to the sequence 8, s is cleaved to a set of fragments { AGCCCGG. CCAAGG. 4 CCAA }. Then the restriction map for s is {7, 6, 4}. Restriction maps are usually constructed prior to sequencing of DNA and many mapped DNAs have never been sequenced. One or more restriction enzymes are applied to cleave the macromolecule into many fragments. At this time only the lengths of the fragments can be identified. This set of unordered integers which gives the fragment length is called restriction pattern. The lengths of these fragments can be determined by separating them by electrophoresis on a polyacrylamide gel. 1.2 Problem Definitions 1.2.1 Multiple Sequence Alignment The comparison of more than two sequences of number or letters is common in sev- eral fields, such as molecular biology, speech recognition. Sequence comparison is particularly important in molecular biology where it has been critical in the study of evolution and in the analysis of protein structure / function relationships. Multiple sequence alignment is used to find an optimal alignment based on certain criteria. There may be several different sets of optimality. These are definitions for the mul- tiple sequence alignment. Definitions A gap is the symbol ”-”. Analphabet K=01,02,"',0118388t0fl ' symbols containing gaps. it can be nucleic acid or amino acid sequences. 5 An element is a member of the alphabet K. A sequences = blbz - - b, is a strings of alphabet if b,- E K for i = l, 2, - - - , n. A multiple sequence alignment 3 = {31, 32, - - - , sh} is a set of sequences of K. sl=b},b;,..-,b,l, 32=b¥,b§,...,bi 31 =bfrbgr'nrb: where bf 6 K for j = 1,2,---,n and i = l,2,---,k. An alignment of the sequences 31, - - - , s. is another set of sequences, SI, - - - , s: is obtained from s,- by inserting gaps in positions where some of the other sequences have a non-blank character. There exist many different possible alignments. The multiple sequence alignment problem is to find out the optimal alignment based on a certain score function for the optimality of an alignment. However, different authors may have different sets of score functions for the optimality of an alignment. Sometimes authors have to generate their own score functions for their own purposes. To identify the primary similarity of the sequences, the score function to reflect the primary similarity among sequences should be used. To identify common secondary structures of the sequences, the score function to reflect the common secondary structures among sequences is used. Authors have been focused in solving the problem by comparing the N sequences simultaneously by dynamic programming using an N dimensional matrix. However, 6 the complexity of the problem restricts this approach to N g 6 and n,- = 200 ~ 300. Hence various heuristic (possibly sub-optimal) approaches are explored to solve the problem. 1.2.2 Restriction Map Database Searches Sequence database search is a standard tool in the identification and characterization of new organisms. However, there are many situations in which sequencing of new organisms is not practical. One simple method to substitute sequence database search is to use restriction patterns and restriction map databases. Maximum number of common restriction sites between a new organism and database sequences is calculated and used as a measure of similarity. To compare restriction patterns and restriction maps in the restriction map database, a problem called Maximum Site Matching Problem (MSMP) is defined in this thesis. MSMP is the problem to find maximum common sites between a restriction pattern and a restriction map. In the MSMP, we have as data the restriction pattern from a new organism a a restriction map from a sequence b in a sequence database when a same enzyme is used, say, Restriction pattern A = {a,- : 1 S i 5 n} from the digest of sequence a Restriction map B = {b,- : l S i S m} from the nucleotide sequence b A will be an unordered set, B will be an ordered set. In general A,B will be multi-sets; that is, there may be values of fragment lengths that occur more than once. Also, 2 -= Z ,-=L (1.1) 7 Given the above data the problem is to find orderings for the restriction pattern A such that the number of common sites implied by this ordering is maximum. 1.3 Previous Studies 1.3.1 Multiple Sequence Alignment Based on Primary Sim- ilarity Existing methods of multiple protein sequence alignment based on primary similar- ity have time complexity varying from 0(Nn2) to 0(nN where N is the number of sequences and n is the length of the sequences. Most of the methods attempt to find, within a reasonable time, good alignments that are not necessarily optimal. Some methods guarantee optimal alignment in accordance with some pre-specified criteria. The three major multiple sequence alignment methods may be summarized as follows. The main issue in these methods is how to optimize a score function. Clustering This approach either constructs an approximate phylogenetic tree to guide the align- ment process or arranges the sequences into some sequential order of alignment. Sev- eral methods [12, 36, 34] adopt the same hierarchical clustering procedure but ues a different score measure for tree construction. The methods of Feng and Doolittle [20] and Taylor [78, 79] use less rigorous clustering procedures for tree construction. The methods of Barton and Sternberg [7] and Martinez [52] arrange the sequences into a 8 certain order based on that the sequences are aligned one by one. Dynamic Programming This approach refers to the simultaneous comparison of N sequences using an N - dimensional dynamic programming matrix. For example, the algorithm of Needleman and Wunch [59] had been extended directly to the comparison of three sequences using a three-dimensional matrix [43]. However, Murata et al. [58] showed that the time complexity of this direct extension would be 0(n5) for three sequences of length n but they were able to reduce it to 0(n3) by using two three-dimensional matrices. A difl'erent dynamic programming algorithm of 0(n3) running time for three sequences was also presented by Fredman [26]. Both algorithms use a gap weight (penalty) which is independent of gap length. However, this drawback is removed by Gotoh [28] whose algorithm runs in 0(n3) time and uses a linear gap weighting function. The algorithm of Murata et al., [58] and the algorithm of Gotoh [28] explicitly specify the weight of simultaneous comparison of three residues. Such a criterion can be extended to the evaluation of an alignment of N sequences, i,e. the cost of aligning the N sequences is taken as the sum of the cost of aligning the N (N — 1) /2 pairs of sequences. This measure of cost is called the SP measure (Sum of Pairs measure) [49]. The algorithm [22] for two sequences looks for the Optimal path within a strip of the two-dimensional matrix as defined by an upper bound of the cost of the optimal path. Based on the SP measure, Carrillo and Lipman [10] have proposed a strategy for the alignment of N sequences which is similar to that of Fickett’s algorithm. Given the upper bound of the cost of a multiple sequence alignment, Carrillo and Lipman 9 showed that the upper bound of the alignment cost of each pair of the sequences can be obtained. The upper-bound of the alignment cost of two sequences, say :I: and y, defines a region on the two-dimensional plane, say (2:, y), of the two sequences. Within this region lies the projection of the optimal path of the N sequences onto (2:, y). The regions on all two-dimensional planes in turn define the region in which the Optimal path in the N -dimensional space would be located. Therefore, the search of the optimal path of N sequences is limited in a certain region of the N -dimensional matrix and the search time is cut down. The algorithm of Lipman et al., [49] adopts the aforementioned strategy of Carrillo and Lipman. it uses a heuristic procedure to obtain an initial alignment based on which the upper bounds of the pairwise alignment costs can be set. Alternatively, the users may specify any set of bounds. The algorithm has two specific features. One is the use of quasi-natural gap costs as proposed by Altschul [l] and the other is the option of either a weighted SP measure or an unweighted SP measure. In the weighted SP measure, different weights are assigned to the pairwise alignment costs following the methods proposed by Altschul et al. [1]. The purpose is to discount the dominance of a set of very similar sequences in the multiple sequence alignment. The algorithm of Lipman et al. [49] can align six to eight sequences of 200-300 residues which renders it superior to the foregoing three-sequence methods. The biggest obstacle to using dynamic programming for multiple sequence align- ment is its computational requirements of the method. Given these difficulties, heuris- tics and modified cost function are applied to dynamic programming. 10 Simulated Annealing SA is a good heuristic approach to solve combinatorial optimization problems. SA is a version of a successful statistical model of thermodynamic processes for growing crystals that has been transformed into computational terms. A perfectly homogeneous crystal lattice represents a configuration of solid state material at a global minimum of energy. The solid state material is heated to a high temperature until it reaches an amorphous liquid state. Then, it is cooled very slowly and according to a specific schedule of decreasing the temperature. If the cooling is perfect, then the atoms will arrange themselves in a pattern that closely resembles the global energy minimum of the perfect crystal. Thermodynamics teaches that the thermal equilibrium at temperature T is a probability distribution in which a state with energy E is attained with the Boltzmann probability e-‘l‘. In a theoretical model of what occurs inside the material during the annealing process, states are continually perturbed by the introduction of small random changes of the positions of atoms in the matter. If the result is a state of lower energy, then the state perturbation is unconditionally accepted. If not, then the state perturbation may still be accepted with a probability of eAE/T. This probability decreases with the temperature. Several authors have suggested simulated annealing as an alternative approach to overcome the limitations of dynamic programming. Lukashin et al. [50] applied SA to human intro sequences with entrapy as a cost function. Ishikawa et al. [38] applied 11 SA to align protein sequences the same cost function as that used in Gotoh [28]. 1.3.2 Multiple Sequence Alignment Based on Secondary Structures It is well-known that secondary structures of RNA are assumed to play an important physiological role. The main issues in multiple RNA sequence alignment is how to generate a score function. Currently, there is no clear global score function for multiple RNA sequence alignment. Most of methods predict local secondary structures and sometimes there is no way to to know the optimality of the alignment obtained. Manual Method This approach needs an alignment based on primary similarity. Then, base changes between the compared sequences are noted. And the places where compensating base changes maintain Watson-Crick complementarity between two potential pairing regions is taken as evidence for the existence of a true stem at that position. This method is done by hand [11, 32, 54, 63]. Context flee Grammar Method Stochastic context-free grammars are applied to the problems of multiple alignment of RNA families [18, 70, 73]. This approach is related to use of Hidden Markov Models (HMMs) to model E. coli DNA. It incorporates elements of both the thermodynamic and phylogenetic approaches, with emphasis on the latter. The method of Sakakibara et. al [70] requires initial knowledge of secondary structures. In contrast, Eddy and 12 Durbin [18] derive the structure of the grammar directly from unaligned sequences and estimate the probability parameters of the resulting grammar using Expectation Maximization (EM). 1.3.3 Restriction Map Database Search Some work has been done on constructing restriction maps from restriction pat- terns [64, 24, 47, 17, 8, 30, 89], but this work typically assumes that there are over- lapping fragments. Several authors [15, 19, 37, 60, 76] studied the restriction site matching probability with given primary similarity of the two sequences. Also simple fragments pattern matching method is used for identification of query sequences. However, this simple pattern matching does not work in some cases. No rigorous works for restriction map searches using restriction patterns have been done. 1.4 Our Studies 1.4.1 Multiple Sequence Alignment In this thesis, we propose two algorithms, MSASA for protein sequences and RNASA for RNA Sequences, based on simulated annealing. In chapter 2, multiple sequence alignment for protein sequences is studied. In MSASA, two cost functions, natural gap costs and quasi-natural gap costs, for protein sequence alignment are discussed as well as the relationships between two methods and two cost functions. We analyzed 13 the solution sets of multiple sequence alignments and suggest a technique to reduce converging time by confining the solution sets. Another speedup strategy is suggested that involves using fast heuristic algorithm as the high temperature phase. 'II'ansi- tion rules to get a new alignment for simulated annealing are generalized, pointin- gout that any new alignments can be obtained by applying those transition rules. Speedup strategies related to length of the alignment and high temperature phase are discussed. We implement MSASA and Show SA can overcome the limitation of dy- namic programming by comparing the output alignment from MSASA and dynamic programming In chapter 3, multiple sequence alignment for RNA sequences is studied. We proposed an algorithm RN ASA to align multiple RNA sequences to identify possi- ble secondary structures. Dot matrix generated from intro-sequence comparison is expanded to find alignments with maximally overlapped potential secondary struc- tures. When the bases are compared, a certain probability of assigning a hit. This probability is used to make a score function. We use a different transition rule called a double shufl‘le in RNASA, which can generate faster convergence to optimal. We show the usefulness of RNASA with experimental results. 1.4.2 Restriction Map Database Searches In chapter 4, we study the problem of obtaining biological information about a macro- molecule isolate using only the restriction patterns of unknown macromolecules and restriction map databases. We propose a three step approach to solve this problem. 14 In the second step We formulate a maximum site matching problem , show this prob- lem is in the class of NP-complete problems, and suggest a heuristic approach to attack this problem. We demonstrate that this three step approach can be used to infer the identification of an unknown macromolecule. Chapter 2 Inferring Relatedness of Sequences based on Primary Similarity using Multiple Sequence Alignment Multiple sequence alignment is a useful technique for studying molecular evolution and analyzing structure-sequence relationships. Dynamic programming of multiple sequence alignment has been widely used to find an optimal alignment. However, dynamic programming does not allow for certain types of gap costs, and it limits the number of sequences that can be aligned due to its high computational complexity. The focus of this paper is to use simulated annealing as the basis for developing an effi- cient multiple sequence alignment. An algorithm called Multiple Sequence Alignment using Simulated Annealing (MSASA). The computational complexity of MSASA is significantly reduced by replacing the high temperature phase of the annealing pro- 15 16 cess by a fast heuristic algorithm. This heuristic algorithm facilitates in minimizing the solution set of the low temperature phase of the annealing process. Compared to the dynamic programming approach, MSASA can (a) use natural gap costs which can generate better solution (b) align more sequences (c) take less computation time. 17 2.1 Introduction Sequence alignment methods are useful tools to obtain regions of sequence similar- ities of particular interest. Pairwise alignment has been widely used in sequence alignment, but multiple alignment reveals more information than pairwise alignment. After Needleman and Wunch [59] introduced dynamic programming into pairwise alignment, efforts were made to apply dynamic programming to multiple Sequence alignment. Multiple sequence alignment methods can be divided into two different types of algorithms; heuristic algorithms and exhaustive algorithms. Heuristic algo- rithms [7, 6, 12, 21, 20, 29, 33, 34, 36, 42, 52, 71, 77, 83] try to find out good but not necessarily optimal alignments within a reasonable time. Most of these heuristic algorithms construct a phylogenetic tree for the alignment of the sequences or assign the sequences to a particular order. The sequences are aligned one by one related to the order. The exhaustive approach [16, 23, 26, 28, 35, 58, 72, 74] based on dynamic program- ming tries to compare sequences simultaneously. This approach refers to the simul- taneous comparison of N sequences using an N -dimensional dynamic programming matrix [86]. For example, the algorithm of three sequences using a three-dimensional matrix [43]. The algorithm of Pickett [22] for two sequences looks for the Optimal path within a strip of the two-dimensional matrix as defined by an upper bound of the cost of the 18 optimal path., Carrillo and Lipman [10] have proposed a strategy for the alignment Of N sequences which is similar to that Of Fickett’s algorithm. Given the upper bound of the cost of a multiple sequence alignment, Carrillo and Lipman showed that the upper bound Of the alignment cost of each pair of the sequences can be obtained. The upper bound of the alignment cost Of two sequences, say a and b, defines a region on the two—dimensional plane, say (a, b), Of the two sequences. Within this region lies the projection Of the optimal path of the N sequences onto (a, b). The regions on all two-dimensional planes in turn define the region in which the optimal path in the N -dimensional space would be located. Therefore, the search of the Optimal path Of N sequences is limited in a certain region of the N -dimensional matrix and the search time is cut down. This approach guarantees Optimal alignment. Although variations of dynamic programming have been widely used to derive optimal alignments, there are certain limitations. One important problem in multiple sequence alignment is to define substitution costs and gap costs. In pairwise alignment, researchers define substitution costs and gap costs and try to minimize or maximize the total cost. Gap costs and substitution costs in multiple sequence alignment should be defined by using the same rationale as used in pairwise alignment. Altschul [1] analyzed several types Of gap cost and substitution cost for multiple alignments. He pointed out that previously defined gap costs in a multiple alignment were not clearly tied to their substitution costs. He suggested a natural gap cost which was clearly related to its substitution cost. Lipman et al. [49] implemented the Multiple Sequence Alignment (MSA) program 19 to align more than three sequences using dynamic programming. In MSA, quasi- natural gap costs were used instead Of natural-gap costs because natural gap costs for dynamic programming require impractically long computation time [1] . Due to the type of gap costs used, MSA cannot guarantee producing an optimal multiple alignment in some special cases. Another problem in expanding dynamic programming to multiple sequence align- ment is its high computational complexity. In pairwise alignment, the computational complexity is 0(m-n) where m, n are the lengths of the sequences. But when dynamic programming is used for multiple sequence alignment, its computational complexity becomes proportional to the product of the lengths of the sequences to be aligned. Therefore the exponential growth in computational complexity makes dynamic pro- gramming impractical for aligning more than three sequences [27, 58]. Lipman [49] has applied dynamic programming to MSA by reducing the solution space using a heuristic algorithm. By confining the solution space, the MSA program can align four to six sequences of length 200-300 residues using rigorous bounds. Several authors [38, 50] have suggested simulated annealing (SA) as an alternative approach to overcome the limitations of dynamic programming. SA is a good heuristic method to solve combinatorial optimization problems [46]. Ishikawa et al. [38] applied SA to align protein sequences with the same cost function as that used in Gotoh [28]. To reduce the long computation time, they utilized a parallel computer for faster convergence to Optimal solution, and discussed temperature parallel algorithm which does not require any temperature scheduling. Lukashin et al. [50] applied SA to 20 human intron sequences with entropy as a cost function. In this study, we developed a method for multiple sequence alignment called Mul- tiple Sequence Alignment using Simulated Annealing (MSASA). Simulated annealing is a good heuristic method to solve combinatorial Optimization problems [46]. Be- cause traditional simulated annealing requires long computation time, several speedup strategies have been incorporated into MSASA. The results generated from MSASA were compared to those from MSA. By applying these speedup strategies we show that MSASA can overcome the problems of high computational complexity and the inability to use natural gap costs in MSA. 2.2 Algorithm 2.2.1 Simulated Annealing Simulated annealing (SA), introduced by Kirkpatrick [46], is a probabilistic ap- proach that can be used to find a global minimum of a function in combinatorial Optimization problems. To apply this algorithm to an optimization problem, a state space X ={z1,---,:r,,} and a cost function 0 : X -> R, where R is the set of real number, should be defined. A real value 0 (X ) should be assigned tO each state m. The goal of the Optimization problem is to find the Optimal state 2:9,, whose score is min(ma:r) {2,- | 1 5 i 5 n}. Simulated annealing continuously generates a new state 2,... from a current state same,“ by applying transition rules and acceptance rules 21 proposed by Metropolis (1953). The criteria Of the acceptance rules are: 1. If AB 5 0, accept a new state raw. 2. If AE > 0, accept a new state mm,” with probability P(AE) = e‘AE/T where T is a temperature and AE = C(znm) — C (swam) is a cost difference. Probability P(AE) prevents fixation at local minimum. A state 2:me is called local minimum if there is no state mm in X that is generated from the state Emma, by applying the single transition rule and that has a lower cost than that Of the scam"... Temperature T controls a probability to accept a new state 2m. Initially, T starts from a high temperature and after every iteration, T decreases to become zero by applying an annealing schedule. The probability of accepting a new state with a higher cost than that Of the current state also decreases as temperature T decreases. If a careful annealing schedule and number Of iterations are given, SA converges to a global minimum state mm. The main disadvantage of SA is its requirement for a large amount of computation time. Because SA is based on Monte-Carlo methods which allow for a new state with a higher cost than that of a current state. To reduce this computation time, speedup strategies are used in MSASA. 2.2.2 Cost Function To be used in sequence alignment, a cost function should be explicitly defined as a measure Of overall alignment quality. Altschul (1989) classified several global cost 22 functions for multiple sequence alignment. These cost functions were composed of substitution costs and gap costs. Substitution Costs Substitution costs are the costs for aligning n number Of sequences which include costs for aligning letters which represent amino acids with nulls. The algorithm Of Murata et al. [58] and the algorithm of Gotoh [28] explicitly Specifly that the weight of the simultaneous comparison of three residues is the sum Of the weights Of the pairwise comparisons of the three residues. Such criterion can be extended to the evaluation Of an alignment Of N sequences, i.e. the cost Of aligning of N sequences is taken as the sum of the costs Of aligning the N (N — 1) / 2 pairs Of sequences. This measure of cost is called the SP mwsure (Sum of Pairs measure) [49]. Substitution costs used in this study were sum Of pairs (SP) substitution costs [5, 28, 58]. In MSA and MSASA, SP substitution costs were used. Gap Costs Gaps are maximal strings Of consecutive nulls (’-’) in one sequence aligned with letters in the other sequences. Gap costs are assessed separately from null costs. Null costs are the substitution costs for aligning individual letters with nulls. The alignment Of two sequences, 3,- and sj, derived from a multiple sequence align- ment 0 is referred as the projection Of 0: onto 3,- and s,- [1, 3]. If the number Of gaps in a is defined as the total number of gaps in all the projections of alpha and each 23 gap is assigned as the total number of gaps in all the projections Of a and each gap is assigned a constant cost, then the total gap cost of a will be called a natural gap cost [1]. Natural gap costs are defined based on the same rationale that is commonly adOpted in defining substitution costs. Altschul [1] points out that previously defined gap costs of multiple sequence alignments are proposed independently of the substitu- tion costs. He argues that, since gap costs and substitution costs together determine an Optimal alignment, they should be defined based on the same rationale. He sug- gested the natural gap cost as a gap cost which is clearly related to the substitution cost. But the amount Of information to keep in a cell with dynamic programming tO apply natural gap costs increases 0([n/ln(2)]"\/h') where n is the number of se- quences [1]. He defined quasi-natural gap costs similarly to natural gap costs except for some special cases that massively reduce the amount of information to keep in a cell. Quasi-natural gap costs were used in MSA instead of natural gap costs. One important advantage of MSASA over the dynamic programming approach used in MSA is its ability tO apply any gap costs, including natural gap costs, in multiple sequence alignment. This is because, unlike dynamic programming MSASA does not require space to keep the information. After applying the transition rule to a current alignment, a completely new alignment can be Obtained and all the information to be applied to any specific gap costs can be identified. In contrast, any complete alignment can not be obtained in dynamic programming until all Of the computations are finished. 2.2.3 24 Transition Rule Several Operations can be applied to a current alignment to generate a new candidate alignment. Basically, all the Operations are related to change the positions of the nulls (’-’) in the sequences. The basic operations are as follows. e Insertion (i, j, 1:, direction): This Operation inserts the lc number of consecutive nulls from the left /right (direction) Of column j in the sequence i. e Deletion (i, j, 1:, direction): This Operation deletes the 1: consecutive number of nulls from the left / right (direction) Of column j where columns j -— a through j + E (a, ,6 Z 1:) make a gap where there are consecutive nulls in a sequence. 0 Shuffle (i, j, 1:, direction): This Operation shuffles the left /right (direction) nulls from the null column j (including null j) in the sequence i and its left / right (direction) 1: consecutive characters. Figure 2.1 shows examples of the move sets. By modifying these move sets, effective A1 HKQIGGAHGSLA-- HKKIGGATGALG-- HK---IGGAHGSLA A3 HKQIGG--AHGSLA HKKIGGATGAL -- HK-IGGAHGSLA-- Figure 2.1: Examples of the move sets. alignment A2 after I nsertion(1,6, 2, right) to A1. A2 MKQIGG--AHGSLA HKKIGGATGALG-- HK---IGGAHGSLA A4 HKQIGGA--HGSLA HKKIGGATGAL -- HK-IGGAHGSLA-- (a) original alignment A1. (b) new (c) new alignment A3 after Deletion(3, 5, 2, left) to A2. (d) new alignment A4 after Swap(1, 7, 1, right) to A3. 25 move sets for a different type Of multiple sequence alignment problem can be Ob- tained. Also move sets can be applied to the different sequences simultaneously. The parameters i, j and direction in the move set rules may be randomly determined. But It may be determined by a certain distribution function, for example uniform distribution or inverse function related to the size of It. Only experiment can tell which is the best distribution function for It. Only one basic operation swap is used as a transition rule in MSASA. This Oper- ation generated a new alignment from the current alignment. 2.2.4 Solution Set Definition 1 An alphabet is a finite set of characters and null(’-’). A sequence is a finite string of characters. A pseudo-sequence is a finite string of characters and nulls. A pseudo-alignment of the sequence a1, a2, - - ~ , an is a set of padded pseudo-sequences dual), - - - , a; of equal length and the removal of the nulls from a; generates a,-. A null column is a column whose elements are all nulls. A character column is a column such that at least one of its elements is a character. An alignment is a pseudo-alignment whose columns are only character columns. Let 1., be a length of a sequence a,-. The range Of the length l of an alignment of the sequences a1, a2, - - - , an can be calculated by the definition of an alignment. The 26 range l is 2...... = manta“) g z s 1...... = in“) (2.1) i=1 In this paper the length of the pseudo-alignment Of the sequences a1,a2, - - - ,a,, is confined by the above equation. Definition 2 Let S; be the set of all alignments with the same set of sequences, and each alignment in the S; has its own cost, and the length of each alignment is I. Then the multiple sequence alignment problem is defined as finding the alignment with the smallest cost in the solution set S,:::(= U‘minslslmcc 5.). In the SA process, the new pseudo-alignment generated from the current pseudo- alignment, using the transition rule, may have null columns. These null columns do not afiect the cost of a pseudo-alignment. Figure 2.1 shows how the null columns can be generated during the SA process. These null columns cannot simply be deleted during the SA process. The removal of the null columns during the SA process re- duces the length Of the pseudo-alignment and it decreases the size of the solution set. Therefore optimal solution may be removed from the solution set. The null columns can be removed only after the SA process. Lemma 1 Let P; be a set of pseudo-alignments with the same set of sequences such that the length of any alignment in H is I. Let p; be a member of the set R. Any pseudo-alignment p[ in H can be generated from a pseudo-alignment p; by applying a finite number of swap operations. Proof: The swap Operation does not decrease or increase the number of nulls. It only changes the positions of the nulls in the alignment. Any pseudo-alignment p[ in the 27 H has same finite length l and same finite number of nulls in each sequence. Only the positions Of the nulls in each sequence are different from each other. The nulls in p; can be moved to the positions of the nulls in p[ within a finite number Of swap Operations. So the pseudo-alignment p; can be transformed to any pseudo-alignment p[ in the set P; by applying the swap Operations. 0 Lemma 2 The set of alignment Sim with same set of sequences can be generated from any alignment p; by applying a finite number of swap operations. Proof: From the lemma 1, H can be generated from any p. by applying the swap Op- erations. The set of pseudo-alignments which have no null columns in P; is equivalent to $1. And the set Of pseudo-alignments which have I: null columns in P; is equivalent to 8;... The maximum of k is l — lm,,,. Therefore eliminating the null columns from the pseudo-alignments in P; generates Sim. . CI Theorem Let two pseudo-alignments with same set of sequences be p;, and p1, where II > la. The alignment set, 52".”, obtained from the alignment p;, by applying the swap operations is the superset of the set, 52...: obtained from the alignment pi, by applying limited number of swap operations. Proof: From the Lemma 2, the set 3;, and 8;, can be generated from the alignment pg, and p;, by applying the swap Operations. The set 5,2.“ can be formed Slim = Slim U SE,“ (2'2) Therefore the set 3a,. is the superset of the set 3‘:.’....- C] In MSASA, the solution set 5:33 is generated from an initial pseudo-alignment 28 pgm by applying the swap operations. The theorem shows that the longer initial alignment generates the bigger solution set by applying the swap operations. This bigger solution set increases the probability to find an Optimal solution. 2.2.5 Speedup Strategies in MSASA Heuristic Algorithm as The First Phase Simulated annealing is composed of roughly two phases : a high-temperature phase and a low-temperature phase. In the high temperature phase, SA gives a high proba- bility to all the states with higher costs than that Of a current state. This allows any state in the solution set to be a current state. At a lower temperature phase, SA gives a high probability to states with a lower or not much higher temperature than that Of a current state. This allows only the states near a current state to be a current state. The high-temperature phase is similar to a random search, and the low-temperature phase is similar to a greedy local search. Several authors [31, 69, 68, 67] suggest good heuristic algorithm as a first phase and a simulated annealing approach as a second phase for fine Optimization to the standard-cell-placement algorithm. In MSASA, the same approach is used. The output alignment generated from the same heuristic algorithm used in MSA is used as the first phase. This heuristic algorithm is similar to progressive pairwise alignment used in the studies of Water- man and Perlwitz [85], Feng and Doolittle [20]. MSASA can eliminate the high- temperature phase and reduce annealing time by using the output alignment gener- 29 ated from the heuristic algorithm. Figure 2.2 shows the annealing curve and different starting points. SA time P can be saved when the system starts from point B which Energy A 1 |«-—u[ Iteration P Figure 2.2: Annealing curve (Energy vs. Iteration). A is the starting point in the traditional SA approach. B is the starting point Obtained from the fast heuristic approach. C is the minimal point. is obtained from the fast heuristic approach instead Of point A. It is clear that SA time can be reduced if point A is closer to the optimal point C. When the alignment is Obtained from the heuristic approach, the initial temperature should be lower than the initial temperature when traditional SA is applied. Confined Solution Set To apply dynamic programming for aligning n sequences, a fixed amount Of compu- tation is required for each cell of the n-dimensional lattice. The total computational time in dynamic programming is proportional to the product of a and fl, a-fi, where a is the fixed number Of computations for each cell and fl is the size Of the n-dimensional lattice. These two factors, a and fl, limit the usage of dynamic programming in mul- tiple sequence alignment. Natural gap costs make the factor a too large for aligning 30 more than five sequences. The factor fl becomes too large for aligning more than three sequences with average protein length (about 200-300 residues). Fickett [22] suggested a way to reduce the solution space in pairwise alignment. He searched the Optimal path within a diagonal band Of the two—dimensional matrix as defined by an upper bound of the cost of the optimal path. Carrillo and Lipman [10]) expanded the idea to reduce the solution space for aligning n sequences. They cal- culated the upper bounds of the alignment cost of each pair of the sequences and confined the solution between each pair of the sequences using those upper bounds. Therefore, they could reduce the computation time by applying dynamic program- ming in only the limited solution space in an n-dimensional lattice. In MSASA, the solution set is confined to 3:33 . Therefore, if the length of the optimal solution is larger than l,-,,,-., the Optimal solution can not be found in the solution set. It is clear that an Optimal solution can be found if the solution set 5,3: is used. This solution set can be obtained from a pseudo-alignment p1,", by applying the swap Operations. Due to its long SA time, it is not practical to use a pseudo-alignment p1,.“ as an initial alignment. Instead of a pseudo-alignment p1,“, the alignment generated by the progressive pairwise algorithm is used as an initial alignment. The solution set 8:3; can be expanded with longer initial alignment. Once the final alignment which has null columns found in a given solution set is achieved, increasing the solution set further rarely leads to improvement. Such an alignment is Optimal for the given solution set. 31 begin , current.pseudo.alignment (— Output generated from the heuristic algorithm; T ‘- Tinitial; Eamm (— C (current.pseudo_alignment) where C is a cost function; f inal.pseudo.alignment (— current.pseudo_alignment; while (T > T1,“) :1: (— random(column(1..l)) ; for each sequence i in the current.pseudo.alignment if column :I: is a null, then apply the swap operation; if column :1: is a character, than do nothing; end for; Em (— C (new.pseudo.alignment); if ((Em < Emma.) then current_pseudo_alignment (— new.pseudo.alignment; Ecurrcnt *— Enew; if (Em < Emu) then Ernin +— Enew; f inal.,pseudo.alignment (— current.pseudo.alignment; end if; end if; T(-'y-T,0<'y< 1; end while; Remove null columns from the f inal.pseudo.alignment; Return f inalhlignment and minimal cost Ema, as an optimal alignment; end Figure 2.3: The MSASA algorithm 2.2.6 The MSASA Algorithm The simplest data structure for sequences is an array. An alignment is a two dimensional array Of symbols in which each row is a one-to-one collinear representation of a molecular sequence. When a swap Operation is applied to sequences, only the positions of the nulls are changed in an alignment. MSASA does not require any complicated data structures. The algorithm for MSASA is presented in Figure 2.3. The space complexity of MSASA is 0(n - l), where l is the length of the sequence 32 and n is the number of sequences. In each iteration the time to apply swap Operations to each sequence is 0(n - Al) where n is the number of the sequences and Al is the changed part of the alignment when swap operations are applied. ++ tittttt HKQIGGA--MGSLA- HKQIGGA--HGSLA- HKK---IGGATGALG HKKIGGA---TGALG (a) (b) Figure 2.4: Example of the Operation swap. (a) original alignment (b) alignment after swap(2, 4, 3, right) operation. (*) shows the changed part (Al) of the alignment and (+) shows the null columns. Figure 2.4 (b) showed the changed part (Al) in the alignment. Only the difference of the current costs and new costs of the columns which are affected by the changed part is used to compute new costs. Al could be referred to as a constant because it is not changed by the number Of sequences or length of the sequences to be aligned. The time to calculate the cost of the changed part takes (n - (n — l) - Al). Therefore, the time complexity of MSASA is 0(n2 - k) where k is the number Of iterations. The time complexity Of MSASA does not depend on the length of the sequences in alignment but to the number of sequences n and the number of iterations k. 2.3 Implementation The program in this paper was implemented and tested on a Sun SPARC2 running Solaris 2.2 which is a UNIX Operating system. The MSASA algorithm was coded in ANSI C (SPARC compiler C 2.0.1). MSASA and MSA were tested and compared in the same environment. MSASA was implemented to produce multiple alignments 33 and was compared to MSA. The experiments were performed on three serine protease families - chymotrypsin, trypsin and elastase. Natural gap costs are used as gap costs in MSASA whereas quasi-natural gap costs are used in MSA. The cost of one gap was 8. In MSASA, the PAM-250 matrix [14], derived from a study of amino acid replacements in homologous proteins, was used for substitution costs. The modified substitution costs (17 minus the values in the PAM- 250 matrix) were used in both algorithms. The SP substitution costs in MSA have two options, weighted SP substitution costs and unweighted substitution costs. In the weighted SP substitution costs, weights are applied to the pairwise alignments in order to reduce the effect Of the dominance of a set of similar sequences in the multiple sequence alignment. In MSASA, both Options could be applied. For easy comparison of the two algorithms, only unweighted SP substitution costs were considered. The alignment from the heuristic procedure of MSA was used as an initial align- ment in MSASA. Only the number of nulls Of each sequence in the initial alignment are allowed to generate a new alignment. In the swap Operation, the maximum value Of the parameter I: was initially set at 10. The initial temperature T; was decided by previous experience. At the final temperature, the probability to accept a new state with a higher cost is e-AE/TI = e (2.3) 34 where 0 < e < 1. This equation is simplified to T, = —-AE/log(e) (2.4) The minimum cost change, —AE, resulting from a swap operation is l. Empirically, c“ is set to the total number of iterations 1:. Therefore the final temperature becomes T, = l/log(k) (2.5) The schedule implemented in MSASA is T, = Tra" where "y is a constant for reducing the temperature. 7 can easily be calculated from T}, T, and k '7 = (TI/T1)“ (2.6) 2.4 Results 2.4.1 Initial Alignment A good initial alignment, whose length and cost are as similar as possible to those of an Optimal alignment, is crucial to MSASA. In the experiment, the length of the initial alignment generated from the heuristics was longer than that Of the Optimal alignment. When the final pseudo-alignment did not have null columns, the l,-,,,-, was increased and the SA process was applied again. One way of increasing the length Of 35 an initial alignment is to use a lower gap cost than the gap cost to be applied. If the same final output alignment is obtained, the alignment is the Optimal solution within the solution set 5:3“: . A lower gap cost (3) than the gap cost (8) to be applied in the SA process is used to get a longer initial alignment when aligning the sequences in Figure 2.5. The length Of the initial alignment is 277 with gap cost (8) and is 281 with gap cost (8). The longer initial alignment is used to show this method. In the Figure 2.8 thru 2.10, the null columns are not removed from the alignments to Show the positions of the final alignments and the lengths of the initial alignments. Too lower gap costs generates an initial alignment whose length is longer than that of the Optimal alignment. MSASA with this longer initial alignment requires longer SA time. Because the solution set becomes bigger and the cost Of the initial alignment becomes higher than those Of the initial alignment Obtained by the gap cost to be applied. 2.4.2 'D'ansition Rules The Operation swap in MSASA is used to change the position Of the nulls in the alignment. This Operation does not change the number Of nulls in the alignment in the SA process. The null columns in the alignment gave the effect that the operation could reduce the nulls. Therefore any alignment whose length is between l,,,,-,, and l,-,,,-¢ could be generated. The Operation insertion which could insert the nulls and the operation deletion lC 36 which could delete the nulls in the alignment were implemented and tested. These Operations were possible to change the number of nulls in the alignment. But these Operations required too much SA time (order of hours) so that it was not practical to use these operations in MSASA. 2.4.3 Cost Comparison Quasi-natural gap costs prevent MSA from generating Optimal alignment. When a series Of nulls with left and right letters completely imposed on the series of nulls in other sequences, quasi-natural costs count one more gap than natural gap costs do (Altschul, 1989). These additional counts prevent generating Optimal alignment from MSA. There are no additional counts in MSASA because the natural gap costs were used in MSASA. MSA and MSASA generate same alignment in some cases. When quasi-natural gap costs are used in both algorithms, both generate the same alignments. And even when natural gap costs are used in MSASA, if there are no completely imposed nulls in the Optimal alignment, both algorithms generate the same alignment. A comparison of the output alignments using the same sequences generated from MSASA and MSA is presented in Figure 2.5. The alignment Al was generated from MSA and the alignment A2 was generated from MSASA. The score (9645) of the alignment A2 from MSASA is lower than that (9668) Of the alignment A2 from MSA. The marked (*) parts show the difference 37 A1 eeeeeeeeee (9663) IIGGVESIPHSRPYHAHLDIVTEKGLRVICGGFLISRQFVLTAAHC IVGGTNSSHGEHPWQVSLQVKLTAQR-HLCGGSLIGHQHVLTAAHC IVNGEEAVPGSHPHQVSLQ--DKTGF-HFCGGSLINENHVVTAAHC IVGGYTCGANTVPYQVSLN----SGY-HFCGGSLINSQHVVSAAHC VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNHVHTAAHC VVGGTRAAQGEFPFHVRLS -------- HGCGGALYAQDIVLTAAHC A2 sseeeeeess (9648) IIGGVESIPHSRPYHAHLDIVTEKGLRVICGGFLISRQFVLTAAHC IVGGTNSSHGEHPHQVSLQVKLTAQ-RHLCGGSLIGHQWVLTAAHC IVNGEEAVPGSUPHQVSLQDKTG---FHFCGGSLINENHVVTAAHC IVGGYTCGANTVPYQVSLNSG ----- YHFCGGSLINSQHVVSAAHC VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNUVHTAAHC VVGGTRAAQGEFPFHVRLSMG -------- CGGALYAQDIVLTAAHC Figure 2.5: Alignments of rat mast cell proteinase II, human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin. A1 is the alignment generated from MSA and A2 is generated from MSASA. The score of A1 is 9663 and the score of A2 is 9648. (*) shows different alignments generated from MSA and MSASA. sequences MSA MSA SA score time score time iteration (million) 4 21244 20 sec 21244 5 min 40 sec 1.3 5 35853 48 min 54 sec 35845 10 min 26 sec 2.0 6 54054 3 hour 18 min 37 sec 54050 16 min 40 sec 2.0 Table 2.1: Comparison of computation time and score in MSA and MSASA between MSASA and MSA. The difference is due to the different gap costs in MSASA and MSA. The results of the alignments of the same set Of four to six sequences generated from MSA and MSASA are shown in Figure 2.7 through Figure 2.10, and the related scores and rlmning times are in Table 2.1. The alignment Of four sequences (Figure 2.5) does not have the regions creating additional counts for MSA. MSASA and MSA generated the same output in this {Floor Ell. 38 IhlSllSJLtumdihdSAL IVGGTNSSWGEHPWQVSLQVKLTAQR-HLCGGSLIGHQWVLTAAHCFDGLPLQDVHRIYSGILNLSDITKDTPFSQIKEI IVNGEEAVPGSHPHQVSLQDK--TGF-HFCGGSLINENHVVTAAHCGVT-TSD---VVVAGEFDQGSSSEKIQKLKIAKV IVGGYTCGANTVPYQVSLN----SGY-HFCGGSLINSQHVVSAAHCYKS-GIQ----VRLGEDNINVVEGNEQFISASKS VVGGTEAQRNSWPSQISLQYRSGSSHAHTCGGTLIRQNHVMTAAHCVDR-ELT--FRVVVGEHNLNQNNGTEQYVGVQKI IIHQNYKVSEGNH--DIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCWVTGHGFSK-EKGEIQNILQKVNIPLVTNE FKNSKYNSLTINN--DITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNT IVHPSYNSNTLNN--DIHLIKLKSAASLNSRVASISLPTSCA--SAGTQCLISGWGNTKSSGTSYPDVLKCLKAPILSDS VVHPYHNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGHGLTR-TNGQLAQTLQQAYLPTVDYA ECQKR-YQDYKITQRHVCAGYKEGGKDACKGDSGGPLVCKHNGHHRLVGITSWCE--GCARREQPGVYTKVAEYHDHILE NCKK--YHGTKIKDAHICAG--ASGVSSCHGDSGGPLVCKKNGAHTLVGIVSVCS--STCSTSTPGVYARVTALVNHVQQ SCKSA-YPG-QITSNHFCAGYLEGGKDSCQGDSGGPVVC--SG--KLQGIVSWGS--GCAQKNKPGVYTKVCNYVSVIKQ ICSSSSYWGSTVKNSHVCAG-GNGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYISUINN KTOSS TLAAN TIASN VIASN Figure 2.6: Alignment Of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, and pig elastase generated from MSA and MSASA. case. Because this alignment does not have completely imposed nulls. But in the cases Of five sequences and six sequences (Figure 2.6 through Figure 2.10), MSASA can generate the alignment with a lower cost than that of MSA. 2.4.4 Time Comparison If the lengths of the sequences are short, the number of sequences is small, and the optimal alignment of the sequences does not have regions making additional counts, MSA may generate an optimal alignment faster than MSASA. One Of these cases is shown in Figure 2.6. In this case, MSA took a shorter running time than did MSASA. But in the case of five and six sequences, MSASA required a smaller running time 39 hdSJl IVGGTNSSUGEHPWQVSLQVKLT-AQRHLCGGSLIGHQHVLTAAHCFDGLPLQDVWRIYSGILNLSDITKDTPFSQIKEI IVNGEEAVPGSWPWQVSLQDKTG---FHFCGGSLINENUVVTAAHCGVT----TSDVVVAGEFDQGSSSEKIQKLKIAKV IVGGYTCGANTVPYQVSL--NSG---YHFCGGSLINSQHVVSAAHCYKS-----GIQVRLGEDNINVVEGNEQFISASKS VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNUVHTAAHCVDR---ELTFRVVVGEHNLNQNNGTEQYVGVQKI VVGGTRAAQGEFPFHVRL--SHG ------ CGGALYAQDIVLTAAHCVSG----SGNNTSITATGGVVDLQSAVKVRSTKV IIHQNYKVSEG--NHDIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCHVTGHGFSK-EKGEIQNILQKVNIPLVTNE FKNSKYNSLTI--NNDITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTCWGLTRYTNANTPDRLQOASLPLLSNT IVHPSYNSNTL--NNDIHLIKLKSAASLNSRVASISLPTSCASAG--TQCLISGHGNTKSSGTSYPDVLKCLKAPILSDS VVHPYUNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTR-TNGQLAQTLQQAYLPTVDYA LQAPGYNGT----GKDHALIKLAQPINQPTLKIATTTAYNQGTFT ------ VAGHGANR-EGGSQQRYLLKANVPFVSDA ECQKR-YQDYKITQRHVCAGYK-EGGKDACKGDSGGPLVCKHN-GHHRLVGITSUGE--GCARREQPGVYTKVAEYHDHI NCKK--YVGTKIKDAHICAG---ASGVSSCHGDSGGPLVCKKN-GAHTLVGIVSHGS--STCSTSTPGVYARVTALVNHV SCKSA-YPG-QITSNHFCAGYL-EGGKDSCQGDSGGPVVCSGK ----- LOGIVSHGS--GCAQKNKPGVYTKVCNYVSHI ICSSSSYHGSTVKNSMVCAGG--NGVRSGCQGDSGGPLHCLVN-GQYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYISUI ACRSA-YGNELVANEEICAGYPDTGGVDTCOGDSGGPHFRKDNADEUIQVGIVSVGY--GCARPGYPGVYTEVSTFASAI LEKTDSS QQTI-MN KQTIASN NNVIASN ASAARTL Figure 2.7: Alignment of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin generated from MSA. 40 RJSJKSAL + ++ IVGGTNSSHGEHPHQVSLQVKLTAQR-HLCGGSLIGHQHVLTAAHCFDGL-PLQDVHRIYSGILNLSDITKDTPF--SQI IVNGEEAVPGSVPHQVSLODKTGF---HFCGGSLINENHVVTAAHCGVT ----- TSDVVVAGEFDQGSSSEKIQX--LKI IVGGYTCGANTVPYQVSLN--SGY---HFCGGSLINSQHVVSAAHCYKS ------ GIQVRLGEDNINVVEGNEOF--ISA VVGGTEAQRNSHPSQISLQYRSGSSHAHTCGGTLIRQNHVMTAAHCVDR----ELTFRVVVGEHNLNQNNGTEQY--VGV VVGGTRAAQGEFPFHVRLS--MG ------ CGGALYAQDIVLTAAHCVSG ----- SGNNTSITATGGVVDLQSAVK--VRS KEIIIBQNYKVSEGNH--DIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCUVTGHGFSK-EKGEIQNILQKVNIPLV AKVFKNSKYNSLTINN--DITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLL SKSIVHPSYNSNTLNN--DIHLIKLKSAASLNSRVASISLPTSCASAG--TOCLISGHGNTKSSGTSYPDVLKCLKAPIL QKIVVHPYHNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITOUGLTR-TNGQLAQTLQQAYLPTV TKVLQAPGYNGTG--K--DHALIKLAQPINQPTLKIATTTAYNQGTFT ------ VAGHGANR-EGGSQQRYLLKANVPFV + TNEECQ-KRYQDYKITDRHVCAGY-KEGGKDACKGDSGGPLVCKHNG-HHRLVGITSHGE---GCARREQPGVYTKVAEY SNTNCK--KYHCTKIKDAHIGAG---ASGVSSCHEUSGGPLVCKKNG-AWTLVGIVSHGS---STCSTSTPGVYARVTAL SDSSCK-SAYPG-QITSNHFCAGY-LEGGKDSCQGDSGGPVVCSGK ----- LQGIVSHGS---GCAQKNKPGVYTKVCNY DYAICSSSSYHGSTVKNSHVCAG--GNGVRSGCQGDSGGPLHCLVNG-QYAVHGVTSFVS-RLGCNVTRKPTVFTRVSAY SDAACR-SAYGNELVANEEICAGYPDTGGVDTCQGDSGGPHFRKDNADEHIQVGIVSHGY--~GCARPGYPGVYTEVSTF HDHILEKTDSS VNUVQQTLAAN VSUIKQTIASN ISHINNVIASN ASAIASAARTL Figure 2.8: Alignment of human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin generated from MSASA. The null columns (+) are not removed intentionally. Brrrrovwv T.IFT....VL VHETSAA 41 MSA IIGGVESIPHSRPYHAHLDIVTEKGLRVICGGFLISRQFVLTAAHCKGR-EIT----VILGAHDVRKRESTQQKIKVEKQ IVGGTNSSWGEHPHQVSLQVKLTAQR-HLCGGSLIGHQHVLTAAHCFDGLPLQDVHRIYSGILNLSDITKDTPFSQIKEI IVNGEEAVPGSHPHQVSLQ--DKTGF-HFCGGSLINENHVVTAAHCGVT-TSD---VVVAGEFDQGSSSEKIQKLKIAKV IVGGYTCGANTVPYQVSLN----SGY-HFCGGSLINSQHVVSAAHCYKS-GIQ----VRLGEDNINVVEGNEQFISASKS VVGGTEAQRNSUPSQISLQYRSGSSHAHTCGGTLIRQNHVHTAAHCVDR-ELT--FRVVVGEHNLNQNNGTEDYVGVQKI VVGGTRAAQGEFPFHVRLS -------- HGCGGALYAQDIVLTAAHCVSG-SGN---NTSITATGGVVDLQSAVKVRSTKV IIHESYNS--VPNLHDIHLLKLEKKVELTPAVNVVPLPSPSDFIHPGAHCUAAGHGKTG-VRDPT-SYTLREVELRIHDE IIHQNYKV--SEGNHDIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCHVTGHGFSK-EKGEI-QNILQKVNIPLVTN FKNSKYNS--LTINNDITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTCHGLTRYTNANT-PDRLQQASLPLLSN IVHPSYNS--NTLNNDIHLIKLKSAASLNSRVASISLPTSCA--SAGTQCLISGHGNTK-SSGTSYPDVLKCLKAPILSD VVHPYHNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGHGLTR-TNGQL-AQTLQQAYLPTVDY LQAPGYNG----TGKDHALIKLAQP ------ INQPTLKIATTTAYNQGTFTVAGHGANR-EGGSQ-QRYLLKANVPFVSD KACVD--YRYYEYKFQ~VCVGSP-TTLRAAFHGDSGGPLLCAGV-----AHGIVSYGH----PDAKPPAIFTRVSTYVPT EECQKR-YQDYKITORHVCAGYK-EGGKDACKGDSGGPLVCKHN-GHHRLVGITSUGE--GCARREQPGVYTKVAEYHDH TNCKK--YWGTKIKDAHIGAG---ASGVSSCHGDSGGPLVCKKN-GAHTLVGIVSHGS--STCSTSTPGVYARVTALVNH SSCKSA-YPG-QITSNHFCAGYL-EGGKDSCQGDSGGPVVCSGK ----- LOGIVSUGS--GCAQKNKPGVYTKVCNYVSH AICSSSSYHGSTVKNSHVCAGG--NGVRSGCQGDSGGPLHCLVN-GQYAVBGVTSFVSRLGCNVTRKPTVFTRVSAYISH AACRSA-YGNELVANEEICAGYPDTGGVDTCQGDSGGPHFRKDNADEHIQVGIVSHGY--GCARPGYPGVYTEVSTFASA INAVI--N ILEKTDSS VQQTLAAN IKQTIASN INNVIASN IASAARTL Figure 2.9: Alignment of rat mast cell proteinase 11, human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin gener- ated from MSA. 42 MSASA + IIGGVESIPHSRPYMAHLDIVTEKGLRVICGGFLISRQFVLTAAHCKGREI ----- TVILGAH-DVRKRESTQQKIKVEK IVGGTNSSHGEHPHQVSLQVKLTAQ-RHLCGGSLIGHQUVLTAAHCFDGLPLQDVURIYSGIL-NLSDITKDTPFSQIKE IVNGEEAVPGSHPHQVSLQDKTC---FHFCGGSLINENUVVTAAHCGVTTSD----VVVAGEF-DQGSSSEKIQKLKIAK IVGGYTCGANTVPYQVSLNSG ----- YHFCGGSLINSQHVVSAAHCYKSGI ----- QVRLGED-NINVVEGNEQFISASK VVGGTEAQRNSWPSQISLQYRSGSSHAHTCGGTLIRQNWVHTAAHCVDRELT---FRYVVGEH-NLNQNNGTEQYVGVQX VVGGTRAAQGEFPFHVRLSHG -------- CGGALYAQDIVLTAAHCVSGSGN---NTSITATG-GVVDLQSAVK-VRSTK + QIIHESYNSVPNL--HDIHLLKLEKKVELTPAVNVVPLPSPSDFIHPGAHCUAAGHGKTGVRDPT--SYTLREVELRIHD IIIHQNYKVSEGN--HDIALIKLQAPLNYTEFQKPICLPSKGDTSTIYTNCHVTCHGFSKEKGEI--QNILQKVNIPLVT VFKNSKYNSLTIN--NDITLLKISTAASFSQTVSAVCLPSASDDFAAGTTCVTTCWGLTRYTNAN-TPDRLQQASLPLLS SIVHPSYNSNTLN--NDIHLIKLKSAASLNSRVASISLPTSCA--SAGTQCLISGHGNTKSSGTS-YPDVLKCLKAPILS IVVHPYUNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGHGLTRTNGQL-~AQTLQQAYLPTVD VLQAPGYNGTG----KDHALIKLAQP ------ INQPTLKIATTTAYNQGTFTVAGHGANREGGSQ--QRYLLKANVPFVS EKACVD--YRYYEYKFQ-VCVGS-PTTLRAAFHGDSGGPLLCAG ----- VAHGIVSYGH----PDAKPPAIFTRVSTYVP NEECQK-RYQDYKITDRHVCAGY-KEGGKDACKGDSGGPLVCKHNG-HHRLVGITSVGE--GCARREQPGVYTKVAEYHD NTNCKK--YHGTKIKDAHICAG---ASGVSSCHGDSGGPLVCKKNG-AHTLVGIVSHGS--STCSTSTPGVYARVTALVN DSSCKS-AYPG-QITSNHFCAGY-LEGGKDSCDGDSGGPVVCSG ----- KLQGIVSHGS--GCAQKNKPGVYTKVCNYVS YAICSSSSYHGSTVKNSHVCAG--GNGVRSGCQGDSGGPLHCLVNG-QYAVHGVTSFVSRLGCNVTRKPTVFTRVSAYIS DAACRS-AYGNELVANEEICAGYPDTGGVDTCQGDSGGPHFRKDNADEHIQVGIVSHCY--GCARPGYPGVYTEVSTFAS TINAVI--N UILEKTDSS UVQQTLAAN HIKQTIASN HINNVIASN AIASAARTL Figure 2.10: Alignment of rat mast cell proteinase II, human plasma kallikrein, bovine chymotrypsin, bovine trypsin, pig elastase, and Streptomyces griseus trypsin gener- ated from MSASA. The null columns (+) are not removed intentionally. 43 80 70 60 50 T(min)10 0 l l l l J l l 4 6 8101214161820 Number of sequences Figure 2.11: Running time for 4 to 20 sequences (1 million iterations) than that of MSA. The running time to get an Optimal alignment in MSASA was not clearly related to the number of sequences. In MSASA, usually 1 to 2 million iterations which took 5 to 16 minutes were enough to get an Optimal alignment for four to six sequences. MSA took an impractically long time to align more than six sequences on a per- sonal workstation. Therefore, it could not be directly compared to MSASA for more than six sequences. In MSASA, 1 to 3 million iterations were enough to get a near- optimal solution when aligning up to twenty sequences. Figure 2.1 1 illustrates the computation time to run 1 million iterations for aligning several numbers of sequences. It took about 17 minutes to run 1 million iterations for aligning 10 sequences and about 76 minutes to run 1 million iterations for aligning 20 sequences. Therefore, the running time to align more than six sequences in MSASA is more practical than that in MSA. To reduce the computation time, well-conserved regions may be fixed and 44 only the regions between the fixed regions can be annealed. By using this divide-and- conquer type approach, the computation time could be greatly reduced. 2.5 Conclusion It is shown that MSASA based on simulated annealing is more powerful than MSA based on dynamic programming. MSASA could apply natural gap costs, which can generate a better alignment, while MSA could not. MSASA could generate alignments faster and with more sequences than MSA. Chapter 3 Inferring Relatedness of Sequences based on Secondary Structure using Multiple Sequence Alignment Multiple sequence alignment has been a useful technique for identifying RNA sec- ondary structures. In this paper, an algorithm for aligning multiple RNA sequences to identify possible secondary structures is presented. In this algorithm, dot matri- ces generated from intro-sequence comparisons are used to obtain possible common secondary structures. A hit probability for dot matrices is calculated and a score function based on this hit probability is defined. Simulated annealing is applied to Optimize the score. A solution set for a multiple sequence alignment is introduced and how the number of nulls and length of an alignment affect the solution set is analyzed. A method to reduce the computation time is applied to multiple sequence 45 46 alignment based on this solution set. Also several strategies to reduce the simulated annealing time is suggested. Double shufl‘le is used as a transition rule instead of single shuffle. This move set generates faster convergence to Optimal. This algorithm was used to find possible common secondary structures in RNA sequences. 47 3.1 Introduction Just as the evolutionary relationship in proteins is Often seen more in tertiary struc- ture than primary sequence, RNA molecule relatedness is often seen in preserved secondary structures. Much effort has been devoted to finding structural features of RNA. Predictions of RNA secondary structure give useful information about the mechanisms of gene expression, gene evolution and the functions of ribosomes. Sev- eral approaches have been used for finding secondary structural features. Phylogenetic analysis of homologous RNA sequences identifies secondary structures which are con- served during evolution [25, 40, 87]. Another approach is to apply thermodynamics to compare the free energy of alternative structures [39, 62, 80, 90]. Context-Flee Gram- mars have been applied to the problems of statistical modeling, multiple alignment, discrimination and prediction of the secondary structures of RNA families [18, 70, 73]. However, RNA secondary structure prediction is not at a stage where perfect predic- tion is possible. Zuker and Stiegler [90] pointed out that ’A program based solely on conformational rules and thermodynamics will not yield a biologically meaningful folding Of a molecule on its own... More and difierent kinds of additional information must be incorporated into the algorithm as well’. Users need to choose between sev- eral methods, each with its own advantages and disadvantages, the one that best fits their particular problem. Multiple sequence alignment has been a useful tool for identifying sites important in enzyme activity and in gene regulation and in phylogenetic comparisons [72, 84]. Dot-matrix methods have been simple and useful tools for examining sequence simi- 48 larities. Several multiple sequence alignment methods [4, 81, 82] are dot-matrix based. The dot matrix derived in these methods is from inter-sequence comparison based on primary sequence similarity and additional properties Of molecular similarity. How- ever, when RNA sequences are distantly related, sequences can not be aligned by just primary sequence similarity. Secondary structure similarity needs to be considered. A dot-matrix generated by intro-sequence comparison is one way to identify possible RNA secondary structures in a single molecule [66]. In this paper, we have extended dot-matrix methods based on intra-sequence com- parison to find alignments with maximally overlapped potential secondary structures. Individual dot matrices for each sequence are generated and overlapped. Correctly aligned structures appear as diagonal regions with hits in most matrices. When the bases are compared, there is a certain probability of assigning a hit even if the base pairs do not exist as an actual secondary structure. This noise can hide real RNA secondary structures. A sliding diagonal window is applied to reduce noise in the dot matrix. The probability of Obtaining an observed number of hits in a cell in a completely unaligned set of sequences can be calculated from the RNA base-pairing relationships. The probability of finding an Observed pattern for a window in a com- pletely unaligned set Of sequences can be obtained from the hit probability of the cells in the window. This probability was used to design a score function. Multiple RNA Sequence Alignment using Simulated Annealing (RNASA) is pro- posed to optimize the score of RNA sequence alignments. Lukashin et al. [50] in- troduced the simulated annealing (SA) technique to multiple sequence alignment for 49 nucleotide sequences. He discussed the cost function and several other important as- pects of SA for multiple sequence alignment. Several authors [38, 45, 44] ) extended SA to multiple sequence alignment Of amino acid sequence. The SA method was introduced by Kirkpatrick et al. [46]. It is a probabilistic approach that can be used to find an Optimal state of a function in combinatorial Optimization problems. SA starts from an initial state with high temperature. By applying the transition rules and acceptance rules proposed by Metropolis [53], simulated annealing continuously generates a new state from a current state. The criteria of the acceptance rules are: 1. If AC 5 0, accept a new state saw. 2. If AC > 0, accept a new state saw with probability P(AC) = e’AC/T where T is a temperature and score difference AC = C(sm) - C(swfim). Probability P(AC) prevents a system from fixing at a local minimum. Temperature T affects the probability of accepting a new state 3m. In the beginning Of the SA process, T starts at a high temperature and T gradually decreases iteration by itera- tion to become zero by applying an annealing schedule. The probability of accepting a new state with a higher score also gradually decreases as temperature T goes down. The SA process converges to a global minimum state when a careful annealing schedule and number of iterations are given. The main disadvantage Of SA is its need for a large amount Of computational time because SA is based on Monte-Carlo meth- ods. To reduce this huge computational time, several speedup strategies including an efficient transition rule, are applied in RN ASA. Finally, experimental results Obtained 50 from RNASA are presented. 3.2 Algorithm 3.2.1 Determination Of Score Function Base Pair The following algorithm aligns conserved regions Of possible secondary structure. It does not attempt to determine the lowest energy secondary structure, or chose among conflicting secondary structure possibilities in a specific alignment. Instead it is based on finding an alignment where the possible secondary structure motifs are conserved between sequences. For the purposes Of this work, we define a base pair possibility as two positions capable Of forming any of the canonical Watson-Crick base pairs (A-U, C-G. G-C. or U-A) along with the common non-canonical (Hi and 11-6 base pairs. In addition to these common base pairs, RNA secondary structures also contain less common base pairs (e.g. A-G , A-A etc.). Current secondary structure prediction methods do not efiectively cover these rare base pairs. Their positions are best established either from the X-ray crystallographic data or from analysis Of compensatory changes in unambiguously aligned homologous sequences. In the present alignment method, these rare base pairs are ignored. Since these base pairs are rare by definition, this simplification seems a reasonable compromise. 51 Dot Matrix 'Il'aditionally a dot matrix for a given RNA sequence is obtained by a square matrix with a dot at the intersection of row i and j (1 S i < j g n) for the bases i and j Of the- sequence. A dot matrix from intra-sequence comparison is used to depict the common RNA secondary structures in the sequences to be aligned. This is a rectangular array whose rows and columns are labeled with the bases of a sequence. A dot matrix with I‘m-t x l,-,,,-¢ is used in RNASA. Length l,-,,,-, is the length of the input alignment, obtained from a progressive pairwise alignment method based on primary sequence similarity. We define a hit (1) if the base pair i and j in a RNA sequence in an alignment is one of these base pairs (A-U, C-G , G-C , U-A, G-U, and 11-0); otherwise the pair is defined as a miss (0). Individual dot matrices from each aligned sequence in an alignment are generated and the cells in each individual matrix are calculated. The total number Of hits, h,, in an aligned sequence, 3, of an alignment is h. = (IAI X IUI) + (IGI X IC'I) + (IGI X IUI) (3-1) where |A|, [U], I0] and [C] are numbers of each bases in a sequence. The values of each cell (i, j) in the individual matrices are summed and the sum of each cell (i, j) is recorded in cell (i, j) of a matrix called the sum matrix. Clearly the value Of cell (i, j) in the sum matrix cannot be larger than the number of sequences n. In the sum matrix, the number of hits, m, in a cell represents the degree of potential common 52 secondary structure. The larger the number of hits, the higher the possibility Of common secondary structure. Figure 3.1 shows an alignment and its sum matrix. a u c o A o o A ° 6 u c o A o c A c G u c a A o e A c a u c o A a o A c 1 {2731475 [sirloin I I I I I I l I G G G G .1-.0J.‘J£L0.LO.L0.IQJQL3. I I I I I I I I u u u u .2.-_.9..9.,£..‘!...4...4..£.9 c c c c .3.-_:-,'9',‘_:.°.:_‘_:.‘.:9',9 o o e e _‘.-.'-.'-I.2L°.'.°.'9.'9'.§ A A A A_5.-.i-i-i-i.°.i.°.l9.l9i9 e a e e.;.-J-J-l-L-L°J%ngg. I I I I I I I I a G a G’."'I"I'T'I'-I"'I"Ia1’6 A A A A----:-t'-+-:--:--:-i-+- 'CCC’IIIILIIIO Figure 3.1: Example of an alignment and its sum matrix. Average Hit Probability The total number of hits in an aligned sequence is given by equation (1). These hits can be regarded as distributed at random except in regions of real secondary structure. The probability p, of a hit occurring by chance at any cell (i, j) in an individual matrix of the aligned sequence 3 is approximately O S p. = ’2 < l (3.2) where l is the length Of the sequence 3. For the present study, we use the average probability p = (fps/n (3.3) s=1 53 where n is the number Of sequences in an alignment as the base pair probability in all the aligned sequences in an alignment. The value Of p is less than 1/2 for most RNA sequences. When individual matrices are overlapped to form a sum matrix, the number Of hits at position (i, j) in the sum matrix is between 0 and the total number of bases, 11,-, in column i Of the alignment. This n,, may be less than the total number of sequences, N , since some of the sequences may have nulls at position i. We can then approximate the probability of randomly finding that m Of n,- bases at a specific position, i, have base pair possibilities (hits) at some other position as: . "i . P(z, m) = -p"‘ - (1 - p)"""' (3-4) The value of P(i, m) becomes smaller as m goes to n,- or m goes to 0. This means the probability that a position will have no potential base pairs is relatively low, as is the probability that a position will have all base pairs. Sliding Windows on a Sum Matrix Now the most basic secondary structure motif consists of one or more contiguous base pairs with non-pairing bases at each end. The probability Of randomly finding such a conserved motif in the alignment at positions i through i-i- w - 1 with w positions 54 at another location is simply: =10 = H P(a: + k — l,mk) (3.5) k=1 where M = (m1, . . . , In"). The number of hits will be equal to n,- for absolutely conserved base pairs, and equal to 0 for positions with no base pair possibility, such as the first non paring position defining the end Of a stem region. For incompletely conserved structures, the Observed values will be between 0 and n,- and thus the prob- ability of Observing such an incompletely conserved motif will be higher. The method we use to Optimize the alignment on secondary structure is to find an alignment with a high number Of low probability windows. A window with size w is slid on the diagonals of the sum matrix for this purpose. For example, ifthe diagonally adjacent cells (i,j), (i+1,j— 1), - - - , (i+w—1,j—w+ 1) make a window, then the diagonally adjacent cells (i + 1, j - 1), (i + 2, j - 2), - - - , (i + w, j — w) make the next window. Determination of Score Function To be used in sequence alignment, a score function should be explicitly defined as a measure Of overall alignment quality. A score function must include important secondary structure information so that if the score function is mimmized / maximized, the complete secondary structure in the alignment will be identified. In RN ASA the sum of the reciprocal of the probability for all possible windows 55 On a sum matrix is calculated as a scoring function for simulated annealing as follows. 1 C"...§...W ‘3‘” The §(:r, w, M) values for regions of conserved potential secondary structure are so much larger than the expected (random) %(:r, w, M) value that they dominate such a sum. A good alignment will have relatively more of these conserved potential secondary structures and will thus have a relatively large C value. The sum 0 is used as a score function in RNASA. In RNASA, at each iteration, a new alignment is generated and C is calculated to find an alignment with maximum score. The goal for RNASA is mazimize(C) (3.7) 3.2.2 Simulated Annealing to Optimize the Score Function Heuristic Algorithm as the High-temperature Phase SA progresses from a high temperature phase that approximates a random search to a lower temperature phase approximating a greedy local search. Several au- thors [38, 45, 44] suggested a good heuristic algorithm could replace the high tem- perature phase of simulated annealing and provide a substantial speed increase. In the present algorithm, a heuristic primary sequence alignment method similar to the progressive pairwise method [20, 77, 84] is used to provide a preliminary alignment and to partially substitute for the high temperature phase. This preliminary align- 56 ment can also be used to manually select regions of long RNA sequences that are not well aligned by primary sequence similarity. Double Shuffle as a 'ITansition Rule In a RNASA process, new alignments are created from an existing, possibly subopti- mal, alignment by introducing random changes according to an Optimized transition rule (Figure 3.2). begin doubleshufllefi) for j = 1,2 :I: +— random(1 . . . l); randomly select a column while column(:c) are null scan left -or— right; (is a: (— a: — 1 or a: + 1) wrap around if necessary; return if no nulls are found in sequence i; end while; y (— random(:r+movm---z -movm); shift right columns a: - 1 thru y -Or- shift left columns a: + 1 thru y; column(y) (— null; end for; end doubleshufllc Figure 3.2: The transition rule double shuffle The rule used in the current algorithm allows changing one randomly chosen sequence in the alignment with each iteration. A gap in the sequence is selected and moved to a new position. This process is repeated on the chosen sequence a second time. After both shuflles, the score function is recalculated. The reason for applying two shuffles to a sequence in each iteration is to allow both sides Of a stem structure to adjust in tandem. This efl'ectively lowers the energy barrier between local points 57 and allows the system tO converge in fewer iterations. If the chosen sequence in the alignment has no null characters, the iteration is skipped. Occasionally, the same gap will be chosen for both shuffle Operations, producing the equivalent Of a single shuffle move. Solution Set The SA process starts from an alignment with length I...“ produced by the heuristic algorithm. The length l of new alignments produced from the current alignment does not change during the SA process due to the properties Of the double shuffle transition rule. Therefore RNASA only generates alignments with length l.-,.,-.. If I... is the length of sequence a,-, the range Of potential lengths of an alignment of the sequences a1, a2, - - - , a,. can be calculated. The range I is confined as below. 1...... = mama...) g l g 1...... = 20...) (3.8) i=1 Let S; be the set of all alignments Of length l with same set of sequences. Each alignment in the S; has its own score and the length Of each alignment is identically I. Then the multiple sequence alignment problem can be defined as finding the alignment with an optimal score in the solution set St: (= Ujmmswm 5;). There will be alignments which contain columns whose elements are all nulls (null columns). If these null columns are removed from an alignment, the length,l', of the alignment is reduced (l’ < l). Also, this alignment is a member of the solution set 58 Sy. Thus, the set S; is a superset of the set Sp where l’ < l (Kim et al., 1994). The number of alignments in S; can be calculated. Let N be the number of sequences. Let n,— be the number Of nulls placed into the sequence a,- to bring it up to the alignment length, 1. That is n,- = l — la... The total number of alignments S in S; is 3m = ° : - ° (3.9) "1 n2 "N PTom equations (9) the number of alignments ,8 increases rapidly with each of n.,l, and N. To find the Optimal alignment, all the alignments in 51,“, should be investigated. However, the sheer size of Si,” prohibits checking all possible alignments. Therefore, if the length of the optimal alignment is larger than 1...“, the optimal alignment can not be found in 51,”. In that case, the final alignment from RNASA is near optimal. Solution set can be expanded with longer initial alignment. Increasing the solution set further rarely leads to improvement, so such an alignment is generally optimal for the given solution set. A2 and A4 in Figure 3.7 are the examples. The SA time can be reduced by this approach. Schedule for Temperature. The schedule implemented in RN ASA is T = T; - e‘ where e is a constant defining the rate Of annealing, i the iteration number, T1 the initial temperature, and T the 59 current temperature. The value Of 6 can easily be calculated from the total number of iterations, k, the final temperature,T,, and T1: «2 = (Ti/Tl)“ (3.10) 3.2.3 The Complexities of RNASA Figure 3.3 shows the block diagram of RNASA. The sum matrix in RNASA is of size l2. Therefore the space complexity of RNASA is 00"). In each iteration, the sum matrix must be updated in the region correspond- ing to the columns modified by the double shuffle Operation. The values of all sliding windows covering these columns must be recalculated. The number of these windows is approximately 1 times the number of affected columns. In the experiments pre- sented here, the maximum number of columns affected by each shuffle is limited to 10 (Figure 3.2). Since the number Of columns affected is essentially constant, the time complexity per iteration is 0(l). It should be noted, however, that the number of iterations necessary to assure convergence is dependent on several factors, including l. 3.3 Results RNASA was implemented and tested on a DEC alpha workstation 3000/ 400 with 32MB main memory size and DEC OSF/l V2.0 which is a UNIX Operating system. 60 begin RNASA current.alignment (— Output generated by the heuristic algorithm; /* Generate initial matrix and sum matrix */ For i=sequence 1 to n For each nucleotide pair (j,k) in the sequence i matrix(i, j, k) = 1 if pair(i,j) is in complementary set matrix(i, j, k) = 0 if pair(i,j) is in not complementary set summatrix(j, k) = sumJnatrix(j, k) + matrix(i, j, k) end for; end for; /* Calculate the score of the sum matrix*/ Cinit = Z score(window) /* Initialize parameters */ finalnlignment (— initial.alignmcnt; Cmax ('- Ciniti C.curr'ent ‘- Cliniti T *- Tatar-ti /* Annealing process */ while (T > Tm) i (— random(sequence(1..n)); doubleshufl‘lefi) in Figure 3.2; Calculate individual matrix i; Calculate sum matrix; Cm (— C(sum matrix); if Metropolice conditions are satisfied then current.alignment (- new.alignment; if (Cm... > Cm) then Cm {— Cm; f inalnalignment (— current.alignment; end if; end if; T4—7-T,0<'7< 1; end while; Return finaLalignment, maximum score Cm and sum matrix ,- end RNASA Figure 3.3: The RN ASA algorithm 61 RNASA was written in programming language ANSI C (DEC OSF/ 1 C compiler). The program will be available on request. RNASA was implemented to produce multiple sequence alignments for RNA se- quences. Experiments were performed on segments of bacterial 16S RNA sequences. rRNA sequences consist of highly conserved regions separated by variable regions. In these variable regions, secondary structure is often more conserved than primary sequence similarity. Several such variable regions were chosen for testing RN ASA. A window size of four was applied in all experiments except 8., Ba, Ba, and B4 in Figure 3.7. 3.3.1 Alignment and Secondary Structure. Figure 3.4 shows an alignment of segment Of 16S RNA sequences and their possible secondary structures. Three stem regions (A-A’ , B-B’ , C-C’) could be identified. This alignment is close to the hand alignment in the RDP [51]. From the final sum matrix, we could identify the common secondary structure (Fig 3.5 (b)) and could overlay each aligned sequence onto this common secondary structures. Figure 3.4 (c) and (d) are the possible secondary structures of the sequence Ehr .bovis and Hir .baltic in the alignment. These possible secondary structures are compared to the proposed secondary structure Of Ag . tumef a7 [32] which is close to the aligned sequences. These structures are matched well. Figure 3.5 (a) shows an alignment Of a segment of ten 16S RNA sequences. The Ehr.bovis Sym.Mscfu Sym.Trgr1 Sym.Trgr5' Sym.Trgr6 Par.denitr Ros.denitr Hir.ba1tic Erb.1ongus R.sa1exige Ps.diminut Caw.FwC2 Afp.felis Afp.c1eve Figure 3.4: (a) Alignment Of segments of 15 163 RNA sequences and possible sec- ondary structures. This region corresponds to nucleotides 996 to 1044 in E. coli 168 RNA. Left ( (, [, {) and right 0, ], }) parts of the alignment Show the possible stem regions (A-A’ , B-B’ , C-C’). (b) Possible common secondary structure from the alignment (a). (c) Secondary structure of Ehr.bovis based on (b). (d) Sec- ondary structure of Hir.ba1tic based on (b). The symbol (0) indictes a G-U base pair and (I) indicates A-U and G-C base pairs. Note that the length of the loop and 62 A B B’ C C’ A’ (((((((< [[[fiif 111111] {{{{ }}}})))))))) aUGAAGAUUagaUCCUCCUuaacAGGAGGGchAGch-gGCUGGGUCUUgCA aUGGAAAUUauaCCUAUUCgaagGGAUAGGguCGGUuc-gGCCGGGUUUCaCA aUGGAAAUCauaCCUAUUCgaagGGAUAGGgu-CchucggCGCGAUUUUaCA aUGGAAAUCauaCCUAUUCgaagGGAUAGGguCGGUuc-gGCCGGAUUUUaCA aUGGAAAUCauaCCUAUUCgaagGGAUAGGguCGGch-gGCCGGAUUUUaCA aUCCCAGG-acc-GGCCCGgagaCGGGUC-uuUCACuucgGUGA-CCUGGaGA aUCCUGUG-cua-ACCCNAgagaUCGGGN-guUCUCgcaaGAGA-CGCAGuGA aUCCUAUG-cu-NACUCCAgagaUGGAGUA--U-C-uucg-G-A-CAUAGuGA aUCCUAGGAcgguUUCUG-gaga-CAGACuccUUCCcuchGGGaCCUAGuGA aUCCCGGGAcgacUUCCA-gage-UGGAUuuuUUCAcuchGUGACCCGGnGA aUGCCUGG-a-cNGCCACGgagaCGUGGCUu—UCCCuucgGGGA-CUAGGaCA aUGCCUGG-a-cCGCCACAgagaUGUGGNNu—UCCCuucgGGGA-CCAGGaCA aUGUCCAGGaccGGUCGCAgagaUGUGACC--UUCUcuchGAGCCUGGAgCA aUGUCCAGGaccGGUCGCAgagaUGUGACC--UUCUcuchGAGCCUGGAgCA (a) U A 0‘ A c c\ A C CungG U , c . C. ‘A A A 51‘??in E. 09699999 : Accyucuas a"? ohAUAc- A,U- 39 U [G 3' U . Ic- c O” U G. U G C Gizu k) ' is base pair regions are variable for each sequence. 63 A I C C' I’ D D’ I l’ A’ «(tum « mm 1111)] >> > '«<««« »»»»> «(it )in) 111111)} (((fiifimlflflf 111111!'I’I'I'I'I'I'I'<«««« >»»»» ((((( 1))” 11111111 3-P1 will“ sec60001GGUUgGGgeIegCCCAUOgeeeClUGOGgueanaCCgAeugagCUCUAUGUAC~ugug-OUGUAUAGAG-gseaAGGGCInugGCCCU-GCCUUGAGen 3 1. bejece azucuuooclflcgccgeuegcccA00gussetUGGGgIeeeeccgCeuaagCUlUlUUAUUCeueaaAGUO10161G-geaeGGCBCIuegOCOUCgIPCCULAGeg Spi. splat: uscCCCCUUGUUUGGgInagCUUGUOgeeaCAUAAGguaaaecCUAengegalflCUUUUAUCenaaBAUAAUAGAAggeeeGGCGCUouechCGUCGCAAGGGGen Trp. Insole uecCcUUUG6eUgGGgInagCCGGOAgeAAUACCGGguaanaCCgleueeggUCGGUl60cacggae-GCCACCGAgg-eeeGCGGCIIcgGCCGCg-CCBLAGGee Trp. spat useCCUUAGGeDgGGgauagCCACUIgeaaUAGUGGguaanaccgleugnggOUACCGGGC- ugug-GCCUIGUIAag-eeaflGAGCarunGCUCCgeCCUBAGGen Spt. stones uncCUUUAGGUUgGGgInagCCAUUAgseeULGOGGguaauaccgleuguggOUGCGGGGC- oung-GCCUUGUAAag-eeafidlc-snag-CUCC-GCCDGAAQen Trp plus-d useCCUUAAGeUgGGgIne¢CUCCUAgalanineAGguaaueCCgAenec-gCUUAUACGOenaeagCCGUIUAAOga-eeeBAGGCIgsgGCCllg-Cuuclccen 183 LP. .P gflGgaaaaCLCABAgaAAUUUGUngaenICDglengugecccuuccucoguag-CAGGipUUGguneee-GCA-gees-06c-GCUUUUAGen Lpn.1111:1 ugcCCACGGAUG‘66gen-eccuuucgeeaGGAAGGgueauaccgCeueecACCAUOGUUACsac3601100100605:see-GCA-gcearOOC-¢UCUGUGOA‘ 8p1.eurent ugcCUGeCOGeDgGGganagCUUCUGgeesCACAGGguaaneccglenge-ACUGACGAGACunngGUUUUGUUAGU-eeeGGUOCIIcgOCACCgRQCBeCLceg (a) l . 3. 5 A3 .A E' C: E a. v III. 0' I) (b) Figure 3.5: (a) Alignment of segments Of ten 16S RNA sequences and possible sec- ondary structures. This region corresponds to nucleotides 133 to 229 in E. coli 16S RNA. Left ( (, <, [, { ) and right ( ), >, ], } ) parts of the alignment show the possible stem regions. (b) Real common secondary structure from the RDP. There are different possible secondary structures in the Region B (in the Box). Note that the length of the loop and base paired regions are variable for each sequence. 64 possible secondary structure was compared to the proposed secondary structure of Spi . aurant [32] which is closely related to the aligned sequences. ( second line in the Figure 3.5 (a) ). The third line shows the possible secondary structure from the alignment. A secondary structure diagram based on Spi . aurant is shown in Figure 3.5 (b). Fl'om the alignment and dot matrix, stem regions (A-A’ , C-C’ . D-D’ , E-E’ ) were correctly aligned. However, in the region (B-B ’ ), there are different possible stem regions with length=2. 3.3.2 Identification of Secondary Structures An RNASA generated alignment of one variable region from eight 16$ RNA sequences is shown in Figure 3.6. This region corresponds to nucleotides 61 to 106 in E. coli 163 RNA [32, 61, 88, 87]. Alignment A1 in Figure 3.6 was generated by a progressive pairwise alignment method and used as an initial alignment for RNASA. This method aligns the se- quences by primary sequence similarity. Therefore when the sequences are distantly related, an alignment by this method does not show the secondary structure simi- larity of the sequences. A2 in Figure 3.6 shows the final alignment obtained from RNASA from initial alignment A1 . The primary similarity of the final alignment Obtained from RNASA was much worse, but the secondary structure similarity Of the final alignment was better than that of the input alignment. The final sum matrix from the output alignment could be used to identify the secondary structures by in- SPGCtion. This possible secondary structure was compared to the one of My: . xanthu A1 Organism Dsb.postga Dss.variab Dsn.acetox Dnn.tiedje ny.xanthu Cys.fuscus Con.crocat Nan.exeden A2 Organism Dsb.postga Dss.variab Dam.acetox Dmn.tiedje ny.xanthu Cys.£uscus Con.crocat Nan.exeden A3 Dsb.postga Dss.variab Dsn.acetox Dnn.tiedje ny.xanthu Cys.1uscus Con.crocat Nan.exeden A4 Dsb.postga Dss.variab Dsm.acetox Dnn.tiedje ny.xanthu Cys.£uscus Con.crocat Nan.exeden Figure 3.6: Alignment Of segments of 168 RNA sequences. The names Of the used 16S RNA sequences are from RDP. This region corresponds to nucleotides 61 to 106 in E. coli 16S RNA. A1 is the alignment generated by the progressive pairwise algorithm. A; is the final alignment generated by initial alignment A. using RN ASA with window size=4. A3 is a longer alignment generated by the progressive pairwise algorithm. A. is the final alignment generated by the initial alignment A3 using RNASA with window size=4. Left (D and right (]) parts of possible stem regions (A, A’ , B . B ’) are symmetric. Upper case characters indicate possible secondary 65 GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC GUCGaacgagaaaGUU---UCchgGGAAAUgaguagagUGGC GUCGUacgagaaaCAUAUCNuuchGGUAUGgaguaaaGUGGC GUCGagcgcgaaUAGG ------ GGcaaCCCUUAguagagCGGC GUCGagcgcgaauGGA -------- gcaaUCCuaguagagCGGC GUCGUgcgagaaaGGG ------ CuuchCCchguaaaGCGGC GUCGaachGCUAgca -------- aUAGUC ------- agUGGC [[[[[ [[[[[[[ 111111] 111]] GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC GUCG-aacgagaaaGUUUCch-gGGAAAUgaguagag-UGGC GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC GUCG-agcgcgaa-UAGGGGc-aaCCCUUA-g-uagag-CGGC GUCG-agcgcgaau-GGA--gcaa--UCC-uaguagag-CGGC GUCGUgcgagaaa--GGGC-uucg-GCCC--cgguaaaGCGGC GUCG-a--ac--g-GGCUA-gcaa-UAGUC-a-----g-UGGC GUCGaacgagaaa-G-GGA-UUGcu-ugc-AAUCCCgaguagagUGGc GUCGUacgag-aacG-CUC-UAGcu-ugCUAGAGC-aaguaaaGUGGC GUCGaacgagaaa-GUUUC-ngcG--GA-AAU---GaguagagUGGC GUCGUacgagaaaCAUAUCNuucgG--GG-UAU-G-gaguaaaGUGGC GUCGagcgcg-aa ------- UAGGG--Gc-aaCCCUUAguagagCGGC GUCGagcgcg-aa ------- u-GGA-gc-aaUCC-uaguagagCGGC GUCGUgcgag-aa ------- a-GGGCuuc-gGCCC-cgguaaaGCGGC GUCGaacg-G-GC 3 A--gc-aaU----AGU-CagUGGC [[[H [[[[[[E 111111] 111]] GUCG-aacg-agaaaGGGAUUGcu--ugCAAUCCCgaguagag-UGGC GUCGUa-cga-gaacGCUCUAch-~ugCUAGAGCaagu-aaaGUGGC GUCG-aacg-agaaa-GUUUCch---gGGAAAUGaguag-ag-UGGC GUCGUa-cga-gaaaCAUAUCNuu--chGGUAUGgarguaaaGUGGC GUCG-agcg-cgaa--UAGGGGc---aaCCCUUA-gu--agag-CGGC GUOG-agcg-cgaau---GGA-gca--a-UCC---uaguagag-CGGC GUCGUg-c-gagaaa--GGGC-nuc-g-GCCC--c-gguaaaGCGGC GUCG-aac---g-----GGCUAgc-a-aUAGUC -------- ag-UGGC structure regions. 66 whose secondary structure is known already and matched well. 3.3.3 Initial Alignment Alignment A3 in Figure 3.6 was Obtained by progressive pairwise alignment. A3 is longer than A; because a lower penalty for nulls was applied in the progressive pairwise alignment. Alignments A4 is the final alignment using RNASA with input alignment A3. Alignment A. has more nulls than A2 and these additional nulls give clearer resolution Of stem and loop regions. When an initial alignment is not long enough, certain columns in a final alignment may contain bases in loop regions and stem regions. Thus, it is important to ensure that the length of the initial alignment is appropriate, based on the output alignment. Ishikawa et al. [38] and Kim et al. [45, 44] have shown the advantages of using an initial alignment Obtained by heuristics in SA. 3.3.4 Window Sizes Figure 3.7 shows the alignments obtained with different window sizes. The alignments Bl, Ba, Ba and B4 in Figure 3.7 and A2 in Figure 3.6 were all Obtained from the same initial alignment (A1 in Figure 3.6) by applying different window sizes. Clearer separation of paired and unpaired regions was Obtained at window sizes Of 4 to 5. The alignment A2 is the best match to the known alignment. By increasing the window size, more weight can be given to windows having diagonally consecutive hits. Therefore the efi'ect Of noise can be reduced. 67 Bl GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC GUCGaacgagaaa-GUUUC-chgGGAA-AUgaguagagUGGC GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC GUCGagcgcgaa ----- UAGGG-GcaaCCCUUAguagagCGGC GUCGagcgcgaa ----- uG--GAgcaaUCC-uaguagagCGGC GUCGUgcgagaaaGGGCuu-c--g--GCC-chguaaaGCGGC GUCGaacg-G-GC----U-A---gc-aaU---AGU-CagUGGC B2 GUCGaacgagaaaGGGAUUGcuugCAAUCCCgaguagagUGGC GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC GUCGaacgagaaaGUUUC--chg-GGAAAUgaguagagUGGC GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC GUCGagcgcga-aUAGGG-Gc--aaCCCUUA--guagagCGGC GUCGagcgcgaau-GGA--gc---aa-UCCua-guagagCGGC GUCGUgcgagaaa--GG--GCuuch-CC-chguaaaGCGGC GUCGaacg-GGC---UA--gc--aa--UA----GUCa-gUGGC BS GUCGaacgagaaaGGGAUUGcungCAAUCCCgaguagagUGGC GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC GUCGaacgagaaa-GUUUC-chgGGAAAU-gaguagagUGGC GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC GUCGagc-gcgaa-UAGGG-GcaaCCCUUA--gu-agagCGGC GUCGagc-gcgaau-GGA--gc-aa-UCC-uagu-agagCGGC GUCGUgcgagaaa--GGGC-uucg-GCCC-c-gguaaaGCGGC GUCGaa---cg---GGCUA-g-caaUAGUC ------ a-gUGGC B4 GUCGaacgagaaaGGGAUUGcungCAAUCCCgaguagagUGGC GUCGUacgagaacGCUCUAGcuugCUAGAGCaaguaaaGUGGC GUCC-aacgagaaaGUUUCU-ucgGGAAAUgaguagag-UGGC GUCGUacgagaaaCAUAUCNuucgGGGUAUGgaguaaaGUGGC GUCG-agcgcgaa-UAGGGG-caaCCCUUA--guagag-CGGC GUCC-agcgcgaa-u-GGA-gcaa-UCCua--guagag-CGGC GUCGUgcgagaaa--GGGC-uucg-GCCC--cgguaaaGCGGC GUCG----aacg--GGCUA-gcaa-UAGUC-ag ------ 0660 Figure 3.7: Alignment Of segments of 168 RNA sequences with difierent window sizes. Each alignment was Obtained after 300000 iterations. BI is the alignment with window size=1. B: is the alignment with window size=2. 83 is the alignment with window size=3. A2 in Figure 3.6 is the alignment with window size=4. B4 is the alignment with window size=5. A2 in Figure 3.6 gives the best match to a known smructure. 68 3.3.5 Effect of Double Shuffle To compare the efficiency Of the double shuffle and single shuflie, RNASA with double shuflle and RN ASA with single shuffle were tested with initial alignment A1 in Figure 3.6. Figure 3.8 Shows the annealing curves with single Shuffle and double shuffle. 32; till 3.. T—S .................. - 3.6 r " Score(x1015)3_8 _ - 4.0 - L T 4.2 A l 1 . l l ‘ 0 510.15 20 25 30 Iteratron(x 10‘) T l I I 3.2 3.4 3.6 Score (x 1015):” l 4.0 4.2 -l 1 I 0 5 10 15 20 25 30 Iteration ( x 10“) . Figure 3.8: (a) Annealing curve with single shuffle. (b) Annealing curve with double shuffle. Initial temperature T.- = 10‘ and final temperature T, = 1. Unit Of score is 1015 and Unit of iteration is 10, 000 iterations. RNASA with single shufiie took longer to converge to the global maxima score and had a tendency to stick to local maximum, even with a very large number Of iterations. 69 3.3.6 Speed of Convergence The converging time necessary to get satisfactory alignment in RNASA is generally tied to the number of sequences, length of sequences and heuristic algorithm. e The length of alignment. It is clear that the converging time for longer align- ments is longer than that of shorter alignments from equation (9). For example, it took 12 minutes ( 0.3 million iterations ) to get A, and 1 minute ( 10,000 iterations ) tO get the alignments in Figure 3.7. It took less than 10 minutes ( .3 million iterations ) to align first 10 sequences in Figure 3.4 (a) (length-=53). However, it took 2hours ( 2 million iterations ) to align in Figure 3.5 (a) (length=107). e The quality of initial alignment. The quality of initial alignment affects the converging time. If the initial alignment is closer to the Optimal, it will require less converging time. Figure 3.6 shows the importance of a good heuristic algorithm. e The number of sequences. It is clear that longer SA time is required for a larger number Of sequences from equation (9). To align 14 sequences in Figure 3.4 (a), it took approximately 20 minutes ( .5 million iterations ). TO reduce the computational time, well-conserved regions may be fixed and only the regions between the fixed regions annealed. Also, by eliminating well aligned regions the length of long RNA sequences can be shortened and can then be aligned. 70 3.4 Conclusion In this paper, we describe a method to align RNA sequences and identify possible secondary structures using Simulated annealing. In our approach, sequences were first aligned based on primary sequence similarity and then realigned based on secondary structure information. Dot matrices from intra-sequence comparison are generated and overlapped to form a sum matrix. We built a clearly defined score function for this sum matrix based on the probability of finding the Observed pattern in completely unaligned sequences. We were able to reduce SA time by reducing the search space as well as using the double shuflle move set. We showed that RNASA can generate alignments with clear secondary structure identification. The main advantage of RNASA is its ability to align conserved secondary struc- tures in distantly related sequences. One potential disadvantage of this method is that is is not based on finding the lowest energy secondary structure (via thermody- namic energy rules). However, RNASA is able tO handle pseudoknotted structures and conserved alternative foldings, both of which are not modeled well by most energy based methods. Chapter 4 Inferring Relatedness Of a Macromolecule to a Sequence Database Without Sequencing Sequence database searches with unidentified sequence data have been a source of useful information of identifying the biological information Of a macromolecule iso- late. TO do the sequence database searches, the whole or part Of the primary sequence information is required. However, in some situations, sequencing a macromolecule is not practical. We need more crude but faster methods without sequencing an uniden- tified macromolecule to identify the relatedness between database sequences and an unidentified macromolecule. TO achieve this goal, we study the problem of Obtain- ing biological information about a macromolecule isolate using only the restriction pattern Of that isolate Obtained from digestion with enzymes and a database D of 71 72 sequences. We investigate a three step approach to solving this problem. (1) we Ob- tain a restriction pattern of the isolate while analytically deriving the corresponding restriction maps of the sequences in the database. (2) We identify a set S Q D of sequences which have restriction maps that are most similar to the unknown isolate’s restriction pattern. This task is complicated by the fact that we have only approxi- mate fragment lengths for the unknown isolate and that we do not know the actual ordering of the unknown isolate’s fragments. Despite these difficulties, we derive experimental results which indicate maximum matching techniques are efi'ective in identifying the correct set most of the time. In this step we introduce Maximum Site Matching Problem (MSMP) and show MSMP is in the class of N P—complete problems which are conjectured to have no polynomial time solution. (3) We use the set S to infer biological information (such as sequence information or hierarchical classifica- tion information) about the unknown isolate. We demonstrate experimentally that the closeness Of the sequences in the set S to each other can be used to infer the relatedness of the unknown isolate to the sequences of the set S. 73 4.1 Introduction Sequence database searches have become a normal first step in the identification and characterization of new macromolecule isolates. The information Obtained from such database searches and comparisons Often yields novel insights into isolate origin, function, and evolution. A number Of very good tools exist for searching sequence databases, for example BLAST [2] and FASTA [65]. These tools can be very powerful in rapidly discovering even weak similarities to a query. However, all of these database search tools require that the molecular sequence Of the query be known. Even though improvements in methods for nucleic acid and protein sequencing have substantially reduced the cost Of Obtaining sequences, there are many situations in which sequencing costs are still too high or the required specialized equipment is just not available. Often, a researcher is able to Obtain more macromolecule isolates than it is practical to sequence. Many of these isolates may be identical, and some rapid method is needed to Obtain a set of unique isolates. One simple method Of categorizing multiple isolates is to use the restriction pattern obtained from digestion with certain enzymes (proteases and restriction endonucle- ases). These enzymes typically cleave the macromolecular substrate at specific sub- sequences. By comparing the pattern Of fragment lengths produced by the isolates, sets Of non-identical patterns can be selected for further study. At a further level of sophistication, the number of matching (same size) fragments between patterns may be used to calculate a measure of (phylogenetic) relatedness 74 between isolates [60, 76, 48]. If the positions of cleavage sites along the molecule (cleavage site map) are known as well as the fragment sizes, then several more accurate methods exist for estimate the relatedness between isolates [60, 15, 19, 37]. Some work has been done on constructing cleavage site maps from digestion patterns [64, 24, 47, 17, 8, 30, 89], but this work typically assumes that there are overlapping fragments. In our case, there are no overlapping fragments to guide us in constructing a cleavage site map. Cleavage site maps can be computed easily from molecular sequences. We are interested in how such maps generated from a sequence database might be used to help characterize a molecular isolate. Assume a cleavage pattern for a query molecu- lar isolate (measured experimentally to within some expected limit of accuracy) and a database containing sequences with varying degree Of sequence similarity to the query isolate. We would like to be able tO extract the set of sequences with cleavage site maps most similar to the (unknown) map of the isolate. A maximum number of common sites between cleavage site maps is used as a measure of similarity. We formalize this problem as Maximum Site Mapping Problem (MSMP) and demon- strate the MSMP to lie in the class Of NP-complete problems conjectured to have no polynomial time solution. To overcome this difficulty, we suggest a heuristic ap- proach to solve a confined version of the problem. In addition, we would like some estimate of how closely related these sequences are to our query molecule in terms of sequence similarity or other biological measures. Related work has also been done in identifying the location of restriction maps Of short query probes in longer restriction 75 maps [56, 55]. However, this problem differs from ours in that the ordering of the query fragments is known. Grouping isolates by cleavage site maps may not be useful in itself, unless this grouping implies some more standard measure Of relationship, such as primary se- quence similarity. Although primary Similarity between two sequences can be used to directly estimate the expected fraction Of shared enzyme sites, sequence similar- ity can not be accurately estimated from the fraction of matching sites alone. Any attempt at performing the reverse calculation requires some knowledge of the under- lying distribution of similarity in the pool from which the two sequences were drawn. This leads us to use experimental results for evaluating the effectiveness of different methods of inferring information about a query isolate. Similar work has been done in classifying unknown peptides by using mass pro- files where the unknown peptide is digested by enzymatic or chemical means and the masses Of the resulting fragments are determined by mass spectrometry [41, 75, 13]. However, when dealing with peptides, several problem parameters are quite different than when dealing with nucleotides. First, the masses Of the resulting fragments can be Obtained with essentially no error. Second, the masses reveal significant informa- tion about the amino acid composition Of each fragment. We perform our experiments using the Ribosomal Database Project (RDP) database of bacterial 168 rRNA gene sequences [51]. This database is a member Of a class Of databases containing sequences derived from that of an unknown common evolutionary ancestor. In addition, the phylogenetic relationships of the sequences in 76 the database have been estimated and are available from the RDP, as is a biological classification scheme based on these relationships. Also, comparison of restriction fragment patterns Of rRNA genes (rDNA) is an accepted method of discerning relat- edness between isolates in micro-biological studies [57]. In this work, we present a method for finding the set Of database sequences with the most sites in common with a query restriction fragment pattern. This method uses sequences in the database as templates to assemble the fragments into putative restriction site maps. The results Of this method are shown to be similar to results Obtained when the exact fragment order is known for both the query isolate and the database sequences. Also, a method to estimate the sequence similarity between the query and database result set is presented. In addition, we demonstrate that the results can be used to place the query in an existing biological classification scheme. 4.2 Methods 4.2.1 Overview Our basic problem is to determine biological information such as primary sequence information about an unknown query isolate q using a database D sequences. Fur- thermore, we do not allow biological sequencing of the unknown query isolate q. An overview Of the approach to solve this problem is given in Figure 4.1. 1. We use enzymes to obtain a restriction pattern Of the query isolate q. Simulta- 77 Step1. F a so 100 300 400J i restfictimpettemq nest 31°92. Step3. 5° 30° 2°° _.[Cs|oilstionot H mum” Hintennmotbioiog'csi uence100 - - mgwenc‘ggso WotqandD muencetso mionnatlonotq Swami! Selecteddatabase 100 100 10050 meets Seaponciu 80 I 20 200400 L " momma J Figure 4.1: The overall process of solving the problem. neously, we analytically compute restriction maps Of all the sequences 8 6 D. 2. We then compute the closeness of the restriction map Of each sequence 3 E D to the restriction pattern of q. 3. This closeness information is then used to infer biological information about q. We describe each of these steps in more detail in the following sections. We note here that our methods for completing each step, particularly the third step, are dependent on the characteristics Of the database D. We first show that analytical methods to infer the primary sequence similarity between the query isolate q and any sequence 3 E D require a priori knowledge Of the relationship of the underlying probability distribution Of the primary sequence similarity of any sequence 3 E D to the sequence of isolate q. Straightforward analytic methods are further handicapped by the assumption that for closely related sequences 3, s’ 6 D, the relationship of the restriction map of s to the restriction pattern of q is independent of the relationship of the restriction map of s’ to the restriction pattern of q. As a result, we experimentally evaluate different methods for inferring information about 78 the query isolate q. Our experimental results demonstrate that effective inference techniques do exist which utilize the set S Q D of sequences which have restriction maps closest to the restriction pattern of q. This result has good consequences for step two of our approach where we show that computing how close the restriction map of any sequence 3 E D to the restriction pattern Of q may be computationally difficult but that determining whether the restriction map of a sequence 3 6 D is extremely close to the restriction pattern of q is computationally tractable. 4.2.2 Obtaining Restriction Patterns and Restriction Maps The basic biological processing we perform on the isolate q is digestion by enzymes which produces a restriction pattern of q. At the same time, we analytically compute the restriction maps of all sequences 3 e D that would have been formed if we had digested these sequences with the same enzymes. To facilitate later comparison of the restriction pattern of q to the restriction maps of s 6 D, we assume that the database sequences represent molecules with ends at positions homologous to the ends of the query molecule. In practice, the end points of query molecules may be known if, for example, they are produced by the PCR reaction using primers to conserved regions. The primers define the query endpoints. For the specific RDP database, PCR is the method of choice for isolating rRNA genes. The two processes of Obtaining a restriction pattern of q and computing restric- tion maps for s e D differ in their precision. Because computing the restriction maps is analytical, we can compute the restriction maps exactly. However, while it 79 is technically possible to measure the exact length of the restriction fragments of q, the methods required are not suitable for a rapid, inexpensive screen. A more real- istic assumption is that fragment sizes would be estimated by simple gel or capillary electrophoresis; this leads us to assume that fragments f which have actual length If] will be measured to have length between (1 - e)| f | and (1 + 6)] f |. In our experiments, we estimate 6 to be 5%. Definitions. We now formally define the terms restriction map, restriction pattern, the closeness Of two restriction maps, and the closeness of a restriction map to a restriction pat- tern. We begin by defining restriction maps, restriction patterns, and the equivalence relation that restriction patterns induce on restriction maps. In the following definitions, 3 is a nucleotide sequence and z is an enzyme. Definition 1 The restriction map M,(s) formed by digesting sequence 3 by enzyme 2 is the ordered multiset of fragment lengths where l,- > 0 for 1 g i _<_ n. We can obtain the set of locations of cut sites 8: {s:s=2195,lj;1 S r 5 n}. Definition 2 Let M,(s) = {11, lg, . . . , l,.}. We define the restriction pattern P,(s) to be the unordered multiset Of fragment lengths {l1, l2, . . . , l,.}. We say that M,(s) yields P, (3). Note many different restriction maps yield the same restriction pattern. For exam- 80 ple, if there are n different fragment lengths in P,(s), n! number of different restriction maps can be generated from P,(s). Definition 3 We define two restriction patterns to be isomorphic if they yield the same restriction pattern. We denote the set of isomorphic restriction maps which yield the restriction pattern P,(s) with the notation [P,(s)]. For example, the restriction map {100, 200,300} and {300, 200, 100} are isomor- phic but neither is isomorphic to the restriction map {100, 200, 100, 200} It is not hard to see that this isomorphic relation defines an equivalence relation on the set of restriction maps. We now define closeness metrics for restriction maps and restriction patterns. First, we need some notation. Let sites(M, (3)) denote the number of cleavage sites in the restriction map M,(s). Let sites([P,(s)]) denote the number Of cleavage sites in any of the restriction maps in P,(s). Note sites(M,(s)) and sites([P,(s)]) is one less than the number Of fragments in M,(s) and P,(s). Definition 4 We define a common site in sequences a and b to be a cleavage site that appears in the same nucleotide position in both a and b. Let common(M, (a), M, (b)) denote the number Of common cleavage sites in the restriction maps M, (a) and M, (b). Let maxcommon(M, (a), P, (b)) denote the maximum number of common cleavage sites in any restriction map in [P, (b)] and M,(a). Definition 5 We define the closeness of the two restriction maps, denoted closeness(M,(s), M, (3’)), to be max(sites(M, (3)), sites(M, (s’)))-common(M, (s), M, (3’)). We define the close- 81 ness Of a restriction map to a restriction pattern, denoted closeness(M, (s), P,(s’)), to be max(sites(M,(s), sites(P,(s’)))- maxcommon(M,(s), P,(s’)). Definition 6 We define the set of sequences 0 (q, D, j) Q D to be the sequences 3 E D such that closeness(M,(s), P,(q)) S j. Example Figure 4.2 explains the above definitions. (a) shows the restriction map M,(a) = {100,500, 400} for sequence a. (b) shows the restriction pattern P, (b) = {100, 200, 300, 400} for b. max(sites(M,(a)),sites(M,(b))) is 3. (c) shows the maxcommon(M,(a), P,(b)) = 2. Clearly, the closeness(M,(a), P,(b)) is 1. loo 7'00 300 l-———i =loo= 500 g 400 i B 400 = (a) 03) p09 200 ,- 300 400 = .100] 500 400 9 fi I I O] (c) c2 Figure 4.2: (a) Restriction map for a. (M,(s) = {100,500,400}. (b) Restriction pattern for b. (P,(b) = {100, 200, 300, 400}. (c) maxcommon(M,(a), P,(b)) = 2 with common sites c1 , 0,. 4.2.3 Computing Closeness Step two Of our approach is computing closeness (M, (s), P, (q)) for each sequence 3 6 D. While we show that this problem appears to be a computationally intractable, 82 we describe how we can efficiently determine if closeness(M,(s), P,(q)) = k where k=0,l or 2. This allows us to compute the set C(q, D, k) which is sufficient for step three Of our approach. We first observe that the problem Of computing closeness(M,(sl),M,(sg)) reduces to the problem Of computing common(M,(sl), M,(sg)) and the prob- lem of computing closeness(M,(sl), P,(sg)) reduces to that Of computing maxcommon(M,(sl), [P,(32)]). We next note that the second problem, computing maxcommon(M,(sr), [P,(s,)]), is NP—complete via a reduction from the 3-Partition problem. The formal proof is as follows: Maximum site matching problem: We have as data the restriction pattern from a sequence a and restriction map from a sequence b when a same enzyme is used, say, Restriction pattern M (a) = {a,- : 1 g i S n} from the digest Of sequence a Restriction map P(a) = {b.- : 1 s i 5 m} from the nucleotide sequence b In general M (a), P(b) will be multisets; that is, there may be values of fragment lengths that occur more than once. The lengths of sequences a and b is = : a,-= Z 05 (4.1) Given the above data the problem is to find orderings for the set M (a) such that the number Of common sites implied by this ordering is maximum. Let 5,. the set of all permutations Of M (a). For a 6 Sn calls a configuration. By 83 ordering M (0) according tO a, we obtain the set Of locations of common cut Sites S = {s : s = lejgagu) and s = Zlggmbjio _<_ r g n,0 S t 5 m}. Since we want to record only the location of cut sites, the set S is not allowed repetitions, that is, S is not a multi-set. Now label the elements Of S such that S={3,-:Osj5k} (4-2) with s.- g s, and 0 _<_ k _<_ min(m,n). The problem is to find a configuration with maximum 1: common sites. Theorem: Maximum site matching problem is NP-complete. Proof: Assume k be the the number of maximum common sites between M (a) and P(b). It is clear that the problem as described above is in NP, as a nondeterministic algorithm need only guess a configuration a and check in polynomial time if 1: common sites exist. The number of steps to check this is in fact linear. To show this problem is NP complete we transform the 3-Partition problem to this problem. In the 3-Partition problem, known to be NP-complete, we are given a finite set M (a) Of 3m elements, a positive bound C, and a positive integer s(a) for each a 6 M(a) such that C/4 < s(a) < C/2 and such that 2,61%.) s(a) = mC. We wish to determine whether M (a) can be partitioned into m disjoint sets M1(a),M2(a),...,M,,.(a) such that, for 1 5 i 5 m, Bag-M,(awm) = C. If Eng-M(a) s(a) = L is not divisible by m, or |M(a)| is not divisible by three, there can be no disjoint sets M1(a), M2(a), . . . , M...(a); else, consider as input to this problem 84 the data M(a) = {s(a,) :1 5 i 5 3m} P(b) = {0,c,---,0} This is simply this problem in the case when the set M (b) has m fragments of equal length and maximum common sites I: = m — 1. Heuristic Solution for MSMP. As a result, we have focused on a restricted version Of this problem which can be solved in polynomial time. In particular, we give a simple polynomial time algorithm which solves the decision problem, “Is closeness(M, (s), [P,(q)]) = k?” where k=0,1, or 2. We can apply this algorithm to all sequences 3 e D to efficiently compute the set C(q, D, k). We first simplify the problem by Observing the answer is no unless sites([P, (q)]) - 2 S sites(M,(s)) 5 sites([P,(q)]) + 2. In other words, we try to find the answer for closeness(M,(s), [P,(q)]) = k where k = 0, 1,2. Thus, we need only consider sequences s 6 D which satisfy the constraint sites([P, (q)]) - 2 5 sites(M,(s)) S sites([P,(q)]) + 2. Next, if s and q have the same number of sites, say n, we Observe that for these sequences 3, the answer is yes if and only if P, (s) is identical to P, (q) assuming that the fragment lengths are exact. This is easily solved in 0(n log n) time by sorting both multi-sets of fragment lengths and verifying that the if” shortest 11' 8% ti ire di 1 l) 85 database fragment length matches the if” shortest query fragment length for 1 5 i 5 n. If the answer is yes, which means closeness(M,(s), [P,(q)]) = 0, we stop. If the answer is no, any two fragments from P,(s) and P, (q) are selected and merged into a longer fragment. Let P,(s)’ and P, (q)’ be new restriction patterns. Again we observe the answer is yes if and only if P,(s)’ and P,(q)’. If the answer is yes, which means closeness(M,(s), [P,(q)]) = 1, we stop. If the answer is no, any two fragments from P, (s)’ and P, (q)’ are selected and merged into a longer fragment. Let P,(s)” and P,(q)” be new restriction patterns from P,(s)’ and P,(q)’. We Observe the answer is yes if and only if P,(s)” and P,(q)". If the answer is yes, which means closeness(M,(s),[P,(q)]) = 2, we stop. If sites([P,(q)]) and sites(M,(s)) are not same, fragments in a restriction pattern with a larger number of sites are merged to be the same number of fragments for both patterns and the answer is checked. Assume sites([P,(q)]) = n and sites(M,(s)) = m. The number of ways Of merging n the fragments into n - 1 is . The number of ways Of merging the fragments 2 . n n - 1 into n — 2 is approximately . However, only consecutive fragments 2 2 in s can be merged into a single fragment. The number of ways Of merging the fragments into m — 1 is (m — 1) . The number of ways of merging the fragments into m - 2 is approximately (m - 1)(m - 2). Therefore in worst case, we have to examine n n n - 1 closeness(M,(s), [P, (q)]) approximately 1+ x (n-l)+ x (n— 2 2 2 1)(n - 2) cases. These merging process and examining closeness(M,(s), [P,(q)]) = k where I: = 0, 1, 2 require polynomial time. 86 While we observe closeness(M,(s), [P,(q)]) = k, we assume the fragment length in q are exact. However, while the set of fragment lengths for P, (s) where s 6 D may be computed exactly, the fragment lengths for P, (q) are only approximate. We only know that the actual length of fragment f ranges between (1 - e)| f | and (1 + 6)] f |. Thus, we must relax our ”yes” condition. In particular, we must consider the length of a query fragment f, to be identical to the length of any database fragment f, if (1 - 6)]qu S f. .<. (1 + Ellfql- Ideally, we would like to find the sequences 3 e D such that closeness(M,(s), M,(q)) = 1:. However, our data is imprecise in two ways which complicates this task. First, we have only P, (q), not M, (q). Second, we have only the approximate lengths, not the exact lengths, of each query fragment. The natural question to ask is, how much do these data imprecisions affect the correctness Of our algorithm? We perform some experiments with the RDP database to show that these imprecisions do not seriously affect the accuracy Of our algorithm. These experiments and results are presented in section 1.5. 4.2.4 Inferring Information from Closeness We now consider the problem Of inferring information about the isolate q from the closeness information we have computed. We first describe analytic methods for inferring information about q given closeness(M, (s), P, (q)) for any 8 E D. Unfor- tunately, these techniques have three drawbacks which limit their applicability to biological databases such as the RDP database. First, the analytical methods require 87 some a priori knowledge on the relationship between q and D such as the distribution of similarities between elements Of D and q. Second, the analytic methods typically assume sequences evolve and change only through mutation or substitution; that is, insertions and deletions of nucleotides are typically not modeled. Third, the analytic methods typically assume that two sequences 3, s' E D vary from s(q) independently, even if we know 3 and s’ are biologically similar. For these reasons, we are forced to evaluate our methods of inference experimentally. We describe a basic heuristic approach and experimentally verify the applicability Of this approach. Theoretical Models. In this section, we describe two methods for using closeness(M,(s), P,(q)) to infer primary sequence information about q. In both methods, we assume that we actually have closeness(M,(s), M,(q)) instead of closeness(M,(s), P,(q)). 1. Nei and Li’s approach. The first method is based upon the work Of Nei and Li [60]. The basic assumption in this model is that all the sequences in D and the underlying sequence s(q) of isolate q are derived from a common ancestor. Sequences diverge over time via a Poisson process as nucleotides mutate. As time goes on, the number of shared sites between q and s typically decreases as more and more nucleotides mutate. Nei and Li developed methods for using the number Of common sites present in s and s(q) to determine the amount Of time that has progressed since 8 and s(q) shared a common ancestor. Using the Poisson process, it is then possible tO compute the number of changes that 88 have occurred in each nucleotide position from the common ancestor to both 3 and s(q). This can then be used to estimate the primary sequence similarity between 3 and s(q). . Bayesian analysis. The second method is based upon Bayesian analysis. Given the primary sequence similarity a between 3 and s(q) and the assumption that each nucleotide position in s is identical to the same nucleotide position in s(q) with probability a, we can compute the probability distribution on the random variable that represents the number of sites common to both restriction maps. However, in our setting, we actually have the number of common sites shared by the restriction maps of s and s(q), and we desire to compute the primary sequence similarity or between 3 and s(q). If we are also given the fact that the primary sequence similarity between 3 and s(q) is drawn from a known probability distribution X, we can compute a probability distribution on or using a Bayesian analysis. Detailed analysis is as follows. Notations q is the query sequence and d is a databwe sequence. 0: denotes the sequence similarity between q and d. r is the length of the restriction enzyme used. n, denotes the number of restriction sites in q. i n“ denotes the number Of shared restriction sites in q and d. p = a', where p is the probability of getting a same shared restriction site. 89 The random variable n4... is simply the binomial random variable with parame- ters (n,, p). Thus the mean Of n“ is simply pnq and the variance is nqp(1 — p). We are more interested in the reverse direction which is deriving a value of a given an Observed value b for a“. That is, we are interested in the condi- tional random variable (a | n4... = b). We can apply Bayesian analysis to get a probability distribution on this conditional random variable. That is, we know that P(nM = b I a = a) = (12') a‘b(l - a4)“"‘b. Thus, the probability that a is in the range al to a; is simply {:3 (mean - a4)"r"’da It ('r)a“(1 — .4)..-th ' Unfortunately, both Of these methods have several flaws which limit their appli- cability to biological databases such as RDP. First, the assumptions on evolution are too restrictive. In both cases, only mutations are considered; insertions and deletions are not allowed. Fhrthermore, both models assume that nucleotides mutate with a given probability distribution which we have access to. This is an extremely strong assumption which does not seem to be justifiable in a general setting. Finally, the Bayesian analysis method does little to account for the fact that biologically similar sequences s and s’ are likely to either both share a site with s(q) or both not share a site with s(q). For these reasons, it seems unlikely either method can be used in gen- eral database settings reliably. Indeed one fundamental assumption that is shared by 90 both models is that the number of common sites Shared by sequences 3 and s(q) is a binomial random variable with p based upon the primary sequence similarity between 3 and s(q). Our experiments with the RDP database in the results section indicate the number of common sites does not seem to follow this binomial distribution. As a result, we explore other methods for inferring information about s(q) and use experimental means tO evaluate these methods. Closest Sequence Methods. In this section, we focus on using the set Of sequences C(q, D, lc) for lc=0,1 or 2 to infer information about the query isolate q. The basic premise behind this method is that sequences which have primary sequences that are most identical to s(q) are the ones most likely to be in the set C (q, D, k). The extended premise is that the similarity of s(q) to any sequence in C (q, D, k) is, with high probability, lower bounded by the maximum pairwise dissimilarity between any two sequences in C (q, D, k). We use this basic premise with two different biological metrics of similarity. The first metric is primary sequence similarity. We say that two sequences 8 and s’ have primary sequence similarity sim(s, s’) = a for 0 5 a 5 1 if lOOa% of the correspond- ing nucleotide positions in s and s’ are identical. Define sim(q, C(q, D, k)) to be the quantity min,ec(q,p,,) sim(q, 3). Example Let q be the query and the set C(q, D, 0) = {31,32,33}. Assume 90% = sim(q,sl), 80% = sim(q,s2), 85% = sim(q, s3). Then the lowest primary similarity between q and C(q, D, 0) is sim(q, C(q, D, 0)) = 80% 91 The second metric is a biological family similarity that is based on a biological classification hierarchy for the database D. Each sequence in D is classified by an 2: digit number for 1 5 x 5 5 (note each digit may have a different range of values). In addition to the rRNA sequence database, the RDP also distributes phylogenetic information inferred from these sequences. This includes a five level hierarchical classification scheme consistent with the inferred phylogeny. For example, by this classification scheme, E. coli is classified as: e First level 2 BACTERIA e Second level 2.13 PURPLEBACTERIA e Third level 2.13.3 GAMMASUBDIVISION e Fourth level 2.13.3.15 ENTERICSANDBELATIVES e Fifth level 2.13.3.15.2 ESCHERICHIASALMONELLAASSEMBLAGE. We say that two sequences s and s’ have level similarity level (3, s’ ) = i if their classifications are identical to the i“ digit. Define level (C(q, D, k)) = min..,,jgc(,,p,,) level (s,-, s,). Finally, we define level(q, C(q, D, k)) to be the quantity min,gc(q,p,,) level(q, s). The extended premise implies that level(C(q, D, k)) lower bounds level(q, C(q, D, k)) with high probability. Example Let q be the query and the set C(q,D,0) = {31,32,33}. Let q be a level 2.1.10.5 and level(sl)=2.1.7.5, level(s2)=2.1.7.1, level(sg)=2.1.7.6. Then level (C (q, D, 0))=2.1.7 and level (q, C (q, D, 0))=2.1. 92 If the extended premise is true and if both sim(C(q, D, k)) and level (C(q, D, k)) are high, we can, with high confidence, infer fairly precise primary sequence informa- tion or classification information about q. In general, it seems that sim(q, C (q, D, k)) and level (q, C (q, D, 1:» should increase if the number of cleavage sites in P, (q) increases. That is, it seems unlikely that dissimilar sequences will have restriction maps that are close to that of q. On the other hand, it also seems likely that sim(q, C(q, D, k)) and level(q, C(q, D, 1:» will decrease in size as the number of cleavage sites in q increases. These relationships, however, are difficult to analyze mathematically, and thus it is difficult to generate conditions where our technique will be effective. As a result, we experimentally evaluate this basic procedure. As we see in the next section, our experiments indicate this method can be used to efl'ectively infer useful information about roughly half the sequences in RDP and that the error rate is quite small. 4.3 Results and Discussion 4.3.1 Experimental Procedure The sequence data used in this study was Obtained from the Ribosomal Database Project (RDP) (release number 5 of May 17, 1995 [51]). The RDP provides curated databases of ribosomal RNA related information and analysis services. The database used in this study was a subset Of the bacterial 16S rRNA database distributed 93 by the RDP. This database contains 16S rRNA sequence information from about 3000 different bacterial isolates and environmental samples. These sequences are distributed by the RDP as pre-aligned sequences with alignment gaps inserted so that homologous residues appear at the same position in all sequences. This alignment was produced by the RDP curators using, in addition to primary sequence similarity, secondary structure and other higher-order information. Inspection of the alignment indicates that highly diverged regions are often aligned tO conserve putative secondary structures without regard to primary sequence. Most of the sequences in this database are incomplete, usually at the two ends. TO construct the subset used here, incomplete sequences were removed after first selecting the region corresponding to positions 46 through 1406 of E. coli [9]. The resulting database contained 1575 sequences. Sequence similarity values were calculated for all pairwise combinations of the 1575 sequences. The similarity values clustered around the mean value of 72% identity, with a tail stretching toward higher similarity values (Figure 4.3). 30 I I T I I 25 - '- 20 .— .— comparisons15 (x 10‘) 10 - - 0 _L I l 40 50 60 70 80 90 100 pnmary elmdarlty(%) Figure 4.3: Pairwise sequence similarity. The pairwise sequence similarity was de- termined for all pairwise combinations of 1575 sequences in the test database. Only sequence regions considered in further analysis were included. The similarity of the aligned pairs was determined using the pre-aligned sequences supplied by RDP. 94 site seq.no site seq.no site seq.no site seq.no 1 6 11 695 21 10 31 2 19 12 727 22 5 32 1 3 86 13 587 23 1 33 4 236 14 429 24 1 34 5 403 15 331 25 1 35 6 626 16 188 26 1 36 7 855 17 80 27 37 8 893 18 49 28 38 9 835 19 22 29 39 10 779 20 9 30 40 Table 4.1: Sequences by number of sites Five separate restriction map data sets were computer generated from the 1575 sequences. For each data set, positions matching recognition sites from two com- mercial restriction enzymes were identified and the length between sites calculated (after removing alignment gaps). Ambiguity codons in the database sequences were treated as not matching any recognition site. The recognition sites (enzymes) chosen to generate the five sets were: AGCT -i- CATG; CCGG + AATT; CTAG + GATC; GTAC + TCGA; and TTAA + ATAT. The average number of sites in the combined 7875 digests was 9.74 with a median of 9 (Table 4.1). The maps in a data set were chosen one at a time as queries and are considered as restriction patterns. When a query had a result set C (q, D, 0), this query did not used for the calculation Of C(q, D, 1). When a query did not have a result set C(q, D, 0), this query was used as a query for the calculation Of C (q, D, 1). This process continued until I: = 2 for single enzyme, 1: = 4 for multiple enzymes. To simulate experimental data, the fragments sizes of the query were assumed to be accurate to only +/- 5%. The result set of database sequences where all sites could be matched was deter- mined for each query, assuming the fragment order for the query was known (ordered 95 query). These result sets were compared with the expected results calculated using the exact site positions from the aligned sequences. These ordered query result sets were missing one or more sequences found using pre-aligned sequences in 836 of the 7875 trials (10.6%). In addition to base changes (point mutations), rRNA genes have accumulated insertions and deletions over the course Of evolution. These insertions and deletions may have caused homologous fragments to no longer have sizes within the 5% error bound. The result of these size changes is apparently to cause pattern mismatches over shorter evolutionary distances than would be predicted from site conservation alone. Another 580 of the 7875 trials (7.4%) produced result sets with extra sequences. Some of these extra matches may be due to peculiarities in the RDP alignment causing occasional site mismatch when comparing aligned sequences; however some extra matches are probably due to matching Of non-homologous sites within the 5% error bound for the ordered query tests. Even if all of the additional matches are due to incorrectly pairing non-homologous sites, the percentage of query results affected is still relatively small. The trials were repeated without assuming the order of query fragments was known (unordered query). Any difl'erences in result sets between ordered and unordered queries represent incorrect pairing of non-homologous regions between query and database sequences (incorrect ordering). Only 263 of the 7875 trials (3.3%) produced result sets with extra sequences not in the ordered query result sets. However, if mismatches were allowed in the site matching, the results deteriorated. When up 96 to one query and/ or database Site mismatch was allowed, 36.3% of the unordered result sets contained sequences not in the ordered query result sets. When up to two query and / or database Site mismatches were allowed the percentage Of result sets with incorrect matches increased to 88.4%. 4.3.2 sim(q, C(q, D, 19)) Number of mismatches (k). The lowest primary similarities between query and sim(q,C(q, D, k)) with k=0,1,2, were calculated as in Figure 4.4. The error bounds were fixed with 5% for k=0,1,2 in the experiments. The values Of sim(q, C(q, D, k)) were dependent, as expected, on the number of sites in the query and the number of mismatched sites, In, between query and result sets. The values increase when the number of matched cleavage sites increases. The value Of sim(q, C (q, D, k)) with given cleavage sites decrease when It increases. For example, the values Of sim(q, C (q, D, 0)) are higher than 90% when the number of cleavage sites is larger than 6. The numbers of required cleavage sites for the values of sim(q, C (q, D, 0)) which is higher than 90% are 12 for k=1 and no site for k=2. Error bounds. sim(q, C(q, D, k)) for different error bounds (10%, 15%, 20%) was calculated as in Figure 4.5. In the experiments, the value of I: was set to 0. The values of sim(q, C (q, D, k)) were dependent, as expected, on the error bounds of the query fragments. The values increase when the sizes of the error bound decrease. For example, the values Of sim(q, C (q, D, 0)) for 5% error bound are higher than 90% when the number Of cleavage sites is larger than 6. The numbers of required cleavage 97 111L111 .q 0‘ IIIIIII O 0" H O g—s 0'! 8 IIIII I— .1 d A1111] 0 5 II) 15 Number of sites in query sequence (k = III) I 95 38 JIIIIII (I 5 III 15 EN) Number of sites in query sequence (’6 = 1) If!) 95. 4 m.- _ 85 .— —i go- A 75- ~ 70- .. 65b- 60 T 0 20] g: I I I : E___.l_l=i:I:El:El:l=:l==__§ (1 5 If) '15 2!) Number of sites in query sequence (k = 2) Figure 4.4: sim(q, C(q, D, k)) for k=0,1,2 with 5% error bound, These results are shown by number of query sites as mean and range, after discarding the 5% highest and lowest values. Bar chart at bottom indicates the number of queries with result sets with size greater than one. The y axis of the first diagram shows the primary similarity between query and result set. The y axis shows the number of query sequences. A site with the number of query sequences 5 10 is eliminated. 98 sassé IIIIII e§§§ s a 75 70 .l L 1111111 I l l 0 5 10 15 M O l 0 5 10 15 Number of sites in query sequences (10%) 8 lpllll 8 0 5 10 15 20 Number of sites in query sequences (15%) .. T .. U! H o ..A U! 8 e§§§ cs 0 5 10 15 Number of sites in query sequences (20%) 8 Figure 4.5: sim(q, C(q, D, 0)) with 10%, 15%, 20% error bounds. The y axis of first diagram shows the primary similarity between query and result set. The y axis shows the number Of query sequences. These results are shown by number of query sites as mean and range, after discarding the 5% highest and lowest values. Bar chart at the bottom of the figure indicates the number Of queries with result sets with size greater than one. A site with the number of query sequences 5 10 is eliminated. 99 sites for the values of sim(q, C(q, D, 0)) 2 90% are 8 (for 10%), 10 (for 15%), 16 (for 20%). Multiple enzymes. To increase the values of sim(q, C(q, D, 19)), multiple enzymes were applied to each query. The result set for each enzyme is Obtained and C (q, D, k)s for each result sets were calculated. Let C(q, D, k)A and Cg(q, D, k)g be the results using two different enzymes A and B with same q and D. First, we calculate the common set of C(q, D, k1)... and 02(q, D, k2)3, which is C(q, D, k1 + k2)AB. The values Of k1 + k2 are 0 thru 4. Then we examine the value Of C(q, D, k1 + k2)“. Figure 4.6 shows the results. 100 ..¢ ' ,l""""' a. I l 1 IL I l l 60 l I I I I I I 0 5 10 15 20 25 30 35 I I 8 m 3%1 0 5 10 15 20 25 30 35 Number of sites in query sequence (k1 + k2 = 0) 100 I -~ vv' rrrrrr I I " ’1'... J l I I I l I l I I II 05101520253035 II T I I I 8 - m assesses IIIII IIIII l 5 10 15 20 25 30 35 40 berofsitesinquerysequence(k1+lc2=1) ' '. . 'O' V; I z Be 5 IIIerf “D C" I l l l I I I l l I l CD OI II 10152025303540 I I PT I ‘1 I I IIII 25 !:gl I 0 5 10 15 20 25 30 35 Numberofsitesinquerysequence(k1+bz=2) m assesses L 101 1m I I 1 I f 95 +- r 90 -— A 85 .. - 80 r. ‘ _ 75 - .. r 70 ._ .- 65 P -I 60 I I I I I I I 0 5 10 I5 20 25 30 35 40 r; I f I I I I I : m 1 I E 0 5 10 15 20 25 30 35 40 Number Of sites in query sequence (In -I- k2 = 3) 100 | 7 I I 95 P m b -i 85 .. -i 80 r— —i 75 - - 70 - a 65 - _ 60 I I I I I I I 0 5 10 15 20 25 30 35 40 : I I I I Pfi I E I I M l I l :1 0 5 10 15 20 25 30 35 40 Numberofsitesinquerysequence (k1+k2 =4) Figure 4.6: sim(q, (C(q, D, k1 + k2))43 for k; + k2=0 thru 4 with 5% error bound and two enzymes. These results are shown by number of query sites as mean and range after discarding the 5% highest and lowest values. Bar chart at bottom indicates the number of queries with result sets with size greater than one. A site with the number of query sequences 5 10 is eliminated. All the values of sim(q, (C(q, D, k. + k2)) are higher than sim(q, (C(q, D, k)), as expected, when the values of k1 + kg and k are same. 4.3.3 level (q, C (q, D, k)) In addition to aligned rRNA sequence databases, the RDP also distributes phyloge- netic information inferred from these sequences. This includes a hierarchical classi- fication scheme consistent with the inferred phylogeny. At the most basic level, all 102 =0 x 1 2 3 4 5 t, 146 47 360 351 3065 r, 146 46 354 347 2898 hit ratio 100.0 97.9 98.3 98.9 94.6 ‘ k=1 x 1 2 3 4 5 t, 446 128 373 148 1239 446 r, 125 362 133 1036 hit ratio 100.0 97.? 97.1 89.9 83.6 =2 x 1 2 3 4 5 t, 552 97 140 39 340 r, 552 88 127 35 221 hit ratio 100.0 90.7 90.1 89.7 65.0 Table 4.2: Hit ratios for each It. 5% error bound. ~ sequences tested here are members of category 2, the Bacteria. There are 15 cate- gories at the next level (2.1 through 2.15), 24 at the third level, 94 at the fourth, and 99 at the fifth and highest level. Not all sequences are categorized to all five levels, we assumed an implied category of 0 for all undefined levels. Number of mismatches (k). The minimum common digits in the levels Of q, level(q, C(q, D, k)) with k=0,1 or 2, were calculated as in Table 4.2. The error bounds were fixed with 5% for k=0,1,2 in the experiments. We define hit ratio tO use as a measure of accuracy of this method. Let t, be the number of queries with level(C(q, D, k)) = x and r, be the number of queries with level(q, C(q, D, k)) = x Then hit ratio for xth level is h, = (r,/t,) x 100(%) (4.3) The hit ratio decreases when It increases. For example, the value of ha are 94.6% for k=0, 83.6% for Ic=l, 65% for k=2. 103 error bound=10% I 3: rx 1 453 453 . 2 128 120 342 332 328 2924 2731 hit ratio 100.0 97.9 98.3 98.9 94.6 error bound=15% 3 t: r: l 776 776 2 229 213 3 451 436 4 305 291 5 2809 2418 hit ratio 100.0 93.0 96.7 95.4 86.0 error bound=20% x t: 7': l 1419 1419 2 399 353 3 539 507 4 278 257 5 2565 1919 hit ratio 100.0 88.5 94.1 92.4 74.8 Table 4.3: Hit ratios for difl'erent error bounds. Error bounds. Hit ratios for different error bounds (10%, 15%, 20%) were calcu- lated as in Table 4.3. In the experiments, the value of I: was set to 0. It is clear that the hit ratio decreases when the error bound increases. For example, the value of hr, are 94.6% for 5%, 94.6% for 10%, 86% for 15%, 74.8% for 20%. Multiple enzymes. To increase the values of level(q, C(q, D, 1:», multiple en- zymes were applied to each query. The result set for each enzyme is Obtained and C(q,D,k)s for each result sets were calculated. Let C(q, D, In) A and C(q, D, k2)3 be the results using two different enzymes with same q and D. First, we calculate the commonly matched sequences C(q, D, k1 + k2)”. Then we examine the value Of level(q, (C(q, D, k1 + kg”). Table 4.4 shows the results. The hit ratio Of multiple enzymes is much higher than single enzyme as expected. For example, the values of hr, for 0 thru 4 are 99.0%, 96.2%, 89.1%, 75.9% and 58.4%. 104 k1 + k2 =0 :1: l 2 3 4 5 t, 4 53 4495 r, 4 53 4450 hit ratio l. 100 120‘ 99.0 k; + k2 =1 _— _ x 1 2 3 4 5 t, 4 3 132 180 3354 r, 4 3 132 179 3222 hit ratio 100.0 100 99.6 99.0 96.1 k1 + k2=2 x 1 2 3 4 5 t, 24 15 259 200 2178 r, 242 15 258 198 1940 hit ratio 100 100 99.6 99.0 89.1 k1 + k2=tll x 1 2 3 4 5 t, 68 41 183 86 1093 r, 68 41 171 80 830 ‘ hit ratio 100 100 93.4 93.0 7—59‘ *1 "I" kg :4 x 1 2 3 4 5 t, 45 32 72 23 519 r, 45 26 61 15 303 hit ratio 100 81.2 84.7 65.7 58.4 Table 4.4: Hit ratios for each k1 + kg. 5% error bound. 105 4.3.4 Ordered. vs. Unordered When we compute sim(q, C (q, D, k)) for It > 0 we allow the merges of fragments in q and/ or s E D to reduce the number Of fragments. We know the order of the fragments in 8 while we don’t know the order Of the fragments in q (unordered). Only consecutive fragments in s are allowed to be merged while any combination Of fragments in q are allowed to be merged. This additional information (order of fragments) gives us more accurate results. We compare sim(q, C(q, D, 2)) with two different cases. Let m be the number Of fragments in q and n be the number of fragments in s. sim(q, C(q, D, 2)) can be calculated when m = n + 2 or n = m + 2. Let sim(q, C(q, D, 2),) be the result by merging the fragments in q and sim(q, C (q, D, 2),) be the result by merging the fragments in s. Figure 4.6 shows the difference between sim(q, C (q, D, 2),) and sim(q, C(q, D, 2),). level(q,C(q,D,2),) and level(q,C(q,D,2),) are calculated by 100 r r r 95- . _- 90» q”—~ 85- A 80.. _ 75— _ 7o- - 65 60 I I I o 5 10 15 20 Number of sites in query sequences Figure 4.7: sim(q, C(q, D, 2)), and sim(q,C(q,D,2llr with 5% error bound. The y axis of the diagram shows the primary similarity between query and result set. The curve ”8” is generated by merging s and the curve ”q” is generated by merging q. doing same process. Hit ratio for each level is calculated. Table 4.5 show the hit ratios of level (q, C (q, D, 2),) and level (q, C(q, D, 2) .). Table 4.5: Hit ratio of level(q, C(q, D, 2),) and level(C(q, D, 2),) 4.3.5 Simulation of Random Database To confirm our method can be applied to other databases, we generated random databases and applied our method and checked the results. To simulate the evolu- tionary procedure, one original sequence, 8}, is generated randomly. 8: denotes xth random sequence of yth level. Two random sequences, 9], sf are generated from so with a similar to so. Again, four random sequences, 3;, ..., s; are generated from s], sf with u percent similar to origin sequence. We call u as uniform edge length. This process continues until 2048 (sh, ..., sfi“) random sequences are generated. The final 2048 random sequences are considered as database sequences. The whole random sequences resemble binary tree structure. Ten random databases with uniform edge length u are generated with different orders Of random seeds and different values of u. get random database with different uniform primary similarity. Same five set of enzymes as in experiment of RDP are applied to ten random databases for each u. All 2048 random sequences in a DB are considered as query sequences. Therefore total number of query sequences are 10240 (2048 x 5). For each query, (CLOSE(q, D, 0) is calculated. To identify the properties of selected database sequences, the value of sim(q, s) are accumulated instead of calculation Of sim(q, (CLOSE(q, D, 0)). In this experiment error bound is zero percent. Figure 107 4.7 shows the output Of this simulation with edge length 92% thru 98%. The edge lengths u 5 98% are not enough to generate meaning output. Also u 5 90% are not biologically meaningful. Therefore we only consider the random databases with u 5 99%. Figure 4.8 shows the output u 5 99%. 5m I l I I I I I I I 888888 150- 100- wh 0 10098 96 94.92 90.88 86 84 82 80 Primarysumlanty(%) Figure 4.8: Accumulation of sim(q, s) for random databases with edge lengths 98%, 96%, 94%, 92%. Bar charts for each diagram show the standard deviations. y shows the total number of queries with output. x shows the primary similarity between query and matched database sequences. 108 First diagram in Figure 4.8 shows the total number of query sequences vs. sim(q, d). The number Of matched database sequences rapidly decreases when the primary similarity between query and database sequences decreases. The value of sim(q,s) for most of s is greater than 90% which means the value of sim(q, (CLOSE(q, D, 0)) is also greater than 90%. This diagram shows that our method can be applied to other sequence databases. This diagram shows the useful- ness Of our method. The output in the first digram of Figure 4.8 is divided by two groups which are forward and backward mutations. First, we check the first common ancestor, o, of the q and d. Second, we check if q and o are matched. If q and d are matched, we consider the output as forward. If not, we consider the output as backward. Second diagram in Figure 4.8 shows the output with forward and third diagram in Figure 4.8 shows the output with backward. The curves in second diagram are decayed rapidly while the curves in third diagram are increased until some period and decayed. Phrther study is required for more detailed analysis for these diagrams. 109 8000 _ ”99.8” 9— _ .. ”99.7” i . ”99.6” _ 7000 .994. _,_ 6000 ml] _ ”99.2” e— - 96 Primary similarity(%) 9000 I I m l 8“” .1 4 ”99.8” 0_ _, 3000 .. ”99.8” 9— _. ”99.7” g: 7 - ”99.6” _ 000 ”99.4” *— eooo - ”99.2” €— - 100 - 98 l ' I I 94 A I 92 90 Figure 4.9: Accumulation Of sim(q, s) for random databases with uniform primary similarities 99.8%, 99.7%, 99.6%, 99.4%, 99.2%. Bar charts for each diagram show the standard deviations. Chapter 5 Conclusions In this thesis, we proposed an algorithm called MSASA, which is based on simulated annealing approach, to align multiple protein sequences. Dynamic programming of multiple sequence alignment has been widely used to find an optimal alignment for certain cost functions. However, it does not allow certain types Of cost function including natural gap costs, and limits the number Of sequences that can be aligned due to its high computational complexity. MSASA overcomes these problems because it uses any gap costs which generates a better solution. It aligns more sequences, and takes less computation time compared to a dynamic programming approach. A solution set for a multiple sequence alignment problem is identified, and the multiple sequence alignment problem is reformulated to find an Optimal alignment from this solution set. The computational complexity Of MSASA is significantly reduced by substituting the higher temperature phase of the annealing process by a fast heuristic algorithm and confining the solution set. 110 111 We then suggested an algorithm called RNASA, which is based on simulated an- nealing, to align multiple RNA sequences. The output RNA alignment from RNASA is used to identify possible secondary structures. A transition rule, based on double shuffle, is proposed. This transition rule provides faster convergence to an optimal solution. We Show that this new transition rule takes less convergence time than a conventional transitiOn rule, based on single shuffle. RN ASA is applied to the se- quences from RDP and the results are compared to known secondary structures. The usefulness of RNASA is supported with experimental results. We then studied the problem of obtaining biological information about a macro- molecule isolate using only restriction patterns and restriction map databases. Max- imum site matching problem (MSMP) is defined in a formal way and proved to be in NP-complete problems. We suggested a heuristic algorithm to solve MSMP. A three phase approach to obtain relatedness Of unknown macromolecules and database sequences is suggested. We demonstrate usefulness Of our approach using RDP database. Bibliography [1] S. F. Altschul. Gap costs for multiple sequence alignment. J. Theor. Biol, 138:297-309, 1989. [2] S. F. Altschul, W. Gish, W. Miller, E. M. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol, 215:403-410, 1990. [3] S. F. Altschul and D. J. Lipman. Tmes,stars,and multiple biological sequence alignment. SIAM J. appl. Math., 49:197—209, 1989. [4] P. Argos. A sensitive procedure to compare amino acid sequences. J. Molec. Biol, 193:385—396, 1987. [5] D. J. Bacon and W. F. Anderson. Multiple sequence alignment. J. Molec. Biol, 191:153—161, 1986. [6] W. Bains. Multan: A program to align multiple dna sequences. Nucl. Acids Res, 14:159—177, 1986. [7] G. J. Barton and M. J. E. Sternberg. A startegy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Molec. Biol, 198:327-337, 1987. [8] B. Bellon. Construction of restriction maps. CABIOS, 4:111—115, 1988. [9] J. Brosius, M. L. Palmer, P. J. Kennedy, and H. F. Noller. Complete nucleotide sequence of a 168 ribosomal rna gene. Proc. Natl. Acad. Sci. U. S.A, 75:4801—4805, 1978. [10] H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM J. Appl. Math, 48:1073—1082, 1988. [11] T. R. Cech. Conserved sequences and structures Of group i intronszbuilding an active site for ma catalysis-a review. Gene, 73:259-271, 1988. [12] F. Corpet. Multiple sequence alignment with hierarchical clustering. Nucl. Acids Res, 16:10881-10890, 1988. [13] J. Cottrell. Protein identification by peptide mass fingerprinting. Peptide Re- search, 7(3):115—124, 1994. 112 113 [14] M. O. Dayhoff. A model of evolutionary change in proteins. matrices for detect- ing distance relationships. In Atlas of Protein Sequence and Structure, volume 5 suppl. 3, pages 345-352. Dayhoff. M. O.(ed) Washington. DC: National Biomed- ical Research Foundation, 1978. [15] R. DeBry and N. A. Slade. Cladistic analysis of restriction endonuclease cleavage maps within a maximum-likelihood framework. Syst. Zool, 34(1):21-34, 1985. [16] A. Delcoigne and P. Hansen. Sequence comparison by dynamic programming. Biometrika, 62:662—664, 1975. [17] R. Durand and F. Bregegere. An efficient program to construct restriction maps from experimental data with realistic error levels. Nucleic Acids Res, 12:703- 716, 1984. [18] S. R Eddy and R. Durbin. Rna sequence analysis using covariance models. Nucl. Acids Res, 22:2079—2088, 1994. [19] Joseph Felsenstein. Phylogenies from restriction sites: a maximum-likelihood approach. Evolution, 46(1):]59—173, 1992. [20] D. F. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Molec. Evol, 25:351-360, 1987. [21] D. F. Feng, M. S. Johnson, and R. F. Doolittle. Aligning amino acid sequences: comparison Of commonly used methods. J. Molec. Evol, 21:112—125, 1982. [22] J. W. Fickett. Fast Optimal alignment. Nucl. Acids Res, 12:175-180, 1984. [23] W. M. Fitch and T. Smith. Optimal sequence alignments. Proc. Natl. Acad. Sci. USA., 80:1382-1386, 1983. [24] W. M. Fitch, T. F. Smith, and W. W. Ralph. Mapping the order of dna restriction fragments. Gene, 22:19-29, 1983. [25] G. E. Fox and C. R. Woese. 5s rna secondary structure. Nature, 256:505-507, 1975. [26] M. L. E'edman. Algorithms for computing evolutionary similarity measures with length independent gap penalties. Bull. Math. Biol, 46:553—566, 1984. [27] O. Gotoh. An improved algorithm for matching biological sequences. J. M olec. Biol, 162:705-708, 1982. [28] O. Gotoh. Alignment Of three biological sequences with an eflicient traceback procedure. J. Theor. Biol, 121:327—333, 1986. [29] M. Gribskov, R. Luthy, and D. Eisenberg. Profile analysis. J. Theor. Biol, 183:146-159, 1990. 114 [30] A. V. Grigorjev and A. A. Mironov. Mapping dna by stochastic relaxation: a new approach to fragment sizes. CAB] OS, 6:107—111, 1990. [31] L. K. Grover. Standard cell placement using simulated sintering. Proc. 24 th DA 0, pages 56-59, 1987. [32] R. R Gutell. Collection of small subunit (168 and 168 like) ribosomal structures. Nucl. Acids Res, pages 3502—3507, 1994. [33] J. Hein. A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous esequences, when the phylogeny is given. Molec. Biol. Evol, 6:649—668, 1989. [34] D. G. Higgins and P. M. Sharp. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, 7:237—244, 1988. [35] M. Hirosawa, M. Hoshida, M. Ishikawa, and T. Toya. Mascot: Multiple align- ment system for protein sequences based on three-way dynamic programming. CABIOS, 9:161-167, 1993. [36] P. Hogeweg and B. Hesper. The alignment Of sets of sequences and the con- struction of phyletic trees: an integrated method. J. Molec. Evol, 20:175-186. 1984. ‘ [37] K. E. Holsinger and R. E. Jansen. Phylogenetic analysis Of restriction site data. Methods in Enzymology, 224:439—455, 1993. [38] M. Ishikawa, T. Toya, M. Hoshida, K. Nitta, A. Ogiwara, and M. Kanehisa. Multiple sequence alignment by parallel simulated annealing. CABIOS, 9:267- 273, 1993. [39] A. B. Jacobson, L. Good, J. Simonetti, and M. Zuker. Some simple computa- tional methods to improve the folding of large rnas. Nucleic Acids Res., 12:45-52, 1984. [40] B. D. James, G. J. Olsen, and N. R. Pace. Phylogenetic comparative analysis of ma secondary structure. Methods in Enzymology, 180:227—230, 1989. [41] P. James, M. Quadroni, E. Carafoli, and G. Gonnet. Protein identification by mass profile fingerprinting. Biochemical and Physical Research Communications, 195(1):58-64, 1993. [42] MS. Johnson and R. F. Doolittle. A method for the simulataneous alignment of three or more amino acid sequences. J. Molec. Evol, 23:267-287. 1986. [43] R. A. J as, N. W. Woodbury, and R. F. Doolittle. Sequence homologies among e.coli ribosomal proteins: evidence for evolutionary related groupings and inter- nal duplication. J. Molec. Biol, 15:129—148. 1980. 115 [44] J. Kim and S. Pramanik. Multiple sequence alignment using simulated annealing. In Second International Conference on Intelligent Systems for Molecular Biology, 1994. [45] J. Kim, S. Pramanik, and M. J. Chung. Multiple sequence alignment using simulated annealing. CABIOS, 10:419—426, 1994. [46] S. Kirkpatrick, C.D. Gelatt Jr, and M. P. Vecchi. Optimization by simulated annealing. Scinece, 220:671—680, 1983. [47] M. Krawczak. Algorithms for the restriction-site mapping Of dna molecules. Proc. Natl. Acad. Sci. USA., 85:7298—7301, 1988. [48] W. Li. Evolutionary change of restriction cleavage sites and phylogenetic infer- ence. Genetics Society of America, 113:187-213, 1986. [49] D. J. Lipman, S. F. Altschul, and J. D. Kececioglu. A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA., 86:4412-4415, 1989. [50] A. V. Lukashin, J. Engelbrecht, and S. Brunak. Multiple alignment using simu- lated annealing: branch point definition in human mrna splicing tool for multiple sequence alignment. Nucl. Acids Res, 20:2511-2516, 1992. [51] B. L. Maidak, N. Larsen, M. J. McCaughey, R. Overbeek, G.J. Olsen, K. Fogel, J. Blandy, and C. R. Woese. The ribosomal database project. Nucl. Acids Res, 22:3485-3487, 1994. [52] H. M. Martinez. A flexible multiple sequence alignment program. Nucl. Acids Res., 16:1683—1691, 1988. [53] N. Metropolis, A. Rcsenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chem. Phys, 21:1807-1092, 1953. [54] F. Michel, K. Umesono, and H. Oseki. Comparative and functional anatomy Of group ii catalytic introns-a review. Gene, 82:5—30, 1989. [55] W. Miller, J. Barr, and K. Rudd. Improved algorithms for searching restriction maps. CABIOS, 7(4):447—456, 1991. [56] W. Miller, J. Ostell, and K. Rudd. An algorithm for searching restriction maps. CABIOS, 6(3):247—252, 1990. [57] C. R. Moyer, F. C. Dobbs, and D. M. Karl. Estimation of diversity and com- munity structure through restriction fragment length polymorphism distribution analysis Of bacterial 168 rrna genes from a microbial mat at an active, hydrother- mal vent system. Appl. Environ. Microbiol, 60:871—879, 1993. 116 [58] M. Murata, J. S Richardson, and J. L. Sussman. Simultaneous comparison of three protein sequences. In Proc. Natl. Acad. Sci. USA., volume 82, pages 3073- 3077, 1985. [59] S. B. Needleman and Wunch. C. D. A general method applicable to the search for similarities in the amino acid sequences Of two proteins. J. Molec. Biol, 48:443—453, 1970. [60] M. Nei and W. Li. Mathematical model for studying genetic variation in terms of restriction endonucleases. Genetics, 76(10):5269—5273, 1979. [61] H. F. Noller and C. R. Woese. Secondary structure Of 168 ribosomal rna. Science, 212:519—533, 1981. [62] R. Nussinov and Jr. Tinoco, I. Sequential folding of a messenger rna molecule. JMB, 151:519—533, 1981. [63] N. R. Pace, D. K. Smith, G. J. Olson, and B. D. James. Phylogenetic comparative analysis and the secondary structure of ribonuclease p rna-a review. Gene, 82:65— ‘75, 1989. [64] W. Pearson. Automatic construction of restriction site maps. Nucleic Acids Res, 10:217-227, 1982. [65] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence com- parison. basic local alignment search tool. Proc. Natl. Acad. Sci. U.S.A, 85:2444— 2448, 1988. [66] G. Quigley, L. Gehrke, D Roth, and P. Auron. Computer-aided nucleic acid secondary structure modeling incorporating enzymatic digestion data. Nucleic Acids. Res, 12:347—366, 1984. [67] . J. Rose, Wolfgang. Klebsch, and J uergen. Wolf. Tempteraure measurement Of simulated annealin placements. In Proceedings I CCAD, pages 514-517, 1988. [68] D. Rose, W. Snelgrove, and Z. Vranesic. Parallel standard cell placement algo- rithms with quality equivalent to simulated annealing. IEEE Transactions on CAD, 7(3):387-396, 1988. [69] J .8 Rose, D. Blyde, W. Snelgrove, and Z. Vranesic. Fast, high quality vlsi place- ment on an mimd multiprocessor. I CCAD, 86:42-45, November 1986. [70] Y. Sakakibara, M. Brown, I. S. Mian, R. Underwood, and D. Haussler. Stochastic context-free grappars for modeling rna. In Proceedings of the Hawaii Interna- tional Conference on System Sciences, Los Alamitos, CA, 1994. IEEE Computer Society Press. [71] D. Sankofl'. Simultaneous comparison of three or more sequences related by a tree. Addison-Wesley, Reading, MA, 1983. 117 [72] D. Sankofl' and J. B. Kruskal. Time Warps, String Edits and MacromoleculesL The Theory and practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. [73] D. B. Searls. The computational linguistics of biological sequences, chapter 2. AAAI Press, 1993. [74] P. H Sellers. On the theory and computation Of evolutionary distances. SIAM J. Appl. Math.,26:787—793,1974. [75] G. Shaw. Rapid identification Of proteins. Proc. Natl. Acad. Sci. USA, 90:5138— 5142, 1993. [76] P. E. Smouse and Li. W. Likelihood analysis Of mitochondrial restriction-cleavage patterns for the human-himpanzee-gorilla trichotomy. Evolution, 41(6):1162-— 1176, 1987. [77] P. Taylor. A fast homology program for aligning biological sequences. Nucl. Acids Res, 12:447-455, 1984. [78] W. R. Taylor. Multiple sequence alignment by a pairwise algorithm. CAB] OS, 3:81-87, 1987. [79] W. R. Taylor. A flexible method to align large numbers of biological sequences. J. Molec. Evol, 28:161-169, 1988. [80] L. Tinoco Jr., 0. C. Uhlenbeck, and M. D. Levine. Estimation Of secondary structure in ribonucleic acids. Nature, 230:363—367, 1971. [81] M. Vihinan. Simultaneous comparison of several sequences. Methods Enzymol., 183:447—456, 1990. [82] M. Vingron and P. Argos. Motif recognition and alignment for many sequences by comparison Of dot-matrices. J. Molec. Biol, 218:33—43, 1991. [83] M. S. Waterman. Consensus patterns in sequences. Boca Raton FL: CRC Press, 1989. [84] M. S. Waterman and M. D. Perlwitz. Consensus methods for folding single- stranded nucleic acids, chapter 8. CRE Press, 1984. [85] M. S. Waterman and M. D. Perlwitz. Line geometries for sequence comparisons. Bull. Math. Biol, 46:567—577, 1984. [86] MS. Waterman, T.F. Smith, and W.A Beyer. Some biological sequence matrices. Adv. Math., pages 367-387, 1976. [87] C. R Woese, R. R Gutell, R. Gupta, and H. F. Noller. Detailed analysis of the higher-order structure of 168-er ribosomal ribonucleic acids. Microbiology Reviews, 47(4):221—229, 1983. 118 [88] L. J. Woese, C. R. Magrum, R. Gupta, and R. B. Siege]. Secondayr structure model for bacterial 168 ribosomal rnazphylogenetic, enzymatic and chemical ev- idence. NAR, 8:2275-2293, 1985. [89] L. W. Wright, J. B. Lichter, J Reintz, M. A. Shifman, K. K. Kidd, and P. L. Miller. Computer-assisted restriction mapping: an integrated approach to han- dling experimental uncertainty. CABIOS, 10:443—450. 1994. [90] M. Zuker and P. Stiegler. Optimal computer folding Of large rna sequences using thermodynamics and auxiliary information. Nucleic Acids Res, 9:133-148, 1981.