CHIGAN ST I III III IIIIIIIIIIIIIIIIIIIIIIIIIIIII 301570 0747 This is to certify that the thesis entitled PERMUTED SUBSTRING MATCHING, WITH APPLICATIONS TO COMPUTATIONAL BIOLOGY presented by Houman Alborzi has been accepted towards fulfillment of the requirements for Master of Science degree in Compote]: Science Eric Torng &- [-4.39 Major professor Date _&-20—1997 0-7 639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State Unlversity PLACE ll RETURN BOX to roman this checkout from your rooord. To AVOID FINES return on or More data due. I I I I II I L_I | |L_I_—_IL__I :LJCI MSU In An Affirmulvo AotIon/Equul OpponunIIy Inotltulon I PERMUTED SUBSTRING MATCHING, WITH APPLICATIONS TO COMPUTATIONAL BIOLOGY By H ouman Alborzz' A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science 1997 ABSTRACT PERMUTED SUBSTRING MATCHING, WITH APPLICATIONS To COMPUTATIONAL BIOLOGY By H ouman Alborzz' We propose a new inexpensive technique to find the DNA sequence of genes. The technique is based upon finding the location of a fragment of a genome using the restriction map of the genome and the restriction pattern of the fragment. As there may be more than one location on the genome matching the restriction pattern of the fragment, we investigate the effectiveness of our technique based on how precisely it can determine the location of the fragment. We Show that the size of the restriction pattern of the fragment is the primary parameter that affects the effectiveness of our technique. We have also designed and implemented fast algorithms for locating the fragment. Dedicated to the advancement of frontiers. iii ACKNOWLEDGMENTS I would like to thank Dr. Eric Torng, my academic advisor who was always trying to put order in my chaotic work. Without his patience and support, completing this thesis would not have been possible. Dr. Sakti Pramanik and Dr. James Cole were always eager to discuss my work. I am grateful for their help and time allotted to this project. My technical thanks to the staff of the computer science department computing services who provided the support and maintenance needed for this project. Finally, I would like to thank the faculty, staff, and students of the computer science department with whom I worked. They were always encouraging. iv TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 Genome, Genes, and DNA .......................... 1.2 Restriction Enzymes ............................. 1.3 Formal Definitions .............................. 1.4 Problem Statement .............................. 1.5 Background on Searching .......................... 1.6 Outline .................................... 2 A probability distribution model for restriction fragment lengths 2.1 Restriction Fragment Lengths have an Approximate Exponential Distri- bution ................................... 2.2 Normalized fragment lengths ......................... 3 Effectiveness of the Procedure 3.1 Metrics for Effectiveness ........................... 3.2 Resolution Metric ............................... 3.3 Category Metric ............................... 3.4 Score Metric .................................. 4 A Matching Algorithm 4.1 The Basis Theorem .............................. 4.2 Brute Force Algorithm ............................ 4.3 Obtaining Sorted Subsequences in 0(Nk) comparisons .......... 4.4 Limiting the search area to the most promising areas ........... 4.5 More Improvement .............................. 4.6 More on Comparing RP and RM ...................... BIBLIOGRAPHY vi vii (DOOQCHOOI-‘H 10 10 13 17 17 19 21 24 27 27 29 30 31 33 34 38 LIST OF TABLES Chi—Square test confidence levels for distribution of restriction sites Running time (in seconds) of the algorithm. ......... ' ....... Running time (in seconds) of the algorithm with 1 limiting fragment. . . Running time (in seconds) of the algorithm with 3 limiting fragments. . . vi 1.1 1.2 2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 LIST OF FIGURES A schematic of an unfolded DNA double helix ................ 2 An example of a digestion process. ..................... 4 CDF of a restriction pattern and its approximation ............. 12 CDF of a normalized restriction pattern and its approximation. ..... 14 Possible overlapping matches to a restriction pattern. ........... 18 Average resolution of isolates, 6 = 0.04 .................... 20 Average resolution of isolates, 6 = 0.10 .................... 20 Percentage of isolates in category 3 ...................... 22 Percentage of isolates in category 2, 6 = 0.04. ............... 24 Percentage of isolates in category 2, 6 = 0.10. ............... 25 Score metric, 6 = 0.04. ............................ 26 Score metric, 6 = 0.10. ............................ 26 Inner-Matching ................................. 35 Proof of Theorem 4.2. ............................ 37 vii Chapter 1 Introduction In this thesis we investigate the problem of locating a fragment of a previously sequenced genome on the genome. More specifically, we try to develop a new proce- dure to find the genetic sequence of genes. In this procedure, a genome that has been previously sequenced is broken into fragments, perhaps using restriction enzymes. Then, after biological experiments have determined that some of the fragments are biologically interesting, we try to find the DNA sequence of those fragments. Usually finding the DNA sequence is expensive. However, since the particular DNA sequence is part of our previously sequenced genome, the problem of finding the DNA sequence of the gene reduces to finding the location of the gene on the genome. 1.1 Genome, Genes, and DNA Genetics goes back to the pioneering work of Gregor Mendel in 1865 [LW95]. Mendel hypothesized that traits are affected by discrete factors, which an offspring inherits from its parents. Today, we call those factors genes. In 1952, Alfred Hershey and Martha Chase showed that genes are encoded in DNA molecules which exist in any living organism’s cells. Later, in 1953, Watson and Crick presented a model of 1 2 the structure of a DNA molecule. The DNA molecules are linear polymers of four basic chemical structures called nucleotides named adenine, thymine, cytosine, and guanine, and abbreviated as A, T, C, and G. Each gene is usually associated with one or more DNA subsequences. These subsequences will be translated into protein sequences, which are considered to be the main chemicals affecting the traits of an organism. The whole set of genes of an organism constitutes its genome. In this manuscript we use the word genome to refer to a single sequence comprising the sequences of all DNA molecules of an organism. The DNA nucleotides are joined by a sugar-phosphate backbone to form a DNA molecule. While some bacteria have single stranded DNA molecules, the DNA of humans and most other organisms consists of two strands making a double helix in which the bases (nucleotides) pair up (Figure 1.1). Figure 1.1: A schematic of an unfolded DNA double helix. This pairing of bases is not arbitrary. A always pairs with T and C always pairs with G. Hence, the sequence of either strand can be determined from the sequence 3 of the other one. The length of a DNA molecule is measured in units of base pairs (bp), which is the number of bases in a single strand of the DNA molecule. Usually, we represent a strand of DNA by a single string over the alphabet of {A, T, C, G}. DNA molecules are usually linear, that is they have two endpoints. However, in some organisms the DNA is circular, so there is no starting or ending point of the DNA. 1 .2 Restriction Enzymes There are enzymes that can cleave a DNA molecule into several fragments. These enzymes are called restriction enzymes, and the process of cleaving the DNA molecule into fragments is called digestion. Each restriction enzymes has a recognition sequence, which is a sequence of nucleotides where the enzyme cuts the DNA molecule. For example, the EcoRI enzyme cuts a DNA molecule at all occurrences of the sequence GAATTC. The cut site of a restriction enzyme determines exactly where in the recog- nition sequence the DNA molecule will be cut. For the EcoRI enzyme the cut site is 1, which means that the DNA molecule will be cut after the first base of the recognition sequence. For example, if the DNA sequence is GCTGAGAATTCGCAAG- GAATTCGCAG, then the EcoRI enzyme will cleave it into three fragments, with the sequences of: GCTGAG, AATTCGCAAG, AATTCGCAG (Figure 1.2). The restriction sites of a DNA molecule for a given restriction enzyme is the set of locations of all the enzyme’s cut sites on the molecule. The ordered list of fragment lengths is the restriction map of the DNA molecule for the given enzyme. For instance, 4 in our previous example, the restriction map, will be (6, 10,9). A restriction pattern is the multi-set of fragment lengths of a restriction map. That is, the order of fragments in the restriction map is forgotten. For example the restriction pattern of our example is the set {6, 9, 10}. We can obtain the restriction map of a DNA molecule using the gel electrophoresis experiment. This procedure is much simpler and cheaper than existing methods for obtaining the corresponding restriction map. In the gel electrOphoresis experiment, the restriction fragments are placed in an electric field over a gel coated medium. The fragments then are separated based on their length. There is a logarithmic relationship between the size of DNA fragment and the distance it migrates on a gel. By measuring the distance each fragment has migrated, the length of the fragment can be found. Restriction Sequence: GAATTC . A Cut Site: 1 DNA Sequence: GCTGAGIAATTCGCAAGGAAATTCGCAG Restriction Map: Ir 6 I 10 I 9 Resulting Fragments: GCTGAG AATTCGCAAG AATTCGCAG l l Figure 1.2: An example of a digestion process. There is a measurement error associated with this experiment; that is, a fragment of length I will be read with some error e which results in a measured length of l(1+e). Usually, c is a random variable with a normal distribution. However, we can safely assume that it is uniformly distributed over a range of (—6, 6), where 6 is a parameter of the precision of the devices used in the experiment. Wherever needed, we assume 5 that 6 = 0.05. For example, considering this measurement error, a restriction pattern of {100, 105, 300, 400} could be read as {98,100,297,415}. The process of digesting a DNA molecule by an enzyme can be partial or com— plete. In a complete digestion, a DNA molecule is broken at all of its recognition sequences. In a partial digestion, not all of the recognition sequences are broken. For example the result of a partial digestion in Figure 1.2 could be the GCTGAG and AATTCGCAAGAATTCGCAG fragments. Depending on the duration of the digestion process and the amount of restriction enzyme used, we can have various degrees of digestion. The next section formally defines the terms described before. It can be skipped in a first reading. 1.3 Formal Definitions Most definitions in this section are based on those in [KCTP96]. Nevertheless, we use more definitions to facilitate the communication. Definition 1.1. The restriction map RMe(g), for the DNA sequence 9 and the re- striction enzyme e, is the ordered tuple (31,32, ...,sn) such that the first restriction site ofe on g is at 31, the second one at s, + 32, and the ith one at 22:13,. If the DNA sequence 9 is a circular molecule, then we arbitrarily choose 31 as the length of one of the restriction fragments and assume that the cut site of its beginning is at index 0 of 9. Definition 1.2. For the multi-set RP and the ordered tuple RM, we define RP ~ 6 RIM if there exists a one to one mapping between their elements. Definition 1.3. The restriction pattern RPc(g) for the DNA sequence 9 and the restriction enzyme 6 is the multi-set {81, .92, . . . , 8"} such that RPe(g) ~ RAIe(g). Definition 1.4. We define z 3" y if and only ifa: g y(1 + 6). In this case we say a: lower-matches y with error 6. Also, :2: 26 y if and only ifzr 2 y(1 — 6). In this case we say I upper-matches y with error 6. Finally, a: 2‘S y if and only if x 3‘5 y and 2: 26 y; that is, y(1— 6) S :1: S y(1+ 6). In this case we say 3 matches y with error 6. Definition 1.5. We say the restriction map RM’ matches restriction map RM with error 6, denoted RM’ =6 RM, if and only if Vi RM,’ =6 RIVA. I Definition 1.6. We say the restriction pattern RP matches restriction map RM with error 6 if and only if, there exists some permutation of RP, denoted by RM’, such that RM’ =6 RM. We denote this by RP ~6 RM. 1 .4 Problem Statement There is a huge amount of ongoing research to obtain the complete genome of var- ious organisms where the results will be stored in gene banks. Some micro-biologists believe that the functionality of some genes is based on their DNA sequences only; hence, by removing or adding a DNA sequence from or to an organism, we can modify the functionality of that organism. Thus, there is a great deal of research focused on finding the DNA sequence of genes responsible for specific traits. 7 Here, we provide a new procedure for finding the sequence of a gene without trying to sequence it directly. The basic idea is that if we know that a DNA fragment contains genes, and it is already part of an already sequenced genome, we can use its restriction pattern data, which is cheaply available, to find its location in the original genome, and hence learn its underlying DNA sequence. The problem is based on the following procedure: 1. Choose a completely sequenced genome that has some interesting characteris- tics. 2. Break down the genome using a partial digestion experiment into fragments, denoted as isolates. 3. Perform biological tests on the isolates in order to identify some interesting isolates. 4. Completely digest an interesting isolate, and use gel electrophoresis to obtain its restriction pattern. 5. Use the restriction pattern of the isolate to find its location on the genome in order to obtain information about its underlying DNA sequence. The last item in the list above is the main goal of this thesis; that is, as we wish to develop fast and practical algorithms to search for locations on the restriction map of the genome which match the restriction pattern of interesting isolates. However, one important question is “how useful is the proposed procedure?” In other words, given a restriction pattern of an isolate, how many locations on the genome could have the same restriction pattern. We denote this attribute of a restriction pattern as its resolution with respect to the genome. 8 1.5 Background on Searching The problem of pattern matching has been extensively studied. The basic problem is to search a text for all the locations that match a given search pattern [Aoe94]. This match could be an exact or an approximate one. For an exact match algorithm, the result will be all the locations in the text which match the pattern exactly. In most approximate matching contexts, the text and pattern can have up to k mismatches. The search pattern is usually a sequence of characters. However variants where the pattern is a regular expression, a sequence of characters, or even a context free grammar have been investigated [Aoe94, LW95]. In these pattern matching variants, the characters of the search pattern and the text are usually from a finite alphabet. Also, the characters have an equivalence relation, under which any two characters that are matched together will equivalently match or not match any third character. In [KI/IP77], Knuth, Morris and Pratt propose an algorithm that solves the problem in only one scan of the text by building a finite state automata from the search pattern. Their algorithm takes 0(n + m) time, where n is size of the text and m is length of the pattern. In [BM77], Boyer and Moore, provide a faster algorithm that basically skips some parts of the text while scanning it. It is another 0(n+m) algorithm. This work assumes the input text will be stored in random access memory, so the cost of skipping text is zero. By this assumption, the Boyer-Moore algorithm and its variants are considered to be the fastest algorithms available with an average running time of O('—°§nfln) which is optimal [GB91]. Karp and Rabin [KR87] proposed a randomized algorithm to improve the brute force search. In their algorithm, a hash function is 9 used to compare the pattern and substrings of the text. Whenever these two match, the algorithm performs a character by character matching. Our application differs from the other pattern matching problems as follows. Firstly, we are basically doing an unordered pattern matching, where the pattern is a multi-set of characters rather than a sequence of characters. Secondly, the char- acters of the pattern and text not only are from an infinite alphabet but also do not have an equivalence relation. Fortunately they do have a partial-order relation which we exploit in developing efficient algorithms. 1.6 Outline In the following chapters, we first try to provide a probability distribution model for the restriction fragment lengths. After that, we provide discussion and empirical results on the effectiveness of the restriction pattern data. Finally we provide some restriction pattern search algorithms and some empirical results. Chapter 2 A probability distribution model for restriction fragment lengths 2.1 Restriction Fragment Lengths have an Ap- proximate Exponential Distribution It is useful for the analysis and design of our algorithms to have a model of restric- tion site distribution along the genome. We consider a model in which the restriction. sites are uniformly distributed across a genome; that is, we assume that a restric- tion site uniformly occurs at any location on the genome with probability p. Given this assumption, we can use basic probability theory to Show that fragment lengths have a geometric distribution. That is, a fragment has length l with probability p(l) = p(l — p)’. If we consider l to be continuous rather than discrete variable, we obtain, p(l) = Ae‘“ (2.1) where /\ = — ln(1 — p) (2.2) 10 11 The same model for restriction sites has been proposed before in [VVat95]. we attempted to confirm the model by running the Chi-Square test [Dev95] for the Haemophilus influenzae Rd genome [FAW+95] with different recognition sites. The results of the Chi-Square test (Table 2.1) Show that in most cases, the hypothe- sis that restriction sites are uniformly distributed is rejected with a high confidence level. However, we find that the exponential distribution for fragment lengths is quite . acceptable even for enzymes which do not meet the uniform distribution assumption for out sites. For example, we plot the cumulative distribution function (CDF) (Fig- ure 2.1, solid line) of restriction fragment lengths for H.influenzae genome when cut by a recognition sequence of ATTAAT — which was rejected with confidence level of 1.000 in the Chi-Square test - and compare it with the exponential distribution function (solid line) where its A parameter is derived using Equation 2.2. To obtain p for a given genome and a restriction enzyme, we use the following equation: _ Number of restriction sites 1) - Length of genome in bp (2.3) As it can be seen from Figure 2.1, the assumption that the restriction pattern has an exponential distribution is indeed a very good approximation. Throughout this text, we use the parameter p as the probability that a restriction site occurs at any point in the genome and A as the parameter for the exponential distribution function of fragment lengths. Note that p and A are related as in Equation 2.2. For small values of p we have A z p (2.4) l2 1 __ W f I I 0.8 - _ 0.6 ~ A 0.4 T 0.2 1 _ e—Ax _ " Re. Pattern - ° - 0 l l l l 0 5000 10000 15000 20000 25000 Genome Haemophilus influenzae Rd Recognition sequence ATTAAT Dotted line The fragment length data Solid line Estimated exponential distribution Figure 2.1: CDF of a restriction pattern and its approximation. While the measured fragment lengths have an error of :l:6, they essentially preserve their exponential distribution as 6 is very small. We do assume the same probability distribution function for both the exact restriction pattern data and those obtained from gel electrOphoresis experiment. AS the restriction pattern data has a known distribution, we consider ways to use this knowledge in order to speed up our algorithms. Mainly we need to search and sort restriction pattern data. If the data has a uniform distribution, we can use interpolation sort with time complexity of 0(a) and interpolation search with time complexity of 0(log log n) [GB91]. In order to use these fast sort and search algorithms, we should build a key of restriction pattern data that has uniform distri- 13 bution. For any distribution, the function needed to make it a uniform distribution is its CDF function. 111 our case the CDF function is 1 -— e‘“, where l is the fragment length. 2.2 Normalized fragment lengths Definition 2.1. The normalized value of a restriction fragment length l, denoted by .r is defined as :17 = 1 — e'“ (2.5) Also, (2.6) Again, to justify our method, a graph of the normalized fragment lengths is shown in Figure 2.2. The genome and the restriction enzyme are the same as in Figure 2.1. While this plot shows more difference between the real data and the approximation, notice that in interpolation search and sort, the most important characteristic that affects the running time is the linearity of the CDF of the data. Here, for most of the graph, we have a good linear approximation. Furthermore, note that the enzyme we have chosen to plot has the most statistical difference with the approximated distribution function. We usually use I, 1’, ll, etc. to refer to a fragment length, and 3:, z’, 2:1, etc. to refer to a normalized fragment length. It will be easier for us to do our analysis using the normalized fragment length data. Therefore, we state a relationship for matching with error 6 two normalized 14 l I I l I 0.8 r _ 0.6 — . , ~ ' u 0.4 L . - ' z 0.2 ’- _ ' . . CU __ T ' Normalized Re. Pattern ° - - 0 l l l L 0 0.2 0.4 0.6 0.8 l Genome Haemophilus influenzae Rd Recognition sequence ATTAAT Dotted line The normalized fragment length data Solid line Uniform distribution Figure 2.2: CDF of a normalized restriction pattern and its approximation. fragment lengths. It is interesting to note that this relationship is independent of the A parameter of the CDF. Theorem 2.1. For two fragment lengths ll and l2, we have ll 36 12 if and only if 2:1 3 1— (1— 1:2)”5, and [1 2512 if and only if x1 _>_ 1— (1— $2)1“5. Also, ll 2612 if and only if1 — (1 — x2)1‘5 S x151_(1_ $2)1+6 Proof. Proof follows directly from Definition 1.4 and Equation 2.6. C] Theorem 2.2. The probability that two restriction fragments match with error 6 is approximately %. Proof. Two distinct fragments, 11 and 12 will match if their normalized values $1 and 2:2 satisfy the relation 1 — (1 — 1:2)1‘5 _<_ :51 g 1 — (1— 1:2)”6. As 2:1 and 2:2 are uniformly distributed, the probability that two fragments match can be approximated as follows: (1— 1:2)1—6 — (I — $2)1+6 dz: 5. 1 1—(1—22)1+5 / / (1.771(1'1'2 = o 1—(1—12)1-6 [0 IO | I—I Q') C); M .+_ O) I on IN? 22 Mlonullh 16 Pattern Conf. Level Pattern Conf. Level AAATTT 0.9998 GAATTC 0.5023 AACGTT 0.9043 GACGTC 0.1665 AAGCTT 0.6948 GAGCTC 0.2867 AATATT 1.0000 GATATC 0.9973 ACATGT 0.2435 GCATGC 0.2677 ACCGGT 0.7415 GCCGGC 0.9488 ACGCGT 0.5107 GCGCGC 0.9654 ACTAGT 0.0805 GCTAGC 0.3993 AGATCT 0.9548 GGATCC 0.9363 AGCGCT 0.7770 GGCGCC 0.2100 AGGCCT 0.7601 GGGCCC 0.4751 AGTACT 0.2167 GGTACC 0.1388 ATATAT 0.9990 GTATAC 0.7571 ATCGAT 0.7109 GTCGAC N / A ATGCAT 0.8591 GTGCAC 0.5255 ATTAAT 1.0000 GTTAAC 0.7709 CAATTG 0.9975 TAATTA 0.9998 CACGTG 0.5277 TACGTA 0.9942 CAGCTG 0.9710 TAGCTA 0.8844 CATATG 0.9905 TATATA 0.9989 CCATGG 0.6005 TCATGA 0.9744 CCCGGG 0.2212 TCCGGA 0.8858 CCGCGG 0.9713 TCGCGA 0.3433 CCTAGG 0.9922 TCTAGA 0.6418 CGATCG 0.6466 TGATCA 0.7699 CGCGCG 0.9064 TGCGCA 0.4748 CGGCCG 0.8858 TGGCCA 0.9344 CGTACG 0.8777 TGTACA 0.5256 CTATAG 0.4512 TTATAA 0.9998 CTCGAG 0.8510 TTCGAA 0.9806 CTGCAG 0.9498 TTGCAA 0.7565 CTTAAG 0.6770 TTTAAA 0.9965 Hypothesis: Restriction sites are uniformly distributed over the H.influenzia genome. Table 2.1: Chi-Square test confidence levels for distribution of restriction sites Chapter 3 Effectiveness of the Procedure As stated earlier, the restriction pattern obtained from a gel electrophoresis ex- periment is not precise. So, in practice, after searching a genome for the restriction pattern, we can end up with more than one matching location on the genome. As the number of matching locations or matches increases, the procedure becomes less effective. We try to investigate the general effectiveness of the procedure and answer the following questions in this chapter: 0 How does 6 affect the effectiveness of the procedure? 0 Which restriction enzyme(s) should we choose for digesting the isolate? We first need to define a metric for the effectiveness of the procedure. In the next section, we propose three different metrics to be used later in our discussion. 3.1 Metrics for Effectiveness We first assume that the number of matches is a good indicator for the general effectiveness of the procedure. Based on this assumption, we define our first metric 17 18 as follow. Definition 3.1. We define the resolution of a restriction pattern RP over a genome g with restriction enzyme e, denoted r,.(RP, g, 6), as the number of subsequences RM, of Rllfe(g) such that RP ~5 RMS. However, in practice, two different matched locations may overlap (Figure 3.1). We believe that the biologist can still use this procedure even if the search results in multiple matches with significant overlap. Genome: } 100 % 125 § 175 % 100 { lst matchzé ; 2nd match: < > Restriction Pattern {100, 125, 175} can match to two different regions of the genome. Figure 3.1: Possible overlapping matches to a restriction pattern. We conjectured that isolates with resolution greater than one are actually overlap- ping with their false matches in most cases. We developed a second metric to study our conjecture where each possible isolate of the genome for a certain enzyme was put into one of three different categories: 1. The isolates which result only in a single match. 2. The isolates which result in multiple matches, but all false matches overlap with the original isolate. 3. The isolates which result in at least one false match which does not overlap the original isolate. We refer to this metric as the Category metric. 19 The isolates in category 1 are the perfect ones for our case. Based on experiments we observed that the isolates in category 3 can be avoided easily as they are very rare. The main difficulty arises with the numerous isolates in category 2. This metric does not distinguish between isolates that generate two highly overlapped matches and isolates that generate two matches that barely overlap. In order to study this effect better, we developed a third metric(Score). Let I denote the isolate length and l’ denote the portion of the isolate shared by all the matches. An isolate receives a score of %. Note that all isolates in category 1 have a score of 1, and those in category 3 have a score of 0. 3.2 Resolution Metric We experimentally measured the resolution of isolates for the Mycoplasma geni- talium genome [F GW+95] using the TGATCA recognition site which cuts the genome into 500 fragments. We plotted them in Figure 3.2 for 6 = 0.04 and Figure 3.3 for = 0.10. In both cases the resolution of isolates increases with the length of the isolate. However, for the smaller error value, the slope is much lower. Also, in both cases, we have high values of resolution for very short isolates. Note that the resolution metric is very dependent on the error value, and we 20 8 l I I T I 6 = 0.04 7 '- a 6 - _ 5 r _ 4 - _ 3 r _ 2 I. 5 _ i W‘ 1 K A _—-——:v A‘m- I l 0 100000 200000 300000 400000 500000 600000 Isolate Length (bps) Figure 3.2: Average resolution of isolates, 6 = 0.04. 14* =0.10 5 ‘ l2r : — 0 100000 200000 300000 400000 500000 600000 Isolate Length (bps) Figure 3.3: Average resolution of isolates, 6 = 0.10. 21 cannot infer much using it, when we don’t know much about the error value. 3.3 Category Metric Definition 3.2. Let P(N, k, i) denote the probability that an isolate with k restriction fragments obtained from a genome with N restriction fragments is in category i. First, we try to find the probability that an isolate is in category 3, i.e. P(N, k, 3). As stated before in Section 2.2, the normalized fragment lengths have uniform dis- tribution, and it is easier for analysis purposes to use the normalized lengths of fragments. An isolate I is not in category 3 if all other non-overlapping isolates of the genome do not match with it. Computing the exact probability that two independent isolates match is rather difficult; therefore we use the following approximation. We first derive a lower bound for P(N, k, 3). The probability that two restriction fragments match is g, and the probability that two isolates match is at least (ilk- N There are T disjoint isolates on the genome, and the probability that there is no other matching isolate will be less than: N to)“ P(N,k,3)21_ (1_(g)k)s_1 We next derive an upper bound on P(N, k, 3) by assuming that all permutations That is, of the restriction pattern are likely to match and by considering all other N — 2k + 1 22 100 '..l I I I I I I n l. 6 = 0.2 1 Lower Bound ' ° ° ' 80 — '. _ 60 r .1. .. % o 40 — . — 20 - ~ 0 l I . ' ’ - I ..... .I- l - A A- .1 .1 1 2 3 4 5 6 7 8 9 10 Number of Fragments Figure 3.4: Percentage of isolates in category 3. non-overlapping isolates of the genome. 6 k k!(N-2k+1) P(N,k,3) 31- (1— (5)) In Figure 3.4, the lower bound and the real percentage of isolates in category 3 is shown. The genome is Mycoplasma genitalium genome [FGW+95], and the recogni- tion site is TGATCA which cuts the genome into 500 fragments. As it can be seen, if the isolate has more than 8 fragments, we don’t have any isolate in category 3 even for a big error value of 20%. So, by carefully choosing the restriction enzyme, we can can assure that the isolates are not in category 3. We now focus on finding the probability that an isolate is in category 2. We derive a lower bound for P(N, k, 2). First we observe that at least % of the isolates are in category 2. The reason is that the first fragment of an isolate can match the very 23 first fragment of the genome after the isolate with that probability. To improve this lower bound for the probability, we compute the probability that two isolates match in all but one fragment. In other words, isolates that share all the fragments except one. Let’s name these two fragments as 3:1 and 1:2. They could match together with a probability of %, hence resulting in the matching of the isolates. Or, they can match through another common fragment, say 2:3, such that 2:1 matches 1:3 and 2:3 matches 2:2. The probability that this happens is: 1 1—(1—22)1+6 1—(1—23)1+5 8 / f / (1531615133 dl'g it: —62 0 1—(1—.’t2)l-‘S 1-(1-.'1:3)1_‘s 27 There are k — 1 such shared fragments between the two isolates’, hence, we can find the following lower bound on the probability that an isolate is in category 2: > _ _ .. _ _ 2 P(N,k,2) _ 1 (1 2) (1 275 ) We can improve this lower bound by considering a chain of shared fragments to match 2:1 and x2. However our current model is acceptable for small values of k. In Figure 3.5 and Figure 3.6, the lower bound and the observed probability of an isolate being in category 2 are depicted. Note that as the number of fragments increases, it is more likely that an isolate is in category 2. On the other hand, the probability of an isolate being category 3 decreases rapidly with the number of fragments (Figure 3.4). To maximize the percentage of category 1 isolates, we should choose a restriction 24 100 T I I I I I I I I , 6 = 0.04 Lower Bound ° ' ' ° . 80 *- A 60 r -.- % . 40 - ,_ '\ 's ‘00. " \ .'- ' \J‘ét:f-5 6s . .00 ..O .. '.§°:’€‘h;.*'fi . . ..... 20 ’— .‘jfl‘;:o-\.u:":". ..“,' ............ — ° . "fix: ..... . o . .m‘o la" 0' . . o. ‘2‘. 3‘ zfi’wfi. o . 0‘ *ac’g‘ I... "a. J. .’ .0 0 "fispf' °‘ l" 'I I l I I I I 0 50 100 150 200 250 300 350 400 450 500 Number of Fragments Figure 3.5: Percentage of isolates in category 2, 6 = 0.04. enzyme which cuts the isolate into have very few fragments (approximately 10). Since the length of the isolate is usually known before doing the digestion, we can choose an appropriate restriction enzyme. 3.4 Score Metric As stated before, the score metric measures the ratio of overlapped length of matches of an isolate to the length of the isolate. For a relatively high error of 6 = 0.10, computer experiments showed that, in average, most isolates achieve a score of 0.95. This indicates that, in most cases, we can find the location of the isolate with a 95% accuracy. If this approximate knowledge of location of the isolate on the genome is sufficient for the biologist, then this procedure can be used successfully. The maximum score was when the isolate has a small length, which is due to the 100 I I I I I I T I I wgvfli‘f’fsw} ’ 80 r . .' .o,u€‘.?- ' .1 ...","JA:& ...... 7:“.- ' o ' .zf‘fi.’ ' 60 - .' '3‘ T A ‘70 5:5. , 40 _ $1" . . . - " A ’\ o o y'fiI:-. 6 = 0.10 20 - '3; ‘- Lower Bound ° ‘ ° ' -* .3' " Ir" 0 l l I I I I I I I 0 50 100 150 200 250 300 350 400 450 500 Number of Fragments Figure 3.6: Percentage of isolates in category 2, 6 = 0.10. smaller number of fragments it has. This was discussed in Section 3.3. Figure 3.7 and Figure 3.8 show the score metric for isolates when the genome is Mycoplasma genitalium and the recognition site is TGATCA. Notice that with the lower error of 6 we obtain a much better score. Based on the experiments we can conjecture that the score of a moderate length isolate is about 1 — g. 26 1 -.3--..-.'~reshuffle.itwfiwwr'i'axMW'T’c‘wa- 0.99 — =f.".--""°' 6 = 0.04 0.98 ~ ., — 0.97 P. — 0.96 —' — Score 0.95 P. T 0.94 »— - 0.93 —' _ 0.92 —. _ 0.91 - - 0 9 I I l 0 50000 100000 150000 200000 Isolate Length (bps) l Figure 3.7: Score metric, 6 = 0.04. l l l l I . o M. fie . o .o I . . .. O. 0‘. 0.99 _ . '."":?o I.“ 3.030. :Of%~:o‘ .V'étx.‘.:~g 0" "‘ do . a. - -- -- °- -;:.- ,3. {’e.-r!-:£~{<’-.1’:’.~.~.:Ca. 0.98--°° n... 0.97 " 0" . -I 0.96 — . - Score 0.95 _ T 0.94 — : - 0.93 ~ _ 0.92 - a 0.91— 6:0.10 - _ 0.9 A 1 1 0 50000 100000 150000 200000 Isolate Length (bps) _ Figure 3.8: Score metric, 6 = 0.10. Chapter 4 A Matching Algorithm In this chapter we consider the design of an algorithm that searches the restriction ‘ map data of a genome for a location which matches the restriction pattern of a query isolate. The restriction map of the genome contains N fragments, and the restriction pattern of the isolate consists of k fragment lengths. The algorithm should find all locations of the genome which match the isolate data. In the next section we provide a theorem which is the basis of our algorithms. In the following section we provide the algorithm and improvements of that algorithm. 4.1 The Basis Theorem Definition 4.1. For any restriction map RM, we define its sorted restriction map W as an ordered tuple consisting of the elements of RM such that, Vi,j,i 3120—6)le $2392II+5IA$1<$2 => x1§y2(1+6) y1(1—6)S$1/\$1<$2 2} y1(1—6)Sl’2 $2Sy2(1+6)/\yl 2312 => $2SyIII+5l Theorem 4.1. RP ~6 RM if and only ifRP =6 R74. Proof. The converse of the theorem is trivial and simply follows from Definition 1.6. Now, we have to Show that if RP ~” RM, then RP =5 RM. Assume that there exist RP and RM such that RP ~" RM but not RT? =5 R_M. We define an inversion of an ordered tuple S, as the number of occurrences of i, such that S,- < Si“. Notice that when the number of InverSIOns Is 0, then S = S . ConSIder the minimum 29 inversion permutation on RP, the ordered tuple RP’, such that RP’ 2” RM. The number of inversions should be positive by assumption. Let i denote the first index i of RP’ such that RP,’ < 1219;“. As RP’ = (125,375,100.50) matching (l25.375,50,50) Genome: Figure 4.1: Inner-Matching. We cannot use Theorem 4.1 to compare RP and subsequences of RM for this case. However, with a little modification, we can still determine whether or not RP z" RM in 0(k) time. Suppose the elements of RP and RM are already sorted in decreasing order. Two of the elements of RM can have lower values in RP. We then skip those elements of RM when trying to compare RM and RP (Algorithm 2). Theorem 4.2. RP 2" RM if and only if Algorithm 2 terminates with a non-negative skips value. Proof. Using Definition 4.3 it is easy to see that the algorithm terminates with a negative skips value unless RP z" RM. Now, we have to show that if RP z" RM, then the algorithm terminates with a non-negative skips value. If RP z” RM, then there is a matching between elements of RP and RM. Consider such a matching, and assume that the RP,- 55 JET/I; and RTJJ- 5” RT/Ik. Build a new set RP’ such that it has all elements of RP except RP,- and RP], with RM1(1+ 6) 6 RP’, RM),(1 + 6) 6 RP’. By Theorem 4.1, we have R?’ ~5 m. Now replace RM1(1+6) of? with RP,, and RM).(1 +6) of RP’ with RP,-. The order of elements of RP’ and m) are exactly the same except that RP,- and RP]- have higher indexes in W). Consider Figure 4.2 for an illustration where 36 Algorithm 2 Inner Match: Is RP 23" RM ? Sort RP, to obtain We 1: 2: Sort RM, to obtain RT! 3: skips (— 0 4: t (— 1 5: while i S kandskips Z 0 do 6: if BIT/IssaMlorRMk then 7: if RPi—Skips 3" RM: then 8: skips (— skips + 1 9: SkipBack +— False 10: else 11: SkipBack (— True 12: end if 13: else 14: if RPi—skips =6 RM: then 15: SkipBack (— False 16: else 17: SkipBack (— True 18: end if 19: end if 20: if SkipBack then 21: skips +— skips - 1 22: else 23: ’l <— i + 1 24: end if 25: end while 26: returnskip Z 0 the solid lines Show a matching. 37 The algorithm matches all elements of RP and RT! until it reaches RP,- to be matched against RI’IIk. There are two cases. Either (i) they match or (ii) they don’t. In case (i) the algorithm corrects the skips value immediately, and continues. In case (ii), the algorithm does not correct the skips value immediately. In either case it does not terminate with a negative skips value. BEL—e Rein? O——O O——O OW: N O—OR—M N1— RM)c Figure 4.2: Proof of Theorem 4.2. BIBLIOGRAPHY Bibliography [Aoe94] [BM77] [Dev95] [PAW+95] [PGW+95] [GB91] [KCTP96] [KMP77] [KR87] [LW95] Jun-ichi Aoe. Computer Algorithms: String Pattern Matching Strategies. IEEE Computer Society Press, 1994. Robert S. Boyer and J. Strother Moore. A fast string matching algorithm. Communications of the ACM, 20(10):62—72, October 1977. Jay L. Devore. Probability and statistics for engineering and the sciences. Duxbury Press, 4th edition, 1995. R. D. Fleischmann, M. D. Adams, 0. White, R.A. Clayton, E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty, J. M. Merrick, et al. Whole-genome random sequencing and assembly of haemophilus infuenzae Rd. Science, 269(5223):496-512, July 1995. C. M. Fraser, J. D. Goycayne, O. White, M. D. Adams, R. A. Clayton, R. D. Fleischmann, C. J. Bult, A. R. Kerlavage, G. Sutton, J. M. Kelley, et al. The minimal gene complement of myc0plasma genitalium. Science, 270(5235):397—403, October 1995. Gaston H. Gonnet and Ricardo Baeza-Yates. Handbook of Algorithms and Data Structures. Assison—Wesley, 2nd edition, 1991. J in Kim, James R. Cole, Eric Trong, and Sakti Pramanik. Inferring relat- edness of a macromolecule to a sequence database without sequencing. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pages 125—133, June 1996. Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323—350, June 1977. Richard M. Karp and Michael O. Rabin. Efficient randomized pattern- macthing algorithms. IBM Journal of Research and Development, 31(2):249—260, March 1987. Eric S. Lander and Michael S. Waterman. Calculating the Secrets of Life. National Academy Press, 1995. 38 39 [W'at95] Michael S. Waterman. Introduction to Computational Biology. Chapman and Hall, 1995. l‘l ICH IGQN STRTE UN IV. L IBRQR IES II I II IIIIII III III III IIIIII IIII IIII II IIIIII IIII II IIIIII II IIII I II 31293015700747