CHIGAN ST

I III

       

 

III

 

 

 

 

 

IIIIIIIIIIIIIIIIIIIIIIIIIIIII

301570 0747

This is to certify that the

thesis entitled

PERMUTED SUBSTRING MATCHING, WITH
APPLICATIONS TO COMPUTATIONAL BIOLOGY

presented by

Houman Alborzi

has been accepted towards fulﬁllment
of the requirements for

Master of Science degree in Compote]: Science

Eric Torng
&- [-4.39

Major professor

 

Date _&-20—1997

0-7 639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

 

LIBRARY
Michigan State
Unlversity

 

 

 

PLACE ll RETURN BOX to roman this checkout from your rooord.
To AVOID FINES return on or More data due.

I I I I
II I
L_I |
|L_I_—_IL__I
:LJCI

MSU In An Affirmulvo AotIon/Equul OpponunIIy Inotltulon

 

 

 

 

 

 

 

    

 

 

 

I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PERMUTED SUBSTRING MATCHING, WITH
APPLICATIONS TO COMPUTATIONAL BIOLOGY

By

H ouman Alborzz'

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Department of Computer Science

1997

ABSTRACT

PERMUTED SUBSTRING MATCHING, WITH APPLICATIONS To
COMPUTATIONAL BIOLOGY

By

H ouman Alborzz'

We propose a new inexpensive technique to ﬁnd the DNA sequence of genes. The
technique is based upon ﬁnding the location of a fragment of a genome using the
restriction map of the genome and the restriction pattern of the fragment. As there
may be more than one location on the genome matching the restriction pattern of the
fragment, we investigate the effectiveness of our technique based on how precisely it
can determine the location of the fragment. We Show that the size of the restriction
pattern of the fragment is the primary parameter that affects the effectiveness of our
technique. We have also designed and implemented fast algorithms for locating the

fragment.

Dedicated to the advancement of frontiers.

iii

ACKNOWLEDGMENTS

I would like to thank Dr. Eric Torng, my academic advisor who was always trying
to put order in my chaotic work. Without his patience and support, completing this
thesis would not have been possible.

Dr. Sakti Pramanik and Dr. James Cole were always eager to discuss my work. I
am grateful for their help and time allotted to this project.

My technical thanks to the staff of the computer science department computing
services who provided the support and maintenance needed for this project.

Finally, I would like to thank the faculty, staff, and students of the computer

science department with whom I worked. They were always encouraging.

iv

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 Introduction

1.1 Genome, Genes, and DNA ..........................
1.2 Restriction Enzymes .............................
1.3 Formal Deﬁnitions ..............................
1.4 Problem Statement ..............................
1.5 Background on Searching ..........................
1.6 Outline ....................................

2 A probability distribution model for restriction fragment lengths
2.1 Restriction Fragment Lengths have an Approximate Exponential Distri-

bution ...................................
2.2 Normalized fragment lengths .........................

3 Effectiveness of the Procedure

3.1 Metrics for Effectiveness ...........................
3.2 Resolution Metric ...............................
3.3 Category Metric ...............................
3.4 Score Metric ..................................

4 A Matching Algorithm

4.1 The Basis Theorem ..............................
4.2 Brute Force Algorithm ............................
4.3 Obtaining Sorted Subsequences in 0(Nk) comparisons ..........
4.4 Limiting the search area to the most promising areas ...........

4.5 More Improvement ..............................
4.6 More on Comparing RP and RM ......................

BIBLIOGRAPHY

vi

vii

(DOOQCHOOI-‘H

10

10
13

17
17
19
21
24

27
27
29
30
31
33
34

38

LIST OF TABLES

Chi—Square test conﬁdence levels for distribution of restriction sites

Running time (in seconds) of the algorithm. ......... ' .......
Running time (in seconds) of the algorithm with 1 limiting fragment. . .
Running time (in seconds) of the algorithm with 3 limiting fragments. . .

vi

1.1
1.2

2.1
2.2

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

4.1
4.2

LIST OF FIGURES

A schematic of an unfolded DNA double helix ................ 2
An example of a digestion process. ..................... 4
CDF of a restriction pattern and its approximation ............. 12
CDF of a normalized restriction pattern and its approximation. ..... 14
Possible overlapping matches to a restriction pattern. ........... 18
Average resolution of isolates, 6 = 0.04 .................... 20
Average resolution of isolates, 6 = 0.10 .................... 20
Percentage of isolates in category 3 ...................... 22
Percentage of isolates in category 2, 6 = 0.04. ............... 24
Percentage of isolates in category 2, 6 = 0.10. ............... 25
Score metric, 6 = 0.04. ............................ 26
Score metric, 6 = 0.10. ............................ 26
Inner-Matching ................................. 35
Proof of Theorem 4.2. ............................ 37

vii

Chapter 1

Introduction

In this thesis we investigate the problem of locating a fragment of a previously
sequenced genome on the genome. More speciﬁcally, we try to develop a new proce-
dure to ﬁnd the genetic sequence of genes. In this procedure, a genome that has been
previously sequenced is broken into fragments, perhaps using restriction enzymes.
Then, after biological experiments have determined that some of the fragments are
biologically interesting, we try to ﬁnd the DNA sequence of those fragments. Usually
ﬁnding the DNA sequence is expensive. However, since the particular DNA sequence
is part of our previously sequenced genome, the problem of ﬁnding the DNA sequence

of the gene reduces to ﬁnding the location of the gene on the genome.

1.1 Genome, Genes, and DNA

Genetics goes back to the pioneering work of Gregor Mendel in 1865 [LW95].
Mendel hypothesized that traits are affected by discrete factors, which an offspring
inherits from its parents. Today, we call those factors genes. In 1952, Alfred Hershey
and Martha Chase showed that genes are encoded in DNA molecules which exist in
any living organism’s cells. Later, in 1953, Watson and Crick presented a model of

1

2

the structure of a DNA molecule. The DNA molecules are linear polymers of four
basic chemical structures called nucleotides named adenine, thymine, cytosine, and
guanine, and abbreviated as A, T, C, and G.

Each gene is usually associated with one or more DNA subsequences. These
subsequences will be translated into protein sequences, which are considered to be
the main chemicals affecting the traits of an organism. The whole set of genes of an
organism constitutes its genome. In this manuscript we use the word genome to refer
to a single sequence comprising the sequences of all DNA molecules of an organism.

The DNA nucleotides are joined by a sugar-phosphate backbone to form a DNA
molecule. While some bacteria have single stranded DNA molecules, the DNA of
humans and most other organisms consists of two strands making a double helix in

which the bases (nucleotides) pair up (Figure 1.1).

Figure 1.1: A schematic of an unfolded DNA double helix.

This pairing of bases is not arbitrary. A always pairs with T and C always pairs

with G. Hence, the sequence of either strand can be determined from the sequence

3

of the other one. The length of a DNA molecule is measured in units of base pairs
(bp), which is the number of bases in a single strand of the DNA molecule. Usually,
we represent a strand of DNA by a single string over the alphabet of {A, T, C, G}.
DNA molecules are usually linear, that is they have two endpoints. However, in
some organisms the DNA is circular, so there is no starting or ending point of the

DNA.

1 .2 Restriction Enzymes

There are enzymes that can cleave a DNA molecule into several fragments. These
enzymes are called restriction enzymes, and the process of cleaving the DNA molecule
into fragments is called digestion. Each restriction enzymes has a recognition sequence,
which is a sequence of nucleotides where the enzyme cuts the DNA molecule. For
example, the EcoRI enzyme cuts a DNA molecule at all occurrences of the sequence
GAATTC. The cut site of a restriction enzyme determines exactly where in the recog-
nition sequence the DNA molecule will be cut. For the EcoRI enzyme the cut site
is 1, which means that the DNA molecule will be cut after the ﬁrst base of the
recognition sequence. For example, if the DNA sequence is GCTGAGAATTCGCAAG-
GAATTCGCAG, then the EcoRI enzyme will cleave it into three fragments, with the
sequences of: GCTGAG, AATTCGCAAG, AATTCGCAG (Figure 1.2).

The restriction sites of a DNA molecule for a given restriction enzyme is the set of
locations of all the enzyme’s cut sites on the molecule. The ordered list of fragment

lengths is the restriction map of the DNA molecule for the given enzyme. For instance,

4
in our previous example, the restriction map, will be (6, 10,9). A restriction pattern is
the multi-set of fragment lengths of a restriction map. That is, the order of fragments
in the restriction map is forgotten. For example the restriction pattern of our example
is the set {6, 9, 10}.

We can obtain the restriction map of a DNA molecule using the gel electrophoresis
experiment. This procedure is much simpler and cheaper than existing methods for
obtaining the corresponding restriction map. In the gel electrOphoresis experiment,
the restriction fragments are placed in an electric ﬁeld over a gel coated medium.
The fragments then are separated based on their length. There is a logarithmic
relationship between the size of DNA fragment and the distance it migrates on a gel.
By measuring the distance each fragment has migrated, the length of the fragment

can be found.

 

 

Restriction Sequence: GAATTC
. A
Cut Site: 1

DNA Sequence: GCTGAGIAATTCGCAAGGAAATTCGCAG

Restriction Map: Ir 6 I 10 I 9

Resulting Fragments: GCTGAG AATTCGCAAG AATTCGCAG

l
l

 

 

 

Figure 1.2: An example of a digestion process.

There is a measurement error associated with this experiment; that is, a fragment
of length I will be read with some error e which results in a measured length of l(1+e).
Usually, c is a random variable with a normal distribution. However, we can safely
assume that it is uniformly distributed over a range of (—6, 6), where 6 is a parameter

of the precision of the devices used in the experiment. Wherever needed, we assume

5

that 6 = 0.05. For example, considering this measurement error, a restriction pattern
of {100, 105, 300, 400} could be read as {98,100,297,415}.

The process of digesting a DNA molecule by an enzyme can be partial or com—
plete. In a complete digestion, a DNA molecule is broken at all of its recognition
sequences. In a partial digestion, not all of the recognition sequences are broken. For
example the result of a partial digestion in Figure 1.2 could be the GCTGAG and
AATTCGCAAGAATTCGCAG fragments. Depending on the duration of the digestion
process and the amount of restriction enzyme used, we can have various degrees of
digestion.

The next section formally deﬁnes the terms described before. It can be skipped

in a ﬁrst reading.

1.3 Formal Deﬁnitions

Most deﬁnitions in this section are based on those in [KCTP96]. Nevertheless, we
use more deﬁnitions to facilitate the communication.
Deﬁnition 1.1. The restriction map RMe(g), for the DNA sequence 9 and the re-
striction enzyme e, is the ordered tuple (31,32, ...,sn) such that the ﬁrst restriction
site ofe on g is at 31, the second one at s, + 32, and the ith one at 22:13,. If the
DNA sequence 9 is a circular molecule, then we arbitrarily choose 31 as the length
of one of the restriction fragments and assume that the cut site of its beginning is at

index 0 of 9.

Deﬁnition 1.2. For the multi-set RP and the ordered tuple RM, we deﬁne RP ~

6

RIM if there exists a one to one mapping between their elements.

Deﬁnition 1.3. The restriction pattern RPc(g) for the DNA sequence 9 and the

restriction enzyme 6 is the multi-set {81, .92, . . . , 8"} such that RPe(g) ~ RAIe(g).

Deﬁnition 1.4. We define z 3" y if and only ifa: g y(1 + 6). In this case we say a:
lower-matches y with error 6. Also, :2: 26 y if and only ifzr 2 y(1 — 6). In this case
we say I upper-matches y with error 6. Finally, a: 2‘S y if and only if x 3‘5 y and
2: 26 y; that is, y(1— 6) S :1: S y(1+ 6). In this case we say 3 matches y with error

6.

Deﬁnition 1.5. We say the restriction map RM’ matches restriction map RM with

error 6, denoted RM’ =6 RM, if and only if Vi RM,’ =6 RIVA.

I Deﬁnition 1.6. We say the restriction pattern RP matches restriction map RM
with error 6 if and only if, there exists some permutation of RP, denoted by RM’,

such that RM’ =6 RM. We denote this by RP ~6 RM.

1 .4 Problem Statement

There is a huge amount of ongoing research to obtain the complete genome of var-
ious organisms where the results will be stored in gene banks. Some micro-biologists
believe that the functionality of some genes is based on their DNA sequences only;
hence, by removing or adding a DNA sequence from or to an organism, we can modify
the functionality of that organism. Thus, there is a great deal of research focused on

ﬁnding the DNA sequence of genes responsible for speciﬁc traits.

7

Here, we provide a new procedure for ﬁnding the sequence of a gene without
trying to sequence it directly. The basic idea is that if we know that a DNA fragment
contains genes, and it is already part of an already sequenced genome, we can use its
restriction pattern data, which is cheaply available, to ﬁnd its location in the original
genome, and hence learn its underlying DNA sequence.

The problem is based on the following procedure:

1. Choose a completely sequenced genome that has some interesting characteris-

tics.

2. Break down the genome using a partial digestion experiment into fragments,
denoted as isolates.

3. Perform biological tests on the isolates in order to identify some interesting
isolates.

4. Completely digest an interesting isolate, and use gel electrophoresis to obtain
its restriction pattern.

5. Use the restriction pattern of the isolate to ﬁnd its location on the genome in

order to obtain information about its underlying DNA sequence.

The last item in the list above is the main goal of this thesis; that is, as we wish
to develop fast and practical algorithms to search for locations on the restriction map
of the genome which match the restriction pattern of interesting isolates. However,
one important question is “how useful is the proposed procedure?” In other words,
given a restriction pattern of an isolate, how many locations on the genome could
have the same restriction pattern. We denote this attribute of a restriction pattern

as its resolution with respect to the genome.

8
1.5 Background on Searching

The problem of pattern matching has been extensively studied. The basic problem
is to search a text for all the locations that match a given search pattern [Aoe94]. This
match could be an exact or an approximate one. For an exact match algorithm, the
result will be all the locations in the text which match the pattern exactly. In most

approximate matching contexts, the text and pattern can have up to k mismatches.

The search pattern is usually a sequence of characters. However variants where
the pattern is a regular expression, a sequence of characters, or even a context free
grammar have been investigated [Aoe94, LW95]. In these pattern matching variants,
the characters of the search pattern and the text are usually from a ﬁnite alphabet.
Also, the characters have an equivalence relation, under which any two characters that
are matched together will equivalently match or not match any third character. In
[KI/IP77], Knuth, Morris and Pratt propose an algorithm that solves the problem in
only one scan of the text by building a ﬁnite state automata from the search pattern.
Their algorithm takes 0(n + m) time, where n is size of the text and m is length of
the pattern. In [BM77], Boyer and Moore, provide a faster algorithm that basically
skips some parts of the text while scanning it. It is another 0(n+m) algorithm. This
work assumes the input text will be stored in random access memory, so the cost of
skipping text is zero. By this assumption, the Boyer-Moore algorithm and its variants
are considered to be the fastest algorithms available with an average running time of
O('—°§nﬂn) which is optimal [GB91]. Karp and Rabin [KR87] proposed a randomized

algorithm to improve the brute force search. In their algorithm, a hash function is

9

used to compare the pattern and substrings of the text. Whenever these two match,
the algorithm performs a character by character matching.

Our application differs from the other pattern matching problems as follows.
Firstly, we are basically doing an unordered pattern matching, where the pattern
is a multi-set of characters rather than a sequence of characters. Secondly, the char-
acters of the pattern and text not only are from an inﬁnite alphabet but also do not
have an equivalence relation. Fortunately they do have a partial-order relation which

we exploit in developing efficient algorithms.

1.6 Outline

In the following chapters, we ﬁrst try to provide a probability distribution model
for the restriction fragment lengths. After that, we provide discussion and empirical
results on the effectiveness of the restriction pattern data. Finally we provide some

restriction pattern search algorithms and some empirical results.

Chapter 2

A probability distribution model
for restriction fragment lengths

2.1 Restriction Fragment Lengths have an Ap-
proximate Exponential Distribution

It is useful for the analysis and design of our algorithms to have a model of restric-
tion site distribution along the genome. We consider a model in which the restriction.
sites are uniformly distributed across a genome; that is, we assume that a restric-
tion site uniformly occurs at any location on the genome with probability p. Given
this assumption, we can use basic probability theory to Show that fragment lengths
have a geometric distribution. That is, a fragment has length l with probability
p(l) = p(l — p)’. If we consider l to be continuous rather than discrete variable, we

obtain,

p(l) = Ae‘“ (2.1)

where
/\ = — ln(1 — p) (2.2)

10

11

The same model for restriction sites has been proposed before in [VVat95]. we
attempted to conﬁrm the model by running the Chi-Square test [Dev95] for the
Haemophilus influenzae Rd genome [FAW+95] with different recognition sites. The
results of the Chi-Square test (Table 2.1) Show that in most cases, the hypothe-
sis that restriction sites are uniformly distributed is rejected with a high conﬁdence
level. However, we ﬁnd that the exponential distribution for fragment lengths is quite .
acceptable even for enzymes which do not meet the uniform distribution assumption
for out sites. For example, we plot the cumulative distribution function (CDF) (Fig-
ure 2.1, solid line) of restriction fragment lengths for H.inﬂuenzae genome when cut by
a recognition sequence of ATTAAT — which was rejected with conﬁdence level of 1.000
in the Chi-Square test - and compare it with the exponential distribution function

(solid line) where its A parameter is derived using Equation 2.2.

To obtain p for a given genome and a restriction enzyme, we use the following

equation:

_ Number of restriction sites
1) - Length of genome in bp

 

(2.3)

As it can be seen from Figure 2.1, the assumption that the restriction pattern has
an exponential distribution is indeed a very good approximation. Throughout this
text, we use the parameter p as the probability that a restriction site occurs at any
point in the genome and A as the parameter for the exponential distribution function
of fragment lengths. Note that p and A are related as in Equation 2.2. For small

values of p we have

A z p (2.4)

l2

 

 

 

 

 

 

1 __ W f I I
0.8 - _
0.6 ~ A
0.4 T
0.2 1 _ e—Ax _ "

Re. Pattern - ° -
0 l l l l
0 5000 10000 15000 20000 25000

Genome Haemophilus inﬂuenzae Rd

Recognition sequence ATTAAT

Dotted line The fragment length data

Solid line Estimated exponential distribution

 

 

 

Figure 2.1: CDF of a restriction pattern and its approximation.

While the measured fragment lengths have an error of :l:6, they essentially preserve
their exponential distribution as 6 is very small. We do assume the same probability
distribution function for both the exact restriction pattern data and those obtained

from gel electrOphoresis experiment.

AS the restriction pattern data has a known distribution, we consider ways to
use this knowledge in order to speed up our algorithms. Mainly we need to search
and sort restriction pattern data. If the data has a uniform distribution, we can
use interpolation sort with time complexity of 0(a) and interpolation search with
time complexity of 0(log log n) [GB91]. In order to use these fast sort and search

algorithms, we should build a key of restriction pattern data that has uniform distri-

13

bution. For any distribution, the function needed to make it a uniform distribution
is its CDF function. 111 our case the CDF function is 1 -— e‘“, where l is the fragment

length.

2.2 Normalized fragment lengths

Deﬁnition 2.1. The normalized value of a restriction fragment length l, denoted by

.r is deﬁned as

:17 = 1 — e'“ (2.5)

Also,

(2.6)

Again, to justify our method, a graph of the normalized fragment lengths is shown
in Figure 2.2. The genome and the restriction enzyme are the same as in Figure 2.1.
While this plot shows more difference between the real data and the approximation,
notice that in interpolation search and sort, the most important characteristic that
affects the running time is the linearity of the CDF of the data. Here, for most of
the graph, we have a good linear approximation. Furthermore, note that the enzyme
we have chosen to plot has the most statistical difference with the approximated

distribution function.
We usually use I, 1’, ll, etc. to refer to a fragment length, and 3:, z’, 2:1, etc. to
refer to a normalized fragment length.

It will be easier for us to do our analysis using the normalized fragment length

data. Therefore, we state a relationship for matching with error 6 two normalized

14

 

 

 

 

 

l I I l I
0.8 r _
0.6 — . , ~ ' u
0.4 L . - ' z
0.2 ’- _ ' . . CU __ T
' Normalized Re. Pattern ° - -
0 l l l L
0 0.2 0.4 0.6 0.8 l
Genome Haemophilus influenzae Rd
Recognition sequence ATTAAT
Dotted line The normalized fragment length data
Solid line Uniform distribution

 

 

 

Figure 2.2: CDF of a normalized restriction pattern and its approximation.
fragment lengths. It is interesting to note that this relationship is independent of the
A parameter of the CDF.

Theorem 2.1. For two fragment lengths ll and l2, we have ll 36 12 if and only if
2:1 3 1— (1— 1:2)”5, and [1 2512 if and only if x1 _>_ 1— (1— $2)1“5. Also, ll 2612

if and only if1 — (1 — x2)1‘5 S x151_(1_ $2)1+6
Proof. Proof follows directly from Deﬁnition 1.4 and Equation 2.6. C]

Theorem 2.2. The probability that two restriction fragments match with error 6 is

approximately %.

Proof. Two distinct fragments, 11 and 12 will match if their normalized values $1

and 2:2 satisfy the relation 1 — (1 — 1:2)1‘5 _<_ :51 g 1 — (1— 1:2)”6. As 2:1 and 2:2 are

uniformly distributed, the probability that two fragments match can be approximated

as follows:

(1— 1:2)1—6 — (I — $2)1+6 dz:

5.

1 1—(1—22)1+5
/ / (1.771(1'1'2 =
o 1—(1—12)1-6

[0
IO | I—I
Q')

C);

M

.+_

O)

 

I
on
IN?

22

Mlonullh

16

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Pattern Conf. Level Pattern Conf. Level
AAATTT 0.9998 GAATTC 0.5023
AACGTT 0.9043 GACGTC 0.1665
AAGCTT 0.6948 GAGCTC 0.2867
AATATT 1.0000 GATATC 0.9973
ACATGT 0.2435 GCATGC 0.2677
ACCGGT 0.7415 GCCGGC 0.9488
ACGCGT 0.5107 GCGCGC 0.9654
ACTAGT 0.0805 GCTAGC 0.3993
AGATCT 0.9548 GGATCC 0.9363
AGCGCT 0.7770 GGCGCC 0.2100
AGGCCT 0.7601 GGGCCC 0.4751
AGTACT 0.2167 GGTACC 0.1388
ATATAT 0.9990 GTATAC 0.7571
ATCGAT 0.7109 GTCGAC N / A
ATGCAT 0.8591 GTGCAC 0.5255
ATTAAT 1.0000 GTTAAC 0.7709
CAATTG 0.9975 TAATTA 0.9998
CACGTG 0.5277 TACGTA 0.9942
CAGCTG 0.9710 TAGCTA 0.8844
CATATG 0.9905 TATATA 0.9989
CCATGG 0.6005 TCATGA 0.9744
CCCGGG 0.2212 TCCGGA 0.8858
CCGCGG 0.9713 TCGCGA 0.3433
CCTAGG 0.9922 TCTAGA 0.6418
CGATCG 0.6466 TGATCA 0.7699
CGCGCG 0.9064 TGCGCA 0.4748
CGGCCG 0.8858 TGGCCA 0.9344
CGTACG 0.8777 TGTACA 0.5256
CTATAG 0.4512 TTATAA 0.9998
CTCGAG 0.8510 TTCGAA 0.9806
CTGCAG 0.9498 TTGCAA 0.7565
CTTAAG 0.6770 TTTAAA 0.9965

 

 

 

 

 

Hypothesis: Restriction sites are uniformly distributed over the H.inﬂuenzia genome.

Table 2.1: Chi-Square test conﬁdence levels for distribution of restriction sites

 

Chapter 3

Effectiveness of the Procedure

As stated earlier, the restriction pattern obtained from a gel electrophoresis ex-
periment is not precise. So, in practice, after searching a genome for the restriction
pattern, we can end up with more than one matching location on the genome. As
the number of matching locations or matches increases, the procedure becomes less
effective. We try to investigate the general effectiveness of the procedure and answer

the following questions in this chapter:

0 How does 6 affect the effectiveness of the procedure?

0 Which restriction enzyme(s) should we choose for digesting the isolate?

We ﬁrst need to deﬁne a metric for the effectiveness of the procedure. In the next

section, we propose three different metrics to be used later in our discussion.

3.1 Metrics for Effectiveness

We ﬁrst assume that the number of matches is a good indicator for the general
effectiveness of the procedure. Based on this assumption, we deﬁne our ﬁrst metric

17

18

as follow.
Deﬁnition 3.1. We deﬁne the resolution of a restriction pattern RP over a genome

g with restriction enzyme e, denoted r,.(RP, g, 6), as the number of subsequences RM,

of Rllfe(g) such that RP ~5 RMS.

However, in practice, two different matched locations may overlap (Figure 3.1).
We believe that the biologist can still use this procedure even if the search results in

multiple matches with signiﬁcant overlap.

 

 

 

 

Genome: } 100 % 125 § 175 % 100 {
lst matchzé ;

2nd match: < >
Restriction Pattern {100, 125, 175} can match to two different regions of the genome.

 

 

 

Figure 3.1: Possible overlapping matches to a restriction pattern.

We conjectured that isolates with resolution greater than one are actually overlap-
ping with their false matches in most cases. We developed a second metric to study
our conjecture where each possible isolate of the genome for a certain enzyme was

put into one of three different categories:

1. The isolates which result only in a single match.

2. The isolates which result in multiple matches, but all false matches overlap with
the original isolate.

3. The isolates which result in at least one false match which does not overlap the
original isolate.

We refer to this metric as the Category metric.

19

The isolates in category 1 are the perfect ones for our case. Based on experiments
we observed that the isolates in category 3 can be avoided easily as they are very
rare. The main difficulty arises with the numerous isolates in category 2. This metric
does not distinguish between isolates that generate two highly overlapped matches
and isolates that generate two matches that barely overlap. In order to study this
effect better, we developed a third metric(Score). Let I denote the isolate length and
l’ denote the portion of the isolate shared by all the matches. An isolate receives a
score of %. Note that all isolates in category 1 have a score of 1, and those in category

3 have a score of 0.

3.2 Resolution Metric

We experimentally measured the resolution of isolates for the Mycoplasma geni-
talium genome [F GW+95] using the TGATCA recognition site which cuts the genome
into 500 fragments. We plotted them in Figure 3.2 for 6 = 0.04 and Figure 3.3 for

= 0.10. In both cases the resolution of isolates increases with the length of the
isolate. However, for the smaller error value, the slope is much lower. Also, in both

cases, we have high values of resolution for very short isolates.

Note that the resolution metric is very dependent on the error value, and we

20

 

 

 

 

 

8 l I I T I
6 = 0.04

7 '- a
6 - _
5 r _
4 - _
3 r _
2 I. 5 _

i W‘
1 K A _—-——:v A‘m- I l

0 100000 200000 300000 400000 500000 600000

Isolate Length (bps)

Figure 3.2: Average resolution of isolates, 6 = 0.04.

 

14* =0.10 5 ‘

l2r : —

 

 

 

 

0 100000 200000 300000 400000 500000 600000
Isolate Length (bps)

Figure 3.3: Average resolution of isolates, 6 = 0.10.

21

cannot infer much using it, when we don’t know much about the error value.

3.3 Category Metric

Deﬁnition 3.2. Let P(N, k, i) denote the probability that an isolate with k restriction

fragments obtained from a genome with N restriction fragments is in category i.

First, we try to find the probability that an isolate is in category 3, i.e. P(N, k, 3).
As stated before in Section 2.2, the normalized fragment lengths have uniform dis-
tribution, and it is easier for analysis purposes to use the normalized lengths of
fragments.

An isolate I is not in category 3 if all other non-overlapping isolates of the genome
do not match with it. Computing the exact probability that two independent isolates
match is rather difficult; therefore we use the following approximation.

We ﬁrst derive a lower bound for P(N, k, 3). The probability that two restriction
fragments match is g, and the probability that two isolates match is at least (ilk-

N

There are T disjoint isolates on the genome, and the probability that there is no

other matching isolate will be less than:

N

to)“
P(N,k,3)21_ (1_(g)k)s_1

We next derive an upper bound on P(N, k, 3) by assuming that all permutations

That is,

of the restriction pattern are likely to match and by considering all other N — 2k + 1

22

 

100 '..l I I I I I I n
l. 6 = 0.2
1 Lower Bound ' ° ° '
80 — '. _
60 r .1. ..
% o

40 — . —
20 - ~

0 l I . ' ’ - I ..... .I- l - A A- .1 .1

 

 

 

1 2 3 4 5 6 7 8 9 10
Number of Fragments

Figure 3.4: Percentage of isolates in category 3.

non-overlapping isolates of the genome.

6 k k!(N-2k+1)
P(N,k,3) 31- (1— (5))

In Figure 3.4, the lower bound and the real percentage of isolates in category 3 is
shown. The genome is Mycoplasma genitalium genome [FGW+95], and the recogni-
tion site is TGATCA which cuts the genome into 500 fragments. As it can be seen, if
the isolate has more than 8 fragments, we don’t have any isolate in category 3 even
for a big error value of 20%. So, by carefully choosing the restriction enzyme, we can

can assure that the isolates are not in category 3.

We now focus on ﬁnding the probability that an isolate is in category 2. We derive
a lower bound for P(N, k, 2). First we observe that at least % of the isolates are in

category 2. The reason is that the ﬁrst fragment of an isolate can match the very

23

ﬁrst fragment of the genome after the isolate with that probability. To improve this
lower bound for the probability, we compute the probability that two isolates match
in all but one fragment. In other words, isolates that share all the fragments except
one. Let’s name these two fragments as 3:1 and 1:2. They could match together with a
probability of %, hence resulting in the matching of the isolates. Or, they can match
through another common fragment, say 2:3, such that 2:1 matches 1:3 and 2:3 matches

2:2. The probability that this happens is:

1 1—(1—22)1+6 1—(1—23)1+5 8
/ f / (1531615133 dl'g it: —62
0 1—(1—.’t2)l-‘S 1-(1-.'1:3)1_‘s 27
There are k — 1 such shared fragments between the two isolates’, hence, we can ﬁnd

the following lower bound on the probability that an isolate is in category 2:

> _ _ .. _ _ 2
P(N,k,2) _ 1 (1 2) (1 275 )

We can improve this lower bound by considering a chain of shared fragments to

match 2:1 and x2. However our current model is acceptable for small values of k.

In Figure 3.5 and Figure 3.6, the lower bound and the observed probability of
an isolate being in category 2 are depicted. Note that as the number of fragments
increases, it is more likely that an isolate is in category 2. On the other hand,
the probability of an isolate being category 3 decreases rapidly with the number of

fragments (Figure 3.4).

To maximize the percentage of category 1 isolates, we should choose a restriction

24

 

 

 

 

100 T I I I I I I I I ,
6 = 0.04

Lower Bound ° ' ' ° .
80 *- A
60 r -.-
% .
40 - ,_
'\ 's

‘00. "

\ .'- ' \J‘ét:f-5 6s

. .00 ..O .. '.§°:’€‘h;.*'ﬁ . . .....

20 ’— .‘jﬂ‘;:o-\.u:":". ..“,' ............ —
° . "ﬁx: .....
. o . .m‘o la" 0'
. . o. ‘2‘. 3‘ zﬁ’wﬁ. o
. 0‘ *ac’g‘ I... "a. J. .’ .0
0 "ﬁspf' °‘ l" 'I I l I I I I

0 50 100 150 200 250 300 350 400 450 500
Number of Fragments

Figure 3.5: Percentage of isolates in category 2, 6 = 0.04.

enzyme which cuts the isolate into have very few fragments (approximately 10). Since
the length of the isolate is usually known before doing the digestion, we can choose

an appropriate restriction enzyme.

3.4 Score Metric

As stated before, the score metric measures the ratio of overlapped length of
matches of an isolate to the length of the isolate. For a relatively high error of
6 = 0.10, computer experiments showed that, in average, most isolates achieve a
score of 0.95. This indicates that, in most cases, we can ﬁnd the location of the isolate
with a 95% accuracy. If this approximate knowledge of location of the isolate on the
genome is sufﬁcient for the biologist, then this procedure can be used successfully.

The maximum score was when the isolate has a small length, which is due to the

 

100 I I I I I I T I I
wgvﬂi‘f’fsw} ’
80 r . .' .o,u€‘.?- ' .1
...","JA:& ......
7:“.- ' o '
.zf‘ﬁ.’ '
60 - .' '3‘ T A
‘70 5:5. ,
40 _ $1" . . . - " A
’\ o
o y'ﬁI:-. 6 = 0.10
20 - '3; ‘- Lower Bound ° ‘ ° ' -*
.3' "
Ir"
0 l l I I I I I I I

 

 

 

0 50 100 150 200 250 300 350 400 450 500
Number of Fragments

Figure 3.6: Percentage of isolates in category 2, 6 = 0.10.

smaller number of fragments it has. This was discussed in Section 3.3.

Figure 3.7 and Figure 3.8 show the score metric for isolates when the genome is
Mycoplasma genitalium and the recognition site is TGATCA. Notice that with the
lower error of 6 we obtain a much better score. Based on the experiments we can

conjecture that the score of a moderate length isolate is about 1 — g.

26

 

 

1 -.3--..-.'~reshufﬂe.itwﬁwwr'i'axMW'T’c‘wa-

0.99 — =f.".--""°' 6 = 0.04
0.98 ~ ., —
0.97 P. —
0.96 —' —
Score 0.95 P. T
0.94 »— -
0.93 —' _
0.92 —. _
0.91 - -

0 9 I I l
0 50000 100000 150000 200000
Isolate Length (bps)

l

 

 

 

Figure 3.7: Score metric, 6 = 0.04.

 

l l l l

I
. o M. ﬁe

. o .o I . . .. O. 0‘.
0.99 _ . '."":?o I.“ 3.030. :Of%~:o‘ .V'étx.‘.:~g 0" "‘ do . a.

- -- -- °- -;:.- ,3. {’e.-r!-:£~{<’-.1’:’.~.~.:Ca.
0.98--°° n...
0.97 " 0" . -I

0.96 — . -

Score 0.95 _ T
0.94 — : -
0.93 ~ _

0.92 - a

0.91— 6:0.10 - _

0.9 A 1 1

0 50000 100000 150000 200000
Isolate Length (bps)

_

 

 

 

Figure 3.8: Score metric, 6 = 0.10.

Chapter 4

A Matching Algorithm

In this chapter we consider the design of an algorithm that searches the restriction ‘

 

map data of a genome for a location which matches the restriction pattern of a query
isolate. The restriction map of the genome contains N fragments, and the restriction
pattern of the isolate consists of k fragment lengths. The algorithm should ﬁnd all
locations of the genome which match the isolate data.

In the next section we provide a theorem which is the basis of our algorithms. In

the following section we provide the algorithm and improvements of that algorithm.

4.1 The Basis Theorem

Deﬁnition 4.1. For any restriction map RM, we deﬁne its sorted restriction map

W as an ordered tuple consisting of the elements of RM such that,

Vi,j,i<j:RM,-2Rll7j

Deﬁnition 4.2. For any restriction pattern RP, we deﬁne its sorted restriction map

27

28

R_P, as an ordered tuple consisting of the elements of RA! such that,

Vi,j,z°<j:fi7’,3lB—P,

Lemma 4.1. For numbers 1:), 172, yl, and y; such that 1:1 3 $2 and y1 Z y2, if

x1 2" y) and 1:2 2” yg, then 172 =5 y1 and 1:1 =6 y2.

Proof.

311(1—5)S$1/\y123/2 => 3120—6)le
$2392II+5IA$1<$2 => x1§y2(1+6)
y1(1—6)S$1/\$1<$2 2} y1(1—6)Sl’2

$2Sy2(1+6)/\yl 2312 => $2SyIII+5l

Theorem 4.1. RP ~6 RM if and only ifRP =6 R74.

Proof. The converse of the theorem is trivial and simply follows from Deﬁnition 1.6.

Now, we have to Show that if RP ~” RM, then RP =5 RM. Assume that
there exist RP and RM such that RP ~" RM but not RT? =5 R_M. We deﬁne an
inversion of an ordered tuple S, as the number of occurrences of i, such that S,- < Si“.

Notice that when the number of InverSIOns Is 0, then S = S . ConSIder the minimum

29

inversion permutation on RP, the ordered tuple RP’, such that RP’ 2” RM. The
number of inversions should be positive by assumption. Let i denote the first index

i of RP’ such that RP,’ < 1219;“. As RP’ =<S BTI we have

RP, 2'6 If._l1,
(and)

RPz‘IH :6 W141

By Lemma 4.1, we obtain

RH... =6 W.-

RP-I :6 mph]

This means that there exists another permutation on RP matching RM with error
6 with one inversion less than RP’. This contradicts our ﬁrst assumption, and thus

the theorem follows. [3

4.2 Brute Force Algorithm

The algorithm is based on Theorem 4.1 which states that we can determine if the
restriction pattern of an isolate matches the restriction map of a subsequence of the
genome with 1: fragments in 0(k) when both the restriction pattern and restriction

map are previously sorted.

 

30

A brute force approach is to sort each size It subsequence of the genome and
compare it against the isolate. In this case we have at most, klogk + k comparisons
for each subsequence of the genome resulting in at most N (k + k log k) comparisons
(Algorithm 1). However, we can optimize this naive approach to perform much better.

We describe these improvements in the remainder of this chapter.

 

Algorithm 1 Brute Force: Matching RP to RM with error 6

1: Sort RP, to obtain RT)
2: for all RM’ subsequences of RM, such that IRM’I = k do

Sort RM’, to obtain RM’

3:

4: Matched (— True

5: for i = 1tok do

6: if notRP, =5 RM,’ then
7: Matched <— False

8: end if

9: end for

10: if Matched then
11: Report RM’
12: end if

13: end for

4.3 Obtaining Sorted Subsequences in 0(Nk) com-
parisons

Algorithm 1 sorts each size It subsequence of RM once; hence it incurs a running
time of 0(Nk log k). We can improve this to 0(Nk) by using the fact that two
overlapping subsequences which differ only in one element will be almost the same
when sorted. We exploit this by making a moving window over the elements of
RM in which we update only one element of RM’ at each pass. If we store enough

information to know which element of RM’ should be removed for each update and

 

31

 

 

 

 

 

 

 

 

 

 

 

 

 

N = 2904 N = 1257
k = 0.04 6 = 0.10 6 = 0.04 = 0.10
10 0.0061 0.0060 0.0026 0.0028
20 0.0081 0.0084 0.0037 0.0036
30 0.0103 0.0103 0.0045 0.0047
40 0.0122 0.0121 0.0056 0.0057

 

 

Table 4.1: Running time (in seconds) of the algorithm.

shift the elements of RM’ as needed, the average number of shifts and comparisons

needed for each update will be

Zizl k ijl kl’l _ jI
k2

 

_k
’3

That means we’ll have a total of (N — 1)§ + k log k + N k comparisons.

In Table 4.1 the running time of the algorithm is shown for various values of N,
k, and 6. Notice that the running time is linearly increasing for both It and N, and

it is independent of 6.

4.4 Limiting the search area to the most promising
areas

The next improvement we make is to limit the search area. Rather than searching
the entire genome, we focus on the areas which are most likely to succeed. We know
that all members of RP should be present in the matched subsequences of RM.
Hence, based on the information about RP, we can limit our search in RM. In
particular, if any fragment of RP is extremely rare in RM, that could be used to

focus our search on only a few areas. However in order to gain the most from this

32

approach, we should know which elements of RP are least likely to have matches in

RM. We use a probabilistic model to help us identify these fragment lengths.

A normalized fragment 516’ matches normalized fragment .r whenever (Theorem 2.1)
(1— 3:)"6 21— Lt' Z (1— 3:)H"

As 36’ is a uniformly distributed random variable, the fraction of RM which match

to :6 will be:

(1__ x)1—6 _ (1 _ $)1+6

Using second order Taylor approximation we derive
—26(1 — x)ln(1— :r) + 0(63)

Ignoring the third order term of 6 and using the fragment length l in place of :r,
we obtain

26Ale_’\’

So, we try to ﬁlter out the RM using any member of RP which achieves the lowest

value of the function above.

We can use all the RP data iteratively to ﬁlter out the RM. However, keeping
track of ﬁltered RM requires more time than is needed to search RM with brute

force. Hence in practice, we use only the ﬁrst few members of RP for ﬁltering.

In Table 4.2 the running time of the algorithm is shown when it uses only one

33

 

 

 

 

 

 

 

 

 

 

 

 

 

N : 2904 N = 1257
k 6 = 0.04 6 = 0.10 6 = 0.04 6 = 0.10
10 0.0019 0.0029 0.0009 0.0014
20 0.0026 0.0037 0.0012 0.0017
30 0.0031 0.0045 0.0015 0.0020
40 0.0042 0.0058 0.0019 0.0023

 

 

Table 4.2: Running time (in seconds) of the algorithm with 1 limiting fragment.

 

 

 

 

 

 

 

 

 

 

 

 

 

N = 2904 N = 1257
k 6 = 0.04 6 = 0.10 6 = 0.04 6 = 0.10
10 0.0025 0.0030 0.0011 0.0013
20 0.0028 0.0037 0.0013 0.0016
30 0.0027 0.0036 0.0014 0.0019
40 0.0030 0.0048 0.0014 0.0020

 

 

Table 4.3: Running time (in seconds) of the algorithm with 3 limiting fragments.

fragment of RP to ﬁlter out the RM. Here, the running time is dependent on the
value of 6. It can be seen that the running time has been improved. In Table 4.3
the running time of the algorithm is shown when it uses three fragments of RP to
ﬁlter out the RM. For smaller values of k, the overhead of using 2 more fragments
to ﬁlter the data does not compensate the gain in limiting the search area. However
for higher values of k we have a better running time. It is still an open question to

ﬁnd the optimal number of limiting fragments for given values of N, k, and 6.

4.5 More Improvement

We observed that, in practice, the RM’ changes locally; that is, new RM’ dif-

fers in only one area from the previous RM’. We divide RM’ into three consecutive

sections. The ﬁrst section RM’I . . . RM’,_1 matches RP} . . . R—P)_1, the second sec-

34

tion is a Single fragment (RTE) that does not match W”), and the third section is
RIM;+1 . . . RAIL. we only compare the elements of RM’ and RP if R11! ’ has changed
in the area of RM; . . . Rll’ ,5, where l’ is the value of 1 from the previous iteration.
There was not much difference in running times after this improvement. It is due
to the fact that the comparison of two sorted restriction patterns takes less time than

generating them.

4.6 More on Comparing RP and RM

Up to this point, we have assumed that the restriction pattern of the isolate is a
subsequence of the genome’s restriction map. However, this assumption is not always
true (Figure 4.1). In general the two tail fragments of RP can be a partial fragment
of their corresponding ones on RM.

Deﬁnition 4.3. We say the restriction pattern RP inner-matches restriction map

RM with error 6, denoted RP z" RM, if and only if

3{x, y} C RP:
(RP — {2, y} =‘5 RM — {RM1, RM.}) A

((2: 35 RM1 A y 5“ RM.) v (y g" RM1 A x g" RMk))

where k = IRMI.

In other words a restriction pattern RP inner-matches a restriction map RM when
there is relaxed matching between elements of RP and RM. Figure 4.1 illustrates

this deﬁnition.

 

100 125 375 100 50

L l l
l l l I l

l
Isolate: < ;
{50,50,125,375}
lst Matchz< a
(100,125,375,100) matching (50,125,375,50)

2nd Match: If _ >
(125,375,100.50) matching (l25.375,50,50)

Genome:

 

 

 

 

Figure 4.1: Inner-Matching.

We cannot use Theorem 4.1 to compare RP and subsequences of RM for this case.
However, with a little modiﬁcation, we can still determine whether or not RP z" RM
in 0(k) time. Suppose the elements of RP and RM are already sorted in decreasing
order. Two of the elements of RM can have lower values in RP. We then skip those

elements of RM when trying to compare RM and RP (Algorithm 2).

Theorem 4.2. RP 2" RM if and only if Algorithm 2 terminates with a non-negative

skips value.

Proof. Using Deﬁnition 4.3 it is easy to see that the algorithm terminates with a
negative skips value unless RP z" RM.

Now, we have to show that if RP z" RM, then the algorithm terminates with a
non-negative skips value. If RP z” RM, then there is a matching between elements
of RP and RM. Consider such a matching, and assume that the RP,- 55 JET/I; and
RTJJ- 5” RT/Ik. Build a new set RP’ such that it has all elements of RP except RP,-
and RP], with RM1(1+ 6) 6 RP’, RM),(1 + 6) 6 RP’. By Theorem 4.1, we have
R?’ ~5 m. Now replace RM1(1+6) of? with RP,, and RM).(1 +6) of RP’ with
RP,-. The order of elements of RP’ and m) are exactly the same except that RP,-

and RP]- have higher indexes in W). Consider Figure 4.2 for an illustration where

36

 

Algorithm 2 Inner Match: Is RP 23" RM ?
Sort RP, to obtain We

1:
2: Sort RM, to obtain RT!
3: skips (— 0
4: t (— 1
5: while i S kandskips Z 0 do
6: if BIT/IssaMlorRMk then
7: if RPi—Skips 3" RM: then
8: skips (— skips + 1
9: SkipBack +— False
10: else
11: SkipBack (— True
12: end if
13: else
14: if RPi—skips =6 RM: then
15: SkipBack (— False
16: else
17: SkipBack (— True
18: end if
19: end if
20: if SkipBack then
21: skips +— skips - 1
22: else
23: ’l <— i + 1
24: end if

25: end while
26: returnskip Z 0

 

the solid lines Show a matching.

37

The algorithm matches all elements of RP and RT! until it reaches RP,- to be

matched against RI’IIk. There are two cases. Either (i) they match or (ii) they don’t.

In case (i) the algorithm corrects the skips value immediately, and continues. In case

(ii), the algorithm does not correct the skips value immediately. In either case it does

not terminate with a negative skips value.

BEL—e

Rein?

O——O
O——O

OW:
N

O—OR—M

 

N1—

RM)c

Figure 4.2: Proof of Theorem 4.2.

BIBLIOGRAPHY

Bibliography

[Aoe94]

[BM77]

[Dev95]

[PAW+95]

[PGW+95]

[GB91]

[KCTP96]

[KMP77]

[KR87]

[LW95]

Jun-ichi Aoe. Computer Algorithms: String Pattern Matching Strategies.
IEEE Computer Society Press, 1994.

Robert S. Boyer and J. Strother Moore. A fast string matching algorithm.
Communications of the ACM, 20(10):62—72, October 1977.

Jay L. Devore. Probability and statistics for engineering and the sciences.
Duxbury Press, 4th edition, 1995.

R. D. Fleischmann, M. D. Adams, 0. White, R.A. Clayton, E. F.
Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty,
J. M. Merrick, et al. Whole-genome random sequencing and assembly of
haemophilus infuenzae Rd. Science, 269(5223):496-512, July 1995.

C. M. Fraser, J. D. Goycayne, O. White, M. D. Adams, R. A. Clayton,
R. D. Fleischmann, C. J. Bult, A. R. Kerlavage, G. Sutton, J. M. Kelley,
et al. The minimal gene complement of myc0plasma genitalium. Science,
270(5235):397—403, October 1995.

Gaston H. Gonnet and Ricardo Baeza-Yates. Handbook of Algorithms and
Data Structures. Assison—Wesley, 2nd edition, 1991.

J in Kim, James R. Cole, Eric Trong, and Sakti Pramanik. Inferring relat-
edness of a macromolecule to a sequence database without sequencing. In
Proceedings of the Fourth International Conference on Intelligent Systems
for Molecular Biology, pages 125—133, June 1996.

Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. Fast pattern
matching in strings. SIAM Journal on Computing, 6(2):323—350, June
1977.

Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-
macthing algorithms. IBM Journal of Research and Development,
31(2):249—260, March 1987.

Eric S. Lander and Michael S. Waterman. Calculating the Secrets of Life.
National Academy Press, 1995.

38

39

[W'at95] Michael S. Waterman. Introduction to Computational Biology. Chapman
and Hall, 1995.

l‘l ICH IGQN STRTE UN IV. L IBRQR IES
II I II IIIIII III III III IIIIII IIII IIII II IIIIII IIII II IIIIII II IIII I II
31293015700747