11:3,». .. ...v . 4.1: é .1 .4:- svxll: ,1. . . . u.:v.. . ‘0. 39."... :vi. .a , Hui; . ‘ ll. , I- . “1.2;. a iI.:uo\nv‘.Jt {I :1: at :1... Sr ’11 3 a». .fix.y..w~,d.§gi I... :9 a a. 4 3. {a A «all; (xii; .. v: '91 3:: {s .9. . .. , u _ 2.3.3... we? . r; . . ‘ V _. tug”??? . .hmmamkfir V4.3. . _ . rewfiéa1.mu$? . . q. . . . , . a n .ma 1 This is to certify that the dissertation entitled THE HYBRID DIGITAL TREE AND ITS APPLICATIONS TO GENOMIC SEQUENCE DATABASES presented by QIANG XUE has been accepted towards fulfillment of the requirements for the Doctoral degree in Computer Science 2 K ROMA Major Professor’s Signature IZ—léwog‘ Date MSU is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University .-.—-o-o-.-c—.-.-c-o--n—c—o-.--o-.—-—o-o----o—o—-—o _. PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 p:/C|RCIDaIeDue.indd—p.1 THE HYBRID DIGITAL TREE AND ITS APPLICATIONS TO GENOMIC SEQUENCE DATABASES By Qiang Xue A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2005 ABSTRACT THE HYBRID DIGITAL TREE AND ITS APPLICATIONS TO GENOMIC SEQUENCE DATABASES By Qiang Xue This dissertation focuses on index structures, search algorithms, and applications for large string databases whose indexes cannot fit entirely in the main memory (RAM). String searching is a classic research topic that has received increasing atten- tion in recent years, due to the rapid growth of digital text collections (strings) and the fast expansion of application range and complexity. Traditional string indexing approaches are either RAM-based or disk-based. The RAM-based structures perform poorly when a database index size exceeds that of the available RAM. On the other hand, disk-based structures do not take full advantage of the available RAM, which may result in overwhelmed Input / Output (I/O) operations. In this dissertation, a novel indexing approach, the Hybrid Digital tree (HD—tree), is proposed. The HD- tree index contains two parts: the RAM-index and the disk-index. The RAM-index resides in the RAM to minimize the disk accesses; while the disk—index maintains the rest of the index on disks so that large databases can be indexed. The first half of this dissertation focuses on index structures. The HD-tree is proposed after investigating existing indexing techniques. Construction and search algorithms for the HD-tree are developed, and characteristics of the tree structure are discussed. The HD-tree is applied to prefix and substring searches, and is compared with the Prefix B-tree. The comparison shows that the HD-tree not only reduces I/O operations by a factor of two to three, but also reduces the total query processing time by one order of magnitude. The HD-tree is also applied to approximate string matching based on the Hamming distance, where the performance of the HD-tree surpasses that of the M-tree and the linear-scan approach. In the second half of this dissertation, the HD-tree is applied to indexing and searching genomic sequence databases, such as the entire GenBank protein sequence database. Since the GenBank data is massive, using the standard method to gen- erate an HD-tree index takes dozens of hours. Therefore, the Sort-Merge method is proposed to reduce the construction time by an order of magnitude. Sequence search algorithms using scoring matrices are developed for the HD-tree. Compared with BLAST, a popular sequence search tool, the HD-tree not only reduces query time by a factor of four, but also finds more valid results for short queries. Finally, the HD- tree is applied to sequence searches using the Profile Hidden Markov Model (PHMM), where it Shows great success. Compared with one of the most popular PHMM search tools, HMMER, the HD-tree is orders of magnitude faster for Short queries. In the appendix, the research of ' approximate q-gram matching in genomic se- quence databases is presented. It is Shown that searching genomic sequence databases using longer query word length and larger Hamming distance in the filtering stage provides an excellent opportunity for optimizing the search cost, while improving the quality of the search. This result provides further support and motivation for devel- oping advanced indexing schemes, such as the HD-tree, for large genomic sequence databases. In summary, this dissertation not only develops a new tree structure for string indexing, but also successfully applies the structure to real applications. According to comparisons with existing techniques, the proposed data structure, the HD-tree, is promising for indexing and searching large string databases, especially genomic sequence databases. To God and my Parents: Kongwen Xue and Fangying Zhu iv Acknowledgements Completing this dissertation is the biggest challenge in my life. Without the help of numerous people, I cannot succeed. My first thanks go to my adviser, Dr. Sakti Pramanik, who committed years in training and helping me to be a qualified researcher. He never gave up on me, although I was slow in learning. Many times, he invited me to his house during weekends to provide extra help. He offered valuable insight and suggestions in guiding my research direction. He read my paper and dissertation drafts and edited my grammar errors without a complaint. I am also thankful for Dr. Pramanik’s family, who welcomed me warmly whenever I visited and treated me many nice meals. My thanks also go to Dr. James Cole, a member of my PhD committee. As a biologist, Dr. Cole provided the knowledge and insight I needed to apply the HD- tree to genomic sequence databases. He also offered many suggestions in finding applications and designing experiments. Without his help, the second half of the dissertation is not possible. I am also thankful to Dr. Qiang Zhu and Dr. Gang Qian, who helped me in developing the structure and algorithms of HD-tree, and publishing the research. I appreciate the help from other members of my PhD committee: Dr. Herman Hughes, Dr. William McCarthy, and Dr. Jon Sticklen. They encouraged me in the process of finishing this dissertation and offered valuable revision suggestions. I am especially thankful that Dr. Hughes, though retired, flew from Atlanta to attend my defense meeting. I am indebted to my parents for encouraging me to pursue this degree, and sup— porting me in every step along the way. They sacrificed much more than most parents in helping me to be successful. I Shared many tears and laughs over this disserta- tion with my brothers and Sisters in Christ. I cannot imagine how could I continue the study without their prayers and encouragements. Among them, Caleb Kelly and Steve Herwaldt played the most important role. During years of study at MSU, I was financially supported by a number of re— sources such as the CSE department. Steve Tuckey from the writing center at MSU provided valuable help me on my writings and presentations. I have made many friends who offered enormous help and encouragements. For these, I have not stopped giving thanks. I do not intend to list all the names I need to give thanks. However, in my heart, I am grateful for each of them. vi Table of Contents LIST OF TABLES xii LIST OF FIGURES xiv 1 Introduction 1 1.1 Research Needs .............................. 1 1.2 Basic Concepts for String Matching ................... 2 1.3 Growth of String Databases ....................... 3 1.3.1 Signal Processing ......................... 4 1.3.2 Relational Databases ....................... 4 1.3.3 E—businesses ............................ 5 1.3.4 Computational Biology ...................... 5 1.4 Memory Hierarchy ............................ 6 1.5 Existing Indexing Structures and Their Limitations .......... 8 1.6 The Hybrid Approach .......................... 11 1.7 Contributions ............................... 12 1.8 Overview of the Dissertation ....................... 14 PART ONE: THE HYBRID DIGITAL TREE 15 2 Existing Indexing Techniques 16 2.1 Extendible Hashing ............................ 17 vii 2.2 B—trees and Prefix B-trees ........................ 19 2.3 Tries .................................... 22 2.4 Suffix rflees and Suffix Arrays ...................... 24 2.5 String B-trees ............................... 26 The HD-Tree 28 3.1 Basic Structure .............................. 28 3.2 HD-tree Properties ............................ 29 3.3 Building the HD-Tree ........................... 34 3.3.1 Insertion Procedure ........................ 34 3.3.2 Overflow Processing ....................... 36 3.3.3 Linked Leaf Nodes ........................ 37 3.3.4 Split Heuristics .......................... 39 HD-tree Behavior 44 4.1 Prefix Search ............................... 44 4.2 Data Sources for Experiments ...................... 45 4.3 HD—Tree Behavior ............................. 47 4.3.1 RAM and Disk Usage ...................... 47 4.3.2 Split Heuristics .......................... 48 4.3.3 Threshold Phenomena ...................... 49 4.4 Comparisons with the Prefix B-tree ................... 51 viii 4.5 Approximate String Matching Based on the Hamming Distance . . . 56 4.5.1 Distance Measure ......................... 56 4.5.2 Search Algorithm ......................... 57 4.5.3 Comparisons ........................... 58 PART TWO: INDEXING AND SEARCHING GENOMIC SEQUENCE 5 6 DATABASES 61 Genomic Sequence Analysis 62 5.1 Introduction to Sequence Analysis .................... 63 5.2 Alignment Type .............................. 65 5.3 Scoring System .............................. 66 5.3.1 Gap Penalties ........................... 70 5.4 Alignment Algorithms .......................... 71 5.4.1 Needleman—Wunsch Algorithm .................. 71 5.4.2 Smith—Waterman Algorithm ................... 73 5.4.3 BLAST .............................. 74 5.4.4 PASTA ............................... 75 5.5 Evaluating Alignments .......................... 77 Indexing and Searching Genomic Sequence Databases 79 6.1 Overview of GenBank .......................... 79 6.2 Existing Techniques ............................ 79 6.3 Creating the HD-tree Using the Sort-Merge Method .......... 82 6.3.1 Discussion ............................. 87 6.4 HD-tree Search Algorithm ........................ 88 6.5 Comparisons with BLAST ........................ 93 6.5.1 Closeness of HD-tree Queries .................. 97 6.5.2 Index Size and Construction Time ................ 97 6.5.3 Quality of Query Results ..................... 99 6.6 Query Time ................................ 101 Sequence Search Using the Profile Hidden Markov Model 103 7.1 Profile Analysis .............................. 103 7.2 Markov Chain ............................... 104 7.3 The Hidden Markov Model ........................ 104 7.4 The Most Probable Path ......................... 107 7.5 The Profile Hidden Markov Model .................... 108 7.6 Viterbi Equations ............................. 111 7.7 Searching PHMM Using the HD-tree .................. 112 7.8 The HMMER Package .......................... 116 7.9 HMMER Plan 7 in the HD-tree ..................... 119 7.10 Comparisons ................................ 121 7.10.1 PFAM ............................... 121 7.10.2 Synthetic Queries ......................... 122 7.10.3 Analyzing Query Time ...................... 125 7.10.4 Real Queries ............................ 128 8 Conclusions and Future Work 130 8.1 Conclusions ................................ 130 8.2 Future Work ................................ 133 APPENDIX 136 A Approximate Q-gram Matching in Genomic Sequence Databases 137 A.1 A2 A3 A4 Introduction ................................ 137 Filtering Based on Approximate Matching ............... 141 A. 2.1 Motivation ............................. 14 1 A.2.2 Methodology ........................... 141 A23 Experimental Results ....................... 143 Theoretical Analysis ........................... 149 A31 False Hit Probability (F HP) ................... 150 A.3.2 True Hit Probability (THP) ................... 150 A33 Verifying FHP and THP ..................... 151 A34 Discussion Based on Theoretical THP and FHP ........ 153 Conclusion ................................. 153 xi LIST OF REFERENCES 155 xii 4.1 4.2 4.3 4.4 5.1 5.2 5.3 6.1 6.2 6.3 6.4 7.1 7.2 7.3 7.4 7.5 7.6 List of Tables Statistics of sample databases and queries ............... RAM / disk Usage of HD-trees ...................... Split heuristics on storage utilization .................. the HD-tree vs. the M-tree ........................ 48 49 60 Two pairwise alignments to a fragment of human alpha globin: hba_human 64 An example of multi-sequence alignment: eight fragments from im- munoglobulin sequences ......................... The PAM30 Substitution Matrix .................... Hunt’s Suffix Tree versus the HD-tree .................. Index Size, construction time and storage utilization of the HD-tree for the GenBank protein sequence database .............. Query quality comparison ........................ BLAST query time ............................ HD-tree versus HMMER, E—value = 10 ................. Closeness versus E—value ......................... HD-tree query time for synthetic PHMMS; Query length = 10 . . . . Computation of dynamic programming table columns using synthetic PHMMS, Closeness = 60% ........................ Analyzing query time. Both the HD-tree and HMMER use 10 synthetic queries, and E-value = 10. ....................... Query time for real PHMMS; E—value = 10 .............. xiii 88 100 101 123 123 124 125 126 129 A.1 A2 A3 A4 A.5 A.6 Categories of HSPS returned by Probes ................. 143 Combinations better than 11/0 in both cost and sensitivity ...... 148 Combinations satisfy minimum sensitivity. ............... 149 Alphabet Distribution .......................... 152 False Hit Probability (x 1e-10) ..................... 152 True Hit Probability ........................... 152 xiv 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 List of Figures Examples of Non-Approximate String Matching ............ 3 Growth of GenBank (1982-2004) ..................... 6 The memory hierarchy of a typical uniprocessor system. Below each memory level iS the range of typical sizes of that memory level. The value of B at the top of the figure indicates the block transfer size between two adjacent levels of the hierarchy. Sizes are given in units of bytes (B), kilobytes (KB), megabytes (MB), gigabytes (GB), or ter- abytes (TB) ................................. 7 Limitations of Disk-based and RAM-based Index Structures ..... 10 Extendible hashing with block size B = 3. The keys are the numbers inside a block. The hash address of a key consists of its binary representation. For example, the hash address of key 12 is ‘...001100’. (a) After insertion of the keys 4, 8, 12, 23, 40, 41, 42. (b) Insertion of 76 into directory location 100 causes the block with local depth 2 to split into two blocks with local depth 3. (c) Insertion of 52 causes the block with to split into two blocks with local depth 4. The directory doubles in size and the global depth d is increased to 4. ............................... 18 A B-tree Node ............................... 20 B—tree Splitting .............................. 20 An Example of the Shortest String Separator in a Prefix B-tree . . . 22 An Example of a Trie ........................... 24 An Example of a Suffix Tree for “BANANA”. Pointers to the suffix position are shown in leaf node. ..................... 25 An HD-tree ................................ 28 Examples of the SGL and the MGL. .................. 37 Examples of the HD-tree Growth .................... 37 XV 3.4 3.5 3.6 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 5.1 Illustration of a large group hindering a possible merge ........ 39 Illustration of the underflow leaf nodes generated by Simple split (S S plz’t) 40 Examples of HD-tree splitting and linked leaf nodes .......... 43 Examples of String Searches ....................... 46 RAM Usage of HD-trees ......................... 48 The relationship between the ANL and the available RAM as the per- centage of the database size. ....................... 50 The relationship between the number of I/OS and the available RAM when the answer size is fixed. ...................... 50 The relationship between the number of I/OS and the available RAM when the answer Size changes. ...................... 51 I/ O comparison for different query localities; average query length is 6. 52 I/O comparison for different RAM sizes; average query string length is 8. 53 I/ O comparison for different RAM sizes; average query string length is 6. 54 I/ O comparison for different query lengths; y-axis is in Logarithmic scale ..................................... 55 Running time comparison; average query string length is 6 ....... 55 I/O comparison for Similarity searches as RAM size increase; the y- axiS is the number of I/OS as the percentage of the total disk blocks occupied by the database. ........................ 59 I/O comparison for Similarity searches as the Database size increases, the y—axis is the number of I/OS as the percentage of the total disk blocks occupied by the database. .................... 59 Compute a Cell in a Dynamic Programming Table ........... 72 xvi 5.2 5.3 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 7.1 The BLAST Search Algorithm ...................... 74 The FASTA Algorithm .......................... 76 Overlapping versus Non-overlapping Words ............... 80 The Sort-Merge method of creating the HD—tree from the GenBank protein sequence database ........................ 84 Split heuristic in the Sort-Merge method ................ 85 Improvement in Construction Time using the Sort-Merge Method. Over- lapping words of length 20 are used .................... 85 Improvement in Storage Utilization using the Sort-Merge Method. Over- lapping words of length 20 are used .................... 86 An example of the Difference—encoding ................. 87 An Example of a Dynamic Programming Table Used in the HD-tree. 90 An example of a genomic sequence search using the HD-tree. Arrow Shows the traversing path. Each downward arrow computes one col- umn (the right-most) of the table, and upward arrows do not compute ” u matrix. “continue , return”, and “search leaf” correspond to lines 14, 7, and 12 in Algorithm 6, respectively. ................. 94 Motif length distribution using MEX from 7000 enzyme sequences . . 96 Distribution of HD-tree Levels in the RAM. Levels greater than 10 are not shown .................................. 99 Number of insulin sequences among top N results. Query length is 14. 101 HD-tree query time ............................ 102 A Simply HMM. The joint probability P(a, 7r) = t1,1 t1; t2,end p1 (G) p2(C') 193(0). Note that another state path (1-1-2) could have gener- ated the same symbol sequence with a probably different joint probability. 106 xvii 7.2 A small PHMM (right) representing a small multiple alignment of five protein sequences (left) with three consensus columns. Squares repre- sent match states (Ml-M3). The 20 emission probabilities are calcu- lated using Laplace’s rule (i.e., each missing residue is counted one). Insert states (diamonds IO-I3) also have 20 emission probabilities (as- sume to be the same as the background distribution). Delete states (circles labeled D0—D3) are “mute” states that have no emission prob- abilities. State transition probabilities are shown as arrows ....... 7.3 The “Plan 7” architecture of HMMER. Squares indicate match states; diamonds indicate insert states; circles indicate delete states and Spe- cial states; arrows indicate state transitions. .............. 7.4 HD—tree query time for synthetic PHMMS ............... A.1 Example alignments ........................... A.2 E. coli, HSP Category I .......................... A.3 E. coli, HSP Category II ......................... A.4 E. coli, HSP Category III ......................... A.5 E. coli, HSP Category IV ......................... A.6 GenBank, HSP Category V ....................... A.7 GenBank, HSP Category VI ....................... xviii 109 145 145 146 146 147 Chapter 1: Introduction The study of data structures and algorithms has been a fundamental research in the field of computer science Since the era of electronic computing. The increase in raw computing power cannot discount the significance of discovering efficient data structures and algorithms. After all, the faster the computing equipment, the more is the gain from human intelligence. This chapter presents the research needs in string indexing, the basic concepts of string matching, and research challenges in string databases. Existing techniques and their limitations are briefly discussed, so that the motivations for the proposed hybrid indexing approach is clear. The contributions and overview of this dissertation are presented at the end of this chapter. 1.1 Research Needs Electronic text (string) collections have increased dramatically over the last decade, from megabytes of dictionaries, to gigabytes of genomic sequences, to terabytes of web documents. Such expansion has not stopped accelerating as information technologies continue to develop. Hence, string indexing techniques become more and more essential to process the massive available information. Many applications, such as computational biology [1, 2, 3], signal processing [4, 5], and information retrieval [6, 7, 8], must process complex queries (e.g., approximate string matching) on the growing amount of text collections. Due to the limited size of the internal memory (i.e., Random Access Memory, or RAM), partial or entire indexes have to be stored on the external memory (disk). The resulting input / output communication between fast RAM and Slow disk creates a major performance bottleneck. In order to reduce search cost and improve query performance, new ideas must be developed to design efficient data structures and search algorithms for processing such large text collections. Motivated by this fact, the focus of this dissertation is on designing and analyzing index structures and search algorithms for large string databases, such as genomic sequence databases. 1.2 Basic Concepts for String Matching A string consists of a series of symbols (or characters) chosen from an alphabet A of Size [A]. The letters and strings are assumed to have a lexicographic order. In this dissertation, lower-case letters are used to denote symbols from A (e.g., a, b and c), while lower-case Greek letters are used to denote strings (e.g., a, 6 and 7). The combinations of strings or symbols indicate string concatenation. For example, 06 is the concatenation of a and 6, while ab is the concatenation of a and b. Given a string a = a1...an of length [a] = n, a1...a,- is called a prefix of a, aj...an a sufifx of a, and a,...aj a substring of a. For simplicity, a database is considered to be a set of records with the form T,- = (Ki, A;), where n,- is a unique string and A, the descriptive information of Ni, such as a statistic, a string position, or a pointer to another location where such information can be found. Since this dissertation focuses on the issues of string indexing, A; is usually ignored in discussions (i.e., not strictly distinguishing a record and a string). This dissertation divides string matching problems into two categories: non-approximate string matching and approximate string matching, based on their computational complexity. Non-approximate string matching includes exact, prefix, substring, and range searches, which are used in applications such as web search engines, relational databases, and E—businesses. Given a database containing strings n1...nn, an exact search, ExactSearch(a), retrieves n,- such that n,- = a, 1 S i _<_ n; a prefix search, P're f 2255' earch(a), retrieves rt,- where a is a prefix of ’67:; a sub-string Database Queries disk W ExactSearch(hybrid) = {hybrid} hybrid index PrefixSearch(in) = {index, insert, insert intersting} interesting prefix SubstringSearch(in) = {index, insert, string intersting, string, substring} Structure substring RangSearch(in, out) = {index, insert, suffix interesting, prefix} L Figure 1.1: Examples of Non—Approximate String Matching search, SubstringSearch(a), retrieves It; where a is a substring of Hi; and a range search, RangeSearch(a, fl), retrieves n,- such that a S n,- S 6. Examples of non-approximate string matching are Shown in Figure 1.1. Approximate string matching is the string matching with “errors,” and the earliest references are from the late 19603 (Signal processing) and the 19708 (computational biology). The general form of approximate string matching is to find the positions of a text where a given pattern occurs, allowing a limited number of “errors” in the matches [9]. Various error models have been developed to define how different two strings are, by measuring a “distance” between the two strings. A user-defined distance is used to identify one string as the erroneous variant of another. The distance measures are further discussed in Section 4.5.1. 1.3 Growth of String Databases String databases and string matching techniques are used in many applications. The following sections introduce the rapid growth in a few applications, such as signal processing, relational databases, E—businesses, and computational biology. 1.3.1 Signal Processing Signal processing is used in speech recognition to determine text messages transmitted in audio signals. Approximate matching is critical in this application, because parts of a speech may be lost or mispronounced. At the same time, since the physical transmission of signals is error-prone, error correction is important to restore the correct message from a possible error introduced during the transmission. One of the most popular distance measurements, known as the Levenshtein distance [4] (also called edit distance) was originally developed for the purpose of error correction. The fast growing of multi-media databases demands the ability of searching the content of audio data, and the increasing interest in wireless networks keeps looking for stronger error correcting methods. Hence, approximate string matching in signal processing is a very active research area. 1.3.2 Relational Databases The rapid growth of the Internet, the increase in online transaction processing, and the expansion of large database applications have contributed Significantly to the data explosion of relational databases, such as consumer relation management, national white pages, and digital libraries. In an recent survey (September 14, 2005) of the world’s largest and most heavily used databases by Winter Corporation, the size of the largest commercial database tops the 100 TB mark, and has increased three-fold Since the 2003 survey [10]. The largest Windows data warehouse is 19.5 TB and the largest number of rows/ records is 2.8 trillion. For the first time, the peak workload on a system exceeds 1 billion SQL statements per hour. In relational databases, a large portion of queries deal with strings (e.g., searching names or addresses). In a majority of cases, users do not expect an exact match of the query string, therefore, prefix search, substring search, and range search are important. How to index and search these massive string data efficiently and effectively is a challenging research issue. 1.3.3 E-businesses The exponential growth of the Internet continues to encourage many traditional businesses to enter the electronic business (E—business) realm. Consequently, managing electronic documents iS essential for the success of E—business. For example, an important application in E—busineSS is to provide electronic product catalogs (E—catalogs) for buyers to locate and select products [11]. Consider a large E—store, such as Amazon.com, the E—catalog may contain millions of products and may receive hundreds of queries per second. In order to provide prompt query response, the E—catalog data have to be indexed properly. In recent years, XML (eXtensive Markup Language), which is based on plain text, has become a standard for representing and exchanging documents on the Internet. Almost all recent E—business standards are based on XML. Since the amount of XML data is large and increasing, efficient storage and query of XML documents is a growing challenge in computer science research [12]. 1.3.4 Computational Biology Computational biology has experienced tremendous growth over the past decade. It is known that genetic information is encoded in DNA and protein sequences. DNA sequences are represented as strings using a four-letter alphabet {A,C,G,T}. Protein sequences are represented as strings using a twenty-letter alphabet, where each letter represents an amino acid. Searching over these strings is a fundamental operation for problems such as assembling DNA chains from the pieces obtained by experiments, looking for given features in DNA chains, or determining how different two genomic sequences are. In such applications, exact match is of little use due to mutations of genomic sequences. Therefore, 50 I l I I T I y I I I I _ x—x Base Pairs of DNA (billions) ‘ 4o — A—A Sequences (millions) A _ 30 — A _ A 20 - _. .- A _‘ ,/ 10 — A __ - ’A . A AAAAAAAAAAAAA—AA‘A' . 9980 ' z "in 1995 2000 2005 2010 Year Figure 1.2: Growth of GenBank (1982-2004) approximate string matching becomes critical in computational biology. There are several public sequence databases, such as GenBank [13], Swiss-PORT [14], EMBL Nucleotide Sequence Database [15], and DNA Data Bank of Japan [16]. These databases have seen exponential growth in recent years, partly due to the well-known human. genome project [17]. Figure 1.2 shows the growth chart of GenBank. AS of April 2004, GenBank contains approximately 44 billion base pairs, and 40 million sequences. The rapid expansion of genomic sequence databases and the complexity of sequence matching tasks demand the development of efficient and effective data structures and search algorithms. 1 .4 Memory Hierarchy In order to be cost-effective, computer systems usually contain a hierarchy of memory levels, where each memory level has different cost and performance characteristics. The lowest level consists of CPU registers and caches that are built with the fastest but most expensive memory. Above this lowest level is the internal memory, which is also called random-access memory (RAM). At a higher level, inexpensive but slower magnetic disks are used for external mass storage. Finally, even slower but larger-capacity devices such as tapes and optical disks are used for archival storage. Figure 1.3 illustrates a typical memory hierarchy and its characteristics. 3:328 B=8KB \ B = 648 tens of ns ns : \ Internal Memory (RAM) l-4MB 32—64KB 128-512MB Figure 1.3: The memory hierarchy of a typical uniprocessor system. Below each memory level is the range of typical sizes of that memory level. The value of B at the top of the figure indicates the block transfer size between two adjacent levels of the hierarchy. Sizes are given in units of bytes (B), kilobytes (KB), megabytes (MB), gigabytes (GB), or terabytes (TB). In modern programming languages, the notion of Virtual Memory allows the program address-space to be far larger than what can fit in the RAM. Programmers usually assume that all memory references require the same access time. In many cases, such an assumption does no harm, especially when the data sets are small. However, when large address spaces Span multiple levels of the memory hierarchy, the assumption of equal access time may not reflect the actual behavior of the program. This is because accessing data in the lowest level of memory is orders of magnitude faster than accessing data in higher levels. For example, accessing a CPU register takes nanoseconds (10’9 seconds), and accessing RAM takes tens of nanoseconds, but the latency of accessing data from a disk is several milliseconds (10"3 seconds), which is about one million times Slower than that of a CPU register. Since the latency and bandwidth of memory chips are improving more quickly than those of disks (roughly 60% increase per year in processor performance, and 20% increase per year in disk performance [18]), the access gap is continually growing. Therefore, in applications that process massive amounts of data, the input / output communication (I/O) between levels of memory, especially between the RAM and disks, often becomes a bottleneck. As these trends continue, the I/O bottleneck grows ever worse without some changes in the fundamentals of data storage. 1.5 Existing Indexing Structures and Their Limitations Indexing structure is a topic studied throughout the development of computer science. The well-known B-tree structure [19], which is the basis of most disk-based indexing structures, was proposed in 1972. The digital tree, which is widely used in string matching, can be traced back to the 19608 [20]. Modern databases contain various types of data, such as pictures, audios, videos, and spatial data; yet textual data (i.e., strings) are still the major components of database systems. How to efficiently index these strings was, is, and will remain a critical issue for database performance. Over the past few decades, many data structures have been proposed for string indexing. These data structures can be divided into two categories: disk-based structures and RAM-based structures. The first category includes inverted files [7], Prefix B-trees [21] and String B-trees [22, 23, 24]. The second category includes various structures based on digital trees (also known as tries), such as Patricia tries [20], suffix trees [25, 26], suffix arrays [27] and PAT trees [28]. Since string indexes for large applications are often too massive to fit entirely in the RAM, disk Space must be used to store the indexes. However, the latency of accessing data from a disk is much Slower than that from the RAM. Therefore, the resulting I / 0 communication can be the major performance issue. The two important measures that are normally used to evaluate the performance of disk-based data structures are the number of I/OS to answer a query and the storage utilization of disk blocks. Among disk-based data structures, inverted files are popular for keyword-based searches. However, it is difficult to perform substring and similarity searches using inverted files. B-trees [19] and their variations [29], are well known balanced multi-way search trees for manipulating dynamic data on the disk. They are very efficient in handling fixed length keys (e.g., integers). However, the performance of B—treeS degrades dramatically for variable length keys (e.g., strings), since the fan-out (i.e., the number of children) of an internal node depends on the number of strings stored in the node. The Prefix B—tree is designed to improve the performance of string indexing by using the shortest unique prefixes as separators within an internal node. The String B-tree uses the Patricia trie inside its internal nodes to provide the same worst-case performance as the B—tree. However, since the String B-tree stores indexed strings in a separate file, it generally requires more disk accesses than the Prefix B-tree. These disk-based indexing techniques do not require RAM. To use the large amount of available RAM, they rely on caching mechanisms that are usually not optimized for individual data structure. Therefore, there is a need for disk-based data structures to efficiently use the available RAM. The RAM-based structures are useful for indexing strings in the RAM where string queries are performed. Patricia tries and PAT / suffix trees are particularly effective in handling relatively small amount of text. However, as the database size increases, it is no longer feasible to keep the entire trie in the RAM. Various methods have been proposed to reduce the sizes of tries, such as efi‘icient implementations of trie nodes [30] and encoded representations of tries. For example, PaTries [31] and PAT-trees [28, 32] are variants of Patricia tries, in which Jacobson’s encoding is used to reduce Space requirement. The compressed trie structures trade Space with computational complexity. However, for very large databases, it is still not practical to fit the corresponding indexes in the RAM. Another approach is to page tries on disks. Because of the unbalanced topology of a trie, it is shown to be difficult and inefficient to page a trie on disk. For example, in the worst case, a downward path of k nodes will be stored in (We) different pages [24]. Paging tries is especially expensive for dynamic indexing, where inserting or deleting an m-length string may take 9(m) page splits or merges [24]. Updating operations will also cause the storage utilization to degrade quickly. In [33], it is reported that the storage utilization is 43% for 238KB dynamic text and 38% for 5.55MB dynamic text. Therefore, it is concluded that RAM-based index structures are not suitable for indexing large string databases whose indexes cannot fit entirely in the RAM. CPU DISK The U0 Bottleneck (a) (b) Figure 1.4: Limitations of Disk-based and RAM-based Index Structures In summary, disk-based structures can index large databases but usually do not fully utilize the available RAM and may result in I/O bottleneck (see Figure 1.4b). On the other hand, RAM-based structures are efficient for string matching problems. However, as database size increases, indexes may become too large to fit in the RAM (see Figure 1.4a), and RAM—based structures perform poorly when 10 database index size exceeds that of the available RAM. 1.6 The Hybrid Approach According to the above discussion, both RAM-based and disk-based data structures have their strengths and limitations. Since our goal is to index large string databases with the aim of performing complex queries efficiently, the existing data structures cannot provide satisfactory performance. Therefore, a novel approach, the Hybrid Digital tree (HD-tree), is proposed. The basic idea of the HD-tree is to keep the internal nodes (similar to those in a digital tree) in the RAM to minimize the number of I/Os, while maintaining the leaf nodes (which hold the database strings) on disks to maximize the capability of the tree for indexing a large database. Strings stored in a leaf node Share the same prefix. The internal nodes are built on these prefixes and are used to guide the search to the leaf node(s) containing the query answer(s). Unlike a traditional digital tree, the parent of a leaf node in the HD-tree allows a set (“range”) of multiple prefixes so that indexed strings with different prefixes may share the same leaf node (i.e., disk block) to improve storage utilization. Moreover, unlike the traditional concept of range, the above prefix “range” of a node may not be “continuous”, so that the storage utilization can be further improved. It is known that traditional disk-based trees, such as Prefix B-trees, may use the available RAM to cache their internal nodes, so that the number of disk I/OS may be reduced. However, the HD-tree is different from this approach as follows: First, an internal node of disk-based trees is a disk block, which is usually several kilobytes in size. However, an internal node of the HD-tree is a data structure (i.e., a trie node), which is usually several bytes in size. Second, the internal nodes of disk-based trees are stored on disks and have to be read into the RAM whenever ll necessary. However, all internal nodes of the HD-tree are kept in the RAM, so that no disk I/Os are required to access these internal nodes. The HD-tree is Shown to be efficient for both non-approximate and approximate string matching. Using the HD-tree, the number of I/Os is optimal for prefix and substring searches. Although hashing techniques can also achieve optimality for exact searches, it cannot be used effectively in prefix, substring, and approximate string searches. It is observed that for a given database size, a small amount of RAM improves the performance of the HD—tree significantly. Since the data structure of internal nodes in an HD-tree is Similar to that of tries, the HD-tree not only has a data compression property that is not supported by other disk-based structures such as the Prefix B-tree, but also achieves great success in approximate string matching. The HD—tree supports various approximate string matching based on the Hamming distance [34, 9], simple edit distance [4], general edit distance using a scoring matrix [35], and the profile hidden Markov model [36]. The HD—tree has shown to be effective in indexing and searching large genomic sequence databases such as the entire GenBank protein sequence database. 1.7 Contributions This dissertation not only studies two important areas of string databases: indexing Structures and approximate string matching, but also deals with real world applications in genomic sequence databases. A novel index structure, the Hybrid Digital tree (HD-tree), is proposed. The HD-tree is a RAM/diSk-based tree that incorporates and extends indexing strategies of the digital trees and B-trees, taking advantages of their strengths in search performance and index capability. The HD-tree is compared with the Prefix B-tree using real textual data from Text REtrieval Conference (TREC) collections [37]. Queries are generated with different 12 cluster levels to study the effectiveness of the HD-tree. It is shown that given random distinctive queries, the number of disk I/Os is reduced by more than 60%, while the query time is reduced by one order of magnitude. The HD-tree is also applied to approximate string matching based on the Hamming distance, where the HD-tree outperforms existing techniques such as the M-tree [38] and the linear scan approach [39, 40]. The HD-tree is applied to genomic sequence databases. Due to the non-structured feature of genomic sequence data, and the largeness of the database, the standard approach of building HD-tree takes many hours. Heuristics are developed to reduce tree construction time in an order of magnitude. Hence, the HD-tree index can be created for the entire GenBank protein database in reasonable time (e.g., 3-4 hours). Algorithms are developed for genomic sequence search using scoring matrices [35, 41]. The HD-tree is compared with the well-established sequence search algorithm, BLAST [3, 42]. For Short protein sequence queries (e.g., insulin), the HD-tree is not only four times faster than BLAST, but also able to find more valid query results. The speed improvement of the HD-tree is even more impressive in sequence search using the Profile Hidden Markov Models (PHMMS), where heuristic algorithms are not applicable. Experiments are conducted on both synthetic and real queries. The HD-tree is Shown to be orders of magnitude faster than HMMER, a popular PHMM search tool, for short queries. Besides index structures and search algorithms, in the appendix, the research of approximate q-gram matching in genomic sequence databases is presented. It is shown that searching genomic sequence databases using longer word length and larger Hamming distance in the filtering stage provides an excellent opportunity for optimizing the search cost while improving the quality of the search. This result is another proper justification of developing advanced indexing schemes, such as the HD-tree, for large genomic sequence databases. 13 1.8 Overview of the Dissertation The first half of the dissertation focuses on the index structures for large string databases. In Chapter 2, some existing techniques for string indexing are covered. In Chapter 3, the structure of the Hybrid Digital tree (HD-tree) and the algorithms to build the HD-tree are presented. In Chapter 4, the behavior of the HD-tree is discussed and algorithms for prefix searches and approximate string matching based on the Hamming distance are presented. The comparisons between the HD-tree and other index techniques, such as the Prefix B-tree for prefix search and the M-tree for approximate string matching based on the Hamming distance, are also discussed in Chapter 4. The second half of the dissertation focuses on applying the HD-tree to genomic sequence databases. In Chapter 5, the background on genomic sequence analysis is covered. In Chapter 6, the techniques to index and search genomic sequence databases using the HD-tree is presented, and the performance of the HD—tree is compared with that of BLAST. In Chapter 7, sequence searches based on PHMMS are introduced, algorithms to search PHMMS using the HD—tree are presented, and the performance of the HD-tree is compared with HMMER, a popular PHMM search tool. Finally, conclusions and future work are discussed in Chapter 8. In appendix, the research on approximate q-gram matching in genomic sequence databases is presented. 14 PART ONE THE HYBRID DIGITAL TREE 15 Chapter 2: Existing Indexing Techniques String indexing is an increasingly important task in computer science. In the past a few decades, many data structures have been proposed. All data structures can be characterized in two main ways: based on how a search is performed (hashing, complete key, or digital decomposition) and based on where they are used (in the RAM or on a disk). Based on how a search is performed, there are three basic categories: hashing, search trees, and digital trees. Hashing maps a key to an integer in a given range (e. g., extendible hashing [43]). It “randomizes” the key order, and is able to perform an exact search very fast. In search trees, the complete value of a key is used to direct the search (e.g., B-trees [19] and Prefix B-trees [21]). In a digital tree (known as a trie, pronounced “try” [44], or radix search tree), keys are decomposed as a sequence of digits or alphabetic characters to direct the search (e.g., Patricia tries [20], suffix trees [26], PAT trees [28], and suffix arrays [27]). Based on where the structure is used, the above data structures can be divided into two categories: RAM-based and disk-based. The first category (RAM-based) includes various structures based on digital trees, such as Patricia tries, suflix trees and suffix arrays. The second category includes the extendible hashing [43], the Prefix B-tree [21], and the String B—tree[24]. In following sections, these data structures will be introduced as the background for the rest of this dissertation. 16 2. 1 Extendible Hashing The common element of all hashing algorithms is a predefined hash function hash(N possible keys) —> (0, 1, ..., M — 1) (2.1) that maps N keys (e.g., strings) to M hash addresses (i.e., integers) in a uniform manner. Since it is possible that two keys may map into one address, how to resolve the collision makes hashing algorithms differ from each other. Most traditional hashing methods have a statically allocated table and are designed to handle only a fixed range of N. When N becomes large, the static hash table becomes infeasible, due to the increased space requirement. Therefore, dynamic hashing structures (e.g., extendible hashing) are required to handle widely varying values of N. As shown in Figure 2.1, extendible hashing contains a directory and a set of disk blocks storing the keys. Assume M is sufficiently large and the directory consists of a table (i.e., array) of 2d pointers, where d is a non-negative integer and each pointer points to a disk block. Keys are assigned to a table location corresponding to the d least significant bits of its hash value. The value of d is called global depth. It is set to the smallest value for which each block has at most B keys assigned to it. A lookup operation takes at most two I/OS: one to access the directory, and the other to access the block containing the item. If the directory fits in the RAM, only one 1/0 is needed. To minimize storage utilization, several table locations may share one disk block if the total number of keys assigned to these table locations are less than B. These table locations have the same It least significant bits in their corresponding hash value. The value of k is called the local depth. It is chosen to be as small as possible so that the keys assigned to these table locations fit into a single disk block. Therefore, each disk block has its own local depth. Note that local depth is less 17 then or equal to the global depth. If a disk block overflows after a new key is inserted, the block needs to be split and the keys within the block need to be redistributed. If the local depth of the overflowing block is less than the global depth, only the block’s local depth is and the corresponding pointers in the directory need to be modified. If the local depth of the overflowing block is equal to the global depth, the global depth is increased by one, and the directory doubles in Size. The is how extendible hashing adapts to a growing N. The pointers in the new directory are initialized to point to the appropriate disk blocks. These disk blocks themselves do not need to be changed because of directory doubling, except for the block that overflows. An example of extendible hashing is shown in Figure 2.1. local depth 0000 0001 0010 000 4- 8' ’2 000 0011 001 001 0100 010 40 010 0101 011 011 0110 100 101 110 111 100 101 110 111 0lll 1000 1001 1010 1011 1100 1101 1110 1111 4,12, 42 global depth d = 3 global depth d = 3 (a) 0)) global depth d = 4 (C) Figure 2.1: Extendible hashing with block Size B = 3. The keys are the numbers inside a block. The hash address of a key consists of its binary representation. For example, the hash address of key 12 is ‘...001100’. (a) After insertion of the keys 4, 8, 12, 23, 40, 41, 42. (b) Insertion of 76 into directory location 100 causes the block with local depth 2 to Split into two blocks with local depth 3. (c) Insertion of 52 causes the block with to split into two blocks with local depth 4. The directory doubles in size and the global depth d is increased to 4. In extendible hashing, at least Q(n/ B) (It gives lower bounds) blocks are 18 needed to store the directory. On average, the directory uses O(N1+1/B/Bz) (9 gives exact order) blocks [45]. For practical values of N and B, the N 1/ B is a small constant, typically less than 2. The expected number of disk blocks required to store the keys is asymptotically n/ln2 z n/ 0.69 [46]. Besides the extendible hashing, other dynamic hashing schemes include linear hashing [47] and spiral hashing [48]. More detailed surveys and analysis for dynamic hashing can be found in [49, 50]. Hashing works well for exact searches in the average case. However, it does not support sequential searches such as retrieving keys in a specified range, and other advanced searches such as a substring search. A more effective approach for sequential searches is to use search trees, which are explored next. 2.2 B-trees and Prefix B-trees B-trees [19] and B-tree variations [29] are well known balanced multi-way search trees used for manipulating dynamic data on disks. Each node of a B-tree is a disk block that can store 9(8) pointers and keys, where B is the disk block size. A B—tree of order m (m = 6(3)) satisfies the following properties: 1) Each node has at most m children. 2) Each internal node (i.e., non-leaf node), except for the root, has at least m/ 2 children. 3) The root has at least 2 children, unless it is a leaf. 4) All leaves appear on the same level. 5) An internal node with k children contains k — 1 keys. A node that contains 1' keys and r + 1 pointers can be represented as that in Figure 2.2, where 1131 < k2 < < kr and P,- points to the sub—tree for keys between it, and ki+1. Searching in a B-tree starts from the root and requires fetching at most one node at each level into the RAM. For example, after the node in Figure 19 1:0 1‘1 1]1 k2 112] [Pl-1 kr Pr II I I Figure 2.2: A B-tree Node 2.2 has been read into the RAM, the given query key, It, is searched among k1, k2, ..., 19,». If the search is successful, the desired key is found; otherwise, assuming k,- < k < ki+1, the node indicated by P,- is fetched and the search process is repeated. The pointer P0 is used if k is less than k1, and P,— is used if k is greater than hr. The search is unsuccessful if the pointer is null. If a new key is inserted into a B—tree of order m, where all leaves are at level I, the new key is inserted into the appropriate node on level I — 1. If the node overflows (i.e., contains 777. keys), it splits into two nodes (see Figure 2.3) and inserts the key, [Wm/2], into the parent of the original node. If the splitting causes the parent node to overflow, the parent node splits, and so on. A splitting can thus propagate up to the root, and the tree grows in height only when the root splits. Deletions are handled in a Similar way by merging nodes. /" "\ P0"11’1 kIm/ZI-l PIm/le-l I’rrpm kIm/ZI-l PInV2l—1 km le ll 1 l l 1 Figure 2.3: B-tree Splitting The complexity of B-trees has been studied thoroughly, and their behavioral boundaries have been determined. Given a B-tree of order m, assume that there are N keys, the height, h, of the B-tree is bounded by: N+1 h 3 1+ log]m/2](—2—). (2.2) 20 When a new key is being inserted, the average number of nodes, 3, that need to be split is bounded by: s g 1 + —. (2.3) Therefore, the average number of Splits while building a tree of N keys is less than 1/([m/2]) — 1) per insertion [51]. Since the birth of the B-tree, many approaches have been developed to improve the basic B-tree structure. For example, in a BT-tree [51, 29], all keys are stored in leaves. The uppper levels, which are organized as a B-tree, consist only of an index as a roadmap to enable rapid location of a key. The leaves of a B+-tree are linked together in order to facilitate range queries and sequential access. In B*-trees [29], splitting is postponed by “sharing” the overflowing node’s data with one of its adjacent siblings. The overflowing node needs to be split only if the adjacent sibling is also full. When this happens, a new node is created. Data from the overflowing node and its full sibling are evenly redistributed among the three nodes, making each of them approximately 2 / 3 full. This method reduces the number of times new nodes must be created and thus increases the storage utilization. Assume random insertions. In regular B-trees, it is shown that the average storage utilization of nodes is ln2 z 69% [46, 52], while in B*-trees, the average storage utilization increases to about 2 ln(3/ 2) z 81% [53]. Although the B-tree was initially designed for fixed-length keys, such as integers, the basic idea can be used for variable-length keys, such as strings. However, if the keys are variable-length strings, the number of keys stored in each tree node may not reach the upper limit, m, before the node becomes full. Consequently, the height of the tree is not bounded, as in Equation 2.2. As the key length increases, the number of keys per node decreases; therefore, the height of the tree increases. Thus, the performance of the B-tree degrades, Since more internal nodes must be accessed. 21 In order to increase the number of keys stored in internal nodes, the Prefix B-tree is proposed in [21]. In conventional B-trees, a separator has to be a key (e.g., k[m/2] in Figure 2.3) . Assume a group of keys is: { “abstract”, “common”, “define”, “longlonglongword”, “moment”, “people” }. In order to split the group, the key “longlonglongword” has to be used as the separator. However, Prefix B-tree removes such limitation, and uses the shortest unique prefix of a key as a separator. For example, any Shortest string between “define” and “longlonglongword” in lexicographic order (e.g., ‘e’ or ‘f ’) can be a separator (see Figure 2.4). This method increases the number of keys stored in internal nodes. However, it can fail when keys have a long common prefix, Since they are adjacent to each other in lexicographic order. In [54], head compression is used to factor out a common prefix from all keys in an internal node. In [55], another compression scheme is adopted. The idea is: if a key begins with the same it characters as its immediate predecessor, the key is stored with its first n characters replaced by integer n. This approach saves space but it does not prevent a key from having many characters in the rest of its positions. Besides the storage issue, heuristics to improve searching performance within a Prefix B-tree node can be found in [55]. [ ..., . \... 1 /\ [ abstract, common, defineg] [ longlonglongword, moment, people] Figure 2.4: An Example of the Shortest String Separator in a Prefix B-tree 2.3 Tries TIies (pronounced “try”, and derived from “information retrieval” [44]), or digital trees, are recursive tree structures that use digital decomposition (i.e., decompose strings as a sequence of digits or characters) to represent a set of strings 22 and to direct the search. The basic idea is similar to the thumb index on a large dictionary, where from the first letter of a given word, the pages containing all words beginning with that letter can be located immediately. A trie can be defined as the following: assume S is a set of strings, and A = {ai}i=1 is the alphabet (a special symbol $ is included to represent the end of a string), then the trie associated to S is defined recursively by the rule: trie(S) = (trie(S|a1), ..., trie(S|a,~)), (2.4) where S |a,; represents the subset of S consisting of strings that start with a,- and stripped of their initial letter a,. The recursion is halted as soon as 5' contains one string. A trie only maintains the minimal prefix set of strings necessary to distinguish all the strings of S. A basic trie can be represented asan M-way tree, where each node is a vector of M components corresponding to symbols in A. Each edge is labeled with a symbol that leads to the next node. Each node on level I represents the set of all keys that begin with a certain sequence of 1 characters. The sequence is the concatenation of the labels traversing from root to a leaf node. An example of a trie structure is shown in Figure 2.5. Since tries represent strings along the paths in the tree, not in the nodes, considerable compression can be achieved by Sharing paths. The height of a trie is the number of nodes in the longest path from the root to an external node. On average, the height of a trie is logarithmic for any square-integrable probability distribution [56]. Tries provide potentially faster access than search trees, since one comparison may lead to a large fan-out (up to [AD [43]. Compacted tries compresses the unary paths (i.e., each node on the path has only one child) of the tries by storing the labels along the paths into trie nodes [51]. 23 a h I Q . n a . n . d g at have r 5 n . It: I] and e 1 here he angel angle $ indicates the end of a string Figure 2.5: An Example of a Trie Patricia tries (”Practical Algorithm To Retrieve Information Coded In Alphanumeric”) are binary tries where the individual bits of the keys are used to decide on the branching [20]. Each internal node of the Patricia trie has an indication of which bit is to be used for branching. This may be given by an absolute bit position or by a count of the number of bits to skip. 2.4 Suffix Trees and Suffix Arrays A suffix (semi-infinite string) is a substring, which starts from a position in a text and continuing toward the end as far as necessary to make the substring unique within the text. A common method to accelerate string searching is to index all sufifzes of a text using a trie. The resulting trie is known as a suffix tree [25]. For each suffix, a logic pointer pointing to the starting position of the suflix is stored at the leaf node of the suffix tree. The starting position can be either at a character or at a word. To make sure that no suffix in the text is a prefix of another suffix, a unique character that is not in A is appended at the end of the text. An example of a suflix tree is shown in Figure 2.6, where # represents the unique character. 24 # A NA 7 BANANA# l l 2 3 4 5 6 7 # BANANA# NA # NA# 6 5 3 # NA # 4 2 Figure 2.6: An Example of a Suffix TYee for “BANANA”. Pointers to the suffix position are shown in leaf node. Classical algorithms construct a suffix tree for a string of length n in O(n log |A|) time and 0(n) space (0 gives upper bounds) [25, 26, 57]. A recent algorithm removes the dependence on alphabet size [58]. Because the suflix tree indexes all possible suffixes of the text, it occupies a great deal of Space, e.g., 17 bytes per index point. Using Patrica trees in a similar setting, 12 bytes are required for each index point. PaTrieS [31] and PAT-trees [32, 28] are variants of Patricia tries, in which the Jacobson’s encoding [59] of the tries is used to reduce Space. The average space requirement for each index point in such variants is 6 bytes. Suffix trees are powerful data structures. However, they use much space. Even though compression techniques help to reduce the size, suffix trees built on large text may easily exceed RAM Size. Therefore, suffix trees have to be stored on disk. As Shown in [24], since suflix trees have an unbalanced topology that is text-dependent, it is difficult to apply suffix trees on disk efliciently and dynamically. It is also shown in [33] that dynamically paging suffix trees on disk leads to decreasing storage utilization, e.g., 43% for 238KB text and 38% for 5.55MB text. Suffix arrays store all the text suffixes in lexicographic order by their pointers. They are very space-efficient because only one pointer per suffix is stored. Suffix 25 arrays are inherently static. They can be applied on external memory by partitioning the index into disk blocks. The performance on external memory degenerates when the text collection becomes large and changes over time. Searching in a suflix array requires 0(log2 N) number of disk accesses. This binary search may perform poorly because of the number of random disk accesses. To reduce the number of disk accesses, the supra-index: has been used as the first step of the search [7]. The supra-index is a sampling of one out of b suffix array entries, where for each sample, the first I suffix characters are stored in the supra-index. 2.5 String B-trees In a conventional B—tree, 9(3) keys are stored in each internal node. However, if the keys are variable—sized text strings, the keys (i.e., string) can be arbitrarily long, and there may not be enough space to store 9(3) strings per node, even using the heuristics adopted by the Prefix B-tree. Consequently, the performance of the tree degrades as the average fan—out (i.e., the number of children) of internal nodes decreases. The worst case of the B-tree indexing variable-length strings is not bounded as that of the B-tree indexing fixed length integers. In order to provide bounded fan-outs, a straightforward approach is to stored 9(3) pointers to the 9(3) strings in each internal node. However, accessing each string within the internal node during the search may require one disk access in the worse case. Therefore, the number of I/Os required for search is usually too high to make this approach useful. To solve this issue, the String B-tree is proposed so that not only the fan-out of an internal node is bounded by 9(3), but also the number of I/Os required for searching an internal node is bounded by a constant number. Such feature is achieved by using a data structure similar to the Patricia trie within an internal node, and store keys in a separate file. The Patricia trie is used to 26 determine the pointer to follow by accessing the file once. The resulting query time to search in a String B-tree for a string of l characters is therefore 0(logB N + l/B). Insertions and deletions can be done in the same I/ O bound. String B-tree provides a theoretical guaranteed worst-case performance for string searches. However, since the String B—tree stores strings in a separate file, two disk I / Os are required for searching an internal node. In leaf nodes, extra disk I/Os are required to access query answers. The performance of the String B-tree is hence usually worse than that of the Prefix B—tree. This is the reason why the Prefix B-tree is selected to compare with the HD-tree in Chapter 4.4. 27 Chapter 3: The HD-Tree The HD—tree adopts a hybrid RAM / disk-based structure, in which leaf nodes are stored on the disk so that a large database can be indexed, and internal nodes are kept in the RAM to achieve greater efficiency. The HD-tree incorporates and extends some indexing strategies of the digital tree and the B*-tree [29], taking advantage of their strengths in search performance, compression capability, and storage utilization. The structure and construction algorithms of the HD—tree are presented in the following sections. Besides the notation and assumptions introduced in Section 1.2, symbol I] is a Special auxiliary symbol such that t] ¢ A and II < c for any c E A. Assume T is a set of letters, functions M I N (T) and M AX ('1‘) yield the smallest and greatest element in T, respectively. 3. 1 Basic Structure ’/’I ‘ b d e ’I ,I r c m o .8 a o . \‘b e V l, :01. "‘~~-.b e, .' b e. 4 5 l c é 17 8 ” V : 9 1 [I x z d DISK "\c I ,'- ’\"’, 16 11/ 12 Figure 3.1: An HD—tree 28 {a, b, c, d, e} Multi-Group Leaf node Single-Group Leaf node Internal Node Multi—Group Leaf Pointer Single—Group Leaf Pointer Internal Pointer The HD-tree is an unbalanced and ordered tree (see the example in Figure 3.1). An internal node, 6, of the HD-tree contains a list of pairs L(6) = {(a1, P1), (a2, P2), ..., (am, Pm)}, where P,- is a pointer to its child node; ai(1 S i S m) is a letter from A, called the label of Pi; and a1 < a2 < < am, such that the pointers are ordered according to their labels. The order of Siblings (the nodes who have the same parent) are determined by the pointers. For example, the left sibling of node 6 is node 2, while the right sibling of node 6 is node 13. Leaf nodes, which are implemented as disk blocks, contain the suffixes of indexed strings. The path string of a tree node is the concatenation of the labels along the path traversing from the root to the node. The path string of the root is empty. Since an HD—tree node can be uniquely identified by its path string, a path string is also called an id—str'ing (i.e., identification string) of the corresponding node. Let I D(6) denote the id-String of a tree node 6. In Figure 3.1, ID(2) = a, ID(9) = bbe, and ID(15) = db. 3.2 HD-tree Properties An HD-tree must satisfy two basic properties, which determine the proper leaf nodes for the indexed strings. PROPERTY 1 For each internal node, 6, in an HD-tree, I D(6) is a common prefix of all strings contained in any leaf node in the sub-tree with 6 as the root. Property 1 is similar to that of a digital tree. However, the id-string of a leaf node 6’ in an HD-tree represents one or more prefixes (i.e., a set or “range” of prefixes) which strings in the leaf node, 6', may have. Let PS (6' ) be the prefix—set of a leaf node 6’. If |PS(6')| = 1, all strings in 6’ share the same common prefix in PS (5’ ) Such a leaf node is called a Single-Group Leaf (SGL). If [PS(6’)| > 1, 6’ may contain several groups of strings, where the strings in each group share a prefix which is different from the prefix of another group. Such a leaf node is called a 29 Multi- Group Leaf (MGL). The reason for using SGL and MGL is to improve disk utilization; otherwise, some large groups of strings may hinder the grouping of small groups. Based on Property 1, each prefix in PS(5’) differs only in the last letter. An internal node in an HD-tree may have three types of pointers: (1) Internal Pointer (IP) to an internal node; (2) Single-Group Leaf Pointer (SGLP) to an SGL; and (3) Multi-Group Leaf Pointer (MGLP) to an MGL. In a traditional index tree such as the B—tree, any key, k, within a given range is kept in one node. This strategy is incorporated into an HD-tree by storing the keys having their prefixes within a “range” (i.e., a set) in the same leaf node. The reason to adopt this strategy in an HD-tree is based on the following observation: the group of keys with one prefix may be too small, and multiple such groups may fit in a leaf node (disk block), which can improve storage utilization. A key range in a traditional index tree is continuous in that no key between the two boundaries of the range can be excluded. However, the prefix “range” (called the prefix-set) in the HD-tree may not be continuous because one or more prefixes between the two boundaries (minimum and maximum prefixes) of the range may be excluded. The reason to allow the exclusion of some prefixes from the range is that their corresponding key groups may be too large to share one leaf (block) with others. In such cases, one or more separate leaves (disk blocks) are used to store the group of keys corresponding to such a prefix of the large group and keep the remaining small groups of keys (with prefixes within the “range”) in another leaf node. The prefix-set, PS (6' ), for an SGL, 6’, contains the unique prefix I D(5' ), i.e., PS (6’ ) = {ID(6’)}. For example, in Figure 3.1, node 11 is an SGL with PS (11) = {bbcc}; that is, all the strings in this node have the common prefix, bbcc. It is the task of the tree building algorithms to determine which node is an SGL. For example, a percentage of the free Space in a leaf node is used in the HD-tree to 30 determine an SGL. Unlike an SGL, whose prefix-set is directly presented by its id-String, the prefix-set of an MGL needs to be derived as follows: let 6’ be an MGL, and 6 be the parent node of 6’ containing list L(6) = {(a1,P1), ,(ak, Pk), ,(am, Pm)}, where m > 0 and Pk is the pointer to 6'. Let ,8 = I D(6), the prefix-set of MGL 6’ is defined as: PS(6’) = {go | c e rpk}, (3.1) where Tpk is a set of letters obtained through the following steps: 1: T’Pk = {a,- ] (ai, Pi) E L(6),a,- < ak,P,- is an MGLP }; 2 : if (T101: iS empty ) b’ = ll;else b’ = MAX(T'Pk); 3: TA={a|aEA,b'»caa cb cbce cbba cbcd dca dcc . bc cdd ce b d ——“gh‘m°“ caa c ea e ea eee <5 group A Single-Group Leaf A Multi-Group Leaf node 11 node 17 Figure 3.2: Examples of the SGL and the MGL. suffix-strings. A group is a set of suflix-strings whose first letters are the same. The common first letter of a group is called the group-head. Groups are ordered by their corresponding group-heads. The first group iS called the leftmost group and the last group is called the rightmost group. In other words, given a leaf node, 6', whose parent is 6, a group in 6' contains strings having the same prefix, 3c, where c is the group-head and 3 = I D(6). AS discussed in Section 3.1, an SGL contains only one group and an MGL contains one or more groups. For example, in Figure 3.2, node 11 is an SGL which contains one group whose group-head is c. Node 17 is an MGL which contains three groups, whose group—heads are c, d and, e, respectively. 3.3.3 Linked Leaf Nodes :> H U MGL Split *— —l— L—a Linked Leaf Node (a) (b) SGL Split Figure 3.3: Examples of the HD-tree Growth 37 The HD-tree keeps track of the current available RAM whenever adding or deleting an internal node (not shown explicitly in the algorithms). When RAM is available, the tree grows by creating internal nodes through overflow processing and splitting (see Figure 3.3a). When there is no available RAM, the tree stops creating new internal nodes. Hence, if a leaf node exceeds the disk block size after inserting a string, an extra disk block is linked to the original disk block to accommodate the overflowing data (see Figure 3.3b). Consequently, a search within the leaf node accesses all linked disk blocks. Using this approach, the HD-tree works with any Size of RAM. ALGORITHM 2 : HD—OverflowProc(6, 6’) Input: ( 1) an internal node 6, where L(5) = {(01, P1), (akr Pk), (am, Pmll; (2) an overflow leaf node, 6’, pointed to by P,- in L(6). Output: an updated HD-tree Method: 1. if the current RAM is not enough to create a new internal node then 2 link a new disk block to 6'; return; 3. end if; 4. if 6 pi contains only one group then 5. create an internal node 63;; 6 remove (a,, Pi) from L(6); 7 add (a;, P') into L(6) where P’ is an IP pointing to 6x; 8. remove the first letter a,- from each suffix-strings in 6' ; 9. add (b, P”) into L(6x) where b is the greatest 38 group-head in current 6' and P” is a MGL pointer to 6’; 10. if 6’ still overflows then 11. call HD-OverflowProc(6$, 6’); 12. else 13. write (6') back to disk; return; 14. end if; 15. else 16. call HD—Split(6, 6’, user-defined threshold); 17. end if; 18. return; Algorithm 2 (HD-OverflowProc) is a recursive procedure to handle overflow leaf nodes. If the overflow leaf node, 6’, iS an SGL, an internal node is created and the HD-tree grows one level down on the corresponding branch (steps 1-6). HD—OverflowProc may be invoked again if 6’ continues to overflow (steps 7-8). If 6’ is an MGL, Algorithm 3 (HD—Split) is invoked (step 14). 3.3.4 Split Heuristics c e g . node 1 node 2 node 3 Figure 3.4: Illustration of a large group hindering a possible merge In the HD-tree, suffix-strings in overflow leaf nodes are split by group. If a node 39 is split into two as soon as it overflows (called SSplit Algorithm), the resulting storage utilization is very low. This low storage utilization has two main causes. First, Since leaf nodes are ordered by the labels of leaf node pointers, a leaf node containing a large (in Size) group may hinder the possible merging of left and right siblings. For example, in Figure 3.4, leaf node 2 contains a large group ‘e’ (95% of the block Size). Group ‘e’ cannot be stored in either leaf node 1 or 3, since it will cause overflow. If group ‘e’ does not exist, leaf node 1 and 3 can be merged into one leaf node, which will increase storage utilization. Second, since Splitting is done by group, it may divide groups in an unbalanced way. The key range (i.e., the prefix-set) of a leaf node keeps shrinking without any possibility of expanding. Consequently, many underflow leaf nodes (where the storage utilization is less then 50%) are created that may be merged with siblings. Figure 3.5 illustrates a series of Splitting that creates two underflow leaf nodes (b2 and b3), which could be merged into one node. [a:5% [[b210%] [cz20%] _; @ W 6.41% f.20% [d:5%][e:4l%] [fz20%[ ($2095) ((15%) - .- node 1 node a1 node a2 Cassi) (13:10qu - ——> I @5090) ((13% - node a1 node a2 — (35%) 630%) w m ; - node bl node b2 node b3 node b4 Figure 3.5: Illustration of the underflow leaf nodes generated by simple Split (5' Split) According to the above two observations, in order to improve the storage utilization, two heuristics are used: (1) an SGL is created if the size of a group is greater than a user-defined threshold T (e.g., 85% of the disk block size), (2) after 40 an a large group is moved out of an overflow leaf node into an SGL, and before an overflow node is Split, groups may be moved to qualified siblings to avoid splitting. The first heuristic is to move a large group into an SGL, so that it does not interfere with the grouping of an MGL. The second heuristic helps to expand or shrink the key range (i.e., the prefix-set) of a leaf node in order to improve the storage utilization. ALGORITHM 3 : HD-Split(6, 6’, T) Input: (1) an internal node 6, where L(5) = {(01,191) (ale, Pk) (am,Pm)}; (2) an overflow MGL 6’ pointed to by P, in L(6); (3) a threshold T. Output: an updated HD-tree Method: 1. if 6’ contains a group g3; whose size is greater than T then 2. create an empty leaf node 63, as an SGL; 3. move gx from 6’ into 65;; 4. add (a;,;, P3) into L(6) where am is the group-head of ya; and P3; is a SGLP pointing to 65:; 5. adjust(a,-, P;)*; 6. end if; 7. if a left/ right MGL sibling 6 py of 6’ has space to accommodate the leftmost / rightmost group 9 in 6’ then 9° read 6}, from disk; y 9" move from 6’ into 6’ ; . 9 pg 41 10. adjust(a,-, Pi) and (ay, P9); 11. end if; 12. if 6' overflows then 13. if 6' contains only one group then 14. call HD-OverflowProc(6, 6’); 15. else 16. create an empty leaf node 6; as an MGL; 17. move groups from 6' into 6'2 one by one in increasing order until 6; is more than half full or there is only one group left in 6' ; 18. add (az, Pz) into L(6); 19. adjust (ai, Pi) and (az, P2); 20. end if; 21. end if; 22. write new and modified leaf node(s) to disk; 23. return; * adjust(a,', Pi) is a procedure which sets a, to the largest group-head in the leaf node 6}): and marks P, as SGLP or M GLP correspondingly. In Algorithm HD-Split, heuristic (1) is implemented in steps 1-6, while heuristic (2) is implemented in steps 7~11. If 6’ still overflows after the two heuristics are applied, 6’ is split and a new leaf node is created to accommodate some groups in 6’ (steps 12-17). Figure 3.6 shows a few examples of HD-tree splitting and the linked leaf nodes after inserting more strings into the original HD-tree (see Figure 3.1). For example, node 11 in Figure 3.1 grows into an internal node 11 and two leaf nodes: 19 and 20. Node 17 splits into two leaf nodes: 17 and 21. Node 18 grows to three linked leaf nodes: 18a, 18b, and 18c, which happens when the RAM is not available for further 42 ‘ ‘ ‘ ‘b‘ I . b, d , 17, Llnked Disk Blocks 4 5 ’1 C 1' ,' , Of node 18 8 II : 9 [ 9 III 21 I, I I‘ ‘\ d [I DISK {c 0 l. Splitting Overflowlng MGL 17 \ O I 1 ’\‘_’, 6 9 V e 19 20 Prooesslng Overtlowlng SGL 11 Figure 3.6: Examples of HD—tree splitting and linked leaf nodes splitting. After the HD-tree is built, algorithms are needed to search the HD-tree for different types of queries. The next chapter presents the search algorithms and the performance of the HD-tree. 43 Chapter 4: HD-tree Behavior Once the HD-tree is built, various queries can be conducted using the tree. In this chapter, algorithms for prefix search and approximate string matching based on the Hamming distance are presented. Characteristics of the HD-tree are discussed. The experimental results show that HD-tree outperform existing data structures, such as the Prefix B-tree for prefix searches and the M—tree for approximate string matching based on the Hamming distance. 4.1 Prefix Search The HD-tree is able to handle exact, prefix, and sub-string searches. In this section, Algorithm 4 (HD-PrefixSearch) is presented for prefix searches. Other string queries (see Section 1.2) can be performed using the prefix search algorithm. For example, ExactSearch(a) is equivalent to PrefixSearch(atl). RangeSearch(a, B) can be done by first finding the leaf nodes storing a and 3 using PrefixSearch(all) and PrefixSearch(,6]l), respectively, then sequentially access all the leaf nodes between these two leaf nodes. Similar to the approach used in [24], the HD—tree handles sub-string searches within a set of strings {141, ..., an} by indexing all suffix strings of K1, ..., nn. Therefore, SubstringSearch(a) can be performed by Pre f ixSearch(a) on these suffix strings. ALGORITHM 4 HD—PrefixSearchO-z, I, 6) Input: (1) the query string K. = k1...kn; (2) the current level I; (3) the current internal node 6, where L(5) = {(01,131) (ak’Pk) (am,Pm) }; Output: query result(s) 44 Method: 1. if 1 > n then 2 return all strings in leaf nodes under 6; 3. end if; 4. fori=1tomdo 5. if a,- == kl and P, is an [P then 6 return HD-PrefixSearch( It, i + 1, 6132.); 7 else if (P,- is an SGLP and a, == kl) or ( P,- is an MGLP and az- >= kl) then 8. read 6}), from disk 2 9. retrieve strings in 63,2, which have a prefix [elm/en; 10. end if; 11. end for; 12. return; Algorithm HD-PrefixSearch starts from the root of an HD-tree, which is at level 1, and traverses down the tree as far as possible (steps 4-6). If a leaf node is encountered, the search will read the leaf node from disks and linearly search the strings in the leaf (steps 7-10). If the prefix is Shorter than the id-String of a node (see Section 3.1), all the strings in the sub-tree are the answers (steps 1-3). Figure 4.1 shows the paths of searching “aa” (node 1 —- > 2 -— > 3 -— > 4, 4a, 4b, 5) and “bbcda” (node 1 — > 6 — > 7 — > 10 — > 12). 4.2 Data Sources for Experiments The performance of the HD-tree on prefix searches are tested using real textual data. A sample database, WSJ 1, is generated from the Wall Street Journal (entire 45 RAM 1 e ; Alphabet: {a,b,c,d,e} .” 2 /b C d ’l D Multi—Group .’ .' Leaf node I, a/ 6 ,’ ’ f ‘ \ “, l 8/ D Single—Group : 3 b .' 13/ ‘ ‘ ~ . WM ‘. '. ; 14 ‘. Internal Node ~-—. ~-‘\ b C, " I _. Mani—Group 4 ‘ . . ,' ; Lear Pointer 5 : ° .' . 15 . 17 . 8 , , 9 I I SIngle-Group ‘ ,I ', I d ,I’ ° Leaf Pointer DISK :' 10 ‘~ ’ —> ax to ‘\ " 16 ll 12 Figure 4.1: Examples of String Searches year of 1991), which is a part of the Text REtrieval Conference (TREC) collection [37]. Markup tags are removed, texts are split into segments of 5MB each, and unique prefixes of the suffix strings at word boundaries are extracted for every segment. If a prefix string is longer than 32, only the first 32 letters are kept to reduce storage requirement. WSJ 1 can be used for keyword-based document searches [7]. Similarly, database WSJ 2 is generated from the same data source, except that suffix strings start at letters (not spaces). WSJ 2 is used for sub-string searches [28]. The sample query set, Q1, is generated by randomly selecting keywords from the Wall Street Journal (1991). The sample query set, Q2, is generated by randomly selecting substrings from WSJ 2. The sample database, GENO, is generated using DNA sequences from GenBank (see Chapter 6 for GenBank Overview). Overlapping words (strings) of length 28 (i.e., a fixed window size of 28 Shifting from the beginning to the end of a sequence by one letter at a time) are extracted from these DNA sequences. Conducting Similarity searches on these overlapping strings is useful for finding homologous regions in genomic sequence databases [60]. To increase the efficiency, every two symbols from the 46 DNA alphabet, (A, C, G, T}, are encoded into one symbol. Hence, the string length becomes 14. Each of these databases consists of 15 million strings and each string is associated with a four-byte integer containing position information. WSJl is used for prefix searches, while GENO is used for approximate string matching based on the Hamming distance. Statistics of these databases and query sets are shown in Table 4.1, where key# indicates the number of database keys; max, min, and avg are the maximum, minimum, and average length of keys, respectively. The query performance (the number of I/OS) in the experimental results is the average among the 100 queries for text databases and 1000 queries for DAN sequence databases. Table 4.1: Statistics of sample databases and queries FDB size key# [min max avg ] WSJl 260.0MB 15M 3 33 14.18 WSJ2 251.8MB 15M 2 33 13.60 GENO 257.5MB 15M 14 14 14 Q1 856B 100 4 20 8.56 Q2 902B 100 4 10 9.02 Q3 140003 1000 14 14 14 4.3 HD-Tree Behavior Experiments are conduced to analyze the behavior of the HD-tree and evaluate its performance by comparing it with existing techniques. The HD-tree is implemented using C++ and experiments are conducted on a PC with 512MB RAM and 1.8GHz Pentium 4 processor, running Linux OS. The disk block size used in the experiments is 4096 bytes. 47 I —' I I I I I I I I I 700 - -1 . A-A oRam - 600 _ H mRam __ E 500 — — a 400 — — iii r _ E 300 — — 200 — '- 100 l—— ...I _ DBs: Samples from WSJ 1 . 0 l l l l M 1 l L I I l O 50 100 150 200 250 300 Database Size (MB) Figure 4.2: RAM Usage of HD-trees Table 4.2: RAM/disk Usage of HD-trees D3 27.39 54.75 109.4 164.1 219.2 274.3 RAM 0.1123 0.2009 0.3428 0.4623 0.5716 0.6679 disk: 32.08 57.04 96.95 130.5 161.1 188.2 total 32.19 57.24 97.3 131 161.7 188.9 ratio(%) 117.5 104.5 89.93 79.83 73.77 68.87 D3,RAM, disk, total: MB; Databases: Samples from WSJl; ratio = total/ DB; ALN=0 4.3.1 RAM and Disk Usage Figure 4.2 illustrates the RAM usage of the HD-tree, where mRam is the minimum RAM size needed to achieve the minimal-IO (i.e., the optimal I/O achieved by the HD-tree), and oRam is the minimum RAM size required to achieve near-optimal performance. Table 4.2 shows the RAM, disk, and total (RAM+disk) size of HD-trees. Note that the ratio between the total size of an HD-tree and the corresponding database size decreases as the database size increases. This performance gain is achieved by the compression feature of the HD-tree. 48 4.3.2 Split Heuristics A set of experiments is designed to Show the effectiveness of the Split heuristics for building an HD-tree. Table 4.3 shows the comparison of the storage utilization (using one disk block for each leaf node) between the SSplit, which is a B+tree-like approach, and the HD-Split (see Section 3.3.4). Note that the HD-Split adopted two heuristics to improve the storage utilization. One heuristic is to distinguish the SGL from the MGL, which allows the prefix range to be “discontinuous.” The other heuristic is to move groups to the left or right sibling to avoid a Split, so that the prefix set of an MGL is dynamically adjusted. It is Shown that the HD-Split increases the disk utilization by more than 40%, which indicates the effectiveness of the grouping mechanism in the HD-Split. Table 4.3: Split heuristics on storage utilization DBSize(MB) ] 50 ] 100 [ 150 | 200 | 250] SSplit 45.7 44.8 44.6 44.5 44.1 HD-Split 65.1 63.5 63.1 62.7 62.6 Improve 42.5 41.7 41.5 40.1 42.0 Databases: Samples from WSJ 1, Table value: % 4.3.3 Threshold Phenomena AS described in Section 3.3.3, using linked disk blocks, the HD-tree is scalable for any RAM size. Figure 4.3 Shows the relationship between the average number of links (AN L) and the available RAM Size as the percentage of the database size (RAM / DB). The ANL is the total number of linked disk blocks divided by the number of linked leaf nodes. An AN L value of zero means that each leaf node occupies one disk block. As shown in Figure 4.3, the ANL decreases as the RAM / DB increases. Note that there exists a threshold (where the curve becomes flat) in the figure. The threshold is almost invariant of database Sizes. 49 D) O ' I ' I 25 H DB 25MB _ H DB 50MB <>-<> DB 150MB - 20 H DB 250MB - DBs: Samples from WSJ 1 ' Average Number of Links 8 G I U! 00 0.1 0.2 ' V0.3 ‘ 04“ RAM Size as the Percentage of the Database Size Figure 4.3: The relationship between the ANL and the available RAM as the per- centage of the database size. 15 ' l ' l ' l ' X—X DB 25MB H DB 50MB (>0 DB I50MB 10 _ H DB 250MB T DBs: Samples from WSJ 1 Number of Disk I/Os per Query LII l 0 I I I l I l 0 0.1 I 0.2 0.3 0.4 RAM Size as the Percentage of the Database Size Figure 4.4: The relationship between the number of I/Os and the available RAM when the answer size is fixed. 50 When AN L is greater than zero (i.e., the linked disk blocks are used), the query performance of the HD-tree is shown to be closely related to the AN L. Curves in Figures 4.4 and 4.5, where the number of I/OS rather than ANL is used, are similar to those in Figure 4.3. The threshold phenomenon is due to the logarithmic nature of the tree (i.e., lower level contains less nodes). As the HD—tree grows, adding the same amount of RAM (i.e., increasing a certain number of leaf nodes) has a decreased impact on the selectivity of the tree (i.e., the total number of leaf nodes). Therefore, when the available RAM is limited with respect to the database size, it is important to allocate enough RAM at the threshold point where the RAM is most effectively utilized. 20 r I l I I l I I F e 3 ' x—x DB 50MB ‘ O’ 15 _ H DB 100MB _ t6 DB 150MB 3; _ H DB 250MB , o i 10 —- DB8: Samples from WSJl 'c'i _ _ c... 0 g 5 — 4 - E . ,. j” . 0 l I l I l I L I l 0 0.1 0.2 0.3 0.4 0.5 RAM Size as the Percentage of the Database Size Figure 4.5: The relationship between the number of I/Os and the available RAM when the answer Size changes. 4.4 Comparisons with the Prefix B-tree In this section, the performance of the HD-tree is evaluated by comparing it with that of the Prefix B—tree. The Prefix B—tree is widely adopted by database 51 systems and has been shown to be a practical technique for indexing large string databases. The Prefix B-tree used is the experiments was implemented by the popular Berkeley DB [61], which is an Open source database system. As a disk-based index structure, the Prefix B-tree does not require any RAM, while the HD-tree requires a certain amount of RAM to keep its internal nodes. For a fair comparison, the same amount of RAM used by the HD-tree is provided for the Prefix B—tree as a cache. The caching algorithm is based on the popular LRU (least-recently—used) heuristic, which is used by almost all commercial database systems because of its simplicity and effectiveness. The LRU algorithm keeps recently accessed internal nodes in the RAM to reduce the number of disk I/Os. The disk I/Os are compared between the HD-tree and the Prefix B—tree using 1000 queries with different numbers of distinctive queries. This set of experiments is designed to evaluate the effect of the locality of the query results on the performance of the HD-tree and the Prefix B—tree. The queries are generated as follows: ( 1) generate a certain number of distinct queries to form a query pool; (2) randomly generate 1000 queries from the query pool. In one extreme case, the 1000 queries are all the same (i.e., one distinctive query), i.e., all the 1000 queries return the same results. As the number of distinctive queries increases, the level of localities in the query results reduces. The other extreme is when all 1000 queries are different. As shown in Figure 4.6, the performance of the Prefix B-tree is better when the number of distinctive queries are small. However, as the number of distinctive queries increases, the performance of the Prefix B-tree deteriorates quickly. The two curves cross between 10 and 20 distinct queries, where the HD-tree starts to outperform the Prefix B—tree. For 1000 distinctive queries, the HD-tree is almost three times better than the Prefix B-tree in term of the number of disk I/Os. The results show that the performance of the Prefix B-tree using the LRU caching mechanism is very susceptible to the locality of the query results. On the other 52 15% I I I I I I I I I I I I I I I I I I I I I I I I I I m _ X—X HD-tree > g 12500 ” O—O Prefix B-tree '5) '- -. a 10000 _ — a... - .. o H 7500 — — B P- A E 2 5000 — — 3 ~ - E2 2500 — v — )/ ‘ l l l l l l l 1 l I l 1 L I l l l I l 1‘0 _ . . 100 1000 Number of Distinctive Queries Figure 4.6: I/O comparison for different query localities; average query length is 6. hand, the HD-tree is robust to distinctive queries. It is concluded that the HD-tree performs better as queries become more distinctive. In the following comparisons, 1000 distinctive random queries are used. In Figures 4.7 and 4.8, the performance of the HD-tree and the Prefix B-tree is compared for different RAM sizes. In Figure 4.7, it is shown that the HD-tree not only reduces the number of I / Os, but also uses the RAM more effectively than the caching mechanism adopted by the Prefix B-tree. For example, as the RAM increases from 250KB to 1.6MB, the HD-tree reduces more than 50% of I/Os, but the Prefix B-tree only reduces less than 20% of I/Os. For the given database WSJl (252MB) and 1.6MB of RAM, the HD-tree reaches its optimal status where each leaf node occupies only one disk block. In Figure 4.8, more RAM to the HD-tree is served as a cache which is the same as that of the Prefix B-tree. It is shown that the HD-tree is continually better than the Prefix B—tree when the RAM is largely available. In Figure 4.9, the number of I/Os are compared for different query lengths. It is shown that the HD-tree performs increasingly better than the Prefix B-tree as the 53 — — q l ' l ' M r r I n E 5’ O O 0 B ‘ .. a. 4 __ H HD-tree — g _ (>0 Prefix B-tree . .2: 3 — _, .Z’ Q - . “a 2 -— — L. Q) ’ . .o 8 1— - 5 z s. .. O J l l l l J_ l 0 500 1000 1500 2000 RAM Size (KB) Figure 4.7: I / 0 comparison for different RAM sizes; average query string length is 8. '5’ - _ - _ . _ . _. . ><->< HD-tree <>-<> Prefix B-tree — G I 5 l 1 Ln I I Number of DISk I/Os per Query X X 1 . 1 I 10 15 20 25 RAM Size (Mb) 0 0 U1— )- Figure 4.8: I/O comparison for different RAM sizes; average query string length is 6. 54 query string length increases. Since the Prefix B—tree uses the same amount of RAM as that of the HD-tree to cache internal nodes, it is concluded that the hybrid RAM/disk—based index structure (e.g., the HD-tree) is better than the disk-based structure combined with caching (e.g., the Prefix B-tree plus LRU caching), especially when queries are more distinctive. lmo E I I I I I I fir I I I I I I E E = 3 :1 C H HD-tree : C3 . <>—<> Prefix B-tree - O) Q" 100 :- .1 5 E E D I I .54 - . B - v a - ..5 10 :— ‘2 u : : d) 1. " flu .. .o . _ E — .. :3 _ - Z 1 1 1 n 1 . 1 . 1 1 1 r 1 1 2 3 4 5 6 7 8 9 Average Query String Length Figure 4.9: I / 0 comparison for different query lengths; y—axis is in Logarithmic scale. Finally, the HD—tree is compared with the Prefix B-tree in terms of total running time including both the RAM processing time and the I/O time. The experiments are conducted in the same computing environment (a Linux PC with 512MB RAM and 1.8GHz Pentium 4 processor). Figure 4.10 shows the running time of the HD-tree and the Prefix B—tree for 1000 queries with different numbers of distinctive queries. It is noticed that the actual running time of the HD-tree is comparable to that of the Prefix B—tree even when the 1000 queries are the same. The reason is that since a large amount of RAM is available, the operating system provides LRU caching for the HD-tree as well. The HD-tree is shown to be increasingly faster than the Prefix B-tree as the number of distinctive queries 55 125 r l IIIIIII I I IIIIIIr I I IIIIII '3 P ><->< HD-tree 1) 8 100— O—OPrefix B-tree — g _ E 75— E: _ é” .a 50 ._... S. m g 25— o f-4 f m‘ L 10 I ‘100 1000 Number of Distinctive Queries Figure 4.10: Running time comparison; average query string length is 6. increases. For 1000 distinctive queries, the HD-tree is more than one magnitude faster than the Prefix B-tree. 4.5 Approximate String Matching Based on the Hamming Distance Most disk-based index structures, such as the Prefix B-tree, cannot efficiently perform approximate string matching. However, since the HD-tree uses a trie-based structure for internal nodes, it has a great potential to perform well in approximate string matching. In this section, the search algorithms and experimental results for approximate matching based on the Hamming Distance are presented. More complex string matching issues and applications are discussed in Chapter 6 and 7. 56 4.5.1 Distance Measure In order to perform approximate string matching, a distance measure is required. Levenshtein distance [4], which is the most popular distance measure, I allows user to deletion, insertion, or substitution of a symbol in two matching strings. If different operations have different costs, or if the costs depend on the symbols involved, it is called general edit distance. The general edit distance is powerful enough for a wide range of applications, such as genomic sequence analysis. If all the operations cost one, it is called simple edit distance (or just edit distance), denoted as edist(a, 6), where a and 5 are strings. Edit distance is the minimum number of operations to make two strings equal. For example, edist( “string”, “stingy”) = 2. If only substitution is allowed at cost one, it is known as Hamming distance [62], denoted as hamming(a, ,8). In order to compute the Hamming distance, the two strings must have the same length. For example, I, (I hamming(“string , stingy”) = 4 and hamming(“string”, “strict”) = 2. 4.5.2 Search Algorithm ALGORITHM 5 HD-HammingSearchOc, 1, max, 6) Input: (1) the query string It = k1...kn; (2) the current level I; (3) the maximum Hamming distance max 2 0; (4) the current internal node 6, where L(6) = {(a1,P1) (ak, Pk) (am, Pm)}; Output: query result Method: 1. fori=1tomdo 2. if a,- 74 kl and max 2 1 and P,- is an [P then 3. return HD-HammingSearch( It, l+ 1, max -— 1, 6132.); ‘57 4. else if a,- == kl and P,- is an [P then 5. return HD—HammingSearch( It, 1+ 1, max, 6152.); 6. else if P,- is an SGLP and a,- 51$ kl and max =2 0 then 7. return NULL; 8. else if P,- is an M GLP or SGLP 9. read 61% from disk; 10. retrieve suffix strings a in 6h where H Distance(a, kl...kn) 3 max 11. end if; 12. end for; 13. return; Algorithm 5 (HD—HammingSearch) starts from the root of an HD-tree, which is at level 1, and traverses down the tree as far as possible (steps 2-5). Note that the maximum Hamming distance decreases if a mismatch is found while going down the tree (step 3). Once a leaf node is encountered (steps 6—10), the search may continue within the leaf node (steps 9—10), or stop at step 7 if the condition in step 6 is satisfied. 4.5.3 Comparisons One straightforward method to perform similarity searches based on the Hamming distance is to employ the linear scan. Assume that the database is stored sequentially on disk without fragments, which boosts its performance by a factor of 10. The performance of the linear scan is proportional to 10% of the database size. This benchmark is used in [39, 40]. As shown in Figure 4.11, for Hamming distances _<_1 and $2, the HD-tree with RAM as small as 50KB outperforms the 10% linear 58 scan. For Hamming distance 33, the HD-tree outperforms the 10% linear scan when the RAM size is more than 150KB. Figure 4.12 shows that as the database size increases, the HD-tree is increasingly more efficient than the 10% linear scan. 20 I I I I I I I I I I _ x—x Hamming distance <= 1 . H Hamming distance <= 2 15 -— <>—€> Hamming distance <= 3 -— — 10% Linear scan 5 ._ — 0 fl .V . if . 0 250 500 750 1000 1250 RAM Size (KB) Figure 4.11: I / 0 comparison for Similarity searches as RAM size increase; the y-axis is the number of I/Os as the percentage of the total disk blocks occupied by the database. Table 4.4 shows the performance comparison between the HD-tree and the M-tree, a disk-based metric tree for similarity searches [38]. It is shown that the number of I/Os using the HD-tree is much less than that using the M-tree. Since the M-tree is a pure disk based structure, it does not have any RAM requirement. If the RAM is available, the M-tree can cache the top level tree nodes to reduce the number of I/Os. However, the performance of the M-tree does not improve much as the RAM size increases. On the other hand, a small amount of RAM can boost the performance of the HD-tree dramatically. For Hamming distance 32, the M-tree takes an extra 560KB RAM to cache the second level of the tree, but the performance is improved less than 1%. However, 16KB RAM helps the HD-tree to reduce the number of I/Os from 640 to 214, i.e., the performance is improved by a 59 20 T— I I I I I I I I I I - x—x Hamming distance <= 1 - H Hamming distance <= 2 15 "‘ O—O Hamming distance <= 3 ‘ — I 0% Linear scan 1.3 V 10 — — O a 5 — ._ 0 XQL—l - x ' 1 ,‘ l M 1 0 50 100 150 200 250 300 Database Size (MB) Figure 4.12: I/O comparison for Similarity searches as the Database size increases, the y-axis is the number of I/Os as the percentage of the total disk blocks occupied by the database. factor of two. The potential of the HD-tree is not limited in various types of queries discussed in this chapter. In fact, the HD-tree is more useful is approximate string matching based on general edit distance. The potential of the HD-tree is further explored in the second half of this dissertation. Table 4.4: the HD-tree vs. the M-tree H D — tree M — tree ram i0 ram i0 K B S 1 S 2 S 3 K B S 1 S 2 S 3 4 125 641 1684 0*“ 790 1420 2480 20 27 198 802 4** 789 1419 2479 32 18 143 616 564* 649 1279 2339 ***: no cache; **: cache the root; *: cache top two levels 60 PART TWO INDEXING AND SEARCHING GENOMIC SEQUENCE DATABASES 61 Chapter 5: Genomic Sequence Analysis The publication of the first working draft of the entire human genome sequence in February 2001 is considered to be a milestone of scientific research. The journal, Nature, describes this event as “Unravelling the three billion or so base pairs of our entire DNA has been compared to the landing on the moon, the splitting of the atom and even the invention of the wheel [63].” The broadly available genomic sequences provide a great opportunity to advance our understanding of the role of genetic factors in human health and disease, and to apply this insight rapidly to drug development, disease prevention, and genetic tests [64]. The massive amount of sequence information requires careful storage, organization, and analysis. Therefore, bio-informatics, which includes recording, analyzing, and searching of nucleotide and protein sequences, is an emerging and prominent research field, where biology, computer science, and information technology merge into a single discipline. Since the inception of the Human Genome Project, which was completed in 2003, hundreds of other genome sequence projects on microbes, plants, and animals have been completed or are in progress [17]. Advances in molecular biology and sequencing equipment have allowed the increasingly rapid sequencing of large portions of the genomes. For example, a new technology developed at the 454 Life Sciences Corporation may achieve a 100-fold increase in DNA sequencing speed over current technology [65]. This technique could allow one person using one machine to easily sequence the 3 billion base pairs in the human genome within a hundred days. As the genomic sequence database continues to grow exponentially, the current popular linear-scan-based searching method will sooner or later become infeasible. Index-based approaches, such as the HD-tree, could become the choice of sequence 62 searching in the future. In this chapter, existing techniques used in sequence analysis are presented, and the following two chapters will discuss indexing and searching genomic sequence databases using the HD-tree. 5.1 Introduction to Sequence Analysis It is known that DNA (deoxyribonucleic acid) stores complete instructions for all the cellular functions of an organism. The primary structure of DNA is represented as strings using a four-letter alphabet, { A,C,G,T }, where each letter represents a nucleotide. RNA (polynucleotides) is a single-stranded molecule composed of nucleotide sequences that is similar to the double-stranded DNA. RNA helps to transfer information from DNA to the protein-forming system of the cell. Three-letter combinations of the nucleotide form messenger RNA (mRNA), which transcribes amino acids. Amino acids in turn can be combined to create proteins. Therefore, the structure of protein is represented as a string using a twenty-letter alphabet, each letter corresponds to one amino acid. The letters in nucleotide (DNA) or protein sequences are known as base pairs (or residues). Most sequence databases consist of long strings of nucleotides and / or amino acids. Each sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof). There are also databases which include taxonomic information, such as the structural and biochemical characteristics of organisms. Sequence databases provide scientists with a wealth of information. However, the power of a sequence database comes not from the collection of information, but in its analysis, which is the most pressing task in bio-informatics. Scientific research has shown that all genes share some common elements. A new sequence often has significant similarity to a sequence which is already known. Therefore, part of the information about the structure and function of the known sequence can be 63 transfered to the new sequence. If two sequences are related, they are homologous, and one sequence is a homologue of another. The common applications of sequence analysis include: (1) finding genes in DNA sequences of various organisms; (2) aligning similar proteins and generating phylogenetic trees; (3) clustering protein sequences into families to develop of protein models; and (4) developing methods to predict the structure and/ or function of newly discovered proteins and structural RNA sequences [36]. Table 5.1: Two pairwise alignments to a fragment of human alpha globin: hba_human (a) hbaJnunan GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL sequenceii GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL (b) :hbaJnnnan GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +0 L ++ H+ D+ A +AL D ++AH+ sequencel) GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE All these applications involve the most basic sequence analysis task: determining if two sequences are homologous. This is usually done by first aligning the sequences (or parts of them), and then deciding whether that alignment is more likely because the sequences are homologous, or just by chance. Table 5.1 shows an example of two pairwise alignments [36], where (a) implies a clear similarity, and (b) is most likely unrelated. In the central line of each alignment, identical positions are indicated with letters, and “related” positions with plus signs (“Related” pairs are those which have a positive score in a substitution matrix, which is discussed in Section 5.3.). Within aligned sequences, symbol ‘-’ represents a gap due to insertion or deletion. In the task of finding homologous sequences, the key issues are: 64 (1) the type of alignments to be considered; (2) the scoring system used to rank alignments; (3) the method used to find alignments; and (4) the method used to evaluate the significance of alignments. The focus of this dissertation is on issue (3), and other issues are resolved by existing techniques. In the following sections, these issues are discussed in detail. 5.2 Alignment Type There are three types of sequence alignment: pairwise alignment, structural alignment, and multi-sequence alignment. Pairwise alignment is the most commonly used alignment type. It is used to find a homologue of a gene (protein or DNA), or a gene-product in a database of known examples. Pairwise alignment can be done locally or globally. A local alignment finds related regions within sequences. They can consist of a sub-sequence within each sequence. For example, positions 22-32 of sequence X might be aligned with positions 64—74 of sequence Y. A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. That is, both sequences have to be aligned from beginning to end. Global alignment is useful mostly for finding closely-related sequences. Local alignment is more flexible than global alignment and has the advantage to find related regions that appear in a different position in the two proteins. This is not possible with global alignment methods. Global and local alignment can refer to query and database sequences, respectively. For example, both query and database sequences are global, or global on query sequence and local on database sequence. 65 Structural alignment is mostly used for proteins. It is a form of alignment to establish equivalences between two or more protein structures based on their fold. Because protein structure is more conserved than protein sequence, structural alignments can be more reliable, especially when the sequences have diverged so much that simple sequence comparison cannot detect their similarity. Multi-sequence alignment, as shown in Table 5.2 [66], is an extension of pairwise alignment. It incorporates more than two sequences into an alignment and aligns all of the sequences in a specified set. Multi-sequence alignments is used to identify related regions between sequences. It is also very useful in generating profile hidden hidden Markov models to search sequence databases for more distant homologues (see Chapter 7). Table 5.2: An example of multi-sequence alignment: eight fragments from im- munoglobulin sequences VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- 5.3 ' Scoring System A key element in evaluating the quality of a pairwise sequence alignment is to assess whether a given alignment constitutes evidence for homology by a process of mutation. The basic mutational processes are substitutions, insertions, and deletions. Substitutions change residues in a sequence, while insertions and deletions add and remove residues, respectively. Insertions and deletions are together referred to as gaps. Conservative substitutions are generally defined as amino acid 66 replacements that preserve the structure and functional properties of proteins. In order to distinguish homologous alignments from random alignments, a scoring scheme is needed, where a “substitution matrix” is often used to assign a score for aligning any possible pair of residues. Early sequence analysis used a unitary scoring matrix, where all matches and mismatches are scored or penalized the same. The unitary scoring matrix is equivalent to the simple edit distance (see Section 4.5). Although unitary scoring matrix is sometimes used for DNA and RNA comparisons, it is not appropriate for protein alignments since it ignores the mutation and structure relations between different amino acids. Years of research in protein sequence analysis has shown that matches and mismatches among different amino acid pairs require different scores, and various substitution matrices have been developed to reflect these different scores [67 , 35, 41]. Substitution matrices are generally presented as log-odds matrices, where each score in the matrix is the logarithm of an odds ratio [36]. The odds ratio is the ratio of the number of times residue ‘A’ is observed to replace residue ‘B’, divided by the number of times residue ‘A’ would be expected to replace residue ‘B’ randomly. A positive score in the matrix indicates a pair of residues that replace each other more often than expected by chance. This is evidence in favor of the aligned sequences being homologous. Meanwhile, negative scores in the matrix are evidence against the sequences being homologous. The process of computing log-odds scores is shown as follows. Assume a pair of sequences, a and 6, of lengths m and 17., respectively. Let a,- be the ith symbol in a, and bj be the jth symbol in B. These symbols come from some alphabet A (e.g., the twenty amino acids). Symbols from A are also denoted by low-case letters like a, b. For now, only ungapped (i.e., no gaps) global pairwise alignments are considered. Assume two sequences are drawn from a random match model R, where any 67 symbol a occurs independently with frequency fa. Hence the probability of matching two sequences is: P(a,fl|R) = I Ifa, I I fbj. (5.1) i J' In the alternative match model H, where the two sequences are homologous, aligned pairs of residues occur with a joint probability pab- The value Pab can be thought of as the probability that the residues a and b have each independently been derived from some unknown original residue, c (c might be the same as a and / or b). The probability for the alignment is: P(a,01H> = HM, (5.2) i The ratio of these two probabilities is the odds ratio: P(Q,IB[H) _ Hzpa’ibi :11 17(1th (53) P(011[3|R) — nifainifb, faifb, ° In order to arrive at an additive scoring system, the logarithm of this ratio, known as the log-odds ratio, is computed as: S = 2 8(02', bi), (54) i where Pail)- faifb. is the log likelihood ratio (i.e., score) of pair (a, b) occurring as an aligned pair, as 3(a,,b,)=10g( ) (5.5) opposed to an unaligned pair. In the 1970’s, Dayhoff pioneered this approach to derive the well known PAM (Point Accepted Mutations) family of substitution matrices [67]. In Dayhoff’s 68 method, all of the proteins are aligned in several families. Then, phylogenetic trees (a graphical means to depict the relationships of a group of organisms) are constructed for each family. Each phylogenetic tree is examined for the substitutions found on each branch. This leads to a table of the relative frequencies with which amino acids replace each other. This table is combined with the relative frequencies of each amino acids in the proteins studied to compute the PAM matrices. Each PAM matrix is associated with a number, which is the number of mutations per 100 amino acids in the sample protein data. For example, PAM30 assumes the occurrence of 30 point mutations per 100 amino acids (or 300 nucleotides) in the gene. An example of a substitution matrix, PAM30, is shown in Table 5.3. Using a substitution matrix, the total score assigned to an alignment is a sum of scores for each aligned pair of residues, plus scores for each gap. Such additive scoring schemes assume mutations occur independently at different positions in a sequence, which is a reasonable approximation for DNA and protein sequences [36]. Table 5.3: The PAM30 Substitution Matrix <4€HUJVWZ§7§FHEDWDOUZW> I 05 I (D I ‘1 I H I0 I p 01 I 01 I to I H O I O) I H ‘1 I on H I (D I q I (D I q I O) I N I I0 69 There are several recent attempts to construct scoring matrices based on observed amino acid substitutions. The BLOSUM (BLOcks SUbstitution Matrix) family of matrices are one of these newly developed log-odds scoring matrices [41]. Unlike PAM matrices, which are developed from global alignments, BLOSUM matrices are based on local multi-sequence alignments of more distantly related sequences. These ungapped alignments are obtained from protein families called BLOCKS database [68]. The first stage of building the BLOSUM matrix is to cluster (group) sequences that are identical in more than d% of their amino acids. This is done to avoid bias of the result in favor of a certain protein. The matrix built from blocks with more than d% of similarity is called BLOSUMd. For example, the matrix built using sequences with more than 62% similarity is called BLOSUM62. The second stage is to compute the probability of amino acids in each column of the multiple alignments. Finally, the log odd ratio is calculated and rounded to the nearest integer. In general, different substitution matrices are tailored to detect similarities among sequences that are diverged by differing degrees. It is shown that the BLOSUM62 matrix is among the best for detecting most weak protein similarities [41]. However, the BLOSUM series does not include any matrices suitable for short queries (e.g., < 50). Therefore, PAM matrices are recommended instead [69]. 5.3.1 Gap Penalties Sequence alignments involve gaps, which are caused by insertions or deletions of residues (see Table 5.1). Gaps are expected to be penalized. The cost associated with a gap of length l is given either by a linear score, g(l) = —lo, (5.6) 70 or an afi‘ine score, g(l) = —0 - (l — 1)e, (5.7) where o is called the gap-open penalty, and e is called the gap-extension penalty. The gap—open penalty is usually greater than the gap-extension penalty. This allows long insertions and deletions to be penalized less than they would be by the linear gap cost. For example, for queries shorter than 35 residues, the recommended gap-open and gap-extension penalties are -9 and -1, respectively [69]. 5.4 Alignment Algorithms Given a scoring system, an algorithm is needed to find an optimal alignment for a pair of sequences. Dynamic programming, which solves a problem by caching sub-problem solutions rather than recomputing them, is essential to sequence analysis. Dynamic programming algorithms are guaranteed to find the optimal scoring alignment(s). Since log-odds ratio is used in the scoring scheme, the optimal alignment has the highest score. The following sections introduce two most commonly used dynamic programming algorithms in sequence analysis: the Needleman-Wunsch algorithm[70] and the Smith-Waterman algorithm [71]. 5.4.1 Needleman—Wunsch Algorithm The Needleman-Wunsch algorithm performs a global alignment on two sequences. It is the first instance of dynamic programming being applied to biological sequence comparison. An improved version is introduced in [7 2]. The idea of the Needleman-Wunsch algorithm is to build up an optimal alignment using previous optimal alignment for shorter sub-sequences. Given two sequences: a = a1...am and ,6 = b1...bn, assume an (m +1) x (n + 1) table T, where T(i,j) (the cell at ith row and jth column of T) is the score of the best alignment between 71 the subsequence a1...a,- and b1...bj. T(i, 3') can be built recursively by first initializing T(O, 0) = 0, then proceeding to fill the table from top left to bottom right (see Figure 5.1). The score of T(i, j) is obtained as follows: r030) = —i0 T(OIJ') = -J'0 T(i — 1, j — 1) + s(a,-, 0,) (5-8) T(Li) = max T(i—1,j)—o , T(i,j—1)—o where s(a,-, 0]) is the substitution score of a,- and bj, o is the the gap—open penalty, and the linear gap score is used. The values in the top row, T(O, j), represent alignments of b1...bj to all gaps in 01. Likewise, values in the left-most column, T(i, 0), represent alignments of a1...a,- to all gaps in [3. The value in the table cell, T(m, n), is by definition the best score for aligning a and ,8. The Needleman—Wunsch algorithm takes 0(mn) time and 0(mn) memory. Since m and n are usually comparable, the algorithm is said to be 0(n2). _I._--_ I I I I I l I I. I I I I I I I I l I I ________ -_ -""1--"- I I I I I I Figure 5.1: Compute a Cell in a Dynamic Programming Table 72 5.4.2 Smith-Waterman Algorithm The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment. It is similar to the Needleman-Wunsch algorithm, except a. few changes which enable the Smith-Waterman algorithm to find optimal local alignments. First, since a local alignment may start at any position, T(i, j) is allowed to take the value 0, if all other options have value less than 0. This corresponds to starting a new alignment. Hence, the score of T(i, j) is obtained as follows: T(i,0) = 0 T(OIJ') = 0 0 (5.9) Ti—1,'—1 +sa-,b- T(i,j) : max ( .7 1 ( i i) T(i_11j)—0 T(i,j — 1) — 0 Note that the top row, T(O, j), and the left-most column, T(i, 0), are filled with zeros, instead of —jo and —i0 in the Needleman—Wunsch algorithm. The second change is that an alignment can end anywhere in the table. Therefore, instead of taking the value in T(m, n) for the best score, the algorithm looks for the highest score of T (i, j ) over the whole table. In order for Smith—Waterman algorithm to work, the expected score for a random alignment must be negative. Otherwise, a long alignment between random unrelated sequences will have high scores due to their length, and the optimal local alignment would be likely to be masked by a longer but incorrect alignment. At the same time, there must be some s(a, b) greater than 0. Otherwise, the algorithm cannot find any alignment, since zero is always chosen as the maximum score at each cell. The scoring matrices discussed in Section 5.3 satisfy these two conditions. The Needleman-Wunsch and Smith-Waterman algorithms are guaranteed to 73 find the optimal global and local alignments, respectively. However, they require 0(mn) time and space. If a large number of sequences are searched, time rapidly becomes an issue. Therefore, there are many attempts to develop heuristic alignment algorithms with the aim of increasing speed under limited sacrifice of sensitivity (i.e., the optimal alignment may be missed). Two of the best-known heuristic algorithms are BLAST [3, 42] and FASTA [73, 74]. 5.4.3 BLAST The BLAST (Basic Local Alignment Search Tool) is designed for finding high scoring local alignments between a query sequence and a target database. The basic idea of BLAST is that true alignments are very likely to contain a short segment of identities (as in DNA sequences), or very high scoring matches (as in protein sequences). These short segments can be used as “seeds” to find a good longer alignment. By keeping the seed segments short, it is possible to make a hash table for all possible seeds from the query sequence and use the hash table to find the matching segments in the target database. Query: GSVEDTTGSQSLAALLNKCKT-RLVNQWIKQPLMDKNRIEERLNLVE The hash table EEG i: containning ! PR6 l 4 sub—queries (seeds) PKG 1 4 PNG 13 PDG 13 13 score threshold PSG 1 3 T = 13 PQA 12 PQN 12 extension . . . extension ‘ t > Query: 3 2 5 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRI EERLNLVEA 3 6 5 +LA++L+ TP G R++ +W+ P+ D + ER + A Sbkt 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330 Figure 5.2: The BLAST Search Algorithm 74 In Figure 5.2, assuming protein sequences, the BLAST algorithm looks for words (seeds) of length W (default = 3 ) that score at least T (default = 13) using a substitution matrix. Words in the database that score T or greater (called High Scoring Pair or HSP) are extended in both directions using an algorithm similar to the Smith-Waterman algorithm. The extension stops when the score of the extended alignment falls below a threshold. The extended alignment that meets certain criteria (see Section 5.5) will then be reported by BLAST. 5.4.4 FASTA As shown in Figure 5.3, FASTA uses four steps to find local high scoring alignments: (1) Identify all exact matches of length I: (k-tuples) or greater between the two sequences a and 6. Speed is achieved by employing a hash table. For example, for proteins, if k = 3, there are 8000 (203) possible k-tuples. Each element of an array, A, of length 8000 is set to represent one of these k-tuples. Sequence a is scanned once and the location of each k—tuple in a is recorded in the corresponding element of A. Sequence 6 is then scanned. By referring to the locations of all k-tuples in (1, matches that are common to a and 3 are identified. (2) If two k-tuples are present on the same diagonal, then the difference between their starting position (offset) is also the same. The diagonals with the most significant number of matches are identified. The best diagonals are extended to find maximal scoring ungapped regions. (3) If there are several initial regions above a user-defined score, then those that could form a longer alignment are joined, allowing for gaps and a score a: is calculated with a penalty for each gap. These candidate alignments are 75 ranked by :r. (4) The highest scoring candidate alignments are realigned using dynamic programming over a narrow band of the high scoring diagonal to produce an alignment with the final score. . .\ \ s \ \ \ \ \‘~ ‘ I..\\‘\~ \\\ ~ \ %\ ~ \\ I‘H\ \ (a) Step 1: Find Identical Regions (b) Step 2: Locate and Extend Di- agonals —— Sequence a——-> —-— Sequence a——> \ «— g aouanbas —. / <___ 9 aauanbas (c) Step 3: Join Regions ((1) Step 4: Apply Dynamic Pro- gramming Figure 5.3: The FASTA Algorithm 76 5.5 Evaluating Alignments Once an Optimal alignment is found, the next task is to determine if the alignment is a biologically meaningful alignment (i.e., constitutes evidence for homology), or just the best alignment between two entirely unrelated sequences. The most common method is based on the statistical approach of calculating the chance of a match score greater than a randomly observed value (i.e., the sequences are unrelated). Statistics for the scores of local alignments, which are the focus of this dissertation, are well understood [3]. It is known that the asymptotic distribution of the maximum, MN, of a series of N independent random variables has the form P(MN g 2:) 2 exp(—KNe)‘($_“)) (5.10) for some constants K, /\ [36, 75]. This form of distribution is called the extreme value distribution or EVD. For local ungapped alignment, the appropriate EVD is derived analytically in [3]. Given sufficiently large sequence lengths m (query) and n (database), the expected number of random matches with score at least S is given by the formula 133(5) = KmneAS, (5.11) where )1 is the positive root of Z fafbe*s(“’b) = 1, (5.12) a,b and K is a constant which is dependent only on fa and s(a, b). E (S) is called the 77 E—value for the score S. The probability of a. match having score greater than S is P(a > S) = 1 — e—E(S), (5.13) where P(a > S) is called P-value. In practice, E—value is usually used as the measurement to determine the significance of an alignment. The E—value is approximately how many alignments having the given score or higher would expect to be found in the given database by chance. Therefore, the smaller the E-value, the more significant (i.e., more likely to be homologous) is the alignment. The statistics discussed above have a solid theoretical foundation only for ungapped local alignments. However, many computational experiments and some analytic results strongly suggest that the same theory applies as well to gapped alignments [76, 42, 77]. For ungapped alignments, the statistical parameters can be calculated from the substitution scores and the background residue frequencies of the sequences being compared. For gapped alignments, these parameters must be estimated from a large-scale comparison of random sequences. In [76], the values of A and K are provided for a range of standard protein alignment scoring schemes, using a large amount of randomly generated sample data. 78 Chapter 6: Indexing and Searching Genomic Sequence Databases As shown in Chapter 5, genomic sequence databases have seen rapid growth in recent years. There are three major sequence databases: the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and the GenBank. All three databases participate in the International Nucleotide Sequence Database Collaboration (INSDC). They increase in size and exchange data on a daily basis. 6.1 Overview of GenBank GenBank [13] is a genetic sequence database hosted by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH). It is an annotated collection of publicly available DNA and protein sequences. GenBank contains millions of sequences, submitted from individual laboratories or large-scale sequencing projects. As shown in Figure 1.2 (see Chapter 1), GenBank grows at an exponential rate. In a recent press release (August 22, 2005), the INSDC announced that the DNA sequence database had exceeded 100 gigabases (i.e., approximately 100GB) [78]. In this dissertation, the entire GenBank non-redundant protein sequence database (as of April, 2004) is chosen as the testbed for the HD-tree. This database contains approximately 1.9 million sequences, and 661 million residues. 6.2 Existing Techniques To support efficient searches in genomic sequence databases, many algorithms (systems) have been developed. Based on how the search is conducted, these 79 systems can be divided into two categories: linear-scan—based and index-based. Linear—scan—based systems include FASTA and BLAST, which are described in Chapter 5. Index-based systems, such as BLAT [79] and CAFE [80, 81], perform a query using a pre—built index of the database. Although linear-scan-based systems are faster than index-based systems for smaller databases, as the size of the genomic sequence databases continually increases, index-based systems are more and more appealing. Searching homologous regions in a genomic sequence database is usually conducted in two stages: the filtering stage and the alignment stage. The filtering stage detects candidate regions which are likely to be homologous. The alignment stage then examines these regions in detail and reports the regions which are indeed homologous according to some criteria (e.g., the E-value introduced in Section 5.5). Due to the unstructured nature of genomic sequences, words (i.e., sub-strings) of length L (also called q-grams) is often used for indexing and searching in the filtering stage [42, 82, 79, 74]. Words can be either overlapping (i.e., a fixed window size of L shifting from the beginning to the end of a sequence by one letter at a time) or non-overlapping (see Figure 6.1). Hits (positions of the words in the genomic sequence) are located by matching query words with database words. Each hit may produce a candidate region. Dynamic programming is used in the alignment stage to find the true homologous regions from the candidate regions [71, 83]. Sequence: ADEGBCDEA Overlapping words: ADE, DEG, EGB, GBC, BCD, CDE, DEA Non—overlapping words: ADE, GBC, DEA word length = 3 Figure 6.1: Overlapping versus N on-overlapping Words Among index-based methods, BLAT builds an index in the RAM using non—overlapping words (length of 4 or 5 for protein sequences). BLAT is shown to 80 be faster than BLAST; however, it is less sensitive. CAFE uses inverted files [7] to index overlapping words (length of 3 for protein sequences). Candidate regions are generated by searching these overlapping words. Heuristics, such as FRAMES [80], are used to reduce the number of candidate regions passed to the alignment stage. Suffix trees (see Section 2.4) have been used in genomic sequence searches for small databases [2, 9, 84]. Suffix trees can be built in the RAM within 0(n) space and 0(n) time [57]. However, constructing an suffix tree larger than the available RAM is a challenging task [85, 9, 84]. Therefore, algorithms using suffix trees have not been successfully applied to large genomic sequence databases. In [84], a new way of creating suffix trees inexcess of available RAM size for genomic sequences is proposed. Multiple passes over sequences are adopted and each pass processes a sub-range of suffixes. The suffix tree is used to search genomic sequences. However, the search dose not return the high scoring alignments, but the positions of the potential matches. It is concluded that using the suffix tree for genomic sequence searches can significantly reduce the computation compared to the standard dynamic programming methods [84]. In [86], another index structure, suffix sequoia, is proposed to index protein sequences. Suffix sequoia indexes overlapping words (length of 5 is used for 471M residues). Positions of these overlapping words are stored on disks. A data structure, called bitmap, combined with ofi’set files are used to locate the position list of an overlapping word stored on disks (position files). The index size of suffix sequoia is relatively small, which is just over 4 bytes per residue of sequences. This is because only the positions of the overlapping words are stored. Sequence names are indexed, so that given a position, the corresponding sequence name is returned. The search of suffix sequoia returns the sequence names which contains a overlapping word that when aligned with an overlapping query word, may achieve a score above certain threshold. Therefore, the search results of suffix sequoia are not 81 high scoring alignments, such as those returned by BLAST. The HD-tree is a better approach than these existing index-based methods. First, the HD-tree uses longer overlapping words, which eliminate the filtering stage and reduce the search time for short queries. Second, the HD-tree is guaranteed to find the optimal alignments according to user-defined search criteria, since the HD-tree searches each overlapping word. Finally, the HD-tree is also applicable to more complex sequence searches, such as the profile hidden Markov model (see Chapter 7), which none of these index-based methods have attempted to do. 6.3 Creating the HD-tree Using the Sort-Merge Method The success of the HD-tree in prefix searches (see Chapter 4) encourages the application of this structure in approximate string matching, especially in the area of genomic sequence analysis. Since the matching of a sequence may start at any position, overlapping words of length L are used to index genomic sequence databases [87, 82, 60]. This increases the size of string data by at least L times. The standard approach (Brute-Force) of creating an HD-tree is to insert one string at a time. Therefore, at least one disk access may be required. Due to the largeness of the GenBank protein sequence database, using the Brute-Force approach to create an index may result in an overwhelming number of disk accesses and an unreasonable large amount of time. For example, in one experiment, it takes approximately 30 minutes to index 375MB of overlapping words generated from protein sequences containing 15 million residues. In order to index the entire GenBank protein sequence database (661 million residues) within a reasonable time (e.g., several hours), heuristics must be developed to speed up the index process. The HD-tree is an ordered tree; therefore, if the inserted strings are in sorted 82 order, the tree can be built quickly from left to right. Based on this observation, the Sort-Merge method is developed to build the HD-tree for large string databases in the following steps: 1. The string database is divided into segments (e.g., 50MB) depending on the available RAM size. Each segment is loaded into the RAM, and an array of suffix pointers pointing to each position is created in the RAM. 2. An internal sorting process (e.g., quicksort [51]) is conducted so that these suffix pointers are sorted in the lexicographic order of the suffix strings they are pointing to. Each suffix string is associated with a 4-byte integer that is the position of this suffix string in the sequence database. 3. These sorted suffix strings are written into temporary files on disks. Only the first L characters of these suffix strings are reserved. For the suffix strings having the same first L characters, their positions are linked into a position list. 4. After all segments are processed, the temporary files containing the sorted strings are merged into one file using an external merge algorithm [51]. The the final file contains the sorted suffix strings for the entire database. 5. Create partial HD—trees (i.e, sub-trees) in the RAM from the sorted suffix strings. These partial HD-trees are written to disks when they exceed a user-defined threshold (e.g., 32MB). 6. Merge these partial HD-trees into the final HD-tree. The process of creating an HD-tree index from the GenBank protein sequence database using the Sort-Merge method is shown in Figure 6.2. Since the inserted strings are in sorted order, the standard splitting algorithm (see Algorithm 3 in Section 3.3.4) is modified to increase the storage utilization. 83 Segments Sorted Strings Merged Partial The Final Tree Strings on Disks Strings Trees \ . 1 \a e :7 $2 [:1 :e e l\ Partial Trees I Partial Trees Split Internal Wn’te External QuickSort To Disks MergeSort GenBank Protein Database ——-—/ 0 Figure 6.2: The Sort-Merge method of creating the HD-tree from the GenBank pro- tein sequence database The basic idea is to fill up a node as much as possible. If the current leaf node overflows, only the right-most unit is moved to the newly-created leaf node (see Figure 6.3b). Note that in the standard HD-tree splitting algorithm, units in the overflowing leaf node are evenly distributed as much as possible between the overflowing leaf node and the newly-created leaf node (see Figure 6.3a). Using the Sort-Merge method, the construction time of the HD—tree is greatly reduced, and storage utilization is also improved. Figure 6.4 compares the construction time between the Brute-Force method and the Sort-Merge method. It is shown that the BruteForce construction time is increased significantly when the number of indexed words reaches 15 million. This is because the size of the HD-tree containing 15 million words (approximately 375MB) exceeds the available RAM size in this example. When the HD-tree is smaller than the available RAM, leaf nodes are cached in the RAM by the operating system, therefore the actual number of disk accesses is relatively small. However, once the RAM cannot hold the entire 84 ' ' . Split WW W :> GHQ (E3199 (“0%) c:20% f:10% node 1 node a1 node a2 (a) Standard Splitting (Es-13m 612.7%) §> @2592) [111091362095] QISQQMQ (“0%) (0:15th Ce:3l%) “0% node 1 node b1 node b2 (b) Splitting in the Son-Merge Method Figure 6.3: Split heuristic in the Sort-Merge method 2000 1 , T , . , r H Brute-Force “ a A—A Sort-Merge '2 1500 — — o O O 33, . 6) .§ FIHDO— _ I': .9 3 . 5 g 500" A L) 00 20 5 10 15 Number of Indexed Words (million) Figure 6.4: Improvement in Construction Time using the Sort-Merge Method. Over- lapping words of length 20 are used. 85 tree, some leaves has to be stored on disks. Hence, each subsequent insertion may require at least one disk access (i.e., the total number of disk access may be proportional to the number of overlapping words), which increases the construction time significantly. On the other hand, using the Sort-Merge method, the number of disk accesses is roughly proportional to the sizes of the overlapping words (the sorting process) and the HD-tree (the merging process). The Sort-Merge method is almost 10 times faster than the Brute-Force method for large string databases such as the GenBank protein sequence database. As shown in Figure 6.5, storage utilization of the Sort-Merge method is consistently over 85%, which is improved significantly over the Brute-Force method. This improvement is due to the splitting heuristic which fills up a leaf node as much as possible. 100 I I ' I ' I ' 95 _ x—x Brute-Force _ P A—A Sort-Merge 1 90 _ ‘/‘/‘-\_A T 85- - 80- - 75~ - Storage Utilization (%) 70— A fi‘ 65- 4 60 1 l 1 l 1 I 5 10 15 20 Number of Indexed Words (million) Figure 6.5: Improvement in Storage Utilization using the Sort—Merge Method. Over- lapping words of length 20 are used. Typically, some strings in a leaf node share a common prefix. Therefore, another heuristic is to use difierence-encoding [55] to store strings in leaf nodes. Difference—encoding stores strings in lexicographic order. It uses a one-byte integer 86 to indicate the number of prefix characters that are repeated from the previous string (i.e., the common prefix). Figure 6.6 shows an example of this encoding scheme. The difference-encoding can reduce not only the storage requirement of leaf nodes, but also the number of characters to be compared when searching strings stored in a leaf node sequentially, since the length of the common prefix is stored. ACHILECPED 0 ACHILECPED ACHILECPHR Difference—encoding 8 HR ACHILECPHT \ 9 r ACHILECPRG /> 8 RG ACHILECPRR 9 R ACHILERLQE 7 RLQE Storage: 60 bytes Storage: 26 bytes Figure 6.6: An example of the Difference-encoding 6.3.1 Discussion The Sort-Merge method of the HD-tree uses similar construction strategy (i.e., build partial trees) as that in [84] (Hunt’s method) to build large suffix index. However, the two methods have the following differences: First, the Hunt’s method is to create a large persistent suffix tree exceeds the available RAM size. The suffix tree is stored on disks as a disk-image of the RAM. Therefore, it is not considered as a disk-based structure. The Sort-Merge method is to create an HD—tree index for overlapping words (i.e., prefixes of length L for all suffix keys). The HD-tree is a hybrid RAM / disk-based data structure which uses disk blocks as leaf nodes. Second, the Hunt’s method uses multiple passes to create partial suffix trees, while the Sort-Merge method divides the database sequences into smaller segments to create partial HD-trees. Since the implementation of the Hunt’s method is not available, the experimental results provided in the paper are shown in Table 6.1. Although the 87 machine and the database are not the same, the construction time of the Hunt’s method seems to be longer than that of the HD-tree, and the index size seems to be larger. This is partly because the Hunt’s method creates the complete suffix tree, which is space and time consuming. However, as the overlapping word length increases, the construction time and space requirement of the Sort-Merge method may exceed that of the Hunt’s method. Table 6.1: Hunt’s Suffix Tree versus the HD-tree Database Hardware RAM Space Time Hunt’s Suffix Tree 268M DNA Enterprise 450 SUN 2GB 19GB 13.5h HD-tree, suffixes 661M Protein Pentium 1.8G PC 516MB 10.6GB 3.41h of length 20 The HD-tree contains RAM-index and disk-index. The RAM-index of the HD-tree is the top levels of the suffix tree in the Hunt’s method. The disk-index of the HD-tree contains the suffixes of the overlapping words. These suflixes enable the HD-tree to quickly find answers for short queries (i.e., less than the length of overlapping words). Using Hunt’s method, however, if a query cannot be answered in the suflix tree (which happens often), in order to find an answer (i.e., a high scoring alignment), the search must follow the pointer stored in leaf nodes to the original database sequences. This operation is very costly if the sequence database is not reside in the RAM, since at least one random disk access is require for a possible match. This is likely to be the reason why the Hunt’s method does not return high scoring alignments, but only the positions which may be a potential match. 6.4 HD-tree Search Algorithm Using suffix trees and suflix arrays to perform approximate string matching has been studied for years [88, 57, 2, 85]. Most of these researches are focused on simple edit distance (also called k-difference problem), where a constant cost (e.g., unitary 88 cost) is used for insertion, deletion, and substitution. Since genomic sequence searches use substitution matrices and affine gap cost model (see Section 5.3), algorithms based on simple edit distance cannot be directly applied. The HD-tree indexes overlapping words of length L. Therefore, it can be viewed as a suffix tree with maximum hight of L. The algorithm of sequence search in the HD-tree is an extension of the dynamic programming algorithm applied to a suffix tree for approximate string matching based on the edit distance [2, 57, 84]. Substitution matrices and affine gap cost model are integrated in the algorithm. Imagine the need to obtain the best matching score, score(a, 6), between sequences 01 = a1...am and B = bl...bn. A table of (m + 1) x (n + 1) cells is created. Cell CM (i.e., ith row and jth column) contains the value CM, which is the maximum achievable score by matching a1...a,- and bl...bj. The following equations define the calculation of cm- (i.e., the dynamic programming): 0 = the gap-open penalty (negative value) e = the gap-extension penalty (negative value) ma,b = the substitution score for symbol a and b o if from cell 0 - to cell C -/ -/ opens a gap gap(C,-,j, 024,34) = m i ..7 e if from cell Cm- to cell 01" j’ extends a gap C7510 = 0+e*i, 0 O -inf -inf -inf —inf -10 O c: m H 1< m a) g _12 i, j i, j+1 :3 o m [ —14 ' ' ' 1+1, j i+1, j+1 -16 . . . . . . . . . m, n Gap—open penalty: ~10 Gap-extension penalty: -2 Figure 6.7: An Example of a Dynamic Programming Table Used in the HD-tree. Recall that in the HD—tree, each tree node at level :1: corresponds to an id-string, a1...ax, which is the concatenation of the labels along the path from the root to the tree node (see Section 3.1). Since HD-tree indexes overlapping words, every potential match can be found by traversing the HD-tree from root to leaves. Therefore, query answers can be found by starting at the root and following every branch, until a match is found or the id-string of the current tree node cannot be the prefix of a possible match (i.e., any string under this sub-tree is not a answer, 90 and the sub-tree is abandoned for the search). Given a query string, q = q1...qk, and a minimum matching score, S, an algorithm is designed to determine the matching score between q and string (1’, using the dynamic programming described in Equation 6.1. In order to traverse the HD-tree and abandon a sub-tree that does not contain query answers, the algorithm must be able to meet the following criteria: (a) considering the string a' incrementally, (b) determining when score(q, a’) Z S, and (c) determining when score(q, a'fi’) < S for any string, 5’ (see Section 1.2 for string notation). Assume a table of (k + 1) x (h + 1) is created, where k is the query length, and h is the height of the HD-tree. Starting from the root, the algorithm descends recursively by every branch of the HD-tree. When descending by a branch labeled by letter a, the algorithm appends a to current string, a’, and compute a table column corresponding to a (see Figure 6.8). There are three possibilities for a given a': (1) If the score(q, a’) Z S, all the leaves of the current sub-tree are reported as answers. (2) If score(q, a’fl') < S for any string, fi', the sub-tree corresponding to a’ is abandoned immediately. (3) Otherwise, the algorithm continues recursively descending through the HD-tree. In order to fulfill criterion (c), the heuristic to determine whether score(q, o/fi’) < S for any string, ,8’, is as follows: assume the cell Ci 'l contains ,la the maximum score in column |a'| (i.e., cells CO I all to Ck I all), then the maximum achievable score equals to score(q, a' (Ii+1---(1k) (i.e., the rest of matching is an exact match). If score(q,a’q,'+1...qk) < S, then for any string, ,8’, score(q, a’fl’) < S, 91 where I score(q, a'qi+1...qk) = score(q1...q,-, a ) + score(q,;+1...qk, Qi+1---Qk)- (6.2) The HD—tree sequence search algorithm is shown in Algorithm 6. Array msc[1...k] is precomputed, where msc[z'] = score(q,-+1...qk, Qi+1---‘Ik)- String (1 = a1...az is the id—string of the HD-tree node, N, at level 9:. CM is the value in the 2th row and jth column of the table C. In the algorithm, the computation of line 1 is based on the dynamic programming described in Equation 6.1. Lines 2 to 3 correspond to criterion (b), and lines 5 to 8 correspond to criterion (c). The algorithm is recursive and the number of table columns is at most the height of the HD-tree. ALGORITHM 6 HD-SeqSearch(Table C, HDtreeNode N, String (1 = a1...a$) Input: (1) a query string q = (ll-"(1k (2) the minimum score S Output: matched strings Method: 1. Compute column a: of the table C using 0.3;, the resulting table is C"; 2. if Ck,.’l: Z S then 3 Report all the leaves below the tree node N; 4. else 5 Cm = ma$(60,x---Ck,z); 6. if (cm, + msc[z’] < S) then 7 return; 8 end if 9. else 10. for each child N ' labeled b 11. if(N’ is a leaf node); 92 12. search strings in N ’; 13. else 14. HD-SeqSearch(C’, N’, ab); 15. end if 16. end for 17. end if An example of traversing the HD-tree and computing the dynamic programming table for a sequence search is shown in Figure 6.8. Compared with the standard dynamic programming method (which computes 17. columns), the HD-tree significantly reduces the computation. For example, in Figure 6.8, only 10 columns are computed (this does include the columns computed within the shaded leaf node). This is because: (1) since the HD-tree indexes overlapping words, computing one table column at a tree node is equivalent to computing multiple table columns in a standard dynamic programming; (2) a sub-tree may be abandoned if it is impossible to satisfy the search criterion in the sub—tree (lines 6 and 7 in Algorithm 6). 6.5 Comparisons with BLAST BLAST [3, 42] is the most commonly used tool in bio-informatics. It is used by GenBank to provide genomic sequence searches. Although there are more sensitive algorithms (such as FASTA [73, 74]), BLAST is much faster and, in practice, has proved sensitive enough to detect the moderate sequence similarities that imply homology. In our experiments, the local BLAST (version 2.2.6 for Linux, downloaded from the NCBI web site) is used for searching the GenBank protein sequence database with the following parameters: word size is 3, scoring matrix is PAM30, E—value is 100, gap-open penalty is —9, and gap extension penalty is -1 (see Section 5.3 for the meaning of these parameters). These parameters are 93 Substitution Matrix: a b c d e a 10 9 -5 -5 -5 b 15 -5 l -5 c 10 -5 -5 d 10 -5 \\{2 e 10 / ' Gapopen:—10 8c \ . - 7/ I Gap extensron: -2 c d Query: bbee Closeness: 80% Maximum Score=15 +15 + 10+ 10: 50 Minimum Score: 50 * 80% =40 Array msc = {35, 20, 10, O} 1 2 5 6 7 ——> ——> —> -—> —-—> ‘i‘ a a b b b b b b °\"1“f °\""f "'1‘ 0 -inf 0 -inf -inf OX-inf -inf -inf b ‘10 9 b “‘10 9‘4 b -10 15 b -10 15 5 b -10 15 5 3 b "2 ‘1 b ‘12 ‘1 ’8 b -12 5 b -12 5‘30 b —12 5‘30 20 e ‘14 ‘3 e ‘14 ‘3 3 e -14 3 e -14 3 20 e —14 3 20‘25 e “6 ‘5 e ‘16 ‘5 6 e -16 1 e —16 1 18 e -16 1 18 15 9+35>40 18+20<40 15+35>40 30+20>40 25+10<40 continue return continue continue return 9 ll 15 17 19 ——'> —> —> —> —> b b c b b e 9 f1 1‘; 0\-inf —inf —inf 0 —inf -inf -inf 0 (“If 0 {Inf “(“1" b -1015 5 3 b —1015\5 3 b ‘10 ‘5 b '10 1 b '10 '5 b -12 5‘30 20 b -12 5 30 20 b '12 ‘15 b '12 '11 b ‘12 ‘15 e _14 3 20\25 e _14 3 20\4O e —l4 -17 e -l4 -13 e -l4 -17 e -16 1 18 15 c -16 1 18 30 c '16 '19 c “16 “'15 e ‘16 ‘19 25+10<40 40+10>40 -5+35<40 1+35<40 —5+35<4O return search leaf (shaded) return return return Figure 6.8: An example of a genomic sequence search using the HD—tree. Arrow shows the traversing path. Each downward arrow computes one column (the right-most) of the table, and upward arrows do not compute matrix. “continue”, “return”, and “search leaf” correspond to lines 14, 7, and 12 in Algorithm 6, respectively. 94 recommended by BLAST for searching queries shorter than 35 residues [69]. Although sequence analysis involves long queries, (e.g., hundreds or thousands of residues), this dissertation focuses on short queries, such as some insulin protein sequences. Short queries are also useful in primer and probe design packages, where the intent is not to find related genes or gene segments, but to find regions of sequence that might cause cross-priming or cross-hybridization. Therefore, detecting even relatively short homologous regions is useful. Another application of short queries is to find motifs (i.e., recurring patterns) in DNA and protein sequences [89, 90] and use motifs to find active regions of proteins [91]. Motifs are usually short, Figure 6.9 shows the distribution of motif lengths using a motif extraction algorithm, MEX, from 7000 enzyme sequences [91]. It is realized that the index methods in [84, 86] are related to the HD-tree. However, they do not provide valid query results (see Section 6.2). Therefore, the real sequence search tool, BLAST, is chosen as the comparison target. The performance of the HD-tree is compared with that of BLAST in the context of the quality of the query result and the query processing time. For fair comparison of query time, all experiments are conducted on a Linux PC with 512MB RAM and 1.8GHz Pentium 4 processor. In the experiments, the entire query is aligned (i.e., global on query) to each possible position of database sequences (i.e., local on database sequences). Queries of length 10 (2 sequences), 14 (8 sequences), and 18 (8 sequences) are selected from the GenBank protein sequence database. Each query has the word ‘insulin” in their annotation. It is noted that these selected queries may not be insulin sequence, but may be insulin-related or insulin-like protein sequences. Only these real queries are used (synthetic queries are used in Chapter 7), so that query quality can be measured as follows. Since the ideal answer set is unknown, the quality of the query result is measured by the number of returning sequences containing the word “insulin” in their annotations. 95 nnn I l v I l v r :53: entire set of3165 motifs — unique motifs — level 2 - unique rnotrls — level 3 Number of motifs * 7 8 9 10 Length of motif Figure 6.9: Motif length distribution using MEX from 7000 enzyme sequences 96 This approximation is based on a reasonable assumption that sequences having “insulin” in their annotations would most likely to be homologous. 6.5.1 Closeness of HD-tree Queries The HD-tree uses closeness as a query parameter. The closeness is a percentage value to measure the similarity between two sequences. The usage of the closeness is as follows. Given a query sequence and a scoring matrix, an exact match achieves the maximum score. The maximum score multiplied by the closeness is the minimum score that each query answer must achieve, so that these query answers will satisfy the similarity defined by the closeness. For example, for a given query “ADEGBC”, using the PAM30 substitution matrix (see Section 5.3), the maximum score achieved by an exact match is 50 (i.e., score( “ADEGBC” , “ADEGBC”) = 50). If the closeness is 80%, only those alignments that achieve the score of 40 (50 :1: 80% = 40) and above are considered as the query answers. 6.5.2 Index Size and Construction Time Since the database is given, the index size and construction time of the HD-tree is affected by the length of the overlapping words. Table 6.2 presents the index sizes and construction times of the HD-tree for different lengths of overlapping words. It is shown that as word length increases, the construction time and the index size increases accordingly. For the given GenBank protein sequence database, BLAST takes 26 minutes to process the raw data. It creates 7 files and the total size is approximately 1.45GB. The processing time of BLAST is not to build index, but to re—organize the raw data so that BLAST can search the database more efficiently. On the other hand, for the same database using overlapping word length of 20, it takes the HD—tree approximately 2.43 hours to sort the overlapping words, and one hour to create the 97 Table 6.2: Index Size, construction time and storage utilization of the HD-tree for the GenBank protein sequence database Word RAM Disk SortTime MergeTime TotalTime ASU* Length (MB) (GB) (hour) (hour) (hour) (%) 12 10.29 6.60 1.73 0.62 2.35 87.83 15 12.54 8.02 2.00 0.73 2.73 87.40 18 14.85 9.47 2.21 0.88 3.10 87.03 20 16.42 10.46 2.43 0.98 3.41 86.79 ASU: average storage utilization tree from these sorted overlapping words. Although the construction time of the HD-tree is about three hours longer than that of BLAST, if a large amount of queries are conducted after the index is built, the amortized construction time is not significant. The length of overlapping words affects the query performance of the HD-tree. If the query length is close to or longer than that of a overlapping word, the HD-tree may not be able to determine if the word is the query answer or not, since appending more residues at the end of the word may achieve higher score. In this situation, the HD-tree must follow the position information associated with the word to continue the search in the original database sequences. This operation is expensive and not recommended in the HD-tree (see Section 8.2 for the future work related this issue). Therefore, the HD-tree requires the query length to be shorter than the length of overlapping words. Figure 6.10 shows the level distribution of the HD-tree in the RAM for different overlapping word lengths. It is shown that increasing the overlapping word length leads to an increased total HD-tree levels (i.e., the area under a curve) in the RAM. This in turn increases the possibility to reduce the number of disk accesses, since more sub—trees may be abandoned during the search. However, as shown in Table 6.2, increasing the overlapping word length results in larger index size and longer construction time. Therefore, it is recommended to choose the longest overlapping 98 word according to available resources. In the following experiments, overlapping words of length 20 are used. l.56+06 _ v 1 I l ' l ' l ’ l ' I ' T ' l . C 1 ~ X-X word=12 . Z A-A word=15 : ~ H word=18 - w : V-V word=20 j “>9 le+06 :' j 3" - - ,H I I o . _ H - .. .8 : I g _ - z 5e+05 - - 0l HD-tree Level Figure 6.10: Distribution of HD-tree Levels in the RAM. Levels greater than 10 are not shown. 6.5.3 Quality of Query Results In Table 6.3, the query result of the HD—tree is compared with that of BLAST. The column “Common” indicates the total number of insufin sequences found by both the HD-tree and the BLAST. The column “HD-only” and “BLAST-only” indicate the total number of insulin sequences found only by the HD—tree and BLAST, respectively. As shown in the table, the quality of the HD-tree query results increases as the corresponding closeness decreases. This is because the minimum score required for a matching decreases as closeness decreases. Therefore, alignments with lower matching score (i.e., distantly related) are returned as query answers. When the closeness is low enough, the HD-tree not only finds all the insulin sequences found by BLAST, but also finds some insulin sequences that 99 BLAST cannot find. For example, for query length of 14, at the closeness level of 40%, the HD-tree returns 119 extra insulin sequences that BLAST cannot find. Table 6.3: Query quality comparison QueryLen Closeness (%) Common HD-only BLAST-only 10 90 8 0 O 10 80 8 0 0 10 70 8 0 O 10 60 8 1 0 10 50 8 1 0 10 40 8 1 0 10 3O 8 72 O 14 90 28 4 197 14 80 128 4 97 14 70 179 12 46 14 60 209 22 16 14 50 219 29 6 14 40 225 119 0 14 30 225 206 0 14 20 225 280 0 18 90 17 2 16 18 80 25 3 8 18 70 26 4 7 18 60 33 4 0 18 50 33 67 0 18 40 33 69 0 18 30 33 1 12 O 18 20 33 131 O In Figure 6.11, the number of insulin sequences among the top N results is compared between the HD-tree and BLAST. These query results are sorted by their matching scores. It is shown that the HD—tree consistently outperforms the BLAST. For example, among the top 1000 results, the HD—tree returns 35 more insulin sequences than that of the BLAST. Again, this shows that the HD-tree achieves better query quality than BLAST. 100 8 l I x—X HD-tree A—A BLAST Number of Insulin N 8 I 8' l 0“1 . 1 . . 1 1 1 . 7 l l O 200 400 600 800 1000 Number of Top Results Figure 6.11: Number of insulin sequences among top N results. Query length is 14. 6.6 Query Time Table 6.4: BLAST query time Query Length Query Time (seconds) 10 37 14 46.25 18 45.75 As shown in Table 6.4, the query time of BLAST is relatively stable. This is because each query has to scan through the entire database. The query time of the HD-tree increases as the closeness increases. However, as shown in Figure 6.12, the HD-tree is consistently faster than BLAST. Even at the closeness level of 40%, where the quality of the HD-tree surpasses that of the BLAST, the HD-tree is still four times faster than BLAST. The query performace of the HD-tree is affected by query length. According to HD-tree search algorithm (Algorithm 6) described in Section 6.4, for a given closeness, as the query length increases, the value of msc[z'] increases. Consequently, the possibility to abandon a sub-tree decreases, and more 101 l T I I l l r l 14 ~ -— b c x—x Length=10 ' ‘2 7 ° MLength=14 ‘ ’5? ’ GOLength=I8 " 'O 1: 10 — — 8 _ e . Q) 3 o 8 — \‘ q .§ ~ E... 6 - .. E: 0 - . 5 O 4 l— .— 2 — \ _ l I l l l l l I l 1 l v l 030 40 50 6O 70 80 '1 100 Closeness (%) Figure 6.12: HD-tree query time leaf nodes are accessed, which increases the query time. According to the experiment results shown above, it can be concluded that, for short queries (e.g., query length < 20) using appropriate closeness, the HD-tree outperforms BLAST not only in speed, but also in quality of query results. The potential of the HD-tree is not limited to this achievement. In the next chapter, the HD-tree is applied to more complex sequence search tasks, where each position of a query sequence uses different scoring criteria. 102 Chapter 7: Sequence Search Using the Profile Hidden Markov Model The Profile Hidden Markov Model (PHMM) has received increasing attention in the field of protein homology detection, since profile-based methods are much more sensitive in detecting distant homologous relationships than pairwise methods [92, 93, 94, 95]. Because of the computational complexity involved in PHMM searches, heuristic alignment algorithms, such as BLAST and FASTA, have not been successfully applied in this area. Pure dynamic-programming—based systems are often used for PHMM searches. However, these dynamic-programming—based systems are very time consuming. For instance, it may take approximately 15 minutes to search a short model of length 12 in the GenBank protein sequence database. The HD-tree is able to reduce the PHMM search time significantly without reducing the quality of search results. In this chapter, sequence analysis using PHMM is introduced. Algorithms for searching the HD-tree using PHMM are proposed. Finally, the HD—tree is compared with HMMER [96], a popular implementation of PHMM for protein sequence analysis. It is shown that the HD-tree is orders of magnitude faster than HMMER for short queries. 7.1 Profile Analysis Profile analysis has long been a useful tool in finding and aligning distantly related sequences [97]. A profile is a description of the consensus (e.g., probability of a residue) of a multi-sequence alignment (see Section 5.1) from a group or “family” of homologous sequences. It uses a position-specific scoring system to capture the information of conservation at various positions in a multi-sequence alignment. This 103 makes it a much more sensitive method for searching genomic sequences than pairwise methods (e.g., BLAST or FASTA) that use a position-independent scoring system. Profile-based methods need mathematical theory to support the meaning and derivation of the scores [94]. The Hidden Markov Model (HMM) is a type of probabilistic models that is generally applicable to time series or linear sequences. It has been widely applied to speech recognition since the 19703 [98]. In 1994, HMM was first introduced into profile-based sequence analysis as a coherent theory [99]. 7 .2 Markov Chain A Markov chain is a sequence of random variables (X 1, X2, X3, ..), having the property that, given the present, the future is conditionally independent of the past [100]. In other words, P(Xt = leo =i0,X1=i1,---,Xt—1=it—1)= P(Xt = jIXt—l =it—1)- (7-1) Therefore, the probability of a Markov sequence, a = a1a2...am, is P(a) = P(am,am_1,...,a1) = P(am|am_1,...,a1)P(am_1[am_2,...,al)...P(a1) (7-2) = Pfamlam—llpmm—llam—2)---P(all 7 .3 The Hidden Markov Model The hidden Markov model contains a finite set of states. Transitions among the states are governed by a set of probabilities, called transition probabilities. The state sequence is called the path, it. The path itself follows a Markov chain, so that the probability of a state depends only on the previous state. The transition probability 104 from state k to state I is written as: th = P(7T,i = ll7fi_1 = k). (7.3) To mark the beginning and end of the model, t0,k and tl,0 are used to represent the beginning and end transition probability, respectively. Each state of HMM can produce a symbol from a distribution over all possible symbols. Therefore, the probability of producing a symbol, a, at state k is defined as: ek(a) = P(a.,' = a|7ri = k). (7.4) This probability is called emission probability. Assume an HMM is used to generate a sequence. At any state, a residue is emitted from the state’s emission probability distribution. The next state is chosen according to the state’s transition probability distribution. The model then generates two strings of information. One is the underlying state path, produced by transmitting from state to state. The other is the observed sequence, where each residue is emitted from one state in the state path. The name, “hidden Markov model”, comes from the fact that the state sequence is a Markov chain and is “hidden” from observers. Only the symbol sequence is directly observed. The joint probability of an observed sequence, a, and a state path, it, is: m P(C¥, W) = t0,7r1 II €7r(ai)t7ri,7ri+1, (7.5) i=1 where 7rm+1 equals to 0, and represents the end of the model. Figure 7.1 shows an example of a simple HMM. Starting in the initial state, 1, the next state is chosen with transition probability 151,1 (i.e., staying in state 1) or 1:13 (i.e., moving to state 2). Then a residue is generated with an emission probability associated with the current state (e.g., generate a G with p1(G)). The 105 transition or emission process is repeated until the end state is reached. In this way, a hidden state path, and an observed symbol sequence are produced. In summary, an HMM specifies the following four properties: (1) the symbol alphabet, A, containing |A| different symbols (e.g., A = {A,C,G,T} for DNA, [A] = 4); (2) the number of states, N, in the model; (3) emission probabilities, ei(a), for each state, i, that sum to one over [A] symbols, a: Ea ei(a) = 1; and (4) transition probabilities, t“, for each state, i, going to any other state (including itself), j, that sum to one over N states: Zj ti,j = 1. t1, end hidden state path: 1: = 1—2—2 "l """" l “““““ l """""""""""""""""" G C G observed sequence: on = GCG Figure 7.1: A simply HMM. The joint probability P(a,7r) = t1,1 t1,2 t2,end p1(G) p2(C) p3(G). Note that another state path (1-1-2) could have generated the same symbol sequence with a probably different joint probability. 106 7 .4 The Most Probable Path Given a particular sequence, oz, and a HMM, there are many state paths, it, may generate (1. However, the probabilities of generating a from each path are very different. The path having the highest probability to generate the given sequence, a, is called the most probable path: 7r* = argmaa: P(a, it), (7.6) (I where argmaa: returns the maximum parameter, it, which generates the maximum value of P(a, 7r). The most probable path, 7r*, is what people are usually interested in. It can be found recursively by the Viterbi algorithm [36]. Suppose the probability, vk(i), of the most probable path ending in state k with observation a; is known. Then the probability, vl(i + 1), of the most probable path ending in state I can be calculated for observation ai+1 as: v10+ 1) = 61(a1+1) mg$(vk(i)ak,z)- (7-7) Since all sequences have to start in the beginning state, 0, the initial condition is that v0(0) = 1. By keeping backward pointers, the actual state sequence (i.e., the path) can be found by backtracking. Assume the transition probability to end state is tk,01 the Viterbi algorithm is: ALGORITHM 7 : Viterbi Input: (1) a HMM of length L; (2) a sequence a = a1...am. Output: the most probable path 7r* . Initialization (i = 0): v0(0) = l, vk(0) = 0 for k > 0. Recursion (i = 1 L): 107 7)10') = 610%) mgfivkfi - llth); ptr2~(l) = argmazr(vk(i — llthl- Termination: k P(=v,7r*) = mgdvkmtkp); «*L = argmax(vk(L)tk,0). 'Ikaceback (i =kL 1): «2:1 = ptrz-(vrg‘). 7 .5 The Profile Hidden Markov Model Functional biological sequences typically come in families. Just as a pairwise alignment captures the relationship between two sequences, a multi-sequence alignment can show how the sequences in a family relate to each other. It is desirable to provide a consensus model for a multi-sequence alignment, so that the relationship between an new sequence and the family can be identified. In [99], a particular type of HMMs is introduced. This type of HMMs is well suited for representing profiles of multi-sequence alignments, and is called the Profile Hidden Markov Model (PHMM). Unlike the general HMMs, PHMMS are strongly linear, left-right models. There are three states at each consensus column of a multi-sequence alignment: “match”, “insert”, and “delete”. A “match” state models the distribution of residues allowed in a column. An “insert” and “delete” state at each column models insertion and deletion of one or more residues between this column and the next, respectively. A small PHMM corresponding to a short multi-sequence alignment is shown in Figure 7.2. Algorithms for generating a PHMM from a multi-sequence alignment are described in [36]. In order to increase computing speed (since sum operation is usually faster 108 Q .3 D1 > D2 D3 The Multi—Sequence Alignment Emission Probability: M1: C=6l25, other=1l25 M2: A=4l25, G=2/25, K=2l25, other=1l25 M3: D=3l25, E=2/25, Y=3/25, other=1l25 OOOOOH x>>m>m KUKFJUUJ Figure 7.2: A small PHMM (right) representing a small multiple alignment of five protein sequences (left) with three consensus columns. Squares represent match states (Ml-M3). The 20 emission probabilities are calculated using Laplace’s rule (i.e., each missing residue is counted one). Insert states (diamonds IO—I3) also have 20 emission probabilities (assume to be the same as the background distribution). Delete states (circles labeled DO—D3) are “mute” states that have no emission probabilities. State transition probabilities are shown as arrows. 109 than product) and resolve underflow problems due to small values of probabilities, the probability parameters in a PHMM are usually converted to additive log-odds scores [92, 101]. If the emission probability of a match state is pa for residue a, and the expected background frequency of residue a in the sequence database is fa, the score for residue a at this match state is log(pa/fa). Therefore, the scores for aligning a residue to a profile match state are comparable to that of the traditional position-independent scoring system (see Section 5.3). In position-independent scoring systems, an insertion or deletion of an residue, a, is scored with the affine gap penalty: g(l) = —o — (l — 1)e, where l is the gap length, 0 is the gap-open penalty, and e is the gap-extension penalty. In a PHMM, for an insertion of length I, there is one state transition for entering into an insert state (called M-I transition), I — 1 state transitions for each subsequent insert state (called I-I transition), and one state transition for leaving the insert state (called I-M transition). The cost of the M-1 transition is log t M, I, where t M, I is the state transition probability for moving from the match state to the insert state. In the same way, the costs of the I-1 transition and M-I transition are log t I, I and log t I, M1 respectively. This is akin to the affine gap penalty, with the gap-open penalty as log t M, I + log t], M1 and the gap-extension penalty as log t], I- However, in a PHMM, these gap costs are not arbitrary numbers. Since PHMMS have a cost for the transition from a match state to a match state that has no counterpart in position-independent scoring systems, the probability of a transition to an insert state is linked to the probability of a transition to a match state, If the gap cost is reduced by raising the transition probability, t M, I: toward 1.0, the probability of the M-M transition, t M, M1 falls toward zero, and thus the cost for sequences without an insertion approaches negative infinity. Therefore, there is a trade-off in choosing the state transition probabilities so that the cost for the sequences having an insertion is balanced against the cost for the sequences without insertion. This is 110 an example of why PHMMS are useful and non~trivial. Additionally, in PHMMS, an inserted residue is associated with the emission probabilities of an insert state. If these emission probabilities are the same as the background residue frequency, the score of the inserted residue is log fa/fa = 0. In position-independent scoring systems, inserted residues have no cost besides the affine gap penalty. This zero cost assume that insertions in protein structures have the same residue distribution as proteins in general. This assumption is usually wrong, since insertions tend to be seen most often in surface loops of protein structures, and so have a bias toward hydrophilic residues [92]. PHMMS can capture this information in emission distributions of the insert state, which increases the sensitivity of sequence searching using PHMMS. 7 .6 Viterbi Equations An important usage of PHMMS is to detect potential membership in a family by obtaining significant matches of a sequence to a given PHMM. The search can be done by the Viterbi equations [36], which are related to Algorithm 7 described in Section 7.4. Let VjM (i) be the log-odds score of the best path of matching a sub-sequence, “Luz“: and the sub-model up to state j, ending with a, being emitted by state M j- Similarly, V]! (i) and VjD (i) are the scores of the best path ending in a,- being emitted by I j and Dj, respectively. Then the Viterbi equations are: 111 VM1(i—1)+l0gtM. M. M . eM-(ai) )1 3-1, J Vj (Z) = log fi— + marl: Vj_1(z _ 1)+10gt1j_1,Mj D -_ Vj_1(z 1) +10gtDj_1,Mj eI.(a-) VM(i—1)+logt . . (7.8) VjI(i) = log—fi—f—+max J MJ’IJ a, V-I(i— l)+logt1. I- ] 3’] VJ! (i—1)+logt . . V3.0“) — max 3 1 MJ‘I’DJ D -_ Vj_1(z 1) +10gtDj—11Dj In order to allow an alignment to start and end in a delete or insert state, in Figure 7.2, the “Begin” state is represented as M0, and VOM(O) is set to 0. The “End” state is represented as M L+11 and the VjM (i) without the emission term is used to calculate Vfifln) as the final score for matching the HMM with the sequence, a = al...n- 7 .7 Searching PHMM Using the HD-tree The algorithm to search PHMM in the HD-tree is related to the algorithm described in Section 6.4. Assume a query model, q = ‘11... k1 and a minimum matching score, S. Starting from the root, the algorithm descends recursively by every branch of the HD-tree. When descending by a branch labeled by the letter, a, the algorithm adds a to the current string, (1'. Assume phmm(q, a’) is the highest achievable score by matching q and (1’. There are three possibilities for the given a’: (1) If the phmm(q, a’) Z S, all the leaves of the current sub-tree are reported as answers. (2) If phmm(q, a’fl’) < S for any string, 6’, the branch is abandoned 112 immediately. (3) Otherwise, the algorithm continues recursively descending through the HD-tree. The score of the best path to match the sub-sequence, a’ = al...i1 and the query model, q = q1...qk, is computed using the Viterbi Equation 7.8. Program 1 is the pseudo code (C++ style) of the PHMM search algorithm at each tree node in the HD-tree. The above three possibilities (1), (2), and (3) are determined at lines 59, 61, and 63, respectively. Lines 11 to 21, 28 to 32, and 41 to 49 correspond to the computations of VjM (i), VjM (i), and VjM (i) in the Viterbi equations, respectively. The variables used in the program are explained as follows: a is the symbol (amino acid) at the current tree node; k is the length of the model; -INFTY is the negative infinity value; so is the current best log—odds score; gtsc(XXX, j) is the given log-odds score of the transition probability (TMM: match to match, TIM: insert to match, TDM: delete to match, etc.) from (j — 1)th state to j state of the model; gmsc(aa, j) and gisc(aa, j) are the given log-odds scores of the emission probabilities of aa at jth match and insert state of the model, respectively; gmmx(i, j), gimx(i, j), and gdmx(i, j) are the best scores at the jth match, insert, and delete state for sub-sequence a’ = a1...a,-, respectively (i.e., VJ-M(i), VjM(i), and VJ-M(i) in Equation 7.8). max is the maximum matching score of sub—sequence, a', to the given PHMM; 113 max_j is the jth position where the maximum score is achieved; maxscLi] is the maximum achievable score of any sub-sequence to the sub-model, qj...qk. It is the counterpart of the msc[i] in Algorithm 6 (see Section 6.4). Similar to the msc[i], the maxsch] is used to abandon a sub-tree which does not contain a potential match (see lines 57 to 59). The maxsc[j] is precomputed using the Program 1, except that gmsc(aa, j) (line 18) and gisc(aa, j) (line 46) are replaced by the maximum log-odds scores of the emission probabilities at jth match and insert state of the model, respectively. PROGRAM 1 : PHMM Search 01: thmSearchCuchar aa, int i) 02: { 03: int sc = O; 04: int rt = UNKNOWN; 05: int max = -INFTY; 06: int max_j = 0; 07: gmmx(i, 0) gimx(i, O) = gdmx(i, 0) = -INFTY; 08: for (int j = 1; j <= k; j++) 09: { 10: /* match state */ 11: gmmx(i, j) = -INFTY; 12: if ((sc = gmmx(i-1, j-1) + gtsc(TMM, j-1)) > gmmx(i, j)) 13: gmmXCi, j) = so; 14: if ((sc = gimx(i-l, j-1) + gtsc(TIM, j-1)) > gmmx(i, j)) 15: gmmXCi, j) = sc; 16: if ((sc = gdmx(i-l, j-1) + gtscCTDM, j-1)) > gmmx(i, j)) 17: gmmx(i, j) = sc; 114 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: if (gmsc(aa, j) != -INFTY) gmmx(i, j) += gmsc(aa, j); else gmmx(i, j) = -INFTY; if (gmmx(i, j) > max) { max = gmmXCi. j); max_j = j; } /* delete state */ gdmx(i, j) = -INFTY; if ((sc = gmmx(i, j-1) + gtsc(TMD, j-1)) > gdmx(i, j)) gdmx(i, j) = so; if ((sc = gdmx(i, j-1) + gtsc(TDD, j-1)) > gdmx(i, j)) gdmx(i, j) = so; if (gdmx(i, j) > max) max = gde(i, j); max-j = 1'; /* insert state */ if (j < k) gimx(i, j) = -INFTY; if ((sc = gmmx(i-1, j) + gtsc(TMI, j)) > gimx(i, j)) gimx(i, j) = sc; if ((sc = gimx(i-l, j) + gtsc(TII, j)) > gimx(i, j)) 115 45: gimx(i, j) = so; 46: if (gisc(aa, j) != -INFTY) 47: gimx(i, j) += gisc(aa, j); 48: else 49: gimx(i, j) = -INFTY; 50: if (gimx(i, j) > max) 51: { 52: max = gimx(i, j); 53: max_j = j; 54: } 55: l 56: } 57: curmax = max + maxsc[max_j]; 58: if (curmax < S) 59: rt = ABANDDN; 60: else if (sc >= 8) 61: rt = FOUND; 62: else 63: It = CONTINUE; 64: 65: return rt; 66: } 7 .8 The HMMER Package HMMER is a freely distributable package (current version is 2.3.2) for protein sequence analysis using PHMM [96]. It contains a set of programs useful for 116 building and searching PHMMS. HMMER is hosted at Washington University at St. Louis, and is one of the most popular packages used by biologists to detect distant homologous relationships using PHMMS. HMMER is used to search for sequences that belong to a known protein family constructed from a multi-sequence alignment. The protein family, like most protein families, is so diverse that a BLAST search may fail to report even the known members in the family. HMMER is also used in automated annotation of the domain structure of proteins, and automated construction and maintenance of large multi-sequence alignment databases [102]. Figure 7.3: The “Plan 7” architecture of HMMER. Squares indicate match states; diamonds indicate insert states; circles indicate delete states and special states; arrows indicate state transitions. Figure 7.3 shows the current HMMER “Plan 7” model architecture [102]. There are 7 transitions per node in the main model. Unlike the standard model in Figure 7.2, Plan 7 has five special states: ‘8’, ‘N’, ‘C’, ‘T’, and ‘J’. When combined with entry probabilities from ‘B’ state and exit probabilities to ‘E’ state, these special states control unique features of the model. For instance, how likely the model is to generate various sorts of local or multi-hit alignments. The abbreviations used in Figure 7.3 are explained as follows [102]: Mx: Match state x. Has K (K = 20 for protein sequences) emission 117 probabilities. Dx: Delete state m. Non-emitter. Ix: Insert state cc, Has K emission probabilities. S: Start state. Non-emitter. N: N-terminal unaligned sequence state. Emits on transition with K emission probabilities. B: Begin state (for entering main model). Non-emitter. E: End state (for exiting main model). Non-emitter. C: C—terminal unaligned sequence state. Emits on transition with K emission probabilities. J : Joining segment unaligned sequence state. Emits on transition with K emission probabilities. In traditional pairwise alignments, distinction is made between global Needleman-Wunsch and local Smith-Waterman algorithms. However, in the Plan 7 architecture, local versus global alignment in HMMER is controlled by transition probabilities. For example, local alignments with respect to the model are achieved by non—zero state transition probabilities from the begin state, ‘B’, to internal match states, and from internal match states to the end state, ‘E’ (see dotted lines in Figure 7.3). Local alignments with respect to the sequence are achieved by non-zero state transitions on the flanking insert states, ‘N’ and ‘C’. More than one hit to the PHMM per sequence is achieved by a cycle of non—zero transitions through the special insert state, ‘J’. 118 7 .9 HMMER Plan 7 in the HD-tree The HD—tree adopts the Plan 7 architecture of the HMMER package, so that local, global, and multi-hit alignments can be easily controlled by transition probabilities in PHMMS. Using the same architecture also allows better comparison between HMMER and the HD-tree. To implement the Plan 7 architecture, Program 1 has to be modified to include the extra states. Therefore, Program 2 is inserted between line 15 and line 16 of Program 1, and Program 3 is inserted between line 56 and line 57 of Program 1. The variables used in Program 2 and Program 3 are explained as follows: bsc [j] is the given log-odds score of the transition probability ‘E’ state to jth state; esc [j] is the given log-odds score of the transition probability from jth state to ‘E’ state; gxsc (XXX, MOVEILDDP) is the given log-odds scores of the transition probability at the state, ‘N’, ‘E’, ‘C’, or ‘J’, where “MOVE” refers to the transition N-B, E—C, C-T, or J-B; and “LOOP” refers to the transition N-N, E—J, C—C, or J-J; gxmx(i , XXX) is the best log-odds score of the transition probability from ith state to the state, ‘B’, ‘E’, ‘C’, ‘J’, or ‘N’ (e.g., XMN represents the transition probability to ‘B’ state). PROGRAM 2 : PHMM Search for HMMER Plan 7, A 01: if ((sc = gxmx(i-1, XMB) + bsc[j]) > gmmx(i, j)) 02: gmmx(i, j) = sc; PROGRAM 3 : PHMM Search for HMMER Plan 7, B 119 01: 02: 03: O4: 05: 06: 07: 08: 09: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: /* N state */ gxmx(i, XMN) = -INFTY; if ((sc = gxmx(i-l, XMN) + gxsc(XTN, LO0P)) > -INFTY) gxmx(i, XMN) = sc; /* E state */ gxmx(i, XME) = -INFTY; for (int j 1; j <= k; j++) if ((sc gmmx(i, j) + esc[j]) > gxmx(i, XME)) gme(i, XME) = sc; /* J state */ gxmx(i, XMJ) = -INFTY; if ((SC = gmeCi-l, XMJ) + gxsc(XTJ, LDDP)) > -INFTY) gxmx(i, XMJ) = sc; if ((sc = gxmx(i, XME) + gxsc(XTE, LOOP)) > gxmx(i, XMJ)) gxmx(i, XMJ) = sc; /* B state */ gxmx(i, XMB) = -INFTY; if ((sc = gxmx(i, XMN) + gxsc(XTN, MOVE)) > -INFTY) gxmx(i, XMB) = sc; if ((sc = gxmx(i, XMJ) + gxsccer, MOVE)) > gxmx(i, XMB)) gxmx(i, XMB) = sc; /* C state */ gxmx(i, XMC) = -INFTY; if ((SC = gme(i-1, XMC) + gxsc -INFTY) gme(i, XMC) = sc; if ((sc = gxmx(i, XME) + gxsc(XTE, MOVE)) > gxmx(i, XMC)) gxmx(i, XMC) = sc; 120 28: sc = gxmx(i, XMC) + gxsc(XTC, MOVE); 7 .10 Comparisons In this section, the HD-tree is compared with HMMER for sequence searching using PHMMS. The entire GenBank protein database is served as the sample database. The HD-tree is created using overlapping words of length 20. Experiments are conducted on a Linux PC with 512MB RAM and 1.8GHz Pentium 4 processor. Queries are generated from the popular PFAM database. Both synthetic and real queries are used in the experiments. Since PHMMS provide the parameters to computer E-value (see Section 5.5), the HD-tree is able to accept E-value as a search criterion besides the closeness defined in Section 6.5.1. 7.10.1 PFAM PFAM (Protein FAMilies) is a large collection of multi-sequence alignments and PHMMS, covering many common protein families [103]. Genome projects, including both the human and fly, have used PFAM extensively for large scale functional annotation of genomic data [104]. PFAM version 18.0 (August 2005) is used in the experiments. It contains alignments and models for 7973 protein families, based on the Swissprot 47.0 and SP—TrEMBL 30.0 protein sequence databases [105]. PFAM is constructed by first distinguishing a stable curated “seed” alignment of a small number of representative sequences, then using HMMER to make a model of the seed, then searching the database for homologues using the model, and automatically producing the full alignment by aligning every sequence to the seed [103]. 121 7.10.2 Synthetic Queries The first set of experiments is conducted using synthetic queries, which are generated from long PHMMS in PFAM. The procedure of generating the synthetic queries is as follows. Assume a long PHMM, Mr, of length L. Transition and emission probabilities related to match, insert, and delete states of a synthetic query, Ms, of length l are copied from the ith to (i + l)th state of Mr. Transition probabilities from ‘B’ state to any state in Ms are copied from the first I states of Mr, which transition probabilities from any state to ‘E’ state in Ms are copied from the last 1 states of Mr. These synthetic queries are then run through a program, hmmcalibrate, provided by HMMER package. hmmcalibrate takes a PHMM and empirically determines parameters (such as A in Equation 5.11) that are used to make searches more sensitive by calculating more accurate E—values (see Section 5.5). The HD—tree is not a heuristic search algorithm such as BLAST. Therefore, all results are found as long as the search criterion (e.g., the E-value) is satisfied. Experiments have shown that the HD-tree finds all the results returned by HMMER. Therefore, the query quality of the HD-tree is the same as that of HMMER. Table 7.1 shows the average query time with respect to query lengths and E—values using the HD-tree and HMMER. For the HD-tree, the results are the averages of 100 synthetic queries unless clearly stated. However, since HMMER is very slow (10-15 minutes per query), only 10 synthetic queries are used for HMMER to generate the average. It is shown that the HD-tree is orders of magnitude faster than HMMER for queries shorter than 12. As the query length increases, the performance of the HD-tree decreases faster than that of HMMER. Therefore, it is not recommended to use the current HD-tree for queries longer than 12 (see Section 8.2 for potential solutions for long queries). Besides the E—value, the HD-tree also uses closeness (see Section 6.5.1) to 122 Table 7.1: HD-tree versus HMMER, E—value = 10 Qlen 7 8 9 10 11 12 HMMER 605.1 723.1 672.0 741.6 859.4 878.9 HD-tree 0.007 0.012 0.927 5.631 38.760 154.970 Table value: Query time in seconds. evaluate the degree of similarity between two sequences. E—value and closeness are exchangeable for a given query. Table 7.2 shows the relationship between the closeness and the E—value for the synthetic queries. It is shown that for the query of length 7, even a near-exact match (i.e., closeness = 98%) may result in E—value of at least 10. This is one reason why short queries are executed very fast using the HD-tree. In order to return more results for short queries, either the E—value is to be increased or the closeness has to be decreased. Table 7.2: Closeness versus E—value Qlen Eval Ang Eval Ang Closeness Angval Closeness Angval 7 10 98.0 100 93.8 60 3019.0 90 674.7 8 10 96.4 100 86.0 60 1348.0 90 209.2 9 10 90.6 100 75.1 60 628.6 90 69.8 10 10 82.8 100 65.0 60 295.5 90 21.5 11 10 72.8 100 55.2 60 134.5 90 5.8 12 10 63.2 100 47.5 60 64.8 90 1.8 Qlen: Query Length; Eval: E—value; Angval: Average E-value; Ang: Average Closeness. Figure 7 .4 illustrates the relationships between query time and query length for given closenesses. It is shown that the query time decreases as the query length decreases or closeness increases. The trend is similar to that of the HD-tree in pairwise alignments using position-independent scoring matrices (see Section 6.6). In order to show the relationship between query time and disk accesses, Table 7.3 provides the statistics for queries of length 10. It is shown that the query time is closely related to the number of disk accesses, which reflects the pruning power (i.e., the ability to abandon a sub-tree in a search) of the HD-tree. Table 7 .4 presents the number of dynamic programming columns computed in sequence searches using the 123 HD—tree, where “Ratio” is the number of columns computed by the HD-tree divided by the number of columns computed by regular dynamic programming. Similar comparison is conducted in [84] to show that using suffix index can significantly reduce the number of columns to be computed. Same conclusion can be made from Table 7.4. 130 I I l I I I 120 H Closeness=80 A-A Closeness=70 H Closeness=60 110 IITIFIIIIIIITII l 8 I Query Time (seconds) 8 'Tfil’I—TU'IITII 50 40 .- € 30 :— -: 20 :— “Z 10 I- -E o : 1* _ 6 7 8 9 10 11 12 13 Query Length Figure 7.4: HD-tree query time for synthetic PHMMS Table 7.3: HD-tree query time for synthetic PHMMS; Query length = 10 Closeness Qtime DiskAcc AccPerc Ang-value (%) (seconds) (%) 90 0.14 48 0.001 21.46 80 0.88 299 0.011 51.77 70 5.04 1989 0.072 125.4 60 25.29 11153 0.407 295.5 50 100.43 57781 2.107 684.8 Qtime: query time; DiskAcc: the number of disk accesses; AccPerc: the percentage of accessed leaves. 124 Table 7 .4: Computation of dynamic programming table columns using synthetic PH- MMs, Closeness = 60% PHMM Length Column Computed Ratio 7 10103 1.52844E—05 8 35335 5.34569E—05 9 117850 0.000177882 10 348271 0.000526885 11 899827 0.001361312 12 2573057 0.003892673 7.10.3 Analyzing Query Time To further analyze the performance difference between the HD-tree and HMMER, the query time is divided into CPU time and disk access time. Table 7.5 shows the statistics of the query performance using 10 synthetic queries with different query lengths for both the HD-tree and HMMER. The original GenBank protein database, which includes sequence annotations, contains 239732 disk blocks (4KB each block). The total length of the sequences is approximately 661 million. The HD-tree generated from the database contains 2741767 leaf nodes (i.e., disk blocks). In Table 7 .5, DiskAchum is the average number of disk blocks accessed by the HD-tree. DiskAccTime is the time spent on reading data from disks. DiskAccPl is the DiskAchum as the percentage of the total number of leaf nodes. DiskAccP2 is the DiskAchum as the percentage of the number of disk blocks occupied by the database, which HMMER accesses sequentially. DptColumn is the number of Dynamic Programming Table (DPT) column computed by the HD-tree, which includes the DPT columns computed for searching strings within leaf nodes. DptRatio is DptColumn divided by 661 million (i.e., the number of DPT column computed by HMMER). As shown in Table 7.5, for different query lengths, the majority of HD-tree query time (approximately 90% or above) is spent on reading disks. However, more than 93% of HMMER query time is spent on CPU computation. Therefore, it can 125 be concluded that the HD-tree is an I/O-bound approach and HMMER is a CPU-bound approach. HMMER is a pure dynamic-programming—based method. It sequentially reads all database sequences, and compute one DPT column for each residue. For any query, reading the database takes approximately 40 seconds. The computation of DPT (approximately 661 million columns) dominates the query time. Therefore, the query time is relatively consistent, except some variations due to the post-processing of query results. The computational complexity of HMMER is 0(mn), where m is the query length and n is the database size. Table 7.5: Analyzing query time. Both the HD-tree and HMMER use 10 synthetic queries, and E-value = 10. QueryLen [ 7 8 9 10 11 12 HD-tree CpuTime 0.001 0.001 0.004 0.028 1.258 9.094 CpuTime% 10.0 8.3 5.6 4.9 7.9 10.5 DiskAccTime 0.009 0.011 0.067 0.543 14.720 77.420 DiskAccTime% 90.0 91.7 94.4 95.1 92.1 89.5 DiskAchum 1 1 21 229 11304 76975 DiskAccPl (%) 0.000 0.000 0.001 0.008 0.412 2.807 DiskAccP2 (%) 0.000 0.000 0.009 0.096 4.715 32.109 DptColumn 101 132 721 7855 395785 2697946 DptRatio 1.528E—8 1.997E—8 1.091E—7 1.188E—6 5.988E—5 4.082E—3 Closeness 99.0 97.0 88.7 77.1 67.3 59.5 MaxError 0.19 0.32 2.28 4.29 6.95 10.36 HMMER CpuTime 567.45 681.41 632.01 700.70 819.72 839.34 CpuTime% 93.78 94.23 94.05 94.48 95.38 95.50 DiskAccTime 37.65 41.69 39.99 40.90 39.68 39.56 DiskAccTime% 6.22 5.77 5.95 5.52 4.62 4.50 On the other hand, as an index-based method, the HD-tree significantly reduces DPT computation. This reduction is the result of two reasons. First, since the HD-tree indexes all overlapping word using trie structure, it avoids the repeated DPT computations for the subsequences (at different positions) having the same 126 prefix. Second, during the tree traversal, a sub-tree may be abandoned (i.e., pruned) if a possible match is not possible within the sub-tree. For example, for query length of 7, the DPT column computed by the HD-tree is only 101, which is negligible compared to 661 million columns computed by HMMER. As the query length increases, the computation of DPT increases. However, even when the query length is 12, the DPT columns computed by the HD-tree is still less than 0.1% of the DPT columns computed by HAMMER. Therefore, the CPU time of the HD-tree is not the major factor of the the query performance. However, since the HD-tree stores leaf nodes on disks, the query time is dominated by the disk access time, which is proportional to the number of disk accesses (i.e., leaf-node accesses). Since the internal nodes of the HD-tree is a trie, each tree node represents a string (i.e., a prefix of overlapping words). A sub-tree can be pruned (i.e., do not access the leaf nodes in this subtree) if the best alignment between the query and the prefix string representing the sub-tree has exceeded the maximum error (i.e., the maximum score minus the minimum score. See Section 6.5.1.). The maximum error is determined by both the closeness and the length of query sequence, and the closeness plays a more important role. As shown in Table 7.5, the average closeness decreases as the query length increases, hence the maximum error (MaxError in Table 7.5) increases. The increased maximum error leads to decreased pruning power. For a given query, assume a sub-tree, T, is pruned for a maximum error of e. If e is increased, the search has to continue within T. Therefore, only sub-trees of T may be pruned, and the pruning power is reduced. It is observed from the experimental results that for a given HD—tree and a fixed E—value, the number of disk accesses increases exponentially as the query length increases. Therefore, as the query length increases, HMMER will sooner or later outperform the HD-tree. 127 7 .10.4 Real Queries In order to compare the HD-tree with HMMER using real PHMMS, experiments are conducted on PHMMS obtained from the PFAM database directly. Table 7.6 shows the query time of the HD-tree and HMMER using these real PHMMS, where “Seq” is a given number to distinguish different queries having the same length. Although the overall performance trend is similar to that in synthetic queries, it is observed that the running time varies among queries having the same length. This may due to the fact that the HD—tree uses the prefix of a query to reduce the search space. Therefore, the composition of a query affects the performance of the HD-tree. For example, if the first a few states in a PHMM has more pruning power, the HD-tree tends to be faster. In summery, according to the results from both synthetic and real data, the HD-tree is shown to be much faster than HMMER in PHMM search for short queries. In Chapter 6, the HD-tree also outperforms BLAST for pairwise alignment using position-independent scoring matrices. As genomic sequence databases continue to grow, the benefit of using an index-based approach, such as the HD-tree, is more and more appealing than the linear-scan-based approach, such as HMMER and BLAST, especially for complex sequence analysis tasks such as PHMM searches. 128 Table 7.6: Query time for real PHMMS; E—value = 10 Qlen Seq HD—tree Closeness HMMER (seconds) (seconds) 7 1 0.386 74 563 8 1 0.028 79 741 8 2 0.363 67 704 8 3 0.036 99 951 8 4 0.002 99 741 8 5 0.128 72 709 9 1 4.661 57 660 9 2 0.001 99 692 9 3 4.871 58 671 9 4 0.001 99 757 10 1 0.168 81 740 10 2 8.909 53 731 10 3 0.612 76 724 10 4 9.464 61 951 10 5 3. 168 70 728 10 6 0.001 99 717 1 1 1 5.685 47 809 1 l 2 0.312 81 856 1 1 3 5.312 66 821 l 1 4 7.000 64 786 1 1 5 1.112 66 844 12 1 36.252 45 898 12 2 160.322 48 875 12 3 52.374 53 905 12 4 229.911 41 843 12 5 211.286 34 892 Qlen: query length; Seq: query sequence number. 129 Chapter 8: Conclusions and Future Work 8.1 Conclusions There is an increasing demand for eflicient indexing techniques to support various types of queries (e. g., prefix searches and approximate string matching) on large string databases, such as genomic sequence databases. Most existing string indexing techniques are either RAM-based or disk-based. RAM-based index structures are not suitable for large databases when only a limited amount of RAM is available. Disk-based structures, on the other hand, can index large databases but usually do not fully utilize the available RAM. In this dissertation, a novel hybrid RAM / disk-based structure, the Hybrid Digital tree (HD—tree), is proposed. The HD-tree takes advantage of the strengths of both RAM-based and disk-based structures. It contains two parts: the RAM-index and the disk-index. The RAM-index uses the trie structure, and resides in the RAM to minimize the disk accesses; while the disk-index maintains the rest of the index on disks so that large databases can be indexed. Algorithms for constructing and searching the HD-tree are developed. The HD—tree not only scales well with the size of the RAM and the database, but also is efficient for various types of queries. Experiments are conducted to compare the HD-tree with existing techniques. In comparisons with the Prefix B-tree for prefix searches, the HD-tree not only reduces I/O operations by more than 60%, but also reduces the total query processing time by one order of magnitude. The HD-tree also outperforms the linear scan and the M-tree for approximate string matching based on the Hamming distance. The HD-tree is very useful in solving real-world applications, such as searching 130 genomic sequence databases. The proposed Sort-Merge method successfully reduced the standard HD-tree construction time (the Brute-Force method) by an order of magnitude, so that the entire GenBank protein sequence database can be indexed in few hours. Algorithms are developed to perform sequence searches using the HD-tree. In the application of searching homologous sequences using position-independent scoring matrices, the HD-tree is not only four times faster than BLAST, a popular heuristic sequence search tool, but also able to find more homologous sequences for short queries. The speed improvement of the HD-tree is even more impressive in sequence search using Profile Hidden Markov Models (PHMMS), where heuristic algorithms are not applicable. The HD-tree is shown to be orders of magnitude faster than HMMER, a popular PHMM search tool, for short queries, while maintaining the same query quality as that of HMMER. The major contribution of this dissertation is the application of index-based approximate string matching for genomic sequence databases. This is a prominent research area in the field of bio-informatics. Due to the complexity of advanced sequence searches, such as the profile hidden Markov model (PHMM) search, dynamic programming over the entire sequence database is used by popular search tools such as HMMER. However, as the genomic sequence databases continually grow rapidly, index-based approaches will sooner or later replace linear-scan-based approaches. Traditional Disk-based string index structures, such as Prefix B—trees and String B—trees, are not applicable in approximate string matching, although they may index large string databases. This is because these Disk-based structures use string as the index unit (unlike the trie which decomposes a string into letters). An internal node contains a set of strings and represents sub-ranges of the entire search space. Depending on which sub-range the query string belongs to, the search continues in a child node having smaller sub-ranges. Such tree structure is effective 131 for exact matches or prefix matches, since the query results belong to one sub-range. However, for approximate string matching, the query results may belong to many sub-ranges across the entire search space. The sub—ranges defined by internal nodes of Prefix B-trees or String B—trees are not small enough to rule out the possibility of finding a match within a sub—range (i.e., a sub—tree). Therefore, the tree loses its pruning power, and the entire tree may have to be accessed for a query. On the other hand, trie-based index structures decompose a string into a sequence of letters and build index letter by letter. Therefore, the sub-ranges are much more refined than those in B—trees, and a search may be able to prune a sub-tree if a match is not possible within the sub-tree. Trie-based structures, such as suffix trees, are effective in approximate string matching by significantly reducing the computation of dynamic programming table [84]. However, the suflix tree is a RAM-based structure and requires a large amount of RAM, which makes it infeasible to index the entire GenBank protein database in the RAM. In [84], a method is proposed for creating suffix trees (Hunt’s suffix tree) in excess of available RAM size. Disk space is served as the image of RAM for the suffix tree and accessing the suffix tree relies on the operating system to page in / out the disk image. Since the suffix tree stores pointers (i.e., suffix positions) in leaf nodes, a search may not be completed unless it follows the pointer to the original sequence database, which is costly. The HD-tree, is a combination of both the RAM-based and disk-based structures. It uses trie-based structure to index overlapping words, so that it is as efficient as the suffix tree in reducing the computation of dynamic programming table. The HD-tree stores partial overlapping words in leaf nodes, so that a search of short queries (shorter than the length of overlapping words) does not need to access original sequence databases. Unlike the disk-based suffix tree, the HD-tree uses a small amount of RAM to store internal nodes of the tree, and groups strings 132 having the same prefix in leaf nodes (i.e., generate some clustering information). These RAM-based internal nodes can prune the leaf nodes (i.e., disk blocks) that do not contain the query answer. Hence the number of disk accesses is significantly reduced. Since Hunt’s suffix tree does not have a mechanism to group strings into disk blocks (i.e., lack of clustering information), it requires more disk accesses than the HD—tree. Therefore, a hybrid RAM / disk-based index structure such as the HD—tree is a promising approach for indexing and searching large string databases, especially genomic sequence databases. 8.2 Future Work The success of the HD-tree in genomic sequence searches using PHMM encourages continual research in this area. One of the biggest challenges is to support long queries. As shown in previous chapters, performance of the HD-tree degrades as query length increases. Therefore, a new method for employing the HD-tree needs to be developed for long queries. One promising solution for long queries is to use two search stages: filtering and alignment. In the filtering stage, the HD-tree is served as a filter to locate the potential homologous regions in the database using short sub-queries (e.g., length of 7 to 11 residues) obtained from a long query. In the alignment stage, these potential regions can be extended in both directions and be searched against the long query using dynamic programming to find the final results. This strategy is similar to that of heuristic algorithms, such as BLAST. However, most heuristic algorithms rely on hashing techniques for filtering, which makes PHMM searches very difficult. The HD-tree, on the other hand, provides an index structure that supports complex PHMM searches. At the same time, a potential homologous region returned by HD-tree searches can be more informative than that of existing heuristic algorithms, 133 due to longer sub—query length (7 to 11 residues versus 3 to 4 residues). This will likely reduce the number of candidate regions in the alignment stage. In the above two-stage approach, how to generate sub-queries effectively and efficiently is an important research issue. In a genomic sequence (or PHMM), some sub-regions are more informative than others. These highly informative sub-regions have been used to look for motifs (i.e., recurring patterns) in DNA and protein sequences [89, 90]. According to the Shannon theory, the entropy (i.e., mean amount of information) in a message can be computed as: H = — 2pm.) Iogp, (8.1) i=1 where p(m,-) is the probability of the message component, m,- [106]. For a PHMM, the emission probabilities of each symbol at each state (position) is known. Therefore, the information content (also called relative entropy) of a PHMM of length L can be computed as: L [M fi j I = Z me- log —p,—, (8.2) j=1i=1 where fi,j is the emission probability of ith symbol in A, and p,- is the background frequency of the ith symbol [89, 107, 108]. The information content can then be used to find the highly informative sub-regions in a long PHMM. However, the number and length of sub-queries, and the search criteria for these sub-queries (such as closeness and E-value) all affect query time and quality. Therefore, developing the appropriate method to generate sub-queries will be an important part of future work for long-query searches using the HD-tree. Once sub-queries are generated and query results are produced, the next step is to construct candidate homologous regions from the search results of sub—queries. 134 One approach is to consider each search result as a candidate region, similar to that in BLAST. Another approach is to form candidate regions based on the position of initial search results, similar to that in FASTA. The second approach is likely to be faster, but may sacrifice sensitivity, since it employs a heuristic method for selecting candidate regions from initial sub-query search results. Algorithms need to be developed and experiments need to be conducted to find the best approach to generate candidate regions, which is another important part of the future work. In summary, the HD—tree developed in this dissertation is shown to be a valuable approach for indexing and searching large string databases. It is successfully applied to genomic sequence databases for short queries, especially in the profile hidden Markov model searches. Yet, the potential for using the HD-tree for long queries needs to be explored further so that it can be a more powerful tool for the growing field of genomic sequence analysis. 135 APPENDIX 136 Appendix A: Approximate Q-gram Matching in Genomic Sequence Databases Searching a genomic sequence database usually begins with selecting a set of candidate regions, a stage called filtering. Local alignments on these candidates are then performed to find true homologous regions. Exact matching of q-grams (substrings or words of length q) has been used by popular systems, such as the BLAST, for the filtering stage. However, if a smaller q value is used, exact matching can result in a very large candidate set, leading to low search efficiency. On the other hand, with a larger q value for exact matching, search efficiency improves but the accuracy of the search is significantly reduced. As the size of genomic sequence databases increases, the situation may get even worse. In this appendix, the application of approximate q-gram matching based on the Hamming distance (see Section 4.5) is analyzed for the filtering stage. According to the experimental results on GenBank nucleotide databases, it is concluded that approximate matching based on a combination of larger q value and longer Hamming distance, is much more efficient and effective than exact matching. A theoretical model is developed to further analyze the performance of approximate matching. A. 1 Introduction genomic sequence databases are widely used to assist molecular biologists in understanding the biochemical function, chemical structure and evolutionary history of organisms. Given a query sequence, a basic operation on these databases is to 137 locate the homologous regions within existing genomic sequences. During the past decade, genomic sequence databases have been growing rapidly in size. The size of GenBank [109], a popular collection of publicly available DNA sequences, increased from 217,102,462 residues (base pairs) and 215,273 sequences in 1994, to 44,575,745,176 residues and 40,604,319 sequences in 2004 [110]. At the same time, there is an increasing demand for searches on genomic sequence databases. To support efficient searches on genomic sequence databases, many algorithms (systems) have been developed in the past decade. Based on how the search is conducted, these systems can be divided into two categories: linear-scan-based and index-based. Linear-scan—based systems, such as FASTA [73, 74] and BLAST [3, 42], compare a query sequence with all the sequences in the database. FASTA is considered as the most accurate (sensitive) system, while BLAST is more popular and faster but less sensitive. Index-based systems, such as BLAT [79], CAFE [80, 81], and Suffix Sequoia [111, 86] perform a query using a pre—built index of the database. Although linear-scan-based systems are faster than index-based systems for smaller databases, as the size of the genomic sequence databases continually increases, index-based systems are more and more appealing for their efficiency. Searching homologous regions in a genomic sequence database is usually conducted in two stages: the filtering stage and the alignment stage. The filtering stage detects candidate regions which are likely to be homologous. The alignment stage then examines these regions in detail and reports the regions which are indeed homologous according to some criteria. The resulting homologous regions are also called “high-scoring segment pairs” or HSPs [42]. Because of the unstructured nature of genomic sequences, the q-gram (substring or word of length q) is often used as a basic indexing/ search unit in the filtering stage [42, 82, 79, 74]. Q-grams can be either overlapping (i.e., a fixed window size of L shifting from the beginning to the end of a sequence by one letter at a time) or non-overlapping. Hits (q—gram 138 positions in the genomic sequence) are located by matching query q-grams with database q-grams. Each hit may produce a candidate region. Dynamic programming and scoring matrices are used in the alignment stage to find the true homologous regions among the candidate regions [83, 71]. These algorithms are so costly that the alignment stage may take more than 90% of the total processing time. Therefore, the quantity and the quality of candidate regions produced by the alignment stage are very important for improving the overall efliciency of the searching algorithm. Heuristics, such as the two-hit method in [42], have been developed to reduce the number of candidate regions based on original hits. Besides q-grams, suffix keys are also used as an index/ search unit in the filtering stage [112]. Compared to q-gram based approaches, these suffix-key-based methods take more time and space; however, they are more sensitive in finding matches that have relatively low similarity to the query [112, 111]. In the filtering stage, both FASTA and BLAST use a hashing technique to sequentially search overlapping query q-grams against overlapping database q-grams. On the other hand, BLAT builds an index based on non—overlapping q-grams in memory. It uses either exact matching or approximate matching with at most one mismatch in the filtering stage to locate candidate regions. BLAT was shown to be faster than popular existing tools, such as BLAST; however, it is less sensitive. CAFE uses inverted files [7] to index genome databases. Overlapping q-grams are used in CAFE. Heuristics, such as FRAMES [80], are used to reduce the number of candidate regions passed to the alignment stage. Compared with FASTA, CAFE is shown to be faster in searching GenBank nucleotide databases with comparable accuracy [80]. This appendix focuses on the performance of the filtering stage using q-grams. Most existing systems use exact matching of q-grams to find the hits. For example, BLAST provides the options of using different word length q. The shorter q, the 139 higher the sensitivity is; however, the number of resulting candidate regions also increases dramatically. For example, for a typical BLAST search, 61,926,143 hits might be found using q = 7, while only 271,083 hits would be found using q = 11. On the other hand, homologous regions that would be found using q = 7 can be missed using q = 11. Using exact matching, it is difficult to further reduce the number of hits while increasing the sensitivity. Since HSPS can be viewed as the results of approximate matching, using approximate matching based on the Hamming distance in the filtering stage may produce lower number of hits as well as higher sensitivity. Approximate matching on string databases has been studied extensively [9]. Index-based q-gram methods were proposed to reduce the search cost for large genome databases [113, 114]. In BLAT, one mismatch has been shown to be effective in finding low similarity HSPS; however, the system does not provide the option to use more than one mismatch. Approximate q-gram matching with more than one mismatch has not been adopted widely in searching genome databases. The goal of this appendix is to investigate the effect of applying larger word length (q) and higher Hamming distance (i.e., the number of mismatches) to the filtering stage. Experiments were conducted on both the E. coli and the entire GenBank nucleotide databases to investigate the performance (cost and sensitivity) of various combinations of word length and Hamming distance. A theoretical model is developed to further analyze the performance. Both experimental results and theoretical analysis show that approximate matching using longer word length and larger Hamming distance can achieve both lower cost and higher sensitivity than exact matching for the filtering stage. The rest of this appendix is organized as follows. Methodology and experimental results are discussed in Section A.2.2, the theoretical model is presented in Section A.3, and conclusions and future work are given in Section A4 140 A.2 Filtering Based on Approximate Matching A.2.1 Motivation The motivation for using approximate matching of q-grams for the filtering stage is inspired by the following observations. Since HSPs, especially ones with relatively low similarity, may have evenly distributed mismatches, it is likely that exact matching may not be able to find a hit (using larger q) within such HSPS. Let q/ h represent approximate matching of q-grams within Hamming distance of h. Exact matching of q-grams is represented as q/O. For example, in Figure A.1(a), the HSP cannot be found by word size of 7 or larger. However, the HSP can be found by approximate matching of 13/ 1. On the other hand, for a region (assume it is not an HSP) in Figure A.1(b), two hits will be reported by 7/0. However, the region will be passed by 13/ 1. Therefore, it is possible that lower cost (number of hits) and higher sensitivity can be achieved by using proper q/ h combination. Since approximate matching has two adjustable variables, q and h, it is more flexible than exact marching, where only word length q can be tuned. ttgatgatgtcatagtatgc attgatgatgtcatctta lIIIIIlIIlIIIIllII IIIIIII Illllll ttgatgctgtcatcgtatgc attgatggctacatctta “0 (b) Figure A.1: Example alignments A.2.2 Methodology BLAST is the most commonly used tool in bio-informatics. Although there are more sensitive algorithms, nucleotide BLAST (BLASTN) is much faster and, in practice, has proved sensitive enough to detect the moderate sequence similarities 141 that imply homology. In addition to its more formal use in detecting evolutionarily related sequences, due to its speed and availability, local BLASTN is used as a core component in several primer and probe design packages, where the intent is not to find related genes or gene segments, but to find regions of sequence similarity that might cause cross-priming or cross-hybridization. For these uses, even relatively short regions of similarity will be of concern. BLASTN (version 2.2.6 for Linux, downloaded from the NCBI web site) is used to generate the standard HSP answer set by searching a database with word size of 7 and mismatch penalty of -1. Queries are 30 probes from an actual oligonucleotide micro-array; each of them has 70 residues. Sequences in standard FASTA format were downloaded from the GenBank in May 2003. The non-redundant nucleotide database, contained 1,751,987 sequences and 8,542,465,976 residues; the E. coli database contained 400 sequences and 4,662,239 residues. A program is developed to simulate the filtering stage. Overlapping q-grams from a query sequence are compared with overlapping q-grams from the database sequences. Given a Hamming distance h, a hit is recorded if the Hamming distance between a query q-gram and a database q-gram is within distance h. Assume the standard HSPS are provided. If a hit lies within a standard HSP, it is a true hit, otherwise, it is a false hit. An HSP is considered to befound by the program if at least one hit lies within a standard HSP. Each false hit generates a candidate region, while any number of true hits within a HSP generates only one candidate region. Therefore, the search cost measured by alignments (i.e., the number of candidate regions) is computed by the number of false hits plus the number of HSPS found. Sensitivity is measured by the number of HSPS found divided by the number of standard HSPS. 142 A.2.3 Experimental Results Two sets of experiments are conducted using various q and h combinations on both the E. coli database and the entire GenBank nucleotide database. To fully understand the behavior of approximate matching for different HSPS, four HSP categories are defined for the E. coli database and two HSP categories for the GenBank nucleotide database based on the length of the HSPs and their similarity to the query. The categories are shown in Table A.1. Table A.1: Categories of HSPS returned by Probes E. coli Category I II III IV total L*220 L*Z40 L*Z4O 20$L*$40 8+ 3 80 S+ 3 80 H S P N umber 482 387 52 43 192 ActualS [63, 100: :63, 95: :63, 89] [63, 80: :69, 80: ActualL [15, 70: :20, 70: :40, 70: :40, 70: :25, 40: GenBank Category V VI total L* 2 40 20 g L* _<_ 40 5+ 3 80 5+ 3 80 H S PN umber 2690 205 ActualS [65, 100: [65, 80: :76, 80: ActualL [21, 70 :40, 70] :35, 40: L*: HSP length; 8+: Similarity in percentage. The sensitivities of each q/ h combination are calculated based on these HSP categories. Figures A.2—A.6 show the experimental results on the cost-sensitivity relationships for some typical q/ h combinations. Note that only q-grams shorter then the minimum HSP length in a HSP category are examined. Since 11/0 is the default word size for BLASTN, it is used as the benchmark in the following discussions. In Figures A.2-A6, all q/ h combinations that have a better sensitivity and cost than 11/0 are identified within the dotted lines at the upper-left corner of the figures. In Figure A.2, it is shown that 11/0 only finds 22.0% of the total HSPS that are longer than 20, while 20/ 4 is able to find 77.8% 143 HSPS with a similar cost. 16 / 3 is able to find 90.7% of the HSPS with 10 times more cost than 11/0. However, its cost is still about 10 times less than that of 7/0. In each figure, there always exist some combinations of q/ h perform better than 11/0 both in cost and sensitivity, such as 17/ 2 in Figure A.2 and 24/ 5 in Figure A.5. Moreover, some combinations of q/ h such as 20/ 3 are consistently better than 11/0 in all HSP categories. Such phenomenon indicates that approximate matching is able to achieve better performance in both sensitivity and cost than exact matching. Table A.2 shows three combinations which are always better than 11/0 in cost and sensitivity for all categories. For example, compared with 11/0, on the average, 20/ 3 improves the sensitivity by 55%, while reduces the cost by 90%. Table A.2 also shows an interesting trend: with properly increasing word length and Hamming distance, sensitivities and costs are improved continuely. It is observed that with a fixed word length, both cost and sensitivity increases as the Hamming distance increases. With a fixed Hamming distance, both cost and sensitivity decreases as the word length increases. This trend is consistent among all HSP categories; however, the relationship of sensitivity may not be consistent if both Hamming distance and word length changes. For example, the sensitivity of 16/ 2 is a little better then 13/1 in Figure A.2; however, it is worse than 13/1 in Figure A.3. This indicates that different HSP categories may affect the performance of approximate matching. In order to achieve certain sensitivity requirement, both q and h have to be adjusted in order to reduce the cost. Table A.3 shows the q / h combinations that achieve the least cost under corresponding sensitivity requirements. Since 7 / 0 is used to construct the standard answer set, the cost of approximate matching is compared with the cost of 7/0 at different sensitivity levels. It is shown that approximate matching is able to achieve high sensitivity with relatively low cost. For example, in category II, the sensitivity of 30/ 10 is 98%; however, the cost is 144 Sensitivity (%) Sensitivity (%) 1— I I IIIIIII I I IIIIIII I I IIIIIII I IIIIIIII] I I IIIIII] I I IIIIIII I I IIIIIL I h I d 100 — . 7’0 — _ .1814 . : .1914 016/3 : 1 1412 1111 - 4 o o 2 80 __ ' ’20/ .1713 _ . 1 .1813 .15’2 . . E .1711 - ~ .1913: ~ 60 _ r160 — .. -l b .2013 g .1311 . _ .1712 I a I 40 — l — _ .1812 I _ _ .2113 E j ' 019/2 .1511 I ‘ 20 : __________________________________ 111/0 ‘ - 02°” .1611 a : .1711 Z 0 1 l l llLlll’lsl/O 1 1111111 1 IJJJllll l 1 1111111 L I [111111 1 l I 111111 1 1 lil 10 100 1000 10000 le+05 le+06 le+07 lc+08 Cost (Allgnments) Figure A.2: E. coli, HSP Category I - III I I I IIIIII I I I IIIIII I I I IIIIII I I II‘IIII I I IIIIIII I I I IIIIII I I I IIIIII _ 5 - ._ ' 7/0 .— 100 1 JO/lo I473 ‘ 24o 1513 ° ‘ _ ’ .1814 ° . .. ' 020/5 - 1914 - 1 ’ .1613 ‘ 80 — — .1412.1111 ' ’20“ 1512 1211 - 30l8 27n __ o o p133 _1 ' 2415 1311 7 60 __ o | o _ .1913 : I _ 2013 1712 . __________’__-_°.-___.'1110 .1812 F .1511 ‘ 3015 1912 40 '— ° .2012 9 _ _ 0‘6“ - .2413 .1711 _ .2713 _ 2011 - ° .1510 - 20 [111' l I l I [“11 l l l LLIJJJ I 1 111111] 1 l l Llllll L Ll Lillll l 1 1 111111 1 I. l 1 11111 l 10 100 1000 10000 le+05 le+06 16407 Cost (Alignments) Figure A.3: E. coli, HSP Category II 145 Sensitivity (%) Sensitivity (%) 100 80 20 100 80 8 3 N C O _ I I I IIIII' I I I IIIIII I I IIIIIII I I II‘IIII I I I IIIIII I I I IIIIII I I IIIIIL I 1— l -1 ' 2719 710 _ — = 4m ° ..7. . , . _ 2417 1513 - I ’ .1814 ° 4 - I .2015 7 - l -1 1914 _ E ° .15/3 ._ 1412 1111 . ' .2014 ’ ’ _ .1512 .1211 - ' 3018 27n - _ O 0 p183 _ : .2415 .1311 Z _ 0‘9’3 .1 ‘ 2013 1712 ‘ ..._ _______ ’ ’ ____..1110 _ .1812 - - 1912 1511 - - ’ .1611 ° _ .1711 I- n: p— _- 27 I- . —1 . .1510 . q I llIIlIIl I LlIlIIIl I I ILIIIII l l 1111“] I LIIIIIII I I Illllll I l IIIlll l 10 100 1000 10000 le+05 le+06 le+07 Cost (Allgnments) Figure A.4: E. coli. HSP Categorv III p- I I IIIIIII I I IIIIIII I IIIIIIII I I III'IIII I I IIIIIII I I IIIIIII I IHIIIIL F- 7” -1 — . d 1714 _ .205 .1513 o A I 0163 I — .1914 0‘3’2 — _ .1412 .1111 _ r 1713 j _ .2014 0 _ I- -1 _ 1211 ., .1512 0 .- I .1 _ .1813 _ : .1311 : _ .1913 g . r- | -1 I — 0 — I t .1712 : ‘ t- - .. __ .4203. ________ ..1110 ‘ L .1711 .190 _‘ [- l l IllIlll I I I LIIIIl l I l LIIIIL J l lIlIlIl l Llllllll l I 111111] I I 11111 1 10 100 1000 10000 le+05 le+06 le+07 Cost (Alignments) Figure A.5: E. coli, HSP Category IV 146 Sensitivity (%) Sensitivity (%) _ N I IIIIIII I I I IIIIII I I FIUIIII I I IIIIIII I I IrIIlll I I IIIIIL I b I - 100 — ' 07’0 -~ _ I “6,3 .1312 _ 7 I .1111 7 I- I u: I b : -1 80 — A . I .1813 . I ~ I 1311 - _ .2013 I ’ _ 1712 60 ._ O I _. 1. l _ I L I -1 l .1812 , — I -[ 40 g ______________________________ .‘lI/O __‘ 1. .l 20 — — _ .1510 q 1 1111111l L 1111111l 1 1111111l 1 1111111l 1 11111111 1 1111111 18000 le+05 le+06 le+07 le+08 1e+09 le+10 Cost (Alignments) Figure A.6: GenBank. HSP Categorv V F I I I IIIIII I I I IIIIII I I TI'IIII I I I IIIIII I I I IIIIIr I I rIITIL I h ' d I 100 7' I 916/3 .1312 97/0 — 1— : .188 -I I .1111 D ' q 1. I ., I 80 '— : _ F l -1 I. I 16,2 - .20/3 : h | d 60 - I _ _ . .1311 . . I ~ 7 .1712 I ‘1 I- | d 40 - I '1 ' .1812 I 1- ' cl _ I a _ _______________________________ .‘ll/O q 20 — — '- .15/0 .1 I I IIIIIII l I IIllIlI I I IlllllI I LIIIIIII I I I IIIIII I I IIIIII 10000 le+05 le+06 1e+07 le+08 1e+09 le+10 Cost (Allgnments) Figure A.7: GenBank, HSP Category VI 147 only 2% of that using 7 / 0. In category V, 16/ 3 finds 96% of HSPS using only 8% cost of 7 / 0. Note that, since the HSPS returned by 7/0 are used as the standard HSP set to calculate the sensitivities, the sensitivity of 7/0 is always 100%. In real application, there could exist HSPS that cannot be found by 7/0, but can be hit by approximate matching. Table A.2: Combinations better than 11/0 in both cost and sensitivity CT1 I 11 III IV q/h 0' S+ C" S+ 6" S+ C" 5+ 14/1 0.61 1.35 0.61 1.08 0.60 1.12 0.60 1.89 17/2 0.24 1.53 0.24 1.04 0.22 1.06 0.22 1.37 20/3 0.08 1.84 0.08 1.04 0.06 1.06 0.06 1.05 0T1i v VI Avg. q/h C“ 3+ 0- 5+ C" 5+ 14/1 0.90 1.44 0.90 1.58 0.70 1.41 17/2 0.42 1.50 0.42 1.90 0.29 1.40 20/3 0.16 1.58 0.16 2.72 0.10 1.55 CT“: HSP Category; 0‘: Cost / (Cost of 11/0); 5+: Sensitivity / (Sensitivity of 11/0). As shown in Figure A.2 (HSP 2 20) and A.3 (HSP 2 40), the sensitivities of all combinations decrease when HSPS are shorter; however, 11/0 decreases faster than other combinations. It is concluded that approximate matching is less affected by HSP length than exact matching. For relatively low similarities and shorter HSPS, approximate matching has a greater advantage. For example, in Figure A.5, 11/0 only finds 9.9% of the HSPS while 20/ 4 is able to find 60.4% of the HSPS with similar cost. To measure the consistency of above observations, experiments are carried out using some typical q/ h combinations in a much larger database (i.e., the GenBank nucleotide database). As shown in Figure A6 and A.7, the performances of the selected q/ h combinations have a very similar trend to that of the E. coli database. The sensitivities of all combinations are higher than those in the E. coli database, 148 Table A.3: Combinations satisfy minimum sensitivity. C'Tr I II 111 MS‘ q/h C‘ 5+ q/h C‘ 5+ q/h C" 5+ 40 20 / 3 0.00037 0.40 30 / 5 0.000018 0.40 20 / 3 0.00029 0.42 60 18/3 0.0053 0.64 30/8 0.00051 0.69 30/8 0.0005 0.63 70 20/4 0.0058 0.78 20/4 0.0058 0.75 19/4 0.02 0.81 80 19/4 0.02 0.88 19/4 0.02 0.85 19/4 0.02 0.81 90 16/3 0.06 0.91 30/10 0.02 0.98 30/10 0.02 0.98 CT“ VI V VI MS‘ q/h C" 5+ q/h C' 3+ q/h C“ 5+ 40 18/3 0.0051 0.41 18/2 0.00044 0.46 20/3 0.0006 0.66 60 20/4 0.0057 0.60 20/3 0.00059 0.64 20/3 0.0006 0.66 70 19/4 0.02 0.79 16/2 0.0052 0.71 18/3 0.0071 0.94 80 20/5 0.06 0.92 16/3 0.08 0.96 18/3 0.0071 0.94 90 20/5 0.06 0.92 16/3 0.08 0.96 18/3 0.0071 0.94 CT”: HSP Category; M 5“ Minimum sensitivity; 0‘: Cost / (Cost of 7/0); 5+: Sensitivity / (Sensitivity of 7/0). because the GenBank nucleotide database contains more HSPS with higher similarities. A.3 Theoretical Analysis A theoretical model is developed to further analyze the performance of approximate and exact matching for the filtering stage. The problem is studied in a probabilistic framework where artificial genomic sequences are generated according to the Bernoulli model, which is also used by BLAST for their scoring system. In the Bernoulli model, every symbol of a finite alphabet A is created independently of the other symbols with different probabilities (identical independent distribution, i.i.d.). The following notations are used in the description of our theoretical model: 3: The similarity ratio between two regions. A: the nucleotide alphabet of size |A| = 4. PD(a): The distribution of a letter a E A in database sequences. PQ (a): The distribution of a letter a E A in query sequences. 149 H Dist(fil, 62): The Hamming distance between two q—grams: Bl and ,32. A.3.1 False Hit Probability (FHP) According to the Bernoulli model, if one letter is drawn from a query sequence and one letter is drawn from database sequences, the probability that the two letters are identical is Pm = Zae A PD(a) :1: PQ (a). The probability that two q-grams are identical is M (q) = (Pm)q. Therefore, the probability of h mismatches between two q—grams is: M’(q. h) = (z) x (PM—ha — an)". (A.1) where (’ql) is the number of combinations of q taken h at a time. The probability that two q-grams are within Hamming distance h is: h . . Mm. h) = 2(3) >< (Pm14"(1— Pm)‘. (A.2) i=0 M f (q, h), called False Hits Probability (FHP), is the probability of matching a query q-gram by chance among database q-grams. A.3.2 True Hit Probability (THP) Assume the similarity between two homologous regions, HRl and H122, is 3. Similar to M f (q, h), for a q—gram, 61, from HRI, the probability of having a q—gram 62 in HR2 and HDist(fil,flg) g h is: h MM. 12) = 2(3) x sq—iu - s11 (A.3) i=0 Mt(q, h), called True Hit Probability (THP), is the probability of matching a q-gram within a homologous region. The higher the value of 114t(q, h), the better is 150 the chance that the homologous regions in the database will be selected in the filtering stage, hence, the higher is the sensitivity. A.3.3 Verifying FHP and THP To verify the correctness of the theoretical model, the theoretical F HP and THP are compared with corresponding experimental values. The experimental FHP is computed as the number of total hits divided by the search space, which is the number of query q-grams times the number of database q-grams. Table A.4 shows the alphabet distributions of the queries and databases used in our experiments. Table A.5 shows the comparison between theoretical and experimental results of FHP. Query q-grams in this experiment are generated from the same probes as mentioned in Section A.2.2. It is shown that the experimental results are close to the corresponding theoretical FHP values. The errors between theoretical and experimental values are caused by the fact that the alphabet distribution in genomic sequences is not totally independent. When the theoretical FHP is very low, such as that of 20/ 2, the non-independent factor, which is not included in the model, starts to dominate the number of false hits. To verify the correctness of the theoretical THP, the HSPS returned by the BLASTN are used as the answer set. The experimental THP is the number of true hits divided by the total number of q—grams within all the HSPS. Since the similarity value varies among HSPS, the final theoretical THP value is normalized by the percentage of each valid similarity in the answer set. As shown in Table A6, the experimental results are close to the theoretical values, which verifies the correctness of the model. 151 Table A.4: Alphabet Distribution A T C G Probes 0.26571 0.27571 0.21333 0.24524 E .coli 0.24639 0.2461 1 0.25404 0.25347 GenBank 0.28837 0.28408 0.21293 0.21462 Table A.5: False Hit Probability (x 1e—10) q / h Theoretical Experimental 7/0 6051012 7580745 11/0 2352 3679.3 12 / 1 24369.3 30950.8 13/1 5728.4 8692.1 13/2 1090089 1467631 16/1 109.8 88.5 16/2 2583.3 3716.1 16/3 37269.7 52180.9 20/1 0.5 0 20/ 2 15.8 2.1 20/3 290.2 246.4 20 / 4 3795.5 5326.1 Table A.6: True Hit Probability q/ h I Theoretical Experimental 7/0 0.137 0.154 11 /0 0.0485 0.0594 13/1 0.13 0.16 16/2 0.195 0.239 18/3 0.287 0.345 20/1 0.0325 0.0475 20/2 0.107 0.135 20/3 0223 0.273 20/4 0372 0.443 Query: Probes; DB: GenBank 152 A.3.4 Discussion Based on Theoretical THP and FHP According to Equations A.2 and A.3, fixing the word length, both THP and F HP increase as the Hamming distance increases; fixing the Hamming distances, both THP and FHP decrease as the word length increases. This trend is observed in the experimental results. Since THP is the probability of matching one query q-gram with one HSP q-gram, the value of THP is much lower than that of the actual sensitivity since there are multiple trials within a HSP; however, using THP can approximate the trend of sensitivity. Therefore, it is possible to estimate the cost and sensitivity of approximate matching using the model. Note that, for a given database and sensitivity requirement, very low F HP may not be necessary. For example, if the database size is 1e+7 and query sequence is le+3, the search space is about 2e+10 (since both strands must be considered), theoretically, any FHP less than 2e—10 generates less than one hit (2e+10 * 2e—10 = 1). Therefore, it is not necessary to further reduce the FHP. Since approximate matching is more expensive than exact matching, among all combinations which satisfy given complexity and sensitivity criteria, it is cost-effective to chose the shortest word length and hamming distance combination. The model can be used to find the best choice of q/h to satisfy user specified system cost and sensitivity. A.4 Conclusion Searching genomic sequence databases becomes increasingly challenging as the rate of increase in database sizes continues to accelerate. Approaches based on q—grams are widely used. Most existing q-gram—based systems use exact matching to locate candidate regions in the filtering stage. Approximate matching beyond one mismatch has not been adopted in most existing q-gram-based systems. In this 153 appendix, it is shown that searching genomic sequence databases using longer word length and larger Hamming distance in the filtering stage provides an excellent opportunity for optimizing the search cost while improving the quality of the search. The encouraging results could be a motivation for researches in efficient calculations of Hamming distance. Moreover, the significant improvement in performance achieved by the Hamming distance-based search opens the possibilities of creating more advanced indexing schemes for large genomic sequence databases, where the number of the Hamming distance computations are minimal, and the cost of the Hamming distance computation in main memory is negligible compared to the cost of secondary memory accesses. Finally, a theoretical model is developed to further analyze the performance of approximate matching. The model can be used to find the best choice of q-gram length and the Hamming distance to satisfy user specified system cost and sensitivity. 154 Bibliography [1] E. W. Myers, “An overview of sequence comparison algorithms in molecular biology,” Tech. Rep. TR—91-29, University of Arizona, November 1991. [2] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press, 1997. [3] S. F. Altschula, W. Gisha, W. Millerb, E. W. Meyersc, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990. [4] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845-848, 1965 (Russian). [5] N. R. Dixon and T. B. Martin, Automatic Speech and Speaker Recognition. New York, NY, USA: John Wiley & Sons, Inc., 1979. [6] J. Zobel and P. Dart, “Phonetic string matching: lessons from information retrieval,” in SIGIR .96: Proceedings of the 19th annual international ACM SI CIR conference on Research and development in information retrieval, (New York, NY, USA), pp. 166—172, ACM Press, 1996. [7] R. Baeza-Yates and B. Ribiero-Neto, Modern Information Retrieval. Addison Wesley Longman Publishing Co. Inc, 1999. [8] J. C. French, A. L. Powell, and E. Schulman, “Applications of approximate word matching in information retriev ,” in CIKM .97: Proceedings of the sixth international conference on Information and knowledge management, (New York, NY, USA), pp. 9—15, ACM Press, 1997. [9] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31—88, 2001. [10] K. Auerbach, “Largest commercial database in winter corp. topten survey tops one hundred terabytes.” Web, September 14, 2005. http: //www.wintercorp.com/PressReleases/ttp2005_pressrelease-091405.htm. [11] M. Wang and X. S. Wang, “Optimizing relational store for e-catalog queries: a data mining approac ,” in Proceedings of the 2002 ACM symposium on Applied computing, pp. 1147—1152, ACM Press, 2002. [12] F. Tian, D. J. DeWitt, J. Chen, and C. Zhang, “The design and performance evaluation of alternative xml storage strategies,” SI GM 0D Rec., vol. 31, no. 1, pp. 5—10, 2002. [13] D. A. Benson, 1. Karsch—Mizrachi, D. J. Lipman, and J. Ostell, “Cenbank,” Nucleic Acids Research, vol. 32, no. 1, pp. 23—26, 2004. 155 [14] A. Bairoch and R. preiler, “The swiss—prot protein sequence database and its supplement trembl in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 45—48, 2000. [15] C. Kanz, P. Aldebert, N. Althorpe, W. Baker, A. Baldwin, K. Bates, P. Browne, A. van den Broek, M. Castro, G. Cochrane, K. Duggan, R. Eberhardt, N. Faruque, J. Gamble, F. G. Diez, N. Harte, T. Kulikova, Q. Lin, V. Lombard, R. Lopez, R. Mancuso, M. McHale, F. Nardone, V. Silventoinen, S. Sobhany, P. Stoehr, M. A. Tuli, K. Tzouvara, R. Vaughan, D. Wu, W. Zhu, and R. preiler, “The embl nucleotide sequence database,” Nucleic Acids Research 2005, vol. 33, 2005. [16] “Dna data bank of japan,” April 21, 2005. http://www.ddbj.nig.ac.jp/. [17] “Human genome project,” April 21, 2005. http: / / www.0rnl. gov / sci / techresources / Human-Genome / home.shtml. [18] “Scalable i/o project,” Lawrence Livermore National Laboratory, May 16, 2005. http: / / www.11nl.gov/ icc/ lc / siop/ . [19] R. Bayer and E. M. McCreight, “Organization and maintenance of large ordered indexes,” Acta Informatica, vol. 1, no. 3, pp. 173—189, 1972. [20] D. R. Morrison, “Patricia practical algorithm to retrieve information coded in alphanumeric,” J. ACM, vol. 15, no. 4, pp. 514—534, 1968. [21] R. Bayer and K. Unterauer, “Prefix b-trees,” ACM Trans. Database Syst., vol. 2, no. 1, pp. 11—26, 1977. [22] P. Ferragina and R. Grossi, “Fast string searching in secondary storage: theoretical developments and experimental results,” in Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms, pp. 373—382, Society for Industrial and Applied Mathematics, 1996. [23] P. Ferragina, Dynamic Data Structures for String Matching Problems. PhD thesis, Univerisity of Pisa-Genova—Udine, Pisa, Italy, 1997. [24] P. Ferragina and R. Grossi, “The string b-tree: A new data structure for string search in external memory and its applications,” J. Assoc. Comput. Mach, vol. 46, no. 2, pp. 236—280, 1999. [25] E. M. McCreight, “A space-economical suffix tree construction algorithm,” J. ACM, vol. 23, no. 2, pp. 262—272, 1976. [26] P. Weiner, “Linear pattern matching algorithms,” in 14th Annual Symposium on Switching and Automata Theory, pp. 1—11, IEEE, October 1973. 156 [27] U. Manber and G. Myers, “Suffix arrays: a new method for on—line string searches,” in Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, pp. 319—327, Society for Industrial and Applied Mathematics, 1990. [28] G. H. Gonnet, R. A. Baeza-Yates, and T. Snider, “Lexicographical indices for text: Inverted files vs. pat trees,” Tech. Rep. OED-91-01, University of Waterloo, 1991. [29] D. Comer, “Ubiquitous b-tree,” ACM Comput. Suru., vol. 11, no. 2, pp. 121—137, 1979. [30] J. Cl’ement, P. Flajolet, and B. Vall’ee, “The analysis of hybrid trie structures,” in Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, pp. 531-539, Society for Industrial and Applied Mathematics, 1998. [31] T. H. Merrett, H. Shang, and X. Zhao, “Database structures, based on tries, for text, spatial, and general data,” in International Symposium on Cooperative Database Systems for Advanced Applications, pp. 316—324, December 1996. [32] W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures 6 Algorithms. Prentice Hall, 1992. [33] D. R. Clark and J. I. Munro, “Efficient suffix trees on secondary storage,” in Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms, (Atlanta, Georgia, United States), pp. 383—391, Society for Industrial and Applied Mathematics, 1996. [34] R. W. Hamming, “Error-detecting and error-correcting codes,” Bell System Technical Journal, vol. 29, no. 2, pp. 147—160, 1950. [35] S. F. Altschula, “Amino acid substitution matrices from an information theoretic perspective,” Journal of Molecular Biology, vol. 219, pp. 555—665, 1991. [36] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998. [37] E. M. Voorhees and D. Harman, “Overview of the sixth text retrieval conference (tree—6),” in Proceedings of the Sixth Text REtrieval Conference, pp. 1-24, NIST Special Publication, 1997. [38] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: an efficient access method for similarity search in metric spaces,” in Proceedings of the 23rd Very Large Data Bases, (Athens, Greece), pp. 426—435, Auguest 1997. 157 [39] K. Chakrabarti and S. Mehrotra, “The hybrid tree: An index structure for high dimensional feature spaces,” in Proceedings of the 15th I CDE, IEEE Computer Society, 1999. [40] G. Qian, Q. Zhu, Q. Xue, and S. Pramanik, “The nd-tree: A dynamic indexing technique for multidimensional non-ordered discrete data spaces,” in Proceedings of 29th VLDB Conference, pp. 620—631, 2003. [41] S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from protein blocks,” Proceedings of the National Academy of Science USA, vol. 89, pp. 10915—10919, November 1992. [42] S. F. Altschul, T. L. Madden, A. A. Schaeffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped blast and psi-blast : a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, pp. 3389—3402, September 1997. http: / /www.ncbi.nlm.nih.gov/ BLAST. [43] R. Fagin, J. Nievergelt, N. Pippenger, and H. R. Strong, “Extendible hashing a fast access method for dynamic files,” ACM Trans. Database Syst., vol. 4, no. 3, pp. 315—344, 1979. [44] E. H. Fredkin, “Trie memory,” Communications of the ACM, vol. 3, no. 9, pp. 490—499, 1960. [45] P. Flajolet, “On the performance evaluation of extendible hashing and trie searching,” Acta Inf., vol. 20, pp. 345—369, 1983. [46] H. Mendelson, “Analysis of extendible hashing,” IEEE Trans. Software Eng, vol. 8, no. 6, pp. 611—619, 1982. [47] P.-A. Larson, “Performance analysis of linear hashing with partial expansions,” ACM Trans. Database Syst., vol. 7, no. 4, pp. 566-587, 1982. [48] J. K. Mullin, “Spiral storage: Efficient dynamic hashing with constant performance,” Comput. J., vol. 28, no. 3, pp. 330—334, 1985. [49] R. A. Baeza-Yates and H. Soza-Pollman, “Analysis of linear hashing revisited,” Nord. J. Comput., vol. 5, no. 1, 1998. [50] R. J. Enbody and H. C. Du, “Dynamic hashing schemes,” ACM Comput. Surv., vol. 20, no. 2, pp. 85—113, 1988. [51] D. E. Knuth, The art of computer programming: sorting and searching, vol. 3. Addison Wesley, 1998. [52] A. C.-C. Yao, “On random 2-3 trees,” Acta Inf, vol. 9, pp. 159—170, 1978. [53] R. A. Baeza-Yates and P.-A. Larson, “Performance of b+-trees with partial expansions,” IEEE Trans. Knowl. Data Eng, vol. 1, no. 2, pp. 248—257, 1989. 158 [54] D. Lomet, “The evolution of effective b—tree: page organization and techniques: a personal account,” SIGM OD Rec., vol. 30, no. 3, pp. 64—69, 2001. [55] P. J. Weinberger, “Unix b—trees,” Tech. Rep. TM-81-11272-1, AT&T Bell Laboratories, 1981. [56] L. Devroye, “A note on the average depth of tries,” Computing 28, pp. 367—371, 1982. [57] E. Ukkonen, “On-line construction of suffix-trees,” Algorithmica, vol. 14, pp. 249—260, 1995. [58] M. Farach, “Optimal suffix tree construction with large alphabets,” in F OCS, pp. 137—143, 1997. [59] G. Jacobson, “Succinct static data structures,” Tech. Rep. CMU-CS-89—112, Carnegie Mellon University, 1982. [60] Q. Wang, “Performance projection for disk-based indexing of the genomic databases,” Master’s thesis, Department of CSE, Michigan State University, 2002. [61] Sleepycat, “Berkeley db.” http://www.sleepycat.com/. [62] D. Sankoff and J. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983. [63] D. Adam, “Draft human genome sequence published,” Nature, February 2001. [64] F. S. Collins, E. D. Green, A. E. Guttmacher, and M. S. Guyer, “A vision for the future of genomics research — a blueprint for the genomic era,” Nature, vol. 422, pp. 835—847, April 2003. [65] M. Margulies, M. Egholm, W. E. Altman, and etc., “Genome sequencing in microfabri- cated high-density picolitre reactors,” Nature advanced online publication, 2005. http: / / www.nature.com / nature / j ournal / vaop / ncurrent / abs / nature03959.html. [66] G. Fuellen, “Multiple alignment,” Complexity International, vol. 4, 1997. [67] M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, Atlas of Protein Sequence and Structure, vol. 5 of 3, ch. A model of evolutionary change in proteins, pp. 345 — 352. National Biomedical Research Foundation, 1978. [68] S. Henikoff and J. G. Henikoff, “Automated assembly of protein blocks for database searching,” Nucleic Acids Res, vol. 19, pp. 6565—6572, 1991. 159 [69] “Blast substitution matrices,” October 19, 2005. http: / / www.ncbi.nlm.nih. gov / blast / html / sub_matrix.html. [70] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, pp. 443-453, 1970. [71] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequence,” Journal of Molecular Biology, vol. 147, pp. 195——197, 1981. [72] O. Gotoh, “An improved algorithm for matching biological sequences,” Journal of Molecular Biology, vol. 162, pp. 705—708, 1982. [73] W. R. Pearson and D. J. Lipman, “Improved tools for biological sequence comparison,” Proceedings of the National Academy of Science USA, vol. 85, no. 8, pp. 2444-2448, 1988. [74] W. R. Pearson, “Flexible sequence similarity searching with the fasta3 program package,” Methods Mol. Biol, vol. 132, pp. 185—219, 2000. [75] E. J. Gumbel, Statistics of Extremes. Columbia University Press, 1958. [76] S. F. Altschul and W. Gish, “Local alignment statistics,” Methods in Enzymology, vol. 266, pp. 460—480, 1996. [77] W. R. Pearson, “Empirical statistical estimates for sequence similarity searches,” Journal of Molecular Biology, vol. 276, pp. 71—84, 1998. [78] R. Mehnert and K. Cravedi, “Public collections of dna and ma sequence reach 100 gigabases,” August 22, 2005. http: / / www.nlm.nih.gov / news / press-releases / dna.rna_1 00_gig.html. [79] W. J. Kent, “Blat the blast-like alignment tool,” Genome Res., vol. 12, pp. 656-664, April 2002. [80] H. E. Williams and J. Zobel, “Indexing and retrieval for genomic databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp. 63—78, 2002. [81] H. E. Williams, “Compressed indexing for genomic retrieval,” J. Mathematical Modelling and Scientific Computing, vol. 9, no. 2, pp. 144-154, 1998. [82] S. Burkhardt and J. Krkkinen, “Better filtering with gapped q—grams,” Fundam. Inf., vol. 56, no. 1,2, pp. 51—70, 2003. [83] D. J. States, W. Gish, and S. F. Altschul, “Improved sensitivity of nucleic acid database searches using application-specific scoring matrices,” Methods: A companion to Methods in Enzymology, vol. 3, no. 1, pp. 66—77, 1991. 160 [84] E. Hunt, M. P. Atkinson, and R. W. Irving, “Database indexing for large dna and protein sequence collections,” The VLDB Journal, vol. 11, no. 3, pp. 256—271, 2002. [85] G. Navarro and R. Baeza—Yates, “A hybrid indexing method for approximate string matching,” Journal of Discrete Algorithms (JDA), vol. 1, no. 1, pp. 205—239, 2000. Special issue on Matching Patterns. [86] E. Hunt, “Indexed searching on proteins using a suffix sequoia,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 27, pp. 24—31, 2004. [87] S. Burkhardt, A. Crauser, P. Ferragina, H.—P. Lenhof, E. Rivals, and M. Vingron, “q-gram based database searching using a suffix array (quasar), in Proceedings of the third annual international conference on Computational molecular biology, pp. 77—83, ACM Press, 1999. 7, [88] E. Ukkonen, “Approximate string-matching over suffix trees,” in Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, pp. 228—242, Springer-Verlag, 1993. [89] G. Z. Hertz and G. D. Stormo, “Identifying dna and protein patterns with statistically significant alignments of multiple sequences,” Bioinformatics, vol. 15, no. 7, pp. 563—577, 1999. [90] K. J. Kechris, E. van Zwet, P. J. Bickel, and M. B. Eisen, “Detecting dna regulatory motifs by incorporating positional trends in information conten ,” Genome Biology, vol. 5, no. 7, 2004. [91] V. Kunik, Z. Solan, S. Edelman, E. Ruppin, and D. Horn, “Motif extraction and protein classification,” in Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference (CSB05), 2005. [92] S. R. Eddy, “Profile hidden markov models,” Bioinformatics, vol. 14, no. 9, pp. 755—763, 1998. [93] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia, “Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods,” Journal of Molecular Biology, vol. 284, no. 4, pp. 1201—1210, 1998. [94] S. Henikoff, “Scores for sequence searches and alignments,” Curr Opin Struct Biol, vol. 6, no. 3, pp. 353—360, 1996. [95] M. Wistrand and E. L. Sonnhammer, “Improved profile hmm performance by assessment of critical algorithmic features in sam and hmmer,” BM C Bioinformatics, vol. 6, no. 1, pp. 99—109, 2005. 161 [96] “Hmmer: profile hmms for protein sequence analysis,” October 27, 2005. http://hmmer.wustl.edu/. [97] M. Gribskov, A. D. McLachlan, and D. Eisenberg, “Profile analysis: Detection of distantly related proteins,” vol. 84, pp. 4355-4358, 1987. [98] L. R. Rabiner, “A tutorial on hidden markov models and selected applicaitons in speech recognition,” in Proceedings of the IEEE, vol. 77, pp. 257—286, 1989. [99] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, “Hidden markov models in computational biology: Applications to protein modeling,” Journal of Molecular Biology, vol. 235, no. 1501-1531, 1994. [100] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 2nd ed, ch. Brownian Movement and Markoff Processes, pp. 515—553. McGraw-Hill, 1984. [101] C. Barrett, R. Hughey, and K. Karplus, “Scoring hidden markov models,” Comput. Applic. Biosci., vol. 13, pp. 191—199, 1997. [102] S. R. Eddy, HMMER Users’ Guide, October 2003. ftp: / / ftp. geneticswustledu / pub / eddy / hmmer / CURRENT / Userguide.pdf. [103] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. L. L. Sonnhammer, D. J. Studholme, C. Yeats, and S. R. Eddy, “The pfam protein families database,” Nucleic Acids Res, vol. 32, pp. D138—D141, 2004. [104] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. R. Eddy, S. Griffiths-Jones, K. L. Howe, M. Marshall, and E. L. L. Sonnhammer, “The pfam protein families database,” Nucleic Acids Res, vol. 30, no. 1, p. 276280, 2002. [105] “The pfam database of protein families and hmms,” August, 2005. http: //pfam.wustl.edu/ . [106] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379—423, 1948. [107] T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht, “Information content of binding sites on nucleotide sequences,” Journal of Molecular Biology, vol. 188, pp. 415—431, 1986. [108] M. Tompa, “An exact method for finding short motifs in sequences, with application to the ribosome binding site problem,” in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 262—271, AAAI Press, 1999. 162 [109] D. A. Benson, 1. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler, “Genbank,” Nucleic Acids Research, vol. 28, no. 1, pp. 15-18, 2000. [110] “Genbank statistics,” April 21, 2005. http: / / www.ncbi.nlm.nih. gov / Genbank / genbankstatshtml. [111] E. Hunt, “The suffix sequoia index for approximate string matching,” Tech. Rep. TR-2003-135, Department of Computing Science, Glasgow University, March 2003. [112] E. Hunt, R. W. Irving, and M. Atkinson, “Persistent suffix trees and suffix binary search trees as dna sequence indexes,” Tech. Rep. TR—2000—63, Department of Computing Science, Glasgow University, October 2000. [113] H. Hyyr6 and G. Navarro, “A practical index for genome searching,” in Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS 2857, pp. 341—349, Springer, 2003. [114] G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio, “Indexing methods for approximate string matching,” IEEE Data Engineering Bulletin, vol. 24, no. 4, pp. 19—27, 2001. Special issue on Managing Text Natively and in DBMSs Invited paper. 163 IIIIIIIIIIIIIIIIIIIIIII 111]1111]1]]1]1111]1]]][1