«Ju‘ . 32 x $5.32 a i 1. .1 . ., Lt! ‘ .21’ \‘ s 5.... 3.... . .4..z......au aux. “an .. .2. ..u¥m%..§bwfifli . a; . ‘5. 5.9.2”. . .. :9. .. . u. in: s, . . L3. {9.5: _.. aaréufiiubu a... I x t. 1. Us €3.51"? .0... 5.: .2 havfihfirth .i , . S..- . ... .. 39%.. .. 31... .u. . I. .IV 0“"... u; v\ . p ‘m- g! sin-5...! $20.44;...“ \I . .. II... .flh‘ufiv bazmthlltc . . l! inn-{til . {11.4 3.2....)- 09326. #7... t.§1‘.ifik3!3.! III-i; .: .tznlaxxtl -: .lliatl . . I ‘2. a . .1 .In. a. n ; .i. .2 ‘u 1.2.0.6123! hlfiflwlt L3 :25 I. in: . t 33.5.3.5! 1.13 Ilatttaui...‘ .4549.» 5‘. It. a}!!! 6.3.! 1 a? ; gm?» LlBRARY , _. .. , Mict--a--.‘. State University This is to certify that the dissertation entitled INDEXING OF MULTIDIMENSIONAL DISCRETE DATA SPACES AND HYBRID EXTENSIONS presented by CHANGQING CHEN has been accepted towards fulfillment of the requirements for the PhD. degree in Computer Science 2 rgwk Major Professor’s Signature gI/z 0/0 a Date MSU is an Affirmative Action/Equal Opportunity Employer PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K;IProj/Acc&Pres/CIRCIDateDue Indd INDEXIN G OF MULTIDIMENSIONAL DISCRETE DATA SPACES AND HYBRID EXTENSIONS By Changqing Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2009 ABSTRACT INDEXING OF MULTIDIMENSIONAL DISCRETE DATA SPACES AND HYBRID EXTENSIONS By Changqing Chen In this thesis various indexing techniques are developed and evaluated to support efficient queries in different vector data spaces. Various indexing techniques have been introduced for the (ordered) Continuous Data Space (CD3) and the Non-ordered Discrete Data Space (NDDS). All these techniques rely on special properties of the CD8 or the NDDS to optimize data accesses and storage in their corresponding structures. Besides conventional exact match queries, the similarity queries and the box queries are two types of fundamental operations widely supported by modern indexing tech- niques. A box query is different from a similarity query in that the box query in multidimensional spaces tries to look up indexed data which meet query conditions on each and every dimension. The difference between similarity queries and box queries suggests that indexing techniques which work well for similarity queries may not nec- essarily support efficient box queries. In this thesis, we propose the BOND-tree, a new indexing technique designed for supporting box queries in an NDDS. Both our theo- retical analysis and experimental results demonstrate that the new heuristics proposed for the BOND-tree improve the performance of box queries in an NDDS significantly. The Hybrid Data Space (HDS) is a multidimensional data space which contains both (ordered) continuous and non-ordered discrete dimensions. In this thesis a novel indexing structure, the C-ND tree, has been developed to support efficient similarity querias in HDSs. To do so, some geometric concepts in the HDS are introduced. Novel node splitting heuristics which exploit characteristics of both CD3 and NDDS are proposed. Our extensive experimental results show that the C-ND tree is quite promising in supporting similarity queries in HDSs. To support box queries in the HDS, we extended the original ND-tree to the HDS to evaluate the effectiveness of the ND-tree heuristics on supporting box queries in an HDS. A novel power value adjustment strategy is used to make the continuous and discrete dimensions comparable and controllable in the HDS. An estimation model is developed to predict the box query performance of the hybrid indexing. Our experimental results Show that the original N D-tree heuristics are effective in supporting box queries in an HDS, and could be further improved with our power adjustment strategies to address the characteristics of the HDS. To my parents, Zhenguo and Yuzhi my wife, Hua and my sister Qinghua iv ACKNOWLEDGMENTS I would like to acknowledge the guidance, support, and encouragement of many people during my PhD study, without whom, I would not have been able to complete this thesis. First, I would like to thank my advisor Dr. Sakti Pramanik for giving me the oppor- tunity of studying at Michigan State University, which has changed my life totally. Dr. Pramanik gave me a great deal of help to improve my research work in the past few years. Under his excellent guidance, I learned how to conduct research in the academic area. My PhD study would not have been successful without the support and guidance from him. And I am very glad to have completed my study in his research group. I would like to take the opportunity to express my appreciation to Dr. Qiang Zhu for his encouragement and direction throughout my PhD study. Dr. Zhu gave me valuable suggestions on my research. I appreciate all that he has done for me in support of my research work. My sincere gratitude also goes to my committee members, Dr. James Cole, Dr. Charles Owen, and Dr. J uyang Weng. They were very generous with their time, ideas and assistance. Their advice and guidance have enriched my research work significantly. Last but not the least, I would like to thank my parents and my wife for giving me the freedom to pursue my own dreams. My thanks go to my whole family for their years of important and indispensable support and encouragement of my PhD study. For this, I owe them a lifetime of gratitude. This research work was supported by the US National Science Foundation (under grants # IIS-O414576 and # IIS—O414594), the Michigan State University and the Uni- versity of Michigan. vi TABLE OF CONTENTS LIST OF TABLES ............................... x LIST OF FIGURES .............................. xi 1 Introduction ................................. 1 1.1 Data Spaces ................................. ' 1 1.2 The Vector Space Model .......................... 3 1.3 Disk-based Dynamic Indexing Structures ................. 5 1.4 Similarity Queries and Box Queries ............ g ........ 7 1.5 Overview of the Dissertation ........................ 9 2 Related Work ................................ 11 2.1 Indexing in the Metric Space ........................ 13 2.2 Indexing Techniques of the CDS ...................... 15 2.2.1 Data-partitioning Approaches in the CDS ............ 15 2.2.2 Space-partitioning Approaches in the CDS ............ 18 2.2.3 Hybrid Approaches in the CDS .................. 20 2.3 Indexing Techniques of the NDDS ..................... 22 2.3.1 The Bitmap Indexing Method ................... 22 2.3.2 The String Indexing Method .................... 24 2.3.3 The ND-tree and the NSP-tree .................. 26 2.4 Challenges in Indexing the HDS ...................... 30 3 Box Queries in the NDDS ........................ 33 3.1 Introduction ................................. 33 3.2 Basic Concepts ............................... 36 3.3 Optimization of Indexing Trees in the NDDS ............... 37 3.3.1 Indexing Problem for Box Queries ................. 38 3.3.2 Expected I/O for Box Queries ................... 40 3.3.3 The Splitting Problem for Box Queries .............. 43 3.3.4 Optimal Splitting Heuristics for Box Queries in the N DDS . . . 46 3.4 Construction of the BOND-tree ...................... 52 3.4.1 Insertion in the BOND-tree ..................... 53 3.4.2 Deletion in the BOND-tree ..................... 68 3.4.3 Box Query on the BOND-tree ................... 69 vii 3.5 Compression to Improve Node Fanout ................... 70 3.5.1 The Compressed BOND-tree Structure .............. 73 3.5.2 Effect of Compression on Splitting Overflow Nodes ....... 74 3.6 Experimental Results ............................ 77 3.6.1 Effect of Different Database Sizes ................. 78 3.6.2 Effect of Different Number of Dimensions ............. 78 3.6.3 Effect of Alphabet Size ....................... 79 3.6.4 Effect of Different Query Box Sizes ................ 80 3.6.5 Performance on Real Data Set ................... 81 3.6.6 Performance of the Compressed BOND-Tree ........... 82 Range Queries in the HDS ........................ 86 4.1 Motivations and Challenges ........................ 86 4.2 Concepts and Notations for the HDS ................... 90 4.3 The C—ND tree ............................... 92 4.3.1 The Tree Structure ......................... 92 4.3.2 Normalized Geometric Measures .................. 95 4.3.3 Choose Leaf Node for Insertion .................. 96 4.3.4 Splitting an Overflow Node of the C-ND Tree .......... 97 4.3.5 Processing Range Query Using the C-ND Tree .......... 107 4.4 Experimental Results ............................ 110 4.4.1 Experimental Setup ......................... 110 4.4.2 Performance of Heuristics Used to Split Overflow Nodes ..... 111 4.4.3 Performance Comparisons With the 10% Linear Scan, ND-tree and R*-tree ............................. 113 Box Queries in the HDS ......................... 120 5.1 Motivations and Challenges ........................ 120 5.2 Extended Hybrid Indexing ......................... 122 5.2.1 Hybrid Geometric Concepts and Normalization ......... 122 5.2.2 Extending the ND-tree to the HDS ................ 123 5.2.3 Enhanced Strategy for Prioritizing Discrete/ Continuous Dimen- sions ................................. 126 5.3 Experimental Results ............................ 130 5.3.1 Experimental Setup ......................... 130 5.3.2 Performance Comparisons ..................... 131 5.3.3 Performance Gain with Increasing Database Sizes ........ 132 5.3.4 Performance for Various Additional Dimensions ......... 133 5.3.5 Performance for Different Alphabet Sizes ............ 135 5.3.6 Performance for Different Query Box Sizes ............ 136 5.3.7 performance on Real Data ..................... 136 5.3.8 Effect of Enhanced Strategy with Power Value Adjustment . . 139 5.4 Performance Estimation Model ...................... 139 viii 6 Conclusion .................................. 148 APPENDICES ................................. 151 A Proof of Theorems For Box Queries in the NDDS .......... 151 Al Proof of Lemma A.0.1 ........................... 152 A2 Proof of Lemma A.0.2 ........................... 153 A3 Proof of Lemma A.0.3 ........................... 158 B Algorithms for Performance Estimation Model ............ 159 BIBLIOGRAPHY ............................... 164 LIST OF TABLES 3.1 Different values of indexed vectors on dimension 2' ............. 47 3.2 Different overlap free partitions from a continuous dimension 2' ...... 47 3.3 Different overlap free partitions from a discrete dimension i. ...... 48 3.4 Different component sets of non-leaf entries on dimension 2' ........ 56 3.5 Different component sets for non-leaf entries E1 ~ E12. ......... 64 3.6 Grouping of non-leaf entries ......................... 64 3.7 the item weights and values in the 0 — 1 knapsack problem ........ 65 3.8 The candidate partition for an overflow node N found by solving the 0-1 knapsack problem. ............................. 66 3.9 Probability of having a full dimension after indexing X vectors. . . . . 71 3.10 Percentage of full dimension at non-leaf levels of the BOND-tree with different alphabet sizes ............................ 72 3.11 Percentage of full dimension at each level of the BOND-tree. ...... 72 4.1 Effectiveness of heuristics used to generate candidate partitions. . . . . 112 4.2 Effectiveness of heuristics used to redistribute uncommon entries. . . . 112 5.1 Number of accumulated discrete and continuous splits at leaf level. . . 129 5.2 Number of accumulated discrete and continuous splits at non—leaf level. 129 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 LIST OF FIGURES Effect of different database sizes in the basic BOND-tree. ........ 79 Effect of number of dimensions in the basic BOND-tree .......... 80 Effect of different alphabet sizes in the basic BOND-tree. ........ 81 Effect of different query box sizes in the basic BOND-tree ......... 82 Performance of indexing genome sequence data ............. 83 Performance of the compressed BOND-Tree with various dimensions. . . 84 Performance of the compressed BOND-Tree with various alphabet sizes. 85 An example of the C-ND tree in a 2-dimensional HDS with discrete domain {A, G, T, C} and continuous domain [—10, 10]. ......... 93 An example of generating candidate partitions in the C-ND tree. . . . . 99 A combination of discrete and continuous candidate partitions ...... 102 An example of redistributing uncommon entries in a hybrid splitting procedure. .................................. 106 Effect of database size in the C-N D tree. ................. 114 Effect of dimension mix in the C-ND tree. ................ 115 Effect of alphabet size in the C-ND tree. ................. 116 Effect of query ranges in the C-ND tree. ................. 117 xi Performance comparison of indexing GSOD data. ............ 119 Effect of various database sizes in the HDS. ............... 133 Effect of additional dimensions in the HDS. ............... 134 Effect of different alphabet sizes in the HDS ................ 135 Effect of different query box sizes in the HDS. .............. 137 Performance comparison of indexing GSOD data. ............ 138 Performance for enhanced strategy with power value adjustment. . . . . 140 Verification of performance model for given trees and given HDSs. . . . 146 Verification of performance model for given HDSs ............. 147 CHAPTER 1 Introduction In this chapter we will review fundamental concepts related to our research. These concepts include: different vector data spaces and their properties involved in indexing techniques; the vector space model used to represent indexed data spaces; disk-based dynamic indexing structures widely supported in modern DBMS (database manage- ment system) and various query types popular in multidimensional space indexing. 1.1 Data Spaces Much research work has been conducted on indexing in the CDS[89] area. A CDS contains either finite or infinite number of elements which could be ordered based on their values. For example, the real number space is a 1-dimensional CD8 and an n—dimensional Euclidean space consists of an n—dimensional CDS. In the database indexing area, attributes such as the height of people and wages of employees are typically treated as continuous data. Indexing of the multi-dimensional CDS relies on efficient organization of indexed data and the partitioning of spaces to limit search space at the query time, This in turn involves utilization of certain geometric concepts in the CDS[46, 42], such as rectangles, spheres, areas, etc. On the other hand, the NDDS contains elements which could be compared as equal but cannot be ordered. For example, the genome data space has a 4-letter domain {’A’,’ G’ ,' T’ ,’ C" } The letter ’A’ in the genome domain is different from the letter ’G’ or ’T’. But since there is no ordering property in the domain, ’A’ is neither larger nor smaller than the other three letters. The NDDS indexing is a relatively new topic in database area. Recently the ND-tree[80, 79, 81] and the NSP-tree[81] are proposed to supporting indexing in the NDDS. As in CD83 indexing methods, basic geometric concepts such as area, overlap, span, etc. are defined and utilized in the indexing structures to organize data points in the NDDS for efficient partitioning of the indexed space. A Hybrid Data Space (HDS) is a combination of CD8 and NDDS (i.e., an HDS has one or more non-ordered discrete dimensions and one or more continuous dimensions). HDS indexing is a new topic and no indexing technique has been reported in the literature. In this thesis we will develop and evaluate indexing structures for the HDS. Like the CD8 and NDDS indexing, our techniques exploit certain geometric concepts to organize vectors indexed in the HDS. The geometric concepts in the hybrid space are defined in chapter 4. 1.2 The Vector Space Model The vector space model [86, 105] is wildly used in areas such as information filtering, in- formation retrieval, and data mining to represent objects mathematically. It overcomes the disadvantages of a simple data model - the Boolean model[45]. The Boolean model is based on boolean logic and set theories. Indexed features in the Boolean model are set to either true or false. Queries are specified as boolean expressions consisting of features connected by AND, OR, NOT operators. Because of the binary notion, indexed objects either matches a query object or does not match it (i.e., query results in the Boolean model cannot be ranked [68]). As a result, indexes based on the Boolean model usually returns either too few or too many results when answering user queries. In the vector space model important properties of objects are represented using feature vectors (data points). Features vectors in vector spaces usually have real number domains ( in contrast to the binary domain of the Boolean model) and could be linearly combined. In the scenario of similarity queries, query objects are also transformed into feature vectors. An example application of the vector space model is to use vectors representing the contents of documents. Each document is represented using a feature vector containing the frequency of terms used in that document. The whole vector space consists of all the vectors used to represent a large amount of documents. When a user looks for certain documents, a query vector is created based on the user interests and the document vector space is searched (probably with the help of certain indexing techniques to avoid linear search of all documents). To support similarity queries in the vector space, the distance (range) value between vectors could be calculated through certain distance measures in the vector space (e.g., the Euclidean Distance[99]). And the closeness between the document vector and query vector is evaluated based on their distance values. It is worth mentioning that not all spatial objects could be represented by vectors naturally. For example, in the area of multimedia processing, contents of objects such as video clips and images are stored in their original format and need non-trivial fea- ture extracting methods for efficient representing, searching and retrieving of database objects. Extraction of image features (e.g., color, texture, shape, shadow) is per- formed as a preprocessing step to obtain image signatures (feature vectors): the color feature could be generated from average gray scale or average RGB format, and the texture feature could be obtained using dyadic wavelet transform [66], wavelet frame transform[97], Gabor filters [54] and so forth. There has been much research work done in the area of feature (content) extraction and dimensionality. reduction. Discussions and comparisons of these techniques could be found in [87, 84, 4]. 1.3 Disk-based Dynamic Indexing Structures Theoretically the sequential scan[13] could be applied to support any kind of database queries. Compared to other methods a sequential scan is easier to implement. And it works effectively if the following two conditions are satisfied: (1)the data set size is small; (2) the selectivity of the query is high. In other situations a sequential scan has poor performance because it always looks up the whole data space to answer a query. In this case more efficient indexing techniques are required to answer database queries and many research works have been done in this area. In modern database applications, the amount of data to be stored could be huge. Many real world spatial databases contains records counted in millions or even more [28]. As the number of tuples to be indexed grows larger and larger, the indexing structure itself will eventually become so big that it could only be stored on the secondary storage media(e.g., the hard disk). In-memory indexing techniques could not be utilized in this situation. With the improvement of computer hardware technologies to speed up secondary storage access [76] and techniques introduced on the software side to minimize the number of storage media access[98], we will focus on the study of disk- based indexing techniques, which are more practical in supporting large scale data sets . A dynamic indexing method[5] could support random insert and delete operations on indexed data efficiently. By dynamic we mean that insert and delete operations are mixed with queries in the indexing structure. Unless the indexed data set never changes (or changes rarely), a dynamic indexing method is more preferable than a static one because the latter requires rebuilding the whole indexing structure from scratch every time the data set is modified, and the reconstruction could be an expensive operation, especially for large data sets. In the past different disk-based dynamic indexing structures have been proposed to support efficient database queries. The main idea behind these techniques is to organize the indexed objects carefully so that at query time, only parts of the indexing structure need to be visited. The rest of the index could be pruned away based on given query conditions. The performance of disk-based indexing structures is typically evaluated through the number of disk accesses required to support database queries. This is because the cost of disk access is significantly higher compared to memory access (usually by orders of magnitude). Thus it is important to verify the performance of new disk-based indexing structure against existing ones (and the sequential scan if appropriate) in terms of disk I/Os needed to answer database queries. In this thesis we will focus on the development of disk-based dynamic indexing struc- tul‘es and evaluate their performances by comparing their query I / Os against those of eMisting indexing techniques. 1.4 Similarity Queries and Box Queries Exact queries (i.e., given a query vector q, determine if q is stored in the database or not) are conventional query types supported by all database management systems. But in certain circumstances exact queries cannot meet user requirements well. For example, given a database storing a huge number of web pages from the internet, when looking for a web page similar to a given web page (the query web page), with exact queries all the user could do is to guess some query conditions and check if the returned query result is satisfying. If not, the user needs to start another exact query with different conditions. This process will continue until the user gets desired results or gives up. To solve this problem, more powerful query types has been supported by mod- ern database indexing techniques. Similarity queries[90, 15, 92] and box/window queries[3, 78] are two common operations widely supported by modern database index- ing techniques. In practice these two queries frequently occur on database systems, and have become important criteria for evaluating the efficiency of indexing techniques[94]. Similarity queries are used to search indexed data points which are ’similar’ to a given query vector. They are important query types because they could be applied to retrieve information that cannot be implemented using conventional exact match qufiaries. For example, similarity queries could solve the web page looking up problem we described earlier. There are two different kinds of similarity queries, which are described as follows. (1) Range query. A range query retrieves indexed data points which are within a given range from a query vector. When the range is 0, a range query becomes an exact match query. Given a data set DS, a query vector q and a query range 1‘, the formal definition of a range query on DS is defined as: Range_Q(q,r) = {v E DS]distance(v, q) S r} (1.1) where distance(v,q) represents the distance between vector q and v. (H) k-Neanest Neighbor Query (K-NN query). The k-Nearest Neighbor Query returns the top It indexed data points which are the most ’close’ to the query vector. The smaller the range between two data points, the Closer they are considered to be. Given a data set DS and a query vector q, a k-Nearest Neighbor Query on DS could be formally expressed as: KNN_Q(q,k) = {v1,v2,. . .Uk 6 DSIV’UJ' E {v1,v2, .. . ,vk}, V’U’ 9! {221,122, . . . ,vk},distance(q,v’) 2 distance(q,vj)} (1.2) Here distance(q, vj) and distance(q, '0') are defined the same way as in (1.1). Box queries are used to return indexed data points which are within a given query window. It is fundamentally different from similarity queries because a box query re- turns vectors which satisfy query conditions on every dimension while similarity queries are interested in vectors within a given distance (from the query vector), which in turn is calculated based on information from all dimensions. Given a data set DS and a query rectangle QR, a box query on DS could be formally expressed as: Box_Q(QR) = {v E 03]?) E QR} (1.3) 1.5 Overview of the Dissertation The rest of the dissertation is organized as follows. In chapter 2, we review existing research works on multidimensional CD83 and NDDSs indexing, and discussed chal- lenges in supporting HDSs indexing. In chapter 3, we provide theoretical analysis of box queries in the NDDS and introduce new algorithms to improve box queries perfor- mance based on our analysis. A new indexing structure, the BOND-tree, is developed to support efficient box queries in the NDDS. A compression idea is then introduced for the BOND-tree to further improve its query performance. The C-ND tree indexing structure, which is designed to support efficient range queries in the HDS, is intro- 9 duced in chapter 4. In this chapter, we extend the geometric concepts in the CDS to the HDS. A normalization idea in the HDS is introduced to make discrete and contin- uous measures comparable and controllable. And extensive experiments are conducted to compare the C-ND tree performance with other indexing techniques. In chapter 5, we extend the original ND-tree heuristics to the HDS to evaluate the efficiency of the ND—tree heuristics in the HDS for box queries. A novel power adjustment idea is proposed in this chapter to adjust the contribution of the discrete and continuous subspaces in the HDS. An accurate evaluation model is presented to estimate perfor- mance of box queries in the HDS. Chapter 6 provides summarization of this thesis and future research works. Detailed theoretical proof of optimized splitting heuristics in the NDDS for box queries is provided in appendix A. And the algorithms used to estimate box query I/Os in the HDS are given in appendix B of this thesis. 10 CHAPTER 2 Related Work Numerous indexing structures have been proposed to support efficient access, insert and delete of database objects. Some of them are only suitable for single-dimensional indexing, such as the binary-tree[77], the B-tree[7], and the AVL—tree[41], etc. Discus- sions of these indexing structures are not included in this thesis because they cannot support queries on multidimensional indexing keys. For the reason provided in section 1.3, in—memory indexing techniques like the k-d tree[12] (multidimensional search tree), quadtree[88] and grid file[72] are also excluded from our discussion. We start our discussion with related work on indexing techniques in the metric space[25, 59, 111]. Since the vector space is a special case of the metric space, theoreti- cally all metric space indexing methods could be applied to the vector space. However, for the same reason (the metric space is a general case of the vector space), special prop- erties of the vector space could not be utilized by the metric space indexing techniques to optimize organization of indexed data objects. 11 After discussion of metric space indexing methods, we introduce existing indexing techniques in multidimensional vector spaces, including the CD8 and the NDDS. These techniques generally fall into one of the two categories: the data-partitioning approach and the space-partitioning approach. Both approaches split an overflow node N into two new nodes, N1 and N2, by dividing indexed vectors in N into two disjoint sets for the new nodes. The data-partitioning approach splits a node N subject to the minimum utilization requirement of the indexing structure. That is, both N1 and N2 are guaranteed to have a certain number of vectors after the split. Although data- partitioning cannot guarantee overlap free partitions (which increases the chance of searching multiple paths of the indexing structure during query time), the minimum utilization criteria ensures reasonable storage utilization in the indexing structure. On the other hand, the space-partitioning approach focuses on how to split the data space represented by an overflow node N into subspaces. Distribution of indexing vectors de- pends on how the indexed space is split. Once the space is partitioned, indexing vectors are put into new nodes based on which subspace they belong to. Space-partitioning methods always generate overlap free split at the cost of possible sacrifice on storage utilization. At the end of this chapter, we will discuss challenges and ideas to index the HDS, based on our analysis of existing CD83 and NDDSs indexing techniques. 12 2.1 Indexing in the Metric Space The metric space is a mathematical model consisting of a domain of objects S and a distance function d between 2 objects (m and n) in S'. The distance function d in a metric space has the following 4 properties. Vat, y, z E S’ (1)d(:r, y) = 0 4:) a: = y Identity. (2)d(:r, y) 2 0 and d(a:, x) = 0 Nonnegative. (3)6163, y) = dfyl-T) Symmetry. (4)d(a:, z) _<_ d(a:,y) + d(y, z) Triangle Inequality. Unlike in the vector model, the only available information in the metric space is the relative distance among objects. The distance information is utilized by indexing techniques in the metric space to organize indexed objects and prune the search space at query time (through the triangle inequality property). Many indexing schemes have been developed to support similarity queries in the metric space. Some typical ones are the vantage-point tree [109, 50], the generalized- hyperplane tree (GHT) [96], the multi-vantage—point tree (MVPT) [18] and the balanced monotonous bisector tree (MBT) [36]. All these techniques use one or more reference points (vantage points) to partition the indexed data spaces into subspaces and create hierarchical indexing structures based on the partition. Although these techniques 13 support similarity queries in the metric space effectively [29, 20], they use memory- based structures which focus on reducing the number of distance calculations. The reason is that, unlike in the vector space, distance evaluation in the metric space could be an expensive operation. A disk-based metric indexing structure which pays attention to reducing disk I/O is the M-tree[30]. The internal node entry of a M-tree consists of a routing feature object, a covering radius (to partition the space), a pointer to its child node and the distance from its parent (except for the root) to the routing object. A leaf node entry contains a feature object, the distance value from its parent to the feature object and an object identifier. Unlike other metric trees, the M-tree is created in a bottom-up manner by splitting fixed sized tree nodes. Each node covers a sphere-like region of the metric space by recording a pivot object and a radius. New objects are inserted into the M- tree by minimizing the covering radius and the distance to the pivot object. The split operation in the M-tree is implemented through first selecting two pivot objects for the two new nodes, then redistributing the rest of the objects using a greedy approach. Queries on the M-tree are performed by using the covering radius and the distance between routing objects. According to [53], more distance calculations are required in the M-tree than in other metric space indexing techniques because bounding spheres in the M-tree are likely to overlap with each other. Since the metric space is a superset of the vector space, all metric indexing structures 14 could be applied to vector spaces. However, metric indexing methods do not neces- sarily work well in vector spaces, because even when indexing the vector space, metric indexing methods consider only relative distance between indexed objects, while vector space indexing methods could take advantage of extra properties of the indexed space, such as the occurrences and distributions of indexed values on every dimension [79]. 2.2 Indexing Techniques of the CDS The CDS indexing methods have been studied for a long time. Various techniques have been proposed to support efficient indexing in the CDS. To avoid searching the whole indexing structure, all these techniques organize the indexing structures in a way that certain parts of the indexes could be pruned away at query time. This is achieved by decomposing indexed data spaces recursively (similar to the idea in divide and conquer methods[58]) using some geometric concepts like bounding rectangles[47, 9], bounding spheres[102], and bounding polygons[56]. 2.2.1 Data—partitioning Approaches in the CDS As a typical data—partitioning indexing structure, the R-tree [47] was introduced by Guttman in 1984. Data organization in the R—tree is similar to that of the B+~tree[33]: indexed vectors reside in the leaf nodes of the tree; insertion and splitting heuristics are implemented in a way that spatially adjacent data are likely to be stored in the same 15 node to reduce disk accesses at query time. The R-tree has the following properties: (1) Every node has between m and M entries unless it is the root( m and M are minimum and maximum number of entries allowed in a node). (2) The root node has at least two entries. (3) Each leaf node entry in the R-tree has a (key, pointer) structure representing the indexing key and the pointer to the actual tuple in the database; each non-leaf entry node contains the minimum bounding rectangle (MBR) of its child node as well as a pointer to that child node.(4) All leaf nodes are at the same level of the tree. Insert operation on the R—tree is a recursive process. It starts with the root node and the next node in the insertion path is calculated by selecting the child node which needs least area enlargement to accommodate the new data. In case of ties on area enlargement, a child node with least area is picked. An overflowing R—tree node could be split using 3 different algorithms: linear, quadratic and exponential. All the three methods try to minimize the total areas of the two new nodes’ MBRS. The exponential splitting approach examines all possible can- didate splits to find the most desirable one. The quadratic approach first finds a pair of seed entries which would waste the most space if they had resided in the same node and separates them into two new nodes. Then it distributes the rest of the entries be- tween the two new nodes using a greedy approach. The linear splitting approach picks two seed entries based on their lower bounds and upper bounds in linear time and dis- 16 tribute the rest of the entries in the same way as the quadratic approach. Performance of the R—tree is heavily affected by its splitting algorithms. Although Guttman found only small difference among linear and quadratic splits, Beckmann reports a significant performance difference between the linear and quadratic splitting algorithms [9] Numerous variants of the R—tree have been developed to improve the performance of the original R-tree. Among them the R*-tree[9] is regarded as a classical optimiza- tion of the R—tree. The R*-tree gets better query performance by optimizing various geometric criteria including area, overlap and margins (summation of a node’s MBR’s edge lengths) at the tree construction time. And it forces reinsertion of indexed vectors to reorganize tree structure and reduce overlap among sibling nodes. Other examples of data-partitioning methods include the SS— tree[102], the X-tree[14], the R+-tree[93], the SR-tree[57] and the TV-tree[52]. All of them try to address cer- tain properties in CD83 indexing. For example, the SS-tree uses bounding spheres to partition the indexed space. A bounding sphere is represented by the centroid point of all indexed vectors and a minimum radius which is the maximum distance among the centroid point and every vector indexed. The X-tree records node splitting history (i.e., on which dimension a node has been split) and utilizes this information for later splitting in order to achieve low overlap partitions. The R+-tree forces node splitting to get an overlap free indexing structure. Note that overlap free partition is generally a characteristic of space-partitioning methods. The main problem of this idea is that 17 it increases both time complexity and space complexity. Time complexity is increased because splitting a node which is an obstacle for an overlap free indexing structure might need to split its child node as well. The number of nodes needed to be split could increase exponentially as the forced split are applied to the lower levels. On the other hand, too many splitting operations in an indexing structure lead to deterioration of space utilization. 2.2.2 Space-partitioning Approaches in the CDS Typical space-partitioning methods of the CDS include the K-D-B tree[82], the LSD tree[49] and the LSDh tree[48]. The K-D-B tree is a combination of the K-D tree[10](multidimensional binary search tree) and the B-tree[8] (l-dimensional multiway search tree). As a memory-based multidimensional binary tree, the K-D tree partitions the indexed space by splitting each dimension in repeating sequences. Its node fanout is constant because each record occupies a node of the K-D tree. The B-tree is a popular 1-dimensional balanced indexing structure. Each node in the B-tree contains multiple child nodes, which is a property suited for the disk-based indexing structure. The K-D-B tree extends the B—tree by supporting multidimensional space indexing. And it extends the K-D tree by allowing multiple entries being stored in one node. The K-D-B tree stores each node in one disk block. It is a balanced structure consisting of region pages (non-leaf nodes) and 18 point pages(leaf nodes). Region page entries in the K-D-B tree record (region, page-id) pairs where region represents the indexed data space which contains all the vectors in the subtree with root node page_id. Point page entries contain (vector, vect0r_id) pairs where vector is the indexed feature vector and vector_id points to the actual tuple in the database. This is a typical structure for multidimensional disk-based indexing methods. As a space-partitioning method, all sibling regions in the K—D-B tree are disjoint, and the region in a non-leaf entry is the union of all regions in its child node’s entries. The insert operation in the K-D-B tree is performed by searching the tree and find the leaf node whose region contains the new vector. Splitting a node in the K—D-B tree is conducted by splitting the node region on one particular dimension. The splitting dimension is picked cyclically among all dimensions of the space or, in certain cases such as the distribution of indexed data is known, with priorities for certain dimensions. Splitting operation on a region page may lead to recursive splitting of its child pages, which increases the computational complexity of constructing the K—D—B tree. As in other space-partitioning methods, storage utilization in the K-D-B tree can not be guaranteed. The LSDh tree is derived from the LSD tree (the local split decision tree), which again derives from the K-D tree (memory-based indexing). Like the LSD tree, initially the whole LSDh tree structure is in the main memory. As more vectors are indexed by the tree, the tree structure will eventually be too large to be kept in memory, in which case the LSDh tree is split into two parts: the top part close to the root node remains in 19 the memory and the rest part is saved on the secondary storage media. Unlike the LSD tree whose internal node records a splitting dimension and the splitting position on that dimension, which would lead to large dead spaces, the LSDh tree combines ideas of the LSD tree, the R-tree and the buddy tree[91]. It also introduces the idea of coded actual data regions in order to overcome the potentially large dead spaces introduced by space-partitioning methods and at the same time keep large node fanouts, which is a typical advantage of the space-partitioning methods. 2.2.3 Hybrid Approaches in the CDS As we have seen that both data-partitioning and space-partitioning approaches have their advantages and disadvantages. And there are some data-partitioning methods that take advantages of space-partitioning approaches and vice versa. For example, the R*-tree (a data-partitioning approach) tries to minimize overlap between sibling nodes during tree construction (although it cannot guarantee overlap free splits). And the coded actual data regions idea in the LSDh tree (a space-partitioning approach) is actually the minimum bounding rectangle idea from data-partitioning approaches. In this section we introduce the hybrid-tree [22], an indexing methods which combines concepts in both data—partitioning and space-partitioning approaches to achieve better query performances. The hybrid-tree is more close to space-partitioning methods with some relaxations on the strict ’overlap free’ requirement, which excludes it from the 20 category of pure space-partitioning. Note that although this indexing structure has ’hybrid’ in its name, it has nothing to do with the Hybrid Data Space. It is called so because it tries to make use of ideas from both data-partitioning and space-partitioning approaches. The hybrid-tree is used to index vectors in the CDS only. To keep a low fanout in the node as in space-partitioning methods, the hybrid tree always partitions the space into two subspaces at node splitting time based on in- formation from one single dimension. This is different from the idea of considering information from all dimensions in the data-partitioning methods. However, it relaxes the overlap free partition requirement in space-partitioning methods to improve the storage utilization. The hybrid-tree adopts the K-D tree structure and records one more split position than the K-D tree in its non-leaf node. So each internal node in a hybrid-tree contains two split positions lsp and rsp: Lsp represent the upper bound of the left side partition in the original K-D tree and rsp represent the lower bound of the right side partition. If rsp, < V3, V4, V5 >, < V6, V7 > and < V3 >. Vectors in the same group have the same letter on dimension i. Based on the grouping, we have 7 different overlap free partitions as shown in table 3.3: New nodes Node 1 Node 2 Vectors in new nodes V1, V2 V3, V4, V5, Vs, V7, V3 Vectors in new nodes V3, V4, V5 V1, V2, V3, V7, V3 Vectors in new nodes V6, V7 V1, V2, V;2,, V4, V5, V3 Vectors in new nodes V3 V1 , V2, V3, V4, V5, V5, V7 Vectors in new nodes V1, V2 , V3, V4, V5 V3, V7, V3 Vectors in new nodes V1, V2, V3, V7 V3, V4, V5, V3 Vectors in new nodes V1, V2 , V3 V3, V4, V5, V3, V7 Table 3.3. Different overlap free partitions from a discrete dimension i. From the above example we could see that more overlap free partitions could be generated in the NDDS than in the CDS. Since we use only one among all overlap free partitions to split a node, we need effective heuristics to select the candidate partition W121i ch could give us the best filtering power at query time. In this section we introduce two new heuristics for evaluating overlap free candidate partitions of an overflow node N in the NDDS. Without loss of generality and only for the purpose of simplicity, we consider box 48 queries which are uniform. A box query Q is said to be uniform if edge lengths of the query window are the same along all dimensions (i.e. if q,- is the cardinality of the component set of query window W along dimension i, then q1 = Q2 = = qd). The common edge length is said to be the boa: size of the uniform box query Q. The theoretical analysis provided in this thesis could be extended to more complex situations where box queries are not uniform. Consider an d-dimensional NDDS 52d, we have the following theorem to find a split- ting dimension for box queries: Theorem 3.3.3. Given an overflowing node N whose DMBR has edge length EL,- along dimension i and a uniform boa: query Q with query window W, the probability of overlapping between W and the DMBRs of newly created nodes N1 and N2 is minimum 2f 1V is split along a dimension u which has the least edge length among all dimensions having edge lengths larger than 1 in N ’s DMBR. In other words: splitting dimension :2 {u|ELu > 1;Vi E [1,d],EL,- 2 EL“ or ELz- = 1}. Theorem 3.3.3 suggests splitting an overflow node on a dimension which has a shorter Sp an in the node’s DMBR. This is opposite to the heuristic (N 82) used by the ND-tree Splitting algorithm. Next we evaluate candidate partitions generated from the same dimension when splitting an overflow node N. Given a splitting dimension u and 2 candidate partitions 0P1 and 0P2 along u, 0P1 distributes the entries in N between two new nodes N1 and N2. Similarly 0P2 splits N to two new nodes Ni and N5. Suppose the edge lengths on dimension u is a: in N1’s DMBR and it is y = EL“ — a: in Ng’s DMBR. And suppose the edge lengths on dimension u in the DMBRs of Ni and N5 are 23’ and y' = ELu — 1" correspondingly. 49 Here we assume a: < y and x’ < y’, otherwise we simply swap 9: and y (or :r' and y’). The filtering power of the new nodes generated from CPI and 0P2 could be evaluated using the following theorem: Theorem 3.3.4. If a: < a", the probability of overlapping between the query window W of a uniform boa: query Q and DMBRs of N1 and N2 is smaller than the probability of overlapping between W and Ni and N5. Again in theorem 3.3.4 we see that to support box queries in the NDDS, there could be better ways to select candidate partitions compared to the heuristic (N 53) used by the ND-tree. Detailed proof of theorems 3.3.3 and 3.3.4 could be found in appendix A of this thesis. Note that a data-partitioning indexing tree has a minimum utilization criterion, which enforces that a certain percentage of the disk block occupied by a tree node should always be filled. When applying theorem 3.3.4, the minimum utilization criterion need to be considered. This means that the most unbalanced candidate partition which satisfies the minimum utilization criterion should be selected because it has the least overlapping probability (among all candidate partitions generated from a splitting dimension u which satisfy the minimum utilization criterion) based on theorem 3.3.4. Given theorems 3.3.3 and 3.3.4, we propose the following heuristics for splitting an 0Verflow node. The heuristics are applied in the order they are specified (i.e. if there are ties on one heuristic then the next heuristic is used to break the tie). It is possible that even after applying all the heuristics there remain more than one candidate splits 50 In such cases a candidate split is chosen randomly from the tied ones. R1: Minimum Overlap . Of all the candidate partitions, heuristic R1 selects the one that results in minimum overlap between the DMBRs of the newly created nodes. This heuristic is the same as the one used by some of the existing works [79, 9]. B2: Minimum Span Among all the candidate partitions generated from splitting dimensions which have spans larger than 1 and tied on R1, heuristic R2 selects the partition generated from a S plitting dimension which has the smallest span. This heuristic prefers to split shorter dimensions because according to theorem 3.3.3, these dimensions are likely to generate Candidate partitions which will later answer box queries in the NDDS using less disk 1/ Os. R3: Minimum Balance C1"i\nen a splitting dimension u, heuristic R3 chooses the most unbalanced candidate partition (i.e. the one that puts as few characters as possible in one node’s DMBR and as many characters as possible in the other node’s DMBR on dimension u) among all candidate partitions which satisfy the minimum utilization criterion and tied on R1 51 and R2. We make use of theorem 3.3.4 in this heuristic. Heuristics R2 and R3 are quite counter-intuitive at the first glance (e.g. the binary search has been proved to be an eflicient searching algorithm in the CDS, which implies a balanced partition of the indexed data space). But these heuristics try to exploit properties exclusive to box queries in the NDDS. It is the nature of the data space that makes seemingly unintuitive splitting heuristics perform better than the ones used in the CDS. Both our theoretical analysis provided in the appendix of this thesis and the experimental results presented in section 3.6 support this fact. 3 -4 Construction of the BoND-tree In this section, we formally describe the data structure and important algorithms for Constructing our proposed BOND-tree indexing technique. A BOND-tree is an indexing structure which has the following properties: (1 2 Each tree node occupies one disk block. (2) All nodes must have a minimum amount of space filled by indexed entries unless it is the root node. This property specifies the minimum space utilization requirement in the tree nodes to ensure a reasonable storage utilization. (3) The root node has at least 2 indexed entries. 52 (4) A leaf node entry structure has the form (V, P), where V is an indexed feature vector and P is the pointer to the tuple in the database represented by V. (5) A non-leaf node entry structure has the form (D, P), where D is the DMBR of the entry’s corresponding child node and P is the pointer to that child node. As we will see from the tree construction algorithms introduced in section 3.4.1, the BOND-tree is built in a bottom-up manner. This ensures that the BOND-tree is a balanced indexing structure. The overall data structure of the BOND-tree is inspired by that of the ND-tree. And it is further optimized by the compression idea utilized in the compressed BOND-tree structure, which will be introduced in section 3.5. 3 - 4.1 Insertion in the BOND-tree Inserting a vector in the BOND-tree involves two steps. First, we find a suitable leaf node L for the new vector. Then we put the vector into L and update L’s ancestor 11C>-Cles’ DMBRs as needed. The second step may cause splitting of the leaf node, which mi ght trigger cascaded splits all the way to the root node. 53 Selecting a Leaf Node Given node N, the BOND-tree uses a select-node algorithm to pick an appropriate child node of N which will accommodate a new vector V. If there is only one child node whose DMBR containing V, that node will be chosen ,to insert V. In case V is covered by more than one child nodes’ DMBRs, the node whose DMBR is the smallest is selected. If V is covered by N’s DMBR but not covered by any of N’s child node’s DMBR, we use heuristics proposed by the ND—tree[79] for selecting a child node. These heuristics are briefly summarized as follows. 0 Minimum Overlap Enlargement : Select the child node which requires minimum enlargement of overlap among all sibling nodes after accommodating the new vector. 0 Minimum Area Enlargement : Choose the child node that requires the least DMBR enlargement to accommodate the new vector. 0 Minimum Area : Select the child node which has the least area after the new vector is put inside. The heuristics are applied in the order they are presented. That is, a heuristic will be used if and only if application of the previous heuristic(s) results in one or more ties. 54 To insert a new vector into the BOND-tree, we need to find a leaf node to accom- modate the vector. This is achieved by invoking the select-node algorithm recursively, starting from the root R of the tree, until a leaf node is selected. The Splitting Algorithm As discussed in section 3.3.4, a better way to split an overflowing node N into N1 and N2 is to get an overlap-free and unbalanced split along a dimension i, which has the minimum span among all dimensions whose spans are larger than 1. Of the heuristics proposed in section 3.3.4, R2 could be achieved by comparing the span of each dimension in node N’s DMBR. However, implementation of R3 in the BOND-tree is a more complex procedure, especially at non-leaf levels of the tree. When splitting a leaf node, overlap free candidate partitions could be obtained by first grouping node entries based on their values on a certain dimension and then distributing t11<:)se groups to the new nodes. Suppose an overflow leaf node has 8 entries (V1 ~ V3), tElwt>le 3.3 in section 3.3.4 gives an example of generating candidate partitions along a dimension i for a leaf node. After we get all the overlap free partitions (subject to the IIlillimum utilization requirement, which is one entry per node in our example), each of them could be evaluated using heuristic R3 (we will provide a better solution using a. greedy approach to apply R3 in leaf node splitting at the end of this section). As we have mentioned, implementation of R3 on a non-leaf node is much more 55 complex because the component sets of non-leaf node entries could have more than one letter on a dimension. Table 3.4 shows an example of different component sets from 8 non-leaf node entries (E1, E2, . . . , E3) on a dimension i which has the alphabet {a7 b) c) d? 67 f, g}' Non-leaf entry E1 E2 E3 'E4 E5 E5 E7 E3 Component set {a,b} {b,c} {a,c} {a,b,c} {a,b,c} {e} {e,f,g} {f} Table 3.4. Different component sets of non-leaf entries on dimension i. When generating candidate partitions on dimension i, we could have a component set which is a proper subset of other sets like {e} and {e, f, g} ; sets which are disjoint or partly overlapped like {a, b}, {e} and {a, b, e}; sets which overlap with each other in a chain like {a, b}, {b, c} and {a, c}; sets whose union is only part of the domain or the whole domain such as {a, b, c}, {f} and {e, f, g}; or a single component set which contains all the letters from the domain. As the number of distinct component sets increases, the relationship among component sets could be even more complex. From the above discussion we can see that heuristic R3 describes the most desirable partition when splitting an overflow node but does not provide a way to achieve it. And it is a challenging problem to get desired partition required by R3. One brute force way to apply R3 is to compute all permutations of the entries in an overflow node, and then put splitting points tentatively between adjacent entries in each permutation to generate candidate partitions. But this clearly demands a heavy computation time. 56 Even for a small number of entries it would be impractical to evaluate all permutations (e.g., for 10 entries, the number of candidate partitions would be more than one million). To get the most desirable partition (according to R3) of indexed entries in an overflow node, we mapped the problem of splitting a BOND-tree node using heuristic R3 to a well-known problem, the 0—1 knapsack problem. This allows the BOND-tree splitting algorithm to make use of existing solutions of the 0—1 knapsack problem to distribute indexed node entries and get the desired partition based on heuristic R3. Formally a stande 0-1 knapsack problem is defined as follows: Problem Definition 3.4.1. 0-1 Knapsack Problem : Given a set of items 11,12, . . .,In each with value I V,- and weight 1W,- (1 S i S n) respectively. The problem is to put a set of items into a knapsack K such that, (1) The total value of items in the set is maximum. (2) The total weight Wtotal of items in the set satisfies the constraint Wtotal S C, where C is referred as the capacity of the knapsack K. To map the node splitting problem to the knapsack problem, we first analyze how an overflow node N is split using heuristic R3 in the BOND-tree. Let u be the dimension along which we will generate candidate partitions for a node N. There may be several entries in N which share one or more common letters along dimension u (i.e., the intersection of these entries’ component sets on dimension u is not empty). In order to get overlap-free partitions, we first group all entries which share common letters along dimension u. Each group is then treated as a single item when splitting the node. Grouping entries this way avoids distributing entries with 57 the same character(s) along dimension u into two different nodes (in which case an non-overlap—free partition is generated). After the grouping process, a group i has a certain number of characters along dimension u which is treated as the value of the corresponding group. Fhrther, group i requires a certain amount of space (depending on the space requirement of each entry in that group) which becomes the weight of group i. We use G1, G2, .. ., Gn to represent these groups. The values and weights of each group are represented by GV1, GV2, ..., CV” and GWl, 0W2, ..., GWn correspondingly. To split an overflow node according to heuristic R3, we want to distribute these entry groups (G1, G2, . . . , Gn) between the two new nodes in the following way. One node has as many values (number of characters along the splitting dimension) as possible and the other node has as few values as possible. And the distribution must satisfy the minimum utilization requirement on both nodes. We use 5,1 to denote the disk block size occupied by each tree node. Suppose the minimum node utilization criterion specifies that a certain size Smin of each node (each disk block) must be filled. Based on our discussion, the BOND-tree node splitting problem using heuristic R3 could be defined as follows. Problem Definition 3.4.2. Node Splitting Problem of the BOND-tree Using Heuristic R3: Given entry groups G1, G2, ..., Gn in an overflow node N, suppose the values (number of char- acters) and weights (required storage space) of each group are GVl, GVZ, ..., GVn and GWl, GWQ, .. ., GWn correspondingly. Further suppose the total weights of en- try groups distributed to N1 and N2 are N W1 and N W2 respectively. The BOND-tree splitting algorithm distributes the entry groups to two new nodes N1 and N2 such that, (1) The total value of entry groups in node N1 is the maximum. 58 (2) Both N W1 and N W2 satisfy the minimum space utilization criterion of the tree (i.e., NW1 2 Smin and NW2 2 8min)- (3) Both N W1 and NW2 are less than the disk block size (i.e., N W1 g 33 and NW2 S 5d)- Problem definition 3.4.2 has the minimum weight requirement and the maximum weight requirement on the two newly generated nodes. On the other hand, problem definition 3.4.1 only has the maximum weight constraint on a single knapsack. To map the two problems, we further analyze the node splitting problem as follows. We use Se to represent the size of each node entry. Recall that the minimum space utilization criterion requires that a certain storage space Smin of each node must be filled. We have Se 3 Smin S [Sci/Se] x Se / 2, which means Smin should be larger than the space required by a single node entry and smaller than half of the maximum space that a node N could use to accommodate indexed entries. The maximum storage space Smag; that could be utilized by a new node is calculated as: Smam = (lSd/Sel + 1 " lsz'n/Sel) X Se (33) For example, consider a node N containing 4 entries and each entry using Se = 90 bytes, the total space occupied by these 4 entries is 90 x 4 = 360 bytes. Suppose the disk block size Sd is 400 bytes, N will overflow if the 5-th entry is inserted into it. Further suppose the minimum utilization criterion specifies that at least 100 bytes of 59 each node must be filled (Sm-n = 100). If N is split to two new nodes, each new node must have at least [Swan/Sc] = 2 entries distributed to it. As a result, each of the new nodes could have at most [Sci/Se] + 1 — [Smin/Se] = 3 entries after the splitting. Thus a new node could use at most Smax = 3 x Se 2 270 bytes to store indexed entries distributed to it. Formula 3.3 gives the maximum amount of space which could be utilized in each of the newly generated nodes to store indexed entries (so the remaining entries will be put in the other node). From formula 3.3 we get two properties of the max space Sm“ for a new node: (P‘I) Smaa: S 5d- Proof. From Smin > Se, we have: Smart: 2 (lSd/Sel + 1 _ [Smin/Sel) X Se _<_ (lSd/Sel + 1 " [Se/Sell X Se = (lSd/Sel) X Se S 3d (P-2) sm 3 (LSd/Sej + 1) x Se — 5min- 60 Proof. Smax = (lSd/Sel 'l' 1 _ [Smin/Sel) X Se = (lSd/Sel + 1) X 36 - (lSmin/Sel) x Se S (lSd/Sel + 1) >< Se - (Smin/Se) X Se = (lSd/Sel + 1) >< Se - Smin From P-2 we know that ([Sd/Sej +1) X Se - Smax Z Smin: which means by allowing one new node to use no more than Sm“ size of space for storing node entries, the other node is guaranteed to have at least Sm,” space filled by entries distributed to it. Given the maximum space Smax defined in formula 3.3, we tackle the node splitting problem defined in 3.4.2 in the following way. When a node N is split to nodes N1 and N2, the splitting algorithm tries to distribute as many entries as possible to N1, but the maximum space utilized in N1 is no less than Smax. Suppose the spaces occupied by entries distributed to N1 and N2 are $1 and 32 correspondingly. Clearly SI is no less than 32 (since the splitting algorithm tries to put more entries into N1). From property P-I we know that 51 will not exceed the block size Sd. Since 51 2 32, we have 5'2 _<_ .33. And property P-2 guarantees that 52 is no smaller than Smin- Since 31 2 32, S] will be no less than Smin too. Based on our analysis above, we provide an alternative definition of the node splitting problem using heuristic R3, which is equivalent to the problem definition 3.4.2. Note that in both definitions we distribute entry groups instead of entries in order to get 61 overlap free partitions. The redefined node splitting problem in 3.4.3 will later be mapped to the 0—1 knapsack problem. Problem Definition 3.4.3. Redefined Node Splitting Problem of the BOND-tree Using Heuristic R3: Given entry groups G1, G2, ..., Cu in an overflow node N, suppose the val- ues (number of characters) and weights (required storage space) of each group are GVl, GV2, ..., CV” and GWl, GWg, ..., GWn correspondingly. The BOND-tree splitting algorithm distributes the entry groups to two new nodes N1 and N2 such that, (1) The total value of entry groups in node N1 is the maximum. (2) The total weight Wtotal of entry groups in node N1 satisfies the constraint Wtotal S Smax, where Smut is calculated from formula 3. 3. Note that in problem definition 3.4.3, we use the maximum space constraint Smax on a single node N1 to guarantee the minimum weight requirement and the maximum weight requirement specified in problem definition 3.4.2. And our discussion above already shows that both the requirements on N1 and N2 defined in 3.4.2 will be satisfied by enforcing the maximum space constraint 3mm on the node N1. Now consider the 0—1 knapsack problem defined in 3.4.1 and the node splitting prob- lem defined in 3.4.3. If we let m = n, I,- 2 G7, 1W,- = GWi, IV; = CV,- (1 S i <3 m); and let G = ([Sd/Sej +1 —— [U/Sel) * Se where Sd is the disk block size, Se is the node entry size, Sm,” is the minimum utilization requirement on tree nodes, the problem of splitting an overflow node N (i.e., generate and find the most suitable candidate parti- tion according to heuristic R3) is the same as solving the 0-1 knapsack problem. That is, instead of enumerating all the possible partitions of an overflow node as discussed at the beginning of this section, we can get the most suitable solution of the node splitting problem by exploiting existing algorithms for the 0—1 knapsack problem. 62 As the node splitting problem is mapped to the 0-1 knapsack problem, a dynamic programming solution [58, 83] can be used to solve it optimally and efficiently. After the items (entry groups) to be put into the knapsack (node N1) is decided, the remaining items (entry groups) are put into node N2. Mapping the splitting problem defined in 3.4.3 not only provides an eflicient way to find the most suitable partition for an overflow node. It also allows the freedom of using different ways to build the BOND-tree based on the particular requirement and purpose of indexing. For example, when both the query performance and the time needed to generate the indexing structure are critical, parallel algorithms [63, 37] for the 0—1 knapsack problem could be applied to build the BOND-tree efficiently and quickly. On the other hand, when the BOND—tree is created as a temporary indexing structure, the query I/O is usually not the only (or the most important) consideration: sometimes people want to build indexing trees quickly and discard them after performing a limited number of queries. In such cases, the BOND-tree could be generated using algorithms introduced in [60] and [85], which provide approximate solutions with guaranteed closeness to the optimal solution with much less time complexity and system resource requirements. We illustrate the BOND-tree splitting algorithm using an example as shown below: Let the entries in the overflowing non-leaf node be E1...E12. Further, suppose 63 DMBRs of these entries have component sets along a splitting dimension u as shown in table 3.5. Entry E1 E2 E3 E4 E5 E6 Component set {a} {b} {a, b, c} {d} {e} {8, f} Entry E7 E8 E9 E10 E11 E12 Component set {f} {h, z'} {i} {j} {j} {k} Table 3.5. Different component sets for non-leaf entries E1 ~ E12. After the grouping process we obtain the following 6 groups as shown in table 3.6. Group G1 093 G3 G4 G5 G6 Entries {31,132,193} {E4} {E5,E6,E7} {EaEg} {1310,1911} {312} Table 3.6. Grouping of non-leaf entries. Each group G,- has a set of characters GS,- on the splitting dimension (by applying set union operation on the component sets of all group members’ DMBRs on dimension u). Here we use CV,- to represent the number of characters in G3,. Also each group requires certain space to store the entries in that group. Let the amount of space required for 64 each entry be one unit and the capacity of the node be 11 units. Further suppose the minimum utilization criterion requires each new node must utilize at least 3 units. We use GW; to represent the space required by Gi. Table 3.7 shows the item weights and values of the 0-1 knapsack problem mapped from the node splitting problem. Item Weight Value G1 3 3 G2 1 1 G3 3 2 '04 2 2 G5 2 1 G5 1 1 Table 3.7. the item weights and values in the O — 1 knapsack problem. According to R3, after splitting a node N into N1 and N2, we want one node to have maximum number of characters on the splitting dimension in its DMBR, while the other node to have minimum number of characters. And both new nodes must satisfy the minimum utilization criterion in our example. If we solve the 0—1 knapsack problem as mentioned above, it will give us the best candidate partition (according to proposed heuristic R3) for splitting the node N as shown in table 3.8. Note that in a leaf node the optimal solution to this splitting problem is even simpler 65 Entries in node N1 G1, G2, G4, G5, G6 Entries in node N2 G3 Table 3.8. The candidate partition for an overflow node N found by solving the 0-1 knapsack problem. since all the entries in the overflow leaf node have only a single letter on a splitting dimension. In other words, values of all the items (entry groups) are the same (more specifically, value is 1 for all items). Hence the problem could be solved using a greedy algorithm (instead of dynamic programming). The greedy approach used in the BoND- tree is sketched as follows. To solve the 0-1 knapsack problem at the leaf level, we first sort all items based on their weights. Then we put those sorted items into a knapsack K1 (new tree node N1) one by one, starting from the items with smaller weights until no more item could be put into K1. All the remaining items are put into the knapsack K2 (tree node N2). This distribution approach will guarantee to obtain the best partition of entries in an overflow leaf node as required by R3. By mapping the node splitting problem to the 0-1 knapsack problem, our proposed BOND—tree’s splitting algorithms are guaranteed to find an overlap free partition satisfy- ing the minimum utilization criterion as long as there exists such a partition. However, theoretically there may be cases (although they are very rare because of the nature of the NDDS as we described in section 3.3.4) when it is simply impossible to get any 66 overlap free split without affecting node utilization. We use the N D-tree’s auxiliary tree approach to safeguard this situation. First we generate a set of candidate parti- tions 3 (all the candidate partitions in S are not overlap free) using the auxiliary tree. Among all partitions in S the one which has the least overlap between the two new nodes’ DMBRs is chosen. If there are ties, the candidate partition (among the tied ones) which has the least area summation of the two new nodes’ DMBRs is chosen. If there are ties again, a random one among the tied ones is chosen. Algorithm 3.4.1 briefly summarizes all the important steps involved in inserting a new entry in a node. Algorithm 3.4.1. insert.entry(N, E) Input: A node N and an entry E to be inserted in N. Output: Modified tree structure that accommodates entry E. Method: 1. if (N has space for E) 2. Insert E in the list of entries in N 3. Update DMBRs of N and all its parent nodes as needed. 4. else // We need to split N 5. Put all dimensions whose span is larger than 1 into list L 6. Sort L based on dimension span in ascending order 7. for every dimension i whose span is larger than 1 do 8. Group entries in N based on their component sets on dimension i .9. Calculate each entry groups’ weight and value //mapped to the 0/1 Knapsack Problem 10. if N is a leaf node 11. Solving the 0 — 1 knapsack problem using the greedy approach 12. else 13. solving the 0 — 1 knapsack problem using dynamic programming 14. end if 15. if the O — 1 knapsack problem is solved 16. return the solution 17. end if 18. if no solution is found //no overlap free partition exists 19. find a solution using the auxiliary tree approach 20. return the solution 21. end if 22. end if 67 3.4.2 Deletion in the BOND-tree If removing a vector from a leaf node L does not cause any underflow (i.e., the minimum utilization requirement on L is satisfied after deletion), the vector is directly removed and DMBRs of L’s ancestor nodes are adjusted as needed. If an underflow occurs on L, the procedure is described as follows. Node L is removed from its parent node N and if N underflows again, N is removed from its parent node. The procedure propagates toward the root node until no un- derflow occurs. Then the sub—tree represented by the underflow node closest to the root node is removed, its ancestor nodes’ DMBRs are adjusted as needed and all the remaining vectors in the sub-tree are reinserted. In the worst case, if the root node has only two children and one of them is removed, the remaining child node becomes the new root of the tree (i.e., tree height decreases by one). Deletion of a vector is a rare event in a database. However, for the sake of com- pleteness, we present a relatively simple algorithm for deletion. The main idea in this algorithm is that any deletion that violates minimum utilization criterion is followed by a series of reinsertions. Algorithm 3.4.2 outlines the procedure for deletion. Algorithm 3.4.2. delete_entry(R, E) Input: Root of the tree R and an entry E to be deleted from the tree. Output: Modified tree structure with E removed. Method: 1. if E is absent in the tree 2. return// Nothing to do 3. else 4. Let L be the leaf holding E. 68 5. if removal of E does not violate the minimum utilization criterion. 6. Remove E and update DMBRs 7. Return. 8. else .9. Remove L and if underflow occurs in L ’s ancestor nodes, remove these underflow nodes recursively until a node N ' is found whose removal does not violate the constraint. 10. if N’ is not the root node 11. Remove the subtree rooted at N’ 12. Adjust DMBRs I3. Reinsert other entries in the subtree. 14. else //N’ is the root node with only 2 children 15. remove the subtree in N ' which contains L 16. the remaining children in N’ becomes new root of the tree 17. Adjust DMBRs 18. Reinsert vectors indexing in the subtree which is removed 19. end if 20. end if 21. end if 22. return. 3.4.3 Box Query on the BoND-tree Algorithm for executing box queries on the BOND-tree is implemented as follows. 'Let Q be the query box and N be a node in the tree (which is initialized to root R of the tree). For each entry E in N, if the query window w overlaps with the DMBR of E then that entry is searched, otherwise the subtree rooted at E is pruned. Algorithm 3.4.3 illustrates the basic steps in executing a box query with query window w on a BOND-tree. Initially N is set to root node R of the tree. Algorithm 3.4.3. box_query( N, w) Input: Query box w and a node N. Output: A set of vectors that are inside query box w. Method: 1. A = ¢//A is the result set of box query 2. 3. 4. if N is a leaf node for each entry E in N if E is inside w 69 5 A = A U {E} 6. end if 7. end for 8. else // (N is a directory node) 9. for each entry E in N 10. if the DMBR of E overlaps with w 11. A = A U box.query(E, w) 12. end if 13. end for 14. end if 15. return A 3.5 Compression to Improve Node Fanout In the CDS, the MBR information on a continuous dimension is stored by recording the lower bound value and upper bound value of that dimension. Since the number of available values in a continuous domain is usually unlimited (or very large), in a hierarchical indexing structure(e.g., the R*-tree) the MBR information on a continuous dimension i is unlikely to cover the whole domain of i. However, in the NDDS the number of letters in a discrete domain is limited (and usually quite small). This means a discrete dimension in a DMBR will get full (i.e., all letters in the domain have appeared on that dimension) much faster than a continuous dimension. Consider a set S which contains letters from a non-ordered discrete domain D with domain size |D| = A. The Markov transition matrix[70] describing the probability of 5’s size after adding one random letter from D to S is shown in (3.4). 70 1 2 3 4 A 1 (1/A (1—1/A) 0 o o ) 2 0 2/A (1-2/A) 0 0 P: 3 0 0 3/A (1—3/A) o (3-4) 4 o 0 0 o o A ( 0 0 0 0 1.0 ) Now suppose we are creating an indexing structure for an NDDS with domain D for dimension i. Further suppose the size of D is 10. Using the Markov transition matrix in (3.4), we can calculate the probability of a node N having all the 10 letters in D on dimension i after indexing X vectors, as shown in table 3.9. X 20 40 60 80 100 Probability 21.47% 85.81% 98.21% 99.78% 99.97% Table 3.9. Probability of having a full dimension after indexing X vectors. As we can see from the table, after indexing 100 vectors, the probability that all the 10 letters in D have appeared in node N’s DMBR on dimension i is 99.97%. And it will become even higher for smaller alphabet sizes (i.e., [D] < 10) or larger number of vectors (X > 100). Note that the calculation applies to the BOND-tree as well as to other hierarchical NDDS indexing techniques such as the ND-tree. 71 The splitting heuristics of the BOND-tree prefer an overlap free candidate partition generated from a shorter dimension. This leads to more full dimensions in the DMBRs of non-leaf nodes of the BOND-tree (especially at higher levels of the tree) compared to the ND-tree. Table 3.10 shows the percentage of full dimensions at non-leaf levels of the BOND-tree when indexing 5 million vectors from 16—dimensional NDDSs with varying alphabet sizes. Alphabet size 8 10 12 14 Percentage of full dimensions 75.30% 75.62% 75.62% 75.58% Table 3.10. Percentage of full dimension at non-leaf levels of the BoND-tree with different alphabet sizes. Table 3.11 lists the percentage of full dimensions by level in a BOND-tree built for a 16—dimensional NDDS with alphabet size 10. The total number of vectors indexed is 5 million and the tree height is 6. Level 1 represents the leaf level and level 6 stands for the root level of the tree. Level of tree 6 5 4 3 2 1 Percentage of full dimensions 100.00% 93.75% 87.50% 81.25% 75.00% 44.35% Table 3.11. Percentage of full dimension at each level of the BOND—tree. From the above statistics we see that a large percentage of dimensions recorded in the DMBRs of non-leaf nodes are full in the BOND—tree. To improve the space utilization 72 of the BOND-tree in non-leaf nodes, we could simply record a dimension to be full instead of recording the occurrence of every letter from the alphabet. This compression idea could be exploited to reduce the amount of space required to store the DMBR, which will increase the fanout of the node. We call this modified indexing structure with increased node fanout the compressed BOND-tree and use the basic BOND-tree to represent a BOND-tree without using the compression idea. In the following section we explain our compression scheme and its effect on the node splitting algorithm. 3.5.1 The Compressed BOND-tree Structure The original BOND-tree records every dimension of a DMBR using a bitmap repre— sentation. Let A be the alphabet size of all the dimensions (for simplicity we assume same alphabet size for all the dimensions but the discussion can be easily modified for varying alphabet sizes). As we need one bit per letter to store information about a dimension, each dimension in a DMBR needs [A/ 8] bytes for storage. So for a DMBR with d dimensions, we need d * [A/ 8] bytes in the basic BOND-tree. In the compressed BoND-tree structure we explicitly store only the dimensions which do not have all the characters. However, we need to record, in some way, the dimensions which have all the characters. We do this by maintaining a bitmap of [d/ 8] bytes for all dimensions in a d-dimensional NDDS. A value 1 of bit tells us that a particular dimension has all the characters and is not explicitly recorded. Thus, the compressed 73 structure requires an overhead of [d/ 8] bytes. If d full is the number of dimensions that have all the characters, then the space required for the so called compressed DMBR is [d/8] + (d— dfull) * [A/8] = d* [A/8] + ([d/8] — dfull * [A/8]). It has been observed that in general, [d/ 8] < d full * [A/ 8]. Hence, this compressed structure requires less space per DMBR. As the memory requirement of a single DMBR is reduced, the fanout of the node increases. This high fanout results in reduction in the height of the tree and reduced I/O at the time of querying. Note that the compression of DMBRs applies only to non-leaf nodes because leaf node entries in the BOND-tree contain only indexed feature vectors. Thus the performance gain of the compressed BOND-tree is achieved through a more eflective representation of DMBRs in the non-leaf nodes, especially nodes at higher levels of the tree. 3.5.2 Effect of Compression on Splitting Overflow Nodes When the entries in an overflow node N of the compressed BOND-tree are distributed to two nodes N1 and N2 during splitting, one important consideration is if it can be assured that those entries could be put in the two new nodes. It is possible that some DMBRs now have less number of full dimensions after the split and hence need more space to store them. This may lead to candidate partitions in which two new nodes N1 and N2 are insufficient (in terms of space required) to hold all the entries. In other words, we are interested in the question weather it is possible all the candidate 74 partitions cause either N1 or N2 (or both) overflow. In this section we show that this situation cannot arise from compressing DMBRs of the BOND-tree. First thing to note is that such a problem may occur only at the non-leaf level of the tree. When splitting a leaf node, the entries are simply redistributed without any change in the space utilization. Hence, the two new nodes are always sufficient to accommodate the entries in any candidate partition. In a non-leaf node, the need for splitting comes only when one of its entries gets replaced with two new entries (due to the split of a child node). Suppose a node N has entries E1, E2, . . . , En. Also, suppose a child node represented by entry Ej(1 S j S n) splits into two new nodes, which results in two new entries ( E; and E3! ) and causes N to overflow. If B is the disk block size (and also the maximum size of a node assuming that one node occupies one disk block of the system), then we have 222:1 sizeo f (E,) S B, where sizeo f (E,) represents the amount of space required to store entry Ei. After the child node represented by Ej is split, the disk block occupied by node N will be insufficient to hold all the entries if Z?=l,i;£ j sizeo f (E,) + sizeo f (E3) + sizeo f (E; ) > B. Thus, the extra space required (denoted by ES) to store all the entries in N is ES = sizeof(E5) + sizeo f (E;’ ) — sizeo f (E3) Note this calculation applies to the BOND—tree with/without compressing DMBRs as well as the original ND-tree. We first consider the original BOND-tree without compression, in which case sizeof(E;-) = sizeof(E§-’) = sizeof(EJ-). This means that ES equals to the space 75 of storing one non-leaf node entry in the BOND-tree. And since a new disk block is allocated due to the overflow of N, as long as B is larger than the size of one non-leaf node entry, we could always find a way to distribute all the entries in N to the two new nodes N1 and N2 without causing any of them to overflow. Note that this requirement is trivially satisfied because if B is less than the space required to store two non-leaf node entries, a non-leaf node could have at most one child node and thus the whole indexing structure is meaningless. The analysis and conclusion of the original BoND-tree also applies to the N D-tree because both trees use the same data structure in their implementations. Now consider the BOND-tree with compressed DMBRs. The largest ES comes from the situation where all dimensions in the DMBR of Ej are compressed and no dimension in E; and E]! is compressed. In this case we could put E1, E2, . . . , Ej_1, Ej+1, . . . , En in node N1, and put E; and E3, in node N2. Clearly N1 will not overflow because since E1 ~ En could be stored in one disk block, with one entry (Ej) removed, the rest of entries require less storage space and thus are guaranteed to be able to fit in one disk block. N2 will not overflow as long as B is larger than or equal to the space required to store two uncompressed DMBRs, which is the same requirement to build a meaningful ND-tree or BOND-tree without compression. 76 3.6 Experimental Results The BOND-tree was implemented in C++. Tests were conducted on Intel Xeon quad- core 5345 processors with 8 GB ECC DDR2 RAM running SuSE Enterprise Linux 10 in a high performance computing cluster system. We have tested the proposed BOND-tree (with and without compression) performance for various query box sizes, number of dimensions, alphabet sizes and database sizes. Besides randomly generated synthetic data sets, we also used human genome data for our experiments in section 3.6.5. In all the experiments, 200 random box queries were executed and the average number of I/O was counted. Query I/O of the BOND-tree was compared with that of the ND-tree and the 10% linear scan[22]. Block size of 1k is used for all the experiments. It should be noted that the ND-tree is fine-tuned for similarity searches. But so far there has been no indexing techniques designed to support efficient box queries in the NDDS. On the other hand, the ND-tree is reported to be a robust indexing technique compared to other known indexing methods in the NDDS[79]. We have also experimented with the M-tree as a candidate for comparison but in most of the experiments, the M—tree performed worse than the 10% linear scan. Hence, in this section, we present results comparison with the ND-tree and the 10% linear scan only. In section 3.6.1 ~ 3.6.4 we report performance comparison among the basic BoND- 77 tree (i.e., the BOND-tree without using the compression idea), the ND-tree and the 10% linear scan. The comparison between the compressed BOND-tree (using the com- pression idea) and the basic BOND-tree is provided in section 3.6.6. 3.6.1 Effect of Different Database Sizes In this group of tests we evaluate the performance of the basic BOND-tree with different database sizes. We varied the number of data points from 5 million to 10 million in steps of 1 million. Average query I/O at each step is shown in figure 3.1. It can be seen that, as the number of indexed data points increases, the query I/O increases for all the techniques in our tests. However, the basic BOND-tree is a clear winner for all database sizes. The average query I/O for the basic BOND-tree is several orders of magnitude smaller than that of the ND-tree. 3.6.2 Effect of Different Number of Dimensions This group of tests evaluates the performance of the proposed tree when indexing data sets with different number of dimensions. In this set of experiments, number of dimensions was varied from 8 to 20 in steps of 4. Other parameters (database size = 5 million, alphabet size = 10) were kept constant. With increasing number of dimensions more space is required to store the DMBR information in the basic BOND-tree as well as the ND-tree. This results in reduction of fanout of tree nodes and a subsequent increase in the height of the tree. Thus the I/O for both trees (as well as linear scan) 78 Effect of Database Sizes 100000 10000“ B——-{3-——--EJ—-——-B——--—I3—---+:I we" ND-Tree o G """" O """" O """ '0 ----- G """" O = 1000 ‘ . E‘ +Basrc g 100 , V V V V V MA BOND-tree N " " " " --El--10% Linear 10 ~ 1 T I r M 5 M 6 M 7 M 8 M 9 M 10 M Database Sizes Figure 3.1. Effect of different database sizes in the basic BOND-tree. increases. The absolute 1/0 of the basic BOND-tree is much less than both ND-tree and 10% linear scan. Further, as figure 3.2 shows, the basic BOND-tree is much less affected by the increased number of dimensions than the ND-tree. 3.6.3 Effect of Alphabet Size In this test-set, the alphabet size was varied from 8 to 14 in steps of 2. Figure 3.3 shows performances of the basic BoND-tree, the ND-tree and the 10% linear scan for various alphabet sizes. As the alphabet size increases, ability of the tree to find a non-overlapping partition increases which results in decrease in the I / O. 79 Effect of Dimensions 10000 Hagan—4.3 l3- """" E o .""O ........... 1000- of": monND-Tree 2 O ......... E. 100 4 +Basic 3 ye X x AA BOND-tree a —-n-- 10% Linear 10 ~ 1 , 1 f 8 12 16 20 Number of Dimensions Figure 3.2. Effect of number of dimensions in the basic BOND—tree. 3.6.4 Effect of Different Query Box Sizes This group of tests compares the performance of the new tree with that of the N D-tree and the 10% linear scan for different box sizes. Number of dimensions and alphabet size were fixed at 16 and 10 respectively. The database size was fixed at 5 million records. When query box size increases, both the basic BOND—tree and the ND-tree require more I/ Os while the I/O for 10% linear remains constant. As we can see from figure 3.4, the performance gain of the basic BOND-tree is significant for all box sizes given. Further, at a moderate box size of 3, the ND-tree loses to the 10% linear scan. Our proposed basic BOND-tree maintains its performance even at a box size of 5. Note that at this point we are already querying large portion of the space which justifies 80 Effect of Alphabet Sizes 10000 El- ------ B —————— Er- ----- -U G ------------ o ...................... 1ooo~ O 0 ---0--ND—Tree Q - _ —-)(-— Basic 5 100 k; X X fix BOND-tree 0 - {r- - 10% Linear 10 ~ 1 T 1 8 10 12 14 Alphabet Sizes Figure 3.3. Effect of different alphabet sizes in the basic BOND-tree. the large amount of I/Os required for all the indexing schemes. At higher box sizes however, 10% linear scan proves to be the best method. This is expected as the amount of space corresponding to the result set is huge. 3.6.5 Performance on Real Data Set To evaluate the effectiveness of the BOND—tree on indexing real data sets. We compared performances of the BOND-tree, the ND-tree, and the 10% linear scan on indexing a 15-dimensional human genome sequence data set from the National Center for Biotech- nologr Information. Experimental results are shown in figure 3.5. The total number of genome vectors indexed is 5 million and the query box size is 2. From the figure we 81 Effect of Box Sizes 1000000 100000 1 men ND-Tree 10000 2 2 z. 4 —-)(—Basic 3 1°00 BOND-tree a 100 _ --El-- 10% Linear 10 « 1 a . 1 2 3 4 5 Box sizes Figure 3.4. Effect of different query box sizes in the basic BOND-tree. can see that just like the case of indexing synthetic data, the BOND-tree works much better than other approaches on indexing real data. 3.6.6 Performance of the Compressed BOND-Tree In all the previous experiments the compressed BOND-tree outperforms the basic BOND-tree. However, the performance gain is not as large as that of the basic BoND- tree over the ND-tree. In this section we report the test results on comparing the compressed BOND-tree and the basic BOND-tree. Note that in this section the results are plotted without using a logarithmic scale for the y-axis (number of query I / Os). First we show the performance comparison for varying number of dimensions. The 82 Performance on Genome Data 8000 7000 i- / . - B 6000 - .B’ ' . / --O-- m-Tree Q 5000 - - / ° 3 4000 -- ’B ""‘ 9"“ o ' ° Dom-tree 3 o l 0 3000 -r g ..o-.——-6"""° -'fl--10%Lhear o . ’ fl 2000 'r $/’ 1000 l. 0 X .L i E : : : l 2 3 4 5 Database Sizes (Millions) Figure 3.5. Performance of indexing genome sequence data database size used for this group of tests is 5 million. Query box size and alphabet size are set to 2 and 10 respectively. As we can see from figure 3.6, for all the test cases the compressed version yields less I/O than the original BOND-tree and the average performance improvement is 10.15%. Figure 3.7 shows the performance gain of using the compressed idea when indexing NDDSs with different alphabet sizes. The number of vectors indexed is fixed to 5 million and the query box size is 2. This group of tests demonstrates the effectiveness of the compression idea when indexing NDDSs with different alphabet sizes. Although both indexing methods yield less I/O as the alphabet size grows up, the compressed BOND-tree performs better than the basic BOND-tree for all the alphabet sizes tested 83 Compressed BoN D-tree with Various Dimensions 140 120 . 100 . Q + Basic 3 80 ‘ BOND-tree g 60 _ +Compressed G BOND-tree 40 « 20 4 0 10 dimensions 16 dimensions 32 dimensions Number of Dimensions Figure 3.6. Performance of the compressed BOND—Tree with various dimensions. in the experiments. 84 Compressed BOND-tree with Various Alphabet Sizes 90 80 —< 70 - Q 60 ‘ —)(-Basic E 50 « BOND-tree g 40 ~ —o—Comp1essed O 30 - BOND-tree 20 . 10 - 0 r T . 8 10 12 14 Alphabet Sizes Figure 3.7. Performance of the compressed BOND-Tree with various alphabet sizes. 85 CHAPTER 4 Range Queries in the HDS 4.1 Motivations and Challenges In many contemporary application areas like machine learning, information retrieval, and data mining, people often need to deal with hybrid data, where vectors can have both discrete and continuous attributes. For example, the feature vectors used to retrieve/ analyze images from World Wide Web typically contain both continuous and discrete features: the text describing them could be regarded as discrete valued features while the statistics like pixel frequencies could be treated as continuous valued features. Another application domain which may deal with hybrid vectors is intrusion detection. One of the main objectives of an intrusion detection system is to classify a user as a malicious or a normal user. Such a judgment should be based on information about the users’ behavioral statistics, such as the physical location, the time-frame within which specific commands are executed and the amount of time the user remains connected to a server. Clearly, some of the information like the time for which user remains connected 86 is continuous, while other information like the physical location can be discrete. To support efficient queries on vectors in an HDS, a multidimensional indexing scheme is required. However, to the best of our knowledge, no indexing scheme that directly indexes vectors in an HDS has been reported in the literature. Next we briefly review existing indexing techniques for the CDS and the NDDS. As we have introduced in section 2.2, most multidimensional indexing schemes pro- posed in the literature are for the CDS, where values along each dimension are con- tinuous (ordered). These techniques are either based on data-partitioning, such as the R—tree, the R*-tree, the R+-tree, the SS-tree, the SR—tree and the X—tree, or based on space-partitioning, such as the K—D—B-tree and the LSDh-tree. The data-partitioning based techniques split an overflow node by dividing the set of its indexed vectors ac- cording to the data distribution, while the space-partitioning based methods split an overflow node by partitioning its corresponding data space. However, these indexing techniques cannot be directly applied to the NDDS since they rely on the ordering property of data values in each dimension. If these techniques are used for an HDS, they can only index the continuous subspace of the HDS. The vectors in a discrete space can be considered as fixed length strings when the alphabets for all dimensions are the same. In this case, the string indexing methods, such as Tries, Prefix B—tree and String B-tree, can be utilized. However, the Trie- structure and its derivatives are memory-based techniques which do not have effective 87 storage utilization[34]. As a result they could not be used to index large scale data sets. The Prefix B-tree and String B-tree are derived from the B-tree structure. They are disk-based techniques which rely on the ordering property of letters. To deal with more general cases and overcome the limitations of string indexing methods, two mul- tidimensional indexing techniques specially designed for NDDSs, i.e., the ND-tree and the NSP-tree, have been recently proposed. As discussed in section 2.3.3, the ND-tree is based on data-partitioning while the NSP-tree is based on space-partitioning. Both techniques exploit the unique characteristics of an NDDS. If they are used for an HDS, they can only index the discrete subspace of the given HDS. From our discussion above we can see that, indexing techniques designed for the CDS and the NDDS exploit exclusive characteristics in the corresponding data space. Thus they could not be used to support indexing operations in the HDS, which in- volve features from both the CDS and the NDDS. One way to apply these techniques in the HDS is to transform data from one space to the other. An example is the dis- cretization method [39, 61], which is used to transform continuous attributes to discrete ones through certain supervised (e.g., using the Chi-square—based criteria[95, 17] or the entropy-based criteria[103, 38]) or unsupervised (based on dividing continuous data to equal-width or equal-frequency intervals) methods. The discretization technique is widely used in data mining and information retrieval areas to reduce number of values to be processed, simplify rules of corresponding algorithms and improve accuracy of results[35]. However, discretization is developed for certain problems and domain ar- 88 eas only. It is not applicable in other areas such as database indexing where accurate retrieval of indexed data is required. In this chapter, we present a new indexing technique, inspired by the R*-tree and the ND—tree, to directly index vectors (without conversion/ transformation) in an HDS. As we have discussed in 2.4, the new indexing structure is a disk-based technique which allows queries mixed with insert and delete operations on the data set. We first define some essential geometric concepts such as rectangle, area, edge length, and overlap in an HDS. Based on these concepts, a multidimensional hybrid tree, called the C—ND tree, and its relevant algorithms are developed. A normalization idea is developed in order to control the contribution of discrete features and continuous feature in the HDS when building the C-ND tree. To deal with the unique characteristics of an HDS, a novel node-splitting strategy called ’the hybrid split’, which combines two splits (one on a non-ordered discrete dimension and the other on a continuous dimension) into one, is proposed. We conducted extensive experiments to compare the query performance of the C—ND tree with that of the ND-tree and the R*-tree. We also compared the C-ND tree performance with the 10% linear scan for the given HDS. Our experimental results demonstrate that the C-ND tree is generally more efficient than the other three existing methods. The rest of the chapter is organized as follows. Section 4.2 introduces the relevant concepts and notations. Section 4.3 presents the C-ND tree, including its tree structure 89 and relevant algorithms. Section 4.4 reports our experimental results. 4.2 Concepts and Notations for the HDS An HDS is a multidimensional vector data space which contains both (ordered) contin- uous and non-ordered discrete dimensions. In the past, many indexing techniques were developed for the CDS. An NDDS, as opposed to the CDS, is a data space in which all elements/ values along each dimension are discrete and have no natural ordering among them. An example of non-ordered discrete data could be the colors like red, green and blue. Every color is unique but there is no natural ordering among them. To develop an indexing technique, like the R*-tree for the CDS and the N D-tree for the NDDS, some essential geometric concepts such as rectangles and areas in the CDS need to be extended to the HDS. The rest of this section will introduce and define such extended geometric concepts to be used by indexing structures in the HDS. Let d be the total number of dimensions and let D,(1 S i S d) be the domain for the i-th dimension in an HDS, which can be either continuous or non-ordered discrete. For a continuous dimension, D,- is an interval/range of real numbers. Let min(D,) and max(D,-) denote the smallest and the largest numbers in the range such that D, = [ min(D,-), max(D,-) ]. We define domain size (or length) of a continuous dimension as Length(D,-) = max(D,-) — min(D,). For a discrete dimension, its domain D,- is a set of non-ordered discrete values / elements/ letters. The domain size (or length) 90 of a discrete dimension is defined as the alphabet size of that dimension. Thus, for a discrete dimension i, we have, Length(D,-) = |D,-]. A d—dimensional HDS lid is defined as the Cartesian product of the d domains: 93 = D1 X D2 x x Dd. a = (a1, a2, ..., ad) - is a vector in Sid if or,- € Di(1 S i S d). For simplicity, in the rest of this chapter, we assume that all the discrete domains are identical and that all the continuous domains are identical. Similar to [80], the discussion can be easily extended to an HDS with domains of various sizes. A hybrid (hyper-)rectangle R in lid is defined as the Cartesian product R = 51 x 32 x . . . x Sd, where S,- is called the component-set or range of R on the i-th dimension. If the i-th dimension is discrete (i.e., D,- is a discrete domain) then S,- is a set of discrete elements such that S,- Q D,. On the other hand, if the i-th dimension is continuous (i.e., D,- is a continuous domain), S,- is a set [ min(S,-), max(S,-) ] such that min(D,-) S min(S,') g max(S,-) _<_ max(D,-). The area of a hybrid rectangle R = 51 x 32 x . . . x Sd, is defined as the product of lengths of all the component sets. Mathematically, Area(R) = [Lil Length(S,-). The perimeter of R is defined as, Perimeter(R) = 2:1 Length(S,;). Given two hybrid rectangles R = 51 x 52 x x 33 and R' = S; x 55 x x SQ, the overlap of R and R’isAreaRflR’ =Area (SlflS’ ><(SzflS’)x...x S flS’)). I 2 d (1 Given a set of hybrid rectangles {R1,R2, . . .,Rn} where, R1 = 51,1 x 31,2 x x Sl,dr R2 = 52,1 X 322 X . . . X S2,da . . . , Rn = Sm] X 5,1,2 X . . . X Sn,dr the hybrid minimum 91 bounding rectangle (HMBR) of {R1, R2, . . . , Rn} is defined as the Cartesian product: U?=1 35,1 x U?=1 Sig x . . . x U311 Sid The component set of the HMBR on the i-th dimension is 51,,- U 52g- U . . . U Sm” The edge length Length(H M BR, i) of the HMBR on the i-th dimension is ISU U 82¢- U . . . LJ Sn,il if the dimension is discrete, and it is max{max(Sl’,-), max(52,,-), . . .max(Sn,,-)} — min{min(Sl,,-), min(Sz,i), . . . min(Sn,,-)} if the dimension is continuous. Example 4.2.1. Consider an HDS 522 with one discrete dimension with domain D1 and one continuous dimension with domain D2. 01 has letters {a, b, c, . . . , j}, D2 has range [0,10]. Two data points (vectors) in (22 are P1 = (a, 2.5) and P2 = (e, 9.0). Two rectangles in fig are R1 = {a,c} x [1.0,5.5] and R2 = {e, h} x [1.5,8.5]. The edge lengths of R1 for the discrete and continuous dimensions are 2 and (5.5 - 1.0) = 4.5, respectively. The area ole is 2*45 = 9. The overlap ofR1 and R2 is 1*(5.5—1.5) = 4. The area of the HMBR of R1 and R2 is 3 * (8.5 — 1.0) = 22.5. 4.3 The C-ND tree In this section, we discuss the C-ND tree structure and the algorithms used to construct it. We also present an algorithm that processes range queries using the C—ND tree. The hybrid geometric concepts discussed in 4.2 are utilized by heuristics in this section to organize vectors in the C-ND tree structure. 4.3.1 The Tree Structure A leaf node of the C-ND tree has entries which keep the information on indexed vectors (i.e., database keys) and their associated data in the underlying database. Hence each leaf node entry stores the discrete and continuous components of an indexed vector 92 on all dimensions and keeps a pointer pointing to the actual data associated with the vector in the database. Each entry of a non-leaf node stores its child node’s HMBR as well as a pointer to that child node. The discrete subspace information and continuous subspace information are recorded differently in non-leaf entries. Information for each discrete dimension is represented using a bitmap representation while information for each continuous dimension is stored by recording the lower and upper bounds of that dimension. Like the ND-tree and the R*-tree, a C—ND tree is a balanced tree which meets the following requirements: (1) the root node has at least two children unless it is a leaf; (2) every node has between m and M children unless it is a root, where 2 < m S M / 2; (3) all leaves appear at the same level. Figure 4.1 is an example of the C-ND tree, which shows the actual tree structure and the HMBRs of the tree. rootnode m{A,G,T}x[-10,10] {A.T.C}x[-1.9l {A.G,T}x [-3,5] ... ... {T,C}x[0.8] Ieafnode (A1) (6.2) (7.0) (7.2) Figure 4.1. An example of the C-ND tree in a 2-dimensional HDS with discrete domain {A, G, T, C} and continuous domain [—10, 10]. 93 The C-ND tree is a dynamic indexing structure which allows insertion, deletion and query operations mixed with each other. When inserting a new vector V into a C-ND tree, a proper insertion path is created from root note R down to a leaf node L. Then V is stored in L, which is the last node added to the insertion path. If L overflows, a hybrid splitting procedure is invoked to split L into two new leaf nodes. The splitting process might propagate to nodes at higher levels (if these nodes overflow) along the insertion path until R is reached, in which caSe R is split into two new nodes and the C-ND tree grows one more level higher. The delete operation on the C—ND tree could be simply implemented as follows. When a vector is removed from a leaf node L, its HMBR is recalculated based on rest of vectors inside L, and its parent nodes’ HMBRs are updated if necessary. If the removal operation causes an underflow on node L, L is removed from the tree and all remaining vectors in L are reinserted. Note that removing node L may causes its parent node L’ to underflow too, in which case L’ is also removed from the tree and all the indexed vectors in L’ are reinserted. Removing L’ may again causes an underflow on its parent node, thus the removal process might continue recursively upward until no underflow occurs or the root node is reached. In the latter case the root node is removed and the tree height is reduced by one. In the rest of this section we will focus on insertion and query algorithms of the C-ND tree. 94 4.3.2 Normalized Geometric Measures To obtain an efficient C—N D tree, we adopt a number of heuristics which utilize the geometric concepts such as hybrid rectangles and their areas introduced in Section 4.2. One issue in applying these heuristics is to make sure both discrete and continuous subspaces contribute their information fairly for the geometric concepts in the HDS space. For example, consider an HMBR in a two dimensional HDS (one discrete di- mension D and one continuous dimension C). Assume the edge length on dimension D is 5 (i.e., the discrete component set contains 5 letters/ elements), and the edge length for dimension C is 100 (i.e., the continuous component set/interval has a length/size of 100). Thus the perimeter value is calculated as 5 + 100 = 105, which is clearly dominated by the absolute value of the continuous edge length and treats the discrete dimension unfairly. To solve this problem, we adopt normalized measures. Specifically, when we calculate the edge length of an HMBR on a discrete dimension, we divide the number of elements of the HMBR on that dimension by the size of the domain on that dimension; when we calculate a continuous edge length of the HMBR, we divide the (total) length of the interval(s) of the HMBR on this dimension by the length of the corresponding domain. When using the normalized measures, edge lengths are the relative lengths with respect to the domain size, which are between 0 and 1. This normalized edge length is then used to calculate other relevant geometric measures such as areas and perimeters. Suppose 95 in the above example the domain of discrete dimension D contains 10 letters / elements, and the domain of dimension C has range length 1000. The normalized edge length for the discrete component set with 5 letters/ elements is calculated as 5 / 10 = 0.5, and the normalized edge length of the continuous component set / interval with a length/ size of 100 is calculated as 100/1000 = 0.1. The normalized perimeter value is 0.5 + 0.1 = 0.6, which reflects information from discrete and continuous subspaces fairly. In the rest discussion of this chapter, we always use the normalized measures unless stated otherwise. 4.3.3 Choose Leaf Node for Insertion Given a new data point / vector P to be inserted into a C—ND tree, a leaf node L needs to be found to accommodate P. Node L is found in a top down manner, i.e., starting from the root node, descending in the tree until a leaf node is picked. All the nodes picked during this procedure constitute the insertion path, and the leaf node L at the end of the insertion path is the one for accommodating P. If a non-leaf node N on the insertion path has one or more child nodes Whose HMBRs containing P, the child node with the smallest HMBR area is chosen for the insertion. If P does not belong to any of N’ 3 child nodes’ HMBRs, the following two heuristics are applied. HC-l: Minimum Overlap Enlargement As we know, a large overlap among nodes at the same level would increase the chance 96 to search multiple paths in the C-ND tree during query processing. To avoid the large overlap, the child node that yields the least overlap enlargement after insertion is selected to accommodate P. If there is a tie, the following heuristic is applied. HC-2: Minimum Area Enlargement The child node that yields the least area enlargement after insertion is chosen to ac- commodate P. If there is a tie again, a child is randomly chosen from the tied ones. 4.3.4 Splitting an Overflow Node of the C-ND Tree In this subsection, we discuss the steps and heuristics used to split an overflowing C—ND tree node. Splitting a node is accomplished in three steps: (1) generating candidate partitions from each subspace of the HDS, (2) finding the best combination of candidate partitions, and (3) redistributing uncommon entries. One straightforward way to generate candidate partitions is to permute all (M + 1) entries, then for each permutation (of (M + 1)! ones ) we put splitting points between the entries to separate the entries into two groups. However, even for small values of M it is computationally expensive to generate all possible splits. We propose a novel concept called the “hybrid split” for generating candidate partitions. As the C—ND tree indexes vectors involving both discrete and continuous component values, it is important that the splitting procedure considers both the subspaces for the optimized split. In the proposed splitting algorithm, we first generate discrete and continuous 97 candidate partitions for the M + 1 entries from discrete subspace and continuous sub- space separately, as described below. Then we find the combination of discrete and continuous candidate partitions which agrees the most on how to distribute entries in their respective subspaces, this step is to be described in Section 4.3.4. As the last step in hybrid split, common entries in both partitions are put into N1 and N2 directly and uncommon entries are redistributed using heuristics to be discussed in Section 4.3.4. Generating Candidate Partitions for Subspaces As the first step for the hybrid split, we generate candidate partitions for the discrete and continuous subspaces, respectively. In fact, each dimension in a subspace is con- sidered when generating candidate partitions. If the current dimension is discrete, we sort the entries in the splitting node using the auxiliary tree as described in [79]. If the current dimension is continuous, the entries in the splitting node are first sorted by the lower bound of the continuous component set / interval on that dimension and then by the upper bound of the continuous component set / interval on that dimension (if there are ties according to the first sort), resulting in a complete sorted entry list. Positions where a split point can be placed are constrained by the minimum space utilization. Consider the example shown in Figure 4.2. Suppose we have an overflowing node with twelve entries and one sorted entry list of these entries is El, E2, E3, . . . , E12. Let the minimum utilization constraint require at least four entries per node. We then have five possible candidate partitions S 1, 5'2, S3, S4 and S5. Each candidate partition 98 11“-- ‘ I‘ < El IE [EFF-“Ill [E [El > EIIEEIE Figure 4.2. An example of generating candidate partitions in the C-N D tree. S,- (1 S i S 5) divides entries in two groups: the first group is < E1, ..., EH3 > and the second is < EH4, ..., E12 >. To efficiently choose a set of good partitions among all candidate ones for a given overflow node, some heuristics are needed. From extensive experiments, we have found that the following heuristics (named HS—l through HS—3) are effective in choosing good candidate partitions for splitting an overflow node in the C-ND tree. Note that since candidate partitions are generated from discrete subspace and continuous subspace sep— arately, MBRs in HS—l ~ HS-3 should be treated as only the discrete part or continuous part of the whole HMBR, depending on the subspace under consideration. HS-l: Minimum Overlap Among all the candidate partitions, the one yielding the least overlap is most preferred for the subspace under consideration. If there is a tie, the following heuristic is applied. HS—2: Maximum Span Among all the candidate partitions that are tied for HS-l, the one with the maximum span is chosen. Here the span means the length/size of the discrete component set on 99 ”Tamar! a discrete dimension (or the continuous set / interval on a continuous dimension) of the overflow node’s MBR before splitting. If there is still a tie, the following heuristic is applied. HS-3: Maximum Balance Among all the candidate partitions that are tied for HS-l and HS-2, the one with the maximum balance is picked. Here we measure the balance by the difference between the lengths/sizes of the corresponding discrete component sets on the discrete dimension (or the corresponding continuous sets/ intervals on the continuous dimension) of the two new nodes’ MBRs after splitting the overflow node using the candidate partition under consideration. This heuristic tries to balance the lengths of the two new nodes’ MBRs on the splitting dimension. In other words, the smaller the difference, the more balance the partition is. When generating candidate partitions at a non—leaf level in the discrete subspace, we noticed that it is very hard to group entries in a non-leaf node based on the discrete component sets’ information due to the fact that each entry in a non-leaf node may have multiple elements on a given discrete dimension, and those sets vary in cardinality and members largely. The more distinct sets we have, the harder it would be to separate them into two groups. This suggests picking a discrete dimension which has fewer distinct discrete component sets. Thus at a non-leaf level when generating candidate partitions from the discrete subspace, we choose the dimension on which the splitting 100 node’s entries have fewer distinct discrete component sets. After applying HS-l ~ HS-3, there may still be ties. Hence, in general, we get two sets of partitions CPd and CPO, representing candidate partitions generated from discrete and continuous subspaces, respectively. Choosing the Best Combination of Discrete and Continuous Partitions The next step of the hybrid split is to find the best combination of candidate partitions from CPd and CPC generated in Section 4.3.4. That is, suppose CPd has candidate partitions D1, D2, . . . , D,- and CPC has candidate partitions C1, C2, ..., Cj, for all combinations of {Dm, Cn}, where 1 g m S i, and 1 g n S j, the following heuristic HS-4 is applied to pick the best combination. Given a partition Dm, the whole entry set of the overflow node is separated into subsets Dml and Dmg. Similarly, 0,, is divided into two subsets Cnl and an. When combining Dm and On, the whole entry set is divided into 4 subsetsszl fl Cnl, Dml fl Cng, Dmg fl Cnl and Dmg fl Cng, as illustrated in Figure 4.3. Note that these four subsets have common entries between partitions Dm and Cu. Since each partition has to include its two subsets, we have the following two combined common sets between Dm and Cn: (Dml flCn1)U(DmgflCn2), and (DmI 007,2) U (DmgflCnl). HS-4: Maximum Number of Common Entries 101 Dm1 sz Dimensionx U Partition Dm on discrete 5' dimension x g . . \ . 2. Combination of ' \ ‘ g Dm and Cn ‘< cn1 cn2 Partition Cn on continuous dimensiony Figure 4.3. A combination of discrete and continuous candidate partitions. Given a combination of candidate partitions Dm and On, let M11 be the number of entries in Dml 0 CM and M12 be the number of entries in Dmg fl Cng. Similarly, let M21 be the number of entries in Dmg fi 0111 and M22 be the number of entries in Dml n Cng. Let M1 2 M11 + M12, M2 2 M21 + M22, the maximum number of common entries Mm” from the combination of Dm and Cu is defined as: Mm", = max{M1, M2}. Among all combinations of candidate partitions, the one with maximum Mm" is picked as the best combination of candidate partitions. If there is a tie, a random one is picked. 102 Note that during the hybrid split we are looking for a combination of splits which has as many common entries as possible, and the lower bound of the number of common entries is guaranteed by the following theorem: Theorem 4.3.1. The number of common entries an is at least 50% of the total number of entries to be distributed. Proof. suppose a splitting overflow node contains P entries, let: 51 = (Dml fl Cn1) U (Dm2 fl Cn2), S2 = (Dml 0 0722) U Wm 0 CM)- Here 51 has M1 entries and 52 has M2 entries. Since 31 U 5'2 is the set of all entries in this splitting node and 8'1 0 5'2 = (D, we have M1 + M2 = P, (M1 2 0,M2 Z 0). If M2(M1) is smaller than or equal to [P/2j, M1(M2) must be larger than or equal to [P/Z]. El The characteristic proved above shows that the hybrid split will have a combination of candidate partitions which agree on at least 50% of total entries’ distribution. This ensures that our split algorithm could always take advantage of both the discrete and continuous candidate partitions. For the chosen combination of candidate partitions, we place the common entries as suggested by 5'1 or SQ, depending on which one is larger. The following subsection describes how to place uncommon entries. Redistributing Uncommon Entries Once a best combination of candidate partitions from CPd and CPC is determined and their common entries are placed, the next step of the hybrid split is to redistribute the 103 uncommon entries into the two common groups. Given a combination of candidate partitions Dm and Cu, without loss of generality, suppose the common groups picked are CG1 = Dml (1 CM and 0G2 = Dmg 0 07,2; the uncommon groups are U01 = Dml fl Cng and U 02 = Dmg 0 CM. For all the entries remained in U 01 or U G2 and not within HMBR of 001 or 002, they are put into a common group CG 1 or 002 using the following heuristics. After all the entries are distributed, the node splitting is determined. Heuristics HS-5 and HS—6 are used to decide whether an uncommon entry E,- (1 g i S n) in U G1 and U Cg should go to CG1 or 0G2. Let 2: represent the dimension from which Dm is generated and let y be the dimension for C". In HS-5 and HS—6, the geometry concept is restricted to the 2—dimensional space composed of a: and y. Let CG’i = 001 U {E} and 06'” = C02 U {E}, (1 S i S n). First we apply HS-5 to E,: HS-5: Minimum Overlap of Common Groups An entry E, (1 < i < n) is put into CGl or 0G2 tentatively, then the common group which yields a less overlap after accommodating E,- is chosen. Let: or, = Area((HMBR of 002,) n (HMBR of (162)), o; = Area((HMBR of 003) n (HMBR of 061)). If Oi < 0’, E,- is put into group CGl; if 0: > 0’, E,- is put into group C02. 104 Heuristic HS—6 is applied if 0% equals 0%. HS-6: Minimum Ratio of Perimeter Enlargement and Area Enlargement Similar to HS—5, E,- (1 S i g n) is tentatively put into 001 or 002. Both the perimeter and area are considered in this heuristic because we want to keep the area enlargement as large as possible and at the same time keep the perimeter enlargement as small as possible, which could improve the pruning power of the C-ND tree. The perimeter and area are two important geometry measurements in optimizing tree performance. By taking ratio of the two, we have both of them considered when redistributing uncommon entries. Let: A1 = Area(HMBR of 0G1) , A2 = Area(HMBR of C02) , P1 = Perimeter(HMBR of CG}) , P2 = Perimeter(HMBR of CGg) , and A3 = Area(HMBR of CG’i) , g = Area(HMBR of 003) , Pf = Perimeter(HMBR of 001i) . P2i = Perimeter(HMBR of C03) , where 1 S i g n. Now let Tf = (Pf—P1)/(A'i-A1),T§ = (P,i — reg/(Ag —A2). 1fo < 73,19,- is put 105 I'r'h '-“E '— into group 0G1; if T1i > T’, E,- is put into group CGg. If T: and T5 equal, E, is put into the group with less number of entries. Figure 4.4 shows an example of redistributing uncommon entries to common groups. ImenSIon X D 93258: < c as ....\ ....e. . . ...\ .... m AMA... Figure 4.4. An example of redistributing uncommon entries in a hybrid splitting pro- cedure. The performance of heuristics used in the hybrid split algorithm is reported in section 4.4.2. 106 4.3.5 Processing Range Query Using the C-ND 'Iree In this chapter, we consider range queries, which are popular in many application domains. Given a query vector q and a distance d, a range query retrieves all the indexed vectors which are within d distance from q. To perform a range query in an HDS, we need a distance measure for the similar- ity between two vectors in the HDS (note that distance measure is not needed when building the C—ND tree. It is only used for range queries in computing the range). The distance measure in HDSs is still an open issue. There is not a well-known HDS dis- tance measure available. In this chapter, we extended the Hamming distance measure to the HDS. The extended Hamming distance measure (EHDM) is defined as follows. Given two data points P 2 (p1, p2, ..., pn) and P’ = '1, p’2, ..., p1,) in a d-dimensional HDS, we define: EHDM(P, P’) = Zn: F(pi, pg), (4.1) i=1 where function F (pi, p;) is defined as: 107 0 if i is a discrete dimension and p,- equals 19; F(p ,) or i is a continuous dimension and irpi = l lpz' - P“ S t \ 1 otherwise As can be seen from the equation above, when i—th dimension is continuous, thresh- old t determines closeness of p, and 192. In all our experiments we set t = 0.001. As we mentioned above, the construction of a C-ND tree does not rely on any distance measure, and the proposed EHDM is only used for the purpose of testing range queries using the indexing tree. There could be different distance measures for HDSs besides EHDM, but the extended Hamming distance provides a reasonable distance measure for an HDS, due to its simplicity and origin from the Hamming distance. Based on the extended Hamming distance measure on two vectors, the distance between a hybrid rectangle R and a query vector q can be defined as follows. Given a d—dimensional HDS Dd, a hybrid rectangle R = 5'1 x 5'2 x x Sd in ad and a query vector q = (q1,q2, . . . ,qn), the distance between R and q is calculated as: di3t(R,(1) = Z “52', (Jr) (4-2) 2:1 where 108 r 0 if i is a discrete dimension and Ch“ 6 Si or i is a continuous dimension and (min(Si) - t) S q.- S (maX(Sz-) + t) f(Sirqi) = i k 1 otherwise Using the extended Hamming distance measure, based on equations (4.1) and (4.2) , a range query in an HDS could be defined as {leH DM (v, q) s r}, where v represents a vector in the result set, q represents the query vector and r represents a search distance(range). An exact query is a special case of a range query when r = 0. The C—ND tree can be utilized to efficiently process range queries based on EHDM. The query processing algorithm is implemented by invoking the following function on the root node of the C-ND tree: Algorithm 4.3.1. RangeQuery(N,q, r) // Processing range queries Input: node N in a given C-ND tree, query vector q and distance r Output: all data vectors indexed by the C-ND tree within distance r from q Method: 1. let S = fl 2. if N is a leaf then 3 for each vector v in N do 4 if EHDM(v, q) S r then 5. S = S U {v} 6 end if 7. end for 8. else 9. for each child node N ' of node N do 10. ifdist(HMBR of N',q) S r then 11. S = S U RangeQuery(N’, q, r) 12. end if 13. end for 14. end if 15. return S 109 4.4 Experimental Results Extensive experiments were conducted to evaluate performance of the C-ND tree in comparison with some of the known indexing schemes. In this section we present our experimental results. 4.4. 1 Experimental Setup The experiment programs were implemented in C++. Tests were conducted on Sun Fire v20z servers with 2x AMD Opteron 250 2.4GHz 64bit and 4 GB RAM running Linux OS and Intel Xeon quad-core 5345 processors with 8 GB ECC DDR2 RAM running SuSE Enterprise Linux 10 in a high performance computing cluster system. Both synthetic and real data were used for our experiments. The synthetic data sets were generated randomly, consisting of both continuous and discrete dimensions. Given a domain D,(1 g i g d), (1 being the number of dimensions, a random integer between 0 and |D,-| — 1 is generated for each discrete dimension. For each continuous dimension, its value is generated as a random decimal number between min(D,-) and max(D,-) (refer to Section 4.2). The same method is used to generate the query vectors for range queries. The query performance is measured by the number of I/Os (i.e., the number of tree nodes accessed) and is computed by averaging the I/ Os over 200 queries. 110 a-' i 4.4.2 Performance of Heuristics Used to Split Overflow Nodes To determine the effectiveness of heuristics on splitting an overflow node, we conducted a group of experiments on 10 different C-ND trees built from 10 different data sets. Each data set contains 1 million randomly generated vectors with 8 discrete dimensions and 8 continuous dimensions, where discrete dimensions have an alphabet size of 10 and continuous dimensions range from 0 to 1. The range query performance for each C—ND tree is measured by the average I/Os of 200 different queries. We take the average query performance for these 10 C-ND trees. The first group of experiments was conducted for evaluating the effectiveness of heuristics used to generate candidate partitions (HS-1, HS-2, and HS-3). We compared the following versions of different combinations: Version 1: using HS-l only. Version 2: using HS—l and HS-2. Version 3: using HS-l, HS-2 and HS-3. The experiment results are reported in Table 4.1, where rq denotes the query range used in the experiments. The second group of experiments was conducted for evaluating effectiveness of heuris- tics used to redistribute uncommon entries into common entry groups (HS-5 and HS-6). We considered the following versions: 111 Version rq=1 rq=2 rq=3 rq=4 1 68.57 370.64 1380.08 3716.94 2 57.18 286.28 1015.75 2706.43 3 28.15 145.76 544.68 1551.17 Table 4.1. Effectiveness of heuristics used to generate candidate partitions. Version 4: using HS-5 only. Version 5: using both HS-5 and HS—6. The results for the second group of experiments are shown in Table 4.2. Version rq=1 rq=2 rq=3 rq=4 4 34.58 224.77 1019.48 3247.12 5 28.15 145.76 544.68 1551.17 Table 4.2. Effectiveness of heuristics used to redistribute uncommon entries. From both groups of experiments we can see that when additional heuristics are applied, the performance of the C-ND tree is positively affected, which suggests the effectiveness of heuristics being used when building the C-ND tree. 112 f 4.4.3 Performance Comparisons With the 10% Linear Scan, ND-tree and R*—tree We have compared the performance of the C-ND tree with that of the 10% linear scan, ND-tree and R*-tree. Note that linear scan of a disk sequentially reads all the pages and, therefore, is faster than the random access of a disk page by a factor of about ten [22]. To have a fair comparison, we consider only 10% of all the pages, which is termed as “the 10% linear scan”. As mentioned in Section 4.1, the ND-tree and the R*-tree were not designed for indexing hybrid data. They can index only the discrete subspace or the continuous subspace of an HDS. We have adopted the ND-tree and R*-tree for hybrid data by storing the whole vector only in leaf nodes because both discrete and continuous dimensions are required for the range computation in the HDS. Various parameters such as the database size, range size, alphabet size, and various mixes of discrete/ continuous dimensions were. considered in our experiments. We have also conducted experiments on real climate data from the National Climatic Data Center. From the results of both synthetic and real data we could see that in general the C-ND tree outperforms the other three approaches. Effect of Database Sizes on Performance of the C-ND Tree We conducted experiments using data sets of sizes ranging from 10 million vectors to 19 million vectors, each with 8 discrete and 8 continuous dimensions. The alphabet 113 AF J size for each of the discrete dimensions was 10. Figure 4.5 shows the number of I/Os for each of the methods for range queries using range 2. As expected, with increasing database sizes, the number of I / Os increases for each of the indexing schemes. However, the C-ND tree outperforms all the other three methods. For database size of 19 million, the C-ND tree is three times more efficient than its nearest contender ND-tree. Effect of Database Size Alphabet size 10, Range 2, 8+8 100000 a —- ' "' X o _ __ . —— x- - -' ' "' *‘ a 10000 . X— - -0- -C-NDtree g —D— ND-tree IE! A A A ‘A —fi—R‘-tree r ... _. _ ._ —x- 40%|: 5 100m [3__._._a———--—B- El "’3' o ------- 0 ------ <3 ------- 0 100 r . . 10M 13M 16M 19M Database Ste Figure 4.5. Effect of database size in the C-ND tree. Effect of Varying Mix of Discrete and Continuous Dimensions In this set of experiments, we varied the mix of discrete and continuous dimensions while keeping the total number of dimensions fixed at 16. The alphabet size and the 114 database size were set to 10 and 10 millions, respectively. 1 The results are shown in Figure 4.6. Fiom the figure, it is observed that the C-ND tree outperforms the R*-tree and the 10% linear scan. When the number of discrete dimensions is high, the ND-tree performance is close to the C-ND tree. When the number of discrete dimensions is very low, the ND-tree is even worse than the 10% linear scan for the data sets considered. An effective ND-tree cannot be created because the number of duplicate discrete subvectors in the discrete subspace is very high for the given data sets. Effect of Dimension Mix Database size 10M, Range 2, Alphabet size 10 100000 Bk 0 X-\ E 10000 \-\-*'_'*‘"‘*-_..x -<>--C-NDtree g \ —D— ND-tree "E: M -A-R*-tree \ a - \ —)(- -10%Iinear z 1000 1.1 \ x o ..... e ..... e ----- 7%: T. T. 100 . r . . 4+12 6+10 8+8 10+6 12+4 Dimension Mix (Discrete + Continuous) Figure 4.6. Effect of dimension mix in the C-ND tree. 115 Effect of Alphabet Sizes Figure 4.7 shows the performance of various indexing schemes with an increasing alpha- bet size. Here again, the C-ND tree is a clear winner. As the alphabet size increases, both the C-ND tree and the ND-tree have more pruning power. This is reflected in the decreasing number of I / Os. Effect of Alphabet size Database size 10M, Range 2, 8+8 100000 0 a 10000 )(-----)<-----)(—-—--X _+_mfle g —n— ND-tree .2 . J / —A—R‘-tree 3 1000 . E- “ '-‘ —x- -10%Iinear 0 ------- 0 ------- <3 ------- O 100 . e . 10 12 14 16 Alphabet Size Figure 4.7. Effect of alphabet size in the C-ND tree. Effect of Different Range sizes for Queries Figure 4.8 shows the performance effect of the query range. From the results we see that, as the range size increases, the number of I / Os also increases for all the indexing 116 methods, which is as expected. However, we also see that the C-ND tree outperforms the others for all the ranges shown. Effect of Range Database size 10M, Alphabet size 10, 8+8 100000 9 10000 1 0.. " 0' ' MD “9 3 —n— sum. 0 1000 . IE: —-A-—R*-tree : —)(- -10%linear z 100 - 10 . . . 1 2 3 4 Range Figure 4.8. Effect of query ranges in the C-N D tree. Performance on Real Data We have used Global Surface Summary of Day (GSOD) data of year 2007 from National Climatic Data Center at http://www.ncdc.noaa.gov2007 to compare the performance of the C-ND tree with the 10% linear scan, ND-tree and R*-tree. We chose 6 discrete features (occurrence of fog, rain or drizzle, snow or ice pellets, hail, thunder, tornado or funnel cloud) and 3 continuous features (maximum temperature, minimum temperature 117 and precipitation) from each data record. Records with missing features were removed from the data set. Thus, each of the discrete dimension has an alphabet of size 2 and the ranges of the three continuous dimensions are: maximum temperatures: —108.6 -— +1287, minimum temperatures: —114.3 — +105.8 and precipitation: 0 — 18.98. These values are normalized based on the idea mentioned in section 4.3.2. We indexed up to 1 million data and the query range is set to 1. The performances of different techniques are reported in Figure 4.9. Note the ND-Tree performance shown in this figure is partly affected by the duplicate records from the discrete subspace, which is because of the nature of the NDDS. As we can seen from the figure, the C—ND tree performs much better than the 10% linear scan, the ND-tree and the R*-tree. 118 or Number of IIO 10000 1000‘ 100 ‘ 10* Performance on Real Data Database size 1M, 8+3 E} 3‘5 x—x—'><“X"x'*—X .X‘X’ X ’0..O-<>_,®.-®"O'O"®"0 100k 200k 300k 400k 500k 600k 700k 800k 900k 1000K fir N umber of Vectors Indexed - -<>- -C-NDtree —B—ND-tree -A—R"-tree —X- -10%Ilnoar Figure 4.9. Performance comparison of indexing GSOD data. 119 ‘v t CHAPTER 5 Box Queries in the HDS 5.1 Motivations and Challenges In many contemporary database applications, indexing of hybrid data which contains both continuous and discrete dimensions is required. For example, when indexing weather data of different locations, the daily temperature, precipitation, humidity should be treated as continuous information while other information such as the type of precipitation is typically regarded as discrete. Different indexing techniques have been proposed for either the CDS or the NDDS. Examples for CDSs indexing methods are the R—tree, R*-tree, K—D-B—tree and LSDh—tree. NDDSs indexing techniques include the ND-tree and the NSP-tree. Not surprisingly, all these indexing methods could not be applied to the HDS directly because they rely on domain-specific characteristics (e.g., the order of data in the CDS) of their own data spaces. One way of applying the CDS/NDDS indexing techniques to the HDS is to transform 120 data from one space to the other. For example, discretization methods [21, 43, 64] could be utilized to convert data from continuous space to discrete space. If we discretize the weather data mentioned before, daily temperature could be converted to three discrete values: cold, warm and hot. However, this approach clearly changes the semantics of the original data. The C-ND tree [26] is recently proposed to create indexes for the hybrid data space, and is optimized to support range queries in the HDS. In this chapter, we evaluate the effectiveness of the extended ND-tree structure and heuristics for box queries in the HDS. The ND—tree structure and building algorithms / heuristics are extended to handle both continuous and discrete information of the HDS. A novel strategy using power adjustment values to balance the preference for continuous and discrete dimensions is presented to handle the unique characteristics of the HDS. And an effective cost model to predict the performance of HDSs indexing is introduced. Our experimental results show that, the extended heuristics are effective in supporting box queries in the HDS, and the cost estimates from the presented performance model are quite accurate. The research work presented in this chapter is reported in [27]. The rest of the chapter is organized as follows. In Section 5.2 the ND-tree' data structures and heuristics are extended to the HDS, and an approach of using different power (exponent) values to handle the unique characteristics of the HDS is presented. Section 5.3 reports our experimental results, which demonstrate that hybrid space 121 1" j indexing is quite promising in supporting efficient box queries in HDSs. Section 5.4 outlines a model to predict the performance of hybrid space indexing. 5.2 Extended Hybrid Indexing 5.2.1 Hybrid Geometric Concepts and Normalization To efliciently build indexes for the HDS, some geometric concepts need to be extended from the NDDS to the HDS. The hybrid geometric concepts used in this chapter are the hybrid (hyper-)rectangle in the HDS, the edge length of a hybrid rectangle on a given dimension, the area of a hybrid rectangle, the overlap between two hybrid rectangles and the hybrid minimum bounding rectangle (HMBR) of a set of hybrid rectangles. Detailed definition of these concepts could be found in [26] and is omitted here due to page limit. One challenge in applying hybrid geometric concepts is how to make the measures for the discrete and continuous dimensions comparable. For example, how to compare the size of 2 for a discrete component set (i.e., 2 letters/elements) of an HMBR with the length of 500 for a continuous component interval in the same HMBR? To solve the problem, we adopt normalized measures for hybrid geometric concepts introduced in [26]. In the rest of this chapter, we always use normalized hybrid geometric measures unless stated otherwise. 122 IT 5.2.2 Extending the N D-tree to the HDS Each non-leaf node entry in the hybrid indexing tree keeps an HMBR of a child node and a pointer to that child node. Information for discrete dimensions in the HMBR is stored using a bitmap representation and information for continuous dimensions is stored by recording the corresponding lower and upper bounds of each dimension. Each leaf node entry stores the discrete and continuous component of every dimension as well as a pointer pointing to the actual data associated with the key in the database. Two critical tasks, namely choosing a leaf node to insert a new vector and overflow treatment are extended to the HDS, by using the corresponding geometric concepts introduced in section 5.2.1. Next we briefly introduce implementations of these two tasks. A leaf node L is found to accommodate an indexed vectors V recursively in a top- bottom matter: starting from the root node N, a child node N’ is selected using certain heuristics. Then one of N ’ ’3 child node is picked using these heuristics again. This procedure continues until a leaf node is found. The heuristics used in finding the best child node are as follows. If V is contained by one or more child node’s HMBR, select the child node which has the smallest HMBR. If V is not within any child node’s HMBR, choose the child node which causes least overlap enlargement among sibling nodes’ HMBRs after accommodating L. In case of ties, select the child node which needs least HMBR enlargement. If there are ties again, pick the one which has 123 smallest HMBR. The heuristics are applied in the order they are presented here. In case there are still ties after applying all these heuristics, a candidate node is selected among tied ones. When splitting an overflowing node, sorted entry lists are generated for a continuous dimension C by sorting all entries’ lower bound values and then upper bound values on C. When splitting an overflow node, candidate partitions are generated along each di- mension in the HDS. The auxiliary tree approach described in [79] is used to generate candidate partitions for discrete dimensions. For each continuous dimension, candidate partitions are generated by sorting all entries’ lower bound values and upper bound val- ues on that dimension, then put split positions in the sorted entry list. Four heuristics are utilized to find a candidate partition among all candidates: select the one with least overlap among the two new nodes; pick the one which is generated along a dimension having the longest span in the overflow nodes’ HMBR; find the partition whose two new nodes’ HMBRs have the closest span on the splitting dimension; and choose a partition which has the smallest HMBRs after splitting. The heuristics are applied in the order presented, i.e., if there are ties on one heuristic, the next one is used to break ties. A random candidate partition will be chosen among ties ones if all heuristics have been utilized. Detailed discussion of the splitting algorithm could be found in [79]. The difference is that the heuristics presented in this section is based on geometric concepts in the HDS. 124 Given the extended insert operation in the HDS, the delete operation is implemented as follows. If no underflow occurs after an entry is removed from a leaf node, only the ancestor nodes’ HMBRs are adjusted. In case of an underflow, the whole node is removed and all the remaining entries in that node are reinserted. If removing a node causes an overflow in its parent node, the parent node is also removed and all vectors indexed are reinserted. This removal operation propagate toward the root node until no underflow occurs or the root node is reached. In the latter case the root node is deleted and its remaining child node becomes the new root node for the tree. From our discussion we can see that the idea of our delete operation is similar to most hierarchical indexing structures, such as the R*-tree and the ND-tree. A query box in the HDS is a hybrid rectangle containing a query range/set on each dimension. For a continuous dimension, a query range is specified by its upper and lower bounds. For a discrete dimension, 3 query set is specified by a subset of letters / elements from its domain. A traditional depth-first search algorithm is implemented for the hybrid indexing tree to evaluate its query performance. If a node’s HMBR does not overlap with the query window (a rectangle in the HDS), the node (and all its child nodes) is pruned away from the query path. Indexed vector which are within the query window will be returned as the query result. If no vector could be found inside the query window, an empty set is returned. 125 5.2.3 Enhanced Strategy for Prioritizing Discrete / Continuous Dimensions As mentioned in Section 5.2.1, to make a fare comparison between discrete and continu- ous dimensions, we have employed normalized edge lengths when calculating geometric measures for an HDS. This strategy allows each dimension to make suitable contribu- tions relative to its domain size in the HDS. We also notice that the (non-ordered) discrete and continuous dimensions usually have different impacts on the performance of hybrid indexing due to their different domain properties. For example, a discrete dimension is more flexible when splitting its corresponding component set of an HMBR due to the non-ordering property, re- sulting in a higher chance to obtain a better (smaller overlap) partition of the relevant overflow node. Assume S1 = { a, g, t, c} is the component set of an HMBR on a discrete dimension, the letters in 51 can be combined into two groups arbitrarily (subject to the minimum space utilization constraint of the node) to form a split of the set, e.g., {g}/{a, t, c}, {a, t} / {g, c}, etc. On the other hand, for the component set on a continu- ous dimension, say 32 = [0.2, 0.7], the way to distribute a value in S2 totally depends on the splitting point. If the splitting point is 0.5 (i.e., having a split [0.2, 0.5]/(0.5, 0.7]), the values less than or equal to 0.5 have to be in the first group, and the others belong to t he second group, because of the ordering property of a continuous dimension. The challenge is how to make use of this observation to balance the preference for discrete 126 and continuous dimensions in the HDS. One might suggest adopting different weights for the (normalized) edge lengths of an HMBR on the discrete and continuous dimensions, respectively. Unfortunately, this approach can not work. Two important measures used in the tree construction algorithms are the area of an HMBR (or overlap between HMBRs) and the span of an HMBR on a particular dimension (i.e., the edge length of the HMBR on the given dimension). Suppose we have HMBRs (or overlaps) R1 and R2 in a two-dimensional HDS with the following relationship: Area(R1) = L11 x L12 < Area(R2) = L21 x L22, where Lij is the (normalized) edge length of R; on the j-th dimension (j = 1, 2). Assume that the first dimension is discrete and the second one is continuous. If we assign a weight wd to the discrete edge length and another weight we to the continuous edge length when calculating the area, we will still have Area(R1) = (wd x L11) x (wcx L12) < Area(R2) = (wd x L21) x (we x L22) because the weight factors on both sides of the inequality cancel each other. The same observation can be obtained for spans. To overcome the above problem, we adopt another approach by assigning different power (exponent) values pd and pc to the discrete and continuous edge lengths, respec~ tively, when calculating area values. With a normalization, we can assume pc = (1 -pd). For R1 and R2 in the above example, ifL11 = 0.1, L12 = 0.3, L21 = 0.2, L22 = 0.2,pc = 0.1, Pd 2 0.9, we have Area(R1) = L11"? x L]? = 0.10'1 x 0.30'9 z 0.27 > Area(R2) = L23! x L5? = 0.20"1 x 0.20'9 z 0.20, while the original area values have the relationship 127 ’___——. Area(R1) = 0.1 x 0.3 = 0.3 < Area(R2) = 0.2 x 0.2 = 0.4. Hence, we can change the area comparison result by using different power adjustment values. Since the edge length is normalized to be between 0 and 1, the larger the power value is, the smaller the adjusted length would be (unless the edge length is 1 or 0). For the heuristics involving areas comparison during tree construction, we always prefer a smaller area. Therefore, if we increase power pd (i.e., reduce pc) for discrete dimensions, we make the discrete edge lengths contribute less to the area calculation while making the con- tinuous edge lengths contribute more to the area calculation. In this sense, we make the discrete dimensions more preferred. The way to make the continuous dimensions more preferred is similar. We can also assign power adjustment values qd and qc = (1 — qd) to the discrete and continuous edge lengths, respectively, when calculating the span value for every dimension. However, during tree construction time a dimension with a larger span is more preferred. Therefore, if we want to make a discrete dimension more preferable, we need to decrease power value qd for that dimension, which is different from the situation of calculating areas values. The discussion for the span value on continuous dimensions is similar. If we want to make discrete dimensions consistently more preferred during the tree construction, we need to increase pd and, in the meantime, decrease qd. To reduce the number of parameters for the algorithm, we simply let qd = (1 —— pd). Hence, we only need to set a value for one parameter pd , other parameters (i.e., 128 pc, qd, qc ) are generated according to pd. ,1 J In table 5.1 and 5.2 we present the accumulated number of discrete splits and con- tinuous splits at leaf level and non-leaf level of the indexing structure. The data is collected after indexing 10 million vectors from an HDS with 4 discrete dimensions and 4 continuous dimensions. pd at leaf level 0.1 0.3 0.5 0.7 0.9 Number of discrete splits 15 6558 36392 64924 81717 Number of continuous 71989 65520 51049 15198 1062 splits Table 5.1. Number of accumulated discrete and continuous splits at leaf level. pd at non-leaf level 0.1 0.3 0.5 0.7 0.9 Number of discrete splits 30 30 728 1045 1071 Number of continuous 1033 1039 612 249 242 splits Table 5.2. Number of accumulated discrete and continuous splits at non-leaf level. As we have expected discrete dimensions are more and more preferred as pd increases at node splitting time, which is clearly reflected in both tables. The experimental results in Section 5.3 show that this power adjustment strategy further improves the performance of the hybrid indexing. 129 5.3 Experimental Results To evaluate the efliciency of hybrid indexing, we conducted extensive experiments. Typical experimental results are reported in this section. 5.3. 1 Experimental Setup The experimental programs were implemented in C++. Tests were conducted on Sun Fire v202 servers with 2x AMD Opteron 250 2.4GHz 64bit and 4 GB RAM running Linux OS. The synthetic data sets used for our experiments were randomly generated which consist both continuous and discrete dimensions. For a discrete dimension if the alpha- bet size is A, a discrete value was created by generating a random integer between 0 and A — 1. For a continuous dimension if the range is A, the possible values are dec- imal numbers ranging between 0 and A. In section 5.3.7 we also provide performance comparison among different techniques using real climate data. For a test query, a box size X is used to define the volume of its query box. Given a query box with box size X, each discrete dimension has X letters and each continuous dimension has length X. The query performance is measured by the number of I/Os (i.e., the number of index tree nodes accessed, assuming each node occupies one disk block) and is computed by averaging the I / Os over 200 queries. 130 j” We have compared the performance of indexing the HDS using the hybrid indexing with that of the 10% linear scan[22], ND-tree and R*-tree. As mentioned in Section 5.1, the ND—tree and the R*-tree were not originally designed for indexing hybrid data. They can index only the discrete subspace or the continuous subspace of an HDS. One approach to process a query in the HDS using a ND-tree is to locate the potential vectors from the database based on the discrete part of the query and then search through the hits (possibly a very large set) for further filtering based on the continuous part of the query. In this case, to avoid requiring a large number of I/Os for searching continuous data we keep both continuous and discrete data in the leaf nodes of the ND-tree. For the same reason, we keep both continuous and discrete data in the leaf nodes of the R*-tree. 5.3.2 Performance Comparisons In the following subsections we compare performances of the hybrid indexing tree, the ND-tree, the R*-tree and the 10% liner scan[22]. Various parameters such as database sizes and alphabet sizes are considered in our experiments. A symbol 6 is used to represent the additional dimensions utilized by the hybrid indexing approach. For example, given an HDS with i continuous dimensions and j discrete dimensions, by indexing the whole HDS we have 6(6 2 j) extra dimensions to use when compared to the R*—tree approach, and 6(6 = i) extra dimensions to use when compared to the ND-tree approach. 131 .A‘ In our experiments we create the R*-tree for a 4-dimensional continuous subspace, which is a typical dimension number for the R*—tree to avoid the dimensionality curse problem. For the discrete subspace we use 8 dimensions because an effective ND-tree could not be built if the number of dimensions is too low (there are too many duplicate vectors in the subspace). From the experiment results we see that the hybrid indexing outperforms the other three approaches. In some of the cases, the performance gain is quite significant. 5.3.3 Performance Gain with Increasing Database Sizes In this group of tests the performance of hybrid indexing is compared with that of the ND—tree, the R*-tree and the 10% linear scan for various database sizes (i.e., the number of vectors indexed). The number of additional dimensions 6 is set to 2. That is, we use the hybrid indexing approach to index the 4 continuous dimensions used by the R*-tree plus 2 additional discrete dimensions, and compare the query I/ O with that of the R*-tree. Similarly, we compare the performance of indexing 8 discrete dimensions and 2 continuous dimensions against the ND-tree approach which indexes only the 8 discrete dimensions. The query I / O of hybrid indexing is also compared with that of the 10% linear scan approach, which utilizes all the dimensions as the hybrid indexing does. The alphabet size for each of the discrete dimensions is set to 10. Figure 5.1 shows that the hybrid indexing approach reduces box query I / Os and the performance 132 9.99.0 QNCDCD—I 0.5 1 IIO ratio .6 .0 .0 N (a) h P ...; k$+**++4> ‘D - 'G-G-G. A G'fl.fl \ a Q ‘|l§ -<>- vs. ND-tree 8 dsc, 6:2 - -El- - vs. R*-tree 4cnt,6=2 —A- -vs.10%linear 8 rise, 2 cnt -9(—-vs. 10% linear 2 dsc, 4 cnt 0 1M 2M 4M 6M 8M 10M 12M 14M 16M Database sizes Figure 5.1. Effect of various database sizes in the HDS. gain generally increases with growing database sizes. 5.3.4 Performance for Various Additional Dimensions In the following experiments, we varies the number of additional dimensions (i.e., the 6 value) used by the hybrid indexing. The alphabet size and database size in these experiments are set to 10 and 10 million respectively. Our experimental results are reported in Figure 5.2. Again from these results we see that the hybrid indexing outperforms all the other approaches. In some cases (e.g., compared with the R*-tree) the performance improvement is significant. 133 IIO ratio 0.9 0.8 4 0.7 ~ 0.6 ~ 0.51 0.4 « 0.3 ] 0.2 — 0.1 . ’9’ " " * '0 —0- vs.ND-tree ’ Bdsc,6cnt V - -D- -vs.R‘-tree 6dsc,4cnt -A- -vs.10% linear El 8 rise, 6 cat —)(—vs. 10% linear 6 dsc, 4 dsci Number of additional dimensions 6 Figure 5.2. Effect of additional dimensions in the HDS. 134 0.9 - / er _ .0 -<>- vs. ND-tree 0.8 - (>- 8dsc,5=2 \ -e’ 0.7 - - {l- -vs.R*-tree .9 0.6 _ 4cnt,6=2 11 5 0.5 - —A- -vs.10%linear = 0.41 _ _ . 8dsc,2cnt 0.3 ‘ ' . 9 ~13 , ~ -—)(—-'vs.10%linear 02— [3 . ‘ ~[3 . - - -U 0.1 ~ _ 2dsc,4cnt 0 1 T ’- - — 8 10 12 14 16 Alphabet sizes Figure 5.3. Effect of different alphabet sizes in the HDS. 5.3.5 Performance for Different Alphabet Sizes As we see from Figure 5.3, the hybrid indexing is much more efficient when compared to the R*-tree and the 10% linear scan. It also does better than the ND-tree approach. With increasing alphabet sizes the ND-tree performance gets closer to the hybrid index- ing. However, in real world applications most NDDS domains are small. For example, genome data has a domain size of 4 (i.e., {a, g, t, c}). The database size used for this group of tests is 10 million. 135 5.3.6 Performance for Different Query Box Sizes All of the above experiments show the performance comparisons for query box size 2. This group of tests evaluates the effect of different query box sizes. Query results for box size 1 ~ 3 are reported here because as box sizes become larger, the 10% linear scan approach is more preferable. Not surprisingly, all indexing trees will eventually lose to linear scan when the query selectivity is high. The results in Figure 5.4 show that indexing the hybrid data space increases box query performance for all the box sizes given. Database size 10 million and alphabet size 10 are used for this group of tests. 5.3.7 performance on Real Data We used GSOD data introduced in section 4.4.3 to compare the performance of the extended ND-tree heuristics with the 10% linear scan, ND-tree and R*-tree. The performances of different techniques are shown in Figure 5.5. The total number of vectors indexed are 1 million and the query box size is set to 1. As seen from the figure, indexing the HDS using our extended heuristics performs better than the 10% linear scan, the ND-tree and the R*-tree. It is worth pointing out that since the N D- tree could index only the discrete subspace, its performance is partly affected by the duplicates of the discrete features. 136 IIO ratio 9.0 (”‘0‘ F3 N 0.6 J p p _o w .1:- 01 l L 9.0 AN 0 l -0- vs. ND-tree 8 dsc, (i=2 --D--vs.R*-tree 4cnt,6=2 —b - vs. 10% , linear 8 dsc, 2 cnt -)é—vs. 10% linear 2 dsc, 4 cnt Query box sizes Figure 5.4. Effect of different query box sizes in the HDS. 137 fr l‘ "um—— v ~—r #x— IIO ratio 02 0.18 0.16 0.14 0.12 0.1 - 0.08 0.06 4 0.04 1 0.02 ~ 0 _4 Bee-£1 ‘ erg-AN ’éfi-‘E‘E’I’A AHANA ..4 _[ e—IejeI-e—Ie -(>- 9 -0- 6 -<> -0- vs. ND-tree 6dsc, 5:3 —D— vs. R*-tree 3on1, 6=6 - -A- -vs.10%linear 8dsc, 3on1 (9" ,,59‘. “3&1- §+ prv- 6634- «@1- 9365’. 9(9‘. @394 Database sizes Figure 5.5. Performance comparison of indexing GSOD data. 138 1’ 5.3.8 Effect of Enhanced Strategy with Power Value Adjust- ment To examine the effectiveness of using exponent (power) values to adjust the edge lengths of an HMBR, as discussed in Section 5.2.3, we conducted relevant experiments. The query I / Os of using this enhanced strategy are shown in Figure 5.6. The x-axis indicates different power values (pd) used for discrete dimensions. To eliminate the possible effect that different number of discrete and continuous dimensions might have on the enhanced strategy, we use an HDS with 4 discrete and 4 continuous dimensions. The alphabet size is set to 10 and number of vectors indexed is 10 million. The results in Figure 5.6 show that the new enhanced strategy could further improve the performance of the hybrid indexing using extended ND-tree heuristics. The exponent value of 0.5 corresponds to the situation of not applying any power value adjustment because both discrete and continuous dimensions have the same exponent value (0.5). 5.4 Performance Estimation Model To predict the performance behavior of the hybrid indexing tree, we have developed a performance estimation model. Our model is divided into two parts: the first part for estimating the key parameters of the tree (e.g., the characteristics of HMBRs of tree nodes) for a given HDS; and the second part for estimating the number of I/Os for arbitrary query box sizes based on the key parameters. If the tree is given and we 139 1 0000 —9—boxsize1 A A A —0- bOXSlZOZ 10001 "-A-...A,...A~ --A--boxsize:1 ..A- . _A. v ‘A O G-—G- --(;~ E 100- ‘G--e- ’0 G 10‘ W 1 T I 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Powervalues Figure 5.6. Performance for enhanced strategy with power value adjustment. 140 want to predict the performance behavior of the tree, only the second part is needed. The two parts of our model are sketched as follows. Part I: Estimating key parameters of a tree In this part of the performance model, we are given an HDS and the number of vectors (to be indexed) in the HDS. Hence we have the following input parameters: the system disk block size, the number of continuous and discrete dimensions, the domain / alphabet size of the discrete domains, and the number of vectors to be indexed. We assume that the components of the vectors are uniformly distributed in their respective dimensions. Using the above information, we can first determine the maximum number of entries Mn in a non-leaf node, and the maximum number of entries MI in a leaf node. From the heuristics extended to the HDS, we notice that the numbers of splits different nodes (at the same level of the tree) went through usually either equal to each other or differ by 1. The tree growing process involves two time windows: during any time of the tree growth, when a node at certain level splits, all the other nodes at the same level split around the same time. We call this time period the splitting ( time ) window. After a node is split, the new nodes start to accumulate incoming data for some time. We call this period the accumulating (time) window. Since a new node created from a split is about half-full and takes quite some time before it becomes full again, the accumulating window is typically much larger than the splitting window. Thus we focus on capturing the performance behavior for the accumulating window in our performance model. 141 At the beginning of an accumulating window the node space utilization is about 50% because each overflow node has split into two half-full nodes. When the accumulating window ends, the node space utilization is close to 100%. On average the utilization would be 75%. However, we notice that, in real world situations, although most splits occur in the splitting window, there are some splits happening during the accumu- lating window. As a result, the number of nodes during the accumulating window is slightly more than what is estimated under the assumption that all splits happen in the splitting window. Hence the actual average node space utilization (around 70% from our experiments) is also lower than expected. Therefore, our estimates for the average numbers of entries in a non-leaf node and a leaf node during the accumulating window are: En = 0.7 * Mn and E1 = 0.7 * M1, respectively. The height h of the tree to index V number of vectors is estimated as: h = [log(lV/Ezll/ [logEnll +1 (5.1) The estimated number of nodes 17.,- at level i of the tree is estimated recursively as in formula (5.2). [n,-_1/En] 1 S i _<_ h (non — leaflevel) n; = (5.2) [V/El] i = 0 (leaf level) 142 Using 10,- to represent the number of rounds of splits that every node at level i has gone through, it could be estimated es: mi = 110920101 (53) When indexing a new vector, it could cause at most one node to split at each level of the tree. In other words, each time a vector is inserted, at most one node at level i of the tree will split. This will cause some nodes at a certain level of the tree to split one more time than the rest of the nodes at the same level. Using v,- to represent the number of nodes which have gone through one more round of split at level i, it could be estimated as: vi = (n,- — 210i) X 2 (5.4) After we know the height of a tree, the number of nodes at each level and the number of splits each node has gone through, we can estimate the parameters (e.g., edge length) for the HMBRs of nodes at each level. Details of the derivation are omitted here and could be found in the appendix of this thesis. Part II: Estimating query performance based on key parameters of a tree The second part of our model estimates the number of I/Os needed for a hybrid 143 indexing tree (defined by the key parameters discussed in Part I) given arbitrary query box sizes. The estimation has three steps: (ES-1) Estimating the overlapping probability for one discrete/continuous dimension For a discrete dimension d, assume that the domain size of the dimension is Dd, the set size of a node N’ s HMBR on dimension d is Ld, and the set size of a query box on this dimension is T d- Clearly, Ld _<_ Dd and Td S Dd. The probability of the query box overlapping with N ’ s HMBR on dimension d is: Td Ta 1 — C d-Ld/CDd (5.5) For a continuous dimension c, without loss of generality, assume that the domain range/interval is [0,06], and the lower and upper bounds of a node N’ s HMBR on dimension c are LC and Uc, respectively. Further suppose the edge length of a query box on this dimension is Tc. We have 0 S LC, UC S Cc and Tc 3 Cc. The probability for the query box to overlap with N ’ s HMBR on dimension c is: (b _ (ll/(Cc — Tc) (5-6) where a = max{Lc — Tc, 0} and b = min{Uc, Cc - Tc}. This probability calculation is based on the lower bound value p of the query box on dimension c. Clearly, p is within range [0, Cc — Tc]. If the query box has an overlap with interval [L6, Uc] on dimension c, p must be within [LC — TC, Uc]. a and b are used to handle boundary conditions when (LC — Tc) < 0 and (Uc + Tc) > Cc. 144 (ES-2) Estimating the overlapping probability for one tree node The probability of a tree node N overlapping with an arbitrary query box Q, is the product of the overlapping probabilities of N and Q on all dimensions, which are calculated by ES-I . (ES-3)Estimating the I/O number for the tree The number of I / Os for the tree to process a box query is estimated as the summation of the overlapping probabilities between the query box and every tree node, which could be calculated by ES—Z. We conducted experiments to verify the above performance model. Two sets of typ- ical experimental results are shown in Figures 5.7 and 5.8. Each observed performance data was measured using the average number of I/Os for 200 random queries. The HDS used in the experiments has 4 continuous dimensions and 4 discrete dimensions with an alphabet size of 10. Figure 5.7 shows the comparison between our estimated and observed 1/ O numbers for queries with box sizes 1 ~ 3. The number of indexed vectors ranges from 1 million to 10 million. Since the trees are given, the experimental results actually demonstrate the accuracy of the second part of our performance model. The experimental results show an average relative error of only 2.45% in such a case. 145 1000 Q 1oo~ 2‘ o :r G 10- W W 1M 4M 7M 10M Nunber of vectors indexed —-A-— actual ilo box size 1 - -X- -estimated ilo box size 1 —°—actual ilo box size 2 - -I— -estimated ilo box size 2 —)lt—actual ilo box size 3 - -o- -estlmated ilo box size 3 Figure 5.7. Verification of performance model for given trees and given HDSs. Figure 5.8 shows the comparison between our observed I/ O numbers (from queries on actual trees) and estimated I/O numbers (for the given HDS without building any tree). In this case, the tree parameters for the given HDS also need to be estimated using the first part of our model. From the figure, we can see that our performance model still give quite good estimates although the accuracy degrades a little bit due to the fact that more parameters need to be estimated. The average relative error is 5.76%. 146 Query IIO 1000 100: 10: 1M 4M 7M Number of vectors indexed 10M —A—actual ilo box size 1 - -X- -estimated ilo box size 1 —e—actua| ilo box size 2 - -l- -estimated ilo box size 2 +actual ilo box size 3 - -<)- -estimated ilo box size 3 Figure 5.8. Verification of performance model for given HDSs. 147 CHAPTER 6 Conclusion In this thesis we introduced a new indexing structure, the BOND-tree, to support effi- cient box queries in the NDDS. Box queries and range queries are two widely used search operations in the indexing area. However they require different indexing schemes be- cause of the fundamental difference in the they search in an indexed NDDS. Box queries apply stand alone query conditions to each indexed dimension and range queries check the combined condition of all the indexed dimensions. Inspired by the ND-tree, which is optimized for range queries in the NDDS, the BOND-tree exploits exclusive prop- erties of the NDDS and adopts new heuristics designed to support box queries in the NDDS. Insert, delete and query operations on the BOND-tree are defined to support indexing of dynamically updated data sets in the NDDS. Based on the observation that DMBRs in non-leaf nodes contain all letters in most of the dimensions, we de- veloped the compression strategy to reduce space utilization of the BOND-tree. The concept of compression may be of significance for some of the other NDDS indexing 148 ' 03-71 structures as well. Both our experimental results and the theoretical proof given for the improved splitting strategies in the NDDS demonstrate that the BOND-tree improves box query performances in the NDDS significantly. The use of compression improves query performance of the BOND-tree. There are numerous modern applications requiring search operation on large scale data sets in the hybrid data space which contains both continuous and discrete di- mensions. To support efficient range query processing for the hybrid data, a robust multidimensional indexing technique is required. In this thesis, we introduced a new index technique, the C-ND tree, to support indexing of vectors in the HDS. To develop the C-ND tree, we first introduced some essential geometric concepts such as hybrid bounding (hyper-)rectangles in the HDS. These geometric concepts are important for efficient organization of indexed vectors in the HDS. The C—ND tree structure and the building algorithms are then presented based on heuristics that we found to be effective in the HDS. A novel length normalization idea is employed to make the measures on continuous and discrete dimensions comparable and controllable. We conducted extensive experiments to evaluate the performance of the C-ND trees in the HDS with various database sizes, mixes of continuous and discrete dimensions and different alphabet sizes. Our experimental results demonstrate that the C—ND tree is generally more efficient than using the linear scan approach or indexing the corresponding continuous subspace using the R*-tree and the discrete subspace using 149 the ND-tree. As expected, when the number of continuous dimensions in an HDS increases, the performance of the C-ND tree is closer to that of an R*-tree; when the number of discrete dimensions increases, the performance of the C-ND tree becomes closer to that of an ND-tree. The reason why the C-ND tree generally outperforms the R*-tree and the ND-tree is that it can make use of the given query conditions on additional dimensions that the latter two methods cannot utilize to prune unnecessary search paths in the tree. We have also extended the original N D-tree building heuristics to support box queries in the HDS. To make the measures on continuous and discrete dimensions comparable and controllable, two unique strategies, the edge length normalization and the power value adjustments, are employed. A theoretical model is also developed to predict the performance of the hybrid indexing in HDSs. Our experimental results demonstrate that the ND-tree’s heuristics are still effective after being extended to the HDS. Using these heuristics to index the HDS is more efficient than the traditional linear scan method, the method to index the continuous subspace of the HDS using the R*-tree and the method to index the discrete subspace using the ND-tree. Our future work includes further study of characteristics in HDS, applicability of ideas in the BOND-tree to the HDS, and development of more effective strategies to build indexes in the HDS to support various query types. 150 APPENDICES APPENDIX A Proof of Theorems For Box Queries in the NDDS In this appendix, we give the proof of the theorems proposed in chapter 3. These theorems are used in the BOND-tree splitting heuristics to support box queries in the NDDS. For the sake of simplicity we assume our discussion is in an NDDS 9,1 with an alphabet size A for all dimensions in Qd and the query box has edge length b on all dimensions. The discussion could be extended to more general situations and is omitted in this thesis. The optimal splitting criteria (Theorem 3.3.3 and Theorem 3.3.4 ) defined and applied in section 3.3.4 could be proved using the following 3 lemmas. Lemma A.0.1. When splitting on a dimension with edge length a: on the MBR, if the two newly generated nodes has edge lengths 1 and a: - 1, the two new nodes has better filtering power than splitting the nodes to other edge lengths. Lemma A.0.2. Splitting the node on a dimension with edge length :1: gives more filter- ing power than splitting on dimension with edge length :1: + 1. Lemma A.0.3. Splitting the node on a dimension with edge length :1: gives more filter- 151 ing power than splitting on the dimension with edge length :1: + n, where n is an integer greater than or equal to 1. A.1 Proof of Lemma A.0.1 When splitting a dimension with edge length :3, one candidate partition generates two new nodes which has edge lengths 1 and a: — 1 on this dimension, the other candidate partition yields two new nodes with edge lengths t and a: — t on this dimension, where 1 < t < a: — 1. Without loss of generality, we further assume that t S :r — t. And we want to show: Cb Cb Cb Cb A‘g’“ + ’2? 2 A?” + Ab" (1 < t < :r — 1) (A.1) CA CA CA CA 4:) CEI—$+1 + 031-1 2 beA—a:+t + CEI-t (1 < t < x ‘ 1) (A2) Leta=A—:r:+t,fi=A—x+1and6=rr—1—t, formula (A.2) equals 0.9.1.6 - 03+, 2 Cf. — c}; (As) When b = 1, (A3) holds. Use mathematical induction, suppose (A.3) holds when b = b’, next we want to prove it holds when b = b’ + 1. 152 We already knows: bl off—+5 032,90 433' (11.4) When b— - b’ + 1, since C’H'l: +1m’ b, 0+6— 0’ b, fi+6—b’ bra—b’ b, B—b’ __ _____> __ __ Ca+5 bl+1 fl+6 bl+1 —Cab/+1 fi+6bl+1 (A5) 4:) b’ a+6— b’ 6 bra—0’ _— > __ 00+, b’+1 Cg+5b’+1- C'b’+1 (A6) SmceC+5>Cfl+6,left31deof (A6) 0’ a+6— b’ (S b’ a+6— b’ b’ 6 b’ a— b’ bra— b’ Ca+6 0+1 Ogden-CW 1r+1 ’ Mir/+1 Cfi+6u+1-Cau+1 So (A.6) holds and proof of Lemma A.0.1 is completed. A.2 Proof of Lemma A.0.2 Consider two dimensions with edge lengths a: and r + 1. From Lemma A.0.1 we already knows the best way to split them is the 1—and-the—rest letters way. 153 When splitting the dimension with edge length 2:, the overlapping probability on this dimension is calculated as: b 0 0b- _ 0b_$ (1__be-_l+1__’_4_($_1).)(1__fi_f_il_l) b CA 0.91 CA . Similarly, the overlapping probability when splitting the dimension with edge length a: + 1 is: b 0b 0b (1 _ Ab-l + 1 _ A;$)(1 Ag!!!) CA CA CA We want to show: ob_ +ob_ ob_ _ ob_ +ob_ ob_ (2_ A l CbA $+l)(1_ AC: l)_<_(2_ A le A 11:)(1_ be) (A7) A A A A A — .T +1 A — a; _. b ' b _ b b _ b . Substitute CA—rr:+1 __ A _ 93+ 1 _ bCA—a: and CA_$_1 _ 7173—014” mto (A.7), we get: 5 b A—CL‘+ 1 b b A —:L'-b b (20A _ CA‘I _ A — x + 1 — bait—“XCA _ A —:L‘ GATT) S (20531 — 0514 — Ci_.>(0.i — 051-.) (A8) <=> b b b b b A _ 301131—33) S (2091 _ Cir—1 _ 0.3—3X03 — 0.2—x) (A9) 154 b b b b b b b (20A_CA—l CA— x)A____$CA-a:_A_x+l_bCA—x(CA_CA—x) b ‘A—x+1—bCA-be—CAxSO 4:» b b b b b b b (20A_CA—1_CA—x)A—_;‘A_x+1_b(CA—CA—x) b b b _ C < A—a:+1—bA—:r A-x-O 4:» b b b A-IE-f'l—b b b b b (QCA-CA—l_CA—x) A—x SCA_CA—3+Z_xCA a: Substrtute CA_1 = (1 — A)CA into (A.12) we get: A+b A—a:+1—b b A—r—b 0 (—CA CA— 2:) A_$ SCA—flCA—x 4:? (A+b)(A—.’L‘+1—b) b b b A(A—:r:) C/VCAS xCA-z ¢=> (A+b)(A—$+l—b)cb A — (A - 2:10.”. s 091... 155 (.110) (AM) (11.12) (11.13) (11.14) (11.15) _ _ 2 (A (”jib b )Cb s 03H (A.16) (A—bx+b—b2) < 031—2: A _ —Eg— (A.17) when b = 1, (A.17) holds. Use mathematical induction, suppose (A.17) holds when b = b’, we want to show it holds when b = b’ + 1. 2 b’ A— I I___ I 0 Let( bx+b b )=a, A_z=fi,weknowa§)6. A 091’ When b = b’ + 1,1eft side of formula (A.17) becomes A—b’x—x+b’+1—b’2 —2b’—1 _ A—b’x+b’—b’2 —:c-—2b’ _ :I:+2b’ A " A _ a A and right side of formula (A.17) becomes b’+1 b’ A - a: — b’ CA a: CA—a: (1’ +1 18(1 _ :1: ) Cbl+1 Cb, A — b, A — b, A b’ + 1 So we want to Show that a: + 2b’ :1: __ < _ . a A _B(1 A—b’) (A 18) 156 Since a S 3, we only need to show: x+2b’ a: > A —flA—b’ <=> (A—b’)(x+2j:)_>fl ’ Arc " 4:) m+2b’A—b’ > (A—x)(A—a:—1)...(A—:v—b’+_1_ :1: A “ A(A—1)...(A—b’+1) <==> x+2b’ > (A—x)(A—x-1)...(A—a:—b’+1) a: - (A—l)...(A—b’) Left side of (A22) has I I x+2b =1+2_b_21 H 8 On the right side of (A22), since :1: > 1, (A—:1:)(A—:1:—1)...(A—:1:—b"+#1)_<1 (A—Du(A—M Thus we know (A22) holds. Proof of Lemma A.0.2 is complete. 157 (Am) (Am) (Am) (An) A.3 Proof of Lemma A.0.3 Suppose the filtering power gained from splitting a node on a dimension with edge length l is Pr(l). From Lemma A.0.2, we know: Pr(:c)>Pr(x+1)>...>Pr(a:+n) (7121) Thus we have Pr(a:) > Pr(a: + n) (n 2 1). Proof of Lemma A.0.3 is complete. By proving the correctness of Lemma A.0.1 ~ Lemma A.0.3, the optimal splitting CIiteria proposed in section 3.3.4 are proved. 158 APPENDIX B Algorithms for Performance Estimation Model Given the system disk block size, the number of continuous and discrete dimensions, the domain for each dimension of the HDS and the number of vectors to be indexed, in section 5.4 we have estimated the height h of the tree, the number of nodes n,- at level 2' of the tree, the number of rounds 10,- that each node at level i has gone through and the number of nodes 22, at level i which have had more than one round of split than the rest of nodes at the same level. In appendix B we give the algorithms to estimate the box query I/O given a query box B. We first provide the algorithm Estimate_QueryIO. It uses formulas 5.1 ~ 5.4 to eStilnate the height h of the indexing tree, the number of nodes n,- at each level 2’, r(NJ-Ends of splits w,- that every node at level 2' has gone through, and the number of nodes which have had one more round of split at level 2'. These estimation are based on the number of vectors indexed (V), the average numbers of entries in a non-leaf 159 node (En) and a leaf node (E). It calls algorithm Estimate_0verlappingPro to get the summation of overlapping probabilities at each level of the tree and calculates the estimated box query I/O for the whole indexing tree. Algorithm B.0.1. Estimate_QueryIO(En, El, V,B) //estimate box query I/O of an indexing tree. Input: En is the average number of entries in a non-leaf node, E1 is the average number of entries in a leaf node, V is number of indexed vectors, B is the query box szze. Output: the estimated box query I/O. Method: 1. boxqueryJO=0 2. estimate tree height h based on En, El and V through formula 5.1 3. for ifrom 1 to h do 4. estimate number of nodes n,- at level i through formula 5.2 5. estimate round of splits w,- at level i through formula 5.3 6. estimate number of nodes v,- having one more round of splits at level i through formula 5.4 7. boxqueryJO += Estimate-0verlappingPro(n,-, 11),, vi, B ) 8. end for 9. return boxquergJO Algorithm Estimate_0verlappingPro is used to estimate the number of box query I/O for a certain level i of the tree. Based on number of nodes ni, number of rounds of splits w, and number of nodes v,- which have gone through one more round of splits, it first calls algorithm EstimateJ-IMBR to estimate HMBRs of nodes at level i of the tree. Since at each level there are u,- nodes which have gone through one more round of split than the rest of nodes, algorithm Estimate_HMBR is called twice to estimate HMBRs for nodes which have different rounds of splits separately. Two lists (LH and LN) are returned by algorithm EstimateHMBR. LH records the HMBRs estimated after a certain round of splits and LN records the number of nodes for each 160 corresponding HMBR in LH . These information are applied to formula 5.5 and formula 5.6 to get the overlapping probability between each node’s HMBR and the query box. Algorithm B.0.2. Estimate_0verlappingPro(n,-, wi, vi, B) //estimate overlap- ping probabilities for nodes at a certain level. Input: ni is the total number of nodes, w,- is the number of splits all nodes have gone through, v,- is number of nodes which have had one more round of splits, B is the query box size. Output: the summation of overlapping probabilities between nodes’ HMBRs and the query box. Method: 1. initialize LH 2. initialize LN 3. let H = HMBR of the whole data space 4. result = 0 5. (LH, LN) = Estimate_HMBR(H, n,- — vi, 11),) 6. for i from 1 to size of LH do 7. calculate overlapping probability p between LH[i] and B using formulas 5.5 and 5.6 8. result+=p x LN [i] .9. end for 10.(LH, LN) = EstimateflMBR(H, vi, w,- + 1) 11.for i from 1 to size of LH do 12. calculate overlapping probability p between LH[i] and B using formulas 5.5 and 5.6 13. result+=p x LN [z] 14.end for 15.return result Algorithm EstimateJ'IMBR is used to estimate all HMBRs occurred at level i of the indexing tree after certain rounds of splits. Note that if there are n,- nodes at level i, the data space is partitioned n,- —1 times. Algorithm Estimate_HMBR starts from an HMBR which covers the whole indexed data space S. It then estimates the number of nodes and their corresponding HMBRs after a certain number of partitions of the original data space. After algorithm Estimate_HMBR, the original HMBR which covers the whole data space is partitioned R times (R is the round of splits that nodes 161 at this level have gone through), LH is returned as a list containing the HMBRs of nodes at this level and LN is a list of numbers of nodes for each corresponding HMBR in LH. Algorithm B.0.3. EstimateJ-IMBR(S, N, R) //estimate HMBRs for nodes at a certain level Input: S is the originally indexed data space, N is the number of partitions (splits), R is the rounds of splits occurred at a certain level of the tree. Output: a list of HMBRs LH and the number of nodes LN for each corresponding HMBR in LH . Method: 1. add S to list LH 2. add N to list LN 3.forifrom1toRdo 4 sz = size of list LH 5 for I: from 1 to 32 do 6. longestSpan=0 7 longestDim=1 8 for j from 2 to num.dim do//num-dim is the total number of dimensions 9 if LH [k] has a span on dimension j larger than longestSpan 10. longestSpan 2 LB [k] ’s span on dimension j 1 1. longestDim=j 12. end if 13. end for 14. if j is a discrete dimension 15. if LH [k] ’s span on dimension j is an even number 16. divide LH[k] ’s span on dimension 3' by 2 17. else 18. x = LH[k] 19. set LH[lc] ’s span on dimension j = |_ (LH[k] ’s span on dimension j )/2 J 20. x ’s span on dimension j = l (x ’s span on dimension j)/2l 21 . append a: to LH 22. y=LN[k] 23. LN [k] = [LN[k]/2j 94- y = 131/21 25. append y to LN 26. end if 27. else //j is a continuous dimension 28. divide LH[k] ’s span on dimension j by 2 29. end if 30. end for 31. end for 32.return (LH, LN ) 162 Algorithms B.0.1 ~ 303 provided in this section complete our theoretical model proposed in section 5.4. Note these three algorithms are used to first estimate the indexing tree structure and then estimate the box query I / 0 based on the tree structure. In case the indexing tree already exists, we do not need to estimate the tree structure (i.e., the tree height h, rounds of splits wi, number of nodes n,- at level i of the tree and number of nodes v,- which have had one more round of splits). Instead we could found the HMBRs for each node in the tree and apply the HMBR information directly in algorithm B.0.2. 163 BIBLIOGRAPHY [1] M. Al—Suwaiyel and E Horowitz. Algorithms for trie compaction. ACM Trans. Database Syst, 9(2):243—263, 1984. [2] G. Antoshenkov. Byte-aligned bitmap compression. In DCC ’95: Proceedings of the Conference on Data Compression, page 476, Washington, DC, USA, 1995. IEEE Computer Society. [3] Walid G. Aref and Hanan Samet. Efficient processing of window queries in the pyramid data structure. In PODS ’90: Proceedings of the ninth ACM SI GA CT- SI GM OD-SI CART symposium on Principles of database systems, pages 265—272, New York, NY, USA, 1990. ACM. [4] Y. Alp Aslandogan and Clement T. Yu. Techniques and systems for image and video retrieval. IEEE Trans. on Knowl. and Data Eng, 11(1):56—63, 1999. [5] Gopi K. Attaluri. An efficient expected cost algorithm for dynamic indexing of spatial objects. In CASCON ’94: Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research, page 2. IBM Press, 1994. [6] R. Bayer and K Unterauer. Prefix B-trees. ACM Transactions on Database Systems, pages 11—26, 1977. [7] Rudolf Bayer and E. McCreight. Organization and maintenance of large ordered indexes. pages 245-262, 2002. [8] Rudolf Bayer and E. McCreight. Organization and maintenance of large ordered indexes. pages 245—262, 2002. [9] N. Beckmann, H.-P. Kriegel, R. Schneider, and B Seeger. The R*-tree: an effi- cient and robust access method for points and rectangles. Proceedings of ACM SICMOD, pages 322—331, 1990. [10] Jon L. Bentley. Multidimensional binary search trees used for associative search- ing. Commun. ACM, 18(9):509—517, September 1975. [11] Jon L. Bentley and Robert Sedgewick. Fast algorithms for sorting and searching strings. In SODA ’97: Proceedings of the eighth annual ACM-SIAM symposium 164 on Discrete algorithms, pages 360—369, Philadelphia, PA, USA, 1997. Society for Industrial and Applied Mathematics. [12] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509—517, 1975. [13] Jon Louis Bentley and Jerome H. Friedman. Data structures for range searching. ACM Comput. Surv., 11(4):397—409, 1979. [14] S. Berchtold, D.A. Keim, and H.-P Kriegel. The X—tree: an index structure for high-dimensional data. Proceedings of the 22nd International Conference on VLDB, pages 28-39, 1996. [15] Stefan Berchtold, Christian Bohm, Daniel A. Keim, and Hans-Peter Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In PODS ’97: Proceedings of the sixteenth ACM SICA CT—SICMOD-SICART symposium on Principles of database systems, pages 78—86, New York, NY, USA, 1997. ACM. [16] Christian Bohm, Stefan Berchtold, and Daniel A. Keim. Searching in high- dimensional spaces: Index structures for improving the performance of multi- media databases. ACM Comput. Surv, 33(3):322—373, 2001. [17] Marc Boulle. Khiops: A statistical discretization method of continuous attributes. Mach. Learn, 55(1):53—69, 2004. [18] Tolga Bozkaya and Meral Ozsoyoglu. Distance-based indexing for high- dimensional metric spaces. In SICMOD ’97: Proceedings of the 1997 ACM SICM OD international conference on Management of data, pages 357—368, New York, NY, USA, 1997. ACM. [19] Tolga Bozkaya and Meral Ozsoyoglu. Indexing large metric spaces for similarity search queries. ACM Trans. Database Syst, 24(3):361—404, 1999. [20] Sergey Brin. Near neighbor search in large metric spaces. In VLDB ’95: Pro- ceedings of the 21th International Conference on Very Large Data Bases, pages 574—584, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. [21] J Catlett. On changing continuous attributes into ordered discrete attributes. Proceedings of the European Working Session on Machine Learning, pages 164— 178, 1991. [22] K. Chakrabarti and S. Mehrotra. The hybrid tree: an index structure for high dimensional feature spaces. Proceedings of the 15th International Conference on Data Engineering, pages 440—447, 1999. [23] Chee-Yong Chan and Yannis E. Ioannidis. Bitmap index design and evaluation. In SICMOD ’98: Proceedings of the 1998 ACM SICMOD international conference on Management of data, pages 355-366, New York, NY, USA, 1998. ACM. 165 [24] Ghee-Yong Chan and Yannis E. Ioannidis. An efficient bitmap encoding scheme for selection queries. SI CM OD Ree, 28(2):215—226, 1999. [25] Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, and José Luis Marroquin. Searching in metric spaces. ACM Comput. Sura, 33(3):273-321, 2001. [26] Changqing Chen, Sakti Pramanik, Qiang Zhu, Watve Alok, and Gang Qian. The end tree: a multidimensional index for hybrid continuous and non-ordered dis- crete data spaces. In EDBT ’09: Proceedings of the 12th International Conference on Extending Database Technology, pages 462—471, New York, NY, USA, 2009. ACM. [27] Changqing Chen, Sakti Pramanik, Qiang Zhu, Watve Alok, and Gang Qian. A study of indexing strategies for hybrid data spaces. The nineth international conference on enterprise information systems (ICEIS 2009), 2009. [28] Beng Chin, Ooi Ron, and Sacks davis Jiawei Han. Indexing in spatial databases. 2008. [29] Tzi—cker Chiueh. Content-based image indexing. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 582—593, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [30] Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426—435, 1997. [31] Julien Clrnent, Brigitte Valle, Thme Gnie Logiciel, and Projet Algo. Dynamical sources in information theory: A general analysis of trie structures. Algorithmica, 29:307—369, 2001. [32] Douglas Comer. Heuristics for trie index minimization. ACM Trans. Database Syst, 4(3):383—395, 1979. [33] Douglas Comer. Ubiquitous b—tree. ACM Comput. .S'urv., 11(2):121—137, 1979. [34] Douglas Comer and Ravi Sethi. The complexity of trie index construction. J. ACM, 24(3):428—440, 1977. [35] James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsuper- vised discretization of continuous features. In International Conference on Ma- chine Learning, pages 194—202, 1995. [36] F. Shi P. Widmayer F. Widmer E. Bugnion, T. Roos. Approximate multiple string matching using spatial indexes. Number 1, pages 43—54, 1993. [37] Moussa Elkihel Didier El Baz. Load balancing in a parallel dynamic programming multi-method applied to the 0-1 knapsack problem. In PDP ’06: Proceedings 166 of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pages 127—132, Washington, DC, USA, 2006. IEEE Computer Society. [38] Fayyad and Irani. Multi-interval discretization of continuous-valued attributes for classification learning. pages 1022—1027, 1993. [39] Usama M. Fayyad and Keki B. Irani. On the handling of continuous-valued attributes in decision tree generation. Mach. Learn, 8(1):87—102, 1992. [40] P. Ferragina and R Grossi. The string B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM, pages 236— 280, 1998. [41] Caxton C. Foster. A generalization of avl trees. Commun. ACM, 16(8):513—517, 1973. [42] Andrew U. Frank. Spatial concepts, geometric data models, and geometric data structures. Comput. Ceosci., 18(4):409—417, 1992. [43] AA Freitas. A survey of evolutionary algorithms for data mining and knowledge discovery. Advances in Evolutionary Computing: Theory and Applications, pages 819—845, 2003. [44] Volker Gaede and Oliver Giinther. Multidimensional access methods. ACM Comput. Surv., 30(2):170—231, 1998. [45] Pedro Garcia and Maria Petrou. The use of boolean model for texture analysis of grey images. In ICPR ’98: Proceedings of the 14th International Conference on Pattern Recognition- Volume 1, page 811, Washington, DC, USA, 1998. IEEE Computer Society. [46] Ralf Hartmut Giiting. An introduction to spatial database systems. The VLDB Journal, 3(4):357—399, 1994. [47] A Guttman. R-trees: a dynamic index structure for spatial searching. Proceedings of ACM SICMOD, pages 47—57, 1984. [48] A Henrich. The LSDh-tree: an access structure for feature vectors. Proceedings of the 14th International Conference on Data Engineering, pages 362-369, 1998. [49] A. Henrich, H. W. Six, and P. Widmayer. The lsd tree: spatial access to mul- tidimensional and non-point objects. In VLDB ’89: Proceedings of the 15th international conference on Very large data bases, pages 45-53, San Francisco, CA, USA, 1989. Morgan Kaufmann Publishers Inc. [50] Gisli R. Hjaltason and Hanan Samet. Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst, 28(4):517—580, 2003. 167 [51] Ellis Horowitz, Sartaj Sahni, and Susan Anderson-Freed. Fundamentals of Data Structures in C. W. H. Freeman & Co., New York, NY, USA, 1992. [52] King ip Lin, H. V. Jagadish, and Christos Faloutsos. The tv-tree - an index structure for high-dimensional data. VLDB Journal, 3:517—542, 1994. [53] Masahiro Ishikawa, Hanxiong Chen, Kazutaka Furuse, Jeffrey Xu Yu, and Nobuo Ohbo. Mb+tree: A dynamically updatable metric index for similarity searches. In WAIM ’00: Proceedings of the First International Conference on Web-Age Information Management, pages 356-373, London, UK, 2000. Springer-Verlag. [54] Anil K. Jain and Farshid Farrokhnia. Unsupervised texture segmentation using gabor filters. Pattern Recogn, 24(12):1167—1186, 1991. [55] Marcus Jiirgens. Index structures for data warehouses. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002. [56] Ibrahim Kamel and Christos Faloutsos. Hilbert r-tree: An improved r-tree using fractals. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 500—509, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [57] N. Katayama and Si Satoh. The SR—tree: an index structure for high-dimensional nearest neighbor queries. Proceedings of ACM SICMOD, pages 369—380, 1997. [58] Donald E. Knuth. The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, 1973. [59] Oliver Kutz, Frank Wolter, Holger Sturm, Nobu—Yuki Suzuki, and Michael Za- kharyaschev. Logics of metric spaces. ACM Trans. Comput. Logic, 4(2):260—294, 2003. [60] Aizhen Liu, Jiazhen Wang, Guodong Han, Suzhen Wang, and Jiafu Wen. Im- proved simulated annealing algorithm solving for 0/ 1 knapsack problem. In ISDA ’06: Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications, pages 1159—1164, Washington, DC, USA, 2006. IEEE Computer Society. [61] Huan Liu, Farhad Hussain, Chew L. Tan, and Manoranjan Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393—423, October 2002. [62] Huan Liu and Rudy Setiono. Feature selection via discretization. IEEE Trans. on Knowl. and Data Eng, 9(4):642—645, 1997. [63] W. Loots and T. H. C. Smith. A parallel algorithm for the 0—1 knapsack problem. Int. J. Parallel Program, 21(5):349-362, 1992. 168 [64] SA. Macskassy, H. Hirsh, A. Banerjee, and A.A Dayanik. Converting numerical classification into text classification. Artificial Intelligence, 143(1), pages 51—77, 2003. [65] Veli Makinen. Compact suffix array: 3 space-efficient full-text index. Fundam. Inf, 56(1,2):191—210, 2002. [66] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Tmns. Pat. Anal. and Machine Intel, 11(7):674—693, 1989. [67] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. In SODA ’90: Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, pages 319—327, Philadelphia, PA, USA, 1990. Society for Industrial and Applied Mathematics. [68] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. [69] Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262—272, 1976. [70] S. Meyn and R. Tweedie. Markov chains and stochastic stability, 1993. [71] Donald R. Morrison. Patricia—practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514—534, 1968. [72] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, 9:38- 71,1984. [73] Patrick O’Neil and Dallan Quass. Improved query performance with variant indexes. In SICMOD ’97: Proceedings of the 1997 ACM SICMOD international conference on Management of data, pages 38—49, New York, NY, USA, 1997. ACM. [74] Patrick E. O’Neil. Model 204 architecture and performance. In Proceedings of the 2nd International Workshop on High Performance Transaction Systems, pages 40—59, London, UK, 1989. Springer-Verlag. [75] Beng Chin Ooi and Kian—Lee Tan. B—trees: bearing fruits of all kinds. Aust. Comput. Sci. Commun., 24(2):13—20, 2002. [76] Yale N. Patt. The i /o subsystem / spl minus / a candidate for improvement. Com- puter, 27(3):15—16, 1994. [77] P. J. Plauger. A better red-black tree. C/C++ Users J., 17(7):10—19, 1999. 169 [78] Guido Proietti and Christos Faloutsos. Selectivity estimation of window queries for line segment datasets. In CIKM ’98: Proceedings of the seventh interna- tional conference on Information and knowledge management, pages 340—347, New York, NY, USA, 1998. ACM. [79] G. Qian, Q. Zhu, Q. Xue, and S Pramanik. The ND—tree: a dynamic indexing technique for multidimensional non-ordered discrete data spaces. Proceedings of the 29th International Conference on VLDB, pages 620—631, 2003. [80] G. Qian, Q. Zhu, Q. Xue, and S Pramanik. Dynamic indexing for multidi- mensional non-ordered discrete data spaces using a data-partitioning approach. Proceedings of ACM Transactions on Database Systems, 31(2), pages 439—484, 2006. [81] G. Qian, Q. Zhu, Q. Xue, and S Pramanik. A space-partitioning-based indexing method for multidimensional non—ordered discrete data spaces. ACM Trans. on Information Syst, 23(1), pages 79—110, 2006. [82] J .T Robinson. The K-D-B—tree: a search structure for large multidimensional dynamic indexes. Proceedings of ACM SIGMOD, pages 10 —18, 1981. [83] Timothy J. Rolfe. An alternative dynamic programming solution for the 0/1 knapsack. SI CCSE Bull, 39(4):54—56, 2007. [84] Yong Rui, Thomas S. Huang, and Shih fu Chang. Image retrieval: Current tech- niques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10:39—62, 1999. [85] Sartaj Sahni. Approximate algorithms for the 0/1 knapsack problem. J. ACM, 22(1):115-124, 1975. [86] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613—620, November 1975. [87] Gerard Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Technical report, Ithaca, NY, USA, 1974. [88] Hanan Samet. The quadtree and related hierarchical data structures. ACM Comput. Sura, 16(2):187-260, 1984. [89] Dov M. Gabbay Samson Abramsky, S. Abramsky. Domain theory. 2004. [90] Simone Santini and Rarnesh Jain. Similarity queries in image databases. In CVPR ’96: Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition ( C VPR ’96), page 646, Washington, DC, USA, 1996. IEEE Computer Society. 170 [91] Bernhard Seeger and Hans-Peter Kriegel. The buddy—tree: An efficient and robust access method for spatial data base systems. In VLDB ’90: Proceedings of the 16th International Conference on Very Large Data Bases, pages 590—601, San Francisco, CA, USA, 1990. Morgan Kaufmann Publishers Inc. [92] Thomas Seidl and Hans peter Kriegel. Adaptable similarity search in large image databases. In State-of-the Art in Content-Based Image and Video Retrieval, pages 297-317. Kluwer Publishers, 2001. [93] T. Sellis, N. Roussopoulos, and C Faloutsos. The R+~treez a dynamic index for multi-dimensional objects. Proceedings of the 13th International Conference on VLDB, pages 507—518, 1987. [94] Michael Stonebraker, James Frew, Kenn Gardels, and Jeff Meredith. The sequoia 2000 storage benchmark. Technical report, Berkeley, CA, USA, 1992. [95] F. E. H. Tay and L. Shen. A modified chi2 algorithm for discretization. IEEE Trans. on Knowl. and Data Eng, 14(3):666—670, 2002. [96] Jeffrey K. Uhlmann. Satisfying general proximity/ similarity queries with metric trees. Inf. Process. Lett., 40(4):175—179, 1991. [97] M. Unser. Texture classification and segmentation using wavelet frames. IEEE Transactions on Image Processing, 4(11):1549—1560, November 1995. [98] Jeffrey Scott Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 33:2001, 2001. [99] Liwei Wang, Yan Zhang, and Jufu Feng. On the euclidean distance of images. IEEE Trans. Pattern Anal. Mach. Intell, 27(8):1334-1339, 2005. [100] H. Wedekind and KL. Koffeman. On the selection of access paths in a data base system. Data Base Management, pages 385—397, 1974. [101] Peter Weiner. Linear pattern matching algorithms. In SWAT ’73: Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1—11, Washington, DC, USA, 1973. IEEE Computer Society. [102] D.A. White and R Jain. Similarity indexing with the SS-tree. Proceedings of the 12th International Conference on Data Engineering, pages 516—523, 1996. [103] Andrew K. C. Wong and David K. Y. Chiu. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell, 9(6):796—805, 1987. [104] Harry K. T. Wong, Hsiu-Fen Liu, Frank Olken, Doron Rotem, and Linda Wong. Bit transposed files. In VLDB ’1985: Proceedings of the 11th international con- ference on Very Large Data Bases, pages 448—457. VLDB Endowment, 1985. 171 [105) [106] [107] [108] (109) [110] [111] S. K.M. Wong, W. Ziarko, V. V. Raghavan, and P. C.N. Wong. On modeling of information retrieval concepts in vector spaces. ACM Trans. Database Syst, 12(2):299—321, 1987. Kesheng Wu, Ekow Otoo, and Arie Shoshani. On the performance of bitmap indices for high cardinality attributes. In VLDB ’04: Proceedings of the Thirtieth international conference on Very large data bases, pages 24—35. VLDB Endow- ment, 2004. Kesheng Wu, Ekow J. Otoo, and Arie Shoshani. Compressing bitmap indexes for faster search operations. In SSDBM ’02: Proceedings of the 14th International Conference on Scientific and Statistical Database Management, pages 99—108, Washington, DC, USA, 2002. IEEE Computer Society. Ming-Chuan Wu and Alejandro P. Buchmann. Encoded bitmap indexing for data warehouses. In ICDE ’98: Proceedings of the Fourteenth International Confer- ence on Data Engineering, pages 220—230, Washington, DC, USA, 1998. IEEE Computer Society. Peter N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA ’93: Proceedings of the fourth annual ACM- SIAM Symposium on Discrete algorithms, pages 311—321, Philadelphia, PA, USA, 1993. Society for Industrial and Applied Mathematics. Cui Yu. High-Dimensional Indexing: TI‘ansformational Approaches to High- Dimensional Range and Similarity Searches. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer, November 2005. 172 [I]ll[[l[l[ll[Ill]ll[I][Mill][[llllljlllll] 3 1293 03062 974