whumwula Samba“ I . .,.~ .3 . . .32, t e. I .. .3 Twang 37.39. . y . 5 Try v11. 3...sz . .. . «Hi». NJ?! . 1. 3..) gis- . . 193.. u $13.51!. .1 . in. .2.\ .. H... £31.! . ’1 2...... 3.3.6 1“... 1. it. .. .1. :1 Eli... . . an . . 3.21;! . 3:5: amt}?! raw : E. .3. :1 1.3! {in 7di:i?..i . 1591! $2.5- .uwfl )(lili... . . c. s twta“~‘..ufl 3.1 I 35.... a. w!!! ..| I 3%.. 3.: 3).. ’Litaoii y. ‘ 23.2.2. .lisi....s.xm~. :51... . n... 1... xi...‘:§ll: v. 5.11171}, {25...} riixftl!£nl‘ it): .1155. . £25215 \ ‘ i2 .11“ I a a .JULHLLBH . i . . V. V ,. _ if. . . ..w_.,....w€$mm.tf. .3 . . L. , .43... .3 .ufiJHVSH , in» {v 1? .. 1‘ :3: 0H THESIS 4’ 2m 0 LIBRARY Michigan State University This is to certify that the dissertation entitled Techniques for Efficient k-Nearest Neighbor Searching in Non- Ordered Discrete and Hybrid Data Spaces presented by Dashiell Matthews Kolbe has been accepted towards furfillment of the requirements for the Doctoral degree in Computer Science % , gmwi Major Professor’s Signatuie .5;/I.27/Zoro Date MSU is an Affirmative Action/Equal Opportunity Employer PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K:IProj/Acc&Pres/ClRCIDateDue.indd TECHNIQUES FOR EFFICIENT K -NEAREST NEIGHBOR SEARCHING IN NON-ORDERED DISCRETE AND HYBRID DATA SPACES By Dashiell l\-"’Iatthews Kolbe A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2010 ABSTRACT TECHNIQUES FOR EFFICIENT K -NEAREST NEIGHBOR SEARCHING IN NON-ORDERED DISCRETE AND HYBRID DATA SPACES By Dashiell l\/Iatthews Kolbe Similarity searches/queries in Non-Ordered Discrete Data Spaces (NDDS) and Hybrid Data Spaces (HDS) are becoming increasingly useful in areas such as bioin- fr.)rn'1atics, multimedia, text. retrieval, audio and video compression, data-mining, and E-con’unerce. The objective of this dissertation is to develop and analyze novel methods to support similarity searches in such spaces. In this dissertation, we first discuss performing k-Nearest Neighbor (k-NN) searches in NDDSs. Performing such searches in NDDSs raises new challenges. Due to the coarse granularity of the commonly used Hamming distance measure, a nearest neighbor query in an NDDS may lead to a large set. of candidate solutions, creating a high degree of non-determinism. \Ve propose a. new distance measure that. reduces the number of candidate solutions for a query while preserving the essential prop— erties of Hamming distance. we have also implemented nearest neighbor queries using multidimensional database indexing in NDDSS. We use the properties of our multidimensional N DDS index to derive the probability of encountering new neigh— bors within specific regions of the index. This probability is used to develop a new search ordering heuristic. Our experiments demonstrate that our nearest neighbor algorithm is efficient in finding nearest neighbors in NDDSS and that our heuristics are effective in improving the performance of such queries. We then discuss our work on providing a generalization of our GEH distance. This generalized form allows our distance measure to be applied to a. broad range of applications. Of these, we discuss a new rank based implementation well suited to applications with heavily skewed data distributions. Our experiments demonstrate the benefits of an adaptable distance metric by presenting scenarios that demonstrate performance changes depending upon the distance measure used. Finally, we discuss extending k-N N searching to HDS. We consider the challenges of exploiting both the CDS and N DDS properties of HDS for optimizing potential search algorithms. In particular we consider how key search information is maintained in HDS data structures and what rules must be observed to guarantee the correctness of search results in such spaces. Further, the concept of search execution stages is introduced to develop efficient. k-NN search algorithms for HDS. Lastly, a theoretical performance model is developed for HDS searching to validate our experimental results. To my parents, who always believed in me. iv ACKNOWLEDGMENTS I would like to first acknowledge my thesis adviser, Dr. Sakti Pramanik, who has provided a strong guiding hand throughout my graduate career at Michigan State University. My growth as a researcher would not. have been possible without Dr. Pramanik. I would also like to provide special acknowledgement for Dr. Qiang Zhu of the University of i\“Iichigan. l\*lany of the ideas presented in this thesis are the result of discussions with Dr. Zhu and Dr. Pramanik. The level of collaboration that. was achieved in these discussions is something that I continue to strive for in my daily life. I also extend my sincere gratitude to my thesis committee, Dr. Mark Dykman, Dr. James Cole, and Dr. Rong Jin. They provided both their time and expertise to improve both the depth and breadth of this thesis. Lastly, I would like to thank my family and friends for being my village. My parents, Nancy, Chris, and Ted, recieve my deepest gratitude for their undending love and support. V7 TABLE OF CONTENTS LIST OF TABLES ............................. LIST OF FIGURES ............................ 1 Introduction ............................... 1.1 l\~»’Iotivating Applications ......................... 1.1.1 l\~’1ultimedia Objects ........................ 1.1.2 Computational Biology ...................... 1.1.3 Text Retrieval ........................... 1.1.4 Document Retrieval ........................ 1.2 Space Partitioning Concepts ....................... 1.2.1 Metric Space ........................... 1.2.2 Non-Ordered Discrete Space ................... 1.2.3 Hybrid Space ........................... 1.2.4 Vector Space Distance l\»’Ieasurements .............. 1.3 Overview of the dissertation ....................... 2 Previous Work ............................. 2.1 Nearest Neighbor Algorithms ...................... 2.1.1 Exact Searching Methods .................... 2.1.2 Approximate Searching Methods ................ 2.1.3 Unique Searching Methods .................... 2.2 General Metric Space and CDS Index Structures ............ 2.2.1 KD-Tree .............................. 2.2.2 LSD—Tree ............................. 2.2.3 R-Tree and R*-tree ........................ 2.2.4 Burkhard-Keller Tree ....................... 2.2.5 Fixed Query Tree ......................... 2.2.6 Fixed Queries Array ....................... 2.2.7 M-Tree ............................... 2.3 NDDS Models in Vector Space ...................... 2.3.1 The ND-tree ............................ 2.3.2 NSP-Tree ............................. 2.4 HDS Models in Vector Space ....................... 2.4.1 NDh-tree ............................. 2.4.2 CND-tree ............................. 2.5 Determining Distance ........................... vi ix >4 rhubOOOOI-J r—‘QOOO\I®O1 18 19 21 22 25 26 27 27 28 29 34 36 37 3 k-Nearest Neighbor Searching in NDDS .............. 41 3.1 Motivations and Challenges ....................... 41 3.2 k—Nearest Neighbors in NDDS ...................... 45 3.2.1 Definition of k—NN in NDDS ................... 45 3.2.2 Extended Hamming Distance .................. 53 3.2.3 Probability of Valid Neighbors .................. 58 3.3 A k-NN Algorithm for NDDS ...................... 66 3.3.1 Heuristics ............................. 67 3.3.2 Algorithm Description ...................... 72 3.3.3 Performance Model ........................ 76 3.4 Experimental Results ........................... 79 3.4.1 Effectiveness of GEH Distance .................. 80 3.4.2 Efficiency of k-N N Algorithm on Uniform Data ........ 81 3.4.3 Efficiency of k-NN Algorithm on Skewed Data .......... 87 3.4.4 Efficiency of k-NN Algorithm on Non-Homogeneous Data. . . . 89 3.4.5 Verification of Performance Model ................ 93 4 Understanding Distance in NDDS .................. 96 4.1 l\»‘Iotivations and Challenges ....................... 96 4.2 Generalized GEH Distance ........................ 98 4.3 Ranking Based GEH Instantiation .................... 103 5 k-Nearest Neighbor in Hybrid Data Spaces ............ 106 5.1 l\-’Iotivation and Challenges ........................ 106 5.2 Nearest Neighbor Search Stages ..................... 109 5.3 Search Algorithm ............................. 112 5.3.1 l\x’Iatch Likelihood ......................... 113 5.3.2 Algorithm Description ...................... 115 5.3.3 Performance Model ........................ 117 5.4 Experimental Results ........................... 124 5.4.1 Effects of Heuristics and Datasets ................ 125 5.4.2 Performance Model Verification ................. 128 6 Conclusion ................................ 132 APPENDICES ............................... 136 A Intrinsic Dimensionality in Non-Ordered Discrete Data Spaces 136 A.1 Overview .................................. 136 A2 Distribution of NDDS Datasets ..................... 138 A3 Distribution Effects on Search Performance ............... 140 A4 Experimental Results ........................... 142 Vii B Triangular Property of GEH - Extended Proof .......... 145 C MinMaxDistance Discussion ..................... 150 C.1 Overview .................................. 150 C2 Proof .................................... 152 BIBLIOGRAPHY ............................. 155 viii LIST OF TABLES 4.1 Varying Dimensionality .......................... 104 4.2 Varying Zipf Distribution ......................... 105 5.1 Performance Model Variables ...................... 119 ix 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 LIST OF FIGURES Example of N DDS data points distributed by distance ........ Comparison of AA? values for the Hamming distance .......... Comparison of A}; values for the GEH and Hamming distances . . . . Effects of heuristics in the k—NN algorithm using N D-tree with k = 10 Performance of the k-NN algorithm using ND-tree vs. the linear scan on synthetic datasets with various sizes ................. Number of Disk Accesses comparison of the k-NN algorithm using ND-tree vs. the k—NN searching based on M-tree ............ Performance of the k—NN algorithm using ND—tree vs. the linear scan on genomic datasets with various sizes for k=10 ............ Performance of the k-NN algorithm using ND—tree vs. the linear scan on genomic datasets with various dimensions for k=10 ........ Performance of the k—NN algorithm using ND-tree vs. the linear scan on synthetic datasets with various dimensions for 16210 and d = 10 Performance comparison for the k-NN searching using ND-tree based on GEH and Hamming distances .................... Performance of the k-N N algorithm using ND-tree vs. the linear scan on synthetic datasets with various sizes and zipf distributions ..... Performance of the k—NN algorithm using ND-tree on datasets with various misleading dimensions (k. = 1) .................. 47 54 81 82 84 85 85 86 87 88 89 90 3.13 3.14 3.15 3.16 3.17 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 A.1 A.2 Performance of the k-NN algorithm using ND-tree on datasets with various misleading dimensions (1: = 5) .................. 91 Performance of the k—NN algorithm using ND-tree on datasets with various misleading dimensions (k = 10) ................. 91 Estimated and Actual performance of the k-NN algorithm vs. the linear scan on synthetic datasets with various sizes (k. = 1) ...... 94 Estimated and Actual performance of the k-NN algorithm vs. the linear scan on synthetic datasets with various sizes (A: = 5) ...... 94 Estimated and Actual performance of the k-NN algorithm vs. the linear scan on synthetic datasets with various sizes (A: = 10) ..... 95 Search stage I/ O with variable number of continuous dimensions . . . 111 Search stage I / O with variable number of discrete dimensions . . . . 111 Search stage I / O with variable database size .............. 112 Performance I / O with variable number of non—native dimensions . . . 126 Performance I / O with variable database size .............. 127 Performance I / O with variable database size (CND—tree and ND—tree only) .................................... 127 Performance I / O comparing ordering methods ............. 128 Performance model comparison with variable database size ...... 129 Performance model comparison with variable number of non-native dimensions ................................. 129 Histogram of Hamming Distance Values ................ 138 Histogram of GEH Distance Values ................... 141 A.3 GEHRa/nk zipf = [0.0 — 1.5] ....................... 143 A.4 GEHFrw, zipf = [0.0 — 1.5] ....................... 144 xii CHAPTER 1 Introduction Nearest. neighbor searclres/queries in Non-Ordered Discrete Data Spaces (NDDS) and Hybrid Data Spaces (HDS) are becoming increasingly useful in areas such as l‘)i(.)informatics, multimedia, text retrieval. audio and video compression. information security, data-mining. and E-cmnmerce. This form of searching may be stated as the following problem: given a set S of 71 data points in an a. dataset X (with IX Z k) and a query point q E X, return a set of k > 0 objects A <_: X where VuEA,1’€S—A : D(q, u.) S D(q. 'v) and IA| = 1:. Examples of such queries are “finding the k closest restaurants to the intersection of Fifth and Main,” and “finding the I: fixed length words that differ the least from the word ‘near’.’3 Numerous techniques have been proposed in the literature to support efficient nearest neighbor queries in continuous (ordered) data. spaces (CDS) and general metric spaces. Excellent surveys of both the theoretical and practical aspects of nearest. neighbor searching in such spaces have been presented by Chavez et a1. [14] and Hjaltason and Samet [27] respectively. Little work has been reported on supporting efficient similarity searches in either NDDSs or HDSS. A d—dimensional N DDS is a Cartesian product. of (1 domains / alphabets consisting of finite non-ordered elements/ letters. For example. when searching genome DNA sequences, consisting 6 of letters ‘a’, g’, ‘t’, ‘0’, each sequence is often divided into intervals/strings of a fixed length (I. (q-gram). These intervals can be considered as vectors from a. d— dimensional NDDS with alphabet. {(1, g. t, c} for each dimension. Other examples of non-ordered discrete dimensions are color, gender and profession. A d-dimensional HDS is a Cartesian product of 712(7)? 3 (1) dimensions/ alphabets consisting of finite non-ordered eleI‘nent-s/ letters and n = d — m continuous dimensions. For example, consider sports related data containing match win statistics and match locations. Match win statistics could be considered continuous data while match locations could be considered non-ordered discrete data. The remainder of this section is comprised as follows: Section 1.1 discusses mo- tivating applications; Section 1.2 introduces space partitioning concepts considered throughout this dissertation; Section 1.3 provides an overview of the research pre- sented in this dissertation. 1.1 Motivating Applications This section presents a sample of applications that rely upon performing efficient k- NN similarity searches in CDSs. NDDSs, and HDSS. Both fixed length and variable length data applications are presented for completeness. 1.1.1 Multimedia Objects Facial recognition algorithms are used by a wide variety of applications such as law enforcement and security to identify an object by a live or still image. This is achieved by comparing a selected set of features from the image against. similar feature sets stored in a database. This is equivalent to creating a feature vector composed of HDS objects. Finger print matching and voice recognition are both handled in a similar fashion. For each, a feature vector is established, containing values representing key structures within the object, such as whorls in a fingerprint, which is then compared with an indexed set. of feature vectors to deterniii'ie the closest matches. Here, it is common practice to try to determine the single closest match. ideally considered an exact match. However, due to approximation involved in creating the feature vectors, nearest matching is a much more practical and efficient approach. 1 . 1 . 2 Computational Biology Biological data, particularly genome sequences, may be represented as a d dimen- sional vector (1, with each dimension 2 (1 S i g (1') containing a value from the set of possible elements .4,- for the I'm dimension. In genome sequencing, the alphabet over all dimensions is the same: A = {(z,g,t,c}, such that. a. 25- dimensional segment of a genon'ie strand could be represented in an NDDS 925 as (1' = ”a gtcaagtcaaatcc(1gtca(1.tcca”. Initially. searching in genome sequence databases employed editor distance metrics or substitution matrices, such as BLOSUM, to determine the similarity between genome sequences. Lately, however, the Hamming distance (discussed in Section 1.2) has become more popular for searching in large genome sequence databases [29], [54]. Unfortunately, the Hamming distance has a poor level of granularity in measurement which can lead to unclear results. We consider this issue in depth in Chapter 3. 1 . 1 .3 Text Retrieval Text objects are typically represented as strings of varying length. Several index methods have been proposed to support. efficient searches over such data. Examples of this method are Tries [30], [18], the Prefix B-tree [5], and the String B-tree [19]. Most indexing methods in this category, such as Prefix B-tree and String B-tree, are designed specifically for exact searches rather than similarity searches. The Tries model does support similarity searches, but is difficult to apply to large databases due to its 111emory-based feature. Text retrieval operations are h(.)wever fundan‘ientally different from those in other applications listed in this section. Datasets of previously listed applications, such as genome sequences, are composed of objects each represented by a. string/ vector of fixed length / size. Due to this. geometric properties such as area and range can be applied to the dataset. Text retrieval operations however, cannot take advantage of these geometric properties due to text. objects being composed of varying length strings. This paper focuses upon searching in datasets of fixed dii'nensionality and thus does not consider applications towards efficient text retrieval. The issue of text retrieval is covered in detail in [45]. 1 . 1 .4 Document Retrieval Document retrieval is used extensively in world Wide Web and database applica- tions. Traditionally. a test document (ii is classified by creating a feature vector containing the percent composition of certain key words through-out the document. For example, if a document is beng classified by only three key words ’7‘itrer’, ’bank’, and ’sedimcnt’, a feature vector V = {0.5.0.302} might be generated to identify C11 that out of the key words found in the document, 50% are ’n'uer’, 30% are ’bank’, and 20% are sediment '. This vector would then be compared to a dataset. UT composed of similar feature vectors generated from a training / representative set of documents. Most work in this field has focused upon improving construction of feature vectors [36] and choosing an optimal number of neighbors to search for [4]. However, [4] showed how the current standard method of cosine similarity measurements some- times provides an intuitively incorrect. similarity match between feature vectors. This phenomenon results from the feature vectors themselves being only an approximation of key word occurrences within a document and pays little attention to how those words are related to each other. A possible solution to this problem is to calculate the similarity between two documents based upon a discrete representation of their key words. 1.2 Space Partitioning Concepts This section discusses common data space environments that k-NN searches are performed in. Continuous Data Space, in the form of General l\r"Ietric Space, is discussed first. This is followed by a discussion of Non—Ordered Discrete Data. Space. Next, the combination of N DDS and CDS is presented as Hybrid Data Space. Finally, the concept. of vector space distance measurement is introduced. 1.2.1 Metric Space Let U denote the universe of valid objects being searched. The function D : U x U —> R denotes a measure of distance between objects. To be used in search and indexing applications, such distance functions generally must satisfy the following properties. Positiveness: VJ), y E U, D(.r._ y) 2 0 Symmetry: VI, y E U, D(:r, y) = d(y. :r) Refiexivity: V1? E U. D(.r,1') = 0 Strict Positiveness: V17, 3; E U, .r 75 y é D(:r, y) > 0 Triangular Inequality: V’r, y, r: E U, D(:1:,y) S d(:r., z) + D(z, y) If a distance measurement D satisfies the Triangular property , then the pair (U, D) is called a metric space. If D does not satisfy the Strict Positiveness property, the space is called a pseudo-metric space. If the Symmetry property is not fulfilled, the space is generally described as a quasi-metric space; the case of a city map containing one way streets is a. common example of such a space. From the above definitions, it can be inferred that most spaces are in fact spe— cialized implementations of the general metric space. Specifically, if the elements in the metric space are tuples of real numbers, then the space may be described as a. finite-dimensional vector space [40]. Such spaces generally make use of geometric 7 distance measures; such as Euclidean or l\*Ianhattan. W hen such tuples represent numbers from a continuous (ordered) dataset, the pair (U, D) is considered a CDS. 1.2.2 Non-Ordered Discrete Space Discrete space is based upon the concept of all objects in the universe of discourse U being discrete and non—ordered along each dimension. Ordered discrete objects typ- ically den'ionstrate the same properties as CDS objects and thus, are not considered in this section. Examples of such non-ordered discrete data are multimedia objects, profession, gender, bioinformatics, and user-defined types. Each of these examples may be represented as a feature vector in a d-dimensional space. Consider, in genome sequence databases such as GenBank, sequences with alphabet. A = {a. g, t,c} are broken into substrings of fixed—length d for similarity searches [29], [54]. Each sub- string can be considered as a vector in d-dimensional space. For example, substring aggctttgcaaggctttgcagcact is a. vector in the 25—dimensional data space, where the 'ith‘ character is a letter chosen from alphabet. A in the 'ith’ dimension. In this example, a is no closer to c than it is to t and so forth. Thus, mapping discrete space into c011- tinuous space by applying a form of ordering changes the semantics of the dataset. A formal definition of a NDDS is as follows. Let A17, (1 g 73 g d) be an alphabet consisting of a. finite number of non-ordered elements. A d—dimensional NDDS Qd, is defined as the Cartesian product of d. alphabets: Qd = {A1 X A2 >< . . . >< Ad}- A, represents the it," dimension alphabet of Q]. The area of space Qd is defined as: d area(Qd) = III/1,]. i=1 This formula also indicates the number of possible unique vectors in (Q. A vector (1 is defined by the tuple (r = (o[1],o—[2], . . . ,(i[d,-]), where o[]] E Alf. A discrete rectangle R in Qd is defined as the Cartesian product: R = {51 x S? . . . x 3d}, where S,- is a set of elements/ letters from the alphabet of the j-th dimension of the given d-dimensional NDDS. The length of the I'm dimension edge of R is defined as: O _ length(1?,i) = [3,]. The area of R follows the formula for the area of the data space as: d area(R) = H [5,]. i=1 1.2.3 Hybrid Space Hybrid space is composed of a combination of NDDS elements and CDS elements. Consider domain D,(1 g i S (1'). If D,- is comprised of non-ordered discrete elements, it corresponds to an alphabet. A.,-. If D1: is comprised of continuous elements, it corresponds to some continuous span. For the purposes of our discussion, we will consider this span to be normalized within [0, 1]. A d—din‘iensional hybrid data space (HDS) Qd is defined as the Cartesian product of d such domains: 9(1ZD1XD2X...XDd. As described by Qian [41], a vector in Qd is comprised of the tuple a = ((1'1.a2,...,(1d), where 0-,: E D.,-(1 g i S d). Subdomains within D,- are defined as S, Q D,(1 S '1? g (I), where S C_: A7: = D,- if D,- is a non-ordered discrete domain, or S,- is a range [mi11,-,max.,-] if D1: is a continuous domain. A hybrid rectangle is then defined as the Cartesian product of domains S,- thus that: RlexS-gx...de. The length of the I'm dimension edge of R is defined as: l—Z—l 1f dlmension 2 1s non-ordered dlscrete length(R,j) = [Ail maxi — min,- otherwise The area of R is defined as: 10 d area(R) = H length(R,-). i=1 1.2.4 Vector Space Distance Measurements Determining an efficient distance measurement is still an open problem. As discussed by Qian [41], the vector model is widely used to support similarity queries. Addi- tional models. such as the Boolean model and probabilistic model [2], have also been proposed to support. similarity queries. However, as discussed by Baeza et. al. [2], the vector model is considered to be more effective for similarity searches. To perform searches, each object in the database. as well as the query object are represented as fixed length vectors. For example, consider security applications tracking intrusion attempts. Each attempt in the database can be transformed into a vector based upon time. frequency. intruder ID. intrusion method, and other intrusion characteristics. Each intrusion can now be considered a point in a multidin'iensional vector space (either CDS. NDDS, or HDS dependent upon the feature generation methods). The distance values between each pair of vector objects may now be calculated utilizing a distance metric most suitable for such a space. Typically, a pair of objects with a minimal distance value between them are considered more similar than a. pair of objects with a maximal distance value between them. 11 The focus of our research is on supporting efficient similarity queries using the vector space model. It should be noted that not all applications have objects that. may be efficiently represented as natural vectors. Qian [41] notes that. some forms of multimedia data, such as video clips, are often stored in their native format due to a loss of precision when generating a feature vector to represent them. Although not consklered in detail here, the issue of designing an effective feature generation algorithm is still an open problem. Excellent. surveys concerning feature generation for multimedia. objects are. presented in [1] and [49]. 1.3 Overview of the dissertation The remainder of this dissertation is organized as follows: Chapter 2 discusses pre— vious work performed in this and related fields. including multidimensional index- ing and similarity search algorithm development; Chapter 3 presents our research in developing novel similarity search methods for NDDSs. Chapter 4 presents our work in developing a. non-application specific distance measure for NDDSs. Chapter 5 presents our research in extending our work in similarity searches in NDDSS to HDSs. Chapter 6 summarizes the contributions of this dissertation and provides directions for future work. 12 CHAPTER 2 Previous Work This chapter presents previous research work related to this dissertation. We first discuss the evolution of common similarity searching algorithms. We then present an overview of popular index structures used to support efficient similarity searches in vector spaces. These index structures are used to maintain data in CDSs, NDDSs, as well as recently prmiosed work to support. HDSs. Lastly, we present work concerning distance measure comparisons. 2.1 Nearest Neighbor Algorithms Most. similarity searching algorithms may be distilled to a simple formula applicable to CDS, NDDS. or HDS index structures. Further, range queries may be viewed as a special case of nearest neighbor queries where the final search radius is known at the start of the search. As such, this section focuses upon nearest neighbor search 13 algorithms. The following algorithms perform similzu‘ity searches for the nearest neighbor to a query based on an index structure of fixed dimensionality. Generally, the search for a single nearest neighbor may be expanded to find k nearest. neigh- bors by maintaining a. list. of neighbors found and using the distance between the neighbor that occupies the kth distance related slot and the query point as a search range/ radius value to search within. Each of the index. structures described in the following sections may generally be used in conjuncture with one of the following search methods. However. developers will typically modify the algorithm to better suit the applicable structure. 2.1.1 Exact Searching Methods The most basic method to perform a k-NN similarity search involves using a range search algorithm. Begin with radius T = a : (o > 0) centered at query point q. Increase 0 until at least k: elements lie within the radius. The cost, in terms of page accesses, of this algorithm is similar to that of performing a range search. This cost however is greatly affected by the amount the value a is adjusted by every time the desired number of elements is not. yet found. Too small, the performance cost will quickly grow; too large, the number of points returned will far exceed the desired number thus decreasing the usefulness of the solution. A more elegant approach was proposed for both general metric spaces and CD88 14 [12, 3, 47] by beginning the search on any data structure using 7“ = 00. The search begins at the anchor / root of the data structure. Each time the query point q is corn- pared to some element p, the search radius is updated such that. 7‘ = min(r, D(q, p)). The search then backtracks to the nearest split.— in the index where a. new path has not yet. been traversed and continues down the new path using the newly calculated radius. As the search continues, the possilgfility increases that an entire path of the index structure may not need to be searched due to the decreasing size of the radius 7‘. Pruning heuristics are employed to determine if a certain route is no longer necessary to search. Roussopoulos et al. [47] provided two heuristics for CDS based index structures, namely R-trees. named MINDIST and MINMAXDIST that. invoke this pruning. h'lINh-‘IAXDIST is used in the breadth search of the covering rectangle of the current search path by determining the minimum maximum distance of all covering rectangles contained in the current one. This distance is used to update the current search radius such that r = '7'17.:17'n(:1‘,l\IIN MAXDIST). MINDIST is then used in the depth traversal to determine if any point of a covering rectangle associated with a new search path is within the search radius. If no point is within the search radius, that search path is pruned from the remainder of the search and thus the effective size of the database to search is decreased. 15 This range reduction method may be improved further by attaining a smaller ra- dius value earlier in the search. Several techniques have been used in both CDSs and general metric spaces [51. 28, 16]. The underlying idea of each technique is that cer- tain paths may be identified by their high level statistics that will yield a. closer point to the query point q earlier in the search. The most common application of this idea is to order potential search paths by either their MINDIST or l\IINMAXDIST values to q. The MINDIST ordering gives the 01.)timistic approach that a lower MIN DIST value is caused by a. relatively closer object in the index structure. This may not always prove true in spatial index structures. Commonly, some point of a search path only exists at. the top most layers. At higher levels within an index structure, points may actually be the result of the intersection of lines drawn from several points lower in the index structure. When this technique appears to suffer from this problem, the pessimistic approach using MINMAXDIST may be used instead. Here, search paths are ordered by the increasing value of their furthest point. Thus a. search may cor- rectly assume that it will at. least not encounter any points further away than the MIN MAXDIST. 2.1.2 Approximate Searching Methods Relaxing the precision of the query results may lead to even further reductions in time complexity. This is a. reasonable procedure for many applications due to some 16 approximation in the modalization of feature vectors for both general metric and CDS indexes. In addition to the query itself, a. user specifies some query parameter E to control how far away from the query point the search may progress. In this manner, the algorithm avoids the costly initial stages of a similarity search. On subsequent searches of similar databases, 6 may decrease to approach zero. As 6 decreases, the time complexity. along with the precision of the result decreases as well. A probabilistic algorithm was given for vector spaces by Yianilos et al. [55], using a method described as aggressive pruning to ii‘nprove the performance. Here, the idea is to increase the number of branches that are pruned at the expense of possible points in the set of nearest neighbors. This process is controlled such that the probability of success is always known. Unfortunately, the data structure used is only useful in a very limited radius; in databases or with searches that could result in neighbors with distances beyond the possible radius. the algorithm is not able to guarantee a result of the true nearest neighbors. This form of similarity searching is described as Approximate Similarity Searching. This topic is not covered in detail in this paper, but. is mentioned here for completeness. An in depth coverage may be found in [53]. 17 2.1.3 Unique Searching Methods The techniques described thus far cover universal proposals for performing k-NN similarity searches. There are however examples of k-NN search algorithms developed for specific indexes that are inapplicable in a generic sense. Such algorithms depend upon the structure developed to support them and are unable to be incorporated with common indexing techniques. Clarkson [17] proposes a method that alleviates the need to 1_)erform extensive backtracking by creating a GNAT—like data structure where points are inserted into multiple subtrees. The tree is constructed by first selecting representatives for the root(s) and then inserting each element 11. into not only the subtree of its closest representative p, but also into the subtree of any other representative p’ such that D(u. p') g 3 * D(u. p). During a query on object aq, the search enters all subtrees such that D(dq. p’) S 3 * D((}q. p). As shown by Clarkson, this is enough to guarantee the retrieval of the nearest neighbor and could be extended to determine the set of k-NN. 2.2 General Metric Space and CDS Index Struc- tures Numerous data indexing structures have been proposed in the literature that support. efficient similarity searches in CDSs and general metric spaces. Most. CDS data 18 index structures may be classified into two categories: data-partitioning and space- partitioning. Examples of the former split an overflow node by partitioning the set of its indexed data objects. The later, split an overflow node by partitioning its representative data space (typically via a splitting point in a dimension). Most of these methods are not. aI.)plicable in either NDDS or HDS due to some essential geometric concepts such as a bounding box no longer being valid in such spaces. An in depth discussion of continuous multidimensional indexing structures may be found in [21]. Metric trees represent a wholly different. approach to indexing data. Here, data points are not reli)resented in any type of dimensional space, rather metric trees implement structures using only the distance information between data points and some center origin / root. Such non-geon’ietric trees are generally not optimized toward NDDSs or HDSs, such as CDS implementations are toward CDSs. However, they do provide a prevalent counter solution to implementing a vector space model for either an NDDS or HDS. An in depth discussion of searching in metric space may be found in [40]. This section presents an overview of commonly employed index structures for both CDS and general metric space. Chapter 5 discusses some of the issues that. arise when applying CDS based index structures to an HDS dataset.. 19 2.2.1 KD-Tree The KD-tree [8, 7], was one of the first proposed d—dimensional data index structures. Structured as a binary tree. the KD-tree recursively sul.‘)divides a dataset along ((1—1)— dimensional hyperplanes whose direction alternates among the (1 possibilities. A simple example is for d = 3, the splitting hyperplanes would be perpendicular to the :r, y, and z axes. Each hyperplane must contain at least one point where interior nodes have one or two descendants. Searching and insertion of new points are simple procedures that. use the interior nodes as guide posts. Deletion however, is much more complicated and invariably results in a. costly reorganization of the tree structure. The KD-tree structure is highly sensitive to the order in which the points are inserted. A poor insertion order generally results in data points being scattered all over the tree. A solution to this problem was presented by Bentley et al. as the adaptive KD-tree [9]. By relaxing the requirements that hyperplanes have to contain a data point as well as being strictly alternating, the adaptive KD—tree was able to choose splits that resulted in a much more even data distribution on both sides of a particular branch. Conceptually. this resulted in pushing all data points out to the leaf nodes leaving interior nodes to only contain dimensional information. The adaptive KD—tree is unfortunately very st atic in nature; frequent insertions and deletions require costly reorganizations to keep the tree balanced. However, the adaptive KD—tree does perform well when the dat aset is known prior to construction of the tree. 2.2.2 LSD-Tree The Local Split Decision (LSD) Tree [24] is organized as an adaptive KD-tree, parti- tioning a dataset into disjoint cells of various sizes. Better adapted to the distribution of a dataset than fixed binary partitioning. the LSD—tree uses a special paging al— gorithm to preserve the external balancing property; i.e., the heights of its external subtrees differ by at most one. This is accomj'flished by paging out subtrees when the structure becomes too large to fit in main memory. \Vhile this strategy in- creases the efficiency of the tree structure it prevents the LSD—tree from being used in general-purpose database systems; where such extensive paging is not available. The split strategy of the LSD-tree tries to accommodate skewed data. by combin- ing data-dependent as well as distribution-dependent split strategies; SP1 and SP2 respectively. The former attempts to achieve the most balanced structure by trying to keep an equal. number of data objects on both sides of the split. The later is per- formed at a. fixed dimension and position where the underlying data is in a known distribution pattern. Determining the split position SP, is the linear combination of applying one of the strategies: SP = dSPl + (1 -— (1)5132, where a = (0,1). The factor oz may vary as objects are inserted or deleted from the tree. This property 21 increases the efficiency of the LSD tree, but makes integration with generic database systems of unknown data distributions difficult. 2.2.3 R-Tree and R*-tree The R-Tree [22] is a height-I'mlanced tree with index records/ pointers in its leaf nodes, similar to the adaptive KD-tree. Each node i.) represents a disk page along a d-dimensional interval Id(z) If the node is an interior node, all descendants 1),; of v are contained in the interval I d (v). If the node is a leaf node, then I (1(1)) represents the d—dimensional bounding box of all the objects stored in the node. Nodes at the same level may overlap area. Each node contains between m and AI entries unless it is the root. The lower bound m, is used to prevent the degeneration of trees and ensure efficient storage utilization. If the number of entries in a node drops below m, the node is deleted and the descendants of the node are reinserted into the tree, (tree condensation). The upper bound Al is used to maintain each node’s size to that of one disk page. Being a height—balanced tree, all of the leaves are at the same level and the height is at. most. [logm(N)] for N : (N > 1) index records. Objects in an R-Tree are represented by their l\=linimum Bounding Interval I d(0). To insert an object, the tree begins at the root and traverses a single depth first path to a leaf node. At. each depth layer, heuristics are used to determine which descendant path to follow. The first is a calculation of which path would require 22 the least. enlargement of area to the interval I ‘1 ('l’j) representing it. If there are multiple paths that satisfy this criterion, Guttman et al. [22] proposed selecting the descendant associated with the smallest interval. This process continues until a leaf node is reached and the object pointer is placed within. If this results in an expansion of the leaf is interval, the interval change propagates upwards toward the root. If insertion results in the number of objects in the leaf node exceeding Al, the leaf node is split. distributing the entries between the old and new disk pages. This change also propagates upwards toward the root. Deletion is handled in a similar manner to insertion. First an exact match search is performed to determine if the selected element for deletion is contained in the database. If so, the element is deleted. and the containing leaf node is checked to see if the area interval may be reduced. If the area is reduced, the change is propagated upwards toward the root node. If this deletion causes the number of elements to drop below the lower bound m, the remaining elements in the leaf node are copied to temporary storage and then deleted from the database, the changes caused by this deletion are again propagated upwards which generally results in the adjustment of several intermediate nodes. Once this is completed, the elements in temporary storage are inserted back into the index structure following the insert method described above. 23 Searching an R—Tree may be performed in a similar manner to the first stages of insertion. At each index node 'v. all index entries are checked to see if they intersect with the search interval. If v is an interior node, the search continues to all the descendant nodes 13,: who were found to intersect the search interval. If v is a leaf node, the search adds all the entries that intersected the search interval to the solution set. Due to interior nodes overlapping area. it is common for a search to include multiple descendants from interior nodes. In the worst case scenario, this will lead to every single node in the tree having to be accessed. Several weaknesses in the original tree construction algorithms were identified through careful analysis of R-tree behavior under different data distributions. It was identified that the insertion phase was a critical step in building an R-tree towards good search performance [48]. The result of this study was the R*-tree. The R*-tree provides a more controlled insertion performance by introduced a policy called forced rei'nsert: if a. node overflows, it is not split immediately, rather p entries are removed from the node and reinserted into the tree. Through testing, it was proposed in [6] that p should be about 30% of the maximum number of entries per node, AI. Additionally, the R*-tree further addresses the issue of node—splitting by adding more heuristics to avoid making random choices. In Guttman et al.’s original R— tree algorithm, node—splitting policy was based solely on trying to minimize the 24 area covered by the two new nodes. This lead to many possible splits, whereby a random split was selected. The R*-tree breaks these ties by taking three more factors into account: overlap, perimeter values, and storage utilization of the two new nodes. The reduction of the overlap between two nodes reduces the probability that a search will have to follow multiple paths. Reduction in the perimeter of a bounding box increases the density of descendants, which in turn increases the storage utilization of the selected node. Increased storage utilization allows for a greater area to be created before a split is necessary, thereby decreasing the probability of a search needing to follow multiple paths. These additions lead to a marked improvement in performance of the R*-tree over the R—tree [48]. 2.2.4 Burkhard-Keller Tree The Burkhard-Keller Tree (BKT) [12] is considered one of the first metric trees and may be seen as the basis for many of the metric trees proposed after. The BKT is defined as follows. Let U represent the universe of discourse for all valid objects within the database. An arbitrary element p E U is selected as the root of the tree. For each distance i > 0, define U, = {u E U, (1(u, p) = i} as the set of all elements at distance i from the root p. For all nonempty Ui, build child 1),, hereafter labeled a pivot, where the BKT is recursively built for U11. This process is repeated until no more elements are left for processing or there are only 6 elements left which may 25 then be placed within a bucket of size I). The BKT supports range queries in discrete space effectively. Given a query (1 and a distance r, the search begins at the root and enters all children i such that (1(1), q) — r S i S (1(1).q) + r, and proceeds recursively. \Vhenever a leaf/ bucket is encountered. the items are compared sequentially. This g1.1arantees an accurate solution due to the. BKT satisfying the Triangular Inequality Property. 2.2.5 Fixed Query Tree The Fixed Query Tree (FQT) [3] is an extension of the BKT. Unlike the BKT, the FQT stores all elements at the leaves. leaving all internal nodes as pivot points. This construction allows for fast backtracking by allowing the effective storage of a search path in temporary memory. If a search visits many nodes of the same level, only one comparison is needed because all of the pivots at that level are the same. Baeza-Yates et al. demonstrated that FQTs performed fewer distance evaluations at query time than the original BKT. This improvement is at the expense of a taller tree. where unlike the BKT, it is not true that, a different element is placed in each node of the tree. The Fixed Height FQT (FHQT), origii'ially proposed in [3] was discussed as a variant of the FQT in [2]. Here. all leaves are at the same height h, regardless of the bucket size (similar to many CDS implementations). This has the affect of making some leaves deeper than necessary. However. because the search path is 26 maintained in temporary memory, this does not represent a significant detriment to the performance of the FHQT. 2.2.6 Fixed Queries Array The Fixed Queries Array (FQA) [14]. is described as a compact representation of the FHQT. No longer described as a tree structure, the FQA is an array representation of the elements stored in the leaves of an FHQT seen left to right. For each element in the array, 12. numbers representing the branches to traverse in the tree to reach the element from the root are computed. These numbers are coded in 6 bits and concatenated in a single number where the higher levels of the tree are the most significant digits. The FQA is then sorted by these numbers. As such. each subtree in the FHQT corresponds to an interval in the FQA. Updates are accomplished using two binary searches through the F QA. The FQA improves the efficiency of the FHQT by being able to include more pivot points within the same amount of memory. 2.2.7 M-Tree The lVI-tree [16] is a. metric tree designed to provide efficient dynamic organization and strong I/O performance in searching. Similar in structure to that of a GNAT [11], the M—tree Chooses a. set of representatives at each node and the elen'ients closest to each of these representatives are grouped into subtrees rooted by the representative. 27 Insertions are handled in similar methods to that of an R—tree. Upon an insertion, an element is inserted in the “best" node. defined as that causing the subtree covering radius to expand the least. In the result of a tie, the closest representative is chosen. Upon reaching a leaf, the insertion of the element may cause overflmv, (i.e., the number of elements equals M + 1). In such a case. the node is split in two and the elements are partitioned between the resulting nodes. One node element is promoted upwards to become a representative: this change propagates to the root of the tree. Searches are performed by comparing the search radius 7‘3 with each re1.)resentative‘s covering radius rc in a node. For all representative in the node where r3 < re, the search continues recursively through the subtrees of the effective representative. As shown by Ciaccia et al. [16]. the M—tree shows impressive performance results against CDS geometric indexes. 2.3 NDDS Models in Vector Space Currently, NDDS indexing techniques utilizing vector space representations of data points are fairly limited. A11 exhaustive search of the literature yields only two meth- ods specifically applicable towards indexing such a space representation; the ND—tree and the NSP—tree. This work appears to provide the most. significant. results toward performing lit-nearest neighbor queries within an NDDS. As such, both methods are discussed in detail in the following sul.)sections. 28 2.3.1 The ND-tree The ND-tree [43] is inspired by popular CDS multidimensional indexing techniques such as R-tree and its variants (the. R*—tree in particular). Like many of the tech- niques that inspired it, the ND—tree is a balanced tree satisfying three conditions: (1) the root has at least two children. unless it is a leaf. and at most AI children; (2) every non leaf and leaf node has between m and AI children or entries respec- tively, unless it is the root; (3) all leaves a.1‘)pear at the smile level. Here, AI and m. represent the upper and lower bounds set on the number of children / entries. where 2 g m g [AI/2]. Unlike previous examples of balanced trees, the ND-tree is designed specifically for NDDSs and as such is based upon the NDDS concepts such as discrete rectangles and their areas of overlap defined in Section 1.2.2. Due to this design consideration, the ND-tree is able to take special characteristics of NDDSs into consideration that metric trees are unable to utilize. The N D-tree is a structure of indexed vectors from an NDDS Qd over an alj')habet A, where A,- represents the alplial:)et of the ifh dimension. Leaf nodes consist of tuples of the form (op. key), where key is a vector from (2,1 representing an object, pointed to by op, in the database. A non leaf node N also consists of tuples. of the form (cp, DMBR), where cp is a pointer to a child node N ’ of N and Dr’l-IBR 29 represents the discrete minimum bounding rectai‘igle. described in section 1.2.2, of N’. The DAIBR of a leaf node N". consists of all the vectors indexed in N ”. The DA! B R of a non leaf node N ’ is the set of all DAIBRs of the child nodes of N’. To build an ND-tree, the algorithm ChooseLeaf is implemented to determine the most suitable leaf node for inserting a new vector a into the tree. Starting at the root and progressing to the appropriate leaf, the algorithm ChooseLeaf must. determine which descendant path to follow at each non leaf node it encounters. Decisions are based upon three heuristics used in ascending order for tie breaks, such that if 1H1 results in two or more possible paths, [Hg is used to narrow the field further, and so on until a child must be chosen at random. These heuristics are presented as follows: 1H1: Choose a child node corresponding to the entry with the least enlargement of overlap(Ek.DAIBR) after the insertion. [43] 1H1 chooses a child node Ek from entries E1.2,...,Ep, m. S p 3 AI and 1 g k S p, such that the insertion of vector (1 results in the least enlargement of overlap(Ek.DAIBR), defined as: 30 p oecrlap(Ek.DJIIBR) = Z area(Ek.DAIBR fl EluDr'lIBR) (2.1) i:1,i;£k 1H1 results from the observation of the ND-tree experiencing similar retrieval performance degradation due to the increasing amount of overlap between bounding regions as seen for CDSs [10], [‘38]. This increase in overlap is a major concern for high dimensional indexing methods and as such has been described as the high dimensionality curse. Unlike multidimensional index trees in CDS, the number of possible values for the overlap of an entry in an NDDS are limited, implying ties may arise frequently. As such IH2 and 1H3, two area based heuristics, are given to break possible ties: 1H2: Choose a child node corresponding to the entry EA. with the least enlargement of area(Ek.D]lIBR) after the insertion.[43] I H3: Choose a child node corresponding to the entry Ek with the minimum area(Ek.D]\rIBR) . [43] Once any node contains more entries than the maximum allowable value, the overflowing node is split into two new nodes N1 and N2 whose entry sets are from a partition defined as follows: let N be an overflow node containing AI + 1 entries ES 2 {E1, E2, . . . , EM+1}- Partition P of N is a pair of entry sets P 2 {E51, ESQ} 31 such that: (1)E51UESQ =2 ES; (2) E81 flESg = (D; and (3) m S [ESll,m S [ESQ]. Partition P is determined through the algorithm SplitNode, which takes an overflow node N as the input and splits it into two new nodes Nl and N; whose entry sets come from a partition as defined above. This is a critical step in the creation of an ND-tree as many split possibilities exist and a good partition may lead to an efficient tree. Qian et al. [43] proposes handling this in two phases: (1)determine the set of all possible partitions; (2) select the partition most likely to lead to an efficient tree structure. The exhaustive approach to implementing this process is very costly. As shown by Qian et al., even for relatively small. values of ill, for example 50, an exhaustive approach would have to consider a result so large as to make the operation impractical: here, 51! z 1.6 x 1066 permutations. Thus, a more efficient method of generating a (smaller) set of candidate partitions is required. A more efficient method of generating candidate partition sets stems from the property that the size of an alphabet A for an NDDS is usually small; i.e., in genome se uence exam ,)les A = 4. Let I .19.. .. ,l be the elements of A, in this case Q I 1 - A] { a, g, t, c}. The number of 1:)ermutations on the elements of an alphabet is thus also relatively small, here 4! = 24. For example, < a,c.g,t > and < g,c,a.t > are both permutations of the set A. Using these 0lf)SGI‘V‘dthIlS Qian et al. proposes an algorithm to choose a set of candidate partitions consisting of d * (AI — 2m + 2) * (IAl!) candidates. For example, if d = 25, AI = 50, m. = 10, and |A| = 4, the 32 alphabet permutation based algorithm only considers 1.92 x 104. Further, because a permutation and its reverse yield the same set of candidate partitions [43], only half of the aforen'ientioned candidates need be considered; a significant improvement over the exhaustive method. It is possible that alphabet A for some NDDS is large. In this case, the aforemen- tioned method no longer 1')ro\«'ides as significant an improxr'emcnt over the exhaustive method. If such a case arises. Qian et al. propose determining one ordering of entries A in the OV'erflow node for each dimension rather than consider ! orderings on each dimension. This is accomplished by creating an auxiliary tree for each dimension Ti and sorting the component sets generated from T,. Once a candidate. set. of partitions has been generated, SplitNode selects an ap- propriate partition based upon four heuristics, as follows: S H 1: Choose a. partition that generates a minimum overlap of the DMBRs of the two new nodes after splitting.[43] S H2: Choose a partition that splits on the dimension where the edge length of the DMBR of the overflow node is the largest.[43] S H3: Choose a partition that has the closest edge lengths of the DMBRs of the two new nodes on the splitting dimension after splitting.[43] 33 SH4: Choose a partition that minimizes the total area of the DMBRs of the two new nodes after splitting.[43] Heuristics S H 1 through S H4 are applied in ascending order until there are no ties present. If a tie still exists after the application of S H4, a partition is chosen at random. Searching an ND—tree is performed similarly to searching an R—tree. Starting at the root, each child nodes DMBR is compared to the search radius to determine if there is an intersection. If the node intersects the search continues recursively. 2.3.2 NSP-Tree A common problem among data partitioning based index structures, such as R*-tree in CDS and ND-tree in N DDS, is that of regional overlap. For such index structures an overflow node is split by grouping its indexed vectors into two sets D81 and DSQ for two new tree nodes such that the new nodes meet a minimum space utilization requirement. Commonly, as the dimensionality of such a space grows, the probability of large overlapping regions increases dramatically. This overlap causes a drastic reduction in the efficiency of searching within such a structure. A solution to this problem was proposed for CDSs in the form of s15)ace—partitioning data indexes. The LSD tree discussed earlier is an example of such methods, where an overflow node is 34 split by partitioning its corresponding data space into two non overlapping s1.il.)s1‘)aces for two new tree nodes. The indexed vectors are then distributed among the new nodes based upon which subspace they belong. \Vhile this method does lead to high search performances in such structures, the nodes in the tree generally no longer guarantee a minimum space utilization. Unfortunately. as was the case for the IND-tree. the methods used in creating CDS implen'ientations of a space—pentitioning index structure do not directly apply to an NDDS. Thus. a new structure. labeled the NSP-tree [44], was proposed. The NSF—tree utilizes many of the concepts described for the ND—tree. For example, the methods used to represent the universe of discourse and minin'nnn bounding rectangles remain the same between the two tree structures. The key difference between an ND-tree and an NSP-trce lies in the method of partitioning the data points into smaller subspaces. The NSP-tree splits an overflow node based upon the frequency distribution of the vectors indexed within the node, such that an even distribution is seen between the two new tree nodes. This distri- bution method differs from CDS models, where a split is performed upon a chosen dimension and the data points are. distributed in relation to the split point. This method no longer applies in a space where no ordering exists among the data points. In an N DDS it is impossible to describe some point. 1rd being less than or greater than some point yd without violating the semantics of the dataset. To increase search efficiency. the NSP—tree utilizes multiple bounding boxes within each subspace to help eliminate the amount of dead space that is included within the subspace; dead space is any area covered that does not contain any indexed vectors. This is similar to techniques used in CDS. however an interesting property of an NDDS is able to exploited by the NSF-tree. Consider two 2—din’1ensional points P1(.r1.y1) and P2(.r2. gg). The MBR necessary to cover such points in a CDS would be a rectangle with points P1 and P2 representing corners along one of the diagonals. Such a representation includes a rather large portion of dead space, (roughly all the area is dead space). In an NDDS however. the MBR may be represented as the Cartesian product of the two points {.11. 1‘2} X {,1/1. yg} which contains a very small dead space {(11, yg), (1'2. 111)}. As shown by Qian et al.. the N SP-tree shows favorable performance comparisons with the ND-tree, particularly for skewed data [44]. 2.4 HDS Models in Vector Space This section describes currently available methods for vector space indexing of HDS objects. Similar to the previous section describing NDDS indexing methods, there exists a limited amount of HDS indexing methods. Although it should be noted that. metric space models such as M-tree could also be used to index such objects, 36 research by Qian et. al [43] and Chen et. al [15] suggest that this is not. as efficient as indexing methods prioritized toward HDSs. In this section we focus upon two more recently proposed HDS indexing methods. the NDh-tree and the CND-tree. 2.4.1 NDh-tree The NDh-tree, as proposed by Qian [41], is an extension of the ND-tree. The key difference is that instead of discrete minimum bounding rectangles, the NDh—tree utilizes hybrid minimum bounding rectangles. Initial results reported by Qian [41] demonstrate the effectiveness of utilizing an index structure for HDS objects specif- ically designed for such a. space. The ND’"-tree serves as an inspiration for the CND—tree introduced by Chen et al. [15]. In this dissertation we focus upon the more recent contribution and will describe key differences as necessary. 2.4.2 CND-tree The CND-tree [15] is similar in structure and function to the R*-tree and the ND- tree. As such, the CND-tree is a balanced tree with leaf nodes containing the indexed vectors. The vectors are reached by traversing a set of branches starting at the root and becoming more refined as one traverses toward the leaves. Each vector is inserted into the tree after an appropriate position is found. The relevant minimum bounding rectangle may be either enlarged or split to accommodate the insertion. 37 The key difference between the CND-tree and related non-hybrid trees (R*-t.ree for continuous space and ND-tree for discrete space) is the way in which a min- imum bounding rectangle is defined and utilized. 111 the CND-tree, a minimum bounding rectangle contains both continuous and non-ordered discrete dimensions. A hybrid minimum bounding rectangle (HMBR), with (1 D discrete dimensions and dc continuous dimensions, for a set of hybrid rectangles G = R1 X Hg X X Rn with discrete sub-rectangles BID 2 SM X X Std D and continuous sub-rectangles R? = Si.dD+1 X X Sf~‘1D+dC (i.e. R,- = RID X RF) is defined as follows: HMBMG) : {of-215,,1} x x {eggs-AD} x (min SLdDH‘ max SLdDH) X ... X (2.2) (111111 Si.(lD+(1C ~111'dx Si.(lD+dC) a where S‘iJ (1 S j S (ID) is a set of elements/ letters from the alphabet of the j-th dimension and S,‘ kid D + 1 S k S d D + (1C) is an interval from the Art," din‘iension. 2.5 Determining Distance An integral part of any similarity search. algorithm is the distance measure employed. Because we are interested in how “close" one object. is to another, the selection of a distance measure provides the semantic definition of what “close” means for the current applications. ' I ..__ For NDDSS, the inability to be ordered along an axis renders standard .forms of distance measurement. such as Euclidean or Manhattan, inapplicable. In turn, a common method of calculating the distance l_)etween two discrete objects is to apply the Hamming measurement. Essentially. this measurement, represents the number of dimensions between two (l-dimensional vectors that contain (.lifferent elements. This is described formally as follows: 9; , , 0 if v. [2'] = Vgh] , DH(1.7'nm(l”1-['2) = Z . (2.3) :1 1 otherwise N. This distance is useful in discrete spaces due to its non-reliance upon dataset se- mantics, particularly for equality measurements. However, its usefulness declines rapidly when applied to other operations. such as grouping, due to its limited car- dinality. The cardinality of a NDDS. C(17'(IDH()(U). for a al.-dimensional dataset U with an alphabet size A] for each dimension 2' in d, is calculated as the product of the alphabet sizes from each dimension, as follows: (1 i=1 Using the aforementioned genome sequence example, a 25-dimensional dataset with an alphabet size of 4 for each dimension would have a cardinality of 39 1,125.889,906.842.624: that is. there are over 1015 possible distinct. elements in the dataset. However. if the Hamming distance formula is used to calculate the distance between the objects, there are only (1 + 1 (2(5) different possible distances between any two objects. For HDSs. Chen et al. [15] utilized a. non-Euclidean measurement for calculating the distance between a vector (1 :2 (o[1].a[‘2]. . . . .q[d D + (10]) and an HMBR R = } to perform range queries: {S} X SQ >< X S‘]D+(1C n di.st(R.a) = Z f(s, 0M) (2 4) i=1 where 0 if 2'. is a discrete dimension and (1.1: 6 Si _ or '2'. is a continuous dimension and fiSiaal’l) = min(Si) — (51‘. g (12- g max(S,;) + 6t 1 otherwise Equation 2.4 essentially discretizes the data of the continuous dimensions utilizing a tolerance value of 6t determined by an application expert. This method is similar to that used in several machine learning teclmiquesll3. 20, 39]. 40 CHAPTER 3 k-Nearest Neighbor Searching in NDDS In this chapter, we consider k-Nearest Neighbor (Ir-XX) searching in NDDSs. Search- ing in NDDSs presents sexli'eral new challenges that we discuss. Additionally, we present. a formal definition of a k-NN query/ search and introduce a new distance measure suitable for searches in NDDSs. A generalized form of this distance mea— sure (and the benefits inherited from this generalized form) is presented in Chapter 3.1 Motivations and Challenges Numerous techniques have been proposed in the literature to support efficient sim- ilarity searches in (ordered) continuous data spaces. A majority of them utilize a multidimensional index structure such as the R—tree [22]. the R*-tree [6], the X—tree 41 [10], the K—D-B tree [46]. and the LSD’l-tree [‘25]. These tecl’iniques rely on some essential geometric properties/ concepts such as bounding rectangles in CDSs. Much work has centered around a. filter and refinement. process. Roussopoulos et al. pre— sented a branch-and-bound algorithm for finding k-NNS to a query point. Korn et al. furthered this work by 1‘)resenting a multi-step k-NN searching algorithm [35], which was then optimized by Seidl and Kriegel [50]. In [31]. a Voronoi based approach was presented to address k-NN searching in spatial network databases. Little work has been reported on supporting efficient similarity searches in non— ordered discrete data spaces. Limited existing work on index-based similarity searches in NDDSS has utilized either metric trees such as the l\I-tree [16] or the ND—tree and the NSP—tree recently proposed by Qian et al. [42. 43. 44]. Unlike the M-tree, the ND-tree and the NSP-tree indexing techniques were designed specifically for N DDSs. It has been shown that these, two techniques outperform the linear scan and typical metric trees such as the M-tree for range queries in NDDSs. Metric trees generally do not perform well in NDDSs because they are too generic and do not take the special characteristics of an NDDS into consideration. On the other hand. Qian et al.’s work in [42. 43, 44] primarily focused on handling range queries. Although a. procedure for finding the nearest neighbor (i.e., l-NN) to a query point was outlined in [43], no empirical evaluation was given. The issue of k—NN searching in NDDSs is in fact not a trivial extension of earlier work. NDDSs raise new challenges for this problem. First. we observe that, unlike a k-NN query in a CDS. a k-NN (plery in an NDDS based on the conventional Hamming distance [23]. often has a large number of alternative solution sets. making the results of the k-NN query non-(leterministic. This non-determinism is mainly due to the coarse granularity of the Hamming distance and can sharply reduce the clarity / usefulness of the query results. Second. existing index-based k-NN searching algorithms for CDSs cannot be directly applied to an NDDS due to lack of relevant. geometric concepts/ measures. 011 the other hand. the algorithms using metric trees for a CDS are suboptimal because of their generic nature and ignorance of special characteristics of an NDDS. Third, the. inforn'iation maintained by an NDDS index structure may become very 111isleading for traditional CDS search ordering strategies, such as those presented by Roussemoulos et al. [47]. This scenario can occur as the distribution of data within the index structure shifts over time. To tackle the first challenge. we introduce a new extended Hamming distance. called the Granularity-Enhanced Hamming (GEH) distance. The GEH distance improves the semantics of k-NN searching in NDDS by greatly increasing the deter- minism of the results. To address the second challenge, we propose a k-NN searching algorithm utilizing the ND-tree. Our algorithm extends the notion of incremen- tal range based search [Roussopoulos et al. 1995] (generalized for metric space by 43 Hjaltason and Samet [Hjaltason and Samet 2000]) to NDDSs by introducing suit- able pruning metrics and relevant searching heuristics based on our new distance measure and the characteristics of NDDSs. Some preliminary results for uniformly distributed datasets were presented in [32]. Our study shows that the new GEH distance provides a greatly improved semantic discrin‘iinating power that is needed for Ar-NN searcl‘iing in NDDSs. and that our searching algorithm is very efficient. in supporting k-NN searches in NDDSs. In this dissertation. we demonstrate that. our k-N N searching algoritl‘n’n is efficient in both UllifOI‘ll‘lly distributed datasets and non-uniformly distributed datasets using zipf distributions as an example. Further, we present a theoretical performance model and demonstrate that the performance of our algorithm matches very closely to what is predicted by this model. To address the third issue, we introduce a method for determining the probability of a vector’s existence within any sub—tree of an N D-tree. We demonstrate this probability in- formation can be used to provide a new search ordering strategy that significantly increases the performance of our search algorithm when the information maintained by the index structure is misleading. The rest of this chapter is organized as follows. Section 3.2 formally defines the problem of k—VN searching. derives the probability of a vector existing within an ND-tree, introduces the new GEH distance in NDDSs, and discusses its properties. Section 3.3 presents our index-based lit-N N searching algorithm for NDDSs, ii‘icluding 44 it s pruning metrics and heuristics and theoretical 1’)erformance model. Section 3.4 (liscusses experimental results. 3.2 k-Nearest Neighbors in NDDS In this section. we formally define a k-NN search/query and identify a major prob- lem associated with k—NN searches in NDDSs. To overcome the problem. we propose a. new extended Hamming distance and discuss its properties. Additionally. we in- troduce a method for determining the prol_)ability of a vector/r (3.1) Equation 3.1 essentially says that A: objects/neighbors .41, Ag, . . . , AA. in ANBS have the minimum total distance to q out of all possible sets of 1: objects in the database. This definition is in fact valid for both continuous and discrete data spaces. Consider Figure 1, if k = 3. there are three possible sets of neighbors that satisfy Equation 3.1: {01.02. (13}. {(11. ()2. a4}. and {01.03. (14}. Each candidate solution set is found within a range of 1' when a range of 7' — 1 would yield less than k neighbors (here. 7‘ = 3). Thus. each candidate solution set is a size A: subset of the set. of neighlmrs t hat. would be found using above minimum distance 7'. Since AENNS is a set of neighqun's. there is no ordering implied among the neighbors. I11 the following recursive definition we provide a procedural semantic of a candidate Aim-nearest. neighbor, which is based on an ordered ranking of the neighbors in the database for a query point. q. 48 Definition 2. Candidate kill-Nearest-Neighbor: Let the universe of discourse for variables A and B be the set of all objects in the database. Let Ak denote a candidate l1” h-nearest mdghbor in the database for a query point q. We recursively define Ak as follows: .41: .4 2:» VB (D(q. 4) g D(q. B)). B é {141.142 ...... 4A_1}/\ A é {.41..4«2 ...... 4 k—1} Ak = A :> VB (D(q..4) g D(q. B)) for k 2 2. (3.2) Definition 2 can be used to produce the candidate A;\i\Ss given by Definition 1, as stated in the following propositimi. Lemma 3.2.1. Each candidate Ari-NINE produced by Definition 2.2.2 is contained in the set of candidate AMNSS given by Dtfiuition 2.2.]. Proof. Proposition 2.2.3 states that the set of candidate kMVSs given by Definition 1 can be produced by Definition 2. This leads to the hypothesis that if a solution set NN = {141.142, . . . .Ak_1} satisfies Equation 3.1 for I; — 1 objects, then Equation 49 3.2 will select a Arm neighbor AA. such that N N U AA. will satisfy Equation 3.1 for k objects and thus be consistent with Definition 1. We first consider a base case Where A? = 1. Equation 3.2 yields the following neighbor A1 : .416 {Az‘C/B(D((1.A)§ D(q.B))}. The solution set {.41} satisfies Equation 3.1 for h = 1 and thus is consistent with Definition 1. Equation 3.2 may then be used again to yield neighbor Ak: B An ...... 4._ /\ , Ake A:VB “1 2 " 1} (D(q.A)§D(q.B)) fork22.. 14 ¢ {‘41. ‘42. . . . , flk_1} The above function returns the object in the dataset that has a minimum distance from the query object out of all objects in the dataset not currently i1‘1cluded in the solution set. Thus the addition of a km neighbor for I; Z ‘2 will result. in a minimal distance to the query point for Lt objects if the set of neighbm‘s Al - Ak_1 has a 111111111131 distance for k — 1 objects. Our base case shows that this is true for 1 object, thus the hypothesis is true for all values of k. El From the above definitions (and Figure 1), we can see that there may be 50 multiple possible ANSSS for a given query. Therefore, A:\l\S is generally not. unique. The non-uniqueness/non-determinism of AANS has an impact. on the semantics of the k-nearest neighbors. \V'e define the degree of non—determinism of k-nearest neighbors by the number of possible hXXSs that exist for the query. This degree of non—determinism is computed by the following proposition. Lemma 3.2.2. The number AA? of candidate lN‘VSs is given by Ak = —,—'— (3.3) where t is defined by D((1.Ak._t) # D(q. Ak_,+1) = D(q,Ak_t+2) = = D(q,Ak); / A j(1 S j _<_ h?) denotes the jm—nearest neighbor; N is the number of nearest neighbors in the database that have the same distance as D(q. Ak). Proof. This section provides the derivation of the number A}; of candidate AMVSS. This may be interpreted as the number of possible solution sets from a dataset that satisfy Equation 3.1 for k objects. The value of AA? is largely influenced by the number of objects within a given solution set that have. the same. distance to the kth query object. as the neighbor. This value, represented by t, is formally defined as follows: 51 D((1« Alt—t) 75 0((1sAk—t4r1): Dta- Alt—HQ) = = Mas/11:) Each neighbor A k—t - A k may be replaced by any other potential neighbor from the dataset a, where. D(q. a) = D(q. Ak). and the solution set will still satisfy Equation 3.1. \Ve denote the set of all potential neighbors as N]. Thus, A}; is the. number of t-element subsets that. can be composed from the set of N]. This can be represented l o ’I I ’ a o O as the binomial coefficient of t and N Wthll decomposes into Ecpiat10n 3.3: Note that t denotes the number of objects with distance D(q, Ak) that have to be included in a. k;\l\S. If t = Is, all the neighbors in a ANbS are of the same distance I as D(q, Ak). In this case Ale—t is inaliplicalne. The values of N and t depend on parameters such as the dimensionality. the database size. and the query point. For a h—NN query on a database in a. continuous data space based on the Euclidean distance, kNNS is typically unique (i.e. A}; = 1) since the chance for two objects having the same distance to the query point is usually very small. As a. result, the non—determinism is usually not. an issue for h-N N searches in continuous data spaces. CI! to However, non—deterininism is a common occurrence in an N DDS. As pointed out in [42. 43] the Hamming distance is typically used for NDDSS. Due to the insufficient semantic discrimination between objects provided by the Hamming distance and the limited number of elements available for each dimension in an NDDS. Al: for a k-XN query in an NDDS is usually large. For example. for a dataset of 2A! vectors in a 10—dimensional NDDS with uniform distril‘nltion. the average A}; values for 100 random k-NN queries with A? 2 1.5.10 are. about 8.0, 19.0K.45.51\I respectively, as shown in Figure ‘2. This (glen‘ionstrates a high degree of non-(leterminism for k-NN searches based on the Hamming distance in an NDDS. especially for large k values. To mitigate the problem. we extend the Hamming distance to provide more semantic discrimination between the neighbors of a k-NN query point in an NDDS. 3.2.2 Extended Hamming Distance Intuitively. the Hamming distance indicates the number of dimensions 011 which the corresponding conmonents of o and ,3 differ. As discussed by Qian et al. [42. 43]. the Hamming distance is well suited for search applications in NDDS. 111 particular, we note that applications with different alphabet sets for different dimensions or with no known similarity matrix (needed for many edit distances) present strong cases for using the Hamming distance when searching. Additionally. recent work has applied the Hamming distance when searching in large genome sequence databases [37, 29]. 1.0E+08 , , , *‘ 1.0E+07 1.0E+06 1.0E+03 3 1.0E+02 Number of Solution Sets 0 O m m 83 a A 01 Y \ ‘t l L 1.0E+01 - __ M LL :7 —s fi—fi /_7 F7, 1.0E+OO w ~ - ~~ 74* - *~* --*—— 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.2. Comparison of A}; values for the Hamming distance Although the Hannning distance is very useful for exact matches and range queries in NDDSS. it does not provide an effective semantic for k-NN queries in NDDSs due to the high degree of non-determinism. as mentioned previously. \Ve notice that the Hamming distance does not distinguish equalities for different elements. For example, it treats element a = (1. as the same as element b = b by assigning 0 to the, distance measure in both cases. In many applications such as genome sequence Searches, some matches (equalities) may be considered to be more important than Others. Based on this observation. we extend the Hamming distance to capture the Semantics of different. equalities in the distance measure. Several constraints have to be considered for such an extension. First. the extended distance should enhance the granularity level of the Hannning distance so that its sem antic discriminating power is increased. Second. the semantic of the traditional Hariiming distance needs to be preserved. For example. from a given distance value. one should be able to tell how many dimensions are distinct (and how many dimen- sions are equal) between two vectors. Third. the extended distance should possess the triangular 1,)1‘operty so that pruning during an index-based search is possible. \V’e observe that matching two vectors on a dimension with a frequent.ly—occurred e\®,111ent is usually more important than matching the two vectors on the dimension with an uncommml (infrecuient) element. Based on this observation, we utilize the frequencies of the elements to extend the Hamming distance as follows: d . . y . 1 1f ' .13 DGEHUI- ,3) = E “[1] #1 [I] . (3.4) i=1 §f(o[z]) otherwise VVllere f((1[i]) = 1 — frequency(o[1]). This extension starts with the traditional Hamming distance; adding one to the tOtaI distance for each dimension that does not match between the two vectors. The difference is that, when the two vectors match on a particular dimension. the frequency of the common element. (i.e. oh] 2 Mil) occurring in the underlying database on the dimension is obtained from a lookup table generated by performing an initial scan of the dataset. This frequency value is then subtracted from one and then added to the distance measure. Thus. the more frequently an element occurs. the more it will subtract from one and thus the less it will add to the distance measure, thereby indicating that the two vectors are closer than if they had matched on a very uncommon element. This frequency based adjustment results in the possibility of fractional distance values rather than just integer distance values (as seen when using the traditional Hamming distance). The factor of 5 is used to ensure that the frequency-based adjustments to the dis- tance measure do not end up becoming more significant than the original Hamming distance. This guarantees that the solution set (1:338) returned using this distance will be among the candidate solution sets returned if the Hamming distance were used instead. We also note that function f(o[i]) is not restricted to the definition given in Equation 3.4. So long as the values of f((r[i]) are within the range of [0. 1). the factor of 511 guarantees the semantic of the Hamming distance is maintained. In Chapter 4. we explore this concept more thortmghly and present a generalized form of Equation 3.4. 1’ From the distance definition. we can see that. if m S DGEHm/d'fl < m. + 1 (m = 0. 1.. . . .d). then vectors a and t3 mis—match on m. dimensicms (i.e., match on d — m. dimensions). Additionally. the function f (o'[1']) plays a factor in preserving the triangular property of Equation 3.4. as shown in Appendix B. Clearly. unlike the traditional Hanuning distance. which has at most d + 1 (in- teger) values — resulting in a quite coarse granularity. this new extended distance allows many more possible values leading to a. refined granularity. We call this extended Hamming distance the Gmnularrity-Enhuxnccd Hamming (GEH) distance. Due to its enhanced granularity. the GEH distance can dramatically reduce Ak in Proposition 3.2.2. leading to more deterministic k-NN searches in NDDSs. As an example. consider again Figure 1. If we assume that vectors ()2. (13. and (14 each match query vector q in only one dimension such that og[1] = q[1]. (.13[2] = q[2]. and 04B] 2 q[3]. and we also assume f(q[3]) < f(q[4]) < f(q[‘2]). The use of the GEH distance resolves the non—determinism seen earlier when k = 3. Here. the solution set would be {(11. 0-3. (14} (one of the candidate A:\.\S when the Hamming distance was used) since DGEH((1-“'1) < DGEH(q.o;;) < DGEH(‘1~04) < DGEH(q.o-2). On a larger scale. for the aforementioned dataset of 2.11 vectors in a 10-dimensional NDDS under the uniform distril.)ution, the average AA? values for 100 random k-NN queries with A”. = 1.5.10 are about 1.09. 1.11. 1.06. respectively (see Figure 3.3 in Section 3.4.1). 57 In fact. the Euclidean distance measurement can be considered to have the finest (continuous) granularity at one end. while the Hamming distance measurement has a very coarse (discrete integers) granularity at the other end. The GEH distance measurement provides an advantage in bringing discrete and continuous distance measurements closer to each. other. 3.2.3 Probability of Valid Neighbors In many scenarios. it» is useful to know the probability/likelihood of encountering vectors within an index tree that are within the current search radius to a given query vector. For the purposes of our discussion. we label each such encountered vector as a valid neighbor of. where V(,,D(q. o) g r. q is the query vector and r is the current search radius. To derive this 1:)robabi1ity. we first consider the Hamming distance and then extend our solutions to benefit from the enhancements provided by the GEH distance. For an initial case. we can assume that our index tree has maintained a relatively uniform distril'mtion of elements within its subtrees. In a well balanced tree (ND-tree. M—tee. etc...). this may prove to be a very reasonable assumption. as most indexing methods will attempt to evenly distribute elements within their subtrees. W hen we consider an ND-tree as our indexing method. the 1:)1'obability that accessing a subtree with associated DMBR R = 5‘1 X SQ x . . . >< Sd will yield a particular element a in any dimension may be estimated as the reciprocal of the magnitude of the alphabet set on that dimension represented by 1?. Therefore. the probability of a specific element, (L occurring in dimension :7 is estimated as: This calculation proves to be fairly accurate so long as the assumption of uniform distribution holds. The accuracy. and thereftn‘e effectiveness. of this calculation begins to degrade as the distribution of elements per dimension within a subtree bGCOIIlCS IlOll- 1111 l f( )I'Il'l . The true probability I’l”lR.i- may be estimated far more accurately by determining the local frequency ratio of element (1. within a subtree. with associated DMBR R. on dimension '1'. as follows: 1’((’)R.i = fl ((1)1128 (3-5) where # of vectors a in Rs subtree where (1M 2 a f1((1)R.-i = total # of vectors in R’s subtree This method is not reliant upon the indexing method to provide an even distribution: Equation 35 remains accurate even for indexes with heavily skewed distributions. The probability of encountering valid neighbors when examining any particular subtree of an ND-tree is analogcms to the probability of such neighbors existing in that subtree. Each (_limensicm in an NDDS is assumed to be independent. therefore. the prol_)ability value of encmmtm‘ing specific elements over all dimensions may be determined by the product of the probability values of encountering a specific element in each dimension. Thus the probability of selecting any 1.)articular vector 0- = ((1‘[ll.€1[2]... . .oldl) at random from the subtree with associated DMBR R is the following: of PE((1.I?) = H p (o[i])RJ-. (3.6) i=1 As defined in Section 3.2.2. the Hamming distance represents the. number of non-matching dimensions between any two vectors. The probability of a subtree containing a vector (1' where DHU,,.,,,(q.o) = () may be (.letermined using Equa- tion 3.6. However. because at most one vector within an ND-tree will satisfy DHammW-a) = 0. we also have to consider the probability of a subtree contain- ing vectors [3 = (,3[1]. 8(2). . . . . 13M). where DHa-mmhfl .3) = 3 (z e {1. 2. . . . .d}). Lemma 3.2.3. Let Y represent the set of (fit/icrrsioris where .13 [1)] # q[1'J-] and. let X 60 represent the set of dune11.91.0118 ‘U’flt’l'é.’ 51’3[\J-] = q[-XJ-]. The prohabzhty of setcctmg a particular vector .3 from the subtree '1111th assoctatcd DM BR R . 111/Lem DHam.,,,(q. 13) = 3(0 33 < d). at mndom '18 green. as the following (note that 2: = |Y|): 1Y1 P.\E( 3.1?) Up ”-711?" *H(1—,1.1Y[j).1j]m) (3.7) j=1 Proof. As described in Equation 3.6. the probability of a specific vector existing in a subtree. represented by R. is the product of the probabilities of each element of the vector in the corresponding dimension of the subtree: the probability of the element is estimated by Pfdinllej (Equation 3.5). The probability of anything except the specified element is 1 _1’(!3lel)R.1’J~- Thus. the probability of a specific vector. which does not mat(h the querv vector in dimensions. existing in a subtree with R 18 the product of two terms: the product of 11(13[XJ]) RXJ- in the matching dimensions and the product of 1 — 11(13 [11]) R‘yj in the non-matching dimensions. El Lemma. 3.2.3 describes the method for determining the probability of encoun— tering a vector that matches the query vector on a particular set. of dimensions X. An example would be determining the probability of encountering a vector ,3 in a IO—dimensional dataset that matched a query vector q in dimensions 1.. 3. 8. and 9. In this example. X 2 {1.3.8.9} and Y = {2.4.5.6.7.10}. resulting in 61 z = DH(1.,,I,,,((1. .3) = 6. However. to determine the probability of encountering a vec— tor at. a. distance 3. we are not only interested in this one particular partial-matching vector. but also all possible partial-matching vectors that may be found at a distance 2 from the query vector. Proposition 1. Let B '1'cprcsent the set of all rectors from the given dataset in an NDDS where 13 E B : [D H(,,m,,(q.3 ) = z]. The probability that a subtree with associated DMBR R = 51 x .92 X. . .>< Sd 111111 contain a 1.113cto'1'3 '11’h(,"1‘e DHa.1?1.11‘1((1- 3) = z is given as the following: PS;(q. 1?) = Z P.N’E(.3.R). (3.8) o’eB Proof. The 1.)robability of a subset. of independent objects existing in a. set is the summation of the probabilities of each of the individual objects within the subset existing in the set. Thus. the probability of a subset of vectors existing within a. subtree. each of which has a specified number of di111ensions matching a query vector. is the surnn‘iaticm of the 1‘11‘obabilities of all vectors within that subset existing in the subt ree. D For example. the probability of selecting a vector .13 at random from a subtree with associated DMBR R. where DH(1,,,,_,,,(11. .3) = 1. is given as follows: (52 P3MaRh= pMMWWMlnpmw—HNI—pMMD +MdHWWBD~WI-pmw-HDPWM) +U—pMUWPMMl~pMW—HWWWU \Ve note that as the dimensionality of the dataset increases. this calculation can 1 . . . .7 . L-) ). One possible solution is to create a hash structure become costly. i.e. ()(d >1: (d in. a pre-processing step. that stores binary arrays representing the different. combi- nat ions of matches and non-111atcl1es for each particular distance value. Here. the set of keys is the set of integers (0. 1. . . . .d). and the values are the sets of binary arrays xvit h a corresponding 111,111’1ber of ()'s. For example. the key 3 would retrieve the set of binary arrays that. contains all possible 1‘1er1nutations with exactly three ()‘s. The set. B for a particular distance 111ay be determined by retrieving the corresponding set of binary arrays. where for each array. a value of 1 would correspond to a dimension in X and a 0 a dimension in Y (see Proposition 2.4.1). This method can (‘lrastically reduce CPU time. The summation of the probability values given by Equation 3.8 for each integer (_ilstanee z E {0. 1. . . . . 1‘} yields the probability of the subtree containing a vector ,3 t 1lat ls xvithin range 1‘ to the query vector (i.e. DH(1111.111((1- 3) S 1) This is expressed f - . ornially aS follows: 63 P\\_)=H(11? :PS (q.R). (3.9) :20 Equation 3.9 111ay therefore be used to give an accurate measure as to the likeli- }1()(:)d that. searching within any subtree will update a solution set when using the H31 Innnng distance. Enhancing the g1‘aI11.1larit_1-' of the Hamming distance leads to an ellllancement of the neighbor probability calculated in Equation 3.9. \Vhen using the GEH distance. r is no longer strictly an integer. Thus. it is possible for a valid neig, hlmr to exist at a distance [1] < DGEH(11.113) g r. where r is a real number. An adjustment to Equation 3.8 is needed to properly account for possible neighbors Wit h in this range. Proposition 2. Let B’ '1‘eprese11t the set of all rectors from the given dataset in . I . . . an NDDS 'urhere ,3 E B : [DHU,,,,,.(q.13) = [1)]. The prohah1l1ty that a subtree mth DAIBR R = SI X 52 x.. .XSI 111ll (.o11ta111 a reco11l 3 uhc1e(() < DGEH(q.1'13) <1) is given as the following: PR(q 1?.1) E P1E(1.61?) (1~— )1-)..1.q). (3.10) ”III. ere 64 1 71f RadjldJIl _<— ‘1‘ — lrl 0 oth1>111111313 15 (1 — (1‘._| 113.11) = 1111.11) 11f filjl = all] 0 11tlz131‘1e1s e 1 , 1 ‘ 1:1 Prov f The proof for Proposition 1 shows that Equation 3.8 yields the probability of a vector. which has : 1‘1'1ismatching dimensions with the query vector. existing in a particular subtree. If .. = (1.] this equation will determine the prolmbility of a vector with [_r_| mismatching dimensions with the query vector existing in the subtree. The . / , - set of these vectors is represented by B . Functlon 1) creates a. subset of these vectors I by removing all vectors 1!. from the set B . where DGEHl'“ q) > 1‘. Equation 3.10 is . . . ’ ’ the summation of each of the remannng vectors 11 . where V1168, : DGEHU’ .q) S 1‘. Thus Equation 3.10 yields the probability of a vector. whose distance to the query vector is between [1‘] and 1‘ to the query vector. existing in a 1:)articular subtree. [:1 The additional granularity provided by Equation 3.10 allows us to refine Equation 3.9 to Illake use of our GEH distance as follows: PNNGEH(11.R.1') = PR(q.1?.1‘)+ PS~ (q. R). (3.11) Equation 3.11 may be used to give an accurate estimate of the likelihood that search- i 11g within any subtree will update a solution set when our enhanced distance measure is used. \Ve use this measure in section 3.3.2 to develop an ordering heuristic that provides a conservative assessment of whether or not to visit a particular subtree that is beneficial to search performance in non—uniformly distributed databases. 3.3 A k-NN Algorithm for NDDS To efficiently process k-NN queries in NDDSs. we introduce an index-l'1ased k—NN searching algorithm. This algorithm utilizes properties of the ND-tree recently pro— post by Qian. et al. [42. 43] for NDDSs. The basic idea of this algm‘ithm is as follows. It descends the ND-tree from the root following a depth-first search strategy. When 1; possible neighbors are retrieved. the searching algorithm uses the distance information about the neighbors already collected to start pruning search paths that can be proven to not include any vectors that. are closer to the query vector than any of the current neighbors. Our algorithm is inspired by earlier incremental ranged based i111plementations presented for CDS by Roussopoulos et al. [47] and Hjal- tason and Samet [‘26] (generalized for metric space). Our algoritlnn extends these implementations to NDDSs by introducing metrics and heuristics suitable for such a space. The details of this algorithm are discussed in the following subsections. 66 3.3. 1 Heuristics In the worst. case scenario. this search would encompass the entire tree structure. However. our extensive experiments have shown that the use of the following heuristics is able to eliminate most search paths before they need to be traversed. fl lVIINDIST Pruning: Similar to [47]. we utilize the minimum distance (MINDIST) 'IF._ «- between a query vector q and a DMBR R = 81 x 82 x >< 8d. denoted by 11‘2.d'13t(q.R). to prune useless paths. Based on the GEH distance. MINDIST is formally defined as follows: if qll‘l ¢ 51 lel’l) Otherwise '1111‘l1st(q. I?) = 121 (3.12) QIH H where f(q[1]) : 1 — f1'1‘4'q11e'ncy(q[-1]). Thls Calculation is then used with the Range of the current k-nearest neighbors (Wlth respect to the query 1-r'ector) to prune subtrees. Specifically. the heuristic for 67 pruning subtrees is: H1 : If 111111st(q, R) 2 Range. then. 111-11.111? the subtree assoc111ted 11111.11. R. By taking the closest distance between a DMBR and q. we are guaranteeing that no vectors that are included in the DMBR‘s subtree are closer than the current Range and thus need not. be included in the, continuing search. MINMAXDIST Pruning: We also utilize the minimum value (MINMAXDIST) of all the maximum distances between a query vector 11 and a DMBR R along each dimension. denoted by 111.111.1I'1st(q. R). for pruning useless paths. In simple terms. mmdz st(q. R) represents the shortest distance from the vector q that can guarantee another vector in R / subtree can be found. For a vector q and DMBR R = 81 x SQ X . X S( . h‘IINMAXDIST is formally defined as follows: 77111’1.d1'.st(q.R)=121kigd f,-,),[lr](q (31.) + Z f1iIl‘1ll-Si) (3-13) - * 121.1%}: where 68 . s‘n-J— ' . . f, (rd/11.51;) —_- 71((1lf1l) if(Ilkl E Sk 1 otherwise 1 . f((1ll 1f{(1l’l}— 5: f.111((1ll-SA) = U I 1 otherwise where f() 011 the right hand side of the last two Equations is defined in Equation 3.12. In general terms. the summation of f” determines the number of dimensions where every vector in the associated subtree is guaranteed to have a 111atching elen'ient with the query vector (since the component set S i on the corresponding dimension contains only the corresponding element q[1] in the query vector). In these cases. a value of f(q[1]) is added to the distance (i.e. the GEH adjustment for a matching dimension). QJr—i The value. of fm determines if there is another dimension (not in those checked for f 11 1) in which at least one vector in the associated subtree will match the query vector. 111 this case. a value of g) f (q[k]) is added to the distance. A value of 1 is added for all other cases in f; I and f,,,. The sunnnation of these values yields the minimum distance (adjusted for GEH) that can be guaranteed a vector will be located from the query VeC-tor in the associated subtree. based upon the information in the DMBR. For example, given a query vector q = (a. b. c) and a DAIBR = {(1.11} x {b} x {(7. a}. we haVe fm((1. {a (1}) + fM( (b {b})+f\1(c{(.a})= %f1((1)+ :1;f(b)+1. which indicates that a Vector (i.e. (a.b. ?)) matching q on the first two dimensions is guaranteed (59 Fau- .222? .aa!-_ ..-_ _ to exist in the corresponding subtree. The minimum TII'III(1181() of such distances is sought in Equation 3.13. If Range 2 1111111113“). it is guaranteed that at least one valid neighbor can be found in the corresponding subtree. To process k-NX searches in our algoritlun. 111111111st() is calculated for each non-leaf node of the ND-tree using query vector q and all the DMBRs (for subtrees) contained in the current node. ()nce each of these MINMAXDIST values (for subtrees) have been calculated. they are sorted in ascending order and the 15’” value is selected as MIJVAIAXDISTA. for the current node. The km value is selected to guarantee that at least k vectors will be found in searching the current node. This selected Al I Nil] AXDI STk is then used in the follovvi 11 g heuristic: H25 If AJINMAXDIS TA. ( node )< Range. then. let Range = AHNMAXDIS TA. ( node ) Optimistic Search Ordering: For those subtrees that are not pruned by heuristic H1 01‘ H 2, we need to decide an order to access them. Two search orders were suggested in [47]: one is based on the ordering of MINDIST values. and the other is based on the ordering of MINMAXDIST values. The MINMAXDIST ordering is too “ u l.~‘r_.1 \ pessimistic to be practically useful. Accessing the subtrees based on such an ordering is ahnost the same as a random access in NDDSs. From an extensive empirical Study. we found that accessing subtrees in the optimistic order of MINDIST values during a k-NN search in an NDDS provided the more promising results. This study was performed with the assun‘iption that the ND—tree is well structured. This access order is shown formally as f(i1llows: H3: Access subtrees ordered 111 ascendtng value of 1ndz'st(q. R). In the meat of a tte, choose a, subtree at Tandem. Conservative Search Ordering: A problem associated with search ordering heuristic H3 is that it optimistically assumes that a vector with a. distance of the MINDIST value exists in the subtree associated with the. relevant DMBR. Typically this is not. the case in an NDDS: the set of elements on each dimension from different vectors often yields a combination that is not an indexed vector in the corresponding subtree. In some instances. the actual distribution of elements per dimension within a. subtree may be significantly (filifferent from what is expressed in the representing DMBR. As discussed in Section 3.2.3. this can be estimated by calculating the dif- ference between the assumed uniform distrilmtion. ’ 1511’ and the actual distribution. ‘5} estimated by frequency in Equation 3.5. When the difference between the assumed distribution and the actual distribution becomes large for multiple elements or multiple dimensions for a query, the likelihood of a vector with a distance of MINDIST existing in the relevant DMBR greatly decreases. “ihen this occurs. it. is more appropriate to order the access of subtrees by the calculated probability of the subtree containing a vector whose distance to the query vector is less than or equal to the current range. as shown in Equation 3.11. This access order is given formally as follows: H4: Access subtrees 1n the (167’.S(.'6"l2.(l'17lg order of the probabthtg of contatnrng a vector (1, where DG E H(q. (1) g Range. Th18 prohabthtg 1s calculated by RN NG E H- Heuristic H4 may be considered as a conservative approach to ordering while heuristic H3 may be seen as an optimistic approach to ordering. 3.3.2 Algorithm Description Our k—N N algorithm adopts a depth first traversal of the N D—tree and applies the aforementioned heuristics to prune non-1.1roductive subtrees and determine the access order of the subtrees. The description of the algorithm is given as follows. k-NN Query Algorithm: Given an ND—tree with root node T. Algorithm Af-NN Query finds a set of k—nearest neighbors. 1111-1int'11ined in queue ASKS. to query vector q that satisfies Equation 3.1 in Definition 1. It invokes two functions: ChooseSubtree and Ret1‘161-1.1e.:\"(»’1.(]}1bors. The former chooses a subtree of a non—leaf node to descend. while the latter updates a list of h-nearest neighbors using vectors in a leaf node. Algorithm 1 k-NN Query Input: (1) query vector q; (2) the desired number k of nearest neighbors; (3) an ND-tree with root node T for a given database. Output: a set LNDS of 117-nearest neighbors to query vector q. 1: let A'NNS = Q). N = T, Range 2 30. Parent 2 NULL 2: while N # NULL do 3: if N is a non-leaf node then 4: [NN, Range] 2 ChooseSuhtrt ((N. (1. k. Range) 5: if NN # NULL then 6: Parent 2 N 7: N = NN 8: else 9: N :2 Parent 10: end if 11: else 12: [ANNE Range] 2 Rct1'1eeeNeighbors(N. q. 1‘. Range. ANA/IS) 13: N = Parent 14: end if 15: end while 16: return ld\\S In the algorithm. step 1 initializes relevant variables. Steps 2 - 15 traverse the ND-tree. Steps 3 - 10 deal with non—leaf nodes by either invoking C hooseSuhtree to decide a descending path or backtracking to the ancestors when there are no more subtrees to explore. Steps 11 - 14 deal with leaf nodes by invoking Ret1"1eeeNeighbors to Update the list of current Ar-i'iearest 1'1eighbors. Step 16 returns the. result (1.533). 73 Note that C hooseS abtree not only returns a chosen subtree but also mav update Range using heuristic H2. If there are no more subtrees to choose. it returns NULL for NN at step 4. Similarly. Rctricce.\-'ciglzbors not only updates ASKS but also may update Range if a closer neiglilmr(s) is found. Function ChooscSabtrec: The effective use. of pruning is the most. efficient. way to reduce. the I/() cost for finding a set of k-nearest neighbors. To this end. the heuristics discussed in Section 3.3.1 are employed in function Clic’mseSabtree. Function 2 ChooseSabz‘rec(N. q. k. Ra ngc) 1: if list L for not yet visited subtrees of *\ not exist then 2: use heuristic H2 to update Ran gr 3: use heuristic H1 to prune subtrees of N 4: use heuristic H3 or H 4 based upon user criteria to sort the remaining subtrees not pruned by H1 and create a list L to hold them else CI! 6 use heuristic H1 to prune subtrees from list L 7: end if 8: if L 7$ (21 then 9 remove the 1st subtree .\'N from L 10: return [ARM Range] 11: else 12: return [NULL Range] 13: end if In this function. steps 1 - 4 handle the case in which the non-leaf is visited for the first time. In this case. the function applies heuristics H1 - H4 to update Range. Prune useless sul.)trees. and order the remaining subtrees (their root nodes) in a list L- The subtrees that are not in this list are those that have already been processed or DI‘Uned. Step 6 applies heuristic H1 and current Range to prune useless subtrees T4 for a non—leaf node that was visited before. Steps 8 — 12 return a chosen subtree (if any) and the updated Range. Note. that heuristics H3 and H4 are suitable for different datasets. Their effects on performance and practical selection guidance will be discussed in Section 3.4.4. Function RetriezterN’eighbors: The main task of Ret‘rieeeNe'z'ghbo-r is to examine the vectors in a given leaf node and update the current k—nearest neighbors and Range. Function 3 Ret'r'ie'z'erN'eighb0'rs(N. q. 1:. Range. hN’A'S) 1: for each vector v in N do 2: if DGEH(q.-v) < Range then 3: kzNNS = laNNS U {n} 4: if |kNNS| > k then 5: remove vector 11’ from [ANS such that 'v' has the largest DG EH01: 'v’ ) in 1311\5 6: Range 2 DGEH((1- U”) such that v” has the largest DGEH((1~. n”) in ANNE 7: end if 8: end if 9: end for 10: return [1.51%. Range] A vector is collected in law only if its distance to the query vector is smaller than current. Range (steps 2 - 3). A vector has to be removed from ANNS if it has more than k. neighbors after a new one is added (steps 4 - 7). The vector to be removed has the largest distance to the query vector. If there is a tie. a random furthest vector is chosen. 75 3.3.3 Performance Model To analyze the performance of our k-NN search algorithm. presented in Section 3.3.2, we conducted both empirical and theoretical studies. The results of our empirical studies are presented in Section 3.4. In this section. we present a theoretical model for estimating the performance of our search algorithm. For our presentation. we assume that our algorithm employs both heuristics H1 and H2. We also assume an optimistic ND-tree structure. where a subtree's associated DMBR is reasonalgily representative of the vectors within the subtree; that is Va.i€{l.2 _____ d} : f[(a.)R,,j ~ 51‘ . With this ' ‘ l 7. assumption. our search algorithm employs H3 as its search ordering heuristic.1 Because of the unique properties of an \ DDS. there is no defined center for the data. space. This may also be interpreted as any point may be considered to be at. the center. Thus. we can define a bounding hyper-sphere around any point within the data space and determine the likely number of objects contained within. Definition 3. The area unthin a distance .3 from point p in a d-dimensional NDDS with alphabet set A for each dimension. is the total nmnber of possible unique points contained within the hyper sphere of radius 3 centered at. point p. This value is fOTmally calculated as the summation. of the number of points earisting in Spherical layers as follows: \ 1The assumption of a reasonably optimistic tree structure covers the majority of ND-trees gen— erated in our empirical studies. Non-optimistic tree. structures, where our fourth heuristic would PFOVide a more beneficial ordering method. are considered empirically in Section 3.4.4. 76 ~ 4v 1 Are-(1(2):: (i ( i:(_) A| —1)'i. (3.14) Note that Area(:) is independent of point p under the uniform distribution as- sumption. Equation 3.14 may be used to calculate the total area of the data space by setting :: = (1. However. this value 111ay be calculated directly as follows: Arrow) = IAId. (3.15) The probability of an object existing within a distance of z from any point p is the quotient of Equations 3.14 and 3.15, as follows: Area(:) P--'s s 3 Z _. (“M ) Areflidl (3.16) Proposition 3. The number of likely points contained within a distance .3 from any point p is the product of the number of points within the dataset N and the probability of a point existing within. a distance of z from p. This is shoum, formally as follows: 11(3) : Pe.rists(3) * N° (3-17) The lower/optimal search bound for our 1')erformance model is determined as a 77 reasonable distance to assure a specific number of objects. It is reasonable to assume that a minimum distance that needs to be searched is one that is likely to yield at least I; neighbors. Thus a lower bound all 2 z. is found by solving Equation 3.17 such that L(d[) 2 k and L(dl — 1) < I". The lower bound for performance I/O i'nay then be estimated as the number of pages that are touched by a range query of radius (11. The range query performance is derived similarly to the model provided by Qian et el, [43]. H—1 10,» = 1+ Z (n,- * Pi.;)~ (3.18) izlil where nt- represents the estimated number of nodes within the ND-tree at a height of i. Pi": represents the probability a node at height i will be accessed in the ND—tree with a search range of :3. and H denotes the height. of the index tree. However. because a k-NN query generally begins with a search range equal to the theoretical upper search bound. an adjustment must. be made to account for the I/O incurred while searching with a non-optimal search range. We have estimated this value as the number of nodes within each level of the N D-tree raised to a power inversely proportional to the height of that level: 78 H—1 1 Adj 2 Z ”(37). (3.19) 1 i=1 Adding this adjustment to the range query performance model yields the following: I —1 l, IONN 2 1+ (no >+< Pt)..:) + Z (72,- * FAQ-+7219?) . (3.20) i=1 The performance of our search algorithm can be estimated by using Equation 3.20 setting 3 to d]. 3.4 Experimental Results To evaluate the effectiveness of our GEH distance and the efficiency of our h-NN searching algorithm. we conducted extensive experiments. The experimental results v 1 are presented in this section. Our k-N.’ searching algorithm was implemented using an ND-tree in the C++ programming language. For comparison purposes, we also implemented the kt—NN searcl‘iing using an l\l—tree in the C++ programming language for a set. of experiments. All experiments were ran on a PC under OS \Vindows XP. The I / 0 block size was set. at 4K bytes for both trees. Both synthetic and genomic datasets were used in our experiments. The synthetic datasets consist of uniformly distributed lO-dimensional vectors with values in each dimension of a vector drawn from an alphabet of size 6; other special case synthetic datasets are listed in the following subsections. The genomic datasets were created from Ecoli DNA data (with alphabet: {a g. t. c}) extracted from the GenBank. Each experimental data reported here is the average over 1()() random queries. 3.4.1 Effectiveness of GEH Distance The first set of experin‘ients was conducted to show the effectiveness of the GEH distance over the Hamming distance. by comparing their values of AA: as defined in Proposition 3.2.2 in Section 3.2.1. Figure 3.3 gives the relationship between A1: and database sizes for both the GEH and Hamming distances. when kzl. 5 and 10. The figure shows a significant decrease in the values of AA? using the GEH distance over those using the Hamming distance. This significant improvement in performance for the GEH distance is observed for all the database sizes and k values ('(msidered. Figure 3.3 shows that when the GEH distance is used. Ah values are very close to 1. indicating a. promising behavior close to that in CDSs. 80 1.0E+08 - - e - - - -. ".4" -. 1.0E+O7 1 .9 i 1 CD . (01.0306 C £10905 2 . 81.0E+04 - GK . A“ .— _ .. C1- ' "‘ ' ’1'“ “5 ‘ \ ~ \ _ , I ' "C" -B-Hamm(k=1) L ‘- ’ -e- Hamm(k=5) 8 10803 -=’.-Hamm(k=10) E —+—GEH(k=1) 31.0E+02 «i +GEH(k=5) Z +GEH(k=10) 1.0E+00 a fi__g___$,_é_ 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.3. Comparison of Air values for the GEH and Hamming distances 3.4.2 Efficiency of k-NN Algorithm on Uniform Data One set. of experiments was conducted to examine the effects of heuristics H1 — H3, presented in Section 3.3.1. on the performance of our k—NN searching algorithm pre— sented in Section 3.3.2 on uniform data. we considered the following three versions of our pruning strategies in the experiments. Version V1: only heuristic H1 is used. Version V2: heuristics H1 and H2 are used. Version V3: three heuristics H1. H2. and H3 are used. 81 34o . -— a ~ ~ ~ a a ~ a l 320 ' .277,“ Ellis. //’ \v‘}. i l , m . m 300. f , d . 3 ,/ /Nm 1 g 280 ' I, // 8 / /// < ‘ 13 j/ x 260 ' ,1” X“ '(2 ’1’] //( D a/ ./ 240 ‘ / aw / A. ”X —+\:+- V2 220 . y r a- V3 200 ~~ ~ ~ ~ mi - ———- — 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.4. Effects of heuristics in the k—NN algorithm using ND-tree with k = 10 Figure 4 shows that V2 provides a little llllpI‘OVCIIIGIIl in the number of disk accesses over V1. However. V2 does make good 1.)erfor1nance improvements over V1 for some of the queries. Thus. we have included H2 in version V3. As seen from the figure, V3 provides the best performance improvement among the three versions for all database sizes tested. Hence V3 is adopted in our k-NN searching algorithm and used in all the remaining experiments. except where noted. Another set. of experiments was conducted to compare the disk I / O performances of our k-NN searching algorithm using an ND-tree. the At-NN searching based on an M-tree. and the linear scan for databases of various sizes. Figure 3.5 shows the performance comparison for our k-NN searching algorithm using an ND-tree and the 82 . linear scan. Figure 3.6 shows the performance comparison in number of disk accesses of our k-NN searching algorithm using an ND-tree. and the h-NN searching based on an l\l-trec. From the figures. we can see a significant reduction in the 111111in of disk accesses for our k—NN’ searching algorithm using an ND-tree over both the M— tree searching algorithm and the linear scan. Additionally. the results in Figure 3.5 show that the performance gains of our lf-NN algorithm increase as the database size increases. As the database grows larger. the density of points within the available data space increases as well. This causes the search range to decrease at a faster rate. due to finding more points at closer distances to the query point. resulting in a greater percentage of subtrees being pruned by H1.2 Figure 3.5 also shows that, for all database sizes tested. our algorithm. implemented using an ND-tree. always used less than 25% (10% for database sizes of 1.1] or more vectors) of the disk accesses than the linear scan. Figure 3.7 shows the performance comparison of our algorithm imjj)lemented us- ing an ND-tree and the linear scan method on genomic datasets. Since a genome sequence is typically divided into intervals of length (i.e.. the number of dimensions) 11 or 15. both scenarios are included in the figure (for 1:210). This figure demon— strates that the performance behavior of our k-NN searching algorithm on real-world 2It should be noted that this behavior is not unique to the ND-tree. An in-depth discussion of this behavior was presented by Chavez et al. [14]. Although Chavez et al. focus primarily on searching in general metric spaces. the same principles apply here. 83 25.0% - 7 . , , , (D 3 t1 8 20.0% 1 ...; k=1 8 a“: <1: ‘x -9- k=5 i 15.0% . __ F10 o ‘ "" ..A.\ \\ ””9 O 10 00/ ‘ ‘ CD - O i \‘\\ \\ \T‘*~».,.K_ 8) \\\ \\ El\\\x» E \\ lift, ——————— TEL ~~~~~~~ 3::\i\\~i~_1 s 5.0% . ‘-~-\v.—\\mg ........ ~ L _ _m . (D m; o. 0.0% ~ ~#—— 7 —~ ~ ”WWW -..___.____.___. a 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.5. Performance of the k—NN algorithm using ND-tree vs. the linear scan on synthetic datasets with various sizes genomic datasets is comparable with that we. observed for the synthetic datasets. To observe. the performance inmroveinent of our k-NN searching algorithm over various dimensions. we ran random k-NN queries (with l.‘ = 10) 011 two series of genomic datasets: one contains 250K vectors for each set and the other contains 1 million vectors for each set. As seen from Figure 3.8. the performance gain of our algorithm over the linear scan is quite significant for lower dimensions. However. the amount of this improvement decreases with an increasing number of dimensions. This phenomenon of deteriorating performance with an increasing number of dimensions is also true in continuous data spaces due to the well-known dimensionality curse problem. Additionally. we have observed the 1‘)crforn'1ance i1nprovemcnt of our k- 84 i_+_M-Tree (k=1) -I— M-Tree (k=5) 8 2500 (+M-Tree(k=10) 8 [ «a—ND-Tree (k=1) g 2000 i-Ef‘i-ND-Tree(k=5) < l tND-Tree (k=10) a 5 1500 - “6 1 b 1000 - .o 1 E ': :3 i z 500 . l a + :1: :3 r; 0 m- __f ,_____*'.-_ ___. - ._ /_ ___ _ \' . j) 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.6. Number of Disk Accesses comparison of the k-NN algorithm using ND- tree vs. the k-NN searching based on l\l—tree 35.0% V 8 x -a<-- 15 Dimensions g; 300% —:2— 11 Dimensions 0) \ :3 25.0% x x. 8 20.0% I \x g 15.0% x. m \T\‘\. g 10.0% a. *\%\ 8 . Til“ ““““ x ~-- . CT.) 5. 00/0 bt\\ é\\ T . F T 0.0% g ‘“ 2.5 5 7.5 10 12.5 15 17.5 20 Database Size x 10"5 Figure 3.7. Performance of the k-NN algorithm using ND-tree vs. the linear scan on genomic datasets with various sizes for A'zll) QC CH 120.0% .r (D .. i 3 100.0% . 8 ‘ ~5— 250K Vectors o < 80.0% . +1.0M Vectors E i D i u.- 60.0% 0 °’ i 0) £9 4OIY% C . a) e i g; 20.0% “L i 0.0% i 91011121314151617181920 2122 23 24 25 Number of Dimensions Figure 3.8. Performance of the k-NN algorithm using ND-trce vs. the linear scan on genomic datasets with various dimensions for #210 NN searching algorithm over various alphabet sizes. We performed random k-NN queries (with k = 10 and d = 10) on databases of 2A] vectors. Figure 3.9 shows that, the effects of increasing alphabet. size are similar to the effects of the dimensionality CHI‘SG. Further, we have compared the disk I/O performance of the kf-NN searching al— gorithm using the GEH distance with that for the At-NN searching algorithm using the Hamming distance. Figure 3.10 shows the percentage I / Os for the GEH distance versus the Hamming distance for various database sizes and k values. From the fig- ure. we can see that the number of disk accesses decreases for all test cases when the GEH distance is used as opposed to the Hamming distance. In fact. the algorithm 86 35.0% a, o -—k=1 8 30.0/0 'B'k=5 a“; -:';-.-k=10 8 25.0% 0 <1: E 20.0% cu . “5 15.0% 4 CD CD ’. 2 g 10.0% 5 CD 8 (D 0 . 0. 5.0/o ; 0.0% 4 6 8 10 12 Alphabet Size Figure 3.9. Perfori‘nance of the k-NN algorithm using ND-tree vs. the linear scan on synthetic datasets with various dimensions for 1“le and d = 10 using the GEH distance needs only 50% ~ 70% of I/ Os that the algorithm using the Hamming distance needs for all test. cases. we feel this is due to an increase in the pruning power of heuristic H1 for the GEH distance. These results indicate that the use of the GEH distance will cost less in disk accesses while providing a far more deterministic result than that using the Hamming distance for a k-NN query. 3.4.3 Efficiency of k-NN Algorithm on Skewed Data Experiments were also conducted to examine the I / 0 performance of our algorithm upon datasets of varying levels of skewness as compared to that of a. linear scan. We applied our algorithm. with heuristic version V3 from Section 3.4.2. to ND—trees 87 11:11 70.0% . 7 .7 as _ g; 3 —E1— k = 1 8’, 66.0% ) a) i 0 . o , < i) 62.0% D . B i a) 58.0% . C) .9 C (D 8 54.0% CD 0. 50.0% '- _ -_w— 7 - .-. __mi 4 8 12 16 20 Number of Vectors in Database x 1045 Figure 3.10. Performance comparison for the k—NN searching using ND—tree based on GEH and Hamming distances constructed from datasets with zipf distrilmtions of 0.0. 0.5. 1.0. and 1.5. Figure 3.11 shows significant reduction in the number of disk accesses for our k-NN searching algorithm over the linear scan for all database sizes tested. Similar to the performance gains for uniform data (see section 3.4.2). our k-NN searching algorithm provides an increased reduction of disk accesses as the database size increases. Figure 3.11 also shows that our h-NN searching algorithm provides increased performance gains as the level of skewness increases (i.e. the zipf distribution level i1‘1creases). These results indicate that our searching heuristics (see Section 3.3.1) are able to identify and prune more useless search paths as the data becomes more skewed. 88 20.00/0 i - *7 , ,,.___ — —— — 7 7 . .-- n. , ___ 16.00/0 l 12.0% ' 9° ‘2 o\ Percentage of Disk Accesses P O o\° 0.0% . - J 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.11. Performance of the k-NN algorithm using ND-tree vs. the linear scan on synthetic datasets with various sizes and zipf distributions 3.4.4 Efficiency of k-NN Algorithm on Non-Homogeneous Data Experiments were conducted to show the effectixeness of our heuristic using the prob— ability formulas presented in Section 3.2.3. We compared the I/O perfmniance be- tween the k-NN algorithm using our probability-based subtree ordering heuristic H4 against the k-N N algorithm using our traditional MINDIST subtree ordering heuris- tic H3. both of which utilize ND-tree. We observed that. although the two heuristics often yield a comparable performance. there are cases in which our probability-based heuristic significantly outperformed the MINDIST one. These cases can occur when the distribution of the dataset shifts over time. For instance. dimensions that are 89 1200[ — .. f, -f - 900 Number of Disk Accesses O) O O 2 4 _ 6_ _ 8 _ 10 Number of Misleading D1mensrons Figure 3.12. Performance of the h-NN algorithm using ND-tree on datasets with various misleading dimensions (A: = 1) highly relevant to the partitioning of vectors into subtrees early in the construction of an ND-tree may no longer be relevant at later stages of the construction. These dimensions may become misleading when searching for the records inserted into the tree during these later stages. Figures 12. 13. and 14 show our results when searching for 1. 5. and 10 neighbors. respectively. Each search was performed on an ND—tree containing 5M vectors using each of the following heuristic combinations: Version 81: heuristics H1. H2. and H3 are used; Version S2: heuristics H1. H2. and H4 are used. The ND-trees constructed from these datasets are known to contain misleading DM- 90 2000 . .. _ _. _ 1600 - 1200 l 800 400 Number of Disk Accesses 2 10 4 6 8 Number of Misleading Dimensions Figure 3.13. Performance of the k-NN algorithm using ND-tree on datasets with various misleading dimensions (A? = 5) 4800 , ~e (I) (D 8 3600 CD 0 0 <( j l 5 2400 . 75 5 g i 3 1200 2 l 0 , 7 _ 2 4 . . _ 8 . 10 Number of Misleading Dimensmns Figure 3.14. Performance of the k-NN algorithm using ND—tree on datasets with various misleading dimensions (1" = 10) 01 BRs in regards to what vectors are present in the relevant subtrees. For our selection of heuristic H4. we compared the values of . 1. lbil and f[(a.)R,- for each node at one level below the root node and labeled a misleading dimension as one in which there was a discrepancy greater than 3 : 1 between the two values compared. The num- ber of Iriisleading dimensions indicates the known number of dimensions in each of the DMBRs at one level below the root node that meet this criterion; that is, for a. particular dimension '1'. > 3 * f[((I)R‘,' V l—bLl < 31; * f[(al1?.io '1 The results in Figure 12 show that the use of heuristic version 82 provides benefits for most cases tested when searching for only a single neighbor. The cases where the number of misleading dimensions was either very large or very small still show better I/() performance when using heuristic version Si. The results in Figure 13 show that the reduction of I/() when heuristic version 82 is used is much larger for all cases tested when searching for five neighbors rather than a single neighbor. The results in Figure 14 show that the reduction of I/O continues to grow when searching for ten neighbors when using heuristic version 82. These results show that in general, as the number of neighbors being searched for increases. the performance benefits when using heuristic version 82 increase as well. Additirmally, we notice that in Figures 13 and 14, the performance when using heuristic version 82 becomes similar to the performance when using 81 as the number of misleading dimensions approaches the total number of dimensimis. This is likely due to the reduction of non-misleading paths available. As the number of non—misleading paths approaches 0. heuristic version S2 will be forced to choose similar paths to heuristic version 81. 3.4.5 Verification of Performance Model Our theoretical performance estimation model was also verified using uniform syn— thetic experimental data. \V'e conducted experiments using 10 dimensional data with an alphabet size of 6. The minimum leaf node utilization of our ND-tree was set. at 0.4 and the 111inimum non-leaf node utilization was set at 0.3. We compared our theoretical values to the observml ND-tree performances for databases varying in size from 400K vectors to 2.0.11 vectors in 400K increments. We also varied the value of k to observe its effects upon the results. Figures 3.15 through 3.17 show the estimated number of I/Os predicted by our performance model, with the actual I / O. The results indicate that our model is quite accurate, estimating the performance within 2% of the actual performance for most test cases. The greatest disparity between estimated and actual perforn‘iance values occurs in the test cases with small datasets. 1,)articularly when searching for only a single neighbor. However. as the size of the dataset. increases or as the number of neighbors searched for increases. the performance estimation becomes increasingly accurate. 93 250% y» e i - . an +Estimated ‘ 8 -E-A9iuai g 20.0% CD 0 0 i < l E 15.0% " 0 l u— o (D 10.0% - CD 52 C CD 8 5.00/0 i (D 0- ' - ' i ' ' £3 ----- E1 0.00/0 ‘ —~--~ - —- —— ___—___ _'_._._____i.,___ 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.15. Estimated and Actual performance of the k-NN algorithm vs. the linear scan on synthetic datasets with various sizes (A? = 1) 25.0%' *7 7* , ,, ,, i i -.-.-A M a a, "3 -9- Estimated -El-Actual 20.0% 15.0% . 10.0% 5.0% ' Percentage of Disk Accesses 0.0% " 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.16. Estimated and Actual performance of the k—N N algorithm vs. the linear scan on synthetic datasets with various sizes (k = 5) 94 25.0% . ~ ~ - ~—— — ’ -- —+— Estimated 8 y - E1 - Actual 3’, 20.0% . a) l o 0 <1 TE 15.00/0 CD “5 m 1OIR6 ' . m i e C i (D E 5096: 0) CL 01Y% 4 8 12 16 20 Number of Vectors in Database x 10"5 Figure 3.17. Estimated and Actual performance of the k-NN algorithm vs. the linear scan on synthetic datasets with various sizes (A? = 10) The above results show that. for both synthetic and genomic uniform data. our .1:- NN searching algorithm based on the GEH distance far outperforms the linear scan. Additionally, our algorithm outperforms the linear scan for synthetic skewed data. Only when the dimensionality of the underlying N DDS begins to grow excessively, does the benefits of our algorithm start to become less significant. This is a result of the well-known dimensionality curse problem. Further. our performance model provides an accurate estimation of the number of I / Os incurred while performing a. k-NN search of a large database. 95 CHAPTER 4 Understanding Distance in NDDS In this chapter we consider in more detail the issue of distance measurement in Non-Ordered Discrete Data Spaces. To efficiently handle a much broader array of applications than those presented in the previous chapter we present a generalized form of our Granularity Enhanced Hamming (GEH) distance. We then provide a new in'iplementation of this distance. 4.1 Motivations and Challenges A major problem with k-NN searching in NDDSs. as discussed in Chapter 3 is the non-determinism of the solution. That is, there is usually a large number of can- didate solutions available which may obscure the result. This is mainly caused by the coarse granularity of the conn'nonly used distance metric, the Hamming distance. An extension to the Hamming distance, termed the Granularity Enhanced Distance 96 (GEH) distance, was introduced in [32] as a solution to this problem. \Ve demon- strated that the GEH distance greatly reduced the non-determinism of the solution set, as well as provided performance benefits. while maintaining the semantics of the original Hamming distance [32, 33]. However. the GEH distance introduced in [32] was tied directly to data point frequency values in a manner that may not be ideal in all scenarios. Applications / scenarios with other more relm'ant dataset characteristics (distribution, clustering. etc...) may not experience the same performance benefits seen in [32]. To address this issue, we introduced a generalized form of the GEH distance in [34]. This form may be optimized to a much broader set of applications than the original GEH distance presented in [32] Conditions/constraints are presented that maintain the necessary distance metric properties to be used as a pruning metric. we show that the original GEH distance is. in fact, an ii’istantiation of this generalized. form. Further, we present a new instantiation of the generalized GEH form that demonstrates the benefits of adapting the generalized form for specific scenarios. The rest of this chapter is organized as follows. Section 4.2 presents the general- ized form of the GEH distance. Section 4.3 introduces a new ranking based GEH instantiation derived from the generalized form. 97 4.2 Generalized GEH Distance The GEH distance. originally presented in [32], expanded upon the Hamming dis- tance to provide more granularity while maintaining all of the semantics of the Ham- ming distance. This was accomplished by adding an adjustment value to the Ham- ming distance between two vectors based upon their matching elements. This is presented formally as follows: (1 . . , . 1 if a [1] 79 1.3[1] DGEH(0~13) = E : ~ (41) 2'21 ;11f(a[2]) otherwise where #0121) = 1— fan-iii). The value of fq(d[i]) is the number of occurrences of 0M in the rim dimension of the dataset, divided by the number of vectors in the dataset.; essentially, a global frequency value. W'l‘iile this does provide a dramatic increase in the determinism of result sets when used in a similarity search, this distance metric may not provide an ideal distance semantic for all applications. Equation 4.1 is limited to applica- tions where the global frequency of elements has some significance in the dataset. 98 Applications where other dataset characteristics provide a better semantic may not. be able to benefit. from using Equation 4.1 to the same degree as the results shown in [32]. To address this issue. we propose a generalization of the GEH distance that. may be optimized to a much broader set of applications. \Ve observe that the Hamming distance assumes that. a worst case match (i.e. a non-match) between two elements is represented by a distance of 1, while all other matches are represented by a distance of 0. we expand upon this by adding more granularity to the values assigned to different types of matches. We propose the following generalized form of the GEH distance to accomplish this goal: (1 , ,. 1 . , . DGEHUI- f3) = DHa-nrmiaa [3) + "C— : fgehialllv Hill)! (42) 221 where Constraint. 1: V0.13 : C 2 d — DHamm(a, ,8) Constraint 2: Vfil‘ilafilil : 0 < fgeh(a[2],[3[2]) < 1 COHSt‘raint 3: Va[2],;’3[2] : [ye/liai'ils 1'3[l]) : fgehffilil Gill) Constraint 4: Valilfilil : a[2'] # 13[2] —> fgeh(a[2], )3[27]) = 0. Here, fgeh represents some function. chosen by an application expert, that. will pro— vide an adjustment to the Hamming distance for each dimension. The variable C 99 is a pseudo-constant1 used to guarantee the adjustment values of fgph do not be— come more dominant than the original Hamming distance. Constraint 1 indicates that the value of C must. be greater than or equal to the number of matching di- mensions between two vectors. Constraint 2 indicates that the result of function ”1 element of the two vectors being considered must be in the range of f gch. for the 2 (0:1) 11011-ill('1118ivc. Constraint 3 indicates that function f9”, must be Symmetric. Constraint 4 indicates that the result of fgeh for the 21m elements of the two vectors being considered must equal 0 if the these two elements do not match.2 From Equation 4.2. we can see that. if m. g DGEHfaa 3) < m+1 (2n. = 0, 1, . . . ,d), then vectors a and 3 mis—match on m. dimensions (i.e., match on d — m dimensions), therefore preserving the original semantics of the Hamming distance.3 Further, the four provided constraints allow the generalized GEH distance to maintain the metric properties necessary for use as a pruning metric in similarity searches as described in the following lemmas: Lemma 4.2.1. The generalized GEH d25tance 'mmntaz'ns the Positiveness property 1The term pseudo-constant is used to indicate that C is not strictly a constant, and may vary as long as Constraint 1 of Equation 4.2 is maintained. 2Note that both variables a: and B are passed to to the adjustment function. This enables the adjustment function to be fully expressed whereby Constraint 4 may be verified. 3Many application specific solutions such as BLOSUM, employed in bioinformatics, reduce the non—determinism of solution sets by utilizing a cost matrix as a direct. form of distance measure- ment. This is similar in theory to utilizing the adjustment function as a distance measure directly. Unfortunately, these methods do not preserve the semantics of the original Hamming distance and thus lose a level of portability between application environments. However, solutions such as BLO- SUM may be incorporated into Equation 4.2 by utilizing the diagonal of the cost matrix for the adjustment function. 100 (i.e. V1.9 : DGEH(-1‘~.’1) Z 0). Proof. By maintaining the Hamming distance within the GEH distance, we are guar- anteed a positive value if any elements between the two vectors do not match. Con- dition 2 indicates that all values resulting from matching elements will have non- negative values. El Lemma 4.2.2. The generalized GEH distance mrzintaxrns the Strict Positrzie'ness property (2.6. V2.1; : .1: # y —> DGEHUV y) > 0). Proof. This property is inherited by maintaining the Hamming distance within the generalized GEH distance, whereby any two vectors that are not equal will have a distance greater than ‘0' based upon a ‘1‘ being added to the distance for each non— matching dimension. Constraint 2 guarai‘itees that the values added from function fgd, will all be non-negative. El Lemma 4.2.3. The gcneruhzcd GEH distance Thaintains the Symmetry property (23.8. Vm : DGEH(41'~LU) = DGEHhr-TW- Proof. The Hamming distance is known to maintain syn‘imetry between vectors. In addition. Constraint 3 guarantees that the values provided by the function film will maintain symmetry as well. [3 101 Lemma 4.2.4. The generalized GEH distance maintains the Pseudo-Reflcriaity property (i.e. ‘v’Iy : DGEHfl‘sIl < 1/\ :r 79 y ——> DGEHfil-‘syl 2 1).4 Proof. This property is maintained due to Constraints 1 and 2, which stipulate that the additional value added to the Hamming distance will always be in the range of (0, 1), non-inclusive. Thus the distance between two vectors that exactly match will have a. distance value less than ‘1‘. Any vectors that. are different in one or more dimensions will have a distai‘ice greater than or equal to ‘1’. [3 Lemma 4.2.5. The generalized GEH distance possesses the Triangular Inequality P7‘0P6'I‘ty (16 V.r.y.z : DGEHfl'a’j) + 00521022) 2 DGEHf-T-e 3)). Proof. We first. consider the Hamming portion of the generalized GEH distance. For any dimension i E [1, d], if :1“,- 75 3,- then either 1‘2: 74 y,- CD :1,- # y7j or r, 74 y, /\ 3.1: 7Q yi. Thus for each dimension 2' where the right. side of the inequality (i.e. DGEH(:1:,::)) would be incremented by an integer value of ‘1’, the left side of the inequality (i.e. DGEH(.’[, y) + DG E H(y, 2)) would be incremented by an integer value of either ‘1’ or ‘2’, thus maintaining the inequality. Next, we consider the adjustment portion of the GEH distance (i.e. éfgm()). For any dimension i E [1,d], if .27, = :2- then either 1:,- = 312' /\ 2,: = y,- or 1r, # yi /\ 3i # yi. Thus, due to constraints 2 and 3, 4Note that the traditional property of Reflexivity (i.e. Va: : D(.r,a:) = 0) is replaced by the property of Pseudo-Reflexivity. This is a reasonable substitution in an NDDS due to two vectors exactly matching each other still being identifiable from all other pairs of vectors based only upon the distance between them. for all dimensions where this is the case. the left side will either be incremented by twice as much as the right side or be incremented by an integer value of ‘2’ while the right side is incremented by some value less than ‘1’. Constraint 4 indicates that no additions will be made if the values in the dimension match, leaving the Hamming component to be dominant. Thus the adjustment values maintain the inequality. D 4.3 Ranking Based GEH Instantiation As described in [14], many search algm‘ithn'is demonstrate better performance when the distances between data points are distributed evenly throughout the distance range. we note that the original GEH distance. Equation 4.1, is likely to result in a heavily skewed distribution of possible distances. As the alphabet. size grows, the likely values of fg((.r[i]) trend closer to ‘0’ leading to a clumping of distance values close to the floor value. Additionally. setting C = (1'. results in C having a dominant role in the distance value as the dimensionality of the dataset. grows larger. To address these issues, we propose a new GEH distance instantiation: J g I. ‘ l3 ,: : rai'ik,,-((1'[i]) fgCh-(Olzlf l’l) [Az'l‘tl I . (4.3) C = d - DH22272172.(QMB)+ 1 Here, the term ranki(d[i]) indicates the global rank of element o'[2'] among the alphabet. set in dimension i. The ranking mechanism employed should be set by an 103 Table 4.1. Varying Dimensionality Hamm. Freq. Rank (1 = 5 36 7 6 d = 10 968 472 480 d : 15 5914 4591 4675 d = 20 8294 8228 8232 application expert on the condition that. it results in the different elements of the alphabet receiving ranking values of [1. 4]] inclusive. The value of C tracks to the number of matching dimensions between vectors a and ,3. As an example ranking mechanism, we consider the frequency of elements within a dimension, applying a higher rank (lower integer value) to elements that occur more frequently, and a lower rank (higher integer value) to elements that occur less frequently. For example, if the alphabet set. in dimension 2' consists of (a, b, c}, where a appears in 20% of the vectors, b appears in 50% of the vectors, and c appears in 30% of the vectors in dimension 2', the rank of each of the elements in dimension i would be as follows: a —> 3, b —> 1, and c —> 2. Although an element’s frequency within a dimension still plays a role in the determination of the GEH distance (in this example), the ranking mechanism maintains a uniform distribution of distance values over varying alphabet sizes. Additionally, having the value of C track to the number of matching dimensions rather than the dimensionality of the dataset reduces the dominance of C as the dimensionality of the dataset grows larger. To evaluate the benefits of an adaptable distance metric, we performed a series of kr-NN queries utilizing the GEH distance implementations in Equations 4.1 and 104 Table 4.2. Varying Zipf Distribution Hamm. Freq. Rank zipf 0.0 968 472 480 zipf 0.5 693 399 301 zipf 1.0 381 233 126 zipf 1.5 105 73 30 4.3 as well as the Hamming distance. Table 1 shows a comparison of I/O results while searching uniformly distributed datasets of varying dimensionality. These re- sults demonstrate a scenario where the frequency based GEH implementation pro- vides slightly better search performance than the rank based GEH implementation. Further. our results agree with those in [52] linking a. decreasing perforn'iance with 5 increasing dimensionality. Table 2 shows a comparison of the I/O results while searching 10—di1nensional datasets of varying zipf distribution. For this scenario, use of the new ranking based GEH implementation provides a strong performance improvement over the. frequency based GEH distance implementation. This is in agreement with the results shown in [14] concerning search performance and dis— tance value distribution. These results highlight scenarios where Equations 4.1 and 4.3 provide search performance improvements specific to each case, thus demonstrat- ing the benefits of an adaptable distance metric. 5Note that for the largest. dimensionality tested, (1 = 20, the I / 0 results when using both ranking based and frequency based GEH implementations begin to approach each other. We attribute this to the din'lensionality of the dataset playing a less dominant role in Equation 4.3 than in Equation 4.1 CHAPTER 5 k-Nearest Neighbor in Hybrid Data Spaces In this chapter, we consider Iii-Nearest Neighbor (At-XX) searching in Hybrid Data. Spaces. Searching in HDSs presents several new challenges not presented in either CDS or NDDS searching applications. We examine these issues and discuss methods to resolve them. Further, we extend the theoretical performance model presented in Chapter 3 to HDSs. 5.1 Motivation and Challenges Nearest neighbor searches/ queries in Hybrid Data Spaces (HDS) are becoming i11- creasingly useful in many contemporary applications such as machine learning, infor— mation retrieval and security, bioinformatics, multimedia, and data—mining. Consider the following information retrieval task. Given a set of network sites and a range of 106 times, determine the set of 1" network intrusions that match a set. of query criteria. most, closely. \Vhen examining records of intrusions. the sites in which they occurred could be considered discrete. data. in that an intrusion either did or did not occur at that site. The times active in a particular site could be considered continuous data, in that an intrusion 111ay have been active over only a period of time. Several tm'hniques have been proposed in the literature to support such searches in both continuous (ordered) data spaces and non-ordered discrete data spaces. A ma.— jority of these techniques utilize a multidimensional index structure such as R*-tree[6] or the ND-tree[43]. Little work has been reported in the literature on supporting efficient nearest. neighbor searches in hybrid data spaces. Efficient index based nearest neighbor searching is dependent upon the usefulness of information maintained in the search index structure. When searching an index containing hybrid data, a difficult scenario occurs when one set of dimensional data is unknown. This scenario is analogous to using a non-hylin‘id index. such as R*-tree or ND—tree, to maintain hybrid data, based on their continuous or discrete subspace. If this structure remains unmodified, perforn'ling nearest neighbor searches becomes impractical due to the values of the non-native dimensions (discrete for R*-tree, cont inuous for ND-tree). To guarantee all valid neighbors are found in these scenarios, additional consid- 107 erations must. be taken into account. First, when examining the current set of ob- jects/ records found to determine a search range, it must be assumed that all non- native dimensions that are not maintained in the index structure differ in value from the query vector. i.e.: D(q. NNk) = DN((], ANA.) + (1,”, (5.1) where UN is the distance in native dimensions between the query vector q and the km neighbor NA} and (1,” is the maximum distance possible between the non-native dimensions in the dataset and q. Second, when comparing a. bounding box or possible new neighbor, the opposite assumption must be made, i.e.: DULX) = D;v(an) + 0. (52) where X is either a bounding box or object. This is due to the possibility of some vector within X (or X itself) exactly matching the query vector on all non—native dimensions that are not maintained in the index structure. Examining these facts, it, is clear that as the number of non—native dimensions grows, it becomes increasingly difficult. to exclude any portion of the index from the search. To address these issues, we consider performing lit-Nearest Neighbor (Ar-KN) 108 searches utilizing the CND-tree, a recently proposed multidimensional index for HDS [15]. W” hen considering k-NN searches utilizing an HDS index, we present a best—first- searching algorithm that utilizes characteristics of HDS in its heuristics to reduce the I / 0 cost of determining a valid set of A: nei,ghb(_)rs. Further, we present a theoretical performance model for this algorithm based on the characteristics of an HDS and the hybrid index. The rest of this chapter is organized as follows. Section 5.2 presents a. brief analysis of the different stages of a k-NN search. Section 5.3 presents our best first algorithm, its heuristics. and our theoretical performance model. Section 5.4 discusses experi- mental results. 5.2 Nearest Neighbor Search Stages In this section, we present a discussion of the different stages of a kt-NN search. By properly categorizing these stages, we are able to develop heuristics that im— prove search performance with more accuracy. The reader is recommended to review Chapter 2 for an overview of the CND-tree. When performing kf-NN queries using a multi-dimensional index structure, the act of traversing the tree (assuming a minimum distance based ordering) can be broken into three distinct stages: range reduction, overlap, and exhaustive. In this section 109 we consider the first two stages in more detail, while the exhaustive stage will be revisited in Section 5.4.2. The range reduction stage occurs while. the current search range, TC, is greater than the final search range value. During this stage, new neighbors / objects that are found are likely to result in reducing the current search range. The order in which nodes are visited has a direct impact upon the I / 0 cost of this stage. The overlap stage occurs when the search range has been reduced to its final value, but there still exists nodes R in the search path whose bounding box is such that D(q, R1) < TC. The amount of overlap between nodes within data organization structures directly affects the I/O cost of this stage. Data organization structures with a minimal amount of overlap within their internal nodes are likely to incur less I/O costs when searching during this stage than data organizational structures with a greater amount of area overlap. Figures 5.1, 5.2 and 5.3 break down the 1/0 cost due to the range reduction and overlap stages when searching a CND—tree. Figures 5.1 and 5.2 show the effects when either the number of continuous or discrete dimensions is held constant (at 6) and the number of the other dimensions varies. Figure 5.3 shows the effects when the number of both continuous and discrete dimensions is held constant (6 continuous, 3 discrete) and the number of records in the database varies. As seen from these figures, the I / O 110 4000]"7 7 w W , 77* ——77~A7«—7 3500 ,D Over1ap [I Range Reduction u 0 O O 2500 2000 Number of Disk Accesses a O O _L o O o 0'1 O O 1 2 3 4 5 6 Number of Non-Native Dlmensions Figure 5.1. Search stage I / O with variable number of continuous dimensions 4000 » i — 7 , — e W 3500 DOvenap I Range Reduction 33000 2500 Number of Disk Access at 8 O O O O _L O O O 01 O O 3 Number of Non-Native Dimensions Figure 5.2. Search stage I / O with variable number of discrete dimensions 111 D Overlap I Range Reduction Number of Disk Accesses 400K 800K 1.2M 1.6M 2.0M Number of Vectors in Database Figure 5.3. Search st age I / O with variable database size costs of the overlap stage rise dramatically as the number of dimensions increases. This stage has a much less dramatic increase in I / 0 cost as the size of the database increases. Additionally, the I/O cost of the range reduction stage actually reduces as the database size grows, while increasing as the number of dimensions increases.1 5.3 Search Algorithm To efficiently process k-NN queries in HDSs, we present a. priority backtracking index-based searching algorithm that utilizes properties of the CND—tree for HDS. This algorithm initiates a search in the root node and then visits each subsequent 1This is a similar phenomenon to what was reported by Chavez et al. [14]. Although Chavez et al. focus primarily on searching in general metric spaces, the same principles apply here. 112 node based upon a “best-first" criterion that is defined in the heuristics. W hen A? possible neighbors are retrieved, the searching algorithm uses the distance informa- tion of the neighbors collected to start pruning the search paths that can be proven to not include any vectors that are closer to the query vector than any of the current neighlgmrs. Our algorithm is inspired by earlier priority l_)acktracking based implemen- tations presented for CDS. NDDS. and generic metric space [33, 27]. Our algorithm extends these implementations to HDSs by utilizing metrics and heuristics suitable for such a space. 111 particular. we introduce the notion of using an estimated match likelihood value to help prioritize ordering. Additionally. we present. a, performance model to accurately estimate the I/O cost of executing our algorithm. 5.3.1 Match Likelihood Consider a. query vector q and a l_)ounding box R. If the total number of dimensions in the dataset is (It = (ID + (10 and the distance between q and R is D(q, H) = r, then it is possible that there exists an object in the subtree associated with R that matches q on (It — .1' dimensions (assuming a distance metric similar to Equation 2.4 is utilized). 111 a multidimmisional index structure such as the ND-tree, R.*—tree, or CND-tree this is not the most likely scenario. More likely, there exist several objects in the subtree associated with I? that match q on a subset of dimensions that are represented in 1?. The projections of these objects at higher levels in the index tree 113 can create a. misleading picture of the vectors in its associated subtree. similar to the concept discussed in Chapter 3 for NDDS searching. \V’e may infer from the bcmnding box of R how likely the information that R contains is likely to represent the vectors in its subtree. For the purposes of this discussion. we will assume that our index tree has maintained a relatively uniform distribution of elements within its subtrees. “hell we consider the CND-tree as our indexing method. the probability of accessing a subtree with associated bounding box R = SD1 x . . . x SD”: >< 8C1 . . . >< Syd will yield a particular element. a. in dimension i may be estimated as follows. (assuming a E R,): 51., if ’1', is discrete I’)(a)1?.z' = ” » (5-3) (St . r .‘ ~. . _ N max Si—mins, if z is (ontmuous where the match likelihood of encountering a vector (.1 = (.11, (.12. . . . , (if in the subtree associated with R may be app1‘(')ximated as f(_)llows: 13(0)]? 2 Z p(o[i])RJ- if (i[i] E R . (5.44) dt 0 ot l'1erw1se Equation 5.3 calculates the reciprocal of the magnitude of the alphabet set on dimension ‘2'. of R for discrete dimensions and the quotient of the threshold value and the range of the set on din‘iension 2'. of R for conti1'1u(_)us dimensions. Equation 5.4 114 then determines the smnmation of these values for all elements in vector a that are represented in R. It should be noted that more in depth methods were presented for estimating this likelihood value for NDDS in [33]. However, we are only interested in using this value to break ties in cases where the minimum distance between a subset of the nodes and the query vect(,)r is the same. Thus, the generalizations of Equations 5.3 and 5.4 are sufficient. 5.3.2 Algorithm Description In a worst case scenario. a search would encompass the entire index structure. How- ever, our extensive experiments have shown that utilizing the following heuristics eliminates most search paths before they need to be traversed. H 1: If the mimimxum. distance between the query vector and HMBR H. is greater than the current range, then prune the subtree (1,.8‘.€()("t(1.f(i‘d with R. H2: Access subtrees in the (1..s(..'c72.d/?ng order of the rnrnci'nmm distance between, the query vector and the associated HMBR 1?. H3: In the event of a tie between. subtrees due. to heuristic H 2, order these subtrees 115 that tied in, the descending order of the match likelihood (Equation. 5.4) between the query vector and the associated HM BR R. Our k-NN algorithm applies these heuristics to prune ncni—productive subtrees and determines the access order of the remaining subtrees during a. best-first traversal of the CND-tree. Given a CND-tree for vectors from HDS Xd with root node T, Algorithm Priority h—NN Query finds a set of lit—nearest neighbors. NN, to query vector q. where NN Q xY , ¥7\T“7\f = h. and VuE‘\"\rm€X_‘\1\f0((1.(I) g D(q. 1?). It utilizes a priority queue, labeled Q. of CND-tree nodes that is sorted based upon heuristics H2 and H3. It invokes two functions: F i ndS'ubtrces and RetriecenN’e-ighbors. The former finds all subtrees of the current node N that are within Range of the query vector and adds their nodes to Q. The latter updates the list of k-nearest neighbors. NN, using vectors in the current leaf node. In the algorithm. step 1 initializes the range variable. Step 2 starts the search at the root: node by inserting it into Q. Steps 3 - 14 traverse the CND-tree. where steps 4 and 5 select the next node to be visited and remove it from Q. Steps 6 — 8 deal with non-leaf nodes. Step 7 invokes FindSubtrees to collect all subtrees of the current node that are within Range. Step 8 sorts Q according to heuristic H2 116 Algorithm 4 Priority h-NN Query 1: Range = dt 2: Q.Insert( T ) 3: while lQ.Empty() do 4: N = Q.Top() 5: Q.Pop() 6: if N.Height > 1 then 7: FindSubtrees( N. Range. g. Q ) 8: Q.SOI‘t() 9: else 10: RetrieveNeighbors( N. Range. q. NN ) 11: Range = NN[A‘].Dist() 12: Q.Prune( Range ) 13: end if 14: end while 15: Return NN and H 3. Steps 9 - 12 deal with leaf nodes. Step 10 invokes RetrieeeNeighbors to update the list of current [if-nearest neighlmrs. Step 11 updates Range to equal the distance of the current hf," neighbor from the query vector. Step 12 prunes nodes from Q according to heuristic H 1. Finally. step 15 returns the result. Note that F indS ubtrees only updates Q with subtrees that are currently within Range. If no subtrees of the current node are within range. no new nodes will be added to Q. The W'HILE loop in steps 3 - 14 is terminated when all subtrees that are within the current. range have been visited. 5.3.3 Performance Model To analyze the performance of our k-VN search algorithm, we conducted both em- pirical and theoretical studies. The results of our empirical studies are presented 117 in Section 5.4. In this section. we present a theoretical model for estimating the performance of our search algorithm. To accomplish this. we first derive an estimate of the likely search range that will yield 1: objects/ neighbors. We then derive an estimate of how many tree nodes/ pages will be accessed by a search of a. variable range. For our presentation. we assume that our algorithm employs heuristics H1. H 2. and H3 and that our dataset is reasonably uniform. For reference, many of the variables used throughout this section are maintained in Table 1. The likely distance of the hm neighbor from the query point represents the final search range that can be used to prune nodes from the search path. To determine this distance. we estimate the ratio of area within a specific distance and the total possible area of the dataset. similar to the method e1111')loyed for NDDSs in [33]. The area within a distance r: from a point p in a d D + (IC dimensional HDS with alphabet A for each discrete dimension. span S for each continuous dimension, and threshold value (it for Equation 2.4 is analogous to the number of unique points within a. hyper sphere of radius 3 centered at point p: A7‘6’(l.(.‘:) = 2;:0 [Zillill(r.dm) ) [faf-T- LIN] a (55) g:max(0.;r—dA[ where 118 Table n,- 11.12- (1.11 m (11‘! (1711. d111 d112 dm 1 (1711.2 3M2 31711 B7712 mum of nodes at. layer i num of vectors in a node at layer i 5.1. Performance Model Variables height of tree = [£171 2 max(dD, (1C) 2 min(dD. dc) _{R IAI if (1311' = (ID "' otherwise IRAI 1f (1m— — dD otI her w ise ll i’d'IDi) max(d:D max(dCI (ICI‘. ;) I min(dCII, (1C1?) max(dDII (1D ,j) I II max((l'ICII-, d/C'Ij) min(d:DI- (1D ,) l } if d” = d D otherwise } if (1111 = d D otherwise } if (in, = dD } otherwise :{{R :{ min(dDI dIIIIDi) if (1772 = dD min1dc I (IF I.) otherwise if (111— — (11) Adi—111 _ dDi if (111 = (1D A (1.111: dIDflT if d1.” = (10 /\ (1.111 = (10.2" otherwise . r 1f (11; = (ID /\ (1.112 = (113.1 . I 1de11 2 d0 /\ (1.112 = (10.1 I’ if (1.11 = dc /\ (1.112 = d(7,7: otherwise if dm = (10 /\ dml = (1,0,1 _ if dm 2 d D /\ dml = (131 _ if (in = dc A dml = (1a: otherwise if (1m, = do A dm2 = (113,1 . II If (1771,. = (ID A (1711.2 : dDa' . I If (1m 2 (1C A (11112 : dC’j otherwise 119 I . 1, . _ fafffill) : ((111)1(1111— 1)U(i‘l[y) ((1.1! — 11$ 31 . The probability of an object existing within a distance 2 from any point 1), may be calculated as follows: .4rea(z) P 's S‘ 3 Z . (.rz.,.t. ( ) ArgafllD + (1C) (5.6) The product of the number of points N within the dataset and the probability of a point existing within a distance 3 from any point. p yields the number of points likely to exist in that subspace: [4(3) : Peristsk) * *N- (5'7) It is reasonable to assume that a. minimum distance that needs to be searched is one less than the distance that is likely to yield at least A: neighbors. Thus a. search range r = z, is found be solving Equation 5.7 such that L(r + 1) _>_ k and L(r) < k. The reason for this is that a search range that yields more than k neighbors is likely to extend beyond the. overlap stage (Section 5.2). Next, we determine how many pages are likely to be accessed by a. search with 120 a range of r. We derive this value in a similar fashion to the method derived by Qian et al. [43] for N DDS. As new records are inserted into the CND-tree. some dimensions 111ay be split and others may not be. Assuming a reasonably uniform distribution and independence among dimensions. the probability for the length of an unsplit edge of an HMBR to be ‘1’ is as followsgz T- = 1 D .—.———r as T _. z 1 (12.1 Wr‘ Using Equation 5.8. the probability for ant hebibliography edge length of an HMBR to be j is as follows: A . j—l ; 141‘“: (ij I1 'l’-ZA.=1(1)IT.fiI—*Ti.k TDIZ'IJ' : [‘4 U'Z' k 7 I? I—1 j Hui (J) u’-Z/._1(1:)T*Tzk T .. __ (kt) (7.2.1 — Rwi Hence. the expected edge length of an HMBR of a node at layer i is: |A| I. SDJ = ijl J * TD,l-,j3 R . . SC]? = ijl J * Tam- VVe can assume that a sequence of n node splits will split reasonably evenly 2For clarity. equations with a D or C subscript are relevant to discrete or continuous dimensions only. respectively. 121 t. hroughout the d D + (1C dimensions. Each split. will divide the component set of t he HMBR on the relevant dimei'ision into two equal sized eonn'xment subsets. To Obtain 172- nodes at layer i of the CND-tree, the expected number of splits needed is logg ni. As such. at any given time. it is likely that some nodes will be split one Inore time than others. This can be determined as follows: (If, 2 [(log-Zni) mod ((ID + dC-‘ll: I , ,, (5.10) (1,- = ((11) + (1C) — di, 1 N . . . . Where (1?. represents those dimensions that have been spht an extra. time. If we assume these splits have been distributed evenly among continuous and discrete dimensions. we have the following: II II dD-d. I ll dC'di C' [mi (511) [I (ID-d. ' (‘Dri — [dB-ta l SO, the HMBR of a node at. layer i has the following expected edge length on d” I d-‘lld d dimensions respectively: 122 ll SDH s- . : ___—- L DJ log) n ’ 21351751 II 5C H SCJ _ log 2 n, ' lcY—D +(Cl 513,- 7%?— 2 ’ 5 ”DJ _ ——1————{D H otherwise ’ og) n] 2(1)+(C"I (5.12) lotJr g9 7) <1 sC I ifmév< 5c H l()(T——)—- 71" D-i-(ICl s - — (“.2 otherwise Thus. for any node, the probability for a component of a. query vector to be covered by the corresponding component of the HMBR of that node for discrete and continuous dimensions. resI’)ectively. is given as follows: I I 30.7” BDJ — my I/ N SD-z' B . z “ D.2 A ’ I l, I (5.13) / 8027 Bar : I? , II II 5'02“ Bar 2 R ' Usitlg Equation 5.13. the probability for a node to be accessed by a query of range / distance It can be evaluated recursively as follows: h 5f mf Pf PM, = Z Z 2 f * Z 9 ’ (514) [C20 3:30 17127710 1)po 123 vvh ere so = max((). k — dMl~ sf = min(k. dm ). 7720 1nax(0, s — (1172.1)- 'mf = min(s, (1mg). p0 = max(0. k. — s — (1AM ) pf = min(k —— 8 (131-2). 1" (1'. ~ ~11: . . _1, d —s+m L_ . f : (('iii2)Bni212 (1 _ B1712)?” * (£13711) Buii21 (1 — 87711)9 171, - __ (1 d]\['2_p - (1‘1 dAIl—k-i—S-f-p k_ _, 9 — ( ‘lpmlBM‘z (1 '— 3M2)” * (k;3_1p)BM1 (1— Ban) 8 p- Thus. the expected number of node/ page accesses for performing a query with search range 7‘ can be estimated as: H—1 10 = 1+ 201,- * Pi.rl- (5.15) i=0 5.4 Experimental Results OUT [C—N N searching algorithm was implemented using a CND-tree, an ND-tree, an R*‘tFG-e, and a linear scan. For our experiments. the linear scan is considered twice: once VVith native non-ordered discrete dimensions and once with native continuous dimensions. Both the ND—tree and the R*-tree were modified to store hybrid data in their 1Eaves. This modification affects the shape of each of these trees but does not 124 incur a change in either of their insertion/split algorithms.3 All experiments were ran on a PC under OS Linux Ubuntu. The 1/0 block size was set at 4K for all trees. Two series of datasets were used. The first consists of 1M vectors with six native dimensions and a variable number of non-native dimensions (1 - 6). The second set has six native dimensions and three non-native dimensions and a variable number of vectors.4 Each experimental data reported here is the average over 100 random queries with A: = 10. 5.4.1 Effects of Heuristics and Datasets The first set of experiments was ccniducted to show the effects of the dimensionality and database (DB) size on the query performance. Figures 5.4 and 5.5 show this data. As both Figure 5.4 and 5.5 show. the CND-tree provides by far the most promising results in all tests.5 Due to this. the remainder of the experimental data considers only the CND-tree. It should be noted in Figure 5.5 that the ND-tree 3Each leaf node is structured to hold data with native and non-native dimensions. The non- native dimensions play no part in determining the composition of the covering hyper rectangle. However. the extra dimensional information requires more space for each object than what would be needed for native dimensional data only. While this extra space requirement does decreases the amount of objects that a leaf node can contain before overflowing, the extra information maintained negates the need to maintain the search constant d“ (Equation 5.1). However, because no changes have been made to the internal nodes of these trees, Equation 5.2 must. still be observed. 4This configuration is chosen. as it is likely the amount of non-native dimensional information will be significantly smaller than the amount of native dimensional information in most real world scenarios. 5Note that for the linear scan results. “NatzD” indicates a scan of hybrid data with native non- ordered discrete dimensions and “NatzC” indicates a scan of hybrid data with native continuous dimensions. The datasets used for each of these experiments are identical to those used for the ND-t ree and R*-tree based searching. respectively. 1 ”...—n. _ ll'lu- 14000 l ,, * "c ’ i e r” ’" “‘ 12000 . - -x- -ND-tree +CND-tree m l —a—Ls (Nat=D) §1000O l +R‘-tree g l +1.3 (Nat=C) < 8000 i X .52 o l '5 6000 ‘ b a: ‘E l :5 4000 3 z s i 2000 g L l 0 a; 4 ' Number of Non-Native Dimensions Figure 5.4. Performance 1/0 with variable number of non—native dimensions appears to provide somewhat similar results to the CND-tree as the size of the database increases. Figure 5.6 shows that. when viewed independantly from other search results. the CND—tree still provides significant performance benefits over the ND-tree as the number of vectors in the database increases. The second set of experiments was conducted to show the effectiveness of our search heuristics compared to similar heuristics utilized in a. non-hybrid space (i.e. a continuous or discrete space). Figure 5.7 shows the I/O comparison of the first search stage when searching the CND-tree with and without the use of heuristic H 3. It can clearly be seen that using H3 decreases the number of I/O incurred in the first stage of searching over all dimensional combinations tested. Figure 5.6. Performance I/O with variable database size (CND-tree and ND-tree only) 14000 12000 : 10000 8000 6000 4000 Number of Disk Accesses 2000 - -><- -ND-tree —0- CND-tree —B~— LS (Nat=D) + R'-tree + LS (Nat=C) 400K 800K 1.2M 1.6M 2.0M Number of Vectors in Database Figure 5.5. Performance 1/0 with variable database size 450 400 ‘ 350 250 200 150 Number of Disk Accesses 100 - 50 300 « - -><- -ND-Tree _ ---------- X +CND-Tree , 400K 800K 1.2M 1.6M 2.0M Number of Vectors in Database T‘ifir - 600' 77 -- ,, ~ a v” v 500 f + CND-C with H3 f i + CND-D with H3 55". l 400 l ' <2 ~CND-C without H3 .‘ ‘ - a ~CND-D without H3 ,5 l Num of IIO co 0 o 200 i NN=1 NN=2 NN=3 NN=4 NN=5 NN=6 Num of Non-Native DIM Figure 5.7. Performance I/O comparing ordering methods 5.4.2 Performance Model Verification Our theoretical performance estimation model was verified against our synthetic experimental data. \Ne conducted experiments using data with (ID = 6. dc = 3. .4| 2 6. and R = 100. The maximum number of leaf node objects was set at 227 and the maximum number of non-leaf node objects was set to 127.6 We compared our theoretical estimates to the observed CND-tree performances for databases varying in size from 400K vectors to 2.0111 vectors in 400K increments. Figure 5.8 shows the estimated number of I / O predicted by our performance model 6As described in Qian et. al [43]. the values of 127 and 227 were chosen for similar search structures to maximize search performance. ...—___..— 30% - - — l -l- CND-tree 1 25% l +CND-tree 2 l - 5?" Performance Model 20% l 15% . Percentage of Disk Accesses 10% 1 “xx. 5% 0% L - — 400K 800K 1.2M 1.6M 2.0M Number of Vectors in Database Figure 5.8. Performance model comparison with variable database size 160% * 7 W ’ 1400/ V—i—VVCND-tree 1 ° ’ +CND-tree 2 :X - cx- -Performance Model 120% x 100% l Percentage of Disk Accesses 1 2 3 4 5 6 Number of Non-Native Dimensions Figure 5.9. Performance model comparison with variable number of non-native di- mensions 129 as well as the actual observed 1/0 as a percentage of the I/O incurred by a linear scan. Two lines are drawn for the observed I/() when searching the CND-tree. CND—tree 1 rei'n‘esents the percentage of I / 0 observed when the search is stopped as soon as stage. 2 (overlap) ltas ended. CND-tree 2 represents the percentage of I/O observed when the search is stopped as soon as stage 3 (exhaustive) has ended.7 The Performance model line represents the predicted I / 0 percentage when 3 = 7‘ — 1. As shown in. Figure 5.8. our theoretical performance model does a. quite accurate job in predicting the mnnber of I / () that would be incurred by using our algorithm on databases of varying sizes. We also performed a comparison of our theoretical performance model and the observed I/() when varying the number of continuous dimensions present in the CXD-tree. For this set of experiments. (ID = 6. (AI = 6. R = 100. and d0 varies from 1 to (i. The number of vectors in the database is 131 and the maximum numbers for the leaf node and non-leaf node objects is again set at 227 and 127 respectively. Figure 5.9 shows the estimated number of I / O predicted by our performance model as well as the actual observed I/O of these exj')eriments. again as a percentage of the I/O incurred by a linear scan. Again. two lines are drawn for the CND-tree 7The exhaustive stage occurs when searching nodes 1? whose bounding box is the same distance from the query point. as the final search range. This stage introduces no closer neighbors than what have already been found. but may be required if multiple sets of k objects form valid solution sets. typical of k-NN searches in discrete databases as discussed in [32. 33]. 130 to represent the number of 1/0 at the end of the second and third search stages. As shown in Figure 5.9 our performance model does a very good job of predicting the number of 1/0 for performing a k—NN search when the number of continuous dimensions is less than or equal to the number of discrete dimensions. However. as the number of continuous dimensions grows. the observed CND-tree results begin to outperform the theoretical results predicted by our perforn'iance model. We believe this phenomenon may be related to the discretizing of the continuous dimensions by Equation 2.4. 131 . A“ .I" Jut‘i' CHAPTER 6 Conclusion Similarity searches in NDDSs and HDSs are becoming increasingly important in application areas such as bioinformatics. biometrics. E-commerce and data mining. Unfortunately. the prevalent searching teclmiques based. on multidimensional indexes such as the R-tree and the K-D-B tree in CDSs cannot be directly applied to either NDDSs or HDSs. On the other hand. recent work [42. 43. 44. 41. 15] on similarity searching in NDDSs and HDSs focused on range queries. Nearest neighbor searches were not developed to the same. extent. In particular. the k-nearest neighbor search- ing is still an open issue. \V'e observe that the issue of k-NN searching in NDDSS and HDSs is not a simple extension of its counterpart, in C DSs. A major problem with a A'—NN search in NDDSs using the conventional Hamming distance is the non-determinism of its solution. That is. there usually is a large number of candidate solutions available. This is mainly caused by the coarse gran— ‘-|. F ' bbrhy “I." .9. a f ‘ ularity of measurement offered by the Hamming distance. To tackle this problem. we introduce a new extended Hamming distance. i.e.. the GEH distance. This new distance takes the semantics of I‘natching scenarios into account. resulting in an en- hanced granularity for its measuren'ient. Further. it is proven that the GEH distance possesses the triangular property and therefore may be used in index based pruning heuristics. To support efficient Af-NN searches in NDDSs. we propose a searching algorithm utilizing the ND—tree [42, 43]. Based on the characteristics of NDDSS. three effec- tive searching heuristics are incorporated into the algorithm. A fourth heuristic is provided that implements a new strategy for probability based search ordering in conservative search scenarios. Further. we provide a. performance model to predict the number of I/O incurred during a k-NN search using our algorithm that is based upon the number of neighbors desired and the din'iensionality and alphabet size of the dat aset. Our extensive ex1,)eriments demonstrate that our GEH distance measure provides an effective semantic discriminating power among the vectors to mitigate the non— determinism for k-NN searches in NDDSs. Experiments also show that the lit-NN searching algorithm is efficient in finding k—NNs in NDDSs. compared to the linear scan method. The algorithm is scalable with respect to the database size and also 133 WT performs well over non-uniform data distributions. However. when the number of dimensions is high. our algoritlnn seems to suffer the same dimensionality curse problem as the similar techniques in continuous data spaces. \V'e have demonstrated that k-NN searches in HDSs greatly benefit from the use of a multidimensional index t'lt‘weloped for such a space. As discussed in Chapter 5. the use of HDS based index eliminates the need for maintaining costly search constants to guarantee correct results. Further. our experimental results confirm that the use of a. HDS index vastly outperforms searches utilizing non-hyln'id space indexes. sometimes by a factor of 10 to 1. as shown in Section 5.4. \\'e have also shown that the use of our newly introduced searching heuristic provides excellent benefits in reducing the I/O costs associated with the first stage of If-NN searcl‘iii'tg. Our experimental results in Section 5.4 demonstrate a performance benefit of almost 33%. Additionally. we have presented a theoretical 1')erformance model that accurately predicts the I/O cost of performing a k—NN search using our algoritlnn presented in Section 5.3 Our future work involves the study of the underlying clnu‘acteristics of NDDSs and HDSs that may be applied to optimizing data index structures for such spaces. Similar to [52]. such a study could provide an optimization in index structure con- struction by determining the relationship between the dimensionality of a dataset and the estimated performance for data retrieval. Additionally. search performance in 134 ...“.l' 1'... ' ,P NDDSS and HDSs are also affected by the cardinality of the alphabet of the dataset.. \Vliile much work been reported on understanding these relationships and optimiz- ing an index structure through that kn(,)wledge for CDSs. it remains an open issue for NDDSS and HDSs. Atilditionally. we will continue to investigate the underlying characteristics of NDDSs and HDSs that can be used in future search heuristics. Fi- nally. our theoretical performance model assumes a uniform node splitting policy of the underlying index structure. We would like to expand upon this to accommodate more potential node split policies. APPENDIX A Intrinsic Dimensionality in Non-Ordered Discrete Data Spaces In this appendix. we present a discussion of the effects of intrinsic dimensionality in NDDSs. We first provide an overview of the concepts of this topic followed by a discussion of the differences in NDDS and CDS dataset distribution. We then discuss the effect of intrinsic dimensionality on search I)("I‘f()1‘11‘1811('(3 when using the GEH distance. This appendix provides additional rationale for the development of the rank based GEH implementation from Chapter 4. A. 1 Overview In both NDDSS and CDSs. designers of search techniques are hampered by the curse of dimensionality [32. 52]. As discussed by Chavez et al. [14]. traditional indexing techniques for vector spaces (e.g. kd-tree. R*—tree) have. an exponential dependency 136 on the representational dimension of the data space. Many recent indexing techniques attempt to avoid the problems associated with this relationship by removing the representational dimension of the space. A common technique is to translate the vector space data to a. generic metric space. Unfortunately. these techniques are unable to completely resolve the curse of dimensionality due to the existence of the intrinsic dimensionality of the dataset. Chavez et al. [14]. defined the intrinsic dimensionality p of a dataset by the quotient. of the mean ,u and variance 02 of a histogram of distance values between objects in the dataset. as shown below: ,0: ——2. (A1) Equation A.1 indicates that a dataset’s intrinsic dimensionality grows in line with the mean and inversely with the variance of distance values. The result. of this equation may be used to indicate the performance potential of searching within a dataset. As shown in [14]. the potential performance for searching a. dataset is inversely proportional to the intrinsic dimensionality of the dataset. 137 5.E+07~ 7 - e ,- _ 2.22.21 A“ DAlph=6 [ 4.E+07 oAiph=8 4.5.07 IIAlph=10 lAIph=12 3.E+O7a 3.E+07 2.E+07 Number of Occurances 2.E+07 ] _ 1.E+07 i 5.E+06 [ ‘ I 0.E+00 .——- ___-v .=— [L a... 0 1 2 3 4 5 6 7 8 9 10 Hamming Distance Figure A.1. Histogram of Hamming Distance Values A.2 Distribution of NDDS Datasets The histogram of distances between points in either a CDS or a generic metric space will usually result in a semi-continuous curve. As shown in [14]. the mean for such histograms in either space is likely to trend equally toward either end of the available range. For an NDDS. this no longer holds. Consider Figure A.1 which shows the histogram of Hamming distance values for a 10—dimensional NDDS of 10K vectors with variable alphabet sizes. We notice three points. First. the mean for each dataset in Figure A.1 appears to be highly dependent upon the cardinality of the alphabet set. such that as the 138 alphabet size grows larger. so does the mean of distance values. Second. the distance values between points appear to clump disproportionately toward the high end of the distance range. Third. the possible distance values between. points is restricted due to the use of the Hamming t’listance. Each of these points agrees with our understanding of NDDSs discussed in Chap- ters 3 and 4. In particular. we note that in an NDDS. there is no defined center of the data space. Thus. available distance metrics. such as Hamming or GEH, have difficulty considering point distance relationships. As shown in [33]. the number of points likely to exist at an integer distance 3 from another point grows exponentially with the value of the alphabet size (this value was described as a hyper-spherical area. in [33]). such that: ~ A} d A'I‘t"(l(2) = Z i ( i =0 A| —1)i. (A2) Additionally in [33]. we showed that the probability of a point existing in an NDDS increases dramatically as the distance 3 increases: Zsz ((i1)( 4] _ 1y I741" Parisfsiz) Z (AB) Equation A.2 explains the dependence between the mean of distance values and 139 the alphabet size of the dataset. Equation A.3 explains the disproportionate amount of distance values between points in an N DDS close to d (as seen in Figure A.1). The third point is easily explained by the discussion of non—granularity of the Hamming distance in Section 4.2. \Vhen we again consider the distances between points in an NDDS. but instead of the Hannning distance metric. we use the GEH distance metric presented in Section 4.2, we are given the histogram shown in Figure A.2 (we have used a scatter plot instead of a column plot to help illustrate the differences with Figure A.1 more clearly). \Ve notice that the restriction on distance values has been greatly reduced. However. we still do not have a continuous curve as would be expected in a. CDS or generic. metric space. Instead. we see local peaks between each integer distance value with their own local mean and variance. A.3 Distribution Effects on Search Performance We define local variance for an integer 2'. as the variance of distance values between [Li + 1). A local mean may be defined in the same manner. To account for these local values. we modify the performance formula given earlier as follows: 140 [.55 mum-u" 8.E+06 hi _ ___. 7.E+06 l OA'phz6 o D Alph=8 a Alph=10 C 0 “Us 2 5.E+06 DO 0" at: 3 ’6 5%0 o D a . O 4 E+06 op D a. a. ‘5 ° 0 a. I_ a o X x 0 a 3 E+06 d] ‘ E O O a}; 3 2 E+06 Q) Q ‘9 . 00 [$2 632:; 00 fig . . 1.E+06 . x [X 5 a”? é? E a . 0.E+00 ~——m X ...—___; O 2 4 6 8 10 Distance Xfixw Figure A2. Histogram of GEH Distance Values where P0 = _ 1 d—1 m—i 2 Pl — 27121=0 (T) ' The value of p, in Equation A.4 captures the expected performance between each integer value. This value may be interpreted as the average performance indicator based on the normalized local means and variances of a. dataset. \N’hen using an extension to the Hamming distance, such as GEH. this value gives an indication of the likely number of pathways that may be pruned during the refinement stages of a k.— NN search. However. because one of the goals when extending the Hamming distance is to retain its original semantics. the value of pl is less dominant in determining the 141 overall performance measure than the value of po. This is due to the effects of a Hamming extension typically only becoming a factor when comparing distances in the same integer range. To help illustrate this point. consider a random query q for a 10 dimensional dataset U. 111 this case, Figure A.2 can represent the distribution of distances between q and a possible pivot point p E U (in our case. p is a DMBR). for various alphabet sizes. Assuming our distance metric D() maintains the triangular inequality property. we can eliminate from our search any point 21. such that D(p. u) Q“ [D(p. q) —r. D(p. (1)-H], where r is the current search radius [14]. As the variance of distances between integer values increases, more points (search paths) may be discarded when searching within that range. However. this increase in variance has no affect 011 the amount of points that may be pruned outside the current integer range. A.4 Experimental Results We. have compared the distance distribution histograms of the ranking based GEH distance implementation from Section 4.2 with that of the frequency based GEH dis— tance implementation from [32] over datasets of various zipf distributions. In these instances, all data is from lO-din'iensional datasets of 10K vectors with [A] = 10. Comparing Figures A4 and A3. we see that the ranking based implementation shows 1.E+07 —— -— —-- ~ — 9.E+06 o zipf = 0.0 8.E+06 . D zipf = 0.5 .. a zipf = 1.0 m 7.E+06 - * Zipf = 1.5 3 a E .2. 15 (5.E+06 - u o 0 5.E+06 ,, x '8 g 4.E+06 « E Cl 3 z 3.E+06 ' x .3 x. A A . i x 0’8‘ 2.E+06 A a :03?) 1.E+06 * I; 3% 3.93% 35 Z: a $41 3‘} “f. 0.E+OO - m “ELL 5‘3“)?“ _- 0 2 4 6 8 10 Distance Figure A3. GEHROM. zipf = [0.0 — 1.5] large improvenrents, in terms of distribution characteristics. over the frequency based distances as the zipf level increases. This indicates that as the underlying dataset becomes more. skewed. the performance benefits of using the ranking based imple- mentation of GEH distance over the frequency based implementation will become greater as well. This agrees with the search 1,)erformance results from Chapter 4. 143 1.E+07 i 2; g n 9.E+06 ] 8.E+06 [ I o zipf = 0.0 g 7.E+06 : zipf = 0.5 g a zipf = 1.0 .-. D §6£+06 - . zipf = 1.5 8 a 05.E+06 -_ ° M- O 0 x b 3 4.E+06 s . z 3.E+06 (g 2.E+06 a 5‘ .~. 8 a: 3?‘ q A E, 0.E+00 «___—Fi— L 0 1 2 3 5 7 8 9 10 Distance Figure A.4. GEHpreq. zipf = [0.0 — 1.5] 144 APPENDIX B Triangular Property of GEH - Extended Proof To maintain the inherent mathematical correctness associated with building and searching index trees in NDDS. our newly introduced GEH distance must maintain the Triangular Property. In Chapter 4. we presented a short. proof of this property. In this appendix, we provide a second proof in extended form. Definition 4. Triangular Property for Vectors: For any three vectors VA, VB, and V0, the addition of the distances between any two pairs is greater than or equal to the distance between the third pair, namely: D( :4: VB) + 003% VC) 2 DWA» Va). (31) The long form proof of the GEH distance maintaining this inequality is handled 145 in two steps: first. a base case is established where the property holds. Second, the base case is evolved into the generic case. Step 1: Consider three d-dimensional vectors VA. VB and V0- Assume that D(i»"A. VC) is a. maximal value. thus every element in V A and VC must be differ- ent. or: D(1«"A. VC) = (1. Next assume D(V:4. IB) is a minimal value. where every elen’ient in VB equals the corresponding element in VA. i.e. VA = l3. Using the GEH distance yields: Doc/4, VB) :1. where 0 < .r < 1. The term :17 represents the adjustment values obtained using some method defined by an application expert (see Chapter 4). Because VA = V B: the distance between VB and 10 is the following: DWB, V0) = (1. Thus we have the following inequality; 146 'r _l.’ ‘_ ‘l-l “—an- Dtl/fi- VB) + 170'}; V0) > D("31~“'C) :> r + n > n 0 Step 2: The second step is divided into three. sub-steps; one for each vector VA. VB, and IC. Step 2.1: First. we evolve IB into a generic vector. From Step 1. we have the following distance values: D(I’A. IB) = .r D(VB. IC) = n Des. VC) ——— where 0 < .1.‘ < 1. To make IB generic. we apply It changes. where each change represents switching an element in VB away from its original value. After this has been done. we are left with the f()ll()WlI1g distances: D(I’:4. VB) = I + 1(- ('1 D(VB. IC) = n — [cf + (:2 D(VA. IC) = 71. Here. (:1 represents the culmination of adjustment values from each of the 1: elements switched. AT represents the number of elements switched that now equal their corre— 147 Tr sponding element in IC. and c2 represents the culmination of the adjustment values to be added due to these newly matching values. Because k 2 [(3. C 2 cl. and ("-2 Z 0. the GEH distance between I34. IB and IC still maintains the inequality from Definition 4. Step 2.2: Next. we evolve IC into a generic vector. We start with the final vectors from Step 2.1 and apply j changes to IC. We now have the following distance values: D(I’:4. I3) = .11' + If — (.‘1 DtVB- IC) = N — II + (‘2 + (ff - ('3) — (.133 - ('4) D(I4. IC) = n — j3‘ + c5. Here. .1: and J72“ represent the integer values that D(IB. IC) increases and decreases by respectively as elements are switched; c3 and (:4 represent the adjustment values due to those changes; )3 and (:5 represent the integer and adjustment changes to D(VA.VC) due to element changes. It is impcn'tant to note that every time j; is incremented there are two possibilities: either the value being switched in IC becomes a value in IB that. still matches I’Z4. in which case )3 is incremented by one and both (‘4 and ('33 are incremented by the same amount. or it becomes a value in IB that does not match VA. which means that. k 2 lei" — 1. This leaves us to note that. j; +163 S 73‘ + A: and that ('4 Z (:5. Finally. with ji‘ 2 ('3. it can be shown that these distance measures still maintain the Triangular Inequality property from 148 Definition 4. Step 2.3: Finally. we evolve I’:4 into a generic vector. Because of our initial con- ditions. this is actually a trivial step. Due to VA only being defined in. relation to the original vectors IC and IB. and IC and IB being able to be manipulated into any general vectors from their starting point, we can start If 4 as any vector we wish. Thus VA is a generic vector and the triangular inequality holds true for any three vectors I4. IB. and IC. 149 APPENDIX C MinMaxDistance Discussion Much of the work presented in this (_lissertation has focused upon improving search performance in terms of reducing the number of I/O required to perform a search. In this appendix. we prove that many search algorithms may be improved in com- putational performance by removing the MINNIAXDIST heuristic while suffering no loss of I / O 1,)erformance. C. 1 Overview For the purposes of this discussion we define the MINDIST for a tree node N and a query point q as follows: JIIIA/‘DISTN (q) 2 min VNS D(q. NS). (C.1) \I‘here N S represents a subtree/ object of N and D is a valid distance metric for the data space being used. We define the .\II.\'.\IAXDIST in a similar manner: .III.'\I.II.4XDISY}V(q) = min VNS (max D(q. NS)) . (C2) Using Equations C1 and C.‘2. the .\II.\"DIST pruning. MINMAXDIST range reduc- tion. and MINDIST ordering eque‘itions. hereafter referred to as H1. H2. and H3 respectively. are as follows: H1: For all subtrees N5 of N . rcmo-ce/ prune any subtree whose [IIINDIS' T value to q is grti’ater than the cmrcnt search range. H2: If the .«IIINJIIAXDIS T of a subtree N S of node N is less than the cmrent search range. reduce the current search range such that it equals the value of the MIN— MAXDIS T of that subtree. H3: Order those subtrees NS not pruned by heuristic H1 in increasing order of their A'IINDIS T mlue to q. For the remainder of this discussion we will hold the following assumptions to be true: Assumption 1. MINDIS T node ordering is being used. Assumption 2. MINDIS T node pruning is being used. Assumption 3. A dtpth first scorch strategy is being employed. C.2 Proof Consider a d-dimensiom’il non-leaf node N. that represents the local root for a branch in an index tree. The search range of the A? Nearest Neighbor search is represented by r. If heuristic H2 is employed. the search range is updated by the MINNIAXDIST value for the sub-nodes of .«VR as follows: r : min(r0, AIINAIAXDISTJN'(Gil (C3) where r0 acts as a, place holder for the search range before it is updated. For future use we will label the minimal RIINMAXDIST value of the subtrees of node N as IDAIAI- The subtrees of N can be categorized into three groups: N51, N 52, and A753, where the following holds true: 0 g MINDISTh ’51) g r r < MINDIST(.N’52) 3 r0 (04) 7‘0 < AIIJVDIST(A753) S DAIAX? where DMAX represents the maximum distance value possible between a sub- tree / object and a query point for the data space being used. Lemma C.2.1. 0 the three (1'(I,l(’( ories 0 subtrees. a k Nearest Neidzbor search will .I .1 access these groups in the following order: NSI first, NSQ second, and N53 third. Proof. This is a result of Assumption 1. [:1 Lemma C.2.2. The subtree of N with the minimum MINMAXDIS T value is con- tained in the sub-tree group N51. Proof. Equation C.3 indicates that this particular subtree will be used to set the value of 7' when Heuristic H2 is employed. This subtree will be in N51 due to its MINDIST value being less than or equal to its NIINRIAXDIST value. El Lemma C.2.3. If Heuristic H2 is employed, Heuristic H1 will prune subtree groups A752 and lV53, (111.6? t0 7" = TALKI- Proof. Heuristic H2 and Equation (3.3 indicate that the updated search range will be equal to r. Equation C4 classifies that. subtree groups N52 and N SB will have a. MINDIST value greater than r and will thus be pruned by Heuristic H1. D Lemma C.2.4. If Heuristic H2 is not employed, Heuristic H1 will prune subtree group N 53 . Proof. Similar to the preceding proof . we are only guaranteed that the current search range will be equal to 1‘0. Equation C4 only classifies subtree group N S3 as having a. MINDIST value U‘reater than 1' and is thus the onlv Otrou uaranteed to be runed b 0 V o by Heuristic H1. [3 Lemma C.2.5. If Heuristic Hg is not employed. the value of 7' will be less than or equal to DA [A I before visiting any sub-nodes in group NSQ. Proof. Due to Assumptions 1 and 3. the search algorithm will visit the subtrees of group N51 before returning to the current node N and considering subtrees in the groups NS? and N 83- According to Lemma (3.2.2. the subtree with a MINMAXDIST value DA [A I is contained in subtree group N51. Thus the search is guaranteed to visit an object with a distance value less than or equal to DNA! before returning to N. [3 Lemma C.2.6. Heuristic H2 provides no I / O benefits when assumptions 1 through 3 are true. Proof. Lemma C.2.5 indicates that. the value of r will be less than or equal to DMM before the search algorithm considers visiting any subtrees from group Ngg. Thus Heuristic H1 will prune these subtrees before they are visited regardless of Heuristic H2 being employed. Cl ,. .Jfll? ‘ [1] M [4] l5] l8] l9] BIBLIOGRAPHY Y. A. Aslandogan and C. T. Yu. Techniques and systems for image and video retrieval. IEEE TKDE, 11:50—63, 1999. Ricardo A. Baeza-Yates. Searching: An algorithn‘iic tour. In Encyclopedia of Computer Science and Technology Vol. 37, pages 331—359, New York, New York, 1997. CRC Press. Ricardo A. Baeza-Yates, \Valter Cunto, Udi Manber, and Sun Wu. Proximity matching using fixed-queries trees. In CPM ’94: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, pages 198-~212, London, UK, 1994. Springer—Verlag. L. Baoli, L. Qin, and Y. Shiwen. An adaptive ls-nearest neighbor text categoriza— tion strategy. ACM Transactions on Asian Language Information Processing, 13:215-226. 2004. R. Bayer and K. Unterauer. Prefix b-trees. ACM Transactions on Databases Systems, 2211—26, 1977. Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In Hector Garcia-Molina and H. V. Jagadish, editors, Proceedings of the 1990 ACM SI GM OD International Conference on Management of Data, Atlantic City, NJ, May 23-25, 1990, pages 322—331, Atlantic City, NJ, U.S.A, 1990. ACM Press. J. L. Bentley. Multidimensional binary search trees in database applications. IEEE Trans. Softw. Eng, 5(4):333—340, 1979. Jon Louis Bentley. h’lultidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509~517, 1975. Jon Louis Bentley and Jerome H. Friedman. Data structures for range searching. ACM Comput. S'IL’I‘U., 11(4):397~409, 1979. [10] [11] [12] [13] [14] [17] l18l [19] [‘30] [21] Stefan Berchtold, Daniel A. Keim. and Hans-Peter Kriegel. The X-tree: An index structure for high—dimensional data. In T. M. Vijayaraman, Alejandro P. Buchmann, C. .\Iohan, and Nandlal L. Sarda, editors, Proceedings of the 22nd International Conference on Very Large Databases, pages 28—39, San Francisco, U.S.A, 1996. Morgan Kaufmann Publishers. Sergey Brin. Near neighbor search in large metric spaces. In VLDB ’95: Pro- ceedings of the 21th International Conference on Very Large Data Bases, pages 574584, San Francisco. CA, USA, 1995. Morgan Kaufmann Publishers Inc. \V'. A. Burkhard and R. M. Keller. Some approaches to best-match file searching. Com-mun. ACM, 10(4):230~230, 1973. J Catlett. On changing continuous attributes into ordered discrete attributes. In Proceedings of the European Work-ing Session on Maching Learning, pages 164- 178, 1991. Edgar Chavez. Bonzalo Navarro, Ricardo Baeza-Yates, and José Luis Mar- roquin. Searching in metric spaces. A CM Computing Surveys, 33(3):273—321, 2001. Changqing Chen, Sakti Pramanik, Qiang Zhu, VVatve Alok, and Gang Qian. The c-nd tree: A multidimensional index for hybrid continuous and non-ordered discrete data. spaces. In Proceedings of EDB T, 2009. Paolo Ciaccia. Marco Patella. and Pavel Zezula. M-tree: An efficient. access method for similarity search in metric spaces. In VLDB ’97: Proceedings of the 23rd International Conference on Very Large Data Bases, pages 426-435, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. Kenneth L. Clarkson. Nearest. neighbor queries in metric spaces. In S TOC '97: Proceedings of the twenty—ninth annual ACM symposium on Theory of comput— ing. pages 609—617, New York, NY, USA, 1997. ACM. J. Clement, P. Flajolet, and B. Vallee. Dynamic sources in information theory: A general analysis of trie structures. Algorithm, 29, 2001. P. Ferragina and R. Grossi. The string b-tree: A new data structure for string search in external memory and its applications. Journal ACM, 46:236—280, 1999. A Freitas. A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery. ACM, 2003. Volker Gaede and Oliver Gunther. .\‘Iultit‘limensional access methods. ACM Computing S'zn‘veys. 30:170- 231, 1998. 156 [22] [‘23] [30] [31] [32] [33] Antonin Guttman. R-trees: a dynamic index: structure for spatial searching. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. R. Hamming. Error—detecting and error—correcting codes. Bell System Technical Journal, 29(2):147-160, 1950. A. Henrich, H. “7. Six, and P. “Iidmayer. The lsd tree: spatial access to mul- tidimensional and non—point objects. 111 VLDB ’89: Proceedings of the 15th international conference on Very large data bases, pages 45—53, San Francisco, CA, USA, 1989. .\Iorgan Kaufmann Publishers Inc. Andreas Henrich. The LSDh—tree: An access structure for feature vectors. In I CDE ’98: Proceedings of the Fourteenth International Conference on Data En- gineering, pages 362—309, \Vashington, DC, USA, 1998. IEEE Computer Society. C. Hjaltason and H. Samet. Incremental similarity search in multimedia databases, 2000. G H jaltason and H Samet. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 28:517-580, 2003. Gisli R. Hjaltason and Hanan Samet. Ranking in spatial databases. In SSD '95: Proceedings of the 4th International Symposium on Advances in Spatial Databases, pages 83—95, London, UK, 1995. Springer-Verlag. W'. J. Kent. Blat~the blast-like alignment tool. Genome Res, 12(4):656—664, April 2002. D. E. Knuth. The Art of Computer Programming, Vol. 3. Addison-Wesley, Reading, MA, USA, 1973. Mohammad Kolahdouzan and Cyrus Shahabi. Voronoi-based k nearest neighbor search for spatial network databases. In VLDB ’04: Proceedings of the Thirti- eth international conference on Very large data bases, pages 840—851, Toronto, Canada, 2004. VLDB Endowment. Dashiell Kolbe, Qiang Zhu, and Sakti Pramanik. On k-nearest neighbor search- ing in non-ordered discrete data spaces. In ICDE, pages 426—435, Istanbul, Turkey, 2007. IEEE. Dashiell Kolbe, Qiang Zhu. and Sakti Pramanik. Efficient k—nearest. neighbor searching in non-ordered discrete data spaces. ACM Transactions on Informa- tion Systems, 28, 2010. a" ‘- A-V-_ '_P-h ‘- “.5‘ [34] [36] [38] [391 [40] [43] [44] Dashiell Kolbe, Qiang Zhu, and Sakti Pramanik. Reducing non-determinism of k-nn searching in non-ordered discrete data spaces. Information Processing Lctters, 2010. Flip Korn, Nikolaos Sidiropoulos. Christos Faloutsos, Eliot. Siegel, and Zenon Protopapas. F ast nearest neighbor search in medical image databases. In VLDB '96: Proceedings of the 22th International Conference on Very Large Data Bases, pages 215* 226. San Francisco. CA, USA. 1996. Morgan Kaufmann Publishers Inc. O.\V. Kwon and J .H. Lee. \Veb page classification based on k—nearest neigh- bor approach. In Proceedings of the 5th International Workshop Information Retrieval with Asian Languages, 2000. F Lewis. Gareth J Hughes. Andrew Rambaut, Anton Pozniak, and Andrew J Leigh Brown. Episodic sexual transmission of HIV revealed by molecular phylodynamics. PLoS Medicine. 5(3), 2008. Jinhua Li. Efiicicnt Similarity Search Based on Data Distribution Properties in High Dimension. PhD thesis, Michigan State University, East Lansing, Michi— gan, United States, 2001. A Macskassy, H Hirsh. A Banerjee. and A Dayanik. Converting numerical classification into text classification. Artificial Intelligence, 143(1):51—77, 2003. Gonzalo Navarro and Ricardo Baeza-yates. Searching in metric spaces. ACM Computing Surveys, 332273 321, 2001. Gang Qian. Principles and applications for supporting similarity queries in non-ordered-discrete and continuous data spaces. PhD thesis. Michigan State Ui‘iiversity, East Lansing, Michigan, United States, 2004. Gang Qian, Qiang Zhu, Qiang Xue, and Sakti Pramanik. The ND-tree: a dy- namic indexing technique for multidimensional non-ordered discrete data spaces. In uldb '2003: Proceedings of the 29th international conference on Very large data bases, pages 620-631, Berlin, Germany, 2003. VLDB Endowment. Gang Qian, Qiang Zhu, Qiang Xue, and Sakti Pramanik. Dynamic indexing for multidimensional non-ordered discrete data spaces using a data—partitioning approach. ACM Trans. Database Syst, 31(2):439--484, 2006. Gang Qian, Qiang Zhu. Qiang Xue, and Sakti Pramanik. A space—partitioning— based indexing method for multidimensional non-ordered discrete data spaces. ACM Trans. Inf. Syst, 24(1):79 -110, 2006. 158 [45] [40'] [48] [491 [51] [52] [53] [54] E. Riloff and L. Hollaar. Text databases and information retrieval. ACM Com- puting Surveys, 28, 1996. John T. Robinson. The k—d—b-tree: a search structure for large multidimensional dynamic indexes. In SIGMOD ’81: Proceedings of the 1981 ACM SIGMOD international conference on IlIanagement of data, pages 10—18, New York, NY, USA, 1981. ACM. Nick Roussopoulos, Stephen Kelley, and Frédéic Vincent. Nearest neighbor queries. In Michael J. Carey and Donovan A. Schneider, editors, Proceedings of the 1995 ACM SI CM OD International Conference on Management of Data, San Jose, California, May 22-25, 1995, pages 71—79, San Jose, California, U.S.A., 1995. ACM Press. Nick Roussopoulos and Daniel Leifker. Direct spatial search on pictorial databases using packed r-trees. SI CM OD Ree, 14(4):17—31, 1985. Y. Rui, T. S. Huang, and S. Change. Image retrival: Current techniques, promis- ing directions, and open issues. J. Visual Communication and Image Represen- tation, 10:39*62, 1999. Thomas Seidl and Hans—Peter Kriegel. Optimal multi-step k-nearest neighbor search. SICMOD Ree, 27(2):154~165, 1998. J. Uhlmann. Implementing metric trees to satisfy general proximity / similarity queries, 1991. Roger Weber, Hans-Jorg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB ’98: Proceedings of the 24rd International Conference on Very Large Data Bases, pages 194—205, San Francisco, CA. USA, 1998. Morgan Kaufmann Publishers Inc. D. W’hite and R. Jain. Algorithms and strategies for similarity retrieval, 1996. Q. Xue, G. Qian, J.R. Cole, and S. Pramanik. Investigation on approximate q—gram matching in genome sequence databases, 2004. Peter N. Yianilos. Locally lifting the curse of dimensionality for nearest. neighbor search (extended abstract). In SODA ’00: Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pages 361—370, Philadelphia, PA, USA, 2000. Society for Industrial and Applied Mathematics. 159