.3 .~ .. .3... .Tf. h}: 91!. :5 a : .. . .11.... r 3.33.» ., L21? iv.“— ~ $35. 275fo This is to certify that the dissertation entitled Efficient Similarity Search Based on Data Distribution Properties in High Dimensions presented by Jinhua Li has been accepted towards fulfillment of the requirements for _Eh I) - degree in Wence Major professor Datej’y’o“ 6‘ MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED wim earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 cJCIRCIDateDueinddp. 15 EFFICIENT SIMILARITY SEARCH BASED ON DATA DISTRIBUTION PROPERTIES IN HIGH DIMENSIONS By J inhua Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2001 ABSTRACT EFFICIENT SIMILARITY SEARCH BASED ON DATA DISTRIBUTION PROPERTIES IN HIGH DIMENSIONS By J inhua Li The characteristics of data distribution in high dimensional Euclidean space play a fundamental role in many areas of computer science, including database systems, pattern recognition, and multimedia. We have studied and formalized a few data distribution properties in high dimensional Euclidean space, such as the distance properties and the angle property. We have exploited these basic properties in devel- oping efficient applications, such as the nearest neighbor queries in high dimensions. Nearest Neighbor search is a fundamental task in many applications. At present state of the art approaches to nearest neighbor search are not efficient in high di- mensions. In this dissertation we present an efficient Angle based Balanced index structure called the AB-tree, which uses heuristics to decide whether or not to access a node in the index tree based on the estimated angle and the weight of the node. Extensive experiments on synthetic data as well as real data demonstrate that the AB-tree outperforms other index structures such as the SS-tree by orders of magni- tude while maintaining 90% true nearest neighbors. T 0 Hui and Draco. iii ACKNOWLEDGEMENTS The author wishes to think Dr. Sakti Pramanik for his great advisorship and management. I am grateful for many discussions and invaluable comments he pro- vides. I would like to thank my guidance committee members: Dr. George C. Stockman, Dr. John J. Forsyth, and Dr. Ramanathapuram V Ramamoorthi for their help and guidance. iv TABLE OF CONTENTS List of Tables viii List of Figures ix Introduction 1 Chapter 1: Related Work 6 1.1 One-dimensional Access Methods ................... 7 1.2 Main Memory Access Structures ..................... 8 1.3 Multimedia Indexing Methods ...................... 10 1.3.1 Point Access Methods ...................... 11 1.3.2 Spatial Access Methods ..................... 15 1.4 Approximate Nearest Neighbor Search ................. 23 Chapter 2: High Dimensional Data Distribution Properties 26 2.1 Distance Properties ............................ 27 2.2 Radius Properties ............................. 30 2.2.1 Radius of the Minimum Bounding Hypersphere ........ 30 2.2.2 Nonuniform Distribution of Data Points inside the MBH . . . 31 2.3 Angle Property .............................. 32 2.4 Interpretation of the Distribution of High Dimensional Data in Bound- ing Hyperspheres Based on the Properties ............... 34 2.5 Experimental Results with Cluster Data ................ 35 2.6 Related Research on Data Distribution Properties ........... 39 Chapter 3: Angle based Balanced index tree: AB—tree 41 3.1 Computing Better Bounding Hyperspheres ............... 41 3.2 Criteria for Node Access ......................... 42 3.2.1 Impact of Angle Property .................... 43 3.2.2 Impact of Distance Property and Weight of Nodes . ...... 44 3.2.3 Criteria Based on Multiple Threshold Angles .......... 45 3.2.4 Criteria Based on Distance .................... 47 3.3 Search Algorithm for AB-tree ...................... 48 3.4 Clustered AB-tree and Cached AB-tree ................. 53 3.4.1 Clustering Approach ....................... 53 3.4.2 Implementation of Clustered AB-trce .............. 55 3.4.3 Cached AB-tree .......................... 58 3.4.4 Implementation of Cached AB-tree ............... 59 Chapter 4: Parameter Estimation 61 4.1 Introduction ................................ 62 4.2 Parameter Autotuning .......................... 64 4.2.1 Complexity ............................ 66 4.2.2 Algorithm for Parameter Autotuning .............. 68 4.3 Tradeoff Between Performance and Accuracy .............. 71 4.4 Stability of Parameters .......................... 73 4.5 Maintenance Cost ............................. 74 4.5.1 Effect of Parameter Reuse on Maintenance Cost ........ 76 4.5.2 Maintenance Cost for Clustered AB-Tree and Cached AB-Tree 77 Chapter 5: Experimental Results 80 5.1 Synthetic Data .............................. 81 5.1.1 Uniform Data ........................... 81 5.1.2 Clustered Data .......................... 83 5.2 Performance Issues ............................ 84 5.2.1 Effect of Database Size ...................... 84 5.2.2 Effect of Dimension ........................ 85 5.2.3 Effect of Number of Nearest Neighbors ............. 86 5.2.4 Effect of Block Size ........................ 87 5.3 Performance Gain of Clustered AB-tree ................. 87 5.3.1 Performance Sensitivity and the Number of Clusters ...... 88 5.3.2 Performance Comparison of Clustered AB-tree with Standard AB-tree, SS-tree and VA-file ................... 89 5.4 Performance Gain of Cached AB-tree .................. 90 5.4.1 Effect of Cache Size ........................ 91 5.4.2 Performance Gain of Cached AB-tree Over AB-tree ...... 93 5.5 Performance Comparison Using Real Data ............... 94 5.6 Discussions on Approximate Nearest Neighbor Search ......... 95 Chapter 6: PicFinder: A Prototype Image Database System 98 6.1 Introduction ................................ 98 6.2 The Design of the PicFinder System .................. 100 6.3 The PicFinder interface .......................... 103 Chapter 7: Conclusions and Future Work 108 vi Bibliography 1 10 vii 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 List of Tables Performance of AB-tree Using Different Number of Threshold Angles 64 Performance for Varying Database Sizes ................ 73 Calibration Cost (Database Size: 800,000) ............... 74 Effect of Initial Values of Threshold Angles ............... 75 Detailed Calibration Cost Without Parameter Reuse ......... 78 Detailed Calibration Cost With Parameter Reuse ........... 78 Number of Node Accesses for Different Cache Level .......... 92 Relative Error Bound in Distance for 90% Accuracy .......... 95 viii 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.8 2.8 3.1 3.2 3.3 3.4 4.1 4.2 List of Figures Effects of Dimension and Number of Points on dnm/dmn ....... Probability Distribution Function of Distance (dimension: 100) . . . . Effects of Dimension and Number of Points on Radius of MBH . . . . The Angle 6 of Point P .......................... Distribution of d-Dimensional Data With Respect to the Angle O . . Probability of d-dimensional Data Points With Angle Less Than 6 . . Distribution of High Dimensional Data in Bounding Hyperspheres . . Properties in High Dimensions ...................... Properties in High Dimensions (con’t) .................. Properties in High Dimensions (con’t) .................. Illustration of Angle Property ...................... Special Case: Undefined Intersecting Angle ............... Guarantee vs. N onguarantee ....................... Better Ordering .............................. Accuracy vs. Number of Node Accesses ................. Parameters For Varying Database Sizes (dimension: 16) ........ 29 30 31 32 33 34 35 36 37 38 43 47 49 52 72 77 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 6.1 6.2 6.3 6.4 6.5 6.6 Uniform Data ............................... Clustered Data (database size: 800,000) ................ Effect of Database Size With Various Accuracies ........... Clustered Data: Effect of Database Size (dimension: 16) ....... Effect of Dimension ............................ Effect of Number of N N’s ........................ Effect of Block Size ............................ Performance of Clustered AB-tree for Varying Number of Subtrees . . Performance Comparison for Different Number of Clusters ...... Performance Comparison for Varying Dimensions ........... Performance of Cached AB-tree for 800,000 32-dimensional Database Performance of Cached AB-tree for 800,000 64-dimensional Database Performance Gain of Cached AB-tree over AB-tree .......... Real Data: 44,300 32-dimensional color histogram feature vectors Result of Query Using Exact NN-search ................ Result of Query Using Approximate NN-search ............ A Diagram of PicFinder ......................... The PicFinder Interface ......................... The QUERY window ........................... The HELP window ............................ The FEATURES and INDEXES window ................ The RESULTS window .......................... 82 85 87 88 89 90 91 92 93 94 96 101 104 105 106 107 Introduction Understanding the distribution of data points in high dimensional Euclidean space has been a focus of computer science research for years. Much of this research deals with the development of efficient algorithms that exploit the spatial relationships of data points in high dimensions. For example, in building index structures for databases with high dimensional data, bounding hyper-rectangles and bounding hyperspheres are often used. Various nearest neighbor search algorithms have been proposed in the past that try to minimize the overlap between these hyperspheres and/ or the hyper- rectangles. However, the relationships between the amount of overlapping space and the number of data points inside the overlapping space needs to be understood for developing efficient algorithms for applications involving high dimensional data. The focus of this research has been to study the properties of data distribution in high dimensional Euclidean space and exploit these properties for efficient application development. One of the targeted applications for this research is the nearest neighbor queries in high dimensional Euclidean space. Nearest neighbor queries are used for databases with non-traditional data such as image and video. With the rapidly increasing wealth of computational power available to developers, applications have begun to shift away from the storage and manipulation of only simple data types such as numbers and strings. Business, industry, and computer science research have moved toward applications that require very large databases that store and manipulate non-traditional data. One of the most common types of queries on these types of data is a similarity, or nearest neighbor (NN), query. The goal of this type of query is to retrieve an object that is most similar to a given object. One variation of a similarity query is a K nearest neighbor (KNN) query. In this type of query, the system is asked to return the K objects most similar to a given object. In order to determine similarity between non—traditional data objects such as images and video, the objects must be represented in a manner that conveys some sort of semantics. Typically, this is done with high-dimensional feature vectors where the features are defined by a domain expert. For example, a feature vector for an image may represent the color contained in the image, the texture inherent in the image, or objects present in the image. Problem of implementing NN search queries is that the notion of NN in high dimensions become less meaningful for certain classes of data and the search com- plexity increases rapidly with dimensions. Beyer, et. al. [12] have identified a broad class of workload(in terms of data and query distributions) for which the difference in distance between the nearest neighbors and other points in the data set becomes negligible in high dimensions. For these classes of workload the notion of NN is not very meaningful. However, there are classes of workload, such as data with a set of distinct clusters, where NN may be meaningful in high dimensions. There are two broad classes of indexing schemes, namely, those using space par- titioning methods, such as KDB—tree [66], hB-tree [52], and lsdh-tree [35], and those 2 using data partitioning methods, such as R-trees [34, 5], SS—tree [79], SR-tree [42], M-tree [18], X-tree [11], and TV-tree [50]. All these indexing schemes do not per- form better than linear scans [77] for ezcact NN search queries in high dimensional databases. A hybrid approach [16] has been proposed for indexing in high dimen- sional databases. It performs significantly better than previously proposed indexing schemes [52, 42] for feature based queries but performs worse than linear scan for NN search queries. Weber, et. a1. [77] have shown that under the assumption of uni- formity and independence, and when the number of dimensions is sufficiently high, a sequential scan may outperform indexing schemes for exact NN search queries. Consequently, they proposed VA-file based on approximations to make the sequential scan faster. Another approach that has recently been used to avoid the curse of dimensionality in NN search is approximate nearest neighbor search [80]. An exact nearest neighbor search is not necessary and may be an overkill, since the features and the similarity measure themselves are chosen based on heuristic. In fact, in multimedia database systems, such as IBM’s QBIC [29] or MIT’s Photobook [60], the mapping of attributes of objects to coordinates of vectors is heuristic, and so is the choice of metric. Recently, Zezula, et. a1. [80] have proposed an indexing scheme for approximate NN search queries which performs well in high dimensions. This dissertation presents an indexing scheme called the AB-tree [61] that effi- ciently finds approximate nearest neighbors in high dimensional databases. AB-tree is based on a few interesting data distribution properties [62], such as the distance Properties and the angle property, which were observed in our experiments for large 3 databases with high dimensional data. We have applied this indexing technique to content-based image-retrieval system [64]. Here, we aim to index images based on descriptions derived automatically from the image. Such descriptions are adapted to the image and are usually more effective than external descriptors, in characterizing the visual information illustrated in the image. The similarity between two images can then be determined by comparing their descriptions. The system is designed to administer a heterogeneous collection of natural images. For administering heteroge- neous collections, where all kinds of images are likely, we are restricted to low level image-descriptions, which do not involve interpretation. Extensive experiments on synthetic and real data demonstrate that AB-tree outperforms SS-tree by a factor of 86 in 64 dimensions while maintaining 90% accuracy [63]. We have also shown that the AB-tree performs better than VA-file, the best known sequential access method, by a factor of 8. Furthermore, We have explored other heuristics such as clustering and caching to enhance the performance of AB-tree and have developed two variants of AB-tree for very large high-dimensional databases: clustered AB-tree and cached AB-tree, which outperform standard AB-tree by a factor of 1.5 to 4. This dissertation is organized as follows: Chapter 1 discusses related work in multimedia indexing schemes and approximate nearest neighbor search; Chapter 2 describes some distribution properties of high dimensional data; Chapter 3 introduces AB-tree and associated search algorithms; Chapter 4 discusses parameter estimation for AB—tree algorithms; Chapter 5 presents experimental results; Chapter 6 presents PicFinder, a prototype image database system that uses AB—tree indexing scheme described in this dissertation; and Chapter 7 provides a summary of this thesis and 4 presents a list of contributions as well as future work. Chapter 1 Related Work Indexing mechanisms must be incorporated into databases in order to perform queries efficiently. This is true for conventional databases as well as for multimedia databases, where multimedia data objects can be represented as a set of high-dimensional vectors (or points) in feature space. Numerous proposals have been given in recent years for designing index structures to accelerate queries in high dimensions. This chapter gives an overview of multimedia indexing methods . Multimedia databases are spatial databases. Thus multimedia indexing methods often refer to spatial index structures or multidimensional access methods. There exist many multimedia indexing structures. As more new indexing schemes come out, it becomes more and more difficult to recognize their merits and faults, since every new method seems to claim superiority to at least one access method that has been published previously. However, there are many different criteria to define optimality and many parameters that determine performance. Both time and space efficiency of an access method strongly depend on the data to be processed and the queries to be asked. At present no access method has proven itself to be superior to all its competitors. This chapter does not try to resolve this problem but rather to al. 7.. T ..-.‘|nn Int! 1'. LY; I. aJ.‘ . AM give an overview of the pros and cons of a variety of access structures. Performance of exact NN search queries is not good in high dimensions using the existing indexing methods. This chapter also discusses the need for approximate NN search in high dimensions. 1.1 One-dimensional Access Methods Classical one-dimensional access methods are an important foundation for almost all multidimensional access methods. Although the survey on hashing functions by Knott [44] is somewhat dated, it represents a good coverage of the different ap- proaches. In practice, the most common one-dimensional structures include linear hashing [49, 51] extendible hashing [25], and the B-tree [3]. Other than hashing schemes, the B-tree and its variant [20] organize the data in a hierarchical manner. B-trees are balanced trees that correspond to a nesting of intervals. Each node 11 corresponds to a disk page D(n) and an interval I(n). If n is an interior node then the intervals I(n) corresponding to the immediate descendants of n are mutually disjoint subsets of I(n). Leaf nodes contain pointers to data items; depending on the type of B-tree, interior nodes may do so as well. B-trees have an upper and lower bound for the number of descendants of a node. The lower bound prevents the degeneration of trees and leads to efficient storage utilization. Nodes whose number of descendants drops below the lower bound are deleted and its content is distributed among the adjacent nodes at the same tree level. The upper bound follows from the fact that each tree node corresponds to exactly one disk page. If during an insertion a node reaches its capacity, it is split into two. Splits may y. propagate up the tree. As the size of the intervals depends on the given data (and the insertion sequence), the B-tree is an adaptive data structure. Hierarchical access methods such as the B—tree are scalable and behave well in the case of skewed input; they are nearly independent of the distribution of the input data. This is not necessarily true for hashing techniques, whose performance may degenerate depending on the given input data and hash function. This problem is aggravated by the use of order-preserving hash functions [55, 30] that try to preserve neighborhood relationships between data items, in order to support range queries. As a result, highly skewed data keeps accumulating at a few selected locations in image space. For uniformly distributed data, however, extendible as well as linear hashing outperform the B-tree on the average for exact match queries, insertions and deletions. 1.2 Main Memory Access Structures Early multidimensional access methods did not account for paged secondary memory and are therefore less suited for large spatial databases. In this section, we review two fundamental data structures: KD—tree [7, 6], and quadtree [68], which are adapted and incorporated in several access methods for disk-based data. One of the most prominent d-dimensional data structures is the KD-tree [7]. The KD-tree is a binary search tree that represents a recursive subdivision of the universe into subspaces by means of (d-I) dimensional hyperplanes. The hyperplanes are iso- oriented, and their direction alternates between the d possibilities. Each splitting hyperplane has to contain at least one data point, which is used for its represen- tation in the tree. Interior nodes have one or two descendants each and function as a discriminator to guide the search. Searching and insertion of new points are straightforward operations. Deletion is somewhat more complicated and may cause a reorganization of the subtree below the data point to be deleted. One disadvantage of the KD—tree is that the structure is sensitive to the order in which the points are inserted. The adaptive KD-tree [6] mitigates these problems by choosing a split such that one finds about the same number of elements on both sides. While the splitting hyperplanes are still parallel to the axes, they do not have to contain a data point and their directions do not have to be strictly alternating anymore. As a result, the split points are not part of the input data; all data points are stored in the leaves. Interior nodes contain the dimension (e.g. x or y) and the coordinate of the corresponding split. Splitting is continued recursively until each subspace contains only a certain number of points. The adaptive KD-tree is a rather static structure; it is obviously difficult to keep the tree balanced in the presence of frequent insertions and deletions. The structure works best if all the data is known a priori and if updates are rare. Another variant of the KD-tree is the bintree [75]. This structure partitions the uni- verse recursively into d-dimensional boxes of equal size until each one contains only a certain number of points. Even though this kind of partitioning is less adaptive, it has Several advantages, such as the implicit knowledge of the partitioning hyperplanes. A disadvantage common to all KD-trees is that for certain distributions no hyperplane can be found which splits the data points evenly [53]. The quadtree with its many variants is a close relative of the KD-tree. For an 9 extensive discussion of this structure, see [68, 69, 70]. While the term quadtree usually refers to the two-dimensional variant, the basic idea applies to arbitrary d. Like the KD-tree, the quadtree decomposes the universe by means of iso—oriented hyperplanes. An important difference, however, is the fact that quadtrees are not binary trees anymore. In d dimensions, the interior nodes of a quadtree have 2‘1 descendants, each corresponding to an interval-shaped partition of the given subspace. These partitions do not have to be of equal size, although that is often the case. For d = 2, for example, each interior node has four descendants, each corresponding to a rectangle. These rectangles are typically referred to as the NW, NE, SW, and SE quadrants. The decomposition into subspaces is usually continued until the number of objects in each partition is below a given threshold. Quadtrees are therefore not necessarily balanced; subtrees corresponding to densely populated regions may be deeper than others may. Searching in a quadtree is similar to searching in an ordinary binary search tree. At each level, one has to decide which of the four subtrees need to be included in the future search. In the case of a point query, typically only one subtree qualifies, whereas for range queries there are often several. We repeat this search step recursively until we reach the leaves of the tree. 1.3 Multimedia Indexing Methods The multidimensional data structures presented in the previous section do not ex- Plicitly take secondary storage management into account. They have originally been d€Signed for main memory applications where all the data is available without ac- cessing the disk. Despite growing main memories, this is of course not always the 10 case. For large multimedia databases, it is necessary for an index structure to be disk-based as opposed to being memory—based. The access methods presented in this section have been designed with secondary storage management in mind. Their operations are closely coordinated with the operating system to ensure that overall performance is optimized. From the spatial database viewpoint, we can distinguish between two major classes of access methods [72]. 0 Point Access Methods (PAMs), which are used to organize multidimensional point objects (e.g.: cities, where each city is represented by three co—ordinates: longitude, latitude and altitude). 0 Spatial Access Methods (SAMs), which are used to organize point objects as well as arbitrary shaped objects (e.g.: street segments, land plots). In this section, first we will present a selection of point access methods. Then, we will present a selection of spatial access methods as well as some new promising indexing methods in high dimensions. 1.3.1 Point Access Methods Usually, the points in the database are organized in a number of buckets, each of which corresponds to a disk page and to some subspace of the universe. The subspaces (often referred to as data regions, bucket regions or simply regions, even though their dimension may be greater than two) need not be rectilinear, although they often are. The buckets are accessed by means of a search tree or some d-dimensional hash function. The grid file [54], for example, uses a directory and a grid-like partition of the uniVerse to answer an exact match query with exactly two disk accesses. Furthermore, there are multidimensional hashing schemes [74, 46, 47], multilevel grid files [78, 40], 11 n-n . rw ? .r» if A |. A ‘IVI and hash trees [59, 58], which organize the directory as a tree structure. Tree-based access methods are usually a generalization of the B-tree to higher dimensions, such as the KDB-tree [66] or the hB-tree [53]. Although there is no total order for objects in two- and higher-dimensional space that completely preserves spatial proximity, there have been numerous attempts to construct hashing functions that preserve proximity at least to some extent. The goal of all these heuristics is that objects that are located close to each other in original space are likely to be stored close together on the disk. This could contribute substantially to minimizing the number of disk accesses per range query. The Grid File As a typical representative for an access method based on hashing, we will first discuss the grid file and some of its variants [36, 59, 78, 24, 13]. The grid file superimposes a d-dimensional orthogonal grid on the universe. Because the grid is not necessarily regular, the resulting cells may be of different shapes and sizes. A grid directory associates one or more of these cells with data buckets, which are stored on one disk page each. Each cell is associated with one bucket, but a bucket may contain several adjacent cells. Since the directory may grow large, it is usually kept on secondary storage. To guarantee that data items are always found with no more than two disk accesses for exact match queries, the grid itself is kept in main memory, represented by d one-dimensional arrays called scales. However, the gird file suffers from two disadvantages: 0 It does not work well if the attribute values are correlated. 0 It might need a large directory if the dimensionality of the address space is high. 12 A... rt. 1? l n III .HVc r... \I "M “A. i For a theoretical analysis of the grid file and some of its variants see [65, 4]. Multidimensional Linear Hashing Unlike multidimensional extendible hashing, multidimensional linear hashing uses no or only a very small directory. It there— fore occupies relatively little storage compared to extendible hashing, and it is usu- ally possible to keep all relevant information in main memory. Several different strategies have been proposed to perform the required address computation. Kriegel and Seeger [46] proposed a variant of linear hashing called multidimensional order- preserving linear hashing with partial expansions (MOLHPE). Another variant that has better order-preserving properties than MOLHPE has been reported by Hut esz, Six, and Widmayer [23]. Their dynamic z-hashing uses a space-filling technique called z-ordering [57] to guarantee that points that are located close to each other are also stored close together on the disk. One disadvantage of z-hashing is that a number of useless data blocks will be generated, similar as in the interpolation-based grid file [59]. Widmayer (1991) later noted, however, that both z-hashing and MOLHPE are of limited use in practice, due to their inability to adapt to different data distributions. The KDB-tree The KDB-tree combines some of the properties of the adaptive KD-tree and the B-tree to handle multidimensional points. It partitions the universe in the manner of an adaptive KD-tree and associates the resulting subspaces with tree nodes. Each interior node corresponds to an interval-shaped region. Regions corresponding to nodes at the same tree level are mutually disjoint; their union is the complete universe. The leaf nodes store the data points that are located in the 13 corresponding partition. Like the B—tree, the KDB-tree is a perfectly balanced tree that adapts well to the distribution of the data. Other than for B-trees, however, no minimum space utilization can be guaranteed. The hB-Tree The hB—tree (holey brick tree) [53, 52] is related to the KDB-tree in that it utilizes KD-trees to organize the space represented by its interior nodes. One of the most noteworthy differences is that node splitting is based on multiple attributes. As a result, nodes no longer correspond to d—dimensional intervals, but to intervals from which smaller intervals have been excised. Similar to the BANG file, the result is a somewhat fractal structure (a holey brick) with an external enclosing region and several cavities called extracted regions. As we will see later, this technique avoids the cascading of splits that is typical for many other structures. Space-Filling Curves for Point Data We already mentioned the main reason why the design of multidimensional access methods is so difficult compared to the one—dimensional case: There is no total order that preserves spatial proximity. One way out of this dilemma is to find heuristic solutions, i.e., to look for total orders that preserve spatial proximity at least to some extent. The idea is that if two objects are located close together in original space, there should at least be a high probability that they are close together in the total order, i.e., in the one-dimensional image Space. For the organization of this total order one could then use a one-dimensional access method (such as a B+-tree), which may provide good performance at least for some spatial queries. 14 One thing all proposals have in common is that they first partition the universe with a grid. Each of the grid cells is labelled with a unique number that defines its position in the total order (the space-filling curve). The points in the given data set are then sorted and indexed according to the grid cell they are contained in. Note that while the labelling is independent of the given data, it is obviously critical for the preservation of proximity in one—dimensional address space. That is, the way we label the cells determines how clustered adjacent cells are stored on secondary memory. Based on several experiments, Abel and Mark [1] conclude that z—ordering [57] and the Hilbert curve [27] are most suitable as a multidimensional access method. Ja- gadish [39] and Faloutsos and Rong [26] all prefer the Hilbert curve among those two. Z-ordering is one of the few access methods that has found its way into commercial database products. In particular, Oracle has adapted and integrated the technique into its database system [37]. An important advantage of all space-filling curves is that they are practically insensitive to the number of dimensions if the one-dimensional keys can be arbi- trarily large. Everything is mapped into one—dimensional space, and one’s favorite one-dimensional access method can be applied to manage the data. An obvious disad- vantage of space-filling curves is that incompatible index partitions cannot be joined Without recomputing the codes of at least one of the two indexes. 1.3.2 Spatial Access Methods All multidimensional access methods presented in Section 1.3.1 have been designed to handle sets of data points and to support spatial searches on them. None of those 15 methods is directly applicable to databases containing objects with a spatial exten- sion. Typical examples include geographic databases, containing mostly polygons, or mechanical CAD data, consisting of three-dimensional polyhedra. In order to handle such extended objects, the proposed methods in the literature form the following two classes [72]: 1. Methods that use space-filling curves 2. Methods that use treelike structures In the following, we will first discuss transformation to higher-dimensional space, and then describe space-filling curves for extended object, and final present several promising treelike indexing structures and some new indexing methods in high di- mensions. Mapping to Higher-Dimensional Space Simple geometric shapes can be repre- sented as points in higher-dimensional space. The original and final spaces are called native space and parameter Space, respectively [56]. A range query in native space can also be translated to a range query in parameter space [28]. The strong point of this idea is that we can turn any PAM into a SAM with very little effort. This approach has been used or suggested in several settings, for example, with grid files, B-trees, and hB-trees, as the underlying PAM. The weak points are the following: First, the parameter space has high dimen- sionality, inviting ”dimensionality curse” problems earlier on. Second, except for range queries, there are no published algorithms for nearest neighbor and spatial join queries. Third, If the original database contains more complex objects, they have to 16 be approximated - e.g. by a rectangle or a sphere - before transformation. In this case, the point access method can only lead to a partial solution. Space-Filling Curves for Extended Objects Space-filling curves are a very different type of transformation approach that seems to suffer less from some of the drawbacks listed in the previous subsection. Space-filling curves can be used to represent extended objects by a list of grid cells or, equivalently, a list of one- dimensional intervals that define the position of the grid cells concerned. A In other words, a complex spatial object is approximated not by only one simpler object, but by the union of several such objects. The z-ordering approach by Orenstein and Merrett [57] is one of the more popular approaches of this kind. A region typically breaks into one or more pieces (blocks), each of which can be described by a z-value. A z—ordering is based on an underlying grid, the resulting set of regions is usually only an approximation of the original object. The termination criterion depends on the accuracy or granularity (maximum number of bits) desired. More regions obviously yield more accuracy, but they also increase the overhead, which affects the overall performance of the resulting data structure. As pointed out by Orenstein [55] there are two possibly conflicting objectives: First, the number of regions to approximate the object should be small, since this results in less index entries. Second, the accuracy of the approximation should be high, since this reduces the expected number of false drops (i.e., objects that are paged in from secondary memory, only to find out that they do not satisfy the search predicate). 17 R-Trees The R-tree proposed by Guttman [34] can be thought of an extension of the B—tree for multidimensional objects. A spatial object is represented by its minimum-bounding rectangle (MBR). Nonleaf nodes contain entries of the form (ptr, R), where ptr is a pointer to a child node in the R-tree; R is the MBR that covers all rectangles in the child node. Leaf nodes contain entries of the form (obj-id, R) where obj-id is a pointer to the object description, and R is the MBR of the object. The main innovation in the R—tree is that parent nodes are allowed to overlap. This way, the R—tree can guarantee good space utilization and remain balanced. Search algorithms can be applied almost unchanged. The only differences are due to the fact that the overlap may increase the number of search paths we have to follow. Even a point query may require the investigation of multiple search paths because there may be several subspaces at any index level that include the search point. For range and region queries, the average number of search paths increases as well. The R—tree inspired much subsequent work, whose main focus was to improve the search time. When the database is static, other optimizations can be incorporated into the index structures in order to enhance search performance. A packing technique is proposed in [67] to minimize the overlap between different nodes in the R—tree for Static data. The idea was to order the data in, say, ascending x-low value, and scan the list, filling each leaf node to capacity. An improved packing technique based on the Hilbert curve is proposed in [41]: the idea is to sort the data rectangles on the Hilbert value of their centers. One of the most successful idea in R-tree is the idea of deferred splitting: Beck- mann et al. proposed the R*-tree [5], which was reported to outperform Guttman’s 18 R-tree by about 30%. The main idea is the concept of forced reinsert, which tries to defer the splits to achieve better utilization: When a node overflows, some of its children are carefully chosen, and they are deleted and reinserted, usually resulting a better-structure R-tree. A class of variations considers more general minimum bounding shapes, trying to minimize the dead space that an MBR may cover. Gunther proposed the cell trees [15], which introduce diagonal cuts in arbitrary orientation. Jagadish proposed the polygon tress (P-trees) [38], where the minimum bounding shapes are polygons with slopes of sides 0, 45, 90, and 135 degrees. Minimum bounding shapes that are concave or even have holes have been suggested (e.g., in the hB—tree [53]). However, as dimension rises, the increasing overlap among the bounding hyper- rectangles deteriorates search efficiency. In addition, the bounding hyper-rectangles become less efficient in dividing the space into neighborhoods as dimension increases. A neighborhood exists when many of the nearest neighbors to a point reside within the same bounding region. Because the R—trees do not create neighborhoods, nearest neighbor searching is inefficient with R—trees in high dimensions. X-tree The X-tree [11] is a modified R-tree that attempts to decrease the overlap of bounding hyper-rectangles by using two techniques: First, the X-tree introduces an overlap-free split algorithm that is based on the split history of the tree. Second, if the overlap-free split algorithm would lead to an unbalanced directory, the X-tree omits the Split and the according directory node becomes a supernode. These supernodes Span more than one disk block and keep bounding regions disjoint. In this way, the 19 X—tree outperforms R*-tree by a factor of up to 400 for point queries. SS-tree On the other hand, the SS-tree [79] partitions the search space by utiliz— ing bounding hyperspheres instead of bounding hyperrectangles. This decreases the amount of storage space needed to represent a bounding region inside of an index node (the space needed to store a bounding hypersphere is nearly half the space needed to store a bounding hyperrectangle). As a result, the fanout of index nodes increases. The SS-tree improves the performance of NN searching in high dimensions over the R*-tree. However, the bounding hyperspheres cover much more volume in the search space than bounding hyper-rectangles. This causes a great deal of overlap among bounding hyperspheres in the tree. The result is that the performance of NN searching degrades significantly as dimension increases. SR—tree In order to partition the space into neighborhoods while minimizing the overlap among bounding regions, the SR—tree [42] has been proposed. The SR- tree utilizes the intersection of a hypersphere and a hyperrectangle as its bounding region. This decreases overlap among bounding regions and improves NN search performance. However, the SR-tree must store a bounding hyperrectangle and a bounding hypersphere for each node in the tree. This decreases the fanout of the index nodes and increases the total number of nodes in the tree. 20 M-tree M-tree [18] is another higher dimensional feature vector indexing method. The main idea behind the M—tree is to combine the advantages of balanced and dy- namic spatial access methods with the capabilities of static metric trees to index ob- jects using features and distance functions. Leaf nodes of an M-tree contains indexed objects, whereas non-leaf nodes store so-called routing objects. A routing object is a database object to which a routing role is assigned by a specific routing algorithm. Pyramid-Technique The Pyramid Technique [9] is a new indexing method for high-dimensional data spaces. The Pyramid-Technique is highly adapted to range query processing using the maximum metric Lmax. In contrast to all other index structures the performance of the Pyramid-Technique does not deteriorate when pro— cessing range queries on data of higher dimensionality. The Pyramid-Technique is based on a special partitioning strategy that is optimized for high—dimensional data. The basic idea is to divide the data space first into 2d pyramids sharing the center points of the space as a top. In a second step, the pyramids are cut into slices parallel to the basic of the pyramid. These slices form the data pages. Furthermore, this partition provides a mapping from the given d-dimensional space to a 1-dimensional space. Therefore, a B+-tree can be used to manage the transformed data. Pyramid- technique clearly outperforms other index structures for range queries. However, for queries having a bad selectivity, i.e. a high number of answers, or extremely skewed queries, a linear scan of the database is faster than the Pyramid-Technique. 21 Voronoi Cells In [10], a technique is proposed based on the precomputation of the solution space of any arbitrary nearest-neighbor search. This corresponds to the com- putation of the voronoi cells of data points. Since voronoi cells may become rather complex when going to higher dimensions, the authors presented a new algorithm for the approximation of high-dimensional voronoi cells using a set of minimum bounding hyperrectangles. The voronoi cells are stored in an index structure efficient for high- dimensional data spaces. As a result, nearest neighbor search corresponds to a simple point query on the index structure. Although the technique is based on a precom- putation of the solution space, it supports insertions of new data points. However, this precomputation technique is only suitable for searching one nearest neighbor of a query point. Multi—Step k-Nearest Neighbor Search While algorithms that directly based on index work well for simple medium-dimensional similarity distance functions, they do not perform efficiently in high dimensions. A multi-step query processing strategy can be used to improve the performance. In a multi-step query processing environ- ment, one or more filter steps produce sets of candidates that are exactly evaluated in one or more subsequent refinement steps. The crucial correctness requirement is to prevent the system from producing false drops. This means that no actual result may be dismissed from the set of candidates. The number of candidates is a funda- mental efficiency parameter. A multi-step algorithm for k-nearest neighbor search has already been developed and successfully been applied to similarity search in 3-D med- ical image databases [45]. Seidl and Kriegel [71] present a novel multi-step algorithm 22 that is guaranteed to produce the minimum number of candidates. Experimental evaluations demonstrate the significant performance gain over the previous solution. VA-file Weber, et. a1. [77] have shown that under the assumption of uniformity and independence, and when the number of dimensions is sufficiently high, sequential scan may outperform indexing schemes for exact NN search queries. Consequently, they proposed the Vector Approximation file (VA-file) based on approximations to make the sequential scan as fast as possible. The VA-file divides the data space into 2" rectangular cells where b denotes a user specified number of bits (e. g. some number of bits per dimension). Instead of hierarchically organizing these cells like in grid-files or R-trees, the VA-file allocates a unique bit-string of length b for each cell, and approximates data points that fall into a cell by that bit-string. A VA-file is an array of these compact, geometric approximations. By scanning the entire approximation file first (filtering step), NN queries need only visit a fraction of the vectors. The VA-file outperforms R-trees and X-tree if dimensionality becomes large. However, Performance of VA-file method deteriorates linearly with the size of databases due to their sequential nature. 1.4 Approximate Nearest Neighbor Search The number of features involved in real world applications may approach thousands. Dimension reduction techniques have been proposed to reduce the number of features. The TV-tree [50] attempts to use only a subset of the dimensions for indexing. The most significant features are used at higher levels of the tree, while more features get 23 utilized as the tree is descended. The TV-tree gets impressive results; however, it can only be used in certain types of applications [79]. Despite the use of dimension reduc- tion techniques such as principal component analysis, vector spaces of several hundred dimensions are typical. Unfortunately, this relaxation of goals has not removed the curse of dimensionality. Another approach that has recently been used to avoid the curse of dimensional- ity is approximate nearest neighbor search. An exact nearest neighbor search is not necessary and may be an overkill, since the features and the similarity measure them- selves are chosen based on heuristics, and are not mathematically precise parameters. In fact, in multimedia database systems, such as IBM’s QBIC [29] or MIT’s Photo- book [60], the mapping of attributes of objects to coordinates of vectors is heuristic, and so is the choice of metric. Recent work in computational geometry has produced substantial results on near- est neighbor search in main memory. Dobkin and Lipton [22] were the first to provide an algorithm for nearest neighbors in d dimensional Euclidean space, with query time 2("‘Ll’). Clarkson [19] showed that query 0(2“ * log(n)) and preprocessing cost 0(n time complexity could be reduced to 0((1/e)(d/2)log(n)) with 0((1/€)(d/2) * log(p) :1: n) space, where e is the relative error in distance and p is the ratio between the furthest- pair and closest-pair interpoint distances. Later Chan [17] showed that the factor of log(p) could be removed from the space complexity. Kleinberg [43] showed that it is possible to eliminate exponential dependencies on dimension in query time, but with 0(n * log(d))(2*d) space. Recently, Indyk and Motwani [38] and independently Kushilevitz, Ostrovsky and Rabani [48], have announced algorithms that eliminate 24 all exponential dependencies in dimension, yielding a query time 0(d * (logo(l)(dn)) and space 0((dn)(0(1)). The O-notation hides constant factors that depend exponen- tially on 6, but not on dimension. Arya, Mount, Netanyahu and Silverman [2] gave an algorithm with query time 0(c* log(n)) and space 0(dn), where c < d * [1 + 6 * d/e]d. The exponential factors in query time do imply that the algorithm is not practical for large values of d. Most of the methods mentioned above used data structures that are stored in main memory. Much of the query time includes the computation cost in main memory. Our focus is on nearest neighbor search in databases. Thus I/ O cost is the main source of query time. We [61] propose a fast approximate nearest neighbor search in high dimensional databases. The fast approximate search index structure called the AB-tree, which is based on the SS—tree. AB-tree uses heuristic to decide whether or not to access a node in the index tree based on the intersecting angle and the weight of the node. The heuristic is motivated by an interesting observation: high dimensional data points inside a bounding hypersphere usually fall within a small interval of angle around 90 degree, which is justified by sound theoretical as well as experimental arguments. In contrast to the algorithms that have been proposed in earlier works, the performance of the AB-tree algorithm does not deteriorate when processing nearest neighbor queries as dimension increases. 25 Chapter 2 High Dimensional Data Distribution Properties Modern database applications such as multimedia databases and data warehouses are dealing with high dimensional data. The distribution of data points in high dimen- sional Euclidean space needs to be understood for developing efficient algorithms for applications involving high dimensional data. This chapter discusses the properties of data distribution in high dimensional Euclidean space. The basic model for character- izing the properties of data distribution in high dimensional Euclidean space is based on the positions of the points in the hyperspace relative to various high dimensional geometrical shapes such as hyperspheres, hypercubes, hyperplanes and hyperangles. For example, if the maximum and the minimum pairwise distances of a set of random points are denoted by dmax and dmm, then an important property is that the ratio of dmam and dmin for a given set of points approaches the value 1 with increasing di- mensionality. This property implies that the data points are distributed close to the surface of a hypersphere. We are representing this relative distance property using the maximum and the minimum pairwise distances of a set of points. There are other ways to capture this relative distance property which will be discussed later. 26 We investigated several such properties that govern the behavior of data dis- tribution in high dimensional Euclidean space. Some of these properties that we investigated in this research are described in the following sections. 0 Distance Properties — Distance Property 1 (DPl): For a given dimension, the ratio d,,,a_.,:/d,m-n increases slowly as the number of points increases. — Distance Property 2 (DP2): For a fixed number of points, dmax/dm,n decreases and becomes close to 1 as the dimension increases. 0 Radius Properties — Radius Property 1 (RPl): The radius of a hypersphere that bounds a set of points increases very slowly with the number of points. - Radius Property 2 (RP2): The radius of a hypersphere that bounds a set of points increases rapidly with an increase in dimensionality. — Radius Property 3 (RP3): Even for a small number of uniformly distributed points (e.g. 5 points) within a hypercube, much of the bounding hypersphere for this set of points lies outside the hypercube. 0 Angle Property — Angle Property 1 (APl): Given a set of points within a bounding hypersphere, and some reference point P, we define an angle 6 between the reference point P and each point Q in the hypersphere, with respect to the center C of the hypersphere. The distribution of points with respect to the angles 6 is proportional to sind6 where d is the dimension of the points. As dimensionality gets larger, for any given reference point, the value of the angle 6 falls inside a decreasing interval of angles around 7r / 2 (90 degrees). 2. 1 Distance Properties In this section we discuss the distance properties (DPl and DP2). These properties describe how close the points are from being equidistant to each other. For example, 27 when the ratio dmam/dmin is 1, all data points are equidistant from each other, and the points lie on the surface of a hypersphere of radius D ’"Dzl/2x(o+1)Xd (2'11) where D is the dimension and d is the distance between a pair of points. When the ratio is not I but close to 1, the data points are close to equidistant from each other. The experimental results shown in Figure 2.1 illustrate the distance prop— erties. The experiments were performed on uniform data sets that were created for each dimension depicted. Each data set consisted of data points that were randomly generated in the range [0,1) for each dimension. All experimental results have been averaged over 1000 random trials. For a fixed number of points in a fixed dimension, the distribution of the ratio dmaI/dmin, is similar to a sharp Gaussian curve with a standard deviation less than 5 percent of the average. As the dimension and/ or the number of points increase, the standard deviation becomes smaller. Figure 2.1(a) shows that for a 100-dimension dataset, as the number of points increases from 5 to 50000, dmax/dmin increases slowly from 1.19 to 1.76. It demonstrates that in high dimensions, a dataset of 50000 points has nearly the same characteristics as that of 5 points. In Figure 2.1(b), we show the effect of dimension on dmax/dmm for a dataset with 500 data points. It is clear that dnm/dm-n decreases as the dimension increases and becomes close to 1. This means that as the dimension increases, the data points lie within a narrow band of the surface of a hypersphere. This property of data points in high dimensions could also be described by prob- abilistic models. In the following we give a brief description of a probabilistic model 28 4 4 3.5 3.5 .4- a 42 3 ~— 2.5 L 2.5 — i 3 a = :2 i 1.5 //—_/ 1.5 1 j 1 «— 0.5 -- 0.5 __ O * * f o 1 4 . 5 so 500 5000 50000 25 so 1 oo zoo 300 Number of Points Dimension (a) dwn/dm;n vs. Number Of Points (b) dmax/dmin vs. Dimension Figure 2.1: Effects of Dimension and Number of Points on dmmc/dm,n describing the property that the points lie within a narrow band of a hypersphere. Points Lie Near the Surface of a Hypersphere: A Probabilistic Approach Assume that each coordinate of the (1 dimensions has uniform distribution. The probability density function of distance from a center with coordinates, c,-, i=1 to d, to a random point is given by the following generating function: G(s)=— (Cm +1)d [[123 0' C2 ] (2.1.2) —1 i=0 The coefficient of s"2 in the above generating function will give the probability density of points at a distance n. The plot of this probability distribution function for 100 dimensions is shown in Figure 2.2. Figure 2.2 shows that the distances between a given center and the random points fall within a narrow band. Superimposed on this figure are the results of experiments for 100-dimension datasets with 100,000 points uniformly distributed inside a unit hypercube. The experimental results follow more closely the above theoretical distribution if the number of points are increased. As the number of 29 dimensions increases the relative narrowness of the band with respect to the distance continues to decrease. 0.001 6 r Experrnenuu - Theoretical ----- 0.0014 - . 'n‘ .4 0.0012 '- “rrewa‘g. u. . .. . j .5 _'-. iii‘um‘ ". - . ' . , . q 0W1 - 0.0008 [- 0.0006 '- Probability Density 0.0004 - *‘rhi-HW' ‘ mmfidfia-h—H'nw-j‘h‘ . ' W""“"" 0.0002 - f o 1 1 mg 0 20 40 60 80 1 00 Distance 'H l" Figure 2.2: Probability Distribution Function of Distance (dimension: 100) 2.2 Radius Properties Radius properties play a significant role in describing the relative positions of groups of data points inside a hypersphere. For example, a bounding hypersphere for a set of points encloses other bounding hyperspheres containing subsets of these points. By property RPI these enclosed hyperspheres are almost as big as their bounding hypersphere in high dimensions. This indicates that all these hyperspheres are almost totally overlapping with each other in high dimensions. 2.2.1 Radius of the Minimum Bounding Hypersphere In this subsection, we study the effects of dimension and the number of points on the radius of Minimum Bounding Hyperspheres (MBH). The experimental results are shown in Figure 2.3. The experiments were performed on uniform data sets that 30 were created for each dimension depicted. Each data set consisted of data points that were randomly generated in the range [0,1) for each dimension. All experimental results have been averaged over 1000 random trials. First, we have tested the effect of the number of points on the radius of MBH. For a 100—dimension dataset, as the number of points increases from 5 to 50000, the radius of MBH increases slowly from 4.54 to 5.32. This is shown in Figure 2.3(a). It is clear that the radius of MBH is not sensitive to the number of points in high dimension. The radius of MBH of 50 points is very close to that of 50000 points in high dimension. On the other hand, for a given set of 500 points the radius of MBH increases significantly as the dimension increases. This is shown in Figure 2.3(b). L 0 d k) U 2A (I O ‘1 O D ' L L l l I i I fi' \ T T O -h N (a a G t) ‘4 G O 50 600 5000 50000 25 50 100 200 300 5 Number of Polnte Dlmenelon (a) Radius vs. Number of points (b) Radius vs. Dimension Figure 2.3: Effects of Dimension and Number of Points on Radius of MBH 2.2.2 Nonuniform Distribution of Data Points inside the MBH In this subsection, we give a justification that a set of uniformly distributed data inside a hypercube lie near the surface of the hypersphere symmetrically around the center and nonuniformly in high dimension. Radius property RP3 shows that even for a small number of points much of the MBH lies outside the hypercube. This is 31 because the maximum radius of the hypersphere enclosed inside a unit hypercube is 0.5, and the radius of the MBH is greater than 4 (Figure 2.3(a)) even for a data set of 5 points. Parts of the hypersphere outside the hypercube are always empty. This results in the nonuniform distribution of data points inside the hypersphere. Because the radius of the MBH increases significantly as the number of dimensions increases, more and more parts of hypersphere are outside the hypercube as the dimension increases, which means the data points inside the hypersphere become more and more nonuniform(e.g. clustered) as the dimension increases. 2.3 Angle Property In this section we present an angle-based representation of high-dimensional Euclidean space, and compare its property to those of the distance-based properties discussed above. We found that the angle-based property has characteristics similar to the distance—based properties. This is because they both use only one parameter to classify the high-dimensional Euclidean space. Figure 2.4: The Angle 6 of Point P For a given reference point R, we define the angle 6 of a data point P as the angle between P and R with respect to the center C of the bounding hyper-sphere, as shown 32 in Figure 2.4. We can map the whole sphere into an angle range [0, 7r]. The volume of a d—dimensional hypersphere with a radius r is Vold(r) as shown below. Vold(r) = / dCCldCEQ” dxd :z:2 6,, (3.2.2) where n 2 i 2 1, N is the weight of node, 6 is the estimated angle. For a node with Weight N and W,- < N < Wm, where n 2 i ,>, 1. If estimated angle 6 > 6,- , then the search algorithm accesses the node, otherwise discards the node. The different ranges of data points correspond to different levels of tree, i.e. (W1, Wg) is for leaf nodes, (W2, W3) is for direct parent nodes of leaf nodes, etc.. The predefined threshold angles 46 are dependent on the desired accuracy of results. For a given required accuracy, we calibrate the threshold angles for each individual database. 3.2.4 Criteria Based on Distance In some case, the possible number of points inside the overlap area can not be correctly estimated based on the angles. One example of such a case is when the candidate set hypersphere falls inside of a bounding hypersphere. In this case, the intersecting angle between the candidate set hypersphere and the bounding hypersphere is not defined, but it is desirable to examine the hypersphere that overlap with the candidate set hypersphere. If the hypersphere is not examined, the quality of resulting NN’s may suffer. This case is illustrated in Figure 3.2. Additional criteria based on distance is introduced to ensure that nodes with a high probability of containing NN’s to a given query point are not discarded by the heuristic. The criteria based on distance is that if the distance from query point Q to the center of bounding hypersphere is less than the distance from Q to the current K ‘h NN, then the node is examined. Bounding Hypersphere <5 Quely Point Candidate Set Hypersphere Figure 3.2: Special Case: Undefined Intersecting Angle 47 3.3 Search Algorithm for AB-tree In the exact NN search algorithms, a node is always accessed if it is overlapping with the candidate set hypersphere, the hypersphere with the query point Q as its center and the distance from Q to the current K ‘h NN as its radius. In our approximate NN searching algorithm, if the candidate set hypersphere overlaps with a node, the node is accessed only if it satisfies one of two conditions: (1) the radius of the candidate set hypersphere is greater than the distance from Q to the node parent’s hypersphere center, C, (C falls within the candidate set hypersphere), as described in Section 3.2.4, or (2) the estimated angle and the weight of the node satisfy the criteria for node access, as described in Section 3.2.3. The following algorithm 3.3.1 utilizes the criteria for node access in the search for NN’s to a query point Q: Algorithm 3.3.1 K Nearest Neighbor Search for AB-tree { //This algorithm begins by invoking ExamineNode //on the root node of the index tree. ExamineNode(rootNode); AB-tree is derived from SS-tree [79]. Besides the angle criteria, AB-tree uses a different ordering strategy than SS-tree as explained below. In SS-tree, the children of a node is sorted in ascending order by distance from the query point to the surfaces of bounding hyperspheres of the children in the first step. In the second step, for each child in the sorted list, it is accessed if the candidate set hypersphere overlaps with its 48 Algorithm 3.3.2 ExamineNode(Node) { if (Node is an index node) { } else { } Invoke ExamineIndexNode(Node); Invoke ExamineLeafNode(Node); bounding hypersphere. The second step repeats until reaching a child such that its bounding hypersphere does not overlap with the candidate set hypersphere. Because the rest of the children remaining in the sorted list will not overlap with the candidate set hypersphere, these two steps guarantee that a child of this node is always accessed if it is overlapping with the candidate set hypersphere. These two steps are applied to all the nodes accessed during search. Therefore, SS—tree guarantees that a node is always accessed if it is overlapping with the candidate set hypersphere. No Guarantee for order by center Cand' ate set hyp rsphere Order by Center: |QC1| > |QC2| Order by Surface: |Q51| < |QSZ| Figure 3.3: Guarantee vs. Nonguarantee 49 Algorithm 3.3.3 ExamineIndexNode(Node) { 1. Sort the children of Node in ascending order by distance from the query point Q to the centers of the children; 2. For each child in the sorted list (set CurrentChild to the current child) { if(K candidate neighbors have not yet been found) { } else if(the candidate set hypersphere (the hypersphere centered at Q with the distance from Q to the current km'NN as the radius) overlaps with the bounding hypersphere of CurrentChild) Invoke ExamineNode(CurrentChild); if( (the distance from Q to the center of CurrentChild is less than the distance from Q to the current k‘hNN) or (The estimated angle 6 between the candidate set hypersphere and the bounding hypersphere of CurrentChild and the weight N of CurrentChild satisfy the criteria for node access: WiSN6i, where ISIS”) Invoke ExamineNode(CurrentChild); else break; 5O Algorithm 3.3.4 ExamineLeafNode(Node) { For each point in the leaf(set CurrentPoint to the current point) { Compute the distance D from the query point Q to CurrentPoint; if(K candidate neighbors have not yet been found) { Add CurrentPoint to the candidate set of neighbors; Sort the candidate set; } else if(D is less than the distance from Q to the current km'NN) { Remove the current kM'NN; Insert CurrentPoint into the candidate set of neighbors; Sort the candidate set; Instead of using distance to the surface, AB-tree uses the distance from the query point to the center of bounding hypersphere as shown in algorithm 3.3.3. When algorithm 3.3.3 terminates due to reaching a child such that its bounding hypersphere does not overlap with the candidate set hypersphere, it is possible some children in the remaining sorted list may overlap with the candidate set hypersphere as illustrated in Figure 3.3. As shown in Figure 3.3, search using order by center checks child node C1 first, then stops and does not check child node C2, since there is no overlap between child node C1 and the candidate set hypersphere. However, child node C2 intersects with the candidate set hypersphere. Thus, order by center does not guarantee that a child node is always accessed if it is overlapping with the candidate set hypersphere. For exact nearest neighbor search, only order by surface can be used. However, even if the discarded children in the remaining sorted list, such as child CZ in Figure 3.3, 51 overlap with the candidate set hypersphere in the order by center case, the possibility of points inside the overlap area is very small. Hence, there is no need to access these child nodes in approximate nearest neighbor search. Furthermore, order by center gives better ordering with respect to the possible number of points inside the overlap area as illustrated in Figure 3.4. As shown in Figure 3.4, the ratio of volume of overlap area to the volume of child node C2 is larger than that of child node C1. The possible number of points inside the overlap area of child node C2 is larger than that of child node C1, based on the assumption that child node C2 has same number of data points as child node Cl has. For approximate nearest neighbor search, this ordering change alone can provide about 3 times speedup for searching a database of 800,000 64 dimension data points with 100 clusters while getting 100% accuracy. This 100% accuracy is not guaranteed. However, we always got near 100% accuracy when we only used ordering by center in our experiments. Candi ate set hyp sphere 51 Q Figure 3.4: Better Ordering 52 3.4 Clustered AB-tree and Cached AB-tree The AB-tree is efficient for large high-dimensional databases. In certain range as the size of database increases, the performance gain of AB-tree over other existing methods increases. We will discuss the effect of the size of database on performance in Section 5.2.1. However, for very large databases, it will yield better performance if we can partition the dataset into many smaller subdatasets and only access a few subdatasets during search. There are three ways to use AB-tree in this situation. 0 Semantic Clustering: Partition dataset into many clusters using special clus- tering methods. Each cluster has special semantic meaning. Build an AB-tree for each cluster. During search only one AB-tree is accessed since each cluster has distinguished semantic meaning. 0 Generic Clustering(Clustered AB-tree): Partition dataset into many clusters using generic clustering methods [32]. Each cluster does not have special se- mantic meaning. Build an AB—tree for each cluster. Multiple AB-trees are accessed during search. 0 Cached AB-tree: Build one AB-tree for all. Treat subtrees at certain level of the AB—tree as clusters in Generic Clustering approach. Semantic Clustering is similar to the previous standalone AB-tree case, and is not our focus here. In the following subsections we will discuss Clustered AB-tree and Cached AB-tree. 3.4.1 Clustering Approach In order to create index structures for high dimensional databases, one approach is to partition the data set into groups using bounding hyper-polygons, such as hyper- spheres and hyper-rectangles. The search algorithms for these indexing schemes gen- erally utilize the partition information to avoid accessing those nodes that do not 53 overlap with the bounding hyper-polygons of the query (candidate set). Therefore, good partitioning is critical to the efficient operation of searching the index trees. In most high dimensional index trees, the partitioning is achieved mainly by splitting an overflowing node. There are several index schemes that use reinsert to refine the partitioning. However, they are greatly limited by several factors as described below. Since data is inserted one by one, the partitioning is not globally optimized. Rein- serting may help alleviate the problem but it cannot eliminate it altogether. This is because reinserts are done only for data of a full node. All data points in other nodes remain untouched. We have compared the performance of the proposed index tree with other index trees which use reinserts. Another problem is that data partitioning by bounding hyper-polygons does not work well for higher level nodes. At leaf nodes, data partitioning works directly on the data points. However, in higher level nodes the partitioning is dependent on the partitioning of the nodes below it. As the level of a node gets higher and higher, the partitioning becomes less and less precise. However, it is the good partitioning of the higher level nodes that impacts the search performance of tree-based indexing because going to the wrong branches at higher levels will lead to accessing wrong nodes at the lower levels. We propose a new approach to get better partitioning of the data set. This is done by exploiting a clustering technique. Here, we first cluster the whole data set into several groups. Then an index tree is created for each group. Conceptually, the index tree for the whole data set can be created by combining all the root nodes of these trees. This approach of creating multiple trees along with the proposed algorithm to 54 filter root nodes based on the angle property described before, provide an efficient search for high dimensional data. We use the random sampling technique from CURE [32] for the basic clustering algorithm. In CURE, in order to capture the semantics of the data set, the size of a sample has to be large enough to guarantee some probability that at least one data point is drawn from each cluster. After the sampling phase, a hierarchical clustering algorithm called CURE is run over the sample. Because our focus is on getting better searching performance and not capturing the semantics of the data set, we do not need to draw a large sample as CURE does. In fact, our experimental results show that performance will remain almost the same after some threshold on the number of subtrees. We also do not need to use the hierarchical clustering technique used in CURE. Instead, we create AB-tree for each cluster. 3.4.2 Implementation of Clustered AB-tree We use the random sampling technique described in [32] to cluster the data sets. In our method, a recursive phase is used to refine the result. In the first step, a random sample is drawn from the data set which serves as the initial centers of the groups. The second step is used to put the data points into the closest group based on their distance to the centers. In the third step, new group centers are computed by averaging the data points within each group. In the third step, the group with insufficient data points are removed to achieve higher tree utilization. These data points are inserted into other groups in the next recursive step. We repeat the second step and the third step until the group centers become stable. The detailed algorithm 55 is given in algorithm 3.4.1. Algorithm 3.4.1 Clustering( D: A set of data points; N: number of subtrees ) { Randomly pick N data points, W0, W1, ..., Wn_1, from data set D; Repeat{ C.- = W,, for O<=i9h where ISIS”) Invoke ExamineNode(CurrentNode); else break; 57 3.4.3 Cached AB-tree As shown later in Section 5.3, the clustered AB-tree outperforms AB-tree by up to 4 times. However, the major drawback of clustered AB-tree is that the whole data set is required in advance in order to partition them into many well grouped clusters using the clustering algorithm. As databases evolve, the clusters may deteriorate and become unbalanced due to deletions and insertions. Thus one central balanced AB-tree is more suitable for dynamic databases. For very huge databases, we can cache one whole level of nodes in a AB-tree in memory to reduce the number of node accesses. As mentioned in Section 1.2, main memory access structure is not suitable for huge databases. However, we can utilize the speed of main memory by caching since main memory is decreasing in cost. The hierarchical structure of AB—tree make it easy to cache the nodes of different levels due to different available cache sizes. As the cache size increases, the number of node accesses decreases. We will address the effect of cache size in Section 5.4. Because a whole level of nodes is in memory, the cached nodes can be sorted in ascending order by distance from the query point Q to the center of the node without any node accesses. This ordering at the cached nodes of AB-tree provides better detailed ordering information than that in the same level of original AB-tree. Because this ordering is corresponding to each query point at the cached level, while ordering through travelling tree is only related to the local ordering in current subtree. This ordering makes the radius of candidate set decrease quicker than AB-tree does during search, thus increase the performance. This ordering along with the proposed 58 algorithm to filter nodes based on the angle property, provide more efficient search for huge high dimensional databases. We will discuss the search algorithm for Cached AB-tree in the following Section 3.4.4, and present the performance gain of cached AB-tree over AB-tree in Section 5.4 3.4.4 Implementation of Cached AB-tree In cached AB-tree, we cache all the nodes at a particular level of the AB-tree by only storing the coordinate of center, radius and weight of each node. Which level to cache is dependent on the trade off between cache size and performance. The effect of cache size is addressed in Section 5.4. The subtree with a cached node as root is accessed only if the angle filtering condition is satisfied. The angle filtering condition is given in Algorithm 3.4.3. The search algorithm for cached AB—tree has two steps. We first sort the cached nodes in ascending order by distance from the query point Q to the centers of the nodes. Then same searching technique as that of AB-tree given in Section 3.3 is ap- plied. Algorithm 3.4.3 is the detailed approximate nearest neighbor search algorithm for cached AB-tree. Algorithm ExamineNode (Node) in algorithm 3.4.3 is same as algorithm 3.3.2 given in Section 3.3. 59 Algorithm 3.4.3 K Nearest Neighbor Search for Cached AB-tree { Sort the cached Nodes in ascending order by distance from the query point Q to the centers of the nodes; For each node in the sorted list(set CurrentNode to the current node) { if(K candidate neighbors have not yet been found) { } else if(the candidate set hypersphere overlaps with the bounding hypersphere of CurrentNode) { Invoke ExamineNode(CurrentNode); if( (the distance from Q to the center of CurrentNode is less than the distance from Q to the current k‘hNN) or (The estimated angle 6 between the candidate set hypersphere and the bounding hypersphere of CurrentNode and the weight N of CurrentNode satisfy the criteria for node access: W,SN6,~, where 13i3n) Invoke ExamineNode(CurrentNode); else break; 60 Chapter 4 Parameter Estimation The basic idea of AB-tree is to reduce the number of node accesses during search using the conditions of angle filtering based on the Angle Property presented in Sec- tion 2.3. The parameter set (threshold angles, etc.) of the angle filtering conditions controls the accuracy of the results. As the threshold angles decrease, the accuracy in- creases; however, the efficiency decreases. So the best threshold angles for a particular application may be a tradeoff between efficiency and accuracy. One important requirement in AB-tree is to derive the right parameter set for a desired accuracy. This chapter discusses how to generate a set of threshold angles automatically that maximizes the performance of AB—tree for a given database with some desired accuracy. First we present the need for parameter autotuning in Sec- tion 4.1, and the implementation of parameter autotuning in Section 4.2. Then we discuss the tradeoff between performance and accuracy in Section 4.3. Finally, we discuss the stability of the derived parameters with respect to database updates in Section 4.4 and maintenance cost in Section 4.5. 61 4. 1 Introduction As discussed in Section 3.2, we can use one threshold angle in AB-tree. The threshold angle was initialized to some value based on the desired accuracy and the probability of d-dimensional data points with the angle less than 0 (0 S 6’ S 1800) discussed in Section 2.3. Then we used binary search to find the right threshold angle for the desired accuracy. Because the final threshold angle is found by the binary search, we initialize the threshold angle to 600 degree based on the experience. We will discuss the effect of initial values of threshold angles on maintenance cost in Section 4.5. The binary search for one threshold angle is detailed as in algorithm 4.1.1. We first tuned the threshold angle manually based on algorithm 4.1.1 Binary- Search(desiredAccuracy). With experience, we mixed different approaches such as linear interpolation and binary search to find the threshold angle quickly. We found that the AB-tree using one threshold angle for 90% accuracy outperforms the AB-tree with one threshold angle for 99% accuracy by several times. However, we can increase the performance of AB-tree further, by using multiple threshold angles as discussed in Section 3.2.3. This is because one threshold angle can not capture the different effects of the nodes in higher levels and those in lower levels in AB-tree. Table 4.1 shows the performance of AB-tree using different number of threshold angles for 800,000 l6-dimensional database with 90% accuracy, except that the accuracy is 100% for 0 threshold angle (exact search). The results in Table 4.1 are the averages of the numbers of node accesses for 1000 random 20 nearest neighbor search queries. However, as the number of threshold angles increases, it is no longer practical to 62 Algorithm 4.1.1 BinarySearch(desiredAccuracy) { Initialize{ threshold angle: 0T = 60; maximum angle: 0mm = 90; minimum angle: Bmfi,==0; } Repeat{ Do 100 trials of k-nearest neighbor searches based on the threshold angle; Compute the accuracy of the results; If(computedAccuracy < desiredAccuracy) { 6T = (0T + 6min)/2; Oman: = 6T; else 6T = (6T + 6mar)/2; 6min = 6T; } Until [(ama, — 0mm) 5 1)]. Return 9T; 63 Number of Threshold Angles Number of Node Accesses Accuracy 0 (exact search) 1392 100% 1 (Granularity Level 1) 377 90% 3 (Granularity Level 2) 41 90% one per node (Granularity Level 3) 136 90% Table 4.1: Performance of AB-tree Using Different Number of Threshold Angles derive all threshold angles manually. Thus, it’s necessary to be able to generate the threshold angles automatically. 4.2 Parameter Autotuning First issue in parameter autotuning of AB-tree is to determine the number of threshold angles. We will discuss the following three levels of granularities for motivating our approach to autotuning. O Granularity Level 1: One threshold angle for all nodes 0 Granularity Level 2: One threshold angle each index level of AB-tree o Granularity Level 3: One threshold angle per node Granularity Level 1 is too coarse. One threshold angle can not capture the dif- ferent effects of the nodes in higher levels and those in lower levels in AB-tree. The performance gain achieved is limited when Granularity Level 1 is used, as shown in Table 4.1. Granularity Level 3 is too fine grained. The total number of nodes in the AB-tree for a huge database is very large. It is not feasible to maintain so many parameters in a data structure separate from AB-tree, and therefore we maintain each threshold angle With the corresponding node of AB-tree. This built-in threshold angle is de- termined based on the local information of the corresponding node, and is updated 64 dynamically as the node is updated resulting from database updates. The advantage of this approach is that there is no need to retune the built-in angles separately as database changes. One problem of this built-in angle approach is to come up with a generic algorithm to compute the built-in threshold angles. We have used several heuristics to estimate the built-in angles, such as using the minimal angle of the angles between any two data points inside the node with respect to the center of the node. The performance of Granularity Level 3 is much better than that of Granularity Level 1, as shown in Table 4.1. However, one big disadvantage of Granularity Level 3 is that the accuracy and performance vary as database changes, and there is no way to retune the accuracy to meet different accuracy requirements since the accuracy for a given database is fixed once the algorithm for computing the built-in threshold an- gles is fixed. For example, two AB—trees are needed based on the different algorithms for computing the built-in threshold angles, if there are two different desired accu- racy requirements for a same database. This can be easily achieved in Granularity Level 2 by using different sets of several threshold angles to meet different accuracy requirements with one AB-tree. Granularity Level 2 has the advantage of Granularity Level 1 such that the ac— curacy is tunable. Granularity Level 2 also has the advantage of Granularity Level 3 such that nodes in different levels in AB—tree are using different threshold angles. Thus, we choose Granularity Level 2 due to its robustness and high performance. We will discuss the complexity of parameter autotuning in Section 4.2.1, then present algorithm for parameter autotuning in Section 4.2.2. 65 4.2.1 Complexity For a given AB-tree , we know the dimension of the feature vector, the data type of the feature vector, size of record stored in the leaves, the height, block size and average block utility of AB-tree. The maximum number of points stored in leaf nodes, weight[0], is determined by equation 4.2.1. blockSize ' ht = 4.2.1 10629 [O] dimension X typeSize + recordSize ( ) Where typeSize is the size of one component of feature vector, and recordSize is the size of data record. The f anout of AB-tree is determined by equation 4.2.2. blockSize dimension x typeSize + pointerSize fanout = (4.2.2) Taking into account the block utility of AB-tree, the maximum number of points stored in a node i levels above the leaf nodes, weight[i], is determined by equa- tion 4.2.3, weight[i = weight[i — 1] x fanout x blockUtility (4.2.3) where weight[i - 1] is the maximum number of points stored in a node (i — 1) levels above the leave level. Noting that the root of AB-tree is always accessed, the number of threshold angles is determined by equation 4.2.4. NumberOfThresholdAngles = heightOfTree — 1 (4.2.4) 66 where heightOfTree is the height of AB—tree, and is roughly estimated by equa- tion 4.2.5. heightOfTree z logfanou, numberOfPoints (4.2.5) where numberO f Points is the total number of data points in the AB-tree. The possible value of a threshold angle is in the range of [0,90]. For simplicity, as- (90NumberOfThresholdAngles) number sume a threshold angle is an integer. Then it needs of iterations to find the global optimal threshold angles such that the performance is best for the requested accuracy. Each iteration requires 100 trials of approximate k-nearest neighbor search. Based on the complexity, it’s infeasible to find the globally optimal angles. Our approach is to find the best threshold angle for the leaf nodes first using algorithm 4.1.1 BinarySearch(desiredAccuracy) as discussed in Section 4.1, then find the best threshold angle for the parent nodes of leaf nodes, and so on until all the threshold angles for the required accuracy are found. Therefore, it needs only (N umberO fT hresholdAngles X log 90) number of iterations. Our approach does not guarantee finding globally optimal threshold angles. It does find good threshold angles which yield high performance for a desired accuracy. For clustered AB-tree and cached AB-tree, the parameter tuning is almost the same as that of AB-tree, except that the NumberOfThresholdAngles is smaller. For clustered AB-tree, the height of any subtree is smaller than that of AB-tree, because the sizes of subtrees are much smaller than that of AB-tree. For cached AB- tree, N umbe'rO fThresholdAngles is equal to the height of AB-tree minus the cache level. Thus the complexity and cost of finding threshold angles for clustered AB-tree 67 i I... ‘sfi I. a II and cached AB-tree is smaller than that of AB—tree. We will discuss the maintenance cost for AB-tree in Section 4.5 4.2.2 Algorithm for Parameter Autotuning To tune the parameters, an angle parameter file is initialized based on the information of AB-tree in the first step. In the second step, k nearest neighbor search is done 100 times based on the angle parameter file. In the third step, analyze the results and update the angle parameter file based on the resulting accuracy and the required accuracy. We repeat the second step and the third step until all the threshold angles are found. The detailed algorithm is given in algorithm 4.2.1. Algorithm 4.2.1 ParameterAutotuning(requiredAccuracy) { InitializeAngleParameterFile(); GenerateAngleParameterFi1e(requiredAccuracy); We can reduce the calibration cost by reusing the threshold angles for higher accuracy as the initial values of threshold angles for lower accuracy, as shown in algo- rithm 4.2.2 InitializeAngleParameterFile(). Experiments indicate that accuracy decreases monotonically with the increase of threshold angles. Even if the monotonic- ity between accuracy and threshold angles could not hold for some very rare cases, algorithm 4.2.2 will generate threshold angles at similar cost due to the nature of bi- nary search, except that the threshold angles may not be good. However, this never happens in all our experiments. Figure 4.2 shows the monotonic relationship between 68 Algorithm 4.2.2 InitializeAngleParameterFile() { if(the angle parameter file for accuracy higher than the required accuracy exits) { //reuse the angle parameter file for higher accuracy //do nothing here } else { // Initialize the angle parameter file weight[O] = blockSize/(dimension*typeSize+recordSize); fanout = blockSize/(dimension*typeSize+pointerSize); 6%=60; for( i=1; i_ 1. Most of these works used data structures that are stored in main memory. Thus the query time consists of mostly the computation cost. Our focus here is on approximate nearest neighbor search in databases. Thus I / O cost is the primary source of query time. We introduce the approximate nearest neighbor search based on the accuracy the percentage of the true nearest neighbors in the answer. To be comparable to (1 + 6)—approximate K- nearest neighbor, we measure the e’s of 1000 20-NN queries on the previous uniform and clustered databases. As shown in Table 5.2, for 90% accuracy, 6 is less than 0.035 in 8 dimensions, decreases as dimension increases, and is less than 0.012 in 64 dimensions for clustered data. As mentioned in the Section 1.4, the approximate nearest neighbor search with 90% accuracy and the relative error bound in distance 6 < 0.035 is just as good as the exact nearest neighbor search. Uniform Data Clustered Data Dimension 8 16 32 16 32 64 c 0.035 0.0217 0.0139 0.0141 0.0139 0.0121 Table 5.2: Relative Error Bound in Distance for 90% Accuracy 95 I V‘;" 2. image b35859 8‘ 6' _il 6. image 145900 3. 1. image 671900 ‘I ‘ ”N ,, v 'I. I . ."J . image 145700 4 .image b43001 ,. r» - ‘3': r 5. image b11146 . image 587900 ' . 'H ,,.].1ulrufl-M' ‘ 'llfltlf 7 9. image 145800 Figure 5.15: Result of Query Using Exact NN-search Figure 5.15 and 5.16 show sample retrieval results using exact and approximate K-NN search based on color histogram , respectively. Ten images were requested (k=10). The first image is also the query image. With 90% accuracy, the first 9 images are exactly the same. So the approximate is as good as the exact. 96 “”1! 1. image 671900 . ~ ‘ 2. image b35859 3. image .,' L ..- 145700 4 .image b43001 3'“ a; , 5. image b11146 8. image 587900 9. image 145800 10. image 579700 Figure 5.16: Result of Query Using Approximate NN-search 97 Chapter 6 PicFinder: A Prototype Image Database System In order to provide a practical example for testing the AB-tree described in chap- ter 3, we have designed and implemented a prototype system. PicFinder, a prototype image database system, is a content based image retrieval system. It has over forty four thousand color images. It currently has distance measures based on color and texture. It has two different indexing schemes: SS-tree for exact search, AB-tree for approximate search. The main purpose of this system is to demonstrate superior performance of AB-tree over that of SS—tree while providing almost the same quality of hits. 6. 1 Introduction Besides the traditional high volume sources of digital imagery, such as remote-sensing agencies and medical imaging facilities, a variety of new and diverse domains, such as art museums, news agencies, and the travel industry, have recently adopted digital technology for archiving images. In the face of this explosive proliferation of image repositories, we find ourselves ill equipped to efficiently retrieve images from large 98 collections. For collections containing more than even several hundred images, manual browsing is practically impossible. Initial attempts at developing computerized image- management systems consisted of extended traditional databases to handle images. This approach has not been very successful, mainly because of the use of externally generated descriptors used to index images in database systems. In recent years, a new field of research has emerged: content-based image retrieval. Here, we aim to index images based on descriptions, such as color, texture, etc., derived automatically from the image. Such descriptions are adapted to the image and are usually more effective than external descriptors such as keywords, in characterizing the visual information illustrated in the image. The similarity between two images can then be determined by comparing their descriptions. Content-based image retrieval systems such as IBM’s QBIC [29], Webseek [73], PicToSeek [31], and ImageRover [76] use the query by similar images paradigm. They differ in how they find the initial set of images. QBIC displays an initial set of images. The user selects an image, then the search engine ranks the database images by similarity to the selected image with respect to color, texture, shape, or all of these criteria. Webseek and ImageRover use text queries to narrow the initial set of images, and PicToSeek asks the user to supply an initial image. In our system: PicFinder, the user can either supply his/ her own initial query image or choose one from an initial set of images provided by PicFinder. As discussed in Section 1.4, features and similarity measures are chosen based on heuristics, and are not mathematically precise parameters in content-based image retrieval systems. Approximate nearest neighbor searches are best suited for these 99 systems. The major contribution of PicFinder is to provide users a fast content- based image retrieval system based on AB-tree indexing as presented in Chapter 3. In PicFinder users can choose indexing scheme AB—tree for approximate search, or SS- tree for exact search, or compare the performance and the quality of nearest neighbors between the two structures. 6.2 The Design of the PicFinder System The PicFinder System is a prototype content based image retrieval system. It uses client-server structure. The client is a Java applet written in Java version 1.1, and runs in any web browser supporting Java version 1.1.7 or higher. The server is writ- ten in C++. It consists of approximately 30,000 lines of source code. The images in the database consist of 44,300 color images taken from a commercial image gallery published by IMSI-USA. This image database is a heterogeneous collection of images, including animals, background objects, food, plants, people, military, religion, scenic, structures, transportation, and etc.. They are stored in J PEG format on a local hard drive, taking about 1 gigabytes of space. PicFinder has been tested on UNIX and MS- WINDOWS. The URL of PicFinder is “http://www.cse.msu.edu/~lijinhua/PicFinder” Figure 6.1 shows the system overview of PicFinder, including the relationship between client and server. When a user sends an image query from a Java-enabled web browser or client program to the server, the feature extraction module extracts the selected feature from the query image, then the matcher module uses the extracted feature from query image to search the feature database using the selected index structure and sends the best-ranked hit images back to the browser. The primary 100 Create Query Image Matcher By URL, Sample Images, etc. / S / CD :1 Index Structures g AB-tree, SS-tree, etc. i Results Image Database Client Interface Server Figure 6.1: A Diagram of PicFinder modules of PicFinder consist of 0 Java client for visual query input and displaying results, 0 feature extraction, 0 index structures, 0 matcher. The feature extraction module is based on IBM’s QBIC. It currently can extract the color feature and texture feature from an image in the following commonly used image formats: bitmap, gif, jpeg, and tiff. 0 Color feature is the color distribution for each image in a predetermined 32 color space. Image similarity is based on the similarity of the color distribution. 0 Texture feature is the texture information for each image. Image similarity is based on several texture attributes such as directionality, coarseness, and contrast. 101 PicFinder has two different indexing schemes: SS-tree for exact search, AB-tree for approximate search. The matcher module compares the extracted feature from query image to the feature database using the selected index structure and sends the best—ranked hit images back to the browser. We will discuss the Java client, the user interface of PicFinder, in Section 6.3. PicFinder allows the creation of composite distance measure based on multiple features. The composite distance measure is a weighted average of the distance mea- sures of selected features. Systems such as QBIC [29] and Virage [33] offer the user the ability to take weighted combination of color, texture, shape, and position measures, combined with keyword search. In a multi-feature query, QBIC searches through different types of feature data in the database in order to find images that closely re- semble the query image. All features are treated equally during the database search, and all involved features are searched at the same time. An example of a multi—feature query would be finding images in the database that have a color distribution and tex- ture similar to a query image. For multi-feature queries, you can weight features to specify their relative importance, which provides flexibility for advanced applications where the returned results must be fine tuned. Results of PicFinder can be combined to form a new query based on the user’s feedback. Search this new composite query may result in yielding more desirable results. 102 6.3 The PicFinder interface II',‘II1IIIII]III*I.‘ .. ‘ [http //www cse rnsu. edu/"IIuthuafmgdb/Imeges/ell/I 0063160 [pg I [IIII‘ ll .. “H ....I.II II' rill" IIIII‘“ ‘Illllhlleugw Mama" ..I H, . ‘M". WWII .. I‘ ..III II' I I'IIIIINI 1...: IIIIIIIQW’W “I‘ll -‘,l MW mu III“ II II" , CPUlirne{:econds] 05000 I} W m] m MI III [I .Hde' ‘ I I Loading the first 5 :inilar images... “ done 3.0: Hm” I(III IIIIIIIIIIIIIIII I I I“. rI"IIIIIIIIIIIIIII '. Wm .. i. IIIIIIIIIIIIIIIIII1 I I I “ ""MII ""'III‘IIIIIIIII]I]II]IIIIIIII‘ II... II .IIIIII.IIIIIIIIIIIIIIII,IIIIIIII“'"NIMIIIWI'll'IIIIIIIIIIIIIIIIml lulwmlll'l]l]']lII'IIIIIIIIIIIIIIIIIIII.III'IWI'IIIIIIIIIIII Him. I IIIHIII . . v ‘I lIIIIIlIII III M" [III I V ] III III'IIIIIII M-I] Inllm‘l|1|