NEAR DUPLICATE IMAGE SEARCH By Fengjie Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science– Doctor of Philosophy 2014 ABSTRACT NEAR DUPLICATE IMAGE SEARCH By Fengjie Li Information retrieval addresses the fundamental problem of how to identify the objects from database that satisfies the information needs of users. Facing the information overload, the major challenge in search algorithm design is to ensure that useful information can be found both accurately and efficiently from large databases. To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e.g. dense feature vectors) and bag-of-features (e.g. local feature represented images). For sparse data, inverted index and document retrieval models had been proved to be very effective for large scale retrieval problems. For dense data and bag-of-feature data, however, there are still some open problems. For example, Locality Sensitive Hashing, a state-of-the-art method for searching high dimensional vectors, often fails to make a good tradeoff between precision and recall. Namely, it tends to achieve high precision but with low recall or vice versa. The bag-of-words model, a popular approach for searching objects represented bag-of-features, has a limited performance because of the information loss during the quantization procedure. Since the general problem of searching objects represented in dense vectors and bag-of-features may be too challenging, in this dissertation, we focus on nearly duplicate search, in which the matched objects is almost identical to the query. By effectively exploring the statistical properties of near duplicities, we will be able to design more effective indexing schemes and search algorithms. Thus, the focus of this dissertation is to design new indexing methods and retrieval algorithms, for near duplicate search in large scale databases, that accurately capture the data similarity and delivers more accurate and efficient search. Below, we summarize the main contributions of this dissertation: Our first contribution is a new algorithm for searching near duplicate bag-of-features data. The proposed algorithm, named random seeding quantization, is more efficient in generating bag-ofwords representations for near duplicate images. The new scheme is motivated by approximating the optimal partial matching between bag-of-features, and thus produces a bag-of-words representation capturing the true similarities of the data, leading to more accurate and efficient retrieval of bag-of-features data. Our second contribution, termed Random Projection Filtering, is a search algorithm designed for efficient near duplicate vector search. By explicitly exploiting the statistical properties of near duplicity, the algorithm projects high dimensional vectors into lower dimensional space and filter out irrelevant items. Our effective filtering procedure makes RPF more accurate and efficient to identify nearly duplicate objects in databases. Our third contribution is to develop and evaluate a new randomized range search algorithm for near duplicate vectors in high dimensional spaces, termed as Random Projection Search. Different from RPF, the algorithm presented in this chapter is suitable for a wider range of applications because it does not require the sparsity constrains for high search accuracy. The key idea is to project both the data points and the query point into an one dimensional space by a random projection, and perform one dimensional range search to find the subset of data points that are within the range of a given query using binary search. We prove the theoretical guarantee for the proposed algorithm and evaluate its empirical performance on a dataset of 1.1 billion image features. To my family. iv ACKNOWLEDGEMENTS I consider completing my Ph.D. the first real challenge I ever had in my life. Myself have evolved as much as how much my dissertation have ben updated since the first day of my Ph.D. program. All these would not have been possible without the help of many people. First and foremost, I would like to express my great gratitude to my thesis advisor, Prof. Rong Jin. It is his course of “Information Retrieval” led me to this research area. Over the years, he taught me tremendous knowledge about search and machine learning. He taught me how to find the right research problem and how to solve the problem right. He taught me the importance of mathematical rigorous in research. He also gave the most important, lifelong tools to achieve any thing that is significant: “perseverance and consistent hard working”. without Dr. Rong Jin, neither my dissertation nor my personal growth would have been possible. I would also like to express my appreciation to Dr. Anil Jain. The “image” part in my dissertation definitely goes to him. He taught me everything about pattern recognition and image processing. He also showed me the importance of keeping a very high standard and getting things done promptly at the same time, which I will benefit in the rest of my career. I am grateful to Dr. Eric Torng, who has been the most supportive person since I joined the program. His support in my hardest times is irreplaceable. I would also like to thank Dr. Aviyente for her insightful suggestions and discussions during my research. I would like to thank my colleagues: Wei Tong, Yang Zhou, Tianbao Yang, Mehrdad Mahdavi, Jinfeng Yi, Zheyun Feng, Lei Wu and Jainpeng Xu. I’d like to thank the people I have worked with at PRIP lab, especially Abhishek Nagar, Pavan Mallapragada and Serhat Selcuk Bucak. I appreciate all the help I got from Linda Moore, Norma Teague and Adam Pitcher. Personally, I’d like to thank all my teammates in the Michigan State University Chinese Soccer team. We had so many memorable moments together that I will never forget in my life. Especially, I would like thank Zheng Wang, for being supportive and thoughtful as my friend both on and off v the field. Among all my friends at Michigan State University, l want to thank Yi Huang and his family in particular, for all the happy times we have had together. I am grateful to all my best friends in China, including Peng Xu, Zhe Wang, Shaoan Hou, Kai Lv, Xiaoming Ji, Xin Zhao, Yi Li and Lipei Duan. Their emotional support is proportional to the physical distance between us. My final and the most grateful thanks goes to my family. Deep in my heart it is their love and support motivates me. My mom and dad are the first teachers in my life, who taught me the most important principles in my life and continue to teach me everyday. My only wish is to be a better son of them. I owe great appreciation to my parents-in-law, who spent so much time caring us at Michigan. I want to express my most heartful appreciation and love to my wife Li, whose love and trust is vital to me in the past ten years. I want to say “thank you and love you” to my two dearest daughters, Siying and Sihan, who are the most precious gifts I have ever had in my life. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTER 1 INTRODUCTION . . . . . . . . . . . 1.1 Large scale high dimensional data retrieval . . . 1.1.1 Retrieval Methods for Sparse Data . . . 1.1.2 Retrieval Methods for Dense Data . . . 1.1.3 Retrieval Methods for Bags-of-Features 1.2 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 5 6 7 CHAPTER 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . 2.1 Retrieval Methods for Sparse Data . . . . . . . . . . . . . . . . 2.1.1 Sparseness and Inverted Indexes . . . . . . . . . . . . . 2.1.2 Retrieval Models . . . . . . . . . . . . . . . . . . . . . 2.1.3 Comparison: Retrieval Models for Sparse Data . . . . . 2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Retrieval Methods for Dense Data . . . . . . . . . . . . . . . . 2.2.1 KD-Tree, Randomized KD-tree and Randomized Forests 2.2.2 Locality Sensitive Hashing . . . . . . . . . . . . . . . . 2.2.3 Comparison: Search Methods for Dense Data . . . . . . 2.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Retrieval Methods for Bag of Features . . . . . . . . . . . . . . 2.3.1 Pyramid Matching and Pyramid Match Hashing . . . . . 2.3.2 The Bag of Words Model . . . . . . . . . . . . . . . . . 2.3.3 Min-Hashing and Geometric Min-Hashing . . . . . . . . 2.3.4 Comparison: Retrieval Methods for Bag-Of-Features . . 2.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 11 13 14 14 18 25 29 30 31 32 34 37 39 40 CHAPTER 3 RANDOM SEEDING FOR KEYPOINT QUANTIZATION . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Random Seeding for Keypoint Quantization . . . . . . . . . . . . . . . . . . . 3.3.1 Distance Measure Based on Optimal Partial Matching . . . . . . . . . . 3.3.2 Approximating the Optimal Partial Matching by A Smooth Similarity Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Random Seeding Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Near Duplicate Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 45 46 49 . . . . . . . . 50 51 56 58 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 3.4.2 Tattoo Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.3 Automatic Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 65 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 CHAPTER 4 RANDOM PROJECTION FILTERING 4.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.2 Problem Definition . . . . . . . . . . . . . . . . 4.3 Random Projection Filtering Algorithm . . . . . 4.4 Experiments . . . . . . . . . . . . . . . . . . . . 4.4.1 Experiments on Synthetic Datasets . . . . 4.4.2 Experiments on image feature dataset . . 4.4.3 Experiment on 80 million image dataset . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 69 70 71 77 78 82 87 88 CHAPTER 5 EFFICIENT RANGE SEARCH IN HIGH DIMENSIONS . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Algorithm and Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Results and discussion: . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 89 91 91 96 96 98 104 CHAPTER 6 SUMMARY AND CONCLUSIONS 6.1 Contributions . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . 6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 106 108 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 viii LIST OF TABLES Table 1.1 Real World Applications of Information Retrieval . . . . . . . . . . . . . . . . . Table 2.1 Complexities of Nearest Neighbor Search Algorithms . . . . . . . . . . . . . . . 17 Table 2.2 Complexity of approximate nearest neighbor search algorithms. For the LSH complexity under L2 distance, we have f (1 + ε) < 1/(1 + ε). . . . . . . . . . . 18 Table 3.1 Summary of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 3.2 Time for generating bag-of-words models for the Oxford5K dataset. Table 3.3 Retrieval accuracy (mAP) and query time (second) for the near duplicate image retrieval on the Oxford5K dataset. “Cluster Seeding” in the last row refers to range search quantization using cluster centers . . . . . . . . . . . . . . . . . 61 Table 3.4 Retrieval accuracy (mAP), query time, and quantization time (i.e., time for mapping keypoints to visual words) for the clustering-based method and the random seeding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Table 4.1 Effectiveness of the filtering procedure . . . . . . . . . . . . . . . . . . . . . . . 82 Table 4.2 RPF Accuracy and Recall on 10 Million SIFT features . . . . . . . . . . . . . . 85 Table 4.3 Search time (seconds) comparison on 10 Million SIFT features . . . . . . . . . . 86 Table 4.4 Training/indexing time (seconds) comparison on 10 Million SIFT features . . . . 86 Table 4.5 Performance and scalability comparison, searching 100 million features . . . . . 87 Table 4.6 Performance comparison, searching 80 million images . . . . . . . . . . . . . . 87 Table 5.1 Performance comparison, searching 10 Million Features . . . . . . . . . . . . . 101 Table 5.2 Performance comparison, RPF and RPS algorithm on 10 million features . . . . 102 Table 5.3 Performance comparison, RPF and RPS algorithm on 80 million images . . . . . 103 Table 5.4 Performance and scalability comparison, searching 1.1 Billion Features . . . . . 103 ix 3 . . . . . . 59 LIST OF FIGURES Figure 2.1 An example of KD-tree and resulting space partition. . . . . . . . . . . . . . . 18 Figure 2.2 An example of priority KD-tree search. Subtrees v1 through v4 are initially in the queue. v1 , the closest node, is selected first. When descending the tree to leaf node w, we enqueue the siblings u1, u2, and u3 along the way [1]. . . . . 23 Figure 2.3 Illustration of difference between general hashing and locality sensitive hashing. 27 Figure 2.4 Illustration of (a) pyramid matching and (b) pyramid match hashing. The pyramid matching approach takes two sets of feature vectors and maps them to multi-resolution histograms, and intersects them to efficiently approximate the optimal partial matching between the original feature sets. The pyramid match hashing approach utilizes an embedding of the pyramid histogram and randomly hash the new representation to allow sub-linear time indexing. Finally, the pyramid match is applied only to a small portion of the database examples [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 2.5 Illustration of the bag-of-words representation of images . . . . . . . . . . . . 34 Figure 2.6 The geometric ordering constraints for bundled features [3]. (a) matches preserving the relative ordering of the SIFT features inside a bundle; (b) matches not respecting geometric orders are penalized. . . . . . . . . . . . . . . . . . . 37 Figure 3.1 The distribution of Euclidean distances between SIFT keypoints and their nearest visual words, computed based on the Oxford5K dataset (14,972,956 keypoints) with 10K clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 3.2 An example showing the relationship between the optimal partial matching similarity and the upper bound in Lemma 1. It shows that when two sets of keypoints (corresponding to two images) are similar, the upper bound in Lemma 1 provides a good approximation to optimal partial matching. . . . . . . 51 Figure 3.3 The effect of radius r on the retrieval accuracy of the random seeding method for keypoint quantization. R is the average pairwise keypoint distance. . . . . . 61 Figure 3.4 The quantization time of the proposed random seeding algorithm and the clustering based approach. The vocabulary is fixed to be 1M visual words generated from the Oxford5K dataset. Background images from Flickr dataset are gradually added to the gallery. The vocabulary construction time is not shown here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 x Figure 3.5 The CMC curves of the tattoo retrieval system using the proposed random seeding algorithm and the clustering algorithm for keypoint quantization. The number of visual words is set to be 1 million. . . . . . . . . . . . . . . . . 64 Figure 3.6 Example query images and the top 5 images retrieved by both the random seeding algorithm (first row) and the clustering based approach (second row). . 65 Figure 3.7 The results of a kNN based image annotation system using the proposed random seeding algorithm for keypoint quantization as well as the clustering based method. k is set to be 50 based on cross validation, and the number of visual words is set to be 1 million. . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 4.1 Performance of Random Projection Filtering (RPF) with different K and d . . . 80 Figure 4.2 Offline tree building time and query time . . . . . . . . . . . . . . . . . . . . . 81 Figure 4.3 Performance of Random Projection Filtering (RPF) with different δs and ε . . . 83 Figure 4.4 Accuracy comparison on 10 Million SIFT Features . . . . . . . . . . . . . . . 86 Figure 4.5 Examples of retrieved images of different algorithms. From top to bottom: RPF, LSH, SH, PCAH, LCH and STH. . . . . . . . . . . . . . . . . . . . . . . 88 Figure 5.1 Performance of Gaussian Projection Search with different number of projections 100 Figure 5.2 Performance of Rademacher Projection Search with different number of projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Figure 5.3 Performance comparison for different coding length, searching 1 Million features102 Figure 5.4 Examples of search results: RPS (first row) and RPF (second row) algorithm . . 103 xi LIST OF ALGORITHMS Algorithm 1 KD-tree Construction Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20 Algorithm 2 KD-tree Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Algorithm 3 Procedure of LSH Indexing and Searching . . . . . . . . . . . . . . . . . 28 Algorithm 4 Procedure of Geometric Min-Hashing . . . . . . . . . . . . . . . . . . . . 38 Algorithm 5 Random Seeding Approach for Keypoint Quantization . . . . . . . . . . . 48 Algorithm 6 Efficient Duplicate Detection for High Dimensional Vectors . . . . . . . . 77 Algorithm 7 A Efficient Range Search Using Gaussian Random Variables . . . . . . . . 92 Algorithm 8 Efficient Range Search using Rademacher Random Variables . . . . . . . 95 xii CHAPTER 1 INTRODUCTION We are facing the problem of information overload: vast amounts of data have been provided from different sources like the Internet, research laboratories and personal digital devices. For example, the number of websites has reached 238 million by June 2009 [4], which has doubled itself since 2006. The number of web pages indexed by Google is more than 1 trillion by 2008 [5]. Flickr, an image and video hosting website, hosts over 3 billion images [6]. At the same time, huge amounts of data had been generated from research communities. For example, databases like Omnibus and ArrayExpress provide billions of gene expression measurements [7]. In addition to their massive amount, the data is also being generated at a speed that is faster than ever [8]. For instance, large sky surveys like Large Synoptic Survey Telescope generates tens of TB data per 24 hours [9]. It is expected that the overall volume of data will increase by 30% per year. Such high growth rate due to the fact that new technologies are becoming more accessible to people [10]. Besides the data size, another challenge of data processing is its high dimensionality, i.e., the content of each data point often has to be represented by a large number of attributes. For instance, a document is usually represented by a word histogram of hundred thousands of words; the dimensionality of microarray data can easily goes beyond a thousand [11]; an image is often represented by a couple of hundred visual features, including shape, color, texture and other useful 1 information. Finally, to capture the relationship between data points, it is important to find proper similarity measurements, and it have been shown that non-linear similarity measurements or even multiple similarity measurements are desirable for real-world applications [12–14]. Different similarity measurements may introduce more complexities into the search problem. For example, if nonlinear kernel functions are used, efficient indexing methods such as inverted list will no longer be suitable. and this is because the underlying linear similarity assumption for inverted index based match no longer holds. 1.1 Large scale high dimensional data retrieval Information retrieval addresses the fundamental problem of how to efficiently and accurately identify the objects from a large data base that match the information needs of individual users. The target objects can be documents, audios, pictures, and video clips. There are four major issues with the design of a information retrieval system: 1. How to effectively represent diverse information needs of users? 2. How to represent objects in a database to support accurate and efficient retrieval? 3. Given the feature representation, how to measure the similarity between the data points? Different similarity functions make significant difference in the strategies for indexing and search. 4. Given the similarity measurement, how to efficiently and effectively identify the objects in a database that match the information needs of users? In this thesis, we focus on last three problems, with emphasis on the effective representation schemes and near duplicate retrieval methods that can handle large scale high dimensional data. Table 1.1 shows some real world applications of large scale information retrieval. 2 Application Document Retrieval [15, 16] Text Based Image Retrieval [17] Content Based Image Retrieval [18, 19] Audio Retrieval [20] Description Find relevant documents based on text queries Find relevant images based on text queries Find relevant images based on image queries Find relevant audio tracks for sound based queries # Objects # Features 1010 108 109 104 109 103 106 102 Table 1.1 Real World Applications of Information Retrieval The large scale and high dimensionality pose a significant challenge for efficient retrieval. This is because in many applications, finding the most relevant objects for a query is often reduced to the problem of finding the nearest neighbor (according to different similarity measurements, e.g. Euclidean distance, Hamming distance or kernel similarities) objects for a given query. Unfortunately, it is well known that the nearest neighbor search is essentially a NP-hard problem when it comes to high dimensional space [21]. Therefore, there is no deterministic approach for efficient nearest neighbor search in high dimensional space. Furthermore, no matter how efficient a single match is, the very large candidate pool makes it impractical to perform any form of linear scan on large databases. Because of these reasons, most practical approaches more or less follow a two-stage scheme, in which the first stage is devoted to identifying a small subset of potential candidates that are likely to be similar to the given query, and in the second stage the exact similarity between the given query and the selected candidates are computed. Note that the first stage is often achieved by using some well designed data structures that effectively exploits the statistical properties of data such as sparsity. How exactly the two-stage system is implemented heavily depends on the properties of specific data used in applications. For example, in the case of document retrieval, where documents are represented by sparse vectors, an inverted index will be created and the first stage is to search in the index. In the second stage, different retrieval models will be applied to calculate the exact 3 query-to-document similarity. If the data are dense and continuous, e.g. dense image features, we can build trees or use hashing to group data points with smaller distance together, and the first stage is to search in a tree or look up hash tables. In the second stage, the exact point-to-point similarity is calculated. If the data are represented by bag of features, e.g. image retrieval based on bag of features, a common approach is to convert the bag of feature representation into a vector representation, and then use existing vector search method to identify a small subset of candidates. In the second stage, some more precise and expensive image matching algorithm can be applied to give the final ranked results. In following, we will first briefly discuss different types of data to be searched and the existing retrieval methods for these data. We then points out the limitations of existing methods and outlines the contribution of this thesis. 1.1.1 Retrieval Methods for Sparse Data In this case, we assume that the vectors used to represent objects are very sparse, namely most of the elements in the vectors are zeros. An example of sparse vector representation is documents. Most text retrieval methods use the bag-of-word model for documents, in which each document is represented by a vector of word histograms [22]. This representation is sparse because most documents focus on a specific topic and thus only use a small portion of the entire vocabulary. Given the sparseness, a special data structure, termed inverted index [23], can be created and used to greatly improve the retrieval efficiency. The basic idea of inverted index is to create mappings, often referred to as the inverted list, from a word to the set of documents that contains the word. Given the inverted index, in order to identify the candidate documents for a given query (usually consists of only a small number of words), we only need to check the similarity of documents in the inverted lists of all the query words, which allows us to avoid the linear scan of the entire collection. Given the identified document candidates, in the second stage, different retrieval models, such as the vector space model [24], the probabilistic model [25] or language models [26], is applied to compute the relevance or similarity scores of document candidates for the given query. Note that 4 it is the sparsity of the vector representation that makes it possible to construct compact inverted index, leading to efficient retrieval. 1.1.2 Retrieval Methods for Dense Data We assume both queries and database data are represented by dense continuous vectors, in which almost all the elements are non-zero and can take any continuous values. A good example of this case is image retrieval based on global features, where both query images and images in the database are represented by vectors of visual features, such as color, texture, and shapes. The similarity between two images are often determined by the distance between two corresponding vectors. As a result, the most similar images for a given query image is usually found by a nearest neighbor search. To facilitate the search, data structures like trees or hash tables are used to index the data. These indexing methods are useful since they efficiently group similar points together through either space partitioning or hashing. However, since there is no efficient algorithm for high dimensional exact nearest neighbor search [21], most existing search algorithms for high dimensional dense vectors have to resort to approximate solutions. In general, there are two categories of approaches that are used for approximate nearest neighbor search in high dimensional space, i.e., tree based approaches [1, 27, 28] and hashing based approaches [29]. The key idea of tree based approaches is to partition the feature space such that if two vectors are close to each other, they are likely to be grouped into the same region (or equivalently, share the same partition coding). To improve the robustness of the index, some degree of randomization is often used when determining the branching point. Besides, multiple trees are constructed in order to improve the both the retrieval accuracy and recall. Given the constructed index tree, the candidate nearest neighbors to to a given query is found by propagating the query point through the tree to the leaf node. In the final step, we compute the exact distance between the query point and the points in leaf nodes identified in the first phase to find the nearest neighbors of the given query. The most popular tree based methods use the Kmeans trees [27] and variants of the K-D tree [28]. 5 The hashing based method, on the other hand, are based on the idea that two data points are likely to share the same hash code if they are close to each other. One representative approaches in this category is Locality Sensitive Hashing (LSH) [29]. The key concept behind this method is the locality sensitive hash function, in which two data points within a predefined distance r are more likely to be hashed to the same value than if they are separated by a distance significantly larger than r (i.e., (1 + δ )r, where δ > 0 is a predefined factor). In order to ensure high precision and recall in identifying r-nearest neighbors, a large number of locality sensitive hashing functions, combined into multiple groups, are used in practice. Similar to the previous case, these hashing functions are used in the first stage to identify the candidates that are within distance r of a given query. In the second stage, the exact distance between the query point and the identified candidates is computed to find the nearest neighbors, and the candidate objects are presented to the user in the ascending order of their distances to the query. Note that the key difference between the tree based method and hashing based method is that the first approaches are designed to find the (approximate) nearest neighbors, while the hashing based methods are designed for range search, i.e., finding the the data points that are in the r-neighborhood of a query point. We will discuss the advantages and drawbacks of above methods in details in chapter 2. 1.1.3 Retrieval Methods for Bags-of-Features In some of applications like image retrieval, each object is represented by multiple feature vectors, instead of a single vector. For example, local descriptors, e.g., SIFT features [30] have shown to be effective for near duplicate image retrieval and object recognition. In these approaches, each image is represented by a set of local descriptors, with each local descriptor being represented by a vector of 128 attributes. This representation is often referred to as the bag-of-features representation [31– 33]. The main challenge posed by such representation is that the correspondence between the local descriptors in two sets is unknown, leading to a high complexity in matching two images. Several algorithms, such as the pyramid matching kernel [32], have been proposed to efficiently compute the similarity between two bags-of-features. The main shortcoming of these approaches is that 6 they require a linear scan of the entire database before the most similar objects are found, making them difficult to scale to large datasets. One way to speed up the retrieval process for objects based on bag-of-features representation is to convert the image matching problem into a text retrieval problem [18]. The key idea of this approach is to quantize each local descriptors to a “visual word”, and represent each image by a histogram of the visual words. This bag-of-words representation allows us to directly exploit the text retrieval technique, particularly invert indexing, to efficiently identify the images candidates that share similar visual content as the query image. In the second stage, more accurate similarities are computed between the query images and the image candidates identified in the first stage. It worth noting that efficient quantization methods often require efficient nearest neighbor search, which was explored in the efficient retrieval methods for continuous and dense vectors. However, a problem with existing quantization approaches is that they fail to accurately capture the similarity between the original features [34, 35], and such coarse approximations often lead to less accurate retrieval results and less efficient retrieval process. In this thesis, we propose new quantization methods to address this issue. 1.2 Thesis contributions Among three classes of retrieval problems discussed, sparse data retrieval has been well studied and successfully used in modern search engines (e.g. Google, Bing Search). So our focus is to address challenges in retrieving dense and bag-of-features data. As we have seen, most large scale search algorithms developed in the literature attempt to address the efficiency issue by approximations, but the existing approximation schemes either do not capture the true similarity well or are not effective to filter out irrelevant items. For example, the widely used vector quantization scheme for bag-of-feature data considers two data points similar when they are mapped into the same cluster, without taking into account the actually distance between the feature points [34, 35]. The Hashing based method often returns a large number of false positives in order to ensure a high recall [21,36]. The above are often not desirable since either 7 true similar items are missed by the algorithm due to inappropriate approximation (low recall), or a large number of false positives are retained in the second stage (low precision), resulting expensive second stage re-ranking. Thus, the goal of the thesis is to design representations schemes and algorithms that are not only computationally efficient but also more accurate in capturing the similarity between objects in the second and third classes of retrieval problems. The main contribution of this dissertation of this dissertation work can be summarized as follows: 1. We present a new quantization scheme in chapter 3, named random seeding quantization algorithm, that is more efficient in generating bag-of-words representations for bag-of-features. The new scheme is motivated by approximating the optimal partial matching between bagof-features, and thus produces a bag-of-words representation capturing the true similarities of the data, leading to more accurate and efficient retrieval for bag-of-features data. 2. We present a new search algorithm in chapter 4 that is designed for efficient near duplicate vector search. By explicitly exploiting the statistical properties of near duplicity, the algorithm projects high dimensional vectors into lower dimensional space for exact search, leading to more accurate and efficient identification of nearly duplicate objects in databases. 3. In chapter 5, we propose a new randomized range search algorithm for near duplicate vectors in high dimensional spaces. Compared to the algorithm presented in chapter 4, it does not require the sparsity constraints and can apply to a wider range of applications. It projects both the data points and the the query point into an one dimensional space by a random projection, and use binary search for efficient nearest neighbor search and filtering. We first prove the theoretical guarantee for the proposed algorithm and then evaluate its empirical performance on a 1.1 billion image feature dataset. 8 CHAPTER 2 LITERATURE REVIEW In these chapter, we review existing retrieval methods for sparse, dense and bag-of-features data. 2.1 Retrieval Methods for Sparse Data A good example for high dimensional sparse data is the bags of words [22] representation for documents. To make our discussions concrete, we will illustrate the key ideas for discrete sparse data retrieval by presenting most widely used document indexing and retrieval methods. We, however, by no means try to exhaustively cover the topic of document retrieval. Rather we merely focus our discussions on efficient indexing and retrieval models, since these techniques are pertinent to this thesis and they can be widely applied to any sparse data, not only documents. For example, the indexing and retrieval method has been successfully applied to near duplicate image retrieval, as will be illustrated later. The indexing scheme to be introduced is inverted index [15, 23], which serves as the first stage to effectively avoid the linear scan of the collection. Afterward, specific retrieval models [37, 38] will be applied to precisely match the query against candidate documents to decide their rankings, which constitutes the second stage. 9 2.1.1 Sparseness and Inverted Indexes Documents are sparsely represented by word frequencies, which is one of the key advantages provided by the bag-of-words representation. However, a more subtle and important sparseness property actually come from the other side: most words only appear in a small number of documents. This fact is known as the zipf’s law [39], which states that the frequency of any word is inversely proportional to its frequency ranking. As a result, if we sort words by their frequency rankings, we will see that their frequencies decrease very quickly, and thus most words have very low frequencies across the collection. Such “word-to-document” sparsity directly leads to an efficient indexing scheme called inverted index, which includes mappings from each word to a list of documents that contains the word. The list is referred to as inverted list, and each list entry is called a posting. The amount of information stored in postings can vary significantly, and it is usually dependent on specific retrieval models that are used. For some simple models like the Boolean model, an index entry only need to store the identifiers of words and a list of document IDs. For other more complex models like the vector space model or language models, other statistical information like the document frequency, word frequencies/weights and word positions might be encoded in the index to support their ranking algorithms. Equipped with the inverted index, one can efficiently identify the documents that contains a query word by simply looking up the index table, and it only takes O(1) time. Furthermore, according to Zipf’s law, the number of postings for most words is small, so we end up with a small number of candidate documents to be found in the first stage before the exact query-document similarity can be computed in the second stage. This fact remains true as long as the number of query words is not too large. A different indexing structure called signature file was also used in document indexing. The basic idea is to create short signature bits for each document by hashing, and then match the hash code to retrieve similar documents. In [40], however, Zobel et al. carefully compared inverted index and signature files in both the efficiency and space requirements on synthetic and real datasets. 10 The experiments were done on two data sets, one is the “Wall Street Journal” dataset from TREC and the other one is synthetic. Both datasets contain roughly 100,000 documents and the average document length in both datasets is about 430. To summarize this work, inverted index is superior to signiture file for text indexing since it 1) is more efficient, 2) uses less space and 3) provides more functionality. To be more specific, studies in [40] showed: inverted index is more space efficient since it can be easily compressed; inverted index is faster to built and easier to update since signature files require a large number of hash operations; inverted files handles vary document length better since signature files require different documents to be hashed into binary strings with the same length; the returned results from signature files are less precise since documents share no words could have the same signature coding; inverted list can be easily extend to encode weighting information while signature files can not. Conclusively, based on current development on both inverted list and signature files, inverted list is undoubtedly a better solution for document indexing compared with signature files. In short, inverted index the key data structure to alleviate the computational challenge posed by large scale document retrieval [15, 16, 23, 41]. Other alternatives like signature files generally have severe disadvantages [16]. As a result, inverted indexes are at the core of all modern text search engines. 2.1.2 Retrieval Models In the second stage of document retrieval, different retrieval models are used for precisely evaluating similarities between the query and candidate document identified using the inverted list. In following sections, we will introduce different retrieval models including vector spaces models and probabilistic models. These models differ in their motivations, model complexities and effectiveness. No model is universally superior to others since diverse application domains imposes various requirements and thus different models might be preferred. 11 Vector Space Models The Vector space model is first proposed by Salton et al. [24], and it is still widely used in some IR systems. The key idea of the vector space model is representing both the query and documents as vectors in the term space, and the similarity is measure by the “closeness” between vectors (e.g. the angle between them). More specifically, a query Q is represented by a vector of index terms: Q = (q1 , q2 , . . . , qd ) and similarly, a document Di is represented by (d1 , d2 , . . . , did ). Here d is the number of indexed terms and elements in the vector are weights of corresponding terms. The similarity between Q and Di is measured by the cosine correlation defined as follows: Sim(Q, Di ) = = Q, Di ||Q|| ||Di || (2.1) ∑dj=1 q j di j ∑dj=1 q j 2 ∑dj=1 di j 2 . (2.2) Terms of qi and di j in 2.2 are usually determined by t f .id f weighting. t f stands for the normalized term frequency that indicates how important a term is for a document, and id f is inverse document frequency that captures how discriminating a term is. The specific settings of t f and id f is mainly based on empirical observations. Probabilistic Models Probabilistic models are based on the probabilistic ranking principles [42], stating that a retrieval system should rank the retrieved documents based on their probabilities being relevant to a given query, which are calculated based on the data available to the retrieval system. Here we will introduce one of the most popular probabilistic models, the Okapi BM25 Model, from the document classification point of view. We assume the document is represented by a binary vector. A document should be classified as relevant if Pr(R|D) > Pr(NR|D). Following the Bayes rule, we will classify a document as relevant if Pr(D|R) Pr(NR) > . Pr(D|NR) Pr(R) (2.3) Pr(D|R) Consequently, it makes sense for the retrieval system to rank the document according to Pr(D|NR) , which is called the likelihood ratio. If we following the naive Bayes assumption and certain prob12 abilistic assumption about the document generation process, we will have Pr(D|R) Pr(D|NR) ∑ i|di =qi =1 log N − ni , ni (2.4) which is very similar to the id f term. However, using only the id f like weighting scheme can not provide desirable retrieval performance due to the absence of the term frequency weightings. Robertson et al. proposed a series of Okapi BM25 models [43–45], which had been proved to be one of the most successful retrieval models and is widely used by many search engines. 2.1.3 Comparison: Retrieval Models for Sparse Data In our discussion, we have covered the Vector Space Model and the classic Probabilistic Model. There exist many other retrieval models like the Boolean models and the Language Models. Due to the space limitation, we omit detailed discussions on these models. However, here we provide a quick summary for all retrieval models for the completeness of our discussion. The performance of these models highly depends on their weighting schemes. It is widely accepted in the Information Retrieval community that there is no single retrieval model that performs universally better than others [16, 46, 47]. Instead, different retrieval models perform optimally under different conditions. The Vector Space Model introduces t f .id f weights based on heuristics; the classic probabilistic model brings id f in by following the idea of relevance classification, and introduces some t f like terms based on heuristics; the Language Modeling approach brings the t f effects by using the estimated language models and obtains the id f weight by using smoothing techniques. Despite the fundamental theoretical differences among them, all these methods finally achieve weighting schemes similar in spirit and effects. Thus, it is improper for us to claim which approach is the state-of-art approach theoretically. To compare them objectively, we need to refer to empirical results. To this end, we cite the results from the Adhoc Track from Trec-8, Trec-7 and Trec-6, since the results presented there are comprehensive, conclusive and well recognized. To summarize the results from the Adhoc tracks, we can say that the classical Probabilistic Model (OKAPI BM25) 13 combined with pseudo relevance feedback delivered the top performance in most evaluations. The Vector Space Model with feedback usually perform slightly worse than the probabilistic model, probably because it has less parameters to tune. It is reported [48] that combining simple language model with the probabilistic model delivers very good performance. In [26,47–51] People showed that language models with proper smoothing can give retrieval performance that is better than the vector space model and comparable to the probabilistic model. But in general 1 , all these models provide very similar performance if well-tuned, and we may say that the marginal difference between them are very likely just a result of parameter tuning. 2.1.4 Summary Searching against large scale high dimensional sparse data (e.g. document retrieval) clearly includes two stages. The first stage is to search against a pre-built index to quickly identify candidate objects that are likely to be relevant to the query, and the second stage is to applying specific retrieval models to calculate the exact query-to-document similarity. The efficient index building and searching are direct results of sparse representation. In the second stage, different retrieval models like vector spaces models, probabilistic models and language models can be used, and they often differ in their motivations, assumptions, model complexities and effectiveness. Among all retrieval methods that had been proposed, language modeling is the most principled approach and have been proved to be very effective in practice. 2.2 Retrieval Methods for Dense Data In this section we focus on retrieval methods for continuous and dense data. Specifically, both queries and objects in database are represented by high dimensional dense vectors, and the elements in the vector representation usually take continuous values. The retrieval problem for such data naturally reduces to the nearest neighbors problem in high dimensional spaces. 1 If we do not consider the collection dependence issue. 14 Similar to the retrieval methods for discrete data, efficient nearest neighbor search in high dimensional space also follows a two-stage scheme: a filtering stage followed by a precise matching stage. However, since the data is no longer sparse, the indexing techniques used here are fundamentally different from those for sparse discrete data. Instead of using inverted index, two categories of methods are commonly used for data indexing and filtering stage: tree based approaches and projection based approaches. Tree based approaches [1, 28, 52] recursively partition the feature space into a collection of high dimensional cells such that very points with similar input patterns are likely to be assigned to the same cell. Given a query point, it will first be propagated down to a leaf node, and precise matchings will be performed between the query point and all the points inside the node to determine the candidate nearest neighbor. The candidate nearest neighbors identified so far will be then used as a “pivot point” to check data points from other branches and it will be updated when a data point with a shorter distance to the query is found. When handling very large scale data, multiple randomized trees [53] are often used to speed up the process of nearest neighbor search. randomized trees essentially partition the feature space in a more or less randomized fashion to achieve robust expected search performance, and multiple trees are used to boost the search accuracy. The most popular tree based methods are the KD-tree and its variants. On the other hand, the projection based methods [29, 54] are based on the idea that by some projection, adjacent data points are more likely to be mapped into the same value than those are far away. In this approach, random projection functions are used to create some auxiliary space, and the similarity between two data points is measured by how likely their projection image will collide in the auxiliary space. A representative projection based method is Locality Sensitive Hashing [29], which relies on hash functions that are locality sensitive: two data points that are within the distance of r are more likely to be hashed into the same value than those that are separated by distance (1 + ε)r. Applying one hash function on a data point will produce a “hash bit”, and hash bits from multiple hash functions are then grouped together to generate “hash strings”. It follows naturally that only really close points will be hashed to the same hash string. In the meanwhile, nearest neighbor candidates based on different hash strings are considered altogether to reduce the chance 15 of missing true near neighbors. Note that locality sensitive hashing is fundamentally different from the tree based method in the sense that it does not solve the nearest neighbor problem directly. Instead, it is designed to solve the r-near neighbor problem: finding the data points that are within the r-neighborhood of the query point, which is often referred to as a range search problem. Another important aspect of the nearest neighbor search in high dimensionality is that we often resort to approximate solutions. The reason is that there is no known exact algorithms that achieves sub-linear search time with linear storage cost. More specifically, for any given nearest neighbor search algorithm, we are concerned about both its time complexity and space complexity as functions of the data set size n and the data dimensionality d. A naive algorithm will store all points, and to process a query, it will linearly scan the entire data set to calculate all pair-wise distances to find the nearest neighbor. Such a algorithm clearly have the both the time and space complexity of O(nd) [21, 55]. The linear time complexity (w.r.t. n) is not desirable and people had developed algorithms to achieve sub-linear time complexity. The current state-of-art result for exact nearest neighbor search is that, if we fix the dimensionality to be a constant, there are algorithms that can solve the problem with O(log nG1 (d)) time complexity and O(nG2 (d)) space complexity [21, 28, 56–60]. Note that in these algorithms, although sub-linear time complexity and linear space complexity are achieved w.r.t. n, there are functions G1 (d) and G2 (d) hidden in the complexity term. Recent studies [1, 21, 55, 61] on high dimensional nearest neighbor search pointed out that either G1 (d) or G2 (d) is in fact exponential function of d, which makes all these algorithm become much less inefficient in space or time as the dimensionality increase. To make our statement more specific, algorithms presented in [58,59] have O(d O(1) log n) time complexity, but they use roughly nO(d) space [55, 62]. For the KD-tree based algorithms presented in [28, 57], their time complexity increases as least as fast as 2d [1, 57], which is also observed in empirical studies [61]. The Ball-tree based algorithm [60] have the time complexity of O(d log n) for only low dimensions, and it becomes O(nd) for higher dimensions [56]. Table 2.1 summarizes important exact nearest neighbor search algorithms and their time and space complexity. In summary, all known exact nearest neighbor search algorithms [21, 28, 54, 56–60, 63] are 16 Algorithm Linear Scan Voronoi [58],Planar [59] KD-Tree [28, 57] Ball-tree [60] Time Complexity O(nd) O(d O(1) log n) Ω(2d log n) O(nd) in high dims Space Complexity O(nd) nO(d) O(nd) O(nd) Table 2.1 Complexities of Nearest Neighbor Search Algorithms not appropriate for large scale high dimensional data. The reason is that they either require the time complexity linear in n, or the time complexity sub-linear in n but exponential in d, or time complexity sub-linear in n, linear in d, but space complexity exponential in d. This implies that for large scale high dimensional data, we need to resort to approximate solutions. The difficulties of performing exact nearest neighbor search motivated people to study approximate nearest neighbor search algorithms, and the KD-tree and LSH can be used for this purpose. The representative approximate search algorithms based on the KD-tree are introduced in [1] and [64]. In [1], authors modified the the KD-tree structure and proposed a so called Balanced Box Decomposition tree to support the (1+ε) approximate nearest neighbor search. In [64], authors proposed to stop the tree search earlier after a fixed number of nodes is checked and empirically showed the effectiveness of this heuristic. LSH related work can be found in [29, 65, 66], where various Locality sensitive hash function families are proposed for nearest neighbor search for different distance/similarity functions. Note that although some of these methods [1] do not actually remove the exponential complexity, they still provide considerable improvement in search efficiency over exact nearest neighbor search algorithms. We emphasize that the approximation of KD-trees is usually achieved by early stopping, whereas LSH is inherently a random algorithm. For a summary of the complexities of approximate nearest neighbor algorithms, refer to table ??. In the following sections, we will elaborate both tree based and hashing based methods for exact and approximate nearest neighbor search. 17 Algorithm ANN [1] BBFS [64, 67] LSH L1 [29] LSH L2 [65] Time Complexity O(d 1 + 6d/ε d log n) O(d log n) O(dn1/(1+ε) ) O(dn f (1+ε) ) Space Complexity O(nd) O(nd) (1+1/(1+ε) O(dn ) (1+ f (1+ε)) O(dn ) Table 2.2 Complexity of approximate nearest neighbor search algorithms. For the LSH complexity under L2 distance, we have f (1 + ε) < 1/(1 + ε). 2.2.1 KD-Tree, Randomized KD-tree and Randomized Forests KD-Tree Although KD-tree was first introduced several decades ago [28, 57], it is still widely used in high dimensional nearest neighbor search. In short, A KD-tree is a generalized binary tree designed for efficient searching. In a KD-tree, every node essentially stores information about space partition. For example, the root node represents the entire feature space and leaf nodes represent non-overlapping small sub-regions of the entire space (or equivalently, subsets of all examples). Meanwhile, the internal nodes record the intermediate steps (and partitions) about how the entire space is recursively divided into the finest sub-regions. See figure 2.1 for an example of constructing a KD-tree based on space partitioning. Figure 2.1 An example of KD-tree and resulting space partition. Given a dataset, the first step is to construct a tree over the data. At each node, we pick the 18 median of the most spread dimension to bipartite the feature space, where the spread is defined as the value range or variance. The split values are stored since they determine the “bounding box” of partitions. This recursive process will continue until the number of points within a node is below some threshold. As a result, each node is then represented by a hyper-rectangle whose boundaries are determined by the split values chosen by its ancestor nodes. At the search stage, the query point will first be descended down the tree by comparing its feature value against the corresponding split value (on the split dimension) chosen at each node. When the algorithm reaches the leaf node, the candidate nearest neighbor will be determined by computing the exact query-to-data distances. The query point together with the shortest distance implies a hypersphere that centered at the query point. Afterward, we will check whether the hypersphere overlaps with the hyperrectangles defined by other unvisited nodes. If there is no overlapping, the node and its children can be completely eliminated from consideration. If there is some overlapping, then the node and its children will be checked for potential nearer neighbors. In the following subsections, we will layout algorithms for tree construction and search, and also discuss some implemetation detials. KD-tree Construction Given a dataset, a KD-tree is constructed recursively as follows: It picks an dimension to and a split value to divide the dataset into two subsets, and then keep dividing the subsets until the number of data points contained in a node drops below some threshold. The following recursive procedure (Algorithm 1) constructs a KD-tree from a given dataset. There are several comments we would like to make regarding the procedure. The granularity, defined as the maximum number of data points within a leaf node, could affect the performance. If the granularity is too fine (e.g. each node contains only one data point), the performance will be suboptimal due to the computational overhead in search imposed by the tree structure [57]. So to maintain a reasonable granularity, usually a set of points will be considered as a leaf node either when the number of data points is small or the spread (the value range of data points) of the sets is small. The function SelectSplit pick the dimension and the split value to partition the feature space, and the criteria can be either the spread or the variance of each dimensions, depending on 19 the application. Often the median value is chosen as the split value. By partitioning the feature space in this fashion, the tree will be constructed to achieve the best expected performance over different queries [57]. Algorithm 1 KDtree construction algorithm Require: dataset {The data for tree construction} 1: if IsEmpty(dataset) then 2: return NIL 3: else if Size(dataset) ≤ threshold then 4: return MakeLeaf(dataset) 5: end if 6: (splitDim, splitVal) = SelectSplit(dataset) 7: le f tData ← {x | x ∈ dataset, xsplitDim ≤ splitVal} 8: rightData ← {x | x ∈ dataset, xsplitDim > splitVal} 9: le f tChild = KDtree(le f tData) 10: rightChild = KDtree(rightData) 11: return MakeNode(splitDim, splitVal, le f tChild, rightChild) KD-Tree Search Intuitively, the search algorithm first locates the leaf node that contains the query point, records unvisited nodes along the path, and then recursively examines those unvisited nodes to improve the search result. Locating the leaf nodes is a one-round calculation process and can be done very efficiently, but the improvement step is recursive and could require a lot of computation. The key calculation in the improvement step is to check whether the hypersphere centered at the query point overlaps with the hyper-rectangles defined by internal nodes. Here, we will describe it with more details. Let q be the query point and p be the nearest neighbor found so far. A hypersphere hs is defined centering at q and having the radius of D(q, p), where D(q, p) is the interpoint distance defined over some metric. To determine whether there is an overlap between the hypersphere hs and some hyperrectangle hr, we need to find the point t in the hyperrectangle hr that is closest to q. If D(p,t) > D(p, q), then it is clear that the region defined by hr will not contain any closer points than p and thus the node associated with the region can be eliminated from consideration. Finding the closest point t given a query point q is equivalent to minimize the distance D(t, q) subject to the point t belongs to hyperrectangle hr. If D(t, q) is defined as 20 the Euclidean distance , minimizing the distance can be achieved by minimizing (ti − qi ) for each dimension i. To find the closest point t, we can find out the coordinate value in the hyper-rectangle that is closest to the corresponding coordinates of p, for every dimension. Specifically, given q and hr, coordinates of t on the ith dimension can be found as follows:    qi < hrimin  hrimin if   ti = qi > hrimax hrimax if      qi if hrimin ≤ qi ≤ hrimax The detailed algorithm for KD-tree search is given in algorithm 2, where the overlap test, a key step in the search, is denoted by overlap in the algorithm. Algorithm 2 KDtree search algorithm Require: q {The query point} Require: p, d {The nearest neighbor and closest distance found so far} Require: root {The root node of the KD-tree} Require: Overlap(q, d, node) {A overlap testing function} 1: if IsLeaf(node) then 2: candidateSet ← {x | x ∈ node} 3: CalExactDists(q, candidateSet) 4: Update(p, d) 5: return 6: end if 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: sd ← node.splitDim sv ← node.splitVal if qsd ≤ sv then closerNode = node.le f tChild f urtherNode = node.rightChild else closerNode = node.rightChild f urtherNode = node.le f tChild end if KDtreeSearch(closerNode) if Overlap(q, d, f urtherNode) then KDtreeSearch( f urtherNode) end if 21 Priority KD-Tree Search The search algorithm described earlier is essentially a depth-first tree search algorithm. This depth-first fashion may not be the most effective way of search for some queries. For example, if the query point is initially propagated down to the left half of the tree but the true nearest neighbor lies on the other half, depth-first search needs to traverse at least one half of the tree before visiting the nearest neighbor. This leads to our first consideration: the order of a search algorithm to visit tree nodes should dependent more on the query point instead of the tree structure, since the tree structure is optimized for the best expected performance [57] instead of for any given query [64]. Specifically, when descending the query point to its first leaf node, we should record the distances between the query point and all other nodes along the path, and first visit the nodes that are closer to the query point. Another consideration is more from a practical perspective. It is observed in experiments [68] that the true nearest neighbor is often encountered in the early stage of the search, which further confirms that it is preferable to first check the node that are more likely to contain the nearest neighbors. Last but not least, in the analysis [57] of KD-tree performance, Friedman et al. assumed (without justification) that the search algorithm examines the leaf nodes in an optimal order, which implies leaf nodes should be sorted based on their distances from the query point. All these considerations motivates the priority KD-tree search [68] (or best bin first search [64]), which essentially search the node based on their distance to the query points, instead of based on the tree structure. Specifically, this modified search algorithm maintains a priority queue that stores the distances from the query point to different nodes. Every time after a leaf node is visited, the next tree node to be visited always has the smallest query-to-node distance. Figure 2.2 illustrates how the priority KD-tree search work. Despite the simplicity of the idea, experiments [53, 64, 68] had shown that using the priority KD-tree search, the algorithm in general uses less time and examine a much smaller number of leaf nodes before reaching the true nearest neighbor. Furthermore, this priority based search scheme had been extensively used in approximate KD-tree search, since it helps in both theoretical justifications and performance improvement, as what we will discuss in the next subsection. 22 Figure 2.2 An example of priority KD-tree search. Subtrees v1 through v4 are initially in the queue. v1 , the closest node, is selected first. When descending the tree to leaf node w, we enqueue the siblings u1, u2, and u3 along the way [1]. Approximate KD-Tree Search As we had mentioned before, the large scale high dimensional data makes it computationally prohibitive to perform exact nearest neighbor search. We then ask the question that whether it is possible to efficiently solve the nearest neighbor search problem not exactly but with good approximation. Most recognized work on approximate KD-tree search are presented in [64] and [1]. Both methods utilized the priority KD-tree search, and essentially they all use “early stop” to avoid the high computational cost of exhaustive search. In [1], Arya et al. proposed a method to solve the so called “(1 + ε) approximate nearest neighbor problem”, defined as follows: Definition 1 ((1 + ε)-approximate nearest neighbor problem). Given a set D of points in Rd , construct a data structure that given a query q ∈ Rd , report any point p within distance at most (1 + ε)D(q, p∗ ), where p∗ is the point in D closest to q. They proposed a data structure called balanced box decomposition (BBD) tree to support their 23 algorithm and theoretical analysis. Unlike KD-tree that only targets on reducing the cardinality of nodes, BBD tree decomposes the feature space in the way that both the cadility and geometric size of nodes reduce exponetially. Since the nodes are ordered based on their distances to the query point, the search can stop as soon as the query-to-node distance is greater than (1 + ε)d, where d is the shortest distance yet seen. Together with the reduction of geometric node size, it is possible to upperbound the number of leaf nodes to be visited during a search. To cite their results precisely, the authors proved that through the use of BBD tree, the (1 + ε)-approximate nearest neighbor problem can be solved in O(Cd,ε log n) time, where Cd,ε ≤ d 1 + 6d/ε d . In the experiments, they showed that the algorithm can provide significant improvements over brute-force search in dimensions as high as 20. However, the exponential factors in the query time imply that this algorithm is not practical for high dimensional data. Motivated by practical considerations, Beis et al. [64] proposed a different stopping criterion by limiting the number leaf nodes to be examined. Although they do not provide any theoretical support to this practice, they showed that when combined with priority KD-tree search, the algorithm can find the true nearest neighbor with more than 90% of the time, and in the case that the true nearest neighbor is not found, the solution is a very good approximate. In [67] authors even argue that this stopping criterion provides better performance than the (1 + ε)d cut off. To summarize, the (1 + ε)d cut off is better justified but may not work well in high dimensions without modification. Fixing the number of leaf nodes to be examined was shown to be useful in solving practical problems, but a more solid analysis would be desirable. Multiple Randomized KD-Trees We have seen that to solve the approximate nearest neighbor problem, a good practice is to use priority KD-tree search and limit the number of leaf nodes to be examined. Increasing the number of examinations will increase the probability to find the true nearest neighbor, but the improvement turns to saturate with more examinations. To address this problem, people introduced multiple randomized KD-trees [53]. The basic idea is to create multiple independent KD-trees with different structures, and perform simultaneous search in all the 24 trees and combine the results. To create different trees, the splitting dimension is chosen at random from a set of dimensions with large variances, and the splitting value will be the median value with some random perturbation. To search these trees, one single priority queue is maintained over all the trees, so only the node closest to the query will be visited first. In many experiments [69, 70], people had shown that using multiple randomized KD-trees can greatly improve the search efficiency and accuracy. 2.2.2 Locality Sensitive Hashing Locality Sensitive Hashing (LSH) is a projection based method for approximate near neighbor search in high dimensions. The idea of using random projections for nearest neighbor search is first presented by Kleinberg in [54]. However, the two algorithms presented there either use exponential storage or linear query time and thus are not suitable for large scale high dimensional data. These limitations in fact motivated the research of LSH. There is rich literature on LSH, and most important results are presented in [29], [65] and [66]. In the following subsections, we will first briefly overview the approximate near neighbor problem and the basic idea of LSH, and then present detailed results in the related work. Approximate Nearest Neighbor As an inherently randomized algorithm, LSH is usually used to solve the randomized approximate near neighbor problem or randomized r-near neighbor problem, instead of the nearest neighbor problem. For example, A point p is a r-near neighbor of the query point q if under some metric, the distance D(p, q) is at most r. Clearly, the nearest neighbor problem and near neighbor problem are related and the former one is “harder”. This is because given the nearest neighbor p of the query q, the r-near neighbor problem can be solved trivially by checking whether D(p, q) ≤ r. Given a query point q, LSH either returns one of the r-near neighbors of q or claim there does not exist such a point for the radius r. On the other hand, solving the nearest neighbor problem involves gradually increasing the radius r until some near neighbor is reported, which is equivalent to solve a series of r-near neighbor problems. 25 More precisely, the LSH can be used to solve the randomized (1 + ε)-approximate near neighbor problem or the randomized r-near neighbor reporting (range search problem) defined as follows: Definition 2 (randomized (1+ε)-approximate near neighbor problem). Given a set D of points in Rd , a query q ∈ Rd and parameters ε > 0, δ > 0, construct a data structure that if there exists a r-near neighbor of q in D, report some (1 + ε)r near neighbor of q with the probability 1 − δ . Some quick comments to definition 2: 1) it does not specify the algorithm’s behavior when there is no r-near neighbor of q. Actually in that case, the algorithm’s behavior is not deterministic. 2) The probability of failure δ can be reduced by building and searching multiple (e.g. k) independent data structures. In this case, the probability of failure will be reduced to δ k . Definition 3 (randomized r-near neighbor problem). Given a set D of points in Rd , a query q ∈ Rd and parameters ε > 0, δ > 0, construct a data structure that report each r-near neighbor of q with the probability 1 − δ . Some quick comments to definition 3: 1) There is no approximation involved. 2) The algorithm may return many points (if many points are qualified) so the running time can not be bounded. In addition, the returned points may not be independent. Key Ideas of Locality Sensitive Hashing The key concept in for LSH is locality sensitive, and it is the key difference between LSH and a general hashing algorithm. In other words, unlike the idea of hashing is to avoid collisions, the goal of LSH is to not only distinguish far-way points, but also map nearby points into the same bucket. This goal is achievable due to the existence of locality sensitive functions, defined as follows: Definition 4 (Locality Sensitive Function). A family of function H is called locality sensitive if for two points p, q ∈ Rd and 0 ≤ P2 < P1 ≤ 1, • if D(p, q) ≤ r then PrH (h(q) = h(p)) ≥ P1 26 Figure 2.3 Illustration of difference between general hashing and locality sensitive hashing. • if D(p, q) ≥ (1 + ε)r then PrH (h(q) = h(p)) ≤ P2 With a family of locality senstive hashing functions, we are ready to index the dataset and then search for near neighbors. However, note that the gap, which is typically measured by the ratio between some monotonic functions of P1 and P2 , is often not large enough to clearly differentiate between a r-near neighbor and a neighor with distance larger than (1 + ε)r. As a result, we need to amplify the gap by concatenating several functions. Specifically, we first group k hash functions h1, j , h2, j , . . . , hk, j together to form a function g j with each hi, j being a locality sensitive hash function. . So only those really close-by points will be hashed into the same value by g j . We also consider L different g j ’s to reduce the chance of missing true near neighbors. To process a query point q, we first hash it using into L buckets using g j ’s, and then we retrieve all the points from these L buckets (points p’s such that g j (p) = g j (q)). We then compute the exact distance between the query q and retrieved candidate points, and report any point that satisfy our needs. The LSH indexing and searching procedure is presented in algorithm 3. Note that there are two possible strategies to examine the candidate points. One is to scan all the points (exhaustive) from all buckets and the other is to stop scanning (early stopping) as soon as a fixed number of L points are reported. These two strategies solves different problems. The first strategy solves the randomized r-near neighbor problem and the second strategy solves the randomized (1 + ε)-approximate near neighbor problem. Specifically, by using the exhaustive scanning approach and setting L and k properly, we can make the failure probability δ in the range search arbitrarily small. On the other hand, it had been proved [29] that with properly selected values of L (L = 4L), the early stopping strategy can be used to solve the randomized (1 + ε)- 27 Algorithm 3 Procedure of LSH Indexing and Searching Require: dataset {The dataset to be indexed} Require: q {The query point if search will be performed} 1: if IsEmpty(dataset) then 2: EXIT 3: end if 4: Select parameters k and L 5: Form L functions {g1 , g2 , . . . , gL }, where g j = h1, j , h2, j , . . . , hk, j , and hi j are chosen randomly from a LSH family independently. 6: Using g j ’s to hash all data points into L hash tables. 7: For j = 1, 2, . . . , L 8: Retrieve points p’s from hashing buckets g j (q) in the jth hash table. 9: Evaluate the exact distance D(p, q), report p if it is a valid answer to the problem. approximate near neighbor problem in sub-linear time O(nρ ), where ρ = 1/(1 + ε). LSH families We have discussed the indexing and searching process using LSH without touch the key question about how exactly to decide the hashing functions. In this section we will discuss several hashing families under different distance measurements. Hamming Distance In the case of Hamming distance, where the distance between two vectors equals the number coordinates that they have different values at, we can intuitively form locality sensitive hash functions by projecting a vector onto one of its coordinates. More specifically, for any point p, we have hash functions hi (p) = pi , where 1 ≤ i ≤ d and d is the dimensionality of the data point. hi ’s here are locality sensitive since for if D(p, q) ≤ r then Pr(h(q) = h(p)) ≥ 1 − r/d, and if D(p, q) ≥ (1 + ε)r then Pr(h(q) = h(p)) ≤ 1 − (1 + ε)r/d. Note that our statement hold for not only binary vectors but also for M-ary vectors. L1 Distance Under the L1 norm, one way to form locality sensitive hash functions is to first embed the vector into a binary representation and then project the new representation onto one of its coordinates. For example, for a M-ary data point p, we can form its binary representation as follows: first convert each of dimension of p into a binary string of pi ones followed by M − pi zeros, and then concanate all dimensions together to form a long binary string. It is straightforward 28 to see that the Hamming distance under this new binary representation equals the L1 distance under the original representation, and the previous mentioned coordinate-projection hashing functions are still locality sensitive. However, this intuitive method only work well when the value of M is small and thus a more direct approach is needed. As presented in [71], for a range search problem with radius r, we can first pick a real number w r, and then randomly sampling d real numbers s1 , s2 , . . . , sd ∈ [0, w), and finally hashing the original vector p onto the grid by setting hs1 ,s2 ,...,sd (p) = ( p1 −s1 w , p2 −s2 w ,..., pd −sd w ). L2 Distance Under the L2 distance measurement, locality sensitive hashing functions can be formed by first randomly projecting the data point onto some one dimensional line, and then dividing the line into segments, shifted by a random offset. Formally, for any data point p, we pick two real numbers w and b ∈ [0, w), followed by randomly generating a Gaussian vector a to form the one dimensional representation ha,b (p) = a·p+b . It had been shown that [65] this approach w of the generating hashing functions can be extended into the cases of Ls norm, where s ∈ [0, 2). To achieve this goal, the projecting vector a is not from random Gaussian distributions but from s-stable distributions. |A∩B| Jaccard Distance For two sets A and B, the Jaccard similarity is defined as SJ (A, B) = |A∪B| and |A∩B| the resulting distance is defined as 1 − |A∪B| . Under this distance metric, a family of functions called min-hash can be defined as follows: first define a random permutation π over the universe U, and then let hπ (A) = min{π(a)|a ∈ A}. It had been proved [72] that the probability of collision equals to similarity under Jaccard distance, i.e., Pr(hπ (A) = hπ (B)) = SJ (A, B). More discussions and extension of min-hash can be found in [73, 74]. 2.2.3 Comparison: Search Methods for Dense Data In this subsection we compare the emperical performance of KD-tree variants and Locality Sensitive Hashing on approximate nearest neighbor search. a comprehensive empirical study of comparing different approximate nearest neighbor search algorithms is presented in [69]. In the that 29 experiment setup, the data set consists of 1000,000 SIFT points. The tree algorithms, LSH, ANN and Best Bin First Search were compared on their probabilities of retrieving the true nearest neighbor from the data set. It was shown that the Best Bin Search was able to outperform both LSH and ANN by about an order of magnitude (on speed), on different accuracy levels from 50% to 95%. This suggests that multiple randomized KD-tree search is an effective approach in practice. However, note that in this study, algorithms were evaluated on their performance of finding the true nearest neighbor. LSH is designed for approximate nearest neighbor search and thus it is not optimal for this task. Furthermore, LSH’s complexity is dependent on the error factor ε so it will approach to linear scan as the accuracy requirement is high. So it is not surprising that LSH is inefficient at finding the true nearest neighbor. In addition, So to make the discussion complete, we cite another work. In [65], authors showed that LSH is about 1 ∼ 2 orders of magnitude faster than exact KD-tree for the approximate nearest neighbor search (with ε = 1). They also claimed that the speed up increases linearly with the dataset size. However, no significantly more speed-up was gained as the dimensionality keep increasing. 2.2.4 Summary In this section, we have discussed two categories of methods for nearest neighbor search, partitionbased method and projection-based method. Despite their apparent differences, both kinds of methods essentially utilize a two-stage scheme for efficient computation: an indexing step followed by exact distance calculation step. In practice, these methods are often used for approximate nearest neighbor search to obtain near optimal solution under sub-linear time, with only linear storage. They have been proved very effective for nearest neighbor search for large scale high dimensional data. However, despite their effectiveness, these methods do have their drawbacks. For example, The ANN algorithm has the time complexity linear in dataset size n but exponential in dimensionality d; The Best Bin Search relies on heuristics for early stop and lack a theoretical analysis; The ball-tree search has the complexity of O(d log n) for not too large d, and it can be as bad as O(dn) for very large dimensionality; At last, the complexity of the LSH algorithm depends 30 on the approximation level: it approaches to linear complexity as the solution become more precise. In a comprehensive empirical study of approximate nearest neighbor search algorithms [69], Best Bin Search on multiple randomized KD-trees outperform LSH and ANN algorithms by about order of magnitude, for many different precision settings. This suggests that multiple randomized KDtree search is an effective approach in practice, at the expenses of using more memory for multiple trees. 2.3 Retrieval Methods for Bag of Features In this section we focus on the problem of large scale retrieval methods for bags-of-features represented data. The concept of bag is similar to set but repeated elements are allowed. Bags usually vary in their cardinalities and there is no ordering defined among features within each bag. To match two objects, one essentially needs to find the best mapping from one set to another, which is very computational expensive. In our discussions, we will focus on the topic of large scale image retrieval, since each image is naturally represented by a bag of descriptors. Image retrieval refers to the process of finding images that are relevant to user’s query. The query can be either textual or an image, which corresponds to either text based image retrieval [17] or the content based image retrieval (CBIR) [18, 19]. Instead of trying to solve the general content based image retrieval problem, here we focus on the problem of near duplicate image retrieval [75], where the goal is to identify gallery images that share high visual similarity with the query image. This is because the general purpose CBIR perform poorly due to the semantic gap (the gap between pixel statistics and human interpretation) [76], whereas the problem of near duplicate image retrieval is better defined and more manageable. To solve the near duplicate image retrieval problem, A number of studies [27, 75, 77–79] have shown that local image features, e.g. SIFT descriptors [30], often referred to as keypoints, are significantly more effective than global image features, such as color, texture and shape. The main idea of the key point based approach is to extract salient local patches from an image, and represent each local patch by a multi-dimensional vector. As a result, each image is represented by 31 a collection of multi-dimensional vectors, i.e., a bag-of-features [80]. A number of algorithms [81–86], have been proposed to measure the similarity between two images based on their bag-of-features representations, but these methods are difficult to scale to large scale datasets since they usually require a linear scan of the entire dataset. to address the computational efficiency issue, several methods have been proposed, including pyramid hashing [2], bag-of-words model [27, 70, 77] and Min-Hash [87]. The pyramid hashing approach extends the work of pyramid matching [83], using a randomized hashing method to improve the efficiency of normalized partial matching. The bag-of-words model converts the image retrieval problem to a text retrieval problem and takes the advantage of inverted index to improve the efficiency. Various methods in this category differ in how the visual vocabulary is created. Min-hasing utilizes LSH to create an efficient representation for feature sets after keypoints are quantized. In the following subsections, we will elaborate all these approaches. 2.3.1 Pyramid Matching and Pyramid Match Hashing The Pyramid Matching Kernel [32, 83, 88] is proposed to generate kernels for objects represented by bags-of-features. The key idea is to map features to multi-resolution histograms and computes similarity score based on weighted histogram intersections. In [32], authors proposed first divide the feature space into the finest bins (every data points belongs to its own bin) to make the first level histogram, and then create subsequent histograms by doubling the bin sizes. If two points falls into the same histogram bin, they are considered as a match. The final matching score between two objects is calculated as the weighted number of matches between two point bags. Note that in this work, the division of feature space is uniform. It has limited performance despite the implementation convenience. [88] extends the original idea by creating the histogram based on the underlying structure in the feature space. By using hierarchical clustering, the algorithm creates a hierarchical decomposition of the feature space, and the resulting histograms are obtained from non-uniformly shaped clusters. It had been shown [32] that the similarity calculated based on the pyramid matching kernel 32 approximates the optimal partial matching, and empirical evaluations also proved its effectiveness. A drawback of pyramid matching kernel is that when given a query image, a linear scan is required to compute the similarity between the query image and every image in the database, which does not scale well to a large image database. In [2], the authors proposed a randomized hashing based method to improve the efficiency of pyramid matching. The pyramid matching hashing algorithm first maps a bag of features representation to a single high dimensional vector via an embedding function, then a set of locality sensitive hash function is used to generate hash strings from the vector. Thus, every image is essentially converted into a binary string. When a query comes, the same embedding and hashing functions are applied to generate a binary string representation, and then an approximate nearest neighbor search algorithm is used to find similar strings from the dataset. Those similar strings correspond to candidate images that are likely to similar to the query image. Finally, the pyramid matching is applied to calculate the exact similarities between the query and those candidate images. Figure 2.4 shows the idea of pyramid match and pyramid match hashing. Figure 2.4 Illustration of (a) pyramid matching and (b) pyramid match hashing. The pyramid matching approach takes two sets of feature vectors and maps them to multi-resolution histograms, and intersects them to efficiently approximate the optimal partial matching between the original feature sets. The pyramid match hashing approach utilizes an embedding of the pyramid histogram and randomly hash the new representation to allow sub-linear time indexing. Finally, the pyramid match is applied only to a small portion of the database examples [2]. To summarize the above discussion, we can see that the pyramid matching essentially serves 33 as a precise matching step since it approximates the optimal partial matching very well. However, it is difficult to scale to large dataset. The pyramid match hashing approach improve the scalability by first representing a bag of features by a single vector of high dimensionality and then searching its nearest neighbor using an approximate approach. The exact pyramid matching will only be applied to a small subset of candidates identified in the first stage. 2.3.2 The Bag of Words Model Among all approaches proposed for efficient large scale image retrieval, the bag-of-words model [77] is probably the most popular one due to its empirical success. It takes advantage of the inverted index, which had been successfully used by web search engines to index billions of documents. The key idea is to quantize the continuous high-dimensional vectors to a vocabulary of visual words, which is typically achieved by a clustering algorithm. By treating each cluster center as a word in a codebook, this approach maps each image feature onto its closest visual word, and then represents each image by a histogram of visual words. The above vector quantization process converts an image retrieval problem into a text retrieval problem, and then standard text retrieval methods can be used to efficiently retrieve relevant images given a query image. A number of studies have shown promising performance of this approach for image retrieval [27, 75, 77–79, 89] and object recognition [77, 90–95]. Figure 2.5 illustrate the idea of the bag-of-words approach. Figure 2.5 Illustration of the bag-of-words representation of images 34 It worth noting that the effectiveness of the bag of words approach is highly dependent on the quality of nearest neighbor search and clustering algorithm used. The nearest neighbor search is crucial since it is the most basic operation in both the clustering stage and the discretization stage. The clustering algorithm is important because it actually affects the quality and the size of visual word vocabulary. In [77], the exact nearest neighbor search together with the naive Kmeans clustering is used. Due to the high computational cost of both nearest neighbor search and clustering, the authors only generates several thousand visual words and the retrieval accuracy is suboptimal due to the limited vocabulary size. In [27], the author proposed to use hierarchical clustering to build a vocabulary tree, making it possible to creating an much larger vocabulary and obtained much better retrieval performance. In [70], approximate nearest neighbor search based on the best bin first method and multiple randomized KD-trees are used, and the reported retrieval accuracy is even better when compared with [27]. Some approaches are proposed to further improve the efficiency of the bag-of-words based method. The overall idea is to reduce the computational cost of large scale vector quantization. In [96], the authors proposed a algorithm to generate random histograms as a vector representation for bags of features, where LSH is used to create a high dimensional embedding for feature sets in order to simplify the matching problem. In [97], a random seeding approach is suggested to discretize each key point into a binary vector by a set of randomly generated hyper-spheres. The method is designed to approximate the similarity measure based on the optimal partial matching between two sets of keypoints, and it avoids the high computational cost of large scale high dimensional clustering. All these methods make the bag-of-words model more efficient and sometimes even more effective. Despite the bag-of-words model is the state-of-the-art method for large scale near duplicate image retrieval, it does have disadvantages. First, when quantizing a feature descriptor into a visual word, the discriminability of the feature decreases since quite different descriptor could be mapped to the same visual word. Secondly, the quantization process does not take into account the geometric relationship between features. For instance, in the application of near duplicate im- 35 age detection, near duplicate images are often the results of affine transformations of the original image. As a result, two descriptors that are close to each other are likely to match another two adjacent descriptors in the other image. However, quantized visual words does not respect geometric constraints like this. Both these two factors increase the possibility of false matches that could hurt the retrieval accuracy. To address this problem, people often utilizes geometric verification as an post-processing step to improve the precision [30, 70, 77]. The basic idea is to estimate a transformation between the query image and each target image, based on how well its feature locations are predicted by the estimated transformation. Then the target images will be re-ranked based on the discriminability of the spatially verified visual words. However, since geometric verification is computational expensive, in practice it is only applied to a small subset of the top-ranked candidate images. To summarize, the bag-of-words model is an efficient and effective approach for large scale image retrieval. It utilize document indexing methods and retrieval models to retrieve relevant images. This process can be viewed as the first stage in a large scale retrieval system because it quickly selects a subset of potentially relevant candidates. A geometric verification step follows the first stage to re-rank the candidate images to improve the precision. The verification step is essentially the second stage since it provide more precise matching and it is more computational expensive. Some recent study shows a trend to integrate the geometric information directly into the matching algorithm instead of using it afterwards. For example, Wu et al. proposed to match key points in a bundled group instead of individually [3]. In particular, quantized key points are grouped based on MSER regions and the matching is performed on a region level. The similarity between two regions is determined by both the number of matched visual words and their geometric consistency. The geometric information is encoded by the relative positions of visual words within bundles. By enforcing the geometric constraints, this algorithm does not only ask for key points matching but also requires matching in the right layout. Figure 2.6 illustrates the main idea of matching bundled features. It was shown in [3] that bundled feature based approach achieves the 36 Figure 2.6 The geometric ordering constraints for bundled features [3]. (a) matches preserving the relative ordering of the SIFT features inside a bundle; (b) matches not respecting geometric orders are penalized. retrieval performance comparable to existing full geometric verification approaches while being much less (∼ 25%) computationally expensive. Another piece of work in similar fashion is geometric Min-hashing, as we will discuss in the next subsection. 2.3.3 Min-Hashing and Geometric Min-Hashing Using Min-Hashing for image retrieval is proposed in [87] The basic idea is to use the set-based similarity to measure the similarity between two images represented in bag-of-words. Recall that |A∩B| for two sets A and B, the Jaccard similarity is defined as SJ (A, B) = |A∪B| and the resulting distance |A∩B| is defined as 1 − |A∪B| . Under this distance metric, a family of functions called min-hash can be defined as follows: first define a random permutation π over the universe U, and then let hπ (A) = min{π(a)|a ∈ A}. It had been proved [72] that the probability that two sets have the same min-hash code equals to the set-based similarity between them, i.e., Pr(hπ (A) = hπ (B)) = SJ (A, B). We can use Min-hash to solve the image retrieval problem as follows: We first quantize all images into the bag-of-words representation. Then we define a permutation function over the entire vocabulary and hash gallery images and the query image into Min-Hash bits. A gallery 37 image will be considered as relevant if it shares the same hash-bit with the query image. Note that the hashing is used after vector quantization (e.g. clustering). So this method is somewhat related to the bag-of-words model. However, since it does not use document indexing and retrieval methods to retrieve images, we separate its discussion from the bag-of-words model. An recent extension of the Min-hashing takes into account the geometric constraints [98]. Different from the original min-hashing algorithm where n random hash function hi ’s are repeatedly applied on the entire image (the same visual word set for every hi ) to form a n tuple, the input set for geometric min-hashing functions are no longer fixed. In particular, inn the GmH, the input set of hi+1 is the neighborhood set of the output of hi . Furthermore, scale constraints were also imposed when choosing subsequent input sets. To see how it works, imagine when a key point matching occurred, GmH further matches the region around that matching point. The image retrieval task can benefit from GmH since the region matching is more informative than a single visual word matching. Algorithm 4 describes how to generate the min-hash code given an bag-of-words representation of an image. Two images are considered to match each other only if they share the same hashing string. Algorithm 4 Procedure of Geometric Min-Hashing Require: The bag-of-words representations of an image Require: The location information of descriptors. Require: k: the length of hash-string. m: Neighborhood size parameter. 1: From the given image, select a set of descriptors that are mapped to different visual words. and have at least m neighboring descriptors of similar scale, Forming set S1 . 2: Create each hashing n-tuple as follows. 3: Apply a min-hash function on S1 obtained in step 1, which will produce a central descriptor (represented using its visual word ID). 4: find neighborhood descriptors that share similar scale with the central descriptor, Form set S2 . Apply min-hash on S2 to obtain the second descriptor (represented using its visual word ID). 5: repeat until k descriptors are obtained. 38 2.3.4 Comparison: Retrieval Methods for Bag-Of-Features For the retrieval methods of Bag-of-Features, we covered three categories of methods: Pyramid Matching, the Bag-of-Words Model and Min-Hashing. The Bag-of-Words model is the state-of-art for large scale image retrieval due to its efficiency and reasonable effectiveness (recall and accuracy). Comparing with the Pyramid Matching, its retrieval accuracy is sub-optimal since the Pyramid Matching approximates the optimal partial matching between two feature sets [32,83]. In [83] people empirically compared pyramid matching against the bag-of-words model on a object recognition task based on Caltech-101 data set, and they claimed the pyramid matching perform substantially better. As for the retrieval speed, it had been reported [32] that a pair-wise pyramid matching between two images takes ∼ 0.05 seconds. And since the pyramid hashing requires linearly scan the entire data set, the overall running time for processing an image is roughly 0.05N, where N is the total number of images. The bag-ofwords model is much more efficient. Base on reported experiments [3, 18] and our experience, it took 10−2 ∼ 10−1 seconds to process a query against a large data set containing ∼ 100K of images. Due to the use of inverted index, the query time does not increase linearly in N. Ideally, it should increase linearly w.r.t. the number of relevant images. For the best of our knowledge, there is no direct comparison between the Min-Hash approach with other methods. But since it also relies on the bag-of-words model and only use hashing-based similarity in the second stage. We can assume its recall/accuracy and speed are close to the other bag-of-words based approaches. All these methods can be improved in their performance and/or speed with modifications. For instance, The Pyramid Match Hashing [2], can greatly reduce the speed of Pyramid Matching at the cost of a minor accuracy drop. It usually needs to check 2% ∼ 3% of the data set to achieve almost the same performance (∼ 99%) of the Pyramid Matching method. The bag-of-words model can be boosted with geometric constraints. By matching features based on regions and penalizing violation of geometric constraints, the retrieval accuracy can be improved by ∼ 50% at the cost of less than one second 2 processing time [3]. For Min-hashing methods, people reported [98] both 2 The experiments were done on a 1 Million image dataset. 39 speed and performance improvements when geometric constraints were taking into account. 2.3.5 Summary To summarize our discussions on image retrieval based on bag-of-features representation, we see that a large scale retrieval system still requires a two stage scheme to achieve both efficiency and reasonable accuracy. For the pyramid matching method, the matching accuracy is good enough but the efficiency is the bottleneck. To address this issue, the pyramid match hashing is proposed to serve as the first stage that efficiently identifies candidate images. On the other hand, for the bag-of-words approach the efficiency is not a big issue but the retrieval accuracy is hurt because of the lost information during vector quantization. As a result, a post processing step, e.g. geometric verification, is proposed to serve as a second stage to improve the accuracy. Both approaches are augmented with methods to solve either their efficiency or accuracy issue as needed. Some recent approaches aim to unify the two stages, and the bundling features and geometric Min-Hashing are two of them. These two pieces of work, however, pose the geometric requirements differently: bundling features requires the matching in the right layout whereas geometric min-hashing asks for more matching within some region. In general, the idea of integrating the geometric information directly into the matching algorithm sounds more preferable. 40 CHAPTER 3 RANDOM SEEDING FOR KEYPOINT QUANTIZATION 3.1 Introduction In Chapter 2, we have covered various methods for near duplicate image retrieval based on the bagof-features representations. Generally speaking, among all methods, the optimal partial matching based methods [81–83, 99] delivers superior performance. Its basic idea is to measure the distance between two images by finding the best mapping between the keypoints in the two images that has the overall shortest distance. The main shortcoming of the optimal partial matching is its high computational cost: given a query image, a linear scan is required to compute the similarity between the query image and every image in the database, which does not scale well to a large image database. To improve the computational efficiency of optimal partial matching, several methods have been proposed [2, 77, 100]. Among them, the bag-of-words model [77] is the most popular one due to its empirical success. However, the bag-of-words model suffers from the following drawbacks: • High computational cost in visual vocabulary construction. One of the key steps in constructing the bag-of-words model is to cluster a large number of keypoints into a rela- 41 tively smaller number of visual words. For large scale image retrieval, we often need to cluster billions of key points into millions of clusters. Although several efficient algorithms [69, 78, 89, 101–105] have been developed for large-scale clustering problems, it is still expensive to generate a vocabulary with millions of visual words. • High computational cost in key point quantization. Given the constructed visual vocabulary, the next step in bag-of-words model is to map each keypoint in a database image to a visual word, which requires finding the nearest neighbors of every keypoint to the visual words. Since the computational cost for keypoint quantization is linear in the number of keypoints, it is expensive to quantize keypoints for a very large image database to a visual vocabulary. Even with the help of approximate nearest neighbor search algorithms, this step is still costly when good approximation is desired. • Inconsistent mapping of keypoints to visual words. The radius of clusters (i.e., the maximum distance between the keypoints in a cluster and its center) could vary significantly from cluster to cluster. As a result, for clusters with large radius, two keypoints can be mapped to the same visual word even if they differ significantly in visual features, leading to an inconsistent criterion for keypoint quantization and potentially poor performance in image matching. • Lack of a theoretic analysis. Most published studies on the bag-of-words model are motivated by efficiency considerations and are primarily focused on its empirical performance. Although [35] showed that the similarity between two bag-of-features representations can be interpreted as a matching algorithm between descriptors, it did not establish the relationship between the bag-of-words model and the optimal partial matching. Without a theoretical analysis, the success of the bag-of-words model can only be demonstrated based on empirical performance. As such, improvements to the bag-of-words model are mostly made by tuning of parameters based on some heuristics. 42 We present a new quantization scheme that explicitly addresses the shortcomings of the clusteringbased approach for keypoint quantization. Unlike most studies on keypoint quantization that are motivated by computational efficiency, the proposed scheme aims to effectively approximate the optimal partial matching based similarity measure. The main idea is to discretize each keypoint into a binary vector by a set of randomly generated hyper-spheres. Each hyper-sphere is analogous to a cluster, and discretizes a key point into a binary bit: 1 when the keypoint is within the hyper-sphere and 0, otherwise. The bag-of-words representation for an image is now computed as a histogram over the hyper-spheres. Our theoretical analysis shows that despite its simplicity, the image similarity based on the proposed quantization scheme serves as an upper bound on the similarity based on optimal partial matching. To distinguish the proposed scheme from the clustering based approach for keypoint quantization, we refer to it as a random seeding approach for keypoint quantization. The computational advantage of the proposed approach arises in two aspects: first, the random seeding approach does not require clustering (e.g., K-means algorithm) to identify visual words, leading to a significant saving in computation; second, given the constructed visual vocabulary, the complexity of quantizing m keypoints into n visual words is O(m ln n) for the clustering-based approach, and O(n ln m) for the proposed scheme 1 . Since m is usually at least one order of magnitude larger than n, the random seeding approach is significantly more efficient than the clustering-based approaches for keypoint quantization, which is confirmed by our empirical study. Besides the computational efficiency, the random seeding approach deploys a more consistent criterion for keypoint quantization: a keypoint is mapped to a visual word if and only if the distance between the keypoint and the visual word is smaller than the radius r of hyper-sphere. By choosing a relatively small r, the proposed quantization approach is able to ensure a large similarity between a keypoint and its mapped visual word. In contrast, the clustering-based approach maps a keypoint to its closest visual word even when they are separated by a large distance, leading to an 1 The analysis here only considers the time for approximate nearest-neighbor/range search in keypoint quantization. It ignores the time for constructing KD-trees, which in practice is one order of magnitude smaller than approximate nearest-neighbor/range search. 43 5 7 x 10 Number of keypoints 6 5 4 3 2 1 0 0 100 200 300 400 500 600 Distances between keypoints to their cluster centers 700 Figure 3.1 The distribution of Euclidean distances between SIFT keypoints and their nearest visual words, computed based on the Oxford5K dataset (14,972,956 keypoints) with 10K clusters. inconsistent criterion for keypoint quantization and potentially poor performance in image matching. Figure 3.1 shows the distributions of distances between keypoints and their nearest cluster centers for the Oxford5K dataset with 10K clusters. It can be seen that these distances have a large variance, indicating a potentially inconsistent criterion for mapping keypoints to visual words. Furthermore, some point-to-center distances are very large, making the clustering-based quantization less accurate. We integrate the random seeding approach into a recognition/retrieval system for large-scale near duplicate image retrieval. We demonstrate the efficacy and efficiency of our system for image retrieval on a database of more than 1 million images. We further validate our system in the domains of tattoo image retrieval and automatic image annotation. The rest of this chapter is organized as follows: Section 2 reviews the related work on image retrieval and efficient methods for keypoint quantization; Section 3 describes the proposed scheme for keypoint quantization, together with a theoretical analysis that validates the proposed approach; Section 4 presents our empirical study of the proposed scheme through a large-scale near duplicate image retrieval experiment. It also validates our conclusions in two application domains, tattoo image retrieval and automatic image annotation. Section 5 makes concluding remarks about our contribution and suggests future research directions. 44 3.2 Related Work Although the bag-of-words model has been shown to be effective for image/object retrieval, one of the major challenges, as pointed out in [101], is how to efficiently construct a vocabulary millions of visual words for a very large image database. This requires developing efficient algorithms for large-scale data clustering. One of the early studies [77] used flat K-means clustering, but it is unable to handle large vocabularies. In [106], the authors showed that hierarchical clustering is able to greatly increase the size of visual vocabulary and therefore is more suitable for large-scale image retrieval. Recent studies [69, 78, 89, 101] have shown that K-means based on approximate nearest neighbor search can scale to a very large vocabulary. The goal of these studies has been to improve the efficiency of approximate nearest neighbor search by a random algorithm, such as randomized trees [78] or Locality Sensitive Hashing (LSH) [102, 103]. Several algorithms have also been proposed for efficient keypoint quantization that avoid clustering [87, 96]. These methods essentially extend Locality Sensitive Hashing [103], which was originally designed for efficient nearest neighbor search in high dimensional space, to efficiently compute the matching between two sets of keypoints. The main objective of these algorithms is to improve computational efficiency of keypoint quantization, but they have been shown to be less effective than the clustering based approach. In contrast, the proposed random seeding approach is motivated by the approximation of the similarity measure based on optimal partial matching. Our empirical studies show that the proposed approach not only has better retrieval performance but it is computationally more efficient than cluster-based quantization. Several approaches have been proposed to improve the overall retrieval performance of the bag-of-words model. In [107], relevance feedback is used to improve the retrieval accuracy of a bag-of-words model. In [108], the images returned by a bag-of-words model is reranked by taking account the geometrical constraints in computing the matching scores. The concept of bundle features is proposed in [3] to combine two sets of matching scores, one based on a bag-of-words model and the other based on the matching of image regions. In [109], the authors represented each image by two models, a bag-of-words model and a mixture model; the similarity between two images is calculated as a 45 linear combination of matches based on the two models. In [110], multiple bag-of-words models are constructed and combined to improve the retrieval accuracy. While these approaches improved the retrieval performance, they are all built upon the basic bag-of-words model. Thus, by improving the efficiency and effectiveness of bag-of-words models as proposed here, we will essentially improve the performance of all these approaches for image retrieval. Finally, since the proposed approach essentially represents each keypoint by a binary vector, it is also related to spectral hashing [111] and Hamming embedding [35] that aim to embed keypoints in a space of binary vectors. The proposed approach differs from the other embedding approaches in that it provides a very sparse set of binary vectors, leading to efficient retrieval algorithms for large image databases. 3.3 Random Seeding for Keypoint Quantization In this section, we first present the random seeding algorithm for keypoint quantization. We then establish the relationship between the proposed scheme for keypoint quantization and the similarity measure based on the optimal partial matching between two sets of keypoints. Let D = (I1 , . . . , IT ) be a collection of T images. Each image Ii is represented by a set of ni n j keypoints, denoted by Xi = (xi1 , . . . , xi i ), where each keypoint xi ∈ Rd is a vector of d dimensions. We now define a set X of all the keypoints from all the T images. The first step of the proposed scheme is to randomly sample N keypoints from this set X, denoted by z1 , . . . , zN . A hyper-sphere Bi is created centered at each sampled keypoint zi , i.e., Bi = {x ∈ Rd : |x − zi |2 ≤ r}, where r is a predefined constant that is computed based on the average distance between any two keypoints in image collection D. Using the hyper-spheres {Bi }N i=1 , we quantize each keypoint x into a binary vector b(x) = (b1 (x), . . . , bN (x)), where each element bi (x) is 1 if x ∈ Bi and 0, otherwise. The bagof-words representation for image Ii , denoted by b(Ii ) = (b1 (Xi ), . . . , bN (Xi )), is computed by n j i b (x ), k = 1, . . . , N. adding the binary vectors of all the keypoints in image Ii , i.e., bk (Xi ) = ∑ j=1 k i One of the key property of the proposed algorithm is its criterion for mapping a keypoint to N j j visual words: a keypoint xi is mapped to a visual word zk if and only if the distance between xi 46 and zk is smaller than the threshold r. This is in contrast to the criterion deployed by the clustering approach for keypoint quantization where a keypoint is always mapped to its closest visual word even if the closest cluster center is far from the keypoint. The advantages of this new mapping criterion are threefold: • Consistent mapping. The mapping criterion implies that two keypoints will never be mapped to the same visual words if they are separated by a distance of more than 2r. By appropriately choosing the threshold r, we can ensure that only "similar" keypoints will be mapped to the same visual word, leading to a more consistent mapping criterion than the clustering-based approach. • Mapping to multiple visual words. A keypoint can be mapped to multiple visual words if it is within distance r of more than one visual word. This is similar to the idea of soft membership [34] that allows us to capture the uncertainty in mapping keypoints to visual words. One problem of soft membership is that every keypoint is forced to be associated with multiple visual words even if it is much closer to one visual word than the other visual words 2 . In contrast, in the proposed scheme, soft membership is introduced for a keypoint only when it is close to multiple visual words simultaneously; a keypoint will be mapped to a single visual word if it is far away from others. • Eliminating outlier keypoints. A keypoint is ignored by the proposed algorithm if its distances to all the N visual words are larger than the threshold r. The underlying rationale is that if a keypoint is far away from all the visual words, it is very likely to be an outlier and therefore should be skipped in the construction of bag-of-words models. By eliminating the outlying keypoints, we not only improve the accuracy of image retrieval but also its efficiency as a smaller number of keypoints need to be considered for image matching. 2 Although this problem can be solved using kernel function [112], it is computationally prohibitive when both the number of keypoints and the number of visual words are large 47 Algorithm 5 Random Seeding Approach for Keypoint Quantization Input: • Image collection D = {I1 , . . . , IT }, with each image Ii represented by a set of keypoints • Number of visual words N • Distance threshold r Visual Vocabulary Construction • Generate visual words z1 , . . . , zN by randomly sampling N keypoints from X, which is the set of all keypoints of all T images in D Keypoint Quantization 1. Construct a randomized KD tree T for all the keypoints detected in D. 2. For i = 1, . . . , N • Use KD tree T and approximate nearest neighbor search to efficiently find the keypoints within the distance r of visual word zi • Compute the histogram of visual word zi for all the images in D To efficiently implement the proposed random seeding approach, we use the Fast Library for Approximate Nearest Neighbors (FLANN)3 , which implements the randomized KD-trees [105], to efficiently decide which subset of keypoints are within distance r of a given visual word. Algorithm 6 shows the major steps of the proposed algorithm. Compared to the clustering-based approach, there are two main features of Algorithm 6: • It constructs the visual vocabulary by randomly sampling keypoints from the image collection D and thus avoids the high computational cost of clustering. • It constructs a randomized KD-tree for all the keypoints in D, and iterates over the visual words to find the subset of keypoints that are within a distance r of a visual word. As a result, the computational cost for quantizing m keypoints into N visual words is O(N log m). In contrast, the clustering-based approach constructs a randomized KD-tree for visual words, 3 http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN 48 and iterates over keypoints to find the nearest visual word for each keypoint, leading to O(m log N) complexity for keypoint quantization. Since m is usually at least an order of magnitude larger than N, the proposed algorithm is more efficient than the clustering-based approach for keypoint quantization. The above analysis ignores the time for constructing randomized KD-trees, which is usually an order faster than mapping keypoints to visual words. The theoretical foundation of our quantization scheme is based on the relationship between the image similarity resulting from this quantization scheme and the image similarity resulting from optimal partial matching. We first show that the partial matching based similarity can be upper bounded by a smooth similarity function. We then derive the proposed scheme of keypoint quantization that approximates this upper bound similarity function with a small error. 3.3.1 Distance Measure Based on Optimal Partial Matching Let two images Ix and Iy be represented by the corresponding sets of keypoints, denoted by X = (x1 , . . . , xm ) and Y = (y1 , . . . , yn ), where each xi and y j is a vector in d-dimensional space. Among various similarity measures that have been proposed, the similarity based on the optimal partial matching [81, 82, 113] is probably the most intuitive one, and has yielded the state-of-theart performance for both image retrieval and object recognition. Let π 1 : {1, . . . , m} → {1, . . . , n} map each point in set X to a point in Y , and π 2 : {1, . . . , n} → {1, . . . , m} map each point in set Y to a point in X. Then the optimal partial matching distance between X and Y for the given mappings π 1 and π 2 , denoted by d(X,Y ; π 1 , π 2 ), is computed as 4 d(X,Y ; π 1 , π 2 ) = m ∑ k=1 xk − y 1 22 + π k n ∑ k=1 yk − x 2 22 , π (3.1) k where πk1 indicates the index of the point in set Y to which xk is mapped; similarly, πk2 indicates the point in set X to which yk is mapped. Finally, the optimal partial matching distance between X 4 Note that we generalize the optimal partial matching in [113] by making it to be a symmetric distance measure 49 and Y , denoted by d(X,Y ), is obtained by minimizing d(X,Y ; π 1 , π 2 ) over the mappings π 1 and π 2. 3.3.2 Approximating the Optimal Partial Matching by A Smooth Similarity Function It is in general difficult to derive an appropriate scheme of key point quantization directly from (3.1) because the distance function d(X,Y ) is non-smooth and is defined implicitly through the minimization of the mappings π 1 and π 2 . To address this difficulty, we approximate the distance function in (3.1) by a smooth similarity function, as shown in the following lemma. Lemma 1. For any λ > 0, we have −d(X,Y ) + (m + n)d0 ≤ 2 m n ∑ ∑ exp(−λ [|xi − y j |22 − d0]), λ i=1 j=1 (3.2) where d0 = min |xi − y j |22 . i, j Proof of this lemma can be found in the Appendix.Note that the equality in Lemma 1 is satisfied when d0 = |xk −y 1 |22 = |yk −x 2 |22 for all k and as λ goes to infinity. The lemma significantly simπ π k k plifies the similarity measure, leading to a potentially efficient implementation of image retrieval systems. In figure 3.2 we show the difference between the similarity based on optimal partial matching and the upper bound presented in Lemma 1. In this example, we randomly generate a set X that contains 1, 000 data points in R128 , and each feature is an integer uniformly drawn from interval [1, 255]. The second bag Y is generatedby perturbing data points in X by adding a scaled Gaussian noise γN (0, 1) to every feature, where γ > 0 controls the noise level and is varied from 0 to 50. We set λ = 25 in this study. We observed from Figure 3.2 that when the perturbation is small (i.e., X and Y are similar), the upper bound provides a good approximation to the similarity based on optimal partial matching (measured in −d(X,Y )); the gap between the two similarity measures becomes large when the similarity between X and Y is small. Since we are mostly interested in near duplicate image retrieval, it is therefore much more important to make a good approximation when the optimal partial matching based distance is small, making the result of Lemma 1 valuable in our application domain. 50 8 1 x 10 Similarity Upperbound Optimal Partial Matching Similarity 0 Similarity (−d(X,Y)) −1 −2 −3 −4 −5 −6 −7 0 5 10 15 20 25 30 35 40 45 50 Perturbation Level Figure 3.2 An example showing the relationship between the optimal partial matching similarity and the upper bound in Lemma 1. It shows that when two sets of keypoints (corresponding to two images) are similar, the upper bound in Lemma 1 provides a good approximation to optimal partial matching. Using Lemma 1, we introduce the following similarity measure between point sets X and Y s(X,Y ) = 1 m n ∑ ∑ exp −λ xi − y j 22 . mn i=1 j=1 (3.3) Note that compared to Lemma 1, we set d0 = 0 in the above similarity measure and introduce the factor 1/mn, which is useful for simplifying our analysis as shown later. Instead of directly computing the distance d(X,Y ) between a query image and a database image, we will compute the similarity s(X,Y ), and rank images in a database in the descending order of their similarity. 3.3.3 Random Seeding Algorithm We now show that the similarity measure s(X,Y ) defined in (3.3) can be computed efficiently by the proposed keypoint quantization scheme. The overall idea is to first interpret the similarity measure s(X,Y ) as an expectation over a hidden variable z. We then approximate s(X,Y ) with a high accuracy by replacing the computation of expectation with an average over a finite number of samples. We finally show that the approximation over the finite number of samples leads to the bag-of-words model that can be efficiently computed. 51 Similarity Measure s(X,Y ) as Expectation In order to interpret the similarity measure s(X,Y ) as an expectation over a hidden variable z, we consider a probabilistic model for the similarity measure between two sets of keypoints. We assume that the given set of keypoints X = (x1 , . . . , xm ) are i.i.d. samples from an unknown distribution, denoted by p(z|X). Using the kernel density function estimation [114], the estimate of p(z|X), denoted by p(z|X), is computed as follows p(z|X) = 1 m µ d/2 ∑ [π]d/2 exp(−µ|z − xi|22), m i=1 (3.4) where µ specifies the kernel width and d is the dimensionality. Similarly, the estimate of the unknown distribution for a set n of keypoints Y , denoted by p(z|Y ), is computed as p(z|Y ) = 1 n µ d/2 ∑ [π]d/2 exp(−µ|z − yi|22). n i=1 (3.5) The following lemma shows that the similarity measure s(X,Y ) can be expressed in p(z|X) and p(z|Y ). Lemma 2. The following relationship holds if µ = 2λ s(X,Y ) = [2π]d/2 µ d/2 p(z|X) p(z|Y )dz (3.6) Proof can be found in Appendix. Similarity Measure as Expectation To view p(z|X) p(z|Y )dz as an expectation, we need to introduce an appropriate distribution for z. Note that p(z|X) p(z|Y )dz is defined as an integration over the unbounded space of z. Hence, it is inappropriate to define a uniform distribution for z, which is sometimes referred to as inappropriate prior in Bayesian analysis. To address this difficulty, we assume that |x|2 ≤ R for any keypoint x, where R is a large constant5 . We introduce a domain Q for z that is defined as follows Q = {z : |z|2 ≤ γR}, 5 This assumption is reasonable since each dimension of the SIFT descriptor is upper-bounded. 52 where γ ≥ 1 is a constant that will be determined empirically. We then approximate p(z|X) p(z|Y )dz by s1 (X,Y ), which is defined as an integration over the domain Q, i.e., s1 (X,Y ) = p(z|X) p(z|Y )dz. (3.7) z∈Q The following lemma gives a bound on the difference between p(z|X) p(z|Y )dz and s1 (X,Y ). Lemma 3. The following inequality holds for any γ > 0 |s(X,Y ) − ζ s1 (X,Y )| ≤ exp(−2µγ(γ − 2)R2 ) s(X,Y ) d−1 π 1/2 + γR 2µ (3.8) where ζ = µ d/2 /[2π]d/2 . Proof can be found in Appendix 6.3. As indicated by Lemma 3, a large value of γ will lead to a small value for | p(z|X) p(z|Y )dz − s1 (X,Y )|, implying that s1 (X,Y ) is close to p(z|X) p(z|Y )dz. This is shown by the following corollary. Corollary 1. If γ ≥ max 4, π 1/2 (2µ)1/2 R , (d − 1) + (d − 1)2 + 4µ ln[1/δ ] 2µR we have |s(X,Y ) − ζ s1 (X,Y )| ≤δ s(X,Y ) The result follows immediately from Lemma 3. With the above analysis, we can now focus on the similarity measure s1 (X,Y ). Since the domain of z in s1 (X,Y ) is bounded, we can now introduce a uniform distribution for z, i.e., q(z) = 1 I(z ∈ Q), vol(Q) where vol(Q) stands for the volume of domain Q and I(x) is an indicator function that outputs 1 when x is true and 0 otherwise. Using the distribution q(z), we can interpret s1 (X,Y ) as the expectation of p(z|X) p(z|Y ) over random variable z, i.e., s1 (X,Y ) = vol(Q)Ez [ p(z|X) p(z|Y )] In the remaining analysis, we will drop the factor vol(Q) for the sake of simplicity. 53 (3.9) Keypoint Quantization by Random Samples We can further approximate the expectation interpretation in (3.9) by replacing the distribution q(z) with a finite number of samples. Let z1 , z2 , . . . , zN be N samples randomly drawn from the distribution q(z). Using the samples {zk }N k=1 , we approximate s1 (X,Y ) by s2 (X,Y ) that is defined as follows s2 (X,Y ) = = µd m n 1 N exp(−µ|zk − xi |22 − µ|zk − y j |22 ) ∑ ∑ ∑ d N mnπ i=1 j=1 k=1 N m µd exp(−µ|zk − xi |22 ) ∑ ∑ d [mnNπ ] k=1 i=1 (3.10) n ∑ exp(−µ|zk − y j |22) . j=1 It is important to note that, in the expression of s2 (X,Y ), by replacing the expectation over the distribution q(z) with the average over the N samples, we essentially represent each point set X = (x1 , . . . , xm ) by a vector f (X) = ( f1 (X), . . . , fN (X)) (3.11) where fk (X), k = 1, . . . , N is defined as fk (X) = 1 m ∑ exp(−µ|zk − xi|22). m i=1 Using the notation f (X) and f (Y ), we can write s2 (X,Y ) as s2 (X,Y ) = f (X) · f (Y ), i.e., the similarity s2 (X,Y ) can be computed as the dot product between two histograms. The theorem below shows that with a sufficiently large number N, s1 (X,Y ) and s2 (X,Y ) will differ only by a small value. Theorem 1. With probability 1 − δ , we have |s2 (X,Y ) − s1 (X,Y )| ≤ 1 2 ln N δ Proof can be found in Appendix 6.3. The result in Theorem 2 only applies to two images. We furthermore generalize it to a collection of images. Let D = (I1 , . . . , IT ) be a collection of 54 n T images and Xi = (xi1 , . . . , xi i ) be the set of keypoints associated with image Ii . We have the following corollary showing that for any two images Ii and I j in D, their similarities s1 (Xi , X j ) and s2 (Xi , X j ) are close to each other. Corollary 2. Given a collection D of T images, with probability 1 − δ , the following inequality holds for any two images Ii and I j in D |s2 (Xi , X j ) − s1 (Xi , X j )| ≤ 1 T (T − 1) ln N δ The above corollary is easily proved by a union bound. Although (3.11) provides a vector representation for each keypoint set, it is overall not a sparse representation, which makes it difficult to implement an efficient retrieval algorithm that scales well to a large image database. To address this challenge, we introduce a threshold η > 0 and set the coefficient exp(−µ|zk − xi |22 ) to be zero whenever exp(−µ|zk − xi |22 ) is smaller than the threshold. This is equivalent to replacing the coefficient exp(−µ|zk − xi |22 ) with max(exp(−µ|zk − xi |22 ) − η, 0). As a result, we represent each keypoint set X = (x1 , . . . , xm ) by a vector h(X) = (h1 (X), . . . , hN (X)), (3.12) where hk (X), k = 1, . . . , N is defined as m hk (X) = ∑ max(exp(−µ|zk − xi|22) − η, 0) (3.13) i=1 Note that by introducing the threshold η into the bag-of-words model, the result in Corollary 3 is relaxed by a factor ηm. The threshold η essentially balances the tradeoff between efficiency and efficacy: a large η will result in a sparse bag-of-words representation of images, leading to efficient retrieval of gallery images; on the other hand, a small η will result in a more accurate approximation of the optimal partial matching similarity, leading to accurate retrieval of visually similar images. Implementation To generate the bag-of-words model h(X) in (3.12), the first step is to sample centers zi ’s from the domain Q. In practice, sampling zi uniformly from a high dimensional domain 55 is a nontrivial problem. In our implementation, we acquire the samples z1 , . . . , zN by first sampling N keypoints from all the keypoints in the image database D, and then scaling the sampled keypoints by a factor γ. This practice essentially assumes that all the keypoints are uniformly distributed within a hyper-sphere. Although this assumption may not be true, it significantly reduces the computational cost of the proposed algorithm, and also yields good performance. We set γ = 1 in our experiment. Although our theoretical result requires γ to be large, γ = 1 does yield desirable performance in our empirical study. The second step in the bag-of-words model is to compute element hk (X) in h(X). Note that hk (X) defined in (3.13) is a real number, making it difficult to implement it by a typical text search engine that requires integer elements in a bag-of-words model. We thus simplify the function max(0, exp(−µ|zk − y j |22 ) − η) to be I(exp(−µ|zk − y j |22 ) ≥ η), where I(z) is an indicator function that outputs 1 if z is true and 0, otherwise. Note that function I(exp(−µ|zk − y j |22 ) ≥ η) is equivalent to I(|zk − y j |2 ≤ r), with r = [ln(1/η)]/µ. In practice, we set r to be proportional to the average distance between any two keypoints in image collection D. In order to estimate the average pairwise distance, we randomly sample 10, 000 keypoint pairs from the entire collection. To efficiently compute I(|zk − y j |2 ≤ r), which requires identifying the subset of keypoints in the image collection D that are within the range r of each center zk . We use the Fast Library for Approximate Nearest Neighbors (FLANN)6 , in Algorithm 6. 3.4 Experiments To verify both the efficiency and efficacy of the random seeding based quantization scheme, we conduct experiments on near duplicate image retrieval on the Oxford5K dataset. Following the experimental settings in [108], we introduce into the experiment a distractor (background) database consisting of 1 million images downloaded from Flickr. We further validate the proposed quantization scheme by conducting two additional experiments in the domain of tattoo image retrieval and automatic image annotation. The tattoo image retrieval experiment is performed on a dataset 6 http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN 56 of 100, 000 tattoo images, where its objective is to find images with the similar tattoo pattern as the one given in the query. For the experiment on automatic image annotation, we used the ESP dataset7 that consists of over 100, 000 manually annotated images, and the objective is to automatically annotate images with appropriate keywords that reflect the visual content of the images. Table 3.1 summarizes the three experiments. Experiment Near duplicate image retrieval Tattoo image retrieval Automatic image annotation Query Set 55 Oxford Images 995 Tattoos 1,000 ESP Images Gallery Set Oxford5K + Flickr1M ∼ 1 million images Tattoo60K + Esp40K ∼ 100, 000 images Esp110K ∼ 110, 000 images SIFT points 14.97M + 798M Evaluation mAP 10.8M CMC 12.8M F1 Table 3.1 Summary of experiments In all the three experiments, we obtain the features based on Harris detector and SIFT descriptor, using the feature extractor in [115] with the default parameters. In the baseline method, hierarchical K-means is applied to cluster keypoints into visual words, and then each keypoint is mapped to its nearest visual word. We used the implementation of large-scale data clustering in [69], which has been optimized for efficient SIFT keypoint matching. Since both the baseline method and the proposed random seeding quantization depend on efficient large scale nearest neighbor search, we used the FLANN [69] in both the methods. The parameter setting for hierarchical K-means is tuned using the auto-tune feature provided in the library, by using 10% of the dataset and setting the accuracy level to be 95%. The search parameters for the proposed method are set accordingly to make the results comparable. For the implementation of the random seeding algorithm, unless otherwise specified, radius r is set to 50% of the average pairwise distance between any two keypoints. We estimate the average pairwise distance by a sampling approach. After obtaining the bag-of-words representations (either by the clustering-based method or by random seeding), Okapi BM25, a state-of-the-art text retrieval model [116], was used to compute the similarity between a query and gallery images. For automatic image annotation, our retrieval system is first used to 7 http://www.gwap.com/gwap/gamesPreview/espgame/ 57 identify the gallery images that are visually similar to a test image; a kNN classifier [117] is then applied to determine the annotation words for the test image. All the experiments were performed on a Xeon 2.8G server with 144GB memory8 . In the following sections, we will present results of individual experiments. 3.4.1 Near Duplicate Image Retrieval In this experiment, we first perform retrieval experiments on the Oxford5K dataset, a benchmark dataset for image retrieval. We then expand the database with 1 million Flickr images (i.e., Flickr1M), and report retrieval results on this large image database. We also evaluate the scalability of the random seeding approach for keypoint quantization by varying the number of gallery images. Dataset The Oxford5K dataset [118] is comprised of 5, 062 images, among which 55 images serve as queries. We extracted a total of 14.97 million keypoints from this dataset. The distractor Flickr dataset (i.e., Flickr1M) contains 999, 771 images that are represented by 798 million keypoints. In the large scale image retrieval experiment, these two datasets are combined together, resulting in a dataset with a total of 1, 004, 833 images. Evaluation Following [35, 108], we use the mean average precision (mAP) as our evaluation metric. It computes average precision for each query based on its precision-recall curve, and averages the average precision over all the queries. Experimental Results We first report the experimental results on the Oxford5K dataset, followed by the experiments using the Oxford5K+Flickr1M dataset. Both experiments use the same set of 55 query images from the Oxford5K dataset. 8 We note that the peak memory usage for all experiments is no more than 20GB. 58 Results for the Oxford5K dataset We now present the experimental results on Oxford5K dataset. We generate visual words either by clustering or by random seeding approach. We set the number of visual words to 10K, 100K, and 1 million. Table 3.2 reports the computation time for generating bag-of-words models by the clustering-based method and by the random seeding method. To facilitate the comparison of the two methods, we divide the running time into two parts, i.e., the time for constructing visual vocabulary (vocabulary construction time) and the time for mapping keypoints to visual words (quantization time). Note that since the random seeding algorithm finds visual words by randomly sampling a set of keypoints, it is basicall independent of vocabulary construction. We observe that not only it is computationally expensive to find visual words by clustering, it also takes considerably longer time for the clustering based method to map keypoints to visual words compared to random seeding approach. We also observe that the running time for the clustering-based method remains almost constant as the number of visual words increases, while it increases noticeably for the random seeding approach. This is because of the specific procedure used in Algorithm 6 for keypoint quantization, which takes O(N log m) to quantize m keypoints into N visual words. In contrast, it takes the clustering-based method O(m log N) to accomplish the same quantization task. The weaker dependency of the clustering method on N makes its running time less sensitive to the number of visual words. Clustering Random Seeding Voc. Construction Quantization Voc. Construction Quantization Vocabulary 10K 9, 178s9 42, 984s10 5s 1,425s Vocabulary 100K 9,941s 43,726s 7s 2,232s Vocabulary 1M 10,405s 44,128s 8s 3,721s Table 3.2 Time for generating bag-of-words models for the Oxford5K dataset. Table 3.3 summarizes the retrieval accuracy (mAP) and query time (i.e., time for running text retrieval to identify the gallery images that are visually similar to the query image) with different 9 Vocabulary construction takes more time than quantization because hierarchical clustering is used. 10 It could be significantly larger than the time reported in other studies due to differences in the NN search parameter settings and/or the library used. 59 number of visual words. We observe that for both the methods, the retrieval accuracy is improved from ∼ 30% to over 50% as the number of visual words is increased from 10K to 1 million, which is consistent with the previous studies [35, 108]. In all the cases, we observe that the random seeding algorithm is more accurate than the clustering based method, although the gap is reduced as the number of visual words increases. We attribute this performance difference due to the inconsistent mapping of the clustering-based method. In particular, when the number of visual words is small, a keypoint is more likely to be mapped to a visual word even if they are separated by a large distance. The occurrence of inconsistent mapping in the clustering-based method is reduced as the number of visual words increases, leading to a smaller gap in retrieval accuracy. The query time result shows that the random seeding algorithm is more efficient in retrieving visually similar images than the clustering-based approach, although the advantage is diminished as the number of visual words increases. We attribute this phenomena to eliminating outlier keypoints. In particular, when the number of visual words is small, it is very likely for that a keypoint will be far away from all the visual words; those outlying keypoints are ignored by the proposed method in the quantization step, leading to efficient and effective retrieval. As the number of visual words increases, the number of outlier keypoints decrease, leading to a longer retrieval time. In fact, we found 84% keypoints were ignored when the number of visual words is 10K 11 , and only 46% keypoints were ignored when the number of visual words is increased to 1 million. To further examine the mapping criterion employed by the random seeding algorithm, we perform the retrieval experiments by using the random seeding algorithm but with the visual words identified by the clustering algorithm, referred to as clustering seeding. The retrieval performance of this approach is reported in the last row of Table 3.3. We observe that the clustering seeding method performs worse than the random seeding approach under all settings. This result can be explained by the fact that some clusters have large variance (cover a widely spread set of keypoints), which contradicts the mapping criterion (range search that maps keypoints to a visual word only when their distance is smaller than a threshold) used by the random seeding method. 11 We also found that when such down-sampling is applied to the clustering-based approach, its 60 Clustering Random Seeding Cluster Seeding mAP (query time) Vocabulary 10K Vocabulary 100K 0.28 (0.39s) 0.41 (0.19s) 0.35 (0.05s) 0.45 (0.06s) 0.32 (0.06) 0.38 (0.06s) Vocabulary 1M 0.52 (0.08s) 0.54 (0.08s) 0.50 (0.09s) Table 3.3 Retrieval accuracy (mAP) and query time (second) for the near duplicate image retrieval on the Oxford5K dataset. “Cluster Seeding” in the last row refers to range search quantization using cluster centers Robustness We examine the sensitivity of the random seeding method w.r.t. radius r, the only parameter in the proposed keypoint quantization. Figure 3.3 shows how the retrieval accuracy of the random seeding algorithm is affected by r. The retrieval accuracy appears to be stable when r is sufficiently large (i.e., r ≥ 0.5R, where R is the estimated average pariwise keypoint distance). We attribute the poor performance of small r to the effect of eliminating too many outlier keypoints since the smaller the radius r, the more keypoints will be ignored in the quantization step, leading to potentially poor bag-of-words model for image representation. 0.6 Vocabulary 10K Vocabulary 100K Vocabulary 1M 0.55 0.5 mAP 0.45 0.4 0.35 0.3 0.25 0.2 0.25 R 0.5 R 0.75 R Radius 1R 1.25 R Figure 3.3 The effect of radius r on the retrieval accuracy of the random seeding method for keypoint quantization. R is the average pairwise keypoint distance. Oxford5K+Flickr1M Retrieval We now test the random seeding method on the large data base Oxford5K+Flickr1M that combines the two data sets. Following the setup in [108], we use a bag-of-words model has a significant worse performance. 61 vocabulary of 1 million visual words generated from the Oxford5K dataset. Table 3.4 shows the retrieval accuracy, query time, and time for keypoint quantization for the Oxford5K+Flickr1M dataset. Note that we exclude the clustering time from the keypoint quantization time since the vocabulary has already been constructed from the Oxford5K dataset. We observe that the random seeding algorithm significantly outperforms the clustering method. We also examine the scalability of the random seeding algorithm in keypoint quantization. In this experiment, we fix the vocabulary to be 1 million visual words generated from the Oxford5K dataset, and gradually add Flickr images into the Oxford5K dataset. Figure 3.4 reports the quantization time for both the methods with the numbers of gallery images. As the size of image database increases, the quantization time of the random seeding method increases at a slower rate than that of the clustering-based method, making it more suitable for handling large-scale dynamic image databases. 6 Quantization Time (seconds) 2.5 x 10 Random Seeding Quantization Clustering−Based Quantization 2 1.5 1 0.5 0 Oxford5K Oxford5K+Flickr100K Oxford5K+Flickr200K Oxford5K+Flickr500K Oxford5K+Flickr1000K Number of Images Figure 3.4 The quantization time of the proposed random seeding algorithm and the clustering based approach. The vocabulary is fixed to be 1M visual words generated from the Oxford5K dataset. Background images from Flickr dataset are gradually added to the gallery. The vocabulary construction time is not shown here. 3.4.2 Tattoo Image Retrieval We now present our results on a large scale tattoo image dataset. Body tattoos have been used by individuals for over 5, 000 years to differentiate themselves from others. Historically, the prac62 mAP Clustering 0.33 Random Seeding 0.45 Query Time (seconds) 0.84 0.33 Quantization Time (hours) 591.2 78.7 Table 3.4 Retrieval accuracy (mAP), query time, and quantization time (i.e., time for mapping keypoints to visual words) for the clustering-based method and the random seeding method. tice of tattooing was limited to particular groups such as motorcycle bikers, soldiers, sailors, and members of criminal gangs. A recent study published in the Journal of the American Academy of Dermatology in 200612 showed the rising popularity of tattoos amongst the younger section of the population; about 36% of Americans in the age group 18 to 29 have at least one tattoo. Law enforcement agencies routinely photograph and catalog tattoo patterns for the purpose of identifying victims and convicts. The ANSI/NIST-ITL 1-2000 standard [119] defines eight major classes (e.g., human face, animal, symbols, ...) and 80 subclasses for categorizing tattoos. Given a query image of a tattoo, a search is performed by matching the class label of the query tattoo with labels of tattoos in a database. This matching process based on human-assigned class labels is subjective and has poor retrieval performance. This motivated Lee et al. [120] to develop an image retrieval system for automatic tattoo image matching. Given a query tattoo image, the attempts to retrieve all the tattoo images from a database that are the photos of the same tattoo as the query. Dataset The database used here consists of 101, 745 images, among which 61, 745 are tattoo images that were provided by the Michigan Forensics Lab [120]; the remaining 40, 000 images were randomly selected from the ESP game data set13 . The purpose of adding images from the ESP game dataset is to verify the capability of the tattoo image retrieval system in distinguishing tattoo images from non-tattoo images. On average, about 108 keypoints are detected from each image, leading to a total of more than 10 million keypoints for the entire image collection. 12 Tattoo Facts and Statistics, Oct. 2006, http://www.vanishingtattoo.com/ tattoo_facts.htm. 13 http://www.gwap.com/gwap/gamesPreview/espgame/ 63 Evaluation To evaluate the retrieval performance of the tattoo image retrieval system, one duplicate image of the same tattoo is used as a query image to retrieve all of its visually similar image(s) in the database. For the tattoo image database used here, we used 995 tattoo images as queries, and manually identified the tattoo images in the database that are visually similar to each query. Cumulative matching characteristics (CMC) curve [121] which accumulates the correct number of retrieved images as the rank increases, was used as the evaluation metric. For a given rank position k, the CMC score is computed as the percentage of queries whose matched images are found in the first k retrieved images. We use CMC curve, instead of precision & recall curve, because it is one of the most widely used evaluation metric in biometric and forensic analysis. Experimental Results Figure 3.5 shows the CMC curve of the tattoo retrieval system using the proposed random seeding algorithm and the clustering algorithm for keypoint quantization. Compared to the clustering approach, we observe an overall 5% improvement in the CMC score of the proposed random seeding algorithm. Figure 3.6 shows some tattoo search results for both random seeding and clustering based methods. Cumulative Matching Characteristics 0.95 0.9 0.85 0.8 Random Seeding Clustering 0.75 0.7 0.65 0.6 0.55 1 10 20 30 40 50 60 70 Number of Retrieved Images 80 90 100 Figure 3.5 The CMC curves of the tattoo retrieval system using the proposed random seeding algorithm and the clustering algorithm for keypoint quantization. The number of visual words is set to be 1 million. Next, we evaluate the efficiency of the proposed scheme for keypoint quantization. It takes the proposed algorithm 3, 476 seconds to quantize all the keypoints (i.e., 10.8 million keypoints 64 Figure 3.6 Example query images and the top 5 images retrieved by both the random seeding algorithm (first row) and the clustering based approach (second row). in total) in the tattoo image collection into one million visual words (i.e., hyper-spheres). On the other hand, the running time of the clustering-based method is 31, 305 seconds, which is about 9 times longer than the proposed random seeding algorithm. This result shows that the proposed scheme is not only more effective but also more efficient than the clustering-based algorithm for keypoint quantization. We also observe that both the methods have similar retrieval time, which is 0.05 second per query. 3.4.3 Automatic Image Annotation We also validate the proposed method on an automatic image annotation task. We implement a kNN classification based approach for automatic image annotation, which was shown to be particularly effective for large-scale image annotation [117]. More specifically, to identify the annotation words for a given test image, we first search for the images in the training set that are visually similar to the test image. We then annotate the test image with the key words that are most frequently used by the first k retrieved images. Dataset The ESP game dataset was used in our study. It consists of 109, 720 images. On average, 117 keypoints were extracted for each image, leading to a total of 12.8 million keypoints for the 65 entire database. We remove the rare annotation words that are used by less than ten images in the database. This gives us a vocabulary of 5, 395 unique annotation keywords. On average, each image is annotated by 7.5 words. We randomly selected 1, 000 images from the dataset to form the test set, and used the remaining 108, 720 images for training. Evaluation We evaluate the performance of our system by comparing the predicted annotation words to the ground truth. The metric F1, which is widely used in multi-label learning [?], is used for evaluation. It is computed as follows: for a given test image, we first rank all the key words in the descending order of their occurrences in the first k images that are retrieved by the system. The first nl words (with a random tie-breaker) in the ranking list, denoted by P = {w1 , . . . , wnl }, are the predicted annotation words for the test image. Let Q = {w1 , . . . , wnt } be the set of true annotations for the given test image. The F1 measure for this test image is computed as the harmonic mean of |P∩Q| |P∩Q| 2pr where p and r are defined as p = |P| and r = |Q| . precision p and recall r, i.e., F1 = p+r Given multiple test images, we first compute F1 for each test image and then average the F1 scores over all the test images as the final evaluation measure. Since the F1 measure depends on the number of predicted annotation words nl , which is usually difficult to determine, we report the F1 measures for different numbers of predicted annotation words. Experimental Results First, we evaluate the annotation accuracy of our system based on different methods for keypoint quantization. One of the key parameter used by the kNN based image annotation system is k, i.e., the number of nearest neighbor images used for predicting the annotation words for a test image. We estimate this parameter by a cross validation procedure which uses 20% of training images to validate the parameter k. The optimal k found by cross validation for both quantization methods is 50. Figure 3.7 shows the F1 values of the kNN based image annotation system that uses two different quantization methods, with nl varied from 1 to 10. We observe that the proposed random seeding algorithm significantly outperforms the clustering based approach in the annotation accuracy. Finally, we evaluate the efficiency of the proposed random seeding algorithm for keypoint 66 0.22 0.2 Macro F1 0.18 0.16 Random Seeding Clustering 0.14 0.12 0.1 0.08 0.06 1 2 3 4 5 6 7 Number of Annotated Words 8 9 10 Figure 3.7 The results of a kNN based image annotation system using the proposed random seeding algorithm for keypoint quantization as well as the clustering based method. k is set to be 50 based on cross validation, and the number of visual words is set to be 1 million. quantization. It takes 3, 376 seconds for the random seeding method to quantize all the keypoints, while the quantization time for the clustering based method is 45, 008 seconds. This again confirmed that the proposed approach is more efficient than the clustering based method for large-scale keypoint quantization. 3.5 Conclusion We have proposed an efficient algorithm for keypoint quantization need in the bag-of-words model. It is designed to approximate the similarity measure based on the optimal partial matching between two sets of keypoints. The main idea is to quantize each keypoint into a sparse vector of real numbers by a set of randomly generated hyper-spheres. Our theoretical analysis shows that the similarity based on the proposed quantization algorithm provides an upper bound on the optimal partial matching based similarity. The proposed random seeding algorithm is more effective than the clustering-based approach because it deploys a more consistent criterion for keypoint quantization, which ensures that only similar keypoints are mapped to the same visual words. It is also more efficient than the clustering-based approach because (1) it avoids the high computational cost of large-scale clustering, and (2) its quantization time for mapping keypoints to visual words 67 is sublinear in the number of keypoints, making it more suitable for large image databases than the clustering-based approach. Experimental results on large-scale near duplicate image retrieval, tattoo image retrieval, and automatic image annotation confirm that the proposed random seeding algorithm is more effective than the state-of-the-art clustering based approach for keypoint quantization. Besides efficacy, our empirical studies also indicate that the proposed scheme is more efficient than the clustering based method for keypoint quantization and in retrieving visually similar images. 68 CHAPTER 4 RANDOM PROJECTION FILTERING 4.1 Introduction As we discussed in Chapter 1, nearest neighbor search is a fundamental problem and has found applications in many domains. The challenge comes when both the number of data points and the dimensionality are high: classical exact search algorithms like KD-tree [28] search often have the time or space complexities that is prohibitive. Various approximate search algorithms had been proposed to address this efficiency issue, but they all have evident drawbacks. For example, ANN [68] has showed to be effective for higher dimensions, but it is not suitable for dimensionality higher than 20; Randomized KD-tree forrest [122] shows promising results in some studies, but it does not have any theoretical guarantee about its search accuracy, partly because its parameters are often selected based on some heuristics but not grounded theory. In addition, Randomized KDtree forrest grows linearly with the number of points in the dataset, which poses a strong memory requirement, making it to be impractical for efficient large scale memory-based search. Random projection based approaches such as LSH [71] does address the range search problem, but it usually needs both a large number of random projections and a large number hash tables to ensure high search accuracy and recall, making it inefficient and ineffective for many practical uses. In this work, we focus on nearly duplicate vector search, where we assume that nearest neighbors differ from the query object only by a small number of attributes. Our contribution is propos69 ing a novel algorithm that is both efficient and accurate for this application domain by explicitly exploit the statistical property of near duplicity. The proposed hashing algorithm, termed Random Projection with Filtering or RPF for short, addresses the tradeoff between efficiency and accuracy by introducing a filtering procedure into the hashing algorithm. Based on the compressed sensing theory [123] and the key property of nearly duplicate search, we show that most objects in the returned list are with short distance of the query. Our empirical study on both synthesized and real world data sets demonstrates the effectiveness of the proposed approach compared to the state-of-art hashing methods. 4.2 Problem Definition In case when each object is represented by a vector in a space of very high dimensionality. More specifically, let D = {x1 , . . . , xN } the collection of N objects for searching, with each object is represented by a vector xi ∈ Rm of m dimension. We assume both the dimensionality m and the number of objects N in the database are very large (e.g., m is a few thousand and N is a couple million). Given a query vector q ∈ Rm represented in the same space, our goal is to find the objects in D that are nearly duplicates of query q. In particular, we assume the target objects for query q should satisfy the following two objectives: • short distance: the matched object x should be close to query q in a Euclidean distance, namely |x − q|2 ≤ R, where R is a predefined distance threshold. • sparsity: the matched object x should differ from q only by a relatively small number of attributes, i.e., |x − q|0 = ∑m i=1 I(xi = qi ) ≤ S It is important to note that our definition of nearly duplicate is beyond the simple measure of 2 distance between a query object and objects in the database. The sparsity criterion plays a very important role in designing the efficient searching algorithm. It is also the sparsity criterion that differs our work from most of the existing work in efficient nearest neighbor or range search in a very high dimensional space. 70 4.3 Random Projection Filtering Algorithm We first introduce a few notations that will be used throughout the draft. We denote by M (q, r) the subset of objects in D that are within distance r of query q, and by M c (q, r) the subset of objects in D that are at least distance r from q, i.e., M (q, r) = {x ∈ D : |x − q|2 ≤ r} (4.1) M c (q, r) = {x ∈ D : |x − q|2 ≥ r} (4.2) We denote by M (q, R, S) the nearly duplicate objects in D for a given query q that satisfy the two objectives mentioned above, i.e., M (q, R, S) = {x ∈ D : |x − q|2 ≤ R, |x − q|0 ≤ S} (4.3) Given a matrix U = (u1 , . . . , ud ) ∈ Rm×d , we denote by PU = (U U)−1U : Rm → Rd the projection operator that maps a vector in Rm into the subspace space spanned by the column vectors in U. Our work is motivated by Johnson Lindenstrauss Theorem, which claims that a low dimension embedding of a very high dimensional space, generated by random projection, is likely to preserve the Euclidean distance between any two points in the original high dimensional space. Theorem 2. (Johnson Lindenstrauss Theorem) Let U ∈ Rm×d be a random matrix, with each element Ui, j , i = 1, . . . , m, j = 1, . . . , d generated by randomly drawing from a Gaussian distribution N (x|0, 1). Let PU : Rm → Rd be the projection operator that projects a vector in Rm into the subspace spanned by the column vectors in U. For any two fixed objects x, q ∈ Rm in the original m dimensional space and for a given ε ∈ (0, 1), we have Pr |Px − Pq|22 ≤ dε 2 d(1 − ε) |x − q|22 ≤ exp − m 4 Pr |Px − Pq|22 ≥ d(1 + ε) dε 2 [1 − 2ε/3] |x − q|22 ≤ exp − m 4 71 A most straightforward approach for exploring Johnson Lindenstrauss theorem is to project vectors in a high dimensional space into a low dimensional space generated by a random matrix, and then explore methods, such as KD tree, for efficient range or nearest neighbor in low dimensions. The main problem with such a problem is that in order to ensure all the nearest neighbors to be included in the returned list with a large probability, d, the dimensionality of the low dimensional embedding could be significantly large. To make our point clear, considering the objective of finding objects in D that are within distance r of a query q. We follow the direct approach of low dimensional embedding, and assume an object will be returned if its distance to query q in the low dimensional embedding is less than or equal to r = r d(1 + ε)/m. Then, for a fixed point |x − q|2 ≤ r, the probability for it to be included in the return list is at least 1 − exp(−dε 2 [1 − 2ε/3]/4). Hence, the probability of returning all objects within distance r of query q is at least 1 − N exp(−dε 2 [1 − 2ε/3]/4), implying that in order to ensure all the objects within distance r of query q to be returned with a large probability, d has to be on the order of O(4[ε 2 (1 − 2ε/3)]−1 log N). Given a collection of one million objects, the number of required dimensionality for embedding is at least 50, which is too high for any efficient range search algorithm to be effective. An alternative approach is to run multiple range searches, each in a very low dimensional space, and then assemble the results together by either anding or oring the results returned by each range search. In this way, we avoid the challenge of range search in a relatively high dimensional space. The extreme case is to run each range search in a one dimension space and then merge the objects returned by each range search. In some sense, the hashing approaches can be thought as a special case of this strategy with the exception that only an approximate range search is implemented by a hashing function. This approach does guarantee with a large probability, all the objects within distance r of query q are found in the union set of objects returned by multiple range searches. However, on the other hand, we may find many of the returned objects are far away from the query q, leading to too many returned objects that need further inspection of distance to the given query. To make our discussion concrete, consider the case we run d one dimensional 72 range search, with each one dimensional embedding generated by a random vector. Similar to the previous example, we set the distance threshold to be r = r (1 + ε)/m. Similar to the previous analysis, the probability of including all the objects within distance r of q in the lists returned by the low dimensional range search is 1 − N exp(−dε 2 [1 − 2ε/3]/4), identical to the previous analysis. However, the problem is that the probability of including objects far away from the query also increases significantly. In particular, the probability of including any object with distance r (1 + ε)/(1 − ε) may become min(1, d exp(−ε 2 /4)). We refer to this challenge as Precision- Recall tradeoff: in order to improve recall for a search, i.e., the chance of including all the nearby objects to a given query, we inherently have to lower the bar for precision, namely the probability of including objects far away from the given query, and vice versa. We evidently could make better tradeoff between precision and recall by being more careful in designing the combination strategy. For instance, we could create multiple d one dimensional embedding: for each d one dimensional embedding, we merge the objects returned by every one dimensional embedding by anding; we then combine the objects returned by each d one dimensional embedding by anding. If the number of d one dimensional embedding is K, the probability of including all objects within distance r of query (i.e., recall) is reduced to 1 − NK exp(−dε 2 [1 − 2ε/3]/4), while the probability of including any objects that are at least r (1 + ε)/(1 − ε) (i.e. precision) is improved to (1 − [1 − exp(−ε 2 /4)]d )K . The main shortcoming of this strategy is that it requires both d and K to be large numbers, leading to a very large number (i.e. dK) of one dimensional searching structures to be implemented, which could significantly lower the efficiency of the algorithm. This is in fact for locality sensitive hashing, which often requires thousands of bits before accurate search results can be found. In this draft, we aim to address the challenge of precision-recall tradeoff by exploring the sparsity requirement of nearly duplicate detection, namely only a small number of elements in the target object x differ from the given query q. Before we discuss the exploration of sparsity requirement, we need to extend Johnson Lindenstrauss theorem, which holds for a subspace spanned by column vectors in a random matrix U ∈ Rm×d . Instead of using the projection operator PU , we will extend the result to directly use 73 U. More specifically, we assume each Ui, j in matrix U is randomly drawn from a Gaussian distribution N(x|0, m−1 ). Note that the difference between PU and U is the matrix [U U]−1 . First, we have the following concentration inequality for random matrix U. Theorem 3. Let U ∈ Rm×d be a random matrix, with each element generated from a Gaussian distribution N(x|0, m−1 ). We assume m > d. Let σmax (U) and σmin (U) be the maximum and minimum singular values for matrix U, respectively. We have Pr σmax (U) > 1 + d/m + η + t ≤ exp(−mt 2 /2) (4.4) Pr σmin (U) < 1 − d/m + η − t ≤ exp(−mt 2 /2) (4.5) where η= 1 2m1/3 √ γ 1/6 (1 + γ)2/3 with γ ∈ (0, 1). Using the above theorem, we have the following corollary. Corollary 3. Let U ∈ Rm×d be a random matrix with each element randomly drawn from N (x|0, 1/m). With a probability 1 − δ , for any vector x, we have |PU x|2 1+ d m +η + 2 ln(2/δ ) m 2 |PU x|2 ≤ |Ux|2 ≤ 1− d m +η − ln(2/δ ) 2 m 2 With sufficiently large m and δ = e−d/2 /2, with probability at least 1 − e−d/2 /2, we have |PU x|2 |PU x|2 ≤ |Ux|2 ≤ (1 + 3 d/m)2 (1 − 2 d/m)2 Theorem 4. (Modified Johnson Lindenstrauss Theorem) Let U ∈ Rm×d be a random matrix, with each element Ui, j , i = 1, . . . , m, j = 1, . . . , d generated by randomly drawing from a Gaussian distribution N (x|0, 1). Assume m is sufficiently large. With probability 1 − δ , for any two fixed objects x, q ∈ Rm and for a given ε ∈ (0, 1), we have Pr |Ux −Uq|22 ≤ d(1 − ε) |x − q|22 4 m(1 − 2 d/m) ≤ exp − dε 2 4 Pr |Ux −Uq|22 ≥ d(1 + ε) |x − q|22 4 m(1 + 3 d/m) ≤ exp − dε 2 [1 − 2ε/3] 4 74 The above theorem is proved directly by combining Theorem 1 and Corollary 1. As indicated by the above theorem, we can directly use the random matrix U for mapping a high dimensional vector to a low dimensional space to preserve the Euclidean distance among objects, which only suffers a small discount factor if m is very large and d is significantly smaller than m. In order to improve the tradeoff between precision and recall, we aim to explore the sparsity requirement in the problem of nearly duplicate detection. In particular, we will explore the Restricted Isometry Property (RIP) for sparse vector, as indicated by the following theorem. Theorem 5. (Restricted Isometry Property) Assume d < m and let U be a random matrix of size m × d whose entries are i.i.d. Gaussian with mean zero and variance 1/m. If S/m is small enough, then with overwhelming probability 1 − e−β m , there exists a positive constant δS < 1 such that we have the following inequality hold for any c ∈ Rm with at most S non-zero elements (1 − δS )|c|2 ≤ m |U c|2 ≤ (1 + δS )|c|2 d It is important to distinguish the RIP condition from Johnson Lindenstrauss theorem because both theorems claim that we will be able to preserve the Euclidean distance among objects by mapping them to a low dimension embedding using random projections. The probability in JL theorem is related to each vector, namely for any vector, there is a small but non-zero probability for the property to fail. It is this uncertainty related to vectors that leads to the tradeoff between precision and recall. On the other hand, the probability in RIP condition is related to the choice of random matrix, namely nearly all random matrix generated by iid Gaussian will be able to preserve Euclidean distance for any vectors. Thus, we only suffer the probability of violation once in choosing the random projection matrix, and if we choose good matrix, the RIP condition guarantees the Euclidean geometry to be preserved. We now state our algorithm for efficiently identifying nearly duplicates of queries in high dimension. Let d some modest integer so that (i) for given ε, d is large enough for make exp(−dε 2 /4) a small value, and (ii) d is small enough that allow for efficient range search in space of d dimensions. Choose K as K = (CS log m)/d , where C > 1 is a parameter that needs to be adjusted in 75 practice. We generate K random matrices U1 ,U2 , . . . ,UK of size m × d whose entries are randomly drawn from a Gaussian distribution N (x|0, 1/m). For each random matrix U j , we map objects x from Rm to Rd by U j x. We denote by D j = {U j x1 ,U j x2 , . . . ,U j xN } the collection of objects mapped to the low dimensional space using random matrix U j . We then build a search structure (e.g., KD-tree) for each D j and assume there exists an range search algorithm to efficient find the subset of objects in D j within distance r of query U j q, where q is query in the original space Rm . We denote by M j (q, r) = {i ∈ [N] : |U j (xi −q)|2 ≤ r} the subset of objects whose projection in U j is within distance r of query q. The key difference between this algorithm and the strategy stated before is that we have a filtering procedure after merging objects returned by the range search in each d dimensional space. This filtering procedure will throw out the objects whose cumulative distance to query (i.e., j ∑ j =1 |U j (x − q)|22 ) is larger than the threshold (1 + δS )R [Kd]/m. The following theorem states the guarantee of this algorithm Theorem 6. Given a query q, after running Algorithm 6, we have following guarantees • with a high probability that all the objects that are nearly duplicates of query q will be included in the returned set M , i.e., Pr (M (q, R, S) ⊆ M ) ≥ 1 − N exp − dKε 2 [1 − 2ε/3] 4 • all the returned objects are within short distance from the query, i.e., max |xi − q|2 ≤ i∈M 1 + δS R 1 − δS Proof. The first guarantee follows directly from the previous analysis and the fact that M (q, R, S) ⊆ M (q, R). The second guarantee follows the fact that U = (U1 , . . . ,UK ) will result in sufficiently large number of columns (compared to sparsity S) and lead to the preservation of Euclidean distance with an overwhelming probability. 76 Algorithm 6 Efficient Duplicate Detection for High Dimensional Vectors 1: Input • R: distance threshold • S: the maximum number of entries in the target object that differs from those of a given query • q: a query 2: Initialization: M = 0/ and dq = 0 3: for j = 1, 2, . . . , K do 4: Run efficient range search over D j with query U j q and threshold r set as r= 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 4.4 R (1 − 2 d/m)2 d(1 − ε) m Returns the indices of objects satisfying the criterion in M j (q, r) along with the distance j information, i.e., for i ∈ M j (q, r), we compute di = |U j (xi − q)|22 . for i ∈ M do if i ∈ / M j then di = di + |U j q|22 else j di = di + di , M j = M j \ {i} end if if mdi /[dK] ≥ R(1 + δS ) then M = M \ {i} end if end for for i ∈ M j do j di = dq + di if mdi /[dK] ≥ R(1 + δS ) then M = M \ {i} end if end for dq = dq + |U j q|22 end for Return M Experiments We will evaluate the random project filtering algorithm with two different sets of experiments. For the first set of experiment, We generate a synthetic dataset that where both the short distance and the sparsity constraints are satisfied. We study different aspects of our algorithm and show how RPF can be effective in solving the near duplicate search problem. In the second experiment, we 77 test our algorithm on an image feature dataset where “duplicity” are often defined based on L2 distance and the sparsity constraint may not be perfectly met. We compare RPF with other existing hashing methods and verify that RPF is still effective in less ideal cases and real applications. 4.4.1 Experiments on Synthetic Datasets The purpose of this experiment is to examine different aspects of RPF algorithm. We examine its accuracy, efficiency and sensitivity to parameters on a 100K datasets. For each query, the first 100 data points that have the shortest distance to the query are considered as the true nearest neighbors. We used average precision and average recall as metrics by following the general evaluation protocols of search algorithms [41]. In addition, we also measure the filtering percentage, which is the percentage of noise points that are filtered out by the algorithm. For the efficiency, report both the offline indexing building time and online searching time. Dataset The synthetic dataset we generated includes three parts: background data, query data, and distractor data. The background data are 100K vectors of 500 dimensions, and each dimension is generated independently from a uniform distribution from range [0 100]. For the query dataset, we first generate 100 zero vectors, and then we random select P dimensions and set their values uniformly from [0 β ]. In our experiments, we set P = 25 and β = 20. In this way the 100 query points are (2Pβ , 2P) near duplicates of each other and they are far away from the background data points. The distractor dataset is introduced similarly as the query data points, The distractor dataset is introduced similarly as the query data points, but with P non-zero dimensions generated from [0 β + δ ]. In our experiments, we set P = 30 and δ = 2. Then, each of the 100 query data points are used to find its near duplicates, and we report the performance over all queries. Algorithm Performance, Effects of K and d We first examine the algorithms’ sensitivity to pa- rameter settings. The major parameter in the proposed algorithm are K, d, δs and ε. According to our experience, we fix δs = 0.05 and ε = 0.5. We test how K and d will affect the efficiency and accuracy of the algorithm. We vary the both K and d in the range of 1, 5, 10, 20, 50, 100 and 78 5, 10, 20, 50, 100, respectively. In figure 4.1 we show the average precision, recall, and filtering percentage under different parameter settings. As we can see, the algorithm delivers very high precision as soon as the values of K and d are not too small (d ≥ 10 and k ≥ 20). In the mean while, over 95% of the target points can be recalled. We also observe that when d is very small (e.g. 5) compared with the original dimensionality (e.g. 500), a relatively larger number projections is needed for good performance. These observations are consistent with the compressive sensing theory, which requires K = (CS log m)/d . Furthermore, we can see that for all different settings, almost all noise points are filtered out by the RPF, which validates the effectiveness of the filtering procedure. Base on the above results, we set K = 10 and d = 20 in later experiments. RPF Efficiency From figure 4.2, we can see both the offline tree building and online query can be done quite efficiently, and scale linearly w.r.t. K 1 . Specifically, search on 100K dataset can be achieved within 0.1 seconds (with d = 20, K = 10), indicating the algorithm is useful in real applications. We also see what when d is too large, the algorithm becomes much less efficient. This is because exact kd tree search in high dimensionality is not efficient. Based on ourexperience, d can √ be set on the order of m, which makes the algorithm is still very practical for high dimensional search. if a very large d is still need, one can use approximate Kdtree search, which had been used on data with hundreds of dimensions. In the next subsection, we show that setting d = 50 can effectively search for 5000 dimensional data. Sensitivity and Effects of δs and ε We now show how the algorithm’s performance change when δs and ε varies. δs controls the process of filtering, and the smaller of δs , the stronger the filtering is. 1 − ε is the radius of range search, and the large the ε, the less number of data points will be returned by KD tree searches. In figure 4.3, we see that when δs increases, the precision of the algorithm drops, as we are loosen the filter. However, when ε is very large, δs has very small effects on precision. This is because the KD tree search only return really close points to the algorithm. 1 For illustration purposes, we set ylim ≤ 2 in the figure. When d = 100, the query time are [0.44662.33494.34557.962617.960535.4678] 79 1 Mean Accuracy 0.95 0.9 d =5 d = 10 d = 20 d = 50 d = 100 0.85 0.8 0.75 1 5 10 K 20 50 100 (a) Average Precison 1 Mean Recall 0.95 0.9 0.85 d=5 d = 10 d = 20 d = 50 d = 100 0.8 0.75 0.7 1 5 10 20 50 100 50 100 K (b) Average Recall Mean Filtering Percentage 1 1 0.9999 0.9999 0.9998 0.9998 0.9997 0.9997 1 5 10 20 K (c) Average Filter Percentage Figure 4.1 Performance of Random Projection Filtering (RPF) with different K and d 80 Tree building time (s) 50 40 30 K=1 K=5 K = 10 K = 20 K = 50 K = 100 20 10 0 5 10 20 d 50 100 (a) Average Tree Building Time (s) 2 Query time (s) 1.5 1 K=1 K=5 K = 10 K = 20 K = 50 K = 100 0.5 0 5 10 20 d 50 100 (b) Average Query Time (s) Figure 4.2 Offline tree building time and query time 81 dataset size RP Precision RP Recall RP Candidate Size RPF Precision RPF Recall RPF Candidate Size 10K 0.1872 1 620 1 0.9934 99.34 5K 0.1188 1 1078 1 0.9929 99.29 20K 0.0171 1 6651 1 0.9880 98.80 50K 0.0076 1 15356 1 0.9899 98.99 100K 0.0058 1 21450 1 0.9940 99.40 Table 4.1 Effectiveness of the filtering procedure For the recall, we generally see a inverse trend compared with the precision, which is the effect of precision-recall tradeoff. We also see that when ε is 0.9, the recall is much lower than other ε values, which indicates that the Kd-tree search with a very small radius missed a number of true positives. Finally, we observe that the filtering percentage varies similarly as the precision, and the analysis to the precision applies similarly. Based on this result, we set ε = 0.5, δs = 0.1 in the following experiments. Effectiveness of the Filtering Procedure In order to investigate the importance of filtering process, we compare the performance of the search algorithm with and without (RP for short) the filtering procedure. By the filtering procedure we mean step 12 and 18 in the algorithm, which conditionally add or remove a candidate point from the candidate pool. In this experiment, we set K, d, ε and δs as suggested by the earlier results, and fixed varied the data set size. Table 4.1 shows the average precision, recall and candidate set size of the algorithm with and without the filtering step. As we can see, the filtering procedure improves the precision and reduces the candidate set size very significantly, at the cost of about 1% recall. This shows the filtering algorithm properly addressed the trade off between precision, recall, and search efficiency. 4.4.2 Experiments on image feature dataset Here we compare the performance of RPF algorithm with other baseline methods on a SIFT point dataset extracted from the Holidays and Copydays dataset [35]. 82 1 Average Precision 0.9 ε = 0.1 ε = 0.3 ε = 0.5 ε = 0.7 ε = 0.9 0.8 0.7 0.6 0.5 0.1 0.3 0.5 δs 0.7 0.9 (a) Average Precison 1 Average Recall ε = 0.1 ε = 0.3 ε = 0.5 ε = 0.7 ε = 0.9 0.95 0.1 0.3 0.5 δs 0.7 0.9 (b) Average Recall Average filter percentage 1 0.8 ε = 0.1 ε = 0.3 ε = 0.5 ε = 0.7 ε = 0.1 0.6 0.4 0.2 0 0.1 0.3 0.5 δs 0.7 0.9 (c) Average Filter Percentage Figure 4.3 Performance of Random Projection Filtering (RPF) with different δs and ε 83 Dataset The Holidays dataset includes 1491 holiday photos, which includes attacked images through rotations, viewpoint changes and illumination changes. The Copydays dataset extends the Holidays dataset by including cropped, blurred, painted, printed, scaned, and compressed images. We extract 10 Million SIFT features [30] using VLFeat sift feature detectors. We then randomly sample a portion of 200 SIFT points as queries. For these queries, we treat their 100-Nearest Neighbor as the true neighbors. In general, the distance between a data point and its 100th nearest neighbor is within ∼10% of the average pairwise-distance in the dataset, which is a good indicator of whether two pairs of points are near duplicates of each other. Settings We compare the search accuracy of RPF and baseline methods based on the number of true duplicates found at top 100 returned results. As we treat the closest 100 points as true near duplicates, this measure tells us exactly how many duplicates can be successfully found by an algorithm. The order between the points in this short list is less important because re-ranking can be easily done based on exact distance. This criteria has been used in recent comprehensive studies on hashing methods [124] to measure their search accuracy. For algorithm efficiencies, we evaluate both the offline training time and online search time. The training time of RPF includes both the data projection time and the tree building time. The the training time of baseline methods includes projection time and the time to learn the hash codes. Both the proposed algorithm and baseline algorithms parameters are optimized based on the searching result of an independent 100K SIFT features. We vary the number of bits in hashing algorithms and the number of projections of the RPF algorithm. During comparison, We fix K = 4 in RPF and make sure K ∗d equals to the number of bits for hashing algorithms, so the total number of projections used by different algorithms are the same during comparison. We used the following algorithms during our comparison: Locality Sensitive Hashing [66], Spectral Hashing [125], Lapacian Co-Hashing [126], PCA Hashing [127] and Self-Taught Hashing [128]. 84 Results and discussions In figure 4.4 we see the F1 comparison of RPF algorithm and other baselines. We observe that RPF performs significantly better than other nearest neighbor search algorithms on F1 measure, showing the effectiveness of the both the searching (affects recall) and filtering procedure (affects precision). At the same time, we also notice the performance difference between the results here and the results on the synthetic dataset. By comparing table 4.1 and table 4.2, we see that although both the accuracy and recall decrease on the real dataset, the accuracy is affected more than the recall. The is because unlike synthetic data, near duplicate SIFT points with small L2 distance may not perfectly satisfy the sparsity constraints assumed by the algorithm. For example, on each dimension, two feature points could differ slightly between each other, resulting an overall small L2 distance with potentially large L1 distance. To address this problem, we design another algorithm in chapter 5. In Table 4.3 we compare the search efficiency. RPF’s online search time is comparable to other searching algorithms. Different from other algorithms, RPF’s search time is less sensitive to the total number of bits used. This is because RPF’s search time is mainly determined by K (i.e. the number of low dimensional searches) but not d (the dimensionality on which the search is performed). As we fixed K and increased d, we improved the search quality without using more time. In Table 4.4 we compare the training/indexing efficiency. RPF takes more time for indexing compared with simple hashing methods like LSH and PCAH, but it is more efficient than SH and STH, which include more complex data-dependent training steps. Num of Bits/Projections RPF’s Mean Accuracy RPF’s Mean Recall 8 0.0020 0.0041 16 0.1085 0.2148 24 0.2320 0.4640 32 0.2719 0.5438 40 0.3125 0.6249 48 0.3282 0.6564 Table 4.2 RPF Accuracy and Recall on 10 Million SIFT features Finally, we compare the performance of RPF, Locality Sensitive Hashing and Spectral Hashing on a dataset of 100 Million features. We choose LSH and SH because they are the most representative methods and they both deliver comparable results against other baselines. We fixed the number of projections to be 48 in thisexperiments. In Table 4.4 we report both the F1 score and 85 0.45 LSH PCAH LCH SH STH RPF 0.4 Average F1 Measure 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 8 16 24 32 40 48 Dimensions/Encoding Bits Figure 4.4 Accuracy comparison on 10 Million SIFT Features Num of Bits/Projections LSH PCAH LCH SH STH RPF 8 1.9667 2.1439 2.0513 2.0553 0.5619 3.9871 16 2.2914 2.2846 2.3237 2.3215 0.6981 4.0055 24 2.5390 2.7205 2.5613 2.6747 0.8446 4.0525 32 2.7784 2.8670 2.8150 2.7542 0.9911 4.0748 40 3.0225 3.0313 3.0432 3.1757 1.1030 4.1760 48 3.3465 3.3747 3.2828 3.2417 1.2838 4.1781 Table 4.3 Search time (seconds) comparison on 10 Million SIFT features Num of Bits/Projections LSH PCAH LCH SH STH RPF 8 12.3969 12.8401 41.1827 26.8050 1037.1 177.7398 16 18.1463 14.8493 39.5128 70.4104 2831.8 216.6184 24 17.6880 18.0228 45.7640 142.8409 3131.7 251.8006 32 23.5655 23.0640 49.3907 244.0592 4417.5 290.5117 40 22.8688 31.2931 54.8485 396.7967 6564.9 318.5319 48 25.5480 29.8792 62.3453 553.1427 7806.7 346.2411 Table 4.4 Training/indexing time (seconds) comparison on 10 Million SIFT features 86 Method F1 Score Search Time (seconds) Random Project Filtering 0.4318 42.9208 LSH 0.1821 49.3101 SH 0.2153 50.2250 Table 4.5 Performance and scalability comparison, searching 100 million features the search time for all three methods. For both performance and efficiency, RPF scales better to much larger datasets compared to the baseline methods. 4.4.3 Experiment on 80 million image dataset In this experiment, we compare the our algorithm and baseline algorithms on the 80 million tiny image data set [129]. This dataset contains 79,302,017 color images collected from the Web. Each image is 32x32 and represented by 384 dimension of GIST features. We randomly sample 200 images as the query and treat dataset images as their true nearest neighbors if they are within 10% of the average pairwise image distance. The pairwise image distance is estimated using 1 million image pairs. We fix the number of projections to be 48 in this experiment. The baseline algorithms’ parameters are trained using 10000 randomly sampled image features. In table 4.6, we compare the indexing time, search time and F1 measure of different algorithms. In figure 4.5, we show examples of image search results of different algorithms. As we can see from the results, RPF and baseline algorithms have comparable efficiency, but RPF’s search result is more accurate. Method RPF F1 0.573 IndexTime (s) 2777.1 SearchTime (s) 35.9 LSH SH 0.457 0.488 316.6 5303.2 38.3 40.5 PCAH STH 0.381 0.402 588.1 109200.0 37.7 25.7 LCH 0.363 727.9 41.4 Table 4.6 Performance comparison, searching 80 million images 87 (a) Map (b) Abraham Lincoln Figure 4.5 Examples of retrieved images of different algorithms. From top to bottom: RPF, LSH, SH, PCAH, LCH and STH. 4.5 Conclusion In this chapter, we proposed an algorithm for large scale high dimensional near duplicate search. In particular, we utilizes the properties of near duplicity and designed a random projection filtering algorithm supported by the compressive sensing theory. Experiments on large scale high dimensional dataset shows that the RPF algorithm successfully addresses the trade-offs between three key factors in search, namely precision, recall, and efficiency, making it a more accurate, efficient and scalable algorithm for near duplicate search than state-of-the-art hashing algorithms. 88 CHAPTER 5 EFFICIENT RANGE SEARCH IN HIGH DIMENSIONS 5.1 Introduction In the chapter 4, we proposed the Random Projection Filtering algorithm for near duplicate search. Although the algorithm shows encouraging results in empirical studies, it does have some limitations. Recall that the projection filtering algorithm have two assumptions, namely the short distance assumption and the sparsity assumption. In some applications, the sparsity assumption may not hold due to the large value range of the feature vectors. For example, “duplicity” of image features like SIFT points are often defined based on L2 distance. A pair of duplicate points with a small L2 distance could have a big L1 distance if they differ just slightly on many dimensions. This is actually often the case because each dimension of a SIFT can take any value between 0 to 255. This motivates us to develop another search algorithm that does not require the sparsity assumption and applicable to a wider range of problems. In this chapter, we will introduce new range search algorithm. Given a query q and a distance threshold r, the goal is to efficiently identify the subset of data points from a database that are within a distance r from q. such problem arise naturally in many areas such as database search 89 and image search. Existing work on range search includes both tree-based method and locality sensitive hashing. When data points are represented by low dimensional vectors, a number of efficient solutions, based on pre-built tree structures, have been proposed (e.g., KD- tree [28,57] or R-tree [130]). However, these algorithms’ complexity increases exponentially with dimensionality, making them very inefficient for high dimensional search [1, 57]. The hashing based methods like locality sensitive hashing type methods [29, 65, 66, 124] does not directly address the exact range search problem, and thus deliver unsatisfactory accuracy for range searches. Interestingly, despite facing the challenges presented above, we find out that both the search efficiency and accuracy can be greatly improved if the range search is done in the near duplicate search scenario. In this chapter, we present a new randomized approach of efficient range search for near duplicate vectors in high dimensional spaces. It projects both the data points and the the query point into an one dimensional space by a random projection, and performs one dimensional range search to find the subset of data points that are within the range of a given query using binary search. By choosing appropriate threshold, we can guarantee a high recall, namely, with a high probability, the proposed approach returns all of the data points within the given range of the query. This is in contrast to the other random projection approaches [131] that require projecting data points into a space of multiple dimensions and performing a range search over the projected space. For instance, according to JL lemma [131], to preserve the pairwise distance, the number of needed random projection is O(ln N/ε 3 ), where N is the number of data points and ε is the target error in distance approximation. For compressive sensing based approaches [123], at least O(k log d) random projections are needed to preserve pairwise distance when the difference between data points and the query points is sparse, where d is the dimension of the data points. We note that for real-world problems, both approaches may require many random projections, leading to a difficulty in searching over the projected space. On the contrary, we project each data point onto a single dimension space, which allows us to perform efficient range search by using the binary search. To ensure a high precision of the search result, i.e., most of the returned data points are within the given range of the query, we perform multiple one-dimension search, and only return 90 the subset of data points that pass all of the one-dimensional search test. The main contribution of this work is to develop and evaluate a new efficient range search algorithm for large scale and high dimensional data. We prove the theoretical guarantee for the proposed algorithm and evaluate its empirical performance on a dataset of 1.1 Billion image features [132]. 5.2 Problem Definition The range search problem is defined as follows. Let D = {x1 , . . . , xN } be a collection of vectors (i.e., the database), where xi ∈ Rd and d 1 is the dimension of the space. Let q ∈ Rd be a query point. The goal of range search is to find a subset of data points in D that is within distance r from q, where r is the range specified by the user. Define D(r, q) as the subset of data points in D that are within distance r from the query q, i.e., D(r, q) = {x ∈ D : |x − q|2 ≤ r} Here, We assume both the data points in D and the query q have bounded length, |x|2 ≤ 1, ∀x ∈ D and |q| ≤ 1. 5.3 Algorithm and Theory To speed up the search, we propose to convert high dimensional range search into a sequence of one-dimensional range searches. More specifically, we randomly sample multiple vectors from a Gaussian distribution, denoted by u1 , . . . , um . For each randomly sampled vector ui , we project both the query q and the data points in D along the direction of ui , and find the subset of data points in D whose projections are within certain threshold ρ (not r, but depend on r) of the query q, denoted by Di . To implement efficient one dimensional range search, we sort the projection of data points in D in a descending order to enable binary search to find the subset of data points whose projections are within a given range. It worth to note that the sorting is done offline and 91 Algorithm 7 Efficient Range Search using Gaussian Random Variables 1: Input: • D = {x1 , . . . , xN ): the database • r > 0: specified range • τ ≥ 1: threshold factor • m: the number of one dimension range searches • q: the query point 2: //Offline processing 3: Random sample U = (u1 , . . . , um ), where uk ∼ N (1, d1 I), k ∈ [m] 4: for i = 1, . . . , N do 5: Compute zi = xi U 6: end for 7: //Online processing q q 8: Compute the projection zq = (z1 , . . . , zm ) = q U for query q 9: for k = 1, 2, . . . , m do 10: if k = 1 then q 11: Compute the set D1 (r, q) as D1 (r, q) = i ∈ [N] : |zi,k − zk | ≤ τ √r d 12: else q 13: Update the set Dk (r, q) as Dk (r, q) = i ∈ Dk−1 (r, q) : |zi,k − zk | ≤ τ √r d 14: end if 15: end for 16: Output the set Dm (r, q). thus will not affect the online search time. The intersection of the data points returned by all one dimensional range search is used to form the final result, i.e., D(r, q) = ∩m i=1 Di . Algorithm 7 gives the detailed steps for the proposed algorithm. We note that although the proposed algorithm is also based on random projection, it is different from the existing approaches in that it does not try to approximate the pairwise distance by random projection. Instead, it tries to approximate the binary decision, i.e., if a data point is within certain range of a query, by a sequence of binary decisions based on the one dimensional projections of both the data point and the query. As the first step of our analysis, we show that for one dimension range search with appropriately chosen threshold ρ, it is able to ensure a high recall, namely with a high probability, a data point within the specified range will be returned by the one dimension range search. The center to our analysis is the Bernstein inequality [133] given in the following theorem. Theorem 7. (Bernstein inequality) Let X1 , . . . , Xm be independent copies of X, with E[X] = 0, 92 |X| ≤ M, and E[X 2 ] = σ 2 . With a probability 1 − δ , we have 1 m 4M ln(1/δ ) Xi − E[X] ≤ + ∑ n i=1 3m 2σ 2 ln(1/δ ) m Theorem 8. Let u drawn from a Gaussian distribution N (0, d1 I). With a probability 1 − 1/[2d 4 ], for a fixed data point x, we have √ |x − q| 6 ln d √ (x − q) u ≤ d Proof. Since we fix |x − q|, to bound |(x − q) u| is equivalent to bound |ui | for any direction i. 2 √ Using the property of error function, i.e., erfc(x) ≤ 2e−x / π, we have Pr |ui | ≥ γ We complete the proof by setting γ = ln d d √ 6. = erfc γ ln d 2 2 exp(−γ 2 ln d/2) √ ≤ γ π ln d As indicated by Theorem 8, when the dimensionality d is very high, with a high probability, |(x − q) u| is upper bounded by O(|x − q| ln d/d). The problem with Theorem 8 is that it does not provide a lower bound |(x − q) u|. Without a lower bound, the one dimensional range search may result in a high recall but a very poor precision as many of the returned data points can be far away from the query q. We address this problem by performing multiple one dimensional range search, and only the data points found by all one dimensional range search will be returned. This enables us to provide a lower bound to our search algorithm. Theorem 9. Let u1 , . . . , um be drawn independently from N (0, d1 I). With a probability at least 1 − m/[2d 3 ], we have √ |x − q| 6 ln d √ max |(x − q) ui | ≤ 1≤i≤m d 2 3 −2 If m ≤ 25 d , then with a probability 1 − 3d , we have 1 m |(x − q) uk |2 ≥ ∑ m k=1  2 |x − q|  m 18[ln d]2 1− 2 − − d m 2d 93 25[ln d]2 m   By setting m = m0 = 100[ln d]2 , under the assumption d ≥ 20 ln d, we have, with a probability 1 − 3d −2 , we have |x − q| max |(x − q) uk | ≥ √ 1≤k≤m 2 2d The proof can be found in Appendix 6.3. As indicated by Theorem 9, given sufficiently large m, with a high probability, we have √ |x − q| 6 ln d |x − q| √ √ ≤ max |(x − q) uk | ≤ 1≤k≤m 2 2d d √ Hence, if we set τ in Algorithm 7 to be 6 ln d, with a high probability, we will be able to guarantee • high recall, i.e., any x satisfying |x − q| ≤ r will be returned, and √ √ • high precision, i.e., any x satisfying |x − q| > 4 3r ln d will NOT be returned. In case when the number of dimension is very high, sampling u1 , . . . , um from a Gaussian distribution can be computationally expensive. We address this challenge by replacing Gaussian random variables in Algorithm 7 with Rademacher random variables 1 . More specifically, to construct each random vector ui , we first sample d Randemacher variables σi1 , . . . , σid , with Pr(σik = −1) = Pr(σik = +1) = 1/2, and form ui as ui = (σi1 , . . . , σid ). The details are given in Algorithm 8. Below we will show that the Rademacher variable based approach yields similar performance as the one based on the Gaussian variables. Similar to the analysis for Algorithm 7, we first analyze the performance of one dimensional range search based on the Rademacher random variable. Theorem 10. Let u = √1 (u1 , . . . , ud ) be a random vector with ui drawn from a Bernoulli distrid bution Pr(ui = 1) = Pr(ui = −1) = 1/2. Then, with a probability 1 − 2/d 3 , for a fixed data point x, we have √ |x − q| 6 ln d √ (x − q) u ≤ d 1A Rademacher random variable has equal probability to be −1 and +1 94 Algorithm 8 Efficient Range Search using Rademacher Random Variables 1: Input: • D = {x1 , . . . , xN ): the database • r > 0: specified range • τ ≥ 1: threshold factor • m: the number of one dimension range searches • q: the query point 2: //Offline processing 3: Random sample U = √1 (u1 , . . . , um ), where Ui, j is a Rademacher variable with Pr(Ui, j = d −1) = Pr(U = +1) = 1/2. 4: for i = 1, . . . , N do 5: Compute zi = xi U 6: end for 7: //Online processing q q 8: Compute the projection zq = (z1 , . . . , zm ) = q U for query q 9: for k = 1, 2, . . . , m do 10: if k = 1 then q 11: Compute the set D1 (r, q) as D1 (r, q) = i ∈ [N] : |zi,k − zk | ≤ τ √r d 12: else q 13: Update the set Dk (r, q) as Dk (r, q) = i ∈ Dk−1 (r, q) : |zi,k − zk | ≤ τ √r d 14: end if 15: end for 16: Output the set D(r, q). Proof. We write (x − q) u as ∑di=1 (xi − qi )ui . Since E[(x − q) u] = 0, using Bernstein inequality, we have, with a probability 1 − δ , |(x − q) u| ≤ |x − q| 2 ln(2/δ ) √ d We complete the proof by setting δ = 2/d 4 . We then show the performance guarantee for multiple one-dimensional search based on Rademacher random variables. Theorem 11. Let U = √1 (u1 , . . . , um ) be random variables with Ui, j having equal probability to d be +1 and −1. With a probability at least 1 − 2m/[d 3 ], we have √ |x − q| 6 ln d √ max |(x − q) ui | ≤ 1≤i≤m d 95 If m ≤ d 3 /50, with a probability 1 − 3d −2 , we have 1 m |(x − q) uk |2 ≥ ∑ m k=1  2 |x − q|  2m 18[ln d]2 − 1− 2 − d m d 25[ln d]2 m   By setting m = m0 = 100[ln d]2 , under the assumption d ≥ 40 ln d, we have, with a probability 1 − 3d −2 , we have |x − q| max |(x − q) uk | ≥ √ 1≤k≤m 2 2d The proof of Theorem 11 is similar to that of Theorem 9. For completeness, we include the proof in Appendix 6.3. Compared to Theorem 9, the theoretical guarantee provided by Theorem 11 is almost identical. In both Theorem 9 and 11, we derive m0 , the minimum number of random projections to achieve both high precision and recall. We note that this number appears to be very large. However, according to our experiments, we found that only a small number of random projection is sufficient to achieve both high precision and high recall. This is due to the fact that the analysis for Theorem 11 is based on the worst case analysis and it is likely that it significantly overestimates the required number of random projections. 5.4 Experiments In this experiment, we validate proposed search algorithm on large-scale image feature datasets. 5.4.1 Experimental Setup Dataset: The algorithm is evaluated on the Holidays and Copydays dataset [35]. The holidays dataset includes 1491 holiday photos. Some photos are taken on purpose under attacks such as rotations, viewpoint changes and illumination changes. The copydays dataset extends the holidays dataset by introducing more attacks into the images: like cropping, blur, paint, print, scan, and compressions. 96 Feature Extraction: For different experiments, we extract 1 Million and 10 Million SIFT features [30] using the current version of VLFeat [134] covariant feature detectors. We randomly sample a portion of 0.02% SIFT points as queries. For these 200 query points, we treat their 100Nearest Neighbor as the true neighbors. In general, the distance between a data point and its 100th nearest neighbor is within ∼10% of the average pairwise-distance in the dataset. Finally, when we compare algorithm’s performance on large-scale data, we extend the dataset with 1.1 additional SIFT features [132]. Settings: In our theory, we have showed the algorithm based on Gaussian random variables and the algorithm based on Rademacher random algorithms shared same idea and have similar theoretical guarantee. In the our experiment, we will first examine the performance of both algorithms and then compare their performance with other nearest neighbor search algorithms. To measure algorithms’ performance, we use precision, recall and the search time. Given a query q, let D(r, q) be the subset of the sift points within the distance r from q, and D(r, q) be the subset of sift points returned by an given algorithm. Then the precision and recall are defined as Prec = |D(r, q) ∩ D(r, q)| |D(r, q)| , Recall = |D(r, q) ∩ D(r, q)| |D(r, q)| A high precision implies that most of the returned data points are within the range of r from the query q, and a high recall implies that most of the documents within distance r from the query q are returned by the algorithm. As precision and recall are a pair of competing factors, the goal of an effective search algorithm is to achieve a good balance between them, i.e. both decent precision and recall. This is evaluated by the F1 measure defined as follows: F1 = 2 ∗ Precision ∗ Recall Precision + Recall In our algorithm, we can see that there are two important parameters: the number of 1-dimensional searches m and the window width factor τ. m determines the precision, because if a candidate misses any of the 1-dimensional searches, it will be removed from the final result. m also directly affects the searching time, as more 1 dimensional searches will need more time. The the parameter 97 τ determines the recall, as a larger τ results in a larger search window in low dimensional spaces. The goal of our experiments is to verify the proposed algorithm can achieve good precision and recall within short response time. 5.4.2 Results and discussion: Here we study our algorithm’s performance and sensitivity to parameters settings (in figure 5.1) based on Gaussian random variables w.r.t. different values of m and τ. Sensitivity to the number of projections, m First, the precision increases with the number of projections m, as every 1-dimensional projection search filters out some false positives. Consistently, we also see that the recall decreases as m increase, because as the number of filters increase, the algorithm may also filter out some true positives. This reflects the general trade-off between accuracy and recall in information retrieval, so we use F1 measure to help us deciding a good value of m. When m is small, increasing m can greatly boost the F1 measure, but as soon as m is greater than 16, the F1 measure tends to stabilize and varying m has much less effects on F1. which indicates our algorithm achieved a good balance between the search accuracy and recall when m ≥ 16. In general, we conclude the algorithm’s search quality is not sensitive to the number of projections as soon as the number of projections is not too small. For the running time, we can see the running time is roughly linear in m when m is small, but become sub-linear when the number of projections is greater (e.g. m ≥ 16). This is because the search time includes two parts: both the searching time in low dimensions and the updating time of the new candidate set. When the number of projection is small, both parts contribute significantly to the running time, but as the number of searches increase, the candidate set becomes much smaller so the updating becomes much less time-expensive . As a result, the running time does not increase linearly as the number of projections. We can see that if we set m = 16, we obtain good trade-offs between the accuracy, efficiency and the running time. In our later experiments, we fix m = 16 for this reason. 98 Sensitivity to the search window width, τ Firstly, we see that the algorithm’s precision first in- creases and then gradually decrease as τ increases. This is because when the search window is too small, no results will be returned; as we increase the window width to some reasonable size, our algorithm will return the result with decent precision. If the window with is set to be too large, it hurts the precision due to the large number of false positives. Secondly, we note that increasing the search window width leads to a higher recall, which is as expected. If we check the F1 measure, we can achieve a good balance between the precision and recall for a wide range values for τ, such as 15 ≤ τ ≤ 30. The running time is not affected much by τ for two reasons: 1) the search time includes only two binary searches for the upper and the lower bounds, which is independent on τ. 2) Although the τ does affect the number of candidates returned, the dominating factor is the number of projections (i.e. the number of filters), τ serves more of a fine-tune purpose that affects the search quality, but has a less impact on the sheer candidate set size. To conclude, our algorithm delivers good search quality and efficiency for as soon as τ is not set to be extreme values. Thus, our algorithm is not sensitive to parameter τ. For the reasons above, we set τ = 20 in our later experiments. In figure 5.2, we also performed experiments using Rademacher random variables. As we can see from the results, using Gaussian random variables and Rademacher random variables have similar performance in all aspects, as we proved in the theory. As Rademacher random variables is more computational feasible in high dimensions, we will use Rademacher random variables in further analysis. In general, we conclude both algorithm 7 and algorithm 8 deliver good search precision and recall within a fraction of second, using only a few projections. Performance comparison, 1 Million Copydays SIFT features Here, the 1 Million Copydays [35] SIFT features used in the previous experiment are now used to compare our algorithms performance with other established methods. We fixed τ = 20. For baseline methods, their parameters are learned from 100,000 SIFT points extracted from same dataset. For all methods, we vary the number of projections (i.e. the code length) from 8 to 48. When the number of projections are 99 1 1 m=4 m=8 m=16 m=24 m=32 m=48 0.9 0.8 0.7 0.7 Average Recall Average Precision 0.8 m=4 m=8 m=16 m=24 m=32 m=48 0.9 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 5 10 15 20 25 30 35 0 5 10 15 τ (a) Average Precison 25 30 35 (b) Average Recall 0.7 1.4 m=4 m=8 m=16 m=24 m=32 m=48 m=4 m=8 m=16 m=24 m=32 m=48 1.2 Average Search Time 0.6 Average F1 Measure 20 τ 0.5 0.4 0.3 0.2 0.1 1 0.8 0.6 0.4 0.2 0 0 0 5 10 15 20 25 30 35 0 τ 5 10 15 20 25 30 35 τ (c) Average F1 Measure (d) Average Running Time Figure 5.1 Performance of Gaussian Projection Search with different number of projections high, our algorithm’s running time increases because of increased one-dimensional searches and set operations. However, we can see that in figure 5.3, the proposed algorithm consistently delivers significant better search results under the same time constraint and a more compact representation. Performance comparison, 10 million SIFT features In this experiment, we extract a total of 10 Million features from the copydays [35] to expand the dataset to a larger scale, and then we compare the different algorithms’ performance. For the proposed algorithm we use 16 projections and set τ = 20. For all other methods, their parameters are also learned from 100,000 SIFT points. We set baseline methods’ coding to 48 bits for comparable search results. The proposed algorithm 100 1 1 m=4 m=8 m=16 m=24 m=32 m=48 0.9 0.8 0.8 0.7 Average Recall 0.7 Average Precision m=4 m=8 m=16 m=24 m=32 m=48 0.9 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 5 10 15 20 25 30 35 0 5 10 15 τ (a) Average Precison 25 30 35 (b) Average Recall 0.7 1 m=4 m=8 m=16 m=24 m=32 m=48 0.6 m=4 m=8 m=16 m=24 m=32 m=48 0.9 0.8 0.5 0.7 Average Search Time Average F1 Measure 20 τ 0.4 0.3 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0 5 10 15 20 25 30 35 0 5 10 15 τ 20 25 30 35 τ (c) Average F1 Measure (d) Average Running Time Figure 5.2 Performance of Rademacher Projection Search with different number of projections outperformed baseline methods with more compact representation and showed better performance robustness when scaled to larger datasets. Method Projection Search LSH F1 0.576 0.205 SearchTime 3.66 3.52 SH 0.235 3.28 Table 5.1 Performance comparison, searching 10 Million Features Retrospective performance comparison, RPF and Projection Search In the beginning of the chapter, we mentioned that one of the motivations of Random Projection Search (RPS) algorithm is to 101 1 0.7 LSH SH Proposed 0.6 LSH SH Proposed 0.9 0.8 0.7 Running Time Average F1 Measure 0.5 0.4 0.3 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 8 16 24 32 40 0 48 8 16 24 32 40 48 Dimensions/Encoding Bits Dimensions/Encoding Bits (a) Average Precison (b) Average Precison Figure 5.3 Performance comparison for different coding length, searching 1 Million features not relying on sparsity assumption for accurate search results. Indeed, RPS algorithm does not use Restricted Isometry Property to filter out noisy points, but rather, it addresses this issue based on a series of binary decisions: only the search point that can pass all the binary decisions will be considered as the true nearest neighbor. As we can see in table 5.2, although the RPS algorithm gives lower recall than the RPF algorithm, it delivers a much higher search accuracy. This validates our hypothesis that in this application, the new decision procedure in RPS is more effective than the filtering procedure in RPF based on RIP condition. Method Accuracy Random Projection Filtering 0.3282 Random Projection Search 0.6182 Recall F1 0.6564 0.4376 0.5482 0.5761 Table 5.2 Performance comparison, RPF and RPS algorithm on 10 million features Searching on 80 Million images We also compare RPS and RPF’s performance on the 80 million tiny image dataset. We use 8 projections and set τ = 20 for RPS. The results are summarized in table 5.3. As RPS algorithm does not need to build KD trees, it requires much less indexing time. Searching is more effcient because binary search is very efficient. The search quality is improved because of the more effective search procedure based on binary decisions. In figure 5.4 we present 102 examples of search results of these two algorithms. To avoid repeated presentation, we did not include the results of baseline algorithms. Those results are presented in chatper 4. Method Index Time (s) Random Projection Filtering 2777.1 Random Projection Search 318.64 Search Time (s) 35.9 19.7 F1 0.573 0.671 Table 5.3 Performance comparison, RPF and RPS algorithm on 80 million images Figure 5.4 Examples of search results: RPS (first row) and RPF (second row) algorithm Scalability evaluation, 1.1 Billion SIFT features To compare the scalability of proposed and baseline methods, we expanded the background data with 1.1 billion SIFT features, which is used for for large scale approximate nearest neighbor search [132]. The algorithm parameters were set according to the 10-Million search experiment because those settings deliver the best performance. As we can see, when we go to really large scale datasets, baseline algorithms are more likely to be affected by potential noisy data due to the vast amount of the data. Our search algorithm, on the other hand, is much more resilient. It is able to effectively remove false positives through a multiple step filtering process and thus delivers much more accurate results more efficiently. Method Projection Search F1 0.571 SearchTime 442.07 (s) LSH 0.155 748.59 (s) SH 0.178 743.20 (s) Table 5.4 Performance and scalability comparison, searching 1.1 Billion Features 103 5.5 Conclusion In this chapter, we proposed a new range search algorithm for large-scale high-dimensional data. We proved the theoretical guarantee for the proposed algorithm and empirically evaluated the algorithm over 80 million images and 1.1 Billion image features. Our study showed the proposed algorithm has better accuracy and efficiency for large-scale, high-dimensional near duplicate search. 104 CHAPTER 6 SUMMARY AND CONCLUSIONS In this thesis, we designed three search algorithms motivated by different kinds of data representations, namely, the bag of features representation, the vector representation with sparsity constraints, and the dense vector representation. The random seeding quantization algorithm is suitable for image search based on bag of local features, where a large number of features is affordable to capture visual details of images for highly accurate search. Random seeding addresses the search efficiency issue and enables such accurate search to be applied on large-scale data. On the other hand, both random projection filtering algorithm and random projection search algorithm are designed for image search based on vectors. In this scenario, a feature vector is generated for each image based on trusted domain knowledge, so a compact but information-rich vector is used to represent the original image. Given the vector representation, the random projection searching algorithm (RPS) is designed for general near-duplicate search, whereas the random projection filtering algorithm (RPF) is specialized on searching sparse vectors. Because most vector representations are more space efficient than bagof-feature representations, both RPS and RPF algorithms are preferable for large-scale search with memory constraints. In the following section, we summarize major contributions of these three algorithms. 105 6.1 Contributions The random seeding algorithm presented in chapter 3 makes the following contributions: • New image similarity measurement and its theoretical foundation: from a probabilistic perspective, a new image similarity measurement is proposed to approximate the optimal partial matching between bags of features. Our analysis not only established a theoretical foundation about this similarity measurement, but also enabled the designing of a new keypoint quantization scheme for image indexing and search. • Efficient visual vocabulary construction: A new visual vocabulary construction scheme is derived for calculating the similarity between two bag-of-features. The new scheme avoids large-scale clustering for vocabulary construction and scales well to huge image databases. • New feature point quantization scheme: A new feature point quantization method is designed to accurately approximate the similarity measurement over a finite number of samples, which also leads to efficient computation of the bag-of-words representation images. • Sparse vector representation of images: sparse vector representation is essential for efficient indexing and search. Proper thresholding is introduced to create sparse vector representations of images without sacrificing representation accuracy, allowing the algorithm to scale to large image search. • New criteria for mapping feature points to visual words: The new mapping criteria allows consistant mapping, eliminating of outlier feature points and 1-to-more mapping during the quantization stage. This new criteria not only creates more accurate representation of images, but also makes the search more efficient. • New feature point search algorithm with KD-trees: It is important to utilize KD-trees for efficient feature point quantization. By constructing KD-trees of the keypoints in the dataset and performing range search using visual words, the computational cost for quantizing N 106 keypoints into m cluster centers is reduced to O(m log N from O(N log m) compared to the clustering-based approach, leading to a more efficient quantization method for large scale feature point datasets. The Random Projection Filtering algorithm presented in chapter 4 makes the following contributions: • Near duplicity with sparsity constraints: While several approaches have been proposed to use the L2 distance constraint for near duplicate search, the sparsity constraints (L0 distance) is often neglected. We showed that by explicitly exploiting the statistical property of sparsity, A new search algorithm can be designed to achieve higher efficiency and accuracy. • Extended Johnson Lindenstrauss theorem for search: We extended Johnson Lindenstrauss theorem to directly using random Gaussian matrices for projection. This extension enables the designing of a filtering process based on the RIP condition. • Multi-step search scheme: Instead of using a single low-dimensional range search, we proposed to perform multiple such searches and assembling their partial results together to recall most nearest neighbors into the final result. This scheme directly avoids the dimensionality limitation of utilizing J-L theorem for large-scale high-dimensional search, and enables us to use tree search algorithms in low dimensions. • Filtering scheme based on RIP condition: By exploring the sparsity constrains, we proposed an effective filtering procedure to address the precision-recall tradeoff. False positives can be efficiently removed to achieve a high accuracy. At the same time, the high recall rate is maintained for an overall better retrieval performance. The Random Projection Search algorithm presented in chapter 5 makes the following contributions: • Algorithm for range search: Although recently some new hashing algorithms have been proposed, they all address the nearest neighbor search problem. Our algorithm, on the other 107 hand, directly address the range search problem and allows users specify the search range in applications. • Search algorithm without sparsity constraints: While sparseness has showed its effectiveness for designing accurate range search algorithms, such property may not hold in some applications. A new search algorithm without sparseness assumption is proposed to apply to a wide range of search scenarios. • Low dimensional embedding independent of d and N: For existing approaches, the number of projections required to preserve the original similarity is either related to the dataset size N ( JL theorem) or the data dimensionality d (compressive sensing theory), leading to difficulties in searching over the projected space. Our new algorithm is able to project the data a single dimension space and utilize binary search, making it more suitable for larger scale and higher dimensionality data. • Effectiveness of one dimensional search: Using Talagrand’s inequality, we theoretically proved that a single one dimension range search with appropriately chosen threshold is able to ensure a high recall search. • High accuracy with multiple one-dimensional searches: We proved theoretically that using multiple one dimensional searches will remove false positives and return highly accurate result. • Efficient sampling for high dimensions: We both proved theoretically and verified empirically that using Rademacher random variables is as effective as using Gaussian random variables. This make the algorithm more suitable for searching high dimensional data. 6.2 Future Work The studies presented in this dissertation lead to several important research questions: 108 – In chapter 3, we used the random sampling approach to choose centers of hyperspheres. These centers and hyper-spheres are then used to generate sparse vector representation of images. We may investigate different approaches for choosing the centers of hyper-spheres, which may lead to better representations to capture image similarities. – In our studies the Random Project Filtering Algorithm, we assumed that the similarity between query and objects in a data is captured by the Euclidean distance. We plan to extend the RPF algorithm to the case when the similarity between objects is assessed by a kernel similarity function. Given the fact that the embedding space for kernel similarity could be of infinite dimension, we can not directly apply the Random Project Filtering method. We thus need to design new indexing schemes as well as search algorithms for efficient kernel-based similarity search. – In chapter 5, our Random Project Search algorithm uses either randomly sampled Gaussian Variables or Rademacher variables. In the future, we plan to improve the efficiency and the effectiveness of the proposed approach by exploring data dependent sampling approaches, such as sampling the random vectors based on the covariance structure of the data. – It is also possible to extend the Random Project Search algorithm to kernel distance. Similar to the Euclidian distance case, we need to first examine the performance guarantee of single one-dimension range search, and then extend it to multiple one-dimension range search. 6.3 Conclusions This thesis presents three new search algorithms for data represented in their different fundamental forms, including the bag of features representation, vector representation with sparsity constraints, and dense vector representation. The conducted research makes significant contributions to both 109 the theoretical foundations of utilizing randomization for search and their applications to largescale high-dimensional data. Algorithms presented here advance the state-of-the-art of searching million-scale images and image feature data. Several future research directions of new large-scale search algorithms are identified. 110 APPENDIX 111 APPENDIX Proof of Lemma 1 Proof. We have m n k=1 m k=1 max (−|xk − yi |22 ) + ∑ max (−|yk − xi |22 ) ∑ 1≤i≤n 1≤i≤m −d(X,Y ) ≤ ≤ ≤ n 1 −λ |xk − yi |22 ∑ λ ln ∑ exp i=1 k=1 2 m n ∑∑ λ i=1 j=1 n + 1 m ∑ λ ln ∑ exp k=1 −λ |yk − xi |22 i=1 exp −λ [ xi − y j 22 − d0 ] − d0 where d0 = min |xk − yi |22 . k,i Proof of Lemma 2 Proof. We have dz p(z|X) p(z|Y ) µd m n = ∑∑ mnπ d i=1 j=1 dz exp(−µ|z − xi |22 − µ|z − y j |22 ) µd m n = ∑∑ mnπ d i=1 j=1 xi + y j 2 µ 2 dz exp − |xi − y j |2 − 2µ z − 2 2 2 = µ d/2 m n µ µ d/2 2 = exp − |x − y | s(X,Y ) i j ∑∑ 2 2 mn[2π]d/2 i=1 j=1 mn[2π]d/2 112 Proof of Lemma 3 Proof. We compute the difference between dz exp(−µ[|z−xi |22 +|z−y j |22 ]) and z∈Q dz exp(−µ[|z− xi |22 + |z − yi |22 ]), i.e., +∞ rd−1 drdΩ exp −µ[|z − xi |22 + |z − y j |22 ] γR   2 +∞  −µ[|(γRˆz − xi ) + (r − γR)|2  rd−1 drdΩ exp   2 γR +|(γRˆz − xi ) + (r − γR)|2 ] ∆i, j = = ≤ exp(−2µ(γ − 1)2 R2 ) = exp(−2µ(γ − 1)2 R2 ) +∞ γR +∞ 0 = exp(−2µ(γ − 1)2 R2 ) rd−1 drdΩ exp(−2µ(r − γR)2 ) (r + γR)d−1 drdΩ exp(−2µr2 ) d−1 π 1/2 + γR 2µ Using ∆i, j computed above, we have the result in the lemma by noticing that |xi − y j |22 ≤ 4R2 . Proof of Corollary 1 Proof. Since γ ≥ max(4, π 1/2 /[(2µ)1/2 R]), we have |s(X,Y ) − s1 (X,Y )| ≤ exp(−µγ 2 R2 ) (2γR)d−1 s(X,Y ) To ensure the difference is bounded by δ , it is sufficient to ensure −µγ 2 R2 + (d − 1) ln(2γR) ≤ ln δ Using ln(γR) ≤ γR − 1, it suffices to show −µγ 2 R2 + (d − 1)γR ≤ ln δ which implies γ≥ 1 (d − 1) + 2µR (d − 1)2 + 4µ ln[1/δ ] 113 Proof of Theorem 2 Proof. Since 0 ≤ p(z|X)p(z|Y ) ≤ 1, according to Hoeffding inequality [135], we have Pr EN [ p(z|X) p(z|Y )] − Ez [ p(z|X) p(z|Y )] ≥ ε ≤ 2 exp(−2Nε 2 ) 2 where EN [·] stands for the expectation over N random samples {zk }N k=1 . By setting δ = 2 exp(−2Nε ), we have, with the probability 1 − δ , the following inequality: |s2 (X,Y ) − s1 (X,Y )| ≤ 1 2 ln N δ Proof of Theorem 9 Proof. The upper bound follows directly from the union bound over m random projections. To show the lower bound, we first bound the maximum value by max |(x − q) uk | ≥ 1≤k≤m 1 m ∑ |(x − q) uk |2. m k=1 1 m |(x − q) u |2 . Define event Z as It is thus sufficient to bound m ∑k=1 k Z = {U = (u1 , . . . , um ) : √ |x − q| 6 ln d √ max (x − q) uk ≤ 1≤k≤m d (7.1) Using Theorem 8, we have Pr(U ∈ Z) ≥ 1 − m/[2d 3 ]. According to the Berstein inequality, we have, with a probability 1 − 2d −2 , for any U ∈ Z, 1 m ∑ |(x − q) uk |2 − E |(x − q) u|2|u ∈ Z m k=1 ≤ 3V ln d + 2 m σ2 ln d m 114 where V = max max |(x − q) uk |2 U∈Z 1≤k≤m σ2 m = E ∑ |(x − q) uk |4 |U ∈ Z k=1 Using the definition of Z, we have V≤ 6|x − q|2 ln d d Since σ 2 ≤ V E |(x − q) u|2 |U ∈ Z To bound the expectation, we have E[|(x − q) u|2 ] ≤ Pr(u ∈ Z)E |(x − q) u|2 |u ∈ Z + Pr(u ∈ / Z)|x − q|2 ≤ Pr(u ∈ Z)E |(x − q) u|2 |u ∈ Z + m |x − q|2 3 2d Since 1 E[|(x − q) u|2 ] = (x − q) E[uu ](x − q) = |x − q|2 d we have E |(x − q) u|2 |u ∈ Z ≥ |x − q|2 m 1− 2 d 2d On the other hand, since E[|(x − q) u|2 ] ≥ Pr(u ∈ Z)E |(x − q) u|2 |u ∈ Z we have E |(x − q) |2 |u ∈ Z ≤ ≤ 115 1 1 − m3 2d 25|x − q|2 24d |x − q|2 d where the step follows from the fact 25m ≤ 2d 3 . As a result, with a probability 1 − 3d −2 , we have 1 m |(x − q) uk |2 ≥ ∑ m k=1  2 |x − q|  18[ln d]2 m − 1− 2 − d m 2d 25[ln d]2 m   By setting m = 100[ln d]2 , we have, with a probability 1 − 3d −3 , 2 1 m 2 ≤ |x − q| |(x − q) u| ∑ m k=1 8d Proof of Theorem 11 Proof. Similar to the proof of Theorem 9, the upper bound follows directly from the union bound over m random projections. To show the lower bound, we aim to bound 1 m m ∑k=1 |(x − q) uk |2 . Using Z defined in (7.1), we have Pr(U ∈ Z) ≥ 1 − 2m/d 3 . Following the same proof as for Theorem 9, we have, with a probability 1 − 2d −2 , for any U ∈ Z, 1 m ∑ |(x − q) uk |2 − E |(x − q) u|2|u ∈ Z m k=1 ≤ 3V ln d + 2 m σ2 ln d m where V = max max |(x − q) uk |2 ≤ U∈Z 1≤k≤m σ2 = E m ∑ |(x − q) uk |4 |U ∈ Z k=1 m ≤ VE 6|x − q|2 ln d d ∑ |(x − q) k=1 116 uk |2 |U ∈ Z Following the same argument as the proof for Theorem 9, we have E |(x − q) u|2 |u ∈ Z ≥ E |(x − q) u|2 − Pr(u ∈ / Z)|x − q|2 2m ≥ (x − q) E[uu ](x − q) − 3 |x − q|2 d 2 2m |x − q| = 1− 2 d d where the last step we use the property E[uu ] = I. The upper bound of E |(x − q) u|2 |u ∈ Z is given by E |(x − q) |2 |u ∈ Z ≤ |x − q|2 (1 − Pr(u ∈ / Z)) d ≤ 25|x − q|2 24d where the step follows from the fact 50m ≤ d 3 . As a result, with a probability 1 − 3d −2 , we have 1 m ∑ |(x − q) uk |2 ≥ m k=1  2 |x − q|  2m 18[ln d]2 − 1− 2 − d m d 25[ln d]2 m   By setting m = 100[ln d]2 , under the assumption that d ≥ 40 ln d, we have, with a probability 1 − 3d −3 , 2 1 m 2 ≤ |x − q| |(x − q) u| ∑ m k=1 8d 117 BIBLIOGRAPHY 118 BIBLIOGRAPHY [1] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” J. ACM, vol. 45, no. 6, pp. 891–923, 1998. [2] K. Grauman and T. Darrell, “Pyramid match hashing: Sub-linear time indexing over partial correspondences,” in CVPR, pp. 1–8, 2007. [3] Z. Wu, Q. Ke, M. Isard, and J. Sun, “Bundling features for large scale partial-duplicate web image search,” (Los Alamitos, CA, USA), pp. 25–32, IEEE Computer Society, 2009. [4] Netcraft, “June 2009 web server survey.” http://news.netcraft.com/ archives/2009/06/17/june_2009_web_server_survey.html, 2009. [5] Google, “We knew the web was big.” http://googleblog.blogspot.com/2008/ 07/we-knew-web-was-big.html, 2008. [6] Flickr, “3 billion!.” http://blog.flickr.net/en/2008/11/03/3-billion/, 2008. [7] T. Barrett, D. Troup, S. Wilhite, and et al., “NCBI GEO: archive for high-throughput functional genomic data,” vol. D885-90, 2009. [8] D. Sifry, “State of the blogosphere, april 2006 part 1: On blogosphere growth.” http: //www.sifry.com/alerts/archives/000432.html, 2006. [9] H. Ferguson, P. Greenfield, T. Axelrod, and et al., “Astronomical data reduction and analysis for the next decade,” 2009. [10] M. M. Group, “World internet users and population stats.” http://www. internetworldstats.com/stats.htm, 2009. [11] L. Li and H. Li, “Dimension reduction methods for microarrays with application to censored survival data,” Bioinformatics, vol. 20, no. 18, pp. 3406–3412, 2004. [12] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001. [13] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, December 2006. [14] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple kernels for object detection,” in Proceedings of the International Conference on Computer Vision, September 2009. 119 [15] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, May 1999. [16] B. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice. Addison Wesley, 1 ed., February 2009. [17] J. D. Media, “Review of image search engines.” http://www.jiscdigitalmedia. ac.uk/stillimages/advice/review-of-image-search-engines/, 2008. [18] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proceedings of the International Conference on Computer Vision, vol. 2, pp. 1470–1477, Oct. 2003. [19] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Comput. Surv., vol. 40, pp. 1–60, April 2008. [20] N. Oria, “Music retrieval: A tutorial and review,” Foundations and Trends R in Information Retrieval, vol. 1, no. 1, 2006. [21] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006. [22] D. D. Lewis, “Naive (bayes) at forty: The independence assumption in information retrieval.,” in Proceedings of ECML-98, 10th European Conference on Machine Learning (C. Nédellec and C. Rouveirol, eds.), no. 1398, (Chemnitz, DE), pp. 4–15, Springer Verlag, Heidelberg, DE, 1998. [23] J. Zobel and A. Moffat, “Inverted files for text search engines,” ACM Comput. Surv., vol. 38, no. 2, p. 6, 2006. [24] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975. [25] K. S. Jones, S. Walker, and S. E. Robertson, “A probabilistic model of information retrieval: development and comparative experiments,” Inf. Process. Manage., vol. 36, no. 6, pp. 779– 808, 2000. [26] J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval,” in SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, (New York, NY, USA), pp. 275–281, ACM, 1998. [27] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Washington, DC, USA), pp. 2161–2168, IEEE Computer Society, 2006. [28] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975. 120 [29] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB ’99: Proceedings of the 25th International Conference on Very Large Data Bases, (San Francisco, CA, USA), pp. 518–529, Morgan Kaufmann Publishers Inc., 1999. [30] D. Lowe., “Distinctive image features form scale-invariant keypoints,” in International Journal of Computer Vision, vol. 60, pp. 91–110, 2004. [31] C. Wallraven, B. Caputo, and A. Graf, “Recognition with local features: the kernel recipe,” in ICCV ’03: Proceedings of the Ninth IEEE International Conference on Computer Vision, (Washington, DC, USA), p. 257, IEEE Computer Society, 2003. [32] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in ICCV ’05: Proceedings of the Tenth IEEE International Conference on Computer Vision, (Washington, DC, USA), pp. 1458–1465, IEEE Computer Society, 2005. [33] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Washington, DC, USA), pp. 2169–2178, IEEE Computer Society, 2006. [34] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in CVPR, 2008. [35] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in ECCV ’08: Proceedings of the 10th European Conference on Computer Vision, (Berlin, Heidelberg), pp. 304–317, Springer-Verlag, 2008. [36] C. Cotton and D. Ellis, “Finding similar acoustic events using matching pursuit and localitysensitive hashing,” 2009. [37] G. Salton, E. A. Fox, and H. Wu, “Extended boolean information retrieval,” Commun. ACM, vol. 26, no. 11, pp. 1022–1036, 1983. [38] D. Z. Zanger, “Interpolation of the extended boolean retrieval model,” Inf. Process. Manage., vol. 38, no. 6, pp. 743–748, 2002. [39] G. Zipf, “The psycho-biology of language,” Boston, MA: Houghton Mifflin, 1935. [40] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted files versus signature files for text indexing,” ACM Trans. Database Syst., vol. 23, no. 4, pp. 453–490, 1998. [41] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, July 2008. [42] S. E. Robertson, “The probability ranking principle in ir,” pp. 281–286, 1997. [43] S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford, “Okapi at trec-3,” in TREC, pp. 109–126, 1994. 121 [44] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne, “Okapi at trec-4,” in TREC, 1995. [45] M. H.-B. Stephen E. Robertson, Steve Walker, “Okapi at trec-7: Automatic ad hoc, filtering, vlc and interactive,” in TREC, pp. 199–210, 1998. [46] W. B. Croft, “Incorporating different search models into one document retrieval system,” SIGIR Forum, vol. 16, no. 1, pp. 40–45, 1981. [47] C. Zhai, Statistical Language Models for Information Retrieval. Hanover, MA, USA: Now Publishers Inc., 2008. [48] E. M. Voorhees and D. Harman, “Overview of the eighth text retrieval conference (trec-8),” in Proceedings of the Seventh Text REtrieval Conference (TREC-7), pp. 1–24, 2000. [49] E. M. Voorhees and D. Harman, “Overview of the sixth text retrieval conference (trec-6),” in The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238, National Institute of Standards and Technology, pp. 3–28, 1997. [50] E. M. Voorhees and D. Harman, “Overview of the seventh text retrieval conference (trec-7),” in Proceedings of the Seventh Text REtrieval Conference (TREC-7), pp. 1–24, 1998. [51] J. Lafferty and C. Zhai, “Document language models, query models, and risk minimization for information retrieval,” in SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, (New York, NY, USA), pp. 111–119, ACM, 2001. [52] S. M. Omohundro, “Five balltree construction algorithms,” tech. rep., International Computer Science Institute, December 1989. [53] C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptor matching,” in CVPR, pp. 1–8, 2008. [54] J. M. Kleinberg, “Two algorithms for nearest-neighbor search in high dimensions,” in STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, (New York, NY, USA), pp. 599–608, ACM, 1997. [55] C. Böhm, S. Berchtold, and D. A. Keim, “Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases,” ACM Comput. Surv., vol. 33, no. 3, pp. 322–373, 2001. [56] T. Liu, A. W. Moore, and A. Gray, “New algorithms for efficient high-dimensional nonparametric classification,” J. Mach. Learn. Res., vol. 7, pp. 1135–1158, 2006. [57] J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM Trans. Math. Softw., vol. 3, no. 3, pp. 209–226, 1977. [58] M. I. Shamos and D. Hoey, “Closest-point problems,” in SFCS ’75: Proceedings of the 16th Annual Symposium on Foundations of Computer Science, (Washington, DC, USA), pp. 151–162, IEEE Computer Society, 1975. 122 [59] R. J. Lipton and R. E. Tarjan, “Applications of a planar separator theorem,” in SFCS ’77: Proceedings of the 18th Annual Symposium on Foundations of Computer Science, (Washington, DC, USA), pp. 162–170, IEEE Computer Society, 1977. [60] S. Omohundro, “Bumptrees for efficient function, constraint, and classification learning,” in NIPS-3, (San Francisco, CA, USA), pp. 693–699, Morgan Kaufmann Publishers Inc., 1990. [61] R. F. Sproull, “Refinements to nearest-neighbor searching in k-dimensional trees,” Algorithmica, vol. 6, pp. 579–589, 1991. [62] S. Meiser, “Point location in arrangements of hyperplanes,” Inf. Comput., vol. 106, no. 2, pp. 286–303, 1993. [63] K. L. Clarkson, “A randomized algorithm for closest-point queries,” SIAM J. Comput., vol. 17, no. 4, pp. 830–847, 1988. [64] J. S. Beis and D. G. Lowe, “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces,” Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, vol. 0, p. 1000, 1997. [65] M. Datar, N. Immorlica, and P. Indyk, “Locality-sensitive hashing scheme based on p-stable distributions,” in SCG ’04: Proceedings of the twentieth annual symposium on Computational geometry, (New York, NY, USA), pp. 253–262, ACM, 2004. [66] P. I. Alexandr Andoni, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” in FOCUS, 2006. [67] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” in VISSAPP (1), pp. 331–340, 2009. [68] S. Arya, D. M. Mount, N. S., N. S. Netanyahu, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching in fixed dimensions,” 1994. [69] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” in VISSAPP (1), pp. 331–340, 2009. [70] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in CVPR, pp. 1–8, 2007. [71] A. Andoni and P. Indyk, “Efficient algorithms for substring near neighbor problem,” in SODA ’06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, (New York, NY, USA), pp. 1203–1212, ACM, 2006. [72] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering of the web,” in Selected papers from the sixth international conference on World Wide Web, (Essex, UK), pp. 1157–1166, Elsevier Science Publishers Ltd., 1997. [73] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations (extended abstract),” in STOC ’98: Proceedings of the thirtieth annual ACM symposium on Theory of computing, (New York, NY, USA), pp. 327–336, ACM, 1998. 123 [74] A. Broder, “On the resemblance and containment of documents,” in SEQUENCES ’97: Proceedings of the Compression and Complexity of Sequences 1997, (Washington, DC, USA), p. 21, IEEE Computer Society, 1997. [75] Y. Ke, R. Sukthankar, and L. Huston, “Efficient near-duplicate detection and sub-image retrieval,” in Proceeding of ACM Multimedia, pp. 869–876, 2004. [76] T. Pavlidis, “Limitations of cbir,” in ICPR, 2008. [77] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in ICCV, pp. 1470–1477, 2003. [78] V. Lepetit, P. Lagger, and P. Fua, “Randomized trees for real-time keypoint recognition,” in CVPR, pp. 775–781, 2005. [79] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, 2006. [80] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1– 22, 2004. [81] S. Lyu, “Mercer kernels for object recognition with local features,” in CVPR, pp. 223–229, 2005. [82] S. Boughorbel, J.-P. Tarel, and F. Fleuret, “Non-mercer kernels for svm object recognition,” in British Machine Vision Conference, pp. 137–146, 2004. [83] K. Grauman and T. Darrell, “The pyramid match kernel: Efficient learning with sets of features,” Journal of Machine Learning Research, vol. 8, pp. 725–760, 2007. [84] L. Wolf and A. Shashua, “Learning over sets using kernel principal angles,” Journal of Machine Learning Research, vol. 4, pp. 913–931, 2003. [85] R. Kondor and T. Jebara, “A kernel between sets of vectors,” in Proceedings of the International Conference on Machine Learning, pp. 361–368, 2003. [86] P. Moreno, P. Ho, and N. Vasconcelos, “A kullback-leibler divergence based kernel for svm classification in multimedia applications,” in Neural Information Processing Systems, 2003. [87] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: min-hash and tf-idf weighting,” in Proceedings of the British Machine Vision Conference, 2008. [88] K. Grauman and T. Darrell, “Approximate correspondences in high dimensions,” in Advances in Neural Information Processing Systems 19, pp. 505–512, MIT Press, 2007. [89] C. Silpa-Anan and R. Hartley, “Localization using an imagemap,” in Proceedings of the 2004 Australasian Conference on Robotics & Automation, 2004. [90] G. Csurka, C. Bray, C. Dance, and L. Fan, “Visual categorization with bags of keypoints,” in Workshop on Statistical Learning in Computer Vision, ECCV, 2004. 124 [91] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in CVPR, pp. 524–531, 2005. [92] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in ICCV, pp. 1800–1807, 2005. [93] F. Perronnin, C. Dance, G. Csurka, and M. Bressian, “Adopted vocabularies for generic visual categorization,” in European Conference on Computer Vision, 2006. [94] P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag-of-visual words image categorization,” in Proceedings of International Conference on Content-based Image and Video Retrieval (CIVR), 2008. [95] L. Wu, M. Li, Z. Li, W. Ma, and N. Yu, “Visual language modeling for image classification,” in Proceedings of International Conference on Content-based Image and Video Retrieval (CIVR), 2007. [96] W. Dong, Z. Wang, M. Charikar, and K. Li, “Efficiently matching sets of features with random histograms,” in Proceeding of ACM Multimedia, pp. 179–188, 2008. [97] F. Li, W. Tong, R. Jin, A. K. Jain, and J.-E. Lee, “An efficient key point quantization algorithm for large scale image retrieval,” in LS-MMRM ’09: Proceedings of the First ACM workshop on Large-scale multimedia retrieval and mining, (New York, NY, USA), pp. 89– 96, ACM, 2009. [98] M. P. Ondrej Chum and J. MatasKristen, “Geometric min-hashing: Finding a (thick) needle in a haystack,” in CVPR, 2009. [99] K. Grauman, K. Grauman, T. Darrell, and T. Darrell, “Approximate correspondences in high dimensions,” in NIPS, 2006. [100] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” Int. J. Comput. Vision, vol. 61, no. 1, pp. 55–79, 2005. [101] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in CVPR, pp. 1–8, 2007. [102] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proceedings of the 25th International Conference on Very Large Data Bases, 1999. [103] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry, 2004. [104] T. Liu, A. Moore, A. Gray, and K. Yang, “An investigation of practical approximate nearest neighbor algorithms,” in Neural Information Processing Systems, pp. 825–832, 2004. [105] C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptor matching,” in CVPR, pp. 1–8, 2008. 125 [106] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in CVPR, pp. 2161–2168, 2006. [107] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query expansion with a generative feature model for object retrieval,” CVPR, pp. 1–8, 2007. [108] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in CVPR, 2007. [109] X. Zhang, Z. Li, L. Zhang, W. Ma, and H.-Y. Shum, “Efficient indexing for large scale visual search,” IEEE Computer Society, 2009. [110] Z. Wu, Q. Ke, J. Sun, and H.-Y. Shum, “A multi-sample, multi-tree approach to bag-ofwords image representation for image retrieval,” IEEE Computer Society, 2009. [111] A. T. Yair Weiss and R. Fergus, “Spectral hashing,” in NIPS, 2008. [112] J. C. Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders, “Kernel codebooks for scene categorization,” in ECCV ’08: Proceedings of the 10th European Conference on Computer Vision, (Berlin, Heidelberg), pp. 696–709, Springer-Verlag, 2008. [113] K. Grauman and T. Darrell, “The pyramid match kernel: Efficient learning with sets of features,” J. Mach. Learn. Res., vol. 8, pp. 725–760, 2007. [114] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley-Interscience Publication, 2000. [115] “Affine covariant features,” June 2007. http://www.robots.ox.ac.uk/~vgg/ research/affine/detectors.html. [116] S. E. Robertson, S. Walker, and M. Hancock-Beaulieu, “Okapi at trec-7,” in Proceedings of the Seventh Text REtrieval Conference, 1998. [117] X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma, “Annotating images by mining image search results,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 1919– 1932, 2008. [118] J. Philbin and A. Zisserman, “The oxford buildings dataset.” World Wide Web electronic publication, Nov 2007. [119] ANSI/NIST-ITL 1-2007 standard: Data Format for the Interchange of Fingerprint, Facial, & Other Biometric Information. 2007. [120] J.-E. Lee, A. K. Jain, and R. Jin, “Scars, marks and tattoos (SMT): Soft biometric for suspect and victim identification,” in Biometric Consortium Conference and Technology Expo, 2008. [121] H. Moon and P. J. Phillips, “Computational and performance aspects of pca-based face recognition algorithms,” Perception, vol. 30, pp. 303–321, 2001. [122] M. Muja and D. G. Lowe, “Fast matching of binary features,” in Computer and Robot Vision (CRV), pp. 404–410, 2012. 126 [123] E. J. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, pp. 489–509, Feb. 2006. [124] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A comparison of hash function types and querying mechanisms,” Pattern Recognition Letters, vol. 31, no. 11, pp. 1348 – 1358, 2010. [125] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NIPS, pp. 1753–1760, 2008. [126] D. Zhang, J. Wang, D. Cai, and J. Lu, “Laplacian co-hashing of terms and documents,” in Advances in Information Retrieval (C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, and K. Rijsbergen, eds.), vol. 5993 of Lecture Notes in Computer Science, pp. 577–580, Springer Berlin Heidelberg, 2010. [127] X. Yu, S. Zhang, B. Liu, L. Zhong, and D. N. Metaxas, “Large scale medical image search via unsupervised pca hashing,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2013. [128] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast similarity search,” in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, (New York, NY, USA), pp. 18–25, ACM, 2010. [129] A. Torralba, R. Fergus, and W. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, pp. 1958–1970, Nov 2008. [130] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, (New York, NY, USA), pp. 47–57, ACM, 1984. [131] W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” in Conference in modern analysis and probability (New Haven, Conn., 1982), vol. 26 of Contemporary Mathematics, pp. 189–206, American Mathematical Society, 1984. [132] H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in one billion vectors: rerank with source coding,” CoRR, vol. abs/1102.3828, 2011. [133] S.N.Bernstein, Theory of Prabability. 1927. [134] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision algorithms.” http://www.vlfeat.org/, 2008. [135] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963. 127