You are here
Search results
(1  1 of 1)
 Title
 Near duplicate image search
 Creator
 Li, Fengjie
 Date
 2014
 Collection
 Electronic Theses & Dissertations
 Description

Information retrieval addresses the fundamental problem of how to identify the objects from database that satisfies the information needs of users. Facing the information overload, the major challenge in search algorithm design is to ensure that useful information can be found both accurately and efficiently from large databases.To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e...
Show moreInformation retrieval addresses the fundamental problem of how to identify the objects from database that satisfies the information needs of users. Facing the information overload, the major challenge in search algorithm design is to ensure that useful information can be found both accurately and efficiently from large databases.To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e.g. dense feature vectors) and bagoffeatures (e.g. local feature represented images). For sparse data, inverted index and document retrieval models had been proved to be very effective for large scale retrieval problems. For dense data and bagoffeature data, however, there are still some open problems. For example, Locality Sensitive Hashing, a stateoftheart method for searching high dimensional vectors, often fails to make a good tradeoff between precision and recall. Namely, it tends to achieve high preci sion but with low recall or vice versa. The bagofwords model, a popular approach for searching objects represented bagoffeatures, has a limited performance because of the information loss during the quantization procedure.Since the general problem of searching objects represented in dense vectors and bagoffeatures may be too challenging, in this dissertation, we focus on nearly duplicate search, in which the matched objects is almost identical to the query. By effectively exploring the statistical proper ties of near duplicities, we will be able to design more effective indexing schemes and search algorithms. Thus, the focus of this dissertation is to design new indexing methods and retrieval algorithms, for near duplicate search in large scale databases, that accurately capture the data simi larity and delivers more accurate and efficient search. Below, we summarize the main contributions of this dissertation:Our first contribution is a new algorithm for searching near duplicate bagoffeatures data. The proposed algorithm, named random seeding quantization, is more efficient in generating bagof words representations for near duplicate images. The new scheme is motivated by approximating the optimal partial matching between bagoffeatures, and thus produces a bagofwords representation capturing the true similarities of the data, leading to more accurate and efficient retrieval of bagoffeatures data.Our second contribution, termed Random Projection Filtering, is a search algorithm designed for efficient near duplicate vector search. By explicitly exploiting the statistical properties of near duplicity, the algorithm projects high dimensional vectors into lower dimensional space and filter out irrelevant items. Our effective filtering procedure makes RPF more accurate and efficient to identify nearly duplicate objects in databases.Our third contribution is to develop and evaluate a new randomized range search algorithm for near duplicate vectors in high dimensional spaces, termed as Random Projection Search. Different from RPF, the algorithm presented in this chapter is suitable for a wider range of applications be cause it does not require the sparsity constrains for high search accuracy. The key idea is to project both the data points and the query point into an one dimensional space by a random projection, and perform one dimensional range search to find the subset of data points that are within the range of a given query using binary search. We prove the theoretical guarantee for the proposed algorithm and evaluate its empirical performance on a dataset of 1.1 billion image features.
Show less