NEAR DUPLICATE IMAGE SEARCH
By
Fengjie Li

A DISSERTATION
Submitted
to Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science– Doctor of Philosophy
2014

ABSTRACT
NEAR DUPLICATE IMAGE SEARCH
By
Fengjie Li
Information retrieval addresses the fundamental problem of how to identify the objects from database
that satisfies the information needs of users. Facing the information overload, the major challenge
in search algorithm design is to ensure that useful information can be found both accurately and
efficiently from large databases.
To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e.g. dense feature vectors)
and bag-of-features (e.g. local feature represented images). For sparse data, inverted index and
document retrieval models had been proved to be very effective for large scale retrieval problems.
For dense data and bag-of-feature data, however, there are still some open problems. For example,
Locality Sensitive Hashing, a state-of-the-art method for searching high dimensional vectors, often
fails to make a good tradeoff between precision and recall. Namely, it tends to achieve high precision but with low recall or vice versa. The bag-of-words model, a popular approach for searching
objects represented bag-of-features, has a limited performance because of the information loss
during the quantization procedure.
Since the general problem of searching objects represented in dense vectors and bag-of-features
may be too challenging, in this dissertation, we focus on nearly duplicate search, in which the
matched objects is almost identical to the query. By effectively exploring the statistical properties of near duplicities, we will be able to design more effective indexing schemes and search
algorithms. Thus, the focus of this dissertation is to design new indexing methods and retrieval
algorithms, for near duplicate search in large scale databases, that accurately capture the data similarity and delivers more accurate and efficient search. Below, we summarize the main contributions
of this dissertation:

Our first contribution is a new algorithm for searching near duplicate bag-of-features data. The
proposed algorithm, named random seeding quantization, is more efficient in generating bag-ofwords representations for near duplicate images. The new scheme is motivated by approximating
the optimal partial matching between bag-of-features, and thus produces a bag-of-words representation capturing the true similarities of the data, leading to more accurate and efficient retrieval of
bag-of-features data.
Our second contribution, termed Random Projection Filtering, is a search algorithm designed
for efficient near duplicate vector search. By explicitly exploiting the statistical properties of near
duplicity, the algorithm projects high dimensional vectors into lower dimensional space and filter
out irrelevant items. Our effective filtering procedure makes RPF more accurate and efficient to
identify nearly duplicate objects in databases.
Our third contribution is to develop and evaluate a new randomized range search algorithm for
near duplicate vectors in high dimensional spaces, termed as Random Projection Search. Different
from RPF, the algorithm presented in this chapter is suitable for a wider range of applications because it does not require the sparsity constrains for high search accuracy. The key idea is to project
both the data points and the query point into an one dimensional space by a random projection, and
perform one dimensional range search to find the subset of data points that are within the range of
a given query using binary search. We prove the theoretical guarantee for the proposed algorithm
and evaluate its empirical performance on a dataset of 1.1 billion image features.

To my family.

iv

ACKNOWLEDGEMENTS

I consider completing my Ph.D. the first real challenge I ever had in my life. Myself have evolved
as much as how much my dissertation have ben updated since the first day of my Ph.D. program.
All these would not have been possible without the help of many people.
First and foremost, I would like to express my great gratitude to my thesis advisor, Prof. Rong
Jin. It is his course of “Information Retrieval” led me to this research area. Over the years, he
taught me tremendous knowledge about search and machine learning. He taught me how to find
the right research problem and how to solve the problem right. He taught me the importance of
mathematical rigorous in research. He also gave the most important, lifelong tools to achieve any
thing that is significant: “perseverance and consistent hard working”. without Dr. Rong Jin, neither
my dissertation nor my personal growth would have been possible.
I would also like to express my appreciation to Dr. Anil Jain. The “image” part in my dissertation definitely goes to him. He taught me everything about pattern recognition and image
processing. He also showed me the importance of keeping a very high standard and getting things
done promptly at the same time, which I will benefit in the rest of my career. I am grateful to
Dr. Eric Torng, who has been the most supportive person since I joined the program. His support
in my hardest times is irreplaceable. I would also like to thank Dr. Aviyente for her insightful
suggestions and discussions during my research.
I would like to thank my colleagues: Wei Tong, Yang Zhou, Tianbao Yang, Mehrdad Mahdavi,
Jinfeng Yi, Zheyun Feng, Lei Wu and Jainpeng Xu. I’d like to thank the people I have worked
with at PRIP lab, especially Abhishek Nagar, Pavan Mallapragada and Serhat Selcuk Bucak. I
appreciate all the help I got from Linda Moore, Norma Teague and Adam Pitcher.
Personally, I’d like to thank all my teammates in the Michigan State University Chinese Soccer
team. We had so many memorable moments together that I will never forget in my life. Especially,
I would like thank Zheng Wang, for being supportive and thoughtful as my friend both on and off

v

the field. Among all my friends at Michigan State University, l want to thank Yi Huang and his
family in particular, for all the happy times we have had together. I am grateful to all my best
friends in China, including Peng Xu, Zhe Wang, Shaoan Hou, Kai Lv, Xiaoming Ji, Xin Zhao, Yi
Li and Lipei Duan. Their emotional support is proportional to the physical distance between us.
My final and the most grateful thanks goes to my family. Deep in my heart it is their love and
support motivates me. My mom and dad are the first teachers in my life, who taught me the most
important principles in my life and continue to teach me everyday. My only wish is to be a better
son of them. I owe great appreciation to my parents-in-law, who spent so much time caring us at
Michigan. I want to express my most heartful appreciation and love to my wife Li, whose love and
trust is vital to me in the past ten years. I want to say “thank you and love you” to my two dearest
daughters, Siying and Sihan, who are the most precious gifts I have ever had in my life.

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1 INTRODUCTION . . . . . . . . . . .
1.1 Large scale high dimensional data retrieval . . .
1.1.1 Retrieval Methods for Sparse Data . . .
1.1.2 Retrieval Methods for Dense Data . . .
1.1.3 Retrieval Methods for Bags-of-Features
1.2 Thesis contributions . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

1
2
4
5
6
7

CHAPTER 2 LITERATURE REVIEW . . . . . . . . . . . . . . . .
2.1 Retrieval Methods for Sparse Data . . . . . . . . . . . . . . . .
2.1.1 Sparseness and Inverted Indexes . . . . . . . . . . . . .
2.1.2 Retrieval Models . . . . . . . . . . . . . . . . . . . . .
2.1.3 Comparison: Retrieval Models for Sparse Data . . . . .
2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Retrieval Methods for Dense Data . . . . . . . . . . . . . . . .
2.2.1 KD-Tree, Randomized KD-tree and Randomized Forests
2.2.2 Locality Sensitive Hashing . . . . . . . . . . . . . . . .
2.2.3 Comparison: Search Methods for Dense Data . . . . . .
2.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Retrieval Methods for Bag of Features . . . . . . . . . . . . . .
2.3.1 Pyramid Matching and Pyramid Match Hashing . . . . .
2.3.2 The Bag of Words Model . . . . . . . . . . . . . . . . .
2.3.3 Min-Hashing and Geometric Min-Hashing . . . . . . . .
2.3.4 Comparison: Retrieval Methods for Bag-Of-Features . .
2.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

9
9
10
11
13
14
14
18
25
29
30
31
32
34
37
39
40

CHAPTER 3 RANDOM SEEDING FOR KEYPOINT QUANTIZATION . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Random Seeding for Keypoint Quantization . . . . . . . . . . . . . . . . . . .
3.3.1 Distance Measure Based on Optimal Partial Matching . . . . . . . . . .
3.3.2 Approximating the Optimal Partial Matching by A Smooth Similarity
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Random Seeding Algorithm . . . . . . . . . . . . . . . . . . . . . . .
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Near Duplicate Image Retrieval . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

41
41
45
46
49

.
.
.
.

.
.
.
.

50
51
56
58

vii

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

3.5

3.4.2 Tattoo Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.3 Automatic Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 65
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

CHAPTER 4 RANDOM PROJECTION FILTERING
4.1 Introduction . . . . . . . . . . . . . . . . . . . .
4.2 Problem Definition . . . . . . . . . . . . . . . .
4.3 Random Projection Filtering Algorithm . . . . .
4.4 Experiments . . . . . . . . . . . . . . . . . . . .
4.4.1 Experiments on Synthetic Datasets . . . .
4.4.2 Experiments on image feature dataset . .
4.4.3 Experiment on 80 million image dataset .
4.5 Conclusion . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

69
69
70
71
77
78
82
87
88

CHAPTER 5 EFFICIENT RANGE SEARCH IN HIGH DIMENSIONS .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Algorithm and Theory . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Results and discussion: . . . . . . . . . . . . . . . . . . . . .
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

89
89
91
91
96
96
98
104

CHAPTER 6 SUMMARY AND CONCLUSIONS
6.1 Contributions . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . .
6.3 Conclusions . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

105
106
108
109

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.

APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

viii

LIST OF TABLES

Table 1.1

Real World Applications of Information Retrieval . . . . . . . . . . . . . . . . .

Table 2.1

Complexities of Nearest Neighbor Search Algorithms . . . . . . . . . . . . . . . 17

Table 2.2

Complexity of approximate nearest neighbor search algorithms. For the LSH
complexity under L2 distance, we have f (1 + ε) < 1/(1 + ε). . . . . . . . . . . 18

Table 3.1

Summary of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Table 3.2

Time for generating bag-of-words models for the Oxford5K dataset.

Table 3.3

Retrieval accuracy (mAP) and query time (second) for the near duplicate image retrieval on the Oxford5K dataset. “Cluster Seeding” in the last row refers
to range search quantization using cluster centers . . . . . . . . . . . . . . . . . 61

Table 3.4

Retrieval accuracy (mAP), query time, and quantization time (i.e., time for
mapping keypoints to visual words) for the clustering-based method and the
random seeding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Table 4.1

Effectiveness of the filtering procedure . . . . . . . . . . . . . . . . . . . . . . . 82

Table 4.2

RPF Accuracy and Recall on 10 Million SIFT features . . . . . . . . . . . . . . 85

Table 4.3

Search time (seconds) comparison on 10 Million SIFT features . . . . . . . . . . 86

Table 4.4

Training/indexing time (seconds) comparison on 10 Million SIFT features . . . . 86

Table 4.5

Performance and scalability comparison, searching 100 million features . . . . . 87

Table 4.6

Performance comparison, searching 80 million images . . . . . . . . . . . . . . 87

Table 5.1

Performance comparison, searching 10 Million Features . . . . . . . . . . . . . 101

Table 5.2

Performance comparison, RPF and RPS algorithm on 10 million features . . . . 102

Table 5.3

Performance comparison, RPF and RPS algorithm on 80 million images . . . . . 103

Table 5.4

Performance and scalability comparison, searching 1.1 Billion Features . . . . . 103

ix

3

. . . . . . 59

LIST OF FIGURES

Figure 2.1

An example of KD-tree and resulting space partition. . . . . . . . . . . . . . . 18

Figure 2.2

An example of priority KD-tree search. Subtrees v1 through v4 are initially
in the queue. v1 , the closest node, is selected first. When descending the tree
to leaf node w, we enqueue the siblings u1, u2, and u3 along the way [1]. . . . . 23

Figure 2.3

Illustration of difference between general hashing and locality sensitive hashing. 27

Figure 2.4

Illustration of (a) pyramid matching and (b) pyramid match hashing. The
pyramid matching approach takes two sets of feature vectors and maps them
to multi-resolution histograms, and intersects them to efficiently approximate
the optimal partial matching between the original feature sets. The pyramid
match hashing approach utilizes an embedding of the pyramid histogram and
randomly hash the new representation to allow sub-linear time indexing. Finally, the pyramid match is applied only to a small portion of the database
examples [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 2.5

Illustration of the bag-of-words representation of images . . . . . . . . . . . . 34

Figure 2.6

The geometric ordering constraints for bundled features [3]. (a) matches preserving the relative ordering of the SIFT features inside a bundle; (b) matches
not respecting geometric orders are penalized. . . . . . . . . . . . . . . . . . . 37

Figure 3.1

The distribution of Euclidean distances between SIFT keypoints and their
nearest visual words, computed based on the Oxford5K dataset (14,972,956
keypoints) with 10K clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 3.2

An example showing the relationship between the optimal partial matching
similarity and the upper bound in Lemma 1. It shows that when two sets
of keypoints (corresponding to two images) are similar, the upper bound in
Lemma 1 provides a good approximation to optimal partial matching. . . . . . . 51

Figure 3.3

The effect of radius r on the retrieval accuracy of the random seeding method
for keypoint quantization. R is the average pairwise keypoint distance. . . . . . 61

Figure 3.4

The quantization time of the proposed random seeding algorithm and the
clustering based approach. The vocabulary is fixed to be 1M visual words
generated from the Oxford5K dataset. Background images from Flickr dataset
are gradually added to the gallery. The vocabulary construction time is not
shown here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

x

Figure 3.5

The CMC curves of the tattoo retrieval system using the proposed random
seeding algorithm and the clustering algorithm for keypoint quantization.
The number of visual words is set to be 1 million. . . . . . . . . . . . . . . . . 64

Figure 3.6

Example query images and the top 5 images retrieved by both the random
seeding algorithm (first row) and the clustering based approach (second row). . 65

Figure 3.7

The results of a kNN based image annotation system using the proposed random seeding algorithm for keypoint quantization as well as the clustering
based method. k is set to be 50 based on cross validation, and the number of
visual words is set to be 1 million. . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 4.1

Performance of Random Projection Filtering (RPF) with different K and d . . . 80

Figure 4.2

Offline tree building time and query time . . . . . . . . . . . . . . . . . . . . . 81

Figure 4.3

Performance of Random Projection Filtering (RPF) with different δs and ε . . . 83

Figure 4.4

Accuracy comparison on 10 Million SIFT Features . . . . . . . . . . . . . . . 86

Figure 4.5

Examples of retrieved images of different algorithms. From top to bottom:
RPF, LSH, SH, PCAH, LCH and STH. . . . . . . . . . . . . . . . . . . . . . . 88

Figure 5.1

Performance of Gaussian Projection Search with different number of projections 100

Figure 5.2

Performance of Rademacher Projection Search with different number of projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 5.3

Performance comparison for different coding length, searching 1 Million features102

Figure 5.4

Examples of search results: RPS (first row) and RPF (second row) algorithm . . 103

xi

LIST OF ALGORITHMS

Algorithm 1 KD-tree Construction Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20
Algorithm 2 KD-tree Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Algorithm 3 Procedure of LSH Indexing and Searching . . . . . . . . . . . . . . . . . 28
Algorithm 4 Procedure of Geometric Min-Hashing . . . . . . . . . . . . . . . . . . . . 38
Algorithm 5 Random Seeding Approach for Keypoint Quantization . . . . . . . . . . . 48
Algorithm 6 Efficient Duplicate Detection for High Dimensional Vectors . . . . . . . . 77
Algorithm 7 A Efficient Range Search Using Gaussian Random Variables . . . . . . . . 92
Algorithm 8 Efficient Range Search using Rademacher Random Variables . . . . . . . 95

xii

CHAPTER 1
INTRODUCTION
We are facing the problem of information overload: vast amounts of data have been provided from
different sources like the Internet, research laboratories and personal digital devices. For example,
the number of websites has reached 238 million by June 2009 [4], which has doubled itself since
2006. The number of web pages indexed by Google is more than 1 trillion by 2008 [5]. Flickr, an
image and video hosting website, hosts over 3 billion images [6]. At the same time, huge amounts
of data had been generated from research communities. For example, databases like Omnibus and
ArrayExpress provide billions of gene expression measurements [7].
In addition to their massive amount, the data is also being generated at a speed that is faster
than ever [8]. For instance, large sky surveys like Large Synoptic Survey Telescope generates tens
of TB data per 24 hours [9]. It is expected that the overall volume of data will increase by 30% per
year. Such high growth rate due to the fact that new technologies are becoming more accessible to
people [10].
Besides the data size, another challenge of data processing is its high dimensionality, i.e.,
the content of each data point often has to be represented by a large number of attributes. For
instance, a document is usually represented by a word histogram of hundred thousands of words;
the dimensionality of microarray data can easily goes beyond a thousand [11]; an image is often
represented by a couple of hundred visual features, including shape, color, texture and other useful

1

information.
Finally, to capture the relationship between data points, it is important to find proper similarity
measurements, and it have been shown that non-linear similarity measurements or even multiple
similarity measurements are desirable for real-world applications [12–14]. Different similarity
measurements may introduce more complexities into the search problem. For example, if nonlinear
kernel functions are used, efficient indexing methods such as inverted list will no longer be suitable.
and this is because the underlying linear similarity assumption for inverted index based match no
longer holds.

1.1

Large scale high dimensional data retrieval

Information retrieval addresses the fundamental problem of how to efficiently and accurately identify the objects from a large data base that match the information needs of individual users. The
target objects can be documents, audios, pictures, and video clips. There are four major issues with
the design of a information retrieval system:
1. How to effectively represent diverse information needs of users?
2. How to represent objects in a database to support accurate and efficient retrieval?
3. Given the feature representation, how to measure the similarity between the data points?
Different similarity functions make significant difference in the strategies for indexing and
search.
4. Given the similarity measurement, how to efficiently and effectively identify the objects in a
database that match the information needs of users?
In this thesis, we focus on last three problems, with emphasis on the effective representation
schemes and near duplicate retrieval methods that can handle large scale high dimensional data.
Table 1.1 shows some real world applications of large scale information retrieval.

2

Application
Document Retrieval [15, 16]

Text Based Image Retrieval [17]
Content Based Image Retrieval [18, 19]
Audio Retrieval [20]

Description
Find relevant documents based on text
queries
Find relevant images
based on text queries
Find relevant images
based on image queries
Find relevant audio
tracks for sound based
queries

# Objects

# Features

1010

108

109

104

109

103

106

102

Table 1.1 Real World Applications of Information Retrieval

The large scale and high dimensionality pose a significant challenge for efficient retrieval. This
is because in many applications, finding the most relevant objects for a query is often reduced to
the problem of finding the nearest neighbor (according to different similarity measurements, e.g.
Euclidean distance, Hamming distance or kernel similarities) objects for a given query. Unfortunately, it is well known that the nearest neighbor search is essentially a NP-hard problem when
it comes to high dimensional space [21]. Therefore, there is no deterministic approach for efficient nearest neighbor search in high dimensional space. Furthermore, no matter how efficient a
single match is, the very large candidate pool makes it impractical to perform any form of linear
scan on large databases. Because of these reasons, most practical approaches more or less follow
a two-stage scheme, in which the first stage is devoted to identifying a small subset of potential
candidates that are likely to be similar to the given query, and in the second stage the exact similarity between the given query and the selected candidates are computed. Note that the first stage is
often achieved by using some well designed data structures that effectively exploits the statistical
properties of data such as sparsity.
How exactly the two-stage system is implemented heavily depends on the properties of specific
data used in applications. For example, in the case of document retrieval, where documents are
represented by sparse vectors, an inverted index will be created and the first stage is to search in
the index. In the second stage, different retrieval models will be applied to calculate the exact
3

query-to-document similarity. If the data are dense and continuous, e.g. dense image features,
we can build trees or use hashing to group data points with smaller distance together, and the first
stage is to search in a tree or look up hash tables. In the second stage, the exact point-to-point
similarity is calculated. If the data are represented by bag of features, e.g. image retrieval based
on bag of features, a common approach is to convert the bag of feature representation into a vector
representation, and then use existing vector search method to identify a small subset of candidates.
In the second stage, some more precise and expensive image matching algorithm can be applied to
give the final ranked results.
In following, we will first briefly discuss different types of data to be searched and the existing
retrieval methods for these data. We then points out the limitations of existing methods and outlines
the contribution of this thesis.

1.1.1

Retrieval Methods for Sparse Data

In this case, we assume that the vectors used to represent objects are very sparse, namely most of
the elements in the vectors are zeros. An example of sparse vector representation is documents.
Most text retrieval methods use the bag-of-word model for documents, in which each document
is represented by a vector of word histograms [22]. This representation is sparse because most
documents focus on a specific topic and thus only use a small portion of the entire vocabulary.
Given the sparseness, a special data structure, termed inverted index [23], can be created and used
to greatly improve the retrieval efficiency. The basic idea of inverted index is to create mappings,
often referred to as the inverted list, from a word to the set of documents that contains the word.
Given the inverted index, in order to identify the candidate documents for a given query (usually
consists of only a small number of words), we only need to check the similarity of documents in the
inverted lists of all the query words, which allows us to avoid the linear scan of the entire collection.
Given the identified document candidates, in the second stage, different retrieval models, such as
the vector space model [24], the probabilistic model [25] or language models [26], is applied to
compute the relevance or similarity scores of document candidates for the given query. Note that

4

it is the sparsity of the vector representation that makes it possible to construct compact inverted
index, leading to efficient retrieval.

1.1.2

Retrieval Methods for Dense Data

We assume both queries and database data are represented by dense continuous vectors, in which
almost all the elements are non-zero and can take any continuous values. A good example of
this case is image retrieval based on global features, where both query images and images in the
database are represented by vectors of visual features, such as color, texture, and shapes. The
similarity between two images are often determined by the distance between two corresponding
vectors. As a result, the most similar images for a given query image is usually found by a nearest
neighbor search. To facilitate the search, data structures like trees or hash tables are used to index
the data. These indexing methods are useful since they efficiently group similar points together
through either space partitioning or hashing. However, since there is no efficient algorithm for
high dimensional exact nearest neighbor search [21], most existing search algorithms for high
dimensional dense vectors have to resort to approximate solutions.
In general, there are two categories of approaches that are used for approximate nearest neighbor search in high dimensional space, i.e., tree based approaches [1, 27, 28] and hashing based
approaches [29]. The key idea of tree based approaches is to partition the feature space such that
if two vectors are close to each other, they are likely to be grouped into the same region (or equivalently, share the same partition coding). To improve the robustness of the index, some degree
of randomization is often used when determining the branching point. Besides, multiple trees are
constructed in order to improve the both the retrieval accuracy and recall. Given the constructed
index tree, the candidate nearest neighbors to to a given query is found by propagating the query
point through the tree to the leaf node. In the final step, we compute the exact distance between the
query point and the points in leaf nodes identified in the first phase to find the nearest neighbors of
the given query. The most popular tree based methods use the Kmeans trees [27] and variants of
the K-D tree [28].

5

The hashing based method, on the other hand, are based on the idea that two data points are
likely to share the same hash code if they are close to each other. One representative approaches in
this category is Locality Sensitive Hashing (LSH) [29]. The key concept behind this method is the
locality sensitive hash function, in which two data points within a predefined distance r are more
likely to be hashed to the same value than if they are separated by a distance significantly larger
than r (i.e., (1 + δ )r, where δ > 0 is a predefined factor). In order to ensure high precision and
recall in identifying r-nearest neighbors, a large number of locality sensitive hashing functions,
combined into multiple groups, are used in practice. Similar to the previous case, these hashing
functions are used in the first stage to identify the candidates that are within distance r of a given
query. In the second stage, the exact distance between the query point and the identified candidates
is computed to find the nearest neighbors, and the candidate objects are presented to the user
in the ascending order of their distances to the query. Note that the key difference between the
tree based method and hashing based method is that the first approaches are designed to find the
(approximate) nearest neighbors, while the hashing based methods are designed for range search,
i.e., finding the the data points that are in the r-neighborhood of a query point.
We will discuss the advantages and drawbacks of above methods in details in chapter 2.

1.1.3

Retrieval Methods for Bags-of-Features

In some of applications like image retrieval, each object is represented by multiple feature vectors,
instead of a single vector. For example, local descriptors, e.g., SIFT features [30] have shown to be
effective for near duplicate image retrieval and object recognition. In these approaches, each image
is represented by a set of local descriptors, with each local descriptor being represented by a vector
of 128 attributes. This representation is often referred to as the bag-of-features representation [31–
33]. The main challenge posed by such representation is that the correspondence between the local
descriptors in two sets is unknown, leading to a high complexity in matching two images. Several
algorithms, such as the pyramid matching kernel [32], have been proposed to efficiently compute
the similarity between two bags-of-features. The main shortcoming of these approaches is that

6

they require a linear scan of the entire database before the most similar objects are found, making
them difficult to scale to large datasets.
One way to speed up the retrieval process for objects based on bag-of-features representation
is to convert the image matching problem into a text retrieval problem [18]. The key idea of this
approach is to quantize each local descriptors to a “visual word”, and represent each image by a
histogram of the visual words. This bag-of-words representation allows us to directly exploit the
text retrieval technique, particularly invert indexing, to efficiently identify the images candidates
that share similar visual content as the query image. In the second stage, more accurate similarities
are computed between the query images and the image candidates identified in the first stage. It
worth noting that efficient quantization methods often require efficient nearest neighbor search,
which was explored in the efficient retrieval methods for continuous and dense vectors. However,
a problem with existing quantization approaches is that they fail to accurately capture the similarity
between the original features [34, 35], and such coarse approximations often lead to less accurate
retrieval results and less efficient retrieval process. In this thesis, we propose new quantization
methods to address this issue.

1.2

Thesis contributions

Among three classes of retrieval problems discussed, sparse data retrieval has been well studied
and successfully used in modern search engines (e.g. Google, Bing Search). So our focus is to
address challenges in retrieving dense and bag-of-features data.
As we have seen, most large scale search algorithms developed in the literature attempt to
address the efficiency issue by approximations, but the existing approximation schemes either do
not capture the true similarity well or are not effective to filter out irrelevant items. For example, the
widely used vector quantization scheme for bag-of-feature data considers two data points similar
when they are mapped into the same cluster, without taking into account the actually distance
between the feature points [34, 35]. The Hashing based method often returns a large number of
false positives in order to ensure a high recall [21,36]. The above are often not desirable since either
7

true similar items are missed by the algorithm due to inappropriate approximation (low recall),
or a large number of false positives are retained in the second stage (low precision), resulting
expensive second stage re-ranking. Thus, the goal of the thesis is to design representations schemes
and algorithms that are not only computationally efficient but also more accurate in capturing the
similarity between objects in the second and third classes of retrieval problems.
The main contribution of this dissertation of this dissertation work can be summarized as follows:
1. We present a new quantization scheme in chapter 3, named random seeding quantization algorithm, that is more efficient in generating bag-of-words representations for bag-of-features.
The new scheme is motivated by approximating the optimal partial matching between bagof-features, and thus produces a bag-of-words representation capturing the true similarities
of the data, leading to more accurate and efficient retrieval for bag-of-features data.
2. We present a new search algorithm in chapter 4 that is designed for efficient near duplicate vector search. By explicitly exploiting the statistical properties of near duplicity, the
algorithm projects high dimensional vectors into lower dimensional space for exact search,
leading to more accurate and efficient identification of nearly duplicate objects in databases.
3. In chapter 5, we propose a new randomized range search algorithm for near duplicate vectors
in high dimensional spaces. Compared to the algorithm presented in chapter 4, it does not
require the sparsity constraints and can apply to a wider range of applications. It projects
both the data points and the the query point into an one dimensional space by a random
projection, and use binary search for efficient nearest neighbor search and filtering. We first
prove the theoretical guarantee for the proposed algorithm and then evaluate its empirical
performance on a 1.1 billion image feature dataset.

8

CHAPTER 2
LITERATURE REVIEW
In these chapter, we review existing retrieval methods for sparse, dense and bag-of-features data.

2.1

Retrieval Methods for Sparse Data

A good example for high dimensional sparse data is the bags of words [22] representation for documents. To make our discussions concrete, we will illustrate the key ideas for discrete sparse data
retrieval by presenting most widely used document indexing and retrieval methods. We, however,
by no means try to exhaustively cover the topic of document retrieval. Rather we merely focus
our discussions on efficient indexing and retrieval models, since these techniques are pertinent to
this thesis and they can be widely applied to any sparse data, not only documents. For example,
the indexing and retrieval method has been successfully applied to near duplicate image retrieval,
as will be illustrated later. The indexing scheme to be introduced is inverted index [15, 23], which
serves as the first stage to effectively avoid the linear scan of the collection. Afterward, specific
retrieval models [37, 38] will be applied to precisely match the query against candidate documents
to decide their rankings, which constitutes the second stage.

9

2.1.1

Sparseness and Inverted Indexes

Documents are sparsely represented by word frequencies, which is one of the key advantages
provided by the bag-of-words representation. However, a more subtle and important sparseness
property actually come from the other side: most words only appear in a small number of documents. This fact is known as the zipf’s law [39], which states that the frequency of any word is
inversely proportional to its frequency ranking. As a result, if we sort words by their frequency
rankings, we will see that their frequencies decrease very quickly, and thus most words have very
low frequencies across the collection.
Such “word-to-document” sparsity directly leads to an efficient indexing scheme called inverted index, which includes mappings from each word to a list of documents that contains the
word. The list is referred to as inverted list, and each list entry is called a posting. The amount
of information stored in postings can vary significantly, and it is usually dependent on specific retrieval models that are used. For some simple models like the Boolean model, an index entry only
need to store the identifiers of words and a list of document IDs. For other more complex models
like the vector space model or language models, other statistical information like the document
frequency, word frequencies/weights and word positions might be encoded in the index to support
their ranking algorithms.
Equipped with the inverted index, one can efficiently identify the documents that contains a
query word by simply looking up the index table, and it only takes O(1) time. Furthermore,
according to Zipf’s law, the number of postings for most words is small, so we end up with a small
number of candidate documents to be found in the first stage before the exact query-document
similarity can be computed in the second stage. This fact remains true as long as the number of
query words is not too large.
A different indexing structure called signature file was also used in document indexing. The
basic idea is to create short signature bits for each document by hashing, and then match the hash
code to retrieve similar documents. In [40], however, Zobel et al. carefully compared inverted index and signature files in both the efficiency and space requirements on synthetic and real datasets.

10

The experiments were done on two data sets, one is the “Wall Street Journal” dataset from TREC
and the other one is synthetic. Both datasets contain roughly 100,000 documents and the average
document length in both datasets is about 430. To summarize this work, inverted index is superior
to signiture file for text indexing since it 1) is more efficient, 2) uses less space and 3) provides
more functionality. To be more specific, studies in [40] showed: inverted index is more space
efficient since it can be easily compressed; inverted index is faster to built and easier to update
since signature files require a large number of hash operations; inverted files handles vary document length better since signature files require different documents to be hashed into binary strings
with the same length; the returned results from signature files are less precise since documents
share no words could have the same signature coding; inverted list can be easily extend to encode
weighting information while signature files can not. Conclusively, based on current development
on both inverted list and signature files, inverted list is undoubtedly a better solution for document
indexing compared with signature files.
In short, inverted index the key data structure to alleviate the computational challenge posed by
large scale document retrieval [15, 16, 23, 41]. Other alternatives like signature files generally have
severe disadvantages [16]. As a result, inverted indexes are at the core of all modern text search
engines.

2.1.2

Retrieval Models

In the second stage of document retrieval, different retrieval models are used for precisely evaluating similarities between the query and candidate document identified using the inverted list. In
following sections, we will introduce different retrieval models including vector spaces models and
probabilistic models. These models differ in their motivations, model complexities and effectiveness. No model is universally superior to others since diverse application domains imposes various
requirements and thus different models might be preferred.

11

Vector Space Models The Vector space model is first proposed by Salton et al. [24], and it is still
widely used in some IR systems. The key idea of the vector space model is representing both the
query and documents as vectors in the term space, and the similarity is measure by the “closeness”
between vectors (e.g. the angle between them).
More specifically, a query Q is represented by a vector of index terms: Q = (q1 , q2 , . . . , qd ) and
similarly, a document Di is represented by (d1 , d2 , . . . , did ). Here d is the number of indexed terms
and elements in the vector are weights of corresponding terms. The similarity between Q and Di
is measured by the cosine correlation defined as follows:
Sim(Q, Di ) =
=

Q, Di
||Q|| ||Di ||

(2.1)

∑dj=1 q j di j
∑dj=1 q j 2

∑dj=1 di j 2

.

(2.2)

Terms of qi and di j in 2.2 are usually determined by t f .id f weighting. t f stands for the normalized
term frequency that indicates how important a term is for a document, and id f is inverse document
frequency that captures how discriminating a term is. The specific settings of t f and id f is mainly
based on empirical observations.
Probabilistic Models Probabilistic models are based on the probabilistic ranking principles [42],
stating that a retrieval system should rank the retrieved documents based on their probabilities
being relevant to a given query, which are calculated based on the data available to the retrieval
system. Here we will introduce one of the most popular probabilistic models, the Okapi BM25
Model, from the document classification point of view.
We assume the document is represented by a binary vector. A document should be classified as
relevant if Pr(R|D) > Pr(NR|D). Following the Bayes rule, we will classify a document as relevant
if
Pr(D|R)
Pr(NR)
>
.
Pr(D|NR)
Pr(R)

(2.3)

Pr(D|R)
Consequently, it makes sense for the retrieval system to rank the document according to Pr(D|NR) ,
which is called the likelihood ratio. If we following the naive Bayes assumption and certain prob12

abilistic assumption about the document generation process, we will have
Pr(D|R)
Pr(D|NR)

∑
i|di =qi =1

log

N − ni
,
ni

(2.4)

which is very similar to the id f term. However, using only the id f like weighting scheme can
not provide desirable retrieval performance due to the absence of the term frequency weightings.
Robertson et al. proposed a series of Okapi BM25 models [43–45], which had been proved to be
one of the most successful retrieval models and is widely used by many search engines.

2.1.3

Comparison: Retrieval Models for Sparse Data

In our discussion, we have covered the Vector Space Model and the classic Probabilistic Model.
There exist many other retrieval models like the Boolean models and the Language Models. Due
to the space limitation, we omit detailed discussions on these models. However, here we provide a
quick summary for all retrieval models for the completeness of our discussion.
The performance of these models highly depends on their weighting schemes. It is widely accepted in the Information Retrieval community that there is no single retrieval model that performs
universally better than others [16, 46, 47]. Instead, different retrieval models perform optimally
under different conditions.
The Vector Space Model introduces t f .id f weights based on heuristics; the classic probabilistic model brings id f in by following the idea of relevance classification, and introduces some t f
like terms based on heuristics; the Language Modeling approach brings the t f effects by using the
estimated language models and obtains the id f weight by using smoothing techniques. Despite
the fundamental theoretical differences among them, all these methods finally achieve weighting
schemes similar in spirit and effects. Thus, it is improper for us to claim which approach is the
state-of-art approach theoretically. To compare them objectively, we need to refer to empirical results. To this end, we cite the results from the Adhoc Track from Trec-8, Trec-7 and Trec-6, since
the results presented there are comprehensive, conclusive and well recognized. To summarize the
results from the Adhoc tracks, we can say that the classical Probabilistic Model (OKAPI BM25)

13

combined with pseudo relevance feedback delivered the top performance in most evaluations. The
Vector Space Model with feedback usually perform slightly worse than the probabilistic model,
probably because it has less parameters to tune. It is reported [48] that combining simple language
model with the probabilistic model delivers very good performance. In [26,47–51] People showed
that language models with proper smoothing can give retrieval performance that is better than the
vector space model and comparable to the probabilistic model. But in general 1 , all these models provide very similar performance if well-tuned, and we may say that the marginal difference
between them are very likely just a result of parameter tuning.

2.1.4

Summary

Searching against large scale high dimensional sparse data (e.g. document retrieval) clearly includes two stages. The first stage is to search against a pre-built index to quickly identify candidate
objects that are likely to be relevant to the query, and the second stage is to applying specific retrieval models to calculate the exact query-to-document similarity. The efficient index building
and searching are direct results of sparse representation. In the second stage, different retrieval
models like vector spaces models, probabilistic models and language models can be used, and they
often differ in their motivations, assumptions, model complexities and effectiveness. Among all
retrieval methods that had been proposed, language modeling is the most principled approach and
have been proved to be very effective in practice.

2.2

Retrieval Methods for Dense Data

In this section we focus on retrieval methods for continuous and dense data. Specifically, both
queries and objects in database are represented by high dimensional dense vectors, and the elements in the vector representation usually take continuous values. The retrieval problem for such
data naturally reduces to the nearest neighbors problem in high dimensional spaces.
1 If

we do not consider the collection dependence issue.

14

Similar to the retrieval methods for discrete data, efficient nearest neighbor search in high dimensional space also follows a two-stage scheme: a filtering stage followed by a precise matching
stage. However, since the data is no longer sparse, the indexing techniques used here are fundamentally different from those for sparse discrete data. Instead of using inverted index, two categories
of methods are commonly used for data indexing and filtering stage: tree based approaches and
projection based approaches. Tree based approaches [1, 28, 52] recursively partition the feature
space into a collection of high dimensional cells such that very points with similar input patterns
are likely to be assigned to the same cell. Given a query point, it will first be propagated down to
a leaf node, and precise matchings will be performed between the query point and all the points
inside the node to determine the candidate nearest neighbor. The candidate nearest neighbors identified so far will be then used as a “pivot point” to check data points from other branches and it
will be updated when a data point with a shorter distance to the query is found. When handling
very large scale data, multiple randomized trees [53] are often used to speed up the process of
nearest neighbor search. randomized trees essentially partition the feature space in a more or less
randomized fashion to achieve robust expected search performance, and multiple trees are used to
boost the search accuracy. The most popular tree based methods are the KD-tree and its variants.
On the other hand, the projection based methods [29, 54] are based on the idea that by some
projection, adjacent data points are more likely to be mapped into the same value than those are far
away. In this approach, random projection functions are used to create some auxiliary space, and
the similarity between two data points is measured by how likely their projection image will collide
in the auxiliary space. A representative projection based method is Locality Sensitive Hashing [29],
which relies on hash functions that are locality sensitive: two data points that are within the distance
of r are more likely to be hashed into the same value than those that are separated by distance
(1 + ε)r. Applying one hash function on a data point will produce a “hash bit”, and hash bits from
multiple hash functions are then grouped together to generate “hash strings”. It follows naturally
that only really close points will be hashed to the same hash string. In the meanwhile, nearest
neighbor candidates based on different hash strings are considered altogether to reduce the chance

15

of missing true near neighbors. Note that locality sensitive hashing is fundamentally different from
the tree based method in the sense that it does not solve the nearest neighbor problem directly.
Instead, it is designed to solve the r-near neighbor problem: finding the data points that are within
the r-neighborhood of the query point, which is often referred to as a range search problem.
Another important aspect of the nearest neighbor search in high dimensionality is that we often resort to approximate solutions. The reason is that there is no known exact algorithms that
achieves sub-linear search time with linear storage cost. More specifically, for any given nearest
neighbor search algorithm, we are concerned about both its time complexity and space complexity
as functions of the data set size n and the data dimensionality d. A naive algorithm will store all
points, and to process a query, it will linearly scan the entire data set to calculate all pair-wise
distances to find the nearest neighbor. Such a algorithm clearly have the both the time and space
complexity of O(nd) [21, 55]. The linear time complexity (w.r.t. n) is not desirable and people
had developed algorithms to achieve sub-linear time complexity. The current state-of-art result
for exact nearest neighbor search is that, if we fix the dimensionality to be a constant, there are
algorithms that can solve the problem with O(log nG1 (d)) time complexity and O(nG2 (d)) space
complexity [21, 28, 56–60]. Note that in these algorithms, although sub-linear time complexity
and linear space complexity are achieved w.r.t. n, there are functions G1 (d) and G2 (d) hidden in
the complexity term. Recent studies [1, 21, 55, 61] on high dimensional nearest neighbor search
pointed out that either G1 (d) or G2 (d) is in fact exponential function of d, which makes all these
algorithm become much less inefficient in space or time as the dimensionality increase. To make
our statement more specific, algorithms presented in [58,59] have O(d O(1) log n) time complexity,
but they use roughly nO(d) space [55, 62]. For the KD-tree based algorithms presented in [28, 57],
their time complexity increases as least as fast as 2d [1, 57], which is also observed in empirical studies [61]. The Ball-tree based algorithm [60] have the time complexity of O(d log n) for
only low dimensions, and it becomes O(nd) for higher dimensions [56]. Table 2.1 summarizes
important exact nearest neighbor search algorithms and their time and space complexity.
In summary, all known exact nearest neighbor search algorithms [21, 28, 54, 56–60, 63] are

16

Algorithm
Linear Scan
Voronoi [58],Planar [59]
KD-Tree [28, 57]
Ball-tree [60]

Time Complexity
O(nd)
O(d O(1) log n)
Ω(2d log n)
O(nd) in high dims

Space Complexity
O(nd)
nO(d)
O(nd)
O(nd)

Table 2.1 Complexities of Nearest Neighbor Search Algorithms

not appropriate for large scale high dimensional data. The reason is that they either require the
time complexity linear in n, or the time complexity sub-linear in n but exponential in d, or time
complexity sub-linear in n, linear in d, but space complexity exponential in d. This implies that
for large scale high dimensional data, we need to resort to approximate solutions.
The difficulties of performing exact nearest neighbor search motivated people to study approximate nearest neighbor search algorithms, and the KD-tree and LSH can be used for this
purpose. The representative approximate search algorithms based on the KD-tree are introduced
in [1] and [64]. In [1], authors modified the the KD-tree structure and proposed a so called Balanced Box Decomposition tree to support the (1+ε) approximate nearest neighbor search. In [64],
authors proposed to stop the tree search earlier after a fixed number of nodes is checked and empirically showed the effectiveness of this heuristic. LSH related work can be found in [29, 65, 66],
where various Locality sensitive hash function families are proposed for nearest neighbor search
for different distance/similarity functions. Note that although some of these methods [1] do not
actually remove the exponential complexity, they still provide considerable improvement in search
efficiency over exact nearest neighbor search algorithms. We emphasize that the approximation of
KD-trees is usually achieved by early stopping, whereas LSH is inherently a random algorithm.
For a summary of the complexities of approximate nearest neighbor algorithms, refer to table ??.
In the following sections, we will elaborate both tree based and hashing based methods for
exact and approximate nearest neighbor search.

17

Algorithm
ANN [1]
BBFS [64, 67]
LSH L1 [29]
LSH L2 [65]

Time Complexity
O(d 1 + 6d/ε d log n)
O(d log n)
O(dn1/(1+ε) )
O(dn f (1+ε) )

Space Complexity
O(nd)
O(nd)
(1+1/(1+ε)
O(dn
)
(1+
f
(1+ε))
O(dn
)

Table 2.2 Complexity of approximate nearest neighbor search algorithms. For the LSH complexity
under L2 distance, we have f (1 + ε) < 1/(1 + ε).

2.2.1

KD-Tree, Randomized KD-tree and Randomized Forests

KD-Tree Although KD-tree was first introduced several decades ago [28, 57], it is still widely
used in high dimensional nearest neighbor search. In short, A KD-tree is a generalized binary
tree designed for efficient searching. In a KD-tree, every node essentially stores information about
space partition. For example, the root node represents the entire feature space and leaf nodes
represent non-overlapping small sub-regions of the entire space (or equivalently, subsets of all
examples). Meanwhile, the internal nodes record the intermediate steps (and partitions) about how
the entire space is recursively divided into the finest sub-regions. See figure 2.1 for an example of
constructing a KD-tree based on space partitioning.

Figure 2.1 An example of KD-tree and resulting space partition.

Given a dataset, the first step is to construct a tree over the data. At each node, we pick the
18

median of the most spread dimension to bipartite the feature space, where the spread is defined as
the value range or variance. The split values are stored since they determine the “bounding box” of
partitions. This recursive process will continue until the number of points within a node is below
some threshold. As a result, each node is then represented by a hyper-rectangle whose boundaries
are determined by the split values chosen by its ancestor nodes. At the search stage, the query point
will first be descended down the tree by comparing its feature value against the corresponding
split value (on the split dimension) chosen at each node. When the algorithm reaches the leaf
node, the candidate nearest neighbor will be determined by computing the exact query-to-data
distances. The query point together with the shortest distance implies a hypersphere that centered
at the query point. Afterward, we will check whether the hypersphere overlaps with the hyperrectangles defined by other unvisited nodes. If there is no overlapping, the node and its children
can be completely eliminated from consideration. If there is some overlapping, then the node and
its children will be checked for potential nearer neighbors. In the following subsections, we will
layout algorithms for tree construction and search, and also discuss some implemetation detials.

KD-tree Construction Given a dataset, a KD-tree is constructed recursively as follows: It picks
an dimension to and a split value to divide the dataset into two subsets, and then keep dividing
the subsets until the number of data points contained in a node drops below some threshold. The
following recursive procedure (Algorithm 1) constructs a KD-tree from a given dataset. There
are several comments we would like to make regarding the procedure. The granularity, defined
as the maximum number of data points within a leaf node, could affect the performance. If the
granularity is too fine (e.g. each node contains only one data point), the performance will be suboptimal due to the computational overhead in search imposed by the tree structure [57]. So to
maintain a reasonable granularity, usually a set of points will be considered as a leaf node either
when the number of data points is small or the spread (the value range of data points) of the sets
is small. The function SelectSplit pick the dimension and the split value to partition the feature
space, and the criteria can be either the spread or the variance of each dimensions, depending on

19

the application. Often the median value is chosen as the split value. By partitioning the feature
space in this fashion, the tree will be constructed to achieve the best expected performance over
different queries [57].
Algorithm 1 KDtree construction algorithm
Require: dataset
{The data for tree construction}
1: if IsEmpty(dataset) then
2:
return NIL
3: else if Size(dataset) ≤ threshold then
4:
return MakeLeaf(dataset)
5: end if
6: (splitDim, splitVal) = SelectSplit(dataset)
7: le f tData ← {x | x ∈ dataset, xsplitDim ≤ splitVal}
8: rightData ← {x | x ∈ dataset, xsplitDim > splitVal}
9: le f tChild = KDtree(le f tData)
10: rightChild = KDtree(rightData)
11: return MakeNode(splitDim, splitVal, le f tChild, rightChild)

KD-Tree Search Intuitively, the search algorithm first locates the leaf node that contains the query
point, records unvisited nodes along the path, and then recursively examines those unvisited nodes
to improve the search result. Locating the leaf nodes is a one-round calculation process and can be
done very efficiently, but the improvement step is recursive and could require a lot of computation.
The key calculation in the improvement step is to check whether the hypersphere centered at the
query point overlaps with the hyper-rectangles defined by internal nodes. Here, we will describe
it with more details. Let q be the query point and p be the nearest neighbor found so far. A
hypersphere hs is defined centering at q and having the radius of D(q, p), where D(q, p) is the
interpoint distance defined over some metric. To determine whether there is an overlap between
the hypersphere hs and some hyperrectangle hr, we need to find the point t in the hyperrectangle
hr that is closest to q. If D(p,t) > D(p, q), then it is clear that the region defined by hr will not
contain any closer points than p and thus the node associated with the region can be eliminated
from consideration. Finding the closest point t given a query point q is equivalent to minimize
the distance D(t, q) subject to the point t belongs to hyperrectangle hr. If D(t, q) is defined as

20

the Euclidean distance , minimizing the distance can be achieved by minimizing (ti − qi ) for each
dimension i. To find the closest point t, we can find out the coordinate value in the hyper-rectangle
that is closest to the corresponding coordinates of p, for every dimension. Specifically, given q and
hr, coordinates of t on the ith dimension can be found as follows:



qi < hrimin
 hrimin if


ti =
qi > hrimax
hrimax if




 qi
if
hrimin ≤ qi ≤ hrimax
The detailed algorithm for KD-tree search is given in algorithm 2, where the overlap test, a key
step in the search, is denoted by overlap in the algorithm.
Algorithm 2 KDtree search algorithm
Require: q {The query point}
Require: p, d {The nearest neighbor and closest distance found so far}
Require: root {The root node of the KD-tree}
Require: Overlap(q, d, node) {A overlap testing function}
1: if IsLeaf(node) then
2:
candidateSet ← {x | x ∈ node}
3:
CalExactDists(q, candidateSet)
4:
Update(p, d)
5:
return
6: end if
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:

sd ← node.splitDim
sv ← node.splitVal
if qsd ≤ sv then
closerNode = node.le f tChild
f urtherNode = node.rightChild
else
closerNode = node.rightChild
f urtherNode = node.le f tChild
end if
KDtreeSearch(closerNode)
if Overlap(q, d, f urtherNode) then
KDtreeSearch( f urtherNode)
end if

21

Priority KD-Tree Search The search algorithm described earlier is essentially a depth-first tree
search algorithm. This depth-first fashion may not be the most effective way of search for some
queries. For example, if the query point is initially propagated down to the left half of the tree but
the true nearest neighbor lies on the other half, depth-first search needs to traverse at least one half
of the tree before visiting the nearest neighbor. This leads to our first consideration: the order of a
search algorithm to visit tree nodes should dependent more on the query point instead of the tree
structure, since the tree structure is optimized for the best expected performance [57] instead of
for any given query [64]. Specifically, when descending the query point to its first leaf node, we
should record the distances between the query point and all other nodes along the path, and first
visit the nodes that are closer to the query point. Another consideration is more from a practical
perspective. It is observed in experiments [68] that the true nearest neighbor is often encountered in
the early stage of the search, which further confirms that it is preferable to first check the node that
are more likely to contain the nearest neighbors. Last but not least, in the analysis [57] of KD-tree
performance, Friedman et al. assumed (without justification) that the search algorithm examines
the leaf nodes in an optimal order, which implies leaf nodes should be sorted based on their distances from the query point. All these considerations motivates the priority KD-tree search [68]
(or best bin first search [64]), which essentially search the node based on their distance to the query
points, instead of based on the tree structure. Specifically, this modified search algorithm maintains a priority queue that stores the distances from the query point to different nodes. Every time
after a leaf node is visited, the next tree node to be visited always has the smallest query-to-node
distance. Figure 2.2 illustrates how the priority KD-tree search work. Despite the simplicity of
the idea, experiments [53, 64, 68] had shown that using the priority KD-tree search, the algorithm
in general uses less time and examine a much smaller number of leaf nodes before reaching the
true nearest neighbor. Furthermore, this priority based search scheme had been extensively used
in approximate KD-tree search, since it helps in both theoretical justifications and performance
improvement, as what we will discuss in the next subsection.

22

Figure 2.2 An example of priority KD-tree search. Subtrees v1 through v4 are initially in the queue.
v1 , the closest node, is selected first. When descending the tree to leaf node w, we enqueue the
siblings u1, u2, and u3 along the way [1].

Approximate KD-Tree Search As we had mentioned before, the large scale high dimensional
data makes it computationally prohibitive to perform exact nearest neighbor search. We then ask
the question that whether it is possible to efficiently solve the nearest neighbor search problem not
exactly but with good approximation. Most recognized work on approximate KD-tree search are
presented in [64] and [1]. Both methods utilized the priority KD-tree search, and essentially they
all use “early stop” to avoid the high computational cost of exhaustive search.
In [1], Arya et al. proposed a method to solve the so called “(1 + ε) approximate nearest
neighbor problem”, defined as follows:
Definition 1 ((1 + ε)-approximate nearest neighbor problem). Given a set D of points in Rd ,
construct a data structure that given a query q ∈ Rd , report any point p within distance at most
(1 + ε)D(q, p∗ ), where p∗ is the point in D closest to q.
They proposed a data structure called balanced box decomposition (BBD) tree to support their
23

algorithm and theoretical analysis. Unlike KD-tree that only targets on reducing the cardinality of
nodes, BBD tree decomposes the feature space in the way that both the cadility and geometric size
of nodes reduce exponetially. Since the nodes are ordered based on their distances to the query
point, the search can stop as soon as the query-to-node distance is greater than (1 + ε)d, where d is
the shortest distance yet seen. Together with the reduction of geometric node size, it is possible to
upperbound the number of leaf nodes to be visited during a search. To cite their results precisely,
the authors proved that through the use of BBD tree, the (1 + ε)-approximate nearest neighbor
problem can be solved in O(Cd,ε log n) time, where Cd,ε ≤ d 1 + 6d/ε d . In the experiments,
they showed that the algorithm can provide significant improvements over brute-force search in
dimensions as high as 20. However, the exponential factors in the query time imply that this
algorithm is not practical for high dimensional data.
Motivated by practical considerations, Beis et al. [64] proposed a different stopping criterion
by limiting the number leaf nodes to be examined. Although they do not provide any theoretical
support to this practice, they showed that when combined with priority KD-tree search, the algorithm can find the true nearest neighbor with more than 90% of the time, and in the case that the
true nearest neighbor is not found, the solution is a very good approximate. In [67] authors even
argue that this stopping criterion provides better performance than the (1 + ε)d cut off.
To summarize, the (1 + ε)d cut off is better justified but may not work well in high dimensions
without modification. Fixing the number of leaf nodes to be examined was shown to be useful in
solving practical problems, but a more solid analysis would be desirable.

Multiple Randomized KD-Trees We have seen that to solve the approximate nearest neighbor
problem, a good practice is to use priority KD-tree search and limit the number of leaf nodes
to be examined. Increasing the number of examinations will increase the probability to find the
true nearest neighbor, but the improvement turns to saturate with more examinations. To address
this problem, people introduced multiple randomized KD-trees [53]. The basic idea is to create
multiple independent KD-trees with different structures, and perform simultaneous search in all the

24

trees and combine the results. To create different trees, the splitting dimension is chosen at random
from a set of dimensions with large variances, and the splitting value will be the median value
with some random perturbation. To search these trees, one single priority queue is maintained over
all the trees, so only the node closest to the query will be visited first. In many experiments [69,
70], people had shown that using multiple randomized KD-trees can greatly improve the search
efficiency and accuracy.

2.2.2

Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is a projection based method for approximate near neighbor
search in high dimensions. The idea of using random projections for nearest neighbor search
is first presented by Kleinberg in [54]. However, the two algorithms presented there either use
exponential storage or linear query time and thus are not suitable for large scale high dimensional
data. These limitations in fact motivated the research of LSH. There is rich literature on LSH, and
most important results are presented in [29], [65] and [66]. In the following subsections, we will
first briefly overview the approximate near neighbor problem and the basic idea of LSH, and then
present detailed results in the related work.

Approximate Nearest Neighbor As an inherently randomized algorithm, LSH is usually used to
solve the randomized approximate near neighbor problem or randomized r-near neighbor problem,
instead of the nearest neighbor problem. For example, A point p is a r-near neighbor of the query
point q if under some metric, the distance D(p, q) is at most r. Clearly, the nearest neighbor
problem and near neighbor problem are related and the former one is “harder”. This is because
given the nearest neighbor p of the query q, the r-near neighbor problem can be solved trivially
by checking whether D(p, q) ≤ r. Given a query point q, LSH either returns one of the r-near
neighbors of q or claim there does not exist such a point for the radius r. On the other hand,
solving the nearest neighbor problem involves gradually increasing the radius r until some near
neighbor is reported, which is equivalent to solve a series of r-near neighbor problems.

25

More precisely, the LSH can be used to solve the randomized (1 + ε)-approximate near neighbor problem or the randomized r-near neighbor reporting (range search problem) defined as follows:
Definition 2 (randomized (1+ε)-approximate near neighbor problem). Given a set D of points
in Rd , a query q ∈ Rd and parameters ε > 0, δ > 0, construct a data structure that if there exists
a r-near neighbor of q in D, report some (1 + ε)r near neighbor of q with the probability 1 − δ .
Some quick comments to definition 2: 1) it does not specify the algorithm’s behavior when
there is no r-near neighbor of q. Actually in that case, the algorithm’s behavior is not deterministic. 2) The probability of failure δ can be reduced by building and searching multiple (e.g. k)
independent data structures. In this case, the probability of failure will be reduced to δ k .
Definition 3 (randomized r-near neighbor problem). Given a set D of points in Rd , a query q
∈ Rd and parameters ε > 0, δ > 0, construct a data structure that report each r-near neighbor of
q with the probability 1 − δ .
Some quick comments to definition 3: 1) There is no approximation involved. 2) The algorithm
may return many points (if many points are qualified) so the running time can not be bounded. In
addition, the returned points may not be independent.

Key Ideas of Locality Sensitive Hashing The key concept in for LSH is locality sensitive, and it
is the key difference between LSH and a general hashing algorithm. In other words, unlike the idea
of hashing is to avoid collisions, the goal of LSH is to not only distinguish far-way points, but also
map nearby points into the same bucket. This goal is achievable due to the existence of locality
sensitive functions, defined as follows:
Definition 4 (Locality Sensitive Function). A family of function H is called locality sensitive if
for two points p, q ∈ Rd and 0 ≤ P2 < P1 ≤ 1,
• if D(p, q) ≤ r then PrH (h(q) = h(p)) ≥ P1

26

Figure 2.3 Illustration of difference between general hashing and locality sensitive hashing.

• if D(p, q) ≥ (1 + ε)r then PrH (h(q) = h(p)) ≤ P2
With a family of locality senstive hashing functions, we are ready to index the dataset and then
search for near neighbors. However, note that the gap, which is typically measured by the ratio
between some monotonic functions of P1 and P2 , is often not large enough to clearly differentiate
between a r-near neighbor and a neighor with distance larger than (1 + ε)r. As a result, we need
to amplify the gap by concatenating several functions. Specifically, we first group k hash functions
h1, j , h2, j , . . . , hk, j together to form a function g j with each hi, j being a locality sensitive hash
function. . So only those really close-by points will be hashed into the same value by g j . We also
consider L different g j ’s to reduce the chance of missing true near neighbors. To process a query
point q, we first hash it using into L buckets using g j ’s, and then we retrieve all the points from
these L buckets (points p’s such that g j (p) = g j (q)). We then compute the exact distance between
the query q and retrieved candidate points, and report any point that satisfy our needs. The LSH
indexing and searching procedure is presented in algorithm 3.
Note that there are two possible strategies to examine the candidate points. One is to scan
all the points (exhaustive) from all buckets and the other is to stop scanning (early stopping) as
soon as a fixed number of L points are reported. These two strategies solves different problems.
The first strategy solves the randomized r-near neighbor problem and the second strategy solves
the randomized (1 + ε)-approximate near neighbor problem. Specifically, by using the exhaustive
scanning approach and setting L and k properly, we can make the failure probability δ in the range
search arbitrarily small. On the other hand, it had been proved [29] that with properly selected
values of L (L = 4L), the early stopping strategy can be used to solve the randomized (1 + ε)-

27

Algorithm 3 Procedure of LSH Indexing and Searching
Require: dataset
{The dataset to be indexed}
Require: q
{The query point if search will be performed}
1: if IsEmpty(dataset) then
2:
EXIT
3: end if
4: Select parameters k and L
5: Form L functions {g1 , g2 , . . . , gL }, where g j = h1, j , h2, j , . . . , hk, j , and hi j are chosen randomly
from a LSH family independently.
6: Using g j ’s to hash all data points into L hash tables.
7: For j = 1, 2, . . . , L
8: Retrieve points p’s from hashing buckets g j (q) in the jth hash table.
9: Evaluate the exact distance D(p, q), report p if it is a valid answer to the problem.
approximate near neighbor problem in sub-linear time O(nρ ), where ρ = 1/(1 + ε).
LSH families We have discussed the indexing and searching process using LSH without touch
the key question about how exactly to decide the hashing functions. In this section we will discuss
several hashing families under different distance measurements.

Hamming Distance In the case of Hamming distance, where the distance between two vectors
equals the number coordinates that they have different values at, we can intuitively form locality
sensitive hash functions by projecting a vector onto one of its coordinates. More specifically, for
any point p, we have hash functions hi (p) = pi , where 1 ≤ i ≤ d and d is the dimensionality of the
data point. hi ’s here are locality sensitive since for if D(p, q) ≤ r then Pr(h(q) = h(p)) ≥ 1 − r/d,
and if D(p, q) ≥ (1 + ε)r then Pr(h(q) = h(p)) ≤ 1 − (1 + ε)r/d. Note that our statement hold for
not only binary vectors but also for M-ary vectors.

L1 Distance Under the L1 norm, one way to form locality sensitive hash functions is to first
embed the vector into a binary representation and then project the new representation onto one of
its coordinates. For example, for a M-ary data point p, we can form its binary representation as
follows: first convert each of dimension of p into a binary string of pi ones followed by M − pi
zeros, and then concanate all dimensions together to form a long binary string. It is straightforward

28

to see that the Hamming distance under this new binary representation equals the L1 distance under
the original representation, and the previous mentioned coordinate-projection hashing functions
are still locality sensitive. However, this intuitive method only work well when the value of M
is small and thus a more direct approach is needed. As presented in [71], for a range search
problem with radius r, we can first pick a real number w

r, and then randomly sampling d real

numbers s1 , s2 , . . . , sd ∈ [0, w), and finally hashing the original vector p onto the grid by setting
hs1 ,s2 ,...,sd (p) = (

p1 −s1
w

,

p2 −s2
w

,...,

pd −sd
w

).

L2 Distance Under the L2 distance measurement, locality sensitive hashing functions can be
formed by first randomly projecting the data point onto some one dimensional line, and then dividing the line into segments, shifted by a random offset. Formally, for any data point p, we pick
two real numbers w and b ∈ [0, w), followed by randomly generating a Gaussian vector a to form
the one dimensional representation ha,b (p) = a·p+b
. It had been shown that [65] this approach
w
of the generating hashing functions can be extended into the cases of Ls norm, where s ∈ [0, 2).
To achieve this goal, the projecting vector a is not from random Gaussian distributions but from
s-stable distributions.
|A∩B|

Jaccard Distance For two sets A and B, the Jaccard similarity is defined as SJ (A, B) = |A∪B| and
|A∩B|

the resulting distance is defined as 1 − |A∪B| . Under this distance metric, a family of functions
called min-hash can be defined as follows: first define a random permutation π over the universe
U, and then let hπ (A) = min{π(a)|a ∈ A}. It had been proved [72] that the probability of collision
equals to similarity under Jaccard distance, i.e., Pr(hπ (A) = hπ (B)) = SJ (A, B). More discussions
and extension of min-hash can be found in [73, 74].

2.2.3

Comparison: Search Methods for Dense Data

In this subsection we compare the emperical performance of KD-tree variants and Locality Sensitive Hashing on approximate nearest neighbor search. a comprehensive empirical study of comparing different approximate nearest neighbor search algorithms is presented in [69]. In the that
29

experiment setup, the data set consists of 1000,000 SIFT points. The tree algorithms, LSH, ANN
and Best Bin First Search were compared on their probabilities of retrieving the true nearest neighbor from the data set. It was shown that the Best Bin Search was able to outperform both LSH
and ANN by about an order of magnitude (on speed), on different accuracy levels from 50% to
95%. This suggests that multiple randomized KD-tree search is an effective approach in practice.
However, note that in this study, algorithms were evaluated on their performance of finding the
true nearest neighbor. LSH is designed for approximate nearest neighbor search and thus it is not
optimal for this task. Furthermore, LSH’s complexity is dependent on the error factor ε so it will
approach to linear scan as the accuracy requirement is high. So it is not surprising that LSH is
inefficient at finding the true nearest neighbor. In addition, So to make the discussion complete,
we cite another work. In [65], authors showed that LSH is about 1 ∼ 2 orders of magnitude faster
than exact KD-tree for the approximate nearest neighbor search (with ε = 1). They also claimed
that the speed up increases linearly with the dataset size. However, no significantly more speed-up
was gained as the dimensionality keep increasing.

2.2.4

Summary

In this section, we have discussed two categories of methods for nearest neighbor search, partitionbased method and projection-based method. Despite their apparent differences, both kinds of
methods essentially utilize a two-stage scheme for efficient computation: an indexing step followed
by exact distance calculation step. In practice, these methods are often used for approximate
nearest neighbor search to obtain near optimal solution under sub-linear time, with only linear
storage. They have been proved very effective for nearest neighbor search for large scale high
dimensional data. However, despite their effectiveness, these methods do have their drawbacks.
For example, The ANN algorithm has the time complexity linear in dataset size n but exponential
in dimensionality d; The Best Bin Search relies on heuristics for early stop and lack a theoretical
analysis; The ball-tree search has the complexity of O(d log n) for not too large d, and it can be as
bad as O(dn) for very large dimensionality; At last, the complexity of the LSH algorithm depends

30

on the approximation level: it approaches to linear complexity as the solution become more precise.
In a comprehensive empirical study of approximate nearest neighbor search algorithms [69], Best
Bin Search on multiple randomized KD-trees outperform LSH and ANN algorithms by about order
of magnitude, for many different precision settings. This suggests that multiple randomized KDtree search is an effective approach in practice, at the expenses of using more memory for multiple
trees.

2.3

Retrieval Methods for Bag of Features

In this section we focus on the problem of large scale retrieval methods for bags-of-features represented data. The concept of bag is similar to set but repeated elements are allowed. Bags usually
vary in their cardinalities and there is no ordering defined among features within each bag. To
match two objects, one essentially needs to find the best mapping from one set to another, which is
very computational expensive. In our discussions, we will focus on the topic of large scale image
retrieval, since each image is naturally represented by a bag of descriptors.
Image retrieval refers to the process of finding images that are relevant to user’s query. The
query can be either textual or an image, which corresponds to either text based image retrieval [17]
or the content based image retrieval (CBIR) [18, 19]. Instead of trying to solve the general content
based image retrieval problem, here we focus on the problem of near duplicate image retrieval
[75], where the goal is to identify gallery images that share high visual similarity with the query
image. This is because the general purpose CBIR perform poorly due to the semantic gap (the gap
between pixel statistics and human interpretation) [76], whereas the problem of near duplicate
image retrieval is better defined and more manageable.
To solve the near duplicate image retrieval problem, A number of studies [27, 75, 77–79] have
shown that local image features, e.g. SIFT descriptors [30], often referred to as keypoints, are
significantly more effective than global image features, such as color, texture and shape. The
main idea of the key point based approach is to extract salient local patches from an image, and
represent each local patch by a multi-dimensional vector. As a result, each image is represented by
31

a collection of multi-dimensional vectors, i.e., a bag-of-features [80].
A number of algorithms [81–86], have been proposed to measure the similarity between two
images based on their bag-of-features representations, but these methods are difficult to scale to
large scale datasets since they usually require a linear scan of the entire dataset. to address the computational efficiency issue, several methods have been proposed, including pyramid hashing [2],
bag-of-words model [27, 70, 77] and Min-Hash [87]. The pyramid hashing approach extends the
work of pyramid matching [83], using a randomized hashing method to improve the efficiency of
normalized partial matching. The bag-of-words model converts the image retrieval problem to a
text retrieval problem and takes the advantage of inverted index to improve the efficiency. Various
methods in this category differ in how the visual vocabulary is created. Min-hasing utilizes LSH
to create an efficient representation for feature sets after keypoints are quantized.
In the following subsections, we will elaborate all these approaches.

2.3.1

Pyramid Matching and Pyramid Match Hashing

The Pyramid Matching Kernel [32, 83, 88] is proposed to generate kernels for objects represented
by bags-of-features. The key idea is to map features to multi-resolution histograms and computes
similarity score based on weighted histogram intersections. In [32], authors proposed first divide
the feature space into the finest bins (every data points belongs to its own bin) to make the first
level histogram, and then create subsequent histograms by doubling the bin sizes. If two points
falls into the same histogram bin, they are considered as a match. The final matching score between
two objects is calculated as the weighted number of matches between two point bags. Note that
in this work, the division of feature space is uniform. It has limited performance despite the
implementation convenience.
[88] extends the original idea by creating the histogram based on the underlying structure in the
feature space. By using hierarchical clustering, the algorithm creates a hierarchical decomposition
of the feature space, and the resulting histograms are obtained from non-uniformly shaped clusters. It had been shown [32] that the similarity calculated based on the pyramid matching kernel

32

approximates the optimal partial matching, and empirical evaluations also proved its effectiveness.
A drawback of pyramid matching kernel is that when given a query image, a linear scan is
required to compute the similarity between the query image and every image in the database,
which does not scale well to a large image database. In [2], the authors proposed a randomized
hashing based method to improve the efficiency of pyramid matching. The pyramid matching
hashing algorithm first maps a bag of features representation to a single high dimensional vector
via an embedding function, then a set of locality sensitive hash function is used to generate hash
strings from the vector. Thus, every image is essentially converted into a binary string. When a
query comes, the same embedding and hashing functions are applied to generate a binary string
representation, and then an approximate nearest neighbor search algorithm is used to find similar
strings from the dataset. Those similar strings correspond to candidate images that are likely
to similar to the query image. Finally, the pyramid matching is applied to calculate the exact
similarities between the query and those candidate images. Figure 2.4 shows the idea of pyramid
match and pyramid match hashing.

Figure 2.4 Illustration of (a) pyramid matching and (b) pyramid match hashing. The pyramid
matching approach takes two sets of feature vectors and maps them to multi-resolution histograms,
and intersects them to efficiently approximate the optimal partial matching between the original
feature sets. The pyramid match hashing approach utilizes an embedding of the pyramid histogram
and randomly hash the new representation to allow sub-linear time indexing. Finally, the pyramid
match is applied only to a small portion of the database examples [2].

To summarize the above discussion, we can see that the pyramid matching essentially serves
33

as a precise matching step since it approximates the optimal partial matching very well. However,
it is difficult to scale to large dataset. The pyramid match hashing approach improve the scalability
by first representing a bag of features by a single vector of high dimensionality and then searching
its nearest neighbor using an approximate approach. The exact pyramid matching will only be
applied to a small subset of candidates identified in the first stage.

2.3.2

The Bag of Words Model

Among all approaches proposed for efficient large scale image retrieval, the bag-of-words model [77]
is probably the most popular one due to its empirical success. It takes advantage of the inverted index, which had been successfully used by web search engines to index billions of documents. The
key idea is to quantize the continuous high-dimensional vectors to a vocabulary of visual words,
which is typically achieved by a clustering algorithm. By treating each cluster center as a word in
a codebook, this approach maps each image feature onto its closest visual word, and then represents each image by a histogram of visual words. The above vector quantization process converts
an image retrieval problem into a text retrieval problem, and then standard text retrieval methods
can be used to efficiently retrieve relevant images given a query image. A number of studies have
shown promising performance of this approach for image retrieval [27, 75, 77–79, 89] and object
recognition [77, 90–95]. Figure 2.5 illustrate the idea of the bag-of-words approach.

Figure 2.5 Illustration of the bag-of-words representation of images

34

It worth noting that the effectiveness of the bag of words approach is highly dependent on
the quality of nearest neighbor search and clustering algorithm used. The nearest neighbor search
is crucial since it is the most basic operation in both the clustering stage and the discretization
stage. The clustering algorithm is important because it actually affects the quality and the size
of visual word vocabulary. In [77], the exact nearest neighbor search together with the naive Kmeans clustering is used. Due to the high computational cost of both nearest neighbor search and
clustering, the authors only generates several thousand visual words and the retrieval accuracy is
suboptimal due to the limited vocabulary size. In [27], the author proposed to use hierarchical
clustering to build a vocabulary tree, making it possible to creating an much larger vocabulary and
obtained much better retrieval performance. In [70], approximate nearest neighbor search based
on the best bin first method and multiple randomized KD-trees are used, and the reported retrieval
accuracy is even better when compared with [27].
Some approaches are proposed to further improve the efficiency of the bag-of-words based
method. The overall idea is to reduce the computational cost of large scale vector quantization.
In [96], the authors proposed a algorithm to generate random histograms as a vector representation
for bags of features, where LSH is used to create a high dimensional embedding for feature sets
in order to simplify the matching problem. In [97], a random seeding approach is suggested to
discretize each key point into a binary vector by a set of randomly generated hyper-spheres. The
method is designed to approximate the similarity measure based on the optimal partial matching
between two sets of keypoints, and it avoids the high computational cost of large scale high dimensional clustering. All these methods make the bag-of-words model more efficient and sometimes
even more effective.
Despite the bag-of-words model is the state-of-the-art method for large scale near duplicate
image retrieval, it does have disadvantages. First, when quantizing a feature descriptor into a visual word, the discriminability of the feature decreases since quite different descriptor could be
mapped to the same visual word. Secondly, the quantization process does not take into account
the geometric relationship between features. For instance, in the application of near duplicate im-

35

age detection, near duplicate images are often the results of affine transformations of the original
image. As a result, two descriptors that are close to each other are likely to match another two adjacent descriptors in the other image. However, quantized visual words does not respect geometric
constraints like this. Both these two factors increase the possibility of false matches that could hurt
the retrieval accuracy. To address this problem, people often utilizes geometric verification as an
post-processing step to improve the precision [30, 70, 77]. The basic idea is to estimate a transformation between the query image and each target image, based on how well its feature locations
are predicted by the estimated transformation. Then the target images will be re-ranked based on
the discriminability of the spatially verified visual words. However, since geometric verification is
computational expensive, in practice it is only applied to a small subset of the top-ranked candidate
images.
To summarize, the bag-of-words model is an efficient and effective approach for large scale
image retrieval. It utilize document indexing methods and retrieval models to retrieve relevant
images. This process can be viewed as the first stage in a large scale retrieval system because it
quickly selects a subset of potentially relevant candidates. A geometric verification step follows
the first stage to re-rank the candidate images to improve the precision. The verification step is
essentially the second stage since it provide more precise matching and it is more computational
expensive.
Some recent study shows a trend to integrate the geometric information directly into the matching algorithm instead of using it afterwards. For example, Wu et al. proposed to match key points
in a bundled group instead of individually [3]. In particular, quantized key points are grouped
based on MSER regions and the matching is performed on a region level. The similarity between
two regions is determined by both the number of matched visual words and their geometric consistency. The geometric information is encoded by the relative positions of visual words within
bundles. By enforcing the geometric constraints, this algorithm does not only ask for key points
matching but also requires matching in the right layout. Figure 2.6 illustrates the main idea of
matching bundled features. It was shown in [3] that bundled feature based approach achieves the

36

Figure 2.6 The geometric ordering constraints for bundled features [3]. (a) matches preserving the
relative ordering of the SIFT features inside a bundle; (b) matches not respecting geometric orders
are penalized.

retrieval performance comparable to existing full geometric verification approaches while being
much less (∼ 25%) computationally expensive. Another piece of work in similar fashion is geometric Min-hashing, as we will discuss in the next subsection.

2.3.3

Min-Hashing and Geometric Min-Hashing

Using Min-Hashing for image retrieval is proposed in [87] The basic idea is to use the set-based
similarity to measure the similarity between two images represented in bag-of-words. Recall that
|A∩B|

for two sets A and B, the Jaccard similarity is defined as SJ (A, B) = |A∪B| and the resulting distance
|A∩B|

is defined as 1 − |A∪B| . Under this distance metric, a family of functions called min-hash can be
defined as follows: first define a random permutation π over the universe U, and then let hπ (A) =
min{π(a)|a ∈ A}. It had been proved [72] that the probability that two sets have the same min-hash
code equals to the set-based similarity between them, i.e., Pr(hπ (A) = hπ (B)) = SJ (A, B).
We can use Min-hash to solve the image retrieval problem as follows: We first quantize all
images into the bag-of-words representation. Then we define a permutation function over the
entire vocabulary and hash gallery images and the query image into Min-Hash bits. A gallery
37

image will be considered as relevant if it shares the same hash-bit with the query image. Note
that the hashing is used after vector quantization (e.g. clustering). So this method is somewhat
related to the bag-of-words model. However, since it does not use document indexing and retrieval
methods to retrieve images, we separate its discussion from the bag-of-words model.
An recent extension of the Min-hashing takes into account the geometric constraints [98]. Different from the original min-hashing algorithm where n random hash function hi ’s are repeatedly
applied on the entire image (the same visual word set for every hi ) to form a n tuple, the input
set for geometric min-hashing functions are no longer fixed. In particular, inn the GmH, the input
set of hi+1 is the neighborhood set of the output of hi . Furthermore, scale constraints were also
imposed when choosing subsequent input sets. To see how it works, imagine when a key point
matching occurred, GmH further matches the region around that matching point. The image retrieval task can benefit from GmH since the region matching is more informative than a single
visual word matching.
Algorithm 4 describes how to generate the min-hash code given an bag-of-words representation
of an image. Two images are considered to match each other only if they share the same hashing
string.
Algorithm 4 Procedure of Geometric Min-Hashing
Require: The bag-of-words representations of an image
Require: The location information of descriptors.
Require: k: the length of hash-string. m: Neighborhood size parameter.
1: From the given image, select a set of descriptors that are mapped to different visual words.
and have at least m neighboring descriptors of similar scale, Forming set S1 .
2: Create each hashing n-tuple as follows.
3: Apply a min-hash function on S1 obtained in step 1, which will produce a central descriptor
(represented using its visual word ID).
4: find neighborhood descriptors that share similar scale with the central descriptor, Form set S2 .
Apply min-hash on S2 to obtain the second descriptor (represented using its visual word ID).
5: repeat until k descriptors are obtained.

38

2.3.4

Comparison: Retrieval Methods for Bag-Of-Features

For the retrieval methods of Bag-of-Features, we covered three categories of methods: Pyramid
Matching, the Bag-of-Words Model and Min-Hashing.
The Bag-of-Words model is the state-of-art for large scale image retrieval due to its efficiency
and reasonable effectiveness (recall and accuracy). Comparing with the Pyramid Matching, its
retrieval accuracy is sub-optimal since the Pyramid Matching approximates the optimal partial
matching between two feature sets [32,83]. In [83] people empirically compared pyramid matching
against the bag-of-words model on a object recognition task based on Caltech-101 data set, and
they claimed the pyramid matching perform substantially better. As for the retrieval speed, it had
been reported [32] that a pair-wise pyramid matching between two images takes ∼ 0.05 seconds.
And since the pyramid hashing requires linearly scan the entire data set, the overall running time
for processing an image is roughly 0.05N, where N is the total number of images. The bag-ofwords model is much more efficient. Base on reported experiments [3, 18] and our experience,
it took 10−2 ∼ 10−1 seconds to process a query against a large data set containing ∼ 100K of
images. Due to the use of inverted index, the query time does not increase linearly in N. Ideally,
it should increase linearly w.r.t. the number of relevant images. For the best of our knowledge,
there is no direct comparison between the Min-Hash approach with other methods. But since it
also relies on the bag-of-words model and only use hashing-based similarity in the second stage.
We can assume its recall/accuracy and speed are close to the other bag-of-words based approaches.
All these methods can be improved in their performance and/or speed with modifications. For
instance, The Pyramid Match Hashing [2], can greatly reduce the speed of Pyramid Matching at
the cost of a minor accuracy drop. It usually needs to check 2% ∼ 3% of the data set to achieve
almost the same performance (∼ 99%) of the Pyramid Matching method. The bag-of-words model
can be boosted with geometric constraints. By matching features based on regions and penalizing
violation of geometric constraints, the retrieval accuracy can be improved by ∼ 50% at the cost of
less than one second 2 processing time [3]. For Min-hashing methods, people reported [98] both
2 The

experiments were done on a 1 Million image dataset.
39

speed and performance improvements when geometric constraints were taking into account.

2.3.5

Summary

To summarize our discussions on image retrieval based on bag-of-features representation, we see
that a large scale retrieval system still requires a two stage scheme to achieve both efficiency and
reasonable accuracy. For the pyramid matching method, the matching accuracy is good enough
but the efficiency is the bottleneck. To address this issue, the pyramid match hashing is proposed
to serve as the first stage that efficiently identifies candidate images. On the other hand, for the
bag-of-words approach the efficiency is not a big issue but the retrieval accuracy is hurt because of
the lost information during vector quantization. As a result, a post processing step, e.g. geometric
verification, is proposed to serve as a second stage to improve the accuracy. Both approaches are
augmented with methods to solve either their efficiency or accuracy issue as needed. Some recent
approaches aim to unify the two stages, and the bundling features and geometric Min-Hashing
are two of them. These two pieces of work, however, pose the geometric requirements differently:
bundling features requires the matching in the right layout whereas geometric min-hashing asks for
more matching within some region. In general, the idea of integrating the geometric information
directly into the matching algorithm sounds more preferable.

40

CHAPTER 3
RANDOM SEEDING FOR KEYPOINT
QUANTIZATION
3.1

Introduction

In Chapter 2, we have covered various methods for near duplicate image retrieval based on the bagof-features representations. Generally speaking, among all methods, the optimal partial matching
based methods [81–83, 99] delivers superior performance. Its basic idea is to measure the distance
between two images by finding the best mapping between the keypoints in the two images that
has the overall shortest distance. The main shortcoming of the optimal partial matching is its high
computational cost: given a query image, a linear scan is required to compute the similarity between the query image and every image in the database, which does not scale well to a large image
database. To improve the computational efficiency of optimal partial matching, several methods
have been proposed [2, 77, 100]. Among them, the bag-of-words model [77] is the most popular
one due to its empirical success. However, the bag-of-words model suffers from the following
drawbacks:
• High computational cost in visual vocabulary construction. One of the key steps in constructing the bag-of-words model is to cluster a large number of keypoints into a rela-

41

tively smaller number of visual words. For large scale image retrieval, we often need
to cluster billions of key points into millions of clusters. Although several efficient algorithms [69, 78, 89, 101–105] have been developed for large-scale clustering problems, it is
still expensive to generate a vocabulary with millions of visual words.
• High computational cost in key point quantization. Given the constructed visual vocabulary,
the next step in bag-of-words model is to map each keypoint in a database image to a visual
word, which requires finding the nearest neighbors of every keypoint to the visual words.
Since the computational cost for keypoint quantization is linear in the number of keypoints,
it is expensive to quantize keypoints for a very large image database to a visual vocabulary.
Even with the help of approximate nearest neighbor search algorithms, this step is still costly
when good approximation is desired.
• Inconsistent mapping of keypoints to visual words. The radius of clusters (i.e., the maximum
distance between the keypoints in a cluster and its center) could vary significantly from
cluster to cluster. As a result, for clusters with large radius, two keypoints can be mapped
to the same visual word even if they differ significantly in visual features, leading to an
inconsistent criterion for keypoint quantization and potentially poor performance in image
matching.
• Lack of a theoretic analysis. Most published studies on the bag-of-words model are motivated by efficiency considerations and are primarily focused on its empirical performance.
Although [35] showed that the similarity between two bag-of-features representations can be
interpreted as a matching algorithm between descriptors, it did not establish the relationship
between the bag-of-words model and the optimal partial matching. Without a theoretical
analysis, the success of the bag-of-words model can only be demonstrated based on empirical performance. As such, improvements to the bag-of-words model are mostly made by
tuning of parameters based on some heuristics.

42

We present a new quantization scheme that explicitly addresses the shortcomings of the clusteringbased approach for keypoint quantization. Unlike most studies on keypoint quantization that are
motivated by computational efficiency, the proposed scheme aims to effectively approximate the
optimal partial matching based similarity measure. The main idea is to discretize each keypoint
into a binary vector by a set of randomly generated hyper-spheres. Each hyper-sphere is analogous to a cluster, and discretizes a key point into a binary bit: 1 when the keypoint is within the
hyper-sphere and 0, otherwise. The bag-of-words representation for an image is now computed
as a histogram over the hyper-spheres. Our theoretical analysis shows that despite its simplicity,
the image similarity based on the proposed quantization scheme serves as an upper bound on the
similarity based on optimal partial matching. To distinguish the proposed scheme from the clustering based approach for keypoint quantization, we refer to it as a random seeding approach for
keypoint quantization.
The computational advantage of the proposed approach arises in two aspects: first, the random
seeding approach does not require clustering (e.g., K-means algorithm) to identify visual words,
leading to a significant saving in computation; second, given the constructed visual vocabulary, the
complexity of quantizing m keypoints into n visual words is O(m ln n) for the clustering-based approach, and O(n ln m) for the proposed scheme 1 . Since m is usually at least one order of magnitude
larger than n, the random seeding approach is significantly more efficient than the clustering-based
approaches for keypoint quantization, which is confirmed by our empirical study.
Besides the computational efficiency, the random seeding approach deploys a more consistent
criterion for keypoint quantization: a keypoint is mapped to a visual word if and only if the distance between the keypoint and the visual word is smaller than the radius r of hyper-sphere. By
choosing a relatively small r, the proposed quantization approach is able to ensure a large similarity
between a keypoint and its mapped visual word. In contrast, the clustering-based approach maps a
keypoint to its closest visual word even when they are separated by a large distance, leading to an
1 The

analysis here only considers the time for approximate nearest-neighbor/range search in
keypoint quantization. It ignores the time for constructing KD-trees, which in practice is one order
of magnitude smaller than approximate nearest-neighbor/range search.

43

5

7

x 10

Number of keypoints

6
5
4
3
2
1
0
0

100

200

300

400

500

600

Distances between keypoints to their cluster centers

700

Figure 3.1 The distribution of Euclidean distances between SIFT keypoints and their nearest visual
words, computed based on the Oxford5K dataset (14,972,956 keypoints) with 10K clusters.

inconsistent criterion for keypoint quantization and potentially poor performance in image matching. Figure 3.1 shows the distributions of distances between keypoints and their nearest cluster
centers for the Oxford5K dataset with 10K clusters. It can be seen that these distances have a large
variance, indicating a potentially inconsistent criterion for mapping keypoints to visual words. Furthermore, some point-to-center distances are very large, making the clustering-based quantization
less accurate.
We integrate the random seeding approach into a recognition/retrieval system for large-scale
near duplicate image retrieval. We demonstrate the efficacy and efficiency of our system for image
retrieval on a database of more than 1 million images. We further validate our system in the
domains of tattoo image retrieval and automatic image annotation.
The rest of this chapter is organized as follows: Section 2 reviews the related work on image
retrieval and efficient methods for keypoint quantization; Section 3 describes the proposed scheme
for keypoint quantization, together with a theoretical analysis that validates the proposed approach;
Section 4 presents our empirical study of the proposed scheme through a large-scale near duplicate
image retrieval experiment. It also validates our conclusions in two application domains, tattoo
image retrieval and automatic image annotation. Section 5 makes concluding remarks about our
contribution and suggests future research directions.

44

3.2

Related Work

Although the bag-of-words model has been shown to be effective for image/object retrieval, one of
the major challenges, as pointed out in [101], is how to efficiently construct a vocabulary millions
of visual words for a very large image database. This requires developing efficient algorithms for
large-scale data clustering. One of the early studies [77] used flat K-means clustering, but it is
unable to handle large vocabularies. In [106], the authors showed that hierarchical clustering is
able to greatly increase the size of visual vocabulary and therefore is more suitable for large-scale
image retrieval. Recent studies [69, 78, 89, 101] have shown that K-means based on approximate
nearest neighbor search can scale to a very large vocabulary. The goal of these studies has been
to improve the efficiency of approximate nearest neighbor search by a random algorithm, such as
randomized trees [78] or Locality Sensitive Hashing (LSH) [102, 103].
Several algorithms have also been proposed for efficient keypoint quantization that avoid clustering [87, 96]. These methods essentially extend Locality Sensitive Hashing [103], which was
originally designed for efficient nearest neighbor search in high dimensional space, to efficiently
compute the matching between two sets of keypoints. The main objective of these algorithms is
to improve computational efficiency of keypoint quantization, but they have been shown to be less
effective than the clustering based approach. In contrast, the proposed random seeding approach
is motivated by the approximation of the similarity measure based on optimal partial matching.
Our empirical studies show that the proposed approach not only has better retrieval performance
but it is computationally more efficient than cluster-based quantization. Several approaches have
been proposed to improve the overall retrieval performance of the bag-of-words model. In [107],
relevance feedback is used to improve the retrieval accuracy of a bag-of-words model. In [108],
the images returned by a bag-of-words model is reranked by taking account the geometrical constraints in computing the matching scores. The concept of bundle features is proposed in [3] to
combine two sets of matching scores, one based on a bag-of-words model and the other based on
the matching of image regions. In [109], the authors represented each image by two models, a
bag-of-words model and a mixture model; the similarity between two images is calculated as a
45

linear combination of matches based on the two models. In [110], multiple bag-of-words models
are constructed and combined to improve the retrieval accuracy. While these approaches improved
the retrieval performance, they are all built upon the basic bag-of-words model. Thus, by improving the efficiency and effectiveness of bag-of-words models as proposed here, we will essentially
improve the performance of all these approaches for image retrieval.
Finally, since the proposed approach essentially represents each keypoint by a binary vector, it
is also related to spectral hashing [111] and Hamming embedding [35] that aim to embed keypoints
in a space of binary vectors. The proposed approach differs from the other embedding approaches
in that it provides a very sparse set of binary vectors, leading to efficient retrieval algorithms for
large image databases.

3.3

Random Seeding for Keypoint Quantization

In this section, we first present the random seeding algorithm for keypoint quantization. We then
establish the relationship between the proposed scheme for keypoint quantization and the similarity
measure based on the optimal partial matching between two sets of keypoints.
Let D = (I1 , . . . , IT ) be a collection of T images. Each image Ii is represented by a set of ni
n

j

keypoints, denoted by Xi = (xi1 , . . . , xi i ), where each keypoint xi ∈ Rd is a vector of d dimensions.
We now define a set X of all the keypoints from all the T images. The first step of the proposed
scheme is to randomly sample N keypoints from this set X, denoted by z1 , . . . , zN . A hyper-sphere
Bi is created centered at each sampled keypoint zi , i.e., Bi = {x ∈ Rd : |x − zi |2 ≤ r}, where r is a
predefined constant that is computed based on the average distance between any two keypoints in
image collection D. Using the hyper-spheres {Bi }N
i=1 , we quantize each keypoint x into a binary
vector b(x) = (b1 (x), . . . , bN (x)), where each element bi (x) is 1 if x ∈ Bi and 0, otherwise. The bagof-words representation for image Ii , denoted by b(Ii ) = (b1 (Xi ), . . . , bN (Xi )), is computed by
n

j

i b (x ), k = 1, . . . , N.
adding the binary vectors of all the keypoints in image Ii , i.e., bk (Xi ) = ∑ j=1
k i

One of the key property of the proposed algorithm is its criterion for mapping a keypoint to N
j

j

visual words: a keypoint xi is mapped to a visual word zk if and only if the distance between xi
46

and zk is smaller than the threshold r. This is in contrast to the criterion deployed by the clustering
approach for keypoint quantization where a keypoint is always mapped to its closest visual word
even if the closest cluster center is far from the keypoint. The advantages of this new mapping
criterion are threefold:
• Consistent mapping. The mapping criterion implies that two keypoints will never be mapped
to the same visual words if they are separated by a distance of more than 2r. By appropriately
choosing the threshold r, we can ensure that only "similar" keypoints will be mapped to the
same visual word, leading to a more consistent mapping criterion than the clustering-based
approach.
• Mapping to multiple visual words. A keypoint can be mapped to multiple visual words if
it is within distance r of more than one visual word. This is similar to the idea of soft
membership [34] that allows us to capture the uncertainty in mapping keypoints to visual
words. One problem of soft membership is that every keypoint is forced to be associated
with multiple visual words even if it is much closer to one visual word than the other visual
words 2 . In contrast, in the proposed scheme, soft membership is introduced for a keypoint
only when it is close to multiple visual words simultaneously; a keypoint will be mapped to
a single visual word if it is far away from others.
• Eliminating outlier keypoints. A keypoint is ignored by the proposed algorithm if its distances to all the N visual words are larger than the threshold r. The underlying rationale
is that if a keypoint is far away from all the visual words, it is very likely to be an outlier
and therefore should be skipped in the construction of bag-of-words models. By eliminating the outlying keypoints, we not only improve the accuracy of image retrieval but also its
efficiency as a smaller number of keypoints need to be considered for image matching.
2 Although

this problem can be solved using kernel function [112], it is computationally prohibitive when both the number of keypoints and the number of visual words are large

47

Algorithm 5 Random Seeding Approach for Keypoint Quantization
Input:
• Image collection D = {I1 , . . . , IT }, with each image Ii represented by a set of keypoints
• Number of visual words N
• Distance threshold r
Visual Vocabulary Construction
• Generate visual words z1 , . . . , zN by randomly sampling N keypoints from X, which is the
set of all keypoints of all T images in D
Keypoint Quantization
1. Construct a randomized KD tree T for all the keypoints detected in D.
2. For i = 1, . . . , N
• Use KD tree T and approximate nearest neighbor search to efficiently find the
keypoints within the distance r of visual word zi
• Compute the histogram of visual word zi for all the images in D

To efficiently implement the proposed random seeding approach, we use the Fast Library for
Approximate Nearest Neighbors (FLANN)3 , which implements the randomized KD-trees [105],
to efficiently decide which subset of keypoints are within distance r of a given visual word. Algorithm 6 shows the major steps of the proposed algorithm. Compared to the clustering-based
approach, there are two main features of Algorithm 6:
• It constructs the visual vocabulary by randomly sampling keypoints from the image collection D and thus avoids the high computational cost of clustering.
• It constructs a randomized KD-tree for all the keypoints in D, and iterates over the visual
words to find the subset of keypoints that are within a distance r of a visual word. As a result,
the computational cost for quantizing m keypoints into N visual words is O(N log m). In
contrast, the clustering-based approach constructs a randomized KD-tree for visual words,
3 http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

48

and iterates over keypoints to find the nearest visual word for each keypoint, leading to
O(m log N) complexity for keypoint quantization. Since m is usually at least an order of
magnitude larger than N, the proposed algorithm is more efficient than the clustering-based
approach for keypoint quantization. The above analysis ignores the time for constructing
randomized KD-trees, which is usually an order faster than mapping keypoints to visual
words.
The theoretical foundation of our quantization scheme is based on the relationship between
the image similarity resulting from this quantization scheme and the image similarity resulting
from optimal partial matching. We first show that the partial matching based similarity can be
upper bounded by a smooth similarity function. We then derive the proposed scheme of keypoint
quantization that approximates this upper bound similarity function with a small error.

3.3.1

Distance Measure Based on Optimal Partial Matching

Let two images Ix and Iy be represented by the corresponding sets of keypoints, denoted by
X = (x1 , . . . , xm ) and Y = (y1 , . . . , yn ), where each xi and y j is a vector in d-dimensional space.
Among various similarity measures that have been proposed, the similarity based on the optimal
partial matching [81, 82, 113] is probably the most intuitive one, and has yielded the state-of-theart performance for both image retrieval and object recognition. Let π 1 : {1, . . . , m} → {1, . . . , n}
map each point in set X to a point in Y , and π 2 : {1, . . . , n} → {1, . . . , m} map each point in set Y to
a point in X. Then the optimal partial matching distance between X and Y for the given mappings
π 1 and π 2 , denoted by d(X,Y ; π 1 , π 2 ), is computed as 4
d(X,Y ; π 1 , π 2 ) =

m

∑
k=1

xk − y 1 22 +
π
k

n

∑
k=1

yk − x 2 22 ,
π

(3.1)

k

where πk1 indicates the index of the point in set Y to which xk is mapped; similarly, πk2 indicates
the point in set X to which yk is mapped. Finally, the optimal partial matching distance between X
4 Note

that we generalize the optimal partial matching in [113] by making it to be a symmetric
distance measure
49

and Y , denoted by d(X,Y ), is obtained by minimizing d(X,Y ; π 1 , π 2 ) over the mappings π 1 and
π 2.

3.3.2

Approximating the Optimal Partial Matching by A Smooth Similarity Function

It is in general difficult to derive an appropriate scheme of key point quantization directly from
(3.1) because the distance function d(X,Y ) is non-smooth and is defined implicitly through the
minimization of the mappings π 1 and π 2 . To address this difficulty, we approximate the distance
function in (3.1) by a smooth similarity function, as shown in the following lemma.
Lemma 1. For any λ > 0, we have
−d(X,Y ) + (m + n)d0 ≤

2 m n
∑ ∑ exp(−λ [|xi − y j |22 − d0]),
λ i=1
j=1

(3.2)

where d0 = min |xi − y j |22 .
i, j

Proof of this lemma can be found in the Appendix.Note that the equality in Lemma 1 is satisfied
when d0 = |xk −y 1 |22 = |yk −x 2 |22 for all k and as λ goes to infinity. The lemma significantly simπ
π
k

k

plifies the similarity measure, leading to a potentially efficient implementation of image retrieval
systems. In figure 3.2 we show the difference between the similarity based on optimal partial
matching and the upper bound presented in Lemma 1. In this example, we randomly generate a
set X that contains 1, 000 data points in R128 , and each feature is an integer uniformly drawn from
interval [1, 255]. The second bag Y is generatedby perturbing data points in X by adding a scaled
Gaussian noise γN (0, 1) to every feature, where γ > 0 controls the noise level and is varied from 0
to 50. We set λ = 25 in this study. We observed from Figure 3.2 that when the perturbation is small
(i.e., X and Y are similar), the upper bound provides a good approximation to the similarity based
on optimal partial matching (measured in −d(X,Y )); the gap between the two similarity measures
becomes large when the similarity between X and Y is small. Since we are mostly interested in
near duplicate image retrieval, it is therefore much more important to make a good approximation
when the optimal partial matching based distance is small, making the result of Lemma 1 valuable
in our application domain.
50

8

1

x 10

Similarity Upperbound
Optimal Partial Matching Similarity
0

Similarity (−d(X,Y))

−1

−2

−3

−4

−5

−6

−7
0

5

10

15

20

25

30

35

40

45

50

Perturbation Level

Figure 3.2 An example showing the relationship between the optimal partial matching similarity
and the upper bound in Lemma 1. It shows that when two sets of keypoints (corresponding to two
images) are similar, the upper bound in Lemma 1 provides a good approximation to optimal partial
matching.

Using Lemma 1, we introduce the following similarity measure between point sets X and Y
s(X,Y ) =

1 m n
∑ ∑ exp −λ xi − y j 22 .
mn i=1
j=1

(3.3)

Note that compared to Lemma 1, we set d0 = 0 in the above similarity measure and introduce
the factor 1/mn, which is useful for simplifying our analysis as shown later. Instead of directly
computing the distance d(X,Y ) between a query image and a database image, we will compute the
similarity s(X,Y ), and rank images in a database in the descending order of their similarity.

3.3.3

Random Seeding Algorithm

We now show that the similarity measure s(X,Y ) defined in (3.3) can be computed efficiently by
the proposed keypoint quantization scheme. The overall idea is to first interpret the similarity
measure s(X,Y ) as an expectation over a hidden variable z. We then approximate s(X,Y ) with a
high accuracy by replacing the computation of expectation with an average over a finite number
of samples. We finally show that the approximation over the finite number of samples leads to the
bag-of-words model that can be efficiently computed.
51

Similarity Measure s(X,Y ) as Expectation In order to interpret the similarity measure s(X,Y )
as an expectation over a hidden variable z, we consider a probabilistic model for the similarity
measure between two sets of keypoints. We assume that the given set of keypoints X = (x1 , . . . , xm )
are i.i.d. samples from an unknown distribution, denoted by p(z|X). Using the kernel density
function estimation [114], the estimate of p(z|X), denoted by p(z|X), is computed as follows
p(z|X) =

1 m µ d/2
∑ [π]d/2 exp(−µ|z − xi|22),
m i=1

(3.4)

where µ specifies the kernel width and d is the dimensionality. Similarly, the estimate of the
unknown distribution for a set n of keypoints Y , denoted by p(z|Y ), is computed as
p(z|Y ) =

1 n µ d/2
∑ [π]d/2 exp(−µ|z − yi|22).
n i=1

(3.5)

The following lemma shows that the similarity measure s(X,Y ) can be expressed in p(z|X) and
p(z|Y ).
Lemma 2. The following relationship holds if µ = 2λ
s(X,Y ) =

[2π]d/2
µ d/2

p(z|X) p(z|Y )dz

(3.6)

Proof can be found in Appendix.
Similarity Measure as Expectation To view p(z|X) p(z|Y )dz as an expectation, we need to introduce an appropriate distribution for z. Note that

p(z|X) p(z|Y )dz is defined as an integration over

the unbounded space of z. Hence, it is inappropriate to define a uniform distribution for z, which
is sometimes referred to as inappropriate prior in Bayesian analysis. To address this difficulty, we
assume that |x|2 ≤ R for any keypoint x, where R is a large constant5 . We introduce a domain Q
for z that is defined as follows
Q = {z : |z|2 ≤ γR},
5 This assumption

is reasonable since each dimension of the SIFT descriptor is upper-bounded.

52

where γ ≥ 1 is a constant that will be determined empirically. We then approximate p(z|X) p(z|Y )dz
by s1 (X,Y ), which is defined as an integration over the domain Q, i.e.,
s1 (X,Y ) =

p(z|X) p(z|Y )dz.

(3.7)

z∈Q

The following lemma gives a bound on the difference between

p(z|X) p(z|Y )dz and s1 (X,Y ).

Lemma 3. The following inequality holds for any γ > 0
|s(X,Y ) − ζ s1 (X,Y )|
≤ exp(−2µγ(γ − 2)R2 )
s(X,Y )

d−1
π 1/2
+ γR
2µ

(3.8)

where ζ = µ d/2 /[2π]d/2 .
Proof can be found in Appendix 6.3.
As indicated by Lemma 3, a large value of γ will lead to a small value for | p(z|X) p(z|Y )dz −
s1 (X,Y )|, implying that s1 (X,Y ) is close to

p(z|X) p(z|Y )dz. This is shown by the following

corollary.
Corollary 1. If
γ ≥ max 4,

π 1/2
(2µ)1/2 R

,

(d − 1) +

(d − 1)2 + 4µ ln[1/δ ]
2µR

we have
|s(X,Y ) − ζ s1 (X,Y )|
≤δ
s(X,Y )
The result follows immediately from Lemma 3.
With the above analysis, we can now focus on the similarity measure s1 (X,Y ). Since the
domain of z in s1 (X,Y ) is bounded, we can now introduce a uniform distribution for z, i.e.,
q(z) =

1
I(z ∈ Q),
vol(Q)

where vol(Q) stands for the volume of domain Q and I(x) is an indicator function that outputs
1 when x is true and 0 otherwise. Using the distribution q(z), we can interpret s1 (X,Y ) as the
expectation of p(z|X) p(z|Y ) over random variable z, i.e.,
s1 (X,Y ) = vol(Q)Ez [ p(z|X) p(z|Y )]
In the remaining analysis, we will drop the factor vol(Q) for the sake of simplicity.
53

(3.9)

Keypoint Quantization by Random Samples We can further approximate the expectation interpretation in (3.9) by replacing the distribution q(z) with a finite number of samples. Let z1 , z2 , . . . , zN
be N samples randomly drawn from the distribution q(z). Using the samples {zk }N
k=1 , we approximate s1 (X,Y ) by s2 (X,Y ) that is defined as follows
s2 (X,Y ) =
=

µd m n 1 N
exp(−µ|zk − xi |22 − µ|zk − y j |22 )
∑
∑
∑
d
N
mnπ i=1 j=1 k=1
N
m
µd
exp(−µ|zk − xi |22 )
∑
∑
d
[mnNπ ] k=1 i=1

(3.10)

n

∑ exp(−µ|zk − y j |22)

.

j=1

It is important to note that, in the expression of s2 (X,Y ), by replacing the expectation over the
distribution q(z) with the average over the N samples, we essentially represent each point set
X = (x1 , . . . , xm ) by a vector
f (X) = ( f1 (X), . . . , fN (X))

(3.11)

where fk (X), k = 1, . . . , N is defined as
fk (X) =

1 m
∑ exp(−µ|zk − xi|22).
m i=1

Using the notation f (X) and f (Y ), we can write s2 (X,Y ) as
s2 (X,Y ) = f (X) · f (Y ),
i.e., the similarity s2 (X,Y ) can be computed as the dot product between two histograms. The
theorem below shows that with a sufficiently large number N, s1 (X,Y ) and s2 (X,Y ) will differ
only by a small value.
Theorem 1. With probability 1 − δ , we have
|s2 (X,Y ) − s1 (X,Y )| ≤

1 2
ln
N δ

Proof can be found in Appendix 6.3. The result in Theorem 2 only applies to two images.
We furthermore generalize it to a collection of images. Let D = (I1 , . . . , IT ) be a collection of
54

n

T images and Xi = (xi1 , . . . , xi i ) be the set of keypoints associated with image Ii . We have the
following corollary showing that for any two images Ii and I j in D, their similarities s1 (Xi , X j ) and
s2 (Xi , X j ) are close to each other.
Corollary 2. Given a collection D of T images, with probability 1 − δ , the following inequality
holds for any two images Ii and I j in D
|s2 (Xi , X j ) − s1 (Xi , X j )| ≤

1 T (T − 1)
ln
N
δ

The above corollary is easily proved by a union bound.
Although (3.11) provides a vector representation for each keypoint set, it is overall not a sparse
representation, which makes it difficult to implement an efficient retrieval algorithm that scales
well to a large image database. To address this challenge, we introduce a threshold η > 0 and
set the coefficient exp(−µ|zk − xi |22 ) to be zero whenever exp(−µ|zk − xi |22 ) is smaller than the
threshold. This is equivalent to replacing the coefficient exp(−µ|zk − xi |22 ) with max(exp(−µ|zk −
xi |22 ) − η, 0). As a result, we represent each keypoint set X = (x1 , . . . , xm ) by a vector
h(X) = (h1 (X), . . . , hN (X)),

(3.12)

where hk (X), k = 1, . . . , N is defined as
m

hk (X) =

∑ max(exp(−µ|zk − xi|22) − η, 0)

(3.13)

i=1

Note that by introducing the threshold η into the bag-of-words model, the result in Corollary 3
is relaxed by a factor ηm. The threshold η essentially balances the tradeoff between efficiency
and efficacy: a large η will result in a sparse bag-of-words representation of images, leading to
efficient retrieval of gallery images; on the other hand, a small η will result in a more accurate
approximation of the optimal partial matching similarity, leading to accurate retrieval of visually
similar images.
Implementation To generate the bag-of-words model h(X) in (3.12), the first step is to sample
centers zi ’s from the domain Q. In practice, sampling zi uniformly from a high dimensional domain
55

is a nontrivial problem. In our implementation, we acquire the samples z1 , . . . , zN by first sampling
N keypoints from all the keypoints in the image database D, and then scaling the sampled keypoints
by a factor γ. This practice essentially assumes that all the keypoints are uniformly distributed
within a hyper-sphere. Although this assumption may not be true, it significantly reduces the
computational cost of the proposed algorithm, and also yields good performance. We set γ = 1 in
our experiment. Although our theoretical result requires γ to be large, γ = 1 does yield desirable
performance in our empirical study.
The second step in the bag-of-words model is to compute element hk (X) in h(X). Note that
hk (X) defined in (3.13) is a real number, making it difficult to implement it by a typical text search
engine that requires integer elements in a bag-of-words model. We thus simplify the function
max(0, exp(−µ|zk − y j |22 ) − η) to be I(exp(−µ|zk − y j |22 ) ≥ η), where I(z) is an indicator function that outputs 1 if z is true and 0, otherwise. Note that function I(exp(−µ|zk − y j |22 ) ≥ η) is
equivalent to I(|zk − y j |2 ≤ r), with r =

[ln(1/η)]/µ. In practice, we set r to be proportional to

the average distance between any two keypoints in image collection D. In order to estimate the
average pairwise distance, we randomly sample 10, 000 keypoint pairs from the entire collection.
To efficiently compute I(|zk − y j |2 ≤ r), which requires identifying the subset of keypoints in
the image collection D that are within the range r of each center zk . We use the Fast Library for
Approximate Nearest Neighbors (FLANN)6 , in Algorithm 6.

3.4

Experiments

To verify both the efficiency and efficacy of the random seeding based quantization scheme, we
conduct experiments on near duplicate image retrieval on the Oxford5K dataset. Following the experimental settings in [108], we introduce into the experiment a distractor (background) database
consisting of 1 million images downloaded from Flickr. We further validate the proposed quantization scheme by conducting two additional experiments in the domain of tattoo image retrieval
and automatic image annotation. The tattoo image retrieval experiment is performed on a dataset
6 http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

56

of 100, 000 tattoo images, where its objective is to find images with the similar tattoo pattern as
the one given in the query. For the experiment on automatic image annotation, we used the ESP
dataset7 that consists of over 100, 000 manually annotated images, and the objective is to automatically annotate images with appropriate keywords that reflect the visual content of the images.
Table 3.1 summarizes the three experiments.
Experiment
Near duplicate
image retrieval
Tattoo image
retrieval
Automatic
image annotation

Query Set
55 Oxford Images
995 Tattoos
1,000 ESP Images

Gallery Set
Oxford5K + Flickr1M
∼ 1 million images
Tattoo60K + Esp40K
∼ 100, 000 images
Esp110K
∼ 110, 000 images

SIFT points
14.97M + 798M

Evaluation
mAP

10.8M

CMC

12.8M

F1

Table 3.1 Summary of experiments

In all the three experiments, we obtain the features based on Harris detector and SIFT descriptor, using the feature extractor in [115] with the default parameters. In the baseline method, hierarchical K-means is applied to cluster keypoints into visual words, and then each keypoint is mapped
to its nearest visual word. We used the implementation of large-scale data clustering in [69], which
has been optimized for efficient SIFT keypoint matching. Since both the baseline method and the
proposed random seeding quantization depend on efficient large scale nearest neighbor search, we
used the FLANN [69] in both the methods. The parameter setting for hierarchical K-means is
tuned using the auto-tune feature provided in the library, by using 10% of the dataset and setting
the accuracy level to be 95%. The search parameters for the proposed method are set accordingly
to make the results comparable. For the implementation of the random seeding algorithm, unless otherwise specified, radius r is set to 50% of the average pairwise distance between any two
keypoints. We estimate the average pairwise distance by a sampling approach. After obtaining the
bag-of-words representations (either by the clustering-based method or by random seeding), Okapi
BM25, a state-of-the-art text retrieval model [116], was used to compute the similarity between a
query and gallery images. For automatic image annotation, our retrieval system is first used to
7 http://www.gwap.com/gwap/gamesPreview/espgame/

57

identify the gallery images that are visually similar to a test image; a kNN classifier [117] is then
applied to determine the annotation words for the test image. All the experiments were performed
on a Xeon 2.8G server with 144GB memory8 . In the following sections, we will present results of
individual experiments.

3.4.1

Near Duplicate Image Retrieval

In this experiment, we first perform retrieval experiments on the Oxford5K dataset, a benchmark dataset for image retrieval. We then expand the database with 1 million Flickr images (i.e.,
Flickr1M), and report retrieval results on this large image database. We also evaluate the scalability of the random seeding approach for keypoint quantization by varying the number of gallery
images.

Dataset The Oxford5K dataset [118] is comprised of 5, 062 images, among which 55 images
serve as queries. We extracted a total of 14.97 million keypoints from this dataset. The distractor
Flickr dataset (i.e., Flickr1M) contains 999, 771 images that are represented by 798 million keypoints. In the large scale image retrieval experiment, these two datasets are combined together,
resulting in a dataset with a total of 1, 004, 833 images.

Evaluation Following [35, 108], we use the mean average precision (mAP) as our evaluation
metric. It computes average precision for each query based on its precision-recall curve, and
averages the average precision over all the queries.

Experimental Results We first report the experimental results on the Oxford5K dataset, followed
by the experiments using the Oxford5K+Flickr1M dataset. Both experiments use the same set of
55 query images from the Oxford5K dataset.
8 We

note that the peak memory usage for all experiments is no more than 20GB.

58

Results for the Oxford5K dataset We now present the experimental results on Oxford5K dataset.
We generate visual words either by clustering or by random seeding approach. We set the number
of visual words to 10K, 100K, and 1 million. Table 3.2 reports the computation time for generating
bag-of-words models by the clustering-based method and by the random seeding method. To
facilitate the comparison of the two methods, we divide the running time into two parts, i.e., the
time for constructing visual vocabulary (vocabulary construction time) and the time for mapping
keypoints to visual words (quantization time). Note that since the random seeding algorithm finds
visual words by randomly sampling a set of keypoints, it is basicall independent of vocabulary
construction. We observe that not only it is computationally expensive to find visual words by
clustering, it also takes considerably longer time for the clustering based method to map keypoints
to visual words compared to random seeding approach. We also observe that the running time
for the clustering-based method remains almost constant as the number of visual words increases,
while it increases noticeably for the random seeding approach. This is because of the specific
procedure used in Algorithm 6 for keypoint quantization, which takes O(N log m) to quantize m
keypoints into N visual words. In contrast, it takes the clustering-based method O(m log N) to
accomplish the same quantization task. The weaker dependency of the clustering method on N
makes its running time less sensitive to the number of visual words.

Clustering
Random Seeding

Voc. Construction
Quantization
Voc. Construction
Quantization

Vocabulary 10K
9, 178s9
42, 984s10
5s
1,425s

Vocabulary 100K
9,941s
43,726s
7s
2,232s

Vocabulary 1M
10,405s
44,128s
8s
3,721s

Table 3.2 Time for generating bag-of-words models for the Oxford5K dataset.

Table 3.3 summarizes the retrieval accuracy (mAP) and query time (i.e., time for running text
retrieval to identify the gallery images that are visually similar to the query image) with different
9 Vocabulary

construction takes more time than quantization because hierarchical clustering is
used.
10 It could be significantly larger than the time reported in other studies due to differences in the
NN search parameter settings and/or the library used.
59

number of visual words. We observe that for both the methods, the retrieval accuracy is improved
from ∼ 30% to over 50% as the number of visual words is increased from 10K to 1 million, which is
consistent with the previous studies [35, 108]. In all the cases, we observe that the random seeding
algorithm is more accurate than the clustering based method, although the gap is reduced as the
number of visual words increases. We attribute this performance difference due to the inconsistent
mapping of the clustering-based method. In particular, when the number of visual words is small, a
keypoint is more likely to be mapped to a visual word even if they are separated by a large distance.
The occurrence of inconsistent mapping in the clustering-based method is reduced as the number
of visual words increases, leading to a smaller gap in retrieval accuracy. The query time result
shows that the random seeding algorithm is more efficient in retrieving visually similar images
than the clustering-based approach, although the advantage is diminished as the number of visual
words increases. We attribute this phenomena to eliminating outlier keypoints. In particular, when
the number of visual words is small, it is very likely for that a keypoint will be far away from all
the visual words; those outlying keypoints are ignored by the proposed method in the quantization
step, leading to efficient and effective retrieval. As the number of visual words increases, the
number of outlier keypoints decrease, leading to a longer retrieval time. In fact, we found 84%
keypoints were ignored when the number of visual words is 10K 11 , and only 46% keypoints were
ignored when the number of visual words is increased to 1 million.
To further examine the mapping criterion employed by the random seeding algorithm, we perform the retrieval experiments by using the random seeding algorithm but with the visual words
identified by the clustering algorithm, referred to as clustering seeding. The retrieval performance
of this approach is reported in the last row of Table 3.3. We observe that the clustering seeding
method performs worse than the random seeding approach under all settings. This result can be explained by the fact that some clusters have large variance (cover a widely spread set of keypoints),
which contradicts the mapping criterion (range search that maps keypoints to a visual word only
when their distance is smaller than a threshold) used by the random seeding method.
11 We

also found that when such down-sampling is applied to the clustering-based approach, its

60

Clustering
Random Seeding
Cluster Seeding

mAP (query time)
Vocabulary 10K Vocabulary 100K
0.28 (0.39s)
0.41 (0.19s)
0.35 (0.05s)
0.45 (0.06s)
0.32 (0.06)
0.38 (0.06s)

Vocabulary 1M
0.52 (0.08s)
0.54 (0.08s)
0.50 (0.09s)

Table 3.3 Retrieval accuracy (mAP) and query time (second) for the near duplicate image retrieval
on the Oxford5K dataset. “Cluster Seeding” in the last row refers to range search quantization
using cluster centers

Robustness We examine the sensitivity of the random seeding method w.r.t. radius r, the only
parameter in the proposed keypoint quantization. Figure 3.3 shows how the retrieval accuracy of
the random seeding algorithm is affected by r. The retrieval accuracy appears to be stable when r
is sufficiently large (i.e., r ≥ 0.5R, where R is the estimated average pariwise keypoint distance).
We attribute the poor performance of small r to the effect of eliminating too many outlier keypoints
since the smaller the radius r, the more keypoints will be ignored in the quantization step, leading
to potentially poor bag-of-words model for image representation.
0.6

Vocabulary 10K
Vocabulary 100K
Vocabulary 1M

0.55
0.5

mAP

0.45
0.4
0.35
0.3
0.25
0.2

0.25 R

0.5 R

0.75 R

Radius

1R

1.25 R

Figure 3.3 The effect of radius r on the retrieval accuracy of the random seeding method for keypoint quantization. R is the average pairwise keypoint distance.

Oxford5K+Flickr1M Retrieval We now test the random seeding method on the large data base
Oxford5K+Flickr1M that combines the two data sets. Following the setup in [108], we use a
bag-of-words model has a significant worse performance.

61

vocabulary of 1 million visual words generated from the Oxford5K dataset. Table 3.4 shows the
retrieval accuracy, query time, and time for keypoint quantization for the Oxford5K+Flickr1M
dataset. Note that we exclude the clustering time from the keypoint quantization time since the
vocabulary has already been constructed from the Oxford5K dataset. We observe that the random
seeding algorithm significantly outperforms the clustering method.
We also examine the scalability of the random seeding algorithm in keypoint quantization. In
this experiment, we fix the vocabulary to be 1 million visual words generated from the Oxford5K
dataset, and gradually add Flickr images into the Oxford5K dataset. Figure 3.4 reports the quantization time for both the methods with the numbers of gallery images. As the size of image database
increases, the quantization time of the random seeding method increases at a slower rate than that
of the clustering-based method, making it more suitable for handling large-scale dynamic image
databases.
6

Quantization Time (seconds)

2.5

x 10

Random Seeding Quantization
Clustering−Based Quantization

2

1.5

1

0.5

0
Oxford5K

Oxford5K+Flickr100K

Oxford5K+Flickr200K

Oxford5K+Flickr500K

Oxford5K+Flickr1000K

Number of Images

Figure 3.4 The quantization time of the proposed random seeding algorithm and the clustering
based approach. The vocabulary is fixed to be 1M visual words generated from the Oxford5K
dataset. Background images from Flickr dataset are gradually added to the gallery. The vocabulary
construction time is not shown here.

3.4.2

Tattoo Image Retrieval

We now present our results on a large scale tattoo image dataset. Body tattoos have been used
by individuals for over 5, 000 years to differentiate themselves from others. Historically, the prac62

mAP
Clustering
0.33
Random Seeding 0.45

Query Time (seconds)
0.84
0.33

Quantization Time (hours)
591.2
78.7

Table 3.4 Retrieval accuracy (mAP), query time, and quantization time (i.e., time for mapping
keypoints to visual words) for the clustering-based method and the random seeding method.

tice of tattooing was limited to particular groups such as motorcycle bikers, soldiers, sailors, and
members of criminal gangs. A recent study published in the Journal of the American Academy of
Dermatology in 200612 showed the rising popularity of tattoos amongst the younger section of the
population; about 36% of Americans in the age group 18 to 29 have at least one tattoo.
Law enforcement agencies routinely photograph and catalog tattoo patterns for the purpose of
identifying victims and convicts. The ANSI/NIST-ITL 1-2000 standard [119] defines eight major
classes (e.g., human face, animal, symbols, ...) and 80 subclasses for categorizing tattoos. Given
a query image of a tattoo, a search is performed by matching the class label of the query tattoo
with labels of tattoos in a database. This matching process based on human-assigned class labels is
subjective and has poor retrieval performance. This motivated Lee et al. [120] to develop an image
retrieval system for automatic tattoo image matching. Given a query tattoo image, the attempts to
retrieve all the tattoo images from a database that are the photos of the same tattoo as the query.

Dataset The database used here consists of 101, 745 images, among which 61, 745 are tattoo
images that were provided by the Michigan Forensics Lab [120]; the remaining 40, 000 images
were randomly selected from the ESP game data set13 . The purpose of adding images from the
ESP game dataset is to verify the capability of the tattoo image retrieval system in distinguishing
tattoo images from non-tattoo images. On average, about 108 keypoints are detected from each
image, leading to a total of more than 10 million keypoints for the entire image collection.
12 Tattoo

Facts and Statistics, Oct. 2006, http://www.vanishingtattoo.com/
tattoo_facts.htm.
13 http://www.gwap.com/gwap/gamesPreview/espgame/

63

Evaluation To evaluate the retrieval performance of the tattoo image retrieval system, one duplicate image of the same tattoo is used as a query image to retrieve all of its visually similar image(s)
in the database. For the tattoo image database used here, we used 995 tattoo images as queries,
and manually identified the tattoo images in the database that are visually similar to each query.
Cumulative matching characteristics (CMC) curve [121] which accumulates the correct number of
retrieved images as the rank increases, was used as the evaluation metric. For a given rank position
k, the CMC score is computed as the percentage of queries whose matched images are found in the
first k retrieved images. We use CMC curve, instead of precision & recall curve, because it is one
of the most widely used evaluation metric in biometric and forensic analysis.
Experimental Results Figure 3.5 shows the CMC curve of the tattoo retrieval system using the
proposed random seeding algorithm and the clustering algorithm for keypoint quantization. Compared to the clustering approach, we observe an overall 5% improvement in the CMC score of the
proposed random seeding algorithm. Figure 3.6 shows some tattoo search results for both random
seeding and clustering based methods.

Cumulative Matching Characteristics

0.95
0.9
0.85
0.8
Random Seeding
Clustering

0.75
0.7
0.65
0.6
0.55
1

10

20

30

40

50

60

70

Number of Retrieved Images

80

90

100

Figure 3.5 The CMC curves of the tattoo retrieval system using the proposed random seeding
algorithm and the clustering algorithm for keypoint quantization. The number of visual words is
set to be 1 million.

Next, we evaluate the efficiency of the proposed scheme for keypoint quantization. It takes
the proposed algorithm 3, 476 seconds to quantize all the keypoints (i.e., 10.8 million keypoints
64

Figure 3.6 Example query images and the top 5 images retrieved by both the random seeding
algorithm (first row) and the clustering based approach (second row).

in total) in the tattoo image collection into one million visual words (i.e., hyper-spheres). On the
other hand, the running time of the clustering-based method is 31, 305 seconds, which is about 9
times longer than the proposed random seeding algorithm. This result shows that the proposed
scheme is not only more effective but also more efficient than the clustering-based algorithm for
keypoint quantization. We also observe that both the methods have similar retrieval time, which is
0.05 second per query.

3.4.3

Automatic Image Annotation

We also validate the proposed method on an automatic image annotation task. We implement a
kNN classification based approach for automatic image annotation, which was shown to be particularly effective for large-scale image annotation [117]. More specifically, to identify the annotation
words for a given test image, we first search for the images in the training set that are visually similar to the test image. We then annotate the test image with the key words that are most frequently
used by the first k retrieved images.
Dataset The ESP game dataset was used in our study. It consists of 109, 720 images. On average,
117 keypoints were extracted for each image, leading to a total of 12.8 million keypoints for the
65

entire database. We remove the rare annotation words that are used by less than ten images in
the database. This gives us a vocabulary of 5, 395 unique annotation keywords. On average, each
image is annotated by 7.5 words. We randomly selected 1, 000 images from the dataset to form the
test set, and used the remaining 108, 720 images for training.

Evaluation We evaluate the performance of our system by comparing the predicted annotation
words to the ground truth. The metric F1, which is widely used in multi-label learning [?], is used
for evaluation. It is computed as follows: for a given test image, we first rank all the key words in
the descending order of their occurrences in the first k images that are retrieved by the system. The
first nl words (with a random tie-breaker) in the ranking list, denoted by P = {w1 , . . . , wnl }, are the
predicted annotation words for the test image. Let Q = {w1 , . . . , wnt } be the set of true annotations
for the given test image. The F1 measure for this test image is computed as the harmonic mean of
|P∩Q|
|P∩Q|
2pr
where p and r are defined as p = |P| and r = |Q| .
precision p and recall r, i.e., F1 = p+r

Given multiple test images, we first compute F1 for each test image and then average the F1 scores
over all the test images as the final evaluation measure. Since the F1 measure depends on the
number of predicted annotation words nl , which is usually difficult to determine, we report the F1
measures for different numbers of predicted annotation words.

Experimental Results First, we evaluate the annotation accuracy of our system based on different
methods for keypoint quantization. One of the key parameter used by the kNN based image annotation system is k, i.e., the number of nearest neighbor images used for predicting the annotation
words for a test image. We estimate this parameter by a cross validation procedure which uses 20%
of training images to validate the parameter k. The optimal k found by cross validation for both
quantization methods is 50. Figure 3.7 shows the F1 values of the kNN based image annotation
system that uses two different quantization methods, with nl varied from 1 to 10. We observe that
the proposed random seeding algorithm significantly outperforms the clustering based approach in
the annotation accuracy.
Finally, we evaluate the efficiency of the proposed random seeding algorithm for keypoint
66

0.22
0.2

Macro F1

0.18
0.16
Random Seeding
Clustering

0.14
0.12
0.1
0.08
0.06
1

2

3

4

5

6

7

Number of Annotated Words

8

9

10

Figure 3.7 The results of a kNN based image annotation system using the proposed random seeding
algorithm for keypoint quantization as well as the clustering based method. k is set to be 50 based
on cross validation, and the number of visual words is set to be 1 million.

quantization. It takes 3, 376 seconds for the random seeding method to quantize all the keypoints,
while the quantization time for the clustering based method is 45, 008 seconds. This again confirmed that the proposed approach is more efficient than the clustering based method for large-scale
keypoint quantization.

3.5

Conclusion

We have proposed an efficient algorithm for keypoint quantization need in the bag-of-words model.
It is designed to approximate the similarity measure based on the optimal partial matching between
two sets of keypoints. The main idea is to quantize each keypoint into a sparse vector of real
numbers by a set of randomly generated hyper-spheres. Our theoretical analysis shows that the
similarity based on the proposed quantization algorithm provides an upper bound on the optimal
partial matching based similarity. The proposed random seeding algorithm is more effective than
the clustering-based approach because it deploys a more consistent criterion for keypoint quantization, which ensures that only similar keypoints are mapped to the same visual words. It is
also more efficient than the clustering-based approach because (1) it avoids the high computational
cost of large-scale clustering, and (2) its quantization time for mapping keypoints to visual words

67

is sublinear in the number of keypoints, making it more suitable for large image databases than
the clustering-based approach. Experimental results on large-scale near duplicate image retrieval,
tattoo image retrieval, and automatic image annotation confirm that the proposed random seeding
algorithm is more effective than the state-of-the-art clustering based approach for keypoint quantization. Besides efficacy, our empirical studies also indicate that the proposed scheme is more
efficient than the clustering based method for keypoint quantization and in retrieving visually similar images.

68

CHAPTER 4
RANDOM PROJECTION FILTERING
4.1

Introduction

As we discussed in Chapter 1, nearest neighbor search is a fundamental problem and has found
applications in many domains. The challenge comes when both the number of data points and
the dimensionality are high: classical exact search algorithms like KD-tree [28] search often have
the time or space complexities that is prohibitive. Various approximate search algorithms had
been proposed to address this efficiency issue, but they all have evident drawbacks. For example,
ANN [68] has showed to be effective for higher dimensions, but it is not suitable for dimensionality
higher than 20; Randomized KD-tree forrest [122] shows promising results in some studies, but it
does not have any theoretical guarantee about its search accuracy, partly because its parameters are
often selected based on some heuristics but not grounded theory. In addition, Randomized KDtree forrest grows linearly with the number of points in the dataset, which poses a strong memory
requirement, making it to be impractical for efficient large scale memory-based search. Random
projection based approaches such as LSH [71] does address the range search problem, but it usually
needs both a large number of random projections and a large number hash tables to ensure high
search accuracy and recall, making it inefficient and ineffective for many practical uses.
In this work, we focus on nearly duplicate vector search, where we assume that nearest neighbors differ from the query object only by a small number of attributes. Our contribution is propos69

ing a novel algorithm that is both efficient and accurate for this application domain by explicitly
exploit the statistical property of near duplicity. The proposed hashing algorithm, termed Random
Projection with Filtering or RPF for short, addresses the tradeoff between efficiency and accuracy
by introducing a filtering procedure into the hashing algorithm. Based on the compressed sensing theory [123] and the key property of nearly duplicate search, we show that most objects in
the returned list are with short distance of the query. Our empirical study on both synthesized
and real world data sets demonstrates the effectiveness of the proposed approach compared to the
state-of-art hashing methods.

4.2

Problem Definition

In case when each object is represented by a vector in a space of very high dimensionality. More
specifically, let D = {x1 , . . . , xN } the collection of N objects for searching, with each object is
represented by a vector xi ∈ Rm of m dimension. We assume both the dimensionality m and the
number of objects N in the database are very large (e.g., m is a few thousand and N is a couple
million). Given a query vector q ∈ Rm represented in the same space, our goal is to find the objects
in D that are nearly duplicates of query q. In particular, we assume the target objects for query q
should satisfy the following two objectives:
• short distance: the matched object x should be close to query q in a Euclidean distance,
namely |x − q|2 ≤ R, where R is a predefined distance threshold.
• sparsity: the matched object x should differ from q only by a relatively small number of
attributes, i.e., |x − q|0 = ∑m
i=1 I(xi = qi ) ≤ S
It is important to note that our definition of nearly duplicate is beyond the simple measure of 2
distance between a query object and objects in the database. The sparsity criterion plays a very
important role in designing the efficient searching algorithm. It is also the sparsity criterion that
differs our work from most of the existing work in efficient nearest neighbor or range search in a
very high dimensional space.
70

4.3

Random Projection Filtering Algorithm

We first introduce a few notations that will be used throughout the draft. We denote by M (q, r) the
subset of objects in D that are within distance r of query q, and by M c (q, r) the subset of objects
in D that are at least distance r from q, i.e.,
M (q, r) = {x ∈ D : |x − q|2 ≤ r}

(4.1)

M c (q, r) = {x ∈ D : |x − q|2 ≥ r}

(4.2)

We denote by M (q, R, S) the nearly duplicate objects in D for a given query q that satisfy the two
objectives mentioned above, i.e.,
M (q, R, S) = {x ∈ D : |x − q|2 ≤ R, |x − q|0 ≤ S}

(4.3)

Given a matrix U = (u1 , . . . , ud ) ∈ Rm×d , we denote by PU = (U U)−1U : Rm → Rd the projection operator that maps a vector in Rm into the subspace space spanned by the column vectors in
U.
Our work is motivated by Johnson Lindenstrauss Theorem, which claims that a low dimension
embedding of a very high dimensional space, generated by random projection, is likely to preserve
the Euclidean distance between any two points in the original high dimensional space.
Theorem 2. (Johnson Lindenstrauss Theorem) Let U ∈ Rm×d be a random matrix, with each
element Ui, j , i = 1, . . . , m, j = 1, . . . , d generated by randomly drawing from a Gaussian distribution
N (x|0, 1). Let PU : Rm → Rd be the projection operator that projects a vector in Rm into the
subspace spanned by the column vectors in U. For any two fixed objects x, q ∈ Rm in the original
m dimensional space and for a given ε ∈ (0, 1), we have
Pr |Px − Pq|22 ≤

dε 2
d(1 − ε)
|x − q|22 ≤ exp −
m
4

Pr |Px − Pq|22 ≥

d(1 + ε)
dε 2 [1 − 2ε/3]
|x − q|22 ≤ exp −
m
4

71

A most straightforward approach for exploring Johnson Lindenstrauss theorem is to project
vectors in a high dimensional space into a low dimensional space generated by a random matrix, and then explore methods, such as KD tree, for efficient range or nearest neighbor in low
dimensions. The main problem with such a problem is that in order to ensure all the nearest neighbors to be included in the returned list with a large probability, d, the dimensionality of the low
dimensional embedding could be significantly large. To make our point clear, considering the
objective of finding objects in D that are within distance r of a query q. We follow the direct
approach of low dimensional embedding, and assume an object will be returned if its distance to
query q in the low dimensional embedding is less than or equal to r = r

d(1 + ε)/m. Then,

for a fixed point |x − q|2 ≤ r, the probability for it to be included in the return list is at least
1 − exp(−dε 2 [1 − 2ε/3]/4). Hence, the probability of returning all objects within distance r of
query q is at least 1 − N exp(−dε 2 [1 − 2ε/3]/4), implying that in order to ensure all the objects
within distance r of query q to be returned with a large probability, d has to be on the order of
O(4[ε 2 (1 − 2ε/3)]−1 log N). Given a collection of one million objects, the number of required dimensionality for embedding is at least 50, which is too high for any efficient range search algorithm
to be effective.
An alternative approach is to run multiple range searches, each in a very low dimensional
space, and then assemble the results together by either anding or oring the results returned by each
range search. In this way, we avoid the challenge of range search in a relatively high dimensional
space. The extreme case is to run each range search in a one dimension space and then merge the
objects returned by each range search. In some sense, the hashing approaches can be thought
as a special case of this strategy with the exception that only an approximate range search is
implemented by a hashing function. This approach does guarantee with a large probability, all
the objects within distance r of query q are found in the union set of objects returned by multiple
range searches. However, on the other hand, we may find many of the returned objects are far away
from the query q, leading to too many returned objects that need further inspection of distance to
the given query. To make our discussion concrete, consider the case we run d one dimensional

72

range search, with each one dimensional embedding generated by a random vector. Similar to the
previous example, we set the distance threshold to be r = r

(1 + ε)/m. Similar to the previous

analysis, the probability of including all the objects within distance r of q in the lists returned
by the low dimensional range search is 1 − N exp(−dε 2 [1 − 2ε/3]/4), identical to the previous
analysis. However, the problem is that the probability of including objects far away from the query
also increases significantly. In particular, the probability of including any object with distance
r

(1 + ε)/(1 − ε) may become min(1, d exp(−ε 2 /4)). We refer to this challenge as Precision-

Recall tradeoff: in order to improve recall for a search, i.e., the chance of including all the nearby
objects to a given query, we inherently have to lower the bar for precision, namely the probability
of including objects far away from the given query, and vice versa. We evidently could make
better tradeoff between precision and recall by being more careful in designing the combination
strategy. For instance, we could create multiple d one dimensional embedding: for each d one
dimensional embedding, we merge the objects returned by every one dimensional embedding by
anding; we then combine the objects returned by each d one dimensional embedding by anding. If
the number of d one dimensional embedding is K, the probability of including all objects within
distance r of query (i.e., recall) is reduced to 1 − NK exp(−dε 2 [1 − 2ε/3]/4), while the probability
of including any objects that are at least r

(1 + ε)/(1 − ε) (i.e. precision) is improved to (1 −

[1 − exp(−ε 2 /4)]d )K . The main shortcoming of this strategy is that it requires both d and K to be
large numbers, leading to a very large number (i.e. dK) of one dimensional searching structures to
be implemented, which could significantly lower the efficiency of the algorithm. This is in fact for
locality sensitive hashing, which often requires thousands of bits before accurate search results can
be found. In this draft, we aim to address the challenge of precision-recall tradeoff by exploring
the sparsity requirement of nearly duplicate detection, namely only a small number of elements in
the target object x differ from the given query q.
Before we discuss the exploration of sparsity requirement, we need to extend Johnson Lindenstrauss theorem, which holds for a subspace spanned by column vectors in a random matrix
U ∈ Rm×d . Instead of using the projection operator PU , we will extend the result to directly use

73

U. More specifically, we assume each Ui, j in matrix U is randomly drawn from a Gaussian distribution N(x|0, m−1 ). Note that the difference between PU and U is the matrix [U U]−1 . First, we
have the following concentration inequality for random matrix U.
Theorem 3. Let U ∈ Rm×d be a random matrix, with each element generated from a Gaussian
distribution N(x|0, m−1 ). We assume m > d. Let σmax (U) and σmin (U) be the maximum and
minimum singular values for matrix U, respectively. We have
Pr σmax (U) > 1 +

d/m + η + t ≤ exp(−mt 2 /2)

(4.4)

Pr σmin (U) < 1 −

d/m + η − t ≤ exp(−mt 2 /2)

(4.5)

where
η=

1
2m1/3

√
γ 1/6 (1 + γ)2/3

with γ ∈ (0, 1).
Using the above theorem, we have the following corollary.
Corollary 3. Let U ∈ Rm×d be a random matrix with each element randomly drawn from N (x|0, 1/m).
With a probability 1 − δ , for any vector x, we have
|PU x|2
1+

d
m +η +

2 ln(2/δ )
m

2

|PU x|2

≤ |Ux|2 ≤
1−

d
m +η −

ln(2/δ )
2 m

2

With sufficiently large m and δ = e−d/2 /2, with probability at least 1 − e−d/2 /2, we have
|PU x|2
|PU x|2
≤ |Ux|2 ≤
(1 + 3 d/m)2
(1 − 2 d/m)2
Theorem 4. (Modified Johnson Lindenstrauss Theorem) Let U ∈ Rm×d be a random matrix, with
each element Ui, j , i = 1, . . . , m, j = 1, . . . , d generated by randomly drawing from a Gaussian distribution N (x|0, 1). Assume m is sufficiently large. With probability 1 − δ , for any two fixed objects
x, q ∈ Rm and for a given ε ∈ (0, 1), we have
Pr |Ux −Uq|22 ≤

d(1 − ε)
|x − q|22
4
m(1 − 2 d/m)

≤ exp −

dε 2
4

Pr |Ux −Uq|22 ≥

d(1 + ε)
|x − q|22
4
m(1 + 3 d/m)

≤ exp −

dε 2 [1 − 2ε/3]
4

74

The above theorem is proved directly by combining Theorem 1 and Corollary 1. As indicated
by the above theorem, we can directly use the random matrix U for mapping a high dimensional
vector to a low dimensional space to preserve the Euclidean distance among objects, which only
suffers a small discount factor if m is very large and d is significantly smaller than m.
In order to improve the tradeoff between precision and recall, we aim to explore the sparsity requirement in the problem of nearly duplicate detection. In particular, we will explore the Restricted
Isometry Property (RIP) for sparse vector, as indicated by the following theorem.
Theorem 5. (Restricted Isometry Property) Assume d < m and let U be a random matrix of size
m × d whose entries are i.i.d. Gaussian with mean zero and variance 1/m. If S/m is small enough,
then with overwhelming probability 1 − e−β m , there exists a positive constant δS < 1 such that we
have the following inequality hold for any c ∈ Rm with at most S non-zero elements
(1 − δS )|c|2 ≤

m
|U c|2 ≤ (1 + δS )|c|2
d

It is important to distinguish the RIP condition from Johnson Lindenstrauss theorem because
both theorems claim that we will be able to preserve the Euclidean distance among objects by
mapping them to a low dimension embedding using random projections. The probability in JL
theorem is related to each vector, namely for any vector, there is a small but non-zero probability
for the property to fail. It is this uncertainty related to vectors that leads to the tradeoff between
precision and recall. On the other hand, the probability in RIP condition is related to the choice
of random matrix, namely nearly all random matrix generated by iid Gaussian will be able to
preserve Euclidean distance for any vectors. Thus, we only suffer the probability of violation
once in choosing the random projection matrix, and if we choose good matrix, the RIP condition
guarantees the Euclidean geometry to be preserved.
We now state our algorithm for efficiently identifying nearly duplicates of queries in high dimension. Let d some modest integer so that (i) for given ε, d is large enough for make exp(−dε 2 /4)
a small value, and (ii) d is small enough that allow for efficient range search in space of d dimensions. Choose K as K = (CS log m)/d , where C > 1 is a parameter that needs to be adjusted in
75

practice. We generate K random matrices U1 ,U2 , . . . ,UK of size m × d whose entries are randomly
drawn from a Gaussian distribution N (x|0, 1/m). For each random matrix U j , we map objects x
from Rm to Rd by U j x. We denote by D j = {U j x1 ,U j x2 , . . . ,U j xN } the collection of objects
mapped to the low dimensional space using random matrix U j . We then build a search structure
(e.g., KD-tree) for each D j and assume there exists an range search algorithm to efficient find the
subset of objects in D j within distance r of query U j q, where q is query in the original space Rm .
We denote by M j (q, r) = {i ∈ [N] : |U j (xi −q)|2 ≤ r} the subset of objects whose projection in U j
is within distance r of query q. The key difference between this algorithm and the strategy stated
before is that we have a filtering procedure after merging objects returned by the range search in
each d dimensional space. This filtering procedure will throw out the objects whose cumulative
distance to query (i.e.,

j

∑ j =1 |U j (x − q)|22 ) is larger than the threshold (1 + δS )R

[Kd]/m. The

following theorem states the guarantee of this algorithm
Theorem 6. Given a query q, after running Algorithm 6, we have following guarantees
• with a high probability that all the objects that are nearly duplicates of query q will be
included in the returned set M , i.e.,
Pr (M (q, R, S) ⊆ M ) ≥ 1 − N exp −

dKε 2 [1 − 2ε/3]
4

• all the returned objects are within short distance from the query, i.e.,
max |xi − q|2 ≤

i∈M

1 + δS
R
1 − δS

Proof. The first guarantee follows directly from the previous analysis and the fact that M (q, R, S) ⊆
M (q, R). The second guarantee follows the fact that U = (U1 , . . . ,UK ) will result in sufficiently
large number of columns (compared to sparsity S) and lead to the preservation of Euclidean distance with an overwhelming probability.

76

Algorithm 6 Efficient Duplicate Detection for High Dimensional Vectors
1: Input
• R: distance threshold
• S: the maximum number of entries in the target object that differs from those of a given
query
• q: a query
2: Initialization: M = 0/ and dq = 0
3: for j = 1, 2, . . . , K do
4:
Run efficient range search over D j with query U j q and threshold r set as
r=
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:

4.4

R
(1 − 2

d/m)2

d(1 − ε)
m

Returns the indices of objects satisfying the criterion in M j (q, r) along with the distance
j
information, i.e., for i ∈ M j (q, r), we compute di = |U j (xi − q)|22 .
for i ∈ M do
if i ∈
/ M j then
di = di + |U j q|22
else
j
di = di + di , M j = M j \ {i}
end if
if mdi /[dK] ≥ R(1 + δS ) then
M = M \ {i}
end if
end for
for i ∈ M j do
j

di = dq + di
if mdi /[dK] ≥ R(1 + δS ) then
M = M \ {i}
end if
end for
dq = dq + |U j q|22
end for
Return M

Experiments

We will evaluate the random project filtering algorithm with two different sets of experiments. For
the first set of experiment, We generate a synthetic dataset that where both the short distance and
the sparsity constraints are satisfied. We study different aspects of our algorithm and show how
RPF can be effective in solving the near duplicate search problem. In the second experiment, we
77

test our algorithm on an image feature dataset where “duplicity” are often defined based on L2
distance and the sparsity constraint may not be perfectly met. We compare RPF with other existing
hashing methods and verify that RPF is still effective in less ideal cases and real applications.

4.4.1

Experiments on Synthetic Datasets

The purpose of this experiment is to examine different aspects of RPF algorithm. We examine its
accuracy, efficiency and sensitivity to parameters on a 100K datasets. For each query, the first 100
data points that have the shortest distance to the query are considered as the true nearest neighbors.
We used average precision and average recall as metrics by following the general evaluation protocols of search algorithms [41]. In addition, we also measure the filtering percentage, which is the
percentage of noise points that are filtered out by the algorithm. For the efficiency, report both the
offline indexing building time and online searching time.
Dataset The synthetic dataset we generated includes three parts: background data, query data,
and distractor data. The background data are 100K vectors of 500 dimensions, and each dimension
is generated independently from a uniform distribution from range [0 100]. For the query dataset,
we first generate 100 zero vectors, and then we random select P dimensions and set their values
uniformly from [0 β ]. In our experiments, we set P = 25 and β = 20. In this way the 100 query
points are (2Pβ , 2P) near duplicates of each other and they are far away from the background data
points. The distractor dataset is introduced similarly as the query data points, The distractor dataset
is introduced similarly as the query data points, but with P non-zero dimensions generated from
[0 β + δ ]. In our experiments, we set P = 30 and δ = 2. Then, each of the 100 query data points
are used to find its near duplicates, and we report the performance over all queries.
Algorithm Performance, Effects of K and d

We first examine the algorithms’ sensitivity to pa-

rameter settings. The major parameter in the proposed algorithm are K, d, δs and ε. According
to our experience, we fix δs = 0.05 and ε = 0.5. We test how K and d will affect the efficiency
and accuracy of the algorithm. We vary the both K and d in the range of 1, 5, 10, 20, 50, 100 and
78

5, 10, 20, 50, 100, respectively. In figure 4.1 we show the average precision, recall, and filtering
percentage under different parameter settings. As we can see, the algorithm delivers very high precision as soon as the values of K and d are not too small (d ≥ 10 and k ≥ 20). In the mean while,
over 95% of the target points can be recalled. We also observe that when d is very small (e.g.
5) compared with the original dimensionality (e.g. 500), a relatively larger number projections
is needed for good performance. These observations are consistent with the compressive sensing
theory, which requires K = (CS log m)/d . Furthermore, we can see that for all different settings,
almost all noise points are filtered out by the RPF, which validates the effectiveness of the filtering
procedure. Base on the above results, we set K = 10 and d = 20 in later experiments.

RPF Efficiency From figure 4.2, we can see both the offline tree building and online query can
be done quite efficiently, and scale linearly w.r.t. K 1 . Specifically, search on 100K dataset can be
achieved within 0.1 seconds (with d = 20, K = 10), indicating the algorithm is useful in real applications. We also see what when d is too large, the algorithm becomes much less efficient. This is
because exact kd tree search in high dimensionality is not efficient. Based on ourexperience, d can
√
be set on the order of m, which makes the algorithm is still very practical for high dimensional
search. if a very large d is still need, one can use approximate Kdtree search, which had been used
on data with hundreds of dimensions. In the next subsection, we show that setting d = 50 can
effectively search for 5000 dimensional data.

Sensitivity and Effects of δs and ε

We now show how the algorithm’s performance change when

δs and ε varies. δs controls the process of filtering, and the smaller of δs , the stronger the filtering
is. 1 − ε is the radius of range search, and the large the ε, the less number of data points will be
returned by KD tree searches. In figure 4.3, we see that when δs increases, the precision of the
algorithm drops, as we are loosen the filter. However, when ε is very large, δs has very small effects
on precision. This is because the KD tree search only return really close points to the algorithm.
1 For

illustration purposes, we set ylim ≤ 2 in the figure. When d = 100, the query time are
[0.44662.33494.34557.962617.960535.4678]

79

1

Mean Accuracy

0.95
0.9
d =5
d = 10
d = 20
d = 50
d = 100

0.85
0.8
0.75
1

5

10

K

20

50

100

(a) Average Precison
1

Mean Recall

0.95
0.9
0.85
d=5
d = 10
d = 20
d = 50
d = 100

0.8
0.75
0.7
1

5

10

20

50

100

50

100

K

(b) Average Recall

Mean Filtering Percentage

1
1
0.9999
0.9999
0.9998
0.9998
0.9997
0.9997
1

5

10

20
K

(c) Average Filter Percentage
Figure 4.1 Performance of Random Projection Filtering (RPF) with different K and d

80

Tree building time (s)

50
40
30

K=1
K=5
K = 10
K = 20
K = 50
K = 100

20
10
0

5

10

20
d

50

100

(a) Average Tree Building Time (s)
2

Query time (s)

1.5

1

K=1
K=5
K = 10
K = 20
K = 50
K = 100

0.5

0

5

10

20
d

50

100

(b) Average Query Time (s)
Figure 4.2 Offline tree building time and query time

81

dataset size
RP Precision
RP Recall
RP Candidate Size
RPF Precision
RPF Recall
RPF Candidate Size

10K
0.1872
1
620
1
0.9934
99.34

5K
0.1188
1
1078
1
0.9929
99.29

20K
0.0171
1
6651
1
0.9880
98.80

50K
0.0076
1
15356
1
0.9899
98.99

100K
0.0058
1
21450
1
0.9940
99.40

Table 4.1 Effectiveness of the filtering procedure

For the recall, we generally see a inverse trend compared with the precision, which is the effect
of precision-recall tradeoff. We also see that when ε is 0.9, the recall is much lower than other ε
values, which indicates that the Kd-tree search with a very small radius missed a number of true
positives. Finally, we observe that the filtering percentage varies similarly as the precision, and
the analysis to the precision applies similarly. Based on this result, we set ε = 0.5, δs = 0.1 in the
following experiments.

Effectiveness of the Filtering Procedure In order to investigate the importance of filtering process,
we compare the performance of the search algorithm with and without (RP for short) the filtering
procedure. By the filtering procedure we mean step 12 and 18 in the algorithm, which conditionally
add or remove a candidate point from the candidate pool. In this experiment, we set K, d, ε and δs
as suggested by the earlier results, and fixed varied the data set size. Table 4.1 shows the average
precision, recall and candidate set size of the algorithm with and without the filtering step. As
we can see, the filtering procedure improves the precision and reduces the candidate set size very
significantly, at the cost of about 1% recall. This shows the filtering algorithm properly addressed
the trade off between precision, recall, and search efficiency.

4.4.2

Experiments on image feature dataset

Here we compare the performance of RPF algorithm with other baseline methods on a SIFT point
dataset extracted from the Holidays and Copydays dataset [35].

82

1

Average Precision

0.9
ε = 0.1
ε = 0.3
ε = 0.5
ε = 0.7
ε = 0.9

0.8

0.7

0.6

0.5
0.1

0.3

0.5
δs

0.7

0.9

(a) Average Precison
1

Average Recall

ε = 0.1
ε = 0.3
ε = 0.5
ε = 0.7
ε = 0.9

0.95
0.1

0.3

0.5
δs

0.7

0.9

(b) Average Recall

Average filter percentage

1
0.8

ε = 0.1
ε = 0.3
ε = 0.5
ε = 0.7
ε = 0.1

0.6
0.4
0.2
0
0.1

0.3

0.5
δs

0.7

0.9

(c) Average Filter Percentage
Figure 4.3 Performance of Random Projection Filtering (RPF) with different δs and ε

83

Dataset The Holidays dataset includes 1491 holiday photos, which includes attacked images
through rotations, viewpoint changes and illumination changes. The Copydays dataset extends the
Holidays dataset by including cropped, blurred, painted, printed, scaned, and compressed images.
We extract 10 Million SIFT features [30] using VLFeat sift feature detectors. We then randomly
sample a portion of 200 SIFT points as queries. For these queries, we treat their 100-Nearest
Neighbor as the true neighbors. In general, the distance between a data point and its 100th nearest
neighbor is within ∼10% of the average pairwise-distance in the dataset, which is a good indicator
of whether two pairs of points are near duplicates of each other.

Settings We compare the search accuracy of RPF and baseline methods based on the number of
true duplicates found at top 100 returned results. As we treat the closest 100 points as true near
duplicates, this measure tells us exactly how many duplicates can be successfully found by an
algorithm. The order between the points in this short list is less important because re-ranking can
be easily done based on exact distance. This criteria has been used in recent comprehensive studies
on hashing methods [124] to measure their search accuracy. For algorithm efficiencies, we evaluate
both the offline training time and online search time. The training time of RPF includes both the
data projection time and the tree building time. The the training time of baseline methods includes
projection time and the time to learn the hash codes. Both the proposed algorithm and baseline
algorithms parameters are optimized based on the searching result of an independent 100K SIFT
features. We vary the number of bits in hashing algorithms and the number of projections of the
RPF algorithm. During comparison, We fix K = 4 in RPF and make sure K ∗d equals to the number
of bits for hashing algorithms, so the total number of projections used by different algorithms are
the same during comparison.
We used the following algorithms during our comparison: Locality Sensitive Hashing [66],
Spectral Hashing [125], Lapacian Co-Hashing [126], PCA Hashing [127] and Self-Taught Hashing [128].

84

Results and discussions In figure 4.4 we see the F1 comparison of RPF algorithm and other
baselines. We observe that RPF performs significantly better than other nearest neighbor search
algorithms on F1 measure, showing the effectiveness of the both the searching (affects recall) and
filtering procedure (affects precision). At the same time, we also notice the performance difference
between the results here and the results on the synthetic dataset. By comparing table 4.1 and table
4.2, we see that although both the accuracy and recall decrease on the real dataset, the accuracy is
affected more than the recall. The is because unlike synthetic data, near duplicate SIFT points with
small L2 distance may not perfectly satisfy the sparsity constraints assumed by the algorithm. For
example, on each dimension, two feature points could differ slightly between each other, resulting
an overall small L2 distance with potentially large L1 distance. To address this problem, we design
another algorithm in chapter 5.
In Table 4.3 we compare the search efficiency. RPF’s online search time is comparable to
other searching algorithms. Different from other algorithms, RPF’s search time is less sensitive
to the total number of bits used. This is because RPF’s search time is mainly determined by K
(i.e. the number of low dimensional searches) but not d (the dimensionality on which the search is
performed). As we fixed K and increased d, we improved the search quality without using more
time. In Table 4.4 we compare the training/indexing efficiency. RPF takes more time for indexing
compared with simple hashing methods like LSH and PCAH, but it is more efficient than SH and
STH, which include more complex data-dependent training steps.
Num of Bits/Projections
RPF’s Mean Accuracy
RPF’s Mean Recall

8
0.0020
0.0041

16
0.1085
0.2148

24
0.2320
0.4640

32
0.2719
0.5438

40
0.3125
0.6249

48
0.3282
0.6564

Table 4.2 RPF Accuracy and Recall on 10 Million SIFT features

Finally, we compare the performance of RPF, Locality Sensitive Hashing and Spectral Hashing
on a dataset of 100 Million features. We choose LSH and SH because they are the most representative methods and they both deliver comparable results against other baselines. We fixed the
number of projections to be 48 in thisexperiments. In Table 4.4 we report both the F1 score and

85

0.45
LSH
PCAH
LCH
SH
STH
RPF

0.4

Average F1 Measure

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

8

16

24

32

40

48

Dimensions/Encoding Bits

Figure 4.4 Accuracy comparison on 10 Million SIFT Features

Num of Bits/Projections
LSH
PCAH
LCH
SH
STH
RPF

8
1.9667
2.1439
2.0513
2.0553
0.5619
3.9871

16
2.2914
2.2846
2.3237
2.3215
0.6981
4.0055

24
2.5390
2.7205
2.5613
2.6747
0.8446
4.0525

32
2.7784
2.8670
2.8150
2.7542
0.9911
4.0748

40
3.0225
3.0313
3.0432
3.1757
1.1030
4.1760

48
3.3465
3.3747
3.2828
3.2417
1.2838
4.1781

Table 4.3 Search time (seconds) comparison on 10 Million SIFT features

Num of Bits/Projections
LSH
PCAH
LCH
SH
STH
RPF

8
12.3969
12.8401
41.1827
26.8050
1037.1
177.7398

16
18.1463
14.8493
39.5128
70.4104
2831.8
216.6184

24
17.6880
18.0228
45.7640
142.8409
3131.7
251.8006

32
23.5655
23.0640
49.3907
244.0592
4417.5
290.5117

40
22.8688
31.2931
54.8485
396.7967
6564.9
318.5319

48
25.5480
29.8792
62.3453
553.1427
7806.7
346.2411

Table 4.4 Training/indexing time (seconds) comparison on 10 Million SIFT features

86

Method
F1 Score
Search Time (seconds)

Random Project Filtering
0.4318
42.9208

LSH
0.1821
49.3101

SH
0.2153
50.2250

Table 4.5 Performance and scalability comparison, searching 100 million features

the search time for all three methods. For both performance and efficiency, RPF scales better to
much larger datasets compared to the baseline methods.

4.4.3

Experiment on 80 million image dataset

In this experiment, we compare the our algorithm and baseline algorithms on the 80 million tiny
image data set [129]. This dataset contains 79,302,017 color images collected from the Web. Each
image is 32x32 and represented by 384 dimension of GIST features. We randomly sample 200
images as the query and treat dataset images as their true nearest neighbors if they are within 10%
of the average pairwise image distance. The pairwise image distance is estimated using 1 million
image pairs. We fix the number of projections to be 48 in this experiment. The baseline algorithms’
parameters are trained using 10000 randomly sampled image features. In table 4.6, we compare
the indexing time, search time and F1 measure of different algorithms. In figure 4.5, we show
examples of image search results of different algorithms. As we can see from the results, RPF and
baseline algorithms have comparable efficiency, but RPF’s search result is more accurate.
Method
RPF
F1
0.573
IndexTime (s) 2777.1
SearchTime (s) 35.9

LSH
SH
0.457 0.488
316.6 5303.2
38.3
40.5

PCAH
STH
0.381
0.402
588.1 109200.0
37.7
25.7

LCH
0.363
727.9
41.4

Table 4.6 Performance comparison, searching 80 million images

87

(a) Map

(b) Abraham Lincoln

Figure 4.5 Examples of retrieved images of different algorithms. From top to bottom: RPF, LSH,
SH, PCAH, LCH and STH.

4.5

Conclusion

In this chapter, we proposed an algorithm for large scale high dimensional near duplicate search.
In particular, we utilizes the properties of near duplicity and designed a random projection filtering
algorithm supported by the compressive sensing theory. Experiments on large scale high dimensional dataset shows that the RPF algorithm successfully addresses the trade-offs between three
key factors in search, namely precision, recall, and efficiency, making it a more accurate, efficient
and scalable algorithm for near duplicate search than state-of-the-art hashing algorithms.

88

CHAPTER 5
EFFICIENT RANGE SEARCH IN HIGH
DIMENSIONS
5.1

Introduction

In the chapter 4, we proposed the Random Projection Filtering algorithm for near duplicate search.
Although the algorithm shows encouraging results in empirical studies, it does have some limitations. Recall that the projection filtering algorithm have two assumptions, namely the short
distance assumption and the sparsity assumption. In some applications, the sparsity assumption
may not hold due to the large value range of the feature vectors. For example, “duplicity” of image
features like SIFT points are often defined based on L2 distance. A pair of duplicate points with
a small L2 distance could have a big L1 distance if they differ just slightly on many dimensions.
This is actually often the case because each dimension of a SIFT can take any value between 0
to 255. This motivates us to develop another search algorithm that does not require the sparsity
assumption and applicable to a wider range of problems.
In this chapter, we will introduce new range search algorithm. Given a query q and a distance
threshold r, the goal is to efficiently identify the subset of data points from a database that are
within a distance r from q. such problem arise naturally in many areas such as database search

89

and image search. Existing work on range search includes both tree-based method and locality
sensitive hashing. When data points are represented by low dimensional vectors, a number of
efficient solutions, based on pre-built tree structures, have been proposed (e.g., KD- tree [28,57] or
R-tree [130]). However, these algorithms’ complexity increases exponentially with dimensionality,
making them very inefficient for high dimensional search [1, 57]. The hashing based methods like
locality sensitive hashing type methods [29, 65, 66, 124] does not directly address the exact range
search problem, and thus deliver unsatisfactory accuracy for range searches.
Interestingly, despite facing the challenges presented above, we find out that both the search
efficiency and accuracy can be greatly improved if the range search is done in the near duplicate
search scenario. In this chapter, we present a new randomized approach of efficient range search
for near duplicate vectors in high dimensional spaces. It projects both the data points and the the
query point into an one dimensional space by a random projection, and performs one dimensional
range search to find the subset of data points that are within the range of a given query using
binary search. By choosing appropriate threshold, we can guarantee a high recall, namely, with
a high probability, the proposed approach returns all of the data points within the given range
of the query. This is in contrast to the other random projection approaches [131] that require
projecting data points into a space of multiple dimensions and performing a range search over the
projected space. For instance, according to JL lemma [131], to preserve the pairwise distance, the
number of needed random projection is O(ln N/ε 3 ), where N is the number of data points and ε
is the target error in distance approximation. For compressive sensing based approaches [123], at
least O(k log d) random projections are needed to preserve pairwise distance when the difference
between data points and the query points is sparse, where d is the dimension of the data points. We
note that for real-world problems, both approaches may require many random projections, leading
to a difficulty in searching over the projected space. On the contrary, we project each data point
onto a single dimension space, which allows us to perform efficient range search by using the
binary search. To ensure a high precision of the search result, i.e., most of the returned data points
are within the given range of the query, we perform multiple one-dimension search, and only return

90

the subset of data points that pass all of the one-dimensional search test.
The main contribution of this work is to develop and evaluate a new efficient range search
algorithm for large scale and high dimensional data. We prove the theoretical guarantee for the
proposed algorithm and evaluate its empirical performance on a dataset of 1.1 Billion image features [132].

5.2

Problem Definition

The range search problem is defined as follows. Let D = {x1 , . . . , xN } be a collection of vectors
(i.e., the database), where xi ∈ Rd and d

1 is the dimension of the space. Let q ∈ Rd be a query

point. The goal of range search is to find a subset of data points in D that is within distance r from
q, where r is the range specified by the user. Define D(r, q) as the subset of data points in D that
are within distance r from the query q, i.e.,
D(r, q) = {x ∈ D : |x − q|2 ≤ r}
Here, We assume both the data points in D and the query q have bounded length, |x|2 ≤ 1, ∀x ∈ D
and |q| ≤ 1.

5.3

Algorithm and Theory

To speed up the search, we propose to convert high dimensional range search into a sequence
of one-dimensional range searches. More specifically, we randomly sample multiple vectors from
a Gaussian distribution, denoted by u1 , . . . , um . For each randomly sampled vector ui , we project
both the query q and the data points in D along the direction of ui , and find the subset of data
points in D whose projections are within certain threshold ρ (not r, but depend on r) of the query
q, denoted by Di . To implement efficient one dimensional range search, we sort the projection of
data points in D in a descending order to enable binary search to find the subset of data points
whose projections are within a given range. It worth to note that the sorting is done offline and

91

Algorithm 7 Efficient Range Search using Gaussian Random Variables
1: Input:
• D = {x1 , . . . , xN ): the database
• r > 0: specified range
• τ ≥ 1: threshold factor
• m: the number of one dimension range searches
• q: the query point
2: //Offline processing
3: Random sample U = (u1 , . . . , um ), where uk ∼ N (1, d1 I), k ∈ [m]
4: for i = 1, . . . , N do
5:
Compute zi = xi U
6: end for
7: //Online processing
q
q
8: Compute the projection zq = (z1 , . . . , zm ) = q U for query q
9: for k = 1, 2, . . . , m do
10:
if k = 1 then
q
11:
Compute the set D1 (r, q) as D1 (r, q) = i ∈ [N] : |zi,k − zk | ≤ τ √r
d
12:
else
q
13:
Update the set Dk (r, q) as Dk (r, q) = i ∈ Dk−1 (r, q) : |zi,k − zk | ≤ τ √r
d
14:
end if
15: end for
16: Output the set Dm (r, q).
thus will not affect the online search time. The intersection of the data points returned by all one
dimensional range search is used to form the final result, i.e., D(r, q) = ∩m
i=1 Di . Algorithm 7 gives
the detailed steps for the proposed algorithm. We note that although the proposed algorithm is also
based on random projection, it is different from the existing approaches in that it does not try to
approximate the pairwise distance by random projection. Instead, it tries to approximate the binary
decision, i.e., if a data point is within certain range of a query, by a sequence of binary decisions
based on the one dimensional projections of both the data point and the query.
As the first step of our analysis, we show that for one dimension range search with appropriately
chosen threshold ρ, it is able to ensure a high recall, namely with a high probability, a data point
within the specified range will be returned by the one dimension range search. The center to our
analysis is the Bernstein inequality [133] given in the following theorem.
Theorem 7. (Bernstein inequality) Let X1 , . . . , Xm be independent copies of X, with E[X] = 0,

92

|X| ≤ M, and E[X 2 ] = σ 2 . With a probability 1 − δ , we have
1 m
4M ln(1/δ )
Xi − E[X] ≤
+
∑
n i=1
3m

2σ 2 ln(1/δ )
m

Theorem 8. Let u drawn from a Gaussian distribution N (0, d1 I). With a probability 1 − 1/[2d 4 ],
for a fixed data point x, we have

√
|x − q| 6 ln d
√
(x − q) u ≤
d

Proof. Since we fix |x − q|, to bound |(x − q) u| is equivalent to bound |ui | for any direction i.
2 √
Using the property of error function, i.e., erfc(x) ≤ 2e−x / π, we have
Pr |ui | ≥ γ

We complete the proof by setting γ =

ln d
d

√
6.

= erfc γ

ln d
2

2 exp(−γ 2 ln d/2)
√
≤
γ π ln d

As indicated by Theorem 8, when the dimensionality d is very high, with a high probability,
|(x − q) u| is upper bounded by O(|x − q| ln d/d). The problem with Theorem 8 is that it does
not provide a lower bound |(x − q) u|. Without a lower bound, the one dimensional range search
may result in a high recall but a very poor precision as many of the returned data points can be far
away from the query q. We address this problem by performing multiple one dimensional range
search, and only the data points found by all one dimensional range search will be returned. This
enables us to provide a lower bound to our search algorithm.
Theorem 9. Let u1 , . . . , um be drawn independently from N (0, d1 I). With a probability at least
1 − m/[2d 3 ], we have

√
|x − q| 6 ln d
√
max |(x − q) ui | ≤
1≤i≤m
d
2
3
−2
If m ≤ 25 d , then with a probability 1 − 3d , we have
1 m
|(x − q) uk |2 ≥
∑
m k=1

2
|x − q| 
m
18[ln d]2
1− 2 −
−
d
m
2d
93

25[ln d]2
m




By setting m = m0 = 100[ln d]2 , under the assumption d ≥ 20 ln d, we have, with a probability
1 − 3d −2 , we have
|x − q|
max |(x − q) uk | ≥ √
1≤k≤m
2 2d
The proof can be found in Appendix 6.3. As indicated by Theorem 9, given sufficiently large
m, with a high probability, we have
√
|x − q| 6 ln d
|x − q|
√
√
≤ max |(x − q) uk | ≤
1≤k≤m
2 2d
d
√
Hence, if we set τ in Algorithm 7 to be 6 ln d, with a high probability, we will be able to guarantee
• high recall, i.e., any x satisfying |x − q| ≤ r will be returned, and
√ √
• high precision, i.e., any x satisfying |x − q| > 4 3r ln d will NOT be returned.
In case when the number of dimension is very high, sampling u1 , . . . , um from a Gaussian distribution can be computationally expensive. We address this challenge by replacing Gaussian random
variables in Algorithm 7 with Rademacher random variables 1 . More specifically, to construct
each random vector ui , we first sample d Randemacher variables σi1 , . . . , σid , with Pr(σik = −1) =
Pr(σik = +1) = 1/2, and form ui as ui = (σi1 , . . . , σid ). The details are given in Algorithm 8. Below we will show that the Rademacher variable based approach yields similar performance as the
one based on the Gaussian variables. Similar to the analysis for Algorithm 7, we first analyze the
performance of one dimensional range search based on the Rademacher random variable.
Theorem 10. Let u = √1 (u1 , . . . , ud ) be a random vector with ui drawn from a Bernoulli distrid

bution Pr(ui = 1) = Pr(ui = −1) = 1/2. Then, with a probability 1 − 2/d 3 , for a fixed data point
x, we have
√
|x − q| 6 ln d
√
(x − q) u ≤
d
1A

Rademacher random variable has equal probability to be −1 and +1

94

Algorithm 8 Efficient Range Search using Rademacher Random Variables
1: Input:
• D = {x1 , . . . , xN ): the database
• r > 0: specified range
• τ ≥ 1: threshold factor
• m: the number of one dimension range searches
• q: the query point
2: //Offline processing
3: Random sample U = √1 (u1 , . . . , um ), where Ui, j is a Rademacher variable with Pr(Ui, j =
d
−1) = Pr(U = +1) = 1/2.
4: for i = 1, . . . , N do
5:
Compute zi = xi U
6: end for
7: //Online processing
q
q
8: Compute the projection zq = (z1 , . . . , zm ) = q U for query q
9: for k = 1, 2, . . . , m do
10:
if k = 1 then
q
11:
Compute the set D1 (r, q) as D1 (r, q) = i ∈ [N] : |zi,k − zk | ≤ τ √r
d
12:
else
q
13:
Update the set Dk (r, q) as Dk (r, q) = i ∈ Dk−1 (r, q) : |zi,k − zk | ≤ τ √r
d
14:
end if
15: end for
16: Output the set D(r, q).
Proof. We write (x − q) u as ∑di=1 (xi − qi )ui . Since E[(x − q) u] = 0, using Bernstein inequality,
we have, with a probability 1 − δ ,
|(x − q) u| ≤

|x − q| 2 ln(2/δ )
√
d

We complete the proof by setting δ = 2/d 4 .
We then show the performance guarantee for multiple one-dimensional search based on Rademacher
random variables.
Theorem 11. Let U = √1 (u1 , . . . , um ) be random variables with Ui, j having equal probability to
d

be +1 and −1. With a probability at least 1 − 2m/[d 3 ], we have
√
|x − q| 6 ln d
√
max |(x − q) ui | ≤
1≤i≤m
d

95

If m ≤ d 3 /50, with a probability 1 − 3d −2 , we have
1 m
|(x − q) uk |2 ≥
∑
m k=1

2
|x − q| 
2m 18[ln d]2
−
1− 2 −
d
m
d

25[ln d]2
m




By setting m = m0 = 100[ln d]2 , under the assumption d ≥ 40 ln d, we have, with a probability
1 − 3d −2 , we have
|x − q|
max |(x − q) uk | ≥ √
1≤k≤m
2 2d
The proof of Theorem 11 is similar to that of Theorem 9. For completeness, we include the
proof in Appendix 6.3. Compared to Theorem 9, the theoretical guarantee provided by Theorem 11
is almost identical. In both Theorem 9 and 11, we derive m0 , the minimum number of random
projections to achieve both high precision and recall. We note that this number appears to be very
large. However, according to our experiments, we found that only a small number of random
projection is sufficient to achieve both high precision and high recall. This is due to the fact that
the analysis for Theorem 11 is based on the worst case analysis and it is likely that it significantly
overestimates the required number of random projections.

5.4

Experiments

In this experiment, we validate proposed search algorithm on large-scale image feature datasets.

5.4.1

Experimental Setup

Dataset: The algorithm is evaluated on the Holidays and Copydays dataset [35]. The holidays
dataset includes 1491 holiday photos. Some photos are taken on purpose under attacks such as
rotations, viewpoint changes and illumination changes. The copydays dataset extends the holidays
dataset by introducing more attacks into the images: like cropping, blur, paint, print, scan, and
compressions.

96

Feature Extraction: For different experiments, we extract 1 Million and 10 Million SIFT features [30] using the current version of VLFeat [134] covariant feature detectors. We randomly
sample a portion of 0.02% SIFT points as queries. For these 200 query points, we treat their 100Nearest Neighbor as the true neighbors. In general, the distance between a data point and its 100th
nearest neighbor is within ∼10% of the average pairwise-distance in the dataset. Finally, when we
compare algorithm’s performance on large-scale data, we extend the dataset with 1.1 additional
SIFT features [132].
Settings: In our theory, we have showed the algorithm based on Gaussian random variables and
the algorithm based on Rademacher random algorithms shared same idea and have similar theoretical guarantee. In the our experiment, we will first examine the performance of both algorithms
and then compare their performance with other nearest neighbor search algorithms.
To measure algorithms’ performance, we use precision, recall and the search time. Given a
query q, let D(r, q) be the subset of the sift points within the distance r from q, and D(r, q) be the
subset of sift points returned by an given algorithm. Then the precision and recall are defined as
Prec =

|D(r, q) ∩ D(r, q)|
|D(r, q)|

,

Recall =

|D(r, q) ∩ D(r, q)|
|D(r, q)|

A high precision implies that most of the returned data points are within the range of r from the
query q, and a high recall implies that most of the documents within distance r from the query q
are returned by the algorithm. As precision and recall are a pair of competing factors, the goal of
an effective search algorithm is to achieve a good balance between them, i.e. both decent precision
and recall. This is evaluated by the F1 measure defined as follows:
F1 = 2 ∗

Precision ∗ Recall
Precision + Recall

In our algorithm, we can see that there are two important parameters: the number of 1-dimensional
searches m and the window width factor τ. m determines the precision, because if a candidate
misses any of the 1-dimensional searches, it will be removed from the final result. m also directly
affects the searching time, as more 1 dimensional searches will need more time. The the parameter
97

τ determines the recall, as a larger τ results in a larger search window in low dimensional spaces.
The goal of our experiments is to verify the proposed algorithm can achieve good precision and
recall within short response time.

5.4.2

Results and discussion:

Here we study our algorithm’s performance and sensitivity to parameters settings (in figure 5.1)
based on Gaussian random variables w.r.t. different values of m and τ.
Sensitivity to the number of projections, m First, the precision increases with the number of projections m, as every 1-dimensional projection search filters out some false positives. Consistently,
we also see that the recall decreases as m increase, because as the number of filters increase, the
algorithm may also filter out some true positives. This reflects the general trade-off between accuracy and recall in information retrieval, so we use F1 measure to help us deciding a good value of
m. When m is small, increasing m can greatly boost the F1 measure, but as soon as m is greater than
16, the F1 measure tends to stabilize and varying m has much less effects on F1. which indicates
our algorithm achieved a good balance between the search accuracy and recall when m ≥ 16. In
general, we conclude the algorithm’s search quality is not sensitive to the number of projections as
soon as the number of projections is not too small.
For the running time, we can see the running time is roughly linear in m when m is small, but
become sub-linear when the number of projections is greater (e.g. m ≥ 16). This is because the
search time includes two parts: both the searching time in low dimensions and the updating time of
the new candidate set. When the number of projection is small, both parts contribute significantly
to the running time, but as the number of searches increase, the candidate set becomes much
smaller so the updating becomes much less time-expensive . As a result, the running time does not
increase linearly as the number of projections.
We can see that if we set m = 16, we obtain good trade-offs between the accuracy, efficiency
and the running time. In our later experiments, we fix m = 16 for this reason.

98

Sensitivity to the search window width, τ

Firstly, we see that the algorithm’s precision first in-

creases and then gradually decrease as τ increases. This is because when the search window is too
small, no results will be returned; as we increase the window width to some reasonable size, our
algorithm will return the result with decent precision. If the window with is set to be too large, it
hurts the precision due to the large number of false positives. Secondly, we note that increasing the
search window width leads to a higher recall, which is as expected. If we check the F1 measure,
we can achieve a good balance between the precision and recall for a wide range values for τ, such
as 15 ≤ τ ≤ 30. The running time is not affected much by τ for two reasons: 1) the search time
includes only two binary searches for the upper and the lower bounds, which is independent on
τ. 2) Although the τ does affect the number of candidates returned, the dominating factor is the
number of projections (i.e. the number of filters), τ serves more of a fine-tune purpose that affects
the search quality, but has a less impact on the sheer candidate set size. To conclude, our algorithm
delivers good search quality and efficiency for as soon as τ is not set to be extreme values. Thus,
our algorithm is not sensitive to parameter τ. For the reasons above, we set τ = 20 in our later
experiments.
In figure 5.2, we also performed experiments using Rademacher random variables. As we can
see from the results, using Gaussian random variables and Rademacher random variables have
similar performance in all aspects, as we proved in the theory. As Rademacher random variables
is more computational feasible in high dimensions, we will use Rademacher random variables in
further analysis. In general, we conclude both algorithm 7 and algorithm 8 deliver good search
precision and recall within a fraction of second, using only a few projections.

Performance comparison, 1 Million Copydays SIFT features Here, the 1 Million Copydays [35]
SIFT features used in the previous experiment are now used to compare our algorithms performance with other established methods. We fixed τ = 20. For baseline methods, their parameters
are learned from 100,000 SIFT points extracted from same dataset. For all methods, we vary the
number of projections (i.e. the code length) from 8 to 48. When the number of projections are

99

1

1
m=4
m=8
m=16
m=24
m=32
m=48

0.9

0.8

0.7

0.7

Average Recall

Average Precision

0.8

m=4
m=8
m=16
m=24
m=32
m=48

0.9

0.6

0.5

0.4

0.6

0.5

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0
0

5

10

15

20

25

30

35

0

5

10

15

τ

(a) Average Precison

25

30

35

(b) Average Recall

0.7

1.4
m=4
m=8
m=16
m=24
m=32
m=48

m=4
m=8
m=16
m=24
m=32
m=48

1.2

Average Search Time

0.6

Average F1 Measure

20

τ

0.5

0.4

0.3

0.2

0.1

1

0.8

0.6

0.4

0.2

0

0
0

5

10

15

20

25

30

35

0

τ

5

10

15

20

25

30

35

τ

(c) Average F1 Measure

(d) Average Running Time

Figure 5.1 Performance of Gaussian Projection Search with different number of projections

high, our algorithm’s running time increases because of increased one-dimensional searches and
set operations. However, we can see that in figure 5.3, the proposed algorithm consistently delivers
significant better search results under the same time constraint and a more compact representation.

Performance comparison, 10 million SIFT features In this experiment, we extract a total of 10
Million features from the copydays [35] to expand the dataset to a larger scale, and then we compare the different algorithms’ performance. For the proposed algorithm we use 16 projections and
set τ = 20. For all other methods, their parameters are also learned from 100,000 SIFT points.
We set baseline methods’ coding to 48 bits for comparable search results. The proposed algorithm

100

1

1
m=4
m=8
m=16
m=24
m=32
m=48

0.9

0.8

0.8

0.7

Average Recall

0.7

Average Precision

m=4
m=8
m=16
m=24
m=32
m=48

0.9

0.6

0.5

0.4

0.6

0.5

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0
0

5

10

15

20

25

30

35

0

5

10

15

τ

(a) Average Precison

25

30

35

(b) Average Recall

0.7

1
m=4
m=8
m=16
m=24
m=32
m=48

0.6

m=4
m=8
m=16
m=24
m=32
m=48

0.9

0.8

0.5

0.7

Average Search Time

Average F1 Measure

20

τ

0.4

0.3

0.6

0.5

0.4

0.3

0.2

0.2
0.1
0.1

0

0
0

5

10

15

20

25

30

35

0

5

10

15

τ

20

25

30

35

τ

(c) Average F1 Measure

(d) Average Running Time

Figure 5.2 Performance of Rademacher Projection Search with different number of projections

outperformed baseline methods with more compact representation and showed better performance
robustness when scaled to larger datasets.
Method
Projection Search LSH
F1
0.576
0.205
SearchTime
3.66
3.52

SH
0.235
3.28

Table 5.1 Performance comparison, searching 10 Million Features

Retrospective performance comparison, RPF and Projection Search In the beginning of the chapter, we mentioned that one of the motivations of Random Projection Search (RPS) algorithm is to
101

1

0.7
LSH
SH
Proposed

0.6

LSH
SH
Proposed

0.9
0.8
0.7

Running Time

Average F1 Measure

0.5

0.4

0.3

0.6
0.5
0.4
0.3

0.2

0.2
0.1
0.1
0

8

16

24

32

40

0

48

8

16

24

32

40

48

Dimensions/Encoding Bits

Dimensions/Encoding Bits

(a) Average Precison

(b) Average Precison

Figure 5.3 Performance comparison for different coding length, searching 1 Million features

not relying on sparsity assumption for accurate search results. Indeed, RPS algorithm does not use
Restricted Isometry Property to filter out noisy points, but rather, it addresses this issue based on
a series of binary decisions: only the search point that can pass all the binary decisions will be
considered as the true nearest neighbor. As we can see in table 5.2, although the RPS algorithm
gives lower recall than the RPF algorithm, it delivers a much higher search accuracy. This validates
our hypothesis that in this application, the new decision procedure in RPS is more effective than
the filtering procedure in RPF based on RIP condition.
Method
Accuracy
Random Projection Filtering
0.3282
Random Projection Search
0.6182

Recall
F1
0.6564 0.4376
0.5482 0.5761

Table 5.2 Performance comparison, RPF and RPS algorithm on 10 million features

Searching on 80 Million images We also compare RPS and RPF’s performance on the 80 million
tiny image dataset. We use 8 projections and set τ = 20 for RPS. The results are summarized in
table 5.3. As RPS algorithm does not need to build KD trees, it requires much less indexing time.
Searching is more effcient because binary search is very efficient. The search quality is improved
because of the more effective search procedure based on binary decisions. In figure 5.4 we present

102

examples of search results of these two algorithms. To avoid repeated presentation, we did not
include the results of baseline algorithms. Those results are presented in chatper 4.
Method
Index Time (s)
Random Projection Filtering
2777.1
Random Projection Search
318.64

Search Time (s)
35.9
19.7

F1
0.573
0.671

Table 5.3 Performance comparison, RPF and RPS algorithm on 80 million images

Figure 5.4 Examples of search results: RPS (first row) and RPF (second row) algorithm

Scalability evaluation, 1.1 Billion SIFT features To compare the scalability of proposed and baseline methods, we expanded the background data with 1.1 billion SIFT features, which is used for
for large scale approximate nearest neighbor search [132]. The algorithm parameters were set according to the 10-Million search experiment because those settings deliver the best performance.
As we can see, when we go to really large scale datasets, baseline algorithms are more likely to
be affected by potential noisy data due to the vast amount of the data. Our search algorithm, on
the other hand, is much more resilient. It is able to effectively remove false positives through a
multiple step filtering process and thus delivers much more accurate results more efficiently.
Method
Projection Search
F1
0.571
SearchTime
442.07 (s)

LSH
0.155
748.59 (s)

SH
0.178
743.20 (s)

Table 5.4 Performance and scalability comparison, searching 1.1 Billion Features

103

5.5

Conclusion

In this chapter, we proposed a new range search algorithm for large-scale high-dimensional data.
We proved the theoretical guarantee for the proposed algorithm and empirically evaluated the algorithm over 80 million images and 1.1 Billion image features. Our study showed the proposed algorithm has better accuracy and efficiency for large-scale, high-dimensional near duplicate search.

104

CHAPTER 6
SUMMARY AND CONCLUSIONS
In this thesis, we designed three search algorithms motivated by different kinds of data representations, namely, the bag of features representation, the vector representation with sparsity constraints,
and the dense vector representation.
The random seeding quantization algorithm is suitable for image search based on bag of local
features, where a large number of features is affordable to capture visual details of images for
highly accurate search. Random seeding addresses the search efficiency issue and enables such
accurate search to be applied on large-scale data. On the other hand, both random projection
filtering algorithm and random projection search algorithm are designed for image search based
on vectors. In this scenario, a feature vector is generated for each image based on trusted domain
knowledge, so a compact but information-rich vector is used to represent the original image. Given
the vector representation, the random projection searching algorithm (RPS) is designed for general
near-duplicate search, whereas the random projection filtering algorithm (RPF) is specialized on
searching sparse vectors. Because most vector representations are more space efficient than bagof-feature representations, both RPS and RPF algorithms are preferable for large-scale search with
memory constraints.
In the following section, we summarize major contributions of these three algorithms.

105

6.1

Contributions

The random seeding algorithm presented in chapter 3 makes the following contributions:
• New image similarity measurement and its theoretical foundation: from a probabilistic perspective, a new image similarity measurement is proposed to approximate the optimal partial
matching between bags of features. Our analysis not only established a theoretical foundation about this similarity measurement, but also enabled the designing of a new keypoint
quantization scheme for image indexing and search.
• Efficient visual vocabulary construction: A new visual vocabulary construction scheme is
derived for calculating the similarity between two bag-of-features. The new scheme avoids
large-scale clustering for vocabulary construction and scales well to huge image databases.
• New feature point quantization scheme: A new feature point quantization method is designed
to accurately approximate the similarity measurement over a finite number of samples, which
also leads to efficient computation of the bag-of-words representation images.
• Sparse vector representation of images: sparse vector representation is essential for efficient
indexing and search. Proper thresholding is introduced to create sparse vector representations of images without sacrificing representation accuracy, allowing the algorithm to scale
to large image search.
• New criteria for mapping feature points to visual words: The new mapping criteria allows
consistant mapping, eliminating of outlier feature points and 1-to-more mapping during the
quantization stage. This new criteria not only creates more accurate representation of images,
but also makes the search more efficient.
• New feature point search algorithm with KD-trees: It is important to utilize KD-trees for
efficient feature point quantization. By constructing KD-trees of the keypoints in the dataset
and performing range search using visual words, the computational cost for quantizing N

106

keypoints into m cluster centers is reduced to O(m log N from O(N log m) compared to the
clustering-based approach, leading to a more efficient quantization method for large scale
feature point datasets.
The Random Projection Filtering algorithm presented in chapter 4 makes the following contributions:
• Near duplicity with sparsity constraints: While several approaches have been proposed to
use the L2 distance constraint for near duplicate search, the sparsity constraints (L0 distance)
is often neglected. We showed that by explicitly exploiting the statistical property of sparsity,
A new search algorithm can be designed to achieve higher efficiency and accuracy.
• Extended Johnson Lindenstrauss theorem for search: We extended Johnson Lindenstrauss
theorem to directly using random Gaussian matrices for projection. This extension enables
the designing of a filtering process based on the RIP condition.
• Multi-step search scheme: Instead of using a single low-dimensional range search, we proposed to perform multiple such searches and assembling their partial results together to recall
most nearest neighbors into the final result. This scheme directly avoids the dimensionality
limitation of utilizing J-L theorem for large-scale high-dimensional search, and enables us
to use tree search algorithms in low dimensions.
• Filtering scheme based on RIP condition: By exploring the sparsity constrains, we proposed
an effective filtering procedure to address the precision-recall tradeoff. False positives can
be efficiently removed to achieve a high accuracy. At the same time, the high recall rate is
maintained for an overall better retrieval performance.
The Random Projection Search algorithm presented in chapter 5 makes the following contributions:
• Algorithm for range search: Although recently some new hashing algorithms have been
proposed, they all address the nearest neighbor search problem. Our algorithm, on the other
107

hand, directly address the range search problem and allows users specify the search range in
applications.
• Search algorithm without sparsity constraints: While sparseness has showed its effectiveness for designing accurate range search algorithms, such property may not hold in some
applications. A new search algorithm without sparseness assumption is proposed to apply to
a wide range of search scenarios.
• Low dimensional embedding independent of d and N: For existing approaches, the number
of projections required to preserve the original similarity is either related to the dataset size
N ( JL theorem) or the data dimensionality d (compressive sensing theory), leading to difficulties in searching over the projected space. Our new algorithm is able to project the data
a single dimension space and utilize binary search, making it more suitable for larger scale
and higher dimensionality data.
• Effectiveness of one dimensional search: Using Talagrand’s inequality, we theoretically
proved that a single one dimension range search with appropriately chosen threshold is able
to ensure a high recall search.
• High accuracy with multiple one-dimensional searches: We proved theoretically that using
multiple one dimensional searches will remove false positives and return highly accurate
result.
• Efficient sampling for high dimensions: We both proved theoretically and verified empirically that using Rademacher random variables is as effective as using Gaussian random
variables. This make the algorithm more suitable for searching high dimensional data.

6.2

Future Work

The studies presented in this dissertation lead to several important research questions:

108

– In chapter 3, we used the random sampling approach to choose centers of hyperspheres. These centers and hyper-spheres are then used to generate sparse vector representation of images. We may investigate different approaches for choosing the centers
of hyper-spheres, which may lead to better representations to capture image similarities.
– In our studies the Random Project Filtering Algorithm, we assumed that the similarity
between query and objects in a data is captured by the Euclidean distance. We plan to
extend the RPF algorithm to the case when the similarity between objects is assessed
by a kernel similarity function. Given the fact that the embedding space for kernel
similarity could be of infinite dimension, we can not directly apply the Random Project
Filtering method. We thus need to design new indexing schemes as well as search
algorithms for efficient kernel-based similarity search.
– In chapter 5, our Random Project Search algorithm uses either randomly sampled Gaussian Variables or Rademacher variables. In the future, we plan to improve the efficiency
and the effectiveness of the proposed approach by exploring data dependent sampling
approaches, such as sampling the random vectors based on the covariance structure of
the data.
– It is also possible to extend the Random Project Search algorithm to kernel distance.
Similar to the Euclidian distance case, we need to first examine the performance guarantee of single one-dimension range search, and then extend it to multiple one-dimension
range search.

6.3

Conclusions

This thesis presents three new search algorithms for data represented in their different fundamental
forms, including the bag of features representation, vector representation with sparsity constraints,
and dense vector representation. The conducted research makes significant contributions to both

109

the theoretical foundations of utilizing randomization for search and their applications to largescale high-dimensional data. Algorithms presented here advance the state-of-the-art of searching
million-scale images and image feature data. Several future research directions of new large-scale
search algorithms are identified.

110

APPENDIX

111

APPENDIX

Proof of Lemma 1
Proof. We have
m

n

k=1
m

k=1

max (−|xk − yi |22 ) + ∑ max (−|yk − xi |22 )
∑ 1≤i≤n
1≤i≤m

−d(X,Y ) ≤
≤
≤

n

1

−λ |xk − yi |22

∑ λ ln ∑ exp
i=1

k=1
2 m

n

∑∑
λ i=1
j=1

n

+

1

m

∑ λ ln ∑ exp

k=1

−λ |yk − xi |22

i=1

exp −λ [ xi − y j 22 − d0 ] − d0

where d0 = min |xk − yi |22 .
k,i

Proof of Lemma 2
Proof. We have
dz p(z|X) p(z|Y )
µd m n
=
∑∑
mnπ d i=1 j=1

dz exp(−µ|z − xi |22 − µ|z − y j |22 )

µd m n
=
∑∑
mnπ d i=1 j=1

xi + y j 2
µ
2
dz exp − |xi − y j |2 − 2µ z −
2
2
2

=

µ d/2

m

n

µ
µ d/2
2 =
exp
−
|x
−
y
|
s(X,Y )
i
j
∑∑
2
2
mn[2π]d/2 i=1 j=1
mn[2π]d/2

112

Proof of Lemma 3
Proof. We compute the difference between dz exp(−µ[|z−xi |22 +|z−y j |22 ]) and z∈Q dz exp(−µ[|z−
xi |22 + |z − yi |22 ]), i.e.,
+∞

rd−1 drdΩ exp −µ[|z − xi |22 + |z − y j |22 ]
γR


2
+∞
 −µ[|(γRˆz − xi ) + (r − γR)|2 
rd−1 drdΩ exp 

2
γR
+|(γRˆz − xi ) + (r − γR)|2 ]

∆i, j =

=

≤ exp(−2µ(γ − 1)2 R2 )
= exp(−2µ(γ − 1)2 R2 )

+∞
γR
+∞
0

=

exp(−2µ(γ − 1)2 R2 )

rd−1 drdΩ exp(−2µ(r − γR)2 )
(r + γR)d−1 drdΩ exp(−2µr2 )

d−1
π 1/2
+ γR
2µ

Using ∆i, j computed above, we have the result in the lemma by noticing that |xi − y j |22 ≤ 4R2 .

Proof of Corollary 1
Proof. Since γ ≥ max(4, π 1/2 /[(2µ)1/2 R]), we have
|s(X,Y ) − s1 (X,Y )|
≤ exp(−µγ 2 R2 ) (2γR)d−1
s(X,Y )
To ensure the difference is bounded by δ , it is sufficient to ensure
−µγ 2 R2 + (d − 1) ln(2γR) ≤ ln δ
Using ln(γR) ≤ γR − 1, it suffices to show
−µγ 2 R2 + (d − 1)γR ≤ ln δ
which implies
γ≥

1
(d − 1) +
2µR

(d − 1)2 + 4µ ln[1/δ ]

113

Proof of Theorem 2
Proof. Since 0 ≤ p(z|X)p(z|Y ) ≤ 1, according to Hoeffding inequality [135], we have
Pr

EN [ p(z|X) p(z|Y )] − Ez [ p(z|X) p(z|Y )] ≥ ε ≤ 2 exp(−2Nε 2 )

2
where EN [·] stands for the expectation over N random samples {zk }N
k=1 . By setting δ = 2 exp(−2Nε ),

we have, with the probability 1 − δ , the following inequality:
|s2 (X,Y ) − s1 (X,Y )| ≤

1 2
ln
N δ

Proof of Theorem 9
Proof. The upper bound follows directly from the union bound over m random projections. To
show the lower bound, we first bound the maximum value by
max |(x − q) uk | ≥

1≤k≤m

1 m
∑ |(x − q) uk |2.
m k=1

1 m |(x − q) u |2 . Define event Z as
It is thus sufficient to bound m
∑k=1
k

Z = {U = (u1 , . . . , um ) :

√
|x − q| 6 ln d
√
max (x − q) uk ≤
1≤k≤m
d

(7.1)

Using Theorem 8, we have Pr(U ∈ Z) ≥ 1 − m/[2d 3 ]. According to the Berstein inequality, we
have, with a probability 1 − 2d −2 , for any U ∈ Z,
1 m
∑ |(x − q) uk |2 − E |(x − q) u|2|u ∈ Z
m k=1
≤

3V
ln d + 2
m

σ2
ln d
m

114

where
V = max max |(x − q) uk |2
U∈Z 1≤k≤m

σ2

m

= E

∑ |(x − q)

uk |4 |U ∈ Z

k=1

Using the definition of Z, we have
V≤

6|x − q|2 ln d
d

Since
σ 2 ≤ V E |(x − q) u|2 |U ∈ Z
To bound the expectation, we have
E[|(x − q) u|2 ]
≤ Pr(u ∈ Z)E |(x − q) u|2 |u ∈ Z + Pr(u ∈
/ Z)|x − q|2
≤ Pr(u ∈ Z)E |(x − q) u|2 |u ∈ Z +

m
|x − q|2
3
2d

Since
1
E[|(x − q) u|2 ] = (x − q) E[uu ](x − q) = |x − q|2
d
we have
E |(x − q) u|2 |u ∈ Z ≥

|x − q|2
m
1− 2
d
2d

On the other hand, since
E[|(x − q) u|2 ] ≥ Pr(u ∈ Z)E |(x − q) u|2 |u ∈ Z
we have
E |(x − q) |2 |u ∈ Z

≤

≤

115

1
1 − m3
2d
25|x − q|2
24d

|x − q|2
d

where the step follows from the fact 25m ≤ 2d 3 . As a result, with a probability 1 − 3d −2 , we have
1 m
|(x − q) uk |2 ≥
∑
m k=1

2
|x − q| 
18[ln d]2
m
−
1− 2 −
d
m
2d

25[ln d]2
m




By setting m = 100[ln d]2 , we have, with a probability 1 − 3d −3 ,
2
1 m
2 ≤ |x − q|
|(x
−
q)
u|
∑
m k=1
8d

Proof of Theorem 11
Proof. Similar to the proof of Theorem 9, the upper bound follows directly from the union bound
over m random projections. To show the lower bound, we aim to bound

1 m
m ∑k=1 |(x − q)

uk |2 .

Using Z defined in (7.1), we have Pr(U ∈ Z) ≥ 1 − 2m/d 3 . Following the same proof as for
Theorem 9, we have, with a probability 1 − 2d −2 , for any U ∈ Z,
1 m
∑ |(x − q) uk |2 − E |(x − q) u|2|u ∈ Z
m k=1
≤

3V
ln d + 2
m

σ2
ln d
m

where
V = max max |(x − q) uk |2 ≤
U∈Z 1≤k≤m

σ2 = E

m

∑ |(x − q)

uk |4 |U ∈ Z

k=1
m

≤ VE

6|x − q|2 ln d
d

∑ |(x − q)
k=1

116

uk |2 |U ∈ Z

Following the same argument as the proof for Theorem 9, we have
E |(x − q) u|2 |u ∈ Z
≥ E |(x − q) u|2 − Pr(u ∈
/ Z)|x − q|2
2m
≥ (x − q) E[uu ](x − q) − 3 |x − q|2
d
2
2m |x − q|
=
1− 2
d
d
where the last step we use the property E[uu ] = I. The upper bound of E |(x − q) u|2 |u ∈ Z is
given by
E |(x − q) |2 |u ∈ Z

≤

|x − q|2
(1 − Pr(u ∈
/ Z)) d

≤

25|x − q|2
24d

where the step follows from the fact 50m ≤ d 3 . As a result, with a probability 1 − 3d −2 , we have
1 m
∑ |(x − q) uk |2 ≥
m k=1

2
|x − q| 
2m 18[ln d]2
−
1− 2 −
d
m
d

25[ln d]2
m




By setting m = 100[ln d]2 , under the assumption that d ≥ 40 ln d, we have, with a probability
1 − 3d −3 ,
2
1 m
2 ≤ |x − q|
|(x
−
q)
u|
∑
m k=1
8d

117

BIBLIOGRAPHY

118

BIBLIOGRAPHY

[1]

S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” J. ACM, vol. 45, no. 6,
pp. 891–923, 1998.

[2]

K. Grauman and T. Darrell, “Pyramid match hashing: Sub-linear time indexing over partial
correspondences,” in CVPR, pp. 1–8, 2007.

[3]

Z. Wu, Q. Ke, M. Isard, and J. Sun, “Bundling features for large scale partial-duplicate web
image search,” (Los Alamitos, CA, USA), pp. 25–32, IEEE Computer Society, 2009.

[4]

Netcraft, “June 2009 web server survey.” http://news.netcraft.com/
archives/2009/06/17/june_2009_web_server_survey.html, 2009.

[5]

Google, “We knew the web was big.” http://googleblog.blogspot.com/2008/
07/we-knew-web-was-big.html, 2008.

[6]

Flickr, “3 billion!.” http://blog.flickr.net/en/2008/11/03/3-billion/,
2008.

[7]

T. Barrett, D. Troup, S. Wilhite, and et al., “NCBI GEO: archive for high-throughput functional genomic data,” vol. D885-90, 2009.

[8]

D. Sifry, “State of the blogosphere, april 2006 part 1: On blogosphere growth.” http:
//www.sifry.com/alerts/archives/000432.html, 2006.

[9]

H. Ferguson, P. Greenfield, T. Axelrod, and et al., “Astronomical data reduction and analysis
for the next decade,” 2009.

[10] M. M. Group, “World internet users and population stats.” http://www.
internetworldstats.com/stats.htm, 2009.
[11] L. Li and H. Li, “Dimension reduction methods for microarrays with application to censored
survival data,” Bioinformatics, vol. 20, no. 18, pp. 3406–3412, 2004.
[12] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.
[13] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, December 2006.
[14] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple kernels for object detection,” in Proceedings of the International Conference on Computer Vision, September 2009.

119

[15] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, May
1999.
[16] B. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice.
Addison Wesley, 1 ed., February 2009.
[17] J. D. Media, “Review of image search engines.” http://www.jiscdigitalmedia.
ac.uk/stillimages/advice/review-of-image-search-engines/, 2008.
[18] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching
in videos,” in Proceedings of the International Conference on Computer Vision, vol. 2,
pp. 1470–1477, Oct. 2003.
[19] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of
the new age,” ACM Comput. Surv., vol. 40, pp. 1–60, April 2008.
[20] N. Oria, “Music retrieval: A tutorial and review,” Foundations and Trends R in Information
Retrieval, vol. 1, no. 1, 2006.
[21] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-Neighbor Methods in Learning and
Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006.
[22] D. D. Lewis, “Naive (bayes) at forty: The independence assumption in information retrieval.,” in Proceedings of ECML-98, 10th European Conference on Machine Learning
(C. Nédellec and C. Rouveirol, eds.), no. 1398, (Chemnitz, DE), pp. 4–15, Springer Verlag,
Heidelberg, DE, 1998.
[23] J. Zobel and A. Moffat, “Inverted files for text search engines,” ACM Comput. Surv., vol. 38,
no. 2, p. 6, 2006.
[24] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975.
[25] K. S. Jones, S. Walker, and S. E. Robertson, “A probabilistic model of information retrieval:
development and comparative experiments,” Inf. Process. Manage., vol. 36, no. 6, pp. 779–
808, 2000.
[26] J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval,”
in SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on
Research and development in information retrieval, (New York, NY, USA), pp. 275–281,
ACM, 1998.
[27] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in CVPR ’06:
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Washington, DC, USA), pp. 2161–2168, IEEE Computer Society, 2006.
[28] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975.

120

[29] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,”
in VLDB ’99: Proceedings of the 25th International Conference on Very Large Data Bases,
(San Francisco, CA, USA), pp. 518–529, Morgan Kaufmann Publishers Inc., 1999.
[30] D. Lowe., “Distinctive image features form scale-invariant keypoints,” in International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.
[31] C. Wallraven, B. Caputo, and A. Graf, “Recognition with local features: the kernel recipe,”
in ICCV ’03: Proceedings of the Ninth IEEE International Conference on Computer Vision,
(Washington, DC, USA), p. 257, IEEE Computer Society, 2003.
[32] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with
sets of image features,” in ICCV ’05: Proceedings of the Tenth IEEE International Conference on Computer Vision, (Washington, DC, USA), pp. 1458–1465, IEEE Computer Society, 2005.
[33] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching
for recognizing natural scene categories,” in CVPR ’06: Proceedings of the 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, (Washington,
DC, USA), pp. 2169–2178, IEEE Computer Society, 2006.
[34] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving
particular object retrieval in large scale image databases,” in CVPR, 2008.
[35] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in ECCV ’08: Proceedings of the 10th European Conference on Computer Vision, (Berlin, Heidelberg), pp. 304–317, Springer-Verlag, 2008.
[36] C. Cotton and D. Ellis, “Finding similar acoustic events using matching pursuit and localitysensitive hashing,” 2009.
[37] G. Salton, E. A. Fox, and H. Wu, “Extended boolean information retrieval,” Commun. ACM,
vol. 26, no. 11, pp. 1022–1036, 1983.
[38] D. Z. Zanger, “Interpolation of the extended boolean retrieval model,” Inf. Process. Manage.,
vol. 38, no. 6, pp. 743–748, 2002.
[39] G. Zipf, “The psycho-biology of language,” Boston, MA: Houghton Mifflin, 1935.
[40] J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted files versus signature files for text
indexing,” ACM Trans. Database Syst., vol. 23, no. 4, pp. 453–490, 1998.
[41] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, July 2008.
[42] S. E. Robertson, “The probability ranking principle in ir,” pp. 281–286, 1997.
[43] S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford, “Okapi at trec-3,”
in TREC, pp. 109–126, 1994.

121

[44] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne, “Okapi at
trec-4,” in TREC, 1995.
[45] M. H.-B. Stephen E. Robertson, Steve Walker, “Okapi at trec-7: Automatic ad hoc, filtering,
vlc and interactive,” in TREC, pp. 199–210, 1998.
[46] W. B. Croft, “Incorporating different search models into one document retrieval system,”
SIGIR Forum, vol. 16, no. 1, pp. 40–45, 1981.
[47] C. Zhai, Statistical Language Models for Information Retrieval. Hanover, MA, USA: Now
Publishers Inc., 2008.
[48] E. M. Voorhees and D. Harman, “Overview of the eighth text retrieval conference (trec-8),”
in Proceedings of the Seventh Text REtrieval Conference (TREC-7), pp. 1–24, 2000.
[49] E. M. Voorhees and D. Harman, “Overview of the sixth text retrieval conference (trec-6),” in
The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238, National
Institute of Standards and Technology, pp. 3–28, 1997.
[50] E. M. Voorhees and D. Harman, “Overview of the seventh text retrieval conference (trec-7),”
in Proceedings of the Seventh Text REtrieval Conference (TREC-7), pp. 1–24, 1998.
[51] J. Lafferty and C. Zhai, “Document language models, query models, and risk minimization
for information retrieval,” in SIGIR ’01: Proceedings of the 24th annual international ACM
SIGIR conference on Research and development in information retrieval, (New York, NY,
USA), pp. 111–119, ACM, 2001.
[52] S. M. Omohundro, “Five balltree construction algorithms,” tech. rep., International Computer Science Institute, December 1989.
[53] C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptor matching,” in
CVPR, pp. 1–8, 2008.
[54] J. M. Kleinberg, “Two algorithms for nearest-neighbor search in high dimensions,” in STOC
’97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing,
(New York, NY, USA), pp. 599–608, ACM, 1997.
[55] C. Böhm, S. Berchtold, and D. A. Keim, “Searching in high-dimensional spaces: Index
structures for improving the performance of multimedia databases,” ACM Comput. Surv.,
vol. 33, no. 3, pp. 322–373, 2001.
[56] T. Liu, A. W. Moore, and A. Gray, “New algorithms for efficient high-dimensional nonparametric classification,” J. Mach. Learn. Res., vol. 7, pp. 1135–1158, 2006.
[57] J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in
logarithmic expected time,” ACM Trans. Math. Softw., vol. 3, no. 3, pp. 209–226, 1977.
[58] M. I. Shamos and D. Hoey, “Closest-point problems,” in SFCS ’75: Proceedings of the
16th Annual Symposium on Foundations of Computer Science, (Washington, DC, USA),
pp. 151–162, IEEE Computer Society, 1975.
122

[59] R. J. Lipton and R. E. Tarjan, “Applications of a planar separator theorem,” in SFCS ’77:
Proceedings of the 18th Annual Symposium on Foundations of Computer Science, (Washington, DC, USA), pp. 162–170, IEEE Computer Society, 1977.
[60] S. Omohundro, “Bumptrees for efficient function, constraint, and classification learning,” in
NIPS-3, (San Francisco, CA, USA), pp. 693–699, Morgan Kaufmann Publishers Inc., 1990.
[61] R. F. Sproull, “Refinements to nearest-neighbor searching in k-dimensional trees,” Algorithmica, vol. 6, pp. 579–589, 1991.
[62] S. Meiser, “Point location in arrangements of hyperplanes,” Inf. Comput., vol. 106, no. 2,
pp. 286–303, 1993.
[63] K. L. Clarkson, “A randomized algorithm for closest-point queries,” SIAM J. Comput.,
vol. 17, no. 4, pp. 830–847, 1988.
[64] J. S. Beis and D. G. Lowe, “Shape indexing using approximate nearest-neighbour search
in high-dimensional spaces,” Computer Vision and Pattern Recognition, IEEE Computer
Society Conference on, vol. 0, p. 1000, 1997.
[65] M. Datar, N. Immorlica, and P. Indyk, “Locality-sensitive hashing scheme based on p-stable
distributions,” in SCG ’04: Proceedings of the twentieth annual symposium on Computational geometry, (New York, NY, USA), pp. 253–262, ACM, 2004.
[66] P. I. Alexandr Andoni, “Near-optimal hashing algorithms for approximate nearest neighbor
in high dimensions,” in FOCUS, 2006.
[67] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm
configuration,” in VISSAPP (1), pp. 331–340, 2009.
[68] S. Arya, D. M. Mount, N. S., N. S. Netanyahu, and A. Y. Wu, “An optimal algorithm for
approximate nearest neighbor searching in fixed dimensions,” 1994.
[69] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm
configuration,” in VISSAPP (1), pp. 331–340, 2009.
[70] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large
vocabularies and fast spatial matching,” in CVPR, pp. 1–8, 2007.
[71] A. Andoni and P. Indyk, “Efficient algorithms for substring near neighbor problem,” in
SODA ’06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, (New York, NY, USA), pp. 1203–1212, ACM, 2006.
[72] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering of the
web,” in Selected papers from the sixth international conference on World Wide Web, (Essex,
UK), pp. 1157–1166, Elsevier Science Publishers Ltd., 1997.
[73] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent
permutations (extended abstract),” in STOC ’98: Proceedings of the thirtieth annual ACM
symposium on Theory of computing, (New York, NY, USA), pp. 327–336, ACM, 1998.
123

[74] A. Broder, “On the resemblance and containment of documents,” in SEQUENCES ’97: Proceedings of the Compression and Complexity of Sequences 1997, (Washington, DC, USA),
p. 21, IEEE Computer Society, 1997.
[75] Y. Ke, R. Sukthankar, and L. Huston, “Efficient near-duplicate detection and sub-image
retrieval,” in Proceeding of ACM Multimedia, pp. 869–876, 2004.
[76] T. Pavlidis, “Limitations of cbir,” in ICPR, 2008.
[77] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in
videos,” in ICCV, pp. 1470–1477, 2003.
[78] V. Lepetit, P. Lagger, and P. Fua, “Randomized trees for real-time keypoint recognition,” in
CVPR, pp. 775–781, 2005.
[79] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-Neighbor Methods in Learning and
Vision: Theory and Practice. MIT Press, 2006.
[80] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with
bags of keypoints,” in Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–
22, 2004.
[81] S. Lyu, “Mercer kernels for object recognition with local features,” in CVPR, pp. 223–229,
2005.
[82] S. Boughorbel, J.-P. Tarel, and F. Fleuret, “Non-mercer kernels for svm object recognition,”
in British Machine Vision Conference, pp. 137–146, 2004.
[83] K. Grauman and T. Darrell, “The pyramid match kernel: Efficient learning with sets of
features,” Journal of Machine Learning Research, vol. 8, pp. 725–760, 2007.
[84] L. Wolf and A. Shashua, “Learning over sets using kernel principal angles,” Journal of
Machine Learning Research, vol. 4, pp. 913–931, 2003.
[85] R. Kondor and T. Jebara, “A kernel between sets of vectors,” in Proceedings of the International Conference on Machine Learning, pp. 361–368, 2003.
[86] P. Moreno, P. Ho, and N. Vasconcelos, “A kullback-leibler divergence based kernel for svm
classification in multimedia applications,” in Neural Information Processing Systems, 2003.
[87] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: min-hash and
tf-idf weighting,” in Proceedings of the British Machine Vision Conference, 2008.
[88] K. Grauman and T. Darrell, “Approximate correspondences in high dimensions,” in Advances in Neural Information Processing Systems 19, pp. 505–512, MIT Press, 2007.
[89] C. Silpa-Anan and R. Hartley, “Localization using an imagemap,” in Proceedings of the
2004 Australasian Conference on Robotics & Automation, 2004.
[90] G. Csurka, C. Bray, C. Dance, and L. Fan, “Visual categorization with bags of keypoints,”
in Workshop on Statistical Learning in Computer Vision, ECCV, 2004.
124

[91] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in CVPR, pp. 524–531, 2005.
[92] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual
dictionary,” in ICCV, pp. 1800–1807, 2005.
[93] F. Perronnin, C. Dance, G. Csurka, and M. Bressian, “Adopted vocabularies for generic
visual categorization,” in European Conference on Computer Vision, 2006.
[94] P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag-of-visual words image
categorization,” in Proceedings of International Conference on Content-based Image and
Video Retrieval (CIVR), 2008.
[95] L. Wu, M. Li, Z. Li, W. Ma, and N. Yu, “Visual language modeling for image classification,”
in Proceedings of International Conference on Content-based Image and Video Retrieval
(CIVR), 2007.
[96] W. Dong, Z. Wang, M. Charikar, and K. Li, “Efficiently matching sets of features with
random histograms,” in Proceeding of ACM Multimedia, pp. 179–188, 2008.
[97] F. Li, W. Tong, R. Jin, A. K. Jain, and J.-E. Lee, “An efficient key point quantization algorithm for large scale image retrieval,” in LS-MMRM ’09: Proceedings of the First ACM
workshop on Large-scale multimedia retrieval and mining, (New York, NY, USA), pp. 89–
96, ACM, 2009.
[98] M. P. Ondrej Chum and J. MatasKristen, “Geometric min-hashing: Finding a (thick) needle
in a haystack,” in CVPR, 2009.
[99] K. Grauman, K. Grauman, T. Darrell, and T. Darrell, “Approximate correspondences in high
dimensions,” in NIPS, 2006.
[100] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,” Int.
J. Comput. Vision, vol. 61, no. 1, pp. 55–79, 2005.
[101] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large
vocabularies and fast spatial matching,” in CVPR, pp. 1–8, 2007.
[102] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in
Proceedings of the 25th International Conference on Very Large Data Bases, 1999.
[103] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme
based on p-stable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry, 2004.
[104] T. Liu, A. Moore, A. Gray, and K. Yang, “An investigation of practical approximate nearest
neighbor algorithms,” in Neural Information Processing Systems, pp. 825–832, 2004.
[105] C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptor matching,” in
CVPR, pp. 1–8, 2008.

125

[106] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in CVPR,
pp. 2161–2168, 2006.
[107] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: Automatic query
expansion with a generative feature model for object retrieval,” CVPR, pp. 1–8, 2007.
[108] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large
vocabularies and fast spatial matching,” in CVPR, 2007.
[109] X. Zhang, Z. Li, L. Zhang, W. Ma, and H.-Y. Shum, “Efficient indexing for large scale visual
search,” IEEE Computer Society, 2009.
[110] Z. Wu, Q. Ke, J. Sun, and H.-Y. Shum, “A multi-sample, multi-tree approach to bag-ofwords image representation for image retrieval,” IEEE Computer Society, 2009.
[111] A. T. Yair Weiss and R. Fergus, “Spectral hashing,” in NIPS, 2008.
[112] J. C. Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders, “Kernel codebooks
for scene categorization,” in ECCV ’08: Proceedings of the 10th European Conference on
Computer Vision, (Berlin, Heidelberg), pp. 696–709, Springer-Verlag, 2008.
[113] K. Grauman and T. Darrell, “The pyramid match kernel: Efficient learning with sets of
features,” J. Mach. Learn. Res., vol. 8, pp. 725–760, 2007.
[114] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley-Interscience Publication, 2000.
[115] “Affine covariant features,” June 2007. http://www.robots.ox.ac.uk/~vgg/
research/affine/detectors.html.
[116] S. E. Robertson, S. Walker, and M. Hancock-Beaulieu, “Okapi at trec-7,” in Proceedings of
the Seventh Text REtrieval Conference, 1998.
[117] X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma, “Annotating images by mining image search results,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 1919–
1932, 2008.
[118] J. Philbin and A. Zisserman, “The oxford buildings dataset.” World Wide Web electronic
publication, Nov 2007.
[119] ANSI/NIST-ITL 1-2007 standard: Data Format for the Interchange of Fingerprint, Facial,
& Other Biometric Information. 2007.
[120] J.-E. Lee, A. K. Jain, and R. Jin, “Scars, marks and tattoos (SMT): Soft biometric for suspect
and victim identification,” in Biometric Consortium Conference and Technology Expo, 2008.
[121] H. Moon and P. J. Phillips, “Computational and performance aspects of pca-based face
recognition algorithms,” Perception, vol. 30, pp. 303–321, 2001.
[122] M. Muja and D. G. Lowe, “Fast matching of binary features,” in Computer and Robot Vision
(CRV), pp. 404–410, 2012.
126

[123] E. J. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, pp. 489–509, Feb. 2006.
[124] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A comparison of hash
function types and querying mechanisms,” Pattern Recognition Letters, vol. 31, no. 11,
pp. 1348 – 1358, 2010.
[125] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NIPS, pp. 1753–1760, 2008.
[126] D. Zhang, J. Wang, D. Cai, and J. Lu, “Laplacian co-hashing of terms and documents,” in
Advances in Information Retrieval (C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little,
T. Roelleke, S. Rüger, and K. Rijsbergen, eds.), vol. 5993 of Lecture Notes in Computer
Science, pp. 577–580, Springer Berlin Heidelberg, 2010.
[127] X. Yu, S. Zhang, B. Liu, L. Zhong, and D. N. Metaxas, “Large scale medical image search
via unsupervised pca hashing,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, June 2013.
[128] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast similarity search,” in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development
in Information Retrieval, SIGIR ’10, (New York, NY, USA), pp. 18–25, ACM, 2010.
[129] A. Torralba, R. Fergus, and W. Freeman, “80 million tiny images: A large data set for
nonparametric object and scene recognition,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 30, pp. 1958–1970, Nov 2008.
[130] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in Proceedings of
the 1984 ACM SIGMOD International Conference on Management of Data, (New York,
NY, USA), pp. 47–57, ACM, 1984.
[131] W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,”
in Conference in modern analysis and probability (New Haven, Conn., 1982), vol. 26 of
Contemporary Mathematics, pp. 189–206, American Mathematical Society, 1984.
[132] H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg, “Searching in one billion vectors: rerank with source coding,” CoRR, vol. abs/1102.3828, 2011.
[133] S.N.Bernstein, Theory of Prabability. 1927.
[134] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision
algorithms.” http://www.vlfeat.org/, 2008.
[135] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of
the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963.

127