whumwula

Samba“ I . .,.~
.3 . . .32, t
e. I .. .3 Twang
37.39. . y . 5 Try v11.
3...sz . .. . «Hi».

NJ?!

. 1.
3..) gis- . .
193.. u $13.51!. .1 .
in. .2.\ .. H... £31.! .
’1 2...... 3.3.6 1“...
1. it. .. .1. :1
Eli... . . an

 

.
. 3.21;!

. 3:5:
amt}?! raw
: E. .3. :1

1.3! {in
7di:i?..i .

1591!
$2.5- .uwﬂ
)(lili... . . c. s
twta“~‘..uﬂ 3.1 I

35.... a. w!!!

 

 

..|

I 3%.. 3.:

3).. ’Litaoii
y. ‘

23.2.2.
.lisi....s.xm~.
:51... .

n... 1...

 

xi...‘:§ll: v. 5.11171},
{25...} riixftl!£nl‘
it): .1155. .

£25215
\ ‘ i2 .11“

I
a

a .JULHLLBH .
i . . V. V ,. _ if. . . ..w_.,....w€$mm.tf.
.3 . . L.
, .43... .3

.uﬁJHVSH , in» {v 1?
.. 1‘ :3: 0H

 

THESIS
4’

2m 0

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the
dissertation entitled

Techniques for Efﬁcient k-Nearest Neighbor Searching in Non-
Ordered Discrete and Hybrid Data Spaces

presented by

Dashiell Matthews Kolbe

has been accepted towards furﬁllment
of the requirements for the

Doctoral degree in Computer Science

 

 

% , gmwi

Major Professor’s Signatuie

.5;/I.27/Zoro

Date

 

MSU is an Afﬁrmative Action/Equal Opportunity Employer

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:IProj/Acc&Pres/ClRCIDateDue.indd

 

TECHNIQUES FOR EFFICIENT K -NEAREST
NEIGHBOR SEARCHING IN NON-ORDERED
DISCRETE AND HYBRID DATA SPACES

By

Dashiell l\-"’Iatthews Kolbe

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Computer Science

2010

ABSTRACT

TECHNIQUES FOR EFFICIENT K -NEAREST
NEIGHBOR SEARCHING IN NON-ORDERED
DISCRETE AND HYBRID DATA SPACES

By

Dashiell l\/Iatthews Kolbe

Similarity searches/queries in Non-Ordered Discrete Data Spaces (NDDS) and
Hybrid Data Spaces (HDS) are becoming increasingly useful in areas such as bioin-
fr.)rn'1atics, multimedia, text. retrieval, audio and video compression, data-mining,
and E-con’unerce. The objective of this dissertation is to develop and analyze novel
methods to support similarity searches in such spaces.

In this dissertation, we first discuss performing k-Nearest Neighbor (k-NN)
searches in NDDSs. Performing such searches in NDDSs raises new challenges. Due
to the coarse granularity of the commonly used Hamming distance measure, a nearest
neighbor query in an NDDS may lead to a large set. of candidate solutions, creating
a high degree of non-determinism. \Ve propose a. new distance measure that. reduces
the number of candidate solutions for a query while preserving the essential prop—
erties of Hamming distance. we have also implemented nearest neighbor queries

using multidimensional database indexing in NDDSS. We use the properties of our

multidimensional N DDS index to derive the probability of encountering new neigh—
bors within speciﬁc regions of the index. This probability is used to develop a new
search ordering heuristic. Our experiments demonstrate that our nearest neighbor
algorithm is efﬁcient in ﬁnding nearest neighbors in NDDSS and that our heuristics
are effective in improving the performance of such queries.

We then discuss our work on providing a generalization of our GEH distance.
This generalized form allows our distance measure to be applied to a. broad range of
applications. Of these, we discuss a new rank based implementation well suited to
applications with heavily skewed data distributions. Our experiments demonstrate
the beneﬁts of an adaptable distance metric by presenting scenarios that demonstrate
performance changes depending upon the distance measure used.

Finally, we discuss extending k-N N searching to HDS. We consider the challenges of
exploiting both the CDS and N DDS properties of HDS for optimizing potential search
algorithms. In particular we consider how key search information is maintained in
HDS data structures and what rules must be observed to guarantee the correctness
of search results in such spaces. Further, the concept of search execution stages is
introduced to develop efﬁcient. k-NN search algorithms for HDS. Lastly, a theoretical
performance model is developed for HDS searching to validate our experimental

results.

To my parents, who always believed in me.

iv

ACKNOWLEDGMENTS

I would like to ﬁrst acknowledge my thesis adviser, Dr. Sakti Pramanik, who has
provided a strong guiding hand throughout my graduate career at Michigan State
University. My growth as a researcher would not. have been possible without Dr.
Pramanik.

I would also like to provide special acknowledgement for Dr. Qiang Zhu of the
University of i\“Iichigan. l\*lany of the ideas presented in this thesis are the result of
discussions with Dr. Zhu and Dr. Pramanik. The level of collaboration that. was
achieved in these discussions is something that I continue to strive for in my daily
life.

I also extend my sincere gratitude to my thesis committee, Dr. Mark Dykman,
Dr. James Cole, and Dr. Rong Jin. They provided both their time and expertise to
improve both the depth and breadth of this thesis.

Lastly, I would like to thank my family and friends for being my village. My
parents, Nancy, Chris, and Ted, recieve my deepest gratitude for their undending

love and support.

V7

TABLE OF CONTENTS

LIST OF TABLES .............................
LIST OF FIGURES ............................
1 Introduction ...............................
1.1 l\~»’Iotivating Applications .........................
1.1.1 l\~’1ultimedia Objects ........................
1.1.2 Computational Biology ......................
1.1.3 Text Retrieval ...........................
1.1.4 Document Retrieval ........................
1.2 Space Partitioning Concepts .......................
1.2.1 Metric Space ...........................
1.2.2 Non-Ordered Discrete Space ...................
1.2.3 Hybrid Space ...........................
1.2.4 Vector Space Distance l\»’Ieasurements ..............
1.3 Overview of the dissertation .......................
2 Previous Work .............................
2.1 Nearest Neighbor Algorithms ......................
2.1.1 Exact Searching Methods ....................
2.1.2 Approximate Searching Methods ................
2.1.3 Unique Searching Methods ....................
2.2 General Metric Space and CDS Index Structures ............
2.2.1 KD-Tree ..............................
2.2.2 LSD—Tree .............................
2.2.3 R-Tree and R*-tree ........................
2.2.4 Burkhard-Keller Tree .......................
2.2.5 Fixed Query Tree .........................
2.2.6 Fixed Queries Array .......................
2.2.7 M-Tree ...............................
2.3 NDDS Models in Vector Space ......................
2.3.1 The ND-tree ............................
2.3.2 NSP-Tree .............................
2.4 HDS Models in Vector Space .......................
2.4.1 NDh-tree .............................
2.4.2 CND-tree .............................
2.5 Determining Distance ...........................

vi

ix

>4

rhubOOOOI-J

r—‘QOOO\I®O1

18
19
21
22
25
26
27
27
28
29
34
36
37

3 k-Nearest Neighbor Searching in NDDS .............. 41
3.1 Motivations and Challenges ....................... 41
3.2 k—Nearest Neighbors in NDDS ...................... 45

3.2.1 Deﬁnition of k—NN in NDDS ................... 45
3.2.2 Extended Hamming Distance .................. 53
3.2.3 Probability of Valid Neighbors .................. 58
3.3 A k-NN Algorithm for NDDS ...................... 66
3.3.1 Heuristics ............................. 67
3.3.2 Algorithm Description ...................... 72
3.3.3 Performance Model ........................ 76
3.4 Experimental Results ........................... 79
3.4.1 Effectiveness of GEH Distance .................. 80
3.4.2 Efﬁciency of k-N N Algorithm on Uniform Data ........ 81
3.4.3 Efﬁciency of k-NN Algorithm on Skewed Data .......... 87
3.4.4 Efﬁciency of k-NN Algorithm on Non-Homogeneous Data. . . . 89
3.4.5 Verification of Performance Model ................ 93

4 Understanding Distance in NDDS .................. 96
4.1 l\»‘Iotivations and Challenges ....................... 96
4.2 Generalized GEH Distance ........................ 98
4.3 Ranking Based GEH Instantiation .................... 103

5 k-Nearest Neighbor in Hybrid Data Spaces ............ 106
5.1 l\-’Iotivation and Challenges ........................ 106
5.2 Nearest Neighbor Search Stages ..................... 109
5.3 Search Algorithm ............................. 112

5.3.1 l\x’Iatch Likelihood ......................... 113
5.3.2 Algorithm Description ...................... 115
5.3.3 Performance Model ........................ 117
5.4 Experimental Results ........................... 124
5.4.1 Effects of Heuristics and Datasets ................ 125
5.4.2 Performance Model Verification ................. 128

6 Conclusion ................................ 132

APPENDICES ............................... 136

A Intrinsic Dimensionality in Non-Ordered Discrete Data Spaces 136
A.1 Overview .................................. 136
A2 Distribution of NDDS Datasets ..................... 138
A3 Distribution Effects on Search Performance ............... 140
A4 Experimental Results ........................... 142

Vii

B Triangular Property of GEH - Extended Proof .......... 145

C MinMaxDistance Discussion ..................... 150
C.1 Overview .................................. 150
C2 Proof .................................... 152

BIBLIOGRAPHY ............................. 155

viii

LIST OF TABLES

4.1 Varying Dimensionality .......................... 104
4.2 Varying Zipf Distribution ......................... 105
5.1 Performance Model Variables ...................... 119

ix

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.10

3.11

3.12

LIST OF FIGURES

Example of N DDS data points distributed by distance ........
Comparison of AA? values for the Hamming distance ..........
Comparison of A}; values for the GEH and Hamming distances . . . .
Effects of heuristics in the k—NN algorithm using N D-tree with k = 10

Performance of the k-NN algorithm using ND-tree vs. the linear scan
on synthetic datasets with various sizes .................

Number of Disk Accesses comparison of the k-NN algorithm using
ND-tree vs. the k—NN searching based on M-tree ............

Performance of the k—NN algorithm using ND—tree vs. the linear scan
on genomic datasets with various sizes for k=10 ............

Performance of the k-NN algorithm using ND—tree vs. the linear scan
on genomic datasets with various dimensions for k=10 ........

Performance of the k—NN algorithm using ND-tree vs. the linear scan
on synthetic datasets with various dimensions for 16210 and d = 10

Performance comparison for the k-NN searching using ND-tree based
on GEH and Hamming distances ....................

Performance of the k-N N algorithm using ND-tree vs. the linear scan
on synthetic datasets with various sizes and zipf distributions .....

Performance of the k—NN algorithm using ND-tree on datasets with
various misleading dimensions (k. = 1) ..................

47

54

81

82

84

85

85

86

87

88

89

90

3.13

3.14

3.15

3.16

3.17

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

A.1

A.2

Performance of the k-NN algorithm using ND-tree on datasets with
various misleading dimensions (1: = 5) .................. 91

Performance of the k—NN algorithm using ND-tree on datasets with
various misleading dimensions (k = 10) ................. 91

Estimated and Actual performance of the k-NN algorithm vs. the
linear scan on synthetic datasets with various sizes (k. = 1) ...... 94

Estimated and Actual performance of the k-NN algorithm vs. the
linear scan on synthetic datasets with various sizes (A: = 5) ...... 94

Estimated and Actual performance of the k-NN algorithm vs. the

linear scan on synthetic datasets with various sizes (A: = 10) ..... 95
Search stage I/ O with variable number of continuous dimensions . . . 111
Search stage I / O with variable number of discrete dimensions . . . . 111
Search stage I / O with variable database size .............. 112
Performance I / O with variable number of non—native dimensions . . . 126
Performance I / O with variable database size .............. 127

Performance I / O with variable database size (CND—tree and ND—tree

only) .................................... 127
Performance I / O comparing ordering methods ............. 128
Performance model comparison with variable database size ...... 129

Performance model comparison with variable number of non-native

dimensions ................................. 129
Histogram of Hamming Distance Values ................ 138
Histogram of GEH Distance Values ................... 141

A.3 GEHRa/nk zipf = [0.0 — 1.5] ....................... 143

A.4 GEHFrw, zipf = [0.0 — 1.5] ....................... 144

xii

CHAPTER 1

Introduction

Nearest. neighbor searclres/queries in Non-Ordered Discrete Data Spaces (NDDS)
and Hybrid Data Spaces (HDS) are becoming increasingly useful in areas such as
l‘)i(.)informatics, multimedia, text retrieval. audio and video compression. information
security, data-mining. and E-cmnmerce. This form of searching may be stated as
the following problem: given a set S of 71 data points in an a. dataset X (with

IX

 

Z k) and a query point q E X, return a set of k > 0 objects A <_: X where
VuEA,1’€S—A : D(q, u.) S D(q. 'v) and IA| = 1:. Examples of such queries are “finding
the k closest restaurants to the intersection of Fifth and Main,” and “finding the I:

ﬁxed length words that differ the least from the word ‘near’.’3

Numerous techniques have been proposed in the literature to support efﬁcient
nearest neighbor queries in continuous (ordered) data. spaces (CDS) and general

metric spaces. Excellent surveys of both the theoretical and practical aspects of

nearest. neighbor searching in such spaces have been presented by Chavez et a1.
[14] and Hjaltason and Samet [27] respectively. Little work has been reported on
supporting efﬁcient similarity searches in either NDDSs or HDSS. A d—dimensional
N DDS is a Cartesian product. of (1 domains / alphabets consisting of ﬁnite non-ordered
elements/ letters. For example. when searching genome DNA sequences, consisting

6

of letters ‘a’, g’, ‘t’, ‘0’, each sequence is often divided into intervals/strings of a
ﬁxed length (I. (q-gram). These intervals can be considered as vectors from a. d—
dimensional NDDS with alphabet. {(1, g. t, c} for each dimension. Other examples of
non-ordered discrete dimensions are color, gender and profession. A d-dimensional
HDS is a Cartesian product of 712(7)? 3 (1) dimensions/ alphabets consisting of ﬁnite
non-ordered eleI‘nent-s/ letters and n = d — m continuous dimensions. For example,
consider sports related data containing match win statistics and match locations.

Match win statistics could be considered continuous data while match locations could

be considered non-ordered discrete data.

The remainder of this section is comprised as follows: Section 1.1 discusses mo-
tivating applications; Section 1.2 introduces space partitioning concepts considered
throughout this dissertation; Section 1.3 provides an overview of the research pre-

sented in this dissertation.

1.1 Motivating Applications

This section presents a sample of applications that rely upon performing efficient k-
NN similarity searches in CDSs. NDDSs, and HDSS. Both ﬁxed length and variable

length data applications are presented for completeness.

1.1.1 Multimedia Objects

Facial recognition algorithms are used by a wide variety of applications such as law
enforcement and security to identify an object by a live or still image. This is achieved
by comparing a selected set of features from the image against. similar feature sets
stored in a database. This is equivalent to creating a feature vector composed of

HDS objects.

Finger print matching and voice recognition are both handled in a similar fashion.
For each, a feature vector is established, containing values representing key structures
within the object, such as whorls in a ﬁngerprint, which is then compared with an
indexed set. of feature vectors to deterniii'ie the closest matches. Here, it is common
practice to try to determine the single closest match. ideally considered an exact
match. However, due to approximation involved in creating the feature vectors,

nearest matching is a much more practical and efficient approach.

1 . 1 . 2 Computational Biology

Biological data, particularly genome sequences, may be represented as a d dimen-
sional vector (1, with each dimension 2 (1 S i g (1') containing a value from
the set of possible elements .4,- for the I'm dimension. In genome sequencing,
the alphabet over all dimensions is the same: A = {(z,g,t,c}, such that. a. 25-
dimensional segment of a genon'ie strand could be represented in an NDDS 925

as (1' = ”a gtcaagtcaaatcc(1gtca(1.tcca”.

Initially. searching in genome sequence databases employed editor distance metrics
or substitution matrices, such as BLOSUM, to determine the similarity between
genome sequences. Lately, however, the Hamming distance (discussed in Section
1.2) has become more popular for searching in large genome sequence databases
[29], [54]. Unfortunately, the Hamming distance has a poor level of granularity in
measurement which can lead to unclear results. We consider this issue in depth in

Chapter 3.

1 . 1 .3 Text Retrieval

Text objects are typically represented as strings of varying length. Several index
methods have been proposed to support. efficient searches over such data. Examples

of this method are Tries [30], [18], the Preﬁx B-tree [5], and the String B-tree [19].

Most indexing methods in this category, such as Preﬁx B-tree and String B-tree, are
designed speciﬁcally for exact searches rather than similarity searches. The Tries
model does support similarity searches, but is difﬁcult to apply to large databases

due to its 111emory-based feature.

Text retrieval operations are h(.)wever fundan‘ientally different from those in other
applications listed in this section. Datasets of previously listed applications, such
as genome sequences, are composed of objects each represented by a. string/ vector
of ﬁxed length / size. Due to this. geometric properties such as area and range can
be applied to the dataset. Text retrieval operations however, cannot take advantage
of these geometric properties due to text. objects being composed of varying length
strings. This paper focuses upon searching in datasets of ﬁxed dii'nensionality and
thus does not consider applications towards efﬁcient text retrieval. The issue of text

retrieval is covered in detail in [45].

1 . 1 .4 Document Retrieval

Document retrieval is used extensively in world Wide Web and database applica-
tions. Traditionally. a test document (ii is classiﬁed by creating a feature vector
containing the percent composition of certain key words through-out the document.
For example, if a document is beng classiﬁed by only three key words ’7‘itrer’, ’bank’,

and ’sedimcnt’, a feature vector V = {0.5.0.302} might be generated to identify

C11

that out of the key words found in the document, 50% are ’n'uer’, 30% are ’bank’, and
20% are sediment '. This vector would then be compared to a dataset. UT composed

of similar feature vectors generated from a training / representative set of documents.

Most work in this ﬁeld has focused upon improving construction of feature vectors
[36] and choosing an optimal number of neighbors to search for [4]. However, [4]
showed how the current standard method of cosine similarity measurements some-
times provides an intuitively incorrect. similarity match between feature vectors. This
phenomenon results from the feature vectors themselves being only an approximation
of key word occurrences within a document and pays little attention to how those
words are related to each other. A possible solution to this problem is to calculate
the similarity between two documents based upon a discrete representation of their

key words.

1.2 Space Partitioning Concepts

This section discusses common data space environments that k-NN searches are
performed in. Continuous Data Space, in the form of General l\r"Ietric Space, is
discussed ﬁrst. This is followed by a discussion of Non—Ordered Discrete Data. Space.
Next, the combination of N DDS and CDS is presented as Hybrid Data Space. Finally,

the concept. of vector space distance measurement is introduced.

1.2.1 Metric Space

Let U denote the universe of valid objects being searched. The function D : U x U —>
R denotes a measure of distance between objects. To be used in search and indexing

applications, such distance functions generally must satisfy the following properties.

Positiveness: VJ), y E U, D(.r._ y) 2 0
Symmetry: VI, y E U, D(:r, y) = d(y. :r)
Reﬁexivity: V1? E U. D(.r,1') = 0

Strict Positiveness: V17, 3; E U, .r 75 y é D(:r, y) > 0
Triangular Inequality: V’r, y, r: E U, D(:1:,y) S d(:r., z) + D(z, y)

If a distance measurement D satisﬁes the Triangular property , then the pair (U, D)
is called a metric space. If D does not satisfy the Strict Positiveness property, the
space is called a pseudo-metric space. If the Symmetry property is not fulﬁlled, the
space is generally described as a quasi-metric space; the case of a city map containing

one way streets is a. common example of such a space.

From the above deﬁnitions, it can be inferred that most spaces are in fact spe—
cialized implementations of the general metric space. Speciﬁcally, if the elements
in the metric space are tuples of real numbers, then the space may be described as
a. ﬁnite-dimensional vector space [40]. Such spaces generally make use of geometric

7

distance measures; such as Euclidean or l\*Ianhattan. W hen such tuples represent

numbers from a continuous (ordered) dataset, the pair (U, D) is considered a CDS.

1.2.2 Non-Ordered Discrete Space

Discrete space is based upon the concept of all objects in the universe of discourse U
being discrete and non—ordered along each dimension. Ordered discrete objects typ-
ically den'ionstrate the same properties as CDS objects and thus, are not considered
in this section. Examples of such non-ordered discrete data are multimedia objects,
profession, gender, bioinformatics, and user-deﬁned types. Each of these examples
may be represented as a feature vector in a d-dimensional space. Consider, in genome
sequence databases such as GenBank, sequences with alphabet. A = {a. g, t,c} are
broken into substrings of ﬁxed—length d for similarity searches [29], [54]. Each sub-
string can be considered as a vector in d-dimensional space. For example, substring
aggctttgcaaggctttgcagcact is a. vector in the 25—dimensional data space, where the 'ith‘
character is a letter chosen from alphabet. A in the 'ith’ dimension. In this example, a
is no closer to c than it is to t and so forth. Thus, mapping discrete space into c011-
tinuous space by applying a form of ordering changes the semantics of the dataset.

A formal deﬁnition of a NDDS is as follows.

Let A17, (1 g 73 g d) be an alphabet consisting of a. ﬁnite number of non-ordered
elements. A d—dimensional NDDS Qd, is deﬁned as the Cartesian product of d.

alphabets: Qd = {A1 X A2 >< . . . >< Ad}- A, represents the it," dimension alphabet of

Q]. The area of space Qd is deﬁned as:

d
area(Qd) = III/1,].
i=1

This formula also indicates the number of possible unique vectors in (Q. A vector

(1 is deﬁned by the tuple (r = (o[1],o—[2], . . . ,(i[d,-]), where o[]] E Alf. A discrete
rectangle R in Qd is deﬁned as the Cartesian product: R = {51 x S? . . . x 3d}, where
S,- is a set of elements/ letters from the alphabet of the j-th dimension of the given

d-dimensional NDDS. The length of the I'm dimension edge of R is deﬁned as:
O _

length(1?,i) = [3,].

The area of R follows the formula for the area of the data space as:

d
area(R) = H [5,].
i=1
1.2.3 Hybrid Space

Hybrid space is composed of a combination of NDDS elements and CDS elements.

Consider domain D,(1 g i S (1'). If D,- is comprised of non-ordered discrete elements,

it corresponds to an alphabet. A.,-. If D1: is comprised of continuous elements, it
corresponds to some continuous span. For the purposes of our discussion, we will
consider this span to be normalized within [0, 1]. A d—din‘iensional hybrid data space

(HDS) Qd is deﬁned as the Cartesian product of d such domains:

9(1ZD1XD2X...XDd.

As described by Qian [41], a vector in Qd is comprised of the tuple a =
((1'1.a2,...,(1d), where 0-,: E D.,-(1 g i S d). Subdomains within D,- are deﬁned
as S, Q D,(1 S '1? g (I), where S C_: A7: = D,- if D,- is a non-ordered discrete domain,
or S,- is a range [mi11,-,max.,-] if D1: is a continuous domain. A hybrid rectangle is then

deﬁned as the Cartesian product of domains S,- thus that:

RlexS-gx...de.

The length of the I'm dimension edge of R is deﬁned as:

l—Z—l 1f dlmension 2 1s non-ordered dlscrete

length(R,j) = [Ail
maxi — min,- otherwise

The area of R is deﬁned as:

10

d
area(R) = H length(R,-).
i=1
1.2.4 Vector Space Distance Measurements

Determining an efﬁcient distance measurement is still an open problem. As discussed
by Qian [41], the vector model is widely used to support similarity queries. Addi-
tional models. such as the Boolean model and probabilistic model [2], have also been
proposed to support. similarity queries. However, as discussed by Baeza et. al. [2],

the vector model is considered to be more effective for similarity searches.

To perform searches, each object in the database. as well as the query object
are represented as ﬁxed length vectors. For example, consider security applications
tracking intrusion attempts. Each attempt in the database can be transformed into a
vector based upon time. frequency. intruder ID. intrusion method, and other intrusion
characteristics. Each intrusion can now be considered a point in a multidin'iensional
vector space (either CDS. NDDS, or HDS dependent upon the feature generation
methods). The distance values between each pair of vector objects may now be
calculated utilizing a distance metric most suitable for such a space. Typically, a
pair of objects with a minimal distance value between them are considered more

similar than a. pair of objects with a maximal distance value between them.

11

The focus of our research is on supporting efﬁcient similarity queries using the
vector space model. It should be noted that not all applications have objects that.
may be efﬁciently represented as natural vectors. Qian [41] notes that. some forms
of multimedia data, such as video clips, are often stored in their native format due
to a loss of precision when generating a feature vector to represent them. Although
not consklered in detail here, the issue of designing an effective feature generation
algorithm is still an open problem. Excellent. surveys concerning feature generation

for multimedia. objects are. presented in [1] and [49].

1.3 Overview of the dissertation

The remainder of this dissertation is organized as follows: Chapter 2 discusses pre—
vious work performed in this and related ﬁelds. including multidimensional index-
ing and similarity search algorithm development; Chapter 3 presents our research
in developing novel similarity search methods for NDDSs. Chapter 4 presents our
work in developing a. non-application speciﬁc distance measure for NDDSs. Chapter
5 presents our research in extending our work in similarity searches in NDDSS to
HDSs. Chapter 6 summarizes the contributions of this dissertation and provides

directions for future work.

12

CHAPTER 2

Previous Work

This chapter presents previous research work related to this dissertation. We ﬁrst
discuss the evolution of common similarity searching algorithms. We then present an
overview of popular index structures used to support efficient similarity searches in
vector spaces. These index structures are used to maintain data in CDSs, NDDSs, as
well as recently prmiosed work to support. HDSs. Lastly, we present work concerning

distance measure comparisons.

2.1 Nearest Neighbor Algorithms

Most. similarity searching algorithms may be distilled to a simple formula applicable
to CDS, NDDS. or HDS index structures. Further, range queries may be viewed as
a special case of nearest neighbor queries where the ﬁnal search radius is known at

the start of the search. As such, this section focuses upon nearest neighbor search

13

algorithms. The following algorithms perform similzu‘ity searches for the nearest
neighbor to a query based on an index structure of ﬁxed dimensionality. Generally,
the search for a single nearest neighbor may be expanded to ﬁnd k nearest. neigh-
bors by maintaining a. list. of neighbors found and using the distance between the

neighbor that occupies the kth

distance related slot and the query point as a search
range/ radius value to search within. Each of the index. structures described in the
following sections may generally be used in conjuncture with one of the following

search methods. However. developers will typically modify the algorithm to better

suit the applicable structure.

2.1.1 Exact Searching Methods

The most basic method to perform a k-NN similarity search involves using a range
search algorithm. Begin with radius T = a : (o > 0) centered at query point q.
Increase 0 until at least k: elements lie within the radius. The cost, in terms of page
accesses, of this algorithm is similar to that of performing a range search. This cost
however is greatly affected by the amount the value a is adjusted by every time the
desired number of elements is not. yet found. Too small, the performance cost will

quickly grow; too large, the number of points returned will far exceed the desired

number thus decreasing the usefulness of the solution.

A more elegant approach was proposed for both general metric spaces and CD88

14

[12, 3, 47] by beginning the search on any data structure using 7“ = 00. The search
begins at the anchor / root of the data structure. Each time the query point q is corn-
pared to some element p, the search radius is updated such that. 7‘ = min(r, D(q, p)).
The search then backtracks to the nearest split.— in the index where a. new path has
not yet. been traversed and continues down the new path using the newly calculated

radius.

As the search continues, the possilgﬁlity increases that an entire path of the index
structure may not need to be searched due to the decreasing size of the radius 7‘.
Pruning heuristics are employed to determine if a certain route is no longer necessary
to search. Roussopoulos et al. [47] provided two heuristics for CDS based index
structures, namely R-trees. named MINDIST and MINMAXDIST that. invoke this
pruning. h'lINh-‘IAXDIST is used in the breadth search of the covering rectangle
of the current search path by determining the minimum maximum distance of all
covering rectangles contained in the current one. This distance is used to update the
current search radius such that r = '7'17.:17'n(:1‘,l\IIN MAXDIST). MINDIST is then used
in the depth traversal to determine if any point of a covering rectangle associated
with a new search path is within the search radius. If no point is within the search
radius, that search path is pruned from the remainder of the search and thus the

effective size of the database to search is decreased.

15

This range reduction method may be improved further by attaining a smaller ra-
dius value earlier in the search. Several techniques have been used in both CDSs and
general metric spaces [51. 28, 16]. The underlying idea of each technique is that cer-
tain paths may be identiﬁed by their high level statistics that will yield a. closer point
to the query point q earlier in the search. The most common application of this idea
is to order potential search paths by either their MINDIST or l\IINMAXDIST values
to q. The MINDIST ordering gives the 01.)timistic approach that a lower MIN DIST
value is caused by a. relatively closer object in the index structure. This may not
always prove true in spatial index structures. Commonly, some point of a search path
only exists at. the top most layers. At higher levels within an index structure, points
may actually be the result of the intersection of lines drawn from several points lower
in the index structure. When this technique appears to suffer from this problem, the
pessimistic approach using MINMAXDIST may be used instead. Here, search paths
are ordered by the increasing value of their furthest point. Thus a. search may cor-
rectly assume that it will at. least not encounter any points further away than the

MIN MAXDIST.

2.1.2 Approximate Searching Methods

Relaxing the precision of the query results may lead to even further reductions in

time complexity. This is a. reasonable procedure for many applications due to some

16

approximation in the modalization of feature vectors for both general metric and
CDS indexes. In addition to the query itself, a. user speciﬁes some query parameter
E to control how far away from the query point the search may progress. In this
manner, the algorithm avoids the costly initial stages of a similarity search. On
subsequent searches of similar databases, 6 may decrease to approach zero. As 6
decreases, the time complexity. along with the precision of the result decreases as

well.

A probabilistic algorithm was given for vector spaces by Yianilos et al. [55], using a
method described as aggressive pruning to ii‘nprove the performance. Here, the idea
is to increase the number of branches that are pruned at the expense of possible points
in the set of nearest neighbors. This process is controlled such that the probability
of success is always known. Unfortunately, the data structure used is only useful in
a very limited radius; in databases or with searches that could result in neighbors
with distances beyond the possible radius. the algorithm is not able to guarantee a
result of the true nearest neighbors. This form of similarity searching is described as
Approximate Similarity Searching. This topic is not covered in detail in this paper,

but. is mentioned here for completeness. An in depth coverage may be found in [53].

17

2.1.3 Unique Searching Methods

The techniques described thus far cover universal proposals for performing k-NN
similarity searches. There are however examples of k-NN search algorithms developed
for speciﬁc indexes that are inapplicable in a generic sense. Such algorithms depend
upon the structure developed to support them and are unable to be incorporated
with common indexing techniques. Clarkson [17] proposes a method that alleviates
the need to 1_)erform extensive backtracking by creating a GNAT—like data structure
where points are inserted into multiple subtrees. The tree is constructed by ﬁrst
selecting representatives for the root(s) and then inserting each element 11. into not
only the subtree of its closest representative p, but also into the subtree of any
other representative p’ such that D(u. p') g 3 * D(u. p). During a query on object
aq, the search enters all subtrees such that D(dq. p’) S 3 * D((}q. p). As shown by
Clarkson, this is enough to guarantee the retrieval of the nearest neighbor and could

be extended to determine the set of k-NN.

2.2 General Metric Space and CDS Index Struc-
tures

Numerous data indexing structures have been proposed in the literature that support.

efficient similarity searches in CDSs and general metric spaces. Most. CDS data

18

index structures may be classiﬁed into two categories: data-partitioning and space-
partitioning. Examples of the former split an overﬂow node by partitioning the set
of its indexed data objects. The later, split an overflow node by partitioning its
representative data space (typically via a splitting point in a dimension). Most of
these methods are not. aI.)plicable in either NDDS or HDS due to some essential
geometric concepts such as a bounding box no longer being valid in such spaces.
An in depth discussion of continuous multidimensional indexing structures may be

found in [21].

Metric trees represent a wholly different. approach to indexing data. Here, data
points are not reli)resented in any type of dimensional space, rather metric trees
implement structures using only the distance information between data points and
some center origin / root. Such non-geon’ietric trees are generally not optimized toward
NDDSs or HDSs, such as CDS implementations are toward CDSs. However, they do
provide a prevalent counter solution to implementing a vector space model for either
an NDDS or HDS. An in depth discussion of searching in metric space may be found

in [40].

This section presents an overview of commonly employed index structures for both
CDS and general metric space. Chapter 5 discusses some of the issues that. arise when

applying CDS based index structures to an HDS dataset..

19

2.2.1 KD-Tree

The KD-tree [8, 7], was one of the ﬁrst proposed d—dimensional data index structures.
Structured as a binary tree. the KD-tree recursively sul.‘)divides a dataset along ((1—1)—
dimensional hyperplanes whose direction alternates among the (1 possibilities. A
simple example is for d = 3, the splitting hyperplanes would be perpendicular to the
:r, y, and z axes. Each hyperplane must contain at least one point where interior
nodes have one or two descendants. Searching and insertion of new points are simple
procedures that. use the interior nodes as guide posts. Deletion however, is much more

complicated and invariably results in a. costly reorganization of the tree structure.

The KD-tree structure is highly sensitive to the order in which the points are
inserted. A poor insertion order generally results in data points being scattered
all over the tree. A solution to this problem was presented by Bentley et al. as
the adaptive KD-tree [9]. By relaxing the requirements that hyperplanes have to
contain a data point as well as being strictly alternating, the adaptive KD—tree was
able to choose splits that resulted in a much more even data distribution on both
sides of a particular branch. Conceptually. this resulted in pushing all data points
out to the leaf nodes leaving interior nodes to only contain dimensional information.
The adaptive KD—tree is unfortunately very st atic in nature; frequent insertions and

deletions require costly reorganizations to keep the tree balanced. However, the

adaptive KD—tree does perform well when the dat aset is known prior to construction

of the tree.

2.2.2 LSD-Tree

The Local Split Decision (LSD) Tree [24] is organized as an adaptive KD-tree, parti-
tioning a dataset into disjoint cells of various sizes. Better adapted to the distribution
of a dataset than ﬁxed binary partitioning. the LSD—tree uses a special paging al—
gorithm to preserve the external balancing property; i.e., the heights of its external
subtrees differ by at most one. This is accomj'ﬂished by paging out subtrees when
the structure becomes too large to ﬁt in main memory. \Vhile this strategy in-
creases the efficiency of the tree structure it prevents the LSD—tree from being used

in general-purpose database systems; where such extensive paging is not available.

The split strategy of the LSD-tree tries to accommodate skewed data. by combin-
ing data-dependent as well as distribution-dependent split strategies; SP1 and SP2
respectively. The former attempts to achieve the most balanced structure by trying
to keep an equal. number of data objects on both sides of the split. The later is per-
formed at a. fixed dimension and position where the underlying data is in a known
distribution pattern. Determining the split position SP, is the linear combination
of applying one of the strategies: SP = dSPl + (1 -— (1)5132, where a = (0,1). The
factor oz may vary as objects are inserted or deleted from the tree. This property

21

increases the efficiency of the LSD tree, but makes integration with generic database

systems of unknown data distributions difﬁcult.

2.2.3 R-Tree and R*-tree

The R-Tree [22] is a height-I'mlanced tree with index records/ pointers in its leaf
nodes, similar to the adaptive KD-tree. Each node i.) represents a disk page along a
d-dimensional interval Id(z) If the node is an interior node, all descendants 1),; of v
are contained in the interval I d (v). If the node is a leaf node, then I (1(1)) represents
the d—dimensional bounding box of all the objects stored in the node. Nodes at the
same level may overlap area. Each node contains between m and AI entries unless
it is the root. The lower bound m, is used to prevent the degeneration of trees and
ensure efﬁcient storage utilization. If the number of entries in a node drops below
m, the node is deleted and the descendants of the node are reinserted into the tree,
(tree condensation). The upper bound Al is used to maintain each node’s size to
that of one disk page. Being a height—balanced tree, all of the leaves are at the same

level and the height is at. most. [logm(N)] for N : (N > 1) index records.

Objects in an R-Tree are represented by their l\=linimum Bounding Interval I d(0).
To insert an object, the tree begins at the root and traverses a single depth ﬁrst
path to a leaf node. At. each depth layer, heuristics are used to determine which
descendant path to follow. The ﬁrst is a calculation of which path would require

22

the least. enlargement of area to the interval I ‘1 ('l’j) representing it. If there are
multiple paths that satisfy this criterion, Guttman et al. [22] proposed selecting
the descendant associated with the smallest interval. This process continues until
a leaf node is reached and the object pointer is placed within. If this results in an
expansion of the leaf is interval, the interval change propagates upwards toward the
root. If insertion results in the number of objects in the leaf node exceeding Al, the
leaf node is split. distributing the entries between the old and new disk pages. This

change also propagates upwards toward the root.

Deletion is handled in a similar manner to insertion. First an exact match search
is performed to determine if the selected element for deletion is contained in the
database. If so, the element is deleted. and the containing leaf node is checked
to see if the area interval may be reduced. If the area is reduced, the change is
propagated upwards toward the root node. If this deletion causes the number of
elements to drop below the lower bound m, the remaining elements in the leaf node
are copied to temporary storage and then deleted from the database, the changes
caused by this deletion are again propagated upwards which generally results in the
adjustment of several intermediate nodes. Once this is completed, the elements in
temporary storage are inserted back into the index structure following the insert

method described above.

23

Searching an R—Tree may be performed in a similar manner to the ﬁrst stages
of insertion. At each index node 'v. all index entries are checked to see if they
intersect with the search interval. If v is an interior node, the search continues to
all the descendant nodes 13,: who were found to intersect the search interval. If v is a
leaf node, the search adds all the entries that intersected the search interval to the
solution set. Due to interior nodes overlapping area. it is common for a search to
include multiple descendants from interior nodes. In the worst case scenario, this

will lead to every single node in the tree having to be accessed.

Several weaknesses in the original tree construction algorithms were identified
through careful analysis of R-tree behavior under different data distributions. It was
identiﬁed that the insertion phase was a critical step in building an R-tree towards
good search performance [48]. The result of this study was the R*-tree. The R*-tree
provides a more controlled insertion performance by introduced a policy called forced
rei'nsert: if a. node overflows, it is not split immediately, rather p entries are removed
from the node and reinserted into the tree. Through testing, it was proposed in [6]

that p should be about 30% of the maximum number of entries per node, AI.

Additionally, the R*-tree further addresses the issue of node—splitting by adding
more heuristics to avoid making random choices. In Guttman et al.’s original R—

tree algorithm, node—splitting policy was based solely on trying to minimize the

24

area covered by the two new nodes. This lead to many possible splits, whereby a
random split was selected. The R*-tree breaks these ties by taking three more factors
into account: overlap, perimeter values, and storage utilization of the two new nodes.
The reduction of the overlap between two nodes reduces the probability that a search
will have to follow multiple paths. Reduction in the perimeter of a bounding box
increases the density of descendants, which in turn increases the storage utilization
of the selected node. Increased storage utilization allows for a greater area to be
created before a split is necessary, thereby decreasing the probability of a search
needing to follow multiple paths. These additions lead to a marked improvement in

performance of the R*-tree over the R—tree [48].

2.2.4 Burkhard-Keller Tree

The Burkhard-Keller Tree (BKT) [12] is considered one of the ﬁrst metric trees and
may be seen as the basis for many of the metric trees proposed after. The BKT
is deﬁned as follows. Let U represent the universe of discourse for all valid objects
within the database. An arbitrary element p E U is selected as the root of the tree.
For each distance i > 0, deﬁne U, = {u E U, (1(u, p) = i} as the set of all elements
at distance i from the root p. For all nonempty Ui, build child 1),, hereafter labeled
a pivot, where the BKT is recursively built for U11. This process is repeated until

no more elements are left for processing or there are only 6 elements left which may

25

then be placed within a bucket of size I).

The BKT supports range queries in discrete space effectively. Given a query (1
and a distance r, the search begins at the root and enters all children i such that
(1(1), q) — r S i S (1(1).q) + r, and proceeds recursively. \Vhenever a leaf/ bucket
is encountered. the items are compared sequentially. This g1.1arantees an accurate

solution due to the. BKT satisfying the Triangular Inequality Property.

2.2.5 Fixed Query Tree

The Fixed Query Tree (FQT) [3] is an extension of the BKT. Unlike the BKT, the
FQT stores all elements at the leaves. leaving all internal nodes as pivot points. This
construction allows for fast backtracking by allowing the effective storage of a search
path in temporary memory. If a search visits many nodes of the same level, only one
comparison is needed because all of the pivots at that level are the same. Baeza-Yates
et al. demonstrated that FQTs performed fewer distance evaluations at query time
than the original BKT. This improvement is at the expense of a taller tree. where
unlike the BKT, it is not true that, a different element is placed in each node of the
tree. The Fixed Height FQT (FHQT), origii'ially proposed in [3] was discussed as
a variant of the FQT in [2]. Here. all leaves are at the same height h, regardless
of the bucket size (similar to many CDS implementations). This has the affect of
making some leaves deeper than necessary. However. because the search path is

26

maintained in temporary memory, this does not represent a signiﬁcant detriment to

the performance of the FHQT.

2.2.6 Fixed Queries Array

The Fixed Queries Array (FQA) [14]. is described as a compact representation of the
FHQT. No longer described as a tree structure, the FQA is an array representation
of the elements stored in the leaves of an FHQT seen left to right. For each element
in the array, 12. numbers representing the branches to traverse in the tree to reach
the element from the root are computed. These numbers are coded in 6 bits and
concatenated in a single number where the higher levels of the tree are the most
signiﬁcant digits. The FQA is then sorted by these numbers. As such. each subtree
in the FHQT corresponds to an interval in the FQA. Updates are accomplished using
two binary searches through the F QA. The FQA improves the efﬁciency of the FHQT

by being able to include more pivot points within the same amount of memory.

2.2.7 M-Tree

The lVI-tree [16] is a. metric tree designed to provide efﬁcient dynamic organization
and strong I/O performance in searching. Similar in structure to that of a GNAT [11],
the M—tree Chooses a. set of representatives at each node and the elen'ients closest to

each of these representatives are grouped into subtrees rooted by the representative.

27

Insertions are handled in similar methods to that of an R—tree. Upon an insertion, an
element is inserted in the “best" node. deﬁned as that causing the subtree covering
radius to expand the least. In the result of a tie, the closest representative is chosen.
Upon reaching a leaf, the insertion of the element may cause overflmv, (i.e., the
number of elements equals M + 1). In such a case. the node is split in two and the
elements are partitioned between the resulting nodes. One node element is promoted
upwards to become a representative: this change propagates to the root of the tree.
Searches are performed by comparing the search radius 7‘3 with each re1.)resentative‘s
covering radius rc in a node. For all representative in the node where r3 < re, the
search continues recursively through the subtrees of the effective representative. As
shown by Ciaccia et al. [16]. the M—tree shows impressive performance results against

CDS geometric indexes.

2.3 NDDS Models in Vector Space

Currently, NDDS indexing techniques utilizing vector space representations of data
points are fairly limited. A11 exhaustive search of the literature yields only two meth-
ods speciﬁcally applicable towards indexing such a space representation; the ND—tree
and the NSP—tree. This work appears to provide the most. signiﬁcant. results toward
performing lit-nearest neighbor queries within an NDDS. As such, both methods are

discussed in detail in the following sul.)sections.

28

2.3.1 The ND-tree

The ND-tree [43] is inspired by popular CDS multidimensional indexing techniques
such as R-tree and its variants (the. R*—tree in particular). Like many of the tech-
niques that inspired it, the ND—tree is a balanced tree satisfying three conditions:
(1) the root has at least two children. unless it is a leaf. and at most AI children;
(2) every non leaf and leaf node has between m and AI children or entries respec-
tively, unless it is the root; (3) all leaves a.1‘)pear at the smile level. Here, AI and m.
represent the upper and lower bounds set on the number of children / entries. where

2 g m g [AI/2].

Unlike previous examples of balanced trees, the ND-tree is designed speciﬁcally
for NDDSs and as such is based upon the NDDS concepts such as discrete rectangles
and their areas of overlap deﬁned in Section 1.2.2. Due to this design consideration,
the ND-tree is able to take special characteristics of NDDSs into consideration that

metric trees are unable to utilize.

The N D-tree is a structure of indexed vectors from an NDDS Qd over an alj')habet
A, where A,- represents the alplial:)et of the ifh dimension. Leaf nodes consist of
tuples of the form (op. key), where key is a vector from (2,1 representing an object,
pointed to by op, in the database. A non leaf node N also consists of tuples. of

the form (cp, DMBR), where cp is a pointer to a child node N ’ of N and Dr’l-IBR

29

represents the discrete minimum bounding rectai‘igle. described in section 1.2.2, of
N’. The DAIBR of a leaf node N". consists of all the vectors indexed in N ”. The

DA! B R of a non leaf node N ’ is the set of all DAIBRs of the child nodes of N’.

To build an ND-tree, the algorithm ChooseLeaf is implemented to determine
the most suitable leaf node for inserting a new vector a into the tree. Starting at
the root and progressing to the appropriate leaf, the algorithm ChooseLeaf must.
determine which descendant path to follow at each non leaf node it encounters.
Decisions are based upon three heuristics used in ascending order for tie breaks,
such that if 1H1 results in two or more possible paths, [Hg is used to narrow the
ﬁeld further, and so on until a child must be chosen at random. These heuristics are

presented as follows:

1H1: Choose a child node corresponding to the entry with the least enlargement of

overlap(Ek.DAIBR) after the insertion. [43]

1H1 chooses a child node Ek from entries E1.2,...,Ep, m. S p 3 AI and
1 g k S p, such that the insertion of vector (1 results in the least enlargement

of overlap(Ek.DAIBR), deﬁned as:

30

p
oecrlap(Ek.DJIIBR) = Z area(Ek.DAIBR ﬂ EluDr'lIBR) (2.1)
i:1,i;£k

1H1 results from the observation of the ND-tree experiencing similar retrieval
performance degradation due to the increasing amount of overlap between bounding
regions as seen for CDSs [10], [‘38]. This increase in overlap is a major concern
for high dimensional indexing methods and as such has been described as the high
dimensionality curse. Unlike multidimensional index trees in CDS, the number of
possible values for the overlap of an entry in an NDDS are limited, implying ties may
arise frequently. As such IH2 and 1H3, two area based heuristics, are given to break

possible ties:

1H2: Choose a child node corresponding to the entry EA. with the least enlargement

of area(Ek.D]lIBR) after the insertion.[43]

I H3: Choose a child node corresponding to the entry Ek with the minimum

area(Ek.D]\rIBR) . [43]

Once any node contains more entries than the maximum allowable value, the
overﬂowing node is split into two new nodes N1 and N2 whose entry sets are from
a partition deﬁned as follows: let N be an overﬂow node containing AI + 1 entries

ES 2 {E1, E2, . . . , EM+1}- Partition P of N is a pair of entry sets P 2 {E51, ESQ}

31

such that: (1)E51UESQ =2 ES; (2) E81 ﬂESg = (D; and (3) m S [ESll,m S [ESQ].
Partition P is determined through the algorithm SplitNode, which takes an overﬂow
node N as the input and splits it into two new nodes Nl and N; whose entry sets
come from a partition as deﬁned above. This is a critical step in the creation of
an ND-tree as many split possibilities exist and a good partition may lead to an
efﬁcient tree. Qian et al. [43] proposes handling this in two phases: (1)determine
the set of all possible partitions; (2) select the partition most likely to lead to an
efﬁcient tree structure. The exhaustive approach to implementing this process is very
costly. As shown by Qian et al., even for relatively small. values of ill, for example
50, an exhaustive approach would have to consider a result so large as to make the
operation impractical: here, 51! z 1.6 x 1066 permutations. Thus, a more efﬁcient

method of generating a (smaller) set of candidate partitions is required.

A more efﬁcient method of generating candidate partition sets stems from the
property that the size of an alphabet A for an NDDS is usually small; i.e., in genome

se uence exam ,)les A = 4. Let I .19.. .. ,l be the elements of A, in this case
Q I 1 - A]

 

{ a, g, t, c}. The number of 1:)ermutations on the elements of an alphabet is thus also
relatively small, here 4! = 24. For example, < a,c.g,t > and < g,c,a.t > are
both permutations of the set A. Using these 0lf)SGI‘V‘dthIlS Qian et al. proposes an
algorithm to choose a set of candidate partitions consisting of d * (AI — 2m + 2) *

(IAl!) candidates. For example, if d = 25, AI = 50, m. = 10, and |A| = 4, the

32

alphabet permutation based algorithm only considers 1.92 x 104. Further, because
a permutation and its reverse yield the same set of candidate partitions [43], only
half of the aforen'ientioned candidates need be considered; a signiﬁcant improvement

over the exhaustive method.

It is possible that alphabet A for some NDDS is large. In this case, the aforemen-
tioned method no longer 1')ro\«'ides as signiﬁcant an improxr'emcnt over the exhaustive
method. If such a case arises. Qian et al. propose determining one ordering of entries

A

in the OV'erﬂow node for each dimension rather than consider ! orderings on each

 

 

dimension. This is accomplished by creating an auxiliary tree for each dimension Ti

and sorting the component sets generated from T,.

Once a candidate. set. of partitions has been generated, SplitNode selects an ap-

propriate partition based upon four heuristics, as follows:

S H 1: Choose a. partition that generates a minimum overlap of the DMBRs of the

two new nodes after splitting.[43]

S H2: Choose a partition that splits on the dimension where the edge length of the

DMBR of the overﬂow node is the largest.[43]

S H3: Choose a partition that has the closest edge lengths of the DMBRs of the two

new nodes on the splitting dimension after splitting.[43]

33

SH4: Choose a partition that minimizes the total area of the DMBRs of the two

new nodes after splitting.[43]

Heuristics S H 1 through S H4 are applied in ascending order until there are no ties
present. If a tie still exists after the application of S H4, a partition is chosen at

random.

Searching an ND—tree is performed similarly to searching an R—tree. Starting at
the root, each child nodes DMBR is compared to the search radius to determine if

there is an intersection. If the node intersects the search continues recursively.

2.3.2 NSP-Tree

A common problem among data partitioning based index structures, such as R*-tree
in CDS and ND-tree in N DDS, is that of regional overlap. For such index structures
an overflow node is split by grouping its indexed vectors into two sets D81 and DSQ
for two new tree nodes such that the new nodes meet a minimum space utilization
requirement. Commonly, as the dimensionality of such a space grows, the probability
of large overlapping regions increases dramatically. This overlap causes a drastic
reduction in the efﬁciency of searching within such a structure. A solution to this
problem was proposed for CDSs in the form of s15)ace—partitioning data indexes. The

LSD tree discussed earlier is an example of such methods, where an overﬂow node is

34

split by partitioning its corresponding data space into two non overlapping s1.il.)s1‘)aces
for two new tree nodes. The indexed vectors are then distributed among the new
nodes based upon which subspace they belong. \Vhile this method does lead to high
search performances in such structures, the nodes in the tree generally no longer

guarantee a minimum space utilization.

Unfortunately. as was the case for the IND-tree. the methods used in creating
CDS implen'ientations of a space—pentitioning index structure do not directly apply
to an NDDS. Thus. a new structure. labeled the NSP-tree [44], was proposed. The
NSF—tree utilizes many of the concepts described for the ND—tree. For example,
the methods used to represent the universe of discourse and minin'nnn bounding

rectangles remain the same between the two tree structures.

The key difference between an ND-tree and an NSP-trce lies in the method of
partitioning the data points into smaller subspaces. The NSP-tree splits an overflow
node based upon the frequency distribution of the vectors indexed within the node,
such that an even distribution is seen between the two new tree nodes. This distri-
bution method differs from CDS models, where a split is performed upon a chosen
dimension and the data points are. distributed in relation to the split point. This
method no longer applies in a space where no ordering exists among the data points.

In an N DDS it is impossible to describe some point. 1rd being less than or greater

than some point yd without violating the semantics of the dataset.

To increase search efﬁciency. the NSP—tree utilizes multiple bounding boxes within
each subspace to help eliminate the amount of dead space that is included within
the subspace; dead space is any area covered that does not contain any indexed
vectors. This is similar to techniques used in CDS. however an interesting property
of an NDDS is able to exploited by the NSF-tree. Consider two 2—din’1ensional points
P1(.r1.y1) and P2(.r2. gg). The MBR necessary to cover such points in a CDS would
be a rectangle with points P1 and P2 representing corners along one of the diagonals.
Such a representation includes a rather large portion of dead space, (roughly all the
area is dead space). In an NDDS however. the MBR may be represented as the
Cartesian product of the two points {.11. 1‘2} X {,1/1. yg} which contains a very small
dead space {(11, yg), (1'2. 111)}. As shown by Qian et al.. the N SP-tree shows favorable

performance comparisons with the ND-tree, particularly for skewed data [44].

2.4 HDS Models in Vector Space

This section describes currently available methods for vector space indexing of HDS
objects. Similar to the previous section describing NDDS indexing methods, there
exists a limited amount of HDS indexing methods. Although it should be noted

that. metric space models such as M-tree could also be used to index such objects,

36

research by Qian et. al [43] and Chen et. al [15] suggest that this is not. as efficient as
indexing methods prioritized toward HDSs. In this section we focus upon two more

recently proposed HDS indexing methods. the NDh-tree and the CND-tree.

2.4.1 NDh-tree

The NDh-tree, as proposed by Qian [41], is an extension of the ND-tree. The key
difference is that instead of discrete minimum bounding rectangles, the NDh—tree
utilizes hybrid minimum bounding rectangles. Initial results reported by Qian [41]
demonstrate the effectiveness of utilizing an index structure for HDS objects specif-
ically designed for such a. space. The ND’"-tree serves as an inspiration for the
CND—tree introduced by Chen et al. [15]. In this dissertation we focus upon the

more recent contribution and will describe key differences as necessary.

2.4.2 CND-tree

The CND-tree [15] is similar in structure and function to the R*-tree and the ND-
tree. As such, the CND-tree is a balanced tree with leaf nodes containing the indexed
vectors. The vectors are reached by traversing a set of branches starting at the root
and becoming more reﬁned as one traverses toward the leaves. Each vector is inserted
into the tree after an appropriate position is found. The relevant minimum bounding

rectangle may be either enlarged or split to accommodate the insertion.

37

The key difference between the CND-tree and related non-hybrid trees (R*-t.ree
for continuous space and ND-tree for discrete space) is the way in which a min-
imum bounding rectangle is deﬁned and utilized. 111 the CND-tree, a minimum
bounding rectangle contains both continuous and non-ordered discrete dimensions.

A hybrid minimum bounding rectangle (HMBR), with (1 D discrete dimensions and

dc continuous dimensions, for a set of hybrid rectangles G = R1 X Hg X X Rn
with discrete sub-rectangles BID 2 SM X X Std D and continuous sub-rectangles

R? = Si.dD+1 X X Sf~‘1D+dC (i.e. R,- = RID X RF) is deﬁned as follows:

HMBMG) : {of-215,,1} x x {eggs-AD} x
(min SLdDH‘ max SLdDH) X ... X (2.2)
(111111 Si.(lD+(1C ~111'dx Si.(lD+dC) a

where S‘iJ (1 S j S (ID) is a set of elements/ letters from the alphabet of the j-th

dimension and S,‘ kid D + 1 S k S d D + (1C) is an interval from the Art," din‘iension.

2.5 Determining Distance

An integral part of any similarity search. algorithm is the distance measure employed.
Because we are interested in how “close" one object. is to another, the selection of
a distance measure provides the semantic deﬁnition of what “close” means for the

current applications.

 

' I

..__

 

For NDDSS, the inability to be ordered along an axis renders standard .forms of
distance measurement. such as Euclidean or Manhattan, inapplicable. In turn, a
common method of calculating the distance l_)etween two discrete objects is to apply
the Hamming measurement. Essentially. this measurement, represents the number of
dimensions between two (l-dimensional vectors that contain (.lifferent elements. This

is described formally as follows:

9;

, , 0 if v. [2'] = Vgh] ,
DH(1.7'nm(l”1-['2) = Z . (2.3)

:1 1 otherwise

N.

This distance is useful in discrete spaces due to its non-reliance upon dataset se-
mantics, particularly for equality measurements. However, its usefulness declines
rapidly when applied to other operations. such as grouping, due to its limited car-

dinality. The cardinality of a NDDS. C(17'(IDH()(U). for a al.-dimensional dataset U

 

with an alphabet size A] for each dimension 2' in d, is calculated as the product of

the alphabet sizes from each dimension, as follows:

(1
i=1

Using the aforementioned genome sequence example, a 25-dimensional dataset

with an alphabet size of 4 for each dimension would have a cardinality of

39

1,125.889,906.842.624: that is. there are over 1015 possible distinct. elements in the
dataset. However. if the Hamming distance formula is used to calculate the distance
between the objects, there are only (1 + 1 (2(5) different possible distances between

any two objects.

For HDSs. Chen et al. [15] utilized a. non-Euclidean measurement for calculating
the distance between a vector (1 :2 (o[1].a[‘2]. . . . .q[d D + (10]) and an HMBR R =

} to perform range queries:

{S} X SQ >< X S‘]D+(1C

n
di.st(R.a) = Z f(s, 0M) (2 4)

i=1

where

0 if 2'. is a discrete dimension and (1.1: 6 Si
_ or '2'. is a continuous dimension and
fiSiaal’l) =

min(Si) — (51‘. g (12- g max(S,;) + 6t

1 otherwise

Equation 2.4 essentially discretizes the data of the continuous dimensions utilizing a
tolerance value of 6t determined by an application expert. This method is similar to

that used in several machine learning teclmiquesll3. 20, 39].

40

CHAPTER 3

k-Nearest Neighbor Searching in
NDDS

In this chapter, we consider k-Nearest Neighbor (Ir-XX) searching in NDDSs. Search-
ing in NDDSs presents sexli'eral new challenges that we discuss. Additionally, we
present. a formal definition of a k-NN query/ search and introduce a new distance
measure suitable for searches in NDDSs. A generalized form of this distance mea—

sure (and the benefits inherited from this generalized form) is presented in Chapter

3.1 Motivations and Challenges

Numerous techniques have been proposed in the literature to support efficient sim-
ilarity searches in (ordered) continuous data spaces. A majority of them utilize a

multidimensional index structure such as the R—tree [22]. the R*-tree [6], the X—tree

41

[10], the K—D-B tree [46]. and the LSD’l-tree [‘25]. These tecl’iniques rely on some
essential geometric properties/ concepts such as bounding rectangles in CDSs. Much
work has centered around a. filter and refinement. process. Roussopoulos et al. pre—
sented a branch-and-bound algorithm for finding k-NNS to a query point. Korn et al.
furthered this work by 1‘)resenting a multi-step k-NN searching algorithm [35], which
was then optimized by Seidl and Kriegel [50]. In [31]. a Voronoi based approach

was presented to address k-NN searching in spatial network databases.

Little work has been reported on supporting efficient similarity searches in non—
ordered discrete data spaces. Limited existing work on index-based similarity
searches in NDDSS has utilized either metric trees such as the l\I-tree [16] or the
ND—tree and the NSP—tree recently proposed by Qian et al. [42. 43. 44]. Unlike the
M-tree, the ND-tree and the NSP-tree indexing techniques were designed specifically
for N DDSs. It has been shown that these, two techniques outperform the linear scan
and typical metric trees such as the M-tree for range queries in NDDSs. Metric trees
generally do not perform well in NDDSs because they are too generic and do not take
the special characteristics of an NDDS into consideration. On the other hand. Qian
et al.’s work in [42. 43, 44] primarily focused on handling range queries. Although a.
procedure for ﬁnding the nearest neighbor (i.e., l-NN) to a query point was outlined

in [43], no empirical evaluation was given.

The issue of k—NN searching in NDDSs is in fact not a trivial extension of earlier
work. NDDSs raise new challenges for this problem. First. we observe that, unlike
a k-NN query in a CDS. a k-NN (plery in an NDDS based on the conventional
Hamming distance [23]. often has a large number of alternative solution sets. making
the results of the k-NN query non-(leterministic. This non-determinism is mainly
due to the coarse granularity of the Hamming distance and can sharply reduce the
clarity / usefulness of the query results. Second. existing index-based k-NN searching
algorithms for CDSs cannot be directly applied to an NDDS due to lack of relevant.
geometric concepts/ measures. 011 the other hand. the algorithms using metric trees
for a CDS are suboptimal because of their generic nature and ignorance of special
characteristics of an NDDS. Third, the. inforn'iation maintained by an NDDS index
structure may become very 111isleading for traditional CDS search ordering strategies,
such as those presented by Roussemoulos et al. [47]. This scenario can occur as the

distribution of data within the index structure shifts over time.

To tackle the ﬁrst challenge. we introduce a new extended Hamming distance.
called the Granularity-Enhanced Hamming (GEH) distance. The GEH distance
improves the semantics of k-NN searching in NDDS by greatly increasing the deter-
minism of the results. To address the second challenge, we propose a k-NN searching
algorithm utilizing the ND-tree. Our algorithm extends the notion of incremen-

tal range based search [Roussopoulos et al. 1995] (generalized for metric space by

43

Hjaltason and Samet [Hjaltason and Samet 2000]) to NDDSs by introducing suit-
able pruning metrics and relevant searching heuristics based on our new distance
measure and the characteristics of NDDSs. Some preliminary results for uniformly
distributed datasets were presented in [32]. Our study shows that the new GEH
distance provides a greatly improved semantic discrin‘iinating power that is needed
for Ar-NN searcl‘iing in NDDSs. and that our searching algorithm is very efficient.
in supporting k-NN searches in NDDSs. In this dissertation. we demonstrate that.
our k-N N searching algoritl‘n’n is efficient in both UllifOI‘ll‘lly distributed datasets and
non-uniformly distributed datasets using zipf distributions as an example. Further,
we present a theoretical performance model and demonstrate that the performance
of our algorithm matches very closely to what is predicted by this model. To address
the third issue, we introduce a method for determining the probability of a vector’s
existence within any sub—tree of an N D-tree. We demonstrate this probability in-
formation can be used to provide a new search ordering strategy that significantly
increases the performance of our search algorithm when the information maintained

by the index structure is misleading.

The rest of this chapter is organized as follows. Section 3.2 formally defines the
problem of k—VN searching. derives the probability of a vector existing within an
ND-tree, introduces the new GEH distance in NDDSs, and discusses its properties.

Section 3.3 presents our index-based lit-N N searching algorithm for NDDSs, ii‘icluding

44

it s pruning metrics and heuristics and theoretical 1’)erformance model. Section 3.4

(liscusses experimental results.

3.2 k-Nearest Neighbors in NDDS

In this section. we formally define a k-NN search/query and identify a major prob-
lem associated with k—NN searches in NDDSs. To overcome the problem. we propose
a. new extended Hamming distance and discuss its properties. Additionally. we in-
troduce a method for determining the prol_)ability of a vector/r<x*ord"s existence in
any particular subtree of an ND-tree (see Chapter ‘2). based upon the properties of

NDDSS and the index tree.

3.2.1 Deﬁnition of k-NN in NDDS

\Vhen considering a query in an NDDS. the dataset may be depicted as a set of
Concentric spheres with the query point located at the center (as shown in Figure
1 ). Each sphere contains all data points that have 1‘ or less mismatching dimen-
Si ons with the query point, where 7‘ represents the layer/ radius of the particular
Sphere. In general, the solution set of k-nearest neighbors for a given query point
1(Ilay not be unique due to multiple objects having the same distance to the query
point. Thus, there may be multiple candidate solution sets for a given query point

and I: value. One way to resolve the non-uniqueiless/non-determinism of a k-NN

45

Search is to ﬁnd the minimum radius r such that a sphere of 7‘ will contain at least
I; data points/ neighbors. while a sphere of radius r — 1 contains less than Ir neigh-
1.) 01‘s. Although such an ap1_)roach indeed resolve.s the non-determinism 1:)1'oblenr the
solution to such a h-NN search may not be what the users are expecting. since they
usually not only want to know the minimum distance/ radius for a k-NN search but
also want to know the actual k-NX neighbors. Note. that the non—determinism also
theoretically exists for a k-NN search in a continuous data space although it is not
as prominent as it is in an NDDS since the chance for two data points having the
same distance to a query point in a continuous data space is usually very small. In
this dissertation, we adopt the traditional approach to finding the actual neighbors
for a k-NN search in an N DDS and resolve. the non-determinism problem in another
“ray. Note that, once the k—nearest neighbors are found. the above minimum radius

'7‘ is also found.

We define a candidate solution set of k-nearest neighbors for a query point as

f(.)llows:

Deﬁnition 1. Candidate k-Nea'l‘est—Ne'ighbors: Let the irrn'zierse of discourse
for variables Al- and 3,: (1 g 2' 3 Ir) be the set of all objects in. the database. Let LAWS

denote a candidate solution set of k-ncarcst neighbors 7772.. the database for a, query

46

 

 

 

 

 

Figure 3.1. Example of NDDS data points distributed by distance

47

quaint q and D(.r. y) denote the distance between the objects I and y. Then Isl/NM is

d (:3 ﬁned as follows:

VB,.B.2....,B, (El-LBW. A» s zleDa. Ba] A
Vm.ne{1.2 ..... k} ll’” # "l —* (Am 74 A'I'zll

Az’V’YNS = {(41,442. - - . ~ Ah} :>

(3.1)

Equation 3.1 essentially says that A: objects/neighbors .41, Ag, . . . , AA. in ANBS have
the minimum total distance to q out of all possible sets of 1: objects in the database.
This definition is in fact valid for both continuous and discrete data spaces. Consider
Figure 1, if k = 3. there are three possible sets of neighbors that satisfy Equation
3.1: {01.02. (13}. {(11. ()2. a4}. and {01.03. (14}. Each candidate solution set is found
within a range of 1' when a range of 7' — 1 would yield less than k neighbors (here.
7‘ = 3). Thus. each candidate solution set is a size A: subset of the set. of neighlmrs

t hat. would be found using above minimum distance 7'.

Since AENNS is a set of neighqun's. there is no ordering implied among the neighbors.
I11 the following recursive definition we provide a procedural semantic of a candidate
Aim-nearest. neighbor, which is based on an ordered ranking of the neighbors in the

database for a query point. q.

48

Deﬁnition 2. Candidate kill-Nearest-Neighbor: Let the universe of discourse
for variables A and B be the set of all objects in the database. Let Ak denote a

candidate l1” h-nearest mdghbor in the database for a query point q. We recursively

deﬁne Ak as follows:

.41: .4 2:» VB (D(q. 4) g D(q. B)).
B é {141.142 ...... 4A_1}/\
A é {.41..4«2 ...... 4 k—1}

Ak = A :> VB (D(q..4) g D(q. B)) for k 2 2.

(3.2)

Definition 2 can be used to produce the candidate A;\i\Ss given by Definition 1, as

stated in the following propositimi.

Lemma 3.2.1. Each candidate Ari-NINE produced by Deﬁnition 2.2.2 is contained in the

set of candidate AMNSS given by Dtﬁuition 2.2.].

Proof. Proposition 2.2.3 states that the set of candidate kMVSs given by Definition 1
can be produced by Definition 2. This leads to the hypothesis that if a solution set

NN = {141.142, . . . .Ak_1} satisfies Equation 3.1 for I; — 1 objects, then Equation

49

3.2 will select a Arm neighbor AA. such that N N U AA. will satisfy Equation 3.1 for k

objects and thus be consistent with Definition 1.

We first consider a base case Where A? = 1. Equation 3.2 yields the following

neighbor A1 :

.416 {Az‘C/B(D((1.A)§ D(q.B))}.

The solution set {.41} satisfies Equation 3.1 for h = 1 and thus is consistent with

Deﬁnition 1. Equation 3.2 may then be used again to yield neighbor Ak:

B An ...... 4._ /\ ,
Ake A:VB “1 2 " 1} (D(q.A)§D(q.B)) fork22..

14 ¢ {‘41. ‘42. . . . , flk_1}

The above function returns the object in the dataset that has a minimum distance
from the query object out of all objects in the dataset not currently i1‘1cluded in the
solution set. Thus the addition of a km neighbor for I; Z ‘2 will result. in a minimal
distance to the query point for Lt objects if the set of neighbm‘s Al - Ak_1 has a
111111111131 distance for k — 1 objects. Our base case shows that this is true for 1 object,

thus the hypothesis is true for all values of k. El

From the above deﬁnitions (and Figure 1), we can see that there may be

50

multiple possible ANSSS for a given query. Therefore, A:\l\S is generally not. unique.
The non-uniqueness/non-determinism of AANS has an impact. on the semantics of
the k-nearest neighbors. \V'e deﬁne the degree of non—determinism of k-nearest
neighbors by the number of possible hXXSs that exist for the query. This degree of

non—determinism is computed by the following proposition.

Lemma 3.2.2. The number AA? of candidate lN‘VSs is given by

Ak = —,—'— (3.3)

where t is deﬁned by D((1.Ak._t) # D(q. Ak_,+1) = D(q,Ak_t+2) = = D(q,Ak);
/
A j(1 S j _<_ h?) denotes the jm—nearest neighbor; N is the number of nearest neighbors

in the database that have the same distance as D(q. Ak).

Proof. This section provides the derivation of the number A}; of candidate AMVSS.
This may be interpreted as the number of possible solution sets from a dataset that
satisfy Equation 3.1 for k objects. The value of AA? is largely inﬂuenced by the
number of objects within a given solution set that have. the same. distance to the

kth

query object. as the neighbor. This value, represented by t, is formally deﬁned

as follows:

51

D((1« Alt—t) 75 0((1sAk—t4r1): Dta- Alt—HQ) = = Mas/11:)

Each neighbor A k—t - A k may be replaced by any other potential neighbor from the
dataset a, where. D(q. a) = D(q. Ak). and the solution set will still satisfy Equation
3.1. \Ve denote the set of all potential neighbors as N]. Thus, A}; is the. number of
t-element subsets that. can be composed from the set of N]. This can be represented

l o ’I I ’ a o O
as the binomial coefficient of t and N Wthll decomposes into Ecpiat10n 3.3:

Note that t denotes the number of objects with distance D(q, Ak) that have to be

included in a. k;\l\S. If t = Is, all the neighbors in a ANbS are of the same distance
I

as D(q, Ak). In this case Ale—t is inaliplicalne. The values of N and t depend on

parameters such as the dimensionality. the database size. and the query point.

For a h—NN query on a database in a. continuous data space based on the Euclidean
distance, kNNS is typically unique (i.e. A}; = 1) since the chance for two objects
having the same distance to the query point is usually very small. As a. result, the

non—determinism is usually not. an issue for h-N N searches in continuous data spaces.

CI!
to

However, non—deterininism is a common occurrence in an N DDS. As pointed out in
[42. 43] the Hamming distance is typically used for NDDSS. Due to the insufficient
semantic discrimination between objects provided by the Hamming distance and the
limited number of elements available for each dimension in an NDDS. Al: for a k-XN
query in an NDDS is usually large. For example. for a dataset of 2A! vectors in
a 10—dimensional NDDS with uniform distril‘nltion. the average A}; values for 100
random k-NN queries with A? 2 1.5.10 are. about 8.0, 19.0K.45.51\I respectively, as
shown in Figure ‘2. This (glen‘ionstrates a high degree of non-(leterminism for k-NN
searches based on the Hamming distance in an NDDS. especially for large k values.
To mitigate the problem. we extend the Hamming distance to provide more semantic

discrimination between the neighbors of a k-NN query point in an NDDS.

3.2.2 Extended Hamming Distance

Intuitively. the Hamming distance indicates the number of dimensions 011 which the
corresponding conmonents of o and ,3 differ. As discussed by Qian et al. [42. 43].
the Hamming distance is well suited for search applications in NDDS. 111 particular,
we note that applications with different alphabet sets for different dimensions or with
no known similarity matrix (needed for many edit distances) present strong cases for
using the Hamming distance when searching. Additionally. recent work has applied

the Hamming distance when searching in large genome sequence databases [37, 29].

 

1.0E+08 , , , *‘
1.0E+07

1.0E+06

 

1.0E+03 3

1.0E+02

Number of Solution Sets
0 O
m m
83 a
A 01
Y
\
‘t
l
L

1.0E+01 - __ M LL :7

—s ﬁ—ﬁ
/_7 F7,

 

1.0E+OO w ~ - ~~ 74* - *~* --*——

4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.2. Comparison of A}; values for the Hamming distance

Although the Hannning distance is very useful for exact matches and range queries

in NDDSS. it does not provide an effective semantic for k-NN queries in NDDSs due
to the high degree of non-determinism. as mentioned previously. \Ve notice that
the Hamming distance does not distinguish equalities for different elements. For
example, it treats element a = (1. as the same as element b = b by assigning 0 to
the, distance measure in both cases. In many applications such as genome sequence
Searches, some matches (equalities) may be considered to be more important than
Others. Based on this observation. we extend the Hamming distance to capture the

Semantics of different. equalities in the distance measure.

Several constraints have to be considered for such an extension. First. the extended

distance should enhance the granularity level of the Hannning distance so that its
sem antic discriminating power is increased. Second. the semantic of the traditional
Hariiming distance needs to be preserved. For example. from a given distance value.
one should be able to tell how many dimensions are distinct (and how many dimen-
sions are equal) between two vectors. Third. the extended distance should possess

the triangular 1,)1‘operty so that pruning during an index-based search is possible.

\V’e observe that matching two vectors on a dimension with a frequent.ly—occurred
e\®,111ent is usually more important than matching the two vectors on the dimension
with an uncommml (infrecuient) element. Based on this observation, we utilize the

frequencies of the elements to extend the Hamming distance as follows:

d . . y .
1 1f ' .13
DGEHUI- ,3) = E “[1] #1 [I] . (3.4)
i=1 §f(o[z]) otherwise

VVllere

f((1[i]) = 1 — frequency(o[1]).

This extension starts with the traditional Hamming distance; adding one to the
tOtaI distance for each dimension that does not match between the two vectors.

The difference is that, when the two vectors match on a particular dimension. the

frequency of the common element. (i.e. oh] 2 Mil) occurring in the underlying
database on the dimension is obtained from a lookup table generated by performing
an initial scan of the dataset. This frequency value is then subtracted from one and
then added to the distance measure. Thus. the more frequently an element occurs.
the more it will subtract from one and thus the less it will add to the distance measure,
thereby indicating that the two vectors are closer than if they had matched on a very
uncommon element. This frequency based adjustment results in the possibility of
fractional distance values rather than just integer distance values (as seen when using

the traditional Hamming distance).

The factor of 5 is used to ensure that the frequency-based adjustments to the dis-
tance measure do not end up becoming more significant than the original Hamming
distance. This guarantees that the solution set (1:338) returned using this distance
will be among the candidate solution sets returned if the Hamming distance were
used instead. We also note that function f(o[i]) is not restricted to the deﬁnition
given in Equation 3.4. So long as the values of f((r[i]) are within the range of [0. 1).
the factor of 511 guarantees the semantic of the Hamming distance is maintained. In
Chapter 4. we explore this concept more thortmghly and present a generalized form

of Equation 3.4.

1’

From the distance definition. we can see that. if m S DGEHm/d'fl < m. + 1

(m = 0. 1.. . . .d). then vectors a and t3 mis—match on m. dimensicms (i.e., match on
d — m. dimensions). Additionally. the function f (o'[1']) plays a factor in preserving

the triangular property of Equation 3.4. as shown in Appendix B.

Clearly. unlike the traditional Hanuning distance. which has at most d + 1 (in-
teger) values — resulting in a quite coarse granularity. this new extended distance
allows many more possible values leading to a. refined granularity. We call this
extended Hamming distance the Gmnularrity-Enhuxnccd Hamming (GEH) distance.
Due to its enhanced granularity. the GEH distance can dramatically reduce Ak in
Proposition 3.2.2. leading to more deterministic k-NN searches in NDDSs. As an
example. consider again Figure 1. If we assume that vectors ()2. (13. and (14 each
match query vector q in only one dimension such that og[1] = q[1]. (.13[2] = q[2]. and
04B] 2 q[3]. and we also assume f(q[3]) < f(q[4]) < f(q[‘2]). The use of the GEH
distance resolves the non—determinism seen earlier when k = 3. Here. the solution set
would be {(11. 0-3. (14} (one of the candidate A:\.\S when the Hamming distance was
used) since DGEH((1-“'1) < DGEH(q.o;;) < DGEH(‘1~04) < DGEH(q.o-2). On a
larger scale. for the aforementioned dataset of 2.11 vectors in a 10-dimensional NDDS
under the uniform distril.)ution, the average AA? values for 100 random k-NN queries
with A”. = 1.5.10 are about 1.09. 1.11. 1.06. respectively (see Figure 3.3 in Section

3.4.1).

57

In fact. the Euclidean distance measurement can be considered to have the ﬁnest
(continuous) granularity at one end. while the Hamming distance measurement has
a very coarse (discrete integers) granularity at the other end. The GEH distance
measurement provides an advantage in bringing discrete and continuous distance

measurements closer to each. other.

3.2.3 Probability of Valid Neighbors

In many scenarios. it» is useful to know the probability/likelihood of encountering
vectors within an index tree that are within the current search radius to a given
query vector. For the purposes of our discussion. we label each such encountered
vector as a valid neighbor of. where V(,,D(q. o) g r. q is the query vector and r is
the current search radius. To derive this 1:)robabi1ity. we first consider the Hamming
distance and then extend our solutions to benefit from the enhancements provided

by the GEH distance.

For an initial case. we can assume that our index tree has maintained a relatively
uniform distril'mtion of elements within its subtrees. In a well balanced tree (ND-tree.
M—tee. etc...). this may prove to be a very reasonable assumption. as most indexing
methods will attempt to evenly distribute elements within their subtrees. W hen we
consider an ND-tree as our indexing method. the 1:)1'obability that accessing a subtree

with associated DMBR R = 5‘1 X SQ x . . . >< Sd will yield a particular element a in any

dimension may be estimated as the reciprocal of the magnitude of the alphabet set
on that dimension represented by 1?. Therefore. the probability of a specific element,

(L occurring in dimension :7 is estimated as:

 

This calculation proves to be fairly accurate so long as the assumption of uniform
distribution holds. The accuracy. and thereftn‘e effectiveness. of this calculation
begins to degrade as the distribution of elements per dimension within a subtree

bGCOIIlCS IlOll- 1111 l f( )I'Il'l .

The true probability I’l”lR.i- may be estimated far more accurately by determining
the local frequency ratio of element (1. within a subtree. with associated DMBR R.

on dimension '1'. as follows:

1’((’)R.i = fl ((1)1128 (3-5)

where

# of vectors a in Rs subtree where (1M 2 a

 

f1((1)R.-i =

total # of vectors in R’s subtree

This method is not reliant upon the indexing method to provide an even distribution:

Equation 35 remains accurate even for indexes with heavily skewed distributions.

The probability of encountering valid neighbors when examining any particular
subtree of an ND-tree is analogcms to the probability of such neighbors existing in
that subtree. Each (_limensicm in an NDDS is assumed to be independent. therefore.
the prol_)ability value of encmmtm‘ing specific elements over all dimensions may be
determined by the product of the probability values of encountering a specific element
in each dimension. Thus the probability of selecting any 1.)articular vector 0- =
((1‘[ll.€1[2]... . .oldl) at random from the subtree with associated DMBR R is the

following:

of
PE((1.I?) = H p (o[i])RJ-. (3.6)
i=1

As defined in Section 3.2.2. the Hamming distance represents the. number of
non-matching dimensions between any two vectors. The probability of a subtree
containing a vector (1' where DHU,,.,,,(q.o) = () may be (.letermined using Equa-
tion 3.6. However. because at most one vector within an ND-tree will satisfy
DHammW-a) = 0. we also have to consider the probability of a subtree contain-

ing vectors [3 = (,3[1]. 8(2). . . . . 13M). where DHa-mmhﬂ .3) = 3 (z e {1. 2. . . . .d}).

Lemma 3.2.3. Let Y represent the set of (fit/icrrsioris where .13 [1)] # q[1'J-] and. let X

60

represent the set of dune11.91.0118 ‘U’flt’l'é.’ 51’3[\J-] = q[-XJ-]. The prohabzhty of setcctmg a
particular vector .3 from the subtree '1111th assoctatcd DM BR R . 111/Lem DHam.,,,(q. 13) =

3(0 33 < d). at mndom '18 green. as the following (note that 2: = |Y|):

 

 

1Y1
P.\E( 3.1?) Up ”-711?" *H(1—,1.1Y[j).1j]m) (3.7)
j=1

Proof. As described in Equation 3.6. the probability of a specific vector existing in a
subtree. represented by R. is the product of the probabilities of each element of the
vector in the corresponding dimension of the subtree: the probability of the element
is estimated by Pfdinllej (Equation 3.5). The probability of anything except the
speciﬁed element is 1 _1’(!3lel)R.1’J~- Thus. the probability of a specific vector. which
does not mat(h the querv vector in dimensions. existing in a subtree with R 18 the
product of two terms: the product of 11(13[XJ]) RXJ- in the matching dimensions and

the product of 1 — 11(13 [11]) R‘yj in the non-matching dimensions. El

Lemma. 3.2.3 describes the method for determining the probability of encoun—
tering a vector that matches the query vector on a particular set. of dimensions
X. An example would be determining the probability of encountering a vector ,3
in a IO—dimensional dataset that matched a query vector q in dimensions 1.. 3. 8.

and 9. In this example. X 2 {1.3.8.9} and Y = {2.4.5.6.7.10}. resulting in

61

z = DH(1.,,I,,,((1. .3) = 6. However. to determine the probability of encountering a vec—
tor at. a. distance 3. we are not only interested in this one particular partial-matching
vector. but also all possible partial-matching vectors that may be found at a distance

2 from the query vector.

Proposition 1. Let B '1'cprcsent the set of all rectors from the given dataset in
an NDDS where 13 E B : [D H(,,m,,(q.3 ) = z]. The probability that a subtree with
associated DMBR R = 51 x .92 X. . .>< Sd 111111 contain a 1.113cto'1'3 '11’h(,"1‘e DHa.1?1.11‘1((1- 3) =

z is given as the following:

PS;(q. 1?) = Z P.N’E(.3.R). (3.8)
o’eB

Proof. The 1.)robability of a subset. of independent objects existing in a. set is the
summation of the probabilities of each of the individual objects within the subset
existing in the set. Thus. the probability of a subset of vectors existing within a.
subtree. each of which has a speciﬁed number of di111ensions matching a query vector.
is the surnn‘iaticm of the 1‘11‘obabilities of all vectors within that subset existing in the

subt ree. D

For example. the probability of selecting a vector .13 at random from a subtree with

associated DMBR R. where DH(1,,,,_,,,(11. .3) = 1. is given as follows:

(52

P3MaRh= pMMWWMlnpmw—HNI—pMMD
+MdHWWBD~WI-pmw-HDPWM)

+U—pMUWPMMl~pMW—HWWWU

\Ve note that as the dimensionality of the dataset increases. this calculation can

1 . . . .7 .
L-) ). One possible solution is to create a hash structure

become costly. i.e. ()(d >1: (d
in. a pre-processing step. that stores binary arrays representing the different. combi-
nat ions of matches and non-111atcl1es for each particular distance value. Here. the set
of keys is the set of integers (0. 1. . . . .d). and the values are the sets of binary arrays
xvit h a corresponding 111,111’1ber of ()'s. For example. the key 3 would retrieve the set
of binary arrays that. contains all possible 1‘1er1nutations with exactly three ()‘s. The
set. B for a particular distance 111ay be determined by retrieving the corresponding set
of binary arrays. where for each array. a value of 1 would correspond to a dimension

in X and a 0 a dimension in Y (see Proposition 2.4.1). This method can (‘lrastically

reduce CPU time.

The summation of the probability values given by Equation 3.8 for each integer
(_ilstanee z E {0. 1. . . . . 1‘} yields the probability of the subtree containing a vector ,3
t 1lat ls xvithin range 1‘ to the query vector (i.e. DH(1111.111((1- 3) S 1) This is expressed

f - .
ornially aS follows:

63

P\\_)=H(11? :PS (q.R). (3.9)
:20

Equation 3.9 111ay therefore be used to give an accurate measure as to the likeli-
}1()(:)d that. searching within any subtree will update a solution set when using the
H31 Innnng distance. Enhancing the g1‘aI11.1larit_1-' of the Hamming distance leads to an
ellllancement of the neighbor probability calculated in Equation 3.9. \Vhen using
the GEH distance. r is no longer strictly an integer. Thus. it is possible for a valid
neig, hlmr to exist at a distance [1] < DGEH(11.113) g r. where r is a real number.
An adjustment to Equation 3.8 is needed to properly account for possible neighbors

Wit h in this range.

Proposition 2. Let B’ '1‘eprese11t the set of all rectors from the given dataset in
. I . . .

an NDDS 'urhere ,3 E B : [DHU,,,,,.(q.13) = [1)]. The prohah1l1ty that a subtree mth

DAIBR R = SI X 52 x.. .XSI 111ll (.o11ta111 a reco11l 3 uhc1e(() < DGEH(q.1'13) <1)

is given as the following:

PR(q 1?.1) E P1E(1.61?) (1~— )1-)..1.q). (3.10)

”III. ere

64

1 71f RadjldJIl _<— ‘1‘ — lrl

0 oth1>111111313

15 (1 — (1‘._| 113.11) =

1111.11) 11f ﬁljl = all]

0 11tlz131‘1e1s e

1
, 1 ‘
1:1

Prov f The proof for Proposition 1 shows that Equation 3.8 yields the probability of
a vector. which has : 1‘1'1ismatching dimensions with the query vector. existing in a
particular subtree. If .. = (1.] this equation will determine the prolmbility of a vector
with [_r_| mismatching dimensions with the query vector existing in the subtree. The
. / , -
set of these vectors is represented by B . Functlon 1) creates a. subset of these vectors
I
by removing all vectors 1!. from the set B . where DGEHl'“ q) > 1‘. Equation 3.10 is
. . . ’ ’
the summation of each of the remannng vectors 11 . where V1168, : DGEHU’ .q) S 1‘.

Thus Equation 3.10 yields the probability of a vector. whose distance to the query

vector is between [1‘] and 1‘ to the query vector. existing in a 1:)articular subtree. [:1

The additional granularity provided by Equation 3.10 allows us to refine Equation

3.9 to Illake use of our GEH distance as follows:

PNNGEH(11.R.1') = PR(q.1?.1‘)+ PS~ (q. R). (3.11)

 

Equation 3.11 may be used to give an accurate estimate of the likelihood that search-
i 11g within any subtree will update a solution set when our enhanced distance measure
is used. \Ve use this measure in section 3.3.2 to develop an ordering heuristic that
provides a conservative assessment of whether or not to visit a particular subtree

that is beneficial to search performance in non—uniformly distributed databases.

3.3 A k-NN Algorithm for NDDS

To efficiently process k-NN queries in NDDSs. we introduce an index-l'1ased k—NN
searching algorithm. This algorithm utilizes properties of the ND-tree recently pro—
post by Qian. et al. [42. 43] for NDDSs. The basic idea of this algm‘ithm is as
follows. It descends the ND-tree from the root following a depth-first search strategy.
When 1; possible neighbors are retrieved. the searching algorithm uses the distance
information about the neighbors already collected to start pruning search paths that
can be proven to not include any vectors that. are closer to the query vector than any
of the current neighbors. Our algorithm is inspired by earlier incremental ranged
based i111plementations presented for CDS by Roussopoulos et al. [47] and Hjal-
tason and Samet [‘26] (generalized for metric space). Our algoritlnn extends these
implementations to NDDSs by introducing metrics and heuristics suitable for such

a space. The details of this algorithm are discussed in the following subsections.

66

3.3. 1 Heuristics

In the worst. case scenario. this search would encompass the entire tree structure.
However. our extensive experiments have shown that the use of the following

heuristics is able to eliminate most search paths before they need to be traversed.

fl

lVIINDIST Pruning: Similar to [47]. we utilize the minimum distance (MINDIST)

 

'IF._ «-

between a query vector q and a DMBR R = 81 x 82 x >< 8d. denoted by
11‘2.d'13t(q.R). to prune useless paths. Based on the GEH distance. MINDIST is

formally defined as follows:

if qll‘l ¢ 51
lel’l) Otherwise

'1111‘l1st(q. I?) =
121

(3.12)

QIH H

where

f(q[1]) : 1 — f1'1‘4'q11e'ncy(q[-1]).

Thls Calculation is then used with the Range of the current k-nearest neighbors

(Wlth respect to the query 1-r'ector) to prune subtrees. Specifically. the heuristic for

67

pruning subtrees is:

H1 : If 111111st(q, R) 2 Range. then. 111-11.111? the subtree assoc111ted 11111.11. R.

By taking the closest distance between a DMBR and q. we are guaranteeing that no
vectors that are included in the DMBR‘s subtree are closer than the current Range

and thus need not. be included in the, continuing search.

MINMAXDIST Pruning: We also utilize the minimum value (MINMAXDIST)
of all the maximum distances between a query vector 11 and a DMBR R along each
dimension. denoted by 111.111.1I'1st(q. R). for pruning useless paths. In simple terms.
mmdz st(q. R) represents the shortest distance from the vector q that can guarantee
another vector in R / subtree can be found. For a vector q and DMBR R = 81 x SQ X

. X S( . h‘IINMAXDIST is formally defined as follows:

77111’1.d1'.st(q.R)=121kigd f,-,),[lr](q (31.) + Z f1iIl‘1ll-Si) (3-13)
- * 121.1%}:

where

68

. s‘n-J— '
. .

 

f, (rd/11.51;) —_- 71((1lf1l) if(Ilkl E Sk

1 otherwise
1 .
f((1ll 1f{(1l’l}— 5:
f.111((1ll-SA) = U I
1 otherwise

where f() 011 the right hand side of the last two Equations is defined in Equation

3.12.

In general terms. the summation of f” determines the number of dimensions where
every vector in the associated subtree is guaranteed to have a 111atching elen'ient with
the query vector (since the component set S i on the corresponding dimension contains
only the corresponding element q[1] in the query vector). In these cases. a value of

f(q[1]) is added to the distance (i.e. the GEH adjustment for a matching dimension).

QJr—i

The value. of fm determines if there is another dimension (not in those checked for
f 11 1) in which at least one vector in the associated subtree will match the query vector.
111 this case. a value of g) f (q[k]) is added to the distance. A value of 1 is added for
all other cases in f; I and f,,,. The sunnnation of these values yields the minimum
distance (adjusted for GEH) that can be guaranteed a vector will be located from the
query VeC-tor in the associated subtree. based upon the information in the DMBR. For
example, given a query vector q = (a. b. c) and a DAIBR = {(1.11} x {b} x {(7. a}. we
haVe fm((1. {a (1}) + fM( (b {b})+f\1(c{(.a})= %f1((1)+ :1;f(b)+1. which indicates

that a Vector (i.e. (a.b. ?)) matching q on the ﬁrst two dimensions is guaranteed

(59

 

Fau- .222? .aa!-_ ..-_ _

to exist in the corresponding subtree. The minimum TII'III(1181() of such distances is
sought in Equation 3.13. If Range 2 1111111113“). it is guaranteed that at least one

valid neighbor can be found in the corresponding subtree.

To process k-NX searches in our algoritlun. 111111111st() is calculated for each non-leaf
node of the ND-tree using query vector q and all the DMBRs (for subtrees) contained
in the current node. ()nce each of these MINMAXDIST values (for subtrees) have
been calculated. they are sorted in ascending order and the 15’” value is selected as

MIJVAIAXDISTA. for the current node.

The km value is selected to guarantee that at least k vectors will be found in
searching the current node. This selected Al I Nil] AXDI STk is then used in the

follovvi 11 g heuristic:

H25 If AJINMAXDIS TA. ( node )< Range. then. let Range = AHNMAXDIS TA. ( node )

Optimistic Search Ordering: For those subtrees that are not pruned by heuristic
H1 01‘ H 2, we need to decide an order to access them. Two search orders were
suggested in [47]: one is based on the ordering of MINDIST values. and the other is

based on the ordering of MINMAXDIST values. The MINMAXDIST ordering is too

“ u l.~‘r_.1
\

 

 

pessimistic to be practically useful. Accessing the subtrees based on such an ordering
is ahnost the same as a random access in NDDSs. From an extensive empirical
Study. we found that accessing subtrees in the optimistic order of MINDIST values
during a k-NN search in an NDDS provided the more promising results. This study
was performed with the assun‘iption that the ND—tree is well structured. This access

order is shown formally as f(i1llows:

H3: Access subtrees ordered 111 ascendtng value of 1ndz'st(q. R). In the meat of a tte,

choose a, subtree at Tandem.

Conservative Search Ordering: A problem associated with search ordering
heuristic H3 is that it optimistically assumes that a vector with a. distance of the
MINDIST value exists in the subtree associated with the. relevant DMBR. Typically
this is not. the case in an NDDS: the set of elements on each dimension from different
vectors often yields a combination that is not an indexed vector in the corresponding
subtree. In some instances. the actual distribution of elements per dimension within
a. subtree may be signiﬁcantly (ﬁlifferent from what is expressed in the representing

DMBR. As discussed in Section 3.2.3. this can be estimated by calculating the dif-

 

ference between the assumed uniform distrilmtion. ’ 1511’ and the actual distribution.

‘5}

 

estimated by frequency in Equation 3.5.

When the difference between the assumed distribution and the actual distribution
becomes large for multiple elements or multiple dimensions for a query, the likelihood
of a vector with a distance of MINDIST existing in the relevant DMBR greatly
decreases. “ihen this occurs. it. is more appropriate to order the access of subtrees
by the calculated probability of the subtree containing a vector whose distance to
the query vector is less than or equal to the current range. as shown in Equation

3.11. This access order is given formally as follows:

H4: Access subtrees 1n the (167’.S(.'6"l2.(l'17lg order of the probabthtg of contatnrng a vector

(1, where DG E H(q. (1) g Range. Th18 prohabthtg 1s calculated by RN NG E H-

Heuristic H4 may be considered as a conservative approach to ordering while heuristic

H3 may be seen as an optimistic approach to ordering.

3.3.2 Algorithm Description

Our k—N N algorithm adopts a depth first traversal of the N D—tree and applies the
aforementioned heuristics to prune non-1.1roductive subtrees and determine the access

order of the subtrees. The description of the algorithm is given as follows.

k-NN Query Algorithm: Given an ND—tree with root node T. Algorithm Af-NN
Query finds a set of k—nearest neighbors. 1111-1int'11ined in queue ASKS. to query vector q
that satisﬁes Equation 3.1 in Definition 1. It invokes two functions: ChooseSubtree
and Ret1‘161-1.1e.:\"(»’1.(]}1bors. The former chooses a subtree of a non—leaf node to descend.

while the latter updates a list of h-nearest neighbors using vectors in a leaf node.

 

Algorithm 1 k-NN Query
Input: (1) query vector q; (2) the desired number k of nearest neighbors;
(3) an ND-tree with root node T for a given database.
Output: a set LNDS of 117-nearest neighbors to query vector q.
1: let A'NNS = Q). N = T, Range 2 30. Parent 2 NULL
2: while N # NULL do
3: if N is a non-leaf node then

 

4: [NN, Range] 2 ChooseSuhtrt ((N. (1. k. Range)
5: if NN # NULL then

6: Parent 2 N

7: N = NN

8: else

9: N :2 Parent

10: end if

11: else

12: [ANNE Range] 2 Rct1'1eeeNeighbors(N. q. 1‘. Range. ANA/IS)
13: N = Parent

14: end if

15: end while
16: return ld\\S

 

 

In the algorithm. step 1 initializes relevant variables. Steps 2 - 15 traverse the
ND-tree. Steps 3 - 10 deal with non—leaf nodes by either invoking C hooseSuhtree to
decide a descending path or backtracking to the ancestors when there are no more
subtrees to explore. Steps 11 - 14 deal with leaf nodes by invoking Ret1"1eeeNeighbors
to Update the list of current Ar-i'iearest 1'1eighbors. Step 16 returns the. result (1.533).

73

Note that C hooseS abtree not only returns a chosen subtree but also mav update
Range using heuristic H2. If there are no more subtrees to choose. it returns NULL
for NN at step 4. Similarly. Rctricce.\-'ciglzbors not only updates ASKS but also may

update Range if a closer neiglilmr(s) is found.

Function ChooscSabtrec: The effective use. of pruning is the most. efficient. way
to reduce. the I/() cost for finding a set of k-nearest neighbors. To this end. the

heuristics discussed in Section 3.3.1 are employed in function Clic’mseSabtree.

 

Function 2 ChooseSabz‘rec(N. q. k. Ra ngc)
1: if list L for not yet visited subtrees of *\ not exist then

 

2: use heuristic H2 to update Ran gr

3: use heuristic H1 to prune subtrees of N

4: use heuristic H3 or H 4 based upon user criteria to sort the remaining subtrees
not pruned by H1 and create a list L to hold them

else

CI!

6 use heuristic H1 to prune subtrees from list L
7: end if

8: if L 7$ (21 then

9 remove the 1st subtree .\'N from L

10: return [ARM Range]

11: else

12: return [NULL Range]

13: end if

 

 

In this function. steps 1 - 4 handle the case in which the non-leaf is visited for the
first time. In this case. the function applies heuristics H1 - H4 to update Range.
Prune useless sul.)trees. and order the remaining subtrees (their root nodes) in a list
L- The subtrees that are not in this list are those that have already been processed
or DI‘Uned. Step 6 applies heuristic H1 and current Range to prune useless subtrees

T4

for a non—leaf node that was visited before. Steps 8 — 12 return a chosen subtree
(if any) and the updated Range. Note. that heuristics H3 and H4 are suitable for
different datasets. Their effects on performance and practical selection guidance will

be discussed in Section 3.4.4.

Function RetriezterN’eighbors: The main task of Ret‘rieeeNe'z'ghbo-r is to examine
the vectors in a given leaf node and update the current k—nearest neighbors and

Range.

 

Function 3 Ret'r'ie'z'erN'eighb0'rs(N. q. 1:. Range. hN’A'S)
1: for each vector v in N do

 

2: if DGEH(q.-v) < Range then

3: kzNNS = laNNS U {n}

4: if |kNNS| > k then

5: remove vector 11’ from [ANS such that 'v' has the largest DG EH01: 'v’ ) in
1311\5

6: Range 2 DGEH((1- U”) such that v” has the largest DGEH((1~. n”) in ANNE

7: end if

8: end if

9: end for

10: return [1.51%. Range]

 

A vector is collected in law only if its distance to the query vector is smaller than
current. Range (steps 2 - 3). A vector has to be removed from ANNS if it has more
than k. neighbors after a new one is added (steps 4 - 7). The vector to be removed
has the largest distance to the query vector. If there is a tie. a random furthest

vector is chosen.

75

3.3.3 Performance Model

To analyze the performance of our k-NN search algorithm. presented in Section 3.3.2,
we conducted both empirical and theoretical studies. The results of our empirical
studies are presented in Section 3.4. In this section. we present a theoretical model for
estimating the performance of our search algorithm. For our presentation. we assume
that our algorithm employs both heuristics H1 and H2. We also assume an optimistic

ND-tree structure. where a subtree's associated DMBR is reasonalgily representative

 

of the vectors within the subtree; that is Va.i€{l.2 _____ d} : f[(a.)R,,j ~ 51‘ . With this
' ‘ l 7.

 

 

assumption. our search algorithm employs H3 as its search ordering heuristic.1

Because of the unique properties of an \ DDS. there is no deﬁned center for the
data. space. This may also be interpreted as any point may be considered to be at.
the center. Thus. we can deﬁne a bounding hyper-sphere around any point within

the data space and determine the likely number of objects contained within.

Definition 3. The area unthin a distance .3 from point p in a d-dimensional NDDS
with alphabet set A for each dimension. is the total nmnber of possible unique points
contained within the hyper sphere of radius 3 centered at. point p. This value is
fOTmally calculated as the summation. of the number of points earisting in Spherical

layers as follows:
\

1The assumption of a reasonably optimistic tree structure covers the majority of ND-trees gen—
erated in our empirical studies. Non-optimistic tree. structures, where our fourth heuristic would
PFOVide a more beneﬁcial ordering method. are considered empirically in Section 3.4.4.

76

 

~
4v

1
Are-(1(2):: (i (

i:(_)

A| —1)'i. (3.14)

 

Note that Area(:) is independent of point p under the uniform distribution as-
sumption. Equation 3.14 may be used to calculate the total area of the data space

by setting :: = (1. However. this value 111ay be calculated directly as follows:

Arrow) = IAId. (3.15)

The probability of an object existing within a distance of z from any point p is the

quotient of Equations 3.14 and 3.15, as follows:

Area(:)

P--'s s 3 Z _.
(“M ) Areﬂidl

(3.16)
Proposition 3. The number of likely points contained within a distance .3 from any

point p is the product of the number of points within the dataset N and the probability

of a point existing within. a distance of z from p. This is shoum, formally as follows:

11(3) : Pe.rists(3) * N° (3-17)

The lower/optimal search bound for our 1')erformance model is determined as a

77

reasonable distance to assure a specific number of objects. It is reasonable to assume
that a minimum distance that needs to be searched is one that is likely to yield at
least I; neighbors. Thus a lower bound all 2 z. is found by solving Equation 3.17 such
that L(d[) 2 k and L(dl — 1) < I". The lower bound for performance I/O i'nay then
be estimated as the number of pages that are touched by a range query of radius (11.
The range query performance is derived similarly to the model provided by Qian et

el, [43].

H—1
10,» = 1+ Z (n,- * Pi.;)~ (3.18)
izlil

where nt- represents the estimated number of nodes within the ND-tree at a height of
i. Pi": represents the probability a node at height i will be accessed in the ND—tree

with a search range of :3. and H denotes the height. of the index tree.

However. because a k-NN query generally begins with a search range equal to the
theoretical upper search bound. an adjustment must. be made to account for the I/O
incurred while searching with a non-optimal search range. We have estimated this
value as the number of nodes within each level of the N D-tree raised to a power

inversely proportional to the height of that level:

78

H—1 1
Adj 2 Z ”(37). (3.19)

1
i=1

Adding this adjustment to the range query performance model yields the following:

I —1 l,
IONN 2 1+ (no >+< Pt)..:) + Z (72,- * FAQ-+7219?) . (3.20)
i=1

The performance of our search algorithm can be estimated by using Equation 3.20

setting 3 to d].

3.4 Experimental Results

To evaluate the effectiveness of our GEH distance and the efﬁciency of our h-NN
searching algorithm. we conducted extensive experiments. The experimental results

v
1

are presented in this section. Our k-N.’ searching algorithm was implemented using
an ND-tree in the C++ programming language. For comparison purposes, we also
implemented the kt—NN searcl‘iing using an l\l—tree in the C++ programming language
for a set. of experiments. All experiments were ran on a PC under OS \Vindows XP.
The I / 0 block size was set. at 4K bytes for both trees. Both synthetic and genomic

datasets were used in our experiments. The synthetic datasets consist of uniformly

distributed lO-dimensional vectors with values in each dimension of a vector drawn

from an alphabet of size 6; other special case synthetic datasets are listed in the
following subsections. The genomic datasets were created from Ecoli DNA data
(with alphabet: {a g. t. c}) extracted from the GenBank. Each experimental data

reported here is the average over 1()() random queries.

3.4.1 Effectiveness of GEH Distance

The ﬁrst set of experin‘ients was conducted to show the effectiveness of the GEH
distance over the Hamming distance. by comparing their values of AA: as deﬁned in

Proposition 3.2.2 in Section 3.2.1.

Figure 3.3 gives the relationship between A1: and database sizes for both the GEH
and Hamming distances. when kzl. 5 and 10. The ﬁgure shows a signiﬁcant decrease
in the values of AA? using the GEH distance over those using the Hamming distance.
This signiﬁcant improvement in performance for the GEH distance is observed for
all the database sizes and k values ('(msidered. Figure 3.3 shows that when the GEH
distance is used. Ah values are very close to 1. indicating a. promising behavior close

to that in CDSs.

80

 

 

 

 

1.0E+08 - - e - - - -. ".4" -.
1.0E+O7 1
.9 i 1
CD .
(01.0306
C
£10905
2 .
81.0E+04 - GK . A“ .— _ .. C1- ' "‘ ' ’1'“
“5 ‘ \ ~ \ _ , I ' "C" -B-Hamm(k=1)
L ‘- ’ -e- Hamm(k=5)
8 10803 -=’.-Hamm(k=10)
E —+—GEH(k=1)
31.0E+02 «i +GEH(k=5)
Z +GEH(k=10)
1.0E+00 a ﬁ__g___$,_é_
4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.3. Comparison of Air values for the GEH and Hamming distances

3.4.2 Efﬁciency of k-NN Algorithm on Uniform Data

One set. of experiments was conducted to examine the effects of heuristics H1 — H3,
presented in Section 3.3.1. on the performance of our k—NN searching algorithm pre—
sented in Section 3.3.2 on uniform data. we considered the following three versions

of our pruning strategies in the experiments.

Version V1: only heuristic H1 is used.

Version V2: heuristics H1 and H2 are used.

Version V3: three heuristics H1. H2. and H3 are used.

81

 

34o . -— a ~ ~ ~ a a ~ a
l
320 ' .277,“ Ellis.
//’ \v‘}. i
l , m .
m 300. f , d .
3 ,/ /Nm 1
g 280 ' I, //
8 / ///
< ‘ 13 j/
x 260 ' ,1” X“
'(2 ’1’] //(
D a/ ./
240 ‘ / aw
/ A.
”X —+\:+- V2
220 . y r a- V3
200 ~~ ~ ~ ~ mi - ———- —
4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.4. Effects of heuristics in the k—NN algorithm using ND-tree with k = 10
Figure 4 shows that V2 provides a little llllpI‘OVCIIIGIIl in the number of disk accesses
over V1. However. V2 does make good 1.)erfor1nance improvements over V1 for some
of the queries. Thus. we have included H2 in version V3. As seen from the figure, V3
provides the best performance improvement among the three versions for all database
sizes tested. Hence V3 is adopted in our k-NN searching algorithm and used in all

the remaining experiments. except where noted.

Another set. of experiments was conducted to compare the disk I / O performances
of our k-NN searching algorithm using an ND-tree. the At-NN searching based on
an M-tree. and the linear scan for databases of various sizes. Figure 3.5 shows the

performance comparison for our k-NN searching algorithm using an ND-tree and the

82

.

linear scan. Figure 3.6 shows the performance comparison in number of disk accesses
of our k-NN searching algorithm using an ND-tree. and the h-NN searching based
on an l\l-trec. From the ﬁgures. we can see a signiﬁcant reduction in the 111111in of
disk accesses for our k—NN’ searching algorithm using an ND-tree over both the M—
tree searching algorithm and the linear scan. Additionally. the results in Figure 3.5
show that the performance gains of our lf-NN algorithm increase as the database size
increases. As the database grows larger. the density of points within the available
data space increases as well. This causes the search range to decrease at a faster
rate. due to finding more points at closer distances to the query point. resulting in a
greater percentage of subtrees being pruned by H1.2 Figure 3.5 also shows that, for
all database sizes tested. our algorithm. implemented using an ND-tree. always used
less than 25% (10% for database sizes of 1.1] or more vectors) of the disk accesses

than the linear scan.

Figure 3.7 shows the performance comparison of our algorithm imjj)lemented us-
ing an ND-tree and the linear scan method on genomic datasets. Since a genome
sequence is typically divided into intervals of length (i.e.. the number of dimensions)
11 or 15. both scenarios are included in the ﬁgure (for 1:210). This ﬁgure demon—

strates that the performance behavior of our k-NN searching algorithm on real-world

 

2It should be noted that this behavior is not unique to the ND-tree. An in-depth discussion
of this behavior was presented by Chavez et al. [14]. Although Chavez et al. focus primarily on
searching in general metric spaces. the same principles apply here.

83

 

25.0% - 7 . , , ,

(D

3 t1

8 20.0% 1 ...; k=1

8 a“:

<1: ‘x -9- k=5

i 15.0% . __ F10

o ‘

"" ..A.\ \\ ””9

O 10 00/ ‘ ‘

CD - O i \‘\\ \\ \T‘*~».,.K_

8) \\\ \\ El\\\x»

E \\ lift, ——————— TEL ~~~~~~~ 3::\i\\~i~_1

s 5.0% . ‘-~-\v.—\\mg ........ ~

L _ _m .

(D m;

o.

0.0% ~ ~#—— 7 —~ ~ ”WWW -..___.____.___. a
4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.5. Performance of the k—NN algorithm using ND-tree vs. the linear scan on

synthetic datasets with various sizes

genomic datasets is comparable with that we. observed for the synthetic datasets.

To observe. the performance inmroveinent of our k-NN searching algorithm over
various dimensions. we ran random k-NN queries (with l.‘ = 10) 011 two series of
genomic datasets: one contains 250K vectors for each set and the other contains 1
million vectors for each set. As seen from Figure 3.8. the performance gain of our
algorithm over the linear scan is quite signiﬁcant for lower dimensions. However. the
amount of this improvement decreases with an increasing number of dimensions. This
phenomenon of deteriorating performance with an increasing number of dimensions
is also true in continuous data spaces due to the well-known dimensionality curse
problem. Additionally. we have observed the 1‘)crforn'1ance i1nprovemcnt of our k-

84

 

i_+_M-Tree (k=1)
-I— M-Tree (k=5)

      

 

 

 

 

 

 

 

8 2500 (+M-Tree(k=10)
8 [ «a—ND-Tree (k=1)
g 2000 i-Ef‘i-ND-Tree(k=5)
< l tND-Tree (k=10)
a
5 1500 -
“6 1
b 1000 -
.o 1
E ':
:3 i
z 500 .
l a + :1: :3 r;
0 m- __f ,_____*'.-_ ___. - ._ /_ ___ _ \' . j)
4 8 12 16 20

Number of Vectors in Database x 10"5

 

 

 

Figure 3.6. Number of Disk Accesses comparison of the k-NN algorithm using ND-
tree vs. the k-NN searching based on l\l—tree

 

 

 

 

 

35.0% V
8 x -a<-- 15 Dimensions
g; 300% —:2— 11 Dimensions
0) \
:3 25.0%
x x.
8 20.0% I \x
g 15.0% x.
m \T\‘\.
g 10.0% a. *\%\
8 . Til“ ““““ x ~-- .
CT.) 5. 00/0 bt\\ é\\ T . F T
0.0% g ‘“
2.5 5 7.5 10 12.5 15 17.5 20
Database Size x 10"5

 

 

 

Figure 3.7. Performance of the k-NN algorithm using ND-tree vs. the linear scan on

genomic datasets with various sizes for A'zll)

QC
CH

 

 

   
   

 

 

120.0% .r
(D
.. i
3 100.0% .
8 ‘ ~5— 250K Vectors
o
< 80.0% . +1.0M Vectors
E i
D i
u.- 60.0%
0
°’ i
0)
£9 4OIY%
C .
a)
e i
g; 20.0%
“L i

0.0% i
91011121314151617181920 2122 23 24 25
Number of Dimensions

 

 

 

Figure 3.8. Performance of the k-NN algorithm using ND-trce vs. the linear scan on
genomic datasets with various dimensions for #210

NN searching algorithm over various alphabet sizes. We performed random k-NN
queries (with k = 10 and d = 10) on databases of 2A] vectors. Figure 3.9 shows that,
the effects of increasing alphabet. size are similar to the effects of the dimensionality

CHI‘SG.

Further, we have compared the disk I/O performance of the kf-NN searching al—
gorithm using the GEH distance with that for the At-NN searching algorithm using
the Hamming distance. Figure 3.10 shows the percentage I / Os for the GEH distance
versus the Hamming distance for various database sizes and k values. From the fig-
ure. we can see that the number of disk accesses decreases for all test cases when the
GEH distance is used as opposed to the Hamming distance. In fact. the algorithm

86

 

35.0%
a, o -—k=1
8 30.0/0 'B'k=5
a“; -:';-.-k=10
8 25.0%
0
<1:
E 20.0%
cu .
“5 15.0% 4
CD
CD ’. 2
g 10.0% 5
CD
8
(D 0 .
0. 5.0/o ;
0.0%
4 6 8 10 12
Alphabet Size

 

 

 

Figure 3.9. Perfori‘nance of the k-NN algorithm using ND-tree vs. the linear scan on
synthetic datasets with various dimensions for 1“le and d = 10

using the GEH distance needs only 50% ~ 70% of I/ Os that the algorithm using the
Hamming distance needs for all test. cases. we feel this is due to an increase in the
pruning power of heuristic H1 for the GEH distance. These results indicate that the
use of the GEH distance will cost less in disk accesses while providing a far more

deterministic result than that using the Hamming distance for a k-NN query.

3.4.3 Efficiency of k-NN Algorithm on Skewed Data

Experiments were also conducted to examine the I / 0 performance of our algorithm
upon datasets of varying levels of skewness as compared to that of a. linear scan.

We applied our algorithm. with heuristic version V3 from Section 3.4.2. to ND—trees

87

11:11

 

 

 

70.0% . 7 .7 as _
g; 3 —E1— k = 1
8’, 66.0% )
a) i
0 .
o ,
<
i) 62.0%
D .
B i
a) 58.0% .
C)
.9
C
(D
8 54.0%
CD
0.
50.0% '- _ -_w— 7 - .-. __mi
4 8 12 16 20
Number of Vectors in Database x 1045

 

 

 

Figure 3.10. Performance comparison for the k—NN searching using ND—tree based
on GEH and Hamming distances

constructed from datasets with zipf distrilmtions of 0.0. 0.5. 1.0. and 1.5.

Figure 3.11 shows significant reduction in the number of disk accesses for our k-NN
searching algorithm over the linear scan for all database sizes tested. Similar to the
performance gains for uniform data (see section 3.4.2). our k-NN searching algorithm
provides an increased reduction of disk accesses as the database size increases. Figure
3.11 also shows that our h-NN searching algorithm provides increased performance
gains as the level of skewness increases (i.e. the zipf distribution level i1‘1creases).
These results indicate that our searching heuristics (see Section 3.3.1) are able to

identify and prune more useless search paths as the data becomes more skewed.

88

 

20.00/0 i - *7 , ,,.___ — —— — 7 7 . .-- n. , ___
16.00/0 l

12.0% '

9°
‘2
o\

Percentage of Disk Accesses
P
O
o\°

 

 

 

 

0.0% . - J

4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.11. Performance of the k-NN algorithm using ND-tree vs. the linear scan
on synthetic datasets with various sizes and zipf distributions

3.4.4 Efﬁciency of k-NN Algorithm on Non-Homogeneous
Data

Experiments were conducted to show the effectixeness of our heuristic using the prob—
ability formulas presented in Section 3.2.3. We compared the I/O perfmniance be-
tween the k-NN algorithm using our probability-based subtree ordering heuristic H4
against the k-N N algorithm using our traditional MINDIST subtree ordering heuris-
tic H3. both of which utilize ND-tree. We observed that. although the two heuristics
often yield a comparable performance. there are cases in which our probability-based
heuristic signiﬁcantly outperformed the MINDIST one. These cases can occur when

the distribution of the dataset shifts over time. For instance. dimensions that are

89

 

1200[ — .. f, -f -

900

Number of Disk Accesses
O)
O
O

 

2 4 _ 6_ _ 8 _ 10
Number of Misleading D1mensrons

 

 

 

Figure 3.12. Performance of the h-NN algorithm using ND-tree on datasets with
various misleading dimensions (A: = 1)

highly relevant to the partitioning of vectors into subtrees early in the construction
of an ND-tree may no longer be relevant at later stages of the construction. These
dimensions may become misleading when searching for the records inserted into the
tree during these later stages. Figures 12. 13. and 14 show our results when searching
for 1. 5. and 10 neighbors. respectively. Each search was performed on an ND—tree

containing 5M vectors using each of the following heuristic combinations:

Version 81: heuristics H1. H2. and H3 are used;

Version S2: heuristics H1. H2. and H4 are used.

The ND-trees constructed from these datasets are known to contain misleading DM-

90

 

2000 . .. _ _. _

1600 -

1200 l

800

400

Number of Disk Accesses

 

2 10

4 6 8
Number of Misleading Dimensions

 

 

 

Figure 3.13. Performance of the k-NN algorithm using ND-tree on datasets with

various misleading dimensions (A? = 5)

 

 

4800 , ~e
(I)
(D
8 3600
CD
0
0
<(
j l
5 2400 .
75
5
g i
3 1200
2
l
0 , 7 _
2 4 . . _ 8 . 10
Number of Misleading Dimensmns

 

 

 

Figure 3.14. Performance of the k-NN algorithm using ND—tree on datasets with

various misleading dimensions (1" = 10)

01

BRs in regards to what vectors are present in the relevant subtrees. For our selection

of heuristic H4. we compared the values of . 1.

lbil

 

and f[(a.)R,- for each node at one
level below the root node and labeled a misleading dimension as one in which there
was a discrepancy greater than 3 : 1 between the two values compared. The num-
ber of Iriisleading dimensions indicates the known number of dimensions in each of
the DMBRs at one level below the root node that meet this criterion; that is, for a.

particular dimension '1'. > 3 * f[((I)R‘,' V l—bLl < 31; * f[(al1?.io
'1

The results in Figure 12 show that the use of heuristic version 82 provides beneﬁts
for most cases tested when searching for only a single neighbor. The cases where
the number of misleading dimensions was either very large or very small still show
better I/() performance when using heuristic version Si. The results in Figure 13
show that the reduction of I/() when heuristic version 82 is used is much larger
for all cases tested when searching for five neighbors rather than a single neighbor.
The results in Figure 14 show that the reduction of I/O continues to grow when
searching for ten neighbors when using heuristic version 82. These results show that
in general, as the number of neighbors being searched for increases. the performance
benefits when using heuristic version 82 increase as well. Additirmally, we notice
that in Figures 13 and 14, the performance when using heuristic version 82 becomes
similar to the performance when using 81 as the number of misleading dimensions

approaches the total number of dimensimis. This is likely due to the reduction of

non-misleading paths available. As the number of non—misleading paths approaches

0. heuristic version S2 will be forced to choose similar paths to heuristic version 81.

3.4.5 Veriﬁcation of Performance Model

Our theoretical performance estimation model was also veriﬁed using uniform syn—
thetic experimental data. \V'e conducted experiments using 10 dimensional data with
an alphabet size of 6. The minimum leaf node utilization of our ND-tree was set.
at 0.4 and the 111inimum non-leaf node utilization was set at 0.3. We compared our
theoretical values to the observml ND-tree performances for databases varying in size
from 400K vectors to 2.0.11 vectors in 400K increments. We also varied the value of

k to observe its effects upon the results.

Figures 3.15 through 3.17 show the estimated number of I/Os predicted by our
performance model, with the actual I / O. The results indicate that our model is quite
accurate, estimating the performance within 2% of the actual performance for most
test cases. The greatest disparity between estimated and actual perforn‘iance values
occurs in the test cases with small datasets. 1,)articularly when searching for only a
single neighbor. However. as the size of the dataset. increases or as the number of
neighbors searched for increases. the performance estimation becomes increasingly

accurate.

93

 

 

 

250% y» e i - . an
+Estimated ‘
8 -E-A9iuai
g 20.0%
CD
0
0 i
< l
E 15.0% "
0 l
u—
o
(D 10.0% -
CD
52
C
CD
8 5.00/0 i
(D
0- ' -
' i ' ' £3 ----- E1
0.00/0 ‘ —~--~ - —- —— ___—___ _'_._._____i.,___
4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.15. Estimated and Actual performance of the k-NN algorithm vs. the linear
scan on synthetic datasets with various sizes (A? = 1)

 

25.0%' *7 7* , ,, ,, i i -.-.-A M a a,

"3 -9- Estimated
-El-Actual

20.0%

15.0% .
10.0%

5.0% '

 

Percentage of Disk Accesses

0.0% "

 

 

4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.16. Estimated and Actual performance of the k—N N algorithm vs. the linear
scan on synthetic datasets with various sizes (k = 5)

94

 

25.0% . ~ ~ - ~—— —

 

’ -- —+— Estimated

8 y - E1 - Actual
3’, 20.0% .
a) l
o
0
<1
TE 15.00/0
CD
“5
m 1OIR6 ' .
m i
e
C i
(D
E 5096:
0)
CL

01Y%

 

 

4 8 12 16 20
Number of Vectors in Database x 10"5

 

 

 

Figure 3.17. Estimated and Actual performance of the k-NN algorithm vs. the linear
scan on synthetic datasets with various sizes (A? = 10)

The above results show that. for both synthetic and genomic uniform data. our .1:-
NN searching algorithm based on the GEH distance far outperforms the linear scan.
Additionally, our algorithm outperforms the linear scan for synthetic skewed data.
Only when the dimensionality of the underlying N DDS begins to grow excessively,
does the beneﬁts of our algorithm start to become less significant. This is a result
of the well-known dimensionality curse problem. Further. our performance model
provides an accurate estimation of the number of I / Os incurred while performing a.

k-NN search of a large database.

95

CHAPTER 4

Understanding Distance in NDDS

In this chapter we consider in more detail the issue of distance measurement in
Non-Ordered Discrete Data Spaces. To efﬁciently handle a much broader array of
applications than those presented in the previous chapter we present a generalized
form of our Granularity Enhanced Hamming (GEH) distance. We then provide a

new in'iplementation of this distance.

4.1 Motivations and Challenges

A major problem with k-NN searching in NDDSs. as discussed in Chapter 3 is the
non-determinism of the solution. That is, there is usually a large number of can-
didate solutions available which may obscure the result. This is mainly caused by
the coarse granularity of the conn'nonly used distance metric, the Hamming distance.

An extension to the Hamming distance, termed the Granularity Enhanced Distance

96

(GEH) distance, was introduced in [32] as a solution to this problem. \Ve demon-
strated that the GEH distance greatly reduced the non-determinism of the solution
set, as well as provided performance benefits. while maintaining the semantics of the
original Hamming distance [32, 33]. However. the GEH distance introduced in [32]
was tied directly to data point frequency values in a manner that may not be ideal in
all scenarios. Applications / scenarios with other more relm'ant dataset characteristics
(distribution, clustering. etc...) may not experience the same performance benefits

seen in [32].

To address this issue, we introduced a generalized form of the GEH distance in
[34]. This form may be optimized to a much broader set of applications than the
original GEH distance presented in [32] Conditions/constraints are presented that
maintain the necessary distance metric properties to be used as a pruning metric. we
show that the original GEH distance is. in fact, an ii’istantiation of this generalized.
form. Further, we present a new instantiation of the generalized GEH form that

demonstrates the benefits of adapting the generalized form for speciﬁc scenarios.

The rest of this chapter is organized as follows. Section 4.2 presents the general-
ized form of the GEH distance. Section 4.3 introduces a new ranking based GEH

instantiation derived from the generalized form.

97

4.2 Generalized GEH Distance

The GEH distance. originally presented in [32], expanded upon the Hamming dis-
tance to provide more granularity while maintaining all of the semantics of the Ham-
ming distance. This was accomplished by adding an adjustment value to the Ham-
ming distance between two vectors based upon their matching elements. This is

presented formally as follows:

(1 . . , .
1 if a [1] 79 1.3[1]
DGEH(0~13) = E : ~ (41)
2'21 ;11f(a[2]) otherwise

where

#0121) = 1— fan-iii).

The value of fq(d[i]) is the number of occurrences of 0M in the rim dimension of

the dataset, divided by the number of vectors in the dataset.; essentially, a global
frequency value. W'l‘iile this does provide a dramatic increase in the determinism of
result sets when used in a similarity search, this distance metric may not provide
an ideal distance semantic for all applications. Equation 4.1 is limited to applica-

tions where the global frequency of elements has some signiﬁcance in the dataset.

98

Applications where other dataset characteristics provide a better semantic may not.
be able to beneﬁt. from using Equation 4.1 to the same degree as the results shown
in [32]. To address this issue. we propose a generalization of the GEH distance that.

may be optimized to a much broader set of applications.

\Ve observe that the Hamming distance assumes that. a worst case match (i.e. a
non-match) between two elements is represented by a distance of 1, while all other
matches are represented by a distance of 0. we expand upon this by adding more
granularity to the values assigned to different types of matches. We propose the

following generalized form of the GEH distance to accomplish this goal:

(1
, ,. 1 . , .
DGEHUI- f3) = DHa-nrmiaa [3) + "C— : fgehialllv Hill)! (42)
221
where
Constraint. 1: V0.13 : C 2 d — DHamm(a, ,8)
Constraint 2: Vﬁl‘ilaﬁlil : 0 < fgeh(a[2],[3[2]) < 1

COHSt‘raint 3: Va[2],;’3[2] : [ye/liai'ils 1'3[l]) : fgehfﬁlil Gill)
Constraint 4: Valilﬁlil : a[2'] # 13[2] —> fgeh(a[2], )3[27]) = 0.

Here, fgeh represents some function. chosen by an application expert, that. will pro—

vide an adjustment to the Hamming distance for each dimension. The variable C

99

is a pseudo-constant1 used to guarantee the adjustment values of fgph do not be—
come more dominant than the original Hamming distance. Constraint 1 indicates
that the value of C must. be greater than or equal to the number of matching di-
mensions between two vectors. Constraint 2 indicates that the result of function

”1 element of the two vectors being considered must be in the range of

f gch. for the 2
(0:1) 11011-ill('1118ivc. Constraint 3 indicates that function f9”, must be Symmetric.

Constraint 4 indicates that the result of fgeh for the 21m elements of the two vectors

being considered must equal 0 if the these two elements do not match.2

From Equation 4.2. we can see that. if m. g DGEHfaa 3) < m+1 (2n. = 0, 1, . . . ,d),
then vectors a and 3 mis—match on m. dimensions (i.e., match on d — m dimensions),
therefore preserving the original semantics of the Hamming distance.3 Further, the
four provided constraints allow the generalized GEH distance to maintain the metric
properties necessary for use as a pruning metric in similarity searches as described

in the following lemmas:

Lemma 4.2.1. The generalized GEH d25tance 'mmntaz'ns the Positiveness property

 

1The term pseudo-constant is used to indicate that C is not strictly a constant, and may vary
as long as Constraint 1 of Equation 4.2 is maintained.

2Note that both variables a: and B are passed to to the adjustment function. This enables the
adjustment function to be fully expressed whereby Constraint 4 may be veriﬁed.

3Many application speciﬁc solutions such as BLOSUM, employed in bioinformatics, reduce the
non—determinism of solution sets by utilizing a cost matrix as a direct. form of distance measure-
ment. This is similar in theory to utilizing the adjustment function as a distance measure directly.
Unfortunately, these methods do not preserve the semantics of the original Hamming distance and
thus lose a level of portability between application environments. However, solutions such as BLO-
SUM may be incorporated into Equation 4.2 by utilizing the diagonal of the cost matrix for the
adjustment function.

100

(i.e. V1.9 : DGEH(-1‘~.’1) Z 0).

Proof. By maintaining the Hamming distance within the GEH distance, we are guar-
anteed a positive value if any elements between the two vectors do not match. Con-
dition 2 indicates that all values resulting from matching elements will have non-

negative values. El

Lemma 4.2.2. The generalized GEH distance mrzintaxrns the Strict Positrzie'ness

property (2.6. V2.1; : .1: # y —> DGEHUV y) > 0).

Proof. This property is inherited by maintaining the Hamming distance within the
generalized GEH distance, whereby any two vectors that are not equal will have a
distance greater than ‘0' based upon a ‘1‘ being added to the distance for each non—
matching dimension. Constraint 2 guarai‘itees that the values added from function

fgd, will all be non-negative. El

Lemma 4.2.3. The gcneruhzcd GEH distance Thaintains the Symmetry property (23.8.

Vm : DGEH(41'~LU) = DGEHhr-TW-

Proof. The Hamming distance is known to maintain syn‘imetry between vectors. In
addition. Constraint 3 guarantees that the values provided by the function ﬁlm will

maintain symmetry as well. [3

101

Lemma 4.2.4. The generalized GEH distance maintains the Pseudo-Reﬂcriaity

property (i.e. ‘v’Iy : DGEHfl‘sIl < 1/\ :r 79 y ——> DGEHfil-‘syl 2 1).4

Proof. This property is maintained due to Constraints 1 and 2, which stipulate that
the additional value added to the Hamming distance will always be in the range of
(0, 1), non-inclusive. Thus the distance between two vectors that exactly match will
have a. distance value less than ‘1‘. Any vectors that. are different in one or more

dimensions will have a distai‘ice greater than or equal to ‘1’. [3

Lemma 4.2.5. The generalized GEH distance possesses the Triangular Inequality

P7‘0P6'I‘ty (16 V.r.y.z : DGEHfl'a’j) + 00521022) 2 DGEHf-T-e 3)).

Proof. We ﬁrst. consider the Hamming portion of the generalized GEH distance. For
any dimension i E [1, d], if :1“,- 75 3,- then either 1‘2: 74 y,- CD :1,- # y7j or r, 74 y, /\ 3.1: 7Q yi.
Thus for each dimension 2' where the right. side of the inequality (i.e. DGEH(:1:,::))
would be incremented by an integer value of ‘1’, the left side of the inequality (i.e.
DGEH(.’[, y) + DG E H(y, 2)) would be incremented by an integer value of either ‘1’
or ‘2’, thus maintaining the inequality. Next, we consider the adjustment portion
of the GEH distance (i.e. éfgm()). For any dimension i E [1,d], if .27, = :2- then

either 1:,- = 312' /\ 2,: = y,- or 1r, # yi /\ 3i # yi. Thus, due to constraints 2 and 3,

 

4Note that the traditional property of Reflexivity (i.e. Va: : D(.r,a:) = 0) is replaced by the
property of Pseudo-Reﬂexivity. This is a reasonable substitution in an NDDS due to two vectors
exactly matching each other still being identiﬁable from all other pairs of vectors based only upon
the distance between them.

for all dimensions where this is the case. the left side will either be incremented by
twice as much as the right side or be incremented by an integer value of ‘2’ while the
right side is incremented by some value less than ‘1’. Constraint 4 indicates that no
additions will be made if the values in the dimension match, leaving the Hamming

component to be dominant. Thus the adjustment values maintain the inequality. D

4.3 Ranking Based GEH Instantiation

As described in [14], many search algm‘ithn'is demonstrate better performance when
the distances between data points are distributed evenly throughout the distance
range. we note that the original GEH distance. Equation 4.1, is likely to result in
a heavily skewed distribution of possible distances. As the alphabet. size grows, the
likely values of fg((.r[i]) trend closer to ‘0’ leading to a clumping of distance values
close to the floor value. Additionally. setting C = (1'. results in C having a dominant
role in the distance value as the dimensionality of the dataset. grows larger. To

address these issues, we propose a new GEH distance instantiation:

J g I. ‘ l3 ,: : rai'ik,,-((1'[i])
fgCh-(Olzlf l’l) [Az'l‘tl I . (4.3)
C = d - DH22272172.(QMB)+ 1

Here, the term ranki(d[i]) indicates the global rank of element o'[2'] among the
alphabet. set in dimension i. The ranking mechanism employed should be set by an

103

Table 4.1. Varying Dimensionality

 

 

Hamm. Freq. Rank
(1 = 5 36 7 6
d = 10 968 472 480
d : 15 5914 4591 4675
d = 20 8294 8228 8232

 

 

 

 

 

 

application expert on the condition that. it results in the different elements of the

 

alphabet receiving ranking values of [1. 4]] inclusive. The value of C tracks to the
number of matching dimensions between vectors a and ,3. As an example ranking
mechanism, we consider the frequency of elements within a dimension, applying a
higher rank (lower integer value) to elements that occur more frequently, and a lower
rank (higher integer value) to elements that occur less frequently. For example, if
the alphabet set. in dimension 2' consists of (a, b, c}, where a appears in 20% of the
vectors, b appears in 50% of the vectors, and c appears in 30% of the vectors in
dimension 2', the rank of each of the elements in dimension i would be as follows:
a —> 3, b —> 1, and c —> 2. Although an element’s frequency within a dimension
still plays a role in the determination of the GEH distance (in this example), the
ranking mechanism maintains a uniform distribution of distance values over varying
alphabet sizes. Additionally, having the value of C track to the number of matching

dimensions rather than the dimensionality of the dataset reduces the dominance of

C as the dimensionality of the dataset grows larger.

To evaluate the benefits of an adaptable distance metric, we performed a series

of kr-NN queries utilizing the GEH distance implementations in Equations 4.1 and

104

Table 4.2. Varying Zipf Distribution

 

Hamm. Freq. Rank
zipf 0.0 968 472 480
zipf 0.5 693 399 301
zipf 1.0 381 233 126
zipf 1.5 105 73 30

 

 

 

 

 

 

 

4.3 as well as the Hamming distance. Table 1 shows a comparison of I/O results
while searching uniformly distributed datasets of varying dimensionality. These re-
sults demonstrate a scenario where the frequency based GEH implementation pro-
vides slightly better search performance than the rank based GEH implementation.
Further. our results agree with those in [52] linking a. decreasing perforn'iance with

5

increasing dimensionality. Table 2 shows a comparison of the I/O results while
searching 10—di1nensional datasets of varying zipf distribution. For this scenario,
use of the new ranking based GEH implementation provides a strong performance
improvement over the. frequency based GEH distance implementation. This is in
agreement with the results shown in [14] concerning search performance and dis—
tance value distribution. These results highlight scenarios where Equations 4.1 and

4.3 provide search performance improvements specific to each case, thus demonstrat-

ing the benefits of an adaptable distance metric.

 

5Note that for the largest. dimensionality tested, (1 = 20, the I / 0 results when using both ranking
based and frequency based GEH implementations begin to approach each other. We attribute this
to the din'lensionality of the dataset playing a less dominant role in Equation 4.3 than in Equation
4.1

CHAPTER 5

k-Nearest Neighbor in Hybrid
Data Spaces

In this chapter, we consider Iii-Nearest Neighbor (At-XX) searching in Hybrid Data.
Spaces. Searching in HDSs presents several new challenges not presented in either
CDS or NDDS searching applications. We examine these issues and discuss methods
to resolve them. Further, we extend the theoretical performance model presented in

Chapter 3 to HDSs.

5.1 Motivation and Challenges

Nearest neighbor searches/ queries in Hybrid Data Spaces (HDS) are becoming i11-
creasingly useful in many contemporary applications such as machine learning, infor—
mation retrieval and security, bioinformatics, multimedia, and data—mining. Consider

the following information retrieval task. Given a set of network sites and a range of

106

times, determine the set of 1" network intrusions that match a set. of query criteria.
most, closely. \Vhen examining records of intrusions. the sites in which they occurred
could be considered discrete. data. in that an intrusion either did or did not occur at
that site. The times active in a particular site could be considered continuous data,

in that an intrusion 111ay have been active over only a period of time.

Several tm'hniques have been proposed in the literature to support such searches in
both continuous (ordered) data spaces and non-ordered discrete data spaces. A ma.—
jority of these techniques utilize a multidimensional index structure such as R*-tree[6]
or the ND-tree[43]. Little work has been reported in the literature on supporting

efﬁcient nearest. neighbor searches in hybrid data spaces.

Efficient index based nearest neighbor searching is dependent upon the usefulness
of information maintained in the search index structure. When searching an index
containing hybrid data, a difficult scenario occurs when one set of dimensional data
is unknown. This scenario is analogous to using a non-hylin‘id index. such as R*-tree
or ND—tree, to maintain hybrid data, based on their continuous or discrete subspace.
If this structure remains unmodified, perforn'ling nearest neighbor searches becomes
impractical due to the values of the non-native dimensions (discrete for R*-tree,

cont inuous for ND-tree).

To guarantee all valid neighbors are found in these scenarios, additional consid-

107

erations must. be taken into account. First, when examining the current set of ob-
jects/ records found to determine a search range, it must be assumed that all non-
native dimensions that are not maintained in the index structure differ in value from

the query vector. i.e.:

D(q. NNk) = DN((], ANA.) + (1,”, (5.1)

where UN is the distance in native dimensions between the query vector q and the
km neighbor NA} and (1,” is the maximum distance possible between the non-native

dimensions in the dataset and q. Second, when comparing a. bounding box or possible

new neighbor, the opposite assumption must be made, i.e.:

DULX) = D;v(an) + 0. (52)

where X is either a bounding box or object. This is due to the possibility of some
vector within X (or X itself) exactly matching the query vector on all non—native
dimensions that are not maintained in the index structure. Examining these facts, it,
is clear that as the number of non—native dimensions grows, it becomes increasingly

difficult. to exclude any portion of the index from the search.

To address these issues, we consider performing lit-Nearest Neighbor (Ar-KN)

108

searches utilizing the CND-tree, a recently proposed multidimensional index for HDS
[15]. W” hen considering k-NN searches utilizing an HDS index, we present a best—first-
searching algorithm that utilizes characteristics of HDS in its heuristics to reduce the
I / 0 cost of determining a valid set of A: nei,ghb(_)rs. Further, we present a theoretical
performance model for this algorithm based on the characteristics of an HDS and

the hybrid index.

The rest of this chapter is organized as follows. Section 5.2 presents a. brief analysis
of the different stages of a k-NN search. Section 5.3 presents our best ﬁrst algorithm,
its heuristics. and our theoretical performance model. Section 5.4 discusses experi-

mental results.

5.2 Nearest Neighbor Search Stages

In this section, we present a discussion of the different stages of a kt-NN search.
By properly categorizing these stages, we are able to develop heuristics that im—
prove search performance with more accuracy. The reader is recommended to review

Chapter 2 for an overview of the CND-tree.

When performing kf-NN queries using a multi-dimensional index structure, the act
of traversing the tree (assuming a minimum distance based ordering) can be broken

into three distinct stages: range reduction, overlap, and exhaustive. In this section

109

we consider the first two stages in more detail, while the exhaustive stage will be

revisited in Section 5.4.2.

The range reduction stage occurs while. the current search range, TC, is greater
than the final search range value. During this stage, new neighbors / objects that are
found are likely to result in reducing the current search range. The order in which

nodes are visited has a direct impact upon the I / 0 cost of this stage.

The overlap stage occurs when the search range has been reduced to its final
value, but there still exists nodes R in the search path whose bounding box is such
that D(q, R1) < TC. The amount of overlap between nodes within data organization
structures directly affects the I/O cost of this stage. Data organization structures
with a minimal amount of overlap within their internal nodes are likely to incur less
I/O costs when searching during this stage than data organizational structures with

a greater amount of area overlap.

Figures 5.1, 5.2 and 5.3 break down the 1/0 cost due to the range reduction and
overlap stages when searching a CND—tree. Figures 5.1 and 5.2 show the effects when
either the number of continuous or discrete dimensions is held constant (at 6) and the
number of the other dimensions varies. Figure 5.3 shows the effects when the number
of both continuous and discrete dimensions is held constant (6 continuous, 3 discrete)

and the number of records in the database varies. As seen from these ﬁgures, the I / O

110

 

4000]"7 7 w W , 77* ——77~A7«—7

3500
,D Over1ap

[I Range Reduction

u
0
O
O

2500

2000

Number of Disk Accesses
a
O
O

_L
o
O
o

0'1
O
O

 

 

1 2 3 4 5 6
Number of Non-Native Dlmensions

 

 

 

Figure 5.1. Search stage I / O with variable number of continuous dimensions

 

4000 » i — 7 , — e W
3500 DOvenap
I Range Reduction
33000
2500

Number of Disk Access
at 8
O O
O O

_L
O
O
O

01
O
O

 

 

3
Number of Non-Native Dimensions

 

 

 

Figure 5.2. Search stage I / O with variable number of discrete dimensions

111

 

D Overlap
I Range Reduction

Number of Disk Accesses

 

 

400K 800K 1.2M 1.6M 2.0M
Number of Vectors in Database

 

 

 

Figure 5.3. Search st age I / O with variable database size

costs of the overlap stage rise dramatically as the number of dimensions increases.
This stage has a much less dramatic increase in I / 0 cost as the size of the database
increases. Additionally, the I/O cost of the range reduction stage actually reduces

as the database size grows, while increasing as the number of dimensions increases.1

5.3 Search Algorithm

To efﬁciently process k-NN queries in HDSs, we present a. priority backtracking
index-based searching algorithm that utilizes properties of the CND—tree for HDS.

This algorithm initiates a search in the root node and then visits each subsequent

 

1This is a similar phenomenon to what was reported by Chavez et al. [14]. Although Chavez et
al. focus primarily on searching in general metric spaces, the same principles apply here.

112

node based upon a “best-first" criterion that is defined in the heuristics. W hen A?
possible neighbors are retrieved, the searching algorithm uses the distance informa-
tion of the neighbors collected to start pruning the search paths that can be proven
to not include any vectors that are closer to the query vector than any of the current
neighlgmrs. Our algorithm is inspired by earlier priority l_)acktracking based implemen-
tations presented for CDS. NDDS. and generic metric space [33, 27]. Our algorithm
extends these implementations to HDSs by utilizing metrics and heuristics suitable
for such a space. 111 particular. we introduce the notion of using an estimated match
likelihood value to help prioritize ordering. Additionally. we present. a, performance

model to accurately estimate the I/O cost of executing our algorithm.

5.3.1 Match Likelihood

Consider a. query vector q and a l_)ounding box R. If the total number of dimensions
in the dataset is (It = (ID + (10 and the distance between q and R is D(q, H) = r,
then it is possible that there exists an object in the subtree associated with R that
matches q on (It — .1' dimensions (assuming a distance metric similar to Equation 2.4
is utilized). 111 a multidimmisional index structure such as the ND-tree, R.*—tree, or
CND-tree this is not the most likely scenario. More likely, there exist several objects
in the subtree associated with I? that match q on a subset of dimensions that are

represented in 1?. The projections of these objects at higher levels in the index tree

113

can create a. misleading picture of the vectors in its associated subtree. similar to the

concept discussed in Chapter 3 for NDDS searching.

\V’e may infer from the bcmnding box of R how likely the information that R
contains is likely to represent the vectors in its subtree. For the purposes of this
discussion. we will assume that our index tree has maintained a relatively uniform
distribution of elements within its subtrees. “hell we consider the CND-tree as our
indexing method. the probability of accessing a subtree with associated bounding box
R = SD1 x . . . x SD”: >< 8C1 . . . >< Syd will yield a particular element. a. in dimension

i may be estimated as follows. (assuming a E R,):

51., if ’1', is discrete
I’)(a)1?.z' = ” » (5-3)

(St . r .‘ ~. . _ N
max Si—mins, if z is (ontmuous

 

 

 

 

where the match likelihood of encountering a vector (.1 = (.11, (.12. . . . , (if in the subtree

associated with R may be app1‘(')ximated as f(_)llows:

13(0)]? 2 Z p(o[i])RJ- if (i[i] E R . (5.44)

dt 0 ot l'1erw1se

Equation 5.3 calculates the reciprocal of the magnitude of the alphabet set on
dimension ‘2'. of R for discrete dimensions and the quotient of the threshold value and

the range of the set on din‘iension 2'. of R for conti1'1u(_)us dimensions. Equation 5.4

114

then determines the smnmation of these values for all elements in vector a that are
represented in R. It should be noted that more in depth methods were presented for
estimating this likelihood value for NDDS in [33]. However, we are only interested
in using this value to break ties in cases where the minimum distance between a
subset of the nodes and the query vect(,)r is the same. Thus, the generalizations of

Equations 5.3 and 5.4 are sufficient.

5.3.2 Algorithm Description

In a worst case scenario. a search would encompass the entire index structure. How-
ever, our extensive experiments have shown that utilizing the following heuristics

eliminates most search paths before they need to be traversed.

H 1: If the mimimxum. distance between the query vector and HMBR H. is greater than

the current range, then prune the subtree (1,.8‘.€()("t(1.f(i‘d with R.

H2: Access subtrees in the (1..s(..'c72.d/?ng order of the rnrnci'nmm distance between, the

query vector and the associated HMBR 1?.

H3: In the event of a tie between. subtrees due. to heuristic H 2, order these subtrees

115

that tied in, the descending order of the match likelihood (Equation. 5.4) between the

query vector and the associated HM BR R.

Our k-NN algorithm applies these heuristics to prune ncni—productive subtrees and
determines the access order of the remaining subtrees during a. best-ﬁrst traversal of

the CND-tree.

Given a CND-tree for vectors from HDS Xd with root node T, Algorithm Priority
h—NN Query ﬁnds a set of lit—nearest neighbors. NN, to query vector q. where NN Q

xY , ¥7\T“7\f

 

 

= h. and VuE‘\"\rm€X_‘\1\f0((1.(I) g D(q. 1?). It utilizes a priority queue,
labeled Q. of CND-tree nodes that is sorted based upon heuristics H2 and H3. It
invokes two functions: F i ndS'ubtrces and RetriecenN’e-ighbors. The former ﬁnds all
subtrees of the current node N that are within Range of the query vector and adds
their nodes to Q. The latter updates the list of k-nearest neighbors. NN, using

vectors in the current leaf node.

In the algorithm. step 1 initializes the range variable. Step 2 starts the search
at the root: node by inserting it into Q. Steps 3 - 14 traverse the CND-tree. where
steps 4 and 5 select the next node to be visited and remove it from Q. Steps 6 —
8 deal with non-leaf nodes. Step 7 invokes FindSubtrees to collect all subtrees of

the current node that are within Range. Step 8 sorts Q according to heuristic H2

116

 

Algorithm 4 Priority h-NN Query

 

1: Range = dt

2: Q.Insert( T )

3: while lQ.Empty() do

4: N = Q.Top()

5: Q.Pop()

6: if N.Height > 1 then

7: FindSubtrees( N. Range. g. Q )
8: Q.SOI‘t()

9: else

10: RetrieveNeighbors( N. Range. q. NN )
11: Range = NN[A‘].Dist()

12: Q.Prune( Range )

13: end if
14: end while
15: Return NN

 

and H 3. Steps 9 - 12 deal with leaf nodes. Step 10 invokes RetrieeeNeighbors to
update the list of current [if-nearest neighlmrs. Step 11 updates Range to equal the
distance of the current hf," neighbor from the query vector. Step 12 prunes nodes
from Q according to heuristic H 1. Finally. step 15 returns the result. Note that
F indS ubtrees only updates Q with subtrees that are currently within Range. If no
subtrees of the current node are within range. no new nodes will be added to Q.
The W'HILE loop in steps 3 - 14 is terminated when all subtrees that are within the

current. range have been visited.

5.3.3 Performance Model

To analyze the performance of our k-VN search algorithm, we conducted both em-
pirical and theoretical studies. The results of our empirical studies are presented

117

in Section 5.4. In this section. we present a theoretical model for estimating the
performance of our search algorithm. To accomplish this. we first derive an estimate
of the likely search range that will yield 1: objects/ neighbors. We then derive an
estimate of how many tree nodes/ pages will be accessed by a search of a. variable
range. For our presentation. we assume that our algorithm employs heuristics H1.
H 2. and H3 and that our dataset is reasonably uniform. For reference, many of the

variables used throughout this section are maintained in Table 1.

The likely distance of the hm neighbor from the query point represents the ﬁnal
search range that can be used to prune nodes from the search path. To determine
this distance. we estimate the ratio of area within a speciﬁc distance and the total

possible area of the dataset. similar to the method e1111')loyed for NDDSs in [33].

The area within a distance r: from a point p in a d D + (IC dimensional HDS with
alphabet A for each discrete dimension. span S for each continuous dimension, and
threshold value (it for Equation 2.4 is analogous to the number of unique points

within a. hyper sphere of radius 3 centered at point p:

A7‘6’(l.(.‘:) = 2;:0 [Zillill(r.dm) ) [faf-T- LIN] a (55)

g:max(0.;r—dA[

where

118

Table

 

 

n,-
11.12-

(1.11

m

(11‘!

(1711.

d111

d112

dm 1

(1711.2

3M2

31711

B7712

mum of nodes at. layer i
num of vectors in a node at layer i

5.1. Performance Model Variables

height of tree

= [£171

2 max(dD, (1C)

2 min(dD. dc)
_{R IAI if (1311' = (ID
"' otherwise

IRAI 1f (1m— — dD

otI her w ise
ll

i’d'IDi)

max(d:D

max(dCI (ICI‘. ;)

I
min(dCII, (1C1?)

max(dDII (1D ,j)
I II
max((l'ICII-, d/C'Ij)

min(d:DI- (1D ,)

l
}

if d” = d D
otherwise }
if (1111 = d D
otherwise }
if (in, = dD }

otherwise

:{{R
:{ min(dDI dIIIIDi)

if (1772 = dD

min1dc I (IF I.) otherwise
if (111— — (11) Adi—111 _ dDi
if (111 = (1D A (1.111: dIDﬂT
if d1.” = (10 /\ (1.111 = (10.2"
otherwise
. r
1f (11; = (ID /\ (1.112 = (113.1
. I
1de11 2 d0 /\ (1.112 = (10.1

I’
if (1.11 = dc /\ (1.112 = d(7,7:

otherwise

if dm = (10 /\ dml = (1,0,1
_ if dm 2 d D /\ dml = (131
_ if (in = dc A dml = (1a:

otherwise

if (1m, = do A dm2 = (113,1

. II
If (1771,. = (ID A (1711.2 : dDa'

. I
If (1m 2 (1C A (11112 : dC’j

 

otherwise

119

 

I . 1, . _
faffﬁll) : ((111)1(1111— 1)U(i‘l[y) ((1.1! — 11$ 31 .

The probability of an object existing within a distance 2 from any point 1), may be

calculated as follows:

.4rea(z)

P 's S‘ 3 Z .
(.rz.,.t. ( ) ArgaﬂlD + (1C)

 

(5.6)

The product of the number of points N within the dataset and the probability of
a point existing within a distance 3 from any point. p yields the number of points

likely to exist in that subspace:

[4(3) : Peristsk) * *N- (5'7)

It is reasonable to assume that a. minimum distance that needs to be searched is
one less than the distance that is likely to yield at least A: neighbors. Thus a. search
range r = z, is found be solving Equation 5.7 such that L(r + 1) _>_ k and L(r) < k.
The reason for this is that a search range that yields more than k neighbors is likely

to extend beyond the. overlap stage (Section 5.2).

Next, we determine how many pages are likely to be accessed by a. search with

120

 

a range of r. We derive this value in a similar fashion to the method derived by
Qian et al. [43] for N DDS. As new records are inserted into the CND-tree. some
dimensions 111ay be split and others may not be. Assuming a reasonably uniform
distribution and independence among dimensions. the probability for the length of

an unsplit edge of an HMBR to be ‘1’ is as followsgz

T- = 1
D .—.———r as

T _. z 1
(12.1 Wr‘

Using Equation 5.8. the probability for ant hebibliography edge length of an HMBR

to be j is as follows:

 

 

 

A . j—l ; 141‘“:
(ij I1 'l’-ZA.=1(1)IT.ﬁI—*Ti.k
TDIZ'IJ' : [‘4 U'Z' k 7
I? I—1 j Hui
(J) u’-Z/._1(1:)T*Tzk
T .. __ (kt)
(7.2.1 — Rwi

Hence. the expected edge length of an HMBR of a node at layer i is:

|A| I.
SDJ = ijl J * TD,l-,j3
R . .
SC]? = ijl J * Tam-

VVe can assume that a sequence of n node splits will split reasonably evenly

 

2For clarity. equations with a D or C subscript are relevant to discrete or continuous dimensions

only. respectively.

121

t. hroughout the d D + (1C dimensions. Each split. will divide the component set of
t he HMBR on the relevant dimei'ision into two equal sized eonn'xment subsets. To
Obtain 172- nodes at layer i of the CND-tree, the expected number of splits needed
is logg ni. As such. at any given time. it is likely that some nodes will be split one

Inore time than others. This can be determined as follows:

(If, 2 [(log-Zni) mod ((ID + dC-‘ll:

I

, ,, (5.10)
(1,- = ((11) + (1C) — di,

1

N . . . .
Where (1?. represents those dimensions that have been spht an extra. time.

If we assume these splits have been distributed evenly among continuous and

discrete dimensions. we have the following:

II
II dD-d.
I
ll dC'di
C' [mi (511)
[I (ID-d. '
(‘Dri — [dB-ta l

SO, the HMBR of a node at. layer i has the following expected edge length on d”

I
d-‘lld d dimensions respectively:

122

ll SDH

s- . : ___—-
L DJ log) n ’
21351751
II 5C H
SCJ _ log 2 n, '

lcY—D +(Cl

513,- 7%?—

2
’ 5
”DJ _ ——1————{D H otherwise ’
og) n]
2(1)+(C"I

(5.12)

lotJr g9 7) <1

sC I ifmév<
5c H
l()(T——)—- 71"

D-i-(ICl

s - —
(“.2 otherwise

Thus. for any node, the probability for a component of a. query vector to be
covered by the corresponding component of the HMBR of that node for discrete and

continuous dimensions. resI’)ectively. is given as follows:

 

 

 

 

I
I 30.7”
BDJ — my

I/
N SD-z'
B . z “
D.2 A ’

I l, I (5.13)

/ 8027
Bar : I? ,

II
II 5'02“
Bar 2 R '

Usitlg Equation 5.13. the probability for a node to be accessed by a query of

range / distance It can be evaluated recursively as follows:

h 5f mf Pf
PM, = Z Z 2 f * Z 9 ’ (514)
[C20 3:30 17127710 1)po

123

vvh ere

so = max((). k — dMl~

sf = min(k. dm ).

7720 1nax(0, s — (1172.1)-
'mf = min(s, (1mg).
p0 = max(0. k. — s — (1AM )

pf = min(k —— 8 (131-2).

1" (1'. ~ ~11: . . _1, d —s+m L_ .
f : (('iii2)Bni212 (1 _ B1712)?” * (£13711) Buii21 (1 — 87711)9 171,
- __ (1 d]\['2_p - (1‘1 dAIl—k-i—S-f-p k_ _,
9 — ( ‘lpmlBM‘z (1 '— 3M2)” * (k;3_1p)BM1 (1— Ban) 8 p-

Thus. the expected number of node/ page accesses for performing a query with

search range 7‘ can be estimated as:

H—1
10 = 1+ 201,- * Pi.rl- (5.15)
i=0

5.4 Experimental Results

OUT [C—N N searching algorithm was implemented using a CND-tree, an ND-tree, an
R*‘tFG-e, and a linear scan. For our experiments. the linear scan is considered twice:
once VVith native non-ordered discrete dimensions and once with native continuous
dimensions. Both the ND—tree and the R*-tree were modiﬁed to store hybrid data in
their 1Eaves. This modiﬁcation affects the shape of each of these trees but does not

124

incur a change in either of their insertion/split algorithms.3 All experiments were
ran on a PC under OS Linux Ubuntu. The 1/0 block size was set at 4K for all trees.
Two series of datasets were used. The first consists of 1M vectors with six native
dimensions and a variable number of non-native dimensions (1 - 6). The second set
has six native dimensions and three non-native dimensions and a variable number
of vectors.4 Each experimental data reported here is the average over 100 random

queries with A: = 10.

5.4.1 Effects of Heuristics and Datasets

The first set of experiments was ccniducted to show the effects of the dimensionality
and database (DB) size on the query performance. Figures 5.4 and 5.5 show this
data. As both Figure 5.4 and 5.5 show. the CND-tree provides by far the most
promising results in all tests.5 Due to this. the remainder of the experimental data

considers only the CND-tree. It should be noted in Figure 5.5 that the ND-tree

 

3Each leaf node is structured to hold data with native and non-native dimensions. The non-
native dimensions play no part in determining the composition of the covering hyper rectangle.
However. the extra dimensional information requires more space for each object than what would
be needed for native dimensional data only. While this extra space requirement does decreases the
amount of objects that a leaf node can contain before overﬂowing, the extra information maintained
negates the need to maintain the search constant d“ (Equation 5.1). However, because no changes
have been made to the internal nodes of these trees, Equation 5.2 must. still be observed.

4This configuration is chosen. as it is likely the amount of non-native dimensional information
will be signiﬁcantly smaller than the amount of native dimensional information in most real world
scenarios.

5Note that for the linear scan results. “NatzD” indicates a scan of hybrid data with native non-
ordered discrete dimensions and “NatzC” indicates a scan of hybrid data with native continuous
dimensions. The datasets used for each of these experiments are identical to those used for the
ND-t ree and R*-tree based searching. respectively.

1 ”...—n. _

 

ll'lu-

 

14000 l ,, * "c ’ i e r” ’" “‘

  
 
 
  

 

12000 . - -x- -ND-tree
+CND-tree
m l —a—Ls (Nat=D)
§1000O l +R‘-tree
g l +1.3 (Nat=C)
< 8000 i
X
.52
o l
'5 6000 ‘
b
a:
‘E l
:5 4000 3
z s
i
2000 g
L l
0 a; 4 '

Number of Non-Native Dimensions

 

 

 

Figure 5.4. Performance 1/0 with variable number of non—native dimensions
appears to provide somewhat similar results to the CND-tree as the size of the
database increases. Figure 5.6 shows that. when viewed independantly from other
search results. the CND—tree still provides significant performance benefits over the

ND-tree as the number of vectors in the database increases.

The second set of experiments was conducted to show the effectiveness of our
search heuristics compared to similar heuristics utilized in a. non-hybrid space (i.e.
a continuous or discrete space). Figure 5.7 shows the I/O comparison of the first
search stage when searching the CND-tree with and without the use of heuristic H 3.
It can clearly be seen that using H3 decreases the number of I/O incurred in the

first stage of searching over all dimensional combinations tested.

Figure 5.6. Performance I/O with variable database size (CND-tree and ND-tree

only)

 

 

14000

12000 :

10000

8000

6000

4000

Number of Disk Accesses

2000

- -><- -ND-tree
—0- CND-tree
—B~— LS (Nat=D)
+ R'-tree
+ LS (Nat=C)

   
   
  

 

 

 

400K 800K 1.2M 1.6M 2.0M
Number of Vectors in Database

 

 

Figure 5.5. Performance 1/0 with variable database size

 

 

450

400 ‘

350

250

200

150

Number of Disk Accesses

100 -

50

300 «

- -><- -ND-Tree _ ---------- X
+CND-Tree

 

,

400K 800K 1.2M 1.6M 2.0M
Number of Vectors in Database

 

 

 

T‘iﬁr -

 

600' 77 -- ,, ~ a v” v

500 f + CND-C with H3 f
i + CND-D with H3 55". l
400 l ' <2 ~CND-C without H3 .‘
‘ - a ~CND-D without H3 ,5 l

Num of IIO
co
0
o

200 i

 

 

NN=1 NN=2 NN=3 NN=4 NN=5 NN=6
Num of Non-Native DIM

 

 

 

Figure 5.7. Performance I/O comparing ordering methods

5.4.2 Performance Model Veriﬁcation

Our theoretical performance estimation model was verified against our synthetic

experimental data. \Ne conducted experiments using data with (ID = 6. dc = 3.

 

.4| 2 6. and R = 100. The maximum number of leaf node objects was set at 227
and the maximum number of non-leaf node objects was set to 127.6 We compared our
theoretical estimates to the observed CND-tree performances for databases varying

in size from 400K vectors to 2.0111 vectors in 400K increments.

Figure 5.8 shows the estimated number of I / O predicted by our performance model

 

6As described in Qian et. al [43]. the values of 127 and 227 were chosen for similar search
structures to maximize search performance.

 

...—___..—

 

30% - - —

l -l- CND-tree 1
25% l +CND-tree 2
l - 5?" Performance Model

20% l

15% .

 

Percentage of Disk Accesses

 

10% 1 “xx.
5%
0% L - —
400K 800K 1.2M 1.6M 2.0M

Number of Vectors in Database

 

 

 

 

Figure 5.8. Performance model comparison with variable database size

 

160% * 7 W ’
1400/ V—i—VVCND-tree 1
° ’ +CND-tree 2 :X
- cx- -Performance Model
120% x
100% l

Percentage of Disk Accesses

 

 

1 2 3 4 5 6
Number of Non-Native Dimensions

 

 

 

Figure 5.9. Performance model comparison with variable number of non-native di-
mensions

129

as well as the actual observed 1/0 as a percentage of the I/O incurred by a linear
scan. Two lines are drawn for the observed I/() when searching the CND-tree.
CND—tree 1 rei'n‘esents the percentage of I / 0 observed when the search is stopped as
soon as stage. 2 (overlap) ltas ended. CND-tree 2 represents the percentage of I/O
observed when the search is stopped as soon as stage 3 (exhaustive) has ended.7 The
Performance model line represents the predicted I / 0 percentage when 3 = 7‘ — 1. As
shown in. Figure 5.8. our theoretical performance model does a. quite accurate job

in predicting the mnnber of I / () that would be incurred by using our algorithm on

databases of varying sizes.

We also performed a comparison of our theoretical performance model and the
observed I/() when varying the number of continuous dimensions present in the
CXD-tree. For this set of experiments. (ID = 6. (AI = 6. R = 100. and d0 varies
from 1 to (i. The number of vectors in the database is 131 and the maximum numbers

for the leaf node and non-leaf node objects is again set at 227 and 127 respectively.

Figure 5.9 shows the estimated number of I / O predicted by our performance model
as well as the actual observed I/O of these exj')eriments. again as a percentage of

the I/O incurred by a linear scan. Again. two lines are drawn for the CND-tree

 

7The exhaustive stage occurs when searching nodes 1? whose bounding box is the same distance
from the query point. as the ﬁnal search range. This stage introduces no closer neighbors than what
have already been found. but may be required if multiple sets of k objects form valid solution sets.
typical of k-NN searches in discrete databases as discussed in [32. 33].

130

to represent the number of 1/0 at the end of the second and third search stages.
As shown in Figure 5.9 our performance model does a very good job of predicting
the number of 1/0 for performing a k—NN search when the number of continuous
dimensions is less than or equal to the number of discrete dimensions. However. as
the number of continuous dimensions grows. the observed CND-tree results begin to
outperform the theoretical results predicted by our perforn'iance model. We believe
this phenomenon may be related to the discretizing of the continuous dimensions by

Equation 2.4.

131

. A“

.I" Jut‘i'

 

CHAPTER 6

Conclusion

Similarity searches in NDDSs and HDSs are becoming increasingly important in
application areas such as bioinformatics. biometrics. E-commerce and data mining.
Unfortunately. the prevalent searching teclmiques based. on multidimensional indexes
such as the R-tree and the K-D-B tree in CDSs cannot be directly applied to either
NDDSs or HDSs. On the other hand. recent work [42. 43. 44. 41. 15] on similarity
searching in NDDSs and HDSs focused on range queries. Nearest neighbor searches
were not developed to the same. extent. In particular. the k-nearest neighbor search-
ing is still an open issue. \V'e observe that the issue of k-NN searching in NDDSS

and HDSs is not a simple extension of its counterpart, in C DSs.

A major problem with a A'—NN search in NDDSs using the conventional Hamming
distance is the non-determinism of its solution. That is. there usually is a large

number of candidate solutions available. This is mainly caused by the coarse gran—

‘-|. F ' bbrhy “I." .9. a f ‘

 

ularity of measurement offered by the Hamming distance. To tackle this problem.
we introduce a new extended Hamming distance. i.e.. the GEH distance. This new
distance takes the semantics of I‘natching scenarios into account. resulting in an en-
hanced granularity for its measuren'ient. Further. it is proven that the GEH distance
possesses the triangular property and therefore may be used in index based pruning

heuristics.

To support efﬁcient Af-NN searches in NDDSs. we propose a searching algorithm
utilizing the ND—tree [42, 43]. Based on the characteristics of NDDSS. three effec-
tive searching heuristics are incorporated into the algorithm. A fourth heuristic is
provided that implements a new strategy for probability based search ordering in
conservative search scenarios. Further. we provide a. performance model to predict
the number of I/O incurred during a k-NN search using our algorithm that is based
upon the number of neighbors desired and the din'iensionality and alphabet size of

the dat aset.

Our extensive ex1,)eriments demonstrate that our GEH distance measure provides
an effective semantic discriminating power among the vectors to mitigate the non—
determinism for k-NN searches in NDDSs. Experiments also show that the lit-NN
searching algorithm is efﬁcient in ﬁnding k—NNs in NDDSs. compared to the linear

scan method. The algorithm is scalable with respect to the database size and also

133

 

WT

performs well over non-uniform data distributions. However. when the number of
dimensions is high. our algoritlnn seems to suffer the same dimensionality curse

problem as the similar techniques in continuous data spaces.

\V'e have demonstrated that k-NN searches in HDSs greatly beneﬁt from the use of
a multidimensional index t'lt‘weloped for such a space. As discussed in Chapter 5. the
use of HDS based index eliminates the need for maintaining costly search constants to
guarantee correct results. Further. our experimental results conﬁrm that the use of a.
HDS index vastly outperforms searches utilizing non-hyln'id space indexes. sometimes
by a factor of 10 to 1. as shown in Section 5.4. \\'e have also shown that the use of our
newly introduced searching heuristic provides excellent beneﬁts in reducing the I/O
costs associated with the first stage of If-NN searcl‘iii'tg. Our experimental results in
Section 5.4 demonstrate a performance beneﬁt of almost 33%. Additionally. we have
presented a theoretical 1')erformance model that accurately predicts the I/O cost of

performing a k—NN search using our algoritlnn presented in Section 5.3

Our future work involves the study of the underlying clnu‘acteristics of NDDSs
and HDSs that may be applied to optimizing data index structures for such spaces.
Similar to [52]. such a study could provide an optimization in index structure con-
struction by determining the relationship between the dimensionality of a dataset and

the estimated performance for data retrieval. Additionally. search performance in

134

 

...“.l' 1'... ' ,P

NDDSS and HDSs are also affected by the cardinality of the alphabet of the dataset..
\Vliile much work been reported on understanding these relationships and optimiz-
ing an index structure through that kn(,)wledge for CDSs. it remains an open issue
for NDDSS and HDSs. Atilditionally. we will continue to investigate the underlying
characteristics of NDDSs and HDSs that can be used in future search heuristics. Fi-
nally. our theoretical performance model assumes a uniform node splitting policy of
the underlying index structure. We would like to expand upon this to accommodate

more potential node split policies.

APPENDIX A

Intrinsic Dimensionality in
Non-Ordered Discrete Data Spaces

In this appendix. we present a discussion of the effects of intrinsic dimensionality
in NDDSs. We ﬁrst provide an overview of the concepts of this topic followed by
a discussion of the differences in NDDS and CDS dataset distribution. We then
discuss the effect of intrinsic dimensionality on search I)("I‘f()1‘11‘1811('(3 when using the
GEH distance. This appendix provides additional rationale for the development of

the rank based GEH implementation from Chapter 4.

A. 1 Overview

In both NDDSS and CDSs. designers of search techniques are hampered by the curse
of dimensionality [32. 52]. As discussed by Chavez et al. [14]. traditional indexing

techniques for vector spaces (e.g. kd-tree. R*—tree) have. an exponential dependency

136

on the representational dimension of the data space. Many recent indexing techniques
attempt to avoid the problems associated with this relationship by removing the
representational dimension of the space. A common technique is to translate the

vector space data to a. generic metric space.

Unfortunately. these techniques are unable to completely resolve the curse of
dimensionality due to the existence of the intrinsic dimensionality of the dataset.
Chavez et al. [14]. deﬁned the intrinsic dimensionality p of a dataset by the quotient.
of the mean ,u and variance 02 of a histogram of distance values between objects in

the dataset. as shown below:

,0: ——2. (A1)

Equation A.1 indicates that a dataset’s intrinsic dimensionality grows in line with
the mean and inversely with the variance of distance values. The result. of this
equation may be used to indicate the performance potential of searching within
a dataset. As shown in [14]. the potential performance for searching a. dataset is

inversely proportional to the intrinsic dimensionality of the dataset.

137

5.E+07~ 7 - e ,- _ 2.22.21 A“
DAlph=6 [
4.E+07 oAiph=8

4.5.07 IIAlph=10
lAIph=12
3.E+O7a

3.E+07

 

2.E+07

Number of Occurances

2.E+07 ] _
1.E+07 i
5.E+06 [

‘ I
0.E+00 .——- ___-v .=— [L a...

0 1 2 3 4 5 6 7 8 9 10
Hamming Distance

 

 

 

 

 

 

 

 

 

 

 

Figure A.1. Histogram of Hamming Distance Values

A.2 Distribution of NDDS Datasets

The histogram of distances between points in either a CDS or a generic metric space
will usually result in a semi-continuous curve. As shown in [14]. the mean for such
histograms in either space is likely to trend equally toward either end of the available
range. For an NDDS. this no longer holds. Consider Figure A.1 which shows the
histogram of Hamming distance values for a 10—dimensional NDDS of 10K vectors

with variable alphabet sizes.

We notice three points. First. the mean for each dataset in Figure A.1 appears

to be highly dependent upon the cardinality of the alphabet set. such that as the

138

 

alphabet size grows larger. so does the mean of distance values. Second. the distance
values between points appear to clump disproportionately toward the high end of
the distance range. Third. the possible distance values between. points is restricted

due to the use of the Hamming t’listance.

Each of these points agrees with our understanding of NDDSs discussed in Chap-
ters 3 and 4. In particular. we note that in an NDDS. there is no deﬁned center of
the data space. Thus. available distance metrics. such as Hamming or GEH, have
difﬁculty considering point distance relationships. As shown in [33]. the number of
points likely to exist at an integer distance 3 from another point grows exponentially
with the value of the alphabet size (this value was described as a hyper-spherical

area. in [33]). such that:

~
A}

d
A'I‘t"(l(2) = Z i (

i =0

A| —1)i. (A2)

 

Additionally in [33]. we showed that the probability of a point existing in an NDDS

increases dramatically as the distance 3 increases:

Zsz ((i1)( 4] _ 1y
I741"

 

 

Parisfsiz) Z (AB)

Equation A.2 explains the dependence between the mean of distance values and

139

the alphabet size of the dataset. Equation A.3 explains the disproportionate amount
of distance values between points in an N DDS close to d (as seen in Figure A.1). The
third point is easily explained by the discussion of non—granularity of the Hamming

distance in Section 4.2.

\Vhen we again consider the distances between points in an NDDS. but instead of
the Hannning distance metric. we use the GEH distance metric presented in Section
4.2, we are given the histogram shown in Figure A.2 (we have used a scatter plot
instead of a column plot to help illustrate the differences with Figure A.1 more
clearly). \Ve notice that the restriction on distance values has been greatly reduced.
However. we still do not have a continuous curve as would be expected in a. CDS or
generic. metric space. Instead. we see local peaks between each integer distance value

with their own local mean and variance.

A.3 Distribution Effects on Search Performance

We deﬁne local variance for an integer 2'. as the variance of distance values between
[Li + 1). A local mean may be deﬁned in the same manner. To account for these

local values. we modify the performance formula given earlier as follows:

140

 

 

[.55 mum-u"

8.E+06 hi _ ___.

7.E+06 l OA'phz6 o
D Alph=8
a Alph=10
C 0 “Us
2 5.E+06 DO 0" at:
3 ’6 5%0
o D a .
O 4 E+06 op D a. a.
‘5 ° 0 a.
I_ a o X x
0
a 3 E+06 d] ‘
E O O a};
3 2 E+06 Q) Q ‘9
. 00 [$2 632:;
00 ﬁg . .
1.E+06 .

x

[X

5 a”?

é? E a .
0.E+00 ~——m X ...—___;

O 2 4 6 8 10
Distance

 

Xﬁxw

Figure A2. Histogram of GEH Distance Values

where

P0 =
_ 1 d—1 m—i 2
Pl — 27121=0 (T) '

The value of p, in Equation A.4 captures the expected performance between each
integer value. This value may be interpreted as the average performance indicator
based on the normalized local means and variances of a. dataset. \N’hen using an
extension to the Hamming distance, such as GEH. this value gives an indication of
the likely number of pathways that may be pruned during the reﬁnement stages of a k.—
NN search. However. because one of the goals when extending the Hamming distance

is to retain its original semantics. the value of pl is less dominant in determining the

141

overall performance measure than the value of po. This is due to the effects of a
Hamming extension typically only becoming a factor when comparing distances in

the same integer range.

To help illustrate this point. consider a random query q for a 10 dimensional dataset
U. 111 this case, Figure A.2 can represent the distribution of distances between q and
a possible pivot point p E U (in our case. p is a DMBR). for various alphabet sizes.
Assuming our distance metric D() maintains the triangular inequality property. we
can eliminate from our search any point 21. such that D(p. u) Q“ [D(p. q) —r. D(p. (1)-H],
where r is the current search radius [14]. As the variance of distances between integer
values increases, more points (search paths) may be discarded when searching within
that range. However. this increase in variance has no affect 011 the amount of points

that may be pruned outside the current integer range.

A.4 Experimental Results

We. have compared the distance distribution histograms of the ranking based GEH
distance implementation from Section 4.2 with that of the frequency based GEH dis—
tance implementation from [32] over datasets of various zipf distributions. In these
instances, all data is from lO-din'iensional datasets of 10K vectors with [A] = 10.

Comparing Figures A4 and A3. we see that the ranking based implementation shows

1.E+07 —— -— —-- ~ —

9.E+06
o zipf = 0.0
8.E+06 . D zipf = 0.5 ..
a zipf = 1.0
m 7.E+06 - * Zipf = 1.5
3 a
E .2.
15 (5.E+06 -
u
o
0 5.E+06 ,, x
'8
g 4.E+06 «
E Cl
3
z 3.E+06 ' x .3
x. A A
. i x 0’8‘
2.E+06 A a :03?)
1.E+06 * I; 3% 3.93%
35 Z: a $41 3‘} “f.
0.E+OO - m “ELL 5‘3“)?“ _-
0 2 4 6 8 10

Distance

Figure A3. GEHROM. zipf = [0.0 — 1.5]
large improvenrents, in terms of distribution characteristics. over the frequency based
distances as the zipf level increases. This indicates that as the underlying dataset
becomes more. skewed. the performance beneﬁts of using the ranking based imple-
mentation of GEH distance over the frequency based implementation will become

greater as well. This agrees with the search 1,)erformance results from Chapter 4.

143

1.E+07

 

 

 

i 2;
g n
9.E+06 ]
8.E+06 [
I o zipf = 0.0
g 7.E+06 : zipf = 0.5
g a zipf = 1.0 .-. D
§6£+06 - . zipf = 1.5
8 a
05.E+06 -_ °
M-
O 0 x
b
3 4.E+06
s .
z 3.E+06 (g
2.E+06 a 5‘ .~.
8 a:
3?‘ q A E,
0.E+00 «___—Fi— L
0 1 2 3 5 7 8 9 10
Distance

Figure A.4. GEHpreq. zipf = [0.0 — 1.5]

144

APPENDIX B

Triangular Property of GEH -
Extended Proof

To maintain the inherent mathematical correctness associated with building and
searching index trees in NDDS. our newly introduced GEH distance must maintain
the Triangular Property. In Chapter 4. we presented a short. proof of this property.

In this appendix, we provide a second proof in extended form.

Deﬁnition 4. Triangular Property for Vectors: For any three vectors VA, VB,
and V0, the addition of the distances between any two pairs is greater than or equal

to the distance between the third pair, namely:

D( :4: VB) + 003% VC) 2 DWA» Va). (31)

The long form proof of the GEH distance maintaining this inequality is handled

145

 

 

in two steps: first. a base case is established where the property holds. Second, the

base case is evolved into the generic case.

Step 1: Consider three d-dimensional vectors VA. VB and V0- Assume that
D(i»"A. VC) is a. maximal value. thus every element in V A and VC must be differ-

ent. or:

D(1«"A. VC) = (1.

Next assume D(V:4. IB) is a minimal value. where every elen’ient in VB equals the

corresponding element in VA. i.e. VA = l3. Using the GEH distance yields:

Doc/4, VB) :1.

where 0 < .r < 1. The term :17 represents the adjustment values obtained using
some method deﬁned by an application expert (see Chapter 4). Because VA = V B:

the distance between VB and 10 is the following:

DWB, V0) = (1.

Thus we have the following inequality;

146

 

'r _l.’ ‘_ ‘l-l “—an-

Dtl/fi- VB) + 170'}; V0) > D("31~“'C)

:> r + n > n

0

Step 2: The second step is divided into three. sub-steps; one for each vector VA. VB,

and IC.

Step 2.1: First. we evolve IB into a generic vector. From Step 1. we have the

following distance values:

D(I’A. IB) = .r
D(VB. IC) = n
Des. VC) ———

where 0 < .1.‘ < 1.

To make IB generic. we apply It changes. where each change represents switching
an element in VB away from its original value. After this has been done. we are left

with the f()ll()WlI1g distances:

D(I’:4. VB) = I + 1(- ('1
D(VB. IC) = n — [cf + (:2
D(VA. IC) = 71.

Here. (:1 represents the culmination of adjustment values from each of the 1: elements

switched. AT represents the number of elements switched that now equal their corre—

147

 

Tr

sponding element in IC. and c2 represents the culmination of the adjustment values
to be added due to these newly matching values. Because k 2 [(3. C 2 cl. and
("-2 Z 0. the GEH distance between I34. IB and IC still maintains the inequality

from Deﬁnition 4.

Step 2.2: Next. we evolve IC into a generic vector. We start with the ﬁnal vectors

from Step 2.1 and apply j changes to IC. We now have the following distance values:

D(I’:4. I3) = .11' + If — (.‘1
DtVB- IC) = N — II + (‘2 + (ff - ('3) — (.133 - ('4)
D(I4. IC) = n — j3‘ + c5.

Here. .1: and J72“ represent the integer values that D(IB. IC) increases and decreases
by respectively as elements are switched; c3 and (:4 represent the adjustment values
due to those changes; )3 and (:5 represent the integer and adjustment changes to
D(VA.VC) due to element changes. It is impcn'tant to note that every time j;
is incremented there are two possibilities: either the value being switched in IC
becomes a value in IB that. still matches I’Z4. in which case )3 is incremented by one
and both (‘4 and ('33 are incremented by the same amount. or it becomes a value in
IB that does not match VA. which means that. k 2 lei" — 1. This leaves us to note
that. j; +163 S 73‘ + A: and that ('4 Z (:5. Finally. with ji‘ 2 ('3. it can be shown

that these distance measures still maintain the Triangular Inequality property from

148

 

Deﬁnition 4.

Step 2.3: Finally. we evolve I’:4 into a generic vector. Because of our initial con-
ditions. this is actually a trivial step. Due to VA only being deﬁned in. relation to
the original vectors IC and IB. and IC and IB being able to be manipulated into
any general vectors from their starting point, we can start If 4 as any vector we wish.
Thus VA is a generic vector and the triangular inequality holds true for any three

vectors I4. IB. and IC.

149

 

 

APPENDIX C

MinMaxDistance Discussion

Much of the work presented in this (_lissertation has focused upon improving search
performance in terms of reducing the number of I/O required to perform a search.
In this appendix. we prove that many search algorithms may be improved in com-
putational performance by removing the MINNIAXDIST heuristic while suffering no

loss of I / O 1,)erformance.

C. 1 Overview

For the purposes of this discussion we deﬁne the MINDIST for a tree node N and a

query point q as follows:

JIIIA/‘DISTN (q) 2 min VNS D(q. NS). (C.1)

 

 

\I‘here N S represents a subtree/ object of N and D is a valid distance metric for the

data space being used. We deﬁne the .\II.\'.\IAXDIST in a similar manner:

.III.'\I.II.4XDISY}V(q) = min VNS (max D(q. NS)) . (C2)

Using Equations C1 and C.‘2. the .\II.\"DIST pruning. MINMAXDIST range reduc-
tion. and MINDIST ordering eque‘itions. hereafter referred to as H1. H2. and H3

respectively. are as follows:

H1: For all subtrees N5 of N . rcmo-ce/ prune any subtree whose [IIINDIS' T value to

q is grti’ater than the cmrcnt search range.

H2: If the .«IIINJIIAXDIS T of a subtree N S of node N is less than the cmrent search
range. reduce the current search range such that it equals the value of the MIN—

MAXDIS T of that subtree.

H3: Order those subtrees NS not pruned by heuristic H1 in increasing order of their

A'IINDIS T mlue to q.

For the remainder of this discussion we will hold the following assumptions to be

true:

Assumption 1. MINDIS T node ordering is being used.

 

Assumption 2. MINDIS T node pruning is being used.

Assumption 3. A dtpth ﬁrst scorch strategy is being employed.

C.2 Proof

Consider a d-dimensiom’il non-leaf node N. that represents the local root for a branch
in an index tree. The search range of the A? Nearest Neighbor search is represented
by r. If heuristic H2 is employed. the search range is updated by the MINNIAXDIST

value for the sub-nodes of .«VR as follows:

r : min(r0, AIINAIAXDISTJN'(Gil (C3)

where r0 acts as a, place holder for the search range before it is updated. For future

use we will label the minimal RIINMAXDIST value of the subtrees of node N as

IDAIAI-

The subtrees of N can be categorized into three groups: N51, N 52, and A753,

where the following holds true:

0 g MINDISTh ’51) g r
r < MINDIST(.N’52) 3 r0 (04)
7‘0 < AIIJVDIST(A753) S DAIAX?

where DMAX represents the maximum distance value possible between a sub-

tree / object and a query point for the data space being used.

Lemma C.2.1. 0 the three (1'(I,l(’( ories 0 subtrees. a k Nearest Neidzbor search will
.I .1

access these groups in the following order: NSI ﬁrst, NSQ second, and N53 third.

Proof. This is a result of Assumption 1. [:1

Lemma C.2.2. The subtree of N with the minimum MINMAXDIS T value is con-

tained in the sub-tree group N51.

Proof. Equation C.3 indicates that this particular subtree will be used to set the
value of 7' when Heuristic H2 is employed. This subtree will be in N51 due to its

MINDIST value being less than or equal to its NIINRIAXDIST value. El

Lemma C.2.3. If Heuristic H2 is employed, Heuristic H1 will prune subtree groups

A752 and lV53, (111.6? t0 7" = TALKI-

Proof. Heuristic H2 and Equation (3.3 indicate that the updated search range will
be equal to r. Equation C4 classiﬁes that. subtree groups N52 and N SB will have a.

MINDIST value greater than r and will thus be pruned by Heuristic H1. D

Lemma C.2.4. If Heuristic H2 is not employed, Heuristic H1 will prune subtree

group N 53 .

 

Proof. Similar to the preceding proof . we are only guaranteed that the current search

range will be equal to 1‘0. Equation C4 only classifies subtree group N S3 as having a.

MINDIST value U‘reater than 1' and is thus the onlv Otrou uaranteed to be runed
b 0 V o

by Heuristic H1. [3

Lemma C.2.5. If Heuristic Hg is not employed. the value of 7' will be less than or

equal to DA [A I before visiting any sub-nodes in group NSQ.

Proof. Due to Assumptions 1 and 3. the search algorithm will visit the subtrees of
group N51 before returning to the current node N and considering subtrees in the
groups NS? and N 83- According to Lemma (3.2.2. the subtree with a MINMAXDIST
value DA [A I is contained in subtree group N51. Thus the search is guaranteed to
visit an object with a distance value less than or equal to DNA! before returning to

N. [3

Lemma C.2.6. Heuristic H2 provides no I / O beneﬁts when assumptions 1 through

3 are true.

Proof. Lemma C.2.5 indicates that. the value of r will be less than or equal to DMM
before the search algorithm considers visiting any subtrees from group Ngg. Thus
Heuristic H1 will prune these subtrees before they are visited regardless of Heuristic

H2 being employed. Cl

,. .Jﬂl? ‘

 

[1]

M

[4]

l5]

l8]

l9]

BIBLIOGRAPHY

Y. A. Aslandogan and C. T. Yu. Techniques and systems for image and video
retrieval. IEEE TKDE, 11:50—63, 1999.

Ricardo A. Baeza-Yates. Searching: An algorithn‘iic tour. In Encyclopedia of
Computer Science and Technology Vol. 37, pages 331—359, New York, New York,
1997. CRC Press.

Ricardo A. Baeza-Yates, \Valter Cunto, Udi Manber, and Sun Wu. Proximity
matching using ﬁxed-queries trees. In CPM ’94: Proceedings of the 5th Annual
Symposium on Combinatorial Pattern Matching, pages 198-~212, London, UK,
1994. Springer—Verlag.

L. Baoli, L. Qin, and Y. Shiwen. An adaptive ls-nearest neighbor text categoriza—
tion strategy. ACM Transactions on Asian Language Information Processing,
13:215-226. 2004.

R. Bayer and K. Unterauer. Preﬁx b-trees. ACM Transactions on Databases
Systems, 2211—26, 1977.

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger.
The R*-tree: An efﬁcient and robust access method for points and rectangles. In
Hector Garcia-Molina and H. V. Jagadish, editors, Proceedings of the 1990 ACM
SI GM OD International Conference on Management of Data, Atlantic City, NJ,
May 23-25, 1990, pages 322—331, Atlantic City, NJ, U.S.A, 1990. ACM Press.

J. L. Bentley. Multidimensional binary search trees in database applications.
IEEE Trans. Softw. Eng, 5(4):333—340, 1979.

Jon Louis Bentley. h’lultidimensional binary search trees used for associative
searching. Commun. ACM, 18(9):509~517, 1975.

Jon Louis Bentley and Jerome H. Friedman. Data structures for range searching.
ACM Comput. S'IL’I‘U., 11(4):397~409, 1979.

 

 

[10]

[11]

[12]

[13]

[14]

[17]

l18l

[19]

[‘30]

[21]

Stefan Berchtold, Daniel A. Keim. and Hans-Peter Kriegel. The X-tree: An
index structure for high—dimensional data. In T. M. Vijayaraman, Alejandro P.
Buchmann, C. .\Iohan, and Nandlal L. Sarda, editors, Proceedings of the 22nd
International Conference on Very Large Databases, pages 28—39, San Francisco,
U.S.A, 1996. Morgan Kaufmann Publishers.

Sergey Brin. Near neighbor search in large metric spaces. In VLDB ’95: Pro-
ceedings of the 21th International Conference on Very Large Data Bases, pages
574584, San Francisco. CA, USA, 1995. Morgan Kaufmann Publishers Inc.

\V'. A. Burkhard and R. M. Keller. Some approaches to best-match file searching.
Com-mun. ACM, 10(4):230~230, 1973.

J Catlett. On changing continuous attributes into ordered discrete attributes.
In Proceedings of the European Work-ing Session on Maching Learning, pages
164- 178, 1991.

Edgar Chavez. Bonzalo Navarro, Ricardo Baeza-Yates, and José Luis Mar-
roquin. Searching in metric spaces. A CM Computing Surveys, 33(3):273—321,
2001.

Changqing Chen, Sakti Pramanik, Qiang Zhu, VVatve Alok, and Gang Qian.
The c-nd tree: A multidimensional index for hybrid continuous and non-ordered
discrete data. spaces. In Proceedings of EDB T, 2009.

Paolo Ciaccia. Marco Patella. and Pavel Zezula. M-tree: An efficient. access
method for similarity search in metric spaces. In VLDB ’97: Proceedings of the
23rd International Conference on Very Large Data Bases, pages 426-435, San
Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.

Kenneth L. Clarkson. Nearest. neighbor queries in metric spaces. In S TOC '97:
Proceedings of the twenty—ninth annual ACM symposium on Theory of comput—
ing. pages 609—617, New York, NY, USA, 1997. ACM.

J. Clement, P. Flajolet, and B. Vallee. Dynamic sources in information theory:
A general analysis of trie structures. Algorithm, 29, 2001.

P. Ferragina and R. Grossi. The string b-tree: A new data structure for string
search in external memory and its applications. Journal ACM, 46:236—280, 1999.

A Freitas. A Survey of Evolutionary Algorithms for Data Mining and Knowledge
Discovery. ACM, 2003.

Volker Gaede and Oliver Gunther. .\‘Iultit‘limensional access methods. ACM
Computing S'zn‘veys. 30:170- 231, 1998.

156

[22]

[‘23]

[30]

[31]

[32]

[33]

Antonin Guttman. R-trees: a dynamic index: structure for spatial searching.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.

R. Hamming. Error—detecting and error—correcting codes. Bell System Technical
Journal, 29(2):147-160, 1950.

A. Henrich, H. “7. Six, and P. “Iidmayer. The lsd tree: spatial access to mul-
tidimensional and non—point objects. 111 VLDB ’89: Proceedings of the 15th
international conference on Very large data bases, pages 45—53, San Francisco,
CA, USA, 1989. .\Iorgan Kaufmann Publishers Inc.

Andreas Henrich. The LSDh—tree: An access structure for feature vectors. In
I CDE ’98: Proceedings of the Fourteenth International Conference on Data En-
gineering, pages 362—309, \Vashington, DC, USA, 1998. IEEE Computer Society.

C. Hjaltason and H. Samet. Incremental similarity search in multimedia
databases, 2000.

G H jaltason and H Samet. Index-driven similarity search in metric spaces. ACM
Transactions on Database Systems, 28:517-580, 2003.

Gisli R. Hjaltason and Hanan Samet. Ranking in spatial databases. In SSD
'95: Proceedings of the 4th International Symposium on Advances in Spatial
Databases, pages 83—95, London, UK, 1995. Springer-Verlag.

W'. J. Kent. Blat~the blast-like alignment tool. Genome Res, 12(4):656—664,
April 2002.

D. E. Knuth. The Art of Computer Programming, Vol. 3. Addison-Wesley,
Reading, MA, USA, 1973.

Mohammad Kolahdouzan and Cyrus Shahabi. Voronoi-based k nearest neighbor
search for spatial network databases. In VLDB ’04: Proceedings of the Thirti-
eth international conference on Very large data bases, pages 840—851, Toronto,
Canada, 2004. VLDB Endowment.

Dashiell Kolbe, Qiang Zhu, and Sakti Pramanik. On k-nearest neighbor search-
ing in non-ordered discrete data spaces. In ICDE, pages 426—435, Istanbul,
Turkey, 2007. IEEE.

Dashiell Kolbe, Qiang Zhu. and Sakti Pramanik. Efﬁcient k—nearest. neighbor
searching in non-ordered discrete data spaces. ACM Transactions on Informa-
tion Systems, 28, 2010.

a" ‘-
A-V-_ '_P-h ‘-

“.5‘

 

[34]

[36]

[38]

[391

[40]

[43]

[44]

Dashiell Kolbe, Qiang Zhu, and Sakti Pramanik. Reducing non-determinism
of k-nn searching in non-ordered discrete data spaces. Information Processing
Lctters, 2010.

Flip Korn, Nikolaos Sidiropoulos. Christos Faloutsos, Eliot. Siegel, and Zenon
Protopapas. F ast nearest neighbor search in medical image databases. In VLDB
'96: Proceedings of the 22th International Conference on Very Large Data Bases,
pages 215* 226. San Francisco. CA, USA. 1996. Morgan Kaufmann Publishers
Inc.

O.\V. Kwon and J .H. Lee. \Veb page classification based on k—nearest neigh-
bor approach. In Proceedings of the 5th International Workshop Information
Retrieval with Asian Languages, 2000.

F Lewis. Gareth J Hughes. Andrew Rambaut, Anton Pozniak, and Andrew
J Leigh Brown. Episodic sexual transmission of HIV revealed by molecular
phylodynamics. PLoS Medicine. 5(3), 2008.

Jinhua Li. Eﬁicicnt Similarity Search Based on Data Distribution Properties in
High Dimension. PhD thesis, Michigan State University, East Lansing, Michi—
gan, United States, 2001.

A Macskassy, H Hirsh. A Banerjee. and A Dayanik. Converting numerical
classiﬁcation into text classification. Artiﬁcial Intelligence, 143(1):51—77, 2003.

Gonzalo Navarro and Ricardo Baeza-yates. Searching in metric spaces. ACM
Computing Surveys, 332273 321, 2001.

Gang Qian. Principles and applications for supporting similarity queries in
non-ordered-discrete and continuous data spaces. PhD thesis. Michigan State
Ui‘iiversity, East Lansing, Michigan, United States, 2004.

Gang Qian, Qiang Zhu, Qiang Xue, and Sakti Pramanik. The ND-tree: a dy-
namic indexing technique for multidimensional non-ordered discrete data spaces.
In uldb '2003: Proceedings of the 29th international conference on Very large data
bases, pages 620-631, Berlin, Germany, 2003. VLDB Endowment.

Gang Qian, Qiang Zhu, Qiang Xue, and Sakti Pramanik. Dynamic indexing
for multidimensional non-ordered discrete data spaces using a data—partitioning
approach. ACM Trans. Database Syst, 31(2):439--484, 2006.

Gang Qian, Qiang Zhu. Qiang Xue, and Sakti Pramanik. A space—partitioning—
based indexing method for multidimensional non-ordered discrete data spaces.
ACM Trans. Inf. Syst, 24(1):79 -110, 2006.

158

 

[45]

[40']

[48]

[491

[51]

[52]

[53]

[54]

E. Riloff and L. Hollaar. Text databases and information retrieval. ACM Com-
puting Surveys, 28, 1996.

John T. Robinson. The k—d—b-tree: a search structure for large multidimensional
dynamic indexes. In SIGMOD ’81: Proceedings of the 1981 ACM SIGMOD
international conference on IlIanagement of data, pages 10—18, New York, NY,
USA, 1981. ACM.

Nick Roussopoulos, Stephen Kelley, and Frédéic Vincent. Nearest neighbor
queries. In Michael J. Carey and Donovan A. Schneider, editors, Proceedings of
the 1995 ACM SI CM OD International Conference on Management of Data, San
Jose, California, May 22-25, 1995, pages 71—79, San Jose, California, U.S.A.,
1995. ACM Press.

Nick Roussopoulos and Daniel Leifker. Direct spatial search on pictorial
databases using packed r-trees. SI CM OD Ree, 14(4):17—31, 1985.

Y. Rui, T. S. Huang, and S. Change. Image retrival: Current techniques, promis-
ing directions, and open issues. J. Visual Communication and Image Represen-
tation, 10:39*62, 1999.

Thomas Seidl and Hans—Peter Kriegel. Optimal multi-step k-nearest neighbor
search. SICMOD Ree, 27(2):154~165, 1998.

J. Uhlmann. Implementing metric trees to satisfy general proximity / similarity
queries, 1991.

Roger Weber, Hans-Jorg Schek, and Stephen Blott. A quantitative analysis and
performance study for similarity-search methods in high-dimensional spaces. In
VLDB ’98: Proceedings of the 24rd International Conference on Very Large
Data Bases, pages 194—205, San Francisco, CA. USA, 1998. Morgan Kaufmann
Publishers Inc.

D. W’hite and R. Jain. Algorithms and strategies for similarity retrieval, 1996.

Q. Xue, G. Qian, J.R. Cole, and S. Pramanik. Investigation on approximate
q—gram matching in genome sequence databases, 2004.

Peter N. Yianilos. Locally lifting the curse of dimensionality for nearest. neighbor
search (extended abstract). In SODA ’00: Proceedings of the eleventh annual
ACM-SIAM symposium on Discrete algorithms, pages 361—370, Philadelphia,
PA, USA, 2000. Society for Industrial and Applied Mathematics.

159