FASTER ALGORITHMS FOR MACHINE LEARNING PROBLEMS IN HIGH DIMENSION

By

Mingquan Ye

A THESIS

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Computer Science – Master of Science

2019

ABSTRACT

FASTER ALGORITHMS FOR MACHINE LEARNING PROBLEMS IN HIGH DIMENSION

By

Mingquan Ye

When dealing with datasets with high dimension, the existing machine learning algorithms often
do not work in practice. Actually, most of the real-world data has the nature of low intrinsic
dimension. For example, data often lies on a low-dimensional manifold or has a low doubling
dimension. Inspired by this phenomenon, this thesis tries to improve the time complexities of two
fundamental problems in machine learning using some techniques in computational geometry.

In Chapter two, we propose a bi-criteria approximation algorithm for minimum enclosing ball
with outliers and extend it to the outlier recognition problem. By virtue of the “core-set” idea and
the Random Gradient Descent Tree, we propose an eﬃcient algorithm which is linear in the number
of points n and the dimensionality d, and provides a probability bound. In experiments, compared
with some existing outlier recognition algorithms, our method is proven to be eﬃcient and robust
to the outlier ratios.

In Chapter three, we adopt the “doubling dimension” to characterize the intrinsic dimension of
a point set. By the property of doubling dimension, we can approximate the geometric alignment
between two point sets by executing the existing alignment algorithms on their subsets, which
achieves a much smaller time complexity. More importantly, the proposed approximate method
has a theoretical upper bound and can serve as the preprocessing step of any alignment algorithm.

ACKNOWLEDGEMENTS

Firstly, I want to express my thanks to my parents and my brother for supporting me to go abroad
and do research.

I also thank my advisor Prof. Hu Ding for leading me to the research on computational geometry

and theoretical computer science.

Thanks to Prof. Eric Torng and Prof. Yiying Tong for being my thesis committee members.
I thank our department secretary Steven Smith for helping me a lot during my master stage.
Finally, I also would like to thank my fellows Liyang Xie, Kaixiang Lin, Qinhao Zhang, Jun

Guo and my friends Qianqian, Zhengfa, Yue Wang and Shuo Wang.

iii

TABLE OF CONTENTS

.
.

.
.

.
.
.

.
.
LIST OF TABLES .
.
.
LIST OF FIGURES .
LIST OF ALGORITHMS .
.
KEY TO ABBREVIATIONS .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
vi
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTRODUCTION .

CHAPTER 1

.
.
.
.

.
.
.
.

1

v

.
.

.
.

.
.

.
.

.

.
.
.

.

.

. .

2.1

CHAPTER 2 A NOVEL GEOMETRIC APPROACH FOR OUTLIER RECOGNITION

Preliminaries

2.3 Experiments .

Introduction .
.
2.1.1 Related Work .
2.1.2
.

IN HIGH DIMENSION . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Our Algorithm and Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
Parameter Settings and Quality Guarantee . . . . . . . . . . . . . . . . . .

3
3
4
5
6
7
2.2.1 Algorithm for MEB with Outliers
2.2.2
8
2.2.3 Complexity Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Boosting .
. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Datasets and Methods to Be Compared . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Random Datasets .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Benchmark Image Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Comparisons of Time Complexities
. . . . . . . . . . . . . . . . . . . . . 19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
CHAPTER 3 GEOMETRIC ALIGNMENT IN LOW DOUBLING DIMENSION . . . . . 21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
.
Introduction .
3.1.1 Related Work .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Outline of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
.
. 36
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 The Proposed Method .
.
3.3 Time Complexity Analysis
.
3.4 Experiments .
3.4.1 Random dataset .
3.4.2 Real dataset
.
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIBLIOGRAPHY .

2.4 Summary .

3.5 Summary .

3.1

. .

. .

.
.

.

.
.

.

.

.
.

.

.

.

.

.

.

.
.

.

.

.

.

.

.
.

.

.

.

.

.

.

.

.

.

.

iv

LIST OF TABLES

Table 2.1: The F1 scores for the high-dimensional random dataset.

. . . . . . . . . . . . . 16

Table 2.2: The F1 scores of the four methods on MNIST by using PCA-grayscale; the
three columns for each γ correspond to PCA-0.95, PCA-0.5 and PCA-0.1,
respectively. . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

. .

. .

Table 2.3: The F1 scores of the four methods on MNIST by using autoencoder feature. . .

. 18

Table 2.4: The F1 scores of the four methods on Caltech-256 by using VGG net feature.

. . 18

Table 2.5: The F1 scores of the four methods on Caltech-256 by using PCA-VGG feature;
the three columns for each γ correspond to PCA-0.95, PCA-0.5 and PCA-0.1,
respectively. . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

. .

. .

Table 3.1: Random dataset: EMD and running time of diﬀerent compression levels.

. . . . 35

Table 3.2: Linguistic datasets: EMD and running time of diﬀerent compression levels.

. . . 37

Table 3.3: PPI network datasets: EMD and running time of diﬀerent compression levels. . . 37

v

LIST OF FIGURES

Figure 2.1: The blue links represent the path from root r to node v, and Pr

v contains the

four points along the path. The point set Sv corresponds to the child nodes of v.

9

Figure 2.2: The illustration of ME B(Pr

v) and ME B(Pr

sent the inliers and outliers, respectively.

v(cid:48)); the blue and red points repre-
. . . . . . . . . . . . . . . . . . . . . 11

Figure 2.3: The illustration of our algorithm on a 2-dimensional point set. . . . . . . . . . . 16

Figure 3.1: The illustration of alignment of two geometric patterns.

. . . . . . . . . . . . . 21

Figure 3.2: The illustration of EMD between two weighted point sets.

. . . . . . . . . . . . 22

Figure 3.3: This image is excerpted from (Zhang et al. (2017)). After obtaining Chinese
and English word embeddings, the problem becomes ﬁnding the optimal
geometric alignment to minimize their EMD. . . . . . . . . . . . . . . . . . . . 23

Figure 3.4: The illustration of PPI networks alignment (Liu et al. (2017)). . . . . . . . . . . 24

Figure 3.5:

Illustration of Algorithm 4.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figure 3.6: The EMDs over baseline for diﬀerent noise levels.

. . . . . . . . . . . . . . . . 35

vi

LIST OF ALGORITHMS

Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4

(, δ)-approximation Algorithm of Outlier Recognition Problem . . . . . . .
7
Approximation Algorithm of MEB . . . . . . . . . . . . . . . . . . . . . . 12
2-approximation k-center clustering . . . . . . . . . . . . . . . . . . . . . 28
. 30
Geometric Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

KEY TO ABBREVIATIONS

PCA principal component analysis
MDA multiple discriminant analysis
SVM support vector machine
MEB minimum enclosing ball
RGD random gradient descent
EMD earth mover’s distance
PPI protein-protein interaction
PTAS polynomial-time approximation scheme
ICP iterative closest point
OP orthogonal procrustes

viii

CHAPTER 1

INTRODUCTION

In the era of big data, we expect to replace some polynomial time algorithms with linear or even
sub-linear time algorithms. Generally speaking, we focus on two aspects: data quantity and data
dimensionality.

1. For the former case, intuitively we only employ a part of dataset by sampling some “important”
data points. A relevant concept is called “core-set”. Speciﬁcally, given a point set P and a
class of functions F , for any function f ∈ F , we denote cost( f , P) as the cost of function f
on dataset P, then a (weighted) subset S of P is called the core-set of P if for  > 0 and any
f ∈ F we have that

(1 − ) · cost( f , P) ≤ cost( f , S) ≤ (1 + ) · cost( f , P).

For core-sets, we refer readers to several surveys (Phillips (2016); Clarkson (2010); Munteanu
and Schwiegelshohn (2018)).

2. For the second case, the dimension reduction is the most commonly used technique in practice.
There are a lot of classical methods in diﬀerent ﬁelds including machine learning, statistics
and theoretical computer science, such as principal component analysis (PCA), multiple
discriminant analysis (MDA) and Johnson-Lindenstrauss lemma (Achlioptas (2003)).
In
practice, the method that we adopt is determined by the speciﬁc problems. For a survey, we
refer readers to (Song et al. (2018)).

In this thesis, we concentrate on two machine learning problems in high dimension. Our
motivation is the observation that many real-world datasets often have low intrinsic dimensions
(Belkin (2003)). For example, the core-set size of minimum enclosing ball is (cid:100)1/(cid:101) (Badoiu and
Clarkson (2003)) which is independent of the number of points n and the dimensionality d; in
computational geometry, the intrinsic dimension of a metric space is often characterized by the

1

concept “doubling dimension”, which is also independent of the dimensionality of the space. Based
on the key observation, in this thesis we improve the time complexities of two important problems
in machine learning. Speciﬁcally, the contributions of this thesis are summarized as follows.

1. We present a novel geometric approach for outlier recognition. An approximate minimum
enclosing ball is built to cover the inliers but exclude the outliers. However, the existence
of outliers makes the problem to be not only non-convex but also highly combinatorial,
and the high dimensionality makes the problem more diﬃcult. To tackle these challenges,
we develop a randomized algorithmic framework using the core-set idea. Comparing with
existing methods for outlier recognition, our method has a quality guarantee and lower time
complexity.

2. We are the ﬁrst to develop eﬃcient algorithmic framework for the geometric alignment
problem in large-scale and high dimension. By virtue of the property of doubling dimension,
we prove that the given geometric patterns with low doubling dimension can be substantially
compressed and thus save a lot running time when computing alignment. More importantly,
our method is an independent step, therefore it can serve as the preprocessing for various
alignment algorithms.

2

CHAPTER 2

A NOVEL GEOMETRIC APPROACH FOR OUTLIER RECOGNITION IN HIGH

DIMENSION

2.1 Introduction

In this big data era, we are confronted with an extremely large amount of data and it is important
to develop eﬃcient algorithmic techniques to handle the arising realistic issues. Due to its recent
rapid development, deep learning (Hinton et al. (2012)) becomes a powerful tool for many emerging
applications; meanwhile, the quality of training dataset often plays a key role and seriously aﬀects
the ﬁnal learning result. For example, we can collect tremendous data (e.g., texts or images)
through the internet, however, the obtained dataset often contains a signiﬁcant amount of outliers.
Since manually removing outliers will be very costly, it is very necessary to develop some eﬃcient
algorithms for recognizing outliers automatically in many scenarios.

Outlier recognition is a typical unsupervised learning problem and its counterpart in supervised
learning is usually called anomaly detection (Tan, Steinbach, and Kumar (2006)).
In anomaly
detection, the given training data are always positive and the task is to generate a model to depict
the positive samples. Therefore, any new data can be distinguished to be positive or negative (i.e.,
anomaly) based on the obtained model. Several existing methods, especially for image data, include
autoencoder (Sakurada and Yairi (2014)) and sparse coding (Lu, Shi, and Jia (2013)).

Unlike anomaly detection, the given data for outlier recognition are unlabeled; thus we can
only model it as an optimization problem based on some reasonable assumption in practice. For
instance, it is very natural to assume that the inliers (i.e., normal data) locate in some dense region
while the outliers are scattered in the feature space. Actually, many well known outlier recognition
methods are based on this assumption (Breunig et al. (2000); Ester et al. (1996)). However, most
of the density-based methods are only for low-dimensional space and quite limited for large-scale
high-dimensional data that are very common in computer vision problems (note that several high-

3

dimensional approaches often are of heuristic natures and need strong assumptions (Kriegel et al.
(2009); Kriegel, Schubert, and Zimek (2008); Aggarwal and Yu (2001)). Recently, (Liu, Hua,
and Smith (2014)) applied the one-class support vector machine (SVM) method (Schölkopf et al.
(1999)) to high-dimensional outlier recognition. Further, (Xia et al. (2015)) introduced a new
unsupervised model of autoencoder inspired by the observation that inliers usually have smaller
reconstruction errors than outliers.

Although the aforementioned methods could eﬃciently solve the problem of outlier recognition
to a certain extent, they still suﬀer from several issues such as high time and space complexities or
unstable performances for diﬀerent datasets. In this chapter, we present a novel geometric approach
for outlier recognition. Roughly speaking, we try to build an approximate minimum enclosing ball
(MEB) to cover the inliers but exclude the outliers. This model is seemed to be very simple but
involves a couple of computational challenges. For example, the existence of outliers makes the
problem to be not only non-convex but also highly combinatorial. Also, the high dimensionality
makes the problem more diﬃcult. To tackle these challenges, we develop a randomized algorithmic
framework using a popular geometric concept called “core-set”. Comparing with existing results
for outlier recognition, we provide a thorough analysis on the complexities and quality guarantee.
Finally, we test our algorithm on both random and benchmark datasets and the experimental results
reveal the advantages of our approach over various existing methods.

2.1.1 Related Work

Besides the aforementioned existing results, many other methods for outlier recognition/anomaly
detection were developed previously and the readers can refer to several excellent surveys (Kriegel,
Kröger, and Zimek (2009); Chandola, Banerjee, and Kumar (2009); Gupta et al. (2014)).

In computational geometry, a core-set (Agarwal, Har-Peled, and Varadarajan (2005)) is a
small set of points that approximates the shape of a much larger point set, and thus can be used
to signiﬁcantly reduce the time complexities for many optimization problems (please refer to a
recent survey (Phillips (2016))). In particular, a core-set can be applied to eﬃciently compute an

4

approximate MEB for a set of points in high-dimensional space (Badoiu, Har-Peled, and Indyk
(2002); Kumar, Mitchell, and Yildirim (2003)). Moreover, (Bădoiu and Clarkson (2008); Badoiu
and Clarkson (2003)) showed that it is possible to ﬁnd a core-set of size (cid:100)1/(cid:101) that yields a (1 + )-
approximate MEB, with an important advantage that the size is independent of the original size
and dimensionality of the dataset. In fact, the algorithm for computing the core-set of MEB is a
Frank-Wolfe style algorithm (Frank and Wolfe (1956)), which has been systematically studied by
Clarkson (Clarkson (2010)).

The problem of MEB with outliers also falls under the umbrella of the topic robust shape ﬁtting
(Har-Peled and Wang (2004); Agarwal, Har-Peled, and Yu (2008)), but most of the approaches
cannot be applied to high-dimensional data. (Zarrabi-Zadeh and Mukhopadhyay (2009)) studied
MEB with outliers in high dimension, however, the resulting approximation is a constant 2 that is
not ﬁt enough for the applications proposed in this chapter.

Actually, our idea is inspired by a recent work about removing outliers for SVM (Ding and Xu
(2015)), where they proposed a novel combinatorial approach called Random Gradient Descent
(RGD) Tree. It is known that SVM is equivalent to ﬁnding the polytope distance from the origin
to the Minkowski Diﬀerence of the given two labeled point sets. Gilbert algorithm (Gilbert (1966);
Gärtner and Jaggi (2009)) is an eﬃcient Frank-Wolfe algorithm for computing polytope distance,
but a signiﬁcant drawback is that the performance is too sensitive to outliers. To remedy this issue,
RGD Tree accommodates the idea of randomization to Gilbert algorithm. Namely, it selects a small
random sample in each step by a carefully designed strategy to overcome the adverse eﬀect from
outliers.

2.1.2 Preliminaries

As mentioned before, we model outlier recognition as a problem of MEB with outliers in high
dimension. Here we ﬁrst introduce several deﬁnitions that are used throughout this chapter.

Deﬁnition 1 (Minimum Enclosing Ball (MEB)). Given a set P of points in Rd, MEB is the ball
covering all the points with the smallest radius. The MEB is denoted by ME B(P).

5

Deﬁnition 2 (MEB with Outliers). Given a set P of n points in Rd and a small parameter γ ∈ (0, 1),
MEB with outliers is to ﬁnd the smallest ball that covers at least (1 − γ)n points. Namely, the task
is to ﬁnd a subset of P having at least (1 − γ)n points such that the resulting MEB is the smallest
among all the possible choices; the induced ball is denoted by ME B(P, γ).

From Deﬁnition 2 we can see that the major challenge is to determine the subset of P which
makes the problem a challenging combinatorial optimization. Therefore we relax our goal to its
approximation as follows. For the sake of convenience, we always use Popt to denote the optimal
subset of P, that is, Popt = argP(cid:48) min{ the radius of ME B(P(cid:48)) | P(cid:48) ⊂ P, |P(cid:48)| ≥ (1 − γ)n}, and ropt
to denote the radius of ME B(Popt).
Deﬁnition 3 (Bi-criteria Approximation). Given an instance (P, γ) for MEB with outliers and two
small parameters 0 < , δ < 1, an (, δ)-approximation is a ball that covers at least (1 − (1 + δ)γ)n
points and has the radius at most (1 + )ropt.

When both  and δ are small, the bi-criteria approximation is very close to the optimal solution

with only a slight violation on the number of covering points and radius.

2.2 Our Algorithm and Analyses

In this section, we present our method in detail. For the sake of completeness, we ﬁrst brieﬂy

introduce the core-set for MEB based on the idea of (Badoiu and Clarkson (2003)).

The algorithm is a simple iterative procedure with an elegant analysis: initially, it selects an
arbitrary point and places it into a set S that is empty at the beginning; in each of the following
(cid:100)2/(cid:101) steps, the algorithm ﬁnds the farthest point to the center of ME B(S) and adds it to S; ﬁnally,
the center of ME B(S) induces a (1 + )-approximation for MEB of the whole input point set.
The selected (cid:100)2/(cid:101) points are also called the core-set of MEB. To ensure that there is at least
certain extent of improvement achieved in each iteration, (Badoiu and Clarkson (2003)) showed
that the following two inequalities would hold if the algorithm always selects the farthest point to

6

the temporary center of ME B(S):

ri+1 ≥ (1 + )ropt − Li,

ri+1 ≥(cid:113)

r2
i

+ L2
i ,

(2.1)

(2.2)

where ri and ri+1 are the radii of the i-th and the (i + 1)-th iterations respectively, ropt is the optimal
radius of the MEB, and Li is the shifting distance of the center of ME B(S).
Algorithm 1 (, δ)-approximation Algorithm of Outlier Recognition Problem
Input: A point set P with n points in Rd, the fraction of outliers γ ∈ (0, 1) and four parameters
Output: A tree with each node whose attached point is a candidate for the (, δ)-approximation

0 < , δ, µ < 1, h ∈ Z+.

solutions.

1: Each node v in the tree is associated with a point (with a slight abuse of notation, we also use

v to denote the point). Initially, randomly pick a point from P as root node r.

2: Starting with root, grow each node as follows:

(1) Let v be the current node.
(2) If the height of v is h, v becomes a leaf node. Otherwise, perform the following steps:

the center of ME B(Pr

(a) Let Pr
v denote the set of points along the path from root r to node v, and cv denote
(b) Let k = (1 + δ)γn. Compute the point set Pv containing the top k points which have
µ from Pv, and let each point v(cid:48) ∈ Sv be
ln h

v). We say that cv is the attached point of v.

(c) Take a random sample Sv of size

the largest distances to cv.

(cid:17)

(cid:16)

1 + 1
δ

a child node of v.

2.2.1 Algorithm for MEB with Outliers
We present our (, δ)-approximation algorithm for MEB with outliers in this section. Although
the outlier recognition problem belongs to unsupervised learning, we can estimate the fraction of
outliers in the given data before executing our algorithm. In practice, we can randomly collect a
small set of samples from the given data, and manually identify the outliers and estimate the outlier
ratio γ. Therefore, in this chapter we assume that the outlier ratio is known.

To better understand our algorithm, we ﬁrst illustrate the high-level idea.

If taking a more
careful analysis on the previously mentioned core-set construction algorithm (Badoiu and Clarkson

7

(2003)), we can ﬁnd that it is not necessary to select the farthest point to the center of ME B(S) in
each step. Instead, as long as the selected point has a distance larger than (1 + )ropt, the minimal
extent of improvement would always be guaranteed (Ding and Xu (2011)). As a consequence, we
investigate the following approach.

We denote the ball centered at point c with radius r > 0 as Ball(c, r). Recall that Popt is
the subset of P yielding the optimal MEB with outliers, and ropt is the radius of ME B(Popt) (see
Section 2.1.2). In the i-th step, we add an arbitrary point from Popt \ Ball(ci,(1 + )ropt) to S where
ci is the current center of S. Based on the above observation, we know that a (1 + )-approximation

is obtained after at most (cid:100)2/(cid:101) steps, that is,(cid:12)(cid:12)P ∩ Ball(ci,(1 + )ropt)(cid:12)(cid:12) ≥ (1 − γ)n when i ≥ (cid:100)2/(cid:101).

However, in order to carry out the above approach we need to solve two key issues: how to
determine the value of ropt and how to select a point belonging to Popt \ Ball(ci,(1 + )ropt).
Actually, we can implicitly avoid the ﬁrst issue via replacing the radius (1 + )ropt by the k-th largest
distance from the points of P to ci, where k is some appropriate number that will be determined in
our following analysis. For the second issue, we have to take a small random sample instead of a
single point from Popt \ Ball(ci,(1 + )ropt) and try each of the sampled points; this operation will
result in a tree structure that is similar to the RGD Tree introduced by (Ding and Xu (2015)) for
SVM. We present the algorithm in Algorithm 1 and place the detailed parameter settings, proof of
correctness, and complexity analyses in Sections 2.2.2 & 2.2.3.

We illustrate Step 2(2)(a-c) of Algorithm 1 in Figure 2.1.

2.2.2 Parameter Settings and Quality Guarantee

We denote the tree constructed by Algorithm 1 as H. The following theorem shows the success
probability of Algorithm 1.

Theorem 1. If we set h = (cid:100)2
one node of H yielding an (, δ)-approximation for the problem of MEB with outliers.

 (cid:101) + 1, then with probability at least (1 − µ)(1 − γ) there exists at least

Before proving Theorem 1, we need to introduce several important lemmas.

8

Figure 2.1: The blue links represent the path from root r to node v, and Pr
points along the path. The point set Sv corresponds to the child nodes of v.

v contains the four

Lemma 1 (Ding and Xu (2014)). Let Q be a set of elements, and Q(cid:48) be a subset of Q with size
|Q(cid:48)| = β |Q| for some β ∈ (0, 1).
η elements from Q, then with
probability at least 1 − η, the sample contains at least one element in Q(cid:48) for any 0 < η < 1.

If one randomly samples 1

β ln 1

Lemma 2. For each node v, the set Sv in Algorithm 1 contains at least one point from Popt with
probability 1 − µ
h.

Proof. Since |Pv| = (1 + δ)nγ and(cid:12)(cid:12)P\Popt
(cid:12)(cid:12)Pv ∩ Popt

(cid:12)(cid:12) = nγ, we have
(cid:12)(cid:12)
(cid:12)(cid:12)Pv\Popt
(cid:12)(cid:12)
(cid:12)(cid:12)P\Popt

= 1 −

≥ 1 −

|Pv|

(cid:12)(cid:12)

|Pv|
|Pv|

(2.3)

=

δ

1 + δ

.

δ) ln h

Note that the size of Sv is (1 + 1
1+δ and η = µ
µ. If we apply Lemma 1 via setting β = δ
easy to know that Sv contains at least one point from Popt with probability 1 − µ
h.
Lemma 3. With probability(1−γ)(1− µ), there exists a leaf node u ∈ H such that the corresponding
set Pr

h, it is
(cid:3)

u ⊂ Popt.

9

Proof. Lemma 2 indicates that each node v has a child node corresponding to a point from Popt
h. In addition, the probability of root r belonging to Popt is 1 − γ (recall that
with probability 1 − µ
γ is the fraction of outliers). Note that the height of H is h, then with probability at least

1 − µ
h
u ⊂ Popt.
there exists one leaf node u ∈ H satisfying Pr

> (1 − γ)(1 − µ),

(2.4)

(cid:3)

(1 − γ)(cid:16)

(cid:17) h

In the remaining analyses, we always assume that such a root-to-leaf path Pr

u described in
Lemma 3 exists and only focus on the nodes along this path. We denote ˆR = (1 + )ropt where
ropt is the optimal radius of the MEB with outliers. Let Ball(cv, rv) be the MEB covering P \ Pv
v(cid:48)) be ˜rv and ˜rv(cid:48) respectively. Readers can
centered at cv, and the radii of ME B(Pr
refer to Figure 2.2. The following lemma is a key observation for MEB.

v) and ME B(Pr

Lemma 4 (Badoiu and Clarkson (2003)). Given a set P of points in Rd, let rP and cP be the radius
and center of ME B(P) respectively. Then for any point p ∈ Rd with a distance K ≥ 0 to cP, there

exists a point q ∈ P such that (cid:107)p − q(cid:107) ≥(cid:113)

+ K2.

r2
P

The following lemma is a key for proving the quality guarantee of Algorithm 1. As mentioned
in Section 2.2.1, the main idea follows the previous works (Badoiu and Clarkson (2003); Ding and
Xu (2011)). For the sake of completeness, we present the detailed proof here.
Lemma 5. For each node v ∈ Pr
(, δ)-approximation; (2) its child v(cid:48) on the path Pr
˜rv(cid:48) ≥ ˆR
2

u, at least one of the following two events happens: (1) cv is an

u satisﬁes

(2.5)

+

.

˜r2
v
2 ˆR

Proof. If rv ≤ ˆR, then we are done; that is, Ball(cv, rv) covers (1 − (1 + δ)γ)n points and rv ≤
(1 + )ropt. Otherwise, rv > ˆR and we consider the second case.

By triangle inequality and the fact that v(cid:48) (i.e., the point associating the node “v(cid:48)”) lies outside

Ball(cv, rv), we have

(cid:13)(cid:13)cv − cv(cid:48)(cid:13)(cid:13) +(cid:13)(cid:13)cv(cid:48) − v

(cid:48)(cid:13)(cid:13) ≥(cid:13)(cid:13)cv − v

(cid:48)(cid:13)(cid:13) > rv > ˆR.

(2.6)

10

Let(cid:13)(cid:13)cv − cv(cid:48)(cid:13)(cid:13) = Kv. Combining the fact that(cid:13)(cid:13)cv(cid:48) − v(cid:48)(cid:13)(cid:13) ≤ ˜rv(cid:48), we have

˜rv(cid:48) > ˆR − Kv.

(2.7)

Figure 2.2: The illustration of ME B(Pr
inliers and outliers, respectively.

v) and ME B(Pr

v(cid:48)); the blue and red points represent the

(cid:13)(cid:13)q − cv(cid:48)(cid:13)(cid:13) ≥(cid:113)

By Lemma 4, we know that there exists one point q (see Figure 2.2) in ME B(Pr

v) satisfying

˜r2
v

+ K2

v . Since q is also inside ME B(Pr

v(cid:48)),(cid:13)(cid:13)q − cv(cid:48)(cid:13)(cid:13) ≤ ˜rv(cid:48). Then, we have

Combining (2.7) and (2.8), we obtain

(cid:113)

˜r2
v

(cid:113)

+ K2

Because ˆR − Kv and
˜r2
v
have ˜rv(cid:48) ≥ ˆR

+ K2

v are decreasing and increasing on Kv respectively, we let ˆR − Kv =
). Substituting the value of Kv to (2.9), we
(cid:3)

. As a consequence, the second event happens and the proof is completed.

v to achieve the lower bound (i.e., Kv =
2 + ˜r2
v
2 ˆR

ˆR2−˜r2
v
2 ˆR

Now we prove Theorem 1 by the idea from (Badoiu and Clarkson (2003)). Suppose no node in
Pr
u makes the ﬁrst event of Lemma 5 occur. As a consequence, we obtain a series of inequalities

11

˜rv(cid:48) ≥(cid:113)
(cid:26)

˜r2
v

+ K2
v .

(cid:113)

˜r2
v

(cid:27)

.

+ K2
v

˜rv(cid:48) ≥ max

ˆR − Kv,

(2.8)

(2.9)

for each pair of radii ˜rv(cid:48) and ˜rv (see (2.5)). For ease of analysis, we denote ˜rv = λi ˆR if the height
of v is i in H. By Inequality (2.5), we have

λi+1 ≥ 1 + λ2
2

i

.

Combining the initial case λ1 = 0 and (2.10), we obtain
λi ≥ 1 − 2
i + 1

(2.10)

(2.11)

by induction (Badoiu and Clarkson (2003)). Note that the equality in (2.11) holds only when i = 1,
therefore,

1 + 
u), which is a contradiction to
u ⊂ Popt. The success probability directly comes from Lemma 3. Overall, we

Then, ˜ru = λh ˆR > ropt (recall that u is the leaf node on the path Pr
our assumption Pr
obtain Theorem 1.

(2.12)

λh > 1 − 2
h + 1

= 1 −

2
(cid:100)2
 (cid:101) + 2

≥ 1 − 2
2
+ 2


=

1

.

2.2.3 Complexity Analyses

We analyze the time and space complexities of Algorithm 1 in this section.

Time Complexity. For each node v, we need to compute the corresponding ME B(Pr

v). To
avoid computing the exact MEB costly, we apply the approximation algorithm proposed by (Badoiu
and Clarkson (2003)). See Algorithm 2 for details.

Algorithm 2 Approximation Algorithm of MEB
Input: A point set Q in Rd, and N ∈ Z +.
1: Start with an arbitrary point c1 ∈ Q, t ← 1.
2: while t < N do
3:
4:
5:
6: end while
7: return ct.

Find the point q ∈ Q farthest away from ct.
ct+1 ← ct + 1
t ← t + 1.

t+1(q − ct).

For Algorithm 2, we have the following theorem.

12

Theorem 2 (Badoiu and Clarkson (2003)). Let the center and radius of ME B(Q) be cQ and rQ

respectively, then ∀t,(cid:13)(cid:13)cQ − ct
(cid:13)(cid:13) ≤ rQ√
(cid:17)
(cid:16) |Q|d

t

.

ε2

(cid:17)

(cid:16) id

From Theorem 2, we know that a(1+ε)-approximation for MEB can be obtained when N = 1/ε2
with the time complexity O
. Suppose the height of node v is i, then the complexity for
computing the corresponding approximate ME B(Pr
. Further, in order to obtain the
point set Pv, we need to ﬁnd the pivot point that has the (n − k)-th smallest distance to cv. Here
we apply the PICK algorithm (Blum et al. (1973)) which can ﬁnd the l-th smallest from a set of
n (l ≤ n) numbers in linear time. Consequently, the complexity for each node v at the i-th layer
In total, the time
is O
complexity of our algorithm is

. Recall that there are |Sv|i−1 nodes at the i-th layer of H.

v) is O

n + i
ε2

(cid:16)(cid:16)

(cid:17)

(cid:17)

ε2

d

(cid:19)

(cid:19)i−1(cid:18)

1 +

1
δ

ln h
µ

n +

i
ε2

(cid:19)

d.

(2.13)

(cid:18)(cid:18)

h

i=1

(cid:16)(cid:16)

(cid:17)

T =

(cid:17) h−1

If we assume 1/ε is a constant, the complexity T = O(nd) is linear in n and d, where the hidden
. In our experiments, we can carefully choose the parameters δ, , µ
constant C =
so as to keep the value of C not too large.

1 + 1
δ

ln h
µ

Space Complexity. In our implementation, we use a queue Q to store the nodes in the tree.
When the head of Q is popped, its |Sv| child nodes are pushed into Q. In other words, we simulate
breadth ﬁrst search on the tree H. Therefore, Q always keeps its size at most C =
.
Note that each node v needs to store Pr
v to compute its corresponding MEB, but actually we only
need to record the pointers to link the points in Pr
v. Therefore, the space complexity of Q is O(Ch).
Together with the space complexity of the input data, the total space complexity of our algorithm
is O(Ch + nd).

(cid:17) h−1

1 + 1
δ

(cid:16)(cid:16)

ln h
µ

(cid:17)

2.2.4 Boosting
By Theorem 1, we know that with probability at least (1 − µ)(1 − γ) there exists an (, δ)-
approximation in the resulting tree. However, when outlier ratio is high, say γ = 0.5, the success

13

probability (1− γ)(1− µ) will become small. To further improve the performance of our algorithm,
we introduce the following two boosting methods.

1. Constructing a forest. Instead of building a single tree, we randomly initialize several root
nodes and grow each root node to be a tree. Suppose the number of root nodes is κ. The
probability that there exists an(, δ)-approximation in the forest is at least 1−(1−(1−γ)(1−µ))κ
which is much larger than (1 − γ)(1 − µ).

2. Sequentialization. First, initialize one root node and build a tree. Then select the best node
in the tree and set it to be the root node for the next tree. After iteratively performing the
procedure for several rounds, we can obtain a much more robust solution.

2.3 Experiments

From our analysis in Section 2.2.2, we know that Algorithm 1 results in a tree H where each
node v has a candidate cv for the desired (, δ)-approximation for the problem of MEB with outliers.
For each candidate, we identify the nearest (1 − (1 + δ)γ)n points to cv as the inliers. To determine
the ﬁnal solution, we select the candidate that has the smallest variance of the inliers.

2.3.1 Datasets and Methods to Be Compared

In our experiments, we test the algorithms on two random datasets and two benchmark image
datasets. In terms of the random datasets, we generate the data points based on normal and uniform
distributions under the assumption that the inliers usually locate in dense regions while the outliers
are scattered in the space. The benchmark image datasets include the popular MNIST (LeCun et
al. (1998)) and Caltech (Fei-Fei, Fergus, and Perona (2007)). All of the experiments are performed
on Windows workstation with 2.4GHz Intel Xeon CPU and 32GB DDR4 Memory.

To make our experiment more convincing, we compare our algorithm with three well known
methods for outlier recognition: angle-based outlier detection (ABOD) (Kriegel, Schubert, and
Zimek (2008)), one-class SVM (OCSVM) (Schölkopf et al. (1999)), and discriminative recon-

14

structions in an autoencoder (DRAE) (Xia et al. (2015)). Speciﬁcally, ABOD distinguishes the
inliers and outliers by assessing the distribution of the angles determined by each 3-tuple data
points; OCSVM models the problem of outlier recognition as a soft-margin one-class SVM; DRAE
applies autoencoder to separate the inliers and outliers based on their reconstruction errors.

= 2∗Precision∗Recall

The performances of the algorithms are measured by the commonly used metric F1 score
Precision+Recall , where precision is the proportion of the correctly identiﬁed positives relative
to the total number of identiﬁed positives, and recall is the proportion of the correctly identiﬁed
positives relative to the total number of positives in the dataset.

2.3.2 Random Datasets

We validate our algorithm on the following two random datasets.

A toy example in 2D. To better illustrate the intuition of our algorithm, we ﬁrst run it on a
random dataset in 2D. We generate an instance of 10, 000 points with the outlier ratio γ = 0.4. The
inliers are generated by a normal distribution; the outliers consist of four groups where the ﬁrst
three are generated by normal distributions and the last is generated by a uniform distribution. The
four groups of outliers contains 800, 1200, 800, and 1200 points, respectively. See Figure 2.3. The
red circle obtained by our algorithm is the boundary to distinguish the inliers and outliers where the
resulting F1 score is 0.944. From this case, we can see that our algorithm can eﬃciently recognize
the densest region even if the outlier ratio is high and the outliers also form some dense regions in
the space.

High-Dimensional Points. We further test our algorithm and the other three methods on high-
dimensional dataset. Similar to the previous 2D case, we generate 20, 000 points with four groups
of outliers in R100; the outlier ratio γ varies from 0.1 to 0.5.

The F1 scores are displayed in Table 2.1, from which we can see that our algorithm signiﬁcantly

outperforms the other three methods for all the levels of outlier ratio.

15

Figure 2.3: The illustration of our algorithm on a 2-dimensional point set.

Table 2.1: The F1 scores for the high-dimensional random dataset.

Algs ABOD OCSVM DRAE Ours
0.984
0.1
0.965
0.2
0.939
0.3
0.938
0.4
0.898
0.5

0.967
0.926
0.880
0.827
0.745

0.951
0.889
0.809
0.709
0.572

γ

0.907
0.815
0.705
0.586
0.419

2.3.3 Benchmark Image Datasets

In this section, we evaluate all the four methods on two benchmark image datasets.

1. MNIST Dataset

MNIST contains 70, 000 handwritten digits (0 to 9) composed of both training and test
datasets. For each of the 10 digits, we add the outliers by randomly selecting the images
from the other 9 digits. For each outlier ratio γ, we compute the average F1 score over all
the 10 digits. To map the images to a feature (Euclidean) space, we use two kinds of image

16

features: PCA-grayscale and autoencoder feature.
(1) PCA-grayscale Feature. Each image in MNIST has a 28 × 28 grayscale which is
represented by a 784-dimensional vector. Note that the images of MNIST have massive
redundancy. For example, the digits often locate in the middle of the images and all the
background pixels have the value 0. Therefore, we apply principal component analysis
(PCA) to reduce the redundancy by trying multiple projection matrices which preserve
95%, 50%, and 10% energy of the original grayscale features. These three features are
denoted as PCA-0.95, PCA-0.5 and PCA-0.1, respectively. The results are shown in
Table 2.2. We notice that our F1 scores always achieve the highest by PCA-0.5; this is
due to the fact that PCA-0.5 can signiﬁcantly reduce the redundancy as well as preserve
the most useful information (comparing with PCA-0.95 and PCA-0.1).

Table 2.2: The F1 scores of the four methods on MNIST by using PCA-grayscale; the three
columns for each γ correspond to PCA-0.95, PCA-0.5 and PCA-0.1, respectively.

γ

0.1
0.2
0.3
0.4
0.5

Algs

ABOD

0.898, 0.895, 0.892
0.775, 0.774, 0.771
0.648, 0.617, 0.642
0.500, 0.470, 0.496
0.346, 0.329, 0.364

OCSVM

0.937, 0.941, 0.934
0.874, 0.883, 0.867
0.804, 0.817, 0.798
0.725, 0.740, 0.713
0.648, 0.639, 0.605

DRAE

0.913, 0.908, 0.911
0.822, 0.818, 0.816
0.726, 0.722, 0.711
0.620, 0.617, 0.602
0.531, 0.501, 0.488

Ours

0.939, 0.941, 0.936
0.881, 0.891, 0.880
0.822, 0.853, 0.823
0.760, 0.778, 0.773
0.633, 0.658, 0.651

(2) Autoencoder Feature. Autoencoder (Rumelhart, Hinton, and Williams (1988)) is often
adopted to extract the features of grayscale images. The autoencoder model trained in
our experiment has seven symmetrical hidden layers (1000-500-250-60-250-500-1000),
and the input layer is a 784-dimensional grayscale. We use the features output by the
middle hidden layer. The results are shown in Table 2.3 and our method achieves the
best for most of the cases.

2. Caltech Dataset

The Caltech-256 dataset1 includes 256 image sets. We choose 11 concepts as the inliers in

1http://www.vision.caltech.edu/Image_Datasets/Caltech256/

17

Table 2.3: The F1 scores of the four methods on MNIST by using autoencoder feature.

γ

Algs ABOD OCSVM DRAE Ours
0.932
0.885
0.831
0.770
0.694

0.933
0.883
0.819
0.737
0.625

0.894
0.778
0.637
0.479
0.313

0.906
0.807
0.706
0.598
0.496

0.1
0.2
0.3
0.4
0.5

our experiment, which are airplane, binocular, bonsai, cup, face, ketch, laptop, motorbike,
sneaker, t-shirt, and watch. We apply VGG net (Simonyan and Zisserman (2014)) to extract
the image features, which have 4096 dimensions output by the second fully-connected layer.
The results are shown in Table 2.4.

Table 2.4: The F1 scores of the four methods on Caltech-256 by using VGG net feature.

γ

Algs ABOD OCSVM DRAE Ours
0.964
0.948
0.932
0.924
0.906

0.945
0.838
0.707
0.499
0.233

0.930
0.885
0.839
0.783
0.739

0.955
0.937
0.930
0.927
0.912

0.1
0.2
0.3
0.4
0.5

Unlike the random data, the distribution of real data in the space is much more complicated.
To alleviate this problem, we try to capture the key parts of the original VGG net feature.
Similar to MNIST dataset, we apply PCA to reduce the redundancy of VGG net feature.
Three matrices are obtained to preserve 95%, 50%, and 10% energy respectively. The results
are shown in Table 2.5, from which we can see that our method achieves the best for all the
cases, especially when using PCA-0.5 (marked by underlines). More importantly, PCA-0.5
considerably improves the results that use the original VGG net features (compare Table 2.5
with Table 2.4), and has only 50 dimensions, which signiﬁcantly reduces the running time.

18

Table 2.5: The F1 scores of the four methods on Caltech-256 by using PCA-VGG feature; the
three columns for each γ correspond to PCA-0.95, PCA-0.5 and PCA-0.1, respectively.

γ

Algs

ABOD

0.1
0.2
0.3
0.4
0.5

0.944, 0.942, 0.941
0.837, 0.832, 0.869
0.707, 0.708, 0.715
0.497, 0.489, 0.525
0.223, 0.199, 0.288

OCSVM

0.932, 0.914, 0.921
0.884, 0.894, 0.867
0.837, 0.869, 0.827
0.782, 0.830, 0.771
0.717, 0.790, 0.699

DRAE

0.955, 0.947, 0.928
0.918, 0.924, 0.878
0.873, 0.914, 0.835
0.873, 0.902, 0.773
0.869, 0.887, 0.692

Ours

0.966, 0.986, 0.949
0.950, 0.984, 0.923
0.934, 0.978, 0.897
0.916, 0.973, 0.871
0.899, 0.958, 0.844

2.3.4 Comparisons of Time Complexities

From Section 2.3.2 and Section 2.3.3 we know that our method achieves the robust and competitive
performances in terms of accuracy. In this section, we compare the time complexities of all the
four algorithms.

ABOD has the time complexity O(n3d).

In the experiments, we use its speed-up edition
FastABOD which has the reduced time complexity O((n2 + nk2)d) where k is some speciﬁed
parameter.

OCSVM is formulated as a quadratic programming with the time complexity O(n3).
DRAE alternatively executes the following two steps: discriminative labeling and reconstruc-
tion learning. Suppose it runs N1 rounds; actually the two inner steps are also iterative procedures
which both run N2 iterations. Thus, the total time complexity of DRAE is O(N1N2hdn), where h
is the number of the hidden layer nodes that can be generally expressed as d/m (m is a constant);
then the total time complexity becomes O(nd2) where the hidden constant ˜C is a large constant
depending on N1, N2 and m.

When the number of points n is large, FastABOD, OCSVM, and DRAE will be very time-
consuming. On the contrary, our algorithm takes only linear running time (see Section 2.2.3) and
usually runs much faster in practice.

2.4 Summary

In this chapter, we present a new approach for outlier recognition in high dimension. Most
existing methods have high time and space complexities or cannot achieve a quality guaranteed

19

solution. On the contrary, our method yields a nearly optimal solution with the time and space
complexities linear in the input size and dimensionality. Furthermore, the experimental results
indicate that our approach outperforms several popular methods in terms of accuracy.

20

CHAPTER 3

GEOMETRIC ALIGNMENT IN LOW DOUBLING DIMENSION

3.1 Introduction

Given two geometric patterns, the problem of alignment is to ﬁnd their appropriate spatial
positions so as to minimize their diﬀerence. In general, a geometric pattern is represented by a set
of (weighted) points in a space, and their diﬀerence is often measured by some objective function.
Figure 3.1 presents an illustration of alignment.

Figure 3.1: The illustration of alignment of two geometric patterns.

In this chapter, we focus on the geometric alignment. Before introducing the deﬁnition of

geometric alignment, we ﬁrst deﬁne Earth Mover’s Distance (EMD) and rigid transformation.
Deﬁnition 4 (EMD). Let A = {a1, · · · , an1} and B = {b1, · · · , bn2} be two sets of weighted
points in Rd with non-negative weights αi and βj for each ai ∈ A and bj ∈ B respectively, and
j=1 βj be their respective total weights. The EMD between A and B is

i=1 αi, WB =n2

WA =n1

deﬁned as

EMD(A, B) =

1

min{WA, WB} min

F

where F = { fi j, i = 1, · · · , n1, j = 1, · · · , n2} is a feasible ﬂow from A to B, i.e.,

n1
i=1 fi j ≤ βj,n2

j=1 fi j ≤ αi, andn1

i=1

i=1

j=1

n2
j=1 fi j = min{WA, WB}.

n1

n2

(cid:13)(cid:13)ai − bj

(cid:13)(cid:13)2

,

fi j

(3.1)
fi j ≥ 0,

See Figure 3.2 for an illustration of EMD. Intuitively, EMD can be viewed as the minimum
transportation cost between A and B, where the weights of A and B are the “supplies” and “demans”

21

respectively, and the cost of each edge connecting a pair of points from A to B is their “ground
distance”. Generally, the “ground distance” has various forms, and here we simply use the Euclidean
distance. In addition, the major advantage of EMD over other measures is its robustness with regard
to noise in practice.

Figure 3.2: The illustration of EMD between two weighted point sets.

Deﬁnition 5 (Rigid Transformation). Let P be a set of points in Rd. A rigid transformation T on
P is a transformation (i.e., rotation, translation, reﬂection, or their combination) which preserves
the pairwise distances of the points in P.

We consider rigid transformation for alignment, because it is very natural to interpret in real-
world problems. Given the deﬁnitions of EMD and rigid transformation, here we formally deﬁne
geometric alignment.

Deﬁnition 6 (Geometric Alignment). Given two weighted point sets A and B as described in
Deﬁnition 4, the geometric alignment between A and B under rigid transformation is to determine
a rigid transformation T for B so as to minimize the EMD EMD(A, T(B)).

In practice, geometric alignment has many applications in the ﬁeld of computer vision such as
image retrieval and matching (Cohen and Guibas (1999)), ﬁngerprint recognition (Maltoni et al.
(2009)) and face alignment (Cao et al. (2014)). Besides the computer vision applications in 2D or
3D, there are a lot of high dimensional problems that can be solved by geometric alignment. Here
we brieﬂy introduce several examples.

22

(1) Natural language processing

Some research on natural language processing has indicated that diﬀerent languages often share
some similar structure at the word level (Youn et al. (2016)); particularly, recent study on word
semantics embedding has also shown that there exists structural isomorphism across languages
(Mikolov, Le, and Sutskever (2013)), and EMD is a good distance metric for languages and
documents (Zhang et al. (2017); Kusner et al. (2015)). Therefore, (Zhang et al. (2017))
proposed to learn the transformation between diﬀerent languages without any cross-lingual
supervision, and the problem is reduced to minimizing the EMD by ﬁnding the optimal
geometric alignment in high dimension. Figure 3.3 shows us an illustration of geometric
alignment of word embeddings.

Figure 3.3: This image is excerpted from (Zhang et al. (2017)). After obtaining Chinese and
English word embeddings, the problem becomes ﬁnding the optimal geometric alignment to
minimize their EMD.

(2) Protein-Protein interaction (PPI) network

A PPI network is a graph representing the interactions among proteins. Given two PPI networks,
it is a fundamental bioinformatics problem to ﬁnd their alignment in order to understand the
correspondences between diﬀerent species (Malod-Dognin, Ban, and Pržulj (2017)). However,
most existing methods need to solve the subgraph isomorphism problem which is NP-hard, and
often suﬀer from a high computational complexity. To circumvent this issue, (Liu et al. (2017))

23

developed a new framework based on geometric alignment in Euclidean space. See Figure 3.4
for an illustration of PPI networks alignment.

Figure 3.4: The illustration of PPI networks alignment (Liu et al. (2017)).

Despite of a number of applications in reality, the research on geometric alignment algorithms
is quite limited and far from being satisfactory. Basically, we need to take into account high
dimensionality and large number of points of the geometric patterns simultaneously. Unfortunately,
even for the 2D and 3D cases, it is quite challenging to solve the geometric alignment problem.

In this chapter, we develop an eﬃcient algorithmic framework for the geometric alignment prob-
lem with large size and high dimensionality. Our key observation is that many real-world datasets
often manifest low intrinsic dimensions (Belkin (2003)). For example, human handwriting images
can be well embedded into some low dimensional manifold though their Euclidean dimensions
can be very high (Tenenbaum, De Silva, and Langford (2000)). Inspired by the observation, we
adopt the widely used concept “doubling dimension” (Krauthgamer and Lee (2004); Talwar (2004);
Karger and Ruhl (2002); Har-Peled and Mendel (2006); Dasgupta and Sinha (2013)) to charac-
terize the intrinsic dimension in geometric alignment problems. We prove that when computing
geometric alignment, the geometric patterns with low doubling dimensions can be substantially
compressed and thus save a lot running time. What’s more, our proposed compression method is
an independent algorithm which can be utilized for various alignment approaches.

24

3.1.1 Related Work

In this section, we recall some related work.

tively and each edge connecting ai and bj has the weight(cid:13)(cid:13)ai − bj

Given a bipartite graph where the two sets of nodes correspond to the points of A and B respec-

(cid:13)(cid:13)2, then computing EMD(A, B)

is actually a min-cost ﬂow problem. A lot of min-cost ﬂow algorithms have been proposed in the
past decades, such as network simplex algorithm (Magnanti and Orlin (1993)) which is a specialized
version of the simplex algorithm. Computing EMD is an instance of min-cost ﬂow problem in
Euclidean space, and a lot of faster algorithms have been proposed in computational geometry
(Indyk (2007); Cabello et al. (2008); Agarwal et al. (2019)). However, most of these methods only
work in low dimensional cases.

(cid:16)

2d−1(cid:17)

Actually, it is more challenging to ﬁnd the geometric alignment between two sets because it
requires one rigid transformation and EMD ﬂows simultaneously. Moreover, because of the variety
of rigid transformations, we can not employ the EMD embedding techniques (Indyk and Thaper
(2003); Andoni et al. (2009)) to alleviate these issues. Speciﬁcally, the embedding can only preserve
the EMD between A and B, however, since we do not know the rigid transformation T in advance,
there are inﬁnite T for set B, therefore, it is diﬃcult to preserve the EMD between A and T(B). In
addition, some theoretical methods have been proposed. (Klein and Veltkamp (2005)) presented
an O
-approximation algorithm in d-dimensional, then (Cabello et al. (2008)) proposed
a (2 + )-approximation solution to the 2D case. Recently, (Ding and Xu (2017)) developed a
polynomial-time approximation scheme (PTAS) for constant dimensionality. Unfortunately, these
theoretical methods are impractical when dimensionality is not constant. (Ding and Xu (2017))
also proved that any constant factor approximation has a time complexity at least nΩ(d) based on
some reasonable assumptions in the computational complexity theory, where n = max{|A|, |B|},
i.e., it is unlikely to obtain a (1 + )-approximation within practical running time, especially when
n is large.

25

(cid:110)
(cid:110)

(cid:111)
(cid:111)

3.1.2 Outline of Our Work

In this section, we brieﬂy introduce our work.

Our work is mainly based on (Cohen and Guibas (1999)), which proposed an alternating
minimization method to obtain the geometric alignment between two point sets. Roughly speaking,
this method is similar to the Iterative Closest Point (ICP) method (Besl and McKay (1992)) which
alternatively updates the rigid transformation and EMD ﬂows in each iteration. To update the rigid
transformation, we utilize the Orthogonal Procrustes (OP) analysis (Schönemann (1966)).

Suppose the EMD ﬂows F = { fi j} are known and we are updating the rigid transformation.

Suppose that the two new sets of weighted points are

ˆA =

ˆB =

1, · · · , an2
a1
1, · · · , b1
b1

1 ; a1
n2; b2

2, · · · , an2
1, · · · , b2

2 ;· · · ; a1
n2;· · · ; bn1

n1, · · · , an2
n1
1 , · · · , bn1
n2

,

,

(3.2)

(3.3)

where each a j
a slight abuse of notations, a j

i (resp. bi

j) has the weight fi j and the same spatial position with ai (resp. bj). With

i is also used to denote the corresponding column vector in Rd.

(1) We obtain a translation vector v satisfying that the weighted centroids of ˆA and ˆB + v coincide
with each other, which can be easily proved because of the fact that the objective function
employs squared distance (Cohen and Guibas (1999)).

+ v ∈ ˆB + v) corresponds to the column(cid:112) fi jaj

(2) Through the OP analysis, we obtain an orthogonal matrix R for ˆB + v to minimize its weighted
2 distance to ˆA. For this purpose, we generate two d × (n1n2) matrices MA and MB, where
(cid:96)2
each point in ˆA (resp.
ˆB + v) corresponds to one column in MA (resp. MB). Speciﬁcally, the
i ∈ ˆA (resp. bi
+ v))
point a j
B be UΣVT,
in MA (resp. MB). Let the singular value decomposition (SVD) of MAMT
then by the OP analysis, the optimal orthogonal matrix R equals UVT.
In fact, we do not
need to explicitly construct the matrices MA and MB, instead, we can compute MAMT
B in
time (see Lemma 6 for details). Therefore, the time complexity
O
of obtaining the optimal R is O

n1n2d + min{n1, n2} · d2 + d3(cid:17)

n1n2d + min{n1, n2} · d2(cid:17)

i (resp. (cid:112) fi j(bi

(cid:16)

(cid:16)

.

j

j

26

√

B can be computed in O(n1n2d + min{n1, n2} · d2) time.

Lemma 6. The multiplication MAMT
Proof. Denote F as the n1 × n2 matrix of the EMD ﬂows where each entry is fi j, and Fi,: represents
the i-th row of F. For a vector f,
f is a new vector whose each entry is the square root of the
corresponding one in f. Also, diag(f) is a diagonal matrix whose i-th diagonal entry is the i-th
entry of f. Following the constructions of MA and MB, we have
2 , · · · ,
(cid:17)(cid:105)
n2, · · · ,

(cid:113)
(cid:113)

(cid:113)
(cid:113)

(cid:113)
(cid:113)

fn1n2bn1
n2

fn1n2an2
n1

n1, · · · ,

1 , · · · ,

fn11bn1

(cid:105)
(cid:105)

f2n2an2

2, · · · ,

fn11a1

MB =

MA =

=

,

1, · · · ,
, · · · , diag

f2n2b2

(cid:16)(cid:113)Fn1,:

,

then

(cid:112)Fi,:) × (diag((cid:112)Fi,:)BT)

(cid:105)

diag

1, · · · ,

1, · · · ,

f1n2an2
1 ,

f1n2b1
n2,
, diag

(cid:104)(cid:112) f11a1
(cid:113)
(cid:112) f21a1
(cid:113)F2,:, · · · , an1
(cid:113)F1,:, a2
(cid:104)a1
(cid:113)Fn1,:
(cid:104)(cid:112) f11b1
(cid:113)
(cid:112) f21b2
(cid:16)(cid:113)F1,:
(cid:16)(cid:113)F2,:
= B(cid:104)
(cid:17)
(cid:17)
n1
n1
n1n2d + min{n1, n2} · d2(cid:17)

= AFBT,

MAMT

B =

(ai

(cid:16)

i=1

i=1

=

aiFi,:BT

which takes time O

.

(cid:3)

Based on the above analyses, we summarize as follows:

Proposition 1. In (Cohen and Guibas (1999)), each iteration costs

(cid:16)

n1n2d + min{n1, n2} · d2 + d3(cid:17)

,

Γ(n1, n2, d) + O

where Γ(n1, n2, d) is the time complexity of computing ﬂow matrix. Assume n1, n2 = O(n) and
n ≥ d, then the time complexity is simpliﬁed to Γ(n, d) + O

n2d

.

(cid:16)

(cid:17)

We notice that the bottleneck in each iteration is large n and d. To reduce the time complexity,
we utilize the property of doubling dimension to construct two subsets SA and SB of the two

27

original point sets A and B, and run the same alignment algorithm on SA and SB instead. Since n
is decreased, the total time complexity is evidently reduced. Moreover, the proposed compression
method is independent of the alignment algorithm provided that the alignment algorithm has the
same objective function as Deﬁnition 6.

3.2 The Proposed Method

In this section, we present the proposed method in detail.
Our method starts with k-center clustering. Given an integer k ≥ 1 and a point set P in some
metric space, k-center clustering is to partition P into k clusters and each cluster is covered by a
ball such that the maximum radius of the k balls is minimized. It is known that k-center clustering
is NP-hard. Fortunately, (Gonzalez (1985)) proposed an elegant 2-approximation algorithm, which
is shown in Algorithm 3, where dis(p, S) = mins∈S (cid:107)p − s(cid:107).

Algorithm 3 2-approximation k-center clustering
Input: Point set P and parameter k
Output: Point set S
1: Choose an arbitrary point c1 from P and S ← {c1}
2: for i = 2 to k do
ci ← arg maxp∈P dis(p, S)
3:
S ← S ∪ {ci}
4:
5: end for

Initially, arbitrarily choose one point c1 from P and let S = {c1}, then iteratively select the point ci
that has the largest distance to S from P and add ci to S until |S| = k.

At ﬁrst, we denote (X, dX) as a metric space, where dX is a metric of set X. A ball centered at

point x ∈ X with radius r > 0 is formulated as

Ball(x, r) = {p ∈ X|dX(x, p) ≤ r},

which is a subset of X.

Claim 1. Suppose S = {c1, · · · , ck} and r = min(cid:8)(cid:13)(cid:13)ci − cj

by the k balls, Ball(c1, r), Ball(c2, r), · · · , Ball(ck, r).

(cid:13)(cid:13) , 1 ≤ i, j ≤ k(cid:9), then P can be covered

28

Proof. Suppose that there exists such one point p ∈ P that cannot be covered by any Ball(ci, r), 1 ≤
i ≤ k, then (cid:107)p − ci(cid:107) > r and p should be inside S, which is a contradiction.
(cid:3)

Now we introduce the deﬁnition of doubling dimension.

Deﬁnition 7 (Doubling Dimension). The doubling dimension of a metric space (X, dX) is the
smallest ρ such that for any x ∈ X and r ≥ 0, Ball(x, r) can be always covered by the union of at
most 2ρ balls with radius r.

For doubling dimension, we have the following claim.
Claim 2. Let (X, dX) be a metric space with doubling dimension ρ and Y ⊂ X. If the aspect ratio
of Y is upper bounded by α, then |Y| ≤ (2α)ρ.

Here the aspect ratio is the ratio of the maximum and the minimum inter-point distances in one
point set. Then we have the following lemma.

Lemma 7. Let P be a point set in Rd with doubling dimension ρ (cid:28) d. The diameter of P is denoted
as ∆ = max{(cid:107)p − q(cid:107) | p, q ∈ P}. Given a small parameter  > 0, if one runs Algorithm 3 with
k =

(cid:17) ρ, then the radius of each resulting ball is at most ∆.

(cid:16) 2


Proof. Let S be the output of Algorithm 3 and r be the resulting radius, then obviously the aspect
ratio of S ∪ {ck+1} is at most ∆
r ≤ ∆.

r . By Claim 2, we have that k + 1 =

(cid:3)

r


(cid:16) 2

(cid:17) ρ

+ 1 ≤ (cid:16) 2∆

(cid:17) ρ and

(cid:110)

(cid:111)

(cid:16) 2

(cid:17) ρ, run Algorithm 3 on A and B and return SA and SB respectively.

Let A and B be two given point sets, and the maximum of their diameters be ∆. We also assume
that their doubling dimension is ρ. The proposed algorithm for compressing the two original sets A
and B is as follows. Set k =
1 , · · · , cA
cA
Denote SA =
i ), we
assign it a weight which equals the total weights of the points belong to this cluster. After that, we
obtain a new instance (SA, SB) for geometric alignment. Note that the total weights of SA (resp.
SB) equal WA (resp. WB). Our method is shown in Algorithm 4.

. For each cluster center cA
i


and SB =

1 , · · · , cB
cB

(resp. cB

(cid:111)

(cid:110)

k

k

29

Algorithm 4 Geometric Alignment
Input: An instance (A, B) of the geometric alignment problem in Deﬁnition 6 with bounded
Output: A rigid transformation ˜T on B and the EMD ﬂows between A and ˜T(B).

doubling dimension ρ in Rd; parameter  ∈ (0, 1).

1: k ←(cid:16) 2

(cid:17) ρ.


2: Run Algorithm 3 on A and B, and obtain their subsets SA and SB respectively.
3: Apply the existing alignment algorithm, e.g., (Cohen and Guibas (1999)) on (SA, SB) and obtain
4: Compute the EMD ﬂows between A and ˜T(B).

the rigid transformation ˜T .

The Algorithm 4 is illustrated in Figure 3.5. The proposed algorithm is not only eﬃcient in

practice but also has a theoretical upper bound.

Theorem 3. Suppose  > 0 is a small parameter in Lemma 7. Given constant c ≥ 1, let ˜T be
a rigid transformation yielding c-approximation of minimizing EMD(SA, T(SB)) in Deﬁnition 6.
Then we have

EMD(A, ˜T(B)) ≤ O(c) · minT EMD(A, T(B)) + O(c)∆2.

Proof. Firstly, we denote

Topt = arg minT EMD(A, T(B)).

Since ˜T yields c-approximation of minimizing EMD(SA, T(SB)), we have
EMD(SA, ˜T(SB)) ≤ c · minT EMD(SA, T(SB))
≤ c · EMD(SA, Topt(SB)).

(3.4)

(3.5)

(3.6)

(resp. cB

(resp. cB

i equalsm

i ). For instance, if the corresponding cluster of cA
i
l=1 αil . Actually, we can view cA

Recall that the weight of each point cA
i ) equals the total weights of the points belonging
i
is {ai1, · · · , aim}, the weight of
to cA
i
cA
with each
a(cid:48)
, 1 ≤ l ≤ m having the weight αil . Therefore, for the sake of convenience, we reformulate SA
il
and SB as follows:

i as m overlapping points

a(cid:48)
i1

, · · · , a(cid:48)
im

(cid:110)

(cid:111)

(cid:110)
(cid:110)

SA =

SB =

a(cid:48)
1, · · · , a(cid:48)
n1
b(cid:48)
1, · · · , b(cid:48)
n2

(cid:111)
(cid:111)

30

,

,

(3.7)

(a) Point sets A and B

(b) Subsets SA and SB returned by Algorithm 3

(c) Rigid transformation ˜T on (SA, SB)

(d) Apply ˜T on (A, B)

Figure 3.5: Illustration of Algorithm 4.

31

and these bounds hold for any rigid transformation in the space. Consequently, for any rigid
transformation T and 1 ≤ i ≤ n1, 1 ≤ j ≤ n2, by applying triangle inequality we have that

i − ai
j − bj

(cid:13)(cid:13) ≤ ∆,
(cid:13)(cid:13)a(cid:48)
(cid:13)(cid:13)(cid:13)b(cid:48)
(cid:13)(cid:13)(cid:13) ≤ ∆,
j) − T(bj)(cid:13)(cid:13)(cid:13)(cid:17)2
j)(cid:13)(cid:13)(cid:13) +
(cid:13)(cid:13)(cid:13)T(b(cid:48)
(cid:13)(cid:13)(cid:13)a(cid:48)
(cid:13)(cid:13)ai − T(bj)(cid:13)(cid:13)2 ≤(cid:16)(cid:13)(cid:13)ai − a(cid:48)
(cid:13)(cid:13) +
j)(cid:13)(cid:13)(cid:13) + 2∆
≤(cid:16)(cid:13)(cid:13)(cid:13)a(cid:48)
(cid:17)2
i − T(b(cid:48)
j)(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)a(cid:48)
(cid:13)(cid:13)(cid:13)a(cid:48)
j)(cid:13)(cid:13)(cid:13) + 42∆2
j)(cid:13)(cid:13)(cid:13)2(cid:19)
(cid:18)
(cid:13)(cid:13)(cid:13)a(cid:48)
j)(cid:13)(cid:13)(cid:13)2
≤(cid:13)(cid:13)(cid:13)a(cid:48)
j)(cid:13)(cid:13)(cid:13)2
= (1 + 2)(cid:13)(cid:13)(cid:13)a(cid:48)
j)(cid:13)(cid:13)(cid:13)2 ≤(cid:16)(cid:13)(cid:13)a(cid:48)
(cid:13)(cid:13)(cid:13)T(bj) − T(b(cid:48)
j)(cid:13)(cid:13)(cid:13)(cid:17)2
(cid:13)(cid:13) +(cid:13)(cid:13)ai − T(bj)(cid:13)(cid:13) +
≤ (1 + 2)(cid:13)(cid:13)ai − T(bj)(cid:13)(cid:13)2 + (2 + 42)∆2.

i − T(b(cid:48)
i − T(b(cid:48)
∆2 +
+ (2 + 42)∆2.

i − T(b(cid:48)
i − T(b(cid:48)
i − T(b(cid:48)

+ 2
i − T(b(cid:48)

i − T(b(cid:48)

(cid:13)(cid:13)(cid:13)a(cid:48)

i − ai

+ 4∆

=

i

+ 42∆2

(3.8)

(3.9)

i (resp. b(cid:48)

j) has the weight αi (resp. βj). By Lemma 7, we have that for 1 ≤ i ≤

where each a(cid:48)
n1, 1 ≤ j ≤ n2,

Using the same idea, we have that

Based on Deﬁnition 4, we denote ˜F = { ˜fi j} as the induced ﬂows of EMD(SA, ˜T(SB)) (using
expression (3.7) for SA and SB). Then (3.8) directly implies that

EMD(A, ˜T(B)) ≤

1

min{WA, WB}

≤

1 + 2

min{WA, WB}

n1
n1

i=1

n2
n2

j=1

i=1

j=1

(cid:13)(cid:13)ai − ˜T(bj)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)a(cid:48)
j)(cid:13)(cid:13)(cid:13)2

i − ˜T(b(cid:48)

˜fi j

˜fi j

+ (2 + 42)∆2

(3.10)

= (1 + 2)EMD(SA, ˜T(SB)) + (2 + 42)∆2.

32

Similarly, let F = { fi j} be the induced ﬂows of EMD(A, Topt(B)), then (3.9) directly implies that

EMD(SA, Topt(SB)) ≤

1

min{WA, WB}

≤

1 + 2

min{WA, WB}

j)(cid:13)(cid:13)(cid:13)2

i − Topt(b(cid:48)

(cid:13)(cid:13)(cid:13)a(cid:48)
(cid:13)(cid:13)ai − Topt(bj)(cid:13)(cid:13)2 + (2 + 42)∆2

fi j

fi j

n1
n1

i=1

n2
n2

j=1

i=1

j=1

(3.11)

(3.12)

(cid:3)

= (1 + 2)EMD(A, Topt(B)) + (2 + 42)∆2.

Combining (3.6), (3.10) and (3.11), we can obtain that
EMD(A, ˜T(B)) ≤ (1 + 2) · EMD(SA, ˜T(SB)) + (2 + 42)∆2

≤ (1 + 2) · c · EMD(SA, Topt(SB)) + (2 + 42)∆2
≤ c(1 + 2)2 · EMD(A, Topt(B)) + (8c3 + 8c2 + 42 + 2 + 2c)∆2
= O(c) · EMD(A, Topt(B)) + O(c)∆2,

which completes the proof.

3.3 Time Complexity Analysis

In this section we analyze the time complexity of Algorithm 4 and consider the steps 2 − 4
In step 3, we assume that the

individually. To simplify the analysis, we let n = max{n1, n2}.
alignment algorithm (Cohen and Guibas (1999)) takes λ iterations.

(1) Step 2. A straightforward implementation of Algorithm 3 is selecting the k cluster centers
iteratively where the resulting running time is O(knd). Several faster implementations for the
high dimensional case with low doubling dimension have been studied before; their idea is to
maintain some data structures to reduce the amortized complexity of each iteration. We refer
readers to (Har-Peled and Mendel (2006)) for more details.

of (A, B), by Proposition 1, the time complexity of Step 3 is O

(2) Step 3. We run the alignment algorithm (Cohen and Guibas (1999)) on instance(SA, SB) instead
.
(3) Step 4. In this step, we compute T(B) ﬁrst and then EMD(A, T(B)), which cost O(nd2) and

Γ(k, d) + k2d + kd2 + d3(cid:17)(cid:17)

(cid:16)

(cid:16)

λ

Γ(n, d) respectively.

33

Note that the complexity Γ(n, d) is usually O(n2d), which dominates the complexity of Step 2 and
Step 4. Based on the above analyses, we have the following theorem.
Theorem 4. Suppose n = max{n1, n2} ≥ d and the alignment algorithm (Cohen and Guibas
(1999)) takes λ iterations. The running time of Algorithm 4 is

+ Γ(n, d).

(cid:16)

O

λ

(cid:16)

Γ(k, d) + k2d + kd2 + d3(cid:17)(cid:17)
(cid:17)(cid:17)

(cid:16)

(cid:16)

As a contrast, the time complexity of running the same alignment algorithm on the original
by Proposition 1. When k (cid:28) n, Algorithm 4 achieves a

λ

instance (A, B) is O
signiﬁcant reduction on the running time.

Γ(n, d) + n2d

3.4 Experiments

In this section we implement the proposed method and test the performance on both random
and real datasets. All of the experiments are performed on Windows workstation with 2.4GHz Intel
Xeon CPU and 32GB DDR4 Memory. For each dataset, we run 20 trials and report the average
results. We set the iterative approach (Cohen and Guibas (1999)) to terminate when the change of
the objective value is less than 10−3.

3.4.1 Random dataset

To construct a random dataset, we randomly generate two manifolds in R500, which are represented
by the polynomial functions with low dimension (≤ 50); in the manifolds, we take two sets of
randomly sampled points having the sizes of n1 = 2 × 104 and n2 = 3 × 104 respectively. Also, for
each sampled point, we randomly assign a positive weight and ﬁnally obtain two weighted point
sets as an instance of geometric alignment.

For each instance, we try diﬀerent compression levels. In Algorithm 4, we set the size of each
10, 1}; in
point set to be k. In experiments, we set k = γ × max{n1, n2} where γ ∈ { 1
particular, γ = 1 indicates that we directly run the alignment algorithm (Cohen and Guibas (1999))

50, 1

40, 1

30, 1

20, 1

34

without compression. The purpose of our proposed approach is to design a compression method,
such that the resulting quantities on the original and compressed datasets are close.

The results of random dataset are shown in Table 3.1. The obtained EMDs on the compressed
instances are only slightly higher than the one of γ = 1, while their running time is signiﬁcantly
reduced. For example, the running time of γ = 1/50 is less than 5% of the one of γ = 1.

Table 3.1: Random dataset: EMD and running time of diﬀerent compression levels.

γ
EMD
Time (s)

1/50
0.948
48.7

1/40
0.946
54.2

1/30
0.945
61.0

1/20
0.943
80.6

1/10
0.941
144.6

1

0.933
1418.2

To further show the robustness of our method, we particularly add Gaussian noise to the random
dataset and study the change of the objective value by varying the noise level. The standard variance
of the Gaussian noise is set to be η × ∆, where ∆ is the maximum diameter of the point sets and
η varies from 0.5 × 10−2 to 2.5 × 10−2. Figure 3.6 shows that the obtained EMDs over baseline
remain very stable (slighly higher than 1).

Figure 3.6: The EMDs over baseline for diﬀerent noise levels.

35

1/501/401/301/201/10Compression level 11.021.041.061.081.1EMD over baseline=0.005=0.010=0.015=0.020=0.0253.4.2 Real dataset

For real datasets, we consider the two applications mentioned in Section 3.1, unsupervised bilingual
lexicon induction and PPI network alignment.

(1) In the ﬁrst application, we have 5 pairs of languages: Chinese-English, Spanish-English,
Italian-English, Japanese-Chinese, and Turkish-English. Provided by (Zhang et al. (2017)),
each language has a vocabulary list that consists of 3000 to 13000 words. We follow their
preprocessing idea to represent all the words using vectors in R50 through the embedding
technique (Mikolov, Le, and Sutskever (2013)). Actually, each vocabulary list is represented
by a distribution in the space where each vector has the weight that equals the corresponding
frequency in the language.

(2) In the second application, we use the popular benchmark dataset NAPAbench (Sahraeian and
Yoon (2012)) of PPI networks, which contains 3 pairs of PPI networks and each network is a
graph of 3000 − 10000 nodes. As the preprocessing step, we apply the node2vec technique
(Grover and Leskovec (2016)) to represent each network by a group of vectors in R100, and
assign a unit weight to each vector (Liu et al. (2017)).

We obtain the similar performances with the random dataset on the two real datasets whose
results are shown in Table 3.2 and Table 3.3 respectively. The EMD for each compression level is
always at most 1.2 times the baseline with γ = 1 while the corresponding runing time is dramatically
reduced.

3.5 Summary

In this chapter, we propose a novel framework for compressing point sets in high dimension
and simultaneously approximately preserving the quality of geometric alignment. This work is
motivated by several emerging applications in the ﬁelds of machine learning, bioinformatics and
wireless networks. We propose an eﬃcient compression algorithm by virtue of the property of low
doubling dimension. What’s more, our method has a theoretical upper bound and achieves a much

36

Table 3.2: Linguistic datasets: EMD and running time of diﬀerent compression levels.

γ

Datasets
es-en EMD
Time (s)
EMD
Time (s)

it-en
ja-zh EMD
Time (s)
EMD
Time (s)

tr-en
zh-en EMD

Time (s)

1/50
0.989
3.4
0.983
5.4
0.991
1.6
0.982
10.1
1.014
1.9

1/40
0.969
3.6
0.966
5.9
0.975
2.1
0.960
10.4
1.012
1.7

1/30
0.955
3.7
0.951
6.1
0.962
2.2
0.946
11.2
0.990
2.2

1/20
0.931
3.9
0.935
6.5
0.941
2.0
0.922
11.9
0.962
2.4

1/10
0.892
5.3
0.899
8.8
0.910
3.0
0.889
15.9
0.923
2.7

1

0.820
100.5
0.847
162.6
0.836
67.0
0.839
287.0
0.842
51.6

Table 3.3: PPI network datasets: EMD and running time of diﬀerent compression levels.

Datasets
CG

DMC

γ

EMD
Time (s)
EMD
Time (s)
EMD
Time (s)

DMR
dm-mm EMD
hs-mm EMD

Time (s)

Time (s)

1/50
0.0645

1/40
0.0644

1/30
0.0643

1/20
0.0643

1/10
0.0642

0.0642

0.0641

0.0641

0.0639

0.0639

0.0567

0.0566

0.0564

0.0564

0.0563

0.1344

0.1343

0.1343

0.1342

0.1342

2.3

3.0

2.3

0.6

1.4

1.8

2.9

2.1

0.6

1.5

1

0.0642

7.6

0.0638
11.4
0.0562
13.7
0.1341

2.0

4.3

2.3

3.0

2.4

0.6

1.5

1.9

3.0

2.1

0.7

1.6

2.5

3.1

2.7

0.8

1.8

0.0774

0.0773

0.0773

0.0773

0.0772

0.0772

smaller time complexity. Experimental results indicate that our method can signiﬁcantly speed up
the existing alignment algorithms while preserve the alignment quality.

37

BIBLIOGRAPHY

38

BIBLIOGRAPHY

Achlioptas, D. 2003. Database-friendly random projections: Johnson-lindenstrauss with binary

coins. Journal of computer and System Sciences 66(4):671–687.

Agarwal, P. K.; Fox, K.; Panigrahi, D.; Varadarajan, K. R.; and Xiao, A. 2019. Faster algorithms

for the geometric transportation problem. arXiv preprint arXiv:1903.08263.

Agarwal, P. K.; Har-Peled, S.; and Varadarajan, K. R. 2005. Geometric approximation via coresets.

Combinatorial and Computational Geometry 52:1–30.

Agarwal, P. K.; Har-Peled, S.; and Yu, H. 2008. Robust shape ﬁtting via peeling and grating

coresets. Discrete & Computational Geometry 39(1-3):38–58.

Aggarwal, C. C., and Yu, P. S. 2001. Outlier detection for high dimensional data. ACM Sigmod

Record 30(2):37–46.

Andoni, A.; Do Ba, K.; Indyk, P.; and Woodruﬀ, D. 2009. Eﬃcient sketches for earth-mover
distance, with applications. In 2009 50th Annual IEEE Symposium on Foundations of Computer
Science (FOCS), 324–330.

Badoiu, M., and Clarkson, K. L. 2003. Smaller core-sets for balls. In Proceedings of the ACM-SIAM

Symposium on Discrete Algorithms (SODA), 801–802.

Bădoiu, M., and Clarkson, K. L. 2008. Optimal core-sets for balls. Computational Geometry

40(1):14–22.

Badoiu, M.; Har-Peled, S.; and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings

of the ACM Symposium on Theory of Computing (STOC), 250–257.

Belkin, M. 2003. Problems of learning on manifolds.
Besl, P. J., and McKay, N. D. 1992. Method for registration of 3-d shapes. In Sensor Fusion IV:

Control Paradigms and Data Structures, volume 1611, 586–607.

Blum, M.; Floyd, R. W.; Pratt, V.; Rivest, R. L.; and Tarjan, R. E. 1973. Time bounds for selection.

Journal of Computer and System Sciences 7(4):448–461.

Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; and Sander, J. 2000. Lof: identifying density-based

local outliers. ACM Sigmod Record 29(2):93–104.

Cabello, S.; Giannopoulos, P.; Knauer, C.; and Rote, G. 2008. Matching point sets with respect to

the earth mover’s distance. Computational Geometry 39(2):118–133.

Cao, X.; Wei, Y.; Wen, F.; and Sun, J. 2014. Face alignment by explicit shape regression.

International Journal of Computer Vision 107(2):177–190.

39

Chandola, V.; Banerjee, A.; and Kumar, V. 2009. Anomaly detection: A survey. ACM Computing

Surveys (CSUR) 41(3):15.

Clarkson, K. L. 2010. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM

Transactions on Algorithms 6(4):63.

Cohen, S., and Guibas, L. 1999. The earth mover’s distance under transformation sets. In Pro-
ceedings of the Seventh IEEE International Conference on Computer Vision (ICCV), volume 2,
1076–1083.

Dasgupta, S., and Sinha, K. 2013. Randomized partition trees for exact nearest neighbor search.

In Conference on Learning Theory (COLT), 317–337.

Ding, H., and Xu, J. 2011. Solving the chromatic cone clustering problem via minimum span-
In Proceedings of the International Colloquium on Automata, Languages, and

ning sphere.
Programming (ICALP), 773–784.

Ding, H., and Xu, J. 2014. Sub-linear time hybrid approximations for least trimmed squares
estimator and related problems. In Proceedings of the International Symposium on Computational
geometry (SoCG), 110.

Ding, H., and Xu, J. 2015. Random gradient descent tree: A combinatorial approach for svm with
outliers. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI), 2561–2567.
Ding, H., and Xu, J. 2017. Fptas for minimizing the earth mover’s distance under rigid transfor-

mations and related problems. Algorithmica 78(3):741–770.

Ester, M.; Kriegel, H.-P.; Sander, J.; and Xu, X. 1996. A density-based algorithm for discovering
clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 226–231.

Fei-Fei, L.; Fergus, R.; and Perona, P. 2007. Learning generative visual models from few training
examples: An incremental bayesian approach tested on 101 object categories. Computer vision
and Image understanding 106(1):59–70.

Frank, M., and Wolfe, P. 1956. An algorithm for quadratic programming. Naval Research Logistics

Quarterly 3(1-2):95–110.

Gärtner, B., and Jaggi, M. 2009. Coresets for polytope distance. In Proceedings of the International

Symposium on Computational geometry (SoCG), 33–42.

Gilbert, E. G. 1966. An iterative procedure for computing the minimum of a quadratic form on a

convex set. SIAM Journal on Control 4(1):61–80.

Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theoretical

Computer Science 38:293–306.

Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In Proceedings
of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining
(KDD), 855–864.

40

Gupta, M.; Gao, J.; Aggarwal, C.; and Han, J. 2014. Outlier detection for temporal data. Synthesis

Lectures on Data Mining and Knowledge Discovery 5(1):1–129.

Har-Peled, S., and Mendel, M. 2006. Fast construction of nets in low-dimensional metrics and

their applications. SIAM Journal on Computing 35(5):1148–1184.

Har-Peled, S., and Wang, Y. 2004. Shape ﬁtting with outliers. SIAM Journal on Computing

33(2):269–285.

Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012.
Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580.

Indyk, P., and Thaper, N. 2003. Fast image retrieval via embeddings.

Indyk, P. 2007. A near linear time constant factor approximation for euclidean bichromatic matching

(cost). In SODA, volume 7, 39–42.

Karger, D. R., and Ruhl, M. 2002. Finding nearest neighbors in growth-restricted metrics. In
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (STOC), 741–
750.

Klein, O., and Veltkamp, R. C. 2005. Approximation algorithms for computing the earth mover’s
distance under transformations. In International Symposium on Algorithms and Computation
(ISAAC), 1019–1028.

Krauthgamer, R., and Lee, J. R. 2004. Navigating nets: simple algorithms for proximity search.
In Proceedings of the ﬁfteenth annual ACM-SIAM symposium on Discrete algorithms (SODA),
798–807.

Kriegel, H.-P.; Kröger, P.; Schubert, E.; and Zimek, A. 2009. Outlier detection in axis-parallel
subspaces of high dimensional data. In Proceedings of the Paciﬁc-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD), 831–838.

Kriegel, H.-P.; Kröger, P.; and Zimek, A. 2009. Outlier detection techniques. Tutorial at PAKDD.

Kriegel, H.-p.; Schubert, M.; and Zimek, A. 2008. Angle-based outlier detection in high-
dimensional data. In Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD), 444–452.

Kumar, P.; Mitchell, J. S. B.; and Yildirim, E. A. 2003. Approximate minimum enclosing balls in

high dimensions using core-sets. ACM Journal of Experimental Algorithmics 8.

Kusner, M.; Sun, Y.; Kolkin, N.; and Weinberger, K. 2015. From word embeddings to document

distances. In International Conference on Machine Learning (ICML), 957–966.

LeCun, Y.; Bottou, L.; Bengio, Y.; and Haﬀner, P. 1998. Gradient-based learning applied to

document recognition. Proceedings of the IEEE 86(11):2278–2324.

41

Liu, Y.; Ding, H.; Chen, D.; and Xu, J. 2017. Novel geometric approach for global alignment of

ppi networks. In Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI).

Liu, W.; Hua, G.; and Smith, J. R. 2014. Unsupervised one-class learning for automatic outlier
removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 3826–3833.

Lu, C.; Shi, J.; and Jia, J. 2013. Abnormal event detection at 150 fps in matlab. In Proceedings of

the IEEE International Conference on Computer Vision (ICCV), 2720–2727.

Magnanti, T., and Orlin, J. 1993. Network ﬂows: theory, algorithms and applications.
Malod-Dognin, N.; Ban, K.; and Pržulj, N. 2017. Uniﬁed alignment of protein-protein interaction

networks. Scientiﬁc reports 7(1):953.

Maltoni, D.; Maio, D.; Jain, A. K.; and Prabhakar, S. 2009. Handbook of ﬁngerprint recognition.

Springer Science & Business Media.

Mikolov, T.; Le, Q. V.; and Sutskever, I. 2013. Exploiting similarities among languages for machine

translation. arXiv preprint arXiv:1309.4168.

Munteanu, A., and Schwiegelshohn, C. 2018. Coresets-methods and history: A theoreticians design

pattern for approximation and streaming algorithms. KI-Künstliche Intelligenz 32(1):37–53.

Phillips, J. M. 2016. Coresets and sketches. Computing Research Repository.
Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1988. Learning representations by back-

propagating errors. Cognitive Modeling 5(3):1.

Sahraeian, S. M. E., and Yoon, B.-J. 2012. A network synthesis model for generating protein

interaction network families. PloS one 7(8):e41474.

Sakurada, M., and Yairi, T. 2014. Anomaly detection using autoencoders with nonlinear dimen-
sionality reduction. In Proceedings of the Workshop on Machine Learning for Sensory Data
Analysis (MLSDA), 4.

Schölkopf, B.; Williamson, R.; Smola, A.; Shawe-Taylor, J.; and Platt, J. 1999. Support vector
method for novelty detection. In Proceedings of the Advances in Neural Information Processing
Systems (NIPS), 582–588.

Schönemann, P. H. 1966. A generalized solution of the orthogonal procrustes problem. Psychome-

trika 31(1):1–10.

Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image

recognition. arXiv preprint arXiv:1409.1556.

Song, L.; Ma, H.; Wu, M.; Zhou, Z.; and Fu, M. 2018. A brief survey of dimension reduction. In
International Conference on Intelligent Science and Big Data Engineering (IScIDE), 189–200.
Talwar, K. 2004. Bypassing the embedding: algorithms for low dimensional metrics. In Proceed-

ings of the thirty-sixth annual ACM symposium on Theory of computing (STOC), 281–290.

42

Tan, P.-N.; Steinbach, M.; and Kumar, V. 2006. Introduction to Data Mining.

Tenenbaum, J. B.; De Silva, V.; and Langford, J. C. 2000. A global geometric framework for

nonlinear dimensionality reduction. science 290(5500):2319–2323.

Xia, Y.; Cao, X.; Wen, F.; Hua, G.; and Sun, J. 2015. Learning discriminative reconstructions for
unsupervised outlier removal. In Proceedings of the IEEE International Conference on Computer
Vision (ICCV), 1511–1519.

Youn, H.; Sutton, L.; Smith, E.; Moore, C.; Wilkins, J. F.; Maddieson, I.; Croft, W.; and Bhat-
tacharya, T. 2016. On the universal structure of human lexical semantics. Proceedings of the
National Academy of Sciences 113(7):1766–1771.

Zarrabi-Zadeh, H., and Mukhopadhyay, A. 2009. Streaming 1-center with outliers in high di-
mensions. In Proceedings of the Canadian Conference on Computational Geometry (CCCG),
83–86.

Zhang, M.; Liu, Y.; Luan, H.; and Sun, M. 2017. Earth mover’s distance minimization for
unsupervised bilingual lexicon induction. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (EMNLP), 1934–1945.

43